GPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
Key Insight
LLM inference is memory-bandwidth bound, not compute-bound moving weights from GPU HBM to registers is the bottleneck, not the matrix multiplications.
Request Journey
How It Works
① Input prompt is tokenized via BPE into token IDs and passed through the embedding layer (token + rotary position embeddings)
② Prefill phase: all prompt tokens are processed in parallel through the transformer stack — FlashAttention computes Q·K^T in fused SRAM tiles, 3x less HBM IO than standard attention
③ KV cache stores the computed Key and Value tensors for every layer so they are not recomputed during generation
④ Decode phase: model generates one token at a time autoregressively — each step reads the full KV cache but only computes attention for the new token
⑤ Tensor parallelism shards weight matrices across GPUs; each GPU computes a slice of every layer, synchronized via AllReduce over NVLink after each layer
⑥ Continuous batching inserts new requests into the batch mid-generation, maximizing GPU utilization
⑦ LM Head projects hidden states to vocabulary logits; top-p/temperature sampling selects the next token
⑧ Generated token is fed back into the embedding layer for the next decode step until EOS or max length
⚠The Problem
A transformer model with 70B parameters has weights totaling ~140GB in FP16. Generating each output token requires loading all weights through GPU memory bandwidth. A 70B model on an A100 takes ~80GB VRAM and generates only 20-30 tokens/second, making it far too slow and expensive to serve millions of users.
✓The Solution
LLM inference engineering attacks the memory-bandwidth bottleneck through KV caching (skip recomputing prompt attention), FlashAttention (IO-aware attention reducing memory traffic 3x), quantization (INT8/INT4 reducing weights 2-4x), and continuous batching (process many user requests in one GPU forward pass). Together, these optimizations achieve 10-50x throughput over a naive baseline.
📊Scale at a Glance
~140GB FP16
70B Model Memory
3x vs. naive
FlashAttn Speedup
4x smaller
INT4 Compression
2-4x throughput
Continuous Batch Gain
🔬Deep Dive
The KV Cache: Avoiding Redundant Computation
During autoregressive generation, each new token attends to all previous tokens. Without caching, this requires recomputing attention over the entire prompt for every output token — quadratic work per sequence. The KV cache stores the key and value tensors for all previously computed tokens. Only the new token needs to run through the transformer; previous tokens' K/V tensors are loaded from cache. This reduces generation from O(n^2) to O(n) per token.
FlashAttention: IO-Aware Attention
Standard attention materializes the full N x N attention matrix in GPU HBM (high-bandwidth memory), then reads it back for softmax and value aggregation. For a 2048-token sequence, this matrix is 2048 x 2048 x 2 bytes = 8MB per layer — reading and writing it dominates runtime. FlashAttention fuses all attention operations into a single kernel using SRAM tiling, never materializing the full matrix. Result: 3x faster attention, 10x less memory.
Quantization: Shrinking the Weights
INT8 quantization represents model weights with 8-bit integers instead of 16-bit floats — halving memory and bandwidth requirements. INT4 (used in GPTQ, AWQ) goes further to 4 bits — 4x compression. The challenge is minimizing accuracy loss: weight distribution is non-uniform, so naive rounding causes significant degradation. GPTQ uses second-order gradient information to find optimal quantization points. INT4 models typically show less than 1% quality loss on benchmarks.
Continuous Batching: Maximizing GPU Utilization
Static batching waits for a batch of N requests to start together and finishes them together — GPU sits idle waiting for the slowest sequence. Continuous batching (pioneered by Orca and vLLM) dynamically adds new requests to the batch as slots free up from completed sequences. A GPU always has a full batch of tokens to process. This increases throughput by 2-4x for typical production traffic distributions.
Speculative Decoding: Parallel Token Generation
Standard LLM generation is serial: one token per forward pass. Speculative decoding uses a small draft model (e.g., 7B) to predict K tokens ahead, then verifies all K with the target model in a single forward pass. If the target model agrees with the draft's top-K tokens, all K tokens are accepted — K tokens generated in the time of 1. For highly predictable completions (code, structured data), 3-5x speedups are achievable.
⬡Architecture Diagram
GPT / Transformer Inference Architecture — simplified architecture overview
✦Core Concepts
KV Cache
FlashAttention
Quantization (INT8/INT4)
Continuous Batching
Tensor Parallelism
Speculative Decoding
⚖Tradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
✓ Strengths
- ✓KV cache eliminates quadratic attention recomputation, making long-context generation practical
- ✓INT4 quantization achieves 4x memory reduction with less than 1% quality loss on most tasks
- ✓Continuous batching increases GPU utilization from ~30% to 70-90% on production traffic
- ✓FlashAttention makes 100K+ token context windows tractable on modern GPUs
✗ Weaknesses
- ✗KV cache grows linearly with sequence length — a 100K token context requires ~20GB of KV cache for a 70B model
- ✗INT4 quantization quality degradation is task-dependent and may be unacceptable for reasoning-heavy applications
- ✗Speculative decoding requires running two models and adds complexity; gains vary significantly by output type
- ✗Continuous batching scheduler complexity: different requests at different sequence lengths require careful memory management
🎯FAANG Interview Questions
Interview Prep💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Explain why LLM inference is memory-bandwidth bound rather than compute-bound. What does this mean for optimization strategy?
- Q2
How does the KV cache work? What are its memory implications for long-context models?
- Q3
Design a serving system for a 70B LLM that must handle 1,000 concurrent users with p95 latency under 3 seconds.
- Q4
Explain speculative decoding. Under what conditions does it provide the biggest speedup, and when does it fail?
- Q5
Compare INT8 and INT4 quantization. What are the tradeoffs, and how does GPTQ minimize quality loss?
Research Papers & Further Reading
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree · No account required · Listen in browser
More LLM & AI Systems
View allRAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI · LangChain · Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone · Weaviate · Qdrant
LLM API Gateway Architecture
Rate limiting, token tracking, model routing, and cost management
OpenAI · Anthropic · Azure OpenAI
Listen to more architecture deep-dives
30 free podcast episodes — Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free · No account needed