Prompt Caching & KV Cache Architecture
PagedAttention, prefix caching, and speculative decoding
Key Insight
PagedAttention solved the fragmentation problem: KV cache blocks are allocated like memory pages, eliminating internal fragmentation and enabling 2-4 throughput improvement.
Request Journey
How It Works
โ Incoming request with system prompt + user message arrives at the inference server
โก Radix tree-based prefix matcher checks if the system prompt (or any prefix) has been seen before and its KV cache blocks are still in GPU memory
โข On cache hit, the cached KV blocks are reused โ prefill computation is skipped for the matched prefix, saving significant GPU compute
โฃ On cache miss, full prefill runs: all prompt tokens are processed through the transformer stack to compute K and V tensors for every layer
โค PagedAttention allocates KV cache in fixed-size blocks (like OS virtual memory pages) via a block allocator โ no pre-allocation of max sequence length, eliminating internal fragmentation
โฅ Block table maps logical KV positions to physical GPU HBM addresses, enabling non-contiguous memory storage
โฆ During autoregressive decoding, each new token's KV pair is appended to the next available block; new blocks are allocated on demand
โง Speculative decoding uses a small draft model to predict N tokens ahead, then the main model verifies all N in a single forward pass
โจ When GPU memory is full, LRU eviction removes the least recently used prefix cache blocks, prioritizing frequently reused system prompts
โ The Problem
Naive LLM serving pre-allocates KV cache memory for the maximum sequence length per request. A 2048-token limit wastes 95% of KV cache memory for short outputs. When serving thousands of concurrent requests, this fragmentation limits a GPU to 10-20 simultaneous requests despite having capacity for 100+. Shared system prompts (1,000-5,000 tokens) are redundantly recomputed for every request, wasting prefill GPU cycles.
โThe Solution
PagedAttention from vLLM manages the KV cache using a virtual memory paging metaphor: cache is divided into fixed-size blocks (pages) that are allocated on demand and freed when a request completes. Blocks are mapped to physical GPU memory via a block table โ just like OS virtual memory. This eliminates internal and external fragmentation, enabling the GPU to run 2-4x more concurrent requests with the same memory.
๐Scale at a Glance
2-4x vs. naive
PagedAttn Throughput Gain
40-60%
Prefix Cache TTFT Savings
3-5x throughput
Speculative Decoding Gain
60-80%
GPU Utilization (vLLM)
๐ฌDeep Dive
The KV Cache Memory Problem
Without PagedAttention, each request pre-allocates a KV cache sized to the maximum sequence length at request start. A request limited to 2048 tokens allocates memory for 2048 tokens even if it only uses 100 โ 95% waste. When many requests are in flight simultaneously, this fragmentation means the GPU can serve only 10-20 concurrent requests despite having enough memory for 100. The result: poor GPU utilization and high queuing latency.
PagedAttention: Virtual Memory for KV Cache
PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens each). A block table maps each request's logical blocks to physical memory blocks โ just like OS virtual memory page tables. Blocks are allocated on-demand as the sequence grows and freed immediately upon completion. This eliminates pre-allocation waste: a 100-token response uses only 7 blocks instead of 128 blocks. GPU utilization jumps from 20-30% to 60-80%.
Prefix Caching: Sharing System Prompts
Many applications prepend the same system prompt (1,000-5,000 tokens) to every user message. Without prefix caching, each request recomputes the KV cache for this shared prefix โ wasting computation and memory. Prefix caching stores the KV blocks for shared prefixes and reuses them across requests. For a 4,096-token system prompt, this saves ~60% of the TTFT (time-to-first-token) for every request. Anthropic exposes this as a pricing feature: cached input tokens cost 90% less.
Speculative Decoding: Draft + Verify
LLM decoding is serial: one token per forward pass of the full model. Speculative decoding parallelizes this: a small draft model (e.g., 1B parameter model) generates K candidate tokens in K forward passes; the large target model verifies all K tokens in a single forward pass using tree attention. If the target model agrees with the draft's top-K tokens, all K are accepted atomically. For highly predictable outputs (code completion, structured responses), speculative decoding achieves 3-5x throughput improvement.
Continuous Batching and Scheduling
vLLM's scheduler manages the block table and continuous batching together. When a new request arrives, the scheduler checks if there are free blocks for the prompt plus estimated completion. If GPU memory is tight, it can preempt (swap to CPU) low-priority requests. The iteration-level scheduler (run after every token) can add new requests to the batch whenever a sequence completes โ maximizing GPU utilization without sacrificing latency for in-progress requests.
โฌกArchitecture Diagram
Prompt Caching & KV Cache Architecture โ simplified architecture overview
โฆCore Concepts
KV Cache
PagedAttention
Prefix Caching
Speculative Decoding
Continuous Batching
GPU Memory Management
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โPagedAttention eliminates KV cache fragmentation, doubling or tripling GPU request concurrency
- โPrefix caching reduces TTFT by 40-60% for applications with shared system prompts
- โSpeculative decoding provides 3-5x throughput improvement for predictable outputs
- โContinuous batching maximizes GPU utilization across variable-length requests
โ Weaknesses
- โBlock table management adds CPU overhead per token generated โ problematic at very high throughputs
- โSpeculative decoding requires running two models simultaneously, consuming additional GPU memory and compute
- โPrefix caching only helps when prompts are identical โ slight variations break cache hits
- โPreemption (swapping KV cache to CPU) adds significant latency to preempted requests
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Explain PagedAttention. How does it solve the KV cache memory fragmentation problem?
- Q2
Design a KV cache management system for a serving cluster with 10 A100 GPUs running a 70B model.
- Q3
How does prefix caching work? What applications benefit most, and how would you implement it?
- Q4
Explain speculative decoding. What types of outputs benefit most, and when does it fail to provide speedup?
- Q5
Compare vLLM continuous batching to static batching. Why is continuous batching strictly better for production LLM serving?
Research Papers & Further Reading
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, W. et al. (vLLM)
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed