Home ArchitecturesPrompt Caching & KV Cache Architecture

🤖 LLM & AIExpertWeek 14

Prompt Caching & KV Cache Architecture

PagedAttention, prefix caching, and speculative decoding

vLLMTensorRT-LLMAnthropic

Key Insight

PagedAttention solved the fragmentation problem: KV cache blocks are allocated like memory pages, eliminating internal fragmentation and enabling 2-4 throughput improvement.

Request Journey

Incoming request with system prompt + user message arrives at the inference server→

Radix tree-based prefix matcher checks if the system prompt (or any prefix) has been seen before and its KV cache blocks are still in GPU memory→

On cache hit, the cached KV blocks are reused — prefill computation is skipped for the matched prefix, saving significant GPU compute→

On cache miss, full prefill runs: all prompt tokens are processed through the transformer stack to compute K and V tensors for every layer→

PagedAttention allocates KV cache in fixed-size blocks (like OS virtual memory pages) via a block allocator — no pre-allocation of max sequence length, eliminating internal fragmentation

+4 more steps

How It Works

① Incoming request with system prompt + user message arrives at the inference server

② Radix tree-based prefix matcher checks if the system prompt (or any prefix) has been seen before and its KV cache blocks are still in GPU memory

③ On cache hit, the cached KV blocks are reused — prefill computation is skipped for the matched prefix, saving significant GPU compute

④ On cache miss, full prefill runs: all prompt tokens are processed through the transformer stack to compute K and V tensors for every layer

⑤ PagedAttention allocates KV cache in fixed-size blocks (like OS virtual memory pages) via a block allocator — no pre-allocation of max sequence length, eliminating internal fragmentation

⑥ Block table maps logical KV positions to physical GPU HBM addresses, enabling non-contiguous memory storage

⑦ During autoregressive decoding, each new token's KV pair is appended to the next available block; new blocks are allocated on demand

⑧ Speculative decoding uses a small draft model to predict N tokens ahead, then the main model verifies all N in a single forward pass

⑨ When GPU memory is full, LRU eviction removes the least recently used prefix cache blocks, prioritizing frequently reused system prompts

⚠The Problem

Naive LLM serving pre-allocates KV cache memory for the maximum sequence length per request. A 2048-token limit wastes 95% of KV cache memory for short outputs. When serving thousands of concurrent requests, this fragmentation limits a GPU to 10-20 simultaneous requests despite having capacity for 100+. Shared system prompts (1,000-5,000 tokens) are redundantly recomputed for every request, wasting prefill GPU cycles.

✓The Solution

PagedAttention from vLLM manages the KV cache using a virtual memory paging metaphor: cache is divided into fixed-size blocks (pages) that are allocated on demand and freed when a request completes. Blocks are mapped to physical GPU memory via a block table — just like OS virtual memory. This eliminates internal and external fragmentation, enabling the GPU to run 2-4x more concurrent requests with the same memory.

📊Scale at a Glance

2-4x vs. naive

PagedAttn Throughput Gain

40-60%

Prefix Cache TTFT Savings

3-5x throughput

Speculative Decoding Gain

60-80%

GPU Utilization (vLLM)

🔬Deep Dive

The KV Cache Memory Problem

Without PagedAttention, each request pre-allocates a KV cache sized to the maximum sequence length at request start. A request limited to 2048 tokens allocates memory for 2048 tokens even if it only uses 100 — 95% waste. When many requests are in flight simultaneously, this fragmentation means the GPU can serve only 10-20 concurrent requests despite having enough memory for 100. The result: poor GPU utilization and high queuing latency.

PagedAttention: Virtual Memory for KV Cache

PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens each). A block table maps each request's logical blocks to physical memory blocks — just like OS virtual memory page tables. Blocks are allocated on-demand as the sequence grows and freed immediately upon completion. This eliminates pre-allocation waste: a 100-token response uses only 7 blocks instead of 128 blocks. GPU utilization jumps from 20-30% to 60-80%.

Prefix Caching: Sharing System Prompts

Many applications prepend the same system prompt (1,000-5,000 tokens) to every user message. Without prefix caching, each request recomputes the KV cache for this shared prefix — wasting computation and memory. Prefix caching stores the KV blocks for shared prefixes and reuses them across requests. For a 4,096-token system prompt, this saves ~60% of the TTFT (time-to-first-token) for every request. Anthropic exposes this as a pricing feature: cached input tokens cost 90% less.

Speculative Decoding: Draft + Verify

LLM decoding is serial: one token per forward pass of the full model. Speculative decoding parallelizes this: a small draft model (e.g., 1B parameter model) generates K candidate tokens in K forward passes; the large target model verifies all K tokens in a single forward pass using tree attention. If the target model agrees with the draft's top-K tokens, all K are accepted atomically. For highly predictable outputs (code completion, structured responses), speculative decoding achieves 3-5x throughput improvement.

Continuous Batching and Scheduling

vLLM's scheduler manages the block table and continuous batching together. When a new request arrives, the scheduler checks if there are free blocks for the prompt plus estimated completion. If GPU memory is tight, it can preempt (swap to CPU) low-priority requests. The iteration-level scheduler (run after every token) can add new requests to the batch whenever a sequence completes — maximizing GPU utilization without sacrificing latency for in-progress requests.

⬡Architecture Diagram

Prompt Caching & KV Cache Architecture — simplified architecture overview

✦Core Concepts

⚡

KV Cache

⚙️

PagedAttention

⚡

Prefix Caching

⚙️

Speculative Decoding

⚙️

Continuous Batching

⚙️

GPU Memory Management

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓PagedAttention eliminates KV cache fragmentation, doubling or tripling GPU request concurrency
✓Prefix caching reduces TTFT by 40-60% for applications with shared system prompts
✓Speculative decoding provides 3-5x throughput improvement for predictable outputs
✓Continuous batching maximizes GPU utilization across variable-length requests

✗ Weaknesses

✗Block table management adds CPU overhead per token generated — problematic at very high throughputs
✗Speculative decoding requires running two models simultaneously, consuming additional GPU memory and compute
✗Prefix caching only helps when prompts are identical — slight variations break cache hits
✗Preemption (swapping KV cache to CPU) adds significant latency to preempted requests

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Explain PagedAttention. How does it solve the KV cache memory fragmentation problem?
Q2
Design a KV cache management system for a serving cluster with 10 A100 GPUs running a 70B model.
Q3
How does prefix caching work? What applications benefit most, and how would you implement it?
Q4
Explain speculative decoding. What types of outputs benefit most, and when does it fail to provide speedup?
Q5
Compare vLLM continuous batching to static batching. Why is continuous batching strictly better for production LLM serving?

Research Papers & Further Reading

2023

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W. et al. (vLLM)

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser