HomeArchitecturesGPT / Transformer Inference Architecture
🤖 LLM & AIExpertWeek 11

GPT / Transformer Inference Architecture

KV cache, FlashAttention, quantization, and batching at scale

OpenAIAnthropicGoogle DeepMind

Key Insight

LLM inference is memory-bandwidth bound, not compute-bound moving weights from GPU HBM to registers is the bottleneck, not the matrix multiplications.

Request Journey

Input prompt is tokenized via BPE into token IDs and passed through the embedding layer (token + rotary position embeddings)
Prefill phase: all prompt tokens are processed in parallel through the transformer stack — FlashAttention computes Q·K^T in fused SRAM tiles, 3x less HBM IO than standard attention
KV cache stores the computed Key and Value tensors for every layer so they are not recomputed during generation
Decode phase: model generates one token at a time autoregressively — each step reads the full KV cache but only computes attention for the new token
Tensor parallelism shards weight matrices across GPUs; each GPU computes a slice of every layer, synchronized via AllReduce over NVLink after each layer
+3 more steps

How It Works

1

① Input prompt is tokenized via BPE into token IDs and passed through the embedding layer (token + rotary position embeddings)

2

② Prefill phase: all prompt tokens are processed in parallel through the transformer stack — FlashAttention computes Q·K^T in fused SRAM tiles, 3x less HBM IO than standard attention

3

③ KV cache stores the computed Key and Value tensors for every layer so they are not recomputed during generation

4

④ Decode phase: model generates one token at a time autoregressively — each step reads the full KV cache but only computes attention for the new token

5

⑤ Tensor parallelism shards weight matrices across GPUs; each GPU computes a slice of every layer, synchronized via AllReduce over NVLink after each layer

6

⑥ Continuous batching inserts new requests into the batch mid-generation, maximizing GPU utilization

7

⑦ LM Head projects hidden states to vocabulary logits; top-p/temperature sampling selects the next token

8

⑧ Generated token is fed back into the embedding layer for the next decode step until EOS or max length

The Problem

A transformer model with 70B parameters has weights totaling ~140GB in FP16. Generating each output token requires loading all weights through GPU memory bandwidth. A 70B model on an A100 takes ~80GB VRAM and generates only 20-30 tokens/second, making it far too slow and expensive to serve millions of users.

The Solution

LLM inference engineering attacks the memory-bandwidth bottleneck through KV caching (skip recomputing prompt attention), FlashAttention (IO-aware attention reducing memory traffic 3x), quantization (INT8/INT4 reducing weights 2-4x), and continuous batching (process many user requests in one GPU forward pass). Together, these optimizations achieve 10-50x throughput over a naive baseline.

📊Scale at a Glance

~140GB FP16

70B Model Memory

3x vs. naive

FlashAttn Speedup

4x smaller

INT4 Compression

2-4x throughput

Continuous Batch Gain

🔬Deep Dive

1

The KV Cache: Avoiding Redundant Computation

During autoregressive generation, each new token attends to all previous tokens. Without caching, this requires recomputing attention over the entire prompt for every output token — quadratic work per sequence. The KV cache stores the key and value tensors for all previously computed tokens. Only the new token needs to run through the transformer; previous tokens' K/V tensors are loaded from cache. This reduces generation from O(n^2) to O(n) per token.

2

FlashAttention: IO-Aware Attention

Standard attention materializes the full N x N attention matrix in GPU HBM (high-bandwidth memory), then reads it back for softmax and value aggregation. For a 2048-token sequence, this matrix is 2048 x 2048 x 2 bytes = 8MB per layer — reading and writing it dominates runtime. FlashAttention fuses all attention operations into a single kernel using SRAM tiling, never materializing the full matrix. Result: 3x faster attention, 10x less memory.

3

Quantization: Shrinking the Weights

INT8 quantization represents model weights with 8-bit integers instead of 16-bit floats — halving memory and bandwidth requirements. INT4 (used in GPTQ, AWQ) goes further to 4 bits — 4x compression. The challenge is minimizing accuracy loss: weight distribution is non-uniform, so naive rounding causes significant degradation. GPTQ uses second-order gradient information to find optimal quantization points. INT4 models typically show less than 1% quality loss on benchmarks.

4

Continuous Batching: Maximizing GPU Utilization

Static batching waits for a batch of N requests to start together and finishes them together — GPU sits idle waiting for the slowest sequence. Continuous batching (pioneered by Orca and vLLM) dynamically adds new requests to the batch as slots free up from completed sequences. A GPU always has a full batch of tokens to process. This increases throughput by 2-4x for typical production traffic distributions.

5

Speculative Decoding: Parallel Token Generation

Standard LLM generation is serial: one token per forward pass. Speculative decoding uses a small draft model (e.g., 7B) to predict K tokens ahead, then verifies all K with the target model in a single forward pass. If the target model agrees with the draft's top-K tokens, all K tokens are accepted — K tokens generated in the time of 1. For highly predictable completions (code, structured data), 3-5x speedups are achievable.

Architecture Diagram

GPT / Transformer Inference Architecture — simplified architecture overview

Core Concepts

KV Cache

⚙️

FlashAttention

⚙️

Quantization (INT8/INT4)

⚙️

Continuous Batching

⚙️

Tensor Parallelism

⚙️

Speculative Decoding

Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

  • KV cache eliminates quadratic attention recomputation, making long-context generation practical
  • INT4 quantization achieves 4x memory reduction with less than 1% quality loss on most tasks
  • Continuous batching increases GPU utilization from ~30% to 70-90% on production traffic
  • FlashAttention makes 100K+ token context windows tractable on modern GPUs

✗ Weaknesses

  • KV cache grows linearly with sequence length — a 100K token context requires ~20GB of KV cache for a 70B model
  • INT4 quantization quality degradation is task-dependent and may be unacceptable for reasoning-heavy applications
  • Speculative decoding requires running two models and adds complexity; gains vary significantly by output type
  • Continuous batching scheduler complexity: different requests at different sequence lengths require careful memory management

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

  1. Q1

    Explain why LLM inference is memory-bandwidth bound rather than compute-bound. What does this mean for optimization strategy?

  2. Q2

    How does the KV cache work? What are its memory implications for long-context models?

  3. Q3

    Design a serving system for a 70B LLM that must handle 1,000 concurrent users with p95 latency under 3 seconds.

  4. Q4

    Explain speculative decoding. Under what conditions does it provide the biggest speedup, and when does it fail?

  5. Q5

    Compare INT8 and INT4 quantization. What are the tradeoffs, and how does GPTQ minimize quality loss?

Research Papers & Further Reading

2017

Attention Is All You Need

Vaswani, A. et al.

Read
2022

FlashAttention: Fast and Memory-Efficient Exact Attention

Dao, T. et al.

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser

More LLM & AI Systems

View all
🎙️ Podcast · All Free

Listen to more architecture deep-dives

30 free podcast episodes — Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.

All architecture articles are free · No account needed