LLM Serving Infrastructure
vLLM, tensor parallelism, Triton Inference Server, and autoscaling
Key Insight
The key to LLM serving efficiency is maximizing GPU Memory Bandwidth Utilization (MBU) packing as many tokens/sec through the GPU's HBM as possible.
Request Journey
How It Works
โ Client request hits the L7 load balancer, which routes based on model name and available capacity
โก Triton Inference Server's continuous batcher collects incoming requests and inserts them into the active batch at iteration granularity โ when a request finishes (EOS), a new request immediately takes its slot with zero GPU idle time
โข Request scheduler manages a priority queue, balancing batch size against available GPU memory (each active request's KV cache grows with sequence length)
โฃ Model weights (quantized to 4-bit via GPTQ or AWQ for 75% memory reduction) are loaded from HBM; tensor parallelism shards each layer's weight matrices across GPUs within a node, synchronized via AllReduce over NVLink
โค Pipeline parallelism splits the model by layer groups across nodes โ GPU 0-1 handle early layers via TP, GPU 2+ handle later layers via PP, with activation tensors passed between stages over InfiniBand
โฅ PagedAttention manages KV cache as dynamically allocated memory pages โ no pre-allocation waste, enabling 2-4x more concurrent requests per GPU
โฆ Generated tokens stream back to the client via SSE or gRPC streaming for low time-to-first-token
โง Autoscaler monitors GPU utilization, request queue depth, and tokens/sec throughput; Kubernetes HPA scales GPU pods, with pre-warm pools keeping standby replicas to avoid 30-120s cold start latency
โ The Problem
Serving large language models at production scale faces extreme resource constraints: a 70B parameter model requires 140GB of GPU memory in FP16, far exceeding a single GPU's capacity. Naive request handling wastes GPU compute โ sequential processing leaves the GPU idle during memory-bound decoding, and fixed batch sizes either underutilize hardware or add unacceptable latency. Autoscaling GPU infrastructure is 10โ100ร more expensive than CPU, making cost optimization critical.
โThe Solution
Modern LLM serving combines model parallelism (tensor parallel across GPUs within a node, pipeline parallel across nodes), continuous batching to maximize GPU utilization by dynamically adding/removing requests mid-generation, quantization (GPTQ/AWQ for 4-bit weights) to reduce memory footprint by 4ร, and PagedAttention for efficient KV cache management. NVIDIA Triton Inference Server provides multi-model serving with dynamic batching, while Kubernetes HPA with custom GPU metrics enables autoscaling.
๐Scale at a Glance
1,000+ tokens/sec
Throughput (vLLM, 70B model)
75% reduction
Memory savings (4-bit quantization)
80โ95% MBU
GPU utilization (continuous batching)
30โ120 seconds
Cold start time (70B model load)
๐ฌDeep Dive
Tensor Parallelism vs Pipeline Parallelism โ Splitting Giant Models
A 70B parameter model in FP16 requires ~140GB of memory, far exceeding a single A100's 80GB capacity. Tensor parallelism (TP) splits individual layers across GPUs โ each GPU holds a slice of every weight matrix and computes a portion of each layer's output. The partial results are combined via all-reduce communication after each layer. TP requires high-bandwidth interconnect (NVLink at 900GB/s) because every layer involves a synchronization point. Pipeline parallelism (PP) splits the model by layers โ GPU 0 gets layers 0โ19, GPU 1 gets layers 20โ39, etc. PP has less communication overhead (only activation tensors between stages) but introduces pipeline bubbles where GPUs wait for earlier stages. In practice, production systems use TP within a node (2โ8 GPUs connected via NVLink) and PP across nodes (connected via InfiniBand). vLLM and TensorRT-LLM support both strategies with automatic configuration based on model size and available hardware.
Continuous Batching โ Eliminating GPU Idle Time
Traditional static batching waits for a batch of requests to accumulate, processes them together, and returns all results simultaneously. This creates two problems: short requests must wait for the longest request in the batch to finish (head-of-line blocking), and the GPU sits idle between batches. Continuous batching (also called iteration-level scheduling, pioneered by Orca) processes requests at the granularity of individual decoding steps. When a request finishes generating (hits EOS or max length), a new request immediately takes its slot in the batch โ no GPU cycles are wasted. vLLM's scheduler maintains a priority queue of pending requests, dynamically adjusting the batch size every iteration. This increases throughput by 2โ4ร compared to static batching at the same latency target. The key challenge is memory management: each active request maintains a KV cache that grows with sequence length, so the scheduler must balance batch size against available GPU memory.
Model Quantization โ Trading Precision for Throughput
Quantization reduces model weights from 16-bit floating point to lower precision (8-bit, 4-bit, or even 2-bit), dramatically reducing memory footprint and enabling faster inference. GPTQ (post-training quantization) uses calibration data to find optimal quantized weights that minimize output error โ it processes one layer at a time, solving a layer-wise reconstruction problem. AWQ (Activation-Aware Weight Quantization) observes that only 1% of weights are critical for model quality (those corresponding to large-magnitude activations) and preserves these at higher precision. GPTQ and AWQ at 4-bit reduce memory by 75% with minimal quality degradation (typically <1% on benchmarks). FP8 quantization is emerging on H100 GPUs with native hardware support, offering 2ร memory reduction with virtually no quality loss. The inference speed improvement from quantization is primarily memory-bandwidth-bound: LLM decoding is limited by how fast weights can be loaded from HBM to compute cores, so smaller weights mean proportionally faster inference.
NVIDIA Triton Inference Server โ Multi-Model Production Serving
Triton Inference Server is NVIDIA's production inference platform that handles the operational complexity of serving multiple models simultaneously. It supports all major frameworks (PyTorch, TensorFlow, TensorRT, ONNX, vLLM) behind a unified gRPC/HTTP API. Key features: dynamic batching aggregates individual requests into GPU-efficient batches with configurable maximum latency, model ensembles chain multiple models in a single request (e.g., tokenizer โ LLM โ post-processor), model versioning enables A/B testing and canary deployments, and concurrent model execution shares a single GPU across multiple models using CUDA MPS (Multi-Process Service). For LLM serving specifically, Triton integrates with TensorRT-LLM for optimized inference kernels and supports in-flight batching (continuous batching). Resource management is critical: Triton's rate limiter prevents GPU OOM by tracking memory usage per model instance and queuing requests when capacity is exceeded.
GPU Autoscaling โ Cost Optimization at Scale
GPU instances cost $1โ$30/hour depending on type (A10G, A100, H100), making autoscaling essential for cost management. Kubernetes Horizontal Pod Autoscaler (HPA) scales LLM serving pods based on custom metrics: GPU utilization, request queue depth, tokens-per-second throughput, or p99 latency. The fundamental challenge is cold start time โ loading a 70B model takes 30โ120 seconds, during which the new pod cannot serve requests. Mitigation strategies include: pre-warming pools (keeping standby pods with models loaded), model caching on NVMe (loading from local SSD is 5โ10ร faster than network storage), and predictive scaling (using historical traffic patterns to scale proactively). Spot/preemptible GPU instances reduce costs by 60โ70% but require graceful handling of interruptions โ request draining, checkpoint saving, and automatic migration to on-demand instances. Multi-tier serving routes simple queries to smaller/cheaper models (7B on A10G at $1/hour) and complex queries to larger models (70B on A100 at $10/hour), optimizing the cost-quality tradeoff per request.
โฌกArchitecture Diagram
LLM Serving Infrastructure โ simplified architecture overview
โฆCore Concepts
Tensor Parallelism
Pipeline Parallelism
Continuous Batching
GPTQ/AWQ Quantization
Triton Server
GPU Autoscaling
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โContinuous batching increases GPU throughput by 2โ4ร compared to static batching at equivalent latency
- โ4-bit quantization (GPTQ/AWQ) reduces GPU memory by 75% with typically less than 1% quality degradation
- โTensor parallelism enables serving models larger than a single GPU's memory with high utilization via NVLink
- โMulti-tier serving routes simple queries to cheap models, reducing average cost per request by 50โ70%
โ Weaknesses
- โTensor parallelism requires expensive NVLink interconnect โ performance degrades severely over PCIe or network links
- โCold start times of 30โ120 seconds make reactive autoscaling too slow for sudden traffic spikes
- โQuantization quality loss is non-uniform across tasks โ some downstream tasks degrade more than benchmarks suggest
- โGPU infrastructure complexity (CUDA drivers, NCCL, model parallelism) creates a steep operational learning curve
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design an LLM serving system that handles 1,000 requests per second for a 70B parameter model. What hardware and software architecture would you use?
- Q2
Explain the difference between tensor parallelism and pipeline parallelism. When would you use each, and what are the communication bottlenecks?
- Q3
Your LLM serving costs are $500K/month. Walk through strategies to reduce costs by 50% without degrading user-perceived quality.
- Q4
A 70B model takes 90 seconds to load. How would you design autoscaling to handle traffic spikes without users experiencing timeouts?
- Q5
Compare vLLM, TensorRT-LLM, and Hugging Face TGI for production LLM serving. What are the key architectural differences?
Research Papers & Further Reading
Orca: A Distributed Serving System for Transformer-Based Generative Models
Yu, G. et al.
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed