Home ArchitecturesLLM Serving Infrastructure

🤖 LLM & AIExpertWeek 15

LLM Serving Infrastructure

vLLM, tensor parallelism, Triton Inference Server, and autoscaling

NVIDIAvLLMHugging Face TGIAWS SageMaker

Key Insight

The key to LLM serving efficiency is maximizing GPU Memory Bandwidth Utilization (MBU) packing as many tokens/sec through the GPU's HBM as possible.

Request Journey

Client request hits the L7 load balancer, which routes based on model name and available capacity→

Triton Inference Server's continuous batcher collects incoming requests and inserts them into the active batch at iteration granularity — when a request finishes (EOS), a new request immediately takes its slot with zero GPU idle time→

Request scheduler manages a priority queue, balancing batch size against available GPU memory (each active request's KV cache grows with sequence length)→

Model weights (quantized to 4-bit via GPTQ or AWQ for 75% memory reduction) are loaded from HBM; tensor parallelism shards each layer's weight matrices across GPUs within a node, synchronized via AllReduce over NVLink→

Pipeline parallelism splits the model by layer groups across nodes — GPU 0-1 handle early layers via TP, GPU 2+ handle later layers via PP, with activation tensors passed between stages over InfiniBand

+3 more steps

How It Works

① Client request hits the L7 load balancer, which routes based on model name and available capacity

② Triton Inference Server's continuous batcher collects incoming requests and inserts them into the active batch at iteration granularity — when a request finishes (EOS), a new request immediately takes its slot with zero GPU idle time

③ Request scheduler manages a priority queue, balancing batch size against available GPU memory (each active request's KV cache grows with sequence length)

④ Model weights (quantized to 4-bit via GPTQ or AWQ for 75% memory reduction) are loaded from HBM; tensor parallelism shards each layer's weight matrices across GPUs within a node, synchronized via AllReduce over NVLink

⑤ Pipeline parallelism splits the model by layer groups across nodes — GPU 0-1 handle early layers via TP, GPU 2+ handle later layers via PP, with activation tensors passed between stages over InfiniBand

⑥ PagedAttention manages KV cache as dynamically allocated memory pages — no pre-allocation waste, enabling 2-4x more concurrent requests per GPU

⑦ Generated tokens stream back to the client via SSE or gRPC streaming for low time-to-first-token

⑧ Autoscaler monitors GPU utilization, request queue depth, and tokens/sec throughput; Kubernetes HPA scales GPU pods, with pre-warm pools keeping standby replicas to avoid 30-120s cold start latency

⚠The Problem

Serving large language models at production scale faces extreme resource constraints: a 70B parameter model requires 140GB of GPU memory in FP16, far exceeding a single GPU's capacity. Naive request handling wastes GPU compute — sequential processing leaves the GPU idle during memory-bound decoding, and fixed batch sizes either underutilize hardware or add unacceptable latency. Autoscaling GPU infrastructure is 10–100× more expensive than CPU, making cost optimization critical.

✓The Solution

Modern LLM serving combines model parallelism (tensor parallel across GPUs within a node, pipeline parallel across nodes), continuous batching to maximize GPU utilization by dynamically adding/removing requests mid-generation, quantization (GPTQ/AWQ for 4-bit weights) to reduce memory footprint by 4×, and PagedAttention for efficient KV cache management. NVIDIA Triton Inference Server provides multi-model serving with dynamic batching, while Kubernetes HPA with custom GPU metrics enables autoscaling.

📊Scale at a Glance

1,000+ tokens/sec

Throughput (vLLM, 70B model)

75% reduction

Memory savings (4-bit quantization)

80–95% MBU

GPU utilization (continuous batching)

30–120 seconds

Cold start time (70B model load)

🔬Deep Dive

Tensor Parallelism vs Pipeline Parallelism — Splitting Giant Models

A 70B parameter model in FP16 requires ~140GB of memory, far exceeding a single A100's 80GB capacity. Tensor parallelism (TP) splits individual layers across GPUs — each GPU holds a slice of every weight matrix and computes a portion of each layer's output. The partial results are combined via all-reduce communication after each layer. TP requires high-bandwidth interconnect (NVLink at 900GB/s) because every layer involves a synchronization point. Pipeline parallelism (PP) splits the model by layers — GPU 0 gets layers 0–19, GPU 1 gets layers 20–39, etc. PP has less communication overhead (only activation tensors between stages) but introduces pipeline bubbles where GPUs wait for earlier stages. In practice, production systems use TP within a node (2–8 GPUs connected via NVLink) and PP across nodes (connected via InfiniBand). vLLM and TensorRT-LLM support both strategies with automatic configuration based on model size and available hardware.

Continuous Batching — Eliminating GPU Idle Time

Traditional static batching waits for a batch of requests to accumulate, processes them together, and returns all results simultaneously. This creates two problems: short requests must wait for the longest request in the batch to finish (head-of-line blocking), and the GPU sits idle between batches. Continuous batching (also called iteration-level scheduling, pioneered by Orca) processes requests at the granularity of individual decoding steps. When a request finishes generating (hits EOS or max length), a new request immediately takes its slot in the batch — no GPU cycles are wasted. vLLM's scheduler maintains a priority queue of pending requests, dynamically adjusting the batch size every iteration. This increases throughput by 2–4× compared to static batching at the same latency target. The key challenge is memory management: each active request maintains a KV cache that grows with sequence length, so the scheduler must balance batch size against available GPU memory.

Model Quantization — Trading Precision for Throughput

Quantization reduces model weights from 16-bit floating point to lower precision (8-bit, 4-bit, or even 2-bit), dramatically reducing memory footprint and enabling faster inference. GPTQ (post-training quantization) uses calibration data to find optimal quantized weights that minimize output error — it processes one layer at a time, solving a layer-wise reconstruction problem. AWQ (Activation-Aware Weight Quantization) observes that only 1% of weights are critical for model quality (those corresponding to large-magnitude activations) and preserves these at higher precision. GPTQ and AWQ at 4-bit reduce memory by 75% with minimal quality degradation (typically <1% on benchmarks). FP8 quantization is emerging on H100 GPUs with native hardware support, offering 2× memory reduction with virtually no quality loss. The inference speed improvement from quantization is primarily memory-bandwidth-bound: LLM decoding is limited by how fast weights can be loaded from HBM to compute cores, so smaller weights mean proportionally faster inference.

NVIDIA Triton Inference Server — Multi-Model Production Serving

Triton Inference Server is NVIDIA's production inference platform that handles the operational complexity of serving multiple models simultaneously. It supports all major frameworks (PyTorch, TensorFlow, TensorRT, ONNX, vLLM) behind a unified gRPC/HTTP API. Key features: dynamic batching aggregates individual requests into GPU-efficient batches with configurable maximum latency, model ensembles chain multiple models in a single request (e.g., tokenizer → LLM → post-processor), model versioning enables A/B testing and canary deployments, and concurrent model execution shares a single GPU across multiple models using CUDA MPS (Multi-Process Service). For LLM serving specifically, Triton integrates with TensorRT-LLM for optimized inference kernels and supports in-flight batching (continuous batching). Resource management is critical: Triton's rate limiter prevents GPU OOM by tracking memory usage per model instance and queuing requests when capacity is exceeded.

GPU Autoscaling — Cost Optimization at Scale

GPU instances cost $1–$30/hour depending on type (A10G, A100, H100), making autoscaling essential for cost management. Kubernetes Horizontal Pod Autoscaler (HPA) scales LLM serving pods based on custom metrics: GPU utilization, request queue depth, tokens-per-second throughput, or p99 latency. The fundamental challenge is cold start time — loading a 70B model takes 30–120 seconds, during which the new pod cannot serve requests. Mitigation strategies include: pre-warming pools (keeping standby pods with models loaded), model caching on NVMe (loading from local SSD is 5–10× faster than network storage), and predictive scaling (using historical traffic patterns to scale proactively). Spot/preemptible GPU instances reduce costs by 60–70% but require graceful handling of interruptions — request draining, checkpoint saving, and automatic migration to on-demand instances. Multi-tier serving routes simple queries to smaller/cheaper models (7B on A10G at $1/hour) and complex queries to larger models (70B on A100 at $10/hour), optimizing the cost-quality tradeoff per request.

⬡Architecture Diagram

LLM Serving Infrastructure — simplified architecture overview

✦Core Concepts

⚙️

Tensor Parallelism

⚙️

Pipeline Parallelism

⚙️

Continuous Batching

⚙️

GPTQ/AWQ Quantization

⚙️

Triton Server

⚙️

GPU Autoscaling

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Continuous batching increases GPU throughput by 2–4× compared to static batching at equivalent latency
✓4-bit quantization (GPTQ/AWQ) reduces GPU memory by 75% with typically less than 1% quality degradation
✓Tensor parallelism enables serving models larger than a single GPU's memory with high utilization via NVLink
✓Multi-tier serving routes simple queries to cheap models, reducing average cost per request by 50–70%

✗ Weaknesses

✗Tensor parallelism requires expensive NVLink interconnect — performance degrades severely over PCIe or network links
✗Cold start times of 30–120 seconds make reactive autoscaling too slow for sudden traffic spikes
✗Quantization quality loss is non-uniform across tasks — some downstream tasks degrade more than benchmarks suggest
✗GPU infrastructure complexity (CUDA drivers, NCCL, model parallelism) creates a steep operational learning curve

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Design an LLM serving system that handles 1,000 requests per second for a 70B parameter model. What hardware and software architecture would you use?
Q2
Explain the difference between tensor parallelism and pipeline parallelism. When would you use each, and what are the communication bottlenecks?
Q3
Your LLM serving costs are $500K/month. Walk through strategies to reduce costs by 50% without degrading user-perceived quality.
Q4
A 70B model takes 90 seconds to load. How would you design autoscaling to handle traffic spikes without users experiencing timeouts?
Q5
Compare vLLM, TensorRT-LLM, and Hugging Face TGI for production LLM serving. What are the key architectural differences?

Research Papers & Further Reading

2022

Orca: A Distributed Serving System for Transformer-Based Generative Models

Yu, G. et al.

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser