LLM Fine-Tuning Pipeline
LoRA, QLoRA, DPO, and the production training infrastructure
Key Insight
LoRA exploits the 'intrinsic dimensionality' hypothesis: fine-tuning changes exist in a low-dimensional subspace of the full weight space.
Request Journey
How It Works
โ Raw data (instruction-response pairs, domain documents, human feedback) is curated โ filtering low-quality examples, deduplicating near-duplicates, and balancing categories
โก Data is formatted into the model's chat template (e.g., ChatML) with system/user/assistant roles
โข Tokenizer encodes text into token IDs with padding/truncation to max sequence length
โฃ Base model weights are frozen; LoRA injects small trainable adapter matrices (rank 16-64) into attention layers โ reducing trainable parameters by 10,000x
โค QLoRA variant quantizes base model to 4-bit NormalFloat while training adapters in BF16, enabling 65B fine-tuning on a single 48GB GPU
โฅ Training loop runs with gradient accumulation across micro-batches; DeepSpeed ZeRO shards optimizer states across GPUs for multi-node training
โฆ DPO (Direct Preference Optimization) or RLHF aligns the model with human preferences using chosen/rejected response pairs
โง Evaluation harness benchmarks the fine-tuned model on standard tasks (MMLU, HumanEval, domain-specific evals)
โจ LoRA adapters are merged back into the base model weights
โฉ Merged model is quantized (GPTQ/AWQ to 4-bit) and deployed to the serving infrastructure
โ The Problem
Training a new LLM from scratch costs $10M-$100M in compute and takes months. Yet organizations need models specialized to their domain (legal, medical, code) or aligned with their specific style and tone. General-purpose models like GPT-4 are capable but not optimized for specialized tasks and cannot be customized.
โThe Solution
Fine-tuning adapts a pre-trained model's behavior by continuing training on a smaller domain-specific dataset. LoRA (Low-Rank Adaptation) makes this practical: instead of updating 70 billion parameters, it freezes the base model and adds tiny adapter matrices (millions of parameters) that capture domain-specific changes. QLoRA further enables fine-tuning a 65B model on a single 48GB GPU through 4-bit quantization.
๐Scale at a Glance
100-10,000x
LoRA Param Reduction
4x less VRAM
QLoRA GPU Savings
1K-10K examples
Min Quality Dataset
2-8 hours A100
Fine-tune Time (7B)
๐ฌDeep Dive
LoRA: Low-Rank Weight Adaptation
Full fine-tuning updates all model weights โ a 7B model has 7 billion floating-point parameters to update, store, and serve. LoRA's insight: weight updates during fine-tuning have low intrinsic rank. Instead of updating W directly, LoRA adds delta_W = B*A where B is (d x r) and A is (r x d) with rank r much smaller than d. For rank 16 and d=4096, this is 131,072 trainable parameters instead of 16.7 million โ 128x reduction. Multiple LoRA adapters can be merged into the base model or swapped per-request for multi-task deployment.
QLoRA: Fine-Tuning on Consumer Hardware
LoRA still requires the base model in memory โ a 65B FP16 model needs 130GB VRAM, far exceeding a single GPU. QLoRA quantizes the base model to 4-bit using a novel NF4 (NormalFloat4) quantization scheme and double quantization, reducing memory by 4x. The base model is frozen in 4-bit; LoRA adapters are trained in 16-bit bfloat16 using gradient checkpointing. This enables fine-tuning a 65B model on a single A100-80GB GPU โ previously requiring 4-8 GPUs.
Instruction Fine-Tuning and Dataset Curation
The most important factor in fine-tuning quality is dataset quality. Instruction fine-tuning trains on (instruction, response) pairs to teach the model to follow directions. Even 1,000-10,000 high-quality examples can dramatically improve task performance. Dataset curation involves: deduplication (near-duplicate examples reduce diversity), quality filtering (remove low-quality completions), format consistency, and mixing domain-specific data with general data to preserve broad capabilities.
DPO: Replacing RLHF Complexity
RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from human preference data, then uses PPO to optimize the LLM against this reward model โ a complex, unstable training procedure. DPO (Direct Preference Optimization) reformulates the same objective without a reward model: given pairs of (preferred, rejected) completions, directly fine-tune the LLM to assign higher probability to preferred completions. DPO is more stable, faster, and achieves comparable alignment quality to RLHF.
Evaluation and Preventing Catastrophic Forgetting
Fine-tuned models suffer catastrophic forgetting: optimizing heavily for a new task can degrade performance on the original capabilities. Evaluation requires both domain benchmarks (did the model learn the target task?) and general benchmarks (MMLU, HellaSwag, HumanEval โ did it forget general reasoning?). Mixing domain fine-tuning data with a small fraction of general instruction data (5-10%) typically prevents forgetting while maintaining domain specialization.
โฌกArchitecture Diagram
LLM Fine-Tuning Pipeline โ simplified architecture overview
โฆCore Concepts
LoRA Adapters
QLoRA (4-bit)
Supervised Fine-Tuning
DPO vs RLHF
Evaluation Harness
PEFT
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โLoRA reduces trainable parameters by 100-10,000x vs. full fine-tuning
- โQLoRA enables fine-tuning 65B models on a single GPU, democratizing LLM customization
- โFine-tuned models outperform prompting on consistent formatting, domain vocabulary, and style
- โDPO alignment is stable, simple, and achieves comparable results to RLHF
โ Weaknesses
- โFine-tuned models degrade on tasks not in training data โ catastrophic forgetting requires mitigation strategies
- โDataset curation is manual and expensive โ 1,000 high-quality examples may take weeks to create
- โLoRA adapters must be merged or hot-swapped for inference โ adds serving infrastructure complexity
- โEvaluating fine-tuning quality requires domain-specific benchmarks that often do not exist
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Explain LoRA. How does low-rank decomposition work, and why does it reduce parameters so dramatically?
- Q2
Design a fine-tuning pipeline for a medical document summarization model. What data would you collect, and how would you evaluate quality?
- Q3
Compare DPO and RLHF for model alignment. What are the practical advantages of DPO?
- Q4
How do you prevent catastrophic forgetting when fine-tuning on a narrow domain dataset?
- Q5
When is fine-tuning better than prompting, and when is prompting better? Give concrete examples.
Research Papers & Further Reading
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed