Home ArchitecturesLLM Fine-Tuning Pipeline

🤖 LLM & AIExpertWeek 13

LLM Fine-Tuning Pipeline

LoRA, QLoRA, DPO, and the production training infrastructure

Hugging FaceMistralMeta (Llama)

Key Insight

LoRA exploits the 'intrinsic dimensionality' hypothesis: fine-tuning changes exist in a low-dimensional subspace of the full weight space.

Request Journey

Raw data (instruction-response pairs, domain documents, human feedback) is curated — filtering low-quality examples, deduplicating near-duplicates, and balancing categories→

Data is formatted into the model's chat template (e.g., ChatML) with system/user/assistant roles→

Tokenizer encodes text into token IDs with padding/truncation to max sequence length→

Base model weights are frozen; LoRA injects small trainable adapter matrices (rank 16-64) into attention layers — reducing trainable parameters by 10,000x→

QLoRA variant quantizes base model to 4-bit NormalFloat while training adapters in BF16, enabling 65B fine-tuning on a single 48GB GPU

+5 more steps

How It Works

① Raw data (instruction-response pairs, domain documents, human feedback) is curated — filtering low-quality examples, deduplicating near-duplicates, and balancing categories

② Data is formatted into the model's chat template (e.g., ChatML) with system/user/assistant roles

③ Tokenizer encodes text into token IDs with padding/truncation to max sequence length

④ Base model weights are frozen; LoRA injects small trainable adapter matrices (rank 16-64) into attention layers — reducing trainable parameters by 10,000x

⑤ QLoRA variant quantizes base model to 4-bit NormalFloat while training adapters in BF16, enabling 65B fine-tuning on a single 48GB GPU

⑥ Training loop runs with gradient accumulation across micro-batches; DeepSpeed ZeRO shards optimizer states across GPUs for multi-node training

⑦ DPO (Direct Preference Optimization) or RLHF aligns the model with human preferences using chosen/rejected response pairs

⑧ Evaluation harness benchmarks the fine-tuned model on standard tasks (MMLU, HumanEval, domain-specific evals)

⑨ LoRA adapters are merged back into the base model weights

⑩ Merged model is quantized (GPTQ/AWQ to 4-bit) and deployed to the serving infrastructure

⚠The Problem

Training a new LLM from scratch costs $10M-$100M in compute and takes months. Yet organizations need models specialized to their domain (legal, medical, code) or aligned with their specific style and tone. General-purpose models like GPT-4 are capable but not optimized for specialized tasks and cannot be customized.

✓The Solution

Fine-tuning adapts a pre-trained model's behavior by continuing training on a smaller domain-specific dataset. LoRA (Low-Rank Adaptation) makes this practical: instead of updating 70 billion parameters, it freezes the base model and adds tiny adapter matrices (millions of parameters) that capture domain-specific changes. QLoRA further enables fine-tuning a 65B model on a single 48GB GPU through 4-bit quantization.

📊Scale at a Glance

100-10,000x

LoRA Param Reduction

4x less VRAM

QLoRA GPU Savings

1K-10K examples

Min Quality Dataset

2-8 hours A100

Fine-tune Time (7B)

🔬Deep Dive

LoRA: Low-Rank Weight Adaptation

Full fine-tuning updates all model weights — a 7B model has 7 billion floating-point parameters to update, store, and serve. LoRA's insight: weight updates during fine-tuning have low intrinsic rank. Instead of updating W directly, LoRA adds delta_W = B*A where B is (d x r) and A is (r x d) with rank r much smaller than d. For rank 16 and d=4096, this is 131,072 trainable parameters instead of 16.7 million — 128x reduction. Multiple LoRA adapters can be merged into the base model or swapped per-request for multi-task deployment.

QLoRA: Fine-Tuning on Consumer Hardware

LoRA still requires the base model in memory — a 65B FP16 model needs 130GB VRAM, far exceeding a single GPU. QLoRA quantizes the base model to 4-bit using a novel NF4 (NormalFloat4) quantization scheme and double quantization, reducing memory by 4x. The base model is frozen in 4-bit; LoRA adapters are trained in 16-bit bfloat16 using gradient checkpointing. This enables fine-tuning a 65B model on a single A100-80GB GPU — previously requiring 4-8 GPUs.

Instruction Fine-Tuning and Dataset Curation

The most important factor in fine-tuning quality is dataset quality. Instruction fine-tuning trains on (instruction, response) pairs to teach the model to follow directions. Even 1,000-10,000 high-quality examples can dramatically improve task performance. Dataset curation involves: deduplication (near-duplicate examples reduce diversity), quality filtering (remove low-quality completions), format consistency, and mixing domain-specific data with general data to preserve broad capabilities.

DPO: Replacing RLHF Complexity

RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from human preference data, then uses PPO to optimize the LLM against this reward model — a complex, unstable training procedure. DPO (Direct Preference Optimization) reformulates the same objective without a reward model: given pairs of (preferred, rejected) completions, directly fine-tune the LLM to assign higher probability to preferred completions. DPO is more stable, faster, and achieves comparable alignment quality to RLHF.

Evaluation and Preventing Catastrophic Forgetting

Fine-tuned models suffer catastrophic forgetting: optimizing heavily for a new task can degrade performance on the original capabilities. Evaluation requires both domain benchmarks (did the model learn the target task?) and general benchmarks (MMLU, HellaSwag, HumanEval — did it forget general reasoning?). Mixing domain fine-tuning data with a small fraction of general instruction data (5-10%) typically prevents forgetting while maintaining domain specialization.

⬡Architecture Diagram

LLM Fine-Tuning Pipeline — simplified architecture overview

✦Core Concepts

⚙️

LoRA Adapters

⚙️

QLoRA (4-bit)

⚙️

Supervised Fine-Tuning

⚙️

DPO vs RLHF

⚙️

Evaluation Harness

⚙️

PEFT

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓LoRA reduces trainable parameters by 100-10,000x vs. full fine-tuning
✓QLoRA enables fine-tuning 65B models on a single GPU, democratizing LLM customization
✓Fine-tuned models outperform prompting on consistent formatting, domain vocabulary, and style
✓DPO alignment is stable, simple, and achieves comparable results to RLHF

✗ Weaknesses

✗Fine-tuned models degrade on tasks not in training data — catastrophic forgetting requires mitigation strategies
✗Dataset curation is manual and expensive — 1,000 high-quality examples may take weeks to create
✗LoRA adapters must be merged or hot-swapped for inference — adds serving infrastructure complexity
✗Evaluating fine-tuning quality requires domain-specific benchmarks that often do not exist

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Explain LoRA. How does low-rank decomposition work, and why does it reduce parameters so dramatically?
Q2
Design a fine-tuning pipeline for a medical document summarization model. What data would you collect, and how would you evaluate quality?
Q3
Compare DPO and RLHF for model alignment. What are the practical advantages of DPO?
Q4
How do you prevent catastrophic forgetting when fine-tuning on a narrow domain dataset?
Q5
When is fine-tuning better than prompting, and when is prompting better? Give concrete examples.

Research Papers & Further Reading

2022

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J. et al. (Microsoft)

Read

2023

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, T. et al.

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser