RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
Key Insight
Chunk size is RAG's most critical hyperparameter too small loses context, too large dilutes relevance signal.
Request Journey
How It Works
โ Documents (PDF, HTML, Markdown) are parsed and cleaned by a document processor (e.g., Unstructured)
โก Recursive text splitter chunks documents into 256-512 token segments with overlap to preserve context at boundaries
โข Each chunk is passed through an embedding model (e.g., text-embedding-3-large, 1536-dim) producing a dense vector
โฃ Vectors are inserted into a vector database with HNSW index for sub-millisecond ANN search
โค At query time, the user query is encoded with the same embedding model
โฅ ANN search retrieves top-K candidate chunks (typically K=20-50) based on cosine similarity
โฆ A cross-encoder reranker scores each (query, chunk) pair with full token-level attention, reordering to top-N most relevant
โง Prompt builder assembles system instructions + retrieved context + user query into a structured prompt
โจ LLM generates an answer grounded in the retrieved context, with inline citations
โฉ Hallucination evaluator checks each claim against source chunks using NLI, flagging unsupported assertions
โ The Problem
Large language models have knowledge cutoffs and hallucinate facts confidently. Retraining a model costs millions of dollars and takes months, making it impractical to keep models updated with new documents, proprietary knowledge bases, or real-time information.
โThe Solution
Retrieval-Augmented Generation (RAG) keeps model weights frozen and instead injects relevant context at inference time. When a query arrives, a retrieval system searches a vector database of document embeddings, finds the K most relevant chunks, and appends them to the LLM prompt. The LLM generates its response grounded in retrieved facts โ dramatically reducing hallucination.
๐Scale at a Glance
200-500 tokens
Typical Chunk Size
5-20ms
Retrieval Latency
+20-30% precision
Re-ranking Gain
~50-70%
Hallucination Reduction
๐ฌDeep Dive
Document Ingestion and Chunking Strategy
Raw documents (PDFs, HTML, DOCX) are parsed into text, then chunked into segments of 200-1000 tokens each. Chunking strategy dramatically affects retrieval quality: too-small chunks lose context, too-large chunks dilute relevance. Recursive character splitting (split at paragraphs, then sentences, then words) produces more natural chunks than fixed-size windows. Sliding window chunking with overlap ensures concepts spanning chunk boundaries are retrievable.
Embedding Models and Vector Stores
Text chunks are converted to dense vectors using an embedding model (OpenAI text-embedding-3-small, Cohere embed-v3, E5-large). A 1536-dimensional embedding represents the semantic meaning of a chunk. These vectors are indexed in a vector database (Pinecone, Weaviate, pgvector) using HNSW for approximate nearest neighbor search. Retrieval time is typically 5-20ms for billion-scale indexes.
Hybrid Retrieval: Dense + Sparse
Pure vector search misses exact keyword matches (product codes, names, error messages). Production RAG systems combine dense vector search with sparse BM25 retrieval in parallel, then fuse results using Reciprocal Rank Fusion (RRF). RRF takes the rank position from each retriever and combines them: score = sum of 1/(k + rank_i). No score normalization needed โ ranks are comparable across retrievers.
Re-ranking for Precision
The top-K retrieved chunks go through a cross-encoder re-ranker that scores each chunk against the query with full bidirectional attention (unlike the bi-encoder used for initial retrieval). Cross-encoders are 10x slower but 20-30% more accurate. Re-ranking with a Cohere or Jina re-ranker on the top-20 retrieved results, returning the top-5 to the LLM, is a common production pattern.
Evaluation and Quality Measurement
RAG quality has two components: retrieval quality (did we retrieve the right chunks?) and generation quality (did the LLM correctly use the retrieved context?). RAGAS is a popular evaluation framework measuring faithfulness (is the answer grounded in the context?), answer relevancy, and context precision/recall. A/B testing chunk sizes, embedding models, and top-K values against RAGAS scores drives systematic improvement.
โฌกArchitecture Diagram
RAG Pipeline Architecture โ simplified architecture overview
โฆCore Concepts
Document Chunking
Text Embeddings
Vector Search (ANN)
BM25 Sparse Retrieval
Reciprocal Rank Fusion
Re-ranking
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โZero retraining cost โ update the knowledge base without touching model weights
- โProvable grounding: answers can cite source documents, reducing hallucination
- โWorks with any LLM as a black box โ retrieval is model-agnostic
- โIncremental updates: add new documents in seconds without reprocessing the full corpus
โ Weaknesses
- โRetrieval failures propagate: if the wrong chunks are retrieved, the LLM generates a confident wrong answer
- โChunking is fragile โ table data, code, and multi-document reasoning are poorly served by simple chunking
- โContext window limits constrain how many chunks fit: 5-10 chunks at 500 tokens each consumes a significant portion of the context window
- โLatency overhead: retrieval plus re-ranking adds 50-200ms to every query
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design a RAG system for a 10,000-document legal knowledge base. Walk through every component from ingestion to generation.
- Q2
What is the most important hyperparameter in a RAG system, and why? How would you tune it?
- Q3
Explain hybrid retrieval. Why does combining BM25 and vector search outperform either alone?
- Q4
A RAG system is producing hallucinated answers despite retrieving relevant chunks. What could be causing this?
- Q5
How would you evaluate RAG quality systematically? What metrics would you track in production?
Research Papers & Further Reading
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis, P. et al. (Facebook AI)
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
LLM API Gateway Architecture
Rate limiting, token tracking, model routing, and cost management
OpenAI ยท Anthropic ยท Azure OpenAI
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed