Home ArchitecturesRAG Pipeline Architecture

🤖 LLM & AIAdvancedWeek 11

RAG Pipeline Architecture

Retrieval-Augmented Generation from PDF to production

OpenAILangChainCoherePerplexity

Key Insight

Chunk size is RAG's most critical hyperparameter too small loses context, too large dilutes relevance signal.

Request Journey

Documents (PDF, HTML, Markdown) are parsed and cleaned by a document processor (e.g., Unstructured)→

Recursive text splitter chunks documents into 256-512 token segments with overlap to preserve context at boundaries→

Each chunk is passed through an embedding model (e.g., text-embedding-3-large, 1536-dim) producing a dense vector→

Vectors are inserted into a vector database with HNSW index for sub-millisecond ANN search→

At query time, the user query is encoded with the same embedding model

+5 more steps

How It Works

① Documents (PDF, HTML, Markdown) are parsed and cleaned by a document processor (e.g., Unstructured)

② Recursive text splitter chunks documents into 256-512 token segments with overlap to preserve context at boundaries

③ Each chunk is passed through an embedding model (e.g., text-embedding-3-large, 1536-dim) producing a dense vector

④ Vectors are inserted into a vector database with HNSW index for sub-millisecond ANN search

⑤ At query time, the user query is encoded with the same embedding model

⑥ ANN search retrieves top-K candidate chunks (typically K=20-50) based on cosine similarity

⑦ A cross-encoder reranker scores each (query, chunk) pair with full token-level attention, reordering to top-N most relevant

⑧ Prompt builder assembles system instructions + retrieved context + user query into a structured prompt

⑨ LLM generates an answer grounded in the retrieved context, with inline citations

⑩ Hallucination evaluator checks each claim against source chunks using NLI, flagging unsupported assertions

⚠The Problem

Large language models have knowledge cutoffs and hallucinate facts confidently. Retraining a model costs millions of dollars and takes months, making it impractical to keep models updated with new documents, proprietary knowledge bases, or real-time information.

✓The Solution

Retrieval-Augmented Generation (RAG) keeps model weights frozen and instead injects relevant context at inference time. When a query arrives, a retrieval system searches a vector database of document embeddings, finds the K most relevant chunks, and appends them to the LLM prompt. The LLM generates its response grounded in retrieved facts — dramatically reducing hallucination.

📊Scale at a Glance

200-500 tokens

Typical Chunk Size

5-20ms

Retrieval Latency

+20-30% precision

Re-ranking Gain

~50-70%

Hallucination Reduction

🔬Deep Dive

Document Ingestion and Chunking Strategy

Raw documents (PDFs, HTML, DOCX) are parsed into text, then chunked into segments of 200-1000 tokens each. Chunking strategy dramatically affects retrieval quality: too-small chunks lose context, too-large chunks dilute relevance. Recursive character splitting (split at paragraphs, then sentences, then words) produces more natural chunks than fixed-size windows. Sliding window chunking with overlap ensures concepts spanning chunk boundaries are retrievable.

Embedding Models and Vector Stores

Text chunks are converted to dense vectors using an embedding model (OpenAI text-embedding-3-small, Cohere embed-v3, E5-large). A 1536-dimensional embedding represents the semantic meaning of a chunk. These vectors are indexed in a vector database (Pinecone, Weaviate, pgvector) using HNSW for approximate nearest neighbor search. Retrieval time is typically 5-20ms for billion-scale indexes.

Hybrid Retrieval: Dense + Sparse

Pure vector search misses exact keyword matches (product codes, names, error messages). Production RAG systems combine dense vector search with sparse BM25 retrieval in parallel, then fuse results using Reciprocal Rank Fusion (RRF). RRF takes the rank position from each retriever and combines them: score = sum of 1/(k + rank_i). No score normalization needed — ranks are comparable across retrievers.

Re-ranking for Precision

The top-K retrieved chunks go through a cross-encoder re-ranker that scores each chunk against the query with full bidirectional attention (unlike the bi-encoder used for initial retrieval). Cross-encoders are 10x slower but 20-30% more accurate. Re-ranking with a Cohere or Jina re-ranker on the top-20 retrieved results, returning the top-5 to the LLM, is a common production pattern.

Evaluation and Quality Measurement

RAG quality has two components: retrieval quality (did we retrieve the right chunks?) and generation quality (did the LLM correctly use the retrieved context?). RAGAS is a popular evaluation framework measuring faithfulness (is the answer grounded in the context?), answer relevancy, and context precision/recall. A/B testing chunk sizes, embedding models, and top-K values against RAGAS scores drives systematic improvement.

⬡Architecture Diagram

RAG Pipeline Architecture — simplified architecture overview

✦Core Concepts

⚙️

Document Chunking

🧠

Text Embeddings

🔍

Vector Search (ANN)

📚

BM25 Sparse Retrieval

⚙️

Reciprocal Rank Fusion

⚙️

Re-ranking

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Zero retraining cost — update the knowledge base without touching model weights
✓Provable grounding: answers can cite source documents, reducing hallucination
✓Works with any LLM as a black box — retrieval is model-agnostic
✓Incremental updates: add new documents in seconds without reprocessing the full corpus

✗ Weaknesses

✗Retrieval failures propagate: if the wrong chunks are retrieved, the LLM generates a confident wrong answer
✗Chunking is fragile — table data, code, and multi-document reasoning are poorly served by simple chunking
✗Context window limits constrain how many chunks fit: 5-10 chunks at 500 tokens each consumes a significant portion of the context window
✗Latency overhead: retrieval plus re-ranking adds 50-200ms to every query

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Design a RAG system for a 10,000-document legal knowledge base. Walk through every component from ingestion to generation.
Q2
What is the most important hyperparameter in a RAG system, and why? How would you tune it?
Q3
Explain hybrid retrieval. Why does combining BM25 and vector search outperform either alone?
Q4
A RAG system is producing hallucinated answers despite retrieving relevant chunks. What could be causing this?
Q5
How would you evaluate RAG quality systematically? What metrics would you track in production?

Research Papers & Further Reading

2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P. et al. (Facebook AI)

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser