Hybrid Search Architecture for LLMs
BM25 + dense vectors, RRF reranking, and query expansion
Key Insight
Reciprocal Rank Fusion is surprisingly effective: fusing rankings from BM25 and vector search with RRF outperforms either alone with no training required.
Request Journey
How It Works
โ User query arrives at the hybrid search pipeline
โก (Optional) HyDE step: LLM generates a hypothetical document that would answer the query โ this synthetic text is embedded alongside the original query for better retrieval
โข BM25 path: query terms are looked up in the inverted index, scoring documents by term frequency, inverse document frequency, and length normalization
โฃ Dense path: query is encoded by a bi-encoder embedding model, then HNSW ANN search retrieves top-K nearest vectors from the dense index
โค Both ranked lists feed into Reciprocal Rank Fusion โ RRF_score(d) = sum of 1/(k + rank) where k=60, combining rankings without needing score normalization
โฅ Top-50 fused candidates are passed to a cross-encoder reranker, which processes each (query, document) pair through a single transformer for deep token-level interaction scoring
โฆ Cross-encoder reorders candidates by relevance, producing the final top-10 results with significantly higher NDCG than either retriever alone
โง Final results are injected as context into the LLM prompt for grounded answer generation
โ The Problem
Pure dense vector search misses exact keyword matches (product IDs, error codes, acronyms) while BM25 keyword search misses semantic paraphrases and intent. In production RAG pipelines, using either approach alone yields 15โ30% lower recall than users expect. The challenge is fusing heterogeneous ranking signals โ sparse lexical scores and dense cosine similarities operate on entirely different scales โ without requiring expensive labeled training data.
โThe Solution
Hybrid search runs BM25 and dense retrieval in parallel, then fuses results using Reciprocal Rank Fusion (RRF), which combines rankings without needing score normalization. A cross-encoder re-ranker then scores the top-K fused results for final ordering. Query expansion via HyDE (Hypothetical Document Embeddings) generates a synthetic answer with the LLM and embeds it alongside the original query, dramatically improving recall for ambiguous or short queries.
๐Scale at a Glance
+15โ30% vs single retriever
Recall@10 improvement
30โ80ms for top-50
Re-ranking latency (cross-encoder)
10,000+ QPS
BM25 index throughput
<5ms p99 at 10M docs
Dense retrieval (HNSW)
๐ฌDeep Dive
BM25 โ The Unbeatable Sparse Baseline
BM25 (Best Matching 25) is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization. Despite being decades old, BM25 remains surprisingly competitive with neural retrievers for keyword-heavy queries. It excels at exact match scenarios: product SKUs, error codes, proper nouns, and technical jargon that embedding models often tokenize poorly. BM25 operates on inverted indices, enabling sub-millisecond retrieval even over billions of documents. Modern implementations like Elasticsearch's BM25 use skip lists and block-max WAND optimization to prune the search space aggressively. The key parameters โ k1 (term frequency saturation) and b (length normalization) โ can be tuned per-field for optimal performance on domain-specific corpora.
Dense Vector Retrieval โ Semantic Understanding at Scale
Dense retrieval encodes queries and documents into fixed-dimensional vectors (typically 768 or 1536 dimensions) using bi-encoder models like E5, BGE, or OpenAI's text-embedding-3-large. Approximate nearest neighbor (ANN) search via HNSW or IVF indices finds semantically similar documents regardless of lexical overlap. The key advantage is understanding paraphrases, synonyms, and intent โ 'how to fix a broken pipe' matches documents about 'plumbing repair' even without shared keywords. The major weakness is that embedding models compress all semantic information into a single vector, losing fine-grained token-level matching. Production systems must handle index staleness (re-encoding when documents change), dimension reduction (Matryoshka embeddings for storage efficiency), and metadata filtering (pre-filter vs post-filter tradeoffs that affect recall).
Reciprocal Rank Fusion โ Score-Free Rank Combination
Reciprocal Rank Fusion (RRF) solves the fundamental problem of combining rankings from heterogeneous retrieval systems whose scores are not comparable. BM25 returns log-probability scores while dense retrieval returns cosine similarities โ normalizing these onto the same scale is fragile and domain-dependent. RRF sidesteps this entirely by using only rank positions: RRF_score(d) = ฮฃ 1/(k + rank_i(d)) where k is a constant (typically 60) and rank_i(d) is the rank of document d in retriever i. Documents ranked highly by multiple retrievers get boosted scores. RRF requires no training data, no score calibration, and works out-of-the-box across any combination of retrievers. Research shows RRF consistently outperforms trained linear score combinations, likely because rank-based fusion is more robust to score distribution shifts across queries.
Cross-Encoder Re-ranking โ The Precision Layer
Cross-encoders process the query and document together through a single transformer pass, enabling deep token-level interaction that bi-encoders cannot achieve. While bi-encoders encode query and document independently (enabling fast ANN search), cross-encoders attend across both simultaneously, capturing fine-grained relevance signals. This makes them too slow for first-stage retrieval (scoring every document would take hours) but ideal for re-ranking the top 20โ100 candidates from the hybrid retrieval stage. Modern cross-encoder models like ms-marco-MiniLM or Cohere Rerank v3 achieve significantly higher NDCG@10 than bi-encoders alone. The latency cost is roughly 1โ2ms per document pair on GPU, so re-ranking 50 candidates adds 50โ100ms. Production systems often distill large cross-encoders into smaller models for latency-sensitive applications.
HyDE โ Query Expansion via Hypothetical Documents
Hypothetical Document Embeddings (HyDE) addresses a fundamental asymmetry in retrieval: queries are short and underspecified while documents are long and detailed. HyDE prompts an LLM to generate a hypothetical answer to the query, then embeds this synthetic document for retrieval instead of (or alongside) the original query. The intuition is that a hypothetical answer, even if factually imperfect, is closer in embedding space to relevant documents than the short query. For example, the query 'RAFT consensus' might generate a paragraph about leader election and log replication, which embeds much closer to actual Raft papers. HyDE improves recall by 10โ20% on average for ambiguous queries, at the cost of one additional LLM call per query (typically 200โ500ms). Multi-HyDE generates multiple hypothetical documents and averages their embeddings for even better coverage of diverse intents behind a single query.
โฌกArchitecture Diagram
Hybrid Search Architecture for LLMs โ simplified architecture overview
โฆCore Concepts
BM25 (TF-IDF)
Dense Vector Search
Reciprocal Rank Fusion
Cross-Encoder Re-ranking
HyDE
Query Expansion
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โRRF requires no labeled training data and works out-of-the-box across any retriever combination
- โHybrid approach captures both exact keyword matches and semantic paraphrases, improving recall by 15โ30%
- โCross-encoder re-ranking adds a high-precision layer that significantly boosts NDCG@10 over retrieval alone
- โArchitecture is modular โ each component (BM25, dense, re-ranker) can be upgraded independently
โ Weaknesses
- โRunning two parallel retrieval pipelines doubles index storage and maintenance overhead
- โCross-encoder re-ranking adds 30โ80ms latency, which may be unacceptable for real-time autocomplete scenarios
- โHyDE query expansion requires an LLM call per query, adding 200โ500ms latency and inference cost
- โTuning RRF constant k and the number of candidates per retriever requires empirical experimentation per domain
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design a search system that handles both exact keyword matches (product IDs, error codes) and semantic queries. How would you combine BM25 and vector search?
- Q2
Your hybrid search recall is good but latency is too high. Walk through optimization strategies at each stage of the pipeline.
- Q3
Explain the tradeoff between bi-encoder and cross-encoder models for retrieval. Why not use cross-encoders everywhere?
- Q4
A user query is ambiguous โ 'Python' could mean the language or the snake. How would you handle query disambiguation in a RAG pipeline?
- Q5
How would you evaluate and monitor a hybrid search system in production? What metrics matter beyond simple accuracy?
Research Papers & Further Reading
Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)
Gao, L. et al.
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed