Home ArchitecturesHybrid Search Architecture for LLMs

🤖 LLM & AIAdvancedWeek 14

Hybrid Search Architecture for LLMs

BM25 + dense vectors, RRF reranking, and query expansion

ElasticsearchWeaviateCoherePerplexity

Key Insight

Reciprocal Rank Fusion is surprisingly effective: fusing rankings from BM25 and vector search with RRF outperforms either alone with no training required.

Request Journey

User query arrives at the hybrid search pipeline→

(Optional) HyDE step: LLM generates a hypothetical document that would answer the query — this synthetic text is embedded alongside the original query for better retrieval→

BM25 path: query terms are looked up in the inverted index, scoring documents by term frequency, inverse document frequency, and length normalization→

Dense path: query is encoded by a bi-encoder embedding model, then HNSW ANN search retrieves top-K nearest vectors from the dense index→

Both ranked lists feed into Reciprocal Rank Fusion — RRF_score(d) = sum of 1/(k + rank) where k=60, combining rankings without needing score normalization

+3 more steps

How It Works

① User query arrives at the hybrid search pipeline

② (Optional) HyDE step: LLM generates a hypothetical document that would answer the query — this synthetic text is embedded alongside the original query for better retrieval

③ BM25 path: query terms are looked up in the inverted index, scoring documents by term frequency, inverse document frequency, and length normalization

④ Dense path: query is encoded by a bi-encoder embedding model, then HNSW ANN search retrieves top-K nearest vectors from the dense index

⑤ Both ranked lists feed into Reciprocal Rank Fusion — RRF_score(d) = sum of 1/(k + rank) where k=60, combining rankings without needing score normalization

⑥ Top-50 fused candidates are passed to a cross-encoder reranker, which processes each (query, document) pair through a single transformer for deep token-level interaction scoring

⑦ Cross-encoder reorders candidates by relevance, producing the final top-10 results with significantly higher NDCG than either retriever alone

⑧ Final results are injected as context into the LLM prompt for grounded answer generation

⚠The Problem

Pure dense vector search misses exact keyword matches (product IDs, error codes, acronyms) while BM25 keyword search misses semantic paraphrases and intent. In production RAG pipelines, using either approach alone yields 15–30% lower recall than users expect. The challenge is fusing heterogeneous ranking signals — sparse lexical scores and dense cosine similarities operate on entirely different scales — without requiring expensive labeled training data.

✓The Solution

Hybrid search runs BM25 and dense retrieval in parallel, then fuses results using Reciprocal Rank Fusion (RRF), which combines rankings without needing score normalization. A cross-encoder re-ranker then scores the top-K fused results for final ordering. Query expansion via HyDE (Hypothetical Document Embeddings) generates a synthetic answer with the LLM and embeds it alongside the original query, dramatically improving recall for ambiguous or short queries.

📊Scale at a Glance

+15–30% vs single retriever

Recall@10 improvement

30–80ms for top-50

Re-ranking latency (cross-encoder)

10,000+ QPS

BM25 index throughput

<5ms p99 at 10M docs

Dense retrieval (HNSW)

🔬Deep Dive

BM25 — The Unbeatable Sparse Baseline

BM25 (Best Matching 25) is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization. Despite being decades old, BM25 remains surprisingly competitive with neural retrievers for keyword-heavy queries. It excels at exact match scenarios: product SKUs, error codes, proper nouns, and technical jargon that embedding models often tokenize poorly. BM25 operates on inverted indices, enabling sub-millisecond retrieval even over billions of documents. Modern implementations like Elasticsearch's BM25 use skip lists and block-max WAND optimization to prune the search space aggressively. The key parameters — k1 (term frequency saturation) and b (length normalization) — can be tuned per-field for optimal performance on domain-specific corpora.

Dense Vector Retrieval — Semantic Understanding at Scale

Dense retrieval encodes queries and documents into fixed-dimensional vectors (typically 768 or 1536 dimensions) using bi-encoder models like E5, BGE, or OpenAI's text-embedding-3-large. Approximate nearest neighbor (ANN) search via HNSW or IVF indices finds semantically similar documents regardless of lexical overlap. The key advantage is understanding paraphrases, synonyms, and intent — 'how to fix a broken pipe' matches documents about 'plumbing repair' even without shared keywords. The major weakness is that embedding models compress all semantic information into a single vector, losing fine-grained token-level matching. Production systems must handle index staleness (re-encoding when documents change), dimension reduction (Matryoshka embeddings for storage efficiency), and metadata filtering (pre-filter vs post-filter tradeoffs that affect recall).

Reciprocal Rank Fusion — Score-Free Rank Combination

Reciprocal Rank Fusion (RRF) solves the fundamental problem of combining rankings from heterogeneous retrieval systems whose scores are not comparable. BM25 returns log-probability scores while dense retrieval returns cosine similarities — normalizing these onto the same scale is fragile and domain-dependent. RRF sidesteps this entirely by using only rank positions: RRF_score(d) = Σ 1/(k + rank_i(d)) where k is a constant (typically 60) and rank_i(d) is the rank of document d in retriever i. Documents ranked highly by multiple retrievers get boosted scores. RRF requires no training data, no score calibration, and works out-of-the-box across any combination of retrievers. Research shows RRF consistently outperforms trained linear score combinations, likely because rank-based fusion is more robust to score distribution shifts across queries.

Cross-Encoder Re-ranking — The Precision Layer

Cross-encoders process the query and document together through a single transformer pass, enabling deep token-level interaction that bi-encoders cannot achieve. While bi-encoders encode query and document independently (enabling fast ANN search), cross-encoders attend across both simultaneously, capturing fine-grained relevance signals. This makes them too slow for first-stage retrieval (scoring every document would take hours) but ideal for re-ranking the top 20–100 candidates from the hybrid retrieval stage. Modern cross-encoder models like ms-marco-MiniLM or Cohere Rerank v3 achieve significantly higher NDCG@10 than bi-encoders alone. The latency cost is roughly 1–2ms per document pair on GPU, so re-ranking 50 candidates adds 50–100ms. Production systems often distill large cross-encoders into smaller models for latency-sensitive applications.

HyDE — Query Expansion via Hypothetical Documents

Hypothetical Document Embeddings (HyDE) addresses a fundamental asymmetry in retrieval: queries are short and underspecified while documents are long and detailed. HyDE prompts an LLM to generate a hypothetical answer to the query, then embeds this synthetic document for retrieval instead of (or alongside) the original query. The intuition is that a hypothetical answer, even if factually imperfect, is closer in embedding space to relevant documents than the short query. For example, the query 'RAFT consensus' might generate a paragraph about leader election and log replication, which embeds much closer to actual Raft papers. HyDE improves recall by 10–20% on average for ambiguous queries, at the cost of one additional LLM call per query (typically 200–500ms). Multi-HyDE generates multiple hypothetical documents and averages their embeddings for even better coverage of diverse intents behind a single query.

⬡Architecture Diagram

Hybrid Search Architecture for LLMs — simplified architecture overview

✦Core Concepts

⚙️

BM25 (TF-IDF)

🔍

Dense Vector Search

⚙️

Reciprocal Rank Fusion

⚙️

Cross-Encoder Re-ranking

⚙️

HyDE

⚙️

Query Expansion

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓RRF requires no labeled training data and works out-of-the-box across any retriever combination
✓Hybrid approach captures both exact keyword matches and semantic paraphrases, improving recall by 15–30%
✓Cross-encoder re-ranking adds a high-precision layer that significantly boosts NDCG@10 over retrieval alone
✓Architecture is modular — each component (BM25, dense, re-ranker) can be upgraded independently

✗ Weaknesses

✗Running two parallel retrieval pipelines doubles index storage and maintenance overhead
✗Cross-encoder re-ranking adds 30–80ms latency, which may be unacceptable for real-time autocomplete scenarios
✗HyDE query expansion requires an LLM call per query, adding 200–500ms latency and inference cost
✗Tuning RRF constant k and the number of candidates per retriever requires empirical experimentation per domain

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Design a search system that handles both exact keyword matches (product IDs, error codes) and semantic queries. How would you combine BM25 and vector search?
Q2
Your hybrid search recall is good but latency is too high. Walk through optimization strategies at each stage of the pipeline.
Q3
Explain the tradeoff between bi-encoder and cross-encoder models for retrieval. Why not use cross-encoders everywhere?
Q4
A user query is ambiguous — 'Python' could mean the language or the snake. How would you handle query disambiguation in a RAG pipeline?
Q5
How would you evaluate and monitor a hybrid search system in production? What metrics matter beyond simple accuracy?

Research Papers & Further Reading

2022

Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)

Gao, L. et al.

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser