Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Key Insight
HNSW's 'small world' property: any vector is reachable from any other vector in O(log N) hops inspired by social network theory.
Request Journey
How It Works
โ High-dimensional vector (e.g., 1536-dim from OpenAI) arrives for insertion
โก Vector is optionally quantized via Product Quantization (PQ) or Scalar Quantization (SQ) to reduce memory footprint
โข HNSW graph builder assigns the vector a random max layer (exponential distribution) and inserts it by connecting to nearest neighbors at each layer via greedy search
โฃ At query time, search enters the HNSW graph at the top layer's entry point
โค Greedy traversal at each layer finds the closest neighbor, then descends to the next layer using that neighbor as the new entry point
โฅ At layer 0 (densest), the algorithm expands the search beam (ef_search parameter) to collect top-100 ANN candidates
โฆ Metadata filters (category, date, tenant) are applied โ pre-filtering prunes before ANN search, post-filtering removes after
โง Candidates are rescored against original full-precision vectors for exact distance
โจ Final top-K results are returned, with recall at 10 typically 95-99% depending on HNSW parameters (M, ef_construction)
โ The Problem
Machine learning models generate high-dimensional vector embeddings (1536-dim text, 512-dim images, 768-dim audio) that encode semantic meaning. Finding the most similar vectors to a query vector in a dataset of 1 billion embeddings using brute-force cosine similarity requires 1 billion multiplications per query โ far too slow for real-time search.
โThe Solution
Vector databases use Approximate Nearest Neighbor (ANN) algorithms โ primarily HNSW (Hierarchical Navigable Small World graphs) and IVF (Inverted File Index) โ to answer nearest-neighbor queries in O(log N) time with configurable recall. A 95% recall@10 search on 1 billion vectors typically takes under 10ms, enabling real-time semantic search at billion scale.
๐Scale at a Glance
< 10ms at 1B
HNSW Search Latency
95-99%
Typical Recall@10
100-400x
PQ Compression
~40-80 bytes
Memory/Vector (HNSW)
๐ฌDeep Dive
HNSW: The Small-World Graph Index
HNSW builds a multi-layer graph where each node is a vector. The top layer is a sparse long-range graph (few connections, fast traversal); lower layers are progressively denser. A query starts at the top layer's entry point, greedily traverses to the nearest neighbor, descends to the next layer, and repeats. This multi-scale structure gives O(log N) search time. The ef_construction and ef_search parameters control the recall-latency tradeoff at build time and query time respectively.
IVF: Clustering-Based Search
IVF pre-clusters the dataset into K clusters (e.g., K=1000 using k-means). At query time, the search only examines the nprobe nearest cluster centroids and their members. With nprobe=10, you search only 1% of the data โ enormous speedup. IVF is more memory-efficient than HNSW (centroids in RAM, vectors on disk with IVF-Flat) but has lower recall for the same latency. IVF-PQ combines IVF with Product Quantization for billion-scale deployment.
Product Quantization: Compressing Embeddings
A 1536-dim float32 vector takes 6KB. Product Quantization (PQ) splits the vector into M sub-vectors and quantizes each sub-vector to 256 centroids (1 byte). A 1536-dim vector compresses to M bytes โ up to 768x compression for M=2. Distance computation uses pre-computed lookup tables between query sub-vectors and all centroids. PQ reduces memory from TBs to GBs but costs roughly 10-15% recall.
Filtered Search: The Hard Problem
Production vector search almost always includes metadata filters: find the 10 most similar products that are in stock and cost under $50. The naive approach is vector search then filter. If only 0.1% of vectors match the filter, this requires 1000x more candidates. Pre-filtering (apply metadata filter first, then search the subset) is accurate but slow for large datasets. Most production vector databases implement graph-based filtered search that integrates filters into HNSW traversal.
Sharding and Distributed Search
A billion-vector index does not fit on one machine โ HNSW requires keeping the entire graph in RAM (typically 40-80 bytes per vector). Distributed vector search partitions vectors across shards, runs parallel searches on each, and merges the top-K results from each shard. The merge step is a simple K-way merge of sorted lists. With M shards, you must retrieve K*M candidates per shard to guarantee the global top-K results โ this amplifies network traffic for high-K searches.
โฌกArchitecture Diagram
Vector Database Internals โ simplified architecture overview
โฆCore Concepts
HNSW Index
IVF Index
ANN Search
Recall vs Latency
Filtered Search
Quantization (PQ/SQ)
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โHNSW achieves 95%+ recall at sub-10ms latency on billion-scale datasets
- โSemantic search finds conceptually similar items without exact keyword matches
- โProduct Quantization enables billion-scale indexes that fit in under 100GB RAM
- โMetadata filtering enables complex query combinations of semantic similarity and structured filters
โ Weaknesses
- โHNSW index build time is O(N log N) and can take hours for billion-scale datasets
- โHNSW requires the entire graph in RAM โ memory cost is 40-80 bytes per vector regardless of vector size
- โRecall-latency tradeoff means perfect recall requires brute force; ANN always sacrifices some correctness
- โFiltered search quality degrades dramatically for highly selective filters โ requires careful index design
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Explain HNSW. How does its multi-layer structure enable sub-linear search time?
- Q2
Design a vector database that supports 1 billion product embeddings with metadata filtering. How would you handle sharding?
- Q3
Compare HNSW and IVF. For what dataset sizes and query patterns would you choose each?
- Q4
What is Product Quantization? How does it achieve 100x compression, and what accuracy tradeoff does it make?
- Q5
A vector search query returns irrelevant results despite the documents seeming related. What could be wrong, and how do you debug it?
Research Papers & Further Reading
Efficient and Robust Approximate Nearest Neighbor Search Using HNSW
Malkov, Y.A. & Yashunin, D.A.
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
LLM API Gateway Architecture
Rate limiting, token tracking, model routing, and cost management
OpenAI ยท Anthropic ยท Azure OpenAI
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed