Home ArchitecturesVector Database Internals

🤖 LLM & AIExpertWeek 12

Vector Database Internals

HNSW, IVF, and ANN search at billion scale

PineconeWeaviateQdrantpgvector

Key Insight

HNSW's 'small world' property: any vector is reachable from any other vector in O(log N) hops inspired by social network theory.

Request Journey

High-dimensional vector (e.g., 1536-dim from OpenAI) arrives for insertion→

Vector is optionally quantized via Product Quantization (PQ) or Scalar Quantization (SQ) to reduce memory footprint→

HNSW graph builder assigns the vector a random max layer (exponential distribution) and inserts it by connecting to nearest neighbors at each layer via greedy search→

At query time, search enters the HNSW graph at the top layer's entry point→

Greedy traversal at each layer finds the closest neighbor, then descends to the next layer using that neighbor as the new entry point

+4 more steps

How It Works

① High-dimensional vector (e.g., 1536-dim from OpenAI) arrives for insertion

② Vector is optionally quantized via Product Quantization (PQ) or Scalar Quantization (SQ) to reduce memory footprint

③ HNSW graph builder assigns the vector a random max layer (exponential distribution) and inserts it by connecting to nearest neighbors at each layer via greedy search

④ At query time, search enters the HNSW graph at the top layer's entry point

⑤ Greedy traversal at each layer finds the closest neighbor, then descends to the next layer using that neighbor as the new entry point

⑥ At layer 0 (densest), the algorithm expands the search beam (ef_search parameter) to collect top-100 ANN candidates

⑦ Metadata filters (category, date, tenant) are applied — pre-filtering prunes before ANN search, post-filtering removes after

⑧ Candidates are rescored against original full-precision vectors for exact distance

⑨ Final top-K results are returned, with recall at 10 typically 95-99% depending on HNSW parameters (M, ef_construction)

⚠The Problem

Machine learning models generate high-dimensional vector embeddings (1536-dim text, 512-dim images, 768-dim audio) that encode semantic meaning. Finding the most similar vectors to a query vector in a dataset of 1 billion embeddings using brute-force cosine similarity requires 1 billion multiplications per query — far too slow for real-time search.

✓The Solution

Vector databases use Approximate Nearest Neighbor (ANN) algorithms — primarily HNSW (Hierarchical Navigable Small World graphs) and IVF (Inverted File Index) — to answer nearest-neighbor queries in O(log N) time with configurable recall. A 95% recall@10 search on 1 billion vectors typically takes under 10ms, enabling real-time semantic search at billion scale.

📊Scale at a Glance

< 10ms at 1B

HNSW Search Latency

95-99%

Typical Recall@10

100-400x

PQ Compression

~40-80 bytes

Memory/Vector (HNSW)

🔬Deep Dive

HNSW: The Small-World Graph Index

HNSW builds a multi-layer graph where each node is a vector. The top layer is a sparse long-range graph (few connections, fast traversal); lower layers are progressively denser. A query starts at the top layer's entry point, greedily traverses to the nearest neighbor, descends to the next layer, and repeats. This multi-scale structure gives O(log N) search time. The ef_construction and ef_search parameters control the recall-latency tradeoff at build time and query time respectively.

IVF: Clustering-Based Search

IVF pre-clusters the dataset into K clusters (e.g., K=1000 using k-means). At query time, the search only examines the nprobe nearest cluster centroids and their members. With nprobe=10, you search only 1% of the data — enormous speedup. IVF is more memory-efficient than HNSW (centroids in RAM, vectors on disk with IVF-Flat) but has lower recall for the same latency. IVF-PQ combines IVF with Product Quantization for billion-scale deployment.

Product Quantization: Compressing Embeddings

A 1536-dim float32 vector takes 6KB. Product Quantization (PQ) splits the vector into M sub-vectors and quantizes each sub-vector to 256 centroids (1 byte). A 1536-dim vector compresses to M bytes — up to 768x compression for M=2. Distance computation uses pre-computed lookup tables between query sub-vectors and all centroids. PQ reduces memory from TBs to GBs but costs roughly 10-15% recall.

Filtered Search: The Hard Problem

Production vector search almost always includes metadata filters: find the 10 most similar products that are in stock and cost under $50. The naive approach is vector search then filter. If only 0.1% of vectors match the filter, this requires 1000x more candidates. Pre-filtering (apply metadata filter first, then search the subset) is accurate but slow for large datasets. Most production vector databases implement graph-based filtered search that integrates filters into HNSW traversal.

Sharding and Distributed Search

A billion-vector index does not fit on one machine — HNSW requires keeping the entire graph in RAM (typically 40-80 bytes per vector). Distributed vector search partitions vectors across shards, runs parallel searches on each, and merges the top-K results from each shard. The merge step is a simple K-way merge of sorted lists. With M shards, you must retrieve K*M candidates per shard to guarantee the global top-K results — this amplifies network traffic for high-K searches.

⬡Architecture Diagram

Vector Database Internals — simplified architecture overview

✦Core Concepts

🔍

HNSW Index

🔍

IVF Index

🔍

ANN Search

⚙️

Recall vs Latency

🔍

Filtered Search

⚙️

Quantization (PQ/SQ)

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓HNSW achieves 95%+ recall at sub-10ms latency on billion-scale datasets
✓Semantic search finds conceptually similar items without exact keyword matches
✓Product Quantization enables billion-scale indexes that fit in under 100GB RAM
✓Metadata filtering enables complex query combinations of semantic similarity and structured filters

✗ Weaknesses

✗HNSW index build time is O(N log N) and can take hours for billion-scale datasets
✗HNSW requires the entire graph in RAM — memory cost is 40-80 bytes per vector regardless of vector size
✗Recall-latency tradeoff means perfect recall requires brute force; ANN always sacrifices some correctness
✗Filtered search quality degrades dramatically for highly selective filters — requires careful index design

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Explain HNSW. How does its multi-layer structure enable sub-linear search time?
Q2
Design a vector database that supports 1 billion product embeddings with metadata filtering. How would you handle sharding?
Q3
Compare HNSW and IVF. For what dataset sizes and query patterns would you choose each?
Q4
What is Product Quantization? How does it achieve 100x compression, and what accuracy tradeoff does it make?
Q5
A vector search query returns irrelevant results despite the documents seeming related. What could be wrong, and how do you debug it?

Research Papers & Further Reading

2018

Efficient and Robust Approximate Nearest Neighbor Search Using HNSW

Malkov, Y.A. & Yashunin, D.A.

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser