PostgreSQL MVCC & Write-Ahead Log
How Postgres achieves ACID transactions with multiversion concurrency
Key Insight
MVCC's trade-off: reads are blazing fast (no locks), but VACUUM must periodically reclaim dead row versions invisible maintenance cost.
Request Journey
How It Works
โ Client begins a transaction, receiving a unique transaction ID (xid) and a snapshot of all currently active transactions
โก For reads, the executor checks each tuple's xmin (creating xid) and xmax (deleting xid) against the snapshot โ only tuples visible to this transaction's snapshot are returned (MVCC)
โข For writes, PostgreSQL creates a NEW tuple version with the current xid as xmin; the old tuple's xmax is set to the current xid (old version remains for concurrent readers)
โฃ Before modifying any data page, the change is first written to the WAL buffer as a redo record
โค On COMMIT, WAL buffer is flushed (fsync) to WAL segment files on disk โ this is the durability guarantee
โฅ Dirty data pages in the shared buffer pool are lazily written to disk by the background writer and checkpointer
โฆ autovacuum periodically scans heap pages, identifies tuples where xmax is older than the oldest active snapshot (dead tuples), and reclaims their space
โง WAL segments are streamed to standby replicas for hot standby and archived for point-in-time recovery (PITR)
โ The Problem
Databases need to handle concurrent reads and writes from thousands of connections without readers blocking writers or vice versa. Traditional lock-based concurrency causes contention at scale: a long-running read blocks all writes to the same rows, creating hotspots.
โThe Solution
PostgreSQL implements MVCC (Multi-Version Concurrency Control): every write creates a new row version rather than updating in place. Readers see a consistent snapshot from their transaction start time without taking any locks. The Write-Ahead Log (WAL) ensures durability โ every change is written to an append-only log before modifying data pages, enabling crash recovery and replication.
๐Scale at a Glance
Unlimited (PB+)
Max DB Size
32 TB
Max Table Size
< 100ms
Replication Lag
~200 (native)
Max Connections
๐ฌDeep Dive
MVCC: How Snapshots Work
Each PostgreSQL row has hidden system columns: xmin (transaction ID that created it) and xmax (transaction ID that deleted it). A query's snapshot sees rows where xmin is at most the snapshot transaction ID and xmax is null or greater than the snapshot ID. This means old row versions coexist with new ones on the same heap page. Readers never block writers; writers never block readers. The trade-off is storage bloat from multiple row versions.
The Write-Ahead Log
Before any data page is modified, the change is recorded in WAL โ a sequential append-only file. On crash, PostgreSQL replays WAL from the last checkpoint to reconstruct any in-flight transactions. Sequential WAL writes are orders of magnitude faster than random data page writes, so most of PostgreSQL's write performance comes from WAL efficiency. WAL segments (16MB by default) are archived for point-in-time recovery.
VACUUM: Reclaiming Dead Versions
MVCC's cost is dead row versions โ old versions that no transaction can see anymore. VACUUM reclaims this space by marking dead tuples as reusable and updating the visibility map. AUTOVACUUM runs VACUUM automatically based on a dead-tuple threshold (default: 20% of rows). Failure to vacuum causes table bloat, index bloat, and eventually transaction ID wraparound โ a critical failure mode that freezes the database.
Streaming Replication
PostgreSQL primary servers stream WAL records to replica servers in near-real-time. Replicas apply WAL records to stay in sync, typically with less than 100ms lag. Replicas can serve read-only queries, distributing read load. Failover promotes a replica to primary โ with synchronous_standby_names, you can guarantee zero data loss promotion. Tools like Patroni automate this failover with etcd-based leader election.
Transaction Isolation Levels
PostgreSQL supports four isolation levels: Read Uncommitted (same as Read Committed in Postgres), Read Committed (default โ each statement sees a fresh snapshot), Repeatable Read (snapshot taken at transaction start, prevents non-repeatable reads), and Serializable (SSI algorithm detects and aborts serialization anomalies). Serializable Snapshot Isolation in Postgres provides true serializability without locking, unique among major databases.
โฌกArchitecture Diagram
PostgreSQL MVCC & Write-Ahead Log โ simplified architecture overview
โฆCore Concepts
MVCC
Write-Ahead Log
VACUUM
Transaction Isolation Levels
Streaming Replication
Logical Replication
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โMVCC enables readers and writers to never block each other, maximizing concurrency
- โWAL provides crash safety and enables streaming replication with near-zero data loss
- โRich feature set: JSON, full-text search, PostGIS, partitioning, parallel queries
- โSerializable Snapshot Isolation provides true ACID serializability without lock contention
โ Weaknesses
- โVACUUM overhead: autovacuum can cause I/O spikes; table bloat degrades performance if neglected
- โWrite amplification: WAL plus heap page plus index updates equals 3-5x write I/O per logical write
- โConnection overhead: each connection spawns a backend process (~5MB); PgBouncer is required for hundreds of connections
- โMVCC transaction ID wraparound is a catastrophic failure mode if VACUUM is delayed too long
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Explain MVCC in PostgreSQL. How does it allow concurrent reads and writes without locking?
- Q2
What is VACUUM and why is it critical? What happens if autovacuum falls behind?
- Q3
Design a read-replica setup for a PostgreSQL database receiving 50,000 reads/sec and 5,000 writes/sec.
- Q4
Compare PostgreSQL MVCC with MySQL InnoDB's approach to concurrency control.
- Q5
A PostgreSQL query that used to take 10ms now takes 5 seconds. Walk me through your diagnostic process.
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More Data & Infrastructure
View allSpotify Music Recommendation System
Collaborative filtering, Discover Weekly, and the AudioEmbeddings pipeline
Spotify ยท Apple Music ยท YouTube Music
GitHub Pull Request & CI/CD Pipeline
Git internals, check suites, and the webhook fanout that powers DevOps
GitHub ยท GitLab ยท Bitbucket
LinkedIn Feed Ranking Architecture
Heavyweight ML scoring with online/offline feature pipelines
LinkedIn ยท Facebook ยท Twitter
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed