Home ArchitecturesPostgreSQL MVCC & Write-Ahead Log

🗄️ Data & InfrastructureAdvancedWeek 10

PostgreSQL MVCC & Write-Ahead Log

How Postgres achieves ACID transactions with multiversion concurrency

PostgreSQLCockroachDBSupabase

Key Insight

MVCC's trade-off: reads are blazing fast (no locks), but VACUUM must periodically reclaim dead row versions invisible maintenance cost.

Request Journey

Client begins a transaction, receiving a unique transaction ID (xid) and a snapshot of all currently active transactions→

For reads, the executor checks each tuple's xmin (creating xid) and xmax (deleting xid) against the snapshot — only tuples visible to this transaction's snapshot are returned (MVCC)→

For writes, PostgreSQL creates a NEW tuple version with the current xid as xmin; the old tuple's xmax is set to the current xid (old version remains for concurrent readers)→

Before modifying any data page, the change is first written to the WAL buffer as a redo record→

On COMMIT, WAL buffer is flushed (fsync) to WAL segment files on disk — this is the durability guarantee

+3 more steps

How It Works

① Client begins a transaction, receiving a unique transaction ID (xid) and a snapshot of all currently active transactions

② For reads, the executor checks each tuple's xmin (creating xid) and xmax (deleting xid) against the snapshot — only tuples visible to this transaction's snapshot are returned (MVCC)

③ For writes, PostgreSQL creates a NEW tuple version with the current xid as xmin; the old tuple's xmax is set to the current xid (old version remains for concurrent readers)

④ Before modifying any data page, the change is first written to the WAL buffer as a redo record

⑤ On COMMIT, WAL buffer is flushed (fsync) to WAL segment files on disk — this is the durability guarantee

⑥ Dirty data pages in the shared buffer pool are lazily written to disk by the background writer and checkpointer

⑦ autovacuum periodically scans heap pages, identifies tuples where xmax is older than the oldest active snapshot (dead tuples), and reclaims their space

⑧ WAL segments are streamed to standby replicas for hot standby and archived for point-in-time recovery (PITR)

⚠The Problem

Databases need to handle concurrent reads and writes from thousands of connections without readers blocking writers or vice versa. Traditional lock-based concurrency causes contention at scale: a long-running read blocks all writes to the same rows, creating hotspots.

✓The Solution

PostgreSQL implements MVCC (Multi-Version Concurrency Control): every write creates a new row version rather than updating in place. Readers see a consistent snapshot from their transaction start time without taking any locks. The Write-Ahead Log (WAL) ensures durability — every change is written to an append-only log before modifying data pages, enabling crash recovery and replication.

📊Scale at a Glance

Unlimited (PB+)

Max DB Size

32 TB

Max Table Size

< 100ms

Replication Lag

~200 (native)

Max Connections

🔬Deep Dive

MVCC: How Snapshots Work

Each PostgreSQL row has hidden system columns: xmin (transaction ID that created it) and xmax (transaction ID that deleted it). A query's snapshot sees rows where xmin is at most the snapshot transaction ID and xmax is null or greater than the snapshot ID. This means old row versions coexist with new ones on the same heap page. Readers never block writers; writers never block readers. The trade-off is storage bloat from multiple row versions.

The Write-Ahead Log

Before any data page is modified, the change is recorded in WAL — a sequential append-only file. On crash, PostgreSQL replays WAL from the last checkpoint to reconstruct any in-flight transactions. Sequential WAL writes are orders of magnitude faster than random data page writes, so most of PostgreSQL's write performance comes from WAL efficiency. WAL segments (16MB by default) are archived for point-in-time recovery.

VACUUM: Reclaiming Dead Versions

MVCC's cost is dead row versions — old versions that no transaction can see anymore. VACUUM reclaims this space by marking dead tuples as reusable and updating the visibility map. AUTOVACUUM runs VACUUM automatically based on a dead-tuple threshold (default: 20% of rows). Failure to vacuum causes table bloat, index bloat, and eventually transaction ID wraparound — a critical failure mode that freezes the database.

Streaming Replication

PostgreSQL primary servers stream WAL records to replica servers in near-real-time. Replicas apply WAL records to stay in sync, typically with less than 100ms lag. Replicas can serve read-only queries, distributing read load. Failover promotes a replica to primary — with synchronous_standby_names, you can guarantee zero data loss promotion. Tools like Patroni automate this failover with etcd-based leader election.

Transaction Isolation Levels

PostgreSQL supports four isolation levels: Read Uncommitted (same as Read Committed in Postgres), Read Committed (default — each statement sees a fresh snapshot), Repeatable Read (snapshot taken at transaction start, prevents non-repeatable reads), and Serializable (SSI algorithm detects and aborts serialization anomalies). Serializable Snapshot Isolation in Postgres provides true serializability without locking, unique among major databases.

⬡Architecture Diagram

PostgreSQL MVCC & Write-Ahead Log — simplified architecture overview

✦Core Concepts

⚙️

MVCC

⚙️

Write-Ahead Log

⚙️

VACUUM

⚙️

Transaction Isolation Levels

📨

Streaming Replication

🔁

Logical Replication

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓MVCC enables readers and writers to never block each other, maximizing concurrency
✓WAL provides crash safety and enables streaming replication with near-zero data loss
✓Rich feature set: JSON, full-text search, PostGIS, partitioning, parallel queries
✓Serializable Snapshot Isolation provides true ACID serializability without lock contention

✗ Weaknesses

✗VACUUM overhead: autovacuum can cause I/O spikes; table bloat degrades performance if neglected
✗Write amplification: WAL plus heap page plus index updates equals 3-5x write I/O per logical write
✗Connection overhead: each connection spawns a backend process (~5MB); PgBouncer is required for hundreds of connections
✗MVCC transaction ID wraparound is a catastrophic failure mode if VACUUM is delayed too long

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Explain MVCC in PostgreSQL. How does it allow concurrent reads and writes without locking?
Q2
What is VACUUM and why is it critical? What happens if autovacuum falls behind?
Q3
Design a read-replica setup for a PostgreSQL database receiving 50,000 reads/sec and 5,000 writes/sec.
Q4
Compare PostgreSQL MVCC with MySQL InnoDB's approach to concurrency control.
Q5
A PostgreSQL query that used to take 10ms now takes 5 seconds. Walk me through your diagnostic process.

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser