WhatsApp Messaging at 100B Messages/Day
How 50 engineers built a system bigger than Twitter
Key Insight
Erlang was designed for telecom fault tolerance 9 nines reliability making it perfect for messaging.
Request Journey
How It Works
โ Sender encrypts with Signal Protocol
โก TCP sends to Erlang server
โข Server checks recipient online status
โฃ Online: deliver immediately with receipt
โค Offline: store in Mnesia with retry backoff
โฅ Recipient decrypts message locally
โ The Problem
WhatsApp needed to support 2 billion usersexchanging 100 billion messages per day with extreme reliability โ yet the engineering team numbered only ~50 people. Traditional Java/Python web stacks would require thousands of servers, complex orchestration, and large ops teams. Messages must be delivered reliably even when recipients are offline for days, and end-to-end encryption means the server must never be able to read message content.
โThe Solution
WhatsApp chose Erlang/OTP, a platform designed for telecom systems requiring nine-nines reliability (99.9999999% uptime). Each user connection maps to a lightweight Erlang process consuming only ~2KB of memory, enabling a single server to hold 2M+ concurrent connections. Messages follow a store-and-forward pattern with delivery acknowledgment chains. FreeBSD kernel tuning pushed per-server connection limits far beyond typical Linux defaults.
๐Scale at a Glance
100B+
Messages/Day
2M+
Connections/Server
~50
Engineering Team
2B+
Monthly Active Users
๐ฌDeep Dive
Erlang/BEAM โ The Telecom Secret Weapon
Erlang's BEAM virtual machine was originally built by Ericsson for telephone switches that could never go down. Each user connection maps to a lightweight Erlang process (not an OS thread) consuming only ~2KB of memory. The BEAM VM runs millions of these processes concurrently with preemptive scheduling and per-process garbage collection โ no stop-the-world pauses that would freeze all connections. Hot code reloading allows WhatsApp to deploy new code to production servers without disconnecting a single user session.
Mnesia and Custom Message Storage
WhatsApp uses Mnesia, Erlang's built-in distributed database, for user session state and routing tables. Mnesia runs inside the same BEAM VM as the application, eliminating network round-trips for metadata lookups. It supports both in-memory and on-disk tables with transparent replication across nodes. For actual message storage, WhatsApp uses a custom append-only store optimized for the write-once-read-once access pattern โ messages are written sequentially when received and read exactly once when delivered, making sequential I/O the dominant pattern.
Store-and-Forward with ACK Chains
When Alice sends a message to Bob, the server stores it in a per-recipient queue. If Bob is online, the message is pushed immediately via his persistent connection. If Bob is offline, the message waits in the queue until he reconnects, at which point all queued messages are delivered in order. Bob's client sends an ACK back to the server, which deletes the message from the queue and forwards a delivery receipt to Alice. This three-way ACK chain (sent โ delivered โ read) provides the familiar checkmark UX and guarantees at-least-once delivery.
FreeBSD Kernel Tuning for Millions of Connections
WhatsApp runs on FreeBSD rather than Linux because its network stack handles massive numbers of concurrent long-lived connections more efficiently. Engineers tuned kernel parameters extensively: file descriptor limits raised to 2M+, socket buffer sizes optimized for small message payloads, and TCP keepalive intervals tuned for mobile networks with variable connectivity. A single WhatsApp server handles 2 million simultaneous connections, each backed by a supervised Erlang process with its own isolated mailbox and automatic crash recovery via OTP supervision trees.
Signal Protocol โ End-to-End Encryption at Scale
WhatsApp implements the Signal Protocol for end-to-end encryption across all messages. Each device generates a unique Curve25519 identity key pair, and message keys are ratcheted forward after every message using the Double Ratchet Algorithm, providing forward secrecy. Key exchange uses X3DH (Extended Triple Diffie-Hellman) with prekey bundles uploaded to the server, enabling encrypted session establishment even when the recipient is offline. The server handles only encrypted blobs โ it can route but never read message content.
โฌกArchitecture Diagram
WhatsApp Messaging at 100B Messages/Day โ simplified architecture overview
โฆCore Concepts
Erlang/BEAM
Mnesia DB
XMPP Protocol
ACK Chains
Store-and-Forward
FreeBSD Tuning
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โErlang processes use ~2KB each, enabling 2M+ concurrent connections per server
- โHot code reloading allows zero-downtime deployments without dropping connections
- โStore-and-forward with ACK chains guarantees delivery even for long-offline recipients
- โ50-engineer team proves extreme operational simplicity of the Erlang/FreeBSD stack
โ Weaknesses
- โErlang's ecosystem is tiny โ hiring experienced Erlang/OTP developers is extremely difficult
- โMnesia has known scalability limitations for very large clusters beyond ~50 nodes
- โEnd-to-end encryption prevents any server-side spam filtering or content moderation
- โFreeBSD operational expertise is rare, further limiting the potential engineering talent pool
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design a messaging system that guarantees message delivery even when recipients are offline for days. What storage and acknowledgment model would you use?
- Q2
WhatsApp handles 2M connections per server with Erlang. How would you achieve similar concurrency in Java or Go?
- Q3
Explain the delivery receipt flow: sent โ delivered โ read. What happens if the delivery ACK packet is lost in transit?
- Q4
How does end-to-end encryption work when both sender and recipient are offline at message send time? Explain prekey bundles.
- Q5
WhatsApp had ~50 engineers serving 2B users. What architectural decisions enable such an extreme user-to-engineer ratio?
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More Distributed Systems
View allNetflix Content Delivery Architecture
How Netflix streams to 260M users without a single datacenter
Netflix ยท Disney+ ยท Hulu
Twitter Fan-Out & Timeline Architecture
The push vs pull dilemma at 500M tweets/day
X (Twitter) ยท Instagram ยท LinkedIn
Uber Surge Pricing & Geospatial Architecture
H3 hexagonal indexing, real-time dispatch, and dynamic pricing
Uber ยท Lyft ยท DoorDash
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed