HomeArchitecturesWhatsApp Messaging at 100B Messages/Day
โšก Distributed SystemsExpertWeek 2

WhatsApp Messaging at 100B Messages/Day

How 50 engineers built a system bigger than Twitter

WhatsAppTelegramSignal

Key Insight

Erlang was designed for telecom fault tolerance 9 nines reliability making it perfect for messaging.

Request Journey

Sender encrypts with Signal Protocolโ†’
TCP sends to Erlang serverโ†’
Server checks recipient online statusโ†’
Online: deliver immediately with receiptโ†’
Offline: store in Mnesia with retry backoff
+1 more steps

How It Works

1

โ‘  Sender encrypts with Signal Protocol

2

โ‘ก TCP sends to Erlang server

3

โ‘ข Server checks recipient online status

4

โ‘ฃ Online: deliver immediately with receipt

5

โ‘ค Offline: store in Mnesia with retry backoff

6

โ‘ฅ Recipient decrypts message locally

โš The Problem

WhatsApp needed to support 2 billion usersexchanging 100 billion messages per day with extreme reliability โ€” yet the engineering team numbered only ~50 people. Traditional Java/Python web stacks would require thousands of servers, complex orchestration, and large ops teams. Messages must be delivered reliably even when recipients are offline for days, and end-to-end encryption means the server must never be able to read message content.

โœ“The Solution

WhatsApp chose Erlang/OTP, a platform designed for telecom systems requiring nine-nines reliability (99.9999999% uptime). Each user connection maps to a lightweight Erlang process consuming only ~2KB of memory, enabling a single server to hold 2M+ concurrent connections. Messages follow a store-and-forward pattern with delivery acknowledgment chains. FreeBSD kernel tuning pushed per-server connection limits far beyond typical Linux defaults.

๐Ÿ“ŠScale at a Glance

100B+

Messages/Day

2M+

Connections/Server

~50

Engineering Team

2B+

Monthly Active Users

๐Ÿ”ฌDeep Dive

1

Erlang/BEAM โ€” The Telecom Secret Weapon

Erlang's BEAM virtual machine was originally built by Ericsson for telephone switches that could never go down. Each user connection maps to a lightweight Erlang process (not an OS thread) consuming only ~2KB of memory. The BEAM VM runs millions of these processes concurrently with preemptive scheduling and per-process garbage collection โ€” no stop-the-world pauses that would freeze all connections. Hot code reloading allows WhatsApp to deploy new code to production servers without disconnecting a single user session.

2

Mnesia and Custom Message Storage

WhatsApp uses Mnesia, Erlang's built-in distributed database, for user session state and routing tables. Mnesia runs inside the same BEAM VM as the application, eliminating network round-trips for metadata lookups. It supports both in-memory and on-disk tables with transparent replication across nodes. For actual message storage, WhatsApp uses a custom append-only store optimized for the write-once-read-once access pattern โ€” messages are written sequentially when received and read exactly once when delivered, making sequential I/O the dominant pattern.

3

Store-and-Forward with ACK Chains

When Alice sends a message to Bob, the server stores it in a per-recipient queue. If Bob is online, the message is pushed immediately via his persistent connection. If Bob is offline, the message waits in the queue until he reconnects, at which point all queued messages are delivered in order. Bob's client sends an ACK back to the server, which deletes the message from the queue and forwards a delivery receipt to Alice. This three-way ACK chain (sent โ†’ delivered โ†’ read) provides the familiar checkmark UX and guarantees at-least-once delivery.

4

FreeBSD Kernel Tuning for Millions of Connections

WhatsApp runs on FreeBSD rather than Linux because its network stack handles massive numbers of concurrent long-lived connections more efficiently. Engineers tuned kernel parameters extensively: file descriptor limits raised to 2M+, socket buffer sizes optimized for small message payloads, and TCP keepalive intervals tuned for mobile networks with variable connectivity. A single WhatsApp server handles 2 million simultaneous connections, each backed by a supervised Erlang process with its own isolated mailbox and automatic crash recovery via OTP supervision trees.

5

Signal Protocol โ€” End-to-End Encryption at Scale

WhatsApp implements the Signal Protocol for end-to-end encryption across all messages. Each device generates a unique Curve25519 identity key pair, and message keys are ratcheted forward after every message using the Double Ratchet Algorithm, providing forward secrecy. Key exchange uses X3DH (Extended Triple Diffie-Hellman) with prekey bundles uploaded to the server, enabling encrypted session establishment even when the recipient is offline. The server handles only encrypted blobs โ€” it can route but never read message content.

โฌกArchitecture Diagram

WhatsApp Messaging at 100B Messages/Day โ€” simplified architecture overview

โœฆCore Concepts

โš™๏ธ

Erlang/BEAM

๐Ÿ—„๏ธ

Mnesia DB

โš™๏ธ

XMPP Protocol

๐Ÿง 

ACK Chains

โš™๏ธ

Store-and-Forward

โš™๏ธ

FreeBSD Tuning

โš–Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

โœ“ Strengths

  • โœ“Erlang processes use ~2KB each, enabling 2M+ concurrent connections per server
  • โœ“Hot code reloading allows zero-downtime deployments without dropping connections
  • โœ“Store-and-forward with ACK chains guarantees delivery even for long-offline recipients
  • โœ“50-engineer team proves extreme operational simplicity of the Erlang/FreeBSD stack

โœ— Weaknesses

  • โœ—Erlang's ecosystem is tiny โ€” hiring experienced Erlang/OTP developers is extremely difficult
  • โœ—Mnesia has known scalability limitations for very large clusters beyond ~50 nodes
  • โœ—End-to-end encryption prevents any server-side spam filtering or content moderation
  • โœ—FreeBSD operational expertise is rare, further limiting the potential engineering talent pool

๐ŸŽฏFAANG Interview Questions

Interview Prep

๐Ÿ’ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

  1. Q1

    Design a messaging system that guarantees message delivery even when recipients are offline for days. What storage and acknowledgment model would you use?

  2. Q2

    WhatsApp handles 2M connections per server with Erlang. How would you achieve similar concurrency in Java or Go?

  3. Q3

    Explain the delivery receipt flow: sent โ†’ delivered โ†’ read. What happens if the delivery ACK packet is lost in transit?

  4. Q4

    How does end-to-end encryption work when both sender and recipient are offline at message send time? Explain prekey bundles.

  5. Q5

    WhatsApp had ~50 engineers serving 2B users. What architectural decisions enable such an extreme user-to-engineer ratio?

Listen to the Podcast Episode

๐ŸŽ™๏ธ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture โ€” real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free ยท No account required ยท Listen in browser

More Distributed Systems

View all
๐ŸŽ™๏ธ Podcast ยท All Free

Listen to more architecture deep-dives

30 free podcast episodes โ€” Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.

All architecture articles are free ยท No account needed