HomeArchitecturesApache Kafka Event Streaming Architecture
๐Ÿ—„๏ธ Data & InfrastructureAdvancedWeek 9

Apache Kafka Event Streaming Architecture

Partitions, consumer groups, log compaction, and exactly-once semantics

LinkedInConfluentUberAirbnb

Key Insight

Kafka's breakthrough: treating a message queue as an immutable log makes it replayable, auditable, and orders of magnitude more scalable.

Request Journey

Producer sends event record with a message key to the Kafka clientโ†’
Partitioner applies murmur2 hash of the key modulo partition count to select the target partitionโ†’
Record is sent to the leader broker for that partition, which appends it to the commit log (immutable, append-only segment files on disk)โ†’
Follower replicas in the ISR (In-Sync Replica set) pull the record from the leader and write to their own logsโ†’
Leader waits for min.insync.replicas acknowledgments before confirming the write to the producer (acks=all)
+4 more steps

How It Works

1

โ‘  Producer sends event record with a message key to the Kafka client

2

โ‘ก Partitioner applies murmur2 hash of the key modulo partition count to select the target partition

3

โ‘ข Record is sent to the leader broker for that partition, which appends it to the commit log (immutable, append-only segment files on disk)

4

โ‘ฃ Follower replicas in the ISR (In-Sync Replica set) pull the record from the leader and write to their own logs

5

โ‘ค Leader waits for min.insync.replicas acknowledgments before confirming the write to the producer (acks=all)

6

โ‘ฅ Consumer group coordinator assigns partitions to consumers via a rebalance protocol (range or cooperative-sticky assignor)

7

โ‘ฆ Each consumer reads sequentially from assigned partitions, tracking position via offsets

8

โ‘ง Consumer commits offsets to the internal __consumer_offsets topic, enabling exactly-once semantics with transactional APIs

9

โ‘จ KRaft controller quorum (replacing ZooKeeper) manages broker metadata, partition leadership elections, and cluster configuration

โš The Problem

LinkedIn, Uber, and hundreds of companies need to process millions of events per second โ€” user clicks, GPS pings, transactions โ€” with zero data loss. Traditional message queues (RabbitMQ, SQS) delete messages after consumption, making replay, audit, and stream processing impossible.

โœ“The Solution

Kafka models the message queue as an immutable, distributed append-only log. Events are retained for configurable periods (days, weeks, forever), enabling consumers to replay history, multiple independent consumer groups to process the same stream, and stream processing frameworks (Flink, Spark) to compute aggregations in real-time.

๐Ÿ“ŠScale at a Glance

1M+ msg/sec

Throughput/Broker

Unlimited

Retention

< 10ms

Latency (p99)

7T msgs/day

LinkedIn Peak

๐Ÿ”ฌDeep Dive

1

Partitions: The Unit of Parallelism

A Kafka topic is divided into N partitions โ€” each an ordered, immutable sequence of records stored on disk. Partitions enable horizontal scaling: each partition is owned by one broker and can be consumed by one consumer in a group. More partitions = more parallelism, but more overhead in ZooKeeper/KRaft metadata. A rule of thumb: target ~100MB/s throughput per partition.

2

Replication and Leader Election

Each partition has one leader and N-1 replicas. All reads and writes go through the leader; replicas pull from the leader to stay in sync. The In-Sync Replicas (ISR) list tracks which replicas are caught up. If the leader fails, Kafka elects a new leader from the ISR โ€” typically taking under 30 seconds. Setting min.insync.replicas=2 ensures durability even if one broker fails.

3

Consumer Groups and Offset Management

Consumer groups enable parallel consumption: each partition is assigned to exactly one consumer in a group. Consumers commit their offsets (last processed message position) back to Kafka's __consumer_offsets topic. If a consumer crashes, its partitions are rebalanced to other group members. This design makes Kafka consumers stateless โ€” any consumer can pick up where another left off.

4

Log Compaction: Kafka as a Database

Log compaction retains only the latest value for each message key. This transforms Kafka into a key-value store: the compacted log represents the current state of all keys. Kafka Streams and ksqlDB use compacted topics as materialized views โ€” joining streams with state without an external database. Changelog topics (used by Kafka Streams) rely entirely on log compaction for state recovery.

5

Exactly-Once Semantics

Kafka achieves exactly-once delivery via two mechanisms: idempotent producers (each message gets a sequence number; brokers deduplicate retries) and transactions (atomic writes to multiple partitions). The transactional API allows consume-process-produce loops with exactly-once guarantees โ€” critical for financial systems. Exactly-once comes with ~10% throughput overhead versus at-least-once.

โฌกArchitecture Diagram

Apache Kafka Event Streaming Architecture โ€” simplified architecture overview

โœฆCore Concepts

โš™๏ธ

Partitions & Offsets

โš™๏ธ

Consumer Groups

โš™๏ธ

Log Compaction

โš™๏ธ

Exactly-Once Semantics

โš™๏ธ

KRaft

๐Ÿ“จ

Stream Processing

โš–Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

โœ“ Strengths

  • โœ“Decouples producers from consumers โ€” either side can scale independently
  • โœ“Persistent log enables replay, auditing, and stream processing on historical data
  • โœ“Extremely high throughput: 1M+ messages/sec per broker with sequential disk I/O
  • โœ“Consumer groups enable the same topic to power multiple independent pipelines

โœ— Weaknesses

  • โœ—Operational complexity: broker configuration, partition rebalancing, and offset management require deep expertise
  • โœ—Latency floor of ~5ms end-to-end makes Kafka unsuitable for ultra-low-latency (<1ms) use cases
  • โœ—Partition count is hard to change after topic creation โ€” requires careful upfront capacity planning
  • โœ—Consumer lag monitoring is critical; silent lag buildup can cause processing delays hours later

๐ŸŽฏFAANG Interview Questions

Interview Prep

๐Ÿ’ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

  1. Q1

    Design a system to process 1 million IoT sensor events per second. How would you size Kafka partitions and consumer groups?

  2. Q2

    Explain how Kafka achieves exactly-once semantics. What are the producer and consumer-side mechanisms?

  3. Q3

    A Kafka consumer group is processing 10 partitions but one consumer is consistently slower, causing lag. How do you diagnose and fix this?

  4. Q4

    Compare Kafka to RabbitMQ. For what use cases would you choose each, and what are the key architectural differences?

  5. Q5

    How does log compaction work in Kafka? Give a concrete example of when you would use a compacted topic over a regular topic.

Research Papers & Further Reading

2011

Kafka: A Distributed Messaging System for Log Processing

Kreps, J. et al. (LinkedIn)

Read

Listen to the Podcast Episode

๐ŸŽ™๏ธ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture โ€” real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free ยท No account required ยท Listen in browser

More Data & Infrastructure

View all
๐ŸŽ™๏ธ Podcast ยท All Free

Listen to more architecture deep-dives

30 free podcast episodes โ€” Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.

All architecture articles are free ยท No account needed