Home ArchitecturesApache Kafka Event Streaming Architecture

🗄️ Data & InfrastructureAdvancedWeek 9

Apache Kafka Event Streaming Architecture

Partitions, consumer groups, log compaction, and exactly-once semantics

LinkedInConfluentUberAirbnb

Key Insight

Kafka's breakthrough: treating a message queue as an immutable log makes it replayable, auditable, and orders of magnitude more scalable.

Request Journey

Producer sends event record with a message key to the Kafka client→

Partitioner applies murmur2 hash of the key modulo partition count to select the target partition→

Record is sent to the leader broker for that partition, which appends it to the commit log (immutable, append-only segment files on disk)→

Follower replicas in the ISR (In-Sync Replica set) pull the record from the leader and write to their own logs→

Leader waits for min.insync.replicas acknowledgments before confirming the write to the producer (acks=all)

+4 more steps

How It Works

① Producer sends event record with a message key to the Kafka client

② Partitioner applies murmur2 hash of the key modulo partition count to select the target partition

③ Record is sent to the leader broker for that partition, which appends it to the commit log (immutable, append-only segment files on disk)

④ Follower replicas in the ISR (In-Sync Replica set) pull the record from the leader and write to their own logs

⑤ Leader waits for min.insync.replicas acknowledgments before confirming the write to the producer (acks=all)

⑥ Consumer group coordinator assigns partitions to consumers via a rebalance protocol (range or cooperative-sticky assignor)

⑦ Each consumer reads sequentially from assigned partitions, tracking position via offsets

⑧ Consumer commits offsets to the internal __consumer_offsets topic, enabling exactly-once semantics with transactional APIs

⑨ KRaft controller quorum (replacing ZooKeeper) manages broker metadata, partition leadership elections, and cluster configuration

⚠The Problem

LinkedIn, Uber, and hundreds of companies need to process millions of events per second — user clicks, GPS pings, transactions — with zero data loss. Traditional message queues (RabbitMQ, SQS) delete messages after consumption, making replay, audit, and stream processing impossible.

✓The Solution

Kafka models the message queue as an immutable, distributed append-only log. Events are retained for configurable periods (days, weeks, forever), enabling consumers to replay history, multiple independent consumer groups to process the same stream, and stream processing frameworks (Flink, Spark) to compute aggregations in real-time.

📊Scale at a Glance

1M+ msg/sec

Throughput/Broker

Unlimited

Retention

< 10ms

Latency (p99)

7T msgs/day

LinkedIn Peak

🔬Deep Dive

Partitions: The Unit of Parallelism

A Kafka topic is divided into N partitions — each an ordered, immutable sequence of records stored on disk. Partitions enable horizontal scaling: each partition is owned by one broker and can be consumed by one consumer in a group. More partitions = more parallelism, but more overhead in ZooKeeper/KRaft metadata. A rule of thumb: target ~100MB/s throughput per partition.

Replication and Leader Election

Each partition has one leader and N-1 replicas. All reads and writes go through the leader; replicas pull from the leader to stay in sync. The In-Sync Replicas (ISR) list tracks which replicas are caught up. If the leader fails, Kafka elects a new leader from the ISR — typically taking under 30 seconds. Setting min.insync.replicas=2 ensures durability even if one broker fails.

Consumer Groups and Offset Management

Consumer groups enable parallel consumption: each partition is assigned to exactly one consumer in a group. Consumers commit their offsets (last processed message position) back to Kafka's __consumer_offsets topic. If a consumer crashes, its partitions are rebalanced to other group members. This design makes Kafka consumers stateless — any consumer can pick up where another left off.

Log Compaction: Kafka as a Database

Log compaction retains only the latest value for each message key. This transforms Kafka into a key-value store: the compacted log represents the current state of all keys. Kafka Streams and ksqlDB use compacted topics as materialized views — joining streams with state without an external database. Changelog topics (used by Kafka Streams) rely entirely on log compaction for state recovery.

Exactly-Once Semantics

Kafka achieves exactly-once delivery via two mechanisms: idempotent producers (each message gets a sequence number; brokers deduplicate retries) and transactions (atomic writes to multiple partitions). The transactional API allows consume-process-produce loops with exactly-once guarantees — critical for financial systems. Exactly-once comes with ~10% throughput overhead versus at-least-once.

⬡Architecture Diagram

Apache Kafka Event Streaming Architecture — simplified architecture overview

✦Core Concepts

⚙️

Partitions & Offsets

⚙️

Consumer Groups

⚙️

Log Compaction

⚙️

Exactly-Once Semantics

⚙️

KRaft

📨

Stream Processing

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Decouples producers from consumers — either side can scale independently
✓Persistent log enables replay, auditing, and stream processing on historical data
✓Extremely high throughput: 1M+ messages/sec per broker with sequential disk I/O
✓Consumer groups enable the same topic to power multiple independent pipelines

✗ Weaknesses

✗Operational complexity: broker configuration, partition rebalancing, and offset management require deep expertise
✗Latency floor of ~5ms end-to-end makes Kafka unsuitable for ultra-low-latency (<1ms) use cases
✗Partition count is hard to change after topic creation — requires careful upfront capacity planning
✗Consumer lag monitoring is critical; silent lag buildup can cause processing delays hours later

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Design a system to process 1 million IoT sensor events per second. How would you size Kafka partitions and consumer groups?
Q2
Explain how Kafka achieves exactly-once semantics. What are the producer and consumer-side mechanisms?
Q3
A Kafka consumer group is processing 10 partitions but one consumer is consistently slower, causing lag. How do you diagnose and fix this?
Q4
Compare Kafka to RabbitMQ. For what use cases would you choose each, and what are the key architectural differences?
Q5
How does log compaction work in Kafka? Give a concrete example of when you would use a compacted topic over a regular topic.

Research Papers & Further Reading

2011

Kafka: A Distributed Messaging System for Log Processing

Kreps, J. et al. (LinkedIn)

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser