Apache Kafka Event Streaming Architecture
Partitions, consumer groups, log compaction, and exactly-once semantics
Key Insight
Kafka's breakthrough: treating a message queue as an immutable log makes it replayable, auditable, and orders of magnitude more scalable.
Request Journey
How It Works
โ Producer sends event record with a message key to the Kafka client
โก Partitioner applies murmur2 hash of the key modulo partition count to select the target partition
โข Record is sent to the leader broker for that partition, which appends it to the commit log (immutable, append-only segment files on disk)
โฃ Follower replicas in the ISR (In-Sync Replica set) pull the record from the leader and write to their own logs
โค Leader waits for min.insync.replicas acknowledgments before confirming the write to the producer (acks=all)
โฅ Consumer group coordinator assigns partitions to consumers via a rebalance protocol (range or cooperative-sticky assignor)
โฆ Each consumer reads sequentially from assigned partitions, tracking position via offsets
โง Consumer commits offsets to the internal __consumer_offsets topic, enabling exactly-once semantics with transactional APIs
โจ KRaft controller quorum (replacing ZooKeeper) manages broker metadata, partition leadership elections, and cluster configuration
โ The Problem
LinkedIn, Uber, and hundreds of companies need to process millions of events per second โ user clicks, GPS pings, transactions โ with zero data loss. Traditional message queues (RabbitMQ, SQS) delete messages after consumption, making replay, audit, and stream processing impossible.
โThe Solution
Kafka models the message queue as an immutable, distributed append-only log. Events are retained for configurable periods (days, weeks, forever), enabling consumers to replay history, multiple independent consumer groups to process the same stream, and stream processing frameworks (Flink, Spark) to compute aggregations in real-time.
๐Scale at a Glance
1M+ msg/sec
Throughput/Broker
Unlimited
Retention
< 10ms
Latency (p99)
7T msgs/day
LinkedIn Peak
๐ฌDeep Dive
Partitions: The Unit of Parallelism
A Kafka topic is divided into N partitions โ each an ordered, immutable sequence of records stored on disk. Partitions enable horizontal scaling: each partition is owned by one broker and can be consumed by one consumer in a group. More partitions = more parallelism, but more overhead in ZooKeeper/KRaft metadata. A rule of thumb: target ~100MB/s throughput per partition.
Replication and Leader Election
Each partition has one leader and N-1 replicas. All reads and writes go through the leader; replicas pull from the leader to stay in sync. The In-Sync Replicas (ISR) list tracks which replicas are caught up. If the leader fails, Kafka elects a new leader from the ISR โ typically taking under 30 seconds. Setting min.insync.replicas=2 ensures durability even if one broker fails.
Consumer Groups and Offset Management
Consumer groups enable parallel consumption: each partition is assigned to exactly one consumer in a group. Consumers commit their offsets (last processed message position) back to Kafka's __consumer_offsets topic. If a consumer crashes, its partitions are rebalanced to other group members. This design makes Kafka consumers stateless โ any consumer can pick up where another left off.
Log Compaction: Kafka as a Database
Log compaction retains only the latest value for each message key. This transforms Kafka into a key-value store: the compacted log represents the current state of all keys. Kafka Streams and ksqlDB use compacted topics as materialized views โ joining streams with state without an external database. Changelog topics (used by Kafka Streams) rely entirely on log compaction for state recovery.
Exactly-Once Semantics
Kafka achieves exactly-once delivery via two mechanisms: idempotent producers (each message gets a sequence number; brokers deduplicate retries) and transactions (atomic writes to multiple partitions). The transactional API allows consume-process-produce loops with exactly-once guarantees โ critical for financial systems. Exactly-once comes with ~10% throughput overhead versus at-least-once.
โฌกArchitecture Diagram
Apache Kafka Event Streaming Architecture โ simplified architecture overview
โฆCore Concepts
Partitions & Offsets
Consumer Groups
Log Compaction
Exactly-Once Semantics
KRaft
Stream Processing
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โDecouples producers from consumers โ either side can scale independently
- โPersistent log enables replay, auditing, and stream processing on historical data
- โExtremely high throughput: 1M+ messages/sec per broker with sequential disk I/O
- โConsumer groups enable the same topic to power multiple independent pipelines
โ Weaknesses
- โOperational complexity: broker configuration, partition rebalancing, and offset management require deep expertise
- โLatency floor of ~5ms end-to-end makes Kafka unsuitable for ultra-low-latency (<1ms) use cases
- โPartition count is hard to change after topic creation โ requires careful upfront capacity planning
- โConsumer lag monitoring is critical; silent lag buildup can cause processing delays hours later
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design a system to process 1 million IoT sensor events per second. How would you size Kafka partitions and consumer groups?
- Q2
Explain how Kafka achieves exactly-once semantics. What are the producer and consumer-side mechanisms?
- Q3
A Kafka consumer group is processing 10 partitions but one consumer is consistently slower, causing lag. How do you diagnose and fix this?
- Q4
Compare Kafka to RabbitMQ. For what use cases would you choose each, and what are the key architectural differences?
- Q5
How does log compaction work in Kafka? Give a concrete example of when you would use a compacted topic over a regular topic.
Research Papers & Further Reading
Kafka: A Distributed Messaging System for Log Processing
Kreps, J. et al. (LinkedIn)
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More Data & Infrastructure
View allSpotify Music Recommendation System
Collaborative filtering, Discover Weekly, and the AudioEmbeddings pipeline
Spotify ยท Apple Music ยท YouTube Music
GitHub Pull Request & CI/CD Pipeline
Git internals, check suites, and the webhook fanout that powers DevOps
GitHub ยท GitLab ยท Bitbucket
LinkedIn Feed Ranking Architecture
Heavyweight ML scoring with online/offline feature pipelines
LinkedIn ยท Facebook ยท Twitter
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed