Kubernetes Container Orchestration Architecture
etcd, the scheduler, controller loops, and service mesh
Key Insight
Kubernetes' key abstraction: controllers watch for 'desired state ฮรซรก actual state' and take action this declarative model makes the system self-healing.
Request Journey
How It Works
โ User submits a Pod spec via kubectl, which sends it to the API Server as a REST request
โก API Server validates the spec through admission controllers (mutating + validating webhooks) and persists it to etcd
โข Scheduler watches for unscheduled Pods, scores candidate nodes using predicates (resource fit, affinity, taints) and priorities (bin-packing, spread), then binds the Pod to the best node
โฃ kubelet on the selected node detects the new binding via API Server watch, pulls container images, and calls the Container Runtime Interface (CRI) to start containers
โค containerd creates the container with Linux namespaces and cgroups for isolation
โฅ kube-proxy programs iptables/IPVS rules so Service ClusterIPs route to healthy Pod endpoints
โฆ Controller Manager runs reconciliation loops (Deployment, ReplicaSet, StatefulSet controllers) continuously comparing desired state in etcd vs actual state
โง When drift is detected (Pod crash, node failure), controllers create or delete Pods to converge actual state back to desired state
โ The Problem
Running thousands of containerized services in production requires solving bin-packing (which container runs on which server?), health monitoring, rolling deployments, service discovery, and network routing โ all simultaneously. Manual orchestration does not scale past a few dozen services.
โThe Solution
Kubernetes provides a declarative API: you describe desired state (3 replicas of image X, with 2GB RAM), and a set of controllers continuously reconcile actual state to match. This self-healing loop handles node failures, rolling deployments, and autoscaling without human intervention.
๐Scale at a Glance
5,000
Max Nodes/Cluster
150,000
Max Pods/Cluster
< 100ms
Scheduling Latency
3 or 5 nodes
etcd Recommended Size
๐ฌDeep Dive
The Control Plane: API Server + etcd
All Kubernetes state lives in etcd โ a distributed key-value store with Raft consensus. The API server is the only component that reads/writes etcd directly; all other components talk to the API server. This architecture means etcd becomes the single source of truth, and any component can be restarted without losing state. etcd is typically run with 3 or 5 nodes for quorum.
Reconciliation Loops: The Heart of Kubernetes
Every Kubernetes controller runs a reconciliation loop: watch for resource changes, compare desired state to actual state, take action to converge. The Deployment controller watches Deployments and manages ReplicaSets. The ReplicaSet controller watches ReplicaSets and manages Pods. This composable design means adding a new resource type (CRD) just requires writing a new controller.
The Scheduler: Bin-Packing with Constraints
The Kubernetes scheduler is a two-phase process: filtering (which nodes can run this pod given CPU/memory/taints/affinities?) and scoring (which filtered node is the best fit?). Default scoring prefers spreading pods across nodes for HA and packing to maximize utilization. Custom schedulers can be plugged in for specialized workloads like GPU-intensive ML training jobs.
Networking: Services and kube-proxy
Kubernetes Services provide stable virtual IPs (ClusterIP) for groups of pods. kube-proxy programs iptables or IPVS rules to DNAT traffic from the VIP to one of the backing pod IPs. A service with 10 pods gets 10 iptables rules with random selection โ simple but effective. For production, a service mesh like Istio replaces this with a sidecar proxy per pod for mTLS, circuit breaking, and traffic shaping.
Custom Resource Definitions: Extending Kubernetes
CRDs let you define new resource types (e.g., PostgresCluster, KafkaTopic) with custom schemas. A controller watches for these resources and takes domain-specific action. This operator pattern is how databases (Strimzi Kafka, CloudNativePG) are managed on Kubernetes โ users interact with high-level abstractions while the operator handles upgrade sequencing, backup scheduling, and failover.
โฌกArchitecture Diagram
Kubernetes Container Orchestration Architecture โ simplified architecture overview
โฆCore Concepts
etcd
Reconciliation Loops
Pod Scheduling
kube-proxy
Service Mesh (Istio)
CRDs
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โSelf-healing: automatic pod restart, node replacement, and rescheduling on failure
- โDeclarative config enables GitOps: entire cluster state in version control
- โMassive ecosystem: Helm charts, operators, and service meshes for every use case
- โHorizontal Pod Autoscaler scales deployments automatically based on CPU/custom metrics
โ Weaknesses
- โSteep learning curve: YAML complexity, RBAC, networking concepts require months to master
- โetcd is a single point of failure for the control plane โ requires careful backup and sizing
- โNetworking model is complex: understanding CNI plugins, service types, and ingress controllers is non-trivial
- โResource overhead: a minimal cluster needs 3 control plane nodes plus worker nodes before running any workload
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Walk me through what happens when you run kubectl apply -f deployment.yaml. Name every component involved.
- Q2
A pod is stuck in CrashLoopBackOff. Walk me through your debugging process, naming the kubectl commands you would run.
- Q3
Design a deployment strategy for a stateful database on Kubernetes. How do you handle persistent storage and rolling upgrades?
- Q4
Explain Kubernetes networking: how does a request from Pod A reach Pod B on a different node?
- Q5
When would you NOT use Kubernetes? What are the operational tradeoffs versus simpler alternatives like ECS or plain VMs?
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More Data & Infrastructure
View allSpotify Music Recommendation System
Collaborative filtering, Discover Weekly, and the AudioEmbeddings pipeline
Spotify ยท Apple Music ยท YouTube Music
GitHub Pull Request & CI/CD Pipeline
Git internals, check suites, and the webhook fanout that powers DevOps
GitHub ยท GitLab ยท Bitbucket
LinkedIn Feed Ranking Architecture
Heavyweight ML scoring with online/offline feature pipelines
LinkedIn ยท Facebook ยท Twitter
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed