Home ArchitecturesKubernetes Container Orchestration Architecture

🗄️ Data & InfrastructureAdvancedWeek 9

Kubernetes Container Orchestration Architecture

etcd, the scheduler, controller loops, and service mesh

GoogleAWS EKSAzure AKS

Key Insight

Kubernetes' key abstraction: controllers watch for 'desired state Γëá actual state' and take action this declarative model makes the system self-healing.

Request Journey

User submits a Pod spec via kubectl, which sends it to the API Server as a REST request→

API Server validates the spec through admission controllers (mutating + validating webhooks) and persists it to etcd→

Scheduler watches for unscheduled Pods, scores candidate nodes using predicates (resource fit, affinity, taints) and priorities (bin-packing, spread), then binds the Pod to the best node→

kubelet on the selected node detects the new binding via API Server watch, pulls container images, and calls the Container Runtime Interface (CRI) to start containers→

containerd creates the container with Linux namespaces and cgroups for isolation

+3 more steps

How It Works

① User submits a Pod spec via kubectl, which sends it to the API Server as a REST request

② API Server validates the spec through admission controllers (mutating + validating webhooks) and persists it to etcd

③ Scheduler watches for unscheduled Pods, scores candidate nodes using predicates (resource fit, affinity, taints) and priorities (bin-packing, spread), then binds the Pod to the best node

④ kubelet on the selected node detects the new binding via API Server watch, pulls container images, and calls the Container Runtime Interface (CRI) to start containers

⑤ containerd creates the container with Linux namespaces and cgroups for isolation

⑥ kube-proxy programs iptables/IPVS rules so Service ClusterIPs route to healthy Pod endpoints

⑦ Controller Manager runs reconciliation loops (Deployment, ReplicaSet, StatefulSet controllers) continuously comparing desired state in etcd vs actual state

⑧ When drift is detected (Pod crash, node failure), controllers create or delete Pods to converge actual state back to desired state

⚠The Problem

Running thousands of containerized services in production requires solving bin-packing (which container runs on which server?), health monitoring, rolling deployments, service discovery, and network routing — all simultaneously. Manual orchestration does not scale past a few dozen services.

✓The Solution

Kubernetes provides a declarative API: you describe desired state (3 replicas of image X, with 2GB RAM), and a set of controllers continuously reconcile actual state to match. This self-healing loop handles node failures, rolling deployments, and autoscaling without human intervention.

📊Scale at a Glance

5,000

Max Nodes/Cluster

150,000

Max Pods/Cluster

< 100ms

Scheduling Latency

3 or 5 nodes

etcd Recommended Size

🔬Deep Dive

The Control Plane: API Server + etcd

All Kubernetes state lives in etcd — a distributed key-value store with Raft consensus. The API server is the only component that reads/writes etcd directly; all other components talk to the API server. This architecture means etcd becomes the single source of truth, and any component can be restarted without losing state. etcd is typically run with 3 or 5 nodes for quorum.

Reconciliation Loops: The Heart of Kubernetes

Every Kubernetes controller runs a reconciliation loop: watch for resource changes, compare desired state to actual state, take action to converge. The Deployment controller watches Deployments and manages ReplicaSets. The ReplicaSet controller watches ReplicaSets and manages Pods. This composable design means adding a new resource type (CRD) just requires writing a new controller.

The Scheduler: Bin-Packing with Constraints

The Kubernetes scheduler is a two-phase process: filtering (which nodes can run this pod given CPU/memory/taints/affinities?) and scoring (which filtered node is the best fit?). Default scoring prefers spreading pods across nodes for HA and packing to maximize utilization. Custom schedulers can be plugged in for specialized workloads like GPU-intensive ML training jobs.

Networking: Services and kube-proxy

Kubernetes Services provide stable virtual IPs (ClusterIP) for groups of pods. kube-proxy programs iptables or IPVS rules to DNAT traffic from the VIP to one of the backing pod IPs. A service with 10 pods gets 10 iptables rules with random selection — simple but effective. For production, a service mesh like Istio replaces this with a sidecar proxy per pod for mTLS, circuit breaking, and traffic shaping.

Custom Resource Definitions: Extending Kubernetes

CRDs let you define new resource types (e.g., PostgresCluster, KafkaTopic) with custom schemas. A controller watches for these resources and takes domain-specific action. This operator pattern is how databases (Strimzi Kafka, CloudNativePG) are managed on Kubernetes — users interact with high-level abstractions while the operator handles upgrade sequencing, backup scheduling, and failover.

⬡Architecture Diagram

Kubernetes Container Orchestration Architecture — simplified architecture overview

✦Core Concepts

⚙️

etcd

⚙️

Reconciliation Loops

⚙️

Pod Scheduling

⚙️

kube-proxy

⚙️

Service Mesh (Istio)

⚙️

CRDs

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Self-healing: automatic pod restart, node replacement, and rescheduling on failure
✓Declarative config enables GitOps: entire cluster state in version control
✓Massive ecosystem: Helm charts, operators, and service meshes for every use case
✓Horizontal Pod Autoscaler scales deployments automatically based on CPU/custom metrics

✗ Weaknesses

✗Steep learning curve: YAML complexity, RBAC, networking concepts require months to master
✗etcd is a single point of failure for the control plane — requires careful backup and sizing
✗Networking model is complex: understanding CNI plugins, service types, and ingress controllers is non-trivial
✗Resource overhead: a minimal cluster needs 3 control plane nodes plus worker nodes before running any workload

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Walk me through what happens when you run kubectl apply -f deployment.yaml. Name every component involved.
Q2
A pod is stuck in CrashLoopBackOff. Walk me through your debugging process, naming the kubectl commands you would run.
Q3
Design a deployment strategy for a stateful database on Kubernetes. How do you handle persistent storage and rolling upgrades?
Q4
Explain Kubernetes networking: how does a request from Pod A reach Pod B on a different node?
Q5
When would you NOT use Kubernetes? What are the operational tradeoffs versus simpler alternatives like ECS or plain VMs?

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser