Other 7 min read

Kubernetes — The Real Shape of the Thing

H
Himanshu

Most people learn Kubernetes by memorizing resource types. Pods, Deployments, Services, Ingress — they collect the vocabulary without ever building the mental model. Then something breaks at 2am and they realize they have no idea what the system is actually doing.

This is an attempt to give you the model first.

The Problem Kubernetes Solves

It’s 2013. You’re running a large distributed service — Google Search, Gmail. Thousands of machines. On each machine, you want to run multiple workloads simultaneously to improve utilization, because idle CPU is waste. But immediately you hit four problems:

  • Placement. Who decides which workload runs on which machine? Manual placement doesn’t scale. Humans can’t reason about thousands of nodes and thousands of services simultaneously.
  • Failure. What happens when a machine dies? Someone has to notice, decide where to reschedule, and act fast. At Google scale, machines die constantly.
  • Discovery. How do services find each other? If a service moves to a different machine, anything talking to it by IP is now broken.
  • Deploys. How do you roll out a new version without downtime? You can’t kill everything and restart — you need a rolling strategy, automated.

Before Kubernetes, people used one of three approaches, all terrible in different ways:

  • Bash scripts + cron jobs — worked until the second thing went wrong simultaneously
  • Mesos + Marathon — more powerful, but a different complexity tax: hard to learn, hard to operate, ultimately lost the war
  • Hand-rolled orchestrators — expensive, bespoke, non-transferable knowledge

Google had been running an internal system called Borg since ~2004 that solved this for them. Kubernetes is the public reconstruction of Borg’s ideas, stripped of Google-internal assumptions.

The cost of the “before” world: you needed either a Google-sized infrastructure team, or you accepted ~15% utilization, manual deployments, and fragile resilience.

The Core Idea

Describe the desired state of your system, not the steps to get there — and let a control loop close the gap continuously.

That’s it. That’s Kubernetes.

Everything else — pods, deployments, services, ingress, operators — is either a vocabulary for expressing desired state, or a control loop that enforces it.

This shift from imperative (“do this, then this, then this”) to declarative (“this is what I want the world to look like”) is the conceptual leap. It’s not a syntax preference. It’s a fundamentally different model of what a computer system is responsible for.

What Kubernetes Trades Away

Kubernetes optimizes for resilience, portability, and large heterogeneous workloads across unreliable infrastructure — without operator intervention at each failure.

What it sacrifices:

  • Simplicity. A startup with 3 services is better served by Railway, Render, or Fly.io. Kubernetes is a good answer to a problem most teams don’t have yet.
  • Operational transparency. Because the system is always reconciling, it’s not always obvious why something happened. A pod restarted — was it OOM killed? Liveness probe failed? Node evicted? The system made a decision on your behalf and left breadcrumbs, not explanations.
  • Fast feedback loops. Applying a manifest and watching Kubernetes eventually converge is slow compared to docker run. The reconciliation loop adds latency to debugging.
  • Stateful workloads. Stateless apps were the assumed model. Databases, message brokers, anything with durable local state — Kubernetes will technically run them, but you’ll earn every feature. StatefulSets, PVCs, PodDisruptionBudgets, backup operators — it’s a different game entirely.

The ugly truth: Kubernetes is optimized for Google’s problem in 2004. Most teams aren’t Google. Many pay the full complexity tax to get 20% of the benefit.

The Layer Below

Kubernetes sits on top of — and hides — several things you’ll need to understand the moment it breaks.

Linux primitives:

  • cgroups — how Kubernetes enforces CPU and memory limits. When a container gets OOM killed, a cgroup limit was hit. Kubernetes didn’t kill it — the kernel did.
  • namespaces — network, PID, mount, UTS namespaces are what make containers isolated. Kubernetes isn’t doing the isolation; it’s orchestrating Linux’s isolation.
  • iptables / eBPF — kube-proxy rewrites iptables rules to implement Services. When a Service isn’t routing, you’re debugging iptables chains (or Cilium’s eBPF programs on a modern CNI).

etcd: The entire cluster state lives in etcd. Kubernetes is, in a meaningful sense, a distributed state machine wrapped around etcd. If etcd goes down or gets corrupted, your cluster is either read-only or catastrophically broken. Most managed Kubernetes services hide etcd from you entirely. This is comfortable until a disaster recovery scenario reveals you’ve never thought about etcd backup strategy.

The network: Kubernetes defines a networking model (every pod gets an IP, pods can talk directly) but does not implement it. The CNI plugin does — Flannel, Calico, Cilium, Azure CNI. When network policies aren’t working, you’re debugging the CNI, not Kubernetes.

The Layer Above

Kubernetes leaves a lot unsolved. The industry has built entire ecosystems on top of it:

  • Helm — because Kubernetes manifests are verbose YAML with no templating, no versioning, no dependency management. Helm is itself a hack: a template engine bolted onto YAML with its own broken type system. It’s the RPM of the cloud-native world — universally used, universally complained about.
  • ArgoCD / Flux — because Kubernetes has no built-in concept of “deploy this app from this Git repo.”
  • Vault / External Secrets Operator — because Kubernetes Secrets are base64-encoded, not encrypted at rest by default, and not rotated. Kubernetes treats secrets like glorified ConfigMaps.
  • Prometheus / Grafana / OpenTelemetry — because Kubernetes gives you logs and basic events. It does not give you metrics, tracing, or dashboards.
  • Operators — for complex stateful workloads (Kafka, Postgres, Elasticsearch), someone has to encode the operational knowledge a human DBA would carry. This is both Kubernetes’s greatest extensibility success and an admission that the base platform isn’t enough.

The Failure Modes

This is where you learn the real shape of the system.

CrashLoopBackOff. A container exits non-zero, Kubernetes restarts it with exponential backoff. The failure is almost always in the application, not Kubernetes — but people blame Kubernetes first. The real lesson: your container must handle its own startup failures gracefully, because Kubernetes will keep trying indefinitely.

Resource starvation cascades. You don’t set resource requests correctly. The scheduler places too many pods on a node. Memory pressure triggers the OOM killer. The node starts evicting pods. Evicted pods land on other nodes, which start going over their limits. Rolling cascade. Kubernetes was trying to help — incorrect requests/limits was the actual root cause.

etcd latency. As cluster size grows, etcd gets slow. The API server starts timing out. Controllers fall behind. Reconciliation slows. The cluster doesn’t fail hard — it fails soft, which is worse. You’ll see mysterious delays, events not firing, deployments that hang. Almost always: etcd disk I/O latency. etcd needs dedicated SSDs and IOPS on production clusters.

Networking black holes. kube-proxy iptables rules get out of sync. Or a CNI upgrade goes wrong. Packets are dropped silently. Pods can’t reach Services. There’s no Kubernetes-level error — from Kubernetes’s perspective, the Service exists and endpoints are healthy. You’re now debugging iptables at the kernel level. This is where most engineers hit the floor.

Node/control plane split-brain. The node loses contact with the API server. Kubernetes marks it NotReady after a timeout (default: 5 minutes), evicts the pods, and reschedules them elsewhere. But the node isn’t dead — it’s just partitioned. You now have the same workloads running in two places. For stateless apps, fine. For stateful apps with exclusive write access, this is a disaster scenario. The --pod-eviction-timeout knob exists precisely because someone learned this the hard way at 3am.

Privilege escalation via misconfiguration. RBAC is fine-grained but complex. A pod with automountServiceAccountToken: true (the default) gets a token. If that service account has ClusterAdmin — it happens — a compromised container owns the cluster. The default Kubernetes security posture is not conservative.

The Mental Model: Kubernetes Is a Thermostat

A thermostat doesn’t execute steps. You don’t tell it “turn on the heat for 20 minutes, then check the temperature.” You tell it “I want 21°C” and it continuously measures reality against that target, making small adjustments. It doesn’t know or care how it got to 21°C. It only cares whether it’s there.

When you kubectl apply a Deployment, you’re setting the thermostat. Kubernetes doesn’t execute your YAML top-to-bottom. A controller wakes up, compares what exists in the cluster to what your YAML describes, and makes the minimum set of changes to close the gap. Then it checks again. And again. Forever.

This is why:

  • Applying the same manifest twice is safe — idempotency falls out naturally
  • Deleting a pod manually doesn’t delete the workload — the thermostat just creates a new one
  • The system self-heals — not because it’s “smart,” but because the reconciliation loop keeps running

Hold this model and Kubernetes behavior becomes predictable without memorizing resource types.

The Real Insight: Kubernetes Is a Database with Actuators

A casual user thinks of Kubernetes as something that runs containers. They interact with it through kubectl apply and watch their pods come up.

Someone who has mastered it sees it differently: etcd is the source of truth; everything else is a controller that reads from it and drives the world toward what etcd says it should be. The API server is a validated interface to etcd. kubectl is a client to the API server. Every “feature” of Kubernetes — Deployments, Services, HPA, Ingress — is a watch loop on specific keys in etcd, plus logic to make reality match those keys.

The consequence of this insight:

  • You extend Kubernetes by writing controllers, not by modifying Kubernetes. The Operator pattern works because you’re just adding another watch loop.
  • Debugging is always about finding the gap between what etcd says and what the real world looks like — and which controller is failing to close it.
  • Performance problems trace to etcd or the API server, not pod scheduling logic, because everything upstream depends on the health of that database.
  • The “eventual” in eventual consistency is real. Between when you write desired state and when reality matches it, there is a window. For most workloads, milliseconds. Under load or during failures, seconds or minutes. Systems that assume instant consistency — health checks, startup probes, dependent deployments — break in this window.

You don’t just use Kubernetes differently once you see this. You design differently. You write operators instead of scripts. You treat manifests as facts to assert, not instructions to execute. You design for reconciliation loops, not deployment pipelines.

That’s the gap between the person who runs kubectl apply and the person who understands what’s happening when they do.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top