Wide cinematic visualization of Kubernetes cluster management at scale

Cloud·12 min read

Kubernetes at Scale: Lessons from Production Deployments

By Osman Kuzucu·Published on 2025-02-10

Kubernetes has become the default platform for container orchestration, but running it well in production is a fundamentally different challenge from getting a cluster up and running. The gap between a working demo and a production-grade deployment is where most teams encounter painful, expensive lessons. After managing Kubernetes clusters across multiple industries and scale profiles, certain patterns emerge consistently. These are not theoretical best practices — they are operational lessons learned through incident post-mortems, performance tuning sessions, and late-night debugging of mysterious pod evictions.

Resource Requests and Limits: Get Them Right or Pay the Price

The single most common production issue we encounter is misconfigured resource requests and limits. When CPU requests are set too low, the scheduler packs too many pods onto a node, leading to CPU throttling that manifests as mysterious latency spikes. When memory limits are set too high, you waste expensive compute. When they are absent entirely, a single runaway pod can OOM-kill its neighbors. The correct approach is to start with profiling: run your workloads under realistic load, observe actual CPU and memory consumption using metrics-server or Prometheus, then set requests to the P95 usage and limits to 1.5-2x that value. For JVM-based workloads, remember that container memory limits must account for heap, metaspace, thread stacks, and off-heap buffers — not just the -Xmx value.

Pod Disruption Budgets and Graceful Rollouts

Pod Disruption Budgets (PDBs) are one of the most overlooked Kubernetes resources, yet they are critical for maintaining availability during node drains, cluster upgrades, and spot instance reclamation. A PDB specifies the minimum number of pods that must remain available (or the maximum number that can be unavailable) during voluntary disruptions. Without PDBs, a cluster autoscaler draining a node or a rolling update can take down all replicas of a service simultaneously. We recommend every production Deployment with more than one replica have a PDB with maxUnavailable set to 1 or minAvailable set to N-1. Combine this with proper readiness probes, preStop hooks with a sleep delay to allow load balancers to deregister, and rolling update strategies with maxSurge and maxUnavailable tuned for your traffic patterns.

Horizontal Pod Autoscaling: Beyond CPU Metrics

Default HPA configurations based solely on CPU utilization are a starting point, not a solution. CPU-based scaling often reacts too slowly for bursty workloads and scales too aggressively for CPU-intensive but low-throughput services. Production-grade autoscaling requires custom metrics. For HTTP services, scale on request rate or request latency percentiles via the Prometheus adapter. For queue consumers, scale on queue depth. Set stabilization windows to prevent flapping — a 3-minute scale-down delay prevents the common pattern where HPA scales down, load spikes again, and HPA scales back up in a tight oscillation loop. Also consider KEDA (Kubernetes Event-Driven Autoscaling) for workloads that need to scale to zero or scale based on external event sources like Kafka topic lag or cloud queue depth.

Observability: The Non-Negotiable Foundation

A production Kubernetes deployment demands a comprehensive observability stack built on three pillars:

Metrics — Prometheus with Grafana dashboards for cluster health, resource utilization, application-level SLIs, and alerting. Use recording rules to pre-compute expensive queries and keep dashboard load times fast.
Logs — Centralized logging with Loki, Elasticsearch, or a managed service. Ensure structured JSON logging from all services, include trace IDs in every log line, and set retention policies that balance cost with debugging needs.
Traces — Distributed tracing with OpenTelemetry, Jaeger, or Tempo. In a microservices environment, traces are the only way to understand request flow across service boundaries and identify where latency is introduced.

Running Kubernetes in production is a journey, not a destination. The platform evolves rapidly, and the patterns that work at 10 nodes may break at 100. The most successful teams we work with treat their Kubernetes platform as a product — with dedicated ownership, clear SLOs, regular reviews, and continuous investment. At OKINT Digital, we partner with engineering teams to design, deploy, and operate production Kubernetes environments that are resilient, observable, and cost-efficient at any scale.

kubernetescontainer orchestrationdevopscloud infrastructure

Want to discuss these topics in depth?

Our engineering team is available for architecture reviews, technical assessments, and strategy sessions.

Schedule a consultation →