Lessons From 5 Years of Kubernetes in Production – Cluster Crashes, Ditching Self-Managed, Cost Cuts, and the Tooling That Actually Works

Five years of Kubernetes in production. Two cluster crashes that took down everything. A migration from self-managed kops to EKS that should have happened sooner. An observability stack we’ve rebuilt three times. Helm charts rewritten more times than I care to admit.

This is what I’ve learned running Kubernetes across fintech, protocol infrastructure, and IoT – the failures, the wins, and the things I’d do differently.

The Journey: Self-Managed to EKS

Why We Started Self-Managed

Early on, EKS wasn’t mature. Or we didn’t trust it. Or we thought running our own control plane would give us “more control.” All of these were wrong, but hindsight is 20/20.

We ran kops on AWS. It worked. Until it didn’t.

The appeal was understandable: full control over the control plane, etcd configuration, API server flags, certificate management. The reality was constant maintenance, upgrade anxiety, and a bus factor of one (me, usually at 2am).

The First Cluster Crash: Certificate Expiry

Our first major outage happened because certificates expired.

The Root CA certificate, the etcd certificate, and the API server certificate all had the same expiry date. When they expired, the cluster didn’t gracefully degrade – it stopped. Completely. No API server means no kubectl. No etcd means no state. No state means you’re rebuilding everything from scratch.

What went wrong:

kops generated certificates with a default expiry we didn’t verify
No monitoring on certificate expiration dates
No runbook for certificate rotation
Backup strategy was “we have the manifests in git” – which helps, but doesn’t help when you can’t apply them

The recovery: Rebuild the entire cluster. Redeploy everything. Restore databases from RDS snapshots (thank god for managed databases). Two days of downtime. Career-questioning levels of stress.

Lesson: If you’re running self-managed Kubernetes, certificate expiry monitoring is non-negotiable. Better yet, don’t run self-managed Kubernetes.

The Second Cluster Crash: The Same Bug, One Year Later

You’d think we learned our lesson. We did – partially.

When we rebuilt the cluster after crash #1, we set certificate expiry to two years. Plenty of margin, right?

Except the version of kops we used had a bug. It didn’t respect the expiry configuration for etcd certificates – it defaulted to one year regardless of what you specified.

Exactly one year after the first crash, the etcd certificate expired. Another cluster crash. Another weekend from hell.

This one was easier to recover from – we didn’t have to rebuild everything – but it reinforced what I already suspected: self-managed Kubernetes is a full-time job, and we didn’t have a full-time job’s worth of people to do it.

The Migration to EKS

When EKS matured, we migrated. Should have done it sooner.

The evaluation was straightforward:

Factor	kops	EKS
Control plane management	Us	AWS
Certificate rotation	Us	AWS
Etcd backups	Us	AWS
API server availability	Us	AWS
Upgrade anxiety	High	Medium
Cost	EC2 for masters	$0.10/hour/cluster

The EKS control plane cost is trivial compared to engineer time. The upgrade process is still work, but it’s documented, supported, and doesn’t require understanding etcd internals.

Migration approach:

Stood up EKS cluster in parallel
Migrated workloads namespace by namespace
Used external-dns to manage DNS cutover
Kept old cluster running for two weeks as fallback
Decommissioned kops cluster

The whole migration took three weeks. Should have done it two years earlier.

Cutting Cluster Costs

The Problem: We Were Haemorrhaging Money

Kubernetes makes it easy to waste compute. Developers request “safe” resource limits, pods get scheduled, nodes get added, nobody looks at actual utilisation.

We were running at 25% CPU utilisation across the cluster. That’s 75% waste.

Karpenter Changed Everything

Cluster Autoscaler was fine. Karpenter is better.

Before Karpenter:

Predefined node groups (t3.large, t3.xlarge, etc.)
Cluster Autoscaler picks from node groups
Over-provisioned because node groups don’t match workload shapes
Spot instance handling was bolted-on and flaky

After Karpenter:

Karpenter provisions exactly the instance type your pods need
Right-sizes nodes dynamically
Native spot instance support with automatic fallback
Consolidation actually consolidates (removes underutilised nodes)

Results: 40% cost reduction. Same workloads. Same reliability.

The configuration is more complex – NodePools, EC2NodeClasses, weight-based selection – but the payoff is substantial.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Spot Instances: 70% Savings, Actually Reliable

The conventional wisdom is “spot instances are unreliable.” The reality is more nuanced.

Workloads that work well on spot:

Stateless services with multiple replicas
Batch jobs that can be retried
Development/staging environments
CI/CD runners

Workloads that don’t:

Databases (obviously)
Single-replica services (don’t do this anyway)
Long-running jobs that can’t checkpoint

Our spot strategy:

Multiple instance types (c5, c6i, m5, m6i) – diversification reduces interruption risk
Multiple availability zones
Karpenter handles fallback to on-demand automatically
Pod Disruption Budgets ensure graceful draining

We run 80% of compute on spot. Interruption rate is under 5%. The 70% cost savings are real.

Right-Sizing: The Boring Work That Matters

Karpenter optimises node selection. You still need to optimise pod requests.

The pattern we see constantly:

resources:
  requests:
    cpu: "1"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

Actual usage: 100m CPU, 256Mi memory.

Tools that help:

Prometheus + Grafana dashboards showing request vs actual usage
Vertical Pod Autoscaler (VPA) in recommend mode
Goldilocks (VPA recommendations visualised)

The process:

Deploy VPA in recommend mode
Let it observe for a week
Review recommendations
Update requests/limits
Repeat quarterly

It’s boring. It saves 30%+ on compute.

Autoscaling: HPA, VPA, and KEDA

HPA: The Baseline

Horizontal Pod Autoscaler based on CPU/memory is table stakes. If you’re not using HPA, you’re either over-provisioned or under-provisioned.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

KEDA: Event-Driven Scaling

HPA scales on CPU. What if you want to scale on SQS queue depth? Kafka lag? Prometheus metrics?

KEDA (Kubernetes Event-driven Autoscaling) fills this gap.

Use cases we run with KEDA:

Scale workers based on SQS queue depth
Scale API pods based on requests per second (Prometheus metric)
Scale to zero for batch processors when queues are empty

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker
spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/work-queue
      queueLength: "5"
      awsRegion: eu-west-1

Scale to zero is the killer feature. Workers that only exist when there’s work to do. Massive cost savings for bursty workloads.

Vertical Pod Autoscaler can automatically adjust resource requests. In theory.

In practice, VPA in “Auto” mode restarts pods to apply changes. For stateless services with fast startup, this is fine. For anything else, it’s disruptive.

Our approach: VPA in recommend mode only. It observes and suggests. Humans review and apply. No surprise restarts.

Observability: The Stack We Rebuilt Three Times

Attempt 1: Loggly + FluentD

Started here because it was easy. FluentD as a DaemonSet, ship logs to Loggly.

Why we left: Cost scaled linearly with log volume. When you’re logging at scale, SaaS log aggregators become your biggest bill.

Attempt 2: ELK Stack (Self-Hosted)

Elasticsearch, Logstash (later Fluentd), Kibana. Self-hosted on Kubernetes.

The good: Cost predictable. Powerful querying. Kibana is genuinely good.

The bad: Elasticsearch is operationally complex. JVM tuning. Shard management. Index lifecycle policies. Cluster health that goes red at 2am.

Why we left: Operational overhead was significant. Elasticsearch expertise became a requirement for the team.

Attempt 3: Prometheus + Grafana + Loki

Where we landed. Where we’re staying.

Prometheus for metrics:

ServiceMonitors for autodiscovery
Prometheus Operator for lifecycle management
Thanos for long-term storage and multi-cluster aggregation

Loki for logs:

LogQL is similar enough to PromQL that the learning curve is minimal
Doesn’t index log content (just labels) – dramatically cheaper to operate than Elasticsearch
Pairs naturally with Grafana

Grafana for dashboards:

Unified view of metrics and logs
Alerting that works
Community dashboards for common services

The stack:

┌─────────────────┐     ┌─────────────────┐
│   Prometheus    │     │      Loki       │
│   (metrics)     │     │     (logs)      │
└────────┬────────┘     └────────┬────────┘
         │                       │
         └───────────┬───────────┘
                     │
              ┌──────┴──────┐
              │   Grafana   │
              │ (dashboards)│
              └─────────────┘

Cost comparison:

Stack	Monthly cost (our scale)
Datadog	£15,000+
Self-hosted ELK	£3,000 (compute) + ops time
Prometheus/Loki/Grafana	£2,000 (compute) + less ops time

The Prometheus stack isn’t free – you’re running databases – but the operational model is simpler than Elasticsearch and the cost is dramatically lower than Datadog.

OpenTelemetry: Do It From Day One

We instrumented applications with vendor-specific SDKs. Now we’re stuck migrating.

OpenTelemetry provides vendor-neutral instrumentation. Switch backends without touching application code.

For new services: OpenTelemetry SDK from day one. For existing services: gradual migration. The tracing is production-ready. Metrics are catching up.

Alerting: Slack Channels, Eventually

Alerting evolution:

PagerDuty for everything – alert fatigue, ignored alerts
Two-tier alerting – critical pages, non-critical emails
Slack channels – alerts routed to team channels, acknowledged inline

The final state: critical alerts page on-call. Everything else goes to Slack channels with team ownership. Weekly review meetings to tune thresholds.

Deployments and Tooling

Helm: Rewritten More Times Than I’d Like

We’ve used Helm since v2 (Tiller days – dark times). The charts have been rewritten multiple times as our stack evolved.

Current state:

Shared library chart for common patterns
Service-specific charts that inherit from library
Values files per environment
Helm charts stored in ECR (OCI format)

Lesson: Invest in your Helm chart structure early. The cost of “we’ll clean it up later” is charts that nobody understands.

GitOps with Flux

All deployments are GitOps. Flux watches git, applies changes, reports drift.

The good:

Single source of truth
Audit trail via git history
Self-healing (drift correction)

The investment:

Tooling to answer “where is my commit?”
Flux observability is weak out of the box
Debugging “why didn’t this deploy?” requires understanding Flux internals

k9s: The Terminal UI You Need

If you’re running kubectl get pods repeatedly, stop. Use k9s.

Terminal UI for Kubernetes. Navigate resources, view logs, exec into pods, delete stuck resources. Faster than kubectl for interactive work.

Container Security: Don’t Overthink It

Start with the basics:

Read-only root filesystem – prevents most runtime attacks
Non-root user – principle of least privilege
Drop all capabilities – add back only what’s needed
Disable service account token auto-mount – most pods don’t need it
Network policies – default deny, explicit allow

Tooling: We use AWS Inspector for container scanning. Catches obvious CVEs. Not trying to find the perfect tool – just something that runs automatically.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

The Honest Truths

Kubernetes Is Complex

You need dedicated engineers. Not everyone-does-a-bit-of-K8s. Actual dedicated time from people who understand the internals.

The workload varies – some weeks are quiet, others are cluster upgrades and incident response. But you can’t rotate Kubernetes responsibility across the team like you can code review. The learning curve is too steep for “jump in and out.”

What works: One or two “go-to” engineers who own the platform, with everyone trained on basic operations (deploy, debug, read logs).

The Things That Keep Breaking

After five years, the failure patterns are predictable:

DNS – CoreDNS scaling, ndots:5 performance, DNS caching
Resource limits – OOMKilled, CPU throttling, eviction
Networking – CNI issues, NetworkPolicy conflicts, load balancer health checks
Certificates – expiry, rotation, trust chains
Storage – PVC binding, EBS attach limits, storage class misconfiguration

Build monitoring and runbooks for these. They will happen.

Was It Worth It?

Yes.

The first two years were painful. Self-managed clusters, immature tooling, constant firefighting.

The last three years have been different. EKS removed the control plane burden. Karpenter solved cost optimisation. The Prometheus stack matured. GitOps made deployments predictable.

Kubernetes is complex, but it solves real problems:

Consistent deployment model across services
Self-healing that actually works
Scaling that responds to demand
Resource isolation between teams and services
Ecosystem of tools that integrate cleanly

The complexity is the cost of those benefits. Whether it’s worth it depends on your scale and team. For us, it was.

The Lessons, Summarised

Infrastructure:

Don’t run self-managed Kubernetes unless you have full-time SREs for it
EKS control plane cost is trivial compared to operational burden
Karpenter > Cluster Autoscaler, no contest
Spot instances work for most workloads with proper diversification

Cost:

Right-size pods (VPA recommend mode)
Right-size nodes (Karpenter)
Scale to zero where possible (KEDA)
Review costs monthly, not quarterly

Observability:

Prometheus + Loki + Grafana is the sweet spot for most teams
OpenTelemetry from day one for new services
Two-tier alerting: pages for critical, Slack for everything else

Operations:

GitOps or regret it
Invest in Helm chart structure early
k9s for interactive work
Dedicated Kubernetes engineers, not shared responsibility

Security:

Read-only root filesystem, non-root user, drop capabilities
Network policies with default deny
Container scanning in CI – any tool is better than no tool

The meta-lesson: Kubernetes rewards investment in automation and tooling. Every hour spent on operational improvements pays dividends. Every shortcut creates debt that compounds.

More war stories from production at CoderCo. Connect on LinkedIn for infrastructure patterns and debugging deep-dives.