Skip to content
Back to blog Lessons From 5 Years of Kubernetes in Production – Cluster Crashes, Ditching Self-Managed, Cost Cuts, and the Tooling That Actually Works

Lessons From 5 Years of Kubernetes in Production – Cluster Crashes, Ditching Self-Managed, Cost Cuts, and the Tooling That Actually Works

K8sAWS

Five years of Kubernetes in production. Two cluster crashes that took down everything. A migration from self-managed kops to EKS that should have happened sooner. An observability stack we’ve rebuilt three times. Helm charts rewritten more times than I care to admit.

This is what I’ve learned running Kubernetes across fintech, protocol infrastructure, and IoT – the failures, the wins, and the things I’d do differently.

The Journey: Self-Managed to EKS

Why We Started Self-Managed

Early on, EKS wasn’t mature. Or we didn’t trust it. Or we thought running our own control plane would give us “more control.” All of these were wrong, but hindsight is 20/20.

We ran kops on AWS. It worked. Until it didn’t.

The appeal was understandable: full control over the control plane, etcd configuration, API server flags, certificate management. The reality was constant maintenance, upgrade anxiety, and a bus factor of one (me, usually at 2am).

The First Cluster Crash: Certificate Expiry

Our first major outage happened because certificates expired.

The Root CA certificate, the etcd certificate, and the API server certificate all had the same expiry date. When they expired, the cluster didn’t gracefully degrade – it stopped. Completely. No API server means no kubectl. No etcd means no state. No state means you’re rebuilding everything from scratch.

What went wrong:

  1. kops generated certificates with a default expiry we didn’t verify
  2. No monitoring on certificate expiration dates
  3. No runbook for certificate rotation
  4. Backup strategy was “we have the manifests in git” – which helps, but doesn’t help when you can’t apply them

The recovery: Rebuild the entire cluster. Redeploy everything. Restore databases from RDS snapshots (thank god for managed databases). Two days of downtime. Career-questioning levels of stress.

Lesson: If you’re running self-managed Kubernetes, certificate expiry monitoring is non-negotiable. Better yet, don’t run self-managed Kubernetes.

The Second Cluster Crash: The Same Bug, One Year Later

You’d think we learned our lesson. We did – partially.

When we rebuilt the cluster after crash #1, we set certificate expiry to two years. Plenty of margin, right?

Except the version of kops we used had a bug. It didn’t respect the expiry configuration for etcd certificates – it defaulted to one year regardless of what you specified.

Exactly one year after the first crash, the etcd certificate expired. Another cluster crash. Another weekend from hell.

This one was easier to recover from – we didn’t have to rebuild everything – but it reinforced what I already suspected: self-managed Kubernetes is a full-time job, and we didn’t have a full-time job’s worth of people to do it.

The Migration to EKS

When EKS matured, we migrated. Should have done it sooner.

The evaluation was straightforward:

FactorkopsEKS
Control plane managementUsAWS
Certificate rotationUsAWS
Etcd backupsUsAWS
API server availabilityUsAWS
Upgrade anxietyHighMedium
CostEC2 for masters$0.10/hour/cluster

The EKS control plane cost is trivial compared to engineer time. The upgrade process is still work, but it’s documented, supported, and doesn’t require understanding etcd internals.

Migration approach:

  1. Stood up EKS cluster in parallel
  2. Migrated workloads namespace by namespace
  3. Used external-dns to manage DNS cutover
  4. Kept old cluster running for two weeks as fallback
  5. Decommissioned kops cluster

The whole migration took three weeks. Should have done it two years earlier.

Cutting Cluster Costs

The Problem: We Were Haemorrhaging Money

Kubernetes makes it easy to waste compute. Developers request “safe” resource limits, pods get scheduled, nodes get added, nobody looks at actual utilisation.

We were running at 25% CPU utilisation across the cluster. That’s 75% waste.

Karpenter Changed Everything

Cluster Autoscaler was fine. Karpenter is better.

Before Karpenter:

  • Predefined node groups (t3.large, t3.xlarge, etc.)
  • Cluster Autoscaler picks from node groups
  • Over-provisioned because node groups don’t match workload shapes
  • Spot instance handling was bolted-on and flaky

After Karpenter:

  • Karpenter provisions exactly the instance type your pods need
  • Right-sizes nodes dynamically
  • Native spot instance support with automatic fallback
  • Consolidation actually consolidates (removes underutilised nodes)

Results: 40% cost reduction. Same workloads. Same reliability.

The configuration is more complex – NodePools, EC2NodeClasses, weight-based selection – but the payoff is substantial.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Spot Instances: 70% Savings, Actually Reliable

The conventional wisdom is “spot instances are unreliable.” The reality is more nuanced.

Workloads that work well on spot:

  • Stateless services with multiple replicas
  • Batch jobs that can be retried
  • Development/staging environments
  • CI/CD runners

Workloads that don’t:

  • Databases (obviously)
  • Single-replica services (don’t do this anyway)
  • Long-running jobs that can’t checkpoint

Our spot strategy:

  • Multiple instance types (c5, c6i, m5, m6i) – diversification reduces interruption risk
  • Multiple availability zones
  • Karpenter handles fallback to on-demand automatically
  • Pod Disruption Budgets ensure graceful draining

We run 80% of compute on spot. Interruption rate is under 5%. The 70% cost savings are real.

Right-Sizing: The Boring Work That Matters

Karpenter optimises node selection. You still need to optimise pod requests.

The pattern we see constantly:

resources:
  requests:
    cpu: "1"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

Actual usage: 100m CPU, 256Mi memory.

Tools that help:

  • Prometheus + Grafana dashboards showing request vs actual usage
  • Vertical Pod Autoscaler (VPA) in recommend mode
  • Goldilocks (VPA recommendations visualised)

The process:

  1. Deploy VPA in recommend mode
  2. Let it observe for a week
  3. Review recommendations
  4. Update requests/limits
  5. Repeat quarterly

It’s boring. It saves 30%+ on compute.

Autoscaling: HPA, VPA, and KEDA

HPA: The Baseline

Horizontal Pod Autoscaler based on CPU/memory is table stakes. If you’re not using HPA, you’re either over-provisioned or under-provisioned.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

KEDA: Event-Driven Scaling

HPA scales on CPU. What if you want to scale on SQS queue depth? Kafka lag? Prometheus metrics?

KEDA (Kubernetes Event-driven Autoscaling) fills this gap.

Use cases we run with KEDA:

  • Scale workers based on SQS queue depth
  • Scale API pods based on requests per second (Prometheus metric)
  • Scale to zero for batch processors when queues are empty
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker
spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/work-queue
      queueLength: "5"
      awsRegion: eu-west-1

Scale to zero is the killer feature. Workers that only exist when there’s work to do. Massive cost savings for bursty workloads.

VPA: Use Recommend Mode Only

Vertical Pod Autoscaler can automatically adjust resource requests. In theory.

In practice, VPA in “Auto” mode restarts pods to apply changes. For stateless services with fast startup, this is fine. For anything else, it’s disruptive.

Our approach: VPA in recommend mode only. It observes and suggests. Humans review and apply. No surprise restarts.

Observability: The Stack We Rebuilt Three Times

Attempt 1: Loggly + FluentD

Started here because it was easy. FluentD as a DaemonSet, ship logs to Loggly.

Why we left: Cost scaled linearly with log volume. When you’re logging at scale, SaaS log aggregators become your biggest bill.

Attempt 2: ELK Stack (Self-Hosted)

Elasticsearch, Logstash (later Fluentd), Kibana. Self-hosted on Kubernetes.

The good: Cost predictable. Powerful querying. Kibana is genuinely good.

The bad: Elasticsearch is operationally complex. JVM tuning. Shard management. Index lifecycle policies. Cluster health that goes red at 2am.

Why we left: Operational overhead was significant. Elasticsearch expertise became a requirement for the team.

Attempt 3: Prometheus + Grafana + Loki

Where we landed. Where we’re staying.

Prometheus for metrics:

  • ServiceMonitors for autodiscovery
  • Prometheus Operator for lifecycle management
  • Thanos for long-term storage and multi-cluster aggregation

Loki for logs:

  • LogQL is similar enough to PromQL that the learning curve is minimal
  • Doesn’t index log content (just labels) – dramatically cheaper to operate than Elasticsearch
  • Pairs naturally with Grafana

Grafana for dashboards:

  • Unified view of metrics and logs
  • Alerting that works
  • Community dashboards for common services

The stack:

┌─────────────────┐     ┌─────────────────┐
│   Prometheus    │     │      Loki       │
│   (metrics)     │     │     (logs)      │
└────────┬────────┘     └────────┬────────┘
         │                       │
         └───────────┬───────────┘

              ┌──────┴──────┐
              │   Grafana   │
              │ (dashboards)│
              └─────────────┘

Cost comparison:

StackMonthly cost (our scale)
Datadog£15,000+
Self-hosted ELK£3,000 (compute) + ops time
Prometheus/Loki/Grafana£2,000 (compute) + less ops time

The Prometheus stack isn’t free – you’re running databases – but the operational model is simpler than Elasticsearch and the cost is dramatically lower than Datadog.

OpenTelemetry: Do It From Day One

We instrumented applications with vendor-specific SDKs. Now we’re stuck migrating.

OpenTelemetry provides vendor-neutral instrumentation. Switch backends without touching application code.

For new services: OpenTelemetry SDK from day one. For existing services: gradual migration. The tracing is production-ready. Metrics are catching up.

Alerting: Slack Channels, Eventually

Alerting evolution:

  1. PagerDuty for everything – alert fatigue, ignored alerts
  2. Two-tier alerting – critical pages, non-critical emails
  3. Slack channels – alerts routed to team channels, acknowledged inline

The final state: critical alerts page on-call. Everything else goes to Slack channels with team ownership. Weekly review meetings to tune thresholds.

Deployments and Tooling

Helm: Rewritten More Times Than I’d Like

We’ve used Helm since v2 (Tiller days – dark times). The charts have been rewritten multiple times as our stack evolved.

Current state:

  • Shared library chart for common patterns
  • Service-specific charts that inherit from library
  • Values files per environment
  • Helm charts stored in ECR (OCI format)

Lesson: Invest in your Helm chart structure early. The cost of “we’ll clean it up later” is charts that nobody understands.

GitOps with Flux

All deployments are GitOps. Flux watches git, applies changes, reports drift.

The good:

  • Single source of truth
  • Audit trail via git history
  • Self-healing (drift correction)

The investment:

  • Tooling to answer “where is my commit?”
  • Flux observability is weak out of the box
  • Debugging “why didn’t this deploy?” requires understanding Flux internals

k9s: The Terminal UI You Need

If you’re running kubectl get pods repeatedly, stop. Use k9s.

Terminal UI for Kubernetes. Navigate resources, view logs, exec into pods, delete stuck resources. Faster than kubectl for interactive work.

Container Security: Don’t Overthink It

Start with the basics:

  1. Read-only root filesystem – prevents most runtime attacks
  2. Non-root user – principle of least privilege
  3. Drop all capabilities – add back only what’s needed
  4. Disable service account token auto-mount – most pods don’t need it
  5. Network policies – default deny, explicit allow

Tooling: We use AWS Inspector for container scanning. Catches obvious CVEs. Not trying to find the perfect tool – just something that runs automatically.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

The Honest Truths

Kubernetes Is Complex

You need dedicated engineers. Not everyone-does-a-bit-of-K8s. Actual dedicated time from people who understand the internals.

The workload varies – some weeks are quiet, others are cluster upgrades and incident response. But you can’t rotate Kubernetes responsibility across the team like you can code review. The learning curve is too steep for “jump in and out.”

What works: One or two “go-to” engineers who own the platform, with everyone trained on basic operations (deploy, debug, read logs).

The Things That Keep Breaking

After five years, the failure patterns are predictable:

  1. DNS – CoreDNS scaling, ndots:5 performance, DNS caching
  2. Resource limits – OOMKilled, CPU throttling, eviction
  3. Networking – CNI issues, NetworkPolicy conflicts, load balancer health checks
  4. Certificates – expiry, rotation, trust chains
  5. Storage – PVC binding, EBS attach limits, storage class misconfiguration

Build monitoring and runbooks for these. They will happen.

Was It Worth It?

Yes.

The first two years were painful. Self-managed clusters, immature tooling, constant firefighting.

The last three years have been different. EKS removed the control plane burden. Karpenter solved cost optimisation. The Prometheus stack matured. GitOps made deployments predictable.

Kubernetes is complex, but it solves real problems:

  • Consistent deployment model across services
  • Self-healing that actually works
  • Scaling that responds to demand
  • Resource isolation between teams and services
  • Ecosystem of tools that integrate cleanly

The complexity is the cost of those benefits. Whether it’s worth it depends on your scale and team. For us, it was.

The Lessons, Summarised

Infrastructure:

  • Don’t run self-managed Kubernetes unless you have full-time SREs for it
  • EKS control plane cost is trivial compared to operational burden
  • Karpenter > Cluster Autoscaler, no contest
  • Spot instances work for most workloads with proper diversification

Cost:

  • Right-size pods (VPA recommend mode)
  • Right-size nodes (Karpenter)
  • Scale to zero where possible (KEDA)
  • Review costs monthly, not quarterly

Observability:

  • Prometheus + Loki + Grafana is the sweet spot for most teams
  • OpenTelemetry from day one for new services
  • Two-tier alerting: pages for critical, Slack for everything else

Operations:

  • GitOps or regret it
  • Invest in Helm chart structure early
  • k9s for interactive work
  • Dedicated Kubernetes engineers, not shared responsibility

Security:

  • Read-only root filesystem, non-root user, drop capabilities
  • Network policies with default deny
  • Container scanning in CI – any tool is better than no tool

The meta-lesson: Kubernetes rewards investment in automation and tooling. Every hour spent on operational improvements pays dividends. Every shortcut creates debt that compounds.


More war stories from production at CoderCo. Connect on LinkedIn for infrastructure patterns and debugging deep-dives.

Found this helpful?

Comments