Five years of Kubernetes in production. Two cluster crashes that took down everything. A migration from self-managed kops to EKS that should have happened sooner. An observability stack we’ve rebuilt three times. Helm charts rewritten more times than I care to admit.
This is what I’ve learned running Kubernetes across fintech, protocol infrastructure, and IoT – the failures, the wins, and the things I’d do differently.
The Journey: Self-Managed to EKS
Why We Started Self-Managed
Early on, EKS wasn’t mature. Or we didn’t trust it. Or we thought running our own control plane would give us “more control.” All of these were wrong, but hindsight is 20/20.
We ran kops on AWS. It worked. Until it didn’t.
The appeal was understandable: full control over the control plane, etcd configuration, API server flags, certificate management. The reality was constant maintenance, upgrade anxiety, and a bus factor of one (me, usually at 2am).
The First Cluster Crash: Certificate Expiry
Our first major outage happened because certificates expired.
The Root CA certificate, the etcd certificate, and the API server certificate all had the same expiry date. When they expired, the cluster didn’t gracefully degrade – it stopped. Completely. No API server means no kubectl. No etcd means no state. No state means you’re rebuilding everything from scratch.
What went wrong:
- kops generated certificates with a default expiry we didn’t verify
- No monitoring on certificate expiration dates
- No runbook for certificate rotation
- Backup strategy was “we have the manifests in git” – which helps, but doesn’t help when you can’t apply them
The recovery: Rebuild the entire cluster. Redeploy everything. Restore databases from RDS snapshots (thank god for managed databases). Two days of downtime. Career-questioning levels of stress.
Lesson: If you’re running self-managed Kubernetes, certificate expiry monitoring is non-negotiable. Better yet, don’t run self-managed Kubernetes.
The Second Cluster Crash: The Same Bug, One Year Later
You’d think we learned our lesson. We did – partially.
When we rebuilt the cluster after crash #1, we set certificate expiry to two years. Plenty of margin, right?
Except the version of kops we used had a bug. It didn’t respect the expiry configuration for etcd certificates – it defaulted to one year regardless of what you specified.
Exactly one year after the first crash, the etcd certificate expired. Another cluster crash. Another weekend from hell.
This one was easier to recover from – we didn’t have to rebuild everything – but it reinforced what I already suspected: self-managed Kubernetes is a full-time job, and we didn’t have a full-time job’s worth of people to do it.
The Migration to EKS
When EKS matured, we migrated. Should have done it sooner.
The evaluation was straightforward:
| Factor | kops | EKS |
|---|---|---|
| Control plane management | Us | AWS |
| Certificate rotation | Us | AWS |
| Etcd backups | Us | AWS |
| API server availability | Us | AWS |
| Upgrade anxiety | High | Medium |
| Cost | EC2 for masters | $0.10/hour/cluster |
The EKS control plane cost is trivial compared to engineer time. The upgrade process is still work, but it’s documented, supported, and doesn’t require understanding etcd internals.
Migration approach:
- Stood up EKS cluster in parallel
- Migrated workloads namespace by namespace
- Used external-dns to manage DNS cutover
- Kept old cluster running for two weeks as fallback
- Decommissioned kops cluster
The whole migration took three weeks. Should have done it two years earlier.
Cutting Cluster Costs
The Problem: We Were Haemorrhaging Money
Kubernetes makes it easy to waste compute. Developers request “safe” resource limits, pods get scheduled, nodes get added, nobody looks at actual utilisation.
We were running at 25% CPU utilisation across the cluster. That’s 75% waste.
Karpenter Changed Everything
Cluster Autoscaler was fine. Karpenter is better.
Before Karpenter:
- Predefined node groups (t3.large, t3.xlarge, etc.)
- Cluster Autoscaler picks from node groups
- Over-provisioned because node groups don’t match workload shapes
- Spot instance handling was bolted-on and flaky
After Karpenter:
- Karpenter provisions exactly the instance type your pods need
- Right-sizes nodes dynamically
- Native spot instance support with automatic fallback
- Consolidation actually consolidates (removes underutilised nodes)
Results: 40% cost reduction. Same workloads. Same reliability.
The configuration is more complex – NodePools, EC2NodeClasses, weight-based selection – but the payoff is substantial.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
Spot Instances: 70% Savings, Actually Reliable
The conventional wisdom is “spot instances are unreliable.” The reality is more nuanced.
Workloads that work well on spot:
- Stateless services with multiple replicas
- Batch jobs that can be retried
- Development/staging environments
- CI/CD runners
Workloads that don’t:
- Databases (obviously)
- Single-replica services (don’t do this anyway)
- Long-running jobs that can’t checkpoint
Our spot strategy:
- Multiple instance types (c5, c6i, m5, m6i) – diversification reduces interruption risk
- Multiple availability zones
- Karpenter handles fallback to on-demand automatically
- Pod Disruption Budgets ensure graceful draining
We run 80% of compute on spot. Interruption rate is under 5%. The 70% cost savings are real.
Right-Sizing: The Boring Work That Matters
Karpenter optimises node selection. You still need to optimise pod requests.
The pattern we see constantly:
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
Actual usage: 100m CPU, 256Mi memory.
Tools that help:
- Prometheus + Grafana dashboards showing request vs actual usage
- Vertical Pod Autoscaler (VPA) in recommend mode
- Goldilocks (VPA recommendations visualised)
The process:
- Deploy VPA in recommend mode
- Let it observe for a week
- Review recommendations
- Update requests/limits
- Repeat quarterly
It’s boring. It saves 30%+ on compute.
Autoscaling: HPA, VPA, and KEDA
HPA: The Baseline
Horizontal Pod Autoscaler based on CPU/memory is table stakes. If you’re not using HPA, you’re either over-provisioned or under-provisioned.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
KEDA: Event-Driven Scaling
HPA scales on CPU. What if you want to scale on SQS queue depth? Kafka lag? Prometheus metrics?
KEDA (Kubernetes Event-driven Autoscaling) fills this gap.
Use cases we run with KEDA:
- Scale workers based on SQS queue depth
- Scale API pods based on requests per second (Prometheus metric)
- Scale to zero for batch processors when queues are empty
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker
spec:
scaleTargetRef:
name: worker
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/work-queue
queueLength: "5"
awsRegion: eu-west-1
Scale to zero is the killer feature. Workers that only exist when there’s work to do. Massive cost savings for bursty workloads.
VPA: Use Recommend Mode Only
Vertical Pod Autoscaler can automatically adjust resource requests. In theory.
In practice, VPA in “Auto” mode restarts pods to apply changes. For stateless services with fast startup, this is fine. For anything else, it’s disruptive.
Our approach: VPA in recommend mode only. It observes and suggests. Humans review and apply. No surprise restarts.
Observability: The Stack We Rebuilt Three Times
Attempt 1: Loggly + FluentD
Started here because it was easy. FluentD as a DaemonSet, ship logs to Loggly.
Why we left: Cost scaled linearly with log volume. When you’re logging at scale, SaaS log aggregators become your biggest bill.
Attempt 2: ELK Stack (Self-Hosted)
Elasticsearch, Logstash (later Fluentd), Kibana. Self-hosted on Kubernetes.
The good: Cost predictable. Powerful querying. Kibana is genuinely good.
The bad: Elasticsearch is operationally complex. JVM tuning. Shard management. Index lifecycle policies. Cluster health that goes red at 2am.
Why we left: Operational overhead was significant. Elasticsearch expertise became a requirement for the team.
Attempt 3: Prometheus + Grafana + Loki
Where we landed. Where we’re staying.
Prometheus for metrics:
- ServiceMonitors for autodiscovery
- Prometheus Operator for lifecycle management
- Thanos for long-term storage and multi-cluster aggregation
Loki for logs:
- LogQL is similar enough to PromQL that the learning curve is minimal
- Doesn’t index log content (just labels) – dramatically cheaper to operate than Elasticsearch
- Pairs naturally with Grafana
Grafana for dashboards:
- Unified view of metrics and logs
- Alerting that works
- Community dashboards for common services
The stack:
┌─────────────────┐ ┌─────────────────┐
│ Prometheus │ │ Loki │
│ (metrics) │ │ (logs) │
└────────┬────────┘ └────────┬────────┘
│ │
└───────────┬───────────┘
│
┌──────┴──────┐
│ Grafana │
│ (dashboards)│
└─────────────┘
Cost comparison:
| Stack | Monthly cost (our scale) |
|---|---|
| Datadog | £15,000+ |
| Self-hosted ELK | £3,000 (compute) + ops time |
| Prometheus/Loki/Grafana | £2,000 (compute) + less ops time |
The Prometheus stack isn’t free – you’re running databases – but the operational model is simpler than Elasticsearch and the cost is dramatically lower than Datadog.
OpenTelemetry: Do It From Day One
We instrumented applications with vendor-specific SDKs. Now we’re stuck migrating.
OpenTelemetry provides vendor-neutral instrumentation. Switch backends without touching application code.
For new services: OpenTelemetry SDK from day one. For existing services: gradual migration. The tracing is production-ready. Metrics are catching up.
Alerting: Slack Channels, Eventually
Alerting evolution:
- PagerDuty for everything – alert fatigue, ignored alerts
- Two-tier alerting – critical pages, non-critical emails
- Slack channels – alerts routed to team channels, acknowledged inline
The final state: critical alerts page on-call. Everything else goes to Slack channels with team ownership. Weekly review meetings to tune thresholds.
Deployments and Tooling
Helm: Rewritten More Times Than I’d Like
We’ve used Helm since v2 (Tiller days – dark times). The charts have been rewritten multiple times as our stack evolved.
Current state:
- Shared library chart for common patterns
- Service-specific charts that inherit from library
- Values files per environment
- Helm charts stored in ECR (OCI format)
Lesson: Invest in your Helm chart structure early. The cost of “we’ll clean it up later” is charts that nobody understands.
GitOps with Flux
All deployments are GitOps. Flux watches git, applies changes, reports drift.
The good:
- Single source of truth
- Audit trail via git history
- Self-healing (drift correction)
The investment:
- Tooling to answer “where is my commit?”
- Flux observability is weak out of the box
- Debugging “why didn’t this deploy?” requires understanding Flux internals
k9s: The Terminal UI You Need
If you’re running kubectl get pods repeatedly, stop. Use k9s.
Terminal UI for Kubernetes. Navigate resources, view logs, exec into pods, delete stuck resources. Faster than kubectl for interactive work.
Container Security: Don’t Overthink It
Start with the basics:
- Read-only root filesystem – prevents most runtime attacks
- Non-root user – principle of least privilege
- Drop all capabilities – add back only what’s needed
- Disable service account token auto-mount – most pods don’t need it
- Network policies – default deny, explicit allow
Tooling: We use AWS Inspector for container scanning. Catches obvious CVEs. Not trying to find the perfect tool – just something that runs automatically.
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
The Honest Truths
Kubernetes Is Complex
You need dedicated engineers. Not everyone-does-a-bit-of-K8s. Actual dedicated time from people who understand the internals.
The workload varies – some weeks are quiet, others are cluster upgrades and incident response. But you can’t rotate Kubernetes responsibility across the team like you can code review. The learning curve is too steep for “jump in and out.”
What works: One or two “go-to” engineers who own the platform, with everyone trained on basic operations (deploy, debug, read logs).
The Things That Keep Breaking
After five years, the failure patterns are predictable:
- DNS – CoreDNS scaling, ndots:5 performance, DNS caching
- Resource limits – OOMKilled, CPU throttling, eviction
- Networking – CNI issues, NetworkPolicy conflicts, load balancer health checks
- Certificates – expiry, rotation, trust chains
- Storage – PVC binding, EBS attach limits, storage class misconfiguration
Build monitoring and runbooks for these. They will happen.
Was It Worth It?
Yes.
The first two years were painful. Self-managed clusters, immature tooling, constant firefighting.
The last three years have been different. EKS removed the control plane burden. Karpenter solved cost optimisation. The Prometheus stack matured. GitOps made deployments predictable.
Kubernetes is complex, but it solves real problems:
- Consistent deployment model across services
- Self-healing that actually works
- Scaling that responds to demand
- Resource isolation between teams and services
- Ecosystem of tools that integrate cleanly
The complexity is the cost of those benefits. Whether it’s worth it depends on your scale and team. For us, it was.
The Lessons, Summarised
Infrastructure:
- Don’t run self-managed Kubernetes unless you have full-time SREs for it
- EKS control plane cost is trivial compared to operational burden
- Karpenter > Cluster Autoscaler, no contest
- Spot instances work for most workloads with proper diversification
Cost:
- Right-size pods (VPA recommend mode)
- Right-size nodes (Karpenter)
- Scale to zero where possible (KEDA)
- Review costs monthly, not quarterly
Observability:
- Prometheus + Loki + Grafana is the sweet spot for most teams
- OpenTelemetry from day one for new services
- Two-tier alerting: pages for critical, Slack for everything else
Operations:
- GitOps or regret it
- Invest in Helm chart structure early
- k9s for interactive work
- Dedicated Kubernetes engineers, not shared responsibility
Security:
- Read-only root filesystem, non-root user, drop capabilities
- Network policies with default deny
- Container scanning in CI – any tool is better than no tool
The meta-lesson: Kubernetes rewards investment in automation and tooling. Every hour spent on operational improvements pays dividends. Every shortcut creates debt that compounds.
More war stories from production at CoderCo. Connect on LinkedIn for infrastructure patterns and debugging deep-dives.