I ran a query last month on a client’s production cluster. 847 pods. Average CPU request: 500m. Average CPU usage: 73m.
They were paying for seven times more compute than they needed.
This isn’t unusual. I’ve audited dozens of Kubernetes clusters, and the pattern is depressingly consistent. Teams request “enough” resources (meaning way too much), and nobody pushes back because under-provisioning causes outages. So the waste accumulates, finance complains about the cloud bill, and everyone shrugs.
Here’s how to fix it.
How We Got Here
The waste pattern follows a predictable path:
Step 1: Initial deployment. A developer needs to deploy a service. They’ve never profiled it, so they guess. “1 CPU and 1Gi memory should be fine.” These are nice round numbers that probably came from a template or Stack Overflow.
Step 2: The OOMKill. Production goes live. The app gets OOMKilled under load because it actually needs 600Mi, but occasionally spikes to 700Mi. Developer’s fix: double the memory to 2Gi. Problem solved, they think.
Step 3: Scale out. Traffic grows. The service scales from 3 replicas to 10. Each replica still has the inflated 2Gi memory request.
Step 4: Repeat. This happens across every service. After a year, you’re running a 100-node cluster that could fit on 30.
The worst part? Nobody even knows it’s happening. The app works. Alerts are green. The waste is invisible until someone looks at the bill.
Understanding Requests vs Limits
Before we fix anything, let’s be precise about what we’re dealing with.
Requests are guarantees. When you set requests.memory: 512Mi, the scheduler promises that 512Mi will always be available. Even if the pod only uses 100Mi, that 512Mi is reserved and unavailable to other workloads.
Limits are caps. limits.memory: 1Gi means the pod can use up to 1Gi, but no more. Exceed it, and you get OOMKilled.
The relationship between requests and limits matters:
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
In this example, the pod is guaranteed 250m CPU and 512Mi memory, but can burst to 1 CPU and 1Gi memory if capacity is available.
The waste happens in requests, not limits. If you request 512Mi but use 100Mi, you’ve wasted 412Mi. That memory can’t be used by anything else, even though it’s sitting idle.
Finding the Worst Offenders
Don’t try to right-size everything at once. Find the services wasting the most resources and fix those first.
If you’re running Prometheus (and you should be), this query shows overprovisioned pods:
# Memory waste by pod (requested - used)
sum by (namespace, pod) (
kube_pod_container_resource_requests{resource="memory"}
)
-
sum by (namespace, pod) (
container_memory_working_set_bytes
)
Sort descending and you’ll find your worst offenders.
For a quick CLI check without Prometheus:
kubectl top pods -A --sort-by=memory | head -20
Compare that with requested resources:
kubectl get pods -A -o custom-columns=\
"NAMESPACE:.metadata.namespace,\
NAME:.metadata.name,\
MEM_REQUEST:.spec.containers[*].resources.requests.memory" | head -20
A pod requesting 2Gi but using 200Mi is a 10x overprovisioning. Fix that one first.
Measuring Actual Usage
Gut feelings don’t cut it. You need data, ideally over at least a week to capture traffic patterns.
Here’s how to measure P95 CPU usage over 7 days:
quantile_over_time(0.95,
rate(container_cpu_usage_seconds_total{
namespace="production",
container="my-app"
}[5m])[7d:]
)
And P99 memory:
quantile_over_time(0.99,
container_memory_working_set_bytes{
namespace="production",
container="my-app"
}[7d:]
)
Why P95 for CPU and P99 for memory? CPU is compressible - if you hit your limit, you get throttled but don’t crash. A brief spike to P99 is annoying but survivable.
Memory is not compressible. Exceed your limit and you die. You want more headroom.
The Right-Sizing Formula
Here’s my approach after doing this dozens of times:
For CPU requests: Set to P95 usage. This covers normal operation with a small buffer.
For CPU limits: Set to 2-4x requests, or remove entirely. Yes, remove. I’ll explain why.
For memory requests: Set to P99 usage + 20% headroom.
For memory limits: Set equal to requests, or slightly higher (1.2x).
Example Calculation
My app shows these metrics over 7 days:
- CPU: P50 = 50m, P95 = 120m, P99 = 180m
- Memory: P50 = 300Mi, P95 = 450Mi, P99 = 520Mi
Right-sized configuration:
resources:
requests:
cpu: 120m # P95 CPU
memory: 624Mi # P99 + 20% = 520 * 1.2
limits:
cpu: 500m # ~4x requests (or omit)
memory: 750Mi # Small buffer above request
Previous config was probably 1 CPU and 2Gi. We just cut resource reservation by 75%.
The Case Against CPU Limits
This is controversial, but hear me out.
CPU throttling is unpredictable and hard to debug. When a container hits its CPU limit, the kernel forces it to wait. This adds latency in ways that don’t show up in obvious metrics. Your app just gets… slower. Randomly.
I’ve debugged latency issues for hours, only to discover that CPU throttling was the culprit. Remove the limits, latency stabilises.
The downside of removing CPU limits: a runaway process can starve other workloads on the same node. But if you’re using requests properly, every pod has guaranteed CPU. The noisy neighbor can only use what’s left over.
My recommendation: for most stateless services, remove CPU limits. Keep memory limits tight.
resources:
requests:
cpu: 120m
memory: 624Mi
limits:
# cpu: removed intentionally
memory: 750Mi
If you need CPU limits for specific workloads (batch jobs, anything running untrusted code), keep them. But don’t apply them blindly everywhere.
Automating with VPA
Manual right-sizing doesn’t scale. The Vertical Pod Autoscaler can automate it.
VPA has three modes:
- Off: Only recommends, doesn’t change anything
- Initial: Sets resources when pods are created, but doesn’t update running pods
- Auto: Evicts and recreates pods with new resource values
Start with Off to build confidence, then move to Auto for stateless workloads.
Let’s set up VPA. First install it:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
Now create a VPA for your deployment:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: my-app
minAllowed:
cpu: 50m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
Critical: always set minAllowed and maxAllowed. Without bounds, VPA might scale a pod to 32Gi memory based on a temporary spike, or scale it so small it can’t start.
Check recommendations:
kubectl get vpa my-app-vpa -o yaml
Look for the recommendation section:
recommendation:
containerRecommendations:
- containerName: my-app
lowerBound:
cpu: 25m
memory: 262144k
target:
cpu: 50m
memory: 524288k
upperBound:
cpu: 200m
memory: 1048576k
- target: What VPA recommends (P50 usage)
- lowerBound: Minimum viable (P10)
- upperBound: Safe ceiling (P90)
For cost optimisation, target is usually sufficient. For stability, go closer to upperBound.
VPA + HPA Together
Here’s a gotcha that trips people up: VPA and HPA can conflict.
If HPA scales based on CPU utilisation and VPA adjusts CPU requests, they’ll fight each other. VPA reduces requests → utilisation goes up → HPA adds replicas → lower utilisation → VPA increases requests. Chaos.
The solution: split responsibilities.
# VPA controls memory only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
resourcePolicy:
containerPolicies:
- containerName: my-app
controlledResources: ["memory"] # Only memory
---
# HPA controls replicas based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
VPA handles memory, HPA handles scale-out based on CPU. No conflicts.
Namespace Guardrails
Right-sizing existing workloads is half the battle. You also need to prevent future waste.
ResourceQuotas limit total resource usage per namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
This forces teams to right-size. If they want to run more pods, they need to reduce requests on existing ones.
LimitRanges set defaults and constraints per pod:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-a
spec:
limits:
- type: Container
default:
cpu: 200m
memory: 256Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: 2
memory: 4Gi
min:
cpu: 50m
memory: 64Mi
Now deployments without explicit resources get sensible defaults, and nobody can request more than 4Gi per container.
Goldilocks - Easy Recommendations
Don’t want to set up VPA in recommend mode for every deployment manually? Goldilocks does it for you.
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace
Enable it on a namespace:
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
Goldilocks creates VPA objects in recommend mode for every deployment and provides a dashboard showing what each deployment should use.
Port forward and check it out:
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
You’ll see every deployment with current requests vs recommended requests. Export to CSV, prioritise by waste, and work through the list.
Measuring Success
Track these metrics to prove you’re making progress:
Cluster efficiency:
# CPU utilisation vs total capacity
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/
sum(kube_node_status_allocatable{resource="cpu"})
Target: above 50% for CPU, above 60% for memory. Below that, you’re overpaying.
Over-provisioning ratio:
# How much is requested vs actually used
sum(kube_pod_container_resource_requests{resource="memory"})
/
sum(container_memory_working_set_bytes{container!=""})
Target: below 1.5x. If you’re at 3x, you’re wasting two-thirds of your spend.
Cost per request (if you’re tracking node costs):
sum(node_hourly_cost) / sum(rate(http_requests_total[1h]))
This is your efficiency metric. As you right-size, cost per request should drop.
Quick Wins
Need to show results this week? Here’s your playbook:
1. Find the top 10 worst offenders. Sort pods by (requests - usage). Fix those first.
2. Enable VPA in recommend mode cluster-wide. Zero risk, immediate visibility.
3. Kill zombie deployments. That staging environment from six months ago? The test namespace nobody remembers? Delete them.
4. Remove CPU limits on latency-sensitive services. This often improves performance AND reduces wasted reserved capacity.
5. Set namespace quotas. Prevent future waste by giving teams a budget.
The biggest gains come from fixing a handful of egregiously wasteful deployments, not micro-optimising everything. One deployment requesting 8Gi but using 500Mi is worth more than tuning 50 others.
The Process
- Measure - Deploy Prometheus, Goldilocks. Collect 2 weeks of data.
- Analyse - Find worst offenders, calculate right-sized values.
- Test - Apply in staging, load test, verify no degradation.
- Apply - Roll out to production gradually, one service at a time.
- Automate - Enable VPA in Auto mode for stateless workloads.
- Maintain - Review quarterly. Traffic patterns change, right-sizing is ongoing.
Right-sizing isn’t a one-time project. It’s a practice. Build it into your quarterly reviews, track the metrics, and keep pushing efficiency up.
Your finance team will thank you.