Right-Sizing Kubernetes Workloads - Stop Burning Money

I ran a query last month on a client’s production cluster. 847 pods. Average CPU request: 500m. Average CPU usage: 73m.

They were paying for seven times more compute than they needed.

This isn’t unusual. I’ve audited dozens of Kubernetes clusters, and the pattern is depressingly consistent. Teams request “enough” resources (meaning way too much), and nobody pushes back because under-provisioning causes outages. So the waste accumulates, finance complains about the cloud bill, and everyone shrugs.

Here’s how to fix it.

How We Got Here

The waste pattern follows a predictable path:

Step 1: Initial deployment. A developer needs to deploy a service. They’ve never profiled it, so they guess. “1 CPU and 1Gi memory should be fine.” These are nice round numbers that probably came from a template or Stack Overflow.

Step 2: The OOMKill. Production goes live. The app gets OOMKilled under load because it actually needs 600Mi, but occasionally spikes to 700Mi. Developer’s fix: double the memory to 2Gi. Problem solved, they think.

Step 3: Scale out. Traffic grows. The service scales from 3 replicas to 10. Each replica still has the inflated 2Gi memory request.

Step 4: Repeat. This happens across every service. After a year, you’re running a 100-node cluster that could fit on 30.

The worst part? Nobody even knows it’s happening. The app works. Alerts are green. The waste is invisible until someone looks at the bill.

Understanding Requests vs Limits

Before we fix anything, let’s be precise about what we’re dealing with.

Requests are guarantees. When you set requests.memory: 512Mi, the scheduler promises that 512Mi will always be available. Even if the pod only uses 100Mi, that 512Mi is reserved and unavailable to other workloads.

Limits are caps. limits.memory: 1Gi means the pod can use up to 1Gi, but no more. Exceed it, and you get OOMKilled.

The relationship between requests and limits matters:

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

In this example, the pod is guaranteed 250m CPU and 512Mi memory, but can burst to 1 CPU and 1Gi memory if capacity is available.

The waste happens in requests, not limits. If you request 512Mi but use 100Mi, you’ve wasted 412Mi. That memory can’t be used by anything else, even though it’s sitting idle.

Finding the Worst Offenders

Don’t try to right-size everything at once. Find the services wasting the most resources and fix those first.

If you’re running Prometheus (and you should be), this query shows overprovisioned pods:

# Memory waste by pod (requested - used)
sum by (namespace, pod) (
  kube_pod_container_resource_requests{resource="memory"}
)
-
sum by (namespace, pod) (
  container_memory_working_set_bytes
)

Sort descending and you’ll find your worst offenders.

For a quick CLI check without Prometheus:

kubectl top pods -A --sort-by=memory | head -20

Compare that with requested resources:

kubectl get pods -A -o custom-columns=\
"NAMESPACE:.metadata.namespace,\
NAME:.metadata.name,\
MEM_REQUEST:.spec.containers[*].resources.requests.memory" | head -20

A pod requesting 2Gi but using 200Mi is a 10x overprovisioning. Fix that one first.

Measuring Actual Usage

Gut feelings don’t cut it. You need data, ideally over at least a week to capture traffic patterns.

Here’s how to measure P95 CPU usage over 7 days:

quantile_over_time(0.95, 
  rate(container_cpu_usage_seconds_total{
    namespace="production",
    container="my-app"
  }[5m])[7d:]
)

And P99 memory:

quantile_over_time(0.99,
  container_memory_working_set_bytes{
    namespace="production",
    container="my-app"
  }[7d:]
)

Why P95 for CPU and P99 for memory? CPU is compressible - if you hit your limit, you get throttled but don’t crash. A brief spike to P99 is annoying but survivable.

Memory is not compressible. Exceed your limit and you die. You want more headroom.

The Right-Sizing Formula

Here’s my approach after doing this dozens of times:

For CPU requests: Set to P95 usage. This covers normal operation with a small buffer.

For CPU limits: Set to 2-4x requests, or remove entirely. Yes, remove. I’ll explain why.

For memory requests: Set to P99 usage + 20% headroom.

For memory limits: Set equal to requests, or slightly higher (1.2x).

Example Calculation

My app shows these metrics over 7 days:

CPU: P50 = 50m, P95 = 120m, P99 = 180m
Memory: P50 = 300Mi, P95 = 450Mi, P99 = 520Mi

Right-sized configuration:

resources:
  requests:
    cpu: 120m      # P95 CPU
    memory: 624Mi  # P99 + 20% = 520 * 1.2
  limits:
    cpu: 500m      # ~4x requests (or omit)
    memory: 750Mi  # Small buffer above request

Previous config was probably 1 CPU and 2Gi. We just cut resource reservation by 75%.

The Case Against CPU Limits

This is controversial, but hear me out.

CPU throttling is unpredictable and hard to debug. When a container hits its CPU limit, the kernel forces it to wait. This adds latency in ways that don’t show up in obvious metrics. Your app just gets… slower. Randomly.

I’ve debugged latency issues for hours, only to discover that CPU throttling was the culprit. Remove the limits, latency stabilises.

The downside of removing CPU limits: a runaway process can starve other workloads on the same node. But if you’re using requests properly, every pod has guaranteed CPU. The noisy neighbor can only use what’s left over.

My recommendation: for most stateless services, remove CPU limits. Keep memory limits tight.

resources:
  requests:
    cpu: 120m
    memory: 624Mi
  limits:
    # cpu: removed intentionally
    memory: 750Mi

If you need CPU limits for specific workloads (batch jobs, anything running untrusted code), keep them. But don’t apply them blindly everywhere.

Automating with VPA

Manual right-sizing doesn’t scale. The Vertical Pod Autoscaler can automate it.

VPA has three modes:

Off: Only recommends, doesn’t change anything
Initial: Sets resources when pods are created, but doesn’t update running pods
Auto: Evicts and recreates pods with new resource values

Start with Off to build confidence, then move to Auto for stateless workloads.

Let’s set up VPA. First install it:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

Now create a VPA for your deployment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
    - containerName: my-app
      minAllowed:
        cpu: 50m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

Critical: always set minAllowed and maxAllowed. Without bounds, VPA might scale a pod to 32Gi memory based on a temporary spike, or scale it so small it can’t start.

Check recommendations:

kubectl get vpa my-app-vpa -o yaml

Look for the recommendation section:

recommendation:
  containerRecommendations:
  - containerName: my-app
    lowerBound:
      cpu: 25m
      memory: 262144k
    target:
      cpu: 50m
      memory: 524288k
    upperBound:
      cpu: 200m
      memory: 1048576k

target: What VPA recommends (P50 usage)
lowerBound: Minimum viable (P10)
upperBound: Safe ceiling (P90)

For cost optimisation, target is usually sufficient. For stability, go closer to upperBound.

VPA + HPA Together

Here’s a gotcha that trips people up: VPA and HPA can conflict.

If HPA scales based on CPU utilisation and VPA adjusts CPU requests, they’ll fight each other. VPA reduces requests → utilisation goes up → HPA adds replicas → lower utilisation → VPA increases requests. Chaos.

The solution: split responsibilities.

# VPA controls memory only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  resourcePolicy:
    containerPolicies:
    - containerName: my-app
      controlledResources: ["memory"]  # Only memory

---
# HPA controls replicas based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

VPA handles memory, HPA handles scale-out based on CPU. No conflicts.

Namespace Guardrails

Right-sizing existing workloads is half the battle. You also need to prevent future waste.

ResourceQuotas limit total resource usage per namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"

This forces teams to right-size. If they want to run more pods, they need to reduce requests on existing ones.

LimitRanges set defaults and constraints per pod:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
  - type: Container
    default:
      cpu: 200m
      memory: 256Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    max:
      cpu: 2
      memory: 4Gi
    min:
      cpu: 50m
      memory: 64Mi

Now deployments without explicit resources get sensible defaults, and nobody can request more than 4Gi per container.

Goldilocks - Easy Recommendations

Don’t want to set up VPA in recommend mode for every deployment manually? Goldilocks does it for you.

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace

Enable it on a namespace:

kubectl label namespace production goldilocks.fairwinds.com/enabled=true

Goldilocks creates VPA objects in recommend mode for every deployment and provides a dashboard showing what each deployment should use.

Port forward and check it out:

kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

You’ll see every deployment with current requests vs recommended requests. Export to CSV, prioritise by waste, and work through the list.

Measuring Success

Track these metrics to prove you’re making progress:

Cluster efficiency:

# CPU utilisation vs total capacity
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/
sum(kube_node_status_allocatable{resource="cpu"})

Target: above 50% for CPU, above 60% for memory. Below that, you’re overpaying.

Over-provisioning ratio:

# How much is requested vs actually used
sum(kube_pod_container_resource_requests{resource="memory"})
/
sum(container_memory_working_set_bytes{container!=""})

Target: below 1.5x. If you’re at 3x, you’re wasting two-thirds of your spend.

Cost per request (if you’re tracking node costs):

sum(node_hourly_cost) / sum(rate(http_requests_total[1h]))

This is your efficiency metric. As you right-size, cost per request should drop.

Quick Wins

Need to show results this week? Here’s your playbook:

1. Find the top 10 worst offenders. Sort pods by (requests - usage). Fix those first.

2. Enable VPA in recommend mode cluster-wide. Zero risk, immediate visibility.

3. Kill zombie deployments. That staging environment from six months ago? The test namespace nobody remembers? Delete them.

4. Remove CPU limits on latency-sensitive services. This often improves performance AND reduces wasted reserved capacity.

5. Set namespace quotas. Prevent future waste by giving teams a budget.

The biggest gains come from fixing a handful of egregiously wasteful deployments, not micro-optimising everything. One deployment requesting 8Gi but using 500Mi is worth more than tuning 50 others.

The Process

Measure - Deploy Prometheus, Goldilocks. Collect 2 weeks of data.
Analyse - Find worst offenders, calculate right-sized values.
Test - Apply in staging, load test, verify no degradation.
Apply - Roll out to production gradually, one service at a time.
Automate - Enable VPA in Auto mode for stateless workloads.
Maintain - Review quarterly. Traffic patterns change, right-sizing is ongoing.

Right-sizing isn’t a one-time project. It’s a practice. Build it into your quarterly reviews, track the metrics, and keep pushing efficiency up.

Your finance team will thank you.