Pod Topology Spread Constraints - Distributing Workloads Intelligently
You have 6 replicas. 3 nodes. Kubernetes puts all 6 pods on node-1 because it has the most resources. Then node-1 dies. You’re down.
Pod affinity rules help, but they’re blunt instruments. They say “don’t put all pods on one node” but don’t guarantee even distribution.
Topology Spread Constraints give you precise control over pod distribution across zones, nodes, or any topology you define. They’re the difference between “hopefully spread out” and “guaranteed spread.”
TL;DR
- Topology Spread Constraints control pod distribution across failure domains
- Use
maxSkewto define acceptable imbalance - Choose
whenUnsatisfiable: DoNotSchedule (strict) or ScheduleAnyway (soft) - Spread across zones for availability, nodes for resource efficiency
- Combine with pod anti-affinity for complete scheduling control
Code Repository: All code from this post is available at github.com/moabukar/blog-code/pod-topology-spread-constraints
The Problem
Default scheduling optimizes for resource efficiency, not availability:
# 6 replicas, 3 nodes
# Default scheduling might produce:
Node-1 (lots of resources): pod-1, pod-2, pod-3, pod-4
Node-2 (some resources): pod-5
Node-3 (some resources): pod-6
# If Node-1 fails: 4 of 6 pods gone = degraded service
What we want:
Node-1: pod-1, pod-2
Node-2: pod-3, pod-4
Node-3: pod-5, pod-6
# If any node fails: only 2 of 6 pods affected = service continues
Basic Syntax
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
containers:
- name: web
image: nginx
Key Fields
| Field | Description |
|---|---|
maxSkew | Maximum allowed difference between zone counts |
topologyKey | Node label to group by (hostname, zone, region) |
whenUnsatisfiable | What to do if constraint can’t be met |
labelSelector | Which pods to count for distribution |
Understanding maxSkew
maxSkew: 1 means the difference between the most and least populated topology domains can be at most 1.
# 6 pods, 3 nodes, maxSkew: 1
# Valid distributions:
Node-1: 2 pods
Node-2: 2 pods
Node-3: 2 pods
# Skew = 2 - 2 = 0 ✓
Node-1: 3 pods
Node-2: 2 pods
Node-3: 1 pod
# Skew = 3 - 1 = 2 ✗ (exceeds maxSkew: 1)
Node-1: 2 pods
Node-2: 2 pods
Node-3: 1 pod
# Skew = 2 - 1 = 1 ✓
maxSkew Values
maxSkew: 1- Strictly even distribution (recommended for HA)maxSkew: 2- Some imbalance allowedmaxSkew: N- More flexibility, less guarantee
whenUnsatisfiable Modes
DoNotSchedule (Strict)
whenUnsatisfiable: DoNotSchedule
If placing a pod would violate the constraint, don’t schedule it. Pod stays Pending.
Use when: Availability is critical. Better to have fewer pods than violate the spread.
ScheduleAnyway (Soft)
whenUnsatisfiable: ScheduleAnyway
Try to satisfy the constraint, but schedule anyway if impossible. Scheduler still prefers compliant placements.
Use when: You want best-effort spreading but can’t guarantee topology (e.g., autoscaling might create uneven node counts).
Common Topology Keys
By Node
topologyKey: kubernetes.io/hostname
Spread across individual nodes. Good for node failure tolerance.
By Zone
topologyKey: topology.kubernetes.io/zone
Spread across availability zones. Essential for zone failure tolerance.
By Region
topologyKey: topology.kubernetes.io/region
Spread across regions. For multi-region deployments.
Custom Labels
# Nodes labeled with: rack=rack-1, rack=rack-2, etc.
topologyKey: rack
Spread across custom failure domains like racks, power zones, etc.
Real-World Examples
High Availability Web Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 6
selector:
matchLabels:
app: web-api
template:
metadata:
labels:
app: web-api
spec:
topologySpreadConstraints:
# Spread across zones (primary)
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-api
# Also spread across nodes within zones (secondary)
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-api
containers:
- name: web-api
image: myapp:latest
resources:
requests:
cpu: 100m
memory: 128Mi
Result:
- Strictly spread across zones (DoNotSchedule)
- Best-effort spread across nodes within zones (ScheduleAnyway)
- Zone failure takes out at most ~33% of pods
Database Replicas
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: postgres
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: postgres
topologyKey: kubernetes.io/hostname
containers:
- name: postgres
image: postgres:15
Result:
- One replica per zone (topologySpreadConstraints)
- No two replicas on the same node (podAntiAffinity)
- Maximum fault tolerance for stateful workload
Mixed Criticality
# Critical pods: strict spreading
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 4
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payment-service
---
# Less critical: soft spreading
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-collector
spec:
replicas: 4
template:
spec:
topologySpreadConstraints:
- maxSkew: 2
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: metrics-collector
Combining with Pod Anti-Affinity
Topology Spread Constraints and Pod Anti-Affinity serve different purposes:
| Feature | Purpose |
|---|---|
| Topology Spread | Even distribution across domains |
| Pod Anti-Affinity | Keep specific pods apart |
When to Use Each
Topology Spread alone:
# 6 replicas across 3 zones
# Allows: zone-a: 2, zone-b: 2, zone-c: 2
# Allows: zone-a: 2, zone-b: 3, zone-c: 1 (if maxSkew: 2)
Anti-Affinity alone:
# No two pods on same node
# Could result in: zone-a: 4, zone-b: 1, zone-c: 1
Both together:
# Spread across zones AND no two on same node
# Best of both worlds
Complete Example
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: myapp
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: kubernetes.io/hostname
Debugging
Check Current Distribution
# See which nodes pods are on
kubectl get pods -l app=web-api -o wide
# Count pods per node
kubectl get pods -l app=web-api -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c
# Count pods per zone
kubectl get pods -l app=web-api -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | \
xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}' | \
sort | uniq -c
Why Is My Pod Pending?
kubectl describe pod <pod-name>
# Look for:
# Warning FailedScheduling default-scheduler
# 0/6 nodes are available: 2 node(s) didn't match pod topology spread constraints
Common Issues
1. No matching nodes
0/3 nodes are available: 3 node(s) didn't match pod topology spread constraints
- maxSkew too strict for current topology
- Solution: Increase maxSkew or add nodes
2. Label selector mismatch
# Constraint counts pods with app=web
labelSelector:
matchLabels:
app: web
# But deployment has app=web-api
# Constraint sees 0 pods, doesn't work as expected
3. Node not labeled
# Check node labels
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone
# Add missing labels
kubectl label node node-1 topology.kubernetes.io/zone=zone-a
Cluster-Level Defaults
Set default constraints for all pods:
# kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
Best Practices
1. Start with Zones
# Zone spreading is usually more important than node spreading
topologyKey: topology.kubernetes.io/zone
2. Use Strict for Critical Services
# Payment service can't afford zone imbalance
whenUnsatisfiable: DoNotSchedule
3. Use Soft for Best-Effort
# Logging can handle some imbalance
whenUnsatisfiable: ScheduleAnyway
maxSkew: 2
4. Match Label Selectors Carefully
# Must match the pods you want to spread
labelSelector:
matchLabels:
app: web-api
# Don't include version if you want all versions spread together
5. Consider Scale-Down Behavior
When scaling down, Kubernetes doesn’t rebalance. You may end up with:
# After scaling 6 → 3 pods
zone-a: 2 pods
zone-b: 1 pod
zone-c: 0 pods
Use the Descheduler to rebalance.
Quick Reference
topologySpreadConstraints:
# Spread across zones
- maxSkew: 1 # Max imbalance
topologyKey: topology.kubernetes.io/zone # Group by zone
whenUnsatisfiable: DoNotSchedule # Strict
labelSelector:
matchLabels:
app: myapp
# Also spread across nodes (soft)
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Best-effort
labelSelector:
matchLabels:
app: myapp
Conclusion
Topology Spread Constraints give you precise control over workload distribution:
- Zone spreading - Survive zone failures
- Node spreading - Survive node failures
- Custom topologies - Match your infrastructure
Don’t rely on luck for availability. Define your spreading requirements explicitly, and Kubernetes will enforce them.