Pod Topology Spread Constraints - Distributing Workloads Intelligently

You have 6 replicas. 3 nodes. Kubernetes puts all 6 pods on node-1 because it has the most resources. Then node-1 dies. You’re down.

Pod affinity rules help, but they’re blunt instruments. They say “don’t put all pods on one node” but don’t guarantee even distribution.

Topology Spread Constraints give you precise control over pod distribution across zones, nodes, or any topology you define. They’re the difference between “hopefully spread out” and “guaranteed spread.”

TL;DR

Topology Spread Constraints control pod distribution across failure domains
Use maxSkew to define acceptable imbalance
Choose whenUnsatisfiable: DoNotSchedule (strict) or ScheduleAnyway (soft)
Spread across zones for availability, nodes for resource efficiency
Combine with pod anti-affinity for complete scheduling control

Code Repository: All code from this post is available at github.com/moabukar/blog-code/pod-topology-spread-constraints

The Problem

Default scheduling optimizes for resource efficiency, not availability:

# 6 replicas, 3 nodes
# Default scheduling might produce:
Node-1 (lots of resources): pod-1, pod-2, pod-3, pod-4
Node-2 (some resources):    pod-5
Node-3 (some resources):    pod-6

# If Node-1 fails: 4 of 6 pods gone = degraded service

What we want:

Node-1: pod-1, pod-2
Node-2: pod-3, pod-4
Node-3: pod-5, pod-6

# If any node fails: only 2 of 6 pods affected = service continues

Basic Syntax

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web
      containers:
        - name: web
          image: nginx

Key Fields

Field	Description
`maxSkew`	Maximum allowed difference between zone counts
`topologyKey`	Node label to group by (hostname, zone, region)
`whenUnsatisfiable`	What to do if constraint can’t be met
`labelSelector`	Which pods to count for distribution

Understanding maxSkew

maxSkew: 1 means the difference between the most and least populated topology domains can be at most 1.

# 6 pods, 3 nodes, maxSkew: 1

# Valid distributions:
Node-1: 2 pods
Node-2: 2 pods  
Node-3: 2 pods
# Skew = 2 - 2 = 0 ✓

Node-1: 3 pods
Node-2: 2 pods
Node-3: 1 pod
# Skew = 3 - 1 = 2 ✗ (exceeds maxSkew: 1)

Node-1: 2 pods
Node-2: 2 pods
Node-3: 1 pod  
# Skew = 2 - 1 = 1 ✓

maxSkew Values

maxSkew: 1 - Strictly even distribution (recommended for HA)
maxSkew: 2 - Some imbalance allowed
maxSkew: N - More flexibility, less guarantee

whenUnsatisfiable Modes

DoNotSchedule (Strict)

whenUnsatisfiable: DoNotSchedule

If placing a pod would violate the constraint, don’t schedule it. Pod stays Pending.

Use when: Availability is critical. Better to have fewer pods than violate the spread.

ScheduleAnyway (Soft)

whenUnsatisfiable: ScheduleAnyway

Try to satisfy the constraint, but schedule anyway if impossible. Scheduler still prefers compliant placements.

Use when: You want best-effort spreading but can’t guarantee topology (e.g., autoscaling might create uneven node counts).

Common Topology Keys

By Node

topologyKey: kubernetes.io/hostname

Spread across individual nodes. Good for node failure tolerance.

By Zone

topologyKey: topology.kubernetes.io/zone

Spread across availability zones. Essential for zone failure tolerance.

By Region

topologyKey: topology.kubernetes.io/region

Spread across regions. For multi-region deployments.

Custom Labels

# Nodes labeled with: rack=rack-1, rack=rack-2, etc.
topologyKey: rack

Spread across custom failure domains like racks, power zones, etc.

Real-World Examples

High Availability Web Service

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
    spec:
      topologySpreadConstraints:
        # Spread across zones (primary)
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web-api
        # Also spread across nodes within zones (secondary)
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: web-api
      containers:
        - name: web-api
          image: myapp:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi

Result:

Strictly spread across zones (DoNotSchedule)
Best-effort spread across nodes within zones (ScheduleAnyway)
Zone failure takes out at most ~33% of pods

Database Replicas

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: postgres
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: postgres
              topologyKey: kubernetes.io/hostname
      containers:
        - name: postgres
          image: postgres:15

Result:

One replica per zone (topologySpreadConstraints)
No two replicas on the same node (podAntiAffinity)
Maximum fault tolerance for stateful workload

Mixed Criticality

# Critical pods: strict spreading
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 4
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-service
---
# Less critical: soft spreading
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-collector
spec:
  replicas: 4
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 2
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: metrics-collector

Combining with Pod Anti-Affinity

Topology Spread Constraints and Pod Anti-Affinity serve different purposes:

Feature	Purpose
Topology Spread	Even distribution across domains
Pod Anti-Affinity	Keep specific pods apart

When to Use Each

Topology Spread alone:

# 6 replicas across 3 zones
# Allows: zone-a: 2, zone-b: 2, zone-c: 2
# Allows: zone-a: 2, zone-b: 3, zone-c: 1 (if maxSkew: 2)

Anti-Affinity alone:

# No two pods on same node
# Could result in: zone-a: 4, zone-b: 1, zone-c: 1

Both together:

# Spread across zones AND no two on same node
# Best of both worlds

Complete Example

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: myapp
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: myapp
            topologyKey: kubernetes.io/hostname

Debugging

Check Current Distribution

# See which nodes pods are on
kubectl get pods -l app=web-api -o wide

# Count pods per node
kubectl get pods -l app=web-api -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c

# Count pods per zone
kubectl get pods -l app=web-api -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | \
  xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}' | \
  sort | uniq -c

Why Is My Pod Pending?

kubectl describe pod <pod-name>

# Look for:
# Warning  FailedScheduling  default-scheduler  
#   0/6 nodes are available: 2 node(s) didn't match pod topology spread constraints

Common Issues

1. No matching nodes

0/3 nodes are available: 3 node(s) didn't match pod topology spread constraints

maxSkew too strict for current topology
Solution: Increase maxSkew or add nodes

2. Label selector mismatch

# Constraint counts pods with app=web
labelSelector:
  matchLabels:
    app: web

# But deployment has app=web-api
# Constraint sees 0 pods, doesn't work as expected

3. Node not labeled

# Check node labels
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone

# Add missing labels
kubectl label node node-1 topology.kubernetes.io/zone=zone-a

Cluster-Level Defaults

Set default constraints for all pods:

# kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
          defaultingType: List

Best Practices

1. Start with Zones

# Zone spreading is usually more important than node spreading
topologyKey: topology.kubernetes.io/zone

2. Use Strict for Critical Services

# Payment service can't afford zone imbalance
whenUnsatisfiable: DoNotSchedule

3. Use Soft for Best-Effort

# Logging can handle some imbalance
whenUnsatisfiable: ScheduleAnyway
maxSkew: 2

4. Match Label Selectors Carefully

# Must match the pods you want to spread
labelSelector:
  matchLabels:
    app: web-api
    # Don't include version if you want all versions spread together

5. Consider Scale-Down Behavior

When scaling down, Kubernetes doesn’t rebalance. You may end up with:

# After scaling 6 → 3 pods
zone-a: 2 pods
zone-b: 1 pod
zone-c: 0 pods

Use the Descheduler to rebalance.

Quick Reference

topologySpreadConstraints:
  # Spread across zones
  - maxSkew: 1                              # Max imbalance
    topologyKey: topology.kubernetes.io/zone # Group by zone
    whenUnsatisfiable: DoNotSchedule        # Strict
    labelSelector:
      matchLabels:
        app: myapp
  
  # Also spread across nodes (soft)
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway       # Best-effort
    labelSelector:
      matchLabels:
        app: myapp

Conclusion

Topology Spread Constraints give you precise control over workload distribution:

Zone spreading - Survive zone failures
Node spreading - Survive node failures
Custom topologies - Match your infrastructure

Don’t rely on luck for availability. Define your spreading requirements explicitly, and Kubernetes will enforce them.

Pod Topology Spread Constraints - Distributing Workloads Intelligently

Pod Topology Spread Constraints - Distributing Workloads Intelligently

TL;DR

The Problem

Basic Syntax

Key Fields

Understanding maxSkew

maxSkew Values

whenUnsatisfiable Modes

DoNotSchedule (Strict)

ScheduleAnyway (Soft)

Common Topology Keys

By Node

By Zone

By Region

Custom Labels

Real-World Examples

High Availability Web Service

Database Replicas

Mixed Criticality

Combining with Pod Anti-Affinity

When to Use Each

Complete Example

Debugging

Check Current Distribution

Why Is My Pod Pending?

Common Issues

Cluster-Level Defaults

Best Practices

1. Start with Zones

2. Use Strict for Critical Services

3. Use Soft for Best-Effort

4. Match Label Selectors Carefully

5. Consider Scale-Down Behavior

Quick Reference

Conclusion

References

Comments

Pod Topology Spread Constraints - Distributing Workloads Intelligently

TL;DR

The Problem

Basic Syntax

Key Fields

Understanding maxSkew

maxSkew Values

whenUnsatisfiable Modes

DoNotSchedule (Strict)

ScheduleAnyway (Soft)

Common Topology Keys

By Node

By Zone

By Region

Custom Labels

Real-World Examples

High Availability Web Service

Database Replicas

Mixed Criticality

Combining with Pod Anti-Affinity

When to Use Each

Complete Example

Debugging

Check Current Distribution

Why Is My Pod Pending?

Common Issues

Cluster-Level Defaults

Best Practices

1. Start with Zones

2. Use Strict for Critical Services

3. Use Soft for Best-Effort

4. Match Label Selectors Carefully

5. Consider Scale-Down Behavior

Quick Reference

Conclusion

References

Related Posts

Building a Production-Grade Homelab with K3s, Vault, and FluxCD

OpenTelemetry Changed How I Think About Observability

Building an Automated Multi-Account AWS Architecture with Control Tower and Terraform

Comments