Spot Instance Patterns: Graceful Handling and Cost Savings
Spot Instances offer 60-90% savings but can be interrupted with 2-minute warning. This guide covers production patterns for handling interruptions gracefully.
TL;DR
- Spot = unused EC2 capacity at steep discounts
- 2-minute interruption warning
- Diversify instance types/AZs
- Handle SIGTERM gracefully
- Mix spot + on-demand for reliability
Spot Basics
PRICING RELIABILITY
======= ===========
On-Demand: $0.10/hr 99.99%
Spot: $0.02/hr (80% off) ~95-98% (varies)
Interruption causes:
- Price exceeds your max (if set)
- Capacity needed for on-demand
- Pool depleted
Kubernetes Integration
Node Termination Handler
helm repo add eks https://aws.github.io/eks-charts
helm upgrade --install aws-node-termination-handler eks/aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining=true \
--set enableScheduledEventDraining=true \
--set enableRebalanceMonitoring=true \
--set enableRebalanceDraining=true
Karpenter Spot Configuration
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-diversified
spec:
template:
spec:
requirements:
# Wide variety of instance types for diversification
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: topology.kubernetes.io/zone
operator: In
values: ["eu-west-2a", "eu-west-2b", "eu-west-2c"]
nodeClassRef:
name: default
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
Mixed Instance Groups (EKS)
# Terraform
resource "aws_eks_node_group" "mixed" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "mixed-spot-ondemand"
node_role_arn = aws_iam_role.node.arn
subnet_ids = aws_subnet.private[*].id
capacity_type = "SPOT"
instance_types = [
"m5.large", "m5a.large", "m5n.large",
"m6i.large", "m6a.large",
"c5.large", "c5a.large", "c6i.large",
"r5.large", "r5a.large", "r6i.large"
]
scaling_config {
desired_size = 5
max_size = 20
min_size = 2
}
labels = {
"node-type" = "spot"
}
taint {
key = "spot"
value = "true"
effect = "NO_SCHEDULE"
}
}
# On-demand baseline
resource "aws_eks_node_group" "ondemand" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "ondemand-baseline"
node_role_arn = aws_iam_role.node.arn
subnet_ids = aws_subnet.private[*].id
capacity_type = "ON_DEMAND"
instance_types = ["m6i.large"]
scaling_config {
desired_size = 2
max_size = 5
min_size = 2
}
labels = {
"node-type" = "ondemand"
}
}
Graceful Shutdown
Application Side
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
srv := &http.Server{Addr: ":8080"}
go func() {
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
// Wait for SIGTERM (from spot interruption)
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
<-quit
log.Println("Shutting down gracefully...")
// 25 seconds to drain (leave buffer before 2-min deadline)
ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
// Stop accepting new requests
if err := srv.Shutdown(ctx); err != nil {
log.Printf("Shutdown error: %v", err)
}
// Cleanup: flush buffers, close connections
cleanup()
log.Println("Shutdown complete")
}
Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-server
Preemption Settings
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: api
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "sleep 5 && /app/drain.sh"
Workload Patterns
Pattern 1: Spot-Tolerant Stateless
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 6
template:
spec:
# Tolerate spot taint
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
# Spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfied: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
# Prefer spot, fallback to on-demand
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values: ["spot"]
Pattern 2: Critical on On-Demand
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
template:
spec:
# Force on-demand
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["ondemand"]
# Do NOT tolerate spot taint
tolerations: []
Pattern 3: Hybrid
# 2 replicas on on-demand (baseline)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-ondemand
spec:
replicas: 2
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["ondemand"]
---
# Remaining on spot
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-spot
spec:
replicas: 4
template:
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["spot"]
Monitoring Spot
# Prometheus alerts
groups:
- name: spot-instances
rules:
- alert: SpotInterruptionWarning
expr: increase(aws_node_termination_handler_actions_total{action="drain"}[5m]) > 0
labels:
severity: warning
annotations:
summary: Spot instance interruption detected
- alert: HighSpotInterruptionRate
expr: rate(aws_node_termination_handler_actions_total{action="drain"}[1h]) > 2
labels:
severity: warning
annotations:
summary: High rate of spot interruptions
References
- Spot Best Practices: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html
- Node Termination Handler: https://github.com/aws/aws-node-termination-handler
- Karpenter Spot: https://karpenter.sh/docs/concepts/scheduling/#capacity-type