Skip to content
Back to blog Spot Instance Patterns: Graceful Handling and Cost Savings

Spot Instance Patterns: Graceful Handling and Cost Savings

AWSK8s

Spot Instance Patterns: Graceful Handling and Cost Savings

Spot Instances offer 60-90% savings but can be interrupted with 2-minute warning. This guide covers production patterns for handling interruptions gracefully.

TL;DR

  • Spot = unused EC2 capacity at steep discounts
  • 2-minute interruption warning
  • Diversify instance types/AZs
  • Handle SIGTERM gracefully
  • Mix spot + on-demand for reliability

Spot Basics

PRICING                     RELIABILITY
=======                     ===========
On-Demand: $0.10/hr         99.99%
Spot: $0.02/hr (80% off)    ~95-98% (varies)

Interruption causes:

  • Price exceeds your max (if set)
  • Capacity needed for on-demand
  • Pool depleted

Kubernetes Integration

Node Termination Handler

helm repo add eks https://aws.github.io/eks-charts
helm upgrade --install aws-node-termination-handler eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true \
  --set enableRebalanceMonitoring=true \
  --set enableRebalanceDraining=true

Karpenter Spot Configuration

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-diversified
spec:
  template:
    spec:
      requirements:
        # Wide variety of instance types for diversification
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["eu-west-2a", "eu-west-2b", "eu-west-2c"]
      
      nodeClassRef:
        name: default
  
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

Mixed Instance Groups (EKS)

# Terraform
resource "aws_eks_node_group" "mixed" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "mixed-spot-ondemand"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = aws_subnet.private[*].id

  capacity_type = "SPOT"

  instance_types = [
    "m5.large", "m5a.large", "m5n.large",
    "m6i.large", "m6a.large",
    "c5.large", "c5a.large", "c6i.large",
    "r5.large", "r5a.large", "r6i.large"
  ]

  scaling_config {
    desired_size = 5
    max_size     = 20
    min_size     = 2
  }

  labels = {
    "node-type" = "spot"
  }

  taint {
    key    = "spot"
    value  = "true"
    effect = "NO_SCHEDULE"
  }
}

# On-demand baseline
resource "aws_eks_node_group" "ondemand" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "ondemand-baseline"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = aws_subnet.private[*].id

  capacity_type = "ON_DEMAND"
  instance_types = ["m6i.large"]

  scaling_config {
    desired_size = 2
    max_size     = 5
    min_size     = 2
  }

  labels = {
    "node-type" = "ondemand"
  }
}

Graceful Shutdown

Application Side

package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    srv := &http.Server{Addr: ":8080"}

    go func() {
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for SIGTERM (from spot interruption)
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    log.Println("Shutting down gracefully...")

    // 25 seconds to drain (leave buffer before 2-min deadline)
    ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()

    // Stop accepting new requests
    if err := srv.Shutdown(ctx); err != nil {
        log.Printf("Shutdown error: %v", err)
    }

    // Cleanup: flush buffers, close connections
    cleanup()

    log.Println("Shutdown complete")
}

Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server

Preemption Settings

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30
      
      containers:
        - name: api
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - "sleep 5 && /app/drain.sh"

Workload Patterns

Pattern 1: Spot-Tolerant Stateless

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 6
  template:
    spec:
      # Tolerate spot taint
      tolerations:
        - key: spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      
      # Spread across zones
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfied: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server
      
      # Prefer spot, fallback to on-demand
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values: ["spot"]

Pattern 2: Critical on On-Demand

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  template:
    spec:
      # Force on-demand
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: ["ondemand"]
      
      # Do NOT tolerate spot taint
      tolerations: []

Pattern 3: Hybrid

# 2 replicas on on-demand (baseline)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-ondemand
spec:
  replicas: 2
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: ["ondemand"]

---
# Remaining on spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-spot
spec:
  replicas: 4
  template:
    spec:
      tolerations:
        - key: spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: ["spot"]

Monitoring Spot

# Prometheus alerts
groups:
  - name: spot-instances
    rules:
      - alert: SpotInterruptionWarning
        expr: increase(aws_node_termination_handler_actions_total{action="drain"}[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: Spot instance interruption detected

      - alert: HighSpotInterruptionRate
        expr: rate(aws_node_termination_handler_actions_total{action="drain"}[1h]) > 2
        labels:
          severity: warning
        annotations:
          summary: High rate of spot interruptions

References

======================================== AWS Spot + Kubernetes + Cost Savings

60-90% savings. Graceful interruption handling.

Found this helpful?

Comments