Chaos Engineering with Litmus: Controlled Failure Injection

Hope is not a strategy. Chaos engineering proves your system can handle failures before production incidents do. LitmusChaos is a Kubernetes-native chaos engineering platform.

TL;DR

LitmusChaos = K8s-native chaos experiments
ChaosHub = library of pre-built experiments
Pod, network, node, and application-level chaos
Integrates with CI/CD for automated resilience testing
Full examples with GameDay patterns

Install LitmusChaos

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm upgrade --install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace \
  --set portal.frontend.service.type=ClusterIP

Pod Chaos Experiments

Pod Delete

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"
            - name: PODS_AFFECTED_PERC
              value: "50"

Container Kill

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: container-kill-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: container-kill
      spec:
        components:
          env:
            - name: TARGET_CONTAINER
              value: api
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"

Network Chaos

Network Latency

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: eth0
            - name: NETWORK_LATENCY
              value: "200"
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: TARGET_PODS
              value: "1"
            - name: DESTINATION_IPS
              value: ""
            - name: DESTINATION_HOSTS
              value: "postgres.production.svc.cluster.local"

Network Loss

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-loss-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-loss
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: eth0
            - name: NETWORK_PACKET_LOSS_PERCENTAGE
              value: "30"
            - name: TOTAL_CHAOS_DURATION
              value: "60"

DNS Chaos

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: dns-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-dns-error
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: TARGET_HOSTNAMES
              value: "api.external.com,database.internal"

Resource Stress

CPU Stress

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: cpu-stress-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-cpu-hog
      spec:
        components:
          env:
            - name: CPU_CORES
              value: "2"
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CPU_LOAD
              value: "80"

Memory Stress

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: memory-stress-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-memory-hog
      spec:
        components:
          env:
            - name: MEMORY_CONSUMPTION
              value: "500"
            - name: TOTAL_CHAOS_DURATION
              value: "60"

Disk Fill

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: disk-fill-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: disk-fill
      spec:
        components:
          env:
            - name: FILL_PERCENTAGE
              value: "80"
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CONTAINER_PATH
              value: "/data"

CI/CD Integration

GitHub Actions

name: Chaos Tests
on:
  schedule:
    - cron: '0 3 * * *'  # Daily at 3am
  workflow_dispatch:

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup kubectl
        uses: azure/setup-kubectl@v3
      
      - name: Configure kubeconfig
        run: echo "${{ secrets.KUBECONFIG }}" | base64 -d > kubeconfig
        
      - name: Run Chaos Experiment
        run: |
          kubectl apply -f chaos/pod-delete.yaml
          
          # Wait for experiment to complete
          kubectl wait --for=condition=ChaosResultFound \
            chaosengine/pod-delete-chaos -n production \
            --timeout=300s
      
      - name: Check Result
        run: |
          RESULT=$(kubectl get chaosresult pod-delete-chaos-pod-delete \
            -n production -o jsonpath='{.status.experimentStatus.verdict}')
          
          if [ "$RESULT" != "Pass" ]; then
            echo "Chaos experiment failed!"
            exit 1
          fi
      
      - name: Cleanup
        if: always()
        run: kubectl delete chaosengine pod-delete-chaos -n production

GameDay Workflow

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: gameday-workflow
  namespace: litmus
spec:
  steps:
    - name: verify-baseline
      template: verify-baseline
    
    - name: pod-failure
      template: pod-failure
      dependencies: [verify-baseline]
    
    - name: verify-recovery
      template: verify-recovery
      dependencies: [pod-failure]
    
    - name: network-chaos
      template: network-chaos
      dependencies: [verify-recovery]
    
    - name: final-verification
      template: verify-baseline
      dependencies: [network-chaos]

  templates:
    - name: verify-baseline
      container:
        image: curlimages/curl
        command: ["/bin/sh", "-c"]
        args:
          - |
            for i in $(seq 1 10); do
              STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://api.production.svc:8080/health)
              if [ "$STATUS" != "200" ]; then
                echo "Health check failed: $STATUS"
                exit 1
              fi
              sleep 2
            done
            echo "Baseline verified"
    
    - name: pod-failure
      chaosEngine:
        engineSpec:
          appinfo:
            appns: production
            applabel: app=api-server
            appkind: deployment
          experiments:
            - name: pod-delete
              spec:
                components:
                  env:
                    - name: TOTAL_CHAOS_DURATION
                      value: "30"
                    - name: PODS_AFFECTED_PERC
                      value: "100"
    
    - name: network-chaos
      chaosEngine:
        engineSpec:
          appinfo:
            appns: production
            applabel: app=api-server
            appkind: deployment
          experiments:
            - name: pod-network-latency
              spec:
                components:
                  env:
                    - name: NETWORK_LATENCY
                      value: "500"
                    - name: TOTAL_CHAOS_DURATION
                      value: "60"

Hypothesis-Driven Testing

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: hypothesis-test
  namespace: production
  annotations:
    hypothesis: "System maintains 99.9% availability when 50% of pods are killed"
    success-criteria: "Error rate < 0.1% during chaos"
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  
  # Probes to validate hypothesis
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: availability-check
            type: httpProbe
            httpProbe/inputs:
              url: http://api.production.svc:8080/health
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 3
              probePollingInterval: 1
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: PODS_AFFECTED_PERC
              value: "50"

Observability Integration

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: observable-chaos
  namespace: production
spec:
  monitoring: true
  jobCleanUpPolicy: retain
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: prometheus-check
            type: promProbe
            promProbe/inputs:
              endpoint: http://prometheus.monitoring:9090
              query: sum(rate(http_requests_total{status=~"5.."}[1m]))
              comparator:
                type: float
                criteria: "<="
                value: "0.01"
            mode: Edge
            runProperties:
              probeTimeout: 5
              interval: 5
              retry: 2

References

LitmusChaos: https://litmuschaos.io
ChaosHub: https://hub.litmuschaos.io
Principles: https://principlesofchaos.org

======================================== LitmusChaos + Kubernetes

Break it in testing. Not in production.

Chaos Engineering with Litmus: Controlled Failure Injection

TL;DR

Install LitmusChaos

Pod Chaos Experiments

Pod Delete

Container Kill

Network Chaos

Network Latency

Network Loss

DNS Chaos

Resource Stress

CPU Stress

Memory Stress

Disk Fill

CI/CD Integration

GitHub Actions

GameDay Workflow

Hypothesis-Driven Testing

Observability Integration

References

======================================== LitmusChaos + Kubernetes

Break it in testing. Not in production.

Related Posts

Building a Production-Grade Homelab with K3s, Vault, and FluxCD

OpenTelemetry Changed How I Think About Observability

Identity Aware Proxy: Zero Trust Access for Internal Applications

Comments