Chaos Engineering with Litmus: Controlled Failure Injection
Hope is not a strategy. Chaos engineering proves your system can handle failures before production incidents do. LitmusChaos is a Kubernetes-native chaos engineering platform.
TL;DR
- LitmusChaos = K8s-native chaos experiments
- ChaosHub = library of pre-built experiments
- Pod, network, node, and application-level chaos
- Integrates with CI/CD for automated resilience testing
- Full examples with GameDay patterns
Install LitmusChaos
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm upgrade --install litmus litmuschaos/litmus \
--namespace litmus --create-namespace \
--set portal.frontend.service.type=ClusterIP
Pod Chaos Experiments
Pod Delete
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "50"
Container Kill
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: container-kill-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: container-kill
spec:
components:
env:
- name: TARGET_CONTAINER
value: api
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
Network Chaos
Network Latency
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-latency-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: eth0
- name: NETWORK_LATENCY
value: "200"
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: TARGET_PODS
value: "1"
- name: DESTINATION_IPS
value: ""
- name: DESTINATION_HOSTS
value: "postgres.production.svc.cluster.local"
Network Loss
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-loss-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-loss
spec:
components:
env:
- name: NETWORK_INTERFACE
value: eth0
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: "30"
- name: TOTAL_CHAOS_DURATION
value: "60"
DNS Chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: dns-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-dns-error
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: TARGET_HOSTNAMES
value: "api.external.com,database.internal"
Resource Stress
CPU Stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: cpu-stress-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-cpu-hog
spec:
components:
env:
- name: CPU_CORES
value: "2"
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CPU_LOAD
value: "80"
Memory Stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: memory-stress-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-memory-hog
spec:
components:
env:
- name: MEMORY_CONSUMPTION
value: "500"
- name: TOTAL_CHAOS_DURATION
value: "60"
Disk Fill
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: disk-fill-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: disk-fill
spec:
components:
env:
- name: FILL_PERCENTAGE
value: "80"
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CONTAINER_PATH
value: "/data"
CI/CD Integration
GitHub Actions
name: Chaos Tests
on:
schedule:
- cron: '0 3 * * *' # Daily at 3am
workflow_dispatch:
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: echo "${{ secrets.KUBECONFIG }}" | base64 -d > kubeconfig
- name: Run Chaos Experiment
run: |
kubectl apply -f chaos/pod-delete.yaml
# Wait for experiment to complete
kubectl wait --for=condition=ChaosResultFound \
chaosengine/pod-delete-chaos -n production \
--timeout=300s
- name: Check Result
run: |
RESULT=$(kubectl get chaosresult pod-delete-chaos-pod-delete \
-n production -o jsonpath='{.status.experimentStatus.verdict}')
if [ "$RESULT" != "Pass" ]; then
echo "Chaos experiment failed!"
exit 1
fi
- name: Cleanup
if: always()
run: kubectl delete chaosengine pod-delete-chaos -n production
GameDay Workflow
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: gameday-workflow
namespace: litmus
spec:
steps:
- name: verify-baseline
template: verify-baseline
- name: pod-failure
template: pod-failure
dependencies: [verify-baseline]
- name: verify-recovery
template: verify-recovery
dependencies: [pod-failure]
- name: network-chaos
template: network-chaos
dependencies: [verify-recovery]
- name: final-verification
template: verify-baseline
dependencies: [network-chaos]
templates:
- name: verify-baseline
container:
image: curlimages/curl
command: ["/bin/sh", "-c"]
args:
- |
for i in $(seq 1 10); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://api.production.svc:8080/health)
if [ "$STATUS" != "200" ]; then
echo "Health check failed: $STATUS"
exit 1
fi
sleep 2
done
echo "Baseline verified"
- name: pod-failure
chaosEngine:
engineSpec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: PODS_AFFECTED_PERC
value: "100"
- name: network-chaos
chaosEngine:
engineSpec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_LATENCY
value: "500"
- name: TOTAL_CHAOS_DURATION
value: "60"
Hypothesis-Driven Testing
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: hypothesis-test
namespace: production
annotations:
hypothesis: "System maintains 99.9% availability when 50% of pods are killed"
success-criteria: "Error rate < 0.1% during chaos"
spec:
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
# Probes to validate hypothesis
experiments:
- name: pod-delete
spec:
probe:
- name: availability-check
type: httpProbe
httpProbe/inputs:
url: http://api.production.svc:8080/health
insecureSkipVerify: false
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 2
retry: 3
probePollingInterval: 1
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: PODS_AFFECTED_PERC
value: "50"
Observability Integration
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: observable-chaos
namespace: production
spec:
monitoring: true
jobCleanUpPolicy: retain
appinfo:
appns: production
applabel: app=api-server
appkind: deployment
experiments:
- name: pod-delete
spec:
probe:
- name: prometheus-check
type: promProbe
promProbe/inputs:
endpoint: http://prometheus.monitoring:9090
query: sum(rate(http_requests_total{status=~"5.."}[1m]))
comparator:
type: float
criteria: "<="
value: "0.01"
mode: Edge
runProperties:
probeTimeout: 5
interval: 5
retry: 2
References
- LitmusChaos: https://litmuschaos.io
- ChaosHub: https://hub.litmuschaos.io
- Principles: https://principlesofchaos.org