litmus-chaos
Configures LitmusChaos for Kubernetes-native chaos engineering - installs via Helm, picks ChaosExperiments from the ChaosHub (`pod-delete`, `network-latency`, `node-cpu-hog`, etc.), authors a ChaosEngine CR scoping the experiment + steady-state probes, runs as part of the cluster, exports Prometheus metrics for the verdict. Use when the platform is Kubernetes (CNCF-hosted; cloud-native). Prefer over chaos-mesh when the team wants a ChaosCenter web UI for workflow scheduling and ChaosHub catalog browsing; use chaos-mesh for fine-grained network-fault policies via its own CRD family.
litmus-chaos
Overview
Per litmus-home:
"LitmusChaos is a CNCF-hosted, open-source Chaos Engineering platform that helps teams identify infrastructure weaknesses through safe, controlled chaos tests."
"Kubernetes developers & SREs use Litmus to manage chaos in a declarative manner." (litmus-home)
The architecture: Litmus runs as a Kubernetes operator; experiments are CRDs; results export to Prometheus.
When to use
Step 1 - Install
Per litmus-home:
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos litmuschaos/litmus -n litmus --create-namespaceThe Litmus operator + ChaosCenter (web UI) deploy.
Step 2 - Pick a ChaosExperiment from the Hub
Per litmus-home, the ChaosHub is "a repository hosting most of the chaos experiments that are needed for a quick start in Chaos Engineering." Common experiments:
| ChaosExperiment | Effect |
|---|---|
pod-delete | Kill random pods |
pod-network-latency | Inject network latency on the pod |
pod-network-loss | Drop a percentage of packets |
pod-cpu-hog | Spike CPU on the pod |
pod-memory-hog | Spike memory on the pod |
node-cpu-hog | Spike CPU on the node |
node-drain | Drain a node |
disk-fill | Fill the pod's writable disk |
kubelet-service-kill | Kill kubelet on a node |
Install per-experiment:
kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/experiment.yamlStep 3 - Author a ChaosEngine
The ChaosEngine CR runs an experiment against a target:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: checkout-pod-delete
namespace: app
spec:
appinfo:
appns: app
applabel: 'app=checkout'
appkind: deployment
chaosServiceAccount: pod-delete-sa
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60' # seconds
- name: CHAOS_INTERVAL
value: '20' # seconds
- name: PODS_AFFECTED_PERCENTAGE
value: '50'
probe:
- name: 'check-checkout-availability'
type: httpProbe
httpProbe/inputs:
url: 'http://checkout.app.svc:8080/health'
insecureSkipVerify: false
method:
get:
criteria: '=='
responseCode: '200'
mode: 'Continuous'
runProperties:
probeTimeout: 5
interval: 2
retry: 3
probePollingInterval: 1Per litmus-home, probes "create complete chaos scenarios close to the real application experience upon failure." The probe is the steady-state check per the chaos principles.
Step 4 - Run
kubectl apply -f checkout-pod-delete.yamlLitmus runs the experiment for TOTAL_CHAOS_DURATION seconds, checking the probe continuously. The verdict (Pass / Fail) lands in chaosengine.status.experimentStatus.verdict.
Step 5 - Read the verdict
kubectl get chaosengine checkout-pod-delete -o jsonpath='{.status.experimentStatus.verdict}'
# Output: Pass | FailStep 6 - Probe types
| Probe type | Use |
|---|---|
httpProbe | HTTP endpoint health + status code |
cmdProbe | Run a shell command; check exit code |
k8sProbe | Check Kubernetes resource state |
promProbe | Query Prometheus metric; assert threshold |
Probes can run in different modes: SOT (start of test), EOT (end of test), Edge (both), Continuous (every N seconds during the experiment).
Step 7 - Observability
Per litmus-home: "chaos observability by exporting Prometheus metrics that highlight and quantify the impact of chaos on the applications or infrastructure in real time."
Key metrics:
Wire to Grafana for dashboards.
Step 8 - CI integration
- name: Run chaos experiment
run: |
kubectl apply -f experiments/checkout-pod-delete.yaml
kubectl wait --for=condition=Complete chaosengine/checkout-pod-delete --timeout=10m
VERDICT=$(kubectl get chaosengine checkout-pod-delete -o jsonpath='{.status.experimentStatus.verdict}')
echo "Verdict: $VERDICT"
[ "$VERDICT" = "Pass" ]Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Running ChaosEngine without probe | No steady-state check; verdict meaningless. | Always include httpProbe / promProbe (Step 3). |
PODS_AFFECTED_PERCENTAGE: 100 | Kills all pods; service down. | Start at 25-50%; increase per blast-radius principle. |
| Running in default namespace | Could affect cluster components. | Dedicated app namespace target. |
| One-shot experiment; never re-run | Per chaos principle 4: automate continuously. | Schedule via CronJob (Step 8 in cron form). |
Skipping chaosServiceAccount | RBAC blocks experiment execution. | Define ServiceAccount with appropriate permissions. |