chaos-mesh
Configures Chaos Mesh for Kubernetes-native chaos engineering - picks fault types (PodChaos, NetworkChaos, StressChaos, IOChaos, TimeChaos, DNSChaos, KernelChaos, HTTPChaos), targets via label selectors, controls blast radius via namespace whitelists + selector filters, schedules via CronJobs, observes via dashboard. Distinct from Litmus by architecture (Chaos Mesh has its own dashboard + workflow orchestration; Litmus uses ChaosCenter UI).
chaos-mesh
Overview
Per chaos-mesh-home:
"Chaos Mesh is a platform that 'brings various types of fault simulation to Kubernetes and has an enormous capability to orchestrate fault scenarios.'"
Per chaos-mesh-home, Chaos Mesh leverages "Kubernetes CustomResourceDefinitions (CRDs) for seamless integration with the Kubernetes ecosystem."
When to use
If LitmusChaos is already deployed, evaluate stack-fit before adding Chaos Mesh - both serve similar use cases with different ergonomics.
Step 1 - Install
curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | bash
# Or via Helm:
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespacePer chaos-mesh-home, "no special dependencies required - Chaos Mesh deploys directly on Kubernetes clusters, including minikube and kind."
Step 2 - Fault types
Per chaos-mesh-home:
| CRD | Effect |
|---|---|
PodChaos | Pod kill, container kill, pod failure |
NetworkChaos | Latency, packet loss, partition, bandwidth, corruption |
StressChaos | CPU stress, memory stress |
IOChaos | Disk read/write delay, errors |
TimeChaos | Clock skew |
DNSChaos | DNS lookup failures |
KernelChaos | Kernel-level fault injection |
HTTPChaos | HTTP request fault injection |
JVMChaos | JVM-level (exception, GC pause, method delay) |
Plus Schedule for cron-style + Workflow for orchestration.
Step 3 - Author a NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: checkout-network-latency
namespace: app
spec:
action: delay
mode: one # or 'all', 'fixed', 'fixed-percent', 'random-max-percent'
selector:
namespaces:
- app
labelSelectors:
app: checkout
delay:
latency: '500ms'
correlation: '50'
jitter: '50ms'
duration: '5m'Per chaos-mesh-home, Chaos Mesh provides "selector-based filtering using labels, annotations, and namespace whitelists to control 'blast radius' and target specific resources."
Step 4 - Author a PodChaos
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: checkout-pod-kill
namespace: app
spec:
action: pod-kill
mode: fixed-percent
value: '50'
selector:
namespaces:
- app
labelSelectors:
app: checkout
duration: '60s'mode: fixed-percent + value: '50' kills 50% of matching pods.
Step 5 - Workflow orchestration
Per chaos-mesh-home: "Workflow Orchestration: Users can combine serial and parallel experiments to simulate complex, realistic failure scenarios matching actual system architecture."
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: checkout-resilience-test
namespace: app
spec:
entry: combined-chaos
templates:
- name: combined-chaos
templateType: Serial
deadline: 30m
children:
- network-latency-step
- then-pod-kill-step
- then-stress-step
- name: network-latency-step
templateType: NetworkChaos
networkChaos:
action: delay
mode: all
selector: { ... }
delay: { latency: 200ms }
duration: 5m
- name: then-pod-kill-step
templateType: PodChaos
podChaos:
action: pod-kill
mode: one
selector: { ... }
- name: then-stress-step
templateType: StressChaos
stressChaos:
mode: all
selector: { ... }
stressors:
cpu: { workers: 4, load: 80 }
duration: 3mStep 6 - Dashboard
Per chaos-mesh-home, Chaos Mesh ships a dashboard with RBAC.
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333The dashboard provides authoring (visual experiment construction), running, observability, and replay.
Step 7 - Run + verdict
kubectl apply -f checkout-network-latency.yaml
# Watch state
kubectl get networkchaos checkout-network-latency -w
# Status / events
kubectl describe networkchaos checkout-network-latencyThe chaos resource lifecycle is Created → Running → Stopped. Pair with external monitoring (Datadog / Prometheus / Grafana) to verify the steady-state hypothesis held.
Step 8 - Physical machine support
Per chaos-mesh-home: "Physical Machine Support: Chaosd (experimental) extends chaos testing to non-Kubernetes environments through PhysicalMachineChaos resources."
For VM / bare-metal targets:
apiVersion: chaos-mesh.org/v1alpha1
kind: PhysicalMachineChaos
metadata:
name: vm-cpu-stress
spec:
action: stress-cpu
address:
- 'http://10.0.0.5:31767'
duration: 5m
stress-cpu:
load: 80
workers: 4The Chaosd agent runs on the target VM; the K8s CRD remotely triggers it.
Step 9 - CI integration
- name: Trigger chaos experiment
run: |
kubectl apply -f experiments/checkout-network-latency.yaml
sleep 320 # 5min duration + buffer
kubectl delete -f experiments/checkout-network-latency.yaml
- name: Check steady-state from Datadog
run: ./scripts/datadog-verdict.shAnti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
mode: all without scope | All matching pods affected; blast radius too wide. | Start with mode: one or fixed-percent: 25. |
No duration | Chaos persists until manual cleanup; risky. | Always set duration (Step 3 example). |
Targeting chaos-mesh namespace | Crashes the chaos infrastructure itself. | Whitelist app namespace; deny-list chaos-mesh. |
| Disable RBAC on dashboard | Anyone with cluster access can trigger chaos. | Per chaos-mesh-home: RBAC is on by default - keep it on. |
| Skipping observability integration | Chaos runs but verdict invisible. | Wire dashboard + external monitoring. |