chaos-experiment-author
Build-an-X workflow for a chaos experiment per the Principles of Chaos Engineering - defines steady-state hypothesis, picks the variables (real-world events: network latency, node failure, region outage), sets the blast radius (which percentage / namespace / user cohort), automates execution, and emits the verdict (steady-state held / didn''''t hold). Use to scope a chaos experiment before running it via Litmus / Chaos Mesh / Gremlin / Toxiproxy.
chaos-experiment-author
Overview
Per chaos-principles:
"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."
The 5 advanced principles (chaos-principles):
This skill walks the team through authoring an experiment that honors all five.
When to use
Step 1 - Define the steady-state hypothesis
Per chaos-principles principle 1: focus on measurable output. The hypothesis must be a number, not a feeling:
# experiments/checkout-network-latency.yaml
hypothesis:
steady_state:
metric: checkout_completion_rate
threshold: ">= 95%"
measured_over: "5 minutes"
source: "datadog dashboard 'checkout-success'"Bad hypotheses:
Good hypotheses:
Step 2 - Pick a real-world event to inject
Per chaos-principles principle 2: vary real-world events. Don't inject "anything"; inject what could plausibly happen.
| Event class | Examples |
|---|---|
| Network | Latency 500ms, packet loss 5%, DNS failure, connection reset |
| Compute | Pod kill, CPU throttle, OOM kill, node drain |
| Storage | Disk full, slow disk, read failure |
| Time | Clock skew, leap second |
| Region / zone | Single AZ outage, multi-AZ outage |
| Dependency | Third-party API 500s, rate limit, timeout |
| Configuration | Bad config push, secret rotation failure |
Pick events the team has already seen (real incidents) or realistically expects.
Step 3 - Set the blast radius
Per chaos-principles principle 5: minimize blast radius.
blast_radius:
scope: "1% of pods in the staging namespace"
duration: "5 minutes"
abort_conditions:
- "Sentry error rate exceeds 2%"
- "PagerDuty incident raised"
- "Manual abort signal"Start small; expand as confidence grows.
Step 4 - Pick the chaos tool
Default: chaos-mesh for Kubernetes stacks - CNCF-graduated, broadest fault catalog (network / pod / IO / time / stress), declarative CRDs that compose with the experiment YAML in Step 1. Use litmus-chaos when the team already runs Litmus workflows; gremlin-chaos for commercial multi-platform support outside Kubernetes; toxiproxy-chaos when the failure surface is purely TCP-level.
The tool's syntax (CRD, attack config, etc.) goes alongside the experiment YAML.
Step 5 - Automate
Per chaos-principles principle 4: automate continuously.
# .github/workflows/chaos-monthly.yml
on:
schedule:
- cron: '0 4 1 * *' # 1st of every month, 4am UTC
jobs:
chaos:
runs-on: ubuntu-latest
steps:
- run: |
kubectl apply -f experiments/checkout-network-latency.yaml
# Wait for completion
kubectl wait --for=condition=Complete chaosengine/checkout-network-latency --timeout=10m
# Check verdict
kubectl get chaosengine/checkout-network-latency -o jsonpath='{.status.experimentStatus.verdict}'Schedule per the team's appetite - monthly for new experiments, weekly for established ones, on-demand for incident reproduction.
Step 6 - Run in production?
Per chaos-principles principle 3: experiments in production are the gold standard. But:
| Stage | Use |
|---|---|
| Pre-prod (staging) | Initial experiment runs; confidence-building. |
| Canary (5% traffic) | Once steady-state holds in staging. |
| Production (full) | Mature experiments; team has playbook for abort. |
Most teams should start in staging. Move to production after the team has confidence and abort procedures.
Step 7 - Verdict + report
## Chaos experiment verdict — `checkout-network-latency`
**Date:** YYYY-MM-DD **Duration:** 5 minutes
**Steady-state hypothesis:** checkout_completion_rate >= 95%
**Verdict:** ✅ HELD
| Metric | Pre-experiment | During experiment | Post |
|--------------------------|----------------|-------------------|------|
| checkout_completion_rate | 97.2% | 96.8% | 97.5% |
| p95 latency | 245ms | 380ms | 240ms |
### Observations
- Latency increased as expected (300ms injected).
- Retry logic worked: ~200 retries observed; no user-visible failures.
### Action items
- (none — system behaved as expected)
### Next iteration
- Increase blast radius from 1% to 5% in next month's run.
- Add a longer-duration variant (15 min) to test fatigue.Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Hypothesis as feeling ("system feels stable") | Unmeasurable; can't tell if held. | Numeric metric (Step 1). |
| Inject anything; see what breaks | Wastes effort; misses real failure modes. | Pick real-world events (Step 2). |
| Production-first experiment without staging | Risks user-visible incident on first run. | Staging → canary → production (Step 6). |
| No abort conditions | Experiment runs past safety threshold; real incident. | Define abort + manual abort signal (Step 3). |
| Manual experiment runs only | Per chaos-principles: "labor-intensive and ultimately unsustainable." | Automate (Step 5). |
| One-off experiment then forget | Confidence decays; same incident recurs. | Schedule + repeat (Step 5 cron). |
| Skipping the verdict report | Lessons not captured; next iteration arbitrary. | Step 7 report. |