gremlin-chaos
Configures Gremlin (commercial) for cross-platform chaos engineering - installs the Gremlin agent on Linux / Windows / Kubernetes, picks attack types (resource, network, state, request), creates Scenarios chaining attacks, integrates with the Reliability Score for forward-looking metrics. Use when the platform spans multiple environments (bare metal + cloud + serverless) and the team needs a commercial-supported solution per Gremlin's multi-platform support.
gremlin-chaos
Overview
Per gremlin-home:
"Gremlin is an Enterprise Reliability Management & Resilience Testing platform that helps organizations move from reactive incident metrics to proactive reliability measurement."
Per gremlin-home, the platform provides "forward-looking reliability scores - so your teams can see where systems will fail, fix them first, and prove the results."
Multi-platform: per gremlin-home, Gremlin works across "bare metal, on-prem, multi-cloud, and serverless."
When to use
If the team is K8s-only and OSS-preferred, see litmus-chaos or chaos-mesh.
Step 1 - Install Gremlin agent
Linux:
sudo apt install -y gremlin
sudo gremlin auth login --org-id <org-id> --user-id <user-id> --api-token <token>Kubernetes:
helm repo add gremlin https://helm.gremlin.com
helm install gremlin gremlin/gremlin \
--namespace gremlin --create-namespace \
--set gremlin.secret.create=true \
--set gremlin.secret.teamID=<team-id> \
--set gremlin.secret.clusterID=<cluster-id> \
--set gremlin.secret.teamSecret=<secret>The agent connects to the Gremlin Control Plane (cloud); attacks trigger via web UI or API.
Step 2 - Attack types
Per gremlin-home and the broader Gremlin docs:
| Class | Attack | Effect |
|---|---|---|
| Resource | CPU | Spike CPU usage |
| Resource | Memory | Spike memory |
| Resource | Disk I/O | Spike disk I/O |
| Resource | Disk space | Fill disk |
| Network | Latency | Inject latency |
| Network | Packet loss | Drop packets |
| Network | DNS | DNS resolution failure |
| Network | Blackhole | Drop all packets to/from a target |
| State | Shutdown | Reboot the host |
| State | Process killer | Kill a specific process |
| State | Time travel | Skew the system clock |
| Request | Request injection | Modify HTTP requests in flight |
Step 3 - Run an attack via UI
Web UI workflow:
The UI provides safety: blast-radius scoping, abort button, notifications.
Step 4 - Author a Scenario
A Scenario chains multiple attacks:
# Pseudo-Scenario config (Gremlin's UI exports JSON; this approximates)
scenario:
name: "Checkout resilience test"
attacks:
- type: latency
target: { service: checkout }
length: 5min
latency: 500ms
- type: packet-loss
target: { service: payment }
length: 5min
loss-percent: 10
delay-after-previous: 1min
abort_conditions:
- "Sentry error rate > 2%"
- "Manual abort"Scenarios match per the chaos-experiment-author "vary real-world events" principle - combinations approximate real incidents.
Step 5 - Reliability score
Per gremlin-home, Gremlin's differentiator is the "Reliability Score" - "individual services" get scores "based on dependency mapping, risk detection, and failure testing."
Score components (per Gremlin docs):
A service moving from "untested" to "score 80" via passing attacks creates an objective improvement signal.
Step 6 - API + automation
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"command": {
"type": "latency",
"args": ["-l", "300", "-m", "500", "-c", "5", "-h", "^api\\.example\\.com$"]
},
"target": {
"type": "Random",
"containers": { "labels": { "app": "checkout" } }
}
}'API enables CI integration:
- name: Trigger Gremlin scenario
run: |
curl -X POST "https://api.gremlin.com/v1/scenarios/${{ vars.SCENARIO_ID }}/runs" \
-H "Authorization: Key ${{ secrets.GREMLIN_API_KEY }}"
- name: Wait + verdict
run: sleep 600 && ./scripts/datadog-verdict.shStep 7 - Compliance + audit
Gremlin's enterprise tier (per gremlin-home's positioning) provides:
Important for regulated industries where audit is non-negotiable.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Manual UI-only attacks | Doesn't scale; per chaos principle 4 must automate. | API-driven scenarios (Step 6). |
| Skipping abort conditions | Attack runs past safety threshold. | Define abort signals (Step 4). |
| Treating Reliability Score as the only signal | Score is service-level; per-attack verdicts matter too. | Both Score (trend) + per-attack verdicts (detail). |
| One-shot installation; team forgets | License paid; not used. | Schedule attacks; build into release process. |
| Production attacks without playbook | Real incident if attack escalates. | Per chaos-experiment-author: blast radius + abort. |