Testland
Browse all skills & agents

api-chaos-runner

Builds a workflow that runs the project's existing API tests under injected network chaos - latency, timeouts, dropped connections, bandwidth caps, packet loss - using Toxiproxy as the proxy layer (with notes on alternatives Pumba / Gremlin / LitmusChaos). Defines a chaos matrix per test scenario, runs each, and reports which assertions break under which conditions. Use when the API surface needs to verify resilience patterns (retry, circuit-breaker, timeout, fallback) actually work.

api-chaos-runner

Overview

Most API tests run against perfect networks: <1ms latency, no packet loss, infinite bandwidth, deterministic ordering. Real production isn't like that. Network chaos testing drives the existing tests under controlled network impairment - the team discovers which retry / circuit-breaker / timeout patterns actually hold up before real customers find out.

The canonical open-source primitive is Toxiproxy - Shopify's "TCP proxy to simulate network and system conditions for chaos and resiliency testing" (toxiproxy-readme). The pattern: sit Toxiproxy between client and upstream; manipulate toxics (latency, timeout, bandwidth, etc.) during test execution.

This skill is build-an-X - the workflow chains the team's existing API tests (Postman / Karate / RestAssured / Tavern / Schemathesis) through a Toxiproxy-managed connection and orchestrates a per-scenario chaos matrix.

When to use

  • The API has documented resilience requirements (retry on 5xx, circuit-break on 3 consecutive failures, timeout at 5s, etc.) and the team needs to verify them, not just document them.
  • A new external dependency just got added; the team wants to pressure-test fallback behavior.
  • An incident postmortem identified a "we should have caught this in testing" item; chaos coverage is the prevention mechanism for that class.
  • The team has integration tests already and wants to multiply their signal value via fault injection.

If the team is just starting API testing and has no resilience patterns to verify, this skill is overkill - start with happy-path coverage via postman-collections or the language-native equivalents first.

Step 1 - Pick the chaos primitive

ToolLayerBest for
ToxiproxyTCP proxyPer-connection latency / bandwidth / timeout / drop. Most precise.
PumbaDocker containerContainer-level chaos (kill, pause, network).
Gremlin (commercial)Multi-platformProduction-grade chaos with audit / approval flow.
LitmusChaosKubernetes operatorCloud-native; experiments declared as CRDs.
tc qdisc (Linux native)Network interfaceLowest level; most setup; CI-friendly only with --cap-add NET_ADMIN.

Default recommendation: Toxiproxy for per-API chaos in CI. The others fit when the team is already in those ecosystems (Docker-Compose-heavy projects, Kubernetes-first projects).

Step 2 - Define the chaos matrix

For each existing API test scenario, define a matrix of conditions to run it under:

ScenarioToxicExpected behavior
Order create (POST /orders)None (control)201 in <500ms
Order createlatency=1000ms201 in <2s (within timeout budget)
Order createlatency=10000ms504 with retry-after, OR client gives up
Order createbandwidth=10kbps201 (eventual) OR 408 timeout
Order createreset_peer502 with retry attempted
Order createtimeout504; circuit-breaker opens after 3rd

Per toxiproxy-readme, the canonical toxic types include latency, down (forced failure), bandwidth, slow_close, timeout, slicer, limit_data, reset_peer.

The matrix is the load-bearing artifact: what the team expects under each condition is what differentiates resilience verification from "did the test pass?" The Expected column drives the assertions.

Step 3 - Wire Toxiproxy into the test environment

Setup (Docker example)

# docker-compose.test.yml
services:
  toxiproxy:
    image: ghcr.io/shopify/toxiproxy:latest
    ports:
      - 8474:8474   # control API
      - 5432:5432   # proxied DB
      - 8080:8080   # proxied API
  app:
    build: .
    environment:
      DATABASE_URL: 'postgres://user:pass@toxiproxy:5432/db'
      EXTERNAL_API_URL: 'http://toxiproxy:8080'

Per toxiproxy-readme, the application points at Toxiproxy's listen ports rather than the upstream. Toxiproxy forwards to the real upstream when no toxic is active.

Define proxies via the control API

# Register the upstream
curl -d '{"name":"orders-api","listen":"0.0.0.0:8080","upstream":"orders-api-real:8080"}' \
  http://toxiproxy:8474/proxies

Add a toxic during a test

Per toxiproxy-readme:

# 1000ms latency on every request through this proxy
toxiproxy-cli toxic add -t latency -a latency=1000 orders-api

# Bandwidth cap at 10 KB/s
toxiproxy-cli toxic add -t bandwidth -a rate=10 orders-api

# Forced timeout
toxiproxy-cli toxic add -t timeout -a timeout=5000 orders-api

# Remove all toxics
toxiproxy-cli toxic remove orders-api -n <toxic-name>

For a stateless add-test-remove cycle, the language-native client libraries (toxiproxy-python, toxiproxy-node, toxiproxy-ruby, toxiproxy-go) wrap the HTTP API.

Step 4 - Run the matrix

A minimal runner shell script:

#!/usr/bin/env bash
# scripts/chaos-matrix.sh
set -e

PROXY=orders-api
TEST_CMD="npx newman run collections/orders.postman_collection.json -e environments/chaos.json -r cli,junit --reporter-junit-export results-$1.xml"

run_with_toxic() {
  local label="$1"; local type="$2"; local args="$3"
  echo "=== $label ==="
  toxiproxy-cli toxic remove "$PROXY" -n latency 2>/dev/null || true
  toxiproxy-cli toxic remove "$PROXY" -n bandwidth 2>/dev/null || true
  toxiproxy-cli toxic remove "$PROXY" -n timeout 2>/dev/null || true
  if [ -n "$type" ]; then
    toxiproxy-cli toxic add -t "$type" $args "$PROXY"
  fi
  $TEST_CMD "$label" || true   # don't bail; we want the matrix
}

run_with_toxic 'control'   ''        ''
run_with_toxic 'latency-1s' latency  '-a latency=1000'
run_with_toxic 'bandwidth' bandwidth '-a rate=10'
run_with_toxic 'timeout'   timeout   '-a timeout=5000'

The matrix produces one JUnit XML per scenario. Aggregate them in the report stage.

Step 5 - Report what broke under what

A successful chaos run produces a resilience matrix report:

## API Chaos Matrix — verdict: REVIEW

| Scenario         | Control | Latency 1s | Bandwidth 10k | Timeout 5s | Reset peer |
|------------------|:-------:|:----------:|:-------------:|:----------:|:----------:|
| POST /orders     |    ✅   |     ✅     |       ✅      |     ✅     |     ❌     |
| GET /orders/:id  |    ✅   |     ✅     |       ❌      |     ✅     |     ✅     |
| DELETE /orders/:id |  ✅   |     ✅     |       ✅      |     ✅     |     ✅     |

### Failures

| Test | Toxic       | Expected                                          | Actual |
|------|-------------|---------------------------------------------------|--------|
| POST /orders | reset_peer | 502 + retry attempted; second attempt succeeds | 502; no retry observed in client logs |
| GET /orders/:id | bandwidth=10k | 200 in <30s | 408 timeout at 10s |

A green matrix isn't the goal - finding where resilience is missing is the goal. A failure under a chaos scenario is a feature request, not a bug in the test.

Choosing what to inject

Match toxics to documented resilience requirements:

Resilience pattern documentedToxic to inject
Retry on 5xxdown (forced 5xx) or reset_peer
Timeout after Nmslatency=N+500 (force the timeout)
Circuit-breaker after 3 failuresdown for ≥3 requests
Fallback to cache when upstream unreachabledown indefinitely
Bulkhead under loadbandwidth=very-low
Slow-loris clientslow_close on response

Run only the toxics that map to a documented expectation; running every toxic against every endpoint is noise.

Anti-patterns

Anti-patternWhy it failsFix
Chaos in productionReal users observe; oncall pages.CI / staging only. Production chaos requires the team's full chaos engineering practice (Gremlin / Litmus + approval flow).
Per-PR chaos matrixAdds 10+ minutes; team disables.Nightly chaos runs; PR runs only the control row.
Asserting "chaos must not break anything"Every system has a breaking point; the test trivially fails.Assert specific resilience behavior under specific conditions; document the breaking point as accepted.
Using down for everythingdown forces 5xx; doesn't model real-world latency / bandwidth.Mix latency, bandwidth, timeout, reset_peer for realistic mixes.
Skipping the control rowWithout control, the matrix can't distinguish chaos failures from test bugs.Always run a no-toxic scenario as the baseline.

Limitations

  • Toxiproxy is TCP-level. UDP flaws (DNS resolver weirdness, QUIC) need different tooling.
  • Doesn't model partial failures within a single connection. Toxiproxy treats each connection uniformly; for "retry on the 3rd request after 2 successes" patterns, layer a counting middleware.
  • Per-test-suite isolation. When tests run in parallel against the same Toxiproxy instance, they fight over the same toxic state; serialize chaos scenarios or use one Toxiproxy instance per worker.

References