qa-chaos

Chaos engineering + fault injection: 8 skills (chaos-experiment-author, chaos-mesh, chaos-results-reporter, failure-injection-test-author, gremlin-chaos, litmus-chaos, steady-state-hypothesis-validator, toxiproxy-chaos) and 1 agent (chaos-drill-orchestrator).

Install this plugin

/plugin install qa-chaos@testland-qa

Part of role bundle: qa-role-performance

qa-chaos

Chaos engineering + fault injection per the Principles of Chaos Engineering. Litmus / Chaos Mesh (Kubernetes-native), Gremlin (commercial multi-platform), Toxiproxy (TCP-level), structured chaos experiment authoring, and combined HTTP+TCP fault injection scenarios.

Components

Type	Name	Description
Skill	chaos-experiment-author	Build-an-X workflow for a chaos experiment per the Principles of Chaos Engineering - defines steady-state hypothesis, picks the variables (real-world events: network latency, node failure, region outage), sets the blast radius (which percentage / namespace / user cohort), automates execution, and emits the verdict (steady-state held / didn''''t hold). Use to scope a chaos experiment before running it via Litmus / Chaos Mesh / Gremlin / Toxiproxy.
Skill	litmus-chaos	Configures LitmusChaos for Kubernetes-native chaos engineering - installs via Helm, picks ChaosExperiments from the ChaosHub (`pod-delete`, `network-latency`, `node-cpu-hog`, etc.), authors a ChaosEngine CR scoping the experiment + steady-state probes, runs as part of the cluster, exports Prometheus metrics for the verdict. Use when the platform is Kubernetes (CNCF-hosted; cloud-native).
Skill	chaos-mesh	Configures Chaos Mesh for Kubernetes-native chaos engineering - picks fault types (PodChaos, NetworkChaos, StressChaos, IOChaos, TimeChaos, DNSChaos, KernelChaos, HTTPChaos), targets via label selectors, controls blast radius via namespace whitelists + selector filters, schedules via CronJobs, observes via dashboard. Distinct from Litmus by architecture (Chaos Mesh has its own dashboard + workflow orchestration; Litmus uses ChaosCenter UI).
Skill	gremlin-chaos	Configures Gremlin (commercial) for cross-platform chaos engineering - installs the Gremlin agent on Linux / Windows / Kubernetes, picks attack types (resource, network, state, request), creates Scenarios chaining attacks, integrates with the Reliability Score for forward-looking metrics. Use when the platform spans multiple environments (bare metal + cloud + serverless) and the team needs a commercial-supported solution per Gremlin's multi-platform support.
Skill	toxiproxy-chaos	Configures Toxiproxy for TCP-level fault injection - runs as a sidecar / proxy between client and upstream, applies toxics (latency, bandwidth, slow_close, timeout, slicer, limit_data, reset_peer) via control API. Sister to api-chaos-runner (qa-api-testing) but focused on the proxy itself + non-test usage (chaos in dev environments, integration tests, pre-prod simulation). Use when the team needs TCP-precise fault injection in development / integration environments without K8s or commercial tooling.
Skill	failure-injection-test-author	Build-an-X workflow that combines WireMock fault stubs (HTTP-level fault: 500s, malformed JSON, slow responses) with Toxiproxy (TCP-level: latency, packet loss, reset) into one orchestrated test scenario - the test starts both, applies fault per scenario, runs the SUT against the impaired endpoints, verifies the SUT''''s resilience patterns. Use when neither pure HTTP fault stubs nor pure TCP chaos covers the actual production failure modes - most real failures span both layers.
Agent	chaos-drill-orchestrator	Action-taking orchestrator that runs a full chaos drill end-to-end - pre-flight checks → experiment injection (via chaos-experiment-author + chosen runner: Chaos Mesh / Litmus / Gremlin / Toxiproxy) → blast-radius monitoring → automatic abort if blast radius exceeds bounds → recovery validation. Distinct from `qa-chaos/chaos-experiment-author` (S1 - authors ONE experiment file). This agent orchestrates the four-stage drill workflow, not a single experiment. Use when running a planned chaos drill against a non-prod environment and the team wants the full pre-flight → inject → monitor → recover loop executed as one workflow.
Skill	chaos-results-reporter	Aggregate chaos-drill verdicts over time into a resilience trend report.
Skill	steady-state-hypothesis-validator	Pre-flight validate a chaos experiment's steady-state hypothesis (measurable, baselined, meaningful).

Install

/plugin marketplace add testland/qa
/plugin install qa-chaos@testland-qa

Skills

chaos-experiment-author

Build-an-X workflow for a chaos experiment per the Principles of Chaos Engineering - defines steady-state hypothesis, picks the variables (real-world events: network latency, node failure, region outage), sets the blast radius (which percentage / namespace / user cohort), automates execution, and emits the verdict (steady-state held / didn''''t hold). Use to scope a chaos experiment before running it via Litmus / Chaos Mesh / Gremlin / Toxiproxy.

chaos-mesh

Configures Chaos Mesh for Kubernetes-native chaos engineering - picks fault types (PodChaos, NetworkChaos, StressChaos, IOChaos, TimeChaos, DNSChaos, KernelChaos, HTTPChaos), targets via label selectors, controls blast radius via namespace whitelists + selector filters, schedules via CronJobs, observes via dashboard. Distinct from Litmus by architecture (Chaos Mesh has its own dashboard + workflow orchestration; Litmus uses ChaosCenter UI).

chaos-results-reporter

Aggregates chaos drill verdicts over time into a resilience trend report - per-experiment hypothesis-held / blast-radius / time-to-detect / time-to-recover, degradation trends across runs, action items, and a stakeholder summary. Use when a team has completed one or more chaos drills and needs a structured trend report showing whether resilience is improving, degrading, or stable across iterations.

failure-injection-test-author

Build-an-X workflow that combines WireMock fault stubs (HTTP-level fault: 500s, malformed JSON, slow responses) with Toxiproxy (TCP-level: latency, packet loss, reset) into one orchestrated test scenario - the test starts both, applies fault per scenario, runs the SUT against the impaired endpoints, verifies the SUT''''s resilience patterns. Use when neither pure HTTP fault stubs nor pure TCP chaos covers the actual production failure modes - most real failures span both layers.

gremlin-chaos

Configures Gremlin (commercial) for cross-platform chaos engineering - installs the Gremlin agent on Linux / Windows / Kubernetes, picks attack types (resource, network, state, request), creates Scenarios chaining attacks, integrates with the Reliability Score for forward-looking metrics. Use when the platform spans multiple environments (bare metal + cloud + serverless) and the team needs a commercial-supported solution per Gremlin's multi-platform support.

litmus-chaos

Configures LitmusChaos for Kubernetes-native chaos engineering - installs via Helm, picks ChaosExperiments from the ChaosHub (`pod-delete`, `network-latency`, `node-cpu-hog`, etc.), authors a ChaosEngine CR scoping the experiment + steady-state probes, runs as part of the cluster, exports Prometheus metrics for the verdict. Use when the platform is Kubernetes (CNCF-hosted; cloud-native). Prefer over chaos-mesh when the team wants a ChaosCenter web UI for workflow scheduling and ChaosHub catalog browsing; use chaos-mesh for fine-grained network-fault policies via its own CRD family.

steady-state-hypothesis-validator

Validates a chaos experiment's steady-state hypothesis before execution: checks that each probe metric is measurable and observable, that a recent baseline exists, that tolerances are numerically meaningful and SLI-backed, that the measurement window is defined, and that the chosen metrics would actually move under the target failure mode. Use when a chaos experiment has been authored (via chaos-experiment-author) and the team needs a pre-flight verdict before running the drill in any environment.

toxiproxy-chaos

Configures Toxiproxy for TCP-level fault injection - runs as a sidecar / proxy between client and upstream, applies toxics (latency, bandwidth, slow_close, timeout, slicer, limit_data, reset_peer) via control API. Sister to api-chaos-runner (qa-api-testing) but focused on the proxy itself + non-test usage (chaos in dev environments, integration tests, pre-prod simulation). Use when the team needs TCP-precise fault injection in development / integration environments without K8s or commercial tooling.

Agents

chaos-drill-orchestrator

Action-taking orchestrator that runs a full chaos drill end-to-end - pre-flight checks → experiment injection (via chaos-experiment-author + chosen runner: Chaos Mesh / Litmus / Gremlin / Toxiproxy) → blast-radius monitoring → automatic abort if blast radius exceeds bounds → recovery validation. Distinct from `qa-chaos/chaos-experiment-author` (authors ONE experiment file). This agent orchestrates the four-stage drill workflow, not a single experiment. Use when running a planned chaos drill against a non-prod environment and the team wants the full pre-flight → inject → monitor → recover loop executed as one workflow.