qa-resilience-drills

Resilience drills: 6 skills (backup-verification-author, dr-drill-runner, error-budget-tests, mttr-mtbf-tracker, restore-time-tests, slo-negotiation-prep) and 2 agents (dr-drill-orchestrator, reliability-review-agent). Production-grade DR drills + backup verification + RTO + error-budget gating + incident metrics.

Install this plugin

/plugin install qa-resilience-drills@testland-qa

Part of role bundle: qa-role-performance

qa-resilience-drills

Production-grade resilience discipline - DR drills, backup verification, restore-time SLAs, error budgets, MTTR/MTBF tracking. Distinct from qa-chaos (experiment-authoring) - this plugin covers measured, scheduled drills + the metrics they feed.

Components

Type	Name	Description
Skill	dr-drill-runner	Per-tier RTO + RPO; pre-drill checklist; drill workflow (announce → fail-over → verify → fail-back → cleanup); post-drill report; cold/warm/hot tier-specific patterns; cadence (monthly/quarterly/annual)
Skill	backup-verification-author	Per-backup-type integrity (SHA-256 + signature); restore-to-test-env spot check; partial-restore; cross-region replication SLA; retention-policy verification; encryption + key recovery
Skill	restore-time-tests	TTF segments; baseline timed restore; parallel-restore optimization; PITR latency; partial object-store restore; trend tracking; cold-start latency
Skill	error-budget-tests	SLI calculation; budget consumption; multi-window multi-burn-rate alerting; freeze-trigger when budget exhausted; rolling-window reset; weekly stakeholder reporting
Skill	mttr-mtbf-tracker	Per-incident schema (detected/acknowledged/mitigated/resolved); MTTD / MTTA / MTTR / MTBF formulae; ITIL alignment; postmortem integration; mitigation vs resolution distinction
Skill	slo-negotiation-prep	Build-an-X prep pack for the QA - SRE - Product SLO conversation: current error-budget consumption + MTTR/MTBF trend + framed decision question + 3-5 option matrix (impact / reversibility / stakeholder cost) + recommended posture with cited alternatives.
Agent	dr-drill-orchestrator	Executes a planned DR drill end to end: pre-drill checklist, failover, RTO/RPO monitor, fail-back, post-drill report.
Agent	reliability-review-agent	Composes error-budget burn + MTTR/MTBF into a weekly manager-facing reliability review narrative.

Install

/plugin marketplace add testland/qa
/plugin install qa-resilience-drills@testland-qa

Skills

backup-verification-author

Author backup-verification harness - per-backup-type integrity (SHA-256 / encrypted-payload signature), restore-to-test-env spot-check cadence, partial-restore (single-table / single-object) verification, cross-region replication validation, retention-policy assertions. "An untested backup is not a backup.

dr-drill-runner

Author and execute a single DR drill for one service: author the runbook (per-tier RTO + RPO), pre-drill checklist (data sync state, alert silencing, customer comms), drill workflow (announce, fail-over, verify, fail-back) with timestamps, standby verification, failback, and an auditor-ready post-drill report. Per Google Cloud DR planning guide; covers cold / warm / hot standby tier-specific patterns. For coordinating drills across multiple services or teams, use dr-drill-orchestrator.

error-budget-tests

Build error-budget gate tests - SLO + error-budget calculation per Google SRE workbook ("difference between target uptime and actual uptime"); burn-rate alerting; monthly-budget exhaustion test; freeze-trigger when budget consumed. Per sre.google embracing-risk reference.

mttr-mtbf-tracker

Reference for tracking MTTR (Mean Time To Recovery) / MTBF (Mean Time Between Failures) / MTTD (Mean Time To Detection) / MTTA (Mean Time To Acknowledge) - incident-record schema, calculation formulae, dashboards-as-code, target-vs-actual alerting. Aligns with ITIL incident management + ISO 20000 + Google SRE incident response chapter.

restore-time-tests

Build restore-time SLA tests - per-database + per-object-store baseline measurement, RTO objective verification, parallel-restore optimization tests, point-in-time-recovery (PITR) latency. Bound `time-to-functional` (TTF) ≤ documented RTO; flag silent regressions when restore time grows over months.

slo-negotiation-prep

Build-an-X workflow that produces the manager's prep pack for the QA - SRE - Product SLO conversation - current error-budget consumption + MTTR/MTBF trend + a single framed decision question + an explicit 3-5 option matrix with reversibility / stakeholder cost / impact scoring + recommended posture with cited alternatives. Distinct from `error-budget-tests` (which computes the SLI / SLO / budget math; this skill consumes it) and from `mttr-mtbf-tracker` (pure-reference incident schema; this skill consumes per-incident metrics). Use when budget is burning or a proposed change will stress the SLO - the output is the evidence pack the manager carries into the meeting, not a recommendation about which option to pick.

Agents

dr-drill-orchestrator

Action-taking orchestrator that executes a planned disaster-recovery drill end to end: pre-drill checklist (backup integrity, replication lag, alert silencing) -> failover execution -> RTO/RPO monitor -> fail-back -> post-drill report with action items. Composes dr-drill-runner (multi-stage runbook), backup-verification-author (integrity + cross-region replication), and restore-time-tests (TTF measurement vs RTO budget). Distinct from chaos-drill-orchestrator in qa-chaos (which injects unrehearsed failures): this agent exercises the documented DR runbook along the rehearsed path. Use when an SRE or QA lead wants the full pre-drill -> fail-over -> RTO/RPO monitor -> fail-back loop executed as one supervised workflow against a non-prod DR environment.

reliability-review-agent

Read-only reporter that composes error-budget burn data (from the error-budget-tests skill) and MTTR/MTBF incident records (from the mttr-mtbf-tracker skill) into a manager-facing weekly reliability narrative covering trend, budget status, top incidents, and recommended actions. Distinct from error-budget-tests (authors gate tests, not prose reports) and from mttr-mtbf-tracker (defines schema and formulae, not narrative synthesis). Use when a QA or SRE manager needs a ready-to-present weekly reliability summary drawn from live incident and SLO data.