e2e-suite-budget
Build-an-X workflow that caps the E2E suite size by computing flakiness ROI per test - for each E2E test, computes (regressions caught × value) ÷ (runtime × flake rate × maintenance cost), ranks all tests by ROI, identifies the bottom decile (low ROI = high cost / low signal), and recommends specific tests to retire / move to lower layer / fix flake. Use quarterly to keep E2E count from growing past the team's maintenance capacity.
e2e-suite-budget
Overview
E2E tests are expensive: per test-pyramid (Cohn), they're "brittle, expensive to write, and time consuming to run." Without active management, the E2E suite grows: a new feature adds 5 tests per sprint; flake rate creeps up; CI time balloons; team disables flaky tests; coverage illusory.
This skill computes per-test ROI and recommends which tests to retire / move to a lower layer / fix.
When to use
Step 1 - Inputs
Per-E2E-test, the agent / skill needs:
Step 2 - ROI formula
ROI = (regressions_caught × value_tier) / (runtime_min × (1 + flake_rate) × (1 + maintenance_count_norm))Where:
Higher ROI = more value per cost.
Step 3 - Per-test scoring
# scripts/e2e-budget.py
import json, sys
from collections import defaultdict
# Load per-test stats from CI history
stats = json.load(open(sys.argv[1])) # {test_id: {runtime, flake_rate, ...}}
regressions = json.load(open(sys.argv[2])) # {test_id: count}
value_tiers = json.load(open(sys.argv[3])) # {test_id: tier}
maintenance = json.load(open(sys.argv[4])) # {test_id: pr_count}
median_pr_count = sorted(maintenance.values())[len(maintenance) // 2] or 1
scores = {}
for tid, s in stats.items():
rc = regressions.get(tid, 0)
vt = value_tiers.get(tid, 3)
rt = s['runtime_min']
fr = s['flake_rate']
mn = maintenance.get(tid, 0) / median_pr_count
score = (rc * vt) / (rt * (1 + fr) * (1 + mn))
scores[tid] = score
# Sort ascending — lowest ROI first
ranked = sorted(scores.items(), key=lambda x: x[1])
print(json.dumps(ranked))Step 4 - Output: bottom decile
## E2E suite budget — `<repo>` — Q2 2026
**Total E2E tests:** 142
**Total runtime:** 38 min (per CI run)
**Median flake rate:** 4.2%
**Bottom-decile (14 tests) recommended for action:**
| Test | ROI | Runtime | Flake | Regressions caught | Value tier | Recommendation |
|---------------------------------------------------|----:|--------:|------:|-------------------:|-----------:|----------------|
| `archive-flow.spec.ts > old-orders` | 0.0 | 2.1m | 18% | 0 | 2 | Retire — high flake, no signal in 6mo. |
| `legacy-checkout.spec.ts > deprecated-promo` | 0.1 | 3.2m | 8% | 0 | 1 | Retire — feature deprecated. |
| `cart.spec.ts > add 1000 items` | 0.2 | 4.5m | 2% | 0 | 2 | Move to perf suite — not E2E concern. |
| `e2e-utils.spec.ts > date-formatting` | 0.5 | 0.8m | 1% | 0 | 2 | Move to unit layer. |
| ... (10 more) | | | | | | |
### Estimated impact of acting on all 14
- Suite size: 142 → 128 (-14)
- Runtime: 38 min → 28 min (-10 min per CI run, ~26% reduction)
- Flake-related reruns: estimated -50%
- Maintenance load: -20% (these 14 had the highest PR-touch count)Step 5 - Categorize recommendations
| Class | Action |
|---|---|
retire | Delete; covered by other tests OR feature deprecated. |
lower-layer | Rewrite at unit / integration; cheaper. |
fix-flake | Tests catches bugs but flakes; investigate per flaky-test-quarantine. |
consolidate | Merge with sibling test that overlaps. |
keep-but-monitor | Low ROI but catches important regressions; tag for next-quarter review. |
The team picks the appropriate class per test; the skill recommends.
Step 6 - Cap discipline
Set an absolute budget:
# e2e-budget.yml
budget:
max_tests: 100
max_runtime_min: 30
max_flake_rate: 0.03 # 3%When the suite exceeds budget, the next sprint's "add new E2E test" requires retiring / moving an existing one. Force the trade-off.
Step 7 - Cadence
| Cadence | Trigger |
|---|---|
| Quarterly | Scheduled review. |
| Per-major-feature | New tests added; verify suite stays under budget. |
| After flake spike | Reactive review; flake source likely a low-ROI test. |
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| ROI formula without "regressions caught" data | All tests look equal; no basis for prioritization. | Track real-bug catches over time (Step 1). |
| Treating value_tier as binary (critical / not) | Misses tier-2 / tier-3 nuance. | 1-5 scale (Step 1). |
| Auto-retiring bottom-decile without review | False positives - important tests retired. | Recommendation only; team confirms (Step 5). |
| Adding E2E tests without budget enforcement | Suite grows; no constraint forcing trade-offs. | Per-quarter cap (Step 6). |
| Recommending "retire" without alternative | Tests that catch important bugs may have low ROI due to runtime; deletion regrets. | Per-test categorization (Step 5). |
| Cherry-picking tests to retire (favorites stay) | Bias. | Apply ranking uniformly; document overrides. |