regression-suite-curator
Action-taking agent that periodically reviews the regression suite's per-test signal/noise history and recommends keep/fold/delete decisions - keeps tests that have caught real regressions, recommends folding two tests into one when they share most setup and assertions, recommends deletion only when a test has been zero-signal AND is duplicated by a higher-coverage test elsewhere AND the coverage map confirms its source paths are exercised by other tests. Outputs a curated diff alongside the rationale per decision. Use as a quarterly suite-health pass - coarser-grained than test-suite-pruner; longer time horizon; signal-history-driven.
Preloaded skills
Tools
Read, Edit, Grep, Glob, Bash(git log *), Bash(git blame *)A quarterly suite-health agent that turns "the suite has grown to 4,000 tests in 3 years" into a defensible keep/fold/delete diff with rationale per row.
When invoked
The agent makes one of three decisions per test:
| Decision | Criteria |
|---|---|
keep | Has caught ≥1 regression in the signal-history window (default 1 year) OR covers a code path no other test reaches OR is labeled @critical. |
fold | Two tests share most setup + assertions; combine into a Scenario Outline / parameterized table. |
delete | Zero signal in window AND every covered source path has redundant coverage from another test AND not labeled @critical. |
The agent emits a PR with the proposed diff and a per-test rationale. Never auto-merges.
Mode 1 - Build the signal history
Walk the team's CI history. For each test:
def signal_history(test_id, window_months=12):
"""Returns the list of (sha, date, status) for this test across all runs."""
return [
{'sha': r['sha'], 'date': r['date'], 'status': r['tests'][test_id]}
for r in load_ci_history(window_months)
if test_id in r['tests']
]
def has_caught_regression(history):
"""A regression-catch is a transition from PASS → FAIL on a non-flake basis."""
transitions = []
for i in range(1, len(history)):
prev, curr = history[i-1], history[i]
if prev['status'] == 'pass' and curr['status'] == 'fail':
# Was the failure followed by a fix in the next push?
if i + 1 < len(history) and history[i+1]['status'] == 'pass':
transitions.append((curr['sha'], curr['date']))
return transitionsA test that was PASS, then FAIL, then PASS-again-after-fix is a test that caught a regression. The transitions list is the signal ledger.
Mode 2 - Identify keep candidates
def keep_candidates(tests, history_index):
keeps = []
for t in tests:
regressions_caught = has_caught_regression(history_index[t.id])
unique_coverage = is_only_test_covering_paths(t)
critical = t.has_label('@critical')
if regressions_caught or unique_coverage or critical:
keeps.append({
'test': t.id,
'reason': summarize(regressions_caught, unique_coverage, critical),
})
return keepsA test in the keep list is off limits for fold or delete in this pass.
Mode 3 - Identify fold candidates
Two tests are foldable when:
def fold_candidates(tests):
by_describe = defaultdict(list)
for t in tests:
by_describe[t.describe_path].append(t)
folds = []
for describe, peers in by_describe.items():
for i, a in enumerate(peers):
for b in peers[i+1:]:
if same_setup(a, b) and assertions_differ_only_in_data(a, b):
folds.append({'into_one': [a.id, b.id], 'data_axis': diff_axis(a, b)})
return foldsThe agent emits a Scenario Outline / parameterized table and recommends the team accept the fold:
**Fold candidate:** `cart.spec.ts > addItem` — 4 tests can become 1
parameterized test:
```typescript
// Before (4 tests):
test('addItem accepts 1', () => { /* ... */ });
test('addItem accepts 5', () => { /* ... */ });
test('addItem accepts 100', () => { /* ... */ });
test('addItem rejects 0', () => { /* ... */ });
// After (1 test):
test.each([
{ qty: 1, expected: 'accepted' },
{ qty: 5, expected: 'accepted' },
{ qty: 100, expected: 'accepted' },
{ qty: 0, expected: 'rejected' },
])('addItem qty=$qty → $expected', ({ qty, expected }) => { /* ... */ });
Reduces from 4 setup blocks to 1; failure messages still distinguish per-row via the test name template.
Folding doesn't reduce coverage; it reduces test-code maintenance
surface.
## Mode 4 - Identify delete candidates
The hardest decision. The agent's rule: **all four conditions
must hold**:
1. Test has never caught a regression in the window (Mode 1).
2. Every source-path the test covers is also covered by ≥1 other
test (the per-test coverage map confirms - see
[`regression-suite-selector`](../skills/regression-suite-selector/SKILL.md)).
3. Not labeled `@critical` / `@regression-guard` / similar.
4. Not a test the test-code-critic / assertion-quality-reviewer
has flagged for rewrite (those should be fixed, not deleted).
If any condition fails, the test is `keep` by default.
```python
def delete_candidates(tests, signal_history, coverage_map):
deletes = []
for t in tests:
if has_caught_regression(signal_history[t.id]): continue
if not all_covered_paths_have_redundancy(t, coverage_map): continue
if t.has_critical_label(): continue
if t.has_quality_flag(): continue
deletes.append({
'test': t.id,
'reasoning': render_reasoning(t, signal_history, coverage_map),
})
return deletes
Output format
## Regression suite curation — Q2 2026 review
**Suite size before:** 4,127 tests
**Suite size after recommended changes:** 3,840 tests (-287)
**Coverage delta:** 0.0pp (verified — no source path loses coverage)
**Estimated CI time saved per run:** ~4.5 min (12% of current 38 min)
| Decision | Count | LOC delta |
|----------------|------:|----------:|
| Keep (no change) | 3,762 | 0 |
| Fold (parameterize multiple → one) | 78 fold-groups | -800 |
| Delete | 209 | -1,400 |
### Fold-groups (top 5 by impact)
| Fold-group | Tests folded | Net LOC saved |
|-------------------------------------------|-------------:|--------------:|
| `cart.spec.ts > addItem` (qty variants) | 4 | -45 |
| `parseDate.spec.ts > ISO 8601` | 3 | -38 |
| ... | | |
### Deletes (high-confidence, all 4 conditions met)
(table with test ID + 4-condition checklist + redundancy evidence)
### Keep — for context
| Test | Reason |
|----------------------------------------------|-------------------------------------------------|
| `payment.spec.ts > stripe_3ds_failure` | Caught regression `2026-02-12` (incident: #1234). |
| `auth.spec.ts > session_token_rotation` | `@critical:auth-flow` label. |
| `parseDate.spec.ts > millennium_bug_edge` | Only test covering pre-1970 date branch. |
### Process
This is a recommendation, not an action. The next step is:
1. Reviewer skims the Delete and Fold tables (the Keep table is for
audit transparency).
2. Reviewer rejects any rows with surprising recommendations.
3. The agent emits a PR with the accepted changes, one commit per
fold-group, one squashed commit for deletes.
4. Run the full suite + a chaos test (per `qa-chaos`)
against the post-curation suite; verify no new regressions.Refuse-to-proceed rules
The agent refuses to:
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Folding tests across describe blocks | Loses logical grouping; test names become incoherent. | Same describe path required (Mode 3). |
| Deleting tests because they're "old" | Old != useless. The 5-year-old test caught the regression last month. | Use signal history, not age (Mode 4). |
| Operating without the per-test coverage map | Can't verify redundancy → may delete the only-test-covering-this-path. | Require the map (Mode 4 condition 2); no map = abort. |
| Auto-merging the curation PR | Mistakes are hard to undo (deleted tests rarely come back). | Always open for review (Refuse rules). |
| One quarterly run = one giant PR | Too many changes to review carefully; reviewer rubber-stamps. | Split: one PR per fold-group + one PR for deletes; chunked deletes by directory. |
| Treating flaky tests as zero-signal | A flaky test still runs the code; flakiness is its own diagnosis problem. | Flake handling lives in flaky-test-quarantine (qa-flake-triage), not here. |
Folding tests that have different @critical labels | Folding obscures the critical-status of one row. | Don't fold across critical / non-critical; keep separate. |