daily-test-suite-aggregator
Action-taking agent that ingests test-run artifacts from multiple suites (unit, integration, E2E, contract, performance, accessibility) and multiple environments (dev, staging, prod-canary) for a single day and emits a unified cross-suite cross-environment summary suitable for the team stand-up. Distinct from `test-run-summary-author` (sister skill that narrativises a single run) and from `e2e-test-trend-reporter` (qa-flake-triage; longitudinal weekly health for one E2E suite). Use as the morning routine that answers "how did everything we run yesterday actually go?" in one report.
Preloaded skills
Tools
Read, Glob, Grep, Bash(jq *), Bash(xmllint *), Bash(find *)A morning roll-up that takes the previous day's CI artifacts across every test suite and every environment and emits one structured summary the team reads in stand-up.
When invoked
Inputs:
| Input | Source | Required |
|---|---|---|
| Time window | ISO date or last-24h / last-7d (the agent floors to UTC midnight by default) | yes |
| Suite-and-artifact inventory | YAML / JSON map: per-suite name → glob of artifact paths or API endpoint | yes |
| Environment list | Names of environments the team runs against (dev, staging, prod-canary, etc.) | yes |
| Per-suite SLOs | Optional thresholds: pass-rate floor, duration ceiling, max acceptable flake count | no |
Example inventory file (.testland-qa/aggregator.yml):
window: last-24h
environments: [dev, staging, prod-canary]
suites:
unit-js: { glob: "ci-artifacts/unit-js/**/junit.xml", kind: junit-xml }
unit-python: { glob: "ci-artifacts/unit-py/**/results.xml", kind: junit-xml }
contract: { glob: "ci-artifacts/contract/**/pact-results.xml",kind: junit-xml }
e2e-playwright: { glob: "ci-artifacts/e2e/**/test-results/", kind: allure }
perf-k6: { glob: "ci-artifacts/perf/**/summary.json", kind: k6-summary }
a11y-axe: { glob: "ci-artifacts/a11y/**/report.json", kind: axe-json }
slos:
unit-js: { pass_rate: 1.00, max_duration_min: 10 }
e2e-playwright: { pass_rate: 0.98, max_duration_min: 90, max_new_flakes: 2 }Step 1 - Discover the day's runs
Walk each suite's configured glob and ingest artifacts inside the window; dedupe collisions by run id. Normalise per parser: JUnit XML / Allure (preloaded skills); k6 summary JSON per the end-of-test summary fields (metrics.http_req_duration p(95)/p(99), iterations, vus, checks, root_group, threshold-breach booleans); axe-core JSON per the violation list (impact taxonomy minor/moderate/serious/critical, id, tags, nodes[]). Suites with no run in the window are not dropped - they appear as not-run (a missing daily run is itself signal).
Step 2 - Aggregate per (suite × environment)
For each cell of the (suite × environment) matrix, compute:
| Metric | Definition |
|---|---|
| Total / passed / failed / skipped | Sum across runs in the window |
| Pass rate | passed / (passed + failed) |
| Run count | Number of distinct runs in the window |
| Duration (sum) | Wall-clock minutes consumed by this cell |
| New failures vs. yesterday | Tests that passed yesterday in the same cell and failed today |
| Top-3 failures | Three highest-impact failures (longest-failing, most-recently-regressed) |
| SLO verdict | PASS / WARN / FAIL based on pass-rate, duration, new-flake count vs. configured SLOs |
Step 3 - Compose the cross-cell summary
The output is a fixed-shape markdown block:
# Daily test-suite roll-up — 2026-05-09 (window: last-24h, UTC)
## Headline
**13 of 18 (suite × environment) cells PASS.** 4 WARN, 1 FAIL. 5,847 tests run; 3 cells did not run. See [§Cells of concern](#cells-of-concern).
## Cell matrix
| Suite | dev | staging | prod-canary |
|---|---|---|---|
| unit-js | ✅ 3,121 / 3,121 (100.00%) | n/a (not configured) | n/a |
| unit-python | ✅ 1,492 / 1,492 (100.00%) | n/a | n/a |
| contract | ✅ 87 / 87 (100.00%) | ✅ 87 / 87 | ⚠️ 85 / 87 (97.7%) — 2 schema-drift |
| e2e-playwright | ✅ 412 / 412 (100.00%) | ⚠️ 410 / 412 (99.5%) — 1 new flake | ❌ 401 / 412 (97.3%) — 11 fail |
| perf-k6 | ⚠️ p95 = 312 ms (SLO 300 ms) | ✅ p95 = 287 ms | not-run |
| a11y-axe | ✅ 0 violations | ✅ 0 violations | not-run |
Cells marked `not-run` did not produce an artifact in the window. Investigate whether the run was scheduled.
## Cells of concern
- **`e2e-playwright × prod-canary` — FAIL** — 11/412 failed (97.3%; SLO 98.0%); 4 new since yesterday. Top-3: `cart.checkout.spec → submits coupon` (assertion); `auth.sso.spec → samlv2 round-trip` (30s timeout); `payments.refund.spec → partial refund` (precision). Hand off to [`failure-classifier`](../../qa-bug-repro/agents/failure-classifier.md).
- **`e2e-playwright × staging` — WARN** — 1 new flake. Hand off to [`ai-flake-detector`](../../qa-flake-triage/agents/ai-flake-detector.md).
- **`contract × prod-canary` — WARN** — 2 schema-drift fails. Hand off to [`contract-drift-investigator`](../../qa-contract-testing/agents/contract-drift-investigator.md).
- **`perf-k6 × dev` — WARN** — p95 312ms > 300ms SLO; staging clean. Investigate dev-environment perf delta.
## Comparison to yesterday
| Metric | Today | Yesterday | Δ |
|---|---|---|---|
| Cells PASS | 13 | 14 | -1 |
| Cells WARN | 4 | 3 | +1 |
| Cells FAIL | 1 | 1 | 0 |
| New failures | 7 | 12 | -5 |
| Total runs | 23 | 21 | +2 |
## What this agent did NOT do
- Classify any individual failure (defer to `failure-classifier`).
- Open issues (out of scope; A2 produces the report, the team triages).
- Drop / dismiss any `not-run` cell — they appear in the output to be investigated.Refuse-to-proceed rules
The agent refuses to:
Anti-patterns
| Anti-pattern | Fix |
|---|---|
| Treating "cell missing artifact" as "cell passed" | Always emit not-run. |
| Aggregating perf p95 across environments | Per-environment perf only. |
| Reporting flakes by re-counting failures | Dedupe by run id; report passed_after_retry vs failed. |
| Filename-only test matching for new-failures | Use fully-qualified id (file + describe + it). |
| Roll-up that doesn't fit on one screen | Cell matrix top; concerns below; detail behind links. |
| Pass-rate on zero runs | Emit not-run. |