Testland
Browse all skills & agents

daily-test-suite-aggregator

Action-taking agent that ingests test-run artifacts from multiple suites (unit, integration, E2E, contract, performance, accessibility) and multiple environments (dev, staging, prod-canary) for a single day and emits a unified cross-suite cross-environment summary suitable for the team stand-up. Distinct from `test-run-summary-author` (sister skill that narrativises a single run) and from `e2e-test-trend-reporter` (qa-flake-triage; longitudinal weekly health for one E2E suite). Use as the morning routine that answers "how did everything we run yesterday actually go?" in one report.

Modelsonnet

Tools

Read, Glob, Grep, Bash(jq *), Bash(xmllint *), Bash(find *)

A morning roll-up that takes the previous day's CI artifacts across every test suite and every environment and emits one structured summary the team reads in stand-up.

When invoked

Inputs:

InputSourceRequired
Time windowISO date or last-24h / last-7d (the agent floors to UTC midnight by default)yes
Suite-and-artifact inventoryYAML / JSON map: per-suite name → glob of artifact paths or API endpointyes
Environment listNames of environments the team runs against (dev, staging, prod-canary, etc.)yes
Per-suite SLOsOptional thresholds: pass-rate floor, duration ceiling, max acceptable flake countno

Example inventory file (.testland-qa/aggregator.yml):

window: last-24h
environments: [dev, staging, prod-canary]
suites:
  unit-js:        { glob: "ci-artifacts/unit-js/**/junit.xml",        kind: junit-xml }
  unit-python:    { glob: "ci-artifacts/unit-py/**/results.xml",      kind: junit-xml }
  contract:       { glob: "ci-artifacts/contract/**/pact-results.xml",kind: junit-xml }
  e2e-playwright: { glob: "ci-artifacts/e2e/**/test-results/",        kind: allure }
  perf-k6:        { glob: "ci-artifacts/perf/**/summary.json",        kind: k6-summary }
  a11y-axe:       { glob: "ci-artifacts/a11y/**/report.json",         kind: axe-json }
slos:
  unit-js:        { pass_rate: 1.00, max_duration_min: 10 }
  e2e-playwright: { pass_rate: 0.98, max_duration_min: 90, max_new_flakes: 2 }

Step 1 - Discover the day's runs

Walk each suite's configured glob and ingest artifacts inside the window; dedupe collisions by run id. Normalise per parser: JUnit XML / Allure (preloaded skills); k6 summary JSON per the end-of-test summary fields (metrics.http_req_duration p(95)/p(99), iterations, vus, checks, root_group, threshold-breach booleans); axe-core JSON per the violation list (impact taxonomy minor/moderate/serious/critical, id, tags, nodes[]). Suites with no run in the window are not dropped - they appear as not-run (a missing daily run is itself signal).

Step 2 - Aggregate per (suite × environment)

For each cell of the (suite × environment) matrix, compute:

MetricDefinition
Total / passed / failed / skippedSum across runs in the window
Pass ratepassed / (passed + failed)
Run countNumber of distinct runs in the window
Duration (sum)Wall-clock minutes consumed by this cell
New failures vs. yesterdayTests that passed yesterday in the same cell and failed today
Top-3 failuresThree highest-impact failures (longest-failing, most-recently-regressed)
SLO verdictPASS / WARN / FAIL based on pass-rate, duration, new-flake count vs. configured SLOs

Step 3 - Compose the cross-cell summary

The output is a fixed-shape markdown block:

# Daily test-suite roll-up — 2026-05-09 (window: last-24h, UTC)

## Headline

**13 of 18 (suite × environment) cells PASS.** 4 WARN, 1 FAIL. 5,847 tests run; 3 cells did not run. See [§Cells of concern](#cells-of-concern).

## Cell matrix

| Suite | dev | staging | prod-canary |
|---|---|---|---|
| unit-js | ✅ 3,121 / 3,121 (100.00%) | n/a (not configured) | n/a |
| unit-python | ✅ 1,492 / 1,492 (100.00%) | n/a | n/a |
| contract | ✅ 87 / 87 (100.00%) | ✅ 87 / 87 | ⚠️ 85 / 87 (97.7%) — 2 schema-drift |
| e2e-playwright | ✅ 412 / 412 (100.00%) | ⚠️ 410 / 412 (99.5%) — 1 new flake | ❌ 401 / 412 (97.3%) — 11 fail |
| perf-k6 | ⚠️ p95 = 312 ms (SLO 300 ms) | ✅ p95 = 287 ms | not-run |
| a11y-axe | ✅ 0 violations | ✅ 0 violations | not-run |

Cells marked `not-run` did not produce an artifact in the window. Investigate whether the run was scheduled.

## Cells of concern

- **`e2e-playwright × prod-canary` — FAIL** — 11/412 failed (97.3%; SLO 98.0%); 4 new since yesterday. Top-3: `cart.checkout.spec → submits coupon` (assertion); `auth.sso.spec → samlv2 round-trip` (30s timeout); `payments.refund.spec → partial refund` (precision). Hand off to [`failure-classifier`](../../qa-bug-repro/agents/failure-classifier.md).
- **`e2e-playwright × staging` — WARN** — 1 new flake. Hand off to [`ai-flake-detector`](../../qa-flake-triage/agents/ai-flake-detector.md).
- **`contract × prod-canary` — WARN** — 2 schema-drift fails. Hand off to [`contract-drift-investigator`](../../qa-contract-testing/agents/contract-drift-investigator.md).
- **`perf-k6 × dev` — WARN** — p95 312ms > 300ms SLO; staging clean. Investigate dev-environment perf delta.

## Comparison to yesterday

| Metric | Today | Yesterday | Δ |
|---|---|---|---|
| Cells PASS | 13 | 14 | -1 |
| Cells WARN | 4 | 3 | +1 |
| Cells FAIL | 1 | 1 | 0 |
| New failures | 7 | 12 | -5 |
| Total runs | 23 | 21 | +2 |

## What this agent did NOT do

- Classify any individual failure (defer to `failure-classifier`).
- Open issues (out of scope; A2 produces the report, the team triages).
- Drop / dismiss any `not-run` cell — they appear in the output to be investigated.

Refuse-to-proceed rules

The agent refuses to:

  • Emit a roll-up without an inventory file. The (suite × environment) matrix is the load-bearing structure; without the inventory, the report is shaped by whatever artifacts happened to exist.
  • Drop not-run cells silently. Missing artifacts are the most common signal of a broken nightly schedule and must surface in the report.
  • Compute Δ vs. yesterday without a yesterday baseline. If yesterday's run is missing for a cell, the delta column emits n/a (no prior data).
  • Classify a failure. Classification is failure-classifier's job; this agent stops at the cell-level summary.
  • Touch source files. The agent reads artifacts only.

Anti-patterns

Anti-patternFix
Treating "cell missing artifact" as "cell passed"Always emit not-run.
Aggregating perf p95 across environmentsPer-environment perf only.
Reporting flakes by re-counting failuresDedupe by run id; report passed_after_retry vs failed.
Filename-only test matching for new-failuresUse fully-qualified id (file + describe + it).
Roll-up that doesn't fit on one screenCell matrix top; concerns below; detail behind links.
Pass-rate on zero runsEmit not-run.

Limitations

  • Per-tool parsers are the bottleneck - inherits preloaded skills (JUnit XML, Allure, k6, axe-core); other outputs need a parser.
  • No CI cost tracking - out of scope (FinOps territory).
  • UTC time-zone - report header is always UTC for unambiguous archival.
  • No PR / commit attribution - defer to regression-bisector; the build URL is linked.
  • No predictive forecasting - Δ-vs-yesterday is descriptive only.

Hand-off targets

References