Browse all skills & agents

daily-test-suite-aggregator

Action-taking agent that ingests test-run artifacts from multiple suites (unit, integration, E2E, contract, performance, accessibility) and multiple environments (dev, staging, prod-canary) for a single day and emits a unified cross-suite cross-environment summary suitable for the team stand-up. Distinct from `test-run-summary-author` (sister skill that narrativises a single run) and from `e2e-test-trend-reporter` (qa-flake-triage; longitudinal weekly health for one E2E suite). Use as the morning routine that answers "how did everything we run yesterday actually go?" in one report.

Modelsonnet

Preloaded skills

junit-xml-analysis allure-reports coverage-diff-reporter currents-integration testrail-integration

Tools

Read, Glob, Grep, Bash(jq *), Bash(xmllint *), Bash(find *)

A morning roll-up that takes the previous day's CI artifacts across every test suite and every environment and emits one structured summary the team reads in stand-up.

When invoked

Inputs:

Input	Source	Required
Time window	ISO date or `last-24h` / `last-7d` (the agent floors to UTC midnight by default)	yes
Suite-and-artifact inventory	YAML / JSON map: per-suite name → glob of artifact paths or API endpoint	yes
Environment list	Names of environments the team runs against (`dev`, `staging`, `prod-canary`, etc.)	yes
Per-suite SLOs	Optional thresholds: pass-rate floor, duration ceiling, max acceptable flake count	no

Example inventory file (.testland-qa/aggregator.yml):

window: last-24h
environments: [dev, staging, prod-canary]
suites:
  unit-js:        { glob: "ci-artifacts/unit-js/**/junit.xml",        kind: junit-xml }
  unit-python:    { glob: "ci-artifacts/unit-py/**/results.xml",      kind: junit-xml }
  contract:       { glob: "ci-artifacts/contract/**/pact-results.xml",kind: junit-xml }
  e2e-playwright: { glob: "ci-artifacts/e2e/**/test-results/",        kind: allure }
  perf-k6:        { glob: "ci-artifacts/perf/**/summary.json",        kind: k6-summary }
  a11y-axe:       { glob: "ci-artifacts/a11y/**/report.json",         kind: axe-json }
slos:
  unit-js:        { pass_rate: 1.00, max_duration_min: 10 }
  e2e-playwright: { pass_rate: 0.98, max_duration_min: 90, max_new_flakes: 2 }

Step 1 - Discover the day's runs

Walk each suite's configured glob and ingest artifacts inside the window; dedupe collisions by run id. Normalise per parser: JUnit XML / Allure (preloaded skills); k6 summary JSON per the end-of-test summary fields (metrics.http_req_duration p(95)/p(99), iterations, vus, checks, root_group, threshold-breach booleans); axe-core JSON per the violation list (impact taxonomy minor/moderate/serious/critical, id, tags, nodes[]). Suites with no run in the window are not dropped - they appear as not-run (a missing daily run is itself signal).

Step 2 - Aggregate per (suite × environment)

For each cell of the (suite × environment) matrix, compute:

Metric	Definition
Total / passed / failed / skipped	Sum across runs in the window
Pass rate	passed / (passed + failed)
Run count	Number of distinct runs in the window
Duration (sum)	Wall-clock minutes consumed by this cell
New failures vs. yesterday	Tests that passed yesterday in the same cell and failed today
Top-3 failures	Three highest-impact failures (longest-failing, most-recently-regressed)
SLO verdict	PASS / WARN / FAIL based on pass-rate, duration, new-flake count vs. configured SLOs

Step 3 - Compose the cross-cell summary

The output is a fixed-shape markdown block:

# Daily test-suite roll-up — 2026-05-09 (window: last-24h, UTC)

## Headline

**13 of 18 (suite × environment) cells PASS.** 4 WARN, 1 FAIL. 5,847 tests run; 3 cells did not run. See [§Cells of concern](#cells-of-concern).

## Cell matrix

| Suite | dev | staging | prod-canary |
|---|---|---|---|
| unit-js | ✅ 3,121 / 3,121 (100.00%) | n/a (not configured) | n/a |
| unit-python | ✅ 1,492 / 1,492 (100.00%) | n/a | n/a |
| contract | ✅ 87 / 87 (100.00%) | ✅ 87 / 87 | ⚠️ 85 / 87 (97.7%) — 2 schema-drift |
| e2e-playwright | ✅ 412 / 412 (100.00%) | ⚠️ 410 / 412 (99.5%) — 1 new flake | ❌ 401 / 412 (97.3%) — 11 fail |
| perf-k6 | ⚠️ p95 = 312 ms (SLO 300 ms) | ✅ p95 = 287 ms | not-run |
| a11y-axe | ✅ 0 violations | ✅ 0 violations | not-run |

Cells marked `not-run` did not produce an artifact in the window. Investigate whether the run was scheduled.

## Cells of concern

- **`e2e-playwright × prod-canary` — FAIL** — 11/412 failed (97.3%; SLO 98.0%); 4 new since yesterday. Top-3: `cart.checkout.spec → submits coupon` (assertion); `auth.sso.spec → samlv2 round-trip` (30s timeout); `payments.refund.spec → partial refund` (precision). Hand off to [`failure-classifier`](../../qa-bug-repro/agents/failure-classifier.md).
- **`e2e-playwright × staging` — WARN** — 1 new flake. Hand off to [`ai-flake-detector`](../../qa-flake-triage/agents/ai-flake-detector.md).
- **`contract × prod-canary` — WARN** — 2 schema-drift fails. Hand off to [`contract-drift-investigator`](../../qa-contract-testing/agents/contract-drift-investigator.md).
- **`perf-k6 × dev` — WARN** — p95 312ms > 300ms SLO; staging clean. Investigate dev-environment perf delta.

## Comparison to yesterday

| Metric | Today | Yesterday | Δ |
|---|---|---|---|
| Cells PASS | 13 | 14 | -1 |
| Cells WARN | 4 | 3 | +1 |
| Cells FAIL | 1 | 1 | 0 |
| New failures | 7 | 12 | -5 |
| Total runs | 23 | 21 | +2 |

## What this agent did NOT do

- Classify any individual failure (defer to `failure-classifier`).
- Open issues (out of scope; A2 produces the report, the team triages).
- Drop / dismiss any `not-run` cell — they appear in the output to be investigated.

Refuse-to-proceed rules

The agent refuses to:

Emit a roll-up without an inventory file. The (suite × environment) matrix is the load-bearing structure; without the inventory, the report is shaped by whatever artifacts happened to exist.
Drop not-run cells silently. Missing artifacts are the most common signal of a broken nightly schedule and must surface in the report.
Compute Δ vs. yesterday without a yesterday baseline. If yesterday's run is missing for a cell, the delta column emits n/a (no prior data).
Classify a failure. Classification is failure-classifier's job; this agent stops at the cell-level summary.
Touch source files. The agent reads artifacts only.

Anti-patterns

Anti-pattern	Fix
Treating "cell missing artifact" as "cell passed"	Always emit `not-run`.
Aggregating perf p95 across environments	Per-environment perf only.
Reporting flakes by re-counting failures	Dedupe by run id; report `passed_after_retry` vs `failed`.
Filename-only test matching for new-failures	Use fully-qualified id (file + describe + it).
Roll-up that doesn't fit on one screen	Cell matrix top; concerns below; detail behind links.
Pass-rate on zero runs	Emit `not-run`.

Limitations

Per-tool parsers are the bottleneck - inherits preloaded skills (JUnit XML, Allure, k6, axe-core); other outputs need a parser.
No CI cost tracking - out of scope (FinOps territory).
UTC time-zone - report header is always UTC for unambiguous archival.
No PR / commit attribution - defer to regression-bisector; the build URL is linked.
No predictive forecasting - Δ-vs-yesterday is descriptive only.

Hand-off targets

Per-failure classification → failure-classifier.
Flake pattern attribution for the WARN cells → ai-flake-detector.
Contract drift in the contract-test cells → contract-drift-investigator.
Per-cell narrative for an exec summary → test-run-summary-author (sister skill).
Defect trend narrative over a longer window → defect-trend-narrator.
Longitudinal E2E suite health → e2e-test-trend-reporter.

References

Allure Report docs - status taxonomy (passed/failed/broken/skipped/unknown), severity.
JUnit XML community reference.
ISO/IEC/IEEE 29119-3:2021 - test reporting (canonical ISO page behind Cloudflare; cite by stable ID).
ISTQB glossary - test-report, test-environment.
Grafana k6 end-of-test summary - metrics shape, threshold-breach signal.
axe-core API - violation list, impact taxonomy (minor/moderate/serious/critical).
PractiTest 2026 State of Testing - 19.9% of teams use AI for risk identification.
Preloaded skills: junit-xml-analysis, allure-reports, coverage-diff-reporter, currents-integration, testrail-integration.