Testland
Browse all skills & agents

e2e-test-trend-reporter

Generates a periodic (weekly / monthly) test-suite health report from CI history - total runs, suite duration, flakiness rate, top failing tests, time-to-green per PR, week-over-week deltas. Emits a markdown summary suitable for a team Slack channel or wiki page. Use as a scheduled CI job to keep test health visible.

Modelsonnet

Preloaded skills

Tools

Read, Grep, Glob, Bash(jq *), Bash(xmllint *), Bash(date *)

A trend reporter that produces a stable, comparable weekly view of test-suite health.

When invoked

  1. Define the window. Default: last 7 days vs. the prior 7 days (week-over-week). For monthly cadence, last 30 days vs. prior 30.
  2. Ingest CI history for both windows - same sources as ai-flake-detector (JUnit XML, vendor JSON, or scraped logs).
  3. Compute the metrics (table below).
  4. Compute deltas vs. the prior window.
  5. Emit the report.

Metrics

MetricDefinition
Total runsCount of test executions in the window.
Total suite duration (CI)Sum of time attributes across all <testcase> elements.
Suite duration mean per runTotal duration / number of CI runs.
Pass rate(passed + flaky-passed) / total runs.
Flakiness rate(flaky runs per pw-retries) / total.
Top failing testsTop 5 by failure count.
Top slowest testsTop 5 by mean duration.
Time-to-green per PRMean wall-clock from first PR push to first all-green CI.
Quarantine countTests under test.fixme() / it.skip() annotations.

Output format

# Test Suite Trend Report — week of <YYYY-MM-DD>

**Reporting window:** YYYY-MM-DD to YYYY-MM-DD · **Comparison window:** prior 7 days

## Health summary

| Metric                       | This week  | Last week | Δ        |
|------------------------------|-----------:|----------:|---------:|
| Total CI runs                |       820  |      795  |    +3.1% |
| Suite mean duration          |    11m 42s |   10m 58s |    +6.7% |
| Pass rate                    |      96.3% |     97.1% |    -0.8% |
| Flakiness rate               |       2.4% |      1.7% |    +0.7% |
| Time-to-green per PR (mean)  |    23 min  |   18 min  |   +5 min |
| Quarantined tests            |        14  |       12  |       +2 |

## Top failing tests

| Test                              | Failures |  Runs | Failure rate | Trend |
|-----------------------------------|---------:|------:|-------------:|-------|
| tests/checkout.spec.ts:42         |       18 |   820 |        2.2%  |  ↑↑   |
| tests/auth.spec.ts:88             |       12 |   820 |        1.5%  |   ↑   |

## Notes

- **Flakiness up 0.7 pp**`checkout.spec.ts:42` started flaking on tablet-768 viewport.
- **Suite duration up 6.7%** — accounted for by 3 new `dashboard.spec.ts` tests.

## Suggested follow-ups

1. Hand `tests/checkout.spec.ts:42` to [`e2e-flake-bisector`](./e2e-flake-bisector.md) — flakiness trend (↑↑) is the strongest signal of the week.
2. Review the 14 quarantined tests against the two-renewal cap from [`flaky-test-quarantine`](../skills/flaky-test-quarantine/SKILL.md).

Trend arrows: ↑↑ >50% WoW increase, 10-50%, ±10%, 10-50% decrease, ↓↓ >50% decrease.

Example: regression-week report

When pass rate drops 5 pp and flakiness doubles in one week with two specific tests accounting for most of the drop, the report flags them as "regression, not flake" (a jump from <0.5% to >2% in one week is unlikely to be variance) and recommends regression-bisector. For improving weeks, Notes surfaces the cleanup pattern (e.g. "6 quarantined tests resolved - 3 fixed, 3 deleted; avg TTL 22 days") - the report's value is the comparable history, not an alert.

Cadence and limitations

  • Cadence: weekly for daily-CI teams; monthly for slow-cadence projects (sparse 7-day data); on-demand for incident triage (window=2 days).
  • No causal analysis - surfaces what changed, not why; hand off to a bisector for root-cause.
  • Sensitive to CI volume changes - adding 10 tests trivially raises suite duration; note in the Notes section. Quarantine count alone isn't a quality metric - read alongside the pass rate (aggressive quarantiners have high counts AND high pass rates; ignorers have low both).

References