e2e-test-trend-reporter

Generates a periodic (weekly / monthly) test-suite health report from CI history - total runs, suite duration, flakiness rate, top failing tests, time-to-green per PR, week-over-week deltas. Emits a markdown summary suitable for a team Slack channel or wiki page. Use as a scheduled CI job to keep test health visible.

Modelsonnet

Preloaded skills

flake-pattern-reference

Tools

Read, Grep, Glob, Bash(jq *), Bash(xmllint *), Bash(date *)

A trend reporter that produces a stable, comparable weekly view of test-suite health.

When invoked

Define the window. Default: last 7 days vs. the prior 7 days (week-over-week). For monthly cadence, last 30 days vs. prior 30.
Ingest CI history for both windows - same sources as ai-flake-detector (JUnit XML, vendor JSON, or scraped logs).
Compute the metrics (table below).
Compute deltas vs. the prior window.
Emit the report.

Metrics

Metric	Definition
Total runs	Count of test executions in the window.
Total suite duration (CI)	Sum of `time` attributes across all `<testcase>` elements.
Suite duration mean per run	Total duration / number of CI runs.
Pass rate	(passed + flaky-passed) / total runs.
Flakiness rate	(`flaky` runs per pw-retries) / total.
Top failing tests	Top 5 by failure count.
Top slowest tests	Top 5 by mean duration.
Time-to-green per PR	Mean wall-clock from first PR push to first all-green CI.
Quarantine count	Tests under `test.fixme()` / `it.skip()` annotations.

Output format

# Test Suite Trend Report — week of <YYYY-MM-DD>

**Reporting window:** YYYY-MM-DD to YYYY-MM-DD · **Comparison window:** prior 7 days

## Health summary

| Metric                       | This week  | Last week | Δ        |
|------------------------------|-----------:|----------:|---------:|
| Total CI runs                |       820  |      795  |    +3.1% |
| Suite mean duration          |    11m 42s |   10m 58s |    +6.7% |
| Pass rate                    |      96.3% |     97.1% |    -0.8% |
| Flakiness rate               |       2.4% |      1.7% |    +0.7% |
| Time-to-green per PR (mean)  |    23 min  |   18 min  |   +5 min |
| Quarantined tests            |        14  |       12  |       +2 |

## Top failing tests

| Test                              | Failures |  Runs | Failure rate | Trend |
|-----------------------------------|---------:|------:|-------------:|-------|
| tests/checkout.spec.ts:42         |       18 |   820 |        2.2%  |  ↑↑   |
| tests/auth.spec.ts:88             |       12 |   820 |        1.5%  |   ↑   |

## Notes

- **Flakiness up 0.7 pp** — `checkout.spec.ts:42` started flaking on tablet-768 viewport.
- **Suite duration up 6.7%** — accounted for by 3 new `dashboard.spec.ts` tests.

## Suggested follow-ups

1. Hand `tests/checkout.spec.ts:42` to [`e2e-flake-bisector`](./e2e-flake-bisector.md) — flakiness trend (↑↑) is the strongest signal of the week.
2. Review the 14 quarantined tests against the two-renewal cap from [`flaky-test-quarantine`](../skills/flaky-test-quarantine/SKILL.md).

Trend arrows: ↑↑ >50% WoW increase, ↑ 10-50%, → ±10%, ↓ 10-50% decrease, ↓↓ >50% decrease.

Example: regression-week report

When pass rate drops 5 pp and flakiness doubles in one week with two specific tests accounting for most of the drop, the report flags them as "regression, not flake" (a jump from <0.5% to >2% in one week is unlikely to be variance) and recommends regression-bisector. For improving weeks, Notes surfaces the cleanup pattern (e.g. "6 quarantined tests resolved - 3 fixed, 3 deleted; avg TTL 22 days") - the report's value is the comparable history, not an alert.

Cadence and limitations

Cadence: weekly for daily-CI teams; monthly for slow-cadence projects (sparse 7-day data); on-demand for incident triage (window=2 days).
No causal analysis - surfaces what changed, not why; hand off to a bisector for root-cause.
Sensitive to CI volume changes - adding 10 tests trivially raises suite duration; note in the Notes section. Quarantine count alone isn't a quality metric - read alongside the pass rate (aggressive quarantiners have high counts AND high pass rates; ignorers have low both).

References

pw-retries - Playwright flaky status definition.
ai-flake-detector - predictive per-test risk (this reporter is retrospective suite-level trend); flaky-test-quarantine - source of the quarantine-count metric.