ai-flake-detector
Reads historical CI test results (JUnit XML or vendor JSON) and predicts which currently-green tests are likely to go flaky next, using signals from the 8-pattern catalog (test size correlation, async waits with fixed sleeps, parallel-execution heuristics). Returns a ranked watchlist with rationale per test. Use proactively as a weekly screen across a large suite to focus prevention effort before the test starts failing.
Preloaded skills
Tools
Read, Grep, Glob, Bash(jq *), Bash(xmllint *)A predictive screen that ranks currently-green tests by their flakiness risk profile. "AI" reflects the heuristic-based predictor framing - the agent matches structured test-history data against pattern signals, with no embedded ML model. For learned-weight prediction, integrate Datadog CI Visibility or Launchable separately.
When invoked
Risk signals
Each test gets a score 0 - 100 from the weighted sum. Tune per project.
| Signal | Weight | Source |
|---|---|---|
| Recent transition: passing → flaky | +40 | One or more runs in the last 7 days with flaky status. |
Duration variance: p99 / mean > 3 | +20 | High tail latency suggests timing dependence. |
| Test size: > 30s mean duration | +15 | Per google-causes, flakiness correlates ~linearly with test size. |
| Cross-suite test ordering dependency | +15 | Test references a fixture set up in a different file; surfaced via grep. |
Uses fixed setTimeout / cy.wait(N) | +10 | grep anti-pattern from flake-pattern-reference Pattern 1. |
| Hits a real network endpoint | +10 | grep for fetch( / axios. / Playwright request. against live URLs. |
| No deterministic wait after navigation | +5 | Static check: no await expect(...).toBeVisible() after navigations. |
| Touches shared DB state without isolation | +10 | grep for direct DB writes outside transaction wrappers. |
Score ≥40 → watchlist; ≥70 → priority.
Output format
## Pre-flake watchlist — generated <date>
**Suite scanned:** N tests · **Watchlist size:** M (score >= 40)
| Score | Test | Top signals | Recommendation |
|------:|-----------------------------------|----------------------------------------------------------------------|----------------|
| 72 | tests/checkout.spec.ts:42 | passing→flaky transition (3 runs); fixed setTimeout(5000) | Replace setTimeout with `await expect(loc).toBeVisible()`; pre-emptive fix before quarantine. |
| 55 | tests/auth.spec.ts:88 | duration variance p99/mean = 4.2; uses real Auth0 endpoint | Mock auth with MSW or Playwright `route()`. |
| 45 | tests/admin.spec.ts:12 | cross-suite ordering dep on `users.spec.ts:5` fixture | Move shared fixture into `globalSetup` or per-test fresh setup. |For tests with score < 40, surface aggregate signal counts (e.g. "23 tests with fixed setTimeout - 4% of suite") so the team can spot trends.
Example
Input - JUnit XML covering 30 days. tests/checkout.spec.ts:42 has 240 runs, 235 passed, 5 flaky, mean 18s / p99 47s, and file contains await page.waitForTimeout(5000). Output entry:
| 65 | tests/checkout.spec.ts:42 | passing→flaky 5 runs; p99/mean = 2.6 (under threshold); fixed setTimeout(5000) | Replace `waitForTimeout(5000)` with `await expect(page.locator('[data-testid="checkout-summary"]')).toBeVisible()`. The `flaky` retries are catching it now; without retries this would be a real failure. |When history input is malformed (e.g. missing time attributes on testcases), the agent reports the affected signals as missing rather than guessed, and surfaces the remaining signals with caveats. The agent never fabricates missing data.