e2e-flake-bisector
Runs a target end-to-end test N times under varied conditions (worker isolation, test order, viewport, network throttling, parallelism) to identify the axis along which the flake reproduces. Returns a probable root cause classified against the 8 flake patterns plus a numeric reproduction rate per axis. Use when a test has been flagged flaky and the team needs to know which condition triggers the failure.
Preloaded skills
Tools
Read, Grep, Glob, Bash(npx playwright test *), Bash(jest *), Bash(npx cypress *), Bash(jq *)A bisector that varies one axis at a time to localize the flake source.
When invoked
Axes to vary
The bisector sweeps these in order - cheapest / most-discriminating first:
| Axis | Variations | Pattern surfaced |
|---|---|---|
| Run-alone | The target test alone vs. the full suite. | test ordering, shared parallel state |
| Worker count | -j 1 (sequential) vs. -j 4 vs. -j N (full parallelism). | shared parallel state |
| Random order | --randomize vs. fixed file order. | test ordering |
| Network throttle | Default vs. --slow-mo 1000ms vs. CDP Network.enable 100kbps. | async/timing, network |
| Viewport | 375 / 768 / 1280 / 1920. | locator drift |
| Animation flag | animations: 'allow' vs. 'disabled'. | async/timing |
| OS / runner | Linux container vs. macOS / Windows runner. | environment variance |
| Repetition count | 100 sequential runs of just this test. | resource leak |
The bisector runs each axis with N=20 repetitions per variation. With 8 axes × 2-4 variations × 20 reps, expect 320-640 test runs total - that's why this agent is invoked on a per-test basis, not blanket on the suite.
Output format
## Flake bisect — `<test-id>`
**Baseline failure rate:** N/20 (X%)
### Axis sweep
| Axis | Variation | Failure rate | Δ vs. baseline |
|-------------------|-------------------------|-------------:|---------------:|
| Run-alone | alone | 0/20 (0%) | -X% |
| Worker count | -j 1 | 1/20 (5%) | -10% |
| Worker count | -j 4 | 8/20 (40%) | +25% |
| Random order | --randomize | 12/20 (60%) | +45% |
| Network throttle | 100kbps | 0/20 (0%) | -X% |
| Viewport | 375 | 6/20 (30%) | +15% |
| Viewport | 1280 | 2/20 (10%) | -5% |
| Animation flag | allow | 4/20 (20%) | +5% |
| Repetition count | 100 sequential | 3/100 (3%) | ~0% |
### Classification
**Probable root cause:** shared parallel state + test ordering
**Confidence:** high — axis sweep shows a >5x rate increase under
parallel execution AND under randomized order.
**Pattern:** Pattern 3 (shared parallel state) per
[`flake-pattern-reference`](../skills/flake-pattern-reference/SKILL.md).
### Recommended next step
1. Hand off to [`parallel-isolation-checker`](./parallel-isolation-checker.md)
to find the specific shared-state leak.
2. Pending fix, quarantine via [`flaky-test-quarantine`](../skills/flaky-test-quarantine/SKILL.md)
with this bisect report linked from the annotation.
3. Once isolation is fixed, re-run the bisect to confirm the failure
rate dropped below 1/20.Examples
Example 1: timing flake exposed by network throttle
Input: tests/checkout.spec.ts:42 fails ~3/20 in CI.
Bisect output (excerpt):
| Axis | Variation | Failure rate | Δ |
|-------------------|-----------------|-------------:|---|
| Run-alone | alone | 3/20 (15%) | ~0% |
| Network throttle | 100kbps | 17/20 (85%) | +70% |Classification: async/timing (Pattern 1). The 100kbps throttle moves the rate from 15% to 85%, indicating a missing page.waitForLoadState('networkidle') or similar deterministic wait. Hand off to the developer with await expect(page.locator('[data-testid="checkout-summary"]')).toBeVisible() as the suggested fix.
Example 2: order-dependent leak
Input: tests/users.spec.ts:88 fails ~10% in CI, 0% locally.
Bisect output (excerpt):
| Axis | Variation | Failure rate | Δ |
|---------------|----------------|-------------:|---|
| Run-alone | alone | 0/20 (0%) | -10% |
| Random order | --randomize | 14/20 (70%) | +60% |Classification: test ordering (Pattern 2). The test passes when run alone, fails 14/20 with random order. Likely a beforeAll mutating state that another test depends on; remediation is moving that setup into beforeEach.
Example 3: cannot reproduce
| Axis | Variation | Failure rate | Δ |
|-------------------|-----------------|-------------:|---|
| (every axis) | (every variation) | 0/20-2/20 | within noise |Classification: inconclusive - likely Pattern 8 (randomness) or a real low-rate environmental flake. Recommend persisting the random seed used in each run; replay-on-failure pattern.
Cost / runtime considerations
A full sweep at N=20 across 8 axes costs ~4-10 minutes of CI time per test (assuming ~2s per test execution). For tests with longer runtimes (multi-step E2E flows), reduce N to 10 and skip the 100 sequential axis unless a leak is suspected.
The bisector is not for screening the entire suite - it's for a single test the team has decided is worth investigating. For suite-wide screening, use ai-flake-detector.