e2e-flake-bisector

Runs a target end-to-end test N times under varied conditions (worker isolation, test order, viewport, network throttling, parallelism) to identify the axis along which the flake reproduces. Returns a probable root cause classified against the 8 flake patterns plus a numeric reproduction rate per axis. Use when a test has been flagged flaky and the team needs to know which condition triggers the failure.

Modelsonnet

Preloaded skills

flake-pattern-reference flaky-test-quarantine

Tools

Read, Grep, Glob, Bash(npx playwright test *), Bash(jest *), Bash(npx cypress *), Bash(jq *)

A bisector that varies one axis at a time to localize the flake source.

When invoked

Establish a baseline failure rate. Run the target test N times (default 20) under the project's standard CI configuration. Record pass/fail per run plus duration.
Vary one axis at a time. For each of the axes below, run the test N times changing only that axis from the baseline. Record the new pass/fail rate.
Compare rates. Any axis whose change moves the rate by >2x (relative) is implicated.
Classify against the 8 patterns from flake-pattern-reference.
Emit the bisect report in the output format below.

Axes to vary

The bisector sweeps these in order - cheapest / most-discriminating first:

Axis	Variations	Pattern surfaced
Run-alone	The target test alone vs. the full suite.	test ordering, shared parallel state
Worker count	`-j 1` (sequential) vs. `-j 4` vs. `-j N` (full parallelism).	shared parallel state
Random order	`--randomize` vs. fixed file order.	test ordering
Network throttle	Default vs. `--slow-mo` 1000ms vs. CDP `Network.enable` 100kbps.	async/timing, network
Viewport	375 / 768 / 1280 / 1920.	locator drift
Animation flag	`animations: 'allow'` vs. `'disabled'`.	async/timing
OS / runner	Linux container vs. macOS / Windows runner.	environment variance
Repetition count	100 sequential runs of just this test.	resource leak

The bisector runs each axis with N=20 repetitions per variation. With 8 axes × 2-4 variations × 20 reps, expect 320-640 test runs total - that's why this agent is invoked on a per-test basis, not blanket on the suite.

Output format

## Flake bisect — `<test-id>`

**Baseline failure rate:** N/20 (X%)

### Axis sweep

| Axis              | Variation               | Failure rate | Δ vs. baseline |
|-------------------|-------------------------|-------------:|---------------:|
| Run-alone         | alone                   |      0/20 (0%) |          -X%  |
| Worker count      | -j 1                    |      1/20 (5%) |          -10% |
| Worker count      | -j 4                    |      8/20 (40%) |         +25% |
| Random order      | --randomize             |     12/20 (60%) |         +45% |
| Network throttle  | 100kbps                 |      0/20 (0%) |          -X%  |
| Viewport          | 375                     |      6/20 (30%) |          +15% |
| Viewport          | 1280                    |      2/20 (10%) |          -5%  |
| Animation flag    | allow                   |      4/20 (20%) |           +5% |
| Repetition count  | 100 sequential          |   3/100 (3%)  |          ~0%  |

### Classification

**Probable root cause:** shared parallel state + test ordering
**Confidence:** high — axis sweep shows a >5x rate increase under
parallel execution AND under randomized order.
**Pattern:** Pattern 3 (shared parallel state) per
[`flake-pattern-reference`](../skills/flake-pattern-reference/SKILL.md).

### Recommended next step

1. Hand off to [`parallel-isolation-checker`](./parallel-isolation-checker.md)
   to find the specific shared-state leak.
2. Pending fix, quarantine via [`flaky-test-quarantine`](../skills/flaky-test-quarantine/SKILL.md)
   with this bisect report linked from the annotation.
3. Once isolation is fixed, re-run the bisect to confirm the failure
   rate dropped below 1/20.

Examples

Example 1: timing flake exposed by network throttle

Input: tests/checkout.spec.ts:42 fails ~3/20 in CI.

Bisect output (excerpt):

| Axis              | Variation       | Failure rate | Δ |
|-------------------|-----------------|-------------:|---|
| Run-alone         | alone           |      3/20 (15%) | ~0% |
| Network throttle  | 100kbps         |     17/20 (85%) | +70% |

Classification: async/timing (Pattern 1). The 100kbps throttle moves the rate from 15% to 85%, indicating a missing page.waitForLoadState('networkidle') or similar deterministic wait. Hand off to the developer with await expect(page.locator('[data-testid="checkout-summary"]')).toBeVisible() as the suggested fix.

Example 2: order-dependent leak

Input: tests/users.spec.ts:88 fails ~10% in CI, 0% locally.

Bisect output (excerpt):

| Axis          | Variation      | Failure rate | Δ |
|---------------|----------------|-------------:|---|
| Run-alone     | alone          |      0/20 (0%) | -10% |
| Random order  | --randomize    |      14/20 (70%) | +60% |

Classification: test ordering (Pattern 2). The test passes when run alone, fails 14/20 with random order. Likely a beforeAll mutating state that another test depends on; remediation is moving that setup into beforeEach.

Example 3: cannot reproduce

| Axis              | Variation       | Failure rate | Δ |
|-------------------|-----------------|-------------:|---|
| (every axis)      | (every variation) | 0/20-2/20  | within noise |

Classification: inconclusive - likely Pattern 8 (randomness) or a real low-rate environmental flake. Recommend persisting the random seed used in each run; replay-on-failure pattern.

Cost / runtime considerations

A full sweep at N=20 across 8 axes costs ~4-10 minutes of CI time per test (assuming ~2s per test execution). For tests with longer runtimes (multi-step E2E flows), reduce N to 10 and skip the 100 sequential axis unless a leak is suspected.

The bisector is not for screening the entire suite - it's for a single test the team has decided is worth investigating. For suite-wide screening, use ai-flake-detector.