peeking-problem-reference
Pure-reference catalog of the peeking problem in online A/B testing. Defines the problem (repeatedly looking at experiment results inflates the false-positive rate above the declared alpha because each look is a separate test), the canonical mitigations (fixed-horizon test with pre-declared sample size; sequential testing with alpha-spending functions e.g., O'Brien-Fleming, Pocock; always-valid inference / mSPRT per Johari et al.), and the policy choices (data-peek schedule, stop-early thresholds, decision-time guard rails). Use when designing an experimentation platform's stop-early policy or auditing why a result was declared significant. Composes guardrail-metrics-reference.
peeking-problem-reference
Overview
In classical (fixed-horizon) hypothesis testing, the test is run once, on a pre-declared sample size, at a pre-declared alpha (typically 0.05). Looking at the data and stopping when significance is reached before the pre-declared end inflates the false-positive rate well above alpha - sometimes to 30%+ at naive 0.05.
This is the peeking problem. Per Kohavi et al. Trustworthy Online Controlled Experiments (ISBN 978-1108724265): "Repeated significance testing is one of the most common mistakes in practical A/B testing."
This skill is a pure reference consumed by the AB-test validity checklist and the SRM detector agent.
When to use
Why naive peeking inflates false positives
At alpha=0.05, the test is calibrated to give a 5% false-positive rate if you look once. If you look every day for 30 days and ship at the first significance - at each look, the test has a fresh chance to spuriously hit. The total false-positive rate compounds.
Per Microsoft Experimentation Platform research (microsoft.com/en-us/research/group/experimentation-platform-exp/): common patterns that surface this - dashboards that update hourly, "early-stop" buttons in experimentation UIs, manager asks for "where are we now?" mid-experiment.
Three corrections
1. Fixed-horizon test (pre-declared)
Decide N in advance via power analysis; collect N samples; do one test; ship or not. No peeking, no early stop.
Pros: standard p-value interpretation, full alpha budget on the declared test.
Cons: must wait for N. Cannot stop early on obvious winners (opportunity cost) or obvious losers (continuing risk).
2. Sequential testing with alpha-spending
Pre-commit to multiple looks, each with a fraction of the alpha budget. Two canonical schedules:
| Schedule | Pattern |
|---|---|
| Pocock | Equal alpha at each look; symmetric |
| O'Brien-Fleming | Tiny alpha early, large alpha late; conservative early-stop |
Implementation: declare K looks in advance; at each look k, the rejection threshold is computed from the cumulative alpha spent (per the schedule). If the test stat exceeds the threshold, stop.
Math: Σ alpha_k = alpha_total.
3. Always-valid inference / mSPRT
Per Johari, Pekelis, Walsh "Always Valid Inference" (paper ID: arXiv:1512.04922) and related work, the mixture sequential probability ratio test (mSPRT) lets you peek arbitrarily often without inflating alpha. The trade-off: less powerful per sample than fixed-horizon.
This is the foundation of "valid sequential" experimentation in Optimizely / Statsig / similar - they expose p-values that are always valid under continuous monitoring.
Per Optimizely's sequential-testing docs (a derivative of mSPRT): the platform allows the user to look at any time; the p-value remains valid.
Visual intuition
| Approach | Look 1 (day 1) | Look 30 (day 30) | Final |
|---|---|---|---|
| Naive fixed-horizon | Don't look | Don't look | Look once at day 30, alpha=0.05 |
| Fixed-horizon + early-stop = WRONG | "Hmm 0.04, ship!" | n/a | False positive risk inflated |
| Pocock 5 looks | alpha=0.016 (=0.05/√5 ish) | alpha=0.016 | Sum ≤ 0.05 |
| mSPRT / always-valid | Look any time; p-value valid | Look any time | Same alpha guarantee |
Decision boundary in tests
Tests for an experimentation platform must verify:
| Behaviour | Test |
|---|---|
| Naive p-value not auto-significant on peek | Run synthetic A/A test; look 100×; ≤5% false positives |
| Sequential adjustment correctly enforced | At look N, threshold matches the declared schedule |
| Stop-early threshold consistent with declared method | Pocock vs O'Brien-Fleming asymmetric on early vs late |
| Always-valid p-value never decreases below declared alpha | Simulate; check never-exceeds-alpha |
| Ship-decision gate enforces the peek-protected p-value | Mock low-p naive p, observe gate rejection |
Combining with guardrails
Per guardrail-metrics-reference: the guardrail-correction (Bonferroni / FDR) stacks with the peeking correction. Don't apply only one if both are needed.
For an experiment with one OEC, 10 guardrails, and 5 looks:
| Naive alpha per look per metric | 0.05 |
|---|---|
| With 5 looks alpha-spending | 0.011 per look |
| With Bonferroni for 11 metrics at each look | 0.001 per (look, metric) |
The strict math is rarely applied this thoroughly; pragmatically most platforms apply sequential + per-metric alpha but not formal multi-comparison correction across guardrails.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Peek + early-stop on naive p-value | False positive rate explodes | Use sequential / always-valid |
| Dashboards refresh hourly, treated as "data" | Implicit peeking; humans see + react | Lock decisions to pre-declared look schedule |
| Stop-loss without symmetric stop-win | One-sided peeking still inflates | Symmetric or pre-committed |
| "We'll just look once at midpoint" | One unscheduled look = one inflation event | Either fixed-horizon OR sequential - not "fixed + one peek" |
| Different metric uses different schedule | Coordination mismatch; inconsistent alpha | One schedule per experiment |
| Re-running an experiment after p=0.06 to "find significance" | Garden of forking paths | Pre-commit; accept null result |
| Stop-early on a guardrail alone | Guardrails should be assessed at horizon | Stop-early only on OEC (with sequential math) |
| Treating "p=0.04 mid-experiment" as significant | Naive interpretation | Use the sequential / always-valid p-value |