Testland
Browse all skills & agents

peeking-problem-reference

Pure-reference catalog of the peeking problem in online A/B testing. Defines the problem (repeatedly looking at experiment results inflates the false-positive rate above the declared alpha because each look is a separate test), the canonical mitigations (fixed-horizon test with pre-declared sample size; sequential testing with alpha-spending functions e.g., O'Brien-Fleming, Pocock; always-valid inference / mSPRT per Johari et al.), and the policy choices (data-peek schedule, stop-early thresholds, decision-time guard rails). Use when designing an experimentation platform's stop-early policy or auditing why a result was declared significant. Composes guardrail-metrics-reference.

peeking-problem-reference

Overview

In classical (fixed-horizon) hypothesis testing, the test is run once, on a pre-declared sample size, at a pre-declared alpha (typically 0.05). Looking at the data and stopping when significance is reached before the pre-declared end inflates the false-positive rate well above alpha - sometimes to 30%+ at naive 0.05.

This is the peeking problem. Per Kohavi et al. Trustworthy Online Controlled Experiments (ISBN 978-1108724265): "Repeated significance testing is one of the most common mistakes in practical A/B testing."

This skill is a pure reference consumed by the AB-test validity checklist and the SRM detector agent.

When to use

  • Designing the stop-early policy for an experiment platform.
  • Auditing an "early ship" decision - was the math valid?
  • PR review of a new experiment dashboard / analysis flow.
  • Investigating "we shipped, then the effect disappeared."

Why naive peeking inflates false positives

At alpha=0.05, the test is calibrated to give a 5% false-positive rate if you look once. If you look every day for 30 days and ship at the first significance - at each look, the test has a fresh chance to spuriously hit. The total false-positive rate compounds.

Per Microsoft Experimentation Platform research (microsoft.com/en-us/research/group/experimentation-platform-exp/): common patterns that surface this - dashboards that update hourly, "early-stop" buttons in experimentation UIs, manager asks for "where are we now?" mid-experiment.

Three corrections

1. Fixed-horizon test (pre-declared)

Decide N in advance via power analysis; collect N samples; do one test; ship or not. No peeking, no early stop.

Pros: standard p-value interpretation, full alpha budget on the declared test.

Cons: must wait for N. Cannot stop early on obvious winners (opportunity cost) or obvious losers (continuing risk).

2. Sequential testing with alpha-spending

Pre-commit to multiple looks, each with a fraction of the alpha budget. Two canonical schedules:

SchedulePattern
PocockEqual alpha at each look; symmetric
O'Brien-FlemingTiny alpha early, large alpha late; conservative early-stop

Implementation: declare K looks in advance; at each look k, the rejection threshold is computed from the cumulative alpha spent (per the schedule). If the test stat exceeds the threshold, stop.

Math: Σ alpha_k = alpha_total.

3. Always-valid inference / mSPRT

Per Johari, Pekelis, Walsh "Always Valid Inference" (paper ID: arXiv:1512.04922) and related work, the mixture sequential probability ratio test (mSPRT) lets you peek arbitrarily often without inflating alpha. The trade-off: less powerful per sample than fixed-horizon.

This is the foundation of "valid sequential" experimentation in Optimizely / Statsig / similar - they expose p-values that are always valid under continuous monitoring.

Per Optimizely's sequential-testing docs (a derivative of mSPRT): the platform allows the user to look at any time; the p-value remains valid.

Visual intuition

ApproachLook 1 (day 1)Look 30 (day 30)Final
Naive fixed-horizonDon't lookDon't lookLook once at day 30, alpha=0.05
Fixed-horizon + early-stop = WRONG"Hmm 0.04, ship!"n/aFalse positive risk inflated
Pocock 5 looksalpha=0.016 (=0.05/√5 ish)alpha=0.016Sum ≤ 0.05
mSPRT / always-validLook any time; p-value validLook any timeSame alpha guarantee

Decision boundary in tests

Tests for an experimentation platform must verify:

BehaviourTest
Naive p-value not auto-significant on peekRun synthetic A/A test; look 100×; ≤5% false positives
Sequential adjustment correctly enforcedAt look N, threshold matches the declared schedule
Stop-early threshold consistent with declared methodPocock vs O'Brien-Fleming asymmetric on early vs late
Always-valid p-value never decreases below declared alphaSimulate; check never-exceeds-alpha
Ship-decision gate enforces the peek-protected p-valueMock low-p naive p, observe gate rejection

Combining with guardrails

Per guardrail-metrics-reference: the guardrail-correction (Bonferroni / FDR) stacks with the peeking correction. Don't apply only one if both are needed.

For an experiment with one OEC, 10 guardrails, and 5 looks:

Naive alpha per look per metric0.05
With 5 looks alpha-spending0.011 per look
With Bonferroni for 11 metrics at each look0.001 per (look, metric)

The strict math is rarely applied this thoroughly; pragmatically most platforms apply sequential + per-metric alpha but not formal multi-comparison correction across guardrails.

Anti-patterns

Anti-patternWhy it failsFix
Peek + early-stop on naive p-valueFalse positive rate explodesUse sequential / always-valid
Dashboards refresh hourly, treated as "data"Implicit peeking; humans see + reactLock decisions to pre-declared look schedule
Stop-loss without symmetric stop-winOne-sided peeking still inflatesSymmetric or pre-committed
"We'll just look once at midpoint"One unscheduled look = one inflation eventEither fixed-horizon OR sequential - not "fixed + one peek"
Different metric uses different scheduleCoordination mismatch; inconsistent alphaOne schedule per experiment
Re-running an experiment after p=0.06 to "find significance"Garden of forking pathsPre-commit; accept null result
Stop-early on a guardrail aloneGuardrails should be assessed at horizonStop-early only on OEC (with sequential math)
Treating "p=0.04 mid-experiment" as significantNaive interpretationUse the sequential / always-valid p-value

Limitations

  • Always-valid inference is less powerful. Same effect size requires more samples than fixed-horizon. Trade convenience for sample efficiency.
  • Sequential methods require pre-declared schedules. The alpha-spending isn't "fluid"; the schedule is fixed in advance.
  • Multiple-testing correction across many metrics is brutal. Per-metric alpha after Bonferroni × 20 metrics = 0.0025.
  • Operator behaviour is the real bottleneck. Math is robust to peeking; humans are not. Education + UI gating matter.
  • Doesn't help with novelty / primacy effects. Statistical validity doesn't fix "users react to change, then revert."

References