guardrail-metrics-reference
Pure-reference catalog of guardrail-metric methodology for online controlled experiments. Defines guardrail metrics (metrics that must NOT degrade for an experiment to ship, even if the primary metric improves), the standard guardrail set (latency / errors / engagement / opt-out), the relationship to OEC (Overall Evaluation Criterion) per Kohavi et al., and the trustworthy-experiments framework (Microsoft Experimentation Platform). Use when designing the metric set for a new experiment, auditing existing experiment configs, or reviewing experiment results before ship-decisions. Composes peeking-problem-reference + ab-test-validity-checklist.
guardrail-metrics-reference
Overview
A guardrail metric is a measure that must not significantly degrade for an experiment to ship, even if the primary metric (the OEC - Overall Evaluation Criterion) improves. The guardrail prevents "we shipped 5% revenue improvement, but latency 30% worse and we discovered too late." Per Kohavi et al. Trustworthy Online Controlled Experiments (Cambridge Univ. Press, ISBN 978-1108724265), this is "the most important class of metrics after the OEC."
This skill is a pure reference consumed by the AB-test validity checklist and the SDK-test skills.
When to use
The guardrail taxonomy
Four classes:
| Class | Examples | Why |
|---|---|---|
| Quality / engineering | API p95 latency, error rate, crash rate, time-to-first-byte | A degraded experience is bad even with metric wins |
| Engagement | DAU, MAU, sessions per user, time on site | Engagement loss is a strategic loss |
| Revenue | Gross revenue, conversion rate, ARPU | Direct business impact |
| Trust | Opt-out rate, unsubscribe rate, complaint rate | Long-term churn signal |
Per Microsoft Experimentation Platform research (microsoft.com/en-us/research/group/experimentation-platform-exp/): "Be vigilant when running A/B tests" because "tiny SRMs" (per peeking-problem-reference sibling concept) and degraded guardrails are the canonical ship-and-regret sources.
The OEC vs guardrail relationship
Per Kohavi et al.: the OEC is one metric (or a weighted combination), declared in advance, with a power calculation. The guardrails are the rest of the dashboard - short-term loss is acceptable if within bounds, but a significant degradation blocks ship.
Setting guardrail thresholds
A guardrail typically has two levels:
| Threshold | What |
|---|---|
| Alert | A statistically significant degradation; investigate before ship |
| Block | A degradation past a pre-declared limit; ship-decision flips to "no" |
Example for API latency p95:
| Level | Threshold |
|---|---|
| Alert | Any statistically significant increase |
| Block | > 10% increase OR > 50ms absolute increase, whichever is greater |
The "whichever is greater" handles fast endpoints where 10% is trivially small in absolute terms.
Standard guardrails - the canonical set
| Domain | Guardrail | Direction |
|---|---|---|
| Web app | TTFB, LCP, INP (Core Web Vitals) | Should not increase |
| API | p95 / p99 latency, error rate, 5xx rate | Should not increase |
| Mobile | Crash rate, ANR rate, app start time | Should not increase |
| Engagement | DAU, sessions / user, retention day 7 | Should not decrease |
| Revenue | Gross revenue, average order value, conversion | Should not decrease |
| Trust | Opt-out rate, complaint rate, refund rate | Should not increase |
Per Kohavi et al.: always include a quality guardrail (latency / error) - the most-missed category in real experiments.
Pre-commitment vs post-hoc
Guardrails must be declared before the experiment starts. Per Kohavi et al.: post-hoc guardrails are p-hacking - if you look at 50 metrics, some will spuriously fail.
Document declarations in the experiment config:
experiment: feed-ranking-v3
oec: ctr_per_session
power:
primary_metric: ctr_per_session
expected_effect: +1.5%
alpha: 0.05
beta: 0.20
guardrails:
- metric: api_p95_latency
direction: not-increase
block_threshold: +10% or +50ms
- metric: dau
direction: not-decrease
block_threshold: -1%
- metric: error_rate
direction: not-increase
block_threshold: +0.1pp absoluteMultiple-comparison correction
With one OEC + N guardrails (typically 10-20), a fixed-alpha significance test means you'll see NĂ—0.05 false positives on average. Per Kohavi et al., apply Bonferroni or Benjamini- Hochberg correction:
| Method | When |
|---|---|
| Bonferroni | Strict; alpha / N. Use when missing a true regression is catastrophic |
| Benjamini-Hochberg (FDR) | False discovery rate; less strict |
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| OEC + zero guardrails | Cargo-cult "ship the metric improvement" | Always include latency + error |
| Guardrails added after seeing results | p-hacking variant; non-causal | Pre-commit guardrails |
| Same alpha across OEC + 50 guardrails | Inflated false-positive rate | Bonferroni / FDR correction |
| Guardrail thresholds invented post-hoc | Move the goalposts | Pre-commit thresholds |
| Single block-threshold (no alert level) | Pass / fail; no surface for "investigate" | Two-tier: alert + block |
| Guardrail in % only on fast endpoint | 10% of 10ms = nothing; ship a 9ms regression | Use max(% , absolute) |
| No mobile-specific guardrails on a mobile experiment | Web-shaped metrics miss crash / ANR | Per-surface guardrails |
| Re-using last experiment's guardrails verbatim | New experiment, new failure modes | Per-experiment review |