qa-experimentation

Experimentation harness testing: SDK-specific testing for Statsig, Optimizely, VWO, Amplitude Experiment; sample-ratio-mismatch (SRM) detection; AB-test validity checklist; guardrail-metrics + peeking-problem references. Distinct from qa-shift-right/feature-flag-experiment-validator (validates experiment results); this plugin tests the experimentation harness itself (SDK behaviour, assignment integrity, statistical-validity gates).

Install this plugin

/plugin install qa-experimentation@testland-qa

Part of role bundle: qa-role-backend

qa-experimentation

Components

Type	Name	Description
Skill	ab-test-validity-checklist	Workflow-driven skill that builds an A/B test validity checklist from an experiment proposal.
Skill	amplitude-experiment-test	Wraps Amplitude Experiment SDK testing patterns: client initialization with API key (or local-flags JSON), the fetch / variant API, expos...
Skill	experiment-results-interpreter	Pure-reference catalog for interpreting the results of an online controlled experiment after harness validity is confirmed.
Skill	guardrail-metrics-reference	Pure-reference catalog of guardrail-metric methodology for online controlled experiments.
Skill	optimizely-test	Wraps Optimizely Feature Experimentation SDK testing patterns: client initialization with a datafile (offline-friendly), the decide / dec...
Skill	peeking-problem-reference	Pure-reference catalog of the peeking problem in online A/B testing.
Skill	split-io-test	Wraps Split.io (Harness FME) SDK testing patterns: hermetic localhost/offline mode with an in-memory features map (JavaScript/browser) or...
Skill	statsig-test	Wraps Statsig SDK testing patterns: server-side initialization (statsig.initialize with API key), gate / experiment / dynamic-config eval...
Skill	vwo-test	Wraps VWO (Visual Website Optimizer) SDK testing patterns: SDK initialization with the settings file (offline-capable), `getFeatureVariab...
Agent	sample-ratio-mismatch-detector	Read-only specialist that detects Sample Ratio Mismatch (SRM) in an A/B test by running a chi-square test against the observed-vs-expecte...

Install

/plugin marketplace add testland/qa
/plugin install qa-experimentation@testland-qa

Skills

ab-test-validity-checklist

Workflow-driven skill that builds an A/B test validity checklist from an experiment proposal. Walks through the canonical validity gates (pre-registration of OEC + power calc + guardrails, randomization unit + SRM check, assignment integrity, telemetry correctness, peeking discipline per peeking-problem-reference, novelty/primacy assessment, post-experiment SRM re-check, results-interpretation guardrails per Kohavi et al.) and emits a per-experiment checklist + a sign-off form. Use when launching a new experiment, auditing an existing one, or building experimentation governance. Composes guardrail-metrics-reference + peeking-problem-reference.

amplitude-experiment-test

Wraps Amplitude Experiment SDK testing patterns: client initialization with API key (or a bootstrapped local flag config for offline tests), the fetch / variant API, exposure-event suppression in tests, and assignment-integrity tests. Use when writing tests for code that uses Amplitude Experiment for A/B testing or flag management. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.

experiment-results-interpreter

Interprets the results of a valid online controlled experiment, one whose harness, SRM, and telemetry have already been confirmed. Covers the distinction between practical and statistical significance, reading confidence intervals instead of binary p-values, novelty and primacy week-over-week decay that causes post-ship reversion, interaction effects from concurrent experiments, Simpson's paradox in segmented results, and the ordered guardrail-check sequence required before a ship decision. Use when a data scientist or PM is ready to draw conclusions from an experiment whose telemetry and randomisation have already passed the ab-test-validity-checklist. Distinct from ab-test-validity-checklist (harness setup and SRM detection) and from interaction-effect overlap auditing during experiment design.

guardrail-metrics-reference

Pure-reference catalog of guardrail-metric methodology for online controlled experiments. Defines guardrail metrics (metrics that must NOT degrade for an experiment to ship, even if the primary metric improves), the standard guardrail set (latency / errors / engagement / opt-out), the relationship to OEC (Overall Evaluation Criterion) per Kohavi et al., and the trustworthy-experiments framework (Microsoft Experimentation Platform). Use when designing the metric set for a new experiment, auditing existing experiment configs, or reviewing experiment results before ship-decisions. Composes peeking-problem-reference + ab-test-validity-checklist.

optimizely-test

Wraps Optimizely Feature Experimentation SDK testing patterns: client initialization with a datafile (offline-friendly), the decide / decideAll API (Optimizely Feature Experimentation, the v5 API), forced-decisions for per-test arm pinning, OptimizelyUserContext + activate / track events, and assignment-integrity tests. Use when writing tests for Optimizely-instrumented application code. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.

peeking-problem-reference

Pure-reference catalog of the peeking problem in online A/B testing. Defines the problem (repeatedly looking at experiment results inflates the false-positive rate above the declared alpha because each look is a separate test), the canonical mitigations (fixed-horizon test with pre-declared sample size; sequential testing with alpha-spending functions e.g., O'Brien-Fleming, Pocock; always-valid inference / mSPRT per Johari et al.), and the policy choices (data-peek schedule, stop-early thresholds, decision-time guard rails). Use when designing an experimentation platform's stop-early policy or auditing why a result was declared significant. Composes guardrail-metrics-reference.

split-io-test

Wraps Split.io (Harness FME) SDK testing patterns: hermetic localhost/offline mode with an in-memory features map (JavaScript/browser) or a YAML fixture file (Node.js server-side), getTreatment and getTreatmentWithConfig evaluation, the SDK_READY event and whenReady() promise, impression listener verification, sync.impressionsMode configuration, and CI setup. Use when writing tests for application code instrumented with the Split.io or Harness Feature Management & Experimentation SDK.

statsig-test

Wraps Statsig SDK testing patterns: server-side initialization (statsig.initialize with API key), gate / experiment / dynamic-config evaluation (checkGate, getExperiment, getConfig), local-evaluation mode for offline tests, override patterns for forcing a specific user into a specific arm (statsig.overrideGate, overrideConfig), and assignment-integrity tests. Use when writing tests for Statsig-instrumented application code. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.

vwo-test

Wraps VWO (Visual Website Optimizer) SDK testing patterns: SDK initialization with the settings file (offline-capable), `getFeatureVariableValue` and `activate` API, force-bucketing for per-test assignment, and assignment-integrity tests against the bucketing algorithm. Use when writing tests for VWO-instrumented application code. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.

Agents

sample-ratio-mismatch-detector

Read-only specialist that detects Sample Ratio Mismatch (SRM) in an A/B test by running a chi-square test against the observed-vs-expected allocation. Returns a verdict (clean / SRM detected) and, if SRM detected, a taxonomy of likely root causes per the Microsoft Research KDD 2019 paper 'Diagnosing Sample Ratio Mismatch' (logging bugs, bot filtering, redirects, telemetry drops, randomization bugs). Use proactively at experiment-end before any ship decision, or when investigating surprising results. Preloads guardrail-metrics-reference + peeking-problem-reference.