qa-experimentation
Experimentation harness testing: SDK-specific testing for Statsig, Optimizely, VWO, Amplitude Experiment; sample-ratio-mismatch (SRM) detection; AB-test validity checklist; guardrail-metrics + peeking-problem references. Distinct from qa-shift-right/feature-flag-experiment-validator (validates experiment results); this plugin tests the experimentation harness itself (SDK behaviour, assignment integrity, statistical-validity gates).
Install this plugin
/plugin install qa-experimentation@testland-qaPart of role bundle: qa-role-backend
qa-experimentation
Experimentation harness testing: SDK-specific testing for Statsig, Optimizely, VWO, Amplitude Experiment; sample-ratio-mismatch (SRM) detection; AB-test validity checklist; guardrail-metrics + peeking-problem references. Distinct from qa-shift-right/feature-flag-experiment-validator (validates experiment results); this plugin tests the experimentation harness itself (SDK behaviour, assignment integrity, statistical-validity gates).
Components
| Type | Name | Description |
|---|---|---|
| Skill | ab-test-validity-checklist | Workflow-driven skill that builds an A/B test validity checklist from an experiment proposal. |
| Skill | amplitude-experiment-test | Wraps Amplitude Experiment SDK testing patterns: client initialization with API key (or local-flags JSON), the fetch / variant API, expos... |
| Skill | experiment-results-interpreter | Pure-reference catalog for interpreting the results of an online controlled experiment after harness validity is confirmed. |
| Skill | guardrail-metrics-reference | Pure-reference catalog of guardrail-metric methodology for online controlled experiments. |
| Skill | optimizely-test | Wraps Optimizely Feature Experimentation SDK testing patterns: client initialization with a datafile (offline-friendly), the decide / dec... |
| Skill | peeking-problem-reference | Pure-reference catalog of the peeking problem in online A/B testing. |
| Skill | split-io-test | Wraps Split.io (Harness FME) SDK testing patterns: hermetic localhost/offline mode with an in-memory features map (JavaScript/browser) or... |
| Skill | statsig-test | Wraps Statsig SDK testing patterns: server-side initialization (statsig.initialize with API key), gate / experiment / dynamic-config eval... |
| Skill | vwo-test | Wraps VWO (Visual Website Optimizer) SDK testing patterns: SDK initialization with the settings file (offline-capable), `getFeatureVariab... |
| Agent | sample-ratio-mismatch-detector | Read-only specialist that detects Sample Ratio Mismatch (SRM) in an A/B test by running a chi-square test against the observed-vs-expecte... |
Install
/plugin marketplace add testland/qa
/plugin install qa-experimentation@testland-qaSkills
ab-test-validity-checklist
Workflow-driven skill that builds an A/B test validity checklist from an experiment proposal. Walks through the canonical validity gates (pre-registration of OEC + power calc + guardrails, randomization unit + SRM check, assignment integrity, telemetry correctness, peeking discipline per peeking-problem-reference, novelty/primacy assessment, post-experiment SRM re-check, results-interpretation guardrails per Kohavi et al.) and emits a per-experiment checklist + a sign-off form. Use when launching a new experiment, auditing an existing one, or building experimentation governance. Composes guardrail-metrics-reference + peeking-problem-reference.
amplitude-experiment-test
Wraps Amplitude Experiment SDK testing patterns: client initialization with API key (or a bootstrapped local flag config for offline tests), the fetch / variant API, exposure-event suppression in tests, and assignment-integrity tests. Use when writing tests for code that uses Amplitude Experiment for A/B testing or flag management. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.
experiment-results-interpreter
Interprets the results of a valid online controlled experiment, one whose harness, SRM, and telemetry have already been confirmed. Covers the distinction between practical and statistical significance, reading confidence intervals instead of binary p-values, novelty and primacy week-over-week decay that causes post-ship reversion, interaction effects from concurrent experiments, Simpson's paradox in segmented results, and the ordered guardrail-check sequence required before a ship decision. Use when a data scientist or PM is ready to draw conclusions from an experiment whose telemetry and randomisation have already passed the ab-test-validity-checklist. Distinct from ab-test-validity-checklist (harness setup and SRM detection) and from interaction-effect overlap auditing during experiment design.
guardrail-metrics-reference
Pure-reference catalog of guardrail-metric methodology for online controlled experiments. Defines guardrail metrics (metrics that must NOT degrade for an experiment to ship, even if the primary metric improves), the standard guardrail set (latency / errors / engagement / opt-out), the relationship to OEC (Overall Evaluation Criterion) per Kohavi et al., and the trustworthy-experiments framework (Microsoft Experimentation Platform). Use when designing the metric set for a new experiment, auditing existing experiment configs, or reviewing experiment results before ship-decisions. Composes peeking-problem-reference + ab-test-validity-checklist.
optimizely-test
Wraps Optimizely Feature Experimentation SDK testing patterns: client initialization with a datafile (offline-friendly), the decide / decideAll API (Optimizely Feature Experimentation, the v5 API), forced-decisions for per-test arm pinning, OptimizelyUserContext + activate / track events, and assignment-integrity tests. Use when writing tests for Optimizely-instrumented application code. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.
peeking-problem-reference
Pure-reference catalog of the peeking problem in online A/B testing. Defines the problem (repeatedly looking at experiment results inflates the false-positive rate above the declared alpha because each look is a separate test), the canonical mitigations (fixed-horizon test with pre-declared sample size; sequential testing with alpha-spending functions e.g., O'Brien-Fleming, Pocock; always-valid inference / mSPRT per Johari et al.), and the policy choices (data-peek schedule, stop-early thresholds, decision-time guard rails). Use when designing an experimentation platform's stop-early policy or auditing why a result was declared significant. Composes guardrail-metrics-reference.
split-io-test
Wraps Split.io (Harness FME) SDK testing patterns: hermetic localhost/offline mode with an in-memory features map (JavaScript/browser) or a YAML fixture file (Node.js server-side), getTreatment and getTreatmentWithConfig evaluation, the SDK_READY event and whenReady() promise, impression listener verification, sync.impressionsMode configuration, and CI setup. Use when writing tests for application code instrumented with the Split.io or Harness Feature Management & Experimentation SDK.
statsig-test
Wraps Statsig SDK testing patterns: server-side initialization (statsig.initialize with API key), gate / experiment / dynamic-config evaluation (checkGate, getExperiment, getConfig), local-evaluation mode for offline tests, override patterns for forcing a specific user into a specific arm (statsig.overrideGate, overrideConfig), and assignment-integrity tests. Use when writing tests for Statsig-instrumented application code. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.
vwo-test
Wraps VWO (Visual Website Optimizer) SDK testing patterns: SDK initialization with the settings file (offline-capable), `getFeatureVariableValue` and `activate` API, force-bucketing for per-test assignment, and assignment-integrity tests against the bucketing algorithm. Use when writing tests for VWO-instrumented application code. Composes guardrail-metrics-reference + peeking-problem-reference + ab-test-validity-checklist.