feature-flag-test-harness
Builds a test harness that runs the same suite under every relevant flag combination - picks the minimum cover (single flags + pairwise interactions where the team marks them, not the full 2^N cartesian product), wires an OpenFeature in-memory provider so the suite never hits the production flag service, runs each combination as its own labeled CI matrix shard, and emits a per-combination result matrix. Use when a feature behind a flag must be verified on AND off (release toggles + experiment toggles per Hodgson) and the team wants those runs deterministic and parallel.
feature-flag-test-harness
Overview
A test that hits the production flag service is non-deterministic by definition - the answer depends on whoever toggled the flag last. And a test that asks "did we test the feature with the flag off?" needs both runs side by side.
This skill builds a harness that:
The skill's reference architecture targets OpenFeature because it standardizes the SDK across LaunchDarkly, Flagsmith, ConfigCat, self-hosted, etc. - the harness works identically against any provider (openfeature-overview).
"OpenFeature provides a shared, standardized feature flagging client - an SDK - which can be plugged into various 3rd-party feature flagging providers." (openfeature-overview)
When to use
If the team has only one or two flags and a flat "always on for test" config works, this skill is overkill - set the test environment's flag values once in setup and stop there.
Step 1 - Classify each flag (Hodgson taxonomy)
Per feature-toggles, flags fall into four categories with different test needs:
| Category | Lifespan | Dynamism | Test combinations needed |
|---|---|---|---|
| Release toggle | Days - weeks | Static at deploy | OFF (current) and ON (new behavior). 2 runs. |
| Experiment toggle | Days - weeks | Per-request dynamic | One run per variant (A / B / control). |
| Ops toggle | Long-lived | Per-request dynamic | ON (normal) and OFF (degraded / kill). |
| Permissioning toggle | Years | Per-request dynamic | One run per relevant user cohort. |
Per feature-toggles: "Each user of the system is placed into a cohort and at runtime the Toggle Router will consistently send a given user down one codepath or the other." For experiment + permissioning toggles, the test harness simulates the cohort by seeding the EvaluationContext.
Don't run all 2^N combinations. Author marks the interactions worth testing:
# tests/flag-matrix.yaml
flags:
new_checkout: { kind: release, test: [off, on] }
promo_codes: { kind: release, test: [off, on] }
ranking_experiment: { kind: experiment, variants: [control, treatment_a, treatment_b] }
payment_kill_switch: { kind: ops, test: [on, off] } # off = degraded
interactions:
# The author asserts these flag pairs interact; run their combinations explicitly.
- [new_checkout, promo_codes]
# Ranking experiment doesn't interact with checkout; don't bloat the matrix.The harness enumerates: every flag's variants individually + the listed interaction tuples. Single flags = 2 + 2 + 3 + 2 = 9 runs. Plus the one declared interaction (new_checkout × promo_codes) = +4 runs. Total: 13 runs, not 24 (2 × 2 × 3 × 2).
Step 2 - Wire the OpenFeature in-memory provider
Per openfeature-providers, "Providers are responsible for performing flag evaluations" - the in-memory test provider returns the flag values the test wants.
Node / TypeScript
// tests/harness/flag-harness.ts
import { OpenFeature, InMemoryProvider } from '@openfeature/server-sdk';
export function withFlags(flags: Record<string, unknown>) {
const provider = new InMemoryProvider(
Object.fromEntries(
Object.entries(flags).map(([k, v]) => [k, {
defaultVariant: 'configured',
variants: { configured: v },
disabled: false,
}]),
),
);
return OpenFeature.setProviderAndWait(provider);
}Then in the test setup:
import { withFlags } from './harness/flag-harness';
beforeAll(async () => {
await withFlags(JSON.parse(process.env.FLAGS_JSON || '{}'));
});Python
# tests/harness/flag_harness.py
from openfeature.api import set_provider
from openfeature.provider.in_memory_provider import InMemoryProvider, InMemoryFlag
def with_flags(flags: dict):
set_provider(InMemoryProvider({
k: InMemoryFlag(default_variant='configured',
variants={'configured': v})
for k, v in flags.items()
}))Java
import dev.openfeature.sdk.OpenFeatureAPI;
import dev.openfeature.contrib.providers.memory.InMemoryProvider;
@BeforeAll
static void wireFlags() {
var flags = parseEnv(System.getenv("FLAGS_JSON")); // your JSON parser
OpenFeatureAPI.getInstance().setProvider(new InMemoryProvider(flags));
}The application code calls the standard OpenFeature evaluation API (openfeature-eval):
const client = OpenFeature.getClient();
const enabled = await client.getBooleanValue('new_checkout', false);Per openfeature-eval: "the default value must also be specified ... In the case of any error during flag evaluation, the default value will be returned, so give consideration to your default values!" The harness picks the value the in-memory provider returns; the application's hard-coded default is what runs in prod-flag-failure scenarios.
Step 3 - Generate the combination matrix
A small generator script enumerates the combinations from flag-matrix.yaml:
# scripts/gen-flag-matrix.py
import json, sys, yaml
from itertools import product
cfg = yaml.safe_load(open(sys.argv[1]))
combos = []
# Single-flag variants
for flag, spec in cfg['flags'].items():
base = {f: defaultFor(s) for f, s in cfg['flags'].items()}
for variant in spec.get('test', spec.get('variants', [])):
combo = dict(base)
combo[flag] = variant
combos.append({'name': f'{flag}={variant}', 'flags': combo})
# Declared interactions
for tuple_flags in cfg.get('interactions', []):
spaces = [cfg['flags'][f].get('test', cfg['flags'][f].get('variants', [])) for f in tuple_flags]
base = {f: defaultFor(s) for f, s in cfg['flags'].items()}
for combo_values in product(*spaces):
combo = dict(base)
for f, v in zip(tuple_flags, combo_values):
combo[f] = v
combos.append({
'name': '+'.join(f'{f}={v}' for f, v in zip(tuple_flags, combo_values)),
'flags': combo,
})
print(json.dumps(combos, indent=2))
def defaultFor(spec):
if 'test' in spec: return spec['test'][0] # first listed variant is the baseline
return spec['variants'][0]Output: a JSON array of {name, flags} objects, one per CI shard.
Step 4 - Wire the CI matrix
# .github/workflows/flag-harness.yml
name: flag-harness
on:
pull_request:
paths:
- 'tests/flag-matrix.yaml'
- 'src/**'
jobs:
generate:
runs-on: ubuntu-latest
outputs:
combos: ${{ steps.gen.outputs.combos }}
steps:
- uses: actions/checkout@v5
- id: gen
run: |
combos=$(python scripts/gen-flag-matrix.py tests/flag-matrix.yaml)
echo "combos=$combos" >> "$GITHUB_OUTPUT"
test:
needs: generate
runs-on: ubuntu-latest
strategy:
fail-fast: false
max-parallel: 8
matrix:
combo: ${{ fromJSON(needs.generate.outputs.combos) }}
name: test (${{ matrix.combo.name }})
steps:
- uses: actions/checkout@v5
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci
- run: npm test
env:
FLAGS_JSON: ${{ toJSON(matrix.combo.flags) }}fail-fast: false is load-bearing - when one combination fails, the matrix continues so the team sees every failing combination at once, not just the first.
Step 5 - Aggregate the result matrix
After the matrix runs, build a single artifact that shows pass/fail per combination:
## Flag harness results — `<sha>`
| Combination | Result | Failures |
|------------------------------------------------------|:------:|-----------------------|
| baseline (all flags = baseline) | ✅ | |
| new_checkout=on | ✅ | |
| new_checkout=off | ✅ | |
| promo_codes=on | ❌ | `checkout.spec.ts:42` |
| ranking_experiment=treatment_a | ✅ | |
| ranking_experiment=treatment_b | ❌ | `cart.spec.ts:18` |
| new_checkout=on + promo_codes=on | ❌ | `checkout.spec.ts:42`, `promo.spec.ts:7` |
| payment_kill_switch=off | ✅ | |The aggregator reads each shard's JUnit XML, groups by combo name, emits the table. Failures column links to the failing test files for quick triage.
Step 6 - Pre-merge / nightly cadence
This split keeps PR runtime bounded while still gaining full coverage every 24h.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Running the full 2^N cartesian product | N=10 flags = 1024 shards; CI bill explodes; most combinations are irrelevant. | Single-flag variants + author-declared interaction tuples (Step 1). |
| Hitting the production OpenFeature provider from tests | Non-deterministic; flaky; depends on whoever toggled last. | InMemoryProvider per openfeature-providers. |
| Hard-coding flag values in the test instead of the harness | Each test re-implements the harness; drift; one test forgets to set a flag. | Centralize in flag-harness.ts/.py/.java; tests just assert behavior. |
Asserting flag value in the test (expect(client.getBooleanValue('new_checkout', false)).toBe(true)) | Tests the SDK, not the feature. The harness already pinned the value. | Assert the observable behavior the flag controls (DOM state, response shape, log line). |
fail-fast: true on the matrix | First failure cancels all other combos; team has to re-run to see the rest. | fail-fast: false. |
| Missing the baseline (all-flags-default) row | Can't tell whether a failure is flag-specific or a regression on default state. | Always emit a baseline combination as combo #1. |
| Treating ranking_experiment variants as a binary on/off | Misses variant-specific bugs (e.g., treatment_b breaks but treatment_a passes). | Enumerate every variant per feature-toggles cohort logic. |