flake-dashboard-author
Builds a persistent flakiness infrastructure dashboard from JUnit XML or JSON CI run history: defines the flake-rate metric (failures per test over a configurable window), authors the data model, generates a Grafana time-series panel JSON or configures a Datadog CI Visibility view, derives the quarantine-candidate query, and wires trend alerts. Use when a team needs a long-lived observability surface for test reliability that outlasts any single weekly report.
flake-dashboard-author
Terminology note: "flaky test" is a practitioner-emergent term used in the industry engineering tradition (Google Testing Blog, google-flaky). "Defect," "failure," and "test run" follow ISTQB Glossary v4.7.1 definitions. "Test suite" and "test case" are used per ISTQB as well.
The e2e-test-trend-reporter agent produces a comparable weekly markdown snapshot. This skill builds the persistent infrastructure layer: a live dashboard that accumulates run history and surfaces the flake-rate metric continuously, rather than on demand.
Step 1 - Define the flake-rate metric
The canonical flake-rate formula for a single test T over a window of N runs is:
flake_rate(T, window) = (failed_runs(T, window) + retried_passed_runs(T, window))
-------------------------------------------------------
total_runs(T, window)A run counts as "retried-passed" when the test framework reports it as flaky (passed only after at least one retry). Playwright marks these in reporter.onTestEnd with result.status === 'flaky' (Playwright reporter API). JUnit XML uses a <rerunFailure> element inside <testcase> (Surefire / junit-xml convention) or a <flakyFailure> element in the Jenkins JUnit plugin extension.
Choose your window at ingestion time. Recommended defaults:
| Team cadence | Window | Minimum runs before showing rate |
|---|---|---|
| Multiple deploys per day | 7 days | 20 |
| One deploy per day | 14 days | 10 |
| Weekly releases | 30 days | 5 |
Step 2 - Build the data model
Persist one row per test-case execution. Minimum schema:
CREATE TABLE test_runs (
run_id TEXT NOT NULL,
suite_name TEXT NOT NULL,
test_name TEXT NOT NULL,
status TEXT NOT NULL, -- 'passed' | 'failed' | 'flaky' | 'skipped'
duration_ms INTEGER NOT NULL,
branch TEXT,
commit_sha TEXT,
worker_index INTEGER,
started_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (run_id, suite_name, test_name)
);
CREATE INDEX idx_test_runs_name_time ON test_runs (test_name, started_at);status = 'flaky' is the retried-passed value emitted by Playwright retries (Playwright retries) and by the Jenkins JUnit plugin's <flakyFailure> extension. For raw JUnit XML without retry markup, derive flaky by joining two rows with the same run_id + test_name where one is failed and the next is passed within the same CI run.
Populate from JUnit XML using xmllint --xpath:
# Extract per-testcase rows from a JUnit XML report
xmllint --xpath '//testcase' report.xml \
| python3 scripts/parse_junit.py --output jsonl >> test_runs.jsonlPopulate from Playwright JSON reporter output (--reporter=json) by iterating results[].suites[].specs[].tests[].results[].
Step 3 - Author the Grafana panel JSON
The following panel JSON renders a time-series of per-test flake rate over a 14-day rolling window. Paste it into Dashboard JSON model (toolbar Export > Copy JSON) or POST it to the Grafana Dashboard HTTP API (Grafana Dashboard API).
{
"type": "timeseries",
"title": "Flake rate per test (14-day rolling)",
"datasource": { "type": "postgres", "uid": "${DS_POSTGRES}" },
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 0 },
"id": 1,
"targets": [
{
"refId": "A",
"datasource": { "type": "postgres", "uid": "${DS_POSTGRES}" },
"rawSql": "SELECT date_trunc('day', started_at) AS time, test_name, ROUND(100.0 * SUM(CASE WHEN status IN ('failed','flaky') THEN 1 ELSE 0 END) / COUNT(*), 2) AS flake_rate FROM test_runs WHERE started_at >= NOW() - INTERVAL '14 days' GROUP BY 1, 2 ORDER BY 1",
"format": "time_series"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 2 },
{ "color": "red", "value": 5 }
]
},
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"pointSize": 5,
"showPoints": "auto",
"spanNulls": false
}
},
"overrides": []
},
"options": {
"legend": { "displayMode": "table", "placement": "bottom", "calcs": ["lastNotNull", "max"] },
"tooltip": { "mode": "multi", "sort": "desc" }
}
}Key fields per the Grafana time-series panel docs:
Step 4 - Configure Datadog CI Visibility (alternative)
If your team uses Datadog, CI Visibility ingests test results natively via the datadog-ci CLI or SDK reporters. The built-in CI Visibility - Tests dashboard tracks Total Flaky Tests (updated every 30 minutes per Datadog flaky test docs).
Datadog applies three tags automatically (Datadog flaky test docs):
Quarantine query in CI Visibility Explorer:
@test.status:fail @test.is_flaky:trueFlakiness rate formula in a Datadog Timeboard widget using the Metrics query editor (CI Visibility emits ci.test.flaky as a count metric):
(count:ci.test.flaky{*} by {test.name}.as_count() /
count:ci.test.run{*} by {test.name}.as_count()) * 100Trend alert using a Datadog Monitor:
Step 5 - Quarantine-candidate query
A test becomes a quarantine candidate when its flake rate exceeds the team threshold over the window AND it has enough samples to be statistically meaningful. Recommended SQL query for the data model in Step 2:
SELECT
test_name,
suite_name,
COUNT(*) AS total_runs,
SUM(CASE WHEN status IN ('failed', 'flaky') THEN 1 ELSE 0 END) AS flaky_runs,
ROUND(
100.0 * SUM(CASE WHEN status IN ('failed', 'flaky') THEN 1 ELSE 0 END)
/ NULLIF(COUNT(*), 0), 2
) AS flake_rate_pct,
MAX(started_at) AS last_seen
FROM test_runs
WHERE started_at >= NOW() - INTERVAL '14 days'
GROUP BY test_name, suite_name
HAVING
COUNT(*) >= 10
AND ROUND(
100.0 * SUM(CASE WHEN status IN ('failed', 'flaky') THEN 1 ELSE 0 END)
/ NULLIF(COUNT(*), 0), 2
) >= 5
ORDER BY flake_rate_pct DESC;The HAVING COUNT(*) >= 10 guard prevents a test with 1 run and 1 failure from appearing as 100% flaky. Adjust the minimum run count per your window size using the table in Step 1.
Hand quarantine candidates to the flaky-test-quarantine skill, which enforces the two-week TTL and renewal cap.
Step 6 - Wire trend alerting in Grafana
Grafana managed alert rules evaluate expressions against your datasource on a configurable schedule (Grafana alert rules).
Steps to create a flake-rate spike alert:
The fieldConfig.defaults.thresholds.steps bands in the panel JSON (green/yellow/red at null/2/5) visually mirror the alert thresholds so on-call engineers see the same boundary lines in the chart that trigger the alert (Grafana time-series thresholds).
Worked example: bootstrap from a Playwright JSON report
Given pw-results.json (Playwright --reporter=json output):
# 1. Parse into the test_runs table
node scripts/ingest_playwright_json.js pw-results.json \
--db postgres://localhost/qa_metrics \
--branch "$CI_BRANCH" \
--commit "$CI_COMMIT_SHA" \
--run-id "$CI_RUN_ID"
# 2. Run the quarantine-candidate query and emit a CSV
psql postgres://localhost/qa_metrics \
-f scripts/quarantine_candidates.sql \
--csv > candidates-$(date +%F).csv
# 3. Import the Grafana dashboard JSON
curl -s -X POST http://grafana:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-u "$GRAFANA_USER:$GRAFANA_PASS" \
-d @dashboards/flakiness-overview.jsonAfter the first ingestion, the Grafana panel populates immediately for the last 14 days of history that was just loaded. The trend alert begins evaluating on the next 1-minute evaluation cycle.