junit-xml-analysis
Parses JUnit-format XML reports (the de-facto interchange format every CI ingests - Jenkins, GitHub Actions, GitLab, Buildkite, CircleCI) into structured, machine-readable per-suite and per-case metrics tables (passed / failed / errored / skipped, time, classname, message, stack), groups failures by classname for trend analysis, and distinguishes "new failures vs flakes" by cross-referencing rerun elements (`<flakyFailure>`, `<rerunFailure>`). Use when the downstream consumer is a dashboard, script, or aggregator - not when the goal is a human-readable prose summary (use test-run-summary-author for that). Single-run, in-XML aggregation only; for cross-run cross-environment roll-ups, use daily-test-suite-aggregator.
junit-xml-analysis
Overview
The "JUnit XML" format is the de-facto schema every CI consumes. It originated with Apache Ant's JUnit task, was widened by Jenkins, and is now emitted by virtually every test runner - pytest, Jest, Mocha, Vitest, Go test (gotestsum), Maven Surefire, Gradle, Newman, Cypress, Playwright (via reporter), RSpec (via formatter), and the rest.
Per llg-junit (the most-cited community schema reference, used by Jenkins's parser):
"Root element:
<testsuites>(optional if only one suite exists;<testsuite>can be the root instead)."
The hierarchy is testsuites → testsuite → testcase, with result child elements (<failure>, <error>, <skipped>) hanging off each testcase.
This skill covers parsing the format, building per-suite + per-case metrics, and the flaky-vs-new distinction via the modern <rerunFailure> / <flakyFailure> extensions (llg-junit).
When to use
Step 1 - Schema overview
Per llg-junit:
| Level | Required attributes | Common attributes |
|---|---|---|
testsuites | (none required at root) | tests, failures, errors, disabled, time, name |
testsuite | name, tests | failures, errors, skipped, time, timestamp, hostname, id, package |
testcase | name, classname | time, assertions, status |
Each <testcase> contains at most one of:
Plus optional:
Critical distinction: per llg-junit, <failure> is an assertion failure (the test made a claim that came back false). <error> is an exception or crash before the assertion ran. Group them differently in dashboards - errors are usually environment / infra; failures are usually code or fixture drift.
Step 2 - Parse safely
Use a streaming parser for large files (multi-thousand-test suites are common). Python:
# scripts/parse_junit.py
import xml.etree.ElementTree as ET
def parse_junit(path):
tree = ET.parse(path)
root = tree.getroot()
suites = root.findall('testsuite') if root.tag == 'testsuites' else [root]
for suite in suites:
for case in suite.findall('testcase'):
yield {
'suite': suite.get('name'),
'classname': case.get('classname'),
'name': case.get('name'),
'time': float(case.get('time') or 0),
'status': classify(case),
'failure_message': (case.find('failure') or case.find('error') or {}).get('message'),
}
def classify(case):
if case.find('failure') is not None: return 'failure'
if case.find('error') is not None: return 'error'
if case.find('skipped') is not None: return 'skipped'
return 'pass'Node:
import { XMLParser } from 'fast-xml-parser';
import { readFileSync } from 'node:fs';
const parser = new XMLParser({ ignoreAttributes: false, attributeNamePrefix: '@_' });
const xml = parser.parse(readFileSync(path, 'utf8'));
const suites = xml.testsuites
? (Array.isArray(xml.testsuites.testsuite) ? xml.testsuites.testsuite : [xml.testsuites.testsuite])
: [xml.testsuite];
for (const suite of suites) {
const cases = Array.isArray(suite.testcase) ? suite.testcase : [suite.testcase];
// ...
}Always handle both shapes: the root may be <testsuites> or <testsuite> per llg-junit. Single-element collapsing (one testsuite/testcase = bare object, multiple = array) is also common in JS XML libs.
Step 3 - Distinguish new failures from flakes
Per llg-junit, the schema "supports modern variants including <flakyFailure>, <flakyError>, <rerunFailure>, and <rerunError> elements for additional test run metadata."
When the runner does automatic retries (Maven Surefire's rerunFailingTestsCount, pytest-rerunfailures, etc.):
Classification:
def reliability(case):
has_flaky = case.find('flakyFailure') is not None or case.find('flakyError') is not None
has_rerun = case.find('rerunFailure') is not None or case.find('rerunError') is not None
has_final = case.find('failure') is not None or case.find('error') is not None
if has_flaky and not has_final: return 'flaky' # passed on retry
if has_rerun and has_final: return 'consistently_failing'
if has_final: return 'newly_failed'
return 'pass'Surface flaky tests in a separate report - they're noise to the PR author but signal to the test-suite owner.
Step 4 - Aggregate per-suite metrics
from collections import defaultdict
def per_suite(cases):
agg = defaultdict(lambda: {'pass': 0, 'failure': 0, 'error': 0, 'skipped': 0, 'flaky': 0, 'time': 0.0})
for c in cases:
agg[c['suite']][c['status']] += 1
agg[c['suite']]['time'] += c['time']
return aggStep 5 - Trend analysis (cross-run)
To detect "is this a new failure or has this test been failing for a week?", store every run's parsed metrics in a per-suite history file:
{"sha":"abc123","ts":"2026-05-05T14:00:00Z","suite":"checkout","failure":2,"flaky":1,"time":12.4}
{"sha":"def456","ts":"2026-05-05T14:30:00Z","suite":"checkout","failure":2,"flaky":0,"time":12.1}Compare by suite + classname:
| classname | name | last 5 runs result | first failed sha |
|---|---|---|---|
cart.CartTest | addItem_validatesStock | F F F F F | abc123 (5 days ago) |
checkout.PromoTest | applyPromo_caseInsensitive | P P P P F | this PR (suspected regression) |
The first row is a stale failure; the second is a probable regression.
Step 6 - Per-case slow-test list
Sort testcases by time descending. The top 1% is the fast feedback target - moving any one of them from 30s → 3s saves more than refactoring a hundred tests that already run in <100ms.
Step 7 - CI integration
# .github/workflows/test-analytics.yml
- name: Run tests (any framework, JUnit XML reporter enabled)
run: npm test -- --reporters=default,jest-junit
env:
JEST_JUNIT_OUTPUT_FILE: junit.xml
- name: Analyze JUnit XML
if: always()
run: python scripts/parse_junit.py junit.xml > analytics.json
- name: Upload analytics
if: always()
uses: actions/upload-artifact@v4
with:
name: junit-analytics
path: |
junit.xml
analytics.jsonif: always() is critical - JUnit XML matters most on failed runs.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Treating <error> and <failure> as the same | Errors are usually infra (DB connection lost), failures are usually code. Conflating hides root-cause patterns. | Group separately per llg-junit. |
Dropping <flakyFailure> reports from the dashboard | Hidden flake budget; quality erodes silently. | Surface flaky tests on a separate panel; assign owner. |
Loading multi-MB XML with xml.dom.minidom.parseString | Whole-tree-in-memory. OOM on large suites. | xml.etree.ElementTree.iterparse for streaming. |
Failing the build on any <skipped> count > 0 | Many runners legitimately skip (platform-gated, conditional). | Skip is informational; only fail on failure / error. |
Hardcoding <testsuites> as the root | Some runners emit a single <testsuite> as the root. | Detect both shapes (Step 2). |
Trusting time for sub-millisecond tests | Some runners emit 0 for any test under their granularity; sort breaks. | Treat time = 0 as "not measured"; don't include in slow-test list. |
Cross-suite aggregation by name alone | Two suites can have a it('renders') each - merging false-flags both. | Always group by (classname, name) tuple. |