test-pyramid-balancer
Build-an-X workflow that analyzes a repo's test mix (unit / integration / E2E counts + runtimes) and recommends rebalancing toward Cohn's pyramid ratios per the change-set shape - pure-logic-heavy repo wants ~80/15/5; UI-heavy repo wants ~60/25/15. Detects "ice-cream cone" (E2E-heavy) and "hourglass" (integration-thin) anti-patterns. Use quarterly to keep the test mix calibrated to the codebase reality.
test-pyramid-balancer
Overview
Per test-pyramid:
"you should have many more low-level UnitTests than high level BroadStackTests running through a GUI."
The pyramid model has three layers (unit / service / UI); the ratio between them depends on the codebase. A repo with mostly pure-logic changes wants more unit tests; a repo with mostly UI changes legitimately needs more E2E.
The anti-patterns ("ice-cream cone" - heavy E2E + thin unit, or "hourglass" - heavy unit + heavy E2E + thin integration) accumulate silently. This skill makes them visible.
When to use
Step 1 - Inventory current test mix
Per-language adapters classify by path heuristic:
# scripts/test-mix-inventory.py
import os, re, glob
def classify(path, content):
if any(s in path for s in ['/playwright/', '/cypress/', '/selenium/', '/e2e/']):
return 'e2e'
if any(s in content for s in ['playwright', 'cypress', 'selenium-webdriver']):
return 'e2e'
if any(s in path for s in ['/integration/', '/it/']):
return 'integration'
if 'testcontainers' in content or 'WebApplicationFactory' in content:
return 'integration'
return 'unit' # default
mix = {'unit': 0, 'integration': 0, 'e2e': 0}
for path in glob.glob('**/*.{js,ts,py,kt,java,rb,go}', recursive=True):
if not re.search(r'(test|spec)\.|test_|_test\.', path):
continue
with open(path) as f:
content = f.read()
layer = classify(path, content)
mix[layer] += content.count('test(') + content.count('it(') + content.count('def test_')
print(mix)Step 2 - Inventory current runtime
# Per-layer runtime (one-off measurement)
time npm test # unit + integration via Jest
time npx playwright test # E2E
# OR via JUnit XML aggregation per [`junit-xml-analysis`](../../qa-test-reporting/skills/junit-xml-analysis/SKILL.md)Step 3 - Compare to ideal ratios
Per test-pyramid: the right ratio depends on the codebase. Defaults:
| Predominant change shape | Recommended (unit / int / e2e) | Notes |
|---|---|---|
| Pure-logic-heavy | 80 / 15 / 5 | Algorithms, data transforms, calculations. |
| Service-layer-heavy | 70 / 25 / 5 | APIs, microservices, repos. |
| UI-heavy | 60 / 25 / 15 | SPAs, mobile apps; UI is the product. |
| Data-heavy | 60 / 30 / 10 | + dedicated data quality suite. |
To detect change shape: walk last 90 days of git log; classify each PR's primary impact via path heuristic (per test-architect Mode 1 Step 2).
Step 4 - Detect anti-patterns
Ice-cream cone (E2E-heavy)
Current: 30 unit / 10 integration / 60 E2E
Recommended: 70 / 25 / 5
Verdict: ICE-CREAM CONE (60% E2E vs target 5%)Symptoms:
Fix: identify E2E tests that test pure logic; rewrite at the unit layer.
Hourglass (integration-thin)
Current: 200 unit / 8 integration / 30 E2E
Recommended: 75 / 20 / 5
Verdict: HOURGLASS (3% integration vs target 20%)Symptoms:
Fix: add integration tests covering the cross-module seams that unit tests can't reach and E2E tests catch too late.
Inverted pyramid
Current: 50 unit / 100 integration / 80 E2E
Recommended: 75 / 20 / 5
Verdict: INVERTED PYRAMID (heaviest at the top; UI tests dominate)Symptoms:
Fix: aggressive layer-down - move tests to lower layers where they catch the same bugs faster.
Step 5 - Recommend specific changes
Output a stack-ranked list of layer-changes:
## Test pyramid analysis — `<repo>`
**Date:** YYYY-MM-DD **Last 90 days commits classified:** 142
### Current
| Layer | Tests | % of total | Avg runtime | Cost factor |
|-------------|------:|-----------:|------------:|------------:|
| Unit | 840 | 59% | 12 ms | 1× |
| Integration | 98 | 7% | 1.4 s | 3× |
| E2E | 485 | 34% | 8.2 s | 10× |
**Verdict:** ICE-CREAM CONE — E2E % (34) far exceeds target (5).
### Recommended (per change shape: 70% service-layer, 30% pure-logic)
| Layer | Target % | Target tests | Δ |
|--------------|---------:|-------------:|----------|
| Unit | 75% | ~1100 | +260 |
| Integration | 20% | ~290 | +192 |
| E2E | 5% | ~75 | -410 (!) |
### Top recommendations
1. **Identify E2E tests testing pure logic** (`grep -l "expect.*\bcalculate\|format\|parse" e2e/`).
Likely candidates: 80-120 tests. Move to unit layer.
2. **Identify E2E tests testing service-layer integration**. Move to
integration layer with [`testcontainers`](../../qa-test-environment/skills/testcontainers/SKILL.md).
3. **Review the remaining 75 E2E tests** for hero-flow coverage. If they
cover 5-10 distinct critical journeys, the suite is healthy.
### Estimated impact
- CI time: ~38 min → ~12 min (per the cost-factor math).
- Flake rate: typical reduction 50-70% (E2E dominates flake).
- Per-PR feedback: <5 min for unit + integration (vs current 15 min).Step 6 - Cadence
| Cadence | Trigger |
|---|---|
| Quarterly | Scheduled review. |
| After major refactor | Re-inventory; ratios may have shifted. |
| New team owner | Inherit the test-mix; understand it. |
| Sprint with E2E-heavy ship | Spot-check; don't tilt the pyramid. |
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| One-size-fits-all pyramid recommendation | Per test-pyramid: ratios depend on the codebase. | Tune per change shape (Step 3). |
| Path-based-only classification | A "unit" test that hits a real DB is actually integration; misclassified. | Read content for hint signals (Step 1 example). |
| Recommending E2E count = 0 | Some critical journeys need E2E; eliminating misses them. | Hero-flow E2E (5-15) is the floor (Step 5 example). |
| One-shot recommendation with no migration plan | Team doesn't know where to start. | Stack-ranked list with specific candidates (Step 5). |
| Quarterly review without follow-up | Recommendations don't ship. | Track action items in tracker; review next quarter. |
| Treating runtime as the only signal | A 0.5-sec E2E test is fine; a 0.5-sec unit test that ran 1000× isn't. | Layer ratio + runtime + flake rate together. |