visual-diff-summarizer
Builds a per-PR visual-diff summary that clusters intentional vs incidental changes across snapshots emitted by Percy, Chromatic, Playwright `toHaveScreenshot`, Storybook test-runner, and other visual testing tools - groups diffs by component / route, separates "intent-aligned with PR scope" from "cascade / regression suspect", surfaces baseline-update recommendations, and emits a single PR comment that points the reviewer at the screenshots that need actual eyes. Use when a PR has 20+ visual diffs and the reviewer needs help triaging which ones to actually open.
visual-diff-summarizer
Overview
A 50-diff visual review is a recipe for diff blindness. The reviewer either rubber-stamps everything or starts skipping. Both fail.
This skill turns "50 diffs" into "3 components changed as the PR intended; 1 component changed unexpectedly - focus here":
The intent-vs-diff classification mirrors the same logic used by golden-file-manager and visual-diff-classifier - a wrong-but-consistent visual baseline is worse than no baseline at all.
When to use
If the PR has 1 - 2 diffs in scoped files, this skill is overkill - the reviewer can open them directly. The value compounds at scale.
Step 1 - Pick the upstream tool's output
Each tool exposes per-snapshot diff data via API or local artifact:
| Tool | Where the diff data lives |
|---|---|
| Percy (BrowserStack) | Build API: GET /api/v1/builds/<id>/snapshots; per-snapshot diff_ratio. |
| Chromatic | chromatic --dry-run JSON; --exit-zero-on-changes build summary. |
Playwright toHaveScreenshot | Test reporter output; failed expectations include attached image diffs. |
| Storybook test-runner | Per-story coverage diff via @storybook/test-runner's snapshot mode. |
| Loki / BackstopJS | JSON report with per-scenario misMatchPercentage. |
The upstream tool wrappers in qa-visual-regression (percy-visual-regression-testing, chromatic-visual-regression-testing, playwright-snapshots, storybook-visual-regression-testing) cover the per-tool integration. This skill is downstream - it consumes their output.
Step 2 - Normalize to per-snapshot rows
interface SnapshotDiff {
tool: 'percy' | 'chromatic' | 'playwright' | 'storybook' | 'loki';
storyOrPage: string; // e.g. "Button/with-icon" or "/checkout"
componentOrRoute: string; // derived: "Button" or "checkout"
variant?: string; // e.g. viewport, theme, hover state
beforeImageUrl?: string;
afterImageUrl?: string;
diffImageUrl?: string;
diffRatio: number; // 0..1, fraction of pixels that changed
baselineSnapshotId?: string;
}Cluster keys:
function clusterKey(d: SnapshotDiff): string {
return d.componentOrRoute; // "Button", "checkout", etc.
}Step 3 - Read PR intent
The PR's title + description + labels is the stated intent signal. Pull via gh pr view:
gh pr view --json title,body,labels,files,headRefOidExtract structural intent:
interface Intent {
title: string;
body: string;
labels: string[];
changedFiles: string[]; // src paths
// Derived:
scopeKeywords: string[]; // e.g. ["Button", "checkout", "design-tokens"]
changedComponents: Set<string>; // mapped from changedFiles
}scopeKeywords extraction:
Step 4 - Classify each cluster against intent
type Classification = 'aligned' | 'adjacent' | 'unrelated';
function classify(cluster: string, diffs: SnapshotDiff[], intent: Intent): Classification {
if (intent.scopeKeywords.includes(cluster) ||
intent.changedComponents.has(cluster)) {
return 'aligned';
}
// Adjacent: cluster is a child / parent of a changed component (e.g. PR touches
// Modal; Button-inside-Modal also diffs).
if (intent.changedComponents.has(parentOf(cluster)) ||
intent.changedComponents.has(childOf(cluster))) {
return 'adjacent';
}
return 'unrelated';
}parentOf/childOf use the team's component-graph manifest (from Storybook or a hand-maintained JSON) to identify hierarchy.
Step 5 - Render the report
## Visual diff summary — `<sha>`
**Total snapshots:** 87 (12 changed, 75 unchanged)
**Verdict:** REVIEW (1 unrelated cluster suspects regression)
### ✅ Aligned with PR intent (3 clusters, 7 diffs)
The PR title says **"Refactor Button to use new design tokens"** —
these clusters match.
| Cluster | Diffs | Max diff% | Recommendation |
|---------|------:|----------:|----------------|
| Button | 4 | 8.2% | Update baselines (`golden-file-manager update`) |
| ButtonGroup | 2 | 3.1% | Update baselines |
| IconButton | 1 | 1.5% | Update baseline |
### ⚠ Adjacent (1 cluster, 3 diffs) — confirm intent
| Cluster | Diffs | Max diff% | Recommendation |
|---------|------:|----------:|----------------|
| Modal | 3 | 2.8% | Modal contains Button; check that the Button color change inside Modal is intended (it should be — but eyeball one). |
### ❌ Unrelated (1 cluster, 2 diffs) — DO NOT update without investigation
| Cluster | Diffs | Max diff% | Recommendation |
|---------|------:|----------:|----------------|
| Footer | 2 | 12.0% | The Footer component isn't mentioned in the PR. Suspected unintended cascade. **Open the diffs:** [link][f1] [link][f2]. Recommend running `regression-bisector` if no obvious cause. |
### Quick actions
```bash
# Update aligned baselines after eyeballing 1 sample per cluster:
chromatic --auto-accept-changes --only-changed --components Button,ButtonGroup,IconButton
# OR for Percy:
percy approve <build-id> --snapshots Button,ButtonGroup,IconButton
# Refused - Footer cluster needs investigation; do NOT auto-approve.
## Step 6 - Cluster sort order
The reviewer scans top-down. Order:
1. **Unrelated** (action required: investigate).
2. **Adjacent** (action required: confirm).
3. **Aligned** (action: bulk-approve).
Inside each group, sort by max diff ratio descending - biggest
visual change first.
## Step 7 - Auto-update only the aligned cluster
The summary report includes the safe-to-run command for the aligned
cluster only. Adjacent and unrelated clusters never get an
auto-update suggestion - those need eyes.
This matches the `visual-diff-classifier` agent's adversarial logic
in `qa-visual-regression`: aligned diffs go through; unrelated diffs
are refused with a recommendation to escalate to
`regression-bisector` in `qa-flake-triage`.
## Step 8 - CI integration (sticky comment)
```yaml
- name: Fetch tool diff data
run: |
npx chromatic --dry-run --json > visual-diffs.json
- name: Generate summary
run: python scripts/visual_summary.py visual-diffs.json --pr ${{ github.event.pull_request.number }} > summary.md
- name: Post sticky comment
uses: marocchino/sticky-pull-request-comment@v2
with:
header: visual-diff-summary
path: summary.md
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Posting one comment per snapshot | PR conversation is buried; reviewer can't see the forest. | One sticky summary (Step 8); per-snapshot detail in linked tool UI. |
| No intent classification - just listing diffs | Reviewer has to figure out what's expected vs surprise; same problem as no summary. | Aligned / adjacent / unrelated buckets (Step 4). |
| Auto-approving every diff | Regressions silently become baselines. The point of visual tests is defeated. | Auto-approve only aligned cluster (Step 7); refuse unrelated. |
| Sorting by alphabetical name | High-impact diffs sit below low-impact alphabetical earlier-letter ones. | Sort by max diff ratio within each classification group (Step 6). |
| Cross-tool deduplication missing | Same snapshot appears in 3 reports (Percy + Playwright + Chromatic dual-instrumentation). | Deduplicate by (componentOrRoute, variant) cluster key. |
| Reporting unchanged snapshots | Hides the few that did change; reviewer scrolls past. | Report only changed; unchanged count in the summary header. |
| Treating "diff% = 0.1" as a real diff | Anti-aliasing / font rendering jitter; not a visual change. | Threshold (typically diff% > 0.5 or pixel count > 10) before counting. |