Testland
Browse all skills & agents

visual-baseline-conventions

Reference catalog for visual regression coverage decisions - which Storybook stories or pages get baselines, how to choose breakpoints, when to mask vs adjust threshold, when to add or remove a baseline, and a decision matrix for picking among Percy / Chromatic / Playwright / Storybook test-runner. Use when designing visual coverage for a new project or auditing an existing baseline set.

visual-baseline-conventions

Reference catalog for how to design visual coverage. Pairs with the engine-specific skills in this plugin (percy-visual-regression-testing, chromatic-visual-regression-testing, playwright-snapshots, storybook-visual-regression-testing) - those tell you the how of running baselines; this tells you which baselines and where.

Engine selection

Pick before authoring any baseline. Mixing engines is fine in a large project - see responsive-breakpoint-runner and visual-baseline-gate for the mechanics of running and gating multiple engines together.

Use casePreferred engineWhy
Design system / component library; story-driven coverageChromatic (or Percy + @percy/storybook)Story → snapshot is automatic; per-story granularity.
Application UI, page-driven full-flow coveragePlaywright snapshots or Percy + Playwright SDKPage-level pixel diffs; full-page scroll capture.
Already on BrowserStack; want hosted UI + cross-browserPercyFirst-party BrowserStack integration; AI noise-filtering review mode.
Free / no SaaS dependency; OK with diff-image artifacts in CIPlaywright snapshotsSelf-hosted; baselines committed to repo.
Storybook + free / self-hosted@storybook/test-runner postVisit + Playwright snapshotsPer-story coverage with the Playwright snapshot mechanics.

Anti-pattern: running both Percy and Chromatic on the same project "to compare" - duplicate snapshot quota cost without a useful signal. Pick one hosted engine; if you need self-hosted depth, add Playwright snapshots alongside.

Story / page selection - what gets a baseline?

Author baselines for states the user actually sees and skip internal-only states.

LayerCoverage
Atoms (Button, Input, Badge)Each variant × each size × disabled & loading states.
Molecules (Card, Modal, Toast)Default state plus the single most common variant.
Templates (page-level layouts)Empty state, populated state, loading state, error state.
Pages (routes)Logged-out home, logged-in home, one representative app page.
Marketing pagesEach above-the-fold section; below-the-fold only if it has interactive elements.

Skip:

  • Pure-prose long pages (Terms, Privacy) unless visual layout is a business concern.
  • Internal admin tooling not seen by external users.
  • Stories that exist only to satisfy @storybook/addon-controls combinatorics - the per-prop combination matrix is exponential and catches very few real bugs.

Breakpoint selection - what widths?

The starter set most teams converge on:

BreakpointWidthRationale
Mobile375 pxiPhone SE-class baseline (smallest popular).
Tablet768 pxiPad portrait (rarely the bottleneck but cheap to add).
Desktop1280 pxModal mid-range desktop / laptop.
Wide-desktop1920 pxFull-HD; catches large-screen layout bugs.

Add a 1024 px breakpoint when iPad-landscape is a heavy-traffic device. Add 320 px (Galaxy Fold inner display) only if analytics show meaningful traffic at that width.

Anti-pattern: snapshotting at 12+ breakpoints "for safety". The matrix ships every story × every breakpoint × every browser = combinatorial blow-up; quota cost rises faster than bug discovery.

Masking, threshold, or wait?

A snapshot can become noisy in three ways. The fix differs:

Source of noiseRight tool
Animated GIF / SVG / videofreezeAnimatedImage (Percy) or Playwright animations: 'disabled'.
Caret blink, focus rings, hover statescaret: 'hide' (Playwright); avoid :hover in story render.
Live data (timestamps, counters, A/B variants)Mask the element (mask / ignoreRegionSelectors).
Anti-aliasing, sub-pixel font renderingThreshold - bump maxDiffPixels (50 - 200) or threshold (0.2 default → 0.3 max).
Async content that hasn't loadedWait before snapshot (page.waitForSelector, await expect(loc).toBeVisible()).

Order of preference: wait → mask → threshold. Reaching for threshold first hides real regressions; waiting / masking surgically removes the known noise without inflating tolerance.

Anti-pattern: maxDiffPixels: 5000 "to make the build green". A five-thousand-pixel tolerance hides whole component regressions; the team eventually disables visual testing.

When to add a baseline

  • A new component / page is shipping.
  • A regression escaped to production (add a baseline for that specific state to prevent re-occurrence).
  • A redesign locked the new design system (refresh all baselines once in a single PR).

When to remove a baseline

  • The story / page was deleted and the baseline is now orphaned.
  • The component is purely behavioral (e.g. a hidden context provider) and the baseline never had visual content.
  • The baseline catches the same regression as another sibling baseline - keep the smaller-blast-radius one.

When to update a baseline

  • Always after intentional UI changes, in the same PR that ships the change. The PR review surface should include both the code diff and the snapshot diff; reviewers approve them together.
  • Never as a separate "snapshot refresh" PR detached from the code change - the diff is uninterpretable without the corresponding code.

Severity tiering

Most projects need only two tiers:

TierBehaviorUse for
BlockFail CI; require explicit acceptanceProduction-shipped pages and components.
WarnSurface in the report; do not blockUnstable areas under active redesign; new baselines during ramp-up (first 2 weeks).

Promote warn-tier baselines to block-tier after they've been stable for ~2 weeks of CI runs.

Naming conventions

  • Story-driven baselines: <atomic-level>/<component>/<variant> (e.g. Atoms/Button/Primary-Disabled).
  • Page-driven baselines: route path with hyphens (-) replacing slashes; e.g. /dashboard/billingdashboard-billing.
  • Per-breakpoint suffix: -375, -768, -1280, -1920.

A baseline name that doesn't tell the reviewer what they're looking at is the most common cause of "rubber-stamp" approvals. Self- documenting names + the engine's diff UI make approvals fast.

Common anti-patterns

Anti-patternWhy it failsFix
Baselines for every Storybook control combinationCombinatorial blow-up; minimal new-bug signalOne baseline per business-relevant variant; skip auto-generated combos.
Threshold cranked above 0.3 / maxDiffPixels > 500Hides whole-component regressionsMask or wait instead.
Snapshots committed from developer laptopsOS / font drift causes false positives in CIRun baseline updates only in CI, or use the official Playwright Docker image locally.
One baseline per breakpoint per browser per localeQuota cost dominates; review fatigueCover most breakpoints in one browser; cross-browser only on top-traffic pages.
Updating snapshots in a "snapshot refresh" PRReviewers can't tell intentional from regressionAlways update baselines in the same PR as the UI code change.
--auto-accept-changes on PR branchesEliminates the entire point of visual reviewOnly --auto-accept-changes on main (post-merge); never on PRs.
Mixing Percy and Chromatic on the same coverageTwo builds, two review UIs, duplicate quotaPick one hosted engine; pair with Playwright for self-hosted depth.

When to retire visual coverage entirely

A few projects do not benefit from visual regression testing:

  • Backend services with no UI. Obvious skip.
  • CLI tools. Snapshot the rendered output as text instead.
  • Data-heavy admin panels with chronic dynamic data. The masking surface area exceeds the asserted area; consider keyboard / accessibility tests as the visual proxy.

If you find yourself retiring more than ~20 % of your baselines as "chronically flaky" - the project is in this category. Switch strategies rather than fight the tool.