How to Organize Regression Testing for Web Apps
TestlandMay 1, 2026Build a regression suite that scales for web apps: scope by change impact, tier by speed, split fast PR runs from nightly soaks, quarantine flaky tests.

A regression suite that takes 45 minutes on every pull request is a regression suite teams stop trusting. Engineers route around CI, mark tests as xfail, and ship straight to staging. The fix is not "write more tests" - it's organization. This guide covers how to scope a regression suite by change impact, place tests at the right pyramid tier, split fast PR runs from full nightly soaks, and quarantine flakiness before it eats trust. Aimed at QA leads and platform engineers at small-to-mid teams who own the test pipeline.
Key facts
Define what regression testing actually covers
Scope first. Regression testing verifies that recent changes haven't broken behavior that already worked. It is not the place for new-feature acceptance tests, exploratory checks, or load testing.
Three categories belong in a regression suite:
Three categories don't:
Place tests on the pyramid
A regression suite is not one thing - it's a stack. Push as much regression coverage as possible to the base.
| Tier | Speed per test | What to regress |
|---|---|---|
| Unit | <1 ms | Pure logic, validators, formatters, hooks |
| Integration | <100 ms | API routes, DB queries, service contracts |
| End-to-end | seconds | Critical user flows (login, checkout, payment) |
Tooling at each tier: Jest, Vitest, or pytest at the unit level; SuperTest, requests, or Pact for integration; Playwright or Cypress for end-to-end.
Martin Fowler puts it directly: "Write lots of small and fast unit tests. Write some more coarse-grained tests and very few high-level tests." When a higher-level test catches a regression that no lower-level test caught, write the lower-level test before merging the fix. The expensive E2E test catches the bug once; the cheap unit test catches the next 50 variants of it.
Split into a fast PR suite and a full nightly suite
Running every test on every pull request is the most common mistake. Two suites with two SLAs solve it:
PR suite - target under 10 minutes
Full nightly suite - target under 90 minutes
Stripe goes further with selective test execution - running around 5% of tests on average per change. The model maps source files to the tests that exercise them via a code-impact graph. For most teams, simpler heuristics work fine: tag tests by feature area, run that area's tests when files in that area change, and let the rest fall to the nightly run.
Quarantine flaky tests instead of ignoring them
A test that fails 1 in 20 runs costs more than a missing test. Engineers stop trusting failures, then they stop reading them. Microsoft's flaky test management system (used by more than 100 product teams) has identified 49,000 flaky tests and rescued 160,000 build sessions that would have failed otherwise.
A four-step quarantine process works at any scale:
For the underlying causes and per-category fixes (timing, shared state, environment, order dependency), see Fixing flaky tests: a systematic approach.
Wire it into CI
The split lives or dies in the CI config. Three triggers, two suite shapes, one parallelism strategy.
Sharding is non-negotiable. Playwright's native sharding splits a suite across N machines with --shard=1/4, 2/4, etc. pytest-xdist does the same with pytest -n auto. Both are 5-minute setups that cut wall-clock time roughly linearly with worker count.
A working CI example
# .github/workflows/regression.yml
name: Regression
on:
pull_request:
push:
branches: [main]
schedule:
- cron: '0 6 * * *' # nightly 06:00 UTC
jobs:
pr-suite:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v5
- uses: actions/setup-node@v5
with:
node-version: '22'
- run: npm ci
- run: npx playwright install --with-deps chromium
- run: npx playwright test --grep "@smoke|@critical" --shard=${{ matrix.shard }}/4
nightly-full:
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4, 5, 6]
browser: [chromium, firefox, webkit]
steps:
- uses: actions/checkout@v5
- uses: actions/setup-node@v5
with:
node-version: '22'
- run: npm ci
- run: npx playwright install --with-deps
- run: npx playwright test --project=${{ matrix.browser }} --shard=${{ matrix.shard }}/6The PR job runs only @smoke and @critical tagged tests on Chromium across 4 shards: the fast loop. The nightly job runs everything across 3 browsers and 6 shards (18 parallel workers): the safety net.
Anti-patterns to avoid
Running everything on every commit. The reason Shopify's CI was 45 minutes was running more than 170,000 tests for every change. Selective execution and time-budgeted prioritization brought it to 18.
Snapshot tests as a regression substitute. Snapshot tests catch serialization changes, not behavior. A button can rerender with the same DOM and a broken click handler - the snapshot still matches. Reserve snapshots for stable serialized data (JSON config, GraphQL responses), not for asserting "the page works."
No test ownership. When tests don't have owners, flakes accumulate quietly. Tag tests with the team that owns the corresponding feature. Auto-route flake reports to that team's queue.
Hardcoded waits. time.sleep(5) is the source of half the flakes in any UI suite. Use condition-based waits - Playwright's auto-waiting, Cypress's .should(), pytest fixtures that poll until a condition is true.
An 8-hour suite as a status symbol. The goal is fast, reliable feedback - not test count. A 30-minute suite with 100% trust beats a 4-hour suite engineers route around.
What to read next
Once the regression suite is organized, the work shifts to keeping it healthy. Two posts that cover the daily mechanics:
For the testability primitives that make change-based selection possible, the Stripe selective test execution post is the best deep-dive on file-to-test impact mapping.