How to Organize Regression Testing for Web Apps

TestlandMay 1, 2026

Build a regression suite that scales for web apps: scope by change impact, tier by speed, split fast PR runs from nightly soaks, quarantine flaky tests.

Shopify's monolith CI p95 fell from 45 to 18 minutes after selective execution and time-budgeted prioritization. The dashed line marks a recommended PR-suite target of 10 minutes.

A regression suite that takes 45 minutes on every pull request is a regression suite teams stop trusting. Engineers route around CI, mark tests as xfail, and ship straight to staging. The fix is not "write more tests" - it's organization. This guide covers how to scope a regression suite by change impact, place tests at the right pyramid tier, split fast PR runs from full nightly soaks, and quarantine flakiness before it eats trust. Aimed at QA leads and platform engineers at small-to-mid teams who own the test pipeline.

Key facts

Shopify reduced its core monolith CI p95 from 45 minutes to 18 minutes by combining selective execution, sharding, and time-budgeted prioritization (Shopify Engineering)
Stripe runs only about 5% of its test suite on average per change via selective test execution across a 50M-line monorepo (Stripe Engineering)
Microsoft's flaky-test management platform (used by more than 100 product teams) has identified 49,000 flaky tests and rescued 160,000 build sessions that would have otherwise failed (Microsoft Engineering)
Failure-rate-based test prioritization catches 80% of failures after running only 60% of the suite (Shopify test budget)
The classic test pyramid still applies: many fast unit tests, fewer integration tests, even fewer end-to-end tests (Martin Fowler)

Define what regression testing actually covers

Scope first. Regression testing verifies that recent changes haven't broken behavior that already worked. It is not the place for new-feature acceptance tests, exploratory checks, or load testing.

Three categories belong in a regression suite:

Tests for fixed bugs. Every bug fix earns a regression test that fails without the fix.
Tests for shipped, stable user workflows. Login, signup, search, checkout - the paths where breakage costs revenue.
Tests for documented integrations. Third-party APIs, webhooks, and data exports your app depends on.

Three categories don't:

New-feature tests not yet in production. Run them on the feature branch, promote them once the feature ships.
A/B variant validation. Use feature flags and production monitoring instead.
Pure visual micro-interactions (animations, hover transitions). These are slow and brittle as automated checks; reserve for manual QA or visual-diff tools.

Place tests on the pyramid

A regression suite is not one thing - it's a stack. Push as much regression coverage as possible to the base.

Tier	Speed per test	What to regress
Unit	<1 ms	Pure logic, validators, formatters, hooks
Integration	<100 ms	API routes, DB queries, service contracts
End-to-end	seconds	Critical user flows (login, checkout, payment)

Tooling at each tier: Jest, Vitest, or pytest at the unit level; SuperTest, requests, or Pact for integration; Playwright or Cypress for end-to-end.

Martin Fowler puts it directly: "Write lots of small and fast unit tests. Write some more coarse-grained tests and very few high-level tests." When a higher-level test catches a regression that no lower-level test caught, write the lower-level test before merging the fix. The expensive E2E test catches the bug once; the cheap unit test catches the next 50 variants of it.

Split into a fast PR suite and a full nightly suite

Running every test on every pull request is the most common mistake. Two suites with two SLAs solve it:

PR suite - target under 10 minutes

All unit tests (always cheap)
Integration tests touched by the change
Smoke E2E for critical paths only (login, checkout, the one workflow that pays the rent)

Full nightly suite - target under 90 minutes

Everything in the PR suite
All E2E tests including long-tail flows
Cross-browser matrix (Chromium + Firefox + WebKit)
Visual regression checks

Stripe goes further with selective test execution - running around 5% of tests on average per change. The model maps source files to the tests that exercise them via a code-impact graph. For most teams, simpler heuristics work fine: tag tests by feature area, run that area's tests when files in that area change, and let the rest fall to the nightly run.

Quarantine flaky tests instead of ignoring them

A test that fails 1 in 20 runs costs more than a missing test. Engineers stop trusting failures, then they stop reading them. Microsoft's flaky test management system (used by more than 100 product teams) has identified 49,000 flaky tests and rescued 160,000 build sessions that would have failed otherwise.

A four-step quarantine process works at any scale:

Detect. Re-run failed tests once. If a re-run passes, log the test to a flake tracker with the failure trace.
Quarantine. After three confirmed flakes, move the test to a separate suite that runs but does not block CI.
Assign. File a bug to the test owner. Set a 2-week SLA before the test gets deleted.
Measure. Track the flaky rate per team. Target: under 1% of all runs.

For the underlying causes and per-category fixes (timing, shared state, environment, order dependency), see Fixing flaky tests: a systematic approach.

Wire it into CI

The split lives or dies in the CI config. Three triggers, two suite shapes, one parallelism strategy.

pull_request: PR suite, sharded across 4 jobs.
push to main: same suite, also sharded - catches anything the PR check missed.
schedule (cron: '0 6 * * *'): full nightly soak with cross-browser matrix.

Sharding is non-negotiable. Playwright's native sharding splits a suite across N machines with --shard=1/4, 2/4, etc. pytest-xdist does the same with pytest -n auto. Both are 5-minute setups that cut wall-clock time roughly linearly with worker count.

A working CI example

# .github/workflows/regression.yml
name: Regression
on:
  pull_request:
  push:
    branches: [main]
  schedule:
    - cron: '0 6 * * *'  # nightly 06:00 UTC

jobs:
  pr-suite:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-node@v5
        with:
          node-version: '22'
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: npx playwright test --grep "@smoke|@critical" --shard=${{ matrix.shard }}/4

  nightly-full:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4, 5, 6]
        browser: [chromium, firefox, webkit]
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-node@v5
        with:
          node-version: '22'
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npx playwright test --project=${{ matrix.browser }} --shard=${{ matrix.shard }}/6

The PR job runs only @smoke and @critical tagged tests on Chromium across 4 shards: the fast loop. The nightly job runs everything across 3 browsers and 6 shards (18 parallel workers): the safety net.

Anti-patterns to avoid

Running everything on every commit. The reason Shopify's CI was 45 minutes was running more than 170,000 tests for every change. Selective execution and time-budgeted prioritization brought it to 18.

Snapshot tests as a regression substitute. Snapshot tests catch serialization changes, not behavior. A button can rerender with the same DOM and a broken click handler - the snapshot still matches. Reserve snapshots for stable serialized data (JSON config, GraphQL responses), not for asserting "the page works."

No test ownership. When tests don't have owners, flakes accumulate quietly. Tag tests with the team that owns the corresponding feature. Auto-route flake reports to that team's queue.

Hardcoded waits. time.sleep(5) is the source of half the flakes in any UI suite. Use condition-based waits - Playwright's auto-waiting, Cypress's .should(), pytest fixtures that poll until a condition is true.

An 8-hour suite as a status symbol. The goal is fast, reliable feedback - not test count. A 30-minute suite with 100% trust beats a 4-hour suite engineers route around.