llm-regression-suite-author
Builds a versioned golden-dataset LLM regression suite for tracking quality across model upgrades: structures a versioned JSONL/CSV golden dataset, configures deterministic eval runs (temperature 0, seed), wires assertion layers (exact, semantic similarity, LLM-as-judge, rubric), enforces a pass-rate threshold with diff reporting vs the baseline model, and gates CI on regression. Use when upgrading an LLM provider model and needing a repeatable before/after quality gate, or when a prompt regression suite must track output quality across model versions over time.
llm-regression-suite-author
A regression suite for LLM outputs pairs a versioned golden dataset (stable inputs + expected outputs) with deterministic eval settings and a pass-rate gate so that promoting a model from v1 to v2 surfaces quality drops before they reach production. The per-tool skills (promptfoo-evaluation, deepeval-evaluation) handle tool mechanics; this skill handles the versioning workflow that neither covers.
When to use
Step 1 - Structure the golden dataset
A golden dataset is a versioned file of (input, expected output) pairs. Use JSONL for line-by-line diffability or CSV for spreadsheet collaboration.
JSONL format (one object per line, per de-datasets):
{"input": "What is the return policy?", "expected_output": "30-day full refund"}
{"input": "How do I cancel?", "expected_output": "Visit account settings > Cancel"}CSV format with promptfoo assertion columns (per pf-testcases):
input,__expected,__description
"What is the return policy?","contains: 30-day","return-policy-basic"
"Summarize the refund terms","llm-rubric: Mentions 30 days and no extra costs","refund-summary"
"Return JSON of policy","is-json","structured-output"Per pf-testcases: __expected values follow TYPE: VALUE syntax; values without a prefix default to equals. Multiple assertions use numbered columns (__expected1, __expected2).
Versioning convention:
datasets/
golden-v1.0.0.jsonl # baseline: current production model
golden-v1.1.0.jsonl # new cases added for feature coverage
golden-current -> golden-v1.1.0.jsonl # symlink or CI varTag the dataset file together with the model snapshot it was authored against. Treat dataset changes the same as schema changes: they need a version bump, a changelog entry, and a corresponding CI run that captures the new baseline scores before the dataset becomes golden-current.
Step 2 - Configure deterministic eval
Non-determinism is the primary enemy of regression suites. Set temperature: 0 and seed at the provider level so identical inputs produce identical outputs on re-run.
Promptfoo (per pf-openai):
# promptfooconfig-regression.yaml
providers:
- id: openai:chat:gpt-4.1 # baseline model (pinned snapshot)
config:
temperature: 0
seed: 42
- id: openai:chat:gpt-5.4-mini # candidate model
config:
temperature: 0
seed: 42
tests: file://datasets/golden-current.csv
evaluateOptions:
repeat: 1 # per [pf-ref]: "Number of times to run each test case"
cache: true # per [pf-ref]: reuses disk-cached responses for unchanged inputsPer pf-openai: seed is "used for deterministic output" and belongs in the provider's config block. Note that OpenAI's API makes best-effort determinism guarantees with seed; treat repeated runs as a sanity check, not a hard guarantee.
DeepEval (per de-start): pin the judge model version in every metric constructor to prevent the judge drifting independently of the model under test:
from deepeval.metrics import GEval
from deepeval.test_case import SingleTurnParams
correctness = GEval(
name="Correctness",
criteria="Actual output is factually correct given the expected output.",
evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
model="gpt-4.1-2025-04-14", # pinned judge snapshot
threshold=0.7,
)Per de-start: threshold determines pass/fail; "a metric is only successful if the evaluation score is equal to or greater than threshold."
Step 3 - Layer the assertion types
A robust regression suite combines four assertion layers. Each layer catches a different failure mode.
Layer 1: Exact / structural (deterministic) Per pf-asserts: equals, contains, regex, is-json, starts-with. Use when the expected output is fully predictable.
assert:
- type: equals
value: "30-day full refund at no extra cost"
- type: is-json # for structured-output test casesLayer 2: Semantic similarity Per pf-asserts: similar uses embeddings + cosine similarity. Use when paraphrase equivalence is acceptable:
assert:
- type: similar
value: "Customers receive a full refund within 30 days"
threshold: 0.82
provider: openai:text-embedding-3-smallLayer 3: LLM-as-judge Per pf-asserts: llm-rubric grades output against a free-form rubric; factuality checks output against reference facts.
assert:
- type: llm-rubric
value: "Response is polite, concise, and mentions the 30-day window"
- type: factuality
value: "Return window is 30 days. No extra cost."Layer 4: Rubric (DeepEval GEval) Per de-gh: GEval with explicit evaluation_steps produces more consistent scores than criteria-only mode because it avoids regenerating chain-of-thought steps on each call.
policy_rubric = GEval(
name="PolicyAccuracy",
evaluation_steps=[
"Check if the response mentions a 30-day return window.",
"Check if the response states there is no extra cost.",
"Penalise responses that add false conditions.",
],
evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
model="gpt-4.1-2025-04-14",
threshold=0.75,
)Step 4 - Set pass-rate threshold and capture baseline
Run the suite against the baseline model first and record the aggregate pass rate. This becomes the floor the candidate model must meet or beat.
Promptfoo uses a test-level threshold (per pf-ref): "Test will fail if the combined score of assertions is less than this number." Group related assertions with assert-set to set a partial-pass floor:
defaultTest:
assert:
- type: assert-set
threshold: 0.8 # 80% of grouped assertions must pass
assert:
- type: similar
value: "{{expected}}"
threshold: 0.8
- type: llm-rubric
value: "Accurate and helpful"Capture the JSON output of the baseline run for diff comparison (per pf-cli: -o flag supports json, csv, junit.xml):
# Baseline run (production model only)
promptfoo eval --config promptfooconfig-regression.yaml \
--filter-pattern ".*" \
-o results/baseline-gpt-4.1-$(date +%Y%m%d).jsonDeepEval baseline capture:
from deepeval import evaluate
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_jsonl_file(file_path="datasets/golden-current.jsonl")
# Generate actual_output from baseline model, attach to test cases, then:
results = evaluate(
test_cases=dataset.test_cases,
metrics=[correctness, policy_rubric],
)
# Inspect results.test_results for per-case scores; store to JSON for diffPer de-datasets: add_goldens_from_jsonl_file requires file_path as the absolute path; each line maps to a Golden.
Step 5 - Diff candidate vs baseline
After the candidate model run, compare aggregate pass rates:
# Candidate run (both models in one config)
promptfoo eval --config promptfooconfig-regression.yaml \
-o results/candidate-gpt-5.4-mini-$(date +%Y%m%d).json
# Simple diff: extract pass rates from JSON output
python3 scripts/regression_diff.py \
results/baseline-gpt-4.1-20260604.json \
results/candidate-gpt-5.4-mini-20260604.json \
--threshold 0.95 # candidate must retain >=95% of baseline pass rateThe regression_diff.py script pattern:
import json, sys
def pass_rate(path):
data = json.load(open(path))
results = data["results"]["results"]
passed = sum(1 for r in results if r["success"])
return passed / len(results)
baseline_rate = pass_rate(sys.argv[1])
candidate_rate = pass_rate(sys.argv[2])
threshold = float(sys.argv[3].replace("--threshold ", "").split("=")[-1]) \
if len(sys.argv) > 3 else 0.95
ratio = candidate_rate / baseline_rate if baseline_rate > 0 else 0
print(f"Baseline: {baseline_rate:.2%} Candidate: {candidate_rate:.2%} Ratio: {ratio:.3f}")
if ratio < threshold:
print(f"REGRESSION: candidate retains only {ratio:.1%} of baseline pass rate")
sys.exit(1)
print("PASS: no regression detected")Per pf-cli: --filter-failing <path> re-runs only cases that failed in a prior eval result file, useful for targeted investigation after a regression is detected.
Step 6 - CI gating
Store the baseline JSON as a CI artifact or in version control alongside the dataset version tag. The CI job fails if the regression ratio drops below the threshold.
GitHub Actions pattern (per pf-gha and de-gh):
name: llm-regression
on:
pull_request:
paths:
- 'prompts/**'
- 'datasets/**'
- '.github/workflows/llm-regression.yml'
jobs:
regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Restore baseline artifact
uses: actions/download-artifact@v4
with:
name: llm-baseline
path: results/
- name: Run candidate eval (promptfoo)
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
npx promptfoo@latest eval \
--config promptfooconfig-regression.yaml \
-o results/candidate.json
- name: Check regression threshold
run: python3 scripts/regression_diff.py results/baseline.json results/candidate.json --threshold=0.95
# On main: update the baseline artifact after a passing run
- name: Upload new baseline
if: github.ref == 'refs/heads/main' && success()
uses: actions/upload-artifact@v4
with:
name: llm-baseline
path: results/candidate.jsonFor DeepEval: deepeval test run tests/llm_regression/ --workers 4 returns a non-zero exit code when any assert_test() call fails (per de-gh), which CI treats as a pipeline failure.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Dataset mutated without a version bump | Baseline scores become meaningless | Tag dataset file with the model snapshot it was authored against |
No temperature: 0 on candidate model | Non-deterministic outputs make pass rate noisy | Set temperature: 0 + seed at provider level (Step 2) |
| Only LLM-as-judge assertions | Judge cost scales linearly; judge model drifts too | Layer deterministic assertions first; use judge only where needed (Step 3) |
| Pass-rate threshold of 1.0 | Any new case added to the dataset will fail until outputs are updated | Use a ratio vs baseline (e.g., 0.95) rather than an absolute floor |
| Baseline JSON not stored in version control | No reproducible diff when investigating a regression | Commit baseline JSON alongside the dataset version tag |
| Pinning "latest" model in CI | Provider silently updates the model; baseline shifts without notice | Pin explicit model snapshots (e.g., gpt-4.1-2025-04-14) |
| Single assertion type across all cases | Structural failures masked by rubric leniency (or vice versa) | Use all four assertion layers proportionate to case type (Step 3) |