pdf-snapshot-tester
Test PDF outputs by converting per-page to images (`pdftocairo` / pdf2image / Poppler) and running pixel-diff (pixelmatch / Resemble.js / Pillow `ImageChops`) against approved baselines. Per-page-range targeting, threshold tuning, font-substitution warnings, byte-stable PDF metadata stripping (CreationDate, /ID).
pdf-snapshot-tester
PDFs are binary documents with embedded fonts, embedded images, and CreationDate/ID metadata. Direct binary diff is useless. The canonical approach: render per-page to image, then pixel-diff against approved baselines.
When to use
Step 1 - Install Poppler + pdf2image
# Linux
apt-get install -y poppler-utils
# macOS
brew install poppler
# Python wrapper
pip install pdf2image pillowPoppler ships pdftocairo + pdftoppm - the workhorses for PDF → image.
Step 2 - Render PDF pages to images
from pdf2image import convert_from_path
from pathlib import Path
pages = convert_from_path(
"out.pdf",
dpi=150,
fmt="png",
output_folder=str(Path("rendered")),
paths_only=True,
)dpi=150 balances diff sensitivity vs file size. Increase to 300 for high-stakes documents (regulatory filings).
CLI alternative:
pdftocairo -png -r 150 out.pdf rendered/page
# produces rendered/page-1.png, rendered/page-2.png, ...Step 3 - Pixel-diff against baseline
from PIL import Image, ImageChops
def pixel_diff(actual_path, baseline_path, threshold=0.001):
a = Image.open(actual_path).convert("RGB")
b = Image.open(baseline_path).convert("RGB")
if a.size != b.size:
return 1.0 # full mismatch on dimension change
diff = ImageChops.difference(a, b)
bbox = diff.getbbox()
if not bbox:
return 0.0
diff_pixels = sum(1 for px in diff.getdata() if any(c > 5 for c in px))
total = a.size[0] * a.size[1]
return diff_pixels / totalOr use pixelmatch (Node) for a maintained reference impl.
Step 4 - Per-page assertion
def test_invoice_pdf_matches_baseline(tmp_path):
actual_pdf = tmp_path / "invoice.pdf"
generate_invoice(invoice_id="inv_001", out=actual_pdf)
pages = convert_from_path(actual_pdf, dpi=150)
for i, page_img in enumerate(pages, start=1):
actual = tmp_path / f"actual-{i}.png"
page_img.save(actual, "PNG")
baseline = Path(f"tests/pdf-baselines/inv_001-{i}.png")
diff_ratio = pixel_diff(actual, baseline)
assert diff_ratio < 0.005, f"Page {i} diff ratio {diff_ratio:.4f}"Step 5 - Page-range targeting
For long PDFs (statements, prospectuses), test only changed pages:
pages = convert_from_path(
"out.pdf",
dpi=150,
first_page=2,
last_page=5,
)CLI:
pdftocairo -png -r 150 -f 2 -l 5 out.pdf rendered/pageStep 6 - Strip non-deterministic PDF metadata
PDFs include /CreationDate, /ID, sometimes /ModDate. These change per run and break byte diffs. Use qpdf to normalize:
qpdf --linearize \
--object-streams=disable \
--replace-stream-data=uncompress \
--remove-attachments \
out.pdf normalized.pdfAlternative: rely on image diff (Steps 2-3) which is metadata-free by construction.
Step 7 - Font-substitution detection
Missing fonts on the rendering host produce visually-different output. Detect via Poppler stderr:
import subprocess
result = subprocess.run(
["pdfinfo", "-list-embedded-fonts", "out.pdf"],
capture_output=True, text=True,
)
if "Font Substitution" in result.stderr:
raise RuntimeError("Font substitution detected; baseline invalid")For CI, install the production font pack via the package manager or check fonts into the repo for deterministic builds.
Step 8 - Update-baseline workflow
Add an opt-in update mode (analogous to Jest snapshots):
import os
def assert_pdf_matches(actual_pdf, baseline_dir, threshold=0.005):
update = os.environ.get("UPDATE_PDF_BASELINES") == "1"
pages = convert_from_path(actual_pdf, dpi=150)
for i, page_img in enumerate(pages, start=1):
baseline = baseline_dir / f"page-{i}.png"
if update or not baseline.exists():
page_img.save(baseline, "PNG")
continue
diff = pixel_diff_img(page_img, Image.open(baseline))
assert diff < threshold, f"Page {i} diff {diff}"Run UPDATE_PDF_BASELINES=1 pytest tests/pdf/ after intentional changes; commit the new baseline images.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Binary diff PDFs directly | CreationDate / ID change per run | Render to image (Step 2) |
dpi=72 (default) | Sub-pixel changes invisible | dpi=150 minimum (Step 2) |
| Threshold = 0 | Anti-aliasing flake | threshold ≈ 0.005 (Step 4) |
| Skip font-pack pinning in CI | OS upgrade swaps fonts; baselines invalidate | Check fonts into repo or pin OS image (Step 7) |
| Snapshot every page of 500-page PDF | CI time + storage explodes | Page-range targeting (Step 5) |