pdf-accessibility-checker
Test PDF accessibility (PDF/UA conformance) - tagged-PDF structure (StructTreeRoot), alternative text on images (Alt), reading-order, language metadata (Lang), document title, heading hierarchy. Use veraPDF / PAC (PDF Accessibility Checker) / pdfix / Adobe Acrobat Pro headless; map each finding back to WCAG 2.1 PDF Techniques (PDF1 - PDF23).
pdf-accessibility-checker
Per the WCAG 2.1 spec, "Text alternatives" + "Structured information that can be programmatically determined" + "Language identification"
When to use
Step 1 - Pick a checker
| Tool | Strength |
|---|---|
| veraPDF (open source) | PDF/UA-1 + PDF/A validation; CI-friendly CLI |
| PAC (PDF Accessibility Checker) by axes4 | UI-first; comprehensive Matterhorn Protocol coverage |
| pdfix | Commercial; auto-tagging fixes |
| Adobe Acrobat Pro Accessibility Check | Industry default; manual-review-friendly |
For CI, veraPDF is the default open-source choice.
Step 2 - Install veraPDF
# Download installer
curl -L -o verapdf.zip https://software.verapdf.org/releases/1.27/verapdf-greenfield-1.27.0-installer.zip
unzip verapdf.zip
verapdf-greenfield/verapdf --version(Verify current release at https://verapdf.org for the URL above.)
Step 3 - Run conformance check
verapdf --flavour ua1 --format json out.pdf > vera-report.jsonExit codes: 0 = conformant, 1 = non-conformant, 2 = parse error.
Step 4 - Parse and assert
import json, subprocess
def test_pdf_passes_pdf_ua_1():
result = subprocess.run(
["verapdf", "--flavour", "ua1", "--format", "json", "out.pdf"],
capture_output=True, text=True,
)
report = json.loads(result.stdout)
# Iterate jobs[0].validationResult
job = report["report"]["jobs"][0]
failed = job["validationResult"]["details"]["failedRules"]
assert failed == 0, f"PDF/UA validation failed: {job['validationResult']['details']}"Step 5 - Verify tagged PDF (StructTreeRoot)
A PDF is "tagged" if its catalog has a StructTreeRoot and MarkInfo /Marked true. Without these, screen readers cannot read the document semantically.
import pikepdf
def test_pdf_is_tagged():
pdf = pikepdf.open("out.pdf")
catalog = pdf.Root
assert "/StructTreeRoot" in catalog, "PDF lacks StructTreeRoot (untagged)"
mark_info = catalog.get("/MarkInfo", {})
# Use bool(), not `is True`: pikepdf can return a pikepdf.Object wrapper
# rather than the singleton Python True, so `is True` is always False
# even when /Marked is set (per [pikepdf objects]).
assert bool(mark_info.get("/Marked", False)), "PDF MarkInfo /Marked is false"Step 6 - Verify document title + language metadata
Per the WCAG 2.1 spec, language identification + descriptive title are required.
def test_pdf_has_title_and_lang():
pdf = pikepdf.open("out.pdf")
info = pdf.docinfo
title = info.get("/Title")
assert title and str(title).strip(), "PDF has no /Info /Title"
lang = pdf.Root.get("/Lang")
assert lang and str(lang) in ("en", "en-US", "en-GB", "de", "fr"), f"PDF /Lang invalid: {lang}"Step 7 - Verify image Alt text presence
def test_all_images_have_alt():
pdf = pikepdf.open("out.pdf")
structure_root = pdf.Root["/StructTreeRoot"]
# The structure tree is NESTED: each element's /K can hold child
# elements to arbitrary depth, so /Figure tags live below /Document,
# /Sect, /P, etc. A top-level-only scan misses nested figures and
# passes vacuously. Walk every /K recursively. Each /Figure must carry
# an /Alt (or /ActualText) per [WCAG PDF1].
untagged_images = []
def walk(node):
if isinstance(node, pikepdf.Array):
for child in node:
walk(child)
return
if not isinstance(node, pikepdf.Dictionary):
return # marked-content id (int) or other leaf
if node.get("/S") == "/Figure":
alt = node.get("/Alt")
if not alt or not str(alt).strip():
untagged_images.append(node)
kids = node.get("/K")
if kids is not None:
walk(kids)
walk(structure_root.get("/K"))
assert untagged_images == [], f"Images without /Alt: {len(untagged_images)}"Step 8 - Reading order verification
Reading order determined by structure-tree DFS. Verify the generated structure puts content in the order a human reader would expect:
def test_reading_order_matches_visual_order():
# TEMPLATE: the title and section labels below are placeholders for an
# invoice layout. Replace them with the expected reading-order landmarks
# for your own document before using this test.
text_per_page = extract_text_with_structure_order("out.pdf")
# First page should start with the document's leading landmark.
assert text_per_page[0].startswith("Invoice #") # replace per document
# Each expected landmark appears on its page, in reading order.
expected = ["Bill To", "Items", "Total", "Footer"] # replace per document
for i, section in enumerate(expected):
assert section in text_per_page[i] or i == 0This is best done via PAC or visual inspection for high-stakes documents - automating reading-order verification is hard.
Step 9 - Map findings to WCAG 2.1 Techniques
Per the WCAG 2.1 spec, PDF Techniques document specific patterns:
| Technique | What it covers |
|---|---|
| PDF1 | Applying alt text to images |
| PDF2 | Creating bookmarks |
| PDF3 | Ensuring correct tab order |
| PDF4 | Hiding decorative images via Artifact tagging |
| PDF6 | Using table elements (<Table> / <TR> / <TH> / <TD>) |
| PDF9 | Heading structure (H1 / H2 / H3) |
| PDF15 | Form field accessibility |
| PDF18 | Document title in DocInfo |
| PDF19 | Lang in document or per-element |
When a check fails, cite the technique in the failure message:
assert title, "PDF18: Document title required (WCAG 2.1)"Step 10 - CI gate
- name: PDF/UA conformance
run: |
for pdf in out/*.pdf; do
# jq -e sets the exit status from the result: false/null gives exit 1,
# so the gate actually fails on violations. Plain jq exits 0 regardless
# of the printed boolean (per [jq manual]), so `|| exit 1` never fired.
verapdf --flavour ua1 --format json "$pdf" \
| jq -e '.report.jobs[0].validationResult.details.failedRules == 0' > /dev/null || exit 1
doneAnti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Generate PDF with printBackground: true and skip tag check | Visually fine; screen-reader-broken | veraPDF flavour ua1 (Step 3) |
| Convert via tools that strip tags (some HTML→PDF engines) | Pre-tag stripped during convert | Use tagged-PDF-capable engines (WeasyPrint with tagged-pdf option) |
| Skip /Lang | Screen readers use wrong pronunciation | Step 6 |
| Auto-generate Alt = filename | "logo.png" doesn't help screen reader | Manual Alt review for high-stakes |
| Only run in pre-prod, not CI | Inaccessible PDFs ship | Block PR (Step 10) |