Testland
Browse all skills & agents

corpus-management-reference

Pure-reference catalog of fuzz-corpus management practices. Defines what a corpus is (seed corpus + evolved corpus saved by the fuzzer), corpus directory layout per libFuzzer / AFL++ / Go native / cargo-fuzz / OSS-Fuzz, the canonical crash-artefact naming (crash-<sha1> / leak-<sha1> / timeout-<sha1>), seed corpus construction strategies (sample-from-prod, sample-from-test-fixtures, from-spec-keywords), corpus minimisation, dictionary files, and the OSS-Fuzz integration corpus sync. Use as the corpus-discipline reference when building a fuzz target or maintaining a long-running fuzz campaign.

corpus-management-reference

Overview

Pure-reference catalog of corpus-management practices across libFuzzer, AFL++, native Go fuzz, cargo-fuzz, Atheris, Jazzer, and OSS-Fuzz. Consumed by the per-language fuzzer skills and the fuzz-target authoring agent.

When to use

  • Bootstrapping a new fuzz target - what should the seed corpus contain?
  • Running a long-running fuzz campaign - when to minimise, where to back up, how to share across CI runs.
  • Debugging a crash - locating the crash artefact, reproducing it.
  • Migrating between fuzzers (libFuzzer ↔ AFL++) - corpus format compatibility.

Corpus components

A fuzzing corpus has three roles:

RoleWhat it isWhere it lives
Seed corpusHand-curated initial inputs covering known interesting pathsVersioned in repo (fuzz/seeds/)
Evolved corpusInputs the fuzzer added because they hit new coverageOutput directory (fuzz/corpus/) - typically not committed
Crash artefactsInputs that triggered a sanitiser / crash / timeoutOutput directory + bug-report attachments

Directory layout per fuzzer

libFuzzer

Per llvm.org/docs/LibFuzzer.html:

fuzz/
  fuzz_target.cc       # the harness
  seeds/               # initial inputs (read on startup)
  corpus/              # evolved corpus (read + written)
  fuzz_target          # compiled binary

Invocation:

./fuzz_target -max_total_time=3600 corpus/ seeds/

The first directory listed is the output corpus (writable); subsequent directories are read-only seeds.

Crash artefacts saved to current directory as crash-<sha1>, leak-<sha1>, timeout-<sha1> per LLVM docs.

AFL++

fuzz/
  inputs/              # seed corpus
  output/              # AFL++ output (queue/, crashes/, hangs/)
  fuzz_target          # AFL-instrumented binary (afl-clang-fast)

Invocation:

afl-fuzz -i inputs/ -o output/ -- ./fuzz_target @@

Crash format: output/default/crashes/id:<num>,sig:<signal>,src:<id>,op:<mutator>,....

AFL++ corpora are not directly compatible with libFuzzer - the queue files have a different on-disk format. Convert via afl-fuzz -i + afl-cmin round-trip.

Go native (go test -fuzz)

Per go.dev/doc/security/fuzz:

package/
  fuzz_test.go         # contains FuzzXxx functions
  testdata/
    fuzz/
      FuzzXxx/
        seedfile1      # seed inputs (hand-curated)
        seedfile2

Failures auto-write to testdata/fuzz/FuzzXxx/ with a generated filename - they're meant to be committed as regression cases.

cargo-fuzz

fuzz/
  Cargo.toml
  fuzz_targets/
    fuzz_target_1.rs   # the harness (one per fuzz target)
  corpus/
    fuzz_target_1/     # per-target corpus
  artifacts/
    fuzz_target_1/
      crash-<sha1>

Invocation:

cargo fuzz run fuzz_target_1

Atheris (Python)

fuzz/
  fuzz_target.py       # uses atheris.Setup + atheris.Fuzz
  corpus/

Invocation (Atheris uses libFuzzer's CLI under the hood):

python fuzz_target.py corpus/

Jazzer (JVM)

fuzz/
  FuzzTarget.java      # with @FuzzTest annotation
  corpus/

Invocation:

jazzer --cp=target/test-classes \
       --target_class=com.example.FuzzTarget \
       corpus/

OSS-Fuzz

OSS-Fuzz aggregates corpora across Google's infrastructure:

oss-fuzz/
  projects/<project>/
    Dockerfile         # build the fuzzer
    build.sh           # produces $OUT/fuzz_target_1, $OUT/fuzz_target_1_seed_corpus.zip

Per google.github.io/oss-fuzz.

The corpus syncs to gs://<project>-corpus.clusterfuzz-external.appspot.com/.

Seed corpus construction strategies

StrategyWhenExample
From spec keywordsNew target, no inputs existExtract JSON keywords from a JSON parser spec, write each as a tiny file
From test fixturesExisting unit-test inputs cover pathsCopy fixtures from tests/fixtures/*.json to seeds/
From production dataMature target, prod logs availableSample 1000 prod requests, strip PII per pii-categories-reference, seed
From corpus minimisationReduce a large corpus to its coverage-equivalent coreRun afl-cmin or libFuzzer -merge=1
From OSS-Fuzz cousinSame format, different targetReuse <format>_seed_corpus.zip from a related OSS-Fuzz project

A seed corpus of 5-50 hand-curated diverse inputs is typically enough to bootstrap. Bigger isn't always better - the fuzzer finds new paths via mutation.

Dictionary files

A dictionary lists tokens the fuzzer prefers when mutating. For a JSON parser:

# fuzz.dict
"{"
"}"
"["
"]"
"true"
"false"
"null"
"\":\""

Invocation: libFuzzer -dict=fuzz.dict or afl-fuzz -x fuzz.dict.

Dictionaries dramatically improve fuzzer effectiveness on structured formats (JSON, XML, protobuf, SQL).

Corpus minimisation

A corpus that grows over time becomes redundant - many inputs hit the same coverage. Minimise periodically:

# libFuzzer merge mode
mkdir minimised/
./fuzz_target -merge=1 minimised/ corpus/

# AFL++ minimisation
afl-cmin -i corpus/ -o minimised/ -- ./fuzz_target @@

# Per-input minimisation (find smallest input triggering same coverage)
afl-tmin -i crash-input -o min-crash-input -- ./fuzz_target @@

Minimisation reduces fuzz cycle time + reproducibility surface.

Crash artefact handling

When the fuzzer finds a crash:

  1. Reproduce locally:
    ./fuzz_target crash-<sha1>
    # Triggers the same sanitiser report
    
  2. Minimise the crash input (see above) so the bug report is small.
  3. File the bug via bug-report-from-failure with the minimised crash as an attachment.
  4. Add the original (non-minimised) crash to the seed corpus as a regression test - re-runs will catch reintroduction.

CI integration

Long-running fuzz campaigns run continuously; CI runs short "smoke fuzz" campaigns (~5 min):

- name: Smoke fuzz
  run: ./fuzz_target -max_total_time=300 corpus/ seeds/

Persist evolved corpus to a CI cache so coverage accumulates across runs:

- uses: actions/cache@v4
  with:
    path: corpus/
    key: fuzz-corpus-${{ github.sha }}
    restore-keys: fuzz-corpus-

Anti-patterns

Anti-patternWhy it failsFix
No seed corpusFuzzer wanders randomly; takes hours to find shallow bugsAlways provide 5-50 hand-curated seeds
Committing evolved corpus to repoRepo bloats; PR diffs hide signalPersist via CI cache or object storage
Mixing seed and evolved corpus in one directoryLoses provenanceSeparate directories; seeds read-only
No dictionary for structured formatsFuzzer spends cycles re-discovering keywordsAlways provide a dict for JSON / XML / protobuf / SQL
Never minimisingCycle time grows; coverage redundantMinimise weekly or per major change
Crash artefact deleted after fixLose regression coverageAdd minimised crash to seed corpus
Single fuzzer assumedDifferent fuzzers find different bugsRun libFuzzer + AFL++ on same target periodically
Sharing prod-sourced corpus without PII reviewGDPR / HIPAA leakPass corpus through PII detection before sharing

Limitations

  • Corpus format incompatibility. libFuzzer / AFL++ / cargo-fuzz corpora aren't directly swap-compatible; cross-fuzzer testing needs conversion.
  • Corpus rot. Evolved corpora are tied to a specific binary's coverage instrumentation; rebuilding the target may invalidate some coverage information.
  • Dictionary maintenance. Dictionaries should evolve with the target's grammar - manual upkeep.
  • Corpus storage. Long campaigns produce GB-scale corpora; needs deliberate storage strategy (S3 / GCS / artifacts).
  • No semantic seeding for opaque formats. Custom binary formats may need a generator to bootstrap coverage.

References