openai-evals
Authors and runs OpenAI Evals - Python framework + registry for evaluating LLMs and LLM-backed systems with `oaieval <model> <eval-name>` CLI; supports template-based evals (Match / Includes / FuzzyMatch / ModelBasedClassify) defined in `evals/registry/evals/*.yaml` against JSONL data files in `evals/registry/data/`, plus custom Python eval classes implementing the Eval interface. Use when the user works with the openai/evals repo, needs the OpenAI-curated eval registry, or contributes new evals via PR to the registry.
openai-evals
Overview
Per oa-gh, a registry of YAML eval-specs lives under evals/registry/evals/, each pointing to a JSONL data file under evals/registry/data/ (Git-LFS managed). The oaieval CLI runs an eval against any completion-function-protocol model - either one of OpenAI's curated evals or a custom one registered by the team.
When to use
For new projects without a registry-contribution motive, evaluate promptfoo-evaluation or deepeval-evaluation first - both have lower friction for non-OpenAI workflows.
Step 1 - Install
For running existing evals (per oa-gh):
pip install evalsFor contributing new evals (clone first, then editable install):
git clone https://github.com/openai/evals.git
cd evals
pip install -e .The editable install is required to register new evals and access the full registry source.
Step 2 - Run an eval
Per github.com/openai/evals/blob/main/docs/run-evals.md:
oaieval gpt-3.5-turbo test-matchPattern: oaieval <model> <eval-name>. Per oa-run:
"Any implementation of the CompletionFn protocol can be run against oaieval."
Eval names are "specified in the YAML files under evals/registry/evals" (oa-run); implementations live in evals/elsuite.
Step 3 - Logging
Per oa-run:
"logging locally or to Snowflake will write to tmp/evallogs"
Override with --record_path /custom/path/. Logs are JSONL events "which can be inspected using a text editor or analyzed programmatically" (oa-run).
Common flags (oa-run):
Step 4 - Eval templates (YAML-defined)
Eval templates avoid Python authoring for common evaluation patterns. The four built-in templates per oa-gh (referenced in eval-templates.md):
A registered eval YAML lives at evals/registry/evals/<name>.yaml and references a JSONL file at evals/registry/data/<name>/samples.jsonl. Each JSONL row contains the input prompt + the ideal field used by the template.
Step 5 - Custom Python evals
For grading logic beyond templates, subclass the Eval interface. The full pattern lives in docs/custom-eval.md and docs/build-eval.md in the oa-gh repository - author per the doc when authoring, then register in the YAML registry.
Step 6 - CI integration
OpenAI Evals does not ship a first-party CI action. Pattern:
oaieval gpt-4 my-eval --record_path ./evallogs
# parse JSONL evallog for pass-rate; fail CI if below threshold
jq -s '[.[] | select(.spec) | .]' ./evallogs/<run>.jsonl # extract spec + outcomesFor PR-comment integration, parse the events JSONL into a summary and post via gh CLI (no built-in action).
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Pick Match template for open-ended generation | Exact-match always fails on creative outputs | Use ModelBasedClassify (Step 4) |
Skip --record_path in CI | Logs land in /tmp and disappear between steps | Always pass --record_path |
| Custom Python eval without registry YAML | oaieval can't find it | Register the YAML alongside the Python class (Step 5) |
Run on gpt-3.5-turbo only | Model-version drift; results not reproducible | Pin specific snapshot (e.g., gpt-4-0613) |