data-contract-extractor
Reads a data-product spec (data PRD, dataset README, lineage doc) and emits a structured data contract - schema (columns + types + nullability + PII flags), freshness SLA, volume bounds, distribution invariants, and ownership. The contract is consumable by the qa-data-quality plugin's dbt-testing / great-expectations / soda-checks skills as their assertion baseline. Use when scoping a new data product or formalizing assertions on an existing one.
data-contract-extractor
Overview
A data contract is the agreement between a data producer and its consumers about what the producer will deliver - schema, freshness, volume, semantic invariants - and what the consumer is allowed to depend on. Practitioner-emergent terminology popularized by Andrew Jones / Chad Sanderson and now standard in the data-engineering literature.
Terminology note: "data contract" is practitioner-emergent, not ISTQB-canonical. ISTQB defines
interface testingandcontract testing(HTTP API) but does not have a dedicated data-contract entry. This skill cites industry-engineering sources.
This skill formalizes the prose of a data PRD into a contract that the QA test suite can mechanically assert against - typically via dbt-tests, great-expectations, or soda-checks (see qa-data-quality plugin).
When to use
What goes in a data contract
The five required sections, in order of authoring priority:
1. Schema
For each column:
| Field | Required | Notes |
|---|---|---|
| Name | yes | snake_case; matches the warehouse table. |
| Type | yes | Warehouse-native type (VARCHAR, BIGINT, TIMESTAMP, etc.). |
| Nullable | yes | true / false - the test suite's not_null decision. |
| PK | yes | true for primary-key column(s). |
| Unique | yes | true for candidate keys (separate from PK). |
| FK | optional | If foreign key, the <table>.<column> reference. |
| PII | required | Tag values: none, direct (email/name), indirect (zip, dob alone), sensitive (SSN, payment). |
| Constraint | optional | Range / enum / regex (per the qa-data-quality data-quality-conventions skill). |
| Description | yes | One-sentence semantic meaning; not just the type. |
2. Freshness SLA
| Field | Notes |
|---|---|
| Update cadence | daily / hourly / continuous / weekly. |
| Max staleness | The point past which downstream consumers should treat the data as broken (per data-quality-conventions: typically 2× cadence). |
| Source-of-truth column | The timestamp column the freshness check reads (e.g. updated_at, loaded_at, event_time). |
3. Volume bounds
| Field | Notes |
|---|---|
| Expected min/max | Row-count range under normal operation. |
| Volatility | stable / monotonic-growth / cyclical / event-driven. |
| Recovery action | If volume falls outside bounds, what does the consumer do? |
4. Distribution invariants
For each business-meaningful column, what must hold:
| Field | Example |
|---|---|
| Categorical: accepted values | status ∈ {placed, shipped, completed, returned}. |
| Numeric: range | discount_pct ∈ [0, 100]. |
| Frequency / rate | cancellation_rate ≤ 5% (rolling 7-day window). |
5. Ownership and governance
| Field | Notes |
|---|---|
| Producer team | Who runs the pipeline. |
| Owner handle | Routing handle (Slack / email) for breaks. |
| Consumers | Known downstream models / dashboards / services. |
| Versioning | How breaking changes are communicated. |
Output format
Emit as YAML for direct consumption by dbt / GX / Soda:
# data-contracts/<dataset-slug>.yml
contract_version: 1
dataset:
name: orders
description: One row per customer order; updated within 1h of placement.
schema:
- name: order_id
type: BIGINT
nullable: false
pk: true
unique: true
pii: none
description: Surrogate key.
- name: customer_id
type: BIGINT
nullable: false
fk: customers.customer_id
pii: indirect
description: Customer who placed the order.
- name: email
type: VARCHAR
nullable: false
pii: direct
constraint: { format: email }
description: Customer email at time of order (immutable snapshot).
- name: status
type: VARCHAR
nullable: false
constraint:
accepted_values: [placed, shipped, completed, returned]
description: Current fulfillment status.
- name: discount_pct
type: DECIMAL(5,2)
nullable: true
constraint: { range: [0, 100] }
description: Promotion discount applied; null = no promo.
- name: updated_at
type: TIMESTAMP
nullable: false
description: Last-modified timestamp; freshness source-of-truth.
freshness:
cadence: hourly
max_staleness: 2h
source_column: updated_at
volume:
min_per_day: 100
max_per_day: 1000000
volatility: cyclical
recovery_action: |
Investigate ingestion pipeline at <runbook-url> if volume falls outside bounds for >2 cycles.
distribution:
- column: status
rule: accepted_values
values: [placed, shipped, completed, returned]
- column: discount_pct
rule: range
range: [0, 100]
- column: cancellation_rate
rule: rolling_window
window: 7d
max: 0.05
ownership:
producer_team: data-platform
owner: '@data-platform-oncall'
consumers:
- dbt: marts/orders_summary
- dashboard: 'Order Health'
- service: shipping-microservice
versioning: |
Breaking changes coordinated via #data-contracts Slack 14d in advance.Step-by-step extraction
When reading a data PRD:
Gap flagging
The agent never fabricates contract fields. Every gap becomes an explicit question:
## Contract gaps (HUMAN INPUT REQUIRED)
| Section | Field | Question |
|---------------|------------------------|--------------------------------------------------------------------------|
| schema | `customer_phone.pii` | Direct PII (yes — phone numbers are PII per most jurisdictions). Confirm. |
| freshness | `cadence` | PRD says "fresh data" — daily? hourly? real-time? Each implies a different gate. |
| volume | `max_per_day` | Not specified; required to set a `row_count between` assertion. |
| distribution | `payment_method` | What's the accepted-value set? PRD lists four; are there more (e.g. "apple_pay" added recently)? |Until the gaps are filled, the contract is incomplete and the test suite cannot be generated.
Examples
Example 1: PRD → contract
Input (data PRD excerpt):
"We're building an
orderstable. One row per order. Updated within 1 hour of order placement. Statuses: placed, shipped, delivered, cancelled. Discount up to 100%. Payment methods: card, bank, paypal. Owned by the data platform team."
Output: see the YAML format above (the example mirrors this PRD).
Example 2: minimal source
Input: "We need a users table. Email and name."
Output:
## Contract gaps (HUMAN INPUT REQUIRED)
| Section | Field | Question |
|---------------|----------------------|------------------------------------------------|
| schema | types | `email VARCHAR(?)` `name VARCHAR(?)`? |
| schema | pii | Direct (email, name)? Confirm. |
| schema | uniqueness | Email unique per user? Or shared (e.g. household)? |
| schema | PK | `users.user_id` implied but not stated. |
| freshness | cadence | Real-time on signup? Daily? |
| volume | range | Expected user count? |
| ownership | team | Who owns this? |
A 2-line PRD generates a 7-line gap list. Fill the gaps before
authoring the contract.Example 3: refactor of existing dataset
Input: an undocumented events table, used by 5 dashboards. Discovered via git grep of the warehouse client code.
Output:
The agent reads the SQL queries hitting events and infers the observed contract - column references, status values used in WHERE clauses, recency expectations from INTERVAL filters. Then it produces a contract draft with a header note:
**Status:** observed-from-usage (NOT producer-attested).
The producer team must review and amend; this contract is the
consumers' best-effort reverse-engineering.Until the producer attests, the contract is one-sided - useful for the consumer's tests, NOT a binding agreement.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Auto-tagging every column pii: none | Misses real PII; downstream ungoverned. | Force the author to make the call; default to "needs review" rather than "none". |
| Generic freshness "real-time" | Ambiguous; "real-time" varies from <100ms to <1min by team. | Quantify: cadence: continuous, max_staleness: 60s. |
| Distribution rules without a window | "Cancellation rate ≤5%" - over what window? Lifetime? | Always specify a window: rolling_window: 7d. |
| Skipping Ownership | A contract without an owner is a wishlist; nobody's on the hook. | Require ownership before declaring the contract complete. |