soda-checks
Authors and runs SodaCL (Soda Checks Language) checks against SQL warehouses (Snowflake, BigQuery, Postgres, Redshift, etc.) via `soda scan`, configures scan profiles in configuration.yml, and gates CI on scan exit code. Use when the user works with Soda Core / Soda Cloud or needs YAML-driven warehouse data quality.
soda-checks
Overview
SodaCL (Soda Checks Language) is a YAML-based domain-specific language for data reliability. A typical Soda project ships two YAML files - configuration.yml (data source connection) and checks.yml (assertions) - and runs them through the soda scan CLI (sodacl-overview, sodacl-quickstart).
This skill covers SodaCL authoring, scan configuration, running, and CI integration. Use it when you want YAML-only assertions on warehouse tables without writing Python (in contrast to Great Expectations, which is Python-first) or compiling a dbt project.
When to use
Authoring checks
File shape
checks.yml is a list of checks for <dataset>: blocks. Each block lists assertions in SodaCL syntax (sodacl-overview):
# checks.yml
checks for orders:
- row_count between 1 and 10000000
- missing_count(order_id) = 0
- duplicate_count(order_id) = 0
- invalid_percent(email) < 1 %:
valid format: email
- freshness(updated_at) < 1d
checks for customers:
- row_count > 0
- missing_count(customer_id) = 0
- schema:
warn:
when required column missing: [created_at]
fail:
when forbidden column present: [pii_ssn, pii_dob]Common check types
The check vocabulary in SodaCL (sodacl-overview):
| Check | Example |
|---|---|
row_count | row_count between 10 and 1000 |
missing_count(c) | missing_count(birth_date) = 0 |
missing_percent(c) | missing_percent(email) < 1 % |
duplicate_count(c) | duplicate_count(phone) = 0 |
invalid_count(c) | invalid_count(phone) = 0 (with valid format) |
freshness(col) | freshness(start_date) < 1d |
| Aggregates | avg(safety_stock_level) > 50 |
schema | required / forbidden columns (warn/fail blocks) |
Thresholds and severity
Thresholds use comparison operators inline with the check (< 1%, between 10 and 1000, = 0). The schema check supports explicit warn: and fail: alert configurations to differentiate severity (sodacl-overview):
checks for transactions:
- schema:
warn:
when required column missing: [updated_at]
fail:
when forbidden column present: [pii*]For non-schema checks, alert configurations apply via the alert configurations: syntax - see Soda's optional-config docs at https://docs.soda.io/soda-v3/sodacl-reference/optional-config.md for the full grammar.
Configuration
configuration.yml defines one or more data sources and (optionally) Soda Cloud credentials. Use environment-variable substitution for secrets - Soda expands ${ ENV_VAR } at scan time (quick-start-dev):
# configuration.yml
data_source warehouse:
type: snowflake
username: ${ SNOWFLAKE_USER }
password: ${ SNOWFLAKE_PASS }
account: ${ SNOWFLAKE_ACCOUNT }
database: analytics
warehouse: compute_wh
role: analyst
schema: public
soda_cloud:
host: cloud.us.soda.io
api_key_id: ${ SODA_CLOUD_API_KEY }
api_key_secret: ${ SODA_CLOUD_API_SECRET }The soda_cloud: block is optional - omit it for a fully local scan.
Running
Canonical CLI invocation (sodacl-quickstart):
soda scan -d warehouse -c configuration.yml checks.ymlWhere:
Multiple checks files run in one scan:
soda scan -d warehouse -c configuration.yml \
checks/orders.yml checks/customers.yml checks/transactions.ymlEach scan produces a checks pass/fail summary in stdout and (when soda_cloud: is configured) reports to Soda Cloud.
Parsing scan results
A failing scan exits non-zero and prints a per-check pass/fail line. For machine consumption, the Soda GitHub Action converts scan results to a markdown table and posts the findings as a PR comment (quick-start-dev). For other CI systems, parse stdout with the filtering snippet below.
# Capture and filter the scan output
soda scan -d warehouse -c configuration.yml checks.yml | tee scan.log
# Surface failing checks for the PR comment / job summary
grep -E '^\s*FAIL' scan.log || echo "All checks passed"Note: if your team needs structured (JSON) scan output for richer downstream reporting, check the current
soda scanflag set on https://docs.soda.io/ - flag availability evolves between major Soda Library / Soda Core versions.
CI integration
The minimal pattern is: install the Soda library + warehouse adapter, run soda scan, exit non-zero on failure, upload the scan log as an artifact.
# .github/workflows/soda-scan.yml
name: soda
on:
pull_request:
push:
branches: [main]
jobs:
soda-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Soda + adapter
# Pin versions; Soda Library has both a free OSS and commercial track.
run: pip install 'soda-core' 'soda-core-snowflake'
- name: Run scan
env:
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASS: ${{ secrets.SNOWFLAKE_PASS }}
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
run: |
soda scan -d warehouse -c configuration.yml checks.yml | tee scan.log
- name: Upload scan log
if: always()
uses: actions/upload-artifact@v4
with:
name: soda-scan-log
path: scan.log
retention-days: 14The if: always() on the artifact upload is required to capture the log on a failed scan - when you most need it for triage. For a managed GitHub-Action wrapper that auto-comments on PRs, see the Soda GitHub Action.