backup-verification-author
Author backup-verification harness - per-backup-type integrity (SHA-256 / encrypted-payload signature), restore-to-test-env spot-check cadence, partial-restore (single-table / single-object) verification, cross-region replication validation, retention-policy assertions. "An untested backup is not a backup.
backup-verification-author
Backups silently fail in many ways - wrong encryption key, missing volume, corrupted file, expired credential. Per the Google Cloud DR planning guide, DR success requires "end-to-end recovery design addressing backup, restoration, and cleanup procedures." This skill authors the verification harness.
When to use
Step 1 - Catalog backup types
## Backup Catalog — `<service>`
| Type | Source | Frequency | Retention | Tool |
|---|---|---|---|---|
| Full DB dump | postgres prod | Daily 02:00 UTC | 30 days | pg_dump + S3 |
| Logical schema | postgres prod | Hourly | 24 hours | logical replication slot |
| File store | S3 prod bucket | Continuous | 90 days | S3 versioning + cross-region |
| Audit log | append-only S3 | Continuous | 7 years | S3 + Glacier |
| Secrets / KMS keys | Vault prod | Daily | 7 days | Vault snapshot + encrypted S3 |Each row needs its own verification step (Step 3).
Step 2 - Integrity checks at backup time
#!/usr/bin/env bash
set -e
BACKUP_FILE="postgres-prod-$(date +%Y%m%d).sql.gz"
BACKUP_PATH="/backups/$BACKUP_FILE"
# Take backup
pg_dump -h prod-db -U replica db_name | gzip > "$BACKUP_PATH"
# Generate SHA-256 + sign
sha256sum "$BACKUP_PATH" > "$BACKUP_PATH.sha256"
gpg --detach-sign --armor "$BACKUP_PATH"
# Upload to backup destination
aws s3 cp "$BACKUP_PATH" "s3://backup/postgres/$BACKUP_FILE"
aws s3 cp "$BACKUP_PATH.sha256" "s3://backup/postgres/$BACKUP_FILE.sha256"
aws s3 cp "$BACKUP_PATH.asc" "s3://backup/postgres/$BACKUP_FILE.asc"
# Tag with metadata
aws s3api put-object-tagging \
--bucket backup --key "postgres/$BACKUP_FILE" \
--tagging 'TagSet=[{Key=integrity_verified,Value=true},{Key=created,Value='$(date -Iseconds)'}]'Tests assert:
Step 3 - Restore-to-test-env spot check
A restore that has never been done is not a backup. Schedule:
# CI cron: weekly random sample
- cron: "0 4 * * 1" # Monday 04:00 UTC
job:
- name: Pick random backup
run: |
DAYS=(1 7 14 30)
DAYS_AGO=${DAYS[$RANDOM % ${#DAYS[@]}]}
BACKUP=$(date -d "$DAYS_AGO days ago" +%Y%m%d)
echo "BACKUP=postgres-prod-$BACKUP.sql.gz" >> $GITHUB_ENV
- name: Verify integrity
run: |
aws s3 cp s3://backup/postgres/$BACKUP.sha256 .
aws s3 cp s3://backup/postgres/$BACKUP .
sha256sum -c "$BACKUP.sha256"
- name: Restore to test DB
run: |
gunzip "$BACKUP"
psql -h test-db -U test -f "${BACKUP%.gz}" db_test
- name: Spot check
run: |
psql -h test-db -U test db_test -c "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '1 day'"
# Verify count > 0 (or whatever invariant fits)Step 4 - Partial-restore test
Real DR scenarios often need single-table or single-object restore (not full DB):
# Single-table extract + restore
pg_restore --table=orders --data-only \
-h test-db -U test -d db_test \
postgres-prod-backup.dumpTest: extract one table; assert rowcount + checksum match the production-time snapshot.
For S3 single-object:
aws s3 cp \
s3://backup-versioned/object-key \
--version-id "VERSION_ID_AT_DESIRED_TIME" \
./restored-objectStep 5 - Cross-region replication test
def test_backup_replicated_to_dr_region():
# Take a backup in primary region
backup_path_primary = take_backup_to(region="us-east-1")
# Wait for replication SLA
deadline = time.time() + 300 # 5 min SLA
while time.time() < deadline:
if exists_in(region="us-west-2", path=backup_path_primary):
return
time.sleep(10)
pytest.fail("Cross-region replication exceeded 5min SLA")Per the Google Cloud DR planning guide: "Security synchronization" also matters - DR region must have the same KMS keys / IAM / secrets, not just the data.
Step 6 - Retention-policy verification
def test_old_backups_purged_per_retention_policy():
# 30-day retention; 100-day-old backup should not exist
target = (datetime.utcnow() - timedelta(days=100)).strftime("%Y%m%d")
obj_key = f"postgres/postgres-prod-{target}.sql.gz"
response = s3.head_object(Bucket="backup", Key=obj_key)
# Should 404
pytest.fail(f"Backup {obj_key} still exists past 30-day retention")Wrap in a try/except - actual missing object = pass.
def test_recent_backups_present():
# Last 30 days should have at least one daily backup each
for d in range(30):
date = (datetime.utcnow() - timedelta(days=d)).strftime("%Y%m%d")
key = f"postgres/postgres-prod-{date}.sql.gz"
s3.head_object(Bucket="backup", Key=key) # raises if missingStep 7 - Encryption verification
For encrypted backups, verify both:
def test_backup_encrypted_with_correct_key():
obj = s3.head_object(Bucket="backup", Key=key)
assert obj["ServerSideEncryption"] == "aws:kms"
assert obj["SSEKMSKeyId"] == EXPECTED_KMS_KEY_ARNStep 8 - Customer-induced backup test (compliance)
Some regulations (HIPAA, SOC 2) require demonstrated ability to restore on demand. Author the workflow:
## Customer-Induced Backup Restore Test
1. Customer requests demo restore via support ticket.
2. SRE picks a random recent backup; restores to clean isolated env.
3. Customer verifies their data via read-only SQL or UI.
4. Cleanup: tear down env, sanitize logs.
5. Document: ticket + timestamps + verification artifacts → audit log.Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Verify backup file exists; not contents | Corrupt files pass | SHA + restore (Steps 2-3) |
| Test restore once, never again | Bit rot, key rotation, schema drift surface later | Weekly cadence (Step 3) |
| Skip partial-restore test | Real DR usually wants partial; full restore takes too long | Step 4 |
| Skip key recovery | Backup encrypted with key not in DR region; useless | Step 7 |
| Trust replication "succeeded" status | Async replication can claim success then fail | Step 5 explicit verification |