restore-time-tests
Build restore-time SLA tests - per-database + per-object-store baseline measurement, RTO objective verification, parallel-restore optimization tests, point-in-time-recovery (PITR) latency. Bound `time-to-functional` (TTF) ≤ documented RTO; flag silent regressions when restore time grows over months.
restore-time-tests
Per the Google Cloud DR planning guide, RTO is "the maximum acceptable length of time that your application can be offline." Restore-time tests measure the actual time-to-functional (TTF) for each backup type and gate it on the RTO budget.
When to use
Step 1 - Define TTF segments
Time-to-functional = sum of:
| Segment | Definition |
|---|---|
| Detection | Time from incident to "something's wrong" |
| Decision | Time from detection to "initiate DR" |
| Provisioning | Time to spin up DR environment (IaC apply) |
| Restore | Time to apply the latest backup |
| Verification | Time to run smoke tests + accept traffic |
| Cutover | DNS / load balancer switch + propagation |
Each segment has its own SLA. The aggregate is the RTO.
This skill focuses on Restore + Verification segments.
Step 2 - Baseline: timed restore
import subprocess, time
import pytest
@pytest.mark.benchmark
def test_postgres_restore_time_under_rto():
# Setup: clean target DB
subprocess.run(["psql", "-h", "test-db", "-c", "DROP DATABASE IF EXISTS db_test"])
subprocess.run(["psql", "-h", "test-db", "-c", "CREATE DATABASE db_test"])
backup = "postgres-prod-latest.sql.gz"
start = time.time()
subprocess.run(
["bash", "-c", f"gunzip -c {backup} | psql -h test-db -d db_test"],
check=True,
)
elapsed = time.time() - start
RTO_BUDGET_SECONDS = 4 * 3600 # 4 hours
# The 0.5 split (restore gets half the RTO, the rest goes to provision +
# verify + cutover) is a planning choice, NOT a standard. Set the fraction
# from your own per-segment RTO budget (Step 1).
RESTORE_SEGMENT_FRACTION = 0.5
budget = RTO_BUDGET_SECONDS * RESTORE_SEGMENT_FRACTION
assert elapsed < budget, f"Restore took {elapsed:.0f}s; budget {budget:.0f}s"Run weekly in CI; track trend.
Step 3 - Parallel-restore optimization
Many backup tools support parallelization. Test:
# pg_restore parallel
pg_restore -j 8 -d db_test backup.dump # 8 parallel workers
# WAL-E / pgbackrest parallel restore
pgbackrest --stanza=prod --process-max=8 restoredef test_parallel_restore_faster_than_serial():
serial = run_restore(parallel_jobs=1)
parallel = run_restore(parallel_jobs=8)
speedup = serial / parallel
# 3.0x is an illustrative target; real speedup depends on I/O saturation,
# CPU count, and backup format. Set the expected ratio from your own
# measured serial-vs-parallel baseline rather than this placeholder.
assert speedup > 3.0, f"Parallel restore only {speedup:.1f}x faster"Find the sweet spot (often 4-8 jobs); past that, contention diminishes returns.
Step 4 - Point-in-time-recovery (PITR) latency
PITR = restore the database to an arbitrary point in the past (within retention). Restore time + WAL replay time:
PITR recovers a pre-existing base backup forward to a target time by replaying archived WAL. It needs two things that must already exist before the restore: a base backup taken earlier and retained, and a continuous WAL archive covering the window up to the target. Do NOT call pg_basebackup at restore time: a backup taken "now" captures the present, leaving nothing earlier to recover to. Per the PostgreSQL PITR docs:
def test_pitr_to_5min_ago_under_30min():
target_time = datetime.utcnow() - timedelta(minutes=5)
# 1. Lay down the PRE-EXISTING base backup into a clean data dir
# (untar the retained base backup; do not take a fresh one here).
restore_retained_base_backup(dest="/restore")
# 2. PG12+ recovery config: restore_command pulls archived WAL,
# recovery_target_time is the stop point, and recovery.signal triggers
# targeted recovery (recovery.conf was removed in PG12).
write_conf("/restore/postgresql.auto.conf", {
"restore_command": "cp /wal_archive/%f %p",
"recovery_target_time": f"'{target_time.isoformat()}'",
"recovery_target_action": "promote",
})
Path("/restore/recovery.signal").touch()
# 3. Time the restore + WAL replay: this is the real PITR latency.
start = time.time()
subprocess.run(["pg_ctl", "start", "-D", "/restore", "-w"], check=True)
wait_for_recovery_complete(timeout=1800)
elapsed = time.time() - start
# 1800s is illustrative; set the budget from your service's RTO segment SLA.
assert elapsed < 1800, f"PITR took {elapsed:.0f}s; budget 30min"PITR latency = base restore + WAL replay. Tests both segments.
Step 5 - Object-store partial restore
For S3 / GCS / Azure Blob restores, time the partial restore (not whole-bucket):
def test_partial_object_restore_under_5_min():
keys_to_restore = sample_500_keys_from_inventory()
start = time.time()
for key in keys_to_restore:
s3.copy_object(
Bucket="restore-target",
Key=key,
CopySource={"Bucket": "backup-versioned", "Key": key, "VersionId": ...},
)
elapsed = time.time() - start
assert elapsed < 300, f"500-object restore took {elapsed:.0f}s"The 500-object count and the 300s budget are illustrative; size both from your own per-account object inventory and restore SLA.
Step 6 - Track restore-time trend
Backup grows over time → restore time grows. Track:
def emit_restore_time_metric(elapsed_seconds, backup_size_bytes):
metrics_client.gauge("dr.restore_time_seconds", elapsed_seconds)
metrics_client.gauge("dr.backup_size_bytes", backup_size_bytes)
metrics_client.gauge("dr.restore_throughput_bytes_per_sec",
backup_size_bytes / elapsed_seconds)Alert when restore time grows beyond a threshold you choose (e.g. 20% over 90 days; tune to your data-growth profile). Sustained growth indicates a need for backup compaction, more parallelism, or RTO renegotiation.
Step 7 - Verification time
Restore success ≠ functional. Verification adds time:
def test_post_restore_smoke_under_5_min():
do_restore()
start = time.time()
run_smoke_suite("dr-environment")
elapsed = time.time() - start
assert elapsed < 300, f"Smoke tests took {elapsed:.0f}s; budget 5min"Smoke suite scope: critical paths only. Full regression is too slow for the RTO window.
Step 8 - Cold-start vs warm-cache
After restore, applications hit cold caches → first requests slow. Test that the cold-start latency is within service SLA:
def test_cold_start_latency_within_sla():
# Restore complete; app started; first requests
latencies = []
for _ in range(100):
start = time.time()
requests.get("https://dr-env.svc/api/products")
latencies.append(time.time() - start)
p99_cold = sorted(latencies)[99]
# 2.0s is a placeholder; set the cold-start bound from your service's SLO.
assert p99_cold < 2.0, f"Cold-start p99 {p99_cold:.2f}s exceeds 2s SLA"Cache-warm step may be needed in DR runbook (loading common queries before declaring "functional").
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Test on yesterday's backup, claim "RTO met" | Real DR uses minutes-old backup | Weekly cadence with realistic data freshness |
| Skip parallel test; use single thread | Aggregate RTO + budget breached at scale | Step 3 sweet-spot tuning |
| Skip verification time | Restore "complete"; users still 5xx | Step 7 must be timed |
| No trend tracking | Silent regression months in | Step 6 metric + alert |
| RTO unit on DB only, ignore app | App may take longer than DB | Step 8 cold-start |