pii-masking-pipeline-builder
Build-an-X workflow that produces a PII masking pipeline spec from a source-data inventory. Walks the author through (1) classifying each field against pii-categories-reference, (2) picking a masking operator from data-masking-techniques-reference, (3) deciding pseudonymisation (reversible, in GDPR scope) vs anonymisation (irreversible, out of scope), (4) ordering the pipeline (detect → operator → audit), and (5) emitting a deployable config for Presidio + Faker + Synthea wrappers. Output is a YAML pipeline spec plus a per-field rationale table. Use after classifying a dataset's PII risk; this is the workflow that translates classification into runnable masking config.
pii-masking-pipeline-builder
Overview
Authoring a masking pipeline requires three classifications per field (regulatory regime, operator, reversibility) and one global decision (pipeline ordering + audit hooks). This workflow produces a deployable YAML spec that downstream tools execute:
When to use
Step 1 - Inventory the source
Enumerate every column / field in the source dataset. For each, record:
| Column | Type | Sample value | Cardinality | Cross-table join? |
|---|---|---|---|---|
users.email | string | alice@acme.com | high | yes (joins events) |
users.ssn | string | 123-45-6789 | high | no |
users.dob | date | 1985-03-14 | medium | no |
users.zip | string | 02139 | low | no |
users.country | string | US | very low | no |
A schema introspector can produce the first columns; cardinality and join graph need a quick analytical pass.
Step 2 - Classify each field
Look up each column in pii-categories-reference and record which regulatory regime(s) apply. Include linkable fields explicitly (NIST 800-122 §2.2).
| Column | GDPR | CPRA SPI | NIST | HIPAA | Risk |
|---|---|---|---|---|---|
users.email | ✓ | - | ✓ | ✓ #6 | direct |
users.ssn | ✓ | ✓ | ✓ | ✓ #7 | direct, high-sensitivity |
users.dob | linkable | - | linkable | ✓ #3 | linkable |
users.zip | linkable | - | linkable | ✓ #2 (sub-state) | linkable |
users.country | - | - | - | - | non-PII |
Any field marked direct OR linkable enters the masking scope. A field marked only "linkable" still gets masked because it identifies in combination with others (Sweeney 87% rule, see pii-categories-reference).
Step 3 - Pick an operator per field
Match each field to a technique in data-masking-techniques-reference. Decision tree:
| Column | Operator | Rationale | Reversible? |
|---|---|---|---|
users.email | Faker substitution (deterministic via hash-seed) | Joins across tables; need referential integrity | Yes (via salt vault) |
users.ssn | Tokenisation (vault) | Strict regulator scope; round-trip needed for auth | Yes (via vault) |
users.dob | Generalisation to year | Analytics needs age bracket, not exact DOB | No |
users.zip | Truncation to first 3 digits | HIPAA Safe Harbor #2 rule (>20k pop only) | No |
users.country | Pass-through | Not PII | n/a |
Step 4 - Pseudonymisation vs anonymisation gate
For each masked field, mark whether the result remains personal data under GDPR Art. 4(5):
Document the gate decision per dataset:
output_classification: pseudonymised # GDPR scope retained
gdpr_lawful_basis: Article 6(1)(f) legitimate interests
retention: 90 days
access_control: only-dev-environment-teamvs.
output_classification: anonymised
gdpr_lawful_basis: out-of-scope per Recital 26
retention: indefinite
access_control: openThe author cannot claim "anonymised" if any reversible technique is in the pipeline.
Step 5 - Compose the pipeline
A standard order:
Step 6 - Emit the YAML spec
Recommended shape - consumable by a generic pipeline runner:
pipeline:
name: users-staging-refresh
source:
type: postgres
connection: $PROD_RO_DSN
schema: public
table: users
classification:
output: pseudonymised
regimes: [gdpr, cpra, hipaa]
fields:
- column: email
operator: deterministic_substitution
provider: faker
provider_method: internet.email
seed_strategy: hash(salt + value)
salt_ref: vault://masking/users.email
- column: ssn
operator: tokenisation
vault: vault://masking/users.ssn
- column: dob
operator: generalisation
params:
granularity: year
- column: zip
operator: truncation
params:
keep_chars: 3
from: start
- column: country
operator: passthrough
free_text_columns:
- notes
- support_message
free_text_detector:
type: presidio
language: en
score_threshold: 0.45
entities: [PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, IP_ADDRESS]
on_detect: replace
audit:
sample_rows: 100
fail_on_critic_block: true
output:
type: postgres
connection: $STAGING_RW_DSN
schema: public
table: users
manifest:
write_to: s3://masking-manifests/${run_id}.jsonStep 7 - Worked example
A SaaS app refreshes its staging from prod nightly. Source has 4M users with 22 columns, 3 of which are free-text. Synthesised spec:
pipeline:
name: prod-to-staging-nightly
source: { type: postgres, table: users }
classification: { output: pseudonymised, regimes: [gdpr, cpra] }
fields:
- { column: user_id, operator: passthrough } # internal opaque ID
- { column: email, operator: deterministic_substitution,
provider: faker, provider_method: internet.email,
seed_strategy: hash(salt + value), salt_ref: vault://prod/email }
- { column: full_name, operator: substitution,
provider: faker, provider_method: name }
- { column: phone, operator: substitution,
provider: faker, provider_method: phone_number }
- { column: address_line1, operator: substitution,
provider: faker, provider_method: address }
- { column: country, operator: passthrough }
- { column: language, operator: passthrough }
- { column: created_at, operator: passthrough }
- { column: last_login_at, operator: passthrough }
- { column: signup_ip, operator: encryption,
params: { algo: fpe-ff1 }, key_ref: vault://prod/ip-fpe }
- { column: notes, operator: free_text_mask }
free_text_detector:
type: presidio
language: en
score_threshold: 0.5
on_detect: replace
audit: { sample_rows: 100, fail_on_critic_block: true }Pipeline classification: pseudonymised (email is deterministic, IP is FPE-encrypted with key retained). The user explicitly accepts that this output remains in GDPR scope.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Per-column operator without referential check | Joins break after masking | Group columns that share keys; apply deterministic operators consistently |
| Free-text columns skipped | Embedded PII (user-typed emails) leaks | Always run Presidio on any string column > ~50 chars |
| Claiming "anonymised" when any reversible op is in the pipeline | False GDPR compliance claim | Audit the pipeline; pseudonymised if any operator is reversible |
| No audit step | Operator failure or recogniser drift goes unnoticed | Always sample output and run pii-leak-critic |
| Salt vault key shared across pipelines | Salt-rotation breaks every downstream pipeline at once | Per-pipeline salt; rotate independently |
| No manifest | Cannot reproduce a past run; auditors can't trace lineage | Always emit manifest with version IDs |
| Pipeline runs on prod-write connection | Risk of writing masked data back over prod | Strict source = read-only DSN; output = staging-write DSN |