data-masking-techniques-reference
Pure-reference catalog of data-masking techniques and de-identification privacy models. Enumerates the seven canonical masking operators (substitution, shuffling, number/date variance, encryption, hashing, nulling, masking-out / character-scrambling) plus tokenisation, redaction, format-preserving encryption, and Microsoft Presidio's six built-in operators. Distinguishes reversible techniques (pseudonymisation candidates per GDPR Art. 4(5)) from irreversible techniques (anonymisation candidates). Maps techniques to NIST SP 800-188 privacy models - k-anonymity, l-diversity, t-closeness, differential privacy. Cites ISO/IEC 20889:2018 for the standard taxonomy. Use to pick the right masking operator per field type and risk level.
data-masking-techniques-reference
Overview
Masking is the act of transforming a real value into a substitute that breaks the link to the original subject while preserving testable properties (format, distribution, referential integrity). Which technique is correct depends on three things: whether the result must be reversible, whether the field is referentially shared across tables, and what privacy model the dataset must satisfy.
This skill is the pure reference that the pipeline builder (pii-masking-pipeline-builder) and the leak critic (pii-leak-critic) draw from to choose operators per field.
When to use
The seven canonical masking techniques
Drawing from the Wikipedia data-masking taxonomy (en.wikipedia.org/wiki/Data_masking) and ISO/IEC 20889:2018 (cite by stable ID; standard text behind paywall):
1. Substitution
Replace the real value with an authentic-looking value from a lookup table - "John Smith" → "Maria Garcia."
2. Shuffling
Randomly rearrange values within a column - salaries column gets shuffled, each row keeps a real salary but no longer the right person's salary.
3. Number / date variance
Apply a bounded random offset: salary ± 10 %, dates ± 120 days (Wikipedia data-masking page).
4. Encryption
Apply a cryptographic algorithm with a key. Two sub-variants:
5. Hashing
Apply a one-way hash (SHA-256 / SHA-512) with optional salt.
6. Nulling out / deletion
Replace the value with NULL or remove the column entirely.
7. Masking-out / character scrambling
Show partial value - credit card "**** **** **** 1234," email "j***@example.com."
Additional techniques
Tokenisation
Replace the real value with a token (random opaque string) and store the real-value → token map in a separate, access-controlled vault.
Redaction
Remove the value entirely (no placeholder, no length signal).
Synthetic substitution
Replace with a synthetically generated value preserving distribution / format (faker-synthetic-data; synthea-healthcare-data for health records).
Microsoft Presidio anonymizer operators
Per microsoft.github.io/presidio/anonymizer, the Presidio Anonymizer engine supports six built-in operators:
| Operator | Parameters | Reversible | Maps to canonical technique |
|---|---|---|---|
replace | new_value (defaults to <entity_type>) | No (random) / Yes (deterministic substitution) | #1 Substitution |
redact | - | No | Redaction |
mask | chars_to_mask, masking_char, from_end | No | #7 Masking-out |
hash | hash_type (sha256 / sha512), salt | No (one-way) | #5 Hashing |
encrypt | key | Yes (with key) | #4 Encryption |
custom | lambda | Depends on lambda | (caller-defined) |
Invocation: engine.anonymize(text=, analyzer_results=, operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})}).
OperatorConfig constructor signature: OperatorConfig(operator_name, params={}) (Presidio docs).
Reversible vs irreversible - pseudonymisation vs anonymisation
GDPR Art. 4(5) defines pseudonymisation as "processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately" (gdpr-info.eu/art-4-gdpr/).
| Technique | Pseudonymisation? | Anonymisation? |
|---|---|---|
| Deterministic substitution (same input → same output) | ✓ | - |
| Random substitution | - | ✓ |
| Shuffling | - | ✓ (when distribution-only) |
| Number / date variance | - | ✓ if variance ≥ identifying granularity |
| General encryption (key kept) | ✓ | - |
| FPE (key kept) | ✓ | - |
| Salted hashing (salt kept separately) | ✓ | - |
| Unsalted hashing of low-entropy field | ✗ (re-identifiable by enumeration) | ✗ |
| Nulling | - | ✓ |
| Masking-out (partial) | depends on revealed chars | depends |
| Tokenisation (vault kept) | ✓ | - |
| Tokenisation + vault destroyed | - | ✓ |
| Redaction | - | ✓ |
| Synthetic substitution | - | ✓ |
Implication: A "masking pipeline" output that uses reversible techniques is still personal data under GDPR - it remains in scope. Only fully irreversible output is out of GDPR scope per Recital 26.
Privacy models - NIST SP 800-188
NIST SP 800-188:2023 ("De-Identifying Government Datasets", csrc.nist.gov/pubs/sp/800/188/final) formalises three statistical privacy models layered above the techniques above:
k-anonymity
A dataset is k-anonymous if every record is indistinguishable from at least k − 1 other records when projected on the quasi-identifiers (Sweeney 2002, cited in NIST 800-188).
l-diversity
Strengthens k-anonymity by requiring at least l well-represented values of the sensitive attribute within each equivalence class (Machanavajjhala et al. 2007).
t-closeness
Strengthens l-diversity by requiring the distribution of the sensitive attribute in each equivalence class be close (within t, by Earth Mover's Distance) to the distribution in the overall dataset (Li et al. 2007).
Differential privacy
A formal mathematical guarantee: the probability of any output changes by at most a multiplicative factor (e^ε) when a single record is added/removed. ε (epsilon) is the privacy budget - lower ε = stronger privacy.
Picking a technique per field
| Field characteristic | Recommended technique | Privacy model layer |
|---|---|---|
| Must round-trip for authorised consumer (payment processing) | Tokenisation (vault) or FPE | none (reversible) |
| Must join across tables, opaque value OK | Deterministic substitution / salted hashing | k-anonymity on quasi-identifiers |
| Free-text PII inside a log line | Redaction or replace-with-<TYPE> (Presidio analyzer + anonymizer) | - |
| Continuous numeric for analytics | Number variance | t-closeness if sensitive attribute |
| Categorical demographic (race, etc.) for analytics | Generalisation + l-diversity | l-diversity |
| Statistical query release | Differential privacy mechanism | DP |
| Demo / training, no analytics utility needed | Synthetic substitution (Faker / Synthea) | n/a (no real data) |
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Unsalted hashing of SSN | SSN format is enumerable (~10⁹); attacker rebuilds the mapping table in minutes. | Salt + key per tenant; or tokenise via vault. |
| FPE for an analytics dataset | Format preservation lets a join attack with another dataset recover identity. | Use random substitution for analytics datasets that don't need format round-trip. |
| "GDPR-compliant" pseudonymisation claim | GDPR pseudonymised data is still personal data - Article 4(5) is explicit. | Either mark output pseudonymised (in scope) or fully anonymise (out of scope). |
| k = 2 anonymity | Re-identification probability is 50 % for the equivalence class. | k ≥ 5 typical; k = 10+ for high-risk datasets. |
| Shuffling a rare-value column | Outliers identify themselves regardless of position. | Combine shuffling with generalisation or suppression of outliers. |
| Number variance ± 1 % on salaries | The variance is smaller than the precision needed to identify; effectively no masking. | Variance must exceed the identifying granularity - ± 10 % minimum for salary. |
| Tokenisation without vault access controls | The vault becomes the single point of failure. | Strict access control + audit logging + separate key custody. |
| Differential privacy with ε = 100 | Useless budget; no privacy guarantee. | ε ≤ 1 typical for strong privacy; ε ≤ 10 for relaxed cases. |