qa-test-data-privacy
PII detection, masking, and synthetic data generation for test environments: 8 skills (data-masking-techniques-reference, faker-synthetic-data, k-anonymity-verifier, pii-categories-reference, pii-masking-pipeline-builder, presidio-pii-detection, synthea-healthcare-data, test-data-governance-reference) and 1 agent (pii-leak-critic).
Install this plugin
/plugin install qa-test-data-privacy@testland-qaPart of role bundle: qa-role-security
qa-test-data-privacy
PII detection, masking, and synthetic data generation for test environments: 5 skills (pii-categories-reference, data-masking-techniques-reference, presidio-pii-detection, faker-synthetic-data, synthea-healthcare-data) + 1 build skill (pii-masking-pipeline-builder) and 1 agent (pii-leak-critic).
Components
| Type | Name | Description |
|---|---|---|
| skill | pii-categories-reference | Catalog of PII categories across GDPR, CPRA, NIST SP 800-122, HIPAA Safe Harbor |
| skill | data-masking-techniques-reference | Masking operators + NIST 800-188 privacy models (k-anonymity, l-diversity, t-closeness, DP) |
| skill | presidio-pii-detection | Microsoft Presidio analyzer + anonymizer for PII scanning + masking |
| skill | faker-synthetic-data | Faker libraries (Python, JavaScript, Java, .NET) for synthetic substitution |
| skill | synthea-healthcare-data | MITRE Synthea synthetic-patient simulator (FHIR / C-CDA / CSV output) |
| skill | pii-masking-pipeline-builder | Build a deployable masking pipeline spec from a source-data inventory |
| agent | pii-leak-critic | Audits masked output for leaks; classifies findings by regime; emits block/pass verdict |
| Skill | k-anonymity-verifier | Verify k-anonymity / l-diversity / t-closeness on masked datasets (ARX, pycanon). |
| Skill | test-data-governance-reference | Pure reference: test-data lifecycle governance (retention, cross-env promotion, deletion). |
Differentiation
This plugin scopes detection + masking + synthetic-substitution of existing data. Sibling neighbours:
Install
/plugin marketplace add testland/qa
/plugin install qa-test-data-privacy@testland-qaSkills
data-masking-techniques-reference
Pure-reference catalog of data-masking techniques and de-identification privacy models. Enumerates the seven canonical masking operators (substitution, shuffling, number/date variance, encryption, hashing, nulling, masking-out / character-scrambling) plus tokenisation, redaction, format-preserving encryption, and Microsoft Presidio's six built-in operators. Distinguishes reversible techniques (pseudonymisation candidates per GDPR Art. 4(5)) from irreversible techniques (anonymisation candidates). Maps techniques to NIST SP 800-188 privacy models - k-anonymity, l-diversity, t-closeness, differential privacy. Cites ISO/IEC 20889:2018 for the standard taxonomy. Use to pick the right masking operator per field type and risk level.
faker-synthetic-data
Author and run Faker libraries (Python `Faker`, JavaScript `@faker-js/faker`, Java `JavaFaker`, .NET `Bogus`) for generating synthetic substitute data when masking pipelines remove real PII. Covers locale-aware generators, deterministic seeding for test reproducibility, the common provider methods (name / email / address / phone / SSN / credit card / IBAN / date / UUID / text), pytest fixture integration, and the trade-off between random vs deterministic substitution for referential integrity. Use after a PII detector flags fields that need synthetic replacement (distinct from synthetic-pii-generator which assembles fixtures from scratch - this is the underlying library skill those build skills compose).
k-anonymity-verifier
Verifies that a masked dataset satisfies k-anonymity, l-diversity, and t-closeness by computing equivalence classes over chosen quasi-identifiers and reporting re-identification risk. Covers quasi-identifier selection heuristics, threshold guidance, pycanon API (k_anonymity / l_diversity / t_closeness / report), ARX Java API and GUI workflow, SmartNoise for differential-privacy comparison, and CI-gate integration. Distinct from data-masking-techniques-reference (which catalogs masking operators but defers k-anonymity measurement to dedicated tooling) and from presidio-pii-detection (which detects PII spans but offers no equivalence-class analysis). Use when you need to confirm whether a masked dataset meets a stated k, l, or t threshold before promoting it to a non-production environment.
pii-categories-reference
Pure-reference catalog of personally identifiable information (PII) categories across GDPR, CCPA/CPRA, NIST SP 800-122, and HIPAA. Defines what counts as personal data under each regime, enumerates the explicit identifiers each regulator lists (GDPR Art. 4(1) and Art. 9 special categories; CPRA sensitive personal information; NIST direct-identifier vs linkable distinction; HIPAA Safe Harbor 18 identifiers), and maps overlapping fields across jurisdictions so a masking pipeline knows which regulator's rules apply. Use as the authoritative source when authoring or reviewing masking rules, classifying a dataset's risk level, or scoping which fields a PII detector must catch.
pii-masking-pipeline-builder
Build-an-X workflow that produces a PII masking pipeline spec from a source-data inventory. Walks the author through (1) classifying each field against pii-categories-reference, (2) picking a masking operator from data-masking-techniques-reference, (3) deciding pseudonymisation (reversible, in GDPR scope) vs anonymisation (irreversible, out of scope), (4) ordering the pipeline (detect → operator → audit), and (5) emitting a deployable config for Presidio + Faker + Synthea wrappers. Output is a YAML pipeline spec plus a per-field rationale table. Use after classifying a dataset's PII risk; this is the workflow that translates classification into runnable masking config.
presidio-pii-detection
Author and run Microsoft Presidio PII detection - wraps presidio-analyzer (PII detector) + presidio-anonymizer (replace/redact/mask/hash/encrypt operators) for scanning datasets, log streams, and free-text fields. Covers AnalyzerEngine + AnonymizerEngine setup, built-in recognizers (PERSON, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, IBAN_CODE, country-specific IDs across US/UK/Spain/Italy/Poland/Singapore/Australia/India and more), custom PatternRecognizer authoring, score thresholds, and CI gating. Use when scanning *existing* data for PII (vs synthesising fresh fixtures with synthetic-pii-generator).
synthea-healthcare-data
Author and run Synthea (MITRE's open-source synthetic patient population simulator) to produce HIPAA-safe synthetic medical records for testing health IT systems. Covers Gradle build, population-size and state-specific generation, FHIR R4 / STU3 / DSTU2 / C-CDA / CSV / CPCDS output formats, disease-module customisation, and the lifecycle-simulation approach (birth-through-death patient journeys with realistic demographics). Use when testing FHIR servers, EHR integrations, claims processing, or any health IT system that needs realistic patient records without HIPAA exposure (distinct from faker-synthetic-data which is generic; this is health-domain-specific).
test-data-governance-reference
Pure-reference catalog of test-data lifecycle governance: retention schedules for test datasets, cross-environment data-sharing agreements, deletion of test data containing real PII, refresh cadence, access controls, and the legal basis for each policy under GDPR Art. 5 storage limitation and NIST SP 800-122. Use when defining a data-steward role for test environments, authoring a retention policy for a test database, scoping a data-sharing agreement before promoting a dataset from production to staging, or determining the deletion timeline for any test fixture that contains live personal data.