qa-ml-models

ML model testing: 6 skills (alibi-explainability, deepchecks-tests, evidently-monitoring, fairlearn-fairness, giskard-tests, model-performance-regression-gate) and 2 agents (data-drift-incident-responder, model-fairness-reviewer). Covers vulnerability scanning, drift monitoring, group fairness, and per-prediction explainability.

Install this plugin

/plugin install qa-ml-models@testland-qa

Part of role bundle: qa-role-ai

qa-ml-models

ML model testing: vulnerability scanning, data validation, drift monitoring, group fairness, and per-prediction explainability. Five skills covering Giskard (scan() + test catalog), Deepchecks (suites for data integrity / train-test / model evaluation), Evidently (drift monitoring + 100+ metrics), Fairlearn (MetricFrame

Reductions mitigation), Alibi Explain (Anchors / SHAP / Integrated Gradients / Counterfactuals) - plus a reviewer agent (model-fairness-reviewer) that gates promotion based on the model's risk class.

Components

Type	Name	Description
Skill	giskard-tests	`scan()` for performance bias / data leakage / robustness / ethical issues; auto-generates test suites
Skill	deepchecks-tests	Data integrity, train-test validation, model evaluation suites - same checks across research / CI / production
Skill	evidently-monitoring	Reference-vs-current drift detection; PSI / KS / Wasserstein stat tests; production scheduling
Skill	fairlearn-fairness	`MetricFrame` group-disaggregated metrics; `ExponentiatedGradient` + `ThresholdOptimizer` mitigation
Skill	alibi-explainability	Anchors / SHAP / Integrated Gradients / Counterfactuals; per-prediction explanation logging for high-risk systems
Agent	model-fairness-reviewer	Adversarial reviewer that gates promotion on risk-class-appropriate evidence; refuses ✅ when sensitive features missing or intersectional analysis absent
Agent	data-drift-incident-responder	Triages a live Evidently drift alert into ranked root-cause hypotheses (schema change, pipeline bug, skew, seasonality, population shift) plus a remediation checklist; decides rollback, retrain, quarantine, or alert re-tune
Skill	model-performance-regression-gate	CI gate that blocks a retrained model regressing on held-out metrics vs production.

Install

/plugin marketplace add testland/qa
/plugin install qa-ml-models@testland-qa

Skills

alibi-explainability

Use Alibi Explain to generate model explanations - Anchors, Integrated Gradients, Kernel/Tree SHAP, ALE, Counterfactual Instances. Wires explainer.fit + explainer.explain into model-evaluation pipelines so that every flagged prediction ships with a "why" record auditors can reason about.

deepchecks-tests

Run Deepchecks suites (data integrity, train-test validation, model evaluation) on tabular / NLP / vision data + models. Pass `result.passed_conditions()` to CI to gate on regressions; the same checks run during research, CI, and production monitoring per the Deepchecks lifecycle posture.

evidently-monitoring

Use Evidently OSS (100+ evaluation metrics, declarative testing API) to detect data drift, target drift, and model-performance regression, wired into CI as a gate (a Report run with include_tests) and into production monitoring as a continuous check; reports as HTML + JSON for both human review and pipeline assertions. Use when you need a drift or quality gate, or a scheduled monitoring job, for a tabular ML model. Built on the Evidently API specifically: for DeepChecks-based validation suites use deepchecks-tests instead.

fairlearn-fairness

Compute group fairness metrics (selection rate, demographic parity, equalized odds) per sensitive feature with `MetricFrame`, then mitigate disparities using Reductions algorithms (`ExponentiatedGradient` with constraint = `DemographicParity`/`EqualizedOdds`). Wire group-disaggregated assertions into the model-evaluation gate.

giskard-tests

Test ML models with Giskard's scan() vulnerability detector + test catalog (performance, robustness, fairness, data leakage, ethical issues) for tabular and NLP models. Wrap a prediction function in giskard.Model + a DataFrame in giskard.Dataset; emit test suites that pass/fail in CI.

model-performance-regression-gate

Computes held-out metrics (accuracy, F1, AUC, RMSE) for a retrained model and compares them against the current production model, failing promotion when any metric regresses beyond a configured tolerance. Adds per-segment checks via Deepchecks WeakSegmentsPerformance so a model that improves globally but regresses on a key slice is still blocked. Use when a retrained model is a candidate for promotion and the CI pipeline must enforce a per-metric pass/fail gate before the artifact is pushed to the model registry.

Agents

data-drift-incident-responder

Receives a live Evidently drift alert (HTML or JSON report) and produces a ranked root-cause hypothesis list plus a remediation checklist. Distinguishes upstream schema change, seasonality, training-serving skew, pipeline bug, and genuine population shift; recommends rollback, retrain, quarantine, feature investigation, or alert re-tuning as appropriate. Use when a DataDriftPreset or TestColumnDrift alert fires in production monitoring and the on-call engineer needs a structured triage before acting.

model-fairness-reviewer

Adversarial reviewer of ML model fairness + explainability evidence before promotion. Validates that fairness metrics (Fairlearn MetricFrame), drift detectors (Evidently/Deepchecks), vulnerability scans (Giskard), and per-prediction explanations (Alibi) collectively cover the model's risk class. Refuses to ✅ when sensitive features are missing, when intersectional analysis is absent, or when a high-risk model lacks per-prediction explanation logging.