qa-ml-models
ML model testing: 6 skills (alibi-explainability, deepchecks-tests, evidently-monitoring, fairlearn-fairness, giskard-tests, model-performance-regression-gate) and 2 agents (data-drift-incident-responder, model-fairness-reviewer). Covers vulnerability scanning, drift monitoring, group fairness, and per-prediction explainability.
Install this plugin
/plugin install qa-ml-models@testland-qaPart of role bundle: qa-role-ai
qa-ml-models
ML model testing: vulnerability scanning, data validation, drift monitoring, group fairness, and per-prediction explainability. Five skills covering Giskard (scan() + test catalog), Deepchecks (suites for data integrity / train-test / model evaluation), Evidently (drift monitoring + 100+ metrics), Fairlearn (MetricFrame
Components
| Type | Name | Description |
|---|---|---|
| Skill | giskard-tests | scan() for performance bias / data leakage / robustness / ethical issues; auto-generates test suites |
| Skill | deepchecks-tests | Data integrity, train-test validation, model evaluation suites - same checks across research / CI / production |
| Skill | evidently-monitoring | Reference-vs-current drift detection; PSI / KS / Wasserstein stat tests; production scheduling |
| Skill | fairlearn-fairness | MetricFrame group-disaggregated metrics; ExponentiatedGradient + ThresholdOptimizer mitigation |
| Skill | alibi-explainability | Anchors / SHAP / Integrated Gradients / Counterfactuals; per-prediction explanation logging for high-risk systems |
| Agent | model-fairness-reviewer | Adversarial reviewer that gates promotion on risk-class-appropriate evidence; refuses ✅ when sensitive features missing or intersectional analysis absent |
| Agent | data-drift-incident-responder | Triages a live Evidently drift alert into ranked root-cause hypotheses (schema change, pipeline bug, skew, seasonality, population shift) plus a remediation checklist; decides rollback, retrain, quarantine, or alert re-tune |
| Skill | model-performance-regression-gate | CI gate that blocks a retrained model regressing on held-out metrics vs production. |
Install
/plugin marketplace add testland/qa
/plugin install qa-ml-models@testland-qaSkills
alibi-explainability
Use Alibi Explain to generate model explanations - Anchors, Integrated Gradients, Kernel/Tree SHAP, ALE, Counterfactual Instances. Wires explainer.fit + explainer.explain into model-evaluation pipelines so that every flagged prediction ships with a "why" record auditors can reason about.
deepchecks-tests
Run Deepchecks suites (data integrity, train-test validation, model evaluation) on tabular / NLP / vision data + models. Pass `result.passed_conditions()` to CI to gate on regressions; the same checks run during research, CI, and production monitoring per the Deepchecks lifecycle posture.
evidently-monitoring
Use Evidently OSS (100+ evaluation metrics, declarative testing API) to detect data drift, target drift, and model-performance regression, wired into CI as a gate (a Report run with include_tests) and into production monitoring as a continuous check; reports as HTML + JSON for both human review and pipeline assertions. Use when you need a drift or quality gate, or a scheduled monitoring job, for a tabular ML model. Built on the Evidently API specifically: for DeepChecks-based validation suites use deepchecks-tests instead.
fairlearn-fairness
Compute group fairness metrics (selection rate, demographic parity, equalized odds) per sensitive feature with `MetricFrame`, then mitigate disparities using Reductions algorithms (`ExponentiatedGradient` with constraint = `DemographicParity`/`EqualizedOdds`). Wire group-disaggregated assertions into the model-evaluation gate.
giskard-tests
Test ML models with Giskard's scan() vulnerability detector + test catalog (performance, robustness, fairness, data leakage, ethical issues) for tabular and NLP models. Wrap a prediction function in giskard.Model + a DataFrame in giskard.Dataset; emit test suites that pass/fail in CI.
model-performance-regression-gate
Computes held-out metrics (accuracy, F1, AUC, RMSE) for a retrained model and compares them against the current production model, failing promotion when any metric regresses beyond a configured tolerance. Adds per-segment checks via Deepchecks WeakSegmentsPerformance so a model that improves globally but regresses on a key slice is still blocked. Use when a retrained model is a candidate for promotion and the CI pipeline must enforce a per-metric pass/fail gate before the artifact is pushed to the model registry.
Agents
data-drift-incident-responder
Receives a live Evidently drift alert (HTML or JSON report) and produces a ranked root-cause hypothesis list plus a remediation checklist. Distinguishes upstream schema change, seasonality, training-serving skew, pipeline bug, and genuine population shift; recommends rollback, retrain, quarantine, feature investigation, or alert re-tuning as appropriate. Use when a DataDriftPreset or TestColumnDrift alert fires in production monitoring and the on-call engineer needs a structured triage before acting.
model-fairness-reviewer
Adversarial reviewer of ML model fairness + explainability evidence before promotion. Validates that fairness metrics (Fairlearn MetricFrame), drift detectors (Evidently/Deepchecks), vulnerability scans (Giskard), and per-prediction explanations (Alibi) collectively cover the model's risk class. Refuses to ✅ when sensitive features are missing, when intersectional analysis is absent, or when a high-risk model lacks per-prediction explanation logging.