Testland
Browse all skills & agents

qa-ml-models

ML model testing: 6 skills (alibi-explainability, deepchecks-tests, evidently-monitoring, fairlearn-fairness, giskard-tests, model-performance-regression-gate) and 2 agents (data-drift-incident-responder, model-fairness-reviewer). Covers vulnerability scanning, drift monitoring, group fairness, and per-prediction explainability.

Install this plugin

/plugin install qa-ml-models@testland-qa

Part of role bundle: qa-role-ai

qa-ml-models

ML model testing: vulnerability scanning, data validation, drift monitoring, group fairness, and per-prediction explainability. Five skills covering Giskard (scan() + test catalog), Deepchecks (suites for data integrity / train-test / model evaluation), Evidently (drift monitoring + 100+ metrics), Fairlearn (MetricFrame

  • Reductions mitigation), Alibi Explain (Anchors / SHAP / Integrated Gradients / Counterfactuals) - plus a reviewer agent (model-fairness-reviewer) that gates promotion based on the model's risk class.

Components

TypeNameDescription
Skillgiskard-testsscan() for performance bias / data leakage / robustness / ethical issues; auto-generates test suites
Skilldeepchecks-testsData integrity, train-test validation, model evaluation suites - same checks across research / CI / production
Skillevidently-monitoringReference-vs-current drift detection; PSI / KS / Wasserstein stat tests; production scheduling
Skillfairlearn-fairnessMetricFrame group-disaggregated metrics; ExponentiatedGradient + ThresholdOptimizer mitigation
Skillalibi-explainabilityAnchors / SHAP / Integrated Gradients / Counterfactuals; per-prediction explanation logging for high-risk systems
Agentmodel-fairness-reviewerAdversarial reviewer that gates promotion on risk-class-appropriate evidence; refuses ✅ when sensitive features missing or intersectional analysis absent
Agentdata-drift-incident-responderTriages a live Evidently drift alert into ranked root-cause hypotheses (schema change, pipeline bug, skew, seasonality, population shift) plus a remediation checklist; decides rollback, retrain, quarantine, or alert re-tune
Skillmodel-performance-regression-gateCI gate that blocks a retrained model regressing on held-out metrics vs production.

Install

/plugin marketplace add testland/qa
/plugin install qa-ml-models@testland-qa

Skills

alibi-explainability

Use Alibi Explain to generate model explanations - Anchors, Integrated Gradients, Kernel/Tree SHAP, ALE, Counterfactual Instances. Wires explainer.fit + explainer.explain into model-evaluation pipelines so that every flagged prediction ships with a "why" record auditors can reason about.

deepchecks-tests

Run Deepchecks suites (data integrity, train-test validation, model evaluation) on tabular / NLP / vision data + models. Pass `result.passed_conditions()` to CI to gate on regressions; the same checks run during research, CI, and production monitoring per the Deepchecks lifecycle posture.

evidently-monitoring

Use Evidently OSS (100+ evaluation metrics, declarative testing API) to detect data drift, target drift, and model-performance regression, wired into CI as a gate (a Report run with include_tests) and into production monitoring as a continuous check; reports as HTML + JSON for both human review and pipeline assertions. Use when you need a drift or quality gate, or a scheduled monitoring job, for a tabular ML model. Built on the Evidently API specifically: for DeepChecks-based validation suites use deepchecks-tests instead.

fairlearn-fairness

Compute group fairness metrics (selection rate, demographic parity, equalized odds) per sensitive feature with `MetricFrame`, then mitigate disparities using Reductions algorithms (`ExponentiatedGradient` with constraint = `DemographicParity`/`EqualizedOdds`). Wire group-disaggregated assertions into the model-evaluation gate.

giskard-tests

Test ML models with Giskard's scan() vulnerability detector + test catalog (performance, robustness, fairness, data leakage, ethical issues) for tabular and NLP models. Wrap a prediction function in giskard.Model + a DataFrame in giskard.Dataset; emit test suites that pass/fail in CI.

model-performance-regression-gate

Computes held-out metrics (accuracy, F1, AUC, RMSE) for a retrained model and compares them against the current production model, failing promotion when any metric regresses beyond a configured tolerance. Adds per-segment checks via Deepchecks WeakSegmentsPerformance so a model that improves globally but regresses on a key slice is still blocked. Use when a retrained model is a candidate for promotion and the CI pipeline must enforce a per-metric pass/fail gate before the artifact is pushed to the model registry.