qa-llm-evaluation
LLM and prompt evaluation: 7 skills (deepeval-evaluation, giskard-llm, langfuse-tracing, llm-regression-suite-author, openai-evals, promptfoo-evaluation, ragas-evaluation) and 2 agents (llm-red-team-planner, prompt-eval-reviewer). Covers the mainstream OSS LLM-eval ecosystem: Promptfoo + OpenAI Evals + DeepEval + Ragas for functional eval, Giskard for adversarial scan, Langfuse for production observability.
Install this plugin
/plugin install qa-llm-evaluation@testland-qaPart of role bundle: qa-role-ai
qa-llm-evaluation
LLM and prompt evaluation. Six per-tool skill wrappers covering the mainstream OSS LLM-eval ecosystem (Promptfoo + OpenAI Evals + DeepEval + Ragas for functional eval; Giskard for adversarial scan; Langfuse for production observability + offline-eval feedback loop) plus an adversarial reviewer agent that flags 8 anti-patterns across any of these frameworks.
Components
| Type | Name | Description |
|---|---|---|
| Skill | promptfoo-evaluation | YAML-driven multi-provider evals with full assertion catalog (deterministic + model-graded + semantic + perf) |
| Skill | openai-evals | OpenAI's framework + registry; oaieval CLI; template + custom-Python evals |
| Skill | deepeval-evaluation | pytest-native; 11+ metrics including G-Eval / Faithfulness / Contextual-* / Hallucination / Bias / Toxicity / JSON-Correctness |
| Skill | ragas-evaluation | Deepest RAG metric variety: Faithfulness, Context Precision/Recall, Noise Sensitivity, Agents/Tool-Use, NL Comparison, SQL, Aspect Critic |
| Skill | giskard-llm | Adversarial scan with 7 vulnerability categories (hallucination, harmful_content, prompt_injection, sensitive_information_disclosure, stereotypes, robustness, basic_sycophancy) |
| Skill | langfuse-tracing | Production observability with @observe decorator, score API, datasets for offline eval |
| Agent | prompt-eval-reviewer | Adversarial reviewer flagging 8 anti-patterns across all 6 sister tools; preloads all 6 |
| Agent | llm-red-team-planner | Plans an LLM red-team campaign across an attack taxonomy, composing Giskard scans + promptfoo red-team configs. |
| Skill | llm-regression-suite-author | Versioned golden-dataset LLM regression suite across model upgrades with CI gating. |
Install
/plugin marketplace add testland/qa
/plugin install qa-llm-evaluation@testland-qaSkills
deepeval-evaluation
Authors and runs DeepEval - pytest-native LLM eval framework with `LLMTestCase` (input + actual_output + expected_output + retrieval_context) and ~11 built-in metrics (G-Eval, Answer-Relevancy, Faithfulness, Contextual-Recall / Precision / Relevancy, Hallucination, Bias, Toxicity, Summarization, JSON-Correctness); runs via `deepeval test run <file.py>` with `assert_test()` per test or `evaluate()` for batch; integrates Confident-AI dashboard. Use when the user prefers pytest workflow, works with RAG and needs faithfulness/contextual metrics out-of-the-box, or wants a managed dashboard.
giskard-llm
Authors and runs Giskard LLM scans - adversarial test-case generation for LLM applications via `giskard.scan(model)` covering 7 vulnerability categories (hallucination, harmful_content, prompt_injection, sensitive_information_disclosure, stereotypes, robustness, basic_sycophancy); wraps any callable model behind `giskard.Model(model_predict, model_type="text_generation", ...)`; emits HTML report. Use when the user needs adversarial / red-team coverage on top of functional eval suites.
langfuse-tracing
Wires Langfuse tracing into LLM apps for production observability and offline eval - instruments via `@observe` (Python) / `startActiveObservation` (TS) decorators that auto-capture inputs / outputs / timings / errors per generation; exposes `langfuse.update_current_span()` for metadata + cost / latency annotation; supports trace-bound scoring for eval datasets and prompt-as-code management. Use when the user needs production LLM observability beyond pre-deploy eval, or wants to ship traces from production to an eval dataset for offline regression testing.
llm-regression-suite-author
Builds a versioned golden-dataset LLM regression suite for tracking quality across model upgrades: structures a versioned JSONL/CSV golden dataset, configures deterministic eval runs (temperature 0, seed), wires assertion layers (exact, semantic similarity, LLM-as-judge, rubric), enforces a pass-rate threshold with diff reporting vs the baseline model, and gates CI on regression. Use when upgrading an LLM provider model and needing a repeatable before/after quality gate, or when a prompt regression suite must track output quality across model versions over time.
openai-evals
Authors and runs OpenAI Evals - Python framework + registry for evaluating LLMs and LLM-backed systems with `oaieval <model> <eval-name>` CLI; supports template-based evals (Match / Includes / FuzzyMatch / ModelBasedClassify) defined in `evals/registry/evals/*.yaml` against JSONL data files in `evals/registry/data/`, plus custom Python eval classes implementing the Eval interface. Use when the user works with the openai/evals repo, needs the OpenAI-curated eval registry, or contributes new evals via PR to the registry.
promptfoo-evaluation
Authors and runs Promptfoo evals for LLM prompts and RAG pipelines - wires `promptfooconfig.yaml` providers + prompts + tests + assertions (deterministic `equals` / `contains` / `is-json` / `regex`, semantic `similar`, model-graded `llm-rubric` / `factuality` / `g-eval`, performance `latency` / `cost`, custom `javascript` / `python`), runs `npx promptfoo eval`, views HTML report via `promptfoo view`, and integrates CI for regression gating. Use when the user runs Promptfoo, asks about prompt regression suites, or needs an eval-driven workflow for LLM-backed features.
ragas-evaluation
Authors and runs Ragas - RAG-pipeline evaluation framework with metrics organized into RAG (Faithfulness, Response Relevancy, Context Precision/Recall, Context Entities Recall, Noise Sensitivity), Natural Language Comparison (Factual Correctness, Semantic Similarity, BLEU/ROUGE/CHRF/Exact Match), Agents/Tool-Use (Topic Adherence, Tool Call Accuracy/F1, Agent Goal Accuracy), General Purpose (Aspect Critic, Rubrics-based Scoring), Nvidia (Answer Accuracy, Context Relevance, Response Groundedness), and Summarization. Use when the user evaluates a RAG pipeline (retriever + generator) and needs the deepest metric variety in the OSS LLM-eval space.
Agents
llm-red-team-planner
Action-taking orchestrator that plans and scaffolds a multi-class LLM adversarial probe campaign beyond canned scanners - enumerates an attack taxonomy (jailbreaks, indirect prompt injection chains, data exfiltration, harmful-content bypass, OWASP LLM Top 10 classes), maps each class to a Giskard detector or a promptfoo red-team plugin, sequences the campaign into phases, and writes the resulting scan scripts and promptfoo redteam YAML configs. Distinct from `prompt-eval-reviewer` (read-only anti-pattern reviewer) and `giskard-llm` / `promptfoo-evaluation` skills (single-tool wrappers). Use when a senior AI-safety or security engineer needs a bespoke red-team campaign plan that goes beyond running default scanner presets.
prompt-eval-reviewer
Adversarial reviewer for an LLM eval suite (Promptfoo, OpenAI Evals, DeepEval, Ragas, Giskard, Langfuse-driven, or custom). Flags 8 anti-patterns: too-few test cases (<10), single-provider lock-in, missing model-graded for creative output, missing semantic-similarity for paraphrase-tolerant output, no baseline diff in CI, no cost/latency cap, hard-coded model versions absent, no adversarial coverage (Giskard or equivalent). Returns Critical / Warning / Info findings table. Use proactively after any LLM eval suite is added or modified.