Testland
Browse all skills & agents

qa-llm-evaluation

LLM and prompt evaluation: 7 skills (deepeval-evaluation, giskard-llm, langfuse-tracing, llm-regression-suite-author, openai-evals, promptfoo-evaluation, ragas-evaluation) and 2 agents (llm-red-team-planner, prompt-eval-reviewer). Covers the mainstream OSS LLM-eval ecosystem: Promptfoo + OpenAI Evals + DeepEval + Ragas for functional eval, Giskard for adversarial scan, Langfuse for production observability.

Install this plugin

/plugin install qa-llm-evaluation@testland-qa

Part of role bundle: qa-role-ai

qa-llm-evaluation

LLM and prompt evaluation. Six per-tool skill wrappers covering the mainstream OSS LLM-eval ecosystem (Promptfoo + OpenAI Evals + DeepEval + Ragas for functional eval; Giskard for adversarial scan; Langfuse for production observability + offline-eval feedback loop) plus an adversarial reviewer agent that flags 8 anti-patterns across any of these frameworks.

Components

TypeNameDescription
Skillpromptfoo-evaluationYAML-driven multi-provider evals with full assertion catalog (deterministic + model-graded + semantic + perf)
Skillopenai-evalsOpenAI's framework + registry; oaieval CLI; template + custom-Python evals
Skilldeepeval-evaluationpytest-native; 11+ metrics including G-Eval / Faithfulness / Contextual-* / Hallucination / Bias / Toxicity / JSON-Correctness
Skillragas-evaluationDeepest RAG metric variety: Faithfulness, Context Precision/Recall, Noise Sensitivity, Agents/Tool-Use, NL Comparison, SQL, Aspect Critic
Skillgiskard-llmAdversarial scan with 7 vulnerability categories (hallucination, harmful_content, prompt_injection, sensitive_information_disclosure, stereotypes, robustness, basic_sycophancy)
Skilllangfuse-tracingProduction observability with @observe decorator, score API, datasets for offline eval
Agentprompt-eval-reviewerAdversarial reviewer flagging 8 anti-patterns across all 6 sister tools; preloads all 6
Agentllm-red-team-plannerPlans an LLM red-team campaign across an attack taxonomy, composing Giskard scans + promptfoo red-team configs.
Skillllm-regression-suite-authorVersioned golden-dataset LLM regression suite across model upgrades with CI gating.

Install

/plugin marketplace add testland/qa
/plugin install qa-llm-evaluation@testland-qa

Skills

deepeval-evaluation

Authors and runs DeepEval - pytest-native LLM eval framework with `LLMTestCase` (input + actual_output + expected_output + retrieval_context) and ~11 built-in metrics (G-Eval, Answer-Relevancy, Faithfulness, Contextual-Recall / Precision / Relevancy, Hallucination, Bias, Toxicity, Summarization, JSON-Correctness); runs via `deepeval test run <file.py>` with `assert_test()` per test or `evaluate()` for batch; integrates Confident-AI dashboard. Use when the user prefers pytest workflow, works with RAG and needs faithfulness/contextual metrics out-of-the-box, or wants a managed dashboard.

giskard-llm

Authors and runs Giskard LLM scans - adversarial test-case generation for LLM applications via `giskard.scan(model)` covering 7 vulnerability categories (hallucination, harmful_content, prompt_injection, sensitive_information_disclosure, stereotypes, robustness, basic_sycophancy); wraps any callable model behind `giskard.Model(model_predict, model_type="text_generation", ...)`; emits HTML report. Use when the user needs adversarial / red-team coverage on top of functional eval suites.

langfuse-tracing

Wires Langfuse tracing into LLM apps for production observability and offline eval - instruments via `@observe` (Python) / `startActiveObservation` (TS) decorators that auto-capture inputs / outputs / timings / errors per generation; exposes `langfuse.update_current_span()` for metadata + cost / latency annotation; supports trace-bound scoring for eval datasets and prompt-as-code management. Use when the user needs production LLM observability beyond pre-deploy eval, or wants to ship traces from production to an eval dataset for offline regression testing.

llm-regression-suite-author

Builds a versioned golden-dataset LLM regression suite for tracking quality across model upgrades: structures a versioned JSONL/CSV golden dataset, configures deterministic eval runs (temperature 0, seed), wires assertion layers (exact, semantic similarity, LLM-as-judge, rubric), enforces a pass-rate threshold with diff reporting vs the baseline model, and gates CI on regression. Use when upgrading an LLM provider model and needing a repeatable before/after quality gate, or when a prompt regression suite must track output quality across model versions over time.

openai-evals

Authors and runs OpenAI Evals - Python framework + registry for evaluating LLMs and LLM-backed systems with `oaieval <model> <eval-name>` CLI; supports template-based evals (Match / Includes / FuzzyMatch / ModelBasedClassify) defined in `evals/registry/evals/*.yaml` against JSONL data files in `evals/registry/data/`, plus custom Python eval classes implementing the Eval interface. Use when the user works with the openai/evals repo, needs the OpenAI-curated eval registry, or contributes new evals via PR to the registry.

promptfoo-evaluation

Authors and runs Promptfoo evals for LLM prompts and RAG pipelines - wires `promptfooconfig.yaml` providers + prompts + tests + assertions (deterministic `equals` / `contains` / `is-json` / `regex`, semantic `similar`, model-graded `llm-rubric` / `factuality` / `g-eval`, performance `latency` / `cost`, custom `javascript` / `python`), runs `npx promptfoo eval`, views HTML report via `promptfoo view`, and integrates CI for regression gating. Use when the user runs Promptfoo, asks about prompt regression suites, or needs an eval-driven workflow for LLM-backed features.

ragas-evaluation

Authors and runs Ragas - RAG-pipeline evaluation framework with metrics organized into RAG (Faithfulness, Response Relevancy, Context Precision/Recall, Context Entities Recall, Noise Sensitivity), Natural Language Comparison (Factual Correctness, Semantic Similarity, BLEU/ROUGE/CHRF/Exact Match), Agents/Tool-Use (Topic Adherence, Tool Call Accuracy/F1, Agent Goal Accuracy), General Purpose (Aspect Critic, Rubrics-based Scoring), Nvidia (Answer Accuracy, Context Relevance, Response Groundedness), and Summarization. Use when the user evaluates a RAG pipeline (retriever + generator) and needs the deepest metric variety in the OSS LLM-eval space.