Testland
Browse all skills & agents

deepeval-evaluation

Authors and runs DeepEval - pytest-native LLM eval framework with `LLMTestCase` (input + actual_output + expected_output + retrieval_context) and ~11 built-in metrics (G-Eval, Answer-Relevancy, Faithfulness, Contextual-Recall / Precision / Relevancy, Hallucination, Bias, Toxicity, Summarization, JSON-Correctness); runs via `deepeval test run <file.py>` with `assert_test()` per test or `evaluate()` for batch; integrates Confident-AI dashboard. Use when the user prefers pytest workflow, works with RAG and needs faithfulness/contextual metrics out-of-the-box, or wants a managed dashboard.

deepeval-evaluation

Overview

Per de-start, DeepEval is "an open-source LLM eval package" enabling "evaluation of LLM applications locally through test cases and metrics." The model: each test constructs an LLMTestCase, applies one or more Metric instances, and either asserts (assert_test) or batch-evaluates (evaluate). Pytest discovery + reporting works unchanged.

When to use

  • The repo already uses pytest; LLM tests should live alongside unit tests.
  • The user works with RAG and needs Faithfulness / Contextual-* metrics without writing them from scratch.
  • The team wants a managed dashboard (Confident-AI) for regression tracking + prompt-vs-prompt comparison.
  • Programmatic test-case authoring (data-driven from a CSV/JSONL) is needed and pytest fixtures fit better than YAML config.

Step 1 - Install

Per de-gh and de-start:

pip install -U deepeval

Optional Confident-AI login (for dashboard):

deepeval login

Per de-start: after login "Confident AI will generate testing reports and automate regression testing whenever you run a test run."

Step 2 - First test

Per de-gh (verbatim quickstart):

import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, SingleTurnParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

Run it:

deepeval test run test_chatbot.py

(Per de-gh.)

Step 3 - LLMTestCase fields

Per de-start, LLMTestCase fields:

FieldRequiredNotes
inputyesThe user prompt / query
actual_outputyesWhat the LLM produced
expected_outputoptionalReference answer (used by metrics that compare)
retrieval_contextoptionalList of retrieved chunks for RAG metrics
contextoptionalGround-truth context for hallucination metric

Step 4 - Metric catalog

Per de-gh the available metrics include:

MetricUse
GEvalCustom rubric-based scoring (LLM-as-judge with chain-of-thought)
AnswerRelevancyMetricDoes actual_output answer input?
FaithfulnessMetricDoes actual_output only state facts in retrieval_context?
ContextualRecallMetricDoes retrieval_context contain enough info to produce expected_output?
ContextualPrecisionMetricAre relevant chunks ranked higher in retrieval_context?
ContextualRelevancyMetricAre chunks in retrieval_context relevant to input?
HallucinationMetricDoes actual_output contradict context?
BiasMetricBias detection in actual_output
ToxicityMetricToxic-content detection
SummarizationMetricSummary quality vs source
JsonCorrectnessMetricValid + schema-conformant JSON output

Each metric takes a threshold parameter; the test passes if score ≥ threshold.

Step 5 - Custom GEval pattern

GEval is the universal escape hatch when no built-in metric fits:

from deepeval.metrics import GEval
from deepeval.test_case import SingleTurnParams

professionalism = GEval(
    name="Professionalism",
    criteria="Determine if the response uses professional language without slang or contractions.",
    evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT],
    threshold=0.7,
)

The criteria string is the rubric; the judge model evaluates and returns a 0 - 1 score with reasoning attached.

Step 6 - Batch evaluation (no pytest)

For dataset-driven runs without pytest:

from deepeval import evaluate

evaluate(test_cases=[case1, case2, case3], metrics=[g_eval, faithfulness])

Returns scores per metric per case; useful for regression sweeps across a CSV/JSONL of historical inputs.

Step 7 - CI integration

deepeval test run tests/llm/ --run-async --workers 4
# pytest exit code propagates: nonzero if any assert_test fails

Combine with Confident-AI for a dashboard view of run history; Confident-AI is the company behind DeepEval (per de-gh).

Anti-patterns

Anti-patternWhy it failsFix
threshold=0.0 on every metricTests never fail; eval theaterPick real thresholds (0.5 - 0.8 typical)
Hallucination metric without contextMetric has nothing to compare againstAlways pass context (Step 3)
Faithfulness metric without retrieval_contextSame problemPass retrieval chunks (Step 3)
Custom GEval criteria too vagueJudge produces inconsistent scoresConcrete criteria with examples (Step 5)
Skip --workers in CISequential runs slow + costly--workers 4 parallelization (Step 7)

Limitations

  • LLM-as-judge metrics depend on judge-model quality + cost; pin judge model version in CI.
  • Confident-AI is the managed dashboard; without it, regression tracking is manual (parse pytest output).
  • Test cases live in Python files - not as discoverable as YAML configs for non-Python teammates (vs Promptfoo).
  • Faithfulness / Contextual-* metrics need RAG-shaped data; for pure prompt evals, promptfoo-evaluation is lower-friction.

References