Testland
Browse all skills & agents

ragas-evaluation

Authors and runs Ragas - RAG-pipeline evaluation framework with metrics organized into RAG (Faithfulness, Response Relevancy, Context Precision/Recall, Context Entities Recall, Noise Sensitivity), Natural Language Comparison (Factual Correctness, Semantic Similarity, BLEU/ROUGE/CHRF/Exact Match), Agents/Tool-Use (Topic Adherence, Tool Call Accuracy/F1, Agent Goal Accuracy), General Purpose (Aspect Critic, Rubrics-based Scoring), Nvidia (Answer Accuracy, Context Relevance, Response Groundedness), and Summarization. Use when the user evaluates a RAG pipeline (retriever + generator) and needs the deepest metric variety in the OSS LLM-eval space.

ragas-evaluation

The model: assemble a dataset (question + answer + retrieval contexts + ground truth), import the metrics relevant to the evaluation goal, run evaluate(), and inspect per-metric per-row scores (per rg-gh).

When to use

  • The repo uses LangChain / LlamaIndex / Haystack / direct retriever→LLM RAG pipelines.
  • The user needs RAG-specific metrics: faithfulness (hallucination on retrieved context), context precision/recall (retrieval quality), context entities recall, noise sensitivity.
  • The team needs many metrics and wants the most active OSS RAG-eval library (Ragas + DeepEval are the leaders here).
  • Agents-style eval (tool-call accuracy, agent goal accuracy, topic adherence) is in scope.

For non-RAG prompt evals, prefer promptfoo-evaluation. For pytest-native LLM evals with a managed dashboard, prefer deepeval-evaluation.

Step 1 - Install

Per rg-gh:

pip install ragas

Or from source:

pip install git+https://github.com/explodinggradients/ragas

Step 2 - Custom metric quickstart

Per rg-gh (verbatim):

import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory

# Setup your LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)

# Create a custom aspect evaluator
metric = DiscreteMetric(
    name="summary_accuracy",
    allowed_values=["accurate", "inaccurate"],
    prompt="""Evaluate if the summary is accurate and captures
key information.
Response: {response}
Answer with only 'accurate' or 'inaccurate'."""
)

# Score your application's output
async def main():
    score = await metric.ascore(
        llm=llm,
        response="The summary of the text is..."
    )
    print(f"Score: {score.value}")
    print(f"Reason: {score.reason}")

if __name__ == "__main__":
    asyncio.run(main())

DiscreteMetric is the pattern for custom rubric-based scoring; the built-in metrics in Step 3 follow a similar shape but are preconfigured.

Step 3 - Built-in metric catalog

Per docs.ragas.io/en/stable/concepts/metrics/available_metrics/:

Retrieval Augmented Generation:

MetricUse
Context PrecisionAre the relevant chunks ranked high in the retrieved context?
Context RecallDoes the retrieved context contain ground-truth info?
Context Entities RecallEntity-level recall vs ground truth
Noise SensitivityDoes irrelevant context degrade output quality?
Response RelevancyDoes the response address the question?
FaithfulnessAre the response's claims grounded in retrieved context?
Multimodal FaithfulnessFaithfulness for text+image RAG
Multimodal RelevanceRelevance for text+image RAG

Nvidia Metrics (per rg-metrics):

MetricUse
Answer AccuracyNvidia-blessed accuracy scoring
Context RelevanceRelevance scoring with Nvidia methodology
Response GroundednessGroundedness in retrieved context

Agents/Tool Use:

MetricUse
Topic AdherenceDoes the agent stay on topic?
Tool Call AccuracyDid it call the right tool?
Tool Call F1F1 score for tool selection
Agent Goal AccuracyDid the agent achieve the user's goal?

Natural Language Comparison:

MetricUse
Factual CorrectnessCompares response facts vs ground truth
Semantic SimilarityEmbedding-based similarity to reference
Non LLM String SimilarityString-distance metrics (no LLM call)
BLEU Score / ROUGE Score / CHRF ScoreClassical NLP metrics
String PresenceToken presence check
Exact MatchStrict equality

SQL:

MetricUse
Execution-based Datacompy ScoreRun query, compare result-sets
SQL Query EquivalenceSemantic equivalence (different SQL, same result)

General Purpose:

MetricUse
Aspect CriticYes/no LLM-judge on a custom aspect
Simple Criteria ScoringNumeric scoring against a rubric
Rubrics-based ScoringMulti-criterion rubric scoring
Instance-specific Rubrics ScoringPer-row rubric variation

Other:

MetricUse
SummarizationSummary quality scoring

Step 4 - Dataset shape

Ragas accepts a Hugging Face Dataset or pandas.DataFrame with columns matching the metrics being run:

ColumnRequired by
questionAll RAG metrics
answerResponse Relevancy, Faithfulness, NL Comparison
contexts (list of strings)Context Precision/Recall, Faithfulness
ground_truthContext Recall, Factual Correctness, Answer Accuracy
reference_contextsContext-comparison metrics

See the per-metric pages on docs.ragas.io for exact required-column lists.

Step 5 - Integration with retrieval frameworks

Ragas integrates with LangChain + LlamaIndex retrieval pipelines - the integration code captures contexts from the retriever and answer from the generator into the evaluation dataset automatically. Consult the per-framework integration docs on docs.ragas.io when wiring; APIs evolve faster than this skill body and the canonical doc is the source of truth.

Step 6 - CI integration

Ragas does not ship a first-party CI action. Pattern: run evaluate() in a pytest fixture or a CLI script, compare per-metric scores against thresholds, fail CI on regression.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
assert result["faithfulness"] >= 0.85
assert result["answer_relevancy"] >= 0.80

Anti-patterns

Anti-patternWhy it failsFix
Run all 30+ metrics on every PRCost + latency explodePick 3 - 5 metrics per pipeline (Step 3)
Faithfulness without contexts columnMetric returns NaN / errorsPass contexts per dataset spec (Step 4)
Pin nothingRagas + judge-model versions both driftPin both in requirements + CI env
Skip Aspect Critic for product-specific concernsBuilt-in metrics miss the requirementCustom Aspect Critic + rubric (Step 3)

Limitations

  • Many metrics require a judge LLM → cost scales with metric count × dataset size.
  • Multimodal metrics need the multimodal extras (pip install ragas[multimodal]); check the per-metric doc on docs.ragas.io.
  • API surface evolves - pin versions in requirements; the canonical doc is the source of truth (this skill body curates the mainstream patterns but does not re-litigate per-method signatures).

References