ragas-evaluation
Authors and runs Ragas - RAG-pipeline evaluation framework with metrics organized into RAG (Faithfulness, Response Relevancy, Context Precision/Recall, Context Entities Recall, Noise Sensitivity), Natural Language Comparison (Factual Correctness, Semantic Similarity, BLEU/ROUGE/CHRF/Exact Match), Agents/Tool-Use (Topic Adherence, Tool Call Accuracy/F1, Agent Goal Accuracy), General Purpose (Aspect Critic, Rubrics-based Scoring), Nvidia (Answer Accuracy, Context Relevance, Response Groundedness), and Summarization. Use when the user evaluates a RAG pipeline (retriever + generator) and needs the deepest metric variety in the OSS LLM-eval space.
ragas-evaluation
The model: assemble a dataset (question + answer + retrieval contexts + ground truth), import the metrics relevant to the evaluation goal, run evaluate(), and inspect per-metric per-row scores (per rg-gh).
When to use
For non-RAG prompt evals, prefer promptfoo-evaluation. For pytest-native LLM evals with a managed dashboard, prefer deepeval-evaluation.
Step 1 - Install
Per rg-gh:
pip install ragasOr from source:
pip install git+https://github.com/explodinggradients/ragasStep 2 - Custom metric quickstart
Per rg-gh (verbatim):
import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory
# Setup your LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
# Create a custom aspect evaluator
metric = DiscreteMetric(
name="summary_accuracy",
allowed_values=["accurate", "inaccurate"],
prompt="""Evaluate if the summary is accurate and captures
key information.
Response: {response}
Answer with only 'accurate' or 'inaccurate'."""
)
# Score your application's output
async def main():
score = await metric.ascore(
llm=llm,
response="The summary of the text is..."
)
print(f"Score: {score.value}")
print(f"Reason: {score.reason}")
if __name__ == "__main__":
asyncio.run(main())DiscreteMetric is the pattern for custom rubric-based scoring; the built-in metrics in Step 3 follow a similar shape but are preconfigured.
Step 3 - Built-in metric catalog
Per docs.ragas.io/en/stable/concepts/metrics/available_metrics/:
Retrieval Augmented Generation:
| Metric | Use |
|---|---|
| Context Precision | Are the relevant chunks ranked high in the retrieved context? |
| Context Recall | Does the retrieved context contain ground-truth info? |
| Context Entities Recall | Entity-level recall vs ground truth |
| Noise Sensitivity | Does irrelevant context degrade output quality? |
| Response Relevancy | Does the response address the question? |
| Faithfulness | Are the response's claims grounded in retrieved context? |
| Multimodal Faithfulness | Faithfulness for text+image RAG |
| Multimodal Relevance | Relevance for text+image RAG |
Nvidia Metrics (per rg-metrics):
| Metric | Use |
|---|---|
| Answer Accuracy | Nvidia-blessed accuracy scoring |
| Context Relevance | Relevance scoring with Nvidia methodology |
| Response Groundedness | Groundedness in retrieved context |
Agents/Tool Use:
| Metric | Use |
|---|---|
| Topic Adherence | Does the agent stay on topic? |
| Tool Call Accuracy | Did it call the right tool? |
| Tool Call F1 | F1 score for tool selection |
| Agent Goal Accuracy | Did the agent achieve the user's goal? |
Natural Language Comparison:
| Metric | Use |
|---|---|
| Factual Correctness | Compares response facts vs ground truth |
| Semantic Similarity | Embedding-based similarity to reference |
| Non LLM String Similarity | String-distance metrics (no LLM call) |
| BLEU Score / ROUGE Score / CHRF Score | Classical NLP metrics |
| String Presence | Token presence check |
| Exact Match | Strict equality |
SQL:
| Metric | Use |
|---|---|
| Execution-based Datacompy Score | Run query, compare result-sets |
| SQL Query Equivalence | Semantic equivalence (different SQL, same result) |
General Purpose:
| Metric | Use |
|---|---|
| Aspect Critic | Yes/no LLM-judge on a custom aspect |
| Simple Criteria Scoring | Numeric scoring against a rubric |
| Rubrics-based Scoring | Multi-criterion rubric scoring |
| Instance-specific Rubrics Scoring | Per-row rubric variation |
Other:
| Metric | Use |
|---|---|
| Summarization | Summary quality scoring |
Step 4 - Dataset shape
Ragas accepts a Hugging Face Dataset or pandas.DataFrame with columns matching the metrics being run:
| Column | Required by |
|---|---|
question | All RAG metrics |
answer | Response Relevancy, Faithfulness, NL Comparison |
contexts (list of strings) | Context Precision/Recall, Faithfulness |
ground_truth | Context Recall, Factual Correctness, Answer Accuracy |
reference_contexts | Context-comparison metrics |
See the per-metric pages on docs.ragas.io for exact required-column lists.
Step 5 - Integration with retrieval frameworks
Ragas integrates with LangChain + LlamaIndex retrieval pipelines - the integration code captures contexts from the retriever and answer from the generator into the evaluation dataset automatically. Consult the per-framework integration docs on docs.ragas.io when wiring; APIs evolve faster than this skill body and the canonical doc is the source of truth.
Step 6 - CI integration
Ragas does not ship a first-party CI action. Pattern: run evaluate() in a pytest fixture or a CLI script, compare per-metric scores against thresholds, fail CI on regression.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
assert result["faithfulness"] >= 0.85
assert result["answer_relevancy"] >= 0.80Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Run all 30+ metrics on every PR | Cost + latency explode | Pick 3 - 5 metrics per pipeline (Step 3) |
Faithfulness without contexts column | Metric returns NaN / errors | Pass contexts per dataset spec (Step 4) |
| Pin nothing | Ragas + judge-model versions both drift | Pin both in requirements + CI env |
| Skip Aspect Critic for product-specific concerns | Built-in metrics miss the requirement | Custom Aspect Critic + rubric (Step 3) |