Evaluation Framework - Obstetrics RAG Benchmark

Why Evaluate RAG Systems?

Building a RAG system is one thing—knowing whether it actually works well is another. Traditional evaluation methods for question-answering systems require:

Manual annotation of correct answers
Human judges to rate response quality
Large labeled test sets
Significant time and cost

This doesn’t scale well, especially when iterating on system design.

The RAGAS SolutionRAGAS (Retrieval-Augmented Generation Assessment) provides automated, LLM-based evaluation metrics that assess RAG system quality without requiring manual annotations. It evaluates both retrieval quality and generation quality using the LLM itself as a judge.

The RAGAS Framework

RAGAS evaluates RAG systems across four fundamental metrics that together provide a comprehensive picture of system quality:

Faithfulness

Is the answer grounded in the retrieved context?

Answer Relevancy

Does the answer address the question?

Context Precision

Are the retrieved documents relevant?

Context Recall

Was all necessary information retrieved?

Metric 1: Faithfulness

What It Measures

Faithfulness evaluates whether the generated answer is factually grounded in the retrieved context. This is critical for preventing hallucinations—a common problem with LLMs.

How Faithfulness Works

Evaluation Process

Extract Claims: Break down the generated answer into atomic factual claims
Verify Each Claim: Check if each claim can be supported by the retrieved context
Calculate Score: faithfulness = (supported_claims) / (total_claims)

Example

Question: ¿Cuál es la cantidad ideal de controles prenatales?Retrieved Context:

“Se recomienda un programa de diez citas para primigestantes. Para una mujer multípara con un embarazo de curso normal se recomienda un programa de siete citas.”

Generated Answer:

“Se recomienda un programa de diez citas para embarazos primarios y siete citas para multíparas con embarazos de curso normal.”

Claims Extraction:

“Se recomienda un programa de diez citas para embarazos primarios” ✓
“Siete citas para multíparas con embarazos de curso normal” ✓

Faithfulness Score: 2/2 = 1.0 (perfect)

Score Interpretation

1.0: All claims are supported by context (ideal)
0.8-0.99: Mostly faithful with minor unsupported details
0.6-0.79: Some hallucination present
< 0.6: Significant hallucination issues

Why Faithfulness Matters

In medical domains, hallucinations are dangerous. An answer that sounds authoritative but contains fabricated information could lead to harmful decisions. High faithfulness ensures that answers are strictly grounded in verified medical documentation.

# Low faithfulness example (hallucination)
Answer: "Se recomienda 12 controles prenatales según la OMS"
# The context never mentioned "12" or "OMS" → faithfulness = 0.0

Metric 2: Answer Relevancy

What It Measures

Answer Relevancy evaluates whether the generated answer actually addresses the user’s question. An answer can be factually correct but still irrelevant if it doesn’t answer what was asked.

How Answer Relevancy Works

Evaluation Process

Generate Reverse Questions: Use an LLM to generate questions that the answer would address
Compare Similarity: Measure semantic similarity between original question and generated questions
Calculate Score: Average similarity across generated questions

Example

Original Question: ¿Cuál es la cantidad ideal de controles prenatales?Generated Answer: “Se recomienda un programa de diez citas para primigestantes y siete citas para multíparas.”Reverse Questions (generated by LLM from the answer):

“¿Cuántas citas prenatales se recomiendan para primigestantes?”
“¿Cuál es el número de controles recomendados durante el embarazo?”
“¿Cuántos controles debe tener una mujer embarazada?”

Similarity Scores:

Original vs Q1: 0.92
Original vs Q2: 0.95
Original vs Q3: 0.88

Answer Relevancy Score: (0.92 + 0.95 + 0.88) / 3 = 0.92

Score Interpretation

0.9-1.0: Highly relevant, directly answers the question
0.7-0.89: Relevant but may include tangential information
0.5-0.69: Partially relevant, misses key aspects
< 0.5: Off-topic or irrelevant

Why Answer Relevancy Matters

A RAG system can retrieve perfect documents but still generate answers that:

Include tangential information
Answer a different question
Are too general or too specific

Answer Relevancy ensures the system stays focused on the user’s actual information need.

# Low relevancy example
Question: "¿Cuándo iniciar controles prenatales?"
Answer: "Los controles prenatales incluyen exámenes de sangre, ultrasonidos..."
# Describes what happens in prenatal care, not when to start → relevancy = 0.4

Metric 3: Context Precision

What It Measures

Context Precision evaluates the relevance of retrieved documents. It measures what proportion of the retrieved context is actually useful for answering the question.

How Context Precision Works

Evaluation Process

Rank Retrieved Contexts: Contexts are ordered by retrieval score
Classify Relevance: Use LLM to determine if each context is relevant to the question
Calculate Precision: Measure the proportion of relevant contexts, weighted by position

Formula

precision@k = (relevant_in_top_k) / k

context_precision = Σ(precision@k × relevance_k) / total_relevant_contexts

Example

Question: ¿Cuándo realizar la valoración de riesgo psicosocial?Retrieved Contexts (ordered by retrieval score):

Context 1: “Se recomienda que las gestantes de bajo riesgo reciban en el momento de la inscripción al control prenatal, y luego en cada trimestre, una valoración de riesgo psicosocial” → Relevant ✓
Context 2: “La valoración debe usar la escala de Herrera & Hurtado” → Relevant ✓
Context 3: “Los controles prenatales incluyen medición de peso y presión arterial” → Not Relevant ✗
Context 4: “Se recomienda evaluar el riesgo biológico y psicosocial a todas las gestantes” → Relevant ✓
Context 5: “El parto debe ocurrir en un centro médico adecuado” → Not Relevant ✗

Precision Calculation:

Relevant contexts: 3 out of 5
Precision@1: 1/1 = 1.0 ✓
Precision@2: 2/2 = 1.0 ✓
Precision@3: 2/3 = 0.67 ✗
Precision@4: 3/4 = 0.75 ✓
Precision@5: 3/5 = 0.60 ✗

Context Precision Score: 0.81 (higher weight on top-ranked contexts)

Score Interpretation

0.9-1.0: Nearly all retrieved contexts are relevant
0.7-0.89: Good precision, some noise present
0.5-0.69: Significant irrelevant context
< 0.5: Poor retrieval, mostly irrelevant documents

Why Context Precision Matters

High context precision means:

Less noise for the LLM to process
Lower token costs (no wasted context)
Better answer quality (signal-to-noise ratio)
Efficient retrieval (the system finds what matters)

Low precision forces the LLM to sift through irrelevant information, which can:

Confuse the model
Lead to off-topic answers
Waste tokens and money

Metric 4: Context Recall

What It Measures

Context Recall evaluates whether the retrieval system found all the necessary information needed to answer the question completely. It measures completeness.

How Context Recall Works

Evaluation Process

Identify Required Facts: Extract facts from the ground truth answer
Check Coverage: Determine which facts are present in retrieved context
Calculate Score: recall = (facts_in_context) / (total_facts_in_ground_truth)

Example

Question: ¿Cuáles son las metas de ganancia de peso en las mujeres gestantes?Ground Truth Answer:

“Se recomienda registrar el IMC en la semana 10 y establecer metas según:

IMC < 20: ganancia entre 12 a 18 Kg

IMC 20-24.9: ganancia entre 10 a 13 Kg

IMC 25-29.9: ganancia entre 7 a 10 Kg

IMC > 30: ganancia entre 6 a 7 Kg”

Required Facts (extracted from ground truth):

IMC should be registered around week 10
IMC < 20 → 12-18 Kg gain
IMC 20-24.9 → 10-13 Kg gain
IMC 25-29.9 → 7-10 Kg gain
IMC > 30 → 6-7 Kg gain

Retrieved Context Coverage:

Context 1 mentions facts 1, 2, 3 ✓
Context 2 mentions facts 4, 5 ✓

Context Recall Score: 5/5 = 1.0 (complete)

Incomplete Retrieval Example

If the retrieval system only found:

Context 1: Facts 1, 2, 3

Then recall would be: 3/5 = 0.6 (incomplete)The answer could only cover underweight and normal BMI ranges, missing overweight and obese guidance.

Score Interpretation

1.0: All necessary information retrieved (complete)
0.8-0.99: Most information retrieved, minor gaps
0.6-0.79: Significant information missing
< 0.6: Incomplete retrieval, major gaps

Why Context Recall Matters

High context recall ensures:

Complete answers that cover all aspects of the question
No missing information that could lead to incomplete guidance
Comprehensive coverage of the topic

In medical Q&A, incomplete information can be as dangerous as incorrect information. Missing a critical detail (like BMI-specific guidelines) could lead to inappropriate recommendations.

How RAGAS Evaluates the Obstetrics RAG Benchmark

Evaluation Dataset

The benchmark uses 10 carefully crafted questions about pregnancy and prenatal care, each with ground truth answers derived from clinical practice guidelines:

DATA_GT = [
    {
        "question": "¿Cuál es la cantidad ideal de controles prenatales?",
        "ground_truth": "Se recomienda un programa de diez citas. Para una mujer multípara con un embarazo de curso normal se recomienda un programa de siete citas"
    },
    # ... 9 more questions
]

Evaluation Pipeline

from src.evaluation.ragas_evaluator import RAGASEvaluator

# Initialize evaluator for a specific RAG architecture
evaluator = RAGASEvaluator(rag_type="hybrid")

# Run evaluation
results = evaluator.run_evaluation()

# Results contain:
# - faithfulness: 0.85
# - answer_relevancy: 0.78  
# - context_precision: 0.92
# - context_recall: 0.76

Evaluation Process

Query Processing

Each test question is processed through the selected RAG architecture to generate an answer and retrieve contexts.

Metric Computation

RAGAS computes all four metrics for each question using LLM-based evaluation.

Aggregation

Results are aggregated across all questions to produce overall scores for the RAG architecture.

Comparison

Scores are compared across different RAG architectures and LLM models to identify best performers.

Interpreting RAGAS Scores

What Makes a Good RAG System?

There’s no single “passing score,” but here are general benchmarks:

Excellent Performance

Faithfulness: ≥ 0.9
Answer Relevancy: ≥ 0.85
Context Precision: ≥ 0.85
Context Recall: ≥ 0.8

Good Performance

Faithfulness: ≥ 0.8
Answer Relevancy: ≥ 0.75
Context Precision: ≥ 0.75
Context Recall: ≥ 0.7

Needs Improvement

Faithfulness: < 0.7
Answer Relevancy: < 0.65
Context Precision: < 0.65
Context Recall: < 0.6

Poor Performance

Faithfulness: < 0.5
Answer Relevancy: < 0.5
Context Precision: < 0.5
Context Recall: < 0.5

Diagnostic Patterns

Pattern 1: High Precision, Low Recall

Symptoms: Precision > 0.8, Recall < 0.6
Problem: Retrieval is finding relevant docs but missing important information
Solution: Increase k (retrieve more documents), try hybrid search

Pattern 2: High Recall, Low Precision

Symptoms: Recall > 0.8, Precision < 0.6
Problem: Retrieval is finding everything but including too much noise
Solution: Improve query processing, add reranking, use HyDE

Pattern 3: High Faithfulness, Low Relevancy

Symptoms: Faithfulness > 0.9, Relevancy < 0.6
Problem: LLM is correctly using context but not answering the question
Solution: Improve answer generation prompt, tune LLM temperature

Pattern 4: Low Faithfulness, High Relevancy

Symptoms: Faithfulness < 0.7, Relevancy > 0.8
Problem: LLM is answering the right question but hallucinating facts
Solution: Strengthen prompt instructions, use more reliable LLM

Automated Evaluation Benefits

Speed

Evaluate 10 questions across 6 RAG architectures in minutes instead of hours of manual review.

Consistency

LLM-based evaluation applies consistent criteria across all questions, eliminating human rater variability.

Scalability

Add more questions, architectures, or models without linearly scaling human effort.

Reproducibility

Results can be reproduced by anyone with the same dataset and evaluation code.

Cost-Effectiveness

LLM-based evaluation costs cents per question vs. dollars for human annotation.

Limitations and Considerations

While RAGAS is powerful, it’s important to understand its limitations:

LLM Judge BiasesRAGAS uses LLMs to evaluate LLM outputs, which can introduce biases:

May favor certain answer styles
Can miss subtle medical errors
Evaluation quality depends on the judge LLM’s capabilities

Complement with Human ReviewFor production medical systems, RAGAS should complement—not replace—expert human review, especially for:

Clinical accuracy validation
Safety-critical applications
Regulatory compliance
Edge cases and rare scenarios

Best Practices

Use RAGAS for rapid iteration during development
Validate findings with human expert review on a sample
Monitor trends rather than absolute scores
Compare architectures relative to each other
Track improvements over time as you tune the system

Next Steps

Run Evaluations

Learn how to evaluate your RAG architectures with RAGAS

Analyze Results

Interpret evaluation results and identify improvements

Data Pipeline

Understand how documents are processed for RAG

RAG Architectures

Review the different retrieval strategies being evaluated

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Documentation Index

​Why Evaluate RAG Systems?

​The RAGAS Framework

Faithfulness

Answer Relevancy

Context Precision

Context Recall

​Metric 1: Faithfulness

​What It Measures

​Evaluation Process

​Example

​Score Interpretation

​Why Faithfulness Matters

​Metric 2: Answer Relevancy

​What It Measures

​Evaluation Process

​Example

​Score Interpretation

​Why Answer Relevancy Matters

​Metric 3: Context Precision

​What It Measures

​Evaluation Process

​Formula

​Example

​Score Interpretation

​Why Context Precision Matters

​Metric 4: Context Recall

​What It Measures

​Evaluation Process

​Example

​Incomplete Retrieval Example

​Score Interpretation

​Why Context Recall Matters

​How RAGAS Evaluates the Obstetrics RAG Benchmark

​Evaluation Dataset

​Evaluation Pipeline

​Evaluation Process

​Interpreting RAGAS Scores

​What Makes a Good RAG System?

Excellent Performance

Good Performance

Needs Improvement

Poor Performance

​Diagnostic Patterns

​Automated Evaluation Benefits

​Limitations and Considerations

​Best Practices

​Next Steps

Run Evaluations

Analyze Results

Data Pipeline

RAG Architectures

Why Evaluate RAG Systems?

The RAGAS Framework

Metric 1: Faithfulness

What It Measures

Evaluation Process

Example

Score Interpretation

Why Faithfulness Matters

Metric 2: Answer Relevancy

What It Measures

Evaluation Process

Example

Score Interpretation

Why Answer Relevancy Matters

Metric 3: Context Precision

What It Measures

Evaluation Process

Formula

Example

Score Interpretation

Why Context Precision Matters

Metric 4: Context Recall

What It Measures

Evaluation Process

Example

Incomplete Retrieval Example

Score Interpretation

Why Context Recall Matters

How RAGAS Evaluates the Obstetrics RAG Benchmark

Evaluation Dataset

Evaluation Pipeline

Evaluation Process

Interpreting RAGAS Scores

What Makes a Good RAG System?

Diagnostic Patterns

Automated Evaluation Benefits

Limitations and Considerations

Best Practices

Next Steps