Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt

Use this file to discover all available pages before exploring further.

Why Evaluate RAG Systems?

Building a RAG system is one thing—knowing whether it actually works well is another. Traditional evaluation methods for question-answering systems require:
  • Manual annotation of correct answers
  • Human judges to rate response quality
  • Large labeled test sets
  • Significant time and cost
This doesn’t scale well, especially when iterating on system design.
The RAGAS SolutionRAGAS (Retrieval-Augmented Generation Assessment) provides automated, LLM-based evaluation metrics that assess RAG system quality without requiring manual annotations. It evaluates both retrieval quality and generation quality using the LLM itself as a judge.

The RAGAS Framework

RAGAS evaluates RAG systems across four fundamental metrics that together provide a comprehensive picture of system quality:

Faithfulness

Is the answer grounded in the retrieved context?

Answer Relevancy

Does the answer address the question?

Context Precision

Are the retrieved documents relevant?

Context Recall

Was all necessary information retrieved?

Metric 1: Faithfulness

What It Measures

Faithfulness evaluates whether the generated answer is factually grounded in the retrieved context. This is critical for preventing hallucinations—a common problem with LLMs.

Evaluation Process

  1. Extract Claims: Break down the generated answer into atomic factual claims
  2. Verify Each Claim: Check if each claim can be supported by the retrieved context
  3. Calculate Score: faithfulness = (supported_claims) / (total_claims)

Example

Question: ¿Cuál es la cantidad ideal de controles prenatales?Retrieved Context:
“Se recomienda un programa de diez citas para primigestantes. Para una mujer multípara con un embarazo de curso normal se recomienda un programa de siete citas.”
Generated Answer:
“Se recomienda un programa de diez citas para embarazos primarios y siete citas para multíparas con embarazos de curso normal.”
Claims Extraction:
  1. “Se recomienda un programa de diez citas para embarazos primarios” ✓
  2. “Siete citas para multíparas con embarazos de curso normal” ✓
Faithfulness Score: 2/2 = 1.0 (perfect)

Score Interpretation

  • 1.0: All claims are supported by context (ideal)
  • 0.8-0.99: Mostly faithful with minor unsupported details
  • 0.6-0.79: Some hallucination present
  • < 0.6: Significant hallucination issues

Why Faithfulness Matters

In medical domains, hallucinations are dangerous. An answer that sounds authoritative but contains fabricated information could lead to harmful decisions. High faithfulness ensures that answers are strictly grounded in verified medical documentation.
# Low faithfulness example (hallucination)
Answer: "Se recomienda 12 controles prenatales según la OMS"
# The context never mentioned "12" or "OMS" → faithfulness = 0.0

Metric 2: Answer Relevancy

What It Measures

Answer Relevancy evaluates whether the generated answer actually addresses the user’s question. An answer can be factually correct but still irrelevant if it doesn’t answer what was asked.

Evaluation Process

  1. Generate Reverse Questions: Use an LLM to generate questions that the answer would address
  2. Compare Similarity: Measure semantic similarity between original question and generated questions
  3. Calculate Score: Average similarity across generated questions

Example

Original Question: ¿Cuál es la cantidad ideal de controles prenatales?Generated Answer: “Se recomienda un programa de diez citas para primigestantes y siete citas para multíparas.”Reverse Questions (generated by LLM from the answer):
  1. “¿Cuántas citas prenatales se recomiendan para primigestantes?”
  2. “¿Cuál es el número de controles recomendados durante el embarazo?”
  3. “¿Cuántos controles debe tener una mujer embarazada?”
Similarity Scores:
  • Original vs Q1: 0.92
  • Original vs Q2: 0.95
  • Original vs Q3: 0.88
Answer Relevancy Score: (0.92 + 0.95 + 0.88) / 3 = 0.92

Score Interpretation

  • 0.9-1.0: Highly relevant, directly answers the question
  • 0.7-0.89: Relevant but may include tangential information
  • 0.5-0.69: Partially relevant, misses key aspects
  • < 0.5: Off-topic or irrelevant

Why Answer Relevancy Matters

A RAG system can retrieve perfect documents but still generate answers that:
  • Include tangential information
  • Answer a different question
  • Are too general or too specific
Answer Relevancy ensures the system stays focused on the user’s actual information need.
# Low relevancy example
Question: "¿Cuándo iniciar controles prenatales?"
Answer: "Los controles prenatales incluyen exámenes de sangre, ultrasonidos..."
# Describes what happens in prenatal care, not when to start → relevancy = 0.4

Metric 3: Context Precision

What It Measures

Context Precision evaluates the relevance of retrieved documents. It measures what proportion of the retrieved context is actually useful for answering the question.

Evaluation Process

  1. Rank Retrieved Contexts: Contexts are ordered by retrieval score
  2. Classify Relevance: Use LLM to determine if each context is relevant to the question
  3. Calculate Precision: Measure the proportion of relevant contexts, weighted by position

Formula

precision@k = (relevant_in_top_k) / k

context_precision = Σ(precision@k × relevance_k) / total_relevant_contexts

Example

Question: ¿Cuándo realizar la valoración de riesgo psicosocial?Retrieved Contexts (ordered by retrieval score):
  1. Context 1: “Se recomienda que las gestantes de bajo riesgo reciban en el momento de la inscripción al control prenatal, y luego en cada trimestre, una valoración de riesgo psicosocial” → Relevant
  2. Context 2: “La valoración debe usar la escala de Herrera & Hurtado” → Relevant
  3. Context 3: “Los controles prenatales incluyen medición de peso y presión arterial” → Not Relevant
  4. Context 4: “Se recomienda evaluar el riesgo biológico y psicosocial a todas las gestantes” → Relevant
  5. Context 5: “El parto debe ocurrir en un centro médico adecuado” → Not Relevant
Precision Calculation:
  • Relevant contexts: 3 out of 5
  • Precision@1: 1/1 = 1.0 ✓
  • Precision@2: 2/2 = 1.0 ✓
  • Precision@3: 2/3 = 0.67 ✗
  • Precision@4: 3/4 = 0.75 ✓
  • Precision@5: 3/5 = 0.60 ✗
Context Precision Score: 0.81 (higher weight on top-ranked contexts)

Score Interpretation

  • 0.9-1.0: Nearly all retrieved contexts are relevant
  • 0.7-0.89: Good precision, some noise present
  • 0.5-0.69: Significant irrelevant context
  • < 0.5: Poor retrieval, mostly irrelevant documents

Why Context Precision Matters

High context precision means:
  • Less noise for the LLM to process
  • Lower token costs (no wasted context)
  • Better answer quality (signal-to-noise ratio)
  • Efficient retrieval (the system finds what matters)
Low precision forces the LLM to sift through irrelevant information, which can:
  • Confuse the model
  • Lead to off-topic answers
  • Waste tokens and money

Metric 4: Context Recall

What It Measures

Context Recall evaluates whether the retrieval system found all the necessary information needed to answer the question completely. It measures completeness.

Evaluation Process

  1. Identify Required Facts: Extract facts from the ground truth answer
  2. Check Coverage: Determine which facts are present in retrieved context
  3. Calculate Score: recall = (facts_in_context) / (total_facts_in_ground_truth)

Example

Question: ¿Cuáles son las metas de ganancia de peso en las mujeres gestantes?Ground Truth Answer:
“Se recomienda registrar el IMC en la semana 10 y establecer metas según:
  • IMC < 20: ganancia entre 12 a 18 Kg
  • IMC 20-24.9: ganancia entre 10 a 13 Kg
  • IMC 25-29.9: ganancia entre 7 a 10 Kg
  • IMC > 30: ganancia entre 6 a 7 Kg”
Required Facts (extracted from ground truth):
  1. IMC should be registered around week 10
  2. IMC < 20 → 12-18 Kg gain
  3. IMC 20-24.9 → 10-13 Kg gain
  4. IMC 25-29.9 → 7-10 Kg gain
  5. IMC > 30 → 6-7 Kg gain
Retrieved Context Coverage:
  • Context 1 mentions facts 1, 2, 3 ✓
  • Context 2 mentions facts 4, 5 ✓
Context Recall Score: 5/5 = 1.0 (complete)

Incomplete Retrieval Example

If the retrieval system only found:
  • Context 1: Facts 1, 2, 3
Then recall would be: 3/5 = 0.6 (incomplete)The answer could only cover underweight and normal BMI ranges, missing overweight and obese guidance.

Score Interpretation

  • 1.0: All necessary information retrieved (complete)
  • 0.8-0.99: Most information retrieved, minor gaps
  • 0.6-0.79: Significant information missing
  • < 0.6: Incomplete retrieval, major gaps

Why Context Recall Matters

High context recall ensures:
  • Complete answers that cover all aspects of the question
  • No missing information that could lead to incomplete guidance
  • Comprehensive coverage of the topic
In medical Q&A, incomplete information can be as dangerous as incorrect information. Missing a critical detail (like BMI-specific guidelines) could lead to inappropriate recommendations.

How RAGAS Evaluates the Obstetrics RAG Benchmark

Evaluation Dataset

The benchmark uses 10 carefully crafted questions about pregnancy and prenatal care, each with ground truth answers derived from clinical practice guidelines:
DATA_GT = [
    {
        "question": "¿Cuál es la cantidad ideal de controles prenatales?",
        "ground_truth": "Se recomienda un programa de diez citas. Para una mujer multípara con un embarazo de curso normal se recomienda un programa de siete citas"
    },
    # ... 9 more questions
]

Evaluation Pipeline

from src.evaluation.ragas_evaluator import RAGASEvaluator

# Initialize evaluator for a specific RAG architecture
evaluator = RAGASEvaluator(rag_type="hybrid")

# Run evaluation
results = evaluator.run_evaluation()

# Results contain:
# - faithfulness: 0.85
# - answer_relevancy: 0.78  
# - context_precision: 0.92
# - context_recall: 0.76

Evaluation Process

1

Query Processing

Each test question is processed through the selected RAG architecture to generate an answer and retrieve contexts.
2

Metric Computation

RAGAS computes all four metrics for each question using LLM-based evaluation.
3

Aggregation

Results are aggregated across all questions to produce overall scores for the RAG architecture.
4

Comparison

Scores are compared across different RAG architectures and LLM models to identify best performers.

Interpreting RAGAS Scores

What Makes a Good RAG System?

There’s no single “passing score,” but here are general benchmarks:

Excellent Performance

  • Faithfulness: ≥ 0.9
  • Answer Relevancy: ≥ 0.85
  • Context Precision: ≥ 0.85
  • Context Recall: ≥ 0.8

Good Performance

  • Faithfulness: ≥ 0.8
  • Answer Relevancy: ≥ 0.75
  • Context Precision: ≥ 0.75
  • Context Recall: ≥ 0.7

Needs Improvement

  • Faithfulness: < 0.7
  • Answer Relevancy: < 0.65
  • Context Precision: < 0.65
  • Context Recall: < 0.6

Poor Performance

  • Faithfulness: < 0.5
  • Answer Relevancy: < 0.5
  • Context Precision: < 0.5
  • Context Recall: < 0.5

Diagnostic Patterns

Pattern 1: High Precision, Low Recall
  • Symptoms: Precision > 0.8, Recall < 0.6
  • Problem: Retrieval is finding relevant docs but missing important information
  • Solution: Increase k (retrieve more documents), try hybrid search
Pattern 2: High Recall, Low Precision
  • Symptoms: Recall > 0.8, Precision < 0.6
  • Problem: Retrieval is finding everything but including too much noise
  • Solution: Improve query processing, add reranking, use HyDE
Pattern 3: High Faithfulness, Low Relevancy
  • Symptoms: Faithfulness > 0.9, Relevancy < 0.6
  • Problem: LLM is correctly using context but not answering the question
  • Solution: Improve answer generation prompt, tune LLM temperature
Pattern 4: Low Faithfulness, High Relevancy
  • Symptoms: Faithfulness < 0.7, Relevancy > 0.8
  • Problem: LLM is answering the right question but hallucinating facts
  • Solution: Strengthen prompt instructions, use more reliable LLM

Automated Evaluation Benefits

Evaluate 10 questions across 6 RAG architectures in minutes instead of hours of manual review.
LLM-based evaluation applies consistent criteria across all questions, eliminating human rater variability.
Add more questions, architectures, or models without linearly scaling human effort.
Results can be reproduced by anyone with the same dataset and evaluation code.
LLM-based evaluation costs cents per question vs. dollars for human annotation.

Limitations and Considerations

While RAGAS is powerful, it’s important to understand its limitations:
LLM Judge BiasesRAGAS uses LLMs to evaluate LLM outputs, which can introduce biases:
  • May favor certain answer styles
  • Can miss subtle medical errors
  • Evaluation quality depends on the judge LLM’s capabilities
Complement with Human ReviewFor production medical systems, RAGAS should complement—not replace—expert human review, especially for:
  • Clinical accuracy validation
  • Safety-critical applications
  • Regulatory compliance
  • Edge cases and rare scenarios

Best Practices

  1. Use RAGAS for rapid iteration during development
  2. Validate findings with human expert review on a sample
  3. Monitor trends rather than absolute scores
  4. Compare architectures relative to each other
  5. Track improvements over time as you tune the system

Next Steps

Run Evaluations

Learn how to evaluate your RAG architectures with RAGAS

Analyze Results

Interpret evaluation results and identify improvements

Data Pipeline

Understand how documents are processed for RAG

RAG Architectures

Review the different retrieval strategies being evaluated