Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
Why Evaluate RAG Systems?
Building a RAG system is one thing—knowing whether it actually works well is another. Traditional evaluation methods for question-answering systems require:- Manual annotation of correct answers
- Human judges to rate response quality
- Large labeled test sets
- Significant time and cost
The RAGAS SolutionRAGAS (Retrieval-Augmented Generation Assessment) provides automated, LLM-based evaluation metrics that assess RAG system quality without requiring manual annotations. It evaluates both retrieval quality and generation quality using the LLM itself as a judge.
The RAGAS Framework
RAGAS evaluates RAG systems across four fundamental metrics that together provide a comprehensive picture of system quality:Faithfulness
Is the answer grounded in the retrieved context?
Answer Relevancy
Does the answer address the question?
Context Precision
Are the retrieved documents relevant?
Context Recall
Was all necessary information retrieved?
Metric 1: Faithfulness
What It Measures
Faithfulness evaluates whether the generated answer is factually grounded in the retrieved context. This is critical for preventing hallucinations—a common problem with LLMs.How Faithfulness Works
How Faithfulness Works
Evaluation Process
- Extract Claims: Break down the generated answer into atomic factual claims
- Verify Each Claim: Check if each claim can be supported by the retrieved context
- Calculate Score:
faithfulness = (supported_claims) / (total_claims)
Example
Question: ¿Cuál es la cantidad ideal de controles prenatales?Retrieved Context:“Se recomienda un programa de diez citas para primigestantes. Para una mujer multípara con un embarazo de curso normal se recomienda un programa de siete citas.”Generated Answer:
“Se recomienda un programa de diez citas para embarazos primarios y siete citas para multíparas con embarazos de curso normal.”Claims Extraction:
- “Se recomienda un programa de diez citas para embarazos primarios” ✓
- “Siete citas para multíparas con embarazos de curso normal” ✓
Score Interpretation
- 1.0: All claims are supported by context (ideal)
- 0.8-0.99: Mostly faithful with minor unsupported details
- 0.6-0.79: Some hallucination present
- < 0.6: Significant hallucination issues
Why Faithfulness Matters
In medical domains, hallucinations are dangerous. An answer that sounds authoritative but contains fabricated information could lead to harmful decisions. High faithfulness ensures that answers are strictly grounded in verified medical documentation.Metric 2: Answer Relevancy
What It Measures
Answer Relevancy evaluates whether the generated answer actually addresses the user’s question. An answer can be factually correct but still irrelevant if it doesn’t answer what was asked.How Answer Relevancy Works
How Answer Relevancy Works
Evaluation Process
- Generate Reverse Questions: Use an LLM to generate questions that the answer would address
- Compare Similarity: Measure semantic similarity between original question and generated questions
- Calculate Score: Average similarity across generated questions
Example
Original Question: ¿Cuál es la cantidad ideal de controles prenatales?Generated Answer: “Se recomienda un programa de diez citas para primigestantes y siete citas para multíparas.”Reverse Questions (generated by LLM from the answer):- “¿Cuántas citas prenatales se recomiendan para primigestantes?”
- “¿Cuál es el número de controles recomendados durante el embarazo?”
- “¿Cuántos controles debe tener una mujer embarazada?”
- Original vs Q1: 0.92
- Original vs Q2: 0.95
- Original vs Q3: 0.88
Score Interpretation
- 0.9-1.0: Highly relevant, directly answers the question
- 0.7-0.89: Relevant but may include tangential information
- 0.5-0.69: Partially relevant, misses key aspects
- < 0.5: Off-topic or irrelevant
Why Answer Relevancy Matters
A RAG system can retrieve perfect documents but still generate answers that:- Include tangential information
- Answer a different question
- Are too general or too specific
Metric 3: Context Precision
What It Measures
Context Precision evaluates the relevance of retrieved documents. It measures what proportion of the retrieved context is actually useful for answering the question.How Context Precision Works
How Context Precision Works
Evaluation Process
- Rank Retrieved Contexts: Contexts are ordered by retrieval score
- Classify Relevance: Use LLM to determine if each context is relevant to the question
- Calculate Precision: Measure the proportion of relevant contexts, weighted by position
Formula
Example
Question: ¿Cuándo realizar la valoración de riesgo psicosocial?Retrieved Contexts (ordered by retrieval score):- Context 1: “Se recomienda que las gestantes de bajo riesgo reciban en el momento de la inscripción al control prenatal, y luego en cada trimestre, una valoración de riesgo psicosocial” → Relevant ✓
- Context 2: “La valoración debe usar la escala de Herrera & Hurtado” → Relevant ✓
- Context 3: “Los controles prenatales incluyen medición de peso y presión arterial” → Not Relevant ✗
- Context 4: “Se recomienda evaluar el riesgo biológico y psicosocial a todas las gestantes” → Relevant ✓
- Context 5: “El parto debe ocurrir en un centro médico adecuado” → Not Relevant ✗
- Relevant contexts: 3 out of 5
- Precision@1: 1/1 = 1.0 ✓
- Precision@2: 2/2 = 1.0 ✓
- Precision@3: 2/3 = 0.67 ✗
- Precision@4: 3/4 = 0.75 ✓
- Precision@5: 3/5 = 0.60 ✗
Score Interpretation
- 0.9-1.0: Nearly all retrieved contexts are relevant
- 0.7-0.89: Good precision, some noise present
- 0.5-0.69: Significant irrelevant context
- < 0.5: Poor retrieval, mostly irrelevant documents
Why Context Precision Matters
High context precision means:- Less noise for the LLM to process
- Lower token costs (no wasted context)
- Better answer quality (signal-to-noise ratio)
- Efficient retrieval (the system finds what matters)
- Confuse the model
- Lead to off-topic answers
- Waste tokens and money
Metric 4: Context Recall
What It Measures
Context Recall evaluates whether the retrieval system found all the necessary information needed to answer the question completely. It measures completeness.How Context Recall Works
How Context Recall Works
Evaluation Process
- Identify Required Facts: Extract facts from the ground truth answer
- Check Coverage: Determine which facts are present in retrieved context
- Calculate Score:
recall = (facts_in_context) / (total_facts_in_ground_truth)
Example
Question: ¿Cuáles son las metas de ganancia de peso en las mujeres gestantes?Ground Truth Answer:“Se recomienda registrar el IMC en la semana 10 y establecer metas según:Required Facts (extracted from ground truth):
- IMC < 20: ganancia entre 12 a 18 Kg
- IMC 20-24.9: ganancia entre 10 a 13 Kg
- IMC 25-29.9: ganancia entre 7 a 10 Kg
- IMC > 30: ganancia entre 6 a 7 Kg”
- IMC should be registered around week 10
- IMC < 20 → 12-18 Kg gain
- IMC 20-24.9 → 10-13 Kg gain
- IMC 25-29.9 → 7-10 Kg gain
- IMC > 30 → 6-7 Kg gain
- Context 1 mentions facts 1, 2, 3 ✓
- Context 2 mentions facts 4, 5 ✓
Incomplete Retrieval Example
If the retrieval system only found:- Context 1: Facts 1, 2, 3
Score Interpretation
- 1.0: All necessary information retrieved (complete)
- 0.8-0.99: Most information retrieved, minor gaps
- 0.6-0.79: Significant information missing
- < 0.6: Incomplete retrieval, major gaps
Why Context Recall Matters
High context recall ensures:- Complete answers that cover all aspects of the question
- No missing information that could lead to incomplete guidance
- Comprehensive coverage of the topic
How RAGAS Evaluates the Obstetrics RAG Benchmark
Evaluation Dataset
The benchmark uses 10 carefully crafted questions about pregnancy and prenatal care, each with ground truth answers derived from clinical practice guidelines:Evaluation Pipeline
Evaluation Process
Query Processing
Each test question is processed through the selected RAG architecture to generate an answer and retrieve contexts.
Aggregation
Results are aggregated across all questions to produce overall scores for the RAG architecture.
Interpreting RAGAS Scores
What Makes a Good RAG System?
There’s no single “passing score,” but here are general benchmarks:Excellent Performance
- Faithfulness: ≥ 0.9
- Answer Relevancy: ≥ 0.85
- Context Precision: ≥ 0.85
- Context Recall: ≥ 0.8
Good Performance
- Faithfulness: ≥ 0.8
- Answer Relevancy: ≥ 0.75
- Context Precision: ≥ 0.75
- Context Recall: ≥ 0.7
Needs Improvement
- Faithfulness: < 0.7
- Answer Relevancy: < 0.65
- Context Precision: < 0.65
- Context Recall: < 0.6
Poor Performance
- Faithfulness: < 0.5
- Answer Relevancy: < 0.5
- Context Precision: < 0.5
- Context Recall: < 0.5
Diagnostic Patterns
Pattern 1: High Precision, Low Recall- Symptoms: Precision > 0.8, Recall < 0.6
- Problem: Retrieval is finding relevant docs but missing important information
- Solution: Increase k (retrieve more documents), try hybrid search
- Symptoms: Recall > 0.8, Precision < 0.6
- Problem: Retrieval is finding everything but including too much noise
- Solution: Improve query processing, add reranking, use HyDE
- Symptoms: Faithfulness > 0.9, Relevancy < 0.6
- Problem: LLM is correctly using context but not answering the question
- Solution: Improve answer generation prompt, tune LLM temperature
- Symptoms: Faithfulness < 0.7, Relevancy > 0.8
- Problem: LLM is answering the right question but hallucinating facts
- Solution: Strengthen prompt instructions, use more reliable LLM
Automated Evaluation Benefits
Speed
Speed
Evaluate 10 questions across 6 RAG architectures in minutes instead of hours of manual review.
Consistency
Consistency
LLM-based evaluation applies consistent criteria across all questions, eliminating human rater variability.
Scalability
Scalability
Add more questions, architectures, or models without linearly scaling human effort.
Reproducibility
Reproducibility
Results can be reproduced by anyone with the same dataset and evaluation code.
Cost-Effectiveness
Cost-Effectiveness
LLM-based evaluation costs cents per question vs. dollars for human annotation.
Limitations and Considerations
While RAGAS is powerful, it’s important to understand its limitations:Complement with Human ReviewFor production medical systems, RAGAS should complement—not replace—expert human review, especially for:
- Clinical accuracy validation
- Safety-critical applications
- Regulatory compliance
- Edge cases and rare scenarios
Best Practices
- Use RAGAS for rapid iteration during development
- Validate findings with human expert review on a sample
- Monitor trends rather than absolute scores
- Compare architectures relative to each other
- Track improvements over time as you tune the system
Next Steps
Run Evaluations
Learn how to evaluate your RAG architectures with RAGAS
Analyze Results
Interpret evaluation results and identify improvements
Data Pipeline
Understand how documents are processed for RAG
RAG Architectures
Review the different retrieval strategies being evaluated
