Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
RAGAS Metrics
This project uses RAGAS (Retrieval-Augmented Generation Assessment) as the evaluation framework. RAGAS provides automated, LLM-based metrics that assess both retrieval quality and generation quality without requiring manual annotations.Core Metrics
The evaluation system tracks four fundamental RAGAS metrics:Faithfulness
What it measures: How much of the generated answer is grounded in the retrieved context. Purpose: Reduces hallucination by ensuring the LLM only uses information from the retrieved documents. Score range: 0.0 to 1.0 (higher is better) Interpretation:- 0.8 - 1.0: Excellent - Answer is highly faithful to the context
- 0.6 - 0.8: Good - Most claims are supported by the context
- 0.4 - 0.6: Fair - Some hallucination present
- < 0.4: Poor - Significant hallucination or unsupported claims
Answer Relevancy
What it measures: Whether the generated answer directly addresses the input question. Purpose: Ensures the response is on-topic and answers what the user actually asked. Score range: 0.0 to 1.0 (higher is better) Interpretation:- 0.8 - 1.0: Excellent - Answer directly addresses the question
- 0.6 - 0.8: Good - Answer is mostly relevant
- 0.4 - 0.6: Fair - Answer is partially relevant
- < 0.4: Poor - Answer doesn’t address the question
Context Precision
What it measures: The proportion of retrieved context that is relevant to the question. Purpose: Evaluates retrieval quality by measuring how much of the retrieved information is actually useful. Score range: 0.0 to 1.0 (higher is better) Interpretation:- 0.8 - 1.0: Excellent - Very high precision, minimal irrelevant context
- 0.6 - 0.8: Good - Most retrieved chunks are relevant
- 0.4 - 0.6: Fair - Significant irrelevant context retrieved
- < 0.4: Poor - Retrieval is pulling mostly irrelevant information
Context Recall
What it measures: The completeness of retrieved relevant information from the knowledge base. Purpose: Ensures the retrieval system finds all the necessary information to answer the question. Score range: 0.0 to 1.0 (higher is better) Interpretation:- 0.8 - 1.0: Excellent - Retrieved all necessary information
- 0.6 - 0.8: Good - Retrieved most necessary information
- 0.4 - 0.6: Fair - Missing some important context
- < 0.4: Poor - Missing critical information
Metric Calculation
RAGAS metrics are calculated using the following data points from your RAG system:Evaluation Process
Overall Performance Assessment
The system calculates an overall average score across all four metrics:- 0.8 - 1.0: Excellent - Production-ready RAG system
- 0.6 - 0.8: Good - Minor improvements may help
- 0.4 - 0.6: Needs improvement - Review retrieval and generation
- < 0.4: Significant improvements needed - Major issues present
Example Metric Results
From actual evaluation results (results/ragas_evaluation_simple_20260311_093843.json):
- Faithfulness (0.850): Excellent - answers are well-grounded in context
- Answer Relevancy (0.265): Poor - answers may not directly address questions
- Context Precision (0.779): Good - retrieval is mostly accurate
- Context Recall (0.600): Fair - some relevant information may be missing
- Overall (0.623): Good - system performs adequately but has room for improvement
Best Practices
Next Steps
Running Evaluations
Learn how to run RAGAS evaluations on your RAG systems
Interpreting Results
Understand how to analyze evaluation results
