Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
Interpreting Results
This guide helps you understand the JSON output from RAGAS evaluations and extract actionable insights.Result File Structure
All evaluation results are saved as JSON files in theresults/ directory with this structure:
Metadata Section
Provides context about the evaluation run:timestamp- When the evaluation was run (ISO 8601 format)evaluation_type- Type of evaluation (single RAG, multi-model, comprehensive)dataset_size- Number of questions evaluatedrags_evaluated- List of RAG architectures testedmodel_used- LLM model used for generation (single RAG only)
Pricing Configuration
Tracks the cost configuration for different models:Pricing data is automatically captured from
src/common/pricing.py and helps you understand the cost implications of different model choices.Summary Section
Contains aggregated metrics and performance statistics:Metrics Analysis
Faithfulness: 0.850
Excellent - Answers are well-grounded in retrieved context with minimal hallucination
Answer Relevancy: 0.265
Poor - Answers may not directly address the questions asked
Context Precision: 0.779
Good - Retrieval is mostly pulling relevant information
Context Recall: 0.600
Fair - Some relevant information may be missing from retrieval
Performance Analysis
Question-by-Question Breakdown
The most detailed section showing individual question performance:Analyzing Individual Questions
Examine the Answer
Read the generated answer to understand what went wrong:
- Is it hallucinating?
- Is it off-topic?
- Is it incomplete?
Check Retrieved Context
The
contexts_count shows how many chunks were retrieved:- Too few contexts (1-2): May be missing information → Low context recall
- Too many contexts (5+): May include noise → Low context precision
Example Analysis: Question #1
- The system retrieved relevant context (precision 0.75)
- But it didn’t retrieve all necessary context (recall 0.0)
- The answer is faithful to what was retrieved (faithfulness 0.9)
- However, the answer doesn’t actually answer the question (relevancy 0.0)
- Only 1 context chunk was retrieved (
contexts_count: 1) - The retrieved chunk didn’t contain the specific answer
- The LLM couldn’t answer based on incomplete information
- Increase retrieval count (k) from default to higher value
- Improve embedding quality for this type of question
- Consider using a more advanced RAG architecture (hybrid, rewriter)
Comparing Across RAG Architectures
For multi-RAG evaluations, compare performance side-by-side:Analysis
Hybrid RAG significantly outperforms Simple RAG:
- Better retrieval (precision 0.88 vs 0.75, recall 0.90 vs 0.0)
- More relevant answers (relevancy 0.85 vs 0.0)
- Maintains faithfulness (0.95 vs 0.9)
Best Performers Analysis
For comprehensive evaluations, the results include abest_performers section:
Trade-offs Analysis
Quality
Best: Hybrid-RRF & Rewriter
- Highest accuracy metrics
- Best retrieval quality
- Most relevant answers
Performance
Best: Simple Semantic
- Fastest execution (8.2s)
- Lowest cost ($0.021)
- Simplest architecture
Actionable Insights
Low Faithfulness
Symptoms
Symptoms
- Generated answers contain information not in the retrieved context
- LLM is hallucinating or using prior knowledge
Solutions
Solutions
- Improve prompts: Add explicit instructions to only use retrieved context
- Use better models: Some models are better at staying grounded
- Add citations: Require the LLM to cite which chunks support each claim
- Post-processing: Filter out unsupported claims
Low Answer Relevancy
Symptoms
Symptoms
- Answers don’t directly address the question
- Responses are too general or off-topic
Solutions
Solutions
- Refine prompts: Make answer format more specific
- Better retrieval: Ensure retrieved context is relevant to the question
- Query preprocessing: Rephrase questions for better retrieval
- Use query-focused RAG: Try Rewriter or HyDE architectures
Low Context Precision
Symptoms
Symptoms
- Retrieved chunks contain irrelevant information
- Too much noise in the context
Solutions
Solutions
- Increase similarity threshold: Only retrieve highly relevant chunks
- Add reranking: Use a reranker to filter retrieved chunks
- Better embeddings: Use domain-specific embedding models
- Try hybrid search: Combine BM25 + semantic search
Low Context Recall
Symptoms
Symptoms
- Important information is missing from retrieved context
- Answers are incomplete
Solutions
Solutions
- Retrieve more chunks: Increase k (e.g., from 5 to 10)
- Better chunking: Ensure chunks contain complete information
- Multi-query retrieval: Try Rewriter RAG for diverse retrieval
- Check embeddings: Ensure embedding model captures domain semantics
Visualization Examples
While the project saves results as JSON, you can create visualizations:Next Steps
RAGAS Metrics
Learn more about what each metric measures
Benchmarking
Best practices for comprehensive RAG benchmarking
