Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Therun_evaluation.py script is the official CLI entrypoint for running RAGAS (Retrieval Augmented Generation Assessment) evaluations on RAG systems. It supports evaluating single RAG systems, comparing multiple RAG architectures, testing different LLM models, and generating comprehensive evaluation reports.
Location
Usage
Basic Command Structure
Available Commands
Single RAG Evaluation
Evaluate a specific RAG system:Multiple RAG Evaluation
Evaluate multiple RAG systems in sequence:Multi-Model Evaluation
Evaluate a single RAG type across multiple LLM models:Comprehensive Evaluation
Evaluate ALL RAG types with ALL models:Options
| Option | Short | Description |
|---|---|---|
--export | -e | Export detailed analysis to CSV/Excel |
--debug | -d | Enable debug output for troubleshooting |
Examples
RAG Types
The following RAG architectures are supported:| Type | Description |
|---|---|
| simple | Simple Semantic RAG using vector similarity |
| hybrid | Hybrid RAG combining BM25 and semantic search |
| hybrid-rrf | Hybrid RAG with Reciprocal Rank Fusion |
| hyde | HyDE RAG using hypothetical document embeddings |
| rewriter | Multi-Query Rewriter RAG |
| pageindex | PageIndex RAG with page-level context |
Evaluation Metrics
The script evaluates RAG systems using RAGAS metrics:- Faithfulness: Measures answer faithfulness to retrieved context
- Answer Relevancy: Measures answer relevance to the question
- Context Precision: Precision of retrieved contexts
- Context Recall: Recall of necessary information from ground truth
What the Script Does
Depending on the command, the script performs different evaluation workflows:Single RAG Evaluation
- Initialize Evaluator: Creates a
RAGASEvaluatorinstance for the specified RAG type - Load Test Dataset: Loads obstetric test queries (10 questions with ground truth)
- Process Queries: Runs each query through the RAG system
- Generate Answers: Collects answers, contexts, and performance metadata
- Run RAGAS Metrics: Evaluates using faithfulness, relevancy, precision, and recall
- Display Results: Shows aggregated scores and performance analysis
- Save Results: Exports JSON report to
results/directory
Multi-Model Evaluation
- Select RAG Type: Initializes evaluator for specified RAG
- Iterate Models: Tests each model from the model registry
- Run Evaluations: Executes full evaluation for each model
- Collect Results: Aggregates metrics and performance data
- Generate Report: Creates consolidated multi-model comparison JSON
Comprehensive Evaluation
- Iterate RAG Types: Loops through all 6 RAG architectures
- Iterate Models: Tests each RAG with all available models
- Track Progress: Displays progress indicators and success/failure status
- Build Report: Creates comprehensive comparison matrix
- Save Results: Exports complete evaluation to timestamped JSON file
Output Files
The script generates JSON files in theresults/ directory:
Single RAG Evaluation
- Metadata (timestamp, RAG type, dataset size)
- Aggregated metrics (overall scores)
- Question-by-question results
- Performance statistics (tokens, cost, execution time)
Multi-Model Evaluation
- Metadata (models evaluated, RAG type)
- Summary (per-model metrics and performance)
- Question-by-question comparison across models
Comprehensive Evaluation
- Complete matrix of all RAGs × all models
- Comparative metrics and performance data
- Question-level details for every combination
Example Output
Configuration
The script uses configuration from:- Model Registry:
src/common/model_provider.pydefines available LLM models - Pricing Config:
src/common/pricing.pyprovides cost calculations - Test Dataset: Embedded in
ragas_evaluator.py(10 obstetric questions)
Environment Requirements
API Keys
Required environment variables in.env:
Dependencies
ragas: For evaluation metricsdatasets: For dataset handlinglangchain: For RAG system integrationpandas: For result processingnumpy: For numerical operations
Implementation Details
Script Architecture
Therun_evaluation.py script is a thin CLI wrapper that:
- Imports the
main()function fromsrc/evaluation/ragas_evaluator.py - Sets up the Python path to ensure imports work correctly
- Delegates all functionality to the evaluator module
Execution Flow
Advanced Features
Synchronous Fallback
The evaluator includes a synchronous fallback mode for environments where async metric evaluation fails (e.g., Python 3.14+ async context issues).Performance Tracking
Tracks and reports:- Execution time per query
- Input/output token counts
- Cost per query and total cost
- Token usage statistics
Custom Model Support
Supports custom LLM models through the model registry:- GPT-4o, GPT-4o-mini
- Claude models
- Google Gemini models
- Any LangChain-compatible model
Troubleshooting
Common Issues
“Model not found in registry”- Ensure the model is defined in
MODELS_REGISTRY - Check spelling of model name
- Verify
OPENAI_API_KEYis set in.env - Check API key has necessary permissions
- Script automatically falls back to synchronous evaluation
- Use
--debugflag to see detailed error information
Debug Mode
Enable debug mode for verbose output:- Detailed error traces
- Sample query processing output
- Metric calculation details
- Performance profiling information
Related
- RAGAS Metrics: Understanding evaluation metrics
- RAG Systems: Available RAG architectures
- Model Configuration: Configuring LLM models
