Get your first RAG evaluation running in minutes. This guide walks you through installation, setup, and running your first benchmark.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, ensure you have:- Python 3.8 or higher installed on your system
- An OpenAI API key for embeddings and LLM access
- Git for cloning the repository
- 10-15 minutes for the initial setup
The initial embedding generation takes 2-3 minutes. Subsequent evaluations run much faster (30 seconds - 2 minutes depending on the RAG architecture).
Quick Setup
Install Dependencies
Install all required Python packages:This installs LangChain, ChromaDB, RAGAS, and all necessary dependencies.
Configure API Key
Create a Replace
.env file in the project root with your OpenAI API key:your_api_key_here with your actual OpenAI API key from platform.openai.com.Generate Embeddings
Create vector embeddings from the medical text corpus:Expected output:
This step loads text chunks from
data/chunks/chunks_final.json, generates embeddings using OpenAI’s text-embedding-3-small model, and stores them in ChromaDB at data/embeddings/chroma_db/.Run Your First Evaluation
Execute a RAG evaluation using the Hybrid architecture:This will:
- Load 10 medical questions from the evaluation dataset
- Run the Hybrid RAG (BM25 + Semantic) on each question
- Generate answers using GPT-4o
- Evaluate with RAGAS metrics
- Display results and save to
results/
Try Different RAG Architectures
Now that you have the system running, try evaluating different RAG strategies:Understanding the Output
Each evaluation provides four key metrics:| Metric | Range | What It Measures |
|---|---|---|
| Faithfulness | 0.0 - 1.0 | How well the answer is grounded in retrieved context (lower = more hallucination) |
| Answer Relevancy | 0.0 - 1.0 | How directly the answer addresses the question |
| Context Precision | 0.0 - 1.0 | Proportion of retrieved context that is relevant |
| Context Recall | 0.0 - 1.0 | Completeness of retrieval (did we get all relevant info?) |
Next Steps
Compare Architectures
Learn about the 6 different RAG strategies and when to use each
Run Benchmarks
Compare all RAG architectures across multiple models
Extend the Research
Add your own RAG architectures or models
Available Commands
Here’s a quick reference of evaluation commands:| Command | Description |
|---|---|
python scripts/run_evaluation.py simple | Evaluate Simple Semantic RAG |
python scripts/run_evaluation.py hybrid | Evaluate Hybrid RAG (BM25 + Semantic) |
python scripts/run_evaluation.py hyde | Evaluate HyDE RAG |
python scripts/run_evaluation.py rewriter | Evaluate Query Rewriter RAG |
python scripts/run_evaluation.py multi-model hybrid | Run Hybrid RAG across all available models |
python scripts/run_evaluation.py all-models-all-rags | Comprehensive benchmark (all RAGs × all models) |
Troubleshooting
OpenAI API Key Error
OpenAI API Key Error
If you see
OPENAI_API_KEY not found, ensure:- Your
.envfile exists in the project root - The key is formatted as
OPENAI_API_KEY=sk-... - There are no quotes around the key value
ChromaDB Not Found
ChromaDB Not Found
If embeddings aren’t found, run the embedding creation step:
Import Errors
Import Errors
If you get import errors, reinstall dependencies:
What You’ve Accomplished
You’ve successfully:- ✅ Installed the Obstetrics RAG Benchmark
- ✅ Generated vector embeddings for medical text
- ✅ Run your first RAG evaluation
- ✅ Viewed RAGAS metrics for retrieval and generation quality
