Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
What is Retrieval-Augmented Generation?
Retrieval-Augmented Generation (RAG) is a powerful technique that enhances Large Language Models (LLMs) by combining them with external knowledge retrieval. Instead of relying solely on a model’s pre-trained knowledge, RAG systems:- Retrieve relevant information from a knowledge base
- Augment the model’s context with this information
- Generate accurate, grounded responses
Why RAG for Medical Q&A?Medical information requires high accuracy and up-to-date knowledge. RAG systems ground LLM responses in verified medical documentation, reducing hallucinations and ensuring answers are based on authoritative sources.
The Obstetrics RAG Benchmark
This project systematically evaluates different RAG architectures for answering questions about pregnancy, prenatal care, and childbirth. The benchmark uses real medical guidance documents to create a comprehensive test of RAG effectiveness in the healthcare domain.Research Objectives
Architecture Comparison
Evaluate 6 distinct RAG retrieval strategies from simple semantic search to advanced query reformulation
Model Performance
Assess how different LLMs (GPT-4o default, plus GPT-5, MediPhi, MedGemma) perform with each RAG architecture
Retrieval Quality
Measure precision and recall of different retrieval strategies for medical content
Best Practices
Identify optimal configurations for medical question-answering systems
System Architecture
The Obstetrics RAG Benchmark implements a complete evaluation pipeline:Key Components
Data Pipeline
Data Pipeline
Raw medical documents are processed, chunked, and embedded into a ChromaDB vector store. This creates a searchable knowledge base optimized for semantic retrieval.Processing Steps:
- Document extraction from PDFs
- Text chunking with overlap
- Embedding generation (OpenAI text-embedding-3-small)
- Vector store indexing
RAG Architectures
RAG Architectures
Six different retrieval strategies are implemented, each representing a distinct approach to finding relevant medical information:
- Simple Semantic Search (baseline)
- Hybrid Search (BM25 + Semantic)
- Hybrid with RRF (Reciprocal Rank Fusion)
- HyDE (Hypothetical Document Embeddings)
- Query Rewriter (Multi-Query)
- PageIndex (External API)
Language Models
Language Models
The system supports evaluation across multiple language models:Default models (used by RAG implementations):
- GPT-4o: Default high-performance model (temperature=0)
- GPT-3.5-turbo: Used for HyDE hypothetical document generation
- gpt-5, gpt-5.2: Next-generation OpenAI models
- microsoft/MediPhi-Instruct (mediphi): Medical-specialized HuggingFace model
- google/medgemma-1.5-4b-it (medgemma): Medical-focused compact model
Evaluation Framework
Evaluation Framework
RAGAS (Retrieval-Augmented Generation Assessment) provides automated, LLM-based evaluation metrics that assess both retrieval quality and answer quality without manual annotation.
How RAG Components Work Together
A typical RAG query flows through multiple stages:1. Query Processing
The user’s question is analyzed and potentially transformed depending on the RAG architecture:2. Document Retrieval
Relevant documents are retrieved from the vector store:3. Context Construction
Retrieved documents are formatted as context for the LLM:4. Answer Generation
The LLM generates a response grounded in the retrieved context:5. Evaluation
The answer is evaluated against ground truth using RAGAS metrics:- Faithfulness: Is the answer grounded in the context?
- Answer Relevancy: Does it address the question?
- Context Precision: Were relevant documents retrieved?
- Context Recall: Was all necessary information retrieved?
Research Dataset
The benchmark uses a curated dataset of 10 questions about pregnancy and prenatal care, each with ground truth answers derived from clinical practice guidelines:Example Question
Question: ¿Cuál es la cantidad ideal de controles prenatales?Ground Truth: Se recomienda un programa de diez citas. Para una mujer multípara con un embarazo de curso normal se recomienda un programa de siete citas.Evaluation: The system retrieves relevant guidelines and generates an answer, which is then scored on faithfulness, relevancy, precision, and recall.
Key Research Questions
This benchmark addresses critical questions for medical RAG systems:- Which retrieval strategy produces the most accurate medical answers?
- Does semantic search alone suffice, or do hybrid approaches improve results?
- How much does model choice matter?
- How do next-gen models (GPT-5) compare to GPT-4o for medical RAG?
- Can specialized medical models (MediPhi, MedGemma) outperform general-purpose LLMs?
- What are the cost-performance tradeoffs?
- Can cheaper models achieve similar quality with better retrieval?
- Do advanced techniques justify their complexity?
- Are HyDE and query rewriting worth the additional LLM calls?
Next Steps
RAG Architectures
Explore the 6 different retrieval strategies in detail
Evaluation Framework
Learn how RAGAS measures RAG system quality
Data Pipeline
Understand document processing and vector store creation
Getting Started
Run your first evaluation
