This guide explains how to integrate new LLMs and specialized medical models into the Obstetrics RAG Benchmark.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
Model Registry Architecture
The benchmark uses a centralized model registry insrc/common/model_provider.py that supports:
- OpenAI Models: Direct API integration (GPT-4, GPT-3.5, etc.)
- HuggingFace Models: Via OpenAI-compatible endpoints (TGI, Inference Endpoints)
- Specialized Medical Models: Domain-specific models like MedPhi, MedGemma
Model Configuration Structure
Each model is registered using theModelConfig dataclass:
Adding OpenAI Models
Register in Model Registry
Add your model configuration to
MODELS_REGISTRY in src/common/model_provider.py:181-208:Configure API Access
Ensure your OpenAI API key is set in The model will be automatically instantiated using the OpenAI API.
.env:Adding HuggingFace Models
For models deployed via TGI (Text Generation Inference) or HuggingFace Inference Endpoints:Deploy Your Model
Deploy your model to an OpenAI-compatible endpoint:
- HuggingFace Inference Endpoints: Automatic OpenAI-compatible API
- Text Generation Inference (TGI): Self-hosted with OpenAI API compatibility
- vLLM: High-performance serving with OpenAI compatibility
POST /v1/chat/completions(OpenAI format)- Authentication via API token
Configure Environment Variables
Add endpoint URL and authentication to
.env:The endpoint URL can include or omit
/v1 - the framework normalizes it automatically.Model Provider Implementation Details
Thecreate_llm() function handles model instantiation:
Integrating Medical-Specialized Models
Example: Adding a Medical LLM
Let’s integrate a hypothetical medical model “ClinicalLLaMA”:Model Identity and Cost Tracking
The framework automatically resolves model identity for cost tracking:- Consistent cost calculation across providers
- Metadata tracking in evaluation results
- Provider-specific optimizations
Pricing Configuration
Add pricing information for cost tracking insrc/common/pricing.py:
Testing New Models
Unit Testing
Integration Testing
Test with a simple RAG query:Evaluation Testing
Run a full evaluation:Multi-Model Evaluation Workflow
The benchmark supports comparing multiple models systematically:- RAGAS metrics (faithfulness, relevancy, precision, recall)
- Performance metrics (execution time, token usage)
- Cost metrics (total cost, cost per question)
Common Issues and Solutions
ValueError: HF_TOKEN environment variable not set
ValueError: HF_TOKEN environment variable not set
Solution: Add your HuggingFace token to
.env:ValueError: Environment variable 'MODEL_ENDPOINT_URL' not set
ValueError: Environment variable 'MODEL_ENDPOINT_URL' not set
Solution: Add your endpoint URL to
.env:Model generates infinite output
Model generates infinite output
Solution: The
max_tokens parameter is set for HuggingFace models. If issues persist, adjust in create_llm():Connection timeout to endpoint
Connection timeout to endpoint
Solution:
- Verify endpoint is running:
curl https://your-endpoint.com/health - Check firewall/network settings
- Increase timeout in model configuration
Model not found in registry during evaluation
Model not found in registry during evaluation
Solution: Ensure model key matches exactly:
Best Practices
Model Selection
- General Purpose: Start with GPT-4o for baseline performance
- Cost Optimization: Use GPT-3.5-turbo for development/testing
- Domain Expertise: Evaluate medical-specialized models for clinical accuracy
- Latency Sensitive: Consider smaller models with faster inference
Evaluation Strategy
- Single Model First: Test new models individually before batch evaluation
- Compare to Baseline: Always compare against GPT-4o baseline
- Multiple RAG Architectures: Test across all RAG strategies to find optimal pairing
- Statistical Significance: Run multiple trials for robust comparisons
Cost Management
- Use smaller models for development and testing
- Cache results to avoid re-running expensive evaluations
- Monitor token usage in evaluation metadata
- Set API rate limits to control costs
Next Steps
Customizing Metrics
Extend evaluation with custom metrics
Adding RAG Architectures
Implement new retrieval strategies
API Reference
Complete API documentation
Results Analysis
Understand evaluation outputs
