Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt

Use this file to discover all available pages before exploring further.

This guide explains how to integrate new LLMs and specialized medical models into the Obstetrics RAG Benchmark.

Model Registry Architecture

The benchmark uses a centralized model registry in src/common/model_provider.py that supports:
  • OpenAI Models: Direct API integration (GPT-4, GPT-3.5, etc.)
  • HuggingFace Models: Via OpenAI-compatible endpoints (TGI, Inference Endpoints)
  • Specialized Medical Models: Domain-specific models like MedPhi, MedGemma

Model Configuration Structure

Each model is registered using the ModelConfig dataclass:
@dataclass
class ModelConfig:
    """Configuration for a language model."""
    
    name: str
    """Short name/key for the model (e.g., 'gpt-4.1', 'mediphi')."""
    
    model_id: str
    """Full model identifier (e.g., 'gpt-4.1', 'microsoft/MediPhi')."""
    
    provider: Literal["openai", "huggingface"]
    """Model provider type."""
    
    endpoint_url_env: Optional[str] = None
    """Environment variable for HF endpoint URL (e.g., 'MEDIPHI_ENDPOINT_URL')."""
    
    temperature: float = 0.0
    """Sampling temperature for the model."""

Adding OpenAI Models

1

Register in Model Registry

Add your model configuration to MODELS_REGISTRY in src/common/model_provider.py:181-208:
MODELS_REGISTRY: Dict[str, ModelConfig] = {
    # Existing models...
    "gpt-5": ModelConfig(
        name="gpt-5",
        model_id="gpt-5",
        provider="openai",
        temperature=0.0
    ),
    # Add your new model:
    "gpt-4-turbo": ModelConfig(
        name="gpt-4-turbo",
        model_id="gpt-4-turbo-2024-04-09",
        provider="openai",
        temperature=0.0
    ),
}
2

Configure API Access

Ensure your OpenAI API key is set in .env:
OPENAI_API_KEY=sk-...
The model will be automatically instantiated using the OpenAI API.
3

Test Model Access

Verify the model works:
from src.common.model_provider import MODELS_REGISTRY, create_llm

# Get model config
config = MODELS_REGISTRY["gpt-4-turbo"]

# Create LLM instance
llm = create_llm(config)

# Test generation
response = llm.invoke("Hello, how are you?")
print(response.content)
4

Run Evaluation

Evaluate with your new model:
# Single RAG architecture
python scripts/run_evaluation.py multi-model hybrid

# Or all architectures
python scripts/run_evaluation.py all-models-all-rags

Adding HuggingFace Models

For models deployed via TGI (Text Generation Inference) or HuggingFace Inference Endpoints:
1

Deploy Your Model

Deploy your model to an OpenAI-compatible endpoint:
  • HuggingFace Inference Endpoints: Automatic OpenAI-compatible API
  • Text Generation Inference (TGI): Self-hosted with OpenAI API compatibility
  • vLLM: High-performance serving with OpenAI compatibility
Your endpoint should expose:
  • POST /v1/chat/completions (OpenAI format)
  • Authentication via API token
2

Configure Environment Variables

Add endpoint URL and authentication to .env:
# HuggingFace Token (required for all HF models)
HF_TOKEN=hf_...

# Model-specific endpoint URLs
MEDIPHI_ENDPOINT_URL=https://your-endpoint.aws.endpoints.huggingface.cloud
MEDGEMMA_ENDPOINT_URL=https://your-endpoint.aws.endpoints.huggingface.cloud
YOUR_MODEL_ENDPOINT_URL=https://your-endpoint.com
The endpoint URL can include or omit /v1 - the framework normalizes it automatically.
3

Register Model in Registry

Add your model to MODELS_REGISTRY in src/common/model_provider.py:
MODELS_REGISTRY: Dict[str, ModelConfig] = {
    # Existing models...
    "your-model": ModelConfig(
        name="your-model",
        model_id="organization/model-name",  # Full HF model ID
        provider="huggingface",
        endpoint_url_env="YOUR_MODEL_ENDPOINT_URL",  # Env var name
        temperature=0.0
    ),
}
4

Verify Endpoint Compatibility

Test the endpoint setup:
from src.common.model_provider import create_llm, MODELS_REGISTRY

config = MODELS_REGISTRY["your-model"]
llm = create_llm(config)

# Test simple generation
response = llm.invoke("Test message")
print(f"Response: {response.content}")
If you get errors, check:
  • Endpoint URL is correct and accessible
  • HF_TOKEN is valid and has access to the model
  • Endpoint is running and accepting requests

Model Provider Implementation Details

The create_llm() function handles model instantiation:
def create_llm(config: ModelConfig) -> BaseChatModel:
    """Factory function to create a language model."""
    load_dotenv_if_needed()
    
    if config.provider == "openai":
        # Standard OpenAI model
        return ChatOpenAI(
            model_name=config.model_id,
            temperature=config.temperature
        )
    
    elif config.provider == "huggingface":
        # HuggingFace model via OpenAI-compatible endpoint
        endpoint_url = os.getenv(config.endpoint_url_env)
        if not endpoint_url:
            raise ValueError(
                f"Environment variable '{config.endpoint_url_env}' not set."
            )
        
        hf_token = os.getenv("HF_TOKEN")
        if not hf_token:
            raise ValueError("HF_TOKEN environment variable not set.")
        
        # Normalize endpoint URL to include /v1
        base_url = _normalize_openai_compatible_base_url(endpoint_url)
        
        return ChatOpenAI(
            base_url=base_url,
            api_key=hf_token,
            model_name=config.model_id,
            temperature=config.temperature,
            max_tokens=512,  # Prevents infinite generation
        )

Integrating Medical-Specialized Models

Example: Adding a Medical LLM

Let’s integrate a hypothetical medical model “ClinicalLLaMA”:
1

Deploy the Model

# Using HuggingFace Inference Endpoints
# Or deploy via TGI:
docker run -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id clinicalllm/clinical-llama-13b \
  --max-input-length 2048 \
  --max-total-tokens 4096
2

Configure Environment

# .env
HF_TOKEN=hf_your_token_here
CLINICAL_LLAMA_ENDPOINT_URL=http://localhost:8080
3

Register Model

# In src/common/model_provider.py
"clinical-llama": ModelConfig(
    name="clinical-llama",
    model_id="clinicalllm/clinical-llama-13b",
    provider="huggingface",
    endpoint_url_env="CLINICAL_LLAMA_ENDPOINT_URL",
    temperature=0.0
),
4

Evaluate Against Benchmarks

# Evaluate on all RAG architectures
python scripts/run_evaluation.py multi-model hybrid

# Or comprehensive evaluation
python scripts/run_evaluation.py all-models-all-rags

Model Identity and Cost Tracking

The framework automatically resolves model identity for cost tracking:
from src.common.model_provider import get_model_identity

model_identity = get_model_identity(llm=llm_instance)
# Returns:
# {
#     "provider": "openai" | "huggingface" | "unknown",
#     "model_id": "full-model-identifier",
#     "model_name": "short-name"
# }
This enables:
  • Consistent cost calculation across providers
  • Metadata tracking in evaluation results
  • Provider-specific optimizations

Pricing Configuration

Add pricing information for cost tracking in src/common/pricing.py:
PRICING_TABLE = {
    "your-model": {
        "provider": "huggingface",
        "input_cost_per_1k": 0.0001,   # USD per 1K input tokens
        "output_cost_per_1k": 0.0002,  # USD per 1K output tokens
    },
}

Testing New Models

Unit Testing

import pytest
from src.common.model_provider import create_llm, MODELS_REGISTRY

def test_model_creation():
    """Test model can be created"""
    config = MODELS_REGISTRY["your-model"]
    llm = create_llm(config)
    assert llm is not None

def test_model_generation():
    """Test model can generate responses"""
    config = MODELS_REGISTRY["your-model"]
    llm = create_llm(config)
    response = llm.invoke("Test question")
    assert response.content
    assert len(response.content) > 0

Integration Testing

Test with a simple RAG query:
from src.rag.simple import query_for_evaluation

result = query_for_evaluation(
    question="¿Cuál es la cantidad ideal de controles prenatales?",
    llm_model="your-model"
)

print(f"Answer: {result['answer']}")
print(f"Contexts: {len(result['contexts'])}")
print(f"Cost: ${result['metadata']['total_cost']:.6f}")

Evaluation Testing

Run a full evaluation:
# With debug output to see detailed progress
python scripts/run_evaluation.py simple --debug

Multi-Model Evaluation Workflow

The benchmark supports comparing multiple models systematically:
# In your evaluation script
from src.evaluation.ragas_evaluator import RAGASEvaluator

evaluator = RAGASEvaluator(rag_type="hybrid")

# Test specific models
models_to_test = ["gpt-4o", "gpt-5", "clinical-llama", "mediphi"]
evaluator.run_multi_model_evaluation(models_to_test=models_to_test)
This generates a consolidated report comparing:
  • RAGAS metrics (faithfulness, relevancy, precision, recall)
  • Performance metrics (execution time, token usage)
  • Cost metrics (total cost, cost per question)

Common Issues and Solutions

Solution: Add your HuggingFace token to .env:
HF_TOKEN=hf_your_token_here
Solution: Add your endpoint URL to .env:
YOUR_MODEL_ENDPOINT_URL=https://your-endpoint.com
Solution: The max_tokens parameter is set for HuggingFace models. If issues persist, adjust in create_llm():
max_tokens=512,  # Increase or decrease as needed
Solution:
  • Verify endpoint is running: curl https://your-endpoint.com/health
  • Check firewall/network settings
  • Increase timeout in model configuration
Solution: Ensure model key matches exactly:
# Registry key
"my-model": ModelConfig(...)

# Must match in evaluation
evaluator.run_multi_model_evaluation(models_to_test=["my-model"])

Best Practices

Model Selection

  • General Purpose: Start with GPT-4o for baseline performance
  • Cost Optimization: Use GPT-3.5-turbo for development/testing
  • Domain Expertise: Evaluate medical-specialized models for clinical accuracy
  • Latency Sensitive: Consider smaller models with faster inference

Evaluation Strategy

  1. Single Model First: Test new models individually before batch evaluation
  2. Compare to Baseline: Always compare against GPT-4o baseline
  3. Multiple RAG Architectures: Test across all RAG strategies to find optimal pairing
  4. Statistical Significance: Run multiple trials for robust comparisons

Cost Management

  • Use smaller models for development and testing
  • Cache results to avoid re-running expensive evaluations
  • Monitor token usage in evaluation metadata
  • Set API rate limits to control costs

Next Steps

Customizing Metrics

Extend evaluation with custom metrics

Adding RAG Architectures

Implement new retrieval strategies

API Reference

Complete API documentation

Results Analysis

Understand evaluation outputs