Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt

Use this file to discover all available pages before exploring further.

Benchmarking

This guide covers best practices for running comprehensive benchmarks to compare RAG architectures and identify optimal configurations.

Benchmarking Goals

A comprehensive benchmark should answer:

Quality

Which RAG architecture produces the highest quality answers?

Performance

Which architecture is fastest and most cost-effective?

Robustness

Which architecture handles diverse questions best?

Scalability

Which architecture scales best to production?

Benchmark Types

1. Single RAG Benchmark

Evaluate one RAG architecture in depth.
python scripts/run_evaluation.py hybrid
Use when:
  • Testing a new RAG implementation
  • Debugging a specific architecture
  • Quick quality check
Output: ragas_evaluation_[rag_type]_[timestamp].json

2. Multi-Model Benchmark

Compare how different LLMs perform with the same RAG architecture.
python scripts/run_evaluation.py multi-model hybrid
Use when:
  • Selecting the best LLM for your use case
  • Understanding model-specific strengths
  • Cost-benefit analysis across models
Output: ragas_multimodel_[rag_type]_[timestamp].json

3. Comprehensive Benchmark

Test all RAG architectures with all available models.
python scripts/run_evaluation.py all-models-all-rags
Use when:
  • Conducting research
  • Selecting production configuration
  • Publishing results
Output: ragas_comprehensive_all_rags_all_models_[timestamp].json

Running a Comprehensive Benchmark

1

Prepare Environment

Ensure stable conditions for fair comparison:
# Clean environment
rm -rf data/embeddings/chroma_db/

# Recreate embeddings
python scripts/create_embeddings.py

# Verify API keys
cat .env | grep OPENAI_API_KEY
2

Run Comprehensive Evaluation

Start the full benchmark:
python scripts/run_evaluation.py all-models-all-rags > benchmark_log.txt 2>&1 &
This runs in the background and logs all output.
3

Monitor Progress

Watch the log file:
tail -f benchmark_log.txt
You’ll see progress through RAG types:
========================= SIMPLE SEMANTIC RAG =========================
Starting RAGAS evaluation
...
============================= HYDE RAG =============================
Starting RAGAS evaluation
...
4

Wait for Completion

Typical duration:
  • 6 RAG types × 4 models × 10 questions = 240 evaluations
  • ~5-10 seconds per question
  • Total: 2-4 hours
Do not interrupt the benchmark. Results are only saved at the end.

Understanding Benchmark Results

The comprehensive benchmark produces a detailed JSON file:
{
  "metadata": {
    "timestamp": "2026-03-11T11:15:57.123456",
    "evaluation_type": "comprehensive_rag_comparison",
    "dataset_size": 10,
    "rags_evaluated": [
      "simple", "hybrid", "hybrid-rrf", "hyde", "rewriter", "pageindex"
    ],
    "models_evaluated": [
      "gpt-4o", "gpt-5", "gpt-5.2", "google/medgemma-1.5-4b-it"
    ]
  },
  "summary": { /* Metrics for each RAG × Model combination */ },
  "best_performers": { /* Top performers for each metric */ },
  "question_by_question": [ /* Detailed results */ ]
}

Summary Section

Compare all RAG architectures:
{
  "summary": {
    "simple": {
      "rag_name": "Simple Semantic RAG",
      "metrics": {
        "faithfulness": 0.850,
        "answer_relevancy": 0.265,
        "context_precision": 0.779,
        "context_recall": 0.600
      },
      "performance": {
        "average_execution_time": 8.234,
        "total_cost": 0.021433
      }
    },
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.912,
        "answer_relevancy": 0.783,
        "context_precision": 0.891,
        "context_recall": 0.845
      },
      "performance": {
        "average_execution_time": 12.567,
        "total_cost": 0.038921
      }
    }
    // ... more RAGs
  }
}

Best Performers

Identify winners for each metric:
{
  "best_performers": {
    "faithfulness": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.924
    },
    "answer_relevancy": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.887
    },
    "context_precision": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.901
    },
    "context_recall": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.894
    },
    "best_avg_execution_time": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 8.234
    },
    "best_total_cost": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 0.021
    }
  }
}

Comparing RAG Architectures

Quality Comparison

Rank by overall average score:
RankRAG ArchitectureAvg ScoreFaithfulnessAnswer Rel.Ctx Prec.Ctx Recall
1Hybrid-RRF0.8910.9240.8750.9010.864
2Rewriter0.8720.9010.8870.8340.894
3Hybrid0.8580.9120.7830.8910.845
4HyDE0.8120.8670.7450.8230.812
5PageIndex0.7810.8340.6980.8010.791
6Simple0.6230.8500.2650.7790.600
Key insights:
  • Hybrid-RRF offers the best overall quality
  • Rewriter excels at recall (finding all relevant info)
  • Simple is fast but lacks answer relevancy

Performance Comparison

Rank by speed and cost:
RankRAG ArchitectureAvg Time (s)Total CostCost per Q
1Simple8.2$0.021$0.0021
2Hybrid12.6$0.039$0.0039
3PageIndex13.1$0.041$0.0041
4Hybrid-RRF14.8$0.048$0.0048
5HyDE18.3$0.067$0.0067
6Rewriter21.7$0.089$0.0089
Key insights:
  • Simple is 2.6× faster than Rewriter
  • Simple costs 4.2× less than Rewriter
  • Advanced RAGs trade cost/speed for quality

Trade-off Analysis

Best Overall Quality

Hybrid-RRF
  • Average score: 0.891
  • Excels in all metrics
  • Cost: $0.048 per 10 questions
Use for: Production systems where quality matters most

Best Balance

Hybrid RAG
  • Average score: 0.858 (only 3.7% lower)
  • 15% faster than Hybrid-RRF
  • 19% cheaper than Hybrid-RRF
Use for: Most production use cases

Best Performance

Simple Semantic
  • Fastest: 8.2s average
  • Cheapest: $0.021 total
  • Score: 0.623 (acceptable)
Use for: High-volume, cost-sensitive applications

Best Recall

Rewriter RAG
  • Context recall: 0.894
  • Answer relevancy: 0.887
  • Most thorough retrieval
Use for: Critical applications requiring completeness

Cross-Model Analysis

For multi-model benchmarks, analyze how models perform across RAGs:
{
  "question_by_question": [
    {
      "question_id": 1,
      "question": "¿En qué momento y quien debe reevaluar...",
      "rag_results": {
        "gpt-4o": { /* metrics */ },
        "gpt-5": { /* metrics */ },
        "gpt-5.2": { /* metrics */ },
        "google/medgemma-1.5-4b-it": { /* metrics */ }
      }
    }
  ]
}

Model Performance Matrix

ModelAvg ScoreFaithfulnessAnswer Rel.Cost
gpt-5.20.8920.9340.901$0.045
gpt-50.8760.9120.887$0.038
gpt-4o0.8540.8890.834$0.041
medgemma-1.5-4b0.6230.8500.265$0.021
Insights:
  • GPT-5.2 offers best quality but at higher cost
  • GPT-5 provides best value (quality/cost ratio)
  • Medical-specialized models (medgemma) need more tuning

Best Practices

Fair Comparison

All RAGs should be evaluated on the exact same questions:
# From ragas_evaluator.py:49-90
DATA_GT = [  # Same 10 questions for all evaluations
    {"question": "...", "ground_truth": "..."},
    # ...
]
Don’t change embeddings between RAG evaluations:
# All RAGs use: OpenAI text-embedding-3-small
# Stored in: data/embeddings/chroma_db/
Keep k (number of chunks) consistent:
# Default k=5 for all RAGs
results = vectorstore.similarity_search(query, k=5)
  • Run evaluations on the same hardware
  • Use the same API tier (avoid rate limits)
  • Don’t run in parallel (can affect timing)

Result Storage Organization

Organize results for easy comparison:
results/
├── benchmarks/
│   ├── 2026-03-11_comprehensive/
│   │   ├── ragas_comprehensive_all_rags_all_models_20260311_111557.json
│   │   ├── analysis.ipynb
│   │   └── visualizations/
│   │       ├── metrics_comparison.png
│   │       ├── cost_analysis.png
│   │       └── performance_heatmap.png
│   └── 2026-03-10_hybrid_only/
│       └── ragas_multimodel_hybrid_20260310_153022.json
└── single_runs/
    ├── ragas_evaluation_simple_20260311_093843.json
    └── ragas_evaluation_hybrid_20260311_095023.json

Documentation

Document your benchmark methodology:
# Benchmark Report: 2026-03-11

## Setup
- Date: March 11, 2026
- Environment: Python 3.11, MacBook Pro M2
- Models tested: gpt-4o, gpt-5, gpt-5.2, medgemma-1.5-4b
- RAGs tested: simple, hybrid, hybrid-rrf, hyde, rewriter, pageindex
- Dataset: 10 obstetric questions (Spanish)
- Embeddings: OpenAI text-embedding-3-small

## Key Findings
1. Hybrid-RRF achieved highest quality (0.891 avg)
2. Simple RAG was fastest (8.2s avg) and cheapest ($0.021)
3. GPT-5 offered best quality/cost ratio
4. Medical models need further tuning

## Recommendations
- Use Hybrid RAG for production (good balance)
- Consider Hybrid-RRF for critical applications
- Use Simple RAG for high-volume, cost-sensitive use cases

Analyzing Results Programmatically

Load and Compare

import json
import pandas as pd

# Load comprehensive results
with open('results/ragas_comprehensive_all_rags_all_models_20260311_111557.json') as f:
    data = json.load(f)

# Extract summary data
rows = []
for rag_type, rag_data in data['summary'].items():
    row = {
        'rag_type': rag_type,
        'rag_name': rag_data['rag_name'],
        **rag_data['metrics'],
        **rag_data['performance']
    }
    rows.append(row)

df = pd.DataFrame(rows)

# Calculate overall average
df['avg_score'] = df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']].mean(axis=1)

# Sort by quality
df_by_quality = df.sort_values('avg_score', ascending=False)
print("\nRAGs ranked by quality:")
print(df_by_quality[['rag_name', 'avg_score', 'total_cost', 'average_execution_time']])

# Sort by cost
df_by_cost = df.sort_values('total_cost')
print("\nRAGs ranked by cost:")
print(df_by_cost[['rag_name', 'total_cost', 'avg_score']])

Visualize Results

import matplotlib.pyplot as plt
import seaborn as sns

# Metrics comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    df_sorted = df.sort_values(metric, ascending=False)
    ax.barh(df_sorted['rag_type'], df_sorted[metric])
    ax.set_xlabel('Score')
    ax.set_title(metric.replace('_', ' ').title())
    ax.set_xlim(0, 1)

plt.tight_layout()
plt.savefig('metrics_comparison.png', dpi=300)

# Cost vs Quality scatter
plt.figure(figsize=(10, 6))
plt.scatter(df['total_cost'], df['avg_score'], s=100)
for idx, row in df.iterrows():
    plt.annotate(row['rag_type'], (row['total_cost'], row['avg_score']), 
                xytext=(5, 5), textcoords='offset points')
plt.xlabel('Total Cost ($)')
plt.ylabel('Average Quality Score')
plt.title('Cost vs Quality Trade-off')
plt.grid(True, alpha=0.3)
plt.savefig('cost_vs_quality.png', dpi=300)

Publishing Results

For research or internal documentation:

LaTeX Table

\begin{table}[h]
\centering
\caption{RAGAS Evaluation Results Across RAG Architectures}
\begin{tabular}{lcccccc}
\hline
RAG Type & Faith. & Ans.Rel. & Ctx.Prec. & Ctx.Rec. & Avg & Cost \\
\hline
Hybrid-RRF & 0.924 & 0.875 & 0.901 & 0.864 & 0.891 & \$0.048 \\
Rewriter & 0.901 & 0.887 & 0.834 & 0.894 & 0.872 & \$0.089 \\
Hybrid & 0.912 & 0.783 & 0.891 & 0.845 & 0.858 & \$0.039 \\
HyDE & 0.867 & 0.745 & 0.823 & 0.812 & 0.812 & \$0.067 \\
PageIndex & 0.834 & 0.698 & 0.801 & 0.791 & 0.781 & \$0.041 \\
Simple & 0.850 & 0.265 & 0.779 & 0.600 & 0.623 & \$0.021 \\
\hline
\end{tabular}
\end{table}

Next Steps

RAGAS Metrics

Understand what each metric measures

Interpreting Results

Detailed guide to analyzing evaluation results