Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JhonHander/obstetrics-rag-benchmark/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The pricing module provides centralized cost calculation for both OpenAI token-based pricing and HuggingFace endpoint hourly pricing. It loads pricing configuration from JSON and resolves total costs using provider-reported costs when available, falling back to estimated calculations.
resolve_total_cost()
Resolve total cost with callback-first and provider-specific fallbacks.
Signature
def resolve_total_cost(
*,
provider: str,
model_name: str,
model_id: str,
input_tokens: int,
output_tokens: int,
provider_reported_cost: Optional[float],
provider_cost_source: str,
execution_time_seconds: Optional[float] = None,
) -> Dict[str, Any]
Parameters
Provider type (“openai” or “huggingface”)
Short model name/key (e.g., “gpt-5”, “mediphi”)
Full model identifier (e.g., “gpt-5”, “microsoft/MediPhi-Instruct”)
Number of input/prompt tokens consumed
Number of output/completion tokens generated
Cost reported directly by provider API (if available)
Source of provider-reported cost for traceability
Execution time in seconds (used for HuggingFace endpoint pricing)
Returns
Dictionary containing:
total_cost (float): Total cost in USD, rounded to 10 decimal places
cost_source (str): Source of cost calculation
pricing_context (dict): Detailed pricing metadata
Cost Resolution Priority
- Provider-reported cost: If
provider_reported_cost is provided and > 0, uses it directly
- OpenAI token-based estimation: For “openai” provider, calculates from token counts and rates
- HuggingFace endpoint estimation: For “huggingface” provider, calculates from execution time and hourly rates
- Missing: Returns 0.0 cost if no pricing information available
Cost Sources
provider_reported: Direct cost from provider API
estimated_openai_token_pricing: Calculated from OpenAI token rates
estimated_hf_endpoint_pricing: Calculated from HuggingFace endpoint hourly rates
missing: No pricing information available
Example
from src.common.pricing import resolve_total_cost
from src.common.usage_metrics import extract_usage_from_ai_message, extract_cost_from_ai_message
# After getting a response from LLM
usage = extract_usage_from_ai_message(message)
cost_info = extract_cost_from_ai_message(message)
# Resolve total cost
result = resolve_total_cost(
provider="openai",
model_name="gpt-5",
model_id="gpt-5",
input_tokens=usage["input_tokens"],
output_tokens=usage["output_tokens"],
provider_reported_cost=cost_info["total_cost"],
provider_cost_source=cost_info["cost_source"],
)
print(f"Total cost: ${result['total_cost']:.6f}")
print(f"Cost source: {result['cost_source']}")
print(f"Pricing method: {result['pricing_context']['pricing_method']}")
OpenAI Token-Based Pricing
input_cost = (input_tokens / 1_000_000) * input_rate_per_1m
output_cost = (output_tokens / 1_000_000) * output_rate_per_1m
total_cost = input_cost + output_cost
Pricing Context (Token-Based)
Always “token_based” for OpenAI models
Input token rate per 1 million tokens (USD)
Output token rate per 1 million tokens (USD)
Number of input tokens consumed
Number of output tokens generated
Cost of input tokens (USD, rounded to 10 decimals)
Cost of output tokens (USD, rounded to 10 decimals)
Example Configuration (pricing.json)
{
"openai_token_pricing_per_1m": {
"gpt-5": {
"input": 2.50,
"output": 10.00,
"pricing_source_url": "https://openai.com/api/pricing/",
"pricing_updated_at": "2024-01-15"
},
"gpt-5.2": {
"input": 1.25,
"output": 5.00,
"pricing_source_url": "https://openai.com/api/pricing/",
"pricing_updated_at": "2024-01-15"
}
}
}
HuggingFace Endpoint Pricing
Allocation Modes
Two allocation modes are supported:
- runtime_proportional (default): Cost based on actual execution time
- amortized_window: Cost amortized over a time window and query count
Runtime Proportional Calculation
total_cost = hourly_rate * replicas * (execution_time_seconds / 3600)
Amortized Window Calculation
total_cost = (hourly_rate * replicas * active_hours_window) / processed_queries_window
Pricing Context (Endpoint Hourly)
Always “endpoint_hourly” for HuggingFace models
Cloud provider (e.g., “aws”, “gcp”, “azure”)
Instance family (e.g., “p4d”, “g5”)
Instance size (e.g., “24xlarge”, “12xlarge”)
GPU accelerator type (e.g., “A100”, “A10G”)
Number of GPUs per instance
hourly_rate_usd_per_replica
Hourly rate in USD per replica
Number of endpoint replicas
Either “runtime_proportional” or “amortized_window”
Execution time in seconds (runtime_proportional mode only)
Total active hours in window (amortized_window mode only)
Total queries processed in window (amortized_window mode only)
URL to pricing documentation
Date when pricing was last updated
Example Configuration (pricing.json)
{
"huggingface_endpoints": {
"mediphi": {
"cloud_provider": "aws",
"instance_family": "g5",
"instance_size": "12xlarge",
"accelerator": "A10G",
"gpu_count": 4,
"vram_gb": 24,
"hourly_rate_usd": 7.09,
"replicas": 1,
"allocation_mode": "runtime_proportional",
"pricing_source_url": "https://aws.amazon.com/ec2/instance-types/g5/",
"pricing_updated_at": "2024-01-15"
},
"medgemma": {
"cloud_provider": "aws",
"instance_family": "g5",
"instance_size": "2xlarge",
"accelerator": "A10G",
"gpu_count": 1,
"vram_gb": 24,
"hourly_rate_usd": 1.21,
"replicas": 1,
"allocation_mode": "amortized_window",
"active_hours_window": 24.0,
"processed_queries_window": 1000,
"pricing_source_url": "https://aws.amazon.com/ec2/instance-types/g5/",
"pricing_updated_at": "2024-01-15"
}
}
}
get_pricing_config_summary()
Generate a summary of pricing configuration for documentation and traceability.
Signature
def get_pricing_config_summary() -> Dict[str, Any]
Returns
Dictionary containing:
openai_models: Dict of OpenAI model pricing configurations
huggingface_endpoints: Dict of HuggingFace endpoint configurations
Example
from src.common.pricing import get_pricing_config_summary
import json
summary = get_pricing_config_summary()
print(json.dumps(summary, indent=2))
# Output:
# {
# "openai_models": {
# "gpt-5": {
# "input_rate_per_1m": 2.5,
# "output_rate_per_1m": 10.0,
# "pricing_source_url": "https://openai.com/api/pricing/",
# "pricing_updated_at": "2024-01-15"
# }
# },
# "huggingface_endpoints": {
# "mediphi": {
# "cloud_provider": "aws",
# "instance_family": "g5",
# "hourly_rate_usd": 7.09,
# ...
# }
# }
# }
load_pricing_config()
Load pricing configuration from JSON file with safe defaults.
Signature
def load_pricing_config() -> Dict[str, Any]
Returns
Pricing configuration dictionary, or empty dict if file not found or invalid
Configuration File Location
- If
PRICING_CONFIG_PATH environment variable is set, uses that path
- Otherwise uses default:
{PROJECT_ROOT}/config/pricing.json
Error Handling
- Returns empty dict
{} if file doesn’t exist
- Returns empty dict
{} if file has invalid JSON
- Never raises exceptions - always returns a valid dict
Configuration File Structure
Complete Example
{
"openai_token_pricing_per_1m": {
"gpt-5": {
"input": 2.50,
"output": 10.00,
"pricing_source_url": "https://openai.com/api/pricing/",
"pricing_updated_at": "2024-01-15"
},
"gpt-5.2": {
"input": 1.25,
"output": 5.00,
"pricing_source_url": "https://openai.com/api/pricing/",
"pricing_updated_at": "2024-01-15"
}
},
"huggingface_endpoints": {
"mediphi": {
"cloud_provider": "aws",
"instance_family": "g5",
"instance_size": "12xlarge",
"accelerator": "A10G",
"gpu_count": 4,
"vram_gb": 24,
"hourly_rate_usd": 7.09,
"replicas": 1,
"allocation_mode": "runtime_proportional",
"pricing_source_url": "https://aws.amazon.com/ec2/instance-types/g5/",
"pricing_updated_at": "2024-01-15"
},
"medgemma": {
"cloud_provider": "aws",
"instance_family": "g5",
"instance_size": "2xlarge",
"accelerator": "A10G",
"gpu_count": 1,
"vram_gb": 24,
"hourly_rate_usd": 1.21,
"replicas": 1,
"allocation_mode": "amortized_window",
"active_hours_window": 24.0,
"processed_queries_window": 1000,
"pricing_source_url": "https://aws.amazon.com/ec2/instance-types/g5/",
"pricing_updated_at": "2024-01-15"
}
}
}
Usage Example
Complete Cost Tracking Workflow
from src.common.model_provider import create_llm, get_model_identity, MODELS_REGISTRY
from src.common.usage_metrics import extract_usage_from_ai_message, extract_cost_from_ai_message
from src.common.pricing import resolve_total_cost
import time
# Create LLM
config = MODELS_REGISTRY["gpt-5"]
llm = create_llm(config)
# Get model identity
identity = get_model_identity(model_name="gpt-5", llm=llm)
# Make inference call
start_time = time.time()
message = llm.invoke("Explain the pathophysiology of preeclampsia.")
execution_time = time.time() - start_time
# Extract usage and cost
usage = extract_usage_from_ai_message(message)
cost_info = extract_cost_from_ai_message(message)
# Resolve total cost
cost_result = resolve_total_cost(
provider=identity["provider"],
model_name=identity["model_name"],
model_id=identity["model_id"],
input_tokens=usage["input_tokens"],
output_tokens=usage["output_tokens"],
provider_reported_cost=cost_info["total_cost"],
provider_cost_source=cost_info["cost_source"],
execution_time_seconds=execution_time,
)
print(f"Model: {identity['model_name']}")
print(f"Tokens: {usage['input_tokens']} in / {usage['output_tokens']} out")
print(f"Cost: ${cost_result['total_cost']:.6f}")
print(f"Cost source: {cost_result['cost_source']}")
print(f"Execution time: {execution_time:.2f}s")