Model routing can reduce your LLM costs by up to 85% while maintaining 95% of GPT-5.2 quality. Instead of sending every request to your most expensive model, a router intelligently selects the right model for each task. This guide covers everything from basic concepts to production implementation.
Model Routing Quick Facts (January 2026)
- Cost Reduction: Up to 85% on MT-Bench, 45% on MMLU benchmarks
- Quality Retention: 95% of GPT-5.2 performance with only 26% GPT-5.2 calls
- Router Overhead: 11µs with modern LLM gateways (Bifrost)
- Best Framework: RouteLLM (open-source, ICLR 2025 published)
What Is LLM Model Routing?
Model routing is the practice of dynamically selecting which LLM to use for each request based on task complexity, cost constraints, and quality requirements. Instead of sending every query to GPT-5.2 ($1.75/$14 per 1M tokens), a router can send simple queries to GPT-5-mini ($0.25/$2) or GPT-5-nano ($0.05/$0.40).
The core insight: most queries don't need your most expensive model. Research from LMSYS shows that a well-trained router can achieve 95% of GPT-5.2 quality while using it for only 26% of requests.
Why Does Model Routing Matter?
- Cost: 85% reduction possible on certain benchmarks
- Latency: Smaller models respond faster
- Scale: Handle more requests within budget
- Quality: Match model capability to task complexity
The Strong vs Weak Model Paradigm
Most routing systems use a two-model setup: a strong (expensive) model and a weak (cheap) model. The router decides which to use for each query.
| Role | Model Examples | Cost (per 1M tokens) | Use Cases |
|---|---|---|---|
| Strong | GPT-5.2, Claude Opus 4.5 | $1.75-$5 / $14-$25 | Complex reasoning, creative writing, coding |
| Weak | GPT-5-mini, Claude Haiku 4.5 | $0.25-$1 / $2-$5 | Classification, extraction, simple Q&A |
| Ultra-Cheap | GPT-5-nano, Mistral Small | $0.05-$0.15 / $0.40-$0.60 | Formatting, basic summarization |
The math is compelling: if 70% of your queries can be handled by GPT-5-mini instead of GPT-5.2, you save 70% × (1 - 0.25/1.75) = 60% on input costs alone.
Types of Model Routers
1. Rule-Based Routers
The simplest approach: define explicit rules for routing.
# Simple rule-based router
def route_query(query: str, task_type: str) -> str:
# Classification tasks → cheap model
if task_type in ["classification", "extraction", "formatting"]:
return "gpt-5-nano"
# Moderate complexity → mid-tier model
if task_type in ["summarization", "translation", "simple_qa"]:
return "gpt-5-mini"
# Complex tasks → expensive model
if task_type in ["reasoning", "coding", "creative"]:
return "gpt-5-2"
# Default to mid-tier
return "gpt-5-mini"
Pros: Simple, predictable, no overhead
Cons: Requires manual classification, misses nuance
2. ML-Based Routers (RouteLLM)
RouteLLM, developed by LMSYS (creators of Chatbot Arena), uses machine learning to predict which model will perform better for each query.
Four router architectures are available:
- Matrix Factorization: Learns scoring function for model-query pairs (best performance)
- Similarity-Weighted (SW) Ranking: Uses embedding similarity to training examples
- BERT Classifier: Fine-tuned BERT predicts optimal model
- Causal LLM Router: Uses Llama-3-8B as classifier
3. LLM-Based Routers
Use a small LLM to classify queries before routing. NVIDIA's LLM Router Blueprint uses Qwen 1.75B for intent classification.
# LLM-based router (simplified)
def llm_router(query: str) -> str:
classification_prompt = f"""Classify this query's complexity:
Query: {query}
Options:
- SIMPLE: Basic facts, formatting, extraction
- MODERATE: Summarization, translation, simple analysis
- COMPLEX: Reasoning, coding, creative writing
Return only the classification."""
complexity = small_llm.generate(classification_prompt)
routing_map = {
"SIMPLE": "gpt-5-nano",
"MODERATE": "gpt-5-mini",
"COMPLEX": "gpt-5-2"
}
return routing_map.get(complexity.strip(), "gpt-5-mini")
Implementing RouteLLM
RouteLLM provides pre-trained routers that achieve 85% cost reduction on MT-Bench while maintaining 95% quality. Here's how to implement it:
Installation
pip install routellm
Basic Usage
import os
from routellm.controller import Controller
# Set API keys
os.environ["OPENAI_API_KEY"] = "sk-..."
# Initialize router with matrix factorization (best performing)
client = Controller(
routers=["mf"], # Matrix factorization router
strong_model="gpt-5-2",
weak_model="gpt-5-mini",
)
# Make a routed request
response = client.chat.completions.create(
model="router-mf-0.5", # 0.5 = cost threshold
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)
# This simple query routes to gpt-5-mini automatically
Adjusting the Cost Threshold
The threshold (0.0-1.0) controls the quality-cost trade-off:
- 0.0: Always use weak model (cheapest, lowest quality)
- 0.5: Balanced routing (recommended starting point)
- 1.0: Always use strong model (expensive, highest quality)
# More aggressive cost savings (more weak model usage)
model="router-mf-0.3"
# Higher quality priority (more strong model usage)
model="router-mf-0.7"
Benchmark Results: How Much Can You Save?
RouteLLM published results at ICLR 2025 showing significant cost reductions:
| Benchmark | Cost Reduction | Quality Retained | Strong Model Calls |
|---|---|---|---|
| MT-Bench | 85% | 95% | 14% |
| MMLU | 45% | 95% | ~50% |
| GSM8K | 35% | 95% | ~60% |
The matrix factorization router achieved 95% of GPT-4 performance using only 26% GPT-4 calls, which is approximately 48% cheaper than a random baseline.
With data augmentation from an LLM judge, the same router achieved 95% quality with only 14% strong model calls—a 75% cost reduction.
Task Classification: What Goes Where?
Route to Cheap Models (GPT-5-nano, Haiku 4.5)
- Text classification and sentiment analysis
- Entity extraction and NER
- Format conversion (JSON, XML, Markdown)
- Simple keyword extraction
- Basic text reformatting
Route to Mid-Tier Models (GPT-5-mini, Sonnet 4.5)
- Summarization of documents
- Translation between languages
- Simple question answering
- Content moderation
- Basic code explanation
Route to Expensive Models (GPT-5.2, Opus 4.5)
- Complex multi-step reasoning
- Code generation and debugging
- Creative writing and content creation
- Mathematical problem solving
- Strategic analysis and planning
LLM Gateways for Production Routing
For production deployments, LLM gateways provide routing, observability, and cost controls in one package.
Top LLM Gateways (2026)
| Gateway | Routing Overhead | Key Features |
|---|---|---|
| Bifrost (Maxim AI) | 11µs | Zero-config, enterprise features |
| OpenRouter | ~50ms | Multi-provider, unified API |
| LiteLLM | ~10ms | Open-source, 100+ providers |
LLM Gateways can cut token spend by 30-50% through automatic routing, policy enforcement (e.g., cap GPT-5.2 calls at 20%), and real-time spend tracking.
Provider Pricing for Routing Decisions
Understanding pricing tiers helps optimize routing decisions:
| Provider | Strong Model | Weak Model | Price Ratio |
|---|---|---|---|
| OpenAI | GPT-5.2 ($1.75/$14) | GPT-5-mini ($0.25/$2) | 7x cheaper |
| Anthropic | Opus 4.5 ($5/$25) | Haiku 4.5 ($1/$5) | 5x cheaper |
| Gemini 2.0 Pro ($1.25/$5) | Gemini 2.0 Flash ($0.075/$0.30) | 16x cheaper |
Google's Gemini offers the largest price differential (16x) between strong and weak models, making it ideal for aggressive routing strategies.
Complete Implementation Example
Here's a production-ready routing implementation:
from typing import Literal
from openai import OpenAI
from anthropic import Anthropic
class LLMRouter:
def __init__(self):
self.openai = OpenAI()
self.anthropic = Anthropic()
# Define model tiers
self.models = {
"cheap": "gpt-5-nano",
"mid": "gpt-5-mini",
"expensive": "gpt-5-2"
}
def classify_complexity(self, query: str) -> Literal["cheap", "mid", "expensive"]:
"""Use a cheap model to classify query complexity."""
response = self.openai.chat.completions.create(
model="gpt-5-nano",
messages=[{
"role": "system",
"content": """Classify query complexity:
- cheap: classification, extraction, formatting
- mid: summarization, translation, simple QA
- expensive: reasoning, coding, creative
Return only: cheap, mid, or expensive"""
}, {
"role": "user",
"content": query
}],
max_tokens=10
)
return response.choices[0].message.content.strip().lower()
def route(self, query: str, messages: list = None):
"""Route query to appropriate model."""
complexity = self.classify_complexity(query)
model = self.models[complexity]
if messages is None:
messages = [{"role": "user", "content": query}]
response = self.openai.chat.completions.create(
model=model,
messages=messages
)
return {
"response": response.choices[0].message.content,
"model_used": model,
"complexity": complexity,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens
}
# Usage
router = LLMRouter()
# Simple query → routes to gpt-5-nano
result = router.route("What is 2+2?")
print(f"Model: {result['model_used']}") # gpt-5-nano
# Complex query → routes to gpt-5-2
result = router.route("Write a recursive algorithm to solve the Tower of Hanoi problem and explain its time complexity.")
print(f"Model: {result['model_used']}") # gpt-5-2
Monitoring and Optimization
Effective routing requires ongoing monitoring:
Key Metrics to Track
- Routing distribution: % of queries to each model tier
- Quality scores: User feedback, task success rates
- Cost per query: Average spend across tiers
- Router accuracy: Did cheap models succeed on routed queries?
Optimization Strategies
- Start conservative: Route 20% to cheap models, measure quality
- Expand gradually: Increase cheap routing as confidence grows
- Monitor failures: Track when cheap models fail tasks
- A/B test thresholds: Compare different routing thresholds
Common Mistakes to Avoid
1. Over-Routing to Cheap Models
Aggressive cost cutting can hurt user experience. Start with a 0.5 threshold and adjust based on quality metrics.
2. Ignoring Router Overhead
If your router adds 100ms latency, ensure the cost savings justify it. Modern gateways like Bifrost add only 11µs.
3. Not Monitoring Quality
Cost reduction is meaningless if quality suffers. Track user satisfaction alongside spend.
4. Static Rules for Dynamic Content
Rule-based routers miss nuance. A "simple" question about quantum physics needs a strong model.
Real Cost Calculation
Let's calculate savings for an app with 1M requests/month, average 1K input + 500 output tokens per request:
| Scenario | Model Mix | Monthly Cost |
|---|---|---|
| All GPT-5.2 | 100% strong | $8,750 |
| Basic Routing | 30% strong, 70% mini | $3,500 |
| Aggressive Routing | 15% strong, 50% mini, 35% nano | $1,925 |
Savings with aggressive routing: $6,825/month (78%)
Track Your Routing Performance with Burnwise
Monitor model distribution, cost per query, and quality metrics across all providers. See which queries could be routed cheaper.
Start Free TrialNext Steps
- Audit your queries: What % are simple vs complex?
- Start with RouteLLM: Use pre-trained routers for quick wins
- Monitor quality: Ensure routing doesn't hurt user experience
- Iterate thresholds: Adjust based on real data
- Consider gateways: For production, use Bifrost or LiteLLM
For related optimization strategies, see our Prompt Caching Guide (50-90% savings) and Complete LLM Cost Optimization Guide.
Questions? Check our SDK documentation or use our LLM Cost Calculator to estimate your savings.