LLM Model Routing: Cut Costs 85% with Smart Model Selection

January 12, 2026
12 min read

Model routing can reduce your LLM costs by up to 85% while maintaining 95% of GPT-5.2 quality. Instead of sending every request to your most expensive model, a router intelligently selects the right model for each task. This guide covers everything from basic concepts to production implementation.

Model Routing Quick Facts (January 2026)

  • Cost Reduction: Up to 85% on MT-Bench, 45% on MMLU benchmarks
  • Quality Retention: 95% of GPT-5.2 performance with only 26% GPT-5.2 calls
  • Router Overhead: 11µs with modern LLM gateways (Bifrost)
  • Best Framework: RouteLLM (open-source, ICLR 2025 published)

What Is LLM Model Routing?

Model routing is the practice of dynamically selecting which LLM to use for each request based on task complexity, cost constraints, and quality requirements. Instead of sending every query to GPT-5.2 ($1.75/$14 per 1M tokens), a router can send simple queries to GPT-5-mini ($0.25/$2) or GPT-5-nano ($0.05/$0.40).

The core insight: most queries don't need your most expensive model. Research from LMSYS shows that a well-trained router can achieve 95% of GPT-5.2 quality while using it for only 26% of requests.

Why Does Model Routing Matter?

  • Cost: 85% reduction possible on certain benchmarks
  • Latency: Smaller models respond faster
  • Scale: Handle more requests within budget
  • Quality: Match model capability to task complexity

The Strong vs Weak Model Paradigm

Most routing systems use a two-model setup: a strong (expensive) model and a weak (cheap) model. The router decides which to use for each query.

Role Model Examples Cost (per 1M tokens) Use Cases
Strong GPT-5.2, Claude Opus 4.5 $1.75-$5 / $14-$25 Complex reasoning, creative writing, coding
Weak GPT-5-mini, Claude Haiku 4.5 $0.25-$1 / $2-$5 Classification, extraction, simple Q&A
Ultra-Cheap GPT-5-nano, Mistral Small $0.05-$0.15 / $0.40-$0.60 Formatting, basic summarization

The math is compelling: if 70% of your queries can be handled by GPT-5-mini instead of GPT-5.2, you save 70% × (1 - 0.25/1.75) = 60% on input costs alone.

Types of Model Routers

1. Rule-Based Routers

The simplest approach: define explicit rules for routing.

# Simple rule-based router
def route_query(query: str, task_type: str) -> str:
    # Classification tasks → cheap model
    if task_type in ["classification", "extraction", "formatting"]:
        return "gpt-5-nano"

    # Moderate complexity → mid-tier model
    if task_type in ["summarization", "translation", "simple_qa"]:
        return "gpt-5-mini"

    # Complex tasks → expensive model
    if task_type in ["reasoning", "coding", "creative"]:
        return "gpt-5-2"

    # Default to mid-tier
    return "gpt-5-mini"

Pros: Simple, predictable, no overhead
Cons: Requires manual classification, misses nuance

2. ML-Based Routers (RouteLLM)

RouteLLM, developed by LMSYS (creators of Chatbot Arena), uses machine learning to predict which model will perform better for each query.

Four router architectures are available:

  • Matrix Factorization: Learns scoring function for model-query pairs (best performance)
  • Similarity-Weighted (SW) Ranking: Uses embedding similarity to training examples
  • BERT Classifier: Fine-tuned BERT predicts optimal model
  • Causal LLM Router: Uses Llama-3-8B as classifier

3. LLM-Based Routers

Use a small LLM to classify queries before routing. NVIDIA's LLM Router Blueprint uses Qwen 1.75B for intent classification.

# LLM-based router (simplified)
def llm_router(query: str) -> str:
    classification_prompt = f"""Classify this query's complexity:
Query: {query}

Options:
- SIMPLE: Basic facts, formatting, extraction
- MODERATE: Summarization, translation, simple analysis
- COMPLEX: Reasoning, coding, creative writing

Return only the classification."""

    complexity = small_llm.generate(classification_prompt)

    routing_map = {
        "SIMPLE": "gpt-5-nano",
        "MODERATE": "gpt-5-mini",
        "COMPLEX": "gpt-5-2"
    }
    return routing_map.get(complexity.strip(), "gpt-5-mini")

Implementing RouteLLM

RouteLLM provides pre-trained routers that achieve 85% cost reduction on MT-Bench while maintaining 95% quality. Here's how to implement it:

Installation

pip install routellm

Basic Usage

import os
from routellm.controller import Controller

# Set API keys
os.environ["OPENAI_API_KEY"] = "sk-..."

# Initialize router with matrix factorization (best performing)
client = Controller(
    routers=["mf"],  # Matrix factorization router
    strong_model="gpt-5-2",
    weak_model="gpt-5-mini",
)

# Make a routed request
response = client.chat.completions.create(
    model="router-mf-0.5",  # 0.5 = cost threshold
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

# This simple query routes to gpt-5-mini automatically

Adjusting the Cost Threshold

The threshold (0.0-1.0) controls the quality-cost trade-off:

  • 0.0: Always use weak model (cheapest, lowest quality)
  • 0.5: Balanced routing (recommended starting point)
  • 1.0: Always use strong model (expensive, highest quality)
# More aggressive cost savings (more weak model usage)
model="router-mf-0.3"

# Higher quality priority (more strong model usage)
model="router-mf-0.7"

Benchmark Results: How Much Can You Save?

RouteLLM published results at ICLR 2025 showing significant cost reductions:

Benchmark Cost Reduction Quality Retained Strong Model Calls
MT-Bench 85% 95% 14%
MMLU 45% 95% ~50%
GSM8K 35% 95% ~60%

The matrix factorization router achieved 95% of GPT-4 performance using only 26% GPT-4 calls, which is approximately 48% cheaper than a random baseline.

With data augmentation from an LLM judge, the same router achieved 95% quality with only 14% strong model calls—a 75% cost reduction.

Task Classification: What Goes Where?

Route to Cheap Models (GPT-5-nano, Haiku 4.5)

  • Text classification and sentiment analysis
  • Entity extraction and NER
  • Format conversion (JSON, XML, Markdown)
  • Simple keyword extraction
  • Basic text reformatting

Route to Mid-Tier Models (GPT-5-mini, Sonnet 4.5)

  • Summarization of documents
  • Translation between languages
  • Simple question answering
  • Content moderation
  • Basic code explanation

Route to Expensive Models (GPT-5.2, Opus 4.5)

  • Complex multi-step reasoning
  • Code generation and debugging
  • Creative writing and content creation
  • Mathematical problem solving
  • Strategic analysis and planning

LLM Gateways for Production Routing

For production deployments, LLM gateways provide routing, observability, and cost controls in one package.

Top LLM Gateways (2026)

Gateway Routing Overhead Key Features
Bifrost (Maxim AI) 11µs Zero-config, enterprise features
OpenRouter ~50ms Multi-provider, unified API
LiteLLM ~10ms Open-source, 100+ providers

LLM Gateways can cut token spend by 30-50% through automatic routing, policy enforcement (e.g., cap GPT-5.2 calls at 20%), and real-time spend tracking.

Provider Pricing for Routing Decisions

Understanding pricing tiers helps optimize routing decisions:

Provider Strong Model Weak Model Price Ratio
OpenAI GPT-5.2 ($1.75/$14) GPT-5-mini ($0.25/$2) 7x cheaper
Anthropic Opus 4.5 ($5/$25) Haiku 4.5 ($1/$5) 5x cheaper
Google Gemini 2.0 Pro ($1.25/$5) Gemini 2.0 Flash ($0.075/$0.30) 16x cheaper

Google's Gemini offers the largest price differential (16x) between strong and weak models, making it ideal for aggressive routing strategies.

Complete Implementation Example

Here's a production-ready routing implementation:

from typing import Literal
from openai import OpenAI
from anthropic import Anthropic

class LLMRouter:
    def __init__(self):
        self.openai = OpenAI()
        self.anthropic = Anthropic()

        # Define model tiers
        self.models = {
            "cheap": "gpt-5-nano",
            "mid": "gpt-5-mini",
            "expensive": "gpt-5-2"
        }

    def classify_complexity(self, query: str) -> Literal["cheap", "mid", "expensive"]:
        """Use a cheap model to classify query complexity."""
        response = self.openai.chat.completions.create(
            model="gpt-5-nano",
            messages=[{
                "role": "system",
                "content": """Classify query complexity:
- cheap: classification, extraction, formatting
- mid: summarization, translation, simple QA
- expensive: reasoning, coding, creative
Return only: cheap, mid, or expensive"""
            }, {
                "role": "user",
                "content": query
            }],
            max_tokens=10
        )
        return response.choices[0].message.content.strip().lower()

    def route(self, query: str, messages: list = None):
        """Route query to appropriate model."""
        complexity = self.classify_complexity(query)
        model = self.models[complexity]

        if messages is None:
            messages = [{"role": "user", "content": query}]

        response = self.openai.chat.completions.create(
            model=model,
            messages=messages
        )

        return {
            "response": response.choices[0].message.content,
            "model_used": model,
            "complexity": complexity,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens
        }

# Usage
router = LLMRouter()

# Simple query → routes to gpt-5-nano
result = router.route("What is 2+2?")
print(f"Model: {result['model_used']}")  # gpt-5-nano

# Complex query → routes to gpt-5-2
result = router.route("Write a recursive algorithm to solve the Tower of Hanoi problem and explain its time complexity.")
print(f"Model: {result['model_used']}")  # gpt-5-2

Monitoring and Optimization

Effective routing requires ongoing monitoring:

Key Metrics to Track

  • Routing distribution: % of queries to each model tier
  • Quality scores: User feedback, task success rates
  • Cost per query: Average spend across tiers
  • Router accuracy: Did cheap models succeed on routed queries?

Optimization Strategies

  1. Start conservative: Route 20% to cheap models, measure quality
  2. Expand gradually: Increase cheap routing as confidence grows
  3. Monitor failures: Track when cheap models fail tasks
  4. A/B test thresholds: Compare different routing thresholds

Common Mistakes to Avoid

1. Over-Routing to Cheap Models

Aggressive cost cutting can hurt user experience. Start with a 0.5 threshold and adjust based on quality metrics.

2. Ignoring Router Overhead

If your router adds 100ms latency, ensure the cost savings justify it. Modern gateways like Bifrost add only 11µs.

3. Not Monitoring Quality

Cost reduction is meaningless if quality suffers. Track user satisfaction alongside spend.

4. Static Rules for Dynamic Content

Rule-based routers miss nuance. A "simple" question about quantum physics needs a strong model.

Real Cost Calculation

Let's calculate savings for an app with 1M requests/month, average 1K input + 500 output tokens per request:

Scenario Model Mix Monthly Cost
All GPT-5.2 100% strong $8,750
Basic Routing 30% strong, 70% mini $3,500
Aggressive Routing 15% strong, 50% mini, 35% nano $1,925

Savings with aggressive routing: $6,825/month (78%)

Track Your Routing Performance with Burnwise

Monitor model distribution, cost per query, and quality metrics across all providers. See which queries could be routed cheaper.

Start Free Trial

Next Steps

  1. Audit your queries: What % are simple vs complex?
  2. Start with RouteLLM: Use pre-trained routers for quick wins
  3. Monitor quality: Ensure routing doesn't hurt user experience
  4. Iterate thresholds: Adjust based on real data
  5. Consider gateways: For production, use Bifrost or LiteLLM

For related optimization strategies, see our Prompt Caching Guide (50-90% savings) and Complete LLM Cost Optimization Guide.

Questions? Check our SDK documentation or use our LLM Cost Calculator to estimate your savings.

model routingcost optimizationroutellmllmgpt-5claude

Put These Insights Into Practice

Burnwise tracks your LLM costs automatically and shows you exactly where to optimize.

Start Free Trial