Burnwise is an AI cost copilot that analyzes where your LLM budget actually goes, links costs to product features, and gives you concrete decisions to cut AI spending by 40% without sacrificing quality.

Which AI providers does Burnwise support?

Burnwise supports all major LLM providers including OpenAI (GPT-5.2, GPT-4), Anthropic (Claude 4.5), Google (Gemini 3.0), Mistral, xAI (Grok), DeepSeek, and Perplexity.

How long does it take to integrate Burnwise?

Burnwise can be integrated in under 5 minutes with just a few lines of code. Simply install the SDK, initialize with your API key, and wrap your existing AI client.

Does Burnwise track my prompts or completions?

No. Burnwise only tracks metadata like token counts, model names, costs, and latency. We never track prompt content, completion content, or any user data within prompts.

How much can I save with Burnwise?

Teams using Burnwise typically reduce their LLM costs by 20-40% through model arbitrage, feature-level optimization, and eliminating waste - all while maintaining output quality.

LLM Model Routing: Cut Costs 85% with Smart Model Selection

Model routing can reduce your LLM costs by up to 85% while maintaining 95% of GPT-5.2 quality. Instead of sending every request to your most expensive model, a router intelligently selects the right model for each task. This guide covers everything from basic concepts to production implementation.

Model Routing Quick Facts (January 2026)

Cost Reduction: Up to 85% on MT-Bench, 45% on MMLU benchmarks
Quality Retention: 95% of GPT-5.2 performance with only 26% GPT-5.2 calls
Router Overhead: 11µs with modern LLM gateways (Bifrost)
Best Framework: RouteLLM (open-source, ICLR 2025 published)

What Is LLM Model Routing?

Model routing is the practice of dynamically selecting which LLM to use for each request based on task complexity, cost constraints, and quality requirements. Instead of sending every query to GPT-5.2 ($1.75/$14 per 1M tokens), a router can send simple queries to GPT-5-mini ($0.25/$2) or GPT-5-nano ($0.05/$0.40).

The core insight: most queries don't need your most expensive model. Research from LMSYS shows that a well-trained router can achieve 95% of GPT-5.2 quality while using it for only 26% of requests.

Why Does Model Routing Matter?

Cost: 85% reduction possible on certain benchmarks
Latency: Smaller models respond faster
Scale: Handle more requests within budget
Quality: Match model capability to task complexity

The Strong vs Weak Model Paradigm

Most routing systems use a two-model setup: a strong (expensive) model and a weak (cheap) model. The router decides which to use for each query.

Role	Model Examples	Cost (per 1M tokens)	Use Cases
Strong	GPT-5.2, Claude Opus 4.5	$1.75-$5 / $14-$25	Complex reasoning, creative writing, coding
Weak	GPT-5-mini, Claude Haiku 4.5	$0.25-$1 / $2-$5	Classification, extraction, simple Q&A
Ultra-Cheap	GPT-5-nano, Mistral Small	$0.05-$0.15 / $0.40-$0.60	Formatting, basic summarization

The math is compelling: if 70% of your queries can be handled by GPT-5-mini instead of GPT-5.2, you save 70% × (1 - 0.25/1.75) = 60% on input costs alone.

Types of Model Routers

1. Rule-Based Routers

The simplest approach: define explicit rules for routing.

# Simple rule-based router
def route_query(query: str, task_type: str) -> str:
    # Classification tasks → cheap model
    if task_type in ["classification", "extraction", "formatting"]:
        return "gpt-5-nano"

    # Moderate complexity → mid-tier model
    if task_type in ["summarization", "translation", "simple_qa"]:
        return "gpt-5-mini"

    # Complex tasks → expensive model
    if task_type in ["reasoning", "coding", "creative"]:
        return "gpt-5-2"

    # Default to mid-tier
    return "gpt-5-mini"

Pros: Simple, predictable, no overhead
Cons: Requires manual classification, misses nuance

2. ML-Based Routers (RouteLLM)

RouteLLM, developed by LMSYS (creators of Chatbot Arena), uses machine learning to predict which model will perform better for each query.

Four router architectures are available:

Matrix Factorization: Learns scoring function for model-query pairs (best performance)
Similarity-Weighted (SW) Ranking: Uses embedding similarity to training examples
BERT Classifier: Fine-tuned BERT predicts optimal model
Causal LLM Router: Uses Llama-3-8B as classifier

3. LLM-Based Routers

Use a small LLM to classify queries before routing. NVIDIA's LLM Router Blueprint uses Qwen 1.75B for intent classification.

# LLM-based router (simplified)
def llm_router(query: str) -> str:
    classification_prompt = f"""Classify this query's complexity:
Query: {query}

Options:
- SIMPLE: Basic facts, formatting, extraction
- MODERATE: Summarization, translation, simple analysis
- COMPLEX: Reasoning, coding, creative writing

Return only the classification."""

    complexity = small_llm.generate(classification_prompt)

    routing_map = {
        "SIMPLE": "gpt-5-nano",
        "MODERATE": "gpt-5-mini",
        "COMPLEX": "gpt-5-2"
    }
    return routing_map.get(complexity.strip(), "gpt-5-mini")

Implementing RouteLLM

RouteLLM provides pre-trained routers that achieve 85% cost reduction on MT-Bench while maintaining 95% quality. Here's how to implement it:

Installation

pip install routellm

Basic Usage

import os
from routellm.controller import Controller

# Set API keys
os.environ["OPENAI_API_KEY"] = "sk-..."

# Initialize router with matrix factorization (best performing)
client = Controller(
    routers=["mf"],  # Matrix factorization router
    strong_model="gpt-5-2",
    weak_model="gpt-5-mini",
)

# Make a routed request
response = client.chat.completions.create(
    model="router-mf-0.5",  # 0.5 = cost threshold
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

# This simple query routes to gpt-5-mini automatically

Adjusting the Cost Threshold

The threshold (0.0-1.0) controls the quality-cost trade-off:

0.0: Always use weak model (cheapest, lowest quality)
0.5: Balanced routing (recommended starting point)
1.0: Always use strong model (expensive, highest quality)

# More aggressive cost savings (more weak model usage)
model="router-mf-0.3"

# Higher quality priority (more strong model usage)
model="router-mf-0.7"

Benchmark Results: How Much Can You Save?

RouteLLM published results at ICLR 2025 showing significant cost reductions:

Benchmark	Cost Reduction	Quality Retained	Strong Model Calls
MT-Bench	85%	95%	14%
MMLU	45%	95%	~50%
GSM8K	35%	95%	~60%

The matrix factorization router achieved 95% of GPT-4 performance using only 26% GPT-4 calls, which is approximately 48% cheaper than a random baseline.

With data augmentation from an LLM judge, the same router achieved 95% quality with only 14% strong model calls—a 75% cost reduction.

Task Classification: What Goes Where?

Route to Cheap Models (GPT-5-nano, Haiku 4.5)

Text classification and sentiment analysis
Entity extraction and NER
Format conversion (JSON, XML, Markdown)
Simple keyword extraction
Basic text reformatting

Route to Mid-Tier Models (GPT-5-mini, Sonnet 4.5)

Summarization of documents
Translation between languages
Simple question answering
Content moderation
Basic code explanation

Route to Expensive Models (GPT-5.2, Opus 4.5)

Complex multi-step reasoning
Code generation and debugging
Creative writing and content creation
Mathematical problem solving
Strategic analysis and planning

LLM Gateways for Production Routing

For production deployments, LLM gateways provide routing, observability, and cost controls in one package.

Top LLM Gateways (2026)

Gateway	Routing Overhead	Key Features
Bifrost (Maxim AI)	11µs	Zero-config, enterprise features
OpenRouter	~50ms	Multi-provider, unified API
LiteLLM	~10ms	Open-source, 100+ providers

LLM Gateways can cut token spend by 30-50% through automatic routing, policy enforcement (e.g., cap GPT-5.2 calls at 20%), and real-time spend tracking.

Provider Pricing for Routing Decisions

Understanding pricing tiers helps optimize routing decisions:

Provider	Strong Model	Weak Model	Price Ratio
OpenAI	GPT-5.2 ($1.75/$14)	GPT-5-mini ($0.25/$2)	7x cheaper
Anthropic	Opus 4.5 ($5/$25)	Haiku 4.5 ($1/$5)	5x cheaper
Google	Gemini 2.0 Pro ($1.25/$5)	Gemini 2.0 Flash ($0.075/$0.30)	16x cheaper

Google's Gemini offers the largest price differential (16x) between strong and weak models, making it ideal for aggressive routing strategies.

Complete Implementation Example

Here's a production-ready routing implementation:

from typing import Literal
from openai import OpenAI
from anthropic import Anthropic

class LLMRouter:
    def __init__(self):
        self.openai = OpenAI()
        self.anthropic = Anthropic()

        # Define model tiers
        self.models = {
            "cheap": "gpt-5-nano",
            "mid": "gpt-5-mini",
            "expensive": "gpt-5-2"
        }

    def classify_complexity(self, query: str) -> Literal["cheap", "mid", "expensive"]:
        """Use a cheap model to classify query complexity."""
        response = self.openai.chat.completions.create(
            model="gpt-5-nano",
            messages=[{
                "role": "system",
                "content": """Classify query complexity:
- cheap: classification, extraction, formatting
- mid: summarization, translation, simple QA
- expensive: reasoning, coding, creative
Return only: cheap, mid, or expensive"""
            }, {
                "role": "user",
                "content": query
            }],
            max_tokens=10
        )
        return response.choices[0].message.content.strip().lower()

    def route(self, query: str, messages: list = None):
        """Route query to appropriate model."""
        complexity = self.classify_complexity(query)
        model = self.models[complexity]

        if messages is None:
            messages = [{"role": "user", "content": query}]

        response = self.openai.chat.completions.create(
            model=model,
            messages=messages
        )

        return {
            "response": response.choices[0].message.content,
            "model_used": model,
            "complexity": complexity,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens
        }

# Usage
router = LLMRouter()

# Simple query → routes to gpt-5-nano
result = router.route("What is 2+2?")
print(f"Model: {result['model_used']}")  # gpt-5-nano

# Complex query → routes to gpt-5-2
result = router.route("Write a recursive algorithm to solve the Tower of Hanoi problem and explain its time complexity.")
print(f"Model: {result['model_used']}")  # gpt-5-2

Monitoring and Optimization

Effective routing requires ongoing monitoring:

Key Metrics to Track

Routing distribution: % of queries to each model tier
Quality scores: User feedback, task success rates
Cost per query: Average spend across tiers
Router accuracy: Did cheap models succeed on routed queries?

Optimization Strategies

Start conservative: Route 20% to cheap models, measure quality
Expand gradually: Increase cheap routing as confidence grows
Monitor failures: Track when cheap models fail tasks
A/B test thresholds: Compare different routing thresholds

Common Mistakes to Avoid

1. Over-Routing to Cheap Models

Aggressive cost cutting can hurt user experience. Start with a 0.5 threshold and adjust based on quality metrics.

2. Ignoring Router Overhead

If your router adds 100ms latency, ensure the cost savings justify it. Modern gateways like Bifrost add only 11µs.

3. Not Monitoring Quality

Cost reduction is meaningless if quality suffers. Track user satisfaction alongside spend.

4. Static Rules for Dynamic Content

Rule-based routers miss nuance. A "simple" question about quantum physics needs a strong model.

Real Cost Calculation

Let's calculate savings for an app with 1M requests/month, average 1K input + 500 output tokens per request:

Scenario	Model Mix	Monthly Cost
All GPT-5.2	100% strong	$8,750
Basic Routing	30% strong, 70% mini	$3,500
Aggressive Routing	15% strong, 50% mini, 35% nano	$1,925

Savings with aggressive routing: $6,825/month (78%)

Track Your Routing Performance with Burnwise

Monitor model distribution, cost per query, and quality metrics across all providers. See which queries could be routed cheaper.

Start Free Trial

Next Steps

Audit your queries: What % are simple vs complex?
Start with RouteLLM: Use pre-trained routers for quick wins
Monitor quality: Ensure routing doesn't hurt user experience
Iterate thresholds: Adjust based on real data
Consider gateways: For production, use Bifrost or LiteLLM

For related optimization strategies, see our Prompt Caching Guide (50-90% savings) and Complete LLM Cost Optimization Guide.

Questions? Check our SDK documentation or use our LLM Cost Calculator to estimate your savings.