Burnwise is an AI cost copilot that analyzes where your LLM budget actually goes, links costs to product features, and gives you concrete decisions to cut AI spending by 40% without sacrificing quality.

Which AI providers does Burnwise support?

Burnwise supports all major LLM providers including OpenAI (GPT-5.2, GPT-4), Anthropic (Claude 4.5), Google (Gemini 3.0), Mistral, xAI (Grok), DeepSeek, and Perplexity.

How long does it take to integrate Burnwise?

Burnwise can be integrated in under 5 minutes with just a few lines of code. Simply install the SDK, initialize with your API key, and wrap your existing AI client.

Does Burnwise track my prompts or completions?

No. Burnwise only tracks metadata like token counts, model names, costs, and latency. We never track prompt content, completion content, or any user data within prompts.

How much can I save with Burnwise?

Teams using Burnwise typically reduce their LLM costs by 20-40% through model arbitrage, feature-level optimization, and eliminating waste - all while maintaining output quality.

Token Optimization: Reduce LLM Input & Output Costs by 60%

Every token costs money. Input tokens, output tokens—they all add up. The good news: you can reduce token usage by 60% or more with proven optimization techniques. This guide covers everything from basic prompt engineering to advanced compression with LLMLingua.

Token Optimization Quick Facts (January 2026)

Prompt Optimization: 35% reduction with concise prompts
Prompt Compression: Up to 20x compression with LLMLingua
Batching: 30% input token savings on repeated context
Output Control: 20-40% savings with max_tokens and format control

Why Do Tokens Matter for LLM Costs?

Tokens are the pricing unit for LLM APIs. Every character you send (input) and receive (output) is converted to tokens, and you pay per token. Understanding tokens is the first step to optimizing costs.

What Is a Token?

A token is a chunk of text that the model processes. Roughly:

English: 1 token ≈ 4 characters or ¾ of a word
Code: Varies significantly (symbols often = 1 token each)
Other languages: Often more tokens per word

For example, "Hello, world!" is 4 tokens in GPT models.

Input vs Output Token Pricing

Output tokens cost 3-8x more than input tokens across all providers:

Model	Input/1M	Output/1M	Output Multiple
GPT-5.2	$1.75	$14.00	8x
Claude Opus 4.5	$5.00	$25.00	5x
Gemini 3.0 Pro	$2.00	$12.00	6x
GPT-5-mini	$0.30	$1.00	3.3x

Key Insight: Reducing output tokens has 3-8x more impact on costs than reducing input tokens. Always optimize output first.

How to Count Tokens Before Sending Requests

Counting tokens before API calls helps you estimate costs and avoid hitting context limits.

OpenAI: tiktoken

tiktoken is OpenAI's official tokenizer, 3-6x faster than alternatives:

import tiktoken

# Get encoding for GPT-5 models
encoding = tiktoken.encoding_for_model("gpt-5")

# Count tokens
text = "Hello, how can I help you today?"
tokens = encoding.encode(text)
print(f"Token count: {len(tokens)}")  # Output: 8

# For chat messages
def count_chat_tokens(messages, model="gpt-5"):
    encoding = tiktoken.encoding_for_model(model)
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Every message has overhead
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
    num_tokens += 2  # Reply priming
    return num_tokens

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
print(f"Total tokens: {count_chat_tokens(messages)}")

Anthropic Claude

Claude uses a different tokenizer. Anthropic provides a simple heuristic:

Rough estimate: 1 token ≈ 3.5 English characters
More accurate: Use Anthropic's legacy tokenizer (1-2% error rate)

# Simple Claude token estimation
def estimate_claude_tokens(text: str) -> int:
    """Estimate tokens using Anthropic's heuristic."""
    return len(text) // 3.5

# For production, use the API's token counting
import anthropic
client = anthropic.Anthropic()
result = client.messages.count_tokens(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(f"Tokens: {result.input_tokens}")

Google Gemini

import google.generativeai as genai

model = genai.GenerativeModel('gemini-2.0-flash')
result = model.count_tokens("What is the meaning of life?")
print(f"Token count: {result.total_tokens}")

Input Token Optimization

1. Write Concise Prompts

The simplest optimization: use fewer words. Every character counts.

❌ Verbose (45 tokens):
"I would like you to please help me by providing a comprehensive
and detailed explanation of what machine learning is, including
all the important concepts and how it works."

✅ Concise (12 tokens):
"Explain machine learning in 2-3 sentences."

Savings: 73%

2. Batch Multiple Inputs

When processing multiple items with the same instructions, batch them together:

❌ Separate calls (3000 tokens total):
Call 1: [System prompt: 800 tokens] + [Doc 1: 200 tokens]
Call 2: [System prompt: 800 tokens] + [Doc 2: 200 tokens]
Call 3: [System prompt: 800 tokens] + [Doc 3: 200 tokens]

✅ Batched (1400 tokens):
Single call: [System prompt: 800 tokens] + [Doc 1 + Doc 2 + Doc 3: 600 tokens]

Savings: 53%

3. Use Prompt Compression (LLMLingua)

LLMLingua, developed by Microsoft Research, automatically removes redundant tokens while preserving meaning. It achieves up to 20x compression with only 1.5% performance loss.

from llmlingua import PromptCompressor

# Initialize compressor with a small model
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    device_map="cpu"
)

# Original prompt (1000 tokens)
original_prompt = """
[Very long context with background information,
examples, and detailed instructions...]
"""

# Compress to ~200 tokens (5x compression)
compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.2,  # Keep 20% of tokens
    force_tokens=["important", "keywords"]  # Preserve specific tokens
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Compression ratio: {compressed['ratio']:.1f}x")

LLMLingua Compression Ratios by Content Type

Content Type	Recommended Compression	Performance Loss
Instructions	10-20%	Minimal
Few-shot examples	60-80%	Low
Context documents	50-70%	Low-Medium
Questions/queries	0-10%	None

4. Optimize Few-Shot Examples

Few-shot examples consume significant tokens. Optimize them:

Use 1-2 examples instead of 5 — Often sufficient for good results
Make examples concise — Trim unnecessary details
Use prompt caching — Cache examples to pay only once

5. Context Window Management

For chat applications, sending full history is expensive. Strategies:

Sliding window: Keep only last N messages
Summarization: Summarize old messages periodically
Relevance filtering: Include only messages relevant to current query

def optimize_chat_history(messages: list, max_tokens: int = 4000) -> list:
    """Keep recent messages within token budget."""
    encoding = tiktoken.encoding_for_model("gpt-5")

    # Always keep system message
    optimized = [messages[0]] if messages[0]["role"] == "system" else []
    total_tokens = count_tokens(optimized)

    # Add messages from most recent, stop when budget exceeded
    for msg in reversed(messages[1:]):
        msg_tokens = len(encoding.encode(msg["content"]))
        if total_tokens + msg_tokens > max_tokens:
            break
        optimized.insert(1, msg)
        total_tokens += msg_tokens

    return optimized

Output Token Optimization

Output tokens cost 3-8x more than input. Controlling output length has massive impact.

1. Set max_tokens Appropriately

Never leave max_tokens unlimited. Set it based on expected response length:

# Task-specific max_tokens settings
MAX_TOKENS_BY_TASK = {
    "classification": 10,       # Just the label
    "yes_no": 5,               # "Yes" or "No"
    "extraction": 100,         # Structured data
    "summary": 200,            # Brief summary
    "explanation": 500,        # Detailed answer
    "code_generation": 1000,   # Code with comments
}

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
    max_tokens=MAX_TOKENS_BY_TASK["classification"]
)

2. Request Concise Responses

Explicitly ask for brevity in your prompt:

❌ Without guidance:
"What are the benefits of exercise?"
→ Output: 500+ tokens (verbose essay)

✅ With length control:
"List 3 benefits of exercise. One sentence each."
→ Output: ~50 tokens (concise list)

3. Use Structured Output (JSON)

JSON mode forces concise, structured responses:

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{
        "role": "user",
        "content": "Extract the person's name and age: 'John Smith is 32 years old.'"
    }],
    response_format={"type": "json_object"},
    max_tokens=50
)

# Output: {"name": "John Smith", "age": 32}
# Much shorter than prose explanation

4. Use Stop Sequences

Stop generation early when you have what you need:

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{
        "role": "user",
        "content": "Generate a product name for a fitness app:"
    }],
    stop=["\n", "."],  # Stop at first newline or period
    max_tokens=20
)

Token Optimization for RAG Systems

Retrieval-Augmented Generation (RAG) systems often include multiple documents in prompts, making token optimization critical.

1. Retrieve Less, Retrieve Better

Limit retrieved chunks: Top 3-5 instead of top 10
Use reranking: Rerank results to get most relevant first
Smaller chunk sizes: 256-512 tokens per chunk instead of 1000+

2. Apply Relevance Filtering

Only include chunks above a similarity threshold:

def filter_relevant_chunks(chunks: list, threshold: float = 0.7) -> list:
    """Keep only chunks with similarity above threshold."""
    return [c for c in chunks if c["similarity"] >= threshold]

# Example: 10 chunks retrieved, 4 pass threshold
# Token savings: ~60%

3. Compress Retrieved Context

Apply LLMLingua to retrieved documents before including in prompt:

# LongLLMLingua for RAG (handles 'lost in the middle' issue)
from llmlingua import PromptCompressor

compressor = PromptCompressor()

# Compress multiple retrieved documents
compressed_context = compressor.compress_prompt(
    context=retrieved_docs,
    instruction="Answer the user's question based on the context.",
    question=user_query,
    rate=0.25  # Keep 25% of tokens
)

# Research shows: 21.4% better RAG performance using 1/4 of tokens

Measuring Token Optimization Impact

Key Metrics to Track

Tokens per request: Average input + output tokens
Compression ratio: Original tokens / Optimized tokens
Quality score: Ensure optimization doesn't hurt output quality
Cost per task: Dollar cost for each task type

A/B Testing Token Optimization

def ab_test_prompt_versions(original: str, optimized: str, test_inputs: list):
    """Compare token usage and quality between prompt versions."""
    results = {"original": [], "optimized": []}

    for input_text in test_inputs:
        # Test original
        resp_orig = call_llm(original + input_text)
        results["original"].append({
            "tokens": resp_orig.usage.total_tokens,
            "quality": evaluate_response(resp_orig.content)
        })

        # Test optimized
        resp_opt = call_llm(optimized + input_text)
        results["optimized"].append({
            "tokens": resp_opt.usage.total_tokens,
            "quality": evaluate_response(resp_opt.content)
        })

    # Calculate savings and quality impact
    avg_original = sum(r["tokens"] for r in results["original"]) / len(test_inputs)
    avg_optimized = sum(r["tokens"] for r in results["optimized"]) / len(test_inputs)

    print(f"Token reduction: {(1 - avg_optimized/avg_original)*100:.1f}%")

Real Cost Savings Example

Let's calculate savings for a chatbot with 100K daily messages:

Optimization	Before	After	Reduction
Concise prompts	500 input tokens	325 input tokens	35%
Output control	300 output tokens	180 output tokens	40%
Context management	2000 history tokens	800 history tokens	60%

Monthly savings with GPT-5-mini (100K messages/day):

Before: 84M input + 9M output tokens = $25.20 + $9.00 = $34.20/day
After: 33.75M input + 5.4M output tokens = $10.13 + $5.40 = $15.53/day
Daily savings: $18.67 (55%)
Monthly savings: $560

Common Token Optimization Mistakes

1. Over-Compressing Instructions

Instructions need clarity. Compress context, not instructions.

2. Ignoring Output Tokens

Output costs 3-8x more. Always set max_tokens and request concise responses.

3. Not Measuring Quality

Aggressive optimization can hurt results. Always A/B test quality.

4. One-Size-Fits-All Compression

Different content types need different compression ratios. Instructions ≠ examples.

5. Forgetting to Cache

Combine token optimization with prompt caching for maximum savings.

Combining All Techniques

Maximum savings come from stacking optimizations:

Write concise prompts (-35% input)
Apply prompt compression (-50% on context)
Enable prompt caching (-50% on cached tokens)
Control output length (-40% output)
Use model routing (cheap models for simple tasks)
Batch when possible (-50% via Batch API)

Combined, these techniques can achieve 70-90% cost reduction for many applications.

Track Your Token Usage with Burnwise

See exactly how many tokens each feature consumes. Get AI-powered recommendations for optimization. One-line SDK integration.

Start Free Trial

Next Steps

Audit current usage: Count tokens per request type
Start with output control: Set max_tokens, request concise responses
Optimize high-volume prompts: Focus on most-used prompts first
Test prompt compression: Try LLMLingua on long contexts
Measure and iterate: Track savings and quality impact

For more cost optimization, see our Prompt Caching Guide, Model Routing Guide, and Batch Processing Guide.

Use our LLM Cost Calculator to estimate savings or compare prices on the AI Pricing page.

Token Optimization: Reduce LLM Input & Output Costs by 60%

Token Optimization Quick Facts (January 2026)

Why Do Tokens Matter for LLM Costs?

What Is a Token?

Input vs Output Token Pricing

How to Count Tokens Before Sending Requests

OpenAI: tiktoken

Anthropic Claude

Google Gemini

Input Token Optimization

1. Write Concise Prompts

2. Batch Multiple Inputs

3. Use Prompt Compression (LLMLingua)

LLMLingua Compression Ratios by Content Type

4. Optimize Few-Shot Examples

5. Context Window Management

Output Token Optimization

1. Set max_tokens Appropriately

2. Request Concise Responses

3. Use Structured Output (JSON)

4. Use Stop Sequences

Token Optimization for RAG Systems

1. Retrieve Less, Retrieve Better

2. Apply Relevance Filtering

3. Compress Retrieved Context

Measuring Token Optimization Impact

Key Metrics to Track

A/B Testing Token Optimization

Real Cost Savings Example

Common Token Optimization Mistakes

1. Over-Compressing Instructions

2. Ignoring Output Tokens

3. Not Measuring Quality

4. One-Size-Fits-All Compression

5. Forgetting to Cache

Combining All Techniques

Track Your Token Usage with Burnwise

Next Steps

Related Articles

How to Reduce OpenAI API Costs by 40% in 2026

The Complete Guide to LLM Cost Optimization (2026)

Prompt Caching: Save 50-90% on LLM API Costs [2026 Guide]

Put These Insights Into Practice