Token Optimization: Reduce LLM Input & Output Costs by 60%

January 12, 2026
14 min read

Every token costs money. Input tokens, output tokens—they all add up. The good news: you can reduce token usage by 60% or more with proven optimization techniques. This guide covers everything from basic prompt engineering to advanced compression with LLMLingua.

Token Optimization Quick Facts (January 2026)

  • Prompt Optimization: 35% reduction with concise prompts
  • Prompt Compression: Up to 20x compression with LLMLingua
  • Batching: 30% input token savings on repeated context
  • Output Control: 20-40% savings with max_tokens and format control

Why Do Tokens Matter for LLM Costs?

Tokens are the pricing unit for LLM APIs. Every character you send (input) and receive (output) is converted to tokens, and you pay per token. Understanding tokens is the first step to optimizing costs.

What Is a Token?

A token is a chunk of text that the model processes. Roughly:

  • English: 1 token ≈ 4 characters or ¾ of a word
  • Code: Varies significantly (symbols often = 1 token each)
  • Other languages: Often more tokens per word

For example, "Hello, world!" is 4 tokens in GPT models.

Input vs Output Token Pricing

Output tokens cost 3-8x more than input tokens across all providers:

Model Input/1M Output/1M Output Multiple
GPT-5.2 $1.75 $14.00 8x
Claude Opus 4.5 $5.00 $25.00 5x
Gemini 3.0 Pro $2.00 $12.00 6x
GPT-5-mini $0.30 $1.00 3.3x
Key Insight: Reducing output tokens has 3-8x more impact on costs than reducing input tokens. Always optimize output first.

How to Count Tokens Before Sending Requests

Counting tokens before API calls helps you estimate costs and avoid hitting context limits.

OpenAI: tiktoken

tiktoken is OpenAI's official tokenizer, 3-6x faster than alternatives:

import tiktoken

# Get encoding for GPT-5 models
encoding = tiktoken.encoding_for_model("gpt-5")

# Count tokens
text = "Hello, how can I help you today?"
tokens = encoding.encode(text)
print(f"Token count: {len(tokens)}")  # Output: 8

# For chat messages
def count_chat_tokens(messages, model="gpt-5"):
    encoding = tiktoken.encoding_for_model(model)
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Every message has overhead
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
    num_tokens += 2  # Reply priming
    return num_tokens

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
print(f"Total tokens: {count_chat_tokens(messages)}")

Anthropic Claude

Claude uses a different tokenizer. Anthropic provides a simple heuristic:

  • Rough estimate: 1 token ≈ 3.5 English characters
  • More accurate: Use Anthropic's legacy tokenizer (1-2% error rate)
# Simple Claude token estimation
def estimate_claude_tokens(text: str) -> int:
    """Estimate tokens using Anthropic's heuristic."""
    return len(text) // 3.5

# For production, use the API's token counting
import anthropic
client = anthropic.Anthropic()
result = client.messages.count_tokens(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(f"Tokens: {result.input_tokens}")

Google Gemini

import google.generativeai as genai

model = genai.GenerativeModel('gemini-2.0-flash')
result = model.count_tokens("What is the meaning of life?")
print(f"Token count: {result.total_tokens}")

Input Token Optimization

1. Write Concise Prompts

The simplest optimization: use fewer words. Every character counts.

❌ Verbose (45 tokens):
"I would like you to please help me by providing a comprehensive
and detailed explanation of what machine learning is, including
all the important concepts and how it works."

✅ Concise (12 tokens):
"Explain machine learning in 2-3 sentences."

Savings: 73%

2. Batch Multiple Inputs

When processing multiple items with the same instructions, batch them together:

❌ Separate calls (3000 tokens total):
Call 1: [System prompt: 800 tokens] + [Doc 1: 200 tokens]
Call 2: [System prompt: 800 tokens] + [Doc 2: 200 tokens]
Call 3: [System prompt: 800 tokens] + [Doc 3: 200 tokens]

✅ Batched (1400 tokens):
Single call: [System prompt: 800 tokens] + [Doc 1 + Doc 2 + Doc 3: 600 tokens]

Savings: 53%

3. Use Prompt Compression (LLMLingua)

LLMLingua, developed by Microsoft Research, automatically removes redundant tokens while preserving meaning. It achieves up to 20x compression with only 1.5% performance loss.

from llmlingua import PromptCompressor

# Initialize compressor with a small model
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    device_map="cpu"
)

# Original prompt (1000 tokens)
original_prompt = """
[Very long context with background information,
examples, and detailed instructions...]
"""

# Compress to ~200 tokens (5x compression)
compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.2,  # Keep 20% of tokens
    force_tokens=["important", "keywords"]  # Preserve specific tokens
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Compression ratio: {compressed['ratio']:.1f}x")

LLMLingua Compression Ratios by Content Type

Content Type Recommended Compression Performance Loss
Instructions 10-20% Minimal
Few-shot examples 60-80% Low
Context documents 50-70% Low-Medium
Questions/queries 0-10% None

4. Optimize Few-Shot Examples

Few-shot examples consume significant tokens. Optimize them:

  • Use 1-2 examples instead of 5 — Often sufficient for good results
  • Make examples concise — Trim unnecessary details
  • Use prompt caching — Cache examples to pay only once

5. Context Window Management

For chat applications, sending full history is expensive. Strategies:

  • Sliding window: Keep only last N messages
  • Summarization: Summarize old messages periodically
  • Relevance filtering: Include only messages relevant to current query
def optimize_chat_history(messages: list, max_tokens: int = 4000) -> list:
    """Keep recent messages within token budget."""
    encoding = tiktoken.encoding_for_model("gpt-5")

    # Always keep system message
    optimized = [messages[0]] if messages[0]["role"] == "system" else []
    total_tokens = count_tokens(optimized)

    # Add messages from most recent, stop when budget exceeded
    for msg in reversed(messages[1:]):
        msg_tokens = len(encoding.encode(msg["content"]))
        if total_tokens + msg_tokens > max_tokens:
            break
        optimized.insert(1, msg)
        total_tokens += msg_tokens

    return optimized

Output Token Optimization

Output tokens cost 3-8x more than input. Controlling output length has massive impact.

1. Set max_tokens Appropriately

Never leave max_tokens unlimited. Set it based on expected response length:

# Task-specific max_tokens settings
MAX_TOKENS_BY_TASK = {
    "classification": 10,       # Just the label
    "yes_no": 5,               # "Yes" or "No"
    "extraction": 100,         # Structured data
    "summary": 200,            # Brief summary
    "explanation": 500,        # Detailed answer
    "code_generation": 1000,   # Code with comments
}

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
    max_tokens=MAX_TOKENS_BY_TASK["classification"]
)

2. Request Concise Responses

Explicitly ask for brevity in your prompt:

❌ Without guidance:
"What are the benefits of exercise?"
→ Output: 500+ tokens (verbose essay)

✅ With length control:
"List 3 benefits of exercise. One sentence each."
→ Output: ~50 tokens (concise list)

3. Use Structured Output (JSON)

JSON mode forces concise, structured responses:

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{
        "role": "user",
        "content": "Extract the person's name and age: 'John Smith is 32 years old.'"
    }],
    response_format={"type": "json_object"},
    max_tokens=50
)

# Output: {"name": "John Smith", "age": 32}
# Much shorter than prose explanation

4. Use Stop Sequences

Stop generation early when you have what you need:

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{
        "role": "user",
        "content": "Generate a product name for a fitness app:"
    }],
    stop=["\n", "."],  # Stop at first newline or period
    max_tokens=20
)

Token Optimization for RAG Systems

Retrieval-Augmented Generation (RAG) systems often include multiple documents in prompts, making token optimization critical.

1. Retrieve Less, Retrieve Better

  • Limit retrieved chunks: Top 3-5 instead of top 10
  • Use reranking: Rerank results to get most relevant first
  • Smaller chunk sizes: 256-512 tokens per chunk instead of 1000+

2. Apply Relevance Filtering

Only include chunks above a similarity threshold:

def filter_relevant_chunks(chunks: list, threshold: float = 0.7) -> list:
    """Keep only chunks with similarity above threshold."""
    return [c for c in chunks if c["similarity"] >= threshold]

# Example: 10 chunks retrieved, 4 pass threshold
# Token savings: ~60%

3. Compress Retrieved Context

Apply LLMLingua to retrieved documents before including in prompt:

# LongLLMLingua for RAG (handles 'lost in the middle' issue)
from llmlingua import PromptCompressor

compressor = PromptCompressor()

# Compress multiple retrieved documents
compressed_context = compressor.compress_prompt(
    context=retrieved_docs,
    instruction="Answer the user's question based on the context.",
    question=user_query,
    rate=0.25  # Keep 25% of tokens
)

# Research shows: 21.4% better RAG performance using 1/4 of tokens

Measuring Token Optimization Impact

Key Metrics to Track

  • Tokens per request: Average input + output tokens
  • Compression ratio: Original tokens / Optimized tokens
  • Quality score: Ensure optimization doesn't hurt output quality
  • Cost per task: Dollar cost for each task type

A/B Testing Token Optimization

def ab_test_prompt_versions(original: str, optimized: str, test_inputs: list):
    """Compare token usage and quality between prompt versions."""
    results = {"original": [], "optimized": []}

    for input_text in test_inputs:
        # Test original
        resp_orig = call_llm(original + input_text)
        results["original"].append({
            "tokens": resp_orig.usage.total_tokens,
            "quality": evaluate_response(resp_orig.content)
        })

        # Test optimized
        resp_opt = call_llm(optimized + input_text)
        results["optimized"].append({
            "tokens": resp_opt.usage.total_tokens,
            "quality": evaluate_response(resp_opt.content)
        })

    # Calculate savings and quality impact
    avg_original = sum(r["tokens"] for r in results["original"]) / len(test_inputs)
    avg_optimized = sum(r["tokens"] for r in results["optimized"]) / len(test_inputs)

    print(f"Token reduction: {(1 - avg_optimized/avg_original)*100:.1f}%")

Real Cost Savings Example

Let's calculate savings for a chatbot with 100K daily messages:

Optimization Before After Reduction
Concise prompts 500 input tokens 325 input tokens 35%
Output control 300 output tokens 180 output tokens 40%
Context management 2000 history tokens 800 history tokens 60%

Monthly savings with GPT-5-mini (100K messages/day):

  • Before: 84M input + 9M output tokens = $25.20 + $9.00 = $34.20/day
  • After: 33.75M input + 5.4M output tokens = $10.13 + $5.40 = $15.53/day
  • Daily savings: $18.67 (55%)
  • Monthly savings: $560

Common Token Optimization Mistakes

1. Over-Compressing Instructions

Instructions need clarity. Compress context, not instructions.

2. Ignoring Output Tokens

Output costs 3-8x more. Always set max_tokens and request concise responses.

3. Not Measuring Quality

Aggressive optimization can hurt results. Always A/B test quality.

4. One-Size-Fits-All Compression

Different content types need different compression ratios. Instructions ≠ examples.

5. Forgetting to Cache

Combine token optimization with prompt caching for maximum savings.

Combining All Techniques

Maximum savings come from stacking optimizations:

  1. Write concise prompts (-35% input)
  2. Apply prompt compression (-50% on context)
  3. Enable prompt caching (-50% on cached tokens)
  4. Control output length (-40% output)
  5. Use model routing (cheap models for simple tasks)
  6. Batch when possible (-50% via Batch API)

Combined, these techniques can achieve 70-90% cost reduction for many applications.

Track Your Token Usage with Burnwise

See exactly how many tokens each feature consumes. Get AI-powered recommendations for optimization. One-line SDK integration.

Start Free Trial

Next Steps

  1. Audit current usage: Count tokens per request type
  2. Start with output control: Set max_tokens, request concise responses
  3. Optimize high-volume prompts: Focus on most-used prompts first
  4. Test prompt compression: Try LLMLingua on long contexts
  5. Measure and iterate: Track savings and quality impact

For more cost optimization, see our Prompt Caching Guide, Model Routing Guide, and Batch Processing Guide.

Use our LLM Cost Calculator to estimate savings or compare prices on the AI Pricing page.

token optimizationcost optimizationtiktokenprompt compressionllmllmlingua

Put These Insights Into Practice

Burnwise tracks your LLM costs automatically and shows you exactly where to optimize.

Start Free Trial