Burnwise is an AI cost copilot that analyzes where your LLM budget actually goes, links costs to product features, and gives you concrete decisions to cut AI spending by 40% without sacrificing quality.

Which AI providers does Burnwise support?

Burnwise supports all major LLM providers including OpenAI (GPT-5.2, GPT-4), Anthropic (Claude 4.5), Google (Gemini 3.0), Mistral, xAI (Grok), DeepSeek, and Perplexity.

How long does it take to integrate Burnwise?

Burnwise can be integrated in under 5 minutes with just a few lines of code. Simply install the SDK, initialize with your API key, and wrap your existing AI client.

Does Burnwise track my prompts or completions?

No. Burnwise only tracks metadata like token counts, model names, costs, and latency. We never track prompt content, completion content, or any user data within prompts.

How much can I save with Burnwise?

Teams using Burnwise typically reduce their LLM costs by 20-40% through model arbitrage, feature-level optimization, and eliminating waste - all while maintaining output quality.

Prompt Caching: Save 50-90% on LLM API Costs [2026 Guide]

Prompt caching can reduce your LLM API costs by 50-90% and latency by up to 85%. Yet most teams don't use it—or use it incorrectly. This guide covers everything you need to implement prompt caching with OpenAI, Anthropic, and Google.

Prompt Caching Quick Facts (January 2026)

Cost Savings: 50% (OpenAI) to 90% (Anthropic) on cached tokens
Latency Reduction: Up to 85% faster for long prompts
Minimum Tokens: 1,024 tokens for cache eligibility
Cache TTL: 5-10 minutes (Anthropic), 24 hours (OpenAI GPT-5.1)

What Is Prompt Caching?

Prompt caching is a feature that stores the computed key-value (KV) tensors from your prompt's attention layers. When you send a similar prompt, the provider retrieves the cached computation instead of reprocessing it from scratch.

Think of it like this: if you send the same system prompt with 100 different user questions, the model only processes the system prompt once. The remaining 99 requests reuse the cached result.

Why Does This Matter?

Cost: Cached tokens cost 50-90% less than fresh tokens
Speed: Skip reprocessing = faster time-to-first-token
Scale: High-volume apps save thousands per month

According to research, 31% of LLM queries exhibit semantic similarity—representing massive inefficiency without caching. Applications with knowledge bases or lengthy instructions see 60-80% cost reduction through prompt caching.

How Does Prompt Caching Work?

When you send a prompt to an LLM, the model computes attention scores for every token. This computation is expensive, especially for long prompts.

Prompt caching works by:

Processing your prompt and generating KV tensors for the attention layers
Storing these tensors in GPU memory or fast storage
Comparing new prompts to check if the prefix matches a cached entry
Reusing cached tensors when a match is found, only processing new tokens

The key insight: only the prompt prefix can be cached. This means you must structure your prompts with static content (system prompt, context documents, examples) at the beginning, followed by dynamic content (user messages).

Provider Comparison: OpenAI vs Anthropic vs Google

Feature	OpenAI	Anthropic	Google Gemini
Caching Type	Automatic	Manual (cache_control)	Automatic
Cost Savings	50%	90%	50%
Min Tokens	1,024	1,024 per checkpoint	1,024
Cache TTL	5-10 min (24h for GPT-5.1)	5 minutes	~10 minutes
Max Breakpoints	N/A (automatic)	4 per prompt	N/A (automatic)
Code Changes	None required	Required	None required

OpenAI Prompt Caching (Automatic)

OpenAI's approach is the most seamless: caching happens automatically with no code changes required.

How It Works

Automatic caching is enabled for prompts exceeding 1,024 tokens
Cache hits occur in increments of 128 tokens
Matching prefixes get 50% cost reduction and up to 80% latency reduction
You cannot manually clear the cache

OpenAI Caching Example

# No special code needed - caching is automatic
from openai import OpenAI

client = OpenAI()

# This system prompt will be cached after first request
system_prompt = """You are a helpful assistant specialized in...
[Long system prompt with 1000+ tokens]
"""

# First request - full processing
response1 = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

# Second request - system prompt cached, 50% cheaper
response2 = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Explain neural networks"}
    ]
)

Pro tip: OpenAI recently rolled out 24-hour cache retention for GPT-5.1 series and GPT-4.1. This is ideal for applications with consistent system prompts.

Anthropic Prompt Caching (Manual Control)

Anthropic gives you explicit control over what gets cached using the cache_control parameter. This requires code changes but offers 90% cost savings.

Key Features

Add header: anthropic-beta: prompt-caching-2024-07-31
Use cache_control parameter to mark cacheable sections
Up to 4 cache breakpoints per prompt
Minimum 1,024 tokens per checkpoint
5-minute TTL that resets on each cache hit

Anthropic Caching Example

import anthropic

client = anthropic.Anthropic()

# Define cacheable system prompt
system_content = """You are a legal expert assistant...
[Long system prompt with 1000+ tokens]
"""

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    extra_headers={
        "anthropic-beta": "prompt-caching-2024-07-31"
    },
    system=[
        {
            "type": "text",
            "text": system_content,
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What are the key points of contract law?"}
    ]
)

# Check cache performance in response
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")

Anthropic Pricing

Token Type	Claude Sonnet 4.5	Claude Haiku 4.5
Fresh input	$3.00/1M	$1.00/1M
Cache write	$3.75/1M	$1.25/1M
Cache read	$0.30/1M	$0.10/1M

Cache reads are 90% cheaper than fresh tokens!

Google Gemini Prompt Caching

Google Gemini offers automatic caching similar to OpenAI, with particularly strong performance for its massive context windows (up to 2M tokens).

Gemini Caching Example

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel('gemini-2.0-flash')

# Long context will be automatically cached
context = """[Your 1000+ token context document]"""

chat = model.start_chat(history=[
    {"role": "user", "parts": context},
    {"role": "model", "parts": "I understand the context. How can I help?"}
])

# Subsequent messages reuse cached context
response = chat.send_message("Summarize the key points")

Best Practices for Prompt Caching

1. Structure Your Prompts Correctly

The most important rule: put static content at the beginning.

✅ Correct structure:
[System prompt] → [Context documents] → [Few-shot examples] → [User message]

❌ Wrong structure:
[User message] → [System prompt] → [Context]

2. Meet Minimum Token Requirements

Ensure cacheable sections have 1,024+ tokens
Shorter prompts won't benefit from caching
Pad system prompts if needed (but don't add noise)

3. Maintain Cache Consistency

Exact match required: even whitespace differences break cache
Use consistent formatting across requests
Avoid dynamic timestamps or IDs in cached sections

4. Monitor Cache Hit Rates

Track cache_read_input_tokens in responses
Target 70%+ cache hit rate for high-volume apps
Investigate if hit rates drop unexpectedly

When Should You Use Prompt Caching?

Ideal Use Cases

Chatbots with consistent system prompts
RAG applications with repeated document context
Code assistants with codebase context
Customer support with product documentation
Data extraction pipelines with schema definitions

Less Effective For

Short prompts (<1,024 tokens)
Highly dynamic prompts with no consistent prefix
One-off requests with unique context

Real Cost Calculation Example

Let's calculate savings for a chatbot with a 2,000-token system prompt handling 100,000 requests/month:

Scenario	Without Caching	With Caching (90% hits)
System prompt tokens	200M tokens	20M fresh + 180M cached
Cost (Claude Sonnet)	$600	$60 + $54 = $114
Savings	-	$486/month (81%)

Common Mistakes to Avoid

1. Putting Dynamic Content First

❌ User message before system prompt breaks caching entirely.

2. Ignoring Token Minimums

❌ Short system prompts won't be cached. Ensure 1,024+ tokens.

3. Not Monitoring Cache Performance

❌ Without tracking, you won't know if caching is working.

4. Inconsistent Prompt Formatting

❌ Adding/removing whitespace or changing order breaks cache.

Track Your Cache Performance

Effective caching requires monitoring. You need to track:

Cache hit rate: % of requests that hit cache
Cost savings: Actual vs theoretical savings
Latency improvement: Time-to-first-token reduction

Monitor Your LLM Caching with Burnwise

Track cache hit rates, cost savings, and optimization opportunities across all providers. One-line SDK integration.

Start Free Trial

Next Steps

Audit your prompts: Identify which have consistent prefixes
Restructure for caching: Put static content first
Implement provider-specific caching: Use code examples above
Monitor performance: Track hit rates and savings
Iterate: Optimize based on real data

For more cost optimization strategies, see our Complete Guide to LLM Cost Optimization and How to Reduce OpenAI Costs by 40%.

Questions? Check our SDK documentation or use our LLM Cost Calculator to estimate your savings.