Prompt Caching: Save 50-90% on LLM API Costs [2026 Guide]

January 12, 2026
11 min read

Prompt caching can reduce your LLM API costs by 50-90% and latency by up to 85%. Yet most teams don't use it—or use it incorrectly. This guide covers everything you need to implement prompt caching with OpenAI, Anthropic, and Google.

Prompt Caching Quick Facts (January 2026)

  • Cost Savings: 50% (OpenAI) to 90% (Anthropic) on cached tokens
  • Latency Reduction: Up to 85% faster for long prompts
  • Minimum Tokens: 1,024 tokens for cache eligibility
  • Cache TTL: 5-10 minutes (Anthropic), 24 hours (OpenAI GPT-5.1)

What Is Prompt Caching?

Prompt caching is a feature that stores the computed key-value (KV) tensors from your prompt's attention layers. When you send a similar prompt, the provider retrieves the cached computation instead of reprocessing it from scratch.

Think of it like this: if you send the same system prompt with 100 different user questions, the model only processes the system prompt once. The remaining 99 requests reuse the cached result.

Why Does This Matter?

  • Cost: Cached tokens cost 50-90% less than fresh tokens
  • Speed: Skip reprocessing = faster time-to-first-token
  • Scale: High-volume apps save thousands per month

According to research, 31% of LLM queries exhibit semantic similarity—representing massive inefficiency without caching. Applications with knowledge bases or lengthy instructions see 60-80% cost reduction through prompt caching.

How Does Prompt Caching Work?

When you send a prompt to an LLM, the model computes attention scores for every token. This computation is expensive, especially for long prompts.

Prompt caching works by:

  1. Processing your prompt and generating KV tensors for the attention layers
  2. Storing these tensors in GPU memory or fast storage
  3. Comparing new prompts to check if the prefix matches a cached entry
  4. Reusing cached tensors when a match is found, only processing new tokens

The key insight: only the prompt prefix can be cached. This means you must structure your prompts with static content (system prompt, context documents, examples) at the beginning, followed by dynamic content (user messages).

Provider Comparison: OpenAI vs Anthropic vs Google

Feature OpenAI Anthropic Google Gemini
Caching Type Automatic Manual (cache_control) Automatic
Cost Savings 50% 90% 50%
Min Tokens 1,024 1,024 per checkpoint 1,024
Cache TTL 5-10 min (24h for GPT-5.1) 5 minutes ~10 minutes
Max Breakpoints N/A (automatic) 4 per prompt N/A (automatic)
Code Changes None required Required None required

OpenAI Prompt Caching (Automatic)

OpenAI's approach is the most seamless: caching happens automatically with no code changes required.

How It Works

  • Automatic caching is enabled for prompts exceeding 1,024 tokens
  • Cache hits occur in increments of 128 tokens
  • Matching prefixes get 50% cost reduction and up to 80% latency reduction
  • You cannot manually clear the cache

OpenAI Caching Example

# No special code needed - caching is automatic
from openai import OpenAI

client = OpenAI()

# This system prompt will be cached after first request
system_prompt = """You are a helpful assistant specialized in...
[Long system prompt with 1000+ tokens]
"""

# First request - full processing
response1 = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

# Second request - system prompt cached, 50% cheaper
response2 = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Explain neural networks"}
    ]
)

Pro tip: OpenAI recently rolled out 24-hour cache retention for GPT-5.1 series and GPT-4.1. This is ideal for applications with consistent system prompts.

Anthropic Prompt Caching (Manual Control)

Anthropic gives you explicit control over what gets cached using the cache_control parameter. This requires code changes but offers 90% cost savings.

Key Features

  • Add header: anthropic-beta: prompt-caching-2024-07-31
  • Use cache_control parameter to mark cacheable sections
  • Up to 4 cache breakpoints per prompt
  • Minimum 1,024 tokens per checkpoint
  • 5-minute TTL that resets on each cache hit

Anthropic Caching Example

import anthropic

client = anthropic.Anthropic()

# Define cacheable system prompt
system_content = """You are a legal expert assistant...
[Long system prompt with 1000+ tokens]
"""

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    extra_headers={
        "anthropic-beta": "prompt-caching-2024-07-31"
    },
    system=[
        {
            "type": "text",
            "text": system_content,
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What are the key points of contract law?"}
    ]
)

# Check cache performance in response
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")

Anthropic Pricing

Token Type Claude Sonnet 4.5 Claude Haiku 4.5
Fresh input $3.00/1M $1.00/1M
Cache write $3.75/1M $1.25/1M
Cache read $0.30/1M $0.10/1M

Cache reads are 90% cheaper than fresh tokens!

Google Gemini Prompt Caching

Google Gemini offers automatic caching similar to OpenAI, with particularly strong performance for its massive context windows (up to 2M tokens).

Gemini Caching Example

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel('gemini-2.0-flash')

# Long context will be automatically cached
context = """[Your 1000+ token context document]"""

chat = model.start_chat(history=[
    {"role": "user", "parts": context},
    {"role": "model", "parts": "I understand the context. How can I help?"}
])

# Subsequent messages reuse cached context
response = chat.send_message("Summarize the key points")

Best Practices for Prompt Caching

1. Structure Your Prompts Correctly

The most important rule: put static content at the beginning.

✅ Correct structure:
[System prompt] → [Context documents] → [Few-shot examples] → [User message]

❌ Wrong structure:
[User message] → [System prompt] → [Context]

2. Meet Minimum Token Requirements

  • Ensure cacheable sections have 1,024+ tokens
  • Shorter prompts won't benefit from caching
  • Pad system prompts if needed (but don't add noise)

3. Maintain Cache Consistency

  • Exact match required: even whitespace differences break cache
  • Use consistent formatting across requests
  • Avoid dynamic timestamps or IDs in cached sections

4. Monitor Cache Hit Rates

  • Track cache_read_input_tokens in responses
  • Target 70%+ cache hit rate for high-volume apps
  • Investigate if hit rates drop unexpectedly

When Should You Use Prompt Caching?

Ideal Use Cases

  • Chatbots with consistent system prompts
  • RAG applications with repeated document context
  • Code assistants with codebase context
  • Customer support with product documentation
  • Data extraction pipelines with schema definitions

Less Effective For

  • Short prompts (<1,024 tokens)
  • Highly dynamic prompts with no consistent prefix
  • One-off requests with unique context

Real Cost Calculation Example

Let's calculate savings for a chatbot with a 2,000-token system prompt handling 100,000 requests/month:

Scenario Without Caching With Caching (90% hits)
System prompt tokens 200M tokens 20M fresh + 180M cached
Cost (Claude Sonnet) $600 $60 + $54 = $114
Savings - $486/month (81%)

Common Mistakes to Avoid

1. Putting Dynamic Content First

❌ User message before system prompt breaks caching entirely.

2. Ignoring Token Minimums

❌ Short system prompts won't be cached. Ensure 1,024+ tokens.

3. Not Monitoring Cache Performance

❌ Without tracking, you won't know if caching is working.

4. Inconsistent Prompt Formatting

❌ Adding/removing whitespace or changing order breaks cache.

Track Your Cache Performance

Effective caching requires monitoring. You need to track:

  • Cache hit rate: % of requests that hit cache
  • Cost savings: Actual vs theoretical savings
  • Latency improvement: Time-to-first-token reduction

Monitor Your LLM Caching with Burnwise

Track cache hit rates, cost savings, and optimization opportunities across all providers. One-line SDK integration.

Start Free Trial

Next Steps

  1. Audit your prompts: Identify which have consistent prefixes
  2. Restructure for caching: Put static content first
  3. Implement provider-specific caching: Use code examples above
  4. Monitor performance: Track hit rates and savings
  5. Iterate: Optimize based on real data

For more cost optimization strategies, see our Complete Guide to LLM Cost Optimization and How to Reduce OpenAI Costs by 40%.

Questions? Check our SDK documentation or use our LLM Cost Calculator to estimate your savings.

prompt cachingcost optimizationopenaianthropicgooglellm

Put These Insights Into Practice

Burnwise tracks your LLM costs automatically and shows you exactly where to optimize.

Start Free Trial