Prompt caching can reduce your LLM API costs by 50-90% and latency by up to 85%. Yet most teams don't use it—or use it incorrectly. This guide covers everything you need to implement prompt caching with OpenAI, Anthropic, and Google.
Prompt Caching Quick Facts (January 2026)
- Cost Savings: 50% (OpenAI) to 90% (Anthropic) on cached tokens
- Latency Reduction: Up to 85% faster for long prompts
- Minimum Tokens: 1,024 tokens for cache eligibility
- Cache TTL: 5-10 minutes (Anthropic), 24 hours (OpenAI GPT-5.1)
What Is Prompt Caching?
Prompt caching is a feature that stores the computed key-value (KV) tensors from your prompt's attention layers. When you send a similar prompt, the provider retrieves the cached computation instead of reprocessing it from scratch.
Think of it like this: if you send the same system prompt with 100 different user questions, the model only processes the system prompt once. The remaining 99 requests reuse the cached result.
Why Does This Matter?
- Cost: Cached tokens cost 50-90% less than fresh tokens
- Speed: Skip reprocessing = faster time-to-first-token
- Scale: High-volume apps save thousands per month
According to research, 31% of LLM queries exhibit semantic similarity—representing massive inefficiency without caching. Applications with knowledge bases or lengthy instructions see 60-80% cost reduction through prompt caching.
How Does Prompt Caching Work?
When you send a prompt to an LLM, the model computes attention scores for every token. This computation is expensive, especially for long prompts.
Prompt caching works by:
- Processing your prompt and generating KV tensors for the attention layers
- Storing these tensors in GPU memory or fast storage
- Comparing new prompts to check if the prefix matches a cached entry
- Reusing cached tensors when a match is found, only processing new tokens
The key insight: only the prompt prefix can be cached. This means you must structure your prompts with static content (system prompt, context documents, examples) at the beginning, followed by dynamic content (user messages).
Provider Comparison: OpenAI vs Anthropic vs Google
| Feature | OpenAI | Anthropic | Google Gemini |
|---|---|---|---|
| Caching Type | Automatic | Manual (cache_control) | Automatic |
| Cost Savings | 50% | 90% | 50% |
| Min Tokens | 1,024 | 1,024 per checkpoint | 1,024 |
| Cache TTL | 5-10 min (24h for GPT-5.1) | 5 minutes | ~10 minutes |
| Max Breakpoints | N/A (automatic) | 4 per prompt | N/A (automatic) |
| Code Changes | None required | Required | None required |
OpenAI Prompt Caching (Automatic)
OpenAI's approach is the most seamless: caching happens automatically with no code changes required.
How It Works
- Automatic caching is enabled for prompts exceeding 1,024 tokens
- Cache hits occur in increments of 128 tokens
- Matching prefixes get 50% cost reduction and up to 80% latency reduction
- You cannot manually clear the cache
OpenAI Caching Example
# No special code needed - caching is automatic
from openai import OpenAI
client = OpenAI()
# This system prompt will be cached after first request
system_prompt = """You are a helpful assistant specialized in...
[Long system prompt with 1000+ tokens]
"""
# First request - full processing
response1 = client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What is machine learning?"}
]
)
# Second request - system prompt cached, 50% cheaper
response2 = client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Explain neural networks"}
]
)
Pro tip: OpenAI recently rolled out 24-hour cache retention for GPT-5.1 series and GPT-4.1. This is ideal for applications with consistent system prompts.
Anthropic Prompt Caching (Manual Control)
Anthropic gives you explicit control over what gets cached using the cache_control parameter. This requires code changes but offers 90% cost savings.
Key Features
- Add header:
anthropic-beta: prompt-caching-2024-07-31 - Use
cache_controlparameter to mark cacheable sections - Up to 4 cache breakpoints per prompt
- Minimum 1,024 tokens per checkpoint
- 5-minute TTL that resets on each cache hit
Anthropic Caching Example
import anthropic
client = anthropic.Anthropic()
# Define cacheable system prompt
system_content = """You are a legal expert assistant...
[Long system prompt with 1000+ tokens]
"""
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
extra_headers={
"anthropic-beta": "prompt-caching-2024-07-31"
},
system=[
{
"type": "text",
"text": system_content,
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": "What are the key points of contract law?"}
]
)
# Check cache performance in response
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
Anthropic Pricing
| Token Type | Claude Sonnet 4.5 | Claude Haiku 4.5 |
|---|---|---|
| Fresh input | $3.00/1M | $1.00/1M |
| Cache write | $3.75/1M | $1.25/1M |
| Cache read | $0.30/1M | $0.10/1M |
Cache reads are 90% cheaper than fresh tokens!
Google Gemini Prompt Caching
Google Gemini offers automatic caching similar to OpenAI, with particularly strong performance for its massive context windows (up to 2M tokens).
Gemini Caching Example
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-2.0-flash')
# Long context will be automatically cached
context = """[Your 1000+ token context document]"""
chat = model.start_chat(history=[
{"role": "user", "parts": context},
{"role": "model", "parts": "I understand the context. How can I help?"}
])
# Subsequent messages reuse cached context
response = chat.send_message("Summarize the key points")
Best Practices for Prompt Caching
1. Structure Your Prompts Correctly
The most important rule: put static content at the beginning.
✅ Correct structure:
[System prompt] → [Context documents] → [Few-shot examples] → [User message]
❌ Wrong structure:
[User message] → [System prompt] → [Context]
2. Meet Minimum Token Requirements
- Ensure cacheable sections have 1,024+ tokens
- Shorter prompts won't benefit from caching
- Pad system prompts if needed (but don't add noise)
3. Maintain Cache Consistency
- Exact match required: even whitespace differences break cache
- Use consistent formatting across requests
- Avoid dynamic timestamps or IDs in cached sections
4. Monitor Cache Hit Rates
- Track
cache_read_input_tokensin responses - Target 70%+ cache hit rate for high-volume apps
- Investigate if hit rates drop unexpectedly
When Should You Use Prompt Caching?
Ideal Use Cases
- Chatbots with consistent system prompts
- RAG applications with repeated document context
- Code assistants with codebase context
- Customer support with product documentation
- Data extraction pipelines with schema definitions
Less Effective For
- Short prompts (<1,024 tokens)
- Highly dynamic prompts with no consistent prefix
- One-off requests with unique context
Real Cost Calculation Example
Let's calculate savings for a chatbot with a 2,000-token system prompt handling 100,000 requests/month:
| Scenario | Without Caching | With Caching (90% hits) |
|---|---|---|
| System prompt tokens | 200M tokens | 20M fresh + 180M cached |
| Cost (Claude Sonnet) | $600 | $60 + $54 = $114 |
| Savings | - | $486/month (81%) |
Common Mistakes to Avoid
1. Putting Dynamic Content First
❌ User message before system prompt breaks caching entirely.
2. Ignoring Token Minimums
❌ Short system prompts won't be cached. Ensure 1,024+ tokens.
3. Not Monitoring Cache Performance
❌ Without tracking, you won't know if caching is working.
4. Inconsistent Prompt Formatting
❌ Adding/removing whitespace or changing order breaks cache.
Track Your Cache Performance
Effective caching requires monitoring. You need to track:
- Cache hit rate: % of requests that hit cache
- Cost savings: Actual vs theoretical savings
- Latency improvement: Time-to-first-token reduction
Monitor Your LLM Caching with Burnwise
Track cache hit rates, cost savings, and optimization opportunities across all providers. One-line SDK integration.
Start Free TrialNext Steps
- Audit your prompts: Identify which have consistent prefixes
- Restructure for caching: Put static content first
- Implement provider-specific caching: Use code examples above
- Monitor performance: Track hit rates and savings
- Iterate: Optimize based on real data
For more cost optimization strategies, see our Complete Guide to LLM Cost Optimization and How to Reduce OpenAI Costs by 40%.
Questions? Check our SDK documentation or use our LLM Cost Calculator to estimate your savings.