Batch processing offers a flat 50% discount on LLM API costs from all major providers. If your workload doesn't need real-time responses, you're leaving money on the table. This guide covers everything from basic concepts to production implementation across OpenAI, Anthropic, and Google.
Batch Processing Quick Facts (January 2026)
- Cost Savings: 50% discount on input AND output tokens
- Completion Time: Results within 24 hours (often much faster)
- Providers: OpenAI, Anthropic Claude, Google Gemini all support it
- Rate Limits: Significantly higher (250M+ tokens enqueued)
What Is LLM Batch Processing?
Batch processing is an asynchronous API pattern where you submit multiple requests together and receive results within 24 hours instead of immediately. In exchange for giving up real-time responses, providers offer a 50% discount on all tokens.
The trade-off is simple:
- Real-time API: Instant responses, full price
- Batch API: 24-hour window, 50% off
For many workloads—data processing, content generation, evaluations—this trade-off is a no-brainer.
How Does the Batch API Work?
The workflow is straightforward but different from standard API calls:
- Create a JSONL file — Each line is a valid JSON request identical to real-time API format
- Upload the file — Send the file to the provider's servers
- Submit a batch job — Reference the uploaded file
- Poll for completion — Check status until results are ready
- Download results — Retrieve and map responses to original requests
custom_id in each request to match responses to their original queries.
Batch Pricing Comparison (January 2026)
OpenAI Batch Pricing
| Model | Standard Input/1M | Batch Input/1M | Standard Output/1M | Batch Output/1M |
|---|---|---|---|---|
| GPT-5.2 | $1.75 | $0.875 | $14.00 | $7.00 |
| GPT-5-mini | $0.30 | $0.15 | $1.00 | $0.50 |
| GPT-4.1 | $2.00 | $1.00 | $8.00 | $4.00 |
| o4-mini | $1.10 | $0.55 | $4.40 | $2.20 |
Anthropic Claude Batch Pricing
| Model | Standard Input/1M | Batch Input/1M | Standard Output/1M | Batch Output/1M |
|---|---|---|---|---|
| Claude Opus 4.5 | $5.00 | $2.50 | $25.00 | $12.50 |
| Claude Sonnet 4.5 | $3.00 | $1.50 | $15.00 | $7.50 |
| Claude Haiku 4.5 | $1.00 | $0.50 | $5.00 | $2.50 |
Google Gemini Batch Pricing
| Model | Standard Input/1M | Batch Input/1M | Standard Output/1M | Batch Output/1M |
|---|---|---|---|---|
| Gemini 3.0 Pro | $2.00 | $1.00 | $12.00 | $6.00 |
| Gemini 3.0 Flash | $0.50 | $0.25 | $3.00 | $1.50 |
| Gemini 2.5 Pro | $1.25 | $0.625 | $10.00 | $5.00 |
Ideal Use Cases for Batch Processing
Perfect for Batch Processing
- Bulk content generation: Blog posts, product descriptions, marketing copy
- Data extraction at scale: Processing thousands of documents
- Training data generation: Creating datasets for fine-tuning
- Prompt evaluations: Testing prompts against large datasets
- Document classification: Categorizing large document collections
- Nightly analytics jobs: Processing daily data pipelines
- Embedding generation: Vectorizing large document corpora
NOT Suitable for Batch Processing
- User-facing chat: Users expect immediate responses
- Real-time assistants: Interactive applications need instant feedback
- Streaming responses: Progressive rendering requires real-time API
- Time-sensitive decisions: Trading, alerts, urgent notifications
OpenAI Batch API Implementation
Step 1: Create the JSONL File
import json
# Create batch requests
requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5-mini",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize this document..."}
],
"max_tokens": 500
}
},
{
"custom_id": "request-2",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5-mini",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Extract key entities..."}
],
"max_tokens": 500
}
}
]
# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
Step 2: Upload and Submit Batch
from openai import OpenAI
client = OpenAI()
# Upload the batch file
batch_file = client.files.create(
file=open("batch_requests.jsonl", "rb"),
purpose="batch"
)
# Create the batch job
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch_job.id}")
print(f"Status: {batch_job.status}")
Step 3: Poll for Completion
import time
def wait_for_batch(client, batch_id, poll_interval=60):
"""Poll batch status until completion."""
while True:
batch = client.batches.retrieve(batch_id)
print(f"Status: {batch.status}")
if batch.status == "completed":
return batch
elif batch.status in ["failed", "expired", "cancelled"]:
raise Exception(f"Batch failed with status: {batch.status}")
time.sleep(poll_interval)
# Wait for completion
completed_batch = wait_for_batch(client, batch_job.id)
Step 4: Download and Process Results
# Download results file
result_file = client.files.content(completed_batch.output_file_id)
results = result_file.text
# Parse results (JSONL format)
for line in results.strip().split("\n"):
result = json.loads(line)
custom_id = result["custom_id"]
response = result["response"]["body"]["choices"][0]["message"]["content"]
print(f"{custom_id}: {response[:100]}...")
Anthropic Claude Batch API
Anthropic offers batch processing with the same 50% discount. The API is similar but uses Anthropic's message format.
import anthropic
client = anthropic.Anthropic()
# Create batch request
batch = client.messages.batches.create(
requests=[
{
"custom_id": "doc-1",
"params": {
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Summarize this document..."}
]
}
},
{
"custom_id": "doc-2",
"params": {
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Extract key insights..."}
]
}
}
]
)
print(f"Batch ID: {batch.id}")
Combining Batch with Prompt Caching
Anthropic uniquely allows stacking discounts. You can combine batch processing (50% off) with prompt caching (90% off cached tokens):
# Batch + Prompt Caching combined
batch = client.messages.batches.create(
requests=[
{
"custom_id": "cached-1",
"params": {
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "Long system prompt with context...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Question 1"}
]
}
}
# More requests with same cached system prompt...
]
)
Best Practices for Batch Processing
1. Always Use Custom IDs
Results may return in different order than submitted. Always include a unique custom_id to map responses back to requests.
2. Implement Retry Logic
Some requests in a batch may fail. Check the error file and retry failed requests:
def handle_batch_errors(client, batch):
"""Process errors from completed batch."""
if batch.error_file_id:
errors = client.files.content(batch.error_file_id).text
failed_ids = []
for line in errors.strip().split("\n"):
error = json.loads(line)
failed_ids.append(error["custom_id"])
print(f"Failed: {error['custom_id']} - {error['error']}")
return failed_ids
return []
3. Optimize Batch Size
Research shows diminishing returns beyond batch sizes of 64 for throughput. However, for cost optimization, larger batches are fine—just be aware of the 24-hour completion window.
4. Set Appropriate Timeouts
Default API client timeouts (5 seconds) are too short for batch operations. Increase to 60+ seconds:
client = OpenAI(timeout=60.0)
5. Monitor Batch Status
Poll status every 30-60 seconds. Don't poll too frequently—it's unnecessary and may hit rate limits.
6. Handle Partial Failures
A batch can complete with some requests failed. Always check both the output file AND the error file.
Real Cost Calculation Example
Let's calculate savings for processing 10,000 documents with GPT-5-mini:
| Metric | Real-time API | Batch API |
|---|---|---|
| Documents | 10,000 | 10,000 |
| Avg input tokens/doc | 2,000 | 2,000 |
| Avg output tokens/doc | 500 | 500 |
| Total input tokens | 20M | 20M |
| Total output tokens | 5M | 5M |
| Input cost | $6.00 | $3.00 |
| Output cost | $5.00 | $2.50 |
| Total cost | $11.00 | $5.50 |
| Savings | - | $5.50 (50%) |
For larger workloads, savings scale linearly. Processing 1M documents monthly saves $550 with GPT-5-mini alone.
Batch Processing vs Async Concurrency
Don't confuse batch processing with async/concurrent API calls:
| Feature | Batch API | Async Concurrency |
|---|---|---|
| Response time | Up to 24 hours | Seconds |
| Cost | 50% discount | Full price |
| Rate limits | Much higher (250M+) | Standard limits |
| Use case | Background jobs | Real-time throughput |
Use async concurrency when you need fast responses at scale. Use batch when you can wait 24 hours for 50% savings.
Common Mistakes to Avoid
1. Using Batch for User-Facing Features
Users won't wait 24 hours. Batch is for background processing only.
2. Not Handling Partial Failures
Some requests in a batch may fail. Always check error files and implement retry logic.
3. Forgetting Custom IDs
Without custom IDs, you can't map responses to requests. Always include them.
4. Polling Too Frequently
Checking status every second wastes resources. Poll every 30-60 seconds.
5. Ignoring the 24-Hour Window
Plan your pipelines around the 24-hour completion window. Most batches complete much faster, but don't rely on it.
Combining with Other Optimizations
Batch processing stacks with other cost optimization techniques:
Batch + Prompt Caching
Anthropic allows combining batch (50% off) with prompt caching (90% off cached tokens). For repeated context across batch requests, this can yield 95%+ savings on cached portions.
Batch + Model Routing
Use model routing within your batch to send simple tasks to cheaper models. Combined with batch discount, you can achieve 75-90% total savings.
Batch + Smaller Models
For classification and extraction tasks, GPT-5-mini or Claude Haiku 4.5 often suffice. Batch + cheap model = maximum savings.
Decision Framework: When to Use Batch
Use Batch Processing When:
- Processing data overnight or during off-hours
- Generating training data or embeddings
- Running evaluations or benchmarks
- Bulk content generation for queued publishing
- Data transformation pipelines
Use Real-Time API When:
- Users are waiting for responses
- You need streaming for progressive display
- Response latency is critical
- Interactive applications
Track Your Batch Processing Costs with Burnwise
Monitor batch vs real-time usage, track savings, and get recommendations for which workloads to move to batch. One-line SDK integration.
Start Free TrialNext Steps
- Audit your workloads: Identify which don't need real-time responses
- Start with one pipeline: Move a single batch job first
- Measure savings: Track actual cost reduction
- Expand gradually: Move more workloads as you gain confidence
- Combine optimizations: Add prompt caching and model routing
For more cost optimization strategies, see our Prompt Caching Guide (50-90% savings) and Model Routing Guide (85% cost reduction).
Check our AI Pricing page for current model costs or use the LLM Cost Calculator to estimate your batch savings.