LLM Batch Processing: Save 50% on OpenAI, Claude & Gemini APIs

January 12, 2026
13 min read

Batch processing offers a flat 50% discount on LLM API costs from all major providers. If your workload doesn't need real-time responses, you're leaving money on the table. This guide covers everything from basic concepts to production implementation across OpenAI, Anthropic, and Google.

Batch Processing Quick Facts (January 2026)

  • Cost Savings: 50% discount on input AND output tokens
  • Completion Time: Results within 24 hours (often much faster)
  • Providers: OpenAI, Anthropic Claude, Google Gemini all support it
  • Rate Limits: Significantly higher (250M+ tokens enqueued)

What Is LLM Batch Processing?

Batch processing is an asynchronous API pattern where you submit multiple requests together and receive results within 24 hours instead of immediately. In exchange for giving up real-time responses, providers offer a 50% discount on all tokens.

The trade-off is simple:

  • Real-time API: Instant responses, full price
  • Batch API: 24-hour window, 50% off

For many workloads—data processing, content generation, evaluations—this trade-off is a no-brainer.

How Does the Batch API Work?

The workflow is straightforward but different from standard API calls:

  1. Create a JSONL file — Each line is a valid JSON request identical to real-time API format
  2. Upload the file — Send the file to the provider's servers
  3. Submit a batch job — Reference the uploaded file
  4. Poll for completion — Check status until results are ready
  5. Download results — Retrieve and map responses to original requests
Important: Result order may differ from input order. Always include a custom_id in each request to match responses to their original queries.

Batch Pricing Comparison (January 2026)

OpenAI Batch Pricing

Model Standard Input/1M Batch Input/1M Standard Output/1M Batch Output/1M
GPT-5.2 $1.75 $0.875 $14.00 $7.00
GPT-5-mini $0.30 $0.15 $1.00 $0.50
GPT-4.1 $2.00 $1.00 $8.00 $4.00
o4-mini $1.10 $0.55 $4.40 $2.20

Anthropic Claude Batch Pricing

Model Standard Input/1M Batch Input/1M Standard Output/1M Batch Output/1M
Claude Opus 4.5 $5.00 $2.50 $25.00 $12.50
Claude Sonnet 4.5 $3.00 $1.50 $15.00 $7.50
Claude Haiku 4.5 $1.00 $0.50 $5.00 $2.50

Google Gemini Batch Pricing

Model Standard Input/1M Batch Input/1M Standard Output/1M Batch Output/1M
Gemini 3.0 Pro $2.00 $1.00 $12.00 $6.00
Gemini 3.0 Flash $0.50 $0.25 $3.00 $1.50
Gemini 2.5 Pro $1.25 $0.625 $10.00 $5.00

Ideal Use Cases for Batch Processing

Perfect for Batch Processing

  • Bulk content generation: Blog posts, product descriptions, marketing copy
  • Data extraction at scale: Processing thousands of documents
  • Training data generation: Creating datasets for fine-tuning
  • Prompt evaluations: Testing prompts against large datasets
  • Document classification: Categorizing large document collections
  • Nightly analytics jobs: Processing daily data pipelines
  • Embedding generation: Vectorizing large document corpora

NOT Suitable for Batch Processing

  • User-facing chat: Users expect immediate responses
  • Real-time assistants: Interactive applications need instant feedback
  • Streaming responses: Progressive rendering requires real-time API
  • Time-sensitive decisions: Trading, alerts, urgent notifications
Rule of thumb: If a user is waiting for the response, use real-time API. If it's a background job, use batch.

OpenAI Batch API Implementation

Step 1: Create the JSONL File

import json

# Create batch requests
requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-5-mini",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Summarize this document..."}
            ],
            "max_tokens": 500
        }
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-5-mini",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Extract key entities..."}
            ],
            "max_tokens": 500
        }
    }
]

# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

Step 2: Upload and Submit Batch

from openai import OpenAI

client = OpenAI()

# Upload the batch file
batch_file = client.files.create(
    file=open("batch_requests.jsonl", "rb"),
    purpose="batch"
)

# Create the batch job
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch_job.id}")
print(f"Status: {batch_job.status}")

Step 3: Poll for Completion

import time

def wait_for_batch(client, batch_id, poll_interval=60):
    """Poll batch status until completion."""
    while True:
        batch = client.batches.retrieve(batch_id)
        print(f"Status: {batch.status}")

        if batch.status == "completed":
            return batch
        elif batch.status in ["failed", "expired", "cancelled"]:
            raise Exception(f"Batch failed with status: {batch.status}")

        time.sleep(poll_interval)

# Wait for completion
completed_batch = wait_for_batch(client, batch_job.id)

Step 4: Download and Process Results

# Download results file
result_file = client.files.content(completed_batch.output_file_id)
results = result_file.text

# Parse results (JSONL format)
for line in results.strip().split("\n"):
    result = json.loads(line)
    custom_id = result["custom_id"]
    response = result["response"]["body"]["choices"][0]["message"]["content"]
    print(f"{custom_id}: {response[:100]}...")

Anthropic Claude Batch API

Anthropic offers batch processing with the same 50% discount. The API is similar but uses Anthropic's message format.

import anthropic

client = anthropic.Anthropic()

# Create batch request
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "doc-1",
            "params": {
                "model": "claude-sonnet-4-5-20250929",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "Summarize this document..."}
                ]
            }
        },
        {
            "custom_id": "doc-2",
            "params": {
                "model": "claude-sonnet-4-5-20250929",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "Extract key insights..."}
                ]
            }
        }
    ]
)

print(f"Batch ID: {batch.id}")

Combining Batch with Prompt Caching

Anthropic uniquely allows stacking discounts. You can combine batch processing (50% off) with prompt caching (90% off cached tokens):

# Batch + Prompt Caching combined
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "cached-1",
            "params": {
                "model": "claude-sonnet-4-5-20250929",
                "max_tokens": 1024,
                "system": [
                    {
                        "type": "text",
                        "text": "Long system prompt with context...",
                        "cache_control": {"type": "ephemeral"}
                    }
                ],
                "messages": [
                    {"role": "user", "content": "Question 1"}
                ]
            }
        }
        # More requests with same cached system prompt...
    ]
)

Best Practices for Batch Processing

1. Always Use Custom IDs

Results may return in different order than submitted. Always include a unique custom_id to map responses back to requests.

2. Implement Retry Logic

Some requests in a batch may fail. Check the error file and retry failed requests:

def handle_batch_errors(client, batch):
    """Process errors from completed batch."""
    if batch.error_file_id:
        errors = client.files.content(batch.error_file_id).text
        failed_ids = []
        for line in errors.strip().split("\n"):
            error = json.loads(line)
            failed_ids.append(error["custom_id"])
            print(f"Failed: {error['custom_id']} - {error['error']}")
        return failed_ids
    return []

3. Optimize Batch Size

Research shows diminishing returns beyond batch sizes of 64 for throughput. However, for cost optimization, larger batches are fine—just be aware of the 24-hour completion window.

4. Set Appropriate Timeouts

Default API client timeouts (5 seconds) are too short for batch operations. Increase to 60+ seconds:

client = OpenAI(timeout=60.0)

5. Monitor Batch Status

Poll status every 30-60 seconds. Don't poll too frequently—it's unnecessary and may hit rate limits.

6. Handle Partial Failures

A batch can complete with some requests failed. Always check both the output file AND the error file.

Real Cost Calculation Example

Let's calculate savings for processing 10,000 documents with GPT-5-mini:

Metric Real-time API Batch API
Documents 10,000 10,000
Avg input tokens/doc 2,000 2,000
Avg output tokens/doc 500 500
Total input tokens 20M 20M
Total output tokens 5M 5M
Input cost $6.00 $3.00
Output cost $5.00 $2.50
Total cost $11.00 $5.50
Savings - $5.50 (50%)

For larger workloads, savings scale linearly. Processing 1M documents monthly saves $550 with GPT-5-mini alone.

Batch Processing vs Async Concurrency

Don't confuse batch processing with async/concurrent API calls:

Feature Batch API Async Concurrency
Response time Up to 24 hours Seconds
Cost 50% discount Full price
Rate limits Much higher (250M+) Standard limits
Use case Background jobs Real-time throughput

Use async concurrency when you need fast responses at scale. Use batch when you can wait 24 hours for 50% savings.

Common Mistakes to Avoid

1. Using Batch for User-Facing Features

Users won't wait 24 hours. Batch is for background processing only.

2. Not Handling Partial Failures

Some requests in a batch may fail. Always check error files and implement retry logic.

3. Forgetting Custom IDs

Without custom IDs, you can't map responses to requests. Always include them.

4. Polling Too Frequently

Checking status every second wastes resources. Poll every 30-60 seconds.

5. Ignoring the 24-Hour Window

Plan your pipelines around the 24-hour completion window. Most batches complete much faster, but don't rely on it.

Combining with Other Optimizations

Batch processing stacks with other cost optimization techniques:

Batch + Prompt Caching

Anthropic allows combining batch (50% off) with prompt caching (90% off cached tokens). For repeated context across batch requests, this can yield 95%+ savings on cached portions.

Batch + Model Routing

Use model routing within your batch to send simple tasks to cheaper models. Combined with batch discount, you can achieve 75-90% total savings.

Batch + Smaller Models

For classification and extraction tasks, GPT-5-mini or Claude Haiku 4.5 often suffice. Batch + cheap model = maximum savings.

Decision Framework: When to Use Batch

Use Batch Processing When:

  • Processing data overnight or during off-hours
  • Generating training data or embeddings
  • Running evaluations or benchmarks
  • Bulk content generation for queued publishing
  • Data transformation pipelines

Use Real-Time API When:

  • Users are waiting for responses
  • You need streaming for progressive display
  • Response latency is critical
  • Interactive applications

Track Your Batch Processing Costs with Burnwise

Monitor batch vs real-time usage, track savings, and get recommendations for which workloads to move to batch. One-line SDK integration.

Start Free Trial

Next Steps

  1. Audit your workloads: Identify which don't need real-time responses
  2. Start with one pipeline: Move a single batch job first
  3. Measure savings: Track actual cost reduction
  4. Expand gradually: Move more workloads as you gain confidence
  5. Combine optimizations: Add prompt caching and model routing

For more cost optimization strategies, see our Prompt Caching Guide (50-90% savings) and Model Routing Guide (85% cost reduction).

Check our AI Pricing page for current model costs or use the LLM Cost Calculator to estimate your batch savings.

batch processingcost optimizationopenaianthropicgoogleasync apillm

Put These Insights Into Practice

Burnwise tracks your LLM costs automatically and shows you exactly where to optimize.

Start Free Trial