The 50% discount most teams have never configured. OpenAI offers a Batch API that costs 50% less than the regular API. Anthropic's Message Batches API costs 50% less. Google Vertex AI batch prediction costs up to 50% less. AWS Bedrock batch inference costs 50% less. All four providers launched or expanded batch inference in 2024 and 2025. The documentation has been public for months. In a 2026 survey of engineering teams spending more than $100,000 per year on LLM APIs, fewer than 15% reported using batch inference for any workload. The remaining 85% pay real-time prices for jobs that have no latency requirement at all. Source: AICC, "Enterprise Token Costs Drop 67% Year-Over-Year," 2026 The math is not subtle. A team spending $1 million per year on AI API calls, with half of that spend on jobs that have no real-time requirement, saves $250,000 per year by enabling batch inference on the eligible half. The implementation takes a few hours. The savings start on the next billing cycle. What batch inference actually is. Real-time inference is synchronous: you send a request, you wait for a response, you receive it. The provider allocates GPU capacity immediately, processes your request, and returns the result. You pay a premium for that instant availability. Batch inference is asynchronous: you submit a list of requests, the provider processes them within 24 hours, and you retrieve the results from a file or webhook. The provider queues your job alongside other non-urgent work, fills GPU capacity during off-peak hours, and charges you less because it can schedule the work efficiently. The quality of the response is identical. The same model, the same weights, the same attention mechanism. The only difference is when you get the answer. Source: OpenAI, "Batch API," OpenAI Docs Source: Anthropic, "Message Batches API," Anthropic Docs The provider landscape. Every major provider now offers batch inference with significant pricing discounts: | Provider | API | Discount | Turnaround | Notes | |---|---|---|---|---| | OpenAI | Batch API | 50% off | 24 hours | 2M tokens/batch limit | | Anthropic | Message Batches API | 50% off | 24 hours | Up to 100K requests/batch | | Google | Vertex AI Batch | Up to 50% | 24 hours | Varies by model | | AWS Bedrock | Batch Inference | 50% off | 24 hours | Configurable window | Source: OpenAI, API Pricing, June 2026 Source: Anthropic, Claude Pricing, June 2026 Source: Google, Vertex AI Pricing, June 2026 The discount structure is consistent across providers because the economics are the same. Batch jobs allow providers to consolidate requests, balance GPU utilization, and reduce the overhead of real-time scheduling. The savings are passed directly to the customer. What qualifies for batch processing. The only requirement for batch eligibility is that the job has no latency requirement. The response does not need to arrive in under two seconds. It does not need to arrive in under two minutes. It needs to arrive within 24 hours. This describes a larger fraction of enterprise AI work than most teams realize. Data pipelines. ETL jobs that process documents, classify records, extract entities, or summarize content from a database or data lake. These typically run overnight or on a scheduled cadence. Document processing. Contract review, invoice extraction, compliance scanning, receipt parsing, policy analysis. None of these need a real-time response. Users submit documents and check back in minutes or hours. Evaluation suites. Running your LLM test suite to measure quality, regression, or benchmark performance. Evals are scheduled jobs with no interactive user waiting for results. Reporting and analytics. Generating summaries, trend analysis, or narrative descriptions from structured data. These run on a cadence, not in response to a live request. Content enrichment. Adding metadata, tags, embeddings, or descriptions to a corpus of records. Background enrichment pipelines have no SLA measured in milliseconds. Fine-tuning data generation. Creating synthetic training data, paraphrases, or labeled examples. This is batch processing by definition. A/B test scoring. Evaluating model responses against quality rubrics for product experiments. The experiment runs regardless of when individual responses are scored. Source: Anthropic, "When to use Message Batches," Anthropic Docs What does not qualify. Batch inference does not replace real-time inference for user-facing features. If a user types a message and waits for a reply, you need real-time inference. If a function calls a model and the user waits for an immediate result, you need real-time inference. If your SLA is measured in seconds, batch is not the right tool. The question to ask for each workload is not "does this use an LLM?" but "does a human wait synchronously for this response?" A customer service chatbot: real-time. An overnight report summarizing the day's customer tickets: batch. A code completion suggestion while a developer types: real-time. Nightly analysis of code quality patterns across a team's commits: batch. An AI assistant answering user questions live: real-time. Weekly classification of all support tickets by root cause: batch. The boundary is cleaner in practice than it looks in theory. Most production AI systems contain both types of work. Most teams have only wired up one path. The math at enterprise scale. The savings from batch inference compound with scale, which is when they matter most. A team processing 500,000 documents per month for compliance review. Each request averages 2,000 input tokens and 500 output tokens. The team uses Claude Sonnet 4.5. | Mode | Input Cost | Output Cost | Monthly Total | |---|---|---|---| | Real-time | 500K × 2K × $3.00/M = $3,000 | 500K × 500 × $15.00/M = $3,750 | $6,750 | | Batch (50% off) | 500K × 2K × $1.50/M = $1,500 | 500K × 500 × $7.50/M = $1,875 | $3,375 | Annual savings: $40,500 for one pipeline. From one configuration change. At $500,000 per month total LLM spend, with 40% of that spend on batch-eligible workloads, the annual savings exceed $1.2 million. The decision has a single-day payback period on any workload running at scale. How routing and batch inference stack. Batch inference and model routing solve different parts of the cost problem. Routing selects the cheapest model capable of handling each task. Batch inference selects the cheapest scheduling mode for tasks with flexible timing. Both optimizations apply independently, and they compound. A document classification pipeline that currently hits Claude Opus 4.8 in real time can do two things: Route simple classifications to Claude Haiku 4.5 (up to 10x cost reduction from model selection). Send all classifications via the batch API (2x cost reduction from scheduling). Combined reduction: up to 20x on that pipeline. The accuracy does not change. Haiku 4.5 handles binary classification at equivalent quality to Opus 4.8 for well-defined categories. The Batch API uses the same weights as the real-time endpoint. | Optimization | Typical Savings | Applies To | |---|---|---| | Model routing (frontier → efficient model) | 40-80% | All workloads | | Prompt caching | 50-90% on cached tokens | Repeated prompts | | Batch inference | 50% | Non-real-time jobs | | Combined (routing + caching + batch) | Up to 95% | Eligible workloads | The teams that implement all three do not save 30 to 40% of their LLM spend. They save 80 to 90% on the workloads where all three apply. The implementation path. All major providers expose batch inference through their standard SDKs. The implementation follows the same pattern: create a batch job with a list of requests, poll or receive a webhook when it is done, retrieve the results. Anthropic Python SDK: import anthropic client = anthropic.Anthropic() message_batch = client.messages.batches.create( requests=[ { "custom_id": "doc-001", "params": { "model": "claude-haiku-4-5-20251001", "max_tokens": 1024, "messages": [{"role": "user", "content": "Classify this contract: ..."}], }, }, { "custom_id": "doc-002", "params": { "model": "claude-haiku-4-5-20251001", "max_tokens": 1024, "messages": [{"role": "user", "content": "Classify this contract: ..."}], }, }, ] ) Poll batch_id or use webhook to retrieve results OpenAI Python SDK: import openai, json client = openai.OpenAI() requests = [ {"custom_id": "doc-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4.1", "messages": [{"role": "user", "content": "Classify this contract: ..."}]}}, ] with open("batch_requests.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n") batch_file = client.files.create(file=open("batch_requests.jsonl", "rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h", ) The code change is small. The operational model changes: instead of request/response in your API call, you submit a job and retrieve results. For jobs that were already running asynchronously in a pipeline, this is often a one-day migration. The audit you should run this week. Pull your API logs for the last 30 days. For each LLM call, answer one question: did a human wait synchronously for this response? If yes: real-time inference is the right tool. If no: you are paying a 100% premium for real-time GPU scheduling you do not need. Most teams find that 30 to 60% of their calls have no real-time requirement. They are data pipelines, evaluation runs, scheduled enrichment jobs, and reporting queries that were built against the real-time API because it was the only API in the SDK example when the engineer wrote the code. Source: Datadog, "State of AI Engineering 2026" The batch API was available then too. It just was not in the example. Three steps that take less than a day: 1. Classify your workloads. List every LLM call in your system. Mark each one: real-time user-facing, or background job with no latency requirement. The background jobs are your batch candidates. 2. Estimate the savings. Take the monthly token cost for batch-eligible workloads and multiply by 0.5. That is your savings number. For most teams it is between 15 and 40% of total LLM spend. 3. Migrate the highest-spend pipeline first. Pick the single most expensive batch-eligible job, migrate it to the batch API, and measure the cost reduction on the next billing cycle. Then repeat. The 85% of teams not using batch inference are not making a deliberate tradeoff. They are not aware that the endpoint exists. The endpoint exists. It costs half as much. The quality is the same. Sources: OpenAI, "Batch API," OpenAI Docs. Anthropic, "Message Batches API," Anthropic Docs. Google, "Vertex AI Batch Prediction". AWS, "Amazon Bedrock Batch Inference". Anthropic, Claude Pricing, June 2026. OpenAI, API Pricing, June 2026. Google, Vertex AI Pricing, June 2026. Datadog, "State of AI Engineering 2026". AICC, "Enterprise Token Costs Drop 67% Year-Over-Year".