The image tokenization schemes your team has never compared. Every vision API charges tokens for images. The billing looks straightforward until you run the same image through three providers. Claude uses 28×28 pixel patches. Each patch is one visual token. A 1920×1080 screenshot: ⌈1920/28⌉ × ⌈1080/28⌉ = 69 × 39 = 2,691 tokens. GPT-4o rescales the image so its shortest side equals 768 pixels, then tiles it into 512×512 blocks at 170 tokens each, plus a 85-token base cost. A 1920×1080 screenshot: scale to 1366×768, divide into 6 tiles = 1,105 tokens. Gemini 2.5 Pro tiles images into 768×768 sections at 258 tokens each. A 1920×1080 screenshot: 3 × 2 = 6 tiles × 258 = 1,548 tokens. Same image. Same task. Cost ratio: 2.4× between cheapest and most expensive provider. At 4K resolution (3840×2160), the gap becomes 9.7×: 10,764 tokens on Claude versus 1,105 tokens on GPT-4o. GPT-4o caps token counts on large images because its scaling step collapses everything above a threshold to the same effective resolution. Claude does not. It attempts to process every 28×28 patch of the original image. The formulas. Claude: 28×28 pixel patches. Anthropic's vision tokenizer divides each image into a grid of 28×28 pixel patches. Each patch becomes one visual token. Formula: ⌈width / 28⌉ × ⌈height / 28⌉ | Image | Calculation | Tokens | |---|---|---:| | 256 × 256 | ⌈9.1⌉ × ⌈9.1⌉ | 100 | | 512 × 512 | ⌈18.3⌉ × ⌈18.3⌉ | 361 | | 1024 × 768 | ⌈36.6⌉ × ⌈27.4⌉ | 1,036 | | 1920 × 1080 | ⌈68.6⌉ × ⌈38.6⌉ | 2,691 | | 3840 × 2160 | ⌈137.1⌉ × ⌈77.1⌉ | 10,764 | The higher token count on Claude reflects a real difference: Claude processes more of the original image, which matters for tasks requiring fine-grained reading of dense text, small labels in diagrams, or handwriting. For most enterprise document tasks — invoice extraction, form parsing, screenshot classification — that level of pixel fidelity is not required. GPT-4o: 512×512 tiles with effective resolution cap. OpenAI applies two scaling steps before tiling: Scale image to fit within a 2048×2048 bounding box (if needed) Scale so the shortest side equals 768 pixels (including upscaling small images) Tile into 512×512 blocks Total = 85 + 170 × n_tiles Two behaviors worth knowing: Small images get upscaled. A 256×256 image becomes 768×768 before tiling, giving 4 tiles and 765 tokens — more than Claude's 100 tokens for the same image. Large images hit a plateau. A 1920×1080 and a 3840×2160 both scale down to approximately 1366×768 before tiling, giving the same 6 tiles and 1,105 tokens. Above roughly 1080p, GPT-4o's token count stops growing. GPT-4o also offers a low-detail mode: flat 85 tokens for any image. | Image | High detail | Low detail | |---|---:|---:| | 256 × 256 | 765 tokens | 85 tokens | | 512 × 512 | 765 tokens | 85 tokens | | 1024 × 768 | 765 tokens | 85 tokens | | 1920 × 1080 | 1,105 tokens | 85 tokens | | 3840 × 2160 | 1,105 tokens | 85 tokens | Gemini 2.5 Pro: 768×768 tiles with flat small-image fee. Formula: ⌈width / 768⌉ × ⌈height / 768⌉ × 258 tokens Images 384px or smaller on both dimensions cost a flat 258 tokens. | Image | Tiles | Tokens | |---|---|---:| | 256 × 256 | flat fee | 258 | | 512 × 512 | 1 × 1 | 258 | | 1024 × 768 | 2 × 1 | 516 | | 1920 × 1080 | 3 × 2 | 1,548 | | 3840 × 2160 | 5 × 3 | 3,870 | The cost comparison across common image types. Pricing as of July 2026: Claude Opus 4.8 at $5/M input, GPT-4o at $2.50/M, Gemini 2.5 Pro at $2/M. | Image | Claude Opus 4.8 | GPT-4o high | GPT-4o low | Gemini 2.5 Pro | |---|---:|---:|---:|---:| | 256 × 256 | $0.00050 | $0.00191 | $0.00021 | $0.00052 | | 512 × 512 | $0.00181 | $0.00191 | $0.00021 | $0.00052 | | 1024 × 768 | $0.00518 | $0.00191 | $0.00021 | $0.00103 | | 1920 × 1080 | $0.01346 | $0.00276 | $0.00021 | $0.00310 | | 3840 × 2160 | $0.05382 | $0.00276 | $0.00021 | $0.00774 | Which provider wins depends on image size: Small images (< 512px): Claude is cheapest. 100 tokens vs 765 on GPT-4o. Medium images (512px–1080p): Gemini wins on price for extraction tasks. Large images (> 1080p): GPT-4o high-detail wins. GPT-4o low-detail wins always on raw price. The production cost of picking the wrong default. A document processing pipeline handling 10,000 HD screenshots (1920×1080) per day: | Approach | Daily image cost | Annual image cost | |---|---:|---:| | Claude Opus 4.8, every image | $134.55 | $49,112 | | Gemini 2.5 Pro, every image | $30.96 | $11,299 | | GPT-4o high detail, every image | $27.60 | $10,074 | | GPT-4o low detail, every image | $2.10 | $767 | | Routed (classify on low, extract on high) | ~$12.00 | ~$4,380 | These are image tokens only — before any text input or output is billed. A team running an invoice scanning pipeline on Claude Opus because that is what the senior engineer used in the prototype is paying $49,112 per year in image tokens for a task GPT-4o handles at $10,074 — or at $4,380 with a two-stage routing approach. The $44,732 difference is invisible until someone audits token cost by modality, not just by total API spend. Most teams do not. Code: calculate image tokens before sending. Token estimation runs locally in under 1ms. Build this into your request middleware to catch expensive images before they hit the bill. import math def claude_image_tokens(width: int, height: int) -> int: return math.ceil(width / 28) math.ceil(height / 28) def gpt4o_image_tokens(width: int, height: int, detail: str = "high") -> int: if detail == "low": return 85 Step 1: fit in 2048x2048 scale = min(2048 / width, 2048 / height, 1.0) w, h = width scale, height scale Step 2: scale so shortest side = 768 short = min(w, h) scale2 = 768 / short w, h = w scale2, h scale2 tiles = math.ceil(w / 512) math.ceil(h / 512) return 85 + 170 tiles def gemini_image_tokens(width: int, height: int) -> int: if width <= 384 and height <= 384: return 258 return math.ceil(width / 768) math.ceil(height / 768) 258 Compare providers for a 1920x1080 screenshot w, h = 1920, 1080 PRICING = {"claude": 5.0, "gpt4o": 2.50, "gemini": 2.0} claude_t = claude_image_tokens(w, h) gpt4o_h_t = gpt4o_image_tokens(w, h, "high") gpt4o_l_t = gpt4o_image_tokens(w, h, "low") gemini_t = gemini_image_tokens(w, h) print(f"Claude Opus 4.8: {claude_t:,} tokens (${claude_t PRICING['claude'] / 1e6:.5f})") print(f"GPT-4o high detail: {gpt4o_h_t:,} tokens (${gpt4o_h_t PRICING['gpt4o'] / 1e6:.5f})") print(f"GPT-4o low detail: {gpt4o_l_t:,} tokens (${gpt4o_l_t PRICING['gpt4o'] / 1e6:.5f})") print(f"Gemini 2.5 Pro: {gemini_t:,} tokens (${gemini_t PRICING['gemini'] / 1e6:.5f})") Claude Opus 4.8: 2,691 tokens ($0.01346) GPT-4o high detail: 1,105 tokens ($0.00276) GPT-4o low detail: 85 tokens ($0.00021) Gemini 2.5 Pro: 1,548 tokens ($0.00310) Add this to a middleware layer and you can log per-image cost before any API call, compare providers at request time, and surface the most expensive images in your pipeline for targeted optimization. Optimization 1: resize before sending. Token cost scales with pixel area. Resizing before the API call cuts cost directly. For invoice OCR, form extraction, and UI screenshot analysis, quality is equivalent down to approximately 50% of original linear resolution. Below that, fine-grained text recognition degrades measurably. from PIL import Image from io import BytesIO def resize_for_api(image_path: str, max_dim: int = 1024) -> tuple[bytes, int, int]: """Resize image to reduce token cost. Returns (bytes, final_width, final_height).""" img = Image.open(image_path) w, h = img.size if max(w, h) > max_dim: ratio = max_dim / max(w, h) img = img.resize((int(w ratio), int(h ratio)), Image.LANCZOS) buf = BytesIO() img.save(buf, format="JPEG", quality=85) return buf.getvalue(), img.width, img.height 1920x1080 → 1024x576 img_bytes, rw, rh = resize_for_api("screenshot.png", max_dim=1024) original = claude_image_tokens(1920, 1080) resized = claude_image_tokens(rw, rh) print(f"Original: {original:,} tokens Resized: {resized:,} tokens Saved: {(1 - resized/original):.0%}") Original: 2,691 tokens Resized: 1,120 tokens Saved: 58% At 10,000 documents/day on Claude Opus, that single resize operation saves $77.95/day — $28,452/year. Optimization 2: use low-detail mode for classification. GPT-4o's low-detail mode sends a 512×512 downscaled image at a flat 85 tokens. For binary classification (is this an invoice? is this a receipt? is this handwritten?), the model's classification capability is identical in low-detail mode. import openai import base64 client = openai.OpenAI() def classify_image(image_path: str, question: str) -> str: """Classify image using low-detail mode. Always 85 tokens input.""" with open(image_path, "rb") as f: b64 = base64.b64encode(f.read()).decode() response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{b64}", "detail": "low" # 85 tokens regardless of image size } }, {"type": "text", "text": question} ] }], max_tokens=10 # yes/no answers ) return response.choices[0].message.content.strip().lower() Cost: 85 input + ~5 output = $0.000226 per classification vs high-detail: 1,105 + 5 = $0.002775 per image — 12x more expensive is_invoice = classify_image("scan.jpg", "Is this an invoice? Answer yes or no.") A two-stage pipeline — classify on low-detail GPT-4o, extract only confirmed invoices on a higher-quality model — reduces image token cost by approximately 60% on pipelines with a 10%+ false-positive rate. Optimization 3: match provider to task type. | Task | Recommended provider | Mode | Cost (1920×1080) | Reason | |---|---|---|---:|---| | Document type routing | GPT-4o | low detail | $0.00021 | 85 tokens, same classification quality | | Invoice / form extraction | Gemini 2.5 Pro | — | $0.00310 | Cheapest at this resolution for extraction | | Handwriting recognition | Claude Sonnet 4.6 | — | $0.00807 | 28×28 patches preserve stroke detail | | Small thumbnail tasks | Claude Haiku 4.5 | — | $0.00010 | 100 tokens vs 765 on GPT-4o high | | High-res diagram analysis | Claude Opus 4.8 | — | $0.01346 | Patch resolution matters for small labels | from PIL import Image def route_vision_request(image_path: str, task: str) -> dict: img = Image.open(image_path) w, h = img.size is_small = max(w, h) <= 512 ROUTES = { "classify": {"provider": "openai", "model": "gpt-4o", "detail": "low"}, "extract": {"provider": "google", "model": "gemini-2.5-pro", "detail": "high"}, "handwriting": {"provider": "anthropic", "model": "claude-sonnet-4-6","detail": "high"}, "diagram": {"provider": "anthropic", "model": "claude-opus-4-8", "detail": "high"}, } config = ROUTES.get(task, ROUTES["extract"]) Small images: Claude Haiku is cheapest across all basic tasks if is_small and task in ("classify", "extract"): config = {"provider": "anthropic", "model": "claude-haiku-4-5", "detail": "high"} return config The practical audit. If your team uses vision APIs and has not compared tokenization costs across providers, the audit takes under an hour: 1. Log image dimensions in production. Most teams discover their images are 1080p by default because they come from screen-capture tools or document scanners set to maximum resolution. 2. Estimate token cost before the API call. The formulas above run in under 1ms. Add them to your logging middleware. Flag any image that would cost more than $0.01 in tokens before sending. 3. Benchmark quality at 50% resolution. For extraction tasks, run your eval suite on 1024px max-dimension images. Most teams find quality holds. Document it. The resize decision becomes easy. 4. Separate classification from extraction. GPT-4o low-detail at 85 tokens is correct for routing and filtering. Send only confirmed targets to high-quality extraction. 5. Match provider to task, not developer preference. The engineer who picked Opus for the prototype picked it because it returned the best answers in manual testing. At 10,000 images per day, that choice costs $44,000 per year more than routing to the right provider per task. Nadir's routing layer routes image queries by task type, image dimensions, and provider token cost — the same infrastructure it uses for text routing. Vision task routing is a built-in dimension, not a custom integration. Reduce your vision pipeline costs today → Tokenization formulas from provider documentation as of July 2026. Pricing as of July 2026: Claude Opus 4.8 $5/M input, Claude Sonnet 4.6 $3/M, Claude Haiku 4.5 $1/M; GPT-4o $2.50/M; Gemini 2.5 Pro $2/M (≤200K context). Sources: Anthropic Vision Documentation. OpenAI Vision Guide. Google Gemini Multimodal API. Roboflow: What does it cost to process an image with a vision model?. Stellaxon AI Image Token Calculator.*