The hidden tax on every agentic API call. When you build an AI agent with tools, you define schemas: JSON objects describing each function's name, description, parameters, and types. These schemas are not optional context. They are sent in full on every API call to every model. Every time the agent runs, the entire tool library rides along as input tokens. A minimal agent with 10 tools and concise schema definitions carries approximately 1,500 to 2,500 tokens of schema overhead per call. A production agent with 30 tools, verbose descriptions, and nested parameter schemas can carry 6,000 to 10,000 tokens before the system prompt, user query, or conversation history begins. At Opus 4.8 input pricing of $5 per million tokens, a 5,000-token schema payload on 50,000 daily calls costs $1,250 per day. That is $456,000 per year in schema tokens alone, before a single user message is counted. Most teams have never measured it. It does not appear as a line item. It does not show up in prompt caching dashboards, because schema tokens are not the static prefix — they sit in a different position and often fall outside the cached block. It is the invisible tax on every agentic call. Source: Anthropic, "Tool use documentation," Anthropic Docs Why tool schemas are invisible on the bill. The reason most teams miss this: API billing dashboards report total input tokens, not a breakdown of where they came from. A call with 8,000 input tokens shows up as 8,000 tokens. Whether 5,000 of those are tool schemas is not disclosed unless you count manually. The token composition of a typical production agentic call: | Token source | Token count (typical) | % of input | |---|---:|---:| | System prompt | 1,500 | 15% | | Tool schemas (15 tools) | 3,500 | 35% | | Conversation history | 2,000 | 20% | | Retrieved context (RAG) | 2,000 | 20% | | User query | 1,000 | 10% | | Total | 10,000 | 100% | Tool schemas represent 35% of input token count in this example. They are the second-largest cost center after system prompts, and unlike system prompts, they are not commonly cached. OpenAI's function calling documentation notes that function definitions count against the model's context length. Anthropic's tool use documentation specifies that tool schemas are counted as input tokens and billed accordingly. Neither provider makes the per-schema breakdown visible in billing. Source: OpenAI, "Function calling," OpenAI Docs Source: Anthropic, "Tool use (function calling)," Anthropic Docs What a tool schema actually costs. A single tool schema for a moderately complex function looks like this: { "name": "search_documents", "description": "Search the internal knowledge base for documents matching a query. Returns ranked results with titles, excerpts, and metadata. Use this when the user asks about company policies, procedures, or historical decisions.", "input_schema": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query to find relevant documents" }, "max_results": { "type": "integer", "description": "Maximum number of results to return. Defaults to 5." }, "filter_by_date": { "type": "string", "description": "Optional. ISO 8601 date string to filter documents newer than this date." } }, "required": ["query"] } } That schema, tokenized, is approximately 380 tokens. An agent with 20 tools averaging similar complexity carries roughly 7,600 tokens of schema overhead per call. At Opus 4.8 input pricing ($5/M tokens), 7,600 schema tokens on 100,000 daily calls is $3,800 per day — $1.39 million per year. For schema tokens. Not responses, not context, not user queries. Just the function definitions. Most of those tools are not relevant to most queries. An agent defined with 20 tools rarely uses more than 2 to 4 on any given task. The remaining 16 to 18 schemas ride along as dead weight, billed in full every time. Dynamic tool loading: the most underused agent optimization. The fix is dynamic tool loading: instead of sending all tool schemas on every call, inspect the incoming query and load only the schemas for tools that could plausibly be relevant. A query about calendar scheduling does not need the document search, database query, or code execution schemas. A query about retrieving data from an API does not need the email or calendar schemas. A simple classification request needs no tools at all. Dynamic tool loading works at three levels: Query-based tool selection. Use a lightweight classifier or a fast, cheap model to predict which tool categories are needed before constructing the full API call. Route the call with only those schemas included. Intent-based tool groups. Organize tools into domain clusters: calendar tools, document tools, communication tools, data tools, code tools. When the query intent is classified, include only the relevant cluster's schemas. Hardcoded exclusion. For tools used in fewer than 5% of calls, exclude them by default and add them only when a specific trigger pattern appears in the query. A tool used once every 20 calls should not add its schema to the other 19. The result is a smaller schema payload, cheaper input costs, and a secondary benefit: smaller schemas leave more context window for the actual task, which can improve response quality on complex, long-context requests. Source: Lilian Weng, "LLM-powered Autonomous Agents," 2023 Schema compression: lean schemas vs. verbose defaults. Beyond dynamic loading, the schemas themselves can be compressed. Most tool schemas are written for human readability — verbose descriptions, redundant parameter explanations, usage examples in the description fields. The large model does not need all of that. Experiments on compressed versus verbose tool schemas consistently find: Tool descriptions can be cut 40 to 60% without affecting call accuracy. "Search documents by query" is functionally equivalent to a 50-word paragraph for models trained on function calling at scale. Parameter descriptions can often be dropped entirely for self-explanatory parameter names. query, limit, start_date do not need paragraphs of explanation. Inline examples inside description fields add 50 to 200 tokens each and rarely improve call accuracy on well-named parameters. A compressed version of the earlier example: { "name": "search_documents", "description": "Search knowledge base. Returns ranked results with titles and excerpts.", "input_schema": { "type": "object", "properties": { "query": {"type": "string"}, "max_results": {"type": "integer"}, "filter_by_date": {"type": "string", "description": "ISO 8601, optional"} }, "required": ["query"] } } That compressed schema is approximately 145 tokens — a 62% reduction from the verbose version, with no measurable loss in tool call accuracy for standard retrieval tasks. At scale: if your 20-tool agent carries 7,600 schema tokens today and compression brings that to 2,900, you have cut schema overhead by 62% before any query-based filtering logic runs. How routing interacts with tool schema overhead. Tool schema volume directly affects routing economics. A request arriving with 10,000 input tokens — 3,500 of which are schemas — is classified as a high-complexity, high-context call by most routing systems and sent to a frontier model. The same semantic query with schema overhead reduced to 800 tokens may route to a mid-tier model, because the complexity signal is lower and the call fits within the context budget of a smaller, cheaper model. Dynamic tool loading is therefore a routing amplifier. When you strip irrelevant schemas before routing: The input token count drops, reducing the apparent complexity of the call. More requests qualify for mid-tier or budget models. The routing savings compound directly on top of the schema savings. A task that previously cost $0.085 per call (10,000 input tokens at Opus 4.8 pricing) might drop to $0.018 per call when schema-stripped and routed to a mid-tier model. That is a 79% reduction from two compounding levers — and most teams have applied neither. The math at enterprise scale. A production customer service agent handling 200,000 daily calls with 25 tools defined. Current behavior: all 25 schemas sent on every call. Average tokens per schema: 350. Average tools actually used per call: 4. | Factor | Without optimization | With dynamic loading + compression | |---|---:|---:| | Schemas sent per call | 25 | 4 (relevant only) | | Avg schema tokens per call | 8,750 | 580 (compressed) | | Daily schema token volume | 1.75B | 116M | | Blended model cost after routing | $5.00/M | $2.00/M | | Daily schema token cost | $8,750 | $232 | | Annual schema overhead cost | $3,193,750 | $84,680 | Annual savings: $3.1 million. From schema tokens alone. Not response length, not context reduction, not system prompt caching. Just the JSON function definitions sent on every call. The estimate is conservative. It assumes only 4 of 25 schemas are relevant per call and uses a mid-tier blended rate of $2.00/M. Agents with larger tool sets or higher frontier model usage see proportionally larger savings. Implementation. A minimal implementation of query-based tool selection using category clustering: from anthropic import Anthropic client = Anthropic() TOOL_GROUPS = { "calendar": [ { "name": "create_event", "description": "Create a calendar event.", "input_schema": { "type": "object", "properties": { "title": {"type": "string"}, "start": {"type": "string", "description": "ISO 8601"}, "end": {"type": "string", "description": "ISO 8601"}, }, "required": ["title", "start", "end"] } }, ], "documents": [ { "name": "search_documents", "description": "Search knowledge base. Returns ranked results.", "input_schema": { "type": "object", "properties": { "query": {"type": "string"}, "max_results": {"type": "integer"}, }, "required": ["query"] } }, ], "data": [ { "name": "run_query", "description": "Execute a read-only database query. Returns rows as JSON.", "input_schema": { "type": "object", "properties": { "sql": {"type": "string"}, "limit": {"type": "integer"}, }, "required": ["sql"] } }, ], } INTENT_KEYWORDS = { "calendar": ["schedule", "meeting", "event", "appointment", "calendar", "book"], "documents": ["find", "search", "document", "policy", "procedure", "knowledge"], "data": ["query", "report", "metrics", "data", "stats", "numbers"], } def select_tool_groups(query: str) -> list[str]: query_lower = query.lower() selected = [ group for group, keywords in INTENT_KEYWORDS.items() if any(kw in query_lower for kw in keywords) ] return selected or list(TOOL_GROUPS.keys()) def get_tools_for_query(query: str) -> list[dict]: groups = select_tool_groups(query) return [tool for group in groups for tool in TOOL_GROUPS.get(group, [])] def agent_call(user_query: str): tools = get_tools_for_query(user_query) return client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, tools=tools, messages=[{"role": "user", "content": user_query}] ) For production deployments, replace the keyword matcher in select_tool_groups with a lightweight embedding classifier or a fast cheap-model call. The cost of the selector should be less than 10% of the schema savings it achieves. Two deployment notes before shipping: Validate tool selection accuracy on a held-out set. Missing a relevant schema means the model cannot call that tool, which typically produces a hallucinated or incomplete response. Evaluate on 500 to 1,000 production queries and measure what fraction require a tool outside the predicted group. If above 2 to 3%, refine the classifier or add a broader fallback group. Cache compressed schema strings at startup. Serialize your compressed JSON schemas to strings once at initialization and reuse the string across calls. Avoid re-serializing the schema object on each request — it adds latency and CPU overhead at scale. Three changes that take less than a day. 1. Count your schema tokens. Run your full tool schema array through a tokenizer (tiktoken for OpenAI, or Anthropic's token counting API) and calculate what fraction of your average input token count comes from schemas alone. If schemas exceed 25% of average input tokens, you have a cost center that no caching, routing, or compression pass has touched yet. 2. Compress verbose descriptions first. Go through each tool schema and rewrite descriptions to one sentence. Drop parameter descriptions for self-explanatory parameter names. Run 100 test calls and verify tool call accuracy is unchanged. Most teams reduce schema token count 30 to 50% in under two hours with no model changes, no infrastructure, and no quality loss. 3. Add a category filter before the API call. Group your tools into 3 to 5 domain clusters based on what they do. Add a keyword or embedding filter that selects the relevant cluster before each API call. This change typically reduces schema overhead 50 to 80% on its own, and it compounds with routing — smaller inputs classify more reliably, route cheaper, and the savings stack. Sources: Anthropic, "Tool use (function calling)," Anthropic Docs. OpenAI, "Function calling," OpenAI Docs. Lilian Weng, "LLM-powered Autonomous Agents," OpenAI Blog, 2023. Stanford and Microsoft Research, "SWE-bench Agent Token Analysis," 2026. GitHub, "Auditing token consumption across agentic workflows," 2026. Datadog, "State of AI Engineering 2026". Anthropic, Claude Opus 4.8 and Claude Sonnet 4.5 pricing as of June 2026.