Reducing AI agent API costs: 12 levers that actually move the bill

An always-on Claude agent calling MCP tools all day quietly turns into a four-figure monthly bill. None of the savings below require changing your model — they are the levers that work after you have already picked Sonnet or Opus.

The 12 levers, in order of impact

1. Turn on prompt caching

Single biggest lever. System prompts and tool definitions are billed at 10% on cache hits. For an agent with a long system prompt and 30 tools loaded, this typically cuts input cost by 60–80% over a session. See our MCP caching guide for adjacent wins at the tool layer.

2. Route by task complexity

Do not send "list my files" to Opus. A small upstream classifier or a heuristic on prompt length picks Haiku for the easy 70% and reserves Opus for the hard 30%. 4–5x cost cut, no quality drop on the easy path.

function pickModel(prompt, toolsUsed) {
  if (prompt.length < 200 && toolsUsed.every(isReadOnly)) return 'haiku-4-5';
  if (toolsUsed.some(isWrite) || prompt.length > 2000) return 'opus-4-7';
  return 'sonnet-4-6';
}

3. Trim tool definitions to what the user is doing

Loading 50 MCP tools into the system prompt costs tokens on every turn. Build a tool router that exposes only the relevant 5–10 based on the conversation topic. Pattern: a tiny first call asks "which tool category does this need?", subsequent calls only load that category.

4. Compress tool results before they hit the model

A SQL query that returns 10,000 rows does not need to be pasted verbatim. Truncate, summarise or paginate at the MCP server boundary. The model rarely needs more than the first 50 rows plus the schema and a row count.

5. Set `max_tokens` aggressively

Default max_tokens is far higher than most agent steps need. Set it to the smallest number that lets the agent complete a step. Output tokens are 5x more expensive than input tokens — capping output is high-leverage.

6. Use streaming and stop early when you can

If your agent’s loop has an early-exit condition (for example structured output extraction), abort the stream as soon as the condition is met. Saves the rest of the output token cost.

7. Batch read-only tool calls

If your agent often calls get_user(id) three times in a row, expose get_users([id1, id2, id3]) instead. One round trip, one model interpretation step. Compound savings: latency, prompt cost, and round-trip count.

8. Persist agent memory between sessions

Re-deriving context every session means re-spending tokens on the same setup work. A persistent memory MCP (graph or vector) lets the agent skip re-discovery. See our memory MCP page.

9. Cache MCP responses (covered separately)

The biggest tool-side lever. Full guide here.

10. Move expensive subroutines off the LLM

Date math, JSON reshaping, regex extraction — these do not need a model. Put them in deterministic post-processors that the agent calls as tools. Each one removes a model round trip.

11. Parallelise tool calls

The Claude API supports parallel tool calls. Issuing three calls in one assistant turn costs three tool result token blocks but only one model interpretation round, vs three separate sequential rounds. Cuts both cost and latency.

12. Set hard token budgets per task

Wrap each agent invocation with a counter. If the running token total crosses a budget, the agent halts and surfaces the partial result. Stops infinite-loop blow-ups from costing you a paycheck.

How to measure each lever’s payoff

The Anthropic API includes usage data in every response. Aggregate it per session, broken down by model, cached vs uncached, input vs output. A small Postgres table is enough:

CREATE TABLE agent_usage (
  ts timestamptz default now(),
  session_id text,
  model text,
  input_tokens int,
  cached_input_tokens int,
  output_tokens int,
  tool_count int
);

Plot daily cost over time after each change. Most teams see the first three levers ship 50–70% off the bill within a week.

The "agent FinOps" mindset

Cost optimisation for agents is not a one-shot exercise. Usage patterns drift as users find new workflows. Dashboard the spend, alert on anomalies, set quarterly cost-per-task targets the way you would for cloud infrastructure. Treating an agent like an unmonitored microservice ends in tears.