AI agent cost optimization: where the money actually goes

The monthly Anthropic bill is not where your agent's costs live. It is a lagging indicator of six upstream choices, most of which your team made without thinking. Here is where to look.

Where the money goes

A production agent's spend breaks down, roughly, into:

Driver	Share of typical bill	Controllable?
Input tokens (context)	40–60%	yes — caching, compression
Output tokens	15–25%	partly — prompt shape
Tool-call loops (wasted work)	10–20%	yes — observability
Model choice (routing)	5–15%	yes — tiered routing
Embeddings + storage	5–10%	yes — dedupe, cheaper dims
Observability + trace storage	2–5%	yes — sampling

The interesting ones are input tokens and tool-call loops. Input tokens because they scale with conversation length and retrieval size. Tool-call loops because you usually do not know you have them until the invoice arrives.

Lever 1: prompt caching

The cheapest optimisation available. Both Anthropic and OpenAI offer prompt caching — cached prefixes cost 10% of the original input price on Anthropic, 50% on OpenAI. A stable system prompt of 5k tokens turns a $0.05 call into a $0.005 call.

Three rules:

Cache the stable part, not the user turn. Cache boundaries must be after immutable content and before variable content.
Budget for the cache write. The first call writes the cache at 1.25× input price. Amortise across ≥3 calls.
Keep the cacheable block past the 1024-token threshold (Anthropic) or it won't cache at all.

See also mcp call caching strategies for the tool-result side of caching.

Lever 2: model routing

Not every task needs Opus. A triage classifier on Haiku can route 70% of calls to cheaper models. Measured impact: 40–60% cost reduction with negligible quality change when the routing is calibrated.

Tiered routing pattern:

User query
   │
   ▼
[Classifier: Haiku]  →  simple? → [Haiku answers]
   │
   ▼
[Sonnet worker]  →  uncertain? → [Opus final pass]

Full playbook in agent model routing optimization.

Lever 3: context compression

Every turn, the full conversation history is re-sent. At turn 20 you are paying for turns 1–19 over and over.

Three ways to cut it:

Summarisation windows. After N turns, replace the earliest M turns with a summary. Budget: lose ~5% accuracy for ~70% token savings at turn 30.
Retrieval-backed memory. Don't send history; retrieve relevant slices. Trade latency for cost.
Structured state. Keep a compact JSON state object; the model writes to it, not to free text. 10× denser than natural language.

Lever 4: kill runaway loops

Loops — agent calls tool, gets bad result, retries, gets same bad result — are the single biggest sudden-cost event. They rarely show up in aggregate metrics because they happen in single sessions.

Defences:

Hard turn limit. 20 turns max per task. After that, escalate to a human or abort.
Token budget per task. If a task crosses 50k tokens, stop. It is either stuck or solving the wrong problem.
Identical-call detection. If the agent made the same tool call with the same arguments twice, something is wrong.
Progress check. Every 5 turns, an internal evaluator asks: "Are we closer than 5 turns ago?" If no, abort.

Instrument all four. See multi-agent debugging techniques for how to find loops in traces.

Lever 5: batch where possible

Anthropic Message Batches and OpenAI Batch API both offer 50% discounts for async work. If any of your agent's work is not user-facing — offline evals, bulk data processing, overnight generation — move it to batch. The 24-hour SLA is usually fine.

Lever 6: embedding hygiene

Embeddings are cheap per call but add up in bulk.

Dedupe content before embedding. Near-duplicate chunks waste 30–50% of an embedding budget.
Use cheaper dimensions. 768-dim Cohere or 512-dim OpenAI small often within 2% recall of 1536.
Re-use across tenants where safe. A generic doc embedded once serves many users.

For the memory side see vector memory for ai agents.

A week-one cost audit

If you inherit or launch an agent, run this audit in week one:

Per-task cost distribution. Plot cost per completed task. If p99 is > 10× p50, you have loops.
Model mix. What % of calls go to Opus? If > 30%, you probably have under-calibrated routing.
Cache hit rate. Anthropic exposes it. Below 60%? Fix cache boundaries.
Failed-call cost. Calls that errored still cost input tokens. If > 5% of spend, fix error handling.
Storage drift. Trace + embedding + memory storage grows linearly. Set retention policies on day one.

The first four are quick wins. The fifth prevents slow bleed.

What does not work (despite claims)

"Switch to a smaller model globally." Quality regression usually cancels the savings.
"Self-hosted open-weight models will save us." For most teams, ops cost > API savings below ~1B daily tokens.
"Just add a cache" without thinking about where the cacheable prefix is. Caching random suffixes accomplishes nothing.