The monthly Anthropic bill is not where your agent's costs live. It is a lagging indicator of six upstream choices, most of which your team made without thinking. Here is where to look.
Where the money goes
A production agent's spend breaks down, roughly, into:
| Driver | Share of typical bill | Controllable? |
|---|---|---|
| Input tokens (context) | 40–60% | yes — caching, compression |
| Output tokens | 15–25% | partly — prompt shape |
| Tool-call loops (wasted work) | 10–20% | yes — observability |
| Model choice (routing) | 5–15% | yes — tiered routing |
| Embeddings + storage | 5–10% | yes — dedupe, cheaper dims |
| Observability + trace storage | 2–5% | yes — sampling |
The interesting ones are input tokens and tool-call loops. Input tokens because they scale with conversation length and retrieval size. Tool-call loops because you usually do not know you have them until the invoice arrives.
Lever 1: prompt caching
The cheapest optimisation available. Both Anthropic and OpenAI offer prompt caching — cached prefixes cost 10% of the original input price on Anthropic, 50% on OpenAI. A stable system prompt of 5k tokens turns a $0.05 call into a $0.005 call.
Three rules:
- Cache the stable part, not the user turn. Cache boundaries must be after immutable content and before variable content.
- Budget for the cache write. The first call writes the cache at 1.25× input price. Amortise across ≥3 calls.
- Keep the cacheable block past the 1024-token threshold (Anthropic) or it won't cache at all.
See also mcp call caching strategies for the tool-result side of caching.
Lever 2: model routing
Not every task needs Opus. A triage classifier on Haiku can route 70% of calls to cheaper models. Measured impact: 40–60% cost reduction with negligible quality change when the routing is calibrated.
Tiered routing pattern:
User query
│
▼
[Classifier: Haiku] → simple? → [Haiku answers]
│
▼
[Sonnet worker] → uncertain? → [Opus final pass]
Full playbook in agent model routing optimization.
Lever 3: context compression
Every turn, the full conversation history is re-sent. At turn 20 you are paying for turns 1–19 over and over.
Three ways to cut it:
- Summarisation windows. After N turns, replace the earliest M turns with a summary. Budget: lose ~5% accuracy for ~70% token savings at turn 30.
- Retrieval-backed memory. Don't send history; retrieve relevant slices. Trade latency for cost.
- Structured state. Keep a compact JSON state object; the model writes to it, not to free text. 10× denser than natural language.
Lever 4: kill runaway loops
Loops — agent calls tool, gets bad result, retries, gets same bad result — are the single biggest sudden-cost event. They rarely show up in aggregate metrics because they happen in single sessions.
Defences:
- Hard turn limit. 20 turns max per task. After that, escalate to a human or abort.
- Token budget per task. If a task crosses 50k tokens, stop. It is either stuck or solving the wrong problem.
- Identical-call detection. If the agent made the same tool call with the same arguments twice, something is wrong.
- Progress check. Every 5 turns, an internal evaluator asks: "Are we closer than 5 turns ago?" If no, abort.
Instrument all four. See multi-agent debugging techniques for how to find loops in traces.
Lever 5: batch where possible
Anthropic Message Batches and OpenAI Batch API both offer 50% discounts for async work. If any of your agent's work is not user-facing — offline evals, bulk data processing, overnight generation — move it to batch. The 24-hour SLA is usually fine.
Lever 6: embedding hygiene
Embeddings are cheap per call but add up in bulk.
- Dedupe content before embedding. Near-duplicate chunks waste 30–50% of an embedding budget.
- Use cheaper dimensions. 768-dim Cohere or 512-dim OpenAI small often within 2% recall of 1536.
- Re-use across tenants where safe. A generic doc embedded once serves many users.
For the memory side see vector memory for ai agents.
A week-one cost audit
If you inherit or launch an agent, run this audit in week one:
- Per-task cost distribution. Plot cost per completed task. If p99 is > 10× p50, you have loops.
- Model mix. What % of calls go to Opus? If > 30%, you probably have under-calibrated routing.
- Cache hit rate. Anthropic exposes it. Below 60%? Fix cache boundaries.
- Failed-call cost. Calls that errored still cost input tokens. If > 5% of spend, fix error handling.
- Storage drift. Trace + embedding + memory storage grows linearly. Set retention policies on day one.
The first four are quick wins. The fifth prevents slow bleed.
What does not work (despite claims)
- "Switch to a smaller model globally." Quality regression usually cancels the savings.
- "Self-hosted open-weight models will save us." For most teams, ops cost > API savings below ~1B daily tokens.
- "Just add a cache" without thinking about where the cacheable prefix is. Caching random suffixes accomplishes nothing.