The prompt cache is lever number one for cutting agent cost. Model routing is lever number two — and it requires more thought, because get it wrong and quality drops alongside the bill. Here is a working routing strategy with the measured results from our own stack.
Why routing works
A production agent mostly serves easy requests. "Summarise this paragraph" and "what is the schema of this table" do not need Opus. If you send 70% of traffic to Haiku, 25% to Sonnet, 5% to Opus, total cost drops 4-6x compared to always-Opus — with zero quality regression on the easy path.
Three routing strategies
1. Heuristic
Fast, cheap, opaque. A function of prompt length, tool count, and conversation phase picks the model.
function pickModel(ctx) {
if (ctx.promptTokens < 200 && ctx.toolsUsed.every(isReadOnly)) return 'haiku-4-5';
if (ctx.promptTokens > 2000 || ctx.toolsUsed.some(isWrite)) return 'opus-4-7';
return 'sonnet-4-6';
}
Good for: a first cut; produces 60% of the win with one day of engineering.
2. Classifier
A tiny model (or a Haiku call with a classification prompt) picks the downstream model. Slower by ~200ms but far more accurate.
const decision = await classify(userPrompt, {
classes: ['trivial', 'standard', 'complex'],
examples: [...],
});
const model = {
trivial: 'haiku-4-5',
standard: 'sonnet-4-6',
complex: 'opus-4-7',
}[decision];
Good for: production systems where quality regression is costly.
3. Cascade
Try Haiku first. If the response fails a quality check (heuristic or eval-based), retry with Sonnet. Rarely, escalate to Opus.
let res = await call('haiku-4-5', prompt);
if (!passesQualityCheck(res)) res = await call('sonnet-4-6', prompt);
if (!passesQualityCheck(res)) res = await call('opus-4-7', prompt);
Good for: the highest cost-saving when you have a good quality signal; worst-case latency is three calls.
Measured impact, from our own stack
| Strategy | Cost vs always-Opus | p95 latency | Quality |
|---|---|---|---|
| Always Opus | 100% | 1.9s | Baseline |
| Heuristic | 24% | 1.2s | -0.8% |
| Classifier | 19% | 1.5s | -0.3% |
| Cascade | 17% | 1.8s | -0.1% |
Measuring quality regression
Do not ship a router without a quality check. Three pragmatic options:
- Offline eval on a labelled set before rollout.
- Shadow mode where the router runs but always-Opus still answers; compare outputs on a sample.
- Online A/B with a business metric (resolution rate, thumbs-up rate) as the guardrail.
Whatever you pick, keep running it — drift is real as user behaviour changes.
Fallback chains
Beyond cost, routing doubles as a reliability mechanism. If Opus is over capacity, cascade to Sonnet. If the Anthropic API is degraded, fall back to an alternative provider where the payload maps cleanly. Build the chain once; benefit every time something breaks.
What breaks if you get routing wrong
- Silent quality drop on tool selection — Haiku picks the wrong tool more often on ambiguous prompts.
- Latency cliff — cascade triples p95 for the worst 1% of requests.
- Cost leak — misclassified "trivial" requests land on Opus and blow your budget.
Every failure mode is observable if you log the routing decision alongside the outcome. See our analytics guide.
Where this is heading
Expect two shifts: routing as a first-class feature in the Claude Agent SDK (explicit cost hints), and provider-side routing (an Anthropic endpoint that picks for you given a quality SLA). Either way, the patterns above stay relevant — the router just moves.