Skip to main content
Tutorial7 min read

Agent model routing: route between Opus, Sonnet, and Haiku without losing quality

A practical pattern for routing agent requests between Claude Opus, Sonnet, and Haiku. Classifiers, heuristics, fallback strategies, and measured cost impact.

The prompt cache is lever number one for cutting agent cost. Model routing is lever number two — and it requires more thought, because get it wrong and quality drops alongside the bill. Here is a working routing strategy with the measured results from our own stack.

Why routing works

A production agent mostly serves easy requests. "Summarise this paragraph" and "what is the schema of this table" do not need Opus. If you send 70% of traffic to Haiku, 25% to Sonnet, 5% to Opus, total cost drops 4-6x compared to always-Opus — with zero quality regression on the easy path.

Three routing strategies

1. Heuristic

Fast, cheap, opaque. A function of prompt length, tool count, and conversation phase picks the model.

function pickModel(ctx) {
  if (ctx.promptTokens < 200 && ctx.toolsUsed.every(isReadOnly)) return 'haiku-4-5';
  if (ctx.promptTokens > 2000 || ctx.toolsUsed.some(isWrite)) return 'opus-4-7';
  return 'sonnet-4-6';
}

Good for: a first cut; produces 60% of the win with one day of engineering.

2. Classifier

A tiny model (or a Haiku call with a classification prompt) picks the downstream model. Slower by ~200ms but far more accurate.

const decision = await classify(userPrompt, {
  classes: ['trivial', 'standard', 'complex'],
  examples: [...],
});
const model = {
  trivial: 'haiku-4-5',
  standard: 'sonnet-4-6',
  complex: 'opus-4-7',
}[decision];

Good for: production systems where quality regression is costly.

3. Cascade

Try Haiku first. If the response fails a quality check (heuristic or eval-based), retry with Sonnet. Rarely, escalate to Opus.

let res = await call('haiku-4-5', prompt);
if (!passesQualityCheck(res)) res = await call('sonnet-4-6', prompt);
if (!passesQualityCheck(res)) res = await call('opus-4-7', prompt);

Good for: the highest cost-saving when you have a good quality signal; worst-case latency is three calls.

Measured impact, from our own stack

Strategy Cost vs always-Opus p95 latency Quality
Always Opus 100% 1.9s Baseline
Heuristic 24% 1.2s -0.8%
Classifier 19% 1.5s -0.3%
Cascade 17% 1.8s -0.1%

Measuring quality regression

Do not ship a router without a quality check. Three pragmatic options:

  1. Offline eval on a labelled set before rollout.
  2. Shadow mode where the router runs but always-Opus still answers; compare outputs on a sample.
  3. Online A/B with a business metric (resolution rate, thumbs-up rate) as the guardrail.

Whatever you pick, keep running it — drift is real as user behaviour changes.

Fallback chains

Beyond cost, routing doubles as a reliability mechanism. If Opus is over capacity, cascade to Sonnet. If the Anthropic API is degraded, fall back to an alternative provider where the payload maps cleanly. Build the chain once; benefit every time something breaks.

What breaks if you get routing wrong

  • Silent quality drop on tool selection — Haiku picks the wrong tool more often on ambiguous prompts.
  • Latency cliff — cascade triples p95 for the worst 1% of requests.
  • Cost leak — misclassified "trivial" requests land on Opus and blow your budget.

Every failure mode is observable if you log the routing decision alongside the outcome. See our analytics guide.

Where this is heading

Expect two shifts: routing as a first-class feature in the Claude Agent SDK (explicit cost hints), and provider-side routing (an Anthropic endpoint that picks for you given a quality SLA). Either way, the patterns above stay relevant — the router just moves.

Loadout

Build your AI agent loadout

Directory
Contact
© 2026 Loadout. Built on Angular 21 SSR.