Adding the eightieth tool to your mega-agent is the moment quality stops improving and starts dropping. Specialisation — splitting one agent into a fleet of focused ones — is the next move. The framework below turns that decision from intuition into engineering.
When to specialise
Three signals that one agent is over-loaded:
- Tool selection accuracy drops below 90% on tasks the agent should handle.
- System prompt exceeds 20k tokens because every domain needs its quirks documented.
- Eval set fragments — different task classes have wildly different pass rates.
Any one is a hint; two is a strong signal; all three is a deadline.
The five-step framework
1. Cluster the eval set
Take your last 1000 real user requests. Cluster them by intent: support questions, code generation, data analysis, planning, writing. Each cluster is a candidate agent.
2. Map tools to clusters
For each cluster, list the tools actually used. The intersection gives the "shared core"; the differences give the specialisation.
3. Define a role per cluster
A role is a tuple: (system prompt, tool set, model choice, evaluation set, escalation policy). Write these down explicitly — they become the agent specs.
4. Design the router
A small classifier (Haiku or rules-based) routes incoming requests to the right specialist. Cheap, low-latency, audit-friendly. See model routing for the cousin pattern.
5. Define handoffs
Some tasks span clusters. The handoff protocol says: when does specialist A pass control to B, what context goes with it, who returns to the user.
Sample role specs
- role: support
prompt_path: prompts/support.md
tools: [zendesk, postgres-readonly:support_db, docs-search]
model: sonnet-4-6
eval: evals/support.yaml
escalates_to: human
- role: code-gen
prompt_path: prompts/codegen.md
tools: [github, filesystem, test-runner]
model: opus-4-7
eval: evals/codegen.yaml
escalates_to: senior-engineer
- role: planner
prompt_path: prompts/planner.md
tools: [calendar, notion]
model: haiku-4-5
eval: evals/planner.yaml
escalates_to: support
The router design
user request
↓
classifier: which role?
↓
specialist agent runs
↓
if escalation: handoff to next role with context bundle
↓
return to user
The classifier is the single most important piece. Eval its accuracy as carefully as you eval the specialists themselves.
When NOT to specialise
Three anti-patterns:
- Premature decomposition — splitting a 200-request-a-day agent into six. You will spend more on plumbing than the split saves.
- Specialisation that mirrors the model menu — one agent per Opus/Sonnet/Haiku is just routing, not specialisation.
- Over-narrow roles — six agents that each handle 5% of traffic; the 70% common case is then nobody's job.
A good rule: 3–6 specialists at most for typical product agents. Beyond that, you are reinventing microservices badly.
Handoff design
A handoff should carry:
- Original user intent (verbatim).
- Specialist A's summary of what was done.
- Why control is being handed.
- Compact context bundle (relevant memory, retrieved chunks, decisions).
- Return-to-user expectation (who closes the loop).
Bad handoffs are the largest source of multi-agent bugs. Write the handoff protocol as carefully as you write the agent prompts.
Eval per role
Each specialist needs its own eval set. Aggregating across all specialists hides regressions. The router gets its own eval too — accuracy on a held-out classification set.
See the evaluation framework for the supporting infra.
Where this is heading
Two shifts to expect: agent fleet primitives in the Claude Agent SDK (define the roles, declare handoffs, the SDK orchestrates), and managed role catalogues (download "support-agent" and configure for your stack). Until then, the framework above is the working version.