When an agent feels slow, instinct says "blame the model". The data almost always says otherwise. In our profiles, MCP tool calls account for 60–80% of agent latency. Here is the profiling stack, the bottlenecks we see most often, and the fixes that move p95.
The latency budget for one agent turn
A typical agent turn breaks down roughly:
| Phase | Median time |
|---|---|
| User prompt → first token | 250–500 ms |
| Model output streaming | 800–2000 ms |
| Each tool call (round trip) | 100–800 ms |
| Tool result → next model call | 200–400 ms |
The model is a known constant; tool calls vary by 8x. That is where to look first.
Where to instrument
Three boundaries to wrap:
- MCP transport — JSON-RPC request out, response back. Captures network + remote work.
- Tool implementation — inside the server, before and after the actual work. Splits transport from compute.
- Agent loop — host-side time between sending a tool result and receiving the next model response.
OpenTelemetry spans at all three give a full breakdown. See the observability platforms guide for ingestion options.
A minimal profiler
If you do not have OTel yet, a five-line wrapper gives 80% of the insight:
async function timed<T>(name: string, fn: () => Promise<T>): Promise<T> {
const start = performance.now();
try {
return await fn();
} finally {
console.log(name, (performance.now() - start).toFixed(0), 'ms');
}
}
Wrap every tool handler with timed(name, () => realHandler(args)). Aggregate the logs after a real session. The slow tools jump out immediately.
The five most common bottlenecks
1. Cold start of the MCP server process
npx -y packages re-resolve and re-launch every Claude Desktop start. First call after launch can take 1–3 seconds. Fix: pin versions, install locally, or pre-warm long-running servers.
2. Synchronous database queries returning huge result sets
A SELECT * on a wide table dumps megabytes of JSON across stdio. Fix: paginate at the server, project only needed columns, stream when possible.
3. Sequential network calls inside one tool
A tool that calls three APIs sequentially when it could parallelise. Fix: Promise.all at the tool boundary; the agent never sees the difference.
4. JSON serialisation of nested results
Encoding 50k-row results to JSON eats CPU. Fix: pre-summarise at the server. The model rarely needs the raw rows.
5. Stdio backpressure
Long results block the JSON-RPC channel; subsequent calls queue. Fix: split large results into multiple smaller responses or move to HTTP+SSE for the heavy server.
Reading the trace
A good profile shows you the fix in seconds:
- Tool latency >> network RTT → the work is in the server, optimise the implementation.
- Tool latency ≈ network RTT → the work is the network, batch or co-locate.
- Tool latency low, agent-loop wait high → the model is slow on the next turn, look at prompt size.
Caching as a first-pass fix
Before you optimise an MCP server, ask if it needs to be called at all. The caching guide covers the patterns. Most read-heavy tools (schema introspection, docs lookups) are 90% repeats.
SLO recommendations
Reasonable per-tool latency targets for production agents:
| Tool class | p50 | p95 |
|---|---|---|
| In-memory read | < 5 ms | < 20 ms |
| Local DB query | < 50 ms | < 200 ms |
| Same-region API | < 150 ms | < 600 ms |
| Cross-region API | < 400 ms | < 1500 ms |
| Browser automation | < 1500 ms | < 5000 ms |
If your tool is slower, that is an observability target before it is a code change.
Where this is heading
Two shifts to watch: per-tool latency budgets enforced at the host (Claude Desktop kills slow tools), and standardised MCP performance attributes in the protocol so hosts can pick fast servers automatically. Profile now and you will be ready.