MCP call latency profiling: where the milliseconds actually go

When an agent feels slow, instinct says "blame the model". The data almost always says otherwise. In our profiles, MCP tool calls account for 60–80% of agent latency. Here is the profiling stack, the bottlenecks we see most often, and the fixes that move p95.

The latency budget for one agent turn

A typical agent turn breaks down roughly:

Phase	Median time
User prompt → first token	250–500 ms
Model output streaming	800–2000 ms
Each tool call (round trip)	100–800 ms
Tool result → next model call	200–400 ms

The model is a known constant; tool calls vary by 8x. That is where to look first.

Where to instrument

Three boundaries to wrap:

MCP transport — JSON-RPC request out, response back. Captures network + remote work.
Tool implementation — inside the server, before and after the actual work. Splits transport from compute.
Agent loop — host-side time between sending a tool result and receiving the next model response.

OpenTelemetry spans at all three give a full breakdown. See the observability platforms guide for ingestion options.

A minimal profiler

If you do not have OTel yet, a five-line wrapper gives 80% of the insight:

async function timed<T>(name: string, fn: () => Promise<T>): Promise<T> {
  const start = performance.now();
  try {
    return await fn();
  } finally {
    console.log(name, (performance.now() - start).toFixed(0), 'ms');
  }
}

Wrap every tool handler with timed(name, () => realHandler(args)). Aggregate the logs after a real session. The slow tools jump out immediately.

The five most common bottlenecks

1. Cold start of the MCP server process

npx -y packages re-resolve and re-launch every Claude Desktop start. First call after launch can take 1–3 seconds. Fix: pin versions, install locally, or pre-warm long-running servers.

2. Synchronous database queries returning huge result sets

A SELECT * on a wide table dumps megabytes of JSON across stdio. Fix: paginate at the server, project only needed columns, stream when possible.

3. Sequential network calls inside one tool

A tool that calls three APIs sequentially when it could parallelise. Fix: Promise.all at the tool boundary; the agent never sees the difference.

4. JSON serialisation of nested results

Encoding 50k-row results to JSON eats CPU. Fix: pre-summarise at the server. The model rarely needs the raw rows.

5. Stdio backpressure

Long results block the JSON-RPC channel; subsequent calls queue. Fix: split large results into multiple smaller responses or move to HTTP+SSE for the heavy server.

Reading the trace

A good profile shows you the fix in seconds:

Tool latency >> network RTT → the work is in the server, optimise the implementation.
Tool latency ≈ network RTT → the work is the network, batch or co-locate.
Tool latency low, agent-loop wait high → the model is slow on the next turn, look at prompt size.

Caching as a first-pass fix

Before you optimise an MCP server, ask if it needs to be called at all. The caching guide covers the patterns. Most read-heavy tools (schema introspection, docs lookups) are 90% repeats.

SLO recommendations

Reasonable per-tool latency targets for production agents:

Tool class	p50	p95
In-memory read	< 5 ms	< 20 ms
Local DB query	< 50 ms	< 200 ms
Same-region API	< 150 ms	< 600 ms
Cross-region API	< 400 ms	< 1500 ms
Browser automation	< 1500 ms	< 5000 ms

If your tool is slower, that is an observability target before it is a code change.

Where this is heading

Two shifts to watch: per-tool latency budgets enforced at the host (Claude Desktop kills slow tools), and standardised MCP performance attributes in the protocol so hosts can pick fast servers automatically. Profile now and you will be ready.