MCP call caching strategies: cut latency and cost without breaking your agent

Every MCP tool call costs you milliseconds and tokens. Cache them and you double your agent’s responsiveness while halving the bill. Here is how to add a cache layer without breaking your agent’s correctness.

Why MCP calls are slow and expensive

Three costs add up:

Round-trip latency — every tool call is a JSON-RPC request, often 50–500 ms even for trivial reads.
Tool result tokens — large responses (a 10k-row SQL dump) eat context and bill at output rate.
Re-prompt cost — each tool result triggers another LLM round trip to interpret it.

Caching attacks all three at once. The trick is doing it without serving stale data when the underlying state has changed.

Three places you can cache

1. At the MCP server

The server itself caches its tool implementations. Cleanest layer because the server knows the semantics — it can invalidate when it knows the underlying data changed.

// inside an MCP server tool handler
const cache = new Map();
async function handle(name, args) {
  const key = name + JSON.stringify(args);
  if (cache.has(key)) return cache.get(key);
  const result = await realCall(name, args);
  cache.set(key, result);
  return result;
}

For read-heavy tools (schema introspection, documentation lookups) this alone can cut latency to single-digit milliseconds.

2. At a proxy / gateway

If you do not own the server, drop a cache between the host and the server. A small Node process speaks MCP on stdio, forwards to the upstream, caches by argument hash. Useful for shared team setups — see our team config guide.

3. At the LLM layer (prompt cache)

Anthropic’s prompt cache means tool results that stay verbatim across turns are billed at 10% of normal input rate. Just keep results stable and you get this for free.

Cache key design

The hardest part is choosing what counts as "the same call". A naive JSON.stringify(args) is brittle (key order matters), and ignores context (the same query at 09:00 and 23:00 might want different answers).

Two patterns work well:

Canonicalise arguments — sort object keys, strip whitespace, normalise casing.
Tag with a TTL bucket — append Math.floor(Date.now() / 60_000) to the key so cache entries expire by the minute.

function cacheKey(name, args, ttlMs) {
  const canonical = JSON.stringify(args, Object.keys(args).sort());
  const bucket = Math.floor(Date.now() / ttlMs);
  return name + ':' + bucket + ':' + canonical;
}

Invalidation strategies

Time-based

Simplest, works well for slow-changing data (docs, schema, reference content). Pick a TTL by data freshness tolerance: 60 s for hot config, 1 hour for docs, 1 day for schemas.

Mutation-keyed

If your tools include both reads and writes, every write should bump a version key for the affected scope. Reads include the version in their cache key, so writes implicitly invalidate.

let dbVersion = 0;
async function writeTool() { dbVersion++; /* ... */ }
async function readTool(args) {
  const key = 'db:' + dbVersion + ':' + JSON.stringify(args);
  return cached(key, () => realRead(args));
}

Stale-while-revalidate

Return the cached value immediately; refresh in the background. Best for tools the agent calls many times in a row with the same arguments (common pattern for exploratory queries).

What not to cache

Tools whose responses include timestamps the agent will reason about.
Tools that return paginated results (cache per page only).
Tools that perform side effects (writes, sends, deletes) — the whole point is they should run.
Anything user-specific without a user key in the cache.

Measuring impact

Wire up basic metrics around your cache layer: hit rate, latency p50/p95, bytes saved. A cheap setup is logging hit/miss to a local SQLite, then a tiny sqlite3 dashboard. After a week of agent use you will see exactly which tools repay the engineering effort.

The 80/20 rule

In our own setup, 80% of cache wins came from three tools: filesystem reads, Postgres schema introspection, and HTTP fetches of public documentation. If you only cache those three, you get most of the latency benefit at a quarter of the engineering cost.