MCP data exfiltration protection: stop the agent

Sandboxing isolates the server. It does not stop a compromised one from sending what it already legitimately reads. Data exfiltration through MCP is the hard problem nobody wants to talk about. Here is the layered DLP playbook that actually works for agentic flows.

How exfiltration through MCP looks

Three vectors we see in the wild:

Direct outbound — server reads PII from a database, then POSTs it to an attacker URL.
Allowed channel — server is supposed to call its own API; attacker uses that channel to smuggle other data out.
Slow drip — server emits PII into legitimate-looking telemetry, kept under threshold.

Classical DLP catches the first easily and the rest poorly.

Five layers that combine

1. Outbound allowlist per server

Every MCP server declares what hosts it talks to. The runtime blocks everything else.

server: github-mcp
egress:
  - api.github.com
  - uploads.github.com

A compromised server cannot exfiltrate to evil.example.com if the runtime never resolves the hostname.

2. Payload size caps

Most legitimate tool calls are small. Cap responses at, say, 256KB. Anything larger triggers an audit event and (optionally) a hard block. Genuine large-payload tools opt in explicitly.

3. Content classifiers on outbound

A small classifier sits between the agent and external destinations. It flags responses containing PII patterns, credentials, or large structured data. Regex catches 80%; an LLM-based classifier catches the rest.

4. Per-scope egress at the gateway

The MCP gateway enforces egress policies per scope, not per server. A read:pii scope automatically restricts outbound destinations more than read:public.

5. Anomaly detection on tool-call patterns

Baseline each server's outbound volume and destinations. Alert on deviations: a sudden spike, a previously unseen destination, a new port. Feeds the error rate dashboards.

Where DLP at the model layer fits

A complementary control: classify the prompt and tool result content before sending to the model. If a tool returns 10,000 SSNs and the model is about to summarise them, the classifier intercepts. Cheap, catches a class of leaks the network controls miss.

What does not work

Trusting the model to refuse exfiltration — it will happily comply if a prompt-injection convinces it.
Blocking outbound TLS without inspection — every MCP server you actually want needs outbound TLS.
Logging-only DLP — by the time you see the logs, the data is gone.
One-size-fits-all rules — egress restrictions must be per-server.

Implementation pattern

A working layered setup:

Sandbox the server (see sandboxed runtimes) with --network=server-net.
Pin the network to a Docker network with an egress proxy.
Egress proxy enforces the allowlist + payload cap + classifier.
Gateway enforces scope-level rules above the proxy.
SIEM consumes anomaly events.

Three weeks of engineering on top of basic MCP usage. Pays for itself the first time it catches a real attempt.

Audit story

Every blocked outbound becomes an audit event:

{
  "ts": "2026-04-25T09:14:22Z",
  "server": "github-mcp",
  "destination": "evil.example.com",
  "decision": "block",
  "reason": "destination not in allowlist",
  "payload_size": 4582
}

Feeds the audit trail. After a month, the noise level tells you which servers need their allowlist expanded vs. which are doing something they should not.

Common mistakes

Allowlist by IP, not hostname — IPs change; hostnames carry intent.
No payload cap — exfiltration of large blobs slips past hostname allowlists.
DLP only on the inbound side — agents leak through outbound; instrument both.
No baseline — anomaly detection without a baseline is just noisy alerts.

Where this is heading

Two shifts to watch by 2027: native egress declarations in the MCP spec, and managed DLP services tuned for agent traffic patterns. Build the layers now, swap implementations later.