Edge agent deployment patterns: where the model lives matters more than ever

Centralised inference is great until you measure first-token latency from Sydney to a US data centre. Edge inference closes the gap, opens new ones, and shifts the architecture meaningfully. Here are the five working deployment patterns and what each is good for.

Why edge for agents

Three drivers:

Latency — speech-driven agents, real-time interactions, anything sub-second.
Privacy — data stays in the user's region (or device).
Cost — for some workloads, edge is cheaper at scale.

Not every workload benefits. Heavy reasoning still wants centralised frontier models.

The five patterns

1. POP-cached

Lightweight model runs at every CDN POP. Cloudflare Workers AI, Fastly's offering, etc. Good for classification, light generation.

Strengths: lowest latency outside on-device.
Weaknesses: small model class only.
Pick when: classifier, router, light text utility.

2. Regional inference

Larger models in a few major regions. Closer than central, fatter than POP.

Strengths: balanced quality and latency.
Weaknesses: still milliseconds of network.
Pick when: mid-tier reasoning, voice agents, regional residency.

3. On-device

The model runs on the user's device. See on-device inference and browser-native agents.

Strengths: zero latency, full privacy.
Weaknesses: model size constrained; battery cost.
Pick when: privacy-critical, latency-critical.

4. Hybrid

Easy 70% on edge or on-device; hard 30% routes to central. See model routing.

Strengths: best of both worlds.
Weaknesses: classifier complexity.
Pick when: product spans easy and hard tasks.

5. Edge gateway

The MCP gateway runs at the edge; tool calls also evaluated at the edge; only the model call goes central.

Strengths: policy and audit at the edge; latency for non-model parts.
Weaknesses: model latency unchanged.
Pick when: tool-call-heavy agents.

Comparison

Pattern	First-token latency (typical)	Model size	Privacy	Cost @ scale
POP-cached	< 50 ms	< 7B	Edge tenant	Low
Regional	80–200 ms	up to 70B	Region-pinned	Medium
On-device	0–50 ms	1–8B	User-local	Free (per-device)
Hybrid	varies	varies	varies	Medium
Edge gateway	central + policy	central	per region	Medium

Architectural implications

Three things change at the edge:

State management

Edge runtimes are usually stateless or per-POP. Memory layers must live somewhere persistent — typically the regional tier.

Tool call routing

A tool call from an edge agent might cross network boundaries multiple times. Latency budgets compound.

Observability

Trace ingest from many POPs is harder than from one region. Use the observability platforms that handle high-cardinality.

When edge does NOT make sense

Three counter-indications:

Heavy reasoning agents — frontier models do not run at edge today.
Cross-region orchestration — multi-region planners are slower at the edge than central.
Compliance pinning — if data must stay in one specific data centre, edge multiplies the surface.

Cost reality

Edge per-token cost is comparable to central for the small-model class but the architecture savings come elsewhere:

No round-trip data transfer cost for cached content.
Lower abandonment because of latency.
Cheaper authentication because identity sits at the edge.

A 30% net cost reduction is typical for the suitable workloads.

A pragmatic rollout

Three phases for most teams:

Wrap your existing classifier in POP-cached inference. First taste of edge wins.
Move the gateway to the edge. Policy and audit get a latency win.
Add hybrid routing. Easy/hard split; central only when needed.

After 90 days you have the data to decide whether the regional and on-device tiers are worth the engineering.

Common mistakes

Putting heavy reasoning at edge — cost and quality both worse.
No persistent state plan — edge is stateless, agents need state.
Ignoring observability — losing trace visibility because of distributed inference.
One pattern across the product — different parts want different patterns.

Where this is heading

Three trends by 2027: frontier-model-class inference at major regional edges, MCP-native edge primitives in the spec, and standardised edge-vs-central routing helpers in the Claude Agent SDK.