Centralised inference is great until you measure first-token latency from Sydney to a US data centre. Edge inference closes the gap, opens new ones, and shifts the architecture meaningfully. Here are the five working deployment patterns and what each is good for.
Why edge for agents
Three drivers:
- Latency — speech-driven agents, real-time interactions, anything sub-second.
- Privacy — data stays in the user's region (or device).
- Cost — for some workloads, edge is cheaper at scale.
Not every workload benefits. Heavy reasoning still wants centralised frontier models.
The five patterns
1. POP-cached
Lightweight model runs at every CDN POP. Cloudflare Workers AI, Fastly's offering, etc. Good for classification, light generation.
- Strengths: lowest latency outside on-device.
- Weaknesses: small model class only.
- Pick when: classifier, router, light text utility.
2. Regional inference
Larger models in a few major regions. Closer than central, fatter than POP.
- Strengths: balanced quality and latency.
- Weaknesses: still milliseconds of network.
- Pick when: mid-tier reasoning, voice agents, regional residency.
3. On-device
The model runs on the user's device. See on-device inference and browser-native agents.
- Strengths: zero latency, full privacy.
- Weaknesses: model size constrained; battery cost.
- Pick when: privacy-critical, latency-critical.
4. Hybrid
Easy 70% on edge or on-device; hard 30% routes to central. See model routing.
- Strengths: best of both worlds.
- Weaknesses: classifier complexity.
- Pick when: product spans easy and hard tasks.
5. Edge gateway
The MCP gateway runs at the edge; tool calls also evaluated at the edge; only the model call goes central.
- Strengths: policy and audit at the edge; latency for non-model parts.
- Weaknesses: model latency unchanged.
- Pick when: tool-call-heavy agents.
Comparison
| Pattern | First-token latency (typical) | Model size | Privacy | Cost @ scale |
|---|---|---|---|---|
| POP-cached | < 50 ms | < 7B | Edge tenant | Low |
| Regional | 80–200 ms | up to 70B | Region-pinned | Medium |
| On-device | 0–50 ms | 1–8B | User-local | Free (per-device) |
| Hybrid | varies | varies | varies | Medium |
| Edge gateway | central + policy | central | per region | Medium |
Architectural implications
Three things change at the edge:
State management
Edge runtimes are usually stateless or per-POP. Memory layers must live somewhere persistent — typically the regional tier.
Tool call routing
A tool call from an edge agent might cross network boundaries multiple times. Latency budgets compound.
Observability
Trace ingest from many POPs is harder than from one region. Use the observability platforms that handle high-cardinality.
When edge does NOT make sense
Three counter-indications:
- Heavy reasoning agents — frontier models do not run at edge today.
- Cross-region orchestration — multi-region planners are slower at the edge than central.
- Compliance pinning — if data must stay in one specific data centre, edge multiplies the surface.
Cost reality
Edge per-token cost is comparable to central for the small-model class but the architecture savings come elsewhere:
- No round-trip data transfer cost for cached content.
- Lower abandonment because of latency.
- Cheaper authentication because identity sits at the edge.
A 30% net cost reduction is typical for the suitable workloads.
A pragmatic rollout
Three phases for most teams:
- Wrap your existing classifier in POP-cached inference. First taste of edge wins.
- Move the gateway to the edge. Policy and audit get a latency win.
- Add hybrid routing. Easy/hard split; central only when needed.
After 90 days you have the data to decide whether the regional and on-device tiers are worth the engineering.
Common mistakes
- Putting heavy reasoning at edge — cost and quality both worse.
- No persistent state plan — edge is stateless, agents need state.
- Ignoring observability — losing trace visibility because of distributed inference.
- One pattern across the product — different parts want different patterns.
Where this is heading
Three trends by 2027: frontier-model-class inference at major regional edges, MCP-native edge primitives in the spec, and standardised edge-vs-central routing helpers in the Claude Agent SDK.