Desktop users tolerate two-second agent waits. Mobile users do not. Sub-second is the perception threshold that decides whether your agent feels intelligent or sluggish. Here is the latency budget and the levers that move it.
Why mobile is different
Three properties that compound:
- Smaller screens — users do not have other content to look at while waiting.
- Touch interaction — every wait interrupts the touch flow.
- Cellular variance — network round-trip can be 50 ms or 800 ms on the same device.
Designing for the median desktop case fails on mobile.
The latency budget
For an interactive agent turn:
| Phase | Target | Where time goes |
|---|---|---|
| Touch → request | < 50 ms | UI thread |
| Network out | 50–300 ms | radio + path |
| First model token | 200–500 ms | model + queue |
| Streaming visible | < 100 ms after first token | UI render |
| Tool call (typical) | 100–500 ms each | upstream |
| Full perceived response | < 1000 ms | end-to-end |
Past 1 s, users start looking away. Past 3 s, they switch app.
The five levers
1. Aggressive caching
Cache MCP tool results, embeddings, and prompt prefixes. See caching strategies. The first hit pays full latency; the next ones are sub-100 ms.
2. On-device for the easy 70%
A local classifier picks the easy cases and runs them on device. See on-device inference. Avoids the network entirely.
3. Streaming everything
Stream model tokens to the UI as they arrive. Even if total time is 1.5 s, perceived first-response is under 500 ms.
4. Predictive prefetch
While the user is typing, prefetch likely tool results. Risky for cost; brilliant for speed.
5. Cellular-aware routing
On cellular, prefer on-device or POP-cached inference. On Wi-Fi, route to better-quality cloud. See edge agent deployment patterns.
Architecture pattern
user input
↓
debounce 50 ms (don't fire on every keystroke)
↓
classifier on-device: easy / hard?
↓ easy
on-device end-to-end
↓ hard
parallel: prefetch likely tools, route to cloud
↓
stream output to UI
↓
on first complete sentence: speak (if voice) or display (if text)
Streaming + prefetch + on-device-first = the perceived sub-second wins.
What kills latency on mobile
Four common culprits, in order of impact:
Cold-start of background tasks
Reactivating an OS-suspended agent process can take 500 ms. Keep the agent process warm where allowed.
Synchronous JSON parsing of large tool results
A 200 KB JSON result parses in 50–200 ms on mobile. Streaming JSON parsers (like JSONStream-style) make this near-zero.
Model picker that re-evaluates per request
Decide once per session, not once per request.
TLS handshakes for every call
Connection pooling matters more on cellular. HTTP/2 or HTTP/3 multiplexing.
Measurement
What to track in production:
- Touch-to-first-token latency (TTFB equivalent for agents).
- Touch-to-end latency.
- Per-network-class breakdown (Wi-Fi vs LTE vs 5G).
- Cold-start rate — how often an agent turn pays cold-start cost.
Surface in your observability platform. Set p95 SLOs aggressively (under 1.5 s for full).
Perceived vs actual
Three ways to fake speed:
- Acknowledgement haptic — phone vibrates instantly on touch; user feels heard.
- Skeleton states — show what the answer will look like before content arrives.
- Optimistic UI — show "added to calendar" before the API call completes.
These move perceived latency below actual latency by 200–400 ms. Cheap UX wins.
When to break the budget
Three cases where over-budget is acceptable:
- Heavy generation the user explicitly asked for ("write me a long email").
- Multi-step research — show progress, the wait is part of the value.
- Voice answers — once started, total length matters less than first-token.
Inform the UX by the wait. Spinning indefinitely fails; a progress narration succeeds.
Common mistakes
- One-size-fits-all latency budget — different interactions deserve different budgets.
- No on-device fallback — every interaction pays the network round-trip.
- Synchronous tool calls — three sequential 200 ms calls = 600 ms; parallelise.
- Skipping streaming — total time is fine; perceived is not.
Where this is heading
Three trends by 2027: native streaming primitives in the Claude Agent SDK, on-device caching that survives app suspension, and OS-level "agent priority" task scheduling. Build for streaming and on-device first; the rest comes.