Low-latency mobile agents: hitting the sub-second perception threshold

Desktop users tolerate two-second agent waits. Mobile users do not. Sub-second is the perception threshold that decides whether your agent feels intelligent or sluggish. Here is the latency budget and the levers that move it.

Why mobile is different

Three properties that compound:

Smaller screens — users do not have other content to look at while waiting.
Touch interaction — every wait interrupts the touch flow.
Cellular variance — network round-trip can be 50 ms or 800 ms on the same device.

Designing for the median desktop case fails on mobile.

The latency budget

For an interactive agent turn:

Phase	Target	Where time goes
Touch → request	< 50 ms	UI thread
Network out	50–300 ms	radio + path
First model token	200–500 ms	model + queue
Streaming visible	< 100 ms after first token	UI render
Tool call (typical)	100–500 ms each	upstream
Full perceived response	< 1000 ms	end-to-end

Past 1 s, users start looking away. Past 3 s, they switch app.

The five levers

1. Aggressive caching

Cache MCP tool results, embeddings, and prompt prefixes. See caching strategies. The first hit pays full latency; the next ones are sub-100 ms.

2. On-device for the easy 70%

A local classifier picks the easy cases and runs them on device. See on-device inference. Avoids the network entirely.

3. Streaming everything

Stream model tokens to the UI as they arrive. Even if total time is 1.5 s, perceived first-response is under 500 ms.

4. Predictive prefetch

While the user is typing, prefetch likely tool results. Risky for cost; brilliant for speed.

5. Cellular-aware routing

On cellular, prefer on-device or POP-cached inference. On Wi-Fi, route to better-quality cloud. See edge agent deployment patterns.

Architecture pattern

user input
   ↓
debounce 50 ms (don't fire on every keystroke)
   ↓
classifier on-device: easy / hard?
   ↓ easy
on-device end-to-end
   ↓ hard
parallel: prefetch likely tools, route to cloud
   ↓
stream output to UI
   ↓
on first complete sentence: speak (if voice) or display (if text)

Streaming + prefetch + on-device-first = the perceived sub-second wins.

What kills latency on mobile

Four common culprits, in order of impact:

Cold-start of background tasks

Reactivating an OS-suspended agent process can take 500 ms. Keep the agent process warm where allowed.

Synchronous JSON parsing of large tool results

A 200 KB JSON result parses in 50–200 ms on mobile. Streaming JSON parsers (like JSONStream-style) make this near-zero.

Model picker that re-evaluates per request

Decide once per session, not once per request.

TLS handshakes for every call

Connection pooling matters more on cellular. HTTP/2 or HTTP/3 multiplexing.

Measurement

What to track in production:

Touch-to-first-token latency (TTFB equivalent for agents).
Touch-to-end latency.
Per-network-class breakdown (Wi-Fi vs LTE vs 5G).
Cold-start rate — how often an agent turn pays cold-start cost.

Surface in your observability platform. Set p95 SLOs aggressively (under 1.5 s for full).

Perceived vs actual

Three ways to fake speed:

Acknowledgement haptic — phone vibrates instantly on touch; user feels heard.
Skeleton states — show what the answer will look like before content arrives.
Optimistic UI — show "added to calendar" before the API call completes.

These move perceived latency below actual latency by 200–400 ms. Cheap UX wins.

When to break the budget

Three cases where over-budget is acceptable:

Heavy generation the user explicitly asked for ("write me a long email").
Multi-step research — show progress, the wait is part of the value.
Voice answers — once started, total length matters less than first-token.

Inform the UX by the wait. Spinning indefinitely fails; a progress narration succeeds.

Common mistakes

One-size-fits-all latency budget — different interactions deserve different budgets.
No on-device fallback — every interaction pays the network round-trip.
Synchronous tool calls — three sequential 200 ms calls = 600 ms; parallelise.
Skipping streaming — total time is fine; perceived is not.

Where this is heading

Three trends by 2027: native streaming primitives in the Claude Agent SDK, on-device caching that survives app suspension, and OS-level "agent priority" task scheduling. Build for streaming and on-device first; the rest comes.