Agent on-device inference: what the new generation of phone and laptop models actually do well

"On-device inference" went from "demo at WWDC" to "shipping to billions of devices" in 18 months. The 1–8B parameter model class is now real, not toy. What it can and cannot do shapes the agent stack of 2027.

The model class today

Roughly comparable performance:

Apple Intelligence — proprietary; runs on iPhone 15 Pro and later, M-series Macs.
Gemini Nano — Pixel 8 and supported Android.
Phi-3.5 / Phi-4 mini — open weights, runs anywhere.
Llama-3.2 1B/3B — open weights, mobile-friendly.

All in the 1–8B range, all targeting < 1.5 GB on-disk, all hitting 30–150 tokens/second on modern hardware.

What they are good at

Five categories:

Classification

Sentiment, intent, language ID, content category. Often parity with frontier models on common labels.

Extraction

Pull a phone number, date, address, name from text. High accuracy.

Light text generation

Summaries, rewrites, polishing. Quality is "good enough for inline use", not "publishable".

Translation between major language pairs

EN-ES, EN-FR, EN-DE — surprisingly close to dedicated translators.

Embedding generation

Local embeddings for retrieval. Slower than batch cloud embeddings but private.

What they are bad at

Five known weaknesses:

Multi-step reasoning — they fail past 2–3 steps reliably.
Long-form coherent generation — coherence breaks past ~500 tokens.
Code generation beyond snippets — usable for completion, not for whole functions.
Niche knowledge — small models cannot hold the long tail of facts.
Tool use — function-calling support exists but is unreliable.

Designing around these is the agent designer's job.

The router pattern

Most production agents use on-device for the easy 70% and route the hard 30% to a frontier model:

incoming user request
   ↓
on-device classifier: easy / medium / hard?
   ↓
easy: on-device end-to-end
medium: on-device draft, cloud verify
hard: cloud end-to-end

See model routing. The classifier itself runs on-device.

Latency and battery

On-device inference is fast in tokens/second, but every inference also wakes the GPU/NPU and burns battery.

Operation	Energy (relative)
Idle text input	1x
50-token on-device generation	5–10x
500-token on-device generation	50–100x
Send to cloud (cellular)	10–20x
Send to cloud (Wi-Fi)	2–5x

For short generations, on-device wins. For long ones, the energy cost can exceed cloud + radio. Profile in your specific app.

Privacy implications

Three guarantees on-device inference can offer:

No data leaves the device for the inference.
No vendor logs what you asked.
Compliance simplification — many GDPR considerations vanish for on-device-only flows.

The catch: tool calls may still cross the network. Be precise about what stays local.

Memory layer for on-device agents

A small on-device store works:

SQLite + sqlite-vec for memory + retrieval.
A few hundred MB of context cached.
Cloud sync optional, encrypted, user-controlled.

See persistent memory architecture for the cloud-side analogue.

What to ship today

A pragmatic agent that uses on-device inference:

Classification and routing on-device, always.
Easy generation on-device.
Hard generation on cloud, with explicit "this requires connectivity" UX.
Memory primarily on-device with optional encrypted sync.

Three months of work for a mid-sized engineering team.

Common mistakes

Treating on-device as a single-model fallback — it is a tier with its own quality profile.
Ignoring battery — long generations destroy phones.
No cloud handoff — users hit limits with no escape.
Hardcoding model names — Apple, Google ship updates; abstract behind a router.

Where this is heading

Three trends by 2027: 10–20B on-device models on flagship phones, vendor-neutral on-device APIs (today they are Apple-specific and Google-specific), and on-device fine-tuning per user. Build the router pattern; the underlying model gets better, the architecture stays.