Voice agents went from "demo only" to "deployed at scale" between 2024 and 2026. The deployments that stuck share UX patterns that look obvious in hindsight and were not designed for in most failed attempts. Here are the five patterns and the gotchas.
What changed
Three technical shifts made voice-first viable:
- Sub-300ms first-token latency in Realtime APIs (OpenAI, Anthropic, Google).
- Robust interruption handling — the agent stops talking when the user does.
- Streaming TTS that does not sound robotic.
Together, voice agents now feel like a phone call, not a kiosk.
Pattern 1: Barge-in done right
The user must be able to interrupt the agent at any point. Three rules:
- The model stops generating instantly when speech is detected.
- The model finishes its current word, not its current sentence.
- The interrupted output is dropped, not queued.
Without barge-in, voice agents feel slow. With buggy barge-in, they feel rude.
Pattern 2: Partial commitment
Long agent actions should commit progressively, not at the end. "I am pulling up your account... I see three options... let me read them..." gives the user the chance to redirect mid-stream.
Without partial commitment, every long answer feels like buffering.
Pattern 3: Voice handoff
When the agent encounters something it cannot handle, the handoff to a human (or another agent) must:
- Tell the user what is happening ("connecting you to billing now").
- Carry the context to the recipient (so the user does not repeat).
- Keep the line open during the handoff.
Without smooth handoff, users hang up.
Pattern 4: Ambient confirmation
For low-stakes actions, do not ask "do you want me to do X?" — just do it and confirm. "Done — added to your calendar." For high-stakes, ask. The threshold matters; over-confirming is the most cited UX complaint.
See consent management for agents for the spectrum.
Pattern 5: Progressive trust
A new user gets confirm-everything mode. After 10–20 successful interactions, the agent moves to ambient mode. Users prefer this trajectory; "trust earned" feels right.
Implement as a per-user counter; surface a setting to revert.
The hard problems
Three things that do not go away with better tech:
Latency budget
Total user-perceived latency: capture (50ms) + STT (100–300ms) + model first token (200–500ms) + TTS first audio (50–100ms). Hit 1 second and the experience cracks. Optimise every layer; cache where possible.
Background noise
Real users are in cars, kitchens, and open-plan offices. The model's STT is the weakest link. Vendor STT works in the lab; real conditions are 10–20% worse on word-error-rate.
Silence handling
When does silence mean "I am thinking" vs "your turn"? VAD (voice activity detection) tunable per user. Default tuning fails for ESL users and for slow speakers.
Cost model
Voice is more expensive per minute than text. Typical:
| Component | Cost per minute |
|---|---|
| STT | $0.005–0.015 |
| Model (medium, ~80 tokens/sec out) | $0.05–0.20 |
| TTS | $0.005–0.030 |
| Latency-driven re-runs | + 10–30% |
| Total | $0.07–0.30 / min |
For a 10-minute call, $0.70–3.00 / call. Plan accordingly.
Architecture
Streaming end-to-end:
mic → VAD → STT stream → model stream → TTS stream → speaker
↓
tool calls
↓
result stream
Buffering anywhere in this chain costs latency. Use the vendor's Realtime API if at all possible.
What does NOT work
- Long agent monologues — break them into chunks with implicit pauses.
- Reading back numbers and codes — users want them in chat or text, not voice.
- Complex menus — voice menus are an anti-pattern; ask open questions, classify intent.
- Verbatim from text agents — voice prompts are different artefacts.
Where it shines
Voice-first wins for:
- Hands-busy contexts (driving, cooking, surgery).
- Accessibility (screen-reader users, motor impairments).
- High-empathy interactions (support, healthcare).
- Hands-free wearables and ambient computing.
Common mistakes
- Reusing text prompts verbatim — voice needs different rhythm.
- No barge-in — kills perceived intelligence.
- Asking before every action — users disengage.
- No handoff path — when the agent is stuck, the user is stuck.
Where this is heading
Three trends by 2027: voice-first becomes the default for support and onboarding agents; multilingual voice agents reach parity with English; and ambient voice agents (always on, always listening) emerge as a category — with the privacy questions that follow.