Web apps have a mature monitoring culture — RED metrics, SLOs, pager runbooks. Agents are still running on vibes. The eight signals below are the minimum to cross from demo into production.
Why agents need their own monitoring stack
Traditional APM (Datadog, New Relic) catches HTTP errors, DB latency, pod restarts. It does not catch: a model that returns grammatical nonsense, a tool-call loop that consumes 50k tokens before finishing, a retrieval step that silently returned zero results. All of those can be p99-latency green while the product is broken.
Agents need a second layer: semantic monitoring. Here is the signal set.
The eight signals
1. Task success rate
The only top-level metric that matters. Did the agent achieve the user's goal?
Measured by:
- Explicit user feedback. Thumbs up/down, retry-from-scratch button.
- Implicit signals. User abandoned the conversation, asked the same question again, escalated to human.
- Offline eval. Replay a sample against a scoring model; see continuous agent regression testing.
Blend all three. No single source is reliable.
2. End-to-end latency
Not per-call LLM latency — end-to-end. From user message to final answer, including all tool calls, retries, and reranking. Users experience this number.
Break it down by:
- Model inference time
- Tool call time (often dominates)
- Framework overhead (often surprising)
3. Token usage per task
Cost and quality both live here. A task that was 2k tokens yesterday and is 8k tokens today has regressed — usually because context grew or the model is retrying silently.
Track p50 and p99 separately. The p99 is where runaway loops hide. Pair this metric with agent token usage analytics for drill-down.
4. Tool-call success rate
Per tool. A Postgres MCP server returning 10% errors is either misconfigured or under attack. A GitHub MCP with 90% success but 300ms p99 is about to cause user-visible slowdowns.
Tag each call with tool_name, tool_version, and caller_agent. Aggregate by all three.
5. Retrieval recall proxy
For agents with RAG or memory: are the retrievals returning anything useful?
Cheap proxies:
- Fraction of turns where retrieval returned zero results.
- Average number of retrieved docs cited in the final answer.
- Embedding-query similarity at p50.
None are perfect. All are better than not measuring.
6. Model output quality
Cheap online signals:
- Rate of JSON parse failures (for structured outputs).
- Rate of refusals or "I don't know" responses.
- Rate of output-format violations (responses that don't match schema).
These do not measure correctness — they measure whether the model is trying. A spike in refusals often precedes a bigger regression.
7. Hallucination signal
Sample 1% of traces, score with a stronger model for factual accuracy. Alert when the score drifts. This is expensive but irreplaceable: no cheap online metric catches confident wrongness.
8. Safety and policy violations
Counts of:
- PII in model outputs (regex and NER).
- Denied tool calls from the policy engine.
- Attempted actions outside the agent's scope.
A flat line is boring. A spike is an incident.
Alerting rules that actually fire
Most agent teams alert on nothing useful. Start with these:
| Rule | Threshold | Page? |
|---|---|---|
| Task success rate drop | > 3σ from 7-day mean | yes |
| p99 end-to-end latency | > 2× baseline for 10 min | yes |
| Token usage p99 | > 3× baseline, any 5-min window | yes |
| Tool error rate per tool | > 10% for 15 min | yes |
| Denied tool calls | any burst > 5/min | yes |
| PII in outputs | any instance | yes |
| Retrieval zero-result rate | > 20% for 30 min | no (ticket) |
| Refusal rate | > 3σ | no (ticket) |
"Yes" means the on-call gets paged. Tune thresholds from baselines, not round numbers.
Stack choices as of April 2026
The three-tier stack most teams end up with:
- Trace layer. Langfuse, Arize Phoenix, Braintrust, Helicone. Captures raw calls.
- Metrics + dashboards. Grafana fed by a custom exporter from traces. Some teams use the trace tool's dashboards directly.
- Alerts. PagerDuty / Opsgenie, wired to the metrics layer. Keep alerts out of the trace tool — you want them in the pager stack SREs already know.
For deeper comparison of trace tools see agent trace visualization tools.
Runbook template
Every alert should have a runbook. For agents, the runbook shape is:
- What broke — plain English.
- How to confirm — specific dashboard link.
- What to rule out first — model vendor incident, regional outage, deploy rollout.
- Mitigations in order of cost — kill switch, rollback, rate-limit, escalate.
- Escalation paths — who owns the agent.
Most first-incident fumbling is in step 3. LLM vendors have status pages; bookmark them on the dashboard.
Common failure patterns
- "Green everywhere, broken product." Your top-level task-success metric is wrong or missing. Fix that before adding anything else.
- Alert fatigue. More than one page per engineer per week is too many. Tighten thresholds or delete rules.
- No per-tenant view. Global metrics hide per-customer outages. Partition by tenant or user segment.
- Trace storage cost. Full-fidelity traces balloon. Sample at the user-session level, not the call level.
Where this is heading
By 2027 expect Datadog, New Relic, and the cloud vendors to ship native "AI agent" monitoring tiers — most will wrap the metrics above with their existing UIs. Open-source alternatives will stay relevant because self-hosting traces with PII is often a hard requirement.