An agent that "mostly works" hides a lot of failure modes. Eight specific error rates surface the problems before users do. Here is the metric set, the SLO targets, and the dashboard layout we land on after running production agents for two years.
Why generic dashboards miss
A 5xx rate of 0% does not mean the agent is healthy. It means it returned something. The eight metrics below catch failure modes a classical APM cannot see.
The 8 metrics
1. Tool-call failure rate
Percentage of tool invocations that errored. Catches: broken integrations, expired credentials, schema drift on the upstream API. Target: < 1% for steady-state.
2. Tool-call retry rate
Percentage of tool invocations the agent retried after its own failure. Catches: rate-limited APIs, flaky network, transient schema mismatches. Target: < 3%.
3. Empty-tool-result rate
Percentage of tool calls that returned an empty result when the agent expected content. Catches: silent data loss, search-index misses, broken pagination. Target: < 5% for searches; < 0.1% for direct lookups.
4. Step-budget exhaustion rate
Percentage of agent runs that hit the maximum step count without finishing. Catches: planner regressions, tool loops, prompt drift. Target: < 0.5%.
5. Hallucination-flag rate
Percentage of responses flagged by the hallucination detector. Catches: model regression after vendor update, prompt drift. Target: < 2%.
6. Citation-mismatch rate
For citing agents, percentage of responses where at least one citation does not support the claim. Catches: extractor failures, source pinning bugs. Target: < 1%.
7. User-thumbs-down rate
Self-reported user dissatisfaction. Lagging but ground truth. Target: depends on baseline; alert on +50% of trailing 7 days.
8. Cost-per-task anomaly rate
Percentage of runs whose cost exceeded the rolling p95 by 2x. Catches: runaway loops, prompt regression, expensive routing decisions. Target: < 1%.
Dashboard layout
Place metrics by where they are most actionable:
┌─────────────────────────────────────────────────────────────┐
│ Top row: 8 single-number tiles (current + 7d delta) │
├─────────────────────────────────────────────────────────────┤
│ Mid row: time-series for the 4 most volatile metrics │
├──────────────────┬──────────────────────────────────────────┤
│ Top errors │ Top users by failure count │
│ (grouped) │ │
├──────────────────┼──────────────────────────────────────────┤
│ Top expensive │ Recent flagged responses (sample) │
│ sessions │ │
└──────────────────┴──────────────────────────────────────────┘
The single-number tiles let on-call see status in 2 seconds. The bottom panels are for the investigation that follows.
Alerting rules
Three rules pay for themselves in the first month:
- Burn rate alert on each metric — page when 4-hour burn rate exceeds 10x the SLO budget.
- Sudden change — alert when any metric changes by more than 50% in an hour.
- Per-user blow-up — page when a single user accounts for more than 20% of failures.
Avoid threshold alerts on raw counts; alert on rates and rates-of-change.
SLO design
Pick a window (28 days is standard) and an error budget per metric. Spend the budget intentionally — a planned migration that consumes 30% of monthly budget is fine if booked.
Tool-call failure SLO: 99% success over 28 days
Step budget SLO: 99.5% completion over 28 days
Citation-match SLO: 99% match over 28 days
Any month you blow the budget on a metric, the next month freezes feature work on the agent until it stabilises. Cultural lever, not just a number.
Sources for the metrics
Most of these come from your trace store. The observability platforms cover ingestion. The audit log feeds the cost-per-task and hallucination metrics — see audit trails.
Common mistakes
- One mega-dashboard — split per agent, per environment, per tier.
- No baselines — every metric needs a "normal" for the alert to mean anything.
- No drill-down — every tile must click into a query that explains it.
- No on-call rotation — dashboards without owners drift instantly.
Where this is heading
Templated agent dashboards that ship from Anthropic and from observability vendors out of the box, with the eight metrics above as the default. Build now, swap to managed later.