Traces alone are not observability. For agents in production you need traces, metrics, evals, cost attribution, and drift detection — stitched into alerts that wake up the right person. Six platforms now bundle all of it. Here is how they stack up.
Why generic APM is not enough
Datadog and New Relic catch latency and errors. They miss the three things that matter for an agent:
- Semantic drift — the prompt changes and accuracy quietly drops 5%.
- Cost attribution — which feature burns tokens, not which endpoint.
- Eval regression — a model update breaks a specific task class.
The six platforms below were built with these in mind.
The 6 contenders
1. Arize Phoenix
Open-source core, enterprise cloud. Strong on drift detection (their ML roots show) and trace clustering. Best for teams already running ML observability.
- Strengths: drift, clustering, eval UI.
- Weaknesses: UX assumes ML-ops vocabulary.
2. Datadog LLM Observability
Datadog bolted an LLM-specific view on top of its existing APM. If Datadog is already your infra, this is the lowest-friction option.
- Strengths: integrates with existing alerts, SLOs, dashboards.
- Weaknesses: evals and drift are thinner than purpose-built tools.
3. Langfuse
Open-source self-hostable, strong trace tree, lightweight eval and metrics. Best when data residency is non-negotiable.
- Strengths: on-prem, MIT licence, fast.
- Weaknesses: heavier to run than SaaS; alerting needs external tooling.
4. LangSmith
Strong trace UI and eval pipeline. Tight with LangChain but also raw Anthropic SDK.
- Strengths: best-in-class prompt versioning and evals.
- Weaknesses: metrics and alerting are thinner than APM-first tools.
5. Honeycomb (with otel-instrumentation)
Not purpose-built for agents, but Honeycomb’s BubbleUp for anomaly investigation is unmatched once you emit the right spans.
- Strengths: high-cardinality analytics, powerful querying.
- Weaknesses: you instrument everything yourself.
6. New Relic AI Monitoring
Newer entrant. Agent-aware trace view plus cost tracking, natural fit if New Relic is already your APM.
- Strengths: tight integration with existing New Relic fleet.
- Weaknesses: fewer agent-specific primitives than Arize or LangSmith.
Feature comparison
| Platform | Traces | Metrics | Evals | Drift | On-prem |
|---|---|---|---|---|---|
| Arize Phoenix | Yes | Yes | Yes | Best | Yes |
| Datadog LLM Obs | Yes | Best | Limited | Limited | No |
| Langfuse | Best | Yes | Yes | Limited | Yes |
| LangSmith | Best | Yes | Best | Limited | No |
| Honeycomb | Best | Best | DIY | Yes | No |
| New Relic AI | Yes | Best | Limited | Limited | No |
Decision matrix
- Already run Datadog? Start with Datadog LLM Obs.
- Already run New Relic? Start with New Relic AI.
- Need on-prem? Langfuse or self-hosted Phoenix.
- Eval-first team? LangSmith or Phoenix.
- Strong ML-ops culture? Phoenix.
- High-cardinality investigator? Honeycomb.
What to instrument first
If you do not know where to start, emit these five spans per agent turn:
- The root span for the user request (with feature tag).
- The model call (with model, input_tokens, output_tokens, cache stats).
- Each tool call (with name, args, latency).
- The final response (with token count).
- Any errors (with type and retryable flag).
Every platform above reads those five well. Once in, you can swap platforms relatively cheaply.