Reading a tool-call trace as raw JSON is a special form of pain. The visualizer market exploded in 2026 — here are seven tools we have actually shipped with, what they do well, and where they break down.
Why trace visualization matters
An agent that takes 14 steps to answer a question generates 14 prompts, 14 tool calls, 14 results, and a final response. When something goes wrong, "ctrl-F" through 100KB of JSON is not a debugging strategy. A good trace UI shows the call graph, latency on each edge, token cost per step, and lets you click into prompts and outputs.
The 7 tools, ranked
1. LangSmith
The market leader. Clean tree view, generous free tier, plugs into LangChain/LangGraph in one line and into raw Anthropic SDK with a wrapper. Best-in-class for cost-per-step breakdowns.
- Strengths: mature UI, eval pipelines, prompt versioning.
- Weaknesses: opinionated about LangChain ergonomics.
2. Helicone
Proxy-based — points your Anthropic baseURL at Helicone, no SDK changes. Catches everything by definition. Strong on cost dashboards.
- Strengths: zero-friction onboarding, generous self-host.
- Weaknesses: tree view less polished than LangSmith.
3. Langfuse
Open-source, self-hostable. Strong UI, can run on your own Postgres + ClickHouse. Best choice when data residency matters.
- Strengths: on-prem, MIT licence, fast tree rendering.
- Weaknesses: setup heavier than SaaS competitors.
4. Phoenix (Arize)
Born for traditional ML, retrofitted for LLM agents. Strong eval framework, clusters traces to find common failure modes.
- Strengths: trace clustering, drift detection.
- Weaknesses: learning curve coming from the agent side.
5. Braintrust
Eval-first product that happens to ship a great trace viewer. Best UX for "this trace is wrong, let me turn it into a regression test".
- Strengths: trace -> eval pipeline is unmatched.
- Weaknesses: pricing aimed at teams, not solo devs.
6. Weights & Biases Weave
Part of the W&B stack. If you already use W&B for ML, this slots in naturally. Otherwise, it is overkill.
- Strengths: tight integration with W&B model registry.
- Weaknesses: heavy if you do not already live in W&B.
7. OpenLLMetry + your own Grafana
Pure OpenTelemetry. Emit spans from your agent, ingest into any OTel backend (Tempo, Jaeger, Datadog APM). Maximum flexibility, you build the UI.
- Strengths: stack-agnostic, no vendor lock-in.
- Weaknesses: you build everything past raw spans.
Pick by your situation
| If you... | Pick |
|---|---|
| Want SaaS, fast onboarding | LangSmith or Helicone |
| Need on-prem / data residency | Langfuse |
| Care most about evals | Braintrust |
| Already have an APM | OpenLLMetry + your APM |
| Already use W&B | Weave |
The features that matter
When evaluating, check these five:
- Tree vs flat view — multi-step agents need the tree.
- Cost-per-span — surface input/output tokens at every node.
- Searchable prompts — find every trace where the system prompt changed.
- MCP tool-call rendering — most tools still treat MCP as opaque function calls; the good ones show schema and argument diffs.
- Replay — re-run a captured trace against a new prompt or model and diff the output.
What is coming next
Three trends to watch:
- Tighter MCP-aware visualization (resource subscriptions, tool versions).
- Cross-agent traces for multi-agent workflows.
- Native trace viewers inside agent IDEs (Cursor and VS Code Copilot are both working on it).