"It worked on my machine" is amplified for agents. Replay tools — capture a trace, rerun it deterministically, change one variable, see what shifts — are the closest thing the field has to a debugger. Six options, head-to-head.
What replay actually means for agents
True determinism with LLMs is hard. Replay tools approximate it three ways:
- Stub mode — every model call returns its captured response; no model is actually called.
- Pin mode — same model + version + temperature 0 + seed; mostly deterministic.
- Live mode — call the model fresh, see how it diverges from the captured trace.
A good replay tool supports all three.
The 6 tools
1. LangSmith Replay
Replay any captured run with stub or live mode. Tight UI for diffing the new run against the original. Best for LangChain users.
2. Helicone Sessions
Less polished but works for any Anthropic SDK code. Replay via re-issuing captured prompts; the proxy captures both runs and diffs.
3. Langfuse Trace Replay
Open-source. Replay against the same model or a different one. Strong on cross-model comparison.
4. Braintrust Trial Reruns
Replay turned into "trials": rerun N times with temperature, capture variance. Best for non-deterministic eval.
5. mlflow with the langchain integration
Reuse mlflow's experiment-tracking primitives for LLM traces. Works if you already live in mlflow.
6. DIY with OpenTelemetry capture
Capture all OTel spans for an agent run; a small script replays them by replaying the prompts. Most flexible, most work.
Comparison
| Tool | Stub | Pin | Live | Diff UI |
|---|---|---|---|---|
| LangSmith Replay | Yes | Yes | Yes | Best |
| Helicone Sessions | Yes | Limited | Yes | Yes |
| Langfuse Replay | Yes | Yes | Yes | Yes |
| Braintrust Trials | No | Yes | Yes | Best for variance |
| mlflow + LC | Yes | Limited | Yes | Limited |
| DIY OTel | Yes | Yes | Yes | DIY |
When to use each mode
- Stub — debugging the orchestration layer, not the model. Fast, free, deterministic.
- Pin — debugging the model's behaviour with a controlled change. Slower, costs money, mostly reproducible.
- Live — measuring drift after a vendor update. The original trace is the baseline; the new run shows the delta.
Capturing for replay
Whatever replay tool you pick, capture must include:
- Every model call: prompt, model, parameters (temperature, top_p, seed, max_tokens).
- Every tool call: name, arguments, result, latency.
- Every memory read/write.
- Timestamps and span hierarchy.
Skip any of those and replay accuracy drops. See trace visualization tools for the capture stack.
Bisecting a regression
Classic workflow:
- Find a failing trace from the post-regression period.
- Replay it with the pre-regression prompt → does it pass?
- Replay with the pre-regression model version → does it pass?
- Replay with the pre-regression tool definitions → does it pass?
- Whichever step flips the result is the cause.
Same shape as git bisect, applied to agent components.
Common pitfalls
- Forgetting non-determinism — even temperature=0 is not perfectly deterministic across versions.
- Stale tool stubs — stubbed tool responses get out of date as the underlying tool evolves.
- Replaying production traces with real side-effects — strip
email_send,paymentetc. before replay. - Treating replay as test — replay reproduces, eval validates. Both, not either.
Where this is heading
Two shifts to expect: native replay primitives in the Claude Agent SDK (agent.replay(traceId)), and standardised trace-capture formats so replay tools become interchangeable. Until then, capture rigorously and pick the tool that fits your stack.