Agent replay debugging tools: 6 ways to re-run a failed trace and find the bug

"It worked on my machine" is amplified for agents. Replay tools — capture a trace, rerun it deterministically, change one variable, see what shifts — are the closest thing the field has to a debugger. Six options, head-to-head.

What replay actually means for agents

True determinism with LLMs is hard. Replay tools approximate it three ways:

Stub mode — every model call returns its captured response; no model is actually called.
Pin mode — same model + version + temperature 0 + seed; mostly deterministic.
Live mode — call the model fresh, see how it diverges from the captured trace.

A good replay tool supports all three.

The 6 tools

1. LangSmith Replay

Replay any captured run with stub or live mode. Tight UI for diffing the new run against the original. Best for LangChain users.

2. Helicone Sessions

Less polished but works for any Anthropic SDK code. Replay via re-issuing captured prompts; the proxy captures both runs and diffs.

3. Langfuse Trace Replay

Open-source. Replay against the same model or a different one. Strong on cross-model comparison.

4. Braintrust Trial Reruns

Replay turned into "trials": rerun N times with temperature, capture variance. Best for non-deterministic eval.

5. mlflow with the langchain integration

Reuse mlflow's experiment-tracking primitives for LLM traces. Works if you already live in mlflow.

6. DIY with OpenTelemetry capture

Capture all OTel spans for an agent run; a small script replays them by replaying the prompts. Most flexible, most work.

Comparison

Tool	Stub	Pin	Live	Diff UI
LangSmith Replay	Yes	Yes	Yes	Best
Helicone Sessions	Yes	Limited	Yes	Yes
Langfuse Replay	Yes	Yes	Yes	Yes
Braintrust Trials	No	Yes	Yes	Best for variance
mlflow + LC	Yes	Limited	Yes	Limited
DIY OTel	Yes	Yes	Yes	DIY

When to use each mode

Stub — debugging the orchestration layer, not the model. Fast, free, deterministic.
Pin — debugging the model's behaviour with a controlled change. Slower, costs money, mostly reproducible.
Live — measuring drift after a vendor update. The original trace is the baseline; the new run shows the delta.

Capturing for replay

Whatever replay tool you pick, capture must include:

Every model call: prompt, model, parameters (temperature, top_p, seed, max_tokens).
Every tool call: name, arguments, result, latency.
Every memory read/write.
Timestamps and span hierarchy.

Skip any of those and replay accuracy drops. See trace visualization tools for the capture stack.

Bisecting a regression

Classic workflow:

Find a failing trace from the post-regression period.
Replay it with the pre-regression prompt → does it pass?
Replay with the pre-regression model version → does it pass?
Replay with the pre-regression tool definitions → does it pass?
Whichever step flips the result is the cause.

Same shape as git bisect, applied to agent components.

Common pitfalls

Forgetting non-determinism — even temperature=0 is not perfectly deterministic across versions.
Stale tool stubs — stubbed tool responses get out of date as the underlying tool evolves.
Replaying production traces with real side-effects — strip email_send, payment etc. before replay.
Treating replay as test — replay reproduces, eval validates. Both, not either.

Where this is heading

Two shifts to expect: native replay primitives in the Claude Agent SDK (agent.replay(traceId)), and standardised trace-capture formats so replay tools become interchangeable. Until then, capture rigorously and pick the tool that fits your stack.