The build works. The tests pass. In production, one in twenty runs returns nonsense and you cannot reproduce it. Welcome to multi-agent debugging — where stack traces are fiction and the bug is usually four prompts upstream.
Why traditional debugging fails
A deterministic program has one execution path for a given input. An agent has many, weighted by sampling temperature, model version, tool-call ordering, and the exact tokens another agent happened to return. The debugger you know — step through, set a breakpoint, inspect variables — assumes determinism. Multi-agent systems have none of it.
The techniques below assume you already have trace-level observability. If not, fix that first: ai agent observability platform covers the stack.
Technique 1: trace replay with pinned randomness
Capture every LLM call's input, output, and sampling parameters. To reproduce a bug, replay the trace against the same model version with temperature=0 and seed=<captured>. Deterministic replay is not perfect (OpenAI's seed is best-effort, Anthropic's is stricter), but it fails noisily when it fails — you see the divergence point immediately.
# Pseudo-code for trace replay
for step in trace:
result = llm.invoke(
messages=step.input,
model=step.model_id,
temperature=0,
seed=step.seed,
)
if result != step.output:
print(f"Divergence at step {step.id}")
break
Langfuse, Langsmith, and Arize Phoenix all ship replay out of the box.
Technique 2: binary search on the trace
When a 20-step run produces a bad output, the cause is usually one step. Delete (mock out) half the steps, replay, check the output. Repeat. In five replays you isolate the culprit to a single LLM call.
This works because most agent failures propagate linearly: one bad tool call poisons every subsequent step. Binary search cuts log(n) replays.
Technique 3: shadow mode with an oracle
Run the production pipeline and a "better" pipeline (bigger model, stricter prompt) side by side on the same inputs. Log every disagreement. The disagreements are a curated dataset of your worst cases — the place to focus eval effort.
Shadow mode pairs naturally with continuous agent regression testing: today's shadow divergences become tomorrow's regression tests.
Technique 4: tool-call mutation testing
Introduce controlled faults at the tool-call boundary:
- Return empty results.
- Return the right schema with wrong content.
- Return a timeout error.
- Return 10× more results than normal.
For each mutation, check that the supervisor/agent handles it gracefully. Any mutation that produces a confident wrong answer is a latent prod bug. See hierarchical agent supervisor pattern for where recovery logic belongs.
Technique 5: log-based span diffing
Use a trace visualiser that lets you diff two runs span-by-span. Open the failing run next to a passing run for the same task. The first diverging span is where the bug lives.
Three visualisers that support diffing (as of April 2026):
- Langfuse — open-source, self-hostable.
- Arize Phoenix — OSS with hosted option.
- Braintrust — commercial, strongest diff UX.
More on the space in agent trace visualization tools.
Technique 6: the prompt-archaeology trick
Sometimes the bug is not in the current prompt — it is in the summarisation that fed into it three turns ago. Inspect the actual tokens the failing model saw. Copy them verbatim into a fresh conversation with the same model. If the bug reproduces, you have isolated it to prompt content; if not, the bug is in the runtime (tool-call handling, state persistence).
This sounds obvious until you realise most agent frameworks hide the final rendered prompt behind abstractions. Dump the raw payload.
A debugging checklist for production incidents
When a multi-agent run fails in production, work through this in order:
- Does it reproduce? If no, capture traces and wait. Non-reproducing incidents are data-collection problems first.
- Is the model version stable? Vendor updates silently break things. Pin model IDs in all calls.
- Is it the supervisor or a worker? Inspect the span tree; the first span with bad output is the origin.
- Is the bad output caused by a bad input or a bad prompt? Run the worker alone with the captured input. If it fails in isolation, fix the prompt. If not, fix the caller.
- Is retry making it worse? Some retries inherit corrupted state. Make sure the retry starts from the pre-failure checkpoint.
- Is token usage abnormal? Unexpected cost spikes correlate with runaway loops — see reducing agent api costs.
Tools worth installing today
- Langfuse — traces, replay, eval. Open source.
- Pytest-recording / cassette libraries — check trace snapshots into git.
- LiteLLM proxy — one place to intercept, log, and mutate every LLM call.
- Custom middleware for your agent framework — because production debugging always needs a hook your framework does not expose.
What changes as systems grow
Single-agent debugging is about the prompt. Multi-agent debugging is about the edges between agents — the handoff format, the error semantics, the retry policy. Budget accordingly: in a mature multi-agent system, debugging tools and replay infrastructure end up costing as much as the agents themselves. It is worth it.