Multi-agent debugging techniques: finding the one prompt that broke your pipeline

The build works. The tests pass. In production, one in twenty runs returns nonsense and you cannot reproduce it. Welcome to multi-agent debugging — where stack traces are fiction and the bug is usually four prompts upstream.

Why traditional debugging fails

A deterministic program has one execution path for a given input. An agent has many, weighted by sampling temperature, model version, tool-call ordering, and the exact tokens another agent happened to return. The debugger you know — step through, set a breakpoint, inspect variables — assumes determinism. Multi-agent systems have none of it.

The techniques below assume you already have trace-level observability. If not, fix that first: ai agent observability platform covers the stack.

Technique 1: trace replay with pinned randomness

Capture every LLM call's input, output, and sampling parameters. To reproduce a bug, replay the trace against the same model version with temperature=0 and seed=<captured>. Deterministic replay is not perfect (OpenAI's seed is best-effort, Anthropic's is stricter), but it fails noisily when it fails — you see the divergence point immediately.

# Pseudo-code for trace replay
for step in trace:
    result = llm.invoke(
        messages=step.input,
        model=step.model_id,
        temperature=0,
        seed=step.seed,
    )
    if result != step.output:
        print(f"Divergence at step {step.id}")
        break

Langfuse, Langsmith, and Arize Phoenix all ship replay out of the box.

Technique 2: binary search on the trace

When a 20-step run produces a bad output, the cause is usually one step. Delete (mock out) half the steps, replay, check the output. Repeat. In five replays you isolate the culprit to a single LLM call.

This works because most agent failures propagate linearly: one bad tool call poisons every subsequent step. Binary search cuts log(n) replays.

Technique 3: shadow mode with an oracle

Run the production pipeline and a "better" pipeline (bigger model, stricter prompt) side by side on the same inputs. Log every disagreement. The disagreements are a curated dataset of your worst cases — the place to focus eval effort.

Shadow mode pairs naturally with continuous agent regression testing: today's shadow divergences become tomorrow's regression tests.

Technique 4: tool-call mutation testing

Introduce controlled faults at the tool-call boundary:

Return empty results.
Return the right schema with wrong content.
Return a timeout error.
Return 10× more results than normal.

For each mutation, check that the supervisor/agent handles it gracefully. Any mutation that produces a confident wrong answer is a latent prod bug. See hierarchical agent supervisor pattern for where recovery logic belongs.

Technique 5: log-based span diffing

Use a trace visualiser that lets you diff two runs span-by-span. Open the failing run next to a passing run for the same task. The first diverging span is where the bug lives.

Three visualisers that support diffing (as of April 2026):

Langfuse — open-source, self-hostable.
Arize Phoenix — OSS with hosted option.
Braintrust — commercial, strongest diff UX.

More on the space in agent trace visualization tools.

Technique 6: the prompt-archaeology trick

Sometimes the bug is not in the current prompt — it is in the summarisation that fed into it three turns ago. Inspect the actual tokens the failing model saw. Copy them verbatim into a fresh conversation with the same model. If the bug reproduces, you have isolated it to prompt content; if not, the bug is in the runtime (tool-call handling, state persistence).

This sounds obvious until you realise most agent frameworks hide the final rendered prompt behind abstractions. Dump the raw payload.

A debugging checklist for production incidents

When a multi-agent run fails in production, work through this in order:

Does it reproduce? If no, capture traces and wait. Non-reproducing incidents are data-collection problems first.
Is the model version stable? Vendor updates silently break things. Pin model IDs in all calls.
Is it the supervisor or a worker? Inspect the span tree; the first span with bad output is the origin.
Is the bad output caused by a bad input or a bad prompt? Run the worker alone with the captured input. If it fails in isolation, fix the prompt. If not, fix the caller.
Is retry making it worse? Some retries inherit corrupted state. Make sure the retry starts from the pre-failure checkpoint.
Is token usage abnormal? Unexpected cost spikes correlate with runaway loops — see reducing agent api costs.

Tools worth installing today

Langfuse — traces, replay, eval. Open source.
Pytest-recording / cassette libraries — check trace snapshots into git.
LiteLLM proxy — one place to intercept, log, and mutate every LLM call.
Custom middleware for your agent framework — because production debugging always needs a hook your framework does not expose.

What changes as systems grow

Single-agent debugging is about the prompt. Multi-agent debugging is about the edges between agents — the handoff format, the error semantics, the retry policy. Budget accordingly: in a mature multi-agent system, debugging tools and replay infrastructure end up costing as much as the agents themselves. It is worth it.