Distributed agent failure recovery: saga patterns for non-deterministic systems

Microservices learned that "just retry" does not survive contact with production. Multi-agent systems are about to learn the same lesson — except their failures are non-deterministic and their compensating actions are written by the very models that just failed.

Why agents need their own failure model

Web service failures are loud: HTTP 500, timeout, connection refused. Agent failures are quieter and weirder:

Half-success. A code-writing agent committed three of four files; the fourth failed validation.
Confident wrongness. Output looks fine, is wrong, no error raised.
Silent stop. The model said "done" five turns early.
Loop instead of fail. The agent retries forever instead of erroring.

A retry-and-pray strategy compounds these. You need explicit failure semantics, mid-pipeline checkpoints, and compensating actions.

The three failure classes

Categorise every agent failure into one of three. The recovery differs.

Transient

Infra failure that should resolve. Vendor 5xx, rate limit, network blip.

Recovery: Exponential backoff with jitter. Cap retries (3–5). After cap, escalate to one of the other classes.

Logical

The agent did something wrong but the system is healthy. Bad SQL, wrong tool call, hallucinated output.

Recovery: Do not retry the same prompt. Retries change nothing on a logical failure. Either:

Re-prompt with the error included so the model can self-correct (1–2 attempts max).
Hand off to a different model tier.
Abort and escalate to a human.

Environmental

Something downstream changed: schema migrated, API removed, file deleted. The agent's plan is stale.

Recovery: Invalidate the agent's working assumptions and restart from a checkpoint with the new state surfaced.

Saga patterns for agents

A saga is a sequence of steps where each has a compensating action that undoes its effect. If step N fails, you compensate steps N-1, N-2, ... back to a clean state.

For agents:

Step 1: Create draft PR        → Compensation: close PR
Step 2: Run tests              → (idempotent, no compensation)
Step 3: Comment on linked issue → Compensation: delete comment
Step 4: Merge                   → (commit point)

Two practical rules:

Define compensations at design time, not at failure time. An agent asked to "undo what you just did" cannot reliably do it. The compensation must be a tool the agent can call.
Make compensation idempotent. Run twice = run once. Failures during compensation are common.

Checkpointing

Without checkpoints, recovery means restarting from scratch. With them, recovery means resuming from the last good state.

A checkpoint captures:

Working state (the structured-state object — see cross-agent context handoff).
Tool-call log with results.
Current step in the plan.
External side effects that have already landed.

Checkpoint after every external side effect, not on a timer. You want to never re-execute a side effect on resume.

def with_checkpoint(state, action):
    state = save_checkpoint(state)
    try:
        result = action(state)
        state.last_completed = action.id
        return state, result
    except Exception as e:
        state.error = serialize(e)
        save_checkpoint(state)
        raise

The escalation ladder

Most failures should never reach a human. But some must. Define the ladder and follow it.

Level	Trigger	Action
0	Transient	Backoff + retry (≤ 3)
1	Logical	Re-prompt with error context (≤ 2)
2	Logical, repeated	Switch to stronger model / different agent
3	Repeated step failure	Pause, run compensations, notify
4	Compensation failure	Page human; manual intervention

The thing that breaks teams: skipping levels. Going straight from level 1 to "page human" trains the on-call to ignore pages. Going from level 2 to "log warning and continue" produces silent corruption.

Idempotency tokens

Every external action gets a deterministic token derived from the saga ID + step ID. The downstream service treats the token as a dedupe key.

token = sha256(f"{saga_id}:{step_id}:{step_input_hash}").hexdigest()
github.create_pr(repo, branch, idempotency_key=token)

Re-execution of the same step produces the same token. The downstream service returns the prior result instead of creating a duplicate. Critical for: GitHub PRs, Stripe charges, email sends, file writes.

Observability

Every failure is a teaching moment for next time. Log:

The exact prompt and response that produced the failure.
The state snapshot at failure.
The compensation actions taken.
The recovery path (which level of escalation).

Pair with real-time agent monitoring for live alerting and multi-agent debugging techniques for post-incident analysis.

Where this goes wrong

Compensations rot. The "delete PR" tool changes semantics; nobody updates the saga. Test compensations in CI like you test forward actions.
Checkpoints leak secrets. Working state often includes credentials. Encrypt at rest, redact at log time.
Retries hammer downstream. Five agents each retrying three times against the same overloaded API = 15× load. Centralise rate limits.
Partial commits accumulate. A saga that fails at step 4 of 6 leaves debris if compensations are wrong. Schedule cleanup jobs that find and remove orphans.

Where this is heading

Expect saga and checkpoint primitives to ship in agent frameworks through 2026–27 — Temporal already has agent-flavoured docs, LangGraph adds checkpoint stores, Inngest is iterating on agent retries. Until those mature, treat your failure-recovery code as first-class production infrastructure, not as scaffolding.