Skip to main content
Guide4 min read

Agent A/B testing methodology: why classical experiment design breaks for agents and what to do

Classical A/B testing assumes deterministic variants. Agents are non-deterministic, multi-step, and entangled with memory. Here is the methodology that survives, with the metric design, sample-size adjustments, and shadow-traffic patterns.

A/B testing for agents is not the A/B testing your growth team knows. Variance is higher, randomisation is messier, and the unit of measurement is weird. The methodology below is what survives once you have shipped a few real agent experiments.

Why classical A/B breaks for agents

Three properties that classical experiments do not handle:

  • Non-determinism — the same prompt can produce different outputs across runs. Variance inside a variant is huge.
  • Multi-step entanglement — one prompt change cascades through 8 tool calls. Attributing impact gets fuzzy.
  • Memory carry-over — yesterday's prompt influences today's behaviour. Subjects are not independent across sessions.

Run the playbook from a typical product experiment and the result is "not significant" on metrics that are actually shifting. You need a different toolkit.

Three experiment types that work

1. Prompt A/B

Two prompt variants, randomised at the request level. Simplest case, closest to classical experimentation.

  • Randomise at: request id (not user — same user can hit either variant).
  • Watch for: carry-over via memory (rinse memory or stratify by memory state).
  • Sample size: 3–5x what classical theory says, because variance is higher.

2. Model A/B

Same prompt, different models. Higher cost contrast, narrower quality contrast.

  • Randomise at: request id.
  • Watch for: latency confound (faster model = more requests in the same time window). Compare per-request, not per-window.
  • Sample size: smaller than prompt A/B because effects are bigger.

3. Tool A/B

Same agent, different tool set or different tool implementation. Hardest because tool selection itself is part of the agent's decision.

  • Randomise at: session start (not request — would confuse the agent).
  • Watch for: path entanglement; one missing tool reroutes everything.
  • Sample size: large; effects are diffuse.

Metric design

The metric must satisfy three properties:

  1. Composable per request — task-level success, not session-level only.
  2. Robust to non-determinism — bin into discrete outcomes (success/partial/fail) when continuous metrics are too noisy.
  3. Lagged metrics secondary — CSAT and retention move slowly; use them as guardrails, not primaries.

A working hierarchy:

Metric Use as
Task success rate Primary
Tool calls per task Secondary (efficiency)
Tokens per task Secondary (cost)
Time to resolution Secondary (latency)
7-day retention Guardrail

Shadow traffic before live A/B

Before ramping a variant to real users, run shadow traffic: variant runs alongside the control on the same input, both responses captured, only the control's response shown to the user. Gives:

  • Variance estimates without user impact.
  • Quality comparison against ground truth (your eval set).
  • Cost projection at full ramp.

Two weeks of shadow before live A/B catches most regressions.

Sample-size rules of thumb

For prompt and tool experiments, multiply classical sample-size estimates by 3x. The variance inside an LLM variant is roughly that much higher than a deterministic UI variant.

For model A/B, multiply by 1.5x — model differences tend to be larger and easier to detect.

Run the experiment for at least one full weekly cycle. Agent usage patterns are weekly-cyclic.

Avoiding common biases

  • Sequence effect — the same user behaving differently after one bad agent turn. Mitigate with request-level randomisation, not session.
  • Memory confound — variant A trained the memory the agent uses for variant B. Mitigate by clearing memory between variant assignments or by stratifying.
  • Latency-mediated effects — slower variant produces fewer total interactions. Compare per-attempt rates, not per-day totals.
  • Non-stationary base rate — model vendor updates mid-experiment. Re-baseline; rerun if a vendor pushes a model update.

Reading the results

Significant on a primary metric is the start, not the end. Confirm with:

  • Eval-set delta — variant beats control on your offline eval too.
  • Failure mode analysis — sample 30 disagreements, classify them, look for patterns.
  • Cost-quality trade-off — quality up but cost up means quality-per-dollar might be down.

A win that does not show up offline and online is not a win.

Tools

Where this is heading

Two shifts: native experiment primitives in the Claude Agent SDK (assign variants, capture results), and statistical libraries that handle LLM-specific variance out of the box. Until then, the methodology above is the working version.

Loadout

Build your AI agent loadout

The directory of MCP servers and AI agents that actually work. Pick the right loadout for Slack, Postgres, GitHub, Figma and 20+ integrations — with install commands ready to paste into Claude Desktop, Cursor or your own stack.

© 2026 Loadout. Built on Angular 21 SSR.