Agent A/B testing methodology: why classical experiment design breaks for agents and what to do

A/B testing for agents is not the A/B testing your growth team knows. Variance is higher, randomisation is messier, and the unit of measurement is weird. The methodology below is what survives once you have shipped a few real agent experiments.

Why classical A/B breaks for agents

Three properties that classical experiments do not handle:

Non-determinism — the same prompt can produce different outputs across runs. Variance inside a variant is huge.
Multi-step entanglement — one prompt change cascades through 8 tool calls. Attributing impact gets fuzzy.
Memory carry-over — yesterday's prompt influences today's behaviour. Subjects are not independent across sessions.

Run the playbook from a typical product experiment and the result is "not significant" on metrics that are actually shifting. You need a different toolkit.

Three experiment types that work

1. Prompt A/B

Two prompt variants, randomised at the request level. Simplest case, closest to classical experimentation.

Randomise at: request id (not user — same user can hit either variant).
Watch for: carry-over via memory (rinse memory or stratify by memory state).
Sample size: 3–5x what classical theory says, because variance is higher.

2. Model A/B

Same prompt, different models. Higher cost contrast, narrower quality contrast.

Randomise at: request id.
Watch for: latency confound (faster model = more requests in the same time window). Compare per-request, not per-window.
Sample size: smaller than prompt A/B because effects are bigger.

3. Tool A/B

Same agent, different tool set or different tool implementation. Hardest because tool selection itself is part of the agent's decision.

Randomise at: session start (not request — would confuse the agent).
Watch for: path entanglement; one missing tool reroutes everything.
Sample size: large; effects are diffuse.

Metric design

The metric must satisfy three properties:

Composable per request — task-level success, not session-level only.
Robust to non-determinism — bin into discrete outcomes (success/partial/fail) when continuous metrics are too noisy.
Lagged metrics secondary — CSAT and retention move slowly; use them as guardrails, not primaries.

A working hierarchy:

Metric	Use as
Task success rate	Primary
Tool calls per task	Secondary (efficiency)
Tokens per task	Secondary (cost)
Time to resolution	Secondary (latency)
7-day retention	Guardrail

Shadow traffic before live A/B

Before ramping a variant to real users, run shadow traffic: variant runs alongside the control on the same input, both responses captured, only the control's response shown to the user. Gives:

Variance estimates without user impact.
Quality comparison against ground truth (your eval set).
Cost projection at full ramp.

Two weeks of shadow before live A/B catches most regressions.

Sample-size rules of thumb

For prompt and tool experiments, multiply classical sample-size estimates by 3x. The variance inside an LLM variant is roughly that much higher than a deterministic UI variant.

For model A/B, multiply by 1.5x — model differences tend to be larger and easier to detect.

Run the experiment for at least one full weekly cycle. Agent usage patterns are weekly-cyclic.

Avoiding common biases

Sequence effect — the same user behaving differently after one bad agent turn. Mitigate with request-level randomisation, not session.
Memory confound — variant A trained the memory the agent uses for variant B. Mitigate by clearing memory between variant assignments or by stratifying.
Latency-mediated effects — slower variant produces fewer total interactions. Compare per-attempt rates, not per-day totals.
Non-stationary base rate — model vendor updates mid-experiment. Re-baseline; rerun if a vendor pushes a model update.

Reading the results

Significant on a primary metric is the start, not the end. Confirm with:

Eval-set delta — variant beats control on your offline eval too.
Failure mode analysis — sample 30 disagreements, classify them, look for patterns.
Cost-quality trade-off — quality up but cost up means quality-per-dollar might be down.

A win that does not show up offline and online is not a win.

Tools

Online A/B: any feature-flag platform (LaunchDarkly, GrowthBook). The flag goes around the prompt or model selection.
Offline eval: see the evaluation framework.
Trace capture: see observability platforms.

Where this is heading

Two shifts: native experiment primitives in the Claude Agent SDK (assign variants, capture results), and statistical libraries that handle LLM-specific variance out of the box. Until then, the methodology above is the working version.