A/B testing for agents is not the A/B testing your growth team knows. Variance is higher, randomisation is messier, and the unit of measurement is weird. The methodology below is what survives once you have shipped a few real agent experiments.
Why classical A/B breaks for agents
Three properties that classical experiments do not handle:
- Non-determinism — the same prompt can produce different outputs across runs. Variance inside a variant is huge.
- Multi-step entanglement — one prompt change cascades through 8 tool calls. Attributing impact gets fuzzy.
- Memory carry-over — yesterday's prompt influences today's behaviour. Subjects are not independent across sessions.
Run the playbook from a typical product experiment and the result is "not significant" on metrics that are actually shifting. You need a different toolkit.
Three experiment types that work
1. Prompt A/B
Two prompt variants, randomised at the request level. Simplest case, closest to classical experimentation.
- Randomise at: request id (not user — same user can hit either variant).
- Watch for: carry-over via memory (rinse memory or stratify by memory state).
- Sample size: 3–5x what classical theory says, because variance is higher.
2. Model A/B
Same prompt, different models. Higher cost contrast, narrower quality contrast.
- Randomise at: request id.
- Watch for: latency confound (faster model = more requests in the same time window). Compare per-request, not per-window.
- Sample size: smaller than prompt A/B because effects are bigger.
3. Tool A/B
Same agent, different tool set or different tool implementation. Hardest because tool selection itself is part of the agent's decision.
- Randomise at: session start (not request — would confuse the agent).
- Watch for: path entanglement; one missing tool reroutes everything.
- Sample size: large; effects are diffuse.
Metric design
The metric must satisfy three properties:
- Composable per request — task-level success, not session-level only.
- Robust to non-determinism — bin into discrete outcomes (success/partial/fail) when continuous metrics are too noisy.
- Lagged metrics secondary — CSAT and retention move slowly; use them as guardrails, not primaries.
A working hierarchy:
| Metric | Use as |
|---|---|
| Task success rate | Primary |
| Tool calls per task | Secondary (efficiency) |
| Tokens per task | Secondary (cost) |
| Time to resolution | Secondary (latency) |
| 7-day retention | Guardrail |
Shadow traffic before live A/B
Before ramping a variant to real users, run shadow traffic: variant runs alongside the control on the same input, both responses captured, only the control's response shown to the user. Gives:
- Variance estimates without user impact.
- Quality comparison against ground truth (your eval set).
- Cost projection at full ramp.
Two weeks of shadow before live A/B catches most regressions.
Sample-size rules of thumb
For prompt and tool experiments, multiply classical sample-size estimates by 3x. The variance inside an LLM variant is roughly that much higher than a deterministic UI variant.
For model A/B, multiply by 1.5x — model differences tend to be larger and easier to detect.
Run the experiment for at least one full weekly cycle. Agent usage patterns are weekly-cyclic.
Avoiding common biases
- Sequence effect — the same user behaving differently after one bad agent turn. Mitigate with request-level randomisation, not session.
- Memory confound — variant A trained the memory the agent uses for variant B. Mitigate by clearing memory between variant assignments or by stratifying.
- Latency-mediated effects — slower variant produces fewer total interactions. Compare per-attempt rates, not per-day totals.
- Non-stationary base rate — model vendor updates mid-experiment. Re-baseline; rerun if a vendor pushes a model update.
Reading the results
Significant on a primary metric is the start, not the end. Confirm with:
- Eval-set delta — variant beats control on your offline eval too.
- Failure mode analysis — sample 30 disagreements, classify them, look for patterns.
- Cost-quality trade-off — quality up but cost up means quality-per-dollar might be down.
A win that does not show up offline and online is not a win.
Tools
- Online A/B: any feature-flag platform (LaunchDarkly, GrowthBook). The flag goes around the prompt or model selection.
- Offline eval: see the evaluation framework.
- Trace capture: see observability platforms.
Where this is heading
Two shifts: native experiment primitives in the Claude Agent SDK (assign variants, capture results), and statistical libraries that handle LLM-specific variance out of the box. Until then, the methodology above is the working version.