A code change that breaks your agent does not always break the compiler. The prompt tweak that fixed one bug silently regressed three others, and the CI was green the whole time. Agent regression testing is the discipline that finds this before users do.
Why CI alone does not catch agent regressions
Traditional CI runs unit tests on deterministic code. When the LLM is the code, two things change:
- Outputs are non-deterministic. Running the same prompt twice can yield different tokens. A naive
assertEqual(output, expected)fails on the first run and passes on the second. - Regressions are statistical. "Accuracy dropped from 92% to 88%" is a real regression, but no single test case caught it.
Agent regression tests are closer to A/B experiments than to unit tests: you run a batch, aggregate, and compare.
The three-tier test pyramid
Borrowing the shape from classical testing, agents need three tiers:
| Tier | Count | Runs on | Purpose |
|---|---|---|---|
| Smoke | 10–30 | every PR | catch structural breaks |
| Regression | 100–500 | nightly / pre-merge | catch quality drift |
| Broad eval | 1000+ | weekly / release | catch slow regressions, new capabilities |
Each tier has a budget (time + tokens). Smoke is deterministic-enough and cheap. Regression is the workhorse. Broad eval is expensive and async.
Tier 1: smoke tests
These catch the obvious. Format compliance, no-op failures, "can the agent even start."
def test_agent_returns_valid_json():
result = agent.run("What is the capital of France?")
assert result.output.strip().startswith("{")
json.loads(result.output) # must parse
def test_agent_calls_search_tool():
result = agent.run("Who won the 2024 election?")
assert any(call.tool == "web_search" for call in result.tool_calls)
Temperature=0, model ID pinned, < 10 seconds. If these fail, your PR is broken.
Tier 2: regression eval
This is where drift shows up.
Components:
- A golden dataset. 200–500 input/expected-behaviour pairs, curated from real production traces. Not synthetic.
- A scoring function. For each case, score the output 0–1. Can be exact-match, LLM-as-judge, or task-specific (e.g., "the correct SQL runs").
- A baseline. Last week's score per case.
- A threshold. Fail PR if aggregate score drops more than X%, or any case flips from pass to fail without explanation.
A minimal pipeline with LLM-as-judge:
def run_regression(agent, dataset, judge_model):
results = []
for case in dataset:
output = agent.run(case.input)
score = judge_model.score(
input=case.input,
expected=case.expected_behaviour,
actual=output,
)
results.append((case.id, score))
return results
def compare_to_baseline(current, baseline, threshold=0.03):
delta = mean(current) - mean(baseline)
flipped = [c for c, _ in current if was_pass(c, baseline) and not was_pass(c, current)]
if delta < -threshold or flipped:
return "FAIL"
return "PASS"
LLM-as-judge quality is the whole game here. Invest in the judge prompt as much as in the agent prompt.
Tier 3: broad eval
Weekly or per-release. Runs against a much larger dataset, maybe the full production log for a time window. Uses the strongest available model as judge. Produces a quality scorecard.
Broad eval is where you see slow regressions — the ones that smoke and regression missed because they are < 1% per run. Over a month, they add up to a 5% product quality drop.
Building the golden dataset
The dataset is the asset. Everything else is mechanics.
Sources, in order of value:
- Production failures. Real traces where users retried or escalated. Manually labelled.
- Historical shadow-mode divergences. Output from the current agent disagrees with output from a stronger agent — a curated seed of hard cases. See multi-agent debugging techniques for how to generate these.
- Adversarial prompts. Known-bad patterns (prompt injection, jailbreak attempts). Keep these pinned even after you fix them.
- Synthetic cases. Last resort, because they bias toward the synth model.
Rotate 10% of cases every quarter. Stale datasets mask real drift.
What to fail on
Three fail conditions that correlate with real regressions:
- Aggregate score drop. The summary metric dropped by > 3%.
- Pass-to-fail flips. Any case that passed last run fails now, and the delta is not explained by the PR.
- Confidence drops. Even if the answer is right, the model is less sure. Often precedes aggregate drops.
And three fail conditions that waste your time:
- ❌ "Output is different from last run." Different ≠ worse.
- ❌ "Judge score went from 0.92 to 0.91." Noise.
- ❌ "New capability appeared." New is not a regression.
Running in CI
Practical setup:
- GitHub Actions matrix across model versions. Regression must pass on the pinned model and the latest vendor release.
- Nightly full regression with notification on failure. PR-level runs a 50-case fast subset.
- Artefacts. Store every failed case's full trace for 30 days. When someone asks "why did this fail," you have the exact input, output, and judge reasoning.
- Cost budget. A 200-case regression at $0.02/case is $4/run. A 1000-case broad eval is $20. Budget monthly.
Anti-patterns
- Writing tests after the fact. If you did not have the regression test when the bug shipped, you will not have it when the bug re-ships.
- Averaging over heterogeneous tasks. One dataset per agent capability. Averaging coder + summariser means you miss regressions in both.
- No judge audits. The judge drifts too. Sample 5% of judge decisions for human review.
- Treating it as done. A regression suite that does not grow is dying.
Where this fits
Continuous regression testing is a peer of real-time agent monitoring: one catches regressions before prod, the other after. Run both.