Agent benchmark suite design: how to build one that survives the next model release

"Our agent improved 3% on MMLU after the model update" tells you nothing about whether your users will notice. Public benchmarks are the wrong tool. A useful benchmark suite is built from your tasks, your tools, and your failure modes. Here is how to design one.

What public benchmarks are good for

Comparing model providers in broad strokes.
Sanity-checking new model releases.
Marketing.

What they are not good for: deciding whether your specific agent is getting better or worse.

Five components of a good in-house benchmark suite

1. Task-specific scenarios

Every scenario mirrors a real user flow. Built from production traces, anonymised. Cover the top 20 most-used flows plus 10 edge cases.

2. Tool-set realism

The agent runs against the same tool set you ship in production. Stubs are acceptable for cost-control if behaviour is frozen.

3. Diverse difficulty distribution

Easy / medium / hard split that mirrors real traffic. Easy 60%, medium 30%, hard 10%. A suite that is all-hard tells you nothing about regressions in the easy band.

4. Stable scoring

Deterministic scorers where possible. LLM judges where not. Calibrate the LLM judge against human-labelled samples quarterly.

5. Versioned and append-only

Once a scenario is in, it stays in. New scenarios add. Compare model X to model Y on the same scenarios.

The sizing math

How many scenarios to detect a 5% regression with 95% confidence?

Roughly: 200–400 scenarios per "task class" you want to monitor independently. For a five-class agent that is 1000–2000 scenarios.

Add up:

n_per_class * num_classes * num_runs_per_scenario
= 300 * 5 * 3 (for variance)
= 4500 model calls per benchmark run

At Sonnet pricing: $30–80 per benchmark run. Run on every model update, every prompt change.

Failure mode coverage

Beyond happy-path tasks, include:

Adversarial inputs — known prompt-injection patterns.
Tool failures — what happens when a tool returns an error.
Token-budget edges — what happens at the limit.
Identity edge cases — different user types, different scopes.

Without these, your benchmark catches drift but not regression in important corners.

Scoring

Three categories, used together:

Deterministic scorers

For tasks with verifiable outputs (extracted entity matches expected, tool call shape matches expected). Cheap, exact.

LLM-as-judge

For open-ended outputs (summary quality, response helpfulness). Calibrate against human labels.

Pass/fail rubrics

A list of yes/no checks per scenario. Fast to apply, easy to communicate.

Most scenarios use 2–3 of these in combination.

Versioning the suite

A working pattern:

Scenarios are immutable once added.
Scoring is versioned alongside the scenario; old scores comparable.
Suite version bumps when 10%+ of scenarios change.

Every benchmark run records: suite version, agent version, model version, prompt version. Comparable across time.

Detecting drift vs regression

Two modes:

Drift — slow change in score over many runs. Often vendor-side.
Regression — sudden change after one of your changes.

Plot score over time per task class. Drift shows as gradual; regression shows as a step. Both warrant investigation; the source is different.

Anti-patterns

Three suite designs that produce false confidence:

All happy-path — misses real regressions.
No baseline — score in isolation tells you nothing.
Frequent rebalancing — comparing across versions becomes impossible.

Comparing your suite to public benchmarks

Public benchmarks (MMLU, GSM8K, SWE-bench) are useful as cross-checks. Track them quarterly:

Did the model regress on a public benchmark in a way your suite missed?
Did your suite catch a regression the public ones did not?

Both directions inform suite design. The latter is the goal: your suite catches what matters to you that public ones cannot see.

CI integration

Benchmark on every:

Prompt change.
Tool change.
Model upgrade (vendor or your routing).
Agent SDK upgrade.

Block production rollouts on score regression beyond a threshold (typically 2–5%).

Common mistakes

Trusting public benchmarks — they do not measure what you care about.
Suite never grows — coverage stalls; new failure modes go undetected.
Suite only grows — too expensive to run; nobody runs it.
No human review of LLM judges — judges drift; calibrate.

Where this is heading

Three trends by 2027: shared-and-customisable benchmark frameworks (clone, modify per agent), benchmark-as-a-service products, and standardised reporting formats so agent benchmark scores become comparable across teams.