"Our agent improved 3% on MMLU after the model update" tells you nothing about whether your users will notice. Public benchmarks are the wrong tool. A useful benchmark suite is built from your tasks, your tools, and your failure modes. Here is how to design one.
What public benchmarks are good for
- Comparing model providers in broad strokes.
- Sanity-checking new model releases.
- Marketing.
What they are not good for: deciding whether your specific agent is getting better or worse.
Five components of a good in-house benchmark suite
1. Task-specific scenarios
Every scenario mirrors a real user flow. Built from production traces, anonymised. Cover the top 20 most-used flows plus 10 edge cases.
2. Tool-set realism
The agent runs against the same tool set you ship in production. Stubs are acceptable for cost-control if behaviour is frozen.
3. Diverse difficulty distribution
Easy / medium / hard split that mirrors real traffic. Easy 60%, medium 30%, hard 10%. A suite that is all-hard tells you nothing about regressions in the easy band.
4. Stable scoring
Deterministic scorers where possible. LLM judges where not. Calibrate the LLM judge against human-labelled samples quarterly.
5. Versioned and append-only
Once a scenario is in, it stays in. New scenarios add. Compare model X to model Y on the same scenarios.
The sizing math
How many scenarios to detect a 5% regression with 95% confidence?
Roughly: 200–400 scenarios per "task class" you want to monitor independently. For a five-class agent that is 1000–2000 scenarios.
Add up:
n_per_class * num_classes * num_runs_per_scenario
= 300 * 5 * 3 (for variance)
= 4500 model calls per benchmark run
At Sonnet pricing: $30–80 per benchmark run. Run on every model update, every prompt change.
Failure mode coverage
Beyond happy-path tasks, include:
- Adversarial inputs — known prompt-injection patterns.
- Tool failures — what happens when a tool returns an error.
- Token-budget edges — what happens at the limit.
- Identity edge cases — different user types, different scopes.
Without these, your benchmark catches drift but not regression in important corners.
Scoring
Three categories, used together:
Deterministic scorers
For tasks with verifiable outputs (extracted entity matches expected, tool call shape matches expected). Cheap, exact.
LLM-as-judge
For open-ended outputs (summary quality, response helpfulness). Calibrate against human labels.
Pass/fail rubrics
A list of yes/no checks per scenario. Fast to apply, easy to communicate.
Most scenarios use 2–3 of these in combination.
Versioning the suite
A working pattern:
- Scenarios are immutable once added.
- Scoring is versioned alongside the scenario; old scores comparable.
- Suite version bumps when 10%+ of scenarios change.
Every benchmark run records: suite version, agent version, model version, prompt version. Comparable across time.
Detecting drift vs regression
Two modes:
- Drift — slow change in score over many runs. Often vendor-side.
- Regression — sudden change after one of your changes.
Plot score over time per task class. Drift shows as gradual; regression shows as a step. Both warrant investigation; the source is different.
Anti-patterns
Three suite designs that produce false confidence:
- All happy-path — misses real regressions.
- No baseline — score in isolation tells you nothing.
- Frequent rebalancing — comparing across versions becomes impossible.
Comparing your suite to public benchmarks
Public benchmarks (MMLU, GSM8K, SWE-bench) are useful as cross-checks. Track them quarterly:
- Did the model regress on a public benchmark in a way your suite missed?
- Did your suite catch a regression the public ones did not?
Both directions inform suite design. The latter is the goal: your suite catches what matters to you that public ones cannot see.
CI integration
Benchmark on every:
- Prompt change.
- Tool change.
- Model upgrade (vendor or your routing).
- Agent SDK upgrade.
Block production rollouts on score regression beyond a threshold (typically 2–5%).
Common mistakes
- Trusting public benchmarks — they do not measure what you care about.
- Suite never grows — coverage stalls; new failure modes go undetected.
- Suite only grows — too expensive to run; nobody runs it.
- No human review of LLM judges — judges drift; calibrate.
Where this is heading
Three trends by 2027: shared-and-customisable benchmark frameworks (clone, modify per agent), benchmark-as-a-service products, and standardised reporting formats so agent benchmark scores become comparable across teams.