"We added evals" usually means "we ran 50 examples once and the score looked fine." Production-quality eval is a discipline that takes months to build and pays back forever — here is what it looks like.
What "evaluation framework" means
An evaluation framework is infrastructure for measuring agent quality continuously. Concretely:
- A dataset of representative tasks.
- A scoring system that turns outputs into numbers.
- A runner that executes tasks and records results.
- A comparison layer that surfaces regressions.
- A gating mechanism wired to deployments.
Missing any one and you have a benchmark, not a framework. This article is the design-level companion to continuous agent regression testing; regression tests are the implementation, eval framework is the discipline.
The four-metric core
Every framework needs at least these four:
1. Task success rate
Did the agent achieve the user's goal? Binary or graded.
This is the metric you ship on. Everything else is diagnostic.
Measurement options:
- End-to-end check. A function that asserts success (e.g., "did the SQL run and return the expected count?").
- Reference comparison. "Did the output match the reference, by some similarity threshold?"
- LLM-as-judge. A stronger model scores the output against a rubric.
Real frameworks use all three, weighted by task type.
2. Trajectory quality
Did the agent take a sensible path? Two correct answers can come from a 3-step path or a 30-step circuitous one. The latter regresses cost and latency.
Measure: number of tool calls, presence of obviously redundant steps, ratio of useful turns to total turns.
3. Faithfulness
Did the agent hallucinate? Specifically: do its claims trace to its tool outputs and retrievals?
Measurement (expensive but necessary): for each claim in the output, can the rubric model find supporting evidence in the conversation? Score the fraction.
4. Calibration
When the agent says "I don't know" or "I'm uncertain", is it actually uncertain? When it says "definitely", is it actually right?
Measure: bin outputs by stated confidence; compute accuracy in each bin. The agent is well-calibrated if accuracy in the "high confidence" bin is much higher than in the "low confidence" bin.
Calibration matters because users learn to trust agents that admit uncertainty — and stop using ones that confidently lie.
The dataset is the asset
Frameworks come and go; the dataset compounds. Treat it as a first-class engineering artefact:
- Versioned in git. Reviewable diffs, blame, history.
- Sourced from production. Synthetic-only datasets bias eval toward what the synth model already does well.
- Stratified. Cover task types proportional to production traffic, plus oversample edge cases.
- Maintained. Stale tasks get pruned; new failure modes get added.
Aim for these proportions:
| Source | Share |
|---|---|
| Real production traces (curated) | 60% |
| Adversarial / red-team cases | 20% |
| Edge cases / regression seeds | 15% |
| Synthetic generated cases | 5% |
For red-team cases see red teaming AI agents.
Scoring rubrics that work
Free-form "is this good?" scoring produces noise. Structured rubrics produce signal.
A rubric for a coding agent might be:
- correctness:
weight: 0.5
levels:
- 0: code does not run
- 0.5: code runs, partially correct
- 1: code runs, fully correct
- safety:
weight: 0.2
levels:
- 0: introduces security regression
- 1: safe
- style:
weight: 0.1
levels:
- 0: ignores existing conventions
- 1: matches conventions
- minimality:
weight: 0.2
levels:
- 0: changed unrelated code
- 1: minimal diff
Per-criterion scoring with explicit levels reduces judge variance dramatically. A 5-point Likert without levels is roughly noise.
LLM-as-judge: when it works, when it does not
Works:
- High-volume scoring at modest cost.
- Tasks where the rubric is well-defined.
- Comparative judgement ("is A better than B?") more than absolute.
Does not work:
- Scoring outputs from the same family as the judge (Sonnet judging Sonnet).
- Tasks where the judge does not understand the domain.
- Subtle correctness (math, code that looks right but is wrong).
Mitigations: use a stronger model class as judge; sample 5–10% of judge decisions for human review; compare judges against each other.
Wiring eval into the deployment pipeline
The framework is only as useful as its gating power. Three gates worth having:
- PR gate. A 50–200 case fast subset must pass within X% of baseline.
- Pre-release gate. Full eval (1000+ cases) must pass within Y% on the candidate.
- Canary gate. Live shadow eval against production for 24h before full rollout.
Each gate has a numeric threshold and a documented procedure for waivers. No waiver-by-Slack.
For the CI mechanics see continuous agent regression testing.
Eval cost economics
A typical mid-size eval looks like:
| Stage | Cases | Avg cost / case | Frequency | Monthly cost |
|---|---|---|---|---|
| PR gate | 100 | $0.05 | 50 PRs | $250 |
| Nightly | 500 | $0.10 | 30 | $1,500 |
| Pre-release | 2000 | $0.15 | 4 | $1,200 |
| Canary | shadow | $0.02 | live | $500 |
About $3,500/month for a full framework. That is also roughly what one prevented production regression saves you in support time.
Common failures of eval programs
- Building the framework, not the dataset. A great runner with 30 stale cases is worthless.
- Optimising the metric, breaking the product. Goodhart's law: if "judge score" becomes the target, you optimise judge perception, not user value.
- One judge to rule them all. Single-judge eval inherits the judge's blind spots. Rotate.
- No baselines. Every eval needs a "what did we score yesterday" comparison. Without it, scores are uninterpretable.
What is changing
Frameworks-as-a-service (Braintrust, Patronus, Langsmith Evals) are eating the runner / scoring layer. The dataset and rubric remain yours and your most defensible asset. Pick the framework that lets you keep your dataset portable.