Golden dataset for agents: building the canonical examples that anchor your evals

A benchmark suite is for catching regressions. A golden dataset is for defining what "right" means. Hand-curated, small, expensive — and the foundation every other eval rests on. Here is how to build one that lasts.

What makes a dataset "golden"

Four properties:

Hand-labelled by experts on the task.
Stable — the labels do not change with the wind.
Diverse — covers the breadth of your task.
Justified — every label has a reason recorded.

A dataset missing any of the four is a working dataset, not golden.

What golden datasets are for

Three uses:

Calibrating LLM judges — measure how often the judge agrees with the gold.
Truth source for benchmarks — when in doubt about a benchmark scoring decision, go to gold.
Training human raters — onboard new labellers against the gold.

If your eval system has no golden anchor, it drifts.

Sizing

Smaller than you might think. For a typical product agent:

100–300 examples for a small project.
500–1500 examples for a mature product.
3000+ only when the task domain is genuinely vast (e.g., open-domain QA).

The cost of a golden example is high (1–4 hours of expert time). 1000 examples is a quarter of an FTE.

Composition

A working composition:

Category	Share
Common-case happy path	50%
Common edge cases	25%
Rare-but-important edge cases	15%
Adversarial / tricky	10%

Skewing too far toward edge cases hides common-case regressions. Skewing too far toward happy path hides everything else.

The label structure

A useful label is more than a number:

example_id: g_001
input: "User asks for the refund policy on a non-refundable product"
expected_behavior: deflect to policy + offer credit
unacceptable_behaviors:
  - issue refund silently
  - claim policy that does not exist
  - escalate without trying
notes: "Tests deflection skill on a known-tense scenario."
labeled_by: alex.k
labeled_at: 2026-03-15
last_reviewed: 2026-04-15

Labels carry intent, not just verdict. Future engineers must understand why this is the gold.

Rotation discipline

Golden datasets go stale. Two patterns:

Quarterly review — every example reviewed; outdated ones replaced or annotated.
Append-mostly, prune-rarely — add new examples as the product evolves; prune only when an example is genuinely wrong.

A dataset that grows by 10% per quarter and has 20% reviewed per quarter stays alive.

Sourcing examples

Three sources, all in the mix:

Production sampling — anonymised real interactions.
Synthetic generation — LLM-generated, human-reviewed.
Adversarial creation — humans inventing tricky cases.

Most teams over-rely on production samples, missing rare-but-important cases. Force diversity at sourcing.

Labelling workflow

A working pipeline:

Sample candidates from the three sources above.
Initial labelling by primary labeller.
Independent review by second labeller.
Disagreement resolution — third labeller as tiebreaker; or flag for SME.
Acceptance — accepted labels join the gold.

Three reviewers minimum; below that, individual bias dominates.

Inter-rater agreement

Track how often labellers agree before resolution. Below 80% agreement is a sign that the task definition is unclear, not that the labellers are bad. Fix the definition.

Golden vs. benchmark

The distinction matters:

Golden is small, hand-curated, calibrates everything else.
Benchmark is large, captures distribution, catches regressions. See benchmark suite design.

Both are needed. Golden anchors; benchmark covers.

Common mistakes

Confusing benchmark and golden — they have different sizes and purposes.
Single-labeller bias — get a second pair of eyes minimum.
No rotation — datasets go stale within a year.
No labelling guidelines — labellers diverge; agreement drops.

What you ship

Three deliverables:

The dataset itself — version-controlled, immutable per version.
Labelling guidelines — the doc that keeps reviewers aligned.
Inter-rater agreement reports — quarterly snapshot of trust in the labels.

Auditors and incoming team members both need all three.

Where this is heading

Three trends by 2027: shared golden datasets per industry, golden-dataset-as-a-service products from eval vendors, and certification programmes for agent labellers similar to data-labelling certifications. Build the discipline; trade up to better datasets later.