Skip to main content
Guide4 min read

Golden dataset for agents: building the canonical examples that anchor your evals

A golden dataset is the small, hand-curated set of examples that anchors every eval you run. Here is how to build one that survives a year, the size you actually need, and the rotation discipline that keeps it fresh.

A benchmark suite is for catching regressions. A golden dataset is for defining what "right" means. Hand-curated, small, expensive — and the foundation every other eval rests on. Here is how to build one that lasts.

What makes a dataset "golden"

Four properties:

  • Hand-labelled by experts on the task.
  • Stable — the labels do not change with the wind.
  • Diverse — covers the breadth of your task.
  • Justified — every label has a reason recorded.

A dataset missing any of the four is a working dataset, not golden.

What golden datasets are for

Three uses:

  • Calibrating LLM judges — measure how often the judge agrees with the gold.
  • Truth source for benchmarks — when in doubt about a benchmark scoring decision, go to gold.
  • Training human raters — onboard new labellers against the gold.

If your eval system has no golden anchor, it drifts.

Sizing

Smaller than you might think. For a typical product agent:

  • 100–300 examples for a small project.
  • 500–1500 examples for a mature product.
  • 3000+ only when the task domain is genuinely vast (e.g., open-domain QA).

The cost of a golden example is high (1–4 hours of expert time). 1000 examples is a quarter of an FTE.

Composition

A working composition:

Category Share
Common-case happy path 50%
Common edge cases 25%
Rare-but-important edge cases 15%
Adversarial / tricky 10%

Skewing too far toward edge cases hides common-case regressions. Skewing too far toward happy path hides everything else.

The label structure

A useful label is more than a number:

example_id: g_001
input: "User asks for the refund policy on a non-refundable product"
expected_behavior: deflect to policy + offer credit
unacceptable_behaviors:
  - issue refund silently
  - claim policy that does not exist
  - escalate without trying
notes: "Tests deflection skill on a known-tense scenario."
labeled_by: alex.k
labeled_at: 2026-03-15
last_reviewed: 2026-04-15

Labels carry intent, not just verdict. Future engineers must understand why this is the gold.

Rotation discipline

Golden datasets go stale. Two patterns:

  • Quarterly review — every example reviewed; outdated ones replaced or annotated.
  • Append-mostly, prune-rarely — add new examples as the product evolves; prune only when an example is genuinely wrong.

A dataset that grows by 10% per quarter and has 20% reviewed per quarter stays alive.

Sourcing examples

Three sources, all in the mix:

  • Production sampling — anonymised real interactions.
  • Synthetic generation — LLM-generated, human-reviewed.
  • Adversarial creation — humans inventing tricky cases.

Most teams over-rely on production samples, missing rare-but-important cases. Force diversity at sourcing.

Labelling workflow

A working pipeline:

  1. Sample candidates from the three sources above.
  2. Initial labelling by primary labeller.
  3. Independent review by second labeller.
  4. Disagreement resolution — third labeller as tiebreaker; or flag for SME.
  5. Acceptance — accepted labels join the gold.

Three reviewers minimum; below that, individual bias dominates.

Inter-rater agreement

Track how often labellers agree before resolution. Below 80% agreement is a sign that the task definition is unclear, not that the labellers are bad. Fix the definition.

Golden vs. benchmark

The distinction matters:

  • Golden is small, hand-curated, calibrates everything else.
  • Benchmark is large, captures distribution, catches regressions. See benchmark suite design.

Both are needed. Golden anchors; benchmark covers.

Common mistakes

  • Confusing benchmark and golden — they have different sizes and purposes.
  • Single-labeller bias — get a second pair of eyes minimum.
  • No rotation — datasets go stale within a year.
  • No labelling guidelines — labellers diverge; agreement drops.

What you ship

Three deliverables:

  • The dataset itself — version-controlled, immutable per version.
  • Labelling guidelines — the doc that keeps reviewers aligned.
  • Inter-rater agreement reports — quarterly snapshot of trust in the labels.

Auditors and incoming team members both need all three.

Where this is heading

Three trends by 2027: shared golden datasets per industry, golden-dataset-as-a-service products from eval vendors, and certification programmes for agent labellers similar to data-labelling certifications. Build the discipline; trade up to better datasets later.

Loadout

Build your AI agent loadout

The directory of MCP servers and AI agents that actually work. Pick the right loadout for Slack, Postgres, GitHub, Figma and 20+ integrations — with install commands ready to paste into Claude Desktop, Cursor or your own stack.

© 2026 Loadout. Built on Angular 21 SSR.