A benchmark suite is for catching regressions. A golden dataset is for defining what "right" means. Hand-curated, small, expensive — and the foundation every other eval rests on. Here is how to build one that lasts.
What makes a dataset "golden"
Four properties:
- Hand-labelled by experts on the task.
- Stable — the labels do not change with the wind.
- Diverse — covers the breadth of your task.
- Justified — every label has a reason recorded.
A dataset missing any of the four is a working dataset, not golden.
What golden datasets are for
Three uses:
- Calibrating LLM judges — measure how often the judge agrees with the gold.
- Truth source for benchmarks — when in doubt about a benchmark scoring decision, go to gold.
- Training human raters — onboard new labellers against the gold.
If your eval system has no golden anchor, it drifts.
Sizing
Smaller than you might think. For a typical product agent:
- 100–300 examples for a small project.
- 500–1500 examples for a mature product.
- 3000+ only when the task domain is genuinely vast (e.g., open-domain QA).
The cost of a golden example is high (1–4 hours of expert time). 1000 examples is a quarter of an FTE.
Composition
A working composition:
| Category | Share |
|---|---|
| Common-case happy path | 50% |
| Common edge cases | 25% |
| Rare-but-important edge cases | 15% |
| Adversarial / tricky | 10% |
Skewing too far toward edge cases hides common-case regressions. Skewing too far toward happy path hides everything else.
The label structure
A useful label is more than a number:
example_id: g_001
input: "User asks for the refund policy on a non-refundable product"
expected_behavior: deflect to policy + offer credit
unacceptable_behaviors:
- issue refund silently
- claim policy that does not exist
- escalate without trying
notes: "Tests deflection skill on a known-tense scenario."
labeled_by: alex.k
labeled_at: 2026-03-15
last_reviewed: 2026-04-15
Labels carry intent, not just verdict. Future engineers must understand why this is the gold.
Rotation discipline
Golden datasets go stale. Two patterns:
- Quarterly review — every example reviewed; outdated ones replaced or annotated.
- Append-mostly, prune-rarely — add new examples as the product evolves; prune only when an example is genuinely wrong.
A dataset that grows by 10% per quarter and has 20% reviewed per quarter stays alive.
Sourcing examples
Three sources, all in the mix:
- Production sampling — anonymised real interactions.
- Synthetic generation — LLM-generated, human-reviewed.
- Adversarial creation — humans inventing tricky cases.
Most teams over-rely on production samples, missing rare-but-important cases. Force diversity at sourcing.
Labelling workflow
A working pipeline:
- Sample candidates from the three sources above.
- Initial labelling by primary labeller.
- Independent review by second labeller.
- Disagreement resolution — third labeller as tiebreaker; or flag for SME.
- Acceptance — accepted labels join the gold.
Three reviewers minimum; below that, individual bias dominates.
Inter-rater agreement
Track how often labellers agree before resolution. Below 80% agreement is a sign that the task definition is unclear, not that the labellers are bad. Fix the definition.
Golden vs. benchmark
The distinction matters:
- Golden is small, hand-curated, calibrates everything else.
- Benchmark is large, captures distribution, catches regressions. See benchmark suite design.
Both are needed. Golden anchors; benchmark covers.
Common mistakes
- Confusing benchmark and golden — they have different sizes and purposes.
- Single-labeller bias — get a second pair of eyes minimum.
- No rotation — datasets go stale within a year.
- No labelling guidelines — labellers diverge; agreement drops.
What you ship
Three deliverables:
- The dataset itself — version-controlled, immutable per version.
- Labelling guidelines — the doc that keeps reviewers aligned.
- Inter-rater agreement reports — quarterly snapshot of trust in the labels.
Auditors and incoming team members both need all three.
Where this is heading
Three trends by 2027: shared golden datasets per industry, golden-dataset-as-a-service products from eval vendors, and certification programmes for agent labellers similar to data-labelling certifications. Build the discipline; trade up to better datasets later.