For decisions an organisation cannot afford to get wrong, one LLM run is not enough. Multiple runs, reaching consensus, are the only credible answer. Five mechanisms exist; they have very different cost and quality profiles. Here is the comparison and the picking rule.
When consensus matters
Three classes of decision call for it:
- Irreversible — the action cannot be undone (publish, send, transfer).
- High-stakes — the cost of being wrong is large (medical, financial, legal).
- Adversarial — an attacker may probe a single model and exploit its idiosyncrasies.
For chat-style use cases, consensus is overkill. For the above three, single-shot is reckless.
The five mechanisms
1. Majority vote
N agents independently produce an answer; the majority wins. Simplest possible.
- Strengths: trivial, parallelisable.
- Weaknesses: ties; minority view may be correct.
- Pick when: discrete classification with clear options.
2. Weighted vote
Each agent's vote carries a weight (model size, historical accuracy, confidence score).
- Strengths: lets you trust Opus more than Haiku.
- Weaknesses: picking weights is half the problem.
- Pick when: mixed-model ensembles.
3. Debate-judge
Two or three agents argue; a separate judge decides who is right. See orchestration patterns.
- Strengths: surfaces edge cases vote misses.
- Weaknesses: expensive; judge becomes the new single point of failure.
- Pick when: open-ended judgement (ethical reviews, complex policy).
4. Confidence-thresholded
Each agent reports a confidence; consensus only declared if all agents agree above a threshold. Otherwise escalate to human.
- Strengths: fail-safe; explicit "I do not know".
- Weaknesses: requires reliable confidence scores (LLMs are notoriously badly calibrated).
- Pick when: human-in-the-loop is acceptable for ambiguous cases.
5. Probabilistic consensus
Treat each agent's output as a sample from a distribution. Compute posterior over possible answers; pick the maximum a posteriori (MAP) estimate.
- Strengths: principled treatment of uncertainty.
- Weaknesses: complex; rarely worth the engineering.
- Pick when: high-volume decisions where calibration matters.
Comparison
| Mechanism | Cost (vs single) | Latency | Best for |
|---|---|---|---|
| Majority vote | N x | Parallel; same | Discrete answers |
| Weighted vote | N x | Parallel; same | Mixed-model |
| Debate-judge | 4–6 x | Sequential; longer | Open-ended judgement |
| Confidence-thresholded | N x | Parallel; same | HITL fallback OK |
| Probabilistic | N x + analysis | Parallel; same | High-volume calibration |
Diversity is the multiplier
A consensus over N identical agents is just one agent paying N times. To get the benefit:
- Vary temperature across instances.
- Vary the prompt framing.
- Vary the model (Opus + Sonnet + Haiku).
- Vary the seed.
Without diversity, you get diversity-collapse: every instance lands on the same wrong answer.
Calibration: the hidden problem
Most consensus mechanisms assume the agents' confidence scores mean something. They do not by default. LLMs are systematically overconfident.
Three ways to calibrate:
- Temperature scaling on a held-out set — adjust raw probabilities.
- Self-consistency check — ask the agent N times, measure how often it agrees with itself.
- Calibrator model — train a small model that maps raw confidence to true probability.
Without calibration, "all agents agreed with confidence > 0.9" is much weaker evidence than it sounds.
Cost reality
Consensus is expensive. Typical mechanism costs:
| Mechanism | Cost vs single Opus |
|---|---|
| 3-vote | 3x |
| 5-vote | 5x |
| Debate (3 rounds + judge) | 6x |
| Confidence-thresholded | 3–5x + occasional human |
The savings come from avoiding the cost of a wrong decision. Worth it only when wrong decisions cost more than 10x the consensus overhead.
When NOT to use consensus
Three anti-patterns:
- Creative open-ended writing — averaging produces blandness.
- Real-time decisions — N x latency may not fit the budget.
- Tasks with no verifiable answer — you cannot measure consensus quality.
Where this is heading
Three trends by 2027: native consensus primitives in the Claude Agent SDK, calibration-as-a-service products that sit between the model and your application, and consensus-aware MCP gateways that route to the right ensemble. Build the basic vote pattern now, swap in better mechanisms as products mature.