Agent consensus mechanisms: how multiple LLM agents agree on a critical decision

For decisions an organisation cannot afford to get wrong, one LLM run is not enough. Multiple runs, reaching consensus, are the only credible answer. Five mechanisms exist; they have very different cost and quality profiles. Here is the comparison and the picking rule.

When consensus matters

Three classes of decision call for it:

Irreversible — the action cannot be undone (publish, send, transfer).
High-stakes — the cost of being wrong is large (medical, financial, legal).
Adversarial — an attacker may probe a single model and exploit its idiosyncrasies.

For chat-style use cases, consensus is overkill. For the above three, single-shot is reckless.

The five mechanisms

1. Majority vote

N agents independently produce an answer; the majority wins. Simplest possible.

Strengths: trivial, parallelisable.
Weaknesses: ties; minority view may be correct.
Pick when: discrete classification with clear options.

2. Weighted vote

Each agent's vote carries a weight (model size, historical accuracy, confidence score).

Strengths: lets you trust Opus more than Haiku.
Weaknesses: picking weights is half the problem.
Pick when: mixed-model ensembles.

3. Debate-judge

Two or three agents argue; a separate judge decides who is right. See orchestration patterns.

Strengths: surfaces edge cases vote misses.
Weaknesses: expensive; judge becomes the new single point of failure.
Pick when: open-ended judgement (ethical reviews, complex policy).

4. Confidence-thresholded

Each agent reports a confidence; consensus only declared if all agents agree above a threshold. Otherwise escalate to human.

Strengths: fail-safe; explicit "I do not know".
Weaknesses: requires reliable confidence scores (LLMs are notoriously badly calibrated).
Pick when: human-in-the-loop is acceptable for ambiguous cases.

5. Probabilistic consensus

Treat each agent's output as a sample from a distribution. Compute posterior over possible answers; pick the maximum a posteriori (MAP) estimate.

Strengths: principled treatment of uncertainty.
Weaknesses: complex; rarely worth the engineering.
Pick when: high-volume decisions where calibration matters.

Comparison

Mechanism	Cost (vs single)	Latency	Best for
Majority vote	N x	Parallel; same	Discrete answers
Weighted vote	N x	Parallel; same	Mixed-model
Debate-judge	4–6 x	Sequential; longer	Open-ended judgement
Confidence-thresholded	N x	Parallel; same	HITL fallback OK
Probabilistic	N x + analysis	Parallel; same	High-volume calibration

Diversity is the multiplier

A consensus over N identical agents is just one agent paying N times. To get the benefit:

Vary temperature across instances.
Vary the prompt framing.
Vary the model (Opus + Sonnet + Haiku).
Vary the seed.

Without diversity, you get diversity-collapse: every instance lands on the same wrong answer.

Calibration: the hidden problem

Most consensus mechanisms assume the agents' confidence scores mean something. They do not by default. LLMs are systematically overconfident.

Three ways to calibrate:

Temperature scaling on a held-out set — adjust raw probabilities.
Self-consistency check — ask the agent N times, measure how often it agrees with itself.
Calibrator model — train a small model that maps raw confidence to true probability.

Without calibration, "all agents agreed with confidence > 0.9" is much weaker evidence than it sounds.

Cost reality

Consensus is expensive. Typical mechanism costs:

Mechanism	Cost vs single Opus
3-vote	3x
5-vote	5x
Debate (3 rounds + judge)	6x
Confidence-thresholded	3–5x + occasional human

The savings come from avoiding the cost of a wrong decision. Worth it only when wrong decisions cost more than 10x the consensus overhead.

When NOT to use consensus

Three anti-patterns:

Creative open-ended writing — averaging produces blandness.
Real-time decisions — N x latency may not fit the budget.
Tasks with no verifiable answer — you cannot measure consensus quality.

Where this is heading

Three trends by 2027: native consensus primitives in the Claude Agent SDK, calibration-as-a-service products that sit between the model and your application, and consensus-aware MCP gateways that route to the right ensemble. Build the basic vote pattern now, swap in better mechanisms as products mature.