Skip to main content
Explainer4 min read

Agent consensus mechanisms: how multiple LLM agents agree on a critical decision

When one model is not trusted enough, run several and reach consensus. Five consensus mechanisms compared — majority vote, weighted vote, debate-judge, confidence-thresholded, and probabilistic — with the production trade-offs.

For decisions an organisation cannot afford to get wrong, one LLM run is not enough. Multiple runs, reaching consensus, are the only credible answer. Five mechanisms exist; they have very different cost and quality profiles. Here is the comparison and the picking rule.

When consensus matters

Three classes of decision call for it:

  • Irreversible — the action cannot be undone (publish, send, transfer).
  • High-stakes — the cost of being wrong is large (medical, financial, legal).
  • Adversarial — an attacker may probe a single model and exploit its idiosyncrasies.

For chat-style use cases, consensus is overkill. For the above three, single-shot is reckless.

The five mechanisms

1. Majority vote

N agents independently produce an answer; the majority wins. Simplest possible.

  • Strengths: trivial, parallelisable.
  • Weaknesses: ties; minority view may be correct.
  • Pick when: discrete classification with clear options.

2. Weighted vote

Each agent's vote carries a weight (model size, historical accuracy, confidence score).

  • Strengths: lets you trust Opus more than Haiku.
  • Weaknesses: picking weights is half the problem.
  • Pick when: mixed-model ensembles.

3. Debate-judge

Two or three agents argue; a separate judge decides who is right. See orchestration patterns.

  • Strengths: surfaces edge cases vote misses.
  • Weaknesses: expensive; judge becomes the new single point of failure.
  • Pick when: open-ended judgement (ethical reviews, complex policy).

4. Confidence-thresholded

Each agent reports a confidence; consensus only declared if all agents agree above a threshold. Otherwise escalate to human.

  • Strengths: fail-safe; explicit "I do not know".
  • Weaknesses: requires reliable confidence scores (LLMs are notoriously badly calibrated).
  • Pick when: human-in-the-loop is acceptable for ambiguous cases.

5. Probabilistic consensus

Treat each agent's output as a sample from a distribution. Compute posterior over possible answers; pick the maximum a posteriori (MAP) estimate.

  • Strengths: principled treatment of uncertainty.
  • Weaknesses: complex; rarely worth the engineering.
  • Pick when: high-volume decisions where calibration matters.

Comparison

Mechanism Cost (vs single) Latency Best for
Majority vote N x Parallel; same Discrete answers
Weighted vote N x Parallel; same Mixed-model
Debate-judge 4–6 x Sequential; longer Open-ended judgement
Confidence-thresholded N x Parallel; same HITL fallback OK
Probabilistic N x + analysis Parallel; same High-volume calibration

Diversity is the multiplier

A consensus over N identical agents is just one agent paying N times. To get the benefit:

  • Vary temperature across instances.
  • Vary the prompt framing.
  • Vary the model (Opus + Sonnet + Haiku).
  • Vary the seed.

Without diversity, you get diversity-collapse: every instance lands on the same wrong answer.

Calibration: the hidden problem

Most consensus mechanisms assume the agents' confidence scores mean something. They do not by default. LLMs are systematically overconfident.

Three ways to calibrate:

  • Temperature scaling on a held-out set — adjust raw probabilities.
  • Self-consistency check — ask the agent N times, measure how often it agrees with itself.
  • Calibrator model — train a small model that maps raw confidence to true probability.

Without calibration, "all agents agreed with confidence > 0.9" is much weaker evidence than it sounds.

Cost reality

Consensus is expensive. Typical mechanism costs:

Mechanism Cost vs single Opus
3-vote 3x
5-vote 5x
Debate (3 rounds + judge) 6x
Confidence-thresholded 3–5x + occasional human

The savings come from avoiding the cost of a wrong decision. Worth it only when wrong decisions cost more than 10x the consensus overhead.

When NOT to use consensus

Three anti-patterns:

  • Creative open-ended writing — averaging produces blandness.
  • Real-time decisions — N x latency may not fit the budget.
  • Tasks with no verifiable answer — you cannot measure consensus quality.

Where this is heading

Three trends by 2027: native consensus primitives in the Claude Agent SDK, calibration-as-a-service products that sit between the model and your application, and consensus-aware MCP gateways that route to the right ensemble. Build the basic vote pattern now, swap in better mechanisms as products mature.

Loadout

Build your AI agent loadout

The directory of MCP servers and AI agents that actually work. Pick the right loadout for Slack, Postgres, GitHub, Figma and 20+ integrations — with install commands ready to paste into Claude Desktop, Cursor or your own stack.

© 2026 Loadout. Built on Angular 21 SSR.