Procurement teams cannot ship agents the way they ship credit lines: with a single number that summarises trust. The "capability score" — analogous to a credit score, with components — is forming in 2026. Here is what it measures, how it is built, and why your procurement deals will run on it by 2027.
Why a single number
Three audiences need it:
- Procurement — comparing vendors at a glance.
- Risk and compliance — gating deployment by score floor.
- End users — surfaced in marketplaces, like app store ratings.
A single number is reductive. It is also useful when the alternative is reading 50-page reports per agent.
The components
A composite score combines five sub-scores:
1. Functional capability (0–100)
Does it do what it claims? Measured by benchmark suite pass rate. See benchmark suite design.
2. Reliability (0–100)
How often does it work? Measured by SLO adherence over the last 90 days.
3. Safety (0–100)
Does it refuse correctly? Measured by adversarial test pass rate. See red teaming.
4. Compliance (0–100)
Does it satisfy claimed regulations? Audit evidence based.
5. Operational maturity (0–100)
Does it have observability, audit, governance? See governance framework.
The composite is a weighted average; weights vary by use case.
Sample capability card
Agent: support-bot-pro
Vendor: Acme AI
Composite Score: 84 / 100
Functional 88/100 Excellent on standard support tasks
Reliability 82/100 99.2% uptime over 90 days
Safety 91/100 Strong refusal on adversarial inputs
Compliance 78/100 SOC2 Type II; GDPR controls audited
Op. Maturity 81/100 Full audit log; mature observability
Last assessed: 2026-04-15
Next review: 2026-07-15
Methodology: AgentScore v2.1
Like a credit report, the composite is the headline; the breakdown is the substance.
Who computes the score
Three patterns emerging in 2026:
Self-assessed
The vendor publishes their own score against a published methodology. Cheapest, lowest trust.
Independent assessor
A third party assesses against a methodology. Trust depends on assessor reputation.
Standards body
A formal body (industry consortium, regulator-blessed) issues scores. Highest trust, slowest cadence.
Standards-body scoring will dominate by 2028. Independent assessment is the bridge.
How procurement will use it
Three patterns:
- Score floor in RFPs — minimum composite required to bid.
- Component thresholds — minimum on safety + compliance, regardless of composite.
- Tier-based procurement — different score floors for different use-case tiers.
Vendors below the floor are filtered out at intake. The composite becomes a moat.
Limitations of single-number scores
Three failure modes:
Gaming
Composites with public methodology get gamed. The methodology must evolve faster than gaming techniques.
False precision
"84/100" sounds objective; it is not. Always disclose methodology and confidence intervals.
Use-case mismatch
A high score on a support-bot benchmark does not mean the agent is good for medical triage. Match scores to use cases.
Mature scoring schemes acknowledge these limitations explicitly.
What buyers should ask
Beyond the score:
- Which methodology version?
- When was it last assessed?
- What was the test set composition?
- Are sub-scores available for inspection?
- Is there an appeals process?
Vendors that resist these questions are publishing self-assessed scores under a marketing label.
What sellers should publish
To stay competitive:
- Score by methodology version — show progression over time.
- Sub-score breakdowns — buyers will look.
- Methodology compliance evidence — links to test sets, audits, attestations.
- Update cadence — quarterly minimum.
This is your capability marketing in 2027.
Score evolution
Scores must update:
- On version change (new model, new prompt, new tools).
- On regulation change (compliance sub-score moves).
- On periodic re-assessment.
A score frozen for 12 months is not a current claim.
Common mistakes
- Self-published without methodology — buyers ignore.
- Composite without sub-scores — opaque; useless for risk decisions.
- Static scores — go stale fast.
- Ignoring the score in your own product team — your agent has one; you should know it.
Where this is heading
Three trends by 2027: industry-specific scoring methodologies (FinScore for fintech agents, ClinScore for clinical), regulatory recognition of certified scores, and procurement platforms that auto-filter on scores. Build operations that score well; the scoring infrastructure is coming.