Agent decision explainability: from black-box to citable reasoning

"The agent said no" is no longer an acceptable answer to a customer or a regulator. EU AI Act, GDPR Article 22, and basic UX all push toward explainable decisions. Here are the four primitives that move agents from black-box to citable reasoning.

What "explainable" actually requires

Three audiences want different things:

End user — "Why did the agent recommend this plan?" One paragraph in plain language.
Operator — "What did the agent see when it decided?" Full trace with context.
Regulator — "Demonstrate the agent did not discriminate against this user." Comparable runs across cohorts.

The four primitives below cover all three.

Primitive 1: chain-of-thought capture

Every agent turn captures the reasoning trace, not just the final output. Stored alongside the audit log, queryable by user request.

What to store: the model's intermediate reasoning, not raw token streams.
What to redact: any PII the reasoning happens to mention.
How to surface: a "show reasoning" UI affordance for the operator view.

For end-user explanations, summarise the chain into one paragraph. Do not show raw chain-of-thought to users — it is verbose and frequently embarrassing.

Primitive 2: tool-call attribution

For each claim in the final output, link it to the tool calls that produced the supporting evidence.

"Your premium increased because of two new claims filed this year."
  ← attributed to: get_claims(user_id) call at t+2.4s
  ← attributed to: pricing_lookup(state, claims) call at t+3.1s

Storage: a side table linking output spans to call ids. Surfacing: hover tooltip on the response, or expandable "what informed this" panel.

Primitive 3: citation graphs

Every retrieval-augmented response must cite. The citation graph captures which retrieved chunks supported which claims, with confidence scores.

{
  "claim": "Coverage starts on the policy date.",
  "citations": [
    { "chunk_id": "policy_doc:p3", "score": 0.94 },
    { "chunk_id": "faq:coverage_start", "score": 0.87 }
  ]
}

A post-pass verifier confirms the citations actually support the claim. See the hallucination detection guide.

Primitive 4: counterfactual probes

Run the same decision with one variable changed. Did changing the user's gender, age, or zip code change the answer? If yes, that is either a feature or a bug — but you can show the comparison.

Original: deny coverage (income $40k, zip 90210)
Counterfactual: approve coverage (income $40k, zip 10001)
Δ: zip code change flipped the decision

Run periodically against your eval set; surface anomalies. Indispensable for fairness audits.

What good user-facing explanation looks like

A working pattern:

We recommended Plan B because:
- You said you want a single user (you mentioned this earlier).
- Plan B has lower per-seat cost than Plan A in that range.
- Plan A's overage charges would apply at your usage pattern.

Show me the source data.   |   This is wrong.

Three sentences. Linked to source. With escape hatches.

Storage cost

Per agent turn, explainability storage adds:

Chain-of-thought: ~500 tokens of text.
Tool attribution: ~50 bytes per call.
Citation graph: ~200 bytes per claim.
Counterfactuals: optional, only on request or scheduled audit.

At scale, low-single-digit cents per active user per month. Cheap relative to the model bill.

Compliance hooks

Article 22 (GDPR) requires:

"Meaningful information about the logic involved" — covered by user-facing explanation.
"Significance and envisaged consequences" — covered by counterfactuals.
"Right to obtain human intervention" — UI affordance, not data.

EU AI Act for high-risk systems requires the operator-level trace. Both regulations are satisfied by the same primitives, surfaced differently.

Common mistakes

Showing raw chain-of-thought to users — frequently exposes prompt internals or PII.
Citation theatre — citations that do not actually support the claim. Verify.
No counterfactuals — until the regulator asks; build them in advance.
Explainability only for negative outcomes — users want it for "why this recommendation" too.

Where this is heading

Two trends to watch by 2027: native explainability primitives in the Claude Agent SDK (capture chain, attribute tools, render citations as one block), and standardised explainability schemas in industry compliance frameworks. Build the primitives now, swap implementations later.