Agent quality assurance pipeline: the QA discipline that did not exist three years ago

QA for deterministic software has 30 years of practice behind it. QA for agents is two years old. The teams that built it well share patterns the field is still publishing. Here is the pipeline that earns "production-ready" for an agent.

What changed

Classical QA assumes:

Output is the same for the same input.
Bugs reproduce.
Coverage is measurable in lines or branches.

None of these hold for agents. The new discipline replaces them.

The pipeline

Five stages, each running on every change:

Stage 1: pre-commit

Linting prompts (length, structure), tool-schema validation, dry-run against a tiny smoke set.

Stage 2: pull-request

Full eval against the benchmark suite. Comparison against the baseline. Block on regression.

Stage 3: pre-deploy

Shadow traffic against production for N days. Diff outputs; surface where the new version disagrees materially.

Stage 4: canary

1–10% of production traffic. Watch the error rate dashboards and key business metrics.

Stage 5: post-deploy

Continuous regression testing. Drift detection. Sample reviews by human raters.

Skip any stage and bugs reach users.

Roles in the new QA team

Three roles emerged in 2026:

Eval engineer

Owns the eval suite, scoring, and CI integration. Often a former data engineer.

Agent quality analyst

Reviews production samples, labels them, feeds the eval suite. Part QA, part product analyst.

Red teamer

Proactively breaks the agent. Adversarial prompts, edge cases, abuse patterns. See red teaming.

A team of 6 engineers ships an agent comfortably with 1 of each. Larger teams scale linearly.

The tools they use

Five categories:

Eval framework — see evaluation framework.
Trace store — see trace visualization tools.
Replay — see replay debugging tools.
Annotation interface — for human-in-the-loop labelling.
Dashboard — see error rate dashboards.

Most teams DIY the annotation interface; everything else is increasingly off-the-shelf.

What good looks like

Five metrics for the QA team itself:

Coverage — % of production task types represented in the eval suite.
Detection latency — time from regression introduction to QA flagging it.
False-positive rate of eval — how often the eval flags a regression that turns out not to be.
Production incident rate — escapes that QA missed.
Human label throughput — how many production samples reviewed per week.

Track quarterly. Below-target metrics drive the next quarter's investments.

The shadow-traffic pattern

Before any production deploy, run the new version against the same inputs as production:

real user request
   ↓
production agent (current version)  → returned to user
   ↓
shadow agent (new version)          → captured, not returned
   ↓
diff service: compare outputs
   ↓
flag material disagreements for review

A week of shadow catches more bugs than a month of pre-deploy testing.

What QA cannot catch

Three categories of bug that survive any pipeline:

Long-tail rare inputs — the eval set never had them.
Real-world drift — user behaviour shifts after launch.
Cross-agent interactions — two agents that pass solo fail together.

These are caught by production monitoring, not pre-deploy QA. Both layers are needed.

Process specifics

Two non-negotiables:

Every prompt change goes through QA. Even one-word changes.
Every tool change goes through QA. A tool that changes signature can break agents that called it correctly before.

The discipline is: nothing reaches production without the pipeline.

Failure modes of the QA team itself

Three patterns that hollow out QA over time:

QA bypassed for "urgent" changes — happens once, becomes routine.
Eval suite goes stale — coverage drops as features grow.
No connection to incident response — incidents do not feed back into the suite.

Quarterly retrospective on the QA process itself prevents drift.

Common mistakes

One-eval-fits-all — different agents need different evals.
Treating QA as a gate, not a partner — best results when QA pairs with engineering.
No human in the loop — automated scores miss what users actually care about.
Skipping shadow traffic — the highest-leverage stage gets dropped first under deadline pressure.

Where this is heading

Three trends by 2027: vendor-provided QA suites for major agent SDKs, professional certifications for agent QA engineers, and shared-corpus benchmarks across organisations in the same vertical.