QA for deterministic software has 30 years of practice behind it. QA for agents is two years old. The teams that built it well share patterns the field is still publishing. Here is the pipeline that earns "production-ready" for an agent.
What changed
Classical QA assumes:
- Output is the same for the same input.
- Bugs reproduce.
- Coverage is measurable in lines or branches.
None of these hold for agents. The new discipline replaces them.
The pipeline
Five stages, each running on every change:
Stage 1: pre-commit
Linting prompts (length, structure), tool-schema validation, dry-run against a tiny smoke set.
Stage 2: pull-request
Full eval against the benchmark suite. Comparison against the baseline. Block on regression.
Stage 3: pre-deploy
Shadow traffic against production for N days. Diff outputs; surface where the new version disagrees materially.
Stage 4: canary
1–10% of production traffic. Watch the error rate dashboards and key business metrics.
Stage 5: post-deploy
Continuous regression testing. Drift detection. Sample reviews by human raters.
Skip any stage and bugs reach users.
Roles in the new QA team
Three roles emerged in 2026:
Eval engineer
Owns the eval suite, scoring, and CI integration. Often a former data engineer.
Agent quality analyst
Reviews production samples, labels them, feeds the eval suite. Part QA, part product analyst.
Red teamer
Proactively breaks the agent. Adversarial prompts, edge cases, abuse patterns. See red teaming.
A team of 6 engineers ships an agent comfortably with 1 of each. Larger teams scale linearly.
The tools they use
Five categories:
- Eval framework — see evaluation framework.
- Trace store — see trace visualization tools.
- Replay — see replay debugging tools.
- Annotation interface — for human-in-the-loop labelling.
- Dashboard — see error rate dashboards.
Most teams DIY the annotation interface; everything else is increasingly off-the-shelf.
What good looks like
Five metrics for the QA team itself:
- Coverage — % of production task types represented in the eval suite.
- Detection latency — time from regression introduction to QA flagging it.
- False-positive rate of eval — how often the eval flags a regression that turns out not to be.
- Production incident rate — escapes that QA missed.
- Human label throughput — how many production samples reviewed per week.
Track quarterly. Below-target metrics drive the next quarter's investments.
The shadow-traffic pattern
Before any production deploy, run the new version against the same inputs as production:
real user request
↓
production agent (current version) → returned to user
↓
shadow agent (new version) → captured, not returned
↓
diff service: compare outputs
↓
flag material disagreements for review
A week of shadow catches more bugs than a month of pre-deploy testing.
What QA cannot catch
Three categories of bug that survive any pipeline:
- Long-tail rare inputs — the eval set never had them.
- Real-world drift — user behaviour shifts after launch.
- Cross-agent interactions — two agents that pass solo fail together.
These are caught by production monitoring, not pre-deploy QA. Both layers are needed.
Process specifics
Two non-negotiables:
- Every prompt change goes through QA. Even one-word changes.
- Every tool change goes through QA. A tool that changes signature can break agents that called it correctly before.
The discipline is: nothing reaches production without the pipeline.
Failure modes of the QA team itself
Three patterns that hollow out QA over time:
- QA bypassed for "urgent" changes — happens once, becomes routine.
- Eval suite goes stale — coverage drops as features grow.
- No connection to incident response — incidents do not feed back into the suite.
Quarterly retrospective on the QA process itself prevents drift.
Common mistakes
- One-eval-fits-all — different agents need different evals.
- Treating QA as a gate, not a partner — best results when QA pairs with engineering.
- No human in the loop — automated scores miss what users actually care about.
- Skipping shadow traffic — the highest-leverage stage gets dropped first under deadline pressure.
Where this is heading
Three trends by 2027: vendor-provided QA suites for major agent SDKs, professional certifications for agent QA engineers, and shared-corpus benchmarks across organisations in the same vertical.