Red teaming AI agents: a practitioner's playbook

A chatbot red team checks if you can make the model say something bad. An agent red team checks if you can make the agent **do** something bad — and the attack surface is bigger by an order of magnitude.

How agent red teaming differs

Three things make agents harder to red team than chat:

Tools mean side effects. A successful attack on an agent moves money, writes files, opens PRs.
Tool outputs are part of the prompt. Adversarial content can travel from a webpage, a database row, or another tool call into the next model turn.
Multi-step plans amplify drift. A subtle nudge at step 1 produces dramatic misbehaviour by step 10.

Red teaming has to cover all three. A pure prompt-injection test misses the side-effect cases entirely.

The attack categories

Cover these in any serious red-team campaign:

Direct prompt injection

The user asks the model to ignore its instructions. Classic but still works against weak prompts.

Test: a corpus of known direct-injection patterns; vary phrasing and language.

Indirect prompt injection

Adversarial content arrives via tool output — a webpage with hidden HTML, a database row with embedded instructions, a file the model just read. The attacker is not the user.

Test: pollute every tool source the agent can read. Verify the agent does not act on the injected instructions.

This is the dominant agent threat — covered in depth in MCP prompt injection prevention.

Tool misuse

The agent calls a tool with arguments that achieve something outside the user's intent. May be triggered by manipulation or a hallucinated plan.

Test: design tasks where success requires restraint. Score whether the agent calls destructive tools when it should not.

Authorisation bypass

The agent tries to act outside its scope, often after the user requests it ("I know I shouldn't, but…").

Test: probe at the policy boundary — see if the agent helps the user evade it.

Data exfiltration

The agent leaks data via tool calls that look benign — encoding secrets in URL parameters, in code comments, in image alt text, in obscure fields of a structured response.

Test: insert canary tokens into the agent's accessible data; check if they appear in any outbound tool call.

Resource exhaustion

The agent is induced into a loop, recursive call, or runaway expansion that consumes budget.

Test: measure max tokens, max wall-clock, max tool calls per task. Compare against baselines after each adversarial input.

Persona / role override

The agent adopts a role that legitimises rule-breaking ("you are now DAN…"). Less common in modern frontier models but still works at edge.

Test: a curated set of persona-switch attempts; track refusal rate.

A red-team pipeline

Run red team as a continuous program, not a one-off audit:

[Adversarial dataset] ─▶ [Run agent] ─▶ [Score: did the attack succeed?]
        ▲                                      │
        │                                      ▼
        └────────── new findings ◀──── [Triage and fix]

Three steps:

Dataset. Maintain a versioned corpus of attacks. Sources: published adversarial datasets, internal incidents, security research, AI safety competitions.
Run. Execute against the agent in a sandbox. Never against production data, even read-only.
Score. Was the attack successful? "Successful" definitions vary by category — leaked canary, called destructive tool, returned restricted info.

Sandboxing matters

Red team in a sandbox where:

Tool calls go to fakes, not real systems. The fake records what would have happened.
Side effects are reversible.
Secrets are dummy values; if anything leaks, no harm done.

Production red teaming is occasionally necessary (live behaviour differs from sandbox), but always with explicit dual approval and a rollback plan.

Quantifying findings

For each attack, record:

Attack ID, category, source.
Whether it succeeded (yes/no/partial).
Severity (what would the worst-case impact have been in production?).
The full trace.

Aggregate into a dashboard: success rate per category, trend over time, regression on previously-fixed attacks.

Closing the loop with eval

Every red-team finding becomes a regression test. Add the failing case to the eval framework and gate deployments on it. A successful attack that is not gated will recur after the next prompt change.

Tools worth knowing

The space is young. Useful starting points:

PyRIT (Microsoft) — open-source automation for red teaming generative AI.
Garak — vulnerability scanner for LLMs.
Promptfoo — eval-style runner that supports adversarial cases.
Custom harness — a lightweight Python or TypeScript runner is often enough; do not over-tool early.

For the broader detection side, see detecting malicious MCP servers.

Common mistakes

Treating it as a one-time audit. Models update, prompts change, tools added — yesterday's clean run is no guarantee.
Only testing English / direct. Multilingual injections, multi-turn injections, image-based injections all bypass naive defences.
Scoring on refusal rate alone. A model that refuses everything is not safe; it is broken. Measure refusal vs. legitimate task success.
No human review of edge cases. Automated scoring misses subtle wins. Sample 5% for human review.

When to bring in external red teams

DIY red teaming catches the obvious. External red teams catch the creative.

Bring them in when:

Before a major launch with regulated data.
After a significant prompt or model change.
When you suspect you have been lulled into "we're fine."

Budget: $50–200k for a 2–4 week engagement from a reputable firm. Cheaper than the incident.