Simulation environments for agents: 6 platforms for stress-testing without breaking prod

Testing an agent against the real world is expensive, slow, and dangerous. Simulation environments — synthetic users, mock tools, controlled scenarios — let you stress an agent before it meets production. Here are the six platforms and the build patterns.

What a good simulator gives you

Five capabilities:

Synthetic users with diverse personas and goals.
Mock tools with controllable failure modes.
Reproducibility — same scenario, repeatable runs.
Speed — many runs per minute, not per hour.
Adversarial mode — actively tries to break the agent.

A simulator without all five is a half-tool that misses critical regressions.

The six platforms

1. AgentBench-Pro

Open-source. Strong on multi-step task scenarios; weak on user diversity.

Strengths: large public scenario library; reproducible.
Weaknesses: synthetic users feel artificial.
Pick when: core capability testing.

2. Hyperbrowser Simulator

Browser-based simulation; ideal for browser-using agents.

Strengths: real browser DOM as the environment.
Weaknesses: narrow modality.
Pick when: testing browser automation agents.

3. Lume Test

Python framework. Lightweight; good for embedding in CI.

Strengths: CI-friendly, fast.
Weaknesses: limited adversarial scenarios.
Pick when: continuous regression in CI.

4. Trillium Persona

Heavy persona library. Diverse synthetic users; multi-language.

Strengths: persona realism.
Weaknesses: SaaS only; cost at scale.
Pick when: consumer-facing agents needing diversity.

5. SimAgent (Anthropic)

First-party simulator from Anthropic. Tight with Claude Agent SDK.

Strengths: integration; vendor support.
Weaknesses: Claude-only; less mature than competitors.
Pick when: building on Anthropic stack.

6. DIY on the Claude Agent SDK

Custom simulator. Most flexible; most engineering investment.

Strengths: matches your specific use cases.
Weaknesses: maintenance burden.
Pick when: unique workflows that off-the-shelf cannot capture.

Comparison

Platform	Persona diversity	Mock tools	Adversarial	CI fit	Cost
AgentBench-Pro	Medium	Yes	Limited	Good	Free
Hyperbrowser	Low	DOM only	Limited	Good	Medium
Lume Test	Low	Yes	Limited	Best	Low
Trillium Persona	High	Yes	Yes	Good	High
SimAgent	Medium	Yes	Limited	Good	Bundled
DIY (SDK)	Custom	Custom	Custom	Custom	Engineering time

Pick by phase

Phase	Platform
First eval	Lume Test or AgentBench-Pro
Pre-launch	Trillium Persona or DIY
Adversarial / red team	DIY + Trillium
CI regression	Lume Test
Browser-heavy	Hyperbrowser
Vendor-aligned	SimAgent

Designing in-house

If you build, three principles:

Realistic synthetic users

Generate via a sibling LLM. Each persona has goals, constraints, error patterns. Vary across cohorts.

Tool mocks with controllable failure

Each mock tool exposes the same interface as the real one but accepts a fault-injection parameter: latency, error_rate, partial_data.

Scenarios as data

Scenarios live in YAML / JSON, not code. Version controlled. Reviewable.

scenario: refund_request_difficult_user
persona: angry_repeat_customer
initial_state:
  account_age_days: 1842
  prior_refunds: 4
goals:
  - get refund for order 8392
constraints:
  - resists first denial
  - escalates to threats by turn 5
expected_agent_behavior:
  - de-escalate
  - propose alternatives
  - escalate to human if not resolved by turn 8

What simulators cannot replace

Three things:

Real production traffic — actual user behaviour drifts.
Production data scale — simulated data is small.
Real adversaries — they invent attacks the simulator cannot.

Simulators are the first line; production monitoring is the last.

Workflow integration

A working pipeline:

Pre-merge: small simulator run on every PR (~5 min).
Pre-deploy: full scenario library (~30 min).
Weekly: persona-diverse stress test.
Pre-major-release: adversarial campaign (red team + simulator).

See QA pipeline for how this fits.

Cost reality

For a mid-sized agent project:

Lume Test in CI: under $50 / month.
Trillium Persona at production scale: $500–2000 / month.
Custom simulator: 1–3 engineer-months upfront.

Most teams DIY the scenarios and use vendor platforms for personas.

Common mistakes

Simulating only happy paths — misses real regressions.
No reproducibility — same scenario produces different results; cannot diagnose.
No adversarial mode — security regressions stay invisible.
Skipping persona diversity — one persona means one perspective.

Where this is heading

Three trends by 2027: shared persona libraries across organisations, simulation-as-a-service products with industry-specific scenarios, and standardised reporting formats so simulator outputs become comparable. Build incrementally; the platform mature catches up to your scenarios.