Testing an agent against the real world is expensive, slow, and dangerous. Simulation environments — synthetic users, mock tools, controlled scenarios — let you stress an agent before it meets production. Here are the six platforms and the build patterns.
What a good simulator gives you
Five capabilities:
- Synthetic users with diverse personas and goals.
- Mock tools with controllable failure modes.
- Reproducibility — same scenario, repeatable runs.
- Speed — many runs per minute, not per hour.
- Adversarial mode — actively tries to break the agent.
A simulator without all five is a half-tool that misses critical regressions.
The six platforms
1. AgentBench-Pro
Open-source. Strong on multi-step task scenarios; weak on user diversity.
- Strengths: large public scenario library; reproducible.
- Weaknesses: synthetic users feel artificial.
- Pick when: core capability testing.
2. Hyperbrowser Simulator
Browser-based simulation; ideal for browser-using agents.
- Strengths: real browser DOM as the environment.
- Weaknesses: narrow modality.
- Pick when: testing browser automation agents.
3. Lume Test
Python framework. Lightweight; good for embedding in CI.
- Strengths: CI-friendly, fast.
- Weaknesses: limited adversarial scenarios.
- Pick when: continuous regression in CI.
4. Trillium Persona
Heavy persona library. Diverse synthetic users; multi-language.
- Strengths: persona realism.
- Weaknesses: SaaS only; cost at scale.
- Pick when: consumer-facing agents needing diversity.
5. SimAgent (Anthropic)
First-party simulator from Anthropic. Tight with Claude Agent SDK.
- Strengths: integration; vendor support.
- Weaknesses: Claude-only; less mature than competitors.
- Pick when: building on Anthropic stack.
6. DIY on the Claude Agent SDK
Custom simulator. Most flexible; most engineering investment.
- Strengths: matches your specific use cases.
- Weaknesses: maintenance burden.
- Pick when: unique workflows that off-the-shelf cannot capture.
Comparison
| Platform | Persona diversity | Mock tools | Adversarial | CI fit | Cost |
|---|---|---|---|---|---|
| AgentBench-Pro | Medium | Yes | Limited | Good | Free |
| Hyperbrowser | Low | DOM only | Limited | Good | Medium |
| Lume Test | Low | Yes | Limited | Best | Low |
| Trillium Persona | High | Yes | Yes | Good | High |
| SimAgent | Medium | Yes | Limited | Good | Bundled |
| DIY (SDK) | Custom | Custom | Custom | Custom | Engineering time |
Pick by phase
| Phase | Platform |
|---|---|
| First eval | Lume Test or AgentBench-Pro |
| Pre-launch | Trillium Persona or DIY |
| Adversarial / red team | DIY + Trillium |
| CI regression | Lume Test |
| Browser-heavy | Hyperbrowser |
| Vendor-aligned | SimAgent |
Designing in-house
If you build, three principles:
Realistic synthetic users
Generate via a sibling LLM. Each persona has goals, constraints, error patterns. Vary across cohorts.
Tool mocks with controllable failure
Each mock tool exposes the same interface as the real one but accepts a fault-injection parameter: latency, error_rate, partial_data.
Scenarios as data
Scenarios live in YAML / JSON, not code. Version controlled. Reviewable.
scenario: refund_request_difficult_user
persona: angry_repeat_customer
initial_state:
account_age_days: 1842
prior_refunds: 4
goals:
- get refund for order 8392
constraints:
- resists first denial
- escalates to threats by turn 5
expected_agent_behavior:
- de-escalate
- propose alternatives
- escalate to human if not resolved by turn 8
What simulators cannot replace
Three things:
- Real production traffic — actual user behaviour drifts.
- Production data scale — simulated data is small.
- Real adversaries — they invent attacks the simulator cannot.
Simulators are the first line; production monitoring is the last.
Workflow integration
A working pipeline:
- Pre-merge: small simulator run on every PR (~5 min).
- Pre-deploy: full scenario library (~30 min).
- Weekly: persona-diverse stress test.
- Pre-major-release: adversarial campaign (red team + simulator).
See QA pipeline for how this fits.
Cost reality
For a mid-sized agent project:
- Lume Test in CI: under $50 / month.
- Trillium Persona at production scale: $500–2000 / month.
- Custom simulator: 1–3 engineer-months upfront.
Most teams DIY the scenarios and use vendor platforms for personas.
Common mistakes
- Simulating only happy paths — misses real regressions.
- No reproducibility — same scenario produces different results; cannot diagnose.
- No adversarial mode — security regressions stay invisible.
- Skipping persona diversity — one persona means one perspective.
Where this is heading
Three trends by 2027: shared persona libraries across organisations, simulation-as-a-service products with industry-specific scenarios, and standardised reporting formats so simulator outputs become comparable. Build incrementally; the platform mature catches up to your scenarios.