An SRE agent that paged itself in during a recent outage and rolled back the wrong service is now famous. The lesson: incident response is the highest-stakes use case for autonomy, and the easiest place to ship a self-inflicted disaster. Here is the pattern that works.
Where the agent actually helps
Three roles where SRE agents earn their keep:
1. Triage
"Is this real?" The agent reads the alert, recent metrics, recent deploys, and similar past incidents. Returns a one-pager: what is happening, what changed, who to page.
Saves the on-call 5–15 minutes per page. Most incidents start with this question.
2. Information gathering
Once a human is paged, the agent fetches: relevant logs, runbook entries, traces from the affected service, and the diff from the last deploy. Tags them in the incident channel.
The on-call sees three minutes of evidence-gathering done, not three minutes of starting from scratch.
3. Runbook execution
The agent runs read-only steps from the runbook automatically: "fetch the queue depth", "list the recent failed jobs". Surfaces the results inline. Stops at any write step unless explicitly told to proceed.
Where it backfires
Three patterns to refuse:
- Auto-rollback — the agent decides to roll back. Sometimes right, often catastrophic.
- Cross-service action — the agent operates outside the affected service.
- Communication on behalf of the team — the agent posts to status page or customers without human review.
All three have produced production incidents in 2026. Recommendation: never enable them.
The architecture that worked
Three components:
alert fires
↓
triage agent (read-only)
↓
posts summary + evidence to incident channel
↓
human on-call decides
↓
human invokes "execute step N" via slash command
↓
agent executes (logged, reversible-only)
↓
loop until resolved
Read-only by default; write only on explicit invocation; never decides to write on its own.
The tools the agent needs
Minimum set:
- Metrics queries (Prometheus, Datadog, Honeycomb).
- Log search (ELK, Loki, your equivalent).
- Recent deploys (CD pipeline metadata).
- Runbook search (Notion, Confluence, your wiki).
- Trace fetch (trace viewers).
- Past-incident search (Jeli, FireHydrant, or a self-hosted Postgres).
- Pager state (PagerDuty, Opsgenie).
All read scopes. Write scopes only behind explicit per-step approval. See agent SSO patterns.
What good triage looks like
A triage post:
Alert: "checkout-service p95 > 2s"
Severity guess: SEV-2 (real, customer-impacting, not catastrophic)
Likely related:
• Deploy at 10:14 UTC by @alice (3 min before alert)
• Spike in DB connection wait; pool exhaustion suspected
Suggested first steps:
1. /sre run "check pool config diff alice@10:14"
2. /sre run "fetch checkout-service db connection metrics"
3. /sre run "show recent incidents tagged 'pool exhaustion'"
Page added: @alice (deploy author), @oncall-data (DB)
Three minutes saved per page. Compounding for major incidents.
Quality metrics
Track:
- Triage accuracy — was the SEV guess right?
- Time-to-first-evidence — how long from page to summary in channel?
- Agent-suggested-step utility — what % of suggested steps did the on-call run?
- False-page rate — agent's contribution to noise.
Surface in the error rate dashboards. Below 80% triage accuracy means rethink the agent.
Drill the agent in
Like any new SRE, drill the agent on game days:
- Replay last quarter's incidents; see what the agent would have done.
- Compare its recommendations to the post-incident review.
- Tune prompts based on the gaps.
Quarterly cadence. The agent gets better; the on-call learns to use it.
Compliance hooks
Two things to log for every agent action during an incident:
- Decision and reasoning — what the agent recommended, why.
- Approval or override — who approved, who overrode.
Feeds audit trails and the post-incident review.
Common mistakes
- Letting the agent decide write actions — recipe for self-inflicted incidents.
- No human on the page — agent runs alone; nothing checks it.
- Giving the agent broad write scopes — even for "easy" actions.
- Skipping post-incident reviews — the agent's mistakes get baked in.
Where this is heading
Three shifts to expect by 2027: native incident-response primitives in observability platforms, AI-aware runbook formats (steps the agent can execute safely), and shared "how the agent decided" sections in PIRs across the industry. Build read-only first, autonomy never.