Know that feeling when a "low priority" ticket turns into a production fire? Or when your on-call rotation starts showing signs of serious burnout from alert overload?
This workflow handles that problem. Two AI agents do the triage work—checking severity, validating against runbooks, triggering the right response.
Incident comes in through webhook → two-agent analysis kicks off:
Agent 1 (Incident Analyzer) checks the report against your Google Sheets runbook database. Looks for matching known issues, evaluates risk signals, assigns a confidence-scored severity (P1/P2/P3). Finally stops you from trusting "CRITICAL URGENT!!!" subject lines.
Agent 2 (Response Planner) builds the action plan: what to do first, who needs to know, investigation steps, post-incident tasks. Like having your most experienced engineer review every single ticket.
Then routing happens:
Nobody responds in time? Auto-escalates to management. Everything logs to Google Sheets for the inevitable post-mortem.
| Feature | This Workflow | Typical AI Triage |
|---|---|---|
| Architecture | Two specialized agents (analyze + coordinate) | Single generic prompt |
| Reliability | Multi-LLM fallback (Gemini → Groq) | Single model, fails if down |
| SLA Enforcement | Auto-waits, checks, escalates autonomously | Sends alert, then done |
| Learning | Feedback webhook improves accuracy over time | Static prompts forever |
| Knowledge Source | Your runbooks (Google Sheets) | Generic templates |
| War Room Creation | Automatic for P1 incidents | Manual |
| Audit Trail | Every decision logged to Sheets | Often missing |
Scenario: Your monitoring system detects database errors.
Webhook receives this messy alert:
{
"title": "DB Connection Pool Exhausted",
"description": "user-service reporting 503 errors",
"severity": "P3",
"service": "user-service"
}
Agent 1 (Incident Analyzer) reasoning:
Agent 2 (Response Coordinator) builds the plan:
What happens next (autonomously):
Human feedback loop (optional but powerful):
On-call engineer reviews the decision and submits:
POST /incident-feedback
{
"incidentId": "INC-20260324-143022-a7f3",
"feedback": "Correct severity upgrade - good catch",
"correctSeverity": "P2"
}
→ This correction gets logged to AI_Audit_Log. Over time, Agent 1 learns which patterns justify severity overrides.
Stop manual triage: What took your on-call engineer 5-10 minutes now takes 3 seconds. Agent 1 checks the runbook, Agent 2 builds the response plan.
Severity validation = fewer false alarms: The workflow cross-checks reported severity against runbook patterns and risk signals. That "P1 URGENT" email from marketing? Gets downgraded to P3 automatically.
SLAs enforce themselves: P1 gets 15 minutes. P2 gets 60. Timers run autonomously. If nobody acknowledges, management gets paged. No more "I forgot to check Slack."
Uses YOUR runbooks, not generic templates: Agent 1 pulls context from your Google Sheets runbook database — known issues, escalation contacts, SLA targets. It knows your systems.
Multi-LLM fallback = 99.9% uptime: Primary: Gemini 2.0. Fallback: Groq. Each agent retries 3x with 5-sec intervals. Basically always works.
Self-improving feedback loop: Engineers can submit corrections via /incident-feedback webhook. The workflow logs every decision + human feedback to AI_Audit_Log. Track accuracy over time, identify patterns where AI needs tuning.
Complete audit trail: Every incident, every AI decision, every escalation — all in Google Sheets. Perfect for post-mortems and compliance.
This is not a 5-minute setup. You'll need:
Google Sheets structure:
Runbooks, Incidents, AI_Audit_LogSlack configuration:
#incidents-critical, #incidents, #management-escalation, #engineering-leadsEstimated setup time: 30-45 minutes
Quick start option: Begin with just Slack + Google Sheets. Add PagerDuty later.