Incident Command Center

Incident Command Center

Centralized war room view of all active incidents, response timelines, and on-call coverage.

Major Operations Active

3 active incidents ยท 5 paging alerts ยท 3 firing

3
Active
1
High+
9
Healthy Svcs

Active Incidents

3

0 critical

Ack SLA (24h)

92%
-3%vs target 95%

Response SLA (24h)

87%
+2%vs target 90%

MTTR (30d)

18.4min
-8%improving

Active War Rooms

Live incident response rooms with commander and responder details

INC-2841HighInvestigating1d 1h 11m

Checkout API elevated 5xx rate in us-east-1

Approximately 3% of checkout attempts in US-East are failing with a 500 error. Retry usually succeeds on second attempt.

Enter
Commander
Marcus Anderson
Impacted Users
4,200
Service
checkout-api
Marcus AndersonSofia BianchiPriya RamanDaniel Vargas
4 responders
4/9

Marcus Anderson: Posted public update: 'We are investigating elevated checkout failures in US-East. Some payment attempts may fail. Please retry.'

05:48 AM
INC-2840MediumMonitoring1d 7h 53m

Billing invoice generation backlog

Invoice emails delayed by up to 6 hours. No financial impact. Payments continue to process normally.

Enter
Commander
Caleb Foster
Impacted Users
14,200
Service
billing-service
Caleb FosterYuki TanakaPriya Raman
3 responders
4/6

Caleb Foster: Status changed to monitoring. Queue depth back under 1k.

04:12 AM
INC-2838MediumIdentified1d 21h 23m

Web app slow page loads (LCP regression)

No functional impact, but slower page loads affect SEO ranking and user experience.

Enter
Commander
Mei Lin
Impacted Users
0
Service
web-app
Mei LinPriya Raman
2 responders
2/5

Mei Lin: Identified: new hero image is 4.2MB. Need to add proper sizing + lazy loading.

06:20 PM

Response Timeline

Latest activity across all incidents

Marcus AndersonINC-2841

Posted public update: 'We are investigating elevated checkout failures in US-East. Some payment attempts may fail. Please retry.'

7/2/2026, 5:48:00 AM
Marcus AndersonINC-2841

Linked runbook: Checkout 3DS Timeout

7/2/2026, 5:38:00 AM
Priya RamanINC-2841

Internal note: error rate started climbing 4 minutes after canary rollout began. Pattern matches a known issue with the new 3DS retry logic.

7/2/2026, 5:31:00 AM
Sofia BianchiINC-2841

Linked deployment DEP-2024-006 (canary v2.4.1-rc1)

7/2/2026, 5:24:00 AM
Marcus AndersonINC-2841

Added service: checkout-api, billing-service

7/2/2026, 5:22:00 AM
Marcus AndersonINC-2841

Added Sofia Bianchi (payments on-call) and Priya Raman (reliability on-call) as responders

7/2/2026, 5:18:00 AM
Marcus AndersonINC-2841

Marcus Anderson acknowledged and took commander role

7/2/2026, 5:16:00 AM
Naomi ChenINC-2841

Alert ALR-1248 linked to incident

7/2/2026, 5:14:00 AM
Naomi ChenINC-2841

Incident auto-created from alert ALR-1248 (checkout 5xx > 1%)

7/2/2026, 5:12:00 AM
Caleb FosterINC-2840

Status changed to monitoring. Queue depth back under 1k.

7/2/2026, 4:12:00 AM
Caleb FosterINC-2840

Scaled workers from 4 to 8. Queue depth dropping ~500/min.

7/1/2026, 11:02:00 PM
Caleb FosterINC-2840

Caleb Foster acknowledged as commander

7/1/2026, 10:45:00 PM
Naomi ChenINC-2840

Incident auto-created from alert ALR-1245 (invoice queue depth > 10k)

7/1/2026, 10:30:00 PM
Mei LinINC-2838

Identified: new hero image is 4.2MB. Need to add proper sizing + lazy loading.

7/1/2026, 6:20:00 PM
Theo LambertINC-2839

401 rate back to baseline. Resolving.

7/1/2026, 2:18:00 PM
Arjun MehtaINC-2839

Purging JWKS cache across all authz-engine instances

7/1/2026, 2:12:00 PM
Theo LambertINC-2839

Identified stale JWKS cache as cause

7/1/2026, 2:08:00 PM
Theo LambertINC-2839

Theo Lambert acknowledged

7/1/2026, 2:03:00 PM

On-Call Now

Currently paged engineer

Caleb Foster
Caleb Foster
primary on-call
Platform ยท Senior Backend Engineer

Escalation Ladder

Platform - Standard policy

1
Daniel Vargas
Staff Site Reliability Engineer
Immediate
2
Liam Walker
Senior Site Reliability Engineer
+15m
3
Maya Okonkwo
Director, Platform Engineering
+30m

Command Palette

Search for a command to run...