Executive Reliability

Executive Reliability Dashboard

Business-impact view of platform reliability, SLO attainment, and top organizational risks.

Service Availability
99.96%
+0.04%vs last quarter
Customer Impact
2.4hrs
this quarter · revenue at risk$184k
SLO Attainment
67%
6/9 SLOs healthy ·1 breached
MTTD / MTTR
4.2m detect
/
18.4m recover
-8%vs last quarter

Executive Summary

Q3 2026 reliability posture

Platform availability is 99.96% for the quarter, exceeding our 99.9% target. However, a checkout SLO breach is currently impacting approximately 3% of payment attempts in us-east-1, putting an estimated $184k of quarterly revenue at risk. The Reliability and Payments teams are actively mitigating. SLO attainment stands at 67% with one breached objective. MTTD and MTTR are both improving quarter-over-quarter. Top organizational risks are documented below with owners and mitigation plans.

Quarterly Reliability Trend

Weekly incident volume overlaid on availability (last 12 weeks)

SLO Attainment by Tier

Healthy SLO percentage grouped by service tier

86%Avg

Top Business Risks

Ranked register of organizational risks with owners and mitigation plans

All risks
RiskScoreCategoryOwnerMitigationStatus
Checkout API SLO breach impacting 3% of payment attempts
92
Customer Impact
Marcus Anderson
Rolling back canary v2.4.1-rc1 and engaging 3DS provider.mitigating
Billing invoice backlog delaying customer communications
74
Customer Impact
Caleb Foster
Scaled workers from 4 to 8. Queue depth dropping.mitigating
OpenSSL critical CVE on API gateway requires emergency patch
88
Security
Hannah Wright
Patch window scheduled for 03:00 UTC Saturday.mitigating
Web app LCP regression affecting SEO ranking
56
Growth
Mei Lin
Image optimization hotfix in progress.mitigating
Single-region PostgreSQL limits DR posture
64
Architecture
Priya Raman
Multi-region replica design in review. Q3 roadmap item.monitoring
On-call fatigue trending upward in Reliability team
42
Team Health
Maya Okonkwo
Hiring additional SRE. Two candidates in onsite loop.monitoring

Active Customer Impact

Live incidents affecting customers

INC-2841High
Checkout API elevated 5xx rate in us-east-1

Approximately 3% of checkout attempts in US-East are failing with a 500 error. Retry usually succeeds on second attempt.

4,200 userssince 7/2/2026, 5:12:00 AM
INC-2840Medium
Billing invoice generation backlog

Invoice emails delayed by up to 6 hours. No financial impact. Payments continue to process normally.

14,200 userssince 7/1/2026, 10:30:00 PM
INC-2838Medium
Web app slow page loads (LCP regression)

No functional impact, but slower page loads affect SEO ranking and user experience.

0 userssince 7/1/2026, 9:00:00 AM

Tier-1 Service Posture

Critical services summary

api-gateway
99.992%
billing-service
99.860%
auth-service
99.970%
postgres-primary
99.999%
checkout-api
99.620%
vault
99.999%
Tier-1 Availability99.97%
Coverage7 services

Command Palette

Search for a command to run...