Runbooks

API Gateway 5xx Spike

Incident Responseapi-gatewayapi-gateway-5xx-spike

Operational runbook - 3 steps

Author

Daniel VargasDaniel Vargas

Last Updated

2026-06-18

Views

184

Steps

3

Procedure

Follow each step in order. Mark complete as you proceed.

1

Triage

Check the rate of 5xx responses per upstream in Grafana. If only one upstream is failing, route investigation to its owning team.

2

Mitigate

If the upstream is overloaded, enable circuit breaker via the runtime config. If a bad deploy is suspected, prepare for rollback.

3

Rollback

Use `sg deploy rollback api-gateway --ref <last-good>` to revert to the previous stable release.

3 steps total

Quick Actions

Related Incidents

0 on this service

No related incidents

Related Alerts

3 on this service

ALR-1247firing

API Gateway p99 latency is 218ms (threshold 200ms)

ALR-1246firing

Host db-replica-2 disk usage is 87% (threshold 85%)

ALR-1244suppressed

API Gateway p99 latency is 215ms (threshold 200ms)

Command Palette

Search for a command to run...