Auth Service JWKS Cache Stale
Postmortem for INC-2839 ยท Auth service token validation errors in EU
Impact Summary
What happened, who was affected, and how severely
850 EU users received 401 errors for 18 minutes during JWKS rotation. No data loss, no auth bypass.
Brief 18-minute window of elevated 401 errors. Users were prompted to re-authenticate. No data loss.
Root Cause
The underlying technical cause
JWKS rotation event completed at 14:00 UTC. authz-engine held a cached copy of the previous keys with TTL of 1 hour. Tokens signed with new keys were validated against old keys, causing 401 responses.
Contributing Factors
Conditions that allowed the incident to occur or worsen
- 1
JWKS cache TTL was set to 1h, but rotation happens every 12h with a 6h overlap window.
- 2
No proactive purge of JWKS cache on rotation event.
- 3
Alerting on 401 rate was set too high - threshold was 5%, spike reached only 3.2%.
Timeline
Key events during the incident
JWKS rotation completed successfully
First 401 errors started appearing
Alert ALR-1240 triggered
Theo acknowledged incident
Cache purge executed
401 rate back to baseline
Lessons Learned
What we learned from this incident
JWKS rotation needs proactive cache invalidation, not just TTL expiry.
Alerting thresholds should be calibrated to detect meaningful deviations, not just severe spikes.
The authz-engine should fetch new keys eagerly when it encounters a kid it doesn't recognize.
Action Items
1 of 4 completed
Implement eager JWKS fetch on unknown kid
Add JWKS rotation event broadcast
Tune 401 rate alert threshold to 1.5%
Document JWKS rotation runbook
Tags
Classification tags