Postmortems List

Auth Service JWKS Cache Stale

PM-2026-014Low Published

Postmortem for INC-2839 ยท Auth service token validation errors in EU

Incident
INC-2839
Severity
Low
Duration
0.3h
Author
Theo Lambert
Created
7/1/2026
Published
7/1/2026

Impact Summary

What happened, who was affected, and how severely

850 EU users received 401 errors for 18 minutes during JWKS rotation. No data loss, no auth bypass.

Customer Impact Statement

Brief 18-minute window of elevated 401 errors. Users were prompted to re-authenticate. No data loss.

Root Cause

The underlying technical cause

JWKS rotation event completed at 14:00 UTC. authz-engine held a cached copy of the previous keys with TTL of 1 hour. Tokens signed with new keys were validated against old keys, causing 401 responses.

Contributing Factors

Conditions that allowed the incident to occur or worsen

  • 1

    JWKS cache TTL was set to 1h, but rotation happens every 12h with a 6h overlap window.

  • 2

    No proactive purge of JWKS cache on rotation event.

  • 3

    Alerting on 401 rate was set too high - threshold was 5%, spike reached only 3.2%.

Timeline

Key events during the incident

7/1/2026, 2:00:00 PM

JWKS rotation completed successfully

7/1/2026, 2:01:00 PM

First 401 errors started appearing

7/1/2026, 2:02:00 PM

Alert ALR-1240 triggered

7/1/2026, 2:03:00 PM

Theo acknowledged incident

7/1/2026, 2:12:00 PM

Cache purge executed

7/1/2026, 2:18:00 PM

401 rate back to baseline

Lessons Learned

What we learned from this incident

  • JWKS rotation needs proactive cache invalidation, not just TTL expiry.

  • Alerting thresholds should be calibrated to detect meaningful deviations, not just severe spikes.

  • The authz-engine should fetch new keys eagerly when it encounters a kid it doesn't recognize.

Action Items

1 of 4 completed

25%

Implement eager JWKS fetch on unknown kid

In Progress preventionHighTheo Lambert Due 2026-07-15

Add JWKS rotation event broadcast

To Do detectionMediumTheo Lambert Due 2026-07-22

Tune 401 rate alert threshold to 1.5%

Done detectionLowPriya Raman Due 2026-07-05

Document JWKS rotation runbook

In Progress processMediumTheo Lambert Due 2026-07-19

Tags

Classification tags

#auth#jwks#cache#rotation

Author

T
Theo Lambert
Senior Site Reliability Engineer

Reviewers (2)

P
Priya Raman
Senior Site Reliability Engineer
Approved
M
Maya Okonkwo
Director, Platform Engineering
Approved

Quick Actions

Command Palette

Search for a command to run...