Redis Cluster Memory Pressure Event
Postmortem for inc_prev1 ยท
Impact Summary
What happened, who was affected, and how severely
5-minute latency spike on cached endpoints after Redis memory reached 92%.
Root Cause
The underlying technical cause
A misconfigured session TTL on the web app caused sessions to persist for 30 days instead of 7 days, leading to gradual memory growth.
Contributing Factors
Conditions that allowed the incident to occur or worsen
- 1
No memory usage alerting between 80-90% (only critical at 95%).
- 2
Session TTL change was made without corresponding capacity review.
- 3
eviction policy was set to no-eviction, so memory growth led to OOM errors instead of graceful eviction.
Timeline
Key events during the incident
Memory usage crossed 80%
Memory usage crossed 90%
First OOM errors
Alert triggered
Mitigation: manual key eviction
Back to normal
Lessons Learned
What we learned from this incident
Session TTL changes require capacity review.
Eviction policy should be allkeys-lru for session caches.
Need graduated alerting at 80%, 85%, 90%.
Action Items
2 of 3 completed
Change eviction policy to allkeys-lru
Add graduated memory alerting
Add capacity review step to TTL changes
Tags
Classification tags