Postmortems List

Redis Cluster Memory Pressure Event

PM-2026-013Medium Published

Postmortem for inc_prev1 ยท

Incident
Severity
Medium
Duration
0.08h
Author
Yuki Tanaka
Created
6/25/2026
Published
6/26/2026

Impact Summary

What happened, who was affected, and how severely

5-minute latency spike on cached endpoints after Redis memory reached 92%.

Root Cause

The underlying technical cause

A misconfigured session TTL on the web app caused sessions to persist for 30 days instead of 7 days, leading to gradual memory growth.

Contributing Factors

Conditions that allowed the incident to occur or worsen

  • 1

    No memory usage alerting between 80-90% (only critical at 95%).

  • 2

    Session TTL change was made without corresponding capacity review.

  • 3

    eviction policy was set to no-eviction, so memory growth led to OOM errors instead of graceful eviction.

Timeline

Key events during the incident

6/25/2026, 8:00:00 AM

Memory usage crossed 80%

6/25/2026, 9:30:00 AM

Memory usage crossed 90%

6/25/2026, 9:42:00 AM

First OOM errors

6/25/2026, 9:45:00 AM

Alert triggered

6/25/2026, 9:50:00 AM

Mitigation: manual key eviction

6/25/2026, 9:55:00 AM

Back to normal

Lessons Learned

What we learned from this incident

  • Session TTL changes require capacity review.

  • Eviction policy should be allkeys-lru for session caches.

  • Need graduated alerting at 80%, 85%, 90%.

Action Items

2 of 3 completed

67%

Change eviction policy to allkeys-lru

Done preventionHighYuki Tanaka Due 2026-06-28

Add graduated memory alerting

Done detectionMediumPriya Raman Due 2026-06-30

Add capacity review step to TTL changes

In Progress processLowPriya Raman Due 2026-07-15

Tags

Classification tags

#redis#memory#session#capacity

Author

Y
Yuki Tanaka
Senior Site Reliability Engineer

Reviewers (2)

P
Priya Raman
Senior Site Reliability Engineer
Approved
M
Maya Okonkwo
Director, Platform Engineering
Approved

Quick Actions

Command Palette

Search for a command to run...