<< back to Guides

🚨 Caching Failures in Production: SRE Guide to Problems and Mitigations

Caching is a powerful tool to reduce latency and load on backend systems β€” but poorly implemented caching can fail catastrophically under pressure. This guide walks through common caching problems encountered in real-world production systems and how to mitigate them from an SRE perspective.


1. 🌩️ Thundering Herd Problem

πŸ” What Happens:

🧠 Example Scenario:

Imagine thousands of requests trying to retrieve a product list β€” all keys expire at once β†’ cache miss β†’ DB overload.

βœ… Mitigation Strategies:

// Add 0–60s random TTL jitter
TTL = 300 + rand(0, 60)

2. πŸ•³οΈ Cache Penetration

πŸ” What Happens:

βœ… Mitigation Strategies:

GET /user/unknown_id  
β†’ cache β†’ miss  
β†’ DB β†’ miss  
β†’ store `null` in cache with 30s TTL

3. πŸ”₯ Cache Breakdown (Hot Key Expiry)

πŸ” What Happens:

βœ… Mitigation Strategies:


4. πŸ’₯ Cache Crash

πŸ” What Happens:

βœ… Mitigation Strategies:

// Circuit breaker for cache dependency
if (cacheDown && dataType == "nonCritical") {
  return fallbackValue;
}

5. 🧟 Stale Data / Inconsistent Cache

πŸ” What Happens:

βœ… Mitigation Strategies:


6. 🧠 Over-Caching / Memory Pressure

πŸ” What Happens:

βœ… Mitigation Strategies:


🧰 Summary Table of Problems & Solutions

Problem Root Cause Solution Highlights
Thundering Herd Mass expiration TTL jitter, request coalescing
Cache Penetration Nonexistent keys Cache nulls, Bloom filter
Cache Breakdown Hot key expires No expiry for hot keys, lazy revalidation
Cache Crash Cache service down Circuit breaker, replication, failover cache
Inconsistent Cache Race conditions, stale writes Write-through, pub-sub invalidation
Over-Caching Irrelevant/expired data retained TTL tuning, LRU config, metrics

πŸ§ͺ Monitoring Tips

// Redis CLI monitoring
redis-cli INFO stats | grep hit

πŸ“š Further Reading


<< back to Guides