<< back to Guides
<< back to Guides
π¨ Caching Failures in Production: SRE Guide to Problems and Mitigations
Caching is a powerful tool to reduce latency and load on backend systems β but poorly implemented caching can fail catastrophically under pressure. This guide walks through common caching problems encountered in real-world production systems and how to mitigate them from an SRE perspective.
1. π©οΈ Thundering Herd Problem
π What Happens:
- Many cached keys expire at the same time
- All incoming requests bypass the cache and hit the database
- The database gets overwhelmed by a flood of identical queries
π§ Example Scenario:
Imagine thousands of requests trying to retrieve a product list β all keys expire at once β cache miss β DB overload.
β Mitigation Strategies:
- Stagger expiration times: Add random jitter to TTL values
// Add 0β60s random TTL jitter
TTL = 300 + rand(0, 60)
- Request coalescing: Let only one request trigger a DB call, others wait for result (e.g., Redis + locks, Go singleflight)
- Protect non-core services: Block requests for non-essential data until cache repopulates
2. π³οΈ Cache Penetration
π What Happens:
- Requests are made for keys that donβt exist in the cache or DB
- Cache keeps missing β DB gets queried repeatedly
- Leads to performance degradation, wasted DB cycles
β Mitigation Strategies:
- Cache negative responses: Store
"null"
or"not_found"
values with a short TTL
GET /user/unknown_id
β cache β miss
β DB β miss
β store `null` in cache with 30s TTL
- Use a Bloom filter:
- Tracks known keys efficiently
- Rejects clearly invalid queries early
- Especially useful for read-heavy systems
3. π₯ Cache Breakdown (Hot Key Expiry)
π What Happens:
- A popular key (hot key) expires
- All users simultaneously trigger a cache miss
- Causes a spike in DB load for that key
β Mitigation Strategies:
- Avoid expiring hot keys: Set long or no TTL for keys accessed very frequently
- Pre-warm hot keys after restart or cache flush
- Lazy expiration: Extend TTL based on access patterns
4. π₯ Cache Crash
π What Happens:
- Entire cache service (e.g., Redis, Memcached) goes down
- All traffic reroutes directly to the origin system (e.g., DB, API)
- Sudden load causes cascading failures
β Mitigation Strategies:
- Circuit breakers: Block traffic to DB when cache is down for non-critical data
- Cache clustering / replication:
- Run Redis Sentinel, Redis Cluster, or ElastiCache Multi-AZ setups
- Graceful degradation:
- Return defaults or reduced data if cache fails
// Circuit breaker for cache dependency
if (cacheDown && dataType == "nonCritical") {
return fallbackValue;
}
5. π§ Stale Data / Inconsistent Cache
π What Happens:
- Cache contains outdated data (due to race conditions or sync delay)
- System reflects incorrect state or introduces bugs
β Mitigation Strategies:
- Use write-through cache: Cache is updated on write, not read
- Use event-driven cache invalidation (Kafka, pub-sub, etc.)
- Prefer short TTL + versioning for consistency-sensitive data
6. π§ Over-Caching / Memory Pressure
π What Happens:
- Cache stores too much irrelevant or stale data
- Eviction policies (e.g., LRU) remove useful data prematurely
β Mitigation Strategies:
- Set appropriate TTLs
- Use namespaced keys and selectively expire
- Monitor hit ratio and memory usage
- Tune eviction strategies (
allkeys-lru
,volatile-ttl
, etc.)
π§° Summary Table of Problems & Solutions
Problem | Root Cause | Solution Highlights |
---|---|---|
Thundering Herd | Mass expiration | TTL jitter, request coalescing |
Cache Penetration | Nonexistent keys | Cache nulls, Bloom filter |
Cache Breakdown | Hot key expires | No expiry for hot keys, lazy revalidation |
Cache Crash | Cache service down | Circuit breaker, replication, failover cache |
Inconsistent Cache | Race conditions, stale writes | Write-through, pub-sub invalidation |
Over-Caching | Irrelevant/expired data retained | TTL tuning, LRU config, metrics |
π§ͺ Monitoring Tips
- Track cache hit/miss ratio
- Monitor cache node CPU/memory/network usage
- Alert on keyspace evictions, latency spikes, or connection errors
- Use dashboards: Redis Insight, Grafana, CloudWatch, etc.
// Redis CLI monitoring
redis-cli INFO stats | grep hit
π Further Reading
- Redis Caching Patterns
- Cloudflareβs Guide to Cache Invalidation
- Netflix Engineering β Caching at Scale
<< back to Guides