A cache stampede is an issue that businesses might not encounter often, but those who have experienced it have plenty of horror stories. Just ask Facebook — in 2010, they faced one of their worst outages, lasting four hours, due to a cache stampede.
So, what exactly is a cache stampede, and how can we prevent or mitigate it? Let’s dive in!
First, What is a Cache?
For those unfamiliar, a cache is a temporary storage layer that holds frequently accessed data. Think of it as keeping readers’ favorite books at the front counter of a library for easy and quick access, rather than having each user get them from the shelves every time.
A cache helps reduce the load on backend servers by providing quick access to commonly requested information.
PS: We’ll be using this library analogy throughout the article, so follow along!
What, then, is a Cache Stampede?
Now, imagine you have a high-traffic website that relies heavily on cached data to ensure smooth performance. Everything works seamlessly — until the cached data expires. Cached data needs periodic refreshing to stay current, and when it expires, the backend can suddenly become overwhelmed with requests.
It’s like all the readers discovering their favorite books are missing from the counter and rushing to the bookshelves at once — causing a “stampede.”
This surge overwhelms the library, just as expired cache data prompts a flood of requests to the backend, causing a massive spike in load and potentially crippling the system. This, in simple terms, is what’s known as a cache stampede.
More Technically Speaking…
Now let’s give a slightly more technical, yet more accurate explanation.
Cached data requires periodic regeneration to ensure it remains current, which naturally involves fetching updated data from the backend.
During this regeneration phase, clients querying the cache may encounter a “cache miss” — meaning the requested data is absent from the cache — unlike a “cache hit,” where the desired data is readily available.
The problem arises when too many clients experience a cache miss and all turn to the backend for data at once.
The result is a cascade of requests that can snowball into a race condition, where multiple threads compete for the same resource, ultimately degrading performance and possibly leading to system collapse.
How to Prevent a Cache Stampede Using Redis
Now that we know what a cache stampede is, let’s explore the ways we can prevent it using everyone’s favorite open-source caching server — Redis!
Redis?
Redis is an open-source, in-memory data store often used as a database, cache, and message broker. Its primary advantage lies in its ability to store and access data with lightning-fast speed, making it an ideal choice for caching.
Redis helps your application quickly retrieve frequently accessed data, reducing the need to constantly query the backend.
Preventing a Cache Stampede with Redis
Redis, with advanced features such as distributed locks, provides several ways to mitigate cache stampedes.
Let’s dive into these techniques—we’ll briefly explain each one, explore its tradeoffs, and then give a small code snippet depicting basic implementation.
1- Mutex Locking
Mutex locking is a method to ensure that only one process can regenerate a piece of cache data at a time.
This would be like allowing only one reader to fetch a favorite book or a few favorite books on behalf of multiple others.
In a cache stampede scenario, mutex locks prevent multiple clients from overwhelming the backend by ensuring that only one client fetches the data, while the others wait for the cache to be updated.
The idea here is simple: when a piece of data needs to be regenerated, only one client does the work while others wait.
How It Works:
When a cache miss occurs, the client attempts to acquire a lock.
If the lock is acquired, the client proceeds to regenerate the cache and updates Redis with the new data.
Once the data is refreshed, the lock is released.
If the lock is not acquired (because another client is already regenerating the cache), the client can either wait for the lock to be released or return a default response.
Pros:
Prevents multiple clients from simultaneously hitting the backend, reducing load and preventing a stampede.
Simple to understand and implement with Redis.
Cons:
Requires careful management of the lock to avoid deadlocks (a situation where two or more processes are unable to proceed because each is waiting for the other to release resources) or delays if a client fails to release it.
May introduce slight latency for clients waiting for the lock to be released.
If the lock TTL (time-to-release) is not managed properly, it could lead to stale locks, where no process can acquire the lock again.
Implementation:
if (redis.setnx('lock:key', '1')) {
// The lock was successfully acquired
try {
let data = fetchDataFromBackend();
redis.setex('cache:key', 3600, data); // Cache the data for 1 hour
} finally {
redis.del('lock:key'); // Always release the lock
}
} else {
// Lock not acquired; handle accordingly (e.g., wait or return stale data)
}
2- Cache Warming
Instead of placing a fixed number of favorite books at the counter every morning, imagine you — being the hardworking librarian that you are — replenished the supply at the counter from time to time.
This proactive approach ensures that readers will always find their favorites at the counter, never needing to make their way to the shelves.
In the world of caching, this proactive strategy is known as Cache Warming. By refreshing the cache with frequently requested data before it expires, you reduce the likelihood of a cache stampede.
Instead of multiple clients bombarding the backend when a cache miss occurs, the system preloads critical data, ensuring smooth and efficient access.
How It Works:
Identify the data most frequently accessed or critical to your application’s performance.
Use background jobs, cron tasks, or application logic to periodically refresh these cache entries before they expire.
Pros:
Significantly reduces the risk of cache misses, thus preventing a stampede.
Ensures that users experience minimal latency as the data they request is almost always ready in the cache.
Cons:
Requires accurate prediction of cache expiration and usage patterns to be effective.
Can lead to unnecessary backend calls and resource usage if the data isn’t as frequently requested as anticipated.
Refreshing the cache in the background requires additional computational resources.
Implementation:
setInterval(() => {
let data = fetchDataFromBackend();
redis.setex('cache:key', 3600, data); // Cache the data for 1 hour
}, refreshInterval);
3- Stale-While-Revalidate
The Stale-While-Revalidate (SWR) strategy offers a practical balance between speed and freshness.
In this approach, when a client requests data, it serves the existing cached version (even if it’s slightly stale) while simultaneously triggering a background process to update the cache with fresh data from the backend.
Following our library analogy, this would be like offering users slightly older copies of favorites while your assistant fetches a new batch of the latest copies from the shelves.
How It Works:
When a request comes in, the system first checks the cache.
If the cache has data, even if it’s stale, it immediately serves this data to the client to ensure low latency.
In parallel, a background process fetches fresh data from the backend and updates the cache for future requests.
Pros:
Provides immediate data to users, minimizing wait time.
Asynchronous updating keeps the cache relevant without blocking user requests.
Cons:
Requires a tolerance for slightly outdated data being served.
Additional complexity in managing the synchronization of background refresh tasks.
Implementation:
let cacheData = redis.get('cache:key');
if (cacheData) {
// Serve stale data and trigger a refresh in the background
refreshCacheInBackground('cache:key');
return cacheData;
} else {
// Fallback to backend data fetching and cache update
let data = fetchDataFromBackend();
redis.setex('cache:key', 3600, data);
return data;
}
function refreshCacheInBackground(key) {
setTimeout(() => {
let freshData = fetchDataFromBackend();
redis.setex(key, 3600, freshData);
}, 0);
}
4- Distributed Caching and Load Balancing
Distributed caching involves spreading cached data across multiple servers or regions to ensure high availability, fault tolerance, and load distribution.
Think of it as setting up multiple counters in different locations in the library, each holding a collection of reader favorites, thus distributing the “demand” and reducing the traffic at any one counter.
In a distributed caching system, load balancing ensures that requests are efficiently routed to the cache node best suited to serve them, preventing any single server from becoming a bottleneck. This strategy leverages geographic proximity and network efficiency to optimize response times and reduce the load on any individual backend server.
Pros:
Improved Availability: Data is available even if some nodes fail.
Scalability: Easily handle increased traffic by adding more nodes.
Reduced Latency: Users are served by the nearest cache node, reducing response times.
Cons:
Complexity: Requires careful configuration and management of distributed nodes.
Data Consistency: Ensuring consistency across multiple nodes can be challenging.
Redlock
In distributed systems, the use of locks—as you can imagine—can quickly get messy. This is why a reliable lock mechanism, such as Redlock, is crucial to ensure that only one process can modify a shared resource at a time.
Redlock is an algorithm designed for distributed locking using Redis, providing fault tolerance and ensuring that the lock is correctly managed across multiple Redis nodes.
How It Works:
Acquire Lock: The client tries to acquire the lock on the majority of Redis nodes.
Set Expiry: Each lock has an expiration to prevent deadlocks if the client fails to release it.
Consensus: The lock is considered acquired if the client manages to lock it on a majority of nodes within a given timeframe.
Release Lock: Once the operation is complete, the client releases the lock across all nodes.
Pros:
Fault Tolerance: Works even if some nodes fail.
Avoids Single Point of Failure: The lock isn’t dependent on a single Redis instance.
Cons:
Complexity: More complex to implement compared to single-node locking.
Slight Overhead: Involves multiple Redis nodes, which could introduce slight delays.
Basic NestJS-based Implementation of Redlock:
import { Injectable } from '@nestjs/common';
import Redis from 'ioredis';
import Redlock from 'redlock';
@Injectable()
export class LockService {
private redisClients: Redis[];
private redlock: Redlock;
constructor() {
this.redisClients = [
new Redis({ host: '127.0.0.1', port: 6379 }),
new Redis({ host: '127.0.0.2', port: 6379 }),
// Add more nodes as needed
];
this.redlock = new Redlock(
this.redisClients,
{
driftFactor: 0.01, // time drift factor
retryCount: 10, // max number of retries
retryDelay: 200, // time in ms between retries
}
);
}
async acquireLock(resource: string, ttl: number): Promise<any> {
try {
const lock = await this.redlock.acquire([resource], ttl);
return lock;
} catch (err) {
console.error('Failed to acquire lock:', err);
throw err;
}
}
async releaseLock(lock: any): Promise<void> {
try {
await lock.release();
} catch (err) {
console.error('Failed to release lock:', err);
}
}
}
5- Proactive Cache Updates
Proactive cache updates involve updating the cache whenever a write operation occurs in the database.
This ensures that the cache always reflects the latest state of the data, akin to a librarian immediately replacing old copies of reader favorites at the counter with the latest editions (in our imaginary library books get outdated fast!).
This approach, also known as write-through, eliminates cache misses for updated data, as the cache is refreshed in real-time alongside database updates.
Pros:
Consistency: Cache always reflects the latest data, reducing stale data risks (users will never find an outdated copy at the counter).
Reduced Latency: Users receive updated data without waiting for cache regeneration.
Cons:
Increased Write Load: Every database write triggers a cache update, potentially increasing the load.
Complexity: Requires integration between database write operations and cache updates.
Implementation:
db.on('write', (data) => {
redis.setex(cache:key:${data.id}
, 3600, JSON.stringify(data));
});
6- Rate Limiting
Rate limiting controls the number of requests a client can make in a given time frame.
It’s like imposing a limit on how many books a reader can pick per hour or per day. This prevents any single client or group of clients from overwhelming the system with too many requests in a short period.
Rate limiting helps mitigate the risk of a stampede by throttling requests, ensuring the backend and cache can handle the load gracefully.
Pros:
Protection Against Overload: Prevents system overload by limiting excessive requests.
Fair Resource Distribution: Ensures all clients get fair access to resources.
Cons:
Potential User Frustration: Users may be denied service if they hit the limit, potentially harming their experience.
Complex Configuration: Requires careful tuning to balance user experience and system protection.
Implementation:
const limit = 10; // Max 10 requests per minute
const ttl = 60; // Time-to-live in seconds
async function rateLimit(clientId) {
const current = await redis.incr(rate:${clientId}
);
if (current === 1) {
await redis.expire(rate:${clientId}
, ttl);
}
if (current > limit) {
throw new Error('Rate limit exceeded');
}
}
7- Probabilistic Early Expiration
Probabilistic early expiration involves renewing cache entries before they expire based on a probabilistic decision.
This would be like replenishing the supply of some randomly selected favorites at the front counter before it actually runs out.
This essentially means some cached data would be refreshed at random before it even expired, smoothing out cache regeneration and avoiding sudden bursts of traffic.
Pros:
Reduced Stampede Risk: Spreads cache regeneration over time, preventing sudden backend spikes.
Improved System Stability: Reduces the chance of simultaneous cache misses leading to a stampede.
Cons:
Complexity: Requires careful tuning of the probabilistic model.
Potential Resource Waste: Some cache entries may be renewed unnecessarily.
Implementation:
function shouldExpireEarly(ttl) {
const probability = Math.min(1, (1 - (ttl / maxTtl)) * scalingFactor);
return Math.random() < probability;
}
if (shouldExpireEarly(cacheTtl)) {
let data = fetchDataFromBackend();
redis.setex('cache:key', 3600, data);
}
Conclusion
There you have it! We’ve explored various ways you can mitigate the cache stampede problem, leveraging the time-tested and trusted Redis.
The right approach for you will of course depend on your system’s requirements such as latency and tolerance for stale data, as well as the complexity you’re willing to manage.
By implementing these techniques, you can protect your backend from sudden load spikes and ensure a smooth experience for your users (or perhaps maintain a quiet library!).
Top comments (0)