An Interview with an Exhausted Redis Node
I sat down with our caching server to talk about cache stampedes, missing TTLs, and the things backend developers keep getting wrong.

The Query That Wouldn't Stop
Last September we had a dashboard page that loaded user activity summaries. Each page load triggered a query that joined three tables, aggregated counts across a rolling 30-day window, and returned about 2KB of JSON. The query itself took around 180ms. Not terrible for one request. But this was the landing page for every logged-in user, and we had about 6,000 active users hitting it throughout the day.
Postgres was averaging 70% CPU by mid-morning. By 2pm it would regularly spike to 95%. Alerts were firing. The on-call engineer was restarting connections. And the thing was, the underlying data only changed maybe once or twice per hour for most users. We were re-running that expensive aggregation thousands of times a day to return the same result.
So: caching. We needed it. We already had a Redis instance sitting around doing session storage, barely breaking a sweat. The plan was simple. Put the activity summary in Redis, serve it from memory, stop hammering Postgres.
The plan was simple. The execution was not.
Cache-Aside and the First Attempt
We went with cache-aside because it's the pattern everyone learns first and it made sense for our case. The application checks Redis before querying the database. On a miss, it queries Postgres, writes the result to Redis, and returns it. On a hit, it skips the database entirely.
def get_activity_summary(user_id):
cache_key = f"activity_summary:{user_id}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
summary = db.execute(ACTIVITY_SUMMARY_QUERY, user_id=user_id)
redis_client.set(cache_key, json.dumps(summary))
return summary
This worked immediately. Postgres CPU dropped to 30% within an hour of deployment. The dashboard loaded in 12ms instead of 180ms for cache hits. Everyone was happy.
For about three days.
Then our Redis instance started using way more memory than expected. It had been sitting at around 400MB for session data. After deploying the activity cache it climbed to 1.2GB within a week and kept going. We'd allocated 2GB and it was on track to hit that ceiling fast.
The problem is visible in that code sample if you look carefully. There's no TTL. We were writing keys and never expiring them. Every user who logged in got an activity_summary:{user_id} key that lived forever. Users who hadn't logged in for months still had cached summaries sitting in memory. We had keys for users who had deleted their accounts.
The fix was one argument:
redis_client.set(cache_key, json.dumps(summary), ex=1800) # 30 min TTL
Thirty minutes felt right for our use case. The data changed infrequently enough that serving a slightly stale summary was acceptable, and 30 minutes meant that even during peak hours, a given key would only trigger a database query twice an hour instead of dozens of times.
Memory stabilized within a day. The lesson sounds obvious in retrospect. It always does. But when you're rushing to fix a production performance problem, "just throw it in Redis" can skip past the part where you think about what happens to those keys over time.
Choosing TTL Values Is Harder Than It Looks
There's no formula for the right TTL. I've seen people recommend starting with 5 minutes as a default, but that's arbitrary and sometimes wrong in both directions.
For our activity summaries, 30 minutes worked because the tolerance for staleness was high. Nobody cares if their dashboard shows 847 events instead of 849 for the next half hour. But we also cached product inventory counts on a different page, and 30 minutes there would have been a disaster. A product could sell out and the page would still show "In Stock" for potentially half an hour. We used a 60-second TTL for inventory, which meant more database hits but acceptable freshness.
The tricky part is that TTL decisions are business decisions disguised as technical ones. You're deciding how stale the data is allowed to be, and that answer depends on what the data means to the user. A profile photo URL can be cached for hours. A bank account balance should probably not be cached at all, or if it is, with a very short window and aggressive invalidation on writes.
One thing I started doing was logging cache hit rates per key prefix. We added a simple counter that tracked hits versus misses over 5-minute windows. This told us which TTLs were too short (low hit rate, lots of unnecessary misses) and which were probably too long (near 100% hit rate, meaning the data was almost never refreshed). We adjusted several TTLs based on this data over the following weeks and found a much better balance than our initial guesses.
The Black Friday Stampede
This is the one that actually scared me.
We run sales events a few times a year. During the biggest one, we had a promotional pricing page that was essentially a single Redis key โ promo:pricing:main โ serving a blob of JSON with all the sale prices. TTL of 5 minutes. Hundreds of thousands of page views per hour during the event.
At some point during peak traffic, that key expired. Normal behavior. Expected. But in the milliseconds between the key expiring and the cache being repopulated, somewhere around 3,000 requests arrived, all saw a cache miss, and all fired the same database query simultaneously.
Postgres connection pool was maxed instantly. Queries started queueing. Response times went from 15ms to 8 seconds. Some requests timed out. The monitoring dashboard lit up like a Christmas tree.
This is a cache stampede, sometimes called a thundering herd. It happens when a popular key expires and many concurrent requests all try to repopulate it at the same time. The irony is that every single one of those requests was doing the right thing according to the cache-aside pattern. Check cache, miss, query database, write result. They were all following the pattern correctly. The pattern just doesn't account for thousands of processes following it simultaneously.
The Mutex Lock Fix
The solution we implemented was a distributed lock using Redis itself. When a request sees a cache miss, before querying the database it tries to acquire a short-lived lock:
def get_promo_pricing():
cache_key = "promo:pricing:main"
lock_key = "lock:promo:pricing:main"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Try to acquire the lock
acquired = redis_client.set(lock_key, "1", nx=True, ex=3)
if acquired:
# This request is responsible for refreshing
pricing = db.execute(PROMO_PRICING_QUERY)
redis_client.set(cache_key, json.dumps(pricing), ex=300)
redis_client.delete(lock_key)
return pricing
else:
# Another request is already refreshing โ wait and retry
time.sleep(0.1)
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# If still no data, fall through to database
return db.execute(PROMO_PRICING_QUERY)
The nx=True flag means "only set if the key doesn't exist," so only one request wins the race. That request refreshes the cache. Everyone else waits briefly and retries, and by then the cache is warm again.
The ex=3 on the lock key is a safety net. If the winning request crashes or takes too long, the lock auto-expires after 3 seconds so we don't end up with a permanent lock blocking cache refreshes forever.
This is a simplified version. In production we added a retry loop with exponential backoff and a maximum retry count. We also added a fallback that serves slightly stale data if it's available (by keeping the old value around briefly after the TTL expires, using a separate key). But the core idea is the mutex. One request refreshes, everyone else waits.
After deploying this, our next sale event handled 4x the traffic with zero stampede incidents. Postgres barely noticed.
Cache Invalidation: Delete, Don't Update
There's that old joke โ the two hardest problems in computer science are cache invalidation, naming things, and off-by-one errors. After a year of running caching in production, I think cache invalidation deserves its reputation.
We had two patterns in different parts of the codebase. Some endpoints used write-through: when the database was updated, the code would also update the corresponding Redis key with the new value. Other endpoints used invalidation: on a database write, delete the Redis key and let the next read repopulate it.
Write-through seemed cleaner at first. The cache always has the latest data. No miss penalty after an update. But we ran into a subtle bug that took two weeks to track down.
The sequence was: a user updated their profile, the app wrote to Postgres, then wrote to Redis. But occasionally โ maybe once in a thousand requests โ the Redis write would fail silently. A network blip, a brief timeout, something transient. The database had the new data. Redis had the old data. And because the TTL was 30 minutes, the user would see their old profile for up to half an hour after updating it. They'd refresh the page, see the old version, try updating again, and enter a frustrating loop.
We switched everything to invalidation. On a write, just delete the cache key. redis_client.delete(cache_key). If the delete fails, the worst case is the old data gets served until the TTL expires naturally. If the delete succeeds, the next read triggers a cache miss, queries the database, and gets the fresh data. There's a small performance penalty โ that next read is slower because it hits the database โ but correctness beats speed.
def update_user_profile(user_id, new_data):
db.execute(UPDATE_PROFILE_QUERY, user_id=user_id, data=new_data)
# Don't try to update the cache โ just kill it
redis_client.delete(f"user_profile:{user_id}")
redis_client.delete(f"activity_summary:{user_id}")
Notice we're deleting two keys there. Profile changes can affect the activity summary display too (username changes, avatar updates). This is where invalidation gets annoying โ you need to know all the cache keys that might be affected by a given write. Miss one and you get stale data in some view that nobody thinks to check.
We started maintaining a simple mapping in our codebase: a dictionary that listed, for each database table, all the cache key prefixes that should be invalidated when that table is written to. It's manual and it's tedious and it needs to be updated every time someone adds a new cached query. But it works, and it's saved us from stale data bugs multiple times.
The Serialization Tax
Something nobody warned me about.
We were caching large JSON objects โ some of our cached responses were 15-20KB of serialized JSON. Every cache hit involved deserializing that JSON back into a Python dictionary. For a single request, this is negligible. At 2,000 requests per minute, the cumulative CPU time spent on json.loads() was actually showing up in our profiling.
We switched to MessagePack for the largest cached objects. Faster serialization, smaller payloads. The 18KB JSON blobs became 11KB MessagePack blobs. Deserialization time dropped by about 40%. Not a game changer on its own, but it reduced our application server CPU usage by a few percentage points during peak hours, which gave us headroom we needed.
For smaller cached values โ anything under a few KB โ we stayed with JSON because the difference was negligible and JSON is easier to debug. You can redis-cli GET a key and actually read what's in it. MessagePack values are binary and opaque without tooling.
Monitoring What Redis Is Actually Doing
For the first few months, we treated Redis as a black box. Data goes in, data comes out. We monitored memory usage and connection count, which is the bare minimum.
Then we started running INFO stats periodically and feeding the metrics into our monitoring system. Three numbers turned out to be really useful:
Hit rate โ keyspace_hits / (keyspace_hits + keyspace_misses). This is the single most important metric for a cache. If your hit rate is 60%, your cache is barely helping. Ours was at 94% for most key prefixes, which told us the TTLs were in a reasonable range. The inventory keys were at 72%, which was expected given the short TTL, and we accepted that tradeoff.
Evicted keys โ if this number is climbing, your cache is running out of memory and dropping keys before their TTL expires. This means your effective TTL is shorter than what you configured, and your hit rate is lower than it should be. We hit this during one of our scaling events and had to bump the instance size.
Connected clients โ a sudden spike usually means something is wrong in the application layer. Connection leaks, retry storms, or a deployment that's creating too many connections. We set an alert at 80% of maxclients and it caught a connection leak in a background job worker before it caused an outage.
We also set up SLOWLOG to catch any commands taking longer than 10ms. This caught a KEYS * command that someone had put in a debug endpoint and forgotten about. KEYS scans every key in the database and blocks Redis while it does it. In production, with hundreds of thousands of keys, that command was taking 200ms and blocking all other operations. We replaced it with SCAN, which does the same thing incrementally without blocking.
Don't Cache Everything
This sounds obvious but it's easy to get into a mindset where the answer to every performance problem is "put it in Redis." After our initial success with the activity summary cache, there was pressure to cache more and more things.
We cached user notification counts. Fine, that made sense. We cached search results. That was more questionable โ search queries have high cardinality and low repeat rates, so the hit rate was terrible. We were filling Redis memory with cached search results that nobody would ever request again. We removed that cache after a week when we saw the hit rate was 8%.
The rule of thumb I use now: cache data that is read frequently, written infrequently, and where some staleness is tolerable. If any of those three conditions isn't met, caching might not be the right tool. High-cardinality data with unique access patterns (like search) is better served by optimizing the query itself โ better indexes, denormalized tables, materialized views.
Sometimes the answer to "the database is slow" is just "add an index" and you don't need a cache at all.
What I'm Still Working On
We're experimenting with a two-tier TTL approach for some of our higher-traffic keys. The idea is to have a "soft" TTL that's shorter than the actual Redis TTL. When a request hits a key that's past its soft TTL but still within the hard TTL, it serves the existing cached data immediately but kicks off an asynchronous background refresh. The user gets a fast response with slightly stale data, and the cache gets refreshed without blocking anyone.
It's working in our staging environment. The code is more involved than I'd like โ managing the soft TTL as a separate field within the cached value, coordinating the background refresh so only one worker does it (basically the stampede mutex again, but for proactive refreshes). I haven't deployed it to production yet because I want to get the monitoring right first. I need to be able to see how often the soft TTL triggers a refresh versus how often data is served from the hard TTL window, so I can tune the two values independently.
Caching feels like it should be simple. Store a value, retrieve a value. But every edge case โ expiration timing, concurrent access, invalidation scope, memory pressure, serialization cost โ adds a layer of complexity that only shows up under real traffic. Most of our caching bugs never appeared in development or staging. They needed thousands of concurrent users and months of accumulated keys to surface. That's the part that makes it interesting, and occasionally stressful.
Written by
Anurag Sinha
Developer who writes about the stuff I actually use day-to-day. If I got something wrong, let me know.
Found this useful?
Share it with someone who might find it helpful too.
Comments
Loading comments...
Related Articles
Debugging Slow PostgreSQL Queries in Production
How to track down and fix multi-second query delays when your API starts timing out.
Monolith vs. Microservices: How We Made the Decision
Our team's actual decision-making process for whether to break up a Rails monolith. Spoiler: we didn't go full microservices.
Rust vs Go: Picking the Right One for Your Next Project
A practical guide for deciding between Rust and Go for backend services. No fanboy energy.