An Interview with an Exhausted Redis Node
I sat down with our caching server to talk about cache stampedes, missing TTLs, and the things backend developers keep getting wrong.

70% CPU on the Postgres server by mid-morning. 95% by 2 PM. Alerts firing. On-call engineer restarting connections. And the root cause was a single dashboard query running thousands of times per day to return the same 2KB of JSON that changed maybe twice an hour.
That's the kind of problem that makes caching feel urgent. And it is โ until you deploy the cache and discover that caching introduces its own category of problems, each one more subtle than the last. We needed Redis. We got Redis. And then we spent the next several months learning that "just throw it in the cache" is the beginning of the work, not the end.
Cache-Aside: The Simple Version That Worked (Briefly)
We went with cache-aside because it's the first pattern everyone learns and it made sense for our situation. Application checks Redis before querying the database. Miss means query Postgres, write the result to Redis, return it. Hit means skip the database entirely.
def get_activity_summary(user_id):
cache_key = f"activity_summary:{user_id}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
summary = db.execute(ACTIVITY_SUMMARY_QUERY, user_id=user_id)
redis_client.set(cache_key, json.dumps(summary))
return summary
Postgres CPU dropped to 30% within an hour of deployment. Dashboard loaded in 12ms instead of 180ms for cache hits. Everyone was happy.
For about three days.
Then Redis memory started climbing. It had been sitting at around 400MB for session storage. After the activity cache deployed, it crossed 1.2GB within a week and kept going. We'd allocated 2GB and were on track to hit the ceiling.
The problem is visible in that code if you look. No TTL. Keys were being written and never expiring. Every user who ever logged in got an activity_summary:{user_id} key that lived forever. Users who hadn't been active in months still had cached summaries occupying memory. Users who'd deleted their accounts. Keys for ghosts.
One argument to the set call fixed it:
redis_client.set(cache_key, json.dumps(summary), ex=1800) # 30 min TTL
Thirty minutes. Data changed infrequently enough that a slightly stale summary was acceptable, and 30 minutes meant each key would trigger a database query twice an hour instead of dozens of times.
Memory stabilized within a day. The fix sounds obvious in retrospect. That's the pattern with caching mistakes โ they always sound obvious after the fact. During the rush to solve a production performance problem, "just throw it in Redis" can very easily skip past the part where you think about what happens to those keys over weeks and months.
TTL Selection Is a Business Decision Wearing a Technical Costume
No formula exists for the right TTL. Seen recommendations to start with 5 minutes as a default, but that's arbitrary and wrong in both directions depending on what you're caching.
Activity summaries: 30 minutes worked because tolerance for staleness was high. Nobody cares whether the dashboard shows 847 events or 849 for the next half hour. But we also cached product inventory counts on a different page. Thirty minutes for inventory would have been catastrophic โ a product could sell out and the page would still say "In Stock" for potentially half an hour. We used a 60-second TTL for inventory. More database hits. Acceptable freshness.
The tricky part: TTL decisions are really business decisions. You're choosing how stale data is allowed to be, and that depends on what the data means to the person looking at it. A profile photo URL can be cached for hours without consequence. A bank account balance probably shouldn't be cached at all, or if it is, with an extremely short window and aggressive invalidation on writes.
One thing that proved useful: logging cache hit rates per key prefix. Simple counter tracking hits versus misses over 5-minute windows. Told us which TTLs were too short (low hit rate, wasted effort) and which were too long (near 100% hit rate, meaning data was almost never refreshed). Adjusted several TTLs based on that data over the following weeks and found a much better balance than our initial guesses provided.
The Black Friday Stampede
This one was scary.
We run sales events a few times a year. During the biggest one, a promotional pricing page was backed by a single Redis key โ promo:pricing:main โ serving a blob of JSON with all the sale prices. TTL of 5 minutes. Hundreds of thousands of page views per hour during the event.
At some point during peak traffic, that key expired. Normal. Expected. But in the milliseconds between the key expiring and the cache being repopulated, roughly 3,000 requests arrived, all saw a cache miss, and all simultaneously fired the same database query.
Postgres connection pool maxed instantly. Queries started stacking up. Response times jumped from 15ms to 8 seconds. Timeouts cascaded. Monitoring dashboard went solid red.
Cache stampede. Sometimes called a thundering herd. Happens when a popular key expires and many concurrent requests all try to repopulate it at the same time. The irony: every single one of those requests was doing the right thing according to the cache-aside pattern. Check cache. Miss. Query database. Write result. They were all following the protocol correctly. The protocol just doesn't account for thousands of processes following it at the same instant.
The Mutex Lock Approach
Fix: a distributed lock using Redis itself. When a request sees a cache miss, it tries to acquire a short-lived lock before querying the database:
def get_promo_pricing():
cache_key = "promo:pricing:main"
lock_key = "lock:promo:pricing:main"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Try to acquire the lock
acquired = redis_client.set(lock_key, "1", nx=True, ex=3)
if acquired:
# This request is responsible for refreshing
pricing = db.execute(PROMO_PRICING_QUERY)
redis_client.set(cache_key, json.dumps(pricing), ex=300)
redis_client.delete(lock_key)
return pricing
else:
# Another request is already refreshing โ wait and retry
time.sleep(0.1)
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# If still no data, fall through to database
return db.execute(PROMO_PRICING_QUERY)
nx=True means "set only if the key doesn't exist." One request wins the race. That request refreshes the cache. Everyone else waits briefly and retries โ by then the cache is warm again.
ex=3 on the lock is a safety net. If the winning request crashes or hangs, the lock auto-expires after 3 seconds. No permanent lock blocking refreshes indefinitely.
Simplified version here. Production code has a retry loop with exponential backoff and a maximum attempt count. Also a fallback that serves slightly stale data when available โ keeping the old value around briefly after TTL expiry in a separate key so there's something to serve during the repopulation window. But the core idea is the mutex. One refresher. Everyone else waits.
Next sale event after deploying this handled 4x the traffic with zero stampede incidents. Postgres barely registered the load increase.
Invalidation: Delete the Key, Don't Update It
Two hard problems in computer science: cache invalidation, naming things, and off-by-one errors. After a year of running caching in production, invalidation deserves every bit of its reputation.
Two patterns existed in different parts of the codebase. Some endpoints used write-through: update the database, then also update the corresponding Redis key with the new value. Other endpoints used invalidation: on a write, just delete the Redis key and let the next read repopulate it.
Write-through felt cleaner initially. Cache always has fresh data. No miss penalty after an update. But a subtle bug took two weeks to track down.
The sequence: user updates their profile, app writes to Postgres, then writes to Redis. Occasionally โ maybe once in a thousand requests โ the Redis write failed silently. Network blip, brief timeout, something transient. Database had the new data. Redis had the old data. And because the TTL was 30 minutes, the user would see their old profile for up to half an hour after updating it. Refresh. Still old. Try updating again. Frustrating loop.
Switched everything to invalidation. On write, just delete the key. redis_client.delete(cache_key). If the delete fails, worst case is old data gets served until the TTL expires naturally. If it succeeds, next read triggers a cache miss, hits the database, gets fresh data. Small performance penalty โ that next read is slower. But correctness beats speed, every time.
def update_user_profile(user_id, new_data):
db.execute(UPDATE_PROFILE_QUERY, user_id=user_id, data=new_data)
# Don't try to update the cache โ just kill it
redis_client.delete(f"user_profile:{user_id}")
redis_client.delete(f"activity_summary:{user_id}")
Two keys deleted there. Profile changes can affect the activity summary display (username changes, avatar updates). Invalidation gets tedious this way โ you need to know all the cache keys that might be affected by a given write. Miss one and you get stale data in a view that nobody thinks to check.
We started maintaining a mapping in the codebase: a dictionary listing, for each database table, all the cache key prefixes that should be invalidated when that table gets written to. Manual. Tedious. Has to be updated every time someone adds a new cached query. But it works, and it's caught stale data issues multiple times.
The Serialization Tax Nobody Warned Me About
We were caching large JSON objects โ some cached responses ran 15-20KB of serialized JSON. Every cache hit meant deserializing that JSON back into a Python dictionary. One request: negligible. 2,000 requests per minute: the cumulative CPU time on json.loads() started showing up in profiling.
Switched to MessagePack for the largest cached objects. Faster serialization, smaller payloads. 18KB JSON blobs became 11KB MessagePack blobs. Deserialization time dropped roughly 40%. Not a game changer by itself, but it shaved a few percentage points off application server CPU during peak hours. Headroom we needed.
Smaller cached values โ anything under a few KB โ stayed as JSON. The performance difference was negligible and JSON is debuggable. You can redis-cli GET a key and actually read what's stored. MessagePack values are binary and opaque without tooling.
Monitoring What Redis Actually Does
First few months, Redis was treated as a black box. Data goes in. Data comes out. We monitored memory and connection count. Bare minimum.
Then we started feeding INFO stats into our monitoring system, and three numbers turned out to be the ones worth watching:
Hit rate โ keyspace_hits / (keyspace_hits + keyspace_misses). The single most important cache metric. Hit rate at 60% means your cache is barely contributing. Ours ran at 94% for most key prefixes. Inventory keys at 72%, expected given the short TTL. Acceptable tradeoff.
Evicted keys โ climbing evictions mean the cache is running out of memory and dropping keys before their TTL expires. Your effective TTL is shorter than configured, hit rate is lower than it should be. We hit this during a scaling event and had to bump the instance size.
Connected clients โ sudden spikes usually mean something's wrong in the application layer. Connection leaks, retry storms, a deploy that's creating too many connections. Alert set at 80% of maxclients caught a connection leak in a background worker before it caused an outage.
Also set up SLOWLOG to catch commands taking longer than 10ms. This flagged a KEYS * command that someone had put in a debug endpoint and forgotten about. KEYS scans every key in the database and blocks Redis while it does it. On a production instance with hundreds of thousands of keys, that command was taking 200ms and blocking all other operations during the scan. Replaced with SCAN, which does the same thing incrementally without the blocking.
Not Everything Belongs in the Cache
Easy to develop a mindset where the answer to every performance problem is "put it in Redis." After the initial success with activity summaries, there was pressure to cache more and more.
Cached user notification counts. Reasonable โ high read frequency, low write frequency. Cached search results. Less reasonable โ search queries have high cardinality and low repeat rates. Hit rate was 8%. We were filling Redis memory with results that nobody would ever request again. Removed that cache after a week.
The heuristic I use now: cache data that is read frequently, written infrequently, and where some staleness is acceptable. If any of those three conditions isn't met, caching might be the wrong tool. High-cardinality data with unique access patterns is better served by optimizing the underlying query โ better indexes, denormalized tables, materialized views.
Sometimes "the database is slow" just means "add an index." I wrote about exactly that kind of situation in my post on debugging slow PostgreSQL queries โ a missing index was causing 47-second queries that no amount of caching would have truly addressed at the root.
When Redis Itself Goes Down
One scenario we didn't plan for early enough: what happens when Redis is unavailable? Network partition, instance restart, whatever the cause. If your code treats a cache miss as "go to the database," a Redis outage effectively becomes a cache miss on every single request simultaneously. That's a stampede, but one where the mutex approach can't save you because the lock mechanism lives in the same Redis instance that's down.
We added a circuit breaker around Redis calls. If Redis fails three times in a row within a 10-second window, the breaker trips and the application stops trying to reach Redis for the next 30 seconds. During that window, all requests go straight to the database. Postgres handles it because the circuit breaker also triggers a reduced-functionality mode โ certain non-critical features get temporarily disabled to keep database load manageable. After the cooldown, the breaker half-opens and sends a test request. If Redis responds, traffic flows back through the cache.
Could we have avoided this? Probably not entirely. Any system with a caching layer needs a plan for when that layer disappears. The important thing is having thought about it before the 3 AM alert.
What's Still in Progress
Experimenting with a two-tier TTL for higher-traffic keys. A "soft" TTL shorter than the actual Redis TTL. When a request hits a key past its soft TTL but still within the hard TTL, it serves the existing data immediately but kicks off an asynchronous background refresh. User gets a fast response with slightly stale data. Cache gets refreshed without blocking anyone.
Working in staging. Code is more involved than I'd like โ managing the soft TTL as a separate field within the cached value, coordinating the background refresh so only one worker handles it (the stampede mutex again, but for proactive refreshes). Haven't deployed to production yet because the monitoring needs to be right first. Need to see how often the soft TTL triggers a refresh versus how often data gets served from the hard TTL window, so the two values can be tuned independently.
If you're running Redis alongside a web app during development, my post on Docker Compose for development covers how we set up the local Redis instance alongside Postgres and the API.
Caching looks simple from the outside. Store a value. Retrieve a value. But every edge case โ expiration timing, concurrent access, invalidation scope, memory pressure, serialization overhead โ adds a layer of complexity that only surfaces under real traffic. Most of our caching bugs never showed up in development or staging. They needed thousands of concurrent users and months of accumulated keys to manifest. That's what makes this domain interesting. Also, occasionally, what makes Thursday afternoons stressful.
Further Resources
- Redis Documentation โ The official reference covering data types, commands, persistence, clustering, and caching patterns.
- Redis University โ Free courses on Redis fundamentals, data structures, and advanced patterns for caching and real-time applications.
- AWS ElastiCache Best Practices โ Production guidance on Redis caching strategies, connection management, and scaling from AWS.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

SQLite โ The Most Underrated Database in Your Toolbox
Why I stopped reaching for Postgres by default and started shipping production apps with SQLite. WAL mode, embedded analytics, and when it genuinely beats the big databases.

MongoDB vs PostgreSQL โ An Honest Comparison After Using Both in Production
When the document model actually helps, when relational wins, and the real project stories behind the decision.

Debugging Slow PostgreSQL Queries in Production
How to track down and fix multi-second query delays when your API starts timing out.