Debugging Slow PostgreSQL Queries in Production
How to track down and fix multi-second query delays when your API starts timing out.

3:47 PM on a Thursday. Slack alerts started piling up. API response times spiking โ P95 latency at twelve seconds and climbing. The Grafana dashboard looked like a stock chart during a crash, except the line was supposed to stay flat and it was going vertical.
Postgres metrics panel: solid red. Active connections maxed out. Query durations in the tens of thousands of milliseconds. Every app server was waiting on the database, and the database was choking on something.
First useful action โ took me a couple minutes of staring at dashboards before I thought of it โ was checking pg_stat_statements. If this extension isn't enabled on your production Postgres instance, stop reading and go turn it on. Probably the single most useful tool for understanding what your database is spending time on.
SELECT
query,
calls,
round(mean_exec_time::numeric, 2) as avg_latency_ms,
round(total_exec_time::numeric, 2) as total_time_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
Top offender: averaging 47 seconds per call. Forty-seven. And it was being called hundreds of times per minute because the frontend had a retry loop โ each timeout at 30 seconds spawned another request, adding another 47-second query to the pile. The database was drowning in its own backlog of the same terrible query repeated endlessly.
The query looked harmless. A join between users and posts, counting posts per user, filtered by signup date. Tutorial-level SQL.
SELECT u.name, COUNT(p.id)
FROM users u
LEFT JOIN posts p ON p.author_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;
Ran EXPLAIN ANALYZE on a read replica โ never run diagnostic queries against your primary when it's already struggling, that's a lesson from a different bad day โ and the execution plan told the story immediately.
Seq Scan on posts p (actual time=0.012..234.567 rows=85000 loops=1)
Filter: (author_id = u.id)
Rows Removed by Filter: 2847000
Sequential scan on the posts table. Postgres was reading every single row. All 2.9 million of them. For every iteration of the join. No index on author_id.
Embarrassing part: the posts table had existed for over a year. The author_id column had been there since day one. Nobody โ not me, not any other developer โ had thought to index it. The table started small. The query was fine. Data grew slowly enough that nobody noticed the degradation until one day the table crossed some threshold and the query planner's strategy flipped from something reasonable to a full sequential scan. Performance didn't degrade gradually, as far as I can tell. It fell off a cliff.
CREATE INDEX CONCURRENTLY idx_posts_author_id ON posts (author_id);
CONCURRENTLY is non-negotiable. Without it, CREATE INDEX takes an exclusive lock on the table โ every write to posts blocks until the build finishes. On a table with 2.9 million rows, that's several minutes of complete write lockout. In production. While the app is already on fire. CONCURRENTLY is slower and uses more resources, but it doesn't lock the table.
Sat there watching pg_stat_activity, refreshing every few seconds, watching the index build progress while Slack messages piled up asking if the site was fixed. Eleven minutes. Long eleven minutes.
While that was building, kicked off another concurrent index for the WHERE clause:
CREATE INDEX CONCURRENTLY idx_users_created_at ON users (created_at);
Debated making this a composite index with name included โ CREATE INDEX CONCURRENTLY idx_users_created_name ON users (created_at, name) โ so Postgres could satisfy the query entirely from the index without hitting the heap. A covering index. Decided against it in the moment because I wasn't sure about the column order and didn't want to build the wrong index and waste another ten minutes. In retrospect, the composite index probably would have been worth it, not sure. Triage mode isn't the time for optimization perfection, though.
The author_id index finished. Ran EXPLAIN ANALYZE again.
Index Scan using idx_posts_author_id on posts p
(actual time=0.023..0.089 rows=12 loops=4521)
Index scan. 0.089 milliseconds per loop instead of 234 seconds total. Query went from 47 seconds to about 180 milliseconds. The relief was physical.
But alerts didn't fully clear. Latency dropped dramatically โ P95 from 12 seconds to about 900 milliseconds โ but still way above normal. Something else was contributing.
Back to pg_stat_statements. Second-worst query: a paginated listing endpoint using OFFSET:
SELECT * FROM posts
ORDER BY created_at DESC
LIMIT 20 OFFSET 100000;
The problem with OFFSET that isn't obvious until you think about it: Postgres still computes and sorts all 100,000+ rows before skipping them. No magic jump-ahead ability. It builds the entire result set up to the offset point, then discards everything except the last 20 rows you asked for. Higher page number, worse performance. Roughly linear degradation.
Known problem, known solution: keyset pagination (cursor-based). If your API serves paginated data, this also matters for how you design your GraphQL or REST endpoints โ pagination strategy affects the client experience directly. Instead of "skip N rows," you say "give me rows after this specific value":
SELECT * FROM posts
WHERE created_at < '2025-06-15T10:30:00Z'
ORDER BY created_at DESC
LIMIT 20;
With an index on created_at, Postgres jumps directly to the right spot and reads exactly 20 rows. No scanning. No discarding. Page 1 and page 5,000 take the same time.
The downside: no random page access. Can't jump to "page 347." You can only go forward and backward. For infinite scroll, feeds, and API pagination, that's fine. For the rare case where random page access matters, you're stuck with OFFSET and the acknowledgment that deep pages will be slow.
Changed the endpoint to cursor-based pagination. Second query resolved. P95 dropped to about 200 milliseconds.
Still high. Normal was under 50ms.
Looked at connection metrics more carefully. During the incident, we'd hit the max_connections limit โ set to 100 โ and connections were queueing in the application's connection pool. Even with queries now fast, the connection bottleneck was creating latency.
Three app servers, each with a connection pool of 40. That's 120 potential connections against a limit of 100. Under normal load, fine โ not all connections active simultaneously. Under this burst of retried requests, every pool maxed out and connections competed.
The real fix: PgBouncer. If you're also running Redis alongside Postgres, I wrote about caching patterns with Redis that can take read pressure off the database in the first place. PgBouncer sits between application and Postgres, multiplexing connections. App servers open hundreds of connections to PgBouncer. PgBouncer maintains a smaller pool of actual Postgres connections โ we set it to 50 โ and routes transactions through them.
[pgbouncer]
pool_mode = transaction
default_pool_size = 50
max_client_conn = 500
Deployed PgBouncer the next morning when things were calmer. Effect was stark. Postgres CPU, which had been hovering 85-90% even after the query fixes, dropped to about 25%. Connection wait times disappeared.
One gotcha about PgBouncer in transaction pooling mode that bit us later: prepared statements don't work. When connections are shared between clients at the transaction level, server-side prepared statements from one client interfere with another. Our ORM used prepared statements by default. After deploying PgBouncer, a wave of prepared statement does not exist errors. Had to add ?prepared_statements=false to the connection string. One of those things you learn by getting hit with it.
Set up a slow query log the following week:
ALTER SYSTEM SET log_min_duration_statement = 500;
SELECT pg_reload_conf();
Logs any query taking longer than 500 milliseconds. Not a replacement for pg_stat_statements, but a real-time feed of problematic queries you can pipe into log aggregation and alert on. If we'd had this before the incident, we would have caught the degradation weeks earlier โ when the query was at 5 seconds, not 47.
Also configured auto_explain to log execution plans for slow queries automatically:
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';
SET auto_explain.log_analyze = true;
When a slow query appears in logs, the execution plan is right there. No need to reproduce the issue to see what the planner did. Valuable for intermittent problems where a query is usually fast but occasionally slow due to plan changes or data distribution shifts.
The whole incident, first alert to fully stable, was about three and a half hours. Most of that was waiting for index builds and deploying the pagination change. Actual diagnosis took maybe twenty minutes. Frustrating โ all the fixes were straightforward. Missing index. Bad pagination. No connection pooler. None of it was subtle. Just stuff that hadn't been set up correctly from the beginning, and the data grew until it couldn't be ignored.
There's a pattern here. Database starts fine because the data is small. Missing indexes and bad query patterns are invisible because everything fits in memory and sequential scans are fast when the table is tiny. Growth happens gradually. Nobody notices queries getting a few milliseconds slower each month because there's no baseline alerting, no slow query log, no periodic EXPLAIN ANALYZE review. Then one day โ a threshold. Table too large for the shared buffer cache. Planner switches strategies. Performance falls off a cliff all at once.
Prevention? Maybe assume every foreign key column needs an index from day one. Maybe every listing endpoint should use cursor pagination from the start. Maybe PgBouncer belongs in your stack before you think you need it.
Though even that I'm not fully sure about. Cases exist where foreign key indexes aren't worth the write overhead. Cursor pagination makes some UX patterns harder. The textbook answer: "don't prematurely optimize." The Thursday-afternoon answer: "that missing index will cost you an afternoon, a pile of Slack messages, and a heart rate you'd rather not discuss."
What Changed After the Incident
The week after the fire drill, we set up a few things to prevent a repeat.
Weekly pg_stat_statements review. Every Monday, a cron job runs a query that extracts the top 10 slowest queries by mean execution time and posts them to a Slack channel. Most weeks, everything looks fine. Occasionally something creeps up โ a new feature with a poorly optimized query, or an existing query that's gotten slower as the table grew. Catching it at "this query takes 800ms" is much cheaper than catching it at "this query takes 47 seconds and the app is down."
Percentile-based latency alerts. Not just average response time โ P95 and P99 alerts. Average can look fine while a subset of users is having a terrible experience. We set the P95 alert at 500ms and the P99 at 2 seconds. The P99 alert catches the long-tail outliers that often indicate a missing index or a query plan regression affecting a specific data subset.
A EXPLAIN ANALYZE habit in code review. When a PR adds a new query or modifies an existing one, the reviewer runs EXPLAIN ANALYZE against a staging database with production-like data volumes and posts the execution plan in the PR comments. Not automated yet โ it's a manual step. But making it part of the review process means queries get scrutinized before they hit production instead of after.
Connection monitoring dashboard. A simple Grafana panel showing active connections, idle connections, and waiting connections over time. The shape of that graph during an incident tells you a lot. Active connections spiking while idle connections drop to zero means the database is overloaded. Waiting connections climbing means the pool is exhausted. Idle connections climbing while everything else is stable probably means a connection leak.
On Premature Optimization vs. Preparedness
There's a tension between "don't optimize prematurely" and "set yourself up so you're not scrambling at 3 PM on a Thursday." After living through this incident, I've landed somewhere practical: baseline monitoring costs almost nothing and prevents the worst surprises.
The five-minute version of preparedness that I now set up on every new project: enable pg_stat_statements, set log_min_duration_statement to something reasonable like 1000ms, create indexes on every foreign key column by default, and use cursor-based pagination from day one on any endpoint that returns lists. None of this is premature. It's just not leaving obvious traps in the codebase.
The premature part would be adding composite indexes before you know the query patterns, partitioning tables that have 50,000 rows, or deploying PgBouncer for an app with 10 concurrent users. That stuff can wait. But the basics? Do them on day one. The cost of adding a foreign key index to a small table is negligible โ a few milliseconds during writes. The cost of adding one to a 3-million-row table during an outage is eleven minutes of staring at pg_stat_activity while your phone buzzes with Slack messages. Pick your moment.
One thing I've started doing in code reviews: when someone adds a new table or a new column that references another table, I check for the corresponding index in the same migration file. If it's missing, I flag it. Not as a style nit โ as a production safety issue. Takes five seconds to check. Prevents a future incident.
Vacuum and Bloat
One more thing that came out of the postmortem: table bloat. Postgres uses MVCC (Multi-Version Concurrency Control) for transaction isolation, which means updates and deletes don't immediately remove old row versions. They're marked as dead tuples and left in place until the autovacuum process cleans them up.
On a busy table with lots of updates โ like our posts table where view counts and comment counts get incremented frequently โ dead tuples accumulate faster than autovacuum cleans them. The table grows physically larger than the live data it contains. Indexes grow too, because they reference all tuple versions, including dead ones. Larger table means more I/O for sequential scans. Larger indexes means slower index lookups.
Checked our posts table after the incident. Autovacuum was running, but the default settings were too conservative for our update rate. Adjusted the per-table autovacuum settings:
ALTER TABLE posts SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_analyze_scale_factor = 0.02
);
Default vacuum_scale_factor is 0.2, meaning autovacuum triggers when 20% of the table is dead tuples. For a 3-million-row table, that's 600,000 dead rows before vacuum kicks in. Setting it to 0.05 (5%) means vacuum runs more frequently on smaller batches of dead tuples. Less bloat accumulation. Slightly more CPU used by vacuum processes, but the tradeoff is worth it for a high-churn table.
VACUUM FULL is the nuclear option that actually reclaims disk space and shrinks the table file. But it takes an exclusive lock on the table โ no reads or writes during the operation. On a 3-million-row table, that's minutes of downtime. We've never run VACUUM FULL in production. Regular autovacuum prevents dead tuple buildup enough that the table doesn't grow uncontrollably. If it ever gets bad enough to need VACUUM FULL, we'd schedule a maintenance window.
Still Wondering About Partitioning
Still wondering whether we should have looked at partitioning the posts table. It's big and getting bigger. At some point even good indexes show their age on a table with tens of millions of rows. Range partitioning by created_at might match our access patterns โ most queries only touch recent data. Haven't had time to test it properly. Heard mixed things about partition pruning performance versus theory. Something for a future Thursday. Preferably a less eventful one.
Further Resources
- PostgreSQL Official Documentation โ The complete reference for PostgreSQL configuration, query planning, indexing strategies, and performance tuning.
- pganalyze Blog โ In-depth articles on PostgreSQL query optimization, EXPLAIN ANALYZE interpretation, and production database monitoring.
- Use The Index, Luke โ A site dedicated to SQL indexing and tuning, explaining how indexes work internally and how to design them for your query patterns.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

MongoDB vs PostgreSQL โ An Honest Comparison After Using Both in Production
When the document model actually helps, when relational wins, and the real project stories behind the decision.

An Interview with an Exhausted Redis Node
I sat down with our caching server to talk about cache stampedes, missing TTLs, and the things backend developers keep getting wrong.

SQLite โ The Most Underrated Database in Your Toolbox
Why I stopped reaching for Postgres by default and started shipping production apps with SQLite. WAL mode, embedded analytics, and when it genuinely beats the big databases.