Vasilii Petrushin

Posted on Jan 31 • Edited on Feb 26

Postgres in prod part 2. Problem-solving and optimizations, heading to 10x growth.

#postgres #database #devops

In part 1. we discussed how to build the Postgres cluster and supplementary infra to start the project in production. I assume, we were lucky with the startup and it began to grow. In my experience, at this point the baby’s problems and stupid bugs are solved, the database size is more than 350GB of data, we have a couple of tables of more than 50-80 GB, QPS (Queries Per Second) is 3000 - 20 000 queries per second at the peak times, and so on. The users/clients count increased by 2x-10x per year or more, and growth and workloads become more or less predictable and follow our marketing activities. Now we have other, more complicated challenges.

Solving the issues

Now our app started to crash at peak time, and we have 5xx - HTTP timeouts and server errors, or 499 caused by client timeouts. Many possible reasons appear in different ways. We started digging logs with the why-why-why method and on the next “why” found that in most cases the root cause is increased Postgres latencies.

The “why” way to the root cause could be indirect, for example, you see in the logs API throws timeouts because can’t obtain a database connection from its own pool. Why? Because all connections are busy serving other earlier requests, and all older requests are waiting in the queue. Why is that? Because average SQL query times become much longer than before. The same indirect case - HTTP-worker pool could be busy, and HTTP requests could wait in your application queue until client timeout, and you could get 499 in the log. In the easiest case, we could see the API endpoint with the long-running query, which leads to the 502 timeout or 503 server error. All of these cases could produce different visual effects on the application, logging, monitoring, and alerting behavior, but all of them are caused by increased Postgres times. The bad is, and it is proven by my life, that increasing HTTP or database connections pool sizes in the application config would not help and could made the things worse. As well as increasing CPU/memory/IOPSes for Postgres would not help too. Why?...

But how to diagnose and fix issues? First of all, you need a set of monitoring tools, as described in Part 1.
In the easiest case, when the situation is clear you can extract a timeouting query from logs or from the monitoring system and you can go and debug it.
In harder cases, I start the diagnosis process like this.

1. Dashboards - the starting point

Check the Postgres dashboard in Grafana for CPU load and QPS - it should be at higher levels than usual. Just in case, if we have no other (network or...) issues.

2. Digging logs

Ensure the slow queries in Postgres logs are enabled:

log_min_duration_statement: "100ms"

I would recommend then starting with the Loki query to get queries running for more than 1 second to exclude more effective queries and focus on the heaviest queries:

{namespace="production-db",pod="postgres-0"} |~ "duration: \\d{4,}"

For all long-running queries:

{namespace="production-db",pod="postgres-0"} |~ "duration: "

3. Blocking and blocked queries

If you get a lot of queries - usually COMMIT, UPDATE …, or SELECT … FOR UPDATE - running more than 30 seconds - it is a symptom of locking. In that case, let’s check the blocked queries in psql console.

-- blocked queries
select pid, username, pg_blocking_pids(pid) as blocked_by, 
query as blocked_query
from pg_stat_activity
where cardinality(pg_blocking_pids(pid)) > 0;

4. Resolving locks

If found locks, resolve by killing the locking queries, and check if locking and blocked queries are optimal.

SELECT pg_cancel_backend(pid) -- for graceful stop
SELECT pg_terminate_backend(pid) -- forced query termination

Use it with caution, in general, it is safe to kill locked queries because they are waiting for execution permission and do nothing at the moment of termination. If it's inside the transaction block - it is also safe, the transaction will be just rolled back.

To analyze SELECT … FOR UPDATE - run EXPLAIN ANALYZE SELECT … without FOR UPDATE, and ensure that the query uses indexes and filters without sequential scans. If it runs with seqscans create additional indexes.
How to analyze the UPDATE against production data?
It is scary to run UPDATE on a production database, but what to do if you fall into an incident and have to fix it ASAP? In some cases, we have a safe way to debug UPDATE or INSERT queries by using transactions. Be careful, if UPDATE runs on a huge table and changes a lot of records - it could run for hours and could lock the table for writing for a long time to perform UPDATE as it is, and then ROLLBACK.

-- easy way if we're lucky
BEGIN;
EXPLAIN ANALYZE UPDATE …;
ROLLBACK;

If we have a huge table and our UPDATE changes a lot of records, then we can analyze SELECT with the same WHERE part.

-- safest way is the table is huge and UPDATE changes a lot of rows
EXPLAIN ANALYZE SELECT * FROM table WHERE … here are the conditions from the UPDATE query.

If EXPLAIN reports that the query runs with seqscans - create additional indexes.

If the queries are already good, but you are still having issues - congrats, you and your developers must fix the application logic to avoid locking. There are three major ways to go - change the application logic to utilize an append-only approach, change the logic to make such update requests asynchronous, or decrease the overall QPS to allow the Postgres instance to focus on the activities, which causes the locks. The third way may be cheaper at the moment of the incident but will not solve the root cause of locking. I want to write an additional article on these approaches, and how they can help to survive with high loads. The good is that often resolving the locks manually immediately helps to bring your application up until the next deadlock, so your team has some time to develop and apply the hotfix.

5. Temporary files

If there are no locks, and/or most long-running queries are below 30 seconds - with the high CPU usage - are the symptoms of nonoptimal queries, nonoptimal data structures, or high temporary files usage. We need to dig deeper. Ensure that we have temp_files logs enabled:

log_temp_files: 0

Two ways to check if queries use temporary files. One way - Loki queries:

{app="postgres", log~="temporary file"}
{namespace="production-db", pod=~"postgres.*"} |= "temporary"

Second - check temp_blks_read and temp_blks_written:

SELECT * FROM pg_stat_statements WHERE 
temp_blks_read > 0 OR temp_blks_written > 0;

Additionally, you could check the Grafana or Postgres pod directly if it uses high I/O and has high iowait. If you use remote data storage like AWS Elastic Block Storage volume, it has limited IOPSes, and absolute numbers of IOPSes could be relatively small, but Postgres container will consume high CPU iowait!

If high temporary file usage is confirmed, increase the work_mem and maintenance_work_mem Postgres parameters and memory requests/limits in the Kubernetes manifest. Be careful; changing these parameters requires a restart of Postgres instance. If it is unclear from logs or our temp_blks_written analysis how much work_mem to add, I recommend upgrading with big steps, like increasing 2x from the current.

6. Nonoptimal queries and data.

The most common cause of long-running SELECT queries is sequential scan when Postgres applies filters or joins by scanning the entire table. Imagine, how long will it take if your table has 100+GB of data. The query could take minutes with significant I/O consumption. How do we detect it?

The symptoms are, together or separately:

high IOPSes in Grafana or iotop output, high iowait
the query appears in the Postgres log with a duration of more than 100ms (we enabled the log of long-running queries in Part 1.),
in Grafana query analysis this query appears with high time consumption,
502/503 errors in your app API calls.

Analyzing the query:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM wallet_transaction WHERE currency_id = 10;
Seq Scan on wallet_transaction  (cost=0.00..25000000.00 rows=10000 width=100) (actual time=0.100..7500.500 rows=10000 loops=1)
  Filter: (currency_id = 10)
  Buffers: shared hit=1571 read=60955 dirtied=739
  I/O Timings: shared read=6878.111
  Rows Removed by Filter: 5000000
Planning Time: 0.200 ms
Execution Time: 7502.300 ms

It is a very primitive query far from real life, but showing the problem of sequential scans:

unpredictable execution cost (cost from zero to high numbers, not matching with actual running time),
high buffer usage,
high I/O consumption.

It means that we cannot predict the actual load, generated by this query, the cache and memory usage is high and could be ineffective, and we could abuse the cache, which is limited, for other queries. Also, we’re abusing the disk I/O, which is critical and limited for clouds and its managed database solutions. High disk I/O also leads to high CPU iowait utilization, it just steals CPU for other tasks. Happily, in the most cases the solution is simple - to create the right index for your table. With index, Postgres will perform the index scan and you can save from 100x to even 10 000x of I/O and buffers. In general, to prevent/decrease such issues set the rule for your project to perform the EXPLAIN for all SELECT queries at the code development stage. And keep this problem in mind when developing the database and table structure. In my experience, the sequential scan is acceptable and more effective for small tables up to a few thousand rows. For bigger tables indexes and index scans are necessary.

In some cases, when you have complex filters, the Postgres planner performs the sequential scans even if you have proper indexes. In those cases, consider creating indexes with more complex expressions, order, and other parameters, matching the SELECT query WHERE expressions. Or create composite indexes by 2-3 columns if it is closer to WHERE expressions. Also, check the random_page_cost parameter and consider lowering it. For SSD it should be between 1.1 and 1.5.

Preventive actions

Earlier we discussed a lot of pain, but the good is that we can avoid the most. We can identify and fix problematic queries peacefully, during working hours, with proper testing, and without issues to the clients. In my routine, I just set up in the Calendar application the recurring event on each second Monday of each month to preserve time for Postgres analysis. Once a month I look at the dashboards, dig the logs, and play with the Postgres console to find, identify, and analyze potential problems, queries, and application behavior. The result is the tickets to our developers and to my team. For complex cases I prepare the information to discuss the potential issues and prevention actions with the team and with the boss.

Depending on the database change rate and release cycle of your app you can set up the Postgres analysis sessions once in 2 or 4 weeks.

Very helpful to prevent the issues is to accept the EXPLAIN rule in your organization. According to this rule, when your backenders developing something, they must check the EXPLAIN ANALYZE for all new or modified SELECTs, SELECTs … FOR UPDATE, and UPDATE queries, and ensure that all these queries run effectively.

With these simple steps in a couple of months, you will optimize your database and prevent a lot of issues.

You did a lot of optimizations, but it did not help

You did a lot of optimizations, increased IOPSes, and added CPU and memory resources, fixed non-optimal queries, but it did not help. Maybe, it is time to switch application reads to read replicas? There are a couple of problems with that, but it surely helps a lot to focus the CPU, memory, and IOPSes to process transactions on the cluster primary and move all read workload and analytics to replicas. The pain points of read replicas:

read replica needs for near-real-time data synchronization with the primary
each read replica requires the same resources as the primary, the cluster with primary and 2 read replicas would take 3x CPU, 3x RAM, and 3x disk space
your application has to support reads from another database, and you need application reconfiguration or fixes to use read replicas.

IMHO the ideal architecture with read replicas is 1 primary and 2 replicas with small max_standby_archive_delay - I use from 15 to 30 seconds for production, and 1 replica outside of cluster with huge max_standby_archive_delay - up to 30-90 minutes.
Why is that?
For application reads and to prevent data losses we need a replica with the most actual data as possible, and a small max_standby_archive_delay will help us to achieve it. It will stop any queries on read replicas to catch up with the primary if the replication lag becomes >= max_standby_archive_delay.
For high availability, we need two replicas behind the pooler to read from the application if we need any maintenance or reconfiguration.
We also realize that we have analytics queries with huge datasets returning, which can prevent wal-logs from applying to the replica, and could lead to big replica lag. For those cases, we need the read replica with high max_standby_archive_delay. Usually, such a workload does not require high availability, so we can have only one read replica tolerant for big replica lag to save some resources.
With read replicas, you also have to control the replication lag carefully. The query example to get the current replica lag in seconds:

SELECT EXTRACT(EPOCH FROM NOW() - 
pg_last_xact_replay_timestamp()) AS replication_lag_seconds;

Here is the link to the Postgres manifest to use with the Zalando operator - 1 primary, 2 replicas with increased resources.
Here is the link to the Postgres manifest to use with the Zalando operator - single replica cluster for analytics and high max_standby_archive_delay.

Debugging long-running queries on read replicas.

The queries with huge returning data rows can cause problems on Read replicas. Imagine, if your query has no or minimal filters and runs against a huge 100GB+ table. The problem is not only in the query, CPU, or I/O usage itself, but it also takes a lot of time for the application to retrieve data. On the server, it also causes buffer abuse and if it runs on read replicas it causes locks and delays in applying the archivelogs. You also will see this query in the Postgres logs as a long-running. As I said before, in case of lag >= max_standby_archive_delay read replica will cancel all queries until catches up with the primary with the error ERROR: canceling statement due to conflict with recovery. To prevent such behavior and protect your read replicas all SELECTs returning many data must be paged and must have ordering and limits. Example of bad query:

SELECT * FROM wallet_transaction 
WHERE currency_id = 10;

Of course, you have to check for indexes by currency_id, and create it if not. But anyway, the query will run for a long time and the amount of data is huge if the wallet_transaction table is big. On read replica, the wal-log application will be blocked while the client is consuming data from the query result.

SELECT * FROM wallet_transaction 
WHERE currency_id = 10 
ORDER BY created_at DESC
OFFSET 10100 LIMIT 100;

This query is better, but anyway, it is bad. It limits the dataset, but Postgres will scan the full table and/or indexes anyway, consuming CPU/Mem/IOPSes.

SELECT * FROM wallet_transaction
WHERE currency_id = 10 
AND created_at BETWEEN NOW() - INTERVAL '1 month' AND NOW()
ORDER BY created_at DESC
OFFSET 10100 LIMIT 100;

The best query. Now the result is limited, it takes a little time to retrieve data for the client, and the index/table scans will be limited by date. Also, it is paged, so your app can get and show all data to a user.

In real life, in my experience, the first query could take from 10 seconds to minutes and could lead to query cancelling errors. 2nd query could take 1 - 15 seconds, and 3rd from 5 to 500 milliseconds. So, for big tables always use ORDER/OFFSET/LIMIT to limit the results, and intervals to limit the data selection and scans.

Conclusions

In the second part, we discussed the second stage of growth problems of Postgres databases. The main root causes are:

high temporary files usage,
locks and deadlocks in the database,
non-optimal table structure and lack of indexing, which causes sequential scans,
huge result sets from SELECT queries.

I estimate this growth stage in the following numbers:

database size 350GB to 2TB,
there are big tables from 50GB to 1TB,
there are tables with 100+ million to billions of records,
QPS (queries per second) starts from 3000 to 10-20 000 and more,
the number of clients/orders/views and/or other business metrics are growing 2x to 10x per 6-12 months.

At this stage, you are going to face new problems and challenges and I showed the debug process from query/database optimizations to making decisions and debugging read replicas.
Also, I provided some examples of useful queries to detect problems and Postgres cluster manifests for databases. Please check my Github repo https://github.com/petrushinvs/databases-in-prod.

Good luck.

Postgres indexes review has been added.

DEV Community

Postgres in prod part 2. Problem-solving and optimizations, heading to 10x growth.

Solving the issues

1. Dashboards - the starting point

2. Digging logs

3. Blocking and blocked queries

4. Resolving locks

5. Temporary files

6. Nonoptimal queries and data.

Preventive actions

You did a lot of optimizations, but it did not help

Debugging long-running queries on read replicas.

Conclusions

Top comments (0)

Read next

Building a Jenkins Multi-Branch Pipeline for 11 Microservices on AWS EKS

Coding Chaos to Clarity: Transforming Developer Workflows with Smart Project Management Tools

Automating Kubernetes Cost Optimization with AI: The Next Frontier in DevOps

🚢 Why "It Works on My Machine" Is No Longer an Excuse – The Power of Docker