Dmitry Romanoff

Posted on Jan 2

Troubleshooting "canceling statement due to conflict with recovery" in PostgreSQL: A Practical Guide

#postgres #database #devops #tutorial

In PostgreSQL, when using a primary-replica architecture, read replicas are commonly used to offload read-heavy queries from the primary database. This improves performance and scalability. However, you may encounter an error message like:

Error executing query on DB instance B: canceling statement due to conflict with recovery
DETAIL:  User was holding shared buffer pin for too long.

In this article, we'll explore the causes of this error and how to resolve it. Additionally, we will demonstrate how to handle concurrent queries on a primary and read-replica database setup using Python.

The Issue: Canceling Statement Due to Conflict with Recovery

When using a read replica in PostgreSQL, the replica server constantly applies changes from the primary database to stay in sync. However, long-running queries on the replica may conflict with the recovery process, leading to errors like:

canceling statement due to conflict with recovery
DETAIL:  User was holding shared buffer pin for too long.

What Causes This Error?

This error typically occurs when:

A query running on the replica holds onto shared buffers for too long.
The replica is lagging behind the primary database and is applying changes (recovery).
PostgreSQL detects that the query's execution could delay the recovery process, and thus cancels the query to ensure the replica stays in sync.

This can happen during long-running SELECT statements, especially those that join multiple tables or involve complex calculations.

The Solution: Adjusting PostgreSQL Parameters

One way to resolve this issue is to adjust two key parameters in PostgreSQL:

1. `max_standby_archive_delay`

This parameter defines the maximum amount of time the replica will wait for a query to complete before it starts applying WAL (Write Ahead Log) data from the primary database. By increasing the value of max_standby_archive_delay, you allow the query to run longer before PostgreSQL cancels it.

2. `max_standby_streaming_delay`

This parameter is similar but applies to streaming replication. It controls the maximum delay between the replica and the primary database before the replica starts applying WAL data.

By adjusting both of these parameters, you can reduce the likelihood of conflicts between long-running queries and the recovery process on the replica.

Example: Modify PostgreSQL Configuration

To increase the delay for both archive and streaming replication, update your postgresql.conf file on the replica:

max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s

These settings will allow queries to run for up to 15 minutes (900 seconds) before PostgreSQL forces them to cancel in order to apply WAL data.

After making these changes, reload PostgreSQL to apply the new settings:

SELECT pg_reload_conf();

Example Scenario: Python Code with Concurrent Queries on Primary and Replica

Now, let's walk through a practical example of how you might execute concurrent queries on a primary PostgreSQL instance (DB instance A) and a read replica (DB instance B) using Python and the psycopg2 library.

Python Code Overview

We will use two threads:

Thread 1 (DB instance A): This thread will create a table, insert 10,000 records, and then periodically update the records.
Thread 2 (DB instance B): This thread will execute a SELECT COUNT() query on the read replica, which joins three copies of the my_table to simulate a heavier query.

Python Code Implementation

import psycopg2
import threading
import time

# Function to execute queries on DB instance A (Primary DB)
def execute_queries_db1():
    try:
        # Connect to DB instance A (Primary)
        db1_connection = psycopg2.connect(
            host='db1_host',
            user='db_user',
            password='db_password',
            database='db_name'
        )

        with db1_connection.cursor() as cursor:
            # Create a table if it doesn't exist
            query_1 = "CREATE TABLE IF NOT EXISTS my_table (some_val VARCHAR(100));"
            cursor.execute(query_1)
            db1_connection.commit()
            print("Table 'my_table' created.")

            # Insert 10,000 records
            query_2 = "INSERT INTO my_table (some_val) SELECT generate_series(1, 10000)::TEXT;"
            cursor.execute(query_2)
            db1_connection.commit()
            print("Inserted 10,000 records into 'my_table'.")

            # Periodically update the table every 2 seconds
            while True:
                query_3 = "UPDATE my_table SET some_val = some_val || '1';"
                cursor.execute(query_3)
                db1_connection.commit()
                print("Updated 'my_table'.")
                time.sleep(2)  # Wait for 2 seconds before next update

    except Exception as e:
        print(f"Error executing queries on DB instance A: {e}")
    finally:
        if db1_connection:
            db1_connection.close()

# Function to execute queries on DB instance B (Read Replica)
def execute_query_db2():
    db2_connection = None  # Initialize db2_connection to avoid UnboundLocalError
    try:
        # Connect to DB instance B (Read Replica)
        db2_connection = psycopg2.connect(
            host='db2_host',
            user='db_user',
            password='db_password',
            database='db_name'
        )

        with db2_connection.cursor() as cursor:
            # Run SELECT COUNT query on my_table
            while True:
                query_4 = "SELECT COUNT(1) FROM my_table t1, my_table t2, my_table t3;"
                cursor.execute(query_4)
                result = cursor.fetchone()
                print(f"Count from query_4 on DB instance B: {result[0]}")
                time.sleep(1)  # Wait for 1 second before running query_4 again

    except Exception as e:
        print(f"Error executing query on DB instance B: {e}")
    finally:
        if db2_connection:
            db2_connection.close()

# Start threads for both DB tasks
def start_threads():
    # Create threads for DB operations
    thread_db1 = threading.Thread(target=execute_queries_db1)
    thread_db2 = threading.Thread(target=execute_query_db2)

    # Start the first thread
    thread_db1.start()

    # Wait for 20 seconds before starting the second thread
    print("Waiting for 20 seconds before starting the second thread...")
    time.sleep(20)

    # Start the second thread
    thread_db2.start()

    # Wait for both threads to complete (they will run indefinitely)
    thread_db1.join()
    thread_db2.join()

    print("Both DB tasks are running.")

if __name__ == "__main__":
    start_threads()

Code Explanation:

DB Instance A (Primary DB): This thread creates a table, inserts 10,000 rows, and then periodically updates them every 2 seconds.
DB Instance B (Read Replica): This thread continuously executes a SELECT COUNT() query that joins three instances of the my_table, simulating a complex query on the read replica.
Concurrency: We use Python's threading module to run both tasks concurrently. The second thread starts 20 seconds after the first thread to simulate the environment where the primary and replica interact.

Key Points:

Make sure your read replica is configured correctly to handle conflicts between long-running queries and recovery operations. Increasing max_standby_archive_delay and max_standby_streaming_delay can help reduce query cancellations.
Always monitor your system's performance and replication delay to avoid issues with query cancellations.

Conclusion

Handling conflicts between queries and recovery on PostgreSQL read replicas is a common challenge, especially when using long-running or complex queries. By tuning parameters like max_standby_archive_delay and max_standby_streaming_delay, you can mitigate this issue and ensure smoother replication and query execution. Additionally, using Python's threading capabilities allows you to run concurrent tasks on the primary and replica databases efficiently.

If you encounter issues like the one discussed in this article, adjusting these PostgreSQL parameters should provide a robust solution to keep your read replicas in sync without prematurely canceling important queries.

DEV Community

Troubleshooting "canceling statement due to conflict with recovery" in PostgreSQL: A Practical Guide

The Issue: Canceling Statement Due to Conflict with Recovery

What Causes This Error?

The Solution: Adjusting PostgreSQL Parameters

1. `max_standby_archive_delay`

2. `max_standby_streaming_delay`

Example: Modify PostgreSQL Configuration

Example Scenario: Python Code with Concurrent Queries on Primary and Replica

Python Code Overview

Python Code Implementation

Code Explanation:

Key Points:

Conclusion

Top comments (0)

Read next

Implementing Logger and Server Analytics in Flutter

GenAI Developer Roadmap 🚀 | Week 1, Day 2

Azure Functions (dotnet): The Right Way to Work with Queue Storage

How to Set Up Kanata on NixOS: A Step-by-Step Guide

The Issue: Canceling Statement Due to Conflict with Recovery

What Causes This Error?

The Solution: Adjusting PostgreSQL Parameters

1. max_standby_archive_delay

2. max_standby_streaming_delay

Example: Modify PostgreSQL Configuration

Example Scenario: Python Code with Concurrent Queries on Primary and Replica

Python Code Overview

Python Code Implementation

Code Explanation:

Key Points:

Conclusion

Read next

Implementing Logger and Server Analytics in Flutter

GenAI Developer Roadmap 🚀 | Week 1, Day 2

Azure Functions (dotnet): The Right Way to Work with Queue Storage

How to Set Up Kanata on NixOS: A Step-by-Step Guide

1. `max_standby_archive_delay`

2. `max_standby_streaming_delay`