DEV Community

Kris Iyer for AWS Community Builders

Posted on • Originally published at hmh.engineering on

GP2 to GP3 for AWS RDS Postgres

AWS GP3 Storage
AWS GP3

Overview

Recently I had an opportunity to work on a migration for RDS Postgres storage types from GP2 to GP3 for a large database. The migration was mostly motivated by potential performance improvements, getting past throughput limits on GP2, better IOPS, and of course, getting on a more modern storage architecture offered by AWS at a lower cost. This post mostly highlights some challenges you may encounter during a migration process which is not very obvious until you run into them.

Migration Options

Depending on the size of the database and/or migration windows/strategy you may choose one of the following:

  • Modify Storage Type This is the simplest approach. You can directly modify the storage type of your primary instance to GP3 through the AWS Management Console, CLI, or SDK. It’s generally non-disruptive, with minimal performance impact during the conversion. Use this option when:
  • You need a simple and non-disruptive upgrade: This method is the easiest way to convert your storage and offers minimal downtime. It’s generally suitable for most GP2 to GP3 migration scenarios.
  • Downtime tolerance is low: The modification process is typically non-disruptive and in place, with only a brief period of potential performance impact while the underlying storage configuration is adjusted (Storage Optimization).
    • Your database size is manageable: Modifying storage type works well for databases of various sizes. However, for very large databases, a snapshot/restore approach might offer more control over the migration process.
  • Database Migration Service (DMS): Use DMS to migrate your data from the GP2 instance to a newly created GP3 instance. This is flexible for complex migrations and minimizes downtime, but it might require more configuration and potentially incur additional costs and could have an impact on the primary depending on the configuration.
  • Replication task scheduling: Schedule the initial data load for off-peak hours when the writer's workload is lighter.
  • Bandwidth throttling: Limit the amount of data DMS reads from the source per unit of time to minimize performance impact.
  • CDC (Change Data Capture): For ongoing replication, DMS uses CDC techniques to capture only the changes made to the source data, reducing the load compared to full table scans.
  • pglogical: Can be used as a plugin for Postgres Logical Replication for DMS thereby reducing the impact on the writer further.
  • Snapshot and Restore: If you have a very strict window for downtime during the migration and modifying storage type directly isn’t feasible due to the brief outage it can cause, Snapshot and Restore can offer more control. You can create a snapshot of the GP2 volume during a maintenance window and then restore it to a newly provisioned GP3 instance during another window. This approach allows you to minimize downtime on the primary instance by performing the restoration on a separate instance.
  • Replica with GP3 storage: While keeping your primary instance on GP2 adding a replica with GP3 storage can be a potential strategy for transitioning to GP3. Some Considerations:
  • Cost: This could be a factor for this solution as this solution does require additional instance(s).
  • Rehydration : Status could be a big factor that requires one to either work with some tools to speed up rehydration and/or collaborate with AWS over support tickets to understand the rehydration the status from S3 to EBS.

Some scenarios and comparisons

Hydration after restore

For Amazon RDS instances that are restored from snapshots (automated and manual), the instances are made available as soon as the needed infrastructure is provisioned. However, there is an ongoing process that continues to copy the storage blocks from Amazon S3 to the EBS volume; this is called lazy loading. While lazy loading is in progress, I/O operations might need to wait for the blocks being accessed to be first read from Amazon S3. This causes increased I/O latency, which doesn’t always have an impact on applications using the Amazon RDS instance. If you want to reduce any slowness due to hydration, read all the data blocks as soon as the restore is complete.

Mitigating Effects of Lazy Loading

There are a few strategies available that help mitigate or minimize the impact or in other words also help speed up lazy loading. For Amazon RDS for PostgreSQL, the following options are available:

  • Use the pg_prewarm shared library module to read through all the tables
  • pg_prewarm doesn’t pre-fetch the following:
  • Toast tables [RDS limitation] — No workaround
  • Indexes [pg_prewarm limitation] — No workaround
  • DB Objects owned by other users [RDS limitation]. The workaround here is to re-run the SQL once as each DB User (that owns any table) Useful script can be found below: https://github.com/robins/PrewarmRDSPostgres/blob/master/singledb.sql https://github.com/robins/PrewarmRDSPostgres/blob/master/toast.sql
  • Use the pg_dump utility with jobs and data-only parameters to perform an export of all application schemas
  • Perform an explicit select on all the large and heavily used tables individually with parallelism.
  • For large tables, you may be able to split the query into ranges based on the primary key. For example, the below query gives 4 ranges of equal number of rows (primary key col1)
select nt,max(col1),count(*) 
from (SELECT col1, Ntile(4) over(ORDER BY col1) nt FROM testuser.test_table)st 
group by nt 
order by nt;

        NT MAX(COL1) COUNT(*)
---------- ---------- ----------
         1 125000 125000
         2 250000 125000
         3 375000 125000
         4 500000 125000
Enter fullscreen mode Exit fullscreen mode

NOTE: test the query before using in production, if you decide to use this query.

  • DMS however acts as a conduit for transferring data between databases. During a GP2 to GP3 migration using DMS, the data is directly transferred from the source GP2 instance to the target GP3 instance. DMS doesn’t store the entire migrated dataset in S3 as an intermediate step thus eliminating the need to rehydrate from S3.

PostgreSQL Error Conflict Recovery

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
CONTEXT: SQL statement "select pg_prewarm(temprow.tablename)"
PL/pgSQL function inline_code_block line 6 at SQL statement
Enter fullscreen mode Exit fullscreen mode
  • Conflict Recovery Error might occur due to the lack of visibility from the primary instance over the activity that’s happening on the read replica. The conflict with recovery occurs when WAL information can’t be applied on the read replica because the changes might obstruct an activity that’s happening on the read replica.
  • Query conflict might happen when a transaction on the read replica is reading tuples that are set for deletion on the primary instance. The deletion of tuples followed by vacuuming on the primary instance causes a conflict with the SELECT query that’s still running on the replica. In this case, the SELECT query on the replica is terminated with the following error message:
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
Enter fullscreen mode Exit fullscreen mode

max_standby_archive_delay

max_standby_archive_delay is a configuration parameter on a PostgreSQL replica instance in RDS that controls how long the replica waits for conflicting queries to finish before canceling its attempt to apply a specific Write-Ahead Log (WAL) segment. In other words, allows you to manage the trade-off between data consistency and accommodating long-running queries on your RDS PostgreSQL replica.

If WAL data is read from the archive location in Amazon Simple Storage Service (Amazon S3), then use the max_standby_archive_delay parameter.

max_standby_streaming_delay

max_standby_streaming_delay on the replica primarily affects its behavior and data consistency with the primary. However, in high-load scenarios or during failovers, it can have indirect consequences for the writer (primary) due to potential replication lag. For instance, If the replica falls significantly behind due to frequent pauses or cancellations caused by, the primary might experience increased write load as it needs to buffer more WAL segments before they can be applied on the replica.

If you are increasing max_standby_archive_delay to avoid canceling queries that conflict with reading WAL archive entries, then consider increasing max_standby_streaming_delay as well to avoid cancelations linked to conflict with streaming WAL entries.

If WAL data is read from streaming replication, then use the max_standby_streaming_delay parameter.

vacuum_defer_cleanup_age

NOTE: Applies to Postgres versions < 17.

With vacuum_defer_cleanup_age, you could specify a time delay (in seconds) for how long the replica would defer cleaning up certain types of data during auto vacuum. The purpose of this deferral was to potentially avoid conflicts between ongoing queries on the replica and the auto vacuum process cleaning up data that those queries might still be accessing.

For versions ≥ 17

The combination of the statement_timeout parameter and the hot_standby_feedback feature can achieve a similar outcome to vacuum_defer_cleanup_age .

Lazy Loading and Multi-AZ

When you change your Single-AZ instance to Multi-AZ, Amazon RDS creates a snapshot of the instance’s volumes. The snapshot is used to create new volumes in another Availability Zone. Although these new volumes are immediately available for use, you might experience a performance impact. This impact occurs because the new volume’s data is still loading from Amazon Simple Storage Service (Amazon S3). Meanwhile, the DB instance continues to load data in the background. This process might lead to elevated write latency and a performance impact during and after the modification process.

a. Initiate a failover on your Reader instance to be sure that the new AZ is the primary AZ.

b. Perform read operations in your Reader instance — Perform an explicit select on all the large and heavily used tables individually with parallelism or — Use the pg_prewarm shared library module to read through all the tables

c. Confirm that the write latency has returned to normal levels by reviewing the WriteLatency metric in Amazon CloudWatch.

For more information, refer to “What’s the impact of modifying my Single-AZ Amazon RDS instance to a Multi-AZ instance and vice versa?

Cascading Replicas and Rollback Scenarios

Note: This only applies to Postgres versions ≥14.1.

While you make your transition from GP2 to GP3, certainly plan on a fallback or rollback strategy as best practice. As you promote your GP3-based replica to the primary (standalone) make sure to add a replica which is GP2 just in case. You can achieve this with cascading replicas which could be set up ahead of time along with the GP3 replica. Rehydration and lazy-loading apply the same as any others including any Multi-AZ Standby instances.

With cascading read replicas, RDS for PostgreSQL DB instance sends WAL data to the first read replica in the chain. That read replica then sends WAL data to the second replica in the chain, and so on. The end result is that all read replicas in the chain have the changes from the RDS for PostgreSQL DB instance, but without the overhead solely on the source DB instance.

Conclusion

Choosing the right strategy depends on how critical minimizing downtime is, cost, and scalability among other things. Choose wisely, consult AWS support, and engage your AWS TAM and AWS Solution Architects to discuss the impact of migration on your databases and SLAs while considering lazy loading and/or full initialization of RDS GP3 volume as you plan yours!


Top comments (0)