AWS Data Migration Considerations - Part III

#awsbigdata #awsdatalake #dms #s3

Relational Databases Migration to AWS Environment into S3 Data Lake Using AWS DMS - Part III

This is the third blog of the multi-part series on considerations and observations during relational databases migration to AWS S3 data lake using AWS DMS service. Relational databases such as MongoDB, Oracle, MS SQL Server, MySQL, PostgreSQL, etc., were migrated to AWS.

Part I in this series covered considerations such as:
• Modified date not populated properly at the source
• Enabling supplemental logging for Oracle as the source
• Network bandwidth issues between source and target databases

Part II in this series covered considerations such as:
• Missing LOB columns for change data capture from MS SQL Server as source
• MS SQL Server time zone issue for timestamp column
• RDBMS instances pause or shutdown during non-business hours

This blog will address the following considerations:

Deleted MongoDB records moved to S3 data lake:

MongoDB records, referred to as documents, were migrated in Document mode into the AWS S3 data lake using AWS DMS. Both new and updated documents were successfully populated into the data lake and validated to match the source.

On one occasion, several thousand deleted records were received alongside inserted and updated records.

Initially, these records were treated as actual deletions and processed accordingly. However, it was later observed that during the same period, records with the same object IDs were received as inserted records a few milliseconds later in a different replica set (MongoDB server instance).

I, 2024-11-21 04:18:55:009812, “{“”id””: {  “” objid”” : “” 2345687fge8“” } , “”CustId”” : “” 121”” }”
D, 2024-11-21 04:18:56:001234, “{“”id””: {  “” objid”” : “” 2345687fge8“” }  }”

Further analysis revealed that these documents were not newly inserted or deleted during the specified time period. Instead, they were added to the MongoDB server a few hours to a few days earlier.

Following further analysis and discussion with the MongoDB DBA team, it was determined that there were no actual deleted and inserted records during that time. The reason is that the MongoDB server is horizontally scaled in shards (replica sets) using multiple servers. The MongoDB balancer automatically redistributes chunks of data across shards to ensure that each shard has a balanced amount of data. The balancer begins distributing the data for sharded collections based on a migration threshold.

During this phase, collection data from one shard (primary node) is inserted into secondary nodes (shards), and these documents are deleted from the primary node. Hence, newly inserted and deleted records were received in the oplog for the MongoDB database.

As there were no actual deletions occurring in the MongoDB server, these deleted records were skipped and not processed in the S3 data lake.

MongoDB with load balancing across multiple replica sets:

During data migration from MongoDB, note that the server may be scaled into shards/replica sets. Each shard holds a subset of the collection data to ensure high availability and boost read/write throughput. This is primarily used in large databases for horizontal scaling.

The table below lists databases stored in different shards (instances):

Sr. No. Database    Instance 1 (xx.xx.xx.1) Instance 2 (xx.xx.xx.2)
1   Customer    Yes                 Yes
2   Master      Yes                 No
3   UserConfig  No                  Yes

The Customer database is distributed across two instances, with the Master and UserConfig databases stored in instance 1 and instance 2, respectively.

Create two separate source endpoints in AWS DMS to fetch data from the Customer database instances. Merge and process the data from both instances (instance 1 and instance 2) at the destination, which is the S3 data lake.

Oracle archived logs retention matter:

When capturing change data from Oracle, an issue with archive logs was noted. The CloudWatch logs for the DMS task showed the following error:

00019300: 2024-11-29T02:22:13 [SOURCE_CAPTURE  ]E:  Archived Redo log with the sequence XXXXX does not exist, thread 1 [XXXXX]  (oradcdc_thread.c:XXX)

The issue was traced to the scheduled purging of archived logs from the Oracle database source. Increasing the retention period for these logs allowed the DMS change data capture tasks to be restarted successfully.

Conclusion:

This is Part III of a series on relational database migration challenges using AWS DMS. This blog focuses on table migration considerations from a MongoDB database.

DEV Community

AWS Data Migration Considerations - Part III

Relational Databases Migration to AWS Environment into S3 Data Lake Using AWS DMS - Part III

Top comments (0)

Read next

Guardius and OWASP ZAP: Boosting Web Application Security with Comprehensive Scanning

Javascript projects and optimize space on your laptop

The first update

AWS Serverless: Intelligent Prompt Routing in Amazon Bedrock – Part 1 (Using AWS Console)