DEV Community

Raj Murugan
Raj Murugan

Posted on

AWS DMS Fails After Azure SQL Managed Instance Failover? Here’s How to Fix It

Problem Statement

A customer is using AWS Database Migration Service (DMS) to migrate data from Azure SQL Managed Instance to AWS RDS PostgreSQL over a VPN in Full Load + Change Data Capture (CDC) mode. The migration runs smoothly until Azure performs maintenance, triggering a failover. When this happens, DMS fails to locate the last tracked LSN (Log Sequence Number), causing a fatal error that stops the task.

Error
Currently, the only workaround is to manually restart the DMS task and perform a full load again, which is inefficient and time-consuming.


Understanding the Issue

When an Azure SQL Managed Instance failover occurs:

  • The old primary is demoted, and a new primary takes over.
  • The transaction logs on the new primary may not have the same LSN sequence as the old primary.
  • AWS DMS attempts to resume CDC from the last tracked LSN, but if that LSN doesn’t exist on the new primary, DMS throws an error and stops the task.
  • This issue arises because SQL Server Availability Groups do not synchronize backup history across replicas.

Potential Solutions & Trade-offs

Since this scenario is easy to replicate, testing these solutions in a non-production environment first is recommended.

1️⃣ Temporary Fix – Manual Action Required on Every Failover

  • Find the latest LSN on the new primary:
  SELECT sys.fn_cdc_get_max_lsn() AS CurrentMaxLSN;
Enter fullscreen mode Exit fullscreen mode
  • Restart DMS using the new LSN via AWS CLI or console.

Trade-offs:
✅ Simple and requires no architectural changes.

❌ Requires manual intervention every time a failover occurs.

❌ Increases downtime and operational overhead.

2️⃣ Permanent Fix – Adjust DMS Settings (May Not Work for Azure SQL MI)

  • Set:
  "AlwaysOnSharedSynchedBackupIsEnabled": false
Enter fullscreen mode Exit fullscreen mode
  • This allows DMS to poll all nodes in the Always On cluster for transaction backups.
  • Works for on-prem SQL Server, but may prevent reading from the old primary in Azure SQL MI.

Trade-offs:
✅ Fully automated once configured.

✅ No need for manual intervention.

❌ May not be supported in Azure SQL MI, requiring additional testing.

❌ Potential data consistency risks if backups are not fully synchronized.

3️⃣ Permanent Fix – Use Transaction Log Backups Instead of Live Logs

  • Azure SQL Managed Instance creates transaction log backups every ~10 minutes.
  • These backups are consistent across failovers, avoiding LSN loss.
  • DMS v3.5.3+ supports reading from log backups for Amazon RDS for SQL Server.
  • Grant necessary permissions for DMS to read log backups:
  GRANT EXEC ON msdb.dbo.rds_dms_tlog_download TO rds_user;
  GRANT EXEC ON msdb.dbo.rds_dms_tlog_read TO rds_user;
  GRANT EXEC ON msdb.dbo.rds_dms_tlog_list_current_lsn TO rds_user;
  GRANT EXEC ON msdb.dbo.rds_task_status TO rds_user;
Enter fullscreen mode Exit fullscreen mode

📌 Supported for Amazon RDS for SQL Server from DMS v3.5.3+ but may not work for Azure SQL MI—needs testing.

Trade-offs:
✅ Provides a seamless failover experience without breaking CDC.

✅ No manual intervention required after failovers.

❌ Might not be supported in Azure SQL MI, requiring thorough testing.

❌ Backup frequency (~10 min) could introduce minor delays in CDC.


Alternative SQL Managed Instance Solution

4️⃣ Synchronize Backup History Across Replicas (If Solution #2 Fails)

  • If the secondary replica is not configured for read access, set:
  "AlwaysOnSharedSynchedBackupIsEnabled": true
Enter fullscreen mode Exit fullscreen mode
  • This forces DMS to use the primary replica as a standalone SQL Server.
  • However, SQL Server does not automatically synchronize backup metadata across replicas.
  • Workaround: Run a script on the replica where backups occur to register backup history on other replicas.
  • Example script & details: Qlik Replicate & SQL Server AG HA Handling.

Trade-offs:
✅ Ensures backup history is consistent across replicas.

✅ Works well with Always On Availability Groups.

❌ Requires additional scripting and maintenance overhead.

❌ May not work in all configurations, depending on permissions and replication settings.


Conclusion

Failovers in Azure SQL Managed Instance disrupt AWS DMS CDC due to LSN mismatches. Using transaction log backups or modifying DMS settings can help mitigate the issue. Testing these solutions in a staging environment before deploying to production is essential.

Let me know if you’ve encountered similar issues or tested any of these solutions! 🚀


References

Top comments (0)