Migrating Thousands Databases To Azure ONLINE? Here Is How We Got It Done - Part4: Sh!t Happens (A)

George Lin
Nov 12, 2021
7 min read

Updated: Nov 13, 2021

Part1(A),Part1(B),Part2 ,Part3(A),Part3(B),Part3(C) Part4(B)

Note: all the real Azure resource names in the error messages posted below were replaced by "*" or masked in the screenshots for confidential reason.

I really wish I wouldn't need to write this part, but unfortunately, shit does happen when dealing with Azure Database Migration Services. Well, it actually happened A LOT in our case.

Incident #1: DMS failed database restore due to "too many request" issue

Log shipping start operation for database "*" on Managed Instance "/subscriptions/*" failed due to error. Error code: "SubscriptionTooManyCreateUpdateRequests". Error message: "Cannot process create or update request. Too many create or update operations in progress for subscription "*". Query sys.dm_operation_stats for pending operations. Wait till pending create/update requests are complete or delete one of your pending create/update requests and retry your request."

Root cause: Azure has throttling limits (default 160) for operation, rate limits for subscriptions, and maximum parallelism limits

Solution: raised limits for both Create and non-Create requests or disable throttling on the subscription level

Incident #2: Failed to stop/delete DMS instances, the command hanging with parameter -DeleteRunningTask

Stop-AzDmsService : DMS API Error One or more activities are currently running. To stop the service, please wait until the activities have completed or stop those activities manually and try again.

Remove-AzDms : DMS API Error One or more tasks are currently running. To delete the service, please wait until the tasks have completed or cancel the tasks manually and retry the deletion.

Root cause: The DMS Azure VM running into issue

Solution: Ask Microsoft DMS support to patch the DMS services, then try to stop/delete them again.

Incident #3: DMS shows activity failure due to "double-cutover"

Root cause: if cutover is too slow on some DBs, those DBs may get more than one cutover requests. it shows failure when query DMS state. Sometime DMS didn't change the state after cutover having been issued.. so if I run the script again, there will be another cutover

Solution: Add logic in the PowerShell script to avoid issuing more than one cutover on the database. for example, update CutoverTime column as a flag in MigGrp table when first time issuing the cutover so that the subsequent cutover request will skip the database even it's in 'LOG_FILES_UPLOADING' state and the last log applied.

Incident #4: DMS activity fails very often if it migrate more than 4 databases

Root cause: the DMS VM run into 100% CPU and encounters the Service Bus error

Solution: Ask MS support team to assign more Vcores to the VM or split the work using more DMS instances, then redo the migration.

Incident #5: The database is online on the target SQLMI, but the status in DMS is still 'cutover in progress"

Root cause: when issuing cutover on multiple database at same time, Each database wants to report its progress, while one reported, the other could have a race condition that its current ETAG value of that record is already expired (the other database jumped in and reported ahead of it).

Solution: Microsoft DMS team to optimize their code to better deal with this "race condition" issue. Basically, if it happens, the code should read the state again, and consolidate to new result

Incident #6: Migration cutover kept failing on some databases.

Root cause: the agent on DMS VM crashed.

Solution:

Ask Microsoft DMS support engineer to fix agent crashing issue then retry cutover
Complete the database restore through REST API call

-Action CutoverByRESTAPI

Incident #7: In SSMS, the database shows being stuck in Restoring. When try to delete it. SQL Server says it doesn't exist. the database is not showing on the portal side.

Root cause: It appears that the problem was resource contention and longer wait times for back end processes. Microsoft support engineers see error messages of type "Operation Timed Out". They also observed that there were many operations waits of types Physical Database Creation and Drop Database. This would reinforce either queuing or resource contention. These pointers are what are driving the suggestion to queue up fewer databases as part of each batch.

Solution: Wait and hope it will clear soon (maybe after 24 hours) or call Microsoft SQLMI team to fix it in the backend.

Incident #8: The DMS activities failed with 'missing transaction log' error, but actually it's a false error because the log chain is not broken and all the backup files are there in the blob containers

{"resourceId":"/subscriptions/*/resourceGroups/*/providers/Microsoft.DataMigration/services/prod-eus2-dms16", "errorType":"Database migration warning", "warningDetail":"Transaction log file(s) with LSN from '46622000000042200001' to '46652000000011000001' are missing. Please add the file(s) to fileshare for migration to continue." }

Root cause: DMS encountered the Service Bus error "MessageProcessingFailRenewLockException"

Solution: Restart the DMS VM to recreate the log sequence

Incident #9: DMS activity failed due to "state machine fault"

Root case: the state machine faulted error is a known issue

Solution: Microsoft Azure DMS team will fix it (most likely it's been fixed by the time you read this)

Incident #10: The DMS activities failed due to some timeout issue

Root cause: Error occurred in Microsoft's pipeline while DMS tried to fetch the database migration result from an internal storage account table... which threw a StorageEcception

Solution: Ask MS DMS support team to restart the DMS VMs in the backend. If that doesn't solve the issue, redo the migration activity

Incident #11: DMS activities stopped copying the log backup files

Root cause: the CPU usage on DMS VM is at 100%

Solution: Ask MS team to upgrade the DMS VM from 4 vcore to 8 Vcore

Incident #12: the DMS activities restarted from restoring the full database backups, but didn't report any failure

Root cause: Restoring from full backup caused by SQLMI restart (confirmed by the corresponding records in the SQLMI error log). Our migration period didn't hit the 36 hour limit that is for deployment events but SQLMI could have crashed or restarted for other reasons. DMS telemetry didn't detect any restore restart events (known issue pending to be fixed)

Solution: Notify the MS SQLMI team in advance about your migration schedule so that they would be aware of the situation when performing any maintenance tasks in the SQLMI backend. In the mean time, MS DMS Team needs to make sure that restarting restore service or SQL Managed Instance should resume from the last applied log backup.

Incident #13: About 40 databases suddenly restarted migration from restoring the full backup files and I don't see restart info in SQLMI error log.

Root cause: There was an Internal Server Error that DMS team had encountered earlier on a Database and they filed an incident for that. The Microsoft on-call engineer who was investigating that identified that the issue was a component of SQL Managed Instance called Restore Service which Timed Out and hence dropped the database which caused the restore to be terminated. As a mitigation, the Restore service was restarted to avoid further issues. However, the unfortunate consequence of this mitigation is that all the other restores on the same instance got restarted from the full backup.

Solution: MS DMS Team needs to make sure that restarting restore service or SQL Managed Instance should resume from the last applied log backup.

Incident #14: Failed to start the DMS instances due to some IP address contention in the DMS subnet, but we've made room (10 free IP addresses are now available in the subnet), but the service are still failing to start.

Start-AzDmsService : Long running operation failed with status 'Failed'. Additional Info:'The service start operation failed with error 'The provisioning of deployment nic_gnr7sh8zxda2rjidfy4vx559 for service
/subscriptions/*/resourceGroups/*/providers/Microsoft.DataMigration/services/prod-eus2-dms110 failed. State was Failed. NIC-gnr7sh8zxda2rjidfy4vx559 (Microsoft.Network/networkInterfaces):
SubnetIsFull - Subnet * with address prefix 10.64.34.0/25 does not have enough capacity for 1 IP addresses. Template output evaluation skipped: at least one resource deployment operation failed. Please list deployment
operations for details. Please see https://aka.ms/DeployOperations for usage details.'.'
At line:1 char:106
+ ... ervice | where ProvisioningState -ne 'Succeeded' | Start-AzDmsService
+                                                        ~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : CloseError: (:) [Start-AzDataMigrationService], CloudException
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.DataMigration.Cmdlets.StartDataMigrationService

Root cause: the NICs of those failed-to-start DMS services keep trying to use the initially assigned IP addresses which are now being used by the DMS services started earlier.

Solution: Recreate those failed-to-start DMS services, preferably with a different names

Incident #15: DMS activities failed due to login issue. tested the same login in the same environment without any problem

FailedInputValidation
During validation of the server '10.21.81.40', a connection problem prevented server properties from being read.
     Error 18452 - Login failed for impersonated user '*'.
Validating DMS activity 'dms19-proj8-activity1' at '04/01/2020 14:02:48' with the migration id '19' for the following databases:

Root cause: If this error message only appears sporadically in an application using Windows Authentication, it may result because the SQL Server cannot contact the Domain Controller to validate the user. This may be caused by high network load stressing the hardware, or to a faulty piece of networking equipment. The next step here is to troubleshoot the network hardware between the SQL Server and the Domain Controller by taking network traces and replacing network hardware as necessary. https://stackoverflow.com/questions/29097103/intermittent-connection-to-sql-server-database-the-login-is-from-an-untrusted

DMS does have a "login failure retry" logic in place, but it seems it only retry login when accessing backup share folders, not when accessing source SQL Servers.

Solution: DMS team need to fix the retry logic bug (most likely done already)

Incident #16: DMS activities show migration state as "full backup restoring" while actually all log files having been applied on the target database.

Root cause: the SQL Managed instance is not returning DMS the correct status for the restore and hence DMS is unable to display the correct status even though log files are restored on the MI instance

Solution: Ignore the issue and continue the migration as usual. MS DMS and SQLMI teams need work together to get this fixed in the future.

Incident #17: The DMS activity is not failing, but it doesn't restore the database on the target and stopped copying the subsequent log backup file from the source folder to Azure storage. The same activity is also migrating the two other databases which are looking normal

Root cause: the DMS telemetry did not get a response from the SQLMI on LRS call and hence this is stuck

Solution: Ask MS DMS team to restart the VM in the backend to make the worker node query LRS for the status of operation gain

Incident #18: The DMS activity seems stuck at "Full backup uploading" state, not moving forward.

Root cause: The DMS first encountered the StorageException and retried to upload the file again, but the storage stopped responding with the status of upload

Solution: Restart the DMS VM so that the activity can resume the migration