Migrating Thousands Databases To Azure ONLINE? Here Is How We Got It Done - Part4: Sh!t Happens (B)

George Lin
Nov 12, 2021
7 min read

Updated: Nov 13, 2021

Part1(A),Part1(B),Part2 ,Part3(A),Part3(B),Part3(C) Part4(A),

Note: all the real Azure resource names in the error messages posted below were replaced by "*" or masked in the screenshots for confidential reason.

Incident #19: I was trying to create 50 new DMS instances, 49 of them succeeded, but one of them kept failing, tried both PowerShell commands and Portal a few times. all failed.

Root cause: hit the Azure resource restriction as just because you have quota may not necessarily mean that there is capacity. Ideally it should be there, but occasionally Azure might hit limits

Solution: Retry it at later time when the resource might have become available again as it could have been a spot pressures. For our case, MS helped adjust the max limit percentage (default is 85%)

Incident #20: I couldn't stop the service as it has activity running, tried to stop the DMS activity, but it's stuck at "Cancelling"(some other cases "Queued") forever. The activity couldn't be dropped as it's not in "Stopped" state.

Root cause: code bug in DMS

Solution: run Remove-AzDmsProject -Force -DeleteRunningTask and this delete all the tasks in the project. Not a ideal workaround ,but currently it's the only way works

Incident #21: DMS instance is stuck at "Starting" for a very long time (hours) then failed

State was Failed. VM-7mcrx7frwz66nni6b4n3nxpd (Microsoft.Compute/virtualMachines): OSProvisioningTimedOut - OS Provisioning for VM 'VM-7mcrx7frwz66nni6b4n3nxpd' did not finish in the allotted time. 
The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, 
refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle. Template output evaluation skipped: at least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.

Root cause: it seem the service was up/running, but somehow the API call returned wrong state info. looks like DMS VM run into issue timing out.

Solution: wait until it change to 'FailedToStart' and then start it again. Most likely the instance become online "Succeeded" immediately after issuing the command

Incident #22: The DMS activity was migrating three databases. Two of them run Ok, but another one didn't start upload log backup files while the full database backup has already been uploaded. the migration state of all three databases was "Log Shipping in progress"

Root cause: Looking at the status (Log Shipping in progress) indicates that DMS has fired off the request for LRS to start restoring, but it didn't get or missed the return back from the failed database. It wont move to the uploading log files state till it gets feedback and the operation id. Further digging into the issue show there was a SqlManagementClient failure happened at the time.

Solution: For short term, ask MS DMS team to restart the DMS VM in the back end so that it will be calling start LRS again. For long term, Microsoft DMS team should add try logic to their code to retry getting the status and operation after a timeout period.

Incident #23: The state of the DMS activity is 'Faulted', but all the logs have been applied on the target database successfully

Root cause: the SQL Managed instance is not returning DMS the correct status for the restore and hence DMS is unable to display the correct status even though log files are restored on the MI instance

Solution: Ignore the issue and complete the database restore through REST API call

-Action CutoverByRESTAPI

Incident #24: the DMS activity is uploading database backup files, but very slow

Root cause: the DMS VM ran into issue (unknown)

Solution: recreate the task in a different DMS instance.

Incident #25: DMS couldn't restore more than 30 databases simultaneously on one target SQLMI

Root cause: DMS Log replay service hit the SQLMI connection pool issue.

Solution: Microsoft SQLMI team need to disable the changed behavior through feature switch so that Azure allows 60-70 concurrent operations per instance safely. It's better to notify the Microsoft SQLMI team in advance about your migration schedule so that they can apply the mitigations in the target environment. However, if the maximum number of databases being restored on the target SQLMI is less than 30, the mitigation is not required.

Note: After the mitigation is deployed on the SQLMIs, any new SQLMIs in the same subnet get the latest version automatically.

Update received from Microsoft on 9/11/2020 1:37 PM:

I just want to confirm that the mitigation pre-migration that needed to be done before on target MIs is not needed anymore, as over the previous week or two, the latest version of MI got deployed on all MIs including those owned by ***. So, to summarize, everything is ok from MI side regarding issues that happened on migrations weeks back.

Incident #26: about 310 database failed migration all together with the first try, some of them failed while validating the full backup set and complains about backup set incomplete in shared folder (DMS issue). others complaining about full backup not found on storage account, but the backup files have been verified totally OK.

Root cause: the errors were caused by a transient issue that have been observed with DMS / storage interface. It usually works on restart and happens only early in the migration phase. There is a small delay between the time the commit for a blob DMS uploaded completes and when DMS kick start the Log Replay that causes this issue.

Solution: The fix in pipeline would basically retry the operation and treat this as a retriable failure so it will transparent to the customer

Incident #27: after the full backup having been restored, the DMS activity failed to copy the log backup files to the blob containers or failed to apply it to the target database complaining LSN error (24)

Root cause: DMS observed broken (false) log chain due to some bug in its code

Solution: Ask DMS support team to restart the VM so it can rebuild the log chain or redo the migration using a different DMS instance.

Incident #28: All databases were suddenly gone on the target SQLMIs during the migration. SSMS didn't show any databases, but Azure portal was still showing the newly restored databases. DMS activities appeared having stopped applying log since 22:15.

Root cause: the restores were suspended due to a unlikely sequence of events - a resize of a cluster which takes precedence over restores (this should not be the case and will be addressed) also picked up another bigger change due to batching of updates. This deployment was allowed through the policy gates which should normally prevent deployments when restores are happening (within about 30 hour window) . Since the restores were suspended, the databases were not showing up and the log changes were not getting applied.

Solution: For each migration, notify the Microsoft SQLMI team in advance about the schedule so that they would be aware of the situation when performing any maintenance tasks in the SQLMI backend.

It seems like the stopping service failed due to failure of NIC deletion. George Lin (Guest), can you check logs of 'NIC-tdkwywrpsmbb5epsen8mdrfm' on the portal? Is there any error message on it?

Incident #29: the DMS activities have been running more than 3 hours without doing anything. They will eventually start migrating the databases, but why they have to wait so long to start working.

Root cause: the migration activities hit a bug on DMS side

Solution: restarting the DMS instances usually can fix the issue

Incident #30: the DMS activities kept failing even after restarting the DMS instance.

INFO: TaskName:'DMS135-Proj4-Activity1', TaskState:'Faulted', TaskError:'Task could not be processed due to an internal service error.', DBName:'*', State:'FULL_BACKUP_UPLOADING', StartOn:'09/11/2020 15:54:24', LastRestoredFile:'', ErrorMsg:'Task could not be processed due to an internal service error.'

Then dropped/recreated the DMS instance with the same name. Got new error when creating a new DMS activity

INFO: Creating DMS task(Type:'MigrateSqlServerSqlDbMiSync') 'DMS250-Proj4-Activity1' at '09/11/2020 16:40:57' with the migration id '250' for the following databases:
*
Write-Error: C:\Users\ling\.test\MyAzDMS-V2.ps1:2327
Line |
2327 |  … onIDArray | Add-MyAzDMSTask -TaskType "MigrateSqlServerSqlDbMiSync" - …
     |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | Failed to create the task 'DMS250-Proj4-Activity1' 
New-AzDataMigrationTask: C:\Users\ling\.test\MyAzDMS-V2.ps1:1091
Line |
1091 |                  $NewTask = New-AzDMSTask @Params
     |                             ~~~~~~~~~~~~~~~~~~~~~
     | DMS API Error ScenarioMessageFailedToExecuteViaProtocolException - The message of type
     | 'Microsoft.SqlServer.Fundamentals.OperationsInfrastructure.Scenarios.Agent.Contracts.Messages.StartScenarioMessageData' could not be executed via the protocol of type
     | 'Microsoft.SqlServer.Fundamentals.OperationsInfrastructure.Scenarios.Agent.AgentServiceProtocol' due to an error.. FailedToSendScenarioMessageDataException - The client was unable to send scenario
     | message data of type 'Microsoft.SqlServer.Fundamentals.OperationsInfrastructure.Scenarios.Agent.Contracts.Messages.StartScenarioMessageData' due to an error.. FailedToEnqueueMessageException - Failed
     | to enqueue the message of type 'Microsoft.SqlServer.Fundamentals.OperationsInfrastructure.Cloud.Messaging.CloudMessage' due to an error.. UnauthorizedAccessException - 40103: Invalid authorization
     | token signature, Resource:sb://sbpaz0cb63472f7866dfcbcd.servicebus.windows.net/ieykxmy94ezs2skxd46bwp3z. TrackingId:0c8ccda1-2a60-4cec-bd56-e0f6c797d310_G5S2,
    | SystemTracker:sbpaz0cb63472f7866dfcbcd.servicebus.windows.net:ieykxmy94ezs2skxd46bwp3z, Timestamp:2020-09-11T20:41:19. FaultException - 40103: Invalid authorization token signature,
     | Resource:sb://sbpaz0cb63472f7866dfcbcd.servicebus.windows.net/ieykxmy94ezs2skxd46bwp3z. TrackingId:0c8ccda1-2a60-4cec-bd56-e0f6c797d310_G5S2,
     | SystemTracker:sbpaz0cb63472f7866dfcbcd.servicebus.windows.net:ieykxmy94ezs2skxd46bwp3z, Timestamp:2020-09-11T20:41:19

Root cause: Due to previously failed migrations with the same activity names there is some error getting thrown when DMS is trying to switch state of the migration from full backup upload to log shipping start. Yes, another bug.

Solution: Drop the DMS instance and create a new with a DIFFERENT name

Incident #31: the full backup files were restored, but the task is not uploading the log backup files. I have tried using 3 different DMS services to migration this DB, all encountered the same issue. I can see all the log bkp files in the source shared folder, DMS also report all log backup file "Arrived" which means the file found at source location

Root cause: the log chain was broken

Solution: restart from creating a new full database backup. It's better to move all the old, now useless for DMS, backup files from the source folder (SMB network share) to somewhere else to reduce the DMS activity's workload when scanning the source folder.

Incident #32: The database seems being migrated successfully, but it's empty, no data

Root cause: Still under investigation.

Solution: Drop the target databases and redo the migration

Incident #33: The databases got cutover, but the database remained in "Cutover in progress" forever. The databases appeared online on the target SQLMI, but completely empty, no data.

Root cause: Still under investigation.

Solution: Drop the target databases and redo the migration

Incident #34: The DMS activities were not failed but hanging in restoring database, not making progress. I issued cutover anyway, but nothing happened

Root cause: the CPU usage on DMS VM is at 100%

Solution: Ask MS team to upgrade the DMS VM from 4 vcore to 8 Vcore or redo the migration using more DMS instances to split the workload.

Incident #35: I could see the restore history records for the database in MSDB.dbo.restorehistory and SSM also displayed the DB in "Restoring", but this DB didn't show up at all on Azure portal.

Root cause: unknown or not convincing.

Solution: Call Microsoft SQLMI support.