Part 4. DRaaS – The four click failover to VMware Cloud on AWS and back

Posted by

This is my fourth and final installment of this series.
These steps were taken by me to failover and failback multi-million dollar datacenters in as easy as 4-clicks.
At this point we should have our SDDC deployed and linked back to our onprem. VR/SRM should be deployed both onprem and in the SDDC. All site pairings and mappings should be complete. Finally, we should have our virtual machines replicating to the SDDC in their own protection groups with their own recovery plans.

What happens during a failover/failback
1. You execute the failover, which will perform one last sync of your data prior to powering off the virtual machines onprem. Once the onprem VMs are powered off SRM will continue executing it’s script to power on the VMs in the SDDC and apply IP customization rules if you set them up.
2. Reprotect – you will execute the reprotect functionality, which will cause SDDC traffic to replicate back to onprem and convert your powered off VMs into placeholder VMs.
3. You will execute the failover, which will once again perform a last sync of your data prior to powering off your VMs in the SDDC. Once the SDDC VMs are powered off SRM will continue executing the script to power on the VMs onprem and applying any IP customization rules you have set.
4. Reprotect – you will execute the reprotect back to the SDDC, which will cause onprem VMs to replicate back to the SDDC converting the powered off VMs back into placeholders.
5. You are now ready to execute another failover any time in the future.

Things to watch out for and general knowledge
1. During a Planned Migration, SRM will replicate over the last deltas of each VM prior to powering down the VMs onprem. Depending on the amount of changes, this could extend the recovery process.
2. Should the planned recovery fail even with all your SDDC VMs powered on, you won’t be able to reprotect. Most of the time it is because VMwareTools didn’t load during the default time. Just go to the RecoveryPlan and change the default timeout to 15 minutes and execute the recovery plan again. It will skip any steps SRM has already completed and you will be able to reprotect.
3. Reprotect – during the reprotect process it will hit 80% complete rather quickly but stay there. That is because reprotect must scan every used/unused block at the other side to calculate which blocks has changed since the failover. A 800GB VM will need all 800GB scanned even if only 256mb has changed. Yes, this is the only thing I do not like about SRM and yes this is the longest part of the entire SRM process.
4. You should always run a test before the actual DR test. This will spin up the VMs in the SDDC on a private (non-routable) network while your onprem VMs are still running. This will allow you to identify any potential issues with replication, networking, etc.

Executing a failover
First, we need to go into our main page of SRM and select Recovery Plans. Following this we will select our Recovery Plan we wish to execute. When you are ready to failover then start the process by clicking “run.”

My personal recommendation is if you are doing a DR test then select Planned migration. Should anything go wrong or an error occur then you will have the opportunity to fix it before proceeding. If you select Disaster recovery then you are going to the cloud no matter what.

The average time it took for a failover to complete for me in the past was between 15-30 minutes. Once the recovery to the SDDC is complete your screen should look like the above.
**notice that the option to run a failback onprem is greyed out, that is because you must go through the reprotect process first.

Once you are failed over to the SDDC click reprotect on the above page to begin reverse replications.

Once reprotect completes we’ll go back to a Plan status: Ready. We are now ready to fail back to onprem whenever testing is complete. Once again, click the run button.

Now that our virtual machines are recovered onprem and running we need to start replicating back to the SDDC. Click reprotect.


Now we are ready to failback to the cloud in the future if needed.

Summary
SRM is a great tool for automating the failover/failback process for your virtual machines. VMware Cloud on AWS makes an affordable and elastic secondary site location for DR. SRM does require some up front work to configure but when the time comes it truly is a 4-click trip to the cloud and back. I hope everyone found this series on DRaaS useful.

Leave a Reply