Google Cloud VM Live Migration

Introduction

Heartbleed bug was revealed on April 7th, 2014. On that day, most cloud customers were impacted because patching the system requires VM reboot. At Google, none of the customers were impacted due to the transparent maintenance functionality introduced in Google Compute Engine in Dec 2013.

Through a combination of datacenter topology innovations and live migration technology, they can move their customer running VMs out of the way of planned hardware and software maintenance events, so they keep the infrastructure protected and reliable — without customers’ VMs, applications or workloads noticing that anything happened.

VM Migration Procedure 

The high-level steps are illustrated in the following

  • The process begins with a notification that VMs need to be evicted from their current host machine. The notification might start with a file change (e.g., a release engineer indicating that a new BIOS is available), Hardware Operations scheduling maintenance, an automatic signal from an impending hardware failure etc. 
  • Once a VM is selected for migration, we provide a notification to the guest that a migration is imminent. After a waiting period, a target host is selected and the host is asked to set up a new, empty “target” VM to receive the migrating “source” VM. Authentication is used to establish a connection between the source and target. 
  • There are three stages involved in the VM’s migration
    • During pre-migration brownout, the VM is still executing on the source, while most state is sent from the source to the target. For instance, we copy all the guest memory to the target, while tracking the pages that have been re-dirtied on the source. The time spent in pre-migration brownout is a function of the size of the guest memory and the rate at which pages are being dirtied.
    • During blackout, which is a very brief moment when the VM is not running anywhere, it is paused, and all the remaining state required to being running the VM on the target is sent. 
    • During post-migration brownout, the VM executes on the target. The source VM is present, and may be providing supporting functionality for the target. For instance, until the network fabric has caught up the new location of the VM, and source VM provides forwarding services for packets to and from the target VM
  • Finally, the migration is completed, and the system deletes the source VM.

    Reference
    [1] Google Compute Engine Users Live Migration Technology to service infrastructure without application downtime 

    Leave a Reply