Live Migration

1. Life Migration Workflow

  • Verify the storage backend is appropriate for the migration type
    • Perform a shared storage check for normal migrations
    • Do the inverse for block migrations
    • Checks are run on both the source and destination, orchestrated via RPC calls from the scheduler
  • On the destination
    • Create the necessary volume connections
    • If block migration, create the instance directory, populate missing backing files from Glance and create empty instance disks
  • On the source
    • Initiate the actual live migration
  • Upon complete
    • Generate the Libvirt XML and define it on the destination

    2. Migrations

    • Why migration
      • Operations
        • Key to performing non-distruptive work
        • Re-balancing workloads and resources
      • Expectations versus reality
        • Special snowflakes
        • Ephemeral instance and the “cloud” way
    • Type of migration
      • Migrate
        • Completely “cold”, libvirt does almost nothing
        • Share code path with “resize”
        • Extremely brittle (users SSh and copies files around)
      • Live migration
        • Orchestrated almost entirely by Libvirt (via DomainMigrateToURI)
      • Block migration
        • Similar code path as live migraiton
        • More risky and brittle (disks are moving along with state)

    3. Live Migrations

    • Nova offloads capabilities comparisons to Libvirt
      • The API equivalent of virsh capabilities is run by the scheduler on the source and destination; 
    • Nova live migraiton
      • Important config options
        • Live_migration_flat =+ VIR_MIGRATE_LIVE
        • block_migration_flag=+ VIR_MIGRATE_LIVE
      • Standardized virtual CPU flags
        • libvirt_cpu_model = custome
        • libvirt_cpu_model = cpu64-rhel6
      • “Max Downtime” (not currently tunable)
        • Look for upstream patches soon
        • Qemu will keep doing when the cut can be done in “30” millseconds

    4. Brittle Operations

    • Any long running, synchronous tasks
      • All migrations (memory sync, disk sync, etc)
    • No graceful way to stop services
    • Most prone to failure
      • Migrate and resize
      • Live migraiton (block or otherwise)
      • Instance snapshot

    5. Recovering from failures

    • Always investigate before forcing actions
      • Look at the log for excpetions
      • Check whether an instance is running on multiple hypervisors
      • Nova reset-state –active and `nova reboot –hard can go a long way
    • Sometime, brute force is going to be required
      • Kill -9 qumu or kvm processes
      • After the database records, commonly `host`

    6. “Stuck” Live Migrations

    • Live migrations can get stuck
    • Instances left in a paused state on both ends
      • Monitor socket is unpresponsive, Libvirt is helpless
    • Generally a result of an overly aggressive “max donwtime” and rapidly changing memory state (e.g., JVM)
    • Can be a result of a QEMU issue/bug
      • manageSave (suspend) will generally be prone as well

    Leave a Reply