Remote Restart for a High Performance Virtual Machine Recovery in a Cloud
Abstract
In this paper, we present a scalable parallel virtual machine planning and fail over method that enables high availability at a VM level in a data center. The solution is implemented and used in IBM's CMS enterprise private cloud as a high availability feature for efficient fail over in large data centers with a large number of servers, VMs, and a large number of disks. The introduced restart system enables dynamic and at-fail over-time planning and execution, and keeps the recovery time within limits of service level agreement (SLA) allowed time budget. The initial serial fail over time is reduced by a factor of up to 11 for parallel implementation, and by a factor of up to 44 for parallel fail over - parallel storage mapping implementation. As part of our future work, we plan to explore the applicability of this planning and fail over solution for Disaster Recovery.