Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems
Abstract
HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configuration, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced. The source code and replication packages for reproducing experimental results are provided.