AI gauge: Runtime estimation for deep learning in the cloud
Abstract
Major cloud providers, including IBM Cloud, Amazon Web Services, Microsoft Azure, and Google Cloud, offer services to train, debug, store, and deploy machine learning models at scale. For enhanced user experience in SLA-driven control, cost effective budgeting, elastic scaling, and efficient operations, estimating the runtime of training a machine learning model is important. We present AI Gauge, a cloud service to estimate runtime and cost for training deep learning models under different configuration options on the cloud. AI Gauge is designed using micro-service architecture and performs estimations based on machine learning models calibrated by an extensive and continuously populated job trace data-set. We show that AI Gauge can accurately predict the remaining time of running jobs based on its runtime progress (< 10% relative error) and can accurately predict the total runtime for a job before it starts with 7-8% relative error on average.