I. Introduction
The process of training deep neural networks (DNNs) has evolved from using single-GPU servers [1] to distributed GPU clusters [2], [3] that can support larger and more complex DNNs. Cloud computing, providing on-demand access to these critical yet expensive GPU resources, has become a popular option for practitioners. Today’s cloud provides its customers abundant options to configure the training clusters, presenting opportunities for tailoring resource acquisition to the specific training workload. When using cloud-based GPU servers to train deep learning models, one can choose the server’s CPU and memory, specify the GPU type, decide the number of servers, as well as pick the desired datacenter location. However, this configuration flexibility also imposes additional complexity upon deep learning practitioners.