Conferences >2020 IEEE 40th International ...

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine th...Show More

Metadata

Abstract:

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration-e.g., server type and number-for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

Published in: 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Date of Conference: 29 November 2020 - 01 December 2020

Date Added to IEEE Xplore: 23 February 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/ICDCS47774.2020.00097

Conference Location: Singapore, Singapore

Funding Agency:

Contents

I. Introduction

The process of training deep neural networks (DNNs) has evolved from using single-GPU servers [1] to distributed GPU clusters [2], [3] that can support larger and more complex DNNs. Cloud computing, providing on-demand access to these critical yet expensive GPU resources, has become a popular option for practitioners. Today’s cloud provides its customers abundant options to configure the training clusters, presenting opportunities for tailoring resource acquisition to the specific training workload. When using cloud-based GPU servers to train deep learning models, one can choose the server’s CPU and memory, specify the GPU type, decide the number of servers, as well as pick the desired datacenter location. However, this configuration flexibility also imposes additional complexity upon deep learning practitioners.

References is not available for this document.

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?