Abstract:
Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine th...Show MoreMetadata
Abstract:
Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration-e.g., server type and number-for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.
Date of Conference: 29 November 2020 - 01 December 2020
Date Added to IEEE Xplore: 23 February 2021
ISBN Information:
ISSN Information:
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- GPU Server ,
- Deep Learning ,
- Complex Models ,
- Convolutional Neural Network ,
- Training Time ,
- Cloud Computing ,
- Training Performance ,
- Training Speed ,
- Training Framework ,
- Measurement Framework ,
- Performance Bottleneck ,
- Regression-based Models ,
- Cluster Configuration ,
- Potential Use Cases ,
- Regression Model ,
- Mean Absolute Error ,
- TensorFlow ,
- Convolutional Neural Network Model ,
- Parameter Server ,
- Checkpointing ,
- Support Vector Regression Model ,
- Mean Absolute Percentage Error ,
- Powerful GPU ,
- Start-up Time ,
- Index File ,
- Polynomial Kernel ,
- Support Vector Regression ,
- GB Memory
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- GPU Server ,
- Deep Learning ,
- Complex Models ,
- Convolutional Neural Network ,
- Training Time ,
- Cloud Computing ,
- Training Performance ,
- Training Speed ,
- Training Framework ,
- Measurement Framework ,
- Performance Bottleneck ,
- Regression-based Models ,
- Cluster Configuration ,
- Potential Use Cases ,
- Regression Model ,
- Mean Absolute Error ,
- TensorFlow ,
- Convolutional Neural Network Model ,
- Parameter Server ,
- Checkpointing ,
- Support Vector Regression Model ,
- Mean Absolute Percentage Error ,
- Powerful GPU ,
- Start-up Time ,
- Index File ,
- Polynomial Kernel ,
- Support Vector Regression ,
- GB Memory
- Author Keywords