Conferences >SC21: International Conferenc...

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines ...Show More

Metadata

Abstract:

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchro-nous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; ben-efiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

Published in: SC21: International Conference for High Performance Computing, Networking, Storage and Analysis

Date of Conference: 14-19 November 2021

Date Added to IEEE Xplore: 18 October 2022

ISBN Information:

ISSN Information:

DOI: 10.1145/3458817.3476145

Conference Location: St. Louis, MO, USA

Funding Agency:

Contents

1 Introduction

Deep learning is continuing to deliver groundbreaking results on the path towards human-level intelligence. This path is character-ized by growing model size, in just six years, the compute requirements for model training grew by 300,000 times [3]. Transform-ers [54] are a typical representative in this trend. As the model size grows, Transformer based models have proven their success in in the field of natural language processing [13, 43, 43, 54]. Recent work [8-10, 14] shows that Transformers also achieve promising re-sults in computer vision tasks, i.e., on par or better than other types of models such as convolutional [31] and recurrent [21] networks. These growing models must be trained on distributed accelera-tor supercomputers. Even today's models are too big to be stored on a single accelerator-for example, GPT-3's 175 billion parame-ters [7] require 350 GiB main memory if stored with 16 bits precision. Switch transformers [17] have in their largest configuration 1.6 tril-lion parameters, a 6.4 TiB storage requirements. Furthermore, the necessary memory for activations, gradients, and optimizer state during training at least triples these memory requirements.

References is not available for this document.

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1 Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1 Introduction

References