Journals & Magazines >IEEE Transactions on Very Lar... >Volume: 31 Issue: 11

An Efficient Training Accelerator for Transformers With Hardware-Algorithm Co-Optimization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Transformers have achieved significant success in deep learning, and training Transformers efficiently on resource-constrained platforms has been attracting continuous at...Show More

Metadata

Abstract:

Transformers have achieved significant success in deep learning, and training Transformers efficiently on resource-constrained platforms has been attracting continuous attention for domain adaptions and privacy concerns. However, deploying Transformers training on these platforms is still challenging due to its dynamic workloads, intensive computations, and massive memory accesses. To address these issues, we propose an Efficient Training Accelerator for TRansformers (TRETA) through a hardware-algorithm co-optimization strategy. First, a hardware-friendly mixed-precision training algorithm is presented based on a compact and efficient data format, which significantly reduces the computation and memory requirements. Second, a flexible and scalable architecture is proposed to achieve high utilization of computing resources when processing arbitrary irregular general matrix multiplication (GEMM) operations during training. These irregular GEMMs lead to severe under-utilization when simply mapped on traditional systolic architectures. Third, we develop training-oriented architectures for the crucial Softmax and layer normalization functions in Transformers, respectively. These area-efficient modules have unified and flexible microarchitectures to meet various computation requirements of different training phases. Finally, TRETA is implemented under Taiwan Semiconductor Manufacturing Company (TSMC) 28-nm technology and evaluated on multiple benchmarks. The experimental results show that our training framework achieves the same accuracy as the full precision baseline. Moreover, TRETA can achieve 14.71 tera operations per second (TOPS) and 3.31 TOPS/W in terms of throughput and energy efficiency, respectively. Compared with prior arts, the proposed design shows 1.4–

$24.5\times$ speedup and 1.5–

$25.4\times$ energy efficiency improvement.

Published in: IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( Volume: 31, Issue: 11, November 2023)

Page(s): 1788 - 1801

Date of Publication: 14 September 2023

ISSN Information:

DOI: 10.1109/TVLSI.2023.3305569

Funding Agency:

Contents

I. Introduction

In recent years, Transformer-based [1] deep neural networks (DNNs) have achieved impressive performance in various fields, such as natural language processing (NLP) [2], [3], [4], [5], computer vision (CV) [6], [7], [8], and speech processing [9], [10], [11]. Prevailing Transformer-based models are usually pretrained from large-scale unlabeled data, and the models can be fine-tuned on downstream tasks after pretraining, instead of training from scratch. These pretraining and fine-tuning processes of Transformers are typically deployed on high-end graphics processing units (GPUs) platform, which has powerful computing capacity and high energy supply. However, with the increasing application scope of Transformers, it is necessary to retrain models on the edge platform [12]. Because the data distribution in real application scenarios may be different from the training dataset, the model needs to be adjusted according to personal data. Uploading private data to servers for retraining will cause significant latency and privacy risks. Therefore, the efficient training of Transformers on edge devices is attracting continuous attention.

References is not available for this document.

An Efficient Training Accelerator for Transformers With Hardware-Algorithm Co-Optimization

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

An Efficient Training Accelerator for Transformers With Hardware-Algorithm Co-Optimization

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References