I. Introduction
In recent years, Transformer-based [1] deep neural networks (DNNs) have achieved impressive performance in various fields, such as natural language processing (NLP) [2], [3], [4], [5], computer vision (CV) [6], [7], [8], and speech processing [9], [10], [11]. Prevailing Transformer-based models are usually pretrained from large-scale unlabeled data, and the models can be fine-tuned on downstream tasks after pretraining, instead of training from scratch. These pretraining and fine-tuning processes of Transformers are typically deployed on high-end graphics processing units (GPUs) platform, which has powerful computing capacity and high energy supply. However, with the increasing application scope of Transformers, it is necessary to retrain models on the edge platform [12]. Because the data distribution in real application scenarios may be different from the training dataset, the model needs to be adjusted according to personal data. Uploading private data to servers for retraining will cause significant latency and privacy risks. Therefore, the efficient training of Transformers on edge devices is attracting continuous attention.