I. Introduction
The Transformer model [1] has revolutionized deep learning, particularly in Natural Language Processing [2] and Computer Vision [3], by addressing the limitations of Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN). It alleviates the weak parallelism and long-term dependency problems of RNN and discards the viewpoint limitations of CNN. Despite its popularity and the emergence of various variants, the model faces challenges with computational efficiency and memory management due to complex computations and variable-length inputs. The computation required for Transformer models is significantly higher than traditional models, which could slow down research progress due to limited computing resources. Additionally, the handling of variable-length inputs by the Transformer model, which are converted into multiple fixed-length inputs, can result in additional computations. This affects performance and poses a challenge to memory management.