I. Introduction
In the realm of computer vision, vision transformers (ViTs) [1] have emerged as the new fundamental backbone models, significantly challenging the convolutional neural networks (CNNs). By leveraging multi-head self-attention (MHSA) mechanism to capture long-range relationships, ViTs exhibit strong and flexible representation capacity, thus resulting in impressive progress in a variety of vision tasks [2], [3], [4], [5], [6], [7]. However, ViTs’ great power comes with considerable complexity. The intricate architecture and large number of parameters of ViTs result in high computational and memory demands. As a result, deploying ViTs in resource-constrained environments such as mobile phones becomes a huge challenge [8], [9], [10], [11], [12], [13], [14].