1. INTRODUCTION
Originating from natural language processing (NLP) tasks [1], transformer-based models have received fabulous performance and outperformed convolutional neural networks (CNNs) in various computer vision (CV) tasks [2]-[6]. However, vision transformers (ViT) suffer from heavier memory and computational costs than CNNs. For example, there are 307 M parameters and 64 G FLOPs in ViT-L [2]. The unaffordable overheads hinder ViTs from running on resource-constrained edge devices, confining their real-world applications. Consequently, model compression for ViTs arises as an urgent problem needed to be solved immediately.