I. Introduction
The Transformer stands as a dominant model in the realm of natural language processing (NLP), grounded in the concept of self-attention[1] - [3]. Its fundamental premise revolves around leveraging attention mechanisms to capture correlations across different positions within an input sequence, thereby enhancing both the expressiveness and efficiency of the model. While initially designed for NLP, the Transformer has recently showcased its effectiveness in addressing image classification challenges[4] - [7]. Following pre-training on extensive datasets like Imagenet-21K and JFT300M, the Vision Transformer has not only attained state-of-the-art performance on large-scale ImageNet benchmarks[8] but has also demonstrated exceptional feature extraction capabilities from extensive datasets.