1. Introduction
Convolutional neural networks (CNNs) [19] have been the de facto standard architecture for computer vision models across different applications for years. AlexNet [18] showed their usefulness on ImageNet [10], and many others followed suit with architectures such as VGG [26], ResNet [17], and EfficientNet [27]. Transformers [31] on the other hand, were originally proposed as attention-based models for natural language processing (NLP), trying to ex-ploit the sequential structure of language. They were the ba-sis upon which BERT [11] and GPT [2], [2]3,[2]4 were built, and they continue to be the state of the art architecture in NLP.