1. Introduction
Recent advancements in Foundation Models (FM) [3], both in Natural Language Processing (NLP) [38], [42], [45] and Computer Vision (CV) [17], [27], [40], have extended the predictive performance of deep learning models. Nevertheless, these advances come at a cost in terms of computational requirements and memory resources. Currently, the baseline reference for FM is a transformer architecture enhanced with an attention mechanism [48]. Initially designed for NLP, transformers have been adapted for CV, resulting in the development of Vision Transformers (ViT) [17]. ViTs are encoder-only models that are typically self-supervised, pre-trained on a large amount of data, and later adapted for downstream tasks such as image classification, object detection, or instance segmentation. Similar to large language models [45], ViTs come in different sizes, depending on the number of layers and, as a result, parameters, which vary from millions to 22 billion [12]. As a result, the largest model requires a dedicated accelerator with sufficient memory to process the data. The large size of ViTs makes them appropriate candidates for compression methods such as quantization, but outliers in intermediate activations pose a challenge [4], [11]. Quantization, a compression technique that reduces the number of bits, converts computation and data from "continuous" (floating point) to discrete (integer). Integer 8-bit (INT8) inference is faster and more energy-efficient than its floating-point counterparts, but the limited range in which we can represent values makes it susceptible to quantization errors during computation that may affect the final accuracy of the deep learning model [20], [30].