Loading [MathJax]/extensions/MathZoom.js
QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers | IEEE Conference Publication | IEEE Xplore

QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers


Abstract:

Vision Transformers have demonstrated outstanding performance in Computer Vision tasks. Nevertheless, this superior performance for large models comes at the expense of i...Show More

Abstract:

Vision Transformers have demonstrated outstanding performance in Computer Vision tasks. Nevertheless, this superior performance for large models comes at the expense of increasing memory usage for storing the parameters and intermediate activations. To accelerate model inference, in this work we develop and evaluate integer and mixed-precision kernels in Triton for the efficient execution of two fundamental building blocks of transformers –linear layer and attention– on graphics processing units (GPUs). On an NVIDIA A100 GPU, our kernel implementations of Vision Transformers achieve a throughput speedup of up to 7x compared with reference kernels in PyTorch floating-point single precision (FP32). Additionally, the accuracy for the ViT Large model top-1 drops by less than one percent on the ImageNet1K classification task. We also observe up to 6x increased throughput by applying our kernels to the Segment Anything Model image encoder while keeping the mIOU close to the FP32 reference on the COCO2017 dataset for static and dynamic quantization. Furthermore, our kernels demonstrate improved speed to the TensorRT INT8 linear layer, and we improve the throughput of base FP16 (half precision) Triton attention on average by up to 19 ± 4.01%. We have open-sourced the QAtnn framework, which is tightly integrated with the PyTorch quantization workflow https://github.com/IBM/qattn.
Date of Conference: 17-18 June 2024
Date Added to IEEE Xplore: 27 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

1. Introduction

Recent advancements in Foundation Models (FM) [3], both in Natural Language Processing (NLP) [38], [42], [45] and Computer Vision (CV) [17], [27], [40], have extended the predictive performance of deep learning models. Nevertheless, these advances come at a cost in terms of computational requirements and memory resources. Currently, the baseline reference for FM is a transformer architecture enhanced with an attention mechanism [48]. Initially designed for NLP, transformers have been adapted for CV, resulting in the development of Vision Transformers (ViT) [17]. ViTs are encoder-only models that are typically self-supervised, pre-trained on a large amount of data, and later adapted for downstream tasks such as image classification, object detection, or instance segmentation. Similar to large language models [45], ViTs come in different sizes, depending on the number of layers and, as a result, parameters, which vary from millions to 22 billion [12]. As a result, the largest model requires a dedicated accelerator with sufficient memory to process the data. The large size of ViTs makes them appropriate candidates for compression methods such as quantization, but outliers in intermediate activations pose a challenge [4], [11]. Quantization, a compression technique that reduces the number of bits, converts computation and data from "continuous" (floating point) to discrete (integer). Integer 8-bit (INT8) inference is faster and more energy-efficient than its floating-point counterparts, but the limited range in which we can represent values makes it susceptible to quantization errors during computation that may affect the final accuracy of the deep learning model [20], [30].

Contact IEEE to Subscribe

References

References is not available for this document.