Conferences >ICASSP 2023 - 2023 IEEE Inter...

TSPTQ-ViT: Two-Scaled Post-Training Quantization for Vision Transformer

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Vision transformers (ViTs) have achieved remarkable performance in various computer vision tasks. However, intensive memory and computation requirements impede ViTs from ...Show More

Metadata

Abstract:

Vision transformers (ViTs) have achieved remarkable performance in various computer vision tasks. However, intensive memory and computation requirements impede ViTs from running on resource-constrained edge devices. Due to the non-normally distributed values after Softmax and GeLU, post- training quantization on ViTs results in severe accuracy degradation. Moreover, conventional methods fail to address the high channel-wise variance in LayerNorm. To reduce the quantization loss and improve classification accuracy, we propose a two-scaled post-training quantization scheme for vision transformer (TSPTQ-ViT). We design the value-aware two-scaled scaling factors (V-2SF) specialized for post- Softmax and post-GeLU values, which leverage the bit sparsity in non-normal distribution to save bit-widths. In addition, the outlier-aware two-scaled scaling factors (O-2SF) are introduced to LayerNorm, alleviating the dominant impacts from outlier values. Our experimental results show that the proposed methods reach near-lossless accuracy drops (<0.5%) on the ImageNet classification task under 8-bit fully quantized ViTs.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10096817

Conference Location: Rhodes Island, Greece

Funding Agency:

Contents

1. INTRODUCTION

Originating from natural language processing (NLP) tasks [1], transformer-based models have received fabulous performance and outperformed convolutional neural networks (CNNs) in various computer vision (CV) tasks [2]-[6]. However, vision transformers (ViT) suffer from heavier memory and computational costs than CNNs. For example, there are 307 M parameters and 64 G FLOPs in ViT-L [2]. The unaffordable overheads hinder ViTs from running on resource-constrained edge devices, confining their real-world applications. Consequently, model compression for ViTs arises as an urgent problem needed to be solved immediately.

References is not available for this document.

TSPTQ-ViT: Two-Scaled Post-Training Quantization for Vision Transformer

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

TSPTQ-ViT: Two-Scaled Post-Training Quantization for Vision Transformer

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References