1. Introduction
Object detection [10], [11], [23], [42], [45], [57] is one of the most fundamental and challenging problems in computer vision. The current popular architectures, including convolution neural networks (CNNs) based [22], [46], [47], [50], [52], [53] and transformer-based [7], [14], [30], [33]–[35], [60] detection models, which are designed as powerful yet complex structures to deal with the detection of visual objects [61]. However, the existing detection models suffer from extremely high computational costs, making them infeasible to deploy on edge devices. This limits the broader application in practical scenarios. To mitigate this gap, several compression techniques [1], [15]–[17], [58] have been proposed to improve the efficiency of networks, among which quantization reduces the computational complexity and memory footprint by using lower bit-widths to represent network parameters. Post-training quantization (PTQ) is a widely used approach because of its wide versatility and low production cost, which directly applies quantization to a well-trained floating-point model without time-consuming retraining.
Comparison of FLOPs and parameters in full-precision and W 4A4 quantized detection models. Head structures take non-neglectable percentage of computation and memory. And quanti-zation significantly reduces the overall FLOPs and memory stor-age.