1. Introduction
Deep Neural Networks (DNNs) have rapidly become a research hotspot in recent years, being applied to various scenarios in practice. However, as DNNs evolve, better model performance is usually associated with huge resource consumption from deeper and wider networks [8], [14], [28]. Meanwhile, the research field of neural network compression and acceleration, which aims to deploy models in resource-constrained scenarios, is gradually gaining more and more attention, including but not limited to Neural Architecture Search [18], [19], [33], [35]–[40], [42], [43], network pruning [4], [7], [16], [27], [32], [41], and quantization [3], [5], [6], [15], [17], [21], [22], [29], [31]. Among these methods, quantization proposed to transform float network activations and weights to low-bit fixed points, which is capable of accelerating inference [13] or training [44] speed with little performance degradation.
Left: Reconstruction loss distribution of BRECQ [17] on 0.5 scaled MobileNetV2 quantized to 4/4 bit. Loss oscillation in BRECQ during reconstruction see red dashed box. Right: Mixed reconstruction granularity (MRECG) smoothing loss oscillation and achieving higher accuracy.