Loading web-font TeX/Main/Regular
Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators | IEEE Journals & Magazine | IEEE Xplore

Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators


Abstract:

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate the deep neural network (DNN) traini...Show More

Abstract:

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate the deep neural network (DNN) training/inference. However, the computational accuracy of analog PIM is compromised due to the nonidealities, such as the conductance variation of ReRAM cells. The impact of these nonidealities worsens as the number of concurrently activated wordlines (WLs) and bitlines (BLs) increases. To guarantee computational accuracy, only a limited number of WLs and BLs of the crossbar array can be turned on concurrently, significantly reducing the achievable parallelism of the architecture.While the constraints on parallelism limit the efficiency of the accelerators, they also provide a new opportunity for the fine-grained mixed-precision quantization. To enable efficient DNN inference on the practical ReRAM-based accelerators, we propose an algorithm-architecture co-design framework called block-wise mixed-precision quantization (BWQ). At the algorithm level, the BWQ algorithm (BWQ-A) introduces a mixed-precision quantization scheme at the block level, which achieves a high weight and activation compression ratio with negligible accuracy degradation. We also present the hardware architecture design BWQ-H, which leverages the low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference on the ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping method to increase the ReRAM crossbar’s throughput. Our evaluation demonstrates the effectiveness of BWQ, which achieves a 6.08 \times speedup and a 17.47 \times energy saving on average compared to the existing ReRAM-based architectures.
Page(s): 4558 - 4571
Date of Publication: 03 June 2024

ISSN Information:

Funding Agency:

Citations are not available for this document.

I. Introduction

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures can perform in-situ computation within the memory devices, and they have demonstrated great potential in accelerating the deep neural network (DNN) training/inference [1], [2]. However, the manufacturing technology of the ReRAM devices is still in its early stage, and there exist many challenges to its practical adoption [3]. Most of the works about ReRAM-based DNN accelerators have overlooked practical considerations and rely on an idealized assumption regarding the ReRAM devices and associated the analog-to-digital converter (ADC) overhead. They assume that it is possible to activate all the rows and columns of a or array simultaneously within a single clock cycle without impacting computational accuracy [4], [5]. However, there are several challenges that render this assumption impractical. The major problem is the conductance variation of the ReRAM devices. Since, ReRAM crossbar arrays leverage the Kirchhoff’s current law to perform the vector-matrix multiplication (VMM) operations, the conductance variation accumulated along the bitlines (BLs) is proportional to the number of concurrently activated wordlines (WLs) [6]. Activating too many WLs simultaneously also leads to high BL current, which would induce significant input register (IR)-drop and cause nonuniform voltage and current distribution along the crossbar [7]. Therefore, to achieve high-accuracy computation, the number of WLs that can be activated within a crossbar array simultaneously should be limited. Another challenge is that for the practical ReRAM-based DNN accelerators, the number of ADCs for each crossbar array should be restricted as they consume a significant amount of power and area [5], [8]. As such, it is necessary to share one ADC among multiple BLs. Given that an ADC can only convert the signals of one BL in a single clock cycle, the number of BLs that can be activated simultaneously should match the number of ADCs in each crossbar [3]. For a practical ReRAM-based DNN accelerator, the VMM on the crossbar arrays should operate at a much finer granularity, termed as an operation unit (OU), rather than at the subarray granularity [3], [9], [10]. It is demonstrated by several recent studies that for a practical ReRAM-based DNN accelerator to attain an acceptable level of inference accuracy, only nine WLs and eight BLs can be turned on concurrently [3], [11], [12].

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All
1.
Mohamed Ibrahim, Zishen Wan, Haitong Li, Priyadarshini Panda, Tushar Krishna, Pentti Kanerva, Yiran Chen, Arijit Raychowdhury, "Special Session: Neuro-Symbolic Architecture Meets Large Language Models: A Memory-Centric Perspective", 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp.11-20, 2024.
Contact IEEE to Subscribe

References

References is not available for this document.