Journals & Magazines >IEEE Transactions on Computer... >Volume: 43 Issue: 12

Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate the deep neural network (DNN) traini...Show More

Metadata

Abstract:

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate the deep neural network (DNN) training/inference. However, the computational accuracy of analog PIM is compromised due to the nonidealities, such as the conductance variation of ReRAM cells. The impact of these nonidealities worsens as the number of concurrently activated wordlines (WLs) and bitlines (BLs) increases. To guarantee computational accuracy, only a limited number of WLs and BLs of the crossbar array can be turned on concurrently, significantly reducing the achievable parallelism of the architecture.While the constraints on parallelism limit the efficiency of the accelerators, they also provide a new opportunity for the fine-grained mixed-precision quantization. To enable efficient DNN inference on the practical ReRAM-based accelerators, we propose an algorithm-architecture co-design framework called block-wise mixed-precision quantization (BWQ). At the algorithm level, the BWQ algorithm (BWQ-A) introduces a mixed-precision quantization scheme at the block level, which achieves a high weight and activation compression ratio with negligible accuracy degradation. We also present the hardware architecture design BWQ-H, which leverages the low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference on the ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping method to increase the ReRAM crossbar’s throughput. Our evaluation demonstrates the effectiveness of BWQ, which achieves a

$6.08 \times$ speedup and a

$17.47 \times$ energy saving on average compared to the existing ReRAM-based architectures.

Published in: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ( Volume: 43, Issue: 12, December 2024)

Page(s): 4558 - 4571

Date of Publication: 03 June 2024

ISSN Information:

DOI: 10.1109/TCAD.2024.3409193

Funding Agency:

Citations are not available for this document.

Contents

I. Introduction

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures can perform in-situ computation within the memory devices, and they have demonstrated great potential in accelerating the deep neural network (DNN) training/inference [1], [2]. However, the manufacturing technology of the ReRAM devices is still in its early stage, and there exist many challenges to its practical adoption [3]. Most of the works about ReRAM-based DNN accelerators have overlooked practical considerations and rely on an idealized assumption regarding the ReRAM devices and associated the analog-to-digital converter (ADC) overhead. They assume that it is possible to activate all the rows and columns of a or array simultaneously within a single clock cycle without impacting computational accuracy [4], [5]. However, there are several challenges that render this assumption impractical. The major problem is the conductance variation of the ReRAM devices. Since, ReRAM crossbar arrays leverage the Kirchhoff’s current law to perform the vector-matrix multiplication (VMM) operations, the conductance variation accumulated along the bitlines (BLs) is proportional to the number of concurrently activated wordlines (WLs) [6]. Activating too many WLs simultaneously also leads to high BL current, which would induce significant input register (IR)-drop and cause nonuniform voltage and current distribution along the crossbar [7]. Therefore, to achieve high-accuracy computation, the number of WLs that can be activated within a crossbar array simultaneously should be limited. Another challenge is that for the practical ReRAM-based DNN accelerators, the number of ADCs for each crossbar array should be restricted as they consume a significant amount of power and area [5], [8]. As such, it is necessary to share one ADC among multiple BLs. Given that an ADC can only convert the signals of one BL in a single clock cycle, the number of BLs that can be activated simultaneously should match the number of ADCs in each crossbar [3]. For a practical ReRAM-based DNN accelerator, the VMM on the crossbar arrays should operate at a much finer granularity, termed as an operation unit (OU), rather than at the subarray granularity [3], [9], [10]. It is demonstrated by several recent studies that for a practical ReRAM-based DNN accelerator to attain an acceptable level of inference accuracy, only nine WLs and eight BLs can be turned on concurrently [3], [11], [12].

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All

Mohamed Ibrahim, Zishen Wan, Haitong Li, Priyadarshini Panda, Tushar Krishna, Pentti Kanerva, Yiran Chen, Arijit Raychowdhury, "Special Session: Neuro-Symbolic Architecture Meets Large Language Models: A Memory-Centric Perspective", 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp.11-20, 2024.

Show Article

Google Scholar

References is not available for this document.

Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Cites in Papers - |

Cites in Papers - IEEE (1)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Cites in Papers - IEEE (1) | Other Publishers (0)

Cites in Papers - IEEE (1)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cites in Papers - |