Loading [MathJax]/extensions/MathMenu.js
Enabling Software Resilience in GPGPU Applications via Partial Thread Protection | IEEE Conference Publication | IEEE Xplore

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection


Abstract:

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient ...Show More

Abstract:

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows to engage partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves average reduction of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.
Date of Conference: 22-30 May 2021
Date Added to IEEE Xplore: 07 May 2021
Print ISBN:978-1-6654-0296-5
Print ISSN: 1558-1225
Conference Location: Madrid, ES

Funding Agency:


I. Introduction

As general purpose GPUs (GPGPUs) are becoming increasingly susceptible to transient hardware faults (soft errors) often from cosmic radiation [1] or from operating under low voltage [2], their reliable operation is of critical importance. With GPGPUs becoming omnipresent in fields such as high-performance computing (HPC), artificial intelligence, deep learning, virtual/augmented reality, and safety critical systems such as autonomous vehicles [3] –[10], transient hardware faults can lead to bit flips in storage devices including the register file and DRAM. Such bit flips are increasing in frequency as system scales increase especially in the HPC domain [11] –[13]. If bit flips occur during application execution, they may result in application crashes/hangs or even worse in silent data corruption (SDC) where the application successfully completes execution but its output is incorrect. Executions that result in SDC outcomes are the most undesirable as they erroneously provide the user with the illusion of correct output, although cases of SDC output that is within certain user-acceptable ranges may exist [14]. To ensure reliable application execution, several mechanisms are widely employed including error correction codes (ECC) [15] –[17], but ECC cannot still provide protection to datapath errors that originate from unprotected latches in functional units (e.g., arithmetic logic and load-store units) [18].

Contact IEEE to Subscribe

References

References is not available for this document.