I. Introduction
As general purpose GPUs (GPGPUs) are becoming increasingly susceptible to transient hardware faults (soft errors) often from cosmic radiation [1] or from operating under low voltage [2], their reliable operation is of critical importance. With GPGPUs becoming omnipresent in fields such as high-performance computing (HPC), artificial intelligence, deep learning, virtual/augmented reality, and safety critical systems such as autonomous vehicles [3] –[10], transient hardware faults can lead to bit flips in storage devices including the register file and DRAM. Such bit flips are increasing in frequency as system scales increase especially in the HPC domain [11] –[13]. If bit flips occur during application execution, they may result in application crashes/hangs or even worse in silent data corruption (SDC) where the application successfully completes execution but its output is incorrect. Executions that result in SDC outcomes are the most undesirable as they erroneously provide the user with the illusion of correct output, although cases of SDC output that is within certain user-acceptable ranges may exist [14]. To ensure reliable application execution, several mechanisms are widely employed including error correction codes (ECC) [15] –[17], but ECC cannot still provide protection to datapath errors that originate from unprotected latches in functional units (e.g., arithmetic logic and load-store units) [18].