Loading [MathJax]/extensions/MathMenu.js
Combining Cluster Sampling and ACE analysis to improve fault-injection based reliability evaluation of GPU-based systems | IEEE Conference Publication | IEEE Xplore

Combining Cluster Sampling and ACE analysis to improve fault-injection based reliability evaluation of GPU-based systems


Abstract:

Computing capability demand has grown massively in recent years. Modern GPU chips are designed to deliver extreme performance for graphics and for data-parallel general p...Show More

Abstract:

Computing capability demand has grown massively in recent years. Modern GPU chips are designed to deliver extreme performance for graphics and for data-parallel general purpose computing workloads (GPGPU computing) as well. Many GPGPU applications require high reliability, thus reliability evaluation has become a crucial step during their design. State-of-the-art techniques to assess the reliability of a system are fault injection and ACE analysis. The former can produce accurate results despite eternal time while the latter is very fast but it lacks accuracy of the results. In this paper we introduce a new sampling methodology based on cluster sampling that enables the exploitation of ACE analysis to accelerate the fault injection process. In our experiments we demonstrate that state-of-the-art fault injection techniques, generating random faults according to a uniform distribution, is outperformed by the proposed sampling technique, thus enabling several advantages in terms of accuracy and evaluation time. To quantify the introduced benefits we analyzed the micro-architecture reliability of an AMD Southern Islands GPU in presence of single bit upset affecting the vector register file for 6 benchmarks. One of the most important achievements is that considering all the benchmarks, on average, we are one order of magnitude faster/more accurate than uniform-sampling-based techniques in case of non exhaustive fault injection campaigns, while more than two orders of magnitude in case of exhaustive campaigns.
Date of Conference: 02-04 October 2019
Date Added to IEEE Xplore: 21 October 2019
ISBN Information:

ISSN Information:

Conference Location: Noordwijk, Netherlands
References is not available for this document.

I. Introduction

Graphics Processing Units (GPUs) constitute an important part of the recently emerging computing continuum whose total market is more than two billion devices per year and whose application fields range from smartphones to mission-critical data center machines [1]. Technologies of this continuum have introduced benefits for several design parameters (i.e., performance and power consumption) but reliability remains a major concern [2]. Evaluating the reliability of GPU-based systems running complex applications is extremely challenging due to their hardware complexity. This requires complex and time consuming simulations. However, addressing GPUs reliability is necessary since GPUs are finding application in critical scenarios [3]. Accurate and fast techniques able to carefully trade-off between reliability analysis time and accuracy of the reported measurements are required to design complex GPU-

Select All
1.
D. Buchholz and I. J. Dunlop, "The future of enterprise computing: Preparing for the compute continuum" in IT@ Intel White Paper Intel IT, 2011.
2.
A. Biswas, Cost-effective reliability trade-offs and challenges, April 2018.
3.
P. Rech, L. Pilla, P. Navaux and L. Carro, "Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability", 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 455-466, Jun. 2014.
4.
G.-H. Asadi, V. Mehdi, B. Tahoori and D. Kaeli, "Balancing Performance and Reliability in the Memory Hierarchy", IEEE International Symposium on Performance Analysis of Systems and Software 2005. ISPASS 2005, pp. 269-279, Mar. 2005.
5.
P. Rech, C. Aguiar, R. Ferreira, M. Silvestri, A. Griffoni, C. Frost, et al., "Neutron-induced soft errors in graphic processing units", 2012 IEEE Radiation Effects Data Workshop, pp. 1-6, July 2012.
6.
I. S. Haque and V. S. Pande, "Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu", 2010 10th IEEE/ACM International Conference on Cluster Cloud and Grid Computing, pp. 691-696, May 2010.
7.
N. Farazmand, R. Ubal and D. Kaeli, "Statistical fault injection-based avf analysis of a gpu architecture", Proceedings of the IEEE Workshop on Silicon Errors in Logic - System Effects, 2012.
8.
B. Fang, K. Pattabiraman, M. Ripeanu and S. Gurumurthi, "GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications", 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 221-230, Mar. 2014.
9.
S. Tselonis and D. Gizopoulos, "GUFI: A framework for GPUs reliability assessment", 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 90-100, Apr. 2016.
10.
A. Vallero, D. Gizopoulos and S. di Carlo, "SIFI: AMD southern islands GPU microarchitectural level fault injector", 2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS), pp. 138-144, Jul. 2017.
11.
S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler and J. Emer, "SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation", 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 249-258, Apr. 2017.
12.
A. Vallero, S. Tselonis, D. Gizopoulos and S. di Carlo, "Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs", 2018 IEEE 36th VLSI Test Symposium (VTS), pp. 1-6, Apr. 2018.
13.
J. Tan, N. Goswami, T. Li and X. Fu, "Analyzing soft-error vulnerability on GPGPU microarchitecture", 2011 IEEE International Symposium on Workload Characterization (IISWC), pp. 226-235, Nov. 2011.
14.
A. Benso, M. Rebaudengo, L. Impagliazzo and P. Marmo, "Fault-list collapsing for fault-injection experiments", Annual Reliability and Maintainability Symposium. 1998 Proceedings. International Symposium on Product Quality and Integrity, pp. 383-388, Jan 1998.
15.
S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor", Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003. MICRO-36, pp. 29-40, Dec. 2003.
16.
R. Frerichs, Cluster sampling chapter five, Rapid Surveys, 2004.
17.
R. Ubal, B. Jang, P. Mistry, D. Schaa and D. Kaeli, "Multi2sim: a simulation framework for cpu-gpu computing", Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 335-344, 2012.

Contact IEEE to Subscribe

References

References is not available for this document.