Conferences >SC '11: Proceedings of 2011 I...

FTI: High performance Fault Tolerance Interface for hybrid systems

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fa...Show More

Metadata

Abstract:

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.

Published in: SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Date of Conference: 12-18 November 2011

Date Added to IEEE Xplore: 29 December 2011

ISBN Information:

ISSN Information:

DOI: 10.1145/2063384.2063427

Conference Location: Seattle, WA, USA

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

Contents

1. Introduction

In high performance computing (HPC), systems are built from highly reliable components. However, the overall failure rate of supercomputers increases with component count. Nowadays, petascale machines have a mean time between failures (MTBF) measured in hours or days [41] and fault tolerance (FT) is a well-known issue. Long running large applications rely on fault-tolerant (FT) techniques to successfully finish their long executions. Checkpoint/Restart (CR) is a popular technique in which the applications save their state in stable storage, frequently a parallel file system (PFS); upon a failure, the application restarts from the last saved checkpoint. CR is a relatively inexpensive technique in comparison with the process-replication scheme that imposes over 100% of overhead.

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

References is not available for this document.

FTI: High performance Fault Tolerance Interface for hybrid systems

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

FTI: High performance Fault Tolerance Interface for hybrid systems

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?