Conferences >2008 Eighth IEEE Internationa...

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We pr...Show More

Metadata

Abstract:

In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies.

Published in: 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)

Date of Conference: 19-22 May 2008

Date Added to IEEE Xplore: 30 May 2008

CD:978-0-7695-3156-4

DOI: 10.1109/CCGRID.2008.79

Conference Location: Lyon, France

Contents

I. Introduction

LARGE and complex scientific workflows rely on computational grids to satisfy their massive computational and data requirements. With increasing heterogeneity and complexity of computational grids, executing large scientific workflows reliably becomes a challenge. Although the mean time to failure of any entity in a computational grid is high, the large number of entities in a grid (hardware, network, software, grid middleware, core services etc) means that a grid will fail frequently. For example, in [1], the authors studied the failure data from several high performance computing systems operated by Los Alamos National Laboratory (LANL) over nine years. Although failure rates per processor varied from 0.1 to 3 failures per processor per year, systems with 4096 processors averaged as many as 3 failures per day. Thus, although the number of failures per processor is relatively low, the aggregate reliability of a system clearly deteriorates as the number of processors is increased. Since the failure rates are roughly proportional to the number of processors in the system, with over 112000 processors on a computational grid like the TeraGrid [2], [3], one would experience a failure every two minutes. Unlike the LANL infrastructure, which is of high priority, very expensive and highly controlled with greater resources to maintain it, a computational grid is also susceptible to failures at the grid middleware level; software and services that tie the heterogeneous computing systems on a grid. Failure will be the norm rather than an exception. Hence workflow execution systems must be designed to execute workflows in a fault tolerant manner.

References is not available for this document.

MIT Libraries

MIT Libraries

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References