I. Introduction
LARGE and complex scientific workflows rely on computational grids to satisfy their massive computational and data requirements. With increasing heterogeneity and complexity of computational grids, executing large scientific workflows reliably becomes a challenge. Although the mean time to failure of any entity in a computational grid is high, the large number of entities in a grid (hardware, network, software, grid middleware, core services etc) means that a grid will fail frequently. For example, in [1], the authors studied the failure data from several high performance computing systems operated by Los Alamos National Laboratory (LANL) over nine years. Although failure rates per processor varied from 0.1 to 3 failures per processor per year, systems with 4096 processors averaged as many as 3 failures per day. Thus, although the number of failures per processor is relatively low, the aggregate reliability of a system clearly deteriorates as the number of processors is increased. Since the failure rates are roughly proportional to the number of processors in the system, with over 112000 processors on a computational grid like the TeraGrid [2], [3], one would experience a failure every two minutes. Unlike the LANL infrastructure, which is of high priority, very expensive and highly controlled with greater resources to maintain it, a computational grid is also susceptible to failures at the grid middleware level; software and services that tie the heterogeneous computing systems on a grid. Failure will be the norm rather than an exception. Hence workflow execution systems must be designed to execute workflows in a fault tolerant manner.