1. Introduction
Building highly available systems has always been critical in certain areas, but society's increased reliance on distributed systems for applications such as search, entertainment, and e-commerce requires low-cost solutions for high availability. Approaches for high availability in such settings are typically based on the combination of redundancy and 24/7 operations support in which human operators detect and repair failures and restore redundancy before the service is compromised. However, putting human operators in the critical path for availability is not ideal. To address this, techniques to recover failed hardware and software components automatically through restart and other methods have been proposed, including software rejuvenation [1], recursive restartability [2], and recovery-oriented computing [3].