I. Introduction
Without a revolutionary hardware re-design, a massive further reduction of clock frequency, or an increasing power budget accommodating hardware failure correction, we have to assume that exascale machines will fail frequently compared to today's machines [1], [2]. Empirical data leads us to expect a linear correlation between the system size and the failure rate [3]. The mean time between failures (MTBF) will shrink. Simulation codes thus have to improve their resiliency. In particular, they have to become able to identify machine errors and to handle them.