Resiliency
Exascale systems will require a combination of powerful resiliency techniques that are also flexible enough to accommodate the heterogeneous nature of systems like the DEEP prototypes. In the DEEP projects we develop such a comprehensive set of resiliency methods that are aimed at different failure types but can also be combined to provide a high level of resiliency at an affordable cost.
The overall aim is to isolate soft or partial system failures to avoid the necessity of full application restarts. This will be key to allow compute at the Exascale.
Fault-Tolerant Interface (FTI)
The DEEP-EST project has largely contributed to the FTI library for multi-level checkpointing library with a simple API
Error Classification
The classification of hardware errors is an essential prerequisite for handling these.
Task-based resiliency
To address Uncorrected Recoverable Errors, OmpSs is extended with lightweight task-based checkpoint/restart functionality.
Uncorrected Errors
To recover from Uncorrected Errors (UC), various techniques are implemented.