Improving resiliency with the SCR (scalable checkoint restart) library.

 

The SCR library

Scalable checkpoint restart (SCR) is a library for application checkpointing. The library supports multi-level checkpointing and redundancy (buddy checkpointing). The application developer lets SCR decide whether a checkpoint is necessary or not. SCR caches the data for the checkpoints in the fast local storage on the compute nodes. This ensures an ultra-fast way of scalable checkpoint/restart.

 

Increasing resiliency with SCR

The SeisSol application from TU Munich, which is worked on in the project by LRZ, uses SCR to increase the code’s resiliency. Only a few SCR calls have to be added – e.g.SCR_Initialize(…), SCR_Need_Checkpoint(…) or SCR_Need_Checkpoint(int *flag). The integration of SCR improves the checkpointing strategy and makes the application robust against hardware failures.

Measurements show that the overhead produced by SCR is low. The restart opportunity saves a considerable amount of time: making use of this resiliency technique, the application can start from the last checkpoint. Without it, the run would have to start all over again.