Resiliency & Fault Tolerance for HPC

Category
Trainings
Date
2016-04-20 09:00 - 2016-04-21 15:00
Venue
Barcelona Supercomputing Centre - 08034 Barcelona
Provinz Barcelona, Spanien

The third DEEP-ER training was co-organised with the Mont-Blanc project at Barcelona Supercomputing Centre and focused on HPC strategies for resiliency and fault tolerance.

Next to a series of introductory talks, the focus clearly was on hands-on workshop sessions with the HPC application developers in both projects.

Our application teams and software experts benefited tremendously from this collaboration activity and made the most of the short time available.

  • The application by partner Inria on ‘Human exposure to electromagnetic fields’ has progressed already quite far in integrating SIONlib and SCR (scalable checkpoint-restart). They focused on OmpSs and benefited from the onsite support by partner BSC.
  • KU Leuven managed to successfully integrate SCR into the mock-up of their space weather code and worked with the experts on SIONlib to further extend the I/O functionalities.
  • For the oil exploration code (FWI = full waveform inversion) from BSC discussions on how to tackle current I/O challenges was extremely useful.

Apart from that, Mont-Blanc and DEEP-ER experts talked about possibilities for how to integrate certain technologies used in one project by the other project and exchanged some first promising ideas.

A detailed agenda can be found here.

 

The training material is availalbe here:

 

Session on SIONlib

Slides for an in-depth introduction to 'Parallel Task Local File I/O with SIONlib' are provided via Jülich Supercomputing Centre

Session on Scalable Checkpoint-Restart (SCR)

 

 
 

All Dates

  • From 2016-04-20 09:00 to 2016-04-21 15:00