Reducing network-wide data movement with the network attached memory (NAM).

The solution

libNAM Checkpoint Restart

In the frame of the DEEP-ER project the ability to create XOR-Sets with the help of the FPGA is exploited. The general idea is to do XOR-Checkpointing by storing parity information on the NAM. In the configuration phase all necessary data like node ids, memory addresses and byte counts are registered in the NAM via RRA, the Remote Register Access of the EXTOLL network. The XOR-Calculation can then be triggered at any time. The NAM then fetches the data from all nodes and calculates the parity information with help of the FPGA. The data has to be stored also locally for later reconstruction. 


libNAM restart

If a node fails and the user code is restarted on a new node the data from the failed node can be easily reconstructed by the NAM in four quick steps:

  1. All remaining nodes send their local checkpointing data to the NAM.
  2. The FPGA in the NAM combines the data from all nodes with the parity information stored and can this way rebuild the data of the node which failed.
  3. The rebuild-set is stored on the NAM and can be read out by the new participating node.
  4. This node is informed by the NAM, when the calculation is finished so that it can read out its restart data.