Autumn 2015 – basically we are half way through the project, which also means things are shaping up, slowly but surely. Read an update on what we have been working on lately below.
Hardware development is still hard work
A lesson learned already in the DEEP project: hardware development takes its toll. Also for DEEP-ER we had to adjust our project plan for various reasons. But on the bright side: we’ve come a long way already and we have leveraged our previous experience: we came up with more than decent workarounds for the challenges faced. But let’s recap from the beginning:
In DEEP-ER we set out to enhance the Cluster-Booster concept of our predecessor project DEEP by implementing latest technology and experimenting with innovative memory technologies like non-volatile memory (NVM) and network attached memory (NAM). The upgraded architecture has been completely designed by now and components for the prototype are under development. Our experts have also chosen the right configuration for the NVM cards based on the requirements the HPC application developers in the team asked for. Also well on its way is the NAM prototype. And here again, the hardware, software and application developer teams are cooperating intensively on the use cases for the NAM.
All in all, we are working towards a prototype system of 8 Intel Xeon cards for the Cluster, 32 to 64 Intel Xeon Phi (2nd generation) nodes for the Booster, an equivalent number of NVM cards and 2 NAM devices all interconnected by the EXTOLL TOURMALET network. Integration is done via a Eurotech Aurora blade architecture.
Nevertheless, the actual realisation of the prototype will take some more time. That is why the team is currently deploying a Software Development Vehicle (SDV) to not delay the work carried out by the software and application developers. The SDV is installed already at the computer room at JSC and some fine-tuning is carried out for it to be used by software and application developers.
Software going strong
On the software side, DEEP-ER focuses on resiliency and I/O topics. By now, the resiliency and I/O layers have been fully defined and the necessary interfaces identified. An analytical model has been developed to evaluate how our resilience techniques will work at Exascale level. Furthermore, a cache layer has been implemented in the file system and application-based checkpoint and task-based resiliency features are under way.
Regarding task-based resiliency, our expert team decided to follow a concept that is similar to the one currently favoured to become part of the next MPI standard: the basic idea is to use MPI error handlers to report broken connections and interpret these as process failures on the remote sites while keeping the surviving processes alive and healthy. As a first step for realizing this, ParaStation MPI has been extended to detect, isolate and clean up failed child processes in such a manner that parent processes can continue to work properly. By integrating this into the task-based checkpoint/restart mechanism of OmpSs, this feature will allow for restarting failed offloaded tasks transparently via MPI_COMM_SPAWN while avoiding the overhead for a full application recovery.
Overall, the software implementation in all the layers is fast progressing and largely benefits from the close cooperation of the different groups inside the project. Examples are the synchronous and asynchronous implementations of the BeeGFS file system, extensions to the SCR scalable checkpointing library, adaptations on the parallel I/O library SIONlib, developments on the I/O software E10, etc. The integration of all the components with each other to achieve a full I/O and resiliency software stack is also on its way.
Applications: Ready for the SDV
In the meanwhile the applications team has completed the groundwork: the structure analysis of the code and the partitioning between code parts suitable for the Cluster and those to best run on the Booster. Plus, a lot of work has been put into code modernisation already as for instance vectorisation and parallelisation. As experience from the DEEP project shows, this is some tedious work but it pays off not only on DEEP/-ER platforms but on heterogeneous platforms in general. By now, all of our seven real-life HPC applications are good to go to be ported and optimized on the SDV.
All in all, quite some tricky yet exciting tasks lie ahead of us. Let’s continue our joint effort!
You are at SC15 in Austin from Nov 15 to 20? Our experts are onsite and are happy to discuss any questions you might have on our project in more detail. Come and visit our European Exascale Projects booth #197.
Looking forward to seeing you and speak soon,
Your DEEP-ER Project Team