Since our last status update just before SC15 half a year ago, a lot of progress has been made in our three main fields of research: hardware, software and applications. These advancements could not have been achieved without intensive co-design activities between the single areas. Below is information on our current project status.
Hardware bring-up well under way
Already in January, the SDV – the software development vehicle – was put into operation at Jülich Supercomputing Centre. The main idea here is to provide the DEEP-ER users with a development platform, so software and application developers can work on their codes while the hardware team is bringing up the final DEEP-ER prototype. The SDV has three main parts, all connected via an EXTOLL Tourmalet A2 interconnect. It consists of:
Xeon part (or SDV-Cluster)
- 16 dual-socket Intel® Xeon® E5-2680 nodes
- 16 NVMe cards Intel DC P3400, 400 GB each (one in each server)
Xeon Phi part (or SDV-Booster)
- 8 Intel® Xeon Phi (KNL) nodes
- 2 NVMe cards Intel DC P3400 (integrated in two of the KNL boards)
Storage
- 2 storage servers (spinning disks, 57 TB) and
- 1 metadata server (SSDs)
Lastly, one network attached memory (NAM) board has been integrated in early May, meaning the SDV now resembles already a lot the final prototype.
Updates on hardware components
The first Intel Xeon Phi (KNL) boards have arrived to the project. First test runs on the SDV have shown very good performance, thanks to the code modernization and improvements performed on the DEEP-ER applications during the past months.

Concerning the actual prototype, substantial progress has been made on the Eurotech Aurora Blade architecture and its adaption to the DEEP-ER project. The design has been completely finalised, first results on the backplane testing are available, mechanics and installation are under preparations and testing the root card will be completed soon. Once all tests have been fully accomplished, all components will be signed off for production. Installation of the prototype is expected for autumn this year.
With respect to the network, DEEP-ER will leverage the A3 version of the Tourmalet cards that have just been released by EXTOLL with a bandwidth of 100.8 Gbit/s/link. Together with the application developers, the hardware experts are taking final decisions on the network topology right now.
And finally, the NAM prototype has been fully verified and the first release of the libnam library is available. Ease of use of the NAM devices for both ´middleware and application developers is ensured. Currently, the NAM-experts are working on implementing a second EXTOLL link to double the bandwidth – but this is work in progress.
Smooth progress on software developments
The two software teams have come already a long way to achieving their objectives for the project. Both the I/O and the resiliency team have finalized most of the developments:
With respect to I/O the team has come a lot closer to the goal of making the I/O scalable on all levels of usage. This development is clearly guided by the I/O benchmarks, the resiliency scheme and obviously the requirements voiced by the application developers. The current status is:
- For BeeGFS the asynchronous API has been implemented. The resiliency features have been integrated with BeeGFS. All software is ported to KNL and is currently in validation.
- The buddy-checkpointing version of SIONlib has been implemented as well – tests are under way now and first results available.
- The Exascale10 extension of existing MPI-IO hints, along with the corresponding implementation of the new DEEP-ER enabled I/O features in the ROMIO layer, has been completed.
Currently the I/O experts are working on benchmarking the different I/O layers with JUBE – a benchmarking environment developed at JSC. It has been implemented on the SDV, first applications have been integrated, the rest is to follow to provide continuous benchmarking. The future work will be to compare the benchmarking results and evaluate which I/O functionalities are most useful for the different types of applications.
Regarding the resiliency architecture, a comprehensive and complex set of methods has been developed which at the same time guarantees ease of use for the DEEP-ER users. The various techniques enable applications to recover from both soft and fatal errors.

To achieve this, OmpSs has been enhanced and also integrated with the I/O layers. A more detailed overview on current developments is available on the project's homepage.
Also for resiliency the JUBE benchmark suite has been used. The test results will help to monitor the error-recovery performance and overheads of the different resilience techniques developed on the project.
Applications: Using the DEEPprojects prototypes
By now, all applications have finalized the code analysis and have worked on code improvements. The developers have ported their codes to the SDV and run first tests, which display the performance and scalability improvements achieved with the mentioned code optimisations. Tests continue, naturally, focusing now on the resiliency and I/O mechanisms as well as on porting applications to KNL.
Intensive co-design for successful project collaboration
Most of the work presented here is based on even more intensified co-design efforts in the last months. The application developers worked together with hardware and software experts to make effective use of the non-volatile memory (NVM) devices that are part of the elaborate DEEP-ER memory hierarchies.

Additionally, the first use case for the network attached memory (NAM) has been identified and is about to be implemented. It will be used to extend the resiliency layers and the NAM will serve as device for calculating and storing parity checkpoints. ParaStation MPI has been enhanced so applications will achieve better parallel efficiency on the SDV. Currently discussions are on-going between applications developers and hardware experts on the network topology and the ratio of Booster node vs Cluster node links – just to name a view of the many and very different co-design activities going on in this phase of the project.
What is left now for the final 9 months of the project is the production and full bring-up of the prototype so applications can run their tests on the full system. We’re confident to achieve this on time and will keep you updated at latest before SC16.
In the meantime, if you are at ISC’16 in Frankfurt our experts are on-site and are happy to discuss any detailed questions you might have on our project. So, please do come and visit us at the European Exascale Projects booth #1340.
Looking forward to seeing you and speak soon,
Your DEEP-ER Project Team