header news2

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View Privacy Policy

Time flies! Already half a year since our last update right before ISC’16 in Frankfurt. And only a couple of months to go until our official DEEP-ER project end in late March 2017.

As usual, the recent developments have been guided by our stringent co-design approach – truly great teamwork thanks to all colleagues involved.

The DEEP-ER Prototype System

Since our last update, all remaining important technical specifications have been decided upon and the system is planned to be fully installed by mid-December by Eurotech: 

DEEP-ER Booster Node
DEEP-ER Booster Node
  • The Aurora DEEP-ER Booster will consist of 72 nodes each comprising an Intel Xeon Phi 7210 CPU and 96 GB memory. This will lead to a peak performance of about 210 TFLOP/s and a total memory of 8 TB for the whole DEEP-ER prototype system including the Cluster.
  • The rack is already installed at Jülich after completion of the necessary adaptations to the infrastructure in the computer room. It will hold 4 chassis with 18 KNL boards each.
  • First chassis will be shipped by Eurotech end of November and installation of the further 3 chassis is planned to finish by mid of December. Each chassis will come with a root card to hold 18x TOURMALET NICs and 18x P3700 NVMe devices (meaning one NIC and NVMe per KNL node).

 

Network topology and Cluster-Booster connection

These are primary examples of co-design at work. Our hardware experts have taken the decisions on the network topology and on how to connect Cluster and Booster in intensive discussions with our application developers. The result:

  • The network will be a 6x3x4 grid on the Booster side, and 2x2x4 torus on the Cluster side. For this kind of complex topology, the EXTOLL network management application EMP has been extended (compared to the DEEP version) so that the new version now also supports hierarchical topologies that arise from the connection with the Cluster part of the system.
  • Nine cables will connect the Cluster to the Booster side of the system, achieving a 1:8 ratio. This provides the Cluster-Booster communication bandwidth required by the applications, keeping the number of long cables in the system as low as possible. 

 

Fully functioning software stack

On the two focus topics in the software area – resiliency and I/O – huge progress has already been made since our last update. The DEEP-ER software stack is now functionally complete and the teams from JSC, BSC, Extoll, Fraunhofer, Intel, and ParTec are working in close collaboration on the last bits and pieces to achieve the maximum performance.

I/O stack

  • The BeeGFS team concentrates on providing native support for the EXTOLL protocol, which will allow for the maximum communication performance to the file system.
  • In the coming weeks, the team will work on the NAM interface for the checkpointing use case.
  • A new SIONlib version has been released very recently including buddy-checkpointing functionality.
  • In a co-design discussion between the two software teams (I/O and resiliency), additional features of the cache API have been discussed and added. The result: BeeGFS now supports the calculation of CRC-checksums to increase the performance of the resiliency checks. The I/O team currently supports the application developers to implement the API.
  • The JUBE benchmarking environment is installed on the SDV and more applications are currently integrated for more continuous benchmarking.

 

Resiliency features

  • Scalable checkpoint restart (SCR) has been extended to leverage the DEEP-ER I/O technologies like SIONlib buddy-checkpointing and the synchronous and asynchronous flushing API functionality by BeeGFS.
  • An analytical model has been integrated into SCR as well to automatically calculate the optimal checkpointing frequency based on the application and system characteristics.
  • Upcoming is the extension of SCR to enable the use of the NAM as well as supporting the application developers on using the resilience techniques. Application users will be able to utilise NAM transparently, as the SCR and SIONlib stacks will take care of its handling.

 
Use cases for DEEP-ER technologies: Our seven real-world HPC applications ready for deployment

Every DEEP-ER technological development – be it hardware like NVMe and NAM or software like the resiliency features or I/O benefits – is tested by several of the applications in our portfolio. For the time being, the majority of tests are done on the DEEP-ER SDV (software development vehicle), which resembles largely the DEEP-ER prototype: It contains the Cluster part of the Prototype (realised with 16 Haswell nodes) and 8 KNL nodes to mimic the Booster side.


Application use cases

In terms of NVMe (non-volatile memory, available on the SDV), considerable improvements have been made for the first use case with the GERShWIN application from INRIA (from 14% to 70% reduction in wallclock runtime, depending on the problem simulated). Additionally, Barcelona’s oil exploration code (FWI) has been set-up for an NVMe use case. First tests show significant speedups when using local NVM storage with the I/O time (traditionally about 50% of the execution time) reduced to almost zero.

With respect to the I/O stack, E10 has been fully implemented and a first use case has been defined for the earthquake source dynamics application by LRZ. Additionally, the team at KU Leuven has fully implemented SIONlib for their space weather application.

 


Interested in learning even more about the DEEP-ER project? We’d love to see you in person at SC16 and talk more details! So please do step by our booth – this time we’re co-located at the booth of our coordinator, Jülich Supercomputing Centre (#2413).


Stay in touch & take care
Your DEEP-ER Project Team