header applications

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

One of the focus topics of the DEEP-ER project is the I/O optimisation. In many massively parallel applications the parallel I/O limits the performance. Since I/O has not evolved as much as computation in the past and probably won’t in the future, even applications with no I/O performance limitations now, might run into I/O bottlenecks when reaching for Exascale. To address this issue the DEEP-ER project provides software as well as hardware features in a comprehensive I/O stack.


I/O optimisation

This software stack for parallel I/O consists of three components: 

  • the parallel file system BeeGFS
  • the scalable I/O library SIONlib
  • Exascale10 for collective parallel I/O operations


BeeGFS is split into two tiers: The upper tier is a partitioned non-coherent cache layer based on the NVMe devices (see below). This provides a linear scalability and very high throughput. The lower tier has high capacity for longer term data storage based on the HDD. Users have the possibility to control the staging of data between the tiers with BeeOND.

Parallel task local I/O leads to a huge amount of files for large-scale parallel applications. The SIONlib library provides the opportunity to read/write data from/to thousands of processors into a small amount of files. Only small code adaptions are needed to use SIONlib. The open/close and read/write operations from C or FORTRAN have to be replaced with calls from the SIONlib API. The more MPI tasks are used for the I/O the more the application will benefit from SIONlib. SIONlib can potentially lower the writing time and also reducing the time for reading operations like it is the case in TurboRVB.

In parallel applications the workload is distributed over all processes and so is the I/O. Exascale10 (E10) is a parallel I/O mechanism that overcomes current collective I/O limitations. E10 provides wrappers for MPI_Init, MPI_Finalize, MPI_File_open and MPI_File_close. The MPIWRAP wrapper library can be used to exploit the SSD cache features provided by E10. With MPIWRAP application developers can dynamically change the information passed to MPI-IO without the need of modifying the application. This approach was used in Chroma from UREG during their I/O optimisation. Chroma’s I/O time could be reduced to up to 50% (large files with 7 GB) and 25% (smaller files with 10 MB). More information about MPIWRAP can be found here in this paper (Congiu, 2015).

On the hardware side the non-volatile memory (NVMe) devices will help optimising the I/O performance. Each Booster Node will have its own NVMe devices. The NVMes are an SSD technology with 400 GB that is directly connected with PCIe which allows a higher bandwidth. When using the NVMe device the reading and writing time should be much faster compared to normal HDD. Only the paths to input and output files have to be changed.

  • /nvme/tmp/filename can be used for data that is only temporally needed.
  • Using /mnt/beeond/filename the file is stored on the local NVMe

To do so, application developers have the chance to migrate it asynchronously to the global filesystem with the help of BeeOND. No other code changes are needed. In some initial tests with the data processing pipeline from ASTRON the achieved bandwidth was increased from 490 GByte/s (global BeeGFS) to 658 GByte/s (NVMe). More details about NVMe in this document.