header software2

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View Privacy Policy

A central pillar of the software stack in the DEEP projects is ParaStation MPI, an MSA-enabled implementation of the Message-Passing Interface (MPI) standard.

 

Since MPI is the most commonly used programming standard for parallel applications, the extensions implemented in ParaStation MPI within the DEEP projects can provide support for a wide range of applications. In doing so, ParaStation MPI is fully MPI-3 compliant and its DEEP-related extensions are designed to be as close as possible to the current standard, while still reflecting the peculiarities of the DEEP prototypes. This way, applications tuned to MSA environments can remain portable and essentially compliant with the MPI standard.

 

MSA Extensions

ParaStation MPI makes affinity information available to applications running across different modules of the DEEP prototypes while adhering to the MPI interface. This way, applications may exploit the underlying hardware topology for further optimisations. These are not limited to the program flow but may likewise affect the communication patterns. For example, by using the new split type MPIX_COMM_TYPE_MODULE, applications are able to create module-specific MPI communicators.

 

Additionally, ParaStation MPI itself applies application-transparent optimisations for modular systems, in particular regarding collective communication patterns. Based on topology information, collective operations such as Broadcast or Reduce can be performed hierarchically so that the inter-module communication (forming a potential bottleneck) can be reduced.

 

NAM Integration

One distinct feature of the DEEP-EST prototype is the Network Attached Memory (NAM): Special memory regions that can be directly accessed via Put and Get operations from every node within the EXTOLL network. The manifestation of such a NAM can be two-fold: Either in form of a purpose-built plug-in board for the Fabri3 chassis with an FPGA plus memory modules, or as a software NAM, i.e., as a lightweight service running on dedicated nodes with EXTOLL connectivity and providing the local node memory as NAM regions for others.

 

For making the use of the NAM more convenient and familiar for application programmers, one goal of the DEEP-EST project regarding ParaStation MPI has been the implementation and integration of an interface for accessing the NAM via MPI. That way, application programmers can use well-known MPI functions (in particular those of the MPI RMA interface) for accessing NAM regions quite similar to other remote memory regions in a standardised (or at least harmonised) fashion under the single roof of an MPI world.

 

For this purpose, the new PSNAM wrapper has been implemented, which establishes the actual linkage between the MPI RMA interface on the one hand and the management of and the access to persistent memory regions such as those of the NAM on the other hand. In addition to memory regions on the NAM, PSNAM also supports persistent shared memory regions on the compute nodes.

 

Regarding the management of NAM memory, an integration with the resource manager has also been implemented so that PSNAM can also handle pre-allocated NAM regions, e.g., those provided by NAM burst buffer plugin for Slurm, which has also been developed in the project.

 

CUDA Awareness

By using a CUDA-aware MPI implementation, mixed CUDA+MPI applications are allowed to pass pointers to CUDA buffers located on the GPU to MPI functions. In contrast, a non-CUDA-aware MPI library would fail in such a case. Furthermore, a CUDA-aware MPI library may determine that a pointer references a GPU buffer to apply appropriate optimisations regarding the communication. For example, so-called GPUDirect capabilities can then be used to enable direct RDMA transfers to and from GPU memory.

 

ParaStation MPI supports CUDA awareness, e.g., for the DEEP-EST ESB, at different levels. On the one hand, the usage of GPU pointers for MPI functions is supported. On the other hand, if an interconnect technology provides features such as GPUDirect, ParaStation MPI is able to bypass its own mechanism for the handling of GPU pointers and to forward the required information to the lower software layers for the exploitation of such hardware capabilities.

 

One goal within the DEEP-EST project is the utilisation of GPUDirect together with EXTOLL via ParaStation MPI.

 

Gateway Support

The MSA concept considers the support for different network technologies within distinct modules. Therefore, ParaStation MPI provides means for message forwarding based on so-called gateway daemons. These daemons run on dedicated gateway nodes being directly connected to different networks of an MSA system, e.g., in the DEEP-EST prototype system there are gateway nodes bridging between the InfiniBand and the EXTOLL network.

 

This gateway mechanism is transparent to the MPI processes, i.e., they see a common MPI_COMM_WORLD communicator spanning the whole MSA system. Therefore, the mechanism introduces a new connection type: the gateway connection as compared with the fabric-native transports such as InfiniBand, EXTOLL, and shared-memory. These virtual gateway connections map onto the underlying physical connections to and from the gateway daemons.

 

Transparency to the MPI layer is enabled by completely implementing the gateway logic on the lower pscom layer, i.e., the high-performance point-to-point communication layer of ParaStation MPI. This way, more complex communication patterns implemented on top, e.g., collective communication operations, can be executed across different modules offhand.

 

ParaStation MPI takes different measures for the avoidance of bottlenecks with respect to the transmission bandwidth of the cross-gateway communication. For one thing, the module interface might comprise multiple gateway nodes. Therefore, the MPI bridging framework is able to handle multiple gateway nodes such that a transparent load balancing can be achieved among them on the basis of a static routing scheme. For another thing, the upper MPICH layer of ParaStation MPI is able to retrieve topology information (cf. MSA extensions) for the optimisation of complex communication patterns, e.g., to minimise the inter-module traffic.

 

In the DEEP-EST project, several optimisations for the gateway protocol have been implemented. These optimisations leverage the RMA capabilities of EXTOLL (as the interconnect of the ESB and the DAM) in combination with message forwarding from and to the IB-equipped Cluster Module (CM). In doing so, the gateway connections support the fragmentation of MPI messages into smaller chunks. This way, the gateway daemons may benefit from a pipelining effect: while receiving message parts on one end of the connection, completely received fragments may already be forwarded to the destination node on the other end of the connection. Ideally, the data transfer from the source to the gateway daemon perfectly overlaps with the transfer from the gateway daemon to the destination. Furthermore, the gateway protocol supports so-called rendezvous semantics. Instead of relying on intermediate, pre-allocated communication buffers, the MPI message is simply announced by a small control message. Subsequently, the actual data transfer can be conducted efficiently by relying on the Remote Direct Memory Access (RDMA) capabilities of the hardware, avoiding the costly involvement of the CPU. Moreover, by relying on this approach, the message transfer may be delayed until the actual receive buffer is known to the communication layer, i.e., this is the case when the receive call has been posted by the application layer.