header software2

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View Privacy Policy

On a Modular Supercomputer Architecture (MSA) system, applications will either use resources within a single module only, or run across different modules either at the same time, or successively in a workflow like model. This requires scalable scheduling and co-allocation of resources for jobs within and across modules.

 

Scheduling

In the DEEP projects, the widely used Open Source scheduler Slurm has been used. The vanilla Slurm implementation supports scheduling and allocation of jobs requiring single or multiple modules. For heterogeneous jobs (consisting of multiple modules) all the required resources in different modules are allocated concurrently. This provided facility is very efficient for jobs that use all the allocated resources simultaneously.

However, there are some heterogeneous jobs that require different modules at different times during their execution lifetime. Slurm supports this by using its dependency mechanism. The dependent jobs are not considered for scheduling until their dependencies are fulfilled. But, there may be cases when the dependent job has to consume a huge amount of data from the producer job. In the current dependency mechanism, that data has to be stored somewhere, so that it can be consumed by the dependent job started after the end of the data producing job. This may produce a big time and space overhead. It can be avoided if the producer and consumer jobs are scheduled such that they have ensured overlap between their respective execution times, creating workflows. This would enable these jobs to communicate over the network, hence bypassing the storage system altogether.

In the DEEP projects, we have enhanced the features of the Slurm scheduler to provide this workflow functionality. The user submits a heterogeneous job (a job pack) with different dependent jobs, just like a normal heterogeneous job of Slurm. The only difference in our version is addition of delays, the user provides a time delay parameter for each job. This parameter informs the Slurm scheduler the time that should be lapsed between the start of the respective job and the first job in the job pack. Upon receiving such a heterogeneous job, the Slurm scheduler tries to schedule it with the desired delays and reserves the time slots for all the jobs in the job pack. This ensures the desired execution time overlap among the jobs of a job pack.

A main hurdle in our approach is the use of user defined delaying times. Users tend to overestimate the required time for their submitted jobs. This puts an extra burden over the scheduler to contend with, as it has to reserve the nodes for workflows. To reduce the side effects of user provided delays, we provide a library communicating with the Slurm controller to change the start times of all the reservations of the future jobs in a workflow. Thus, the currently executing job can try to change the start times of future jobs if it realizes that the data required for the next jobs in the workflow is available sooner than anticipated. This helps in early release of allocated resources for the current job, as it does not have to wait longer for the start of the next job.

Our provided library also has the functionality to change the dependency type of the dependent jobs on the request of the currently executing job. Using this mechanism, we can avoid the reservation for jobs of a workflow completely. The user submits multiple jobs with dependencies among them. The currently executing job requests the Slurm controller to change the dependency type of its dependent jobs at any time during its execution, to make these jobs eligible for allocation. The drawback of this mechanism is that it does not ensure the overlap of producer and consumer job’s execution time, as it depends on the availability of resources required by the dependent jobs. But on the other hand, it does not put extra reservation restraints on the Slurm scheduler.

 

Resource and Process Management

As most applications will go more and more into using different types of accelerators for different tasks inside their workflow and thus become more and more heterogeneous, the Resource Management System needs to be capable of managing these kinds of resources on different module (including hardware accelerators, memory class and capacity, and storagesystem) and provide them to the jobs. In the DEEP prototype architecture, the accelerator nodes (called Booster nodes) and data analytics nodes of the DAM are connected via gateway nodes to the regular compute nodes (called Cluster nodes). These gateway nodes manage the data transfer between Cluster and Booster nodes and the Resource Management System has to manage them as a kind of global resource. In the DEEP projects, we are using Slurm as resource allocator as far as it supports our needs, supplemented by the ParaStation Management System which is also used as process manager replacing the one coming with Slurm.


Currently, Slurm does not provide any support for the allocation of dynamically determined resources such as the gateway nodes needed to connect the different networks of the Cluster and Booster modules. This comprises the startup of additional daemons as the required gateway daemon on these nodes.  Therefore, we extended the resource management by according functionality. For this purpose, we implemented a new plugin to the ParaStation Management Daemon called psgw utilising several existing capabilities. In the long run it would be desirable to bring at least parts of the functionality into Slurm itself. Once  the  scheduler  decides  to  run  a  heterogeneous  job using multiple modules of the MSA system, the process manager has to setup the infrastructure e.g. by starting the required gateway daemons. Subsequently, it starts and manages the processes on each module’s nodes and provides the necessary information for inter-module communication.