This workflow runs OpenMPI hello world jobs using Parsl. The OpenMPI hello world code is defined in mpitest.c
and, when compiled and executed, prints the hostname and rank in the format below:
Hello world from processor alvaro-gcpslurmv2dev-00091-1-0001, rank 0 out of 2 processors
Hello world from processor alvaro-gcpslurmv2dev-00091-1-0002, rank 1 out of 2 processors
High performance computing workflows may depend on the orchestration of many diverse applications: ensemble members, data-assimilation, pre-/post-processing, visualization tools, and most recently, machine learning. To enable community involvement, these complex workflows need to be automated, users need to be able to manage “live” workflows (i.e. as they run) in a straightforward manner, and the workflows need to be portable so they can run in a wide range of compute environments (e.g. on-premise clusters and cloud). Separately, over the last decades, the workflow community has developed different workflow systems for expressing automation and control but the majority of this work has been to support high-throughput workflows: i.e. 1000’s of single node tasks. Workflow fabrics’ support for the coordination and management of multi-node tasks (i.e. large MPI jobs) is less widespread and documented. Here, we aim to help bridge this knowledge gap with a curated set of templates and examples that demonstrate the automation and control of workflows that launch MPI tasks with existing workflow fabrics. This project is working toward the development of a proof of concept workflow that manages MPI tasks and has a similar topology as real-world weather forecasting operational workflows. By making these documented and evolving building blocks available to the community, we hope to empower users to work in the “MPI task niche” within the broader landscape of workflow fabrics.
The files in the top level of this repo are designed to be launched from the Parallel Works
platform as workflows. The run_on_cluster
subdirectory is a starting point for
experimentation/debugging/development because it provides scripts that are designed to
run directly on a cluster (i.e. manually) rather than being launched via a workflow.
In particular, the run_on_cluster
directory archives experimentation using Parsl
and Flux together (via Parsl's FluxExecutor
). As we gain additional experience with
this framework, Flux will be integrated into PW-launched workflows at the top level.
Parsl apps are defined as bash_app
decorated functions in the workflow_apps.py
script. The OpenMPI code is compiled by compile_mpi_hello_world_ompi
and executed by compile_mpi_hello_world_ompi_localprovider
or compile_mpi_hello_world_ompi_slurmprovider
, depending on the selected provider (see execution), multiple times in parallel as defined by the repeats
workflow input parameter.
The workflow can be executed using the LocalProvider or the SlurmProvider. The parsl_utils repository is used to integrate Parsl on the PW clusters. This repository defines the Parsl configuration object from a JSON file. Sample JSON files are provided in the executors
directory. The selected Parsl provider is started on the controller node of the SLURM cluster using the Parsl SSHChannel.
The nodes for the jobs are allocated with the sbatch -W
command in the
run_mpi_hello_world_ompi
function. The workflow generates the #SBATCH
options from the SLURM parameters in the workflow.xml
file. The name of these parameters starts with slurm_
to indicate that they correspond to SLURM sbatch options.
The nodes for the jobs are allocated by the pilot job and the MPI code is executed directly on these without using a SLURM command.
Running MPI jobs using Parsl's SlurmProvider can be challenging as discussed in this video. Here are a list of challenges:
If you launch the command inside a bash_app
it will return the following error:
mpirun -np 2 mpitest
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
The reason for this is that it uses the SLURM environment variables that Parsl sets for the pilot job, in which --ntasks-per-node=1
is hardcoded.
One solution would be to overwrite these variables on the bash app itself.
No combination of cores_per_worker
, nodes_per_block
and cores_per_node
returns the right MPI output. The only combination that works is shown in the run_on_cluster/slurmprovider.py
config object. It requires the following setup:
- Use the
SimpleLauncher
- Set
nodes_per_block
to the desired number of nodes per MPI task - Set
cores_per_worker
to the number of cores in a single node - Set
parallelism
to the desired number of nodes per MPI task