Skip to content

NERSC Open Hackathon 2024

Roland Haas edited this page Aug 28, 2024 · 15 revisions

2024 NERSC Open Hackathon

The Hackathon is described here.

Zoom Meeting Link (for all dates):

https://us06web.zoom.us/j/81999469980?pwd=1nDlwyma78Pj7XMH6OE41JKvN6iTHM.1

Meeting ID: 819 9946 9980 // Passcode: 924348

System Access:

  • Teams and mentors attending the event will be given access to the Perlmutter compute system during the duration of the Hackathon. Additional systems may also be made available, based on availability and interest.
  • If you don’t have an account at NERSC please sign up for a training account here: https://iris.nersc.gov/train with training code dyrW. If your organization is not listed in the dropdown menu, select "NERSC".
  • The training account will be active from August 6 to August 30, 2024. To log in: ssh [email protected]
  • For more information visit: https://docs.nersc.gov/getting-started https://docs.nersc.gov/connect/
  • If you have questions, please submit them in the Slack #perlmutter-cluster-support channel.

Timeline

  • NERSC Open Hackathon Day 0 - Team/Mentor Meeting 10:30 AM PT:August 06, 2024

  • NERSC Open Hackathon Day 1:August 13, 2024, 9:00AM - 5:00PM PT

  • NERSC Open Hackathon Day 2:August 20, 2024, 9:00AM - 5:00PM PT

  • NERSC Open Hackathon Day 3:August 21, 2024, 9:00AM - 5:00PM PT

  • NERSC Open Hackathon Day 4:August 22, 2024, 9:00AM - 5:00PM PT

AsterX Participants

  • Hannah Ross (mentor)

  • Mukul Dave (mentor)

  • Steve Brandt

  • Michail Chabanov

  • Lorenzo Ennoggi

  • Roland Haas

  • Liwei Ji

  • Jay Kalinani

  • Lucas Timotheo Sanches

  • Erik Schnetter

Communication channel:

Slack Workspace

Please join the NERSC Open Hackathon workspace using the following link: https://join.slack.com/t/nerscopenhackathon/shared_invite/zt-2nwxpsmev-GhiLfxFVsJ86UmlVH6tStQ

After joining the workspace, please search for and join the #team-asterx channel.

Compiling and running on Perlmutter

If you have used Perlmutter in the past and created a ~/.hostname file, please delete it as it is no longer required and can confuse things:

rm -f ~/.hostname
  • Create ET folder in the home directory:

    cd ~/
    mkdir ET
    cd ET
    
  • Download the code via the following commands:

    curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/master/GetComponents
    chmod a+x GetComponents
    ./GetComponents --root Cactus --parallel --no-shallow https://raw.githubusercontent.com/jaykalinani/AsterX/main/Docs/thornlist/asterx.th
    
  • Add a defs.local.ini file in Cactus/simfactory/etc/., with details on user account details, source and base directory paths. See, for example: https://github.com/jaykalinani/AsterX/blob/main/Docs/compile-notes/frontier/defs.local.ini

  • Return to Cactus directory and compile using the following command:

    ./simfactory/bin/sim build -j32 <config_name> --thornlist=./thornlists/asterx.th
    
  • Example command to create-submit a job for a shocktube test via simfactory

    ./simfactory/bin/sim submit B1 --parfile=./arrangements/AsterX/AsterX/test/Balsara1_shocktube.par --config=<config_name> --allocation=m3374 --procs=1 --num-threads=1 --ppn-used=1 --walltime 00:05:00 --machine=perlmutter-p1
    
  • For a magnetized TOV test evolving spacetime, example submit command via simfactory

    ./simfactory/bin/sim submit magTOV_Cowling_unigrid --parfile=./arrangements/AsterX/AsterX/par/magTOV_Cowling_unigrid.par --config=<config_name> --allocation=m3374 --procs=1 --num-threads=1 --ppn-used=1 --walltime 00:5:00 --machine=perlmutter-p1
    

Build system notes (mostly for mentors)

The build system ultimately uses just make but with some auto-generated code (which is not in the performance critical path). Environment modules being loaded are those in the envsetup entry in the file simfactory/mdb/machines/perlmutter-p1.ini. Compilers and other options are in simfactory/mdb/optionlists/perlmutter-p1.cfg. SLURM script is in simfactory/mdb/submitscript/perlmutter-p1.sub and it runs the actual code as a bash script in simfactory/mdb/runscripts/perlmutter-p1.run.

Changing compile time options is achieve by (one way) editing simfactory/mdb/optionlists/perlmutter-p1.cfg then running

./simfactory/bin/sim build -j32 <config_name> --thornlist=./thornlists/asterx.th --reconfig  --optionlist perlmutter-p1.cfg

For a more traditional build process (without the extra "simfactory" layer) one can also manually load the modules listed in the envsetup key. Then use

make foo-config options=simfactory/mdb/optionlists/perlmutter-p1.cfg

make -j32 foo

to build configuration foo in exe/cactus_foo. Editing options requires re-running the -config step.

Helpful make targets are make foo-clean and make foo-realclean, as well as make foo-build BUILDLIST=<Thorn> where thorn could be eg AsterX which builds only that one module (eg for testing compiler options).

Set

export VERBOSE=yes

to see the exact commands make executes.

Submitting without Simfactory

The actual code is not dependent on Simfactory and many groups use it without. One can use bits and pieces from Simfactory to craft a "regular" SLURM script and srun invocation.

#! /bin/bash
#SBATCH -A ntrain4_g
#SBATCH -C gpu
#SBATCH -p regular
#SBATCH -t 0:05:00
#SBATCH -N 1 -n 1 -c 1
#SBATCH --ntasks-per-node 1
#SBATCH --gpus-per-task 1
#SBATCH --gpu-bind=map_gpu:0,1,2,3
#SBATCH
#SBATCH -J foo-01-0000
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o /pscratch/sd/r/rhaas/simulations/foo-01/foo-01.out
#SBATCH -e /pscratch/sd/r/rhaas/simulations/foo-01/foo-01.err

# "SubmitScript" above

# modules from perlmutter-p1.ini file's envsetup line
# cmake >= 3.19 causes issues with ADIOS2 (bad HDF5 version detection in system cmake)
module unload PrgEnv-nvidia
module unload PrgEnv-intel
module unload PrgEnv-cray
module unload PrgEnv-gnu
module unload gcc
module unload intel
module unload nvc
module unload cudatoolkit
module load PrgEnv-gnu/8.5.0 &&
module load gcc/11.2.0 &&
module load cray-libsci/23.02.1.1 &&
module load cray-mpich/8.1.25 &&
module load cudatoolkit/11.7 &&
module load cray-hdf5-parallel/1.12.2.3 &&
module load cray-fftw/3.3.10.6 &&
module load papi/7.0.1.2 &&
module load cmake/3.22.0


# "RunScript" content
echo "Preparing:"
set -x                          # Output commands
set -e                          # Abort on errors

cd /pscratch/sd/r/rhaas/simulations/foo-01

module list

echo "Checking:"
pwd
hostname
date
# TODO: This does not work
cat ${PBS_NODES} > NODES

echo "Environment:"
export CACTUS_NUM_PROCS=1
export CACTUS_NUM_THREADS=1
export GMON_OUT_PREFIX=gmon.out
export OMP_NUM_THREADS=1
export OMP_PLACES=cores # TODO: maybe use threads when smt is used?
export SLURM_CPU_BIND="cores"
env | sort > ENVIRONMENT

echo "Starting:"
export CACTUS_STARTTIME=$(date +%s)
srun /pscratch/sd/r/rhaas/AsterX/exe/cactus_sim -L 3 /pscratch/sd/r/rhaas/AsterX/arrangements/AsterX/AsterX/par/magTOV_Cowling_unigrid.par
echo "Stopping:"
date

echo "Done."

This let's one more easily play with options and adding bits to the command line. Submit as usual using sbatch foo-01.sbatch or so. Note that it has all paths to input decks and output directories hard-coded as absolute paths (look for /pscratch/sd/r/rhaas/) including the path to the executable.

Quick links

Profiling using NVIDIA Nsight Systems:

  • To generate the profile reports, the Perlmutter RunScript (located at Cactus/simfactory/mdb/runscripts/perlmutter-p1.run) needs to be modified, before submitting the job. Within the runscript, one can add the option nsys profile --stats=true before the executable name. For example, the modified command will look like:
srun nsys profile --stats=true @EXECUTABLE@ -L 3 @PARFILE@
  • Once the report files (ending with extensions .nsys-rep and .sqlite) are generated in the simulation folder, they could be copied to the local workstation, and loaded onto the NVIDIA Nsights system software.

  • Link to download the software: https://developer.nvidia.com/nsight-systems/get-started

Profiling using nsight-compute

nvidia's system and compute profilers offer a way to profile GPU code performance on the system and kernel level. There are also options to add to the compiler that can output useful information at compile time.

CUCCFLAGS = [...]  --ptxas-options=-v

will make the compiler report on register usage of kernels that it asesmbles.

CUCCFLAGS = [...] -generate-line-info

corresponds to -g for CPU code and includes line number information in the executable that are used by nsight-compute.

Running nsight-compute requires adding to the RunScript

srun --ntasks-per-node=1 dcgmi profile --pause
srun ncu -o report-%h-%p.ncu-rep --set full --nvtx --nvtx-include=Fluxes @EXECUTABLE@ -L 3 @PARFILE@

which will profile kernels within the region marked by

#ifdef __CUDACC__
  const nvtxRangeId_t range = nvtxRangeStartA("Fluxes");
#endif
[...]
#ifdef __CUDACC__
  nvtxRangeEnd(range);
#endif

See Perlmutter's help pages on nsight: https://docs.nersc.gov/tools/performance/nvidiaproftools/

Tasks

To-do list