Erik's Notes from the Pre Workshop Meeting

Jump to bottom Edit New page

Erik Schnetter edited this page Jun 23, 2023 · 2 revisions

Frontier Workshop 2023 Pre-Meeting:

We know already:

don't use SystemTopology

We should do this:

System

use shared memory (64 kB per CU)
benchmark with 2 iterations instead of 10
use amd/5.4.3 (might improve performance)
use cray-mpich/8.1.25 (might improve performance)
generally, use newest versions

Environment variables

MPICH_ENV_DISPLAY=1
MPICH_VERSION_DISPLAY=1
MPICH_GPU_SUPPORT_ENABLED=1
MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1

Debugging

AMD_LOG_LEVEL=3 may be a good default, or just for GPU debugging
set also AMD_LOG_MASK=?
for debugging: AMD_SERIALIZE_KERNEL=?, AMD_SERIALIZE_COPY=?
rocgdb for debugging, including debugging compute kernels. probably only up to rocm/5.3.0.
gdb4hpc, launches code, side-by-side debugging

Kernels

__launch_bounds__(256) to reduce number of threads
move registers to LDS ("shared memory")
(try different compiler versions)

Performance Tools

use Omniperf to study kernel performance
see https://amdresearch.github.io/omnitrace
use https://ui.perfetto.dev to visualize output?

Kernel Assembly Code

build with --save-temps; look at assembler code

More Performance Tools

Craypat (Cray's Performance Analysis Toolkit): performance, but not inside kernels
Omnitrace?

Performance Tuning

magic sauce, test and benchmark before using:
FI_MR_CACHE_MONITOR=memhooks (might improve performance at scale)
FI_CXI_RX_MATCH_MODE=software (if suggested in error message)
report any results to OLCF
man intro_mpi, man fi_cxi

More Debugging

srun/sbatch -u for unbuffered output for debugging

Running Jobs

use --reservation=frontier-hackathon with Slurm
we could (but can't) use --cpus-per-task=7 ("low-noise-mode"). there are only 56 cores. per node available.
use --gpus-per-task=1 (instead of --gpus-per-node)
for elegance, set -N, -n, -c
use sbcast to broadcast files to compute nodes, reduces launch time at scale

More Help

man intro_<tab> (mpi, perftools, craypat, ...)
OLCF office hours, sign up, get direct help

AMReX-Specific Ideas

AMReX: experiment with ParallelFor<MY_BLOCK_SIZE>(...) https://amrex-codes.github.io/amrex/docs_html/GPU.html#gpu-block-size
AMReX: use multiple streams
AMReX: generally, turn off OpenMP when running on GPUs
AMReX: use shared memory! launch has a shared_mem_bytes argument.

General information:

GPU Hardware

has only 16 kB L1 cache per CU ("SM")
wavefront ("warp") size is 64
CU have many registers (1024 words?)

Memory management

XNACK=1 (not the default) migrates memory between host and device automatically, even when allocated with plain malloc.

SSDs

can use burst buffers (--constraint nvme), 4 TByte per node, e.g. for checkpointing or output?