-
Notifications
You must be signed in to change notification settings - Fork 11
Erik's Notes from the Pre Workshop Meeting
- don't use
SystemTopology
- use shared memory (64 kB per CU)
- benchmark with 2 iterations instead of 10
- use amd/5.4.3 (might improve performance)
- use cray-mpich/8.1.25 (might improve performance)
- generally, use newest versions
MPICH_ENV_DISPLAY=1
MPICH_VERSION_DISPLAY=1
MPICH_GPU_SUPPORT_ENABLED=1
MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1
-
AMD_LOG_LEVEL=3
may be a good default, or just for GPU debugging -
set also
AMD_LOG_MASK=?
-
for debugging:
AMD_SERIALIZE_KERNEL=?
,AMD_SERIALIZE_COPY=?
-
rocgdb
for debugging, including debugging compute kernels. probably only up to rocm/5.3.0. -
gdb4hpc
, launches code, side-by-side debugging
-
__launch_bounds__(256)
to reduce number of threads - move registers to LDS ("shared memory")
- (try different compiler versions)
- use Omniperf to study kernel performance
- see https://amdresearch.github.io/omnitrace
- use https://ui.perfetto.dev to visualize output?
- build with
--save-temps
; look at assembler code
-
Craypat (Cray's Performance Analysis Toolkit): performance, but not inside kernels
-
Omnitrace?
- magic sauce, test and benchmark before using:
-
FI_MR_CACHE_MONITOR=memhooks
(might improve performance at scale) -
FI_CXI_RX_MATCH_MODE=software
(if suggested in error message) - report any results to OLCF
-
man intro_mpi
,man fi_cxi
- srun/sbatch
-u
for unbuffered output for debugging
-
use
--reservation=frontier-hackathon
with Slurm -
we could (but can't) use
--cpus-per-task=7
("low-noise-mode"). there are only 56 cores. per node available. -
use
--gpus-per-task=1
(instead of--gpus-per-node
) -
for elegance, set
-N
,-n
,-c
-
use
sbcast
to broadcast files to compute nodes, reduces launch time at scale
-
man
intro_<tab>
(mpi
,perftools
,craypat
, ...) -
OLCF office hours, sign up, get direct help
-
AMReX: experiment with
ParallelFor<MY_BLOCK_SIZE>(...)
https://amrex-codes.github.io/amrex/docs_html/GPU.html#gpu-block-size -
AMReX: use multiple streams
-
AMReX: generally, turn off OpenMP when running on GPUs
-
AMReX: use shared memory!
launch
has ashared_mem_bytes
argument.
- has only 16 kB L1 cache per CU ("SM")
- wavefront ("warp") size is 64
- CU have many registers (1024 words?)
-
XNACK=1
(not the default) migrates memory between host and device automatically, even when allocated with plainmalloc
.
- can use burst buffers (
--constraint nvme
), 4 TByte per node, e.g. for checkpointing or output?