Skip to content

Erik's Notes from the Pre Workshop Meeting

Erik Schnetter edited this page Jun 23, 2023 · 2 revisions

Frontier Workshop 2023 Pre-Meeting:

We know already:

  • don't use SystemTopology

We should do this:

System

  • use shared memory (64 kB per CU)
  • benchmark with 2 iterations instead of 10
  • use amd/5.4.3 (might improve performance)
  • use cray-mpich/8.1.25 (might improve performance)
  • generally, use newest versions

Environment variables

  • MPICH_ENV_DISPLAY=1
  • MPICH_VERSION_DISPLAY=1
  • MPICH_GPU_SUPPORT_ENABLED=1
  • MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1

Debugging

  • AMD_LOG_LEVEL=3 may be a good default, or just for GPU debugging

  • set also AMD_LOG_MASK=?

  • for debugging: AMD_SERIALIZE_KERNEL=?, AMD_SERIALIZE_COPY=?

  • rocgdb for debugging, including debugging compute kernels. probably only up to rocm/5.3.0.

  • gdb4hpc, launches code, side-by-side debugging

Kernels

  • __launch_bounds__(256) to reduce number of threads
  • move registers to LDS ("shared memory")
  • (try different compiler versions)

Performance Tools

Kernel Assembly Code

  • build with --save-temps; look at assembler code

More Performance Tools

  • Craypat (Cray's Performance Analysis Toolkit): performance, but not inside kernels

  • Omnitrace?

Performance Tuning

  • magic sauce, test and benchmark before using:
  • FI_MR_CACHE_MONITOR=memhooks (might improve performance at scale)
  • FI_CXI_RX_MATCH_MODE=software (if suggested in error message)
  • report any results to OLCF
  • man intro_mpi, man fi_cxi

More Debugging

  • srun/sbatch -u for unbuffered output for debugging

Running Jobs

  • use --reservation=frontier-hackathon with Slurm

  • we could (but can't) use --cpus-per-task=7 ("low-noise-mode"). there are only 56 cores. per node available.

  • use --gpus-per-task=1 (instead of --gpus-per-node)

  • for elegance, set -N, -n, -c

  • use sbcast to broadcast files to compute nodes, reduces launch time at scale

More Help

  • man intro_<tab> (mpi, perftools, craypat, ...)

  • OLCF office hours, sign up, get direct help

AMReX-Specific Ideas

General information:

GPU Hardware

  • has only 16 kB L1 cache per CU ("SM")
  • wavefront ("warp") size is 64
  • CU have many registers (1024 words?)

Memory management

  • XNACK=1 (not the default) migrates memory between host and device automatically, even when allocated with plain malloc.

SSDs

  • can use burst buffers (--constraint nvme), 4 TByte per node, e.g. for checkpointing or output?