The purpose of this file is to collate all non-obvious known magic shell variables
and compiler flags required for either correctness or performance on various systems.

A repository of work-arounds.

Contents:
1. Interconnect + MPI
2. Compilation

************************
* 1. INTERCONNECT + MPI
************************

--------------------------------------------------------------------
MPI2-IO correctness: force OpenMPI to use the MPICH romio implementation for parallel I/O 
--------------------------------------------------------------------
export OMPI_MCA_io=romio321

--------------------------------------
ROMIO fail with > 2GB per node read (32 bit issue)
--------------------------------------

Use later MPICH

https://github.com/paboyle/Grid/issues/381

https://github.com/pmodels/mpich/commit/3a479ab0

--------------------------------------------------------------------
Slingshot: Frontier and Perlmutter libfabric slow down 
and physical memory fragmentation 
--------------------------------------------------------------------
export FI_MR_CACHE_MONITOR=disabled
or
export FI_MR_CACHE_MONITOR=kdreg2

--------------------------------------------------------------------
Perlmutter
--------------------------------------------------------------------

export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_IPC_ENABLED=1
export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
export MPICH_GPU_NO_ASYNC_MEMCPY=0

--------------------------------------------------------------------
Frontier/LumiG
--------------------------------------------------------------------

Hiding ROCR_VISIBLE_DEVICES triggers SDMA engines to be used for GPU-GPU

cat << EOF > select_gpu
#!/bin/bash
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_SMP_SINGLE_COPY_MODE=XPMEM
export GPU_MAP=(0 1 2 3 7 6 5 4)
export NUMA_MAP=(3 3 1 1 2 2 0 0)
export GPU=\${GPU_MAP[\$SLURM_LOCALID]}
export NUMA=\${NUMA_MAP[\$SLURM_LOCALID]}
export HIP_VISIBLE_DEVICES=\$GPU
unset ROCR_VISIBLE_DEVICES
echo RANK \$SLURM_LOCALID using GPU \$GPU    
exec numactl -m \$NUMA -N \$NUMA \$*
EOF
chmod +x ./select_gpu

srun ./select_gpu BINARY


--------------------------------------------------------------------
Mellanox performance with A100 GPU (Tursa, Booster, Leonardo)
--------------------------------------------------------------------
export OMPI_MCA_btl=^uct,openib
export UCX_TLS=gdr_copy,rc,rc_x,sm,cuda_copy,cuda_ipc
export UCX_RNDV_SCHEME=put_zcopy
export UCX_RNDV_THRESH=16384
export UCX_IB_GPU_DIRECT_RDMA=yes

--------------------------------------------------------------------
Mellanox + A100 correctness (Tursa, Booster, Leonardo)
--------------------------------------------------------------------
export UCX_MEMTYPE_CACHE=n

--------------------------------------------------------------------
MPICH/Aurora/PVC correctness and performance 
--------------------------------------------------------------------

https://github.com/pmodels/mpich/issues/7302

--enable-cuda-aware-mpi=no  
--enable-unified=no

Grid's internal D-H-H-D pipeline mode, avoid device memory in MPI
Do not use SVM

Ideally use MPICH with fix to issue 7302:

https://github.com/pmodels/mpich/pull/7312

Ideally:
MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=generic

Alternatives:
export MPIR_CVAR_NOLOCAL=1
export MPIR_CVAR_CH4_IPC_GPU_P2P_THRESHOLD=1000000000

--------------------------------------------------------------------
MPICH/Aurora/PVC correctness and performance 
--------------------------------------------------------------------

Broken:
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1

This gives good peformance without requiring 
--enable-cuda-aware-mpi=no  

But is an open issue reported by James Osborn
https://github.com/pmodels/mpich/issues/7139

Possibly resolved but unclear if in the installed software yet.

************************
* 2. COMPILATION
************************

--------------------------------------------------------------------
G++ compiler breakage / graveyard
--------------------------------------------------------------------

9.3.0, 10.3.1, 
https://github.com/paboyle/Grid/issues/290
https://github.com/paboyle/Grid/issues/264

Working (-) Broken (X):

4.9.0 -
4.9.1 -
5.1.0 X
5.2.0 X
5.3.0 X
5.4.0 X
6.1.0 X
6.2.0 X
6.3.0 -
7.1.0 -
8.0.0 (HEAD) -

https://github.com/paboyle/Grid/issues/100

--------------------------------------------------------------------
AMD GPU nodes :
--------------------------------------------------------------------

multiple ROCM versions broken; use 5.3.0
manifests itself as wrong results in fp32 

https://github.com/paboyle/Grid/issues/464

--------------------------------------------------------------------
Aurora/PVC
--------------------------------------------------------------------

SYCL ahead of time compilation (fixes rare runtime JIT errors and faster runtime, PB)
SYCL slow link and relocatable code issues (Christoph Lehner)
Opt large register file required for good performance in fp64


export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file"
export LDFLAGS="-fiopenmp -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=spir64_gen -Xs -device -Xs pvc -fsycl-device-lib=all -lze_loader -L${MKLROOT}/lib -qmkl=parallel  -fsycl  -lsycl -fPIC -fsycl-max-parallel-link-jobs=16 -fno-sycl-rdc" 
export CXXFLAGS="-O3 -fiopenmp -fsycl-unnamed-lambda -fsycl -Wno-tautological-compare -qmkl=parallel  -fsycl -fno-exceptions -fPIC"