Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.2 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
What This Is
Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
Build
Uses GNU Autotools. The bootstrap step only needs to run once (or after configure.ac changes).
./bootstrap.sh # downloads Eigen 3.4.0, generates configure
mkdir build && cd build
../configure [options]
make -j$(nproc)
make check # run root-level tests
make install
Key configure options:
| Option | Common values |
|---|---|
--enable-simd= |
AVX2, AVX512, KNL, A64FX, NEONv8, GPU |
--enable-comms= |
mpi-auto, mpi3-auto, none |
--enable-accelerator= |
cuda, hip, sycl |
--enable-shm= |
shmopen, hugetlbfs, nvlink |
--enable-Nc= |
3 (default), 2, 4, 5 |
--with-gmp=, --with-mpfr=, --with-fftw=, --with-lime= |
paths to libs |
--enable-hdf5, --enable-mkl, --enable-lapack |
optional features |
Platform recipes from README.md:
- KNL:
--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl - Skylake/Haswell:
--enable-simd=AVX512orAVX2+--enable-comms=mpi3-auto - AMD EPYC:
--enable-simd=AVX2 --enable-comms=mpi3 - A64FX (Fugaku):
--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget(seeSVE_README.txt)
Required external libs: GMP, MPFR, OpenSSL, zlib.
Running Tests
# From build directory
make check # root-level tests (Test_simd, Test_cshift, etc.)
make -C tests/<subdir> tests # build tests in a subdirectory
./tests/core/Test_simd # run a single test binary directly
Test subdirectories and their focus: core (SIMD, stencil, comms), solver (CG, GMRES, eigensolvers), hmc (MD integrators), forces (fermion forces), lanczos, IO, smearing, sp2n, debug.
Architecture
Layer stack (bottom to top)
-
SIMD layer (
Grid/simd/) — platform-specific intrinsics wrapped intovRealF,vComplexD, etc. The SIMD width and layout are compile-time constants controlled by--enable-simd. -
Tensor layer (
Grid/tensors/) — Lorentz/colour/spin tensor algebra built on top of SIMD types.iMatrix,iVector,iScalartemplates compose into QCD types likeColourMatrix,SpinColourVector. -
Lattice layer (
Grid/lattice/) —Lattice<T>container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused. -
Cartesian/comms layer (
Grid/cartesian/,Grid/communicator/) —GridCartesianholds the MPI topology and local/global geometry.Grid/cshift/implements nearest-neighbour halo exchange;Grid/stencil/is the optimised multi-hop stencil used by Dirac operators. -
Algorithm layer (
Grid/algorithms/) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing. -
QCD layer (
Grid/qcd/) — gauge and fermion actions, HMC integrators, observables.
QCD subsystem (Grid/qcd/)
action/fermion/— Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variantsaction/gauge/— Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rectrepresentations/— Fundamental, Adjoint, Two-index, Sp(2n)hmc/— Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/rejectsmearing/— APE, Stout, HEX, gradient flowobservables/— Polyakov loop, plaquette, topological charge
GPU acceleration
GPU support is injected via macros (accelerator_for, accelerator_for2dNB). The Grid/simd/ SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (--enable-unified=yes); device-aware MPI (--enable-accelerator-aware-mpi) avoids device→host copies on transfers.
Memory and I/O
Grid/allocator/— aligned/NUMA-aware allocators; caching allocator via--enable-alloc-cacheGrid/parallelIO/— distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formatsGrid/serialisation/— text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
HMC applications
HMC/ contains production-ready HMC driver programmes (e.g. Mobius2p1f.cc, DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc). These are built separately from the library tests.
Key Conventions
- C++17 is required throughout.
- Template structure: most classes are templated on
<_FImpl>(fermion impl) or<Gimpl>(gauge impl), which encode the representation and precision. Instantiation is controlled by--enable-fermion-instantiations. - The
RealD/RealF/ComplexD/ComplexFtypedefs are used everywhere; avoid rawdouble/float. - Logging uses
Grid_log,Grid_errormacros (fromGrid/log/); performance-critical paths use theGRID_TRACE/ timer macros fromGrid/perfmon/. - Reductions across MPI ranks go through
GridBase::GlobalSum/GlobalMax; never reduce with bare MPI calls inside library code.