mirror of https://github.com/paboyle/Grid.git synced 2026-05-16 23:24:30 +01:00

Files

T

Peter Boyle 969b0a3922 Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction

Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/
reduceKernel) and the global device variable retirementCount with a unified
CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL.
No small/large split is needed: both CUB and sycl::reduction handle arbitrary
object sizes internally.

Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the
original files for regression testing on GPU hardware.

Also add CLAUDE.md with build, test, and architecture guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-15 13:41:56 -04:00

5.2 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

What This Is

Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.

Build

Uses GNU Autotools. The bootstrap step only needs to run once (or after configure.ac changes).

./bootstrap.sh                    # downloads Eigen 3.4.0, generates configure
mkdir build && cd build
../configure [options]
make -j$(nproc)
make check                        # run root-level tests
make install

Key configure options:

Option	Common values
`--enable-simd=`	`AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU`
`--enable-comms=`	`mpi-auto`, `mpi3-auto`, `none`
`--enable-accelerator=`	`cuda`, `hip`, `sycl`
`--enable-shm=`	`shmopen`, `hugetlbfs`, `nvlink`
`--enable-Nc=`	`3` (default), `2`, `4`, `5`
`--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=`	paths to libs
`--enable-hdf5`, `--enable-mkl`, `--enable-lapack`	optional features

Platform recipes from README.md:

KNL: --enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl
Skylake/Haswell: --enable-simd=AVX512 or AVX2 + --enable-comms=mpi3-auto
AMD EPYC: --enable-simd=AVX2 --enable-comms=mpi3
A64FX (Fugaku): --enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget (see SVE_README.txt)

Required external libs: GMP, MPFR, OpenSSL, zlib.

Running Tests

# From build directory
make check                          # root-level tests (Test_simd, Test_cshift, etc.)
make -C tests/<subdir> tests        # build tests in a subdirectory
./tests/core/Test_simd              # run a single test binary directly

Test subdirectories and their focus: core (SIMD, stencil, comms), solver (CG, GMRES, eigensolvers), hmc (MD integrators), forces (fermion forces), lanczos, IO, smearing, sp2n, debug.

Architecture

Layer stack (bottom to top)

SIMD layer (Grid/simd/) — platform-specific intrinsics wrapped into vRealF, vComplexD, etc. The SIMD width and layout are compile-time constants controlled by --enable-simd.
Tensor layer (Grid/tensors/) — Lorentz/colour/spin tensor algebra built on top of SIMD types. iMatrix, iVector, iScalar templates compose into QCD types like ColourMatrix, SpinColourVector.
Lattice layer (Grid/lattice/) — Lattice<T> container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
Cartesian/comms layer (Grid/cartesian/, Grid/communicator/) — GridCartesian holds the MPI topology and local/global geometry. Grid/cshift/ implements nearest-neighbour halo exchange; Grid/stencil/ is the optimised multi-hop stencil used by Dirac operators.
Algorithm layer (Grid/algorithms/) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
QCD layer (Grid/qcd/) — gauge and fermion actions, HMC integrators, observables.

QCD subsystem (`Grid/qcd/`)

action/fermion/ — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
action/gauge/ — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
representations/ — Fundamental, Adjoint, Two-index, Sp(2n)
hmc/ — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
smearing/ — APE, Stout, HEX, gradient flow
observables/ — Polyakov loop, plaquette, topological charge

GPU acceleration

GPU support is injected via macros (accelerator_for, accelerator_for2dNB). The Grid/simd/ SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (--enable-unified=yes); device-aware MPI (--enable-accelerator-aware-mpi) avoids device→host copies on transfers.

Memory and I/O

Grid/allocator/ — aligned/NUMA-aware allocators; caching allocator via --enable-alloc-cache
Grid/parallelIO/ — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
Grid/serialisation/ — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects

HMC applications

HMC/ contains production-ready HMC driver programmes (e.g. Mobius2p1f.cc, DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc). These are built separately from the library tests.

Key Conventions

C++17 is required throughout.
Template structure: most classes are templated on <_FImpl> (fermion impl) or <Gimpl> (gauge impl), which encode the representation and precision. Instantiation is controlled by --enable-fermion-instantiations.
The RealD/RealF/ComplexD/ComplexF typedefs are used everywhere; avoid raw double/float.
Logging uses Grid_log, Grid_error macros (from Grid/log/); performance-critical paths use the GRID_TRACE / timer macros from Grid/perfmon/.
Reductions across MPI ranks go through GridBase::GlobalSum / GlobalMax; never reduce with bare MPI calls inside library code.

5.2 KiB Raw Blame History