mirror of
https://github.com/paboyle/Grid.git
synced 2026-05-17 07:34:31 +01:00
Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,98 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## What This Is
|
||||
|
||||
Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
|
||||
|
||||
## Build
|
||||
|
||||
Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
|
||||
|
||||
```bash
|
||||
./bootstrap.sh # downloads Eigen 3.4.0, generates configure
|
||||
mkdir build && cd build
|
||||
../configure [options]
|
||||
make -j$(nproc)
|
||||
make check # run root-level tests
|
||||
make install
|
||||
```
|
||||
|
||||
Key configure options:
|
||||
|
||||
| Option | Common values |
|
||||
|--------|---------------|
|
||||
| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
|
||||
| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
|
||||
| `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
|
||||
| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
|
||||
| `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
|
||||
| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
|
||||
| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
|
||||
|
||||
Platform recipes from `README.md`:
|
||||
- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
|
||||
- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
|
||||
- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
|
||||
- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
|
||||
|
||||
Required external libs: GMP, MPFR, OpenSSL, zlib.
|
||||
|
||||
## Running Tests
|
||||
|
||||
```bash
|
||||
# From build directory
|
||||
make check # root-level tests (Test_simd, Test_cshift, etc.)
|
||||
make -C tests/<subdir> tests # build tests in a subdirectory
|
||||
./tests/core/Test_simd # run a single test binary directly
|
||||
```
|
||||
|
||||
Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Layer stack (bottom to top)
|
||||
|
||||
1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
|
||||
|
||||
2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
|
||||
|
||||
3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
|
||||
|
||||
4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
|
||||
|
||||
5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
|
||||
|
||||
6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
|
||||
|
||||
### QCD subsystem (`Grid/qcd/`)
|
||||
|
||||
- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
|
||||
- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
|
||||
- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
|
||||
- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
|
||||
- `smearing/` — APE, Stout, HEX, gradient flow
|
||||
- `observables/` — Polyakov loop, plaquette, topological charge
|
||||
|
||||
### GPU acceleration
|
||||
|
||||
GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
|
||||
|
||||
### Memory and I/O
|
||||
|
||||
- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
|
||||
- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
|
||||
- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
|
||||
|
||||
### HMC applications
|
||||
|
||||
`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
|
||||
|
||||
## Key Conventions
|
||||
|
||||
- **C++17** is required throughout.
|
||||
- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
|
||||
- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
|
||||
- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
|
||||
- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
|
||||
Reference in New Issue
Block a user