mirror of
https://github.com/paboyle/Grid.git
synced 2026-05-16 23:24:30 +01:00
969b0a3922
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
99 lines
5.2 KiB
Markdown
99 lines
5.2 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## What This Is
|
|
|
|
Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
|
|
|
|
## Build
|
|
|
|
Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
|
|
|
|
```bash
|
|
./bootstrap.sh # downloads Eigen 3.4.0, generates configure
|
|
mkdir build && cd build
|
|
../configure [options]
|
|
make -j$(nproc)
|
|
make check # run root-level tests
|
|
make install
|
|
```
|
|
|
|
Key configure options:
|
|
|
|
| Option | Common values |
|
|
|--------|---------------|
|
|
| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
|
|
| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
|
|
| `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
|
|
| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
|
|
| `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
|
|
| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
|
|
| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
|
|
|
|
Platform recipes from `README.md`:
|
|
- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
|
|
- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
|
|
- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
|
|
- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
|
|
|
|
Required external libs: GMP, MPFR, OpenSSL, zlib.
|
|
|
|
## Running Tests
|
|
|
|
```bash
|
|
# From build directory
|
|
make check # root-level tests (Test_simd, Test_cshift, etc.)
|
|
make -C tests/<subdir> tests # build tests in a subdirectory
|
|
./tests/core/Test_simd # run a single test binary directly
|
|
```
|
|
|
|
Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
|
|
|
|
## Architecture
|
|
|
|
### Layer stack (bottom to top)
|
|
|
|
1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
|
|
|
|
2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
|
|
|
|
3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
|
|
|
|
4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
|
|
|
|
5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
|
|
|
|
6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
|
|
|
|
### QCD subsystem (`Grid/qcd/`)
|
|
|
|
- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
|
|
- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
|
|
- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
|
|
- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
|
|
- `smearing/` — APE, Stout, HEX, gradient flow
|
|
- `observables/` — Polyakov loop, plaquette, topological charge
|
|
|
|
### GPU acceleration
|
|
|
|
GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
|
|
|
|
### Memory and I/O
|
|
|
|
- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
|
|
- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
|
|
- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
|
|
|
|
### HMC applications
|
|
|
|
`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
|
|
|
|
## Key Conventions
|
|
|
|
- **C++17** is required throughout.
|
|
- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
|
|
- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
|
|
- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
|
|
- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
|