1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-17 07:34:31 +01:00

Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction

Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/
reduceKernel) and the global device variable retirementCount with a unified
CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL.
No small/large split is needed: both CUB and sycl::reduction handle arbitrary
object sizes internally.

Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the
original files for regression testing on GPU hardware.

Also add CLAUDE.md with build, test, and architecture guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Peter Boyle
2026-05-15 13:41:56 -04:00
parent c6c2834e03
commit 969b0a3922
5 changed files with 306 additions and 21 deletions
+98
View File
@@ -0,0 +1,98 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## What This Is
Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
## Build
Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
```bash
./bootstrap.sh # downloads Eigen 3.4.0, generates configure
mkdir build && cd build
../configure [options]
make -j$(nproc)
make check # run root-level tests
make install
```
Key configure options:
| Option | Common values |
|--------|---------------|
| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
| `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
| `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
Platform recipes from `README.md`:
- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
Required external libs: GMP, MPFR, OpenSSL, zlib.
## Running Tests
```bash
# From build directory
make check # root-level tests (Test_simd, Test_cshift, etc.)
make -C tests/<subdir> tests # build tests in a subdirectory
./tests/core/Test_simd # run a single test binary directly
```
Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
## Architecture
### Layer stack (bottom to top)
1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
### QCD subsystem (`Grid/qcd/`)
- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
- `smearing/` — APE, Stout, HEX, gradient flow
- `observables/` — Polyakov loop, plaquette, topological charge
### GPU acceleration
GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
### Memory and I/O
- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
### HMC applications
`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
## Key Conventions
- **C++17** is required throughout.
- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
+3
View File
@@ -31,6 +31,9 @@ Author: Christoph Lehner <christoph@lhnr.de>
#if defined(GRID_SYCL) #if defined(GRID_SYCL)
#include <Grid/lattice/Lattice_reduction_sycl.h> #include <Grid/lattice/Lattice_reduction_sycl.h>
#endif #endif
#if defined(GRID_CUDA)||defined(GRID_HIP)||defined(GRID_SYCL)
#include <Grid/lattice/Lattice_reduction_gpu_cub.h>
#endif
#include <Grid/lattice/Lattice_slicesum_core.h> #include <Grid/lattice/Lattice_slicesum_core.h>
NAMESPACE_BEGIN(Grid); NAMESPACE_BEGIN(Grid);
+10 -10
View File
@@ -198,7 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) {
// Possibly promote to double and sum // Possibly promote to double and sum
///////////////////////////////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////////////////////////////////
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_objectD sobj; typedef typename vobj::scalar_objectD sobj;
typedef decltype(lat) Iterator; typedef decltype(lat) Iterator;
@@ -224,7 +224,7 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::vector_type vector; typedef typename vobj::vector_type vector;
typedef typename vobj::scalar_typeD scalarD; typedef typename vobj::scalar_typeD scalarD;
@@ -244,13 +244,13 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi
buf[ss] = dat[ss*words+w]; buf[ss] = dat[ss*words+w];
}); });
ret_p[w] = sumD_gpu_small(tbuf,osites); ret_p[w] = sumD_gpu_small_old(tbuf,osites);
} }
return ret; return ret;
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_objectD sobj; typedef typename vobj::scalar_objectD sobj;
sobj ret; sobj ret;
@@ -261,9 +261,9 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks); int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks);
if ( ok ) { if ( ok ) {
ret = sumD_gpu_small(lat,osites); ret = sumD_gpu_small_old(lat,osites);
} else { } else {
ret = sumD_gpu_large(lat,osites); ret = sumD_gpu_large_old(lat,osites);
} }
return ret; return ret;
} }
@@ -272,20 +272,20 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
// Return as same precision as input performing reduction in double precision though // Return as same precision as input performing reduction in double precision though
///////////////////////////////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////////////////////////////////
template <class vobj> template <class vobj>
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_object sobj;
sobj result; sobj result;
result = sumD_gpu(lat,osites); result = sumD_gpu_old(lat,osites);
return result; return result;
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_object sobj;
sobj result; sobj result;
result = sumD_gpu_large(lat,osites); result = sumD_gpu_large_old(lat,osites);
return result; return result;
} }
+184
View File
@@ -0,0 +1,184 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Source file: ./Grid/lattice/Lattice_reduction_gpu_cub.h
Copyright (C) 2015-2024
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#pragma once
#if defined(GRID_CUDA)
#include <cub/cub.cuh>
#define gpucub cub
#define gpuError_t cudaError_t
#define gpuSuccess cudaSuccess
#elif defined(GRID_HIP)
#include <hipcub/hipcub.hpp>
#define gpucub hipcub
#define gpuError_t hipError_t
#define gpuSuccess hipSuccess
#endif
NAMESPACE_BEGIN(Grid);
/////////////////////////////////////////////////////////////////////////////////////////////////////////
// Unified lattice reduction using CUB (CUDA/HIP) and sycl::reduction (SYCL).
//
// Strategy: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD,
// then a single library reduce over the sobjD array. No small/large split is needed:
// CUB DeviceReduce::Reduce and sycl::reduction both handle arbitrary object sizes by
// tuning block occupancy internally.
/////////////////////////////////////////////////////////////////////////////////////////////////////////
#if defined(GRID_CUDA) || defined(GRID_HIP)
template<class vobj>
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
{
typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_objectD sobjD;
// Per-site: sum SIMD lanes (Reduce) and promote to double precision.
deviceVector<sobjD> per_site(osites);
sobjD *per_site_p = &per_site[0];
accelerator_for(ss, osites, 1, {
sobj tmp = Reduce(lat[ss]);
sobjD tmpD; tmpD = tmp;
per_site_p[ss] = tmpD;
});
// CUB global reduction over the sobjD array.
sobjD zero; zeroit(zero);
sobjD *d_out = static_cast<sobjD *>(acceleratorAllocDevice(sizeof(sobjD)));
void *d_temp = nullptr;
size_t temp_bytes = 0;
gpuError_t gpuErr;
gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out,
(int)osites, gpucub::Sum(), zero, computeStream);
if (gpuErr != gpuSuccess) {
std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce size query failed: "
<< gpuErr << std::endl;
exit(EXIT_FAILURE);
}
d_temp = acceleratorAllocDevice(temp_bytes);
gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out,
(int)osites, gpucub::Sum(), zero, computeStream);
if (gpuErr != gpuSuccess) {
std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce failed: "
<< gpuErr << std::endl;
exit(EXIT_FAILURE);
}
accelerator_barrier();
sobjD result;
acceleratorCopyFromDevice(d_out, &result, sizeof(sobjD));
acceleratorFreeDevice(d_temp);
acceleratorFreeDevice(d_out);
return result;
}
// sumD_gpu_small and sumD_gpu_large are preserved as aliases for API compatibility.
template<class vobj>
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
{
return sumD_gpu(lat, osites);
}
template<class vobj>
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
{
return sumD_gpu(lat, osites);
}
template<class vobj>
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
{
typedef typename vobj::scalar_object sobj;
sobj result;
result = sumD_gpu(lat, osites);
return result;
}
template<class vobj>
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
{
return sum_gpu(lat, osites);
}
#endif // GRID_CUDA || GRID_HIP
#if defined(GRID_SYCL)
// Accumulates in sobjD throughout, fixing the precision bug in the original
// Lattice_reduction_sycl.h which accumulated in sobj then converted at the end.
template<class vobj>
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
{
typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_objectD sobjD;
sobjD identity; zeroit(identity);
sobjD ret; zeroit(ret);
{
sycl::buffer<sobjD, 1> abuff(&ret, {1});
theGridAccelerator->submit([&](sycl::handler &cgh) {
auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>());
cgh.parallel_for(sycl::range<1>{(size_t)osites},
Reduction,
[=](sycl::id<1> item, auto &sum) {
sobj s = Reduce(lat[item[0]]);
sobjD sd; sd = s;
sum += sd;
});
});
}
return ret;
}
template<class vobj>
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
{
return sumD_gpu(lat, osites);
}
template<class vobj>
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
{
return sumD_gpu(lat, osites);
}
template<class vobj>
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
{
typedef typename vobj::scalar_object sobj;
sobj result;
result = sumD_gpu(lat, osites);
return result;
}
template<class vobj>
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
{
return sum_gpu(lat, osites);
}
#endif // GRID_SYCL
NAMESPACE_END(Grid);
+11 -11
View File
@@ -6,7 +6,7 @@ NAMESPACE_BEGIN(Grid);
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_tensor_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_objectD sobjD; typedef typename vobj::scalar_objectD sobjD;
@@ -31,40 +31,40 @@ inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer os
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites)
{ {
return sumD_gpu_tensor(lat,osites); return sumD_gpu_tensor_old(lat,osites);
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites)
{ {
return sumD_gpu_large(lat,osites); return sumD_gpu_large_old(lat,osites);
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites)
{ {
return sumD_gpu_large(lat,osites); return sumD_gpu_large_old(lat,osites);
} }
///////////////////////////////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////////////////////////////////
// Return as same precision as input performing reduction in double precision though // Return as same precision as input performing reduction in double precision though
///////////////////////////////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////////////////////////////////
template <class vobj> template <class vobj>
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_object sobj;
sobj result; sobj result;
result = sumD_gpu(lat,osites); result = sumD_gpu_old(lat,osites);
return result; return result;
} }
template <class vobj> template <class vobj>
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites)
{ {
typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_object sobj;
sobj result; sobj result;
result = sumD_gpu_large(lat,osites); result = sumD_gpu_large_old(lat,osites);
return result; return result;
} }