Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction

Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 07:34:31 +01:00 · 2026-05-15 13:41:56 -04:00
parent c6c2834e03
commit 969b0a3922
5 changed files with 306 additions and 21 deletions
@@ -0,0 +1,98 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## What This Is
 Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
 ## Build
 Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
 ```bash
 ./bootstrap.sh                    # downloads Eigen 3.4.0, generates configure
 mkdir build && cd build
 ../configure [options]
 make -j$(nproc)
 make check                        # run root-level tests
 make install
 ```
 Key configure options:
 | Option | Common values |
 |--------|---------------|
 | `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
 | `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
 | `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
 | `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
 | `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
 | `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
 | `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
 Platform recipes from `README.md`:
 - **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
 - **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
 - **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
 - **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
 Required external libs: GMP, MPFR, OpenSSL, zlib.
 ## Running Tests
 ```bash
 # From build directory
 make check                          # root-level tests (Test_simd, Test_cshift, etc.)
 make -C tests/<subdir> tests        # build tests in a subdirectory
 ./tests/core/Test_simd              # run a single test binary directly
 ```
 Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
 ## Architecture
 ### Layer stack (bottom to top)
 1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
 2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
 3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
 4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
 5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
 6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
 ### QCD subsystem (`Grid/qcd/`)
 - `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
 - `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
 - `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
 - `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
 - `smearing/` — APE, Stout, HEX, gradient flow
 - `observables/` — Polyakov loop, plaquette, topological charge
 ### GPU acceleration
 GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
 ### Memory and I/O
 - `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
 - `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
 - `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
 ### HMC applications
 `HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
 ## Key Conventions
 - **C++17** is required throughout.
 - Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
 - The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
 - Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
 - Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
@@ -31,6 +31,9 @@ Author: Christoph Lehner <christoph@lhnr.de>
 #if defined(GRID_SYCL)
 #include <Grid/lattice/Lattice_reduction_sycl.h>
 #endif
 #if defined(GRID_CUDA)||defined(GRID_HIP)||defined(GRID_SYCL)
 #include <Grid/lattice/Lattice_reduction_gpu_cub.h>
 #endif
 #include <Grid/lattice/Lattice_slicesum_core.h>
 NAMESPACE_BEGIN(Grid);
@@ -198,7 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) {
 // Possibly promote to double and sum
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) 
+inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_objectD sobj;
  typedef decltype(lat) Iterator;
@@ -224,7 +224,7 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi
 }
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::vector_type  vector;
  typedef typename vobj::scalar_typeD scalarD;
@@ -244,13 +244,13 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi
 	buf[ss] = dat[ss*words+w];
      });
-    ret_p[w] = sumD_gpu_small(tbuf,osites);
+    ret_p[w] = sumD_gpu_small_old(tbuf,osites);
  }
  return ret;
 }
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_objectD sobj;
  sobj ret;
@@ -261,9 +261,9 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
  int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks);
  if ( ok ) {
-    ret = sumD_gpu_small(lat,osites);
+    ret = sumD_gpu_small_old(lat,osites);
  } else {
-    ret = sumD_gpu_large(lat,osites);
+    ret = sumD_gpu_large_old(lat,osites);
  }
  return ret;
 }
@@ -272,20 +272,20 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
 // Return as same precision as input performing reduction in double precision though
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 template <class vobj>
-inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) 
+inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  sobj result;
-  result = sumD_gpu(lat,osites);
+  result = sumD_gpu_old(lat,osites);
  return result;
 }
 template <class vobj>
-inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
+inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  sobj result;
-  result = sumD_gpu_large(lat,osites);
+  result = sumD_gpu_large_old(lat,osites);
  return result;
 }
@@ -0,0 +1,184 @@
 /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid
    Source file: ./Grid/lattice/Lattice_reduction_gpu_cub.h
    Copyright (C) 2015-2024
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #pragma once
 #if defined(GRID_CUDA)
 #include <cub/cub.cuh>
 #define gpucub cub
 #define gpuError_t cudaError_t
 #define gpuSuccess cudaSuccess
 #elif defined(GRID_HIP)
 #include <hipcub/hipcub.hpp>
 #define gpucub hipcub
 #define gpuError_t hipError_t
 #define gpuSuccess hipSuccess
 #endif
 NAMESPACE_BEGIN(Grid);
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 // Unified lattice reduction using CUB (CUDA/HIP) and sycl::reduction (SYCL).
 //
 // Strategy: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD,
 // then a single library reduce over the sobjD array.  No small/large split is needed:
 // CUB DeviceReduce::Reduce and sycl::reduction both handle arbitrary object sizes by
 // tuning block occupancy internally.
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 #if defined(GRID_CUDA) || defined(GRID_HIP)
 template<class vobj>
 inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object  sobj;
  typedef typename vobj::scalar_objectD sobjD;
  // Per-site: sum SIMD lanes (Reduce) and promote to double precision.
  deviceVector<sobjD> per_site(osites);
  sobjD *per_site_p = &per_site[0];
  accelerator_for(ss, osites, 1, {
    sobj tmp = Reduce(lat[ss]);
    sobjD tmpD; tmpD = tmp;
    per_site_p[ss] = tmpD;
  });
  // CUB global reduction over the sobjD array.
  sobjD zero; zeroit(zero);
  sobjD *d_out = static_cast<sobjD *>(acceleratorAllocDevice(sizeof(sobjD)));
  void  *d_temp = nullptr;
  size_t temp_bytes = 0;
  gpuError_t gpuErr;
  gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out,
                                        (int)osites, gpucub::Sum(), zero, computeStream);
  if (gpuErr != gpuSuccess) {
    std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce size query failed: "
              << gpuErr << std::endl;
    exit(EXIT_FAILURE);
  }
  d_temp = acceleratorAllocDevice(temp_bytes);
  gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out,
                                        (int)osites, gpucub::Sum(), zero, computeStream);
  if (gpuErr != gpuSuccess) {
    std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce failed: "
              << gpuErr << std::endl;
    exit(EXIT_FAILURE);
  }
  accelerator_barrier();
  sobjD result;
  acceleratorCopyFromDevice(d_out, &result, sizeof(sobjD));
  acceleratorFreeDevice(d_temp);
  acceleratorFreeDevice(d_out);
  return result;
 }
 // sumD_gpu_small and sumD_gpu_large are preserved as aliases for API compatibility.
 template<class vobj>
 inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
 {
  return sumD_gpu(lat, osites);
 }
 template<class vobj>
 inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
 {
  return sumD_gpu(lat, osites);
 }
 template<class vobj>
 inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  sobj result;
  result = sumD_gpu(lat, osites);
  return result;
 }
 template<class vobj>
 inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
 {
  return sum_gpu(lat, osites);
 }
 #endif // GRID_CUDA || GRID_HIP
 #if defined(GRID_SYCL)
 // Accumulates in sobjD throughout, fixing the precision bug in the original
 // Lattice_reduction_sycl.h which accumulated in sobj then converted at the end.
 template<class vobj>
 inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object  sobj;
  typedef typename vobj::scalar_objectD sobjD;
  sobjD identity; zeroit(identity);
  sobjD ret;      zeroit(ret);
  {
    sycl::buffer<sobjD, 1> abuff(&ret, {1});
    theGridAccelerator->submit([&](sycl::handler &cgh) {
      auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>());
      cgh.parallel_for(sycl::range<1>{(size_t)osites},
                       Reduction,
                       [=](sycl::id<1> item, auto &sum) {
                         sobj s = Reduce(lat[item[0]]);
                         sobjD sd; sd = s;
                         sum += sd;
                       });
    });
  }
  return ret;
 }
 template<class vobj>
 inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
 {
  return sumD_gpu(lat, osites);
 }
 template<class vobj>
 inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
 {
  return sumD_gpu(lat, osites);
 }
 template<class vobj>
 inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  sobj result;
  result = sumD_gpu(lat, osites);
  return result;
 }
 template<class vobj>
 inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
 {
  return sum_gpu(lat, osites);
 }
 #endif // GRID_SYCL
 NAMESPACE_END(Grid);
@@ -6,7 +6,7 @@ NAMESPACE_BEGIN(Grid);
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) 
+inline typename vobj::scalar_objectD sumD_gpu_tensor_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  typedef typename vobj::scalar_objectD sobjD;
@@ -31,40 +31,40 @@ inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer os
 }
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites)
 {
-  return sumD_gpu_tensor(lat,osites);
+  return sumD_gpu_tensor_old(lat,osites);
 }
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites)
 {
-  return sumD_gpu_large(lat,osites);
+  return sumD_gpu_large_old(lat,osites);
 }
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites)
 {
-  return sumD_gpu_large(lat,osites);
+  return sumD_gpu_large_old(lat,osites);
 }
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 // Return as same precision as input performing reduction in double precision though
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 template <class vobj>
-inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) 
+inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  sobj result;
-  result = sumD_gpu(lat,osites);
+  result = sumD_gpu_old(lat,osites);
  return result;
 }
 template <class vobj>
-inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
+inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites)
 {
  typedef typename vobj::scalar_object sobj;
  sobj result;
-  result = sumD_gpu_large(lat,osites);
+  result = sumD_gpu_large_old(lat,osites);
  return result;
 }