From 969b0a3922296abfa426b92975a09d413e2e205e Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 15 May 2026 13:41:56 -0400 Subject: [PATCH 01/44] Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 98 ++++++++++++ Grid/lattice/Lattice_reduction.h | 3 + Grid/lattice/Lattice_reduction_gpu.h | 20 +-- Grid/lattice/Lattice_reduction_gpu_cub.h | 184 +++++++++++++++++++++++ Grid/lattice/Lattice_reduction_sycl.h | 22 +-- 5 files changed, 306 insertions(+), 21 deletions(-) create mode 100644 CLAUDE.md create mode 100644 Grid/lattice/Lattice_reduction_gpu_cub.h diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..4b6acf2a --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,98 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What This Is + +Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC. + +## Build + +Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes). + +```bash +./bootstrap.sh # downloads Eigen 3.4.0, generates configure +mkdir build && cd build +../configure [options] +make -j$(nproc) +make check # run root-level tests +make install +``` + +Key configure options: + +| Option | Common values | +|--------|---------------| +| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` | +| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` | +| `--enable-accelerator=` | `cuda`, `hip`, `sycl` | +| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` | +| `--enable-Nc=` | `3` (default), `2`, `4`, `5` | +| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs | +| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features | + +Platform recipes from `README.md`: +- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl` +- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto` +- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3` +- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`) + +Required external libs: GMP, MPFR, OpenSSL, zlib. + +## Running Tests + +```bash +# From build directory +make check # root-level tests (Test_simd, Test_cshift, etc.) +make -C tests/ tests # build tests in a subdirectory +./tests/core/Test_simd # run a single test binary directly +``` + +Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`. + +## Architecture + +### Layer stack (bottom to top) + +1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`. + +2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`. + +3. **Lattice layer** (`Grid/lattice/`) — `Lattice` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused. + +4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators. + +5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing. + +6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables. + +### QCD subsystem (`Grid/qcd/`) + +- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants +- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect +- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n) +- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject +- `smearing/` — APE, Stout, HEX, gradient flow +- `observables/` — Polyakov loop, plaquette, topological charge + +### GPU acceleration + +GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers. + +### Memory and I/O + +- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache` +- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats +- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects + +### HMC applications + +`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests. + +## Key Conventions + +- **C++17** is required throughout. +- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`. +- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`. +- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`. +- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code. diff --git a/Grid/lattice/Lattice_reduction.h b/Grid/lattice/Lattice_reduction.h index 861f3f06..5d97a794 100644 --- a/Grid/lattice/Lattice_reduction.h +++ b/Grid/lattice/Lattice_reduction.h @@ -31,6 +31,9 @@ Author: Christoph Lehner #if defined(GRID_SYCL) #include #endif +#if defined(GRID_CUDA)||defined(GRID_HIP)||defined(GRID_SYCL) +#include +#endif #include NAMESPACE_BEGIN(Grid); diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 849f5309..2b792ba0 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -198,7 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) { // Possibly promote to double and sum ///////////////////////////////////////////////////////////////////////////////////////////////////////// template -inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_objectD sobj; typedef decltype(lat) Iterator; @@ -224,7 +224,7 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi } template -inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites) { typedef typename vobj::vector_type vector; typedef typename vobj::scalar_typeD scalarD; @@ -244,13 +244,13 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi buf[ss] = dat[ss*words+w]; }); - ret_p[w] = sumD_gpu_small(tbuf,osites); + ret_p[w] = sumD_gpu_small_old(tbuf,osites); } return ret; } template -inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_objectD sobj; sobj ret; @@ -261,9 +261,9 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks); if ( ok ) { - ret = sumD_gpu_small(lat,osites); + ret = sumD_gpu_small_old(lat,osites); } else { - ret = sumD_gpu_large(lat,osites); + ret = sumD_gpu_large_old(lat,osites); } return ret; } @@ -272,20 +272,20 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) // Return as same precision as input performing reduction in double precision though ///////////////////////////////////////////////////////////////////////////////////////////////////////// template -inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu(lat,osites); + result = sumD_gpu_old(lat,osites); return result; } template -inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu_large(lat,osites); + result = sumD_gpu_large_old(lat,osites); return result; } diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h new file mode 100644 index 00000000..97cc2939 --- /dev/null +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -0,0 +1,184 @@ +/************************************************************************************* + Grid physics library, www.github.com/paboyle/Grid + Source file: ./Grid/lattice/Lattice_reduction_gpu_cub.h + Copyright (C) 2015-2024 +Author: Peter Boyle + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + See the full license in the file "LICENSE" in the top level distribution directory + *************************************************************************************/ + /* END LEGAL */ +#pragma once + +#if defined(GRID_CUDA) +#include +#define gpucub cub +#define gpuError_t cudaError_t +#define gpuSuccess cudaSuccess +#elif defined(GRID_HIP) +#include +#define gpucub hipcub +#define gpuError_t hipError_t +#define gpuSuccess hipSuccess +#endif + +NAMESPACE_BEGIN(Grid); + +///////////////////////////////////////////////////////////////////////////////////////////////////////// +// Unified lattice reduction using CUB (CUDA/HIP) and sycl::reduction (SYCL). +// +// Strategy: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD, +// then a single library reduce over the sobjD array. No small/large split is needed: +// CUB DeviceReduce::Reduce and sycl::reduction both handle arbitrary object sizes by +// tuning block occupancy internally. +///////////////////////////////////////////////////////////////////////////////////////////////////////// + +#if defined(GRID_CUDA) || defined(GRID_HIP) + +template +inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) +{ + typedef typename vobj::scalar_object sobj; + typedef typename vobj::scalar_objectD sobjD; + + // Per-site: sum SIMD lanes (Reduce) and promote to double precision. + deviceVector per_site(osites); + sobjD *per_site_p = &per_site[0]; + + accelerator_for(ss, osites, 1, { + sobj tmp = Reduce(lat[ss]); + sobjD tmpD; tmpD = tmp; + per_site_p[ss] = tmpD; + }); + + // CUB global reduction over the sobjD array. + sobjD zero; zeroit(zero); + sobjD *d_out = static_cast(acceleratorAllocDevice(sizeof(sobjD))); + void *d_temp = nullptr; + size_t temp_bytes = 0; + + gpuError_t gpuErr; + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, + (int)osites, gpucub::Sum(), zero, computeStream); + if (gpuErr != gpuSuccess) { + std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce size query failed: " + << gpuErr << std::endl; + exit(EXIT_FAILURE); + } + + d_temp = acceleratorAllocDevice(temp_bytes); + + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, + (int)osites, gpucub::Sum(), zero, computeStream); + if (gpuErr != gpuSuccess) { + std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce failed: " + << gpuErr << std::endl; + exit(EXIT_FAILURE); + } + + accelerator_barrier(); + + sobjD result; + acceleratorCopyFromDevice(d_out, &result, sizeof(sobjD)); + acceleratorFreeDevice(d_temp); + acceleratorFreeDevice(d_out); + return result; +} + +// sumD_gpu_small and sumD_gpu_large are preserved as aliases for API compatibility. +template +inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) +{ + return sumD_gpu(lat, osites); +} + +template +inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) +{ + return sumD_gpu(lat, osites); +} + +template +inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) +{ + typedef typename vobj::scalar_object sobj; + sobj result; + result = sumD_gpu(lat, osites); + return result; +} + +template +inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) +{ + return sum_gpu(lat, osites); +} + +#endif // GRID_CUDA || GRID_HIP + +#if defined(GRID_SYCL) + +// Accumulates in sobjD throughout, fixing the precision bug in the original +// Lattice_reduction_sycl.h which accumulated in sobj then converted at the end. +template +inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) +{ + typedef typename vobj::scalar_object sobj; + typedef typename vobj::scalar_objectD sobjD; + + sobjD identity; zeroit(identity); + sobjD ret; zeroit(ret); + { + sycl::buffer abuff(&ret, {1}); + theGridAccelerator->submit([&](sycl::handler &cgh) { + auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>()); + cgh.parallel_for(sycl::range<1>{(size_t)osites}, + Reduction, + [=](sycl::id<1> item, auto &sum) { + sobj s = Reduce(lat[item[0]]); + sobjD sd; sd = s; + sum += sd; + }); + }); + } + return ret; +} + +template +inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) +{ + return sumD_gpu(lat, osites); +} + +template +inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) +{ + return sumD_gpu(lat, osites); +} + +template +inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) +{ + typedef typename vobj::scalar_object sobj; + sobj result; + result = sumD_gpu(lat, osites); + return result; +} + +template +inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) +{ + return sum_gpu(lat, osites); +} + +#endif // GRID_SYCL + +NAMESPACE_END(Grid); diff --git a/Grid/lattice/Lattice_reduction_sycl.h b/Grid/lattice/Lattice_reduction_sycl.h index ace7f010..232cf9e0 100644 --- a/Grid/lattice/Lattice_reduction_sycl.h +++ b/Grid/lattice/Lattice_reduction_sycl.h @@ -6,7 +6,7 @@ NAMESPACE_BEGIN(Grid); template -inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_tensor_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; @@ -31,40 +31,40 @@ inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer os } template -inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites) { - return sumD_gpu_tensor(lat,osites); + return sumD_gpu_tensor_old(lat,osites); } template -inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites) { - return sumD_gpu_large(lat,osites); + return sumD_gpu_large_old(lat,osites); } template -inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites) { - return sumD_gpu_large(lat,osites); + return sumD_gpu_large_old(lat,osites); } ///////////////////////////////////////////////////////////////////////////////////////////////////////// // Return as same precision as input performing reduction in double precision though ///////////////////////////////////////////////////////////////////////////////////////////////////////// template -inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu(lat,osites); + result = sumD_gpu_old(lat,osites); return result; } template -inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu_large(lat,osites); + result = sumD_gpu_large_old(lat,osites); return result; } From 286c29d6fb7b1587731279454e6112937e7b7b65 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 15 May 2026 14:31:33 -0400 Subject: [PATCH 02/44] Add Test_reduction to tests/debug Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D, LatticeColourMatrixF/D and LatticePropagatorF/D. Part a) gaussian random field: checks that old and new agree to within float/double roundoff tolerance. Part b) constant field (= 1.0, identity-matrix init): verifies innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero diagonal scalar components per site (1 / Nc / Ns*Nc respectively). Make.inc is auto-generated by scripts/filelist on bootstrap and is not tracked; the new .cc file is all that is needed. Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_reduction.cc | 160 ++++++++++++++++++++++++++++++++++ 1 file changed, 160 insertions(+) create mode 100644 tests/debug/Test_reduction.cc diff --git a/tests/debug/Test_reduction.cc b/tests/debug/Test_reduction.cc new file mode 100644 index 00000000..dfe6708c --- /dev/null +++ b/tests/debug/Test_reduction.cc @@ -0,0 +1,160 @@ +/************************************************************************************* + Grid physics library, www.github.com/paboyle/Grid + + Source file: ./tests/debug/Test_reduction.cc + + Copyright (C) 2024 + +Author: Peter Boyle + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + + See the full license in the file "LICENSE" in the top level distribution directory + *************************************************************************************/ + /* END LEGAL */ +#include + +using namespace std; +using namespace Grid; + +static int passed = 0; +static int failed = 0; + +static void check(bool ok, const std::string &msg) +{ + if (ok) { + std::cout << GridLogMessage << "PASS " << msg << std::endl; + passed++; + } else { + std::cout << GridLogMessage << "FAIL " << msg << std::endl; + failed++; + } +} + +// Frobenius-like norm squared of any Grid scalar tensor object. +// For iScalar chains: real(conj(a)*a) +// For iMatrix: sum_{i,j} real(conj(a_ij)*a_ij) (all elements) +// Both old and new paths promote to double before accumulating, so the result +// is always exact enough for the tolerances used here. +template +RealD scalarNorm2(const T &a) +{ + return (RealD)real(TensorRemove(innerProduct(a, a))); +} + +template +void testReduction(GridCartesian *grid, GridParallelRNG &rng, + const std::string &name, int Ncomp) +{ + typedef typename Field::vector_object vobj; + typedef typename vobj::scalar_object sobj; + typedef typename vobj::scalar_type scalar_type; + + const Integer V = grid->_gsites; + const Integer osites = grid->oSites(); + + // Detect single vs double precision by comparing fundamental scalar sizes. + const bool isFloat = (sizeof(scalar_type) < sizeof(ComplexD)); + + std::cout << GridLogMessage << "=== " << name << " ===" << std::endl; + + Field field(grid); + + //-------------------------------------------------------------------- + // a) Gaussian random field: sum_gpu (new CUB path) vs sum_gpu_old + // (preserved hand-rolled shared-memory path). Both promote lanes + // to double internally, so results should agree to near-roundoff. + //-------------------------------------------------------------------- +#if defined(GRID_CUDA) || defined(GRID_HIP) || defined(GRID_SYCL) + { + gaussian(rng, field); + + autoView(v, field, AcceleratorRead); + sobj new_result = sum_gpu (&v[0], osites); + sobj old_result = sum_gpu_old(&v[0], osites); + + sobj diff = new_result - old_result; + RealD diffn = scalarNorm2(diff); + RealD refn = scalarNorm2(old_result); + RealD reldiff = (refn > 0.0) ? std::sqrt(diffn / refn) : std::sqrt(diffn); + + // Float fields: both paths cast from double to float, expect O(eps_float). + // Double fields: ordering differences at most O(V * eps_double). + RealD tol = isFloat ? 1e-6 : 1e-10; + + std::cout << GridLogMessage + << name << " random reldiff = " << reldiff << std::endl; + check(reldiff < tol, name + " random: sum_gpu agrees with sum_gpu_old"); + } +#endif + + //-------------------------------------------------------------------- + // b) Constant field via field = 1.0. + // + // Grid's iMatrix::operator=(scalar) sets only the diagonal, so: + // LatticeComplex -> scalar 1.0 (Ncomp = 1 nonzero per site) + // LatticeColourMatrix -> Nc x Nc identity (Ncomp = Nc nonzero per site) + // LatticePropagator -> (Ns*Nc)^2 identity (Ncomp = Ns*Nc nonzero per site) + // + // After GlobalSum: sum_result has Ncomp diagonal entries each equal to V, + // all off-diagonal entries zero. Grid's recursive innerProduct computes + // the Frobenius inner product (sum of |element|^2 over all indices), giving + // + // innerProduct(sum_result, sum_result) = Ncomp * V^2 + //-------------------------------------------------------------------- + { + field = 1.0; + sobj sum_result = sum(field); // uses new GPU path + GlobalSum + + RealD got = scalarNorm2(sum_result); + RealD expected = (RealD)Ncomp * (RealD)V * (RealD)V; + RealD reldiff = std::abs(got - expected) / expected; + + std::cout << GridLogMessage + << name << " const: got " << got + << " expected " << expected + << " reldiff " << reldiff << std::endl; + check(reldiff < 1e-8, name + " const: innerProduct(sum,sum) = Ncomp*V^2"); + } +} + +int main(int argc, char **argv) +{ + Grid_init(&argc, &argv); + + Coordinate latt = GridDefaultLatt(); + Coordinate simd = GridDefaultSimd(Nd, vComplexD::Nsimd()); + Coordinate mpi = GridDefaultMpi(); + + GridCartesian *grid = SpaceTimeGrid::makeFourDimGrid(latt, simd, mpi); + GridParallelRNG rng(grid); + rng.SeedFixedIntegers({1, 2, 3, 4}); + + std::cout << GridLogMessage << "Lattice : " << latt << std::endl; + std::cout << GridLogMessage << "Volume : " << grid->_gsites << std::endl; + + testReduction (grid, rng, "LatticeComplexF", 1 ); + testReduction (grid, rng, "LatticeComplexD", 1 ); + testReduction (grid, rng, "LatticeColourMatrixF", Nc ); + testReduction (grid, rng, "LatticeColourMatrixD", Nc ); + testReduction (grid, rng, "LatticePropagatorF", Ns*Nc ); + testReduction (grid, rng, "LatticePropagatorD", Ns*Nc ); + + std::cout << GridLogMessage << "==============================" << std::endl; + std::cout << GridLogMessage << passed << " PASSED " << failed << " FAILED" << std::endl; + + Grid_finalize(); + return (failed > 0) ? EXIT_FAILURE : EXIT_SUCCESS; +} From 773a82d87f74b621f57e8cd14e944c27db2b5424 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 15 May 2026 16:55:58 -0400 Subject: [PATCH 03/44] Reinstate large/small dispatch in CUB reduction path; radix-4 word-bundle for large types MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit rocPRIM's DeviceReduce requires warpSize(64) threads each holding one element in shared memory, so sizeof(T)*64 must fit in sharedMemPerBlock. LatticePropagator::scalar_objectD is 2304 bytes (64*2304 = 147 KB), exceeding the budget and triggering a compile-time static_assert in limit_block_size. Introduce sumD_gpu_direct (the original direct-CUB path, safe for small types) and a new sumD_gpu_large that groups the vobj's vector_type words in bundles of 4, reducing each bundle as WordBundle4 (64 bytes, 64*64 = 4 KB — always within budget). If words % 4 != 0, the final partial bundle is zero-padded. sumD_gpu dispatches at compile time via if constexpr on sizeof(sobjD) > 512. For LatticePropagator (144 words) this gives 36 CUB launches instead of 144. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 161 +++++++++++++++++++++-- 1 file changed, 147 insertions(+), 14 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index 97cc2939..c9901f63 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -36,21 +36,47 @@ NAMESPACE_BEGIN(Grid); ///////////////////////////////////////////////////////////////////////////////////////////////////////// // Unified lattice reduction using CUB (CUDA/HIP) and sycl::reduction (SYCL). // -// Strategy: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD, -// then a single library reduce over the sobjD array. No small/large split is needed: -// CUB DeviceReduce::Reduce and sycl::reduction both handle arbitrary object sizes by -// tuning block occupancy internally. +// CUDA/HIP: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD, +// then CUB/hipCUB DeviceReduce::Reduce over the resulting array. +// +// rocPRIM's DeviceReduce requires warpSize(64) threads per block, each holding one element +// in shared memory: sizeof(T)*64 must fit in sharedMemPerBlock. Large QCD objects such as +// LatticePropagator (sobjD = 2304 bytes, 64*2304 = 147 KB) exceed this budget. +// +// For those types sumD_gpu_large groups the vobj's vector_type words in bundles of 4, +// reducing each bundle as a WordBundle4 (64 bytes, 64*64 = 4 KB — always safe). +// Words that do not fill a complete bundle are zero-padded. +// +// SYCL: sycl::reduction handles any type size through the runtime, so one path suffices. ///////////////////////////////////////////////////////////////////////////////////////////////////////// #if defined(GRID_CUDA) || defined(GRID_HIP) +// Bundles 4 scalar_typeD values for the radix-4 large-type reduction path. +// sizeof = 4 * sizeof(scalarD) <= 64 bytes; 64 * 64 = 4096 bytes, safely within +// rocPRIM's shared-memory budget on all supported devices. +template +struct WordBundle4 { + scalarD w[4]; + accelerator_inline WordBundle4 operator+(const WordBundle4 &rhs) const { + WordBundle4 r; + r.w[0] = w[0] + rhs.w[0]; + r.w[1] = w[1] + rhs.w[1]; + r.w[2] = w[2] + rhs.w[2]; + r.w[3] = w[3] + rhs.w[3]; + return r; + } +}; + +// Direct CUB reduction on the full scalar_objectD. +// Only safe when sizeof(sobjD)*64 <= device sharedMemPerBlock. +// Do not call directly for large composite types (e.g. LatticePropagator). template -inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; - // Per-site: sum SIMD lanes (Reduce) and promote to double precision. deviceVector per_site(osites); sobjD *per_site_p = &per_site[0]; @@ -60,7 +86,6 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) per_site_p[ss] = tmpD; }); - // CUB global reduction over the sobjD array. sobjD zero; zeroit(zero); sobjD *d_out = static_cast(acceleratorAllocDevice(sizeof(sobjD))); void *d_temp = nullptr; @@ -70,7 +95,7 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, (int)osites, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce size query failed: " + std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce size query failed: " << gpuErr << std::endl; exit(EXIT_FAILURE); } @@ -80,7 +105,7 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, (int)osites, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce failed: " + std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce failed: " << gpuErr << std::endl; exit(EXIT_FAILURE); } @@ -94,15 +119,120 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) return result; } -// sumD_gpu_small and sumD_gpu_large are preserved as aliases for API compatibility. +// Radix-4 word-bundle path for types too large for the direct CUB path. +// Treats vobj as words of vector_type; groups them in bundles of 4 and reduces +// each bundle as a WordBundle4. If words % 4 != 0, the final partial +// bundle is zero-padded so all unused slots contribute zero to the sum. template -inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) { - return sumD_gpu(lat, osites); + typedef typename vobj::vector_type vector; + typedef typename vobj::scalar_typeD scalarD; + typedef typename vobj::scalar_objectD sobjD; + using R4 = WordBundle4; + + const int words = sizeof(vobj) / sizeof(vector); + const int nfull = words / 4; + const int rem = words % 4; + + sobjD ret; zeroit(ret); + scalarD *ret_p = (scalarD *)&ret; + + iScalar *idat = (iScalar *)lat; + deviceVector buf(osites); + R4 *buf_p = &buf[0]; + + R4 zero4; + zero4.w[0] = zero4.w[1] = zero4.w[2] = zero4.w[3] = Zero(); + + R4 *d_out = static_cast(acceleratorAllocDevice(sizeof(R4))); + void *d_temp = nullptr; + size_t temp_bytes = 0; + + // Probe workspace size once — type R4 and count osites are fixed across all groups. + gpuError_t gpuErr; + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, + (int)osites, gpucub::Sum(), zero4, computeStream); + if (gpuErr != gpuSuccess) { + std::cout << GridLogError << "sumD_gpu_large: DeviceReduce size query failed: " + << gpuErr << std::endl; + exit(EXIT_FAILURE); + } + d_temp = acceleratorAllocDevice(temp_bytes); + + // Full groups of 4 words. + for (int g = 0; g < nfull; g++) { + int base = 4 * g; + accelerator_for(ss, osites, 1, { + R4 r4; + r4.w[0] = TensorRemove(Reduce(idat[ss * words + base ])); + r4.w[1] = TensorRemove(Reduce(idat[ss * words + base + 1])); + r4.w[2] = TensorRemove(Reduce(idat[ss * words + base + 2])); + r4.w[3] = TensorRemove(Reduce(idat[ss * words + base + 3])); + buf_p[ss] = r4; + }); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, + (int)osites, gpucub::Sum(), zero4, computeStream); + if (gpuErr != gpuSuccess) { + std::cout << GridLogError << "sumD_gpu_large: DeviceReduce failed (group " + << g << "): " << gpuErr << std::endl; + exit(EXIT_FAILURE); + } + accelerator_barrier(); + R4 group_result; + acceleratorCopyFromDevice(d_out, &group_result, sizeof(R4)); + ret_p[base ] = group_result.w[0]; + ret_p[base + 1] = group_result.w[1]; + ret_p[base + 2] = group_result.w[2]; + ret_p[base + 3] = group_result.w[3]; + } + + // Partial last group: zero-pad unused slots so they contribute nothing to the sum. + if (rem > 0) { + int base = 4 * nfull; + accelerator_for(ss, osites, 1, { + R4 r4; + r4.w[0] = r4.w[1] = r4.w[2] = r4.w[3] = Zero(); + for (int k = 0; k < rem; k++) + r4.w[k] = TensorRemove(Reduce(idat[ss * words + base + k])); + buf_p[ss] = r4; + }); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, + (int)osites, gpucub::Sum(), zero4, computeStream); + if (gpuErr != gpuSuccess) { + std::cout << GridLogError << "sumD_gpu_large: DeviceReduce failed (partial group): " + << gpuErr << std::endl; + exit(EXIT_FAILURE); + } + accelerator_barrier(); + R4 partial_result; + acceleratorCopyFromDevice(d_out, &partial_result, sizeof(R4)); + for (int k = 0; k < rem; k++) + ret_p[4 * nfull + k] = partial_result.w[k]; + } + + acceleratorFreeDevice(d_temp); + acceleratorFreeDevice(d_out); + return ret; +} + +// Dispatch: direct CUB path for types that fit in the shared-memory budget, +// radix-4 word-bundle path for larger types. +// Threshold 512 bytes: 64 * 512 = 32768 bytes, within rocPRIM's +// ROCPRIM_SHARED_MEMORY_MAX on all supported devices. +template +inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) +{ + typedef typename vobj::scalar_objectD sobjD; + if constexpr (sizeof(sobjD) > 512) { + return sumD_gpu_large(lat, osites); + } else { + return sumD_gpu_direct(lat, osites); + } } template -inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) { return sumD_gpu(lat, osites); } @@ -119,7 +249,10 @@ inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) template inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) { - return sum_gpu(lat, osites); + typedef typename vobj::scalar_object sobj; + sobj result; + result = sumD_gpu_large(lat, osites); + return result; } #endif // GRID_CUDA || GRID_HIP From 003fec509c817672d638411a3a6f729587750d67 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 15 May 2026 18:10:17 -0400 Subject: [PATCH 04/44] Fix Zero() used on thrust::complex in WordBundle4 initialisation Grid's Zero() sentinel is not assignable to thrust::complex; use scalarD(0) instead. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index c9901f63..ac4368c0 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -143,7 +143,7 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi R4 *buf_p = &buf[0]; R4 zero4; - zero4.w[0] = zero4.w[1] = zero4.w[2] = zero4.w[3] = Zero(); + zero4.w[0] = zero4.w[1] = zero4.w[2] = zero4.w[3] = scalarD(0); R4 *d_out = static_cast(acceleratorAllocDevice(sizeof(R4))); void *d_temp = nullptr; @@ -192,7 +192,7 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi int base = 4 * nfull; accelerator_for(ss, osites, 1, { R4 r4; - r4.w[0] = r4.w[1] = r4.w[2] = r4.w[3] = Zero(); + r4.w[0] = r4.w[1] = r4.w[2] = r4.w[3] = scalarD(0); for (int k = 0; k < rem; k++) r4.w[k] = TensorRemove(Reduce(idat[ss * words + base + k])); buf_p[ss] = r4; From 09552cfd737dc7ecc37018bf4ceb593f2ddb94f3 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 15 May 2026 23:15:11 -0400 Subject: [PATCH 05/44] Rename scalarNorm2 to squaredSum in Test_reduction.cc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The function computes |sum|^2 — the squared magnitude of an aggregate sum — not a norm. squaredSum makes clear that squaring is applied to the sum, not to individual site values before summing, distinguishing it from sumOfSquares (the squared L2 norm). Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_reduction.cc | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/tests/debug/Test_reduction.cc b/tests/debug/Test_reduction.cc index dfe6708c..69b5a5b6 100644 --- a/tests/debug/Test_reduction.cc +++ b/tests/debug/Test_reduction.cc @@ -43,13 +43,13 @@ static void check(bool ok, const std::string &msg) } } -// Frobenius-like norm squared of any Grid scalar tensor object. -// For iScalar chains: real(conj(a)*a) -// For iMatrix: sum_{i,j} real(conj(a_ij)*a_ij) (all elements) -// Both old and new paths promote to double before accumulating, so the result -// is always exact enough for the tolerances used here. +// Squared magnitude of a Grid scalar tensor aggregate: innerProduct(a,a). +// For iScalar: real(conj(a)*a) +// For iMatrix: sum_{i,j} real(conj(a_ij)*a_ij) (Frobenius) +// Named squaredSum to make clear the squaring is applied to the aggregate +// (the sum), not to individual site values before summing. template -RealD scalarNorm2(const T &a) +RealD squaredSum(const T &a) { return (RealD)real(TensorRemove(innerProduct(a, a))); } @@ -86,8 +86,8 @@ void testReduction(GridCartesian *grid, GridParallelRNG &rng, sobj old_result = sum_gpu_old(&v[0], osites); sobj diff = new_result - old_result; - RealD diffn = scalarNorm2(diff); - RealD refn = scalarNorm2(old_result); + RealD diffn = squaredSum(diff); + RealD refn = squaredSum(old_result); RealD reldiff = (refn > 0.0) ? std::sqrt(diffn / refn) : std::sqrt(diffn); // Float fields: both paths cast from double to float, expect O(eps_float). @@ -118,7 +118,7 @@ void testReduction(GridCartesian *grid, GridParallelRNG &rng, field = 1.0; sobj sum_result = sum(field); // uses new GPU path + GlobalSum - RealD got = scalarNorm2(sum_result); + RealD got = squaredSum(sum_result); RealD expected = (RealD)Ncomp * (RealD)V * (RealD)V; RealD reldiff = std::abs(got - expected) / expected; From c0472aa0ec8482a32c86f7228341718b8f742e34 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 12:09:35 -0400 Subject: [PATCH 06/44] Test_reduction: use separate float and double grids Float fields require a grid constructed with vComplexF::Nsimd(); using a double grid causes grid->_gsites to undercount the sites in float vobjF, making the constant-field expected value wrong. Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_reduction.cc | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/tests/debug/Test_reduction.cc b/tests/debug/Test_reduction.cc index 69b5a5b6..a376a43f 100644 --- a/tests/debug/Test_reduction.cc +++ b/tests/debug/Test_reduction.cc @@ -135,22 +135,25 @@ int main(int argc, char **argv) Grid_init(&argc, &argv); Coordinate latt = GridDefaultLatt(); - Coordinate simd = GridDefaultSimd(Nd, vComplexD::Nsimd()); Coordinate mpi = GridDefaultMpi(); - GridCartesian *grid = SpaceTimeGrid::makeFourDimGrid(latt, simd, mpi); - GridParallelRNG rng(grid); + GridCartesian *UGrid = SpaceTimeGrid::makeFourDimGrid(latt, GridDefaultSimd(Nd, vComplexD::Nsimd()), mpi); + GridCartesian *UGrid_f = SpaceTimeGrid::makeFourDimGrid(latt, GridDefaultSimd(Nd, vComplexF::Nsimd()), mpi); + + GridParallelRNG rng(UGrid); rng.SeedFixedIntegers({1, 2, 3, 4}); + GridParallelRNG rng_f(UGrid_f); + rng_f.SeedFixedIntegers({1, 2, 3, 4}); std::cout << GridLogMessage << "Lattice : " << latt << std::endl; - std::cout << GridLogMessage << "Volume : " << grid->_gsites << std::endl; + std::cout << GridLogMessage << "Volume : " << UGrid->_gsites << std::endl; - testReduction (grid, rng, "LatticeComplexF", 1 ); - testReduction (grid, rng, "LatticeComplexD", 1 ); - testReduction (grid, rng, "LatticeColourMatrixF", Nc ); - testReduction (grid, rng, "LatticeColourMatrixD", Nc ); - testReduction (grid, rng, "LatticePropagatorF", Ns*Nc ); - testReduction (grid, rng, "LatticePropagatorD", Ns*Nc ); + testReduction (UGrid_f, rng_f, "LatticeComplexF", 1 ); + testReduction (UGrid, rng, "LatticeComplexD", 1 ); + testReduction (UGrid_f, rng_f, "LatticeColourMatrixF", Nc ); + testReduction (UGrid, rng, "LatticeColourMatrixD", Nc ); + testReduction (UGrid_f, rng_f, "LatticePropagatorF", Ns*Nc ); + testReduction (UGrid, rng, "LatticePropagatorD", Ns*Nc ); std::cout << GridLogMessage << "==============================" << std::endl; std::cout << GridLogMessage << passed << " PASSED " << failed << " FAILED" << std::endl; From c93b338bdd7789d470f3e45e0145bf3dd3620319 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 12:10:44 -0400 Subject: [PATCH 07/44] skills: HPC battle-hardening skill files for GPU+MPI correctness Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 --- skills/communication-overlap.md | 196 +++++++++++++++++++++++++++++ skills/compiler-validation.md | 154 +++++++++++++++++++++++ skills/correctness-verification.md | 169 +++++++++++++++++++++++++ skills/gpu-runtime-correctness.md | 101 +++++++++++++++ skills/hang-diagnosis.md | 102 +++++++++++++++ skills/mpi-heterogeneous.md | 137 ++++++++++++++++++++ 6 files changed, 859 insertions(+) create mode 100644 skills/communication-overlap.md create mode 100644 skills/compiler-validation.md create mode 100644 skills/correctness-verification.md create mode 100644 skills/gpu-runtime-correctness.md create mode 100644 skills/hang-diagnosis.md create mode 100644 skills/mpi-heterogeneous.md diff --git a/skills/communication-overlap.md b/skills/communication-overlap.md new file mode 100644 index 00000000..e5da15e8 --- /dev/null +++ b/skills/communication-overlap.md @@ -0,0 +1,196 @@ +--- +name: communication-overlap +description: Design and implement communication/computation overlap pipelines for GPU+MPI codes — per-packet event tracking, host-staging through pinned memory, internode/intranode bandwidth separation, and the 7-phase pipeline pattern that replaces broken accelerator-aware MPI paths. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) +--- + +# Communication/Computation Overlap Pipeline Design + +## Why GPU-Direct MPI Is Often Not the Right Default + +GPU-direct RDMA (passing GPU buffer pointers directly to MPI) is appealing because it eliminates explicit D2H/H2D copies. In practice on several leadership systems: + +- **Bandwidth**: RDMA at 30% of wirespeed has been observed on Pontevecchio/Aurora. The overhead of staging through pinned host memory can be *lower* total latency than slow RDMA. +- **Correctness**: Device buffer aliasing in `MPI_Sendrecv` (see `mpi-heterogeneous.md`) makes direct GPU-to-GPU transfer unreliable. +- **Overlap**: Host-staging enables fine-grained overlap — each packet's D2H can be issued as a separate asynchronous event, and the corresponding MPI send can fire as soon as *that packet* arrives in host memory, not after all packets are ready. + +The pipeline pattern below was developed to replace broken MPICH accelerator-aware paths. It achieves genuine computation/communication overlap by tracking per-packet GPU events. + +## The 7-Phase Pipeline + +Given a set of halo exchange operations (each identified by a `packet_index`): + +### Phase 0: Prepare data on device +Pack halo data into contiguous GPU buffers. One buffer per direction/neighbour. + +### Phase 1: Post receives + start D2H +Post all `MPI_Irecv` calls immediately (into pinned host buffers). Simultaneously, start asynchronous D2H copies for all send buffers: + +```cpp +for (auto &pkt : send_packets) { + MPI_Irecv(pkt.host_recv_buf, pkt.bytes, MPI_BYTE, + pkt.src_rank, pkt.tag, comm, &pkt.recv_req); + + acceleratorCopyFromDeviceAsync(pkt.device_send_buf, + pkt.host_send_buf, + pkt.bytes, &pkt.d2h_event); +} +``` + +The key: `pkt.d2h_event` is a per-packet GPU event (e.g. `cudaEvent_t`, `hipEvent_t`, or SYCL event). We can poll individual packet completion rather than waiting for all. + +### Phase 2: Fire sends as D2H completes (packet by packet) +Poll packet D2H events. As each packet becomes ready in host memory, immediately fire the corresponding `MPI_Isend`. Also start intranode D2D copies at this point — these are deferred until now to avoid competing with the internode D2H on PCIe bandwidth: + +```cpp +bool all_sent = false; +while (!all_sent) { + all_sent = true; + for (auto &pkt : send_packets) { + if (!pkt.sent && acceleratorEventIsComplete(pkt.d2h_event)) { + MPI_Isend(pkt.host_send_buf, pkt.bytes, MPI_BYTE, + pkt.dst_rank, pkt.tag, comm, &pkt.send_req); + pkt.sent = true; + start_intranode_copy(pkt); // now safe, D2H is done + } + if (!pkt.sent) all_sent = false; + } +} +``` + +### Phase 3: Poll receives + start H2D as each arrives +`MPI_Test` individual receive requests. As each completes, immediately start the H2D copy into device-resident halo buffer: + +```cpp +bool all_recvd = false; +while (!all_recvd) { + all_recvd = true; + for (auto &pkt : recv_packets) { + if (!pkt.h2d_started) { + int flag = 0; + MPI_Test(&pkt.recv_req, &flag, MPI_STATUS_IGNORE); + if (flag) { + acceleratorCopyToDeviceAsync(pkt.host_recv_buf, + pkt.device_recv_buf, + pkt.bytes, &pkt.h2d_event); + pkt.h2d_started = true; + } + } + if (!pkt.h2d_started) all_recvd = false; + } +} +``` + +### Phase 4: Wait for all sends +```cpp +std::vector send_reqs; +for (auto &pkt : send_packets) send_reqs.push_back(pkt.send_req); +MPI_Waitall(send_reqs.size(), send_reqs.data(), MPI_STATUSES_IGNORE); +``` + +### Phase 5: Wait for all H2D copies +```cpp +for (auto &pkt : recv_packets) acceleratorEventWait(pkt.h2d_event); +``` + +### Phase 6: Run interior computation +The interior (non-halo) computation can run from Phase 1 onwards, overlapped with all of the above: + +```cpp +// Launched in Phase 1, runs in parallel with the pipeline +accelerator_for(ss, interior_sites, ...) { compute_interior(ss); } +``` + +Synchronise with interior before using the full field: +```cpp +accelerator_barrier(); // interior kernel done +// Halo H2D is also complete (Phase 5 above) +// Now safe to use full field +``` + +## Per-Packet Event Tracking Data Structure + +```cpp +struct Packet { + // Buffers + void *device_send_buf; + void *host_send_buf; // pinned + void *device_recv_buf; + void *host_recv_buf; // pinned + size_t bytes; + + // MPI + int src_rank, dst_rank, tag; + MPI_Request send_req, recv_req; + + // GPU events (one per packet, not one global barrier) + AcceleratorEvent d2h_event; + AcceleratorEvent h2d_event; + + // State flags + bool sent = false; + bool h2d_started = false; +}; +``` + +The critical design point: `d2h_event` and `h2d_event` are **per-packet**, not global. This allows the MPI send for packet 0 to fire while packet 1's D2H is still in progress. + +## Internode vs Intranode Separation + +PCIe (GPU-to-CPU) and NVLink/xGMI (GPU-to-GPU within a node) are separate bandwidth resources. They do not compete with each other, but they *do* compete with each other for transactions if both are active simultaneously. + +Strategy: complete all internode D2H copies first (to maximise NIC injection bandwidth), then start intranode D2D copies (which use NVLink/xGMI and do not contend with PCIe for internode traffic): + +```cpp +// In Phase 2: start intranode D2D only after D2H is confirmed complete +if (pkt.is_intranode && pkt.d2h_done) { + // Use peer access (cudaMemcpyPeerAsync / hipMemcpyPeerAsync) + // rather than staging through host for intranode + cudaMemcpyPeerAsync(pkt.peer_recv_buf, pkt.dst_device, + pkt.device_send_buf, pkt.src_device, + pkt.bytes, computeStream); +} +``` + +Grid reference: `Grid/communicator/Communicator_mpi3.cc` — search for `NVLINK_GET` and `ACCELERATOR_AWARE_MPI` conditional blocks. + +## Pinned Memory Allocation + +All host staging buffers must be pinned (page-locked) for async D2H/H2D: + +```cpp +// CUDA +cudaMallocHost(&host_buf, bytes); +cudaFreeHost(host_buf); + +// HIP +hipHostMalloc(&host_buf, bytes, hipHostMallocDefault); +hipHostFree(host_buf); + +// SYCL +host_buf = sycl::malloc_host(bytes, *queue); +sycl::free(host_buf, *queue); +``` + +Pre-allocate at startup. Repeated `cudaMallocHost` in the hot path adds latency from the OS memory manager. + +## Checksumming in the Pipeline + +Insert checksum computation before D2H (on the GPU-resident data) and verification after H2D (on the received GPU-resident data). See `correctness-verification.md` for the checksum pattern. The salting (`packet_index + 1000 * tag`) detects packet transposition — critical for diagnosing MPI buffer aliasing bugs where two packets' contents are swapped. + +## Smoke Test for a New System + +Before running physics, validate the pipeline on a synthetic benchmark: + +```cpp +// Send a buffer of known values, receive and check +// Run at multiple message sizes: 4KB, 64KB, 1MB, 16MB +// Run at multiple process counts: 2, 8, 64, 512 +// Verify checksums on every packet +// Measure bandwidth: should be ≥ 80% of FDR/HDR/NDR peak for host-staged +``` + +Any bandwidth below 50% of theoretical, or any checksum failure, indicates a problem in the communication stack that must be resolved before production runs. diff --git a/skills/compiler-validation.md b/skills/compiler-validation.md new file mode 100644 index 00000000..cbd88266 --- /dev/null +++ b/skills/compiler-validation.md @@ -0,0 +1,154 @@ +--- +name: compiler-validation +description: Identify GPU compiler code generation bugs, distinguish them from hardware and runtime bugs, construct minimal reproducers, and validate correctness of generated assembly for performance-critical HPC kernels. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) + - Bash(objdump) +--- + +# Compiler Validation for GPU HPC Codes + +## Why Compiler Bugs Are Distinct + +Compiler bugs have a unique diagnostic signature: they produce *deterministically wrong* results. The same input always produces the same wrong output. This distinguishes them from: + +- Hardware bugs: usually stochastic (wrong answer sometimes, correct answer other times) +- Runtime bugs (premature barrier, buffer aliasing): often stochastic or history-dependent +- Race conditions: non-deterministic + +**The determinism test**: run the same kernel 100 times with the same input. If the wrong answer is always the same wrong answer, suspect the compiler. + +## The Minimal Reproducer Protocol + +When a kernel produces wrong results, isolate the compiler as quickly as possible: + +**Step 1: Eliminate the physics**. Reduce the failing kernel to the smallest possible computation that still exhibits the bug. Replace QCD fields with `double` arrays. Replace lattice operations with scalar arithmetic. The goal is a 20-line CUDA/HIP/SYCL file that any compiler engineer can compile and run. + +**Step 2: Binary search over optimisation levels**. Compile at `-O0` (or equivalent). If the answer becomes correct, the bug is in an optimisation pass. Then test `-O1`, `-O2`, `-O3` individually to find which optimisation level introduces the bug. + +```bash +# HIP example +hipcc -O0 minimal_repro.cc -o test_O0 && ./test_O0 # should be correct +hipcc -O1 minimal_repro.cc -o test_O1 && ./test_O1 # compare +hipcc -O2 minimal_repro.cc -o test_O2 && ./test_O2 # compare +``` + +**Step 3: Identify the optimisation pass**. For LLVM-based compilers (clang, hipcc, dpcpp, nvcc via ptxas): +```bash +# Disable individual optimisation passes: +hipcc -O2 -mllvm -disable-loop-unrolling minimal_repro.cc -o test +hipcc -O2 -fno-vectorize minimal_repro.cc -o test +hipcc -O2 -fno-slp-vectorize minimal_repro.cc -o test +``` + +**Step 4: Inspect the generated code**. For CUDA/HIP, use `--generate-line-info` and `cuobjdump` or `roc-obj-extract` to get annotated assembly: +```bash +# CUDA +nvcc -O2 --generate-line-info --keep minimal_repro.cu +cuobjdump --dump-ptx minimal_repro.o +# HIP/ROCm +hipcc -O2 --save-temps minimal_repro.cc +llvm-objdump -d minimal_repro.o +# SYCL/DPC++ +icpx -O2 -fsycl -Xclang -ast-dump minimal_repro.cc 2>&1 | grep -A5 "suspicious_expr" +``` + +Look for: incorrect register spill/fill sequences, loop trip count miscalculation, vectorisation across iteration boundaries, incorrect address arithmetic. + +## Known Compiler Bug Patterns in GPU Code + +### Register Pressure / Spill Bugs +High register usage forces spills to local memory. Some compiler versions generate incorrect spill/fill code — the value is written to local memory but a stale register value is read back instead of the spilled value. + +**Signature**: Wrong answer with high-register-count kernels; becomes correct when `--maxrregcount=N` forces lower register count (more spilling) or higher (`--maxrregcount=256`, fewer spills). + +**Diagnostic**: Check register usage: +```bash +nvcc -O2 --ptxas-options=-v minimal_repro.cu 2>&1 | grep "registers" +hipcc -O2 --offload-arch=gfx90a --save-temps minimal_repro.cc +llvm-mc --arch=amdgcn minimal_repro.s 2>&1 | grep "VGPRs" +``` + +### Vectorisation Across Loop Boundaries +The compiler vectorises two successive loop iterations as a SIMD unit when they have a data dependency that the compiler has incorrectly determined does not exist. + +**Signature**: Wrong answer that becomes correct when the loop body is extracted to a non-inlined function (disabling auto-vectorisation across iterations). + +### Incorrect Constant Propagation +The compiler evaluates a compile-time expression incorrectly, substituting a wrong constant. Common in template-heavy code where `sizeof(T)` or `alignof(T)` is used in arithmetic that the compiler folds at compile time. + +**Signature**: Wrong array index or wrong stride. Inspecting the generated assembly shows a literal constant where you expect a computed value. + +## Stress Patterns for Compiler Validation + +These patterns exercise the compiler in ways that commonly expose bugs: + +```cpp +// 1. Aliased pointer write followed by immediate read +// (tests correct handling of write-after-write in register allocation) +__global__ void alias_stress(double *a, double *b, int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) { + a[i] = a[i] * 2.0; + b[i] = a[i] + 1.0; // must read the updated value, not the original + } +} + +// 2. Mixed-precision accumulation +// (tests correct type promotion in FMA sequences) +__global__ void precision_stress(float *in, double *out, int n) { + double acc = 0.0; + for (int i = 0; i < n; i++) acc += (double)in[i]; + *out = acc; +} + +// 3. Large struct in shared memory +// (tests alignment and offset calculation for non-power-of-2-sized objects) +struct S { double x[3]; }; // sizeof = 24 bytes, not a power of 2 +__global__ void struct_stress(S *in, S *out, int n) { + extern __shared__ S smem[]; + int tid = threadIdx.x; + smem[tid] = in[tid]; + __syncthreads(); + out[tid] = smem[(tid + 1) % blockDim.x]; +} +``` + +## Separating Compiler from Runtime/Hardware + +When results are deterministically wrong: + +| Test | Compiler bug | Runtime/hardware bug | +|---|---|---| +| Recompile at -O0 | Fixes it | No effect | +| Run on CPU (host code equivalent) | Fixes it | No effect | +| Reorder loop iterations | Changes wrong answer | No effect or different pattern | +| Different compiler version | Fixes or changes wrong answer | No effect | +| Different GPU of same model | Same wrong answer | Different or no error | +| Different GPU model | Fixes it (ISA-specific codegen bug) | May or may not fix | + +## Reporting to Compiler Teams + +A compiler bug report needs: +1. Minimal reproducer (< 50 lines) +2. Compiler version (`hipcc --version`, `nvcc --version`, `icpx --version`) +3. GPU model and driver version +4. Exact wrong and correct answers (hexfloat for reproducibility) +5. Which compile flags change the behaviour +6. Generated assembly for the correct and incorrect variants + +File with: LLVM Bugzilla (for hipcc/clang/dpcpp backends), NVIDIA bug portal (nvcc/ptxas), or vendor-specific developer forum. The minimal reproducer is the single most important element — without it, compiler teams cannot prioritise. + +## Pragmatic In-Production Workaround + +When a compiler bug is confirmed but the fix is not yet available, the lowest-risk workaround is to mark the affected function with reduced optimisation: + +```cpp +#pragma clang optimize off // clang/hipcc/dpcpp +void __attribute__((optimize("O0"))) affected_kernel_host_wrapper() { ... } +// For device code, use per-file compilation flags via CMake/Makefile +``` + +Document the workaround with a comment referencing the compiler bug report number so it can be removed when the compiler is updated. diff --git a/skills/correctness-verification.md b/skills/correctness-verification.md new file mode 100644 index 00000000..d190452e --- /dev/null +++ b/skills/correctness-verification.md @@ -0,0 +1,169 @@ +--- +name: correctness-verification +description: Implement application-level correctness verification for HPC codes on unreliable hardware — double-run pattern, deterministic reductions, per-packet checksums, and flight recorder step logging. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) +--- + +# Correctness Verification Infrastructure for HPC Codes + +## The Problem + +Leadership computing facilities sometimes have hardware or firmware bugs below the level visible to application code. The accelerator runtime can return from `q.wait()` or `cudaDeviceSynchronize()` before work is actually complete, or silently produce wrong answers in DMA transfers. Standard testing does not catch these because they are non-deterministic and often topology-dependent (fail only at specific process counts or on specific node configurations). + +The symptoms look like numerical instabilities, random MPI hangs, or wrong physics results — not like crashes. Without deliberate infrastructure, diagnosing root cause takes months. + +## The Double-Run Pattern + +The most reliable correctness check for non-deterministic hardware bugs is to run every computation twice and compare bit-identical fingerprints. + +**Key constraint**: the second run must use a *deterministic* code path. Non-deterministic floating-point ordering (e.g. from MPI_Allreduce with different reduction trees on retry) produces false mismatches. See `mpi-heterogeneous.md` for how to make reductions deterministic. + +```cpp +// Pseudocode: double-run a step and compare CRC fingerprints +void run_step_verified(State &state) { + state.save_checkpoint(); + + uint64_t crc_a = run_step_and_fingerprint(state); + state.restore_checkpoint(); + uint64_t crc_b = run_step_and_fingerprint(state); + + if (crc_a != crc_b) { + report_mismatch("step", crc_a, crc_b); + // Policy: abort, retry from checkpoint, or continue with alarm + } +} +``` + +**Fingerprinting**: XOR-fold a CRC32 over all floating-point data after each step. XOR is order-independent, so it works across distributed nodes without communication. For field data: + +```cpp +uint64_t fingerprint(const double *data, size_t n) { + uint64_t acc = 0; + for (size_t i = 0; i < n; i++) { + uint64_t bits; + memcpy(&bits, &data[i], sizeof(bits)); + acc ^= crc32(bits); + } + return acc; +} +``` + +On GPU, compute the XOR reduction on-device (avoids D2H transfer of the full field): + +```cpp +// SYCL +uint64_t svm_xor(uint64_t *vec, uint64_t L) { + uint64_t ret = 0; + { sycl::buffer abuff(&ret, {1}); + theGridAccelerator->submit([&](sycl::handler &cgh) { + auto R = sycl::reduction(abuff, cgh, uint64_t(0), std::bit_xor<>()); + cgh.parallel_for(sycl::range<1>{L}, R, + [=](sycl::id<1> i, auto &sum) { sum ^= vec[i]; }); + }); } + theGridAccelerator->wait(); + return ret; +} +``` + +## Per-Packet Communication Checksums + +Silent data corruption in MPI buffers (documented in MPICH with device-resident buffers; see `mpi-heterogeneous.md`) requires per-packet verification, not just end-to-end. The pattern: + +1. Before packing a send buffer, compute a GPU-side checksum of the payload. +2. Append the checksum to the host staging buffer alongside the data. +3. After receiving and copying to device, recompute the checksum on-device and compare. + +Salt each checksum with `packet_index + 1000 * mpi_tag` to detect transposition (packet A landing in packet B's slot): + +```cpp +uint64_t salt = (uint64_t)packet_index + 1000ULL * mpi_tag; +checksum_send = checksum_gpu(payload_gpu, payload_words) ^ salt; +// ... transmit payload + checksum_send ... +checksum_recv = checksum_gpu(payload_gpu_recv, payload_words) ^ salt; +assert(checksum_recv == checksum_send); +``` + +Grid reference: `Grid/communicator/Communicator_mpi3.cc`, `#ifdef GRID_CHECKSUM_COMMS`. + +## Flight Recorder: Step-Level Logging + +Maintain a monotonic counter that names the current operation. On a hang, this is the only way to know *which* operation the process is stuck in without a debugger. + +```cpp +struct FlightRecorder { + std::atomic step_counter{0}; + const char *step_name = "init"; + + void step_log(const char *name) { + step_name = name; + step_counter.fetch_add(1, std::memory_order_relaxed); + } +}; +extern FlightRecorder gRecorder; +``` + +In Record mode, also store floating-point norms and communication checksums to vectors. In Verify mode, compare against stored values: + +```cpp +void norm_log(double val) { + if (mode == Record) norm_log_vec.push_back(val); + if (mode == Verify) { + double expected = norm_log_vec[norm_counter]; + if (val != expected) { // bit-exact for deterministic paths + std::cerr << "MISMATCH at step " << step_counter + << " (" << step_name << "): " + << std::hexfloat << val << " vs " << expected << "\n"; + print_backtrace(); + } + norm_counter++; + } +} +``` + +Grid reference: `Grid/util/FlightRecorder.h`, `Grid/util/FlightRecorder.cc`. + +## Signal Handler for Hang Detection + +Install a SIGHUP handler that dumps the current flight recorder state. This is async-safe only if the handler writes to a pre-allocated buffer using `write()` (not `printf`): + +```cpp +static char hang_buf[4096]; + +static void sighup_handler(int) { + int n = snprintf(hang_buf, sizeof(hang_buf), + "rank=%d step=%llu name=%s\n", + mpi_rank, + (unsigned long long)gRecorder.step_counter.load(), + gRecorder.step_name); + write(STDERR_FILENO, hang_buf, n); + // Optional: call backtrace_symbols_fd (async-safe on Linux) + void *frames[64]; + int depth = backtrace(frames, 64); + backtrace_symbols_fd(frames, depth, STDERR_FILENO); +} + +// In main(): +signal(SIGHUP, sighup_handler); +``` + +To diagnose a hang across all ranks: `kill -HUP $(pgrep my_app)` or via job scheduler. + +## What to Verify at Each Step + +| Data type | Fingerprint method | Frequency | +|---|---|---| +| Lattice fields | XOR of CRC32 over float64 words | Every algorithmic step | +| Communication buffers | GPU XOR reduction, salted | Every MPI operation | +| Scalar reductions | Bit-exact match of double | Every GlobalSum | +| Iteration counters | Exact integer match | Every solver iteration | + +## When to Abort vs Continue + +- **Abort immediately**: communication checksum mismatch (data is corrupt, continuing will silently propagate errors). +- **Log and continue**: norm mismatch in Verify mode if you need to map out which operations are unreliable. +- **Retry from checkpoint**: double-run mismatch when the underlying bug is non-deterministic (second retry will usually pass). + +Track the mismatch rate over a production run. A rate above ~1/1000 steps indicates a systemic hardware issue that should be escalated to the facility. diff --git a/skills/gpu-runtime-correctness.md b/skills/gpu-runtime-correctness.md new file mode 100644 index 00000000..24e696d0 --- /dev/null +++ b/skills/gpu-runtime-correctness.md @@ -0,0 +1,101 @@ +--- +name: gpu-runtime-correctness +description: Detect and work around GPU runtime correctness failures — premature completion signalling, infinite poll hangs, stale completion flags, and the double-wait diagnostic pattern. Covers CUDA, HIP/ROCm, and SYCL/Level Zero runtimes. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) +--- + +# GPU Runtime Correctness + +## The Completion Signalling Problem + +GPU runtimes expose a synchronisation primitive — `cudaDeviceSynchronize()`, `hipDeviceSynchronize()`, `q.wait()` — that is supposed to block until all previously submitted GPU work is complete. On several production systems, this guarantee has been violated in two distinct ways: + +### Failure Mode A: Premature Return +The wait returns before the GPU work is done. The subsequent CPU code reads stale data from the output buffer. This is the most dangerous failure because it looks like a numerical instability, not a crash. Results are wrong but the program exits normally. + +**Identifying Premature Return**: Insert a second, independent wait immediately after the first. If a second `q.wait()` "fixes" incorrect results that appeared with a single `q.wait()`, the first wait was returning prematurely. + +```cpp +// Diagnostic version — if this stabilises results, you have premature return +accelerator_barrier(); // first wait +accelerator_barrier(); // second wait (diagnostic) +``` + +Production fix: submit a trivially cheap no-op kernel after the real work and wait for it. The no-op kernel cannot complete until all previous commands in the queue are done (command queue ordering guarantee), so waiting for the no-op is a stronger barrier than waiting for the queue itself: + +```cpp +// Lightweight fence kernel +template +__global__ void noop_kernel(T *p) { if (threadIdx.x == 0) (void)(*p); } + +void strong_barrier(T *device_ptr) { + noop_kernel<<<1, 1, 0, computeStream>>>(device_ptr); + cudaStreamSynchronize(computeStream); // wait for the no-op +} +``` + +### Failure Mode B: Infinite Poll +The wait enters a polling loop that never terminates. The process consumes 100% CPU in a runtime library. The GPU has either stopped signalling progress entirely, or the completion flag is in a memory region that has become incoherent. + +This is distinct from Failure Mode A: with premature return the CPU proceeds; with infinite poll the CPU is stuck. + +**Identifying Infinite Poll**: `top` shows the MPI rank at 100% CPU. `perf top -p PID` or `strace -p PID` shows the process burning cycles inside the GPU runtime library (e.g. `libze_intel_gpu.so`, `libamdhip64.so`). + +**Documented instances**: +- Intel Level Zero on Pontevecchio (Aurora): both premature return *and* infinite poll have been observed as independent bugs on the same system. +- The two failure modes can co-exist and have overlapping symptoms at the application level. + +## Completion Signalling Architecture + +Understanding why these bugs happen requires knowing how completion signalling works: + +``` +GPU command processor + → signals completion by writing to a host-visible memory address + → CPU runtime polls that address (or uses OS event notification via ioctl) +``` + +A premature return means the memory write happened before the actual work completed (e.g. the signal is on a different command stream that has not been serialised with the work stream). An infinite poll means the memory write never happens (hardware or driver bug preventing the signal from being written). + +**Implication**: `accelerator_barrier()` is not an unconditional correctness guarantee on all production systems. Application-level verification (double-run, checksums) is necessary as a second line of defence. + +## The Double-Wait Pattern in Practice + +The double-wait is a pragmatic workaround when premature return is suspected but not yet confirmed. It adds latency but does not change correctness if the barrier is working properly, so it is safe to enable in production: + +```cpp +#ifdef WORKAROUND_PREMATURE_BARRIER +#define accelerator_barrier() do { \ + real_accelerator_barrier(); \ + real_accelerator_barrier(); \ +} while(0) +#endif +``` + +Monitor whether this changes observed behaviour. If double-wait eliminates wrong answers, you have confirmed premature return. If it does not help but inserting a no-op kernel does, the issue is with the wait primitive specifically, not with the underlying completion signal. + +## SYCL/Level Zero Specifics + +Level Zero (the backend for Intel GPU runtimes) separates command submission from synchronisation. A `q.wait()` should wait for all previously submitted command lists to retire. Documented bugs include: + +- `q.wait()` returning before the associated fence in Level Zero has been signalled. +- `q.wait()` entering an `ioctl(i915, I915_GEM_WAIT)` call that never returns (kernel driver bug, not runtime bug). + +The latter requires a node reboot and cannot be worked around in application code. Detect it by checking process state (`D` in `ps aux`) and the kernel function via `/proc/PID/wchan`. + +## Stream Ordering and Compute Streams + +All GPU work must be submitted to the *same* stream/queue if you rely on in-order execution guarantees. Mixing default stream and non-default streams invalidates ordering assumptions on some backends. + +Grid uses `computeStream` (CUDA/HIP) or `theGridAccelerator` (SYCL) consistently throughout. If mixing Grid with third-party GPU code, ensure the third-party code is directed to the same stream, or insert explicit inter-stream barriers. + +## Checklist for New GPU Code + +1. Every kernel launch is followed by an `accelerator_barrier()` before reading device-side output on the host. +2. All device-to-host copies use an explicit stream synchronisation after the copy, not before. +3. If results are non-deterministic across runs, insert a second barrier and observe whether reproducibility improves. +4. For correctness-critical operations (reductions that will be compared against reference values), add the double-run checksum test from `correctness-verification.md`. +5. If the process hangs at 100% CPU in a runtime library function, this is a driver/runtime bug — there is no application-level fix beyond scheduling a node reboot. diff --git a/skills/hang-diagnosis.md b/skills/hang-diagnosis.md new file mode 100644 index 00000000..a6f39c26 --- /dev/null +++ b/skills/hang-diagnosis.md @@ -0,0 +1,102 @@ +--- +name: hang-diagnosis +description: Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) + - Bash(strace) + - Bash(gdb) +--- + +# Hang Diagnosis on HPC Systems + +## Taxonomy of Hangs + +Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems: + +### 1. Kernel-level ioctl hang (never returns) +The process is in `D` (uninterruptible sleep) state. `strace` shows it blocked in an `ioctl` syscall. The GPU device driver has entered an unrecoverable state. + +**Diagnosis**: `ps aux | grep D` — the process shows `D` state. `cat /proc/PID/wchan` shows `i915_gem_wait_for_error` or similar. + +**Resolution**: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler. + +### 2. Infinite poll loop (`q.wait()` or `cudaDeviceSynchronize()` never returns) +The process is in `R` (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread. + +**Diagnosis**: `top` shows the rank at 100% CPU. `strace -p PID` shows repeated `futex` or `read` syscalls with zero-length results, or no syscalls at all (pure spinloop). `perf top -p PID` shows the process burning cycles in a single tight loop in a runtime library (e.g., `ze_intel_gpu.so`). + +**Resolution**: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See `gpu-runtime-correctness.md`. + +### 3. Collective deadlock +One or more ranks are blocked in an MPI call, usually `MPI_Allreduce` or `MPI_Barrier`, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive. + +**Diagnosis**: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different `step_name` strings. The hung ranks will show `D` or `S` state in `ps`. + +**Resolution**: Replace `MPI_Allreduce` with a deterministic point-to-point tree reduction. See `mpi-heterogeneous.md`. + +### 4. Premature return from wait (silent wrong answer, not a hang) +The runtime returns from `q.wait()` before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally. + +**Diagnosis**: Double-run with checksum (see `correctness-verification.md`). Insert a second `q.wait()` after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely. + +## Flight Recorder for Hang Localization + +The most important diagnostic tool is knowing *which operation* a process is in when it hangs. Maintain a named step counter: + +```cpp +// Call at the start of every major operation +FlightRecorder::StepLog("MPI_Allreduce::norm"); +// ... do the operation ... +FlightRecorder::StepLog("MPI_Allreduce::done"); +``` + +On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner: + +```cpp +static void sighup_handler(int) { + char buf[256]; + int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n", + comm_rank, + (unsigned long long)step_counter, + step_name); + write(2, buf, n); + // backtrace_symbols_fd is async-safe on Linux glibc + void *frames[32]; + backtrace_symbols_fd(frames, backtrace(frames, 32), 2); +} +signal(SIGHUP, sighup_handler); +``` + +Broadcast SIGHUP to all ranks from outside the job: +```bash +# In a separate shell while the job is hung +squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \ + xargs -I{} ssh {} "pkill -SIGHUP -f my_application" +``` + +The step names from all ranks will reveal which collective operation has diverged. + +## Distinguishing Driver Hang from MPI Hang + +| Symptom | Driver hang | MPI hang | +|---|---|---| +| Process state | `D` (ioctl) or `R` (spinloop) | `S` (blocked in syscall) | +| `strace` | blocked `ioctl` or tight loop | blocked `recvmsg` / `read` | +| Scope | single rank / single node | subset of ranks, pattern-dependent | +| Recovery | reboot node | cancel job | +| Flight recorder | step name is a GPU operation | step name is a collective | + +## Reducing Diagnostic Time + +1. **Name every collective operation** in the flight recorder before calling it. +2. **Separate GPU work from MPI work** in the code so the step name unambiguously identifies which subsystem is hung. +3. **Log node identifiers** alongside step names so flaky nodes can be identified and blacklisted. +4. **Request flight recorder dumps from all ranks simultaneously** (SIGHUP broadcast) rather than attaching a debugger — attaching `gdb` to one rank of a hung MPI job usually deadlocks the debugger too. + +## What Not to Do + +- Do not `kill -9` a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost. +- Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the *last* rank to arrive at the barrier. +- Do not use `MPI_Abort` in the hang handler — it may itself hang on some implementations. Use `_exit(1)` to force termination. diff --git a/skills/mpi-heterogeneous.md b/skills/mpi-heterogeneous.md new file mode 100644 index 00000000..9ad45523 --- /dev/null +++ b/skills/mpi-heterogeneous.md @@ -0,0 +1,137 @@ +--- +name: mpi-heterogeneous +description: Diagnose and work around MPI correctness bugs on heterogeneous (CPU+GPU) systems — device buffer aliasing in MPI_Sendrecv, AARCH64 PLT corruption from libfabric, topology-dependent allreduce hangs, and deterministic point-to-point reduction trees as a replacement for MPI_Allreduce. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) +--- + +# MPI Correctness on Heterogeneous HPC Systems + +## The Core Problem + +MPI libraries were designed for CPU-resident buffers. When GPU-resident buffers are passed directly (GPU-aware MPI / GPU direct RDMA), several correctness assumptions break: + +- **Buffer aliasing**: The MPI library may internally alias send/receive buffer addresses for `MPI_Sendrecv` in ways that are safe for CPU memory but wrong for GPU memory with different cache coherency rules. +- **RDMA bandwidth**: GPU direct RDMA on some fabrics operates at a fraction of peak wirespeed (documented at ~30% on Pontevecchio/Aurora), making host-staging mandatory for performance even when correctness is not an issue. +- **Collective tree topology**: `MPI_Allreduce` implementations may select reduction trees based on process count or communicator topology that expose rank-ordering bugs, causing hangs on some configurations but not others. + +## Bug Class 1: Device Buffer Aliasing in MPI_Sendrecv + +**Symptom**: `MPI_Sendrecv` with GPU-resident send and receive buffers produces wrong results. The received data matches neither the expected payload nor a host-staged copy. The failure is *deterministic* for a given problem size and process count, but *history-dependent* — earlier sends affect which alias is selected. + +**Root cause**: The MPI library internally reuses GPU buffer addresses for temporary staging without proper device memory ordering. When the same physical GPU memory pages appear in both the send and receive paths, writes from one path corrupt the other. + +**Diagnosis**: +1. Enable per-packet checksumming (see `correctness-verification.md`). If the checksum on the received packet does not match the sent checksum, the data was corrupted in transit. +2. Replace `MPI_Sendrecv` with separate `MPI_Isend` + `MPI_Irecv` + `MPI_Waitall`. If this fixes the problem, the bug is in the `MPI_Sendrecv` implementation's internal buffer handling. +3. Stage through host memory (`cudaMemcpy`/`hipMemcpy` to a host buffer, then `MPI_Sendrecv` on host buffers, then copy back). If this fixes the problem, confirms GPU-specific aliasing. + +**Reported as**: MPICH issue #7302. Affects MPICH on Intel Pontevecchio (Aurora) with device-resident buffers. + +**Workaround**: Do not use `MPI_Sendrecv` with GPU buffers. Use asynchronous send/receive pairs or host-staging. See `communication-overlap.md` for the full pipeline pattern. + +## Bug Class 2: PLT Corruption on AARCH64 (libfabric) + +**Symptom**: Application crashes or hangs on first `MPI_Comm_dup` call on AARCH64 systems (e.g. NVIDIA Grace/H200). Backtrace shows a bad instruction in the PLT (Procedure Linkage Table) for `MPI_Comm_dup` — specifically a `br x15` instruction that should instead be a proper trampoline. + +**Root cause**: `libfabric`'s memory registration cache monitor patches PLT entries at runtime to intercept memory allocation calls. Its AARCH64 trampoline generation writes an incorrect instruction sequence, leaving `br x15` (branch to whatever happens to be in x15) in the PLT entry. The next call through that PLT entry executes garbage. + +**Diagnosis**: +```bash +# Check if the PLT entry is corrupted +objdump -d /proc/PID/exe | grep -A5 "MPI_Comm_dup@plt" +# Look for "br x15" — this should be a proper stub, not a register branch +``` + +Or check the disassembly of the live process: +```bash +gdb -p PID -batch -ex "disassemble 'MPI_Comm_dup@plt'" +``` + +**Workaround**: +```bash +export FI_MR_CACHE_MONITOR=disabled +``` +This prevents libfabric from patching PLT entries. It may reduce MR cache performance but restores correctness. + +**Reported as**: libfabric issue #11451. Affects systems using AARCH64 + libfabric OFI provider (Cray Slingshot, AWS EFA) with memory registration cache enabled. + +## Bug Class 3: Topology-Dependent Allreduce Hangs + +**Symptom**: `MPI_Allreduce` hangs indefinitely on some node configurations but completes correctly on others. The failure correlates with process count (e.g. fails at 512 ranks, works at 256) or network topology (fails when crossing specific router boundaries). + +**Root cause**: The MPI library's collective selection algorithm picks a reduction tree implementation that assumes symmetric participation from all ranks. A bug in one rank's contribution path (e.g. a GPU-side buffer not yet flushed when MPI reads it, due to premature barrier — see `gpu-runtime-correctness.md`) causes that rank to send wrong or incomplete data, and the tree-reduction protocol deadlocks waiting for data that never arrives correctly. + +**Diagnosis**: Flight recorder step logging (see `hang-diagnosis.md`). SIGHUP broadcast to all ranks. Ranks that are hung will show step name `MPI_Allreduce::...`; ranks that completed will show the next step. The hung ranks are the *recipients* of the stale data, not necessarily the *cause*. + +**Workaround — deterministic P2P reduction tree**: + +Replace `MPI_Allreduce` with an explicit point-to-point binary tree reduction. This is slower for large communicators but: +1. Is immune to topology-dependent collective bugs. +2. Is deterministic in floating-point ordering (the tree is fixed, not chosen at runtime). +3. Makes the hang location explicit — each P2P operation is a named step in the flight recorder. + +```cpp +// Binary tree reduction: rank 0 collects, then broadcasts +void GlobalSumP2P(double *data, int count, MPI_Comm comm) { + int rank, size; + MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size); + + // Reduce phase: even ranks receive from odd neighbours + for (int stride = 1; stride < size; stride *= 2) { + if (rank % (2*stride) == 0) { + int partner = rank + stride; + if (partner < size) { + std::vector tmp(count); + MPI_Recv(tmp.data(), count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE); + for (int i = 0; i < count; i++) data[i] += tmp[i]; + } + } else if (rank % stride == 0) { + int partner = rank - stride; + MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm); + break; + } + } + // Broadcast phase + for (int stride = /* highest power of 2 <= size */; stride >= 1; stride /= 2) { + if (rank % (2*stride) == 0) { + int partner = rank + stride; + if (partner < size) + MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm); + } else if (rank % stride == 0) { + int partner = rank - stride; + MPI_Recv(data, count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE); + } + } +} +``` + +Grid reference: `USE_GRID_REDUCTION` macro in `Grid/communicator/Communicator_mpi3.cc`. + +## Compile-Time Guard Structure + +Recommended macro structure to switch between the workaround paths: + +```cpp +// In configure / CMake, expose as options: +// ACCELERATOR_AWARE_MPI — use GPU direct (fast, potentially broken) +// GRID_CHECKSUM_COMMS — per-packet checksums (overhead: ~5%) +// USE_GRID_REDUCTION — P2P tree instead of MPI_Allreduce (slower, deterministic) +// FI_MR_CACHE_MONITOR — libfabric PLT workaround (env var, not compile-time) +``` + +On a known-good system, enable `ACCELERATOR_AWARE_MPI` and disable the others. On a system with known bugs, disable `ACCELERATOR_AWARE_MPI` and enable `GRID_CHECKSUM_COMMS` + `USE_GRID_REDUCTION` as needed. + +## Escalation Checklist + +Before concluding a bug is in your code: + +1. [ ] Can you reproduce with a minimal reproducer (two MPI ranks, no physics code)? +2. [ ] Does the failure rate correlate with buffer size, process count, or network route? +3. [ ] Does staging through host memory eliminate the failure? +4. [ ] Is the failure deterministic for a given input (same answer, always wrong) or stochastic? +5. [ ] Does the failure appear on a different MPI implementation (e.g. OpenMPI vs MPICH)? + +Deterministic wrong answers that reproduce with minimal reproducers and disappear with host-staging are strong evidence of an MPI library bug. File with the MPI library issue tracker with the minimal reproducer. From baa70d8ec99ade706d73490c05969447569d2a63 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 12:31:13 -0400 Subject: [PATCH 08/44] Test_reduction: add timing benchmark for new vs old reduction paths Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and sum_gpu_old (hand-rolled shared-memory) for each field type, with 5-call warmup and 100-call timed loop. Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_reduction.cc | 45 ++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) diff --git a/tests/debug/Test_reduction.cc b/tests/debug/Test_reduction.cc index a376a43f..da0905d2 100644 --- a/tests/debug/Test_reduction.cc +++ b/tests/debug/Test_reduction.cc @@ -101,7 +101,50 @@ void testReduction(GridCartesian *grid, GridParallelRNG &rng, #endif //-------------------------------------------------------------------- - // b) Constant field via field = 1.0. + // b) Timing: new (CUB/sycl::reduction) vs old (hand-rolled) path. + // Warmup first, then Niter timed calls; report us/call and GB/s. + //-------------------------------------------------------------------- +#if defined(GRID_CUDA) || defined(GRID_HIP) || defined(GRID_SYCL) + { + const int Nwarm = 5; + const int Niter = 100; + + gaussian(rng, field); + + { + autoView(v, field, AcceleratorRead); + for (int i = 0; i < Nwarm; i++) sum_gpu (&v[0], osites); + for (int i = 0; i < Nwarm; i++) sum_gpu_old(&v[0], osites); + } + + RealD t_new, t_old; + { + autoView(v, field, AcceleratorRead); + t_new = -usecond(); + for (int i = 0; i < Niter; i++) sum_gpu(&v[0], osites); + t_new += usecond(); + } + { + autoView(v, field, AcceleratorRead); + t_old = -usecond(); + for (int i = 0; i < Niter; i++) sum_gpu_old(&v[0], osites); + t_old += usecond(); + } + + RealD bytes = (RealD)osites * sizeof(vobj); + RealD GBs_new = bytes / (t_new / Niter) * 1e-3; + RealD GBs_old = bytes / (t_old / Niter) * 1e-3; + + std::cout << GridLogMessage << name << " timing (" << Niter << " calls):" << std::endl; + std::cout << GridLogMessage + << " sum_gpu " << t_new/Niter << " us " << GBs_new << " GB/s" << std::endl; + std::cout << GridLogMessage + << " sum_gpu_old " << t_old/Niter << " us " << GBs_old << " GB/s" << std::endl; + } +#endif + + //-------------------------------------------------------------------- + // d) Constant field via field = 1.0. // // Grid's iMatrix::operator=(scalar) sets only the diagonal, so: // LatticeComplex -> scalar 1.0 (Ncomp = 1 nonzero per site) From dc6ae51cabe423f98c3e10feaf53bf310c58234b Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 13:55:28 -0400 Subject: [PATCH 09/44] Lattice_reduction_gpu_cub: replace WordBundle4 with iVector,4> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit WordBundle4 was redundant with Grid's existing tensor infrastructure. iVector,4> already provides accelerator_inline operator+, zeroit(), and sycl::is_device_copyable — no new type needed. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 54 +++++++++--------------- 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index ac4368c0..17b6d7af 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -44,7 +44,7 @@ NAMESPACE_BEGIN(Grid); // LatticePropagator (sobjD = 2304 bytes, 64*2304 = 147 KB) exceed this budget. // // For those types sumD_gpu_large groups the vobj's vector_type words in bundles of 4, -// reducing each bundle as a WordBundle4 (64 bytes, 64*64 = 4 KB — always safe). +// reducing each bundle as an iVector,4> (64 bytes, 64*64 = 4 KB — always safe). // Words that do not fill a complete bundle are zero-padded. // // SYCL: sycl::reduction handles any type size through the runtime, so one path suffices. @@ -52,22 +52,6 @@ NAMESPACE_BEGIN(Grid); #if defined(GRID_CUDA) || defined(GRID_HIP) -// Bundles 4 scalar_typeD values for the radix-4 large-type reduction path. -// sizeof = 4 * sizeof(scalarD) <= 64 bytes; 64 * 64 = 4096 bytes, safely within -// rocPRIM's shared-memory budget on all supported devices. -template -struct WordBundle4 { - scalarD w[4]; - accelerator_inline WordBundle4 operator+(const WordBundle4 &rhs) const { - WordBundle4 r; - r.w[0] = w[0] + rhs.w[0]; - r.w[1] = w[1] + rhs.w[1]; - r.w[2] = w[2] + rhs.w[2]; - r.w[3] = w[3] + rhs.w[3]; - return r; - } -}; - // Direct CUB reduction on the full scalar_objectD. // Only safe when sizeof(sobjD)*64 <= device sharedMemPerBlock. // Do not call directly for large composite types (e.g. LatticePropagator). @@ -121,15 +105,19 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os // Radix-4 word-bundle path for types too large for the direct CUB path. // Treats vobj as words of vector_type; groups them in bundles of 4 and reduces -// each bundle as a WordBundle4. If words % 4 != 0, the final partial -// bundle is zero-padded so all unused slots contribute zero to the sum. +// each bundle as an iVector,4> — reusing Grid's existing tensor +// type which already has accelerator_inline operator+ and zeroit(). +// sizeof = 4 * sizeof(scalarD) <= 64 bytes; 64 * 64 = 4096 bytes, safely within +// rocPRIM's shared-memory budget on all supported devices. +// If words % 4 != 0, the final partial bundle is zero-padded so all unused +// slots contribute zero to the sum. template inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) { typedef typename vobj::vector_type vector; typedef typename vobj::scalar_typeD scalarD; typedef typename vobj::scalar_objectD sobjD; - using R4 = WordBundle4; + using R4 = iVector, 4>; const int words = sizeof(vobj) / sizeof(vector); const int nfull = words / 4; @@ -142,8 +130,7 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi deviceVector buf(osites); R4 *buf_p = &buf[0]; - R4 zero4; - zero4.w[0] = zero4.w[1] = zero4.w[2] = zero4.w[3] = scalarD(0); + R4 zero4; zeroit(zero4); R4 *d_out = static_cast(acceleratorAllocDevice(sizeof(R4))); void *d_temp = nullptr; @@ -165,10 +152,10 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi int base = 4 * g; accelerator_for(ss, osites, 1, { R4 r4; - r4.w[0] = TensorRemove(Reduce(idat[ss * words + base ])); - r4.w[1] = TensorRemove(Reduce(idat[ss * words + base + 1])); - r4.w[2] = TensorRemove(Reduce(idat[ss * words + base + 2])); - r4.w[3] = TensorRemove(Reduce(idat[ss * words + base + 3])); + r4._internal[0] = TensorRemove(Reduce(idat[ss * words + base ])); + r4._internal[1] = TensorRemove(Reduce(idat[ss * words + base + 1])); + r4._internal[2] = TensorRemove(Reduce(idat[ss * words + base + 2])); + r4._internal[3] = TensorRemove(Reduce(idat[ss * words + base + 3])); buf_p[ss] = r4; }); gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, @@ -181,20 +168,19 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi accelerator_barrier(); R4 group_result; acceleratorCopyFromDevice(d_out, &group_result, sizeof(R4)); - ret_p[base ] = group_result.w[0]; - ret_p[base + 1] = group_result.w[1]; - ret_p[base + 2] = group_result.w[2]; - ret_p[base + 3] = group_result.w[3]; + ret_p[base ] = TensorRemove(group_result._internal[0]); + ret_p[base + 1] = TensorRemove(group_result._internal[1]); + ret_p[base + 2] = TensorRemove(group_result._internal[2]); + ret_p[base + 3] = TensorRemove(group_result._internal[3]); } // Partial last group: zero-pad unused slots so they contribute nothing to the sum. if (rem > 0) { int base = 4 * nfull; accelerator_for(ss, osites, 1, { - R4 r4; - r4.w[0] = r4.w[1] = r4.w[2] = r4.w[3] = scalarD(0); + R4 r4; zeroit(r4); for (int k = 0; k < rem; k++) - r4.w[k] = TensorRemove(Reduce(idat[ss * words + base + k])); + r4._internal[k] = TensorRemove(Reduce(idat[ss * words + base + k])); buf_p[ss] = r4; }); gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, @@ -208,7 +194,7 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi R4 partial_result; acceleratorCopyFromDevice(d_out, &partial_result, sizeof(R4)); for (int k = 0; k < rem; k++) - ret_p[4 * nfull + k] = partial_result.w[k]; + ret_p[4 * nfull + k] = TensorRemove(partial_result._internal[k]); } acceleratorFreeDevice(d_temp); From e12bc7f07ccd4324e2a67206ebd53c416dc31790 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 14:23:44 -0400 Subject: [PATCH 10/44] Lattice_reduction_gpu_cub: add GRID_REDUCTION_TIMING instrumentation Guards accelerator_for and CUB DeviceReduce calls in sumD_gpu_direct and sumD_gpu_large with #ifdef GRID_REDUCTION_TIMING to isolate where time is spent in each path. Large path accumulates across all groups and prints totals with words/nfull/rem context. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 51 ++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index 17b6d7af..e104a6f2 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -64,11 +64,18 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os deviceVector per_site(osites); sobjD *per_site_p = &per_site[0]; +#ifdef GRID_REDUCTION_TIMING + RealD t_for = -usecond(); +#endif accelerator_for(ss, osites, 1, { sobj tmp = Reduce(lat[ss]); sobjD tmpD; tmpD = tmp; per_site_p[ss] = tmpD; }); +#ifdef GRID_REDUCTION_TIMING + accelerator_barrier(); + t_for += usecond(); +#endif sobjD zero; zeroit(zero); sobjD *d_out = static_cast(acceleratorAllocDevice(sizeof(sobjD))); @@ -86,6 +93,9 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os d_temp = acceleratorAllocDevice(temp_bytes); +#ifdef GRID_REDUCTION_TIMING + RealD t_cub = -usecond(); +#endif gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, (int)osites, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { @@ -95,6 +105,13 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os } accelerator_barrier(); +#ifdef GRID_REDUCTION_TIMING + t_cub += usecond(); + std::cout << GridLogMessage << "sumD_gpu_direct" + << " sizeof(sobjD)=" << sizeof(sobjD) + << " accelerator_for=" << t_for << " us" + << " CUB_reduce=" << t_cub << " us" << std::endl; +#endif sobjD result; acceleratorCopyFromDevice(d_out, &result, sizeof(sobjD)); @@ -147,9 +164,16 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi } d_temp = acceleratorAllocDevice(temp_bytes); +#ifdef GRID_REDUCTION_TIMING + RealD t_for_large = 0.0, t_cub_large = 0.0; +#endif + // Full groups of 4 words. for (int g = 0; g < nfull; g++) { int base = 4 * g; +#ifdef GRID_REDUCTION_TIMING + t_for_large -= usecond(); +#endif accelerator_for(ss, osites, 1, { R4 r4; r4._internal[0] = TensorRemove(Reduce(idat[ss * words + base ])); @@ -158,6 +182,11 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi r4._internal[3] = TensorRemove(Reduce(idat[ss * words + base + 3])); buf_p[ss] = r4; }); +#ifdef GRID_REDUCTION_TIMING + accelerator_barrier(); + t_for_large += usecond(); + t_cub_large -= usecond(); +#endif gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, (int)osites, gpucub::Sum(), zero4, computeStream); if (gpuErr != gpuSuccess) { @@ -166,6 +195,9 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi exit(EXIT_FAILURE); } accelerator_barrier(); +#ifdef GRID_REDUCTION_TIMING + t_cub_large += usecond(); +#endif R4 group_result; acceleratorCopyFromDevice(d_out, &group_result, sizeof(R4)); ret_p[base ] = TensorRemove(group_result._internal[0]); @@ -177,12 +209,20 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi // Partial last group: zero-pad unused slots so they contribute nothing to the sum. if (rem > 0) { int base = 4 * nfull; +#ifdef GRID_REDUCTION_TIMING + t_for_large -= usecond(); +#endif accelerator_for(ss, osites, 1, { R4 r4; zeroit(r4); for (int k = 0; k < rem; k++) r4._internal[k] = TensorRemove(Reduce(idat[ss * words + base + k])); buf_p[ss] = r4; }); +#ifdef GRID_REDUCTION_TIMING + accelerator_barrier(); + t_for_large += usecond(); + t_cub_large -= usecond(); +#endif gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, (int)osites, gpucub::Sum(), zero4, computeStream); if (gpuErr != gpuSuccess) { @@ -191,12 +231,23 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi exit(EXIT_FAILURE); } accelerator_barrier(); +#ifdef GRID_REDUCTION_TIMING + t_cub_large += usecond(); +#endif R4 partial_result; acceleratorCopyFromDevice(d_out, &partial_result, sizeof(R4)); for (int k = 0; k < rem; k++) ret_p[4 * nfull + k] = TensorRemove(partial_result._internal[k]); } +#ifdef GRID_REDUCTION_TIMING + std::cout << GridLogMessage << "sumD_gpu_large" + << " sizeof(sobjD)=" << sizeof(sobjD) + << " words=" << words << " nfull=" << nfull << " rem=" << rem + << " accelerator_for=" << t_for_large << " us" + << " CUB_reduce=" << t_cub_large << " us" << std::endl; +#endif + acceleratorFreeDevice(d_temp); acceleratorFreeDevice(d_out); return ret; From fca2c5dba0663268c9c5099284a6109d9466e3c1 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 14:54:08 -0400 Subject: [PATCH 11/44] Lattice_reduction_gpu_cub: define GRID_REDUCTION_TIMING in header Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index e104a6f2..6a732de4 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -52,6 +52,8 @@ NAMESPACE_BEGIN(Grid); #if defined(GRID_CUDA) || defined(GRID_HIP) +#define GRID_REDUCTION_TIMING + // Direct CUB reduction on the full scalar_objectD. // Only safe when sizeof(sobjD)*64 <= device sharedMemPerBlock. // Do not call directly for large composite types (e.g. LatticePropagator). From 747c1676587036f7556eb790672dc70be5bb3385 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 16:21:50 -0400 Subject: [PATCH 12/44] sumD_gpu_direct: one thread per SIMD lane using extractLane MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide loop) with one thread per lane calling extractLane() — O(1) per thread. CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction but leaves the per-lane sobjD store stride as a known remaining concern. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index 6a732de4..afea7571 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -63,16 +63,21 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; - deviceVector per_site(osites); - sobjD *per_site_p = &per_site[0]; + const Integer nsimd = vobj::Nsimd(); + const Integer nlanes = osites * nsimd; + + deviceVector per_lane(nlanes); + sobjD *per_lane_p = &per_lane[0]; #ifdef GRID_REDUCTION_TIMING RealD t_for = -usecond(); #endif - accelerator_for(ss, osites, 1, { - sobj tmp = Reduce(lat[ss]); + accelerator_for(idx, nlanes, 1, { + Integer ss = idx / nsimd; + Integer lane = idx % nsimd; + sobj tmp = extractLane(lane, lat[ss]); sobjD tmpD; tmpD = tmp; - per_site_p[ss] = tmpD; + per_lane_p[idx] = tmpD; }); #ifdef GRID_REDUCTION_TIMING accelerator_barrier(); @@ -85,8 +90,8 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os size_t temp_bytes = 0; gpuError_t gpuErr; - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, - (int)osites, gpucub::Sum(), zero, computeStream); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, + (int)nlanes, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce size query failed: " << gpuErr << std::endl; @@ -98,8 +103,8 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os #ifdef GRID_REDUCTION_TIMING RealD t_cub = -usecond(); #endif - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, - (int)osites, gpucub::Sum(), zero, computeStream); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, + (int)nlanes, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce failed: " << gpuErr << std::endl; From 843d6497b28ed25dd090606acdf7d7629721cbd4 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 21:08:10 -0400 Subject: [PATCH 13/44] sumD_gpu_direct: shared-memory lane reduction with acceleratorThreads(1) Set acceleratorThreads to 1 before the extraction kernel so that dim3(nsimd,1,1) blocks give exactly one site group per block and __shared__ sobjD smem[nsimd] is correctly sized without depending on the runtime acceleratorThreads() value. threadIdx.x (acceleratorSIMTlane) indexes the SIMD lane for coalesced reads; lane 0 sums smem[0..nsimd-1] and writes one sobjD per site. CUB then reduces osites elements instead of osites*nsimd, reducing both store traffic and CUB work by Nsimd. acceleratorSynchronise() (warp-level) suffices since nsimd < warpSize. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 40 +++++++++++++++--------- 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index afea7571..274371f1 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -63,24 +63,36 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; - const Integer nsimd = vobj::Nsimd(); - const Integer nlanes = osites * nsimd; + constexpr Integer nsimd = vobj::Nsimd(); - deviceVector per_lane(nlanes); - sobjD *per_lane_p = &per_lane[0]; + deviceVector per_site(osites); + sobjD *per_site_p = &per_site[0]; + + // Set acceleratorThreads=1 so the block is dim3(nsimd,1,1): one site per block. + // smem[nsimd] is then exactly sized; no other site group competes for the same slots. + // Restore after the (synchronous) accelerator_for. + uint32_t saved_threads = acceleratorThreads(); + acceleratorThreads(1); #ifdef GRID_REDUCTION_TIMING RealD t_for = -usecond(); #endif - accelerator_for(idx, nlanes, 1, { - Integer ss = idx / nsimd; - Integer lane = idx % nsimd; - sobj tmp = extractLane(lane, lat[ss]); + accelerator_for(ss, osites, nsimd, { + int lane = acceleratorSIMTlane(nsimd); + __shared__ sobjD smem[nsimd]; + sobj tmp = extractLane(lane, lat[ss]); sobjD tmpD; tmpD = tmp; - per_lane_p[idx] = tmpD; + smem[lane] = tmpD; + acceleratorSynchronise(); + if (lane == 0) { + sobjD acc = smem[0]; + for (int l = 1; l < nsimd; l++) acc = acc + smem[l]; + per_site_p[ss] = acc; + } }); + // accelerator_for already issued accelerator_barrier(); restore thread count now. + acceleratorThreads(saved_threads); #ifdef GRID_REDUCTION_TIMING - accelerator_barrier(); t_for += usecond(); #endif @@ -90,8 +102,8 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os size_t temp_bytes = 0; gpuError_t gpuErr; - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, - (int)nlanes, gpucub::Sum(), zero, computeStream); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, + (int)osites, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce size query failed: " << gpuErr << std::endl; @@ -103,8 +115,8 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os #ifdef GRID_REDUCTION_TIMING RealD t_cub = -usecond(); #endif - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, - (int)nlanes, gpucub::Sum(), zero, computeStream); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, + (int)osites, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce failed: " << gpuErr << std::endl; From f4fbf7c9cae0e00cfb9bcd09afc85e8f3999cd71 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 21:23:15 -0400 Subject: [PATCH 14/44] sumD_gpu_direct: revert to per-lane write; CUB handles Nsimd*osites inputs Benchmarking showed the shared-memory lane-summation approach (843d6497) was slower than writing each SIMD lane individually and letting CUB reduce the full nlanes = osites*Nsimd array. CUB's device reduce is more efficient over the larger input than the smem overhead + serialised lane-0 summation. The smem approach also required overriding acceleratorThreads() to avoid the block-size sizing problem. Restore the simpler per-lane path. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu_cub.h | 40 +++++++++--------------- 1 file changed, 14 insertions(+), 26 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h index 274371f1..eb9348ab 100644 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ b/Grid/lattice/Lattice_reduction_gpu_cub.h @@ -63,36 +63,24 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; - constexpr Integer nsimd = vobj::Nsimd(); + const Integer nsimd = vobj::Nsimd(); + const Integer nlanes = osites * nsimd; - deviceVector per_site(osites); - sobjD *per_site_p = &per_site[0]; - - // Set acceleratorThreads=1 so the block is dim3(nsimd,1,1): one site per block. - // smem[nsimd] is then exactly sized; no other site group competes for the same slots. - // Restore after the (synchronous) accelerator_for. - uint32_t saved_threads = acceleratorThreads(); - acceleratorThreads(1); + deviceVector per_lane(nlanes); + sobjD *per_lane_p = &per_lane[0]; #ifdef GRID_REDUCTION_TIMING RealD t_for = -usecond(); #endif - accelerator_for(ss, osites, nsimd, { - int lane = acceleratorSIMTlane(nsimd); - __shared__ sobjD smem[nsimd]; - sobj tmp = extractLane(lane, lat[ss]); + accelerator_for(idx, nlanes, 1, { + Integer ss = idx / nsimd; + Integer lane = idx % nsimd; + sobj tmp = extractLane(lane, lat[ss]); sobjD tmpD; tmpD = tmp; - smem[lane] = tmpD; - acceleratorSynchronise(); - if (lane == 0) { - sobjD acc = smem[0]; - for (int l = 1; l < nsimd; l++) acc = acc + smem[l]; - per_site_p[ss] = acc; - } + per_lane_p[idx] = tmpD; }); - // accelerator_for already issued accelerator_barrier(); restore thread count now. - acceleratorThreads(saved_threads); #ifdef GRID_REDUCTION_TIMING + accelerator_barrier(); t_for += usecond(); #endif @@ -102,8 +90,8 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os size_t temp_bytes = 0; gpuError_t gpuErr; - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, - (int)osites, gpucub::Sum(), zero, computeStream); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, + (int)nlanes, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce size query failed: " << gpuErr << std::endl; @@ -115,8 +103,8 @@ inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer os #ifdef GRID_REDUCTION_TIMING RealD t_cub = -usecond(); #endif - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out, - (int)osites, gpucub::Sum(), zero, computeStream); + gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, + (int)nlanes, gpucub::Sum(), zero, computeStream); if (gpuErr != gpuSuccess) { std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce failed: " << gpuErr << std::endl; From 068f95ad2d397bd517d439cb44e4fccac7333e72 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 21:52:18 -0400 Subject: [PATCH 15/44] Revert to hand-rolled reduction; drop Lattice_reduction_gpu_cub.h Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h, Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the CUB rewrite (commit 969b0a39), recovering the original primary function names (sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the hand-rolled shared-memory reduction kernel. Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the old/new comparison sections that depended on sum_gpu_old. The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small types, and the smem sizing problem for the extraction pass has no clean solution within the accelerator_for abstraction. The right improvement is a higher radix (12 then 4) in sumD_gpu_large, applied directly to the existing hand-rolled kernel. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction.h | 3 - Grid/lattice/Lattice_reduction_gpu.h | 20 +- Grid/lattice/Lattice_reduction_gpu_cub.h | 361 ----------------------- Grid/lattice/Lattice_reduction_sycl.h | 22 +- tests/debug/Test_reduction.cc | 53 +--- 5 files changed, 28 insertions(+), 431 deletions(-) delete mode 100644 Grid/lattice/Lattice_reduction_gpu_cub.h diff --git a/Grid/lattice/Lattice_reduction.h b/Grid/lattice/Lattice_reduction.h index 5d97a794..861f3f06 100644 --- a/Grid/lattice/Lattice_reduction.h +++ b/Grid/lattice/Lattice_reduction.h @@ -31,9 +31,6 @@ Author: Christoph Lehner #if defined(GRID_SYCL) #include #endif -#if defined(GRID_CUDA)||defined(GRID_HIP)||defined(GRID_SYCL) -#include -#endif #include NAMESPACE_BEGIN(Grid); diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 2b792ba0..849f5309 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -198,7 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) { // Possibly promote to double and sum ///////////////////////////////////////////////////////////////////////////////////////////////////////// template -inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) { typedef typename vobj::scalar_objectD sobj; typedef decltype(lat) Iterator; @@ -224,7 +224,7 @@ inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer } template -inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) { typedef typename vobj::vector_type vector; typedef typename vobj::scalar_typeD scalarD; @@ -244,13 +244,13 @@ inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer buf[ss] = dat[ss*words+w]; }); - ret_p[w] = sumD_gpu_small_old(tbuf,osites); + ret_p[w] = sumD_gpu_small(tbuf,osites); } return ret; } template -inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) { typedef typename vobj::scalar_objectD sobj; sobj ret; @@ -261,9 +261,9 @@ inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osite int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks); if ( ok ) { - ret = sumD_gpu_small_old(lat,osites); + ret = sumD_gpu_small(lat,osites); } else { - ret = sumD_gpu_large_old(lat,osites); + ret = sumD_gpu_large(lat,osites); } return ret; } @@ -272,20 +272,20 @@ inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osite // Return as same precision as input performing reduction in double precision though ///////////////////////////////////////////////////////////////////////////////////////////////////////// template -inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu_old(lat,osites); + result = sumD_gpu(lat,osites); return result; } template -inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu_large_old(lat,osites); + result = sumD_gpu_large(lat,osites); return result; } diff --git a/Grid/lattice/Lattice_reduction_gpu_cub.h b/Grid/lattice/Lattice_reduction_gpu_cub.h deleted file mode 100644 index eb9348ab..00000000 --- a/Grid/lattice/Lattice_reduction_gpu_cub.h +++ /dev/null @@ -1,361 +0,0 @@ -/************************************************************************************* - Grid physics library, www.github.com/paboyle/Grid - Source file: ./Grid/lattice/Lattice_reduction_gpu_cub.h - Copyright (C) 2015-2024 -Author: Peter Boyle - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 2 of the License, or - (at your option) any later version. - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - You should have received a copy of the GNU General Public License along - with this program; if not, write to the Free Software Foundation, Inc., - 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. - See the full license in the file "LICENSE" in the top level distribution directory - *************************************************************************************/ - /* END LEGAL */ -#pragma once - -#if defined(GRID_CUDA) -#include -#define gpucub cub -#define gpuError_t cudaError_t -#define gpuSuccess cudaSuccess -#elif defined(GRID_HIP) -#include -#define gpucub hipcub -#define gpuError_t hipError_t -#define gpuSuccess hipSuccess -#endif - -NAMESPACE_BEGIN(Grid); - -///////////////////////////////////////////////////////////////////////////////////////////////////////// -// Unified lattice reduction using CUB (CUDA/HIP) and sycl::reduction (SYCL). -// -// CUDA/HIP: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD, -// then CUB/hipCUB DeviceReduce::Reduce over the resulting array. -// -// rocPRIM's DeviceReduce requires warpSize(64) threads per block, each holding one element -// in shared memory: sizeof(T)*64 must fit in sharedMemPerBlock. Large QCD objects such as -// LatticePropagator (sobjD = 2304 bytes, 64*2304 = 147 KB) exceed this budget. -// -// For those types sumD_gpu_large groups the vobj's vector_type words in bundles of 4, -// reducing each bundle as an iVector,4> (64 bytes, 64*64 = 4 KB — always safe). -// Words that do not fill a complete bundle are zero-padded. -// -// SYCL: sycl::reduction handles any type size through the runtime, so one path suffices. -///////////////////////////////////////////////////////////////////////////////////////////////////////// - -#if defined(GRID_CUDA) || defined(GRID_HIP) - -#define GRID_REDUCTION_TIMING - -// Direct CUB reduction on the full scalar_objectD. -// Only safe when sizeof(sobjD)*64 <= device sharedMemPerBlock. -// Do not call directly for large composite types (e.g. LatticePropagator). -template -inline typename vobj::scalar_objectD sumD_gpu_direct(const vobj *lat, Integer osites) -{ - typedef typename vobj::scalar_object sobj; - typedef typename vobj::scalar_objectD sobjD; - - const Integer nsimd = vobj::Nsimd(); - const Integer nlanes = osites * nsimd; - - deviceVector per_lane(nlanes); - sobjD *per_lane_p = &per_lane[0]; - -#ifdef GRID_REDUCTION_TIMING - RealD t_for = -usecond(); -#endif - accelerator_for(idx, nlanes, 1, { - Integer ss = idx / nsimd; - Integer lane = idx % nsimd; - sobj tmp = extractLane(lane, lat[ss]); - sobjD tmpD; tmpD = tmp; - per_lane_p[idx] = tmpD; - }); -#ifdef GRID_REDUCTION_TIMING - accelerator_barrier(); - t_for += usecond(); -#endif - - sobjD zero; zeroit(zero); - sobjD *d_out = static_cast(acceleratorAllocDevice(sizeof(sobjD))); - void *d_temp = nullptr; - size_t temp_bytes = 0; - - gpuError_t gpuErr; - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, - (int)nlanes, gpucub::Sum(), zero, computeStream); - if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce size query failed: " - << gpuErr << std::endl; - exit(EXIT_FAILURE); - } - - d_temp = acceleratorAllocDevice(temp_bytes); - -#ifdef GRID_REDUCTION_TIMING - RealD t_cub = -usecond(); -#endif - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_lane_p, d_out, - (int)nlanes, gpucub::Sum(), zero, computeStream); - if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "sumD_gpu_direct: DeviceReduce failed: " - << gpuErr << std::endl; - exit(EXIT_FAILURE); - } - - accelerator_barrier(); -#ifdef GRID_REDUCTION_TIMING - t_cub += usecond(); - std::cout << GridLogMessage << "sumD_gpu_direct" - << " sizeof(sobjD)=" << sizeof(sobjD) - << " accelerator_for=" << t_for << " us" - << " CUB_reduce=" << t_cub << " us" << std::endl; -#endif - - sobjD result; - acceleratorCopyFromDevice(d_out, &result, sizeof(sobjD)); - acceleratorFreeDevice(d_temp); - acceleratorFreeDevice(d_out); - return result; -} - -// Radix-4 word-bundle path for types too large for the direct CUB path. -// Treats vobj as words of vector_type; groups them in bundles of 4 and reduces -// each bundle as an iVector,4> — reusing Grid's existing tensor -// type which already has accelerator_inline operator+ and zeroit(). -// sizeof = 4 * sizeof(scalarD) <= 64 bytes; 64 * 64 = 4096 bytes, safely within -// rocPRIM's shared-memory budget on all supported devices. -// If words % 4 != 0, the final partial bundle is zero-padded so all unused -// slots contribute zero to the sum. -template -inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) -{ - typedef typename vobj::vector_type vector; - typedef typename vobj::scalar_typeD scalarD; - typedef typename vobj::scalar_objectD sobjD; - using R4 = iVector, 4>; - - const int words = sizeof(vobj) / sizeof(vector); - const int nfull = words / 4; - const int rem = words % 4; - - sobjD ret; zeroit(ret); - scalarD *ret_p = (scalarD *)&ret; - - iScalar *idat = (iScalar *)lat; - deviceVector buf(osites); - R4 *buf_p = &buf[0]; - - R4 zero4; zeroit(zero4); - - R4 *d_out = static_cast(acceleratorAllocDevice(sizeof(R4))); - void *d_temp = nullptr; - size_t temp_bytes = 0; - - // Probe workspace size once — type R4 and count osites are fixed across all groups. - gpuError_t gpuErr; - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, - (int)osites, gpucub::Sum(), zero4, computeStream); - if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "sumD_gpu_large: DeviceReduce size query failed: " - << gpuErr << std::endl; - exit(EXIT_FAILURE); - } - d_temp = acceleratorAllocDevice(temp_bytes); - -#ifdef GRID_REDUCTION_TIMING - RealD t_for_large = 0.0, t_cub_large = 0.0; -#endif - - // Full groups of 4 words. - for (int g = 0; g < nfull; g++) { - int base = 4 * g; -#ifdef GRID_REDUCTION_TIMING - t_for_large -= usecond(); -#endif - accelerator_for(ss, osites, 1, { - R4 r4; - r4._internal[0] = TensorRemove(Reduce(idat[ss * words + base ])); - r4._internal[1] = TensorRemove(Reduce(idat[ss * words + base + 1])); - r4._internal[2] = TensorRemove(Reduce(idat[ss * words + base + 2])); - r4._internal[3] = TensorRemove(Reduce(idat[ss * words + base + 3])); - buf_p[ss] = r4; - }); -#ifdef GRID_REDUCTION_TIMING - accelerator_barrier(); - t_for_large += usecond(); - t_cub_large -= usecond(); -#endif - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, - (int)osites, gpucub::Sum(), zero4, computeStream); - if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "sumD_gpu_large: DeviceReduce failed (group " - << g << "): " << gpuErr << std::endl; - exit(EXIT_FAILURE); - } - accelerator_barrier(); -#ifdef GRID_REDUCTION_TIMING - t_cub_large += usecond(); -#endif - R4 group_result; - acceleratorCopyFromDevice(d_out, &group_result, sizeof(R4)); - ret_p[base ] = TensorRemove(group_result._internal[0]); - ret_p[base + 1] = TensorRemove(group_result._internal[1]); - ret_p[base + 2] = TensorRemove(group_result._internal[2]); - ret_p[base + 3] = TensorRemove(group_result._internal[3]); - } - - // Partial last group: zero-pad unused slots so they contribute nothing to the sum. - if (rem > 0) { - int base = 4 * nfull; -#ifdef GRID_REDUCTION_TIMING - t_for_large -= usecond(); -#endif - accelerator_for(ss, osites, 1, { - R4 r4; zeroit(r4); - for (int k = 0; k < rem; k++) - r4._internal[k] = TensorRemove(Reduce(idat[ss * words + base + k])); - buf_p[ss] = r4; - }); -#ifdef GRID_REDUCTION_TIMING - accelerator_barrier(); - t_for_large += usecond(); - t_cub_large -= usecond(); -#endif - gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, buf_p, d_out, - (int)osites, gpucub::Sum(), zero4, computeStream); - if (gpuErr != gpuSuccess) { - std::cout << GridLogError << "sumD_gpu_large: DeviceReduce failed (partial group): " - << gpuErr << std::endl; - exit(EXIT_FAILURE); - } - accelerator_barrier(); -#ifdef GRID_REDUCTION_TIMING - t_cub_large += usecond(); -#endif - R4 partial_result; - acceleratorCopyFromDevice(d_out, &partial_result, sizeof(R4)); - for (int k = 0; k < rem; k++) - ret_p[4 * nfull + k] = TensorRemove(partial_result._internal[k]); - } - -#ifdef GRID_REDUCTION_TIMING - std::cout << GridLogMessage << "sumD_gpu_large" - << " sizeof(sobjD)=" << sizeof(sobjD) - << " words=" << words << " nfull=" << nfull << " rem=" << rem - << " accelerator_for=" << t_for_large << " us" - << " CUB_reduce=" << t_cub_large << " us" << std::endl; -#endif - - acceleratorFreeDevice(d_temp); - acceleratorFreeDevice(d_out); - return ret; -} - -// Dispatch: direct CUB path for types that fit in the shared-memory budget, -// radix-4 word-bundle path for larger types. -// Threshold 512 bytes: 64 * 512 = 32768 bytes, within rocPRIM's -// ROCPRIM_SHARED_MEMORY_MAX on all supported devices. -template -inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) -{ - typedef typename vobj::scalar_objectD sobjD; - if constexpr (sizeof(sobjD) > 512) { - return sumD_gpu_large(lat, osites); - } else { - return sumD_gpu_direct(lat, osites); - } -} - -template -inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) -{ - return sumD_gpu(lat, osites); -} - -template -inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) -{ - typedef typename vobj::scalar_object sobj; - sobj result; - result = sumD_gpu(lat, osites); - return result; -} - -template -inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) -{ - typedef typename vobj::scalar_object sobj; - sobj result; - result = sumD_gpu_large(lat, osites); - return result; -} - -#endif // GRID_CUDA || GRID_HIP - -#if defined(GRID_SYCL) - -// Accumulates in sobjD throughout, fixing the precision bug in the original -// Lattice_reduction_sycl.h which accumulated in sobj then converted at the end. -template -inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) -{ - typedef typename vobj::scalar_object sobj; - typedef typename vobj::scalar_objectD sobjD; - - sobjD identity; zeroit(identity); - sobjD ret; zeroit(ret); - { - sycl::buffer abuff(&ret, {1}); - theGridAccelerator->submit([&](sycl::handler &cgh) { - auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>()); - cgh.parallel_for(sycl::range<1>{(size_t)osites}, - Reduction, - [=](sycl::id<1> item, auto &sum) { - sobj s = Reduce(lat[item[0]]); - sobjD sd; sd = s; - sum += sd; - }); - }); - } - return ret; -} - -template -inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) -{ - return sumD_gpu(lat, osites); -} - -template -inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) -{ - return sumD_gpu(lat, osites); -} - -template -inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) -{ - typedef typename vobj::scalar_object sobj; - sobj result; - result = sumD_gpu(lat, osites); - return result; -} - -template -inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) -{ - return sum_gpu(lat, osites); -} - -#endif // GRID_SYCL - -NAMESPACE_END(Grid); diff --git a/Grid/lattice/Lattice_reduction_sycl.h b/Grid/lattice/Lattice_reduction_sycl.h index 232cf9e0..ace7f010 100644 --- a/Grid/lattice/Lattice_reduction_sycl.h +++ b/Grid/lattice/Lattice_reduction_sycl.h @@ -6,7 +6,7 @@ NAMESPACE_BEGIN(Grid); template -inline typename vobj::scalar_objectD sumD_gpu_tensor_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; @@ -31,40 +31,40 @@ inline typename vobj::scalar_objectD sumD_gpu_tensor_old(const vobj *lat, Intege } template -inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) { - return sumD_gpu_tensor_old(lat,osites); + return sumD_gpu_tensor(lat,osites); } template -inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) { - return sumD_gpu_large_old(lat,osites); + return sumD_gpu_large(lat,osites); } template -inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites) { - return sumD_gpu_large_old(lat,osites); + return sumD_gpu_large(lat,osites); } ///////////////////////////////////////////////////////////////////////////////////////////////////////// // Return as same precision as input performing reduction in double precision though ///////////////////////////////////////////////////////////////////////////////////////////////////////// template -inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu_old(lat,osites); + result = sumD_gpu(lat,osites); return result; } template -inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites) +inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites) { typedef typename vobj::scalar_object sobj; sobj result; - result = sumD_gpu_large_old(lat,osites); + result = sumD_gpu_large(lat,osites); return result; } diff --git a/tests/debug/Test_reduction.cc b/tests/debug/Test_reduction.cc index da0905d2..387a8987 100644 --- a/tests/debug/Test_reduction.cc +++ b/tests/debug/Test_reduction.cc @@ -73,36 +73,7 @@ void testReduction(GridCartesian *grid, GridParallelRNG &rng, Field field(grid); //-------------------------------------------------------------------- - // a) Gaussian random field: sum_gpu (new CUB path) vs sum_gpu_old - // (preserved hand-rolled shared-memory path). Both promote lanes - // to double internally, so results should agree to near-roundoff. - //-------------------------------------------------------------------- -#if defined(GRID_CUDA) || defined(GRID_HIP) || defined(GRID_SYCL) - { - gaussian(rng, field); - - autoView(v, field, AcceleratorRead); - sobj new_result = sum_gpu (&v[0], osites); - sobj old_result = sum_gpu_old(&v[0], osites); - - sobj diff = new_result - old_result; - RealD diffn = squaredSum(diff); - RealD refn = squaredSum(old_result); - RealD reldiff = (refn > 0.0) ? std::sqrt(diffn / refn) : std::sqrt(diffn); - - // Float fields: both paths cast from double to float, expect O(eps_float). - // Double fields: ordering differences at most O(V * eps_double). - RealD tol = isFloat ? 1e-6 : 1e-10; - - std::cout << GridLogMessage - << name << " random reldiff = " << reldiff << std::endl; - check(reldiff < tol, name + " random: sum_gpu agrees with sum_gpu_old"); - } -#endif - - //-------------------------------------------------------------------- - // b) Timing: new (CUB/sycl::reduction) vs old (hand-rolled) path. - // Warmup first, then Niter timed calls; report us/call and GB/s. + // a) Timing: Niter timed calls reporting us/call and GB/s. //-------------------------------------------------------------------- #if defined(GRID_CUDA) || defined(GRID_HIP) || defined(GRID_SYCL) { @@ -113,38 +84,28 @@ void testReduction(GridCartesian *grid, GridParallelRNG &rng, { autoView(v, field, AcceleratorRead); - for (int i = 0; i < Nwarm; i++) sum_gpu (&v[0], osites); - for (int i = 0; i < Nwarm; i++) sum_gpu_old(&v[0], osites); + for (int i = 0; i < Nwarm; i++) sum_gpu(&v[0], osites); } - RealD t_new, t_old; + RealD t_new; { autoView(v, field, AcceleratorRead); t_new = -usecond(); for (int i = 0; i < Niter; i++) sum_gpu(&v[0], osites); t_new += usecond(); } - { - autoView(v, field, AcceleratorRead); - t_old = -usecond(); - for (int i = 0; i < Niter; i++) sum_gpu_old(&v[0], osites); - t_old += usecond(); - } - RealD bytes = (RealD)osites * sizeof(vobj); - RealD GBs_new = bytes / (t_new / Niter) * 1e-3; - RealD GBs_old = bytes / (t_old / Niter) * 1e-3; + RealD bytes = (RealD)osites * sizeof(vobj); + RealD GBs = bytes / (t_new / Niter) * 1e-3; std::cout << GridLogMessage << name << " timing (" << Niter << " calls):" << std::endl; std::cout << GridLogMessage - << " sum_gpu " << t_new/Niter << " us " << GBs_new << " GB/s" << std::endl; - std::cout << GridLogMessage - << " sum_gpu_old " << t_old/Niter << " us " << GBs_old << " GB/s" << std::endl; + << " sum_gpu " << t_new/Niter << " us " << GBs << " GB/s" << std::endl; } #endif //-------------------------------------------------------------------- - // d) Constant field via field = 1.0. + // b) Constant field via field = 1.0. // // Grid's iMatrix::operator=(scalar) sets only the diagonal, so: // LatticeComplex -> scalar 1.0 (Ncomp = 1 nonzero per site) From 0650d7c7eb3075ea64241269b6fef53f08d86971 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 21:53:40 -0400 Subject: [PATCH 16/44] Lattice_reduction_sycl: fix double-precision accumulation in sumD_gpu_tensor Accumulate in sobjD throughout rather than accumulating in sobj and converting the final sum. For float fields this matters: summing N floats then casting loses O(N*eps_float) relative precision vs accumulating in double from the start. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_sycl.h | 31 +++++++++++++-------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_sycl.h b/Grid/lattice/Lattice_reduction_sycl.h index ace7f010..84e7a09a 100644 --- a/Grid/lattice/Lattice_reduction_sycl.h +++ b/Grid/lattice/Lattice_reduction_sycl.h @@ -6,28 +6,27 @@ NAMESPACE_BEGIN(Grid); template -inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) { - typedef typename vobj::scalar_object sobj; + typedef typename vobj::scalar_object sobj; typedef typename vobj::scalar_objectD sobjD; - sobj identity; zeroit(identity); - sobj ret; zeroit(ret); - Integer nsimd= vobj::Nsimd(); - { - sycl::buffer abuff(&ret, {1}); + sobjD identity; zeroit(identity); + sobjD ret; zeroit(ret); + { + sycl::buffer abuff(&ret, {1}); theGridAccelerator->submit([&](sycl::handler &cgh) { - auto Reduction = sycl::reduction(abuff,cgh,identity,std::plus<>()); - cgh.parallel_for(sycl::range<1>{osites}, - Reduction, - [=] (sycl::id<1> item, auto &sum) { - auto osite = item[0]; - sum +=Reduce(lat[osite]); - }); + auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>()); + cgh.parallel_for(sycl::range<1>{(size_t)osites}, + Reduction, + [=](sycl::id<1> item, auto &sum) { + sobj s = Reduce(lat[item[0]]); + sobjD sd; sd = s; + sum += sd; + }); }); } - sobjD dret; convertType(dret,ret); - return dret; + return ret; } template From 26c3c7d8f9f8c30ed6b9e97ac57a3171847f2952 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 21:56:45 -0400 Subject: [PATCH 17/44] sumD_gpu_large: radix-12 word-bundle reduction replacing radix-1 Replace the word-by-word loop (one kernel launch per scalar word) with sumD_gpu_reduce_words which packs R consecutive vector_type words per site into iVector,R>, then calls the existing sumD_gpu_small shared-memory kernel once for the whole bundle. Dispatch: radix-12 first, radix-4 for the remainder < 12, radix-1 for any final < 4 words. For LatticePropagator (144 words = 12x12), this reduces the kernel-launch count from 144 to 12 -- a 12x reduction. Bundle::Nsimd() inherits from vector_type so sumD_gpu_small handles SIMD lane extraction and double-precision promotion identically to the scalar word case. sizeof(Bundle::scalar_objectD) = R*16 <= 192 B; well within sharedMemPerBlock on all supported devices. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu.h | 58 ++++++++++++++++++++-------- 1 file changed, 41 insertions(+), 17 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 849f5309..96056671 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -223,29 +223,53 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi return result; } +// Pack R consecutive vector_type words of lat[0..osites-1] starting at word +// 'base' into a Bundle = iVector,R> per site, then reduce +// with sumD_gpu_small. Bundle::Nsimd() == vector::Nsimd(), so the existing +// shared-memory kernel handles SIMD-lane extraction and double-promotion +// correctly. sizeof(Bundle::scalar_objectD) = R*sizeof(scalarD) <= 192 B +// for R<=12, safely within sharedMemPerBlock on all supported devices. +template +inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites, + typename vobj::scalar_typeD *ret_p, int base) +{ + typedef typename vobj::vector_type vector; + using Bundle = iVector, R>; + + const int words = sizeof(vobj) / sizeof(vector); + iScalar *idat = (iScalar *)lat; + + deviceVector buf(osites); + Bundle *buf_p = &buf[0]; + + accelerator_for(ss, osites, 1, { + Bundle b; + for (int k = 0; k < R; k++) + b._internal[k] = idat[ss * words + base + k]; + buf_p[ss] = b; + }); + + auto sum_bundle = sumD_gpu_small(buf_p, osites); + for (int k = 0; k < R; k++) + ret_p[base + k] = TensorRemove(sum_bundle._internal[k]); +} + template inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites) { - typedef typename vobj::vector_type vector; - typedef typename vobj::scalar_typeD scalarD; - typedef typename vobj::scalar_objectD sobj; - sobj ret; + typedef typename vobj::vector_type vector; + typedef typename vobj::scalar_typeD scalarD; + typedef typename vobj::scalar_objectD sobjD; + + const int words = sizeof(vobj) / sizeof(vector); + sobjD ret; zeroit(ret); scalarD *ret_p = (scalarD *)&ret; - - const int words = sizeof(vobj)/sizeof(vector); - deviceVector buffer(osites); - vector *dat = (vector *)lat; - vector *buf = &buffer[0]; - iScalar *tbuf =(iScalar *) &buffer[0]; - for(int w=0;w(lat, osites, ret_p, w); w += 12; } + while (w + 4 <= words) { sumD_gpu_reduce_words< 4>(lat, osites, ret_p, w); w += 4; } + while (w < words) { sumD_gpu_reduce_words< 1>(lat, osites, ret_p, w); w += 1; } - accelerator_for(ss,osites,1,{ - buf[ss] = dat[ss*words+w]; - }); - - ret_p[w] = sumD_gpu_small(tbuf,osites); - } return ret; } From a31af313289dcbef91af43820517e8ad89bc78c3 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 22:13:30 -0400 Subject: [PATCH 18/44] Lattice_reduction_gpu: add GRID_REDUCTION_TIMING instrumentation Uncomment #define GRID_REDUCTION_TIMING to enable per-phase timing output: sumD_gpu_reduce_words: pack time (accelerator_for) per R and base sumD_gpu_small: reduceKernel+barrier time and D2H time separately sumD_gpu_large: total wall time across all word groups This lets us identify whether the large-type bottleneck is in the pack kernel, the shared-memory reduction kernel, the barrier, or the D2H. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu.h | 44 ++++++++++++++++++++++++---- 1 file changed, 39 insertions(+), 5 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 96056671..84ef0a1a 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -197,12 +197,16 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) { ///////////////////////////////////////////////////////////////////////////////////////////////////////// // Possibly promote to double and sum ///////////////////////////////////////////////////////////////////////////////////////////////////////// + +// Uncomment to print per-phase timing for every sumD_gpu_small and sumD_gpu_large call. +// #define GRID_REDUCTION_TIMING + template -inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) +inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) { typedef typename vobj::scalar_objectD sobj; typedef decltype(lat) Iterator; - + Integer nsimd= vobj::Nsimd(); Integer size = osites*nsimd; @@ -211,15 +215,28 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi GRID_ASSERT(ok); Integer smemSize = numThreads * sizeof(sobj); - // Move out of UVM - // Turns out I had messed up the synchronise after move to compute stream - // as running this on the default stream fools the synchronise deviceVector buffer(numBlocks); sobj *buffer_v = &buffer[0]; sobj result; + +#ifdef GRID_REDUCTION_TIMING + RealD t_kernel = -usecond(); +#endif reduceKernel<<< numBlocks, numThreads, smemSize, computeStream >>>(lat, buffer_v, size); accelerator_barrier(); +#ifdef GRID_REDUCTION_TIMING + t_kernel += usecond(); + RealD t_d2h = -usecond(); +#endif acceleratorCopyFromDevice(buffer_v,&result,sizeof(result)); +#ifdef GRID_REDUCTION_TIMING + t_d2h += usecond(); + std::cout << GridLogMessage << " sumD_gpu_small" + << " sizeof(sobj)=" << sizeof(sobj) + << " blocks=" << numBlocks << " threads=" << numThreads + << " kernel+barrier=" << t_kernel << " us" + << " D2H=" << t_d2h << " us" << std::endl; +#endif return result; } @@ -242,12 +259,20 @@ inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites, deviceVector buf(osites); Bundle *buf_p = &buf[0]; +#ifdef GRID_REDUCTION_TIMING + RealD t_pack = -usecond(); +#endif accelerator_for(ss, osites, 1, { Bundle b; for (int k = 0; k < R; k++) b._internal[k] = idat[ss * words + base + k]; buf_p[ss] = b; }); +#ifdef GRID_REDUCTION_TIMING + t_pack += usecond(); + std::cout << GridLogMessage << " sumD_gpu_reduce_words R=" << R + << " base=" << base << " pack=" << t_pack << " us" << std::endl; +#endif auto sum_bundle = sumD_gpu_small(buf_p, osites); for (int k = 0; k < R; k++) @@ -265,10 +290,19 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi sobjD ret; zeroit(ret); scalarD *ret_p = (scalarD *)&ret; +#ifdef GRID_REDUCTION_TIMING + RealD t_large = -usecond(); +#endif int w = 0; while (w + 12 <= words) { sumD_gpu_reduce_words<12>(lat, osites, ret_p, w); w += 12; } while (w + 4 <= words) { sumD_gpu_reduce_words< 4>(lat, osites, ret_p, w); w += 4; } while (w < words) { sumD_gpu_reduce_words< 1>(lat, osites, ret_p, w); w += 1; } +#ifdef GRID_REDUCTION_TIMING + t_large += usecond(); + std::cout << GridLogMessage << "sumD_gpu_large" + << " sizeof(sobjD)=" << sizeof(sobjD) + << " words=" << words << " total=" << t_large << " us" << std::endl; +#endif return ret; } From 1315d4604d0a83aef0d4da8e8cf660e2cdd3a84e Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Mon, 18 May 2026 22:14:00 -0400 Subject: [PATCH 19/44] Enable GRID_REDUCTION_TIMING unconditionally Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 84ef0a1a..5cc8636f 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -198,8 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) { // Possibly promote to double and sum ///////////////////////////////////////////////////////////////////////////////////////////////////////// -// Uncomment to print per-phase timing for every sumD_gpu_small and sumD_gpu_large call. -// #define GRID_REDUCTION_TIMING +#define GRID_REDUCTION_TIMING template inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) From 1304172a9364ebff64f39604ef00fa0c5b5cbdfb Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 08:53:13 -0400 Subject: [PATCH 20/44] Modified repack --- Grid/lattice/Lattice_reduction_gpu.h | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 5cc8636f..7fa80bcd 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -261,11 +261,9 @@ inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites, #ifdef GRID_REDUCTION_TIMING RealD t_pack = -usecond(); #endif - accelerator_for(ss, osites, 1, { - Bundle b; - for (int k = 0; k < R; k++) - b._internal[k] = idat[ss * words + base + k]; - buf_p[ss] = b; + constexpr int Nsimd = vobj::Nsimd(); + accelerator_for2d(k, R, ss, osites, Nsimd, { + coalescedWrite(buf_p[ss]._internal[k], coalescedRead(idat[ss * words + base + k])); }); #ifdef GRID_REDUCTION_TIMING t_pack += usecond(); From 66b529b345d253ba464b756ffc77d22676f1fc82 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 09:46:43 -0400 Subject: [PATCH 21/44] sumD_gpu_reduce_words: fuse pack+reduce into single packReduceKernel Replace the two-kernel pack+reduce sequence with a single fused kernel packReduceKernel that reads R words of each vobj at offset 'base' and accumulates directly into iVector,R>, eliminating the intermediate bundle buffer entirely. HBM access per word-group drops from 3x (pack-read + pack-write + reduce-read) to 1x. Thread count comes from getNumBlocksAndThreads (warpSize..256) rather than acceleratorThreads(), so occupancy is correct regardless of the --accelerator-threads setting. Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu.h | 143 ++++++++++++++++++++++----- 1 file changed, 119 insertions(+), 24 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index 7fa80bcd..da5eb90f 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -239,41 +239,136 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi return result; } -// Pack R consecutive vector_type words of lat[0..osites-1] starting at word -// 'base' into a Bundle = iVector,R> per site, then reduce -// with sumD_gpu_small. Bundle::Nsimd() == vector::Nsimd(), so the existing -// shared-memory kernel handles SIMD-lane extraction and double-promotion -// correctly. sizeof(Bundle::scalar_objectD) = R*sizeof(scalarD) <= 192 B -// for R<=12, safely within sharedMemPerBlock on all supported devices. +// Fused pack+reduce: reads R words of each vobj at word offset 'base', +// accumulates directly into iVector,R> without staging +// through an intermediate bundle buffer. One HBM pass instead of three. +template +__device__ void packReduceBlocks( + const iScalar *idat, + sobj *g_odata, Iterator osites, int base, int words) +{ + constexpr Iterator nsimd = vobj::Nsimd(); + Iterator blockSize = blockDim.x; + + extern __shared__ __align__(COALESCE_GRANULARITY) unsigned char shmem_pointer[]; + sobj *sdata = (sobj *)shmem_pointer; + + Iterator tid = threadIdx.x; + Iterator i = blockIdx.x * (blockSize * 2) + threadIdx.x; + Iterator gridSize = blockSize * 2 * gridDim.x; + sobj mySum = Zero(); + + while (i < osites * nsimd) { + Iterator lane = i % nsimd; + Iterator ss = i / nsimd; + sobj tmpD; zeroit(tmpD); + for (int k = 0; k < R; k++) { + auto w = extractLane(lane, idat[ss * words + base + k]); + iScalar wd; wd = w; + tmpD._internal[k] = wd; + } + mySum += tmpD; + + if (i + blockSize < osites * nsimd) { + lane = (i + blockSize) % nsimd; + ss = (i + blockSize) / nsimd; + sobj tmpD2; zeroit(tmpD2); + for (int k = 0; k < R; k++) { + auto w = extractLane(lane, idat[ss * words + base + k]); + iScalar wd; wd = w; + tmpD2._internal[k] = wd; + } + mySum += tmpD2; + } + i += gridSize; + } + + reduceBlock(sdata, mySum, tid); + if (tid == 0) g_odata[blockIdx.x] = sdata[0]; +} + +template +__global__ void packReduceKernel( + const iScalar *idat, + sobj *buffer, Iterator osites, int base, int words) +{ + Iterator blockSize = blockDim.x; + + packReduceBlocks(idat, buffer, osites, base, words); + + if (gridDim.x > 1) { + const Iterator tid = threadIdx.x; + __shared__ bool amLast; + extern __shared__ __align__(COALESCE_GRANULARITY) unsigned char shmem_pointer[]; + sobj *smem = (sobj *)shmem_pointer; + + acceleratorFence(); + + if (tid == 0) { + unsigned int ticket = atomicInc(&retirementCount, gridDim.x); + amLast = (ticket == gridDim.x - 1); + } + acceleratorSynchroniseAll(); + + if (amLast) { + Iterator i = tid; + sobj mySum = Zero(); + while (i < (Iterator)gridDim.x) { + mySum += buffer[i]; + i += blockSize; + } + reduceBlock(smem, mySum, tid); + if (tid == 0) { + buffer[0] = smem[0]; + retirementCount = 0; + } + } + } +} + template inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites, typename vobj::scalar_typeD *ret_p, int base) { - typedef typename vobj::vector_type vector; - using Bundle = iVector, R>; + typedef typename vobj::vector_type vector; + typedef typename vobj::scalar_typeD scalarD; + using BundleScalarD = iVector, R>; - const int words = sizeof(vobj) / sizeof(vector); - iScalar *idat = (iScalar *)lat; - - deviceVector buf(osites); - Bundle *buf_p = &buf[0]; - -#ifdef GRID_REDUCTION_TIMING - RealD t_pack = -usecond(); -#endif constexpr int Nsimd = vobj::Nsimd(); - accelerator_for2d(k, R, ss, osites, Nsimd, { - coalescedWrite(buf_p[ss]._internal[k], coalescedRead(idat[ss * words + base + k])); - }); + const int words = sizeof(vobj) / sizeof(vector); + const iScalar *idat = (const iScalar *)lat; + + Integer size = (Integer)osites * Nsimd; + Integer numThreads, numBlocks; + int ok = getNumBlocksAndThreads(size, sizeof(BundleScalarD), numThreads, numBlocks); + GRID_ASSERT(ok); + + Integer smemSize = numThreads * sizeof(BundleScalarD); + deviceVector buffer(numBlocks); + BundleScalarD *buffer_v = &buffer[0]; + BundleScalarD result; + #ifdef GRID_REDUCTION_TIMING - t_pack += usecond(); + RealD t_kernel = -usecond(); +#endif + packReduceKernel + <<>> + (idat, buffer_v, osites, base, words); + accelerator_barrier(); +#ifdef GRID_REDUCTION_TIMING + t_kernel += usecond(); + RealD t_d2h = -usecond(); +#endif + acceleratorCopyFromDevice(buffer_v, &result, sizeof(result)); +#ifdef GRID_REDUCTION_TIMING + t_d2h += usecond(); std::cout << GridLogMessage << " sumD_gpu_reduce_words R=" << R - << " base=" << base << " pack=" << t_pack << " us" << std::endl; + << " base=" << base + << " kernel=" << t_kernel << " D2H=" << t_d2h << " us" << std::endl; #endif - auto sum_bundle = sumD_gpu_small(buf_p, osites); for (int k = 0; k < R; k++) - ret_p[base + k] = TensorRemove(sum_bundle._internal[k]); + ret_p[base + k] = TensorRemove(result._internal[k]); } template From 60df2dd5d0b90a230685639560f6cef7f0d1f687 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 10:03:32 -0400 Subject: [PATCH 22/44] skills: add gpu-memory-performance.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents the acceleratorThreads() default=2 trap, LambdaApply thread mapping, coalescedRead/Write idiom, when to use __global__ vs accelerator_for, and fused vs staged HBM access patterns. Includes observed MI250X numbers from LatticePropagatorD reduction (50 → 297 → 546 GB/s progression). Co-Authored-By: Claude Sonnet 4.6 --- skills/gpu-memory-performance.md | 181 +++++++++++++++++++++++++++++++ 1 file changed, 181 insertions(+) create mode 100644 skills/gpu-memory-performance.md diff --git a/skills/gpu-memory-performance.md b/skills/gpu-memory-performance.md new file mode 100644 index 00000000..276cd77a --- /dev/null +++ b/skills/gpu-memory-performance.md @@ -0,0 +1,181 @@ +--- +name: gpu-memory-performance +description: Diagnose and fix GPU memory bandwidth and occupancy problems in Grid HPC kernels — acceleratorThreads() pitfalls, LambdaApply thread mapping, coalescedRead/Write idiom, when to use accelerator_for vs a hand-rolled __global__ kernel, and fused vs staged HBM access patterns. +user-invocable: true +allowed-tools: + - Read + - Bash(grep -r) +--- + +# GPU Memory Performance in Grid + +## The acceleratorThreads() Trap + +`acceleratorThreads()` is a runtime-settable global (default **2**) that controls the `blockDim.y` of every `accelerator_for` launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension. + +```cpp +// Grid/threads/Accelerator.cc +uint32_t accelerator_threads = 2; // <-- default +``` + +With `accelerator_for(ss, osites, nsimd, ...)`, the launch is: + +``` +dim3 threads(nsimd, acceleratorThreads(), 1) +dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1) +``` + +For `nsimd=1` and the default `acceleratorThreads()=2`: +- **2 threads per block** on a 64-thread AMD wavefront → **3% occupancy** +- Expected bandwidth ≈ peak × 3% ≈ 50 GB/s on MI250X + +**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume. Check with `--accelerator-threads 16` or `--accelerator-threads 32` at runtime. A large speedup confirms occupancy starvation. + +**Fix options** (in order of preference): +1. Kernel needs its own thread count — use `getNumBlocksAndThreads` and launch a `__global__` kernel directly (see below). +2. Temporarily acceptable: set `--accelerator-threads 16` or 32 at the application level. Note this affects every `accelerator_for` site in the binary. + +## LambdaApply Thread Mapping + +`accelerator_for` and `accelerator_for2d` go through `LambdaApply`: + +```cpp +// HIP/CUDA LambdaApply kernel: +uint64_t x = threadIdx.y + blockDim.y * blockIdx.x; // iter1 (site index) +uint64_t y = threadIdx.z + blockDim.z * blockIdx.y; // iter2 +uint64_t z = threadIdx.x; // lane (SIMD lane) +Lambda(x, y, z); +``` + +`threadIdx.x` is the **fast** (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the **same** site, not consecutive sites. + +Consequence: for coalesced access from a `vobj` array (AoS layout, stride = `sizeof(vobj)` between adjacent sites), adjacent threads in a wavefront address the **same** site at different lanes, not adjacent sites. With `Nsimd=1` (GPU scalar build), `threadIdx.x` is always 0 and provides no coalescing benefit at all. + +## coalescedRead / coalescedWrite + +These are Grid's canonical way to read/write one SIMD lane from a vector type inside a `GRID_SIMT` kernel: + +```cpp +// accelerator_for(ss, osites, Nsimd, { +// lane = acceleratorSIMTlane(Nsimd) = threadIdx.x +auto scalar_val = coalescedRead(field[ss]); // extractLane(lane, field[ss]) +coalescedWrite(field[ss], scalar_val); // insertLane(lane, field[ss], scalar_val) +``` + +For `vobj` aggregate types, `coalescedRead` calls `extractLane(lane, vobj)` which recurses through the tensor hierarchy and returns `vobj::scalar_object`. + +For `vsimd` (raw SIMD vector) types, it casts to `scalar_type*` and indexes with `lane`. + +**When Nsimd=1** (GPU scalar build): `lane=0` always, so `coalescedRead`/`coalescedWrite` are effectively no-ops (direct read/write). Coalescing must be achieved through the iteration structure instead. + +## Coalescing the Iteration Structure + +For an AoS input array where each site is `words` 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access: + +```cpp +// Good: k varies across threads in a block → consecutive 16-byte reads +accelerator_for2d(k, R, ss, osites, Nsimd, { + coalescedWrite(out[ss]._internal[k], + coalescedRead(idat[ss * words + base + k])); +}); +// dim3(Nsimd, nt, 1): threadIdx.y = k (consecutive words, coalesced) +// threadIdx.x = lane (SIMD sub-lane, coalesced for Nsimd>1) +``` + +```cpp +// Bad: each thread reads all R words of its site serially +accelerator_for(ss, osites, 1, { + Bundle b; + for (int k = 0; k < R; k++) + b._internal[k] = idat[ss * words + base + k]; // serial, not coalesced across threads + out[ss] = b; // bulk struct write +}); +``` + +The bad pattern also accumulates a large struct in registers (192 bytes for R=12), increasing register pressure and reducing occupancy further. + +## When to Use a __global__ Kernel Instead of accelerator_for + +`accelerator_for` is correct for site-parallel work where `acceleratorThreads()` is tuned appropriately. Use a direct `__global__` kernel when: + +- The kernel requires a **specific thread count** for correctness or performance (reductions, shared-memory algorithms). +- The optimal thread count depends on `sizeof(sobj)` and `sharedMemPerBlock`, not on a runtime global. +- You need the retirement-count pattern for cross-block final reduction. + +Pattern: use `getNumBlocksAndThreads` to pick `numThreads` and `numBlocks`: + +```cpp +Integer numThreads, numBlocks; +int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks); +// starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock +// gives 64–256 threads/block → near-100% wavefront occupancy +Integer smemSize = numThreads * sizeof(sobj); +myKernel<<>>(args...); +``` + +This gives 64–256 threads/block regardless of `acceleratorThreads()`. Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X. + +## Fused vs Staged HBM Access + +A staged pack+reduce reads the data **three times**: + +``` +pack kernel: reads vobj array (N bytes), writes bundle buffer (N bytes) +reduce kernel: reads bundle buffer (N bytes), writes tiny result buffer +``` + +Total HBM: 3N bytes for N bytes of useful input. + +A fused kernel reads the data **once**: + +``` +packReduceKernel: reads R words of vobj array (N bytes), reduces in-place +``` + +Total HBM: N bytes. Register pressure increases (R words held per thread) but the 3× HBM saving dominates for large objects. + +The fused pattern in Grid's `sumD_gpu_reduce_words`: + +```cpp +template +__device__ void packReduceBlocks( + const iScalar *idat, + sobj *g_odata, Iterator osites, int base, int words) +{ + // sobj = iVector, R> (R double-precision scalars per site) + constexpr Iterator nsimd = vobj::Nsimd(); + ... + while (i < osites * nsimd) { + Iterator lane = i % nsimd; + Iterator ss = i / nsimd; + sobj tmpD; zeroit(tmpD); + for (int k = 0; k < R; k++) { + auto w = extractLane(lane, idat[ss * words + base + k]); + iScalar wd; wd = w; // float→double promotion + tmpD._internal[k] = wd; + } + mySum += tmpD; + ... + } + reduceBlock(sdata, mySum, tid); +} +``` + +Launched with `getNumBlocksAndThreads` → 128–256 threads/block → correct occupancy without depending on `acceleratorThreads()`. + +## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=1) + +| Configuration | pack µs/group | reduce µs/group | total µs | GB/s | +|---|---|---|---|---| +| acceleratorThreads=2, staged | 10,080 | 470 | 126,909 | 50 | +| acceleratorThreads=16, staged | 342 | 310 | 8,251 | 297 | +| acceleratorThreads=16, fused | — | 349 | 4,584 | 546 | + +The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak. The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches. + +## Quick Checklist When a Kernel Is Slow + +1. Check threads per block: `accelerator_for(ss, N, 1, ...)` with default `acceleratorThreads()=2` = 2 threads/block = 3% occupancy on AMD. Try `--accelerator-threads 16` at runtime; if it helps a lot, occupancy is the problem. +2. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`). Replace with per-element writes via `coalescedWrite`. +3. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data. +4. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`. From 2fadd8bb62e7b041fa9035e78d05c00c03b7ea8d Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 10:15:53 -0400 Subject: [PATCH 23/44] Accelerator: raise default accelerator_threads from 2 to 16 --- Grid/threads/Accelerator.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Grid/threads/Accelerator.cc b/Grid/threads/Accelerator.cc index aa39f88c..f0f33705 100644 --- a/Grid/threads/Accelerator.cc +++ b/Grid/threads/Accelerator.cc @@ -3,7 +3,7 @@ NAMESPACE_BEGIN(Grid); int world_rank; // Use to control world rank for print guarding int acceleratorAbortOnGpuError=1; -uint32_t accelerator_threads=2; +uint32_t accelerator_threads=16; uint32_t acceleratorThreads(void) {return accelerator_threads;}; void acceleratorThreads(uint32_t t) {accelerator_threads = t;}; From b6abdc384548f9fd24849b6826dfe9882fdb093d Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 13:41:03 -0400 Subject: [PATCH 24/44] Accelerator: lower default accelerator_threads from 16 to 8 Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16. With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads = one full AMD wavefront. Higher values double register demand per block and hit a register-pressure cliff for stencil kernels. Co-Authored-By: Claude Sonnet 4.6 --- Grid/threads/Accelerator.cc | 2 +- skills/gpu-memory-performance.md | 68 +++++++++++++++++++++----------- 2 files changed, 47 insertions(+), 23 deletions(-) diff --git a/Grid/threads/Accelerator.cc b/Grid/threads/Accelerator.cc index f0f33705..e1bbab91 100644 --- a/Grid/threads/Accelerator.cc +++ b/Grid/threads/Accelerator.cc @@ -3,7 +3,7 @@ NAMESPACE_BEGIN(Grid); int world_rank; // Use to control world rank for print guarding int acceleratorAbortOnGpuError=1; -uint32_t accelerator_threads=16; +uint32_t accelerator_threads=8; uint32_t acceleratorThreads(void) {return accelerator_threads;}; void acceleratorThreads(uint32_t t) {accelerator_threads = t;}; diff --git a/skills/gpu-memory-performance.md b/skills/gpu-memory-performance.md index 276cd77a..88ca695f 100644 --- a/skills/gpu-memory-performance.md +++ b/skills/gpu-memory-performance.md @@ -9,13 +9,26 @@ allowed-tools: # GPU Memory Performance in Grid +## Nsimd on GPU builds + +With `GEN_SIMD_WIDTH=64B` (the typical production setting), `Nsimd` is **not 1**: + +| Scalar type | `sizeof` | `Nsimd = 64 / sizeof` | +|---|---|---| +| `ComplexD` | 16 B | **4** | +| `ComplexF` | 8 B | **8** | +| `RealD` | 8 B | **8** | +| `RealF` | 4 B | **16** | + +So for `LatticePropagatorD` (scalar type `ComplexD`), `Nsimd=4` and the SIMD lane runs `threadIdx.x ∈ {0,1,2,3}`. True `Nsimd=1` scalar-GPU builds are the exception, not the rule. + ## The acceleratorThreads() Trap -`acceleratorThreads()` is a runtime-settable global (default **2**) that controls the `blockDim.y` of every `accelerator_for` launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension. +`acceleratorThreads()` is a runtime-settable global (default **8**) that controls the `blockDim.y` of every `accelerator_for` launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension. ```cpp // Grid/threads/Accelerator.cc -uint32_t accelerator_threads = 2; // <-- default +uint32_t accelerator_threads = 8; ``` With `accelerator_for(ss, osites, nsimd, ...)`, the launch is: @@ -25,15 +38,27 @@ dim3 threads(nsimd, acceleratorThreads(), 1) dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1) ``` -For `nsimd=1` and the default `acceleratorThreads()=2`: -- **2 threads per block** on a 64-thread AMD wavefront → **3% occupancy** -- Expected bandwidth ≈ peak × 3% ≈ 50 GB/s on MI250X +Total threads per block = `Nsimd × acceleratorThreads()`. With `GEN_SIMD_WIDTH=64B` and `Nsimd=8` (fp32 / ComplexF): -**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume. Check with `--accelerator-threads 16` or `--accelerator-threads 32` at runtime. A large speedup confirms occupancy starvation. +| `acceleratorThreads()` | threads/block | AMD wavefront | note | +|---|---|---|---| +| 2 (original default) | 16 | 25% | sub-wavefront, poor | +| 4 | 32 | 50% | half wavefront | +| **8 (current default)** | **64** | **100%** | **one full wavefront — Dslash sweet spot** | +| 16 | 128 | 200% | two wavefronts/block; register-pressure cliff for heavy kernels | +| 32 | 256 | 400% | severe register spill for stencil kernels | + +**Why 8 and not higher?** Compute-heavy kernels like the Domain Wall Dslash carry many live registers per thread (spinors + gauge links + projections). Doubling `acceleratorThreads` from 8→16 doubles the register demand per block, which on AMD GFX90A triggers a hard occupancy cliff: `Benchmark_dwf_fp32` drops from 1.7 TF/s (nt=8) to ~300 GF/s (nt=16). The sweet spot is one full wavefront per block, which with `Nsimd=8` (fp32) means `nt=8`. + +For fp64 work (`Nsimd=4`), `nt=8` gives 32 threads = half a wavefront (AMD pads to 64 with idle lanes). Kernels that are not register-limited (e.g. simple lattice arithmetic) can benefit from `--accelerator-threads 16` at runtime. Reduction kernels bypass `acceleratorThreads()` entirely via `getNumBlocksAndThreads`. + +**Why was the default ever 2?** Before the `threadIdx.x`/`threadIdx.y` remap (see LambdaApply section below), the site index lived in `threadIdx.x` — the fast, coalescing dimension. Increasing `acceleratorThreads` widened the block in the *site* direction, so adjacent threads in a warp hit adjacent sites, each stride `sizeof(vobj)` apart in AoS memory — breaking coalescing. On early NVIDIA ports with `Nsimd≈8`, `nt=2` gave 16 threads = 50% of a 32-thread warp; NVIDIA recovers this via multiple concurrent blocks per SM, so occupancy was barely tolerable. AMD has no such multiplier when blocks are already sub-wavefront. After the remap put the site index in `threadIdx.y` and the SIMD lane in `threadIdx.x`, coalescing became independent of `acceleratorThreads`, removing the constraint. + +**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume. Check with `--accelerator-threads 32` at runtime. A large speedup confirms occupancy starvation. **Fix options** (in order of preference): 1. Kernel needs its own thread count — use `getNumBlocksAndThreads` and launch a `__global__` kernel directly (see below). -2. Temporarily acceptable: set `--accelerator-threads 16` or 32 at the application level. Note this affects every `accelerator_for` site in the binary. +2. Temporarily acceptable: set `--accelerator-threads 32` at the application level. Note this affects every `accelerator_for` site in the binary. ## LambdaApply Thread Mapping @@ -49,7 +74,7 @@ Lambda(x, y, z); `threadIdx.x` is the **fast** (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the **same** site, not consecutive sites. -Consequence: for coalesced access from a `vobj` array (AoS layout, stride = `sizeof(vobj)` between adjacent sites), adjacent threads in a wavefront address the **same** site at different lanes, not adjacent sites. With `Nsimd=1` (GPU scalar build), `threadIdx.x` is always 0 and provides no coalescing benefit at all. +With `GEN_SIMD_WIDTH=64B` and `Nsimd=4` (PropagatorD), a 64-thread AMD wavefront contains 64/4 = 16 sites, each processed by 4 lanes. Adjacent threads within the wavefront read different lanes of the same site — this is a broadcast pattern (hardware handles this efficiently), not a stride. The stride between consecutive *sites* (`sizeof(vobj)`) only appears between groups of `Nsimd` threads, spaced `Nsimd` apart in threadIdx.x — not between adjacent threads within a warp. ## coalescedRead / coalescedWrite @@ -57,7 +82,7 @@ These are Grid's canonical way to read/write one SIMD lane from a vector type in ```cpp // accelerator_for(ss, osites, Nsimd, { -// lane = acceleratorSIMTlane(Nsimd) = threadIdx.x +// lane = acceleratorSIMTlane(Nsimd) = threadIdx.x ∈ {0..Nsimd-1} auto scalar_val = coalescedRead(field[ss]); // extractLane(lane, field[ss]) coalescedWrite(field[ss], scalar_val); // insertLane(lane, field[ss], scalar_val) ``` @@ -66,8 +91,6 @@ For `vobj` aggregate types, `coalescedRead` calls `extractLane(lane, vobj)` whic For `vsimd` (raw SIMD vector) types, it casts to `scalar_type*` and indexes with `lane`. -**When Nsimd=1** (GPU scalar build): `lane=0` always, so `coalescedRead`/`coalescedWrite` are effectively no-ops (direct read/write). Coalescing must be achieved through the iteration structure instead. - ## Coalescing the Iteration Structure For an AoS input array where each site is `words` 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access: @@ -108,12 +131,12 @@ Pattern: use `getNumBlocksAndThreads` to pick `numThreads` and `numBlocks`: Integer numThreads, numBlocks; int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks); // starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock -// gives 64–256 threads/block → near-100% wavefront occupancy +// gives 64–256 threads/block → correct occupancy independent of acceleratorThreads() Integer smemSize = numThreads * sizeof(sobj); myKernel<<>>(args...); ``` -This gives 64–256 threads/block regardless of `acceleratorThreads()`. Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X. +Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X. ## Fused vs Staged HBM Access @@ -161,21 +184,22 @@ __device__ void packReduceBlocks( } ``` -Launched with `getNumBlocksAndThreads` → 128–256 threads/block → correct occupancy without depending on `acceleratorThreads()`. +Launched with `getNumBlocksAndThreads` → 128 threads/block for R=12 (`BundleScalarD`=192 B, sharedMem=64 KB) → correct occupancy without depending on `acceleratorThreads()`. -## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=1) +## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=4, GEN_SIMD_WIDTH=64B) | Configuration | pack µs/group | reduce µs/group | total µs | GB/s | |---|---|---|---|---| -| acceleratorThreads=2, staged | 10,080 | 470 | 126,909 | 50 | -| acceleratorThreads=16, staged | 342 | 310 | 8,251 | 297 | -| acceleratorThreads=16, fused | — | 349 | 4,584 | 546 | +| acceleratorThreads=2 (8 threads/block), staged | 10,080 | 470 | 126,909 | 50 | +| acceleratorThreads=16 (64 threads/block), staged | 342 | 310 | 8,251 | 297 | +| acceleratorThreads=16, fused (128 threads/block via getNumBlocksAndThreads) | — | 349 | 4,584 | 546 | The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak. The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches. ## Quick Checklist When a Kernel Is Slow -1. Check threads per block: `accelerator_for(ss, N, 1, ...)` with default `acceleratorThreads()=2` = 2 threads/block = 3% occupancy on AMD. Try `--accelerator-threads 16` at runtime; if it helps a lot, occupancy is the problem. -2. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`). Replace with per-element writes via `coalescedWrite`. -3. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data. -4. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`. +1. **Check Nsimd**: `GEN_SIMD_WIDTH=64B` → Nsimd=4 (ComplexD), 8 (ComplexF). Total threads/block = `Nsimd × acceleratorThreads()`. With old default nt=2 and Nsimd=4: 8 threads = 12.5% of AMD wavefront. +2. Check threads per block: for `accelerator_for` kernels use `--accelerator-threads 32` and measure; a large speedup confirms occupancy starvation. +3. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`). Replace with per-element writes via `coalescedWrite`. +4. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data. +5. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`. From 1e29c59bcc6302181605ff87e13af80a9bd6b732 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 15:12:10 -0400 Subject: [PATCH 25/44] FFT: cache plans per vobj type across calls Plans are created lazily on the first FFT_dim call and reused for all subsequent calls on the same FFT object. PlanCreate() can be called explicitly to pre-warm the cache. PlanDestroy() must be called before switching to a different vobj type; the destructor cleans up any live plans automatically. Update Test_fft.cc and Test_fftf.cc to call PlanDestroy() between the LatticeComplex and LatticeSpinMatrix sections that reuse the same FFT object. Co-Authored-By: Claude Sonnet 4.6 --- Grid/algorithms/FFT.h | 506 ++++++++++++++++++++-------------------- tests/core/Test_fft.cc | 1 + tests/core/Test_fftf.cc | 1 + 3 files changed, 250 insertions(+), 258 deletions(-) diff --git a/Grid/algorithms/FFT.h b/Grid/algorithms/FFT.h index de621387..e739145d 100644 --- a/Grid/algorithms/FFT.h +++ b/Grid/algorithms/FFT.h @@ -1,6 +1,6 @@ /************************************************************************************* - Grid physics library, www.github.com/paboyle/Grid + Grid physics library, www.github.com/paboyle/Grid Source file: ./lib/Cshift.h @@ -28,6 +28,10 @@ Author: Peter Boyle #ifndef _GRID_FFT_H_ #define _GRID_FFT_H_ +#include +#include +#include + #ifdef GRID_CUDA #include #endif @@ -65,17 +69,16 @@ public: typedef hipfftDoubleComplex FFTW_scalar; typedef hipfftHandle FFTW_plan; static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany, - FFTW_scalar *in, int *inembed, - int istride, int idist, - FFTW_scalar *out, int *onembed, - int ostride, int odist, + FFTW_scalar *in, int *inembed, + int istride, int idist, + FFTW_scalar *out, int *onembed, + int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_Z2Z,howmany); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; - } - + } inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) { hipfftResult rv; if ( sign == forward ) rv =hipfftExecZ2Z(p,in,out,HIPFFT_FORWARD); @@ -83,29 +86,25 @@ public: accelerator_barrier(); GRID_ASSERT(rv==HIPFFT_SUCCESS); } - inline static void fftw_destroy_plan(const FFTW_plan p) { - hipfftDestroy(p); - } + inline static void fftw_destroy_plan(const FFTW_plan p) { hipfftDestroy(p); } }; template<> struct FFTW { public: static const int forward=FFTW_FORWARD; static const int backward=FFTW_BACKWARD; - typedef hipfftComplex FFTW_scalar; - typedef hipfftHandle FFTW_plan; - + typedef hipfftComplex FFTW_scalar; + typedef hipfftHandle FFTW_plan; static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany, - FFTW_scalar *in, int *inembed, - int istride, int idist, - FFTW_scalar *out, int *onembed, - int ostride, int odist, + FFTW_scalar *in, int *inembed, + int istride, int idist, + FFTW_scalar *out, int *onembed, + int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_C2C,howmany); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; - } - + } inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) { hipfftResult rv; if ( sign == forward ) rv =hipfftExecC2C(p,in,out,HIPFFT_FORWARD); @@ -113,9 +112,7 @@ public: accelerator_barrier(); GRID_ASSERT(rv==HIPFFT_SUCCESS); } - inline static void fftw_destroy_plan(const FFTW_plan p) { - hipfftDestroy(p); - } + inline static void fftw_destroy_plan(const FFTW_plan p) { hipfftDestroy(p); } }; #endif @@ -126,53 +123,45 @@ public: static const int backward=FFTW_BACKWARD; typedef cufftDoubleComplex FFTW_scalar; typedef cufftHandle FFTW_plan; - static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany, - FFTW_scalar *in, int *inembed, - int istride, int idist, - FFTW_scalar *out, int *onembed, - int ostride, int odist, + FFTW_scalar *in, int *inembed, + int istride, int idist, + FFTW_scalar *out, int *onembed, + int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; cufftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,CUFFT_Z2Z,howmany); return p; - } - + } inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) { if ( sign == forward ) cufftExecZ2Z(p,in,out,CUFFT_FORWARD); else cufftExecZ2Z(p,in,out,CUFFT_INVERSE); accelerator_barrier(); } - inline static void fftw_destroy_plan(const FFTW_plan p) { - cufftDestroy(p); - } + inline static void fftw_destroy_plan(const FFTW_plan p) { cufftDestroy(p); } }; template<> struct FFTW { public: static const int forward=FFTW_FORWARD; static const int backward=FFTW_BACKWARD; typedef cufftComplex FFTW_scalar; - typedef cufftHandle FFTW_plan; - + typedef cufftHandle FFTW_plan; static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany, - FFTW_scalar *in, int *inembed, - int istride, int idist, - FFTW_scalar *out, int *onembed, - int ostride, int odist, + FFTW_scalar *in, int *inembed, + int istride, int idist, + FFTW_scalar *out, int *onembed, + int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; cufftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,CUFFT_C2C,howmany); return p; - } - + } inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) { if ( sign == forward ) cufftExecC2C(p,in,out,CUFFT_FORWARD); else cufftExecC2C(p,in,out,CUFFT_INVERSE); accelerator_barrier(); } - inline static void fftw_destroy_plan(const FFTW_plan p) { - cufftDestroy(p); - } + inline static void fftw_destroy_plan(const FFTW_plan p) { cufftDestroy(p); } }; #endif @@ -183,313 +172,314 @@ public: typedef fftw_complex FFTW_scalar; typedef fftw_plan FFTW_plan; static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany, - FFTW_scalar *in, int *inembed, - int istride, int idist, - FFTW_scalar *out, int *onembed, - int ostride, int odist, + FFTW_scalar *in, int *inembed, + int istride, int idist, + FFTW_scalar *out, int *onembed, + int ostride, int odist, int sign, unsigned flags) { return ::fftw_plan_many_dft(rank,n,howmany,in,inembed,istride,idist,out,onembed,ostride,odist,sign,flags); - } - + } inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) { ::fftw_execute_dft(p,in,out); } - inline static void fftw_destroy_plan(const FFTW_plan p) { - ::fftw_destroy_plan(p); - } + inline static void fftw_destroy_plan(const FFTW_plan p) { ::fftw_destroy_plan(p); } }; template<> struct FFTW { public: typedef fftwf_complex FFTW_scalar; typedef fftwf_plan FFTW_plan; static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany, - FFTW_scalar *in, int *inembed, - int istride, int idist, - FFTW_scalar *out, int *onembed, - int ostride, int odist, + FFTW_scalar *in, int *inembed, + int istride, int idist, + FFTW_scalar *out, int *onembed, + int ostride, int odist, int sign, unsigned flags) { return ::fftwf_plan_many_dft(rank,n,howmany,in,inembed,istride,idist,out,onembed,ostride,odist,sign,flags); - } - + } inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) { ::fftwf_execute_dft(p,in,out); } - inline static void fftw_destroy_plan(const FFTW_plan p) { - ::fftwf_destroy_plan(p); - } + inline static void fftw_destroy_plan(const FFTW_plan p) { ::fftwf_destroy_plan(p); } }; #endif #endif class FFT { private: - - double flops; - double flops_call; + + double flops; + double flops_call; uint64_t usec; - -public: - - static const int forward=FFTW_FORWARD; - static const int backward=FFTW_BACKWARD; - - double Flops(void) {return flops;} - double MFlops(void) {return flops/usec;} - double USec(void) {return (double)usec;} + GridCartesian *_grid; - FFT ( GridCartesian * grid ) - { - flops=0; - usec =0; + // Type-erased plan entry. The handle is recovered via + // std::any_cast::FFTW_plan> inside FFT_dim, which knows the + // scalar type at compile time. + struct PlanEntry { + std::any handle; + std::function destroy; }; - - ~FFT ( void) { - // delete sgrid; - } - - template - void FFT_dim_mask(Lattice &result,const Lattice &source,Coordinate mask,int sign){ - // vgrid=result.Grid(); - // conformable(result.Grid(),vgrid); - // conformable(source.Grid(),vgrid); + std::vector forward_plans; // size Nd when populated, 0 otherwise + std::vector backward_plans; + std::type_index _plan_type { typeid(void) }; // vobj type plans were built for + +public: + + static const int forward = FFTW_FORWARD; + static const int backward = FFTW_BACKWARD; + + double Flops(void) { return flops; } + double MFlops(void) { return flops / usec; } + double USec(void) { return (double)usec; } + + FFT(GridCartesian *grid) : _grid(grid), flops(0), usec(0) {} + + ~FFT() { + if (forward_plans.size() > 0) PlanDestroy(); + } + + // Explicitly pre-create and cache plans for all Nd dimensions. + // Optional: FFT_dim will call this lazily on first use if not called. + // Asserts that no plans already exist; call PlanDestroy first to re-create. + template + void PlanCreate() { + GRID_ASSERT(forward_plans.size() == 0); + + typedef typename vobj::scalar_type scalar; + typedef typename vobj::scalar_object sobj; + typedef typename FFTW::FFTW_scalar FFTW_scalar; + typedef typename FFTW::FFTW_plan FFTW_plan; + + const int Ndim = _grid->Nd(); + forward_plans.resize(Ndim); + backward_plans.resize(Ndim); + + for (int d = 0; d < Ndim; d++) { + int G = _grid->_fdimensions[d]; + int Ncomp = sizeof(sobj) / sizeof(scalar); + int64_t Nperp = 1; + for (int dd = 0; dd < Ndim; dd++) + if (dd != d) Nperp *= _grid->_ldimensions[dd]; + int howmany = Ncomp * (int)Nperp; + int n[] = {G}; + + // GPU backends (cuFFT/hipFFT) ignore the buffer pointer at plan creation. + // CPU FFTW with FFTW_ESTIMATE inspects only alignment and never touches data. + deviceVector dummy(2); + FFTW_scalar *buf = (FFTW_scalar *)&dummy[0]; + + { + FFTW_plan p = FFTW::fftw_plan_many_dft( + 1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_FORWARD, FFTW_ESTIMATE); + forward_plans[d] = { p, [p](){ FFTW::fftw_destroy_plan(p); } }; + } + { + FFTW_plan p = FFTW::fftw_plan_many_dft( + 1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_BACKWARD, FFTW_ESTIMATE); + backward_plans[d] = { p, [p](){ FFTW::fftw_destroy_plan(p); } }; + } + } + + _plan_type = std::type_index(typeid(vobj)); + } + + void PlanDestroy() { + for (auto &e : forward_plans) e.destroy(); + for (auto &e : backward_plans) e.destroy(); + forward_plans.resize(0); + backward_plans.resize(0); + _plan_type = std::type_index(typeid(void)); + } + + template + void FFT_dim_mask(Lattice &result, const Lattice &source, Coordinate mask, int sign) { const int Ndim = source.Grid()->Nd(); Lattice tmp = source; - for(int d=0;d - void FFT_all_dim(Lattice &result,const Lattice &source,int sign){ + void FFT_all_dim(Lattice &result, const Lattice &source, int sign) { const int Ndim = source.Grid()->Nd(); - Coordinate mask(Ndim,1); - FFT_dim_mask(result,source,mask,sign); + Coordinate mask(Ndim, 1); + FFT_dim_mask(result, source, mask, sign); } - template - void FFT_dim(Lattice &result,const Lattice &source,int dim, int sign){ + void FFT_dim(Lattice &result, const Lattice &source, int dim, int sign) { const int Ndim = source.Grid()->Nd(); GridBase *grid = source.Grid(); - conformable(result.Grid(),source.Grid()); + conformable(result.Grid(), source.Grid()); int L = grid->_ldimensions[dim]; int G = grid->_fdimensions[dim]; - - Coordinate layout(Ndim,1); - - // Construct pencils + typedef typename vobj::scalar_object sobj; - typedef typename vobj::scalar_type scalar; typedef typename vobj::scalar_type scalar_type; typedef typename vobj::vector_type vector_type; - - //std::cout << "CPU view" << std::endl; - - typedef typename FFTW::FFTW_scalar FFTW_scalar; - typedef typename FFTW::FFTW_plan FFTW_plan; - - int Ncomp = sizeof(sobj)/sizeof(scalar); - int64_t Nlow = 1; - int64_t Nhigh = 1; - for(int d=0;d_ldimensions[d]; - } - for(int d=dim+1;d_ldimensions[d]; - } - int64_t Nperp=Nlow*Nhigh; - - deviceVector pgbuf; // Layout is [perp][component][dim] - pgbuf.resize(Nperp*Ncomp*G); - scalar *pgbuf_v = &pgbuf[0]; - - int rank = 1; /* 1d transforms */ - int n[] = {G}; /* 1d transforms of length G */ + typedef typename FFTW::FFTW_scalar FFTW_scalar; + typedef typename FFTW::FFTW_plan FFTW_plan; + + int Ncomp = sizeof(sobj) / sizeof(scalar_type); + int64_t Nlow = 1; + int64_t Nhigh = 1; + for (int d = 0; d < dim; d++) Nlow *= grid->_ldimensions[d]; + for (int d = dim+1; d < Ndim; d++) Nhigh *= grid->_ldimensions[d]; + int64_t Nperp = Nlow * Nhigh; + + deviceVector pgbuf(Nperp * Ncomp * G); // [perp][component][dim] + scalar_type *pgbuf_v = &pgbuf[0]; + + int rank = 1; + int n[] = {G}; int howmany = Ncomp * Nperp; - int odist,idist,istride,ostride; - idist = odist = G; /* Distance between consecutive FT's */ - istride = ostride = 1; /* Distance between two elements in the same FT */ + int idist = G, odist = G, istride = 1, ostride = 1; int *inembed = n, *onembed = n; - - scalar div; - if ( sign == backward ) div = 1.0/G; - else if ( sign == forward ) div = 1.0; + + scalar_type div; + if (sign == backward) div = 1.0 / G; + else if (sign == forward) div = 1.0; else GRID_ASSERT(0); - double t_pencil=0; - double t_fft =0; - double t_total =-usecond(); - // std::cout << GridLogPerformance<<"Making FFTW plan" << std::endl; - /* - * - */ - FFTW_plan p; - { - FFTW_scalar *in = (FFTW_scalar *)&pgbuf_v[0]; - FFTW_scalar *out= (FFTW_scalar *)&pgbuf_v[0]; - p = FFTW::fftw_plan_many_dft(rank,n,howmany, - in,inembed, - istride,idist, - out,onembed, - ostride, odist, - sign,FFTW_ESTIMATE); - } - - // Barrel shift and collect global pencil - // std::cout << GridLogPerformance<<"Making pencil" << std::endl; - Coordinate lcoor(Ndim), gcoor(Ndim); - double t_copy=0; - double t_shift=0; - t_pencil = -usecond(); + // Populate cache on first call; subsequent calls check type consistency. + if (forward_plans.size() == 0) PlanCreate(); + GRID_ASSERT(forward_plans.size() == (size_t)Ndim); + GRID_ASSERT(std::type_index(typeid(vobj)) == _plan_type); + + auto &plans = (sign == forward) ? forward_plans : backward_plans; + FFTW_plan p = std::any_cast(plans[dim].handle); + + double t_pencil = 0; + double t_fft = 0; + double t_copy = 0; + double t_shift = 0; + double t_total = -usecond(); + + // Barrel-shift gather: accumulate global pencil into pgbuf result = source; int pc = grid->_processor_coor[dim]; const Coordinate ldims = grid->_ldimensions; const Coordinate rdims = grid->_rdimensions; const Coordinate sdims = grid->_simd_layout; + Coordinate processors = grid->_processors; - Coordinate processors = grid->_processors; Coordinate pgdims(Ndim); pgdims[0] = G; - for(int d=0, dd=1;doSites(), vobj::Nsimd(), { #ifdef GRID_SIMT - { - int lane=acceleratorSIMTlane(Nsimd); // buffer lane + { + int lane = acceleratorSIMTlane(Nsimd); #else - for(int lane=0;lane temp(grid); - t_shift-=usecond(); - temp = Cshift(result,dim,L); result = temp; - t_shift+=usecond(); + if (p_idx != processors[dim] - 1) { + Lattice temp(grid); + t_shift -= usecond(); + temp = Cshift(result, dim, L); result = temp; + t_shift += usecond(); } } t_pencil += usecond(); - - FFTW_scalar *in = (FFTW_scalar *)pgbuf_v; - FFTW_scalar *out= (FFTW_scalar *)pgbuf_v; + + FFTW_scalar *in = (FFTW_scalar *)pgbuf_v; + FFTW_scalar *out = (FFTW_scalar *)pgbuf_v; t_fft = -usecond(); - FFTW::fftw_execute_dft(p,in,out,sign); + FFTW::fftw_execute_dft(p, in, out, sign); t_fft += usecond(); - - // performance counting - flops_call = 5.0*howmany*G*log2(G); - usec = t_fft; - flops= flops_call; + + flops_call = 5.0 * howmany * G * log2(G); + usec = t_fft; + flops = flops_call; result = Zero(); - double t_insert = -usecond(); { - autoView(r_v,result,AcceleratorWrite); - accelerator_for(idx,grid->oSites(),Nsimd,{ + autoView(r_v, result, AcceleratorWrite); + accelerator_for(idx, grid->oSites(), Nsimd, { #ifdef GRID_SIMT - { - int lane=acceleratorSIMTlane(Nsimd); // buffer lane + { + int lane = acceleratorSIMTlane(Nsimd); #else - for(int lane=0;lane::fftw_destroy_plan(p); - - t_total +=usecond(); - - std::cout < Date: Tue, 19 May 2026 17:15:21 -0400 Subject: [PATCH 26/44] FFT: pass nullptr for inembed/onembed in hipfftPlanMany to avoid HIPFFT_PARSE_ERROR --- Grid/algorithms/FFT.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/Grid/algorithms/FFT.h b/Grid/algorithms/FFT.h index e739145d..7a61ad1a 100644 --- a/Grid/algorithms/FFT.h +++ b/Grid/algorithms/FFT.h @@ -75,7 +75,10 @@ public: int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; - auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_Z2Z,howmany); + // Pass nullptr for inembed/onembed: contiguous layout is the default and + // avoids HIPFFT_PARSE_ERROR (12) triggered by some rocFFT versions when + // inembed==n (non-null, no-padding case). + auto rv = hipfftPlanMany(&p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_Z2Z,howmany); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; } @@ -101,7 +104,7 @@ public: int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; - auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_C2C,howmany); + auto rv = hipfftPlanMany(&p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_C2C,howmany); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; } From fc8c8ce6e70a952a8bab2fbebf105cc4dbb754b4 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 17:29:28 -0400 Subject: [PATCH 27/44] FFT HIP: use hipfftCreate+hipfftMakePlanMany instead of hipfftPlanMany --- Grid/algorithms/FFT.h | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/Grid/algorithms/FFT.h b/Grid/algorithms/FFT.h index 7a61ad1a..406a49b6 100644 --- a/Grid/algorithms/FFT.h +++ b/Grid/algorithms/FFT.h @@ -74,11 +74,14 @@ public: FFTW_scalar *out, int *onembed, int ostride, int odist, int sign, unsigned flags) { + // hipfftPlanMany (one-step) triggers HIPFFT_PARSE_ERROR (12) on some + // ROCm versions. The two-step hipfftCreate + hipfftMakePlanMany is + // more robust across ROCm releases. FFTW_plan p; - // Pass nullptr for inembed/onembed: contiguous layout is the default and - // avoids HIPFFT_PARSE_ERROR (12) triggered by some rocFFT versions when - // inembed==n (non-null, no-padding case). - auto rv = hipfftPlanMany(&p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_Z2Z,howmany); + size_t workSize; + auto rc = hipfftCreate(&p); + GRID_ASSERT(rc==HIPFFT_SUCCESS); + auto rv = hipfftMakePlanMany(p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_Z2Z,howmany,&workSize); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; } @@ -104,7 +107,10 @@ public: int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; - auto rv = hipfftPlanMany(&p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_C2C,howmany); + size_t workSize; + auto rc = hipfftCreate(&p); + GRID_ASSERT(rc==HIPFFT_SUCCESS); + auto rv = hipfftMakePlanMany(p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_C2C,howmany,&workSize); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; } From 4de160ce209fc6c4dda276d50b1da0ae06bd9cf0 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 17:52:59 -0400 Subject: [PATCH 28/44] tests/debug: add minimal hipfft plan-creation reproducer Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_minimal.cc | 122 +++++++++++++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 tests/debug/Test_hipfft_minimal.cc diff --git a/tests/debug/Test_hipfft_minimal.cc b/tests/debug/Test_hipfft_minimal.cc new file mode 100644 index 00000000..548e919b --- /dev/null +++ b/tests/debug/Test_hipfft_minimal.cc @@ -0,0 +1,122 @@ +/* + * Minimal reproducer for hipfftMakePlanMany / hipfftPlanMany failures. + * + * Compile on Frontier (no Grid headers needed): + * hipcc -o Test_hipfft_minimal Test_hipfft_minimal.cc -lhipfft + * + * Run: + * ./Test_hipfft_minimal + */ + +#include +#include +#include +#include + +static const char *hipfftResultString(hipfftResult r) { + switch (r) { + case HIPFFT_SUCCESS: return "HIPFFT_SUCCESS"; + case HIPFFT_INVALID_PLAN: return "HIPFFT_INVALID_PLAN"; + case HIPFFT_ALLOC_FAILED: return "HIPFFT_ALLOC_FAILED"; + case HIPFFT_INVALID_TYPE: return "HIPFFT_INVALID_TYPE"; + case HIPFFT_INVALID_VALUE: return "HIPFFT_INVALID_VALUE"; + case HIPFFT_INTERNAL_ERROR: return "HIPFFT_INTERNAL_ERROR"; + case HIPFFT_EXEC_FAILED: return "HIPFFT_EXEC_FAILED"; + case HIPFFT_SETUP_FAILED: return "HIPFFT_SETUP_FAILED"; + case HIPFFT_INVALID_SIZE: return "HIPFFT_INVALID_SIZE"; + case HIPFFT_UNALIGNED_DATA: return "HIPFFT_UNALIGNED_DATA"; + case HIPFFT_INCOMPLETE_PARAMETER_LIST:return "HIPFFT_INCOMPLETE_PARAMETER_LIST"; + case HIPFFT_INVALID_DEVICE: return "HIPFFT_INVALID_DEVICE"; + case HIPFFT_PARSE_ERROR: return "HIPFFT_PARSE_ERROR"; + case HIPFFT_NO_WORKSPACE: return "HIPFFT_NO_WORKSPACE"; + case HIPFFT_NOT_IMPLEMENTED: return "HIPFFT_NOT_IMPLEMENTED"; + case HIPFFT_NOT_SUPPORTED: return "HIPFFT_NOT_SUPPORTED"; + default: return "UNKNOWN"; + } +} + +// Try both hipfftPlanMany and hipfftCreate+hipfftMakePlanMany for given G and howmany. +static void tryPlans(int G, int howmany) { + int n[] = {G}; + + printf("--- G=%-4d howmany=%-6d ---\n", G, howmany); + + // 1. hipfftPlanMany (one-step) + { + hipfftHandle p; + hipfftResult rv = hipfftPlanMany(&p, 1, n, + nullptr, 1, G, + nullptr, 1, G, + HIPFFT_Z2Z, howmany); + printf(" hipfftPlanMany : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) hipfftDestroy(p); + } + + // 2. hipfftPlanMany with inembed=n (old Grid behaviour) + { + hipfftHandle p; + hipfftResult rv = hipfftPlanMany(&p, 1, n, + n, 1, G, + n, 1, G, + HIPFFT_Z2Z, howmany); + printf(" hipfftPlanMany(inembed=n): %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) hipfftDestroy(p); + } + + // 3. hipfftCreate + hipfftMakePlanMany (two-step, nullptr embed) + { + hipfftHandle p; + size_t workSize = 0; + hipfftResult rc = hipfftCreate(&p); + printf(" hipfftCreate : %d (%s)\n", (int)rc, hipfftResultString(rc)); + if (rc == HIPFFT_SUCCESS) { + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, + nullptr, 1, G, + HIPFFT_Z2Z, howmany, &workSize); + printf(" hipfftMakePlanMany : %d (%s) workSize=%zu\n", + (int)rv, hipfftResultString(rv), workSize); + hipfftDestroy(p); + } + } + + // 4. hipfftPlan1d (simplest API) + { + hipfftHandle p; + hipfftResult rv = hipfftPlan1d(&p, G, HIPFFT_Z2Z, howmany); + printf(" hipfftPlan1d : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) hipfftDestroy(p); + } + + printf("\n"); +} + +int main(void) { + // Print HIP device info + int device = 0; + hipGetDevice(&device); + hipDeviceProp_t prop; + hipGetDeviceProperties(&prop, device); + printf("Device %d: %s warpSize=%d\n\n", device, prop.name, prop.warpSize); + + // Print hipFFT version if available +#ifdef hipfftVersionMinor + printf("hipFFT version: %d.%d.%d\n\n", + hipfftVersionMajor, hipfftVersionMinor, hipfftVersionPatch); +#endif + + // Sweep over transform sizes G with a fixed representative howmany + // (512 = typical Nperp*Ncomp for a small lattice) + const int howmany = 512; + for (int G : {4, 8, 12, 16, 24, 32, 48, 64}) { + tryPlans(G, howmany); + } + + // Also try the exact parameters from the failing Grid FFT + printf("=== Grid-specific parameters ===\n\n"); + tryPlans(8, 512); // Ls=8, small 4D lattice + tryPlans(16, 512); // 16^4 + tryPlans(32, 512); // 32^4 (known to work) + + return 0; +} From ad9d03fd8536a42692acc176a0aa3d856fe8b89e Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 19:19:59 -0400 Subject: [PATCH 29/44] tests/debug: extend hipfft reproducer with Grid-realistic howmany and exec tests Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_minimal.cc | 134 ++++++++++++++++++----------- 1 file changed, 85 insertions(+), 49 deletions(-) diff --git a/tests/debug/Test_hipfft_minimal.cc b/tests/debug/Test_hipfft_minimal.cc index 548e919b..2f49c0e0 100644 --- a/tests/debug/Test_hipfft_minimal.cc +++ b/tests/debug/Test_hipfft_minimal.cc @@ -35,59 +35,77 @@ static const char *hipfftResultString(hipfftResult r) { } } -// Try both hipfftPlanMany and hipfftCreate+hipfftMakePlanMany for given G and howmany. -static void tryPlans(int G, int howmany) { +// Plan creation + execution for (G, howmany) using hipfftCreate+hipfftMakePlanMany. +// This is the path Grid's FFT.h now uses. +static void tryPlanAndExec(int G, long howmany) { int n[] = {G}; + long nelems = (long)G * howmany; - printf("--- G=%-4d howmany=%-6d ---\n", G, howmany); + printf("--- G=%-4d howmany=%-10ld total_elems=%-12ld ---\n", + G, howmany, nelems); - // 1. hipfftPlanMany (one-step) + // Allocate device buffer (hipfftDoubleComplex = 16 bytes each) + hipfftDoubleComplex *dbuf = nullptr; + hipError_t herr = hipMalloc(&dbuf, nelems * sizeof(hipfftDoubleComplex)); + if (herr != hipSuccess) { + printf(" hipMalloc failed (%d) for %ld elems — skipping\n\n", (int)herr, nelems); + return; + } + hipMemset(dbuf, 0, nelems * sizeof(hipfftDoubleComplex)); + + // 1. hipfftPlanMany (one-step, nullptr embed) — current Grid path { hipfftHandle p; hipfftResult rv = hipfftPlanMany(&p, 1, n, nullptr, 1, G, nullptr, 1, G, - HIPFFT_Z2Z, howmany); - printf(" hipfftPlanMany : %d (%s)\n", (int)rv, hipfftResultString(rv)); - if (rv == HIPFFT_SUCCESS) hipfftDestroy(p); - } - - // 2. hipfftPlanMany with inembed=n (old Grid behaviour) - { - hipfftHandle p; - hipfftResult rv = hipfftPlanMany(&p, 1, n, - n, 1, G, - n, 1, G, - HIPFFT_Z2Z, howmany); - printf(" hipfftPlanMany(inembed=n): %d (%s)\n", (int)rv, hipfftResultString(rv)); - if (rv == HIPFFT_SUCCESS) hipfftDestroy(p); - } - - // 3. hipfftCreate + hipfftMakePlanMany (two-step, nullptr embed) - { - hipfftHandle p; - size_t workSize = 0; - hipfftResult rc = hipfftCreate(&p); - printf(" hipfftCreate : %d (%s)\n", (int)rc, hipfftResultString(rc)); - if (rc == HIPFFT_SUCCESS) { - hipfftResult rv = hipfftMakePlanMany(p, 1, n, - nullptr, 1, G, - nullptr, 1, G, - HIPFFT_Z2Z, howmany, &workSize); - printf(" hipfftMakePlanMany : %d (%s) workSize=%zu\n", - (int)rv, hipfftResultString(rv), workSize); + HIPFFT_Z2Z, (int)howmany); + printf(" hipfftPlanMany create : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" hipfftPlanMany execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); hipfftDestroy(p); } } - // 4. hipfftPlan1d (simplest API) + // 2. hipfftCreate + hipfftMakePlanMany (two-step) — also current Grid path { hipfftHandle p; - hipfftResult rv = hipfftPlan1d(&p, G, HIPFFT_Z2Z, howmany); - printf(" hipfftPlan1d : %d (%s)\n", (int)rv, hipfftResultString(rv)); - if (rv == HIPFFT_SUCCESS) hipfftDestroy(p); + size_t workSize = 0; + hipfftResult rc = hipfftCreate(&p); + if (rc == HIPFFT_SUCCESS) { + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, + nullptr, 1, G, + HIPFFT_Z2Z, (int)howmany, &workSize); + printf(" hipfftMakePlanMany : %d (%s) workSize=%zu\n", + (int)rv, hipfftResultString(rv), workSize); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" hipfftMakePlanMany exec : %d (%s)\n", (int)rv, hipfftResultString(rv)); + } + hipfftDestroy(p); + } else { + printf(" hipfftCreate : %d (%s)\n", (int)rc, hipfftResultString(rc)); + } } + // 3. hipfftPlan1d (simplest API, batch = howmany) + { + hipfftHandle p; + hipfftResult rv = hipfftPlan1d(&p, G, HIPFFT_Z2Z, (int)howmany); + printf(" hipfftPlan1d create : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" hipfftPlan1d execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); + hipfftDestroy(p); + } + } + + hipFree(dbuf); printf("\n"); } @@ -99,24 +117,42 @@ int main(void) { hipGetDeviceProperties(&prop, device); printf("Device %d: %s warpSize=%d\n\n", device, prop.name, prop.warpSize); - // Print hipFFT version if available #ifdef hipfftVersionMinor printf("hipFFT version: %d.%d.%d\n\n", hipfftVersionMajor, hipfftVersionMinor, hipfftVersionPatch); #endif - // Sweep over transform sizes G with a fixed representative howmany - // (512 = typical Nperp*Ncomp for a small lattice) - const int howmany = 512; - for (int G : {4, 8, 12, 16, 24, 32, 48, 64}) { - tryPlans(G, howmany); - } + // Original sweep with small howmany (these passed first time) + printf("=== Small howmany (original sweep) ===\n\n"); + for (int G : {4, 8, 12, 16, 24, 32, 48, 64}) + tryPlanAndExec(G, 512); - // Also try the exact parameters from the failing Grid FFT - printf("=== Grid-specific parameters ===\n\n"); - tryPlans(8, 512); // Ls=8, small 4D lattice - tryPlans(16, 512); // 16^4 - tryPlans(32, 512); // 32^4 (known to work) + // Grid-realistic howmany values derived from actual lattice geometries. + // howmany = Ncomp * product(ldimensions[d] for d != dim) + // For LatticeComplexD: Ncomp=1. + printf("=== Grid-realistic parameters ===\n\n"); + + // --grid 16.16.16.16 4D FFT (KNOWN TO FAIL in Grid) + // Each dim: G=16, Nperp=16^3=4096 + tryPlanAndExec(16, 4096); + + // --grid 32.32.32.32 4D FFT (KNOWN TO SUCCEED in Grid) + // Each dim: G=32, Nperp=32^3=32768 + tryPlanAndExec(32, 32768); + + // --grid 32.32.32.32 Ls=8 5D DWF FFT (KNOWN TO FAIL on dim 0 in Grid) + // dim 0: G=8, Nperp=32^4=1048576 + tryPlanAndExec(8, 1048576); + // dim 1-4: G=32, Nperp=8*32^3=262144 + tryPlanAndExec(32, 262144); + + // Extra intermediate cases to bracket the failure + tryPlanAndExec(16, 1024); + tryPlanAndExec(16, 2048); + tryPlanAndExec(16, 8192); + tryPlanAndExec(8, 4096); + tryPlanAndExec(8, 65536); + tryPlanAndExec(8, 262144); return 0; } From e5996b440da0050ae94b97c9697dc0ad835ce08b Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 21:40:17 -0400 Subject: [PATCH 30/44] tests/debug: test plan-before-malloc vs malloc-before-plan ordering Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_minimal.cc | 96 +++++++++++++----------------- 1 file changed, 40 insertions(+), 56 deletions(-) diff --git a/tests/debug/Test_hipfft_minimal.cc b/tests/debug/Test_hipfft_minimal.cc index 2f49c0e0..bbccff3c 100644 --- a/tests/debug/Test_hipfft_minimal.cc +++ b/tests/debug/Test_hipfft_minimal.cc @@ -35,8 +35,11 @@ static const char *hipfftResultString(hipfftResult r) { } } -// Plan creation + execution for (G, howmany) using hipfftCreate+hipfftMakePlanMany. -// This is the path Grid's FFT.h now uses. +// Plan creation + execution for (G, howmany). +// Tests two orderings to isolate whether a prior hipMalloc poisons hipfft +// plan creation for small G on ROCm 7: +// A) plan BEFORE hipMalloc — hypothesis: succeeds +// B) hipMalloc BEFORE plan — hypothesis: fails for G < 32 static void tryPlanAndExec(int G, long howmany) { int n[] = {G}; long nelems = (long)G * howmany; @@ -44,68 +47,49 @@ static void tryPlanAndExec(int G, long howmany) { printf("--- G=%-4d howmany=%-10ld total_elems=%-12ld ---\n", G, howmany, nelems); - // Allocate device buffer (hipfftDoubleComplex = 16 bytes each) - hipfftDoubleComplex *dbuf = nullptr; - hipError_t herr = hipMalloc(&dbuf, nelems * sizeof(hipfftDoubleComplex)); - if (herr != hipSuccess) { - printf(" hipMalloc failed (%d) for %ld elems — skipping\n\n", (int)herr, nelems); - return; - } - hipMemset(dbuf, 0, nelems * sizeof(hipfftDoubleComplex)); - - // 1. hipfftPlanMany (one-step, nullptr embed) — current Grid path - { - hipfftHandle p; - hipfftResult rv = hipfftPlanMany(&p, 1, n, - nullptr, 1, G, - nullptr, 1, G, - HIPFFT_Z2Z, (int)howmany); - printf(" hipfftPlanMany create : %d (%s)\n", (int)rv, hipfftResultString(rv)); - if (rv == HIPFFT_SUCCESS) { - rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); - hipDeviceSynchronize(); - printf(" hipfftPlanMany execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); - hipfftDestroy(p); - } - } - - // 2. hipfftCreate + hipfftMakePlanMany (two-step) — also current Grid path + // --- A: create plan first, allocate buffer afterwards --- { hipfftHandle p; size_t workSize = 0; - hipfftResult rc = hipfftCreate(&p); - if (rc == HIPFFT_SUCCESS) { - hipfftResult rv = hipfftMakePlanMany(p, 1, n, - nullptr, 1, G, - nullptr, 1, G, - HIPFFT_Z2Z, (int)howmany, &workSize); - printf(" hipfftMakePlanMany : %d (%s) workSize=%zu\n", - (int)rv, hipfftResultString(rv), workSize); - if (rv == HIPFFT_SUCCESS) { - rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); - hipDeviceSynchronize(); - printf(" hipfftMakePlanMany exec : %d (%s)\n", (int)rv, hipfftResultString(rv)); - } - hipfftDestroy(p); - } else { - printf(" hipfftCreate : %d (%s)\n", (int)rc, hipfftResultString(rc)); - } - } - - // 3. hipfftPlan1d (simplest API, batch = howmany) - { - hipfftHandle p; - hipfftResult rv = hipfftPlan1d(&p, G, HIPFFT_Z2Z, (int)howmany); - printf(" hipfftPlan1d create : %d (%s)\n", (int)rv, hipfftResultString(rv)); + hipfftCreate(&p); + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, nullptr, 1, G, + HIPFFT_Z2Z, (int)howmany, &workSize); + printf(" plan-first create : %d (%s)\n", (int)rv, hipfftResultString(rv)); if (rv == HIPFFT_SUCCESS) { - rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipfftDoubleComplex *buf = nullptr; + hipMalloc(&buf, nelems * sizeof(hipfftDoubleComplex)); + hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); + rv = hipfftExecZ2Z(p, buf, buf, HIPFFT_FORWARD); hipDeviceSynchronize(); - printf(" hipfftPlan1d execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); - hipfftDestroy(p); + printf(" plan-first execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); + hipFree(buf); } + hipfftDestroy(p); + } + + // --- B: hipMalloc first, create plan afterwards --- + { + hipfftDoubleComplex *buf = nullptr; + hipMalloc(&buf, nelems * sizeof(hipfftDoubleComplex)); + hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); + + hipfftHandle p; + size_t workSize = 0; + hipfftCreate(&p); + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, nullptr, 1, G, + HIPFFT_Z2Z, (int)howmany, &workSize); + printf(" malloc-first create : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, buf, buf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" malloc-first execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); + } + hipfftDestroy(p); + hipFree(buf); } - hipFree(dbuf); printf("\n"); } From 58cc6ca9c09fa20b7ebd8c138e6e0a060547e981 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 21:48:23 -0400 Subject: [PATCH 31/44] tests/debug: add minimal hipfft ordering bug fail/pass pair Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_bug_fail.cc | 53 +++++++++++++++++++++++++ tests/debug/Test_hipfft_bug_pass.cc | 61 +++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+) create mode 100644 tests/debug/Test_hipfft_bug_fail.cc create mode 100644 tests/debug/Test_hipfft_bug_pass.cc diff --git a/tests/debug/Test_hipfft_bug_fail.cc b/tests/debug/Test_hipfft_bug_fail.cc new file mode 100644 index 00000000..e88c05ac --- /dev/null +++ b/tests/debug/Test_hipfft_bug_fail.cc @@ -0,0 +1,53 @@ +/* + * Minimal program demonstrating hipfft HIPFFT_PARSE_ERROR on ROCm 7 / hipFFT 1.0.20. + * + * Bug: hipfftMakePlanMany returns HIPFFT_PARSE_ERROR (12) for transform sizes + * smaller than 32 when a hipMalloc is issued before plan creation. + * + * Compile: + * hipcc -o Test_hipfft_bug_fail Test_hipfft_bug_fail.cc -lhipfft + * + * Run with empty rocFFT cache: + * rm -rf ~/.cache/rocfft + * ./Test_hipfft_bug_fail + * + * Expected (broken): hipfftMakePlanMany returns 12 (HIPFFT_PARSE_ERROR) for G < 32. + * See Test_hipfft_bug_pass.cc for the workaround. + */ + +#include +#include +#include + +int main(void) { + hipDeviceProp_t prop; + hipGetDeviceProperties(&prop, 0); + printf("Device: %s\n", prop.name); +#ifdef hipfftVersionMinor + printf("hipFFT version: %d.%d.%d\n\n", + hipfftVersionMajor, hipfftVersionMinor, hipfftVersionPatch); +#endif + + for (int G : {8, 16, 32}) { + int howmany = 512; + int n[] = {G}; + long nelems = (long)G * howmany; + + // hipMalloc BEFORE plan creation — triggers HIPFFT_PARSE_ERROR for G < 32 + hipfftDoubleComplex *buf = nullptr; + hipMalloc(&buf, nelems * sizeof(hipfftDoubleComplex)); + + hipfftHandle p; + size_t workSize = 0; + hipfftCreate(&p); + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, nullptr, 1, G, + HIPFFT_Z2Z, howmany, &workSize); + printf("G=%-4d hipMalloc-then-plan: %d (%s)\n", + G, (int)rv, rv == HIPFFT_SUCCESS ? "HIPFFT_SUCCESS" : "HIPFFT_PARSE_ERROR"); + hipfftDestroy(p); + hipFree(buf); + } + + return 0; +} diff --git a/tests/debug/Test_hipfft_bug_pass.cc b/tests/debug/Test_hipfft_bug_pass.cc new file mode 100644 index 00000000..3cd7c210 --- /dev/null +++ b/tests/debug/Test_hipfft_bug_pass.cc @@ -0,0 +1,61 @@ +/* + * Minimal program demonstrating the workaround for the hipfft ROCm 7 bug. + * + * Workaround: create the hipfft plan BEFORE any hipMalloc. Plan creation + * for G < 32 then succeeds even with an empty rocFFT cache. + * + * Compile: + * hipcc -o Test_hipfft_bug_pass Test_hipfft_bug_pass.cc -lhipfft + * + * Run: + * rm -rf ~/.cache/rocfft + * ./Test_hipfft_bug_pass + * + * Expected: all G values succeed. + * Compare with Test_hipfft_bug_fail.cc which uses the opposite ordering. + */ + +#include +#include +#include + +int main(void) { + hipDeviceProp_t prop; + hipGetDeviceProperties(&prop, 0); + printf("Device: %s\n", prop.name); +#ifdef hipfftVersionMinor + printf("hipFFT version: %d.%d.%d\n\n", + hipfftVersionMajor, hipfftVersionMinor, hipfftVersionPatch); +#endif + + for (int G : {8, 16, 32}) { + int howmany = 512; + int n[] = {G}; + long nelems = (long)G * howmany; + + // Plan created BEFORE hipMalloc — succeeds for all G + hipfftHandle p; + size_t workSize = 0; + hipfftCreate(&p); + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, nullptr, 1, G, + HIPFFT_Z2Z, howmany, &workSize); + printf("G=%-4d plan-then-hipMalloc: %d (%s)\n", + G, (int)rv, rv == HIPFFT_SUCCESS ? "HIPFFT_SUCCESS" : "HIPFFT_PARSE_ERROR"); + + if (rv == HIPFFT_SUCCESS) { + hipfftDoubleComplex *buf = nullptr; + hipMalloc(&buf, nelems * sizeof(hipfftDoubleComplex)); + hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); + rv = hipfftExecZ2Z(p, buf, buf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf("G=%-4d execFwd: %d (%s)\n", + G, (int)rv, rv == HIPFFT_SUCCESS ? "HIPFFT_SUCCESS" : "FAILED"); + hipFree(buf); + } + + hipfftDestroy(p); + } + + return 0; +} From bdba5b8403fd7816ebf53d2d0634d0dd4724efc4 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 21:49:06 -0400 Subject: [PATCH 32/44] FFT: use host stack buffer in PlanCreate, not deviceVector Co-Authored-By: Claude Sonnet 4.6 --- Grid/algorithms/FFT.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/Grid/algorithms/FFT.h b/Grid/algorithms/FFT.h index 406a49b6..8bacc274 100644 --- a/Grid/algorithms/FFT.h +++ b/Grid/algorithms/FFT.h @@ -275,8 +275,10 @@ public: // GPU backends (cuFFT/hipFFT) ignore the buffer pointer at plan creation. // CPU FFTW with FFTW_ESTIMATE inspects only alignment and never touches data. - deviceVector dummy(2); - FFTW_scalar *buf = (FFTW_scalar *)&dummy[0]; + // Use a host stack buffer: a device allocation here triggers a rocFFT RTC + // bug on ROCm 7 that causes plan creation to fail for small transform sizes. + scalar stack_dummy[2] = {}; + FFTW_scalar *buf = (FFTW_scalar *)stack_dummy; { FFTW_plan p = FFTW::fftw_plan_many_dft( From ea57bd8f03e6a5d0ad1d74c6a91d9c28382ea002 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 22:02:02 -0400 Subject: [PATCH 33/44] tests/debug: extend hipfft fail reproducer with hipMemset and sync variants Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_bug_fail.cc | 60 +++++++++++++++++++---------- 1 file changed, 40 insertions(+), 20 deletions(-) diff --git a/tests/debug/Test_hipfft_bug_fail.cc b/tests/debug/Test_hipfft_bug_fail.cc index e88c05ac..e2ad736c 100644 --- a/tests/debug/Test_hipfft_bug_fail.cc +++ b/tests/debug/Test_hipfft_bug_fail.cc @@ -1,24 +1,41 @@ /* - * Minimal program demonstrating hipfft HIPFFT_PARSE_ERROR on ROCm 7 / hipFFT 1.0.20. + * Isolating the hipfft HIPFFT_PARSE_ERROR on ROCm 7 / hipFFT 1.0.20. * - * Bug: hipfftMakePlanMany returns HIPFFT_PARSE_ERROR (12) for transform sizes - * smaller than 32 when a hipMalloc is issued before plan creation. + * Tests three orderings with an empty rocFFT cache to find which GPU + * operation before plan creation triggers the failure: + * A) hipMalloc only — hypothesis: passes (no async GPU work) + * B) hipMalloc + hipMemset — hypothesis: fails (async GPU work in flight) + * C) hipMalloc + hipMemset — hypothesis: passes (work completed before plan) + * + hipDeviceSynchronize * * Compile: * hipcc -o Test_hipfft_bug_fail Test_hipfft_bug_fail.cc -lhipfft * - * Run with empty rocFFT cache: - * rm -rf ~/.cache/rocfft + * Run with empty cache: + * rm -rf ~/.cache/ * ./Test_hipfft_bug_fail - * - * Expected (broken): hipfftMakePlanMany returns 12 (HIPFFT_PARSE_ERROR) for G < 32. - * See Test_hipfft_bug_pass.cc for the workaround. */ #include #include #include +static const char *res(hipfftResult rv) { + return rv == HIPFFT_SUCCESS ? "SUCCESS" : "PARSE_ERROR"; +} + +static hipfftResult makePlan(int G, int howmany) { + int n[] = {G}; + hipfftHandle p; + size_t workSize = 0; + hipfftCreate(&p); + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, nullptr, 1, G, + HIPFFT_Z2Z, howmany, &workSize); + hipfftDestroy(p); + return rv; +} + int main(void) { hipDeviceProp_t prop; hipGetDeviceProperties(&prop, 0); @@ -30,22 +47,25 @@ int main(void) { for (int G : {8, 16, 32}) { int howmany = 512; - int n[] = {G}; long nelems = (long)G * howmany; - - // hipMalloc BEFORE plan creation — triggers HIPFFT_PARSE_ERROR for G < 32 hipfftDoubleComplex *buf = nullptr; hipMalloc(&buf, nelems * sizeof(hipfftDoubleComplex)); - hipfftHandle p; - size_t workSize = 0; - hipfftCreate(&p); - hipfftResult rv = hipfftMakePlanMany(p, 1, n, - nullptr, 1, G, nullptr, 1, G, - HIPFFT_Z2Z, howmany, &workSize); - printf("G=%-4d hipMalloc-then-plan: %d (%s)\n", - G, (int)rv, rv == HIPFFT_SUCCESS ? "HIPFFT_SUCCESS" : "HIPFFT_PARSE_ERROR"); - hipfftDestroy(p); + // A: hipMalloc only, no GPU work + hipfftResult rvA = makePlan(G, howmany); + printf("G=%-4d A) hipMalloc only : %s\n", G, res(rvA)); + + // B: hipMalloc + hipMemset (async GPU work in flight) + hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); + hipfftResult rvB = makePlan(G, howmany); + printf("G=%-4d B) hipMalloc + hipMemset : %s\n", G, res(rvB)); + + // C: hipMalloc + hipMemset + sync before plan + hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); + hipDeviceSynchronize(); + hipfftResult rvC = makePlan(G, howmany); + printf("G=%-4d C) hipMalloc + hipMemset + sync: %s\n\n", G, res(rvC)); + hipFree(buf); } From 3f0fdbb597b044d6dd1d11ad1c31082c29470a16 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 22:10:16 -0400 Subject: [PATCH 34/44] tests/debug: test hipMemset variant before cache is populated Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_bug_fail.cc | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/tests/debug/Test_hipfft_bug_fail.cc b/tests/debug/Test_hipfft_bug_fail.cc index e2ad736c..819ce3d3 100644 --- a/tests/debug/Test_hipfft_bug_fail.cc +++ b/tests/debug/Test_hipfft_bug_fail.cc @@ -51,20 +51,23 @@ int main(void) { hipfftDoubleComplex *buf = nullptr; hipMalloc(&buf, nelems * sizeof(hipfftDoubleComplex)); - // A: hipMalloc only, no GPU work - hipfftResult rvA = makePlan(G, howmany); - printf("G=%-4d A) hipMalloc only : %s\n", G, res(rvA)); + // Tests ordered so each runs before a prior success can populate the cache. - // B: hipMalloc + hipMemset (async GPU work in flight) + // B first: hipMalloc + hipMemset (async GPU work in flight) + // If this fails, A (no hipMemset) will pass, confirming hipMemset is the trigger. hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); hipfftResult rvB = makePlan(G, howmany); printf("G=%-4d B) hipMalloc + hipMemset : %s\n", G, res(rvB)); - // C: hipMalloc + hipMemset + sync before plan + // C: hipMalloc + hipMemset + sync — does syncing before plan creation fix it? hipMemset(buf, 0, nelems * sizeof(hipfftDoubleComplex)); hipDeviceSynchronize(); hipfftResult rvC = makePlan(G, howmany); - printf("G=%-4d C) hipMalloc + hipMemset + sync: %s\n\n", G, res(rvC)); + printf("G=%-4d C) hipMalloc + hipMemset + sync: %s\n", G, res(rvC)); + + // A last: hipMalloc only, no async GPU work — should always pass + hipfftResult rvA = makePlan(G, howmany); + printf("G=%-4d A) hipMalloc only : %s\n\n", G, res(rvA)); hipFree(buf); } From 79ccc81a864274b8286e51f4feec8f6efee58f0e Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 22:21:52 -0400 Subject: [PATCH 35/44] tests/debug: add G=4 to hipfft fail reproducer Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_bug_fail.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/debug/Test_hipfft_bug_fail.cc b/tests/debug/Test_hipfft_bug_fail.cc index 819ce3d3..990250d1 100644 --- a/tests/debug/Test_hipfft_bug_fail.cc +++ b/tests/debug/Test_hipfft_bug_fail.cc @@ -45,7 +45,7 @@ int main(void) { hipfftVersionMajor, hipfftVersionMinor, hipfftVersionPatch); #endif - for (int G : {8, 16, 32}) { + for (int G : {4, 8, 16, 32}) { int howmany = 512; long nelems = (long)G * howmany; hipfftDoubleComplex *buf = nullptr; From 50aa51f93ad6074eefe42d906e0a8139b7fbc062 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Tue, 19 May 2026 22:27:27 -0400 Subject: [PATCH 36/44] =?UTF-8?q?debug:=20add=20Test=5Fhipfft=5Frepro=20?= =?UTF-8?q?=E2=80=94=20reproducer=20for=20hipFFT=20PARSE=5FERROR=20on=20RO?= =?UTF-8?q?Cm=207?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- tests/debug/Test_hipfft_repro.cc | 168 +++++++++++++++++++++++++++++++ 1 file changed, 168 insertions(+) create mode 100644 tests/debug/Test_hipfft_repro.cc diff --git a/tests/debug/Test_hipfft_repro.cc b/tests/debug/Test_hipfft_repro.cc new file mode 100644 index 00000000..12f0d3d6 --- /dev/null +++ b/tests/debug/Test_hipfft_repro.cc @@ -0,0 +1,168 @@ +/* + * Reproducer for HIPFFT_PARSE_ERROR (error 12) from hipfftMakePlanMany on + * ROCm 7 / hipFFT 1.0.20 (Frontier, MI210 login and MI250X compute nodes). + * + * Observed failure: G < 32 returns HIPFFT_PARSE_ERROR from all three plan + * creation APIs (hipfftPlanMany, hipfftMakePlanMany, hipfftPlan1d) when a + * device buffer is allocated and zeroed with hipMalloc+hipMemset before the + * plan creation call. G >= 32 succeeds. + * + * Contrast with Test_hipfft_minimal.cc (plan-first ordering) which passes + * for all G even with an empty rocFFT cache. + * + * Compile on Frontier (no Grid headers needed): + * hipcc -o Test_hipfft_repro Test_hipfft_repro.cc -lhipfft + * + * Run with empty cache to reproduce the failure: + * rm -rf ~/.cache/rocfft + * ./Test_hipfft_repro + */ + +#include +#include +#include +#include + +static const char *hipfftResultString(hipfftResult r) { + switch (r) { + case HIPFFT_SUCCESS: return "HIPFFT_SUCCESS"; + case HIPFFT_INVALID_PLAN: return "HIPFFT_INVALID_PLAN"; + case HIPFFT_ALLOC_FAILED: return "HIPFFT_ALLOC_FAILED"; + case HIPFFT_INVALID_TYPE: return "HIPFFT_INVALID_TYPE"; + case HIPFFT_INVALID_VALUE: return "HIPFFT_INVALID_VALUE"; + case HIPFFT_INTERNAL_ERROR: return "HIPFFT_INTERNAL_ERROR"; + case HIPFFT_EXEC_FAILED: return "HIPFFT_EXEC_FAILED"; + case HIPFFT_SETUP_FAILED: return "HIPFFT_SETUP_FAILED"; + case HIPFFT_INVALID_SIZE: return "HIPFFT_INVALID_SIZE"; + case HIPFFT_UNALIGNED_DATA: return "HIPFFT_UNALIGNED_DATA"; + case HIPFFT_INCOMPLETE_PARAMETER_LIST:return "HIPFFT_INCOMPLETE_PARAMETER_LIST"; + case HIPFFT_INVALID_DEVICE: return "HIPFFT_INVALID_DEVICE"; + case HIPFFT_PARSE_ERROR: return "HIPFFT_PARSE_ERROR"; + case HIPFFT_NO_WORKSPACE: return "HIPFFT_NO_WORKSPACE"; + case HIPFFT_NOT_IMPLEMENTED: return "HIPFFT_NOT_IMPLEMENTED"; + case HIPFFT_NOT_SUPPORTED: return "HIPFFT_NOT_SUPPORTED"; + default: return "UNKNOWN"; + } +} + +// Plan creation + execution for (G, howmany) using hipfftCreate+hipfftMakePlanMany. +// This is the path Grid's FFT.h now uses. +static void tryPlanAndExec(int G, long howmany) { + int n[] = {G}; + long nelems = (long)G * howmany; + + printf("--- G=%-4d howmany=%-10ld total_elems=%-12ld ---\n", + G, howmany, nelems); + + // Allocate device buffer (hipfftDoubleComplex = 16 bytes each) + hipfftDoubleComplex *dbuf = nullptr; + hipError_t herr = hipMalloc(&dbuf, nelems * sizeof(hipfftDoubleComplex)); + if (herr != hipSuccess) { + printf(" hipMalloc failed (%d) for %ld elems — skipping\n\n", (int)herr, nelems); + return; + } + hipMemset(dbuf, 0, nelems * sizeof(hipfftDoubleComplex)); + + // 1. hipfftPlanMany (one-step, nullptr embed) — current Grid path + { + hipfftHandle p; + hipfftResult rv = hipfftPlanMany(&p, 1, n, + nullptr, 1, G, + nullptr, 1, G, + HIPFFT_Z2Z, (int)howmany); + printf(" hipfftPlanMany create : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" hipfftPlanMany execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); + hipfftDestroy(p); + } + } + + // 2. hipfftCreate + hipfftMakePlanMany (two-step) — also current Grid path + { + hipfftHandle p; + size_t workSize = 0; + hipfftResult rc = hipfftCreate(&p); + if (rc == HIPFFT_SUCCESS) { + hipfftResult rv = hipfftMakePlanMany(p, 1, n, + nullptr, 1, G, + nullptr, 1, G, + HIPFFT_Z2Z, (int)howmany, &workSize); + printf(" hipfftMakePlanMany : %d (%s) workSize=%zu\n", + (int)rv, hipfftResultString(rv), workSize); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" hipfftMakePlanMany exec : %d (%s)\n", (int)rv, hipfftResultString(rv)); + } + hipfftDestroy(p); + } else { + printf(" hipfftCreate : %d (%s)\n", (int)rc, hipfftResultString(rc)); + } + } + + // 3. hipfftPlan1d (simplest API, batch = howmany) + { + hipfftHandle p; + hipfftResult rv = hipfftPlan1d(&p, G, HIPFFT_Z2Z, (int)howmany); + printf(" hipfftPlan1d create : %d (%s)\n", (int)rv, hipfftResultString(rv)); + if (rv == HIPFFT_SUCCESS) { + rv = hipfftExecZ2Z(p, dbuf, dbuf, HIPFFT_FORWARD); + hipDeviceSynchronize(); + printf(" hipfftPlan1d execFwd: %d (%s)\n", (int)rv, hipfftResultString(rv)); + hipfftDestroy(p); + } + } + + hipFree(dbuf); + printf("\n"); +} + +int main(void) { + // Print HIP device info + int device = 0; + hipGetDevice(&device); + hipDeviceProp_t prop; + hipGetDeviceProperties(&prop, device); + printf("Device %d: %s warpSize=%d\n\n", device, prop.name, prop.warpSize); + +#ifdef hipfftVersionMinor + printf("hipFFT version: %d.%d.%d\n\n", + hipfftVersionMajor, hipfftVersionMinor, hipfftVersionPatch); +#endif + + // Original sweep with small howmany (these passed first time) + printf("=== Small howmany (original sweep) ===\n\n"); + for (int G : {4, 8, 12, 16, 24, 32, 48, 64}) + tryPlanAndExec(G, 512); + + // Grid-realistic howmany values derived from actual lattice geometries. + // howmany = Ncomp * product(ldimensions[d] for d != dim) + // For LatticeComplexD: Ncomp=1. + printf("=== Grid-realistic parameters ===\n\n"); + + // --grid 16.16.16.16 4D FFT (KNOWN TO FAIL in Grid) + // Each dim: G=16, Nperp=16^3=4096 + tryPlanAndExec(16, 4096); + + // --grid 32.32.32.32 4D FFT (KNOWN TO SUCCEED in Grid) + // Each dim: G=32, Nperp=32^3=32768 + tryPlanAndExec(32, 32768); + + // --grid 32.32.32.32 Ls=8 5D DWF FFT (KNOWN TO FAIL on dim 0 in Grid) + // dim 0: G=8, Nperp=32^4=1048576 + tryPlanAndExec(8, 1048576); + // dim 1-4: G=32, Nperp=8*32^3=262144 + tryPlanAndExec(32, 262144); + + // Extra intermediate cases to bracket the failure + tryPlanAndExec(16, 1024); + tryPlanAndExec(16, 2048); + tryPlanAndExec(16, 8192); + tryPlanAndExec(8, 4096); + tryPlanAndExec(8, 65536); + tryPlanAndExec(8, 262144); + + return 0; +} From 29198efabef380df4b07370f6fbbbaf7693fe990 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Wed, 20 May 2026 17:53:17 -0400 Subject: [PATCH 37/44] FFT: add FFTbase, PlannedFFT; factor FFT_dim_execute free function Co-Authored-By: Claude Sonnet 4.6 --- Grid/algorithms/FFT.h | 462 +++++++++++++++++++++--------------------- 1 file changed, 229 insertions(+), 233 deletions(-) diff --git a/Grid/algorithms/FFT.h b/Grid/algorithms/FFT.h index 8bacc274..c9d9d6a8 100644 --- a/Grid/algorithms/FFT.h +++ b/Grid/algorithms/FFT.h @@ -28,10 +28,6 @@ Author: Peter Boyle #ifndef _GRID_FFT_H_ #define _GRID_FFT_H_ -#include -#include -#include - #ifdef GRID_CUDA #include #endif @@ -74,14 +70,8 @@ public: FFTW_scalar *out, int *onembed, int ostride, int odist, int sign, unsigned flags) { - // hipfftPlanMany (one-step) triggers HIPFFT_PARSE_ERROR (12) on some - // ROCm versions. The two-step hipfftCreate + hipfftMakePlanMany is - // more robust across ROCm releases. FFTW_plan p; - size_t workSize; - auto rc = hipfftCreate(&p); - GRID_ASSERT(rc==HIPFFT_SUCCESS); - auto rv = hipfftMakePlanMany(p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_Z2Z,howmany,&workSize); + auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_Z2Z,howmany); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; } @@ -107,10 +97,7 @@ public: int ostride, int odist, int sign, unsigned flags) { FFTW_plan p; - size_t workSize; - auto rc = hipfftCreate(&p); - GRID_ASSERT(rc==HIPFFT_SUCCESS); - auto rv = hipfftMakePlanMany(p,rank,n,nullptr,istride,idist,nullptr,ostride,odist,HIPFFT_C2C,howmany,&workSize); + auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_C2C,howmany); GRID_ASSERT(rv==HIPFFT_SUCCESS); return p; } @@ -213,28 +200,12 @@ public: #endif #endif -class FFT { -private: - - double flops; - double flops_call; - uint64_t usec; +struct FFTbase { + double flops; + double flops_call; + uint64_t usec; GridCartesian *_grid; - // Type-erased plan entry. The handle is recovered via - // std::any_cast::FFTW_plan> inside FFT_dim, which knows the - // scalar type at compile time. - struct PlanEntry { - std::any handle; - std::function destroy; - }; - - std::vector forward_plans; // size Nd when populated, 0 otherwise - std::vector backward_plans; - std::type_index _plan_type { typeid(void) }; // vobj type plans were built for - -public: - static const int forward = FFTW_FORWARD; static const int backward = FFTW_BACKWARD; @@ -242,70 +213,166 @@ public: double MFlops(void) { return flops / usec; } double USec(void) { return (double)usec; } - FFT(GridCartesian *grid) : _grid(grid), flops(0), usec(0) {} + FFTbase(GridCartesian *grid) : _grid(grid), flops(0), flops_call(0), usec(0) {} +}; - ~FFT() { - if (forward_plans.size() > 0) PlanDestroy(); - } +// Barrel-shift gather, FFT execute, and insert. Called by both FFT and PlannedFFT. +// The caller is responsible for plan acquisition and destruction. +template +static void FFT_dim_execute( + Lattice &result, + const Lattice &source, + int dim, int sign, + typename FFTW::FFTW_plan p, + GridCartesian *grid, + double &flops, double &flops_call, uint64_t &usec) +{ + typedef typename vobj::scalar_type scalar; + typedef typename vobj::scalar_object sobj; + typedef typename vobj::scalar_type scalar_type; + typedef typename vobj::vector_type vector_type; + typedef typename FFTW::FFTW_scalar FFTW_scalar; - // Explicitly pre-create and cache plans for all Nd dimensions. - // Optional: FFT_dim will call this lazily on first use if not called. - // Asserts that no plans already exist; call PlanDestroy first to re-create. - template - void PlanCreate() { - GRID_ASSERT(forward_plans.size() == 0); + const int Ndim = grid->Nd(); + int L = grid->_ldimensions[dim]; + int G = grid->_fdimensions[dim]; + int Ncomp = sizeof(sobj) / sizeof(scalar); + int64_t Nlow = 1, Nhigh = 1; + for (int d = 0; d < dim; d++) Nlow *= grid->_ldimensions[d]; + for (int d = dim+1; d < Ndim; d++) Nhigh *= grid->_ldimensions[d]; + int64_t Nperp = Nlow * Nhigh; - typedef typename vobj::scalar_type scalar; - typedef typename vobj::scalar_object sobj; - typedef typename FFTW::FFTW_scalar FFTW_scalar; - typedef typename FFTW::FFTW_plan FFTW_plan; + deviceVector pgbuf(Nperp * Ncomp * G); + scalar *pgbuf_v = &pgbuf[0]; + int howmany = Ncomp * Nperp; - const int Ndim = _grid->Nd(); - forward_plans.resize(Ndim); - backward_plans.resize(Ndim); + scalar div; + if (sign == FFTW_BACKWARD) div = 1.0 / G; + else if (sign == FFTW_FORWARD) div = 1.0; + else GRID_ASSERT(0); - for (int d = 0; d < Ndim; d++) { - int G = _grid->_fdimensions[d]; - int Ncomp = sizeof(sobj) / sizeof(scalar); - int64_t Nperp = 1; - for (int dd = 0; dd < Ndim; dd++) - if (dd != d) Nperp *= _grid->_ldimensions[dd]; - int howmany = Ncomp * (int)Nperp; - int n[] = {G}; + double t_pencil = 0, t_fft = 0, t_copy = 0, t_shift = 0; + double t_total = -usecond(); - // GPU backends (cuFFT/hipFFT) ignore the buffer pointer at plan creation. - // CPU FFTW with FFTW_ESTIMATE inspects only alignment and never touches data. - // Use a host stack buffer: a device allocation here triggers a rocFFT RTC - // bug on ROCm 7 that causes plan creation to fail for small transform sizes. - scalar stack_dummy[2] = {}; - FFTW_scalar *buf = (FFTW_scalar *)stack_dummy; + result = source; + int pc = grid->_processor_coor[dim]; - { - FFTW_plan p = FFTW::fftw_plan_many_dft( - 1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_FORWARD, FFTW_ESTIMATE); - forward_plans[d] = { p, [p](){ FFTW::fftw_destroy_plan(p); } }; - } - { - FFTW_plan p = FFTW::fftw_plan_many_dft( - 1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_BACKWARD, FFTW_ESTIMATE); - backward_plans[d] = { p, [p](){ FFTW::fftw_destroy_plan(p); } }; + const Coordinate ldims = grid->_ldimensions; + const Coordinate rdims = grid->_rdimensions; + const Coordinate sdims = grid->_simd_layout; + const Coordinate processors = grid->_processors; + + Coordinate pgdims(Ndim); + pgdims[0] = G; + for (int d = 0, dd = 1; d < Ndim; d++) + if (d != dim) pgdims[dd++] = ldims[d]; + int64_t pgvol = 1; + for (int d = 0; d < Ndim; d++) pgvol *= pgdims[d]; + + const int Nsimd = vobj::Nsimd(); + t_pencil = -usecond(); + for (int p_idx = 0; p_idx < processors[dim]; p_idx++) { + t_copy -= usecond(); + autoView(r_v, result, AcceleratorRead); + accelerator_for(idx, grid->oSites(), vobj::Nsimd(), { +#ifdef GRID_SIMT + { + int lane = acceleratorSIMTlane(Nsimd); +#else + for (int lane = 0; lane < Nsimd; lane++) { +#endif + Coordinate icoor, ocoor, pgcoor; + Lexicographic::CoorFromIndex(icoor, lane, sdims); + Lexicographic::CoorFromIndex(ocoor, idx, rdims); + pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + ((pc+p_idx)%processors[dim])*L; + for (int d = 0, dd = 1; d < Ndim; d++) + if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; } + int64_t pgidx; + Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims); + vector_type *from = (vector_type *)&r_v[idx]; + scalar_type stmp; + for (int w = 0; w < Ncomp; w++) { + stmp = getlane(from[w], lane); + pgbuf_v[pgidx + w*pgvol] = stmp; } +#ifdef GRID_SIMT + } +#else + } +#endif + }); + t_copy += usecond(); + if (p_idx != processors[dim] - 1) { + Lattice temp(grid); + t_shift -= usecond(); + temp = Cshift(result, dim, L); result = temp; + t_shift += usecond(); } - - _plan_type = std::type_index(typeid(vobj)); } + t_pencil += usecond(); - void PlanDestroy() { - for (auto &e : forward_plans) e.destroy(); - for (auto &e : backward_plans) e.destroy(); - forward_plans.resize(0); - backward_plans.resize(0); - _plan_type = std::type_index(typeid(void)); + FFTW_scalar *in = (FFTW_scalar *)pgbuf_v; + FFTW_scalar *out = (FFTW_scalar *)pgbuf_v; + t_fft = -usecond(); + FFTW::fftw_execute_dft(p, in, out, sign); + t_fft += usecond(); + + flops_call = 5.0 * howmany * G * log2(G); + usec = t_fft; + flops = flops_call; + + result = Zero(); + double t_insert = -usecond(); + { + autoView(r_v, result, AcceleratorWrite); + accelerator_for(idx, grid->oSites(), Nsimd, { +#ifdef GRID_SIMT + { + int lane = acceleratorSIMTlane(Nsimd); +#else + for (int lane = 0; lane < Nsimd; lane++) { +#endif + Coordinate icoor(Ndim), ocoor(Ndim), pgcoor(Ndim); + Lexicographic::CoorFromIndex(icoor, lane, sdims); + Lexicographic::CoorFromIndex(ocoor, idx, rdims); + pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + pc*L; + for (int d = 0, dd = 1; d < Ndim; d++) + if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; } + int64_t pgidx; + Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims); + vector_type *to = (vector_type *)&r_v[idx]; + scalar_type stmp; + for (int w = 0; w < Ncomp; w++) { + stmp = pgbuf_v[pgidx + w*pgvol]; + putlane(to[w], stmp, lane); + } +#ifdef GRID_SIMT + } +#else + } +#endif + }); } + result = result * div; + t_insert += usecond(); + t_total += usecond(); + + std::cout << GridLogPerformance << " FFT took " << t_total/1.0e6 << " s" << std::endl; + std::cout << GridLogPerformance << " FFT pencil " << t_pencil/1.0e6 << " s" << std::endl; + std::cout << GridLogPerformance << " of which copy " << t_copy/1.0e6 << " s" << std::endl; + std::cout << GridLogPerformance << " of which shift" << t_shift/1.0e6 << " s" << std::endl; + std::cout << GridLogPerformance << " FFT kernels " << t_fft/1.0e6 << " s" << std::endl; + std::cout << GridLogPerformance << " FFT insert " << t_insert/1.0e6 << " s" << std::endl; +} + +class FFT : public FFTbase { +public: + FFT(GridCartesian *grid) : FFTbase(grid) {} + ~FFT() {} template void FFT_dim_mask(Lattice &result, const Lattice &source, Coordinate mask, int sign) { - const int Ndim = source.Grid()->Nd(); + const int Ndim = _grid->Nd(); Lattice tmp = source; for (int d = 0; d < Ndim; d++) { if (mask[d]) { @@ -317,180 +384,109 @@ public: template void FFT_all_dim(Lattice &result, const Lattice &source, int sign) { - const int Ndim = source.Grid()->Nd(); - Coordinate mask(Ndim, 1); + Coordinate mask(_grid->Nd(), 1); FFT_dim_mask(result, source, mask, sign); } template void FFT_dim(Lattice &result, const Lattice &source, int dim, int sign) { - const int Ndim = source.Grid()->Nd(); - GridBase *grid = source.Grid(); + GRID_ASSERT(source.Grid() == _grid); + GRID_ASSERT(result.Grid() == _grid); conformable(result.Grid(), source.Grid()); - int L = grid->_ldimensions[dim]; - int G = grid->_fdimensions[dim]; - + typedef typename vobj::scalar_type scalar; typedef typename vobj::scalar_object sobj; - typedef typename vobj::scalar_type scalar_type; - typedef typename vobj::vector_type vector_type; + typedef typename FFTW::FFTW_scalar FFTW_scalar; + typedef typename FFTW::FFTW_plan FFTW_plan; - typedef typename FFTW::FFTW_scalar FFTW_scalar; - typedef typename FFTW::FFTW_plan FFTW_plan; - - int Ncomp = sizeof(sobj) / sizeof(scalar_type); - int64_t Nlow = 1; - int64_t Nhigh = 1; - for (int d = 0; d < dim; d++) Nlow *= grid->_ldimensions[d]; - for (int d = dim+1; d < Ndim; d++) Nhigh *= grid->_ldimensions[d]; - int64_t Nperp = Nlow * Nhigh; - - deviceVector pgbuf(Nperp * Ncomp * G); // [perp][component][dim] - scalar_type *pgbuf_v = &pgbuf[0]; - - int rank = 1; + const int Ndim = _grid->Nd(); + int G = _grid->_fdimensions[dim]; + int Ncomp = sizeof(sobj) / sizeof(scalar); + int64_t Nperp = 1; + for (int d = 0; d < Ndim; d++) + if (d != dim) Nperp *= _grid->_ldimensions[d]; int n[] = {G}; int howmany = Ncomp * Nperp; - int idist = G, odist = G, istride = 1, ostride = 1; - int *inembed = n, *onembed = n; - scalar_type div; - if (sign == backward) div = 1.0 / G; - else if (sign == forward) div = 1.0; - else GRID_ASSERT(0); + deviceVector dummy(2); + FFTW_scalar *buf = (FFTW_scalar *)&dummy[0]; + FFTW_plan p = FFTW::fftw_plan_many_dft(1, n, howmany, + buf, n, 1, G, + buf, n, 1, G, + sign, FFTW_ESTIMATE); + FFT_dim_execute(result, source, dim, sign, p, _grid, flops, flops_call, usec); + FFTW::fftw_destroy_plan(p); + } +}; - // Populate cache on first call; subsequent calls check type consistency. - if (forward_plans.size() == 0) PlanCreate(); - GRID_ASSERT(forward_plans.size() == (size_t)Ndim); - GRID_ASSERT(std::type_index(typeid(vobj)) == _plan_type); +template +class PlannedFFT : public FFTbase { +private: + typedef typename vobj::scalar_type scalar; + typedef typename vobj::scalar_object sobj; + typedef typename vobj::vector_type vector_type; + typedef typename FFTW::FFTW_scalar FFTW_scalar; + typedef typename FFTW::FFTW_plan FFTW_plan; - auto &plans = (sign == forward) ? forward_plans : backward_plans; - FFTW_plan p = std::any_cast(plans[dim].handle); + std::vector forward_plans; + std::vector backward_plans; - double t_pencil = 0; - double t_fft = 0; - double t_copy = 0; - double t_shift = 0; - double t_total = -usecond(); + void PlanCreate() { + const int Ndim = _grid->Nd(); + forward_plans.resize(Ndim); + backward_plans.resize(Ndim); - // Barrel-shift gather: accumulate global pencil into pgbuf - result = source; - int pc = grid->_processor_coor[dim]; + for (int d = 0; d < Ndim; d++) { + int G = _grid->_fdimensions[d]; + int Ncomp = sizeof(sobj) / sizeof(scalar); + int64_t Nperp = 1; + for (int dd = 0; dd < Ndim; dd++) + if (dd != d) Nperp *= _grid->_ldimensions[dd]; + int howmany = Ncomp * (int)Nperp; + int n[] = {G}; - const Coordinate ldims = grid->_ldimensions; - const Coordinate rdims = grid->_rdimensions; - const Coordinate sdims = grid->_simd_layout; - Coordinate processors = grid->_processors; + deviceVector dummy(2); + FFTW_scalar *buf = (FFTW_scalar *)&dummy[0]; - Coordinate pgdims(Ndim); - pgdims[0] = G; - for (int d = 0, dd = 1; d < Ndim; d++) - if (d != dim) pgdims[dd++] = ldims[d]; - int64_t pgvol = 1; - for (int d = 0; d < Ndim; d++) pgvol *= pgdims[d]; + forward_plans[d] = FFTW::fftw_plan_many_dft(1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_FORWARD, FFTW_ESTIMATE); + backward_plans[d] = FFTW::fftw_plan_many_dft(1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_BACKWARD, FFTW_ESTIMATE); + } + } - const int Nsimd = vobj::Nsimd(); - t_pencil = -usecond(); - for (int p_idx = 0; p_idx < processors[dim]; p_idx++) { - t_copy -= usecond(); - autoView(r_v, result, AcceleratorRead); - accelerator_for(idx, grid->oSites(), vobj::Nsimd(), { -#ifdef GRID_SIMT - { - int lane = acceleratorSIMTlane(Nsimd); -#else - for (int lane = 0; lane < Nsimd; lane++) { -#endif - Coordinate icoor, ocoor, pgcoor; - Lexicographic::CoorFromIndex(icoor, lane, sdims); - Lexicographic::CoorFromIndex(ocoor, idx, rdims); + void PlanDestroy() { + for (auto p : forward_plans) FFTW::fftw_destroy_plan(p); + for (auto p : backward_plans) FFTW::fftw_destroy_plan(p); + forward_plans.clear(); + backward_plans.clear(); + } - pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + ((pc+p_idx)%processors[dim])*L; - for (int d = 0, dd = 1; d < Ndim; d++) { - if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; } - } - int64_t pgidx; - Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims); +public: + PlannedFFT(GridCartesian *grid) : FFTbase(grid) { PlanCreate(); } + ~PlannedFFT() { PlanDestroy(); } - vector_type *from = (vector_type *)&r_v[idx]; - scalar_type stmp; - for (int w = 0; w < Ncomp; w++) { - stmp = getlane(from[w], lane); - pgbuf_v[pgidx + w*pgvol] = stmp; - } -#ifdef GRID_SIMT - } -#else - } -#endif - }); - t_copy += usecond(); - - if (p_idx != processors[dim] - 1) { - Lattice temp(grid); - t_shift -= usecond(); - temp = Cshift(result, dim, L); result = temp; - t_shift += usecond(); + void FFT_dim_mask(Lattice &result, const Lattice &source, Coordinate mask, int sign) { + const int Ndim = _grid->Nd(); + Lattice tmp = source; + for (int d = 0; d < Ndim; d++) { + if (mask[d]) { + FFT_dim(result, tmp, d, sign); + tmp = result; } } - t_pencil += usecond(); + } - FFTW_scalar *in = (FFTW_scalar *)pgbuf_v; - FFTW_scalar *out = (FFTW_scalar *)pgbuf_v; - t_fft = -usecond(); - FFTW::fftw_execute_dft(p, in, out, sign); - t_fft += usecond(); + void FFT_all_dim(Lattice &result, const Lattice &source, int sign) { + Coordinate mask(_grid->Nd(), 1); + FFT_dim_mask(result, source, mask, sign); + } - flops_call = 5.0 * howmany * G * log2(G); - usec = t_fft; - flops = flops_call; - - result = Zero(); - double t_insert = -usecond(); - { - autoView(r_v, result, AcceleratorWrite); - accelerator_for(idx, grid->oSites(), Nsimd, { -#ifdef GRID_SIMT - { - int lane = acceleratorSIMTlane(Nsimd); -#else - for (int lane = 0; lane < Nsimd; lane++) { -#endif - Coordinate icoor(Ndim), ocoor(Ndim), pgcoor(Ndim); - Lexicographic::CoorFromIndex(icoor, lane, sdims); - Lexicographic::CoorFromIndex(ocoor, idx, rdims); - - pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + pc*L; - for (int d = 0, dd = 1; d < Ndim; d++) { - if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; } - } - int64_t pgidx; - Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims); - - vector_type *to = (vector_type *)&r_v[idx]; - scalar_type stmp; - for (int w = 0; w < Ncomp; w++) { - stmp = pgbuf_v[pgidx + w*pgvol]; - putlane(to[w], stmp, lane); - } -#ifdef GRID_SIMT - } -#else - } -#endif - }); - } - - result = result * div; - t_insert += usecond(); - t_total += usecond(); - - std::cout << GridLogPerformance << " FFT took " << t_total/1.0e6 << " s" << std::endl; - std::cout << GridLogPerformance << " FFT pencil " << t_pencil/1.0e6 << " s" << std::endl; - std::cout << GridLogPerformance << " of which copy " << t_copy/1.0e6 << " s" << std::endl; - std::cout << GridLogPerformance << " of which shift " << t_shift/1.0e6 << " s" << std::endl; - std::cout << GridLogPerformance << " FFT kernels " << t_fft/1.0e6 << " s" << std::endl; - std::cout << GridLogPerformance << " FFT insert " << t_insert/1.0e6 << " s" << std::endl; + void FFT_dim(Lattice &result, const Lattice &source, int dim, int sign) { + GRID_ASSERT(source.Grid() == _grid); + GRID_ASSERT(result.Grid() == _grid); + GRID_ASSERT((int)forward_plans.size() == _grid->Nd()); + conformable(result.Grid(), source.Grid()); + FFTW_plan p = (sign == forward ? forward_plans : backward_plans)[dim]; + FFT_dim_execute(result, source, dim, sign, p, _grid, flops, flops_call, usec); } }; From 72b4a061f3632604f5095bd09a2e4221e079fcdc Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Wed, 20 May 2026 17:54:41 -0400 Subject: [PATCH 38/44] tests/fft: remove PlanDestroy calls (FFT handles plans per-call) Co-Authored-By: Claude Sonnet 4.6 --- tests/core/Test_fft.cc | 1 - tests/core/Test_fftf.cc | 1 - 2 files changed, 2 deletions(-) diff --git a/tests/core/Test_fft.cc b/tests/core/Test_fft.cc index 18a6bfac..b048e31a 100644 --- a/tests/core/Test_fft.cc +++ b/tests/core/Test_fft.cc @@ -113,7 +113,6 @@ int main (int argc, char ** argv) Cref= Cref - C; std::cout << " invertible check " << norm2(Cref)< Date: Wed, 20 May 2026 17:59:24 -0400 Subject: [PATCH 39/44] tests: add Test_planned_fft exercising PlannedFFT Co-Authored-By: Claude Sonnet 4.6 --- tests/core/Test_planned_fft.cc | 321 +++++++++++++++++++++++++++++++++ 1 file changed, 321 insertions(+) create mode 100644 tests/core/Test_planned_fft.cc diff --git a/tests/core/Test_planned_fft.cc b/tests/core/Test_planned_fft.cc new file mode 100644 index 00000000..cf40a5cc --- /dev/null +++ b/tests/core/Test_planned_fft.cc @@ -0,0 +1,321 @@ +/************************************************************************************* + + Grid physics library, www.github.com/paboyle/Grid + + Source file: ./tests/core/Test_planned_fft.cc + + Copyright (C) 2015 + +Author: Azusa Yamaguchi +Author: Peter Boyle + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + + See the full license in the file "LICENSE" in the top level distribution directory +*************************************************************************************/ +/* END LEGAL */ +#include + +using namespace Grid; + +int main (int argc, char ** argv) +{ + Grid_init(&argc,&argv); + + int threads = GridThread::GetThreads(); + std::cout< theFFT(&GRID); + PlannedFFT theFFT_spin(&GRID); + + Ctilde=C; + std::cout<<" Benchmarking PlannedFFT of LatticeComplex "< seeds({1,2,3,4}); + GridSerialRNG sRNG; sRNG.SeedFixedIntegers(seeds); + GridParallelRNG pRNG(&GRID); + pRNG.SeedFixedIntegers(seeds); + + LatticeGaugeFieldD Umu(&GRID); + SU::ColdConfiguration(pRNG,Umu); + + //////////////////////////////////////////////////// + // Wilson test + //////////////////////////////////////////////////// + { + LatticeFermionD src(&GRID); gaussian(pRNG,src); + LatticeFermionD tmp(&GRID); + LatticeFermionD ref(&GRID); + + RealD mass=0.01; + WilsonFermionD Dw(Umu,GRID,RBGRID,mass); + + Dw.M(src,tmp); + + std::cout << "Dw src = " < theFFT5(FGrid); + + theFFT5.FFT_dim(result5,tmp5,1,FFTbase::forward); tmp5 = result5; + std::cout<<"Fourier xformed Ddwf 1 "< "< HermOp(Ddwf); + ConjugateGradient CG(1.0e-8,10000); + CG(HermOp,src5,result5); + + ExtractSlice(tmp,result5,0 ,sdir); result4 = (tmp-G5*tmp)*0.5; + ExtractSlice(tmp,result5,Ls-1,sdir); result4 = result4+(tmp+G5*tmp)*0.5; + + std::cout << " Taking difference" < Date: Wed, 20 May 2026 18:13:38 -0400 Subject: [PATCH 40/44] Test_planned_fft: fix PlannedFFT template parameter to use ::vector_object Co-Authored-By: Claude Sonnet 4.6 --- tests/core/Test_planned_fft.cc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tests/core/Test_planned_fft.cc b/tests/core/Test_planned_fft.cc index cf40a5cc..4778aa1e 100644 --- a/tests/core/Test_planned_fft.cc +++ b/tests/core/Test_planned_fft.cc @@ -78,9 +78,9 @@ int main (int argc, char ** argv) S=Zero(); S = S+C; - // PlannedFFT creates all plans at construction; separate instances per vobj type. - PlannedFFT theFFT(&GRID); - PlannedFFT theFFT_spin(&GRID); + // PlannedFFT is templated on the lattice element type (vector_object), not the Lattice<> itself. + PlannedFFT theFFT(&GRID); + PlannedFFT theFFT_spin(&GRID); Ctilde=C; std::cout<<" Benchmarking PlannedFFT of LatticeComplex "< theFFT5(FGrid); + PlannedFFT theFFT5(FGrid); theFFT5.FFT_dim(result5,tmp5,1,FFTbase::forward); tmp5 = result5; std::cout<<"Fourier xformed Ddwf 1 "< Date: Thu, 21 May 2026 12:05:36 -0400 Subject: [PATCH 41/44] Lattice_reduction_gpu: demote timing logs to Debug, disable by default skills/mpi-heterogeneous: add Bug Class 4 for Frontier GTL/libamdhip64 ABI mismatch Co-Authored-By: Claude Sonnet 4.6 --- Grid/lattice/Lattice_reduction_gpu.h | 8 ++--- skills/mpi-heterogeneous.md | 47 +++++++++++++++++++++++++++- 2 files changed, 50 insertions(+), 5 deletions(-) diff --git a/Grid/lattice/Lattice_reduction_gpu.h b/Grid/lattice/Lattice_reduction_gpu.h index da5eb90f..27388062 100644 --- a/Grid/lattice/Lattice_reduction_gpu.h +++ b/Grid/lattice/Lattice_reduction_gpu.h @@ -198,7 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) { // Possibly promote to double and sum ///////////////////////////////////////////////////////////////////////////////////////////////////////// -#define GRID_REDUCTION_TIMING +#undef GRID_REDUCTION_TIMING template inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) @@ -230,7 +230,7 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi acceleratorCopyFromDevice(buffer_v,&result,sizeof(result)); #ifdef GRID_REDUCTION_TIMING t_d2h += usecond(); - std::cout << GridLogMessage << " sumD_gpu_small" + std::cout << GridLogDebug << " sumD_gpu_small" << " sizeof(sobj)=" << sizeof(sobj) << " blocks=" << numBlocks << " threads=" << numThreads << " kernel+barrier=" << t_kernel << " us" @@ -362,7 +362,7 @@ inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites, acceleratorCopyFromDevice(buffer_v, &result, sizeof(result)); #ifdef GRID_REDUCTION_TIMING t_d2h += usecond(); - std::cout << GridLogMessage << " sumD_gpu_reduce_words R=" << R + std::cout << GridLogDebug << " sumD_gpu_reduce_words R=" << R << " base=" << base << " kernel=" << t_kernel << " D2H=" << t_d2h << " us" << std::endl; #endif @@ -391,7 +391,7 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi while (w < words) { sumD_gpu_reduce_words< 1>(lat, osites, ret_p, w); w += 1; } #ifdef GRID_REDUCTION_TIMING t_large += usecond(); - std::cout << GridLogMessage << "sumD_gpu_large" + std::cout << GridLogDebug << "sumD_gpu_large" << " sizeof(sobjD)=" << sizeof(sobjD) << " words=" << words << " total=" << t_large << " us" << std::endl; #endif diff --git a/skills/mpi-heterogeneous.md b/skills/mpi-heterogeneous.md index 9ad45523..e71a54c0 100644 --- a/skills/mpi-heterogeneous.md +++ b/skills/mpi-heterogeneous.md @@ -1,6 +1,6 @@ --- name: mpi-heterogeneous -description: Diagnose and work around MPI correctness bugs on heterogeneous (CPU+GPU) systems — device buffer aliasing in MPI_Sendrecv, AARCH64 PLT corruption from libfabric, topology-dependent allreduce hangs, and deterministic point-to-point reduction trees as a replacement for MPI_Allreduce. +description: Diagnose and work around MPI correctness bugs on heterogeneous (CPU+GPU) systems — device buffer aliasing in MPI_Sendrecv, AARCH64 PLT corruption from libfabric, topology-dependent allreduce hangs, mixed-ABI HIP runtime from wrong GTL library (Frontier/ROCm), and deterministic point-to-point reduction trees as a replacement for MPI_Allreduce. user-invocable: true allowed-tools: - Read @@ -110,6 +110,51 @@ void GlobalSumP2P(double *data, int count, MPI_Comm comm) { Grid reference: `USE_GRID_REDUCTION` macro in `Grid/communicator/Communicator_mpi3.cc`. +## Bug Class 4: Mixed HIP ABI from Wrong GTL Library (Frontier / ROCm) + +**Symptom**: `HIPFFT_PARSE_ERROR` (error code 12) returned by `hipfftPlanMany` / `hipfftMakePlanMany` / `hipfftPlan1d` for FFT sizes G < 32, but G ≥ 32 succeeds. The failure only occurs with an empty rocFFT kernel cache (`~/.cache/rocfft`); a warm cache may mask it. Host-side operations and GPU kernels that do not invoke rocFFT JIT work correctly. + +**Root cause — mixed HIP ABI**: rocFFT uses JIT compilation (via `libamd_comgr`) for small transforms (G < 32); for G ≥ 32 it uses pre-compiled device code bundled in the library, so the JIT path is never exercised. When two HIP runtime versions are loaded in the same process — e.g. `libamdhip64.so.7` (ROCm 7) and `libamdhip64.so.6` (ROCm 6) — the rocFFT JIT cannot complete successfully. + +The hidden source of the old library is the Cray MPI GPU Transport Layer. On Frontier, `cray-mpich`'s `libmpi_gtl_hsa.so` may be compiled against `libamdhip64.so.6` (ROCm 6 ABI) even when the loaded ROCm module is 7.0.2. Because `LD_LIBRARY_PATH` picks up the GTL directory before the ROCm 7 library directory, `libamdhip64.so.6` is pulled in first, and both ABI versions end up resident in the process. + +**Diagnosis**: +```bash +# Check which libamdhip64 versions are actually linked into your binary at runtime +ldd --verbose ./your_binary 2>&1 | grep amdhip +# Bad output — two different .so versions: +# libamdhip64.so.6 => /opt/rocm-6.4.2/lib/libamdhip64.so.6 +# libamdhip64.so.7 => /opt/rocm-7.0.2/lib/libamdhip64.so.7 +# Good output — only one: +# libamdhip64.so.7 => /opt/rocm-7.0.2/lib/libamdhip64.so.7 +``` + +If two versions appear, the problem is the GTL/LD_LIBRARY_PATH ordering. + +**Fix — correct module stack and LD_LIBRARY_PATH ordering (Frontier)**: +```bash +module load cce/21.0.0 +module load cpe/26.03 +module load rocm/7.0.2 +# Prepend CRAY_LD_LIBRARY_PATH so the ROCm-7-aware GTL is found first +export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH +# Ensure ROCm 7 LLVM libs (needed by libamd_comgr JIT) are on the path +export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib/llvm/lib/:$LD_LIBRARY_PATH +``` + +The critical step is prepending `CRAY_LD_LIBRARY_PATH`: this ensures the GTL library built against the ROCm 7 ABI is resolved before any older version that may appear further down `LD_LIBRARY_PATH`. Without this step, a stale symlink or directory ordering can silently load the wrong `libmpi_gtl_hsa.so`. + +**Reproducer**: `tests/debug/Test_hipfft_repro.cc` — standalone hipFFT test (no Grid headers) that sweeps G and howmany values matching realistic Grid lattice geometries. Compile with: +```bash +hipcc -o Test_hipfft_repro Test_hipfft_repro.cc -lhipfft +rm -rf ~/.cache/rocfft # empty cache required to trigger JIT path +./Test_hipfft_repro +``` + +**Reference**: `systems/WorkArounds.txt`, Frontier section — GPU mapping, XPMEM, and `FI_MR_CACHE_MONITOR=disabled` settings for Frontier are documented there. + +**Systems affected**: Frontier (ORNL, MI250X). Likely applies to any Cray PE system where the loaded `cray-mpich` GTL was compiled against an older ROCm ABI than the runtime ROCm module. LumiG (CSC, MI250X) uses the same Cray PE and may exhibit the same issue. + ## Compile-Time Guard Structure Recommended macro structure to switch between the workaround paths: From 155b34c1aaebca4df3fb08309c9822b739d883b9 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 15 May 2026 15:06:58 -0400 Subject: [PATCH 42/44] File list lost --- scripts/filelist | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/filelist b/scripts/filelist index 64e5250c..34962e88 100755 --- a/scripts/filelist +++ b/scripts/filelist @@ -11,7 +11,7 @@ CCFILES=`find . -name '*.cc' -not -path '*/instantiation/*/*' -not -path '*/gamm ZWILS_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/ZWilsonImpl*' ` WILS_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonImpl*' ` -STAG_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/Staggered*' ` +STAG_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/StaggeredImpl*' ` GP_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/Gparity*' ` ADJ_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonAdj*' ` TWOIND_FERMION_FILES=`find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonTwoIndex*'` From 957601101186647683320c71eb2d24e36200deff Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Thu, 21 May 2026 12:28:04 -0400 Subject: [PATCH 43/44] Changed setup for ROCM 7, nasty LD_LIBRARY_PATH issues were committing evils --- systems/Frontier/sourceme.sh | 30 ++++++------------------------ 1 file changed, 6 insertions(+), 24 deletions(-) diff --git a/systems/Frontier/sourceme.sh b/systems/Frontier/sourceme.sh index ace9b729..0788fdf6 100644 --- a/systems/Frontier/sourceme.sh +++ b/systems/Frontier/sourceme.sh @@ -1,28 +1,10 @@ echo spack -. /autofs/nccs-svm1_home1/paboyle/Crusher/Grid/spack/share/spack/setup-env.sh +. /autofs/nccs-svm1_home1/paboyle/spack/share/spack/setup-env.sh -module load cce/15.0.1 -module load amd/7.0.2 -#module load amd/7.1.1 -#module load rocm/7.2.0 -#module load rocm/6.4.2 -module load cray-fftw -module load craype-accel-amd-gfx90a -#Ugly hacks to get down level software working on current system -export LD_LIBRARY_PATH=/opt/cray/libfabric/1.20.1/lib64/:$LD_LIBRARY_PATH -export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH -export LD_LIBRARY_PATH=`pwd`/:$LD_LIBRARY_PATH -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/LD_PATH/ - -#echo spack load c-lime -#spack load c-lime -#module load emacs -##module load PrgEnv-gnu -##module load cray-mpich -##module load cray-fftw -##module load craype-accel-amd-gfx90a -##export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH -#Hack for lib -##export LD_LIBRARY_PATH=`pwd`/:$LD_LIBRARY_PATH +module load cce/21.0.0 +module load cpe/26.03 +module load rocm/7.0.2 +export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH +export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib/llvm/lib/:$LD_LIBRARY_PATH From 12e3499b6d1bee3acabb2800c0e38d5d2f8c0b32 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Thu, 21 May 2026 12:28:42 -0400 Subject: [PATCH 44/44] Updated rocm 7 compile for ORNL --- systems/Frontier/config-command | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/systems/Frontier/config-command b/systems/Frontier/config-command index 7561fb15..7f71b4d9 100644 --- a/systems/Frontier/config-command +++ b/systems/Frontier/config-command @@ -13,8 +13,8 @@ CLIME=`spack find --paths c-lime@2-3-9 | grep c-lime| cut -c 15-` --with-mpfr=/opt/cray/pe/gcc/mpfr/3.1.4/ \ --disable-fermion-reps \ CXX=hipcc MPICXX=mpicxx \ -CXXFLAGS="-fPIC -I${ROCM_PATH}/include/ -I${MPICH_DIR}/include -L/lib64 " \ - LDFLAGS="-L/lib64 -L${ROCM_PATH}/lib -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa -lhipblas -lrocblas -lhipfft" +CXXFLAGS="-fPIC -I${ROCM_PATH}/include/ -I${MPICH_DIR}/include " \ + LDFLAGS="-L${ROCM_PATH}/lib -L${MPICH_DIR}/lib -lmpi -lmpi_gtl_hsa -lhipblas -lrocblas -lhipfft -lamdhip64"