1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-26 20:14:16 +01:00
Files
Grid/skills/gpu-memory-performance.md
T
Peter Boyle 5a9056cd93 Accelerator: lower default accelerator_threads from 16 to 8
Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16.
With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads =
one full AMD wavefront. Higher values double register demand per block and
hit a register-pressure cliff for stencil kernels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00

11 KiB
Raw Blame History

name, description, user-invocable, allowed-tools
name description user-invocable allowed-tools
gpu-memory-performance Diagnose and fix GPU memory bandwidth and occupancy problems in Grid HPC kernels — acceleratorThreads() pitfalls, LambdaApply thread mapping, coalescedRead/Write idiom, when to use accelerator_for vs a hand-rolled __global__ kernel, and fused vs staged HBM access patterns. true
Read
Bash(grep -r)

GPU Memory Performance in Grid

Nsimd on GPU builds

With GEN_SIMD_WIDTH=64B (the typical production setting), Nsimd is not 1:

Scalar type sizeof Nsimd = 64 / sizeof
ComplexD 16 B 4
ComplexF 8 B 8
RealD 8 B 8
RealF 4 B 16

So for LatticePropagatorD (scalar type ComplexD), Nsimd=4 and the SIMD lane runs threadIdx.x ∈ {0,1,2,3}. True Nsimd=1 scalar-GPU builds are the exception, not the rule.

The acceleratorThreads() Trap

acceleratorThreads() is a runtime-settable global (default 8) that controls the blockDim.y of every accelerator_for launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension.

// Grid/threads/Accelerator.cc
uint32_t accelerator_threads = 8;

With accelerator_for(ss, osites, nsimd, ...), the launch is:

dim3 threads(nsimd, acceleratorThreads(), 1)
dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1)

Total threads per block = Nsimd × acceleratorThreads(). With GEN_SIMD_WIDTH=64B and Nsimd=8 (fp32 / ComplexF):

acceleratorThreads() threads/block AMD wavefront note
2 (original default) 16 25% sub-wavefront, poor
4 32 50% half wavefront
8 (current default) 64 100% one full wavefront — Dslash sweet spot
16 128 200% two wavefronts/block; register-pressure cliff for heavy kernels
32 256 400% severe register spill for stencil kernels

Why 8 and not higher? Compute-heavy kernels like the Domain Wall Dslash carry many live registers per thread (spinors + gauge links + projections). Doubling acceleratorThreads from 8→16 doubles the register demand per block, which on AMD GFX90A triggers a hard occupancy cliff: Benchmark_dwf_fp32 drops from 1.7 TF/s (nt=8) to ~300 GF/s (nt=16). The sweet spot is one full wavefront per block, which with Nsimd=8 (fp32) means nt=8.

For fp64 work (Nsimd=4), nt=8 gives 32 threads = half a wavefront (AMD pads to 64 with idle lanes). Kernels that are not register-limited (e.g. simple lattice arithmetic) can benefit from --accelerator-threads 16 at runtime. Reduction kernels bypass acceleratorThreads() entirely via getNumBlocksAndThreads.

Why was the default ever 2? Before the threadIdx.x/threadIdx.y remap (see LambdaApply section below), the site index lived in threadIdx.x — the fast, coalescing dimension. Increasing acceleratorThreads widened the block in the site direction, so adjacent threads in a warp hit adjacent sites, each stride sizeof(vobj) apart in AoS memory — breaking coalescing. On early NVIDIA ports with Nsimd≈8, nt=2 gave 16 threads = 50% of a 32-thread warp; NVIDIA recovers this via multiple concurrent blocks per SM, so occupancy was barely tolerable. AMD has no such multiplier when blocks are already sub-wavefront. After the remap put the site index in threadIdx.y and the SIMD lane in threadIdx.x, coalescing became independent of acceleratorThreads, removing the constraint.

Diagnostic: observed bandwidth << peak, kernel time >> expected from data volume. Check with --accelerator-threads 32 at runtime. A large speedup confirms occupancy starvation.

Fix options (in order of preference):

  1. Kernel needs its own thread count — use getNumBlocksAndThreads and launch a __global__ kernel directly (see below).
  2. Temporarily acceptable: set --accelerator-threads 32 at the application level. Note this affects every accelerator_for site in the binary.

LambdaApply Thread Mapping

accelerator_for and accelerator_for2d go through LambdaApply:

// HIP/CUDA LambdaApply kernel:
uint64_t x = threadIdx.y + blockDim.y * blockIdx.x;  // iter1 (site index)
uint64_t y = threadIdx.z + blockDim.z * blockIdx.y;  // iter2
uint64_t z = threadIdx.x;                             // lane (SIMD lane)
Lambda(x, y, z);

threadIdx.x is the fast (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the same site, not consecutive sites.

With GEN_SIMD_WIDTH=64B and Nsimd=4 (PropagatorD), a 64-thread AMD wavefront contains 64/4 = 16 sites, each processed by 4 lanes. Adjacent threads within the wavefront read different lanes of the same site — this is a broadcast pattern (hardware handles this efficiently), not a stride. The stride between consecutive sites (sizeof(vobj)) only appears between groups of Nsimd threads, spaced Nsimd apart in threadIdx.x — not between adjacent threads within a warp.

coalescedRead / coalescedWrite

These are Grid's canonical way to read/write one SIMD lane from a vector type inside a GRID_SIMT kernel:

// accelerator_for(ss, osites, Nsimd, {
//   lane = acceleratorSIMTlane(Nsimd) = threadIdx.x  ∈ {0..Nsimd-1}
auto scalar_val = coalescedRead(field[ss]);       // extractLane(lane, field[ss])
coalescedWrite(field[ss], scalar_val);             // insertLane(lane, field[ss], scalar_val)

For vobj aggregate types, coalescedRead calls extractLane(lane, vobj) which recurses through the tensor hierarchy and returns vobj::scalar_object.

For vsimd (raw SIMD vector) types, it casts to scalar_type* and indexes with lane.

Coalescing the Iteration Structure

For an AoS input array where each site is words 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access:

// Good: k varies across threads in a block → consecutive 16-byte reads
accelerator_for2d(k, R, ss, osites, Nsimd, {
    coalescedWrite(out[ss]._internal[k],
                   coalescedRead(idat[ss * words + base + k]));
});
// dim3(Nsimd, nt, 1): threadIdx.y = k (consecutive words, coalesced)
//                      threadIdx.x = lane (SIMD sub-lane, coalesced for Nsimd>1)
// Bad: each thread reads all R words of its site serially
accelerator_for(ss, osites, 1, {
    Bundle b;
    for (int k = 0; k < R; k++)
        b._internal[k] = idat[ss * words + base + k];  // serial, not coalesced across threads
    out[ss] = b;                                         // bulk struct write
});

The bad pattern also accumulates a large struct in registers (192 bytes for R=12), increasing register pressure and reducing occupancy further.

When to Use a global Kernel Instead of accelerator_for

accelerator_for is correct for site-parallel work where acceleratorThreads() is tuned appropriately. Use a direct __global__ kernel when:

  • The kernel requires a specific thread count for correctness or performance (reductions, shared-memory algorithms).
  • The optimal thread count depends on sizeof(sobj) and sharedMemPerBlock, not on a runtime global.
  • You need the retirement-count pattern for cross-block final reduction.

Pattern: use getNumBlocksAndThreads to pick numThreads and numBlocks:

Integer numThreads, numBlocks;
int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks);
// starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock
// gives 64256 threads/block → correct occupancy independent of acceleratorThreads()
Integer smemSize = numThreads * sizeof(sobj);
myKernel<<<numBlocks, numThreads, smemSize, computeStream>>>(args...);

Grid's reduceKernel uses this pattern and achieves ~400 GB/s on MI250X.

Fused vs Staged HBM Access

A staged pack+reduce reads the data three times:

pack kernel:   reads vobj array (N bytes), writes bundle buffer (N bytes)
reduce kernel: reads bundle buffer (N bytes), writes tiny result buffer

Total HBM: 3N bytes for N bytes of useful input.

A fused kernel reads the data once:

packReduceKernel: reads R words of vobj array (N bytes), reduces in-place

Total HBM: N bytes. Register pressure increases (R words held per thread) but the 3× HBM saving dominates for large objects.

The fused pattern in Grid's sumD_gpu_reduce_words<R>:

template <int R, class vobj, class sobj, class Iterator>
__device__ void packReduceBlocks(
    const iScalar<typename vobj::vector_type> *idat,
    sobj *g_odata, Iterator osites, int base, int words)
{
  // sobj = iVector<iScalar<scalarD>, R>  (R double-precision scalars per site)
  constexpr Iterator nsimd = vobj::Nsimd();
  ...
  while (i < osites * nsimd) {
    Iterator lane = i % nsimd;
    Iterator ss   = i / nsimd;
    sobj tmpD; zeroit(tmpD);
    for (int k = 0; k < R; k++) {
      auto w = extractLane(lane, idat[ss * words + base + k]);
      iScalar<typename vobj::scalar_typeD> wd; wd = w;   // float→double promotion
      tmpD._internal[k] = wd;
    }
    mySum += tmpD;
    ...
  }
  reduceBlock(sdata, mySum, tid);
}

Launched with getNumBlocksAndThreads → 128 threads/block for R=12 (BundleScalarD=192 B, sharedMem=64 KB) → correct occupancy without depending on acceleratorThreads().

Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=4, GEN_SIMD_WIDTH=64B)

Configuration pack µs/group reduce µs/group total µs GB/s
acceleratorThreads=2 (8 threads/block), staged 10,080 470 126,909 50
acceleratorThreads=16 (64 threads/block), staged 342 310 8,251 297
acceleratorThreads=16, fused (128 threads/block via getNumBlocksAndThreads) 349 4,584 546

The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak. The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches.

Quick Checklist When a Kernel Is Slow

  1. Check Nsimd: GEN_SIMD_WIDTH=64B → Nsimd=4 (ComplexD), 8 (ComplexF). Total threads/block = Nsimd × acceleratorThreads(). With old default nt=2 and Nsimd=4: 8 threads = 12.5% of AMD wavefront.
  2. Check threads per block: for accelerator_for kernels use --accelerator-threads 32 and measure; a large speedup confirms occupancy starvation.
  3. Check for bulk struct accumulation in registers (Bundle b; for(...) b._internal[k] = ...;). Replace with per-element writes via coalescedWrite.
  4. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data.
  5. For reduction kernels, always use getNumBlocksAndThreads rather than accelerator_for so thread count is independent of acceleratorThreads().