1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-28 04:54:16 +01:00

Accelerator: lower default accelerator_threads from 16 to 8

Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16.
With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads =
one full AMD wavefront. Higher values double register demand per block and
hit a register-pressure cliff for stencil kernels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Peter Boyle
2026-05-19 13:41:03 -04:00
parent 012c36ab5a
commit 5a9056cd93
2 changed files with 47 additions and 23 deletions
+1 -1
View File
@@ -3,7 +3,7 @@
NAMESPACE_BEGIN(Grid); NAMESPACE_BEGIN(Grid);
int world_rank; // Use to control world rank for print guarding int world_rank; // Use to control world rank for print guarding
int acceleratorAbortOnGpuError=1; int acceleratorAbortOnGpuError=1;
uint32_t accelerator_threads=16; uint32_t accelerator_threads=8;
uint32_t acceleratorThreads(void) {return accelerator_threads;}; uint32_t acceleratorThreads(void) {return accelerator_threads;};
void acceleratorThreads(uint32_t t) {accelerator_threads = t;}; void acceleratorThreads(uint32_t t) {accelerator_threads = t;};
+46 -22
View File
@@ -9,13 +9,26 @@ allowed-tools:
# GPU Memory Performance in Grid # GPU Memory Performance in Grid
## Nsimd on GPU builds
With `GEN_SIMD_WIDTH=64B` (the typical production setting), `Nsimd` is **not 1**:
| Scalar type | `sizeof` | `Nsimd = 64 / sizeof` |
|---|---|---|
| `ComplexD` | 16 B | **4** |
| `ComplexF` | 8 B | **8** |
| `RealD` | 8 B | **8** |
| `RealF` | 4 B | **16** |
So for `LatticePropagatorD` (scalar type `ComplexD`), `Nsimd=4` and the SIMD lane runs `threadIdx.x ∈ {0,1,2,3}`. True `Nsimd=1` scalar-GPU builds are the exception, not the rule.
## The acceleratorThreads() Trap ## The acceleratorThreads() Trap
`acceleratorThreads()` is a runtime-settable global (default **2**) that controls the `blockDim.y` of every `accelerator_for` launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension. `acceleratorThreads()` is a runtime-settable global (default **8**) that controls the `blockDim.y` of every `accelerator_for` launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension.
```cpp ```cpp
// Grid/threads/Accelerator.cc // Grid/threads/Accelerator.cc
uint32_t accelerator_threads = 2; // <-- default uint32_t accelerator_threads = 8;
``` ```
With `accelerator_for(ss, osites, nsimd, ...)`, the launch is: With `accelerator_for(ss, osites, nsimd, ...)`, the launch is:
@@ -25,15 +38,27 @@ dim3 threads(nsimd, acceleratorThreads(), 1)
dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1) dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1)
``` ```
For `nsimd=1` and the default `acceleratorThreads()=2`: Total threads per block = `Nsimd × acceleratorThreads()`. With `GEN_SIMD_WIDTH=64B` and `Nsimd=8` (fp32 / ComplexF):
- **2 threads per block** on a 64-thread AMD wavefront → **3% occupancy**
- Expected bandwidth ≈ peak × 3% ≈ 50 GB/s on MI250X
**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume. Check with `--accelerator-threads 16` or `--accelerator-threads 32` at runtime. A large speedup confirms occupancy starvation. | `acceleratorThreads()` | threads/block | AMD wavefront | note |
|---|---|---|---|
| 2 (original default) | 16 | 25% | sub-wavefront, poor |
| 4 | 32 | 50% | half wavefront |
| **8 (current default)** | **64** | **100%** | **one full wavefront — Dslash sweet spot** |
| 16 | 128 | 200% | two wavefronts/block; register-pressure cliff for heavy kernels |
| 32 | 256 | 400% | severe register spill for stencil kernels |
**Why 8 and not higher?** Compute-heavy kernels like the Domain Wall Dslash carry many live registers per thread (spinors + gauge links + projections). Doubling `acceleratorThreads` from 8→16 doubles the register demand per block, which on AMD GFX90A triggers a hard occupancy cliff: `Benchmark_dwf_fp32` drops from 1.7 TF/s (nt=8) to ~300 GF/s (nt=16). The sweet spot is one full wavefront per block, which with `Nsimd=8` (fp32) means `nt=8`.
For fp64 work (`Nsimd=4`), `nt=8` gives 32 threads = half a wavefront (AMD pads to 64 with idle lanes). Kernels that are not register-limited (e.g. simple lattice arithmetic) can benefit from `--accelerator-threads 16` at runtime. Reduction kernels bypass `acceleratorThreads()` entirely via `getNumBlocksAndThreads`.
**Why was the default ever 2?** Before the `threadIdx.x`/`threadIdx.y` remap (see LambdaApply section below), the site index lived in `threadIdx.x` — the fast, coalescing dimension. Increasing `acceleratorThreads` widened the block in the *site* direction, so adjacent threads in a warp hit adjacent sites, each stride `sizeof(vobj)` apart in AoS memory — breaking coalescing. On early NVIDIA ports with `Nsimd≈8`, `nt=2` gave 16 threads = 50% of a 32-thread warp; NVIDIA recovers this via multiple concurrent blocks per SM, so occupancy was barely tolerable. AMD has no such multiplier when blocks are already sub-wavefront. After the remap put the site index in `threadIdx.y` and the SIMD lane in `threadIdx.x`, coalescing became independent of `acceleratorThreads`, removing the constraint.
**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume. Check with `--accelerator-threads 32` at runtime. A large speedup confirms occupancy starvation.
**Fix options** (in order of preference): **Fix options** (in order of preference):
1. Kernel needs its own thread count — use `getNumBlocksAndThreads` and launch a `__global__` kernel directly (see below). 1. Kernel needs its own thread count — use `getNumBlocksAndThreads` and launch a `__global__` kernel directly (see below).
2. Temporarily acceptable: set `--accelerator-threads 16` or 32 at the application level. Note this affects every `accelerator_for` site in the binary. 2. Temporarily acceptable: set `--accelerator-threads 32` at the application level. Note this affects every `accelerator_for` site in the binary.
## LambdaApply Thread Mapping ## LambdaApply Thread Mapping
@@ -49,7 +74,7 @@ Lambda(x, y, z);
`threadIdx.x` is the **fast** (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the **same** site, not consecutive sites. `threadIdx.x` is the **fast** (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the **same** site, not consecutive sites.
Consequence: for coalesced access from a `vobj` array (AoS layout, stride = `sizeof(vobj)` between adjacent sites), adjacent threads in a wavefront address the **same** site at different lanes, not adjacent sites. With `Nsimd=1` (GPU scalar build), `threadIdx.x` is always 0 and provides no coalescing benefit at all. With `GEN_SIMD_WIDTH=64B` and `Nsimd=4` (PropagatorD), a 64-thread AMD wavefront contains 64/4 = 16 sites, each processed by 4 lanes. Adjacent threads within the wavefront read different lanes of the same site — this is a broadcast pattern (hardware handles this efficiently), not a stride. The stride between consecutive *sites* (`sizeof(vobj)`) only appears between groups of `Nsimd` threads, spaced `Nsimd` apart in threadIdx.x — not between adjacent threads within a warp.
## coalescedRead / coalescedWrite ## coalescedRead / coalescedWrite
@@ -57,7 +82,7 @@ These are Grid's canonical way to read/write one SIMD lane from a vector type in
```cpp ```cpp
// accelerator_for(ss, osites, Nsimd, { // accelerator_for(ss, osites, Nsimd, {
// lane = acceleratorSIMTlane(Nsimd) = threadIdx.x // lane = acceleratorSIMTlane(Nsimd) = threadIdx.x ∈ {0..Nsimd-1}
auto scalar_val = coalescedRead(field[ss]); // extractLane(lane, field[ss]) auto scalar_val = coalescedRead(field[ss]); // extractLane(lane, field[ss])
coalescedWrite(field[ss], scalar_val); // insertLane(lane, field[ss], scalar_val) coalescedWrite(field[ss], scalar_val); // insertLane(lane, field[ss], scalar_val)
``` ```
@@ -66,8 +91,6 @@ For `vobj` aggregate types, `coalescedRead` calls `extractLane(lane, vobj)` whic
For `vsimd` (raw SIMD vector) types, it casts to `scalar_type*` and indexes with `lane`. For `vsimd` (raw SIMD vector) types, it casts to `scalar_type*` and indexes with `lane`.
**When Nsimd=1** (GPU scalar build): `lane=0` always, so `coalescedRead`/`coalescedWrite` are effectively no-ops (direct read/write). Coalescing must be achieved through the iteration structure instead.
## Coalescing the Iteration Structure ## Coalescing the Iteration Structure
For an AoS input array where each site is `words` 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access: For an AoS input array where each site is `words` 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access:
@@ -108,12 +131,12 @@ Pattern: use `getNumBlocksAndThreads` to pick `numThreads` and `numBlocks`:
Integer numThreads, numBlocks; Integer numThreads, numBlocks;
int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks); int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks);
// starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock // starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock
// gives 64256 threads/block → near-100% wavefront occupancy // gives 64256 threads/block → correct occupancy independent of acceleratorThreads()
Integer smemSize = numThreads * sizeof(sobj); Integer smemSize = numThreads * sizeof(sobj);
myKernel<<<numBlocks, numThreads, smemSize, computeStream>>>(args...); myKernel<<<numBlocks, numThreads, smemSize, computeStream>>>(args...);
``` ```
This gives 64256 threads/block regardless of `acceleratorThreads()`. Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X. Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X.
## Fused vs Staged HBM Access ## Fused vs Staged HBM Access
@@ -161,21 +184,22 @@ __device__ void packReduceBlocks(
} }
``` ```
Launched with `getNumBlocksAndThreads` → 128256 threads/block → correct occupancy without depending on `acceleratorThreads()`. Launched with `getNumBlocksAndThreads` → 128 threads/block for R=12 (`BundleScalarD`=192 B, sharedMem=64 KB) → correct occupancy without depending on `acceleratorThreads()`.
## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=1) ## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=4, GEN_SIMD_WIDTH=64B)
| Configuration | pack µs/group | reduce µs/group | total µs | GB/s | | Configuration | pack µs/group | reduce µs/group | total µs | GB/s |
|---|---|---|---|---| |---|---|---|---|---|
| acceleratorThreads=2, staged | 10,080 | 470 | 126,909 | 50 | | acceleratorThreads=2 (8 threads/block), staged | 10,080 | 470 | 126,909 | 50 |
| acceleratorThreads=16, staged | 342 | 310 | 8,251 | 297 | | acceleratorThreads=16 (64 threads/block), staged | 342 | 310 | 8,251 | 297 |
| acceleratorThreads=16, fused | — | 349 | 4,584 | 546 | | acceleratorThreads=16, fused (128 threads/block via getNumBlocksAndThreads) | — | 349 | 4,584 | 546 |
The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak. The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches. The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak. The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches.
## Quick Checklist When a Kernel Is Slow ## Quick Checklist When a Kernel Is Slow
1. Check threads per block: `accelerator_for(ss, N, 1, ...)` with default `acceleratorThreads()=2` = 2 threads/block = 3% occupancy on AMD. Try `--accelerator-threads 16` at runtime; if it helps a lot, occupancy is the problem. 1. **Check Nsimd**: `GEN_SIMD_WIDTH=64B` → Nsimd=4 (ComplexD), 8 (ComplexF). Total threads/block = `Nsimd × acceleratorThreads()`. With old default nt=2 and Nsimd=4: 8 threads = 12.5% of AMD wavefront.
2. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`). Replace with per-element writes via `coalescedWrite`. 2. Check threads per block: for `accelerator_for` kernels use `--accelerator-threads 32` and measure; a large speedup confirms occupancy starvation.
3. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data. 3. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`). Replace with per-element writes via `coalescedWrite`.
4. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`. 4. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data.
5. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`.