Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16.
With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads =
one full AMD wavefront. Higher values double register demand per block and
hit a register-pressure cliff for stencil kernels.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the acceleratorThreads() default=2 trap, LambdaApply thread
mapping, coalescedRead/Write idiom, when to use __global__ vs
accelerator_for, and fused vs staged HBM access patterns.
Includes observed MI250X numbers from LatticePropagatorD reduction
(50 → 297 → 546 GB/s progression).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>