Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1.8 KiB
name, description, metadata
| name | description | metadata | ||||||
|---|---|---|---|---|---|---|---|---|
| ref_grid_simt_pattern | Grid GRID_SIMT |
|
The problem
On CPU, accelerator_for(sf, oSites, Nsimd, {...}) expands to thread_for(sf, oSites, {...}) — one thread per osite. acceleratorSIMTlane(Nsimd) always returns 0 on CPU. If you need to iterate all Nsimd lanes (e.g. to extract SIMD-packed data), you must loop explicitly on CPU.
On GPU, accelerator_for launches one SIMT thread per (osite × lane). acceleratorSIMTlane(Nsimd) returns the actual lane index [0, Nsimd).
Correct pattern (from MomentumProject::ImportVector)
accelerator_for(sf, osites, Nsimd, {
#ifdef GRID_SIMT
{
int lane = acceleratorSIMTlane(Nsimd);
#else
for (int lane = 0; lane < Nsimd; lane++) {
#endif
// body using lane
}
});
- On GPU:
GRID_SIMTis defined → single-lane body, lane from hardware - On CPU:
GRID_SIMTis not defined → explicit lane loop inside the osite thread
When is this needed?
Only when you explicitly need the lane index, e.g.:
- Extracting scalar data from SIMD-packed
vobjviaextractLane(lane, src[sf]) - Computing full local coordinates from (osite, lane) →
Lexicographic::CoorFromIndex(icoor, lane, simd_layout)
When using coalescedRead/coalescedWrite, this pattern is not needed — those handle lane selection transparently.
Pitfall that caused a bug
A2ASpatialSum::PackVectors originally used accelerator_for without the #ifdef GRID_SIMT lane loop. On CPU, only lane=0 was extracted, giving wrong norms (~8× too small for GEN_SIMD_WIDTH=64, Nsimd=4). Fix: add the #ifdef GRID_SIMT pattern. See ref_accelerator_for.
Model file
Grid/algorithms/blas/MomentumProject.h, function ImportVector, lines ~166-207.