1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-23 02:24:17 +01:00
Files
Grid/Grid/lattice
Peter Boyle 747c167658 sumD_gpu_direct: one thread per SIMD lane using extractLane
Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide
loop) with one thread per lane calling extractLane() — O(1) per thread.
CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction
but leaves the per-lane sobjD store stride as a known remaining concern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 16:21:50 -04:00
..
2024-10-11 03:23:09 +00:00
2020-10-14 22:59:41 -04:00
2025-04-04 18:35:05 -04:00
2025-08-14 20:25:54 +00:00
2021-03-12 14:55:07 +01:00