1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-20 00:54:30 +01:00
Files
Grid/Grid/lattice
Peter Boyle 66b529b345 sumD_gpu_reduce_words: fuse pack+reduce into single packReduceKernel
Replace the two-kernel pack+reduce sequence with a single fused kernel
packReduceKernel<R> that reads R words of each vobj at offset 'base'
and accumulates directly into iVector<iScalar<scalarD>,R>, eliminating
the intermediate bundle buffer entirely.

HBM access per word-group drops from 3x (pack-read + pack-write +
reduce-read) to 1x.  Thread count comes from getNumBlocksAndThreads
(warpSize..256) rather than acceleratorThreads(), so occupancy is
correct regardless of the --accelerator-threads setting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 09:46:43 -04:00
..
2024-10-11 03:23:09 +00:00
2020-10-14 22:59:41 -04:00
2025-04-04 18:35:05 -04:00
2025-08-14 20:25:54 +00:00
2021-03-12 14:55:07 +01:00