mirror of
https://github.com/paboyle/Grid.git
synced 2026-05-23 02:24:17 +01:00
747c167658
Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide loop) with one thread per lane calling extractLane() — O(1) per thread. CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction but leaves the per-lane sobjD store stride as a known remaining concern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>