Replace the word-by-word loop (one kernel launch per scalar word) with
sumD_gpu_reduce_words<R> which packs R consecutive vector_type words per
site into iVector<iScalar<vector>,R>, then calls the existing sumD_gpu_small
shared-memory kernel once for the whole bundle.
Dispatch: radix-12 first, radix-4 for the remainder < 12, radix-1 for
any final < 4 words. For LatticePropagator (144 words = 12x12), this
reduces the kernel-launch count from 144 to 12 -- a 12x reduction.
Bundle::Nsimd() inherits from vector_type so sumD_gpu_small handles SIMD
lane extraction and double-precision promotion identically to the scalar
word case. sizeof(Bundle::scalar_objectD) = R*16 <= 192 B; well within
sharedMemPerBlock on all supported devices.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Accumulate in sobjD throughout rather than accumulating in sobj and
converting the final sum. For float fields this matters: summing N floats
then casting loses O(N*eps_float) relative precision vs accumulating in
double from the start.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h,
Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the
CUB rewrite (commit 969b0a39), recovering the original primary function names
(sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the
hand-rolled shared-memory reduction kernel.
Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the
old/new comparison sections that depended on sum_gpu_old.
The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small
types, and the smem sizing problem for the extraction pass has no clean
solution within the accelerator_for abstraction. The right improvement is
a higher radix (12 then 4) in sumD_gpu_large, applied directly to the
existing hand-rolled kernel.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmarking showed the shared-memory lane-summation approach (843d6497)
was slower than writing each SIMD lane individually and letting CUB reduce
the full nlanes = osites*Nsimd array. CUB's device reduce is more efficient
over the larger input than the smem overhead + serialised lane-0 summation.
The smem approach also required overriding acceleratorThreads() to avoid
the block-size sizing problem. Restore the simpler per-lane path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Set acceleratorThreads to 1 before the extraction kernel so that
dim3(nsimd,1,1) blocks give exactly one site group per block and
__shared__ sobjD smem[nsimd] is correctly sized without depending on
the runtime acceleratorThreads() value. threadIdx.x (acceleratorSIMTlane)
indexes the SIMD lane for coalesced reads; lane 0 sums smem[0..nsimd-1]
and writes one sobjD per site. CUB then reduces osites elements instead
of osites*nsimd, reducing both store traffic and CUB work by Nsimd.
acceleratorSynchronise() (warp-level) suffices since nsimd < warpSize.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide
loop) with one thread per lane calling extractLane() — O(1) per thread.
CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction
but leaves the per-lane sobjD store stride as a known remaining concern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Guards accelerator_for and CUB DeviceReduce calls in sumD_gpu_direct
and sumD_gpu_large with #ifdef GRID_REDUCTION_TIMING to isolate where
time is spent in each path. Large path accumulates across all groups
and prints totals with words/nfull/rem context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WordBundle4 was redundant with Grid's existing tensor infrastructure.
iVector<iScalar<scalarD>,4> already provides accelerator_inline operator+,
zeroit(), and sycl::is_device_copyable — no new type needed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and
sum_gpu_old (hand-rolled shared-memory) for each field type, with
5-call warmup and 100-call timed loop.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Float fields require a grid constructed with vComplexF::Nsimd(); using
a double grid causes grid->_gsites to undercount the sites in float
vobjF, making the constant-field expected value wrong.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The function computes |sum|^2 — the squared magnitude of an aggregate sum —
not a norm. squaredSum makes clear that squaring is applied to the sum, not
to individual site values before summing, distinguishing it from sumOfSquares
(the squared L2 norm).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rocPRIM's DeviceReduce requires warpSize(64) threads each holding one element in shared
memory, so sizeof(T)*64 must fit in sharedMemPerBlock. LatticePropagator::scalar_objectD
is 2304 bytes (64*2304 = 147 KB), exceeding the budget and triggering a compile-time
static_assert in limit_block_size.
Introduce sumD_gpu_direct (the original direct-CUB path, safe for small types) and a new
sumD_gpu_large that groups the vobj's vector_type words in bundles of 4, reducing each
bundle as WordBundle4<scalarD> (64 bytes, 64*64 = 4 KB — always within budget). If
words % 4 != 0, the final partial bundle is zero-padded. sumD_gpu dispatches at compile
time via if constexpr on sizeof(sobjD) > 512.
For LatticePropagator (144 words) this gives 36 CUB launches instead of 144.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the
preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D,
LatticeColourMatrixF/D and LatticePropagatorF/D.
Part a) gaussian random field: checks that old and new agree to within
float/double roundoff tolerance.
Part b) constant field (= 1.0, identity-matrix init): verifies
innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero
diagonal scalar components per site (1 / Nc / Ns*Nc respectively).
Make.inc is auto-generated by scripts/filelist on bootstrap and is not
tracked; the new .cc file is all that is needed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/
reduceKernel) and the global device variable retirementCount with a unified
CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL.
No small/large split is needed: both CUB and sycl::reduction handle arbitrary
object sizes internally.
Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the
original files for regression testing on GPU hardware.
Also add CLAUDE.md with build, test, and architecture guidance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>