portelli/Grid - Grid - DiRAC Tursa git server

mirror of https://github.com/paboyle/Grid.git synced 2026-06-24 12:33:30 +01:00

Author	SHA1	Message	Date
Peter Boyle	1304172a93	Modified repack	2026-05-19 08:53:13 -04:00
Peter Boyle	1315d4604d	Enable GRID_REDUCTION_TIMING unconditionally Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 22:14:00 -04:00
Peter Boyle	a31af31328	Lattice_reduction_gpu: add GRID_REDUCTION_TIMING instrumentation Uncomment #define GRID_REDUCTION_TIMING to enable per-phase timing output: sumD_gpu_reduce_words: pack time (accelerator_for) per R and base sumD_gpu_small: reduceKernel+barrier time and D2H time separately sumD_gpu_large: total wall time across all word groups This lets us identify whether the large-type bottleneck is in the pack kernel, the shared-memory reduction kernel, the barrier, or the D2H. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 22:13:30 -04:00
Peter Boyle	26c3c7d8f9	sumD_gpu_large: radix-12 word-bundle reduction replacing radix-1 Replace the word-by-word loop (one kernel launch per scalar word) with sumD_gpu_reduce_words<R> which packs R consecutive vector_type words per site into iVector<iScalar<vector>,R>, then calls the existing sumD_gpu_small shared-memory kernel once for the whole bundle. Dispatch: radix-12 first, radix-4 for the remainder < 12, radix-1 for any final < 4 words. For LatticePropagator (144 words = 12x12), this reduces the kernel-launch count from 144 to 12 -- a 12x reduction. Bundle::Nsimd() inherits from vector_type so sumD_gpu_small handles SIMD lane extraction and double-precision promotion identically to the scalar word case. sizeof(Bundle::scalar_objectD) = R*16 <= 192 B; well within sharedMemPerBlock on all supported devices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:56:45 -04:00
Peter Boyle	068f95ad2d	Revert to hand-rolled reduction; drop Lattice_reduction_gpu_cub.h Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h, Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the CUB rewrite (commit `969b0a39`), recovering the original primary function names (sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the hand-rolled shared-memory reduction kernel. Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the old/new comparison sections that depended on sum_gpu_old. The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small types, and the smem sizing problem for the extraction pass has no clean solution within the accelerator_for abstraction. The right improvement is a higher radix (12 then 4) in sumD_gpu_large, applied directly to the existing hand-rolled kernel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:52:18 -04:00
Peter Boyle	969b0a3922	Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 13:41:56 -04:00
paboyle	9e6a4a4737	Assertion updates to macros (mostly) with backtrace. WIlson flow to include options for DBW2, Iwasaki, Symanzik. View logging for data assurance	2025-08-07 15:48:38 +00:00
paboyle	066544281f	Deprecate UVM	2024-09-17 13:34:27 +00:00
Peter Boyle	33097681b9	FTHMC compiled and merged to develop	2023-10-14 00:42:55 +03:00
Peter Boyle	5068413cdb	Merge branch 'feature/dirichlet' of https://github.com/paboyle/Grid into feature/dirichlet	2023-03-28 08:35:38 -07:00
Peter Boyle	71c6960eea	Commet	2023-03-28 08:34:24 -07:00
Peter Boyle	d8a9a745d8	stream synchronise	2023-03-24 15:40:30 -04:00
Peter Boyle	d0bb033ea2	Device resident GPU block buffer instead of UVM as hit likely UVM bug. Code worked on CUDA 11.4 but fails on later drivers (certainly 530.30.02, but need to find the perlmutter driver version).	2023-03-22 19:07:32 -04:00
Peter Boyle	551a5f8dc8	RRII gpu option	2022-10-11 14:44:55 -04:00
fjosw	d1decee4cc	Cleaned up unused variables in Lattice_reduction_gpu.h	2022-03-02 16:54:23 +00:00
fjosw	d4ae71b880	sum_gpu_large and sum_gpu templates added.	2022-03-02 15:40:18 +00:00
Peter Boyle	3e882f555d	Large / small sumD options	2022-03-01 08:54:45 -05:00
Peter Boyle	42d56ea6b6	Verbosity	2021-10-29 02:23:08 +01:00
Peter Boyle	0b905a72dd	Better reduction for GPUs	2021-10-29 02:22:22 +01:00
fjosw	1f9688417a	Error message added when attempting to sum object which is too large for the shared memory	2021-10-13 20:45:46 +01:00
Peter Boyle	288c615782	Hip improvements	2020-09-16 00:31:50 +01:00
Peter Boyle	92b342a477	Hip reduction too	2020-05-24 13:50:28 -04:00
Peter Boyle	3e49dc8a67	Reduction finished and hopefully fixes CI regression fail on single precisoin and force	2019-08-14 15:18:34 +01:00
Peter Boyle	ce97638bac	Think the reduction is now sorted and cleaned up	2019-08-11 11:09:01 +01:00
Peter Boyle	9dad7a0094	Reproducible reduction and axpy_norm offload from Gianluca. Hopefully get CG running entirely on GPU	2019-07-30 00:14:12 +01:00

25 Commits