1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-20 17:14:30 +01:00
Commit Graph

8191 Commits

Author SHA1 Message Date
Peter Boyle 003fec509c Fix Zero() used on thrust::complex in WordBundle4 initialisation
Grid's Zero() sentinel is not assignable to thrust::complex<double>;
use scalarD(0) instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 18:10:17 -04:00
Peter Boyle 773a82d87f Reinstate large/small dispatch in CUB reduction path; radix-4 word-bundle for large types
rocPRIM's DeviceReduce requires warpSize(64) threads each holding one element in shared
memory, so sizeof(T)*64 must fit in sharedMemPerBlock.  LatticePropagator::scalar_objectD
is 2304 bytes (64*2304 = 147 KB), exceeding the budget and triggering a compile-time
static_assert in limit_block_size.

Introduce sumD_gpu_direct (the original direct-CUB path, safe for small types) and a new
sumD_gpu_large that groups the vobj's vector_type words in bundles of 4, reducing each
bundle as WordBundle4<scalarD> (64 bytes, 64*64 = 4 KB — always within budget).  If
words % 4 != 0, the final partial bundle is zero-padded.  sumD_gpu dispatches at compile
time via if constexpr on sizeof(sobjD) > 512.

For LatticePropagator (144 words) this gives 36 CUB launches instead of 144.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:55:58 -04:00
Peter Boyle 286c29d6fb Add Test_reduction to tests/debug
Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the
preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D,
LatticeColourMatrixF/D and LatticePropagatorF/D.

Part a) gaussian random field: checks that old and new agree to within
float/double roundoff tolerance.
Part b) constant field (= 1.0, identity-matrix init): verifies
innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero
diagonal scalar components per site (1 / Nc / Ns*Nc respectively).

Make.inc is auto-generated by scripts/filelist on bootstrap and is not
tracked; the new .cc file is all that is needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 14:31:33 -04:00
Peter Boyle 969b0a3922 Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/
reduceKernel) and the global device variable retirementCount with a unified
CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL.
No small/large split is needed: both CUB and sycl::reduction handle arbitrary
object sizes internally.

Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the
original files for regression testing on GPU hardware.

Also add CLAUDE.md with build, test, and architecture guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:41:56 -04:00
Peter Boyle c6c2834e03 Hip Happy 2026-05-15 11:30:29 -04:00
Peter Boyle 856545a1db Support ROCM 7.0.2 2026-05-15 11:30:29 -04:00
Peter Boyle e2d607f6c7 Merge pull request #490 from jdmaia/hip-guard-acceleratorfor2dNB
[HIP] Including kernel launch parameter guard on accelerator_for2dNB
2026-05-06 14:51:30 -04:00
Julio Maia 66da4e0657 Including guard on accelerator_for2dNB against invalid kernel configurations if GRID_HIP 2026-05-06 13:26:33 -05:00
Peter Boyle b37390bb5a 4 node usqcd run 2026-04-27 14:40:11 -07:00
Peter Boyle 829dc8cceb 32 node 2026-04-27 14:38:02 -07:00
Peter Boyle 13cc2c39f5 FOM run 2026-04-27 14:20:49 -07:00
Peter Boyle 66ea3b271c Merge branch 'develop' of https://github.com/paboyle/Grid into develop 2026-04-27 13:55:52 -07:00
Peter Boyle d293b58a20 384 node baseline run 2026-04-27 13:54:40 -07:00
Peter Boyle ce093b2bf3 rdtsc 2026-04-27 13:54:06 -07:00
Peter Boyle e4404efe5a Perlmutter compile update 2026-04-27 13:53:28 -07:00
Peter Boyle 5ce270f1de Adding Claude related files 2026-04-21 10:41:18 -04:00
Peter Boyle af43b067a0 New CLAUDE controllable visualiser 2026-04-10 11:23:25 -04:00
Quadro 34b44d1fee New file for animation in MD time direction 2026-04-02 13:55:38 -04:00
Peter Boyle 595ceaac37 Include grid header and make the ENABLE correct 2026-03-11 17:24:44 -04:00
Peter Boyle daf5834e8e Fixing incorrect PR about disable fermion instantiations 2026-03-11 17:05:46 -04:00
Peter Boyle 0d8658a039 Optimised 2026-03-05 06:06:32 -05:00
Peter Boyle 095e004d01 Setup change GCR 2026-03-05 06:06:32 -05:00
Peter Boyle 0acabee7f6 Modest change 2026-03-05 06:06:32 -05:00
Peter Boyle 76fbcffb60 Improvement to 16^3 hdcg 2026-03-05 06:06:32 -05:00
Peter Boyle a0a62d7ead Merge pull request #478 from vataspro/PolyakovUpstream
Spatial Polyakov Loop implementation
2026-02-24 20:45:42 -05:00
Peter Boyle c5038ea6a5 Merge pull request #483 from cmcknigh/bugfix/rocm7-rocblas-type-refactor
Adding a version check to handle rocBlas type refactor
2026-02-24 20:45:03 -05:00
Peter Boyle a5120903eb Merge pull request #486 from RChrHill/fix/sp4-fp32
Define Sp4 ProjectOnGeneralGroup for generic vtype
2026-02-24 20:44:08 -05:00
Peter Boyle 00b286a08a Merge pull request #488 from RChrHill/feature/additional-ET-traces
Add ET support for Lattice spin- and colour-traces
2026-02-24 20:43:45 -05:00
Peter Boyle 24a9759353 Merge pull request #485 from edbennett/skip-fermion-instantiations
Be able to skip compiling fermion instantiations altogether
2026-02-24 20:43:20 -05:00
edbennett 1b56f6f46d be able to skip compiling fermion instantiations altogether 2026-02-24 23:52:18 +00:00
Peter Boyle 2a8084d569 Subspace setup 2026-02-13 17:26:11 -05:00
Peter Boyle 6ff29f9d4f Alternate multigrids 2026-02-13 17:25:45 -05:00
RChHill c4d3e79193 Add ET support for Lattice spin- and colour-traces 2026-01-29 14:46:52 +00:00
Peter Boyle 7cd3f21e6b preserving a bunch of experiments on setup and g5 subspace doubling 2026-01-06 05:57:39 -05:00
paboyle 4a0aaf0786 Fix issue with Aurora compilers 2025-11-21 21:41:13 +00:00
paboyle 9c3835524c Fix compile warn 2025-11-21 21:41:12 +00:00
paboyle 549351bb8a Stag verbose clean up 2025-11-20 18:22:57 +00:00
RChHill b650b89682 Define Sp4 ProjectOnGeneralGroup for generic vtype 2025-11-19 13:26:52 +00:00
Peter Boyle 74e6b19f83 Looks like the reuse of xfers in staggered has bugs or corner cases depending on volume 2025-11-17 22:29:06 -05:00
Peter Boyle 2e684028de Improvements 2025-11-14 18:12:27 -05:00
paboyle c54d87a472 Aurora compile fix for new compiler 2025-11-06 18:17:33 +00:00
Allen McKnight 4304245c1b Merge branch 'develop' into bugfix/rocm7-rocblas-type-refactor 2025-11-04 08:50:11 -06:00
Peter Boyle 6165931afa Update GridStd.h 2025-10-03 14:35:37 -04:00
Your Name 1d1fd3bcaf adding a version check to handle rocblas type change 2025-10-02 15:24:24 -05:00
paboyle 23581333e6 link cufft 2025-08-21 22:25:55 +01:00
paboyle e5fa3d887f Compile on CUDA 2025-08-21 22:10:27 +01:00
paboyle 583fa7bb0a FFTW guarded after CUDA adn HIP 2025-08-21 22:00:12 +01:00
Peter Boyle fe0db53842 FFT offload to GPU and MUCH faster comms.
40x speed up on Frontier
2025-08-21 16:45:38 -04:00
Peter Boyle 76c0ada1e1 Benchmark for En Hung 2025-08-21 16:45:38 -04:00
Peter Boyle 92f49e9194 Merge pull request #482 from g-simonetti/wflow_sp2n_paboyle
Fixed Wilson flow for Nc not equal to 3
2025-08-21 09:10:25 -04:00