portelli/Grid - Grid - DiRAC Tursa git server

mirror of https://github.com/paboyle/Grid.git synced 2026-07-17 23:53:27 +01:00

Author	SHA1	Message	Date
Peter BoyleandClaude Sonnet 4.6	50aa51f93a	debug: add Test_hipfft_repro — reproducer for hipFFT PARSE_ERROR on ROCm 7 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 22:27:27 -04:00
Peter BoyleandClaude Sonnet 4.6	79ccc81a86	tests/debug: add G=4 to hipfft fail reproducer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 22:21:52 -04:00
Peter BoyleandClaude Sonnet 4.6	3f0fdbb597	tests/debug: test hipMemset variant before cache is populated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 22:10:16 -04:00
Peter BoyleandClaude Sonnet 4.6	ea57bd8f03	tests/debug: extend hipfft fail reproducer with hipMemset and sync variants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 22:02:02 -04:00
Peter BoyleandClaude Sonnet 4.6	bdba5b8403	FFT: use host stack buffer in PlanCreate, not deviceVector Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 21:49:06 -04:00
Peter BoyleandClaude Sonnet 4.6	58cc6ca9c0	tests/debug: add minimal hipfft ordering bug fail/pass pair Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 21:48:23 -04:00
Peter BoyleandClaude Sonnet 4.6	e5996b440d	tests/debug: test plan-before-malloc vs malloc-before-plan ordering Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 21:40:17 -04:00
Peter BoyleandClaude Sonnet 4.6	ad9d03fd85	tests/debug: extend hipfft reproducer with Grid-realistic howmany and exec tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 19:19:59 -04:00
Peter BoyleandClaude Sonnet 4.6	4de160ce20	tests/debug: add minimal hipfft plan-creation reproducer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 17:52:59 -04:00
Peter Boyle	fc8c8ce6e7	FFT HIP: use hipfftCreate+hipfftMakePlanMany instead of hipfftPlanMany	2026-05-19 17:29:28 -04:00
Peter Boyle	ddbb7f07c8	FFT: pass nullptr for inembed/onembed in hipfftPlanMany to avoid HIPFFT_PARSE_ERROR	2026-05-19 17:15:21 -04:00
Peter BoyleandClaude Sonnet 4.6	1e29c59bcc	FFT: cache plans per vobj type across calls Plans are created lazily on the first FFT_dim call and reused for all subsequent calls on the same FFT object. PlanCreate<vobj>() can be called explicitly to pre-warm the cache. PlanDestroy() must be called before switching to a different vobj type; the destructor cleans up any live plans automatically. Update Test_fft.cc and Test_fftf.cc to call PlanDestroy() between the LatticeComplex and LatticeSpinMatrix sections that reuse the same FFT object. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 15:12:10 -04:00
Peter BoyleandClaude Sonnet 4.6	b6abdc3845	Accelerator: lower default accelerator_threads from 16 to 8 Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16. With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads = one full AMD wavefront. Higher values double register demand per block and hit a register-pressure cliff for stencil kernels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 13:41:03 -04:00
Peter Boyle	2fadd8bb62	Accelerator: raise default accelerator_threads from 2 to 16	2026-05-19 10:15:53 -04:00
Peter BoyleandClaude Sonnet 4.6	60df2dd5d0	skills: add gpu-memory-performance.md Documents the acceleratorThreads() default=2 trap, LambdaApply thread mapping, coalescedRead/Write idiom, when to use __global__ vs accelerator_for, and fused vs staged HBM access patterns. Includes observed MI250X numbers from LatticePropagatorD reduction (50 → 297 → 546 GB/s progression). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 10:03:32 -04:00
Peter BoyleandClaude Sonnet 4.6	66b529b345	sumD_gpu_reduce_words: fuse pack+reduce into single packReduceKernel Replace the two-kernel pack+reduce sequence with a single fused kernel packReduceKernel<R> that reads R words of each vobj at offset 'base' and accumulates directly into iVector<iScalar<scalarD>,R>, eliminating the intermediate bundle buffer entirely. HBM access per word-group drops from 3x (pack-read + pack-write + reduce-read) to 1x. Thread count comes from getNumBlocksAndThreads (warpSize..256) rather than acceleratorThreads(), so occupancy is correct regardless of the --accelerator-threads setting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 09:46:43 -04:00
Peter Boyle	1304172a93	Modified repack	2026-05-19 08:53:13 -04:00
Peter BoyleandClaude Sonnet 4.6	1315d4604d	Enable GRID_REDUCTION_TIMING unconditionally Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 22:14:00 -04:00
Peter BoyleandClaude Sonnet 4.6	a31af31328	Lattice_reduction_gpu: add GRID_REDUCTION_TIMING instrumentation Uncomment #define GRID_REDUCTION_TIMING to enable per-phase timing output: sumD_gpu_reduce_words: pack time (accelerator_for) per R and base sumD_gpu_small: reduceKernel+barrier time and D2H time separately sumD_gpu_large: total wall time across all word groups This lets us identify whether the large-type bottleneck is in the pack kernel, the shared-memory reduction kernel, the barrier, or the D2H. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 22:13:30 -04:00
Peter BoyleandClaude Sonnet 4.6	26c3c7d8f9	sumD_gpu_large: radix-12 word-bundle reduction replacing radix-1 Replace the word-by-word loop (one kernel launch per scalar word) with sumD_gpu_reduce_words<R> which packs R consecutive vector_type words per site into iVector<iScalar<vector>,R>, then calls the existing sumD_gpu_small shared-memory kernel once for the whole bundle. Dispatch: radix-12 first, radix-4 for the remainder < 12, radix-1 for any final < 4 words. For LatticePropagator (144 words = 12x12), this reduces the kernel-launch count from 144 to 12 -- a 12x reduction. Bundle::Nsimd() inherits from vector_type so sumD_gpu_small handles SIMD lane extraction and double-precision promotion identically to the scalar word case. sizeof(Bundle::scalar_objectD) = R*16 <= 192 B; well within sharedMemPerBlock on all supported devices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:56:45 -04:00
Peter BoyleandClaude Sonnet 4.6	0650d7c7eb	Lattice_reduction_sycl: fix double-precision accumulation in sumD_gpu_tensor Accumulate in sobjD throughout rather than accumulating in sobj and converting the final sum. For float fields this matters: summing N floats then casting loses O(N*eps_float) relative precision vs accumulating in double from the start. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:53:40 -04:00
Peter BoyleandClaude Sonnet 4.6	068f95ad2d	Revert to hand-rolled reduction; drop Lattice_reduction_gpu_cub.h Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h, Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the CUB rewrite (commit `969b0a39`), recovering the original primary function names (sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the hand-rolled shared-memory reduction kernel. Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the old/new comparison sections that depended on sum_gpu_old. The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small types, and the smem sizing problem for the extraction pass has no clean solution within the accelerator_for abstraction. The right improvement is a higher radix (12 then 4) in sumD_gpu_large, applied directly to the existing hand-rolled kernel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:52:18 -04:00
Peter BoyleandClaude Sonnet 4.6	f4fbf7c9ca	sumD_gpu_direct: revert to per-lane write; CUB handles Nsimdosites inputs Benchmarking showed the shared-memory lane-summation approach (`843d6497`) was slower than writing each SIMD lane individually and letting CUB reduce the full nlanes = ositesNsimd array. CUB's device reduce is more efficient over the larger input than the smem overhead + serialised lane-0 summation. The smem approach also required overriding acceleratorThreads() to avoid the block-size sizing problem. Restore the simpler per-lane path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:23:15 -04:00
Peter BoyleandClaude Sonnet 4.6	843d6497b2	sumD_gpu_direct: shared-memory lane reduction with acceleratorThreads(1) Set acceleratorThreads to 1 before the extraction kernel so that dim3(nsimd,1,1) blocks give exactly one site group per block and __shared__ sobjD smem[nsimd] is correctly sized without depending on the runtime acceleratorThreads() value. threadIdx.x (acceleratorSIMTlane) indexes the SIMD lane for coalesced reads; lane 0 sums smem[0..nsimd-1] and writes one sobjD per site. CUB then reduces osites elements instead of osites*nsimd, reducing both store traffic and CUB work by Nsimd. acceleratorSynchronise() (warp-level) suffices since nsimd < warpSize. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 21:08:10 -04:00
Peter BoyleandClaude Sonnet 4.6	747c167658	sumD_gpu_direct: one thread per SIMD lane using extractLane Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide loop) with one thread per lane calling extractLane() — O(1) per thread. CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction but leaves the per-lane sobjD store stride as a known remaining concern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 16:21:50 -04:00
Peter BoyleandClaude Sonnet 4.6	fca2c5dba0	Lattice_reduction_gpu_cub: define GRID_REDUCTION_TIMING in header Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 14:54:08 -04:00
Peter BoyleandClaude Sonnet 4.6	e12bc7f07c	Lattice_reduction_gpu_cub: add GRID_REDUCTION_TIMING instrumentation Guards accelerator_for and CUB DeviceReduce calls in sumD_gpu_direct and sumD_gpu_large with #ifdef GRID_REDUCTION_TIMING to isolate where time is spent in each path. Large path accumulates across all groups and prints totals with words/nfull/rem context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 14:23:44 -04:00
Peter BoyleandClaude Sonnet 4.6	dc6ae51cab	Lattice_reduction_gpu_cub: replace WordBundle4 with iVector<iScalar<scalarD>,4> WordBundle4 was redundant with Grid's existing tensor infrastructure. iVector<iScalar<scalarD>,4> already provides accelerator_inline operator+, zeroit(), and sycl::is_device_copyable — no new type needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 13:55:28 -04:00
Peter BoyleandClaude Sonnet 4.6	baa70d8ec9	Test_reduction: add timing benchmark for new vs old reduction paths Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and sum_gpu_old (hand-rolled shared-memory) for each field type, with 5-call warmup and 100-call timed loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 12:31:13 -04:00
Peter BoyleandClaude Sonnet 4.6	c93b338bdd	skills: HPC battle-hardening skill files for GPU+MPI correctness Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 12:10:44 -04:00
Peter BoyleandClaude Sonnet 4.6	c0472aa0ec	Test_reduction: use separate float and double grids Float fields require a grid constructed with vComplexF::Nsimd(); using a double grid causes grid->_gsites to undercount the sites in float vobjF, making the constant-field expected value wrong. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 12:09:35 -04:00
Peter BoyleandClaude Sonnet 4.6	09552cfd73	Rename scalarNorm2 to squaredSum in Test_reduction.cc The function computes \|sum\|^2 — the squared magnitude of an aggregate sum — not a norm. squaredSum makes clear that squaring is applied to the sum, not to individual site values before summing, distinguishing it from sumOfSquares (the squared L2 norm). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 23:15:11 -04:00
Peter BoyleandClaude Sonnet 4.6	003fec509c	Fix Zero() used on thrust::complex in WordBundle4 initialisation Grid's Zero() sentinel is not assignable to thrust::complex<double>; use scalarD(0) instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 18:10:17 -04:00
Peter BoyleandClaude Sonnet 4.6	773a82d87f	Reinstate large/small dispatch in CUB reduction path; radix-4 word-bundle for large types rocPRIM's DeviceReduce requires warpSize(64) threads each holding one element in shared memory, so sizeof(T)64 must fit in sharedMemPerBlock. LatticePropagator::scalar_objectD is 2304 bytes (642304 = 147 KB), exceeding the budget and triggering a compile-time static_assert in limit_block_size. Introduce sumD_gpu_direct (the original direct-CUB path, safe for small types) and a new sumD_gpu_large that groups the vobj's vector_type words in bundles of 4, reducing each bundle as WordBundle4<scalarD> (64 bytes, 64*64 = 4 KB — always within budget). If words % 4 != 0, the final partial bundle is zero-padded. sumD_gpu dispatches at compile time via if constexpr on sizeof(sobjD) > 512. For LatticePropagator (144 words) this gives 36 CUB launches instead of 144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 16:55:58 -04:00
Peter BoyleandClaude Sonnet 4.6	286c29d6fb	Add Test_reduction to tests/debug Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D, LatticeColourMatrixF/D and LatticePropagatorF/D. Part a) gaussian random field: checks that old and new agree to within float/double roundoff tolerance. Part b) constant field (= 1.0, identity-matrix init): verifies innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero diagonal scalar components per site (1 / Nc / Ns*Nc respectively). Make.inc is auto-generated by scripts/filelist on bootstrap and is not tracked; the new .cc file is all that is needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 14:31:33 -04:00
Peter BoyleandClaude Sonnet 4.6	969b0a3922	Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 13:41:56 -04:00
Peter Boyle	c6c2834e03	Hip Happy	2026-05-15 11:30:29 -04:00
Peter Boyle	856545a1db	Support ROCM 7.0.2	2026-05-15 11:30:29 -04:00
Peter BoyleandGitHub	e2d607f6c7	Merge pull request #490 from jdmaia/hip-guard-acceleratorfor2dNB [HIP] Including kernel launch parameter guard on accelerator_for2dNB	2026-05-06 14:51:30 -04:00
Julio Maia	66da4e0657	Including guard on accelerator_for2dNB against invalid kernel configurations if GRID_HIP	2026-05-06 13:26:33 -05:00
Peter Boyle	b37390bb5a	4 node usqcd run	2026-04-27 14:40:11 -07:00
Peter Boyle	829dc8cceb	32 node	2026-04-27 14:38:02 -07:00
Peter Boyle	13cc2c39f5	FOM run	2026-04-27 14:20:49 -07:00
Peter Boyle	66ea3b271c	Merge branch 'develop' of https://github.com/paboyle/Grid into develop	2026-04-27 13:55:52 -07:00
Peter Boyle	d293b58a20	384 node baseline run	2026-04-27 13:54:40 -07:00
Peter Boyle	ce093b2bf3	rdtsc	2026-04-27 13:54:06 -07:00
Peter Boyle	e4404efe5a	Perlmutter compile update	2026-04-27 13:53:28 -07:00
Peter Boyle	5ce270f1de	Adding Claude related files	2026-04-21 10:41:18 -04:00
Peter Boyle	af43b067a0	New CLAUDE controllable visualiser	2026-04-10 11:23:25 -04:00
Quadro	34b44d1fee	New file for animation in MD time direction	2026-04-02 13:55:38 -04:00

1 2 3 4 5 ...