portelli/Grid - Grid - DiRAC Tursa git server

mirror of https://github.com/paboyle/Grid.git synced 2026-07-17 15:43:27 +01:00

Author	SHA1	Message	Date
Peter Boyle	42cd9eda71	Some improvements that should have been there if in synch with develop, and also some staggered hdcg type work	2026-05-29 13:36:57 -04:00
Thomas BlumandClaude Sonnet 4.5	34d8d003a8	staggered-hdcg: smoother shift tuning, CG baseline, Lanczos diagnostics Smoother shift: - Replace hard-coded mass^2 = 0.0025 with fine_lambda_max / divisor, measured at runtime via PowerMethod on the SchurStaggeredOperator. - Current divisor = 200 (tunable); concentrates the O(8) CG polynomial zeros on the high-frequency end of the spectrum [shift, lambda_max], repairing the spectral leakage introduced at coarse-cell boundaries when the coarse-grid solution is promoted back to the fine grid. - Add explanatory comment on the lego-block edge / covariant-derivative physics behind the high-mode smoothing requirement. Chebyshev filter (IRL): - Fix lambda_lo = 0.02 (was mass^2 * 0.5 = 0.00125). Tuning history logged in comments: lo=0.005 → 0/24 modes (T_70~53); lo=0.01 → 24/24 but 2 restarts; lo=0.02 → 24/24 in 1 restart. - Reduce Nk/Nm from 48/96 to 24/48 (target 24 near-null modes only). - Print Chebyshev filter parameters at run time. CG baseline: - Add sequential single-RHS CG loop before the HDCG solve to establish unpreconditioned iteration count and wall time for direct comparison. ImplicitlyRestartedBlockLanczosCoarse: - Print Ritz values before and after implicit shift at each restart. - Print alpha/beta block-diagonal elements at each Lanczos step. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-05-28 16:43:23 -04:00
Thomas BlumandClaude Sonnet 4.6	905651deaa	Test_staggered_hdcg: fix GridParallelRNG and Lanczos grid bugs - GridParallelRNG must be constructed on full (non-checkerboarded) UGrid, not UrbGrid; fill() recurses infinitely when _grid is checkerboarded. - evec and c_srcs for ImplicitlyRestartedBlockLanczosCoarse must both be on f_grid (Coarse4d), not CoarseMrhs; calc_irbl asserts evec[0].Grid() == src[0].Grid(). - Switch subspace generation from CreateSubspaceChebyshevNew to CreateSubspace (CG inverse iteration), which requires no spectral bound tuning and adapts automatically to the matrix spectrum. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:41:41 -04:00
Thomas BlumandClaude Sonnet 4.6	119308c42a	Test_staggered_hdcg: add missing ImplicitlyRestartedBlockLanczos.h include IRBLdiagonalisation, SortEigen, and LanczosType are defined in ImplicitlyRestartedBlockLanczos.h, which must be included before ImplicitlyRestartedBlockLanczosCoarse.h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 20:55:51 -04:00
Thomas BlumandClaude Sonnet 4.6	89a32799e3	mac-arm: align --enable-Sp=no with upstream config-command-mpi style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:21:02 -04:00
Thomas Blum	ce8b52749d	Merge remote-tracking branch 'origin/develop' into feature/staggered-hdcg	2026-05-27 16:20:47 -04:00
Peter Boyle	86c7f29183	Config command update	2026-05-27 16:19:33 -04:00
Thomas BlumandClaude Sonnet 4.6	bbdc8e95f4	mac-arm: disable Sp, fermion-reps, gparity for faster dev builds Reduces compile time significantly by skipping representations not needed for the staggered HDCG work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:19:28 -04:00
Thomas Blum	1284acf37a	Merge remote-tracking branch 'origin/develop' into feature/staggered-hdcg	2026-05-27 16:19:19 -04:00
Peter Boyle	b0c99f876e	Configure on mac update	2026-05-27 16:16:55 -04:00
Peter Boyle	bf5fcdc860	Ease of use for std::complex interchangable with thrust	2026-05-27 16:05:37 -04:00
Thomas BlumandClaude Sonnet 4.6	520b90259d	Add staggered HDCG multigrid test and mac-arm Homebrew build scripts Test_staggered_hdcg.cc implements a two-level ADEF2 multigrid solver for NaiveStaggeredFermion using SchurStaggeredOperator, following the mrhs hermitian multigrid approach of arXiv:2409.03904. Uses a 33-point coarse stencil (NextToNearestStencilGeometry4D) with nbasis=24, block={4,4,4,4}, and Chebyshev subspace generation with hi=5.0 (lambda_max ~4.6). Also adds systems/mac-arm/sourceme-homebrew.sh and config-command-homebrew for building Grid on Apple Silicon with Homebrew-installed dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:52:49 -04:00
Peter Boyle	b58a1508fa	Perlmutter cuda version update	2026-05-21 13:25:13 -07:00
Peter Boyle	4d527e81fa	Remove hip specific files	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	7803580aa6	Lattice_reduction_gpu: demote timing logs to Debug, disable by default skills/mpi-heterogeneous: add Bug Class 4 for Frontier GTL/libamdhip64 ABI mismatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	32654db366	Test_planned_fft: fix PlannedFFT template parameter to use ::vector_object Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	cd340cfab3	tests: add Test_planned_fft exercising PlannedFFT<vobj> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	f32866b2ff	tests/fft: remove PlanDestroy calls (FFT handles plans per-call) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	1cd1dc091e	FFT: add FFTbase, PlannedFFT; factor FFT_dim_execute free function Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	0493656e86	debug: add Test_hipfft_repro — reproducer for hipFFT PARSE_ERROR on ROCm 7 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	66fd504c4d	tests/debug: add G=4 to hipfft fail reproducer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	be4dd2b52f	tests/debug: test hipMemset variant before cache is populated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	707d059766	tests/debug: extend hipfft fail reproducer with hipMemset and sync variants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	f08c755ae6	FFT: use host stack buffer in PlanCreate, not deviceVector Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	dbbfdd4e4b	tests/debug: add minimal hipfft ordering bug fail/pass pair Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	f967fb40bf	tests/debug: test plan-before-malloc vs malloc-before-plan ordering Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	74e0f846cb	tests/debug: extend hipfft reproducer with Grid-realistic howmany and exec tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	303a4d26e5	tests/debug: add minimal hipfft plan-creation reproducer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter Boyle	119888653c	FFT HIP: use hipfftCreate+hipfftMakePlanMany instead of hipfftPlanMany	2026-05-21 12:34:30 -04:00
Peter Boyle	a9f42c08f9	FFT: pass nullptr for inembed/onembed in hipfftPlanMany to avoid HIPFFT_PARSE_ERROR	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	e79adc9d31	FFT: cache plans per vobj type across calls Plans are created lazily on the first FFT_dim call and reused for all subsequent calls on the same FFT object. PlanCreate<vobj>() can be called explicitly to pre-warm the cache. PlanDestroy() must be called before switching to a different vobj type; the destructor cleans up any live plans automatically. Update Test_fft.cc and Test_fftf.cc to call PlanDestroy() between the LatticeComplex and LatticeSpinMatrix sections that reuse the same FFT object. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	5a9056cd93	Accelerator: lower default accelerator_threads from 16 to 8 Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16. With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads = one full AMD wavefront. Higher values double register demand per block and hit a register-pressure cliff for stencil kernels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter Boyle	012c36ab5a	Accelerator: raise default accelerator_threads from 2 to 16	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	5c4574f9aa	skills: add gpu-memory-performance.md Documents the acceleratorThreads() default=2 trap, LambdaApply thread mapping, coalescedRead/Write idiom, when to use __global__ vs accelerator_for, and fused vs staged HBM access patterns. Includes observed MI250X numbers from LatticePropagatorD reduction (50 → 297 → 546 GB/s progression). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	a424775884	sumD_gpu_reduce_words: fuse pack+reduce into single packReduceKernel Replace the two-kernel pack+reduce sequence with a single fused kernel packReduceKernel<R> that reads R words of each vobj at offset 'base' and accumulates directly into iVector<iScalar<scalarD>,R>, eliminating the intermediate bundle buffer entirely. HBM access per word-group drops from 3x (pack-read + pack-write + reduce-read) to 1x. Thread count comes from getNumBlocksAndThreads (warpSize..256) rather than acceleratorThreads(), so occupancy is correct regardless of the --accelerator-threads setting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter Boyle	d6b1388741	Modified repack	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	796c6cae4e	Enable GRID_REDUCTION_TIMING unconditionally Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	1a8064d6d9	Lattice_reduction_gpu: add GRID_REDUCTION_TIMING instrumentation Uncomment #define GRID_REDUCTION_TIMING to enable per-phase timing output: sumD_gpu_reduce_words: pack time (accelerator_for) per R and base sumD_gpu_small: reduceKernel+barrier time and D2H time separately sumD_gpu_large: total wall time across all word groups This lets us identify whether the large-type bottleneck is in the pack kernel, the shared-memory reduction kernel, the barrier, or the D2H. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	43648924c3	sumD_gpu_large: radix-12 word-bundle reduction replacing radix-1 Replace the word-by-word loop (one kernel launch per scalar word) with sumD_gpu_reduce_words<R> which packs R consecutive vector_type words per site into iVector<iScalar<vector>,R>, then calls the existing sumD_gpu_small shared-memory kernel once for the whole bundle. Dispatch: radix-12 first, radix-4 for the remainder < 12, radix-1 for any final < 4 words. For LatticePropagator (144 words = 12x12), this reduces the kernel-launch count from 144 to 12 -- a 12x reduction. Bundle::Nsimd() inherits from vector_type so sumD_gpu_small handles SIMD lane extraction and double-precision promotion identically to the scalar word case. sizeof(Bundle::scalar_objectD) = R*16 <= 192 B; well within sharedMemPerBlock on all supported devices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	bf2140e74d	Lattice_reduction_sycl: fix double-precision accumulation in sumD_gpu_tensor Accumulate in sobjD throughout rather than accumulating in sobj and converting the final sum. For float fields this matters: summing N floats then casting loses O(N*eps_float) relative precision vs accumulating in double from the start. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	a1119266c1	Revert to hand-rolled reduction; drop Lattice_reduction_gpu_cub.h Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h, Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the CUB rewrite (commit `969b0a39`), recovering the original primary function names (sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the hand-rolled shared-memory reduction kernel. Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the old/new comparison sections that depended on sum_gpu_old. The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small types, and the smem sizing problem for the extraction pass has no clean solution within the accelerator_for abstraction. The right improvement is a higher radix (12 then 4) in sumD_gpu_large, applied directly to the existing hand-rolled kernel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	a0f00c0eca	sumD_gpu_direct: revert to per-lane write; CUB handles Nsimdosites inputs Benchmarking showed the shared-memory lane-summation approach (`843d6497`) was slower than writing each SIMD lane individually and letting CUB reduce the full nlanes = ositesNsimd array. CUB's device reduce is more efficient over the larger input than the smem overhead + serialised lane-0 summation. The smem approach also required overriding acceleratorThreads() to avoid the block-size sizing problem. Restore the simpler per-lane path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	d358954a84	sumD_gpu_direct: shared-memory lane reduction with acceleratorThreads(1) Set acceleratorThreads to 1 before the extraction kernel so that dim3(nsimd,1,1) blocks give exactly one site group per block and __shared__ sobjD smem[nsimd] is correctly sized without depending on the runtime acceleratorThreads() value. threadIdx.x (acceleratorSIMTlane) indexes the SIMD lane for coalesced reads; lane 0 sums smem[0..nsimd-1] and writes one sobjD per site. CUB then reduces osites elements instead of osites*nsimd, reducing both store traffic and CUB work by Nsimd. acceleratorSynchronise() (warp-level) suffices since nsimd < warpSize. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	aee00bdfb5	sumD_gpu_direct: one thread per SIMD lane using extractLane Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide loop) with one thread per lane calling extractLane() — O(1) per thread. CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction but leaves the per-lane sobjD store stride as a known remaining concern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	cf324b0fa1	Lattice_reduction_gpu_cub: define GRID_REDUCTION_TIMING in header Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	b314dc224d	Lattice_reduction_gpu_cub: add GRID_REDUCTION_TIMING instrumentation Guards accelerator_for and CUB DeviceReduce calls in sumD_gpu_direct and sumD_gpu_large with #ifdef GRID_REDUCTION_TIMING to isolate where time is spent in each path. Large path accumulates across all groups and prints totals with words/nfull/rem context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	1bbd62498e	Lattice_reduction_gpu_cub: replace WordBundle4 with iVector<iScalar<scalarD>,4> WordBundle4 was redundant with Grid's existing tensor infrastructure. iVector<iScalar<scalarD>,4> already provides accelerator_inline operator+, zeroit(), and sycl::is_device_copyable — no new type needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	f3c3b1c04b	Test_reduction: add timing benchmark for new vs old reduction paths Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and sum_gpu_old (hand-rolled shared-memory) for each field type, with 5-call warmup and 100-call timed loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	069f98b253	skills: HPC battle-hardening skill files for GPU+MPI correctness Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter BoyleandClaude Sonnet 4.6	dfd0503eae	Test_reduction: use separate float and double grids Float fields require a grid constructed with vComplexF::Nsimd(); using a double grid causes grid->_gsites to undercount the sites in float vobjF, making the constant-field expected value wrong. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00

1 2 3 4 5 ...