1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-06-05 03:34:36 +01:00
Commit Graph

145 Commits

Author SHA1 Message Date
Peter Boyle 42cd9eda71 Some improvements that should have been there if in synch with develop,
and also some staggered hdcg type work
2026-05-29 13:36:57 -04:00
Thomas Blum 34d8d003a8 staggered-hdcg: smoother shift tuning, CG baseline, Lanczos diagnostics
Smoother shift:
- Replace hard-coded mass^2 = 0.0025 with fine_lambda_max / divisor,
  measured at runtime via PowerMethod on the SchurStaggeredOperator.
- Current divisor = 200 (tunable); concentrates the O(8) CG polynomial
  zeros on the high-frequency end of the spectrum [shift, lambda_max],
  repairing the spectral leakage introduced at coarse-cell boundaries
  when the coarse-grid solution is promoted back to the fine grid.
- Add explanatory comment on the lego-block edge / covariant-derivative
  physics behind the high-mode smoothing requirement.

Chebyshev filter (IRL):
- Fix lambda_lo = 0.02 (was mass^2 * 0.5 = 0.00125).
  Tuning history logged in comments: lo=0.005 → 0/24 modes (T_70~53);
  lo=0.01 → 24/24 but 2 restarts; lo=0.02 → 24/24 in 1 restart.
- Reduce Nk/Nm from 48/96 to 24/48 (target 24 near-null modes only).
- Print Chebyshev filter parameters at run time.

CG baseline:
- Add sequential single-RHS CG loop before the HDCG solve to establish
  unpreconditioned iteration count and wall time for direct comparison.

ImplicitlyRestartedBlockLanczosCoarse:
- Print Ritz values before and after implicit shift at each restart.
- Print alpha/beta block-diagonal elements at each Lanczos step.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-05-28 16:43:23 -04:00
Thomas Blum 905651deaa Test_staggered_hdcg: fix GridParallelRNG and Lanczos grid bugs
- GridParallelRNG must be constructed on full (non-checkerboarded) UGrid,
  not UrbGrid; fill() recurses infinitely when _grid is checkerboarded.
- evec and c_srcs for ImplicitlyRestartedBlockLanczosCoarse must both be
  on f_grid (Coarse4d), not CoarseMrhs; calc_irbl asserts evec[0].Grid()
  == src[0].Grid().
- Switch subspace generation from CreateSubspaceChebyshevNew to
  CreateSubspace (CG inverse iteration), which requires no spectral
  bound tuning and adapts automatically to the matrix spectrum.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:41:41 -04:00
Thomas Blum 119308c42a Test_staggered_hdcg: add missing ImplicitlyRestartedBlockLanczos.h include
IRBLdiagonalisation, SortEigen, and LanczosType are defined in
ImplicitlyRestartedBlockLanczos.h, which must be included before
ImplicitlyRestartedBlockLanczosCoarse.h.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 20:55:51 -04:00
Thomas Blum 520b90259d Add staggered HDCG multigrid test and mac-arm Homebrew build scripts
Test_staggered_hdcg.cc implements a two-level ADEF2 multigrid solver for
NaiveStaggeredFermion using SchurStaggeredOperator, following the mrhs
hermitian multigrid approach of arXiv:2409.03904. Uses a 33-point coarse
stencil (NextToNearestStencilGeometry4D) with nbasis=24, block={4,4,4,4},
and Chebyshev subspace generation with hi=5.0 (lambda_max ~4.6).

Also adds systems/mac-arm/sourceme-homebrew.sh and config-command-homebrew
for building Grid on Apple Silicon with Homebrew-installed dependencies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:52:49 -04:00
Peter Boyle 4d527e81fa Remove hip specific files 2026-05-21 12:34:30 -04:00
Peter Boyle 0493656e86 debug: add Test_hipfft_repro — reproducer for hipFFT PARSE_ERROR on ROCm 7
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 66fd504c4d tests/debug: add G=4 to hipfft fail reproducer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle be4dd2b52f tests/debug: test hipMemset variant before cache is populated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 707d059766 tests/debug: extend hipfft fail reproducer with hipMemset and sync variants
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle dbbfdd4e4b tests/debug: add minimal hipfft ordering bug fail/pass pair
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle f967fb40bf tests/debug: test plan-before-malloc vs malloc-before-plan ordering
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 74e0f846cb tests/debug: extend hipfft reproducer with Grid-realistic howmany and exec tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 303a4d26e5 tests/debug: add minimal hipfft plan-creation reproducer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle a1119266c1 Revert to hand-rolled reduction; drop Lattice_reduction_gpu_cub.h
Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h,
Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the
CUB rewrite (commit 969b0a39), recovering the original primary function names
(sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the
hand-rolled shared-memory reduction kernel.

Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the
old/new comparison sections that depended on sum_gpu_old.

The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small
types, and the smem sizing problem for the extraction pass has no clean
solution within the accelerator_for abstraction. The right improvement is
a higher radix (12 then 4) in sumD_gpu_large, applied directly to the
existing hand-rolled kernel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle f3c3b1c04b Test_reduction: add timing benchmark for new vs old reduction paths
Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and
sum_gpu_old (hand-rolled shared-memory) for each field type, with
5-call warmup and 100-call timed loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle dfd0503eae Test_reduction: use separate float and double grids
Float fields require a grid constructed with vComplexF::Nsimd(); using
a double grid causes grid->_gsites to undercount the sites in float
vobjF, making the constant-field expected value wrong.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle c629b2e87e Rename scalarNorm2 to squaredSum in Test_reduction.cc
The function computes |sum|^2 — the squared magnitude of an aggregate sum —
not a norm. squaredSum makes clear that squaring is applied to the sum, not
to individual site values before summing, distinguishing it from sumOfSquares
(the squared L2 norm).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle bba328fac5 Add Test_reduction to tests/debug
Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the
preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D,
LatticeColourMatrixF/D and LatticePropagatorF/D.

Part a) gaussian random field: checks that old and new agree to within
float/double roundoff tolerance.
Part b) constant field (= 1.0, identity-matrix init): verifies
innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero
diagonal scalar components per site (1 / Nc / Ns*Nc respectively).

Make.inc is auto-generated by scripts/filelist on bootstrap and is not
tracked; the new .cc file is all that is needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 0d8658a039 Optimised 2026-03-05 06:06:32 -05:00
Peter Boyle 76fbcffb60 Improvement to 16^3 hdcg 2026-03-05 06:06:32 -05:00
Peter Boyle 6ff29f9d4f Alternate multigrids 2026-02-13 17:25:45 -05:00
Peter Boyle 7cd3f21e6b preserving a bunch of experiments on setup and g5 subspace doubling 2026-01-06 05:57:39 -05:00
paboyle 9e6a4a4737 Assertion updates to macros (mostly) with backtrace.
WIlson flow to include options for DBW2, Iwasaki, Symanzik.
View logging for data assurance
2025-08-07 15:48:38 +00:00
Peter Boyle 677b4cc5b0 Make all tests compile 2025-04-24 20:33:26 -04:00
Peter Boyle 6fec3c15ca Cleaner printing 2025-04-04 18:35:06 -04:00
Peter Boyle c74d11e3d7 PVdagM MG 2025-02-01 11:04:13 -05:00
Peter Boyle 3f3661a86f Heading towards PVdagM multigrid 2025-01-17 14:33:35 +00:00
Peter Boyle 2a9cfeb9ea New files 2024-09-26 14:23:29 -04:00
Peter Boyle 575eb72182 Converges on 16^3 2024-08-27 19:20:38 +00:00
Peter Boyle 29f6b8a74a Setup 2024-08-27 12:02:49 -04:00
Peter Boyle 9779aaea33 16^3 optimise 2024-08-27 11:38:35 -04:00
Peter Boyle ec25604a67 Fastest solver for mrhs multigrid 2024-08-27 11:32:34 -04:00
Peter Boyle b461184797 Merge branch 'develop' of https://github.com/paboyle/Grid into develop 2024-07-23 09:53:58 -04:00
Peter Boyle 486412635a 8^4 test for PETSc 2024-07-22 15:25:17 -04:00
Peter Boyle 8b23a1546a Force compile temporarily 2024-07-22 15:24:56 -04:00
Peter Boyle a901e4e369 Regressed performance for paper 2024-07-22 15:24:04 -04:00
Peter Boyle 804d9367d4 Regressed performance 2024-07-22 15:23:25 -04:00
Peter Boyle 7c246606c1 Schur additional case 2024-07-10 22:04:32 +00:00
Peter Boyle 12b8be7cb9 Best so far on 96^3 350 Evecs converged on 4^4 block 2024-06-18 16:31:37 -04:00
Peter Boyle dc80b08969 96^3 test 2024-06-10 15:07:29 -04:00
Peter Boyle 0e607a55e7 Updated for 8^4 test 2024-05-26 20:53:05 +00:00
Peter Boyle ad14a82742 Working aas good as possible on 48^3 in double 2024-05-16 10:55:45 -04:00
Peter Boyle 98cf247f33 prepare to switch to mixed precision 2024-04-30 05:23:45 -04:00
Peter Boyle 0cf16522d1 Refine with HDCG choice 2024-04-30 05:22:14 -04:00
Peter Boyle 5147a42818 Updated hdcg 2024-04-05 01:05:57 -04:00
Peter Boyle 5b79d51c22 Improvements 2024-04-01 14:18:40 -04:00
Peter Boyle cc04dc42dc Merge branch 'develop' into feature/scidac-wp1 2024-03-06 14:55:21 -05:00
Peter Boyle 070b61f08f Simplifying the MultiRHS solver to make it do SRHS *and* MRHS 2024-03-06 14:04:33 -05:00
Peter Boyle cd15abe9d1 Mrhs prep 2024-02-27 11:41:13 -05:00