Smoother shift:
- Replace hard-coded mass^2 = 0.0025 with fine_lambda_max / divisor,
measured at runtime via PowerMethod on the SchurStaggeredOperator.
- Current divisor = 200 (tunable); concentrates the O(8) CG polynomial
zeros on the high-frequency end of the spectrum [shift, lambda_max],
repairing the spectral leakage introduced at coarse-cell boundaries
when the coarse-grid solution is promoted back to the fine grid.
- Add explanatory comment on the lego-block edge / covariant-derivative
physics behind the high-mode smoothing requirement.
Chebyshev filter (IRL):
- Fix lambda_lo = 0.02 (was mass^2 * 0.5 = 0.00125).
Tuning history logged in comments: lo=0.005 → 0/24 modes (T_70~53);
lo=0.01 → 24/24 but 2 restarts; lo=0.02 → 24/24 in 1 restart.
- Reduce Nk/Nm from 48/96 to 24/48 (target 24 near-null modes only).
- Print Chebyshev filter parameters at run time.
CG baseline:
- Add sequential single-RHS CG loop before the HDCG solve to establish
unpreconditioned iteration count and wall time for direct comparison.
ImplicitlyRestartedBlockLanczosCoarse:
- Print Ritz values before and after implicit shift at each restart.
- Print alpha/beta block-diagonal elements at each Lanczos step.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- GridParallelRNG must be constructed on full (non-checkerboarded) UGrid,
not UrbGrid; fill() recurses infinitely when _grid is checkerboarded.
- evec and c_srcs for ImplicitlyRestartedBlockLanczosCoarse must both be
on f_grid (Coarse4d), not CoarseMrhs; calc_irbl asserts evec[0].Grid()
== src[0].Grid().
- Switch subspace generation from CreateSubspaceChebyshevNew to
CreateSubspace (CG inverse iteration), which requires no spectral
bound tuning and adapts automatically to the matrix spectrum.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IRBLdiagonalisation, SortEigen, and LanczosType are defined in
ImplicitlyRestartedBlockLanczos.h, which must be included before
ImplicitlyRestartedBlockLanczosCoarse.h.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Test_staggered_hdcg.cc implements a two-level ADEF2 multigrid solver for
NaiveStaggeredFermion using SchurStaggeredOperator, following the mrhs
hermitian multigrid approach of arXiv:2409.03904. Uses a 33-point coarse
stencil (NextToNearestStencilGeometry4D) with nbasis=24, block={4,4,4,4},
and Chebyshev subspace generation with hi=5.0 (lambda_max ~4.6).
Also adds systems/mac-arm/sourceme-homebrew.sh and config-command-homebrew
for building Grid on Apple Silicon with Homebrew-installed dependencies.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plans are created lazily on the first FFT_dim call and reused for all
subsequent calls on the same FFT object. PlanCreate<vobj>() can be
called explicitly to pre-warm the cache. PlanDestroy() must be called
before switching to a different vobj type; the destructor cleans up any
live plans automatically.
Update Test_fft.cc and Test_fftf.cc to call PlanDestroy() between the
LatticeComplex and LatticeSpinMatrix sections that reuse the same FFT object.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h,
Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the
CUB rewrite (commit 969b0a39), recovering the original primary function names
(sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the
hand-rolled shared-memory reduction kernel.
Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the
old/new comparison sections that depended on sum_gpu_old.
The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small
types, and the smem sizing problem for the extraction pass has no clean
solution within the accelerator_for abstraction. The right improvement is
a higher radix (12 then 4) in sumD_gpu_large, applied directly to the
existing hand-rolled kernel.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and
sum_gpu_old (hand-rolled shared-memory) for each field type, with
5-call warmup and 100-call timed loop.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Float fields require a grid constructed with vComplexF::Nsimd(); using
a double grid causes grid->_gsites to undercount the sites in float
vobjF, making the constant-field expected value wrong.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The function computes |sum|^2 — the squared magnitude of an aggregate sum —
not a norm. squaredSum makes clear that squaring is applied to the sum, not
to individual site values before summing, distinguishing it from sumOfSquares
(the squared L2 norm).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the
preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D,
LatticeColourMatrixF/D and LatticePropagatorF/D.
Part a) gaussian random field: checks that old and new agree to within
float/double roundoff tolerance.
Part b) constant field (= 1.0, identity-matrix init): verifies
innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero
diagonal scalar components per site (1 / Nc / Ns*Nc respectively).
Make.inc is auto-generated by scripts/filelist on bootstrap and is not
tracked; the new .cc file is all that is needed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>