Test_staggered_hdcg.cc implements a two-level ADEF2 multigrid solver for
NaiveStaggeredFermion using SchurStaggeredOperator, following the mrhs
hermitian multigrid approach of arXiv:2409.03904. Uses a 33-point coarse
stencil (NextToNearestStencilGeometry4D) with nbasis=24, block={4,4,4,4},
and Chebyshev subspace generation with hi=5.0 (lambda_max ~4.6).
Also adds systems/mac-arm/sourceme-homebrew.sh and config-command-homebrew
for building Grid on Apple Silicon with Homebrew-installed dependencies.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/
reduceKernel) and the global device variable retirementCount with a unified
CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL.
No small/large split is needed: both CUB and sycl::reduction handle arbitrary
object sizes internally.
Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the
original files for regression testing on GPU hardware.
Also add CLAUDE.md with build, test, and architecture guidance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>