Add staggered HDCG multigrid test and mac-arm Homebrew build scripts

Test_staggered_hdcg.cc implements a two-level ADEF2 multigrid solver for NaiveStaggeredFermion using SchurStaggeredOperator, following the mrhs hermitian multigrid approach of arXiv:2409.03904. Uses a 33-point coarse stencil (NextToNearestStencilGeometry4D) with nbasis=24, block={4,4,4,4}, and Chebyshev subspace generation with hi=5.0 (lambda_max ~4.6). Also adds systems/mac-arm/sourceme-homebrew.sh and config-command-homebrew for building Grid on Apple Silicon with Homebrew-installed dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 19:24:36 +01:00 · 2026-05-27 15:52:49 -04:00
parent b58a1508fa
commit 520b90259d
4 changed files with 317 additions and 0 deletions
@@ -30,6 +30,9 @@ Key configure options:
 | `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
 | `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
 | `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
+| `--enable-debug=yes` | adds `-g`, removes `-O3` |
+
+Use `make V=1` for verbose compiler output (shows full flags; required for bug reports).

 Platform recipes from `README.md`:
 - **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
@@ -96,3 +99,17 @@ GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). T
 - The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
 - Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
 - Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
+
+## Skills
+
+`skills/` contains seven user-invocable Claude Code skills encoding deep domain knowledge for HPC work in this repo. Invoke with `/skill-name` or ask Claude to use them by name:
+
+| Skill | When to use |
+|-------|-------------|
+| `gpu-memory-performance` | Bandwidth/occupancy problems — `acceleratorThreads()` pitfalls, `coalescedRead/Write`, fused vs staged HBM patterns |
+| `gpu-runtime-correctness` | Wrong answers, non-deterministic results, premature `q.wait()` returns |
+| `communication-overlap` | Designing GPU+MPI overlap pipelines; replacing broken accelerator-aware MPI paths with host-staged 7-phase pipeline |
+| `mpi-heterogeneous` | Collective hangs, buffer aliasing in `MPI_Sendrecv`, heterogeneous topology bugs |
+| `hang-diagnosis` | Distinguishing ioctl hangs, infinite poll loops, collective deadlocks, and silent wrong-answer races |
+| `correctness-verification` | Reproducibility checksums, double-wait testing, bisecting non-deterministic failures |
+| `compiler-validation` | Confirming compiler/optimisation flags are safe before production runs |