From 5822a6599c55b7a3b1540bdaf6f50171faa59846 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Wed, 27 May 2026 11:12:47 -0400 Subject: [PATCH] skills: add GPU/A2A reference skill documents Co-Authored-By: Claude Sonnet 4.6 --- skills/ref_a2a_emf_work.md | 61 ++++++++++++++++++++++++++++ skills/ref_accelerator_for.md | 43 ++++++++++++++++++++ skills/ref_coalesced_views.md | 70 +++++++++++++++++++++++++++++++++ skills/ref_grid_simt_pattern.md | 47 ++++++++++++++++++++++ skills/ref_lattice_vs_vector.md | 48 ++++++++++++++++++++++ 5 files changed, 269 insertions(+) create mode 100644 skills/ref_a2a_emf_work.md create mode 100644 skills/ref_accelerator_for.md create mode 100644 skills/ref_coalesced_views.md create mode 100644 skills/ref_grid_simt_pattern.md create mode 100644 skills/ref_lattice_vs_vector.md diff --git a/skills/ref_a2a_emf_work.md b/skills/ref_a2a_emf_work.md new file mode 100644 index 00000000..187f61b5 --- /dev/null +++ b/skills/ref_a2a_emf_work.md @@ -0,0 +1,61 @@ +--- +name: ref_a2a_emf_work +description: "A2A Extended Meson Field GPU offload work — status, file locations, pending task" +metadata: + node_type: memory + type: project + originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131 +--- + +## What was built + +`Grid/algorithms/blas/A2ASpatialSum.h` — batched GEMM spatial sum replacing scalar SIMD accumulation. Included via `Grid/algorithms/Algorithms.h`. + +`tests/Test_extended_meson_field.cc` — test with class `A2AExtendedMesonFieldRef` containing: +- CPU reference path (`use_blas=false`) +- BLAS path (`use_blas=true`) using `A2ASpatialSum` +- Per-phase timing with `[ref type=N]` / `[blas type=N]` labels +- 4 contraction types (0-3), all verified at machine precision (~4e-16 rel_err) + +## Pending task: GPU offload class + +**Goal**: Write `A2AExtendedMesonFieldGPU` in the same test file, replacing all `thread_for` loops with `accelerator_for`-based free function kernels. + +The `thread_for` blocks to replace all have the form: +```cpp +thread_for(r, rd, { + int so = r * grid->_ostride[orthogdim]; + for (int n = 0; n < e1; n++) + for (int b = 0; b < e2; b++) { + int ss = so + n * stride + b; + // work + } +}); +``` +Replace with `accelerator_for(ss, grid->oSites(), Nsimd, { ... })`. + +**Free functions to write** (each takes `Lattice` args, opens views internally): +- `A2ALoopPropagator` — outerProduct sum (loop build) +- `A2APackLeftConjugated` — conjugate left fermion fields into `Lattice` +- `A2ALoopLeftContractionType0/1/2/3` — per-site loop × loop propagator → `tloop` +- `A2ALoopRightContractionType0/1/2/3` — per-site tloop × right → `loopRight[j]` + +**Data structure changes required**: +- `tloopv`: `std::vector` → `Lattice` (PropagatorField) +- `leftv[i]`: `std::vector` → `Lattice` +- `loopRight[j]`: `std::vector` → `Lattice` + +**Why**: `std::vector` is host memory, not GPU accessible. See [[ref_lattice_vs_vector]]. + +**`A2ASpatialSum` impact**: `PackLeft`/`PackRight` currently take `std::vector>`. Once leftv/loopRight become `std::vector>`, those signatures must change to match. + +## Timing on 8.8.8.16 (N_i=N_j=8, Nloop=4, 1 MPI rank) + +Dominant costs: +- `loop_build`: 4-6 ms (outerProduct over 4 propagators) +- `pack_loopright`: 0.9-2.2 ms (type-dependent) +- `spatial_sum` (ref): ~1.5 ms +- `A2ASpatialSum TOTAL`: 2.5-4.3 ms (PackLeft+PackRight dominate GEMM on small volume) + +## Related +[[ref_accelerator_for]] [[ref_coalesced_views]] [[ref_lattice_vs_vector]] [[ref_grid_simt_pattern]] diff --git a/skills/ref_accelerator_for.md b/skills/ref_accelerator_for.md new file mode 100644 index 00000000..fd942a8b --- /dev/null +++ b/skills/ref_accelerator_for.md @@ -0,0 +1,43 @@ +--- +name: ref_accelerator_for +description: Grid accelerator_for usage — converting block-strided thread_for to GPU-portable oSites loops +metadata: + node_type: memory + type: reference + originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131 +--- + +## Pattern: block-strided thread_for → accelerator_for over oSites + +Old CPU-only pattern (block-strided over orthog dimension): +```cpp +thread_for(r, rd, { + int so = r * grid->_ostride[orthogdim]; + for (int n = 0; n < e1; n++) + for (int b = 0; b < e2; b++) { + int ss = so + n * stride + b; + // work on site ss + } +}); +``` + +GPU-portable replacement: +```cpp +accelerator_for(ss, grid->oSites(), Nsimd, { + // work on site ss — one SIMT thread per (osite, lane) on GPU + // one thread per osite (lane loop implicit via GRID_SIMT) on CPU +}); +``` + +Key rules: +- `accelerator_for(iter, count, Nsimd, body)` — Nsimd is `vobj::Nsimd()` or `grid->Nsimd()` +- On CPU: expands to `thread_for` over count, `acceleratorSIMTlane` always returns 0 — must use `#ifdef GRID_SIMT` pattern if iterating lanes explicitly (see [[ref_grid_simt_pattern]]) +- On GPU: one SIMT thread per (iter × lane), `acceleratorSIMTlane(Nsimd)` returns actual lane +- Loop body must capture only scalar/POD by value or via device-accessible pointers; no `std::vector` or host containers inside the body +- `Coordinate` inside `accelerator_for` must be `AcceleratorVector` (stack-allocated, device-safe) — Grid's `Coordinate` typedef already satisfies this + +## Where defined +`Grid/threads/Accelerator.h` — CPU path ~line 607; GPU paths in conditional blocks above. + +## Model file +`Grid/algorithms/blas/MomentumProject.h` — `ImportVector` is the canonical example of correct `accelerator_for` + SIMD lane extraction. diff --git a/skills/ref_coalesced_views.md b/skills/ref_coalesced_views.md new file mode 100644 index 00000000..e0d48032 --- /dev/null +++ b/skills/ref_coalesced_views.md @@ -0,0 +1,70 @@ +--- +name: ref_coalesced_views +description: Grid coalescedRead/coalescedWrite and autoView — GPU-portable field access inside accelerator_for +metadata: + node_type: memory + type: reference + originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131 +--- + +## View access modes + +```cpp +autoView(v, field, AcceleratorRead); // read-only, device-accessible +autoView(v, field, AcceleratorWrite); // write-only, device-accessible +autoView(v, field, AcceleratorReadWrite); // read-write, device-accessible +autoView(v, field, CpuRead); // CPU only (avoids GPU migration) +autoView(v, field, CpuWrite); // CPU only +``` + +Views must be opened **before** `accelerator_for` and closed (go out of scope) **after**. Never open a view inside the accelerator_for body. + +## coalescedRead / coalescedWrite + +Inside `accelerator_for(ss, oSites, Nsimd, { ... })`: + +```cpp +auto site = coalescedRead(v[ss]); // reads SIMT lane; returns scalar_object on GPU, vobj on CPU +coalescedWrite(v[ss], site); // writes SIMT lane +``` + +- `coalescedRead(v[ss])` calls `v.operator()(ss)` which on GPU returns `extractLane(lane, v[ss])` — one lane per SIMT thread, contiguous across threads → coalesced +- On CPU returns the full vobj (no lane extraction needed; handled transparently) +- The returned type is `decltype(coalescedRead(v[ss]))` — use `auto` or match with scalar_object + +## Typical kernel pattern + +```cpp +autoView(out_v, out, AcceleratorWrite); +autoView(in_v, in, AcceleratorRead); +accelerator_for(ss, grid->oSites(), vobj::Nsimd(), { + auto x = coalescedRead(in_v[ss]); + // modify x ... + coalescedWrite(out_v[ss], x); +}); +``` + +## Free function kernel signature + +```cpp +template +void MyKernel(Lattice &out, const Lattice &in) +{ + GridBase *grid = in.Grid(); + autoView(out_v, out, AcceleratorWrite); + autoView(in_v, in, AcceleratorRead); + accelerator_for(ss, grid->oSites(), vobj::Nsimd(), { + auto x = coalescedRead(in_v[ss]); + coalescedWrite(out_v[ss], x); + }); +} +``` + +## What NOT to do +- Do not access `std::vector` elements inside `accelerator_for` — not device-accessible +- Do not use `CpuRead`/`CpuWrite` views inside `accelerator_for` — GPU will fault +- Do not assign to `v[ss]` directly inside `accelerator_for` — use `coalescedWrite` +- Do not open multiple write views on the same field simultaneously + +## Related +[[ref_accelerator_for]] [[ref_lattice_vs_vector]] diff --git a/skills/ref_grid_simt_pattern.md b/skills/ref_grid_simt_pattern.md new file mode 100644 index 00000000..ef606f21 --- /dev/null +++ b/skills/ref_grid_simt_pattern.md @@ -0,0 +1,47 @@ +--- +name: ref_grid_simt_pattern +description: Grid GRID_SIMT +metadata: + node_type: memory + type: reference + originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131 +--- + +## The problem + +On CPU, `accelerator_for(sf, oSites, Nsimd, {...})` expands to `thread_for(sf, oSites, {...})` — one thread per osite. `acceleratorSIMTlane(Nsimd)` always returns **0** on CPU. If you need to iterate all Nsimd lanes (e.g. to extract SIMD-packed data), you must loop explicitly on CPU. + +On GPU, `accelerator_for` launches one SIMT thread per (osite × lane). `acceleratorSIMTlane(Nsimd)` returns the actual lane index [0, Nsimd). + +## Correct pattern (from MomentumProject::ImportVector) + +```cpp +accelerator_for(sf, osites, Nsimd, { +#ifdef GRID_SIMT + { + int lane = acceleratorSIMTlane(Nsimd); +#else + for (int lane = 0; lane < Nsimd; lane++) { +#endif + // body using lane + } + }); +``` + +- On GPU: `GRID_SIMT` is defined → single-lane body, lane from hardware +- On CPU: `GRID_SIMT` is not defined → explicit lane loop inside the osite thread + +## When is this needed? + +Only when you explicitly need the lane index, e.g.: +- Extracting scalar data from SIMD-packed `vobj` via `extractLane(lane, src[sf])` +- Computing full local coordinates from (osite, lane) → `Lexicographic::CoorFromIndex(icoor, lane, simd_layout)` + +When using `coalescedRead`/`coalescedWrite`, this pattern is **not needed** — those handle lane selection transparently. + +## Pitfall that caused a bug + +`A2ASpatialSum::PackVectors` originally used `accelerator_for` without the `#ifdef GRID_SIMT` lane loop. On CPU, only lane=0 was extracted, giving wrong norms (~8× too small for `GEN_SIMD_WIDTH=64`, `Nsimd=4`). Fix: add the `#ifdef GRID_SIMT` pattern. See [[ref_accelerator_for]]. + +## Model file +`Grid/algorithms/blas/MomentumProject.h`, function `ImportVector`, lines ~166-207. diff --git a/skills/ref_lattice_vs_vector.md b/skills/ref_lattice_vs_vector.md new file mode 100644 index 00000000..f521a8a8 --- /dev/null +++ b/skills/ref_lattice_vs_vector.md @@ -0,0 +1,48 @@ +--- +name: ref_lattice_vs_vector +description: When to use Lattice vs std::vector for GPU-portable field storage in Grid +metadata: + node_type: memory + type: reference + originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131 +--- + +## Rule + +Use `Lattice` (or `std::vector>`) for any field that will be read or written inside `accelerator_for`. `std::vector` is host memory and is NOT device-accessible. + +## Before vs after GPU offload + +```cpp +// CPU-only (host memory, not GPU accessible) +std::vector tloopv(oSites, Zero()); +// accessed directly: tloopv[ss] + +// GPU-portable +Lattice tloop(grid); +// accessed via view: autoView(tloop_v, tloop, AcceleratorWrite); +// coalescedWrite(tloop_v[ss], val); +``` + +## Corollary: function signatures + +CPU-only version: +```cpp +void PackLeft(const std::vector> &leftv); +``` + +GPU-portable version: +```cpp +void PackLeft(const std::vector> &leftv); +``` + +## deviceVector for raw device buffers + +`deviceVector` (defined in Grid) is like `std::vector` but in device-accessible memory. Use for raw scalar scratch/pack buffers (e.g. GEMM input/output staging). Not for structured lattice data. + +## Pointer arrays for batched BLAS + +`deviceVector` holds batch pointer arrays. Populate with `acceleratorPut(ptrs[t], base + offset)` — sets device-side pointer from host. See `A2ASpatialSum::Allocate`. + +## Related +[[ref_coalesced_views]] [[ref_accelerator_for]]