mirror of
https://github.com/paboyle/Grid.git
synced 2026-06-04 11:14:38 +01:00
skills: add GPU/A2A reference skill documents
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,61 @@
|
|||||||
|
---
|
||||||
|
name: ref_a2a_emf_work
|
||||||
|
description: "A2A Extended Meson Field GPU offload work — status, file locations, pending task"
|
||||||
|
metadata:
|
||||||
|
node_type: memory
|
||||||
|
type: project
|
||||||
|
originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
|
||||||
|
---
|
||||||
|
|
||||||
|
## What was built
|
||||||
|
|
||||||
|
`Grid/algorithms/blas/A2ASpatialSum.h` — batched GEMM spatial sum replacing scalar SIMD accumulation. Included via `Grid/algorithms/Algorithms.h`.
|
||||||
|
|
||||||
|
`tests/Test_extended_meson_field.cc` — test with class `A2AExtendedMesonFieldRef` containing:
|
||||||
|
- CPU reference path (`use_blas=false`)
|
||||||
|
- BLAS path (`use_blas=true`) using `A2ASpatialSum`
|
||||||
|
- Per-phase timing with `[ref type=N]` / `[blas type=N]` labels
|
||||||
|
- 4 contraction types (0-3), all verified at machine precision (~4e-16 rel_err)
|
||||||
|
|
||||||
|
## Pending task: GPU offload class
|
||||||
|
|
||||||
|
**Goal**: Write `A2AExtendedMesonFieldGPU` in the same test file, replacing all `thread_for` loops with `accelerator_for`-based free function kernels.
|
||||||
|
|
||||||
|
The `thread_for` blocks to replace all have the form:
|
||||||
|
```cpp
|
||||||
|
thread_for(r, rd, {
|
||||||
|
int so = r * grid->_ostride[orthogdim];
|
||||||
|
for (int n = 0; n < e1; n++)
|
||||||
|
for (int b = 0; b < e2; b++) {
|
||||||
|
int ss = so + n * stride + b;
|
||||||
|
// work
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
Replace with `accelerator_for(ss, grid->oSites(), Nsimd, { ... })`.
|
||||||
|
|
||||||
|
**Free functions to write** (each takes `Lattice<T>` args, opens views internally):
|
||||||
|
- `A2ALoopPropagator` — outerProduct sum (loop build)
|
||||||
|
- `A2APackLeftConjugated` — conjugate left fermion fields into `Lattice<SpinColourVector_v>`
|
||||||
|
- `A2ALoopLeftContractionType0/1/2/3` — per-site loop × loop propagator → `tloop`
|
||||||
|
- `A2ALoopRightContractionType0/1/2/3` — per-site tloop × right → `loopRight[j]`
|
||||||
|
|
||||||
|
**Data structure changes required**:
|
||||||
|
- `tloopv`: `std::vector<SpinColourMatrix_v>` → `Lattice<SpinColourMatrix_v>` (PropagatorField)
|
||||||
|
- `leftv[i]`: `std::vector<SpinColourVector_v>` → `Lattice<SpinColourVector_v>`
|
||||||
|
- `loopRight[j]`: `std::vector<SpinColourVector_v>` → `Lattice<SpinColourVector_v>`
|
||||||
|
|
||||||
|
**Why**: `std::vector<vobj>` is host memory, not GPU accessible. See [[ref_lattice_vs_vector]].
|
||||||
|
|
||||||
|
**`A2ASpatialSum` impact**: `PackLeft`/`PackRight` currently take `std::vector<std::vector<vobj>>`. Once leftv/loopRight become `std::vector<Lattice<vobj>>`, those signatures must change to match.
|
||||||
|
|
||||||
|
## Timing on 8.8.8.16 (N_i=N_j=8, Nloop=4, 1 MPI rank)
|
||||||
|
|
||||||
|
Dominant costs:
|
||||||
|
- `loop_build`: 4-6 ms (outerProduct over 4 propagators)
|
||||||
|
- `pack_loopright`: 0.9-2.2 ms (type-dependent)
|
||||||
|
- `spatial_sum` (ref): ~1.5 ms
|
||||||
|
- `A2ASpatialSum TOTAL`: 2.5-4.3 ms (PackLeft+PackRight dominate GEMM on small volume)
|
||||||
|
|
||||||
|
## Related
|
||||||
|
[[ref_accelerator_for]] [[ref_coalesced_views]] [[ref_lattice_vs_vector]] [[ref_grid_simt_pattern]]
|
||||||
@@ -0,0 +1,43 @@
|
|||||||
|
---
|
||||||
|
name: ref_accelerator_for
|
||||||
|
description: Grid accelerator_for usage — converting block-strided thread_for to GPU-portable oSites loops
|
||||||
|
metadata:
|
||||||
|
node_type: memory
|
||||||
|
type: reference
|
||||||
|
originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pattern: block-strided thread_for → accelerator_for over oSites
|
||||||
|
|
||||||
|
Old CPU-only pattern (block-strided over orthog dimension):
|
||||||
|
```cpp
|
||||||
|
thread_for(r, rd, {
|
||||||
|
int so = r * grid->_ostride[orthogdim];
|
||||||
|
for (int n = 0; n < e1; n++)
|
||||||
|
for (int b = 0; b < e2; b++) {
|
||||||
|
int ss = so + n * stride + b;
|
||||||
|
// work on site ss
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
GPU-portable replacement:
|
||||||
|
```cpp
|
||||||
|
accelerator_for(ss, grid->oSites(), Nsimd, {
|
||||||
|
// work on site ss — one SIMT thread per (osite, lane) on GPU
|
||||||
|
// one thread per osite (lane loop implicit via GRID_SIMT) on CPU
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
Key rules:
|
||||||
|
- `accelerator_for(iter, count, Nsimd, body)` — Nsimd is `vobj::Nsimd()` or `grid->Nsimd()`
|
||||||
|
- On CPU: expands to `thread_for` over count, `acceleratorSIMTlane` always returns 0 — must use `#ifdef GRID_SIMT` pattern if iterating lanes explicitly (see [[ref_grid_simt_pattern]])
|
||||||
|
- On GPU: one SIMT thread per (iter × lane), `acceleratorSIMTlane(Nsimd)` returns actual lane
|
||||||
|
- Loop body must capture only scalar/POD by value or via device-accessible pointers; no `std::vector` or host containers inside the body
|
||||||
|
- `Coordinate` inside `accelerator_for` must be `AcceleratorVector<int, MaxDims>` (stack-allocated, device-safe) — Grid's `Coordinate` typedef already satisfies this
|
||||||
|
|
||||||
|
## Where defined
|
||||||
|
`Grid/threads/Accelerator.h` — CPU path ~line 607; GPU paths in conditional blocks above.
|
||||||
|
|
||||||
|
## Model file
|
||||||
|
`Grid/algorithms/blas/MomentumProject.h` — `ImportVector` is the canonical example of correct `accelerator_for` + SIMD lane extraction.
|
||||||
@@ -0,0 +1,70 @@
|
|||||||
|
---
|
||||||
|
name: ref_coalesced_views
|
||||||
|
description: Grid coalescedRead/coalescedWrite and autoView — GPU-portable field access inside accelerator_for
|
||||||
|
metadata:
|
||||||
|
node_type: memory
|
||||||
|
type: reference
|
||||||
|
originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
|
||||||
|
---
|
||||||
|
|
||||||
|
## View access modes
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
autoView(v, field, AcceleratorRead); // read-only, device-accessible
|
||||||
|
autoView(v, field, AcceleratorWrite); // write-only, device-accessible
|
||||||
|
autoView(v, field, AcceleratorReadWrite); // read-write, device-accessible
|
||||||
|
autoView(v, field, CpuRead); // CPU only (avoids GPU migration)
|
||||||
|
autoView(v, field, CpuWrite); // CPU only
|
||||||
|
```
|
||||||
|
|
||||||
|
Views must be opened **before** `accelerator_for` and closed (go out of scope) **after**. Never open a view inside the accelerator_for body.
|
||||||
|
|
||||||
|
## coalescedRead / coalescedWrite
|
||||||
|
|
||||||
|
Inside `accelerator_for(ss, oSites, Nsimd, { ... })`:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
auto site = coalescedRead(v[ss]); // reads SIMT lane; returns scalar_object on GPU, vobj on CPU
|
||||||
|
coalescedWrite(v[ss], site); // writes SIMT lane
|
||||||
|
```
|
||||||
|
|
||||||
|
- `coalescedRead(v[ss])` calls `v.operator()(ss)` which on GPU returns `extractLane(lane, v[ss])` — one lane per SIMT thread, contiguous across threads → coalesced
|
||||||
|
- On CPU returns the full vobj (no lane extraction needed; handled transparently)
|
||||||
|
- The returned type is `decltype(coalescedRead(v[ss]))` — use `auto` or match with scalar_object
|
||||||
|
|
||||||
|
## Typical kernel pattern
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
autoView(out_v, out, AcceleratorWrite);
|
||||||
|
autoView(in_v, in, AcceleratorRead);
|
||||||
|
accelerator_for(ss, grid->oSites(), vobj::Nsimd(), {
|
||||||
|
auto x = coalescedRead(in_v[ss]);
|
||||||
|
// modify x ...
|
||||||
|
coalescedWrite(out_v[ss], x);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Free function kernel signature
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
template<class vobj>
|
||||||
|
void MyKernel(Lattice<vobj> &out, const Lattice<vobj> &in)
|
||||||
|
{
|
||||||
|
GridBase *grid = in.Grid();
|
||||||
|
autoView(out_v, out, AcceleratorWrite);
|
||||||
|
autoView(in_v, in, AcceleratorRead);
|
||||||
|
accelerator_for(ss, grid->oSites(), vobj::Nsimd(), {
|
||||||
|
auto x = coalescedRead(in_v[ss]);
|
||||||
|
coalescedWrite(out_v[ss], x);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## What NOT to do
|
||||||
|
- Do not access `std::vector` elements inside `accelerator_for` — not device-accessible
|
||||||
|
- Do not use `CpuRead`/`CpuWrite` views inside `accelerator_for` — GPU will fault
|
||||||
|
- Do not assign to `v[ss]` directly inside `accelerator_for` — use `coalescedWrite`
|
||||||
|
- Do not open multiple write views on the same field simultaneously
|
||||||
|
|
||||||
|
## Related
|
||||||
|
[[ref_accelerator_for]] [[ref_lattice_vs_vector]]
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
---
|
||||||
|
name: ref_grid_simt_pattern
|
||||||
|
description: Grid GRID_SIMT
|
||||||
|
metadata:
|
||||||
|
node_type: memory
|
||||||
|
type: reference
|
||||||
|
originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
|
||||||
|
---
|
||||||
|
|
||||||
|
## The problem
|
||||||
|
|
||||||
|
On CPU, `accelerator_for(sf, oSites, Nsimd, {...})` expands to `thread_for(sf, oSites, {...})` — one thread per osite. `acceleratorSIMTlane(Nsimd)` always returns **0** on CPU. If you need to iterate all Nsimd lanes (e.g. to extract SIMD-packed data), you must loop explicitly on CPU.
|
||||||
|
|
||||||
|
On GPU, `accelerator_for` launches one SIMT thread per (osite × lane). `acceleratorSIMTlane(Nsimd)` returns the actual lane index [0, Nsimd).
|
||||||
|
|
||||||
|
## Correct pattern (from MomentumProject::ImportVector)
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
accelerator_for(sf, osites, Nsimd, {
|
||||||
|
#ifdef GRID_SIMT
|
||||||
|
{
|
||||||
|
int lane = acceleratorSIMTlane(Nsimd);
|
||||||
|
#else
|
||||||
|
for (int lane = 0; lane < Nsimd; lane++) {
|
||||||
|
#endif
|
||||||
|
// body using lane
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
- On GPU: `GRID_SIMT` is defined → single-lane body, lane from hardware
|
||||||
|
- On CPU: `GRID_SIMT` is not defined → explicit lane loop inside the osite thread
|
||||||
|
|
||||||
|
## When is this needed?
|
||||||
|
|
||||||
|
Only when you explicitly need the lane index, e.g.:
|
||||||
|
- Extracting scalar data from SIMD-packed `vobj` via `extractLane(lane, src[sf])`
|
||||||
|
- Computing full local coordinates from (osite, lane) → `Lexicographic::CoorFromIndex(icoor, lane, simd_layout)`
|
||||||
|
|
||||||
|
When using `coalescedRead`/`coalescedWrite`, this pattern is **not needed** — those handle lane selection transparently.
|
||||||
|
|
||||||
|
## Pitfall that caused a bug
|
||||||
|
|
||||||
|
`A2ASpatialSum::PackVectors` originally used `accelerator_for` without the `#ifdef GRID_SIMT` lane loop. On CPU, only lane=0 was extracted, giving wrong norms (~8× too small for `GEN_SIMD_WIDTH=64`, `Nsimd=4`). Fix: add the `#ifdef GRID_SIMT` pattern. See [[ref_accelerator_for]].
|
||||||
|
|
||||||
|
## Model file
|
||||||
|
`Grid/algorithms/blas/MomentumProject.h`, function `ImportVector`, lines ~166-207.
|
||||||
@@ -0,0 +1,48 @@
|
|||||||
|
---
|
||||||
|
name: ref_lattice_vs_vector
|
||||||
|
description: When to use Lattice<T> vs std::vector<T> for GPU-portable field storage in Grid
|
||||||
|
metadata:
|
||||||
|
node_type: memory
|
||||||
|
type: reference
|
||||||
|
originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rule
|
||||||
|
|
||||||
|
Use `Lattice<vobj>` (or `std::vector<Lattice<vobj>>`) for any field that will be read or written inside `accelerator_for`. `std::vector<vobj>` is host memory and is NOT device-accessible.
|
||||||
|
|
||||||
|
## Before vs after GPU offload
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
// CPU-only (host memory, not GPU accessible)
|
||||||
|
std::vector<SpinColourVector_v> tloopv(oSites, Zero());
|
||||||
|
// accessed directly: tloopv[ss]
|
||||||
|
|
||||||
|
// GPU-portable
|
||||||
|
Lattice<SpinColourVector_v> tloop(grid);
|
||||||
|
// accessed via view: autoView(tloop_v, tloop, AcceleratorWrite);
|
||||||
|
// coalescedWrite(tloop_v[ss], val);
|
||||||
|
```
|
||||||
|
|
||||||
|
## Corollary: function signatures
|
||||||
|
|
||||||
|
CPU-only version:
|
||||||
|
```cpp
|
||||||
|
void PackLeft(const std::vector<std::vector<vobj>> &leftv);
|
||||||
|
```
|
||||||
|
|
||||||
|
GPU-portable version:
|
||||||
|
```cpp
|
||||||
|
void PackLeft(const std::vector<Lattice<vobj>> &leftv);
|
||||||
|
```
|
||||||
|
|
||||||
|
## deviceVector for raw device buffers
|
||||||
|
|
||||||
|
`deviceVector<T>` (defined in Grid) is like `std::vector<T>` but in device-accessible memory. Use for raw scalar scratch/pack buffers (e.g. GEMM input/output staging). Not for structured lattice data.
|
||||||
|
|
||||||
|
## Pointer arrays for batched BLAS
|
||||||
|
|
||||||
|
`deviceVector<scalar *>` holds batch pointer arrays. Populate with `acceleratorPut(ptrs[t], base + offset)` — sets device-side pointer from host. See `A2ASpatialSum::Allocate`.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
[[ref_coalesced_views]] [[ref_accelerator_for]]
|
||||||
Reference in New Issue
Block a user