1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-06-04 19:24:36 +01:00
Files
Grid/skills/ref_a2a_emf_work.md
T
Peter Boyle 5822a6599c skills: add GPU/A2A reference skill documents
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 11:12:47 -04:00

62 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: ref_a2a_emf_work
description: "A2A Extended Meson Field GPU offload work — status, file locations, pending task"
metadata:
node_type: memory
type: project
originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
---
## What was built
`Grid/algorithms/blas/A2ASpatialSum.h` — batched GEMM spatial sum replacing scalar SIMD accumulation. Included via `Grid/algorithms/Algorithms.h`.
`tests/Test_extended_meson_field.cc` — test with class `A2AExtendedMesonFieldRef` containing:
- CPU reference path (`use_blas=false`)
- BLAS path (`use_blas=true`) using `A2ASpatialSum`
- Per-phase timing with `[ref type=N]` / `[blas type=N]` labels
- 4 contraction types (0-3), all verified at machine precision (~4e-16 rel_err)
## Pending task: GPU offload class
**Goal**: Write `A2AExtendedMesonFieldGPU` in the same test file, replacing all `thread_for` loops with `accelerator_for`-based free function kernels.
The `thread_for` blocks to replace all have the form:
```cpp
thread_for(r, rd, {
int so = r * grid->_ostride[orthogdim];
for (int n = 0; n < e1; n++)
for (int b = 0; b < e2; b++) {
int ss = so + n * stride + b;
// work
}
});
```
Replace with `accelerator_for(ss, grid->oSites(), Nsimd, { ... })`.
**Free functions to write** (each takes `Lattice<T>` args, opens views internally):
- `A2ALoopPropagator` — outerProduct sum (loop build)
- `A2APackLeftConjugated` — conjugate left fermion fields into `Lattice<SpinColourVector_v>`
- `A2ALoopLeftContractionType0/1/2/3` — per-site loop × loop propagator → `tloop`
- `A2ALoopRightContractionType0/1/2/3` — per-site tloop × right → `loopRight[j]`
**Data structure changes required**:
- `tloopv`: `std::vector<SpinColourMatrix_v>``Lattice<SpinColourMatrix_v>` (PropagatorField)
- `leftv[i]`: `std::vector<SpinColourVector_v>``Lattice<SpinColourVector_v>`
- `loopRight[j]`: `std::vector<SpinColourVector_v>``Lattice<SpinColourVector_v>`
**Why**: `std::vector<vobj>` is host memory, not GPU accessible. See [[ref_lattice_vs_vector]].
**`A2ASpatialSum` impact**: `PackLeft`/`PackRight` currently take `std::vector<std::vector<vobj>>`. Once leftv/loopRight become `std::vector<Lattice<vobj>>`, those signatures must change to match.
## Timing on 8.8.8.16 (N_i=N_j=8, Nloop=4, 1 MPI rank)
Dominant costs:
- `loop_build`: 4-6 ms (outerProduct over 4 propagators)
- `pack_loopright`: 0.9-2.2 ms (type-dependent)
- `spatial_sum` (ref): ~1.5 ms
- `A2ASpatialSum TOTAL`: 2.5-4.3 ms (PackLeft+PackRight dominate GEMM on small volume)
## Related
[[ref_accelerator_for]] [[ref_coalesced_views]] [[ref_lattice_vs_vector]] [[ref_grid_simt_pattern]]