aarch64 libunwind compatibility fix

2026-07-22 11:33:27 +01:00 · 2025-10-30 16:45:43 +01:00
124 changed files with 585 additions and 10752 deletions
@@ -1,98 +0,0 @@
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## What This Is
-
-Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
-
-## Build
-
-Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
-
-```bash
-./bootstrap.sh                    # downloads Eigen 3.4.0, generates configure
-mkdir build && cd build
-../configure [options]
-make -j$(nproc)
-make check                        # run root-level tests
-make install
-```
-
-Key configure options:
-
-| Option | Common values |
-|--------|---------------|
-| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
-| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
-| `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
-| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
-| `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
-| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
-| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
-
-Platform recipes from `README.md`:
- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
-
-Required external libs: GMP, MPFR, OpenSSL, zlib.
-
-## Running Tests
-
-```bash
-# From build directory
-make check                          # root-level tests (Test_simd, Test_cshift, etc.)
-make -C tests/<subdir> tests        # build tests in a subdirectory
-./tests/core/Test_simd              # run a single test binary directly
-```
-
-Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
-
-## Architecture
-
-### Layer stack (bottom to top)
-
-1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
-
-2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
-
-3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
-
-4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
-
-5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
-
-6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
-
-### QCD subsystem (`Grid/qcd/`)
-
- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
- `smearing/` — APE, Stout, HEX, gradient flow
- `observables/` — Polyakov loop, plaquette, topological charge
-
-### GPU acceleration
-
-GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
-
-### Memory and I/O
-
- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
-
-### HMC applications
-
-`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
-
-## Key Conventions
-
- **C++17** is required throughout.
- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
@@ -54,24 +54,22 @@ Version.h: version-cache
 include Make.inc
 include Eigen.inc

-if BUILD_FERMION_INSTANTIATIONS
-  extra_sources+=$(WILS_FERMION_FILES)
-  extra_sources+=$(STAG_FERMION_FILES)
+extra_sources+=$(WILS_FERMION_FILES)
+extra_sources+=$(STAG_FERMION_FILES)
 if BUILD_ZMOBIUS
-    extra_sources+=$(ZWILS_FERMION_FILES)
+  extra_sources+=$(ZWILS_FERMION_FILES)
 endif
 if BUILD_GPARITY
-    extra_sources+=$(GP_FERMION_FILES)
+  extra_sources+=$(GP_FERMION_FILES)
 endif
 if BUILD_FERMION_REPS
-    extra_sources+=$(ADJ_FERMION_FILES)
-    extra_sources+=$(TWOIND_FERMION_FILES)
+  extra_sources+=$(ADJ_FERMION_FILES)
+  extra_sources+=$(TWOIND_FERMION_FILES)
 endif
 if BUILD_SP
    extra_sources+=$(SP_FERMION_FILES)
 if BUILD_FERMION_REPS
-      extra_sources+=$(SP_TWOIND_FERMION_FILES)
-endif
+    extra_sources+=$(SP_TWOIND_FERMION_FILES)
 endif
 endif

@@ -53,7 +53,6 @@ NAMESPACE_CHECK(approx);
 #include <Grid/algorithms/deflation/MultiRHSBlockCGLinalg.h>
 // Not really deflation, but useful
 #include <Grid/algorithms/blas/MomentumProject.h>
-#include <Grid/algorithms/blas/A2ASpatialSum.h>
 NAMESPACE_CHECK(deflation);
 #include <Grid/algorithms/iterative/ConjugateGradient.h>
 NAMESPACE_CHECK(ConjGrad);
@@ -1,6 +1,6 @@
 /*************************************************************************************

-    Grid physics library, www.github.com/paboyle/Grid
+    Grid physics library, www.github.com/paboyle/Grid 

    Source file: ./lib/Cshift.h

@@ -65,16 +65,17 @@ public:
  typedef hipfftDoubleComplex FFTW_scalar;
  typedef hipfftHandle        FFTW_plan;
  static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
-				      FFTW_scalar *in, int *inembed,
-				      int istride, int idist,
-				      FFTW_scalar *out, int *onembed,
-				      int ostride, int odist,
+				      FFTW_scalar *in, int *inembed,		
+				      int istride, int idist,		
+				      FFTW_scalar *out, int *onembed,		
+				      int ostride, int odist,		
 				      int sign, unsigned flags) {
    FFTW_plan p;
    auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_Z2Z,howmany);
    GRID_ASSERT(rv==HIPFFT_SUCCESS);
    return p;
-  }
+  }	  
+    
  inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
    hipfftResult rv;
    if ( sign == forward ) rv =hipfftExecZ2Z(p,in,out,HIPFFT_FORWARD);
@@ -82,25 +83,29 @@ public:
    accelerator_barrier();
    GRID_ASSERT(rv==HIPFFT_SUCCESS);
  }
-  inline static void fftw_destroy_plan(const FFTW_plan p) { hipfftDestroy(p); }
+  inline static void fftw_destroy_plan(const FFTW_plan p) {
+    hipfftDestroy(p);
+  }
 };
 template<> struct FFTW<ComplexF> {
 public:
  static const int forward=FFTW_FORWARD;
  static const int backward=FFTW_BACKWARD;
-  typedef hipfftComplex FFTW_scalar;
-  typedef hipfftHandle  FFTW_plan;
+  typedef hipfftComplex      FFTW_scalar;
+  typedef hipfftHandle        FFTW_plan;
+
  static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
-				      FFTW_scalar *in, int *inembed,
-				      int istride, int idist,
-				      FFTW_scalar *out, int *onembed,
-				      int ostride, int odist,
+				      FFTW_scalar *in, int *inembed,		
+				      int istride, int idist,		
+				      FFTW_scalar *out, int *onembed,		
+				      int ostride, int odist,		
 				      int sign, unsigned flags) {
    FFTW_plan p;
    auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_C2C,howmany);
    GRID_ASSERT(rv==HIPFFT_SUCCESS);
    return p;
-  }
+  }	  
+    
  inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
    hipfftResult rv;
    if ( sign == forward ) rv =hipfftExecC2C(p,in,out,HIPFFT_FORWARD);
@@ -108,7 +113,9 @@ public:
    accelerator_barrier();
    GRID_ASSERT(rv==HIPFFT_SUCCESS);
  }
-  inline static void fftw_destroy_plan(const FFTW_plan p) { hipfftDestroy(p); }
+  inline static void fftw_destroy_plan(const FFTW_plan p) {
+    hipfftDestroy(p);
+  }
 };
 #endif

@@ -119,45 +126,53 @@ public:
  static const int backward=FFTW_BACKWARD;
  typedef cufftDoubleComplex FFTW_scalar;
  typedef cufftHandle        FFTW_plan;
+
  static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
-				      FFTW_scalar *in, int *inembed,
-				      int istride, int idist,
-				      FFTW_scalar *out, int *onembed,
-				      int ostride, int odist,
+				      FFTW_scalar *in, int *inembed,		
+				      int istride, int idist,		
+				      FFTW_scalar *out, int *onembed,		
+				      int ostride, int odist,		
 				      int sign, unsigned flags) {
    FFTW_plan p;
    cufftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,CUFFT_Z2Z,howmany);
    return p;
-  }
+  }	  
+    
  inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
    if ( sign == forward ) cufftExecZ2Z(p,in,out,CUFFT_FORWARD);
    else                   cufftExecZ2Z(p,in,out,CUFFT_INVERSE);
    accelerator_barrier();
  }
-  inline static void fftw_destroy_plan(const FFTW_plan p) { cufftDestroy(p); }
+  inline static void fftw_destroy_plan(const FFTW_plan p) {
+    cufftDestroy(p);
+  }
 };
 template<> struct FFTW<ComplexF> {
 public:
  static const int forward=FFTW_FORWARD;
  static const int backward=FFTW_BACKWARD;
  typedef cufftComplex FFTW_scalar;
-  typedef cufftHandle  FFTW_plan;
+  typedef cufftHandle        FFTW_plan;
+
  static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
-				      FFTW_scalar *in, int *inembed,
-				      int istride, int idist,
-				      FFTW_scalar *out, int *onembed,
-				      int ostride, int odist,
+				      FFTW_scalar *in, int *inembed,		
+				      int istride, int idist,		
+				      FFTW_scalar *out, int *onembed,		
+				      int ostride, int odist,		
 				      int sign, unsigned flags) {
    FFTW_plan p;
    cufftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,CUFFT_C2C,howmany);
    return p;
-  }
+  }	  
+    
  inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
    if ( sign == forward ) cufftExecC2C(p,in,out,CUFFT_FORWARD);
    else                   cufftExecC2C(p,in,out,CUFFT_INVERSE);
    accelerator_barrier();
  }
-  inline static void fftw_destroy_plan(const FFTW_plan p) { cufftDestroy(p); }
+  inline static void fftw_destroy_plan(const FFTW_plan p) {
+    cufftDestroy(p);
+  }
 };
 #endif

@@ -168,325 +183,313 @@ public:
  typedef fftw_complex FFTW_scalar;
  typedef fftw_plan    FFTW_plan;
  static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
-				      FFTW_scalar *in, int *inembed,
-				      int istride, int idist,
-				      FFTW_scalar *out, int *onembed,
-				      int ostride, int odist,
+				      FFTW_scalar *in, int *inembed,		
+				      int istride, int idist,		
+				      FFTW_scalar *out, int *onembed,		
+				      int ostride, int odist,		
 				      int sign, unsigned flags) {
    return ::fftw_plan_many_dft(rank,n,howmany,in,inembed,istride,idist,out,onembed,ostride,odist,sign,flags);
-  }
+  }	  
+    
  inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
    ::fftw_execute_dft(p,in,out);
  }
-  inline static void fftw_destroy_plan(const FFTW_plan p) { ::fftw_destroy_plan(p); }
+  inline static void fftw_destroy_plan(const FFTW_plan p) {
+    ::fftw_destroy_plan(p);
+  }
 };
 template<> struct FFTW<ComplexF> {
 public:
  typedef fftwf_complex FFTW_scalar;
  typedef fftwf_plan    FFTW_plan;
  static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
-				      FFTW_scalar *in, int *inembed,
-				      int istride, int idist,
-				      FFTW_scalar *out, int *onembed,
-				      int ostride, int odist,
+				      FFTW_scalar *in, int *inembed,		
+				      int istride, int idist,		
+				      FFTW_scalar *out, int *onembed,		
+				      int ostride, int odist,		
 				      int sign, unsigned flags) {
    return ::fftwf_plan_many_dft(rank,n,howmany,in,inembed,istride,idist,out,onembed,ostride,odist,sign,flags);
-  }
+  }	  
+    
  inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
    ::fftwf_execute_dft(p,in,out);
  }
-  inline static void fftw_destroy_plan(const FFTW_plan p) { ::fftwf_destroy_plan(p); }
+  inline static void fftw_destroy_plan(const FFTW_plan p) {
+    ::fftwf_destroy_plan(p);
+  }
 };
 #endif
 #endif

-struct FFTbase {
-  double        flops;
-  double        flops_call;
-  uint64_t      usec;
-  GridCartesian *_grid;
-
-  static const int forward  = FFTW_FORWARD;
-  static const int backward = FFTW_BACKWARD;
-
-  double Flops(void)  { return flops; }
-  double MFlops(void) { return flops / usec; }
-  double USec(void)   { return (double)usec; }
-
-  FFTbase(GridCartesian *grid) : _grid(grid), flops(0), flops_call(0), usec(0) {}
-};
-
-// Barrel-shift gather, FFT execute, and insert. Called by both FFT and PlannedFFT.
-// The caller is responsible for plan acquisition and destruction.
-template<class vobj>
-static void FFT_dim_execute(
-    Lattice<vobj>       &result,
-    const Lattice<vobj> &source,
-    int dim, int sign,
-    typename FFTW<typename vobj::scalar_type>::FFTW_plan p,
-    GridCartesian *grid,
-    double &flops, double &flops_call, uint64_t &usec)
-{
-  typedef typename vobj::scalar_type   scalar;
-  typedef typename vobj::scalar_object sobj;
-  typedef typename vobj::scalar_type   scalar_type;
-  typedef typename vobj::vector_type   vector_type;
-  typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
-
-  const int Ndim = grid->Nd();
-  int L = grid->_ldimensions[dim];
-  int G = grid->_fdimensions[dim];
-  int Ncomp = sizeof(sobj) / sizeof(scalar);
-  int64_t Nlow = 1, Nhigh = 1;
-  for (int d = 0; d < dim;      d++) Nlow  *= grid->_ldimensions[d];
-  for (int d = dim+1; d < Ndim; d++) Nhigh *= grid->_ldimensions[d];
-  int64_t Nperp = Nlow * Nhigh;
-
-  deviceVector<scalar> pgbuf(Nperp * Ncomp * G);
-  scalar *pgbuf_v = &pgbuf[0];
-  int howmany = Ncomp * Nperp;
-
-  scalar div;
-  if      (sign == FFTW_BACKWARD) div = 1.0 / G;
-  else if (sign == FFTW_FORWARD)  div = 1.0;
-  else GRID_ASSERT(0);
-
-  double t_pencil = 0, t_fft = 0, t_copy = 0, t_shift = 0;
-  double t_total  = -usecond();
-
-  result = source;
-  int pc = grid->_processor_coor[dim];
-
-  const Coordinate ldims      = grid->_ldimensions;
-  const Coordinate rdims      = grid->_rdimensions;
-  const Coordinate sdims      = grid->_simd_layout;
-  const Coordinate processors = grid->_processors;
-
-  Coordinate pgdims(Ndim);
-  pgdims[0] = G;
-  for (int d = 0, dd = 1; d < Ndim; d++)
-    if (d != dim) pgdims[dd++] = ldims[d];
-  int64_t pgvol = 1;
-  for (int d = 0; d < Ndim; d++) pgvol *= pgdims[d];
-
-  const int Nsimd = vobj::Nsimd();
-  t_pencil = -usecond();
-  for (int p_idx = 0; p_idx < processors[dim]; p_idx++) {
-    t_copy -= usecond();
-    autoView(r_v, result, AcceleratorRead);
-    accelerator_for(idx, grid->oSites(), vobj::Nsimd(), {
-#ifdef GRID_SIMT
-    {
-      int lane = acceleratorSIMTlane(Nsimd);
-#else
-    for (int lane = 0; lane < Nsimd; lane++) {
-#endif
-      Coordinate icoor, ocoor, pgcoor;
-      Lexicographic::CoorFromIndex(icoor, lane, sdims);
-      Lexicographic::CoorFromIndex(ocoor, idx,  rdims);
-      pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + ((pc+p_idx)%processors[dim])*L;
-      for (int d = 0, dd = 1; d < Ndim; d++)
-        if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; }
-      int64_t pgidx;
-      Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims);
-      vector_type *from = (vector_type *)&r_v[idx];
-      scalar_type  stmp;
-      for (int w = 0; w < Ncomp; w++) {
-        stmp = getlane(from[w], lane);
-        pgbuf_v[pgidx + w*pgvol] = stmp;
-      }
-#ifdef GRID_SIMT
-    }
-#else
-    }
-#endif
-    });
-    t_copy += usecond();
-    if (p_idx != processors[dim] - 1) {
-      Lattice<vobj> temp(grid);
-      t_shift -= usecond();
-      temp = Cshift(result, dim, L); result = temp;
-      t_shift += usecond();
-    }
-  }
-  t_pencil += usecond();
-
-  FFTW_scalar *in  = (FFTW_scalar *)pgbuf_v;
-  FFTW_scalar *out = (FFTW_scalar *)pgbuf_v;
-  t_fft = -usecond();
-  FFTW<scalar>::fftw_execute_dft(p, in, out, sign);
-  t_fft += usecond();
-
-  flops_call = 5.0 * howmany * G * log2(G);
-  usec  = t_fft;
-  flops = flops_call;
-
-  result = Zero();
-  double t_insert = -usecond();
-  {
-    autoView(r_v, result, AcceleratorWrite);
-    accelerator_for(idx, grid->oSites(), Nsimd, {
-#ifdef GRID_SIMT
-    {
-      int lane = acceleratorSIMTlane(Nsimd);
-#else
-    for (int lane = 0; lane < Nsimd; lane++) {
-#endif
-      Coordinate icoor(Ndim), ocoor(Ndim), pgcoor(Ndim);
-      Lexicographic::CoorFromIndex(icoor, lane, sdims);
-      Lexicographic::CoorFromIndex(ocoor, idx,  rdims);
-      pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + pc*L;
-      for (int d = 0, dd = 1; d < Ndim; d++)
-        if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; }
-      int64_t pgidx;
-      Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims);
-      vector_type *to  = (vector_type *)&r_v[idx];
-      scalar_type  stmp;
-      for (int w = 0; w < Ncomp; w++) {
-        stmp = pgbuf_v[pgidx + w*pgvol];
-        putlane(to[w], stmp, lane);
-      }
-#ifdef GRID_SIMT
-    }
-#else
-    }
-#endif
-    });
-  }
-  result = result * div;
-  t_insert += usecond();
-  t_total  += usecond();
-
-  std::cout << GridLogPerformance << " FFT took   " << t_total/1.0e6  << " s" << std::endl;
-  std::cout << GridLogPerformance << " FFT pencil " << t_pencil/1.0e6 << " s" << std::endl;
-  std::cout << GridLogPerformance << "  of which copy " << t_copy/1.0e6  << " s" << std::endl;
-  std::cout << GridLogPerformance << "  of which shift" << t_shift/1.0e6 << " s" << std::endl;
-  std::cout << GridLogPerformance << " FFT kernels " << t_fft/1.0e6    << " s" << std::endl;
-  std::cout << GridLogPerformance << " FFT insert  " << t_insert/1.0e6 << " s" << std::endl;
-}
-
-class FFT : public FFTbase {
+class FFT {
+private:
+    
+  double flops;
+  double flops_call;
+  uint64_t usec;
+    
 public:
-  FFT(GridCartesian *grid) : FFTbase(grid) {}
-  ~FFT() {}
+    
+  static const int forward=FFTW_FORWARD;
+  static const int backward=FFTW_BACKWARD;
+    
+  double Flops(void) {return flops;}
+  double MFlops(void) {return flops/usec;}
+  double USec(void)   {return (double)usec;}    

+  FFT ( GridCartesian * grid ) 
+  {
+    flops=0;
+    usec =0;
+  };
+    
+  ~FFT ( void)  {
+    //    delete sgrid;
+  }
+    
  template<class vobj>
-  void FFT_dim_mask(Lattice<vobj> &result, const Lattice<vobj> &source, Coordinate mask, int sign) {
-    const int Ndim = _grid->Nd();
+  void FFT_dim_mask(Lattice<vobj> &result,const Lattice<vobj> &source,Coordinate mask,int sign){
+
+    //    vgrid=result.Grid();
+    //    conformable(result.Grid(),vgrid);
+    //    conformable(source.Grid(),vgrid);
+    const int Ndim = source.Grid()->Nd();
    Lattice<vobj> tmp = source;
-    for (int d = 0; d < Ndim; d++) {
-      if (mask[d]) {
-        FFT_dim(result, tmp, d, sign);
-        tmp = result;
+    for(int d=0;d<Ndim;d++){
+      if( mask[d] ) {
+	FFT_dim(result,tmp,d,sign);
+	tmp=result;
      }
    }
  }

  template<class vobj>
-  void FFT_all_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int sign) {
-    Coordinate mask(_grid->Nd(), 1);
-    FFT_dim_mask(result, source, mask, sign);
+  void FFT_all_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int sign){
+    const int Ndim = source.Grid()->Nd();
+    Coordinate mask(Ndim,1);
+    FFT_dim_mask(result,source,mask,sign);
  }

-  template<class vobj>
-  void FFT_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int dim, int sign) {
-    GRID_ASSERT(source.Grid() == _grid);
-    GRID_ASSERT(result.Grid() == _grid);
-    conformable(result.Grid(), source.Grid());

-    typedef typename vobj::scalar_type   scalar;
+  template<class vobj>
+  void FFT_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int dim, int sign){
+    const int Ndim = source.Grid()->Nd();
+    GridBase *grid = source.Grid();
+    conformable(result.Grid(),source.Grid());
+
+    int L = grid->_ldimensions[dim];
+    int G = grid->_fdimensions[dim];
+      
+    Coordinate layout(Ndim,1);
+    
+    // Construct pencils
    typedef typename vobj::scalar_object sobj;
+    typedef typename vobj::scalar_type   scalar;
+    typedef typename vobj::scalar_type   scalar_type;
+    typedef typename vobj::vector_type   vector_type;
+      
+    //std::cout << "CPU view" << std::endl;
+    
    typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
    typedef typename FFTW<scalar>::FFTW_plan   FFTW_plan;
+      
+    int Ncomp = sizeof(sobj)/sizeof(scalar);
+    int64_t Nlow  = 1;
+    int64_t Nhigh = 1;

-    const int Ndim = _grid->Nd();
-    int G     = _grid->_fdimensions[dim];
-    int Ncomp = sizeof(sobj) / sizeof(scalar);
-    int64_t Nperp = 1;
-    for (int d = 0; d < Ndim; d++)
-      if (d != dim) Nperp *= _grid->_ldimensions[d];
-    int n[]     = {G};
-    int howmany = Ncomp * Nperp;
-
-    deviceVector<scalar> dummy(2);
-    FFTW_scalar *buf = (FFTW_scalar *)&dummy[0];
-    FFTW_plan p = FFTW<scalar>::fftw_plan_many_dft(1, n, howmany,
-                                                    buf, n, 1, G,
-                                                    buf, n, 1, G,
-                                                    sign, FFTW_ESTIMATE);
-    FFT_dim_execute(result, source, dim, sign, p, _grid, flops, flops_call, usec);
-    FFTW<scalar>::fftw_destroy_plan(p);
-  }
-};
-
-template<class vobj>
-class PlannedFFT : public FFTbase {
-private:
-  typedef typename vobj::scalar_type   scalar;
-  typedef typename vobj::scalar_object sobj;
-  typedef typename vobj::vector_type   vector_type;
-  typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
-  typedef typename FFTW<scalar>::FFTW_plan   FFTW_plan;
-
-  std::vector<FFTW_plan> forward_plans;
-  std::vector<FFTW_plan> backward_plans;
-
-  void PlanCreate() {
-    const int Ndim = _grid->Nd();
-    forward_plans.resize(Ndim);
-    backward_plans.resize(Ndim);
-
-    for (int d = 0; d < Ndim; d++) {
-      int G       = _grid->_fdimensions[d];
-      int Ncomp   = sizeof(sobj) / sizeof(scalar);
-      int64_t Nperp = 1;
-      for (int dd = 0; dd < Ndim; dd++)
-        if (dd != d) Nperp *= _grid->_ldimensions[dd];
-      int howmany = Ncomp * (int)Nperp;
-      int n[]     = {G};
-
-      deviceVector<scalar> dummy(2);
-      FFTW_scalar *buf = (FFTW_scalar *)&dummy[0];
-
-      forward_plans[d]  = FFTW<scalar>::fftw_plan_many_dft(1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_FORWARD,  FFTW_ESTIMATE);
-      backward_plans[d] = FFTW<scalar>::fftw_plan_many_dft(1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_BACKWARD, FFTW_ESTIMATE);
+    for(int d=0;d<dim;d++){
+      Nlow*=grid->_ldimensions[d];
    }
-  }
+    for(int d=dim+1;d<Ndim;d++){
+      Nhigh*=grid->_ldimensions[d];
+    }
+    int64_t Nperp=Nlow*Nhigh;
+    
+    deviceVector<scalar> pgbuf; // Layout is [perp][component][dim]
+    pgbuf.resize(Nperp*Ncomp*G);
+    scalar *pgbuf_v = &pgbuf[0];
+      
+    int rank = 1;  /* 1d transforms */
+    int n[] = {G}; /* 1d transforms of length G */
+    int howmany = Ncomp * Nperp;
+    int odist,idist,istride,ostride;
+    idist   = odist   = G;            /* Distance between consecutive FT's */
+    istride = ostride = 1;            /* Distance between two elements in the same FT */
+    int *inembed = n, *onembed = n;
+      
+    scalar div;
+    if ( sign == backward ) div = 1.0/G;
+    else if ( sign == forward ) div = 1.0;
+    else GRID_ASSERT(0);

-  void PlanDestroy() {
-    for (auto p : forward_plans)  FFTW<scalar>::fftw_destroy_plan(p);
-    for (auto p : backward_plans) FFTW<scalar>::fftw_destroy_plan(p);
-    forward_plans.clear();
-    backward_plans.clear();
-  }
+    double t_pencil=0;
+    double t_fft   =0;
+    double t_total =-usecond();
+    //    std::cout << GridLogPerformance<<"Making FFTW plan" << std::endl;
+    /*
+     *
+     */
+    FFTW_plan p;
+    {
+      FFTW_scalar *in = (FFTW_scalar *)&pgbuf_v[0];
+      FFTW_scalar *out= (FFTW_scalar *)&pgbuf_v[0];
+      p = FFTW<scalar>::fftw_plan_many_dft(rank,n,howmany,
+					   in,inembed,
+					   istride,idist,
+					   out,onembed,
+					   ostride, odist,
+					   sign,FFTW_ESTIMATE);
+    }
+      
+    // Barrel shift and collect global pencil
+    //    std::cout << GridLogPerformance<<"Making pencil" << std::endl;
+    Coordinate lcoor(Ndim), gcoor(Ndim);
+    double t_copy=0;
+    double t_shift=0;
+    t_pencil = -usecond();
+    result = source;
+    int pc = grid->_processor_coor[dim];

-public:
-  PlannedFFT(GridCartesian *grid) : FFTbase(grid) { PlanCreate(); }
-  ~PlannedFFT() { PlanDestroy(); }
+    const Coordinate ldims = grid->_ldimensions;
+    const Coordinate rdims = grid->_rdimensions;
+    const Coordinate sdims = grid->_simd_layout;

-  void FFT_dim_mask(Lattice<vobj> &result, const Lattice<vobj> &source, Coordinate mask, int sign) {
-    const int Ndim = _grid->Nd();
-    Lattice<vobj> tmp = source;
-    for (int d = 0; d < Ndim; d++) {
-      if (mask[d]) {
-        FFT_dim(result, tmp, d, sign);
-        tmp = result;
+    Coordinate processors = grid->_processors;
+    Coordinate pgdims(Ndim);
+    pgdims[0] = G;
+    for(int d=0, dd=1;d<Ndim;d++){
+      if ( d!=dim ) pgdims[dd++] = ldims[d];
+    }
+    int64_t pgvol=1;
+    for(int d=0;d<Ndim;d++) pgvol*=pgdims[d];
+    
+    const int Nsimd = vobj::Nsimd();
+    for(int p=0;p<processors[dim];p++) {
+      t_copy-=usecond();
+      autoView(r_v,result,AcceleratorRead);
+      accelerator_for(idx, grid->oSites(), vobj::Nsimd(), {
+#ifdef GRID_SIMT
+      {
+	int lane=acceleratorSIMTlane(Nsimd); // buffer lane
+#else
+      for(int lane=0;lane<Nsimd;lane++) {
+#endif
+	Coordinate icoor;
+	Coordinate ocoor;
+	Coordinate pgcoor;
+
+	Lexicographic::CoorFromIndex(icoor,lane,sdims);
+	Lexicographic::CoorFromIndex(ocoor,idx,rdims);
+
+	pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + ((pc+p)%processors[dim])*L;
+	for(int d=0,dd=1;d<Ndim;d++){
+	  if ( d!=dim ) {
+	    pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d];
+	    dd++;
+	  }
+	}
+
+	// Map coordinates in lattice layout to FFTW index
+	int64_t pgidx;
+	Lexicographic::IndexFromCoor(pgcoor,pgidx,pgdims);
+
+	vector_type *from = (vector_type *)&r_v[idx];
+	scalar_type stmp;
+	for(int w=0;w<Ncomp;w++){
+	  int64_t pg_idx = pgidx + w*pgvol;
+	  stmp = getlane(from[w], lane);
+	  pgbuf_v[pg_idx] = stmp;
+	}
+#ifdef GRID_SIMT
+      }
+#else
+      }
+#endif
+      });
+
+      t_copy+=usecond();
+      if (p != processors[dim] - 1) {
+	Lattice<vobj> temp(grid);
+	t_shift-=usecond();
+	temp = Cshift(result,dim,L); result = temp;
+	t_shift+=usecond();
      }
    }
-  }
+    t_pencil += usecond();
+      
+    FFTW_scalar *in = (FFTW_scalar *)pgbuf_v;
+    FFTW_scalar *out= (FFTW_scalar *)pgbuf_v;
+    t_fft = -usecond();
+    FFTW<scalar>::fftw_execute_dft(p,in,out,sign);
+    t_fft += usecond();
+    
+    // performance counting
+    flops_call = 5.0*howmany*G*log2(G);
+    usec = t_fft;
+    flops= flops_call;

-  void FFT_all_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int sign) {
-    Coordinate mask(_grid->Nd(), 1);
-    FFT_dim_mask(result, source, mask, sign);
-  }
+    result = Zero();
+    
+    double t_insert = -usecond();
+    {
+      autoView(r_v,result,AcceleratorWrite);
+      accelerator_for(idx,grid->oSites(),Nsimd,{
+#ifdef GRID_SIMT
+      {
+	int lane=acceleratorSIMTlane(Nsimd); // buffer lane
+#else
+      for(int lane=0;lane<Nsimd;lane++) {
+#endif
+	Coordinate icoor(Ndim);
+	Coordinate ocoor(Ndim);
+	Coordinate pgcoor(Ndim);

-  void FFT_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int dim, int sign) {
-    GRID_ASSERT(source.Grid() == _grid);
-    GRID_ASSERT(result.Grid() == _grid);
-    GRID_ASSERT((int)forward_plans.size() == _grid->Nd());
-    conformable(result.Grid(), source.Grid());
-    FFTW_plan p = (sign == forward ? forward_plans : backward_plans)[dim];
-    FFT_dim_execute(result, source, dim, sign, p, _grid, flops, flops_call, usec);
+	Lexicographic::CoorFromIndex(icoor,lane,sdims);
+	Lexicographic::CoorFromIndex(ocoor,idx,rdims);
+
+	pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + pc*L;
+	for(int d=0,dd=1;d<Ndim;d++){
+	  if ( d!=dim ) {
+	    pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d];
+	    dd++;
+	  }
+	}
+	// Map coordinates in lattice layout to FFTW index
+	int64_t pgidx;
+	Lexicographic::IndexFromCoor(pgcoor,pgidx,pgdims);
+
+	vector_type *to = (vector_type *)&r_v[idx];
+	scalar_type stmp;
+	for(int w=0;w<Ncomp;w++){
+	  int64_t pg_idx = pgidx + w*pgvol;
+	  stmp = pgbuf_v[pg_idx];
+	  putlane(to[w], stmp, lane);
+	}
+	
+#ifdef GRID_SIMT
+      }
+#else
+      }
+#endif
+      });
+    }
+
+    result = result*div;
+
+    t_insert +=usecond();
+    
+    // destroying plan
+    FFTW<scalar>::fftw_destroy_plan(p);
+
+    t_total +=usecond();
+
+    std::cout <<GridLogPerformance<< " FFT took   "<<t_total/1.0e6 <<" s" << std::endl;
+    std::cout <<GridLogPerformance<< " FFT pencil "<<t_pencil/1.0e6 <<" s" << std::endl;
+    std::cout <<GridLogPerformance<< "  of which copy "<<t_copy/1.0e6 <<" s" << std::endl;
+    std::cout <<GridLogPerformance<< "  of which shift"<<t_shift/1.0e6 <<" s" << std::endl;
+    std::cout <<GridLogPerformance<< " FFT kernels "<<t_fft/1.0e6 <<" s" << std::endl;
+    std::cout <<GridLogPerformance<< " FFT insert  "<<t_insert/1.0e6 <<" s" << std::endl;
+    
  }
 };

@@ -1,213 +0,0 @@
-/*************************************************************************************
-
-    Grid physics library, www.github.com/paboyle/Grid
-
-    Source file: Grid/algorithms/blas/A2ASpatialSum.h
-
-    Copyright (C) 2025
-
-Author: Peter Boyle <pboyle@bnl.gov>
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-    See the full license in the file "LICENSE" in the top level distribution directory
-*************************************************************************************/
-/*  END LEGAL */
-#pragma once
-
-NAMESPACE_BEGIN(Grid);
-
-/*
-  A2ASpatialSum
-
-  Replaces the scalar spatial accumulation loop in A2A extended meson field
-  contractions with a batched GEMM over local time slices, enabling GPU offload.
-
-  Given:
-    leftv[N_i][osite]    — conjugated left SpinColourVectors (SIMD-packed)
-    loopRight[N_j][osite]— type-contracted right SpinColourVectors (SIMD-packed)
-
-  Computes:
-    EMF[i,j,t] = sum_{x,s,c} leftv[i][x,t,s,c] * loopRight[j][x,t,s,c]
-
-  via batched GEMM over nt local time slices, then GlobalSumVector across MPI.
-
-  Memory layout (all C row-major):
-    W_buf  [nt][N_i][nxyz*Nsc]  — W[t][i][x*Nsc+sc]  = leftv[i] at (x,t)
-    LR_buf [nt][N_j][nxyz*Nsc]  — LR[t][j][x*Nsc+sc] = loopRight[j] at (x,t)
-    EMF_buf[nt][N_j][N_i]       — column-major result; EMF[i,j,t] = EMF_buf[t][j][i]
-
-  BLAS call (column-major, OP_T on A so A is read as W[i][k]):
-    C = A^T * B  where A=W[N_i×K C-row], B=LR[N_j×K C-row], C=[N_j×N_i C-row]
-    → C[i,j] = sum_k W[i][k] * LR[j][k] = EMF[i,j] ✓
-*/
-template<class vobj>
-class A2ASpatialSum
-{
-public:
-  typedef typename vobj::scalar_type   scalar;
-  typedef typename vobj::scalar_object sobj;
-
-  GridBase *grid;
-  int N_i, N_j;
-  int nt, nxyz, Nsc;
-
-  deviceVector<scalar>   W_buf;
-  deviceVector<scalar>   LR_buf;
-  deviceVector<scalar>   EMF_buf;
-  deviceVector<scalar *> W_ptrs;
-  deviceVector<scalar *> LR_ptrs;
-  deviceVector<scalar *> EMF_ptrs;
-
-  A2ASpatialSum() : grid(nullptr), N_i(0), N_j(0), nt(0), nxyz(0), Nsc(0) {}
-
-  void Allocate(int _N_i, int _N_j, GridBase *_grid)
-  {
-    grid = _grid;
-    N_i  = _N_i;
-    N_j  = _N_j;
-    Coordinate ldims = grid->LocalDimensions();
-    nt   = ldims[grid->Nd() - 1];
-    nxyz = grid->lSites() / nt;
-    Nsc  = sizeof(sobj) / sizeof(scalar);
-
-    W_buf.resize(nt * N_i * nxyz * Nsc);
-    LR_buf.resize(nt * N_j * nxyz * Nsc);
-    EMF_buf.resize(nt * N_j * N_i);
-
-    // Build persistent batch pointer arrays
-    W_ptrs.resize(nt);
-    LR_ptrs.resize(nt);
-    EMF_ptrs.resize(nt);
-    scalar *Wh   = &W_buf[0];
-    scalar *LRh  = &LR_buf[0];
-    scalar *EMFh = &EMF_buf[0];
-    int lN_i = N_i, lN_j = N_j, lnxyz = nxyz, lNsc = Nsc;
-    for (int t = 0; t < nt; t++) {
-      acceleratorPut(W_ptrs[t],   Wh   + t * lN_i * lnxyz * lNsc);
-      acceleratorPut(LR_ptrs[t],  LRh  + t * lN_j * lnxyz * lNsc);
-      acceleratorPut(EMF_ptrs[t], EMFh + t * lN_j * lN_i);
-    }
-  }
-
-  void PackLeft(const std::vector<Lattice<vobj>> &leftv)
-  {
-    GRID_ASSERT((int)leftv.size() == N_i);
-    PackVectors(leftv, &W_buf[0], N_i);
-  }
-
-  void PackRight(const std::vector<Lattice<vobj>> &loopRight)
-  {
-    GRID_ASSERT((int)loopRight.size() == N_j);
-    PackVectors(loopRight, &LR_buf[0], N_j);
-  }
-
-private:
-  // Pack vecs[N] lattice fields into buf[nt][N][nxyz*Nsc], extracting all SIMD lanes.
-  void PackVectors(const std::vector<Lattice<vobj>> &vecs, scalar *buf, int N)
-  {
-    int nd     = grid->_ndimension;
-    int osites = grid->oSites();
-    int Nsimd  = vobj::Nsimd();
-    int lN     = N;
-    int lNsc   = Nsc;
-    int lnxyz  = nxyz;
-    Coordinate rdimensions = grid->_rdimensions;
-    Coordinate ldims       = grid->LocalDimensions();
-    Coordinate simd        = grid->_simd_layout;
-
-    for (int n = 0; n < N; n++) {
-      autoView(src_v, vecs[n], AcceleratorRead);
-      accelerator_for(sf, osites, Nsimd, {
-#ifdef GRID_SIMT
-        {
-          int lane = acceleratorSIMTlane(Nsimd);
-#else
-          for (int lane = 0; lane < Nsimd; lane++) {
-#endif
-          Coordinate icoor(nd), ocoor(nd), lcoor(nd);
-          Lexicographic::CoorFromIndex(icoor, lane, simd);
-          Lexicographic::CoorFromIndex(ocoor, sf, rdimensions);
-          for (int d = 0; d < nd; d++)
-            lcoor[d] = rdimensions[d] * icoor[d] + ocoor[d];
-
-          int     l_t = lcoor[nd - 1];
-          Coordinate xyz_coor = lcoor;
-          xyz_coor[nd - 1] = 0;
-          int64_t l_xyz;
-          Lexicographic::IndexFromCoor(xyz_coor, l_xyz, ldims);
-
-          sobj    data   = extractLane(lane, src_v[sf]);
-          scalar *data_s = (scalar *)&data;
-
-          int64_t base = (int64_t)l_t * lN * lnxyz * lNsc
-                       + (int64_t)n   * lnxyz * lNsc
-                       + l_xyz * lNsc;
-          for (int sc = 0; sc < lNsc; sc++)
-            buf[base + sc] = data_s[sc];
-        }
-      });
-    }
-  }
-
-public:
-
-  // Batched GEMM + MPI reduction → result[nt_global][N_i][N_j]
-  //
-  // BLAS (column-major, OP_T on A):
-  //   C[N_j×N_i] = A^T[N_i×K] * B[N_j×K]    with K=nxyz*Nsc
-  //   reading A as C row-major [N_i][K] and B as C row-major [N_j][K]
-  //   → C[i,j] = sum_k W[i,k] * LR[j,k] = EMF[i,j] ✓
-  void Sum(Eigen::Tensor<ComplexD, 3> &result)
-  {
-    GridBLAS BLAS;
-
-    int K = nxyz * Nsc;
-    BLAS.gemmBatched(GridBLAS_OP_T, GridBLAS_OP_N,
-                     N_i, N_j, K,
-                     scalar(1.0),
-                     W_ptrs,
-                     LR_ptrs,
-                     scalar(0.0),
-                     EMF_ptrs);
-    BLAS.synchronise();
-
-    // Copy from device and distribute into global-t layout
-    int nt_global = result.dimension(0);
-    int nd        = grid->Nd();
-    int lt_start  = grid->LocalStarts()[nd - 1];
-
-    std::vector<scalar> host_emf(nt * N_j * N_i);
-    acceleratorCopyFromDevice(&EMF_buf[0], host_emf.data(),
-                              nt * N_j * N_i * sizeof(scalar));
-
-    // EMF_buf[t][j*N_i + i] = EMF[i,j] for local t
-    std::vector<scalar> global_emf(nt_global * N_i * N_j, scalar(0.0));
-    for (int lt = 0; lt < nt; lt++) {
-      int gt = lt + lt_start;
-      for (int i = 0; i < N_i; i++)
-      for (int j = 0; j < N_j; j++)
-        global_emf[gt * N_i * N_j + i * N_j + j] = host_emf[lt * N_j * N_i + j * N_i + i];
-    }
-    grid->GlobalSumVector(global_emf.data(), nt_global * N_i * N_j);
-
-    for (int gt = 0; gt < nt_global; gt++)
-    for (int i  = 0; i  < N_i;      i++)
-    for (int j  = 0; j  < N_j;      j++)
-      result(gt, i, j) = global_emf[gt * N_i * N_j + i * N_j + j];
-  }
-};
-
-NAMESPACE_END(Grid);
@@ -28,7 +28,6 @@ Author: Peter Boyle <pboyle@bnl.gov>
 #pragma once

 #ifdef GRID_HIP
-#include <hip/hip_version.h>
 #include <hipblas/hipblas.h>
 #endif
 #ifdef GRID_CUDA
@@ -256,30 +255,17 @@ public:
    if ( OpB == GridBLAS_OP_N ) hOpB = HIPBLAS_OP_N;
    if ( OpB == GridBLAS_OP_T ) hOpB = HIPBLAS_OP_T;
    if ( OpB == GridBLAS_OP_C ) hOpB = HIPBLAS_OP_C;
-#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
    auto err = hipblasZgemmBatched(gridblasHandle,
 				   hOpA,
 				   hOpB,
 				   m,n,k,
-				   (hipDoubleComplex *) &alpha_p[0],
-				   (hipDoubleComplex **)&Amk[0], lda,
-				   (hipDoubleComplex **)&Bkn[0], ldb,
-				   (hipDoubleComplex *) &beta_p[0],
-				   (hipDoubleComplex **)&Cmn[0], ldc,
+				   (hipblasDoubleComplex *) &alpha_p[0],
+				   (hipblasDoubleComplex **)&Amk[0], lda,
+				   (hipblasDoubleComplex **)&Bkn[0], ldb,
+				   (hipblasDoubleComplex *) &beta_p[0],
+				   (hipblasDoubleComplex **)&Cmn[0], ldc,
 				   batchCount);
-#else
-    auto err = hipblasZgemmBatched(gridblasHandle,
-                                   hOpA,
-                                   hOpB,
-                                   m,n,k,
-                                   (hipblasDoubleComplex *) &alpha_p[0],
-                                   (hipblasDoubleComplex **)&Amk[0], lda,
-                                   (hipblasDoubleComplex **)&Bkn[0], ldb,
-                                   (hipblasDoubleComplex *) &beta_p[0],
-                                   (hipblasDoubleComplex **)&Cmn[0], ldc,
-                                   batchCount);
-#endif
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
+    //	 std::cout << " hipblas return code " <<(int)err<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -517,31 +503,17 @@ public:
    if ( OpB == GridBLAS_OP_N ) hOpB = HIPBLAS_OP_N;
    if ( OpB == GridBLAS_OP_T ) hOpB = HIPBLAS_OP_T;
    if ( OpB == GridBLAS_OP_C ) hOpB = HIPBLAS_OP_C;
-#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
    auto err = hipblasCgemmBatched(gridblasHandle,
 				   hOpA,
 				   hOpB,
 				   m,n,k,
-				   (hipComplex *) &alpha_p[0],
-				   (hipComplex **)&Amk[0], lda,
-				   (hipComplex **)&Bkn[0], ldb,
-				   (hipComplex *) &beta_p[0],
-				   (hipComplex **)&Cmn[0], ldc,
+				   (hipblasComplex *) &alpha_p[0],
+				   (hipblasComplex **)&Amk[0], lda,
+				   (hipblasComplex **)&Bkn[0], ldb,
+				   (hipblasComplex *) &beta_p[0],
+				   (hipblasComplex **)&Cmn[0], ldc,
 				   batchCount);
-#else
-    auto err = hipblasCgemmBatched(gridblasHandle,
-                                   hOpA,
-                                   hOpB,
-                                   m,n,k,
-                                   (hipblasComplex *) &alpha_p[0],
-                                   (hipblasComplex **)&Amk[0], lda,
-                                   (hipblasComplex **)&Bkn[0], ldb,
-                                   (hipblasComplex *) &beta_p[0],
-                                   (hipblasComplex **)&Cmn[0], ldc,
-                                   batchCount);

-#endif
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -578,7 +550,6 @@ public:
 				(void **)&Cmn[0], CUDA_C_32F, ldc,
 				batchCount, compute_precision, CUBLAS_GEMM_DEFAULT);
    }
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==CUBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_SYCL
@@ -749,7 +720,6 @@ public:
 				   (float *) &beta_p[0],
 				   (float **)&Cmn[0], ldc,
 				   batchCount);
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -910,7 +880,6 @@ public:
 				   (double *) &beta_p[0],
 				   (double **)&Cmn[0], ldc,
 				   batchCount);
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -1125,20 +1094,11 @@ public:
    GRID_ASSERT(info.size()==batchCount);

 #ifdef GRID_HIP
-#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
    auto err = hipblasZgetrfBatched(gridblasHandle,(int)n,
-				    (hipDoubleComplex **)&Ann[0], (int)n,
+				    (hipblasDoubleComplex **)&Ann[0], (int)n,
 				    (int*) &ipiv[0],
 				    (int*) &info[0],
 				    (int)batchCount);
-#else
-    auto err = hipblasZgetrfBatched(gridblasHandle,(int)n,
-                                    (hipblasDoubleComplex **)&Ann[0], (int)n,
-                                    (int*) &ipiv[0],
-                                    (int*) &info[0],
-                                    (int)batchCount);
-#endif
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -1164,21 +1124,11 @@ public:
    GRID_ASSERT(info.size()==batchCount);

 #ifdef GRID_HIP
-#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
    auto err = hipblasCgetrfBatched(gridblasHandle,(int)n,
-				    (hipComplex **)&Ann[0], (int)n,
+				    (hipblasComplex **)&Ann[0], (int)n,
 				    (int*) &ipiv[0],
 				    (int*) &info[0],
 				    (int)batchCount);
-#else
-    auto err = hipblasCgetrfBatched(gridblasHandle,(int)n,
-                                    (hipblasComplex **)&Ann[0], (int)n,
-                                    (int*) &ipiv[0],
-                                    (int*) &info[0],
-                                    (int)batchCount);
-#endif
-
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -1251,23 +1201,12 @@ public:
    GRID_ASSERT(Cnn.size()==batchCount);

 #ifdef GRID_HIP
-#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
    auto err = hipblasZgetriBatched(gridblasHandle,(int)n,
-				    (hipDoubleComplex **)&Ann[0], (int)n,
+				    (hipblasDoubleComplex **)&Ann[0], (int)n,
 				    (int*) &ipiv[0],
-				    (hipDoubleComplex **)&Cnn[0], (int)n,
+				    (hipblasDoubleComplex **)&Cnn[0], (int)n,
 				    (int*) &info[0],
 				    (int)batchCount);
-#else
-    auto err = hipblasZgetriBatched(gridblasHandle,(int)n,
-                                    (hipblasDoubleComplex **)&Ann[0], (int)n,
-                                    (int*) &ipiv[0],
-                                    (hipblasDoubleComplex **)&Cnn[0], (int)n,
-                                    (int*) &info[0],
-                                    (int)batchCount);
-
-#endif
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -1296,22 +1235,12 @@ public:
    GRID_ASSERT(Cnn.size()==batchCount);

 #ifdef GRID_HIP
-#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
    auto err = hipblasCgetriBatched(gridblasHandle,(int)n,
-				    (hipComplex **)&Ann[0], (int)n,
+				    (hipblasComplex **)&Ann[0], (int)n,
 				    (int*) &ipiv[0],
-				    (hipComplex **)&Cnn[0], (int)n,
+				    (hipblasComplex **)&Cnn[0], (int)n,
 				    (int*) &info[0],
 				    (int)batchCount);
-#else
-    auto err = hipblasCgetriBatched(gridblasHandle,(int)n,
-                                    (hipblasComplex **)&Ann[0], (int)n,
-                                    (int*) &ipiv[0],
-                                    (hipblasComplex **)&Cnn[0], (int)n,
-                                    (int*) &info[0],
-                                    (int)batchCount);
-#endif
-    //    std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
    GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
 #endif
 #ifdef GRID_CUDA
@@ -92,8 +92,8 @@ class TwoLevelCGmrhs
  // Vector case
  virtual void operator() (std::vector<Field> &src, std::vector<Field> &x)
  {
-    SolveSingleSystem(src,x);
-	// SolvePrecBlockCG(src,x);
+    //    SolveSingleSystem(src,x);
+    SolvePrecBlockCG(src,x);
  }

 ////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -97,7 +97,7 @@ public:

    RealD scale;

-    ConjugateGradient<FineField> CG(1.0e-4,2000,false);
+    ConjugateGradient<FineField> CG(1.0e-3,400,false);
    FineField noise(FineGrid);
    FineField Mn(FineGrid);

@@ -131,10 +131,7 @@ public:
    RealD scale;

    TrivialPrecon<FineField> simple_fine;
-    //    PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,10,DiracOp,simple_fine,30,30);
-    //    PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,10,DiracOp,simple_fine,12,12);
-    //    PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,30,DiracOp,simple_fine,12,12);
-    PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,30,DiracOp,simple_fine,10,10);
+    PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,30,DiracOp,simple_fine,12,12);
    FineField noise(FineGrid);
    FineField src(FineGrid);
    FineField guess(FineGrid);
@@ -149,16 +146,16 @@ public:
      
      DiracOp.Op(noise,Mn); std::cout<<GridLogMessage << "noise   ["<<b<<"] <n|Op|n> "<<innerProduct(noise,Mn)<<std::endl;

-      for(int i=0;i<3;i++){
+      for(int i=0;i<2;i++){
 	//  void operator() (const Field &src, Field &psi){
 #if 1
-	if (i==0)std::cout << GridLogMessage << " inverting on noise "<<std::endl;
+	std::cout << GridLogMessage << " inverting on noise "<<std::endl;
 	src = noise;
 	guess=Zero();
 	GCR(src,guess);
 	subspace[b] = guess;
 #else
-	if (i==0)std::cout << GridLogMessage << " inverting on zero "<<std::endl;
+	std::cout << GridLogMessage << " inverting on zero "<<std::endl;
 	src=Zero();
 	guess = noise;
 	GCR(src,guess);
@@ -170,7 +167,7 @@ public:

      }

-      DiracOp.Op(noise,Mn); std::cout<<GridLogMessage << "filtered["<<b<<"] <f|Op|f> "<<innerProduct(noise,Mn)<<" <f|OpDagOp|f>"<<norm2(Mn)<<std::endl;
+      DiracOp.Op(noise,Mn); std::cout<<GridLogMessage << "filtered["<<b<<"] <f|Op|f> "<<innerProduct(noise,Mn)<<std::endl;
      subspace[b]   = noise;

    }
@@ -27,8 +27,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 /*  END LEGAL */
 #include <Grid/GridCore.h>

-void GridAbort(void) { abort(); }
-
 NAMESPACE_BEGIN(Grid);

 ///////////////////////////////////////////////////////////////////////////////////////////////////
@@ -36,6 +34,7 @@ NAMESPACE_BEGIN(Grid);
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 Grid_MPI_Comm       CartesianCommunicator::communicator_world;

+void GridAbort(void) { abort(); }

 void CartesianCommunicator::Init(int *argc, char *** arv)
 {
@@ -124,7 +124,7 @@ template<class vobj> void Cshift_simple(Lattice<vobj>& ret,const Lattice<vobj> &
    void *hsend_buf = (void *)&hrhs[0];
    void *hrecv_buf = (void *)&hret[0];

-    acceleratorCopyFromDevice(send_buf,hsend_buf,bytes);
+    acceleratorCopyFromDevice(&send_buf[0],&hsend_buf[0],bytes);

    grid->SendToRecvFrom(hsend_buf,
 			 xmit_to_rank,
@@ -132,7 +132,7 @@ template<class vobj> void Cshift_simple(Lattice<vobj>& ret,const Lattice<vobj> &
 			 recv_from_rank,
 			 bytes);

-    acceleratorCopyToDevice(hrecv_buf,recv_buf,bytes);
+    acceleratorCopyToDevice(&hrecv_buf[0],&recv_buf[0],bytes);
 #endif
  }
 }
@@ -197,15 +197,12 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) {
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
 // Possibly promote to double and sum
 /////////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#undef GRID_REDUCTION_TIMING
-
 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites) 
 {
  typedef typename vobj::scalar_objectD sobj;
  typedef decltype(lat) Iterator;
-
+  
  Integer nsimd= vobj::Nsimd();
  Integer size = osites*nsimd;

@@ -214,188 +211,41 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi
  GRID_ASSERT(ok);

  Integer smemSize = numThreads * sizeof(sobj);
+  // Move out of UVM
+  // Turns out I had messed up the synchronise after move to compute stream
+  // as running this on the default stream fools the synchronise
  deviceVector<sobj> buffer(numBlocks);
  sobj *buffer_v = &buffer[0];
  sobj result;
-
-#ifdef GRID_REDUCTION_TIMING
-  RealD t_kernel = -usecond();
-#endif
  reduceKernel<<< numBlocks, numThreads, smemSize, computeStream >>>(lat, buffer_v, size);
  accelerator_barrier();
-#ifdef GRID_REDUCTION_TIMING
-  t_kernel += usecond();
-  RealD t_d2h = -usecond();
-#endif
  acceleratorCopyFromDevice(buffer_v,&result,sizeof(result));
-#ifdef GRID_REDUCTION_TIMING
-  t_d2h += usecond();
-  std::cout << GridLogDebug << "  sumD_gpu_small"
-            << " sizeof(sobj)=" << sizeof(sobj)
-            << " blocks=" << numBlocks << " threads=" << numThreads
-            << " kernel+barrier=" << t_kernel << " us"
-            << " D2H=" << t_d2h << " us" << std::endl;
-#endif
  return result;
 }

-// Fused pack+reduce: reads R words of each vobj at word offset 'base',
-// accumulates directly into iVector<iScalar<scalarD>,R> without staging
-// through an intermediate bundle buffer.  One HBM pass instead of three.
-template <int R, class vobj, class sobj, class Iterator>
-__device__ void packReduceBlocks(
-    const iScalar<typename vobj::vector_type> *idat,
-    sobj *g_odata, Iterator osites, int base, int words)
-{
-  constexpr Iterator nsimd = vobj::Nsimd();
-  Iterator blockSize = blockDim.x;
-
-  extern __shared__ __align__(COALESCE_GRANULARITY) unsigned char shmem_pointer[];
-  sobj *sdata = (sobj *)shmem_pointer;
-
-  Iterator tid      = threadIdx.x;
-  Iterator i        = blockIdx.x * (blockSize * 2) + threadIdx.x;
-  Iterator gridSize = blockSize * 2 * gridDim.x;
-  sobj mySum = Zero();
-
-  while (i < osites * nsimd) {
-    Iterator lane = i % nsimd;
-    Iterator ss   = i / nsimd;
-    sobj tmpD; zeroit(tmpD);
-    for (int k = 0; k < R; k++) {
-      auto w = extractLane(lane, idat[ss * words + base + k]);
-      iScalar<typename vobj::scalar_typeD> wd; wd = w;
-      tmpD._internal[k] = wd;
-    }
-    mySum += tmpD;
-
-    if (i + blockSize < osites * nsimd) {
-      lane = (i + blockSize) % nsimd;
-      ss   = (i + blockSize) / nsimd;
-      sobj tmpD2; zeroit(tmpD2);
-      for (int k = 0; k < R; k++) {
-        auto w = extractLane(lane, idat[ss * words + base + k]);
-        iScalar<typename vobj::scalar_typeD> wd; wd = w;
-        tmpD2._internal[k] = wd;
-      }
-      mySum += tmpD2;
-    }
-    i += gridSize;
-  }
-
-  reduceBlock(sdata, mySum, tid);
-  if (tid == 0) g_odata[blockIdx.x] = sdata[0];
-}
-
-template <int R, class vobj, class sobj, class Iterator>
-__global__ void packReduceKernel(
-    const iScalar<typename vobj::vector_type> *idat,
-    sobj *buffer, Iterator osites, int base, int words)
-{
-  Iterator blockSize = blockDim.x;
-
-  packReduceBlocks<R, vobj, sobj>(idat, buffer, osites, base, words);
-
-  if (gridDim.x > 1) {
-    const Iterator tid = threadIdx.x;
-    __shared__ bool amLast;
-    extern __shared__ __align__(COALESCE_GRANULARITY) unsigned char shmem_pointer[];
-    sobj *smem = (sobj *)shmem_pointer;
-
-    acceleratorFence();
-
-    if (tid == 0) {
-      unsigned int ticket = atomicInc(&retirementCount, gridDim.x);
-      amLast = (ticket == gridDim.x - 1);
-    }
-    acceleratorSynchroniseAll();
-
-    if (amLast) {
-      Iterator i = tid;
-      sobj mySum = Zero();
-      while (i < (Iterator)gridDim.x) {
-        mySum += buffer[i];
-        i += blockSize;
-      }
-      reduceBlock(smem, mySum, tid);
-      if (tid == 0) {
-        buffer[0] = smem[0];
-        retirementCount = 0;
-      }
-    }
-  }
-}
-
-template<int R, class vobj>
-inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites,
-                                   typename vobj::scalar_typeD *ret_p, int base)
-{
-  typedef typename vobj::vector_type  vector;
-  typedef typename vobj::scalar_typeD scalarD;
-  using BundleScalarD = iVector<iScalar<scalarD>, R>;
-
-  constexpr int Nsimd = vobj::Nsimd();
-  const int words = sizeof(vobj) / sizeof(vector);
-  const iScalar<vector> *idat = (const iScalar<vector> *)lat;
-
-  Integer size = (Integer)osites * Nsimd;
-  Integer numThreads, numBlocks;
-  int ok = getNumBlocksAndThreads(size, sizeof(BundleScalarD), numThreads, numBlocks);
-  GRID_ASSERT(ok);
-
-  Integer smemSize = numThreads * sizeof(BundleScalarD);
-  deviceVector<BundleScalarD> buffer(numBlocks);
-  BundleScalarD *buffer_v = &buffer[0];
-  BundleScalarD result;
-
-#ifdef GRID_REDUCTION_TIMING
-  RealD t_kernel = -usecond();
-#endif
-  packReduceKernel<R, vobj, BundleScalarD, Integer>
-    <<<numBlocks, numThreads, smemSize, computeStream>>>
-    (idat, buffer_v, osites, base, words);
-  accelerator_barrier();
-#ifdef GRID_REDUCTION_TIMING
-  t_kernel += usecond();
-  RealD t_d2h = -usecond();
-#endif
-  acceleratorCopyFromDevice(buffer_v, &result, sizeof(result));
-#ifdef GRID_REDUCTION_TIMING
-  t_d2h += usecond();
-  std::cout << GridLogDebug << " sumD_gpu_reduce_words R=" << R
-            << " base=" << base
-            << " kernel=" << t_kernel << " D2H=" << t_d2h << " us" << std::endl;
-#endif
-
-  for (int k = 0; k < R; k++)
-    ret_p[base + k] = TensorRemove(result._internal[k]);
-}
-
 template <class vobj>
 inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
 {
-  typedef typename vobj::vector_type    vector;
-  typedef typename vobj::scalar_typeD   scalarD;
-  typedef typename vobj::scalar_objectD sobjD;
-
-  const int words = sizeof(vobj) / sizeof(vector);
-  sobjD ret; zeroit(ret);
+  typedef typename vobj::vector_type  vector;
+  typedef typename vobj::scalar_typeD scalarD;
+  typedef typename vobj::scalar_objectD sobj;
+  sobj ret;
  scalarD *ret_p = (scalarD *)&ret;
+  
+  const int words = sizeof(vobj)/sizeof(vector);

-#ifdef GRID_REDUCTION_TIMING
-  RealD t_large = -usecond();
-#endif
-  int w = 0;
-  while (w + 12 <= words) { sumD_gpu_reduce_words<12>(lat, osites, ret_p, w); w += 12; }
-  while (w +  4 <= words) { sumD_gpu_reduce_words< 4>(lat, osites, ret_p, w); w +=  4; }
-  while (w      <  words) { sumD_gpu_reduce_words< 1>(lat, osites, ret_p, w); w +=  1; }
-#ifdef GRID_REDUCTION_TIMING
-  t_large += usecond();
-  std::cout << GridLogDebug << "sumD_gpu_large"
-            << " sizeof(sobjD)=" << sizeof(sobjD)
-            << " words=" << words << " total=" << t_large << " us" << std::endl;
-#endif
+  deviceVector<vector> buffer(osites);
+  vector *dat = (vector *)lat;
+  vector *buf = &buffer[0];
+  iScalar<vector> *tbuf =(iScalar<vector> *)  &buffer[0];
+  for(int w=0;w<words;w++) {

+    accelerator_for(ss,osites,1,{
+	buf[ss] = dat[ss*words+w];
+      });
+      
+    ret_p[w] = sumD_gpu_small(tbuf,osites);
+  }
  return ret;
 }

@@ -6,27 +6,28 @@ NAMESPACE_BEGIN(Grid);


 template <class vobj>
-inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites)
+inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites) 
 {
-  typedef typename vobj::scalar_object  sobj;
+  typedef typename vobj::scalar_object sobj;
  typedef typename vobj::scalar_objectD sobjD;

-  sobjD identity; zeroit(identity);
-  sobjD ret;      zeroit(ret);
-  {
-    sycl::buffer<sobjD, 1> abuff(&ret, {1});
+  sobj identity; zeroit(identity);
+  sobj ret; zeroit(ret);
+  Integer nsimd= vobj::Nsimd();
+  { 
+    sycl::buffer<sobj, 1> abuff(&ret, {1});
    theGridAccelerator->submit([&](sycl::handler &cgh) {
-      auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>());
-      cgh.parallel_for(sycl::range<1>{(size_t)osites},
-                       Reduction,
-                       [=](sycl::id<1> item, auto &sum) {
-                         sobj s = Reduce(lat[item[0]]);
-                         sobjD sd; sd = s;
-                         sum += sd;
-                       });
+      auto Reduction = sycl::reduction(abuff,cgh,identity,std::plus<>());
+      cgh.parallel_for(sycl::range<1>{osites},
+                      Reduction,
+                      [=] (sycl::id<1> item, auto &sum) {
+                        auto osite   = item[0];
+                        sum +=Reduce(lat[osite]);
+                      });
    });
  }
-  return ret;
+  sobjD dret; convertType(dret,ret);
+  return dret;
 }

 template <class vobj>
@@ -1,6 +1,7 @@
 #pragma once

 #if defined(GRID_CUDA)
+
 #include <cub/cub.cuh>
 #define gpucub cub
 #define gpuError_t cudaError_t
@@ -56,13 +57,8 @@ inline void sliceSumReduction_cub_small(const vobj *Data,
  //copy offsets to device
  acceleratorCopyToDeviceAsynch(&offsets[0],d_offsets,sizeof(int)*(rd+1),computeStream);
  
-#if defined(__CUDACC__) && (__CUDACC_VER_MAJOR__ >= 13)
-  #define GRID_CUB_SUM_OP ::cuda::std::plus<>{}
-#else
-  #define GRID_CUB_SUM_OP ::gpucub::Sum()
-#endif
  
-  gpuError_t gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p,d_out, rd, d_offsets, d_offsets+1, GRID_CUB_SUM_OP, zero_init, computeStream);
+  gpuError_t gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p,d_out, rd, d_offsets, d_offsets+1, ::gpucub::Sum(), zero_init, computeStream);
  if (gpuErr!=gpuSuccess) {
    std::cout << GridLogError << "Lattice_slicesum_gpu.h: Encountered error during gpucub::DeviceSegmentedReduce::Reduce (setup)! Error: " << gpuErr <<std::endl;
    exit(EXIT_FAILURE);
@@ -86,13 +82,11 @@ inline void sliceSumReduction_cub_small(const vobj *Data,
  });
  
  //issue segmented reductions in computeStream
-  gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p, d_out, rd, d_offsets, d_offsets+1, GRID_CUB_SUM_OP, zero_init, computeStream);
+  gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p, d_out, rd, d_offsets, d_offsets+1,::gpucub::Sum(), zero_init, computeStream);
  if (gpuErr!=gpuSuccess) {
    std::cout << GridLogError << "Lattice_slicesum_gpu.h: Encountered error during gpucub::DeviceSegmentedReduce::Reduce! Error: " << gpuErr <<std::endl;
    exit(EXIT_FAILURE);
  }
-
-#undef GRID_CUB_SUM_OP
  
  acceleratorCopyFromDeviceAsynch(d_out,&lvSum[0],rd*sizeof(vobj),computeStream);
  
@@ -51,8 +51,8 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #endif
 #ifdef __x86_64__
 #ifdef GRID_CUDA
-//accelerator_inline uint64_t __rdtsc(void) {  return 0; }
-//accelerator_inline uint64_t __rdpmc(int ) {  return 0; }
+accelerator_inline uint64_t __rdtsc(void) {  return 0; }
+accelerator_inline uint64_t __rdpmc(int ) {  return 0; }
 #else
 #include <x86intrin.h>
 #endif
@@ -596,32 +596,16 @@ template<int Index,class vobj> inline vobj transposeColour(const vobj &lhs){
 //////////////////////////////////////////
 // Trace lattice and non-lattice
 //////////////////////////////////////////
-#define GRID_UNOP(name)   name
-#define GRID_DEF_UNOP(op, name)						\
-  template <typename T1, typename std::enable_if<is_lattice<T1>::value||is_lattice_expr<T1>::value,T1>::type * = nullptr> \
-  inline auto op(const T1 &arg) ->decltype(LatticeUnaryExpression<GRID_UNOP(name),T1>(GRID_UNOP(name)(), arg)) \
-  {									\
-    return     LatticeUnaryExpression<GRID_UNOP(name),T1>(GRID_UNOP(name)(), arg); \
-  }
-
 template<int Index,class vobj>
 inline auto traceSpin(const Lattice<vobj> &lhs) -> Lattice<decltype(traceIndex<SpinIndex>(vobj()))>
 {
  return traceIndex<SpinIndex>(lhs);
 }
-
-GridUnopClass(UnaryTraceSpin, traceIndex<SpinIndex>(a));
-GRID_DEF_UNOP(traceSpin, UnaryTraceSpin);
-
 template<int Index,class vobj>
 inline auto traceColour(const Lattice<vobj> &lhs) -> Lattice<decltype(traceIndex<ColourIndex>(vobj()))>
 {
  return traceIndex<ColourIndex>(lhs);
 }
-
-GridUnopClass(UnaryTraceColour, traceIndex<ColourIndex>(a));
-GRID_DEF_UNOP(traceColour, UnaryTraceColour);
-
 template<int Index,class vobj>
 inline auto traceSpin(const vobj &lhs) -> Lattice<decltype(traceIndex<SpinIndex>(lhs))>
 {
@@ -633,8 +617,6 @@ inline auto traceColour(const vobj &lhs) -> Lattice<decltype(traceIndex<ColourIn
  return traceIndex<ColourIndex>(lhs);
 }

-#undef GRID_UNOP
-#undef GRID_DEF_UNOP
 //////////////////////////////////////////
 // Current types
 //////////////////////////////////////////
@@ -411,7 +411,7 @@ void WilsonKernels<Impl>::DhopDirKernel( StencilImpl &st, DoubledGaugeField &U,S
 #undef LoopBody
 }

-#if 0
+#ifdef GRID_SYCL
 extern "C" {
    ulong SYCL_EXTERNAL __attribute__((overloadable)) intel_get_cycle_counter( void );
    uint  SYCL_EXTERNAL __attribute__((overloadable)) intel_get_active_channel_mask( void );
@@ -138,13 +138,10 @@ public:
    //auto start = std::chrono::high_resolution_clock::now();
    autoView(U_v,U,AcceleratorWrite);
    autoView(P_v,P,AcceleratorRead);
-    typedef typename Field::vector_object vobj;
-    const int Nsimd = vobj::Nsimd();
-    accelerator_for(ss, P.Grid()->oSites(),Nsimd,{
+    accelerator_for(ss, P.Grid()->oSites(),1,{
      for (int mu = 0; mu < Nd; mu++) {
-	auto tmp = Exponentiate(P_v(ss)(mu), ep, Nexp) * U_v(ss)(mu);
-	tmp      = Group::ProjectOnGeneralGroup(tmp);
-	coalescedWrite(U_v[ss](mu),tmp);
+          U_v[ss](mu) = Exponentiate(P_v[ss](mu), ep, Nexp) * U_v[ss](mu);
+          U_v[ss](mu) = Group::ProjectOnGeneralGroup(U_v[ss](mu));
      }
    });
   //auto end = std::chrono::high_resolution_clock::now();
@@ -103,18 +103,6 @@ class PolyakovMod: public ObservableModule<PolyakovLogger<Impl>, NoParameters>{
  PolyakovMod(): ObsBase(NoParameters()){}
 };

-template < class Impl >
-class SpatialPolyakovMod: public ObservableModule<SpatialPolyakovLogger<Impl>, NoParameters>{
-  typedef ObservableModule<SpatialPolyakovLogger<Impl>, NoParameters> ObsBase;
-  using ObsBase::ObsBase; // for constructors
-
-  // acquire resource
-  virtual void initialize(){
-    this->ObservablePtr.reset(new SpatialPolyakovLogger<Impl>());
-  }
-  public:
-  SpatialPolyakovMod(): ObsBase(NoParameters()){}
-};

 template < class Impl >
 class TopologicalChargeMod: public ObservableModule<TopologicalCharge<Impl>, TopologyObsParameters>{
@@ -2,12 +2,11 @@

 Grid physics library, www.github.com/paboyle/Grid

-Source file: ./Grid/qcd/observables/polyakov_loop.h
+Source file: ./lib/qcd/modules/polyakov_line.h

-Copyright (C) 2025
+Copyright (C) 2017

 Author: David Preti <david.preti@csic.es>
-Author: Alexis Verney-Provatas <2414441@swansea.ac.uk>

 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
@@ -61,43 +60,4 @@ class PolyakovLogger : public HmcObservable<typename Impl::Field> {
  }
 };

-template <class Impl>
-class SpatialPolyakovLogger : public HmcObservable<typename Impl::Field> {
- public:
-    // here forces the Impl to be of gauge fields
-    // if not the compiler will complain
-    INHERIT_GIMPL_TYPES(Impl);
-
-     // necessary for HmcObservable compatibility
-    typedef typename Impl::Field Field;
-
-    void TrajectoryComplete(int traj,
-                            Field &U,
-                            GridSerialRNG &sRNG,
-                            GridParallelRNG &pRNG) {
-
-    // Save current numerical output precision
-    int def_prec = std::cout.precision();
-
-    // Assume that the dimensions are D=3+1
-    int Ndim = 3;
-    ComplexD polyakov;
-   
-    // Iterate over the spatial directions and print the average spatial polyakov loop
-    // over them 
-    for (int idx=0; idx<Ndim; idx++) {
-        polyakov = WilsonLoops<Impl>::avgPolyakovLoop(U, idx);
-    
-        std::cout << GridLogMessage
-            << std::setprecision(std::numeric_limits<Real>::digits10 + 1)
-            << "Polyakov Loop in the " << idx << " spatial direction : [ " << traj << " ] "<< polyakov << std::endl;
-
-    }
-
-    // Return to original output precision
-    std::cout.precision(def_prec);
-
-  }
-};
-
 NAMESPACE_END(Grid);
@@ -291,8 +291,8 @@ public:
    int idx=0;
    for(int mu=0;mu<4;mu++){
    for(int nu=0;nu<4;nu++){
-      if ( mu!=nu) assert(this->StoutSmearing->SmearRho[idx]==rho);
-      else         assert(this->StoutSmearing->SmearRho[idx]==0.0);
+      if ( mu!=nu) GRID_ASSERT(this->StoutSmearing->SmearRho[idx]==rho);
+      else         GRID_ASSERT(this->StoutSmearing->SmearRho[idx]==0.0);
      idx++;
    }}
    //////////////////////////////////////////////////////////////////
@@ -825,7 +825,6 @@ public:
  virtual void fill_smearedSet(GaugeField &U)
  {
    this->ThinLinks = &U;  // attach the smearing routine to the field U
-    std::cout << GridLogMessage << " fill_smearedSet " << WilsonLoops<PeriodicGimplR>::avgPlaquette(U) << std::endl;

    // check the pointer is not null
    if (this->ThinLinks == NULL)
@@ -847,8 +846,6 @@ public:
 	ApplyMask(smeared_A,smearLvl);
 	smeared_B = previous_u;
 	ApplyMask(smeared_B,smearLvl);
-	std::cout << GridLogMessage << " smeared_A " << norm2(smeared_A) << std::endl;
-	std::cout << GridLogMessage << " smeared_B " << norm2(smeared_B) << std::endl;
 	// Replace only the masked portion
 	this->SmearedSet[smearLvl] = previous_u-smeared_B + smeared_A;
        previous_u = this->SmearedSet[smearLvl];
@@ -937,10 +934,10 @@ public:
  SmearedConfigurationMasked(GridCartesian* _UGrid, unsigned int Nsmear, Smear_Stout<Gimpl>& Stout)
    : SmearedConfiguration<Gimpl>(_UGrid, Nsmear,Stout)
  {
-    assert(Nsmear%(2*Nd)==0); // Or multiply by 8??
+    GRID_ASSERT(Nsmear%(2*Nd)==0); // Or multiply by 8??

    // was resized in base class
-    assert(this->SmearedSet.size()==Nsmear);
+    GRID_ASSERT(this->SmearedSet.size()==Nsmear);
    
    GridRedBlackCartesian * UrbGrid;
    UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(_UGrid);
@@ -54,7 +54,7 @@ public:
  // Usual cases are not used
  //////////////////////////////////
  virtual void refresh(const GaugeField &U, GridSerialRNG &sRNG, GridParallelRNG &pRNG){ GRID_ASSERT(0);};
-  virtual RealD S(const GaugeField &U) { GRID_ASSERT(0); return 0; }
+  virtual RealD S(const GaugeField &U) { GRID_ASSERT(0); }
  virtual void deriv(const GaugeField &U, GaugeField &dSdU) { GRID_ASSERT(0);  }

  //////////////////////////////////
@@ -254,9 +254,9 @@ static void testGenerators(GroupName::Sp) {
  }
 }

-template <class vtype, int N>
-static Lattice<iScalar<iScalar<iMatrix<vtype, N> > > >
-ProjectOnGeneralGroup(const Lattice<iScalar<iScalar<iMatrix<vtype, N> > > > &Umu, GroupName::Sp) {
+template <int N>
+static Lattice<iScalar<iScalar<iMatrix<vComplexD, N> > > >
+ProjectOnGeneralGroup(const Lattice<iScalar<iScalar<iMatrix<vComplexD, N> > > > &Umu, GroupName::Sp) {
  return ProjectOnSpGroup(Umu);
 }

@@ -177,43 +177,25 @@ public:
  }

  //////////////////////////////////////////////////
-  // average Polyakov loop in mu direction over all directions != mu
+  // average over all x,y,z the temporal loop
  //////////////////////////////////////////////////
-  static ComplexD avgPolyakovLoop(const GaugeField &Umu, const int mu) {  //assume Nd=4
-    
-    // Protect against bad value of mu [0, 3]
-    if ((mu < 0 ) || (mu > 3)) {
-      std::cout << GridLogError << "Index is not an integer inclusively between 0 and 3." << std::endl;
-      exit(1);
-    }
-
-    // U_loop is U_{mu}
-    GaugeMat U_loop(Umu.Grid()), P(Umu.Grid());
+  static ComplexD avgPolyakovLoop(const GaugeField &Umu) {  //assume Nd=4
+    GaugeMat Ut(Umu.Grid()), P(Umu.Grid());
    ComplexD out;
    int T = Umu.Grid()->GlobalDimensions()[3];
    int X = Umu.Grid()->GlobalDimensions()[0];
    int Y = Umu.Grid()->GlobalDimensions()[1];
    int Z = Umu.Grid()->GlobalDimensions()[2];

-    // Number of sites in mu direction
-    int N_mu = Umu.Grid()->GlobalDimensions()[mu];
-
-    U_loop = peekLorentz(Umu, mu); //Select direction
-    P = U_loop;
-    for (int t=1;t<N_mu;t++){ 
-      P = Gimpl::CovShiftForward(U_loop,mu,P);
+    Ut = peekLorentz(Umu,3); //Select temporal direction
+    P = Ut;
+    for (int t=1;t<T;t++){ 
+      P = Gimpl::CovShiftForward(Ut,3,P);
    }
   RealD norm = 1.0/(Nc*X*Y*Z*T);
   out = sum(trace(P))*norm;
   return out;   
-  }  
-
-  /////////////////////////////////////////////////
-  // overload for temporal Polyakov loop
-  /////////////////////////////////////////////////
-  static ComplexD avgPolyakovLoop(const GaugeField &Umu) { 
-    return avgPolyakovLoop(Umu, 3);
-  }
+}

  //////////////////////////////////////////////////
  // average over traced single links
@@ -113,14 +113,6 @@ accelerator_inline RealD    adj(const RealD  & r){ return r; }
 accelerator_inline ComplexD adj(const ComplexD& r){ return(conjugate(r)); }
 accelerator_inline ComplexF adj(const ComplexF& r ){ return(conjugate(r)); }

-#if defined(GRID_CUDA) || defined(GRID_HIP)
-//Provide for convenience
-accelerator_inline std::complex<double> conjugate(const std::complex<double>& r){ return(conj(r)); }
-accelerator_inline std::complex<float>  conjugate(const std::complex<float>& r) { return(conj(r)); }
-accelerator_inline std::complex<double> adj(const std::complex<double>& r)      { return(conj(r)); }
-accelerator_inline std::complex<float>  adj(const std::complex<float>& r)       { return(conj(r)); }
-#endif
-
 accelerator_inline RealF real(const RealF  & r){ return r; }
 accelerator_inline RealD real(const RealD  & r){ return r; }
 accelerator_inline RealF real(const ComplexF  & r){ return r.real(); }
@@ -751,7 +751,7 @@ public:
    obj.xbytes      = xbytes;
    obj.rbytes      = rbytes;
    obj.cb          = cb;
-    
+
    for(int i=0;i<CachedTransfers.size();i++){
      if (   (CachedTransfers[i].direction  ==direction)
 	   &&(CachedTransfers[i].OrthogPlane==OrthogPlane)
@@ -763,13 +763,11 @@ public:
 	     ){
 	// FIXME worry about duplicate with partial compression
 	// Wont happen as DWF has no duplicates, but...
-	//	AddCopy(CachedTransfers[i].recv_buf,recv_buf,rbytes);
-	//	std::cout << "Duplicate dir " <<direction<<" "<<" OrthogPlane "<<OrthogPlane<<" Dest"<<DestProc <<" xbytes " <<xbytes<<" lane "<< lane<<" cb "<<cb<<std::endl;
-	return 0;
-	
-	//	return 1;
+	AddCopy(CachedTransfers[i].recv_buf,recv_buf,rbytes);
+	return 1;
      }
    }
+
    CachedTransfers.push_back(obj);
    return 0;
  }
@@ -32,33 +32,6 @@ Author: Christoph Lehner <christoph@lhnr.de>

 NAMESPACE_BEGIN(Grid);

-//////////////////////////////////////
-// innerProductD scalar overloads must be visible before norm2 is defined
-//////////////////////////////////////
-accelerator_inline ComplexD innerProductD(const ComplexF &l,const ComplexF &r){  return innerProduct(l,r); }
-accelerator_inline ComplexD innerProductD(const ComplexD &l,const ComplexD &r){  return innerProduct(l,r); }
-accelerator_inline RealD    innerProductD(const RealD    &l,const RealD    &r){  return innerProduct(l,r); }
-accelerator_inline RealD    innerProductD(const RealF    &l,const RealF    &r){  return innerProduct(l,r); }
-
-accelerator_inline vComplexD innerProductD(const vComplexD &l,const vComplexD &r){  return innerProduct(l,r); }
-accelerator_inline vRealD    innerProductD(const vRealD    &l,const vRealD    &r){  return innerProduct(l,r); }
-accelerator_inline vComplexD innerProductD(const vComplexF &l,const vComplexF &r)
-{
-  vComplexD la,lb;
-  vComplexD ra,rb;
-  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
-  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
-  return innerProduct(la,ra) + innerProduct(lb,rb);
-}
-accelerator_inline vRealD innerProductD(const vRealF &l,const vRealF &r)
-{
-  vRealD la,lb;
-  vRealD ra,rb;
-  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
-  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
-  return innerProduct(la,ra) + innerProduct(lb,rb);
-}
-
 ///////////////////////////////////////////////////////////////////////////////////////
 // innerProduct Scalar x Scalar -> Scalar
 // innerProduct Vector x Vector -> Scalar
@@ -165,8 +138,30 @@ auto Reduce (const iScalar<l>& lhs) -> iScalar<decltype(Reduce(lhs._internal))>

 //////////////////////////////////////
 // innerProductD : if single promote to double and evaluate with sum 2x
-// (scalar/vector overloads are declared above norm2 for ADL visibility)
 //////////////////////////////////////
+accelerator_inline ComplexD innerProductD(const ComplexF &l,const ComplexF &r){  return innerProduct(l,r); }
+accelerator_inline ComplexD innerProductD(const ComplexD &l,const ComplexD &r){  return innerProduct(l,r); }
+accelerator_inline RealD    innerProductD(const RealD    &l,const RealD    &r){  return innerProduct(l,r); }
+accelerator_inline RealD    innerProductD(const RealF    &l,const RealF    &r){  return innerProduct(l,r); }
+
+accelerator_inline vComplexD innerProductD(const vComplexD &l,const vComplexD &r){  return innerProduct(l,r); }
+accelerator_inline vRealD    innerProductD(const vRealD    &l,const vRealD    &r){  return innerProduct(l,r); }
+accelerator_inline vComplexD innerProductD(const vComplexF &l,const vComplexF &r)
+{  
+  vComplexD la,lb;
+  vComplexD ra,rb;
+  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
+  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
+  return innerProduct(la,ra) + innerProduct(lb,rb); 
+}
+accelerator_inline vRealD innerProductD(const vRealF &l,const vRealF &r)
+{  
+  vRealD la,lb;
+  vRealD ra,rb;
+  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
+  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
+  return innerProduct(la,ra) + innerProduct(lb,rb); 
+}

 // Now do it for vector, matrix, scalar
 template<class l,class r,int N> accelerator_inline
@@ -3,7 +3,7 @@
 NAMESPACE_BEGIN(Grid);
 int      world_rank; // Use to control world rank for print guarding
 int      acceleratorAbortOnGpuError=1;
-uint32_t accelerator_threads=8;
+uint32_t accelerator_threads=2;
 uint32_t acceleratorThreads(void)       {return accelerator_threads;};
 void     acceleratorThreads(uint32_t t) {accelerator_threads = t;};

@@ -96,9 +96,7 @@ void     acceleratorInit(void);

 #ifdef GRID_CUDA

-NAMESPACE_END(Grid);
 #include <cuda.h>
-NAMESPACE_BEGIN(Grid);

 #ifdef __CUDA_ARCH__
 #define GRID_SIMT
@@ -434,20 +432,22 @@ accelerator_inline int acceleratorSIMTlane(int Nsimd) {

 #define accelerator_for2dNB( iter1, num1, iter2, num2, nsimd, ... )	\
  {									\
-    if (num1*num2) { \
-      typedef uint64_t Iterator;						\
-      auto lambda = [=] accelerator					\
-        (Iterator iter1,Iterator iter2,Iterator lane ) mutable {		\
-        { __VA_ARGS__;}							\
-      };									\
-      int nt=acceleratorThreads();					\
-      dim3 hip_threads(nsimd, nt, 1);					 \
-      dim3 hip_blocks ((num1+nt-1)/nt,num2,1); \
-      if(hip_threads.x * hip_threads.y * hip_threads.z <= 64){ \
-        LambdaApply64<<<hip_blocks,hip_threads,0,computeStream>>>(num1,num2,nsimd,lambda);			\
-      } else { \
-        LambdaApply<<<hip_blocks,hip_threads,0,computeStream>>>(num1,num2,nsimd,lambda);			\
-      } \
+    typedef uint64_t Iterator;						\
+    auto lambda = [=] accelerator					\
+      (Iterator iter1,Iterator iter2,Iterator lane ) mutable {		\
+      { __VA_ARGS__;}							\
+    };									\
+    int nt=acceleratorThreads();					\
+    dim3 hip_threads(nsimd, nt, 1);					 \
+    dim3 hip_blocks ((num1+nt-1)/nt,num2,1); \
+    if(hip_threads.x * hip_threads.y * hip_threads.z <= 64){ \
+      hipLaunchKernelGGL(LambdaApply64,hip_blocks,hip_threads,		\
+   	                 0,computeStream,						\
+			 num1,num2,nsimd, lambda);			\
+    } else { \
+      hipLaunchKernelGGL(LambdaApply,hip_blocks,hip_threads,		\
+			 0,computeStream,				\
+			 num1,num2,nsimd, lambda);			\
    } \
  }

@@ -755,7 +755,7 @@ void Grid_generic_handler(int sig,siginfo_t *si,void * ptr)
  sig_print_uint(si->si_code);
  SIGLOG("\n");

-  ucontext_t *uc= (ucontext_t *)ptr;
+  unw_context_t *uc= (unw_context_t *)ptr;
  
  SIGLOG("Backtrace:\n");
 #ifdef HAVE_UNWIND
@@ -24,11 +24,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 #if Nc == 3
 #include <Grid/qcd/smearing/GaugeConfigurationMasked.h>
@@ -234,4 +230,3 @@ int main(int argc, char **argv)
 #endif
 } // main

-#endif
@@ -25,11 +25,7 @@ directory
 *************************************************************************************/
 /*  END LEGAL */

-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 #if Nc == 3
 #include <Grid/qcd/smearing/GaugeConfigurationMasked.h>
@@ -235,4 +231,5 @@ int main(int argc, char **argv)
 #endif
 } // main

-#endif
+
+
@@ -24,11 +24,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 #if Nc == 3
 #include <Grid/qcd/smearing/GaugeConfigurationMasked.h>
@@ -234,4 +230,5 @@ int main(int argc, char **argv)
 #endif
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 int main(int argc, char **argv) {
  using namespace Grid;
@@ -199,4 +195,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -28,11 +28,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 #ifdef GRID_DEFAULT_PRECISION_DOUBLE
 #define MIXED_PRECISION
@@ -453,4 +449,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -28,11 +28,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 #ifdef GRID_DEFAULT_PRECISION_DOUBLE
 #define MIXED_PRECISION
@@ -446,4 +442,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -28,11 +28,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 using namespace Grid;

@@ -922,5 +918,3 @@ int main(int argc, char **argv) {
  return 0;
 #endif
 } // main
-
-#endif
@@ -28,11 +28,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 using namespace Grid;

@@ -877,5 +873,3 @@ int main(int argc, char **argv) {
  return 0;
 #endif
 } // main
-
-#endif
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 int main(int argc, char **argv) {
  using namespace Grid;
@@ -197,4 +193,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 NAMESPACE_BEGIN(Grid);

@@ -516,4 +512,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 int main(int argc, char **argv) {
  using namespace Grid;
@@ -349,4 +345,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 NAMESPACE_BEGIN(Grid);

@@ -520,4 +516,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 NAMESPACE_BEGIN(Grid);

@@ -571,4 +567,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 int main(int argc, char **argv) {
  using namespace Grid;
@@ -267,4 +263,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 int main(int argc, char **argv) {
  using namespace Grid;
@@ -421,4 +417,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 NAMESPACE_BEGIN(Grid);

@@ -456,4 +452,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>

 NAMESPACE_BEGIN(Grid);

@@ -466,4 +462,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -27,11 +27,7 @@ See the full license in the file "LICENSE" in the top level distribution
 directory
 *************************************************************************************/
 /*  END LEGAL */
-
-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include<Grid/Grid.h>
+#include <Grid/Grid.h>



@@ -268,4 +264,5 @@ int main(int argc, char **argv) {
  Grid_finalize();
 } // main

-#endif
+
+
@@ -1,16 +0,0 @@
-#include <Grid/Grid.h>
-#pragma once
-
-
-#ifndef ENABLE_FERMION_INSTANTIATIONS
-#include <iostream>
-
-int main(void) {
-  std::cout << "This build of Grid was configured to exclude fermion instantiations, "
-	    << "which this example relies on. "
-	    << "Please reconfigure and rebuild Grid with --enable-fermion-instantiations"
-	    << "to run this example."
-	    << std::endl;
-  return 1;
-}
-#endif
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace Grid;
@@ -734,5 +731,3 @@ int main (int argc, char ** argv)

  Grid_finalize();
 }
-
-#endif
@@ -20,9 +20,6 @@
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>
 #ifdef GRID_CUDA
 #define CUDA_PROFILE
@@ -442,4 +439,3 @@ void Benchmark(int Ls, Coordinate Dirichlet,bool sloppy)
  GRID_ASSERT(norm2(src_e)<1.0e-4);
  GRID_ASSERT(norm2(src_o)<1.0e-4);
 }
-#endif
@@ -20,10 +20,6 @@
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>
 #ifdef GRID_CUDA
 #define CUDA_PROFILE
@@ -443,5 +439,3 @@ void Benchmark(int Ls, Coordinate Dirichlet,bool sloppy)
  GRID_ASSERT(norm2(src_e)<1.0e-4);
  GRID_ASSERT(norm2(src_o)<1.0e-4);
 }
-
-#endif
@@ -20,9 +20,6 @@
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>
 #ifdef GRID_CUDA
 #define CUDA_PROFILE
@@ -388,5 +385,3 @@ int main (int argc, char ** argv)
  Grid_finalize();
  exit(0);
 }
-
-#endif
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -241,4 +238,5 @@ void benchDw(std::vector<int> & latt4, int Ls, int threads,int report )
  }
 }

-#endif
+
+
@@ -1,7 +1,3 @@
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-
 #include <Grid/Grid.h>
 #include <sstream>
 using namespace std;
@@ -159,4 +155,3 @@ int main (int argc, char ** argv)
  Grid_finalize();
 }

-#endif
@@ -20,9 +20,6 @@
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>
 #ifdef GRID_CUDA
 #define CUDA_PROFILE
@@ -132,5 +129,3 @@ int main (int argc, char ** argv)
  Grid_finalize();
  exit(0);
 }
-
-#endif
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -152,5 +149,3 @@ int main (int argc, char ** argv)

  Grid_finalize();
 }
-
-#endif
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -175,4 +172,5 @@ void benchDw(std::vector<int> & latt4, int Ls)
  //  Dw.Report();
 }

-#endif
+
+
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -113,5 +110,3 @@ int main (int argc, char ** argv)

  Grid_finalize();
 }
-
-#endif
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -115,5 +112,3 @@ int main (int argc, char ** argv)

  Grid_finalize();
 }
-
-#endif
@@ -26,10 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-
 #include <Grid/Grid.h>
 #include <Grid/algorithms/blas/BatchedBlas.h>

@@ -982,5 +978,3 @@ int main (int argc, char ** argv)
  Grid_finalize();
  fclose(FP);
 }
-
-#endif
@@ -26,9 +26,6 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -261,5 +258,3 @@ int main (int argc, char ** argv)

  Grid_finalize();
 }
-
-#endif
@@ -19,9 +19,6 @@ Author: Richard Rollins <rprollins@users.noreply.github.com>
    See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
-#include "disable_benchmarks_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -164,5 +161,3 @@ void bench_wilson_eo (
  double flops = (single_site_flops * volume * ncall)/2.0;
  std::cout << flops/(t1-t0) << "\t\t";
 }
-
-#endif
@@ -1,16 +0,0 @@
-#include <Grid/Grid.h>
-
-#pragma once
-
-#ifndef ENABLE_FERMION_INSTANTIATIONS
-#include <iostream>
-
-int main(void) {
-  std::cout << "This build of Grid was configured to exclude fermion instantiations, "
-	    << "which this benchmark relies on. "
-	    << "Please reconfigure and rebuild Grid with --enable-fermion-instantiations"
-	    << "to run this benchmark."
-	    << std::endl;
-  return 1;
-}
-#endif
@@ -172,12 +172,6 @@ case ${ac_TRACING} in
 esac

 ############### fermions
-AC_ARG_ENABLE([fermion-instantiations],
-     [AS_HELP_STRING([--enable-fermion-instantiations=yes|no],[enable fermion instantiations])],
-     [ac_FERMION_REPS=${enable_fermion_instantiations}], [ac_FERMION_INSTANTIATIONS=yes])
-
-AM_CONDITIONAL(BUILD_FERMION_INSTANTIATIONS, [ test "${ac_FERMION_INSTANTIATIONS}X" == "yesX" ])
-
 AC_ARG_ENABLE([fermion-reps],
     [AS_HELP_STRING([--enable-fermion-reps=yes|no],[enable extra fermion representation support])],
     [ac_FERMION_REPS=${enable_fermion_reps}], [ac_FERMION_REPS=yes])
@@ -200,9 +194,6 @@ AM_CONDITIONAL(BUILD_ZMOBIUS, [ test "${ac_ZMOBIUS}X" == "yesX" ])
 case ${ac_FERMION_REPS} in
   yes) AC_DEFINE([ENABLE_FERMION_REPS],[1],[non QCD fermion reps]);;
 esac
-case ${ac_FERMION_INSTANTIATIONS} in
-   yes) AC_DEFINE([ENABLE_FERMION_INSTANTIATIONS],[1],[enable fermions]);;
-esac
 case ${ac_GPARITY} in
   yes) AC_DEFINE([ENABLE_GPARITY],[1],[fermion actions with GPARITY BCs]);;
 esac
@@ -418,10 +409,16 @@ AC_SEARCH_LIBS([unw_backtrace], [unwind],
               [have_unwind=true],
 	             [AC_MSG_WARN(libunwind library was not found in your system.)])

-AC_SEARCH_LIBS([_Ux86_64_step], [unwind-x86_64],
-               [AC_DEFINE([HAVE_UNWIND_X86_64], [1], [Define to 1 if you have the `libunwind-x86_64' library])]
-               [have_unwind_x86_64=true],
-	             [AC_MSG_WARN(libunwind library was not found in your system.)])
+AS_CASE([$host_cpu], [x86_64],
+  [AC_SEARCH_LIBS([_Ux86_64_step], [unwind-x86_64],
+                [AC_DEFINE([HAVE_UNWIND_X86_64], [1], [Define to 1 if you have the `libunwind-x86_64' library])]
+                [have_unwind_x86_64=true],
+                [AC_MSG_WARN(libunwind library was not found in your system.)])],
+  [aarch64],
+  [AC_SEARCH_LIBS([_Uaarch64_step], [unwind-aarch64],
+                [AC_DEFINE([HAVE_UNWIND_AARCH64], [1], [Define to 1 if you have the `libunwind-aarch64' library])]
+                [have_unwind_aarch64=true],
+                [AC_MSG_WARN(libunwind library was not found in your system.)])])

 AC_SEARCH_LIBS([SHA256_Init], [crypto],
               [AC_DEFINE([HAVE_CRYPTO], [1], [Define to 1 if you have the `OpenSSL' library])]
@@ -3,9 +3,6 @@
 * without regression / tests being applied
 */

-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -313,4 +310,5 @@ int main (int argc, char ** argv)
  Grid_finalize();
 }

-#endif
+
+
@@ -3,9 +3,6 @@
 * without regression / tests being applied
 */

-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -435,4 +432,5 @@ int main (int argc, char ** argv)
  Grid_finalize();
 }

-#endif
+
+
@@ -3,9 +3,6 @@
 * without regression / tests being applied
 */

-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -538,4 +535,5 @@ int main (int argc, char ** argv)
  Grid_finalize();
 }

-#endif
+
+
@@ -3,9 +3,6 @@
 * without regression / tests being applied
 */

-#include "disable_examples_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -432,4 +429,5 @@ int main (int argc, char ** argv)
  Grid_finalize();
 }

-#endif
+
+
@@ -1,15 +0,0 @@
-#include <Grid/Grid.h>
-#pragma once
-
-#ifndef ENABLE_FERMION_INSTANTIATIONS
-#include <iostream>
-
-int main(void) {
-  std::cout << "This build of Grid was configured to exclude fermion instantiations, "
-	    << "which this example relies on. "
-	    << "Please reconfigure and rebuild Grid with --enable-fermion-instantiations"
-	    << "to run this example."
-	    << std::endl;
-  return 1;
-}
-#endif
@@ -11,7 +11,7 @@ CCFILES=`find . -name '*.cc' -not -path '*/instantiation/*/*' -not -path '*/gamm

 ZWILS_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/ZWilsonImpl*' `
 WILS_FERMION_FILES=`  find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonImpl*' `
-STAG_FERMION_FILES=`  find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/StaggeredImpl*' `
+STAG_FERMION_FILES=`  find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/Staggered*' `
 GP_FERMION_FILES=`    find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/Gparity*' `
 ADJ_FERMION_FILES=`   find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonAdj*' `
 TWOIND_FERMION_FILES=`find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonTwoIndex*'`
@@ -1,196 +0,0 @@
---
-name: communication-overlap
-description: Design and implement communication/computation overlap pipelines for GPU+MPI codes — per-packet event tracking, host-staging through pinned memory, internode/intranode bandwidth separation, and the 7-phase pipeline pattern that replaces broken accelerator-aware MPI paths.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
---
-
-# Communication/Computation Overlap Pipeline Design
-
-## Why GPU-Direct MPI Is Often Not the Right Default
-
-GPU-direct RDMA (passing GPU buffer pointers directly to MPI) is appealing because it eliminates explicit D2H/H2D copies. In practice on several leadership systems:
-
- **Bandwidth**: RDMA at 30% of wirespeed has been observed on Pontevecchio/Aurora. The overhead of staging through pinned host memory can be *lower* total latency than slow RDMA.
- **Correctness**: Device buffer aliasing in `MPI_Sendrecv` (see `mpi-heterogeneous.md`) makes direct GPU-to-GPU transfer unreliable.
- **Overlap**: Host-staging enables fine-grained overlap — each packet's D2H can be issued as a separate asynchronous event, and the corresponding MPI send can fire as soon as *that packet* arrives in host memory, not after all packets are ready.
-
-The pipeline pattern below was developed to replace broken MPICH accelerator-aware paths. It achieves genuine computation/communication overlap by tracking per-packet GPU events.
-
-## The 7-Phase Pipeline
-
-Given a set of halo exchange operations (each identified by a `packet_index`):
-
-### Phase 0: Prepare data on device
-Pack halo data into contiguous GPU buffers. One buffer per direction/neighbour.
-
-### Phase 1: Post receives + start D2H
-Post all `MPI_Irecv` calls immediately (into pinned host buffers). Simultaneously, start asynchronous D2H copies for all send buffers:
-
-```cpp
-for (auto &pkt : send_packets) {
-    MPI_Irecv(pkt.host_recv_buf, pkt.bytes, MPI_BYTE,
-               pkt.src_rank, pkt.tag, comm, &pkt.recv_req);
-    
-    acceleratorCopyFromDeviceAsync(pkt.device_send_buf,
-                                   pkt.host_send_buf,
-                                   pkt.bytes, &pkt.d2h_event);
-}
-```
-
-The key: `pkt.d2h_event` is a per-packet GPU event (e.g. `cudaEvent_t`, `hipEvent_t`, or SYCL event). We can poll individual packet completion rather than waiting for all.
-
-### Phase 2: Fire sends as D2H completes (packet by packet)
-Poll packet D2H events. As each packet becomes ready in host memory, immediately fire the corresponding `MPI_Isend`. Also start intranode D2D copies at this point — these are deferred until now to avoid competing with the internode D2H on PCIe bandwidth:
-
-```cpp
-bool all_sent = false;
-while (!all_sent) {
-    all_sent = true;
-    for (auto &pkt : send_packets) {
-        if (!pkt.sent && acceleratorEventIsComplete(pkt.d2h_event)) {
-            MPI_Isend(pkt.host_send_buf, pkt.bytes, MPI_BYTE,
-                       pkt.dst_rank, pkt.tag, comm, &pkt.send_req);
-            pkt.sent = true;
-            start_intranode_copy(pkt);  // now safe, D2H is done
-        }
-        if (!pkt.sent) all_sent = false;
-    }
-}
-```
-
-### Phase 3: Poll receives + start H2D as each arrives
-`MPI_Test` individual receive requests. As each completes, immediately start the H2D copy into device-resident halo buffer:
-
-```cpp
-bool all_recvd = false;
-while (!all_recvd) {
-    all_recvd = true;
-    for (auto &pkt : recv_packets) {
-        if (!pkt.h2d_started) {
-            int flag = 0;
-            MPI_Test(&pkt.recv_req, &flag, MPI_STATUS_IGNORE);
-            if (flag) {
-                acceleratorCopyToDeviceAsync(pkt.host_recv_buf,
-                                              pkt.device_recv_buf,
-                                              pkt.bytes, &pkt.h2d_event);
-                pkt.h2d_started = true;
-            }
-        }
-        if (!pkt.h2d_started) all_recvd = false;
-    }
-}
-```
-
-### Phase 4: Wait for all sends
-```cpp
-std::vector<MPI_Request> send_reqs;
-for (auto &pkt : send_packets) send_reqs.push_back(pkt.send_req);
-MPI_Waitall(send_reqs.size(), send_reqs.data(), MPI_STATUSES_IGNORE);
-```
-
-### Phase 5: Wait for all H2D copies
-```cpp
-for (auto &pkt : recv_packets) acceleratorEventWait(pkt.h2d_event);
-```
-
-### Phase 6: Run interior computation
-The interior (non-halo) computation can run from Phase 1 onwards, overlapped with all of the above:
-
-```cpp
-// Launched in Phase 1, runs in parallel with the pipeline
-accelerator_for(ss, interior_sites, ...) { compute_interior(ss); }
-```
-
-Synchronise with interior before using the full field:
-```cpp
-accelerator_barrier();  // interior kernel done
-// Halo H2D is also complete (Phase 5 above)
-// Now safe to use full field
-```
-
-## Per-Packet Event Tracking Data Structure
-
-```cpp
-struct Packet {
-    // Buffers
-    void *device_send_buf;
-    void *host_send_buf;    // pinned
-    void *device_recv_buf;
-    void *host_recv_buf;    // pinned
-    size_t bytes;
-    
-    // MPI
-    int src_rank, dst_rank, tag;
-    MPI_Request send_req, recv_req;
-    
-    // GPU events (one per packet, not one global barrier)
-    AcceleratorEvent d2h_event;
-    AcceleratorEvent h2d_event;
-    
-    // State flags
-    bool sent = false;
-    bool h2d_started = false;
-};
-```
-
-The critical design point: `d2h_event` and `h2d_event` are **per-packet**, not global. This allows the MPI send for packet 0 to fire while packet 1's D2H is still in progress.
-
-## Internode vs Intranode Separation
-
-PCIe (GPU-to-CPU) and NVLink/xGMI (GPU-to-GPU within a node) are separate bandwidth resources. They do not compete with each other, but they *do* compete with each other for transactions if both are active simultaneously.
-
-Strategy: complete all internode D2H copies first (to maximise NIC injection bandwidth), then start intranode D2D copies (which use NVLink/xGMI and do not contend with PCIe for internode traffic):
-
-```cpp
-// In Phase 2: start intranode D2D only after D2H is confirmed complete
-if (pkt.is_intranode && pkt.d2h_done) {
-    // Use peer access (cudaMemcpyPeerAsync / hipMemcpyPeerAsync)
-    // rather than staging through host for intranode
-    cudaMemcpyPeerAsync(pkt.peer_recv_buf, pkt.dst_device,
-                         pkt.device_send_buf, pkt.src_device,
-                         pkt.bytes, computeStream);
-}
-```
-
-Grid reference: `Grid/communicator/Communicator_mpi3.cc` — search for `NVLINK_GET` and `ACCELERATOR_AWARE_MPI` conditional blocks.
-
-## Pinned Memory Allocation
-
-All host staging buffers must be pinned (page-locked) for async D2H/H2D:
-
-```cpp
-// CUDA
-cudaMallocHost(&host_buf, bytes);
-cudaFreeHost(host_buf);
-
-// HIP  
-hipHostMalloc(&host_buf, bytes, hipHostMallocDefault);
-hipHostFree(host_buf);
-
-// SYCL
-host_buf = sycl::malloc_host(bytes, *queue);
-sycl::free(host_buf, *queue);
-```
-
-Pre-allocate at startup. Repeated `cudaMallocHost` in the hot path adds latency from the OS memory manager.
-
-## Checksumming in the Pipeline
-
-Insert checksum computation before D2H (on the GPU-resident data) and verification after H2D (on the received GPU-resident data). See `correctness-verification.md` for the checksum pattern. The salting (`packet_index + 1000 * tag`) detects packet transposition — critical for diagnosing MPI buffer aliasing bugs where two packets' contents are swapped.
-
-## Smoke Test for a New System
-
-Before running physics, validate the pipeline on a synthetic benchmark:
-
-```cpp
-// Send a buffer of known values, receive and check
-// Run at multiple message sizes: 4KB, 64KB, 1MB, 16MB
-// Run at multiple process counts: 2, 8, 64, 512
-// Verify checksums on every packet
-// Measure bandwidth: should be ≥ 80% of FDR/HDR/NDR peak for host-staged
-```
-
-Any bandwidth below 50% of theoretical, or any checksum failure, indicates a problem in the communication stack that must be resolved before production runs.
@@ -1,154 +0,0 @@
---
-name: compiler-validation
-description: Identify GPU compiler code generation bugs, distinguish them from hardware and runtime bugs, construct minimal reproducers, and validate correctness of generated assembly for performance-critical HPC kernels.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
-  - Bash(objdump)
---
-
-# Compiler Validation for GPU HPC Codes
-
-## Why Compiler Bugs Are Distinct
-
-Compiler bugs have a unique diagnostic signature: they produce *deterministically wrong* results. The same input always produces the same wrong output. This distinguishes them from:
-
- Hardware bugs: usually stochastic (wrong answer sometimes, correct answer other times)
- Runtime bugs (premature barrier, buffer aliasing): often stochastic or history-dependent
- Race conditions: non-deterministic
-
-**The determinism test**: run the same kernel 100 times with the same input. If the wrong answer is always the same wrong answer, suspect the compiler.
-
-## The Minimal Reproducer Protocol
-
-When a kernel produces wrong results, isolate the compiler as quickly as possible:
-
-**Step 1: Eliminate the physics**. Reduce the failing kernel to the smallest possible computation that still exhibits the bug. Replace QCD fields with `double` arrays. Replace lattice operations with scalar arithmetic. The goal is a 20-line CUDA/HIP/SYCL file that any compiler engineer can compile and run.
-
-**Step 2: Binary search over optimisation levels**. Compile at `-O0` (or equivalent). If the answer becomes correct, the bug is in an optimisation pass. Then test `-O1`, `-O2`, `-O3` individually to find which optimisation level introduces the bug.
-
-```bash
-# HIP example
-hipcc -O0 minimal_repro.cc -o test_O0 && ./test_O0   # should be correct
-hipcc -O1 minimal_repro.cc -o test_O1 && ./test_O1   # compare
-hipcc -O2 minimal_repro.cc -o test_O2 && ./test_O2   # compare
-```
-
-**Step 3: Identify the optimisation pass**. For LLVM-based compilers (clang, hipcc, dpcpp, nvcc via ptxas):
-```bash
-# Disable individual optimisation passes:
-hipcc -O2 -mllvm -disable-loop-unrolling minimal_repro.cc -o test
-hipcc -O2 -fno-vectorize minimal_repro.cc -o test
-hipcc -O2 -fno-slp-vectorize minimal_repro.cc -o test
-```
-
-**Step 4: Inspect the generated code**. For CUDA/HIP, use `--generate-line-info` and `cuobjdump` or `roc-obj-extract` to get annotated assembly:
-```bash
-# CUDA
-nvcc -O2 --generate-line-info --keep minimal_repro.cu
-cuobjdump --dump-ptx minimal_repro.o
-# HIP/ROCm
-hipcc -O2 --save-temps minimal_repro.cc
-llvm-objdump -d minimal_repro.o
-# SYCL/DPC++
-icpx -O2 -fsycl -Xclang -ast-dump minimal_repro.cc 2>&1 | grep -A5 "suspicious_expr"
-```
-
-Look for: incorrect register spill/fill sequences, loop trip count miscalculation, vectorisation across iteration boundaries, incorrect address arithmetic.
-
-## Known Compiler Bug Patterns in GPU Code
-
-### Register Pressure / Spill Bugs
-High register usage forces spills to local memory. Some compiler versions generate incorrect spill/fill code — the value is written to local memory but a stale register value is read back instead of the spilled value.
-
-**Signature**: Wrong answer with high-register-count kernels; becomes correct when `--maxrregcount=N` forces lower register count (more spilling) or higher (`--maxrregcount=256`, fewer spills).
-
-**Diagnostic**: Check register usage:
-```bash
-nvcc -O2 --ptxas-options=-v minimal_repro.cu 2>&1 | grep "registers"
-hipcc -O2 --offload-arch=gfx90a --save-temps minimal_repro.cc
-llvm-mc --arch=amdgcn minimal_repro.s 2>&1 | grep "VGPRs"
-```
-
-### Vectorisation Across Loop Boundaries
-The compiler vectorises two successive loop iterations as a SIMD unit when they have a data dependency that the compiler has incorrectly determined does not exist.
-
-**Signature**: Wrong answer that becomes correct when the loop body is extracted to a non-inlined function (disabling auto-vectorisation across iterations).
-
-### Incorrect Constant Propagation
-The compiler evaluates a compile-time expression incorrectly, substituting a wrong constant. Common in template-heavy code where `sizeof(T)` or `alignof(T)` is used in arithmetic that the compiler folds at compile time.
-
-**Signature**: Wrong array index or wrong stride. Inspecting the generated assembly shows a literal constant where you expect a computed value.
-
-## Stress Patterns for Compiler Validation
-
-These patterns exercise the compiler in ways that commonly expose bugs:
-
-```cpp
-// 1. Aliased pointer write followed by immediate read
-// (tests correct handling of write-after-write in register allocation)
-__global__ void alias_stress(double *a, double *b, int n) {
-    int i = blockIdx.x * blockDim.x + threadIdx.x;
-    if (i < n) {
-        a[i] = a[i] * 2.0;
-        b[i] = a[i] + 1.0;  // must read the updated value, not the original
-    }
-}
-
-// 2. Mixed-precision accumulation
-// (tests correct type promotion in FMA sequences)
-__global__ void precision_stress(float *in, double *out, int n) {
-    double acc = 0.0;
-    for (int i = 0; i < n; i++) acc += (double)in[i];
-    *out = acc;
-}
-
-// 3. Large struct in shared memory
-// (tests alignment and offset calculation for non-power-of-2-sized objects)
-struct S { double x[3]; };  // sizeof = 24 bytes, not a power of 2
-__global__ void struct_stress(S *in, S *out, int n) {
-    extern __shared__ S smem[];
-    int tid = threadIdx.x;
-    smem[tid] = in[tid];
-    __syncthreads();
-    out[tid] = smem[(tid + 1) % blockDim.x];
-}
-```
-
-## Separating Compiler from Runtime/Hardware
-
-When results are deterministically wrong:
-
-| Test | Compiler bug | Runtime/hardware bug |
-|---|---|---|
-| Recompile at -O0 | Fixes it | No effect |
-| Run on CPU (host code equivalent) | Fixes it | No effect |
-| Reorder loop iterations | Changes wrong answer | No effect or different pattern |
-| Different compiler version | Fixes or changes wrong answer | No effect |
-| Different GPU of same model | Same wrong answer | Different or no error |
-| Different GPU model | Fixes it (ISA-specific codegen bug) | May or may not fix |
-
-## Reporting to Compiler Teams
-
-A compiler bug report needs:
-1. Minimal reproducer (< 50 lines)
-2. Compiler version (`hipcc --version`, `nvcc --version`, `icpx --version`)
-3. GPU model and driver version
-4. Exact wrong and correct answers (hexfloat for reproducibility)
-5. Which compile flags change the behaviour
-6. Generated assembly for the correct and incorrect variants
-
-File with: LLVM Bugzilla (for hipcc/clang/dpcpp backends), NVIDIA bug portal (nvcc/ptxas), or vendor-specific developer forum. The minimal reproducer is the single most important element — without it, compiler teams cannot prioritise.
-
-## Pragmatic In-Production Workaround
-
-When a compiler bug is confirmed but the fix is not yet available, the lowest-risk workaround is to mark the affected function with reduced optimisation:
-
-```cpp
-#pragma clang optimize off  // clang/hipcc/dpcpp
-void __attribute__((optimize("O0"))) affected_kernel_host_wrapper() { ... }
-// For device code, use per-file compilation flags via CMake/Makefile
-```
-
-Document the workaround with a comment referencing the compiler bug report number so it can be removed when the compiler is updated.
@@ -1,169 +0,0 @@
---
-name: correctness-verification
-description: Implement application-level correctness verification for HPC codes on unreliable hardware — double-run pattern, deterministic reductions, per-packet checksums, and flight recorder step logging.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
---
-
-# Correctness Verification Infrastructure for HPC Codes
-
-## The Problem
-
-Leadership computing facilities sometimes have hardware or firmware bugs below the level visible to application code. The accelerator runtime can return from `q.wait()` or `cudaDeviceSynchronize()` before work is actually complete, or silently produce wrong answers in DMA transfers. Standard testing does not catch these because they are non-deterministic and often topology-dependent (fail only at specific process counts or on specific node configurations).
-
-The symptoms look like numerical instabilities, random MPI hangs, or wrong physics results — not like crashes. Without deliberate infrastructure, diagnosing root cause takes months.
-
-## The Double-Run Pattern
-
-The most reliable correctness check for non-deterministic hardware bugs is to run every computation twice and compare bit-identical fingerprints.
-
-**Key constraint**: the second run must use a *deterministic* code path. Non-deterministic floating-point ordering (e.g. from MPI_Allreduce with different reduction trees on retry) produces false mismatches. See `mpi-heterogeneous.md` for how to make reductions deterministic.
-
-```cpp
-// Pseudocode: double-run a step and compare CRC fingerprints
-void run_step_verified(State &state) {
-    state.save_checkpoint();
-    
-    uint64_t crc_a = run_step_and_fingerprint(state);
-    state.restore_checkpoint();
-    uint64_t crc_b = run_step_and_fingerprint(state);
-    
-    if (crc_a != crc_b) {
-        report_mismatch("step", crc_a, crc_b);
-        // Policy: abort, retry from checkpoint, or continue with alarm
-    }
-}
-```
-
-**Fingerprinting**: XOR-fold a CRC32 over all floating-point data after each step. XOR is order-independent, so it works across distributed nodes without communication. For field data:
-
-```cpp
-uint64_t fingerprint(const double *data, size_t n) {
-    uint64_t acc = 0;
-    for (size_t i = 0; i < n; i++) {
-        uint64_t bits;
-        memcpy(&bits, &data[i], sizeof(bits));
-        acc ^= crc32(bits);
-    }
-    return acc;
-}
-```
-
-On GPU, compute the XOR reduction on-device (avoids D2H transfer of the full field):
-
-```cpp
-// SYCL
-uint64_t svm_xor(uint64_t *vec, uint64_t L) {
-    uint64_t ret = 0;
-    { sycl::buffer<uint64_t,1> abuff(&ret, {1});
-      theGridAccelerator->submit([&](sycl::handler &cgh) {
-        auto R = sycl::reduction(abuff, cgh, uint64_t(0), std::bit_xor<>());
-        cgh.parallel_for(sycl::range<1>{L}, R,
-          [=](sycl::id<1> i, auto &sum) { sum ^= vec[i]; });
-      }); }
-    theGridAccelerator->wait();
-    return ret;
-}
-```
-
-## Per-Packet Communication Checksums
-
-Silent data corruption in MPI buffers (documented in MPICH with device-resident buffers; see `mpi-heterogeneous.md`) requires per-packet verification, not just end-to-end. The pattern:
-
-1. Before packing a send buffer, compute a GPU-side checksum of the payload.
-2. Append the checksum to the host staging buffer alongside the data.
-3. After receiving and copying to device, recompute the checksum on-device and compare.
-
-Salt each checksum with `packet_index + 1000 * mpi_tag` to detect transposition (packet A landing in packet B's slot):
-
-```cpp
-uint64_t salt = (uint64_t)packet_index + 1000ULL * mpi_tag;
-checksum_send = checksum_gpu(payload_gpu, payload_words) ^ salt;
-// ... transmit payload + checksum_send ...
-checksum_recv = checksum_gpu(payload_gpu_recv, payload_words) ^ salt;
-assert(checksum_recv == checksum_send);
-```
-
-Grid reference: `Grid/communicator/Communicator_mpi3.cc`, `#ifdef GRID_CHECKSUM_COMMS`.
-
-## Flight Recorder: Step-Level Logging
-
-Maintain a monotonic counter that names the current operation. On a hang, this is the only way to know *which* operation the process is stuck in without a debugger.
-
-```cpp
-struct FlightRecorder {
-    std::atomic<uint64_t> step_counter{0};
-    const char *step_name = "init";
-    
-    void step_log(const char *name) {
-        step_name = name;
-        step_counter.fetch_add(1, std::memory_order_relaxed);
-    }
-};
-extern FlightRecorder gRecorder;
-```
-
-In Record mode, also store floating-point norms and communication checksums to vectors. In Verify mode, compare against stored values:
-
-```cpp
-void norm_log(double val) {
-    if (mode == Record) norm_log_vec.push_back(val);
-    if (mode == Verify) {
-        double expected = norm_log_vec[norm_counter];
-        if (val != expected) {  // bit-exact for deterministic paths
-            std::cerr << "MISMATCH at step " << step_counter
-                      << " (" << step_name << "): "
-                      << std::hexfloat << val << " vs " << expected << "\n";
-            print_backtrace();
-        }
-        norm_counter++;
-    }
-}
-```
-
-Grid reference: `Grid/util/FlightRecorder.h`, `Grid/util/FlightRecorder.cc`.
-
-## Signal Handler for Hang Detection
-
-Install a SIGHUP handler that dumps the current flight recorder state. This is async-safe only if the handler writes to a pre-allocated buffer using `write()` (not `printf`):
-
-```cpp
-static char hang_buf[4096];
-
-static void sighup_handler(int) {
-    int n = snprintf(hang_buf, sizeof(hang_buf),
-        "rank=%d step=%llu name=%s\n",
-        mpi_rank,
-        (unsigned long long)gRecorder.step_counter.load(),
-        gRecorder.step_name);
-    write(STDERR_FILENO, hang_buf, n);
-    // Optional: call backtrace_symbols_fd (async-safe on Linux)
-    void *frames[64];
-    int depth = backtrace(frames, 64);
-    backtrace_symbols_fd(frames, depth, STDERR_FILENO);
-}
-
-// In main():
-signal(SIGHUP, sighup_handler);
-```
-
-To diagnose a hang across all ranks: `kill -HUP $(pgrep my_app)` or via job scheduler.
-
-## What to Verify at Each Step
-
-| Data type | Fingerprint method | Frequency |
-|---|---|---|
-| Lattice fields | XOR of CRC32 over float64 words | Every algorithmic step |
-| Communication buffers | GPU XOR reduction, salted | Every MPI operation |
-| Scalar reductions | Bit-exact match of double | Every GlobalSum |
-| Iteration counters | Exact integer match | Every solver iteration |
-
-## When to Abort vs Continue
-
- **Abort immediately**: communication checksum mismatch (data is corrupt, continuing will silently propagate errors).
- **Log and continue**: norm mismatch in Verify mode if you need to map out which operations are unreliable.
- **Retry from checkpoint**: double-run mismatch when the underlying bug is non-deterministic (second retry will usually pass).
-
-Track the mismatch rate over a production run. A rate above ~1/1000 steps indicates a systemic hardware issue that should be escalated to the facility.
@@ -1,205 +0,0 @@
---
-name: gpu-memory-performance
-description: Diagnose and fix GPU memory bandwidth and occupancy problems in Grid HPC kernels — acceleratorThreads() pitfalls, LambdaApply thread mapping, coalescedRead/Write idiom, when to use accelerator_for vs a hand-rolled __global__ kernel, and fused vs staged HBM access patterns.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
---
-
-# GPU Memory Performance in Grid
-
-## Nsimd on GPU builds
-
-With `GEN_SIMD_WIDTH=64B` (the typical production setting), `Nsimd` is **not 1**:
-
-| Scalar type | `sizeof` | `Nsimd = 64 / sizeof` |
-|---|---|---|
-| `ComplexD` | 16 B | **4** |
-| `ComplexF` | 8 B  | **8** |
-| `RealD`    | 8 B  | **8** |
-| `RealF`    | 4 B  | **16** |
-
-So for `LatticePropagatorD` (scalar type `ComplexD`), `Nsimd=4` and the SIMD lane runs `threadIdx.x ∈ {0,1,2,3}`.  True `Nsimd=1` scalar-GPU builds are the exception, not the rule.
-
-## The acceleratorThreads() Trap
-
-`acceleratorThreads()` is a runtime-settable global (default **8**) that controls the `blockDim.y` of every `accelerator_for` launch.  It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension.
-
-```cpp
-// Grid/threads/Accelerator.cc
-uint32_t accelerator_threads = 8;
-```
-
-With `accelerator_for(ss, osites, nsimd, ...)`, the launch is:
-
-```
-dim3 threads(nsimd, acceleratorThreads(), 1)
-dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1)
-```
-
-Total threads per block = `Nsimd × acceleratorThreads()`.  With `GEN_SIMD_WIDTH=64B` and `Nsimd=8` (fp32 / ComplexF):
-
-| `acceleratorThreads()` | threads/block | AMD wavefront | note |
-|---|---|---|---|
-| 2 (original default) | 16 | 25% | sub-wavefront, poor |
-| 4 | 32 | 50% | half wavefront |
-| **8 (current default)** | **64** | **100%** | **one full wavefront — Dslash sweet spot** |
-| 16 | 128 | 200% | two wavefronts/block; register-pressure cliff for heavy kernels |
-| 32 | 256 | 400% | severe register spill for stencil kernels |
-
-**Why 8 and not higher?**  Compute-heavy kernels like the Domain Wall Dslash carry many live registers per thread (spinors + gauge links + projections).  Doubling `acceleratorThreads` from 8→16 doubles the register demand per block, which on AMD GFX90A triggers a hard occupancy cliff: `Benchmark_dwf_fp32` drops from 1.7 TF/s (nt=8) to ~300 GF/s (nt=16).  The sweet spot is one full wavefront per block, which with `Nsimd=8` (fp32) means `nt=8`.
-
-For fp64 work (`Nsimd=4`), `nt=8` gives 32 threads = half a wavefront (AMD pads to 64 with idle lanes).  Kernels that are not register-limited (e.g. simple lattice arithmetic) can benefit from `--accelerator-threads 16` at runtime.  Reduction kernels bypass `acceleratorThreads()` entirely via `getNumBlocksAndThreads`.
-
-**Why was the default ever 2?**  Before the `threadIdx.x`/`threadIdx.y` remap (see LambdaApply section below), the site index lived in `threadIdx.x` — the fast, coalescing dimension.  Increasing `acceleratorThreads` widened the block in the *site* direction, so adjacent threads in a warp hit adjacent sites, each stride `sizeof(vobj)` apart in AoS memory — breaking coalescing.  On early NVIDIA ports with `Nsimd≈8`, `nt=2` gave 16 threads = 50% of a 32-thread warp; NVIDIA recovers this via multiple concurrent blocks per SM, so occupancy was barely tolerable.  AMD has no such multiplier when blocks are already sub-wavefront.  After the remap put the site index in `threadIdx.y` and the SIMD lane in `threadIdx.x`, coalescing became independent of `acceleratorThreads`, removing the constraint.
-
-**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume.  Check with `--accelerator-threads 32` at runtime.  A large speedup confirms occupancy starvation.
-
-**Fix options** (in order of preference):
-1. Kernel needs its own thread count — use `getNumBlocksAndThreads` and launch a `__global__` kernel directly (see below).
-2. Temporarily acceptable: set `--accelerator-threads 32` at the application level.  Note this affects every `accelerator_for` site in the binary.
-
-## LambdaApply Thread Mapping
-
-`accelerator_for` and `accelerator_for2d` go through `LambdaApply`:
-
-```cpp
-// HIP/CUDA LambdaApply kernel:
-uint64_t x = threadIdx.y + blockDim.y * blockIdx.x;  // iter1 (site index)
-uint64_t y = threadIdx.z + blockDim.z * blockIdx.y;  // iter2
-uint64_t z = threadIdx.x;                             // lane (SIMD lane)
-Lambda(x, y, z);
-```
-
-`threadIdx.x` is the **fast** (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the **same** site, not consecutive sites.
-
-With `GEN_SIMD_WIDTH=64B` and `Nsimd=4` (PropagatorD), a 64-thread AMD wavefront contains 64/4 = 16 sites, each processed by 4 lanes.  Adjacent threads within the wavefront read different lanes of the same site — this is a broadcast pattern (hardware handles this efficiently), not a stride.  The stride between consecutive *sites* (`sizeof(vobj)`) only appears between groups of `Nsimd` threads, spaced `Nsimd` apart in threadIdx.x — not between adjacent threads within a warp.
-
-## coalescedRead / coalescedWrite
-
-These are Grid's canonical way to read/write one SIMD lane from a vector type inside a `GRID_SIMT` kernel:
-
-```cpp
-// accelerator_for(ss, osites, Nsimd, {
-//   lane = acceleratorSIMTlane(Nsimd) = threadIdx.x  ∈ {0..Nsimd-1}
-auto scalar_val = coalescedRead(field[ss]);       // extractLane(lane, field[ss])
-coalescedWrite(field[ss], scalar_val);             // insertLane(lane, field[ss], scalar_val)
-```
-
-For `vobj` aggregate types, `coalescedRead` calls `extractLane(lane, vobj)` which recurses through the tensor hierarchy and returns `vobj::scalar_object`.
-
-For `vsimd` (raw SIMD vector) types, it casts to `scalar_type*` and indexes with `lane`.
-
-## Coalescing the Iteration Structure
-
-For an AoS input array where each site is `words` 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access:
-
-```cpp
-// Good: k varies across threads in a block → consecutive 16-byte reads
-accelerator_for2d(k, R, ss, osites, Nsimd, {
-    coalescedWrite(out[ss]._internal[k],
-                   coalescedRead(idat[ss * words + base + k]));
-});
-// dim3(Nsimd, nt, 1): threadIdx.y = k (consecutive words, coalesced)
-//                      threadIdx.x = lane (SIMD sub-lane, coalesced for Nsimd>1)
-```
-
-```cpp
-// Bad: each thread reads all R words of its site serially
-accelerator_for(ss, osites, 1, {
-    Bundle b;
-    for (int k = 0; k < R; k++)
-        b._internal[k] = idat[ss * words + base + k];  // serial, not coalesced across threads
-    out[ss] = b;                                         // bulk struct write
-});
-```
-
-The bad pattern also accumulates a large struct in registers (192 bytes for R=12), increasing register pressure and reducing occupancy further.
-
-## When to Use a __global__ Kernel Instead of accelerator_for
-
-`accelerator_for` is correct for site-parallel work where `acceleratorThreads()` is tuned appropriately.  Use a direct `__global__` kernel when:
-
- The kernel requires a **specific thread count** for correctness or performance (reductions, shared-memory algorithms).
- The optimal thread count depends on `sizeof(sobj)` and `sharedMemPerBlock`, not on a runtime global.
- You need the retirement-count pattern for cross-block final reduction.
-
-Pattern: use `getNumBlocksAndThreads` to pick `numThreads` and `numBlocks`:
-
-```cpp
-Integer numThreads, numBlocks;
-int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks);
-// starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock
-// gives 64–256 threads/block → correct occupancy independent of acceleratorThreads()
-Integer smemSize = numThreads * sizeof(sobj);
-myKernel<<<numBlocks, numThreads, smemSize, computeStream>>>(args...);
-```
-
-Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X.
-
-## Fused vs Staged HBM Access
-
-A staged pack+reduce reads the data **three times**:
-
-```
-pack kernel:   reads vobj array (N bytes), writes bundle buffer (N bytes)
-reduce kernel: reads bundle buffer (N bytes), writes tiny result buffer
-```
-
-Total HBM: 3N bytes for N bytes of useful input.
-
-A fused kernel reads the data **once**:
-
-```
-packReduceKernel: reads R words of vobj array (N bytes), reduces in-place
-```
-
-Total HBM: N bytes.  Register pressure increases (R words held per thread) but the 3× HBM saving dominates for large objects.
-
-The fused pattern in Grid's `sumD_gpu_reduce_words<R>`:
-
-```cpp
-template <int R, class vobj, class sobj, class Iterator>
-__device__ void packReduceBlocks(
-    const iScalar<typename vobj::vector_type> *idat,
-    sobj *g_odata, Iterator osites, int base, int words)
-{
-  // sobj = iVector<iScalar<scalarD>, R>  (R double-precision scalars per site)
-  constexpr Iterator nsimd = vobj::Nsimd();
-  ...
-  while (i < osites * nsimd) {
-    Iterator lane = i % nsimd;
-    Iterator ss   = i / nsimd;
-    sobj tmpD; zeroit(tmpD);
-    for (int k = 0; k < R; k++) {
-      auto w = extractLane(lane, idat[ss * words + base + k]);
-      iScalar<typename vobj::scalar_typeD> wd; wd = w;   // float→double promotion
-      tmpD._internal[k] = wd;
-    }
-    mySum += tmpD;
-    ...
-  }
-  reduceBlock(sdata, mySum, tid);
-}
-```
-
-Launched with `getNumBlocksAndThreads` → 128 threads/block for R=12 (`BundleScalarD`=192 B, sharedMem=64 KB) → correct occupancy without depending on `acceleratorThreads()`.
-
-## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=4, GEN_SIMD_WIDTH=64B)
-
-| Configuration | pack µs/group | reduce µs/group | total µs | GB/s |
-|---|---|---|---|---|
-| acceleratorThreads=2 (8 threads/block), staged | 10,080 | 470 | 126,909 | 50 |
-| acceleratorThreads=16 (64 threads/block), staged | 342 | 310 | 8,251 | 297 |
-| acceleratorThreads=16, fused (128 threads/block via getNumBlocksAndThreads) | — | 349 | 4,584 | 546 |
-
-The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak.  The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches.
-
-## Quick Checklist When a Kernel Is Slow
-
-1. **Check Nsimd**: `GEN_SIMD_WIDTH=64B` → Nsimd=4 (ComplexD), 8 (ComplexF).  Total threads/block = `Nsimd × acceleratorThreads()`.  With old default nt=2 and Nsimd=4: 8 threads = 12.5% of AMD wavefront.
-2. Check threads per block: for `accelerator_for` kernels use `--accelerator-threads 32` and measure; a large speedup confirms occupancy starvation.
-3. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`).  Replace with per-element writes via `coalescedWrite`.
-4. Check for staged HBM access (pack → buffer → reduce).  Count the passes; fuse if ≥ 2 passes over the same data.
-5. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`.
@@ -1,101 +0,0 @@
---
-name: gpu-runtime-correctness
-description: Detect and work around GPU runtime correctness failures — premature completion signalling, infinite poll hangs, stale completion flags, and the double-wait diagnostic pattern. Covers CUDA, HIP/ROCm, and SYCL/Level Zero runtimes.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
---
-
-# GPU Runtime Correctness
-
-## The Completion Signalling Problem
-
-GPU runtimes expose a synchronisation primitive — `cudaDeviceSynchronize()`, `hipDeviceSynchronize()`, `q.wait()` — that is supposed to block until all previously submitted GPU work is complete. On several production systems, this guarantee has been violated in two distinct ways:
-
-### Failure Mode A: Premature Return
-The wait returns before the GPU work is done. The subsequent CPU code reads stale data from the output buffer. This is the most dangerous failure because it looks like a numerical instability, not a crash. Results are wrong but the program exits normally.
-
-**Identifying Premature Return**: Insert a second, independent wait immediately after the first. If a second `q.wait()` "fixes" incorrect results that appeared with a single `q.wait()`, the first wait was returning prematurely.
-
-```cpp
-// Diagnostic version — if this stabilises results, you have premature return
-accelerator_barrier();  // first wait
-accelerator_barrier();  // second wait (diagnostic)
-```
-
-Production fix: submit a trivially cheap no-op kernel after the real work and wait for it. The no-op kernel cannot complete until all previous commands in the queue are done (command queue ordering guarantee), so waiting for the no-op is a stronger barrier than waiting for the queue itself:
-
-```cpp
-// Lightweight fence kernel
-template<class T>
-__global__ void noop_kernel(T *p) { if (threadIdx.x == 0) (void)(*p); }
-
-void strong_barrier(T *device_ptr) {
-    noop_kernel<<<1, 1, 0, computeStream>>>(device_ptr);
-    cudaStreamSynchronize(computeStream);  // wait for the no-op
-}
-```
-
-### Failure Mode B: Infinite Poll
-The wait enters a polling loop that never terminates. The process consumes 100% CPU in a runtime library. The GPU has either stopped signalling progress entirely, or the completion flag is in a memory region that has become incoherent.
-
-This is distinct from Failure Mode A: with premature return the CPU proceeds; with infinite poll the CPU is stuck.
-
-**Identifying Infinite Poll**: `top` shows the MPI rank at 100% CPU. `perf top -p PID` or `strace -p PID` shows the process burning cycles inside the GPU runtime library (e.g. `libze_intel_gpu.so`, `libamdhip64.so`).
-
-**Documented instances**: 
- Intel Level Zero on Pontevecchio (Aurora): both premature return *and* infinite poll have been observed as independent bugs on the same system.
- The two failure modes can co-exist and have overlapping symptoms at the application level.
-
-## Completion Signalling Architecture
-
-Understanding why these bugs happen requires knowing how completion signalling works:
-
-```
-GPU command processor
-  → signals completion by writing to a host-visible memory address
-  → CPU runtime polls that address (or uses OS event notification via ioctl)
-```
-
-A premature return means the memory write happened before the actual work completed (e.g. the signal is on a different command stream that has not been serialised with the work stream). An infinite poll means the memory write never happens (hardware or driver bug preventing the signal from being written).
-
-**Implication**: `accelerator_barrier()` is not an unconditional correctness guarantee on all production systems. Application-level verification (double-run, checksums) is necessary as a second line of defence.
-
-## The Double-Wait Pattern in Practice
-
-The double-wait is a pragmatic workaround when premature return is suspected but not yet confirmed. It adds latency but does not change correctness if the barrier is working properly, so it is safe to enable in production:
-
-```cpp
-#ifdef WORKAROUND_PREMATURE_BARRIER
-#define accelerator_barrier() do { \
-    real_accelerator_barrier();    \
-    real_accelerator_barrier();    \
-} while(0)
-#endif
-```
-
-Monitor whether this changes observed behaviour. If double-wait eliminates wrong answers, you have confirmed premature return. If it does not help but inserting a no-op kernel does, the issue is with the wait primitive specifically, not with the underlying completion signal.
-
-## SYCL/Level Zero Specifics
-
-Level Zero (the backend for Intel GPU runtimes) separates command submission from synchronisation. A `q.wait()` should wait for all previously submitted command lists to retire. Documented bugs include:
-
- `q.wait()` returning before the associated fence in Level Zero has been signalled.
- `q.wait()` entering an `ioctl(i915, I915_GEM_WAIT)` call that never returns (kernel driver bug, not runtime bug).
-
-The latter requires a node reboot and cannot be worked around in application code. Detect it by checking process state (`D` in `ps aux`) and the kernel function via `/proc/PID/wchan`.
-
-## Stream Ordering and Compute Streams
-
-All GPU work must be submitted to the *same* stream/queue if you rely on in-order execution guarantees. Mixing default stream and non-default streams invalidates ordering assumptions on some backends.
-
-Grid uses `computeStream` (CUDA/HIP) or `theGridAccelerator` (SYCL) consistently throughout. If mixing Grid with third-party GPU code, ensure the third-party code is directed to the same stream, or insert explicit inter-stream barriers.
-
-## Checklist for New GPU Code
-
-1. Every kernel launch is followed by an `accelerator_barrier()` before reading device-side output on the host.
-2. All device-to-host copies use an explicit stream synchronisation after the copy, not before.
-3. If results are non-deterministic across runs, insert a second barrier and observe whether reproducibility improves.
-4. For correctness-critical operations (reductions that will be compared against reference values), add the double-run checksum test from `correctness-verification.md`.
-5. If the process hangs at 100% CPU in a runtime library function, this is a driver/runtime bug — there is no application-level fix beyond scheduling a node reboot.
@@ -1,102 +0,0 @@
---
-name: hang-diagnosis
-description: Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
-  - Bash(strace)
-  - Bash(gdb)
---
-
-# Hang Diagnosis on HPC Systems
-
-## Taxonomy of Hangs
-
-Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems:
-
-### 1. Kernel-level ioctl hang (never returns)
-The process is in `D` (uninterruptible sleep) state. `strace` shows it blocked in an `ioctl` syscall. The GPU device driver has entered an unrecoverable state.
-
-**Diagnosis**: `ps aux | grep D` — the process shows `D` state. `cat /proc/PID/wchan` shows `i915_gem_wait_for_error` or similar.
-
-**Resolution**: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler.
-
-### 2. Infinite poll loop (`q.wait()` or `cudaDeviceSynchronize()` never returns)
-The process is in `R` (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread.
-
-**Diagnosis**: `top` shows the rank at 100% CPU. `strace -p PID` shows repeated `futex` or `read` syscalls with zero-length results, or no syscalls at all (pure spinloop). `perf top -p PID` shows the process burning cycles in a single tight loop in a runtime library (e.g., `ze_intel_gpu.so`).
-
-**Resolution**: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See `gpu-runtime-correctness.md`.
-
-### 3. Collective deadlock
-One or more ranks are blocked in an MPI call, usually `MPI_Allreduce` or `MPI_Barrier`, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive.
-
-**Diagnosis**: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different `step_name` strings. The hung ranks will show `D` or `S` state in `ps`.
-
-**Resolution**: Replace `MPI_Allreduce` with a deterministic point-to-point tree reduction. See `mpi-heterogeneous.md`.
-
-### 4. Premature return from wait (silent wrong answer, not a hang)
-The runtime returns from `q.wait()` before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally.
-
-**Diagnosis**: Double-run with checksum (see `correctness-verification.md`). Insert a second `q.wait()` after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely.
-
-## Flight Recorder for Hang Localization
-
-The most important diagnostic tool is knowing *which operation* a process is in when it hangs. Maintain a named step counter:
-
-```cpp
-// Call at the start of every major operation
-FlightRecorder::StepLog("MPI_Allreduce::norm");
-// ... do the operation ...
-FlightRecorder::StepLog("MPI_Allreduce::done");
-```
-
-On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner:
-
-```cpp
-static void sighup_handler(int) {
-    char buf[256];
-    int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n",
-        comm_rank,
-        (unsigned long long)step_counter,
-        step_name);
-    write(2, buf, n);
-    // backtrace_symbols_fd is async-safe on Linux glibc
-    void *frames[32];
-    backtrace_symbols_fd(frames, backtrace(frames, 32), 2);
-}
-signal(SIGHUP, sighup_handler);
-```
-
-Broadcast SIGHUP to all ranks from outside the job:
-```bash
-# In a separate shell while the job is hung
-squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \
-  xargs -I{} ssh {} "pkill -SIGHUP -f my_application"
-```
-
-The step names from all ranks will reveal which collective operation has diverged.
-
-## Distinguishing Driver Hang from MPI Hang
-
-| Symptom | Driver hang | MPI hang |
-|---|---|---|
-| Process state | `D` (ioctl) or `R` (spinloop) | `S` (blocked in syscall) |
-| `strace` | blocked `ioctl` or tight loop | blocked `recvmsg` / `read` |
-| Scope | single rank / single node | subset of ranks, pattern-dependent |
-| Recovery | reboot node | cancel job |
-| Flight recorder | step name is a GPU operation | step name is a collective |
-
-## Reducing Diagnostic Time
-
-1. **Name every collective operation** in the flight recorder before calling it.
-2. **Separate GPU work from MPI work** in the code so the step name unambiguously identifies which subsystem is hung.
-3. **Log node identifiers** alongside step names so flaky nodes can be identified and blacklisted.
-4. **Request flight recorder dumps from all ranks simultaneously** (SIGHUP broadcast) rather than attaching a debugger — attaching `gdb` to one rank of a hung MPI job usually deadlocks the debugger too.
-
-## What Not to Do
-
- Do not `kill -9` a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost.
- Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the *last* rank to arrive at the barrier.
- Do not use `MPI_Abort` in the hang handler — it may itself hang on some implementations. Use `_exit(1)` to force termination.
@@ -1,182 +0,0 @@
---
-name: mpi-heterogeneous
-description: Diagnose and work around MPI correctness bugs on heterogeneous (CPU+GPU) systems — device buffer aliasing in MPI_Sendrecv, AARCH64 PLT corruption from libfabric, topology-dependent allreduce hangs, mixed-ABI HIP runtime from wrong GTL library (Frontier/ROCm), and deterministic point-to-point reduction trees as a replacement for MPI_Allreduce.
-user-invocable: true
-allowed-tools:
-  - Read
-  - Bash(grep -r)
---
-
-# MPI Correctness on Heterogeneous HPC Systems
-
-## The Core Problem
-
-MPI libraries were designed for CPU-resident buffers. When GPU-resident buffers are passed directly (GPU-aware MPI / GPU direct RDMA), several correctness assumptions break:
-
- **Buffer aliasing**: The MPI library may internally alias send/receive buffer addresses for `MPI_Sendrecv` in ways that are safe for CPU memory but wrong for GPU memory with different cache coherency rules.
- **RDMA bandwidth**: GPU direct RDMA on some fabrics operates at a fraction of peak wirespeed (documented at ~30% on Pontevecchio/Aurora), making host-staging mandatory for performance even when correctness is not an issue.
- **Collective tree topology**: `MPI_Allreduce` implementations may select reduction trees based on process count or communicator topology that expose rank-ordering bugs, causing hangs on some configurations but not others.
-
-## Bug Class 1: Device Buffer Aliasing in MPI_Sendrecv
-
-**Symptom**: `MPI_Sendrecv` with GPU-resident send and receive buffers produces wrong results. The received data matches neither the expected payload nor a host-staged copy. The failure is *deterministic* for a given problem size and process count, but *history-dependent* — earlier sends affect which alias is selected.
-
-**Root cause**: The MPI library internally reuses GPU buffer addresses for temporary staging without proper device memory ordering. When the same physical GPU memory pages appear in both the send and receive paths, writes from one path corrupt the other.
-
-**Diagnosis**: 
-1. Enable per-packet checksumming (see `correctness-verification.md`). If the checksum on the received packet does not match the sent checksum, the data was corrupted in transit.
-2. Replace `MPI_Sendrecv` with separate `MPI_Isend` + `MPI_Irecv` + `MPI_Waitall`. If this fixes the problem, the bug is in the `MPI_Sendrecv` implementation's internal buffer handling.
-3. Stage through host memory (`cudaMemcpy`/`hipMemcpy` to a host buffer, then `MPI_Sendrecv` on host buffers, then copy back). If this fixes the problem, confirms GPU-specific aliasing.
-
-**Reported as**: MPICH issue #7302. Affects MPICH on Intel Pontevecchio (Aurora) with device-resident buffers.
-
-**Workaround**: Do not use `MPI_Sendrecv` with GPU buffers. Use asynchronous send/receive pairs or host-staging. See `communication-overlap.md` for the full pipeline pattern.
-
-## Bug Class 2: PLT Corruption on AARCH64 (libfabric)
-
-**Symptom**: Application crashes or hangs on first `MPI_Comm_dup` call on AARCH64 systems (e.g. NVIDIA Grace/H200). Backtrace shows a bad instruction in the PLT (Procedure Linkage Table) for `MPI_Comm_dup` — specifically a `br x15` instruction that should instead be a proper trampoline.
-
-**Root cause**: `libfabric`'s memory registration cache monitor patches PLT entries at runtime to intercept memory allocation calls. Its AARCH64 trampoline generation writes an incorrect instruction sequence, leaving `br x15` (branch to whatever happens to be in x15) in the PLT entry. The next call through that PLT entry executes garbage.
-
-**Diagnosis**:
-```bash
-# Check if the PLT entry is corrupted
-objdump -d /proc/PID/exe | grep -A5 "MPI_Comm_dup@plt"
-# Look for "br x15" — this should be a proper stub, not a register branch
-```
-
-Or check the disassembly of the live process:
-```bash
-gdb -p PID -batch -ex "disassemble 'MPI_Comm_dup@plt'"
-```
-
-**Workaround**:
-```bash
-export FI_MR_CACHE_MONITOR=disabled
-```
-This prevents libfabric from patching PLT entries. It may reduce MR cache performance but restores correctness.
-
-**Reported as**: libfabric issue #11451. Affects systems using AARCH64 + libfabric OFI provider (Cray Slingshot, AWS EFA) with memory registration cache enabled.
-
-## Bug Class 3: Topology-Dependent Allreduce Hangs
-
-**Symptom**: `MPI_Allreduce` hangs indefinitely on some node configurations but completes correctly on others. The failure correlates with process count (e.g. fails at 512 ranks, works at 256) or network topology (fails when crossing specific router boundaries).
-
-**Root cause**: The MPI library's collective selection algorithm picks a reduction tree implementation that assumes symmetric participation from all ranks. A bug in one rank's contribution path (e.g. a GPU-side buffer not yet flushed when MPI reads it, due to premature barrier — see `gpu-runtime-correctness.md`) causes that rank to send wrong or incomplete data, and the tree-reduction protocol deadlocks waiting for data that never arrives correctly.
-
-**Diagnosis**: Flight recorder step logging (see `hang-diagnosis.md`). SIGHUP broadcast to all ranks. Ranks that are hung will show step name `MPI_Allreduce::...`; ranks that completed will show the next step. The hung ranks are the *recipients* of the stale data, not necessarily the *cause*.
-
-**Workaround — deterministic P2P reduction tree**:
-
-Replace `MPI_Allreduce` with an explicit point-to-point binary tree reduction. This is slower for large communicators but:
-1. Is immune to topology-dependent collective bugs.
-2. Is deterministic in floating-point ordering (the tree is fixed, not chosen at runtime).
-3. Makes the hang location explicit — each P2P operation is a named step in the flight recorder.
-
-```cpp
-// Binary tree reduction: rank 0 collects, then broadcasts
-void GlobalSumP2P(double *data, int count, MPI_Comm comm) {
-    int rank, size;
-    MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size);
-    
-    // Reduce phase: even ranks receive from odd neighbours
-    for (int stride = 1; stride < size; stride *= 2) {
-        if (rank % (2*stride) == 0) {
-            int partner = rank + stride;
-            if (partner < size) {
-                std::vector<double> tmp(count);
-                MPI_Recv(tmp.data(), count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE);
-                for (int i = 0; i < count; i++) data[i] += tmp[i];
-            }
-        } else if (rank % stride == 0) {
-            int partner = rank - stride;
-            MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm);
-            break;
-        }
-    }
-    // Broadcast phase
-    for (int stride = /* highest power of 2 <= size */; stride >= 1; stride /= 2) {
-        if (rank % (2*stride) == 0) {
-            int partner = rank + stride;
-            if (partner < size)
-                MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm);
-        } else if (rank % stride == 0) {
-            int partner = rank - stride;
-            MPI_Recv(data, count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE);
-        }
-    }
-}
-```
-
-Grid reference: `USE_GRID_REDUCTION` macro in `Grid/communicator/Communicator_mpi3.cc`.
-
-## Bug Class 4: Mixed HIP ABI from Wrong GTL Library (Frontier / ROCm)
-
-**Symptom**: `HIPFFT_PARSE_ERROR` (error code 12) returned by `hipfftPlanMany` / `hipfftMakePlanMany` / `hipfftPlan1d` for FFT sizes G < 32, but G ≥ 32 succeeds. The failure only occurs with an empty rocFFT kernel cache (`~/.cache/rocfft`); a warm cache may mask it. Host-side operations and GPU kernels that do not invoke rocFFT JIT work correctly.
-
-**Root cause — mixed HIP ABI**: rocFFT uses JIT compilation (via `libamd_comgr`) for small transforms (G < 32); for G ≥ 32 it uses pre-compiled device code bundled in the library, so the JIT path is never exercised. When two HIP runtime versions are loaded in the same process — e.g. `libamdhip64.so.7` (ROCm 7) and `libamdhip64.so.6` (ROCm 6) — the rocFFT JIT cannot complete successfully.
-
-The hidden source of the old library is the Cray MPI GPU Transport Layer. On Frontier, `cray-mpich`'s `libmpi_gtl_hsa.so` may be compiled against `libamdhip64.so.6` (ROCm 6 ABI) even when the loaded ROCm module is 7.0.2. Because `LD_LIBRARY_PATH` picks up the GTL directory before the ROCm 7 library directory, `libamdhip64.so.6` is pulled in first, and both ABI versions end up resident in the process.
-
-**Diagnosis**:
-```bash
-# Check which libamdhip64 versions are actually linked into your binary at runtime
-ldd --verbose ./your_binary 2>&1 | grep amdhip
-# Bad output — two different .so versions:
-#   libamdhip64.so.6 => /opt/rocm-6.4.2/lib/libamdhip64.so.6
-#   libamdhip64.so.7 => /opt/rocm-7.0.2/lib/libamdhip64.so.7
-# Good output — only one:
-#   libamdhip64.so.7 => /opt/rocm-7.0.2/lib/libamdhip64.so.7
-```
-
-If two versions appear, the problem is the GTL/LD_LIBRARY_PATH ordering.
-
-**Fix — correct module stack and LD_LIBRARY_PATH ordering (Frontier)**:
-```bash
-module load cce/21.0.0
-module load cpe/26.03
-module load rocm/7.0.2
-# Prepend CRAY_LD_LIBRARY_PATH so the ROCm-7-aware GTL is found first
-export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
-# Ensure ROCm 7 LLVM libs (needed by libamd_comgr JIT) are on the path
-export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib/llvm/lib/:$LD_LIBRARY_PATH
-```
-
-The critical step is prepending `CRAY_LD_LIBRARY_PATH`: this ensures the GTL library built against the ROCm 7 ABI is resolved before any older version that may appear further down `LD_LIBRARY_PATH`. Without this step, a stale symlink or directory ordering can silently load the wrong `libmpi_gtl_hsa.so`.
-
-**Reproducer**: `tests/debug/Test_hipfft_repro.cc` — standalone hipFFT test (no Grid headers) that sweeps G and howmany values matching realistic Grid lattice geometries. Compile with:
-```bash
-hipcc -o Test_hipfft_repro Test_hipfft_repro.cc -lhipfft
-rm -rf ~/.cache/rocfft   # empty cache required to trigger JIT path
-./Test_hipfft_repro
-```
-
-**Reference**: `systems/WorkArounds.txt`, Frontier section — GPU mapping, XPMEM, and `FI_MR_CACHE_MONITOR=disabled` settings for Frontier are documented there.
-
-**Systems affected**: Frontier (ORNL, MI250X). Likely applies to any Cray PE system where the loaded `cray-mpich` GTL was compiled against an older ROCm ABI than the runtime ROCm module. LumiG (CSC, MI250X) uses the same Cray PE and may exhibit the same issue.
-
-## Compile-Time Guard Structure
-
-Recommended macro structure to switch between the workaround paths:
-
-```cpp
-// In configure / CMake, expose as options:
-// ACCELERATOR_AWARE_MPI  — use GPU direct (fast, potentially broken)
-// GRID_CHECKSUM_COMMS    — per-packet checksums (overhead: ~5%)
-// USE_GRID_REDUCTION     — P2P tree instead of MPI_Allreduce (slower, deterministic)
-// FI_MR_CACHE_MONITOR    — libfabric PLT workaround (env var, not compile-time)
-```
-
-On a known-good system, enable `ACCELERATOR_AWARE_MPI` and disable the others. On a system with known bugs, disable `ACCELERATOR_AWARE_MPI` and enable `GRID_CHECKSUM_COMMS` + `USE_GRID_REDUCTION` as needed.
-
-## Escalation Checklist
-
-Before concluding a bug is in your code:
-
-1. [ ] Can you reproduce with a minimal reproducer (two MPI ranks, no physics code)?
-2. [ ] Does the failure rate correlate with buffer size, process count, or network route?
-3. [ ] Does staging through host memory eliminate the failure?
-4. [ ] Is the failure deterministic for a given input (same answer, always wrong) or stochastic?
-5. [ ] Does the failure appear on a different MPI implementation (e.g. OpenMPI vs MPICH)?
-
-Deterministic wrong answers that reproduce with minimal reproducers and disappear with host-staging are strong evidence of an MPI library bug. File with the MPI library issue tracker with the minimal reproducer.
@@ -1,61 +0,0 @@
---
-name: ref_a2a_emf_work
-description: "A2A Extended Meson Field GPU offload work — status, file locations, pending task"
-metadata: 
-  node_type: memory
-  type: project
-  originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
---
-
-## What was built
-
-`Grid/algorithms/blas/A2ASpatialSum.h` — batched GEMM spatial sum replacing scalar SIMD accumulation. Included via `Grid/algorithms/Algorithms.h`.
-
-`tests/Test_extended_meson_field.cc` — test with class `A2AExtendedMesonFieldRef` containing:
- CPU reference path (`use_blas=false`)
- BLAS path (`use_blas=true`) using `A2ASpatialSum`
- Per-phase timing with `[ref type=N]` / `[blas type=N]` labels
- 4 contraction types (0-3), all verified at machine precision (~4e-16 rel_err)
-
-## Pending task: GPU offload class
-
-**Goal**: Write `A2AExtendedMesonFieldGPU` in the same test file, replacing all `thread_for` loops with `accelerator_for`-based free function kernels.
-
-The `thread_for` blocks to replace all have the form:
-```cpp
-thread_for(r, rd, {
-  int so = r * grid->_ostride[orthogdim];
-  for (int n = 0; n < e1; n++)
-  for (int b = 0; b < e2; b++) {
-    int ss = so + n * stride + b;
-    // work
-  }
-});
-```
-Replace with `accelerator_for(ss, grid->oSites(), Nsimd, { ... })`.
-
-**Free functions to write** (each takes `Lattice<T>` args, opens views internally):
- `A2ALoopPropagator` — outerProduct sum (loop build)
- `A2APackLeftConjugated` — conjugate left fermion fields into `Lattice<SpinColourVector_v>`
- `A2ALoopLeftContractionType0/1/2/3` — per-site loop × loop propagator → `tloop`
- `A2ALoopRightContractionType0/1/2/3` — per-site tloop × right → `loopRight[j]`
-
-**Data structure changes required**:
- `tloopv`: `std::vector<SpinColourMatrix_v>` → `Lattice<SpinColourMatrix_v>` (PropagatorField)
- `leftv[i]`: `std::vector<SpinColourVector_v>` → `Lattice<SpinColourVector_v>`
- `loopRight[j]`: `std::vector<SpinColourVector_v>` → `Lattice<SpinColourVector_v>`
-
-**Why**: `std::vector<vobj>` is host memory, not GPU accessible. See [[ref_lattice_vs_vector]].
-
-**`A2ASpatialSum` impact**: `PackLeft`/`PackRight` currently take `std::vector<std::vector<vobj>>`. Once leftv/loopRight become `std::vector<Lattice<vobj>>`, those signatures must change to match.
-
-## Timing on 8.8.8.16 (N_i=N_j=8, Nloop=4, 1 MPI rank)
-
-Dominant costs:
- `loop_build`: 4-6 ms (outerProduct over 4 propagators)
- `pack_loopright`: 0.9-2.2 ms (type-dependent)
- `spatial_sum` (ref): ~1.5 ms
- `A2ASpatialSum TOTAL`: 2.5-4.3 ms (PackLeft+PackRight dominate GEMM on small volume)
-
-## Related
-[[ref_accelerator_for]] [[ref_coalesced_views]] [[ref_lattice_vs_vector]] [[ref_grid_simt_pattern]]
@@ -1,43 +0,0 @@
---
-name: ref_accelerator_for
-description: Grid accelerator_for usage — converting block-strided thread_for to GPU-portable oSites loops
-metadata: 
-  node_type: memory
-  type: reference
-  originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
---
-
-## Pattern: block-strided thread_for → accelerator_for over oSites
-
-Old CPU-only pattern (block-strided over orthog dimension):
-```cpp
-thread_for(r, rd, {
-  int so = r * grid->_ostride[orthogdim];
-  for (int n = 0; n < e1; n++)
-  for (int b = 0; b < e2; b++) {
-    int ss = so + n * stride + b;
-    // work on site ss
-  }
-});
-```
-
-GPU-portable replacement:
-```cpp
-accelerator_for(ss, grid->oSites(), Nsimd, {
-  // work on site ss — one SIMT thread per (osite, lane) on GPU
-  // one thread per osite (lane loop implicit via GRID_SIMT) on CPU
-});
-```
-
-Key rules:
- `accelerator_for(iter, count, Nsimd, body)` — Nsimd is `vobj::Nsimd()` or `grid->Nsimd()`
- On CPU: expands to `thread_for` over count, `acceleratorSIMTlane` always returns 0 — must use `#ifdef GRID_SIMT` pattern if iterating lanes explicitly (see [[ref_grid_simt_pattern]])
- On GPU: one SIMT thread per (iter × lane), `acceleratorSIMTlane(Nsimd)` returns actual lane
- Loop body must capture only scalar/POD by value or via device-accessible pointers; no `std::vector` or host containers inside the body
- `Coordinate` inside `accelerator_for` must be `AcceleratorVector<int, MaxDims>` (stack-allocated, device-safe) — Grid's `Coordinate` typedef already satisfies this
-
-## Where defined
-`Grid/threads/Accelerator.h` — CPU path ~line 607; GPU paths in conditional blocks above.
-
-## Model file
-`Grid/algorithms/blas/MomentumProject.h` — `ImportVector` is the canonical example of correct `accelerator_for` + SIMD lane extraction.
@@ -1,70 +0,0 @@
---
-name: ref_coalesced_views
-description: Grid coalescedRead/coalescedWrite and autoView — GPU-portable field access inside accelerator_for
-metadata: 
-  node_type: memory
-  type: reference
-  originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
---
-
-## View access modes
-
-```cpp
-autoView(v, field, AcceleratorRead);   // read-only, device-accessible
-autoView(v, field, AcceleratorWrite);  // write-only, device-accessible
-autoView(v, field, AcceleratorReadWrite); // read-write, device-accessible
-autoView(v, field, CpuRead);           // CPU only (avoids GPU migration)
-autoView(v, field, CpuWrite);          // CPU only
-```
-
-Views must be opened **before** `accelerator_for` and closed (go out of scope) **after**. Never open a view inside the accelerator_for body.
-
-## coalescedRead / coalescedWrite
-
-Inside `accelerator_for(ss, oSites, Nsimd, { ... })`:
-
-```cpp
-auto site = coalescedRead(v[ss]);          // reads SIMT lane; returns scalar_object on GPU, vobj on CPU
-coalescedWrite(v[ss], site);               // writes SIMT lane
-```
-
- `coalescedRead(v[ss])` calls `v.operator()(ss)` which on GPU returns `extractLane(lane, v[ss])` — one lane per SIMT thread, contiguous across threads → coalesced
- On CPU returns the full vobj (no lane extraction needed; handled transparently)
- The returned type is `decltype(coalescedRead(v[ss]))` — use `auto` or match with scalar_object
-
-## Typical kernel pattern
-
-```cpp
-autoView(out_v, out, AcceleratorWrite);
-autoView(in_v,  in,  AcceleratorRead);
-accelerator_for(ss, grid->oSites(), vobj::Nsimd(), {
-  auto x = coalescedRead(in_v[ss]);
-  // modify x ...
-  coalescedWrite(out_v[ss], x);
-});
-```
-
-## Free function kernel signature
-
-```cpp
-template<class vobj>
-void MyKernel(Lattice<vobj> &out, const Lattice<vobj> &in)
-{
-  GridBase *grid = in.Grid();
-  autoView(out_v, out, AcceleratorWrite);
-  autoView(in_v,  in,  AcceleratorRead);
-  accelerator_for(ss, grid->oSites(), vobj::Nsimd(), {
-    auto x = coalescedRead(in_v[ss]);
-    coalescedWrite(out_v[ss], x);
-  });
-}
-```
-
-## What NOT to do
- Do not access `std::vector` elements inside `accelerator_for` — not device-accessible
- Do not use `CpuRead`/`CpuWrite` views inside `accelerator_for` — GPU will fault
- Do not assign to `v[ss]` directly inside `accelerator_for` — use `coalescedWrite`
- Do not open multiple write views on the same field simultaneously
-
-## Related
-[[ref_accelerator_for]] [[ref_lattice_vs_vector]]
@@ -1,47 +0,0 @@
---
-name: ref_grid_simt_pattern
-description: Grid GRID_SIMT
-metadata: 
-  node_type: memory
-  type: reference
-  originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
---
-
-## The problem
-
-On CPU, `accelerator_for(sf, oSites, Nsimd, {...})` expands to `thread_for(sf, oSites, {...})` — one thread per osite. `acceleratorSIMTlane(Nsimd)` always returns **0** on CPU. If you need to iterate all Nsimd lanes (e.g. to extract SIMD-packed data), you must loop explicitly on CPU.
-
-On GPU, `accelerator_for` launches one SIMT thread per (osite × lane). `acceleratorSIMTlane(Nsimd)` returns the actual lane index [0, Nsimd).
-
-## Correct pattern (from MomentumProject::ImportVector)
-
-```cpp
-accelerator_for(sf, osites, Nsimd, {
-#ifdef GRID_SIMT
-  {
-    int lane = acceleratorSIMTlane(Nsimd);
-#else
-    for (int lane = 0; lane < Nsimd; lane++) {
-#endif
-      // body using lane
-    }
-  });
-```
-
- On GPU: `GRID_SIMT` is defined → single-lane body, lane from hardware
- On CPU: `GRID_SIMT` is not defined → explicit lane loop inside the osite thread
-
-## When is this needed?
-
-Only when you explicitly need the lane index, e.g.:
- Extracting scalar data from SIMD-packed `vobj` via `extractLane(lane, src[sf])`
- Computing full local coordinates from (osite, lane) → `Lexicographic::CoorFromIndex(icoor, lane, simd_layout)`
-
-When using `coalescedRead`/`coalescedWrite`, this pattern is **not needed** — those handle lane selection transparently.
-
-## Pitfall that caused a bug
-
-`A2ASpatialSum::PackVectors` originally used `accelerator_for` without the `#ifdef GRID_SIMT` lane loop. On CPU, only lane=0 was extracted, giving wrong norms (~8× too small for `GEN_SIMD_WIDTH=64`, `Nsimd=4`). Fix: add the `#ifdef GRID_SIMT` pattern. See [[ref_accelerator_for]].
-
-## Model file
-`Grid/algorithms/blas/MomentumProject.h`, function `ImportVector`, lines ~166-207.
@@ -1,48 +0,0 @@
---
-name: ref_lattice_vs_vector
-description: When to use Lattice<T> vs std::vector<T> for GPU-portable field storage in Grid
-metadata: 
-  node_type: memory
-  type: reference
-  originSessionId: 956e80aa-401d-481a-80bb-17f8abe1c131
---
-
-## Rule
-
-Use `Lattice<vobj>` (or `std::vector<Lattice<vobj>>`) for any field that will be read or written inside `accelerator_for`. `std::vector<vobj>` is host memory and is NOT device-accessible.
-
-## Before vs after GPU offload
-
-```cpp
-// CPU-only (host memory, not GPU accessible)
-std::vector<SpinColourVector_v> tloopv(oSites, Zero());
-// accessed directly: tloopv[ss]
-
-// GPU-portable
-Lattice<SpinColourVector_v> tloop(grid);
-// accessed via view: autoView(tloop_v, tloop, AcceleratorWrite);
-//                    coalescedWrite(tloop_v[ss], val);
-```
-
-## Corollary: function signatures
-
-CPU-only version:
-```cpp
-void PackLeft(const std::vector<std::vector<vobj>> &leftv);
-```
-
-GPU-portable version:
-```cpp
-void PackLeft(const std::vector<Lattice<vobj>> &leftv);
-```
-
-## deviceVector for raw device buffers
-
-`deviceVector<T>` (defined in Grid) is like `std::vector<T>` but in device-accessible memory. Use for raw scalar scratch/pack buffers (e.g. GEMM input/output staging). Not for structured lattice data.
-
-## Pointer arrays for batched BLAS
-
-`deviceVector<scalar *>` holds batch pointer arrays. Populate with `acceleratorPut(ptrs[t], base + offset)` — sets device-side pointer from host. See `A2ASpatialSum::Allocate`.
-
-## Related
-[[ref_coalesced_views]] [[ref_accelerator_for]]
@@ -10,7 +10,7 @@ export HDF5=/opt/cray/pe/hdf5/1.12.2.3/gnu/9.1
 	--disable-gparity \
 	--disable-fermion-reps \
 	--enable-shm=nvlink \
-	--enable-checksum-comms=no \
+	--enable-checksum-comms=yes \
 	--enable-log-views=yes \
 	--enable-accelerator=sycl \
 	--enable-accelerator-aware-mpi=no \
@@ -1,3 +1,4 @@
+CLIME=`spack find --paths c-lime@2-3-9 | grep c-lime| cut -c 15-`
 ../../configure --enable-comms=mpi-auto \
 --with-lime=$CLIME \
 --enable-unified=no \
@@ -8,13 +9,12 @@
 --disable-gparity \
 --disable-fermion-reps \
 --enable-simd=GPU \
--with-gmp=$GMP \
--with-mpfr=$MPFR \
--with-openssl=$OPENSSL \
+--with-gmp=$OLCF_GMP_ROOT \
+--with-mpfr=/opt/cray/pe/gcc/mpfr/3.1.4/ \
 --disable-fermion-reps \
 CXX=hipcc MPICXX=mpicxx \
-CXXFLAGS="-fPIC -I${ROCM_PATH}/include/ -I${MPICH_DIR}/include " \
- LDFLAGS="-L${ROCM_PATH}/lib -L${MPICH_DIR}/lib -lmpi -lmpi_gtl_hsa -lhipblas -lrocblas -lhipfft -lamdhip64"
+CXXFLAGS="-fPIC -I${ROCM_PATH}/include/ -I${MPICH_DIR}/include -L/lib64 " \
+ LDFLAGS="-L/lib64 -L${ROCM_PATH}/lib -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa -lhipblas -lrocblas -lhipfft"



@@ -1,14 +1,25 @@

 echo spack
-. /autofs/nccs-svm1_home1/paboyle/spack/share/spack/setup-env.sh
+. /autofs/nccs-svm1_home1/paboyle/Crusher/Grid/spack/share/spack/setup-env.sh

-export CLIME=`spack find --paths c-lime | grep ^c-lime | awk '{print $2}' `
-export MPFR=`spack find --paths mpfr    | grep ^mpfr  | awk '{print $2}' `
-export OPENSSL=`spack find --paths openssl | grep openssl | awk  '{print $2}' `
-export GMP=`spack find --paths gmp      | grep ^gmp | awk '{print $2}' `
+module load cce/15.0.1
+module load rocm/5.3.0
+module load cray-fftw
+module load craype-accel-amd-gfx90a

-module load cce/21.0.0
-module load cpe/26.03
-module load rocm/7.0.2
-export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
-export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib/llvm/lib/:$LD_LIBRARY_PATH
+#Ugly hacks to get down level software working on current system
+export LD_LIBRARY_PATH=/opt/cray/libfabric/1.20.1/lib64/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=`pwd`/:$LD_LIBRARY_PATH
+ln -s /opt/rocm-6.0.0/lib/libamdhip64.so.6 .
+
+#echo spack load c-lime
+#spack load c-lime
+#module load emacs 
+##module load PrgEnv-gnu
+##module load cray-mpich
+##module load cray-fftw
+##module load craype-accel-amd-gfx90a
+##export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
+#Hack for lib
+##export LD_LIBRARY_PATH=`pwd`/:$LD_LIBRARY_PATH
@@ -1,16 +1,17 @@
 DIR=`pwd`
-
+PREFIX=$DIR/../Prequisites/install/
 ../../configure \
    --enable-comms=mpi \
    --enable-simd=GPU \
    --enable-shm=nvlink \
    --enable-gen-simd-width=64 \
-    --with-gmp=$GMP \
-    --with-mpfr=$MPFR \
    --enable-accelerator=cuda \
+    --enable-setdevice \
+    --disable-accelerator-cshift \
+    --with-gmp=$PREFIX \
    --disable-fermion-reps \
    --disable-unified \
    --disable-gparity \
    CXX=nvcc \
    LDFLAGS="-cudart shared " \
-    CXXFLAGS="-ccbin CC -gencode arch=compute_80,code=sm_80 -std=c++17 -cudart shared"
+    CXXFLAGS="-ccbin CC -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared"
@@ -1,39 +0,0 @@
-#!/bin/bash
-##SBATCH -A m5294_g
-#SBATCH -A mp13_g
-#m3886_g
-#SBATCH -C gpu
-#SBATCH -q premium
-#SBATCH -t 00:20
-#SBATCH -c 32
-#SBATCH -N 384
-#SBATCH -n 1536
-#SBATCH --ntasks-per-node=4
-#SBATCH --gpus-per-task=1
-#SBATCH --exclusive
-#SBATCH --gpu-bind=none
-
-export SLURM_CPU_BIND="cores"
-export MPICH_GPU_SUPPORT_ENABLED=1
-export MPICH_RDMA_ENABLED_CUDA=1
-export MPICH_GPU_IPC_ENABLED=1
-export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
-export MPICH_GPU_NO_ASYNC_MEMCPY=0
-#export MPICH_SMP_SINGLE_COPY_MODE=CMA
-
-cat << EOF > select_gpu
-#!/bin/bash
-export GPU_MAP=(0 1 2 3)
-export NUMA_MAP=( 0 1 2 3 )
-export GPU=\$SLURM_LOCALID
-export NUMA=\$SLURM_LOCALID
-export CUDA_VISIBLE_DEVICES=\$GPU
-exec numactl -m \$NUMA -N \$NUMA \$*
-EOF
-
-chmod +x ./select_gpu
-
-export VOL=128.128.128.288
-OPT="--comms-overlap --shm-mpi 0"
-srun ./select_gpu ./benchmarks/Benchmark_dwf_fp32 --mpi 4.8.4.12 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT
-
@@ -1,6 +1,5 @@
 #!/bin/bash
-#SBATCH -A m5294_g
-#m3886_g
+#SBATCH -A m3886_g
 #SBATCH -C gpu
 #SBATCH -q debug
 #SBATCH -t 0:20:00
@@ -20,21 +19,9 @@ export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
 export MPICH_GPU_NO_ASYNC_MEMCPY=0
 #export MPICH_SMP_SINGLE_COPY_MODE=CMA

-cat << EOF > select_gpu
-#!/bin/bash
-export GPU_MAP=(0 1 2 3)
-export NUMA_MAP=( 0 1 2 3 )
-export GPU=\$SLURM_LOCALID
-export NUMA=\$SLURM_LOCALID
-export CUDA_VISIBLE_DEVICES=\$GPU
-exec numactl -m \$NUMA -N \$NUMA \$*
-EOF
-
-chmod +x ./select_gpu
-
-OPT="--comms-sequential --shm-mpi 0"
-VOL=64.64.32.32
-srun ./select_gpu ./benchmarks/Benchmark_dwf_fp32 --mpi 2.2.1.1 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT
-OPT="--comms-overlap --shm-mpi 0"
-srun ./select_gpu ./benchmarks/Benchmark_dwf_fp32 --mpi 2.2.1.1 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT
+OPT="--comms-sequential --shm-mpi 1"
+VOL=64.64.64.64
+srun ./benchmarks/Benchmark_dwf_fp32 --mpi 2.2.1.1 --grid $VOL --accelerator-threads 8 --shm 2048 $OPT
+#srun ./benchmarks/Benchmark_dwf_fp32 --mpi 2.1.1.4 --grid $VOL --accelerator-threads 8 --shm 2048 $OPT
+#srun ./benchmarks/Benchmark_dwf_fp32 --mpi 1.1.1.8 --grid $VOL --accelerator-threads 8 --shm 2048 $OPT

@@ -1,6 +1,4 @@
-export CRAY_ACCEL_TARGET=nvidia80
-source /global/homes/p/pboyle/spack/share/spack/setup-env.sh
-export MPFR=`spack find --paths mpfr | grep mpfr | cut -c 13-`
-export GMP=`spack find --paths gmp   | grep gmp  | cut -c 12-`

-module load PrgEnv-gnu cpe-cuda cudatoolkit/12.0
+export CRAY_ACCEL_TARGET=nvidia80
+
+module load PrgEnv-gnu cpe-cuda cudatoolkit/11.4
@@ -1,44 +0,0 @@
-#!/bin/bash
-##SBATCH -A m5294_g
-#SBATCH -A mp13_g
-#m3886_g
-#SBATCH -C gpu
-#SBATCH -q premium
-#SBATCH -t 00:10
-#SBATCH -c 32
-#SBATCH -N 32
-#SBATCH -n 128
-#SBATCH --ntasks-per-node=4
-#SBATCH --gpus-per-task=1
-#SBATCH --exclusive
-#SBATCH --gpu-bind=none
-
-export SLURM_CPU_BIND="cores"
-export MPICH_GPU_SUPPORT_ENABLED=1
-export MPICH_RDMA_ENABLED_CUDA=1
-export MPICH_GPU_IPC_ENABLED=1
-export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
-export MPICH_GPU_NO_ASYNC_MEMCPY=0
-#export MPICH_SMP_SINGLE_COPY_MODE=CMA
-
-cat << EOF > select_gpu
-#!/bin/bash
-export GPU_MAP=(0 1 2 3)
-export NUMA_MAP=( 0 1 2 3 )
-export GPU=\$SLURM_LOCALID
-export NUMA=\$SLURM_LOCALID
-export CUDA_VISIBLE_DEVICES=\$GPU
-exec numactl -m \$NUMA -N \$NUMA \$*
-EOF
-
-chmod +x ./select_gpu
-
-OPT="--comms-overlap --shm-mpi 0"
-#
-# Local volume WAS 32.16.32.24
-#
-# 384 nodes
-#srun ./select_gpu ./Test_dwf_mixedcg_prec --seconds 300 --grid 128.128.128.288 --mpi 4.8.4.12 --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > job.log
-# 32 nodes, same volume per node
-srun ./select_gpu ./Test_dwf_mixedcg_prec --seconds 300 --grid 64.32.64.96 --mpi 2.2.2.4 --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > job.log
-
@@ -1,38 +0,0 @@
-#!/bin/bash
-##SBATCH -A m5294_g
-#SBATCH -A mp13_g
-#m3886_g
-#SBATCH -C gpu
-#SBATCH -q premium
-#SBATCH -t 00:10
-#SBATCH -c 32
-#SBATCH -N 384
-#SBATCH -n 1536
-#SBATCH --ntasks-per-node=4
-#SBATCH --gpus-per-task=1
-#SBATCH --exclusive
-#SBATCH --gpu-bind=none
-
-export SLURM_CPU_BIND="cores"
-export MPICH_GPU_SUPPORT_ENABLED=1
-export MPICH_RDMA_ENABLED_CUDA=1
-export MPICH_GPU_IPC_ENABLED=1
-export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
-export MPICH_GPU_NO_ASYNC_MEMCPY=0
-#export MPICH_SMP_SINGLE_COPY_MODE=CMA
-
-cat << EOF > select_gpu
-#!/bin/bash
-export GPU_MAP=(0 1 2 3)
-export NUMA_MAP=( 0 1 2 3 )
-export GPU=\$SLURM_LOCALID
-export NUMA=\$SLURM_LOCALID
-export CUDA_VISIBLE_DEVICES=\$GPU
-exec numactl -m \$NUMA -N \$NUMA \$*
-EOF
-
-chmod +x ./select_gpu
-
-OPT="--comms-overlap --shm-mpi 0"
-srun ./select_gpu ./Test_dwf_mixedcg_prec --seconds 300 --grid 128.128.128.288 --mpi 4.8.4.12 --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > job.log
-
@@ -1,39 +0,0 @@
-#!/bin/bash
-#SBATCH -A m5294_g
-#m3886_g
-#SBATCH -C gpu
-#SBATCH -q debug
-#SBATCH -t 0:30:00
-#SBATCH -c 32
-#SBATCH -N 4
-#SBATCH -n 16
-#SBATCH --ntasks-per-node=4
-#SBATCH --gpus-per-task=1
-#SBATCH --exclusive
-#SBATCH --gpu-bind=none
-
-export SLURM_CPU_BIND="cores"
-export MPICH_GPU_SUPPORT_ENABLED=1
-export MPICH_RDMA_ENABLED_CUDA=1
-export MPICH_GPU_IPC_ENABLED=1
-export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
-export MPICH_GPU_NO_ASYNC_MEMCPY=0
-#export MPICH_SMP_SINGLE_COPY_MODE=CMA
-
-cat << EOF > select_gpu
-#!/bin/bash
-export GPU_MAP=(0 1 2 3)
-export NUMA_MAP=( 0 1 2 3 )
-export GPU=\$SLURM_LOCALID
-export NUMA=\$SLURM_LOCALID
-export CUDA_VISIBLE_DEVICES=\$GPU
-exec numactl -m \$NUMA -N \$NUMA \$*
-EOF
-
-chmod +x ./select_gpu
-
-OPT="--comms-sequential --shm-mpi 0"
-VOL=64.64.32.32
-srun ./select_gpu ./benchmarks/Benchmark_usqcd --mpi 2.2.2.2 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > usqcd.log
-
-
@@ -3,14 +3,12 @@
 CXX=mpicxx ../../configure \
 	   --enable-simd=GEN \
 	   --enable-comms=mpi-auto \
+	   --enable-Sp=yes \
 	   --enable-unified=yes \
-	   --disable-fermion-reps \
-	   --disable-gparity \
-	   --prefix /Users/peterboyle/QCD/Grid-install \
+	   --prefix /Users/peterboyle/QCD/vtk/Grid/install \
 	   --with-lime=$CLIME \
 	   --with-openssl=$OPENSSL \
 	   --with-gmp=$GMP \
 	   --with-mpfr=$MPFR \
-	   --with-fftw=$FFTW \
-	   --disable-debug
+	   --disable-debug 

@@ -1,7 +0,0 @@
-source /Users/peterboyle/QCD//Spack/spack//share/spack/setup-env.sh
-export FFTW=`spack find --paths fftw    | grep ^fftw   | awk '{print $2}' `
-export CLIME=`spack find --paths c-lime | grep ^c-lime | awk '{print $2}' `
-export MPFR=`spack find --paths mpfr    | grep ^mpfr  | awk '{print $2}' `
-export OPENSSL=`spack find --paths openssl | grep openssl | awk  '{print $2}' `
-export GMP=`spack find --paths gmp      | grep ^gmp | awk '{print $2}' `
-
@@ -25,9 +25,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_tests_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -276,6 +273,8 @@ void  TestWhat(What & Ddwf,

  err = phi-chi;
  std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<< std::endl;
+
+  
 }

-#endif
+
@@ -30,9 +30,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 *  Reimplement the badly named "multigrid" lanczos as compressed Lanczos using the features 
 *  in Grid that were intended to be used to support blocked Aggregates, from
 */
-#include "disable_tests_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>
 #include <Grid/algorithms/iterative/ImplicitlyRestartedLanczos.h>
 #include <Grid/algorithms/iterative/LocalCoherenceLanczos.h>
@@ -259,4 +256,3 @@ int main (int argc, char ** argv) {
  Grid_finalize();
 }

-#endif
@@ -25,9 +25,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_tests_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -240,5 +237,3 @@ int main (int argc, char ** argv)
  
  Grid_finalize();
 }
-
-#endif
@@ -25,9 +25,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_tests_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -225,5 +222,3 @@ int main (int argc, char ** argv)
  
  Grid_finalize();
 }
-
-#endif
@@ -25,9 +25,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
-#include "disable_tests_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
 #include <Grid/Grid.h>

 using namespace std;
@@ -121,4 +118,3 @@ int main (int argc, char ** argv)
  Grid_finalize();
 }
 #endif
-#endif
@@ -1,841 +0,0 @@
-/*************************************************************************************
-
-Grid physics library, www.github.com/paboyle/Grid
-
-Source file: tests/Test_extended_meson_field.cc
-
-Copyright (C) 2015-2025
-
-Author: Peter Boyle <pboyle@bnl.gov>
-Author: Masaaki Tomii <masaaki.tomii@uconn.edu> (original Hadrons kernels)
-
-This program is free software; you can redistribute it and/or modify
-it under the terms of the GNU General Public License as published by
-the Free Software Foundation; either version 2 of the License, or
-(at your option) any later version.
-
-This program is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License along
-with this program; if not, write to the Free Software Foundation, Inc.,
-51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-See the full license in the file "LICENSE" in the top level distribution directory
-*************************************************************************************/
-#include "disable_tests_without_instantiations.h"
-#ifdef ENABLE_FERMION_INSTANTIATIONS
-
-#include <Grid/Grid.h>
-#include <Grid/qcd/utils/A2Autils.h>
-
-using namespace Grid;
-
-typedef WilsonImplD FImpl;
-typedef typename FImpl::FermionField FermionField;
-typedef typename FImpl::SiteSpinor   vobj;
-typedef typename vobj::scalar_type   scalar_type;
-typedef typename vobj::vector_type   vector_type;
-typedef iSpinColourMatrix<vector_type> SpinColourMatrix_v;
-typedef iSpinColourVector<vector_type> SpinColourVector_v;
-typedef iSpinMatrix<vector_type>       SpinMatrix_v;
-typedef iSinglet<vector_type>          Scalar_v;
-typedef iSinglet<scalar_type>          Scalar_s;
-typedef Lattice<SpinColourMatrix_v>    PropagatorField;
-
-// CPU reference + optionally batched GEMM spatial sum, ported from
-// Hadrons/Modules/MContraction/A2AExtendedMesonField.hpp
-// (M. Tomii, mtomii/Hadrons:local-2025-edits). Hadrons infrastructure removed.
-// thread_for / CpuRead / orthogdim=3 preserved.
-class A2AExtendedMesonFieldRef
-{
-public:
-  // result is indexed [nt][N_i][N_j].
-  // use_blas=true replaces the scalar spatial accumulation with A2ASpatialSum.
-  static void compute(
-      Eigen::Tensor<ComplexD, 3> &result,
-      const std::vector<FermionField> &left,
-      const std::vector<FermionField> &right,
-      const std::vector<FermionField> &loop1,
-      const std::vector<FermionField> &loop2,
-      const std::vector<Gamma::Algebra> &gamma1,
-      const std::vector<Gamma::Algebra> &gamma2,
-      int type,
-      bool use_blas = false)
-  {
-    GridBase *grid = left[0].Grid();
-
-    const int orthogdim = 3;
-    int rd     = grid->_rdimensions[orthogdim];
-    int ld     = grid->_ldimensions[orthogdim];
-    int Nd     = grid->_ndimension;
-    int Nsimd  = grid->Nsimd();
-
-    int nt  = result.dimension(0);
-    int N_i = (int)left.size();
-    int N_j = (int)right.size();
-
-    std::string tag = std::string(use_blas ? "[blas" : "[ref ") + " type=" + std::to_string(type) + "]";
-    auto Tms = [](double us) { return us * 1e-3; };
-    double t0;
-
-    // ------------------------------------------------------------------
-    // Loop propagator: sum_k outerProduct(loop1[k], loop2[k])
-    // ------------------------------------------------------------------
-    t0 = usecond();
-    PropagatorField loop(grid);
-    loop = Zero();
-    for (unsigned int k = 0; k < loop1.size(); ++k)
-      loop += outerProduct(loop1[k], loop2[k]);
-    std::cout << GridLogMessage << tag << " loop_build:      " << Tms(usecond()-t0) << " ms\n";
-
-    // ------------------------------------------------------------------
-    // Pack conjugated left vectors
-    // ------------------------------------------------------------------
-    t0 = usecond();
-    std::vector<FermionField> leftv(N_i, grid);
-    for (int i = 0; i < N_i; i++)
-      leftv[i] = conjugate(left[i]);
-    std::cout << GridLogMessage << tag << " pack_left:       " << Tms(usecond()-t0) << " ms\n";
-
-    // ------------------------------------------------------------------
-    // Per-site loop contraction into PropagatorField tloop (type-dependent)
-    // ------------------------------------------------------------------
-    t0 = usecond();
-    PropagatorField tloop(grid);
-    tloop = Zero();
-    {
-      autoView(tloopv, tloop, CpuWrite);
-      autoView(loopv,  loop,  CpuRead);
-      if (type == 0) {
-        thread_for(ss, grid->oSites(), {
-          for (int s1 = 0; s1 < Ns; ++s1)
-          for (int s2 = 0; s2 < Ns; ++s2)
-            tloopv[ss]()(s1,s2)(0,0) = loopv[ss]()(s1,s2)(0,0)
-                                     + loopv[ss]()(s1,s2)(1,1)
-                                     + loopv[ss]()(s1,s2)(2,2);
-        });
-      }
-      if (type == 1) {
-        thread_for(ss, grid->oSites(), {
-          tloopv[ss] = Zero();
-          for (int mu = 0; mu < (int)gamma1.size(); ++mu)
-            tloopv[ss] = tloopv[ss] + Gamma(gamma1[mu]) * loopv[ss] * Gamma(gamma2[mu]);
-        });
-      }
-      if (type == 2) {
-        thread_for(ss, grid->oSites(), {
-          tloopv[ss] = Zero();
-          for (int mu = 0; mu < (int)gamma2.size(); ++mu) {
-            SpinColourMatrix_v tmp = Gamma(gamma2[mu]) * loopv[ss];
-            int s1 = mu / Ns;
-            int s2 = mu % Ns;
-            for (int c1 = 0; c1 < Nc; ++c1)
-            for (int c2 = 0; c2 < Nc; ++c2)
-              tloopv[ss]()(s1,s2)(c1,c2) = tmp()(0,0)(c1,c2) + tmp()(1,1)(c1,c2)
-                                          + tmp()(2,2)(c1,c2) + tmp()(3,3)(c1,c2);
-          }
-        });
-      }
-      if (type == 3) {
-        thread_for(ss, grid->oSites(), {
-          SpinMatrix_v spinLoop = Zero();
-          for (int s1 = 0; s1 < Ns; ++s1)
-          for (int s2 = 0; s2 < Ns; ++s2)
-            spinLoop()(s1,s2)() = loopv[ss]()(s1,s2)(0,0)
-                                + loopv[ss]()(s1,s2)(1,1)
-                                + loopv[ss]()(s1,s2)(2,2);
-          tloopv[ss] = Zero();
-          for (int mu = 0; mu < (int)gamma1.size(); ++mu) {
-            SpinMatrix_v tmp2 = Gamma(gamma1[mu]) * spinLoop * Gamma(gamma2[mu]);
-            for (int s1 = 0; s1 < Ns; ++s1)
-            for (int s2 = 0; s2 < Ns; ++s2)
-              tloopv[ss]()(s1,s2)(0,0) = tloopv[ss]()(s1,s2)(0,0) + tmp2()(s1,s2)();
-          }
-        });
-      }
-    }
-    std::cout << GridLogMessage << tag << " tloop:           " << Tms(usecond()-t0) << " ms\n";
-
-    // Select addLoopRight kernel for this type
-    std::function<void(SpinColourVector_v &,
-                       const SpinColourMatrix_v &,
-                       const SpinColourVector_v &,
-                       const std::vector<Gamma::Algebra> &,
-                       const std::vector<Gamma::Algebra> &)> addLoopRight;
-
-    if (type == 0) {
-      addLoopRight = [](SpinColourVector_v &lR,
-                        const SpinColourMatrix_v &loopm,
-                        const SpinColourVector_v &rightv,
-                        const std::vector<Gamma::Algebra> &g1,
-                        const std::vector<Gamma::Algebra> &g2) {
-        SpinMatrix_v spinLoop = Zero();
-        for (int s1 = 0; s1 < Ns; ++s1)
-        for (int s2 = 0; s2 < Ns; ++s2)
-          spinLoop()(s1,s2)() = loopm()(s1,s2)(0,0);
-        for (int mu = 0; mu < (int)g1.size(); ++mu) {
-          SpinMatrix_v GLoop = Gamma(g2[mu]) * spinLoop;
-          auto trGLoop = GLoop()(0,0)() + GLoop()(1,1)() + GLoop()(2,2)() + GLoop()(3,3)();
-          SpinColourVector_v Grightv = Gamma(g1[mu]) * rightv;
-          for (int s = 0; s < Ns; ++s)
-          for (int c = 0; c < Nc; ++c)
-            lR()(s)(c) += Grightv()(s)(c) * trGLoop;
-        }
-      };
-    }
-    if (type == 1) {
-      addLoopRight = [](SpinColourVector_v &lR,
-                        const SpinColourMatrix_v &loopm,
-                        const SpinColourVector_v &rightv,
-                        const std::vector<Gamma::Algebra> &g1,
-                        const std::vector<Gamma::Algebra> &g2) {
-        for (int s = 0; s < Ns; ++s)
-        for (int c = 0; c < Nc; ++c) {
-          lR()(s)(c)
-            += loopm()(s,0)(c,0) * rightv()(0)(0)
-            +  loopm()(s,0)(c,1) * rightv()(0)(1)
-            +  loopm()(s,0)(c,2) * rightv()(0)(2)
-            +  loopm()(s,1)(c,0) * rightv()(1)(0)
-            +  loopm()(s,1)(c,1) * rightv()(1)(1)
-            +  loopm()(s,1)(c,2) * rightv()(1)(2)
-            +  loopm()(s,2)(c,0) * rightv()(2)(0)
-            +  loopm()(s,2)(c,1) * rightv()(2)(1)
-            +  loopm()(s,2)(c,2) * rightv()(2)(2)
-            +  loopm()(s,3)(c,0) * rightv()(3)(0)
-            +  loopm()(s,3)(c,1) * rightv()(3)(1)
-            +  loopm()(s,3)(c,2) * rightv()(3)(2);
-        }
-      };
-    }
-    if (type == 2) {
-      addLoopRight = [](SpinColourVector_v &lR,
-                        const SpinColourMatrix_v &loopm,
-                        const SpinColourVector_v &rightv,
-                        const std::vector<Gamma::Algebra> &g1,
-                        const std::vector<Gamma::Algebra> &g2) {
-        for (int mu = 0; mu < (int)g1.size(); ++mu) {
-          int s1 = mu / Ns;
-          int s2 = mu % Ns;
-          SpinColourVector_v Grightv = Gamma(g1[mu]) * rightv;
-          for (int s = 0; s < Ns; ++s)
-          for (int c = 0; c < Nc; ++c)
-            lR()(s)(c) += loopm()(s1,s2)(c,0) * Grightv()(s)(0)
-                       +  loopm()(s1,s2)(c,1) * Grightv()(s)(1)
-                       +  loopm()(s1,s2)(c,2) * Grightv()(s)(2);
-        }
-      };
-    }
-    if (type == 3) {
-      addLoopRight = [](SpinColourVector_v &lR,
-                        const SpinColourMatrix_v &loopm,
-                        const SpinColourVector_v &rightv,
-                        const std::vector<Gamma::Algebra> &g1,
-                        const std::vector<Gamma::Algebra> &g2) {
-        for (int s = 0; s < Ns; ++s)
-        for (int c = 0; c < Nc; ++c)
-          lR()(s)(c) += loopm()(s,0)(0,0) * rightv()(0)(c)
-                     +  loopm()(s,1)(0,0) * rightv()(1)(c)
-                     +  loopm()(s,2)(0,0) * rightv()(2)(c)
-                     +  loopm()(s,3)(0,0) * rightv()(3)(c);
-      };
-    }
-
-    // ------------------------------------------------------------------
-    // Pack loopRight[j] = type-kernel(tloop, right[j]) per site
-    // ------------------------------------------------------------------
-    t0 = usecond();
-    std::vector<FermionField> loopRight(N_j, grid);
-    {
-      autoView(tlv, tloop, CpuRead);
-      for (int j = 0; j < N_j; j++) {
-        loopRight[j] = Zero();
-        autoView(lRv, loopRight[j], CpuWrite);
-        autoView(rv,  right[j],     CpuRead);
-        thread_for(ss, grid->oSites(), {
-          addLoopRight(lRv[ss], tlv[ss], rv[ss], gamma1, gamma2);
-        });
-      }
-    }
-    std::cout << GridLogMessage << tag << " pack_loopright:  " << Tms(usecond()-t0) << " ms\n";
-
-    if (use_blas) {
-      // ------------------------------------------------------------------
-      // BLAS path: A2ASpatialSum (Allocate + PackLeft + PackRight + Sum)
-      // ------------------------------------------------------------------
-      A2ASpatialSum<SpinColourVector_v> spatial_sum;
-      double t_blas_start = usecond();
-
-      t0 = usecond();
-      spatial_sum.Allocate(N_i, N_j, grid);
-      std::cout << GridLogMessage << tag << " Allocate:        " << Tms(usecond()-t0) << " ms\n";
-
-      t0 = usecond();
-      spatial_sum.PackLeft(leftv);
-      std::cout << GridLogMessage << tag << " PackLeft:        " << Tms(usecond()-t0) << " ms\n";
-
-      t0 = usecond();
-      spatial_sum.PackRight(loopRight);
-      std::cout << GridLogMessage << tag << " PackRight:       " << Tms(usecond()-t0) << " ms\n";
-
-      t0 = usecond();
-      spatial_sum.Sum(result);
-      std::cout << GridLogMessage << tag << " Sum (GEMM+MPI):  " << Tms(usecond()-t0) << " ms\n";
-
-      std::cout << GridLogMessage << tag << " A2ASpatialSum:   " << Tms(usecond()-t_blas_start) << " ms  [TOTAL]\n";
-
-    } else {
-      // ------------------------------------------------------------------
-      // Reference path: SIMD spatial sum + scalar extraction
-      // ------------------------------------------------------------------
-      int MFrvol = rd * N_i * N_j;
-      int MFlvol = ld * N_i * N_j;
-
-      Vector<Scalar_v> lvSum(MFrvol);
-      thread_for(r, MFrvol, { lvSum[r] = Zero(); });
-
-      t0 = usecond();
-      {
-        int e1     = grid->_slice_nblock[orthogdim];
-        int e2     = grid->_slice_block [orthogdim];
-        int stride = grid->_slice_stride[orthogdim];
-        using LView = decltype(leftv[0].View(CpuRead));
-        using RView = decltype(loopRight[0].View(CpuRead));
-        std::vector<LView> lv_views;
-        std::vector<RView> lr_views;
-        lv_views.reserve(N_i);
-        lr_views.reserve(N_j);
-        for (int i = 0; i < N_i; i++) lv_views.push_back(leftv[i].View(CpuRead));
-        for (int j = 0; j < N_j; j++) lr_views.push_back(loopRight[j].View(CpuRead));
-        thread_for(r, rd, {
-          int so   = r * grid->_ostride[orthogdim];
-          int base = N_i * N_j * r;
-          for (int n = 0; n < e1; n++)
-          for (int b = 0; b < e2; b++) {
-            int ss = so + n * stride + b;
-            for (int ii = 0; ii < N_i; ii++)
-            for (int jj = 0; jj < N_j; jj++) {
-              int idx = jj + N_j * ii + base;
-              for (int s = 0; s < Ns; ++s)
-              for (int c = 0; c < Nc; ++c)
-                lvSum[idx]()()() += lv_views[ii][ss]()(s)(c) * lr_views[jj][ss]()(s)(c);
-            }
-          }
-        });
-        for (auto &v : lv_views) v.ViewClose();
-        for (auto &v : lr_views) v.ViewClose();
-      }
-      std::cout << GridLogMessage << tag << " spatial_sum:     " << Tms(usecond()-t0) << " ms\n";
-
-      Vector<Scalar_s> lsSum(MFlvol);
-      thread_for(r, MFlvol, { lsSum[r] = scalar_type(0.0); });
-
-      t0 = usecond();
-      thread_for(rt, rd, {
-        Coordinate icoor(Nd);
-        ExtractBuffer<Scalar_s> extracted(Nsimd);
-        for (int ii = 0; ii < N_i; ii++)
-        for (int jj = 0; jj < N_j; jj++) {
-          int ij_rdx = jj + N_j * (ii + N_i * rt);
-          extract(lvSum[ij_rdx], extracted);
-          for (int idx = 0; idx < Nsimd; idx++) {
-            grid->iCoorFromIindex(icoor, idx);
-            int ldx    = rt + icoor[orthogdim] * rd;
-            int ij_ldx = jj + N_j * (ii + N_i * ldx);
-            lsSum[ij_ldx] = lsSum[ij_ldx] + extracted[idx];
-          }
-        }
-      });
-      std::cout << GridLogMessage << tag << " simd_extract:    " << Tms(usecond()-t0) << " ms\n";
-
-      int pd = grid->_processors[orthogdim];
-      int pc = grid->_processor_coor[orthogdim];
-
-      t0 = usecond();
-      Vector<ComplexD> cache(nt * N_i * N_j, ComplexD(0.0));
-      for (int lt = 0; lt < ld; lt++)
-      for (int pt = 0; pt < pd; pt++) {
-        int t = lt + pt * ld;
-        for (int ii = 0; ii < N_i; ii++)
-        for (int jj = 0; jj < N_j; jj++) {
-          if (pt == pc) {
-            int ij_ldx = jj + N_j * (ii + N_i * lt);
-            cache[jj + N_j * (ii + N_i * t)] = lsSum[ij_ldx]()()();
-          }
-        }
-      }
-      grid->GlobalSumVector(cache.data(), nt * N_i * N_j);
-      std::cout << GridLogMessage << tag << " globalsum:       " << Tms(usecond()-t0) << " ms\n";
-
-      for (int t  = 0; t  < nt;  t++)
-      for (int ii = 0; ii < N_i; ii++)
-      for (int jj = 0; jj < N_j; jj++)
-        result(t, ii, jj) = cache[jj + N_j * (ii + N_i * t)];
-    }
-  }
-};
-
-// ================================================================
-// Free-function GPU kernels — accelerator_for, v(ss) reads,
-// coalescedWrite writes, vobj-level arithmetic throughout.
-// Gamma arrays passed as Vector<Gamma::Algebra> (UVM).
-// ================================================================
-
-void A2ALoopPropagator(PropagatorField &loop,
-                       const std::vector<FermionField> &loop1,
-                       const std::vector<FermionField> &loop2)
-{
-  int Nk      = (int)loop1.size();
-  uint64_t oSites = loop.Grid()->oSites();
-  int Nsimd   = SpinColourVector_v::Nsimd();
-
-  typedef decltype(loop1[0].View(AcceleratorRead)) View;
-  std::vector<View> v1, v2;
-  v1.reserve(Nk); v2.reserve(Nk);
-  for (int k = 0; k < Nk; k++) {
-    v1.push_back(loop1[k].View(AcceleratorRead));
-    v2.push_back(loop2[k].View(AcceleratorRead));
-  }
-
-  deviceVector<SpinColourVector_v *> l1p(Nk), l2p(Nk);
-  for (int k = 0; k < Nk; k++) {
-    acceleratorPut(l1p[k], &v1[k][0]);
-    acceleratorPut(l2p[k], &v2[k][0]);
-  }
-
-  autoView(loopv, loop, AcceleratorWrite);
-  SpinColourVector_v **l1 = &l1p[0];
-  SpinColourVector_v **l2 = &l2p[0];
-  int lNk = Nk;
-  accelerator_for(ss, oSites, Nsimd, {
-    auto result = outerProduct(coalescedRead(l1[0][ss]), coalescedRead(l2[0][ss]));
-    for (int k = 1; k < lNk; k++)
-      result = result + outerProduct(coalescedRead(l1[k][ss]), coalescedRead(l2[k][ss]));
-    coalescedWrite(loopv[ss], result);
-  });
-}
-
-void A2APackLeftConjugated(FermionField &out, const FermionField &in)
-{
-  autoView(outv, out, AcceleratorWrite);
-  autoView(inv,  in,  AcceleratorRead);
-  uint64_t Osites = in.Grid()->oSites();
-  int Nsimd = SpinColourVector_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    coalescedWrite(outv[ss], conjugate(inv(ss)));
-  });
-}
-
-// Type 0: colour-trace stored in (s1,s2)(0,0)
-void A2ALoopLeftContractionType0(PropagatorField &tloop, const PropagatorField &loop)
-{
-  autoView(tloopv, tloop, AcceleratorWrite);
-  autoView(loopv,  loop,  AcceleratorRead);
-  uint64_t Osites = loop.Grid()->oSites();
-  int Nsimd = SpinColourMatrix_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    auto l = loopv(ss);
-    auto tmp = l; tmp = Zero();
-    for (int s1 = 0; s1 < Ns; ++s1)
-    for (int s2 = 0; s2 < Ns; ++s2)
-      tmp()(s1,s2)(0,0) = l()(s1,s2)(0,0) + l()(s1,s2)(1,1) + l()(s1,s2)(2,2);
-    coalescedWrite(tloopv[ss], tmp);
-  });
-}
-
-// Type 1: tloop = sum_mu Gamma(g1[mu]) * loop * Gamma(g2[mu])
-void A2ALoopLeftContractionType1(PropagatorField &tloop, const PropagatorField &loop,
-                                  const Vector<Gamma::Algebra> &gamma1,
-                                  const Vector<Gamma::Algebra> &gamma2)
-{
-  int ng = (int)gamma1.size();
-  const Gamma::Algebra *g1 = gamma1.data();
-  const Gamma::Algebra *g2 = gamma2.data();
-  autoView(tloopv, tloop, AcceleratorWrite);
-  autoView(loopv,  loop,  AcceleratorRead);
-  uint64_t Osites = loop.Grid()->oSites();
-  int Nsimd = SpinColourMatrix_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    auto l = loopv(ss);
-    auto tmp = l; tmp = Zero();
-    for (int mu = 0; mu < ng; ++mu)
-      tmp = tmp + Gamma(g1[mu]) * l * Gamma(g2[mu]);
-    coalescedWrite(tloopv[ss], tmp);
-  });
-}
-
-// Type 2: for mu=[0..ng), s1=mu/Ns, s2=mu%Ns;
-//         tloop(s1,s2)(c1,c2) = Tr_spin( Gamma(g2[mu]) * loop )(c1,c2)
-void A2ALoopLeftContractionType2(PropagatorField &tloop, const PropagatorField &loop,
-                                  const Vector<Gamma::Algebra> &gamma2)
-{
-  int ng = (int)gamma2.size();
-  const Gamma::Algebra *g2 = gamma2.data();
-  autoView(tloopv, tloop, AcceleratorWrite);
-  autoView(loopv,  loop,  AcceleratorRead);
-  uint64_t Osites = loop.Grid()->oSites();
-  int Nsimd = SpinColourMatrix_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    auto l = loopv(ss);
-    auto tmp = l; tmp = Zero();
-    for (int mu = 0; mu < ng; ++mu) {
-      auto gtmp = Gamma(g2[mu]) * l;
-      int s1 = mu / Ns;
-      int s2 = mu % Ns;
-      for (int c1 = 0; c1 < Nc; ++c1)
-      for (int c2 = 0; c2 < Nc; ++c2)
-        tmp()(s1,s2)(c1,c2) = gtmp()(0,0)(c1,c2) + gtmp()(1,1)(c1,c2)
-                            + gtmp()(2,2)(c1,c2) + gtmp()(3,3)(c1,c2);
-    }
-    coalescedWrite(tloopv[ss], tmp);
-  });
-}
-
-// Type 3: colour-trace → spin matrix → sum_mu G1*spinLoop*G2 stored in (s1,s2)(0,0)
-void A2ALoopLeftContractionType3(PropagatorField &tloop, const PropagatorField &loop,
-                                  const Vector<Gamma::Algebra> &gamma1,
-                                  const Vector<Gamma::Algebra> &gamma2)
-{
-  int ng = (int)gamma1.size();
-  const Gamma::Algebra *g1 = gamma1.data();
-  const Gamma::Algebra *g2 = gamma2.data();
-  autoView(tloopv, tloop, AcceleratorWrite);
-  autoView(loopv,  loop,  AcceleratorRead);
-  uint64_t Osites = loop.Grid()->oSites();
-  int Nsimd = SpinColourMatrix_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    typedef decltype(coalescedRead(loopv[0])) calcSCMatrix;
-    typedef iSpinMatrix<typename calcSCMatrix::vector_type> calcSpinMatrix;
-    auto l = loopv(ss);
-    calcSpinMatrix spinLoop; spinLoop = Zero();
-    for (int s1 = 0; s1 < Ns; ++s1)
-    for (int s2 = 0; s2 < Ns; ++s2)
-      spinLoop()(s1,s2)() = l()(s1,s2)(0,0) + l()(s1,s2)(1,1) + l()(s1,s2)(2,2);
-    auto tmp = l; tmp = Zero();
-    for (int mu = 0; mu < ng; ++mu) {
-      calcSpinMatrix tmp2 = Gamma(g1[mu]) * spinLoop * Gamma(g2[mu]);
-      for (int s1 = 0; s1 < Ns; ++s1)
-      for (int s2 = 0; s2 < Ns; ++s2)
-        tmp()(s1,s2)(0,0) = tmp()(s1,s2)(0,0) + tmp2()(s1,s2)();
-    }
-    coalescedWrite(tloopv[ss], tmp);
-  });
-}
-
-// Type 0: loopRight = sum_mu Tr(G2*spinLoop) * G1*right
-//   where spinLoop(s1,s2) = tloop(s1,s2)(0,0)
-void A2ALoopRightContractionType0(FermionField &loopRight,
-                                   const PropagatorField &tloop,
-                                   const FermionField &right,
-                                   const Vector<Gamma::Algebra> &gamma1,
-                                   const Vector<Gamma::Algebra> &gamma2)
-{
-  int ng = (int)gamma1.size();
-  const Gamma::Algebra *g1 = gamma1.data();
-  const Gamma::Algebra *g2 = gamma2.data();
-  autoView(lRv, loopRight, AcceleratorWrite);
-  autoView(tlv, tloop,     AcceleratorRead);
-  autoView(rv,  right,     AcceleratorRead);
-  uint64_t Osites = right.Grid()->oSites();
-  int Nsimd = SpinColourVector_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    typedef decltype(coalescedRead(rv[0]))  calcSCVector;
-    typedef decltype(coalescedRead(tlv[0])) calcSCMatrix;
-    typedef iSpinMatrix<typename calcSCMatrix::vector_type> calcSpinMatrix;
-    auto loopm  = tlv(ss);
-    auto rightv = rv(ss);
-    calcSpinMatrix spinLoop; spinLoop = Zero();
-    for (int s1 = 0; s1 < Ns; ++s1)
-    for (int s2 = 0; s2 < Ns; ++s2)
-      spinLoop()(s1,s2)() = loopm()(s1,s2)(0,0);
-    calcSCVector lR; lR = Zero();
-    for (int mu = 0; mu < ng; ++mu) {
-      auto GLoop   = Gamma(g2[mu]) * spinLoop;
-      auto trGLoop = GLoop()(0,0)() + GLoop()(1,1)() + GLoop()(2,2)() + GLoop()(3,3)();
-      auto Grightv = Gamma(g1[mu]) * rightv;
-      for (int s = 0; s < Ns; ++s)
-      for (int c = 0; c < Nc; ++c)
-        lR()(s)(c) = lR()(s)(c) + Grightv()(s)(c) * trGLoop;
-    }
-    coalescedWrite(lRv[ss], lR);
-  });
-}
-
-// Type 1: loopRight = tloop * right  (SpinColourMatrix * SpinColourVector)
-void A2ALoopRightContractionType1(FermionField &loopRight,
-                                   const PropagatorField &tloop,
-                                   const FermionField &right)
-{
-  autoView(lRv, loopRight, AcceleratorWrite);
-  autoView(tlv, tloop,     AcceleratorRead);
-  autoView(rv,  right,     AcceleratorRead);
-  uint64_t Osites = right.Grid()->oSites();
-  int Nsimd = SpinColourVector_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    coalescedWrite(lRv[ss], tlv(ss) * rv(ss));
-  });
-}
-
-// Type 2: loopRight(s)(c) = sum_{mu,c'} tloop(s1,s2)(c,c') * (G(g1[mu])*right)(s)(c')
-void A2ALoopRightContractionType2(FermionField &loopRight,
-                                   const PropagatorField &tloop,
-                                   const FermionField &right,
-                                   const Vector<Gamma::Algebra> &gamma1)
-{
-  int ng = (int)gamma1.size();
-  const Gamma::Algebra *g1 = gamma1.data();
-  autoView(lRv, loopRight, AcceleratorWrite);
-  autoView(tlv, tloop,     AcceleratorRead);
-  autoView(rv,  right,     AcceleratorRead);
-  uint64_t Osites = right.Grid()->oSites();
-  int Nsimd = SpinColourVector_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    typedef decltype(coalescedRead(rv[0])) calcSCVector;
-    auto loopm  = tlv(ss);
-    auto rightv = rv(ss);
-    calcSCVector lR; lR = Zero();
-    for (int mu = 0; mu < ng; ++mu) {
-      int s1 = mu / Ns;
-      int s2 = mu % Ns;
-      auto Grightv = Gamma(g1[mu]) * rightv;
-      for (int s = 0; s < Ns; ++s)
-      for (int c = 0; c < Nc; ++c)
-        lR()(s)(c) = lR()(s)(c)
-                   + loopm()(s1,s2)(c,0) * Grightv()(s)(0)
-                   + loopm()(s1,s2)(c,1) * Grightv()(s)(1)
-                   + loopm()(s1,s2)(c,2) * Grightv()(s)(2);
-    }
-    coalescedWrite(lRv[ss], lR);
-  });
-}
-
-// Type 3: loopRight(s)(c) = sum_{s'} tloop(s,s')(0,0) * right(s')(c)
-void A2ALoopRightContractionType3(FermionField &loopRight,
-                                   const PropagatorField &tloop,
-                                   const FermionField &right)
-{
-  autoView(lRv, loopRight, AcceleratorWrite);
-  autoView(tlv, tloop,     AcceleratorRead);
-  autoView(rv,  right,     AcceleratorRead);
-  uint64_t Osites = right.Grid()->oSites();
-  int Nsimd = SpinColourVector_v::Nsimd();
-  accelerator_for(ss, Osites, Nsimd, {
-    typedef decltype(coalescedRead(rv[0])) calcSCVector;
-    auto loopm  = tlv(ss);
-    auto rightv = rv(ss);
-    calcSCVector lR; lR = Zero();
-    for (int s = 0; s < Ns; ++s)
-    for (int c = 0; c < Nc; ++c)
-      lR()(s)(c) = loopm()(s,0)(0,0) * rightv()(0)(c)
-                 + loopm()(s,1)(0,0) * rightv()(1)(c)
-                 + loopm()(s,2)(0,0) * rightv()(2)(c)
-                 + loopm()(s,3)(0,0) * rightv()(3)(c);
-    coalescedWrite(lRv[ss], lR);
-  });
-}
-
-// ================================================================
-// GPU-offloaded extended meson field: accelerator_for contractions
-// + A2ASpatialSum GEMM spatial reduction.
-// ================================================================
-class A2AExtendedMesonFieldGPU
-{
-public:
-  static void compute(
-      Eigen::Tensor<ComplexD, 3> &result,
-      const std::vector<FermionField> &left,
-      const std::vector<FermionField> &right,
-      const std::vector<FermionField> &loop1,
-      const std::vector<FermionField> &loop2,
-      const std::vector<Gamma::Algebra> &gamma1_in,
-      const std::vector<Gamma::Algebra> &gamma2_in,
-      int type)
-  {
-    GridBase *grid = left[0].Grid();
-    int N_i = (int)left.size();
-    int N_j = (int)right.size();
-
-    std::string tag = std::string("[gpu  type=") + std::to_string(type) + "]";
-    auto Tms = [](double us) { return us * 1e-3; };
-    double t0;
-
-    Vector<Gamma::Algebra> gamma1(gamma1_in.begin(), gamma1_in.end());
-    Vector<Gamma::Algebra> gamma2(gamma2_in.begin(), gamma2_in.end());
-
-    t0 = usecond();
-    for (auto &f : loop1) { autoView(v, f, AcceleratorRead); }
-    for (auto &f : loop2) { autoView(v, f, AcceleratorRead); }
-    std::cout << GridLogMessage << tag << " view_open_loop:  " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    PropagatorField loop(grid);
-    A2ALoopPropagator(loop, loop1, loop2);
-    std::cout << GridLogMessage << tag << " loop_build:      " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    for (int i = 0; i < N_i; i++) { autoView(v, left[i], AcceleratorRead); }
-    std::cout << GridLogMessage << tag << " view_open_left:  " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    std::vector<FermionField> leftv(N_i, grid);
-    for (int i = 0; i < N_i; i++)
-      A2APackLeftConjugated(leftv[i], left[i]);
-    std::cout << GridLogMessage << tag << " pack_left:       " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    PropagatorField tloop(grid);
-    tloop = Zero();
-    switch (type) {
-    case 0: A2ALoopLeftContractionType0(tloop, loop);                break;
-    case 1: A2ALoopLeftContractionType1(tloop, loop, gamma1, gamma2); break;
-    case 2: A2ALoopLeftContractionType2(tloop, loop, gamma2);         break;
-    case 3: A2ALoopLeftContractionType3(tloop, loop, gamma1, gamma2); break;
-    }
-    std::cout << GridLogMessage << tag << " tloop:           " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    { autoView(tlv, tloop, AcceleratorRead); }
-    for (int j = 0; j < N_j; j++) { autoView(rv, right[j], AcceleratorRead); }
-    std::cout << GridLogMessage << tag << " view_open_right: " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    std::vector<FermionField> loopRight(N_j, grid);
-    for (int j = 0; j < N_j; j++) {
-      switch (type) {
-      case 0: A2ALoopRightContractionType0(loopRight[j], tloop, right[j], gamma1, gamma2); break;
-      case 1: A2ALoopRightContractionType1(loopRight[j], tloop, right[j]);                 break;
-      case 2: A2ALoopRightContractionType2(loopRight[j], tloop, right[j], gamma1);         break;
-      case 3: A2ALoopRightContractionType3(loopRight[j], tloop, right[j]);                 break;
-      }
-    }
-    std::cout << GridLogMessage << tag << " pack_loopright:  " << Tms(usecond()-t0) << " ms\n";
-
-    A2ASpatialSum<SpinColourVector_v> spatial_sum;
-    double t_blas = usecond();
-
-    t0 = usecond();
-    spatial_sum.Allocate(N_i, N_j, grid);
-    std::cout << GridLogMessage << tag << " Allocate:        " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    spatial_sum.PackLeft(leftv);
-    std::cout << GridLogMessage << tag << " PackLeft:        " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    spatial_sum.PackRight(loopRight);
-    std::cout << GridLogMessage << tag << " PackRight:       " << Tms(usecond()-t0) << " ms\n";
-
-    t0 = usecond();
-    spatial_sum.Sum(result);
-    std::cout << GridLogMessage << tag << " Sum (GEMM+MPI):  " << Tms(usecond()-t0) << " ms\n";
-
-    std::cout << GridLogMessage << tag << " A2ASpatialSum:   " << Tms(usecond()-t_blas) << " ms  [TOTAL]\n";
-  }
-};
-
-int main(int argc, char *argv[])
-{
-  Grid_init(&argc, &argv);
-
-  Coordinate latt_size   = GridDefaultLatt();
-  Coordinate simd_layout = GridDefaultSimd(4, vComplexD::Nsimd());
-  Coordinate mpi_layout  = GridDefaultMpi();
-  GridCartesian grid(latt_size, simd_layout, mpi_layout);
-
-  int Nt    = latt_size[Tp];
-  int N_i   = 8;
-  int N_j   = 8;
-  int Nloop = 4;
-
-  if (GridCmdOptionExists(argv, argv+argc, "--Ni"))
-    N_i = std::stoi(GridCmdOptionPayload(argv, argv+argc, "--Ni"));
-  if (GridCmdOptionExists(argv, argv+argc, "--Nj"))
-    N_j = std::stoi(GridCmdOptionPayload(argv, argv+argc, "--Nj"));
-
-  GridParallelRNG pRNG(&grid);
-  pRNG.SeedFixedIntegers({1, 2, 3, 4});
-
-  std::vector<FermionField> left(N_i,   &grid);
-  std::vector<FermionField> right(N_j,  &grid);
-  std::vector<FermionField> loop1(Nloop, &grid);
-  std::vector<FermionField> loop2(Nloop, &grid);
-
-  for (auto &f : left)  random(pRNG, f);
-  for (auto &f : right) random(pRNG, f);
-  for (auto &f : loop1) random(pRNG, f);
-  for (auto &f : loop2) random(pRNG, f);
-
-  std::vector<Gamma::Algebra> GammaMU = {
-    Gamma::Algebra::GammaX,
-    Gamma::Algebra::GammaY,
-    Gamma::Algebra::GammaZ,
-    Gamma::Algebra::GammaT
-  };
-
-  Eigen::Tensor<ComplexD, 3> result_ref(Nt, N_i, N_j);
-  Eigen::Tensor<ComplexD, 3> result_blas(Nt, N_i, N_j);
-  Eigen::Tensor<ComplexD, 3> result_gpu(Nt, N_i, N_j);
-  double t_ref = 0, t_blas = 0, t_gpu = 0, start, stop;
-
-  for (int type = 0; type < 4; type++) {
-
-    result_ref.setZero();
-    start = usecond();
-    A2AExtendedMesonFieldRef::compute(result_ref,  left, right, loop1, loop2,
-                                      GammaMU, GammaMU, type, false);
-    stop = usecond(); t_ref = stop - start;
-
-    result_blas.setZero();
-    start = usecond();
-    A2AExtendedMesonFieldRef::compute(result_blas, left, right, loop1, loop2,
-                                      GammaMU, GammaMU, type, true);
-    stop = usecond(); t_blas = stop - start;
-
-    result_gpu.setZero();
-    start = usecond();
-    A2AExtendedMesonFieldGPU::compute(result_gpu, left, right, loop1, loop2,
-                                      GammaMU, GammaMU, type);
-    stop = usecond(); t_gpu = stop - start;
-
-    double norm2_ref = 0.0, norm2_blas = 0.0, norm2_gpu = 0.0;
-    double norm2_diff_blas = 0.0, norm2_diff_gpu = 0.0;
-    for (int t  = 0; t  < Nt;  t++)
-    for (int ii = 0; ii < N_i; ii++)
-    for (int jj = 0; jj < N_j; jj++) {
-      norm2_ref  += norm2(result_ref(t, ii, jj));
-      norm2_blas += norm2(result_blas(t, ii, jj));
-      norm2_gpu  += norm2(result_gpu(t, ii, jj));
-      ComplexD diff_blas = result_ref(t, ii, jj) - result_blas(t, ii, jj);
-      ComplexD diff_gpu  = result_ref(t, ii, jj) - result_gpu(t, ii, jj);
-      norm2_diff_blas += norm2(diff_blas);
-      norm2_diff_gpu  += norm2(diff_gpu);
-    }
-
-    double rel_blas = (norm2_ref > 0) ? std::sqrt(norm2_diff_blas / norm2_ref) : 0.0;
-    double rel_gpu  = (norm2_ref > 0) ? std::sqrt(norm2_diff_gpu  / norm2_ref) : 0.0;
-
-    std::cout << GridLogMessage
-              << "type=" << type
-              << "  norm2_ref="  << norm2_ref
-              << "  norm2_blas=" << norm2_blas
-              << "  norm2_gpu="  << norm2_gpu
-              << "  rel_blas="   << rel_blas
-              << "  rel_gpu="    << rel_gpu
-              << "  t_ref="  << t_ref  * 1e-6 << "s"
-              << "  t_blas=" << t_blas * 1e-6 << "s"
-              << "  t_gpu="  << t_gpu  * 1e-6 << "s"
-              << std::endl;
-
-    GRID_ASSERT(rel_blas < 1e-10);
-    GRID_ASSERT(rel_gpu  < 1e-10);
-  }
-
-  std::cout << GridLogMessage << "All types passed A2ASpatialSum and GPU regression." << std::endl;
-
-  Grid_finalize();
-  return EXIT_SUCCESS;
-}
-
-#endif
--- a/Show More
+++ b/Show More