Merge branch 'develop' of https://github.com/paboyle/Grid into develop

2025-08-03 05:07:07 +01:00 · 2021-03-12 15:36:55 +01:00
parent 51f506553c 226be84937
commit 3c67d626ba
1 changed files with 35 additions and 40 deletions
--- a/75
+++ b/75
@@ -1,3 +1,6 @@
 -- comms threads issue??
 -- Part done: Staggered kernel performance on GPU
 =========================================================
 General
 =========================================================
@@ -5,28 +8,18 @@ General
 - Make representations code take Gimpl
 - Simplify the HMCand remove modules
 - Lattice_arith - are the mult, mac etc.. still needed after ET engine?
- Lattice_rng
+- Lattice_rng - faster local only loop in init
- Lattice_transfer.h
+- Audit: accelerate A2Autils -- off critical path for HMC
 - accelerate A2Autils -- off critical path for HMC
 =========================================================
-GPU branch code item work list
+GPU  work list
 =========================================================
-* sum_cpu promote to double during summation for increased precisoin.
+* sum_cpu promote to double during summation for increased precision.
 * Introduce sumD & ReduceD 
 * GPU sum is probably better currently.
 * Accelerate the cshift & benchmark
 * 0) Single GPU
 - 128 bit integer table load in GPU code.
  - ImprovedStaggered accelerate & measure perf
  - Gianluca's changes to Cayley into gpu-port
  - Mobius kernel fusion.                     -- Gianluca?
  - Lebesque order reintroduction. StencilView should have pointer to it
  - Lebesgue reorder in all kernels
 * 3) Comms/NVlink
 - OpenMP tasks to run comms threads. Experiment with it 
 - Remove explicit openMP in staggered. 
@@ -35,14 +28,6 @@ GPU branch code item work list
 - Stencil gather ??
 - SIMD dirs in stencil
 * 4) ET enhancements
 - eval -> scalar ops in ET engine
 - coalescedRead, coalescedWrite in expressions.
 * 5) Misc
 - Conserved current clean up.
 - multLinkProp eliminate
 8) Merge develop and test HMC
 9) Gamma tables on GPU; check this. Appear to work, but no idea why. Are these done on CPU?
@@ -52,7 +37,7 @@ GPU branch code item work list
 -     Audit NAMESPACE CHANGES
 -     Audit changes
-----
+---------
 Gianluca's changes
 - Performance impact of construct in aligned allocator???
 ---------
@@ -62,6 +47,33 @@ Gianluca's changes
 -----------------------------
 DONE:
 -----------------------------
 =====
 -- Done: Remez X^-1/2 X^-1/2 X = 1 test.
         Feed in MdagM^2 as a test and take its sqrt.
         Automated test that MdagM invsqrt(MdagM)invsqrt(MdagM) = 1 in HMC for bounds satisfaction.
 -- Done: Sycl Kernels into develop. Compare to existing unroll and just use.
 -- Done: sRNG into refresh functions
 -- Done: Tuned decomposition on CUDA into develop
 -- Done: Sycl friend accessor. Const view attempt via typedef??
 * Done 5) Misc
 - Conserved current clean up.
 - multLinkProp eliminate
 * Done 0) Single GPU
 - 128 bit integer table load in GPU code.
  - ImprovedStaggered accelerate & measure perf
  - Gianluca's changes to Cayley into gpu-port
  - Mobius kernel fusion.                     -- Gianluca?
  - Lebesque order reintroduction. StencilView should have pointer to it
  - Lebesgue reorder in all kernels
 * 4) ET enhancements
 - Done eval -> scalar ops in ET engine
 - Done coalescedRead, coalescedWrite in expressions.
 =============================================================================================
 AUDIT ContractWWVV with respect to develop    -- DONE
 - GPU accelerate EOFA                                                  -- DONE
@@ -125,23 +137,6 @@ AUDIT ContractWWVV with respect to develop    -- DONE
 - -      (4) omp parallel for collapse(n)
 - - Only (1) has a natural mirror in accelerator_loop
 - - Nested loop macros get cumbersome made a generic interface for N deep
 - - Don't like thread_region and thread_loop_in_region
 - - Could replace with 
    thread_nested(1, 
      for {
      }
    );
    thread_nested(2,
      for (){
        for (){
 	}
      }
    );
    and same "in_region".
 -----------------------------