From 001814b4422d9b17a2751d18e23172e5ded79690 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 12 Mar 2021 09:31:17 -0500 Subject: [PATCH] updated to do list. Start adding DDHMC work items --- TODO | 75 ++++++++++++++++++++++++++++-------------------------------- 1 file changed, 35 insertions(+), 40 deletions(-) diff --git a/TODO b/TODO index f1175560..e23e040d 100644 --- a/TODO +++ b/TODO @@ -1,3 +1,6 @@ +-- comms threads issue?? +-- Part done: Staggered kernel performance on GPU + ========================================================= General ========================================================= @@ -5,28 +8,18 @@ General - Make representations code take Gimpl - Simplify the HMCand remove modules - Lattice_arith - are the mult, mac etc.. still needed after ET engine? -- Lattice_rng -- Lattice_transfer.h -- accelerate A2Autils -- off critical path for HMC +- Lattice_rng - faster local only loop in init +- Audit: accelerate A2Autils -- off critical path for HMC ========================================================= -GPU branch code item work list +GPU work list ========================================================= -* sum_cpu promote to double during summation for increased precisoin. +* sum_cpu promote to double during summation for increased precision. * Introduce sumD & ReduceD * GPU sum is probably better currently. - * Accelerate the cshift & benchmark -* 0) Single GPU -- 128 bit integer table load in GPU code. - - ImprovedStaggered accelerate & measure perf - - Gianluca's changes to Cayley into gpu-port - - Mobius kernel fusion. -- Gianluca? - - Lebesque order reintroduction. StencilView should have pointer to it - - Lebesgue reorder in all kernels - * 3) Comms/NVlink - OpenMP tasks to run comms threads. Experiment with it - Remove explicit openMP in staggered. @@ -35,14 +28,6 @@ GPU branch code item work list - Stencil gather ?? - SIMD dirs in stencil -* 4) ET enhancements -- eval -> scalar ops in ET engine -- coalescedRead, coalescedWrite in expressions. - -* 5) Misc -- Conserved current clean up. -- multLinkProp eliminate - 8) Merge develop and test HMC 9) Gamma tables on GPU; check this. Appear to work, but no idea why. Are these done on CPU? @@ -52,7 +37,7 @@ GPU branch code item work list - Audit NAMESPACE CHANGES - Audit changes ------ +--------- Gianluca's changes - Performance impact of construct in aligned allocator??? --------- @@ -62,6 +47,33 @@ Gianluca's changes ----------------------------- DONE: ----------------------------- +===== +-- Done: Remez X^-1/2 X^-1/2 X = 1 test. + Feed in MdagM^2 as a test and take its sqrt. + Automated test that MdagM invsqrt(MdagM)invsqrt(MdagM) = 1 in HMC for bounds satisfaction. + +-- Done: Sycl Kernels into develop. Compare to existing unroll and just use. +-- Done: sRNG into refresh functions +-- Done: Tuned decomposition on CUDA into develop +-- Done: Sycl friend accessor. Const view attempt via typedef?? + + +* Done 5) Misc +- Conserved current clean up. +- multLinkProp eliminate + +* Done 0) Single GPU +- 128 bit integer table load in GPU code. + - ImprovedStaggered accelerate & measure perf + - Gianluca's changes to Cayley into gpu-port + - Mobius kernel fusion. -- Gianluca? + - Lebesque order reintroduction. StencilView should have pointer to it + - Lebesgue reorder in all kernels + +* 4) ET enhancements +- Done eval -> scalar ops in ET engine +- Done coalescedRead, coalescedWrite in expressions. + ============================================================================================= AUDIT ContractWWVV with respect to develop -- DONE - GPU accelerate EOFA -- DONE @@ -125,23 +137,6 @@ AUDIT ContractWWVV with respect to develop -- DONE - - (4) omp parallel for collapse(n) - - Only (1) has a natural mirror in accelerator_loop - - Nested loop macros get cumbersome made a generic interface for N deep -- - Don't like thread_region and thread_loop_in_region -- - Could replace with - - thread_nested(1, - for { - - } - ); - thread_nested(2, - for (){ - for (){ - - } - } - ); - - and same "in_region". -----------------------------