mirror of
https://github.com/paboyle/Grid.git
synced 2025-04-04 19:25:56 +01:00
Update todo list
This commit is contained in:
parent
2d2de7aede
commit
2b037e3daa
73
TODO
73
TODO
@ -1,30 +1,31 @@
|
|||||||
|
=========================================================
|
||||||
|
General
|
||||||
|
=========================================================
|
||||||
|
|
||||||
- Make representations code take Gimpl
|
- Make representations code take Gimpl
|
||||||
- Simplify the HMCand remove modules
|
- Simplify the HMCand remove modules
|
||||||
|
|
||||||
- Lattice_arith - are the mult, mac etc.. still needed after ET engine?
|
- Lattice_arith - are the mult, mac etc.. still needed after ET engine?
|
||||||
- ImprovedStaggered accelerate
|
- Lattice_rng
|
||||||
Lattice_rng
|
- Lattice_transfer.h
|
||||||
Lattice_transfer.h
|
|
||||||
|
|
||||||
- accelerate A2Autils -- off critical path for HMC
|
- accelerate A2Autils -- off critical path for HMC
|
||||||
- Lebesque order reintroduction. StencilView should have pointer to it
|
|
||||||
|
|
||||||
|
=========================================================
|
||||||
GPU branch code item work list
|
GPU branch code item work list
|
||||||
-----------------------------
|
=========================================================
|
||||||
|
|
||||||
|
* sum_cpu promote to double during summation for increased precisoin.
|
||||||
|
* Introduce sumD & ReduceD
|
||||||
|
* GPU sum is probably better currently.
|
||||||
|
|
||||||
|
* Accelerate the cshift & benchmark
|
||||||
|
|
||||||
7) Accelerate the cshift & benchmark
|
|
||||||
* 0) Single GPU
|
* 0) Single GPU
|
||||||
- 128 bit integer table load in GPU code.
|
- 128 bit integer table load in GPU code.
|
||||||
- Staggered kernels -> GPU coalesced loop, loop in kernels
|
- ImprovedStaggered accelerate & measure perf
|
||||||
- Staggered kernels inline for GPU -- DONE
|
|
||||||
|
|
||||||
* Gianluca merger
|
|
||||||
- Cayley coefficients -> GPU retention or prefetch
|
|
||||||
- Gianluca's changes to Cayley into gpu-port
|
- Gianluca's changes to Cayley into gpu-port
|
||||||
- Mobius kernel fusion. -- Gianluca?
|
- Mobius kernel fusion. -- Gianluca?
|
||||||
- Make GPU offload reductions deterministic -- Gianluca merge
|
- Lebesque order reintroduction. StencilView should have pointer to it
|
||||||
- Lattice_reduction - remnant thread_loops must offload. Audit thread_loop in main code for non-accelerated code
|
- Lebesgue reorder in all kernels
|
||||||
|
|
||||||
* 3) Comms/NVlink
|
* 3) Comms/NVlink
|
||||||
- OpenMP tasks to run comms threads. Experiment with it
|
- OpenMP tasks to run comms threads. Experiment with it
|
||||||
@ -43,12 +44,24 @@ GPU branch code item work list
|
|||||||
- multLinkProp eliminate
|
- multLinkProp eliminate
|
||||||
|
|
||||||
8) Merge develop and test HMC
|
8) Merge develop and test HMC
|
||||||
|
|
||||||
9) Gamma tables on GPU; check this. Appear to work, but no idea why. Are these done on CPU?
|
9) Gamma tables on GPU; check this. Appear to work, but no idea why. Are these done on CPU?
|
||||||
|
|
||||||
10) Audit
|
10) Audit
|
||||||
- pragma once uniformly
|
- pragma once uniformly
|
||||||
- Audit NAMESPACE CHANGES
|
- Audit NAMESPACE CHANGES
|
||||||
- Audit changes
|
- Audit changes
|
||||||
|
|
||||||
|
-----
|
||||||
|
Gianluca's changes
|
||||||
|
- Performance impact of construct in aligned allocator???
|
||||||
|
---------
|
||||||
|
|
||||||
|
- merge2 where is it used. Audit routines, comment out and check compile.
|
||||||
|
|
||||||
|
-----------------------------
|
||||||
|
DONE:
|
||||||
|
-----------------------------
|
||||||
=============================================================================================
|
=============================================================================================
|
||||||
AUDIT ContractWWVV with respect to develop -- DONE
|
AUDIT ContractWWVV with respect to develop -- DONE
|
||||||
- GPU accelerate EOFA -- DONE
|
- GPU accelerate EOFA -- DONE
|
||||||
@ -63,6 +76,12 @@ AUDIT ContractWWVV with respect to develop -- DONE
|
|||||||
_foreach
|
_foreach
|
||||||
_for
|
_for
|
||||||
|
|
||||||
|
- Staggered kernels -> GPU coalesced loop, loop in kernels -- DONE
|
||||||
|
- Staggered kernels inline for GPU -- DONE
|
||||||
|
|
||||||
|
-- Common source GPU and CPU generic kernels??? ---- DONE
|
||||||
|
-- - Uniform coding between GPU kernels and CPU kernels attempt ---- DONE, got faster !
|
||||||
|
|
||||||
-- Figure what to do about "multLinkGpu" etc.. in FermionOperatorImpl. -- DONE
|
-- Figure what to do about "multLinkGpu" etc.. in FermionOperatorImpl. -- DONE
|
||||||
-- Gparity is the awkward one -- DONE
|
-- Gparity is the awkward one -- DONE
|
||||||
-- Solve non-Gparity first. -- DONE
|
-- Solve non-Gparity first. -- DONE
|
||||||
@ -75,25 +94,17 @@ AUDIT ContractWWVV with respect to develop -- DONE
|
|||||||
--
|
--
|
||||||
-- Reread WilsonKernels and check diffs -- DONE
|
-- Reread WilsonKernels and check diffs -- DONE
|
||||||
--
|
--
|
||||||
-- Common source GPU and CPU generic kernels??? ---- DONE
|
|
||||||
-- - Uniform coding between GPU kernels and CPU kernels attempt ---- DONE, got faster !
|
|
||||||
|
|
||||||
-----
|
|
||||||
Gianluca's changes
|
|
||||||
- Performance impact of construct in aligned allocator???
|
|
||||||
- Inner product compare to Summit inner product optimisation
|
|
||||||
- CayleyFermion5D.cc - flop count line 166 odd. Shouldn't depend on arch
|
|
||||||
- - Review Vector use
|
|
||||||
- CayleyFermion5D.h - DperpGPU unify coding style
|
|
||||||
---------
|
|
||||||
|
|
||||||
- Lebesgue reorder in all kernels
|
|
||||||
- merge2 where is it used. Audit routines, comment out and check compile.
|
|
||||||
- AVX512 still broken, lebesgue order missing ?
|
- AVX512 still broken, lebesgue order missing ?
|
||||||
|
|
||||||
|
* Gianluca merger
|
||||||
DONE:
|
- Cayley coefficients -> GPU retention or prefetch
|
||||||
-----------------------------
|
- Make GPU offload reductions deterministic -- Gianluca merge
|
||||||
|
- Lattice_reduction - remnant thread_loops must offload. Audit thread_loop in main code for non-accelerated code
|
||||||
|
- Inner product compare to Summit inner product optimisation
|
||||||
|
- CayleyFermion5D.cc - flop count line 166 odd. Shouldn't depend on arch
|
||||||
|
- - Review Vector use
|
||||||
|
- CayleyFermion5D.h - DperpGPU unify coding style
|
||||||
|
|
||||||
- Committed my modifications
|
- Committed my modifications
|
||||||
- Accelerate non-dslash elements of Mobius; check accelerator_loop uniformly used in fermion operators
|
- Accelerate non-dslash elements of Mobius; check accelerator_loop uniformly used in fermion operators
|
||||||
|
Loading…
x
Reference in New Issue
Block a user