mirror of
https://github.com/paboyle/Grid.git
synced 2024-11-10 07:55:35 +00:00
ec9939c1ba
This should be possible to cache block at outer levels, global sum across nodes not performed and deferred to caller to block them all into a big all reduce. Nc=3 and Fermion is hard coded in an ugly way. We might think about benchmarking whether a product without the conjugate should be made available by Grid. It is not clear whether the explicit unroll, or the performing of conjugate on left once was the real source of the speed up. Gives 70-80 GF/s on my laptop (single) half that double, and 70GB/s to cache. This is competitive with dslash and a reasonable stopping point for the optimisation. If necessary we can revisit. |
||
---|---|---|
.. | ||
Benchmark_comms.cc | ||
Benchmark_dwf_sweep.cc | ||
Benchmark_dwf.cc | ||
Benchmark_gparity.cc | ||
Benchmark_IO.cc | ||
Benchmark_ITT.cc | ||
Benchmark_memory_asynch.cc | ||
Benchmark_memory_bandwidth.cc | ||
Benchmark_meson_field.cc | ||
Benchmark_mooee.cc | ||
Benchmark_staggered.cc | ||
Benchmark_su3.cc | ||
Benchmark_wilson_sweep.cc | ||
Benchmark_wilson.cc | ||
Makefile.am | ||
simple_simd_test.cc | ||
simple_su3_expr.cc | ||
simple_su3_test.cc |