f8797e1e3e
bug fix. works now and great face performance
2017-04-26 03:14:02 -04:00
5b55867a7a
Slightly cheaper Ext assembly
2017-04-24 05:36:11 -04:00
3accb1ef89
Debugged assemply split phase with interior suppression
2017-04-23 19:30:19 -04:00
5812eb8a8c
Partially fixed. But the comms-overlap does not work yet.
2017-04-22 18:50:25 -04:00
ac58565d0a
Dangerous rewrite of the assembly. If I make a mistake the debug will be painful.
2017-04-22 19:31:04 +01:00
2c246551d0
Overlap comms and compute options in wilson kernels
2017-02-07 01:37:10 -05:00
eabf316ed9
BGQ performance ASM
2016-12-22 21:56:08 +00:00
e1042aef77
First version of the doube prec for testing purposes
...
It does not compile single and double version at the same time
2016-10-28 17:20:04 +01:00
81f2aeaece
KNL streaming stores, and KNL performance coutners
2016-10-12 11:45:22 +01:00
90e70790f3
Feature for z-Mobius prep
2016-08-15 22:31:29 +01:00
bdaa5b1767
Updated to have perfect prefetching for the s-vectorised kernel with any cache blocking.
2016-06-30 14:35:02 -07:00
8fcefc021a
Improved the prefetching when using cache blocking codes
2016-06-30 14:35:02 -07:00
05c884a62a
Prefetch change
2016-06-30 14:35:01 -07:00
2d8bb4c594
Tweaks
2016-06-30 14:35:01 -07:00
6d58cb2a68
Enable reordering of the loops in the assembler for cache friendly.
...
This gets in the way of L2 prefetching however. Do next next link in stencil
prefetching.
2016-06-30 14:35:01 -07:00
87418e7df1
Slightly faster prefetching perf.
2016-06-13 02:32:52 -07:00
55f65b81b5
Improvements to the assembler interface that let us move chunks of the
...
site and s loop into the kernels. This will save on function call overhead and
guarantee L2 prefetching strategy is right since OMP can't distribute the
sub-chunks of work.
2016-06-09 01:12:36 -07:00
d9408893b3
Prefetching in the normal kernel implementation.
2016-06-08 05:43:48 -07:00
139cc5f1ae
Large change with KNL preparation
2016-06-03 03:24:26 -07:00