1
0
mirror of https://github.com/paboyle/Grid.git synced 2025-10-25 18:19:34 +01:00
Commit Graph

960 Commits

Author SHA1 Message Date
paboyle
a2ff068e29 Asm and threading for many core 2015-11-06 03:47:14 -08:00
paboyle
b362f8d27b Threading for many core 2015-11-06 03:46:41 -08:00
paboyle
64770d9052 Threading changes for many core and asm calls 2015-11-06 03:46:21 -08:00
paboyle
17af18dcab Changes for AVX512 assembler 2015-11-06 03:45:51 -08:00
paboyle
1159de165c Asm option for AVX512 2015-11-05 22:04:51 -08:00
paboyle
16c7993434 Merge branch 'master' of github.com:paboyle/Grid
Conflicts:
	lib/simd/Grid_avx512.h
	lib/simd/Grid_imci.h
2015-11-04 03:32:10 -08:00
paboyle
6be9716e6f New file 2015-11-04 03:26:28 -08:00
paboyle
4a41c885ed Use Linux kernel interface to hardware performance counters. Dead useful. 2015-11-04 03:24:19 -08:00
paboyle
757b31ed42 Threading for KNC mods. 2015-11-04 03:22:14 -08:00
paboyle
ac7d1f26ad Either blocking or lebesgue curve 2015-11-04 03:19:16 -08:00
paboyle
1a8bf938b3 Use either sub-blocking or lebesgue 2015-11-04 03:18:51 -08:00
paboyle
63a2993827 Exec info an cache blocking 2015-11-04 03:16:56 -08:00
paboyle
4e65ad21ac Adding a routine for AVX512 / IMCI with explicit assembly implementations 2015-11-04 03:15:08 -08:00
Peter Boyle
dfc1de6f60 Merge branch 'master' of github.com:paboyle/Grid 2015-11-04 05:14:26 -06:00
Peter Boyle
3b7576ad53 Switch off for now 2015-11-04 05:13:29 -06:00
paboyle
9b5d31ffc1 mac , mult routines
Lines# with '#' will be ignored, and an empty message aborts the commit.
2015-11-04 03:10:34 -08:00
paboyle
a38762159c Inline assembly hooks for AVX 512. Better way in some ways than BAGEL to generate assembly.
Updated Grid_avx512.h
2015-11-04 03:09:06 -08:00
Peter Boyle
ffc5dab17f AMD FMA4 support added for Interlagos/BlueWaters 2015-11-04 04:29:58 -06:00
Peter Boyle
96608c70d1 chrono causing some problems on Cray systems. Suspend use for now 2015-11-04 04:28:31 -06:00
Peter Boyle
d35d63b171 Algorithm in 2015-11-04 04:27:44 -06:00
Peter Boyle
24044dbc56 Debugged a problem with checkerboarded cshift in the checker dimension which arose
only when mpi spread out in the checker dimension. Added a test that trapped and helped debug this
2015-11-04 10:00:55 +00:00
Peter Boyle
abb23df83f formatting only 2015-11-04 10:00:27 +00:00
Peter Boyle
12c5ec813c Useful debug messages (commented out) are included for preservation in case I need to revisit this 2015-11-04 09:59:27 +00:00
Peter Boyle
1271508ca2 Bug fix for spread out in x (EO) direction.
This is really annoying -- it is very hard to thread the loops with the index
recursion on buffer offset in the red-black case. Must think of a good threading
solution here.
2015-11-04 09:57:57 +00:00
Peter Boyle
ec5af35166 EO bug fix when spread out in x-direction 2015-11-04 09:56:58 +00:00
Peter Boyle
0f59356e86 Problem in comms fixed 2015-11-02 00:00:15 +00:00
Peter Boyle
8889af45ca FMA4 added 2015-10-09 01:00:53 +02:00
Peter Boyle
83afb2e26a Poly support for lanczos 2015-10-09 00:43:21 +02:00
Peter Boyle
6d06bd9493 Minor change in commented out code 2015-10-09 00:42:21 +02:00
Peter Boyle
6ee23f409e Lanczos addition 2015-10-09 00:41:00 +02:00
Peter Boyle
2d95dac6b6 Lanczos untested/partially tested additions. In middle of shake out but at least compiles 2015-10-09 00:40:25 +02:00
Peter Boyle
814c79f38d SIMD improvements for mac and madd use in complex for avx, sse 2015-10-09 00:38:52 +02:00
paboyle
1878bf97d0 Babbage fix 2015-09-30 16:04:01 -07:00
paboyle
a660ce716b No compile babbage fix 2015-09-30 16:02:44 -07:00
paboyle
f4b6d1dfea NGO stores reenabled 2015-09-30 16:02:14 -07:00
paboyle
23813ac798 No compile on babbage fix 2015-09-30 16:01:28 -07:00
Peter Boyle
9f4f65cb46 Added a decoupled memory system benchmark to remove thread synch overhead 2015-09-26 18:23:57 -07:00
Peter Boyle
64d64d1ab6 Updating to modify non-inlining permute routines and hopefully get better reg use and
enhance performance.
2015-09-25 08:55:04 -07:00
Peter Boyle
5ef42add2d Changes to remove warnings under icc; disambiguate AVX512 from IMCI correctly
and drop swizzles in AVX512. Don't know why these compiled.
2015-09-23 05:23:45 -07:00
Peter Boyle
2f38ebc446 Reintroducing the hand unrolled loops 2015-09-08 17:45:30 +01:00
Peter Boyle
638d6675ee Tested rms dH is ~ dt^4 numerically, so believe the ForceGradient is correct now.
Paranoia makes me want to diddle with the FG step to ensure dt^2 reappears.
2015-08-31 16:33:20 +01:00
Peter Boyle
357c6ab46d Reunitarise. Complete the HMC and integrator changes. 2015-08-31 16:32:04 +01:00
Peter Boyle
755dca9533 Added ForceGradient integrator. dH dropped so seems to work. Will only
believe it is right once I have pulled a dt^4 error scaling plot out.
2015-08-31 06:23:02 +01:00
Peter Boyle
29fd004d54 Unified integrator and integrator algorithm into virtual class used as a policy for the
HMC.
2015-08-30 13:39:19 +01:00
Peter Boyle
aa52fdadcc Global edit on HMC sector -- making GaugeField a template parameter and
preparing to pass integrator, smearing, bc's as policy classes to hmc.

Propose to unify "integrator" and integrator algorithm in a base/derived
way to override step. Want to read through ForceGradient to ensure
that abstraction covers the force gradient case.
2015-08-30 12:18:34 +01:00
Peter Boyle
76d752585b Started a tidy up in the HMC sector. Now comfortable with the two level integrators;
to a little figure out what Guido had done & why -- but there is a neat saving of force
evaluations across the nesting time boundary making use of linearity of the leapP in dt.

I cleaned up the printing, reduced the volume of code, in the process sharing printing
between all integrators. Placed an assert that the total integration time for all integrators
must match at end of trajectory.

Have now verified e-dH = 1 for nested integrators in Wilson/Wilson runs with both
Omelyan and with Leapfrog so substantial confidence gained.
2015-08-29 17:18:43 +01:00
Peter Boyle
dc814f30da Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.

Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.

Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.

Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.

Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.

It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.

That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.

I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
Peter Boyle
612957f057 pull in original license. 2015-08-21 10:19:08 +01:00
Peter Boyle
cea8ac9a22 Credits to orig source where I found the macro tricks. 2015-08-21 10:14:53 +01:00
Peter Boyle
476da3ee62 Separated IO reader/writers into a proper abstract base,
derived relationship. Have Text/Binary/Xml versions of
Reader & Writer.

Any new Reader/Writer class inheriting the interface can give object serialisation
to any desired format now.

      new file:   lib/serialisation/BaseIO.h
      modified:   lib/serialisation/BinaryIO.h
      modified:   lib/serialisation/Serialisation.h
      modified:   lib/serialisation/TextIO.h
      modified:   lib/serialisation/XmlIO.h

The test uses the Xml, Binary and Text formats as well as cout << Object.
2015-08-21 10:06:33 +01:00