Peter Boyle
c05b2199f6
Improvements to huge memory
2017-09-04 10:41:21 -04:00
paboyle
7359df3501
Full reporting for benchmark; save robustness factor
2017-08-31 10:42:35 +01:00
Peter Boyle
c3b1263e75
Benchmark prep
2017-08-25 09:25:54 +01:00
paboyle
b49bec0cec
MAP_HUGETLB portability fix
2017-08-20 03:08:54 +01:00
paboyle
1cdf999668
Moving multicommunicator into mpi3 also for threading
2017-08-20 02:39:10 +01:00
paboyle
11062fb686
Comms none fail fix
2017-08-20 01:37:07 +01:00
paboyle
a446d95c33
Trying to pass TeamCity and Travis
2017-08-20 01:10:50 +01:00
Peter Boyle
0b0cf62193
Fix mpi 3 interface change
2017-08-19 13:18:50 -04:00
Peter Boyle
7d88198387
Merge branch 'develop' into feature/multi-communicator
2017-08-19 13:03:35 -04:00
Peter Boyle
2f619482b8
Enable blocking stencil send
2017-08-19 12:53:59 -04:00
Peter Boyle
d6472eda8d
Use mmap
2017-08-19 12:53:18 -04:00
Peter Boyle
14d53e1c9e
Threaded MPI calls patches
2017-07-29 13:08:10 -04:00
azusayamaguchi
dc6f078246
fixed the header file for mpi3
2017-07-11 14:15:08 +01:00
Peter Boyle
40e119c61c
NUMA improvements worth preserving from AMD EPYC tests
2017-07-08 22:27:11 -04:00
paboyle
57002924bc
NERSC shakeout of this
2017-07-02 14:58:30 -07:00
paboyle
54e94360ad
Experimental: Multiple communicators to see if we can avoid thread locks in --enable-comms=mpit
2017-06-24 23:10:24 +01:00
paboyle
869b99ec1e
Threaded calls to multiple communicators
2017-06-24 10:55:54 +01:00
paboyle
1feddf4ba6
const fixes
2017-06-22 19:32:41 +01:00
paboyle
e504260f3d
Able to run a test job splitting into multiple MPI subdomains.
2017-06-22 18:53:11 +01:00
paboyle
5e4bea8f20
Benchmark DWF works
2017-06-22 08:38:54 +01:00
paboyle
6ebf9f15b7
Splitting communicators first cut
2017-06-22 08:14:34 +01:00
paboyle
3bfd1f13e6
I/O improvements
2017-06-11 23:14:10 +01:00
paboyle
e30fa9f4b8
RankCount; need to clean up ambigious ProcessCount
2017-05-30 23:39:16 +01:00
8ef4300412
spurious .dirstamp files removed
2017-04-10 17:00:22 +01:00
paboyle
5592f7b8c1
Creation mode better implementation
2017-04-05 02:35:34 +09:00
paboyle
35da4ece0b
UID fix
2017-04-05 02:18:15 +09:00
paboyle
417ec56cca
Release candidate
2017-03-29 05:45:33 -04:00
paboyle
35695ba57a
Bug fix in MPI3
2017-03-29 04:43:55 -04:00
paboyle
4b17e8eba8
Merge branch 'develop' into feature/bgq-asm
...
Conflicts:
lib/qcd/action/fermion/Fermion.h
lib/qcd/action/fermion/WilsonFermion.cc
lib/util/Init.cc
tests/Test_cayley_even_odd_vec.cc
2017-03-28 04:49:30 -04:00
paboyle
18bde08d1b
Merge branch 'feature/staggering' into develop
2017-03-28 15:25:55 +09:00
paboyle
fc93f0b2ec
Save some code for static huge tlb's. It is ifdef'ed out but an interesting root only experiment.
...
No gain from it.
2017-03-21 22:30:29 -04:00
Christopher Kelly
06a132e3f9
Fixes to SHMEM comms
2017-02-28 13:31:54 -08:00
paboyle
4e7ab3166f
Refactoring header layout
2017-02-22 18:09:33 +00:00
paboyle
3ae92fa2e6
Global changes to parallel_for structure.
...
Move the comms flags to more sensible names
2017-02-21 05:24:27 -05:00
paboyle
37720c4db7
Count bytes off node only
2017-02-20 17:47:40 -05:00
paboyle
5c0adf7bf2
Make clang happy with parenthesis
2017-02-16 23:51:33 +00:00
paboyle
bd600702cf
Vectorise the XYZT face gathering better.
...
Hard coded for simd_layout <= 2 in any given spread out direction; full generality is inconsistent
with efficiency.
2017-02-15 11:11:04 +00:00
paboyle
a48ee6f0f2
Don't use MPI3_leader any more. No real gain and complex
2017-02-07 01:31:24 -05:00
paboyle
73547cca66
MPI3 working i think
2017-02-07 01:30:02 -05:00
paboyle
123c673db7
Policy to control async or sync SendRecv
2017-02-07 01:24:54 -05:00
paboyle
61f82216e2
Communicator Policy, NodeCount distinct from Rank count
2017-02-07 01:22:53 -05:00
fad743fbb1
Build system sanity check: corrected several headers not in the <Grid/*> format
2017-01-26 17:00:41 -08:00
Azusa Yamaguchi
668ca57702
Merge branch 'develop' of https://github.com/paboyle/Grid into feature/staggering
2016-11-22 13:49:11 +00:00
azusayamaguchi
f85b35314d
Fix a routine for single node processor coor from rank
2016-11-08 11:49:13 +00:00
azusayamaguchi
6e548a8ad5
Linux compile needed
2016-11-04 11:34:16 +00:00
Azusa Yamaguchi
ee686a7d85
Compiles now
2016-11-03 16:58:23 +00:00
paboyle
f41a230b32
Decrease mpi3l verbose
2016-11-02 19:54:03 +00:00
paboyle
757a928f9a
Improvement to use own SHM_OPEN call to avoid openmpi bug.
2016-11-02 12:37:46 +00:00
paboyle
32375aca65
Semaphore sleep/wake up on remote processes.
2016-11-02 09:27:20 +00:00
paboyle
bb94ddd0eb
Tidy up of mpi3; also some cleaning of the dslash controls.
2016-11-02 08:07:09 +00:00
paboyle
791cb050c8
Comms improvements
2016-11-01 11:35:43 +00:00
paboyle
09f66100d3
MPI 3 compile on non-linux
2016-10-25 06:01:12 +01:00
azusayamaguchi
d7d92af09d
Travis fail fix attempt
2016-10-25 01:45:53 +01:00
azusayamaguchi
d97a27f483
Verbose
2016-10-25 01:05:31 +01:00
azusayamaguchi
7c3363b91e
Compiles all comms targets
2016-10-25 00:04:17 +01:00
azusayamaguchi
b94478fa51
mpi, mpi3, shmem all compile.
...
mpi, mpi3 pass single node multi-rank
2016-10-24 23:45:31 +01:00
azusayamaguchi
b6a65059a2
Update to use shared memory to contain the stencil comms buffers
...
Tested on 2.1.1.1 1.2.1.1 4.1.1.1 1.4.1.1 2.2.1.1 subnode decompositions
2016-10-24 17:30:43 +01:00
azusayamaguchi
c190221fd3
Internal SHM comms in non-simd directions working
...
Need to fix simd directions
2016-10-22 18:14:27 +01:00
azusayamaguchi
910b8dd6a1
use simd type
2016-10-21 22:35:29 +01:00
azusayamaguchi
09fd5c43a7
Reasonably fast version
2016-10-21 15:17:39 +01:00
azusayamaguchi
fad96cf250
StencilBufs
2016-10-21 13:36:00 +01:00
azusayamaguchi
f331809c27
Use variable type for loop
2016-10-21 13:35:37 +01:00
paboyle
306160ad9a
bcopy threaded
2016-10-21 12:07:28 +01:00
paboyle
a762b1fb71
MPI3 working with a bounce through shared memory on my laptop.
...
Longer term plan: make the "u_comm_buf" in Stencil point to the shared region and avoid the
send between ranks on same node.
2016-10-21 09:03:26 +01:00
paboyle
b58adc6a4b
commVector
2016-10-20 17:00:15 +01:00
paboyle
5fe2b85cbd
MPI3 and shared memory support
2016-10-20 16:58:01 +01:00
paboyle
32bc7a6ab8
MPI back out of change that hangs
...
AVX2 for clang, gcc needs the -mfma flag.
2016-08-05 10:36:00 +01:00
paboyle
62601bb649
Bug fix
2016-07-08 20:46:29 +01:00
paboyle
ef97e32152
Adding persistent communicators
2016-07-08 17:16:08 +01:00
paboyle
680645f849
Merge branch 'release/v0.5.0'
2016-06-30 15:15:03 -07:00
Guido Cossu
5e02392f9c
Fixed compilation error for benchmark_dwf
...
Some parts were assuming floating point precision
2016-06-20 12:30:51 +01:00
Richard Rollins
86187d7cca
Removed write to stdout in constructor for MPI CartesianCommunicator
2016-06-14 15:34:20 +01:00
paboyle
d6b64f47d9
Uint64 sum for IO rates
2016-03-16 02:27:22 -07:00
paboyle
a359f7a9f5
Merge branch 'master' of https://github.com/paboyle/Grid
2016-03-11 16:07:07 -08:00
paboyle
b606deb3f0
Uint64 gsum
2016-03-11 16:06:54 -08:00
paboyle
090e7aa930
Merge remote-tracking branch 'origin/chulwoo-dec12-2015'
...
Merge Chulwoo's Lanczos related improvements.
Merge Nd!=4 fixes for pure gauge HMC from Evan.
2016-03-08 09:55:14 +00:00
paboyle
e55c35734b
Fix a nocompile
2016-03-03 20:33:28 +00:00
Peter Boyle
6aeaf6f568
Parallel IO worked on. I'm puzzled because I already thought I shook this out on MacOS + OpenMPI and then
...
turned up problems on the BlueWaters Cray.
Gets 75MB/s from home filesystem on parallel configuration read. Need to make the RNG IO parallel,
and also to look at aggregating bigger writes for the parallel write.
Not sure what the home filesystem is.
2016-02-21 08:03:21 -06:00
paboyle
a3fbabf404
Bug fix
2016-02-18 18:08:24 +00:00
Peter Boyle
41c2b09184
Shmem comms [NO MPI] target added. The dwf test runs and passes.
...
Not really shaken out to my satisfaction though as I want more tests done, so don't declare as working.
But committing my current while I try a few experimentals.
2016-02-14 14:24:38 -06:00
paboyle
294dbf1bf0
Compile on OpenMPI shmem
2016-02-11 23:45:51 +00:00
Peter Boyle
7f927a541c
Shmem related fixes for shmem compile
2016-02-11 07:37:39 -06:00
paboyle
e2f73e3ead
Updates for shmem
2016-02-10 16:50:32 -08:00
Jung
5c57d4f403
Merge branch 'master' of https://github.com/paboyle/Grid into scidac1_2
...
Conflicts:
lib/qcd/action/fermion/WilsonKernels.h
2016-01-11 11:36:45 -05:00
Jung
5924e5a562
Merge branch 'master' of https://github.com/paboyle/Grid into scidac1_2
...
Conflicts:
configure
lib/qcd/action/Actions.h
lib/qcd/action/fermion/WilsonKernels.h
2016-01-06 03:44:57 -05:00
paboyle
aae8bf31a7
Global edit adding copyright and license info to every source file.
2016-01-02 14:51:32 +00:00
Peter Boyle
dc814f30da
Binary IO file for generic Grid array parallel I/O.
...
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
neo
48bf4878c1
Experimental support for ARM
2015-06-09 15:46:21 +09:00
Azusa Yamaguchi
58a4f32298
merge to the head
2015-06-05 10:15:31 +01:00
Peter Boyle
1d0df449e8
Reorganise of file naming
2015-06-03 12:47:05 +01:00
Peter Boyle
3845f267cb
Domain wall fermions now invert ; have the basis set up for
...
Tanh/Zolo * (Cayley/PartFrac/ContFrac) * (Mobius/Shamir/Wilson)
Approx Representation Kernel.
All are done with space-time taking part in checkerboarding, Ls uncheckerboarded
Have only so far tested the Domain Wall limit of mobius, and at that only checked
that it
i) Inverts
ii) 5dim DW == Ls copies of 4dim D2
iii) MeeInv Mee == 1
iv) Meo+Mee+Moe+Moo == M unprec.
v) MpcDagMpc is hermitan
vi) Mdag is the adjoint of M between stochastic vectors.
That said, the RB schur solve, RB MpcDagMpc solve, Unprec solve
all converge and the true residual becomes small; so pretty good tests.
2015-06-02 16:57:12 +01:00
neo
74e91cd925
Partial implementation of the vector types SIMD
...
Implementing SSE4 now
A systematic series of tests must be written.
2015-05-19 17:21:17 +09:00
neo
baa382f055
Added check of mpfr and gmp at configure time
...
It generates automatically the linker flags or complains if not found.
2015-05-19 13:54:55 +09:00
neo
b4cd37276b
Corrected some compilation errors (zolotarev.h) and SSE4 vsplat and conj to make cshift test pass.
2015-05-18 16:48:14 +09:00
Peter Boyle
b1d2c60d07
Moving some things around for pretty
2015-05-11 19:09:49 +01:00
paboyle
379943abf5
Command line args and a general clean up
2015-05-11 12:43:10 +01:00
Peter Boyle
29be76f958
Fixing breakage in the Comms non compile
2015-05-10 15:23:09 +01:00
Peter Boyle
193860dbc8
Comms and memory benchmarks added
2015-05-03 09:44:47 +01:00
Peter Boyle
f663be2a6c
Added a comms benchmark
2015-05-02 23:42:30 +01:00
Peter Boyle
9ec3529864
Improved the gamma quite a bit.
...
Serial rng's which are set on node zero and broadcaste
2015-04-24 20:21:40 +01:00