2016-01-02 14:51:32 +00:00
|
|
|
/*************************************************************************************
|
|
|
|
|
|
|
|
Grid physics library, www.github.com/paboyle/Grid
|
|
|
|
|
|
|
|
Source file: ./lib/parallelIO/BinaryIO.h
|
|
|
|
|
|
|
|
Copyright (C) 2015
|
|
|
|
|
2016-10-22 13:06:00 +01:00
|
|
|
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
|
|
|
|
Author: Guido Cossu<guido.cossu@ed.ac.uk>
|
2016-01-02 14:51:32 +00:00
|
|
|
|
|
|
|
This program is free software; you can redistribute it and/or modify
|
|
|
|
it under the terms of the GNU General Public License as published by
|
|
|
|
the Free Software Foundation; either version 2 of the License, or
|
|
|
|
(at your option) any later version.
|
|
|
|
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
GNU General Public License for more details.
|
|
|
|
|
|
|
|
You should have received a copy of the GNU General Public License along
|
|
|
|
with this program; if not, write to the Free Software Foundation, Inc.,
|
|
|
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
|
|
|
|
|
|
|
|
See the full license in the file "LICENSE" in the top level distribution directory
|
|
|
|
*************************************************************************************/
|
|
|
|
/* END LEGAL */
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
#ifndef GRID_BINARY_IO_H
|
|
|
|
#define GRID_BINARY_IO_H
|
|
|
|
|
2017-06-24 23:10:24 +01:00
|
|
|
#if defined(GRID_COMMS_MPI) || defined(GRID_COMMS_MPI3) || defined(GRID_COMMS_MPIT)
|
2017-06-02 22:50:25 +01:00
|
|
|
#define USE_MPI_IO
|
|
|
|
#else
|
|
|
|
#undef USE_MPI_IO
|
|
|
|
#endif
|
2017-04-05 14:41:04 +01:00
|
|
|
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
#ifdef HAVE_ENDIAN_H
|
|
|
|
#include <endian.h>
|
|
|
|
#endif
|
2017-06-02 22:50:25 +01:00
|
|
|
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
#include <arpa/inet.h>
|
2015-11-04 10:27:44 +00:00
|
|
|
#include <algorithm>
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
namespace Grid {
|
|
|
|
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
/////////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Byte reversal garbage
|
|
|
|
/////////////////////////////////////////////////////////////////////////////////
|
2017-03-28 07:28:04 +01:00
|
|
|
inline uint32_t byte_reverse32(uint32_t f) {
|
|
|
|
f = ((f&0xFF)<<24) | ((f&0xFF00)<<8) | ((f&0xFF0000)>>8) | ((f&0xFF000000UL)>>24) ;
|
|
|
|
return f;
|
|
|
|
}
|
|
|
|
inline uint64_t byte_reverse64(uint64_t f) {
|
|
|
|
uint64_t g;
|
|
|
|
g = ((f&0xFF)<<24) | ((f&0xFF00)<<8) | ((f&0xFF0000)>>8) | ((f&0xFF000000UL)>>24) ;
|
|
|
|
g = g << 32;
|
|
|
|
f = f >> 32;
|
|
|
|
g|= ((f&0xFF)<<24) | ((f&0xFF00)<<8) | ((f&0xFF0000)>>8) | ((f&0xFF000000UL)>>24) ;
|
|
|
|
return g;
|
|
|
|
}
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
|
|
|
#if BYTE_ORDER == BIG_ENDIAN
|
2017-03-28 07:28:04 +01:00
|
|
|
inline uint64_t Grid_ntohll(uint64_t A) { return A; }
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
#else
|
2017-03-28 07:28:04 +01:00
|
|
|
inline uint64_t Grid_ntohll(uint64_t A) {
|
|
|
|
return byte_reverse64(A);
|
|
|
|
}
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
#endif
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
// A little helper
|
|
|
|
inline void removeWhitespace(std::string &key)
|
|
|
|
{
|
|
|
|
key.erase(std::remove_if(key.begin(), key.end(), ::isspace),key.end());
|
|
|
|
}
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
///////////////////////////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Static class holding the parallel IO code
|
|
|
|
// Could just use a namespace
|
|
|
|
///////////////////////////////////////////////////////////////////////////////////////////////////
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
class BinaryIO {
|
|
|
|
public:
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// more byte manipulation helpers
|
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
2017-06-11 23:14:10 +01:00
|
|
|
|
2017-06-18 00:10:42 +01:00
|
|
|
template<class vobj> static inline void Uint32Checksum(Lattice<vobj> &lat,uint32_t &nersc_csum)
|
2017-06-11 23:14:10 +01:00
|
|
|
{
|
|
|
|
typedef typename vobj::scalar_object sobj;
|
|
|
|
|
|
|
|
GridBase *grid = lat._grid;
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t lsites = grid->lSites();
|
2017-06-11 23:14:10 +01:00
|
|
|
|
|
|
|
std::vector<sobj> scalardata(lsites);
|
|
|
|
unvectorizeToLexOrdArray(scalardata,lat);
|
|
|
|
|
2017-06-18 00:10:42 +01:00
|
|
|
NerscChecksum(grid,scalardata,nersc_csum);
|
2017-06-11 23:14:10 +01:00
|
|
|
}
|
2017-06-18 00:10:42 +01:00
|
|
|
|
2017-07-27 15:12:50 +01:00
|
|
|
template <class fobj>
|
|
|
|
static inline void NerscChecksum(GridBase *grid, std::vector<fobj> &fbuf, uint32_t &nersc_csum)
|
|
|
|
{
|
|
|
|
const uint64_t size32 = sizeof(fobj) / sizeof(uint32_t);
|
2017-06-18 00:10:42 +01:00
|
|
|
|
2017-07-27 15:12:50 +01:00
|
|
|
uint64_t lsites = grid->lSites();
|
|
|
|
if (fbuf.size() == 1)
|
|
|
|
{
|
|
|
|
lsites = 1;
|
2017-06-18 00:10:42 +01:00
|
|
|
}
|
|
|
|
|
2017-07-27 15:12:50 +01:00
|
|
|
#pragma omp parallel
|
|
|
|
{
|
|
|
|
uint32_t nersc_csum_thr = 0;
|
2017-06-18 00:10:42 +01:00
|
|
|
|
2017-07-27 15:12:50 +01:00
|
|
|
#pragma omp for
|
|
|
|
for (uint64_t local_site = 0; local_site < lsites; local_site++)
|
|
|
|
{
|
|
|
|
uint32_t *site_buf = (uint32_t *)&fbuf[local_site];
|
|
|
|
for (uint64_t j = 0; j < size32; j++)
|
|
|
|
{
|
|
|
|
nersc_csum_thr = nersc_csum_thr + site_buf[j];
|
|
|
|
}
|
2017-06-18 00:10:42 +01:00
|
|
|
}
|
|
|
|
|
2017-07-27 15:12:50 +01:00
|
|
|
#pragma omp critical
|
2017-06-18 00:10:42 +01:00
|
|
|
{
|
2017-07-27 15:12:50 +01:00
|
|
|
nersc_csum += nersc_csum_thr;
|
2017-06-18 00:10:42 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2017-07-27 15:12:50 +01:00
|
|
|
|
2017-06-18 00:10:42 +01:00
|
|
|
template<class fobj> static inline void ScidacChecksum(GridBase *grid,std::vector<fobj> &fbuf,uint32_t &scidac_csuma,uint32_t &scidac_csumb)
|
2017-06-01 22:36:53 +01:00
|
|
|
{
|
2017-06-11 23:14:10 +01:00
|
|
|
const uint64_t size32 = sizeof(fobj)/sizeof(uint32_t);
|
|
|
|
|
|
|
|
|
|
|
|
int nd = grid->_ndimension;
|
|
|
|
|
|
|
|
uint64_t lsites =grid->lSites();
|
2017-06-18 00:10:42 +01:00
|
|
|
if (fbuf.size()==1) {
|
|
|
|
lsites=1;
|
|
|
|
}
|
2017-06-11 23:14:10 +01:00
|
|
|
std::vector<int> local_vol =grid->LocalDimensions();
|
|
|
|
std::vector<int> local_start =grid->LocalStarts();
|
|
|
|
std::vector<int> global_vol =grid->FullDimensions();
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
#pragma omp parallel
|
|
|
|
{
|
2017-06-11 23:14:10 +01:00
|
|
|
std::vector<int> coor(nd);
|
|
|
|
uint32_t scidac_csuma_thr=0;
|
|
|
|
uint32_t scidac_csumb_thr=0;
|
|
|
|
uint32_t site_crc=0;
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
#pragma omp for
|
2017-06-11 23:14:10 +01:00
|
|
|
for(uint64_t local_site=0;local_site<lsites;local_site++){
|
|
|
|
|
|
|
|
uint32_t * site_buf = (uint32_t *)&fbuf[local_site];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scidac csum is rather more heavyweight
|
2018-03-16 21:54:56 +00:00
|
|
|
* FIXME -- 128^3 x 256 x 16 will overflow.
|
2017-06-11 23:14:10 +01:00
|
|
|
*/
|
2018-03-16 21:54:56 +00:00
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
int global_site;
|
|
|
|
|
|
|
|
Lexicographic::CoorFromIndex(coor,local_site,local_vol);
|
|
|
|
|
2017-06-18 00:10:42 +01:00
|
|
|
for(int d=0;d<nd;d++) {
|
2017-06-11 23:14:10 +01:00
|
|
|
coor[d] = coor[d]+local_start[d];
|
2017-06-18 00:10:42 +01:00
|
|
|
}
|
2017-06-11 23:14:10 +01:00
|
|
|
|
|
|
|
Lexicographic::IndexFromCoor(coor,global_site,global_vol);
|
|
|
|
|
|
|
|
uint32_t gsite29 = global_site%29;
|
|
|
|
uint32_t gsite31 = global_site%31;
|
2017-06-18 00:10:42 +01:00
|
|
|
|
|
|
|
site_crc = crc32(0,(unsigned char *)site_buf,sizeof(fobj));
|
|
|
|
// std::cout << "Site "<<local_site << " crc "<<std::hex<<site_crc<<std::dec<<std::endl;
|
|
|
|
// std::cout << "Site "<<local_site << std::hex<<site_buf[0] <<site_buf[1]<<std::dec <<std::endl;
|
2017-06-11 23:14:10 +01:00
|
|
|
scidac_csuma_thr ^= site_crc<<gsite29 | site_crc>>(32-gsite29);
|
|
|
|
scidac_csumb_thr ^= site_crc<<gsite31 | site_crc>>(32-gsite31);
|
2017-06-01 22:36:53 +01:00
|
|
|
}
|
2017-06-11 23:14:10 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
#pragma omp critical
|
2017-06-11 23:14:10 +01:00
|
|
|
{
|
|
|
|
scidac_csuma^= scidac_csuma_thr;
|
|
|
|
scidac_csumb^= scidac_csumb_thr;
|
|
|
|
}
|
2017-06-01 22:36:53 +01:00
|
|
|
}
|
|
|
|
}
|
2017-06-11 23:14:10 +01:00
|
|
|
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
// Network is big endian
|
2017-06-11 23:14:10 +01:00
|
|
|
static inline void htobe32_v(void *file_object,uint32_t bytes){ be32toh_v(file_object,bytes);}
|
|
|
|
static inline void htobe64_v(void *file_object,uint32_t bytes){ be64toh_v(file_object,bytes);}
|
|
|
|
static inline void htole32_v(void *file_object,uint32_t bytes){ le32toh_v(file_object,bytes);}
|
|
|
|
static inline void htole64_v(void *file_object,uint32_t bytes){ le64toh_v(file_object,bytes);}
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
static inline void be32toh_v(void *file_object,uint64_t bytes)
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
{
|
|
|
|
uint32_t * f = (uint32_t *)file_object;
|
2017-06-01 22:36:53 +01:00
|
|
|
uint64_t count = bytes/sizeof(uint32_t);
|
|
|
|
parallel_for(uint64_t i=0;i<count;i++){
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
f[i] = ntohl(f[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// LE must Swap and switch to host
|
2017-06-01 22:36:53 +01:00
|
|
|
static inline void le32toh_v(void *file_object,uint64_t bytes)
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
{
|
|
|
|
uint32_t *fp = (uint32_t *)file_object;
|
|
|
|
uint32_t f;
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
uint64_t count = bytes/sizeof(uint32_t);
|
|
|
|
parallel_for(uint64_t i=0;i<count;i++){
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
f = fp[i];
|
|
|
|
// got network order and the network to host
|
|
|
|
f = ((f&0xFF)<<24) | ((f&0xFF00)<<8) | ((f&0xFF0000)>>8) | ((f&0xFF000000UL)>>24) ;
|
|
|
|
fp[i] = ntohl(f);
|
|
|
|
}
|
|
|
|
}
|
2017-06-11 23:14:10 +01:00
|
|
|
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
// BE is same as network
|
2017-06-01 22:36:53 +01:00
|
|
|
static inline void be64toh_v(void *file_object,uint64_t bytes)
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
{
|
|
|
|
uint64_t * f = (uint64_t *)file_object;
|
2017-06-01 22:36:53 +01:00
|
|
|
uint64_t count = bytes/sizeof(uint64_t);
|
|
|
|
parallel_for(uint64_t i=0;i<count;i++){
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
f[i] = Grid_ntohll(f[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// LE must swap and switch;
|
2017-06-01 22:36:53 +01:00
|
|
|
static inline void le64toh_v(void *file_object,uint64_t bytes)
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
{
|
|
|
|
uint64_t *fp = (uint64_t *)file_object;
|
|
|
|
uint64_t f,g;
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
uint64_t count = bytes/sizeof(uint64_t);
|
|
|
|
parallel_for(uint64_t i=0;i<count;i++){
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
f = fp[i];
|
|
|
|
// got network order and the network to host
|
|
|
|
g = ((f&0xFF)<<24) | ((f&0xFF00)<<8) | ((f&0xFF0000)>>8) | ((f&0xFF000000UL)>>24) ;
|
|
|
|
g = g << 32;
|
|
|
|
f = f >> 32;
|
|
|
|
g|= ((f&0xFF)<<24) | ((f&0xFF00)<<8) | ((f&0xFF0000)>>8) | ((f&0xFF000000UL)>>24) ;
|
|
|
|
fp[i] = Grid_ntohll(g);
|
|
|
|
}
|
|
|
|
}
|
2017-06-01 22:36:53 +01:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Real action:
|
|
|
|
// Read or Write distributed lexico array of ANY object to a specific location in file
|
|
|
|
//////////////////////////////////////////////////////////////////////////////////////
|
2017-06-02 00:38:58 +01:00
|
|
|
|
|
|
|
static const int BINARYIO_MASTER_APPEND = 0x10;
|
|
|
|
static const int BINARYIO_UNORDERED = 0x08;
|
|
|
|
static const int BINARYIO_LEXICOGRAPHIC = 0x04;
|
|
|
|
static const int BINARYIO_READ = 0x02;
|
|
|
|
static const int BINARYIO_WRITE = 0x01;
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
template<class word,class fobj>
|
2017-06-11 23:14:10 +01:00
|
|
|
static inline void IOobject(word w,
|
|
|
|
GridBase *grid,
|
|
|
|
std::vector<fobj> &iodata,
|
|
|
|
std::string file,
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t offset,
|
2017-06-11 23:14:10 +01:00
|
|
|
const std::string &format, int control,
|
|
|
|
uint32_t &nersc_csum,
|
|
|
|
uint32_t &scidac_csuma,
|
|
|
|
uint32_t &scidac_csumb)
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
{
|
2017-06-01 22:36:53 +01:00
|
|
|
grid->Barrier();
|
|
|
|
GridStopWatch timer;
|
|
|
|
GridStopWatch bstimer;
|
2017-07-27 15:12:50 +01:00
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
nersc_csum=0;
|
|
|
|
scidac_csuma=0;
|
|
|
|
scidac_csumb=0;
|
2017-05-30 23:41:07 +01:00
|
|
|
|
|
|
|
int ndim = grid->Dimensions();
|
|
|
|
int nrank = grid->ProcessorCount();
|
|
|
|
int myrank = grid->ThisRank();
|
|
|
|
|
|
|
|
std::vector<int> psizes = grid->ProcessorGrid();
|
|
|
|
std::vector<int> pcoor = grid->ThisProcessorCoor();
|
|
|
|
std::vector<int> gLattice= grid->GlobalDimensions();
|
|
|
|
std::vector<int> lLattice= grid->LocalDimensions();
|
|
|
|
|
|
|
|
std::vector<int> lStart(ndim);
|
|
|
|
std::vector<int> gStart(ndim);
|
|
|
|
|
|
|
|
// Flatten the file
|
2017-06-01 22:36:53 +01:00
|
|
|
uint64_t lsites = grid->lSites();
|
2017-06-02 00:38:58 +01:00
|
|
|
if ( control & BINARYIO_MASTER_APPEND ) {
|
|
|
|
assert(iodata.size()==1);
|
|
|
|
} else {
|
|
|
|
assert(lsites==iodata.size());
|
|
|
|
}
|
2017-05-30 23:41:07 +01:00
|
|
|
for(int d=0;d<ndim;d++){
|
|
|
|
gStart[d] = lLattice[d]*pcoor[d];
|
|
|
|
lStart[d] = 0;
|
|
|
|
}
|
|
|
|
|
2017-06-02 00:38:58 +01:00
|
|
|
#ifdef USE_MPI_IO
|
|
|
|
std::vector<int> distribs(ndim,MPI_DISTRIBUTE_BLOCK);
|
|
|
|
std::vector<int> dargs (ndim,MPI_DISTRIBUTE_DFLT_DARG);
|
2017-05-30 23:41:07 +01:00
|
|
|
MPI_Datatype mpiObject;
|
|
|
|
MPI_Datatype fileArray;
|
|
|
|
MPI_Datatype localArray;
|
|
|
|
MPI_Datatype mpiword;
|
|
|
|
MPI_Offset disp = offset;
|
|
|
|
MPI_File fh ;
|
|
|
|
MPI_Status status;
|
|
|
|
int numword;
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
if ( sizeof( word ) == sizeof(float ) ) {
|
2017-05-30 23:41:07 +01:00
|
|
|
numword = sizeof(fobj)/sizeof(float);
|
|
|
|
mpiword = MPI_FLOAT;
|
|
|
|
} else {
|
|
|
|
numword = sizeof(fobj)/sizeof(double);
|
|
|
|
mpiword = MPI_DOUBLE;
|
|
|
|
}
|
|
|
|
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Sobj in MPI phrasing
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
int ierr;
|
2017-06-01 22:36:53 +01:00
|
|
|
ierr = MPI_Type_contiguous(numword,mpiword,&mpiObject); assert(ierr==0);
|
|
|
|
ierr = MPI_Type_commit(&mpiObject);
|
2017-05-30 23:41:07 +01:00
|
|
|
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
// File global array data type
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
2017-06-01 22:36:53 +01:00
|
|
|
ierr=MPI_Type_create_subarray(ndim,&gLattice[0],&lLattice[0],&gStart[0],MPI_ORDER_FORTRAN, mpiObject,&fileArray); assert(ierr==0);
|
|
|
|
ierr=MPI_Type_commit(&fileArray); assert(ierr==0);
|
2017-05-30 23:41:07 +01:00
|
|
|
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
// local lattice array
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
2017-06-01 22:36:53 +01:00
|
|
|
ierr=MPI_Type_create_subarray(ndim,&lLattice[0],&lLattice[0],&lStart[0],MPI_ORDER_FORTRAN, mpiObject,&localArray); assert(ierr==0);
|
|
|
|
ierr=MPI_Type_commit(&localArray); assert(ierr==0);
|
2017-06-02 00:38:58 +01:00
|
|
|
#endif
|
2017-06-01 22:36:53 +01:00
|
|
|
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Byte order
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
int ieee32big = (format == std::string("IEEE32BIG"));
|
|
|
|
int ieee32 = (format == std::string("IEEE32"));
|
|
|
|
int ieee64big = (format == std::string("IEEE64BIG"));
|
|
|
|
int ieee64 = (format == std::string("IEEE64"));
|
2017-05-30 23:41:07 +01:00
|
|
|
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
2017-06-02 00:38:58 +01:00
|
|
|
// Do the I/O
|
2017-05-30 23:41:07 +01:00
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
2017-06-02 00:38:58 +01:00
|
|
|
if ( control & BINARYIO_READ ) {
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
timer.Start();
|
2017-06-02 00:38:58 +01:00
|
|
|
|
|
|
|
if ( (control & BINARYIO_LEXICOGRAPHIC) && (nrank > 1) ) {
|
|
|
|
#ifdef USE_MPI_IO
|
2017-10-26 01:59:30 +01:00
|
|
|
std::cout<< GridLogMessage<<"IOobject: MPI read I/O "<< file<< std::endl;
|
2017-06-13 12:21:29 +01:00
|
|
|
ierr=MPI_File_open(grid->communicator,(char *) file.c_str(), MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); assert(ierr==0);
|
2017-06-02 00:38:58 +01:00
|
|
|
ierr=MPI_File_set_view(fh, disp, mpiObject, fileArray, "native", MPI_INFO_NULL); assert(ierr==0);
|
|
|
|
ierr=MPI_File_read_all(fh, &iodata[0], 1, localArray, &status); assert(ierr==0);
|
|
|
|
MPI_File_close(&fh);
|
|
|
|
MPI_Type_free(&fileArray);
|
|
|
|
MPI_Type_free(&localArray);
|
|
|
|
#else
|
|
|
|
assert(0);
|
|
|
|
#endif
|
2017-07-27 15:12:50 +01:00
|
|
|
} else {
|
2017-10-26 01:59:30 +01:00
|
|
|
std::cout << GridLogMessage <<"IOobject: C++ read I/O " << file << " : "
|
2017-07-27 15:12:50 +01:00
|
|
|
<< iodata.size() * sizeof(fobj) << " bytes" << std::endl;
|
|
|
|
std::ifstream fin;
|
2018-03-30 12:30:58 +01:00
|
|
|
fin.open(file, std::ios::binary | std::ios::in);
|
2017-07-27 15:12:50 +01:00
|
|
|
if (control & BINARYIO_MASTER_APPEND)
|
|
|
|
{
|
|
|
|
fin.seekg(-sizeof(fobj), fin.end);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
fin.seekg(offset + myrank * lsites * sizeof(fobj));
|
|
|
|
}
|
|
|
|
fin.read((char *)&iodata[0], iodata.size() * sizeof(fobj));
|
|
|
|
assert(fin.fail() == 0);
|
|
|
|
fin.close();
|
2017-06-02 00:38:58 +01:00
|
|
|
}
|
2017-06-01 22:36:53 +01:00
|
|
|
timer.Stop();
|
|
|
|
|
|
|
|
grid->Barrier();
|
2017-05-30 23:41:07 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
bstimer.Start();
|
2017-06-18 00:10:42 +01:00
|
|
|
ScidacChecksum(grid,iodata,scidac_csuma,scidac_csumb);
|
2017-06-11 23:14:10 +01:00
|
|
|
if (ieee32big) be32toh_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
|
|
|
if (ieee32) le32toh_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
|
|
|
if (ieee64big) be64toh_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
|
|
|
if (ieee64) le64toh_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
2017-06-18 00:10:42 +01:00
|
|
|
NerscChecksum(grid,iodata,nersc_csum);
|
2017-06-01 22:36:53 +01:00
|
|
|
bstimer.Stop();
|
2017-06-02 00:38:58 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if ( control & BINARYIO_WRITE ) {
|
2017-06-01 22:36:53 +01:00
|
|
|
|
|
|
|
bstimer.Start();
|
2017-06-18 00:10:42 +01:00
|
|
|
NerscChecksum(grid,iodata,nersc_csum);
|
2017-06-11 23:14:10 +01:00
|
|
|
if (ieee32big) htobe32_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
|
|
|
if (ieee32) htole32_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
|
|
|
if (ieee64big) htobe64_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
|
|
|
if (ieee64) htole64_v((void *)&iodata[0], sizeof(fobj)*iodata.size());
|
2017-06-18 00:10:42 +01:00
|
|
|
ScidacChecksum(grid,iodata,scidac_csuma,scidac_csumb);
|
2017-06-01 22:36:53 +01:00
|
|
|
bstimer.Stop();
|
2017-05-30 23:41:07 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
grid->Barrier();
|
|
|
|
|
|
|
|
timer.Start();
|
2017-06-02 00:38:58 +01:00
|
|
|
if ( (control & BINARYIO_LEXICOGRAPHIC) && (nrank > 1) ) {
|
|
|
|
#ifdef USE_MPI_IO
|
2017-10-26 01:59:30 +01:00
|
|
|
std::cout << GridLogMessage <<"IOobject: MPI write I/O " << file << std::endl;
|
2017-08-04 12:14:10 +01:00
|
|
|
ierr = MPI_File_open(grid->communicator, (char *)file.c_str(), MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
|
2017-10-26 01:59:30 +01:00
|
|
|
// std::cout << GridLogMessage << "Checking for errors" << std::endl;
|
2017-08-04 12:14:10 +01:00
|
|
|
if (ierr != MPI_SUCCESS)
|
|
|
|
{
|
|
|
|
char error_string[BUFSIZ];
|
|
|
|
int length_of_error_string, error_class;
|
|
|
|
|
|
|
|
MPI_Error_class(ierr, &error_class);
|
|
|
|
MPI_Error_string(error_class, error_string, &length_of_error_string);
|
|
|
|
fprintf(stderr, "%3d: %s\n", myrank, error_string);
|
|
|
|
MPI_Error_string(ierr, error_string, &length_of_error_string);
|
|
|
|
fprintf(stderr, "%3d: %s\n", myrank, error_string);
|
|
|
|
MPI_Abort(MPI_COMM_WORLD, 1); //assert(ierr == 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
std::cout << GridLogDebug << "MPI read I/O set view " << file << std::endl;
|
|
|
|
ierr = MPI_File_set_view(fh, disp, mpiObject, fileArray, "native", MPI_INFO_NULL);
|
|
|
|
assert(ierr == 0);
|
|
|
|
|
|
|
|
std::cout << GridLogDebug << "MPI read I/O write all " << file << std::endl;
|
|
|
|
ierr = MPI_File_write_all(fh, &iodata[0], 1, localArray, &status);
|
|
|
|
assert(ierr == 0);
|
|
|
|
|
|
|
|
MPI_File_close(&fh);
|
|
|
|
MPI_Type_free(&fileArray);
|
|
|
|
MPI_Type_free(&localArray);
|
2017-06-02 00:38:58 +01:00
|
|
|
#else
|
|
|
|
assert(0);
|
|
|
|
#endif
|
|
|
|
} else {
|
2017-10-25 23:49:23 +01:00
|
|
|
|
2017-10-26 01:59:30 +01:00
|
|
|
std::cout << GridLogMessage << "IOobject: C++ write I/O " << file << " : "
|
2017-10-25 23:49:23 +01:00
|
|
|
<< iodata.size() * sizeof(fobj) << " bytes" << std::endl;
|
2017-07-27 15:12:50 +01:00
|
|
|
|
|
|
|
std::ofstream fout;
|
2017-10-25 23:49:23 +01:00
|
|
|
fout.exceptions ( std::fstream::failbit | std::fstream::badbit );
|
|
|
|
try {
|
2018-03-30 12:30:58 +01:00
|
|
|
if (offset) { // Must already exist and contain data
|
|
|
|
fout.open(file,std::ios::binary|std::ios::out|std::ios::in);
|
|
|
|
} else { // Allow create
|
|
|
|
fout.open(file,std::ios::binary|std::ios::out);
|
|
|
|
}
|
2017-10-25 23:49:23 +01:00
|
|
|
} catch (const std::fstream::failure& exc) {
|
|
|
|
std::cout << GridLogError << "Error in opening the file " << file << " for output" <<std::endl;
|
|
|
|
std::cout << GridLogError << "Exception description: " << exc.what() << std::endl;
|
2018-03-30 12:30:58 +01:00
|
|
|
// std::cout << GridLogError << "Probable cause: wrong path, inaccessible location "<< std::endl;
|
2017-10-25 23:49:23 +01:00
|
|
|
#ifdef USE_MPI_IO
|
|
|
|
MPI_Abort(MPI_COMM_WORLD,1);
|
|
|
|
#else
|
|
|
|
exit(1);
|
|
|
|
#endif
|
|
|
|
}
|
2017-07-27 15:12:50 +01:00
|
|
|
|
2017-10-25 23:49:23 +01:00
|
|
|
if ( control & BINARYIO_MASTER_APPEND ) {
|
|
|
|
try {
|
|
|
|
fout.seekp(0,fout.end);
|
|
|
|
} catch (const std::fstream::failure& exc) {
|
|
|
|
std::cout << "Exception in seeking file end " << file << std::endl;
|
|
|
|
}
|
2017-06-02 00:38:58 +01:00
|
|
|
} else {
|
2017-10-25 23:49:23 +01:00
|
|
|
try {
|
|
|
|
fout.seekp(offset+myrank*lsites*sizeof(fobj));
|
|
|
|
} catch (const std::fstream::failure& exc) {
|
|
|
|
std::cout << "Exception in seeking file " << file <<" offset "<< offset << std::endl;
|
|
|
|
}
|
2017-06-02 00:38:58 +01:00
|
|
|
}
|
2017-07-27 15:12:50 +01:00
|
|
|
|
2017-10-25 23:49:23 +01:00
|
|
|
try {
|
|
|
|
fout.write((char *)&iodata[0],iodata.size()*sizeof(fobj));//assert( fout.fail()==0);
|
|
|
|
}
|
|
|
|
catch (const std::fstream::failure& exc) {
|
|
|
|
std::cout << "Exception in writing file " << file << std::endl;
|
|
|
|
std::cout << GridLogError << "Exception description: "<< exc.what() << std::endl;
|
|
|
|
#ifdef USE_MPI_IO
|
|
|
|
MPI_Abort(MPI_COMM_WORLD,1);
|
|
|
|
#else
|
|
|
|
exit(1);
|
|
|
|
#endif
|
|
|
|
}
|
2017-06-02 00:38:58 +01:00
|
|
|
fout.close();
|
2017-10-25 23:49:23 +01:00
|
|
|
}
|
|
|
|
timer.Stop();
|
|
|
|
}
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout<<GridLogMessage<<"IOobject: ";
|
2017-06-02 00:38:58 +01:00
|
|
|
if ( control & BINARYIO_READ) std::cout << " read ";
|
|
|
|
else std::cout << " write ";
|
|
|
|
uint64_t bytes = sizeof(fobj)*iodata.size()*nrank;
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout<< bytes <<" bytes in "<<timer.Elapsed() <<" "
|
|
|
|
<< (double)bytes/ (double)timer.useconds() <<" MB/s "<<std::endl;
|
2017-05-30 23:41:07 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout<<GridLogMessage<<"IOobject: endian and checksum overhead "<<bstimer.Elapsed() <<std::endl;
|
2017-05-30 23:41:07 +01:00
|
|
|
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Safety check
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
2017-07-27 15:12:50 +01:00
|
|
|
// if the data size is 1 we do not want to sum over the MPI ranks
|
|
|
|
if (iodata.size() != 1){
|
|
|
|
grid->Barrier();
|
|
|
|
grid->GlobalSum(nersc_csum);
|
|
|
|
grid->GlobalXOR(scidac_csuma);
|
|
|
|
grid->GlobalXOR(scidac_csumb);
|
|
|
|
grid->Barrier();
|
|
|
|
}
|
2017-05-30 23:41:07 +01:00
|
|
|
}
|
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Read a Lattice of object
|
|
|
|
//////////////////////////////////////////////////////////////////////////////////////
|
|
|
|
template<class vobj,class fobj,class munger>
|
2017-06-11 23:14:10 +01:00
|
|
|
static inline void readLatticeObject(Lattice<vobj> &Umu,
|
|
|
|
std::string file,
|
|
|
|
munger munge,
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t offset,
|
2017-06-11 23:14:10 +01:00
|
|
|
const std::string &format,
|
|
|
|
uint32_t &nersc_csum,
|
|
|
|
uint32_t &scidac_csuma,
|
|
|
|
uint32_t &scidac_csumb)
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
{
|
2017-06-01 22:36:53 +01:00
|
|
|
typedef typename vobj::scalar_object sobj;
|
|
|
|
typedef typename vobj::Realified::scalar_type word; word w=0;
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
|
|
|
GridBase *grid = Umu._grid;
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t lsites = grid->lSites();
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::vector<sobj> scalardata(lsites);
|
|
|
|
std::vector<fobj> iodata(lsites); // Munge, checksum, byte order in here
|
2017-03-28 07:28:04 +01:00
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
IOobject(w,grid,iodata,file,offset,format,BINARYIO_READ|BINARYIO_LEXICOGRAPHIC,
|
|
|
|
nersc_csum,scidac_csuma,scidac_csumb);
|
2017-05-25 11:43:33 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
GridStopWatch timer;
|
|
|
|
timer.Start();
|
2016-10-22 13:06:00 +01:00
|
|
|
|
2018-03-16 21:54:56 +00:00
|
|
|
parallel_for(uint64_t x=0;x<lsites;x++) munge(iodata[x], scalardata[x]);
|
2016-10-07 13:37:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
vectorizeFromLexOrdArray(scalardata,Umu);
|
|
|
|
grid->Barrier();
|
2016-10-07 13:37:29 +01:00
|
|
|
|
2016-10-22 13:06:00 +01:00
|
|
|
timer.Stop();
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout<<GridLogMessage<<"readLatticeObject: vectorize overhead "<<timer.Elapsed() <<std::endl;
|
2015-12-19 18:32:25 +00:00
|
|
|
}
|
2016-10-22 13:06:00 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Write a Lattice of object
|
|
|
|
//////////////////////////////////////////////////////////////////////////////////////
|
|
|
|
template<class vobj,class fobj,class munger>
|
2017-06-11 23:14:10 +01:00
|
|
|
static inline void writeLatticeObject(Lattice<vobj> &Umu,
|
|
|
|
std::string file,
|
|
|
|
munger munge,
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t offset,
|
2017-06-11 23:14:10 +01:00
|
|
|
const std::string &format,
|
|
|
|
uint32_t &nersc_csum,
|
|
|
|
uint32_t &scidac_csuma,
|
|
|
|
uint32_t &scidac_csumb)
|
2015-12-19 18:32:25 +00:00
|
|
|
{
|
2017-06-01 22:36:53 +01:00
|
|
|
typedef typename vobj::scalar_object sobj;
|
|
|
|
typedef typename vobj::Realified::scalar_type word; word w=0;
|
|
|
|
GridBase *grid = Umu._grid;
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t lsites = grid->lSites();
|
2015-12-19 18:32:25 +00:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::vector<sobj> scalardata(lsites);
|
|
|
|
std::vector<fobj> iodata(lsites); // Munge, checksum, byte order in here
|
2015-12-19 18:32:25 +00:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Munge [ .e.g 3rd row recon ]
|
|
|
|
//////////////////////////////////////////////////////////////////////////////
|
|
|
|
GridStopWatch timer; timer.Start();
|
|
|
|
unvectorizeToLexOrdArray(scalardata,Umu);
|
2015-12-19 18:32:25 +00:00
|
|
|
|
2018-03-16 21:54:56 +00:00
|
|
|
parallel_for(uint64_t x=0;x<lsites;x++) munge(scalardata[x],iodata[x]);
|
2015-12-19 18:32:25 +00:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
grid->Barrier();
|
|
|
|
timer.Stop();
|
2015-12-19 18:32:25 +00:00
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
IOobject(w,grid,iodata,file,offset,format,BINARYIO_WRITE|BINARYIO_LEXICOGRAPHIC,
|
|
|
|
nersc_csum,scidac_csuma,scidac_csumb);
|
2016-10-19 16:56:11 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout<<GridLogMessage<<"writeLatticeObject: unvectorize overhead "<<timer.Elapsed() <<std::endl;
|
2015-12-19 18:32:25 +00:00
|
|
|
}
|
2017-06-01 22:36:53 +01:00
|
|
|
|
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Read a RNG; use IOobject and lexico map to an array of state
|
|
|
|
//////////////////////////////////////////////////////////////////////////////////////
|
2017-06-11 23:14:10 +01:00
|
|
|
static inline void readRNG(GridSerialRNG &serial,
|
|
|
|
GridParallelRNG ¶llel,
|
|
|
|
std::string file,
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t offset,
|
2017-06-11 23:14:10 +01:00
|
|
|
uint32_t &nersc_csum,
|
|
|
|
uint32_t &scidac_csuma,
|
|
|
|
uint32_t &scidac_csumb)
|
2017-06-01 22:36:53 +01:00
|
|
|
{
|
|
|
|
typedef typename GridSerialRNG::RngStateType RngStateType;
|
|
|
|
const int RngStateCount = GridSerialRNG::RngStateCount;
|
|
|
|
typedef std::array<RngStateType,RngStateCount> RNGstate;
|
|
|
|
typedef RngStateType word; word w=0;
|
2015-12-19 18:32:25 +00:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::string format = "IEEE32BIG";
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
GridBase *grid = parallel._grid;
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t gsites = grid->gSites();
|
|
|
|
uint64_t lsites = grid->lSites();
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-07-27 15:12:50 +01:00
|
|
|
uint32_t nersc_csum_tmp = 0;
|
|
|
|
uint32_t scidac_csuma_tmp = 0;
|
|
|
|
uint32_t scidac_csumb_tmp = 0;
|
2017-06-11 23:14:10 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
GridStopWatch timer;
|
2016-10-22 13:06:00 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout << GridLogMessage << "RNG read I/O on file " << file << std::endl;
|
2016-10-22 13:06:00 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::vector<RNGstate> iodata(lsites);
|
2017-06-11 23:14:10 +01:00
|
|
|
IOobject(w,grid,iodata,file,offset,format,BINARYIO_READ|BINARYIO_LEXICOGRAPHIC,
|
|
|
|
nersc_csum,scidac_csuma,scidac_csumb);
|
2016-09-09 11:34:25 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
timer.Start();
|
2018-03-16 21:54:56 +00:00
|
|
|
parallel_for(uint64_t lidx=0;lidx<lsites;lidx++){
|
2017-06-01 22:36:53 +01:00
|
|
|
std::vector<RngStateType> tmp(RngStateCount);
|
|
|
|
std::copy(iodata[lidx].begin(),iodata[lidx].end(),tmp.begin());
|
|
|
|
parallel.SetState(tmp,lidx);
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
}
|
2016-03-16 09:30:16 +00:00
|
|
|
timer.Stop();
|
2017-06-01 22:36:53 +01:00
|
|
|
|
2017-06-02 00:38:58 +01:00
|
|
|
iodata.resize(1);
|
2017-06-11 23:14:10 +01:00
|
|
|
IOobject(w,grid,iodata,file,offset,format,BINARYIO_READ|BINARYIO_MASTER_APPEND,
|
|
|
|
nersc_csum_tmp,scidac_csuma_tmp,scidac_csumb_tmp);
|
|
|
|
|
2017-06-02 00:38:58 +01:00
|
|
|
{
|
|
|
|
std::vector<RngStateType> tmp(RngStateCount);
|
|
|
|
std::copy(iodata[0].begin(),iodata[0].end(),tmp.begin());
|
|
|
|
serial.SetState(tmp,0);
|
|
|
|
}
|
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
nersc_csum = nersc_csum + nersc_csum_tmp;
|
|
|
|
scidac_csuma = scidac_csuma ^ scidac_csuma_tmp;
|
|
|
|
scidac_csumb = scidac_csumb ^ scidac_csumb_tmp;
|
|
|
|
|
2017-06-18 00:10:42 +01:00
|
|
|
std::cout << GridLogMessage << "RNG file nersc_checksum " << std::hex << nersc_csum << std::dec << std::endl;
|
|
|
|
std::cout << GridLogMessage << "RNG file scidac_checksuma " << std::hex << scidac_csuma << std::dec << std::endl;
|
|
|
|
std::cout << GridLogMessage << "RNG file scidac_checksumb " << std::hex << scidac_csumb << std::dec << std::endl;
|
2017-06-11 23:14:10 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout << GridLogMessage << "RNG state overhead " << timer.Elapsed() << std::endl;
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
}
|
2017-06-01 22:36:53 +01:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Write a RNG; lexico map to an array of state and use IOobject
|
|
|
|
//////////////////////////////////////////////////////////////////////////////////////
|
2017-06-11 23:14:10 +01:00
|
|
|
static inline void writeRNG(GridSerialRNG &serial,
|
|
|
|
GridParallelRNG ¶llel,
|
|
|
|
std::string file,
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t offset,
|
2017-06-11 23:14:10 +01:00
|
|
|
uint32_t &nersc_csum,
|
|
|
|
uint32_t &scidac_csuma,
|
|
|
|
uint32_t &scidac_csumb)
|
2017-06-01 22:36:53 +01:00
|
|
|
{
|
|
|
|
typedef typename GridSerialRNG::RngStateType RngStateType;
|
|
|
|
typedef RngStateType word; word w=0;
|
|
|
|
const int RngStateCount = GridSerialRNG::RngStateCount;
|
|
|
|
typedef std::array<RngStateType,RngStateCount> RNGstate;
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
GridBase *grid = parallel._grid;
|
2018-03-16 21:54:56 +00:00
|
|
|
uint64_t gsites = grid->gSites();
|
|
|
|
uint64_t lsites = grid->lSites();
|
2016-10-07 13:37:29 +01:00
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
uint32_t nersc_csum_tmp;
|
|
|
|
uint32_t scidac_csuma_tmp;
|
|
|
|
uint32_t scidac_csumb_tmp;
|
|
|
|
|
2016-10-07 13:37:29 +01:00
|
|
|
GridStopWatch timer;
|
2017-06-01 22:36:53 +01:00
|
|
|
std::string format = "IEEE32BIG";
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout << GridLogMessage << "RNG write I/O on file " << file << std::endl;
|
2016-02-21 14:03:21 +00:00
|
|
|
|
2017-06-01 22:36:53 +01:00
|
|
|
timer.Start();
|
|
|
|
std::vector<RNGstate> iodata(lsites);
|
2018-03-16 21:54:56 +00:00
|
|
|
parallel_for(uint64_t lidx=0;lidx<lsites;lidx++){
|
2017-06-01 22:36:53 +01:00
|
|
|
std::vector<RngStateType> tmp(RngStateCount);
|
|
|
|
parallel.GetState(tmp,lidx);
|
|
|
|
std::copy(tmp.begin(),tmp.end(),iodata[lidx].begin());
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
}
|
2016-03-16 09:30:16 +00:00
|
|
|
timer.Stop();
|
2016-10-07 13:37:29 +01:00
|
|
|
|
2017-06-11 23:14:10 +01:00
|
|
|
IOobject(w,grid,iodata,file,offset,format,BINARYIO_WRITE|BINARYIO_LEXICOGRAPHIC,
|
|
|
|
nersc_csum,scidac_csuma,scidac_csumb);
|
2017-04-05 14:41:04 +01:00
|
|
|
|
2017-06-02 00:38:58 +01:00
|
|
|
iodata.resize(1);
|
|
|
|
{
|
|
|
|
std::vector<RngStateType> tmp(RngStateCount);
|
|
|
|
serial.GetState(tmp,0);
|
|
|
|
std::copy(tmp.begin(),tmp.end(),iodata[0].begin());
|
|
|
|
}
|
2017-06-11 23:14:10 +01:00
|
|
|
IOobject(w,grid,iodata,file,offset,format,BINARYIO_WRITE|BINARYIO_MASTER_APPEND,
|
|
|
|
nersc_csum_tmp,scidac_csuma_tmp,scidac_csumb_tmp);
|
2017-06-18 00:10:42 +01:00
|
|
|
|
|
|
|
nersc_csum = nersc_csum + nersc_csum_tmp;
|
|
|
|
scidac_csuma = scidac_csuma ^ scidac_csuma_tmp;
|
|
|
|
scidac_csumb = scidac_csumb ^ scidac_csumb_tmp;
|
2017-06-02 00:38:58 +01:00
|
|
|
|
2017-06-18 00:10:42 +01:00
|
|
|
std::cout << GridLogMessage << "RNG file checksum " << std::hex << nersc_csum << std::dec << std::endl;
|
|
|
|
std::cout << GridLogMessage << "RNG file checksuma " << std::hex << scidac_csuma << std::dec << std::endl;
|
|
|
|
std::cout << GridLogMessage << "RNG file checksumb " << std::hex << scidac_csumb << std::dec << std::endl;
|
2017-06-01 22:36:53 +01:00
|
|
|
std::cout << GridLogMessage << "RNG state overhead " << timer.Elapsed() << std::endl;
|
Binary IO file for generic Grid array parallel I/O.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
2015-08-26 13:40:29 +01:00
|
|
|
}
|
|
|
|
};
|
|
|
|
}
|
|
|
|
#endif
|