1
0
mirror of https://github.com/paboyle/Grid.git synced 2025-04-10 06:00:45 +01:00

Merge branch 'develop' into feature/hadrons

This commit is contained in:
Antonin Portelli 2017-08-24 17:05:45 +01:00
commit 21b02760c3
42 changed files with 1906 additions and 512 deletions

265
README.md
View File

@ -18,10 +18,41 @@
License: GPL v2. License: GPL v2.
Last update Nov 2016. Last update June 2017.
_Please do not send pull requests to the `master` branch which is reserved for releases._ _Please do not send pull requests to the `master` branch which is reserved for releases._
### Description
This library provides data parallel C++ container classes with internal memory layout
that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
are provided, similar to HPF and cmfortran, and user control is given over the mapping of
array indices to both MPI tasks and SIMD processing elements.
* Identically shaped arrays then be processed with perfect data parallelisation.
* Such identically shaped arrays are called conformable arrays.
The transformation is based on the observation that Cartesian array processing involves
identical processing to be performed on different regions of the Cartesian array.
The library will both geometrically decompose into MPI tasks and across SIMD lanes.
Local vector loops are parallelised with OpenMP pragmas.
Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
for most programmers.
The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
Presently SSE4, ARM NEON (128 bits) AVX, AVX2, QPX (256 bits), IMCI and AVX512 (512 bits) targets are supported.
These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types.
The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
MPI, OpenMP, and SIMD parallelism are present in the library.
Please see [this paper](https://arxiv.org/abs/1512.03487) for more detail.
### Compilers ### Compilers
Intel ICPC v16.0.3 and later Intel ICPC v16.0.3 and later
@ -56,35 +87,25 @@ When you file an issue, please go though the following checklist:
6. Attach the output of `make V=1`. 6. Attach the output of `make V=1`.
7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example. 7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example.
### Required libraries
Grid requires:
[GMP](https://gmplib.org/),
### Description [MPFR](http://www.mpfr.org/)
This library provides data parallel C++ container classes with internal memory layout
that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
are provided, similar to HPF and cmfortran, and user control is given over the mapping of
array indices to both MPI tasks and SIMD processing elements.
* Identically shaped arrays then be processed with perfect data parallelisation. Bootstrapping grid downloads and uses for internal dense matrix (non-QCD operations) the Eigen library.
* Such identically shaped arrays are called conformable arrays.
The transformation is based on the observation that Cartesian array processing involves Grid optionally uses:
identical processing to be performed on different regions of the Cartesian array.
The library will both geometrically decompose into MPI tasks and across SIMD lanes. [HDF5](https://support.hdfgroup.org/HDF5/)
Local vector loops are parallelised with OpenMP pragmas.
Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but [LIME](http://usqcd-software.github.io/c-lime/) for ILDG and SciDAC file format support.
optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
for most programmers.
The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture. [FFTW](http://www.fftw.org) either generic version or via the Intel MKL library.
Presently SSE4 (128 bit) AVX, AVX2, QPX (256 bit), IMCI, and AVX512 (512 bit) targets are supported (ARM NEON on the way).
These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers. LAPACK either generic version or Intel MKL library.
The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
MPI, OpenMP, and SIMD parallelism are present in the library.
Please see https://arxiv.org/abs/1512.03487 for more detail.
### Quick start ### Quick start
First, start by cloning the repository: First, start by cloning the repository:
@ -155,7 +176,6 @@ The following options can be use with the `--enable-comms=` option to target dif
| `none` | no communications | | `none` | no communications |
| `mpi[-auto]` | MPI communications | | `mpi[-auto]` | MPI communications |
| `mpi3[-auto]` | MPI communications using MPI 3 shared memory | | `mpi3[-auto]` | MPI communications using MPI 3 shared memory |
| `mpi3l[-auto]` | MPI communications using MPI 3 shared memory and leader model |
| `shmem ` | Cray SHMEM communications | | `shmem ` | Cray SHMEM communications |
For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names). The `-auto` suffix is not supported by the Cray environment wrapper scripts. Use the standard versions instead. For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names). The `-auto` suffix is not supported by the Cray environment wrapper scripts. Use the standard versions instead.
@ -173,7 +193,8 @@ The following options can be use with the `--enable-simd=` option to target diff
| `AVXFMA4` | AVX (256 bit) + FMA4 | | `AVXFMA4` | AVX (256 bit) + FMA4 |
| `AVX2` | AVX 2 (256 bit) | | `AVX2` | AVX 2 (256 bit) |
| `AVX512` | AVX 512 bit | | `AVX512` | AVX 512 bit |
| `QPX` | QPX (256 bit) | | `NEONv8` | [ARM NEON](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch07s03.html) (128 bit) |
| `QPX` | IBM QPX (256 bit) |
Alternatively, some CPU codenames can be directly used: Alternatively, some CPU codenames can be directly used:
@ -195,21 +216,205 @@ The following configuration is recommended for the Intel Knights Landing platfor
``` bash ``` bash
../configure --enable-precision=double\ ../configure --enable-precision=double\
--enable-simd=KNL \ --enable-simd=KNL \
--enable-comms=mpi-auto \ --enable-comms=mpi-auto \
--with-gmp=<path> \
--with-mpfr=<path> \
--enable-mkl \ --enable-mkl \
CXX=icpc MPICXX=mpiicpc CXX=icpc MPICXX=mpiicpc
``` ```
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
where `<path>` is the UNIX prefix where GMP and MPFR are installed. If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use: If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
``` bash ``` bash
../configure --enable-precision=double\ ../configure --enable-precision=double\
--enable-simd=KNL \ --enable-simd=KNL \
--enable-comms=mpi \ --enable-comms=mpi \
--with-gmp=<path> \
--with-mpfr=<path> \
--enable-mkl \ --enable-mkl \
CXX=CC CC=cc CXX=CC CC=cc
``` ```
If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
``` bash
--with-gmp=<path> \
--with-mpfr=<path> \
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
Knight's Landing with Intel Omnipath adapters with two adapters per node
presently performs better with use of more than one rank per node, using shared memory
for interior communication. This is the mpi3 communications implementation.
We recommend four ranks per node for best performance, but optimum is local volume dependent.
``` bash
../configure --enable-precision=double\
--enable-simd=KNL \
--enable-comms=mpi3-auto \
--enable-mkl \
CC=icpc MPICXX=mpiicpc
```
### Build setup for Intel Haswell Xeon platform
The following configuration is recommended for the Intel Haswell platform:
``` bash
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi3-auto \
--enable-mkl \
CXX=icpc MPICXX=mpiicpc
```
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
``` bash
--with-gmp=<path> \
--with-mpfr=<path> \
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
``` bash
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi3 \
--enable-mkl \
CXX=CC CC=cc
```
Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of
one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using
```
export I_MPI_PIN=1
```
This is the default.
### Build setup for Intel Skylake Xeon platform
The following configuration is recommended for the Intel Skylake platform:
``` bash
../configure --enable-precision=double\
--enable-simd=AVX512 \
--enable-comms=mpi3 \
--enable-mkl \
CXX=mpiicpc
```
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
``` bash
--with-gmp=<path> \
--with-mpfr=<path> \
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
``` bash
../configure --enable-precision=double\
--enable-simd=AVX512 \
--enable-comms=mpi3 \
--enable-mkl \
CXX=CC CC=cc
```
Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of
one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using
```
export I_MPI_PIN=1
```
This is the default.
#### Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping):
mpirun -n 2 benchmarks/Benchmark_dwf --grid 16.16.16.16 --mpi 2.1.1.1 --cacheblocking 2.2.2.2 --dslash-asm --shm 1024 --threads 18
TBA
### Build setup for AMD EPYC / RYZEN
The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores.
So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total
are common. Each chip within the module exposes a separate NUMA domain.
There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain.
MPI-3 is recommended with the use of four ranks per socket,
and 8 threads per rank.
The following configuration is recommended for the AMD EPYC platform.
``` bash
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi3 \
CXX=mpicxx
```
If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
``` bash
--with-gmp=<path> \
--with-mpfr=<path> \
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank.
This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.
It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and
shared memory to communicate within this node:
mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4
Where omp_bind.sh does the following:
```
#!/bin/bash
numanode=` expr $PMI_RANK % 8 `
basecore=`expr $numanode \* 16`
core0=`expr $basecore + 0 `
core1=`expr $basecore + 2 `
core2=`expr $basecore + 4 `
core3=`expr $basecore + 6 `
core4=`expr $basecore + 8 `
core5=`expr $basecore + 10 `
core6=`expr $basecore + 12 `
core7=`expr $basecore + 14 `
export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7"
echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY
$@
```
Performance:
#### Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) performance using NUMA MPI mapping):
mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --threads 8 --mpi 2.2.2.1 --dslash-unroll --grid 16.16.16.16 --cacheblocking 4.4.4.4
TBA
### Build setup for BlueGene/Q
To be written...
### Build setup for ARM Neon
To be written...
### Build setup for laptops, other compilers, non-cluster builds
Many versions of g++ and clang++ work with Grid, and involve merely replacing CXX (and MPICXX),
and omit the enable-mkl flag.
Single node builds are enabled with
```
--enable-comms=none
```
FFTW support that is not in the default search path may then enabled with
```
--with-fftw=<installpath>
```
BLAS will not be compiled in by default, and Lanczos will default to Eigen diagonalisation.

14
TODO
View File

@ -2,18 +2,18 @@ TODO:
--------------- ---------------
Large item work list: Large item work list:
1)- MultiRHS with spread out extra dim -- Go through filesystem with SciDAC I/O
1)- BG/Q port and check
2)- Christoph's local basis expansion Lanczos 2)- Christoph's local basis expansion Lanczos
3)- BG/Q port and check 3)- Precision conversion and sort out localConvert <-- partial
4)- Precision conversion and sort out localConvert <-- partial
- Consistent linear solver flop count/rate -- PARTIAL, time but no flop/s yet - Consistent linear solver flop count/rate -- PARTIAL, time but no flop/s yet
5)- Physical propagator interface 4)- Physical propagator interface
6)- Conserved currents 5)- Conserved currents
7)- Multigrid Wilson and DWF, compare to other Multigrid implementations 6)- Multigrid Wilson and DWF, compare to other Multigrid implementations
8)- HDCR resume 7)- HDCR resume
Recent DONE Recent DONE
-- MultiRHS with spread out extra dim -- Go through filesystem with SciDAC I/O
-- Lanczos Remove DenseVector, DenseMatrix; Use Eigen instead. <-- DONE -- Lanczos Remove DenseVector, DenseMatrix; Use Eigen instead. <-- DONE
-- GaugeFix into central location <-- DONE -- GaugeFix into central location <-- DONE
-- Scidac and Ildg metadata handling <-- DONE -- Scidac and Ildg metadata handling <-- DONE

518
benchmarks/Benchmark_ITT.cc Normal file
View File

@ -0,0 +1,518 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Source file: ./benchmarks/Benchmark_memory_bandwidth.cc
Copyright (C) 2015
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
Author: paboyle <paboyle@ph.ed.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
using namespace std;
using namespace Grid;
using namespace Grid::QCD;
struct time_statistics{
double mean;
double err;
double min;
double max;
void statistics(std::vector<double> v){
double sum = std::accumulate(v.begin(), v.end(), 0.0);
mean = sum / v.size();
std::vector<double> diff(v.size());
std::transform(v.begin(), v.end(), diff.begin(), [=](double x) { return x - mean; });
double sq_sum = std::inner_product(diff.begin(), diff.end(), diff.begin(), 0.0);
err = std::sqrt(sq_sum / (v.size()*(v.size() - 1)));
auto result = std::minmax_element(v.begin(), v.end());
min = *result.first;
max = *result.second;
}
};
void comms_header(){
std::cout <<GridLogMessage << " L "<<"\t"<<" Ls "<<"\t"
<<std::setw(11)<<"bytes"<<"MB/s uni (err/min/max)"<<"\t\t"<<"MB/s bidi (err/min/max)"<<std::endl;
};
Gamma::Algebra Gmu [] = {
Gamma::Algebra::GammaX,
Gamma::Algebra::GammaY,
Gamma::Algebra::GammaZ,
Gamma::Algebra::GammaT
};
struct controls {
int Opt;
int CommsOverlap;
Grid::CartesianCommunicator::CommunicatorPolicy_t CommsAsynch;
// int HugePages;
};
class Benchmark {
public:
static void Decomposition (void ) {
int threads = GridThread::GetThreads();
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << "= Grid is setup to use "<<threads<<" threads"<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage<<"Grid Default Decomposition patterns\n";
std::cout<<GridLogMessage<<"\tOpenMP threads : "<<GridThread::GetThreads()<<std::endl;
std::cout<<GridLogMessage<<"\tMPI tasks : "<<GridCmdVectorIntToString(GridDefaultMpi())<<std::endl;
std::cout<<GridLogMessage<<"\tvReal : "<<sizeof(vReal )*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vReal::Nsimd()))<<std::endl;
std::cout<<GridLogMessage<<"\tvRealF : "<<sizeof(vRealF)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealF::Nsimd()))<<std::endl;
std::cout<<GridLogMessage<<"\tvRealD : "<<sizeof(vRealD)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealD::Nsimd()))<<std::endl;
std::cout<<GridLogMessage<<"\tvComplex : "<<sizeof(vComplex )*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplex::Nsimd()))<<std::endl;
std::cout<<GridLogMessage<<"\tvComplexF : "<<sizeof(vComplexF)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplexF::Nsimd()))<<std::endl;
std::cout<<GridLogMessage<<"\tvComplexD : "<<sizeof(vComplexD)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplexD::Nsimd()))<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
}
static void Comms(void)
{
int Nloop=100;
int nmu=0;
int maxlat=32;
std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplexD::Nsimd());
std::vector<int> mpi_layout = GridDefaultMpi();
std::vector<double> t_time(Nloop);
time_statistics timestat;
std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
std::cout<<GridLogMessage << "= Benchmarking threaded STENCIL halo exchange in "<<nmu<<" dimensions"<<std::endl;
std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
comms_header();
for(int lat=4;lat<=maxlat;lat+=4){
for(int Ls=8;Ls<=8;Ls*=2){
std::vector<int> latt_size ({lat*mpi_layout[0],
lat*mpi_layout[1],
lat*mpi_layout[2],
lat*mpi_layout[3]});
GridCartesian Grid(latt_size,simd_layout,mpi_layout);
RealD Nrank = Grid._Nprocessors;
RealD Nnode = Grid.NodeCount();
RealD ppn = Nrank/Nnode;
std::vector<HalfSpinColourVectorD *> xbuf(8);
std::vector<HalfSpinColourVectorD *> rbuf(8);
Grid.ShmBufferFreeAll();
for(int d=0;d<8;d++){
xbuf[d] = (HalfSpinColourVectorD *)Grid.ShmBufferMalloc(lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD));
rbuf[d] = (HalfSpinColourVectorD *)Grid.ShmBufferMalloc(lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD));
bzero((void *)xbuf[d],lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD));
bzero((void *)rbuf[d],lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD));
}
int ncomm;
int bytes=lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD);
double dbytes;
for(int i=0;i<Nloop;i++){
double start=usecond();
std::vector<CartesianCommunicator::CommsRequest_t> requests;
dbytes=0;
ncomm=0;
parallel_for(int dir=0;dir<8;dir++){
double tbytes;
int mu =dir % 4;
if (mpi_layout[mu]>1 ) {
ncomm++;
int xmit_to_rank;
int recv_from_rank;
if ( dir == mu ) {
int comm_proc=1;
Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank);
} else {
int comm_proc = mpi_layout[mu]-1;
Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank);
}
#if 0
tbytes= Grid.StencilSendToRecvFromBegin(requests,
(void *)&xbuf[dir][0],
xmit_to_rank,
(void *)&rbuf[dir][0],
recv_from_rank,
bytes,dir);
Grid.StencilSendToRecvFromComplete(requests,dir);
#endif
requests.resize(0);
#pragma omp atomic
dbytes+=tbytes;
}
}
Grid.Barrier();
double stop=usecond();
t_time[i] = stop-start; // microseconds
}
timestat.statistics(t_time);
dbytes=dbytes*ppn;
double xbytes = dbytes*0.5;
double rbytes = dbytes*0.5;
double bidibytes = dbytes;
std::cout<<GridLogMessage << std::setw(4) << lat<<"\t"<<Ls<<"\t"
<<std::setw(11) << bytes<< std::fixed << std::setprecision(1) << std::setw(7)
<<std::right<< xbytes/timestat.mean<<" "<< xbytes*timestat.err/(timestat.mean*timestat.mean)<< " "
<<xbytes/timestat.max <<" "<< xbytes/timestat.min
<< "\t\t"<<std::setw(7)<< bidibytes/timestat.mean<< " " << bidibytes*timestat.err/(timestat.mean*timestat.mean) << " "
<< bidibytes/timestat.max << " " << bidibytes/timestat.min << std::endl;
}
}
return;
}
static void Memory(void)
{
const int Nvec=8;
typedef Lattice< iVector< vReal,Nvec> > LatticeVec;
typedef iVector<vReal,Nvec> Vec;
std::vector<int> simd_layout = GridDefaultSimd(Nd,vReal::Nsimd());
std::vector<int> mpi_layout = GridDefaultMpi();
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << "= Benchmarking a*x + y bandwidth"<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
uint64_t lmax=48;
#define NLOOP (10*lmax*lmax*lmax*lmax/lat/lat/lat/lat)
GridSerialRNG sRNG; sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
for(int lat=8;lat<=lmax;lat+=4){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout);
Vec rn ; random(sRNG,rn);
LatticeVec z(&Grid); z=rn;
LatticeVec x(&Grid); x=rn;
LatticeVec y(&Grid); y=rn;
double a=2.0;
uint64_t Nloop=NLOOP;
double start=usecond();
for(int i=0;i<Nloop;i++){
z=a*x-y;
x._odata[0]=z._odata[0]; // force serial dependency to prevent optimise away
y._odata[4]=z._odata[4];
}
double stop=usecond();
double time = (stop-start)/Nloop*1000;
double flops=vol*Nvec*2;// mul,add
double bytes=3.0*vol*Nvec*sizeof(Real);
std::cout<<GridLogMessage<<std::setprecision(3)
<< lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl;
}
};
static void DWF(int Ls,int L)
{
RealD mass=0.1;
RealD M5 =1.8;
double mflops;
double mflops_best = 0;
double mflops_worst= 0;
///////////////////////////////////////////////////////
// Set/Get the layout & grid size
///////////////////////////////////////////////////////
int threads = GridThread::GetThreads();
std::vector<int> mpi = GridDefaultMpi(); assert(mpi.size()==4);
std::vector<int> local({L,L,L,L});
GridCartesian * TmpGrid = SpaceTimeGrid::makeFourDimGrid(std::vector<int>({64,64,64,64}),
GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
uint64_t NP = TmpGrid->RankCount();
uint64_t NN = TmpGrid->NodeCount();
uint64_t SHM=NP/NN;
std::vector<int> internal;
if ( SHM == 1 ) internal = std::vector<int>({1,1,1,1});
else if ( SHM == 2 ) internal = std::vector<int>({2,1,1,1});
else if ( SHM == 4 ) internal = std::vector<int>({2,2,1,1});
else if ( SHM == 8 ) internal = std::vector<int>({2,2,2,1});
else assert(0);
std::vector<int> nodes({mpi[0]/internal[0],mpi[1]/internal[1],mpi[2]/internal[2],mpi[3]/internal[3]});
std::vector<int> latt4({local[0]*nodes[0],local[1]*nodes[1],local[2]*nodes[2],local[3]*nodes[3]});
///////// Welcome message ////////////
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << "Benchmark DWF on "<<L<<"^4 local volume "<<std::endl;
std::cout<<GridLogMessage << "* Global volume : "<<GridCmdVectorIntToString(latt4)<<std::endl;
std::cout<<GridLogMessage << "* Ls : "<<Ls<<std::endl;
std::cout<<GridLogMessage << "* MPI ranks : "<<GridCmdVectorIntToString(mpi)<<std::endl;
std::cout<<GridLogMessage << "* Intranode : "<<GridCmdVectorIntToString(internal)<<std::endl;
std::cout<<GridLogMessage << "* nodes : "<<GridCmdVectorIntToString(nodes)<<std::endl;
std::cout<<GridLogMessage << "* Using "<<threads<<" threads"<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
///////// Lattice Init ////////////
GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(latt4, GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
///////// RNG Init ////////////
std::vector<int> seeds4({1,2,3,4});
std::vector<int> seeds5({5,6,7,8});
GridParallelRNG RNG4(UGrid); RNG4.SeedFixedIntegers(seeds4);
GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds5);
std::cout << GridLogMessage << "Initialised RNGs" << std::endl;
///////// Source preparation ////////////
LatticeFermion src (FGrid); random(RNG5,src);
LatticeFermion ref (FGrid);
LatticeFermion tmp (FGrid);
RealD N2 = 1.0/::sqrt(norm2(src));
src = src*N2;
LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(RNG4,Umu);
DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
////////////////////////////////////
// Naive wilson implementation
////////////////////////////////////
{
LatticeGaugeField Umu5d(FGrid);
std::vector<LatticeColourMatrix> U(4,FGrid);
for(int ss=0;ss<Umu._grid->oSites();ss++){
for(int s=0;s<Ls;s++){
Umu5d._odata[Ls*ss+s] = Umu._odata[ss];
}
}
ref = zero;
for(int mu=0;mu<Nd;mu++){
U[mu] = PeekIndex<LorentzIndex>(Umu5d,mu);
}
for(int mu=0;mu<Nd;mu++){
tmp = U[mu]*Cshift(src,mu+1,1);
ref=ref + tmp - Gamma(Gmu[mu])*tmp;
tmp =adj(U[mu])*src;
tmp =Cshift(tmp,mu+1,-1);
ref=ref + tmp + Gamma(Gmu[mu])*tmp;
}
ref = -0.5*ref;
}
LatticeFermion src_e (FrbGrid);
LatticeFermion src_o (FrbGrid);
LatticeFermion r_e (FrbGrid);
LatticeFermion r_o (FrbGrid);
LatticeFermion r_eo (FGrid);
LatticeFermion err (FGrid);
{
pickCheckerboard(Even,src_e,src);
pickCheckerboard(Odd,src_o,src);
#if defined(AVX512)
const int num_cases = 6;
#else
const int num_cases = 4;
#endif
controls Cases [] = {
#if defined(AVX512)
{ QCD::WilsonKernelsStatic::OptInlineAsm , QCD::WilsonKernelsStatic::CommsAndCompute ,CartesianCommunicator::CommunicatorPolicySequential },
{ QCD::WilsonKernelsStatic::OptInlineAsm , QCD::WilsonKernelsStatic::CommsThenCompute ,CartesianCommunicator::CommunicatorPolicySequential },
#endif
{ QCD::WilsonKernelsStatic::OptHandUnroll, QCD::WilsonKernelsStatic::CommsThenCompute ,CartesianCommunicator::CommunicatorPolicySequential },
{ QCD::WilsonKernelsStatic::OptHandUnroll, QCD::WilsonKernelsStatic::CommsAndCompute ,CartesianCommunicator::CommunicatorPolicySequential },
{ QCD::WilsonKernelsStatic::OptGeneric , QCD::WilsonKernelsStatic::CommsThenCompute ,CartesianCommunicator::CommunicatorPolicySequential },
{ QCD::WilsonKernelsStatic::OptGeneric , QCD::WilsonKernelsStatic::CommsAndCompute ,CartesianCommunicator::CommunicatorPolicySequential }
};
for(int c=0;c<num_cases;c++) {
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3 WilsonKernels" <<std::endl;
if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3 WilsonKernels" <<std::endl;
if ( WilsonKernelsStatic::Comms == WilsonKernelsStatic::CommsAndCompute ) std::cout << GridLogMessage<< "* Using Overlapped Comms/Compute" <<std::endl;
if ( WilsonKernelsStatic::Comms == WilsonKernelsStatic::CommsThenCompute) std::cout << GridLogMessage<< "* Using sequential comms compute" <<std::endl;
if ( sizeof(Real)==4 ) std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
if ( sizeof(Real)==8 ) std::cout << GridLogMessage<< "* DOUBLE precision "<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
QCD::WilsonKernelsStatic::Comms = Cases[c].CommsOverlap;
QCD::WilsonKernelsStatic::Opt = Cases[c].Opt;
CartesianCommunicator::SetCommunicatorPolicy(Cases[c].CommsAsynch);
int nwarm = 10;
double t0=usecond();
FGrid->Barrier();
for(int i=0;i<nwarm;i++){
Dw.DhopEO(src_o,r_e,DaggerNo);
}
FGrid->Barrier();
double t1=usecond();
uint64_t ncall = (uint64_t) 2.5*1000.0*1000.0*nwarm/(t1-t0);
FGrid->Broadcast(0,&ncall,sizeof(ncall));
// std::cout << GridLogMessage << " Estimate " << ncall << " calls per second"<<std::endl;
Dw.ZeroCounters();
time_statistics timestat;
std::vector<double> t_time(ncall);
for(uint64_t i=0;i<ncall;i++){
t0=usecond();
Dw.DhopEO(src_o,r_e,DaggerNo);
t1=usecond();
t_time[i] = t1-t0;
}
FGrid->Barrier();
double volume=Ls; for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
double flops=(1344.0*volume)/2;
double mf_hi, mf_lo, mf_err;
timestat.statistics(t_time);
mf_hi = flops/timestat.min;
mf_lo = flops/timestat.max;
mf_err= flops/timestat.min * timestat.err/timestat.mean;
mflops = flops/timestat.mean;
if ( mflops_best == 0 ) mflops_best = mflops;
if ( mflops_worst== 0 ) mflops_worst= mflops;
if ( mflops>mflops_best ) mflops_best = mflops;
if ( mflops<mflops_worst) mflops_worst= mflops;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo mflop/s = "<< mflops << " ("<<mf_err<<") " << mf_lo<<"-"<<mf_hi <<std::endl;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo mflop/s per rank "<< mflops/NP<<std::endl;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo mflop/s per node "<< mflops/NN<<std::endl;
Dw.Report();
Dw.DhopEO(src_o,r_e,DaggerNo);
Dw.DhopOE(src_e,r_o,DaggerNo);
setCheckerboard(r_eo,r_o);
setCheckerboard(r_eo,r_e);
err = r_eo-ref;
std::cout<<GridLogMessage << "norm diff "<< norm2(err)<<std::endl;
assert((norm2(err)<1.0e-4));
}
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << L<<"^4 x "<<Ls<< " Deo Best mflop/s = "<< mflops_best <<std::endl;
std::cout<<GridLogMessage << L<<"^4 x "<<Ls<< " Deo Worst mflop/s = "<< mflops_worst<<std::endl;
std::cout<<GridLogMessage << L<<"^4 x "<<Ls<< " Performance Robustness = "<< mflops_worst/mflops_best <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
}
}
};
int main (int argc, char ** argv)
{
Grid_init(&argc,&argv);
CartesianCommunicator::SetCommunicatorPolicy(CartesianCommunicator::CommunicatorPolicySequential);
LebesgueOrder::Block = std::vector<int>({2,2,2,2});
Benchmark::Decomposition();
int do_memory=1;
int do_comms =1;
int do_su3 =0;
int do_wilson=1;
int do_dwf =1;
if ( do_memory ) {
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Memory benchmark " <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
Benchmark::Memory();
}
if ( do_comms ) {
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Communications benchmark " <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
Benchmark::Comms();
}
if ( do_su3 ) {
// empty for now
}
if ( do_wilson ) {
int Ls=1;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Wilson dslash 4D vectorised" <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
Benchmark::DWF(Ls,16);
Benchmark::DWF(Ls,24);
Benchmark::DWF(Ls,32);
}
if ( do_dwf ) {
int Ls=16;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Domain wall dslash 4D vectorised" <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
Benchmark::DWF(Ls,8);
Benchmark::DWF(Ls,12);
Benchmark::DWF(Ls,16);
Benchmark::DWF(Ls,24);
}
Grid_finalize();
}

View File

@ -165,7 +165,7 @@ int main (int argc, char ** argv)
std::cout << GridLogMessage<< "*****************************************************************" <<std::endl; std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5); DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
int ncall =1000; int ncall =500;
if (1) { if (1) {
FGrid->Barrier(); FGrid->Barrier();
Dw.ZeroCounters(); Dw.ZeroCounters();
@ -302,6 +302,7 @@ int main (int argc, char ** argv)
std::cout<< "sD ERR \n " << err <<std::endl; std::cout<< "sD ERR \n " << err <<std::endl;
} }
assert(sum < 1.0e-4); assert(sum < 1.0e-4);
if(1){ if(1){
std::cout << GridLogMessage<< "*********************************************************" <<std::endl; std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
@ -381,8 +382,23 @@ int main (int argc, char ** argv)
} }
assert(error<1.0e-4); assert(error<1.0e-4);
} }
if(0){
std::cout << "Single cache warm call to sDw.Dhop " <<std::endl;
for(int i=0;i< PerformanceCounter::NumTypes(); i++ ){
sDw.Dhop(ssrc,sresult,0);
PerformanceCounter Counter(i);
Counter.Start();
sDw.Dhop(ssrc,sresult,0);
Counter.Stop();
Counter.Report();
}
} }
}
if (1) if (1)
{ // Naive wilson dag implementation { // Naive wilson dag implementation
ref = zero; ref = zero;

View File

@ -55,21 +55,21 @@ int main (int argc, char ** argv)
std::cout<<GridLogMessage << "===================================================================================================="<<std::endl; std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl; std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl; std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
uint64_t lmax=64; uint64_t lmax=96;
#define NLOOP (100*lmax*lmax*lmax*lmax/vol) #define NLOOP (10*lmax*lmax*lmax*lmax/vol)
for(int lat=4;lat<=lmax;lat+=4){ for(int lat=8;lat<=lmax;lat+=8){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
uint64_t Nloop=NLOOP; uint64_t Nloop=NLOOP;
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeVec z(&Grid); //random(pRNG,z); LatticeVec z(&Grid);// random(pRNG,z);
LatticeVec x(&Grid); //random(pRNG,x); LatticeVec x(&Grid);// random(pRNG,x);
LatticeVec y(&Grid); //random(pRNG,y); LatticeVec y(&Grid);// random(pRNG,y);
double a=2.0; double a=2.0;
@ -83,7 +83,7 @@ int main (int argc, char ** argv)
double time = (stop-start)/Nloop*1000; double time = (stop-start)/Nloop*1000;
double flops=vol*Nvec*2;// mul,add double flops=vol*Nvec*2;// mul,add
double bytes=3*vol*Nvec*sizeof(Real); double bytes=3.0*vol*Nvec*sizeof(Real);
std::cout<<GridLogMessage<<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl; std::cout<<GridLogMessage<<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl;
} }
@ -94,17 +94,17 @@ int main (int argc, char ** argv)
std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl; std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl; std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
for(int lat=4;lat<=lmax;lat+=4){ for(int lat=8;lat<=lmax;lat+=8){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeVec z(&Grid); //random(pRNG,z); LatticeVec z(&Grid);// random(pRNG,z);
LatticeVec x(&Grid); //random(pRNG,x); LatticeVec x(&Grid);// random(pRNG,x);
LatticeVec y(&Grid); //random(pRNG,y); LatticeVec y(&Grid);// random(pRNG,y);
double a=2.0; double a=2.0;
uint64_t Nloop=NLOOP; uint64_t Nloop=NLOOP;
@ -119,7 +119,7 @@ int main (int argc, char ** argv)
double time = (stop-start)/Nloop*1000; double time = (stop-start)/Nloop*1000;
double flops=vol*Nvec*2;// mul,add double flops=vol*Nvec*2;// mul,add
double bytes=3*vol*Nvec*sizeof(Real); double bytes=3.0*vol*Nvec*sizeof(Real);
std::cout<<GridLogMessage<<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl; std::cout<<GridLogMessage<<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl;
} }
@ -129,20 +129,20 @@ int main (int argc, char ** argv)
std::cout<<GridLogMessage << "===================================================================================================="<<std::endl; std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl; std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
for(int lat=4;lat<=lmax;lat+=4){ for(int lat=8;lat<=lmax;lat+=8){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
uint64_t Nloop=NLOOP; uint64_t Nloop=NLOOP;
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeVec z(&Grid); //random(pRNG,z); LatticeVec z(&Grid);// random(pRNG,z);
LatticeVec x(&Grid); //random(pRNG,x); LatticeVec x(&Grid);// random(pRNG,x);
LatticeVec y(&Grid); //random(pRNG,y); LatticeVec y(&Grid);// random(pRNG,y);
RealD a=2.0; RealD a=2.0;
@ -154,7 +154,7 @@ int main (int argc, char ** argv)
double stop=usecond(); double stop=usecond();
double time = (stop-start)/Nloop*1000; double time = (stop-start)/Nloop*1000;
double bytes=2*vol*Nvec*sizeof(Real); double bytes=2.0*vol*Nvec*sizeof(Real);
double flops=vol*Nvec*1;// mul double flops=vol*Nvec*1;// mul
std::cout<<GridLogMessage <<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl; std::cout<<GridLogMessage <<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<<"\t\t"<<(stop-start)/1000./1000.<<std::endl;
@ -166,17 +166,17 @@ int main (int argc, char ** argv)
std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl; std::cout<<GridLogMessage << " L "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl; std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
for(int lat=4;lat<=lmax;lat+=4){ for(int lat=8;lat<=lmax;lat+=8){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
uint64_t Nloop=NLOOP; uint64_t Nloop=NLOOP;
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeVec z(&Grid); //random(pRNG,z); LatticeVec z(&Grid);// random(pRNG,z);
LatticeVec x(&Grid); //random(pRNG,x); LatticeVec x(&Grid);// random(pRNG,x);
LatticeVec y(&Grid); //random(pRNG,y); LatticeVec y(&Grid);// random(pRNG,y);
RealD a=2.0; RealD a=2.0;
Real nn; Real nn;
double start=usecond(); double start=usecond();
@ -187,7 +187,7 @@ int main (int argc, char ** argv)
double stop=usecond(); double stop=usecond();
double time = (stop-start)/Nloop*1000; double time = (stop-start)/Nloop*1000;
double bytes=vol*Nvec*sizeof(Real); double bytes=1.0*vol*Nvec*sizeof(Real);
double flops=vol*Nvec*2;// mul,add double flops=vol*Nvec*2;// mul,add
std::cout<<GridLogMessage<<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<< "\t\t"<<(stop-start)/1000./1000.<< "\t\t " <<std::endl; std::cout<<GridLogMessage<<std::setprecision(3) << lat<<"\t\t"<<bytes<<" \t\t"<<bytes/time<<"\t\t"<<flops/time<< "\t\t"<<(stop-start)/1000./1000.<< "\t\t " <<std::endl;

View File

@ -37,12 +37,12 @@ int main (int argc, char ** argv)
Grid_init(&argc,&argv); Grid_init(&argc,&argv);
#define LMAX (64) #define LMAX (64)
int Nloop=20; int64_t Nloop=20;
std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd()); std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
std::vector<int> mpi_layout = GridDefaultMpi(); std::vector<int> mpi_layout = GridDefaultMpi();
int threads = GridThread::GetThreads(); int64_t threads = GridThread::GetThreads();
std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl; std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
std::cout<<GridLogMessage << "===================================================================================================="<<std::endl; std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
@ -54,16 +54,16 @@ int main (int argc, char ** argv)
for(int lat=2;lat<=LMAX;lat+=2){ for(int lat=2;lat<=LMAX;lat+=2){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeColourMatrix z(&Grid);// random(pRNG,z); LatticeColourMatrix z(&Grid); random(pRNG,z);
LatticeColourMatrix x(&Grid);// random(pRNG,x); LatticeColourMatrix x(&Grid); random(pRNG,x);
LatticeColourMatrix y(&Grid);// random(pRNG,y); LatticeColourMatrix y(&Grid); random(pRNG,y);
double start=usecond(); double start=usecond();
for(int i=0;i<Nloop;i++){ for(int64_t i=0;i<Nloop;i++){
x=x*y; x=x*y;
} }
double stop=usecond(); double stop=usecond();
@ -86,17 +86,17 @@ int main (int argc, char ** argv)
for(int lat=2;lat<=LMAX;lat+=2){ for(int lat=2;lat<=LMAX;lat+=2){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeColourMatrix z(&Grid); //random(pRNG,z); LatticeColourMatrix z(&Grid); random(pRNG,z);
LatticeColourMatrix x(&Grid); //random(pRNG,x); LatticeColourMatrix x(&Grid); random(pRNG,x);
LatticeColourMatrix y(&Grid); //random(pRNG,y); LatticeColourMatrix y(&Grid); random(pRNG,y);
double start=usecond(); double start=usecond();
for(int i=0;i<Nloop;i++){ for(int64_t i=0;i<Nloop;i++){
z=x*y; z=x*y;
} }
double stop=usecond(); double stop=usecond();
@ -117,17 +117,17 @@ int main (int argc, char ** argv)
for(int lat=2;lat<=LMAX;lat+=2){ for(int lat=2;lat<=LMAX;lat+=2){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeColourMatrix z(&Grid); //random(pRNG,z); LatticeColourMatrix z(&Grid); random(pRNG,z);
LatticeColourMatrix x(&Grid); //random(pRNG,x); LatticeColourMatrix x(&Grid); random(pRNG,x);
LatticeColourMatrix y(&Grid); //random(pRNG,y); LatticeColourMatrix y(&Grid); random(pRNG,y);
double start=usecond(); double start=usecond();
for(int i=0;i<Nloop;i++){ for(int64_t i=0;i<Nloop;i++){
mult(z,x,y); mult(z,x,y);
} }
double stop=usecond(); double stop=usecond();
@ -148,17 +148,17 @@ int main (int argc, char ** argv)
for(int lat=2;lat<=LMAX;lat+=2){ for(int lat=2;lat<=LMAX;lat+=2){
std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); std::vector<int> latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
GridCartesian Grid(latt_size,simd_layout,mpi_layout); GridCartesian Grid(latt_size,simd_layout,mpi_layout);
// GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}); GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
LatticeColourMatrix z(&Grid); //random(pRNG,z); LatticeColourMatrix z(&Grid); random(pRNG,z);
LatticeColourMatrix x(&Grid); //random(pRNG,x); LatticeColourMatrix x(&Grid); random(pRNG,x);
LatticeColourMatrix y(&Grid); //random(pRNG,y); LatticeColourMatrix y(&Grid); random(pRNG,y);
double start=usecond(); double start=usecond();
for(int i=0;i<Nloop;i++){ for(int64_t i=0;i<Nloop;i++){
mac(z,x,y); mac(z,x,y);
} }
double stop=usecond(); double stop=usecond();

View File

@ -13,6 +13,10 @@ m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])])
################ Get git info ################ Get git info
#AC_REVISION([m4_esyscmd_s([./scripts/configure.commit])]) #AC_REVISION([m4_esyscmd_s([./scripts/configure.commit])])
################ Set flags
# do not move!
CXXFLAGS="-O3 $CXXFLAGS"
############### Checks for programs ############### Checks for programs
AC_PROG_CXX AC_PROG_CXX
AC_PROG_RANLIB AC_PROG_RANLIB
@ -27,7 +31,6 @@ AX_GXX_VERSION
AC_DEFINE_UNQUOTED([GXX_VERSION],["$GXX_VERSION"], AC_DEFINE_UNQUOTED([GXX_VERSION],["$GXX_VERSION"],
[version of g++ that will compile the code]) [version of g++ that will compile the code])
CXXFLAGS="-g $CXXFLAGS"
############### Checks for typedefs, structures, and compiler characteristics ############### Checks for typedefs, structures, and compiler characteristics
@ -51,9 +54,14 @@ AC_CHECK_HEADERS(malloc/malloc.h)
AC_CHECK_HEADERS(malloc.h) AC_CHECK_HEADERS(malloc.h)
AC_CHECK_HEADERS(endian.h) AC_CHECK_HEADERS(endian.h)
AC_CHECK_HEADERS(execinfo.h) AC_CHECK_HEADERS(execinfo.h)
AC_CHECK_HEADERS(numaif.h)
AC_CHECK_DECLS([ntohll],[], [], [[#include <arpa/inet.h>]]) AC_CHECK_DECLS([ntohll],[], [], [[#include <arpa/inet.h>]])
AC_CHECK_DECLS([be64toh],[], [], [[#include <arpa/inet.h>]]) AC_CHECK_DECLS([be64toh],[], [], [[#include <arpa/inet.h>]])
############## Standard libraries
AC_CHECK_LIB([m],[cos])
AC_CHECK_LIB([stdc++],[abort])
############### GMP and MPFR ############### GMP and MPFR
AC_ARG_WITH([gmp], AC_ARG_WITH([gmp],
[AS_HELP_STRING([--with-gmp=prefix], [AS_HELP_STRING([--with-gmp=prefix],
@ -186,9 +194,14 @@ Info at: http://usqcd.jlab.org/usqcd-docs/c-lime/)])
AC_SEARCH_LIBS([crc32], [z], AC_SEARCH_LIBS([crc32], [z],
[AC_DEFINE([HAVE_ZLIB], [1], [Define to 1 if you have the `LIBZ' library])] [AC_DEFINE([HAVE_ZLIB], [1], [Define to 1 if you have the `LIBZ' library])]
[have_zlib=true], [have_zlib=true] [LIBS="${LIBS} -lz"],
[AC_MSG_ERROR(zlib library was not found in your system.)]) [AC_MSG_ERROR(zlib library was not found in your system.)])
AC_SEARCH_LIBS([move_pages], [numa],
[AC_DEFINE([HAVE_LIBNUMA], [1], [Define to 1 if you have the `LIBNUMA' library])]
[have_libnuma=true] [LIBS="${LIBS} -lnuma"],
[AC_MSG_WARN(libnuma library was not found in your system. Some optimisations will not apply)])
AC_SEARCH_LIBS([H5Fopen], [hdf5_cpp], AC_SEARCH_LIBS([H5Fopen], [hdf5_cpp],
[AC_DEFINE([HAVE_HDF5], [1], [Define to 1 if you have the `HDF5' library])] [AC_DEFINE([HAVE_HDF5], [1], [Define to 1 if you have the `HDF5' library])]
[have_hdf5=true] [have_hdf5=true]
@ -241,6 +254,7 @@ case ${ax_cv_cxx_compiler_vendor} in
SIMD_FLAGS='';; SIMD_FLAGS='';;
KNL) KNL)
AC_DEFINE([AVX512],[1],[AVX512 intrinsics]) AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
AC_DEFINE([KNL],[1],[Knights landing processor])
SIMD_FLAGS='-march=knl';; SIMD_FLAGS='-march=knl';;
GEN) GEN)
AC_DEFINE([GEN],[1],[generic vector code]) AC_DEFINE([GEN],[1],[generic vector code])
@ -248,6 +262,9 @@ case ${ax_cv_cxx_compiler_vendor} in
[generic SIMD vector width (in bytes)]) [generic SIMD vector width (in bytes)])
SIMD_GEN_WIDTH_MSG=" (width= $ac_gen_simd_width)" SIMD_GEN_WIDTH_MSG=" (width= $ac_gen_simd_width)"
SIMD_FLAGS='';; SIMD_FLAGS='';;
NEONv8)
AC_DEFINE([NEONV8],[1],[ARMv8 NEON])
SIMD_FLAGS='-march=armv8-a';;
QPX|BGQ) QPX|BGQ)
AC_DEFINE([QPX],[1],[QPX intrinsics for BG/Q]) AC_DEFINE([QPX],[1],[QPX intrinsics for BG/Q])
SIMD_FLAGS='';; SIMD_FLAGS='';;
@ -276,6 +293,7 @@ case ${ax_cv_cxx_compiler_vendor} in
SIMD_FLAGS='';; SIMD_FLAGS='';;
KNL) KNL)
AC_DEFINE([AVX512],[1],[AVX512 intrinsics for Knights Landing]) AC_DEFINE([AVX512],[1],[AVX512 intrinsics for Knights Landing])
AC_DEFINE([KNL],[1],[Knights landing processor])
SIMD_FLAGS='-xmic-avx512';; SIMD_FLAGS='-xmic-avx512';;
GEN) GEN)
AC_DEFINE([GEN],[1],[generic vector code]) AC_DEFINE([GEN],[1],[generic vector code])

View File

@ -199,7 +199,12 @@ void BlockCGrQsolve(LinearOperatorBase<Field> &Linop, const Field &B, Field &X)
Linop.HermOp(X, AD); Linop.HermOp(X, AD);
tmp = B - AD; tmp = B - AD;
//std::cout << GridLogMessage << " initial tmp " << norm2(tmp)<< std::endl;
ThinQRfact (m_rr, m_C, m_Cinv, Q, tmp); ThinQRfact (m_rr, m_C, m_Cinv, Q, tmp);
//std::cout << GridLogMessage << " initial Q " << norm2(Q)<< std::endl;
//std::cout << GridLogMessage << " m_rr " << m_rr<<std::endl;
//std::cout << GridLogMessage << " m_C " << m_C<<std::endl;
//std::cout << GridLogMessage << " m_Cinv " << m_Cinv<<std::endl;
D=Q; D=Q;
std::cout << GridLogMessage<<"BlockCGrQ computed initial residual and QR fact " <<std::endl; std::cout << GridLogMessage<<"BlockCGrQ computed initial residual and QR fact " <<std::endl;
@ -221,13 +226,15 @@ void BlockCGrQsolve(LinearOperatorBase<Field> &Linop, const Field &B, Field &X)
MatrixTimer.Start(); MatrixTimer.Start();
Linop.HermOp(D, Z); Linop.HermOp(D, Z);
MatrixTimer.Stop(); MatrixTimer.Stop();
//std::cout << GridLogMessage << " norm2 Z " <<norm2(Z)<<std::endl;
//4. M = [D^dag Z]^{-1} //4. M = [D^dag Z]^{-1}
sliceInnerTimer.Start(); sliceInnerTimer.Start();
sliceInnerProductMatrix(m_DZ,D,Z,Orthog); sliceInnerProductMatrix(m_DZ,D,Z,Orthog);
sliceInnerTimer.Stop(); sliceInnerTimer.Stop();
m_M = m_DZ.inverse(); m_M = m_DZ.inverse();
//std::cout << GridLogMessage << " m_DZ " <<m_DZ<<std::endl;
//5. X = X + D MC //5. X = X + D MC
m_tmp = m_M * m_C; m_tmp = m_M * m_C;
sliceMaddTimer.Start(); sliceMaddTimer.Start();

View File

@ -11,7 +11,7 @@ int PointerCache::victim;
void *PointerCache::Insert(void *ptr,size_t bytes) { void *PointerCache::Insert(void *ptr,size_t bytes) {
if (bytes < 4096 ) return NULL; if (bytes < 4096 ) return ptr;
#ifdef GRID_OMP #ifdef GRID_OMP
assert(omp_in_parallel()==0); assert(omp_in_parallel()==0);

View File

@ -98,7 +98,14 @@ public:
#else #else
if ( ptr == (_Tp *) NULL ) ptr = (_Tp *) memalign(128,bytes); if ( ptr == (_Tp *) NULL ) ptr = (_Tp *) memalign(128,bytes);
#endif #endif
// First touch optimise in threaded loop
uint8_t *cp = (uint8_t *)ptr;
#ifdef GRID_OMP
#pragma omp parallel for
#endif
for(size_type n=0;n<bytes;n+=4096){
cp[n]=0;
}
return ptr; return ptr;
} }
@ -186,6 +193,13 @@ public:
#else #else
_Tp * ptr = (_Tp *) memalign(128,__n*sizeof(_Tp)); _Tp * ptr = (_Tp *) memalign(128,__n*sizeof(_Tp));
#endif #endif
size_type bytes = __n*sizeof(_Tp);
uint8_t *cp = (uint8_t *)ptr;
// One touch per 4k page, static OMP loop to catch same loop order
#pragma omp parallel for schedule(static)
for(size_type n=0;n<bytes;n+=4096){
cp[n]=0;
}
return ptr; return ptr;
} }
void deallocate(pointer __p, size_type) { void deallocate(pointer __p, size_type) {

View File

@ -185,17 +185,18 @@ public:
//////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////
void show_decomposition(){ void show_decomposition(){
std::cout << GridLogMessage << "Full Dimensions : " << _fdimensions << std::endl; std::cout << GridLogMessage << "\tFull Dimensions : " << _fdimensions << std::endl;
std::cout << GridLogMessage << "Global Dimensions : " << _gdimensions << std::endl; std::cout << GridLogMessage << "\tSIMD layout : " << _simd_layout << std::endl;
std::cout << GridLogMessage << "Local Dimensions : " << _ldimensions << std::endl; std::cout << GridLogMessage << "\tGlobal Dimensions : " << _gdimensions << std::endl;
std::cout << GridLogMessage << "Reduced Dimensions : " << _rdimensions << std::endl; std::cout << GridLogMessage << "\tLocal Dimensions : " << _ldimensions << std::endl;
std::cout << GridLogMessage << "Outer strides : " << _ostride << std::endl; std::cout << GridLogMessage << "\tReduced Dimensions : " << _rdimensions << std::endl;
std::cout << GridLogMessage << "Inner strides : " << _istride << std::endl; std::cout << GridLogMessage << "\tOuter strides : " << _ostride << std::endl;
std::cout << GridLogMessage << "iSites : " << _isites << std::endl; std::cout << GridLogMessage << "\tInner strides : " << _istride << std::endl;
std::cout << GridLogMessage << "oSites : " << _osites << std::endl; std::cout << GridLogMessage << "\tiSites : " << _isites << std::endl;
std::cout << GridLogMessage << "lSites : " << lSites() << std::endl; std::cout << GridLogMessage << "\toSites : " << _osites << std::endl;
std::cout << GridLogMessage << "gSites : " << gSites() << std::endl; std::cout << GridLogMessage << "\tlSites : " << lSites() << std::endl;
std::cout << GridLogMessage << "Nd : " << _ndimension << std::endl; std::cout << GridLogMessage << "\tgSites : " << gSites() << std::endl;
std::cout << GridLogMessage << "\tNd : " << _ndimension << std::endl;
} }
//////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////

View File

@ -62,77 +62,81 @@ public:
return shift; return shift;
} }
GridCartesian(const std::vector<int> &dimensions, GridCartesian(const std::vector<int> &dimensions,
const std::vector<int> &simd_layout, const std::vector<int> &simd_layout,
const std::vector<int> &processor_grid const std::vector<int> &processor_grid) : GridBase(processor_grid)
) : GridBase(processor_grid)
{ {
/////////////////////// ///////////////////////
// Grid information // Grid information
/////////////////////// ///////////////////////
_ndimension = dimensions.size(); _ndimension = dimensions.size();
_fdimensions.resize(_ndimension);
_gdimensions.resize(_ndimension);
_ldimensions.resize(_ndimension);
_rdimensions.resize(_ndimension);
_simd_layout.resize(_ndimension);
_lstart.resize(_ndimension);
_lend.resize(_ndimension);
_ostride.resize(_ndimension);
_istride.resize(_ndimension);
_fsites = _gsites = _osites = _isites = 1;
for(int d=0;d<_ndimension;d++){ _fdimensions.resize(_ndimension);
_fdimensions[d] = dimensions[d]; // Global dimensions _gdimensions.resize(_ndimension);
_gdimensions[d] = _fdimensions[d]; // Global dimensions _ldimensions.resize(_ndimension);
_simd_layout[d] = simd_layout[d]; _rdimensions.resize(_ndimension);
_fsites = _fsites * _fdimensions[d]; _simd_layout.resize(_ndimension);
_gsites = _gsites * _gdimensions[d]; _lstart.resize(_ndimension);
_lend.resize(_ndimension);
//FIXME check for exact division _ostride.resize(_ndimension);
_istride.resize(_ndimension);
// Use a reduced simd grid _fsites = _gsites = _osites = _isites = 1;
_ldimensions[d]= _gdimensions[d]/_processors[d]; //local dimensions
_rdimensions[d]= _ldimensions[d]/_simd_layout[d]; //overdecomposition for (int d = 0; d < _ndimension; d++)
_lstart[d] = _processor_coor[d]*_ldimensions[d]; {
_lend[d] = _processor_coor[d]*_ldimensions[d]+_ldimensions[d]-1; _fdimensions[d] = dimensions[d]; // Global dimensions
_osites *= _rdimensions[d]; _gdimensions[d] = _fdimensions[d]; // Global dimensions
_isites *= _simd_layout[d]; _simd_layout[d] = simd_layout[d];
_fsites = _fsites * _fdimensions[d];
// Addressing support _gsites = _gsites * _gdimensions[d];
if ( d==0 ) {
_ostride[d] = 1; // Use a reduced simd grid
_istride[d] = 1; _ldimensions[d] = _gdimensions[d] / _processors[d]; //local dimensions
} else { assert(_ldimensions[d] * _processors[d] == _gdimensions[d]);
_ostride[d] = _ostride[d-1]*_rdimensions[d-1];
_istride[d] = _istride[d-1]*_simd_layout[d-1]; _rdimensions[d] = _ldimensions[d] / _simd_layout[d]; //overdecomposition
} assert(_rdimensions[d] * _simd_layout[d] == _ldimensions[d]);
_lstart[d] = _processor_coor[d] * _ldimensions[d];
_lend[d] = _processor_coor[d] * _ldimensions[d] + _ldimensions[d] - 1;
_osites *= _rdimensions[d];
_isites *= _simd_layout[d];
// Addressing support
if (d == 0)
{
_ostride[d] = 1;
_istride[d] = 1;
} }
else
/////////////////////// {
// subplane information _ostride[d] = _ostride[d - 1] * _rdimensions[d - 1];
/////////////////////// _istride[d] = _istride[d - 1] * _simd_layout[d - 1];
_slice_block.resize(_ndimension);
_slice_stride.resize(_ndimension);
_slice_nblock.resize(_ndimension);
int block =1;
int nblock=1;
for(int d=0;d<_ndimension;d++) nblock*=_rdimensions[d];
for(int d=0;d<_ndimension;d++){
nblock/=_rdimensions[d];
_slice_block[d] =block;
_slice_stride[d]=_ostride[d]*_rdimensions[d];
_slice_nblock[d]=nblock;
block = block*_rdimensions[d];
} }
}
///////////////////////
// subplane information
///////////////////////
_slice_block.resize(_ndimension);
_slice_stride.resize(_ndimension);
_slice_nblock.resize(_ndimension);
int block = 1;
int nblock = 1;
for (int d = 0; d < _ndimension; d++)
nblock *= _rdimensions[d];
for (int d = 0; d < _ndimension; d++)
{
nblock /= _rdimensions[d];
_slice_block[d] = block;
_slice_stride[d] = _ostride[d] * _rdimensions[d];
_slice_nblock[d] = nblock;
block = block * _rdimensions[d];
}
}; };
}; };
} }
#endif #endif

View File

@ -131,21 +131,21 @@ public:
Init(dimensions,simd_layout,processor_grid,checker_dim_mask,0); Init(dimensions,simd_layout,processor_grid,checker_dim_mask,0);
} }
void Init(const std::vector<int> &dimensions, void Init(const std::vector<int> &dimensions,
const std::vector<int> &simd_layout, const std::vector<int> &simd_layout,
const std::vector<int> &processor_grid, const std::vector<int> &processor_grid,
const std::vector<int> &checker_dim_mask, const std::vector<int> &checker_dim_mask,
int checker_dim) int checker_dim)
{ {
/////////////////////// ///////////////////////
// Grid information // Grid information
/////////////////////// ///////////////////////
_checker_dim = checker_dim; _checker_dim = checker_dim;
assert(checker_dim_mask[checker_dim]==1); assert(checker_dim_mask[checker_dim] == 1);
_ndimension = dimensions.size(); _ndimension = dimensions.size();
assert(checker_dim_mask.size()==_ndimension); assert(checker_dim_mask.size() == _ndimension);
assert(processor_grid.size()==_ndimension); assert(processor_grid.size() == _ndimension);
assert(simd_layout.size()==_ndimension); assert(simd_layout.size() == _ndimension);
_fdimensions.resize(_ndimension); _fdimensions.resize(_ndimension);
_gdimensions.resize(_ndimension); _gdimensions.resize(_ndimension);
_ldimensions.resize(_ndimension); _ldimensions.resize(_ndimension);
@ -153,114 +153,133 @@ public:
_simd_layout.resize(_ndimension); _simd_layout.resize(_ndimension);
_lstart.resize(_ndimension); _lstart.resize(_ndimension);
_lend.resize(_ndimension); _lend.resize(_ndimension);
_ostride.resize(_ndimension); _ostride.resize(_ndimension);
_istride.resize(_ndimension); _istride.resize(_ndimension);
_fsites = _gsites = _osites = _isites = 1; _fsites = _gsites = _osites = _isites = 1;
_checker_dim_mask=checker_dim_mask;
for(int d=0;d<_ndimension;d++){ _checker_dim_mask = checker_dim_mask;
_fdimensions[d] = dimensions[d];
_gdimensions[d] = _fdimensions[d];
_fsites = _fsites * _fdimensions[d];
_gsites = _gsites * _gdimensions[d];
if (d==_checker_dim) {
_gdimensions[d] = _gdimensions[d]/2; // Remove a checkerboard
}
_ldimensions[d] = _gdimensions[d]/_processors[d];
_lstart[d] = _processor_coor[d]*_ldimensions[d];
_lend[d] = _processor_coor[d]*_ldimensions[d]+_ldimensions[d]-1;
// Use a reduced simd grid for (int d = 0; d < _ndimension; d++)
_simd_layout[d] = simd_layout[d]; {
_rdimensions[d]= _ldimensions[d]/_simd_layout[d]; _fdimensions[d] = dimensions[d];
assert(_rdimensions[d]>0); _gdimensions[d] = _fdimensions[d];
_fsites = _fsites * _fdimensions[d];
_gsites = _gsites * _gdimensions[d];
// all elements of a simd vector must have same checkerboard. if (d == _checker_dim)
// If Ls vectorised, this must still be the case; e.g. dwf rb5d {
if ( _simd_layout[d]>1 ) { assert((_gdimensions[d] & 0x1) == 0);
if ( checker_dim_mask[d] ) { _gdimensions[d] = _gdimensions[d] / 2; // Remove a checkerboard
assert( (_rdimensions[d]&0x1) == 0 ); }
} _ldimensions[d] = _gdimensions[d] / _processors[d];
} assert(_ldimensions[d] * _processors[d] == _gdimensions[d]);
_lstart[d] = _processor_coor[d] * _ldimensions[d];
_lend[d] = _processor_coor[d] * _ldimensions[d] + _ldimensions[d] - 1;
_osites *= _rdimensions[d]; // Use a reduced simd grid
_isites *= _simd_layout[d]; _simd_layout[d] = simd_layout[d];
_rdimensions[d] = _ldimensions[d] / _simd_layout[d]; // this is not checking if this is integer
// Addressing support assert(_rdimensions[d] * _simd_layout[d] == _ldimensions[d]);
if ( d==0 ) { assert(_rdimensions[d] > 0);
_ostride[d] = 1;
_istride[d] = 1;
} else {
_ostride[d] = _ostride[d-1]*_rdimensions[d-1];
_istride[d] = _istride[d-1]*_simd_layout[d-1];
}
// all elements of a simd vector must have same checkerboard.
// If Ls vectorised, this must still be the case; e.g. dwf rb5d
if (_simd_layout[d] > 1)
{
if (checker_dim_mask[d])
{
assert((_rdimensions[d] & 0x1) == 0);
}
}
_osites *= _rdimensions[d];
_isites *= _simd_layout[d];
// Addressing support
if (d == 0)
{
_ostride[d] = 1;
_istride[d] = 1;
}
else
{
_ostride[d] = _ostride[d - 1] * _rdimensions[d - 1];
_istride[d] = _istride[d - 1] * _simd_layout[d - 1];
}
} }
//////////////////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////////////////////
// subplane information // subplane information
//////////////////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////////////////////
_slice_block.resize(_ndimension); _slice_block.resize(_ndimension);
_slice_stride.resize(_ndimension); _slice_stride.resize(_ndimension);
_slice_nblock.resize(_ndimension); _slice_nblock.resize(_ndimension);
int block =1; int block = 1;
int nblock=1; int nblock = 1;
for(int d=0;d<_ndimension;d++) nblock*=_rdimensions[d]; for (int d = 0; d < _ndimension; d++)
nblock *= _rdimensions[d];
for(int d=0;d<_ndimension;d++){
nblock/=_rdimensions[d]; for (int d = 0; d < _ndimension; d++)
_slice_block[d] =block; {
_slice_stride[d]=_ostride[d]*_rdimensions[d]; nblock /= _rdimensions[d];
_slice_nblock[d]=nblock; _slice_block[d] = block;
block = block*_rdimensions[d]; _slice_stride[d] = _ostride[d] * _rdimensions[d];
_slice_nblock[d] = nblock;
block = block * _rdimensions[d];
} }
//////////////////////////////////////////////// ////////////////////////////////////////////////
// Create a checkerboard lookup table // Create a checkerboard lookup table
//////////////////////////////////////////////// ////////////////////////////////////////////////
int rvol = 1; int rvol = 1;
for(int d=0;d<_ndimension;d++){ for (int d = 0; d < _ndimension; d++)
rvol=rvol * _rdimensions[d]; {
rvol = rvol * _rdimensions[d];
} }
_checker_board.resize(rvol); _checker_board.resize(rvol);
for(int osite=0;osite<_osites;osite++){ for (int osite = 0; osite < _osites; osite++)
_checker_board[osite] = CheckerBoardFromOindex (osite); {
_checker_board[osite] = CheckerBoardFromOindex(osite);
} }
}; };
protected:
protected:
virtual int oIndex(std::vector<int> &coor) virtual int oIndex(std::vector<int> &coor)
{ {
int idx=0; int idx = 0;
for(int d=0;d<_ndimension;d++) { for (int d = 0; d < _ndimension; d++)
if( d==_checker_dim ) { {
idx+=_ostride[d]*((coor[d]/2)%_rdimensions[d]); if (d == _checker_dim)
} else { {
idx+=_ostride[d]*(coor[d]%_rdimensions[d]); idx += _ostride[d] * ((coor[d] / 2) % _rdimensions[d]);
} }
else
{
idx += _ostride[d] * (coor[d] % _rdimensions[d]);
}
} }
return idx; return idx;
}; };
virtual int iIndex(std::vector<int> &lcoor) virtual int iIndex(std::vector<int> &lcoor)
{ {
int idx=0; int idx = 0;
for(int d=0;d<_ndimension;d++) { for (int d = 0; d < _ndimension; d++)
if( d==_checker_dim ) { {
idx+=_istride[d]*(lcoor[d]/(2*_rdimensions[d])); if (d == _checker_dim)
} else { {
idx+=_istride[d]*(lcoor[d]/_rdimensions[d]); idx += _istride[d] * (lcoor[d] / (2 * _rdimensions[d]));
} }
} else
return idx; {
idx += _istride[d] * (lcoor[d] / _rdimensions[d]);
}
}
return idx;
} }
}; };
} }
#endif #endif

View File

@ -37,7 +37,10 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
#include <sys/ipc.h> #include <sys/ipc.h>
#include <sys/shm.h> #include <sys/shm.h>
#include <sys/mman.h> #include <sys/mman.h>
//#include <zlib.h> #include <zlib.h>
#ifdef HAVE_NUMAIF_H
#include <numaif.h>
#endif
#ifndef SHM_HUGETLB #ifndef SHM_HUGETLB
#define SHM_HUGETLB 04000 #define SHM_HUGETLB 04000
#endif #endif
@ -214,6 +217,25 @@ void CartesianCommunicator::Init(int *argc, char ***argv) {
void * ptr = mmap(NULL,size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); void * ptr = mmap(NULL,size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if ( ptr == MAP_FAILED ) { perror("failed mmap"); assert(0); } if ( ptr == MAP_FAILED ) { perror("failed mmap"); assert(0); }
assert(((uint64_t)ptr&0x3F)==0); assert(((uint64_t)ptr&0x3F)==0);
// Try to force numa domain on the shm segment if we have numaif.h
#ifdef HAVE_NUMAIF_H
int status;
int flags=MPOL_MF_MOVE;
#ifdef KNL
int nodes=1; // numa domain == MCDRAM
// Find out if in SNC2,SNC4 mode ?
#else
int nodes=r; // numa domain == MPI ID
#endif
unsigned long count=1;
for(uint64_t page=0;page<size;page+=4096){
void *pages = (void *) ( page + (uint64_t)ptr );
uint64_t *cow_it = (uint64_t *)pages; *cow_it = 1;
ierr= move_pages(0,count, &pages,&nodes,&status,flags);
if (ierr && (page==0)) perror("numa relocate command failed");
}
#endif
ShmCommBufs[r] =ptr; ShmCommBufs[r] =ptr;
} }

View File

@ -369,6 +369,7 @@ static void sliceMaddVector(Lattice<vobj> &R,std::vector<RealD> &a,const Lattice
} }
}; };
/*
inline GridBase *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Orthog) inline GridBase *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Orthog)
{ {
int NN = BlockSolverGrid->_ndimension; int NN = BlockSolverGrid->_ndimension;
@ -387,6 +388,7 @@ inline GridBase *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Or
} }
return (GridBase *)new GridCartesian(latt_phys,simd_phys,mpi_phys); return (GridBase *)new GridCartesian(latt_phys,simd_phys,mpi_phys);
} }
*/
template<class vobj> template<class vobj>
static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<vobj> &X,const Lattice<vobj> &Y,int Orthog,RealD scale=1.0) static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<vobj> &X,const Lattice<vobj> &Y,int Orthog,RealD scale=1.0)
@ -398,14 +400,15 @@ static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice
int Nblock = X._grid->GlobalDimensions()[Orthog]; int Nblock = X._grid->GlobalDimensions()[Orthog];
GridBase *FullGrid = X._grid; GridBase *FullGrid = X._grid;
GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); // GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
Lattice<vobj> Xslice(SliceGrid); // Lattice<vobj> Xslice(SliceGrid);
Lattice<vobj> Rslice(SliceGrid); // Lattice<vobj> Rslice(SliceGrid);
assert( FullGrid->_simd_layout[Orthog]==1); assert( FullGrid->_simd_layout[Orthog]==1);
int nh = FullGrid->_ndimension; int nh = FullGrid->_ndimension;
int nl = SliceGrid->_ndimension; // int nl = SliceGrid->_ndimension;
int nl = nh-1;
//FIXME package in a convenient iterator //FIXME package in a convenient iterator
//Should loop over a plane orthogonal to direction "Orthog" //Should loop over a plane orthogonal to direction "Orthog"
@ -448,14 +451,14 @@ static void sliceMulMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<
int Nblock = X._grid->GlobalDimensions()[Orthog]; int Nblock = X._grid->GlobalDimensions()[Orthog];
GridBase *FullGrid = X._grid; GridBase *FullGrid = X._grid;
GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); // GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
// Lattice<vobj> Xslice(SliceGrid);
Lattice<vobj> Xslice(SliceGrid); // Lattice<vobj> Rslice(SliceGrid);
Lattice<vobj> Rslice(SliceGrid);
assert( FullGrid->_simd_layout[Orthog]==1); assert( FullGrid->_simd_layout[Orthog]==1);
int nh = FullGrid->_ndimension; int nh = FullGrid->_ndimension;
int nl = SliceGrid->_ndimension; // int nl = SliceGrid->_ndimension;
int nl=1;
//FIXME package in a convenient iterator //FIXME package in a convenient iterator
//Should loop over a plane orthogonal to direction "Orthog" //Should loop over a plane orthogonal to direction "Orthog"
@ -498,18 +501,19 @@ static void sliceInnerProductMatrix( Eigen::MatrixXcd &mat, const Lattice<vobj>
typedef typename vobj::vector_type vector_type; typedef typename vobj::vector_type vector_type;
GridBase *FullGrid = lhs._grid; GridBase *FullGrid = lhs._grid;
GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); // GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
int Nblock = FullGrid->GlobalDimensions()[Orthog]; int Nblock = FullGrid->GlobalDimensions()[Orthog];
Lattice<vobj> Lslice(SliceGrid); // Lattice<vobj> Lslice(SliceGrid);
Lattice<vobj> Rslice(SliceGrid); // Lattice<vobj> Rslice(SliceGrid);
mat = Eigen::MatrixXcd::Zero(Nblock,Nblock); mat = Eigen::MatrixXcd::Zero(Nblock,Nblock);
assert( FullGrid->_simd_layout[Orthog]==1); assert( FullGrid->_simd_layout[Orthog]==1);
int nh = FullGrid->_ndimension; int nh = FullGrid->_ndimension;
int nl = SliceGrid->_ndimension; // int nl = SliceGrid->_ndimension;
int nl = nh-1;
//FIXME package in a convenient iterator //FIXME package in a convenient iterator
//Should loop over a plane orthogonal to direction "Orthog" //Should loop over a plane orthogonal to direction "Orthog"
@ -540,7 +544,8 @@ static void sliceInnerProductMatrix( Eigen::MatrixXcd &mat, const Lattice<vobj>
for(int i=0;i<Nblock;i++){ for(int i=0;i<Nblock;i++){
for(int j=0;j<Nblock;j++){ for(int j=0;j<Nblock;j++){
auto tmp = innerProduct(Left[i],Right[j]); auto tmp = innerProduct(Left[i],Right[j]);
vector_typeD rtmp = TensorRemove(tmp); // vector_typeD rtmp = TensorRemove(tmp);
auto rtmp = TensorRemove(tmp);
mat_thread(i,j) += Reduce(rtmp); mat_thread(i,j) += Reduce(rtmp);
}} }}
}} }}
@ -549,6 +554,14 @@ static void sliceInnerProductMatrix( Eigen::MatrixXcd &mat, const Lattice<vobj>
mat += mat_thread; mat += mat_thread;
} }
} }
for(int i=0;i<Nblock;i++){
for(int j=0;j<Nblock;j++){
ComplexD sum = mat(i,j);
FullGrid->GlobalSum(sum);
mat(i,j)=sum;
}}
return; return;
} }

View File

@ -98,35 +98,39 @@ class BinaryIO {
NerscChecksum(grid,scalardata,nersc_csum); NerscChecksum(grid,scalardata,nersc_csum);
} }
template<class fobj> static inline void NerscChecksum(GridBase *grid,std::vector<fobj> &fbuf,uint32_t &nersc_csum) template <class fobj>
static inline void NerscChecksum(GridBase *grid, std::vector<fobj> &fbuf, uint32_t &nersc_csum)
{ {
const uint64_t size32 = sizeof(fobj)/sizeof(uint32_t); const uint64_t size32 = sizeof(fobj) / sizeof(uint32_t);
uint64_t lsites = grid->lSites();
uint64_t lsites =grid->lSites(); if (fbuf.size() == 1)
if (fbuf.size()==1) { {
lsites=1; lsites = 1;
} }
#pragma omp parallel #pragma omp parallel
{ {
uint32_t nersc_csum_thr=0; uint32_t nersc_csum_thr = 0;
#pragma omp for #pragma omp for
for(uint64_t local_site=0;local_site<lsites;local_site++){ for (uint64_t local_site = 0; local_site < lsites; local_site++)
uint32_t * site_buf = (uint32_t *)&fbuf[local_site]; {
for(uint64_t j=0;j<size32;j++){ uint32_t *site_buf = (uint32_t *)&fbuf[local_site];
nersc_csum_thr=nersc_csum_thr+site_buf[j]; for (uint64_t j = 0; j < size32; j++)
} {
nersc_csum_thr = nersc_csum_thr + site_buf[j];
}
} }
#pragma omp critical #pragma omp critical
{ {
nersc_csum += nersc_csum_thr; nersc_csum += nersc_csum_thr;
} }
} }
} }
template<class fobj> static inline void ScidacChecksum(GridBase *grid,std::vector<fobj> &fbuf,uint32_t &scidac_csuma,uint32_t &scidac_csumb) template<class fobj> static inline void ScidacChecksum(GridBase *grid,std::vector<fobj> &fbuf,uint32_t &scidac_csuma,uint32_t &scidac_csumb)
{ {
const uint64_t size32 = sizeof(fobj)/sizeof(uint32_t); const uint64_t size32 = sizeof(fobj)/sizeof(uint32_t);
@ -266,7 +270,7 @@ class BinaryIO {
grid->Barrier(); grid->Barrier();
GridStopWatch timer; GridStopWatch timer;
GridStopWatch bstimer; GridStopWatch bstimer;
nersc_csum=0; nersc_csum=0;
scidac_csuma=0; scidac_csuma=0;
scidac_csumb=0; scidac_csumb=0;
@ -362,18 +366,22 @@ class BinaryIO {
#else #else
assert(0); assert(0);
#endif #endif
} else { } else {
std::cout<< GridLogMessage<< "C++ read I/O "<< file<<" : " std::cout << GridLogMessage << "C++ read I/O " << file << " : "
<< iodata.size()*sizeof(fobj)<<" bytes"<<std::endl; << iodata.size() * sizeof(fobj) << " bytes" << std::endl;
std::ifstream fin; std::ifstream fin;
fin.open(file,std::ios::binary|std::ios::in); fin.open(file, std::ios::binary | std::ios::in);
if ( control & BINARYIO_MASTER_APPEND ) { if (control & BINARYIO_MASTER_APPEND)
fin.seekg(-sizeof(fobj),fin.end); {
} else { fin.seekg(-sizeof(fobj), fin.end);
fin.seekg(offset+myrank*lsites*sizeof(fobj)); }
} else
fin.read((char *)&iodata[0],iodata.size()*sizeof(fobj));assert( fin.fail()==0); {
fin.close(); fin.seekg(offset + myrank * lsites * sizeof(fobj));
}
fin.read((char *)&iodata[0], iodata.size() * sizeof(fobj));
assert(fin.fail() == 0);
fin.close();
} }
timer.Stop(); timer.Stop();
@ -405,30 +413,78 @@ class BinaryIO {
timer.Start(); timer.Start();
if ( (control & BINARYIO_LEXICOGRAPHIC) && (nrank > 1) ) { if ( (control & BINARYIO_LEXICOGRAPHIC) && (nrank > 1) ) {
#ifdef USE_MPI_IO #ifdef USE_MPI_IO
std::cout<< GridLogMessage<< "MPI write I/O "<< file<< std::endl; std::cout << GridLogMessage << "MPI write I/O " << file << std::endl;
ierr=MPI_File_open(grid->communicator,(char *) file.c_str(), MPI_MODE_RDWR|MPI_MODE_CREATE,MPI_INFO_NULL, &fh); assert(ierr==0); ierr = MPI_File_open(grid->communicator, (char *)file.c_str(), MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
ierr=MPI_File_set_view(fh, disp, mpiObject, fileArray, "native", MPI_INFO_NULL); assert(ierr==0); std::cout << GridLogMessage << "Checking for errors" << std::endl;
ierr=MPI_File_write_all(fh, &iodata[0], 1, localArray, &status); assert(ierr==0); if (ierr != MPI_SUCCESS)
MPI_File_close(&fh); {
MPI_Type_free(&fileArray); char error_string[BUFSIZ];
MPI_Type_free(&localArray); int length_of_error_string, error_class;
MPI_Error_class(ierr, &error_class);
MPI_Error_string(error_class, error_string, &length_of_error_string);
fprintf(stderr, "%3d: %s\n", myrank, error_string);
MPI_Error_string(ierr, error_string, &length_of_error_string);
fprintf(stderr, "%3d: %s\n", myrank, error_string);
MPI_Abort(MPI_COMM_WORLD, 1); //assert(ierr == 0);
}
std::cout << GridLogDebug << "MPI read I/O set view " << file << std::endl;
ierr = MPI_File_set_view(fh, disp, mpiObject, fileArray, "native", MPI_INFO_NULL);
assert(ierr == 0);
std::cout << GridLogDebug << "MPI read I/O write all " << file << std::endl;
ierr = MPI_File_write_all(fh, &iodata[0], 1, localArray, &status);
assert(ierr == 0);
MPI_File_close(&fh);
MPI_Type_free(&fileArray);
MPI_Type_free(&localArray);
#else #else
assert(0); assert(0);
#endif #endif
} else { } else {
std::ofstream fout; fout.open(file,std::ios::binary|std::ios::out|std::ios::in);
std::cout<< GridLogMessage<< "C++ write I/O "<< file<<" : " std::ofstream fout;
<< iodata.size()*sizeof(fobj)<<" bytes"<<std::endl; fout.exceptions ( std::fstream::failbit | std::fstream::badbit );
if ( control & BINARYIO_MASTER_APPEND ) { try {
fout.open(file,std::ios::binary|std::ios::out|std::ios::in);
} catch (const std::fstream::failure& exc) {
std::cout << GridLogError << "Error in opening the file " << file << " for output" <<std::endl;
std::cout << GridLogError << "Exception description: " << exc.what() << std::endl;
std::cout << GridLogError << "Probable cause: wrong path, inaccessible location "<< std::endl;
#ifdef USE_MPI_IO
MPI_Abort(MPI_COMM_WORLD,1);
#else
exit(1);
#endif
}
std::cout << GridLogMessage<< "C++ write I/O "<< file<<" : "
<< iodata.size()*sizeof(fobj)<<" bytes"<<std::endl;
if ( control & BINARYIO_MASTER_APPEND ) {
fout.seekp(0,fout.end); fout.seekp(0,fout.end);
} else { } else {
fout.seekp(offset+myrank*lsites*sizeof(fobj)); fout.seekp(offset+myrank*lsites*sizeof(fobj));
} }
fout.write((char *)&iodata[0],iodata.size()*sizeof(fobj));assert( fout.fail()==0);
try {
fout.write((char *)&iodata[0],iodata.size()*sizeof(fobj));//assert( fout.fail()==0);
}
catch (const std::fstream::failure& exc) {
std::cout << "Exception in writing file " << file << std::endl;
std::cout << GridLogError << "Exception description: "<< exc.what() << std::endl;
#ifdef USE_MPI_IO
MPI_Abort(MPI_COMM_WORLD,1);
#else
exit(1);
#endif
}
fout.close(); fout.close();
} }
timer.Stop(); timer.Stop();
} }
std::cout<<GridLogMessage<<"IOobject: "; std::cout<<GridLogMessage<<"IOobject: ";
if ( control & BINARYIO_READ) std::cout << " read "; if ( control & BINARYIO_READ) std::cout << " read ";
@ -442,11 +498,14 @@ class BinaryIO {
////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////////
// Safety check // Safety check
////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////////
grid->Barrier(); // if the data size is 1 we do not want to sum over the MPI ranks
grid->GlobalSum(nersc_csum); if (iodata.size() != 1){
grid->GlobalXOR(scidac_csuma); grid->Barrier();
grid->GlobalXOR(scidac_csumb); grid->GlobalSum(nersc_csum);
grid->Barrier(); grid->GlobalXOR(scidac_csuma);
grid->GlobalXOR(scidac_csumb);
grid->Barrier();
}
} }
///////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////
@ -546,9 +605,9 @@ class BinaryIO {
int gsites = grid->gSites(); int gsites = grid->gSites();
int lsites = grid->lSites(); int lsites = grid->lSites();
uint32_t nersc_csum_tmp; uint32_t nersc_csum_tmp = 0;
uint32_t scidac_csuma_tmp; uint32_t scidac_csuma_tmp = 0;
uint32_t scidac_csumb_tmp; uint32_t scidac_csumb_tmp = 0;
GridStopWatch timer; GridStopWatch timer;

View File

@ -40,7 +40,7 @@ const PerformanceCounter::PerformanceCounterConfig PerformanceCounter::Performan
{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES , "CPUCYCLES.........." , INSTRUCTIONS}, { PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES , "CPUCYCLES.........." , INSTRUCTIONS},
{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS , "INSTRUCTIONS......." , CPUCYCLES }, { PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS , "INSTRUCTIONS......." , CPUCYCLES },
// 4 // 4
#ifdef AVX512 #ifdef KNL
{ PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", CPUCYCLES }, { PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", CPUCYCLES },
{ PERF_TYPE_RAW, RawConfig(0x01,0x04), "L1_MISS_LOADS......", L1D_READ_ACCESS }, { PERF_TYPE_RAW, RawConfig(0x01,0x04), "L1_MISS_LOADS......", L1D_READ_ACCESS },
{ PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", L1D_READ_ACCESS }, { PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", L1D_READ_ACCESS },

View File

@ -230,8 +230,15 @@ void ImprovedStaggeredFermion5D<Impl>::DhopInternal(StencilImpl & st, LebesgueOr
{ {
Compressor compressor; Compressor compressor;
int LLs = in._grid->_rdimensions[0]; int LLs = in._grid->_rdimensions[0];
DhopTotalTime -= usecond();
DhopCommTime -= usecond();
st.HaloExchange(in,compressor); st.HaloExchange(in,compressor);
DhopCommTime += usecond();
DhopComputeTime -= usecond();
// Dhop takes the 4d grid from U, and makes a 5d index for fermion // Dhop takes the 4d grid from U, and makes a 5d index for fermion
if (dag == DaggerYes) { if (dag == DaggerYes) {
parallel_for (int ss = 0; ss < U._grid->oSites(); ss++) { parallel_for (int ss = 0; ss < U._grid->oSites(); ss++) {
@ -244,12 +251,15 @@ void ImprovedStaggeredFermion5D<Impl>::DhopInternal(StencilImpl & st, LebesgueOr
Kernels::DhopSite(st,lo,U,UUU,st.CommBuf(),LLs,sU,in,out); Kernels::DhopSite(st,lo,U,UUU,st.CommBuf(),LLs,sU,in,out);
} }
} }
DhopComputeTime += usecond();
DhopTotalTime += usecond();
} }
template<class Impl> template<class Impl>
void ImprovedStaggeredFermion5D<Impl>::DhopOE(const FermionField &in, FermionField &out,int dag) void ImprovedStaggeredFermion5D<Impl>::DhopOE(const FermionField &in, FermionField &out,int dag)
{ {
DhopCalls+=1;
conformable(in._grid,FermionRedBlackGrid()); // verifies half grid conformable(in._grid,FermionRedBlackGrid()); // verifies half grid
conformable(in._grid,out._grid); // drops the cb check conformable(in._grid,out._grid); // drops the cb check
@ -261,6 +271,7 @@ void ImprovedStaggeredFermion5D<Impl>::DhopOE(const FermionField &in, FermionFie
template<class Impl> template<class Impl>
void ImprovedStaggeredFermion5D<Impl>::DhopEO(const FermionField &in, FermionField &out,int dag) void ImprovedStaggeredFermion5D<Impl>::DhopEO(const FermionField &in, FermionField &out,int dag)
{ {
DhopCalls+=1;
conformable(in._grid,FermionRedBlackGrid()); // verifies half grid conformable(in._grid,FermionRedBlackGrid()); // verifies half grid
conformable(in._grid,out._grid); // drops the cb check conformable(in._grid,out._grid); // drops the cb check
@ -272,6 +283,7 @@ void ImprovedStaggeredFermion5D<Impl>::DhopEO(const FermionField &in, FermionFie
template<class Impl> template<class Impl>
void ImprovedStaggeredFermion5D<Impl>::Dhop(const FermionField &in, FermionField &out,int dag) void ImprovedStaggeredFermion5D<Impl>::Dhop(const FermionField &in, FermionField &out,int dag)
{ {
DhopCalls+=2;
conformable(in._grid,FermionGrid()); // verifies full grid conformable(in._grid,FermionGrid()); // verifies full grid
conformable(in._grid,out._grid); conformable(in._grid,out._grid);
@ -280,6 +292,54 @@ void ImprovedStaggeredFermion5D<Impl>::Dhop(const FermionField &in, FermionField
DhopInternal(Stencil,Lebesgue,Umu,UUUmu,in,out,dag); DhopInternal(Stencil,Lebesgue,Umu,UUUmu,in,out,dag);
} }
template<class Impl>
void ImprovedStaggeredFermion5D<Impl>::Report(void)
{
std::vector<int> latt = GridDefaultLatt();
RealD volume = Ls; for(int mu=0;mu<Nd;mu++) volume=volume*latt[mu];
RealD NP = _FourDimGrid->_Nprocessors;
RealD NN = _FourDimGrid->NodeCount();
std::cout << GridLogMessage << "#### Dhop calls report " << std::endl;
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D Number of DhopEO Calls : "
<< DhopCalls << std::endl;
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D TotalTime /Calls : "
<< DhopTotalTime / DhopCalls << " us" << std::endl;
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D CommTime /Calls : "
<< DhopCommTime / DhopCalls << " us" << std::endl;
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D ComputeTime/Calls : "
<< DhopComputeTime / DhopCalls << " us" << std::endl;
// Average the compute time
_FourDimGrid->GlobalSum(DhopComputeTime);
DhopComputeTime/=NP;
RealD mflops = 1154*volume*DhopCalls/DhopComputeTime/2; // 2 for red black counting
std::cout << GridLogMessage << "Average mflops/s per call : " << mflops << std::endl;
std::cout << GridLogMessage << "Average mflops/s per call per rank : " << mflops/NP << std::endl;
std::cout << GridLogMessage << "Average mflops/s per call per node : " << mflops/NN << std::endl;
RealD Fullmflops = 1154*volume*DhopCalls/(DhopTotalTime)/2; // 2 for red black counting
std::cout << GridLogMessage << "Average mflops/s per call (full) : " << Fullmflops << std::endl;
std::cout << GridLogMessage << "Average mflops/s per call per rank (full): " << Fullmflops/NP << std::endl;
std::cout << GridLogMessage << "Average mflops/s per call per node (full): " << Fullmflops/NN << std::endl;
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D Stencil" <<std::endl; Stencil.Report();
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D StencilEven"<<std::endl; StencilEven.Report();
std::cout << GridLogMessage << "ImprovedStaggeredFermion5D StencilOdd" <<std::endl; StencilOdd.Report();
}
template<class Impl>
void ImprovedStaggeredFermion5D<Impl>::ZeroCounters(void)
{
DhopCalls = 0;
DhopTotalTime = 0;
DhopCommTime = 0;
DhopComputeTime = 0;
Stencil.ZeroCounters();
StencilEven.ZeroCounters();
StencilOdd.ZeroCounters();
}
///////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////
// Implement the general interface. Here we use SAME mass on all slices // Implement the general interface. Here we use SAME mass on all slices

View File

@ -55,6 +55,16 @@ namespace QCD {
FermionField _tmp; FermionField _tmp;
FermionField &tmp(void) { return _tmp; } FermionField &tmp(void) { return _tmp; }
////////////////////////////////////////
// Performance monitoring
////////////////////////////////////////
void Report(void);
void ZeroCounters(void);
double DhopTotalTime;
double DhopCalls;
double DhopCommTime;
double DhopComputeTime;
/////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////
// Implement the abstract base // Implement the abstract base
/////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////

View File

@ -93,6 +93,8 @@ class ScalarImplTypes {
class ScalarAdjMatrixImplTypes { class ScalarAdjMatrixImplTypes {
public: public:
typedef S Simd; typedef S Simd;
typedef QCD::SU<N> Group;
template <typename vtype> template <typename vtype>
using iImplField = iScalar<iScalar<iMatrix<vtype, N>>>; using iImplField = iScalar<iScalar<iMatrix<vtype, N>>>;
template <typename vtype> template <typename vtype>
@ -108,7 +110,7 @@ class ScalarImplTypes {
typedef Field PropagatorField; typedef Field PropagatorField;
static inline void generate_momenta(Field& P, GridParallelRNG& pRNG) { static inline void generate_momenta(Field& P, GridParallelRNG& pRNG) {
QCD::SU<N>::GaussianFundamentalLieAlgebraMatrix(pRNG, P); Group::GaussianFundamentalLieAlgebraMatrix(pRNG, P);
} }
static inline Field projectForce(Field& P) {return P;} static inline Field projectForce(Field& P) {return P;}
@ -122,11 +124,11 @@ class ScalarImplTypes {
} }
static inline void HotConfiguration(GridParallelRNG &pRNG, Field &U) { static inline void HotConfiguration(GridParallelRNG &pRNG, Field &U) {
QCD::SU<N>::LieRandomize(pRNG, U); Group::GaussianFundamentalLieAlgebraMatrix(pRNG, U);
} }
static inline void TepidConfiguration(GridParallelRNG &pRNG, Field &U) { static inline void TepidConfiguration(GridParallelRNG &pRNG, Field &U) {
QCD::SU<N>::LieRandomize(pRNG, U, 0.01); Group::GaussianFundamentalLieAlgebraMatrix(pRNG, U, 0.01);
} }
static inline void ColdConfiguration(GridParallelRNG &pRNG, Field &U) { static inline void ColdConfiguration(GridParallelRNG &pRNG, Field &U) {

View File

@ -81,7 +81,7 @@ namespace Grid {
phiStencil.HaloExchange(p, compressor); phiStencil.HaloExchange(p, compressor);
Field action(p._grid), pshift(p._grid), phisquared(p._grid); Field action(p._grid), pshift(p._grid), phisquared(p._grid);
phisquared = p*p; phisquared = p*p;
action = (2.0*Ndim + mass_square)*phisquared + lambda*phisquared*phisquared; action = (2.0*Ndim + mass_square)*phisquared - lambda/24.*phisquared*phisquared;
for (int mu = 0; mu < Ndim; mu++) { for (int mu = 0; mu < Ndim; mu++) {
// pshift = Cshift(p, mu, +1); // not efficient, implement with stencils // pshift = Cshift(p, mu, +1); // not efficient, implement with stencils
parallel_for (int i = 0; i < p._grid->oSites(); i++) { parallel_for (int i = 0; i < p._grid->oSites(); i++) {
@ -98,7 +98,7 @@ namespace Grid {
permute(temp2, *temp, permute_type); permute(temp2, *temp, permute_type);
action._odata[i] -= temp2*(*t_p) + (*t_p)*temp2; action._odata[i] -= temp2*(*t_p) + (*t_p)*temp2;
} else { } else {
action._odata[i] -= *temp*(*t_p) + (*t_p)*(*temp); action._odata[i] -= (*temp)*(*t_p) + (*t_p)*(*temp);
} }
} else { } else {
action._odata[i] -= phiStencil.CommBuf()[SE->_offset]*(*t_p) + (*t_p)*phiStencil.CommBuf()[SE->_offset]; action._odata[i] -= phiStencil.CommBuf()[SE->_offset]*(*t_p) + (*t_p)*phiStencil.CommBuf()[SE->_offset];
@ -113,7 +113,7 @@ namespace Grid {
virtual void deriv(const Field &p, Field &force) { virtual void deriv(const Field &p, Field &force) {
assert(p._grid->Nd() == Ndim); assert(p._grid->Nd() == Ndim);
force = (2.0*Ndim + mass_square)*p + 2.0*lambda*p*p*p; force = (2.0*Ndim + mass_square)*p - lambda/12.*p*p*p;
// move this outside // move this outside
static Stencil phiStencil(p._grid, npoint, 0, directions, displacements); static Stencil phiStencil(p._grid, npoint, 0, directions, displacements);
phiStencil.HaloExchange(p, compressor); phiStencil.HaloExchange(p, compressor);

View File

@ -76,7 +76,7 @@ struct HMCparameters: Serializable {
template < class ReaderClass > template < class ReaderClass >
void initialize(Reader<ReaderClass> &TheReader){ void initialize(Reader<ReaderClass> &TheReader){
std::cout << "Reading HMC\n"; std::cout << GridLogMessage << "Reading HMC\n";
read(TheReader, "HMC", *this); read(TheReader, "HMC", *this);
} }

View File

@ -165,7 +165,7 @@ class HMCResourceManager {
// Grids // Grids
////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////
void AddGrid(std::string s, GridModule& M) { void AddGrid(const std::string s, GridModule& M) {
// Check for name clashes // Check for name clashes
auto search = Grids.find(s); auto search = Grids.find(s);
if (search != Grids.end()) { if (search != Grids.end()) {
@ -174,14 +174,24 @@ class HMCResourceManager {
exit(1); exit(1);
} }
Grids[s] = std::move(M); Grids[s] = std::move(M);
std::cout << GridLogMessage << "::::::::::::::::::::::::::::::::::::::::" <<std::endl;
std::cout << GridLogMessage << "HMCResourceManager:" << std::endl;
std::cout << GridLogMessage << "Created grid set with name '" << s << "' and decomposition for the full cartesian " << std::endl;
Grids[s].show_full_decomposition();
std::cout << GridLogMessage << "::::::::::::::::::::::::::::::::::::::::" <<std::endl;
} }
// Add a named grid set, 4d shortcut // Add a named grid set, 4d shortcut
void AddFourDimGrid(std::string s) { void AddFourDimGrid(const std::string s) {
GridFourDimModule<vComplex> Mod; GridFourDimModule<vComplex> Mod;
AddGrid(s, Mod); AddGrid(s, Mod);
} }
// Add a named grid set, 4d shortcut + tweak simd lanes
void AddFourDimGrid(const std::string s, const std::vector<int> simd_decomposition) {
GridFourDimModule<vComplex> Mod(simd_decomposition);
AddGrid(s, Mod);
}
GridCartesian* GetCartesian(std::string s = "") { GridCartesian* GetCartesian(std::string s = "") {
@ -253,6 +263,7 @@ class HMCResourceManager {
template<class T, class... Types> template<class T, class... Types>
void AddObservable(Types&&... Args){ void AddObservable(Types&&... Args){
ObservablesList.push_back(std::unique_ptr<T>(new T(std::forward<Types>(Args)...))); ObservablesList.push_back(std::unique_ptr<T>(new T(std::forward<Types>(Args)...)));
ObservablesList.back()->print_parameters();
} }
std::vector<HmcObservable<typename ImplementationPolicy::Field>* > GetObservables(){ std::vector<HmcObservable<typename ImplementationPolicy::Field>* > GetObservables(){
@ -297,4 +308,4 @@ private:
} }
} }
#endif // HMC_RESOURCE_MANAGER_H #endif // HMC_RESOURCE_MANAGER_H

View File

@ -33,28 +33,29 @@ directory
namespace Grid { namespace Grid {
// Resources // Resources
// Modules for grids // Modules for grids
// Introduce another namespace HMCModules? // Introduce another namespace HMCModules?
class GridModuleParameters: Serializable{ class GridModuleParameters: Serializable{
public: public:
GRID_SERIALIZABLE_CLASS_MEMBERS(GridModuleParameters, GRID_SERIALIZABLE_CLASS_MEMBERS(GridModuleParameters,
std::string, lattice, std::string, lattice,
std::string, mpi); std::string, mpi);
std::vector<int> getLattice(){return strToVec<int>(lattice);} std::vector<int> getLattice() const {return strToVec<int>(lattice);}
std::vector<int> getMpi() {return strToVec<int>(mpi);} std::vector<int> getMpi() const {return strToVec<int>(mpi);}
void check(){
if (getLattice().size() != getMpi().size()) { void check() const {
std::cout << GridLogError if (getLattice().size() != getMpi().size() ) {
std::cout << GridLogError
<< "Error in GridModuleParameters: lattice and mpi dimensions " << "Error in GridModuleParameters: lattice and mpi dimensions "
"do not match" "do not match"
<< std::endl; << std::endl;
exit(1); exit(1);
} }
} }
template <class ReaderClass> template <class ReaderClass>
GridModuleParameters(Reader<ReaderClass>& Reader, std::string n = "LatticeGrid"):name(n) { GridModuleParameters(Reader<ReaderClass>& Reader, std::string n = "LatticeGrid"):name(n) {
@ -75,51 +76,94 @@ private:
// Lower level class // Lower level class
class GridModule { class GridModule {
public: public:
GridCartesian* get_full() { GridCartesian* get_full() {
std::cout << GridLogDebug << "Getting cartesian in module"<< std::endl; std::cout << GridLogDebug << "Getting cartesian in module"<< std::endl;
return grid_.get(); } return grid_.get(); }
GridRedBlackCartesian* get_rb() { GridRedBlackCartesian* get_rb() {
std::cout << GridLogDebug << "Getting rb-cartesian in module"<< std::endl; std::cout << GridLogDebug << "Getting rb-cartesian in module"<< std::endl;
return rbgrid_.get(); } return rbgrid_.get(); }
void set_full(GridCartesian* grid) { grid_.reset(grid); } void set_full(GridCartesian* grid) { grid_.reset(grid); }
void set_rb(GridRedBlackCartesian* rbgrid) { rbgrid_.reset(rbgrid); } void set_rb(GridRedBlackCartesian* rbgrid) { rbgrid_.reset(rbgrid); }
void show_full_decomposition(){ grid_->show_decomposition(); }
void show_rb_decomposition(){ rbgrid_->show_decomposition(); }
protected: protected:
std::unique_ptr<GridCartesian> grid_; std::unique_ptr<GridCartesian> grid_;
std::unique_ptr<GridRedBlackCartesian> rbgrid_; std::unique_ptr<GridRedBlackCartesian> rbgrid_;
}; };
//////////////////////////////////// ////////////////////////////////////
// Classes for the user // Classes for the user
//////////////////////////////////// ////////////////////////////////////
// Note: the space time grid should be out of the QCD namespace // Note: the space time grid should be out of the QCD namespace
template< class vector_type> template <class vector_type>
class GridFourDimModule : public GridModule { class GridFourDimModule : public GridModule
public: {
GridFourDimModule() { public:
GridFourDimModule()
{
using namespace QCD; using namespace QCD;
set_full(SpaceTimeGrid::makeFourDimGrid( set_full(SpaceTimeGrid::makeFourDimGrid(
GridDefaultLatt(), GridDefaultSimd(4, vector_type::Nsimd()), GridDefaultLatt(),
GridDefaultSimd(4, vector_type::Nsimd()),
GridDefaultMpi())); GridDefaultMpi()));
set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get())); set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get()));
} }
GridFourDimModule(GridModuleParameters Params) { GridFourDimModule(const std::vector<int> tweak_simd)
{
using namespace QCD;
if (tweak_simd.size() != 4)
{
std::cout << GridLogError
<< "Error in GridFourDimModule: SIMD size different from 4"
<< std::endl;
exit(1);
}
// Checks that the product agrees with the expectation
int simd_sum = 1;
for (auto &n : tweak_simd)
simd_sum *= n;
std::cout << GridLogDebug << "TweakSIMD: " << tweak_simd << " Sum: " << simd_sum << std::endl;
if (simd_sum == vector_type::Nsimd())
{
set_full(SpaceTimeGrid::makeFourDimGrid(
GridDefaultLatt(),
tweak_simd,
GridDefaultMpi()));
set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get()));
}
else
{
std::cout << GridLogError
<< "Error in GridFourDimModule: SIMD lanes must sum to "
<< vector_type::Nsimd()
<< std::endl;
}
}
GridFourDimModule(const GridModuleParameters Params)
{
using namespace QCD; using namespace QCD;
Params.check();
std::vector<int> lattice_v = Params.getLattice(); std::vector<int> lattice_v = Params.getLattice();
std::vector<int> mpi_v = Params.getMpi(); std::vector<int> mpi_v = Params.getMpi();
if (lattice_v.size() == 4) { if (lattice_v.size() == 4)
{
set_full(SpaceTimeGrid::makeFourDimGrid( set_full(SpaceTimeGrid::makeFourDimGrid(
lattice_v, GridDefaultSimd(4, vector_type::Nsimd()), lattice_v,
GridDefaultSimd(4, vector_type::Nsimd()),
mpi_v)); mpi_v));
set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get())); set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get()));
} else { }
std::cout << GridLogError else
<< "Error in GridFourDimModule: lattice dimension different from 4" {
<< std::endl; std::cout << GridLogError
<< "Error in GridFourDimModule: lattice dimension different from 4"
<< std::endl;
exit(1); exit(1);
} }
} }

View File

@ -84,8 +84,6 @@ class PlaquetteMod: public ObservableModule<PlaquetteLogger<Impl>, NoParameters>
typedef ObservableModule<PlaquetteLogger<Impl>, NoParameters> ObsBase; typedef ObservableModule<PlaquetteLogger<Impl>, NoParameters> ObsBase;
using ObsBase::ObsBase; // for constructors using ObsBase::ObsBase; // for constructors
// acquire resource // acquire resource
virtual void initialize(){ virtual void initialize(){
this->ObservablePtr.reset(new PlaquetteLogger<Impl>()); this->ObservablePtr.reset(new PlaquetteLogger<Impl>());
@ -94,23 +92,22 @@ class PlaquetteMod: public ObservableModule<PlaquetteLogger<Impl>, NoParameters>
PlaquetteMod(): ObsBase(NoParameters()){} PlaquetteMod(): ObsBase(NoParameters()){}
}; };
template < class Impl > template < class Impl >
class TopologicalChargeMod: public ObservableModule<TopologicalCharge<Impl>, NoParameters>{ class TopologicalChargeMod: public ObservableModule<TopologicalCharge<Impl>, TopologyObsParameters>{
typedef ObservableModule<TopologicalCharge<Impl>, NoParameters> ObsBase; typedef ObservableModule<TopologicalCharge<Impl>, TopologyObsParameters> ObsBase;
using ObsBase::ObsBase; // for constructors using ObsBase::ObsBase; // for constructors
// acquire resource // acquire resource
virtual void initialize(){ virtual void initialize(){
this->ObservablePtr.reset(new TopologicalCharge<Impl>()); this->ObservablePtr.reset(new TopologicalCharge<Impl>(this->Par_));
} }
public: public:
TopologicalChargeMod(): ObsBase(NoParameters()){} TopologicalChargeMod(TopologyObsParameters Par): ObsBase(Par){}
TopologicalChargeMod(): ObsBase(){}
}; };
}// QCD temporarily here }// QCD temporarily here

View File

@ -33,9 +33,45 @@ directory
namespace Grid { namespace Grid {
namespace QCD { namespace QCD {
struct TopologySmearingParameters : Serializable {
GRID_SERIALIZABLE_CLASS_MEMBERS(TopologySmearingParameters,
int, steps,
float, step_size,
int, meas_interval,
float, maxTau);
TopologySmearingParameters(int s = 0, float ss = 0.0f, int mi = 0, float mT = 0.0f):
steps(s), step_size(ss), meas_interval(mi), maxTau(mT){}
template < class ReaderClass >
TopologySmearingParameters(Reader<ReaderClass>& Reader){
read(Reader, "Smearing", *this);
}
};
struct TopologyObsParameters : Serializable {
GRID_SERIALIZABLE_CLASS_MEMBERS(TopologyObsParameters,
int, interval,
bool, do_smearing,
TopologySmearingParameters, Smearing);
TopologyObsParameters(int interval = 1, bool smearing = false):
interval(interval), Smearing(smearing){}
template <class ReaderClass >
TopologyObsParameters(Reader<ReaderClass>& Reader){
read(Reader, "TopologyMeasurement", *this);
}
};
// this is only defined for a gauge theory // this is only defined for a gauge theory
template <class Impl> template <class Impl>
class TopologicalCharge : public HmcObservable<typename Impl::Field> { class TopologicalCharge : public HmcObservable<typename Impl::Field> {
TopologyObsParameters Pars;
public: public:
// here forces the Impl to be of gauge fields // here forces the Impl to be of gauge fields
// if not the compiler will complain // if not the compiler will complain
@ -44,20 +80,39 @@ class TopologicalCharge : public HmcObservable<typename Impl::Field> {
// necessary for HmcObservable compatibility // necessary for HmcObservable compatibility
typedef typename Impl::Field Field; typedef typename Impl::Field Field;
TopologicalCharge(int interval = 1, bool do_smearing = false):
Pars(interval, do_smearing){}
TopologicalCharge(TopologyObsParameters P):Pars(P){
std::cout << GridLogDebug << "Creating TopologicalCharge " << std::endl;
}
void TrajectoryComplete(int traj, void TrajectoryComplete(int traj,
Field &U, Field &U,
GridSerialRNG &sRNG, GridSerialRNG &sRNG,
GridParallelRNG &pRNG) { GridParallelRNG &pRNG) {
Real q = WilsonLoops<Impl>::TopologicalCharge(U); if (traj%Pars.interval == 0){
// Smearing
Field Usmear = U;
int def_prec = std::cout.precision();
if (Pars.do_smearing){
// using wilson flow by default here
WilsonFlow<PeriodicGimplR> WF(Pars.Smearing.steps, Pars.Smearing.step_size, Pars.Smearing.meas_interval);
WF.smear_adaptive(Usmear, U, Pars.Smearing.maxTau);
Real T0 = WF.energyDensityPlaquette(Usmear);
std::cout << GridLogMessage << std::setprecision(std::numeric_limits<Real>::digits10 + 1)
<< "T0 : [ " << traj << " ] "<< T0 << std::endl;
}
int def_prec = std::cout.precision(); Real q = WilsonLoops<Impl>::TopologicalCharge(Usmear);
std::cout << GridLogMessage
<< std::setprecision(std::numeric_limits<Real>::digits10 + 1)
<< "Topological Charge: [ " << traj << " ] "<< q << std::endl;
std::cout << GridLogMessage std::cout.precision(def_prec);
<< std::setprecision(std::numeric_limits<Real>::digits10 + 1) }
<< "Topological Charge: [ " << traj << " ] "<< q << std::endl;
std::cout.precision(def_prec);
} }
}; };

View File

@ -108,7 +108,7 @@ void WilsonFlow<Gimpl>::evolve_step_adaptive(typename Gimpl::GaugeField &U, Real
if (maxTau - taus < epsilon){ if (maxTau - taus < epsilon){
epsilon = maxTau-taus; epsilon = maxTau-taus;
} }
std::cout << GridLogMessage << "Integration epsilon : " << epsilon << std::endl; //std::cout << GridLogMessage << "Integration epsilon : " << epsilon << std::endl;
GaugeField Z(U._grid); GaugeField Z(U._grid);
GaugeField Zprime(U._grid); GaugeField Zprime(U._grid);
GaugeField tmp(U._grid), Uprime(U._grid); GaugeField tmp(U._grid), Uprime(U._grid);
@ -138,10 +138,10 @@ void WilsonFlow<Gimpl>::evolve_step_adaptive(typename Gimpl::GaugeField &U, Real
// adjust integration step // adjust integration step
taus += epsilon; taus += epsilon;
std::cout << GridLogMessage << "Adjusting integration step with distance: " << diff << std::endl; //std::cout << GridLogMessage << "Adjusting integration step with distance: " << diff << std::endl;
epsilon = epsilon*0.95*std::pow(1e-4/diff,1./3.); epsilon = epsilon*0.95*std::pow(1e-4/diff,1./3.);
std::cout << GridLogMessage << "New epsilon : " << epsilon << std::endl; //std::cout << GridLogMessage << "New epsilon : " << epsilon << std::endl;
} }
@ -166,7 +166,6 @@ void WilsonFlow<Gimpl>::smear(GaugeField& out, const GaugeField& in) const {
out = in; out = in;
for (unsigned int step = 1; step <= Nstep; step++) { for (unsigned int step = 1; step <= Nstep; step++) {
auto start = std::chrono::high_resolution_clock::now(); auto start = std::chrono::high_resolution_clock::now();
std::cout << GridLogMessage << "Evolution time :"<< tau(step) << std::endl;
evolve_step(out); evolve_step(out);
auto end = std::chrono::high_resolution_clock::now(); auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start; std::chrono::duration<double> diff = end - start;
@ -191,7 +190,7 @@ void WilsonFlow<Gimpl>::smear_adaptive(GaugeField& out, const GaugeField& in, Re
unsigned int step = 0; unsigned int step = 0;
do{ do{
step++; step++;
std::cout << GridLogMessage << "Evolution time :"<< taus << std::endl; //std::cout << GridLogMessage << "Evolution time :"<< taus << std::endl;
evolve_step_adaptive(out, maxTau); evolve_step_adaptive(out, maxTau);
std::cout << GridLogMessage << "[WilsonFlow] Energy density (plaq) : " std::cout << GridLogMessage << "[WilsonFlow] Energy density (plaq) : "
<< step << " " << step << " "

View File

@ -26,12 +26,14 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
/* END LEGAL */ /* END LEGAL */
//#include <Grid/Grid.h> //#include <Grid/Grid.h>
using namespace Grid; #ifndef GRID_QCD_GAUGE_FIX_H
using namespace Grid::QCD; #define GRID_QCD_GAUGE_FIX_H
namespace Grid {
namespace QCD {
template <class Gimpl> template <class Gimpl>
class FourierAcceleratedGaugeFixer : public Gimpl { class FourierAcceleratedGaugeFixer : public Gimpl {
public: public:
INHERIT_GIMPL_TYPES(Gimpl); INHERIT_GIMPL_TYPES(Gimpl);
typedef typename Gimpl::GaugeLinkField GaugeMat; typedef typename Gimpl::GaugeLinkField GaugeMat;
@ -186,3 +188,6 @@ class FourierAcceleratedGaugeFixer : public Gimpl {
} }
}; };
}
}
#endif

View File

@ -716,8 +716,7 @@ template<typename GaugeField,typename GaugeMat>
for (int a = 0; a < AdjointDimension; a++) { for (int a = 0; a < AdjointDimension; a++) {
generator(a, Ta); generator(a, Ta);
auto tmp = - 2.0 * (trace(timesI(Ta) * in)) * scale;// 2.0 for the normalization of the trace in the fundamental rep pokeColour(h_out, - 2.0 * (trace(timesI(Ta) * in)) * scale, a);
pokeColour(h_out, tmp, a);
} }
} }

View File

@ -65,10 +65,12 @@ Hdf5Reader::Hdf5Reader(const std::string &fileName)
Hdf5Type<unsigned int>::type()); Hdf5Type<unsigned int>::type());
} }
void Hdf5Reader::push(const std::string &s) bool Hdf5Reader::push(const std::string &s)
{ {
group_ = group_.openGroup(s); group_ = group_.openGroup(s);
path_.push_back(s); path_.push_back(s);
return true;
} }
void Hdf5Reader::pop(void) void Hdf5Reader::pop(void)

View File

@ -54,7 +54,7 @@ namespace Grid
public: public:
Hdf5Reader(const std::string &fileName); Hdf5Reader(const std::string &fileName);
virtual ~Hdf5Reader(void) = default; virtual ~Hdf5Reader(void) = default;
void push(const std::string &s); bool push(const std::string &s);
void pop(void); void pop(void);
template <typename U> template <typename U>
void readDefault(const std::string &s, U &output); void readDefault(const std::string &s, U &output);

View File

@ -1,13 +1,14 @@
/************************************************************************************* /*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid Grid physics library, www.github.com/paboyle/Grid
Source file: ./lib/simd/Grid_neon.h Source file: ./lib/simd/Grid_neon.h
Copyright (C) 2015 Copyright (C) 2015
Author: Peter Boyle <paboyle@ph.ed.ac.uk> Author: Nils Meyer <nils.meyer@ur.de>
Author: neo <cossu@post.kek.jp> Author: Peter Boyle <paboyle@ph.ed.ac.uk>
Author: neo <cossu@post.kek.jp>
This program is free software; you can redistribute it and/or modify This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by it under the terms of the GNU General Public License as published by
@ -26,19 +27,25 @@ Author: neo <cossu@post.kek.jp>
See the full license in the file "LICENSE" in the top level distribution directory See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/ *************************************************************************************/
/* END LEGAL */ /* END LEGAL */
//----------------------------------------------------------------------
/*! @file Grid_sse4.h
@brief Optimization libraries for NEON (ARM) instructions set ARMv8
Experimental - Using intrinsics - DEVELOPING! /*
ARMv8 NEON intrinsics layer by
Nils Meyer <nils.meyer@ur.de>,
University of Regensburg, Germany
SFB/TRR55
*/ */
// Time-stamp: <2015-07-10 17:45:09 neo>
//----------------------------------------------------------------------
#ifndef GEN_SIMD_WIDTH
#define GEN_SIMD_WIDTH 16u
#endif
#include "Grid_generic_types.h"
#include <arm_neon.h> #include <arm_neon.h>
// ARMv8 supports double precision namespace Grid {
namespace Optimization { namespace Optimization {
template<class vtype> template<class vtype>
@ -46,16 +53,20 @@ namespace Optimization {
float32x4_t f; float32x4_t f;
vtype v; vtype v;
}; };
union u128f { union u128f {
float32x4_t v; float32x4_t v;
float f[4]; float f[4];
}; };
union u128d { union u128d {
float64x2_t v; float64x2_t v;
double f[4]; double f[2];
}; };
// half precision
union u128h {
float16x8_t v;
uint16_t f[8];
};
struct Vsplat{ struct Vsplat{
//Complex float //Complex float
inline float32x4_t operator()(float a, float b){ inline float32x4_t operator()(float a, float b){
@ -64,31 +75,31 @@ namespace Optimization {
} }
// Real float // Real float
inline float32x4_t operator()(float a){ inline float32x4_t operator()(float a){
return vld1q_dup_f32(&a); return vdupq_n_f32(a);
} }
//Complex double //Complex double
inline float32x4_t operator()(double a, double b){ inline float64x2_t operator()(double a, double b){
float tmp[4]={(float)a,(float)b,(float)a,(float)b}; double tmp[2]={a,b};
return vld1q_f32(tmp); return vld1q_f64(tmp);
} }
//Real double //Real double // N:tbc
inline float32x4_t operator()(double a){ inline float64x2_t operator()(double a){
return vld1q_dup_f32(&a); return vdupq_n_f64(a);
} }
//Integer //Integer // N:tbc
inline uint32x4_t operator()(Integer a){ inline uint32x4_t operator()(Integer a){
return vld1q_dup_u32(&a); return vdupq_n_u32(a);
} }
}; };
struct Vstore{ struct Vstore{
//Float //Float
inline void operator()(float32x4_t a, float* F){ inline void operator()(float32x4_t a, float* F){
vst1q_f32(F, a); vst1q_f32(F, a);
} }
//Double //Double
inline void operator()(float32x4_t a, double* D){ inline void operator()(float64x2_t a, double* D){
vst1q_f32((float*)D, a); vst1q_f64(D, a);
} }
//Integer //Integer
inline void operator()(uint32x4_t a, Integer* I){ inline void operator()(uint32x4_t a, Integer* I){
@ -97,54 +108,54 @@ namespace Optimization {
}; };
struct Vstream{ struct Vstream{ // N:equivalents to _mm_stream_p* in NEON?
//Float //Float // N:generic
inline void operator()(float * a, float32x4_t b){ inline void operator()(float * a, float32x4_t b){
memcpy(a,&b,4*sizeof(float));
} }
//Double //Double // N:generic
inline void operator()(double * a, float32x4_t b){ inline void operator()(double * a, float64x2_t b){
memcpy(a,&b,2*sizeof(double));
} }
}; };
// Nils: Vset untested; not used currently in Grid at all;
// git commit 4a8c4ccfba1d05159348d21a9698028ea847e77b
struct Vset{ struct Vset{
// Complex float // Complex float // N:ok
inline float32x4_t operator()(Grid::ComplexF *a){ inline float32x4_t operator()(Grid::ComplexF *a){
float32x4_t foo; float tmp[4]={a[1].imag(),a[1].real(),a[0].imag(),a[0].real()};
return foo; return vld1q_f32(tmp);
} }
// Complex double // Complex double // N:ok
inline float32x4_t operator()(Grid::ComplexD *a){ inline float64x2_t operator()(Grid::ComplexD *a){
float32x4_t foo; double tmp[2]={a[0].imag(),a[0].real()};
return foo; return vld1q_f64(tmp);
} }
// Real float // Real float // N:ok
inline float32x4_t operator()(float *a){ inline float32x4_t operator()(float *a){
float32x4_t foo; float tmp[4]={a[3],a[2],a[1],a[0]};
return foo; return vld1q_f32(tmp);
} }
// Real double // Real double // N:ok
inline float32x4_t operator()(double *a){ inline float64x2_t operator()(double *a){
float32x4_t foo; double tmp[2]={a[1],a[0]};
return foo; return vld1q_f64(tmp);
} }
// Integer // Integer // N:ok
inline uint32x4_t operator()(Integer *a){ inline uint32x4_t operator()(Integer *a){
uint32x4_t foo; return vld1q_dup_u32(a);
return foo;
} }
}; };
// N:leaving as is
template <typename Out_type, typename In_type> template <typename Out_type, typename In_type>
struct Reduce{ struct Reduce{
//Need templated class to overload output type //Need templated class to overload output type
//General form must generate error if compiled //General form must generate error if compiled
inline Out_type operator()(In_type in){ inline Out_type operator()(In_type in){
printf("Error, using wrong Reduce function\n"); printf("Error, using wrong Reduce function\n");
exit(1); exit(1);
return 0; return 0;
@ -184,26 +195,98 @@ namespace Optimization {
} }
}; };
struct MultRealPart{
inline float32x4_t operator()(float32x4_t a, float32x4_t b){
float32x4_t re = vtrn1q_f32(a, a);
return vmulq_f32(re, b);
}
inline float64x2_t operator()(float64x2_t a, float64x2_t b){
float64x2_t re = vzip1q_f64(a, a);
return vmulq_f64(re, b);
}
};
struct MaddRealPart{
inline float32x4_t operator()(float32x4_t a, float32x4_t b, float32x4_t c){
float32x4_t re = vtrn1q_f32(a, a);
return vfmaq_f32(c, re, b);
}
inline float64x2_t operator()(float64x2_t a, float64x2_t b, float64x2_t c){
float64x2_t re = vzip1q_f64(a, a);
return vfmaq_f64(c, re, b);
}
};
struct Div{
// Real float
inline float32x4_t operator()(float32x4_t a, float32x4_t b){
return vdivq_f32(a, b);
}
// Real double
inline float64x2_t operator()(float64x2_t a, float64x2_t b){
return vdivq_f64(a, b);
}
};
struct MultComplex{ struct MultComplex{
// Complex float // Complex float
inline float32x4_t operator()(float32x4_t a, float32x4_t b){ inline float32x4_t operator()(float32x4_t a, float32x4_t b){
float32x4_t foo;
return foo; float32x4_t r0, r1, r2, r3, r4;
// a = ar ai Ar Ai
// b = br bi Br Bi
// collect real/imag part, negate bi and Bi
r0 = vtrn1q_f32(b, b); // br br Br Br
r1 = vnegq_f32(b); // -br -bi -Br -Bi
r2 = vtrn2q_f32(b, r1); // bi -bi Bi -Bi
// the fun part
r3 = vmulq_f32(r2, a); // bi*ar -bi*ai ...
r4 = vrev64q_f32(r3); // -bi*ai bi*ar ...
// fma(a,b,c) = a+b*c
return vfmaq_f32(r4, r0, a); // ar*br-ai*bi ai*br+ar*bi ...
// no fma, use mul and add
//float32x4_t r5;
//r5 = vmulq_f32(r0, a);
//return vaddq_f32(r4, r5);
} }
// Complex double // Complex double
inline float64x2_t operator()(float64x2_t a, float64x2_t b){ inline float64x2_t operator()(float64x2_t a, float64x2_t b){
float32x4_t foo;
return foo; float64x2_t r0, r1, r2, r3, r4;
// b = br bi
// collect real/imag part, negate bi
r0 = vtrn1q_f64(b, b); // br br
r1 = vnegq_f64(b); // -br -bi
r2 = vtrn2q_f64(b, r1); // bi -bi
// the fun part
r3 = vmulq_f64(r2, a); // bi*ar -bi*ai
r4 = vextq_f64(r3,r3,1); // -bi*ai bi*ar
// fma(a,b,c) = a+b*c
return vfmaq_f64(r4, r0, a); // ar*br-ai*bi ai*br+ar*bi
// no fma, use mul and add
//float64x2_t r5;
//r5 = vmulq_f64(r0, a);
//return vaddq_f64(r4, r5);
} }
}; };
struct Mult{ struct Mult{
// Real float // Real float
inline float32x4_t mac(float32x4_t a, float32x4_t b, float32x4_t c){ inline float32x4_t mac(float32x4_t a, float32x4_t b, float32x4_t c){
return vaddq_f32(vmulq_f32(b,c),a); //return vaddq_f32(vmulq_f32(b,c),a);
return vfmaq_f32(a, b, c);
} }
inline float64x2_t mac(float64x2_t a, float64x2_t b, float64x2_t c){ inline float64x2_t mac(float64x2_t a, float64x2_t b, float64x2_t c){
return vaddq_f64(vmulq_f64(b,c),a); //return vaddq_f64(vmulq_f64(b,c),a);
return vfmaq_f64(a, b, c);
} }
inline float32x4_t operator()(float32x4_t a, float32x4_t b){ inline float32x4_t operator()(float32x4_t a, float32x4_t b){
return vmulq_f32(a,b); return vmulq_f32(a,b);
@ -221,89 +304,275 @@ namespace Optimization {
struct Conj{ struct Conj{
// Complex single // Complex single
inline float32x4_t operator()(float32x4_t in){ inline float32x4_t operator()(float32x4_t in){
return in; // ar ai br bi -> ar -ai br -bi
float32x4_t r0, r1;
r0 = vnegq_f32(in); // -ar -ai -br -bi
r1 = vrev64q_f32(r0); // -ai -ar -bi -br
return vtrn1q_f32(in, r1); // ar -ai br -bi
} }
// Complex double // Complex double
//inline float32x4_t operator()(float32x4_t in){ inline float64x2_t operator()(float64x2_t in){
// return 0;
//} float64x2_t r0, r1;
r0 = vextq_f64(in, in, 1); // ai ar
r1 = vnegq_f64(r0); // -ai -ar
return vextq_f64(r0, r1, 1); // ar -ai
}
// do not define for integer input // do not define for integer input
}; };
struct TimesMinusI{ struct TimesMinusI{
//Complex single //Complex single
inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ inline float32x4_t operator()(float32x4_t in, float32x4_t ret){
return in; // ar ai br bi -> ai -ar ai -br
float32x4_t r0, r1;
r0 = vnegq_f32(in); // -ar -ai -br -bi
r1 = vrev64q_f32(in); // ai ar bi br
return vtrn1q_f32(r1, r0); // ar -ai br -bi
} }
//Complex double //Complex double
//inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ inline float64x2_t operator()(float64x2_t in, float64x2_t ret){
// return in; // a ib -> b -ia
//} float64x2_t tmp;
tmp = vnegq_f64(in);
return vextq_f64(in, tmp, 1);
}
}; };
struct TimesI{ struct TimesI{
//Complex single //Complex single
inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ inline float32x4_t operator()(float32x4_t in, float32x4_t ret){
//need shuffle // ar ai br bi -> -ai ar -bi br
return in; float32x4_t r0, r1;
r0 = vnegq_f32(in); // -ar -ai -br -bi
r1 = vrev64q_f32(r0); // -ai -ar -bi -br
return vtrn1q_f32(r1, in); // -ai ar -bi br
} }
//Complex double //Complex double
//inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ inline float64x2_t operator()(float64x2_t in, float64x2_t ret){
// return 0; // a ib -> -b ia
//} float64x2_t tmp;
tmp = vnegq_f64(in);
return vextq_f64(tmp, in, 1);
}
};
struct Permute{
static inline float32x4_t Permute0(float32x4_t in){ // N:ok
// AB CD -> CD AB
return vextq_f32(in, in, 2);
};
static inline float32x4_t Permute1(float32x4_t in){ // N:ok
// AB CD -> BA DC
return vrev64q_f32(in);
};
static inline float32x4_t Permute2(float32x4_t in){ // N:not used by Boyle
return in;
};
static inline float32x4_t Permute3(float32x4_t in){ // N:not used by Boyle
return in;
};
static inline float64x2_t Permute0(float64x2_t in){ // N:ok
// AB -> BA
return vextq_f64(in, in, 1);
};
static inline float64x2_t Permute1(float64x2_t in){ // N:not used by Boyle
return in;
};
static inline float64x2_t Permute2(float64x2_t in){ // N:not used by Boyle
return in;
};
static inline float64x2_t Permute3(float64x2_t in){ // N:not used by Boyle
return in;
};
};
struct Rotate{
static inline float32x4_t rotate(float32x4_t in,int n){ // N:ok
switch(n){
case 0: // AB CD -> AB CD
return tRotate<0>(in);
break;
case 1: // AB CD -> BC DA
return tRotate<1>(in);
break;
case 2: // AB CD -> CD AB
return tRotate<2>(in);
break;
case 3: // AB CD -> DA BC
return tRotate<3>(in);
break;
default: assert(0);
}
}
static inline float64x2_t rotate(float64x2_t in,int n){ // N:ok
switch(n){
case 0: // AB -> AB
return tRotate<0>(in);
break;
case 1: // AB -> BA
return tRotate<1>(in);
break;
default: assert(0);
}
}
// working, but no restriction on n
// template<int n> static inline float32x4_t tRotate(float32x4_t in){ return vextq_f32(in,in,n); };
// template<int n> static inline float64x2_t tRotate(float64x2_t in){ return vextq_f64(in,in,n); };
// restriction on n
template<int n> static inline float32x4_t tRotate(float32x4_t in){ return vextq_f32(in,in,n%4); };
template<int n> static inline float64x2_t tRotate(float64x2_t in){ return vextq_f64(in,in,n%2); };
};
struct PrecisionChange {
static inline float16x8_t StoH (const float32x4_t &a,const float32x4_t &b) {
float16x4_t h = vcvt_f16_f32(a);
return vcvt_high_f16_f32(h, b);
}
static inline void HtoS (float16x8_t h,float32x4_t &sa,float32x4_t &sb) {
sb = vcvt_high_f32_f16(h);
// there is no direct conversion from lower float32x4_t to float64x2_t
// vextq_f16 not supported by clang 3.8 / 4.0 / arm clang
//float16x8_t h1 = vextq_f16(h, h, 4); // correct, but not supported by clang
// workaround for clang
uint32x4_t h1u = reinterpret_cast<uint32x4_t>(h);
float16x8_t h1 = reinterpret_cast<float16x8_t>(vextq_u32(h1u, h1u, 2));
sa = vcvt_high_f32_f16(h1);
}
static inline float32x4_t DtoS (float64x2_t a,float64x2_t b) {
float32x2_t s = vcvt_f32_f64(a);
return vcvt_high_f32_f64(s, b);
}
static inline void StoD (float32x4_t s,float64x2_t &a,float64x2_t &b) {
b = vcvt_high_f64_f32(s);
// there is no direct conversion from lower float32x4_t to float64x2_t
float32x4_t s1 = vextq_f32(s, s, 2);
a = vcvt_high_f64_f32(s1);
}
static inline float16x8_t DtoH (float64x2_t a,float64x2_t b,float64x2_t c,float64x2_t d) {
float32x4_t s1 = DtoS(a, b);
float32x4_t s2 = DtoS(c, d);
return StoH(s1, s2);
}
static inline void HtoD (float16x8_t h,float64x2_t &a,float64x2_t &b,float64x2_t &c,float64x2_t &d) {
float32x4_t s1, s2;
HtoS(h, s1, s2);
StoD(s1, a, b);
StoD(s2, c, d);
}
};
//////////////////////////////////////////////
// Exchange support
struct Exchange{
static inline void Exchange0(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
// in1: ABCD -> out1: ABEF
// in2: EFGH -> out2: CDGH
// z: CDAB
float32x4_t z = vextq_f32(in1, in1, 2);
// out1: ABEF
out1 = vextq_f32(z, in2, 2);
// z: GHEF
z = vextq_f32(in2, in2, 2);
// out2: CDGH
out2 = vextq_f32(in1, z, 2);
};
static inline void Exchange1(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
// in1: ABCD -> out1: AECG
// in2: EFGH -> out2: BFDH
out1 = vtrn1q_f32(in1, in2);
out2 = vtrn2q_f32(in1, in2);
};
static inline void Exchange2(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
assert(0);
return;
};
static inline void Exchange3(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
assert(0);
return;
};
// double precision
static inline void Exchange0(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
// in1: AB -> out1: AC
// in2: CD -> out2: BD
out1 = vzip1q_f64(in1, in2);
out2 = vzip2q_f64(in1, in2);
};
static inline void Exchange1(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
assert(0);
return;
};
static inline void Exchange2(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
assert(0);
return;
};
static inline void Exchange3(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
assert(0);
return;
};
}; };
////////////////////////////////////////////// //////////////////////////////////////////////
// Some Template specialization // Some Template specialization
template < typename vtype >
void permute(vtype &a, vtype b, int perm) {
};
//Complex float Reduce //Complex float Reduce
template<> template<>
inline Grid::ComplexF Reduce<Grid::ComplexF, float32x4_t>::operator()(float32x4_t in){ inline Grid::ComplexF Reduce<Grid::ComplexF, float32x4_t>::operator()(float32x4_t in){
return 0; float32x4_t v1; // two complex
v1 = Optimization::Permute::Permute0(in);
v1 = vaddq_f32(v1,in);
u128f conv; conv.v=v1;
return Grid::ComplexF(conv.f[0],conv.f[1]);
} }
//Real float Reduce //Real float Reduce
template<> template<>
inline Grid::RealF Reduce<Grid::RealF, float32x4_t>::operator()(float32x4_t in){ inline Grid::RealF Reduce<Grid::RealF, float32x4_t>::operator()(float32x4_t in){
float32x2_t high = vget_high_f32(in); return vaddvq_f32(in);
float32x2_t low = vget_low_f32(in);
float32x2_t tmp = vadd_f32(low, high);
float32x2_t sum = vpadd_f32(tmp, tmp);
return vget_lane_f32(sum,0);
} }
//Complex double Reduce //Complex double Reduce
template<> template<> // N:by Boyle
inline Grid::ComplexD Reduce<Grid::ComplexD, float64x2_t>::operator()(float64x2_t in){ inline Grid::ComplexD Reduce<Grid::ComplexD, float64x2_t>::operator()(float64x2_t in){
return 0; u128d conv; conv.v = in;
return Grid::ComplexD(conv.f[0],conv.f[1]);
} }
//Real double Reduce //Real double Reduce
template<> template<>
inline Grid::RealD Reduce<Grid::RealD, float64x2_t>::operator()(float64x2_t in){ inline Grid::RealD Reduce<Grid::RealD, float64x2_t>::operator()(float64x2_t in){
float64x2_t sum = vpaddq_f64(in, in); return vaddvq_f64(in);
return vgetq_lane_f64(sum,0);
} }
//Integer Reduce //Integer Reduce
template<> template<>
inline Integer Reduce<Integer, uint32x4_t>::operator()(uint32x4_t in){ inline Integer Reduce<Integer, uint32x4_t>::operator()(uint32x4_t in){
// FIXME unimplemented // FIXME unimplemented
printf("Reduce : Missing integer implementation -> FIX\n"); printf("Reduce : Missing integer implementation -> FIX\n");
assert(0); assert(0);
} }
} }
////////////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////////////////
// Here assign types // Here assign types
namespace Grid {
// typedef Optimization::vech SIMD_Htype; // Reduced precision type
typedef float16x8_t SIMD_Htype; // Half precision type
typedef float32x4_t SIMD_Ftype; // Single precision type typedef float32x4_t SIMD_Ftype; // Single precision type
typedef float64x2_t SIMD_Dtype; // Double precision type typedef float64x2_t SIMD_Dtype; // Double precision type
typedef uint32x4_t SIMD_Itype; // Integer type typedef uint32x4_t SIMD_Itype; // Integer type
@ -312,13 +581,6 @@ namespace Grid {
inline void prefetch_HINT_T0(const char *ptr){}; inline void prefetch_HINT_T0(const char *ptr){};
// Gpermute function
template < typename VectorSIMD >
inline void Gpermute(VectorSIMD &y,const VectorSIMD &b, int perm ) {
Optimization::permute(y.v,b.v,perm);
}
// Function name aliases // Function name aliases
typedef Optimization::Vsplat VsplatSIMD; typedef Optimization::Vsplat VsplatSIMD;
typedef Optimization::Vstore VstoreSIMD; typedef Optimization::Vstore VstoreSIMD;
@ -326,16 +588,19 @@ namespace Grid {
typedef Optimization::Vstream VstreamSIMD; typedef Optimization::Vstream VstreamSIMD;
template <typename S, typename T> using ReduceSIMD = Optimization::Reduce<S,T>; template <typename S, typename T> using ReduceSIMD = Optimization::Reduce<S,T>;
// Arithmetic operations // Arithmetic operations
typedef Optimization::Sum SumSIMD; typedef Optimization::Sum SumSIMD;
typedef Optimization::Sub SubSIMD; typedef Optimization::Sub SubSIMD;
typedef Optimization::Div DivSIMD;
typedef Optimization::Mult MultSIMD; typedef Optimization::Mult MultSIMD;
typedef Optimization::MultComplex MultComplexSIMD; typedef Optimization::MultComplex MultComplexSIMD;
typedef Optimization::MultRealPart MultRealPartSIMD;
typedef Optimization::MaddRealPart MaddRealPartSIMD;
typedef Optimization::Conj ConjSIMD; typedef Optimization::Conj ConjSIMD;
typedef Optimization::TimesMinusI TimesMinusISIMD; typedef Optimization::TimesMinusI TimesMinusISIMD;
typedef Optimization::TimesI TimesISIMD; typedef Optimization::TimesI TimesISIMD;
} }

View File

@ -53,7 +53,7 @@ directory
#if defined IMCI #if defined IMCI
#include "Grid_imci.h" #include "Grid_imci.h"
#endif #endif
#ifdef NEONv8 #ifdef NEONV8
#include "Grid_neon.h" #include "Grid_neon.h"
#endif #endif
#if defined QPX #if defined QPX

View File

@ -32,8 +32,11 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
namespace Grid { namespace Grid {
int LebesgueOrder::UseLebesgueOrder; int LebesgueOrder::UseLebesgueOrder;
#ifdef KNL
std::vector<int> LebesgueOrder::Block({8,2,2,2}); std::vector<int> LebesgueOrder::Block({8,2,2,2});
#else
std::vector<int> LebesgueOrder::Block({2,2,2,2});
#endif
LebesgueOrder::IndexInteger LebesgueOrder::alignup(IndexInteger n){ LebesgueOrder::IndexInteger LebesgueOrder::alignup(IndexInteger n){
n--; // 1000 0011 --> 1000 0010 n--; // 1000 0011 --> 1000 0010
n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011 n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011
@ -51,8 +54,31 @@ LebesgueOrder::LebesgueOrder(GridBase *_grid)
if ( Block[0]==0) ZGraph(); if ( Block[0]==0) ZGraph();
else if ( Block[1]==0) NoBlocking(); else if ( Block[1]==0) NoBlocking();
else CartesianBlocking(); else CartesianBlocking();
}
if (0) {
std::cout << "Thread Interleaving"<<std::endl;
ThreadInterleave();
}
}
void LebesgueOrder::ThreadInterleave(void)
{
std::vector<IndexInteger> reorder = _LebesgueReorder;
std::vector<IndexInteger> throrder;
int vol = _LebesgueReorder.size();
int threads = GridThread::GetThreads();
int blockbits=3;
int blocklen = 8;
int msk = 0x7;
for(int t=0;t<threads;t++){
for(int ss=0;ss<vol;ss++){
if ( ( ss >> blockbits) % threads == t ) {
throrder.push_back(reorder[ss]);
}
}
}
_LebesgueReorder = throrder;
}
void LebesgueOrder::NoBlocking(void) void LebesgueOrder::NoBlocking(void)
{ {
std::cout<<GridLogDebug<<"Lexicographic : no cache blocking"<<std::endl; std::cout<<GridLogDebug<<"Lexicographic : no cache blocking"<<std::endl;

View File

@ -70,6 +70,8 @@ namespace Grid {
std::vector<IndexInteger> & xi, std::vector<IndexInteger> & xi,
std::vector<IndexInteger> &dims); std::vector<IndexInteger> &dims);
void ThreadInterleave(void);
private: private:
std::vector<IndexInteger> _LebesgueReorder; std::vector<IndexInteger> _LebesgueReorder;

View File

@ -98,7 +98,9 @@ template<class rtype,class vtype,class mtype,int N>
strong_inline void mult(iVector<rtype,N> * __restrict__ ret, strong_inline void mult(iVector<rtype,N> * __restrict__ ret,
const iVector<vtype,N> * __restrict__ rhs, const iVector<vtype,N> * __restrict__ rhs,
const iScalar<mtype> * __restrict__ lhs){ const iScalar<mtype> * __restrict__ lhs){
mult(ret,lhs,rhs); for(int c1=0;c1<N;c1++){
mult(&ret->_internal[c1],&rhs->_internal[c1],&lhs->_internal);
}
} }

View File

@ -377,7 +377,7 @@ void Grid_init(int *argc,char ***argv)
std::cout << GridLogDebug << "Requesting "<< CartesianCommunicator::MAX_MPI_SHM_BYTES <<" byte stencil comms buffers "<<std::endl; std::cout << GridLogDebug << "Requesting "<< CartesianCommunicator::MAX_MPI_SHM_BYTES <<" byte stencil comms buffers "<<std::endl;
if( GridCmdOptionExists(*argv,*argv+*argc,"--decomposition") ){ if( GridCmdOptionExists(*argv,*argv+*argc,"--decomposition") ){
std::cout<<GridLogMessage<<"Grid Decomposition\n"; std::cout<<GridLogMessage<<"Grid Default Decomposition patterns\n";
std::cout<<GridLogMessage<<"\tOpenMP threads : "<<GridThread::GetThreads()<<std::endl; std::cout<<GridLogMessage<<"\tOpenMP threads : "<<GridThread::GetThreads()<<std::endl;
std::cout<<GridLogMessage<<"\tMPI tasks : "<<GridCmdVectorIntToString(GridDefaultMpi())<<std::endl; std::cout<<GridLogMessage<<"\tMPI tasks : "<<GridCmdVectorIntToString(GridDefaultMpi())<<std::endl;
std::cout<<GridLogMessage<<"\tvRealF : "<<sizeof(vRealF)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealF::Nsimd()))<<std::endl; std::cout<<GridLogMessage<<"\tvRealF : "<<sizeof(vRealF)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealF::Nsimd()))<<std::endl;

View File

@ -28,6 +28,9 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
/* END LEGAL */ /* END LEGAL */
#include <Grid/Grid.h> #include <Grid/Grid.h>
using namespace Grid;
using namespace Grid::QCD;
int main (int argc, char ** argv) int main (int argc, char ** argv)
{ {
std::vector<int> seeds({1,2,3,4}); std::vector<int> seeds({1,2,3,4});
@ -82,6 +85,7 @@ int main (int argc, char ** argv)
Uorg = Uorg - Umu; Uorg = Uorg - Umu;
std::cout << " Norm Difference "<< norm2(Uorg) << std::endl; std::cout << " Norm Difference "<< norm2(Uorg) << std::endl;
std::cout << " Norm "<< norm2(Umu) << std::endl;
std::cout<< "*****************************************************************" <<std::endl; std::cout<< "*****************************************************************" <<std::endl;

View File

@ -40,12 +40,6 @@ namespace Grid{
double, StoppingCondition, double, StoppingCondition,
int, MaxCGIterations, int, MaxCGIterations,
bool, ApplySmearing); bool, ApplySmearing);
//template <class ReaderClass >
//FermionParameters(Reader<ReaderClass>& Reader){
// read(Reader, "Mobius", *this);
//}
}; };
@ -113,9 +107,17 @@ int main(int argc, char **argv) {
bool ApplySmearing = MyParams.Mobius.ApplySmearing; bool ApplySmearing = MyParams.Mobius.ApplySmearing;
// Use this if you want to tweak the default decomposition
// commented out as very architecture speficic
//std::vector<int> simd_lanes({2,2,1,1});
// Grid from the command line // Grid from the command line arguments --grid and --mpi
TheHMC.Resources.AddFourDimGrid("gauge"); // drop the simd_lanes argument to fall back to the default decomposition for the SIMD lanes
//TheHMC.Resources.AddFourDimGrid("gauge", simd_lanes); // tweak the SIMD lanes
TheHMC.Resources.AddFourDimGrid("gauge"); // use default simd lanes decomposition
// Possibile to create the module by hand // Possibile to create the module by hand
// hardcoding parameters or using a Reader // hardcoding parameters or using a Reader

View File

@ -45,7 +45,7 @@ using namespace Grid;
using namespace Grid::QCD; using namespace Grid::QCD;
template <class Impl> template <class Impl>
class MagLogger : public HmcObservable<typename Impl::Field> { class MagMeas : public HmcObservable<typename Impl::Field> {
public: public:
typedef typename Impl::Field Field; typedef typename Impl::Field Field;
typedef typename Impl::Simd::scalar_type Trace; typedef typename Impl::Simd::scalar_type Trace;
@ -72,13 +72,13 @@ private:
}; };
template <class Impl> template <class Impl>
class MagMod: public ObservableModule<MagLogger<Impl>, NoParameters>{ class MagMod: public ObservableModule<MagMeas<Impl>, NoParameters>{
typedef ObservableModule<MagLogger<Impl>, NoParameters> ObsBase; typedef ObservableModule<MagMeas<Impl>, NoParameters> ObsBase;
using ObsBase::ObsBase; // for constructors using ObsBase::ObsBase; // for constructors
// acquire resource // acquire resource
virtual void initialize(){ virtual void initialize(){
this->ObservablePtr.reset(new MagLogger<Impl>()); this->ObservablePtr.reset(new MagMeas<Impl>());
} }
public: public:
MagMod(): ObsBase(NoParameters()){} MagMod(): ObsBase(NoParameters()){}

View File

@ -66,7 +66,14 @@ int main(int argc, char **argv) {
typedef PlaquetteMod<HMCWrapper::ImplPolicy> PlaqObs; typedef PlaquetteMod<HMCWrapper::ImplPolicy> PlaqObs;
typedef TopologicalChargeMod<HMCWrapper::ImplPolicy> QObs; typedef TopologicalChargeMod<HMCWrapper::ImplPolicy> QObs;
TheHMC.Resources.AddObservable<PlaqObs>(); TheHMC.Resources.AddObservable<PlaqObs>();
TheHMC.Resources.AddObservable<QObs>(); TopologyObsParameters TopParams;
TopParams.interval = 5;
TopParams.do_smearing = true;
TopParams.Smearing.steps = 200;
TopParams.Smearing.step_size = 0.01;
TopParams.Smearing.meas_interval = 50;
TopParams.Smearing.maxTau = 2.0;
TheHMC.Resources.AddObservable<QObs>(TopParams);
////////////////////////////////////////////// //////////////////////////////////////////////
///////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////

View File

@ -75,7 +75,7 @@ int main (int argc, char ** argv)
LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(pRNG,Umu); LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(pRNG,Umu);
RealD mass=0.003; RealD mass=0.003;
ImprovedStaggeredFermion5DR Ds(Umu,Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass); ImprovedStaggeredFermion5DR Ds(Umu,Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass);
MdagMLinearOperator<ImprovedStaggeredFermion5DR,FermionField> HermOp(Ds); MdagMLinearOperator<ImprovedStaggeredFermion5DR,FermionField> HermOp(Ds);
ConjugateGradient<FermionField> CG(1.0e-8,10000); ConjugateGradient<FermionField> CG(1.0e-8,10000);
@ -99,21 +99,27 @@ int main (int argc, char ** argv)
std::cout << GridLogMessage << " Calling 5d CG for "<<Ls <<" right hand sides" <<std::endl; std::cout << GridLogMessage << " Calling 5d CG for "<<Ls <<" right hand sides" <<std::endl;
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
result=zero; result=zero;
Ds.ZeroCounters();
CG(HermOp,src,result); CG(HermOp,src,result);
Ds.Report();
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
std::cout << GridLogMessage << " Calling multiRHS CG for "<<Ls <<" right hand sides" <<std::endl; std::cout << GridLogMessage << " Calling multiRHS CG for "<<Ls <<" right hand sides" <<std::endl;
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
result=zero; result=zero;
Ds.ZeroCounters();
mCG(HermOp,src,result); mCG(HermOp,src,result);
Ds.Report();
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
std::cout << GridLogMessage << " Calling Block CG for "<<Ls <<" right hand sides" <<std::endl; std::cout << GridLogMessage << " Calling Block CG for "<<Ls <<" right hand sides" <<std::endl;
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
result=zero; result=zero;
Ds.ZeroCounters();
BCGrQ(HermOp,src,result); BCGrQ(HermOp,src,result);
Ds.Report();
std::cout << GridLogMessage << "************************************************************************ "<<std::endl; std::cout << GridLogMessage << "************************************************************************ "<<std::endl;