diff --git a/README.md b/README.md index 9432abe1..1e0988f3 100644 --- a/README.md +++ b/README.md @@ -18,10 +18,41 @@ License: GPL v2. -Last update Nov 2016. +Last update June 2017. _Please do not send pull requests to the `master` branch which is reserved for releases._ + + +### Description +This library provides data parallel C++ container classes with internal memory layout +that is transformed to map efficiently to SIMD architectures. CSHIFT facilities +are provided, similar to HPF and cmfortran, and user control is given over the mapping of +array indices to both MPI tasks and SIMD processing elements. + +* Identically shaped arrays then be processed with perfect data parallelisation. +* Such identically shaped arrays are called conformable arrays. + +The transformation is based on the observation that Cartesian array processing involves +identical processing to be performed on different regions of the Cartesian array. + +The library will both geometrically decompose into MPI tasks and across SIMD lanes. +Local vector loops are parallelised with OpenMP pragmas. + +Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but +optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification +for most programmers. + +The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture. +Presently SSE4, ARM NEON (128 bits) AVX, AVX2, QPX (256 bits), IMCI and AVX512 (512 bits) targets are supported. + +These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. +The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`. + +MPI, OpenMP, and SIMD parallelism are present in the library. +Please see [this paper](https://arxiv.org/abs/1512.03487) for more detail. + + ### Compilers Intel ICPC v16.0.3 and later @@ -56,35 +87,25 @@ When you file an issue, please go though the following checklist: 6. Attach the output of `make V=1`. 7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example. +### Required libraries +Grid requires: +[GMP](https://gmplib.org/), -### Description -This library provides data parallel C++ container classes with internal memory layout -that is transformed to map efficiently to SIMD architectures. CSHIFT facilities -are provided, similar to HPF and cmfortran, and user control is given over the mapping of -array indices to both MPI tasks and SIMD processing elements. +[MPFR](http://www.mpfr.org/) -* Identically shaped arrays then be processed with perfect data parallelisation. -* Such identically shaped arrays are called conformable arrays. +Bootstrapping grid downloads and uses for internal dense matrix (non-QCD operations) the Eigen library. -The transformation is based on the observation that Cartesian array processing involves -identical processing to be performed on different regions of the Cartesian array. +Grid optionally uses: -The library will both geometrically decompose into MPI tasks and across SIMD lanes. -Local vector loops are parallelised with OpenMP pragmas. +[HDF5](https://support.hdfgroup.org/HDF5/) -Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but -optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification -for most programmers. +[LIME](http://usqcd-software.github.io/c-lime/) for ILDG and SciDAC file format support. -The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture. -Presently SSE4 (128 bit) AVX, AVX2, QPX (256 bit), IMCI, and AVX512 (512 bit) targets are supported (ARM NEON on the way). +[FFTW](http://www.fftw.org) either generic version or via the Intel MKL library. -These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers. -The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`. +LAPACK either generic version or Intel MKL library. -MPI, OpenMP, and SIMD parallelism are present in the library. -Please see https://arxiv.org/abs/1512.03487 for more detail. ### Quick start First, start by cloning the repository: @@ -155,7 +176,6 @@ The following options can be use with the `--enable-comms=` option to target dif | `none` | no communications | | `mpi[-auto]` | MPI communications | | `mpi3[-auto]` | MPI communications using MPI 3 shared memory | -| `mpi3l[-auto]` | MPI communications using MPI 3 shared memory and leader model | | `shmem ` | Cray SHMEM communications | For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names). The `-auto` suffix is not supported by the Cray environment wrapper scripts. Use the standard versions instead. @@ -173,7 +193,8 @@ The following options can be use with the `--enable-simd=` option to target diff | `AVXFMA4` | AVX (256 bit) + FMA4 | | `AVX2` | AVX 2 (256 bit) | | `AVX512` | AVX 512 bit | -| `QPX` | QPX (256 bit) | +| `NEONv8` | [ARM NEON](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch07s03.html) (128 bit) | +| `QPX` | IBM QPX (256 bit) | Alternatively, some CPU codenames can be directly used: @@ -195,21 +216,205 @@ The following configuration is recommended for the Intel Knights Landing platfor ``` bash ../configure --enable-precision=double\ --enable-simd=KNL \ - --enable-comms=mpi-auto \ - --with-gmp= \ - --with-mpfr= \ + --enable-comms=mpi-auto \ --enable-mkl \ CXX=icpc MPICXX=mpiicpc ``` +The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library. -where `` is the UNIX prefix where GMP and MPFR are installed. If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use: +If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use: ``` bash ../configure --enable-precision=double\ --enable-simd=KNL \ --enable-comms=mpi \ - --with-gmp= \ - --with-mpfr= \ --enable-mkl \ CXX=CC CC=cc -``` \ No newline at end of file +``` + +If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed: +``` bash + --with-gmp= \ + --with-mpfr= \ +``` +where `` is the UNIX prefix where GMP and MPFR are installed. + +Knight's Landing with Intel Omnipath adapters with two adapters per node +presently performs better with use of more than one rank per node, using shared memory +for interior communication. This is the mpi3 communications implementation. +We recommend four ranks per node for best performance, but optimum is local volume dependent. + +``` bash +../configure --enable-precision=double\ + --enable-simd=KNL \ + --enable-comms=mpi3-auto \ + --enable-mkl \ + CC=icpc MPICXX=mpiicpc +``` + +### Build setup for Intel Haswell Xeon platform + +The following configuration is recommended for the Intel Haswell platform: + +``` bash +../configure --enable-precision=double\ + --enable-simd=AVX2 \ + --enable-comms=mpi3-auto \ + --enable-mkl \ + CXX=icpc MPICXX=mpiicpc +``` +The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library. + +If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed: +``` bash + --with-gmp= \ + --with-mpfr= \ +``` +where `` is the UNIX prefix where GMP and MPFR are installed. + +If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use: + +``` bash +../configure --enable-precision=double\ + --enable-simd=AVX2 \ + --enable-comms=mpi3 \ + --enable-mkl \ + CXX=CC CC=cc +``` +Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of +one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using +``` + export I_MPI_PIN=1 +``` +This is the default. + +### Build setup for Intel Skylake Xeon platform + +The following configuration is recommended for the Intel Skylake platform: + +``` bash +../configure --enable-precision=double\ + --enable-simd=AVX512 \ + --enable-comms=mpi3 \ + --enable-mkl \ + CXX=mpiicpc +``` +The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library. + +If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed: +``` bash + --with-gmp= \ + --with-mpfr= \ +``` +where `` is the UNIX prefix where GMP and MPFR are installed. + +If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use: + +``` bash +../configure --enable-precision=double\ + --enable-simd=AVX512 \ + --enable-comms=mpi3 \ + --enable-mkl \ + CXX=CC CC=cc +``` +Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of +one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using +``` + export I_MPI_PIN=1 +``` +This is the default. + +#### Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping): + +mpirun -n 2 benchmarks/Benchmark_dwf --grid 16.16.16.16 --mpi 2.1.1.1 --cacheblocking 2.2.2.2 --dslash-asm --shm 1024 --threads 18 + +TBA + + +### Build setup for AMD EPYC / RYZEN + +The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. +So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total +are common. Each chip within the module exposes a separate NUMA domain. +There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain. +MPI-3 is recommended with the use of four ranks per socket, +and 8 threads per rank. + +The following configuration is recommended for the AMD EPYC platform. + +``` bash +../configure --enable-precision=double\ + --enable-simd=AVX2 \ + --enable-comms=mpi3 \ + CXX=mpicxx +``` + +If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed: +``` bash + --with-gmp= \ + --with-mpfr= \ +``` +where `` is the UNIX prefix where GMP and MPFR are installed. + +Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank. +This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this. + +It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and +shared memory to communicate within this node: + +mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4 + +Where omp_bind.sh does the following: +``` +#!/bin/bash + +numanode=` expr $PMI_RANK % 8 ` +basecore=`expr $numanode \* 16` +core0=`expr $basecore + 0 ` +core1=`expr $basecore + 2 ` +core2=`expr $basecore + 4 ` +core3=`expr $basecore + 6 ` +core4=`expr $basecore + 8 ` +core5=`expr $basecore + 10 ` +core6=`expr $basecore + 12 ` +core7=`expr $basecore + 14 ` + +export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7" +echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY + +$@ +``` + +Performance: + +#### Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) performance using NUMA MPI mapping): + +mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --threads 8 --mpi 2.2.2.1 --dslash-unroll --grid 16.16.16.16 --cacheblocking 4.4.4.4 + +TBA + +### Build setup for BlueGene/Q + +To be written... + +### Build setup for ARM Neon + +To be written... + +### Build setup for laptops, other compilers, non-cluster builds + +Many versions of g++ and clang++ work with Grid, and involve merely replacing CXX (and MPICXX), +and omit the enable-mkl flag. + +Single node builds are enabled with +``` + --enable-comms=none +``` + +FFTW support that is not in the default search path may then enabled with +``` + --with-fftw= +``` + +BLAS will not be compiled in by default, and Lanczos will default to Eigen diagonalisation. + diff --git a/TODO b/TODO index 001c6c0c..cccc5f45 100644 --- a/TODO +++ b/TODO @@ -2,18 +2,18 @@ TODO: --------------- Large item work list: -1)- MultiRHS with spread out extra dim -- Go through filesystem with SciDAC I/O +1)- BG/Q port and check 2)- Christoph's local basis expansion Lanczos -3)- BG/Q port and check -4)- Precision conversion and sort out localConvert <-- partial +3)- Precision conversion and sort out localConvert <-- partial - Consistent linear solver flop count/rate -- PARTIAL, time but no flop/s yet -5)- Physical propagator interface -6)- Conserved currents -7)- Multigrid Wilson and DWF, compare to other Multigrid implementations -8)- HDCR resume +4)- Physical propagator interface +5)- Conserved currents +6)- Multigrid Wilson and DWF, compare to other Multigrid implementations +7)- HDCR resume Recent DONE +-- MultiRHS with spread out extra dim -- Go through filesystem with SciDAC I/O -- Lanczos Remove DenseVector, DenseMatrix; Use Eigen instead. <-- DONE -- GaugeFix into central location <-- DONE -- Scidac and Ildg metadata handling <-- DONE diff --git a/benchmarks/Benchmark_ITT.cc b/benchmarks/Benchmark_ITT.cc new file mode 100644 index 00000000..91524149 --- /dev/null +++ b/benchmarks/Benchmark_ITT.cc @@ -0,0 +1,518 @@ + /************************************************************************************* + + Grid physics library, www.github.com/paboyle/Grid + + Source file: ./benchmarks/Benchmark_memory_bandwidth.cc + + Copyright (C) 2015 + +Author: Peter Boyle +Author: paboyle + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + + See the full license in the file "LICENSE" in the top level distribution directory + *************************************************************************************/ + /* END LEGAL */ +#include + +using namespace std; +using namespace Grid; +using namespace Grid::QCD; + + +struct time_statistics{ + double mean; + double err; + double min; + double max; + + void statistics(std::vector v){ + double sum = std::accumulate(v.begin(), v.end(), 0.0); + mean = sum / v.size(); + + std::vector diff(v.size()); + std::transform(v.begin(), v.end(), diff.begin(), [=](double x) { return x - mean; }); + double sq_sum = std::inner_product(diff.begin(), diff.end(), diff.begin(), 0.0); + err = std::sqrt(sq_sum / (v.size()*(v.size() - 1))); + + auto result = std::minmax_element(v.begin(), v.end()); + min = *result.first; + max = *result.second; +} +}; + +void comms_header(){ + std::cout < simd_layout = GridDefaultSimd(Nd,vComplexD::Nsimd()); + std::vector mpi_layout = GridDefaultMpi(); + + std::vector t_time(Nloop); + time_statistics timestat; + + std::cout< latt_size ({lat*mpi_layout[0], + lat*mpi_layout[1], + lat*mpi_layout[2], + lat*mpi_layout[3]}); + + GridCartesian Grid(latt_size,simd_layout,mpi_layout); + RealD Nrank = Grid._Nprocessors; + RealD Nnode = Grid.NodeCount(); + RealD ppn = Nrank/Nnode; + + std::vector xbuf(8); + std::vector rbuf(8); + Grid.ShmBufferFreeAll(); + for(int d=0;d<8;d++){ + xbuf[d] = (HalfSpinColourVectorD *)Grid.ShmBufferMalloc(lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD)); + rbuf[d] = (HalfSpinColourVectorD *)Grid.ShmBufferMalloc(lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD)); + bzero((void *)xbuf[d],lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD)); + bzero((void *)rbuf[d],lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD)); + } + + int ncomm; + int bytes=lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD); + double dbytes; + for(int i=0;i requests; + dbytes=0; + ncomm=0; + + parallel_for(int dir=0;dir<8;dir++){ + + double tbytes; + int mu =dir % 4; + + if (mpi_layout[mu]>1 ) { + + ncomm++; + int xmit_to_rank; + int recv_from_rank; + if ( dir == mu ) { + int comm_proc=1; + Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank); + } else { + int comm_proc = mpi_layout[mu]-1; + Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank); + } +#if 0 + tbytes= Grid.StencilSendToRecvFromBegin(requests, + (void *)&xbuf[dir][0], + xmit_to_rank, + (void *)&rbuf[dir][0], + recv_from_rank, + bytes,dir); + Grid.StencilSendToRecvFromComplete(requests,dir); +#endif + requests.resize(0); + +#pragma omp atomic + dbytes+=tbytes; + } + } + Grid.Barrier(); + double stop=usecond(); + t_time[i] = stop-start; // microseconds + } + + timestat.statistics(t_time); + + dbytes=dbytes*ppn; + double xbytes = dbytes*0.5; + double rbytes = dbytes*0.5; + double bidibytes = dbytes; + + + std::cout< > LatticeVec; + typedef iVector Vec; + + std::vector simd_layout = GridDefaultSimd(Nd,vReal::Nsimd()); + std::vector mpi_layout = GridDefaultMpi(); + + std::cout<({45,12,81,9})); + for(int lat=8;lat<=lmax;lat+=4){ + + std::vector latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); + int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + GridCartesian Grid(latt_size,simd_layout,mpi_layout); + + Vec rn ; random(sRNG,rn); + + LatticeVec z(&Grid); z=rn; + LatticeVec x(&Grid); x=rn; + LatticeVec y(&Grid); y=rn; + double a=2.0; + + uint64_t Nloop=NLOOP; + + double start=usecond(); + for(int i=0;i mpi = GridDefaultMpi(); assert(mpi.size()==4); + std::vector local({L,L,L,L}); + + GridCartesian * TmpGrid = SpaceTimeGrid::makeFourDimGrid(std::vector({64,64,64,64}), + GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi()); + uint64_t NP = TmpGrid->RankCount(); + uint64_t NN = TmpGrid->NodeCount(); + uint64_t SHM=NP/NN; + + std::vector internal; + if ( SHM == 1 ) internal = std::vector({1,1,1,1}); + else if ( SHM == 2 ) internal = std::vector({2,1,1,1}); + else if ( SHM == 4 ) internal = std::vector({2,2,1,1}); + else if ( SHM == 8 ) internal = std::vector({2,2,2,1}); + else assert(0); + + std::vector nodes({mpi[0]/internal[0],mpi[1]/internal[1],mpi[2]/internal[2],mpi[3]/internal[3]}); + std::vector latt4({local[0]*nodes[0],local[1]*nodes[1],local[2]*nodes[2],local[3]*nodes[3]}); + + ///////// Welcome message //////////// + std::cout< seeds4({1,2,3,4}); + std::vector seeds5({5,6,7,8}); + GridParallelRNG RNG4(UGrid); RNG4.SeedFixedIntegers(seeds4); + GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds5); + std::cout << GridLogMessage << "Initialised RNGs" << std::endl; + + ///////// Source preparation //////////// + LatticeFermion src (FGrid); random(RNG5,src); + LatticeFermion ref (FGrid); + LatticeFermion tmp (FGrid); + + RealD N2 = 1.0/::sqrt(norm2(src)); + src = src*N2; + + LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(RNG4,Umu); + + DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5); + + //////////////////////////////////// + // Naive wilson implementation + //////////////////////////////////// + { + LatticeGaugeField Umu5d(FGrid); + std::vector U(4,FGrid); + for(int ss=0;ssoSites();ss++){ + for(int s=0;s(Umu5d,mu); + } + for(int mu=0;muBarrier(); + for(int i=0;iBarrier(); + double t1=usecond(); + uint64_t ncall = (uint64_t) 2.5*1000.0*1000.0*nwarm/(t1-t0); + FGrid->Broadcast(0,&ncall,sizeof(ncall)); + + // std::cout << GridLogMessage << " Estimate " << ncall << " calls per second"< t_time(ncall); + for(uint64_t i=0;iBarrier(); + + double volume=Ls; for(int mu=0;mumflops_best ) mflops_best = mflops; + if ( mflops({2,2,2,2}); + + Benchmark::Decomposition(); + + int do_memory=1; + int do_comms =1; + int do_su3 =0; + int do_wilson=1; + int do_dwf =1; + + if ( do_memory ) { + std::cout<Barrier(); Dw.ZeroCounters(); @@ -302,6 +302,7 @@ int main (int argc, char ** argv) std::cout<< "sD ERR \n " << err < latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; GridCartesian Grid(latt_size,simd_layout,mpi_layout); uint64_t Nloop=NLOOP; - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeVec z(&Grid); //random(pRNG,z); - LatticeVec x(&Grid); //random(pRNG,x); - LatticeVec y(&Grid); //random(pRNG,y); + LatticeVec z(&Grid);// random(pRNG,z); + LatticeVec x(&Grid);// random(pRNG,x); + LatticeVec y(&Grid);// random(pRNG,y); double a=2.0; @@ -83,7 +83,7 @@ int main (int argc, char ** argv) double time = (stop-start)/Nloop*1000; double flops=vol*Nvec*2;// mul,add - double bytes=3*vol*Nvec*sizeof(Real); + double bytes=3.0*vol*Nvec*sizeof(Real); std::cout< latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeVec z(&Grid); //random(pRNG,z); - LatticeVec x(&Grid); //random(pRNG,x); - LatticeVec y(&Grid); //random(pRNG,y); + LatticeVec z(&Grid);// random(pRNG,z); + LatticeVec x(&Grid);// random(pRNG,x); + LatticeVec y(&Grid);// random(pRNG,y); double a=2.0; uint64_t Nloop=NLOOP; @@ -119,7 +119,7 @@ int main (int argc, char ** argv) double time = (stop-start)/Nloop*1000; double flops=vol*Nvec*2;// mul,add - double bytes=3*vol*Nvec*sizeof(Real); + double bytes=3.0*vol*Nvec*sizeof(Real); std::cout< latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; uint64_t Nloop=NLOOP; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeVec z(&Grid); //random(pRNG,z); - LatticeVec x(&Grid); //random(pRNG,x); - LatticeVec y(&Grid); //random(pRNG,y); + LatticeVec z(&Grid);// random(pRNG,z); + LatticeVec x(&Grid);// random(pRNG,x); + LatticeVec y(&Grid);// random(pRNG,y); RealD a=2.0; @@ -154,7 +154,7 @@ int main (int argc, char ** argv) double stop=usecond(); double time = (stop-start)/Nloop*1000; - double bytes=2*vol*Nvec*sizeof(Real); + double bytes=2.0*vol*Nvec*sizeof(Real); double flops=vol*Nvec*1;// mul std::cout< latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol= latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; uint64_t Nloop=NLOOP; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); - LatticeVec z(&Grid); //random(pRNG,z); - LatticeVec x(&Grid); //random(pRNG,x); - LatticeVec y(&Grid); //random(pRNG,y); + // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); + LatticeVec z(&Grid);// random(pRNG,z); + LatticeVec x(&Grid);// random(pRNG,x); + LatticeVec y(&Grid);// random(pRNG,y); RealD a=2.0; Real nn; double start=usecond(); @@ -187,7 +187,7 @@ int main (int argc, char ** argv) double stop=usecond(); double time = (stop-start)/Nloop*1000; - double bytes=vol*Nvec*sizeof(Real); + double bytes=1.0*vol*Nvec*sizeof(Real); double flops=vol*Nvec*2;// mul,add std::cout< simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd()); std::vector mpi_layout = GridDefaultMpi(); - int threads = GridThread::GetThreads(); + int64_t threads = GridThread::GetThreads(); std::cout< latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeColourMatrix z(&Grid);// random(pRNG,z); - LatticeColourMatrix x(&Grid);// random(pRNG,x); - LatticeColourMatrix y(&Grid);// random(pRNG,y); + LatticeColourMatrix z(&Grid); random(pRNG,z); + LatticeColourMatrix x(&Grid); random(pRNG,x); + LatticeColourMatrix y(&Grid); random(pRNG,y); double start=usecond(); - for(int i=0;i latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeColourMatrix z(&Grid); //random(pRNG,z); - LatticeColourMatrix x(&Grid); //random(pRNG,x); - LatticeColourMatrix y(&Grid); //random(pRNG,y); + LatticeColourMatrix z(&Grid); random(pRNG,z); + LatticeColourMatrix x(&Grid); random(pRNG,x); + LatticeColourMatrix y(&Grid); random(pRNG,y); double start=usecond(); - for(int i=0;i latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeColourMatrix z(&Grid); //random(pRNG,z); - LatticeColourMatrix x(&Grid); //random(pRNG,x); - LatticeColourMatrix y(&Grid); //random(pRNG,y); + LatticeColourMatrix z(&Grid); random(pRNG,z); + LatticeColourMatrix x(&Grid); random(pRNG,x); + LatticeColourMatrix y(&Grid); random(pRNG,y); double start=usecond(); - for(int i=0;i latt_size ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]}); - int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; + int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3]; GridCartesian Grid(latt_size,simd_layout,mpi_layout); - // GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9}); + GridParallelRNG pRNG(&Grid); pRNG.SeedFixedIntegers(std::vector({45,12,81,9})); - LatticeColourMatrix z(&Grid); //random(pRNG,z); - LatticeColourMatrix x(&Grid); //random(pRNG,x); - LatticeColourMatrix y(&Grid); //random(pRNG,y); + LatticeColourMatrix z(&Grid); random(pRNG,z); + LatticeColourMatrix x(&Grid); random(pRNG,x); + LatticeColourMatrix y(&Grid); random(pRNG,y); double start=usecond(); - for(int i=0;i]]) AC_CHECK_DECLS([be64toh],[], [], [[#include ]]) +############## Standard libraries +AC_CHECK_LIB([m],[cos]) +AC_CHECK_LIB([stdc++],[abort]) + ############### GMP and MPFR AC_ARG_WITH([gmp], [AS_HELP_STRING([--with-gmp=prefix], @@ -186,9 +194,14 @@ Info at: http://usqcd.jlab.org/usqcd-docs/c-lime/)]) AC_SEARCH_LIBS([crc32], [z], [AC_DEFINE([HAVE_ZLIB], [1], [Define to 1 if you have the `LIBZ' library])] - [have_zlib=true], + [have_zlib=true] [LIBS="${LIBS} -lz"], [AC_MSG_ERROR(zlib library was not found in your system.)]) +AC_SEARCH_LIBS([move_pages], [numa], + [AC_DEFINE([HAVE_LIBNUMA], [1], [Define to 1 if you have the `LIBNUMA' library])] + [have_libnuma=true] [LIBS="${LIBS} -lnuma"], + [AC_MSG_WARN(libnuma library was not found in your system. Some optimisations will not apply)]) + AC_SEARCH_LIBS([H5Fopen], [hdf5_cpp], [AC_DEFINE([HAVE_HDF5], [1], [Define to 1 if you have the `HDF5' library])] [have_hdf5=true] @@ -241,6 +254,7 @@ case ${ax_cv_cxx_compiler_vendor} in SIMD_FLAGS='';; KNL) AC_DEFINE([AVX512],[1],[AVX512 intrinsics]) + AC_DEFINE([KNL],[1],[Knights landing processor]) SIMD_FLAGS='-march=knl';; GEN) AC_DEFINE([GEN],[1],[generic vector code]) @@ -248,6 +262,9 @@ case ${ax_cv_cxx_compiler_vendor} in [generic SIMD vector width (in bytes)]) SIMD_GEN_WIDTH_MSG=" (width= $ac_gen_simd_width)" SIMD_FLAGS='';; + NEONv8) + AC_DEFINE([NEONV8],[1],[ARMv8 NEON]) + SIMD_FLAGS='-march=armv8-a';; QPX|BGQ) AC_DEFINE([QPX],[1],[QPX intrinsics for BG/Q]) SIMD_FLAGS='';; @@ -276,6 +293,7 @@ case ${ax_cv_cxx_compiler_vendor} in SIMD_FLAGS='';; KNL) AC_DEFINE([AVX512],[1],[AVX512 intrinsics for Knights Landing]) + AC_DEFINE([KNL],[1],[Knights landing processor]) SIMD_FLAGS='-xmic-avx512';; GEN) AC_DEFINE([GEN],[1],[generic vector code]) diff --git a/lib/algorithms/iterative/BlockConjugateGradient.h b/lib/algorithms/iterative/BlockConjugateGradient.h index 9418f63c..d7817c05 100644 --- a/lib/algorithms/iterative/BlockConjugateGradient.h +++ b/lib/algorithms/iterative/BlockConjugateGradient.h @@ -199,7 +199,12 @@ void BlockCGrQsolve(LinearOperatorBase &Linop, const Field &B, Field &X) Linop.HermOp(X, AD); tmp = B - AD; + //std::cout << GridLogMessage << " initial tmp " << norm2(tmp)<< std::endl; ThinQRfact (m_rr, m_C, m_Cinv, Q, tmp); + //std::cout << GridLogMessage << " initial Q " << norm2(Q)<< std::endl; + //std::cout << GridLogMessage << " m_rr " << m_rr< &Linop, const Field &B, Field &X) MatrixTimer.Start(); Linop.HermOp(D, Z); MatrixTimer.Stop(); + //std::cout << GridLogMessage << " norm2 Z " < &dimensions, - const std::vector &simd_layout, - const std::vector &processor_grid - ) : GridBase(processor_grid) + const std::vector &simd_layout, + const std::vector &processor_grid) : GridBase(processor_grid) { - /////////////////////// - // Grid information - /////////////////////// - _ndimension = dimensions.size(); - - _fdimensions.resize(_ndimension); - _gdimensions.resize(_ndimension); - _ldimensions.resize(_ndimension); - _rdimensions.resize(_ndimension); - _simd_layout.resize(_ndimension); - _lstart.resize(_ndimension); - _lend.resize(_ndimension); - - _ostride.resize(_ndimension); - _istride.resize(_ndimension); - - _fsites = _gsites = _osites = _isites = 1; + /////////////////////// + // Grid information + /////////////////////// + _ndimension = dimensions.size(); - for(int d=0;d<_ndimension;d++){ - _fdimensions[d] = dimensions[d]; // Global dimensions - _gdimensions[d] = _fdimensions[d]; // Global dimensions - _simd_layout[d] = simd_layout[d]; - _fsites = _fsites * _fdimensions[d]; - _gsites = _gsites * _gdimensions[d]; + _fdimensions.resize(_ndimension); + _gdimensions.resize(_ndimension); + _ldimensions.resize(_ndimension); + _rdimensions.resize(_ndimension); + _simd_layout.resize(_ndimension); + _lstart.resize(_ndimension); + _lend.resize(_ndimension); - //FIXME check for exact division + _ostride.resize(_ndimension); + _istride.resize(_ndimension); - // Use a reduced simd grid - _ldimensions[d]= _gdimensions[d]/_processors[d]; //local dimensions - _rdimensions[d]= _ldimensions[d]/_simd_layout[d]; //overdecomposition - _lstart[d] = _processor_coor[d]*_ldimensions[d]; - _lend[d] = _processor_coor[d]*_ldimensions[d]+_ldimensions[d]-1; - _osites *= _rdimensions[d]; - _isites *= _simd_layout[d]; - - // Addressing support - if ( d==0 ) { - _ostride[d] = 1; - _istride[d] = 1; - } else { - _ostride[d] = _ostride[d-1]*_rdimensions[d-1]; - _istride[d] = _istride[d-1]*_simd_layout[d-1]; - } + _fsites = _gsites = _osites = _isites = 1; + + for (int d = 0; d < _ndimension; d++) + { + _fdimensions[d] = dimensions[d]; // Global dimensions + _gdimensions[d] = _fdimensions[d]; // Global dimensions + _simd_layout[d] = simd_layout[d]; + _fsites = _fsites * _fdimensions[d]; + _gsites = _gsites * _gdimensions[d]; + + // Use a reduced simd grid + _ldimensions[d] = _gdimensions[d] / _processors[d]; //local dimensions + assert(_ldimensions[d] * _processors[d] == _gdimensions[d]); + + _rdimensions[d] = _ldimensions[d] / _simd_layout[d]; //overdecomposition + assert(_rdimensions[d] * _simd_layout[d] == _ldimensions[d]); + + _lstart[d] = _processor_coor[d] * _ldimensions[d]; + _lend[d] = _processor_coor[d] * _ldimensions[d] + _ldimensions[d] - 1; + _osites *= _rdimensions[d]; + _isites *= _simd_layout[d]; + + // Addressing support + if (d == 0) + { + _ostride[d] = 1; + _istride[d] = 1; } - - /////////////////////// - // subplane information - /////////////////////// - _slice_block.resize(_ndimension); - _slice_stride.resize(_ndimension); - _slice_nblock.resize(_ndimension); - - int block =1; - int nblock=1; - for(int d=0;d<_ndimension;d++) nblock*=_rdimensions[d]; - - for(int d=0;d<_ndimension;d++){ - nblock/=_rdimensions[d]; - _slice_block[d] =block; - _slice_stride[d]=_ostride[d]*_rdimensions[d]; - _slice_nblock[d]=nblock; - block = block*_rdimensions[d]; + else + { + _ostride[d] = _ostride[d - 1] * _rdimensions[d - 1]; + _istride[d] = _istride[d - 1] * _simd_layout[d - 1]; } + } + /////////////////////// + // subplane information + /////////////////////// + _slice_block.resize(_ndimension); + _slice_stride.resize(_ndimension); + _slice_nblock.resize(_ndimension); + + int block = 1; + int nblock = 1; + for (int d = 0; d < _ndimension; d++) + nblock *= _rdimensions[d]; + + for (int d = 0; d < _ndimension; d++) + { + nblock /= _rdimensions[d]; + _slice_block[d] = block; + _slice_stride[d] = _ostride[d] * _rdimensions[d]; + _slice_nblock[d] = nblock; + block = block * _rdimensions[d]; + } }; }; - - } #endif diff --git a/lib/cartesian/Cartesian_red_black.h b/lib/cartesian/Cartesian_red_black.h index 3037de00..b1a5b9ef 100644 --- a/lib/cartesian/Cartesian_red_black.h +++ b/lib/cartesian/Cartesian_red_black.h @@ -131,21 +131,21 @@ public: Init(dimensions,simd_layout,processor_grid,checker_dim_mask,0); } void Init(const std::vector &dimensions, - const std::vector &simd_layout, - const std::vector &processor_grid, - const std::vector &checker_dim_mask, - int checker_dim) + const std::vector &simd_layout, + const std::vector &processor_grid, + const std::vector &checker_dim_mask, + int checker_dim) { - /////////////////////// - // Grid information - /////////////////////// + /////////////////////// + // Grid information + /////////////////////// _checker_dim = checker_dim; - assert(checker_dim_mask[checker_dim]==1); + assert(checker_dim_mask[checker_dim] == 1); _ndimension = dimensions.size(); - assert(checker_dim_mask.size()==_ndimension); - assert(processor_grid.size()==_ndimension); - assert(simd_layout.size()==_ndimension); - + assert(checker_dim_mask.size() == _ndimension); + assert(processor_grid.size() == _ndimension); + assert(simd_layout.size() == _ndimension); + _fdimensions.resize(_ndimension); _gdimensions.resize(_ndimension); _ldimensions.resize(_ndimension); @@ -153,114 +153,133 @@ public: _simd_layout.resize(_ndimension); _lstart.resize(_ndimension); _lend.resize(_ndimension); - + _ostride.resize(_ndimension); _istride.resize(_ndimension); - + _fsites = _gsites = _osites = _isites = 1; - - _checker_dim_mask=checker_dim_mask; - for(int d=0;d<_ndimension;d++){ - _fdimensions[d] = dimensions[d]; - _gdimensions[d] = _fdimensions[d]; - _fsites = _fsites * _fdimensions[d]; - _gsites = _gsites * _gdimensions[d]; - - if (d==_checker_dim) { - _gdimensions[d] = _gdimensions[d]/2; // Remove a checkerboard - } - _ldimensions[d] = _gdimensions[d]/_processors[d]; - _lstart[d] = _processor_coor[d]*_ldimensions[d]; - _lend[d] = _processor_coor[d]*_ldimensions[d]+_ldimensions[d]-1; + _checker_dim_mask = checker_dim_mask; - // Use a reduced simd grid - _simd_layout[d] = simd_layout[d]; - _rdimensions[d]= _ldimensions[d]/_simd_layout[d]; - assert(_rdimensions[d]>0); + for (int d = 0; d < _ndimension; d++) + { + _fdimensions[d] = dimensions[d]; + _gdimensions[d] = _fdimensions[d]; + _fsites = _fsites * _fdimensions[d]; + _gsites = _gsites * _gdimensions[d]; - // all elements of a simd vector must have same checkerboard. - // If Ls vectorised, this must still be the case; e.g. dwf rb5d - if ( _simd_layout[d]>1 ) { - if ( checker_dim_mask[d] ) { - assert( (_rdimensions[d]&0x1) == 0 ); - } - } + if (d == _checker_dim) + { + assert((_gdimensions[d] & 0x1) == 0); + _gdimensions[d] = _gdimensions[d] / 2; // Remove a checkerboard + } + _ldimensions[d] = _gdimensions[d] / _processors[d]; + assert(_ldimensions[d] * _processors[d] == _gdimensions[d]); + _lstart[d] = _processor_coor[d] * _ldimensions[d]; + _lend[d] = _processor_coor[d] * _ldimensions[d] + _ldimensions[d] - 1; - _osites *= _rdimensions[d]; - _isites *= _simd_layout[d]; - - // Addressing support - if ( d==0 ) { - _ostride[d] = 1; - _istride[d] = 1; - } else { - _ostride[d] = _ostride[d-1]*_rdimensions[d-1]; - _istride[d] = _istride[d-1]*_simd_layout[d-1]; - } + // Use a reduced simd grid + _simd_layout[d] = simd_layout[d]; + _rdimensions[d] = _ldimensions[d] / _simd_layout[d]; // this is not checking if this is integer + assert(_rdimensions[d] * _simd_layout[d] == _ldimensions[d]); + assert(_rdimensions[d] > 0); + // all elements of a simd vector must have same checkerboard. + // If Ls vectorised, this must still be the case; e.g. dwf rb5d + if (_simd_layout[d] > 1) + { + if (checker_dim_mask[d]) + { + assert((_rdimensions[d] & 0x1) == 0); + } + } + _osites *= _rdimensions[d]; + _isites *= _simd_layout[d]; + + // Addressing support + if (d == 0) + { + _ostride[d] = 1; + _istride[d] = 1; + } + else + { + _ostride[d] = _ostride[d - 1] * _rdimensions[d - 1]; + _istride[d] = _istride[d - 1] * _simd_layout[d - 1]; + } } - + //////////////////////////////////////////////////////////////////////////////////////////// // subplane information //////////////////////////////////////////////////////////////////////////////////////////// _slice_block.resize(_ndimension); _slice_stride.resize(_ndimension); _slice_nblock.resize(_ndimension); - - int block =1; - int nblock=1; - for(int d=0;d<_ndimension;d++) nblock*=_rdimensions[d]; - - for(int d=0;d<_ndimension;d++){ - nblock/=_rdimensions[d]; - _slice_block[d] =block; - _slice_stride[d]=_ostride[d]*_rdimensions[d]; - _slice_nblock[d]=nblock; - block = block*_rdimensions[d]; + + int block = 1; + int nblock = 1; + for (int d = 0; d < _ndimension; d++) + nblock *= _rdimensions[d]; + + for (int d = 0; d < _ndimension; d++) + { + nblock /= _rdimensions[d]; + _slice_block[d] = block; + _slice_stride[d] = _ostride[d] * _rdimensions[d]; + _slice_nblock[d] = nblock; + block = block * _rdimensions[d]; } //////////////////////////////////////////////// // Create a checkerboard lookup table //////////////////////////////////////////////// int rvol = 1; - for(int d=0;d<_ndimension;d++){ - rvol=rvol * _rdimensions[d]; + for (int d = 0; d < _ndimension; d++) + { + rvol = rvol * _rdimensions[d]; } _checker_board.resize(rvol); - for(int osite=0;osite<_osites;osite++){ - _checker_board[osite] = CheckerBoardFromOindex (osite); + for (int osite = 0; osite < _osites; osite++) + { + _checker_board[osite] = CheckerBoardFromOindex(osite); } - }; -protected: + + protected: virtual int oIndex(std::vector &coor) { - int idx=0; - for(int d=0;d<_ndimension;d++) { - if( d==_checker_dim ) { - idx+=_ostride[d]*((coor[d]/2)%_rdimensions[d]); - } else { - idx+=_ostride[d]*(coor[d]%_rdimensions[d]); - } + int idx = 0; + for (int d = 0; d < _ndimension; d++) + { + if (d == _checker_dim) + { + idx += _ostride[d] * ((coor[d] / 2) % _rdimensions[d]); + } + else + { + idx += _ostride[d] * (coor[d] % _rdimensions[d]); + } } return idx; }; - + virtual int iIndex(std::vector &lcoor) { - int idx=0; - for(int d=0;d<_ndimension;d++) { - if( d==_checker_dim ) { - idx+=_istride[d]*(lcoor[d]/(2*_rdimensions[d])); - } else { - idx+=_istride[d]*(lcoor[d]/_rdimensions[d]); - } - } - return idx; + int idx = 0; + for (int d = 0; d < _ndimension; d++) + { + if (d == _checker_dim) + { + idx += _istride[d] * (lcoor[d] / (2 * _rdimensions[d])); + } + else + { + idx += _istride[d] * (lcoor[d] / _rdimensions[d]); + } + } + return idx; } }; - } #endif diff --git a/lib/communicator/Communicator_mpi3.cc b/lib/communicator/Communicator_mpi3.cc index 632eb991..4192300b 100644 --- a/lib/communicator/Communicator_mpi3.cc +++ b/lib/communicator/Communicator_mpi3.cc @@ -37,7 +37,10 @@ Author: Peter Boyle #include #include #include -//#include +#include +#ifdef HAVE_NUMAIF_H +#include +#endif #ifndef SHM_HUGETLB #define SHM_HUGETLB 04000 #endif @@ -214,6 +217,25 @@ void CartesianCommunicator::Init(int *argc, char ***argv) { void * ptr = mmap(NULL,size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if ( ptr == MAP_FAILED ) { perror("failed mmap"); assert(0); } assert(((uint64_t)ptr&0x3F)==0); + + // Try to force numa domain on the shm segment if we have numaif.h +#ifdef HAVE_NUMAIF_H + int status; + int flags=MPOL_MF_MOVE; +#ifdef KNL + int nodes=1; // numa domain == MCDRAM + // Find out if in SNC2,SNC4 mode ? +#else + int nodes=r; // numa domain == MPI ID +#endif + unsigned long count=1; + for(uint64_t page=0;page &R,std::vector &a,const Lattice } }; +/* inline GridBase *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Orthog) { int NN = BlockSolverGrid->_ndimension; @@ -387,6 +388,7 @@ inline GridBase *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Or } return (GridBase *)new GridCartesian(latt_phys,simd_phys,mpi_phys); } +*/ template static void sliceMaddMatrix (Lattice &R,Eigen::MatrixXcd &aa,const Lattice &X,const Lattice &Y,int Orthog,RealD scale=1.0) @@ -398,14 +400,15 @@ static void sliceMaddMatrix (Lattice &R,Eigen::MatrixXcd &aa,const Lattice int Nblock = X._grid->GlobalDimensions()[Orthog]; GridBase *FullGrid = X._grid; - GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); + // GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); - Lattice Xslice(SliceGrid); - Lattice Rslice(SliceGrid); + // Lattice Xslice(SliceGrid); + // Lattice Rslice(SliceGrid); assert( FullGrid->_simd_layout[Orthog]==1); int nh = FullGrid->_ndimension; - int nl = SliceGrid->_ndimension; + // int nl = SliceGrid->_ndimension; + int nl = nh-1; //FIXME package in a convenient iterator //Should loop over a plane orthogonal to direction "Orthog" @@ -448,14 +451,14 @@ static void sliceMulMatrix (Lattice &R,Eigen::MatrixXcd &aa,const Lattice< int Nblock = X._grid->GlobalDimensions()[Orthog]; GridBase *FullGrid = X._grid; - GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); - - Lattice Xslice(SliceGrid); - Lattice Rslice(SliceGrid); + // GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); + // Lattice Xslice(SliceGrid); + // Lattice Rslice(SliceGrid); assert( FullGrid->_simd_layout[Orthog]==1); int nh = FullGrid->_ndimension; - int nl = SliceGrid->_ndimension; + // int nl = SliceGrid->_ndimension; + int nl=1; //FIXME package in a convenient iterator //Should loop over a plane orthogonal to direction "Orthog" @@ -498,18 +501,19 @@ static void sliceInnerProductMatrix( Eigen::MatrixXcd &mat, const Lattice typedef typename vobj::vector_type vector_type; GridBase *FullGrid = lhs._grid; - GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); + // GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog); int Nblock = FullGrid->GlobalDimensions()[Orthog]; - Lattice Lslice(SliceGrid); - Lattice Rslice(SliceGrid); + // Lattice Lslice(SliceGrid); + // Lattice Rslice(SliceGrid); mat = Eigen::MatrixXcd::Zero(Nblock,Nblock); assert( FullGrid->_simd_layout[Orthog]==1); int nh = FullGrid->_ndimension; - int nl = SliceGrid->_ndimension; + // int nl = SliceGrid->_ndimension; + int nl = nh-1; //FIXME package in a convenient iterator //Should loop over a plane orthogonal to direction "Orthog" @@ -540,7 +544,8 @@ static void sliceInnerProductMatrix( Eigen::MatrixXcd &mat, const Lattice for(int i=0;i mat += mat_thread; } } + + for(int i=0;iGlobalSum(sum); + mat(i,j)=sum; + }} + return; } diff --git a/lib/parallelIO/BinaryIO.h b/lib/parallelIO/BinaryIO.h index 117bec01..f56f6514 100644 --- a/lib/parallelIO/BinaryIO.h +++ b/lib/parallelIO/BinaryIO.h @@ -98,35 +98,39 @@ class BinaryIO { NerscChecksum(grid,scalardata,nersc_csum); } - - template static inline void NerscChecksum(GridBase *grid,std::vector &fbuf,uint32_t &nersc_csum) + + template + static inline void NerscChecksum(GridBase *grid, std::vector &fbuf, uint32_t &nersc_csum) { - const uint64_t size32 = sizeof(fobj)/sizeof(uint32_t); + const uint64_t size32 = sizeof(fobj) / sizeof(uint32_t); - - uint64_t lsites =grid->lSites(); - if (fbuf.size()==1) { - lsites=1; + uint64_t lsites = grid->lSites(); + if (fbuf.size() == 1) + { + lsites = 1; } -#pragma omp parallel - { - uint32_t nersc_csum_thr=0; + #pragma omp parallel + { + uint32_t nersc_csum_thr = 0; -#pragma omp for - for(uint64_t local_site=0;local_site static inline void ScidacChecksum(GridBase *grid,std::vector &fbuf,uint32_t &scidac_csuma,uint32_t &scidac_csumb) { const uint64_t size32 = sizeof(fobj)/sizeof(uint32_t); @@ -266,7 +270,7 @@ class BinaryIO { grid->Barrier(); GridStopWatch timer; GridStopWatch bstimer; - + nersc_csum=0; scidac_csuma=0; scidac_csumb=0; @@ -362,18 +366,22 @@ class BinaryIO { #else assert(0); #endif - } else { - std::cout<< GridLogMessage<< "C++ read I/O "<< file<<" : " - << iodata.size()*sizeof(fobj)<<" bytes"< 1) ) { #ifdef USE_MPI_IO - std::cout<< GridLogMessage<< "MPI write I/O "<< file<< std::endl; - ierr=MPI_File_open(grid->communicator,(char *) file.c_str(), MPI_MODE_RDWR|MPI_MODE_CREATE,MPI_INFO_NULL, &fh); assert(ierr==0); - ierr=MPI_File_set_view(fh, disp, mpiObject, fileArray, "native", MPI_INFO_NULL); assert(ierr==0); - ierr=MPI_File_write_all(fh, &iodata[0], 1, localArray, &status); assert(ierr==0); - MPI_File_close(&fh); - MPI_Type_free(&fileArray); - MPI_Type_free(&localArray); + std::cout << GridLogMessage << "MPI write I/O " << file << std::endl; + ierr = MPI_File_open(grid->communicator, (char *)file.c_str(), MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh); + std::cout << GridLogMessage << "Checking for errors" << std::endl; + if (ierr != MPI_SUCCESS) + { + char error_string[BUFSIZ]; + int length_of_error_string, error_class; + + MPI_Error_class(ierr, &error_class); + MPI_Error_string(error_class, error_string, &length_of_error_string); + fprintf(stderr, "%3d: %s\n", myrank, error_string); + MPI_Error_string(ierr, error_string, &length_of_error_string); + fprintf(stderr, "%3d: %s\n", myrank, error_string); + MPI_Abort(MPI_COMM_WORLD, 1); //assert(ierr == 0); + } + + std::cout << GridLogDebug << "MPI read I/O set view " << file << std::endl; + ierr = MPI_File_set_view(fh, disp, mpiObject, fileArray, "native", MPI_INFO_NULL); + assert(ierr == 0); + + std::cout << GridLogDebug << "MPI read I/O write all " << file << std::endl; + ierr = MPI_File_write_all(fh, &iodata[0], 1, localArray, &status); + assert(ierr == 0); + + MPI_File_close(&fh); + MPI_Type_free(&fileArray); + MPI_Type_free(&localArray); #else assert(0); #endif } else { - std::ofstream fout; fout.open(file,std::ios::binary|std::ios::out|std::ios::in); - std::cout<< GridLogMessage<< "C++ write I/O "<< file<<" : " - << iodata.size()*sizeof(fobj)<<" bytes"<Barrier(); - grid->GlobalSum(nersc_csum); - grid->GlobalXOR(scidac_csuma); - grid->GlobalXOR(scidac_csumb); - grid->Barrier(); + // if the data size is 1 we do not want to sum over the MPI ranks + if (iodata.size() != 1){ + grid->Barrier(); + grid->GlobalSum(nersc_csum); + grid->GlobalXOR(scidac_csuma); + grid->GlobalXOR(scidac_csumb); + grid->Barrier(); + } } ///////////////////////////////////////////////////////////////////////////// @@ -546,9 +605,9 @@ class BinaryIO { int gsites = grid->gSites(); int lsites = grid->lSites(); - uint32_t nersc_csum_tmp; - uint32_t scidac_csuma_tmp; - uint32_t scidac_csumb_tmp; + uint32_t nersc_csum_tmp = 0; + uint32_t scidac_csuma_tmp = 0; + uint32_t scidac_csumb_tmp = 0; GridStopWatch timer; diff --git a/lib/perfmon/PerfCount.cc b/lib/perfmon/PerfCount.cc index 4778295a..c6f92b9f 100644 --- a/lib/perfmon/PerfCount.cc +++ b/lib/perfmon/PerfCount.cc @@ -40,7 +40,7 @@ const PerformanceCounter::PerformanceCounterConfig PerformanceCounter::Performan { PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES , "CPUCYCLES.........." , INSTRUCTIONS}, { PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS , "INSTRUCTIONS......." , CPUCYCLES }, // 4 -#ifdef AVX512 +#ifdef KNL { PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", CPUCYCLES }, { PERF_TYPE_RAW, RawConfig(0x01,0x04), "L1_MISS_LOADS......", L1D_READ_ACCESS }, { PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", L1D_READ_ACCESS }, diff --git a/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.cc b/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.cc index 293077f7..7308ff30 100644 --- a/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.cc +++ b/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.cc @@ -230,8 +230,15 @@ void ImprovedStaggeredFermion5D::DhopInternal(StencilImpl & st, LebesgueOr { Compressor compressor; int LLs = in._grid->_rdimensions[0]; + + + + DhopTotalTime -= usecond(); + DhopCommTime -= usecond(); st.HaloExchange(in,compressor); + DhopCommTime += usecond(); + DhopComputeTime -= usecond(); // Dhop takes the 4d grid from U, and makes a 5d index for fermion if (dag == DaggerYes) { parallel_for (int ss = 0; ss < U._grid->oSites(); ss++) { @@ -244,12 +251,15 @@ void ImprovedStaggeredFermion5D::DhopInternal(StencilImpl & st, LebesgueOr Kernels::DhopSite(st,lo,U,UUU,st.CommBuf(),LLs,sU,in,out); } } + DhopComputeTime += usecond(); + DhopTotalTime += usecond(); } template void ImprovedStaggeredFermion5D::DhopOE(const FermionField &in, FermionField &out,int dag) { + DhopCalls+=1; conformable(in._grid,FermionRedBlackGrid()); // verifies half grid conformable(in._grid,out._grid); // drops the cb check @@ -261,6 +271,7 @@ void ImprovedStaggeredFermion5D::DhopOE(const FermionField &in, FermionFie template void ImprovedStaggeredFermion5D::DhopEO(const FermionField &in, FermionField &out,int dag) { + DhopCalls+=1; conformable(in._grid,FermionRedBlackGrid()); // verifies half grid conformable(in._grid,out._grid); // drops the cb check @@ -272,6 +283,7 @@ void ImprovedStaggeredFermion5D::DhopEO(const FermionField &in, FermionFie template void ImprovedStaggeredFermion5D::Dhop(const FermionField &in, FermionField &out,int dag) { + DhopCalls+=2; conformable(in._grid,FermionGrid()); // verifies full grid conformable(in._grid,out._grid); @@ -280,6 +292,54 @@ void ImprovedStaggeredFermion5D::Dhop(const FermionField &in, FermionField DhopInternal(Stencil,Lebesgue,Umu,UUUmu,in,out,dag); } +template +void ImprovedStaggeredFermion5D::Report(void) +{ + std::vector latt = GridDefaultLatt(); + RealD volume = Ls; for(int mu=0;mu_Nprocessors; + RealD NN = _FourDimGrid->NodeCount(); + + std::cout << GridLogMessage << "#### Dhop calls report " << std::endl; + + std::cout << GridLogMessage << "ImprovedStaggeredFermion5D Number of DhopEO Calls : " + << DhopCalls << std::endl; + std::cout << GridLogMessage << "ImprovedStaggeredFermion5D TotalTime /Calls : " + << DhopTotalTime / DhopCalls << " us" << std::endl; + std::cout << GridLogMessage << "ImprovedStaggeredFermion5D CommTime /Calls : " + << DhopCommTime / DhopCalls << " us" << std::endl; + std::cout << GridLogMessage << "ImprovedStaggeredFermion5D ComputeTime/Calls : " + << DhopComputeTime / DhopCalls << " us" << std::endl; + + // Average the compute time + _FourDimGrid->GlobalSum(DhopComputeTime); + DhopComputeTime/=NP; + + RealD mflops = 1154*volume*DhopCalls/DhopComputeTime/2; // 2 for red black counting + std::cout << GridLogMessage << "Average mflops/s per call : " << mflops << std::endl; + std::cout << GridLogMessage << "Average mflops/s per call per rank : " << mflops/NP << std::endl; + std::cout << GridLogMessage << "Average mflops/s per call per node : " << mflops/NN << std::endl; + + RealD Fullmflops = 1154*volume*DhopCalls/(DhopTotalTime)/2; // 2 for red black counting + std::cout << GridLogMessage << "Average mflops/s per call (full) : " << Fullmflops << std::endl; + std::cout << GridLogMessage << "Average mflops/s per call per rank (full): " << Fullmflops/NP << std::endl; + std::cout << GridLogMessage << "Average mflops/s per call per node (full): " << Fullmflops/NN << std::endl; + + std::cout << GridLogMessage << "ImprovedStaggeredFermion5D Stencil" < +void ImprovedStaggeredFermion5D::ZeroCounters(void) +{ + DhopCalls = 0; + DhopTotalTime = 0; + DhopCommTime = 0; + DhopComputeTime = 0; + Stencil.ZeroCounters(); + StencilEven.ZeroCounters(); + StencilOdd.ZeroCounters(); +} ///////////////////////////////////////////////////////////////////////// // Implement the general interface. Here we use SAME mass on all slices diff --git a/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.h b/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.h index 1c540892..22fb9e7d 100644 --- a/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.h +++ b/lib/qcd/action/fermion/ImprovedStaggeredFermion5D.h @@ -55,6 +55,16 @@ namespace QCD { FermionField _tmp; FermionField &tmp(void) { return _tmp; } + //////////////////////////////////////// + // Performance monitoring + //////////////////////////////////////// + void Report(void); + void ZeroCounters(void); + double DhopTotalTime; + double DhopCalls; + double DhopCommTime; + double DhopComputeTime; + /////////////////////////////////////////////////////////////// // Implement the abstract base /////////////////////////////////////////////////////////////// diff --git a/lib/qcd/action/scalar/ScalarImpl.h b/lib/qcd/action/scalar/ScalarImpl.h index 5342a1fa..f85ab840 100644 --- a/lib/qcd/action/scalar/ScalarImpl.h +++ b/lib/qcd/action/scalar/ScalarImpl.h @@ -93,6 +93,8 @@ class ScalarImplTypes { class ScalarAdjMatrixImplTypes { public: typedef S Simd; + typedef QCD::SU Group; + template using iImplField = iScalar>>; template @@ -108,7 +110,7 @@ class ScalarImplTypes { typedef Field PropagatorField; static inline void generate_momenta(Field& P, GridParallelRNG& pRNG) { - QCD::SU::GaussianFundamentalLieAlgebraMatrix(pRNG, P); + Group::GaussianFundamentalLieAlgebraMatrix(pRNG, P); } static inline Field projectForce(Field& P) {return P;} @@ -122,11 +124,11 @@ class ScalarImplTypes { } static inline void HotConfiguration(GridParallelRNG &pRNG, Field &U) { - QCD::SU::LieRandomize(pRNG, U); + Group::GaussianFundamentalLieAlgebraMatrix(pRNG, U); } static inline void TepidConfiguration(GridParallelRNG &pRNG, Field &U) { - QCD::SU::LieRandomize(pRNG, U, 0.01); + Group::GaussianFundamentalLieAlgebraMatrix(pRNG, U, 0.01); } static inline void ColdConfiguration(GridParallelRNG &pRNG, Field &U) { diff --git a/lib/qcd/action/scalar/ScalarInteractionAction.h b/lib/qcd/action/scalar/ScalarInteractionAction.h index 5f4c630c..4d189352 100644 --- a/lib/qcd/action/scalar/ScalarInteractionAction.h +++ b/lib/qcd/action/scalar/ScalarInteractionAction.h @@ -81,7 +81,7 @@ namespace Grid { phiStencil.HaloExchange(p, compressor); Field action(p._grid), pshift(p._grid), phisquared(p._grid); phisquared = p*p; - action = (2.0*Ndim + mass_square)*phisquared + lambda*phisquared*phisquared; + action = (2.0*Ndim + mass_square)*phisquared - lambda/24.*phisquared*phisquared; for (int mu = 0; mu < Ndim; mu++) { // pshift = Cshift(p, mu, +1); // not efficient, implement with stencils parallel_for (int i = 0; i < p._grid->oSites(); i++) { @@ -98,7 +98,7 @@ namespace Grid { permute(temp2, *temp, permute_type); action._odata[i] -= temp2*(*t_p) + (*t_p)*temp2; } else { - action._odata[i] -= *temp*(*t_p) + (*t_p)*(*temp); + action._odata[i] -= (*temp)*(*t_p) + (*t_p)*(*temp); } } else { action._odata[i] -= phiStencil.CommBuf()[SE->_offset]*(*t_p) + (*t_p)*phiStencil.CommBuf()[SE->_offset]; @@ -113,7 +113,7 @@ namespace Grid { virtual void deriv(const Field &p, Field &force) { assert(p._grid->Nd() == Ndim); - force = (2.0*Ndim + mass_square)*p + 2.0*lambda*p*p*p; + force = (2.0*Ndim + mass_square)*p - lambda/12.*p*p*p; // move this outside static Stencil phiStencil(p._grid, npoint, 0, directions, displacements); phiStencil.HaloExchange(p, compressor); diff --git a/lib/qcd/hmc/HMC.h b/lib/qcd/hmc/HMC.h index ac690b60..5688bb24 100644 --- a/lib/qcd/hmc/HMC.h +++ b/lib/qcd/hmc/HMC.h @@ -76,7 +76,7 @@ struct HMCparameters: Serializable { template < class ReaderClass > void initialize(Reader &TheReader){ - std::cout << "Reading HMC\n"; + std::cout << GridLogMessage << "Reading HMC\n"; read(TheReader, "HMC", *this); } diff --git a/lib/qcd/hmc/HMCResourceManager.h b/lib/qcd/hmc/HMCResourceManager.h index 9f4c99a9..3e20a8c1 100644 --- a/lib/qcd/hmc/HMCResourceManager.h +++ b/lib/qcd/hmc/HMCResourceManager.h @@ -165,7 +165,7 @@ class HMCResourceManager { // Grids ////////////////////////////////////////////////////////////// - void AddGrid(std::string s, GridModule& M) { + void AddGrid(const std::string s, GridModule& M) { // Check for name clashes auto search = Grids.find(s); if (search != Grids.end()) { @@ -174,14 +174,24 @@ class HMCResourceManager { exit(1); } Grids[s] = std::move(M); + std::cout << GridLogMessage << "::::::::::::::::::::::::::::::::::::::::" < Mod; AddGrid(s, Mod); } + // Add a named grid set, 4d shortcut + tweak simd lanes + void AddFourDimGrid(const std::string s, const std::vector simd_decomposition) { + GridFourDimModule Mod(simd_decomposition); + AddGrid(s, Mod); + } GridCartesian* GetCartesian(std::string s = "") { @@ -253,6 +263,7 @@ class HMCResourceManager { template void AddObservable(Types&&... Args){ ObservablesList.push_back(std::unique_ptr(new T(std::forward(Args)...))); + ObservablesList.back()->print_parameters(); } std::vector* > GetObservables(){ @@ -297,4 +308,4 @@ private: } } -#endif // HMC_RESOURCE_MANAGER_H \ No newline at end of file +#endif // HMC_RESOURCE_MANAGER_H diff --git a/lib/qcd/hmc/HMC_GridModules.h b/lib/qcd/hmc/HMC_GridModules.h index 8331c02b..0f34e9a7 100644 --- a/lib/qcd/hmc/HMC_GridModules.h +++ b/lib/qcd/hmc/HMC_GridModules.h @@ -33,28 +33,29 @@ directory namespace Grid { // Resources -// Modules for grids +// Modules for grids // Introduce another namespace HMCModules? -class GridModuleParameters: Serializable{ +class GridModuleParameters: Serializable{ public: GRID_SERIALIZABLE_CLASS_MEMBERS(GridModuleParameters, std::string, lattice, std::string, mpi); - std::vector getLattice(){return strToVec(lattice);} - std::vector getMpi() {return strToVec(mpi);} + std::vector getLattice() const {return strToVec(lattice);} + std::vector getMpi() const {return strToVec(mpi);} - void check(){ - if (getLattice().size() != getMpi().size()) { - std::cout << GridLogError + + void check() const { + if (getLattice().size() != getMpi().size() ) { + std::cout << GridLogError << "Error in GridModuleParameters: lattice and mpi dimensions " "do not match" << std::endl; exit(1); } - } + } template GridModuleParameters(Reader& Reader, std::string n = "LatticeGrid"):name(n) { @@ -75,51 +76,94 @@ private: // Lower level class class GridModule { public: - GridCartesian* get_full() { + GridCartesian* get_full() { std::cout << GridLogDebug << "Getting cartesian in module"<< std::endl; return grid_.get(); } - GridRedBlackCartesian* get_rb() { + GridRedBlackCartesian* get_rb() { std::cout << GridLogDebug << "Getting rb-cartesian in module"<< std::endl; return rbgrid_.get(); } void set_full(GridCartesian* grid) { grid_.reset(grid); } void set_rb(GridRedBlackCartesian* rbgrid) { rbgrid_.reset(rbgrid); } + void show_full_decomposition(){ grid_->show_decomposition(); } + void show_rb_decomposition(){ rbgrid_->show_decomposition(); } protected: std::unique_ptr grid_; std::unique_ptr rbgrid_; - + }; //////////////////////////////////// // Classes for the user //////////////////////////////////// // Note: the space time grid should be out of the QCD namespace -template< class vector_type> -class GridFourDimModule : public GridModule { - public: - GridFourDimModule() { +template +class GridFourDimModule : public GridModule +{ +public: + GridFourDimModule() + { using namespace QCD; set_full(SpaceTimeGrid::makeFourDimGrid( - GridDefaultLatt(), GridDefaultSimd(4, vector_type::Nsimd()), + GridDefaultLatt(), + GridDefaultSimd(4, vector_type::Nsimd()), GridDefaultMpi())); set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get())); } - GridFourDimModule(GridModuleParameters Params) { + GridFourDimModule(const std::vector tweak_simd) + { + using namespace QCD; + if (tweak_simd.size() != 4) + { + std::cout << GridLogError + << "Error in GridFourDimModule: SIMD size different from 4" + << std::endl; + exit(1); + } + + // Checks that the product agrees with the expectation + int simd_sum = 1; + for (auto &n : tweak_simd) + simd_sum *= n; + std::cout << GridLogDebug << "TweakSIMD: " << tweak_simd << " Sum: " << simd_sum << std::endl; + + if (simd_sum == vector_type::Nsimd()) + { + set_full(SpaceTimeGrid::makeFourDimGrid( + GridDefaultLatt(), + tweak_simd, + GridDefaultMpi())); + set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get())); + } + else + { + std::cout << GridLogError + << "Error in GridFourDimModule: SIMD lanes must sum to " + << vector_type::Nsimd() + << std::endl; + } + } + + GridFourDimModule(const GridModuleParameters Params) + { using namespace QCD; - Params.check(); std::vector lattice_v = Params.getLattice(); std::vector mpi_v = Params.getMpi(); - if (lattice_v.size() == 4) { + if (lattice_v.size() == 4) + { set_full(SpaceTimeGrid::makeFourDimGrid( - lattice_v, GridDefaultSimd(4, vector_type::Nsimd()), + lattice_v, + GridDefaultSimd(4, vector_type::Nsimd()), mpi_v)); set_rb(SpaceTimeGrid::makeFourDimRedBlackGrid(grid_.get())); - } else { - std::cout << GridLogError - << "Error in GridFourDimModule: lattice dimension different from 4" - << std::endl; + } + else + { + std::cout << GridLogError + << "Error in GridFourDimModule: lattice dimension different from 4" + << std::endl; exit(1); } } diff --git a/lib/qcd/modules/ObservableModules.h b/lib/qcd/modules/ObservableModules.h index 579fc1ec..24511617 100644 --- a/lib/qcd/modules/ObservableModules.h +++ b/lib/qcd/modules/ObservableModules.h @@ -84,8 +84,6 @@ class PlaquetteMod: public ObservableModule, NoParameters> typedef ObservableModule, NoParameters> ObsBase; using ObsBase::ObsBase; // for constructors - - // acquire resource virtual void initialize(){ this->ObservablePtr.reset(new PlaquetteLogger()); @@ -94,23 +92,22 @@ class PlaquetteMod: public ObservableModule, NoParameters> PlaquetteMod(): ObsBase(NoParameters()){} }; + template < class Impl > -class TopologicalChargeMod: public ObservableModule, NoParameters>{ - typedef ObservableModule, NoParameters> ObsBase; +class TopologicalChargeMod: public ObservableModule, TopologyObsParameters>{ + typedef ObservableModule, TopologyObsParameters> ObsBase; using ObsBase::ObsBase; // for constructors - - // acquire resource virtual void initialize(){ - this->ObservablePtr.reset(new TopologicalCharge()); + this->ObservablePtr.reset(new TopologicalCharge(this->Par_)); } public: - TopologicalChargeMod(): ObsBase(NoParameters()){} + TopologicalChargeMod(TopologyObsParameters Par): ObsBase(Par){} + TopologicalChargeMod(): ObsBase(){} }; - }// QCD temporarily here diff --git a/lib/qcd/observables/topological_charge.h b/lib/qcd/observables/topological_charge.h index 5d09c420..5af8d77b 100644 --- a/lib/qcd/observables/topological_charge.h +++ b/lib/qcd/observables/topological_charge.h @@ -33,9 +33,45 @@ directory namespace Grid { namespace QCD { +struct TopologySmearingParameters : Serializable { + GRID_SERIALIZABLE_CLASS_MEMBERS(TopologySmearingParameters, + int, steps, + float, step_size, + int, meas_interval, + float, maxTau); + + TopologySmearingParameters(int s = 0, float ss = 0.0f, int mi = 0, float mT = 0.0f): + steps(s), step_size(ss), meas_interval(mi), maxTau(mT){} + + template < class ReaderClass > + TopologySmearingParameters(Reader& Reader){ + read(Reader, "Smearing", *this); + } +}; + + + +struct TopologyObsParameters : Serializable { + GRID_SERIALIZABLE_CLASS_MEMBERS(TopologyObsParameters, + int, interval, + bool, do_smearing, + TopologySmearingParameters, Smearing); + + TopologyObsParameters(int interval = 1, bool smearing = false): + interval(interval), Smearing(smearing){} + + template + TopologyObsParameters(Reader& Reader){ + read(Reader, "TopologyMeasurement", *this); + } +}; + + // this is only defined for a gauge theory template class TopologicalCharge : public HmcObservable { + TopologyObsParameters Pars; + public: // here forces the Impl to be of gauge fields // if not the compiler will complain @@ -44,20 +80,39 @@ class TopologicalCharge : public HmcObservable { // necessary for HmcObservable compatibility typedef typename Impl::Field Field; + TopologicalCharge(int interval = 1, bool do_smearing = false): + Pars(interval, do_smearing){} + + TopologicalCharge(TopologyObsParameters P):Pars(P){ + std::cout << GridLogDebug << "Creating TopologicalCharge " << std::endl; + } + void TrajectoryComplete(int traj, Field &U, GridSerialRNG &sRNG, GridParallelRNG &pRNG) { - Real q = WilsonLoops::TopologicalCharge(U); + if (traj%Pars.interval == 0){ + // Smearing + Field Usmear = U; + int def_prec = std::cout.precision(); + + if (Pars.do_smearing){ + // using wilson flow by default here + WilsonFlow WF(Pars.Smearing.steps, Pars.Smearing.step_size, Pars.Smearing.meas_interval); + WF.smear_adaptive(Usmear, U, Pars.Smearing.maxTau); + Real T0 = WF.energyDensityPlaquette(Usmear); + std::cout << GridLogMessage << std::setprecision(std::numeric_limits::digits10 + 1) + << "T0 : [ " << traj << " ] "<< T0 << std::endl; + } - int def_prec = std::cout.precision(); + Real q = WilsonLoops::TopologicalCharge(Usmear); + std::cout << GridLogMessage + << std::setprecision(std::numeric_limits::digits10 + 1) + << "Topological Charge: [ " << traj << " ] "<< q << std::endl; - std::cout << GridLogMessage - << std::setprecision(std::numeric_limits::digits10 + 1) - << "Topological Charge: [ " << traj << " ] "<< q << std::endl; - - std::cout.precision(def_prec); + std::cout.precision(def_prec); + } } }; diff --git a/lib/qcd/smearing/WilsonFlow.h b/lib/qcd/smearing/WilsonFlow.h index 5e9f2d95..4f5c0d43 100644 --- a/lib/qcd/smearing/WilsonFlow.h +++ b/lib/qcd/smearing/WilsonFlow.h @@ -108,7 +108,7 @@ void WilsonFlow::evolve_step_adaptive(typename Gimpl::GaugeField &U, Real if (maxTau - taus < epsilon){ epsilon = maxTau-taus; } - std::cout << GridLogMessage << "Integration epsilon : " << epsilon << std::endl; + //std::cout << GridLogMessage << "Integration epsilon : " << epsilon << std::endl; GaugeField Z(U._grid); GaugeField Zprime(U._grid); GaugeField tmp(U._grid), Uprime(U._grid); @@ -138,10 +138,10 @@ void WilsonFlow::evolve_step_adaptive(typename Gimpl::GaugeField &U, Real // adjust integration step taus += epsilon; - std::cout << GridLogMessage << "Adjusting integration step with distance: " << diff << std::endl; + //std::cout << GridLogMessage << "Adjusting integration step with distance: " << diff << std::endl; epsilon = epsilon*0.95*std::pow(1e-4/diff,1./3.); - std::cout << GridLogMessage << "New epsilon : " << epsilon << std::endl; + //std::cout << GridLogMessage << "New epsilon : " << epsilon << std::endl; } @@ -166,7 +166,6 @@ void WilsonFlow::smear(GaugeField& out, const GaugeField& in) const { out = in; for (unsigned int step = 1; step <= Nstep; step++) { auto start = std::chrono::high_resolution_clock::now(); - std::cout << GridLogMessage << "Evolution time :"<< tau(step) << std::endl; evolve_step(out); auto end = std::chrono::high_resolution_clock::now(); std::chrono::duration diff = end - start; @@ -191,7 +190,7 @@ void WilsonFlow::smear_adaptive(GaugeField& out, const GaugeField& in, Re unsigned int step = 0; do{ step++; - std::cout << GridLogMessage << "Evolution time :"<< taus << std::endl; + //std::cout << GridLogMessage << "Evolution time :"<< taus << std::endl; evolve_step_adaptive(out, maxTau); std::cout << GridLogMessage << "[WilsonFlow] Energy density (plaq) : " << step << " " diff --git a/lib/qcd/utils/GaugeFix.h b/lib/qcd/utils/GaugeFix.h index 4ff216e4..c4ea31aa 100644 --- a/lib/qcd/utils/GaugeFix.h +++ b/lib/qcd/utils/GaugeFix.h @@ -26,12 +26,14 @@ Author: Peter Boyle /* END LEGAL */ //#include -using namespace Grid; -using namespace Grid::QCD; +#ifndef GRID_QCD_GAUGE_FIX_H +#define GRID_QCD_GAUGE_FIX_H +namespace Grid { +namespace QCD { template class FourierAcceleratedGaugeFixer : public Gimpl { - public: + public: INHERIT_GIMPL_TYPES(Gimpl); typedef typename Gimpl::GaugeLinkField GaugeMat; @@ -186,3 +188,6 @@ class FourierAcceleratedGaugeFixer : public Gimpl { } }; +} +} +#endif diff --git a/lib/qcd/utils/SUn.h b/lib/qcd/utils/SUn.h index 99a620bc..8f0c0a7b 100644 --- a/lib/qcd/utils/SUn.h +++ b/lib/qcd/utils/SUn.h @@ -716,8 +716,7 @@ template for (int a = 0; a < AdjointDimension; a++) { generator(a, Ta); - auto tmp = - 2.0 * (trace(timesI(Ta) * in)) * scale;// 2.0 for the normalization of the trace in the fundamental rep - pokeColour(h_out, tmp, a); + pokeColour(h_out, - 2.0 * (trace(timesI(Ta) * in)) * scale, a); } } diff --git a/lib/serialisation/Hdf5IO.cc b/lib/serialisation/Hdf5IO.cc index b9bb0b87..1fb7be0c 100644 --- a/lib/serialisation/Hdf5IO.cc +++ b/lib/serialisation/Hdf5IO.cc @@ -65,10 +65,12 @@ Hdf5Reader::Hdf5Reader(const std::string &fileName) Hdf5Type::type()); } -void Hdf5Reader::push(const std::string &s) +bool Hdf5Reader::push(const std::string &s) { group_ = group_.openGroup(s); path_.push_back(s); + + return true; } void Hdf5Reader::pop(void) diff --git a/lib/serialisation/Hdf5IO.h b/lib/serialisation/Hdf5IO.h index 2f891cd4..94ad9736 100644 --- a/lib/serialisation/Hdf5IO.h +++ b/lib/serialisation/Hdf5IO.h @@ -54,7 +54,7 @@ namespace Grid public: Hdf5Reader(const std::string &fileName); virtual ~Hdf5Reader(void) = default; - void push(const std::string &s); + bool push(const std::string &s); void pop(void); template void readDefault(const std::string &s, U &output); diff --git a/lib/simd/Grid_neon.h b/lib/simd/Grid_neon.h index 7c1ad443..d6eb9c5a 100644 --- a/lib/simd/Grid_neon.h +++ b/lib/simd/Grid_neon.h @@ -1,13 +1,14 @@ - /************************************************************************************* +/************************************************************************************* - Grid physics library, www.github.com/paboyle/Grid + Grid physics library, www.github.com/paboyle/Grid Source file: ./lib/simd/Grid_neon.h Copyright (C) 2015 -Author: Peter Boyle -Author: neo + Author: Nils Meyer + Author: Peter Boyle + Author: neo This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -26,19 +27,25 @@ Author: neo See the full license in the file "LICENSE" in the top level distribution directory *************************************************************************************/ /* END LEGAL */ -//---------------------------------------------------------------------- -/*! @file Grid_sse4.h - @brief Optimization libraries for NEON (ARM) instructions set ARMv8 - Experimental - Using intrinsics - DEVELOPING! +/* + + ARMv8 NEON intrinsics layer by + + Nils Meyer , + University of Regensburg, Germany + SFB/TRR55 + */ -// Time-stamp: <2015-07-10 17:45:09 neo> -//---------------------------------------------------------------------- +#ifndef GEN_SIMD_WIDTH +#define GEN_SIMD_WIDTH 16u +#endif + +#include "Grid_generic_types.h" #include -// ARMv8 supports double precision - +namespace Grid { namespace Optimization { template @@ -46,16 +53,20 @@ namespace Optimization { float32x4_t f; vtype v; }; - union u128f { float32x4_t v; float f[4]; }; union u128d { float64x2_t v; - double f[4]; + double f[2]; }; - + // half precision + union u128h { + float16x8_t v; + uint16_t f[8]; + }; + struct Vsplat{ //Complex float inline float32x4_t operator()(float a, float b){ @@ -64,31 +75,31 @@ namespace Optimization { } // Real float inline float32x4_t operator()(float a){ - return vld1q_dup_f32(&a); + return vdupq_n_f32(a); } //Complex double - inline float32x4_t operator()(double a, double b){ - float tmp[4]={(float)a,(float)b,(float)a,(float)b}; - return vld1q_f32(tmp); + inline float64x2_t operator()(double a, double b){ + double tmp[2]={a,b}; + return vld1q_f64(tmp); } - //Real double - inline float32x4_t operator()(double a){ - return vld1q_dup_f32(&a); + //Real double // N:tbc + inline float64x2_t operator()(double a){ + return vdupq_n_f64(a); } - //Integer + //Integer // N:tbc inline uint32x4_t operator()(Integer a){ - return vld1q_dup_u32(&a); + return vdupq_n_u32(a); } }; struct Vstore{ - //Float + //Float inline void operator()(float32x4_t a, float* F){ vst1q_f32(F, a); } //Double - inline void operator()(float32x4_t a, double* D){ - vst1q_f32((float*)D, a); + inline void operator()(float64x2_t a, double* D){ + vst1q_f64(D, a); } //Integer inline void operator()(uint32x4_t a, Integer* I){ @@ -97,54 +108,54 @@ namespace Optimization { }; - struct Vstream{ - //Float + struct Vstream{ // N:equivalents to _mm_stream_p* in NEON? + //Float // N:generic inline void operator()(float * a, float32x4_t b){ - + memcpy(a,&b,4*sizeof(float)); } - //Double - inline void operator()(double * a, float32x4_t b){ - + //Double // N:generic + inline void operator()(double * a, float64x2_t b){ + memcpy(a,&b,2*sizeof(double)); } }; + // Nils: Vset untested; not used currently in Grid at all; + // git commit 4a8c4ccfba1d05159348d21a9698028ea847e77b struct Vset{ - // Complex float + // Complex float // N:ok inline float32x4_t operator()(Grid::ComplexF *a){ - float32x4_t foo; - return foo; + float tmp[4]={a[1].imag(),a[1].real(),a[0].imag(),a[0].real()}; + return vld1q_f32(tmp); } - // Complex double - inline float32x4_t operator()(Grid::ComplexD *a){ - float32x4_t foo; - return foo; + // Complex double // N:ok + inline float64x2_t operator()(Grid::ComplexD *a){ + double tmp[2]={a[0].imag(),a[0].real()}; + return vld1q_f64(tmp); } - // Real float + // Real float // N:ok inline float32x4_t operator()(float *a){ - float32x4_t foo; - return foo; + float tmp[4]={a[3],a[2],a[1],a[0]}; + return vld1q_f32(tmp); } - // Real double - inline float32x4_t operator()(double *a){ - float32x4_t foo; - return foo; + // Real double // N:ok + inline float64x2_t operator()(double *a){ + double tmp[2]={a[1],a[0]}; + return vld1q_f64(tmp); } - // Integer + // Integer // N:ok inline uint32x4_t operator()(Integer *a){ - uint32x4_t foo; - return foo; + return vld1q_dup_u32(a); } - - }; + // N:leaving as is template struct Reduce{ //Need templated class to overload output type //General form must generate error if compiled - inline Out_type operator()(In_type in){ + inline Out_type operator()(In_type in){ printf("Error, using wrong Reduce function\n"); exit(1); return 0; @@ -184,26 +195,98 @@ namespace Optimization { } }; + struct MultRealPart{ + inline float32x4_t operator()(float32x4_t a, float32x4_t b){ + float32x4_t re = vtrn1q_f32(a, a); + return vmulq_f32(re, b); + } + inline float64x2_t operator()(float64x2_t a, float64x2_t b){ + float64x2_t re = vzip1q_f64(a, a); + return vmulq_f64(re, b); + } + }; + + struct MaddRealPart{ + inline float32x4_t operator()(float32x4_t a, float32x4_t b, float32x4_t c){ + float32x4_t re = vtrn1q_f32(a, a); + return vfmaq_f32(c, re, b); + } + inline float64x2_t operator()(float64x2_t a, float64x2_t b, float64x2_t c){ + float64x2_t re = vzip1q_f64(a, a); + return vfmaq_f64(c, re, b); + } + }; + + struct Div{ + // Real float + inline float32x4_t operator()(float32x4_t a, float32x4_t b){ + return vdivq_f32(a, b); + } + // Real double + inline float64x2_t operator()(float64x2_t a, float64x2_t b){ + return vdivq_f64(a, b); + } + }; + struct MultComplex{ // Complex float inline float32x4_t operator()(float32x4_t a, float32x4_t b){ - float32x4_t foo; - return foo; + + float32x4_t r0, r1, r2, r3, r4; + + // a = ar ai Ar Ai + // b = br bi Br Bi + // collect real/imag part, negate bi and Bi + r0 = vtrn1q_f32(b, b); // br br Br Br + r1 = vnegq_f32(b); // -br -bi -Br -Bi + r2 = vtrn2q_f32(b, r1); // bi -bi Bi -Bi + + // the fun part + r3 = vmulq_f32(r2, a); // bi*ar -bi*ai ... + r4 = vrev64q_f32(r3); // -bi*ai bi*ar ... + + // fma(a,b,c) = a+b*c + return vfmaq_f32(r4, r0, a); // ar*br-ai*bi ai*br+ar*bi ... + + // no fma, use mul and add + //float32x4_t r5; + //r5 = vmulq_f32(r0, a); + //return vaddq_f32(r4, r5); } // Complex double inline float64x2_t operator()(float64x2_t a, float64x2_t b){ - float32x4_t foo; - return foo; + + float64x2_t r0, r1, r2, r3, r4; + + // b = br bi + // collect real/imag part, negate bi + r0 = vtrn1q_f64(b, b); // br br + r1 = vnegq_f64(b); // -br -bi + r2 = vtrn2q_f64(b, r1); // bi -bi + + // the fun part + r3 = vmulq_f64(r2, a); // bi*ar -bi*ai + r4 = vextq_f64(r3,r3,1); // -bi*ai bi*ar + + // fma(a,b,c) = a+b*c + return vfmaq_f64(r4, r0, a); // ar*br-ai*bi ai*br+ar*bi + + // no fma, use mul and add + //float64x2_t r5; + //r5 = vmulq_f64(r0, a); + //return vaddq_f64(r4, r5); } }; struct Mult{ // Real float inline float32x4_t mac(float32x4_t a, float32x4_t b, float32x4_t c){ - return vaddq_f32(vmulq_f32(b,c),a); + //return vaddq_f32(vmulq_f32(b,c),a); + return vfmaq_f32(a, b, c); } inline float64x2_t mac(float64x2_t a, float64x2_t b, float64x2_t c){ - return vaddq_f64(vmulq_f64(b,c),a); + //return vaddq_f64(vmulq_f64(b,c),a); + return vfmaq_f64(a, b, c); } inline float32x4_t operator()(float32x4_t a, float32x4_t b){ return vmulq_f32(a,b); @@ -221,89 +304,275 @@ namespace Optimization { struct Conj{ // Complex single inline float32x4_t operator()(float32x4_t in){ - return in; + // ar ai br bi -> ar -ai br -bi + float32x4_t r0, r1; + r0 = vnegq_f32(in); // -ar -ai -br -bi + r1 = vrev64q_f32(r0); // -ai -ar -bi -br + return vtrn1q_f32(in, r1); // ar -ai br -bi } // Complex double - //inline float32x4_t operator()(float32x4_t in){ - // return 0; - //} + inline float64x2_t operator()(float64x2_t in){ + + float64x2_t r0, r1; + r0 = vextq_f64(in, in, 1); // ai ar + r1 = vnegq_f64(r0); // -ai -ar + return vextq_f64(r0, r1, 1); // ar -ai + } // do not define for integer input }; struct TimesMinusI{ //Complex single inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ - return in; + // ar ai br bi -> ai -ar ai -br + float32x4_t r0, r1; + r0 = vnegq_f32(in); // -ar -ai -br -bi + r1 = vrev64q_f32(in); // ai ar bi br + return vtrn1q_f32(r1, r0); // ar -ai br -bi } //Complex double - //inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ - // return in; - //} - - + inline float64x2_t operator()(float64x2_t in, float64x2_t ret){ + // a ib -> b -ia + float64x2_t tmp; + tmp = vnegq_f64(in); + return vextq_f64(in, tmp, 1); + } }; struct TimesI{ //Complex single inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ - //need shuffle - return in; + // ar ai br bi -> -ai ar -bi br + float32x4_t r0, r1; + r0 = vnegq_f32(in); // -ar -ai -br -bi + r1 = vrev64q_f32(r0); // -ai -ar -bi -br + return vtrn1q_f32(r1, in); // -ai ar -bi br } //Complex double - //inline float32x4_t operator()(float32x4_t in, float32x4_t ret){ - // return 0; - //} + inline float64x2_t operator()(float64x2_t in, float64x2_t ret){ + // a ib -> -b ia + float64x2_t tmp; + tmp = vnegq_f64(in); + return vextq_f64(tmp, in, 1); + } + }; + + struct Permute{ + + static inline float32x4_t Permute0(float32x4_t in){ // N:ok + // AB CD -> CD AB + return vextq_f32(in, in, 2); + }; + static inline float32x4_t Permute1(float32x4_t in){ // N:ok + // AB CD -> BA DC + return vrev64q_f32(in); + }; + static inline float32x4_t Permute2(float32x4_t in){ // N:not used by Boyle + return in; + }; + static inline float32x4_t Permute3(float32x4_t in){ // N:not used by Boyle + return in; + }; + + static inline float64x2_t Permute0(float64x2_t in){ // N:ok + // AB -> BA + return vextq_f64(in, in, 1); + }; + static inline float64x2_t Permute1(float64x2_t in){ // N:not used by Boyle + return in; + }; + static inline float64x2_t Permute2(float64x2_t in){ // N:not used by Boyle + return in; + }; + static inline float64x2_t Permute3(float64x2_t in){ // N:not used by Boyle + return in; + }; + + }; + + struct Rotate{ + + static inline float32x4_t rotate(float32x4_t in,int n){ // N:ok + switch(n){ + case 0: // AB CD -> AB CD + return tRotate<0>(in); + break; + case 1: // AB CD -> BC DA + return tRotate<1>(in); + break; + case 2: // AB CD -> CD AB + return tRotate<2>(in); + break; + case 3: // AB CD -> DA BC + return tRotate<3>(in); + break; + default: assert(0); + } + } + static inline float64x2_t rotate(float64x2_t in,int n){ // N:ok + switch(n){ + case 0: // AB -> AB + return tRotate<0>(in); + break; + case 1: // AB -> BA + return tRotate<1>(in); + break; + default: assert(0); + } + } + +// working, but no restriction on n +// template static inline float32x4_t tRotate(float32x4_t in){ return vextq_f32(in,in,n); }; +// template static inline float64x2_t tRotate(float64x2_t in){ return vextq_f64(in,in,n); }; + +// restriction on n + template static inline float32x4_t tRotate(float32x4_t in){ return vextq_f32(in,in,n%4); }; + template static inline float64x2_t tRotate(float64x2_t in){ return vextq_f64(in,in,n%2); }; + + }; + + struct PrecisionChange { + + static inline float16x8_t StoH (const float32x4_t &a,const float32x4_t &b) { + float16x4_t h = vcvt_f16_f32(a); + return vcvt_high_f16_f32(h, b); + } + static inline void HtoS (float16x8_t h,float32x4_t &sa,float32x4_t &sb) { + sb = vcvt_high_f32_f16(h); + // there is no direct conversion from lower float32x4_t to float64x2_t + // vextq_f16 not supported by clang 3.8 / 4.0 / arm clang + //float16x8_t h1 = vextq_f16(h, h, 4); // correct, but not supported by clang + // workaround for clang + uint32x4_t h1u = reinterpret_cast(h); + float16x8_t h1 = reinterpret_cast(vextq_u32(h1u, h1u, 2)); + sa = vcvt_high_f32_f16(h1); + } + static inline float32x4_t DtoS (float64x2_t a,float64x2_t b) { + float32x2_t s = vcvt_f32_f64(a); + return vcvt_high_f32_f64(s, b); + + } + static inline void StoD (float32x4_t s,float64x2_t &a,float64x2_t &b) { + b = vcvt_high_f64_f32(s); + // there is no direct conversion from lower float32x4_t to float64x2_t + float32x4_t s1 = vextq_f32(s, s, 2); + a = vcvt_high_f64_f32(s1); + + } + static inline float16x8_t DtoH (float64x2_t a,float64x2_t b,float64x2_t c,float64x2_t d) { + float32x4_t s1 = DtoS(a, b); + float32x4_t s2 = DtoS(c, d); + return StoH(s1, s2); + } + static inline void HtoD (float16x8_t h,float64x2_t &a,float64x2_t &b,float64x2_t &c,float64x2_t &d) { + float32x4_t s1, s2; + HtoS(h, s1, s2); + StoD(s1, a, b); + StoD(s2, c, d); + } + }; + + ////////////////////////////////////////////// + // Exchange support + + struct Exchange{ + static inline void Exchange0(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){ + // in1: ABCD -> out1: ABEF + // in2: EFGH -> out2: CDGH + + // z: CDAB + float32x4_t z = vextq_f32(in1, in1, 2); + // out1: ABEF + out1 = vextq_f32(z, in2, 2); + + // z: GHEF + z = vextq_f32(in2, in2, 2); + // out2: CDGH + out2 = vextq_f32(in1, z, 2); + }; + + static inline void Exchange1(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){ + // in1: ABCD -> out1: AECG + // in2: EFGH -> out2: BFDH + out1 = vtrn1q_f32(in1, in2); + out2 = vtrn2q_f32(in1, in2); + }; + static inline void Exchange2(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){ + assert(0); + return; + }; + static inline void Exchange3(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){ + assert(0); + return; + }; + // double precision + static inline void Exchange0(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){ + // in1: AB -> out1: AC + // in2: CD -> out2: BD + out1 = vzip1q_f64(in1, in2); + out2 = vzip2q_f64(in1, in2); + }; + static inline void Exchange1(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){ + assert(0); + return; + }; + static inline void Exchange2(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){ + assert(0); + return; + }; + static inline void Exchange3(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){ + assert(0); + return; + }; }; ////////////////////////////////////////////// // Some Template specialization - template < typename vtype > - void permute(vtype &a, vtype b, int perm) { - }; //Complex float Reduce template<> inline Grid::ComplexF Reduce::operator()(float32x4_t in){ - return 0; + float32x4_t v1; // two complex + v1 = Optimization::Permute::Permute0(in); + v1 = vaddq_f32(v1,in); + u128f conv; conv.v=v1; + return Grid::ComplexF(conv.f[0],conv.f[1]); } //Real float Reduce template<> inline Grid::RealF Reduce::operator()(float32x4_t in){ - float32x2_t high = vget_high_f32(in); - float32x2_t low = vget_low_f32(in); - float32x2_t tmp = vadd_f32(low, high); - float32x2_t sum = vpadd_f32(tmp, tmp); - return vget_lane_f32(sum,0); + return vaddvq_f32(in); } - - + + //Complex double Reduce - template<> + template<> // N:by Boyle inline Grid::ComplexD Reduce::operator()(float64x2_t in){ - return 0; + u128d conv; conv.v = in; + return Grid::ComplexD(conv.f[0],conv.f[1]); } - + //Real double Reduce template<> inline Grid::RealD Reduce::operator()(float64x2_t in){ - float64x2_t sum = vpaddq_f64(in, in); - return vgetq_lane_f64(sum,0); + return vaddvq_f64(in); } //Integer Reduce template<> inline Integer Reduce::operator()(uint32x4_t in){ // FIXME unimplemented - printf("Reduce : Missing integer implementation -> FIX\n"); + printf("Reduce : Missing integer implementation -> FIX\n"); assert(0); } } ////////////////////////////////////////////////////////////////////////////////////// -// Here assign types -namespace Grid { +// Here assign types +// typedef Optimization::vech SIMD_Htype; // Reduced precision type + typedef float16x8_t SIMD_Htype; // Half precision type typedef float32x4_t SIMD_Ftype; // Single precision type typedef float64x2_t SIMD_Dtype; // Double precision type typedef uint32x4_t SIMD_Itype; // Integer type @@ -312,13 +581,6 @@ namespace Grid { inline void prefetch_HINT_T0(const char *ptr){}; - // Gpermute function - template < typename VectorSIMD > - inline void Gpermute(VectorSIMD &y,const VectorSIMD &b, int perm ) { - Optimization::permute(y.v,b.v,perm); - } - - // Function name aliases typedef Optimization::Vsplat VsplatSIMD; typedef Optimization::Vstore VstoreSIMD; @@ -326,16 +588,19 @@ namespace Grid { typedef Optimization::Vstream VstreamSIMD; template using ReduceSIMD = Optimization::Reduce; - + // Arithmetic operations typedef Optimization::Sum SumSIMD; typedef Optimization::Sub SubSIMD; + typedef Optimization::Div DivSIMD; typedef Optimization::Mult MultSIMD; typedef Optimization::MultComplex MultComplexSIMD; + typedef Optimization::MultRealPart MultRealPartSIMD; + typedef Optimization::MaddRealPart MaddRealPartSIMD; typedef Optimization::Conj ConjSIMD; typedef Optimization::TimesMinusI TimesMinusISIMD; typedef Optimization::TimesI TimesISIMD; -} +} \ No newline at end of file diff --git a/lib/simd/Grid_vector_types.h b/lib/simd/Grid_vector_types.h index e05fecc4..27585547 100644 --- a/lib/simd/Grid_vector_types.h +++ b/lib/simd/Grid_vector_types.h @@ -53,7 +53,7 @@ directory #if defined IMCI #include "Grid_imci.h" #endif -#ifdef NEONv8 +#ifdef NEONV8 #include "Grid_neon.h" #endif #if defined QPX diff --git a/lib/stencil/Lebesgue.cc b/lib/stencil/Lebesgue.cc index 4551878c..2880e4b6 100644 --- a/lib/stencil/Lebesgue.cc +++ b/lib/stencil/Lebesgue.cc @@ -32,8 +32,11 @@ Author: paboyle namespace Grid { int LebesgueOrder::UseLebesgueOrder; +#ifdef KNL std::vector LebesgueOrder::Block({8,2,2,2}); - +#else +std::vector LebesgueOrder::Block({2,2,2,2}); +#endif LebesgueOrder::IndexInteger LebesgueOrder::alignup(IndexInteger n){ n--; // 1000 0011 --> 1000 0010 n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011 @@ -51,8 +54,31 @@ LebesgueOrder::LebesgueOrder(GridBase *_grid) if ( Block[0]==0) ZGraph(); else if ( Block[1]==0) NoBlocking(); else CartesianBlocking(); -} + if (0) { + std::cout << "Thread Interleaving"< reorder = _LebesgueReorder; + std::vector throrder; + int vol = _LebesgueReorder.size(); + int threads = GridThread::GetThreads(); + int blockbits=3; + int blocklen = 8; + int msk = 0x7; + + for(int t=0;t> blockbits) % threads == t ) { + throrder.push_back(reorder[ss]); + } + } + } + _LebesgueReorder = throrder; +} void LebesgueOrder::NoBlocking(void) { std::cout< & xi, std::vector &dims); + void ThreadInterleave(void); + private: std::vector _LebesgueReorder; diff --git a/lib/tensors/Tensor_arith_mul.h b/lib/tensors/Tensor_arith_mul.h index c24853b7..a474db9c 100644 --- a/lib/tensors/Tensor_arith_mul.h +++ b/lib/tensors/Tensor_arith_mul.h @@ -98,7 +98,9 @@ template strong_inline void mult(iVector * __restrict__ ret, const iVector * __restrict__ rhs, const iScalar * __restrict__ lhs){ - mult(ret,lhs,rhs); + for(int c1=0;c1_internal[c1],&rhs->_internal[c1],&lhs->_internal); + } } diff --git a/lib/util/Init.cc b/lib/util/Init.cc index fe3b1734..35a569ba 100644 --- a/lib/util/Init.cc +++ b/lib/util/Init.cc @@ -377,7 +377,7 @@ void Grid_init(int *argc,char ***argv) std::cout << GridLogDebug << "Requesting "<< CartesianCommunicator::MAX_MPI_SHM_BYTES <<" byte stencil comms buffers "< /* END LEGAL */ #include +using namespace Grid; +using namespace Grid::QCD; + int main (int argc, char ** argv) { std::vector seeds({1,2,3,4}); @@ -82,6 +85,7 @@ int main (int argc, char ** argv) Uorg = Uorg - Umu; std::cout << " Norm Difference "<< norm2(Uorg) << std::endl; + std::cout << " Norm "<< norm2(Umu) << std::endl; std::cout<< "*****************************************************************" < - //FermionParameters(Reader& Reader){ - // read(Reader, "Mobius", *this); - //} - }; @@ -113,9 +107,17 @@ int main(int argc, char **argv) { bool ApplySmearing = MyParams.Mobius.ApplySmearing; + // Use this if you want to tweak the default decomposition + // commented out as very architecture speficic + + //std::vector simd_lanes({2,2,1,1}); - // Grid from the command line - TheHMC.Resources.AddFourDimGrid("gauge"); + // Grid from the command line arguments --grid and --mpi + // drop the simd_lanes argument to fall back to the default decomposition for the SIMD lanes + + //TheHMC.Resources.AddFourDimGrid("gauge", simd_lanes); // tweak the SIMD lanes + TheHMC.Resources.AddFourDimGrid("gauge"); // use default simd lanes decomposition + // Possibile to create the module by hand // hardcoding parameters or using a Reader diff --git a/tests/hmc/Test_hmc_ScalarActionNxN.cc b/tests/hmc/Test_hmc_ScalarActionNxN.cc index a7490f51..a4dad1a3 100644 --- a/tests/hmc/Test_hmc_ScalarActionNxN.cc +++ b/tests/hmc/Test_hmc_ScalarActionNxN.cc @@ -45,7 +45,7 @@ using namespace Grid; using namespace Grid::QCD; template -class MagLogger : public HmcObservable { +class MagMeas : public HmcObservable { public: typedef typename Impl::Field Field; typedef typename Impl::Simd::scalar_type Trace; @@ -72,13 +72,13 @@ private: }; template -class MagMod: public ObservableModule, NoParameters>{ - typedef ObservableModule, NoParameters> ObsBase; +class MagMod: public ObservableModule, NoParameters>{ + typedef ObservableModule, NoParameters> ObsBase; using ObsBase::ObsBase; // for constructors // acquire resource virtual void initialize(){ - this->ObservablePtr.reset(new MagLogger()); + this->ObservablePtr.reset(new MagMeas()); } public: MagMod(): ObsBase(NoParameters()){} diff --git a/tests/hmc/Test_hmc_WilsonGauge.cc b/tests/hmc/Test_hmc_WilsonGauge.cc index b2d5fb02..05bf81a2 100644 --- a/tests/hmc/Test_hmc_WilsonGauge.cc +++ b/tests/hmc/Test_hmc_WilsonGauge.cc @@ -66,7 +66,14 @@ int main(int argc, char **argv) { typedef PlaquetteMod PlaqObs; typedef TopologicalChargeMod QObs; TheHMC.Resources.AddObservable(); - TheHMC.Resources.AddObservable(); + TopologyObsParameters TopParams; + TopParams.interval = 5; + TopParams.do_smearing = true; + TopParams.Smearing.steps = 200; + TopParams.Smearing.step_size = 0.01; + TopParams.Smearing.meas_interval = 50; + TopParams.Smearing.maxTau = 2.0; + TheHMC.Resources.AddObservable(TopParams); ////////////////////////////////////////////// ///////////////////////////////////////////////////////////// diff --git a/tests/solver/Test_staggered_block_cg_unprec.cc b/tests/solver/Test_staggered_block_cg_unprec.cc index 8db41e98..f54bc3b2 100644 --- a/tests/solver/Test_staggered_block_cg_unprec.cc +++ b/tests/solver/Test_staggered_block_cg_unprec.cc @@ -75,7 +75,7 @@ int main (int argc, char ** argv) LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(pRNG,Umu); RealD mass=0.003; - ImprovedStaggeredFermion5DR Ds(Umu,Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass); + ImprovedStaggeredFermion5DR Ds(Umu,Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass); MdagMLinearOperator HermOp(Ds); ConjugateGradient CG(1.0e-8,10000); @@ -99,21 +99,27 @@ int main (int argc, char ** argv) std::cout << GridLogMessage << " Calling 5d CG for "<