Merge branch 'release/v0.6.0' into feature/hadrons

# Conflicts: # Makefile.am
2025-09-18 17:21:05 +01:00 · 2016-11-08 20:43:39 +00:00
parent a034e9901b 9576f0903d
commit 13a8997789
140 changed files with 10370 additions and 5042 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -9,10 +9,6 @@ matrix:
    - os:        osx
      osx_image: xcode7.2
      compiler: clang
    - os:        osx
      osx_image: xcode7.2
      compiler: gcc
      env: VERSION=-5
    - compiler: gcc
      addons:
        apt:
@@ -107,3 +103,4 @@ script:
    - make -j4
    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then mpirun.openmpi -n 2 ./benchmarks/Benchmark_dwf --threads 1 --mpi 2.1.1.1; fi
    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then mpirun -n 2 ./benchmarks/Benchmark_dwf --threads 1 --mpi 2.1.1.1; fi
--- a/Makefile.am
+++ b/Makefile.am
@@ -1,4 +1,10 @@
 # additional include paths necessary to compile the C++ library
 SUBDIRS = lib benchmarks tests programs
 .PHONY: tests
 tests: all
 	$(MAKE) -C tests tests
 AM_CXXFLAGS += -I$(top_builddir)/include
 ACLOCAL_AMFLAGS = -I m4
--- a/44
+++ b/44
@@ -1,44 +0,0 @@
 This library provides data parallel C++ container classes with internal memory layout
 that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
 are provided, similar to HPF and cmfortran, and user control is given over the mapping of
 array indices to both MPI tasks and SIMD processing elements.
 * Identically shaped arrays then be processed with perfect data parallelisation.
 * Such identically shapped arrays are called conformable arrays.
 The transformation is based on the observation that Cartesian array processing involves
 identical processing to be performed on different regions of the Cartesian array.
 The library will (eventually) both geometrically decompose into MPI tasks and across SIMD lanes.
 Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
 optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
 for most programmers.
 The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
 Presently SSE2 (128 bit) AVX, AVX2 (256 bit) and IMCI and AVX512 (512 bit) targets are supported.
 These are presented as 
  vRealF, vRealD, vComplexF, vComplexD 
 internal vector data types. These may be useful in themselves for other programmers.
 The corresponding scalar types are named
  RealF, RealD, ComplexF, ComplexD
 MPI parallelism is UNIMPLEMENTED and for now only OpenMP and SIMD parallelism is present in the library.
   You can give `configure' initial values for configuration parameters
 by setting variables in the command line or in the environment.  Here
 is are examples:
     ./configure CXX=clang++ CXXFLAGS="-std=c++11 -O3 -msse4" --enable-simd=SSE4
     ./configure CXX=clang++ CXXFLAGS="-std=c++11 -O3 -mavx" --enable-simd=AVX1
     ./configure CXX=clang++ CXXFLAGS="-std=c++11 -O3 -mavx2" --enable-simd=AVX2
     ./configure CXX=icpc CXXFLAGS="-std=c++11 -O3 -mmic" --enable-simd=AVX512 --host=none
--- a/1
+++ b/1
@@ -0,0 +1 @@
 README.md
--- a/README.md
+++ b/README.md
@@ -16,11 +16,27 @@
 **Data parallel C++ mathematical object library.**
 Please send all pull requests to the `develop` branch.
 License: GPL v2.
-Last update 2016/08/03.
+Last update Nov 2016.
 _Please do not send pull requests to the `master` branch which is reserved for releases._
 ### Bug report
 _To help us tracking and solving more efficiently issues with Grid, please report problems using the issue system of GitHub rather than sending emails to Grid developers._
 When you file an issue, please go though the following checklist:
 1. Check that the code is pointing to the `HEAD` of `develop` or any commit in `master` which is tagged with a version number. 
 2. Give a description of the target platform (CPU, network, compiler). Please give the full CPU part description, using for example `cat /proc/cpuinfo | grep 'model name' | uniq` (Linux) or `sysctl machdep.cpu.brand_string` (macOS) and the full output the `--version` option of your compiler.
 3. Give the exact `configure` command used.
 4. Attach `config.log`.
 5. Attach `config.summary`.
 6. Attach the output of `make V=1`.
 7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example.
 ### Description
 This library provides data parallel C++ container classes with internal memory layout
@@ -29,7 +45,7 @@ are provided, similar to HPF and cmfortran, and user control is given over the m
 array indices to both MPI tasks and SIMD processing elements.
 * Identically shaped arrays then be processed with perfect data parallelisation.
-* Such identically shapped arrays are called conformable arrays.
+* Such identically shaped arrays are called conformable arrays.
 The transformation is based on the observation that Cartesian array processing involves
 identical processing to be performed on different regions of the Cartesian array.
@@ -42,7 +58,7 @@ optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a signifi
 for most programmers.
 The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
-Presently SSE4 (128 bit) AVX, AVX2 (256 bit) and IMCI and AVX512 (512 bit) targets are supported (ARM NEON and BG/Q QPX on the way).
+Presently SSE4 (128 bit) AVX, AVX2, QPX (256 bit), IMCI, and AVX512 (512 bit) targets are supported (ARM NEON on the way).
 These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers.
 The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
@@ -50,7 +66,7 @@ The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `Compl
 MPI, OpenMP, and SIMD parallelism are present in the library.
 Please see https://arxiv.org/abs/1512.03487 for more detail.
-### Installation
+### Quick start
 First, start by cloning the repository:
 ``` bash
@@ -71,12 +87,10 @@ mkdir build; cd build
 ../configure --enable-precision=double --enable-simd=AVX --enable-comms=mpi-auto --prefix=<path>
 ```
-where `--enable-precision=` set the default precision (`single` or `double`),
+where `--enable-precision=` set the default precision,
-`--enable-simd=` set the SIMD type (see possible values below), `--enable-
+`--enable-simd=` set the SIMD type, `--enable-
-comms=` set the protocol used for communications (`none`, `mpi`, `mpi-auto` or
+comms=`, and `<path>` should be replaced by the prefix path where you want to
-`shmem`), and `<path>` should be replaced by the prefix path where you want to
+install Grid. Other options are detailed in the next section, you can also use `configure
 install Grid. The `mpi-auto` communication option set `configure` to determine
 automatically how to link to MPI. Other options are available, use `configure
 --help` to display them. Like with any other program using GNU autotool, the
 `CXX`, `CXXFLAGS`, `LDFLAGS`, ... environment variables can be modified to
 customise the build.
@@ -92,25 +106,88 @@ To minimise the build time, only the tests at the root of the `tests` directory
 ``` bash
 make -C tests/<subdir> tests
 ```
 If you want to build all the tests at once just use `make tests`.
 ### Build configuration options
 - `--prefix=<path>`: installation prefix for Grid.
 - `--with-gmp=<path>`: look for GMP in the UNIX prefix `<path>`
 - `--with-mpfr=<path>`: look for MPFR in the UNIX prefix `<path>`
 - `--with-fftw=<path>`: look for FFTW in the UNIX prefix `<path>`
 - `--enable-lapack[=<path>]`: enable LAPACK support in Lanczos eigensolver. A UNIX prefix containing the library can be specified (optional).
 - `--enable-mkl[=<path>]`: use Intel MKL for FFT (and LAPACK if enabled) routines. A UNIX prefix containing the library can be specified (optional).
 - `--enable-numa`: ???
 - `--enable-simd=<code>`: setup Grid for the SIMD target `<code>` (default: `GEN`). A list of possible SIMD targets is detailed in a section below.
 - `--enable-precision={single|double}`: set the default precision (default: `double`).
 - `--enable-precision=<comm>`: Use `<comm>` for message passing (default: `none`). A list of possible SIMD targets is detailed in a section below.
 - `--enable-rng={ranlux48|mt19937}`: choose the RNG (default: `ranlux48 `).
 - `--disable-timers`: disable system dependent high-resolution timers.
 - `--enable-chroma`: enable Chroma regression tests.
 ### Possible communication interfaces
 The following options can be use with the `--enable-comms=` option to target different communication interfaces:
 | `<comm>`       | Description                                                   |
 | -------------- | ------------------------------------------------------------- |
 | `none`         | no communications                                             |
 | `mpi[-auto]`   | MPI communications                                            |
 | `mpi3[-auto]`  | MPI communications using MPI 3 shared memory                  |
 | `mpi3l[-auto]` | MPI communications using MPI 3 shared memory and leader model |
 | `shmem `       | Cray SHMEM communications                                     |
 For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names).
 ### Possible SIMD types
 The following options can be use with the `--enable-simd=` option to target different SIMD instruction sets:
-| String      | Description                            |
+| `<code>`    | Description                            |
 | ----------- | -------------------------------------- |
 | `GEN`       | generic portable vector code           |
 | `SSE4`      | SSE 4.2 (128 bit)                      |
 | `AVX`       | AVX (256 bit)                          |
-| `AVXFMA4`   | AVX (256 bit) + FMA                    |
+| `AVXFMA`    | AVX (256 bit) + FMA                    |
 | `AVXFMA4`   | AVX (256 bit) + FMA4                   |
 | `AVX2`      | AVX 2 (256 bit)                        |
 | `AVX512`    | AVX 512 bit                            |
-| `AVX512MIC` | AVX 512 bit for Intel MIC architecture |
+| `QPX`       | QPX (256 bit)                          |
 | `ICMI`      | Intel ICMI instructions (512 bit)      |
 Alternatively, some CPU codenames can be directly used:
-| String      | Description                            |
+| `<code>`    | Description                            |
 | ----------- | -------------------------------------- |
-| `KNC`       | [Intel Knights Corner](http://ark.intel.com/products/codename/57721/Knights-Corner) |
+| `KNC`       | [Intel Xeon Phi codename Knights Corner](http://ark.intel.com/products/codename/57721/Knights-Corner) |
-| `KNL`       | [Intel Knights Landing](http://ark.intel.com/products/codename/48999/Knights-Landing) |
+| `KNL`       | [Intel Xeon Phi codename Knights Landing](http://ark.intel.com/products/codename/48999/Knights-Landing) |
 | `BGQ`       | Blue Gene/Q                            |
 #### Notes:
 - We currently support AVX512 only for the Intel compiler. Support for GCC and clang will appear in future versions of Grid when the AVX512 support within GCC and clang will be more advanced.
 - For BG/Q only [bgclang](http://trac.alcf.anl.gov/projects/llvm-bgq) is supported. We do not presently plan to support more compilers for this platform.
 - BG/Q performances are currently rather poor. This is being investigated for future versions.
 ### Build setup for Intel Knights Landing platform
 The following configuration is recommended for the Intel Knights Landing platform:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=KNL        \
             --enable-comms=mpi-auto \
             --with-gmp=<path>        \
             --with-mpfr=<path>       \
             --enable-mkl             \
             CXX=icpc MPICXX=mpiicpc
 ```
 where `<path>` is the UNIX prefix where GMP and MPFR are installed. If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=KNL        \
             --enable-comms=mpi       \
             --with-gmp=<path>        \
             --with-mpfr=<path>       \
             --enable-mkl             \
             CXX=CC CC=cc
 ```
--- a/4
+++ b/4
@@ -1,4 +1,6 @@
-Version : 0.5.0
+Version : 0.6.0
 - AVX512, AVX2, AVX, SSE good
 - Clang 3.5 and above, ICPC v16 and above, GCC 4.9 and above
 - MPI and MPI3
 - HiRep, Smearing, Generic gauge group
--- a/benchmarks/Benchmark_comms.cc
+++ b/benchmarks/Benchmark_comms.cc
@@ -42,15 +42,14 @@ int main (int argc, char ** argv)
  int Nloop=10;
  int nmu=0;
-  for(int mu=0;mu<4;mu++) if (mpi_layout[mu]>1) nmu++;
+  for(int mu=0;mu<Nd;mu++) if (mpi_layout[mu]>1) nmu++;
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
  std::cout<<GridLogMessage << "= Benchmarking concurrent halo exchange in "<<nmu<<" dimensions"<<std::endl;
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<" Ls  "<<"\t\t"<<"bytes"<<"\t\t"<<"MB/s uni"<<"\t\t"<<"MB/s bidi"<<std::endl;
-
+  int maxlat=16;
-
+  for(int lat=4;lat<=maxlat;lat+=2){
  for(int lat=4;lat<=32;lat+=2){
    for(int Ls=1;Ls<=16;Ls*=2){
      std::vector<int> latt_size  ({lat*mpi_layout[0],
@@ -125,7 +124,7 @@ int main (int argc, char ** argv)
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<" Ls  "<<"\t\t"<<"bytes"<<"\t\t"<<"MB/s uni"<<"\t\t"<<"MB/s bidi"<<std::endl;
-  for(int lat=4;lat<=32;lat+=2){
+  for(int lat=4;lat<=maxlat;lat+=2){
    for(int Ls=1;Ls<=16;Ls*=2){
      std::vector<int> latt_size  ({lat,lat,lat,lat});
@@ -194,128 +193,83 @@ int main (int argc, char ** argv)
    }
  }  
-#if 0
+  Nloop=100;
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
-  std::cout<<GridLogMessage << "= Benchmarking sequential persistent halo exchange in "<<nmu<<" dimensions"<<std::endl;
+  std::cout<<GridLogMessage << "= Benchmarking concurrent STENCIL halo exchange in "<<nmu<<" dimensions"<<std::endl;
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<" Ls  "<<"\t\t"<<"bytes"<<"\t\t"<<"MB/s uni"<<"\t\t"<<"MB/s bidi"<<std::endl;
-
+  for(int lat=4;lat<=maxlat;lat+=2){
  for(int lat=4;lat<=32;lat+=2){
    for(int Ls=1;Ls<=16;Ls*=2){
-      std::vector<int> latt_size  ({lat,lat,lat,lat});
+      std::vector<int> latt_size  ({lat*mpi_layout[0],
      				    lat*mpi_layout[1],
      				    lat*mpi_layout[2],
      				    lat*mpi_layout[3]});
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      std::vector<std::vector<HalfSpinColourVectorD> > xbuf(8,std::vector<HalfSpinColourVectorD>(lat*lat*lat*Ls));
+      std::vector<HalfSpinColourVectorD *> xbuf(8);
-      std::vector<std::vector<HalfSpinColourVectorD> > rbuf(8,std::vector<HalfSpinColourVectorD>(lat*lat*lat*Ls));
+      std::vector<HalfSpinColourVectorD *> rbuf(8);
-
+      Grid.ShmBufferFreeAll();
      for(int d=0;d<8;d++){
 	xbuf[d] = (HalfSpinColourVectorD *)Grid.ShmBufferMalloc(lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD));
 	rbuf[d] = (HalfSpinColourVectorD *)Grid.ShmBufferMalloc(lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD));
      }
      int ncomm;
      int bytes=lat*lat*lat*Ls*sizeof(HalfSpinColourVectorD);
      double start=usecond();
      for(int i=0;i<Nloop;i++){
-      std::vector<CartesianCommunicator::CommsRequest_t> empty;
+	std::vector<CartesianCommunicator::CommsRequest_t> requests;
      std::vector<std::vector<CartesianCommunicator::CommsRequest_t> > requests_fwd(Nd,empty);
      std::vector<std::vector<CartesianCommunicator::CommsRequest_t> > requests_bwd(Nd,empty);
      for(int mu=0;mu<4;mu++){
 	ncomm=0;
-	if (mpi_layout[mu]>1 ) {
+	for(int mu=0;mu<4;mu++){
 	  ncomm++;
-	  int comm_proc;
+	  if (mpi_layout[mu]>1 ) {
 	  int xmit_to_rank;
 	  int recv_from_rank;
-	  comm_proc=1;
+	    ncomm++;
-	  Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank);
+	    int comm_proc=1;
-	  Grid.SendToRecvFromInit(requests_fwd[mu],
+	    int xmit_to_rank;
-				  (void *)&xbuf[mu][0],
+	    int recv_from_rank;
 				  xmit_to_rank,
 				  (void *)&rbuf[mu][0],
 				  recv_from_rank,
 				  bytes);
-	  comm_proc = mpi_layout[mu]-1;
+	    Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank);
-	  Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank);
+	    Grid.StencilSendToRecvFromBegin(requests,
-	  Grid.SendToRecvFromInit(requests_bwd[mu],
+					    (void *)&xbuf[mu][0],
-				  (void *)&xbuf[mu+4][0],
+					    xmit_to_rank,
-				  xmit_to_rank,
+					    (void *)&rbuf[mu][0],
-				  (void *)&rbuf[mu+4][0],
+					    recv_from_rank,
-				  recv_from_rank,
+					    bytes);
 				  bytes);
-	}
+	    comm_proc = mpi_layout[mu]-1;
      }
-      {
+	    Grid.ShiftedRanks(mu,comm_proc,xmit_to_rank,recv_from_rank);
-	double start=usecond();
+	    Grid.StencilSendToRecvFromBegin(requests,
-	for(int i=0;i<Nloop;i++){
+					    (void *)&xbuf[mu+4][0],
 					    xmit_to_rank,
 					    (void *)&rbuf[mu+4][0],
 					    recv_from_rank,
 					    bytes);
 	  for(int mu=0;mu<4;mu++){
 	    if (mpi_layout[mu]>1 ) {
 	      Grid.SendToRecvFromBegin(requests_fwd[mu]);
 	      Grid.SendToRecvFromComplete(requests_fwd[mu]);
 	      Grid.SendToRecvFromBegin(requests_bwd[mu]);
 	      Grid.SendToRecvFromComplete(requests_bwd[mu]);
 	    }
 	  }
 	  Grid.Barrier();
 	}
-	
+	Grid.StencilSendToRecvFromComplete(requests);
-	double stop=usecond();
+	Grid.Barrier();
 	double dbytes    = bytes;
 	double xbytes    = Nloop*dbytes*2.0*ncomm;
 	double rbytes    = xbytes;
 	double bidibytes = xbytes+rbytes;
 	double time = stop-start;
 	std::cout<<GridLogMessage << lat<<"\t\t"<<Ls<<"\t\t"<<bytes<<"\t\t"<<xbytes/time<<"\t\t"<<bidibytes/time<<std::endl;
      }
      double stop=usecond();
      double dbytes    = bytes;
      double xbytes    = Nloop*dbytes*2.0*ncomm;
      double rbytes    = xbytes;
      double bidibytes = xbytes+rbytes;
-      {
+      double time = stop-start; // microseconds
 	double start=usecond();
 	for(int i=0;i<Nloop;i++){
 	  for(int mu=0;mu<4;mu++){
 	    if (mpi_layout[mu]>1 ) {
 	      Grid.SendToRecvFromBegin(requests_fwd[mu]);
 	      Grid.SendToRecvFromBegin(requests_bwd[mu]);
 	      Grid.SendToRecvFromComplete(requests_fwd[mu]);
 	      Grid.SendToRecvFromComplete(requests_bwd[mu]);
 	    }
 	  }
 	  Grid.Barrier();
 	}
 	double stop=usecond();
 	double dbytes    = bytes;
 	double xbytes    = Nloop*dbytes*2.0*ncomm;
 	double rbytes    = xbytes;
 	double bidibytes = xbytes+rbytes;
 	double time = stop-start;
 	std::cout<<GridLogMessage << lat<<"\t\t"<<Ls<<"\t\t"<<bytes<<"\t\t"<<xbytes/time<<"\t\t"<<bidibytes/time<<std::endl;
      }
      std::cout<<GridLogMessage << lat<<"\t\t"<<Ls<<"\t\t"<<bytes<<"\t\t"<<xbytes/time<<"\t\t"<<bidibytes/time<<std::endl;
    }
  }    
 #endif
  Grid_finalize();
 }
--- a/benchmarks/Benchmark_dwf.cc
+++ b/benchmarks/Benchmark_dwf.cc
@@ -44,7 +44,6 @@ struct scal {
    Gamma::GammaT
  };
 bool overlapComms = false;
 typedef WilsonFermion5D<DomainWallVec5dImplR> WilsonFermion5DR;
 typedef WilsonFermion5D<DomainWallVec5dImplF> WilsonFermion5DF;
 typedef WilsonFermion5D<DomainWallVec5dImplD> WilsonFermion5DD;
@@ -54,10 +53,6 @@ int main (int argc, char ** argv)
 {
  Grid_init(&argc,&argv);
  if( GridCmdOptionExists(argv,argv+argc,"--asynch") ){
    overlapComms = true;
  }
  int threads = GridThread::GetThreads();
  std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
@@ -86,18 +81,6 @@ int main (int argc, char ** argv)
  LatticeFermion    tmp(FGrid);
  LatticeFermion    err(FGrid);
  /*  src=zero;
  std::vector<int> origin(5,0);
  SpinColourVector f=zero;
  for(int sp=0;sp<4;sp++){
  for(int co=0;co<3;co++){
    f()(sp)(co)=Complex(1.0,0.0); 
  }}
  pokeSite(f,src,origin);
  */
  ColourMatrix cm = Complex(1.0,0.0);
  LatticeGaugeField Umu(UGrid); 
  random(RNG4,Umu);
@@ -138,16 +121,25 @@ int main (int argc, char ** argv)
  RealD NP = UGrid->_Nprocessors;
  for(int doasm=1;doasm<2;doasm++){
    QCD::WilsonKernelsStatic::AsmOpt=doasm;
  DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
-  std::cout<<GridLogMessage << "Calling Dw"<<std::endl;
+  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Kernel options --dslash-generic, --dslash-unroll, --dslash-asm" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Benchmarking DomainWallFermionR::Dhop                  "<<std::endl;
  std::cout << GridLogMessage<< "* Vectorising space-time by "<<vComplex::Nsimd()<<std::endl;
  if ( sizeof(Real)==4 )   std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
  if ( sizeof(Real)==8 )   std::cout << GridLogMessage<< "* DOUBLE precision "<<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric   ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3       WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  int ncall =100;
  if (1) {
    Dw.ZeroCounters();
    double t0=usecond();
    for(int i=0;i<ncall;i++){
      __SSC_START;
@@ -163,14 +155,26 @@ int main (int argc, char ** argv)
    std::cout<<GridLogMessage << "norm result "<< norm2(result)<<std::endl;
    std::cout<<GridLogMessage << "norm ref    "<< norm2(ref)<<std::endl;
    std::cout<<GridLogMessage << "mflop/s =   "<< flops/(t1-t0)<<std::endl;
-    std::cout<<GridLogMessage << "mflop/s per node =  "<< flops/(t1-t0)/NP<<std::endl;
+    std::cout<<GridLogMessage << "mflop/s per rank =  "<< flops/(t1-t0)/NP<<std::endl;
    err = ref-result; 
    std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<<std::endl;
-    //    Dw.Report();
+    assert (norm2(err)< 1.0e-5 );
    Dw.Report();
  }
  if (1)
  {
    std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
    std::cout << GridLogMessage<< "* Benchmarking WilsonFermion5D<DomainWallVec5dImplR>::Dhop "<<std::endl;
    std::cout << GridLogMessage<< "* Vectorising fifth dimension by "<<vComplex::Nsimd()<<std::endl;
    if ( sizeof(Real)==4 )   std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
    if ( sizeof(Real)==8 )   std::cout << GridLogMessage<< "* DOUBLE precision "<<std::endl;
    if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric   ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
    if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3       WilsonKernels" <<std::endl;
    if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
    std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
    typedef WilsonFermion5D<DomainWallVec5dImplR> WilsonFermion5DR;
    LatticeFermion ssrc(sFGrid);
    LatticeFermion sref(sFGrid);
@@ -188,8 +192,9 @@ int main (int argc, char ** argv)
      peekSite(tmp,src,site);
      pokeSite(tmp,ssrc,site);
    }}}}}
-    std::cout<<"src norms "<< norm2(src)<<" " <<norm2(ssrc)<<std::endl;
+    std::cout<<GridLogMessage<< "src norms "<< norm2(src)<<" " <<norm2(ssrc)<<std::endl;
    double t0=usecond();
    sDw.ZeroCounters();
    for(int i=0;i<ncall;i++){
      __SSC_START;
      sDw.Dhop(ssrc,sresult,0);
@@ -199,26 +204,25 @@ int main (int argc, char ** argv)
    double volume=Ls;  for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
    double flops=1344*volume*ncall;
-    std::cout<<GridLogMessage << "Called Dw sinner "<<ncall<<" times in "<<t1-t0<<" us"<<std::endl;
+    std::cout<<GridLogMessage << "Called Dw s_inner "<<ncall<<" times in "<<t1-t0<<" us"<<std::endl;
    std::cout<<GridLogMessage << "mflop/s =   "<< flops/(t1-t0)<<std::endl;
-    std::cout<<GridLogMessage << "mflop/s per node =  "<< flops/(t1-t0)/NP<<std::endl;
+    std::cout<<GridLogMessage << "mflop/s per rank =  "<< flops/(t1-t0)/NP<<std::endl;
-    //  sDw.Report();
+    sDw.Report();
    if(0){
      for(int i=0;i< PerformanceCounter::NumTypes(); i++ ){
-	sDw.Dhop(ssrc,sresult,0);
+  sDw.Dhop(ssrc,sresult,0);
-	PerformanceCounter Counter(i);
+  PerformanceCounter Counter(i);
-	Counter.Start();
+  Counter.Start();
-	sDw.Dhop(ssrc,sresult,0);
+  sDw.Dhop(ssrc,sresult,0);
-	Counter.Stop();
+  Counter.Stop();
-	Counter.Report();
+  Counter.Report();
      }
    }
-    std::cout<<"res norms "<< norm2(result)<<" " <<norm2(sresult)<<std::endl;
+    std::cout<<GridLogMessage<< "res norms "<< norm2(result)<<" " <<norm2(sresult)<<std::endl;
-
+    RealD sum=0;
    RealF sum=0;
    for(int x=0;x<latt4[0];x++){
    for(int y=0;y<latt4[1];y++){
    for(int z=0;z<latt4[2];z++){
@@ -235,13 +239,13 @@ int main (int argc, char ** argv)
 	std::cout << "site "<<x<<","<<y<<","<<z<<","<<t<<","<<s<<" simd   "<<simd<<std::endl;
      }
    }}}}}
-    std::cout<<" difference between normal and simd is "<<sum<<std::endl;
+    std::cout<<GridLogMessage<<" difference between normal and simd is "<<sum<<std::endl;
    assert (sum< 1.0e-5 );
    if (1) {
      LatticeFermion sr_eo(sFGrid);
      LatticeFermion serr(sFGrid);
      LatticeFermion ssrc_e (sFrbGrid);
      LatticeFermion ssrc_o (sFrbGrid);
@@ -253,23 +257,35 @@ int main (int argc, char ** argv)
      setCheckerboard(sr_eo,ssrc_o);
      setCheckerboard(sr_eo,ssrc_e);
      serr = sr_eo-ssrc; 
      std::cout<<GridLogMessage << "EO src norm diff   "<< norm2(serr)<<std::endl;
      sr_e = zero;
      sr_o = zero;
      std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
      std::cout << GridLogMessage<< "* Benchmarking WilsonFermion5D<DomainWallVec5dImplR>::DhopEO "<<std::endl;
      std::cout << GridLogMessage<< "* Vectorising fifth dimension by "<<vComplex::Nsimd()<<std::endl;
      if ( sizeof(Real)==4 )   std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
      if ( sizeof(Real)==8 )   std::cout << GridLogMessage<< "* DOUBLE precision "<<std::endl;
      if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric   ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
      if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3       WilsonKernels" <<std::endl;
      if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
      std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
      sDw.ZeroCounters();
      sDw.stat.init("DhopEO");
      double t0=usecond();
-      for(int i=0;i<ncall;i++){
+      for (int i = 0; i < ncall; i++) {
-	sDw.DhopEO(ssrc_o,sr_e,DaggerNo);
+        sDw.DhopEO(ssrc_o, sr_e, DaggerNo);
      }
      double t1=usecond();
      sDw.stat.print();
      double volume=Ls;  for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
      double flops=(1344.0*volume*ncall)/2;
      std::cout<<GridLogMessage << "sDeo mflop/s =   "<< flops/(t1-t0)<<std::endl;
-      std::cout<<GridLogMessage << "sDeo mflop/s per node   "<< flops/(t1-t0)/NP<<std::endl;
+      std::cout<<GridLogMessage << "sDeo mflop/s per rank   "<< flops/(t1-t0)/NP<<std::endl;
      sDw.Report();
      sDw.DhopEO(ssrc_o,sr_e,DaggerNo);
      sDw.DhopOE(ssrc_e,sr_o,DaggerNo);
@@ -278,9 +294,18 @@ int main (int argc, char ** argv)
      pickCheckerboard(Even,ssrc_e,sresult);
      pickCheckerboard(Odd ,ssrc_o,sresult);
      ssrc_e = ssrc_e - sr_e;
      RealD error = norm2(ssrc_e);
      std::cout<<GridLogMessage << "sE norm diff   "<< norm2(ssrc_e)<< "  vec nrm"<<norm2(sr_e) <<std::endl;
      ssrc_o = ssrc_o - sr_o;
      error+= norm2(ssrc_o);
      std::cout<<GridLogMessage << "sO norm diff   "<< norm2(ssrc_o)<< "  vec nrm"<<norm2(sr_o) <<std::endl;
      if(error>1.0e-5) { 
 	setCheckerboard(ssrc,ssrc_o);
 	setCheckerboard(ssrc,ssrc_e);
 	std::cout<< ssrc << std::endl;
      }
    }
@@ -294,24 +319,25 @@ int main (int argc, char ** argv)
      //    ref =  src - Gamma(Gamma::GammaX)* src ; // 1+gamma_x
      tmp = U[mu]*Cshift(src,mu+1,1);
      for(int i=0;i<ref._odata.size();i++){
-	ref._odata[i]+= tmp._odata[i] + Gamma(Gmu[mu])*tmp._odata[i]; ;
+  ref._odata[i]+= tmp._odata[i] + Gamma(Gmu[mu])*tmp._odata[i]; ;
      }
      tmp =adj(U[mu])*src;
      tmp =Cshift(tmp,mu+1,-1);
      for(int i=0;i<ref._odata.size();i++){
-	ref._odata[i]+= tmp._odata[i] - Gamma(Gmu[mu])*tmp._odata[i]; ;
+  ref._odata[i]+= tmp._odata[i] - Gamma(Gmu[mu])*tmp._odata[i]; ;
      }
    }
    ref = -0.5*ref;
  }
  Dw.Dhop(src,result,1);
  std::cout << GridLogMessage << "Compare to naive wilson implementation Dag to verify correctness" << std::endl;
  std::cout<<GridLogMessage << "Called DwDag"<<std::endl;
  std::cout<<GridLogMessage << "norm result "<< norm2(result)<<std::endl;
  std::cout<<GridLogMessage << "norm ref    "<< norm2(ref)<<std::endl;
  err = ref-result; 
  std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<<std::endl;
-
+  assert(norm2(err)<1.0e-5);
  LatticeFermion src_e (FrbGrid);
  LatticeFermion src_o (FrbGrid);
  LatticeFermion r_e   (FrbGrid);
@@ -319,14 +345,24 @@ int main (int argc, char ** argv)
  LatticeFermion r_eo  (FGrid);
-  std::cout<<GridLogMessage << "Calling Deo and Doe"<<std::endl;
+  std::cout<<GridLogMessage << "Calling Deo and Doe and assert Deo+Doe == Dunprec"<<std::endl;
  pickCheckerboard(Even,src_e,src);
  pickCheckerboard(Odd,src_o,src);
  std::cout<<GridLogMessage << "src_e"<<norm2(src_e)<<std::endl;
  std::cout<<GridLogMessage << "src_o"<<norm2(src_o)<<std::endl;
  std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Benchmarking DomainWallFermionR::DhopEO                "<<std::endl;
  std::cout << GridLogMessage<< "* Vectorising space-time by "<<vComplex::Nsimd()<<std::endl;
  if ( sizeof(Real)==4 )   std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
  if ( sizeof(Real)==8 )   std::cout << GridLogMessage<< "* DOUBLE precision "<<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric   ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3       WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
  std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
  {
    Dw.ZeroCounters();
    double t0=usecond();
    for(int i=0;i<ncall;i++){
      Dw.DhopEO(src_o,r_e,DaggerNo);
@@ -337,7 +373,8 @@ int main (int argc, char ** argv)
    double flops=(1344.0*volume*ncall)/2;
    std::cout<<GridLogMessage << "Deo mflop/s =   "<< flops/(t1-t0)<<std::endl;
-    std::cout<<GridLogMessage << "Deo mflop/s per node   "<< flops/(t1-t0)/NP<<std::endl;
+    std::cout<<GridLogMessage << "Deo mflop/s per rank   "<< flops/(t1-t0)/NP<<std::endl;
    Dw.Report();
  }
  Dw.DhopEO(src_o,r_e,DaggerNo);
  Dw.DhopOE(src_e,r_o,DaggerNo);
@@ -352,14 +389,14 @@ int main (int argc, char ** argv)
  err = r_eo-result; 
  std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<<std::endl;
  assert(norm2(err)<1.0e-5);
  pickCheckerboard(Even,src_e,err);
  pickCheckerboard(Odd,src_o,err);
  std::cout<<GridLogMessage << "norm diff even  "<< norm2(src_e)<<std::endl;
  std::cout<<GridLogMessage << "norm diff odd   "<< norm2(src_o)<<std::endl;
-
+  assert(norm2(src_e)<1.0e-5);
-
+  assert(norm2(src_o)<1.0e-5);
  }
  Grid_finalize();
 }
--- a/benchmarks/Benchmark_dwf_ntpf
+++ b/benchmarks/Benchmark_dwf_ntpf
--- a/benchmarks/Benchmark_dwf_ntpf.cc
+++ b/benchmarks/Benchmark_dwf_ntpf.cc
@@ -1,153 +0,0 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./benchmarks/Benchmark_dwf.cc
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #include <Grid/Grid.h>
 using namespace std;
 using namespace Grid;
 using namespace Grid::QCD;
 template<class d>
 struct scal {
  d internal;
 };
  Gamma::GammaMatrix Gmu [] = {
    Gamma::GammaX,
    Gamma::GammaY,
    Gamma::GammaZ,
    Gamma::GammaT
  };
 bool overlapComms = false;
 int main (int argc, char ** argv)
 {
  Grid_init(&argc,&argv);
  if( GridCmdOptionExists(argv,argv+argc,"--asynch") ){
    overlapComms = true;
  }
  int threads = GridThread::GetThreads();
  std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
  std::vector<int> latt4 = GridDefaultLatt();
  const int Ls=16;
  GridCartesian         * UGrid   = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
  GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
  GridCartesian         * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
  GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
  std::vector<int> seeds4({1,2,3,4});
  std::vector<int> seeds5({5,6,7,8});
  GridParallelRNG          RNG4(UGrid);  RNG4.SeedFixedIntegers(seeds4);
  GridParallelRNG          RNG5(FGrid);  RNG5.SeedFixedIntegers(seeds5);
  LatticeFermion src   (FGrid); random(RNG5,src);
  LatticeFermion result(FGrid); result=zero;
  LatticeFermion    ref(FGrid);    ref=zero;
  LatticeFermion    tmp(FGrid);
  LatticeFermion    err(FGrid);
  ColourMatrix cm = Complex(1.0,0.0);
  LatticeGaugeField Umu(UGrid); 
  random(RNG4,Umu);
  LatticeGaugeField Umu5d(FGrid); 
  // replicate across fifth dimension
  for(int ss=0;ss<Umu._grid->oSites();ss++){
    for(int s=0;s<Ls;s++){
      Umu5d._odata[Ls*ss+s] = Umu._odata[ss];
    }
  }
  ////////////////////////////////////
  // Naive wilson implementation
  ////////////////////////////////////
  std::vector<LatticeColourMatrix> U(4,FGrid);
  for(int mu=0;mu<Nd;mu++){
    U[mu] = PeekIndex<LorentzIndex>(Umu5d,mu);
  }
  if (1)
  {
    ref = zero;
    for(int mu=0;mu<Nd;mu++){
      tmp = U[mu]*Cshift(src,mu+1,1);
      ref=ref + tmp - Gamma(Gmu[mu])*tmp;
      tmp =adj(U[mu])*src;
      tmp =Cshift(tmp,mu+1,-1);
      ref=ref + tmp + Gamma(Gmu[mu])*tmp;
    }
    ref = -0.5*ref;
  }
  RealD mass=0.1;
  RealD M5  =1.8;
  typename DomainWallFermionR::ImplParams params; 
  params.overlapCommsCompute = overlapComms;
  RealD NP = UGrid->_Nprocessors;
  QCD::WilsonKernelsStatic::AsmOpt=1;
  DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5,params);
  std::cout<<GridLogMessage << "Calling Dw"<<std::endl;
  int ncall =50;
  if (1) {
    double t0=usecond();
    for(int i=0;i<ncall;i++){
      Dw.Dhop(src,result,0);
    }
    double t1=usecond();
    double volume=Ls;  for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
    double flops=1344*volume*ncall;
    std::cout<<GridLogMessage << "Called Dw "<<ncall<<" times in "<<t1-t0<<" us"<<std::endl;
    std::cout<<GridLogMessage << "norm result "<< norm2(result)<<std::endl;
    std::cout<<GridLogMessage << "norm ref    "<< norm2(ref)<<std::endl;
    std::cout<<GridLogMessage << "mflop/s =   "<< flops/(t1-t0)<<std::endl;
    std::cout<<GridLogMessage << "mflop/s per node =  "<< flops/(t1-t0)/NP<<std::endl;
    err = ref-result; 
    std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<<std::endl;
    //    Dw.Report();
  }
  Grid_finalize();
 }
--- a/benchmarks/Benchmark_dwf_sweep.cc
+++ b/benchmarks/Benchmark_dwf_sweep.cc
@@ -51,16 +51,18 @@ int main (int argc, char ** argv)
 {
  Grid_init(&argc,&argv);
-  const int Ls=16;
+  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Kernel options --dslash-generic, --dslash-unroll, --dslash-asm" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric   ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3       WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  const int Ls=8;
  int threads = GridThread::GetThreads();
  std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
  if ( getenv("ASMOPT") )  {
    QCD::WilsonKernelsStatic::AsmOpt=1;
  } else { 
    QCD::WilsonKernelsStatic::AsmOpt=0;
  }
  std::cout<<GridLogMessage << "=========================================================================="<<std::endl;
  std::cout<<GridLogMessage << "= Benchmarking DWF"<<std::endl;
  std::cout<<GridLogMessage << "=========================================================================="<<std::endl;
--- a/benchmarks/Benchmark_wilson_sweep.cc
+++ b/benchmarks/Benchmark_wilson_sweep.cc
@@ -58,6 +58,19 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});
  RealD mass = 0.1;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Kernel options --dslash-generic, --dslash-unroll, --dslash-asm" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Benchmarking WilsonFermionR::Dhop                  "<<std::endl;
  std::cout << GridLogMessage<< "* Vectorising space-time by "<<vComplex::Nsimd()<<std::endl;
  if ( sizeof(Real)==4 )   std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
  if ( sizeof(Real)==8 )   std::cout << GridLogMessage<< "* DOUBLE precision "<<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptGeneric   ) std::cout << GridLogMessage<< "* Using GENERIC Nc WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptHandUnroll) std::cout << GridLogMessage<< "* Using Nc=3       WilsonKernels" <<std::endl;
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout<<GridLogMessage << "============================================================================="<< std::endl;
  std::cout<<GridLogMessage << "= Benchmarking Wilson" << std::endl;
  std::cout<<GridLogMessage << "============================================================================="<< std::endl;
--- a/benchmarks/Benchmark_zmm
+++ b/benchmarks/Benchmark_zmm
--- a/benchmarks/Benchmark_zmm.cc
+++ b/benchmarks/Benchmark_zmm.cc
@@ -1,175 +0,0 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./tests/Test_zmm.cc
    Copyright (C) 2015
 Author: paboyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #include <Grid/Grid.h>
 using namespace Grid;
 using namespace Grid::QCD;
 int bench(std::ofstream &os, std::vector<int> &latt4,int Ls);
 int main(int argc,char **argv)
 {
  Grid_init(&argc,&argv);
  std::ofstream os("zmm.dat");
  os << "#V Ls Lxy Lzt C++ Asm OMP L1 " <<std::endl;
  std::cout<<GridLogMessage << "====================================================================="<<std::endl;
  std::cout<<GridLogMessage << "= Benchmarking ZMM"<<std::endl;
  std::cout<<GridLogMessage << "====================================================================="<<std::endl;
  std::cout<<GridLogMessage << "Volume \t\t\t\tC++DW/MFLOPs\tASM-DW/MFLOPs\tdiff"<<std::endl;
  std::cout<<GridLogMessage << "====================================================================="<<std::endl;
  for(int L=4;L<=32;L+=4){
    for(int m=1;m<=2;m++){
      for(int Ls=8;Ls<=16;Ls+=8){
 	std::vector<int> grid({L,L,m*L,m*L});
  std::cout << GridLogMessage <<"\t";
 	for(int i=0;i<4;i++) { 
 	  std::cout << grid[i]<<"x";
 	}
 	std::cout << Ls<<"\t\t";
 	bench(os,grid,Ls);
      }
    }
  }
 }
 int bench(std::ofstream &os, std::vector<int> &latt4,int Ls)
 {
  GridCartesian         * UGrid   = SpaceTimeGrid::makeFourDimGrid(latt4, GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
  GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
  GridCartesian         * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
  GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
  std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
  std::vector<int> mpi_layout  = GridDefaultMpi();
  int threads = GridThread::GetThreads();
  std::vector<int> seeds4({1,2,3,4});
  std::vector<int> seeds5({5,6,7,8});
  GridSerialRNG sRNG; sRNG.SeedFixedIntegers(seeds4);
  LatticeFermion src (FGrid);
  LatticeFermion tmp (FGrid);
  LatticeFermion srce(FrbGrid);
  LatticeFermion resulto(FrbGrid); resulto=zero;
  LatticeFermion resulta(FrbGrid); resulta=zero;
  LatticeFermion junk(FrbGrid); junk=zero;
  LatticeFermion diff(FrbGrid); 
  LatticeGaugeField Umu(UGrid);
  double mfc, mfa, mfo, mfl1;
  GridParallelRNG          RNG4(UGrid);  RNG4.SeedFixedIntegers(seeds4);
  GridParallelRNG          RNG5(FGrid);  RNG5.SeedFixedIntegers(seeds5);
  random(RNG5,src);
 #if 1
  random(RNG4,Umu);
 #else
  int mmu=2;
  std::vector<LatticeColourMatrix> U(4,UGrid);
  for(int mu=0;mu<Nd;mu++){
    U[mu] = PeekIndex<LorentzIndex>(Umu,mu);
    if ( mu!=mmu ) U[mu] = zero;
    if ( mu==mmu ) U[mu] = 1.0;
    PokeIndex<LorentzIndex>(Umu,U[mu],mu);
  }
 #endif
 pickCheckerboard(Even,srce,src);
  RealD mass=0.1;
  RealD M5  =1.8;
  DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
  int ncall=50;
  double t0=usecond();
  for(int i=0;i<ncall;i++){
    Dw.DhopOE(srce,resulto,0);
  }
  double t1=usecond();
  double volume=Ls;  for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
  double flops=1344*volume/2;
  mfc = flops*ncall/(t1-t0);
  std::cout<<mfc<<"\t\t";
  QCD::WilsonKernelsStatic::AsmOpt=1;
  t0=usecond();
  for(int i=0;i<ncall;i++){
    Dw.DhopOE(srce,resulta,0);
  }
  t1=usecond();
  mfa = flops*ncall/(t1-t0);
  std::cout<<mfa<<"\t\t";
  /*
  int dag=DaggerNo;
  t0=usecond();
  for(int i=0;i<1;i++){
    Dw.DhopInternalOMPbench(Dw.StencilEven,Dw.LebesgueEvenOdd,Dw.UmuOdd,srce,resulta,dag);
  }
  t1=usecond();
  mfo = flops*100/(t1-t0);
  std::cout<<GridLogMessage << "Called ASM-OMP Dw"<< " mflop/s =   "<< mfo<<std::endl;
  t0=usecond();
  for(int i=0;i<1;i++){
    Dw.DhopInternalL1bench(Dw.StencilEven,Dw.LebesgueEvenOdd,Dw.UmuOdd,srce,resulta,dag);
  }
  t1=usecond();
  mfl1= flops*100/(t1-t0);
  std::cout<<GridLogMessage << "Called ASM-L1 Dw"<< " mflop/s =   "<< mfl1<<std::endl;
  os << latt4[0]*latt4[1]*latt4[2]*latt4[3]<< " "<<Ls<<" "<< latt4[0] <<" " <<latt4[2]<< " "
     << mfc<<" "
     << mfa<<" "
     << mfo<<" "
     << mfl1<<std::endl;
  */
 #if 0
  for(int i=0;i< PerformanceCounter::NumTypes(); i++ ){
    Dw.DhopOE(srce,resulta,0);
    PerformanceCounter Counter(i);
    Counter.Start();
    Dw.DhopOE(srce,resulta,0);
    Counter.Stop();
    Counter.Report();
  }
 #endif
  //resulta = (-0.5) * resulta;
  diff = resulto-resulta;
  std::cout<<norm2(diff)<<std::endl;
  return 0;
 }
--- a/bootstrap.sh
+++ b/bootstrap.sh
@@ -1,18 +1,12 @@
 #!/usr/bin/env bash
 EIGEN_URL='http://bitbucket.org/eigen/eigen/get/3.2.9.tar.bz2'
 FFTW_URL=http://www.fftw.org/fftw-3.3.4.tar.gz
 echo "-- deploying Eigen source..."
 wget ${EIGEN_URL} --no-check-certificate
 ./scripts/update_eigen.sh `basename ${EIGEN_URL}`
 rm `basename ${EIGEN_URL}`
 echo "-- copying fftw prototypes..."
 wget ${FFTW_URL}
 ./scripts/update_fftw.sh `basename ${FFTW_URL}`
 rm `basename ${FFTW_URL}`
 echo '-- generating Make.inc files...'
 ./scripts/filelist
 echo '-- generating configure script...'
--- a/configure.ac
+++ b/configure.ac
@@ -1,5 +1,8 @@
 AC_PREREQ([2.63])
-AC_INIT([Grid], [0.5.1-dev], [https://github.com/paboyle/Grid], [Grid])
+AC_INIT([Grid], [0.6.0], [https://github.com/paboyle/Grid], [Grid])
 AC_CANONICAL_BUILD
 AC_CANONICAL_HOST
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE(subdir-objects)
 AC_CONFIG_MACRO_DIR([m4])
 AC_CONFIG_SRCDIR([lib/Grid.h])
@@ -7,20 +10,32 @@ AC_CONFIG_HEADERS([lib/Config.h])
 m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])])
 ############### Checks for programs
 AC_LANG(C++)
 CXXFLAGS="-O3 $CXXFLAGS"
 AC_PROG_CXX
 AC_PROG_RANLIB
-############ openmp  ###############
+############### Get compiler informations
 AC_LANG([C++])
 AX_CXX_COMPILE_STDCXX_11([noext],[mandatory])
 AX_COMPILER_VENDOR
 AC_DEFINE_UNQUOTED([CXX_COMP_VENDOR],["$ax_cv_cxx_compiler_vendor"],
      [vendor of C++ compiler that will compile the code])
 AX_GXX_VERSION
 AC_DEFINE_UNQUOTED([GXX_VERSION],["$GXX_VERSION"],
      [version of g++ that will compile the code])
 ############### Checks for typedefs, structures, and compiler characteristics
 AC_TYPE_SIZE_T
 AC_TYPE_UINT32_T
 AC_TYPE_UINT64_T
 ############### OpenMP 
 AC_OPENMP
 ac_openmp=no
 if test "${OPENMP_CXXFLAGS}X" != "X"; then
-ac_openmp=yes
+  ac_openmp=yes
-AM_CXXFLAGS="$OPENMP_CXXFLAGS $AM_CXXFLAGS"
+  AM_CXXFLAGS="$OPENMP_CXXFLAGS $AM_CXXFLAGS"
-AM_LDFLAGS="$OPENMP_CXXFLAGS $AM_LDFLAGS"
+  AM_LDFLAGS="$OPENMP_CXXFLAGS $AM_LDFLAGS"
 fi
 ############### Checks for header files
@@ -33,27 +48,29 @@ AC_CHECK_HEADERS(execinfo.h)
 AC_CHECK_DECLS([ntohll],[], [], [[#include <arpa/inet.h>]])
 AC_CHECK_DECLS([be64toh],[], [], [[#include <arpa/inet.h>]])
-############### Checks for typedefs, structures, and compiler characteristics
+############### GMP and MPFR
 AC_TYPE_SIZE_T
 AC_TYPE_UINT32_T
 AC_TYPE_UINT64_T
 ############### GMP and MPFR #################
 AC_ARG_WITH([gmp],
    [AS_HELP_STRING([--with-gmp=prefix],
    [try this for a non-standard install prefix of the GMP library])],
    [AM_CXXFLAGS="-I$with_gmp/include $AM_CXXFLAGS"]
-    [AM_LDFLAGS="-L$with_gmp/lib" $AM_LDFLAGS])
+    [AM_LDFLAGS="-L$with_gmp/lib $AM_LDFLAGS"])
 AC_ARG_WITH([mpfr],
    [AS_HELP_STRING([--with-mpfr=prefix],
    [try this for a non-standard install prefix of the MPFR library])],
    [AM_CXXFLAGS="-I$with_mpfr/include $AM_CXXFLAGS"]
    [AM_LDFLAGS="-L$with_mpfr/lib $AM_LDFLAGS"])
-################## lapack ####################
+############### FFTW3 
 AC_ARG_WITH([fftw],    
            [AS_HELP_STRING([--with-fftw=prefix],
            [try this for a non-standard install prefix of the FFTW3 library])],
            [AM_CXXFLAGS="-I$with_fftw/include $AM_CXXFLAGS"]
            [AM_LDFLAGS="-L$with_fftw/lib $AM_LDFLAGS"])
 ############### lapack 
 AC_ARG_ENABLE([lapack],
    [AC_HELP_STRING([--enable-lapack=yes|no|prefix], [enable LAPACK])], 
-    [ac_LAPACK=${enable_lapack}],[ac_LAPACK=no])
+    [ac_LAPACK=${enable_lapack}], [ac_LAPACK=no])
 case ${ac_LAPACK} in
    no)
@@ -63,59 +80,77 @@ case ${ac_LAPACK} in
    *)
        AM_CXXFLAGS="-I$ac_LAPACK/include $AM_CXXFLAGS"
        AM_LDFLAGS="-L$ac_LAPACK/lib $AM_LDFLAGS"
-        AC_DEFINE([USE_LAPACK],[1],[use LAPACK])
+        AC_DEFINE([USE_LAPACK],[1],[use LAPACK]);;
 esac
-################## FFTW3 ####################
+############### MKL
-AC_ARG_WITH([fftw],    
+AC_ARG_ENABLE([mkl],
-            [AS_HELP_STRING([--with-fftw=prefix],
+    [AC_HELP_STRING([--enable-mkl=yes|no|prefix], [enable Intel MKL for LAPACK & FFTW])],
-            [try this for a non-standard install prefix of the FFTW3 library])],
+    [ac_MKL=${enable_mkl}], [ac_MKL=no])
            [AM_CXXFLAGS="-I$with_fftw/include $AM_CXXFLAGS"]
            [AM_LDFLAGS="-L$with_fftw/lib $AM_LDFLAGS"])
-################ Get compiler informations
+case ${ac_MKL} in
-AC_LANG([C++])
+    no)
-AX_CXX_COMPILE_STDCXX_11([noext],[mandatory])
+        ;;
-AX_COMPILER_VENDOR
+    yes)
-AC_DEFINE_UNQUOTED([CXX_COMP_VENDOR],["$ax_cv_cxx_compiler_vendor"],
+        AC_DEFINE([USE_MKL], [1], [Define to 1 if you use the Intel MKL]);;
-      [vendor of C++ compiler that will compile the code])
+    *)
-AX_GXX_VERSION
+        AM_CXXFLAGS="-I$ac_MKL/include $AM_CXXFLAGS"
-AC_DEFINE_UNQUOTED([GXX_VERSION],["$GXX_VERSION"],
+        AM_LDFLAGS="-L$ac_MKL/lib $AM_LDFLAGS"
-      [version of g++ that will compile the code])
+        AC_DEFINE([USE_MKL], [1], [Define to 1 if you use the Intel MKL]);;
 esac
 ############### first-touch
 AC_ARG_ENABLE([numa],
    [AC_HELP_STRING([--enable-numa=yes|no|prefix], [enable first touch numa opt])], 
    [ac_NUMA=${enable_NUMA}],[ac_NUMA=no])
 case ${ac_NUMA} in
    no)
        ;;
    yes)
        AC_DEFINE([GRID_NUMA],[1],[First touch numa locality]);;
    *)
        AC_DEFINE([GRID_NUMA],[1],[First touch numa locality]);;
 esac
 ############### Checks for library functions
 CXXFLAGS_CPY=$CXXFLAGS
 LDFLAGS_CPY=$LDFLAGS
 CXXFLAGS="$AM_CXXFLAGS $CXXFLAGS"
 LDFLAGS="$AM_LDFLAGS $LDFLAGS"
 AC_CHECK_FUNCS([gettimeofday])
-AC_CHECK_LIB([gmp],[__gmpf_init],
+
-             [AC_CHECK_LIB([mpfr],[mpfr_init],
+if test "${ac_MKL}x" != "nox"; then
-                 [AC_DEFINE([HAVE_LIBMPFR], [1], [Define to 1 if you have the `MPFR' library (-lmpfr).])]
+    AC_SEARCH_LIBS([mkl_set_interface_layer], [mkl_rt], [],
-                 [have_mpfr=true]
+                   [AC_MSG_ERROR("MKL enabled but library not found")])
-                 [LIBS="$LIBS -lmpfr"],
+fi
-                 [AC_MSG_ERROR([MPFR library not found])])]
+
-   	     [AC_DEFINE([HAVE_LIBGMP], [1], [Define to 1 if you have the `GMP' library (-lgmp).])]
+AC_SEARCH_LIBS([__gmpf_init], [gmp],
-             [have_gmp=true]
+               [AC_SEARCH_LIBS([mpfr_init], [mpfr], 
-             [LIBS="$LIBS -lgmp"],
+                               [AC_DEFINE([HAVE_LIBMPFR], [1], 
-             [AC_MSG_WARN([**** GMP library not found, Grid can still compile but RHMC will not work ****])])
+                                          [Define to 1 if you have the `MPFR' library])]
                               [have_mpfr=true], [AC_MSG_ERROR([MPFR library not found])])]
               [AC_DEFINE([HAVE_LIBGMP], [1], [Define to 1 if you have the `GMP' library])]
               [have_gmp=true])
 if test "${ac_LAPACK}x" != "nox"; then
-    AC_CHECK_LIB([lapack],[LAPACKE_sbdsdc],[],
+    AC_SEARCH_LIBS([LAPACKE_sbdsdc], [lapack], [],
-                 [AC_MSG_ERROR("LAPACK enabled but library not found")])
+                   [AC_MSG_ERROR("LAPACK enabled but library not found")])
 fi   
-AC_CHECK_LIB([fftw3],[fftw_execute],
+
-  [AC_DEFINE([HAVE_FFTW],[1],[Define to 1 if you have the `FFTW' library (-lfftw3).])]
+AC_SEARCH_LIBS([fftw_execute], [fftw3],
-  [have_fftw=true]
+               [AC_SEARCH_LIBS([fftwf_execute], [fftw3f], [],
-  [LIBS="$LIBS -lfftw3 -lfftw3f"],
+                               [AC_MSG_ERROR("single precision FFTW library not found")])]
-  [AC_MSG_WARN([**** FFTW library not found, Grid can still compile but FFT-based routines will not work ****])])
+               [AC_DEFINE([HAVE_FFTW], [1], [Define to 1 if you have the `FFTW' library])]
               [have_fftw=true])
 CXXFLAGS=$CXXFLAGS_CPY
 LDFLAGS=$LDFLAGS_CPY
 ############### SIMD instruction selection
-AC_ARG_ENABLE([simd],[AC_HELP_STRING([--enable-simd=SSE4|AVX|AVXFMA4|AVX2|AVX512|AVX512MIC|IMCI|KNL|KNC],\
+AC_ARG_ENABLE([simd],[AC_HELP_STRING([--enable-simd=<code>],
-	[Select instructions to be SSE4.0, AVX 1.0, AVX 2.0+FMA, AVX 512, IMCI])],\
+	            [select SIMD target (cf. README.md)])], [ac_SIMD=${enable_simd}], [ac_SIMD=GEN])
 	[ac_SIMD=${enable_simd}],[ac_SIMD=GEN])
 case ${ax_cv_cxx_compiler_vendor} in
  clang|gnu)
@@ -129,15 +164,21 @@ case ${ax_cv_cxx_compiler_vendor} in
      AVXFMA4)
        AC_DEFINE([AVXFMA4],[1],[AVX intrinsics with FMA4])
        SIMD_FLAGS='-mavx -mfma4';;
      AVXFMA)
        AC_DEFINE([AVXFMA],[1],[AVX intrinsics with FMA3])
        SIMD_FLAGS='-mavx -mfma';;
      AVX2)
        AC_DEFINE([AVX2],[1],[AVX2 intrinsics])
        SIMD_FLAGS='-mavx2 -mfma';;
-      AVX512|AVX512MIC|KNL)
+      AVX512)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
        SIMD_FLAGS='-mavx512f -mavx512pf -mavx512er -mavx512cd';;
-      IMCI|KNC)
+      KNC)
        AC_DEFINE([IMCI],[1],[IMCI intrinsics for Knights Corner])
        SIMD_FLAGS='';;
      KNL)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
        SIMD_FLAGS='-march=knl';;
      GEN)
        AC_DEFINE([GENERIC_VEC],[1],[generic vector code])
        SIMD_FLAGS='';;
@@ -155,21 +196,21 @@ case ${ax_cv_cxx_compiler_vendor} in
      AVX)
        AC_DEFINE([AVX1],[1],[AVX intrinsics])
        SIMD_FLAGS='-mavx -xavx';;
-      AVXFMA4)
+      AVXFMA)
-        AC_DEFINE([AVXFMA4],[1],[AVX intrinsics with FMA4])
+        AC_DEFINE([AVXFMA],[1],[AVX intrinsics with FMA4])
-        SIMD_FLAGS='-mavx -xavx -mfma';;
+        SIMD_FLAGS='-mavx -mfma';;
      AVX2)
        AC_DEFINE([AVX2],[1],[AVX2 intrinsics])
        SIMD_FLAGS='-march=core-avx2 -xcore-avx2';;
      AVX512)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
        SIMD_FLAGS='-xcore-avx512';;
-      AVX512MIC|KNL)
+      KNC)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics for Knights Landing])
        SIMD_FLAGS='-xmic-avx512';;
      IMCI|KNC)
        AC_DEFINE([IMCI],[1],[IMCI Intrinsics for Knights Corner])
        SIMD_FLAGS='';;
      KNL)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics for Knights Landing])
        SIMD_FLAGS='-xmic-avx512';;
      GEN)
        AC_DEFINE([GENERIC_VEC],[1],[generic vector code])
        SIMD_FLAGS='';;
@@ -184,14 +225,18 @@ AM_CXXFLAGS="$SIMD_FLAGS $AM_CXXFLAGS"
 AM_CFLAGS="$SIMD_FLAGS $AM_CFLAGS"
 case ${ac_SIMD} in
-  AVX512|AVX512MIC|KNL)
+  AVX512|KNL)
    AC_DEFINE([TEST_ZMM],[1],[compile ZMM test]);;
  *)
 	;;
 esac
-############### precision selection
+############### Precision selection
-AC_ARG_ENABLE([precision],[AC_HELP_STRING([--enable-precision=single|double],[Select default word size of Real])],[ac_PRECISION=${enable_precision}],[ac_PRECISION=double])
+AC_ARG_ENABLE([precision],
              [AC_HELP_STRING([--enable-precision=single|double],
                              [Select default word size of Real])],
              [ac_PRECISION=${enable_precision}],[ac_PRECISION=double])
 case ${ac_PRECISION} in
     single)
       AC_DEFINE([GRID_DEFAULT_PRECISION_SINGLE],[1],[GRID_DEFAULT_PRECISION is SINGLE] )
@@ -202,39 +247,56 @@ case ${ac_PRECISION} in
 esac
 ############### communication type selection
-AC_ARG_ENABLE([comms],[AC_HELP_STRING([--enable-comms=none|mpi|mpi-auto|shmem],[Select communications])],[ac_COMMS=${enable_comms}],[ac_COMMS=none])
+AC_ARG_ENABLE([comms],[AC_HELP_STRING([--enable-comms=none|mpi|mpi-auto|mpi3|mpi3-auto|shmem],
              [Select communications])],[ac_COMMS=${enable_comms}],[ac_COMMS=none])
 case ${ac_COMMS} in
     none)
-       AC_DEFINE([GRID_COMMS_NONE],[1],[GRID_COMMS_NONE] )
+        AC_DEFINE([GRID_COMMS_NONE],[1],[GRID_COMMS_NONE] )
        comms_type='none'
     ;;
-     mpi-auto)
+     mpi3l*)
-       AC_DEFINE([GRID_COMMS_MPI],[1],[GRID_COMMS_MPI] )
+       AC_DEFINE([GRID_COMMS_MPI3L],[1],[GRID_COMMS_MPI3L] )
-       LX_FIND_MPI
+       comms_type='mpi3l'
       if test "x$have_CXX_mpi" = 'xno'; then AC_MSG_ERROR(["MPI not found"]); fi
       AM_CXXFLAGS="$MPI_CXXFLAGS $AM_CXXFLAGS"
       AM_CFLAGS="$MPI_CFLAGS $AM_CFLAGS"
       AM_LDFLAGS="`echo $MPI_CXXLDFLAGS | sed -E 's/-l@<:@^ @:>@+//g'` $AM_LDFLAGS"
       LIBS="`echo $MPI_CXXLDFLAGS | sed -E 's/-L@<:@^ @:>@+//g'` $LIBS"
     ;;
-     mpi)
+     mpi3*)
-       AC_DEFINE([GRID_COMMS_MPI],[1],[GRID_COMMS_MPI] )
+        AC_DEFINE([GRID_COMMS_MPI3],[1],[GRID_COMMS_MPI3] )
        comms_type='mpi3'
     ;;
     mpi*)
        AC_DEFINE([GRID_COMMS_MPI],[1],[GRID_COMMS_MPI] )
        comms_type='mpi'
     ;;
     shmem)
-       AC_DEFINE([GRID_COMMS_SHMEM],[1],[GRID_COMMS_SHMEM] )
+        AC_DEFINE([GRID_COMMS_SHMEM],[1],[GRID_COMMS_SHMEM] )
        comms_type='shmem'
     ;;
     *)
-     AC_MSG_ERROR([${ac_COMMS} unsupported --enable-comms option]); 
+        AC_MSG_ERROR([${ac_COMMS} unsupported --enable-comms option]); 
     ;;
 esac
-AM_CONDITIONAL(BUILD_COMMS_SHMEM,[ test "X${ac_COMMS}X" == "XshmemX" ])
+case ${ac_COMMS} in
-AM_CONDITIONAL(BUILD_COMMS_MPI,[ test "X${ac_COMMS}X" == "XmpiX" || test "X${ac_COMMS}X" == "Xmpi-autoX" ])
+    *-auto)
-AM_CONDITIONAL(BUILD_COMMS_NONE,[ test "X${ac_COMMS}X" == "XnoneX" ])
+        LX_FIND_MPI
        if test "x$have_CXX_mpi" = 'xno'; then AC_MSG_ERROR(["MPI not found"]); fi
        AM_CXXFLAGS="$MPI_CXXFLAGS $AM_CXXFLAGS"
        AM_CFLAGS="$MPI_CFLAGS $AM_CFLAGS"
        AM_LDFLAGS="`echo $MPI_CXXLDFLAGS | sed -E 's/-l@<:@^ @:>@+//g'` $AM_LDFLAGS"
        LIBS="`echo $MPI_CXXLDFLAGS | sed -E 's/-L@<:@^ @:>@+//g'` $LIBS";;
    *)
        ;;
 esac
 AM_CONDITIONAL(BUILD_COMMS_SHMEM, [ test "${comms_type}X" == "shmemX" ])
 AM_CONDITIONAL(BUILD_COMMS_MPI,   [ test "${comms_type}X" == "mpiX" ])
 AM_CONDITIONAL(BUILD_COMMS_MPI3,  [ test "${comms_type}X" == "mpi3X" ] )
 AM_CONDITIONAL(BUILD_COMMS_MPI3L, [ test "${comms_type}X" == "mpi3lX" ] )
 AM_CONDITIONAL(BUILD_COMMS_NONE,  [ test "${comms_type}X" == "noneX" ])
 ############### RNG selection
 AC_ARG_ENABLE([rng],[AC_HELP_STRING([--enable-rng=ranlux48|mt19937],\
-	[Select Random Number Generator to be used])],\
+	            [Select Random Number Generator to be used])],\
-	[ac_RNG=${enable_rng}],[ac_RNG=ranlux48])
+	            [ac_RNG=${enable_rng}],[ac_RNG=ranlux48])
 case ${ac_RNG} in
     ranlux48)
@@ -248,10 +310,11 @@ case ${ac_RNG} in
     ;;
 esac
-############### timer option
+############### Timer option
 AC_ARG_ENABLE([timers],[AC_HELP_STRING([--enable-timers],\
-	[Enable system dependent high res timers])],\
+	            [Enable system dependent high res timers])],\
-	[ac_TIMERS=${enable_timers}],[ac_TIMERS=yes])
+	            [ac_TIMERS=${enable_timers}],[ac_TIMERS=yes])
 case ${ac_TIMERS} in
     yes)
      AC_DEFINE([TIMERS_ON],[1],[TIMERS_ON] )
@@ -265,7 +328,9 @@ case ${ac_TIMERS} in
 esac
 ############### Chroma regression test
-AC_ARG_ENABLE([chroma],[AC_HELP_STRING([--enable-chroma],[Expect chroma compiled under c++11 ])],ac_CHROMA=yes,ac_CHROMA=no)
+AC_ARG_ENABLE([chroma],[AC_HELP_STRING([--enable-chroma],
              [Expect chroma compiled under c++11 ])],ac_CHROMA=yes,ac_CHROMA=no)
 case ${ac_CHROMA} in
     yes|no)
     ;;
@@ -273,6 +338,7 @@ case ${ac_CHROMA} in
       AC_MSG_ERROR([${ac_CHROMA} unsupported --enable-chroma option]); 
     ;;
 esac
 AM_CONDITIONAL(BUILD_CHROMA_REGRESSION,[ test "X${ac_CHROMA}X" == "XyesX" ])
 ############### Doxygen
@@ -306,35 +372,36 @@ AC_CONFIG_FILES(programs/Makefile)
 AC_CONFIG_FILES(programs/Hadrons/Makefile)
 AC_OUTPUT
-echo "
+echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Summary of configuration for $PACKAGE v$VERSION
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ----- PLATFORM ----------------------------------------
- architecture (build)          : $build_cpu
+architecture (build)        : $build_cpu
- os (build)                    : $build_os
+os (build)                  : $build_os
- architecture (target)         : $target_cpu
+architecture (target)       : $target_cpu
- os (target)                   : $target_os
+os (target)                 : $target_os
- compiler vendor               : ${ax_cv_cxx_compiler_vendor}
+compiler vendor             : ${ax_cv_cxx_compiler_vendor}
- compiler version              : ${ax_cv_gxx_version}
+compiler version            : ${ax_cv_gxx_version}
 ----- BUILD OPTIONS -----------------------------------
- SIMD                          : ${ac_SIMD}
+SIMD                        : ${ac_SIMD}
- Threading                     : ${ac_openmp} 
+Threading                   : ${ac_openmp} 
- Communications type           : ${ac_COMMS}
+Communications type         : ${comms_type}
- Default precision             : ${ac_PRECISION}
+Default precision           : ${ac_PRECISION}
- RNG choice                    : ${ac_RNG} 
+RNG choice                  : ${ac_RNG} 
- GMP                           : `if test "x$have_gmp" = xtrue; then echo yes; else echo no; fi`
+GMP                         : `if test "x$have_gmp" = xtrue; then echo yes; else echo no; fi`
- LAPACK                        : ${ac_LAPACK}
+LAPACK                      : ${ac_LAPACK}
- FFTW                          : `if test "x$have_fftw" = xtrue; then echo yes; else echo no; fi`
+FFTW                        : `if test "x$have_fftw" = xtrue; then echo yes; else echo no; fi`
- build DOXYGEN documentation   : `if test "x$enable_doc" = xyes; then echo yes; else echo no; fi`
+build DOXYGEN documentation : `if test "x$enable_doc" = xyes; then echo yes; else echo no; fi`
- graphs and diagrams           : `if test "x$enable_dot" = xyes; then echo yes; else echo no; fi`
+graphs and diagrams         : `if test "x$enable_dot" = xyes; then echo yes; else echo no; fi`
 ----- BUILD FLAGS -------------------------------------
- CXXFLAGS:
+CXXFLAGS:
-`echo ${AM_CXXFLAGS} ${CXXFLAGS} | sed 's/ -/\n\t-/g' | sed 's/^-/\t-/g'`
+`echo ${AM_CXXFLAGS} ${CXXFLAGS} | tr ' ' '\n' | sed 's/^-/    -/g'`
- LDFLAGS:
+LDFLAGS:
-`echo ${AM_LDFLAGS} ${LDFLAGS} | sed 's/ -/\n\t-/g' | sed 's/^-/\t-/g'`
+`echo ${AM_LDFLAGS} ${LDFLAGS} | tr ' ' '\n' | sed 's/^-/    -/g'`
- LIBS:
+LIBS:
-`echo ${LIBS} | sed 's/ -/\n\t-/g' | sed 's/^-/\t-/g'`
+`echo ${LIBS} | tr ' ' '\n' | sed 's/^-/    -/g'`
-------------------------------------------------------
+-------------------------------------------------------" > config.summary
-"
+echo ""
 cat config.summary
 echo ""
--- a/lib/AlignedAllocator.h
+++ b/lib/AlignedAllocator.h
@@ -40,14 +40,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #include <mm_malloc.h>
 #endif
 #ifdef GRID_COMMS_SHMEM
 extern "C" { 
 #include <mpp/shmem.h>
 extern void * shmem_align(size_t, size_t);
 extern void  shmem_free(void *);
 }
 #endif
 namespace Grid {
 ////////////////////////////////////////////////////////////////////
@@ -65,28 +57,85 @@ public:
  typedef _Tp        value_type;
  template<typename _Tp1>  struct rebind { typedef alignedAllocator<_Tp1> other; };
  alignedAllocator() throw() { }
  alignedAllocator(const alignedAllocator&) throw() { }
  template<typename _Tp1> alignedAllocator(const alignedAllocator<_Tp1>&) throw() { }
  ~alignedAllocator() throw() { }
  pointer       address(reference __x)       const { return &__x; }
  //  const_pointer address(const_reference __x) const { return &__x; }
  size_type  max_size() const throw() { return size_t(-1) / sizeof(_Tp); }
  pointer allocate(size_type __n, const void* _p= 0)
  { 
 #ifdef HAVE_MM_MALLOC_H
    _Tp * ptr = (_Tp *) _mm_malloc(__n*sizeof(_Tp),128);
 #else
    _Tp * ptr = (_Tp *) memalign(128,__n*sizeof(_Tp));
 #endif
    _Tp tmp;
 #ifdef GRID_NUMA
 #pragma omp parallel for schedule(static)
  for(int i=0;i<__n;i++){
    ptr[i]=tmp;
  }
 #endif 
    return ptr;
  }
  void deallocate(pointer __p, size_type) { 
 #ifdef HAVE_MM_MALLOC_H
    _mm_free((void *)__p); 
 #else
    free((void *)__p);
 #endif
  }
  void construct(pointer __p, const _Tp& __val) { };
  void construct(pointer __p) { };
  void destroy(pointer __p) { };
 };
 template<typename _Tp>  inline bool operator==(const alignedAllocator<_Tp>&, const alignedAllocator<_Tp>&){ return true; }
 template<typename _Tp>  inline bool operator!=(const alignedAllocator<_Tp>&, const alignedAllocator<_Tp>&){ return false; }
 //////////////////////////////////////////////////////////////////////////////////////////
 // MPI3 : comms must use shm region
 // SHMEM: comms must use symmetric heap
 //////////////////////////////////////////////////////////////////////////////////////////
 #ifdef GRID_COMMS_SHMEM
-
+extern "C" { 
-    _Tp *ptr = (_Tp *) shmem_align(__n*sizeof(_Tp),64);
+#include <mpp/shmem.h>
-
+extern void * shmem_align(size_t, size_t);
-
+extern void  shmem_free(void *);
 }
 #define PARANOID_SYMMETRIC_HEAP
 #endif
 template<typename _Tp>
 class commAllocator {
 public: 
  typedef std::size_t     size_type;
  typedef std::ptrdiff_t  difference_type;
  typedef _Tp*       pointer;
  typedef const _Tp* const_pointer;
  typedef _Tp&       reference;
  typedef const _Tp& const_reference;
  typedef _Tp        value_type;
  template<typename _Tp1>  struct rebind { typedef commAllocator<_Tp1> other; };
  commAllocator() throw() { }
  commAllocator(const commAllocator&) throw() { }
  template<typename _Tp1> commAllocator(const commAllocator<_Tp1>&) throw() { }
  ~commAllocator() throw() { }
  pointer       address(reference __x)       const { return &__x; }
  size_type  max_size() const throw() { return size_t(-1) / sizeof(_Tp); }
 #ifdef GRID_COMMS_SHMEM
  pointer allocate(size_type __n, const void* _p= 0)
  {
 #ifdef CRAY
    _Tp *ptr = (_Tp *) shmem_align(__n*sizeof(_Tp),64);
 #else
    _Tp *ptr = (_Tp *) shmem_align(64,__n*sizeof(_Tp));
 #endif
 #ifdef PARANOID_SYMMETRIC_HEAP
    static void * bcast;
    static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
@@ -96,55 +145,47 @@ public:
    if ( bcast != ptr ) {
      std::printf("inconsistent alloc pe %d %lx %lx \n",shmem_my_pe(),bcast,ptr);std::fflush(stdout);
-      BACKTRACEFILE();
+      //      BACKTRACEFILE();
      exit(0);
    }
    assert( bcast == (void *) ptr);
 #endif 
    return ptr;
  }
  void deallocate(pointer __p, size_type) { 
    shmem_free((void *)__p);
  }
 #else
-
+  pointer allocate(size_type __n, const void* _p= 0) 
  {
 #ifdef HAVE_MM_MALLOC_H
    _Tp * ptr = (_Tp *) _mm_malloc(__n*sizeof(_Tp),128);
 #else
    _Tp * ptr = (_Tp *) memalign(128,__n*sizeof(_Tp));
 #endif
 #endif
    _Tp tmp;
 #undef FIRST_TOUCH_OPTIMISE
 #ifdef FIRST_TOUCH_OPTIMISE
 #pragma omp parallel for 
  for(int i=0;i<__n;i++){
    ptr[i]=tmp;
  }
 #endif 
    return ptr;
  }
  void deallocate(pointer __p, size_type) { 
 #ifdef GRID_COMMS_SHMEM
    shmem_free((void *)__p);
 #else
 #ifdef HAVE_MM_MALLOC_H
    _mm_free((void *)__p); 
 #else
    free((void *)__p);
 #endif
 #endif
  }
 #endif
  void construct(pointer __p, const _Tp& __val) { };
  void construct(pointer __p) { };
  void destroy(pointer __p) { };
 };
 template<typename _Tp>  inline bool operator==(const commAllocator<_Tp>&, const commAllocator<_Tp>&){ return true; }
 template<typename _Tp>  inline bool operator!=(const commAllocator<_Tp>&, const commAllocator<_Tp>&){ return false; }
-template<typename _Tp>  inline bool
+////////////////////////////////////////////////////////////////////////////////
-operator==(const alignedAllocator<_Tp>&, const alignedAllocator<_Tp>&){ return true; }
+// Template typedefs
-
+////////////////////////////////////////////////////////////////////////////////
-template<typename _Tp>  inline bool
+template<class T> using Vector     = std::vector<T,alignedAllocator<T> >;           
-operator!=(const alignedAllocator<_Tp>&, const alignedAllocator<_Tp>&){ return false; }
+template<class T> using commVector = std::vector<T,commAllocator<T> >;              
 template<class T> using Matrix     = std::vector<std::vector<T,alignedAllocator<T> > >;
 }; // namespace Grid
 #endif
--- a/lib/Cshift.h
+++ b/lib/Cshift.h
@@ -38,6 +38,14 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #include <Grid/cshift/Cshift_mpi.h>
 #endif 
 #ifdef GRID_COMMS_MPI3
 #include <Grid/cshift/Cshift_mpi.h>
 #endif 
 #ifdef GRID_COMMS_MPI3L
 #include <Grid/cshift/Cshift_mpi.h>
 #endif 
 #ifdef GRID_COMMS_SHMEM
 #include <Grid/cshift/Cshift_mpi.h> // uses same implementation of communicator
 #endif 
--- a/lib/FFT.h
+++ b/lib/FFT.h
@@ -30,8 +30,14 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #define _GRID_FFT_H_
 #ifdef HAVE_FFTW
 #ifdef USE_MKL
 #include <fftw/fftw3.h>
 #else
 #include <fftw3.h>
 #endif
 #endif
 namespace Grid {
  template<class scalar> struct FFTW { };
@@ -120,13 +126,14 @@ namespace Grid {
    double Flops(void) {return flops;}
    double MFlops(void) {return flops/usec;}
    double USec(void)   {return (double)usec;}    
    FFT ( GridCartesian * grid ) :
-      vgrid(grid),
+    vgrid(grid),
-      Nd(grid->_ndimension),
+    Nd(grid->_ndimension),
-      dimensions(grid->_fdimensions),
+    dimensions(grid->_fdimensions),
-      processors(grid->_processors),
+    processors(grid->_processors),
-      processor_coor(grid->_processor_coor)
+    processor_coor(grid->_processor_coor)
    {
      flops=0;
      usec =0;
@@ -139,10 +146,34 @@ namespace Grid {
    }
    template<class vobj>
-    void FFT_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int dim, int inverse){
+    void FFT_dim_mask(Lattice<vobj> &result,const Lattice<vobj> &source,std::vector<int> mask,int sign){
      conformable(result._grid,vgrid);
      conformable(source._grid,vgrid);
      Lattice<vobj> tmp(vgrid);
      tmp = source;
      for(int d=0;d<Nd;d++){
 	if( mask[d] ) {
 	  FFT_dim(result,tmp,d,sign);
 	  tmp=result;
 	}
      }
    }
    template<class vobj>
    void FFT_all_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int sign){
      std::vector<int> mask(Nd,1);
      FFT_dim_mask(result,source,mask,sign);
    }
    template<class vobj>
    void FFT_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int dim, int sign){
 #ifndef HAVE_FFTW
      assert(0);
 #else
      conformable(result._grid,vgrid);
      conformable(source._grid,vgrid);
      int L = vgrid->_ldimensions[dim];
      int G = vgrid->_fdimensions[dim];
@@ -159,118 +190,113 @@ namespace Grid {
      typedef typename vobj::scalar_object sobj;
      typedef typename sobj::scalar_type   scalar;
-      Lattice<vobj> ssource(vgrid); ssource =source;
+      Lattice<sobj> pgbuf(&pencil_g);
-      Lattice<sobj> pgsource(&pencil_g);
+      
      Lattice<sobj> pgresult(&pencil_g); pgresult=zero;
 #ifndef HAVE_FFTW	
      assert(0);
 #else 
      typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
      typedef typename FFTW<scalar>::FFTW_plan   FFTW_plan;
-      {
+      int Ncomp = sizeof(sobj)/sizeof(scalar);
-	int Ncomp = sizeof(sobj)/sizeof(scalar);
+      int Nlow  = 1;
-	int Nlow  = 1;
+      for(int d=0;d<dim;d++){
-	for(int d=0;d<dim;d++){
+        Nlow*=vgrid->_ldimensions[d];
 	  Nlow*=vgrid->_ldimensions[d];
 	}
 	int rank = 1;  /* 1d transforms */
 	int n[] = {G}; /* 1d transforms of length G */
 	int howmany = Ncomp;
 	int odist,idist,istride,ostride;
 	idist   = odist   = 1;          /* Distance between consecutive FT's */
 	istride = ostride = Ncomp*Nlow; /* distance between two elements in the same FT */
 	int *inembed = n, *onembed = n;
 	int sign = FFTW_FORWARD;
 	if (inverse) sign = FFTW_BACKWARD;
 	FFTW_plan p;
 	{
 	  FFTW_scalar *in = (FFTW_scalar *)&pgsource._odata[0];
 	  FFTW_scalar *out= (FFTW_scalar *)&pgresult._odata[0];
 	  p = FFTW<scalar>::fftw_plan_many_dft(rank,n,howmany,
 					       in,inembed,
 					       istride,idist,
 					       out,onembed,
 					       ostride, odist,
 					       sign,FFTW_ESTIMATE);
 	}
 	double add,mul,fma;
 	FFTW<scalar>::fftw_flops(p,&add,&mul,&fma);
 	flops_call = add+mul+2.0*fma;
 	GridStopWatch timer;
 	// Barrel shift and collect global pencil
 	for(int p=0;p<processors[dim];p++) { 
 	  for(int idx=0;idx<sgrid->lSites();idx++) { 
 	    std::vector<int> lcoor(Nd);
    	    sgrid->LocalIndexToLocalCoor(idx,lcoor);
 	    sobj s;
 	    peekLocalSite(s,ssource,lcoor);
 	    lcoor[dim]+=p*L;
 	    pokeLocalSite(s,pgsource,lcoor);
 	  }
 	  ssource = Cshift(ssource,dim,L);
 	}
 	// Loop over orthog coords
 	int NN=pencil_g.lSites();
 	GridStopWatch Timer;
 	Timer.Start();
 PARALLEL_FOR_LOOP
 	for(int idx=0;idx<NN;idx++) { 
 	  std::vector<int> lcoor(Nd);
 	  pencil_g.LocalIndexToLocalCoor(idx,lcoor);
 	  if ( lcoor[dim] == 0 ) {  // restricts loop to plane at lcoor[dim]==0
 	    FFTW_scalar *in = (FFTW_scalar *)&pgsource._odata[idx];
 	    FFTW_scalar *out= (FFTW_scalar *)&pgresult._odata[idx];
 	    FFTW<scalar>::fftw_execute_dft(p,in,out);
 	  }
 	}
        Timer.Stop();
 	usec += Timer.useconds();
 	flops+= flops_call*NN;
        int pc = processor_coor[dim];
        for(int idx=0;idx<sgrid->lSites();idx++) { 
 	  std::vector<int> lcoor(Nd);
 	  sgrid->LocalIndexToLocalCoor(idx,lcoor);
 	  std::vector<int> gcoor = lcoor;
 	  // extract the result
 	  sobj s;
 	  gcoor[dim] = lcoor[dim]+L*pc;
 	  peekLocalSite(s,pgresult,gcoor);
 	  pokeLocalSite(s,result,lcoor);
 	}
 	FFTW<scalar>::fftw_destroy_plan(p);
      }
      int rank = 1;  /* 1d transforms */
      int n[] = {G}; /* 1d transforms of length G */
      int howmany = Ncomp;
      int odist,idist,istride,ostride;
      idist   = odist   = 1;          /* Distance between consecutive FT's */
      istride = ostride = Ncomp*Nlow; /* distance between two elements in the same FT */
      int *inembed = n, *onembed = n;
      scalar div;
 	  if ( sign == backward ) div = 1.0/G;
 	  else if ( sign == forward ) div = 1.0;
 	  else assert(0);
      FFTW_plan p;
      {
        FFTW_scalar *in = (FFTW_scalar *)&pgbuf._odata[0];
        FFTW_scalar *out= (FFTW_scalar *)&pgbuf._odata[0];
        p = FFTW<scalar>::fftw_plan_many_dft(rank,n,howmany,
                                             in,inembed,
                                             istride,idist,
                                             out,onembed,
                                             ostride, odist,
                                             sign,FFTW_ESTIMATE);
      }
      // Barrel shift and collect global pencil
      std::vector<int> lcoor(Nd), gcoor(Nd);
      result = source;
      for(int p=0;p<processors[dim];p++) {
        PARALLEL_REGION
        {
          std::vector<int> cbuf(Nd);
          sobj s;
          PARALLEL_FOR_LOOP_INTERN
          for(int idx=0;idx<sgrid->lSites();idx++) {
            sgrid->LocalIndexToLocalCoor(idx,cbuf);
            peekLocalSite(s,result,cbuf);
            cbuf[dim]+=p*L;
            pokeLocalSite(s,pgbuf,cbuf);
          }
        }
        result = Cshift(result,dim,L);
      }
      // Loop over orthog coords
      int NN=pencil_g.lSites();
      GridStopWatch timer;
      timer.Start();
      PARALLEL_REGION
      {
        std::vector<int> cbuf(Nd);
        PARALLEL_FOR_LOOP_INTERN
        for(int idx=0;idx<NN;idx++) {
          pencil_g.LocalIndexToLocalCoor(idx, cbuf);
          if ( cbuf[dim] == 0 ) {  // restricts loop to plane at lcoor[dim]==0
            FFTW_scalar *in = (FFTW_scalar *)&pgbuf._odata[idx];
            FFTW_scalar *out= (FFTW_scalar *)&pgbuf._odata[idx];
            FFTW<scalar>::fftw_execute_dft(p,in,out);
          }
        }
      }
      timer.Stop();
      // performance counting
      double add,mul,fma;
      FFTW<scalar>::fftw_flops(p,&add,&mul,&fma);
      flops_call = add+mul+2.0*fma;
      usec += timer.useconds();
      flops+= flops_call*NN;
      // writing out result
      int pc = processor_coor[dim];
      PARALLEL_REGION
      {
        std::vector<int> clbuf(Nd), cgbuf(Nd);
        sobj s;
        PARALLEL_FOR_LOOP_INTERN
        for(int idx=0;idx<sgrid->lSites();idx++) {
          sgrid->LocalIndexToLocalCoor(idx,clbuf);
          cgbuf = clbuf;
          cgbuf[dim] = clbuf[dim]+L*pc;
          peekLocalSite(s,pgbuf,cgbuf);
          s = s * div;
          pokeLocalSite(s,result,clbuf);
        }
      }
      // destroying plan
      FFTW<scalar>::fftw_destroy_plan(p);
 #endif
    }
  };
 }
 #endif
--- a/lib/Grid.h
+++ b/lib/Grid.h
@@ -77,11 +77,10 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #include <Grid/Stencil.h>      
 #include <Grid/Algorithms.h>   
 #include <Grid/parallelIO/BinaryIO.h>
 #include <Grid/qcd/QCD.h>
 #include <Grid/parallelIO/NerscIO.h>
 #include <Grid/FFT.h>
 #include <Grid/qcd/QCD.h>
 #include <Grid/parallelIO/NerscIO.h>
 #include <Grid/qcd/hmc/NerscCheckpointer.h>
 #include <Grid/qcd/hmc/HmcRunner.h>
--- a/lib/Init.cc
+++ b/lib/Init.cc
@@ -44,9 +44,33 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #include <Grid.h>
 #include <algorithm>
 #include <iterator>
 #include <cstdlib>
 #include <memory>
 #include <fenv.h>
 #ifdef __APPLE__
 static int
 feenableexcept (unsigned int excepts)
 {
  static fenv_t fenv;
  unsigned int new_excepts = excepts & FE_ALL_EXCEPT,
    old_excepts;  // previous masks
  if ( fegetenv (&fenv) ) return -1;
  old_excepts = fenv.__control & FE_ALL_EXCEPT;
  // unmask
  fenv.__control &= ~new_excepts;
  fenv.__mxcsr   &= ~(new_excepts << 7);
  return ( fesetenv (&fenv) ? -1 : old_excepts );
 }
 #endif
 namespace Grid {
 //////////////////////////////////////////////////////
 // Convenience functions to access stadard command line arg
 // driven parallelism controls
@@ -123,6 +147,13 @@ void GridCmdOptionIntVector(std::string &str,std::vector<int> & vec)
  return;
 }
 void GridCmdOptionInt(std::string &str,int & val)
 {
  std::stringstream ss(str);
  ss>>val;
  return;
 }
 void GridParseLayout(char **argv,int argc,
 		     std::vector<int> &latt,
@@ -153,14 +184,12 @@ void GridParseLayout(char **argv,int argc,
    assert(ompthreads.size()==1);
    GridThread::SetThreads(ompthreads[0]);
  }
  if( GridCmdOptionExists(argv,argv+argc,"--cores") ){
-    std::vector<int> cores(0);
+    int cores;
    arg= GridCmdOptionPayload(argv,argv+argc,"--cores");
-    GridCmdOptionIntVector(arg,cores);
+    GridCmdOptionInt(arg,cores);
-    GridThread::SetCores(cores[0]);
+    GridThread::SetCores(cores);
  }
 }
 std::string GridCmdVectorIntToString(const std::vector<int> & vec){
@@ -169,33 +198,40 @@ std::string GridCmdVectorIntToString(const std::vector<int> & vec){
  return oss.str();
 }
 /////////////////////////////////////////////////////////
-//
+// Reinit guard
 /////////////////////////////////////////////////////////
 static int Grid_is_initialised = 0;
 void Grid_init(int *argc,char ***argv)
 {
  CartesianCommunicator::Init(argc,argv);
  // Parse command line args.
  GridLogger::StopWatch.Start();
  std::string arg;
  ////////////////////////////////////
  // Shared memory block size
  ////////////////////////////////////
  if( GridCmdOptionExists(*argv,*argv+*argc,"--shm") ){
    int MB;
    arg= GridCmdOptionPayload(*argv,*argv+*argc,"--shm");
    GridCmdOptionInt(arg,MB);
    CartesianCommunicator::MAX_MPI_SHM_BYTES = MB*1024*1024;
  }
  CartesianCommunicator::Init(argc,argv);
  ////////////////////////////////////
  // Logging
  ////////////////////////////////////
  std::vector<std::string> logstreams;
  std::string defaultLog("Error,Warning,Message,Performance");
  GridCmdOptionCSL(defaultLog,logstreams);
  GridLogConfigure(logstreams);
-  if( GridCmdOptionExists(*argv,*argv+*argc,"--help") ){
+  if( !GridCmdOptionExists(*argv,*argv+*argc,"--debug-stdout") ){
-    std::cout<<GridLogMessage<<"--help : this message"<<std::endl;
+    Grid_quiesce_nodes();
    std::cout<<GridLogMessage<<"--debug-signals : catch sigsegv and print a blame report"<<std::endl;
    std::cout<<GridLogMessage<<"--debug-stdout  : print stdout from EVERY node"<<std::endl;    
    std::cout<<GridLogMessage<<"--decomposition : report on default omp,mpi and simd decomposition"<<std::endl;    
    std::cout<<GridLogMessage<<"--mpi n.n.n.n   : default MPI decomposition"<<std::endl;    
    std::cout<<GridLogMessage<<"--threads n     : default number of OMP threads"<<std::endl;
    std::cout<<GridLogMessage<<"--grid n.n.n.n  : default Grid size"<<std::endl;    
    std::cout<<GridLogMessage<<"--log list      : comma separted list of streams from Error,Warning,Message,Performance,Iterative,Integrator,Debug,Colours"<<std::endl;
    exit(EXIT_SUCCESS);
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--log") ){
@@ -204,35 +240,39 @@ void Grid_init(int *argc,char ***argv)
    GridLogConfigure(logstreams);
  }
-  if( GridCmdOptionExists(*argv,*argv+*argc,"--debug-signals") ){
+  ////////////////////////////////////
-    Grid_debug_handler_init();
+  // Help message
-  }
+  ////////////////////////////////////
-  if( !GridCmdOptionExists(*argv,*argv+*argc,"--debug-stdout") ){
+
-    Grid_quiesce_nodes();
+  if( GridCmdOptionExists(*argv,*argv+*argc,"--help") ){
-  }
+    std::cout<<GridLogMessage<<"  --help : this message"<<std::endl;
-  if( GridCmdOptionExists(*argv,*argv+*argc,"--dslash-opt") ){
+    std::cout<<GridLogMessage<<std::endl;
-    QCD::WilsonKernelsStatic::HandOpt=1;
+    std::cout<<GridLogMessage<<"Geometry:"<<std::endl;
-  }
+    std::cout<<GridLogMessage<<"  --mpi n.n.n.n   : default MPI decomposition"<<std::endl;    
-  if( GridCmdOptionExists(*argv,*argv+*argc,"--lebesgue") ){
+    std::cout<<GridLogMessage<<"  --threads n     : default number of OMP threads"<<std::endl;
-    LebesgueOrder::UseLebesgueOrder=1;
+    std::cout<<GridLogMessage<<"  --grid n.n.n.n  : default Grid size"<<std::endl;    
    std::cout<<GridLogMessage<<"  --shm  M        : allocate M megabytes of shared memory for comms"<<std::endl;    
    std::cout<<GridLogMessage<<std::endl;
    std::cout<<GridLogMessage<<"Verbose and debug:"<<std::endl;
    std::cout<<GridLogMessage<<"  --log list      : comma separted list of streams from Error,Warning,Message,Performance,Iterative,Integrator,Debug,Colours"<<std::endl;
    std::cout<<GridLogMessage<<"  --decomposition : report on default omp,mpi and simd decomposition"<<std::endl;    
    std::cout<<GridLogMessage<<"  --debug-signals : catch sigsegv and print a blame report"<<std::endl;
    std::cout<<GridLogMessage<<"  --debug-stdout  : print stdout from EVERY node"<<std::endl;    
    std::cout<<GridLogMessage<<"  --notimestamp   : suppress millisecond resolution stamps"<<std::endl;    
    std::cout<<GridLogMessage<<std::endl;
    std::cout<<GridLogMessage<<"Performance:"<<std::endl;
    std::cout<<GridLogMessage<<"  --dslash-generic: Wilson kernel for generic Nc"<<std::endl;    
    std::cout<<GridLogMessage<<"  --dslash-unroll : Wilson kernel for Nc=3"<<std::endl;    
    std::cout<<GridLogMessage<<"  --dslash-asm    : Wilson kernel for AVX512"<<std::endl;    
    std::cout<<GridLogMessage<<"  --lebesgue      : Cache oblivious Lebesgue curve/Morton order/Z-graph stencil looping"<<std::endl;    
    std::cout<<GridLogMessage<<"  --cacheblocking n.m.o.p : Hypercuboidal cache blocking"<<std::endl;    
    std::cout<<GridLogMessage<<std::endl;
    exit(EXIT_SUCCESS);
  }
-  if( GridCmdOptionExists(*argv,*argv+*argc,"--cacheblocking") ){
+  ////////////////////////////////////
-    arg= GridCmdOptionPayload(*argv,*argv+*argc,"--cacheblocking");
+  // Banner
-    GridCmdOptionIntVector(arg,LebesgueOrder::Block);
+  ////////////////////////////////////
  }
  GridParseLayout(*argv,*argc,
 		  Grid_default_latt,
 		  Grid_default_mpi);
  if( GridCmdOptionExists(*argv,*argv+*argc,"--decomposition") ){
    std::cout<<GridLogMessage<<"Grid Decomposition\n";
    std::cout<<GridLogMessage<<"\tOpenMP threads : "<<GridThread::GetThreads()<<std::endl;
    std::cout<<GridLogMessage<<"\tMPI tasks      : "<<GridCmdVectorIntToString(GridDefaultMpi())<<std::endl;
    std::cout<<GridLogMessage<<"\tvRealF         : "<<sizeof(vRealF)*8    <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealF::Nsimd()))<<std::endl;
    std::cout<<GridLogMessage<<"\tvRealD         : "<<sizeof(vRealD)*8    <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealD::Nsimd()))<<std::endl;
    std::cout<<GridLogMessage<<"\tvComplexF      : "<<sizeof(vComplexF)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplexF::Nsimd()))<<std::endl;
    std::cout<<GridLogMessage<<"\tvComplexD      : "<<sizeof(vComplexD)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplexD::Nsimd()))<<std::endl;
  }
  std::string COL_RED    = GridLogColours.colour["RED"];
  std::string COL_PURPLE = GridLogColours.colour["PURPLE"];
@@ -242,19 +282,18 @@ void Grid_init(int *argc,char ***argv)
  std::string COL_YELLOW = GridLogColours.colour["YELLOW"];
  std::string COL_BACKGROUND = GridLogColours.colour["NORMAL"];
  std::cout <<std::endl;
  std::cout <<COL_RED  << "__|__|__|__|__"<<             "|__|__|_"<<COL_PURPLE<<"_|__|__|"<<                "__|__|__|__|__"<<std::endl; 
  std::cout <<COL_RED  << "__|__|__|__|__"<<             "|__|__|_"<<COL_PURPLE<<"_|__|__|"<<                "__|__|__|__|__"<<std::endl; 
-  std::cout <<COL_RED  << "__|__|  |  |  "<<             "|  |  | "<<COL_PURPLE<<" |  |  |"<<                "  |  |  | _|__"<<std::endl; 
+  std::cout <<COL_RED  << "__|_ |  |  |  "<<             "|  |  | "<<COL_PURPLE<<" |  |  |"<<                "  |  |  | _|__"<<std::endl; 
-  std::cout <<COL_RED  << "__|__         "<<             "        "<<COL_PURPLE<<"        "<<                "          _|__"<<std::endl; 
+  std::cout <<COL_RED  << "__|_          "<<             "        "<<COL_PURPLE<<"        "<<                "          _|__"<<std::endl; 
  std::cout <<COL_RED  << "__|_  "<<COL_GREEN<<" GGGG   "<<COL_RED<<" RRRR   "<<COL_BLUE  <<" III    "<<COL_PURPLE<<"DDDD  "<<COL_PURPLE<<"    _|__"<<std::endl;
  std::cout <<COL_RED  << "__|_  "<<COL_GREEN<<"G       "<<COL_RED<<" R   R  "<<COL_BLUE  <<"  I     "<<COL_PURPLE<<"D   D "<<COL_PURPLE<<"    _|__"<<std::endl;
  std::cout <<COL_RED  << "__|_  "<<COL_GREEN<<"G       "<<COL_RED<<" R   R  "<<COL_BLUE  <<"  I     "<<COL_PURPLE<<"D    D"<<COL_PURPLE<<"    _|__"<<std::endl;
  std::cout <<COL_BLUE << "__|_  "<<COL_GREEN<<"G  GG   "<<COL_RED<<" RRRR   "<<COL_BLUE  <<"  I     "<<COL_PURPLE<<"D    D"<<COL_GREEN <<"    _|__"<<std::endl;
  std::cout <<COL_BLUE << "__|_  "<<COL_GREEN<<"G   G   "<<COL_RED<<" R  R   "<<COL_BLUE  <<"  I     "<<COL_PURPLE<<"D   D "<<COL_GREEN <<"    _|__"<<std::endl;
  std::cout <<COL_BLUE << "__|_  "<<COL_GREEN<<" GGGG   "<<COL_RED<<" R   R  "<<COL_BLUE  <<" III    "<<COL_PURPLE<<"DDDD  "<<COL_GREEN <<"    _|__"<<std::endl;
-  std::cout <<COL_BLUE << "__|__         "<<             "        "<<COL_GREEN <<"        "<<                "          _|__"<<std::endl; 
+  std::cout <<COL_BLUE << "__|_          "<<             "        "<<COL_GREEN <<"        "<<                "          _|__"<<std::endl; 
  std::cout <<COL_BLUE << "__|__|__|__|__"<<             "|__|__|_"<<COL_GREEN <<"_|__|__|"<<                "__|__|__|__|__"<<std::endl; 
  std::cout <<COL_BLUE << "__|__|__|__|__"<<             "|__|__|_"<<COL_GREEN <<"_|__|__|"<<                "__|__|__|__|__"<<std::endl; 
  std::cout <<COL_BLUE << "  |  |  |  |  "<<             "|  |  | "<<COL_GREEN <<" |  |  |"<<                "  |  |  |  |  "<<std::endl; 
@@ -274,12 +313,63 @@ void Grid_init(int *argc,char ***argv)
  std::cout << "GNU General Public License for more details."<<std::endl;
  std::cout << COL_BACKGROUND <<std::endl;
  std::cout << std::endl;
  ////////////////////////////////////
  // Debug and performance options
  ////////////////////////////////////
  if( GridCmdOptionExists(*argv,*argv+*argc,"--debug-signals") ){
    Grid_debug_handler_init();
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--dslash-unroll") ){
    QCD::WilsonKernelsStatic::Opt=QCD::WilsonKernelsStatic::OptHandUnroll;
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--dslash-asm") ){
    QCD::WilsonKernelsStatic::Opt=QCD::WilsonKernelsStatic::OptInlineAsm;
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--dslash-generic") ){
    QCD::WilsonKernelsStatic::Opt=QCD::WilsonKernelsStatic::OptGeneric;
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--lebesgue") ){
    LebesgueOrder::UseLebesgueOrder=1;
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--cacheblocking") ){
    arg= GridCmdOptionPayload(*argv,*argv+*argc,"--cacheblocking");
    GridCmdOptionIntVector(arg,LebesgueOrder::Block);
  }
  if( GridCmdOptionExists(*argv,*argv+*argc,"--notimestamp") ){
    GridLogTimestamp(0);
  } else { 
    GridLogTimestamp(1);
  }
  GridParseLayout(*argv,*argc,
 		  Grid_default_latt,
 		  Grid_default_mpi);
  std::cout << GridLogMessage << "Requesting "<< CartesianCommunicator::MAX_MPI_SHM_BYTES <<" byte stencil comms buffers "<<std::endl;
  if( GridCmdOptionExists(*argv,*argv+*argc,"--decomposition") ){
    std::cout<<GridLogMessage<<"Grid Decomposition\n";
    std::cout<<GridLogMessage<<"\tOpenMP threads : "<<GridThread::GetThreads()<<std::endl;
    std::cout<<GridLogMessage<<"\tMPI tasks      : "<<GridCmdVectorIntToString(GridDefaultMpi())<<std::endl;
    std::cout<<GridLogMessage<<"\tvRealF         : "<<sizeof(vRealF)*8    <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealF::Nsimd()))<<std::endl;
    std::cout<<GridLogMessage<<"\tvRealD         : "<<sizeof(vRealD)*8    <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vRealD::Nsimd()))<<std::endl;
    std::cout<<GridLogMessage<<"\tvComplexF      : "<<sizeof(vComplexF)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplexF::Nsimd()))<<std::endl;
    std::cout<<GridLogMessage<<"\tvComplexD      : "<<sizeof(vComplexD)*8 <<"bits ; " <<GridCmdVectorIntToString(GridDefaultSimd(4,vComplexD::Nsimd()))<<std::endl;
  }
  Grid_is_initialised = 1;
 }
 void Grid_finalize(void)
 {
-#ifdef GRID_COMMS_MPI
+#if defined (GRID_COMMS_MPI) || defined (GRID_COMMS_MPI3)
  MPI_Finalize();
  Grid_unquiesce_nodes();
 #endif
@@ -326,10 +416,7 @@ void Grid_sa_signal_handler(int sig,siginfo_t *si,void * ptr)
  exit(0);
  return;
 };
-#ifdef GRID_FPE
+
 #define _GNU_SOURCE
 #include <fenv.h>
 #endif
 void Grid_debug_handler_init(void)
 {
  struct sigaction sa,osa;
@@ -338,9 +425,9 @@ void Grid_debug_handler_init(void)
  sa.sa_flags    = SA_SIGINFO;
  sigaction(SIGSEGV,&sa,NULL);
  sigaction(SIGTRAP,&sa,NULL);
-#ifdef GRID_FPE
+
  feenableexcept( FE_INVALID|FE_OVERFLOW|FE_DIVBYZERO);
  sigaction(SIGFPE,&sa,NULL);
 #endif
 }
 }
--- a/lib/Init.h
+++ b/lib/Init.h
@@ -33,6 +33,7 @@ namespace Grid {
  void Grid_init(int *argc,char ***argv);
  void Grid_finalize(void);
  // internal, controled with --handle
  void Grid_sa_signal_handler(int sig,siginfo_t *si,void * ptr);
  void Grid_debug_handler_init(void);
@@ -44,6 +45,7 @@ namespace Grid {
  const std::vector<int> &GridDefaultMpi(void);
  const int              &GridThreads(void)  ;
  void                    GridSetThreads(int t) ;
  void GridLogTimestamp(int);
  // Common parsing chores
  std::string GridCmdOptionPayload(char ** begin, char ** end, const std::string & option);
@@ -52,6 +54,7 @@ namespace Grid {
  void GridCmdOptionCSL(std::string str,std::vector<std::string> & vec);
  void GridCmdOptionIntVector(std::string &str,std::vector<int> & vec);
  void GridParseLayout(char **argv,int argc,
 		       std::vector<int> &latt,
 		       std::vector<int> &simd,
--- a/lib/Log.cc
+++ b/lib/Log.cc
@@ -31,11 +31,31 @@ directory
 /*  END LEGAL */
 #include <Grid.h>
 #include <cxxabi.h>
 namespace Grid {
  std::string demangle(const char* name) {
    int status = -4; // some arbitrary value to eliminate the compiler warning
    // enable c++11 by passing the flag -std=c++11 to g++
    std::unique_ptr<char, void(*)(void*)> res {
      abi::__cxa_demangle(name, NULL, NULL, &status),
 	std::free
 	};
    return (status==0) ? res.get() : name ;
  }
 GridStopWatch Logger::StopWatch;
 int Logger::timestamp;
 std::ostream Logger::devnull(0);
 void GridLogTimestamp(int on){
  Logger::Timestamp(on);
 }
 Colours GridLogColours(0);
 GridLogger GridLogError(1, "Error", GridLogColours, "RED");
 GridLogger GridLogWarning(1, "Warning", GridLogColours, "YELLOW");
@@ -73,7 +93,7 @@ void GridLogConfigure(std::vector<std::string> &logstreams) {
 ////////////////////////////////////////////////////////////
 void Grid_quiesce_nodes(void) {
  int me = 0;
-#ifdef GRID_COMMS_MPI
+#if defined(GRID_COMMS_MPI) || defined(GRID_COMMS_MPI3)
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
 #endif
 #ifdef GRID_COMMS_SHMEM
--- a/lib/Log.h
+++ b/lib/Log.h
@@ -37,10 +37,11 @@
 #include <execinfo.h>
 #endif
-    namespace Grid {
+namespace Grid {
 //////////////////////////////////////////////////////////////////////////////////////////////////
 // Dress the output; use std::chrono for time stamping via the StopWatch class
-int Rank(void); // used for early stage debug before library init
+//////////////////////////////////////////////////////////////////////////////////////////////////
 class Colours{
@@ -55,7 +56,6 @@ public:
  void Active(bool activate){
    is_active=activate;
    if (is_active){
     colour["BLACK"]  ="\033[30m";
     colour["RED"]    ="\033[31m";
@@ -66,21 +66,18 @@ public:
     colour["CYAN"]   ="\033[36m";
     colour["WHITE"]  ="\033[37m";
     colour["NORMAL"] ="\033[0;39m";
-   } else {
+    } else {
-    colour["BLACK"] ="";
+      colour["BLACK"] ="";
-    colour["RED"]   ="";
+      colour["RED"]   ="";
-    colour["GREEN"] ="";
+      colour["GREEN"] ="";
-    colour["YELLOW"]="";
+      colour["YELLOW"]="";
-    colour["BLUE"]  ="";
+      colour["BLUE"]  ="";
-    colour["PURPLE"]="";
+      colour["PURPLE"]="";
-    colour["CYAN"]  ="";
+      colour["CYAN"]  ="";
-    colour["WHITE"] ="";
+      colour["WHITE"] ="";
-    colour["NORMAL"]="";
+      colour["NORMAL"]="";
-  }
+    }
-
+  };
 };
 };
@@ -88,6 +85,7 @@ class Logger {
 protected:
  Colours &Painter;
  int active;
  static int timestamp;
  std::string name, topName;
  std::string COLOUR;
@@ -99,25 +97,28 @@ public:
  std::string evidence() {return Painter.colour["YELLOW"];}
  std::string colour() {return Painter.colour[COLOUR];}
-  Logger(std::string topNm, int on, std::string nm, Colours& col_class, std::string col)
+  Logger(std::string topNm, int on, std::string nm, Colours& col_class, std::string col)  : active(on),
-  : active(on),
+    name(nm),
-  name(nm),
+    topName(topNm),
-  topName(topNm),
+    Painter(col_class),
-  Painter(col_class),
+    COLOUR(col) {} ;
  COLOUR(col){} ;
  void Active(int on) {active = on;};
  int  isActive(void) {return active;};
  static void Timestamp(int on) {timestamp = on;};
  friend std::ostream& operator<< (std::ostream& stream, Logger& log){
    if ( log.active ) {
      StopWatch.Stop();
      GridTime now = StopWatch.Elapsed();
      StopWatch.Start();
      stream << log.background()<< log.topName << log.background()<< " : ";
      stream << log.colour() <<std::setw(14) << std::left << log.name << log.background() << " : ";
-      stream << log.evidence()<< now << log.background() << " : " << log.colour();
+      if ( log.timestamp ) {
 	StopWatch.Stop();
 	GridTime now = StopWatch.Elapsed();
 	StopWatch.Start();
 	stream << log.evidence()<< now << log.background() << " : " ;
      }
      stream << log.colour();
      return stream;
    } else { 
      return devnull;
@@ -143,13 +144,14 @@ extern GridLogger GridLogIterative  ;
 extern GridLogger GridLogIntegrator  ;
 extern Colours    GridLogColours;
 std::string demangle(const char* name) ;
 #define _NBACKTRACE (256)
 extern void * Grid_backtrace_buffer[_NBACKTRACE];
 #define BACKTRACEFILE() {\
 char string[20];					\
-std::sprintf(string,"backtrace.%d",Rank());				\
+std::sprintf(string,"backtrace.%d",CartesianCommunicator::RankWorld()); \
 std::FILE * fp = std::fopen(string,"w");				\
 BACKTRACEFP(fp)\
 std::fclose(fp);	    \
@@ -161,7 +163,7 @@ std::fclose(fp);	    \
 int symbols    = backtrace        (Grid_backtrace_buffer,_NBACKTRACE);\
 char **strings = backtrace_symbols(Grid_backtrace_buffer,symbols);\
 for (int i = 0; i < symbols; i++){\
-  std::fprintf (fp,"BackTrace Strings: %d %s\n",i, strings[i]); std::fflush(fp); \
+  std::fprintf (fp,"BackTrace Strings: %d %s\n",i, demangle(strings[i]).c_str()); std::fflush(fp); \
 }\
 }
 #else 
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -1,14 +1,27 @@
 extra_sources=
 if BUILD_COMMS_MPI
  extra_sources+=communicator/Communicator_mpi.cc
  extra_sources+=communicator/Communicator_base.cc
 endif
 if BUILD_COMMS_MPI3
  extra_sources+=communicator/Communicator_mpi3.cc
  extra_sources+=communicator/Communicator_base.cc
 endif
 if BUILD_COMMS_MPI3L
  extra_sources+=communicator/Communicator_mpi3_leader.cc
  extra_sources+=communicator/Communicator_base.cc
 endif
 if BUILD_COMMS_SHMEM
  extra_sources+=communicator/Communicator_shmem.cc
  extra_sources+=communicator/Communicator_base.cc
 endif
 if BUILD_COMMS_NONE
  extra_sources+=communicator/Communicator_none.cc
  extra_sources+=communicator/Communicator_base.cc
 endif
 #
--- a/lib/PerfCount.h
+++ b/lib/PerfCount.h
@@ -43,6 +43,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #else
 #include <sys/syscall.h>
 #endif
 #ifdef __x86_64__
 #include <x86intrin.h>
 #endif
 namespace Grid {
@@ -86,7 +89,6 @@ inline uint64_t cyclecount(void){
   return tmp;
 }
 #elif defined __x86_64__
 #include <x86intrin.h>
 inline uint64_t cyclecount(void){ 
  return __rdtsc();
  //  unsigned int dummy;
--- a/lib/Simd.h
+++ b/lib/Simd.h
@@ -237,6 +237,18 @@ namespace Grid {
    stream<<">";
    return stream;
  }
  inline std::ostream& operator<< (std::ostream& stream, const vInteger &o){
    int nn=vInteger::Nsimd();
    std::vector<Integer,alignedAllocator<Integer> > buf(nn);
    vstore(o,&buf[0]);
    stream<<"<";
    for(int i=0;i<nn;i++){
      stream<<buf[i];
      if(i<nn-1) stream<<",";
    }
    stream<<">";
    return stream;
  }
 }
--- a/lib/Stat.cc
+++ b/lib/Stat.cc
@@ -0,0 +1,247 @@
 #include <Grid.h>
 #include <PerfCount.h>
 #include <Stat.h>
 namespace Grid { 
 bool PmuStat::pmu_initialized=false;
 void PmuStat::init(const char *regname)
 {
 #ifdef __x86_64__
  name = regname;
  if (!pmu_initialized)
    {
      std::cout<<"initialising pmu"<<std::endl;
      pmu_initialized = true;
      pmu_init();
    }
  clear();
 #endif
 }
 void PmuStat::clear(void)
 {
 #ifdef __x86_64__
  count = 0;
  tregion = 0;
  pmc0 = 0;
  pmc1 = 0;
  inst = 0;
  cyc = 0;
  ref = 0;
  tcycles = 0;
  reads = 0;
  writes = 0;
 #endif
 }
 void PmuStat::print(void)
 {
 #ifdef __x86_64__
  std::cout <<"Reg "<<std::string(name)<<":\n";
  std::cout <<"  region "<<tregion<<std::endl;
  std::cout <<"  cycles "<<tcycles<<std::endl;
  std::cout <<"  inst   "<<inst   <<std::endl;
  std::cout <<"  cyc    "<<cyc    <<std::endl;
  std::cout <<"  ref    "<<ref    <<std::endl;
  std::cout <<"  pmc0   "<<pmc0   <<std::endl;
  std::cout <<"  pmc1   "<<pmc1   <<std::endl;
  std::cout <<"  count  "<<count  <<std::endl;
  std::cout <<"  reads  "<<reads  <<std::endl;
  std::cout <<"  writes "<<writes <<std::endl;
 #endif
 }
 void PmuStat::start(void)
 {
 #ifdef __x86_64__
  pmu_start();
  ++count;
  xmemctrs(&mrstart, &mwstart);
  tstart = __rdtsc();
 #endif
 }
 void PmuStat::enter(int t)
 {
 #ifdef __x86_64__
  counters[0][t] = __rdpmc(0);
  counters[1][t] = __rdpmc(1);
  counters[2][t] = __rdpmc((1<<30)|0);
  counters[3][t] = __rdpmc((1<<30)|1);
  counters[4][t] = __rdpmc((1<<30)|2);
  counters[5][t] = __rdtsc();
 #endif
 }
 void PmuStat::exit(int t)
 {
 #ifdef __x86_64__
  counters[0][t] = __rdpmc(0) - counters[0][t];
  counters[1][t] = __rdpmc(1) - counters[1][t];
  counters[2][t] = __rdpmc((1<<30)|0) - counters[2][t];
  counters[3][t] = __rdpmc((1<<30)|1) - counters[3][t];
  counters[4][t] = __rdpmc((1<<30)|2) - counters[4][t];
  counters[5][t] = __rdtsc() - counters[5][t];
 #endif
 }
 void PmuStat::accum(int nthreads)
 {
 #ifdef __x86_64__
  tend = __rdtsc();
  xmemctrs(&mrend, &mwend);
  pmu_stop();
  for (int t = 0; t < nthreads; ++t) {
    pmc0 += counters[0][t];
    pmc1 += counters[1][t];
    inst += counters[2][t];
    cyc += counters[3][t];
    ref += counters[4][t];
    tcycles += counters[5][t];
  }
  uint64_t region = tend - tstart;
  tregion += region;
  uint64_t mreads = mrend - mrstart;
  reads += mreads;
  uint64_t mwrites = mwend - mwstart;
  writes += mwrites;
 #endif
 }
 void PmuStat::pmu_fini(void) {}
 void PmuStat::pmu_start(void) {};
 void PmuStat::pmu_stop(void) {};
 void PmuStat::pmu_init(void)
 {
 #ifdef _KNIGHTS_LANDING_
  KNLsetup();
 #endif
 }
 void PmuStat::xmemctrs(uint64_t *mr, uint64_t *mw)
 {
 #ifdef _KNIGHTS_LANDING_
  ctrs c;
  KNLreadctrs(c);
  uint64_t emr = 0, emw = 0;
  for (int i = 0; i < NEDC; ++i)
    {
      emr += c.edcrd[i];
      emw += c.edcwr[i];
    }
  *mr = emr;
  *mw = emw;
 #else
  *mr = *mw = 0;
 #endif
 }
 #ifdef _KNIGHTS_LANDING_
 struct knl_gbl_ PmuStat::gbl;
 #define PMU_MEM
 void PmuStat::KNLevsetup(const char *ename, int &fd, int event, int umask)
 {
  char fname[1024];
  snprintf(fname, sizeof(fname), "%s/type", ename);
  FILE *fp = fopen(fname, "r");
  if (fp == 0) {
    ::printf("open %s", fname);
    ::exit(0);
  }
  int type;
  int ret = fscanf(fp, "%d", &type);
  assert(ret == 1);
  fclose(fp);
  //  std::cout << "Using PMU type "<<type<<" from " << std::string(ename) <<std::endl;
  struct perf_event_attr hw = {};
  hw.size = sizeof(hw);
  hw.type = type;
  // see /sys/devices/uncore_*/format/*
  // All of the events we are interested in are configured the same way, but
  // that isn't always true. Proper code would parse the format files
  hw.config = event | (umask << 8);
  //hw.read_format = PERF_FORMAT_GROUP;
  // unfortunately the above only works within a single PMU; might
  // as well just read them one at a time
  int cpu = 0;
  fd = perf_event_open(&hw, -1, cpu, -1, 0);
  if (fd == -1) {
    ::printf("CPU %d, box %s, event 0x%lx", cpu, ename, hw.config);
    ::exit(0);
  } else { 
    //    std::cout << "event "<<std::string(ename)<<" set up for fd "<<fd<<" hw.config "<<hw.config <<std::endl;
  }
 }
 void PmuStat::KNLsetup(void){
   int ret;
   char fname[1024];
   // MC RPQ inserts and WPQ inserts (reads & writes)
   for (int mc = 0; mc < NMC; ++mc)
     {
       ::snprintf(fname, sizeof(fname), "/sys/devices/uncore_imc_%d",mc);
       // RPQ Inserts
       KNLevsetup(fname, gbl.mc_rd[mc], 0x1, 0x1);
       // WPQ Inserts
       KNLevsetup(fname, gbl.mc_wr[mc], 0x2, 0x1);
     }
   // EDC RPQ inserts and WPQ inserts
   for (int edc=0; edc < NEDC; ++edc)
     {
       ::snprintf(fname, sizeof(fname), "/sys/devices/uncore_edc_eclk_%d",edc);
       // RPQ inserts
       KNLevsetup(fname, gbl.edc_rd[edc], 0x1, 0x1);
       // WPQ inserts
       KNLevsetup(fname, gbl.edc_wr[edc], 0x2, 0x1);
     }
   // EDC HitE, HitM, MissE, MissM
   for (int edc=0; edc < NEDC; ++edc)
     {
       ::snprintf(fname, sizeof(fname), "/sys/devices/uncore_edc_uclk_%d", edc);
       KNLevsetup(fname, gbl.edc_hite[edc], 0x2, 0x1);
       KNLevsetup(fname, gbl.edc_hitm[edc], 0x2, 0x2);
       KNLevsetup(fname, gbl.edc_misse[edc], 0x2, 0x4);
       KNLevsetup(fname, gbl.edc_missm[edc], 0x2, 0x8);
     }
 }
 uint64_t PmuStat::KNLreadctr(int fd)
 {
  uint64_t data;
  size_t s = ::read(fd, &data, sizeof(data));
  if (s != sizeof(uint64_t)){
    ::printf("read counter %lu", s);
    ::exit(0);
  }
  return data;
 }
 void PmuStat::KNLreadctrs(ctrs &c)
 {
  for (int i = 0; i < NMC; ++i)
    {
      c.mcrd[i] = KNLreadctr(gbl.mc_rd[i]);
      c.mcwr[i] = KNLreadctr(gbl.mc_wr[i]);
    }
  for (int i = 0; i < NEDC; ++i)
    {
      c.edcrd[i] = KNLreadctr(gbl.edc_rd[i]);
      c.edcwr[i] = KNLreadctr(gbl.edc_wr[i]);
    }
  for (int i = 0; i < NEDC; ++i)
    {
      c.edchite[i] = KNLreadctr(gbl.edc_hite[i]);
      c.edchitm[i] = KNLreadctr(gbl.edc_hitm[i]);
      c.edcmisse[i] = KNLreadctr(gbl.edc_misse[i]);
      c.edcmissm[i] = KNLreadctr(gbl.edc_missm[i]);
    }
 }
 #endif
 }
--- a/lib/Stat.h
+++ b/lib/Stat.h
@@ -0,0 +1,104 @@
 #ifndef _GRID_STAT_H
 #define _GRID_STAT_H
 #ifdef AVX512
 #define _KNIGHTS_LANDING_ROOTONLY
 #endif
 namespace Grid { 
 ///////////////////////////////////////////////////////////////////////////////
 // Extra KNL counters from MCDRAM
 ///////////////////////////////////////////////////////////////////////////////
 #ifdef _KNIGHTS_LANDING_
 #define NMC 6
 #define NEDC 8
 struct ctrs
 {
    uint64_t mcrd[NMC];
    uint64_t mcwr[NMC];
    uint64_t edcrd[NEDC]; 
    uint64_t edcwr[NEDC];
    uint64_t edchite[NEDC];
    uint64_t edchitm[NEDC];
    uint64_t edcmisse[NEDC];
    uint64_t edcmissm[NEDC];
 };
 // Peter/Azusa:
 // Our modification of a code provided by Larry Meadows from Intel
 // Verified by email exchange non-NDA, ok for github. Should be as uses /sys/devices/ FS
 // so is already public and in the linux kernel for KNL.
 struct knl_gbl_
 {
  int mc_rd[NMC];
  int mc_wr[NMC];
  int edc_rd[NEDC];
  int edc_wr[NEDC];
  int edc_hite[NEDC];
  int edc_hitm[NEDC];
  int edc_misse[NEDC];
  int edc_missm[NEDC];
 };
 #endif
 ///////////////////////////////////////////////////////////////////////////////
 class PmuStat
 {
    uint64_t counters[8][256];
 #ifdef _KNIGHTS_LANDING_
    static struct knl_gbl_ gbl;
 #endif
    const char *name;
    uint64_t reads;     // memory reads
    uint64_t writes;    // memory writes
    uint64_t mrstart;   // memory read counter at start of parallel region
    uint64_t mrend;     // memory read counter at end of parallel region
    uint64_t mwstart;   // memory write counter at start of parallel region
    uint64_t mwend;     // memory write counter at end of parallel region
    // cumulative counters
    uint64_t count;     // number of invocations
    uint64_t tregion;   // total time in parallel region (from thread 0)
    uint64_t tcycles;   // total cycles inside parallel region
    uint64_t inst, ref, cyc;   // fixed counters
    uint64_t pmc0, pmc1;// pmu
    // add memory counters here
    // temp variables
    uint64_t tstart;    // tsc at start of parallel region
    uint64_t tend;      // tsc at end of parallel region
    // map for ctrs values
    // 0 pmc0 start
    // 1 pmc0 end
    // 2 pmc1 start
    // 3 pmc1 end
    // 4 tsc start
    // 5 tsc end
    static bool pmu_initialized;
 public:
    static bool is_init(void){ return pmu_initialized;}
    static void pmu_init(void);
    static void pmu_fini(void);
    static void pmu_start(void);
    static void pmu_stop(void);
    void accum(int nthreads);
    static void xmemctrs(uint64_t *mr, uint64_t *mw);
    void start(void);
    void enter(int t);
    void exit(int t);
    void print(void);
    void init(const char *regname);
    void clear(void);
 #ifdef _KNIGHTS_LANDING_
    static void     KNLsetup(void);
    static uint64_t KNLreadctr(int fd);
    static void     KNLreadctrs(ctrs &c);
    static void     KNLevsetup(const char *ename, int &fd, int event, int umask);
 #endif
  };
 }
 #endif
--- a/lib/Stencil.h
+++ b/lib/Stencil.h
--- a/lib/Threads.h
+++ b/lib/Threads.h
@@ -37,11 +37,20 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #ifdef GRID_OMP
 #include <omp.h>
-#define PARALLEL_FOR_LOOP _Pragma("omp parallel for ")
+#ifdef GRID_NUMA
 #define PARALLEL_FOR_LOOP        _Pragma("omp parallel for schedule(static)")
 #define PARALLEL_FOR_LOOP_INTERN _Pragma("omp for schedule(static)")
 #else
 #define PARALLEL_FOR_LOOP        _Pragma("omp parallel for schedule(runtime)")
 #define PARALLEL_FOR_LOOP_INTERN _Pragma("omp for schedule(runtime)")
 #endif
 #define PARALLEL_NESTED_LOOP2 _Pragma("omp parallel for collapse(2)")
 #define PARALLEL_REGION       _Pragma("omp parallel")
 #else
 #define PARALLEL_FOR_LOOP
 #define PARALLEL_FOR_LOOP_INTERN
 #define PARALLEL_NESTED_LOOP2
 #define PARALLEL_REGION
 #endif
 namespace Grid {
@@ -123,6 +132,22 @@ class GridThread {
    ThreadBarrier();
  };
  static void bcopy(const void *src, void *dst, size_t len) {
 #ifdef GRID_OMP
 #pragma omp parallel 
    {
      const char *c_src =(char *) src;
      char *c_dest=(char *) dst;
      int me,mywork,myoff;
      GridThread::GetWorkBarrier(len,me, mywork,myoff);
      bcopy(&c_src[myoff],&c_dest[myoff],mywork);
    }
 #else 
    bcopy(src,dst,len);
 #endif
  }
 };
 }
--- a/lib/algorithms/CoarsenedMatrix.h
+++ b/lib/algorithms/CoarsenedMatrix.h
@@ -282,7 +282,7 @@ PARALLEL_FOR_LOOP
 	  } else if(SE->_is_local) { 
 	    nbr = in._odata[SE->_offset];
 	  } else {
-	    nbr = Stencil.comm_buf[SE->_offset];
+	    nbr = Stencil.CommBuf()[SE->_offset];
 	  }
 	  res = res + A[point]._odata[ss]*nbr;
 	}
--- a/lib/algorithms/iterative/ConjugateGradient.h
+++ b/lib/algorithms/iterative/ConjugateGradient.h
@@ -1,153 +1,168 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/algorithms/iterative/ConjugateGradient.h
+Source file: ./lib/algorithms/iterative/ConjugateGradient.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #ifndef GRID_CONJUGATE_GRADIENT_H
 #define GRID_CONJUGATE_GRADIENT_H
 namespace Grid {
-    /////////////////////////////////////////////////////////////
+/////////////////////////////////////////////////////////////
-    // Base classes for iterative processes based on operators
+// Base classes for iterative processes based on operators
-    // single input vec, single output vec.
+// single input vec, single output vec.
-    /////////////////////////////////////////////////////////////
+/////////////////////////////////////////////////////////////
-  template<class Field> 
+template <class Field>
-    class ConjugateGradient : public OperatorFunction<Field> {
+class ConjugateGradient : public OperatorFunction<Field> {
-public:                                                
+ public:
-    bool ErrorOnNoConverge; //throw an assert when the CG fails to converge. Defaults true.
+  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
-    RealD   Tolerance;
+                           // Defaults true.
-    Integer MaxIterations;
+  RealD Tolerance;
-  ConjugateGradient(RealD tol,Integer maxit, bool err_on_no_conv = true) : Tolerance(tol), MaxIterations(maxit), ErrorOnNoConverge(err_on_no_conv){ 
+  Integer MaxIterations;
-    };
+  ConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
      : Tolerance(tol),
        MaxIterations(maxit),
        ErrorOnNoConverge(err_on_no_conv){};
  void operator()(LinearOperatorBase<Field> &Linop, const Field &src,
                  Field &psi) {
    psi.checkerboard = src.checkerboard;
    conformable(psi, src);
    RealD cp, c, a, d, b, ssq, qq, b_pred;
    Field p(src);
    Field mmp(src);
    Field r(src);
    // Initial residual computation & set up
    RealD guess = norm2(psi);
    assert(std::isnan(guess) == 0);
-    void operator() (LinearOperatorBase<Field> &Linop,const Field &src, Field &psi){
+    Linop.HermOpAndNorm(psi, mmp, d, b);
      psi.checkerboard = src.checkerboard;
      conformable(psi,src);
-      RealD cp,c,a,d,b,ssq,qq,b_pred;
+    r = src - mmp;
    p = r;
-      Field   p(src);
+    a = norm2(p);
-      Field mmp(src);
+    cp = a;
-      Field   r(src);
+    ssq = norm2(src);
-      //Initial residual computation & set up
+    std::cout << GridLogIterative << std::setprecision(4)
-      RealD guess = norm2(psi);
+              << "ConjugateGradient: guess " << guess << std::endl;
-      assert(std::isnan(guess)==0);
+    std::cout << GridLogIterative << std::setprecision(4)
              << "ConjugateGradient:   src " << ssq << std::endl;
    std::cout << GridLogIterative << std::setprecision(4)
              << "ConjugateGradient:    mp " << d << std::endl;
    std::cout << GridLogIterative << std::setprecision(4)
              << "ConjugateGradient:   mmp " << b << std::endl;
    std::cout << GridLogIterative << std::setprecision(4)
              << "ConjugateGradient:  cp,r " << cp << std::endl;
    std::cout << GridLogIterative << std::setprecision(4)
              << "ConjugateGradient:     p " << a << std::endl;
-      Linop.HermOpAndNorm(psi,mmp,d,b);
+    RealD rsq = Tolerance * Tolerance * ssq;
-      r= src-mmp;
+    // Check if guess is really REALLY good :)
-      p= r;
+    if (cp <= rsq) {
-      
+      return;
      a  =norm2(p);
      cp =a;
      ssq=norm2(src);
      std::cout<<GridLogIterative <<std::setprecision(4)<< "ConjugateGradient: guess "<<guess<<std::endl;
      std::cout<<GridLogIterative <<std::setprecision(4)<< "ConjugateGradient:   src "<<ssq  <<std::endl;
      std::cout<<GridLogIterative <<std::setprecision(4)<< "ConjugateGradient:    mp "<<d    <<std::endl;
      std::cout<<GridLogIterative <<std::setprecision(4)<< "ConjugateGradient:   mmp "<<b    <<std::endl;
      std::cout<<GridLogIterative <<std::setprecision(4)<< "ConjugateGradient:  cp,r "<<cp   <<std::endl;
      std::cout<<GridLogIterative <<std::setprecision(4)<< "ConjugateGradient:     p "<<a    <<std::endl;
      RealD rsq =  Tolerance* Tolerance*ssq;
      //Check if guess is really REALLY good :)
      if ( cp <= rsq ) {
 	return;
      }
      std::cout<<GridLogIterative << std::setprecision(4)<< "ConjugateGradient: k=0 residual "<<cp<<" target "<<rsq<<std::endl;
      GridStopWatch LinalgTimer;
      GridStopWatch MatrixTimer;
      GridStopWatch SolverTimer;
      SolverTimer.Start();
      int k;
      for (k=1;k<=MaxIterations;k++){
 	c=cp;
 	MatrixTimer.Start();
 	Linop.HermOpAndNorm(p,mmp,d,qq);
 	MatrixTimer.Stop();
 	LinalgTimer.Start();
 	//	RealD    qqck = norm2(mmp);
 	//	ComplexD dck  = innerProduct(p,mmp);
 	a      = c/d;
 	b_pred = a*(a*qq-d)/c;
 	cp = axpy_norm(r,-a,mmp,r);
 	b = cp/c;
 	// Fuse these loops ; should be really easy
 	psi= a*p+psi;
 	p  = p*b+r;
 	LinalgTimer.Stop();
 	std::cout<<GridLogIterative<<"ConjugateGradient: Iteration " <<k<<" residual "<<cp<< " target "<< rsq<<std::endl;
 	// Stopping condition
 	if ( cp <= rsq ) { 
 	  SolverTimer.Stop();
 	  Linop.HermOpAndNorm(psi,mmp,d,qq);
 	  p=mmp-src;
 	  RealD mmpnorm = sqrt(norm2(mmp));
 	  RealD psinorm = sqrt(norm2(psi));
 	  RealD srcnorm = sqrt(norm2(src));
 	  RealD resnorm = sqrt(norm2(p));
 	  RealD true_residual = resnorm/srcnorm;
 	  std::cout<<GridLogMessage<<"ConjugateGradient: Converged on iteration " <<k
 		   <<" computed residual "<<sqrt(cp/ssq)
 		   <<" true residual "    <<true_residual
 		   <<" target "<<Tolerance<<std::endl;
 	  std::cout<<GridLogMessage<<"Time elapsed: Total "<< SolverTimer.Elapsed() << " Matrix  "<<MatrixTimer.Elapsed() << " Linalg "<<LinalgTimer.Elapsed();
 	  std::cout<<std::endl;
 	  if(ErrorOnNoConverge)
 	    assert(true_residual/Tolerance < 1000.0);
 	  return;
 	}
      }
      std::cout<<GridLogMessage<<"ConjugateGradient did NOT converge"<<std::endl;
      if(ErrorOnNoConverge)	
 	assert(0);
    }
-  };
+
    std::cout << GridLogIterative << std::setprecision(4)
              << "ConjugateGradient: k=0 residual " << cp << " target " << rsq
              << std::endl;
    GridStopWatch LinalgTimer;
    GridStopWatch MatrixTimer;
    GridStopWatch SolverTimer;
    SolverTimer.Start();
    int k;
    for (k = 1; k <= MaxIterations; k++) {
      c = cp;
      MatrixTimer.Start();
      Linop.HermOpAndNorm(p, mmp, d, qq);
      MatrixTimer.Stop();
      LinalgTimer.Start();
      //  RealD    qqck = norm2(mmp);
      //  ComplexD dck  = innerProduct(p,mmp);
      a = c / d;
      b_pred = a * (a * qq - d) / c;
      cp = axpy_norm(r, -a, mmp, r);
      b = cp / c;
      // Fuse these loops ; should be really easy
      psi = a * p + psi;
      p = p * b + r;
      LinalgTimer.Stop();
      std::cout << GridLogIterative << "ConjugateGradient: Iteration " << k
                << " residual " << cp << " target " << rsq << std::endl;
      // Stopping condition
      if (cp <= rsq) {
        SolverTimer.Stop();
        Linop.HermOpAndNorm(psi, mmp, d, qq);
        p = mmp - src;
        RealD mmpnorm = sqrt(norm2(mmp));
        RealD psinorm = sqrt(norm2(psi));
        RealD srcnorm = sqrt(norm2(src));
        RealD resnorm = sqrt(norm2(p));
        RealD true_residual = resnorm / srcnorm;
        std::cout << GridLogMessage
                  << "ConjugateGradient: Converged on iteration " << k << std::endl;
        std::cout << GridLogMessage << "Computed residual " << sqrt(cp / ssq)
                  << " true residual " << true_residual << " target "
                  << Tolerance << std::endl;
        std::cout << GridLogMessage << "Time elapsed: Iterations "
                  << SolverTimer.Elapsed() << " Matrix  "
                  << MatrixTimer.Elapsed() << " Linalg "
                  << LinalgTimer.Elapsed();
        std::cout << std::endl;
        if (ErrorOnNoConverge) assert(true_residual / Tolerance < 1000.0);
        return;
      }
    }
    std::cout << GridLogMessage << "ConjugateGradient did NOT converge"
              << std::endl;
    if (ErrorOnNoConverge) assert(0);
  }
 };
 }
 #endif
--- a/lib/algorithms/iterative/ImplicitlyRestartedLanczos.h
+++ b/lib/algorithms/iterative/ImplicitlyRestartedLanczos.h
@@ -31,7 +31,11 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #include <string.h> //memset
 #ifdef USE_LAPACK
-#include <lapacke.h>
+void LAPACK_dstegr(char *jobz, char *range, int *n, double *d, double *e,
                   double *vl, double *vu, int *il, int *iu, double *abstol,
                   int *m, double *w, double *z, int *ldz, int *isuppz,
                   double *work, int *lwork, int *iwork, int *liwork,
                   int *info);
 #endif
 #include "DenseMatrix.h"
 #include "EigenSort.h"
--- a/lib/cartesian/Cartesian_base.h
+++ b/lib/cartesian/Cartesian_base.h
@@ -77,15 +77,12 @@ public:
    // GridCartesian / GridRedBlackCartesian
    ////////////////////////////////////////////////////////////////
    virtual int CheckerBoarded(int dim)=0;
-    virtual int CheckerBoard(std::vector<int> site)=0;
+    virtual int CheckerBoard(std::vector<int> &site)=0;
    virtual int CheckerBoardDestination(int source_cb,int shift,int dim)=0;
    virtual int CheckerBoardShift(int source_cb,int dim,int shift,int osite)=0;
    virtual int CheckerBoardShiftForCB(int source_cb,int dim,int shift,int cb)=0;
-    int  CheckerBoardFromOindex (int Oindex){
+    virtual int CheckerBoardFromOindex (int Oindex)=0;
-      std::vector<int> ocoor;
+    virtual int CheckerBoardFromOindexTable (int Oindex)=0;
      oCoorFromOindex(ocoor,Oindex); 
      return CheckerBoard(ocoor);
    }
    //////////////////////////////////////////////////////////////////////////////////////////////
    // Local layout calculations
--- a/lib/cartesian/Cartesian_full.h
+++ b/lib/cartesian/Cartesian_full.h
@@ -39,10 +39,17 @@ class GridCartesian: public GridBase {
 public:
    virtual int  CheckerBoardFromOindexTable (int Oindex) {
      return 0;
    }
    virtual int  CheckerBoardFromOindex (int Oindex)
    {
      return 0;
    }
    virtual int CheckerBoarded(int dim){
      return 0;
    }
-    virtual int CheckerBoard(std::vector<int> site){
+    virtual int CheckerBoard(std::vector<int> &site){
        return 0;
    }
    virtual int CheckerBoardDestination(int cb,int shift,int dim){
--- a/lib/cartesian/Cartesian_red_black.h
+++ b/lib/cartesian/Cartesian_red_black.h
@@ -43,12 +43,13 @@ class GridRedBlackCartesian : public GridBase
 public:
    std::vector<int> _checker_dim_mask;
    int              _checker_dim;
    std::vector<int> _checker_board;
    virtual int CheckerBoarded(int dim){
      if( dim==_checker_dim) return 1;
      else return 0;
    }
-    virtual int CheckerBoard(std::vector<int> site){
+    virtual int CheckerBoard(std::vector<int> &site){
      int linear=0;
      assert(site.size()==_ndimension);
      for(int d=0;d<_ndimension;d++){ 
@@ -72,12 +73,20 @@ public:
      // or by looping over x,y,z and multiply rather than computing checkerboard.
      if ( (source_cb+ocb)&1 ) {
 	return (shift)/2;
      } else {
 	return (shift+1)/2;
      }
    }
    virtual int  CheckerBoardFromOindexTable (int Oindex) {
      return _checker_board[Oindex];
    }
    virtual int  CheckerBoardFromOindex (int Oindex)
    {
      std::vector<int> ocoor;
      oCoorFromOindex(ocoor,Oindex);
      return CheckerBoard(ocoor);
    }
    virtual int CheckerBoardShift(int source_cb,int dim,int shift,int osite){
      if(dim != _checker_dim) return shift;
@@ -169,7 +178,7 @@ public:
 	// all elements of a simd vector must have same checkerboard.
 	// If Ls vectorised, this must still be the case; e.g. dwf rb5d
 	if ( _simd_layout[d]>1 ) {
-	  if ( d != _checker_dim ) { 
+	  if ( checker_dim_mask[d] ) { 
 	    assert( (_rdimensions[d]&0x1) == 0 );
 	  }
 	}
@@ -185,6 +194,8 @@ public:
 	  _ostride[d] = _ostride[d-1]*_rdimensions[d-1];
 	  _istride[d] = _istride[d-1]*_simd_layout[d-1];
 	}
      }
      ////////////////////////////////////////////////////////////////////////////////////////////
@@ -206,6 +217,18 @@ public:
 	block = block*_rdimensions[d];
      }
      ////////////////////////////////////////////////
      // Create a checkerboard lookup table
      ////////////////////////////////////////////////
      int rvol = 1;
      for(int d=0;d<_ndimension;d++){
 	rvol=rvol * _rdimensions[d];
      }
      _checker_board.resize(rvol);
      for(int osite=0;osite<_osites;osite++){
 	_checker_board[osite] = CheckerBoardFromOindex (osite);
      }
    };
 protected:
    virtual int oIndex(std::vector<int> &coor)
--- a/lib/communicator/Communicator_base.cc
+++ b/lib/communicator/Communicator_base.cc
@@ -0,0 +1,124 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/communicator/Communicator_none.cc
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #include "Grid.h"
 namespace Grid {
 ///////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////
 void *              CartesianCommunicator::ShmCommBuf;
 uint64_t            CartesianCommunicator::MAX_MPI_SHM_BYTES   = 128*1024*1024; 
 /////////////////////////////////
 // Alloc, free shmem region
 /////////////////////////////////
 void *CartesianCommunicator::ShmBufferMalloc(size_t bytes){
  //  bytes = (bytes+sizeof(vRealD))&(~(sizeof(vRealD)-1));// align up bytes
  void *ptr = (void *)heap_top;
  heap_top  += bytes;
  heap_bytes+= bytes;
  if (heap_bytes >= MAX_MPI_SHM_BYTES) {
    std::cout<< " ShmBufferMalloc exceeded shared heap size -- try increasing with --shm <MB> flag" <<std::endl;
    std::cout<< " Parameter specified in units of MB (megabytes) " <<std::endl;
    std::cout<< " Current value is " << (MAX_MPI_SHM_BYTES/(1024*1024)) <<std::endl;
    assert(heap_bytes<MAX_MPI_SHM_BYTES);
  }
  return ptr;
 }
 void CartesianCommunicator::ShmBufferFreeAll(void) { 
  heap_top  =(size_t)ShmBufferSelf();
  heap_bytes=0;
 }
 /////////////////////////////////
 // Grid information queries
 /////////////////////////////////
 int                      CartesianCommunicator::IsBoss(void)            { return _processor==0; };
 int                      CartesianCommunicator::BossRank(void)          { return 0; };
 int                      CartesianCommunicator::ThisRank(void)          { return _processor; };
 const std::vector<int> & CartesianCommunicator::ThisProcessorCoor(void) { return _processor_coor; };
 const std::vector<int> & CartesianCommunicator::ProcessorGrid(void)     { return _processors; };
 int                      CartesianCommunicator::ProcessorCount(void)    { return _Nprocessors; };
 ////////////////////////////////////////////////////////////////////////////////
 // very VERY rarely (Log, serial RNG) we need world without a grid
 ////////////////////////////////////////////////////////////////////////////////
 void CartesianCommunicator::GlobalSum(ComplexF &c)
 {
  GlobalSumVector((float *)&c,2);
 }
 void CartesianCommunicator::GlobalSumVector(ComplexF *c,int N)
 {
  GlobalSumVector((float *)c,2*N);
 }
 void CartesianCommunicator::GlobalSum(ComplexD &c)
 {
  GlobalSumVector((double *)&c,2);
 }
 void CartesianCommunicator::GlobalSumVector(ComplexD *c,int N)
 {
  GlobalSumVector((double *)c,2*N);
 }
 #if !defined( GRID_COMMS_MPI3) && !defined (GRID_COMMS_MPI3L)
 void CartesianCommunicator::StencilSendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 						       void *xmit,
 						       int xmit_to_rank,
 						       void *recv,
 						       int recv_from_rank,
 						       int bytes)
 {
  SendToRecvFromBegin(list,xmit,xmit_to_rank,recv,recv_from_rank,bytes);
 }
 void CartesianCommunicator::StencilSendToRecvFromComplete(std::vector<CommsRequest_t> &waitall)
 {
  SendToRecvFromComplete(waitall);
 }
 void CartesianCommunicator::StencilBarrier(void){};
 commVector<uint8_t> CartesianCommunicator::ShmBufStorageVector;
 void *CartesianCommunicator::ShmBufferSelf(void) { return ShmCommBuf; }
 void *CartesianCommunicator::ShmBuffer(int rank) {
  return NULL;
 }
 void *CartesianCommunicator::ShmBufferTranslate(int rank,void * local_p) { 
  return NULL;
 }
 void CartesianCommunicator::ShmInitGeneric(void){
  ShmBufStorageVector.resize(MAX_MPI_SHM_BYTES);
  ShmCommBuf=(void *)&ShmBufStorageVector[0];
 }
 #endif
 }
--- a/lib/communicator/Communicator_base.h
+++ b/lib/communicator/Communicator_base.h
@@ -1,3 +1,4 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
@@ -34,123 +35,196 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #ifdef GRID_COMMS_MPI
 #include <mpi.h>
 #endif
 #ifdef GRID_COMMS_MPI3
 #include <mpi.h>
 #endif
 #ifdef GRID_COMMS_MPI3L
 #include <mpi.h>
 #endif
 #ifdef GRID_COMMS_SHMEM
 #include <mpp/shmem.h>
 #endif
 namespace Grid {
 class CartesianCommunicator {
  public:    
  // 65536 ranks per node adequate for now
  // 128MB shared memory for comms enought for 48^4 local vol comms
  // Give external control (command line override?) of this
  static const int      MAXLOG2RANKSPERNODE = 16;            
  static uint64_t MAX_MPI_SHM_BYTES;
  // Communicator should know nothing of the physics grid, only processor grid.
  int              _Nprocessors;     // How many in all
  std::vector<int> _processors;      // Which dimensions get relayed out over processors lanes.
  int              _processor;       // linear processor rank
  std::vector<int> _processor_coor;  // linear processor coordinate
  unsigned long _ndimension;
-    int              _Nprocessors;     // How many in all
+#if defined (GRID_COMMS_MPI) || defined (GRID_COMMS_MPI3) || defined (GRID_COMMS_MPI3L)
-    std::vector<int> _processors;      // Which dimensions get relayed out over processors lanes.
+  static MPI_Comm communicator_world;
-    int              _processor;       // linear processor rank
+         MPI_Comm communicator;
-    std::vector<int> _processor_coor;  // linear processor coordinate
+  typedef MPI_Request CommsRequest_t;
    unsigned long _ndimension;
 #ifdef GRID_COMMS_MPI
    MPI_Comm communicator;
    typedef MPI_Request CommsRequest_t;
 #else 
-    typedef int CommsRequest_t;
+  typedef int CommsRequest_t;
 #endif
-    static void Init(int *argc, char ***argv);
+  ////////////////////////////////////////////////////////////////////
  // Helper functionality for SHM Windows common to all other impls
  ////////////////////////////////////////////////////////////////////
  // Longer term; drop this in favour of a master / slave model with 
  // cartesian communicator on a subset of ranks, slave ranks controlled
  // by group leader with data xfer via shared memory
  ////////////////////////////////////////////////////////////////////
 #ifdef GRID_COMMS_MPI3
-    // Constructor
+  static int ShmRank;
-    CartesianCommunicator(const std::vector<int> &pdimensions_in);
+  static int ShmSize;
  static int GroupRank;
  static int GroupSize;
  static int WorldRank;
  static int WorldSize;
-    // Wraps MPI_Cart routines
+  std::vector<int>  WorldDims;
-    void ShiftedRanks(int dim,int shift,int & source, int & dest);
+  std::vector<int>  GroupDims;
-    int  RankFromProcessorCoor(std::vector<int> &coor);
+  std::vector<int>  ShmDims;
    void ProcessorCoorFromRank(int rank,std::vector<int> &coor);
-    /////////////////////////////////
+  std::vector<int> GroupCoor;
-    // Grid information queries
+  std::vector<int> ShmCoor;
-    /////////////////////////////////
+  std::vector<int> WorldCoor;
    int                      IsBoss(void)            { return _processor==0; };
    int                      BossRank(void)          { return 0; };
    int                      ThisRank(void)          { return _processor; };
    const std::vector<int> & ThisProcessorCoor(void) { return _processor_coor; };
    const std::vector<int> & ProcessorGrid(void)     { return _processors; };
    int                      ProcessorCount(void)    { return _Nprocessors; };
-    ////////////////////////////////////////////////////////////
+  static std::vector<int> GroupRanks; 
-    // Reduction
+  static std::vector<int> MyGroup;
-    ////////////////////////////////////////////////////////////
+  static int ShmSetup;
-    void GlobalSum(RealF &);
+  static MPI_Win ShmWindow; 
-    void GlobalSumVector(RealF *,int N);
+  static MPI_Comm ShmComm;
-    void GlobalSum(RealD &);
+  std::vector<int>  LexicographicToWorldRank;
    void GlobalSumVector(RealD *,int N);
-    void GlobalSum(uint32_t &);
+  static std::vector<void *> ShmCommBufs;
    void GlobalSum(uint64_t &);
-    void GlobalSum(ComplexF &c)
+#else 
-    {
+  static void ShmInitGeneric(void);
-      GlobalSumVector((float *)&c,2);
+  static commVector<uint8_t> ShmBufStorageVector;
-    }
+#endif 
    void GlobalSumVector(ComplexF *c,int N)
    {
      GlobalSumVector((float *)c,2*N);
    }
-    void GlobalSum(ComplexD &c)
+  /////////////////////////////////
-    {
+  // Grid information and queries
-      GlobalSumVector((double *)&c,2);
+  // Implemented in Communicator_base.C
-    }
+  /////////////////////////////////
-    void GlobalSumVector(ComplexD *c,int N)
+  static void * ShmCommBuf;
-    {
+  size_t heap_top;
-      GlobalSumVector((double *)c,2*N);
+  size_t heap_bytes;
    }
-    template<class obj> void GlobalSum(obj &o){
+  void *ShmBufferSelf(void);
-      typedef typename obj::scalar_type scalar_type;
+  void *ShmBuffer(int rank);
-      int words = sizeof(obj)/sizeof(scalar_type);
+  void *ShmBufferTranslate(int rank,void * local_p);
-      scalar_type * ptr = (scalar_type *)& o;
+  void *ShmBufferMalloc(size_t bytes);
-      GlobalSumVector(ptr,words);
+  void ShmBufferFreeAll(void) ;
    }
    ////////////////////////////////////////////////////////////
    // Face exchange, buffer swap in translational invariant way
    ////////////////////////////////////////////////////////////
    void SendToRecvFrom(void *xmit,
 			int xmit_to_rank,
 			void *recv,
 			int recv_from_rank,
 			int bytes);
-    void SendRecvPacket(void *xmit,
+  ////////////////////////////////////////////////
-			void *recv,
+  // Must call in Grid startup
-			int xmit_to_rank,
+  ////////////////////////////////////////////////
-			int recv_from_rank,
+  static void Init(int *argc, char ***argv);
 			int bytes);
-    void SendToRecvFromBegin(std::vector<CommsRequest_t> &list,
+  ////////////////////////////////////////////////
-			 void *xmit,
+  // Constructor of any given grid
-			 int xmit_to_rank,
+  ////////////////////////////////////////////////
-			 void *recv,
+  CartesianCommunicator(const std::vector<int> &pdimensions_in);
 			 int recv_from_rank,
 			 int bytes);
    void SendToRecvFromComplete(std::vector<CommsRequest_t> &waitall);
-    ////////////////////////////////////////////////////////////
+  ////////////////////////////////////////////////////////////////////////////////////////
-    // Barrier
+  // Wraps MPI_Cart routines, or implements equivalent on other impls
-    ////////////////////////////////////////////////////////////
+  ////////////////////////////////////////////////////////////////////////////////////////
-    void Barrier(void);
+  void ShiftedRanks(int dim,int shift,int & source, int & dest);
  int  RankFromProcessorCoor(std::vector<int> &coor);
  void ProcessorCoorFromRank(int rank,std::vector<int> &coor);
-    ////////////////////////////////////////////////////////////
+  int                      IsBoss(void)            ;
-    // Broadcast a buffer and composite larger
+  int                      BossRank(void)          ;
-    ////////////////////////////////////////////////////////////
+  int                      ThisRank(void)          ;
-    void Broadcast(int root,void* data, int bytes);
+  const std::vector<int> & ThisProcessorCoor(void) ;
-    template<class obj> void Broadcast(int root,obj &data)
+  const std::vector<int> & ProcessorGrid(void)     ;
  int                      ProcessorCount(void)    ;
  ////////////////////////////////////////////////////////////////////////////////
  // very VERY rarely (Log, serial RNG) we need world without a grid
  ////////////////////////////////////////////////////////////////////////////////
  static int  RankWorld(void) ;
  static void BroadcastWorld(int root,void* data, int bytes);
  ////////////////////////////////////////////////////////////
  // Reduction
  ////////////////////////////////////////////////////////////
  void GlobalSum(RealF &);
  void GlobalSumVector(RealF *,int N);
  void GlobalSum(RealD &);
  void GlobalSumVector(RealD *,int N);
  void GlobalSum(uint32_t &);
  void GlobalSum(uint64_t &);
  void GlobalSum(ComplexF &c);
  void GlobalSumVector(ComplexF *c,int N);
  void GlobalSum(ComplexD &c);
  void GlobalSumVector(ComplexD *c,int N);
  template<class obj> void GlobalSum(obj &o){
    typedef typename obj::scalar_type scalar_type;
    int words = sizeof(obj)/sizeof(scalar_type);
    scalar_type * ptr = (scalar_type *)& o;
    GlobalSumVector(ptr,words);
  }
  ////////////////////////////////////////////////////////////
  // Face exchange, buffer swap in translational invariant way
  ////////////////////////////////////////////////////////////
  void SendToRecvFrom(void *xmit,
 		      int xmit_to_rank,
 		      void *recv,
 		      int recv_from_rank,
 		      int bytes);
  void SendRecvPacket(void *xmit,
 		      void *recv,
 		      int xmit_to_rank,
 		      int recv_from_rank,
 		      int bytes);
  void SendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 			   void *xmit,
 			   int xmit_to_rank,
 			   void *recv,
 			   int recv_from_rank,
 			   int bytes);
  void SendToRecvFromComplete(std::vector<CommsRequest_t> &waitall);
  void StencilSendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 				  void *xmit,
 				  int xmit_to_rank,
 				  void *recv,
 				  int recv_from_rank,
 				  int bytes);
  void StencilSendToRecvFromComplete(std::vector<CommsRequest_t> &waitall);
  void StencilBarrier(void);
  ////////////////////////////////////////////////////////////
  // Barrier
  ////////////////////////////////////////////////////////////
  void Barrier(void);
  ////////////////////////////////////////////////////////////
  // Broadcast a buffer and composite larger
  ////////////////////////////////////////////////////////////
  void Broadcast(int root,void* data, int bytes);
  template<class obj> void Broadcast(int root,obj &data)
    {
      Broadcast(root,(void *)&data,sizeof(data));
    };
    static void BroadcastWorld(int root,void* data, int bytes);
 }; 
 }
--- a/lib/communicator/Communicator_mpi.cc
+++ b/lib/communicator/Communicator_mpi.cc
@@ -30,21 +30,23 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 namespace Grid {
-  // Should error check all MPI calls.
+
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 MPI_Comm CartesianCommunicator::communicator_world;
 // Should error check all MPI calls.
 void CartesianCommunicator::Init(int *argc, char ***argv) {
  int flag;
  MPI_Initialized(&flag); // needed to coexist with other libs apparently
  if ( !flag ) {
    MPI_Init(argc,argv);
  }
  MPI_Comm_dup (MPI_COMM_WORLD,&communicator_world);
  ShmInitGeneric();
 }
  int Rank(void) {
    int pe;
    MPI_Comm_rank(MPI_COMM_WORLD,&pe);
    return pe;
  }
 CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
 {
  _ndimension = processors.size();
@@ -54,7 +56,7 @@ CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
  _processors = processors;
  _processor_coor.resize(_ndimension);
-  MPI_Cart_create(MPI_COMM_WORLD, _ndimension,&_processors[0],&periodic[0],1,&communicator);
+  MPI_Cart_create(communicator_world, _ndimension,&_processors[0],&periodic[0],1,&communicator);
  MPI_Comm_rank(communicator,&_processor);
  MPI_Cart_coords(communicator,_processor,_ndimension,&_processor_coor[0]);
@@ -67,7 +69,6 @@ CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
  assert(Size==_Nprocessors);
 }
 void CartesianCommunicator::GlobalSum(uint32_t &u){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&u,1,MPI_UINT32_T,MPI_SUM,communicator);
  assert(ierr==0);
@@ -168,7 +169,6 @@ void CartesianCommunicator::SendToRecvFromComplete(std::vector<CommsRequest_t> &
  int nreq=list.size();
  std::vector<MPI_Status> status(nreq);
  int ierr = MPI_Waitall(nreq,&list[0],&status[0]);
  assert(ierr==0);
 }
@@ -187,14 +187,22 @@ void CartesianCommunicator::Broadcast(int root,void* data, int bytes)
 		     communicator);
  assert(ierr==0);
 }
-
+  ///////////////////////////////////////////////////////
  // Should only be used prior to Grid Init finished.
  // Check for this?
  ///////////////////////////////////////////////////////
 int CartesianCommunicator::RankWorld(void){ 
  int r; 
  MPI_Comm_rank(communicator_world,&r);
  return r;
 }
 void CartesianCommunicator::BroadcastWorld(int root,void* data, int bytes)
 {
  int ierr= MPI_Bcast(data,
 		      bytes,
 		      MPI_BYTE,
 		      root,
-		      MPI_COMM_WORLD);
+		      communicator_world);
  assert(ierr==0);
 }
--- a/lib/communicator/Communicator_mpi3.cc
+++ b/lib/communicator/Communicator_mpi3.cc
@@ -0,0 +1,580 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/communicator/Communicator_mpi.cc
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #include "Grid.h"
 #include <mpi.h>
 namespace Grid {
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 int CartesianCommunicator::ShmSetup = 0;
 int CartesianCommunicator::ShmRank;
 int CartesianCommunicator::ShmSize;
 int CartesianCommunicator::GroupRank;
 int CartesianCommunicator::GroupSize;
 int CartesianCommunicator::WorldRank;
 int CartesianCommunicator::WorldSize;
 MPI_Comm CartesianCommunicator::communicator_world;
 MPI_Comm CartesianCommunicator::ShmComm;
 MPI_Win  CartesianCommunicator::ShmWindow;
 std::vector<int> CartesianCommunicator::GroupRanks;  
 std::vector<int> CartesianCommunicator::MyGroup;
 std::vector<void *> CartesianCommunicator::ShmCommBufs;
 void *CartesianCommunicator::ShmBufferSelf(void)
 {
  return ShmCommBufs[ShmRank];
 }
 void *CartesianCommunicator::ShmBuffer(int rank)
 {
  int gpeer = GroupRanks[rank];
  if (gpeer == MPI_UNDEFINED){
    return NULL;
  } else { 
    return ShmCommBufs[gpeer];
  }
 }
 void *CartesianCommunicator::ShmBufferTranslate(int rank,void * local_p)
 {
  int gpeer = GroupRanks[rank];
  if (gpeer == MPI_UNDEFINED){
    return NULL;
  } else { 
    uint64_t offset = (uint64_t)local_p - (uint64_t)ShmCommBufs[ShmRank];
    uint64_t remote = (uint64_t)ShmCommBufs[gpeer]+offset;
    return (void *) remote;
  }
 }
 void CartesianCommunicator::Init(int *argc, char ***argv) {
  int flag;
  MPI_Initialized(&flag); // needed to coexist with other libs apparently
  if ( !flag ) {
    MPI_Init(argc,argv);
  }
  MPI_Comm_dup (MPI_COMM_WORLD,&communicator_world);
  MPI_Comm_rank(communicator_world,&WorldRank);
  MPI_Comm_size(communicator_world,&WorldSize);
  /////////////////////////////////////////////////////////////////////
  // Split into groups that can share memory
  /////////////////////////////////////////////////////////////////////
  MPI_Comm_split_type(communicator_world, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL,&ShmComm);
  MPI_Comm_rank(ShmComm     ,&ShmRank);
  MPI_Comm_size(ShmComm     ,&ShmSize);
  GroupSize = WorldSize/ShmSize;
  /////////////////////////////////////////////////////////////////////
  // find world ranks in our SHM group (i.e. which ranks are on our node)
  /////////////////////////////////////////////////////////////////////
  MPI_Group WorldGroup, ShmGroup;
  MPI_Comm_group (communicator_world, &WorldGroup); 
  MPI_Comm_group (ShmComm, &ShmGroup);
  std::vector<int> world_ranks(WorldSize); 
  GroupRanks.resize(WorldSize); 
  for(int r=0;r<WorldSize;r++) world_ranks[r]=r;
  MPI_Group_translate_ranks (WorldGroup,WorldSize,&world_ranks[0],ShmGroup, &GroupRanks[0]); 
  ///////////////////////////////////////////////////////////////////
  // Identify who is in my group and noninate the leader
  ///////////////////////////////////////////////////////////////////
  int g=0;
  MyGroup.resize(ShmSize);
  for(int rank=0;rank<WorldSize;rank++){
    if(GroupRanks[rank]!=MPI_UNDEFINED){
      assert(g<ShmSize);
      MyGroup[g++] = rank;
    }
  }
  std::sort(MyGroup.begin(),MyGroup.end(),std::less<int>());
  int myleader = MyGroup[0];
  std::vector<int> leaders_1hot(WorldSize,0);
  std::vector<int> leaders_group(GroupSize,0);
  leaders_1hot [ myleader ] = 1;
  ///////////////////////////////////////////////////////////////////
  // global sum leaders over comm world
  ///////////////////////////////////////////////////////////////////
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&leaders_1hot[0],WorldSize,MPI_INT,MPI_SUM,communicator_world);
  assert(ierr==0);
  ///////////////////////////////////////////////////////////////////
  // find the group leaders world rank
  ///////////////////////////////////////////////////////////////////
  int group=0;
  for(int l=0;l<WorldSize;l++){
    if(leaders_1hot[l]){
      leaders_group[group++] = l;
    }
  }
  ///////////////////////////////////////////////////////////////////
  // Identify the rank of the group in which I (and my leader) live
  ///////////////////////////////////////////////////////////////////
  GroupRank=-1;
  for(int g=0;g<GroupSize;g++){
    if (myleader == leaders_group[g]){
      GroupRank=g;
    }
  }
  assert(GroupRank!=-1);
  //////////////////////////////////////////////////////////////////////////////////////////////////////////
  // allocate the shared window for our group
  //////////////////////////////////////////////////////////////////////////////////////////////////////////
  ShmCommBuf = 0;
  ierr = MPI_Win_allocate_shared(MAX_MPI_SHM_BYTES,1,MPI_INFO_NULL,ShmComm,&ShmCommBuf,&ShmWindow);
  assert(ierr==0);
  // KNL hack -- force to numa-domain 1 in flat
 #if 0
  //#include <numaif.h>
  for(uint64_t page=0;page<MAX_MPI_SHM_BYTES;page+=4096){
    void *pages = (void *) ( page + ShmCommBuf );
    int status;
    int flags=MPOL_MF_MOVE_ALL;
    int nodes=1; // numa domain == MCDRAM
    unsigned long count=1;
    ierr= move_pages(0,count, &pages,&nodes,&status,flags);
    if (ierr && (page==0)) perror("numa relocate command failed");
  }
 #endif
  MPI_Win_lock_all (MPI_MODE_NOCHECK, ShmWindow);
  /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  // Plan: allocate a fixed SHM region. Scratch that is just used via some scheme during stencil comms, with no allocate free.
  /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  ShmCommBufs.resize(ShmSize);
  for(int r=0;r<ShmSize;r++){
    MPI_Aint sz;
    int dsp_unit;
    MPI_Win_shared_query (ShmWindow, r, &sz, &dsp_unit, &ShmCommBufs[r]);
  }
  //////////////////////////////////////////////////////////////////////////////////////////////////////////
  // Verbose for now
  //////////////////////////////////////////////////////////////////////////////////////////////////////////
  if (WorldRank == 0){
    std::cout<<GridLogMessage<< "Grid MPI-3 configuration: detected ";
    std::cout<< WorldSize << " Ranks " ;
    std::cout<< GroupSize << " Nodes " ;
    std::cout<<  ShmSize  << " with ranks-per-node "<<std::endl;
    std::cout<<GridLogMessage     <<"Grid MPI-3 configuration: allocated shared memory region of size ";
    std::cout<<std::hex << MAX_MPI_SHM_BYTES <<" ShmCommBuf address = "<<ShmCommBuf << std::dec<<std::endl;
    for(int g=0;g<GroupSize;g++){
      std::cout<<GridLogMessage<<" Node "<<g<<" led by MPI rank "<<leaders_group[g]<<std::endl;
    }
    std::cout<<GridLogMessage<<" Boss Node Shm Pointers are {";
    for(int g=0;g<ShmSize;g++){
      std::cout<<std::hex<<ShmCommBufs[g]<<std::dec;
      if(g!=ShmSize-1) std::cout<<",";
      else std::cout<<"}"<<std::endl;
    }
  }
  for(int g=0;g<GroupSize;g++){
    if ( (ShmRank == 0) && (GroupRank==g) )  std::cout<<GridLogMessage<<"["<<g<<"] Node Group "<<g<<" is ranks {";
    for(int r=0;r<ShmSize;r++){
      if ( (ShmRank == 0) && (GroupRank==g) ) {
 	std::cout<<MyGroup[r];
 	if(r<ShmSize-1) std::cout<<",";
 	else std::cout<<"}"<<std::endl;
      }
      MPI_Barrier(communicator_world);
    }
  }
  assert(ShmSetup==0);  ShmSetup=1;
 }
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////
 // Want to implement some magic ... Group sub-cubes into those on same node
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////
 void CartesianCommunicator::ShiftedRanks(int dim,int shift,int &source,int &dest)
 {
  std::vector<int> coor = _processor_coor;
  assert(std::abs(shift) <_processors[dim]);
  coor[dim] = (_processor_coor[dim] + shift + _processors[dim])%_processors[dim];
  Lexicographic::IndexFromCoor(coor,source,_processors);
  source = LexicographicToWorldRank[source];
  coor[dim] = (_processor_coor[dim] - shift + _processors[dim])%_processors[dim];
  Lexicographic::IndexFromCoor(coor,dest,_processors);
  dest = LexicographicToWorldRank[dest];
 }
 int CartesianCommunicator::RankFromProcessorCoor(std::vector<int> &coor)
 {
  int rank;
  Lexicographic::IndexFromCoor(coor,rank,_processors);
  rank = LexicographicToWorldRank[rank];
  return rank;
 }
 void  CartesianCommunicator::ProcessorCoorFromRank(int rank, std::vector<int> &coor)
 {
  Lexicographic::CoorFromIndex(coor,rank,_processors);
  rank = LexicographicToWorldRank[rank];
 }
 CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
 { 
  int ierr;
  communicator=communicator_world;
  _ndimension = processors.size();
  ////////////////////////////////////////////////////////////////
  // Assert power of two shm_size.
  ////////////////////////////////////////////////////////////////
  int log2size = -1;
  for(int i=0;i<=MAXLOG2RANKSPERNODE;i++){  
    if ( (0x1<<i) == ShmSize ) {
      log2size = i;
      break;
    }
  }
  assert(log2size != -1);
  ////////////////////////////////////////////////////////////////
  // Identify subblock of ranks on node spreading across dims
  // in a maximally symmetrical way
  ////////////////////////////////////////////////////////////////
  int dim = 0;
  std::vector<int> WorldDims = processors;
  ShmDims.resize(_ndimension,1);
  GroupDims.resize(_ndimension);
  ShmCoor.resize(_ndimension);
  GroupCoor.resize(_ndimension);
  WorldCoor.resize(_ndimension);
  for(int l2=0;l2<log2size;l2++){
    while ( WorldDims[dim] / ShmDims[dim] <= 1 ) dim=(dim+1)%_ndimension;
    ShmDims[dim]*=2;
    dim=(dim+1)%_ndimension;
  }
  ////////////////////////////////////////////////////////////////
  // Establish torus of processes and nodes with sub-blockings
  ////////////////////////////////////////////////////////////////
  for(int d=0;d<_ndimension;d++){
    GroupDims[d] = WorldDims[d]/ShmDims[d];
  }
  ////////////////////////////////////////////////////////////////
  // Check processor counts match
  ////////////////////////////////////////////////////////////////
  _Nprocessors=1;
  _processors = processors;
  _processor_coor.resize(_ndimension);
  for(int i=0;i<_ndimension;i++){
    _Nprocessors*=_processors[i];
  }
  assert(WorldSize==_Nprocessors);
  ////////////////////////////////////////////////////////////////
  // Establish mapping between lexico physics coord and WorldRank
  // 
  ////////////////////////////////////////////////////////////////
  LexicographicToWorldRank.resize(WorldSize,0);
  Lexicographic::CoorFromIndex(GroupCoor,GroupRank,GroupDims);
  Lexicographic::CoorFromIndex(ShmCoor,ShmRank,ShmDims);
  for(int d=0;d<_ndimension;d++){
    WorldCoor[d] = GroupCoor[d]*ShmDims[d]+ShmCoor[d];
  }
  _processor_coor = WorldCoor;
  int lexico;
  Lexicographic::IndexFromCoor(WorldCoor,lexico,WorldDims);
  LexicographicToWorldRank[lexico]=WorldRank;
  _processor = lexico;
  ///////////////////////////////////////////////////////////////////
  // global sum Lexico to World mapping
  ///////////////////////////////////////////////////////////////////
  ierr=MPI_Allreduce(MPI_IN_PLACE,&LexicographicToWorldRank[0],WorldSize,MPI_INT,MPI_SUM,communicator);
  assert(ierr==0);
 };
 void CartesianCommunicator::GlobalSum(uint32_t &u){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&u,1,MPI_UINT32_T,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSum(uint64_t &u){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&u,1,MPI_UINT64_T,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSum(float &f){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&f,1,MPI_FLOAT,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSumVector(float *f,int N)
 {
  int ierr=MPI_Allreduce(MPI_IN_PLACE,f,N,MPI_FLOAT,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSum(double &d)
 {
  int ierr = MPI_Allreduce(MPI_IN_PLACE,&d,1,MPI_DOUBLE,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSumVector(double *d,int N)
 {
  int ierr = MPI_Allreduce(MPI_IN_PLACE,d,N,MPI_DOUBLE,MPI_SUM,communicator);
  assert(ierr==0);
 }
 // Basic Halo comms primitive
 void CartesianCommunicator::SendToRecvFrom(void *xmit,
 					   int dest,
 					   void *recv,
 					   int from,
 					   int bytes)
 {
  std::vector<CommsRequest_t> reqs(0);
  SendToRecvFromBegin(reqs,xmit,dest,recv,from,bytes);
  SendToRecvFromComplete(reqs);
 }
 void CartesianCommunicator::SendRecvPacket(void *xmit,
 					   void *recv,
 					   int sender,
 					   int receiver,
 					   int bytes)
 {
  MPI_Status stat;
  assert(sender != receiver);
  int tag = sender;
  if ( _processor == sender ) {
    MPI_Send(xmit, bytes, MPI_CHAR,receiver,tag,communicator);
  }
  if ( _processor == receiver ) { 
    MPI_Recv(recv, bytes, MPI_CHAR,sender,tag,communicator,&stat);
  }
 }
 // Basic Halo comms primitive
 void CartesianCommunicator::SendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 						void *xmit,
 						int dest,
 						void *recv,
 						int from,
 						int bytes)
 {
 #if 0
  this->StencilBarrier();
  MPI_Request xrq;
  MPI_Request rrq;
  static int sequence;
  int ierr;
  int tag;
  int check;
  assert(dest != _processor);
  assert(from != _processor);
  int gdest = GroupRanks[dest];
  int gfrom = GroupRanks[from];
  int gme   = GroupRanks[_processor];
  sequence++;
  char *from_ptr = (char *)ShmCommBufs[ShmRank];
  int small = (bytes<MAX_MPI_SHM_BYTES);
  typedef uint64_t T;
  int words = bytes/sizeof(T);
  assert(((size_t)bytes &(sizeof(T)-1))==0);
  assert(gme == ShmRank);
  if ( small && (gdest !=MPI_UNDEFINED) ) {
    char *to_ptr   = (char *)ShmCommBufs[gdest];
    assert(gme != gdest);
    T *ip = (T *)xmit;
    T *op = (T *)to_ptr;
 PARALLEL_FOR_LOOP 
    for(int w=0;w<words;w++) {
      op[w]=ip[w];
    }
    bcopy(&_processor,&to_ptr[bytes],sizeof(_processor));
    bcopy(&  sequence,&to_ptr[bytes+4],sizeof(sequence));
  } else { 
    ierr =MPI_Isend(xmit, bytes, MPI_CHAR,dest,_processor,communicator,&xrq);
    assert(ierr==0);
    list.push_back(xrq);
  }
  this->StencilBarrier();
  if (small && (gfrom !=MPI_UNDEFINED) ) {
    T *ip = (T *)from_ptr;
    T *op = (T *)recv;
 PARALLEL_FOR_LOOP 
    for(int w=0;w<words;w++) {
      op[w]=ip[w];
    }
    bcopy(&from_ptr[bytes]  ,&tag  ,sizeof(tag));
    bcopy(&from_ptr[bytes+4],&check,sizeof(check));
    assert(check==sequence);
    assert(tag==from);
  } else { 
    ierr=MPI_Irecv(recv, bytes, MPI_CHAR,from,from,communicator,&rrq);
    assert(ierr==0);
    list.push_back(rrq);
  }
  this->StencilBarrier();
 #else
  MPI_Request xrq;
  MPI_Request rrq;
  int rank = _processor;
  int ierr;
  ierr =MPI_Isend(xmit, bytes, MPI_CHAR,dest,_processor,communicator,&xrq);
  ierr|=MPI_Irecv(recv, bytes, MPI_CHAR,from,from,communicator,&rrq);
  assert(ierr==0);
  list.push_back(xrq);
  list.push_back(rrq);
 #endif
 }
 void CartesianCommunicator::StencilSendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 						       void *xmit,
 						       int dest,
 						       void *recv,
 						       int from,
 						       int bytes)
 {
  MPI_Request xrq;
  MPI_Request rrq;
  int ierr;
  assert(dest != _processor);
  assert(from != _processor);
  int gdest = GroupRanks[dest];
  int gfrom = GroupRanks[from];
  int gme   = GroupRanks[_processor];
  assert(gme == ShmRank);
  if ( gdest == MPI_UNDEFINED ) {
    ierr =MPI_Isend(xmit, bytes, MPI_CHAR,dest,_processor,communicator,&xrq);
    assert(ierr==0);
    list.push_back(xrq);
  }
  if ( gfrom ==MPI_UNDEFINED) {
    ierr=MPI_Irecv(recv, bytes, MPI_CHAR,from,from,communicator,&rrq);
    assert(ierr==0);
    list.push_back(rrq);
  }
 }
 void CartesianCommunicator::StencilSendToRecvFromComplete(std::vector<CommsRequest_t> &list)
 {
  SendToRecvFromComplete(list);
 }
 void CartesianCommunicator::StencilBarrier(void)
 {
  MPI_Win_sync (ShmWindow);   
  MPI_Barrier  (ShmComm);
  MPI_Win_sync (ShmWindow);   
 }
 void CartesianCommunicator::SendToRecvFromComplete(std::vector<CommsRequest_t> &list)
 {
  int nreq=list.size();
  std::vector<MPI_Status> status(nreq);
  int ierr = MPI_Waitall(nreq,&list[0],&status[0]);
  assert(ierr==0);
 }
 void CartesianCommunicator::Barrier(void)
 {
  int ierr = MPI_Barrier(communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::Broadcast(int root,void* data, int bytes)
 {
  int ierr=MPI_Bcast(data,
 		     bytes,
 		     MPI_BYTE,
 		     root,
 		     communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::BroadcastWorld(int root,void* data, int bytes)
 {
  int ierr= MPI_Bcast(data,
 		      bytes,
 		      MPI_BYTE,
 		      root,
 		      communicator_world);
  assert(ierr==0);
 }
 }
--- a/lib/communicator/Communicator_mpi3_leader.cc
+++ b/lib/communicator/Communicator_mpi3_leader.cc
@@ -0,0 +1,874 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/communicator/Communicator_mpi.cc
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #include "Grid.h"
 #include <mpi.h>
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 /// Workarounds:
 /// i) bloody mac os doesn't implement unnamed semaphores since it is "optional" posix.
 ///    darwin dispatch semaphores don't seem to be multiprocess.
 ///
 /// ii) openmpi under --mca shmem posix works with two squadrons per node; 
 ///     openmpi under default mca settings (I think --mca shmem mmap) on MacOS makes two squadrons map the SAME
 ///     memory as each other, despite their living on different communicators. This appears to be a bug in OpenMPI.
 ///
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 #include <semaphore.h>
 #include <fcntl.h>
 #include <unistd.h>
 #include <limits.h>
 typedef sem_t *Grid_semaphore;
 #define SEM_INIT(S)      S = sem_open(sem_name,0,0600,0); assert ( S != SEM_FAILED );
 #define SEM_INIT_EXCL(S) sem_unlink(sem_name); S = sem_open(sem_name,O_CREAT|O_EXCL,0600,0); assert ( S != SEM_FAILED );
 #define SEM_POST(S) assert ( sem_post(S) == 0 ); 
 #define SEM_WAIT(S) assert ( sem_wait(S) == 0 );
 #include <sys/mman.h>
 namespace Grid {
 enum { COMMAND_ISEND, COMMAND_IRECV, COMMAND_WAITALL };
 struct Descriptor {
  uint64_t buf;
  size_t bytes;
  int rank;
  int tag;
  int command;
  MPI_Request request;
 };
 const int pool = 48;
 class SlaveState {
 public:
  volatile int head;
  volatile int start;
  volatile int tail;
  volatile Descriptor Descrs[pool];
 };
 class Slave {
 public:
  Grid_semaphore  sem_head;
  Grid_semaphore  sem_tail;
  SlaveState *state;
  MPI_Comm squadron;
  uint64_t     base;
  int universe_rank;
  int vertical_rank;
  char sem_name [NAME_MAX];
  ////////////////////////////////////////////////////////////
  // Descriptor circular pointers
  ////////////////////////////////////////////////////////////
  Slave() {};
  void Init(SlaveState * _state,MPI_Comm _squadron,int _universe_rank,int _vertical_rank);
  void SemInit(void) {
    sprintf(sem_name,"/Grid_mpi3_sem_head_%d",universe_rank);
    //    printf("SEM_NAME: %s \n",sem_name);
    SEM_INIT(sem_head);
    sprintf(sem_name,"/Grid_mpi3_sem_tail_%d",universe_rank);
    //    printf("SEM_NAME: %s \n",sem_name);
    SEM_INIT(sem_tail);
  }  
  void SemInitExcl(void) {
    sprintf(sem_name,"/Grid_mpi3_sem_head_%d",universe_rank);
    //    printf("SEM_INIT_EXCL: %s \n",sem_name);
    SEM_INIT_EXCL(sem_head);
    sprintf(sem_name,"/Grid_mpi3_sem_tail_%d",universe_rank);
    //    printf("SEM_INIT_EXCL: %s \n",sem_name);
    SEM_INIT_EXCL(sem_tail);
  }  
  void WakeUpDMA(void) { 
    SEM_POST(sem_head);
  };
  void WakeUpCompute(void) { 
    SEM_POST(sem_tail);
  };
  void WaitForCommand(void) { 
    SEM_WAIT(sem_head);
  };
  void WaitForComplete(void) { 
    SEM_WAIT(sem_tail);
  };
  void EventLoop (void) {
    //    std::cout<< " Entering event loop "<<std::endl;
    while(1){
      WaitForCommand();
      //      std::cout << "Getting command "<<std::endl;
      Event();
    }
  }
  int Event (void) ;
  uint64_t QueueCommand(int command,void *buf, int bytes, int hashtag, MPI_Comm comm,int u_rank) ;
  void WaitAll() {
    //    std::cout << "Queueing WAIT command  "<<std::endl;
    QueueCommand(COMMAND_WAITALL,0,0,0,squadron,0);
    //    std::cout << "Waking up DMA "<<std::endl;
    WakeUpDMA();
    //    std::cout << "Waiting from semaphore "<<std::endl;
    WaitForComplete();
    //    std::cout << "Checking FIFO is empty "<<std::endl;
    assert ( state->tail == state->head );
  }
 };
 ////////////////////////////////////////////////////////////////////////
 // One instance of a data mover.
 // Master and Slave must agree on location in shared memory
 ////////////////////////////////////////////////////////////////////////
 class MPIoffloadEngine { 
 public:
  static std::vector<Slave> Slaves;
  static int ShmSetup;
  static int UniverseRank;
  static int UniverseSize;
  static MPI_Comm communicator_universe;
  static MPI_Comm communicator_cached;
  static MPI_Comm HorizontalComm;
  static int HorizontalRank;
  static int HorizontalSize;
  static MPI_Comm VerticalComm;
  static MPI_Win  VerticalWindow; 
  static int VerticalSize;
  static int VerticalRank;
  static std::vector<void *> VerticalShmBufs;
  static std::vector<std::vector<int> > UniverseRanks;
  static std::vector<int> UserCommunicatorToWorldRanks; 
  static MPI_Group WorldGroup, CachedGroup;
  static void CommunicatorInit (MPI_Comm &communicator_world,
 				MPI_Comm &ShmComm,
 				void * &ShmCommBuf);
  static void MapCommRankToWorldRank(int &hashtag, int & comm_world_peer,int tag, MPI_Comm comm,int commrank);
  /////////////////////////////////////////////////////////
  // routines for master proc must handle any communicator
  /////////////////////////////////////////////////////////
  static void QueueSend(int slave,void *buf, int bytes, int tag, MPI_Comm comm,int rank) {
     //    std::cout<< " Queueing send  "<< bytes<< " slave "<< slave << " to comm "<<rank  <<std::endl;
    Slaves[slave].QueueCommand(COMMAND_ISEND,buf,bytes,tag,comm,rank);
    //    std::cout << "Queued send command to rank "<< rank<< " via "<<slave <<std::endl;
    Slaves[slave].WakeUpDMA();
    //    std::cout << "Waking up DMA "<< slave<<std::endl;
  };
  static void QueueRecv(int slave, void *buf, int bytes, int tag, MPI_Comm comm,int rank) {
    //    std::cout<< " Queueing recv "<< bytes<< " slave "<< slave << " from comm "<<rank  <<std::endl;
    Slaves[slave].QueueCommand(COMMAND_IRECV,buf,bytes,tag,comm,rank);
    //    std::cout << "Queued recv command from rank "<< rank<< " via "<<slave <<std::endl;
    Slaves[slave].WakeUpDMA();
    //    std::cout << "Waking up DMA "<< slave<<std::endl;
  };
  static void WaitAll() {
    for(int s=1;s<VerticalSize;s++) {
      //      std::cout << "Waiting for slave "<< s<<std::endl;
      Slaves[s].WaitAll();
    }
    //    std::cout << " Wait all Complete "<<std::endl;
  };
  static void GetWork(int nwork, int me, int & mywork, int & myoff,int units){
    int basework = nwork/units;
    int backfill = units-(nwork%units);
    if ( me >= units ) { 
      mywork = myoff = 0;
    } else { 
      mywork = (nwork+me)/units;
      myoff  = basework * me;
      if ( me > backfill ) 
 	myoff+= (me-backfill);
    }
    return;
  };
  static void QueueMultiplexedSend(void *buf, int bytes, int tag, MPI_Comm comm,int rank) {
    uint8_t * cbuf = (uint8_t *) buf;
    int mywork, myoff, procs;
    procs = VerticalSize-1;
    for(int s=0;s<procs;s++) {
      GetWork(bytes,s,mywork,myoff,procs);
      QueueSend(s+1,&cbuf[myoff],mywork,tag,comm,rank);
    }
  };
  static void QueueMultiplexedRecv(void *buf, int bytes, int tag, MPI_Comm comm,int rank) {
    uint8_t * cbuf = (uint8_t *) buf;
    int mywork, myoff, procs;
    procs = VerticalSize-1;
    for(int s=0;s<procs;s++) {
      GetWork(bytes,s,mywork,myoff,procs);
      QueueRecv(s+1,&cbuf[myoff],mywork,tag,comm,rank);
    }
  };
 };
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 std::vector<Slave> MPIoffloadEngine::Slaves;
 int MPIoffloadEngine::UniverseRank;
 int MPIoffloadEngine::UniverseSize;
 MPI_Comm  MPIoffloadEngine::communicator_universe;
 MPI_Comm  MPIoffloadEngine::communicator_cached;
 MPI_Group MPIoffloadEngine::WorldGroup;
 MPI_Group MPIoffloadEngine::CachedGroup;
 MPI_Comm MPIoffloadEngine::HorizontalComm;
 int      MPIoffloadEngine::HorizontalRank;
 int      MPIoffloadEngine::HorizontalSize;
 MPI_Comm MPIoffloadEngine::VerticalComm;
 int      MPIoffloadEngine::VerticalSize;
 int      MPIoffloadEngine::VerticalRank;
 MPI_Win  MPIoffloadEngine::VerticalWindow; 
 std::vector<void *>            MPIoffloadEngine::VerticalShmBufs;
 std::vector<std::vector<int> > MPIoffloadEngine::UniverseRanks;
 std::vector<int>               MPIoffloadEngine::UserCommunicatorToWorldRanks; 
 int MPIoffloadEngine::ShmSetup = 0;
 void MPIoffloadEngine::CommunicatorInit (MPI_Comm &communicator_world,
 					 MPI_Comm &ShmComm,
 					 void * &ShmCommBuf)
 {      
  int flag;
  assert(ShmSetup==0);  
  //////////////////////////////////////////////////////////////////////
  // Universe is all nodes prior to squadron grouping
  //////////////////////////////////////////////////////////////////////
  MPI_Comm_dup (MPI_COMM_WORLD,&communicator_universe);
  MPI_Comm_rank(communicator_universe,&UniverseRank);
  MPI_Comm_size(communicator_universe,&UniverseSize);
  /////////////////////////////////////////////////////////////////////
  // Split into groups that can share memory (Verticals)
  /////////////////////////////////////////////////////////////////////
 #undef MPI_SHARED_MEM_DEBUG
 #ifdef  MPI_SHARED_MEM_DEBUG
  MPI_Comm_split(communicator_universe,(UniverseRank/4),UniverseRank,&VerticalComm);
 #else 
  MPI_Comm_split_type(communicator_universe, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL,&VerticalComm);
 #endif
  MPI_Comm_rank(VerticalComm     ,&VerticalRank);
  MPI_Comm_size(VerticalComm     ,&VerticalSize);
  //////////////////////////////////////////////////////////////////////
  // Split into horizontal groups by rank in squadron
  //////////////////////////////////////////////////////////////////////
  MPI_Comm_split(communicator_universe,VerticalRank,UniverseRank,&HorizontalComm);
  MPI_Comm_rank(HorizontalComm,&HorizontalRank);
  MPI_Comm_size(HorizontalComm,&HorizontalSize);
  assert(HorizontalSize*VerticalSize==UniverseSize);
  ////////////////////////////////////////////////////////////////////////////////
  // What is my place in the world
  ////////////////////////////////////////////////////////////////////////////////
  int WorldRank=0;
  if(VerticalRank==0) WorldRank = HorizontalRank;
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&WorldRank,1,MPI_INT,MPI_SUM,VerticalComm);
  assert(ierr==0);
  ////////////////////////////////////////////////////////////////////////////////
  // Where is the world in the universe?
  ////////////////////////////////////////////////////////////////////////////////
  UniverseRanks = std::vector<std::vector<int> >(HorizontalSize,std::vector<int>(VerticalSize,0));
  UniverseRanks[WorldRank][VerticalRank] = UniverseRank;
  for(int w=0;w<HorizontalSize;w++){
    ierr=MPI_Allreduce(MPI_IN_PLACE,&UniverseRanks[w][0],VerticalSize,MPI_INT,MPI_SUM,communicator_universe);
    assert(ierr==0);
  }
  //////////////////////////////////////////////////////////////////////////////////////////////////////////
  // allocate the shared window for our group, pass back Shm info to CartesianCommunicator
  //////////////////////////////////////////////////////////////////////////////////////////////////////////
  VerticalShmBufs.resize(VerticalSize);
 #undef MPI_SHARED_MEM
 #ifdef MPI_SHARED_MEM
  ierr = MPI_Win_allocate_shared(CartesianCommunicator::MAX_MPI_SHM_BYTES,1,MPI_INFO_NULL,VerticalComm,&ShmCommBuf,&VerticalWindow);
  ierr|= MPI_Win_lock_all (MPI_MODE_NOCHECK, VerticalWindow);
  assert(ierr==0);
  //  std::cout<<"SHM "<<ShmCommBuf<<std::endl;
  for(int r=0;r<VerticalSize;r++){
    MPI_Aint sz;
    int dsp_unit;
    MPI_Win_shared_query (VerticalWindow, r, &sz, &dsp_unit, &VerticalShmBufs[r]);
    //    std::cout<<"SHM "<<r<<" " <<VerticalShmBufs[r]<<std::endl;
  }
 #else 
  char shm_name [NAME_MAX];
  MPI_Barrier(VerticalComm);
  if ( VerticalRank == 0 ) {
    for(int r=0;r<VerticalSize;r++){
      size_t size = CartesianCommunicator::MAX_MPI_SHM_BYTES;
      if ( r>0 ) size = sizeof(SlaveState);
      sprintf(shm_name,"/Grid_mpi3_shm_%d_%d",WorldRank,r);
      shm_unlink(shm_name);
      int fd=shm_open(shm_name,O_RDWR|O_CREAT,0600);
      if ( fd < 0 ) {
 	perror("failed shm_open");
 	assert(0);
      }
      ftruncate(fd, size);
      VerticalShmBufs[r] = mmap(NULL,size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
      if ( VerticalShmBufs[r] == MAP_FAILED ) { 
 	perror("failed mmap");
 	assert(0);
      }
      uint64_t * check = (uint64_t *) VerticalShmBufs[r];
      check[0] = WorldRank;
      check[1] = r;
      //      std::cout<<"SHM "<<r<<" " <<VerticalShmBufs[r]<<std::endl;
    }
  }
  MPI_Barrier(VerticalComm);
  if ( VerticalRank != 0 ) { 
  for(int r=0;r<VerticalSize;r++){
    size_t size = CartesianCommunicator::MAX_MPI_SHM_BYTES ;
    if ( r>0 ) size = sizeof(SlaveState);
    sprintf(shm_name,"/Grid_mpi3_shm_%d_%d",WorldRank,r);
    int fd=shm_open(shm_name,O_RDWR|O_CREAT,0600);
    if ( fd<0 ) {
      perror("failed shm_open");
      assert(0);
    }
    VerticalShmBufs[r] = mmap(NULL,size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    uint64_t * check = (uint64_t *) VerticalShmBufs[r];
    assert(check[0]== WorldRank);
    assert(check[1]== r);
    std::cerr<<"SHM "<<r<<" " <<VerticalShmBufs[r]<<std::endl;
  }
  }
 #endif
  MPI_Barrier(VerticalComm);
  //////////////////////////////////////////////////////////////////////
  // Map rank of leader on node in their in new world, to the
  // rank in this vertical plane's horizontal communicator
  //////////////////////////////////////////////////////////////////////
  communicator_world = HorizontalComm;
  ShmComm            = VerticalComm;
  ShmCommBuf         = VerticalShmBufs[0];
  MPI_Comm_group (communicator_world, &WorldGroup); 
  ///////////////////////////////////////////////////////////
  // Start the slave data movers
  ///////////////////////////////////////////////////////////
  if ( VerticalRank != 0 ) {
    Slave indentured;
    indentured.Init( (SlaveState *) VerticalShmBufs[VerticalRank], VerticalComm, UniverseRank,VerticalRank);
    indentured.SemInitExcl();// init semaphore in shared memory
    MPI_Barrier(VerticalComm);
    MPI_Barrier(VerticalComm);
    indentured.EventLoop();
    assert(0);
  } else {
    Slaves.resize(VerticalSize);
    for(int i=1;i<VerticalSize;i++){
      Slaves[i].Init((SlaveState *)VerticalShmBufs[i],VerticalComm, UniverseRanks[HorizontalRank][i],i);
    }
    MPI_Barrier(VerticalComm);
    for(int i=1;i<VerticalSize;i++){
      Slaves[i].SemInit();// init semaphore in shared memory
    }
    MPI_Barrier(VerticalComm);
  }
  ///////////////////////////////////////////////////////////
  // Verbose for now
  ///////////////////////////////////////////////////////////
  ShmSetup=1;
  if (UniverseRank == 0){
    std::cout<<GridLogMessage << "Grid MPI-3 configuration: detected ";
    std::cout<<UniverseSize   << " Ranks " ;
    std::cout<<HorizontalSize << " Nodes " ;
    std::cout<<VerticalSize   << " with ranks-per-node "<<std::endl;
    std::cout<<GridLogMessage << "Grid MPI-3 configuration: using one lead process per node " << std::endl;
    std::cout<<GridLogMessage << "Grid MPI-3 configuration: reduced communicator has size " << HorizontalSize << std::endl;
    for(int g=0;g<HorizontalSize;g++){
      std::cout<<GridLogMessage<<" Node "<<g<<" led by MPI rank "<< UniverseRanks[g][0]<<std::endl;
    }
    for(int g=0;g<HorizontalSize;g++){
      std::cout<<GridLogMessage<<" { ";
      for(int s=0;s<VerticalSize;s++){
 	std::cout<< UniverseRanks[g][s];
 	if ( s<VerticalSize-1 ) {
 	  std::cout<<",";
 	}
      }
      std::cout<<" } "<<std::endl;
    }
  }
 };
  ///////////////////////////////////////////////////////////////////////////////////////////////
  // Map the communicator into communicator_world, and find the neighbour.
  // Cache the mappings; cache size is 1.
  ///////////////////////////////////////////////////////////////////////////////////////////////
 void MPIoffloadEngine::MapCommRankToWorldRank(int &hashtag, int & comm_world_peer,int tag, MPI_Comm comm,int rank) {
  if ( comm == HorizontalComm ) {
    comm_world_peer = rank;
    //    std::cout << " MapCommRankToWorldRank  horiz " <<rank<<"->"<<comm_world_peer<<std::endl;
  } else if ( comm == communicator_cached ) {
    comm_world_peer = UserCommunicatorToWorldRanks[rank];
    //    std::cout << " MapCommRankToWorldRank  cached " <<rank<<"->"<<comm_world_peer<<std::endl;
  } else { 
    int size;
    MPI_Comm_size(comm,&size);
    UserCommunicatorToWorldRanks.resize(size);
    std::vector<int> cached_ranks(size); 
    for(int r=0;r<size;r++) {
      cached_ranks[r]=r;
    }
    communicator_cached=comm;
    MPI_Comm_group(communicator_cached, &CachedGroup);
    MPI_Group_translate_ranks(CachedGroup,size,&cached_ranks[0],WorldGroup, &UserCommunicatorToWorldRanks[0]); 
    comm_world_peer = UserCommunicatorToWorldRanks[rank];
    //    std::cout << " MapCommRankToWorldRank  cache miss " <<rank<<"->"<<comm_world_peer<<std::endl;
    assert(comm_world_peer != MPI_UNDEFINED);
  }
  assert( (tag & (~0xFFFFL)) ==0); 
  uint64_t icomm = (uint64_t)comm;
  int comm_hash = ((icomm>>0 )&0xFFFF)^((icomm>>16)&0xFFFF)
                ^ ((icomm>>32)&0xFFFF)^((icomm>>48)&0xFFFF);
  //  hashtag = (comm_hash<<15) | tag;      
  hashtag = tag;      
 };
 void Slave::Init(SlaveState * _state,MPI_Comm _squadron,int _universe_rank,int _vertical_rank)
 {
  squadron=_squadron;
  universe_rank=_universe_rank;
  vertical_rank=_vertical_rank;
  state   =_state;
  //  std::cout << "state "<<_state<<" comm "<<_squadron<<" universe_rank"<<universe_rank <<std::endl;
  state->head = state->tail = state->start = 0;
  base = (uint64_t)MPIoffloadEngine::VerticalShmBufs[0];
  int rank; MPI_Comm_rank(_squadron,&rank);
 }
 #define PERI_PLUS(A) ( (A+1)%pool )
 int Slave::Event (void) {
  static int tail_last;
  static int head_last;
  static int start_last;
  int ierr;
  ////////////////////////////////////////////////////
  // Try to advance the start pointers
  ////////////////////////////////////////////////////
  int s=state->start;
  if ( s != state->head ) {
    switch ( state->Descrs[s].command ) {
    case COMMAND_ISEND:
      /*
            std::cout<< " Send "<<s << " ptr "<< state<<" "<< state->Descrs[s].buf<< "["<<state->Descrs[s].bytes<<"]"
      	       << " to " << state->Descrs[s].rank<< " tag" << state->Descrs[s].tag
       << " Comm " << MPIoffloadEngine::communicator_universe<< " me " <<universe_rank<< std::endl;
      */
      ierr = MPI_Isend((void *)(state->Descrs[s].buf+base), 
 		       state->Descrs[s].bytes, 
 		       MPI_CHAR,
 		       state->Descrs[s].rank,
 		       state->Descrs[s].tag,
 		       MPIoffloadEngine::communicator_universe,
 		       (MPI_Request *)&state->Descrs[s].request);
      assert(ierr==0);
      state->start = PERI_PLUS(s);
      return 1;
      break;
    case COMMAND_IRECV:
      /*
      std::cout<< " Recv "<<s << " ptr "<< state<<" "<< state->Descrs[s].buf<< "["<<state->Descrs[s].bytes<<"]"
 	       << " from " << state->Descrs[s].rank<< " tag" << state->Descrs[s].tag
 	       << " Comm " << MPIoffloadEngine::communicator_universe<< " me "<< universe_rank<< std::endl;
      */
      ierr=MPI_Irecv((void *)(state->Descrs[s].buf+base), 
 		     state->Descrs[s].bytes, 
 		     MPI_CHAR,
 		     state->Descrs[s].rank,
 		     state->Descrs[s].tag,
 		     MPIoffloadEngine::communicator_universe,
 		     (MPI_Request *)&state->Descrs[s].request);
      //      std::cout<< " Request is "<<state->Descrs[s].request<<std::endl;
      //      std::cout<< " Request0 is "<<state->Descrs[0].request<<std::endl;
      assert(ierr==0);
      state->start = PERI_PLUS(s);
      return 1;
      break;
    case COMMAND_WAITALL:
      for(int t=state->tail;t!=s; t=PERI_PLUS(t) ){
 	MPI_Wait((MPI_Request *)&state->Descrs[t].request,MPI_STATUS_IGNORE);
      };
      s=PERI_PLUS(s);
      state->start = s;
      state->tail  = s;
      WakeUpCompute();
      return 1;
      break;
    default:
      assert(0);
      break;
    }
  }
  return 0;
 }
  //////////////////////////////////////////////////////////////////////////////
  // External interaction with the queue
  //////////////////////////////////////////////////////////////////////////////
 uint64_t Slave::QueueCommand(int command,void *buf, int bytes, int tag, MPI_Comm comm,int commrank) 
 {
  /////////////////////////////////////////
  // Spin; if FIFO is full until not full
  /////////////////////////////////////////
  int head =state->head;
  int next = PERI_PLUS(head);
  // Set up descriptor
  int worldrank;
  int hashtag;
  MPI_Comm    communicator;
  MPI_Request request;
  MPIoffloadEngine::MapCommRankToWorldRank(hashtag,worldrank,tag,comm,commrank);
  uint64_t relative= (uint64_t)buf - base;
  state->Descrs[head].buf    = relative;
  state->Descrs[head].bytes  = bytes;
  state->Descrs[head].rank   = MPIoffloadEngine::UniverseRanks[worldrank][vertical_rank];
  state->Descrs[head].tag    = hashtag;
  state->Descrs[head].command= command;
  /*  
  if ( command == COMMAND_ISEND ) { 
  std::cout << "QueueSend from "<< universe_rank <<" to commrank " << commrank 
            << " to worldrank " << worldrank <<std::endl;
  std::cout << " via VerticalRank "<< vertical_rank <<" to universerank " << MPIoffloadEngine::UniverseRanks[worldrank][vertical_rank]<<std::endl;
  std::cout << " QueueCommand "<<buf<<"["<<bytes<<"]" << std::endl;
  } 
  if ( command == COMMAND_IRECV ) { 
  std::cout << "QueueRecv on "<< universe_rank <<" from commrank " << commrank 
            << " from worldrank " << worldrank <<std::endl;
  std::cout << " via VerticalRank "<< vertical_rank <<" from universerank " << MPIoffloadEngine::UniverseRanks[worldrank][vertical_rank]<<std::endl;
  std::cout << " QueueSend "<<buf<<"["<<bytes<<"]" << std::endl;
  } 
  */
  // Block until FIFO has space
  while( state->tail==next );
  // Msync on weak order architectures
  // Advance pointer
  state->head = next;
  return 0;
 }
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 MPI_Comm CartesianCommunicator::communicator_world;
 void CartesianCommunicator::Init(int *argc, char ***argv) 
 {
  int flag;
  MPI_Initialized(&flag); // needed to coexist with other libs apparently
  if ( !flag ) {
    MPI_Init(argc,argv);
  }
  communicator_world = MPI_COMM_WORLD;
  MPI_Comm ShmComm;
  MPIoffloadEngine::CommunicatorInit (communicator_world,ShmComm,ShmCommBuf);
 }
 void CartesianCommunicator::ShiftedRanks(int dim,int shift,int &source,int &dest)
 {
  int ierr=MPI_Cart_shift(communicator,dim,shift,&source,&dest);
  assert(ierr==0);
 }
 int CartesianCommunicator::RankFromProcessorCoor(std::vector<int> &coor)
 {
  int rank;
  int ierr=MPI_Cart_rank  (communicator, &coor[0], &rank);
  assert(ierr==0);
  return rank;
 }
 void  CartesianCommunicator::ProcessorCoorFromRank(int rank, std::vector<int> &coor)
 {
  coor.resize(_ndimension);
  int ierr=MPI_Cart_coords  (communicator, rank, _ndimension,&coor[0]);
  assert(ierr==0);
 }
 CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
 { 
  _ndimension = processors.size();
  std::vector<int> periodic(_ndimension,1);
  _Nprocessors=1;
  _processors = processors;
  for(int i=0;i<_ndimension;i++){
    _Nprocessors*=_processors[i];
  }
  int Size; 
  MPI_Comm_size(communicator_world,&Size);
  assert(Size==_Nprocessors);
  _processor_coor.resize(_ndimension);
  MPI_Cart_create(communicator_world, _ndimension,&_processors[0],&periodic[0],1,&communicator);
  MPI_Comm_rank  (communicator,&_processor);
  MPI_Cart_coords(communicator,_processor,_ndimension,&_processor_coor[0]);
 };
 void CartesianCommunicator::GlobalSum(uint32_t &u){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&u,1,MPI_UINT32_T,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSum(uint64_t &u){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&u,1,MPI_UINT64_T,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSum(float &f){
  int ierr=MPI_Allreduce(MPI_IN_PLACE,&f,1,MPI_FLOAT,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSumVector(float *f,int N)
 {
  int ierr=MPI_Allreduce(MPI_IN_PLACE,f,N,MPI_FLOAT,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSum(double &d)
 {
  int ierr = MPI_Allreduce(MPI_IN_PLACE,&d,1,MPI_DOUBLE,MPI_SUM,communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::GlobalSumVector(double *d,int N)
 {
  int ierr = MPI_Allreduce(MPI_IN_PLACE,d,N,MPI_DOUBLE,MPI_SUM,communicator);
  assert(ierr==0);
 }
 // Basic Halo comms primitive
 void CartesianCommunicator::SendToRecvFrom(void *xmit,
 					   int dest,
 					   void *recv,
 					   int from,
 					   int bytes)
 {
  std::vector<CommsRequest_t> reqs(0);
  SendToRecvFromBegin(reqs,xmit,dest,recv,from,bytes);
  SendToRecvFromComplete(reqs);
 }
 void CartesianCommunicator::SendRecvPacket(void *xmit,
 					   void *recv,
 					   int sender,
 					   int receiver,
 					   int bytes)
 {
  MPI_Status stat;
  assert(sender != receiver);
  int tag = sender;
  if ( _processor == sender ) {
    MPI_Send(xmit, bytes, MPI_CHAR,receiver,tag,communicator);
  }
  if ( _processor == receiver ) { 
    MPI_Recv(recv, bytes, MPI_CHAR,sender,tag,communicator,&stat);
  }
 }
 // Basic Halo comms primitive
 void CartesianCommunicator::SendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 						void *xmit,
 						int dest,
 						void *recv,
 						int from,
 						int bytes)
 {
  MPI_Request xrq;
  MPI_Request rrq;
  int rank = _processor;
  int ierr;
  ierr =MPI_Isend(xmit, bytes, MPI_CHAR,dest,_processor,communicator,&xrq);
  ierr|=MPI_Irecv(recv, bytes, MPI_CHAR,from,from,communicator,&rrq);
  assert(ierr==0);
  list.push_back(xrq);
  list.push_back(rrq);
 }
 void CartesianCommunicator::StencilSendToRecvFromBegin(std::vector<CommsRequest_t> &list,
 						       void *xmit,
 						       int dest,
 						       void *recv,
 						       int from,
 						       int bytes)
 {
  uint64_t xmit_i = (uint64_t) xmit;
  uint64_t recv_i = (uint64_t) recv;
  uint64_t shm    = (uint64_t) ShmCommBuf;
  // assert xmit and recv lie in shared memory region
  assert( (xmit_i >= shm) && (xmit_i+bytes <= shm+MAX_MPI_SHM_BYTES) );
  assert( (recv_i >= shm) && (recv_i+bytes <= shm+MAX_MPI_SHM_BYTES) );
  assert(from!=_processor);
  assert(dest!=_processor);
  MPIoffloadEngine::QueueMultiplexedSend(xmit,bytes,_processor,communicator,dest);
  MPIoffloadEngine::QueueMultiplexedRecv(recv,bytes,from,communicator,from);
 }
 void CartesianCommunicator::StencilSendToRecvFromComplete(std::vector<CommsRequest_t> &list)
 {
  MPIoffloadEngine::WaitAll();
 }
 void CartesianCommunicator::StencilBarrier(void)
 {
 }
 void CartesianCommunicator::SendToRecvFromComplete(std::vector<CommsRequest_t> &list)
 {
  int nreq=list.size();
  std::vector<MPI_Status> status(nreq);
  int ierr = MPI_Waitall(nreq,&list[0],&status[0]);
  assert(ierr==0);
 }
 void CartesianCommunicator::Barrier(void)
 {
  int ierr = MPI_Barrier(communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::Broadcast(int root,void* data, int bytes)
 {
  int ierr=MPI_Bcast(data,
 		     bytes,
 		     MPI_BYTE,
 		     root,
 		     communicator);
  assert(ierr==0);
 }
 void CartesianCommunicator::BroadcastWorld(int root,void* data, int bytes)
 {
  int ierr= MPI_Bcast(data,
 		      bytes,
 		      MPI_BYTE,
 		      root,
 		      communicator_world);
  assert(ierr==0);
 }
 void *CartesianCommunicator::ShmBufferSelf(void) { return ShmCommBuf; }
 void *CartesianCommunicator::ShmBuffer(int rank) {
  return NULL;
 }
 void *CartesianCommunicator::ShmBufferTranslate(int rank,void * local_p) { 
  return NULL;
 }
 };
--- a/lib/communicator/Communicator_none.cc
+++ b/lib/communicator/Communicator_none.cc
@@ -28,12 +28,15 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #include "Grid.h"
 namespace Grid {
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 void CartesianCommunicator::Init(int *argc, char *** arv)
 {
  ShmInitGeneric();
 }
 int Rank(void ){ return 0; };
 CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
 {
  _processors = processors;
@@ -89,30 +92,17 @@ void CartesianCommunicator::SendToRecvFromComplete(std::vector<CommsRequest_t> &
  assert(0);
 }
-void CartesianCommunicator::Barrier(void)
+int  CartesianCommunicator::RankWorld(void){return 0;}
-{
+void CartesianCommunicator::Barrier(void){}
-}
+void CartesianCommunicator::Broadcast(int root,void* data, int bytes) {}
-
+void CartesianCommunicator::BroadcastWorld(int root,void* data, int bytes) { }
-void CartesianCommunicator::Broadcast(int root,void* data, int bytes)
+int  CartesianCommunicator::RankFromProcessorCoor(std::vector<int> &coor) {  return 0;}
-{
+void CartesianCommunicator::ProcessorCoorFromRank(int rank, std::vector<int> &coor){ coor = _processor_coor ;}
 }
 void CartesianCommunicator::BroadcastWorld(int root,void* data, int bytes)
 {
 }
 void CartesianCommunicator::ShiftedRanks(int dim,int shift,int &source,int &dest)
 {
  source =0;
  dest=0;
 }
 int CartesianCommunicator::RankFromProcessorCoor(std::vector<int> &coor)
 {
  return 0;
 }
 void  CartesianCommunicator::ProcessorCoorFromRank(int rank, std::vector<int> &coor)
 {
 }
 }
--- a/lib/communicator/Communicator_shmem.cc
+++ b/lib/communicator/Communicator_shmem.cc
@@ -39,14 +39,24 @@ namespace Grid {
    BACKTRACEFILE();		   \
  }\
 }
-int Rank(void) {
+
-  return shmem_my_pe();
+
-}
+///////////////////////////////////////////////////////////////////////////////////////////////////
 // Info that is setup once and indept of cartesian layout
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 typedef struct HandShake_t { 
  uint64_t seq_local;
  uint64_t seq_remote;
 } HandShake;
 std::array<long,_SHMEM_REDUCE_SYNC_SIZE> make_psync_init(void) {
  array<long,_SHMEM_REDUCE_SYNC_SIZE> ret;
  ret.fill(SHMEM_SYNC_VALUE);
  return ret;
 }
 static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync_init = make_psync_init();
 static Vector< HandShake > XConnections;
 static Vector< HandShake > RConnections;
@@ -61,7 +71,9 @@ void CartesianCommunicator::Init(int *argc, char ***argv) {
    RConnections[pe].seq_remote= 0;
  }
  shmem_barrier_all();
  ShmInitGeneric();
 }
 CartesianCommunicator::CartesianCommunicator(const std::vector<int> &processors)
 {
  _ndimension = processors.size();
@@ -89,7 +101,7 @@ void CartesianCommunicator::GlobalSum(uint32_t &u){
  static long long source ;
  static long long dest   ;
  static long long llwrk[_SHMEM_REDUCE_MIN_WRKDATA_SIZE];
-  static long      psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  //  int nreduce=1;
  //  int pestart=0;
@@ -105,7 +117,7 @@ void CartesianCommunicator::GlobalSum(uint64_t &u){
  static long long source ;
  static long long dest   ;
  static long long llwrk[_SHMEM_REDUCE_MIN_WRKDATA_SIZE];
-  static long      psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  //  int nreduce=1;
  //  int pestart=0;
@@ -121,7 +133,7 @@ void CartesianCommunicator::GlobalSum(float &f){
  static float source ;
  static float dest   ;
  static float llwrk[_SHMEM_REDUCE_MIN_WRKDATA_SIZE];
-  static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  source = f;
  dest   =0.0;
@@ -133,7 +145,7 @@ void CartesianCommunicator::GlobalSumVector(float *f,int N)
  static float source ;
  static float dest   = 0 ;
  static float llwrk[_SHMEM_REDUCE_MIN_WRKDATA_SIZE];
-  static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  if ( shmem_addr_accessible(f,_processor)  ){
    shmem_float_sum_to_all(f,f,N,0,0,_Nprocessors,llwrk,psync);
@@ -152,7 +164,7 @@ void CartesianCommunicator::GlobalSum(double &d)
  static double source;
  static double dest  ;
  static double llwrk[_SHMEM_REDUCE_MIN_WRKDATA_SIZE];
-  static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  source = d;
  dest   = 0;
@@ -164,7 +176,8 @@ void CartesianCommunicator::GlobalSumVector(double *d,int N)
  static double source ;
  static double dest   ;
  static double llwrk[_SHMEM_REDUCE_MIN_WRKDATA_SIZE];
-  static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  if ( shmem_addr_accessible(d,_processor)  ){
    shmem_double_sum_to_all(d,d,N,0,0,_Nprocessors,llwrk,psync);
@@ -230,12 +243,9 @@ void CartesianCommunicator::SendRecvPacket(void *xmit,
  if ( _processor == sender ) {
    printf("Sender SHMEM pt2pt %d -> %d\n",sender,receiver);
    // Check he has posted a receive
    while(SendSeq->seq_remote == SendSeq->seq_local);
    printf("Sender receive %d posted\n",sender,receiver);
    // Advance our send count
    seq = ++(SendSeq->seq_local);
@@ -244,26 +254,19 @@ void CartesianCommunicator::SendRecvPacket(void *xmit,
    shmem_putmem(recv,xmit,bytes,receiver);
    shmem_fence();
    printf("Sender sent payload %d\n",seq);
    //Notify him we're done
    shmem_putmem((void *)&(RecvSeq->seq_remote),&seq,sizeof(seq),receiver);
    shmem_fence();
    printf("Sender ringing door bell  %d\n",seq);
  }
  if ( _processor == receiver ) {
    printf("Receiver SHMEM pt2pt %d->%d\n",sender,receiver);
    // Post a receive
    seq = ++(RecvSeq->seq_local);
    shmem_putmem((void *)&(SendSeq->seq_remote),&seq,sizeof(seq),sender);
    printf("Receiver Opening letter box %d\n",seq);
    // Now wait until he has advanced our reception counter
    while(RecvSeq->seq_remote != RecvSeq->seq_local);
    printf("Receiver Got the mail %d\n",seq);
  }
 }
@@ -291,7 +294,7 @@ void CartesianCommunicator::Barrier(void)
 }
 void CartesianCommunicator::Broadcast(int root,void* data, int bytes)
 {
-  static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  static uint32_t word;
  uint32_t *array = (uint32_t *) data;
  assert( (bytes % 4)==0);
@@ -314,7 +317,7 @@ void CartesianCommunicator::Broadcast(int root,void* data, int bytes)
 }
 void CartesianCommunicator::BroadcastWorld(int root,void* data, int bytes)
 {
-  static long  psync[_SHMEM_REDUCE_SYNC_SIZE];
+  static std::array<long,_SHMEM_REDUCE_SYNC_SIZE> psync =  psync_init;
  static uint32_t word;
  uint32_t *array = (uint32_t *) data;
  assert( (bytes % 4)==0);
--- a/lib/cshift/Cshift_common.h
+++ b/lib/cshift/Cshift_common.h
@@ -1,3 +1,4 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
@@ -44,7 +45,7 @@ public:
 // Gather for when there is no need to SIMD split with compression
 ///////////////////////////////////////////////////////////////////
 template<class vobj,class cobj,class compressor> void 
-Gather_plane_simple (const Lattice<vobj> &rhs,std::vector<cobj,alignedAllocator<cobj> > &buffer,int dimension,int plane,int cbmask,compressor &compress, int off=0)
+Gather_plane_simple (const Lattice<vobj> &rhs,commVector<cobj> &buffer,int dimension,int plane,int cbmask,compressor &compress, int off=0)
 {
  int rd = rhs._grid->_rdimensions[dimension];
@@ -56,6 +57,7 @@ Gather_plane_simple (const Lattice<vobj> &rhs,std::vector<cobj,alignedAllocator<
  int e1=rhs._grid->_slice_nblock[dimension];
  int e2=rhs._grid->_slice_block[dimension];
  int stride=rhs._grid->_slice_stride[dimension];
  if ( cbmask == 0x3 ) { 
 PARALLEL_NESTED_LOOP2
@@ -68,15 +70,20 @@ PARALLEL_NESTED_LOOP2
    }
  } else { 
     int bo=0;
     std::vector<std::pair<int,int> > table;
     for(int n=0;n<e1;n++){
       for(int b=0;b<e2;b++){
 	 int o  = n*stride;
-	 int ocb=1<<rhs._grid->CheckerBoardFromOindex(o+b);// Could easily be a table lookup
+	 int ocb=1<<rhs._grid->CheckerBoardFromOindexTable(o+b);
 	 if ( ocb &cbmask ) {
-	   buffer[off+bo++]=compress(rhs._odata[so+o+b]);
+	   table.push_back(std::pair<int,int> (bo++,o+b));
 	 }
       }
     }
 PARALLEL_FOR_LOOP     
     for(int i=0;i<table.size();i++){
       buffer[off+table[i].first]=compress(rhs._odata[so+table[i].second]);
     }
  }
 }
@@ -107,6 +114,7 @@ PARALLEL_NESTED_LOOP2
 	int o      =   n*n1;
 	int offset = b+n*n2;
 	cobj temp =compress(rhs._odata[so+o+b]);
 	extract<cobj>(temp,pointers,offset);
      }
@@ -114,6 +122,7 @@ PARALLEL_NESTED_LOOP2
  } else { 
    assert(0); //Fixme think this is buggy
    for(int n=0;n<e1;n++){
      for(int b=0;b<e2;b++){
 	int o=n*rhs._grid->_slice_stride[dimension];
@@ -132,7 +141,7 @@ PARALLEL_NESTED_LOOP2
 //////////////////////////////////////////////////////
 // Gather for when there is no need to SIMD split
 //////////////////////////////////////////////////////
-template<class vobj> void Gather_plane_simple (const Lattice<vobj> &rhs,std::vector<vobj,alignedAllocator<vobj> > &buffer,             int dimension,int plane,int cbmask)
+template<class vobj> void Gather_plane_simple (const Lattice<vobj> &rhs,commVector<vobj> &buffer, int dimension,int plane,int cbmask)
 {
  SimpleCompressor<vobj> dontcompress;
  Gather_plane_simple (rhs,buffer,dimension,plane,cbmask,dontcompress);
@@ -150,7 +159,7 @@ template<class vobj> void Gather_plane_extract(const Lattice<vobj> &rhs,std::vec
 //////////////////////////////////////////////////////
 // Scatter for when there is no need to SIMD split
 //////////////////////////////////////////////////////
-template<class vobj> void Scatter_plane_simple (Lattice<vobj> &rhs,std::vector<vobj,alignedAllocator<vobj> > &buffer, int dimension,int plane,int cbmask)
+template<class vobj> void Scatter_plane_simple (Lattice<vobj> &rhs,commVector<vobj> &buffer, int dimension,int plane,int cbmask)
 {
  int rd = rhs._grid->_rdimensions[dimension];
--- a/lib/cshift/Cshift_mpi.h
+++ b/lib/cshift/Cshift_mpi.h
@@ -119,8 +119,8 @@ template<class vobj> void Cshift_comms(Lattice<vobj> &ret,const Lattice<vobj> &r
  assert(shift<fd);
  int buffer_size = rhs._grid->_slice_nblock[dimension]*rhs._grid->_slice_block[dimension];
-  std::vector<vobj,alignedAllocator<vobj> > send_buf(buffer_size);
+  commVector<vobj> send_buf(buffer_size);
-  std::vector<vobj,alignedAllocator<vobj> > recv_buf(buffer_size);
+  commVector<vobj> recv_buf(buffer_size);
  int cb= (cbmask==0x2)? Odd : Even;
  int sshift= rhs._grid->CheckerBoardShiftForCB(rhs.checkerboard,dimension,shift,cb);
@@ -191,8 +191,8 @@ template<class vobj> void  Cshift_comms_simd(Lattice<vobj> &ret,const Lattice<vo
  int buffer_size = grid->_slice_nblock[dimension]*grid->_slice_block[dimension];
  int words = sizeof(vobj)/sizeof(vector_type);
-  std::vector<Vector<scalar_object> >   send_buf_extract(Nsimd,Vector<scalar_object>(buffer_size) );
+  std::vector<commVector<scalar_object> >   send_buf_extract(Nsimd,commVector<scalar_object>(buffer_size) );
-  std::vector<Vector<scalar_object> >   recv_buf_extract(Nsimd,Vector<scalar_object>(buffer_size) );
+  std::vector<commVector<scalar_object> >   recv_buf_extract(Nsimd,commVector<scalar_object>(buffer_size) );
  int bytes = buffer_size*sizeof(scalar_object);
--- a/lib/lattice/Lattice_ET.h
+++ b/lib/lattice/Lattice_ET.h
@@ -261,6 +261,7 @@ GridUnopClass(UnaryExp, exp(a));
 GridBinOpClass(BinaryAdd, lhs + rhs);
 GridBinOpClass(BinarySub, lhs - rhs);
 GridBinOpClass(BinaryMul, lhs *rhs);
 GridBinOpClass(BinaryDiv, lhs /rhs);
 GridBinOpClass(BinaryAnd, lhs &rhs);
 GridBinOpClass(BinaryOr, lhs | rhs);
@@ -385,6 +386,7 @@ GRID_DEF_UNOP(exp, UnaryExp);
 GRID_DEF_BINOP(operator+, BinaryAdd);
 GRID_DEF_BINOP(operator-, BinarySub);
 GRID_DEF_BINOP(operator*, BinaryMul);
 GRID_DEF_BINOP(operator/, BinaryDiv);
 GRID_DEF_BINOP(operator&, BinaryAnd);
 GRID_DEF_BINOP(operator|, BinaryOr);
--- a/lib/lattice/Lattice_base.h
+++ b/lib/lattice/Lattice_base.h
@@ -65,9 +65,6 @@ public:
 class LatticeExpressionBase {};
 template<class T> using Vector = std::vector<T,alignedAllocator<T> >;               // Aligned allocator??
 template<class T> using Matrix = std::vector<std::vector<T,alignedAllocator<T> > >; // Aligned allocator??
 template <typename Op, typename T1>                           
 class LatticeUnaryExpression  : public std::pair<Op,std::tuple<T1> > , public LatticeExpressionBase {
 public:
@@ -303,17 +300,6 @@ PARALLEL_FOR_LOOP
        *this = (*this)+r;
        return *this;
    }
    strong_inline friend Lattice<vobj> operator / (const Lattice<vobj> &lhs,const Lattice<vobj> &rhs){
        conformable(lhs,rhs);
        Lattice<vobj> ret(lhs._grid);
 PARALLEL_FOR_LOOP
        for(int ss=0;ss<lhs._grid->oSites();ss++){
 	  ret._odata[ss] = lhs._odata[ss]*pow(rhs._odata[ss],-1.0);
        }
        return ret;
    };
 }; // class Lattice
  template<class vobj> std::ostream& operator<< (std::ostream& stream, const Lattice<vobj> &o){
--- a/lib/lattice/Lattice_peekpoke.h
+++ b/lib/lattice/Lattice_peekpoke.h
@@ -154,7 +154,7 @@ PARALLEL_FOR_LOOP
    template<class vobj,class sobj>
    void peekLocalSite(sobj &s,const Lattice<vobj> &l,std::vector<int> &site){
-      GridBase *grid=l._grid;
+      GridBase *grid = l._grid;
      typedef typename vobj::scalar_type scalar_type;
      typedef typename vobj::vector_type vector_type;
@@ -164,15 +164,17 @@ PARALLEL_FOR_LOOP
      assert( l.checkerboard== l._grid->CheckerBoard(site));
      assert( sizeof(sobj)*Nsimd == sizeof(vobj));
      static const int words=sizeof(vobj)/sizeof(vector_type);
      int odx,idx;
      idx= grid->iIndex(site);
      odx= grid->oIndex(site);
-      std::vector<sobj> buf(Nsimd);
+      scalar_type * vp = (scalar_type *)&l._odata[odx];
      scalar_type * pt = (scalar_type *)&s;
-      extract(l._odata[odx],buf);
+      for(int w=0;w<words;w++){
-      
+        pt[w] = vp[idx+w*Nsimd];
-      s = buf[idx];
+      }
      return;
    };
@@ -190,18 +192,17 @@ PARALLEL_FOR_LOOP
      assert( l.checkerboard== l._grid->CheckerBoard(site));
      assert( sizeof(sobj)*Nsimd == sizeof(vobj));
      static const int words=sizeof(vobj)/sizeof(vector_type);
      int odx,idx;
      idx= grid->iIndex(site);
      odx= grid->oIndex(site);
-      std::vector<sobj> buf(Nsimd);
+      scalar_type * vp = (scalar_type *)&l._odata[odx];
      scalar_type * pt = (scalar_type *)&s;
-      // extract-modify-merge cycle is easiest way and this is not perf critical
+      for(int w=0;w<words;w++){
-      extract(l._odata[odx],buf);
+        vp[idx+w*Nsimd] = pt[w];
-      
+      }
      buf[idx] = s;
      merge(l._odata[odx],buf);
      return;
    };
--- a/lib/lattice/Lattice_rng.h
+++ b/lib/lattice/Lattice_rng.h
@@ -297,8 +297,9 @@ namespace Grid {
 	int l_idx=generator_idx(o_idx,i_idx);
-	std::vector<int> site_seeds(4);
+	const int num_rand_seed=16;
-	for(int i=0;i<4;i++){
+	std::vector<int> site_seeds(num_rand_seed);
 	for(int i=0;i<site_seeds.size();i++){
 	  site_seeds[i]= ui(pseeder);
 	}
--- a/lib/parallelIO/BinaryIO.h
+++ b/lib/parallelIO/BinaryIO.h
@@ -457,7 +457,7 @@ class BinaryIO {
    // available (how short sighted is that?)
    //////////////////////////////////////////////////////////
    Umu = zero;
-    static uint32_t csum=0;
+    static uint32_t csum; csum=0;
    fobj fileObj;
    static sobj siteObj; // Static to place in symmetric region for SHMEM
--- a/lib/qcd/QCD.h
+++ b/lib/qcd/QCD.h
@@ -493,16 +493,27 @@ namespace QCD {
 }   //namespace QCD
 } // Grid
 #include <Grid/qcd/utils/SpaceTimeGrid.h>
 #include <Grid/qcd/spin/Dirac.h>
 #include <Grid/qcd/spin/TwoSpinor.h>
 #include <Grid/qcd/utils/LinalgUtils.h>
 #include <Grid/qcd/utils/CovariantCshift.h>
 // Include representations 	
 #include <Grid/qcd/utils/SUn.h>
 #include <Grid/qcd/utils/SUnAdjoint.h>
 #include <Grid/qcd/utils/SUnTwoIndex.h>
 #include <Grid/qcd/representations/hmc_types.h>
 #include <Grid/qcd/action/Actions.h>
 #include <Grid/qcd/smearing/Smearing.h>
 #include <Grid/qcd/hmc/integrators/Integrator.h>
 #include <Grid/qcd/hmc/integrators/Integrator_algorithm.h>
 #include <Grid/qcd/hmc/HMC.h>
-#include <Grid/qcd/smearing/Smearing.h>
+
 #endif
--- a/lib/qcd/action/ActionBase.h
+++ b/lib/qcd/action/ActionBase.h
@@ -1,87 +1,153 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/ActionBase.h
+Source file: ./lib/qcd/action/ActionBase.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: neo <cossu@post.kek.jp>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #ifndef QCD_ACTION_BASE
 #define QCD_ACTION_BASE
 namespace Grid {
-namespace QCD{
+namespace QCD {
-template<class GaugeField>
+template <class GaugeField>
 class Action {
 public:
  bool is_smeared = false;
  // Boundary conditions? // Heatbath?
-  virtual void  refresh(const GaugeField &U, GridParallelRNG& pRNG) = 0;// refresh pseudofermions
+  virtual void refresh(const GaugeField& U,
-  virtual RealD S    (const GaugeField &U)                        = 0;  // evaluate the action
+                       GridParallelRNG& pRNG) = 0;  // refresh pseudofermions
-  virtual void  deriv(const GaugeField &U,GaugeField & dSdU )     = 0;  // evaluate the action derivative
+  virtual RealD S(const GaugeField& U) = 0;         // evaluate the action
-  virtual ~Action() {};
+  virtual void deriv(const GaugeField& U,
                     GaugeField& dSdU) = 0;  // evaluate the action derivative
  virtual ~Action(){};
 };
 // Indexing of tuple types
 template <class T, class Tuple>
 struct Index;
 template <class T, class... Types>
 struct Index<T, std::tuple<T, Types...>> {
  static const std::size_t value = 0;
 };
 template <class T, class U, class... Types>
 struct Index<T, std::tuple<U, Types...>> {
  static const std::size_t value = 1 + Index<T, std::tuple<Types...>>::value;
 };
 // Could derive PseudoFermion action with a PF field, FermionField, and a Grid; implement refresh
 /*
-template<class GaugeField, class FermionField>
+template <class GaugeField>
-class PseudoFermionAction : public Action<GaugeField> {
+struct ActionLevel {
 public:
-  FermionField Phi;
+  typedef Action<GaugeField>*
-  GridParallelRNG &pRNG;
+      ActPtr;  // now force the same colours as the rest of the code
  GridBase &Grid;
-  PseudoFermionAction(GridBase &_Grid,GridParallelRNG &_pRNG) : Grid(_Grid), Phi(&_Grid), pRNG(_pRNG) {
+  //Add supported representations here
  };
  virtual void refresh(const GaugeField &gauge) {
    gaussian(Phi,pRNG);
  };
-};
+  unsigned int multiplier;
 */
 template<class GaugeField> struct ActionLevel{
 public:
  typedef Action<GaugeField>*  ActPtr; // now force the same colours as the rest of the code
  int multiplier;
  std::vector<ActPtr> actions;
-  ActionLevel(int mul = 1) : multiplier(mul) {
+  ActionLevel(unsigned int mul = 1) : actions(0), multiplier(mul) {
-    assert (mul > 0);
+    assert(mul >= 1);
  };
-  void push_back(ActPtr ptr){
+  void push_back(ActPtr ptr) { actions.push_back(ptr); }
-    actions.push_back(ptr);
+};
 */
 template <class GaugeField, class Repr = NoHirep >
 struct ActionLevel {
 public:
  unsigned int multiplier; 
  // Fundamental repr actions separated because of the smearing
  typedef Action<GaugeField>* ActPtr;
  // construct a tuple of vectors of the actions for the corresponding higher
  // representation fields
  typedef typename AccessTypes<Action, Repr>::VectorCollection action_collection;
  action_collection actions_hirep;
  typedef typename  AccessTypes<Action, Repr>::FieldTypeCollection action_hirep_types;
  std::vector<ActPtr>& actions;
  // Temporary conversion between ActionLevel and ActionLevelHirep
  //ActionLevelHirep(ActionLevel<GaugeField>& AL ):actions(AL.actions), multiplier(AL.multiplier){}
  ActionLevel(unsigned int mul = 1) : actions(std::get<0>(actions_hirep)), multiplier(mul) {
    // initialize the hirep vectors to zero.
    //apply(this->resize, actions_hirep, 0); //need a working resize
    assert(mul >= 1);
  };
  //void push_back(ActPtr ptr) { actions.push_back(ptr); }
  template < class Field >
  void push_back(Action<Field>* ptr) {
    // insert only in the correct vector
    std::get< Index < Field, action_hirep_types>::value >(actions_hirep).push_back(ptr);
  };
  template < class ActPtr>
  static void resize(ActPtr ap, unsigned int n){
    ap->resize(n);
  }
  //template <std::size_t I>
  //auto getRepresentation(Repr& R)->decltype(std::get<I>(R).U)  {return std::get<I>(R).U;}
  // Loop on tuple for a callable function
  template <std::size_t I = 1, typename Callable, typename ...Args>
  inline typename std::enable_if<I == std::tuple_size<action_collection>::value, void>::type apply(
      Callable, Repr& R,Args&...) const {}
  template <std::size_t I = 1, typename Callable, typename ...Args>
  inline typename std::enable_if<I < std::tuple_size<action_collection>::value, void>::type apply(
      Callable fn, Repr& R, Args&... arguments) const {
    fn(std::get<I>(actions_hirep), std::get<I>(R.rep), arguments...);
    apply<I + 1>(fn, R, arguments...);
  }  
 };
 template<class GaugeField> using ActionSet = std::vector<ActionLevel< GaugeField > >;
 //template <class GaugeField>
 //using ActionSet = std::vector<ActionLevel<GaugeField> >;
-}}
+template <class GaugeField, class R>
 using ActionSet = std::vector<ActionLevel<GaugeField, R> >;
 }
 }
 #endif
--- a/lib/qcd/action/Actions.h
+++ b/lib/qcd/action/Actions.h
@@ -116,6 +116,14 @@ typedef SymanzikGaugeAction<ConjugateGimplD>        ConjugateSymanzikGaugeAction
  template class A<GparityWilsonImplF>;		\
  template class A<GparityWilsonImplD>;		
 #define AdjointFermOpTemplateInstantiate(A) \
  template class A<WilsonAdjImplF>; \
  template class A<WilsonAdjImplD>; 
 #define TwoIndexFermOpTemplateInstantiate(A) \
  template class A<WilsonTwoIndexSymmetricImplF>; \
  template class A<WilsonTwoIndexSymmetricImplD>; 
 #define FermOp5dVecTemplateInstantiate(A) \
  template class A<DomainWallVec5dImplF>;	\
  template class A<DomainWallVec5dImplD>;	\
@@ -126,6 +134,7 @@ typedef SymanzikGaugeAction<ConjugateGimplD>        ConjugateSymanzikGaugeAction
 FermOp4dVecTemplateInstantiate(A) \
 FermOp5dVecTemplateInstantiate(A) 
 #define GparityFermOpTemplateInstantiate(A) 
 ////////////////////////////////////////////
@@ -171,6 +180,14 @@ typedef WilsonFermion<WilsonImplR> WilsonFermionR;
 typedef WilsonFermion<WilsonImplF> WilsonFermionF;
 typedef WilsonFermion<WilsonImplD> WilsonFermionD;
 typedef WilsonFermion<WilsonAdjImplR> WilsonAdjFermionR;
 typedef WilsonFermion<WilsonAdjImplF> WilsonAdjFermionF;
 typedef WilsonFermion<WilsonAdjImplD> WilsonAdjFermionD;
 typedef WilsonFermion<WilsonTwoIndexSymmetricImplR> WilsonTwoIndexSymmetricFermionR;
 typedef WilsonFermion<WilsonTwoIndexSymmetricImplF> WilsonTwoIndexSymmetricFermionF;
 typedef WilsonFermion<WilsonTwoIndexSymmetricImplD> WilsonTwoIndexSymmetricFermionD;
 typedef WilsonTMFermion<WilsonImplR> WilsonTMFermionR;
 typedef WilsonTMFermion<WilsonImplF> WilsonTMFermionF;
 typedef WilsonTMFermion<WilsonImplD> WilsonTMFermionD;
--- a/lib/qcd/action/fermion/CayleyFermion5D.cc
+++ b/lib/qcd/action/fermion/CayleyFermion5D.cc
@@ -50,6 +50,30 @@ namespace QCD {
   mass(_mass)
 { }
 template<class Impl>  
 void CayleyFermion5D<Impl>::Dminus(const FermionField &psi, FermionField &chi)
 {
  int Ls=this->Ls;
  FermionField tmp(psi._grid);
  this->DW(psi,tmp,DaggerNo);
  for(int s=0;s<Ls;s++){
    axpby_ssp(chi,Coeff_t(1.0),psi,-cs[s],tmp,s,s);// chi = (1-c[s] D_W) psi
  }
 }
 template<class Impl>  
 void CayleyFermion5D<Impl>::DminusDag(const FermionField &psi, FermionField &chi)
 {
  int Ls=this->Ls;
  FermionField tmp(psi._grid);
  this->DW(psi,tmp,DaggerYes);
  for(int s=0;s<Ls;s++){
    axpby_ssp(chi,Coeff_t(1.0),psi,-cs[s],tmp,s,s);// chi = (1-c[s] D_W) psi
  }
 }
 template<class Impl>  
 void CayleyFermion5D<Impl>::M5D   (const FermionField &psi, FermionField &chi)
 {
--- a/lib/qcd/action/fermion/CayleyFermion5D.h
+++ b/lib/qcd/action/fermion/CayleyFermion5D.h
@@ -56,6 +56,9 @@ namespace Grid {
      virtual void   M5D   (const FermionField &psi, FermionField &chi);
      virtual void   M5Ddag(const FermionField &psi, FermionField &chi);
      virtual void   Dminus(const FermionField &psi, FermionField &chi);
      virtual void   DminusDag(const FermionField &psi, FermionField &chi);
      /////////////////////////////////////////////////////
      // Instantiate different versions depending on Impl
      /////////////////////////////////////////////////////
@@ -117,6 +120,7 @@ namespace Grid {
 		      GridRedBlackCartesian &FourDimRedBlackGrid,
 		      RealD _mass,RealD _M5,const ImplParams &p= ImplParams());
    protected:
      void SetCoefficientsZolotarev(RealD zolohi,Approx::zolotarev_data *zdata,RealD b,RealD c);
      void SetCoefficientsTanh(Approx::zolotarev_data *zdata,RealD b,RealD c);
--- a/lib/qcd/action/fermion/DomainWallFermion.h
+++ b/lib/qcd/action/fermion/DomainWallFermion.h
@@ -42,6 +42,10 @@ namespace Grid {
     INHERIT_IMPL_TYPES(Impl);
    public:
      void  MomentumSpacePropagator(FermionField &out,const FermionField &in,RealD _m) { 
 	this->MomentumSpacePropagatorHt(out,in,_m);
      };
      virtual void   Instantiatable(void) {};
      // Constructors
      DomainWallFermion(GaugeField &_Umu,
@@ -51,6 +55,7 @@ namespace Grid {
 			GridRedBlackCartesian &FourDimRedBlackGrid,
 			RealD _mass,RealD _M5,const ImplParams &p= ImplParams()) : 
      CayleyFermion5D<Impl>(_Umu,
 			    FiveDimGrid,
 			    FiveDimRedBlackGrid,
--- a/lib/qcd/action/fermion/FermionOperator.h
+++ b/lib/qcd/action/fermion/FermionOperator.h
@@ -91,6 +91,20 @@ namespace Grid {
      virtual void  Mdiag  (const FermionField &in, FermionField &out) { Mooee(in,out);};   // Same as Mooee applied to both CB's
      virtual void  Mdir   (const FermionField &in, FermionField &out,int dir,int disp)=0;   // case by case Wilson, Clover, Cayley, ContFrac, PartFrac
      virtual void  MomentumSpacePropagator(FermionField &out,const FermionField &in,RealD _m) { assert(0);};
      virtual void  FreePropagator(const FermionField &in,FermionField &out,RealD mass) { 
 	FFT theFFT((GridCartesian *) in._grid);
 	FermionField in_k(in._grid);
 	FermionField prop_k(in._grid);
 	theFFT.FFT_all_dim(in_k,in,FFT::forward);
        this->MomentumSpacePropagator(prop_k,in_k,mass);
 	theFFT.FFT_all_dim(out,prop_k,FFT::backward);
      };
      ///////////////////////////////////////////////
      // Updates gauge field during HMC
      ///////////////////////////////////////////////
--- a/lib/qcd/action/fermion/FermionOperatorImpl.h
+++ b/lib/qcd/action/fermion/FermionOperatorImpl.h
@@ -1,513 +1,532 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/fermion/FermionOperatorImpl.h
+Source file: ./lib/qcd/action/fermion/FermionOperatorImpl.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: Peter Boyle <peterboyle@Peters-MacBook-Pro-2.local>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
-#ifndef  GRID_QCD_FERMION_OPERATOR_IMPL_H
+/*  END LEGAL */
-#define  GRID_QCD_FERMION_OPERATOR_IMPL_H
+#ifndef GRID_QCD_FERMION_OPERATOR_IMPL_H
 #define GRID_QCD_FERMION_OPERATOR_IMPL_H
 namespace Grid {
-
+namespace QCD {
  namespace QCD {
-    //////////////////////////////////////////////
+  //////////////////////////////////////////////
-    // Template parameter class constructs to package
+  // Template parameter class constructs to package
-    // externally control Fermion implementations
+  // externally control Fermion implementations
-    // in orthogonal directions
+  // in orthogonal directions
-    //
+  //
-    // Ultimately need Impl to always define types where XXX is opaque
+  // Ultimately need Impl to always define types where XXX is opaque
-    //
+  //
-    //    typedef typename XXX               Simd;
+  //    typedef typename XXX               Simd;
-    //    typedef typename XXX     GaugeLinkField;	
+  //    typedef typename XXX     GaugeLinkField;	
-    //    typedef typename XXX         GaugeField;
+  //    typedef typename XXX         GaugeField;
-    //    typedef typename XXX      GaugeActField;
+  //    typedef typename XXX      GaugeActField;
-    //    typedef typename XXX       FermionField;
+  //    typedef typename XXX       FermionField;
-    //    typedef typename XXX  DoubledGaugeField;
+  //    typedef typename XXX  DoubledGaugeField;
-    //    typedef typename XXX         SiteSpinor;
+  //    typedef typename XXX         SiteSpinor;
-    //    typedef typename XXX     SiteHalfSpinor;	
+  //    typedef typename XXX     SiteHalfSpinor;	
-    //    typedef typename XXX         Compressor;	
+  //    typedef typename XXX         Compressor;	
-    //
+  //
-    // and Methods:
+  // and Methods:
-    //    void ImportGauge(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
+  //    void ImportGauge(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
-    //    void DoubleStore(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
+  //    void DoubleStore(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
-    //    void multLink(SiteHalfSpinor &phi,const SiteDoubledGaugeField &U,const SiteHalfSpinor &chi,int mu,StencilEntry *SE,StencilImpl &St)
+  //    void multLink(SiteHalfSpinor &phi,const SiteDoubledGaugeField &U,const SiteHalfSpinor &chi,int mu,StencilEntry *SE,StencilImpl &St)
-    //    void InsertForce4D(GaugeField &mat,const FermionField &Btilde,const FermionField &A,int mu)
+  //    void InsertForce4D(GaugeField &mat,const FermionField &Btilde,const FermionField &A,int mu)
-    //    void InsertForce5D(GaugeField &mat,const FermionField &Btilde,const FermionField &A,int mu)
+  //    void InsertForce5D(GaugeField &mat,const FermionField &Btilde,const FermionField &A,int mu)
-    //
+  //
-    //
+  //
-    // To acquire the typedefs from "Base" (either a base class or template param) use:
+  // To acquire the typedefs from "Base" (either a base class or template param) use:
-    //
+  //
-    // INHERIT_GIMPL_TYPES(Base)
+  // INHERIT_GIMPL_TYPES(Base)
-    // INHERIT_FIMPL_TYPES(Base)
+  // INHERIT_FIMPL_TYPES(Base)
-    // INHERIT_IMPL_TYPES(Base)
+  // INHERIT_IMPL_TYPES(Base)
-    //
+  //
-    // The Fermion operators will do the following:
+  // The Fermion operators will do the following:
-    //
+  //
-    // struct MyOpParams { 
+  // struct MyOpParams { 
-    //   RealD mass;
+  //   RealD mass;
-    // };
+  // };
-    //
+  //
-    //
+  //
-    // template<class Impl>
+  // template<class Impl>
-    // class MyOp : public<Impl> { 
+  // class MyOp : public<Impl> { 
-    // public:
+  // public:
-    //
+  //
-    //    INHERIT_ALL_IMPL_TYPES(Impl);
+  //    INHERIT_ALL_IMPL_TYPES(Impl);
-    //
+  //
-    //    MyOp(MyOpParams Myparm, ImplParams &ImplParam) :  Impl(ImplParam)
+  //    MyOp(MyOpParams Myparm, ImplParams &ImplParam) :  Impl(ImplParam)
-    //    {
+  //    {
-    //
+  //
-    //    };
+  //    };
-    //    
+  //    
-    //  }
+  //  }
-    //////////////////////////////////////////////
+  //////////////////////////////////////////////
-    ////////////////////////////////////////////////////////////////////////
+  ////////////////////////////////////////////////////////////////////////
-    // Implementation dependent fermion types
+  // Implementation dependent fermion types
-    ////////////////////////////////////////////////////////////////////////
+  ////////////////////////////////////////////////////////////////////////
 #define INHERIT_FIMPL_TYPES(Impl)\
-    typedef typename Impl::FermionField           FermionField;		\
+  typedef typename Impl::FermionField           FermionField;		\
-    typedef typename Impl::DoubledGaugeField DoubledGaugeField;		\
+  typedef typename Impl::DoubledGaugeField DoubledGaugeField;		\
-    typedef typename Impl::SiteSpinor               SiteSpinor;		\
+  typedef typename Impl::SiteSpinor               SiteSpinor;		\
-    typedef typename Impl::SiteHalfSpinor       SiteHalfSpinor;		\
+  typedef typename Impl::SiteHalfSpinor       SiteHalfSpinor;		\
-    typedef typename Impl::Compressor               Compressor;		\
+  typedef typename Impl::Compressor               Compressor;		\
-    typedef typename Impl::StencilImpl             StencilImpl;	  \
+  typedef typename Impl::StencilImpl             StencilImpl;		\
-    typedef typename Impl::ImplParams ImplParams; \
+  typedef typename Impl::ImplParams ImplParams;				\
-    typedef typename Impl::Coeff_t       Coeff_t;
+  typedef typename Impl::Coeff_t       Coeff_t;
 #define INHERIT_IMPL_TYPES(Base) \
-    INHERIT_GIMPL_TYPES(Base)\
+  INHERIT_GIMPL_TYPES(Base)	 \
-    INHERIT_FIMPL_TYPES(Base)
+  INHERIT_FIMPL_TYPES(Base)
  /////////////////////////////////////////////////////////////////////////////
  // Single flavour four spinors with colour index
  /////////////////////////////////////////////////////////////////////////////
  template <class S, class Representation = FundamentalRepresentation,class _Coeff_t = RealD >
  class WilsonImpl : public PeriodicGaugeImpl<GaugeImplTypes<S, Representation::Dimension > > {
    ///////
    // Single flavour four spinors with colour index
    ///////
    template<class S,int Nrepresentation=Nc,class _Coeff_t = RealD>
    class WilsonImpl :  public PeriodicGaugeImpl< GaugeImplTypes< S, Nrepresentation> > { 
    public:
    static const int Dimension = Representation::Dimension;
    typedef PeriodicGaugeImpl<GaugeImplTypes<S, Dimension > > Gimpl;
-      const bool LsVectorised=false;
+    //Necessary?
    constexpr bool is_fundamental() const{return Dimension == Nc ? 1 : 0;}
-      typedef _Coeff_t Coeff_t;
+    const bool LsVectorised=false;
-      typedef PeriodicGaugeImpl< GaugeImplTypes< S,Nrepresentation> > Gimpl;
+    typedef _Coeff_t Coeff_t;
-      INHERIT_GIMPL_TYPES(Gimpl);
+    INHERIT_GIMPL_TYPES(Gimpl);
-      template<typename vtype> using iImplSpinor             = iScalar<iVector<iVector<vtype, Nrepresentation>, Ns> >;
+    template <typename vtype> using iImplSpinor            = iScalar<iVector<iVector<vtype, Dimension>, Ns> >;
-      template<typename vtype> using iImplHalfSpinor         = iScalar<iVector<iVector<vtype, Nrepresentation>, Nhs> >;
+    template <typename vtype> using iImplHalfSpinor        = iScalar<iVector<iVector<vtype, Dimension>, Nhs> >;
-      template<typename vtype> using iImplDoubledGaugeField  = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds >;
+    template <typename vtype> using iImplDoubledGaugeField = iVector<iScalar<iMatrix<vtype, Dimension> >, Nds>;
-      typedef iImplSpinor    <Simd>           SiteSpinor;
+    typedef iImplSpinor<Simd>            SiteSpinor;
-      typedef iImplHalfSpinor<Simd>           SiteHalfSpinor;
+    typedef iImplHalfSpinor<Simd>        SiteHalfSpinor;
-      typedef iImplDoubledGaugeField<Simd>    SiteDoubledGaugeField;
+    typedef iImplDoubledGaugeField<Simd> SiteDoubledGaugeField;
-      typedef Lattice<SiteSpinor>                 FermionField;
+    typedef Lattice<SiteSpinor>            FermionField;
-      typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
+    typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
-      typedef WilsonCompressor<SiteHalfSpinor,SiteSpinor> Compressor;
+    typedef WilsonCompressor<SiteHalfSpinor, SiteSpinor> Compressor;
-      typedef WilsonImplParams ImplParams;
+    typedef WilsonImplParams ImplParams;
-      typedef WilsonStencil<SiteSpinor,SiteHalfSpinor> StencilImpl;
+    typedef WilsonStencil<SiteSpinor, SiteHalfSpinor> StencilImpl;
-      ImplParams Params;
+    ImplParams Params;
-      WilsonImpl(const ImplParams &p= ImplParams()) : Params(p) {}; 
+    WilsonImpl(const ImplParams &p = ImplParams()) : Params(p){};
-      bool overlapCommsCompute(void) { return Params.overlapCommsCompute; };
+    bool overlapCommsCompute(void) { return Params.overlapCommsCompute; };
-      inline void multLink(SiteHalfSpinor &phi,const SiteDoubledGaugeField &U,const SiteHalfSpinor &chi,int mu,StencilEntry *SE,StencilImpl &St){
+    inline void multLink(SiteHalfSpinor &phi,
-        mult(&phi(),&U(mu),&chi());
+			 const SiteDoubledGaugeField &U,
 			 const SiteHalfSpinor &chi,
 			 int mu,
 			 StencilEntry *SE,
 			 StencilImpl &St) {
      mult(&phi(), &U(mu), &chi());
    }
    template <class ref>
    inline void loadLinkElement(Simd &reg, ref &memory) {
      reg = memory;
    }
    inline void DoubleStore(GridBase *GaugeGrid,
 			    DoubledGaugeField &Uds,
 			    const GaugeField &Umu) {
      conformable(Uds._grid, GaugeGrid);
      conformable(Umu._grid, GaugeGrid);
      GaugeLinkField U(GaugeGrid);
      for (int mu = 0; mu < Nd; mu++) {
 	U = PeekIndex<LorentzIndex>(Umu, mu);
 	PokeIndex<LorentzIndex>(Uds, U, mu);
 	U = adj(Cshift(U, mu, -1));
 	PokeIndex<LorentzIndex>(Uds, U, mu + 4);
      }
    }
-      template<class ref>
+    inline void InsertForce4D(GaugeField &mat, FermionField &Btilde, FermionField &A,int mu){
-      inline void loadLinkElement(Simd & reg,ref &memory){
+      GaugeLinkField link(mat._grid);
-	reg = memory;
+      link = TraceIndex<SpinIndex>(outerProduct(Btilde,A)); 
-      }
+      PokeIndex<LorentzIndex>(mat,link,mu);
-      inline void DoubleStore(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
+    }   
-      {
+      
-        conformable(Uds._grid,GaugeGrid);
+    inline void InsertForce5D(GaugeField &mat, FermionField &Btilde, FermionField &Atilde,int mu){
-        conformable(Umu._grid,GaugeGrid);
+      
-        GaugeLinkField U(GaugeGrid);
+      int Ls=Btilde._grid->_fdimensions[0];
-        for(int mu=0;mu<Nd;mu++){
+      GaugeLinkField tmp(mat._grid);
-  	  U = PeekIndex<LorentzIndex>(Umu,mu);
+      tmp = zero;
-	  PokeIndex<LorentzIndex>(Uds,U,mu);
+      
-	  U = adj(Cshift(U,mu,-1));
+      PARALLEL_FOR_LOOP
-	  PokeIndex<LorentzIndex>(Uds,U,mu+4);
+      for(int sss=0;sss<tmp._grid->oSites();sss++){
 	int sU=sss;
 	for(int s=0;s<Ls;s++){
 	  int sF = s+Ls*sU;
 	  tmp[sU] = tmp[sU]+ traceIndex<SpinIndex>(outerProduct(Btilde[sF],Atilde[sF])); // ordering here
 	}
      }
      PokeIndex<LorentzIndex>(mat,tmp,mu);
-      inline void InsertForce4D(GaugeField &mat, FermionField &Btilde, FermionField &A,int mu){
+    }
-	GaugeLinkField link(mat._grid);
+  };
-	link = TraceIndex<SpinIndex>(outerProduct(Btilde,A)); 
+
-	PokeIndex<LorentzIndex>(mat,link,mu);
+  ////////////////////////////////////////////////////////////////////////////////////
  // Single flavour four spinors with colour index, 5d redblack
  ////////////////////////////////////////////////////////////////////////////////////
 template<class S,int Nrepresentation=Nc,class _Coeff_t = RealD>
 class DomainWallVec5dImpl :  public PeriodicGaugeImpl< GaugeImplTypes< S,Nrepresentation> > { 
  public:
  static const int Dimension = Nrepresentation;
  const bool LsVectorised=true;
  typedef _Coeff_t Coeff_t;      
  typedef PeriodicGaugeImpl<GaugeImplTypes<S, Nrepresentation> > Gimpl;
  INHERIT_GIMPL_TYPES(Gimpl);
  template <typename vtype> using iImplSpinor            = iScalar<iVector<iVector<vtype, Nrepresentation>, Ns> >;
  template <typename vtype> using iImplHalfSpinor        = iScalar<iVector<iVector<vtype, Nrepresentation>, Nhs> >;
  template <typename vtype> using iImplDoubledGaugeField = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds>;
  template <typename vtype> using iImplGaugeField        = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nd>;
  template <typename vtype> using iImplGaugeLink         = iScalar<iScalar<iMatrix<vtype, Nrepresentation> > >;
  typedef iImplSpinor<Simd> SiteSpinor;
  typedef iImplHalfSpinor<Simd> SiteHalfSpinor;
  typedef Lattice<SiteSpinor> FermionField;
  // Make the doubled gauge field a *scalar*
  typedef iImplDoubledGaugeField<typename Simd::scalar_type>  SiteDoubledGaugeField;  // This is a scalar
  typedef iImplGaugeField<typename Simd::scalar_type>         SiteScalarGaugeField;  // scalar
  typedef iImplGaugeLink<typename Simd::scalar_type>          SiteScalarGaugeLink;  // scalar
  typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
  typedef WilsonCompressor<SiteHalfSpinor, SiteSpinor> Compressor;
  typedef WilsonImplParams ImplParams;
  typedef WilsonStencil<SiteSpinor, SiteHalfSpinor> StencilImpl;
  ImplParams Params;
  DomainWallVec5dImpl(const ImplParams &p = ImplParams()) : Params(p){};
  bool overlapCommsCompute(void) { return false; };
  template <class ref>
  inline void loadLinkElement(Simd &reg, ref &memory) {
    vsplat(reg, memory);
  }
  inline void multLink(SiteHalfSpinor &phi, const SiteDoubledGaugeField &U,
 		       const SiteHalfSpinor &chi, int mu, StencilEntry *SE,
 		       StencilImpl &St) {
    SiteGaugeLink UU;
    for (int i = 0; i < Nrepresentation; i++) {
      for (int j = 0; j < Nrepresentation; j++) {
 	vsplat(UU()()(i, j), U(mu)()(i, j));
      }
    }
    mult(&phi(), &UU(), &chi());
  }
-      inline void InsertForce5D(GaugeField &mat, FermionField &Btilde, FermionField &Atilde,int mu){
+  inline void DoubleStore(GridBase *GaugeGrid, DoubledGaugeField &Uds,const GaugeField &Umu) 
  {
    SiteScalarGaugeField ScalarUmu;
    SiteDoubledGaugeField ScalarUds;
-	int Ls=Btilde._grid->_fdimensions[0];
+    GaugeLinkField U(Umu._grid);
    GaugeField Uadj(Umu._grid);
    for (int mu = 0; mu < Nd; mu++) {
      U = PeekIndex<LorentzIndex>(Umu, mu);
      U = adj(Cshift(U, mu, -1));
      PokeIndex<LorentzIndex>(Uadj, U, mu);
    }
-	GaugeLinkField tmp(mat._grid);
+    for (int lidx = 0; lidx < GaugeGrid->lSites(); lidx++) {
-	tmp = zero;
+      std::vector<int> lcoor;
-PARALLEL_FOR_LOOP
+      GaugeGrid->LocalIndexToLocalCoor(lidx, lcoor);
 	for(int sss=0;sss<tmp._grid->oSites();sss++){
 	  int sU=sss;
 	  for(int s=0;s<Ls;s++){
 	    int sF = s+Ls*sU;
 	    tmp[sU] = tmp[sU]+ traceIndex<SpinIndex>(outerProduct(Btilde[sF],Atilde[sF])); // ordering here
 	  }
 	}
 	PokeIndex<LorentzIndex>(mat,tmp,mu);
-      }
+      peekLocalSite(ScalarUmu, Umu, lcoor);
      for (int mu = 0; mu < 4; mu++) ScalarUds(mu) = ScalarUmu(mu);
-    };
+      peekLocalSite(ScalarUmu, Uadj, lcoor);
      for (int mu = 0; mu < 4; mu++) ScalarUds(mu + 4) = ScalarUmu(mu);
      pokeLocalSite(ScalarUds, Uds, lcoor);
    }
  }
  inline void InsertForce4D(GaugeField &mat, FermionField &Btilde,FermionField &A, int mu) 
  {
    assert(0);
  }
-    ///////
+  inline void InsertForce5D(GaugeField &mat, FermionField &Btilde,FermionField &Atilde, int mu) 
-    // Single flavour four spinors with colour index, 5d redblack
+  {
    ///////
    template<class S,int Nrepresentation=Nc,class _Coeff_t = RealD>
    class DomainWallVec5dImpl :  public PeriodicGaugeImpl< GaugeImplTypes< S,Nrepresentation> > { 
    public:
      const bool LsVectorised=true;
      typedef _Coeff_t Coeff_t;
      typedef PeriodicGaugeImpl< GaugeImplTypes< S,Nrepresentation> > Gimpl;
      INHERIT_GIMPL_TYPES(Gimpl);
      template<typename vtype> using iImplSpinor             = iScalar<iVector<iVector<vtype, Nrepresentation>, Ns> >;
      template<typename vtype> using iImplHalfSpinor         = iScalar<iVector<iVector<vtype, Nrepresentation>, Nhs> >;
      template<typename vtype> using iImplDoubledGaugeField  = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds >;
      template<typename vtype> using iImplGaugeField         = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nd >;
      template<typename vtype> using iImplGaugeLink          = iScalar<iScalar<iMatrix<vtype, Nrepresentation> > >;
      typedef iImplSpinor    <Simd>           SiteSpinor;
      typedef iImplHalfSpinor<Simd>           SiteHalfSpinor;
      typedef Lattice<SiteSpinor>             FermionField;
      // Make the doubled gauge field a *scalar*
      typedef iImplDoubledGaugeField<typename Simd::scalar_type>    SiteDoubledGaugeField; // This is a scalar
      typedef iImplGaugeField<typename Simd::scalar_type>           SiteScalarGaugeField;  // scalar
      typedef iImplGaugeLink <typename Simd::scalar_type>           SiteScalarGaugeLink;   // scalar
      typedef Lattice<SiteDoubledGaugeField>                  DoubledGaugeField;
      typedef WilsonCompressor<SiteHalfSpinor,SiteSpinor> Compressor;
      typedef WilsonImplParams ImplParams;
      typedef WilsonStencil<SiteSpinor,SiteHalfSpinor> StencilImpl;
      ImplParams Params;
      DomainWallVec5dImpl(const ImplParams &p= ImplParams()) : Params(p) {}; 
      bool overlapCommsCompute(void) { return false; };
      template<class ref>
      inline void loadLinkElement(Simd & reg,ref &memory){
 	vsplat(reg,memory);
      }
      inline void multLink(SiteHalfSpinor &phi,const SiteDoubledGaugeField &U,const SiteHalfSpinor &chi,int mu,StencilEntry *SE,StencilImpl &St)
      {
 	SiteGaugeLink UU;
 	for(int i=0;i<Nrepresentation;i++){
 	  for(int j=0;j<Nrepresentation;j++){
 	    vsplat(UU()()(i,j),U(mu)()(i,j));
 	  }
 	}
        mult(&phi(),&UU(),&chi());
      }
      inline void DoubleStore(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
      {
 	SiteScalarGaugeField  ScalarUmu;
 	SiteDoubledGaugeField ScalarUds;
        GaugeLinkField U   (Umu._grid);
 	GaugeField     Uadj(Umu._grid);
        for(int mu=0;mu<Nd;mu++){
  	  U = PeekIndex<LorentzIndex>(Umu,mu);
 	  U = adj(Cshift(U,mu,-1));
 	  PokeIndex<LorentzIndex>(Uadj,U,mu);
 	}
 	for(int lidx=0;lidx<GaugeGrid->lSites();lidx++){
 	  std::vector<int> lcoor;
 	  GaugeGrid->LocalIndexToLocalCoor(lidx,lcoor);
 	  peekLocalSite(ScalarUmu,Umu,lcoor);
 	  for(int mu=0;mu<4;mu++) ScalarUds(mu) = ScalarUmu(mu);
 	  peekLocalSite(ScalarUmu,Uadj,lcoor);
 	  for(int mu=0;mu<4;mu++) ScalarUds(mu+4) = ScalarUmu(mu);
 	  pokeLocalSite(ScalarUds,Uds,lcoor);
 	}
      }
      inline void InsertForce4D(GaugeField &mat, FermionField &Btilde, FermionField &A,int mu){
 	assert(0);
-      }   
+  }
-
+};
      inline void InsertForce5D(GaugeField &mat, FermionField &Btilde, FermionField &Atilde,int mu){
 	assert(0);
      }
    };
    ////////////////////////////////////////////////////////////////////////////////////////
    // Flavour doubled spinors; is Gparity the only? what about C*?
    ////////////////////////////////////////////////////////////////////////////////////////
-    template<class S,int Nrepresentation,class _Coeff_t = RealD>
+template <class S, int Nrepresentation,class _Coeff_t = RealD>
-    class GparityWilsonImpl : public ConjugateGaugeImpl< GaugeImplTypes<S,Nrepresentation> >{ 
+class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresentation> > {
-    public:
+ public:
-      const bool LsVectorised=false;
+ static const int Dimension = Nrepresentation;
-      typedef _Coeff_t Coeff_t;
+ const bool LsVectorised=false;
      typedef ConjugateGaugeImpl< GaugeImplTypes<S,Nrepresentation> > Gimpl;
-      INHERIT_GIMPL_TYPES(Gimpl);
+ typedef _Coeff_t Coeff_t;
 typedef ConjugateGaugeImpl< GaugeImplTypes<S,Nrepresentation> > Gimpl;
-      template<typename vtype> using iImplSpinor             = iVector<iVector<iVector<vtype, Nrepresentation>, Ns>, Ngp >;
+ INHERIT_GIMPL_TYPES(Gimpl);
      template<typename vtype> using iImplHalfSpinor         = iVector<iVector<iVector<vtype, Nrepresentation>, Nhs>, Ngp >;
      template<typename vtype> using iImplDoubledGaugeField  = iVector<iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds >, Ngp >;
-      typedef iImplSpinor    <Simd>           SiteSpinor;
+ template <typename vtype> using iImplSpinor            = iVector<iVector<iVector<vtype, Nrepresentation>, Ns>, Ngp>;
-      typedef iImplHalfSpinor<Simd>           SiteHalfSpinor;
+ template <typename vtype> using iImplHalfSpinor        = iVector<iVector<iVector<vtype, Nrepresentation>, Nhs>, Ngp>;
-      typedef iImplDoubledGaugeField<Simd>    SiteDoubledGaugeField;
+ template <typename vtype> using iImplDoubledGaugeField = iVector<iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds>, Ngp>;
-      typedef Lattice<SiteSpinor>                 FermionField;
+ typedef iImplSpinor<Simd> SiteSpinor;
-      typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
+ typedef iImplHalfSpinor<Simd> SiteHalfSpinor;
 typedef iImplDoubledGaugeField<Simd> SiteDoubledGaugeField;
-      typedef WilsonCompressor<SiteHalfSpinor,SiteSpinor> Compressor;
+ typedef Lattice<SiteSpinor> FermionField;
-      typedef WilsonStencil<SiteSpinor,SiteHalfSpinor> StencilImpl;
+ typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
-      typedef GparityWilsonImplParams ImplParams;
+ typedef WilsonCompressor<SiteHalfSpinor, SiteSpinor> Compressor;
 typedef WilsonStencil<SiteSpinor, SiteHalfSpinor> StencilImpl;
-      ImplParams Params;
+ typedef GparityWilsonImplParams ImplParams;
-      GparityWilsonImpl(const ImplParams &p= ImplParams()) : Params(p) {}; 
+ ImplParams Params;
-      bool overlapCommsCompute(void) { return Params.overlapCommsCompute; };
+ GparityWilsonImpl(const ImplParams &p = ImplParams()) : Params(p){};
-      // provide the multiply by link that is differentiated between Gparity (with flavour index) and non-Gparity
+ bool overlapCommsCompute(void) { return Params.overlapCommsCompute; };
      inline void multLink(SiteHalfSpinor &phi,const SiteDoubledGaugeField &U,const SiteHalfSpinor &chi,int mu,StencilEntry *SE,StencilImpl &St){
-	typedef SiteHalfSpinor vobj;
+ // provide the multiply by link that is differentiated between Gparity (with
-	typedef typename SiteHalfSpinor::scalar_object sobj;
+ // flavour index) and non-Gparity
 inline void multLink(SiteHalfSpinor &phi, const SiteDoubledGaugeField &U,
 		      const SiteHalfSpinor &chi, int mu, StencilEntry *SE,
 		      StencilImpl &St) {
-	vobj vtmp;
+  typedef SiteHalfSpinor vobj;
-	sobj stmp;
+   typedef typename SiteHalfSpinor::scalar_object sobj;
-	GridBase *grid = St._grid;
+   vobj vtmp;
   sobj stmp;
-	const int Nsimd = grid->Nsimd();
+   GridBase *grid = St._grid;
-	int direction    = St._directions[mu];
+   const int Nsimd = grid->Nsimd();
 	int distance     = St._distances[mu];
 	int ptype        = St._permute_type[mu]; 
 	int sl           = St._grid->_simd_layout[direction];
-	// Fixme X.Y.Z.T hardcode in stencil
+   int direction = St._directions[mu];
-	int mmu          = mu % Nd;
+   int distance = St._distances[mu];
   int ptype = St._permute_type[mu];
   int sl = St._grid->_simd_layout[direction];
-	// assert our assumptions
+   // Fixme X.Y.Z.T hardcode in stencil
-	assert((distance==1)||(distance==-1)); // nearest neighbour stencil hard code
+   int mmu = mu % Nd;
 	assert((sl==1)||(sl==2));
-	std::vector<int> icoor;
+   // assert our assumptions
   assert((distance == 1) || (distance == -1));  // nearest neighbour stencil hard code
   assert((sl == 1) || (sl == 2));
-	if ( SE->_around_the_world && Params.twists[mmu] ) {
+   std::vector<int> icoor;
-	  if ( sl == 2 ) {
+   if ( SE->_around_the_world && Params.twists[mmu] ) {
-	    std::vector<sobj> vals(Nsimd);
+     if ( sl == 2 ) {
-	    extract(chi,vals);
+       std::vector<sobj> vals(Nsimd);
 	    for(int s=0;s<Nsimd;s++){
-	      grid->iCoorFromIindex(icoor,s);
+       extract(chi,vals);
       for(int s=0;s<Nsimd;s++){
-	      assert((icoor[direction]==0)||(icoor[direction]==1));
+	 grid->iCoorFromIindex(icoor,s);
-	      int permute_lane;
+	 assert((icoor[direction]==0)||(icoor[direction]==1));
-	      if ( distance == 1) {
+	      
-		permute_lane = icoor[direction]?1:0;
+	 int permute_lane;
-	      } else {
+	 if ( distance == 1) {
-		permute_lane = icoor[direction]?0:1;
+	   permute_lane = icoor[direction]?1:0;
 	 } else {
 	   permute_lane = icoor[direction]?0:1;
 	 }
 	 if ( permute_lane ) { 
 	   stmp(0) = vals[s](1);
 	   stmp(1) = vals[s](0);
 	   vals[s] = stmp;
 	      }
       }
       merge(vtmp,vals);
-	      if ( permute_lane ) { 
+     } else { 
-		stmp(0) = vals[s](1);
+       vtmp(0) = chi(1);
-		stmp(1) = vals[s](0);
+       vtmp(1) = chi(0);
-		vals[s] = stmp;
+     }
-	      }
+     mult(&phi(0),&U(0)(mu),&vtmp(0));
-	    }
+     mult(&phi(1),&U(1)(mu),&vtmp(1));
 	    merge(vtmp,vals);
-	  } else { 
+   } else { 
-	    vtmp(0) = chi(1);
+     mult(&phi(0),&U(0)(mu),&chi(0));
-	    vtmp(1) = chi(0);
+     mult(&phi(1),&U(1)(mu),&chi(1));
-	  }
+   }
 	  mult(&phi(0),&U(0)(mu),&vtmp(0));
 	  mult(&phi(1),&U(1)(mu),&vtmp(1));
-	} else { 
+ }
 	  mult(&phi(0),&U(0)(mu),&chi(0));
 	  mult(&phi(1),&U(1)(mu),&chi(1));
 	}
-      }
+ inline void DoubleStore(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
 {
   conformable(Uds._grid,GaugeGrid);
   conformable(Umu._grid,GaugeGrid);
-      inline void DoubleStore(GridBase *GaugeGrid,DoubledGaugeField &Uds,const GaugeField &Umu)
+   GaugeLinkField Utmp (GaugeGrid);
-      {
+   GaugeLinkField U    (GaugeGrid);
   GaugeLinkField Uconj(GaugeGrid);
-	conformable(Uds._grid,GaugeGrid);
+   Lattice<iScalar<vInteger> > coor(GaugeGrid);
 	conformable(Umu._grid,GaugeGrid);
-	GaugeLinkField Utmp (GaugeGrid);
+   for(int mu=0;mu<Nd;mu++){
 	GaugeLinkField U    (GaugeGrid);
 	GaugeLinkField Uconj(GaugeGrid);
-	Lattice<iScalar<vInteger> > coor(GaugeGrid);
+     LatticeCoordinate(coor,mu);
     U     = PeekIndex<LorentzIndex>(Umu,mu);
     Uconj = conjugate(U);
-	for(int mu=0;mu<Nd;mu++){
+     // This phase could come from a simple bc 1,1,-1,1 ..
-	  
+     int neglink = GaugeGrid->GlobalDimensions()[mu]-1;
-	  LatticeCoordinate(coor,mu);
+     if ( Params.twists[mu] ) { 
-	  
+       Uconj = where(coor==neglink,-Uconj,Uconj);
-	  U     = PeekIndex<LorentzIndex>(Umu,mu);
+     }
 	  Uconj = conjugate(U);
 	  // This phase could come from a simple bc 1,1,-1,1 ..
 	  int neglink = GaugeGrid->GlobalDimensions()[mu]-1;
 	  if ( Params.twists[mu] ) { 
 	    Uconj = where(coor==neglink,-Uconj,Uconj);
 	  }
 PARALLEL_FOR_LOOP
-	  for(auto ss=U.begin();ss<U.end();ss++){
+     for(auto ss=U.begin();ss<U.end();ss++){
-	    Uds[ss](0)(mu) = U[ss]();
+       Uds[ss](0)(mu) = U[ss]();
-	    Uds[ss](1)(mu) = Uconj[ss]();
+       Uds[ss](1)(mu) = Uconj[ss]();
-	  }
+     }
-	  U     = adj(Cshift(U    ,mu,-1));      // correct except for spanning the boundary
+     U     = adj(Cshift(U    ,mu,-1));      // correct except for spanning the boundary
-	  Uconj = adj(Cshift(Uconj,mu,-1));
+     Uconj = adj(Cshift(Uconj,mu,-1));
-	  Utmp = U;
+     Utmp = U;
-	  if ( Params.twists[mu] ) { 
+     if ( Params.twists[mu] ) { 
-	    Utmp = where(coor==0,Uconj,Utmp);
+       Utmp = where(coor==0,Uconj,Utmp);
-	  }
+     }
 PARALLEL_FOR_LOOP
-	  for(auto ss=U.begin();ss<U.end();ss++){
+     for(auto ss=U.begin();ss<U.end();ss++){
-	    Uds[ss](0)(mu+4) = Utmp[ss]();
+       Uds[ss](0)(mu+4) = Utmp[ss]();
-	  }
+     }
-	  Utmp = Uconj;
+     Utmp = Uconj;
-	  if ( Params.twists[mu] ) { 
+     if ( Params.twists[mu] ) { 
-	    Utmp = where(coor==0,U,Utmp);
+       Utmp = where(coor==0,U,Utmp);
-	  }
+     }
 PARALLEL_FOR_LOOP
-	  for(auto ss=U.begin();ss<U.end();ss++){
+     for(auto ss=U.begin();ss<U.end();ss++){
-	    Uds[ss](1)(mu+4) = Utmp[ss]();
+       Uds[ss](1)(mu+4) = Utmp[ss]();
-	  }
+     }
-	}
+   }
-      }
+ }
      inline void InsertForce4D(GaugeField &mat, FermionField &Btilde, FermionField &A,int mu){
-	// DhopDir provides U or Uconj depending on coor/flavour.
+ inline void InsertForce4D(GaugeField &mat, FermionField &Btilde, FermionField &A, int mu) {
-	GaugeLinkField link(mat._grid);
+
-	// use lorentz for flavour as hack.
+   // DhopDir provides U or Uconj depending on coor/flavour.
   GaugeLinkField link(mat._grid);
   // use lorentz for flavour as hack.
   auto tmp = TraceIndex<SpinIndex>(outerProduct(Btilde, A));
 PARALLEL_FOR_LOOP
-        for(auto ss=link.begin();ss<link.end();ss++){
+   for (auto ss = tmp.begin(); ss < tmp.end(); ss++) {
-	  auto ttmp = traceIndex<SpinIndex>(outerProduct(Btilde[ss],A[ss]));  
+     link[ss]() = tmp[ss](0, 0) - conjugate(tmp[ss](1, 1));
-	  link[ss]() = ttmp(0,0) + conjugate(ttmp(1,1)) ; 
+   }
-	}
+   PokeIndex<LorentzIndex>(mat, link, mu);
-	PokeIndex<LorentzIndex>(mat,link,mu);
+   return;
-	return;
+ }
      }
      inline void InsertForce5D(GaugeField &mat, FermionField &Btilde, FermionField &Atilde,int mu){
-	int Ls=Btilde._grid->_fdimensions[0];
+ inline void InsertForce5D(GaugeField &mat, FermionField &Btilde, FermionField &Atilde, int mu) {
-	GaugeLinkField tmp(mat._grid);
+   int Ls = Btilde._grid->_fdimensions[0];
-	tmp = zero;
+	
   GaugeLinkField tmp(mat._grid);
   tmp = zero;
 PARALLEL_FOR_LOOP
-	for(int ss=0;ss<tmp._grid->oSites();ss++){
+   for (int ss = 0; ss < tmp._grid->oSites(); ss++) {
-	  for(int s=0;s<Ls;s++){
+     for (int s = 0; s < Ls; s++) {
-	    int sF = s+Ls*ss;
+       int sF = s + Ls * ss;
-	    auto ttmp = traceIndex<SpinIndex>(outerProduct(Btilde[sF],Atilde[sF]));
+       auto ttmp = traceIndex<SpinIndex>(outerProduct(Btilde[sF], Atilde[sF]));
-	    tmp[ss]() = tmp[ss]()+ ttmp(0,0) + conjugate(ttmp(1,1));
+       tmp[ss]() = tmp[ss]() + ttmp(0, 0) + conjugate(ttmp(1, 1));
-	  }
+     }
-	}
+   }
-	PokeIndex<LorentzIndex>(mat,tmp,mu);
+   PokeIndex<LorentzIndex>(mat, tmp, mu);
-	return;
+   return;
-      }
+ }
    };
-    typedef WilsonImpl<vComplex ,Nc> WilsonImplR; // Real.. whichever prec
+};
    typedef WilsonImpl<vComplexF,Nc> WilsonImplF; // Float
    typedef WilsonImpl<vComplexD,Nc> WilsonImplD; // Double
-    typedef WilsonImpl<vComplex ,Nc,ComplexD> ZWilsonImplR; // Real.. whichever prec
+ typedef WilsonImpl<vComplex,  FundamentalRepresentation > WilsonImplR;   // Real.. whichever prec
-    typedef WilsonImpl<vComplexF,Nc,ComplexD> ZWilsonImplF; // Float
+ typedef WilsonImpl<vComplexF, FundamentalRepresentation > WilsonImplF;  // Float
-    typedef WilsonImpl<vComplexD,Nc,ComplexD> ZWilsonImplD; // Double
+ typedef WilsonImpl<vComplexD, FundamentalRepresentation > WilsonImplD;  // Double
-    typedef DomainWallVec5dImpl<vComplex ,Nc> DomainWallVec5dImplR; // Real.. whichever prec
+ typedef WilsonImpl<vComplex,  FundamentalRepresentation, ComplexD > ZWilsonImplR; // Real.. whichever prec
-    typedef DomainWallVec5dImpl<vComplexF,Nc> DomainWallVec5dImplF; // Float
+ typedef WilsonImpl<vComplexF, FundamentalRepresentation, ComplexD > ZWilsonImplF; // Float
-    typedef DomainWallVec5dImpl<vComplexD,Nc> DomainWallVec5dImplD; // Double
+ typedef WilsonImpl<vComplexD, FundamentalRepresentation, ComplexD > ZWilsonImplD; // Double
-    typedef DomainWallVec5dImpl<vComplex ,Nc,ComplexD> ZDomainWallVec5dImplR; // Real.. whichever prec
+ typedef WilsonImpl<vComplex,  AdjointRepresentation > WilsonAdjImplR;   // Real.. whichever prec
-    typedef DomainWallVec5dImpl<vComplexF,Nc,ComplexD> ZDomainWallVec5dImplF; // Float
+ typedef WilsonImpl<vComplexF, AdjointRepresentation > WilsonAdjImplF;  // Float
-    typedef DomainWallVec5dImpl<vComplexD,Nc,ComplexD> ZDomainWallVec5dImplD; // Double
+ typedef WilsonImpl<vComplexD, AdjointRepresentation > WilsonAdjImplD;  // Double
-    typedef DomainWallVec5dImpl<vComplex ,Nc> DomainWallVec5dImplR; // Real.. whichever prec
+ typedef WilsonImpl<vComplex,  TwoIndexSymmetricRepresentation > WilsonTwoIndexSymmetricImplR;   // Real.. whichever prec
-    typedef DomainWallVec5dImpl<vComplexF,Nc> DomainWallVec5dImplF; // Float
+ typedef WilsonImpl<vComplexF, TwoIndexSymmetricRepresentation > WilsonTwoIndexSymmetricImplF;  // Float
-    typedef DomainWallVec5dImpl<vComplexD,Nc> DomainWallVec5dImplD; // Double
+ typedef WilsonImpl<vComplexD, TwoIndexSymmetricRepresentation > WilsonTwoIndexSymmetricImplD;  // Double
-    typedef GparityWilsonImpl<vComplex ,Nc> GparityWilsonImplR; // Real.. whichever prec
+ typedef DomainWallVec5dImpl<vComplex ,Nc> DomainWallVec5dImplR; // Real.. whichever prec
-    typedef GparityWilsonImpl<vComplexF,Nc> GparityWilsonImplF; // Float
+ typedef DomainWallVec5dImpl<vComplexF,Nc> DomainWallVec5dImplF; // Float
-    typedef GparityWilsonImpl<vComplexD,Nc> GparityWilsonImplD; // Double
+ typedef DomainWallVec5dImpl<vComplexD,Nc> DomainWallVec5dImplD; // Double
 typedef DomainWallVec5dImpl<vComplex ,Nc,ComplexD> ZDomainWallVec5dImplR; // Real.. whichever prec
 typedef DomainWallVec5dImpl<vComplexF,Nc,ComplexD> ZDomainWallVec5dImplF; // Float
 typedef DomainWallVec5dImpl<vComplexD,Nc,ComplexD> ZDomainWallVec5dImplD; // Double
 typedef GparityWilsonImpl<vComplex , Nc> GparityWilsonImplR;  // Real.. whichever prec
 typedef GparityWilsonImpl<vComplexF, Nc> GparityWilsonImplF;  // Float
 typedef GparityWilsonImpl<vComplexD, Nc> GparityWilsonImplD;  // Double
 }}
  }
 }
 #endif
--- a/lib/qcd/action/fermion/OverlapWilsonCayleyTanhFermion.h
+++ b/lib/qcd/action/fermion/OverlapWilsonCayleyTanhFermion.h
@@ -42,7 +42,11 @@ namespace Grid {
     INHERIT_IMPL_TYPES(Impl);
    public:
-      // Constructors
+     void  MomentumSpacePropagator(FermionField &out,const FermionField &in,RealD _m) { 
       this->MomentumSpacePropagatorHw(out,in,_m);
     };
     // Constructors
    OverlapWilsonCayleyTanhFermion(GaugeField &_Umu,
 				   GridCartesian         &FiveDimGrid,
 				   GridRedBlackCartesian &FiveDimRedBlackGrid,
--- a/lib/qcd/action/fermion/WilsonFermion.cc
+++ b/lib/qcd/action/fermion/WilsonFermion.cc
@@ -1,128 +1,127 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/fermion/WilsonFermion.cc
+Source file: ./lib/qcd/action/fermion/WilsonFermion.cc
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: Peter Boyle <peterboyle@Peters-MacBook-Pro-2.local>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #include <Grid.h>
 namespace Grid {
 namespace QCD {
-  const std::vector<int> WilsonFermionStatic::directions   ({0,1,2,3, 0, 1, 2, 3});
+const std::vector<int> WilsonFermionStatic::directions({0, 1, 2, 3, 0, 1, 2,
-  const std::vector<int> WilsonFermionStatic::displacements({1,1,1,1,-1,-1,-1,-1});
+                                                        3});
-  int WilsonFermionStatic::HandOptDslash;
+const std::vector<int> WilsonFermionStatic::displacements({1, 1, 1, 1, -1, -1,
                                                           -1, -1});
 int WilsonFermionStatic::HandOptDslash;
-  /////////////////////////////////
+/////////////////////////////////
-  // Constructor and gauge import
+// Constructor and gauge import
-  /////////////////////////////////
+/////////////////////////////////
-  template<class Impl>
+template <class Impl>
-  WilsonFermion<Impl>::WilsonFermion(GaugeField &_Umu,
+WilsonFermion<Impl>::WilsonFermion(GaugeField &_Umu, GridCartesian &Fgrid,
-				     GridCartesian         &Fgrid,
+                                   GridRedBlackCartesian &Hgrid, RealD _mass,
-				     GridRedBlackCartesian &Hgrid, 
+                                   const ImplParams &p)
-				     RealD _mass,const ImplParams &p) :
+    : Kernels(p),
-        Kernels(p),
+      _grid(&Fgrid),
-        _grid(&Fgrid),
+      _cbgrid(&Hgrid),
-	_cbgrid(&Hgrid),
+      Stencil(&Fgrid, npoint, Even, directions, displacements),
-	Stencil    (&Fgrid,npoint,Even,directions,displacements),
+      StencilEven(&Hgrid, npoint, Even, directions,
-	StencilEven(&Hgrid,npoint,Even,directions,displacements), // source is Even
+                  displacements),  // source is Even
-	StencilOdd (&Hgrid,npoint,Odd ,directions,displacements), // source is Odd
+      StencilOdd(&Hgrid, npoint, Odd, directions,
-	mass(_mass),
+                 displacements),  // source is Odd
-	Lebesgue(_grid),
+      mass(_mass),
-	LebesgueEvenOdd(_cbgrid),
+      Lebesgue(_grid),
-	Umu(&Fgrid),
+      LebesgueEvenOdd(_cbgrid),
-	UmuEven(&Hgrid),
+      Umu(&Fgrid),
-	UmuOdd (&Hgrid) 
+      UmuEven(&Hgrid),
-  {
+      UmuOdd(&Hgrid) {
-    // Allocate the required comms buffer
+  // Allocate the required comms buffer
-    ImportGauge(_Umu);
+  ImportGauge(_Umu);
 }
 template <class Impl>
 void WilsonFermion<Impl>::ImportGauge(const GaugeField &_Umu) {
  GaugeField HUmu(_Umu._grid);
  HUmu = _Umu * (-0.5);
  Impl::DoubleStore(GaugeGrid(), Umu, HUmu);
  pickCheckerboard(Even, UmuEven, Umu);
  pickCheckerboard(Odd, UmuOdd, Umu);
 }
 /////////////////////////////
 // Implement the interface
 /////////////////////////////
 template <class Impl>
 RealD WilsonFermion<Impl>::M(const FermionField &in, FermionField &out) {
  out.checkerboard = in.checkerboard;
  Dhop(in, out, DaggerNo);
  return axpy_norm(out, 4 + mass, in, out);
 }
 template <class Impl>
 RealD WilsonFermion<Impl>::Mdag(const FermionField &in, FermionField &out) {
  out.checkerboard = in.checkerboard;
  Dhop(in, out, DaggerYes);
  return axpy_norm(out, 4 + mass, in, out);
 }
 template <class Impl>
 void WilsonFermion<Impl>::Meooe(const FermionField &in, FermionField &out) {
  if (in.checkerboard == Odd) {
    DhopEO(in, out, DaggerNo);
  } else {
    DhopOE(in, out, DaggerNo);
  }
 }
-  template<class Impl>
+template <class Impl>
-  void WilsonFermion<Impl>::ImportGauge(const GaugeField &_Umu)
+void WilsonFermion<Impl>::MeooeDag(const FermionField &in, FermionField &out) {
-  {
+  if (in.checkerboard == Odd) {
-    GaugeField HUmu(_Umu._grid);
+    DhopEO(in, out, DaggerYes);
-    HUmu = _Umu*(-0.5);
+  } else {
-    Impl::DoubleStore(GaugeGrid(),Umu,HUmu);
+    DhopOE(in, out, DaggerYes);
    pickCheckerboard(Even,UmuEven,Umu);
    pickCheckerboard(Odd ,UmuOdd,Umu);
  }
 }
-  /////////////////////////////
+  template <class Impl>
  // Implement the interface
  /////////////////////////////
  template<class Impl>
  RealD WilsonFermion<Impl>::M(const FermionField &in, FermionField &out) 
  {
    out.checkerboard=in.checkerboard;
    Dhop(in,out,DaggerNo);
    return axpy_norm(out,4+mass,in,out);
  }
  template<class Impl>
  RealD WilsonFermion<Impl>::Mdag(const FermionField &in, FermionField &out) 
  {
    out.checkerboard=in.checkerboard;
    Dhop(in,out,DaggerYes);
    return axpy_norm(out,4+mass,in,out);
  }
  template<class Impl>
  void WilsonFermion<Impl>::Meooe(const FermionField &in, FermionField &out) 
  {
    if ( in.checkerboard == Odd ) {
      DhopEO(in,out,DaggerNo);
    } else {
      DhopOE(in,out,DaggerNo);
    }
  }
  template<class Impl>
  void WilsonFermion<Impl>::MeooeDag(const FermionField &in, FermionField &out) 
  {
    if ( in.checkerboard == Odd ) {
      DhopEO(in,out,DaggerYes);
    } else {
      DhopOE(in,out,DaggerYes);
    }
  }
  template<class Impl>
  void WilsonFermion<Impl>::Mooee(const FermionField &in, FermionField &out) {
    out.checkerboard = in.checkerboard;
-    typename FermionField::scalar_type scal(4.0+mass);
+    typename FermionField::scalar_type scal(4.0 + mass);
-    out = scal*in;
+    out = scal * in;
  }
-  template<class Impl>
+  template <class Impl>
  void WilsonFermion<Impl>::MooeeDag(const FermionField &in, FermionField &out) {
    out.checkerboard = in.checkerboard;
-    Mooee(in,out);
+    Mooee(in, out);
  }
  template<class Impl>
@@ -137,183 +136,236 @@ namespace QCD {
    MooeeInv(in,out);
  }
  ///////////////////////////////////
  // Internal
  ///////////////////////////////////
  template<class Impl>
-  void WilsonFermion<Impl>::DerivInternal(StencilImpl & st,
+  void WilsonFermion<Impl>::MomentumSpacePropagator(FermionField &out, const FermionField &in,RealD _m) {
 					  DoubledGaugeField & U,
 					  GaugeField &mat,
 					  const FermionField &A,
 					  const FermionField &B,int dag) {
-    assert((dag==DaggerNo) ||(dag==DaggerYes));
+    // what type LatticeComplex 
    conformable(_grid,out._grid);
-    Compressor compressor(dag);
+    typedef typename FermionField::vector_type vector_type;
    typedef typename FermionField::scalar_type ScalComplex;
-    FermionField Btilde(B._grid);
+    typedef Lattice<iSinglet<vector_type> > LatComplex;
    FermionField Atilde(B._grid);
    Atilde = A;
-    st.HaloExchange(B,compressor);
+    Gamma::GammaMatrix Gmu [] = {
      Gamma::GammaX,
      Gamma::GammaY,
      Gamma::GammaZ,
      Gamma::GammaT
    };
-    for(int mu=0;mu<Nd;mu++){
+    std::vector<int> latt_size   = _grid->_fdimensions;
-      ////////////////////////////////////////////////////////////////////////
+    FermionField   num  (_grid); num  = zero;
-      // Flip gamma (1+g)<->(1-g) if dag
+    LatComplex    wilson(_grid); wilson= zero;
-      ////////////////////////////////////////////////////////////////////////
+    LatComplex     one  (_grid); one = ScalComplex(1.0,0.0);
      int gamma = mu;
      if ( !dag ) gamma+= Nd;
-      ////////////////////////
+    LatComplex denom(_grid); denom= zero;
-      // Call the single hop
+    LatComplex kmu(_grid); 
-      ////////////////////////
+    ScalComplex ci(0.0,1.0);
-PARALLEL_FOR_LOOP
+    // momphase = n * 2pi / L
-	for(int sss=0;sss<B._grid->oSites();sss++){
+    for(int mu=0;mu<Nd;mu++) {
 	  Kernels::DiracOptDhopDir(st,U,st.comm_buf,sss,sss,B,Btilde,mu,gamma);
 	}
-      //////////////////////////////////////////////////
+      LatticeCoordinate(kmu,mu);
      // spin trace outer product
      //////////////////////////////////////////////////
      Impl::InsertForce4D(mat,Btilde,Atilde,mu);
      RealD TwoPiL =  M_PI * 2.0/ latt_size[mu];
      kmu = TwoPiL * kmu;
      wilson = wilson + 2.0*sin(kmu*0.5)*sin(kmu*0.5); // Wilson term
      num = num - sin(kmu)*ci*(Gamma(Gmu[mu])*in);    // derivative term
      denom=denom + sin(kmu)*sin(kmu);
    }
    wilson = wilson + _m;     // 2 sin^2 k/2 + m
    num   = num + wilson*in;     // -i gmu sin k + 2 sin^2 k/2 + m
    denom= denom+wilson*wilson; // sin^2 k + (2 sin^2 k/2 + m)^2
    denom= one/denom;
    out = num*denom; // [ -i gmu sin k + 2 sin^2 k/2 + m] / [ sin^2 k + (2 sin^2 k/2 + m)^2 ]
  }
 ///////////////////////////////////
 // Internal
 ///////////////////////////////////
 template <class Impl>
 void WilsonFermion<Impl>::DerivInternal(StencilImpl &st, DoubledGaugeField &U,
                                        GaugeField &mat, const FermionField &A,
                                        const FermionField &B, int dag) {
  assert((dag == DaggerNo) || (dag == DaggerYes));
  Compressor compressor(dag);
  FermionField Btilde(B._grid);
  FermionField Atilde(B._grid);
  Atilde = A;
  st.HaloExchange(B, compressor);
  for (int mu = 0; mu < Nd; mu++) {
    ////////////////////////////////////////////////////////////////////////
    // Flip gamma (1+g)<->(1-g) if dag
    ////////////////////////////////////////////////////////////////////////
    int gamma = mu;
    if (!dag) gamma += Nd;
    ////////////////////////
    // Call the single hop
    ////////////////////////
    PARALLEL_FOR_LOOP
    for (int sss = 0; sss < B._grid->oSites(); sss++) {
      Kernels::DiracOptDhopDir(st, U, st.CommBuf(), sss, sss, B, Btilde, mu,
                               gamma);
    }
    //////////////////////////////////////////////////
    // spin trace outer product
    //////////////////////////////////////////////////
    Impl::InsertForce4D(mat, Btilde, Atilde, mu);
  }
 }
 template <class Impl>
 void WilsonFermion<Impl>::DhopDeriv(GaugeField &mat, const FermionField &U,
                                    const FermionField &V, int dag) {
  conformable(U._grid, _grid);
  conformable(U._grid, V._grid);
  conformable(U._grid, mat._grid);
  mat.checkerboard = U.checkerboard;
  DerivInternal(Stencil, Umu, mat, U, V, dag);
 }
 template <class Impl>
 void WilsonFermion<Impl>::DhopDerivOE(GaugeField &mat, const FermionField &U,
                                      const FermionField &V, int dag) {
  conformable(U._grid, _cbgrid);
  conformable(U._grid, V._grid);
  conformable(U._grid, mat._grid);
  assert(V.checkerboard == Even);
  assert(U.checkerboard == Odd);
  mat.checkerboard = Odd;
  DerivInternal(StencilEven, UmuOdd, mat, U, V, dag);
 }
 template <class Impl>
 void WilsonFermion<Impl>::DhopDerivEO(GaugeField &mat, const FermionField &U,
                                      const FermionField &V, int dag) {
  conformable(U._grid, _cbgrid);
  conformable(U._grid, V._grid);
  conformable(U._grid, mat._grid);
  assert(V.checkerboard == Odd);
  assert(U.checkerboard == Even);
  mat.checkerboard = Even;
  DerivInternal(StencilOdd, UmuEven, mat, U, V, dag);
 }
 template <class Impl>
 void WilsonFermion<Impl>::Dhop(const FermionField &in, FermionField &out,
                               int dag) {
  conformable(in._grid, _grid);  // verifies full grid
  conformable(in._grid, out._grid);
  out.checkerboard = in.checkerboard;
  DhopInternal(Stencil, Lebesgue, Umu, in, out, dag);
 }
 template <class Impl>
 void WilsonFermion<Impl>::DhopOE(const FermionField &in, FermionField &out,
                                 int dag) {
  conformable(in._grid, _cbgrid);    // verifies half grid
  conformable(in._grid, out._grid);  // drops the cb check
  assert(in.checkerboard == Even);
  out.checkerboard = Odd;
  DhopInternal(StencilEven, LebesgueEvenOdd, UmuOdd, in, out, dag);
 }
 template <class Impl>
 void WilsonFermion<Impl>::DhopEO(const FermionField &in, FermionField &out,
                                 int dag) {
  conformable(in._grid, _cbgrid);    // verifies half grid
  conformable(in._grid, out._grid);  // drops the cb check
  assert(in.checkerboard == Odd);
  out.checkerboard = Even;
  DhopInternal(StencilOdd, LebesgueEvenOdd, UmuEven, in, out, dag);
 }
 template <class Impl>
 void WilsonFermion<Impl>::Mdir(const FermionField &in, FermionField &out,
                               int dir, int disp) {
  DhopDir(in, out, dir, disp);
 }
 template <class Impl>
 void WilsonFermion<Impl>::DhopDir(const FermionField &in, FermionField &out,
                                  int dir, int disp) {
  int skip = (disp == 1) ? 0 : 1;
  int dirdisp = dir + skip * 4;
  int gamma = dir + (1 - skip) * 4;
  DhopDirDisp(in, out, dirdisp, gamma, DaggerNo);
 };
 template <class Impl>
 void WilsonFermion<Impl>::DhopDirDisp(const FermionField &in, FermionField &out,
                                      int dirdisp, int gamma, int dag) {
  Compressor compressor(dag);
  Stencil.HaloExchange(in, compressor);
  PARALLEL_FOR_LOOP
  for (int sss = 0; sss < in._grid->oSites(); sss++) {
    Kernels::DiracOptDhopDir(Stencil, Umu, Stencil.CommBuf(), sss, sss, in, out,
                             dirdisp, gamma);
  }
 };
 template <class Impl>
 void WilsonFermion<Impl>::DhopInternal(StencilImpl &st, LebesgueOrder &lo,
                                       DoubledGaugeField &U,
                                       const FermionField &in,
                                       FermionField &out, int dag) {
  assert((dag == DaggerNo) || (dag == DaggerYes));
  Compressor compressor(dag);
  st.HaloExchange(in, compressor);
  if (dag == DaggerYes) {
    PARALLEL_FOR_LOOP
    for (int sss = 0; sss < in._grid->oSites(); sss++) {
      Kernels::DiracOptDhopSiteDag(st, lo, U, st.CommBuf(), sss, sss, 1, 1, in,
                                   out);
    }
  } else {
    PARALLEL_FOR_LOOP
    for (int sss = 0; sss < in._grid->oSites(); sss++) {
      Kernels::DiracOptDhopSite(st, lo, U, st.CommBuf(), sss, sss, 1, 1, in,
                                out);
    }
  }
 };
-  template<class Impl>
+FermOpTemplateInstantiate(WilsonFermion);
-  void WilsonFermion<Impl>::DhopDeriv(GaugeField &mat,const FermionField &U,const FermionField &V,int dag)
+AdjointFermOpTemplateInstantiate(WilsonFermion);
-  {
+TwoIndexFermOpTemplateInstantiate(WilsonFermion);
-    conformable(U._grid,_grid);  
+GparityFermOpTemplateInstantiate(WilsonFermion);
-    conformable(U._grid,V._grid);
+}
-    conformable(U._grid,mat._grid);
+}
    mat.checkerboard = U.checkerboard;
    DerivInternal(Stencil,Umu,mat,U,V,dag);
  }
  template<class Impl>
  void WilsonFermion<Impl>::DhopDerivOE(GaugeField &mat,const FermionField &U,const FermionField &V,int dag)
  {
    conformable(U._grid,_cbgrid);  
    conformable(U._grid,V._grid);
    conformable(U._grid,mat._grid);
    assert(V.checkerboard==Even);
    assert(U.checkerboard==Odd);
    mat.checkerboard = Odd;
    DerivInternal(StencilEven,UmuOdd,mat,U,V,dag);
  }
  template<class Impl>
  void WilsonFermion<Impl>::DhopDerivEO(GaugeField &mat,const FermionField &U,const FermionField &V,int dag)
  {
    conformable(U._grid,_cbgrid);  
    conformable(U._grid,V._grid);
    conformable(U._grid,mat._grid);
    assert(V.checkerboard==Odd);
    assert(U.checkerboard==Even);
    mat.checkerboard = Even;
    DerivInternal(StencilOdd,UmuEven,mat,U,V,dag);
  }
  template<class Impl>
  void WilsonFermion<Impl>::Dhop(const FermionField &in, FermionField &out,int dag) {
    conformable(in._grid,_grid); // verifies full grid
    conformable(in._grid,out._grid);
    out.checkerboard = in.checkerboard;
    DhopInternal(Stencil,Lebesgue,Umu,in,out,dag);
  }
  template<class Impl>
  void WilsonFermion<Impl>::DhopOE(const FermionField &in, FermionField &out,int dag) {
    conformable(in._grid,_cbgrid);    // verifies half grid
    conformable(in._grid,out._grid); // drops the cb check
    assert(in.checkerboard==Even);
    out.checkerboard = Odd;
    DhopInternal(StencilEven,LebesgueEvenOdd,UmuOdd,in,out,dag);
  }
  template<class Impl>
  void WilsonFermion<Impl>::DhopEO(const FermionField &in, FermionField &out,int dag) {
    conformable(in._grid,_cbgrid);    // verifies half grid
    conformable(in._grid,out._grid); // drops the cb check
    assert(in.checkerboard==Odd);
    out.checkerboard = Even;
    DhopInternal(StencilOdd,LebesgueEvenOdd,UmuEven,in,out,dag);
  }
  template<class Impl>
  void WilsonFermion<Impl>::Mdir (const FermionField &in, FermionField &out,int dir,int disp) {
    DhopDir(in,out,dir,disp);
  }
  template<class Impl>
  void WilsonFermion<Impl>::DhopDir(const FermionField &in, FermionField &out,int dir,int disp){
    int skip = (disp==1) ? 0 : 1;
    int dirdisp  = dir+skip*4;
    int gamma    = dir+(1-skip)*4;
    DhopDirDisp(in,out,dirdisp,gamma,DaggerNo);
  };
  template<class Impl>
  void WilsonFermion<Impl>::DhopDirDisp(const FermionField &in, FermionField &out,int dirdisp,int gamma,int dag) {
    Compressor compressor(dag);
    Stencil.HaloExchange(in,compressor);
 PARALLEL_FOR_LOOP
      for(int sss=0;sss<in._grid->oSites();sss++){
 	Kernels::DiracOptDhopDir(Stencil,Umu,Stencil.comm_buf,sss,sss,in,out,dirdisp,gamma);
      }
  };
  template<class Impl>
  void WilsonFermion<Impl>::DhopInternal(StencilImpl & st,LebesgueOrder& lo,DoubledGaugeField & U,
 					 const FermionField &in, FermionField &out,int dag) 
  {
    assert((dag==DaggerNo) ||(dag==DaggerYes));
    Compressor compressor(dag);
    st.HaloExchange(in,compressor);
    if ( dag == DaggerYes ) {
 PARALLEL_FOR_LOOP
      for(int sss=0;sss<in._grid->oSites();sss++){
 	Kernels::DiracOptDhopSiteDag(st,lo,U,st.comm_buf,sss,sss,1,1,in,out);
      }
    } else {
 PARALLEL_FOR_LOOP
      for(int sss=0;sss<in._grid->oSites();sss++){
 	Kernels::DiracOptDhopSite(st,lo,U,st.comm_buf,sss,sss,1,1,in,out);
      }
    }
  };
  FermOpTemplateInstantiate(WilsonFermion);
  GparityFermOpTemplateInstantiate(WilsonFermion);
 }}
--- a/lib/qcd/action/fermion/WilsonFermion.h
+++ b/lib/qcd/action/fermion/WilsonFermion.h
@@ -1,161 +1,154 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/fermion/WilsonFermion.h
+Source file: ./lib/qcd/action/fermion/WilsonFermion.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
-#ifndef  GRID_QCD_WILSON_FERMION_H
+/*  END LEGAL */
-#define  GRID_QCD_WILSON_FERMION_H
+#ifndef GRID_QCD_WILSON_FERMION_H
 #define GRID_QCD_WILSON_FERMION_H
 namespace Grid {
-  namespace QCD {
+namespace QCD {
-    class WilsonFermionStatic {
+class WilsonFermionStatic {
-    public:
+ public:
-      static int HandOptDslash; // these are a temporary hack
+  static int HandOptDslash;  // these are a temporary hack
-      static int MortonOrder;
+  static int MortonOrder;
-      static const std::vector<int> directions   ;
+  static const std::vector<int> directions;
-      static const std::vector<int> displacements;
+  static const std::vector<int> displacements;
-      static const int npoint=8;
+  static const int npoint = 8;
-    };
+};
-    template<class Impl>
+template <class Impl>
-    class WilsonFermion : public WilsonKernels<Impl>, public WilsonFermionStatic
+class WilsonFermion : public WilsonKernels<Impl>, public WilsonFermionStatic {
-    {
+ public:
-    public:
+  INHERIT_IMPL_TYPES(Impl);
-    INHERIT_IMPL_TYPES(Impl);
+  typedef WilsonKernels<Impl> Kernels;
    typedef WilsonKernels<Impl> Kernels;
-      ///////////////////////////////////////////////////////////////
+  ///////////////////////////////////////////////////////////////
-      // Implement the abstract base
+  // Implement the abstract base
-      ///////////////////////////////////////////////////////////////
+  ///////////////////////////////////////////////////////////////
-      GridBase *GaugeGrid(void)              { return _grid ;}
+  GridBase *GaugeGrid(void) { return _grid; }
-      GridBase *GaugeRedBlackGrid(void)      { return _cbgrid ;}
+  GridBase *GaugeRedBlackGrid(void) { return _cbgrid; }
-      GridBase *FermionGrid(void)            { return _grid;}
+  GridBase *FermionGrid(void) { return _grid; }
-      GridBase *FermionRedBlackGrid(void)    { return _cbgrid;}
+  GridBase *FermionRedBlackGrid(void) { return _cbgrid; }
-      //////////////////////////////////////////////////////////////////
+  //////////////////////////////////////////////////////////////////
-      // override multiply; cut number routines if pass dagger argument
+  // override multiply; cut number routines if pass dagger argument
-      // and also make interface more uniformly consistent
+  // and also make interface more uniformly consistent
-      //////////////////////////////////////////////////////////////////
+  //////////////////////////////////////////////////////////////////
-      RealD M(const FermionField &in, FermionField &out);
+  RealD M(const FermionField &in, FermionField &out);
-      RealD Mdag(const FermionField &in, FermionField &out);
+  RealD Mdag(const FermionField &in, FermionField &out);
-      /////////////////////////////////////////////////////////
+  /////////////////////////////////////////////////////////
-      // half checkerboard operations
+  // half checkerboard operations
-      // could remain virtual so we  can derive Clover from Wilson base
+  // could remain virtual so we  can derive Clover from Wilson base
-      /////////////////////////////////////////////////////////
+  /////////////////////////////////////////////////////////
-      void Meooe(const FermionField &in, FermionField &out) ;
+  void Meooe(const FermionField &in, FermionField &out);
-      void MeooeDag(const FermionField &in, FermionField &out) ;
+  void MeooeDag(const FermionField &in, FermionField &out);
-      // allow override for twisted mass and clover
+  // allow override for twisted mass and clover
-      virtual void Mooee(const FermionField &in, FermionField &out) ;
+  virtual void Mooee(const FermionField &in, FermionField &out);
-      virtual void MooeeDag(const FermionField &in, FermionField &out) ;
+  virtual void MooeeDag(const FermionField &in, FermionField &out);
-      virtual void MooeeInv(const FermionField &in, FermionField &out) ;
+  virtual void MooeeInv(const FermionField &in, FermionField &out);
-      virtual void MooeeInvDag(const FermionField &in, FermionField &out) ;
+  virtual void MooeeInvDag(const FermionField &in, FermionField &out);
-      ////////////////////////
+  virtual void  MomentumSpacePropagator(FermionField &out,const FermionField &in,RealD _mass) ;
-      // Derivative interface
+
-      ////////////////////////
+  ////////////////////////
-      // Interface calls an internal routine
+  // Derivative interface
-      void DhopDeriv(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
+  ////////////////////////
-      void DhopDerivOE(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
+  // Interface calls an internal routine
-      void DhopDerivEO(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
+  void DhopDeriv(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
  void DhopDerivOE(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
  void DhopDerivEO(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
  ///////////////////////////////////////////////////////////////
  // non-hermitian hopping term; half cb or both
  ///////////////////////////////////////////////////////////////
  void Dhop(const FermionField &in, FermionField &out, int dag);
  void DhopOE(const FermionField &in, FermionField &out, int dag);
  void DhopEO(const FermionField &in, FermionField &out, int dag);
  ///////////////////////////////////////////////////////////////
  // Multigrid assistance; force term uses too
  ///////////////////////////////////////////////////////////////
  void Mdir(const FermionField &in, FermionField &out, int dir, int disp);
  void DhopDir(const FermionField &in, FermionField &out, int dir, int disp);
  void DhopDirDisp(const FermionField &in, FermionField &out, int dirdisp,
                   int gamma, int dag);
  ///////////////////////////////////////////////////////////////
  // Extra methods added by derived
  ///////////////////////////////////////////////////////////////
  void DerivInternal(StencilImpl &st, DoubledGaugeField &U, GaugeField &mat,
                     const FermionField &A, const FermionField &B, int dag);
  void DhopInternal(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
                    const FermionField &in, FermionField &out, int dag);
  // Constructor
  WilsonFermion(GaugeField &_Umu, GridCartesian &Fgrid,
                GridRedBlackCartesian &Hgrid, RealD _mass,
                const ImplParams &p = ImplParams());
  // DoubleStore impl dependent
  void ImportGauge(const GaugeField &_Umu);
  ///////////////////////////////////////////////////////////////
  // Data members require to support the functionality
  ///////////////////////////////////////////////////////////////
  //    protected:
 public:
  RealD mass;
  GridBase *_grid;
  GridBase *_cbgrid;
  // Defines the stencils for even and odd
  StencilImpl Stencil;
  StencilImpl StencilEven;
  StencilImpl StencilOdd;
  // Copy of the gauge field , with even and odd subsets
  DoubledGaugeField Umu;
  DoubledGaugeField UmuEven;
  DoubledGaugeField UmuOdd;
  LebesgueOrder Lebesgue;
  LebesgueOrder LebesgueEvenOdd;
 };
 typedef WilsonFermion<WilsonImplF> WilsonFermionF;
 typedef WilsonFermion<WilsonImplD> WilsonFermionD;
-      ///////////////////////////////////////////////////////////////
+}
      // non-hermitian hopping term; half cb or both
      ///////////////////////////////////////////////////////////////
      void Dhop(const FermionField &in, FermionField &out,int dag) ;
      void DhopOE(const FermionField &in, FermionField &out,int dag) ;
      void DhopEO(const FermionField &in, FermionField &out,int dag) ;
      ///////////////////////////////////////////////////////////////
      // Multigrid assistance; force term uses too
      ///////////////////////////////////////////////////////////////
      void Mdir (const FermionField &in, FermionField &out,int dir,int disp) ;
      void DhopDir(const FermionField &in, FermionField &out,int dir,int disp);
      void DhopDirDisp(const FermionField &in, FermionField &out,int dirdisp,int gamma,int dag) ;
      ///////////////////////////////////////////////////////////////
      // Extra methods added by derived
      ///////////////////////////////////////////////////////////////
      void DerivInternal(StencilImpl & st,
 			 DoubledGaugeField & U,
 			 GaugeField &mat,
 			 const FermionField &A,
 			 const FermionField &B,
 			 int dag);
      void DhopInternal(StencilImpl & st,LebesgueOrder & lo,DoubledGaugeField & U,
 			const FermionField &in, FermionField &out,int dag) ;
      // Constructor
      WilsonFermion(GaugeField &_Umu,
 		    GridCartesian         &Fgrid,
 		    GridRedBlackCartesian &Hgrid, 
 		    RealD _mass,
 		    const ImplParams &p= ImplParams()
 		    ) ;
      // DoubleStore impl dependent
      void ImportGauge(const GaugeField &_Umu);
      ///////////////////////////////////////////////////////////////
      // Data members require to support the functionality
      ///////////////////////////////////////////////////////////////
      //    protected:
    public:
      RealD                        mass;
      GridBase                     *    _grid; 
      GridBase                     *  _cbgrid;
      //Defines the stencils for even and odd
      StencilImpl Stencil; 
      StencilImpl StencilEven; 
      StencilImpl StencilOdd; 
      // Copy of the gauge field , with even and odd subsets
      DoubledGaugeField Umu;
      DoubledGaugeField UmuEven;
      DoubledGaugeField UmuOdd;
      LebesgueOrder Lebesgue;
      LebesgueOrder LebesgueEvenOdd;
    };
    typedef WilsonFermion<WilsonImplF> WilsonFermionF;
    typedef WilsonFermion<WilsonImplD> WilsonFermionD;
  }
 }
 #endif
--- a/lib/qcd/action/fermion/WilsonFermion5D.cc
+++ b/lib/qcd/action/fermion/WilsonFermion5D.cc
@@ -42,11 +42,11 @@ const std::vector<int> WilsonFermion5DStatic::displacements({1,1,1,1,-1,-1,-1,-1
  // 5d lattice for DWF.
 template<class Impl>
 WilsonFermion5D<Impl>::WilsonFermion5D(GaugeField &_Umu,
-				       GridCartesian         &FiveDimGrid,
+               GridCartesian         &FiveDimGrid,
-				       GridRedBlackCartesian &FiveDimRedBlackGrid,
+               GridRedBlackCartesian &FiveDimRedBlackGrid,
-				       GridCartesian         &FourDimGrid,
+               GridCartesian         &FourDimGrid,
-				       GridRedBlackCartesian &FourDimRedBlackGrid,
+               GridRedBlackCartesian &FourDimRedBlackGrid,
-				       RealD _M5,const ImplParams &p) :
+               RealD _M5,const ImplParams &p) :
  Kernels(p),
  _FiveDimGrid        (&FiveDimGrid),
  _FiveDimRedBlackGrid(&FiveDimRedBlackGrid),
@@ -135,10 +135,10 @@ WilsonFermion5D<Impl>::WilsonFermion5D(GaugeField &_Umu,
  /*
 template<class Impl>
 WilsonFermion5D<Impl>::WilsonFermion5D(int simd,GaugeField &_Umu,
-				       GridCartesian         &FiveDimGrid,
+               GridCartesian         &FiveDimGrid,
-				       GridRedBlackCartesian &FiveDimRedBlackGrid,
+               GridRedBlackCartesian &FiveDimRedBlackGrid,
-				       GridCartesian         &FourDimGrid,
+               GridCartesian         &FourDimGrid,
-				       RealD _M5,const ImplParams &p) :
+               RealD _M5,const ImplParams &p) :
 {
  int nsimd = Simd::Nsimd();
@@ -175,6 +175,66 @@ WilsonFermion5D<Impl>::WilsonFermion5D(int simd,GaugeField &_Umu,
 }  
  */
 template<class Impl>
 void WilsonFermion5D<Impl>::Report(void)
 {
    std::vector<int> latt = GridDefaultLatt();          
    RealD volume = Ls;  for(int mu=0;mu<Nd;mu++) volume=volume*latt[mu];
    RealD NP = _FourDimGrid->_Nprocessors;
  if ( DhopCalls > 0 ) {
    std::cout << GridLogMessage << "#### Dhop calls report " << std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Number of Dhop Calls     : " << DhopCalls   << std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Total Communication time : " << DhopCommTime<< " us" << std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D CommTime/Calls           : " << DhopCommTime / DhopCalls << " us" << std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Total Compute time       : " << DhopComputeTime << " us" << std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D ComputeTime/Calls        : " << DhopComputeTime / DhopCalls << " us" << std::endl;
    RealD mflops = 1344*volume*DhopCalls/DhopComputeTime/2; // 2 for red black counting
    std::cout << GridLogMessage << "Average mflops/s per call                : " << mflops << std::endl;
    std::cout << GridLogMessage << "Average mflops/s per call per rank       : " << mflops/NP << std::endl;
   }
  if ( DerivCalls > 0 ) {
    std::cout << GridLogMessage << "#### Deriv calls report "<< std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Number of Deriv Calls    : " <<DerivCalls <<std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Total Communication time : " <<DerivCommTime <<" us"<<std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D CommTime/Calls           : " <<DerivCommTime/DerivCalls<<" us" <<std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Total Compute time       : " <<DerivComputeTime <<" us"<<std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D ComputeTime/Calls        : " <<DerivComputeTime/DerivCalls<<" us" <<std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Total Dhop Compute time  : " <<DerivDhopComputeTime <<" us"<<std::endl;
    std::cout << GridLogMessage << "WilsonFermion5D Dhop ComputeTime/Calls   : " <<DerivDhopComputeTime/DerivCalls<<" us" <<std::endl;
    RealD mflops = 144*volume*DerivCalls/DerivDhopComputeTime;
    std::cout << GridLogMessage << "Average mflops/s per call                : " << mflops << std::endl;
    std::cout << GridLogMessage << "Average mflops/s per call per node       : " << mflops/NP << std::endl;
  }
  if (DerivCalls > 0 || DhopCalls > 0){
    std::cout << GridLogMessage << "WilsonFermion5D Stencil"<<std::endl;  Stencil.Report();
    std::cout << GridLogMessage << "WilsonFermion5D StencilEven"<<std::endl;  StencilEven.Report();
    std::cout << GridLogMessage << "WilsonFermion5D StencilOdd"<<std::endl;  StencilOdd.Report();
  }
 }
 template<class Impl>
 void WilsonFermion5D<Impl>::ZeroCounters(void) {
  DhopCalls       = 0;
  DhopCommTime    = 0;
  DhopComputeTime = 0;
  DerivCalls       = 0;
  DerivCommTime    = 0;
  DerivComputeTime = 0;
  DerivDhopComputeTime = 0;
  Stencil.ZeroCounters();
  StencilEven.ZeroCounters();
  StencilOdd.ZeroCounters();
 }
 template<class Impl>
 void WilsonFermion5D<Impl>::ImportGauge(const GaugeField &_Umu)
 {
@@ -208,19 +268,20 @@ PARALLEL_FOR_LOOP
    for(int s=0;s<Ls;s++){
      int sU=ss;
      int sF = s+Ls*sU; 
-      Kernels::DiracOptDhopDir(Stencil,Umu,Stencil.comm_buf,sF,sU,in,out,dirdisp,gamma);
+      Kernels::DiracOptDhopDir(Stencil,Umu,Stencil.CommBuf(),sF,sU,in,out,dirdisp,gamma);
    }
  }
 };
 template<class Impl>
 void WilsonFermion5D<Impl>::DerivInternal(StencilImpl & st,
-					  DoubledGaugeField & U,
+            DoubledGaugeField & U,
-					  GaugeField &mat,
+            GaugeField &mat,
-					  const FermionField &A,
+            const FermionField &A,
-					  const FermionField &B,
+            const FermionField &B,
-					  int dag)
+            int dag)
 {
  DerivCalls++;
  assert((dag==DaggerNo) ||(dag==DaggerYes));
  conformable(st._grid,A._grid);
@@ -231,51 +292,52 @@ void WilsonFermion5D<Impl>::DerivInternal(StencilImpl & st,
  FermionField Btilde(B._grid);
  FermionField Atilde(B._grid);
  DerivCommTime-=usecond();
  st.HaloExchange(B,compressor);
  DerivCommTime+=usecond();
  Atilde=A;
-  for(int mu=0;mu<Nd;mu++){
+  DerivComputeTime-=usecond();
-      
+  for (int mu = 0; mu < Nd; mu++) {
    ////////////////////////////////////////////////////////////////////////
    // Flip gamma if dag
    ////////////////////////////////////////////////////////////////////////
    int gamma = mu;
-    if ( !dag ) gamma+= Nd;
+    if (!dag) gamma += Nd;
    ////////////////////////
    // Call the single hop
    ////////////////////////
-PARALLEL_FOR_LOOP
+    DerivDhopComputeTime -= usecond();
-    for(int sss=0;sss<U._grid->oSites();sss++){
+    PARALLEL_FOR_LOOP
-      for(int s=0;s<Ls;s++){
+    for (int sss = 0; sss < U._grid->oSites(); sss++) {
-	int sU=sss;
+      for (int s = 0; s < Ls; s++) {
-	int sF = s+Ls*sU;
+        int sU = sss;
        int sF = s + Ls * sU;
-	assert ( sF< B._grid->oSites());
+        assert(sF < B._grid->oSites());
-	assert ( sU< U._grid->oSites());
+        assert(sU < U._grid->oSites());
-	Kernels::DiracOptDhopDir(st,U,st.comm_buf,sF,sU,B,Btilde,mu,gamma);
+        Kernels::DiracOptDhopDir(st, U, st.CommBuf(), sF, sU, B, Btilde, mu, gamma);
    ////////////////////////////
    // spin trace outer product
    ////////////////////////////
        ////////////////////////////
        // spin trace outer product
        ////////////////////////////
      }
    }
-
+    DerivDhopComputeTime += usecond();
-    Impl::InsertForce5D(mat,Btilde,Atilde,mu);
+    Impl::InsertForce5D(mat, Btilde, Atilde, mu);
  }
  DerivComputeTime += usecond();
 }
 template<class Impl>
-void WilsonFermion5D<Impl>::DhopDeriv(      GaugeField &mat,
+void WilsonFermion5D<Impl>::DhopDeriv(GaugeField &mat,
-					    const FermionField &A,
+				      const FermionField &A,
-					    const FermionField &B,
+				      const FermionField &B,
-					    int dag)
+				      int dag)
 {
  conformable(A._grid,FermionGrid());  
  conformable(A._grid,B._grid);
@@ -306,9 +368,9 @@ void WilsonFermion5D<Impl>::DhopDerivEO(GaugeField &mat,
 template<class Impl>
 void WilsonFermion5D<Impl>::DhopDerivOE(GaugeField &mat,
-				  const FermionField &A,
+					const FermionField &A,
-				  const FermionField &B,
+					const FermionField &B,
-				  int dag)
+					int dag)
 {
  conformable(A._grid,FermionRedBlackGrid());
  conformable(GaugeRedBlackGrid(),mat._grid);
@@ -331,30 +393,56 @@ void WilsonFermion5D<Impl>::DhopInternal(StencilImpl & st, LebesgueOrder &lo,
  int LLs = in._grid->_rdimensions[0];
  DhopCommTime-=usecond();
  st.HaloExchange(in,compressor);
  DhopCommTime+=usecond();
  DhopComputeTime-=usecond();
  // Dhop takes the 4d grid from U, and makes a 5d index for fermion
-  if ( dag == DaggerYes ) {
+  if (dag == DaggerYes) {
-PARALLEL_FOR_LOOP
+    PARALLEL_FOR_LOOP
-    for(int ss=0;ss<U._grid->oSites();ss++){
+    for (int ss = 0; ss < U._grid->oSites(); ss++) {
-	int sU=ss;
+      int sU = ss;
-	int sF=LLs*sU;
+      int sF = LLs * sU;
-	Kernels::DiracOptDhopSiteDag(st,lo,U,st.comm_buf,sF,sU,LLs,1,in,out);
+      Kernels::DiracOptDhopSiteDag(st, lo, U, st.CommBuf(), sF, sU, LLs, 1, in, out);
    }
-  } else {
+#ifdef AVX512
-PARALLEL_FOR_LOOP
+  } else if (stat.is_init() ) {
-    for(int ss=0;ss<U._grid->oSites();ss++){
+
    int nthreads;
    stat.start();
 #pragma omp parallel
    {
 #pragma omp master
    nthreads = omp_get_num_threads();
    int mythread = omp_get_thread_num();
    stat.enter(mythread);
 #pragma omp for nowait
    for(int ss=0;ss<U._grid->oSites();ss++) {
      int sU=ss;
      int sF=LLs*sU;
-      Kernels::DiracOptDhopSite(st,lo,U,st.comm_buf,sF,sU,LLs,1,in,out);
+      Kernels::DiracOptDhopSite(st,lo,U,st.CommBuf(),sF,sU,LLs,1,in,out);
    }
    stat.exit(mythread);
    }
    stat.accum(nthreads);
 #endif
  } else {
    PARALLEL_FOR_LOOP
    for (int ss = 0; ss < U._grid->oSites(); ss++) {
      int sU = ss;
      int sF = LLs * sU;
      Kernels::DiracOptDhopSite(st,lo,U,st.CommBuf(),sF,sU,LLs,1,in,out);
    }
  }
  DhopComputeTime+=usecond();
 }
 template<class Impl>
 void WilsonFermion5D<Impl>::DhopOE(const FermionField &in, FermionField &out,int dag)
 {
  DhopCalls++;
  conformable(in._grid,FermionRedBlackGrid());    // verifies half grid
  conformable(in._grid,out._grid); // drops the cb check
@@ -366,6 +454,7 @@ void WilsonFermion5D<Impl>::DhopOE(const FermionField &in, FermionField &out,int
 template<class Impl>
 void WilsonFermion5D<Impl>::DhopEO(const FermionField &in, FermionField &out,int dag)
 {
  DhopCalls++;
  conformable(in._grid,FermionRedBlackGrid());    // verifies half grid
  conformable(in._grid,out._grid); // drops the cb check
@@ -377,6 +466,7 @@ void WilsonFermion5D<Impl>::DhopEO(const FermionField &in, FermionField &out,int
 template<class Impl>
 void WilsonFermion5D<Impl>::Dhop(const FermionField &in, FermionField &out,int dag)
 {
  DhopCalls+=2;
  conformable(in._grid,FermionGrid()); // verifies full grid
  conformable(in._grid,out._grid);
@@ -392,6 +482,148 @@ void WilsonFermion5D<Impl>::DW(const FermionField &in, FermionField &out,int dag
  axpy(out,4.0-M5,in,out);
 }
 template<class Impl>
 void WilsonFermion5D<Impl>::MomentumSpacePropagatorHt(FermionField &out,const FermionField &in, RealD mass) 
 {
  // what type LatticeComplex 
  GridBase *_grid = _FourDimGrid;
  conformable(_grid,out._grid);
  typedef typename FermionField::vector_type vector_type;
  typedef typename FermionField::scalar_type ScalComplex;
  typedef iSinglet<ScalComplex> Tcomplex;
  typedef Lattice<iSinglet<vector_type> > LatComplex;
  Gamma::GammaMatrix Gmu [] = {
    Gamma::GammaX,
    Gamma::GammaY,
    Gamma::GammaZ,
    Gamma::GammaT
  };
  std::vector<int> latt_size   = _grid->_fdimensions;
  FermionField   num  (_grid); num  = zero;
  LatComplex    sk(_grid);  sk = zero;
  LatComplex    sk2(_grid); sk2= zero;
  LatComplex    W(_grid); W= zero;
  LatComplex    a(_grid); a= zero;
  LatComplex    one  (_grid); one = ScalComplex(1.0,0.0);
  LatComplex denom(_grid); denom= zero;
  LatComplex cosha(_grid); 
  LatComplex kmu(_grid); 
  LatComplex Wea(_grid); 
  LatComplex Wema(_grid); 
  ScalComplex ci(0.0,1.0);
  for(int mu=0;mu<Nd;mu++) {
    LatticeCoordinate(kmu,mu);
    RealD TwoPiL =  M_PI * 2.0/ latt_size[mu];
    kmu = TwoPiL * kmu;
    sk2 = sk2 + 2.0*sin(kmu*0.5)*sin(kmu*0.5);
    sk  = sk  +     sin(kmu)    *sin(kmu); 
    num = num - sin(kmu)*ci*(Gamma(Gmu[mu])*in);
  }
  W = one - M5 + sk2;
  ////////////////////////////////////////////
  // Cosh alpha -> alpha
  ////////////////////////////////////////////
  cosha =  (one + W*W + sk) / (W*2.0);
  // FIXME Need a Lattice acosh
  for(int idx=0;idx<_grid->lSites();idx++){
    std::vector<int> lcoor(Nd);
    Tcomplex cc;
    RealD sgn;
    _grid->LocalIndexToLocalCoor(idx,lcoor);
    peekLocalSite(cc,cosha,lcoor);
    assert((double)real(cc)>=1.0);
    assert(fabs((double)imag(cc))<=1.0e-15);
    cc = ScalComplex(::acosh(real(cc)),0.0);
    pokeLocalSite(cc,a,lcoor);
  }
  Wea = ( exp( a) * W  ); 
  Wema= ( exp(-a) * W  ); 
  num   = num + ( one - Wema ) * mass * in;
  denom= ( Wea - one ) + mass*mass * (one - Wema); 
  out = num/denom;
 }
 template<class Impl>
 void WilsonFermion5D<Impl>::MomentumSpacePropagatorHw(FermionField &out,const FermionField &in,RealD mass) 
 {
    Gamma::GammaMatrix Gmu [] = {
      Gamma::GammaX,
      Gamma::GammaY,
      Gamma::GammaZ,
      Gamma::GammaT
    };
    GridBase *_grid = _FourDimGrid;
    conformable(_grid,out._grid);
    typedef typename FermionField::vector_type vector_type;
    typedef typename FermionField::scalar_type ScalComplex;
    typedef Lattice<iSinglet<vector_type> > LatComplex;
    std::vector<int> latt_size   = _grid->_fdimensions;
    LatComplex    sk(_grid);  sk = zero;
    LatComplex    sk2(_grid); sk2= zero;
    LatComplex    w_k(_grid); w_k= zero;
    LatComplex    b_k(_grid); b_k= zero;
    LatComplex     one  (_grid); one = ScalComplex(1.0,0.0);
    FermionField   num  (_grid); num  = zero;
    LatComplex denom(_grid); denom= zero;
    LatComplex kmu(_grid); 
    ScalComplex ci(0.0,1.0);
    for(int mu=0;mu<Nd;mu++) {
      LatticeCoordinate(kmu,mu);
      RealD TwoPiL =  M_PI * 2.0/ latt_size[mu];
      kmu = TwoPiL * kmu;
      sk2 = sk2 + 2.0*sin(kmu*0.5)*sin(kmu*0.5);
      sk  = sk  + sin(kmu)*sin(kmu); 
      num = num - sin(kmu)*ci*(Gamma(Gmu[mu])*in);
    }
    num = num + mass * in ;
    b_k = sk2 - M5;
    w_k = sqrt(sk + b_k*b_k);
    denom= ( w_k + b_k + mass*mass) ;
    denom= one/denom;
    out = num*denom;
 }
 FermOpTemplateInstantiate(WilsonFermion5D);
 GparityFermOpTemplateInstantiate(WilsonFermion5D);
--- a/lib/qcd/action/fermion/WilsonFermion5D.h
+++ b/lib/qcd/action/fermion/WilsonFermion5D.h
@@ -31,9 +31,21 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #ifndef  GRID_QCD_WILSON_FERMION_5D_H
 #define  GRID_QCD_WILSON_FERMION_5D_H
-namespace Grid {
+#include <Grid/Stat.h>
-  namespace QCD {
+namespace Grid {
 namespace QCD {
  ////////////////////////////////////////////////////////////////////////////////
  // This is the 4d red black case appropriate to support
  //
  // parity = (x+y+z+t)|2;
  // generalised five dim fermions like mobius, zolotarev etc..	
  //
  // i.e. even even contains fifth dim hopping term.
  //
  // [DIFFERS from original CPS red black implementation parity = (x+y+z+t+s)|2 ]
  ////////////////////////////////////////////////////////////////////////////////
    ////////////////////////////////////////////////////////////////////////////////
    // This is the 4d red black case appropriate to support
@@ -60,6 +72,18 @@ namespace Grid {
    public:
     INHERIT_IMPL_TYPES(Impl);
     typedef WilsonKernels<Impl> Kernels;
     PmuStat stat;
     void Report(void);
     void ZeroCounters(void);
     double DhopCalls;
     double DhopCommTime;
     double DhopComputeTime;
     double DerivCalls;
     double DerivCommTime;
     double DerivComputeTime;
     double DerivDhopComputeTime;
      ///////////////////////////////////////////////////////////////
      // Implement the abstract base
@@ -88,6 +112,9 @@ namespace Grid {
      virtual void DhopDerivEO(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
      virtual void DhopDerivOE(GaugeField &mat,const FermionField &U,const FermionField &V,int dag);
      void MomentumSpacePropagatorHt(FermionField &out,const FermionField &in,RealD mass) ;
      void MomentumSpacePropagatorHw(FermionField &out,const FermionField &in,RealD mass) ;
      // Implement hopping term non-hermitian hopping term; half cb or both
      // Implement s-diagonal DW
      void DW    (const FermionField &in, FermionField &out,int dag);
@@ -97,78 +124,78 @@ namespace Grid {
      // add a DhopComm
      // -- suboptimal interface will presently trigger multiple comms.
-      void DhopDir(const FermionField &in, FermionField &out,int dir,int disp);
+    void DhopDir(const FermionField &in, FermionField &out,int dir,int disp);
-      ///////////////////////////////////////////////////////////////
+    ///////////////////////////////////////////////////////////////
-      // New methods added 
+    // New methods added 
-      ///////////////////////////////////////////////////////////////
+    ///////////////////////////////////////////////////////////////
-      void DerivInternal(StencilImpl & st,
+    void DerivInternal(StencilImpl & st,
-			 DoubledGaugeField & U,
+		       DoubledGaugeField & U,
-			 GaugeField &mat,
+		       GaugeField &mat,
-			 const FermionField &A,
+		       const FermionField &A,
-			 const FermionField &B,
+		       const FermionField &B,
-			 int dag);
+		       int dag);
-      void DhopInternal(StencilImpl & st,
+    void DhopInternal(StencilImpl & st,
-			LebesgueOrder &lo,
+		      LebesgueOrder &lo,
-			DoubledGaugeField &U,
+		      DoubledGaugeField &U,
-			const FermionField &in, 
+		      const FermionField &in, 
-			FermionField &out,
+		      FermionField &out,
-			int dag);
+		      int dag);
-      // Constructors
+    // Constructors
-      WilsonFermion5D(GaugeField &_Umu,
+    WilsonFermion5D(GaugeField &_Umu,
-		      GridCartesian         &FiveDimGrid,
+		    GridCartesian         &FiveDimGrid,
-		      GridRedBlackCartesian &FiveDimRedBlackGrid,
+		    GridRedBlackCartesian &FiveDimRedBlackGrid,
-		      GridCartesian         &FourDimGrid,
+		    GridCartesian         &FourDimGrid,
-		      GridRedBlackCartesian &FourDimRedBlackGrid,
+		    GridRedBlackCartesian &FourDimRedBlackGrid,
-		      double _M5,const ImplParams &p= ImplParams());
+		    double _M5,const ImplParams &p= ImplParams());
-      // Constructors
+    // Constructors
-      /*
+    /*
      WilsonFermion5D(int simd, 
-		      GaugeField &_Umu,
+      GaugeField &_Umu,
-		      GridCartesian         &FiveDimGrid,
+      GridCartesian         &FiveDimGrid,
-		      GridRedBlackCartesian &FiveDimRedBlackGrid,
+      GridRedBlackCartesian &FiveDimRedBlackGrid,
-		      GridCartesian         &FourDimGrid,
+      GridCartesian         &FourDimGrid,
-		      double _M5,const ImplParams &p= ImplParams());
+      double _M5,const ImplParams &p= ImplParams());
-      */
+    */
-      // DoubleStore
+    // DoubleStore
-      void ImportGauge(const GaugeField &_Umu);
+    void ImportGauge(const GaugeField &_Umu);
-      ///////////////////////////////////////////////////////////////
+    ///////////////////////////////////////////////////////////////
-      // Data members require to support the functionality
+    // Data members require to support the functionality
-      ///////////////////////////////////////////////////////////////
+    ///////////////////////////////////////////////////////////////
-    public:
+  public:
-      // Add these to the support from Wilson
+    // Add these to the support from Wilson
-      GridBase *_FourDimGrid;
+    GridBase *_FourDimGrid;
-      GridBase *_FourDimRedBlackGrid;
+    GridBase *_FourDimRedBlackGrid;
-      GridBase *_FiveDimGrid;
+    GridBase *_FiveDimGrid;
-      GridBase *_FiveDimRedBlackGrid;
+    GridBase *_FiveDimRedBlackGrid;
-      double                        M5;
+    double                        M5;
-      int Ls;
+    int Ls;
-      //Defines the stencils for even and odd
+    //Defines the stencils for even and odd
-      StencilImpl Stencil; 
+    StencilImpl Stencil; 
-      StencilImpl StencilEven; 
+    StencilImpl StencilEven; 
-      StencilImpl StencilOdd; 
+    StencilImpl StencilOdd; 
-      // Copy of the gauge field , with even and odd subsets
+    // Copy of the gauge field , with even and odd subsets
-      DoubledGaugeField Umu;
+    DoubledGaugeField Umu;
-      DoubledGaugeField UmuEven;
+    DoubledGaugeField UmuEven;
-      DoubledGaugeField UmuOdd;
+    DoubledGaugeField UmuOdd;
-      LebesgueOrder Lebesgue;
+    LebesgueOrder Lebesgue;
-      LebesgueOrder LebesgueEvenOdd;
+    LebesgueOrder LebesgueEvenOdd;
-      // Comms buffer
+    // Comms buffer
-      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  comm_buf;
+    std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  comm_buf;
-    };
+  };
-  }
+
-}
+}}
 #endif
--- a/lib/qcd/action/fermion/WilsonKernels.cc
+++ b/lib/qcd/action/fermion/WilsonKernels.cc
@@ -1,103 +1,52 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/fermion/WilsonKernels.cc
+Source file: ./lib/qcd/action/fermion/WilsonKernels.cc
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: Peter Boyle <peterboyle@Peters-MacBook-Pro-2.local>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #include <Grid.h>
 namespace Grid {
 namespace QCD {
-  int WilsonKernelsStatic::HandOpt;
+int WilsonKernelsStatic::Opt;
  int WilsonKernelsStatic::AsmOpt;
-template<class Impl> 
+template <class Impl>
-WilsonKernels<Impl>::WilsonKernels(const ImplParams &p): Base(p) {};
+WilsonKernels<Impl>::WilsonKernels(const ImplParams &p) : Base(p){};
-template<class Impl> 
+////////////////////////////////////////////
-void WilsonKernels<Impl>::DiracOptDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+// Generic implementation; move to different file?
-						  std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+////////////////////////////////////////////
 						  int sF,int sU,int Ls, int Ns, const FermionField &in, FermionField &out)
 {
 #ifdef AVX512
  if ( AsmOpt ) {
-    WilsonKernels<Impl>::DiracOptAsmDhopSite(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
+template <class Impl>
-
+void WilsonKernels<Impl>::DiracOptGenericDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
-  } else {
+						     SiteHalfSpinor *buf, int sF,
-#else
+						     int sU, const FermionField &in, FermionField &out) {
-  {  
+  SiteHalfSpinor tmp;
-#endif
+  SiteHalfSpinor chi;
    for(int site=0;site<Ns;site++) {
      for(int s=0;s<Ls;s++) {
 	if (HandOpt) WilsonKernels<Impl>::DiracOptHandDhopSite(st,lo,U,buf,sF,sU,in,out);
 	else         WilsonKernels<Impl>::DiracOptGenericDhopSite(st,lo,U,buf,sF,sU,in,out);
 	sF++;
      }
      sU++;
    }
  }
 }
 template<class Impl> 
 void WilsonKernels<Impl>::DiracOptDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 					   std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 					   int sF,int sU,int Ls, int Ns, const FermionField &in, FermionField &out)
 {
 #ifdef AVX512
  if ( AsmOpt ) {
    WilsonKernels<Impl>::DiracOptAsmDhopSiteDag(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
  } else {
 #else
  {  
 #endif
    for(int site=0;site<Ns;site++) {
      for(int s=0;s<Ls;s++) {
 	if (HandOpt) WilsonKernels<Impl>::DiracOptHandDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
 	else         WilsonKernels<Impl>::DiracOptGenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
 	sF++;
      }
      sU++;
    }
  }
 }
  ////////////////////////////////////////////
  // Generic implementation; move to different file?
  ////////////////////////////////////////////
 template<class Impl> 
 void WilsonKernels<Impl>::DiracOptGenericDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 					   std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 					   int sF,int sU,const FermionField &in, FermionField &out)
 {
  SiteHalfSpinor  tmp;    
  SiteHalfSpinor  chi;    
  SiteHalfSpinor *chi_p;
  SiteHalfSpinor Uchi;
  SiteSpinor result;
@@ -107,175 +56,173 @@ void WilsonKernels<Impl>::DiracOptGenericDhopSiteDag(StencilImpl &st,LebesgueOrd
  ///////////////////////////
  // Xp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Xp,sF);
+  SE = st.GetEntry(ptype, Xp, sF);
-  if (SE->_is_local ) { 
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjXp(tmp,in._odata[SE->_offset]);
+      spProjXp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjXp(chi,in._odata[SE->_offset]);
+      spProjXp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Xp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xp, SE, st);
-  spReconXp(result,Uchi);
+  spReconXp(result, Uchi);
  ///////////////////////////
  // Yp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Yp,sF);
+  SE = st.GetEntry(ptype, Yp, sF);
-  if ( SE->_is_local ) { 
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjYp(tmp,in._odata[SE->_offset]);
+      spProjYp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjYp(chi,in._odata[SE->_offset]);
+      spProjYp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Yp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Yp, SE, st);
-  accumReconYp(result,Uchi);
+  accumReconYp(result, Uchi);
  ///////////////////////////
  // Zp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Zp,sF);
+  SE = st.GetEntry(ptype, Zp, sF);
-  if ( SE->_is_local ) { 
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjZp(tmp,in._odata[SE->_offset]);
+      spProjZp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjZp(chi,in._odata[SE->_offset]);
+      spProjZp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Zp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zp, SE, st);
-  accumReconZp(result,Uchi);
+  accumReconZp(result, Uchi);
  ///////////////////////////
  // Tp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Tp,sF);
+  SE = st.GetEntry(ptype, Tp, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjTp(tmp,in._odata[SE->_offset]);
+      spProjTp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjTp(chi,in._odata[SE->_offset]);
+      spProjTp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Tp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tp, SE, st);
-  accumReconTp(result,Uchi);
+  accumReconTp(result, Uchi);
  ///////////////////////////
  // Xm
  ///////////////////////////
-  SE=st.GetEntry(ptype,Xm,sF);
+  SE = st.GetEntry(ptype, Xm, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjXm(tmp,in._odata[SE->_offset]);
+      spProjXm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjXm(chi,in._odata[SE->_offset]);
+      spProjXm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Xm,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xm, SE, st);
-  accumReconXm(result,Uchi);
+  accumReconXm(result, Uchi);
  ///////////////////////////
  // Ym
  ///////////////////////////
-  SE=st.GetEntry(ptype,Ym,sF);
+  SE = st.GetEntry(ptype, Ym, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjYm(tmp,in._odata[SE->_offset]);
+      spProjYm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjYm(chi,in._odata[SE->_offset]);
+      spProjYm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Ym,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Ym, SE, st);
-  accumReconYm(result,Uchi);
+  accumReconYm(result, Uchi);
  ///////////////////////////
  // Zm
  ///////////////////////////
-  SE=st.GetEntry(ptype,Zm,sF);
+  SE = st.GetEntry(ptype, Zm, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjZm(tmp,in._odata[SE->_offset]);
+      spProjZm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjZm(chi,in._odata[SE->_offset]);
+      spProjZm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Zm,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zm, SE, st);
-  accumReconZm(result,Uchi);
+  accumReconZm(result, Uchi);
  ///////////////////////////
  // Tm
  ///////////////////////////
-  SE=st.GetEntry(ptype,Tm,sF);
+  SE = st.GetEntry(ptype, Tm, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjTm(tmp,in._odata[SE->_offset]);
+      spProjTm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjTm(chi,in._odata[SE->_offset]);
+      spProjTm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Tm,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tm, SE, st);
-  accumReconTm(result,Uchi);
+  accumReconTm(result, Uchi);
-  vstream(out._odata[sF],result);
+  vstream(out._odata[sF], result);
 };
-
+// Need controls to do interior, exterior, or both
-  // Need controls to do interior, exterior, or both
+template <class Impl>
-template<class Impl> 
+void WilsonKernels<Impl>::DiracOptGenericDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
-void WilsonKernels<Impl>::DiracOptGenericDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+						  SiteHalfSpinor *buf, int sF,
-						  std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+						  int sU, const FermionField &in, FermionField &out) {
-						  int sF,int sU,const FermionField &in, FermionField &out)
+  SiteHalfSpinor tmp;
-{
+  SiteHalfSpinor chi;
  SiteHalfSpinor  tmp;    
  SiteHalfSpinor  chi;    
  SiteHalfSpinor *chi_p;
  SiteHalfSpinor Uchi;
  SiteSpinor result;
@@ -285,296 +232,297 @@ void WilsonKernels<Impl>::DiracOptGenericDhopSite(StencilImpl &st,LebesgueOrder
  ///////////////////////////
  // Xp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Xm,sF);
+  SE = st.GetEntry(ptype, Xm, sF);
-  if ( SE->_is_local ) { 
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjXp(tmp,in._odata[SE->_offset]);
+      spProjXp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjXp(chi,in._odata[SE->_offset]);
+      spProjXp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Xm,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xm, SE, st);
-  spReconXp(result,Uchi);
+  spReconXp(result, Uchi);
  ///////////////////////////
  // Yp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Ym,sF);
+  SE = st.GetEntry(ptype, Ym, sF);
-  if ( SE->_is_local ) { 
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjYp(tmp,in._odata[SE->_offset]);
+      spProjYp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjYp(chi,in._odata[SE->_offset]);
+      spProjYp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Ym,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Ym, SE, st);
-  accumReconYp(result,Uchi);
+  accumReconYp(result, Uchi);
  ///////////////////////////
  // Zp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Zm,sF);
+  SE = st.GetEntry(ptype, Zm, sF);
-  if ( SE->_is_local ) { 
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjZp(tmp,in._odata[SE->_offset]);
+      spProjZp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjZp(chi,in._odata[SE->_offset]);
+      spProjZp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Zm,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zm, SE, st);
-  accumReconZp(result,Uchi);
+  accumReconZp(result, Uchi);
  ///////////////////////////
  // Tp
  ///////////////////////////
-  SE=st.GetEntry(ptype,Tm,sF);
+  SE = st.GetEntry(ptype, Tm, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjTp(tmp,in._odata[SE->_offset]);
+      spProjTp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjTp(chi,in._odata[SE->_offset]);
+      spProjTp(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Tm,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tm, SE, st);
-  accumReconTp(result,Uchi);
+  accumReconTp(result, Uchi);
  ///////////////////////////
  // Xm
  ///////////////////////////
-  SE=st.GetEntry(ptype,Xp,sF);
+  SE = st.GetEntry(ptype, Xp, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjXm(tmp,in._odata[SE->_offset]);
+      spProjXm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjXm(chi,in._odata[SE->_offset]);
+      spProjXm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Xp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xp, SE, st);
-  accumReconXm(result,Uchi);
+  accumReconXm(result, Uchi);
  ///////////////////////////
  // Ym
  ///////////////////////////
-  SE=st.GetEntry(ptype,Yp,sF);
+  SE = st.GetEntry(ptype, Yp, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjYm(tmp,in._odata[SE->_offset]);
+      spProjYm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjYm(chi,in._odata[SE->_offset]);
+      spProjYm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Yp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Yp, SE, st);
-  accumReconYm(result,Uchi);
+  accumReconYm(result, Uchi);
  ///////////////////////////
  // Zm
  ///////////////////////////
-  SE=st.GetEntry(ptype,Zp,sF);
+  SE = st.GetEntry(ptype, Zp, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjZm(tmp,in._odata[SE->_offset]);
+      spProjZm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjZm(chi,in._odata[SE->_offset]);
+      spProjZm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Zp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zp, SE, st);
-  accumReconZm(result,Uchi);
+  accumReconZm(result, Uchi);
  ///////////////////////////
  // Tm
  ///////////////////////////
-  SE=st.GetEntry(ptype,Tp,sF);
+  SE = st.GetEntry(ptype, Tp, sF);
-  if ( SE->_is_local ) {
+  if (SE->_is_local) {
    chi_p = &chi;
-    if ( SE->_permute ) {
+    if (SE->_permute) {
-      spProjTm(tmp,in._odata[SE->_offset]);
+      spProjTm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
    } else {
-      spProjTm(chi,in._odata[SE->_offset]);
+      spProjTm(chi, in._odata[SE->_offset]);
    }
  } else {
-    chi_p=&buf[SE->_offset];
+    chi_p = &buf[SE->_offset];
  }
-  Impl::multLink(Uchi,U._odata[sU],*chi_p,Tp,SE,st);
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tp, SE, st);
-  accumReconTm(result,Uchi);
+  accumReconTm(result, Uchi);
-  vstream(out._odata[sF],result);
+  vstream(out._odata[sF], result);
 };
-template<class Impl> 
+template <class Impl>
-void WilsonKernels<Impl>::DiracOptDhopDir(StencilImpl &st,DoubledGaugeField &U,
+void WilsonKernels<Impl>::DiracOptDhopDir( StencilImpl &st, DoubledGaugeField &U,SiteHalfSpinor *buf, int sF,
-					  std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+					   int sU, const FermionField &in, FermionField &out, int dir, int gamma) {
-					  int sF,int sU,const FermionField &in, FermionField &out,int dir,int gamma)
+
-{
+  SiteHalfSpinor tmp;
-  SiteHalfSpinor  tmp;    
+  SiteHalfSpinor chi;
-  SiteHalfSpinor  chi;    
+  SiteSpinor result;
  SiteSpinor   result;
  SiteHalfSpinor Uchi;
  StencilEntry *SE;
  int ptype;
-  SE=st.GetEntry(ptype,dir,sF);
+  SE = st.GetEntry(ptype, dir, sF);
  // Xp
-  if(gamma==Xp){
+  if (gamma == Xp) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjXp(tmp,in._odata[SE->_offset]);
+      spProjXp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjXp(chi,in._odata[SE->_offset]);
+      spProjXp(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconXp(result,Uchi);
+    spReconXp(result, Uchi);
  }
  // Yp
-  if ( gamma==Yp ){
+  if (gamma == Yp) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjYp(tmp,in._odata[SE->_offset]);
+      spProjYp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjYp(chi,in._odata[SE->_offset]);
+      spProjYp(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconYp(result,Uchi);
+    spReconYp(result, Uchi);
  }
  // Zp
-  if ( gamma ==Zp ){
+  if (gamma == Zp) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjZp(tmp,in._odata[SE->_offset]);
+      spProjZp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjZp(chi,in._odata[SE->_offset]);
+      spProjZp(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconZp(result,Uchi);
+    spReconZp(result, Uchi);
  }
  // Tp
-  if ( gamma ==Tp ){
+  if (gamma == Tp) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjTp(tmp,in._odata[SE->_offset]);
+      spProjTp(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjTp(chi,in._odata[SE->_offset]);
+      spProjTp(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconTp(result,Uchi);
+    spReconTp(result, Uchi);
  }
  // Xm
-  if ( gamma==Xm ){
+  if (gamma == Xm) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjXm(tmp,in._odata[SE->_offset]);
+      spProjXm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjXm(chi,in._odata[SE->_offset]);
+      spProjXm(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconXm(result,Uchi);
+    spReconXm(result, Uchi);
  }
  // Ym
-  if ( gamma == Ym ){
+  if (gamma == Ym) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjYm(tmp,in._odata[SE->_offset]);
+      spProjYm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjYm(chi,in._odata[SE->_offset]);
+      spProjYm(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconYm(result,Uchi);
+    spReconYm(result, Uchi);
  }
  // Zm
-  if ( gamma == Zm ){
+  if (gamma == Zm) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjZm(tmp,in._odata[SE->_offset]);
+      spProjZm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjZm(chi,in._odata[SE->_offset]);
+      spProjZm(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconZm(result,Uchi);
+    spReconZm(result, Uchi);
  }
  // Tm
-  if ( gamma==Tm ) {
+  if (gamma == Tm) {
-    if (  SE->_is_local && SE->_permute ) {
+    if (SE->_is_local && SE->_permute) {
-      spProjTm(tmp,in._odata[SE->_offset]);
+      spProjTm(tmp, in._odata[SE->_offset]);
-      permute(chi,tmp,ptype);
+      permute(chi, tmp, ptype);
-    } else if ( SE->_is_local ) {
+    } else if (SE->_is_local) {
-      spProjTm(chi,in._odata[SE->_offset]);
+      spProjTm(chi, in._odata[SE->_offset]);
    } else {
-      chi=buf[SE->_offset];
+      chi = buf[SE->_offset];
    }
-    Impl::multLink(Uchi,U._odata[sU],chi,dir,SE,st);
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconTm(result,Uchi);
+    spReconTm(result, Uchi);
  }
-  vstream(out._odata[sF],result);
+  vstream(out._odata[sF], result);
 }
-
+FermOpTemplateInstantiate(WilsonKernels);
-  FermOpTemplateInstantiate(WilsonKernels);
+AdjointFermOpTemplateInstantiate(WilsonKernels);
 TwoIndexFermOpTemplateInstantiate(WilsonKernels);
 }}
--- a/lib/qcd/action/fermion/WilsonKernels.h
+++ b/lib/qcd/action/fermion/WilsonKernels.h
@@ -1,120 +1,183 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/fermion/WilsonKernels.h
+Source file: ./lib/qcd/action/fermion/WilsonKernels.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
-#ifndef  GRID_QCD_DHOP_H
+/*  END LEGAL */
-#define  GRID_QCD_DHOP_H
+#ifndef GRID_QCD_DHOP_H
 #define GRID_QCD_DHOP_H
 namespace Grid {
 namespace QCD {
-  namespace QCD {
+  ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  // Helper routines that implement Wilson stencil for a single site.
  // Common to both the WilsonFermion and WilsonFermion5D
  ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 class WilsonKernelsStatic { 
 public:
  enum { OptGeneric, OptHandUnroll, OptInlineAsm };
  // S-direction is INNERMOST and takes no part in the parity.
  static int Opt;  // these are a temporary hack
 };
-    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+template<class Impl> class WilsonKernels : public FermionOperator<Impl> , public WilsonKernelsStatic { 
-    // Helper routines that implement Wilson stencil for a single site.
+ public:
    // Common to both the WilsonFermion and WilsonFermion5D
    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    class WilsonKernelsStatic { 
    public:
      // S-direction is INNERMOST and takes no part in the parity.
      static int AsmOpt;  // these are a temporary hack
      static int HandOpt; // these are a temporary hack
    };
-    template<class Impl> class WilsonKernels : public FermionOperator<Impl> , public WilsonKernelsStatic { 
+  INHERIT_IMPL_TYPES(Impl);
-    public:
+  typedef FermionOperator<Impl> Base;
-     INHERIT_IMPL_TYPES(Impl);
+public:
-     typedef FermionOperator<Impl> Base;
+   
  template <bool EnableBool = true>
  typename std::enable_if<Impl::Dimension == 3 && Nc == 3 &&EnableBool, void>::type
  DiracOptDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 		   int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out) 
  {
    switch(Opt) {
 #ifdef AVX512
    case OptInlineAsm:
       WilsonKernels<Impl>::DiracOptAsmDhopSite(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
       break;
 #endif
    case OptHandUnroll:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
 	  WilsonKernels<Impl>::DiracOptHandDhopSite(st,lo,U,buf,sF,sU,in,out);
 	  sF++;
 	}
 	sU++;
      }
      break;
    case OptGeneric:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
 	  WilsonKernels<Impl>::DiracOptGenericDhopSite(st,lo,U,buf,sF,sU,in,out);
 	  sF++;
 	}
 	sU++;
      }
      break;
    default:
      assert(0);
    }
  }
  template <bool EnableBool = true>
  typename std::enable_if<(Impl::Dimension != 3 || (Impl::Dimension == 3 && Nc != 3)) && EnableBool, void>::type
  DiracOptDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 		   int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out) {
    // no kernel choice  
    for (int site = 0; site < Ns; site++) {
      for (int s = 0; s < Ls; s++) {
 	WilsonKernels<Impl>::DiracOptGenericDhopSite(st, lo, U, buf, sF, sU, in, out);
 	sF++;
      }
      sU++;
    }
  }
  template <bool EnableBool = true>
  typename std::enable_if<Impl::Dimension == 3 && Nc == 3 && EnableBool,void>::type
  DiracOptDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 		      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out) {
    switch(Opt) {
 #ifdef AVX512
    case OptInlineAsm:
      WilsonKernels<Impl>::DiracOptAsmDhopSiteDag(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      break;
 #endif
    case OptHandUnroll:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
 	  WilsonKernels<Impl>::DiracOptHandDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
 	  sF++;
 	}
 	sU++;
      }
      break;
    case OptGeneric:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
 	  WilsonKernels<Impl>::DiracOptGenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
 	  sF++;
 	}
 	sU++;
      }
      break;
    default:
      assert(0);
    }
  }
  template <bool EnableBool = true>
  typename std::enable_if<(Impl::Dimension != 3 || (Impl::Dimension == 3 && Nc != 3)) && EnableBool,void>::type
  DiracOptDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,SiteHalfSpinor * buf,
 		      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out) {
    for (int site = 0; site < Ns; site++) {
      for (int s = 0; s < Ls; s++) {
 	WilsonKernels<Impl>::DiracOptGenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
 	sF++;
      }
      sU++;
    }
  }
  void DiracOptDhopDir(StencilImpl &st, DoubledGaugeField &U,SiteHalfSpinor * buf,
 		       int sF, int sU, const FermionField &in, FermionField &out, int dirdisp, int gamma);
 private:
     // Specialised variants
  void DiracOptGenericDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 			       int sF, int sU, const FermionField &in, FermionField &out);
  void DiracOptGenericDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 				  int sF, int sU, const FermionField &in, FermionField &out);
  void DiracOptAsmDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 			   int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);
  void DiracOptAsmDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 			      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);
  void DiracOptHandDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 			    int sF, int sU, const FermionField &in, FermionField &out);
  void DiracOptHandDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
 			       int sF, int sU, const FermionField &in, FermionField &out);
 public:
  WilsonKernels(const ImplParams &p = ImplParams());
 };
 }}
    public:
     void DiracOptDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			   std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			   int sF, int sU,int Ls, int Ns, const FermionField &in, FermionField &out);
     void DiracOptDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			      int sF,int sU,int Ls, int Ns, const FermionField &in,FermionField &out);
     void DiracOptDhopDir(StencilImpl &st,DoubledGaugeField &U,
 			  std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			  int sF,int sU,const FermionField &in, FermionField &out,int dirdisp,int gamma);
    private:
     // Specialised variants
     void DiracOptGenericDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			   std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			   int sF,int sU, const FermionField &in, FermionField &out);
     void DiracOptGenericDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			      int sF,int sU,const FermionField &in,FermionField &out);
     void DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			      int sF,int sU,int Ls, int Ns, const FermionField &in, FermionField &out);
     void DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			      int sF,int sU,int Ls, int Ns, const FermionField &in, FermionField &out);
     void DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 			      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 			      int sF,int sU,const FermionField &in, FermionField &out);
     void DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
 				 std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 				 int sF,int sU,const FermionField &in, FermionField &out);
    public:
     WilsonKernels(const ImplParams &p= ImplParams());
    };
    ///////////////////////////////////////////////////////////
    // Default to no assembler implementation
    ///////////////////////////////////////////////////////////
    template<class Impl>
    void WilsonKernels<Impl >::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,
 						   std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 						   int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
    {
      assert(0);
    }
    template<class Impl>
    void WilsonKernels<Impl >::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,
 						      std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
 						      int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
    {
      assert(0);
    }
  }
 }
 #endif
--- a/lib/qcd/action/fermion/WilsonKernelsAsm.cc
+++ b/lib/qcd/action/fermion/WilsonKernelsAsm.cc
@@ -10,6 +10,7 @@
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>
 Author: Guido Cossu <guido.cossu@ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -31,29 +32,48 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #include <Grid.h>
 namespace Grid {
-  namespace QCD {
+namespace QCD {
 ///////////////////////////////////////////////////////////
 // Default to no assembler implementation
 ///////////////////////////////////////////////////////////
 template<class Impl> void 
 WilsonKernels<Impl >::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
 					  int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 {
  assert(0);
 }
 template<class Impl> void 
 WilsonKernels<Impl >::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
 					     int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 {
  assert(0);
 }
 #if defined(AVX512) 
-    
+#include <simd/Intel512wilson.h>
    ///////////////////////////////////////////////////////////
    // If we are AVX512 specialise the single precision routine
    ///////////////////////////////////////////////////////////
 #include <simd/Intel512wilson.h>
 #include <simd/Intel512single.h>
-    static Vector<vComplexF> signs;
+static Vector<vComplexF> signsF;
-    int setupSigns(void ){
+  template<typename vtype>    
-      Vector<vComplexF> bother(2);
+  int setupSigns(Vector<vtype>& signs ){
-      signs = bother;
+    Vector<vtype> bother(2);
-      vrsign(signs[0]);
+    signs = bother;
-      visign(signs[1]);
+    vrsign(signs[0]);
-      return 1;
+    visign(signs[1]);
-    }
+    return 1;
-    static int signInit = setupSigns();
+  }
  static int signInitF = setupSigns(signsF);
 #define label(A)  ilabel(A)
 #define ilabel(A) ".globl\n"  #A ":\n" 
@@ -61,19 +81,19 @@ namespace Grid {
 #define MAYBEPERM(A,perm) if (perm) { A ; }
 #define MULT_2SPIN(ptr,pf) MULT_ADDSUB_2SPIN(ptr,pf)
 #define FX(A) WILSONASM_ ##A
 #define COMPLEX_TYPE vComplexF
 #define signs signsF
 #undef KERNEL_DAG
-    template<>
+template<> void 
-    void WilsonKernels<WilsonImplF>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,
+WilsonKernels<WilsonImplF>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
-							 std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #define KERNEL_DAG
-    template<>
+template<> void 
-    void WilsonKernels<WilsonImplF>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,
+WilsonKernels<WilsonImplF>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-							    std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+						   int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #undef VMOVIDUP
@@ -83,25 +103,104 @@ namespace Grid {
 #undef FX 
 #define FX(A) DWFASM_ ## A
 #define MAYBEPERM(A,B) 
-#define VMOVIDUP(A,B,C)                                  VBCASTIDUPf(A,B,C)
+//#define VMOVIDUP(A,B,C)                                  VBCASTIDUPf(A,B,C)
-#define VMOVRDUP(A,B,C)                                  VBCASTRDUPf(A,B,C)
+//#define VMOVRDUP(A,B,C)                                  VBCASTRDUPf(A,B,C)
 #define MULT_2SPIN(ptr,pf) MULT_ADDSUB_2SPIN_LS(ptr,pf)
 #undef KERNEL_DAG
-    template<>
+template<> void 
-    void WilsonKernels<DomainWallVec5dImplF>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,
+WilsonKernels<DomainWallVec5dImplF>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
-								  std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 								  int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #define KERNEL_DAG
-    template<>
+template<> void 
-    void WilsonKernels<DomainWallVec5dImplF>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,
+WilsonKernels<DomainWallVec5dImplF>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-								     std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
-								     int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #undef COMPLEX_TYPE
 #undef signs
 #undef VMOVRDUP
 #undef MAYBEPERM
 #undef MULT_2SPIN
 #undef FX 
 ///////////////////////////////////////////////////////////
 // If we are AVX512 specialise the double precision routine
 ///////////////////////////////////////////////////////////
 #include <simd/Intel512double.h>
 static Vector<vComplexD> signsD;
 #define signs signsD
 static int signInitD = setupSigns(signsD);
 #define MAYBEPERM(A,perm) if (perm) { A ; }
 #define MULT_2SPIN(ptr,pf) MULT_ADDSUB_2SPIN(ptr,pf)
 #define FX(A) WILSONASM_ ##A
 #define COMPLEX_TYPE vComplexD
 #undef KERNEL_DAG
 template<> void 
 WilsonKernels<WilsonImplD>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
-#endif
+#define KERNEL_DAG
-  }
+template<> void 
-}
+WilsonKernels<WilsonImplD>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
 						   int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #undef VMOVIDUP
 #undef VMOVRDUP
 #undef MAYBEPERM
 #undef MULT_2SPIN
 #undef FX 
 #define FX(A) DWFASM_ ## A
 #define MAYBEPERM(A,B) 
 //#define VMOVIDUP(A,B,C)                                  VBCASTIDUPd(A,B,C)
 //#define VMOVRDUP(A,B,C)                                  VBCASTRDUPd(A,B,C)
 #define MULT_2SPIN(ptr,pf) MULT_ADDSUB_2SPIN_LS(ptr,pf)
 #undef KERNEL_DAG
 template<> void 
 WilsonKernels<DomainWallVec5dImplD>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #define KERNEL_DAG
 template<> void 
 WilsonKernels<DomainWallVec5dImplD>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 #undef COMPLEX_TYPE
 #undef signs
 #undef VMOVRDUP
 #undef MAYBEPERM
 #undef MULT_2SPIN
 #undef FX 
 #endif //AVX512
 #define INSTANTIATE_ASM(A)\
 template void WilsonKernels<A>::DiracOptAsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,\
                                  int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out);\
 \
 template void WilsonKernels<A>::DiracOptAsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,\
                                  int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out);\
 INSTANTIATE_ASM(WilsonImplF);
 INSTANTIATE_ASM(WilsonImplD);
 INSTANTIATE_ASM(ZWilsonImplF);
 INSTANTIATE_ASM(ZWilsonImplD);
 INSTANTIATE_ASM(GparityWilsonImplF);
 INSTANTIATE_ASM(GparityWilsonImplD);
 INSTANTIATE_ASM(DomainWallVec5dImplF);
 INSTANTIATE_ASM(DomainWallVec5dImplD);
 INSTANTIATE_ASM(ZDomainWallVec5dImplF);
 INSTANTIATE_ASM(ZDomainWallVec5dImplD);
 }}
--- a/lib/qcd/action/fermion/WilsonKernelsAsmBody.h
+++ b/lib/qcd/action/fermion/WilsonKernelsAsmBody.h
@@ -5,7 +5,9 @@
  const uint64_t plocal =(uint64_t) & in._odata[0];
  //  vComplexF isigns[2] = { signs[0], signs[1] };
-  vComplexF *isigns = &signs[0];
+  //COMPLEX_TYPE is vComplexF of vComplexD depending 
  //on the chosen precision
  COMPLEX_TYPE *isigns = &signs[0];
  MASK_REGS;
  int nmax=U._grid->oSites();
@@ -134,7 +136,9 @@
  ////////////////////////////////
  // Xm
  ////////////////////////////////
 #ifndef STREAM_STORE
  basep= (uint64_t) &out._odata[ss];
 #endif
  //  basep= st.GetPFInfo(nent,plocal); nent++;
  if ( local ) {
    LOAD64(%r10,isigns);  // times i => shuffle and xor the real part sign bit
@@ -229,7 +233,9 @@
    LOAD_CHI(base);
  }
  base= (uint64_t) &out._odata[ss];
 #ifndef STREAM_STORE
  PREFETCH_CHIMU(base);
 #endif
  {
    MULT_2SPIN_DIR_PFTM(Tm,basep);
  }
--- a/lib/qcd/action/fermion/WilsonKernelsHand.cc
+++ b/lib/qcd/action/fermion/WilsonKernelsHand.cc
@@ -311,10 +311,9 @@ namespace Grid {
 namespace QCD {
-template<class Impl>
+template<class Impl> void 
-void WilsonKernels<Impl >::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+WilsonKernels<Impl>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor  *buf,
-					       std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+					  int ss,int sU,const FermionField &in, FermionField &out)
 					       int ss,int sU,const FermionField &in, FermionField &out)
 {
  typedef typename Simd::scalar_type S;
  typedef typename Simd::vector_type V;
@@ -555,9 +554,8 @@ void WilsonKernels<Impl >::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &l
 }
 template<class Impl>
-void WilsonKernels<Impl >::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+void WilsonKernels<Impl>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-					       std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+						  int ss,int sU,const FermionField &in, FermionField &out)
 					       int ss,int sU,const FermionField &in, FermionField &out)
 {
  //  std::cout << "Hand op Dhop "<<std::endl;
  typedef typename Simd::scalar_type S;
@@ -798,38 +796,35 @@ void WilsonKernels<Impl >::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder
  }
 }
  ////////////////////////////////////////////////
  // Specialise Gparity to simple implementation
  ////////////////////////////////////////////////
-template<>
+template<> void 
-void WilsonKernels<GparityWilsonImplF>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+WilsonKernels<GparityWilsonImplF>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
-							     std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+							SiteHalfSpinor *buf,
-							     int sF,int sU,const FermionField &in, FermionField &out)
+							int sF,int sU,const FermionField &in, FermionField &out)
 {
  assert(0);
 }
-template<>
+template<> void 
-void WilsonKernels<GparityWilsonImplF>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+WilsonKernels<GparityWilsonImplF>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
-								std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+							   SiteHalfSpinor *buf,
-								int sF,int sU,const FermionField &in, FermionField &out)
+							   int sF,int sU,const FermionField &in, FermionField &out)
 {
  assert(0);
 }
-template<>
+template<> void 
-void WilsonKernels<GparityWilsonImplD>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+WilsonKernels<GparityWilsonImplD>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-							     std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+							int sF,int sU,const FermionField &in, FermionField &out)
 							     int sF,int sU,const FermionField &in, FermionField &out)
 {
  assert(0);
 }
-template<>
+template<> void 
-void WilsonKernels<GparityWilsonImplD>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
+WilsonKernels<GparityWilsonImplD>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-								std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,
+							   int sF,int sU,const FermionField &in, FermionField &out)
 								int sF,int sU,const FermionField &in, FermionField &out)
 {
  assert(0);
 }
@@ -840,12 +835,10 @@ void WilsonKernels<GparityWilsonImplD>::DiracOptHandDhopSiteDag(StencilImpl &st,
 // Need Nc=3 though //
 #define INSTANTIATE_THEM(A) \
-template void WilsonKernels<A>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,\
+template void WilsonKernels<A>::DiracOptHandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,\
-							       std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,\
+						     int ss,int sU,const FermionField &in, FermionField &out); \
-							       int ss,int sU,const FermionField &in, FermionField &out);\
+template void WilsonKernels<A>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,\
-template void WilsonKernels<A>::DiracOptHandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,\
+							int ss,int sU,const FermionField &in, FermionField &out);
 								  std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> >  &buf,\
 								  int ss,int sU,const FermionField &in, FermionField &out);
 INSTANTIATE_THEM(WilsonImplF);
 INSTANTIATE_THEM(WilsonImplD);
--- a/lib/qcd/action/gauge/GaugeImpl.h
+++ b/lib/qcd/action/gauge/GaugeImpl.h
@@ -30,7 +30,6 @@ directory
 #define GRID_QCD_GAUGE_IMPL_H
 namespace Grid {
 namespace QCD {
 ////////////////////////////////////////////////////////////////////////
@@ -52,7 +51,7 @@ public:
  typedef S Simd;
  template <typename vtype>
-  using iImplGaugeLink = iScalar<iScalar<iMatrix<vtype, Nrepresentation>>>;
+  using iImplGaugeLink  = iScalar<iScalar<iMatrix<vtype, Nrepresentation>>>;
  template <typename vtype>
  using iImplGaugeField = iVector<iScalar<iMatrix<vtype, Nrepresentation>>, Nd>;
@@ -64,7 +63,7 @@ public:
                                                 // ugly
  typedef Lattice<SiteGaugeField> GaugeField;
-  // Move this elsewhere?
+  // Move this elsewhere? FIXME
  static inline void AddGaugeLink(GaugeField &U, GaugeLinkField &W,
                                  int mu) { // U[mu] += W
    PARALLEL_FOR_LOOP
@@ -174,12 +173,19 @@ typedef GaugeImplTypes<vComplex, Nc> GimplTypesR;
 typedef GaugeImplTypes<vComplexF, Nc> GimplTypesF;
 typedef GaugeImplTypes<vComplexD, Nc> GimplTypesD;
 typedef GaugeImplTypes<vComplex, SU<Nc>::AdjointDimension> GimplAdjointTypesR;
 typedef GaugeImplTypes<vComplexF, SU<Nc>::AdjointDimension> GimplAdjointTypesF;
 typedef GaugeImplTypes<vComplexD, SU<Nc>::AdjointDimension> GimplAdjointTypesD;
 typedef PeriodicGaugeImpl<GimplTypesR> PeriodicGimplR; // Real.. whichever prec
 typedef PeriodicGaugeImpl<GimplTypesF> PeriodicGimplF; // Float
 typedef PeriodicGaugeImpl<GimplTypesD> PeriodicGimplD; // Double
-typedef ConjugateGaugeImpl<GimplTypesR>
+typedef PeriodicGaugeImpl<GimplAdjointTypesR> PeriodicGimplAdjR; // Real.. whichever prec
-    ConjugateGimplR; // Real.. whichever prec
+typedef PeriodicGaugeImpl<GimplAdjointTypesF> PeriodicGimplAdjF; // Float
 typedef PeriodicGaugeImpl<GimplAdjointTypesD> PeriodicGimplAdjD; // Double
 typedef ConjugateGaugeImpl<GimplTypesR> ConjugateGimplR; // Real.. whichever prec
 typedef ConjugateGaugeImpl<GimplTypesF> ConjugateGimplF; // Float
 typedef ConjugateGaugeImpl<GimplTypesD> ConjugateGimplD; // Double
 }
--- a/lib/qcd/action/pseudofermion/TwoFlavour.h
+++ b/lib/qcd/action/pseudofermion/TwoFlavour.h
@@ -1,149 +1,151 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/pseudofermion/TwoFlavour.h
+Source file: ./lib/qcd/action/pseudofermion/TwoFlavour.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #ifndef QCD_PSEUDOFERMION_TWO_FLAVOUR_H
 #define QCD_PSEUDOFERMION_TWO_FLAVOUR_H
-namespace Grid{
+namespace Grid {
-  namespace QCD{
+namespace QCD {
-    ////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////
-    // Two flavour pseudofermion action for any dop
+// Two flavour pseudofermion action for any dop
-    ////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////
-    template<class Impl>
+template <class Impl>
-    class TwoFlavourPseudoFermionAction : public Action<typename Impl::GaugeField> {
+class TwoFlavourPseudoFermionAction : public Action<typename Impl::GaugeField> {
-    public:
+ public:
-      INHERIT_IMPL_TYPES(Impl);
+  INHERIT_IMPL_TYPES(Impl);
-    private:
+ private:
  FermionOperator<Impl> &FermOp;  // the basic operator
-      FermionOperator<Impl> & FermOp;// the basic operator
+  OperatorFunction<FermionField> &DerivativeSolver;
-      OperatorFunction<FermionField> &DerivativeSolver;
+  OperatorFunction<FermionField> &ActionSolver;
-      OperatorFunction<FermionField> &ActionSolver;
+  FermionField Phi;  // the pseudo fermion field for this trajectory
-      FermionField Phi; // the pseudo fermion field for this trajectory
+ public:
  /////////////////////////////////////////////////
  // Pass in required objects.
  /////////////////////////////////////////////////
  TwoFlavourPseudoFermionAction(FermionOperator<Impl> &Op,
                                OperatorFunction<FermionField> &DS,
                                OperatorFunction<FermionField> &AS)
      : FermOp(Op),
        DerivativeSolver(DS),
        ActionSolver(AS),
        Phi(Op.FermionGrid()){};
-    public:
+  //////////////////////////////////////////////////////////////////////////////////////
-      /////////////////////////////////////////////////
+  // Push the gauge field in to the dops. Assume any BC's and smearing already
-      // Pass in required objects.
+  // applied
-      /////////////////////////////////////////////////
+  //////////////////////////////////////////////////////////////////////////////////////
-    TwoFlavourPseudoFermionAction(FermionOperator<Impl>  &Op, 
+  virtual void refresh(const GaugeField &U, GridParallelRNG &pRNG) {
-				  OperatorFunction<FermionField> & DS,
+    // P(phi) = e^{- phi^dag (MdagM)^-1 phi}
-				  OperatorFunction<FermionField> & AS
+    // Phi = Mdag eta
-				  ) : FermOp(Op), DerivativeSolver(DS), ActionSolver(AS), Phi(Op.FermionGrid()) {
+    // P(eta) = e^{- eta^dag eta}
-      };
+    //
    // e^{x^2/2 sig^2} => sig^2 = 0.5.
    //
    // So eta should be of width sig = 1/sqrt(2).
    // and must multiply by 0.707....
    //
    // Chroma has this scale factor: two_flavor_monomial_w.h
    // IroIro: does not use this scale. It is absorbed by a change of vars
    //         in the Phi integral, and thus is only an irrelevant prefactor for
    //         the partition function.
    //
    RealD scale = std::sqrt(0.5);
    FermionField eta(FermOp.FermionGrid());
-      //////////////////////////////////////////////////////////////////////////////////////
+    gaussian(pRNG, eta);
      // Push the gauge field in to the dops. Assume any BC's and smearing already applied
      //////////////////////////////////////////////////////////////////////////////////////
      virtual void refresh(const GaugeField &U, GridParallelRNG& pRNG) {
-	// P(phi) = e^{- phi^dag (MdagM)^-1 phi}
+    FermOp.ImportGauge(U);
-	// Phi = Mdag eta 
+    FermOp.Mdag(eta, Phi);
 	// P(eta) = e^{- eta^dag eta}
 	//
 	// e^{x^2/2 sig^2} => sig^2 = 0.5.
 	// 
 	// So eta should be of width sig = 1/sqrt(2).
 	// and must multiply by 0.707....
 	//
 	// Chroma has this scale factor: two_flavor_monomial_w.h
 	// IroIro: does not use this scale. It is absorbed by a change of vars
 	//         in the Phi integral, and thus is only an irrelevant prefactor for the partition function.
 	//
 	RealD scale = std::sqrt(0.5);
 	FermionField eta(FermOp.FermionGrid());
-	gaussian(pRNG,eta);
+    Phi = Phi * scale;
  };
-	FermOp.ImportGauge(U);
+  //////////////////////////////////////////////////////
-	FermOp.Mdag(eta,Phi);
+  // S = phi^dag (Mdag M)^-1 phi
  //////////////////////////////////////////////////////
  virtual RealD S(const GaugeField &U) {
    FermOp.ImportGauge(U);
-	Phi=Phi*scale;
+    FermionField X(FermOp.FermionGrid());
    FermionField Y(FermOp.FermionGrid());
-      };
+    MdagMLinearOperator<FermionOperator<Impl>, FermionField> MdagMOp(FermOp);
    X = zero;
    ActionSolver(MdagMOp, Phi, X);
    MdagMOp.Op(X, Y);
-      //////////////////////////////////////////////////////
+    RealD action = norm2(Y);
-      // S = phi^dag (Mdag M)^-1 phi
+    std::cout << GridLogMessage << "Pseudofermion action " << action
-      //////////////////////////////////////////////////////
+              << std::endl;
-      virtual RealD S(const GaugeField &U) {
+    return action;
  };
-	FermOp.ImportGauge(U);
+  //////////////////////////////////////////////////////
  // dS/du = - phi^dag  (Mdag M)^-1 [ Mdag dM + dMdag M ]  (Mdag M)^-1 phi
  //       = - phi^dag M^-1 dM (MdagM)^-1 phi -  phi^dag (MdagM)^-1 dMdag dM
  //       (Mdag)^-1 phi
  //
  //       = - Ydag dM X  - Xdag dMdag Y
  //
  //////////////////////////////////////////////////////
  virtual void deriv(const GaugeField &U, GaugeField &dSdU) {
    FermOp.ImportGauge(U);
-	FermionField X(FermOp.FermionGrid());
+    FermionField X(FermOp.FermionGrid());
-	FermionField Y(FermOp.FermionGrid());
+    FermionField Y(FermOp.FermionGrid());
    GaugeField tmp(FermOp.GaugeGrid());
-	MdagMLinearOperator<FermionOperator<Impl> ,FermionField> MdagMOp(FermOp);
+    MdagMLinearOperator<FermionOperator<Impl>, FermionField> MdagMOp(FermOp);
 	X=zero;
 	ActionSolver(MdagMOp,Phi,X);
 	MdagMOp.Op(X,Y);
-	RealD action = norm2(Y);
+    X = zero;
-	std::cout << GridLogMessage << "Pseudofermion action "<<action<<std::endl;
+    DerivativeSolver(MdagMOp, Phi, X); // X = (MdagM)^-1 phi    
-	return action;
+    MdagMOp.Op(X, Y);                  // Y = M X = (Mdag)^-1 phi
      };
-      //////////////////////////////////////////////////////
+    // Our conventions really make this UdSdU; We do not differentiate wrt Udag
-      // dS/du = - phi^dag  (Mdag M)^-1 [ Mdag dM + dMdag M ]  (Mdag M)^-1 phi
+    // here.
-      //       = - phi^dag M^-1 dM (MdagM)^-1 phi -  phi^dag (MdagM)^-1 dMdag dM (Mdag)^-1 phi 
+    // So must take dSdU - adj(dSdU) and left multiply by mom to get dS/dt.
      //
      //       = - Ydag dM X  - Xdag dMdag Y
      //
      //////////////////////////////////////////////////////
      virtual void deriv(const GaugeField &U,GaugeField & dSdU) {
-	FermOp.ImportGauge(U);
+    FermOp.MDeriv(tmp, Y, X, DaggerNo);
    dSdU = tmp;
    FermOp.MDeriv(tmp, X, Y, DaggerYes);
    dSdU = dSdU + tmp;
-	FermionField X(FermOp.FermionGrid());
+    // not taking here the traceless antihermitian component
-	FermionField Y(FermOp.FermionGrid());
+  };
-	GaugeField   tmp(FermOp.GaugeGrid());
+};
-
+}
 	MdagMLinearOperator<FermionOperator<Impl> ,FermionField> MdagMOp(FermOp);
 	X=zero;
 	DerivativeSolver(MdagMOp,Phi,X);
 	MdagMOp.Op(X,Y);
 	// Our conventions really make this UdSdU; We do not differentiate wrt Udag here.
 	// So must take dSdU - adj(dSdU) and left multiply by mom to get dS/dt.
 	FermOp.MDeriv(tmp , Y, X,DaggerNo );  dSdU=tmp;
 	FermOp.MDeriv(tmp , X, Y,DaggerYes);  dSdU=dSdU+tmp;
 	//dSdU = Ta(dSdU);
      };
    };
  }
 }
 #endif
--- a/lib/qcd/action/pseudofermion/TwoFlavourEvenOdd.h
+++ b/lib/qcd/action/pseudofermion/TwoFlavourEvenOdd.h
@@ -1,70 +1,66 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/action/pseudofermion/TwoFlavourEvenOdd.h
+Source file: ./lib/qcd/action/pseudofermion/TwoFlavourEvenOdd.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #ifndef QCD_PSEUDOFERMION_TWO_FLAVOUR_EVEN_ODD_H
 #define QCD_PSEUDOFERMION_TWO_FLAVOUR_EVEN_ODD_H
-namespace Grid{
+namespace Grid {
-  namespace QCD{
+namespace QCD {
 ////////////////////////////////////////////////////////////////////////
 // Two flavour pseudofermion action for any EO prec dop
 ////////////////////////////////////////////////////////////////////////
 template <class Impl>
 class TwoFlavourEvenOddPseudoFermionAction
    : public Action<typename Impl::GaugeField> {
 public:
  INHERIT_IMPL_TYPES(Impl);
 private:
  FermionOperator<Impl> &FermOp;  // the basic operator
-    ////////////////////////////////////////////////////////////////////////
+  OperatorFunction<FermionField> &DerivativeSolver;
-    // Two flavour pseudofermion action for any EO prec dop
+  OperatorFunction<FermionField> &ActionSolver;
    ////////////////////////////////////////////////////////////////////////
    template<class Impl>
    class TwoFlavourEvenOddPseudoFermionAction : public Action<typename Impl::GaugeField> {
-    public:
+  FermionField PhiOdd;   // the pseudo fermion field for this trajectory
  FermionField PhiEven;  // the pseudo fermion field for this trajectory
-      INHERIT_IMPL_TYPES(Impl);
+ public:
-
+  /////////////////////////////////////////////////
-    private:
+  // Pass in required objects.
-      
+  /////////////////////////////////////////////////
-      FermionOperator<Impl> & FermOp;// the basic operator
+  TwoFlavourEvenOddPseudoFermionAction(FermionOperator<Impl> &Op,
-
+                                       OperatorFunction<FermionField> &DS,
-      OperatorFunction<FermionField> &DerivativeSolver;
+                                       OperatorFunction<FermionField> &AS)
-      OperatorFunction<FermionField> &ActionSolver;
+      : FermOp(Op),
-
+        DerivativeSolver(DS),
-      FermionField PhiOdd;   // the pseudo fermion field for this trajectory
+        ActionSolver(AS),
      FermionField PhiEven;  // the pseudo fermion field for this trajectory
    public:
      /////////////////////////////////////////////////
      // Pass in required objects.
      /////////////////////////////////////////////////
      TwoFlavourEvenOddPseudoFermionAction(FermionOperator<Impl>  &Op, 
 					 OperatorFunction<FermionField> & DS,
 					 OperatorFunction<FermionField> & AS
 					   ) : 
        FermOp(Op), 
 	DerivativeSolver(DS), 
 	ActionSolver(AS), 
        PhiEven(Op.FermionRedBlackGrid()),
 	PhiOdd(Op.FermionRedBlackGrid())
 		  {};
--- a/lib/qcd/action/pseudofermion/TwoFlavourEvenOddRatio.h
+++ b/lib/qcd/action/pseudofermion/TwoFlavourEvenOddRatio.h
@@ -131,9 +131,11 @@ namespace Grid{
 	Vpc.MpcDag(PhiOdd,Y);           // Y= Vdag phi
 	X=zero;
 	ActionSolver(Mpc,Y,X);          // X= (MdagM)^-1 Vdag phi
-	Mpc.Mpc(X,Y);                   // Y=  Mdag^-1 Vdag phi
+	//Mpc.Mpc(X,Y);                   // Y=  Mdag^-1 Vdag phi
 	// Multiply by Ydag
 	RealD action = real(innerProduct(Y,X));
-	RealD action = norm2(Y);
+	//RealD action = norm2(Y);
 	// The EE factorised block; normally can replace with zero if det is constant (gauge field indept)
 	// Only really clover term that creates this. Leave the EE portion as a future to do to make most
--- a/lib/qcd/hmc/HmcRunner.h
+++ b/lib/qcd/hmc/HmcRunner.h
@@ -1,179 +1,191 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/hmc/HmcRunner.h
+Source file: ./lib/qcd/hmc/HmcRunner.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 #ifndef HMC_RUNNER
 #define HMC_RUNNER
-namespace Grid{
+namespace Grid {
-  namespace QCD{
+namespace QCD {
-
+template <class Gimpl, class RepresentationsPolicy = NoHirep >
 template<class Gimpl>
 class NerscHmcRunnerTemplate {
-public:
+ public:
  INHERIT_GIMPL_TYPES(Gimpl);
  enum StartType_t { ColdStart, HotStart, TepidStart, CheckpointStart };
-  ActionSet<GaugeField> TheAction;
+  ActionSet<GaugeField, RepresentationsPolicy> TheAction;
-  GridCartesian         * UGrid   ;
+  GridCartesian *UGrid;
-  GridCartesian         * FGrid   ;
+  GridCartesian *FGrid;
-  GridRedBlackCartesian * UrbGrid ;
+  GridRedBlackCartesian *UrbGrid;
-  GridRedBlackCartesian * FrbGrid ;
+  GridRedBlackCartesian *FrbGrid;
-  virtual void BuildTheAction (int argc, char **argv) = 0; // necessary?
+  virtual void BuildTheAction(int argc, char **argv) = 0;  // necessary?
  void Run (int argc, char  **argv){
  void Run(int argc, char **argv) {
    StartType_t StartType = HotStart;
    std::string arg;
-    if( GridCmdOptionExists(argv,argv+argc,"--StartType") ){
+    if (GridCmdOptionExists(argv, argv + argc, "--StartType")) {
-      arg = GridCmdOptionPayload(argv,argv+argc,"--StartType");
+      arg = GridCmdOptionPayload(argv, argv + argc, "--StartType");
-      if ( arg == "HotStart" ) { StartType = HotStart; }
+      if (arg == "HotStart") {
-      else if ( arg == "ColdStart" ) { StartType = ColdStart; }
+        StartType = HotStart;
-      else if ( arg == "TepidStart" ) { StartType = TepidStart; }
+      } else if (arg == "ColdStart") {
-      else if ( arg == "CheckpointStart" ) { StartType = CheckpointStart; }
+        StartType = ColdStart;
-      else assert(0);
+      } else if (arg == "TepidStart") {
        StartType = TepidStart;
      } else if (arg == "CheckpointStart") {
        StartType = CheckpointStart;
      } else {
        std::cout << GridLogError << "Unrecognized option in --StartType\n";
        std::cout << GridLogError << "Valid [HotStart, ColdStart, TepidStart, CheckpointStart]\n";
        assert(0);
      }
    }
    int StartTraj = 0;
-    if( GridCmdOptionExists(argv,argv+argc,"--StartTrajectory") ){
+    if (GridCmdOptionExists(argv, argv + argc, "--StartTrajectory")) {
-      arg= GridCmdOptionPayload(argv,argv+argc,"--StartTrajectory");
+      arg = GridCmdOptionPayload(argv, argv + argc, "--StartTrajectory");
      std::vector<int> ivec(0);
-      GridCmdOptionIntVector(arg,ivec);
+      GridCmdOptionIntVector(arg, ivec);
      StartTraj = ivec[0];
    }
    int NumTraj = 1;
-    if( GridCmdOptionExists(argv,argv+argc,"--Trajectories") ){
+    if (GridCmdOptionExists(argv, argv + argc, "--Trajectories")) {
-      arg= GridCmdOptionPayload(argv,argv+argc,"--Trajectories");
+      arg = GridCmdOptionPayload(argv, argv + argc, "--Trajectories");
      std::vector<int> ivec(0);
-      GridCmdOptionIntVector(arg,ivec);
+      GridCmdOptionIntVector(arg, ivec);
      NumTraj = ivec[0];
    }
    int NumThermalizations = 10;
-    if( GridCmdOptionExists(argv,argv+argc,"--Thermalizations") ){
+    if (GridCmdOptionExists(argv, argv + argc, "--Thermalizations")) {
-      arg= GridCmdOptionPayload(argv,argv+argc,"--Thermalizations");
+      arg = GridCmdOptionPayload(argv, argv + argc, "--Thermalizations");
      std::vector<int> ivec(0);
-      GridCmdOptionIntVector(arg,ivec);
+      GridCmdOptionIntVector(arg, ivec);
      NumThermalizations = ivec[0];
    }
    GridSerialRNG sRNG;
    GridParallelRNG pRNG(UGrid);
    LatticeGaugeField U(UGrid);  // change this to an extended field (smearing class)?
-    GridSerialRNG    sRNG;
+    std::vector<int> SerSeed({1, 2, 3, 4, 5});
-    GridParallelRNG  pRNG(UGrid);
+    std::vector<int> ParSeed({6, 7, 8, 9, 10});
    LatticeGaugeField  U(UGrid); // change this to an extended field (smearing class)
    std::vector<int> SerSeed({1,2,3,4,5});
    std::vector<int> ParSeed({6,7,8,9,10});
    // Create integrator, including the smearing policy
-    // Smearing policy
+    // Smearing policy, only defined for Nc=3
    /*
    std::cout << GridLogDebug << " Creating the Stout class\n";
-    double rho = 0.1; // smearing parameter, now hardcoded
+    double rho = 0.1;  // smearing parameter, now hardcoded
-    int Nsmear = 1;   // number of smearing levels
+    int Nsmear = 1;    // number of smearing levels
    Smear_Stout<Gimpl> Stout(rho);
    std::cout << GridLogDebug << " Creating the SmearedConfiguration class\n";
-    SmearedConfiguration<Gimpl> SmearingPolicy(UGrid, Nsmear, Stout);
+    //SmearedConfiguration<Gimpl> SmearingPolicy(UGrid, Nsmear, Stout);
    std::cout << GridLogDebug << " done\n";
    */
    //////////////
-    typedef MinimumNorm2<GaugeField, SmearedConfiguration<Gimpl> >  IntegratorType;// change here to change the algorithm
+    NoSmearing<Gimpl> SmearingPolicy;
-    IntegratorParameters MDpar(20);
+    typedef MinimumNorm2<GaugeField, NoSmearing<Gimpl>, RepresentationsPolicy >
        IntegratorType;  // change here to change the algorithm
    IntegratorParameters MDpar(20, 1.0);
    IntegratorType MDynamics(UGrid, MDpar, TheAction, SmearingPolicy);
    // Checkpoint strategy
-    NerscHmcCheckpointer<Gimpl> Checkpoint(std::string("ckpoint_lat"),std::string("ckpoint_rng"),1);
+    NerscHmcCheckpointer<Gimpl> Checkpoint(std::string("ckpoint_lat"),
-    PlaquetteLogger<Gimpl>      PlaqLog(std::string("plaq"));
+                                           std::string("ckpoint_rng"), 1);
    PlaquetteLogger<Gimpl> PlaqLog(std::string("plaq"));
    HMCparameters HMCpar;
-    HMCpar.StartTrajectory   = StartTraj;
+    HMCpar.StartTrajectory = StartTraj;
-    HMCpar.Trajectories      = NumTraj;
+    HMCpar.Trajectories = NumTraj;
    HMCpar.NoMetropolisUntil = NumThermalizations;
-
+    if (StartType == HotStart) {
    if ( StartType == HotStart ) {
      // Hot start
      HMCpar.MetropolisTest = true;
      sRNG.SeedFixedIntegers(SerSeed);
      pRNG.SeedFixedIntegers(ParSeed);
-      SU3::HotConfiguration(pRNG, U);
+      SU<Nc>::HotConfiguration(pRNG, U);
-    } else if ( StartType == ColdStart ) { 
+    } else if (StartType == ColdStart) {
      // Cold start
      HMCpar.MetropolisTest = true;
      sRNG.SeedFixedIntegers(SerSeed);
      pRNG.SeedFixedIntegers(ParSeed);
-      SU3::ColdConfiguration(pRNG, U);
+      SU<Nc>::ColdConfiguration(pRNG, U);
-    } else if ( StartType == TepidStart ) {       
+    } else if (StartType == TepidStart) {
      // Tepid start
      HMCpar.MetropolisTest = true;
      sRNG.SeedFixedIntegers(SerSeed);
      pRNG.SeedFixedIntegers(ParSeed);
-      SU3::TepidConfiguration(pRNG, U);
+      SU<Nc>::TepidConfiguration(pRNG, U);
-    } else if ( StartType == CheckpointStart ) { 
+    } else if (StartType == CheckpointStart) {
      HMCpar.MetropolisTest = true;
      // CheckpointRestart
      Checkpoint.CheckpointRestore(StartTraj, U, sRNG, pRNG);
    }
-    // Attach the gauge field to the smearing Policy and create the fill the smeared set
+    // Attach the gauge field to the smearing Policy and create the fill the
    // smeared set
    // notice that the unit configuration is singular in this procedure
    std::cout << GridLogMessage << "Filling the smeared set\n";
    SmearingPolicy.set_GaugeField(U);
-    HybridMonteCarlo<GaugeField,IntegratorType>  HMC(HMCpar, MDynamics,sRNG,pRNG,U); 
+    HybridMonteCarlo<GaugeField, IntegratorType> HMC(HMCpar, MDynamics, sRNG,
                                                     pRNG, U);
    HMC.AddObservable(&Checkpoint);
    HMC.AddObservable(&PlaqLog);
    // Run it
    HMC.evolve();
  }
 };
- typedef NerscHmcRunnerTemplate<PeriodicGimplR> NerscHmcRunner;
+typedef NerscHmcRunnerTemplate<PeriodicGimplR> NerscHmcRunner;
- typedef NerscHmcRunnerTemplate<PeriodicGimplF> NerscHmcRunnerF;
+typedef NerscHmcRunnerTemplate<PeriodicGimplF> NerscHmcRunnerF;
- typedef NerscHmcRunnerTemplate<PeriodicGimplD> NerscHmcRunnerD;
+typedef NerscHmcRunnerTemplate<PeriodicGimplD> NerscHmcRunnerD;
- typedef NerscHmcRunnerTemplate<PeriodicGimplR> PeriodicNerscHmcRunner;
+typedef NerscHmcRunnerTemplate<PeriodicGimplR> PeriodicNerscHmcRunner;
- typedef NerscHmcRunnerTemplate<PeriodicGimplF> PeriodicNerscHmcRunnerF;
+typedef NerscHmcRunnerTemplate<PeriodicGimplF> PeriodicNerscHmcRunnerF;
- typedef NerscHmcRunnerTemplate<PeriodicGimplD> PeriodicNerscHmcRunnerD;
+typedef NerscHmcRunnerTemplate<PeriodicGimplD> PeriodicNerscHmcRunnerD;
- typedef NerscHmcRunnerTemplate<ConjugateGimplR> ConjugateNerscHmcRunner;
+typedef NerscHmcRunnerTemplate<ConjugateGimplR> ConjugateNerscHmcRunner;
- typedef NerscHmcRunnerTemplate<ConjugateGimplF> ConjugateNerscHmcRunnerF;
+typedef NerscHmcRunnerTemplate<ConjugateGimplF> ConjugateNerscHmcRunnerF;
- typedef NerscHmcRunnerTemplate<ConjugateGimplD> ConjugateNerscHmcRunnerD;
+typedef NerscHmcRunnerTemplate<ConjugateGimplD> ConjugateNerscHmcRunnerD;
-}}
+template <class RepresentationsPolicy>
 using NerscHmcRunnerHirep = NerscHmcRunnerTemplate<PeriodicGimplR, RepresentationsPolicy>;
 }
 }
 #endif
--- a/lib/qcd/hmc/integrators/Integrator.h
+++ b/lib/qcd/hmc/integrators/Integrator.h
@@ -1,33 +1,34 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+Grid physics library, www.github.com/paboyle/Grid
-    Source file: ./lib/qcd/hmc/integrators/Integrator.h
+Source file: ./lib/qcd/hmc/integrators/Integrator.h
-    Copyright (C) 2015
+Copyright (C) 2015
 Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: neo <cossu@post.kek.jp>
 Author: paboyle <paboyle@ph.ed.ac.uk>
-    This program is free software; you can redistribute it and/or modify
+This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
+it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
+the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
+(at your option) any later version.
-    This program is distributed in the hope that it will be useful,
+This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
+but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
+GNU General Public License for more details.
-    You should have received a copy of the GNU General Public License along
+You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
+with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-    See the full license in the file "LICENSE" in the top level distribution directory
+See the full license in the file "LICENSE" in the top level distribution
-    *************************************************************************************/
+directory
-    /*  END LEGAL */
+*************************************************************************************/
 /*  END LEGAL */
 //--------------------------------------------------------------------
 /*! @file Integrator.h
 * @brief Classes for the Molecular Dynamics integrator
@@ -40,208 +41,278 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #ifndef INTEGRATOR_INCLUDED
 #define INTEGRATOR_INCLUDED
-//class Observer;
+// class Observer;
 #include <memory>
- namespace Grid{
+namespace Grid {
- 	namespace QCD{
+namespace QCD {
- 		struct IntegratorParameters{
+struct IntegratorParameters {
  int Nexp;
  int MDsteps;  // number of outer steps
  RealD trajL;  // trajectory length
  RealD stepsize;
- 			int Nexp;
+  IntegratorParameters(int MDsteps_, RealD trajL_ = 1.0, int Nexp_ = 12)
-      int MDsteps;  // number of outer steps
+      : Nexp(Nexp_),
-      RealD trajL;  // trajectory length 
+        MDsteps(MDsteps_),
-      RealD stepsize;
+        trajL(trajL_),
-
+        stepsize(trajL / MDsteps){
-      IntegratorParameters(int MDsteps_, 
+            // empty body constructor
-      	RealD trajL_=1.0,
+        };
      	int Nexp_=12):
      Nexp(Nexp_),
      MDsteps(MDsteps_),
      trajL(trajL_),
      stepsize(trajL/MDsteps)
      {
 	  // empty body constructor
      };
  };
    /*! @brief Class for Molecular Dynamics management */   
    template<class GaugeField, class SmearingPolicy>
  class Integrator {
  protected:
  	typedef IntegratorParameters ParameterType;
  	IntegratorParameters Params;
  	const ActionSet<GaugeField> as;
      int levels;              //
      double t_U;              // Track time passing on each level and for U and for P
      std::vector<double> t_P; //
      GaugeField P;
      SmearingPolicy &Smearer;
      // Should match any legal (SU(n)) gauge field
      // Need to use this template to match Ncol to pass to SU<N> class
      template<int Ncol,class vec> void generate_momenta(Lattice< iVector< iScalar< iMatrix<vec,Ncol> >, Nd> > & P,GridParallelRNG& pRNG){
      typedef Lattice< iScalar< iScalar< iMatrix<vec,Ncol> > > > GaugeLinkField;
      GaugeLinkField Pmu(P._grid);
      Pmu = zero;
      for(int mu=0;mu<Nd;mu++){
      	SU<Ncol>::GaussianLieAlgebraMatrix(pRNG, Pmu);
      	PokeIndex<LorentzIndex>(P, Pmu, mu);
      }
  }
      //ObserverList observers; // not yet
      //      typedef std::vector<Observer*> ObserverList;
      //      void register_observers();
      //      void notify_observers();
  void update_P(GaugeField&U, int level, double ep){
  	t_P[level]+=ep;
  	update_P(P,U,level,ep);
  	std::cout<<GridLogIntegrator<<"["<<level<<"] P " << " dt "<< ep <<" : t_P "<< t_P[level] <<std::endl;
  }
  void update_P(GaugeField &Mom,GaugeField&U, int level,double ep){
  	// input U actually not used... 
  	for(int a=0; a<as[level].actions.size(); ++a){
  		GaugeField force(U._grid);
  		GaugeField& Us = Smearer.get_U(as[level].actions.at(a)->is_smeared);
  		as[level].actions.at(a)->deriv(Us,force); // deriv should NOT include Ta
 	  	std::cout<< GridLogIntegrator << "Smearing (on/off): "<<as[level].actions.at(a)->is_smeared <<std::endl;
 	  	if (as[level].actions.at(a)->is_smeared) Smearer.smeared_force(force);
 	  	force = Ta(force);
 	  	std::cout<< GridLogIntegrator << "Force average: "<< norm2(force)/(U._grid->gSites()) <<std::endl;
 	  	Mom -= force*ep;
 	  }
 	}
 	void update_U(GaugeField&U, double ep){
 		update_U(P,U,ep);
 		t_U+=ep;
 		int fl = levels-1;
 		std::cout<< GridLogIntegrator <<"   "<<"["<<fl<<"] U " << " dt "<< ep <<" : t_U "<< t_U <<std::endl;
 	}
 	void update_U(GaugeField &Mom, GaugeField&U, double ep){
 	//rewrite exponential to deal automatically  with the lorentz index?
 	//	GaugeLinkField Umu(U._grid);
 	//	GaugeLinkField Pmu(U._grid);
 		for (int mu = 0; mu < Nd; mu++){
 			auto Umu=PeekIndex<LorentzIndex>(U, mu);
 			auto Pmu=PeekIndex<LorentzIndex>(Mom, mu);
 			Umu = expMat(Pmu, ep, Params.Nexp)*Umu;
 			ProjectOnGroup(Umu);
 			PokeIndex<LorentzIndex>(U, Umu, mu);
 		}
 	// Update the smeared fields, can be implemented as observer
 		Smearer.set_GaugeField(U);
 	}
 	virtual void step (GaugeField& U,int level, int first,int last)=0;
 public:
 	Integrator(GridBase* grid, 
 		IntegratorParameters Par,
 		ActionSet<GaugeField> & Aset,
 		SmearingPolicy &Sm):
 	Params(Par),
 	as(Aset),
 	P(grid),
 	levels(Aset.size()),
 	Smearer(Sm)
 	{
 		t_P.resize(levels,0.0);
 		t_U=0.0;
 	// initialization of smearer delegated outside of Integrator
 	};
 	virtual ~Integrator(){}
      //Initialization of momenta and actions
 	void refresh(GaugeField& U,GridParallelRNG &pRNG){
 		std::cout<<GridLogIntegrator<< "Integrator refresh\n";
 		generate_momenta(P,pRNG);
 		for(int level=0; level< as.size(); ++level){
 			for(int actionID=0; actionID<as[level].actions.size(); ++actionID){
 	    // get gauge field from the SmearingPolicy and
 	    // based on the boolean is_smeared in actionID
 				GaugeField& Us = Smearer.get_U(as[level].actions.at(actionID)->is_smeared);
 				as[level].actions.at(actionID)->refresh(Us, pRNG);
 			}
 		}
 	}
      // Calculate action
 	RealD S(GaugeField& U){// here also U not used
 		LatticeComplex Hloc(U._grid);	Hloc = zero;
 	// Momenta
 		for (int mu=0; mu <Nd; mu++){
 			auto Pmu = PeekIndex<LorentzIndex>(P, mu);
 			Hloc -= trace(Pmu*Pmu);
 		}
 		Complex Hsum = sum(Hloc);
 		RealD H = Hsum.real();
 		RealD Hterm;
 		std::cout<<GridLogMessage << "Momentum action H_p = "<< H << "\n";
 	// Actions
 		for(int level=0; level<as.size(); ++level){
 			for(int actionID=0; actionID<as[level].actions.size(); ++actionID){
 	    // get gauge field from the SmearingPolicy and
 	    // based on the boolean is_smeared in actionID
 				GaugeField& Us = Smearer.get_U(as[level].actions.at(actionID)->is_smeared);
 				Hterm = as[level].actions.at(actionID)->S(Us);
 				std::cout<<GridLogMessage << "S Level "<<level<<" term "<<actionID<<" H = "<<Hterm<<std::endl;
 				H += Hterm;
 			}
 		}
 		return H;
 	}
 	void integrate(GaugeField& U){
 	// reset the clocks
 		t_U=0;
 		for(int level=0; level<as.size(); ++level){
 			t_P[level]=0;
 		}	
 	for(int step=0; step< Params.MDsteps; ++step){   // MD step
 		int first_step = (step==0);
 		int  last_step = (step==Params.MDsteps-1);
 		this->step(U,0,first_step,last_step);
 	}
 	// Check the clocks all match on all levels
 	for(int level=0; level<as.size(); ++level){
 	  assert(fabs(t_U - t_P[level])<1.0e-6); // must be the same
 	  std::cout<<GridLogIntegrator<<" times["<<level<<"]= "<<t_P[level]<< " " << t_U <<std::endl;
 	}	
 	// and that we indeed got to the end of the trajectory
 	assert(fabs(t_U-Params.trajL) < 1.0e-6);
 }
 };
 /*! @brief Class for Molecular Dynamics management */
 template <class GaugeField, class SmearingPolicy, class RepresentationPolicy>
 class Integrator {
 protected:
  typedef IntegratorParameters ParameterType;
  IntegratorParameters Params;
  const ActionSet<GaugeField, RepresentationPolicy> as;
  int levels;  //
  double t_U;  // Track time passing on each level and for U and for P
  std::vector<double> t_P;  //
  GaugeField P;
  SmearingPolicy& Smearer;
  RepresentationPolicy Representations;
  // Should match any legal (SU(n)) gauge field
  // Need to use this template to match Ncol to pass to SU<N> class
  template <int Ncol, class vec>
  void generate_momenta(Lattice<iVector<iScalar<iMatrix<vec, Ncol> >, Nd> >& P,
                        GridParallelRNG& pRNG) {
    typedef Lattice<iScalar<iScalar<iMatrix<vec, Ncol> > > > GaugeLinkField;
    GaugeLinkField Pmu(P._grid);
    Pmu = zero;
    for (int mu = 0; mu < Nd; mu++) {
      SU<Ncol>::GaussianFundamentalLieAlgebraMatrix(pRNG, Pmu);
      PokeIndex<LorentzIndex>(P, Pmu, mu);
    }
  }
  // ObserverList observers; // not yet
  //      typedef std::vector<Observer*> ObserverList;
  //      void register_observers();
  //      void notify_observers();
  void update_P(GaugeField& U, int level, double ep) {
    t_P[level] += ep;
    update_P(P, U, level, ep);
    std::cout << GridLogIntegrator << "[" << level << "] P "
              << " dt " << ep << " : t_P " << t_P[level] << std::endl;
  }
  // to be used by the actionlevel class to iterate
  // over the representations
  struct _updateP {
    template <class FieldType, class GF, class Repr>
    void operator()(std::vector<Action<FieldType>*> repr_set, Repr& Rep,
                    GF& Mom, GF& U, double ep) {
      for (int a = 0; a < repr_set.size(); ++a) {
        FieldType forceR(U._grid);
        // Implement smearing only for the fundamental representation now
        repr_set.at(a)->deriv(Rep.U, forceR);
        GF force =
            Rep.RtoFundamentalProject(forceR);  // Ta for the fundamental rep
        std::cout << GridLogIntegrator << "Hirep Force average: "
                  << norm2(force) / (U._grid->gSites()) << std::endl;
        Mom -= force * ep ;
      }
    }
  } update_P_hireps{};
  void update_P(GaugeField& Mom, GaugeField& U, int level, double ep) {
    // input U actually not used in the fundamental case
    // Fundamental updates, include smearing
    for (int a = 0; a < as[level].actions.size(); ++a) {
      GaugeField force(U._grid);
      GaugeField& Us = Smearer.get_U(as[level].actions.at(a)->is_smeared);
      as[level].actions.at(a)->deriv(Us, force);  // deriv should NOT include Ta
      std::cout << GridLogIntegrator
                << "Smearing (on/off): " << as[level].actions.at(a)->is_smeared
                << std::endl;
      if (as[level].actions.at(a)->is_smeared) Smearer.smeared_force(force);
      force = Ta(force);
      std::cout << GridLogIntegrator
                << "Force average: " << norm2(force) / (U._grid->gSites())
                << std::endl;
      Mom -= force * ep;
    }
    // Force from the other representations
    as[level].apply(update_P_hireps, Representations, Mom, U, ep);
  }
  void update_U(GaugeField& U, double ep) {
    update_U(P, U, ep);
    t_U += ep;
    int fl = levels - 1;
    std::cout << GridLogIntegrator << "   "
              << "[" << fl << "] U "
              << " dt " << ep << " : t_U " << t_U << std::endl;
  }
  void update_U(GaugeField& Mom, GaugeField& U, double ep) {
    // rewrite exponential to deal internally with the lorentz index?
    for (int mu = 0; mu < Nd; mu++) {
      auto Umu = PeekIndex<LorentzIndex>(U, mu);
      auto Pmu = PeekIndex<LorentzIndex>(Mom, mu);
      Umu = expMat(Pmu, ep, Params.Nexp) * Umu;
      PokeIndex<LorentzIndex>(U, ProjectOnGroup(Umu), mu);
    }
    // Update the smeared fields, can be implemented as observer
    Smearer.set_GaugeField(U);
    // Update the higher representations fields
    Representations.update(U);  // void functions if fundamental representation
  }
  virtual void step(GaugeField& U, int level, int first, int last) = 0;
 public:
  Integrator(GridBase* grid, IntegratorParameters Par,
             ActionSet<GaugeField, RepresentationPolicy>& Aset,
             SmearingPolicy& Sm)
      : Params(Par),
        as(Aset),
        P(grid),
        levels(Aset.size()),
        Smearer(Sm),
        Representations(grid) {
    t_P.resize(levels, 0.0);
    t_U = 0.0;
    // initialization of smearer delegated outside of Integrator
  };
  virtual ~Integrator() {}
  // to be used by the actionlevel class to iterate
  // over the representations
  struct _refresh {
    template <class FieldType, class Repr>
    void operator()(std::vector<Action<FieldType>*> repr_set, Repr& Rep,
                    GridParallelRNG& pRNG) {
      for (int a = 0; a < repr_set.size(); ++a){
        repr_set.at(a)->refresh(Rep.U, pRNG);
      std::cout << GridLogDebug << "Hirep refreshing pseudofermions" << std::endl;
    }
    }
  } refresh_hireps{};
  // Initialization of momenta and actions
  void refresh(GaugeField& U, GridParallelRNG& pRNG) {
    std::cout << GridLogIntegrator << "Integrator refresh\n";
    generate_momenta(P, pRNG);
    // Update the smeared fields, can be implemented as observer
    // necessary to keep the fields updated even after a reject
    // of the Metropolis
    Smearer.set_GaugeField(U);
    // Set the (eventual) representations gauge fields
    Representations.update(U);
    // The Smearer is attached to a pointer of the gauge field
    // automatically gets the correct field
    // whether or not has been accepted in the previous sweep
    for (int level = 0; level < as.size(); ++level) {
      for (int actionID = 0; actionID < as[level].actions.size(); ++actionID) {
        // get gauge field from the SmearingPolicy and
        // based on the boolean is_smeared in actionID
        GaugeField& Us =
            Smearer.get_U(as[level].actions.at(actionID)->is_smeared);
        as[level].actions.at(actionID)->refresh(Us, pRNG);
      }
      // Refresh the higher representation actions
      as[level].apply(refresh_hireps, Representations, pRNG);
    }
  }
  // to be used by the actionlevel class to iterate
  // over the representations
  struct _S {
    template <class FieldType, class Repr>
    void operator()(std::vector<Action<FieldType>*> repr_set, Repr& Rep,
                    int level, RealD& H) {
      for (int a = 0; a < repr_set.size(); ++a) {
        RealD Hterm = repr_set.at(a)->S(Rep.U);
        std::cout << GridLogMessage << "S Level " << level << " term " << a
                  << " H Hirep = " << Hterm << std::endl;
        H += Hterm;
      }
    }
  } S_hireps{};
  // Calculate action
  RealD S(GaugeField& U) {  // here also U not used
    LatticeComplex Hloc(U._grid);
    Hloc = zero;
    // Momenta
    for (int mu = 0; mu < Nd; mu++) {
      auto Pmu = PeekIndex<LorentzIndex>(P, mu);
      Hloc -= trace(Pmu * Pmu);
    }
    Complex Hsum = sum(Hloc);
    RealD H = Hsum.real();
    RealD Hterm;
    std::cout << GridLogMessage << "Momentum action H_p = " << H << "\n";
    // Actions
    for (int level = 0; level < as.size(); ++level) {
      for (int actionID = 0; actionID < as[level].actions.size(); ++actionID) {
        // get gauge field from the SmearingPolicy and
        // based on the boolean is_smeared in actionID
        GaugeField& Us =
            Smearer.get_U(as[level].actions.at(actionID)->is_smeared);
        Hterm = as[level].actions.at(actionID)->S(Us);
        std::cout << GridLogMessage << "S Level " << level << " term "
                  << actionID << " H = " << Hterm << std::endl;
        H += Hterm;
      }
      as[level].apply(S_hireps, Representations, level, H);
    }
    return H;
  }
  void integrate(GaugeField& U) {
    // reset the clocks
    t_U = 0;
    for (int level = 0; level < as.size(); ++level) {
      t_P[level] = 0;
    }
    for (int step = 0; step < Params.MDsteps; ++step) {  // MD step
      int first_step = (step == 0);
      int last_step = (step == Params.MDsteps - 1);
      this->step(U, 0, first_step, last_step);
    }
    // Check the clocks all match on all levels
    for (int level = 0; level < as.size(); ++level) {
      assert(fabs(t_U - t_P[level]) < 1.0e-6);  // must be the same
      std::cout << GridLogIntegrator << " times[" << level
                << "]= " << t_P[level] << " " << t_U << std::endl;
    }
    // and that we indeed got to the end of the trajectory
    assert(fabs(t_U - Params.trajL) < 1.0e-6);
  }
 };
 }
 }
-#endif//INTEGRATOR_INCLUDED
+#endif  // INTEGRATOR_INCLUDED
--- a/lib/qcd/hmc/integrators/Integrator_algorithm.h
+++ b/lib/qcd/hmc/integrators/Integrator_algorithm.h
@@ -91,17 +91,19 @@ namespace Grid{
    *  P 1/2                            P 1/2
    */    
-    template<class GaugeField, class SmearingPolicy> class LeapFrog :
+    template<class GaugeField,
-      public Integrator<GaugeField, SmearingPolicy> {
+	     class SmearingPolicy,
 	     class RepresentationPolicy = Representations< FundamentalRepresentation > > class LeapFrog :
      public Integrator<GaugeField, SmearingPolicy, RepresentationPolicy> {
    public:
-      typedef LeapFrog<GaugeField, SmearingPolicy> Algorithm;
+      typedef LeapFrog<GaugeField, SmearingPolicy, RepresentationPolicy> Algorithm;
      LeapFrog(GridBase* grid, 
 	       IntegratorParameters Par,
-	       ActionSet<GaugeField> & Aset,
+	       ActionSet<GaugeField, RepresentationPolicy> & Aset,
 	       SmearingPolicy & Sm):
-	Integrator<GaugeField, SmearingPolicy>(grid,Par,Aset,Sm) {};
+	Integrator<GaugeField, SmearingPolicy, RepresentationPolicy>(grid,Par,Aset,Sm) {};
      void step (GaugeField& U, int level,int _first, int _last){
@@ -138,8 +140,10 @@ namespace Grid{
      }
    };
-    template<class GaugeField, class SmearingPolicy> class MinimumNorm2 :
+    template<class GaugeField,
-      public Integrator<GaugeField, SmearingPolicy> {
+	     class SmearingPolicy,
 	     class RepresentationPolicy = Representations < FundamentalRepresentation > > class MinimumNorm2 :
      public Integrator<GaugeField, SmearingPolicy, RepresentationPolicy> {
    private:
      const RealD lambda = 0.1931833275037836;
@@ -147,9 +151,9 @@ namespace Grid{
      MinimumNorm2(GridBase* grid, 
 		   IntegratorParameters Par,
-		   ActionSet<GaugeField> & Aset,
+		   ActionSet<GaugeField, RepresentationPolicy> & Aset,
 		   SmearingPolicy& Sm):
-	Integrator<GaugeField, SmearingPolicy>(grid,Par,Aset,Sm) {};
+	Integrator<GaugeField, SmearingPolicy, RepresentationPolicy>(grid,Par,Aset,Sm) {};
      void step (GaugeField& U, int level, int _first,int _last){
@@ -197,8 +201,10 @@ namespace Grid{
    };
-    template<class GaugeField, class SmearingPolicy> class ForceGradient :
+    template<class GaugeField,
-      public Integrator<GaugeField, SmearingPolicy> {
+	     class SmearingPolicy,
 	     class RepresentationPolicy = Representations< FundamentalRepresentation > > class ForceGradient :
      public Integrator<GaugeField, SmearingPolicy, RepresentationPolicy> {
    private:
      const RealD lambda = 1.0/6.0;;
      const RealD chi    = 1.0/72.0;
@@ -209,9 +215,9 @@ namespace Grid{
      // Looks like dH scales as dt^4. tested wilson/wilson 2 level.
    ForceGradient(GridBase* grid, 
 		  IntegratorParameters Par,
-		  ActionSet<GaugeField> & Aset,
+		  ActionSet<GaugeField, RepresentationPolicy> & Aset,
 		  SmearingPolicy &Sm):
-      Integrator<GaugeField, SmearingPolicy>(grid,Par,Aset, Sm) {};
+      Integrator<GaugeField, SmearingPolicy, RepresentationPolicy>(grid,Par,Aset, Sm) {};
      void FG_update_P(GaugeField&U, int level,double fg_dt,double ep){
--- a/lib/qcd/representations/adjoint.h
+++ b/lib/qcd/representations/adjoint.h
@@ -0,0 +1,115 @@
 /*
 *  Policy classes for the HMC
 *  Author: Guido Cossu
 */
 #ifndef ADJOINT_H
 #define ADJOINT_H
 namespace Grid {
 namespace QCD {
 /*
 * This is an helper class for the HMC
 * Should contain only the data for the adjoint representation
 * and the facility to convert from the fundamental -> adjoint
 */
 template <int ncolour>
 class AdjointRep {
 public:
  // typdef to be used by the Representations class in HMC to get the
  // types for the higher representation fields
  typedef typename SU_Adjoint<ncolour>::LatticeAdjMatrix LatticeMatrix;
  typedef typename SU_Adjoint<ncolour>::LatticeAdjField LatticeField;
  static const int Dimension = ncolour * ncolour - 1;
  LatticeField U;
  explicit AdjointRep(GridBase *grid) : U(grid) {}
  void update_representation(const LatticeGaugeField &Uin) {
    std::cout << GridLogDebug << "Updating adjoint representation\n";
    // Uin is in the fundamental representation
    // get the U in AdjointRep
    // (U_adj)_B = tr[e^a U e^b U^dag]
    // e^a = t^a/sqrt(T_F)
    // where t^a is the generator in the fundamental
    // T_F is 1/2 for the fundamental representation
    conformable(U, Uin);
    U = zero;
    LatticeColourMatrix tmp(Uin._grid);
    Vector<typename SU<ncolour>::Matrix> ta(Dimension);
    // Debug lines
    // LatticeMatrix uno(Uin._grid);
    // uno = 1.0;
    ////////////////
    // FIXME probably not very efficient to get all the generators
    // everytime
    for (int a = 0; a < Dimension; a++) SU<ncolour>::generator(a, ta[a]);
    for (int mu = 0; mu < Nd; mu++) {
      auto Uin_mu = peekLorentz(Uin, mu);
      auto U_mu = peekLorentz(U, mu);
      for (int a = 0; a < Dimension; a++) {
        tmp = 2.0 * adj(Uin_mu) * ta[a] * Uin_mu;
        for (int b = 0; b < Dimension; b++)
          pokeColour(U_mu, trace(tmp * ta[b]), a, b);
      }
      pokeLorentz(U, U_mu, mu);
      // Check matrix U_mu, must be real orthogonal
      // reality
      /*
      LatticeMatrix Ucheck = U_mu - conjugate(U_mu);
      std::cout << GridLogMessage << "Reality check: " << norm2(Ucheck) <<
      std::endl;
      Ucheck = U_mu * adj(U_mu) - uno;
      std::cout << GridLogMessage << "orthogonality check: " << norm2(Ucheck) <<
      std::endl;
      */
    }
  }
  LatticeGaugeField RtoFundamentalProject(const LatticeField &in,
                                          Real scale = 1.0) const {
    LatticeGaugeField out(in._grid);
    out = zero;
    for (int mu = 0; mu < Nd; mu++) {
      LatticeColourMatrix out_mu(in._grid);  // fundamental representation
      LatticeMatrix in_mu = peekLorentz(in, mu);
      out_mu = zero;
      typename SU<ncolour>::LatticeAlgebraVector h(in._grid);
      projectOnAlgebra(h, in_mu, double(Nc) * 2.0);  // factor C(r)/C(fund)
      FundamentalLieAlgebraMatrix(h, out_mu);   // apply scale only once
      pokeLorentz(out, out_mu, mu);
      // Returns traceless antihermitian matrix Nc * Nc.
      // Confirmed
    }
    return out;
  }
 private:
  void projectOnAlgebra(typename SU<ncolour>::LatticeAlgebraVector &h_out,
                        const LatticeMatrix &in, Real scale = 1.0) const {
    SU_Adjoint<ncolour>::projectOnAlgebra(h_out, in, scale);
  }
  void FundamentalLieAlgebraMatrix(
      typename SU<ncolour>::LatticeAlgebraVector &h,
      typename SU<ncolour>::LatticeMatrix &out, Real scale = 1.0) const {
    SU<ncolour>::FundamentalLieAlgebraMatrix(h, out, scale);
  }
 };
 typedef AdjointRep<Nc> AdjointRepresentation;
 }
 }
 #endif
--- a/lib/qcd/representations/fundamental.h
+++ b/lib/qcd/representations/fundamental.h
@@ -0,0 +1,45 @@
 /*
 *	Policy classes for the HMC
 *	Author: Guido Cossu
 */	
 #ifndef FUNDAMENTAL_H
 #define FUNDAMENTAL_H
 namespace Grid {
 namespace QCD {
 /*
 * This is an helper class for the HMC
 * Empty since HMC updates already the fundamental representation 
 */
 template <int ncolour>
 class FundamentalRep {
 public:
  static const int Dimension = ncolour;
  // typdef to be used by the Representations class in HMC to get the
  // types for the higher representation fields
  typedef typename SU<ncolour>::LatticeMatrix LatticeMatrix;
  typedef LatticeGaugeField LatticeField;
  explicit FundamentalRep(GridBase* grid) {} //do nothing
  void update_representation(const LatticeGaugeField& Uin) {} // do nothing
  LatticeField RtoFundamentalProject(const LatticeField& in, Real scale = 1.0) const{
    return (scale * in);
  }
 };
 typedef	 FundamentalRep<Nc> FundamentalRepresentation;
 }
 }
 #endif
--- a/lib/qcd/representations/hmc_types.h
+++ b/lib/qcd/representations/hmc_types.h
@@ -0,0 +1,91 @@
 #ifndef HMC_TYPES_H
 #define HMC_TYPES_H
 #include <Grid/qcd/representations/adjoint.h>
 #include <Grid/qcd/representations/two_index.h>
 #include <Grid/qcd/representations/fundamental.h>
 #include <tuple>
 #include <utility>
 namespace Grid {
 namespace QCD {
 // Supported types
 // enum {Fundamental, Adjoint} repr_type;
 // Utility to add support to the HMC for representations other than the
 // fundamental
 template <class... Reptypes>
 class Representations {
 public:
  typedef std::tuple<Reptypes...> Representation_type;
  // Size of the tuple, known at compile time
  static const int tuple_size = sizeof...(Reptypes);
  // The collection of types for the gauge fields
  typedef std::tuple<typename Reptypes::LatticeField...> Representation_Fields;
  // To access the Reptypes (FundamentalRepresentation, AdjointRepresentation)
  template <std::size_t N>
  using repr_type = typename std::tuple_element<N, Representation_type>::type;
  // in order to get the typename of the field use
  // type repr_type<I>::LatticeField
  Representation_type rep;
  // Multiple types constructor
  explicit Representations(GridBase* grid) : rep(Reptypes(grid)...){};
  int size() { return tuple_size; }
  // update the fields
  template <std::size_t I = 0>
  inline typename std::enable_if<(I == tuple_size), void>::type update(
      LatticeGaugeField& U) {}
  template <std::size_t I = 0>
  inline typename std::enable_if<(I < tuple_size), void>::type update(
      LatticeGaugeField& U) {
    std::get<I>(rep).update_representation(U);
    update<I + 1>(U);
  }
 };
 typedef Representations<FundamentalRepresentation> NoHirep;
 // Helper classes to access the elements
 // Strips the first N parameters from the tuple
 // sequence of classes to obtain the S sequence
 // Creates a type that is a tuple of vectors of the template type A
 template <template <typename> class A, class TupleClass,
          size_t N = TupleClass::tuple_size, size_t... S>
 struct AccessTypes : AccessTypes<A, TupleClass, N - 1, N - 1, S...> {};
 template <template <typename> class A, class TupleClass, size_t... S>
 struct AccessTypes<A, TupleClass, 0, S...> {
 public:
  typedef typename TupleClass::Representation_Fields Rfields;
  template <std::size_t N>
  using elem = typename std::tuple_element<N, Rfields>::type;  // fields types
  typedef std::tuple<std::vector< A< elem<S> >* > ... > VectorCollection;
  typedef std::tuple< elem<S> ... > FieldTypeCollection;
  // Debug
  void return_size() {
    std::cout << GridLogMessage
              << "Access:" << std::tuple_size<std::tuple<elem<S>...> >::value
              << "\n";
    std::cout << GridLogMessage
              << "Access vectors:" << std::tuple_size<VectorCollection>::value
              << "\n";
  }
 };
 }
 }
 #endif
--- a/lib/qcd/representations/two_index.h
+++ b/lib/qcd/representations/two_index.h
@@ -0,0 +1,99 @@
 /*
 *  Policy classes for the HMC
 *  Authors: Guido Cossu, David Preti
 */
 #ifndef SUN2INDEX_H_H
 #define SUN2INDEX_H_H
 namespace Grid {
 namespace QCD {
 /*
 * This is an helper class for the HMC
 * Should contain only the data for the two index representations
 * and the facility to convert from the fundamental -> two index
 * The templated parameter TwoIndexSymmetry choses between the 
 * symmetric and antisymmetric representations
 * 
 * There is an 
 * enum TwoIndexSymmetry { Symmetric = 1, AntiSymmetric = -1 };
 * in the SUnTwoIndex.h file
 */
 template <int ncolour, TwoIndexSymmetry S>
 class TwoIndexRep {
 public:
  // typdef to be used by the Representations class in HMC to get the
  // types for the higher representation fields
  typedef typename SU_TwoIndex<ncolour, S>::LatticeTwoIndexMatrix LatticeMatrix;
  typedef typename SU_TwoIndex<ncolour, S>::LatticeTwoIndexField LatticeField;
  static const int Dimension = ncolour * (ncolour + S) / 2;
  LatticeField U;
  explicit TwoIndexRep(GridBase *grid) : U(grid) {}
  void update_representation(const LatticeGaugeField &Uin) {
    std::cout << GridLogDebug << "Updating TwoIndex representation\n";
    // Uin is in the fundamental representation
    // get the U in TwoIndexRep
    // (U)_{(ij)(lk)} = tr [ adj(e^(ij)) U e^(lk) transpose(U) ]
    conformable(U, Uin);
    U = zero;
    LatticeColourMatrix tmp(Uin._grid);
    Vector<typename SU<ncolour>::Matrix> eij(Dimension);
    for (int a = 0; a < Dimension; a++)
      SU_TwoIndex<ncolour, S>::base(a, eij[a]);
    for (int mu = 0; mu < Nd; mu++) {
      auto Uin_mu = peekLorentz(Uin, mu);
      auto U_mu = peekLorentz(U, mu);
      for (int a = 0; a < Dimension; a++) {
        tmp = transpose(Uin_mu) * adj(eij[a]) * Uin_mu;
        for (int b = 0; b < Dimension; b++)
          pokeColour(U_mu, trace(tmp * eij[b]), a, b);
      }
      pokeLorentz(U, U_mu, mu);
    }
  }
  LatticeGaugeField RtoFundamentalProject(const LatticeField &in,
                                          Real scale = 1.0) const {
    LatticeGaugeField out(in._grid);
    out = zero;
    for (int mu = 0; mu < Nd; mu++) {
      LatticeColourMatrix out_mu(in._grid);  // fundamental representation
      LatticeMatrix in_mu = peekLorentz(in, mu);
      out_mu = zero;
      typename SU<ncolour>::LatticeAlgebraVector h(in._grid);
      projectOnAlgebra(h, in_mu, double(Nc + 2 * S));  // factor T(r)/T(fund)
      FundamentalLieAlgebraMatrix(h, out_mu);          // apply scale only once
      pokeLorentz(out, out_mu, mu);
    }
    return out;
  }
 private:
  void projectOnAlgebra(typename SU<ncolour>::LatticeAlgebraVector &h_out,
                        const LatticeMatrix &in, Real scale = 1.0) const {
    SU_TwoIndex<ncolour, S>::projectOnAlgebra(h_out, in, scale);
  }
  void FundamentalLieAlgebraMatrix(
      typename SU<ncolour>::LatticeAlgebraVector &h,
      typename SU<ncolour>::LatticeMatrix &out, Real scale = 1.0) const {
    SU<ncolour>::FundamentalLieAlgebraMatrix(h, out, scale);
  }
 };
 typedef TwoIndexRep<Nc, Symmetric> TwoIndexSymmetricRepresentation;
 typedef TwoIndexRep<Nc, AntiSymmetric> TwoIndexAntiSymmetricRepresentation;
 }
 }
 #endif
--- a/lib/qcd/smearing/GaugeConfiguration.h
+++ b/lib/qcd/smearing/GaugeConfiguration.h
@@ -10,6 +10,29 @@ namespace Grid {
 namespace QCD {
  //trivial class for no smearing
  template< class Gimpl >
 class NoSmearing {
 public:
  INHERIT_GIMPL_TYPES(Gimpl);
  GaugeField*
      ThinLinks;
  NoSmearing(): ThinLinks(NULL) {}
  void set_GaugeField(GaugeField& U) { ThinLinks = &U; }
  void smeared_force(GaugeField& SigmaTilde) const {}
  GaugeField& get_SmearedU() { return *ThinLinks; }
  GaugeField& get_U(bool smeared = false) {
    return *ThinLinks;
  }
 };
 /*!
  @brief Smeared configuration container
@@ -201,6 +224,8 @@ class SmearedConfiguration {
  SmearedConfiguration()
      : smearingLevels(0), StoutSmearing(), SmearedSet(), ThinLinks(NULL) {}
  // attach the smeared routines to the thin links U and fill the smeared set
  void set_GaugeField(GaugeField& U) { fill_smearedSet(U); }
--- a/lib/qcd/smearing/StoutSmearing.h
+++ b/lib/qcd/smearing/StoutSmearing.h
@@ -18,14 +18,12 @@ class Smear_Stout : public Smear<Gimpl> {
  INHERIT_GIMPL_TYPES(Gimpl)
  Smear_Stout(Smear<Gimpl>* base) : SmearBase(base) {
-    static_assert(Nc == 3,
+    assert(Nc == 3);//                  "Stout smearing currently implemented only for Nc==3");
                  "Stout smearing currently implemented only for Nc==3");
  }
  /*! Default constructor */
  Smear_Stout(double rho = 1.0) : SmearBase(new Smear_APE<Gimpl>(rho)) {
-    static_assert(Nc == 3,
+    assert(Nc == 3);//                  "Stout smearing currently implemented only for Nc==3");
                  "Stout smearing currently implemented only for Nc==3");
  }
  ~Smear_Stout() {}  // delete SmearBase...
--- a/lib/qcd/utils/LinalgUtils.h
+++ b/lib/qcd/utils/LinalgUtils.h
@@ -39,8 +39,8 @@ namespace QCD{
 //on the 5d (rb4d) checkerboarded lattices
 ////////////////////////////////////////////////////////////////////////
-template<class vobj> 
+template<class vobj,class Coeff>
-void axpibg5x(Lattice<vobj> &z,const Lattice<vobj> &x,RealD a,RealD b)
+void axpibg5x(Lattice<vobj> &z,const Lattice<vobj> &x,Coeff a,Coeff b)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,z);
@@ -57,8 +57,8 @@ PARALLEL_FOR_LOOP
  }
 }
-template<class vobj> 
+template<class vobj,class Coeff> 
-void axpby_ssp(Lattice<vobj> &z, RealD a,const Lattice<vobj> &x,RealD b,const Lattice<vobj> &y,int s,int sp)
+void axpby_ssp(Lattice<vobj> &z, Coeff a,const Lattice<vobj> &x,Coeff b,const Lattice<vobj> &y,int s,int sp)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,y);
@@ -72,8 +72,8 @@ PARALLEL_FOR_LOOP
  }
 }
-template<class vobj> 
+template<class vobj,class Coeff> 
-void ag5xpby_ssp(Lattice<vobj> &z,RealD a,const Lattice<vobj> &x,RealD b,const Lattice<vobj> &y,int s,int sp)
+void ag5xpby_ssp(Lattice<vobj> &z,Coeff a,const Lattice<vobj> &x,Coeff b,const Lattice<vobj> &y,int s,int sp)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,y);
@@ -90,8 +90,8 @@ PARALLEL_FOR_LOOP
  }
 }
-template<class vobj> 
+template<class vobj,class Coeff> 
-void axpbg5y_ssp(Lattice<vobj> &z,RealD a,const Lattice<vobj> &x,RealD b,const Lattice<vobj> &y,int s,int sp)
+void axpbg5y_ssp(Lattice<vobj> &z,Coeff a,const Lattice<vobj> &x,Coeff b,const Lattice<vobj> &y,int s,int sp)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,y);
@@ -108,8 +108,8 @@ PARALLEL_FOR_LOOP
  }
 }
-template<class vobj> 
+template<class vobj,class Coeff> 
-void ag5xpbg5y_ssp(Lattice<vobj> &z,RealD a,const Lattice<vobj> &x,RealD b,const Lattice<vobj> &y,int s,int sp)
+void ag5xpbg5y_ssp(Lattice<vobj> &z,Coeff a,const Lattice<vobj> &x,Coeff b,const Lattice<vobj> &y,int s,int sp)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,y);
@@ -127,8 +127,8 @@ PARALLEL_FOR_LOOP
  }
 }
-template<class vobj> 
+template<class vobj,class Coeff> 
-void axpby_ssp_pminus(Lattice<vobj> &z,RealD a,const Lattice<vobj> &x,RealD b,const Lattice<vobj> &y,int s,int sp)
+void axpby_ssp_pminus(Lattice<vobj> &z,Coeff a,const Lattice<vobj> &x,Coeff b,const Lattice<vobj> &y,int s,int sp)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,y);
@@ -144,8 +144,8 @@ PARALLEL_FOR_LOOP
  }
 }
-template<class vobj> 
+template<class vobj,class Coeff> 
-void axpby_ssp_pplus(Lattice<vobj> &z,RealD a,const Lattice<vobj> &x,RealD b,const Lattice<vobj> &y,int s,int sp)
+void axpby_ssp_pplus(Lattice<vobj> &z,Coeff a,const Lattice<vobj> &x,Coeff b,const Lattice<vobj> &y,int s,int sp)
 {
  z.checkerboard = x.checkerboard;
  conformable(x,y);
--- a/lib/qcd/utils/SUn.h
+++ b/lib/qcd/utils/SUn.h
--- a/lib/qcd/utils/SUnAdjoint.h
+++ b/lib/qcd/utils/SUnAdjoint.h
@@ -0,0 +1,182 @@
 #ifndef QCD_UTIL_SUNADJOINT_H
 #define QCD_UTIL_SUNADJOINT_H
 ////////////////////////////////////////////////////////////////////////
 //
 // * Adjoint representation generators
 //
 // * Normalisation for the fundamental generators: 
 //   trace ta tb = 1/2 delta_ab = T_F delta_ab
 //   T_F = 1/2  for SU(N) groups
 //
 //
 //   base for NxN hermitian traceless matrices
 //   normalized to 1:
 //
 //   (e_Adj)^a = t^a / sqrt(T_F)
 //
 //   then the real, antisymmetric generators for the adjoint representations
 //   are computed ( shortcut: e^a == (e_Adj)^a )
 //
 //   (iT_adj)^d_ba = i tr[e^a t^d e^b - t^d e^a e^b]
 //
 ////////////////////////////////////////////////////////////////////////
 namespace Grid {
 namespace QCD {
 template <int ncolour>
 class SU_Adjoint : public SU<ncolour> {
 public:
  static const int Dimension = ncolour * ncolour - 1;
  template <typename vtype>
  using iSUnAdjointMatrix =
      iScalar<iScalar<iMatrix<vtype, Dimension > > >;
  // Actually the adjoint matrices are real...
  // Consider this overhead... FIXME
  typedef iSUnAdjointMatrix<Complex> AMatrix;
  typedef iSUnAdjointMatrix<ComplexF> AMatrixF;
  typedef iSUnAdjointMatrix<ComplexD> AMatrixD;
  typedef iSUnAdjointMatrix<vComplex> vAMatrix;
  typedef iSUnAdjointMatrix<vComplexF> vAMatrixF;
  typedef iSUnAdjointMatrix<vComplexD> vAMatrixD;
  typedef Lattice<vAMatrix>  LatticeAdjMatrix;
  typedef Lattice<vAMatrixF> LatticeAdjMatrixF;
  typedef Lattice<vAMatrixD> LatticeAdjMatrixD;
  typedef Lattice<iVector<iScalar<iMatrix<vComplex, Dimension> >, Nd> >
      LatticeAdjField;
  typedef Lattice<iVector<iScalar<iMatrix<vComplexF, Dimension> >, Nd> >
      LatticeAdjFieldF;
  typedef Lattice<iVector<iScalar<iMatrix<vComplexD, Dimension> >, Nd> >
      LatticeAdjFieldD;
  template <class cplx>
  static void generator(int Index, iSUnAdjointMatrix<cplx> &iAdjTa) {
    // returns i(T_Adj)^index necessary for the projectors
    // see definitions above
    iAdjTa = zero;
    Vector<typename SU<ncolour>::template iSUnMatrix<cplx> > ta(ncolour * ncolour - 1);
    typename SU<ncolour>::template iSUnMatrix<cplx> tmp;
    // FIXME not very efficient to get all the generators everytime
    for (int a = 0; a < Dimension; a++) SU<ncolour>::generator(a, ta[a]);
    for (int a = 0; a < Dimension; a++) {
      tmp = ta[a] * ta[Index] - ta[Index] * ta[a];
      for (int b = 0; b < (ncolour * ncolour - 1); b++) {
        typename SU<ncolour>::template iSUnMatrix<cplx> tmp1 =
            2.0 * tmp * ta[b];  // 2.0 from the normalization
        Complex iTr = TensorRemove(timesI(trace(tmp1)));
        //iAdjTa()()(b, a) = iTr;
        iAdjTa()()(a, b) = iTr;
      }
    }
  }
  static void printGenerators(void) {
    for (int gen = 0; gen < Dimension; gen++) {
      AMatrix ta;
      generator(gen, ta);
      std::cout << GridLogMessage << "Nc = " << ncolour << " t_" << gen
                << std::endl;
      std::cout << GridLogMessage << ta << std::endl;
    }
  }
  static void testGenerators(void) {
    AMatrix adjTa;
    std::cout << GridLogMessage << "Adjoint - Checking if real" << std::endl;
    for (int a = 0; a < Dimension; a++) {
      generator(a, adjTa);
      std::cout << GridLogMessage << a << std::endl;
      assert(norm2(adjTa - conjugate(adjTa)) < 1.0e-6);
    }
    std::cout << GridLogMessage << std::endl;
    std::cout << GridLogMessage << "Adjoint - Checking if antisymmetric"
              << std::endl;
    for (int a = 0; a < Dimension; a++) {
      generator(a, adjTa);
      std::cout << GridLogMessage << a << std::endl;
      assert(norm2(adjTa + transpose(adjTa)) < 1.0e-6);
    }
    std::cout << GridLogMessage << std::endl;
  }
  static void AdjointLieAlgebraMatrix(
      const typename SU<ncolour>::LatticeAlgebraVector &h,
      LatticeAdjMatrix &out, Real scale = 1.0) {
    conformable(h, out);
    GridBase *grid = out._grid;
    LatticeAdjMatrix la(grid);
    AMatrix iTa;
    out = zero;
    for (int a = 0; a < Dimension; a++) {
      generator(a, iTa);
      la = peekColour(h, a) * iTa;
      out += la;
    }
    out *= scale;
  }
  // Projects the algebra components a lattice matrix (of dimension ncol*ncol -1 )
  static void projectOnAlgebra(typename SU<ncolour>::LatticeAlgebraVector &h_out, const LatticeAdjMatrix &in, Real scale = 1.0) {
    conformable(h_out, in);
    h_out = zero;
    AMatrix iTa;
    Real coefficient = - 1.0/(ncolour) * scale;// 1/Nc for the normalization of the trace in the adj rep
    for (int a = 0; a < Dimension; a++) {
      generator(a, iTa);
      auto tmp = real(trace(iTa * in)) * coefficient;
      pokeColour(h_out, tmp, a);
    }
  }
  // a projector that keeps the generators stored to avoid the overhead of recomputing them 
  static void projector(typename SU<ncolour>::LatticeAlgebraVector &h_out, const LatticeAdjMatrix &in, Real scale = 1.0) {
    conformable(h_out, in);
    static std::vector<AMatrix> iTa(Dimension);  // to store the generators
    h_out = zero;
    static bool precalculated = false; 
    if (!precalculated){
      precalculated = true;
        for (int a = 0; a < Dimension; a++) generator(a, iTa[a]);
    }
    Real coefficient = -1.0 / (ncolour) * scale;  // 1/Nc for the normalization of
                                                // the trace in the adj rep
    for (int a = 0; a < Dimension; a++) {
      auto tmp = real(trace(iTa[a] * in)) * coefficient; 
      pokeColour(h_out, tmp, a);
    }
  }
 };
 // Some useful type names
 typedef SU_Adjoint<2> SU2Adjoint;
 typedef SU_Adjoint<3> SU3Adjoint;
 typedef SU_Adjoint<4> SU4Adjoint;
 typedef SU_Adjoint<5> SU5Adjoint;
 typedef SU_Adjoint<Nc> AdjointMatrices;
 }
 }
 #endif
--- a/lib/qcd/utils/SUnTwoIndex.h
+++ b/lib/qcd/utils/SUnTwoIndex.h
@@ -0,0 +1,276 @@
 ////////////////////////////////////////////////////////////////////////
 //
 // * Two index representation generators
 //
 // * Normalisation for the fundamental generators:
 //   trace ta tb = 1/2 delta_ab = T_F delta_ab
 //   T_F = 1/2  for SU(N) groups
 //
 //
 //   base for NxN two index (anti-symmetric) matrices
 //   normalized to 1 (d_ij is the kroenecker delta)
 //
 //   (e^(ij)_{kl} = 1 / sqrt(2) (d_ik d_jl +/- d_jk d_il)
 //
 //   Then the generators are written as
 //
 //   (iT_a)^(ij)(lk) = i * ( tr[e^(ij)^dag e^(lk) T^trasp_a] +
 //   tr[e^(lk)e^(ij)^dag T_a] )  //
 //   
 //
 ////////////////////////////////////////////////////////////////////////
 // Authors: David Preti, Guido Cossu
 #ifndef QCD_UTIL_SUN2INDEX_H
 #define QCD_UTIL_SUN2INDEX_H
 namespace Grid {
 namespace QCD {
 enum TwoIndexSymmetry { Symmetric = 1, AntiSymmetric = -1 };
 inline Real delta(int a, int b) { return (a == b) ? 1.0 : 0.0; }
 template <int ncolour, TwoIndexSymmetry S>
 class SU_TwoIndex : public SU<ncolour> {
 public:
  static const int Dimension = ncolour * (ncolour + S) / 2;
  static const int NumGenerators = SU<ncolour>::AdjointDimension;
  template <typename vtype>
  using iSUnTwoIndexMatrix = iScalar<iScalar<iMatrix<vtype, Dimension> > >;
  typedef iSUnTwoIndexMatrix<Complex> TIMatrix;
  typedef iSUnTwoIndexMatrix<ComplexF> TIMatrixF;
  typedef iSUnTwoIndexMatrix<ComplexD> TIMatrixD;
  typedef iSUnTwoIndexMatrix<vComplex> vTIMatrix;
  typedef iSUnTwoIndexMatrix<vComplexF> vTIMatrixF;
  typedef iSUnTwoIndexMatrix<vComplexD> vTIMatrixD;
  typedef Lattice<vTIMatrix> LatticeTwoIndexMatrix;
  typedef Lattice<vTIMatrixF> LatticeTwoIndexMatrixF;
  typedef Lattice<vTIMatrixD> LatticeTwoIndexMatrixD;
  typedef Lattice<iVector<iScalar<iMatrix<vComplex, Dimension> >, Nd> >
      LatticeTwoIndexField;
  typedef Lattice<iVector<iScalar<iMatrix<vComplexF, Dimension> >, Nd> >
      LatticeTwoIndexFieldF;
  typedef Lattice<iVector<iScalar<iMatrix<vComplexD, Dimension> >, Nd> >
      LatticeTwoIndexFieldD;
  template <typename vtype>
  using iSUnMatrix = iScalar<iScalar<iMatrix<vtype, ncolour> > >;
  typedef iSUnMatrix<Complex> Matrix;
  typedef iSUnMatrix<ComplexF> MatrixF;
  typedef iSUnMatrix<ComplexD> MatrixD;
  template <class cplx>
  static void base(int Index, iSUnMatrix<cplx> &eij) {
    // returns (e)^(ij)_{kl} necessary for change of base U_F -> U_R
    assert(Index < NumGenerators);
    eij = zero;
    // for the linearisation of the 2 indexes 
    static int a[ncolour * (ncolour - 1) / 2][2]; // store the a <-> i,j
    static bool filled = false;
    if (!filled) {
      int counter = 0;
      for (int i = 1; i < ncolour; i++) {
        for (int j = 0; j < i; j++) {
          a[counter][0] = i;
          a[counter][1] = j;
          counter++;
        }
      }
      filled = true;
    }
    if (Index < ncolour * (ncolour - 1) / 2) {
      baseOffDiagonal(a[Index][0], a[Index][1], eij);
    } else {
      baseDiagonal(Index, eij);
    }
  }
  template <class cplx>
  static void baseDiagonal(int Index, iSUnMatrix<cplx> &eij) {
    eij = zero;
    eij()()(Index - ncolour * (ncolour - 1) / 2,
            Index - ncolour * (ncolour - 1) / 2) = 1.0;
  }
  template <class cplx>
  static void baseOffDiagonal(int i, int j, iSUnMatrix<cplx> &eij) {
    eij = zero;
    for (int k = 0; k < ncolour; k++)
      for (int l = 0; l < ncolour; l++)
        eij()()(l, k) = delta(i, k) * delta(j, l) +
                        S * delta(j, k) * delta(i, l);
    RealD nrm = 1. / std::sqrt(2.0);
    eij = eij * nrm;
  }
  static void printBase(void) {
    for (int gen = 0; gen < Dimension; gen++) {
      Matrix tmp;
      base(gen, tmp);
      std::cout << GridLogMessage << "Nc = " << ncolour << " t_" << gen
                << std::endl;
      std::cout << GridLogMessage << tmp << std::endl;
    }
  }
  template <class cplx>
  static void generator(int Index, iSUnTwoIndexMatrix<cplx> &i2indTa) {
    Vector<typename SU<ncolour>::template iSUnMatrix<cplx> > ta(
        ncolour * ncolour - 1);
    Vector<typename SU<ncolour>::template iSUnMatrix<cplx> > eij(Dimension);
    typename SU<ncolour>::template iSUnMatrix<cplx> tmp;
    i2indTa = zero;
    for (int a = 0; a < ncolour * ncolour - 1; a++)
      SU<ncolour>::generator(a, ta[a]);
    for (int a = 0; a < Dimension; a++) base(a, eij[a]);
    for (int a = 0; a < Dimension; a++) {
      tmp = transpose(ta[Index]) * adj(eij[a]) + adj(eij[a]) * ta[Index];
      for (int b = 0; b < Dimension; b++) {
        typename SU<ncolour>::template iSUnMatrix<cplx> tmp1 =
            tmp * eij[b]; 
        Complex iTr = TensorRemove(timesI(trace(tmp1)));
        i2indTa()()(a, b) = iTr;
      }
    }
  }
  static void printGenerators(void) {
    for (int gen = 0; gen < ncolour * ncolour - 1; gen++) {
      TIMatrix i2indTa;
      generator(gen, i2indTa);
      std::cout << GridLogMessage << "Nc = " << ncolour << " t_" << gen
                << std::endl;
      std::cout << GridLogMessage << i2indTa << std::endl;
    }
  }
  static void testGenerators(void) {
    TIMatrix i2indTa, i2indTb;
    std::cout << GridLogMessage << "2IndexRep - Checking if traceless"
              << std::endl;
    for (int a = 0; a < ncolour * ncolour - 1; a++) {
      generator(a, i2indTa);
      std::cout << GridLogMessage << a << std::endl;
      assert(norm2(trace(i2indTa)) < 1.0e-6);
    }
    std::cout << GridLogMessage << std::endl;
    std::cout << GridLogMessage << "2IndexRep - Checking if antihermitean"
              << std::endl;
    for (int a = 0; a < ncolour * ncolour - 1; a++) {
      generator(a, i2indTa);
      std::cout << GridLogMessage << a << std::endl;
      assert(norm2(adj(i2indTa) + i2indTa) < 1.0e-6);
    }
    std::cout << GridLogMessage << std::endl;
    std::cout << GridLogMessage
              << "2IndexRep - Checking Tr[Ta*Tb]=delta(a,b)*(N +- 2)/2"
              << std::endl;
    for (int a = 0; a < ncolour * ncolour - 1; a++) {
      for (int b = 0; b < ncolour * ncolour - 1; b++) {
        generator(a, i2indTa);
        generator(b, i2indTb);
        // generator returns iTa, so we need a minus sign here
        Complex Tr = -TensorRemove(trace(i2indTa * i2indTb));
        std::cout << GridLogMessage << "a=" << a << "b=" << b << "Tr=" << Tr
                  << std::endl;
      }
    }
    std::cout << GridLogMessage << std::endl;
  }
  static void TwoIndexLieAlgebraMatrix(
      const typename SU<ncolour>::LatticeAlgebraVector &h,
      LatticeTwoIndexMatrix &out, Real scale = 1.0) {
    conformable(h, out);
    GridBase *grid = out._grid;
    LatticeTwoIndexMatrix la(grid);
    TIMatrix i2indTa;
    out = zero;
    for (int a = 0; a < ncolour * ncolour - 1; a++) {
      generator(a, i2indTa);
      la = peekColour(h, a) * i2indTa;
      out += la;
    }
    out *= scale;
  }
  // Projects the algebra components 
  // of a lattice matrix ( of dimension ncol*ncol -1 )
  static void projectOnAlgebra(
      typename SU<ncolour>::LatticeAlgebraVector &h_out,
      const LatticeTwoIndexMatrix &in, Real scale = 1.0) {
    conformable(h_out, in);
    h_out = zero;
    TIMatrix i2indTa;
    Real coefficient = -2.0 / (ncolour + 2 * S) * scale;
    // 2/(Nc +/- 2) for the normalization of the trace in the two index rep
    for (int a = 0; a < ncolour * ncolour - 1; a++) {
      generator(a, i2indTa);
      auto tmp = real(trace(i2indTa * in)) * coefficient;
      pokeColour(h_out, tmp, a);
    }
  }
  // a projector that keeps the generators stored to avoid the overhead of
  // recomputing them
  static void projector(typename SU<ncolour>::LatticeAlgebraVector &h_out,
                        const LatticeTwoIndexMatrix &in, Real scale = 1.0) {
    conformable(h_out, in);
    // to store the generators
    static std::vector<TIMatrix> i2indTa(ncolour * ncolour -1); 
    h_out = zero;
    static bool precalculated = false;
    if (!precalculated) {
      precalculated = true;
      for (int a = 0; a < ncolour * ncolour - 1; a++) generator(a, i2indTa[a]);
    }
    Real coefficient =
        -2.0 / (ncolour + 2 * S) * scale;  // 2/(Nc +/- 2) for the normalization
                                           // of the trace in the two index rep
    for (int a = 0; a < ncolour * ncolour - 1; a++) {
      auto tmp = real(trace(i2indTa[a] * in)) * coefficient;
      pokeColour(h_out, tmp, a);
    }
  }
 };
 // Some useful type names
 typedef SU_TwoIndex<Nc, Symmetric> TwoIndexSymmMatrices;
 typedef SU_TwoIndex<Nc, AntiSymmetric> TwoIndexAntiSymmMatrices;
 typedef SU_TwoIndex<2, Symmetric> SU2TwoIndexSymm;
 typedef SU_TwoIndex<3, Symmetric> SU3TwoIndexSymm;
 typedef SU_TwoIndex<4, Symmetric> SU4TwoIndexSymm;
 typedef SU_TwoIndex<5, Symmetric> SU5TwoIndexSymm;
 typedef SU_TwoIndex<2, AntiSymmetric> SU2TwoIndexAntiSymm;
 typedef SU_TwoIndex<3, AntiSymmetric> SU3TwoIndexAntiSymm;
 typedef SU_TwoIndex<4, AntiSymmetric> SU4TwoIndexAntiSymm;
 typedef SU_TwoIndex<5, AntiSymmetric> SU5TwoIndexAntiSymm;
 }
 }
 #endif
--- a/lib/simd/Grid_avx.h
+++ b/lib/simd/Grid_avx.h
@@ -365,6 +365,18 @@ namespace Optimization {
    }
  };
  struct Div{
    // Real float
    inline __m256 operator()(__m256 a, __m256 b){
      return _mm256_div_ps(a,b);
    }
    // Real double
    inline __m256d operator()(__m256d a, __m256d b){
      return _mm256_div_pd(a,b);
    }
  };
  struct Conj{
    // Complex single
    inline __m256 operator()(__m256 in){
@@ -437,14 +449,13 @@ namespace Optimization {
  };
-#if defined (AVX2) || defined (AVXFMA4) 
+#if defined (AVX2)
-#define _mm256_alignr_epi32(ret,a,b,n) ret=(__m256) _mm256_alignr_epi8((__m256i)a,(__m256i)b,(n*4)%16)
+#define _mm256_alignr_epi32_grid(ret,a,b,n) ret=(__m256)  _mm256_alignr_epi8((__m256i)a,(__m256i)b,(n*4)%16)
-#define _mm256_alignr_epi64(ret,a,b,n) ret=(__m256d) _mm256_alignr_epi8((__m256i)a,(__m256i)b,(n*8)%16)
+#define _mm256_alignr_epi64_grid(ret,a,b,n) ret=(__m256d) _mm256_alignr_epi8((__m256i)a,(__m256i)b,(n*8)%16)
 #endif
-#if defined (AVX1) 
+#if defined (AVX1) || defined (AVXFMA)  
-
+#define _mm256_alignr_epi32_grid(ret,a,b,n) {	\
 #define _mm256_alignr_epi32(ret,a,b,n) {	\
    __m128 aa, bb;				\
 						\
    aa  = _mm256_extractf128_ps(a,1);		\
@@ -458,7 +469,7 @@ namespace Optimization {
    ret = _mm256_insertf128_ps(ret,aa,0);	\
  }
-#define _mm256_alignr_epi64(ret,a,b,n) {	\
+#define _mm256_alignr_epi64_grid(ret,a,b,n) {	\
    __m128d aa, bb;				\
 						\
    aa  = _mm256_extractf128_pd(a,1);		\
@@ -474,19 +485,6 @@ namespace Optimization {
 #endif
    inline std::ostream & operator << (std::ostream& stream, const __m256 a)
    {
      const float *p=(const float *)&a;
      stream<< "{"<<p[0]<<","<<p[1]<<","<<p[2]<<","<<p[3]<<","<<p[4]<<","<<p[5]<<","<<p[6]<<","<<p[7]<<"}";
      return stream;
    };
    inline std::ostream & operator<< (std::ostream& stream, const __m256d a)
    {
      const double *p=(const double *)&a;
      stream<< "{"<<p[0]<<","<<p[1]<<","<<p[2]<<","<<p[3]<<"}";
      return stream;
    };
  struct Rotate{
    static inline __m256 rotate(__m256 in,int n){ 
@@ -518,11 +516,10 @@ namespace Optimization {
      __m256 tmp = Permute::Permute0(in);
      __m256 ret;
      if ( n > 3 ) { 
-	_mm256_alignr_epi32(ret,in,tmp,n);  
+	_mm256_alignr_epi32_grid(ret,in,tmp,n);  
      } else {
-        _mm256_alignr_epi32(ret,tmp,in,n);          
+        _mm256_alignr_epi32_grid(ret,tmp,in,n);          
      }
      //      std::cout << " align epi32 n=" <<n<<" in "<<tmp<<in<<" -> "<< ret <<std::endl;
      return ret;
    };
@@ -531,18 +528,15 @@ namespace Optimization {
      __m256d tmp = Permute::Permute0(in);
      __m256d ret;
      if ( n > 1 ) {
-	_mm256_alignr_epi64(ret,in,tmp,n);          
+	_mm256_alignr_epi64_grid(ret,in,tmp,n);          
      } else {
-        _mm256_alignr_epi64(ret,tmp,in,n);          
+        _mm256_alignr_epi64_grid(ret,tmp,in,n);          
      }
      //      std::cout << " align epi64 n=" <<n<<" in "<<tmp<<in<<" -> "<< ret <<std::endl;
      return ret;
    };
  };
  //Complex float Reduce
  template<>
    inline Grid::ComplexF Reduce<Grid::ComplexF, __m256>::operator()(__m256 in){
@@ -631,6 +625,7 @@ namespace Optimization {
  // Arithmetic operations
  typedef Optimization::Sum         SumSIMD;
  typedef Optimization::Sub         SubSIMD;
  typedef Optimization::Div         DivSIMD;
  typedef Optimization::Mult        MultSIMD;
  typedef Optimization::MultComplex MultComplexSIMD;
  typedef Optimization::Conj        ConjSIMD;
--- a/lib/simd/Grid_avx512.h
+++ b/lib/simd/Grid_avx512.h
@@ -42,6 +42,16 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 namespace Grid{
 namespace Optimization {
  union u512f {
    __m512 v;
    float f[16];
  };
  union u512d {
    __m512d v;
    double f[8];
  };
  struct Vsplat{
    //Complex float
    inline __m512 operator()(float a, float b){
@@ -230,6 +240,17 @@ namespace Optimization {
    }
  };
  struct Div{
    // Real float
    inline __m512 operator()(__m512 a, __m512 b){
      return _mm512_div_ps(a,b);
    }
    // Real double
    inline __m512d operator()(__m512d a, __m512d b){
      return _mm512_div_pd(a,b);
    }
  };
  struct Conj{
    // Complex single
@@ -360,6 +381,66 @@ namespace Optimization {
  //////////////////////////////////////////////
  // Some Template specialization
  // Hack for CLANG until mm512_reduce_add_ps etc... are implemented in GCC and Clang releases
 #ifndef __INTEL_COMPILER
 #warning "Slow reduction due to incomplete reduce intrinsics"
  //Complex float Reduce
  template<>
    inline Grid::ComplexF Reduce<Grid::ComplexF, __m512>::operator()(__m512 in){
    __m512 v1,v2;
    v1=Optimization::Permute::Permute0(in); // avx 512; quad complex single
    v1= _mm512_add_ps(v1,in);
    v2=Optimization::Permute::Permute1(v1); 
    v1 = _mm512_add_ps(v1,v2);
    v2=Optimization::Permute::Permute2(v1); 
    v1 = _mm512_add_ps(v1,v2);
    u512f conv; conv.v = v1;
    return Grid::ComplexF(conv.f[0],conv.f[1]);
  }
  //Real float Reduce
  template<>
    inline Grid::RealF Reduce<Grid::RealF, __m512>::operator()(__m512 in){
    __m512 v1,v2;
    v1 = Optimization::Permute::Permute0(in); // avx 512; octo-double
    v1 = _mm512_add_ps(v1,in);
    v2 = Optimization::Permute::Permute1(v1); 
    v1 = _mm512_add_ps(v1,v2);
    v2 = Optimization::Permute::Permute2(v1); 
    v1 = _mm512_add_ps(v1,v2);
    v2 = Optimization::Permute::Permute3(v1); 
    v1 = _mm512_add_ps(v1,v2);
    u512f conv; conv.v=v1;
    return conv.f[0];
  }
  //Complex double Reduce
  template<>
    inline Grid::ComplexD Reduce<Grid::ComplexD, __m512d>::operator()(__m512d in){
    __m512d v1;
    v1 = Optimization::Permute::Permute0(in); // sse 128; paired complex single
    v1 = _mm512_add_pd(v1,in);
    v1 = Optimization::Permute::Permute1(in); // sse 128; paired complex single
    v1 = _mm512_add_pd(v1,in);
    u512d conv; conv.v = v1;
    return Grid::ComplexD(conv.f[0],conv.f[1]);
  }
  //Real double Reduce
  template<>
    inline Grid::RealD Reduce<Grid::RealD, __m512d>::operator()(__m512d in){
    __m512d v1,v2;
    v1 = Optimization::Permute::Permute0(in); // avx 512; quad double
    v1 = _mm512_add_pd(v1,in);
      v2 = Optimization::Permute::Permute1(v1); 
      v1 = _mm512_add_pd(v1,v2);
      v2 = Optimization::Permute::Permute2(v1); 
      v1 = _mm512_add_pd(v1,v2);
     u512d conv; conv.v = v1;
     return conv.f[0];
  }
 #else
  //Complex float Reduce
  template<>
  inline Grid::ComplexF Reduce<Grid::ComplexF, __m512>::operator()(__m512 in){
@@ -371,7 +452,6 @@ namespace Optimization {
    return _mm512_reduce_add_ps(in);
  }
  //Complex double Reduce
  template<>
  inline Grid::ComplexD Reduce<Grid::ComplexD, __m512d>::operator()(__m512d in){
@@ -391,6 +471,7 @@ namespace Optimization {
    printf("Reduce : Missing integer implementation -> FIX\n");
    assert(0);
  }
 #endif
 }
@@ -427,6 +508,7 @@ namespace Optimization {
  typedef Optimization::Sum         SumSIMD;
  typedef Optimization::Sub         SubSIMD;
  typedef Optimization::Mult        MultSIMD;
  typedef Optimization::Div         DivSIMD;
  typedef Optimization::MultComplex MultComplexSIMD;
  typedef Optimization::Conj        ConjSIMD;
  typedef Optimization::TimesMinusI TimesMinusISIMD;
--- a/lib/simd/Grid_imci.h
+++ b/lib/simd/Grid_imci.h
@@ -244,6 +244,17 @@ namespace Optimization {
    }
  };
  struct Div{
    // Real float
    inline __m512 operator()(__m512 a, __m512 b){
      return _mm512_div_ps(a,b);
    }
    // Real double
    inline __m512d operator()(__m512d a, __m512d b){
      return _mm512_div_pd(a,b);
    }
  };
  struct Conj{
    // Complex single
@@ -437,6 +448,7 @@ namespace Optimization {
  // Arithmetic operations
  typedef Optimization::Sum         SumSIMD;
  typedef Optimization::Sub         SubSIMD;
  typedef Optimization::Div         DivSIMD;
  typedef Optimization::Mult        MultSIMD;
  typedef Optimization::MultComplex MultComplexSIMD;
  typedef Optimization::Conj        ConjSIMD;
--- a/lib/simd/Grid_sse4.h
+++ b/lib/simd/Grid_sse4.h
@@ -224,6 +224,18 @@ namespace Optimization {
    }
  };
  struct Div{
    // Real float
    inline __m128 operator()(__m128 a, __m128 b){
      return _mm_div_ps(a,b);
    }
    // Real double
    inline __m128d operator()(__m128d a, __m128d b){
      return _mm_div_pd(a,b);
    }
  };
  struct Conj{
    // Complex single
    inline __m128 operator()(__m128 in){
@@ -372,6 +384,8 @@ namespace Optimization {
  }
 }
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 
@@ -398,6 +412,7 @@ namespace Optimization {
  // Arithmetic operations
  typedef Optimization::Sum         SumSIMD;
  typedef Optimization::Sub         SubSIMD;
  typedef Optimization::Div         DivSIMD;
  typedef Optimization::Mult        MultSIMD;
  typedef Optimization::MultComplex MultComplexSIMD;
  typedef Optimization::Conj        ConjSIMD;
--- a/lib/simd/Grid_vector_types.h
+++ b/lib/simd/Grid_vector_types.h
@@ -77,38 +77,24 @@ struct RealPart<std::complex<T> > {
 //////////////////////////////////////
 // demote a vector to real type
 //////////////////////////////////////
 // type alias used to simplify the syntax of std::enable_if
-template <typename T>
+template <typename T> using Invoke = typename T::type;
-using Invoke = typename T::type;
+template <typename Condition, typename ReturnType> using EnableIf = Invoke<std::enable_if<Condition::value, ReturnType> >;
-template <typename Condition, typename ReturnType>
+template <typename Condition, typename ReturnType> using NotEnableIf = Invoke<std::enable_if<!Condition::value, ReturnType> >;
 using EnableIf = Invoke<std::enable_if<Condition::value, ReturnType> >;
 template <typename Condition, typename ReturnType>
 using NotEnableIf = Invoke<std::enable_if<!Condition::value, ReturnType> >;
 ////////////////////////////////////////////////////////
 // Check for complexity with type traits
-template <typename T>
+template <typename T> struct is_complex : public std::false_type {};
-struct is_complex : public std::false_type {};
+template <> struct is_complex<std::complex<double> > : public std::true_type {};
-template <>
+template <> struct is_complex<std::complex<float> > : public std::true_type {};
 struct is_complex<std::complex<double> > : public std::true_type {};
 template <>
 struct is_complex<std::complex<float> > : public std::true_type {};
-template <typename T>
+template <typename T> using IfReal       = Invoke<std::enable_if<std::is_floating_point<T>::value, int> >;
-using IfReal = Invoke<std::enable_if<std::is_floating_point<T>::value, int> >;
+template <typename T> using IfComplex    = Invoke<std::enable_if<is_complex<T>::value, int> >;
-template <typename T>
+template <typename T> using IfInteger    = Invoke<std::enable_if<std::is_integral<T>::value, int> >;
 using IfComplex = Invoke<std::enable_if<is_complex<T>::value, int> >;
 template <typename T>
 using IfInteger = Invoke<std::enable_if<std::is_integral<T>::value, int> >;
-template <typename T>
+template <typename T> using IfNotReal    = Invoke<std::enable_if<!std::is_floating_point<T>::value, int> >;
-using IfNotReal =
+template <typename T> using IfNotComplex = Invoke<std::enable_if<!is_complex<T>::value, int> >;
-    Invoke<std::enable_if<!std::is_floating_point<T>::value, int> >;
+template <typename T> using IfNotInteger = Invoke<std::enable_if<!std::is_integral<T>::value, int> >;
 template <typename T>
 using IfNotComplex = Invoke<std::enable_if<!is_complex<T>::value, int> >;
 template <typename T>
 using IfNotInteger = Invoke<std::enable_if<!std::is_integral<T>::value, int> >;
 ////////////////////////////////////////////////////////
 // Define the operation templates functors
@@ -285,6 +271,20 @@ class Grid_simd {
    return a * b;
  }
  //////////////////////////////////
  // Divides
  //////////////////////////////////
  friend inline Grid_simd operator/(const Scalar_type &a, Grid_simd b) {
    Grid_simd va;
    vsplat(va, a);
    return va / b;
  }
  friend inline Grid_simd operator/(Grid_simd b, const Scalar_type &a) {
    Grid_simd va;
    vsplat(va, a);
    return b / a;
  }
  ///////////////////////
  // Unary negation
  ///////////////////////
@@ -428,7 +428,6 @@ inline void rotate(Grid_simd<S,V> &ret,Grid_simd<S,V> b,int nrot)
  ret.v = Optimization::Rotate::rotate(b.v,2*nrot);
 }
 template <class S, class V> 
 inline void vbroadcast(Grid_simd<S,V> &ret,const Grid_simd<S,V> &src,int lane){
  S* typepun =(S*) &src;
@@ -512,7 +511,6 @@ template <class S, class V, IfInteger<S> = 0>
 inline void vfalse(Grid_simd<S, V> &ret) {
  vsplat(ret, 0);
 }
 template <class S, class V>
 inline void zeroit(Grid_simd<S, V> &z) {
  vzero(z);
@@ -530,7 +528,6 @@ inline void vstream(Grid_simd<S, V> &out, const Grid_simd<S, V> &in) {
  typedef typename S::value_type T;
  binary<void>((T *)&out.v, in.v, VstreamSIMD());
 }
 template <class S, class V, IfInteger<S> = 0>
 inline void vstream(Grid_simd<S, V> &out, const Grid_simd<S, V> &in) {
  out = in;
@@ -569,6 +566,34 @@ inline Grid_simd<S, V> operator*(Grid_simd<S, V> a, Grid_simd<S, V> b) {
  return ret;
 };
 // Distinguish between complex types and others
 template <class S, class V, IfComplex<S> = 0>
 inline Grid_simd<S, V> operator/(Grid_simd<S, V> a, Grid_simd<S, V> b) {
  typedef Grid_simd<S, V> simd;
  simd ret;
  simd den;
  typename simd::conv_t conv;
  ret = a * conjugate(b) ;
  den = b * conjugate(b) ;
  auto real_den = toReal(den);
  ret.v=binary<V>(ret.v, real_den.v, DivSIMD());
  return ret;
 };
 // Real/Integer types
 template <class S, class V, IfNotComplex<S> = 0>
 inline Grid_simd<S, V> operator/(Grid_simd<S, V> a, Grid_simd<S, V> b) {
  Grid_simd<S, V> ret;
  ret.v = binary<V>(a.v, b.v, DivSIMD());
  return ret;
 };
 ///////////////////////
 // Conjugate
 ///////////////////////
@@ -582,7 +607,6 @@ template <class S, class V, IfNotComplex<S> = 0>
 inline Grid_simd<S, V> conjugate(const Grid_simd<S, V> &in) {
  return in;  // for real objects
 }
 // Suppress adj for integer types... // odd; why conjugate above but not adj??
 template <class S, class V, IfNotInteger<S> = 0>
 inline Grid_simd<S, V> adj(const Grid_simd<S, V> &in) {
@@ -596,14 +620,12 @@ template <class S, class V, IfComplex<S> = 0>
 inline void timesMinusI(Grid_simd<S, V> &ret, const Grid_simd<S, V> &in) {
  ret.v = binary<V>(in.v, ret.v, TimesMinusISIMD());
 }
 template <class S, class V, IfComplex<S> = 0>
 inline Grid_simd<S, V> timesMinusI(const Grid_simd<S, V> &in) {
  Grid_simd<S, V> ret;
  timesMinusI(ret, in);
  return ret;
 }
 template <class S, class V, IfNotComplex<S> = 0>
 inline Grid_simd<S, V> timesMinusI(const Grid_simd<S, V> &in) {
  return in;
@@ -616,14 +638,12 @@ template <class S, class V, IfComplex<S> = 0>
 inline void timesI(Grid_simd<S, V> &ret, const Grid_simd<S, V> &in) {
  ret.v = binary<V>(in.v, ret.v, TimesISIMD());
 }
 template <class S, class V, IfComplex<S> = 0>
 inline Grid_simd<S, V> timesI(const Grid_simd<S, V> &in) {
  Grid_simd<S, V> ret;
  timesI(ret, in);
  return ret;
 }
 template <class S, class V, IfNotComplex<S> = 0>
 inline Grid_simd<S, V> timesI(const Grid_simd<S, V> &in) {
  return in;
--- a/lib/simd/Intel512common.h
+++ b/lib/simd/Intel512common.h
@@ -138,9 +138,14 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #define ZLOADf(OFF,PTR,ri,ir)  VLOADf(OFF,PTR,ir)  VSHUFf(ir,ri)
 #define ZLOADd(OFF,PTR,ri,ir)  VLOADd(OFF,PTR,ir)  VSHUFd(ir,ri)
-
+#define STREAM_STORE
 #ifdef STREAM_STORE
 #define VSTOREf(OFF,PTR,SRC)   "vmovntps " #SRC "," #OFF "*64(" #PTR ")"  ";\n"
 #define VSTOREd(OFF,PTR,SRC)   "vmovntpd " #SRC "," #OFF "*64(" #PTR ")"  ";\n"
 #else
 #define VSTOREf(OFF,PTR,SRC)   "vmovaps " #SRC "," #OFF "*64(" #PTR ")"  ";\n"
 #define VSTOREd(OFF,PTR,SRC)   "vmovapd " #SRC "," #OFF "*64(" #PTR ")"  ";\n"
 #endif
 // Swaps Re/Im ; could unify this with IMCI
 #define VSHUFd(A,DEST)         "vpshufd  $0x4e," #A "," #DEST  ";\n"    
--- a/lib/stencil/Lebesgue.cc
+++ b/lib/stencil/Lebesgue.cc
@@ -32,7 +32,7 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 namespace Grid {
 int LebesgueOrder::UseLebesgueOrder;
-std::vector<int> LebesgueOrder::Block({2,2,2,2});
+std::vector<int> LebesgueOrder::Block({8,2,2,2});
 LebesgueOrder::IndexInteger LebesgueOrder::alignup(IndexInteger n){
  n--;           // 1000 0011 --> 1000 0010
--- a/lib/tensors/Tensor_arith_mul.h
+++ b/lib/tensors/Tensor_arith_mul.h
@@ -127,6 +127,36 @@ iVector<rtype,N> operator * (const iVector<mtype,N>& lhs,const iScalar<vtype>& r
    return ret;
 }
 //////////////////////////////////////////////////////////////////
 // Divide by scalar
 //////////////////////////////////////////////////////////////////
 template<class rtype,class vtype> strong_inline
 iScalar<rtype> operator / (const iScalar<rtype>& lhs,const iScalar<vtype>& rhs)
 {
    iScalar<rtype> ret;
    ret._internal = lhs._internal/rhs._internal;
    return ret;
 }
 template<class rtype,class vtype,int N> strong_inline
 iVector<rtype,N> operator / (const iVector<rtype,N>& lhs,const iScalar<vtype>& rhs)
 {
    iVector<rtype,N> ret;
    for(int i=0;i<N;i++){
      ret._internal[i] = lhs._internal[i]/rhs._internal;
    }
    return ret;
 }
 template<class rtype,class vtype,int N> strong_inline
 iMatrix<rtype,N> operator / (const iMatrix<rtype,N>& lhs,const iScalar<vtype>& rhs)
 {
    iMatrix<rtype,N> ret;
    for(int i=0;i<N;i++){
    for(int j=0;j<N;j++){
      ret._internal[i][j] = lhs._internal[i][j]/rhs._internal;
    }}
    return ret;
 }
    //////////////////////////////////////////////////////////////////
    // Glue operators to mult routines. Must resolve return type cleverly from typeof(internal)
    // since nesting matrix<scalar> x matrix<matrix>-> matrix<matrix>
--- a/lib/tensors/Tensor_exp.h
+++ b/lib/tensors/Tensor_exp.h
@@ -56,7 +56,7 @@ namespace Grid {
 	temp = unit + temp*arg;
      }
-      return ProjectOnGroup(temp);//maybe not strictly necessary
+      return temp;
    }
--- a/lib/tensors/Tensor_extract_merge.h
+++ b/lib/tensors/Tensor_extract_merge.h
@@ -250,8 +250,7 @@ void merge(vobj &vec,std::vector<typename vobj::scalar_object *> &extracted,int
  }
 }
-template<class vobj> inline 
+template<class vobj> inline void merge1(vobj &vec,std::vector<typename vobj::scalar_object *> &extracted,int offset)
 void merge1(vobj &vec,std::vector<typename vobj::scalar_object *> &extracted,int offset)
 {
  typedef typename vobj::scalar_type scalar_type ;
  typedef typename vobj::vector_type vector_type ;
@@ -269,8 +268,7 @@ void merge1(vobj &vec,std::vector<typename vobj::scalar_object *> &extracted,int
  }}
 }
-template<class vobj> inline 
+template<class vobj> inline void merge2(vobj &vec,std::vector<typename vobj::scalar_object *> &extracted,int offset)
 void merge2(vobj &vec,std::vector<typename vobj::scalar_object *> &extracted,int offset)
 {
  typedef typename vobj::scalar_type scalar_type ;
  typedef typename vobj::vector_type vector_type ;
--- a/prerequisites/fftw-3.3.4.tar.gz
+++ b/prerequisites/fftw-3.3.4.tar.gz
--- a/scripts/arm_configure.experimental
+++ b/scripts/arm_configure.experimental
@@ -1 +0,0 @@
 ./configure --host=arm-linux-gnueabihf  CXX=clang++-3.5 CXXFLAGS='-std=c++11 -O3 -target arm-linux-gnueabihf -I/usr/arm-linux-gnueabihf/include/ -I/home/neo/Codes/gmp6.0/gmp-arm/include/ -I/usr/arm-linux-gnueabihf/include/c++/4.8.2/arm-linux-gnueabihf/ -L/home/neo/Codes/gmp6.0/gmp-arm/lib/ -I/home/neo/Codes/mpfr3.1.2/mpfr-arm/include/ -L/home/neo/Codes/mpfr3.1.2/mpfr-arm/lib/ -static -mcpu=cortex-a7' --enable-simd=NEONv7
--- a/scripts/arm_configure.experimental_cortex57
+++ b/scripts/arm_configure.experimental_cortex57
@@ -1,3 +0,0 @@
 #./configure --host=arm-linux-gnueabihf  CXX=clang++-3.5 CXXFLAGS='-std=c++11 -O3 -target arm-linux-gnueabihf -I/usr/arm-linux-gnueabihf/include/ -I/home/neo/Codes/gmp6.0/gmp-arm/include/ -I/usr/lib/llvm-3.5/lib/clang/3.5.0/include/ -L/home/neo/Codes/gmp6.0/gmp-arm/lib/ -I/home/neo/Codes/mpfr3.1.2/mpfr-arm/include/ -L/home/neo/Codes/mpfr3.1.2/mpfr-arm/lib/ -static -mcpu=cortex-a57' --enable-simd=NEONv7
 ./configure --host=aarch64-linux-gnu  CXX=clang++-3.5 CXXFLAGS='-std=c++11 -O3 -target aarch64-linux-gnu -static -I/home/neo/Codes/gmp6.0/gmp-armv8/include/ -L/home/neo/Codes/gmp6.0/gmp-armv8/lib/ -I/home/neo/Codes/mpfr3.1.2/mpfr-armv8/include/ -L/home/neo/Codes/mpfr3.1.2/mpfr-armv8/lib/ -I/usr/aarch64-linux-gnu/include/ -I/usr/aarch64-linux-gnu/include/c++/4.8.2/aarch64-linux-gnu/' --enable-simd=NEONv7
--- a/scripts/bench_wilson.sh
+++ b/scripts/bench_wilson.sh
@@ -1,9 +0,0 @@
 for omp in 1 2 4
 do
 echo > wilson.t$omp
 for vol in 4.4.4.4 4.4.4.8 4.4.8.8  4.8.8.8  8.8.8.8   8.8.8.16 8.8.16.16  8.16.16.16
 do   
 perf=` ./benchmarks/Grid_wilson --grid $vol --omp $omp  | grep mflop | awk '{print $3}'`
 echo $vol $perf >> wilson.t$omp
 done
 done
--- a/Show More
+++ b/Show More
		`@@ -1 +0,0 @@`
			`./configure --host=arm-linux-gnueabihf CXX=clang++-3.5 CXXFLAGS='-std=c++11 -O3 -target arm-linux-gnueabihf -I/usr/arm-linux-gnueabihf/include/ -I/home/neo/Codes/gmp6.0/gmp-arm/include/ -I/usr/arm-linux-gnueabihf/include/c++/4.8.2/arm-linux-gnueabihf/ -L/home/neo/Codes/gmp6.0/gmp-arm/lib/ -I/home/neo/Codes/mpfr3.1.2/mpfr-arm/include/ -L/home/neo/Codes/mpfr3.1.2/mpfr-arm/lib/ -static -mcpu=cortex-a7' --enable-simd=NEONv7`
		`@@ -1,3 +0,0 @@`
			`#./configure --host=arm-linux-gnueabihf CXX=clang++-3.5 CXXFLAGS='-std=c++11 -O3 -target arm-linux-gnueabihf -I/usr/arm-linux-gnueabihf/include/ -I/home/neo/Codes/gmp6.0/gmp-arm/include/ -I/usr/lib/llvm-3.5/lib/clang/3.5.0/include/ -L/home/neo/Codes/gmp6.0/gmp-arm/lib/ -I/home/neo/Codes/mpfr3.1.2/mpfr-arm/include/ -L/home/neo/Codes/mpfr3.1.2/mpfr-arm/lib/ -static -mcpu=cortex-a57' --enable-simd=NEONv7`

			`./configure --host=aarch64-linux-gnu CXX=clang++-3.5 CXXFLAGS='-std=c++11 -O3 -target aarch64-linux-gnu -static -I/home/neo/Codes/gmp6.0/gmp-armv8/include/ -L/home/neo/Codes/gmp6.0/gmp-armv8/lib/ -I/home/neo/Codes/mpfr3.1.2/mpfr-armv8/include/ -L/home/neo/Codes/mpfr3.1.2/mpfr-armv8/lib/ -I/usr/aarch64-linux-gnu/include/ -I/usr/aarch64-linux-gnu/include/c++/4.8.2/aarch64-linux-gnu/' --enable-simd=NEONv7`