Merge branch 'develop' into feature/json-fix

2026-05-24 02:54:16 +01:00 · 2017-07-07 14:17:50 +01:00
parent b672717096 7b0237b081
commit d9593c4b81
102 changed files with 4235 additions and 3759 deletions
@@ -18,10 +18,41 @@
 License: GPL v2.
-Last update Nov 2016.
+Last update June 2017.
 _Please do not send pull requests to the `master` branch which is reserved for releases._
 ### Description
 This library provides data parallel C++ container classes with internal memory layout
 that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
 are provided, similar to HPF and cmfortran, and user control is given over the mapping of
 array indices to both MPI tasks and SIMD processing elements.
 * Identically shaped arrays then be processed with perfect data parallelisation.
 * Such identically shaped arrays are called conformable arrays.
 The transformation is based on the observation that Cartesian array processing involves
 identical processing to be performed on different regions of the Cartesian array.
 The library will both geometrically decompose into MPI tasks and across SIMD lanes.
 Local vector loops are parallelised with OpenMP pragmas.
 Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
 optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
 for most programmers.
 The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
 Presently SSE4, ARM NEON (128 bits) AVX, AVX2, QPX (256 bits), IMCI and AVX512 (512 bits) targets are supported.
 These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. 
 The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
 MPI, OpenMP, and SIMD parallelism are present in the library.
 Please see [this paper](https://arxiv.org/abs/1512.03487) for more detail.
 ### Compilers
 Intel ICPC v16.0.3 and later
@@ -56,35 +87,25 @@ When you file an issue, please go though the following checklist:
 6. Attach the output of `make V=1`.
 7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example.
 ### Required libraries
 Grid requires:
 [GMP](https://gmplib.org/), 
-### Description
+[MPFR](http://www.mpfr.org/) 
 This library provides data parallel C++ container classes with internal memory layout
 that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
 are provided, similar to HPF and cmfortran, and user control is given over the mapping of
 array indices to both MPI tasks and SIMD processing elements.
-* Identically shaped arrays then be processed with perfect data parallelisation.
+Bootstrapping grid downloads and uses for internal dense matrix (non-QCD operations) the Eigen library.
 * Such identically shaped arrays are called conformable arrays.
-The transformation is based on the observation that Cartesian array processing involves
+Grid optionally uses:
 identical processing to be performed on different regions of the Cartesian array.
-The library will both geometrically decompose into MPI tasks and across SIMD lanes.
+[HDF5](https://support.hdfgroup.org/HDF5/)  
 Local vector loops are parallelised with OpenMP pragmas.
-Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
+[LIME](http://usqcd-software.github.io/c-lime/) for ILDG and SciDAC file format support. 
 optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
 for most programmers.
-The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
+[FFTW](http://www.fftw.org) either generic version or via the Intel MKL library.
 Presently SSE4 (128 bit) AVX, AVX2, QPX (256 bit), IMCI, and AVX512 (512 bit) targets are supported (ARM NEON on the way).
-These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers.
+LAPACK either generic version or Intel MKL library.
 The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
 MPI, OpenMP, and SIMD parallelism are present in the library.
 Please see https://arxiv.org/abs/1512.03487 for more detail.
 ### Quick start
 First, start by cloning the repository:
@@ -155,7 +176,6 @@ The following options can be use with the `--enable-comms=` option to target dif
 | `none`         | no communications                                             |
 | `mpi[-auto]`   | MPI communications                                            |
 | `mpi3[-auto]`  | MPI communications using MPI 3 shared memory                  |
 | `mpi3l[-auto]` | MPI communications using MPI 3 shared memory and leader model |
 | `shmem `       | Cray SHMEM communications                                     |
 For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names). The `-auto` suffix is not supported by the Cray environment wrapper scripts. Use the standard versions instead.  
@@ -173,7 +193,8 @@ The following options can be use with the `--enable-simd=` option to target diff
 | `AVXFMA4`   | AVX (256 bit) + FMA4                   |
 | `AVX2`      | AVX 2 (256 bit)                        |
 | `AVX512`    | AVX 512 bit                            |
-| `QPX`       | QPX (256 bit)                          |
+| `NEONv8`    | [ARM NEON](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch07s03.html) (128 bit)                     |
 | `QPX`       | IBM QPX (256 bit)                      |
 Alternatively, some CPU codenames can be directly used:
@@ -195,21 +216,136 @@ The following configuration is recommended for the Intel Knights Landing platfor
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=KNL        \
-             --enable-comms=mpi-auto \
+             --enable-comms=mpi-auto  \
             --with-gmp=<path>        \
             --with-mpfr=<path>       \
             --enable-mkl             \
             CXX=icpc MPICXX=mpiicpc
 ```
 The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
-where `<path>` is the UNIX prefix where GMP and MPFR are installed. If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
+If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=KNL        \
             --enable-comms=mpi       \
             --with-gmp=<path>        \
             --with-mpfr=<path>       \
             --enable-mkl             \
             CXX=CC CC=cc
-```
+```
 If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
 ``` bash
               --with-gmp=<path>        \
               --with-mpfr=<path>       \
 ```
 where `<path>` is the UNIX prefix where GMP and MPFR are installed. 
 Knight's Landing with Intel Omnipath adapters with two adapters per node 
 presently performs better with use of more than one rank per node, using shared memory 
 for interior communication. This is the mpi3 communications implementation. 
 We recommend four ranks per node for best performance, but optimum is local volume dependent.
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=KNL        \
             --enable-comms=mpi3-auto \
             --enable-mkl             \
             CC=icpc MPICXX=mpiicpc 
 ```
 ### Build setup for Intel Haswell Xeon platform
 The following configuration is recommended for the Intel Haswell platform:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=AVX2       \
             --enable-comms=mpi3-auto \
             --enable-mkl             \
             CXX=icpc MPICXX=mpiicpc
 ```
 The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
 If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
 ``` bash
               --with-gmp=<path>        \
               --with-mpfr=<path>       \
 ```
 where `<path>` is the UNIX prefix where GMP and MPFR are installed. 
 If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=AVX2       \
             --enable-comms=mpi3      \
             --enable-mkl             \
             CXX=CC CC=cc
 ```
 Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of 
 one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using
 ```
        export I_MPI_PIN=1
 ```
 This is the default.
 ### Build setup for Intel Skylake Xeon platform
 The following configuration is recommended for the Intel Skylake platform:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=AVX512     \
             --enable-comms=mpi3      \
             --enable-mkl             \
             CXX=mpiicpc
 ```
 The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
 If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:
 ``` bash
               --with-gmp=<path>        \
               --with-mpfr=<path>       \
 ```
 where `<path>` is the UNIX prefix where GMP and MPFR are installed. 
 If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
 ``` bash
 ../configure --enable-precision=double\
             --enable-simd=AVX512     \
             --enable-comms=mpi3      \
             --enable-mkl             \
             CXX=CC CC=cc
 ```
 Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of 
 one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using
 ``` 
        export I_MPI_PIN=1
 ```
 This is the default. 
 ### Build setup for BlueGene/Q
 To be written...
 ### Build setup for ARM Neon
 To be written...
 ### Build setup for laptops, other compilers, non-cluster builds
 Many versions of g++ and clang++ work with Grid, and involve merely replacing CXX (and MPICXX),
 and omit the enable-mkl flag. 
 Single node builds are enabled with 
 ```
            --enable-comms=none
 ```
 FFTW support that is not in the default search path may then enabled with
 ```
    --with-fftw=<installpath>
 ```
 BLAS will not be compiled in by default, and Lanczos will default to Eigen diagonalisation.
@@ -1,24 +1,30 @@
 TODO:
 ---------------
-Peter's work list:
+Large item work list:
-1)- Precision conversion and sort out localConvert      <-- 
+1)- MultiRHS with spread out extra dim -- Go through filesystem with SciDAC I/O
 2)- Remove DenseVector, DenseMatrix; Use Eigen instead. <-- 
-- Profile CG, BlockCG, etc... Flop count/rate -- PARTIAL, time but no flop/s yet
+2)- Christoph's local basis expansion Lanczos
-- Physical propagator interface
+3)- BG/Q port and check
-- Conserved currents
+4)- Precision conversion and sort out localConvert      <-- partial
-- GaugeFix into central location
+  - Consistent linear solver flop count/rate -- PARTIAL, time but no flop/s yet
-- Multigrid Wilson and DWF, compare to other Multigrid implementations
+5)- Physical propagator interface
-- HDCR resume
+6)- Conserved currents
 7)- Multigrid Wilson and DWF, compare to other Multigrid implementations
 8)- HDCR resume
 Recent DONE 
 -- Lanczos Remove DenseVector, DenseMatrix; Use Eigen instead. <-- DONE
 -- GaugeFix into central location                      <-- DONE
 -- Scidac and Ildg metadata handling                   <-- DONE
 -- Binary I/O MPI2 IO                                  <-- DONE
 -- Binary I/O speed up & x-strips                      <-- DONE
 -- Cut down the exterior overhead                      <-- DONE
 -- Interior legs from SHM comms                        <-- DONE
 -- Half-precision comms                                <-- DONE
-- Merge high precision reduction into develop        
+-- Merge high precision reduction into develop         <-- DONE
-- multiRHS DWF; benchmark on Cori/BNL for comms elimination
+-- BlockCG, BCGrQ                                      <-- DONE
 -- multiRHS DWF; benchmark on Cori/BNL for comms elimination <-- DONE
   -- slice* linalg routines for multiRHS, BlockCG    
 -----
@@ -165,7 +165,7 @@ int main (int argc, char ** argv)
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
-  int ncall =1000;
+  int ncall =500;
  if (1) {
    FGrid->Barrier();
    Dw.ZeroCounters();
@@ -302,6 +302,7 @@ int main (int argc, char ** argv)
      std::cout<< "sD ERR   \n " << err  <<std::endl;
    }
    assert(sum < 1.0e-4);
    if(1){
      std::cout << GridLogMessage<< "*********************************************************" <<std::endl;
@@ -381,8 +382,23 @@ int main (int argc, char ** argv)
      }
      assert(error<1.0e-4);
    }
  if(0){
    std::cout << "Single cache warm call to sDw.Dhop " <<std::endl;
    for(int i=0;i< PerformanceCounter::NumTypes(); i++ ){
      sDw.Dhop(ssrc,sresult,0);
      PerformanceCounter Counter(i);
      Counter.Start();
      sDw.Dhop(ssrc,sresult,0);
      Counter.Stop();
      Counter.Report();
    }
  }
  }
  if (1)
  { // Naive wilson dag implementation
    ref = zero;
@@ -55,9 +55,9 @@ int main (int argc, char ** argv)
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
  std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
-  uint64_t lmax=64;
+  uint64_t lmax=96;
-#define NLOOP (100*lmax*lmax*lmax*lmax/vol)
+#define NLOOP (10*lmax*lmax*lmax*lmax/vol)
-  for(int lat=4;lat<=lmax;lat+=4){
+  for(int lat=8;lat<=lmax;lat+=8){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
@@ -65,11 +65,11 @@ int main (int argc, char ** argv)
      uint64_t Nloop=NLOOP;
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeVec z(&Grid); //random(pRNG,z);
+      LatticeVec z(&Grid); random(pRNG,z);
-      LatticeVec x(&Grid); //random(pRNG,x);
+      LatticeVec x(&Grid); random(pRNG,x);
-      LatticeVec y(&Grid); //random(pRNG,y);
+      LatticeVec y(&Grid); random(pRNG,y);
      double a=2.0;
@@ -94,17 +94,17 @@ int main (int argc, char ** argv)
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
  std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
-  for(int lat=4;lat<=lmax;lat+=4){
+  for(int lat=8;lat<=lmax;lat+=8){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeVec z(&Grid); //random(pRNG,z);
+      LatticeVec z(&Grid); random(pRNG,z);
-      LatticeVec x(&Grid); //random(pRNG,x);
+      LatticeVec x(&Grid); random(pRNG,x);
-      LatticeVec y(&Grid); //random(pRNG,y);
+      LatticeVec y(&Grid); random(pRNG,y);
      double a=2.0;
      uint64_t Nloop=NLOOP;
@@ -129,7 +129,7 @@ int main (int argc, char ** argv)
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
-  for(int lat=4;lat<=lmax;lat+=4){
+  for(int lat=8;lat<=lmax;lat+=8){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
@@ -138,11 +138,11 @@ int main (int argc, char ** argv)
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeVec z(&Grid); //random(pRNG,z);
+      LatticeVec z(&Grid); random(pRNG,z);
-      LatticeVec x(&Grid); //random(pRNG,x);
+      LatticeVec x(&Grid); random(pRNG,x);
-      LatticeVec y(&Grid); //random(pRNG,y);
+      LatticeVec y(&Grid); random(pRNG,y);
      RealD a=2.0;
@@ -166,17 +166,17 @@ int main (int argc, char ** argv)
  std::cout<<GridLogMessage << "  L  "<<"\t\t"<<"bytes"<<"\t\t\t"<<"GB/s"<<"\t\t"<<"Gflop/s"<<"\t\t seconds"<<std::endl;
  std::cout<<GridLogMessage << "----------------------------------------------------------"<<std::endl;
-  for(int lat=4;lat<=lmax;lat+=4){
+  for(int lat=8;lat<=lmax;lat+=8){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      uint64_t Nloop=NLOOP;
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeVec z(&Grid); //random(pRNG,z);
+      LatticeVec z(&Grid); random(pRNG,z);
-      LatticeVec x(&Grid); //random(pRNG,x);
+      LatticeVec x(&Grid); random(pRNG,x);
-      LatticeVec y(&Grid); //random(pRNG,y);
+      LatticeVec y(&Grid); random(pRNG,y);
      RealD a=2.0;
      Real nn;      
      double start=usecond();
@@ -37,12 +37,12 @@ int main (int argc, char ** argv)
  Grid_init(&argc,&argv);
 #define LMAX (64)
-  int Nloop=20;
+  int64_t Nloop=20;
  std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
  std::vector<int> mpi_layout  = GridDefaultMpi();
-  int threads = GridThread::GetThreads();
+  int64_t threads = GridThread::GetThreads();
  std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
  std::cout<<GridLogMessage << "===================================================================================================="<<std::endl;
@@ -54,16 +54,16 @@ int main (int argc, char ** argv)
  for(int lat=2;lat<=LMAX;lat+=2){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
-      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
+      int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeColourMatrix z(&Grid);// random(pRNG,z);
+      LatticeColourMatrix z(&Grid); random(pRNG,z);
-      LatticeColourMatrix x(&Grid);// random(pRNG,x);
+      LatticeColourMatrix x(&Grid); random(pRNG,x);
-      LatticeColourMatrix y(&Grid);// random(pRNG,y);
+      LatticeColourMatrix y(&Grid); random(pRNG,y);
      double start=usecond();
-      for(int i=0;i<Nloop;i++){
+      for(int64_t i=0;i<Nloop;i++){
 	x=x*y;
      }
      double stop=usecond();
@@ -86,17 +86,17 @@ int main (int argc, char ** argv)
  for(int lat=2;lat<=LMAX;lat+=2){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
-      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
+      int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeColourMatrix z(&Grid); //random(pRNG,z);
+      LatticeColourMatrix z(&Grid); random(pRNG,z);
-      LatticeColourMatrix x(&Grid); //random(pRNG,x);
+      LatticeColourMatrix x(&Grid); random(pRNG,x);
-      LatticeColourMatrix y(&Grid); //random(pRNG,y);
+      LatticeColourMatrix y(&Grid); random(pRNG,y);
      double start=usecond();
-      for(int i=0;i<Nloop;i++){
+      for(int64_t i=0;i<Nloop;i++){
 	z=x*y;
      }
      double stop=usecond();
@@ -117,17 +117,17 @@ int main (int argc, char ** argv)
  for(int lat=2;lat<=LMAX;lat+=2){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
-      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
+      int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeColourMatrix z(&Grid); //random(pRNG,z);
+      LatticeColourMatrix z(&Grid); random(pRNG,z);
-      LatticeColourMatrix x(&Grid); //random(pRNG,x);
+      LatticeColourMatrix x(&Grid); random(pRNG,x);
-      LatticeColourMatrix y(&Grid); //random(pRNG,y);
+      LatticeColourMatrix y(&Grid); random(pRNG,y);
      double start=usecond();
-      for(int i=0;i<Nloop;i++){
+      for(int64_t i=0;i<Nloop;i++){
 	mult(z,x,y);
      }
      double stop=usecond();
@@ -148,17 +148,17 @@ int main (int argc, char ** argv)
  for(int lat=2;lat<=LMAX;lat+=2){
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
-      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
+      int64_t vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
-      LatticeColourMatrix z(&Grid); //random(pRNG,z);
+      LatticeColourMatrix z(&Grid); random(pRNG,z);
-      LatticeColourMatrix x(&Grid); //random(pRNG,x);
+      LatticeColourMatrix x(&Grid); random(pRNG,x);
-      LatticeColourMatrix y(&Grid); //random(pRNG,y);
+      LatticeColourMatrix y(&Grid); random(pRNG,y);
      double start=usecond();
-      for(int i=0;i<Nloop;i++){
+      for(int64_t i=0;i<Nloop;i++){
 	mac(z,x,y);
      }
      double stop=usecond();
@@ -27,7 +27,7 @@ AX_GXX_VERSION
 AC_DEFINE_UNQUOTED([GXX_VERSION],["$GXX_VERSION"],
      [version of g++ that will compile the code])
-CXXFLAGS="-g $CXXFLAGS"
+CXXFLAGS="-O3 $CXXFLAGS"
 ############### Checks for typedefs, structures, and compiler characteristics
@@ -241,6 +241,7 @@ case ${ax_cv_cxx_compiler_vendor} in
        SIMD_FLAGS='';;
      KNL)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
        AC_DEFINE([KNL],[1],[Knights landing processor])
        SIMD_FLAGS='-march=knl';;
      GEN)
        AC_DEFINE([GEN],[1],[generic vector code])
@@ -248,6 +249,9 @@ case ${ax_cv_cxx_compiler_vendor} in
                           [generic SIMD vector width (in bytes)])
        SIMD_GEN_WIDTH_MSG=" (width= $ac_gen_simd_width)"
        SIMD_FLAGS='';;
      NEONv8)
        AC_DEFINE([NEONV8],[1],[ARMv8 NEON])
        SIMD_FLAGS='-march=armv8-a';;
      QPX|BGQ)
        AC_DEFINE([QPX],[1],[QPX intrinsics for BG/Q])
        SIMD_FLAGS='';;
@@ -276,6 +280,7 @@ case ${ax_cv_cxx_compiler_vendor} in
        SIMD_FLAGS='';;
      KNL)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics for Knights Landing])
        AC_DEFINE([KNL],[1],[Knights landing processor])
        SIMD_FLAGS='-xmic-avx512';;
      GEN)
        AC_DEFINE([GEN],[1],[generic vector code])
@@ -41,9 +41,10 @@ using namespace Hadrons;
 // constructor /////////////////////////////////////////////////////////////////
 Environment::Environment(void)
 {
-    nd_ = GridDefaultLatt().size();
+    dim_ = GridDefaultLatt();
    nd_  = dim_.size();
    grid4d_.reset(SpaceTimeGrid::makeFourDimGrid(
-        GridDefaultLatt(), GridDefaultSimd(nd_, vComplex::Nsimd()),
+        dim_, GridDefaultSimd(nd_, vComplex::Nsimd()),
        GridDefaultMpi()));
    gridRb4d_.reset(SpaceTimeGrid::makeFourDimRedBlackGrid(grid4d_.get()));
    auto loc = getGrid()->LocalDimensions();
@@ -132,6 +133,16 @@ unsigned int Environment::getNd(void) const
    return nd_;
 }
 std::vector<int> Environment::getDim(void) const
 {
    return dim_;
 }
 int Environment::getDim(const unsigned int mu) const
 {
    return dim_[mu];
 }
 // random number generator /////////////////////////////////////////////////////
 void Environment::setSeed(const std::vector<int> &seed)
 {
@@ -271,6 +282,21 @@ std::string Environment::getModuleType(const std::string name) const
    return getModuleType(getModuleAddress(name));
 }
 std::string Environment::getModuleNamespace(const unsigned int address) const
 {
    std::string type = getModuleType(address), ns;
    auto pos2 = type.rfind("::");
    auto pos1 = type.rfind("::", pos2 - 2);
    return type.substr(pos1 + 2, pos2 - pos1 - 2);
 }
 std::string Environment::getModuleNamespace(const std::string name) const
 {
    return getModuleNamespace(getModuleAddress(name));
 }
 bool Environment::hasModule(const unsigned int address) const
 {
    return (address < module_.size());
@@ -492,7 +518,14 @@ std::string Environment::getObjectType(const unsigned int address) const
 {
    if (hasRegisteredObject(address))
    {
-        return typeName(object_[address].type);
+        if (object_[address].type)
        {
            return typeName(object_[address].type);
        }
        else
        {
            return "<no type>";
        }
    }
    else if (hasObject(address))
    {
@@ -532,6 +565,23 @@ Environment::Size Environment::getObjectSize(const std::string name) const
    return getObjectSize(getObjectAddress(name));
 }
 unsigned int Environment::getObjectModule(const unsigned int address) const
 {
    if (hasObject(address))
    {
        return object_[address].module;
    }
    else
    {
        HADRON_ERROR("no object with address " + std::to_string(address));
    }
 }
 unsigned int Environment::getObjectModule(const std::string name) const
 {
    return getObjectModule(getObjectAddress(name));
 }
 unsigned int Environment::getObjectLs(const unsigned int address) const
 {
    if (hasRegisteredObject(address))
@@ -106,6 +106,8 @@ public:
    void                    createGrid(const unsigned int Ls);
    GridCartesian *         getGrid(const unsigned int Ls = 1) const;
    GridRedBlackCartesian * getRbGrid(const unsigned int Ls = 1) const;
    std::vector<int>        getDim(void) const;
    int                     getDim(const unsigned int mu) const;
    unsigned int            getNd(void) const;
    // random number generator
    void                    setSeed(const std::vector<int> &seed);
@@ -131,6 +133,8 @@ public:
    std::string             getModuleName(const unsigned int address) const;
    std::string             getModuleType(const unsigned int address) const;
    std::string             getModuleType(const std::string name) const;
    std::string             getModuleNamespace(const unsigned int address) const;
    std::string             getModuleNamespace(const std::string name) const;
    bool                    hasModule(const unsigned int address) const;
    bool                    hasModule(const std::string name) const;
    Graph<unsigned int>     makeModuleGraph(void) const;
@@ -171,6 +175,8 @@ public:
    std::string             getObjectType(const std::string name) const;
    Size                    getObjectSize(const unsigned int address) const;
    Size                    getObjectSize(const std::string name) const;
    unsigned int            getObjectModule(const unsigned int address) const;
    unsigned int            getObjectModule(const std::string name) const;
    unsigned int            getObjectLs(const unsigned int address) const;
    unsigned int            getObjectLs(const std::string name) const;
    bool                    hasObject(const unsigned int address) const;
@@ -181,6 +187,10 @@ public:
    bool                    hasCreatedObject(const std::string name) const;
    bool                    isObject5d(const unsigned int address) const;
    bool                    isObject5d(const std::string name) const;
    template <typename T>
    bool                    isObjectOfType(const unsigned int address) const;
    template <typename T>
    bool                    isObjectOfType(const std::string name) const;
    Environment::Size       getTotalSize(void) const;
    void                    addOwnership(const unsigned int owner,
                                         const unsigned int property);
@@ -197,6 +207,7 @@ private:
    bool                                   dryRun_{false};
    unsigned int                           traj_, locVol_;
    // grids
    std::vector<int>                       dim_;
    GridPt                                 grid4d_;
    std::map<unsigned int, GridPt>         grid5d_;
    GridRbPt                               gridRb4d_;
@@ -343,7 +354,7 @@ T * Environment::getObject(const unsigned int address) const
        else
        {
            HADRON_ERROR("object with address " + std::to_string(address) +
-                         " does not have type '" + typeid(T).name() +
+                         " does not have type '" + typeName(&typeid(T)) +
                         "' (has type '" + getObjectType(address) + "')");
        }
    }
@@ -380,6 +391,37 @@ T * Environment::createLattice(const std::string name)
    return createLattice<T>(getObjectAddress(name));
 }
 template <typename T>
 bool Environment::isObjectOfType(const unsigned int address) const
 {
    if (hasRegisteredObject(address))
    {
        if (auto h = dynamic_cast<Holder<T> *>(object_[address].data.get()))
        {
            return true;
        }
        else
        {
            return false;
        }
    }
    else if (hasObject(address))
    {
        HADRON_ERROR("object with address " + std::to_string(address) +
                     " exists but is not registered");
    }
    else
    {
        HADRON_ERROR("no object with address " + std::to_string(address));
    }
 }
 template <typename T>
 bool Environment::isObjectOfType(const std::string name) const
 {
    return isObjectOfType<T>(getObjectAddress(name));
 }
 END_HADRONS_NAMESPACE
 #endif // Hadrons_Environment_hpp_
@@ -51,23 +51,43 @@ using Grid::operator<<;
 * error with GCC 5 (clang & GCC 6 compile fine without it).
 */
 // FIXME: find a way to do that in a more general fashion
 #ifndef FIMPL
 #define FIMPL WilsonImplR
 #endif
 #ifndef SIMPL
 #define SIMPL ScalarImplCR
 #endif
 BEGIN_HADRONS_NAMESPACE
 // type aliases
-#define TYPE_ALIASES(FImpl, suffix)\
+#define FERM_TYPE_ALIASES(FImpl, suffix)\
 typedef FermionOperator<FImpl>                       FMat##suffix;             \
 typedef typename FImpl::FermionField                 FermionField##suffix;     \
 typedef typename FImpl::PropagatorField              PropagatorField##suffix;  \
 typedef typename FImpl::SitePropagator               SitePropagator##suffix;   \
-typedef typename FImpl::DoubledGaugeField            DoubledGaugeField##suffix;\
+typedef std::vector<typename FImpl::SitePropagator::scalar_object>             \
-typedef std::function<void(FermionField##suffix &,                             \
+                                                     SlicedPropagator##suffix;
 #define GAUGE_TYPE_ALIASES(FImpl, suffix)\
 typedef typename FImpl::DoubledGaugeField DoubledGaugeField##suffix;
 #define SCALAR_TYPE_ALIASES(SImpl, suffix)\
 typedef typename SImpl::Field ScalarField##suffix;\
 typedef typename SImpl::Field PropagatorField##suffix;
 #define SOLVER_TYPE_ALIASES(FImpl, suffix)\
 typedef std::function<void(FermionField##suffix &,\
                      const FermionField##suffix &)> SolverFn##suffix;
 #define SINK_TYPE_ALIASES(suffix)\
 typedef std::function<SlicedPropagator##suffix(const PropagatorField##suffix &)> SinkFn##suffix;
 #define FGS_TYPE_ALIASES(FImpl, suffix)\
 FERM_TYPE_ALIASES(FImpl, suffix)\
 GAUGE_TYPE_ALIASES(FImpl, suffix)\
 SOLVER_TYPE_ALIASES(FImpl, suffix)
 // logger
 class HadronsLogger: public Logger
 {
@@ -1,31 +1,3 @@
 /*************************************************************************************
 Grid physics library, www.github.com/paboyle/Grid 
 Source file: extras/Hadrons/Modules.hpp
 Copyright (C) 2015
 Copyright (C) 2016
 Author: Antonin Portelli <antonin.portelli@me.com>
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
 #include <Grid/Hadrons/Modules/MAction/DWF.hpp>
 #include <Grid/Hadrons/Modules/MAction/Wilson.hpp>
 #include <Grid/Hadrons/Modules/MContraction/Baryon.hpp>
@@ -36,13 +8,18 @@ See the full license in the file "LICENSE" in the top level distribution directo
 #include <Grid/Hadrons/Modules/MContraction/WeakHamiltonianEye.hpp>
 #include <Grid/Hadrons/Modules/MContraction/WeakHamiltonianNonEye.hpp>
 #include <Grid/Hadrons/Modules/MContraction/WeakNeutral4ptDisc.hpp>
 #include <Grid/Hadrons/Modules/MFermion/GaugeProp.hpp>
 #include <Grid/Hadrons/Modules/MGauge/Load.hpp>
 #include <Grid/Hadrons/Modules/MGauge/Random.hpp>
 #include <Grid/Hadrons/Modules/MGauge/StochEm.hpp>
 #include <Grid/Hadrons/Modules/MGauge/Unit.hpp>
 #include <Grid/Hadrons/Modules/MLoop/NoiseLoop.hpp>
 #include <Grid/Hadrons/Modules/MScalar/ChargedProp.hpp>
 #include <Grid/Hadrons/Modules/MScalar/FreeProp.hpp>
 #include <Grid/Hadrons/Modules/MScalar/Scalar.hpp>
 #include <Grid/Hadrons/Modules/MSink/Point.hpp>
 #include <Grid/Hadrons/Modules/MSolver/RBPrecCG.hpp>
 #include <Grid/Hadrons/Modules/MSource/Point.hpp>
 #include <Grid/Hadrons/Modules/MSource/SeqGamma.hpp>
 #include <Grid/Hadrons/Modules/MSource/Wall.hpp>
 #include <Grid/Hadrons/Modules/MSource/Z2.hpp>
 #include <Grid/Hadrons/Modules/Quark.hpp>
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_DWF_hpp_
+#ifndef Hadrons_MAction_DWF_hpp_
-#define Hadrons_DWF_hpp_
+#define Hadrons_MAction_DWF_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -56,7 +56,7 @@ template <typename FImpl>
 class TDWF: public Module<DWFPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FGS_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TDWF(const std::string name);
@@ -137,4 +137,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_DWF_hpp_
+#endif // Hadrons_MAction_DWF_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Wilson_hpp_
+#ifndef Hadrons_MAction_Wilson_hpp_
-#define Hadrons_Wilson_hpp_
+#define Hadrons_MAction_Wilson_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -54,7 +54,7 @@ template <typename FImpl>
 class TWilson: public Module<WilsonPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FGS_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TWilson(const std::string name);
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Baryon_hpp_
+#ifndef Hadrons_MContraction_Baryon_hpp_
-#define Hadrons_Baryon_hpp_
+#define Hadrons_MContraction_Baryon_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -55,9 +55,9 @@ template <typename FImpl1, typename FImpl2, typename FImpl3>
 class TBaryon: public Module<BaryonPar>
 {
 public:
-    TYPE_ALIASES(FImpl1, 1);
+    FERM_TYPE_ALIASES(FImpl1, 1);
-    TYPE_ALIASES(FImpl2, 2);
+    FERM_TYPE_ALIASES(FImpl2, 2);
-    TYPE_ALIASES(FImpl3, 3);
+    FERM_TYPE_ALIASES(FImpl3, 3);
    class Result: Serializable
    {
    public:
@@ -121,11 +121,11 @@ void TBaryon<FImpl1, FImpl2, FImpl3>::execute(void)
    // FIXME: do contractions
-    write(writer, "meson", result);
+    // write(writer, "meson", result);
 }
 END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Baryon_hpp_
+#endif // Hadrons_MContraction_Baryon_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_DiscLoop_hpp_
+#ifndef Hadrons_MContraction_DiscLoop_hpp_
-#define Hadrons_DiscLoop_hpp_
+#define Hadrons_MContraction_DiscLoop_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -52,7 +52,7 @@ public:
 template <typename FImpl>
 class TDiscLoop: public Module<DiscLoopPar>
 {
-    TYPE_ALIASES(FImpl,);
+    FERM_TYPE_ALIASES(FImpl,);
    class Result: Serializable
    {
    public:
@@ -141,4 +141,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_DiscLoop_hpp_
+#endif // Hadrons_MContraction_DiscLoop_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Gamma3pt_hpp_
+#ifndef Hadrons_MContraction_Gamma3pt_hpp_
-#define Hadrons_Gamma3pt_hpp_
+#define Hadrons_MContraction_Gamma3pt_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -72,9 +72,9 @@ public:
 template <typename FImpl1, typename FImpl2, typename FImpl3>
 class TGamma3pt: public Module<Gamma3ptPar>
 {
-    TYPE_ALIASES(FImpl1, 1);
+    FERM_TYPE_ALIASES(FImpl1, 1);
-    TYPE_ALIASES(FImpl2, 2);
+    FERM_TYPE_ALIASES(FImpl2, 2);
-    TYPE_ALIASES(FImpl3, 3);
+    FERM_TYPE_ALIASES(FImpl3, 3);
    class Result: Serializable
    {
    public:
@@ -167,4 +167,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Gamma3pt_hpp_
+#endif // Hadrons_MContraction_Gamma3pt_hpp_
@@ -29,8 +29,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Meson_hpp_
+#ifndef Hadrons_MContraction_Meson_hpp_
-#define Hadrons_Meson_hpp_
+#define Hadrons_MContraction_Meson_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -69,7 +69,7 @@ public:
                                    std::string, q1,
                                    std::string, q2,
                                    std::string, gammas,
-                                    std::string, mom,
+                                    std::string, sink,
                                    std::string, output);
 };
@@ -77,8 +77,10 @@ template <typename FImpl1, typename FImpl2>
 class TMeson: public Module<MesonPar>
 {
 public:
-    TYPE_ALIASES(FImpl1, 1);
+    FERM_TYPE_ALIASES(FImpl1, 1);
-    TYPE_ALIASES(FImpl2, 2);
+    FERM_TYPE_ALIASES(FImpl2, 2);
    FERM_TYPE_ALIASES(ScalarImplCR, Scalar);
    SINK_TYPE_ALIASES(Scalar);
    class Result: Serializable
    {
    public:
@@ -115,7 +117,7 @@ TMeson<FImpl1, FImpl2>::TMeson(const std::string name)
 template <typename FImpl1, typename FImpl2>
 std::vector<std::string> TMeson<FImpl1, FImpl2>::getInput(void)
 {
-    std::vector<std::string> input = {par().q1, par().q2};
+    std::vector<std::string> input = {par().q1, par().q2, par().sink};
    return input;
 }
@@ -154,6 +156,9 @@ void TMeson<FImpl1, FImpl2>::parseGammaString(std::vector<GammaPair> &gammaList)
 // execution ///////////////////////////////////////////////////////////////////
 #define mesonConnected(q1, q2, gSnk, gSrc) \
 (g5*(gSnk))*(q1)*(adj(gSrc)*g5)*adj(q2)
 template <typename FImpl1, typename FImpl2>
 void TMeson<FImpl1, FImpl2>::execute(void)
 {
@@ -161,43 +166,72 @@ void TMeson<FImpl1, FImpl2>::execute(void)
                 << " quarks '" << par().q1 << "' and '" << par().q2 << "'"
                 << std::endl;
-    CorrWriter              writer(par().output);
+    CorrWriter             writer(par().output);
    PropagatorField1       &q1 = *env().template getObject<PropagatorField1>(par().q1);
    PropagatorField2       &q2 = *env().template getObject<PropagatorField2>(par().q2);
    LatticeComplex         c(env().getGrid());
    Gamma                  g5(Gamma::Algebra::Gamma5);
    std::vector<GammaPair> gammaList;
    std::vector<TComplex>  buf;
    std::vector<Result>    result;
-    std::vector<Real>      p;
+    Gamma                  g5(Gamma::Algebra::Gamma5);
-
+    std::vector<GammaPair> gammaList;
-    p  = strToVec<Real>(par().mom);
+    int                    nt = env().getDim(Tp);
    LatticeComplex         ph(env().getGrid()), coor(env().getGrid());
    Complex                i(0.0,1.0);
    ph = zero;
    for(unsigned int mu = 0; mu < env().getNd(); mu++)
    {
        LatticeCoordinate(coor, mu);
        ph = ph + p[mu]*coor*((1./(env().getGrid()->_fdimensions[mu])));
    }
    ph = exp((Real)(2*M_PI)*i*ph);
    parseGammaString(gammaList);
    result.resize(gammaList.size());
    for (unsigned int i = 0; i < result.size(); ++i)
    {
        Gamma gSnk(gammaList[i].first);
        Gamma gSrc(gammaList[i].second);
        c = trace((g5*gSnk)*q1*(adj(gSrc)*g5)*adj(q2))*ph;
        sliceSum(c, buf, Tp);
        result[i].gamma_snk = gammaList[i].first;
        result[i].gamma_src = gammaList[i].second;
-        result[i].corr.resize(buf.size());
+        result[i].corr.resize(nt);
-        for (unsigned int t = 0; t < buf.size(); ++t)
+    }
    if (env().template isObjectOfType<SlicedPropagator1>(par().q1) and
        env().template isObjectOfType<SlicedPropagator2>(par().q2))
    {
        SlicedPropagator1 &q1 = *env().template getObject<SlicedPropagator1>(par().q1);
        SlicedPropagator2 &q2 = *env().template getObject<SlicedPropagator2>(par().q2);
        LOG(Message) << "(propagator already sinked)" << std::endl;
        for (unsigned int i = 0; i < result.size(); ++i)
        {
-            result[i].corr[t] = TensorRemove(buf[t]);
+            Gamma gSnk(gammaList[i].first);
            Gamma gSrc(gammaList[i].second);
            for (unsigned int t = 0; t < buf.size(); ++t)
            {
                result[i].corr[t] = TensorRemove(trace(mesonConnected(q1[t], q2[t], gSnk, gSrc)));
            }
        }
    }
    else
    {
        PropagatorField1 &q1   = *env().template getObject<PropagatorField1>(par().q1);
        PropagatorField2 &q2   = *env().template getObject<PropagatorField2>(par().q2);
        LatticeComplex   c(env().getGrid());
        LOG(Message) << "(using sink '" << par().sink << "')" << std::endl;
        for (unsigned int i = 0; i < result.size(); ++i)
        {
            Gamma       gSnk(gammaList[i].first);
            Gamma       gSrc(gammaList[i].second);
            std::string ns;
            ns = env().getModuleNamespace(env().getObjectModule(par().sink));
            if (ns == "MSource")
            {
                PropagatorField1 &sink =
                    *env().template getObject<PropagatorField1>(par().sink);
                c = trace(mesonConnected(q1, q2, gSnk, gSrc)*sink);
                sliceSum(c, buf, Tp);
            }
            else if (ns == "MSink")
            {
                SinkFnScalar &sink = *env().template getObject<SinkFnScalar>(par().sink);
                c   = trace(mesonConnected(q1, q2, gSnk, gSrc));
                buf = sink(c);
            }
            for (unsigned int t = 0; t < buf.size(); ++t)
            {
                result[i].corr[t] = TensorRemove(buf[t]);
            }
        }
    }
    write(writer, "meson", result);
@@ -207,4 +241,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Meson_hpp_
+#endif // Hadrons_MContraction_Meson_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_WeakHamiltonian_hpp_
+#ifndef Hadrons_MContraction_WeakHamiltonian_hpp_
-#define Hadrons_WeakHamiltonian_hpp_
+#define Hadrons_MContraction_WeakHamiltonian_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -83,7 +83,7 @@ public:
 class T##modname: public Module<WeakHamiltonianPar>\
 {\
 public:\
-    TYPE_ALIASES(FIMPL,)\
+    FERM_TYPE_ALIASES(FIMPL,)\
    class Result: Serializable\
    {\
    public:\
@@ -111,4 +111,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_WeakHamiltonian_hpp_
+#endif // Hadrons_MContraction_WeakHamiltonian_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_WeakHamiltonianEye_hpp_
+#ifndef Hadrons_MContraction_WeakHamiltonianEye_hpp_
-#define Hadrons_WeakHamiltonianEye_hpp_
+#define Hadrons_MContraction_WeakHamiltonianEye_hpp_
 #include <Grid/Hadrons/Modules/MContraction/WeakHamiltonian.hpp>
@@ -55,4 +55,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_WeakHamiltonianEye_hpp_
+#endif // Hadrons_MContraction_WeakHamiltonianEye_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_WeakHamiltonianNonEye_hpp_
+#ifndef Hadrons_MContraction_WeakHamiltonianNonEye_hpp_
-#define Hadrons_WeakHamiltonianNonEye_hpp_
+#define Hadrons_MContraction_WeakHamiltonianNonEye_hpp_
 #include <Grid/Hadrons/Modules/MContraction/WeakHamiltonian.hpp>
@@ -54,4 +54,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_WeakHamiltonianNonEye_hpp_
+#endif // Hadrons_MContraction_WeakHamiltonianNonEye_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_WeakNeutral4ptDisc_hpp_
+#ifndef Hadrons_MContraction_WeakNeutral4ptDisc_hpp_
-#define Hadrons_WeakNeutral4ptDisc_hpp_
+#define Hadrons_MContraction_WeakNeutral4ptDisc_hpp_
 #include <Grid/Hadrons/Modules/MContraction/WeakHamiltonian.hpp>
@@ -56,4 +56,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_WeakNeutral4ptDisc_hpp_
+#endif // Hadrons_MContraction_WeakNeutral4ptDisc_hpp_
@@ -1,34 +1,5 @@
-/*************************************************************************************
+#ifndef Hadrons_MFermion_GaugeProp_hpp_
-
+#define Hadrons_MFermion_GaugeProp_hpp_
 Grid physics library, www.github.com/paboyle/Grid 
 Source file: extras/Hadrons/Modules/Quark.hpp
 Copyright (C) 2015
 Copyright (C) 2016
 Author: Antonin Portelli <antonin.portelli@me.com>
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
 #ifndef Hadrons_Quark_hpp_
 #define Hadrons_Quark_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -37,27 +8,29 @@ See the full license in the file "LICENSE" in the top level distribution directo
 BEGIN_HADRONS_NAMESPACE
 /******************************************************************************
- *                               TQuark                                       *
+ *                                GaugeProp                                   *
 ******************************************************************************/
-class QuarkPar: Serializable
+BEGIN_MODULE_NAMESPACE(MFermion)
 class GaugePropPar: Serializable
 {
 public:
-    GRID_SERIALIZABLE_CLASS_MEMBERS(QuarkPar,
+    GRID_SERIALIZABLE_CLASS_MEMBERS(GaugePropPar,
                                    std::string, source,
                                    std::string, solver);
 };
 template <typename FImpl>
-class TQuark: public Module<QuarkPar>
+class TGaugeProp: public Module<GaugePropPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FGS_TYPE_ALIASES(FImpl,);
 public:
    // constructor
-    TQuark(const std::string name);
+    TGaugeProp(const std::string name);
    // destructor
-    virtual ~TQuark(void) = default;
+    virtual ~TGaugeProp(void) = default;
-    // dependencies/products
+    // dependency relation
    virtual std::vector<std::string> getInput(void);
    virtual std::vector<std::string> getOutput(void);
    // setup
@@ -69,20 +42,20 @@ private:
    SolverFn     *solver_{nullptr};
 };
-MODULE_REGISTER(Quark, TQuark<FIMPL>);
+MODULE_REGISTER_NS(GaugeProp, TGaugeProp<FIMPL>, MFermion);
 /******************************************************************************
- *                          TQuark implementation                             *
+ *                      TGaugeProp implementation                             *
 ******************************************************************************/
 // constructor /////////////////////////////////////////////////////////////////
 template <typename FImpl>
-TQuark<FImpl>::TQuark(const std::string name)
+TGaugeProp<FImpl>::TGaugeProp(const std::string name)
-: Module(name)
+: Module<GaugePropPar>(name)
 {}
 // dependencies/products ///////////////////////////////////////////////////////
 template <typename FImpl>
-std::vector<std::string> TQuark<FImpl>::getInput(void)
+std::vector<std::string> TGaugeProp<FImpl>::getInput(void)
 {
    std::vector<std::string> in = {par().source, par().solver};
@@ -90,7 +63,7 @@ std::vector<std::string> TQuark<FImpl>::getInput(void)
 }
 template <typename FImpl>
-std::vector<std::string> TQuark<FImpl>::getOutput(void)
+std::vector<std::string> TGaugeProp<FImpl>::getOutput(void)
 {
    std::vector<std::string> out = {getName(), getName() + "_5d"};
@@ -99,7 +72,7 @@ std::vector<std::string> TQuark<FImpl>::getOutput(void)
 // setup ///////////////////////////////////////////////////////////////////////
 template <typename FImpl>
-void TQuark<FImpl>::setup(void)
+void TGaugeProp<FImpl>::setup(void)
 {
    Ls_ = env().getObjectLs(par().solver);
    env().template registerLattice<PropagatorField>(getName());
@@ -111,13 +84,13 @@ void TQuark<FImpl>::setup(void)
 // execution ///////////////////////////////////////////////////////////////////
 template <typename FImpl>
-void TQuark<FImpl>::execute(void)
+void TGaugeProp<FImpl>::execute(void)
 {
    LOG(Message) << "Computing quark propagator '" << getName() << "'"
-                 << std::endl;
+    << std::endl;
    FermionField    source(env().getGrid(Ls_)), sol(env().getGrid(Ls_)),
-                    tmp(env().getGrid());
+    tmp(env().getGrid());
    std::string     propName = (Ls_ == 1) ? getName() : (getName() + "_5d");
    PropagatorField &prop    = *env().template createLattice<PropagatorField>(propName);
    PropagatorField &fullSrc = *env().template getObject<PropagatorField>(par().source);
@@ -128,7 +101,7 @@ void TQuark<FImpl>::execute(void)
    }
    LOG(Message) << "Inverting using solver '" << par().solver
-                 << "' on source '" << par().source << "'" << std::endl;
+    << "' on source '" << par().source << "'" << std::endl;
    for (unsigned int s = 0; s < Ns; ++s)
    for (unsigned int c = 0; c < Nc; ++c)
    {
@@ -170,7 +143,7 @@ void TQuark<FImpl>::execute(void)
        if (Ls_ > 1)
        {
            PropagatorField &p4d =
-                *env().template getObject<PropagatorField>(getName());
+            *env().template getObject<PropagatorField>(getName());
            axpby_ssp_pminus(sol, 0., sol, 1., sol, 0, 0);
            axpby_ssp_pplus(sol, 1., sol, 1., sol, 0, Ls_-1);
@@ -180,6 +153,8 @@ void TQuark<FImpl>::execute(void)
    }
 }
 END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Quark_hpp_
+#endif // Hadrons_MFermion_GaugeProp_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Load_hpp_
+#ifndef Hadrons_MGauge_Load_hpp_
-#define Hadrons_Load_hpp_
+#define Hadrons_MGauge_Load_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -70,4 +70,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Load_hpp_
+#endif // Hadrons_MGauge_Load_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Random_hpp_
+#ifndef Hadrons_MGauge_Random_hpp_
-#define Hadrons_Random_hpp_
+#define Hadrons_MGauge_Random_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -63,4 +63,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Random_hpp_
+#endif // Hadrons_MGauge_Random_hpp_
@@ -0,0 +1,88 @@
 /*************************************************************************************
 Grid physics library, www.github.com/paboyle/Grid 
 Source file: extras/Hadrons/Modules/MGauge/StochEm.cc
 Copyright (C) 2015
 Copyright (C) 2016
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
 #include <Grid/Hadrons/Modules/MGauge/StochEm.hpp>
 using namespace Grid;
 using namespace Hadrons;
 using namespace MGauge;
 /******************************************************************************
 *                  TStochEm implementation                             *
 ******************************************************************************/
 // constructor /////////////////////////////////////////////////////////////////
 TStochEm::TStochEm(const std::string name)
 : Module<StochEmPar>(name)
 {}
 // dependencies/products ///////////////////////////////////////////////////////
 std::vector<std::string> TStochEm::getInput(void)
 {
    std::vector<std::string> in;
    return in;
 }
 std::vector<std::string> TStochEm::getOutput(void)
 {
    std::vector<std::string> out = {getName()};
    return out;
 }
 // setup ///////////////////////////////////////////////////////////////////////
 void TStochEm::setup(void)
 {
    if (!env().hasRegisteredObject("_" + getName() + "_weight"))
    {
        env().registerLattice<EmComp>("_" + getName() + "_weight");
    }
    env().registerLattice<EmField>(getName());
 }
 // execution ///////////////////////////////////////////////////////////////////
 void TStochEm::execute(void)
 {
    PhotonR photon(par().gauge, par().zmScheme);
    EmField &a = *env().createLattice<EmField>(getName());
    EmComp  *w;
    if (!env().hasCreatedObject("_" + getName() + "_weight"))
    {
        LOG(Message) << "Caching stochatic EM potential weight (gauge: "
                     << par().gauge << ", zero-mode scheme: "
                     << par().zmScheme << ")..." << std::endl;
        w = env().createLattice<EmComp>("_" + getName() + "_weight");
        photon.StochasticWeight(*w);
    }
    else
    {
        w = env().getObject<EmComp>("_" + getName() + "_weight");
    }
    LOG(Message) << "Generating stochatic EM potential..." << std::endl;
    photon.StochasticField(a, *env().get4dRng(), *w);
 }
@@ -0,0 +1,75 @@
 /*************************************************************************************
 Grid physics library, www.github.com/paboyle/Grid 
 Source file: extras/Hadrons/Modules/MGauge/StochEm.hpp
 Copyright (C) 2015
 Copyright (C) 2016
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
 #ifndef Hadrons_MGauge_StochEm_hpp_
 #define Hadrons_MGauge_StochEm_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
 #include <Grid/Hadrons/ModuleFactory.hpp>
 BEGIN_HADRONS_NAMESPACE
 /******************************************************************************
 *                         StochEm                                 *
 ******************************************************************************/
 BEGIN_MODULE_NAMESPACE(MGauge)
 class StochEmPar: Serializable
 {
 public:
    GRID_SERIALIZABLE_CLASS_MEMBERS(StochEmPar,
                                    PhotonR::Gauge,    gauge,
                                    PhotonR::ZmScheme, zmScheme);
 };
 class TStochEm: public Module<StochEmPar>
 {
 public:
    typedef PhotonR::GaugeField     EmField;
    typedef PhotonR::GaugeLinkField EmComp;
 public:
    // constructor
    TStochEm(const std::string name);
    // destructor
    virtual ~TStochEm(void) = default;
    // dependency relation
    virtual std::vector<std::string> getInput(void);
    virtual std::vector<std::string> getOutput(void);
    // setup
    virtual void setup(void);
    // execution
    virtual void execute(void);
 };
 MODULE_REGISTER_NS(StochEm, TStochEm, MGauge);
 END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
 #endif // Hadrons_MGauge_StochEm_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Unit_hpp_
+#ifndef Hadrons_MGauge_Unit_hpp_
-#define Hadrons_Unit_hpp_
+#define Hadrons_MGauge_Unit_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -63,4 +63,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Unit_hpp_
+#endif // Hadrons_MGauge_Unit_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_NoiseLoop_hpp_
+#ifndef Hadrons_MLoop_NoiseLoop_hpp_
-#define Hadrons_NoiseLoop_hpp_
+#define Hadrons_MLoop_NoiseLoop_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -65,7 +65,7 @@ template <typename FImpl>
 class TNoiseLoop: public Module<NoiseLoopPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FERM_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TNoiseLoop(const std::string name);
@@ -129,4 +129,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_NoiseLoop_hpp_
+#endif // Hadrons_MLoop_NoiseLoop_hpp_
@@ -0,0 +1,226 @@
 #include <Grid/Hadrons/Modules/MScalar/ChargedProp.hpp>
 #include <Grid/Hadrons/Modules/MScalar/Scalar.hpp>
 using namespace Grid;
 using namespace Hadrons;
 using namespace MScalar;
 /******************************************************************************
 *                     TChargedProp implementation                             *
 ******************************************************************************/
 // constructor /////////////////////////////////////////////////////////////////
 TChargedProp::TChargedProp(const std::string name)
 : Module<ChargedPropPar>(name)
 {}
 // dependencies/products ///////////////////////////////////////////////////////
 std::vector<std::string> TChargedProp::getInput(void)
 {
    std::vector<std::string> in = {par().source, par().emField};
    return in;
 }
 std::vector<std::string> TChargedProp::getOutput(void)
 {
    std::vector<std::string> out = {getName()};
    return out;
 }
 // setup ///////////////////////////////////////////////////////////////////////
 void TChargedProp::setup(void)
 {
    freeMomPropName_ = FREEMOMPROP(par().mass);
    phaseName_.clear();
    for (unsigned int mu = 0; mu < env().getNd(); ++mu)
    {
        phaseName_.push_back("_shiftphase_" + std::to_string(mu));
    }
    GFSrcName_ = "_" + getName() + "_DinvSrc";
    if (!env().hasRegisteredObject(freeMomPropName_))
    {
        env().registerLattice<ScalarField>(freeMomPropName_);
    }
    if (!env().hasRegisteredObject(phaseName_[0]))
    {
        for (unsigned int mu = 0; mu < env().getNd(); ++mu)
        {
            env().registerLattice<ScalarField>(phaseName_[mu]);
        }
    }
    if (!env().hasRegisteredObject(GFSrcName_))
    {
        env().registerLattice<ScalarField>(GFSrcName_);
    }
    env().registerLattice<ScalarField>(getName());
 }
 // execution ///////////////////////////////////////////////////////////////////
 void TChargedProp::execute(void)
 {
    // CACHING ANALYTIC EXPRESSIONS
    ScalarField &source = *env().getObject<ScalarField>(par().source);
    Complex     ci(0.0,1.0);
    FFT         fft(env().getGrid());
    // cache free scalar propagator
    if (!env().hasCreatedObject(freeMomPropName_))
    {
        LOG(Message) << "Caching momentum space free scalar propagator"
                     << " (mass= " << par().mass << ")..." << std::endl;
        freeMomProp_ = env().createLattice<ScalarField>(freeMomPropName_);
        SIMPL::MomentumSpacePropagator(*freeMomProp_, par().mass);
    }
    else
    {
        freeMomProp_ = env().getObject<ScalarField>(freeMomPropName_);
    }
    // cache G*F*src
    if (!env().hasCreatedObject(GFSrcName_))
    {
        GFSrc_ = env().createLattice<ScalarField>(GFSrcName_);
        fft.FFT_all_dim(*GFSrc_, source, FFT::forward);
        *GFSrc_ = (*freeMomProp_)*(*GFSrc_);
    }
    else
    {
        GFSrc_ = env().getObject<ScalarField>(GFSrcName_);
    }
    // cache phases
    if (!env().hasCreatedObject(phaseName_[0]))
    {
        std::vector<int> &l = env().getGrid()->_fdimensions;
        LOG(Message) << "Caching shift phases..." << std::endl;
        for (unsigned int mu = 0; mu < env().getNd(); ++mu)
        {
            Real    twoPiL = M_PI*2./l[mu];
            phase_.push_back(env().createLattice<ScalarField>(phaseName_[mu]));
            LatticeCoordinate(*(phase_[mu]), mu);
            *(phase_[mu]) = exp(ci*twoPiL*(*(phase_[mu])));
        }
    }
    else
    {
        for (unsigned int mu = 0; mu < env().getNd(); ++mu)
        {
            phase_.push_back(env().getObject<ScalarField>(phaseName_[mu]));
        }
    }
    // PROPAGATOR CALCULATION
    LOG(Message) << "Computing charged scalar propagator"
                 << " (mass= " << par().mass
                 << ", charge= " << par().charge << ")..." << std::endl;
    ScalarField &prop   = *env().createLattice<ScalarField>(getName());
    ScalarField buf(env().getGrid());
    ScalarField &GFSrc = *GFSrc_, &G = *freeMomProp_;
    double      q = par().charge;
    // G*F*Src
    prop = GFSrc;
    // - q*G*momD1*G*F*Src (momD1 = F*D1*Finv)
    buf = GFSrc;
    momD1(buf, fft);
    buf = G*buf;
    prop = prop - q*buf;
    // + q^2*G*momD1*G*momD1*G*F*Src (here buf = G*momD1*G*F*Src)
    momD1(buf, fft);
    prop = prop + q*q*G*buf;
    // - q^2*G*momD2*G*F*Src (momD2 = F*D2*Finv)
    buf = GFSrc;
    momD2(buf, fft);
    prop = prop - q*q*G*buf;
    // final FT
    fft.FFT_all_dim(prop, prop, FFT::backward);
    // OUTPUT IF NECESSARY
    if (!par().output.empty())
    {
        std::string           filename = par().output + "." +
                                         std::to_string(env().getTrajectory());
        LOG(Message) << "Saving zero-momentum projection to '"
                     << filename << "'..." << std::endl;
        CorrWriter            writer(filename);
        std::vector<TComplex> vecBuf;
        std::vector<Complex>  result;
        sliceSum(prop, vecBuf, Tp);
        result.resize(vecBuf.size());
        for (unsigned int t = 0; t < vecBuf.size(); ++t)
        {
            result[t] = TensorRemove(vecBuf[t]);
        }
        write(writer, "charge", q);
        write(writer, "prop", result);
    }
 }
 void TChargedProp::momD1(ScalarField &s, FFT &fft)
 {
    EmField     &A = *env().getObject<EmField>(par().emField);
    ScalarField buf(env().getGrid()), result(env().getGrid()),
                Amu(env().getGrid());
    Complex     ci(0.0,1.0);
    result = zero;
    for (unsigned int mu = 0; mu < env().getNd(); ++mu)
    {
        Amu = peekLorentz(A, mu);
        buf = (*phase_[mu])*s;
        fft.FFT_all_dim(buf, buf, FFT::backward);
        buf = Amu*buf;
        fft.FFT_all_dim(buf, buf, FFT::forward);
        result = result - ci*buf;
    }
    fft.FFT_all_dim(s, s, FFT::backward);
    for (unsigned int mu = 0; mu < env().getNd(); ++mu)
    {
        Amu = peekLorentz(A, mu);
        buf = Amu*s;
        fft.FFT_all_dim(buf, buf, FFT::forward);
        result = result + ci*adj(*phase_[mu])*buf;
    }
    s = result;
 }
 void TChargedProp::momD2(ScalarField &s, FFT &fft)
 {
    EmField     &A = *env().getObject<EmField>(par().emField);
    ScalarField buf(env().getGrid()), result(env().getGrid()),
                Amu(env().getGrid());
    result = zero;
    for (unsigned int mu = 0; mu < env().getNd(); ++mu)
    {
        Amu = peekLorentz(A, mu);
        buf = (*phase_[mu])*s;
        fft.FFT_all_dim(buf, buf, FFT::backward);
        buf = Amu*Amu*buf;
        fft.FFT_all_dim(buf, buf, FFT::forward);
        result = result + .5*buf;
    }
    fft.FFT_all_dim(s, s, FFT::backward);
    for (unsigned int mu = 0; mu < env().getNd(); ++mu)
    {
        Amu = peekLorentz(A, mu);        
        buf = Amu*Amu*s;
        fft.FFT_all_dim(buf, buf, FFT::forward);
        result = result + .5*adj(*phase_[mu])*buf;
    }
    s = result;
 }
@@ -0,0 +1,61 @@
 #ifndef Hadrons_MScalar_ChargedProp_hpp_
 #define Hadrons_MScalar_ChargedProp_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
 #include <Grid/Hadrons/ModuleFactory.hpp>
 BEGIN_HADRONS_NAMESPACE
 /******************************************************************************
 *                       Charged scalar propagator                            *
 ******************************************************************************/
 BEGIN_MODULE_NAMESPACE(MScalar)
 class ChargedPropPar: Serializable
 {
 public:
    GRID_SERIALIZABLE_CLASS_MEMBERS(ChargedPropPar,
                                    std::string, emField,
                                    std::string, source,
                                    double,      mass,
                                    double,      charge,
                                    std::string, output);
 };
 class TChargedProp: public Module<ChargedPropPar>
 {
 public:
    SCALAR_TYPE_ALIASES(SIMPL,);
    typedef PhotonR::GaugeField     EmField;
    typedef PhotonR::GaugeLinkField EmComp;
 public:
    // constructor
    TChargedProp(const std::string name);
    // destructor
    virtual ~TChargedProp(void) = default;
    // dependency relation
    virtual std::vector<std::string> getInput(void);
    virtual std::vector<std::string> getOutput(void);
    // setup
    virtual void setup(void);
    // execution
    virtual void execute(void);
 private:
    void momD1(ScalarField &s, FFT &fft);
    void momD2(ScalarField &s, FFT &fft);
 private:
    std::string                freeMomPropName_, GFSrcName_;
    std::vector<std::string>   phaseName_;
    ScalarField                *freeMomProp_, *GFSrc_;
    std::vector<ScalarField *> phase_;
    EmField                    *A;
 };
 MODULE_REGISTER_NS(ChargedProp, TChargedProp, MScalar);
 END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
 #endif // Hadrons_MScalar_ChargedProp_hpp_
@@ -0,0 +1,79 @@
 #include <Grid/Hadrons/Modules/MScalar/FreeProp.hpp>
 #include <Grid/Hadrons/Modules/MScalar/Scalar.hpp>
 using namespace Grid;
 using namespace Hadrons;
 using namespace MScalar;
 /******************************************************************************
 *                        TFreeProp implementation                             *
 ******************************************************************************/
 // constructor /////////////////////////////////////////////////////////////////
 TFreeProp::TFreeProp(const std::string name)
 : Module<FreePropPar>(name)
 {}
 // dependencies/products ///////////////////////////////////////////////////////
 std::vector<std::string> TFreeProp::getInput(void)
 {
    std::vector<std::string> in = {par().source};
    return in;
 }
 std::vector<std::string> TFreeProp::getOutput(void)
 {
    std::vector<std::string> out = {getName()};
    return out;
 }
 // setup ///////////////////////////////////////////////////////////////////////
 void TFreeProp::setup(void)
 {
    freeMomPropName_ = FREEMOMPROP(par().mass);
    if (!env().hasRegisteredObject(freeMomPropName_))
    {
        env().registerLattice<ScalarField>(freeMomPropName_);
    }
    env().registerLattice<ScalarField>(getName());
 }
 // execution ///////////////////////////////////////////////////////////////////
 void TFreeProp::execute(void)
 {
    ScalarField &prop   = *env().createLattice<ScalarField>(getName());
    ScalarField &source = *env().getObject<ScalarField>(par().source);
    ScalarField *freeMomProp;
    if (!env().hasCreatedObject(freeMomPropName_))
    {
        LOG(Message) << "Caching momentum space free scalar propagator"
                     << " (mass= " << par().mass << ")..." << std::endl;
        freeMomProp = env().createLattice<ScalarField>(freeMomPropName_);
        SIMPL::MomentumSpacePropagator(*freeMomProp, par().mass);
    }
    else
    {
        freeMomProp = env().getObject<ScalarField>(freeMomPropName_);
    }
    LOG(Message) << "Computing free scalar propagator..." << std::endl;
    SIMPL::FreePropagator(source, prop, *freeMomProp);
    if (!par().output.empty())
    {
        TextWriter            writer(par().output + "." +
                                     std::to_string(env().getTrajectory()));
        std::vector<TComplex> buf;
        std::vector<Complex>  result;
        sliceSum(prop, buf, Tp);
        result.resize(buf.size());
        for (unsigned int t = 0; t < buf.size(); ++t)
        {
            result[t] = TensorRemove(buf[t]);
        }
        write(writer, "prop", result);
    }
 }
@@ -0,0 +1,50 @@
 #ifndef Hadrons_MScalar_FreeProp_hpp_
 #define Hadrons_MScalar_FreeProp_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
 #include <Grid/Hadrons/ModuleFactory.hpp>
 BEGIN_HADRONS_NAMESPACE
 /******************************************************************************
 *                               FreeProp                                     *
 ******************************************************************************/
 BEGIN_MODULE_NAMESPACE(MScalar)
 class FreePropPar: Serializable
 {
 public:
    GRID_SERIALIZABLE_CLASS_MEMBERS(FreePropPar,
                                    std::string, source,
                                    double,      mass,
                                    std::string, output);
 };
 class TFreeProp: public Module<FreePropPar>
 {
 public:
    SCALAR_TYPE_ALIASES(SIMPL,);
 public:
    // constructor
    TFreeProp(const std::string name);
    // destructor
    virtual ~TFreeProp(void) = default;
    // dependency relation
    virtual std::vector<std::string> getInput(void);
    virtual std::vector<std::string> getOutput(void);
    // setup
    virtual void setup(void);
    // execution
    virtual void execute(void);
 private:
    std::string freeMomPropName_;
 };
 MODULE_REGISTER_NS(FreeProp, TFreeProp, MScalar);
 END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
 #endif // Hadrons_MScalar_FreeProp_hpp_
@@ -0,0 +1,6 @@
 #ifndef Hadrons_Scalar_hpp_
 #define Hadrons_Scalar_hpp_
 #define FREEMOMPROP(m) "_scalar_mom_prop_" + std::to_string(m)
 #endif // Hadrons_Scalar_hpp_
@@ -0,0 +1,114 @@
 #ifndef Hadrons_MSink_Point_hpp_
 #define Hadrons_MSink_Point_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
 #include <Grid/Hadrons/ModuleFactory.hpp>
 BEGIN_HADRONS_NAMESPACE
 /******************************************************************************
 *                                   Point                                    *
 ******************************************************************************/
 BEGIN_MODULE_NAMESPACE(MSink)
 class PointPar: Serializable
 {
 public:
    GRID_SERIALIZABLE_CLASS_MEMBERS(PointPar,
                                    std::string, mom);
 };
 template <typename FImpl>
 class TPoint: public Module<PointPar>
 {
 public:
    FERM_TYPE_ALIASES(FImpl,);
    SINK_TYPE_ALIASES();
 public:
    // constructor
    TPoint(const std::string name);
    // destructor
    virtual ~TPoint(void) = default;
    // dependency relation
    virtual std::vector<std::string> getInput(void);
    virtual std::vector<std::string> getOutput(void);
    // setup
    virtual void setup(void);
    // execution
    virtual void execute(void);
 };
 MODULE_REGISTER_NS(Point,       TPoint<FIMPL>,        MSink);
 MODULE_REGISTER_NS(ScalarPoint, TPoint<ScalarImplCR>, MSink);
 /******************************************************************************
 *                          TPoint implementation                             *
 ******************************************************************************/
 // constructor /////////////////////////////////////////////////////////////////
 template <typename FImpl>
 TPoint<FImpl>::TPoint(const std::string name)
 : Module<PointPar>(name)
 {}
 // dependencies/products ///////////////////////////////////////////////////////
 template <typename FImpl>
 std::vector<std::string> TPoint<FImpl>::getInput(void)
 {
    std::vector<std::string> in;
    return in;
 }
 template <typename FImpl>
 std::vector<std::string> TPoint<FImpl>::getOutput(void)
 {
    std::vector<std::string> out = {getName()};
    return out;
 }
 // setup ///////////////////////////////////////////////////////////////////////
 template <typename FImpl>
 void TPoint<FImpl>::setup(void)
 {
    unsigned int size;
    size = env().template lattice4dSize<LatticeComplex>();
    env().registerObject(getName(), size);
 }
 // execution ///////////////////////////////////////////////////////////////////
 template <typename FImpl>
 void TPoint<FImpl>::execute(void)
 {
    std::vector<Real> p = strToVec<Real>(par().mom);
    LatticeComplex    ph(env().getGrid()), coor(env().getGrid());
    Complex           i(0.0,1.0);
    LOG(Message) << "Setting up point sink function for momentum ["
                 << par().mom << "]" << std::endl;
    ph = zero;
    for(unsigned int mu = 0; mu < env().getNd(); mu++)
    {
        LatticeCoordinate(coor, mu);
        ph = ph + (p[mu]/env().getGrid()->_fdimensions[mu])*coor;
    }
    ph = exp((Real)(2*M_PI)*i*ph);
    auto sink = [ph](const PropagatorField &field)
    {
        SlicedPropagator res;
        PropagatorField  tmp = ph*field;
        sliceSum(tmp, res, Tp);
        return res;
    };
    env().setObject(getName(), new SinkFn(sink));
 }
 END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
 #endif // Hadrons_MSink_Point_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_RBPrecCG_hpp_
+#ifndef Hadrons_MSolver_RBPrecCG_hpp_
-#define Hadrons_RBPrecCG_hpp_
+#define Hadrons_MSolver_RBPrecCG_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -53,7 +53,7 @@ template <typename FImpl>
 class TRBPrecCG: public Module<RBPrecCGPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FGS_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TRBPrecCG(const std::string name);
@@ -129,4 +129,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_RBPrecCG_hpp_
+#endif // Hadrons_MSolver_RBPrecCG_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Point_hpp_
+#ifndef Hadrons_MSource_Point_hpp_
-#define Hadrons_Point_hpp_
+#define Hadrons_MSource_Point_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -63,7 +63,7 @@ template <typename FImpl>
 class TPoint: public Module<PointPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FERM_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TPoint(const std::string name);
@@ -78,7 +78,8 @@ public:
    virtual void execute(void);
 };
-MODULE_REGISTER_NS(Point, TPoint<FIMPL>, MSource);
+MODULE_REGISTER_NS(Point,       TPoint<FIMPL>,        MSource);
 MODULE_REGISTER_NS(ScalarPoint, TPoint<ScalarImplCR>, MSource);
 /******************************************************************************
 *                       TPoint template implementation                       *
@@ -132,4 +133,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Point_hpp_
+#endif // Hadrons_MSource_Point_hpp_
@@ -28,8 +28,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_SeqGamma_hpp_
+#ifndef Hadrons_MSource_SeqGamma_hpp_
-#define Hadrons_SeqGamma_hpp_
+#define Hadrons_MSource_SeqGamma_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -72,7 +72,7 @@ template <typename FImpl>
 class TSeqGamma: public Module<SeqGammaPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FGS_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TSeqGamma(const std::string name);
@@ -161,4 +161,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_SeqGamma_hpp_
+#endif // Hadrons_MSource_SeqGamma_hpp_
@@ -26,8 +26,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_WallSource_hpp_
+#ifndef Hadrons_MSource_WallSource_hpp_
-#define Hadrons_WallSource_hpp_
+#define Hadrons_MSource_WallSource_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -64,7 +64,7 @@ template <typename FImpl>
 class TWall: public Module<WallPar>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FERM_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TWall(const std::string name);
@@ -144,4 +144,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_WallSource_hpp_
+#endif // Hadrons_MSource_WallSource_hpp_
@@ -27,8 +27,8 @@ See the full license in the file "LICENSE" in the top level distribution directo
 *************************************************************************************/
 /*  END LEGAL */
-#ifndef Hadrons_Z2_hpp_
+#ifndef Hadrons_MSource_Z2_hpp_
-#define Hadrons_Z2_hpp_
+#define Hadrons_MSource_Z2_hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -67,7 +67,7 @@ template <typename FImpl>
 class TZ2: public Module<Z2Par>
 {
 public:
-    TYPE_ALIASES(FImpl,);
+    FERM_TYPE_ALIASES(FImpl,);
 public:
    // constructor
    TZ2(const std::string name);
@@ -82,7 +82,8 @@ public:
    virtual void execute(void);
 };
-MODULE_REGISTER_NS(Z2, TZ2<FIMPL>, MSource);
+MODULE_REGISTER_NS(Z2,       TZ2<FIMPL>,        MSource);
 MODULE_REGISTER_NS(ScalarZ2, TZ2<ScalarImplCR>, MSource);
 /******************************************************************************
 *                       TZ2 template implementation                          *
@@ -148,4 +149,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons_Z2_hpp_
+#endif // Hadrons_MSource_Z2_hpp_
@@ -1,5 +1,5 @@
-#ifndef Hadrons____FILEBASENAME____hpp_
+#ifndef Hadrons____NAMESPACE_______FILEBASENAME____hpp_
-#define Hadrons____FILEBASENAME____hpp_
+#define Hadrons____NAMESPACE_______FILEBASENAME____hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -41,4 +41,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons____FILEBASENAME____hpp_
+#endif // Hadrons____NAMESPACE_______FILEBASENAME____hpp_
@@ -1,5 +1,5 @@
-#ifndef Hadrons____FILEBASENAME____hpp_
+#ifndef Hadrons____NAMESPACE_______FILEBASENAME____hpp_
-#define Hadrons____FILEBASENAME____hpp_
+#define Hadrons____NAMESPACE_______FILEBASENAME____hpp_
 #include <Grid/Hadrons/Global.hpp>
 #include <Grid/Hadrons/Module.hpp>
@@ -82,4 +82,4 @@ END_MODULE_NAMESPACE
 END_HADRONS_NAMESPACE
-#endif // Hadrons____FILEBASENAME____hpp_
+#endif // Hadrons____NAMESPACE_______FILEBASENAME____hpp_
@@ -4,7 +4,10 @@ modules_cc =\
  Modules/MContraction/WeakNeutral4ptDisc.cc \
  Modules/MGauge/Load.cc \
  Modules/MGauge/Random.cc \
-  Modules/MGauge/Unit.cc
+  Modules/MGauge/StochEm.cc \
  Modules/MGauge/Unit.cc \
  Modules/MScalar/ChargedProp.cc \
  Modules/MScalar/FreeProp.cc
 modules_hpp =\
  Modules/MAction/DWF.hpp \
@@ -17,14 +20,19 @@ modules_hpp =\
  Modules/MContraction/WeakHamiltonianEye.hpp \
  Modules/MContraction/WeakHamiltonianNonEye.hpp \
  Modules/MContraction/WeakNeutral4ptDisc.hpp \
  Modules/MFermion/GaugeProp.hpp \
  Modules/MGauge/Load.hpp \
  Modules/MGauge/Random.hpp \
  Modules/MGauge/StochEm.hpp \
  Modules/MGauge/Unit.hpp \
  Modules/MLoop/NoiseLoop.hpp \
  Modules/MScalar/ChargedProp.hpp \
  Modules/MScalar/FreeProp.hpp \
  Modules/MScalar/Scalar.hpp \
  Modules/MSink/Point.hpp \
  Modules/MSolver/RBPrecCG.hpp \
  Modules/MSource/Point.hpp \
  Modules/MSource/SeqGamma.hpp \
  Modules/MSource/Wall.hpp \
-  Modules/MSource/Z2.hpp \
+  Modules/MSource/Z2.hpp
  Modules/Quark.hpp
@@ -0,0 +1,11 @@
 #include <qed-fvol/Global.hpp>
 using namespace Grid;
 using namespace QCD;
 using namespace QedFVol;
 QedFVolLogger QedFVol::QedFVolLogError(1,"Error");
 QedFVolLogger QedFVol::QedFVolLogWarning(1,"Warning");
 QedFVolLogger QedFVol::QedFVolLogMessage(1,"Message");
 QedFVolLogger QedFVol::QedFVolLogIterative(1,"Iterative");
 QedFVolLogger QedFVol::QedFVolLogDebug(1,"Debug");
@@ -0,0 +1,42 @@
 #ifndef QedFVol_Global_hpp_
 #define QedFVol_Global_hpp_
 #include <Grid/Grid.h>
 #define BEGIN_QEDFVOL_NAMESPACE \
 namespace Grid {\
 using namespace QCD;\
 namespace QedFVol {\
 using Grid::operator<<;
 #define END_QEDFVOL_NAMESPACE }}
 /* the 'using Grid::operator<<;' statement prevents a very nasty compilation
 * error with GCC (clang compiles fine without it).
 */
 BEGIN_QEDFVOL_NAMESPACE
 class QedFVolLogger: public Logger
 {
 public:
    QedFVolLogger(int on, std::string nm): Logger("QedFVol", on, nm,
                                                  GridLogColours, "BLACK"){};
 };
 #define LOG(channel) std::cout << QedFVolLog##channel
 #define QEDFVOL_ERROR(msg)\
 LOG(Error) << msg << " (" << __FUNCTION__ << " at " << __FILE__ << ":"\
           << __LINE__ << ")" << std::endl;\
 abort();
 #define DEBUG_VAR(var) LOG(Debug) << #var << "= " << (var) << std::endl;
 extern QedFVolLogger QedFVolLogError;
 extern QedFVolLogger QedFVolLogWarning;
 extern QedFVolLogger QedFVolLogMessage;
 extern QedFVolLogger QedFVolLogIterative;
 extern QedFVolLogger QedFVolLogDebug;
 END_QEDFVOL_NAMESPACE
 #endif // QedFVol_Global_hpp_
@@ -0,0 +1,9 @@
 AM_CXXFLAGS += -I$(top_srcdir)/extras
 bin_PROGRAMS = qed-fvol
 qed_fvol_SOURCES =   \
    qed-fvol.cc      \
    Global.cc
 qed_fvol_LDADD   = -lGrid
@@ -0,0 +1,265 @@
 #ifndef QEDFVOL_WILSONLOOPS_H
 #define QEDFVOL_WILSONLOOPS_H
 #include <Global.hpp>
 BEGIN_QEDFVOL_NAMESPACE
 template <class Gimpl> class NewWilsonLoops : public Gimpl {
 public:
  INHERIT_GIMPL_TYPES(Gimpl);
  typedef typename Gimpl::GaugeLinkField GaugeMat;
  typedef typename Gimpl::GaugeField GaugeLorentz;
  //////////////////////////////////////////////////
  // directed plaquette oriented in mu,nu plane
  //////////////////////////////////////////////////
  static void dirPlaquette(GaugeMat &plaq, const std::vector<GaugeMat> &U,
                           const int mu, const int nu) {
    // Annoyingly, must use either scope resolution to find dependent base
    // class,
    // or this-> ; there is no "this" in a static method. This forces explicit
    // Gimpl scope
    // resolution throughout the usage in this file, and rather defeats the
    // purpose of deriving
    // from Gimpl.
    plaq = Gimpl::CovShiftBackward(
        U[mu], mu, Gimpl::CovShiftBackward(
                       U[nu], nu, Gimpl::CovShiftForward(U[mu], mu, U[nu])));
  }
  //////////////////////////////////////////////////
  // trace of directed plaquette oriented in mu,nu plane
  //////////////////////////////////////////////////
  static void traceDirPlaquette(LatticeComplex &plaq,
                                const std::vector<GaugeMat> &U, const int mu,
                                const int nu) {
    GaugeMat sp(U[0]._grid);
    dirPlaquette(sp, U, mu, nu);
    plaq = trace(sp);
  }
  //////////////////////////////////////////////////
  // sum over all planes of plaquette
  //////////////////////////////////////////////////
  static void sitePlaquette(LatticeComplex &Plaq,
                            const std::vector<GaugeMat> &U) {
    LatticeComplex sitePlaq(U[0]._grid);
    Plaq = zero;
    for (int mu = 1; mu < U[0]._grid->_ndimension; mu++) {
      for (int nu = 0; nu < mu; nu++) {
        traceDirPlaquette(sitePlaq, U, mu, nu);
        Plaq = Plaq + sitePlaq;
      }
    }
  }
  //////////////////////////////////////////////////
  // sum over all x,y,z,t and over all planes of plaquette
  //////////////////////////////////////////////////
  static Real sumPlaquette(const GaugeLorentz &Umu) {
    std::vector<GaugeMat> U(4, Umu._grid);
    for (int mu = 0; mu < Umu._grid->_ndimension; mu++) {
      U[mu] = PeekIndex<LorentzIndex>(Umu, mu);
    }
    LatticeComplex Plaq(Umu._grid);
    sitePlaquette(Plaq, U);
    TComplex Tp = sum(Plaq);
    Complex p = TensorRemove(Tp);
    return p.real();
  }
  //////////////////////////////////////////////////
  // average over all x,y,z,t and over all planes of plaquette
  //////////////////////////////////////////////////
  static Real avgPlaquette(const GaugeLorentz &Umu) {
    int ndim = Umu._grid->_ndimension;
    Real sumplaq = sumPlaquette(Umu);
    Real vol = Umu._grid->gSites();
    Real faces = (1.0 * ndim * (ndim - 1)) / 2.0;
    return sumplaq / vol / faces / Nc; // Nc dependent... FIXME
  }
  //////////////////////////////////////////////////
  // Wilson loop of size (R1, R2), oriented in mu,nu plane
  //////////////////////////////////////////////////
  static void wilsonLoop(GaugeMat &wl, const std::vector<GaugeMat> &U,
                           const int Rmu, const int Rnu,
                           const int mu, const int nu) {
    wl = U[nu];
    for(int i = 0; i < Rnu-1; i++){
      wl = Gimpl::CovShiftForward(U[nu], nu, wl);
    }
    for(int i = 0; i < Rmu; i++){
      wl = Gimpl::CovShiftForward(U[mu], mu, wl);
    }
    for(int i = 0; i < Rnu; i++){
      wl = Gimpl::CovShiftBackward(U[nu], nu, wl);
    }
    for(int i = 0; i < Rmu; i++){
      wl = Gimpl::CovShiftBackward(U[mu], mu, wl);
    }
  }
  //////////////////////////////////////////////////
  // trace of Wilson Loop oriented in mu,nu plane
  //////////////////////////////////////////////////
  static void traceWilsonLoop(LatticeComplex &wl,
                                const std::vector<GaugeMat> &U,
                                const int Rmu, const int Rnu,
                                const int mu, const int nu) {
    GaugeMat sp(U[0]._grid);
    wilsonLoop(sp, U, Rmu, Rnu, mu, nu);
    wl = trace(sp);
  }
  //////////////////////////////////////////////////
  // sum over all planes of Wilson loop
  //////////////////////////////////////////////////
  static void siteWilsonLoop(LatticeComplex &Wl,
                            const std::vector<GaugeMat> &U,
                            const int R1, const int R2) {
    LatticeComplex siteWl(U[0]._grid);
    Wl = zero;
    for (int mu = 1; mu < U[0]._grid->_ndimension; mu++) {
      for (int nu = 0; nu < mu; nu++) {
        traceWilsonLoop(siteWl, U, R1, R2, mu, nu);
        Wl = Wl + siteWl;
        traceWilsonLoop(siteWl, U, R2, R1, mu, nu);
        Wl = Wl + siteWl;
      }
    }
  }
  //////////////////////////////////////////////////
  // sum over planes of Wilson loop with length R1
  // in the time direction
  //////////////////////////////////////////////////
  static void siteTimelikeWilsonLoop(LatticeComplex &Wl,
                            const std::vector<GaugeMat> &U,
                            const int R1, const int R2) {
    LatticeComplex siteWl(U[0]._grid);
    int ndim = U[0]._grid->_ndimension;
    Wl = zero;
    for (int nu = 0; nu < ndim - 1; nu++) {
      traceWilsonLoop(siteWl, U, R1, R2, ndim-1, nu);
      Wl = Wl + siteWl;
    }
  }
  //////////////////////////////////////////////////
  // sum Wilson loop over all planes orthogonal to the time direction
  //////////////////////////////////////////////////
  static void siteSpatialWilsonLoop(LatticeComplex &Wl,
                            const std::vector<GaugeMat> &U,
                            const int R1, const int R2) {
    LatticeComplex siteWl(U[0]._grid);
    Wl = zero;
    for (int mu = 1; mu < U[0]._grid->_ndimension - 1; mu++) {
      for (int nu = 0; nu < mu; nu++) {
        traceWilsonLoop(siteWl, U, R1, R2, mu, nu);
        Wl = Wl + siteWl;
        traceWilsonLoop(siteWl, U, R2, R1, mu, nu);
        Wl = Wl + siteWl;
      }
    }
  }
  //////////////////////////////////////////////////
  // sum over all x,y,z,t and over all planes of Wilson loop
  //////////////////////////////////////////////////
  static Real sumWilsonLoop(const GaugeLorentz &Umu,
                            const int R1, const int R2) {
    std::vector<GaugeMat> U(4, Umu._grid);
    for (int mu = 0; mu < Umu._grid->_ndimension; mu++) {
      U[mu] = PeekIndex<LorentzIndex>(Umu, mu);
    }
    LatticeComplex Wl(Umu._grid);
    siteWilsonLoop(Wl, U, R1, R2);
    TComplex Tp = sum(Wl);
    Complex p = TensorRemove(Tp);
    return p.real();
  }
  //////////////////////////////////////////////////
  // sum over all x,y,z,t and over all planes of timelike Wilson loop
  //////////////////////////////////////////////////
  static Real sumTimelikeWilsonLoop(const GaugeLorentz &Umu,
                            const int R1, const int R2) {
    std::vector<GaugeMat> U(4, Umu._grid);
    for (int mu = 0; mu < Umu._grid->_ndimension; mu++) {
      U[mu] = PeekIndex<LorentzIndex>(Umu, mu);
    }
    LatticeComplex Wl(Umu._grid);
    siteTimelikeWilsonLoop(Wl, U, R1, R2);
    TComplex Tp = sum(Wl);
    Complex p = TensorRemove(Tp);
    return p.real();
  }
  //////////////////////////////////////////////////
  // sum over all x,y,z,t and over all planes of spatial Wilson loop
  //////////////////////////////////////////////////
  static Real sumSpatialWilsonLoop(const GaugeLorentz &Umu,
                            const int R1, const int R2) {
    std::vector<GaugeMat> U(4, Umu._grid);
    for (int mu = 0; mu < Umu._grid->_ndimension; mu++) {
      U[mu] = PeekIndex<LorentzIndex>(Umu, mu);
    }
    LatticeComplex Wl(Umu._grid);
    siteSpatialWilsonLoop(Wl, U, R1, R2);
    TComplex Tp = sum(Wl);
    Complex p = TensorRemove(Tp);
    return p.real();
  }
  //////////////////////////////////////////////////
  // average over all x,y,z,t and over all planes of Wilson loop
  //////////////////////////////////////////////////
  static Real avgWilsonLoop(const GaugeLorentz &Umu,
                            const int R1, const int R2) {
    int ndim = Umu._grid->_ndimension;
    Real sumWl = sumWilsonLoop(Umu, R1, R2);
    Real vol = Umu._grid->gSites();
    Real faces = 1.0 * ndim * (ndim - 1);
    return sumWl / vol / faces / Nc; // Nc dependent... FIXME
  }
  //////////////////////////////////////////////////
  // average over all x,y,z,t and over all planes of timelike Wilson loop
  //////////////////////////////////////////////////
  static Real avgTimelikeWilsonLoop(const GaugeLorentz &Umu,
                            const int R1, const int R2) {
    int ndim = Umu._grid->_ndimension;
    Real sumWl = sumTimelikeWilsonLoop(Umu, R1, R2);
    Real vol = Umu._grid->gSites();
    Real faces = 1.0 * (ndim - 1);
    return sumWl / vol / faces / Nc; // Nc dependent... FIXME
  }
  //////////////////////////////////////////////////
  // average over all x,y,z,t and over all planes of spatial Wilson loop
  //////////////////////////////////////////////////
  static Real avgSpatialWilsonLoop(const GaugeLorentz &Umu,
                            const int R1, const int R2) {
    int ndim = Umu._grid->_ndimension;
    Real sumWl = sumSpatialWilsonLoop(Umu, R1, R2);
    Real vol = Umu._grid->gSites();
    Real faces = 1.0 * (ndim - 1) * (ndim - 2);
    return sumWl / vol / faces / Nc; // Nc dependent... FIXME
  }
 };
 END_QEDFVOL_NAMESPACE
 #endif // QEDFVOL_WILSONLOOPS_H
@@ -0,0 +1,88 @@
 #include <Global.hpp>
 #include <WilsonLoops.h>
 using namespace Grid;
 using namespace QCD;
 using namespace QedFVol;
 typedef PeriodicGaugeImpl<QedGimplR>    QedPeriodicGimplR;
 typedef PhotonR::GaugeField             EmField;
 typedef PhotonR::GaugeLinkField         EmComp;
 const int NCONFIGS = 10;
 const int NWILSON = 10;
 int main(int argc, char *argv[])
 {
    // parse command line
    std::string parameterFileName;
    if (argc < 2)
    {
        std::cerr << "usage: " << argv[0] << " <parameter file> [Grid options]";
        std::cerr << std::endl;
        std::exit(EXIT_FAILURE);
    }
    parameterFileName = argv[1];
    // initialization
    Grid_init(&argc, &argv);
    QedFVolLogError.Active(GridLogError.isActive());
    QedFVolLogWarning.Active(GridLogWarning.isActive());
    QedFVolLogMessage.Active(GridLogMessage.isActive());
    QedFVolLogIterative.Active(GridLogIterative.isActive());
    QedFVolLogDebug.Active(GridLogDebug.isActive());
    LOG(Message) << "Grid initialized" << std::endl;
    // QED stuff
    std::vector<int> latt_size   = GridDefaultLatt();
    std::vector<int> simd_layout = GridDefaultSimd(4, vComplex::Nsimd());
    std::vector<int> mpi_layout  = GridDefaultMpi();
    GridCartesian    grid(latt_size,simd_layout,mpi_layout);
    GridParallelRNG  pRNG(&grid);
    PhotonR          photon(PhotonR::Gauge::feynman,
                            PhotonR::ZmScheme::qedL);
    EmField          a(&grid);
    EmField          expA(&grid);
    Complex imag_unit(0, 1);
    Real wlA;
    std::vector<Real> logWlAvg(NWILSON, 0.0), logWlTime(NWILSON, 0.0), logWlSpace(NWILSON, 0.0);
    pRNG.SeedRandomDevice();
    LOG(Message) << "Wilson loop calculation beginning" << std::endl;
    for(int ic = 0; ic < NCONFIGS; ic++){
        LOG(Message) << "Configuration " << ic <<std::endl;
        photon.StochasticField(a, pRNG);
        // Exponentiate photon field
        expA = exp(imag_unit*a);
        // Calculate Wilson loops
        for(int iw=1; iw<=NWILSON; iw++){
            wlA = NewWilsonLoops<QedPeriodicGimplR>::avgWilsonLoop(expA, iw, iw) * 3;
            logWlAvg[iw-1] -= 2*log(wlA);
            wlA = NewWilsonLoops<QedPeriodicGimplR>::avgTimelikeWilsonLoop(expA, iw, iw) * 3;
            logWlTime[iw-1] -= 2*log(wlA);
            wlA = NewWilsonLoops<QedPeriodicGimplR>::avgSpatialWilsonLoop(expA, iw, iw) * 3;
            logWlSpace[iw-1] -= 2*log(wlA);
        }
    }
    LOG(Message) << "Wilson loop calculation completed" << std::endl;
    // Calculate Wilson loops
    for(int iw=1; iw<=10; iw++){
        LOG(Message) << iw << 'x' << iw << " Wilson loop" << std::endl;
        LOG(Message) << "-2log(W) average: " << logWlAvg[iw-1]/NCONFIGS << std::endl;
        LOG(Message) << "-2log(W) timelike: " << logWlTime[iw-1]/NCONFIGS << std::endl;
        LOG(Message) << "-2log(W) spatial: " << logWlSpace[iw-1]/NCONFIGS << std::endl;
    }
    // epilogue
    LOG(Message) << "Grid is finalizing now" << std::endl;
    Grid_finalize();
    return EXIT_SUCCESS;
 }
@@ -41,6 +41,7 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #include <Grid/GridCore.h>
 #include <Grid/GridQCDcore.h>
 #include <Grid/qcd/action/Action.h>
 #include <Grid/qcd/utils/GaugeFix.h>
 #include <Grid/qcd/smearing/Smearing.h>
 #include <Grid/parallelIO/MetaData.h>
 #include <Grid/qcd/hmc/HMC_aggregate.h>
@@ -1,137 +0,0 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/algorithms/iterative/DenseMatrix.h
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #ifndef GRID_DENSE_MATRIX_H
 #define GRID_DENSE_MATRIX_H
 namespace Grid {
    /////////////////////////////////////////////////////////////
    // Matrix untils
    /////////////////////////////////////////////////////////////
 template<class T> using DenseVector = std::vector<T>;
 template<class T> using DenseMatrix = DenseVector<DenseVector<T> >;
 template<class T> void Size(DenseVector<T> & vec, int &N) 
 { 
  N= vec.size();
 }
 template<class T> void Size(DenseMatrix<T> & mat, int &N,int &M) 
 { 
  N= mat.size();
  M= mat[0].size();
 }
 template<class T> void SizeSquare(DenseMatrix<T> & mat, int &N) 
 { 
  int M; Size(mat,N,M);
  assert(N==M);
 }
 template<class T> void Resize(DenseVector<T > & mat, int N) { 
  mat.resize(N);
 }
 template<class T> void Resize(DenseMatrix<T > & mat, int N, int M) { 
  mat.resize(N);
  for(int i=0;i<N;i++){
    mat[i].resize(M);
  }
 }
 template<class T> void Fill(DenseMatrix<T> & mat, T&val) { 
  int N,M;
  Size(mat,N,M);
  for(int i=0;i<N;i++){
  for(int j=0;j<M;j++){
    mat[i][j] = val;
  }}
 }
 /** Transpose of a matrix **/
 template<class T> DenseMatrix<T> Transpose(DenseMatrix<T> & mat){
  int N,M;
  Size(mat,N,M);
  DenseMatrix<T> C; Resize(C,M,N);
  for(int i=0;i<M;i++){
  for(int j=0;j<N;j++){
    C[i][j] = mat[j][i];
  }} 
  return C;
 }
 /** Set DenseMatrix to unit matrix **/
 template<class T> void Unity(DenseMatrix<T> &A){
  int N;  SizeSquare(A,N);
  for(int i=0;i<N;i++){
    for(int j=0;j<N;j++){
      if ( i==j ) A[i][j] = 1;
      else        A[i][j] = 0;
    } 
  } 
 }
 /** Add C * I to matrix **/
 template<class T>
 void PlusUnit(DenseMatrix<T> & A,T c){
  int dim;  SizeSquare(A,dim);
  for(int i=0;i<dim;i++){A[i][i] = A[i][i] + c;} 
 }
 /** return the Hermitian conjugate of matrix **/
 template<class T>
 DenseMatrix<T> HermitianConj(DenseMatrix<T> &mat){
  int dim; SizeSquare(mat,dim);
  DenseMatrix<T> C; Resize(C,dim,dim);
  for(int i=0;i<dim;i++){
    for(int j=0;j<dim;j++){
      C[i][j] = conj(mat[j][i]);
    } 
  } 
  return C;
 }
 /**Get a square submatrix**/
 template <class T>
 DenseMatrix<T> GetSubMtx(DenseMatrix<T> &A,int row_st, int row_end, int col_st, int col_end)
 {
  DenseMatrix<T> H; Resize(H,row_end - row_st,col_end-col_st);
  for(int i = row_st; i<row_end; i++){
  for(int j = col_st; j<col_end; j++){
    H[i-row_st][j-col_st]=A[i][j];
  }}
  return H;
 }
 }
 #include "Householder.h"
 #include "Francis.h"
 #endif
@@ -1,525 +0,0 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/algorithms/iterative/Francis.h
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #ifndef FRANCIS_H
 #define FRANCIS_H
 #include <cstdlib>
 #include <string>
 #include <cmath>
 #include <iostream>
 #include <sstream>
 #include <stdexcept>
 #include <fstream>
 #include <complex>
 #include <algorithm>
 //#include <timer.h>
 //#include <lapacke.h>
 //#include <Eigen/Dense>
 namespace Grid {
 template <class T> int SymmEigensystem(DenseMatrix<T > &Ain, DenseVector<T> &evals, DenseMatrix<T> &evecs, RealD small);
 template <class T> int     Eigensystem(DenseMatrix<T > &Ain, DenseVector<T> &evals, DenseMatrix<T> &evecs, RealD small);
 /**
  Find the eigenvalues of an upper hessenberg matrix using the Francis QR algorithm.
 H =
      x  x  x  x  x  x  x  x  x
      x  x  x  x  x  x  x  x  x
      0  x  x  x  x  x  x  x  x
      0  0  x  x  x  x  x  x  x
      0  0  0  x  x  x  x  x  x
      0  0  0  0  x  x  x  x  x
      0  0  0  0  0  x  x  x  x
      0  0  0  0  0  0  x  x  x
      0  0  0  0  0  0  0  x  x
 Factorization is P T P^H where T is upper triangular (mod cc blocks) and P is orthagonal/unitary.
 **/
 template <class T>
 int QReigensystem(DenseMatrix<T> &Hin, DenseVector<T> &evals, DenseMatrix<T> &evecs, RealD small)
 {
  DenseMatrix<T> H = Hin; 
  int N ; SizeSquare(H,N);
  int M = N;
  Fill(evals,0);
  Fill(evecs,0);
  T s,t,x=0,y=0,z=0;
  T u,d;
  T apd,amd,bc;
  DenseVector<T> p(N,0);
  T nrm = Norm(H);    ///DenseMatrix Norm
  int n, m;
  int e = 0;
  int it = 0;
  int tot_it = 0;
  int l = 0;
  int r = 0;
  DenseMatrix<T> P; Resize(P,N,N); Unity(P);
  DenseVector<int> trows(N,0);
  /// Check if the matrix is really hessenberg, if not abort
  RealD sth = 0;
  for(int j=0;j<N;j++){
    for(int i=j+2;i<N;i++){
      sth = abs(H[i][j]);
      if(sth > small){
 	std::cout << "Non hessenberg H = " << sth << " > " << small << std::endl;
 	exit(1);
      }
    }
  }
  do{
    std::cout << "Francis QR Step N = " << N << std::endl;
    /** Check for convergence
      x  x  x  x  x
      0  x  x  x  x
      0  0  x  x  x
      0  0  x  x  x
      0  0  0  0  x
      for this matrix l = 4
     **/
    do{
      l = Chop_subdiag(H,nrm,e,small);
      r = 0;    ///May have converged on more than one eval
      ///Single eval
      if(l == N-1){
        evals[e] = H[l][l];
        N--; e++; r++; it = 0;
      }
      ///RealD eval
      if(l == N-2){
        trows[l+1] = 1;    ///Needed for UTSolve
        apd = H[l][l] + H[l+1][l+1];
        amd = H[l][l] - H[l+1][l+1];
        bc =  (T)4.0*H[l+1][l]*H[l][l+1];
        evals[e]   = (T)0.5*( apd + sqrt(amd*amd + bc) );
        evals[e+1] = (T)0.5*( apd - sqrt(amd*amd + bc) );
        N-=2; e+=2; r++; it = 0;
      }
    } while(r>0);
    if(N ==0) break;
    DenseVector<T > ck; Resize(ck,3);
    DenseVector<T> v;   Resize(v,3);
    for(int m = N-3; m >= l; m--){
      ///Starting vector essentially random shift.
      if(it%10 == 0 && N >= 3 && it > 0){
        s = (T)1.618033989*( abs( H[N-1][N-2] ) + abs( H[N-2][N-3] ) );
        t = (T)0.618033989*( abs( H[N-1][N-2] ) + abs( H[N-2][N-3] ) );
        x = H[m][m]*H[m][m] + H[m][m+1]*H[m+1][m] - s*H[m][m] + t;
        y = H[m+1][m]*(H[m][m] + H[m+1][m+1] - s);
        z = H[m+1][m]*H[m+2][m+1];
      }
      ///Starting vector implicit Q theorem
      else{
        s = (H[N-2][N-2] + H[N-1][N-1]);
        t = (H[N-2][N-2]*H[N-1][N-1] - H[N-2][N-1]*H[N-1][N-2]);
        x = H[m][m]*H[m][m] + H[m][m+1]*H[m+1][m] - s*H[m][m] + t;
        y = H[m+1][m]*(H[m][m] + H[m+1][m+1] - s);
        z = H[m+1][m]*H[m+2][m+1];
      }
      ck[0] = x; ck[1] = y; ck[2] = z;
      if(m == l) break;
      /** Some stupid thing from numerical recipies, seems to work**/
      // PAB.. for heaven's sake quote page, purpose, evidence it works.
      //       what sort of comment is that!?!?!?
      u=abs(H[m][m-1])*(abs(y)+abs(z));
      d=abs(x)*(abs(H[m-1][m-1])+abs(H[m][m])+abs(H[m+1][m+1]));
      if ((T)abs(u+d) == (T)abs(d) ){
 	l = m; break;
      }
      //if (u < small){l = m; break;}
    }
    if(it > 100000){
     std::cout << "QReigensystem: bugger it got stuck after 100000 iterations" << std::endl;
     std::cout << "got " << e << " evals " << l << " " << N << std::endl;
      exit(1);
    }
    normalize(ck);    ///Normalization cancels in PHP anyway
    T beta;
    Householder_vector<T >(ck, 0, 2, v, beta);
    Householder_mult<T >(H,v,beta,0,l,l+2,0);
    Householder_mult<T >(H,v,beta,0,l,l+2,1);
    ///Accumulate eigenvector
    Householder_mult<T >(P,v,beta,0,l,l+2,1);
    int sw = 0;      ///Are we on the last row?
    for(int k=l;k<N-2;k++){
      x = H[k+1][k];
      y = H[k+2][k];
      z = (T)0.0;
      if(k+3 <= N-1){
 	z = H[k+3][k];
      } else{
 	sw = 1; 
 	v[2] = (T)0.0;
      }
      ck[0] = x; ck[1] = y; ck[2] = z;
      normalize(ck);
      Householder_vector<T >(ck, 0, 2-sw, v, beta);
      Householder_mult<T >(H,v, beta,0,k+1,k+3-sw,0);
      Householder_mult<T >(H,v, beta,0,k+1,k+3-sw,1);
      ///Accumulate eigenvector
      Householder_mult<T >(P,v, beta,0,k+1,k+3-sw,1);
    }
    it++;
    tot_it++;
  }while(N > 1);
  N = evals.size();
  ///Annoying - UT solves in reverse order;
  DenseVector<T> tmp; Resize(tmp,N);
  for(int i=0;i<N;i++){
    tmp[i] = evals[N-i-1];
  } 
  evals = tmp;
  UTeigenvectors(H, trows, evals, evecs);
  for(int i=0;i<evals.size();i++){evecs[i] = P*evecs[i]; normalize(evecs[i]);}
  return tot_it;
 }
 template <class T>
 int my_Wilkinson(DenseMatrix<T> &Hin, DenseVector<T> &evals, DenseMatrix<T> &evecs, RealD small)
 {
  /**
  Find the eigenvalues of an upper Hessenberg matrix using the Wilkinson QR algorithm.
  H =
  x  x  0  0  0  0
  x  x  x  0  0  0
  0  x  x  x  0  0
  0  0  x  x  x  0
  0  0  0  x  x  x
  0  0  0  0  x  x
  Factorization is P T P^H where T is upper triangular (mod cc blocks) and P is orthagonal/unitary.  **/
  return my_Wilkinson(Hin, evals, evecs, small, small);
 }
 template <class T>
 int my_Wilkinson(DenseMatrix<T> &Hin, DenseVector<T> &evals, DenseMatrix<T> &evecs, RealD small, RealD tol)
 {
  int N; SizeSquare(Hin,N);
  int M = N;
  ///I don't want to modify the input but matricies must be passed by reference
  //Scale a matrix by its "norm"
  //RealD Hnorm = abs( Hin.LargestDiag() ); H =  H*(1.0/Hnorm);
  DenseMatrix<T> H;  H = Hin;
  RealD Hnorm = abs(Norm(Hin));
  H = H * (1.0 / Hnorm);
  // TODO use openmp and memset
  Fill(evals,0);
  Fill(evecs,0);
  T s, t, x = 0, y = 0, z = 0;
  T u, d;
  T apd, amd, bc;
  DenseVector<T> p; Resize(p,N); Fill(p,0);
  T nrm = Norm(H);    ///DenseMatrix Norm
  int n, m;
  int e = 0;
  int it = 0;
  int tot_it = 0;
  int l = 0;
  int r = 0;
  DenseMatrix<T> P; Resize(P,N,N);
  Unity(P);
  DenseVector<int> trows(N, 0);
  /// Check if the matrix is really symm tridiag
  RealD sth = 0;
  for(int j = 0; j < N; ++j)
  {
    for(int i = j + 2; i < N; ++i)
    {
      if(abs(H[i][j]) > tol || abs(H[j][i]) > tol)
      {
 	std::cout << "Non Tridiagonal H(" << i << ","<< j << ") = |" << Real( real( H[j][i] ) ) << "| > " << tol << std::endl;
 	std::cout << "Warning tridiagonalize and call again" << std::endl;
        // exit(1); // see what is going on
        //return;
      }
    }
  }
  do{
    do{
      //Jasper
      //Check if the subdiagonal term is small enough (<small)
      //if true then it is converged.
      //check start from H.dim - e - 1
      //How to deal with more than 2 are converged?
      //What if Chop_symm_subdiag return something int the middle?
      //--------------
      l = Chop_symm_subdiag(H,nrm, e, small);
      r = 0;    ///May have converged on more than one eval
      //Jasper
      //In this case
      // x  x  0  0  0  0
      // x  x  x  0  0  0
      // 0  x  x  x  0  0
      // 0  0  x  x  x  0
      // 0  0  0  x  x  0
      // 0  0  0  0  0  x  <- l
      //--------------
      ///Single eval
      if(l == N - 1)
      {
        evals[e] = H[l][l];
        N--;
        e++;
        r++;
        it = 0;
      }
      //Jasper
      // x  x  0  0  0  0
      // x  x  x  0  0  0
      // 0  x  x  x  0  0
      // 0  0  x  x  0  0
      // 0  0  0  0  x  x  <- l
      // 0  0  0  0  x  x
      //--------------
      ///RealD eval
      if(l == N - 2)
      {
        trows[l + 1] = 1;    ///Needed for UTSolve
        apd = H[l][l] + H[l + 1][ l + 1];
        amd = H[l][l] - H[l + 1][l + 1];
        bc =  (T) 4.0 * H[l + 1][l] * H[l][l + 1];
        evals[e] = (T) 0.5 * (apd + sqrt(amd * amd + bc));
        evals[e + 1] = (T) 0.5 * (apd - sqrt(amd * amd + bc));
        N -= 2;
        e += 2;
        r++;
        it = 0;
      }
    }while(r > 0);
    //Jasper
    //Already converged
    //--------------
    if(N == 0) break;
    DenseVector<T> ck,v; Resize(ck,2); Resize(v,2);
    for(int m = N - 3; m >= l; m--)
    {
      ///Starting vector essentially random shift.
      if(it%10 == 0 && N >= 3 && it > 0)
      {
        t = abs(H[N - 1][N - 2]) + abs(H[N - 2][N - 3]);
        x = H[m][m] - t;
        z = H[m + 1][m];
      } else {
      ///Starting vector implicit Q theorem
        d = (H[N - 2][N - 2] - H[N - 1][N - 1]) * (T) 0.5;
        t =  H[N - 1][N - 1] - H[N - 1][N - 2] * H[N - 1][N - 2] 
 	  / (d + sign(d) * sqrt(d * d + H[N - 1][N - 2] * H[N - 1][N - 2]));
        x = H[m][m] - t;
        z = H[m + 1][m];
      }
      //Jasper
      //why it is here????
      //-----------------------
      if(m == l)
        break;
      u = abs(H[m][m - 1]) * (abs(y) + abs(z));
      d = abs(x) * (abs(H[m - 1][m - 1]) + abs(H[m][m]) + abs(H[m + 1][m + 1]));
      if ((T)abs(u + d) == (T)abs(d))
      {
        l = m;
        break;
      }
    }
    //Jasper
    if(it > 1000000)
    {
      std::cout << "Wilkinson: bugger it got stuck after 100000 iterations" << std::endl;
      std::cout << "got " << e << " evals " << l << " " << N << std::endl;
      exit(1);
    }
    //
    T s, c;
    Givens_calc<T>(x, z, c, s);
    Givens_mult<T>(H, l, l + 1, c, -s, 0);
    Givens_mult<T>(H, l, l + 1, c,  s, 1);
    Givens_mult<T>(P, l, l + 1, c,  s, 1);
    //
    for(int k = l; k < N - 2; ++k)
    {
      x = H.A[k + 1][k];
      z = H.A[k + 2][k];
      Givens_calc<T>(x, z, c, s);
      Givens_mult<T>(H, k + 1, k + 2, c, -s, 0);
      Givens_mult<T>(H, k + 1, k + 2, c,  s, 1);
      Givens_mult<T>(P, k + 1, k + 2, c,  s, 1);
    }
    it++;
    tot_it++;
  }while(N > 1);
  N = evals.size();
  ///Annoying - UT solves in reverse order;
  DenseVector<T> tmp(N);
  for(int i = 0; i < N; ++i)
    tmp[i] = evals[N-i-1];
  evals = tmp;
  //
  UTeigenvectors(H, trows, evals, evecs);
  //UTSymmEigenvectors(H, trows, evals, evecs);
  for(int i = 0; i < evals.size(); ++i)
  {
    evecs[i] = P * evecs[i];
    normalize(evecs[i]);
    evals[i] = evals[i] * Hnorm;
  }
  // // FIXME this is to test
  // Hin.write("evecs3", evecs);
  // Hin.write("evals3", evals);
  // // check rsd
  // for(int i = 0; i < M; i++) {
  //   vector<T> Aevec = Hin * evecs[i];
  //   RealD norm2(0.);
  //   for(int j = 0; j < M; j++) {
  //     norm2 += (Aevec[j] - evals[i] * evecs[i][j]) * (Aevec[j] - evals[i] * evecs[i][j]);
  //   }
  // }
  return tot_it;
 }
 template <class T>
 void Hess(DenseMatrix<T > &A, DenseMatrix<T> &Q, int start){
  /**
  turn a matrix A =
  x  x  x  x  x
  x  x  x  x  x
  x  x  x  x  x
  x  x  x  x  x
  x  x  x  x  x
  into
  x  x  x  x  x
  x  x  x  x  x
  0  x  x  x  x
  0  0  x  x  x
  0  0  0  x  x
  with householder rotations
  Slow.
  */
  int N ; SizeSquare(A,N);
  DenseVector<T > p; Resize(p,N); Fill(p,0);
  for(int k=start;k<N-2;k++){
    //cerr << "hess" << k << std::endl;
    DenseVector<T > ck,v; Resize(ck,N-k-1); Resize(v,N-k-1);
    for(int i=k+1;i<N;i++){ck[i-k-1] = A(i,k);}  ///kth column
    normalize(ck);    ///Normalization cancels in PHP anyway
    T beta;
    Householder_vector<T >(ck, 0, ck.size()-1, v, beta);  ///Householder vector
    Householder_mult<T>(A,v,beta,start,k+1,N-1,0);  ///A -> PA
    Householder_mult<T >(A,v,beta,start,k+1,N-1,1);  ///PA -> PAP^H
    ///Accumulate eigenvector
    Householder_mult<T >(Q,v,beta,start,k+1,N-1,1);  ///Q -> QP^H
  }
  /*for(int l=0;l<N-2;l++){
    for(int k=l+2;k<N;k++){
    A(0,k,l);
    }
    }*/
 }
 template <class T>
 void Tri(DenseMatrix<T > &A, DenseMatrix<T> &Q, int start){
 ///Tridiagonalize a matrix
  int N; SizeSquare(A,N);
  Hess(A,Q,start);
  /*for(int l=0;l<N-2;l++){
    for(int k=l+2;k<N;k++){
    A(0,l,k);
    }
    }*/
 }
 template <class T>
 void ForceTridiagonal(DenseMatrix<T> &A){
 ///Tridiagonalize a matrix
  int N ; SizeSquare(A,N);
  for(int l=0;l<N-2;l++){
    for(int k=l+2;k<N;k++){
      A[l][k]=0;
      A[k][l]=0;
    }
  }
 }
 template <class T>
 int my_SymmEigensystem(DenseMatrix<T > &Ain, DenseVector<T> &evals, DenseVector<DenseVector<T> > &evecs, RealD small){
  ///Solve a symmetric eigensystem, not necessarily in tridiagonal form
  int N; SizeSquare(Ain,N);
  DenseMatrix<T > A; A = Ain;
  DenseMatrix<T > Q; Resize(Q,N,N); Unity(Q);
  Tri(A,Q,0);
  int it = my_Wilkinson<T>(A, evals, evecs, small);
  for(int k=0;k<N;k++){evecs[k] = Q*evecs[k];}
  return it;
 }
 template <class T>
 int Wilkinson(DenseMatrix<T> &Ain, DenseVector<T> &evals, DenseVector<DenseVector<T> > &evecs, RealD small){
  return my_Wilkinson(Ain, evals, evecs, small);
 }
 template <class T>
 int SymmEigensystem(DenseMatrix<T> &Ain, DenseVector<T> &evals, DenseVector<DenseVector<T> > &evecs, RealD small){
  return my_SymmEigensystem(Ain, evals, evecs, small);
 }
 template <class T>
 int Eigensystem(DenseMatrix<T > &Ain, DenseVector<T> &evals, DenseVector<DenseVector<T> > &evecs, RealD small){
 ///Solve a general eigensystem, not necessarily in tridiagonal form
  int N = Ain.dim;
  DenseMatrix<T > A(N); A = Ain;
  DenseMatrix<T > Q(N);Q.Unity();
  Hess(A,Q,0);
  int it = QReigensystem<T>(A, evals, evecs, small);
  for(int k=0;k<N;k++){evecs[k] = Q*evecs[k];}
  return it;
 }
 }
 #endif
@@ -1,242 +0,0 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/algorithms/iterative/Householder.h
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #ifndef HOUSEHOLDER_H
 #define HOUSEHOLDER_H
 #define TIMER(A) std::cout << GridLogMessage << __FUNC__ << " file "<< __FILE__ <<" line " << __LINE__ << std::endl;
 #define ENTER()  std::cout << GridLogMessage << "ENTRY "<<__FUNC__ << " file "<< __FILE__ <<" line " << __LINE__ << std::endl;
 #define LEAVE()  std::cout << GridLogMessage << "EXIT  "<<__FUNC__ << " file "<< __FILE__ <<" line " << __LINE__ << std::endl;
 #include <cstdlib>
 #include <string>
 #include <cmath>
 #include <iostream>
 #include <sstream>
 #include <stdexcept>
 #include <fstream>
 #include <complex>
 #include <algorithm>
 namespace Grid {
 /** Comparison function for finding the max element in a vector **/
 template <class T> bool cf(T i, T j) { 
  return abs(i) < abs(j); 
 }
 /** 
 	Calculate a real Givens angle 
 **/
 template <class T> inline void Givens_calc(T y, T z, T &c, T &s){
  RealD mz = (RealD)abs(z);
  if(mz==0.0){
    c = 1; s = 0;
  }
  if(mz >= (RealD)abs(y)){
    T t = -y/z;
    s = (T)1.0 / sqrt ((T)1.0 + t * t);
    c = s * t;
  } else {
    T t = -z/y;
    c = (T)1.0 / sqrt ((T)1.0 + t * t);
    s = c * t;
  }
 }
 template <class T> inline void Givens_mult(DenseMatrix<T> &A,  int i, int k, T c, T s, int dir)
 {
  int q ; SizeSquare(A,q);
  if(dir == 0){
    for(int j=0;j<q;j++){
      T nu = A[i][j];
      T w  = A[k][j];
      A[i][j] = (c*nu + s*w);
      A[k][j] = (-s*nu + c*w);
    }
  }
  if(dir == 1){
    for(int j=0;j<q;j++){
      T nu = A[j][i];
      T w  = A[j][k];
      A[j][i] = (c*nu - s*w);
      A[j][k] = (s*nu + c*w);
    }
  }
 }
 /**
 	from input = x;
 	Compute the complex Householder vector, v, such that
 	P = (I - b v transpose(v) )
 	b = 2/v.v
 	P | x |    | x | k = 0
 	| x |    | 0 | 
 	| x | =  | 0 |
 	| x |    | 0 | j = 3
 	| x |	   | x |
 	These are the "Unreduced" Householder vectors.
 **/
 template <class T> inline void Householder_vector(DenseVector<T> input, int k, int j, DenseVector<T> &v, T &beta)
 {
  int N ; Size(input,N);
  T m = *max_element(input.begin() + k, input.begin() + j + 1, cf<T> );
  if(abs(m) > 0.0){
    T alpha = 0;
    for(int i=k; i<j+1; i++){
      v[i] = input[i]/m;
      alpha = alpha + v[i]*conj(v[i]);
    }
    alpha = sqrt(alpha);
    beta = (T)1.0/(alpha*(alpha + abs(v[k]) ));
    if(abs(v[k]) > 0.0)  v[k] = v[k] + (v[k]/abs(v[k]))*alpha;
    else                 v[k] = -alpha;
  } else{
    for(int i=k; i<j+1; i++){
      v[i] = 0.0;
    } 
  }
 }
 /**
 	from input = x;
 	Compute the complex Householder vector, v, such that
 	P = (I - b v transpose(v) )
 	b = 2/v.v
 	Px = alpha*e_dir
 	These are the "Unreduced" Householder vectors.
 **/
 template <class T> inline void Householder_vector(DenseVector<T> input, int k, int j, int dir, DenseVector<T> &v, T &beta)
 {
  int N = input.size();
  T m = *max_element(input.begin() + k, input.begin() + j + 1, cf);
  if(abs(m) > 0.0){
    T alpha = 0;
    for(int i=k; i<j+1; i++){
      v[i] = input[i]/m;
      alpha = alpha + v[i]*conj(v[i]);
    }
    alpha = sqrt(alpha);
    beta = 1.0/(alpha*(alpha + abs(v[dir]) ));
    if(abs(v[dir]) > 0.0) v[dir] = v[dir] + (v[dir]/abs(v[dir]))*alpha;
    else                  v[dir] = -alpha;
  }else{
    for(int i=k; i<j+1; i++){
      v[i] = 0.0;
    } 
  }
 }
 /**
 	Compute the product PA if trans = 0
 	AP if trans = 1
 	P = (I - b v transpose(v) )
 	b = 2/v.v
 	start at element l of matrix A
 	v is of length j - k + 1 of v are nonzero
 **/
 template <class T> inline void Householder_mult(DenseMatrix<T> &A , DenseVector<T> v, T beta, int l, int k, int j, int trans)
 {
  int N ; SizeSquare(A,N);
  if(abs(beta) > 0.0){
    for(int p=l; p<N; p++){
      T s = 0;
      if(trans==0){
 	for(int i=k;i<j+1;i++) s += conj(v[i-k])*A[i][p];
 	s *= beta;
 	for(int i=k;i<j+1;i++){ A[i][p] = A[i][p]-s*conj(v[i-k]);}
      } else {
 	for(int i=k;i<j+1;i++){ s += conj(v[i-k])*A[p][i];}
 	s *= beta;
 	for(int i=k;i<j+1;i++){ A[p][i]=A[p][i]-s*conj(v[i-k]);}
      }
    }
  }
 }
 /**
 	Compute the product PA if trans = 0
 	AP if trans = 1
 	P = (I - b v transpose(v) )
 	b = 2/v.v
 	start at element l of matrix A
 	v is of length j - k + 1 of v are nonzero
 	A is tridiagonal
 **/
 template <class T> inline void Householder_mult_tri(DenseMatrix<T> &A , DenseVector<T> v, T beta, int l, int M, int k, int j, int trans)
 {
  if(abs(beta) > 0.0){
    int N ; SizeSquare(A,N);
    DenseMatrix<T> tmp; Resize(tmp,N,N); Fill(tmp,0); 
    T s;
    for(int p=l; p<M; p++){
      s = 0;
      if(trans==0){
 	for(int i=k;i<j+1;i++) s = s + conj(v[i-k])*A[i][p];
      }else{
 	for(int i=k;i<j+1;i++) s = s + v[i-k]*A[p][i];
      }
      s = beta*s;
      if(trans==0){
 	for(int i=k;i<j+1;i++) tmp[i][p] = tmp(i,p) - s*v[i-k];
      }else{
 	for(int i=k;i<j+1;i++) tmp[p][i] = tmp[p][i] - s*conj(v[i-k]);
      }
    }
    for(int p=l; p<M; p++){
      if(trans==0){
 	for(int i=k;i<j+1;i++) A[i][p] = A[i][p] + tmp[i][p];
      }else{
 	for(int i=k;i<j+1;i++) A[p][i] = A[p][i] + tmp[p][i];
      }
    }
  }
 }
 }
 #endif
@@ -33,6 +33,8 @@ directory
 namespace Grid {
 enum BlockCGtype { BlockCG, BlockCGrQ, CGmultiRHS };
 //////////////////////////////////////////////////////////////////////////
 // Block conjugate gradient. Dimension zero should be the block direction
 //////////////////////////////////////////////////////////////////////////
@@ -40,25 +42,273 @@ template <class Field>
 class BlockConjugateGradient : public OperatorFunction<Field> {
 public:
  typedef typename Field::scalar_type scomplex;
-  const int blockDim = 0;
+  int blockDim ;
  int Nblock;
  BlockCGtype CGtype;
  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
                           // Defaults true.
  RealD Tolerance;
  Integer MaxIterations;
  Integer IterationsToComplete; //Number of iterations the CG took to finish. Filled in upon completion
-  BlockConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
+  BlockConjugateGradient(BlockCGtype cgtype,int _Orthog,RealD tol, Integer maxit, bool err_on_no_conv = true)
-    : Tolerance(tol),
+    : Tolerance(tol), CGtype(cgtype),   blockDim(_Orthog),  MaxIterations(maxit), ErrorOnNoConverge(err_on_no_conv)
-    MaxIterations(maxit),
+  {};
    ErrorOnNoConverge(err_on_no_conv){};
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 // Thin QR factorisation (google it)
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 void ThinQRfact (Eigen::MatrixXcd &m_rr,
 		 Eigen::MatrixXcd &C,
 		 Eigen::MatrixXcd &Cinv,
 		 Field & Q,
 		 const Field & R)
 {
  int Orthog = blockDim; // First dimension is block dim; this is an assumption
  ////////////////////////////////////////////////////////////////////////////////////////////////////
  //Dimensions
  // R_{ferm x Nblock} =  Q_{ferm x Nblock} x  C_{Nblock x Nblock} -> ferm x Nblock
  //
  // Rdag R = m_rr = Herm = L L^dag        <-- Cholesky decomposition (LLT routine in Eigen)
  //
  //   Q  C = R => Q = R C^{-1}
  //
  // Want  Ident = Q^dag Q = C^{-dag} R^dag R C^{-1} = C^{-dag} L L^dag C^{-1} = 1_{Nblock x Nblock} 
  //
  // Set C = L^{dag}, and then Q^dag Q = ident 
  //
  // Checks:
  // Cdag C = Rdag R ; passes.
  // QdagQ  = 1      ; passes
  ////////////////////////////////////////////////////////////////////////////////////////////////////
  sliceInnerProductMatrix(m_rr,R,R,Orthog);
  ////////////////////////////////////////////////////////////////////////////////////////////////////
  // Cholesky from Eigen
  // There exists a ldlt that is documented as more stable
  ////////////////////////////////////////////////////////////////////////////////////////////////////
  Eigen::MatrixXcd L    = m_rr.llt().matrixL(); 
  C    = L.adjoint();
  Cinv = C.inverse();
  ////////////////////////////////////////////////////////////////////////////////////////////////////
  // Q = R C^{-1}
  //
  // Q_j  = R_i Cinv(i,j) 
  //
  // NB maddMatrix conventions are Right multiplication X[j] a[j,i] already
  ////////////////////////////////////////////////////////////////////////////////////////////////////
  // FIXME:: make a sliceMulMatrix to avoid zero vector
  sliceMulMatrix(Q,Cinv,R,Orthog);
 }
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 // Call one of several implementations
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
 {
-  int Orthog = 0; // First dimension is block dim
+  if ( CGtype == BlockCGrQ ) {
    BlockCGrQsolve(Linop,Src,Psi);
  } else if (CGtype == BlockCG ) {
    BlockCGsolve(Linop,Src,Psi);
  } else if (CGtype == CGmultiRHS ) {
    CGmultiRHSsolve(Linop,Src,Psi);
  } else {
    assert(0);
  }
 }
 ////////////////////////////////////////////////////////////////////////////
 // BlockCGrQ implementation:
 //--------------------------
 // X is guess/Solution
 // B is RHS
 // Solve A X_i = B_i    ;        i refers to Nblock index
 ////////////////////////////////////////////////////////////////////////////
 void BlockCGrQsolve(LinearOperatorBase<Field> &Linop, const Field &B, Field &X) 
 {
  int Orthog = blockDim; // First dimension is block dim; this is an assumption
  Nblock = B._grid->_fdimensions[Orthog];
  std::cout<<GridLogMessage<<" Block Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
  X.checkerboard = B.checkerboard;
  conformable(X, B);
  Field tmp(B);
  Field Q(B);
  Field D(B);
  Field Z(B);
  Field AD(B);
  Eigen::MatrixXcd m_DZ     = Eigen::MatrixXcd::Identity(Nblock,Nblock);
  Eigen::MatrixXcd m_M      = Eigen::MatrixXcd::Identity(Nblock,Nblock);
  Eigen::MatrixXcd m_rr     = Eigen::MatrixXcd::Zero(Nblock,Nblock);
  Eigen::MatrixXcd m_C      = Eigen::MatrixXcd::Zero(Nblock,Nblock);
  Eigen::MatrixXcd m_Cinv   = Eigen::MatrixXcd::Zero(Nblock,Nblock);
  Eigen::MatrixXcd m_S      = Eigen::MatrixXcd::Zero(Nblock,Nblock);
  Eigen::MatrixXcd m_Sinv   = Eigen::MatrixXcd::Zero(Nblock,Nblock);
  Eigen::MatrixXcd m_tmp    = Eigen::MatrixXcd::Identity(Nblock,Nblock);
  Eigen::MatrixXcd m_tmp1   = Eigen::MatrixXcd::Identity(Nblock,Nblock);
  // Initial residual computation & set up
  std::vector<RealD> residuals(Nblock);
  std::vector<RealD> ssq(Nblock);
  sliceNorm(ssq,B,Orthog);
  RealD sssum=0;
  for(int b=0;b<Nblock;b++) sssum+=ssq[b];
  sliceNorm(residuals,B,Orthog);
  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
  sliceNorm(residuals,X,Orthog);
  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
  /************************************************************************
   * Block conjugate gradient rQ (Sebastien Birk Thesis, after Dubrulle 2001)
   ************************************************************************
   * Dimensions:
   *
   *   X,B==(Nferm x Nblock)
   *   A==(Nferm x Nferm)
   *  
   * Nferm = Nspin x Ncolour x Ncomplex x Nlattice_site
   * 
   * QC = R = B-AX, D = Q     ; QC => Thin QR factorisation (google it)
   * for k: 
   *   Z  = AD
   *   M  = [D^dag Z]^{-1}
   *   X  = X + D MC
   *   QS = Q - ZM
   *   D  = Q + D S^dag
   *   C  = S C
   */
  ///////////////////////////////////////
  // Initial block: initial search dir is guess
  ///////////////////////////////////////
  std::cout << GridLogMessage<<"BlockCGrQ algorithm initialisation " <<std::endl;
  //1.  QC = R = B-AX, D = Q     ; QC => Thin QR factorisation (google it)
  Linop.HermOp(X, AD);
  tmp = B - AD;  
  ThinQRfact (m_rr, m_C, m_Cinv, Q, tmp);
  D=Q;
  std::cout << GridLogMessage<<"BlockCGrQ computed initial residual and QR fact " <<std::endl;
  ///////////////////////////////////////
  // Timers
  ///////////////////////////////////////
  GridStopWatch sliceInnerTimer;
  GridStopWatch sliceMaddTimer;
  GridStopWatch QRTimer;
  GridStopWatch MatrixTimer;
  GridStopWatch SolverTimer;
  SolverTimer.Start();
  int k;
  for (k = 1; k <= MaxIterations; k++) {
    //3. Z  = AD
    MatrixTimer.Start();
    Linop.HermOp(D, Z);      
    MatrixTimer.Stop();
    //4. M  = [D^dag Z]^{-1}
    sliceInnerTimer.Start();
    sliceInnerProductMatrix(m_DZ,D,Z,Orthog);
    sliceInnerTimer.Stop();
    m_M       = m_DZ.inverse();
    //5. X  = X + D MC
    m_tmp     = m_M * m_C;
    sliceMaddTimer.Start();
    sliceMaddMatrix(X,m_tmp, D,X,Orthog);     
    sliceMaddTimer.Stop();
    //6. QS = Q - ZM
    sliceMaddTimer.Start();
    sliceMaddMatrix(tmp,m_M,Z,Q,Orthog,-1.0);
    sliceMaddTimer.Stop();
    QRTimer.Start();
    ThinQRfact (m_rr, m_S, m_Sinv, Q, tmp);
    QRTimer.Stop();
    //7. D  = Q + D S^dag
    m_tmp = m_S.adjoint();
    sliceMaddTimer.Start();
    sliceMaddMatrix(D,m_tmp,D,Q,Orthog);
    sliceMaddTimer.Stop();
    //8. C  = S C
    m_C = m_S*m_C;
    /*********************
     * convergence monitor
     *********************
     */
    m_rr = m_C.adjoint() * m_C;
    RealD max_resid=0;
    RealD rrsum=0;
    RealD rr;
    for(int b=0;b<Nblock;b++) {
      rrsum+=real(m_rr(b,b));
      rr = real(m_rr(b,b))/ssq[b];
      if ( rr > max_resid ) max_resid = rr;
    }
    std::cout << GridLogIterative << "\titeration "<<k<<" rr_sum "<<rrsum<<" ssq_sum "<< sssum
 	      <<" ave "<<std::sqrt(rrsum/sssum) << " max "<< max_resid <<std::endl;
    if ( max_resid < Tolerance*Tolerance ) { 
      SolverTimer.Stop();
      std::cout << GridLogMessage<<"BlockCGrQ converged in "<<k<<" iterations"<<std::endl;
      for(int b=0;b<Nblock;b++){
 	std::cout << GridLogMessage<< "\t\tblock "<<b<<" computed resid "
 		  << std::sqrt(real(m_rr(b,b))/ssq[b])<<std::endl;
      }
      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
      Linop.HermOp(X, AD);
      AD = AD-B;
      std::cout << GridLogMessage <<"\t True residual is " << std::sqrt(norm2(AD)/norm2(B)) <<std::endl;
      std::cout << GridLogMessage << "Time Breakdown "<<std::endl;
      std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed()     <<std::endl;
      std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed()     <<std::endl;
      std::cout << GridLogMessage << "\tInnerProd  " << sliceInnerTimer.Elapsed() <<std::endl;
      std::cout << GridLogMessage << "\tMaddMatrix " << sliceMaddTimer.Elapsed()  <<std::endl;
      std::cout << GridLogMessage << "\tThinQRfact " << QRTimer.Elapsed()  <<std::endl;
      IterationsToComplete = k;
      return;
    }
  }
  std::cout << GridLogMessage << "BlockConjugateGradient(rQ) did NOT converge" << std::endl;
  if (ErrorOnNoConverge) assert(0);
  IterationsToComplete = k;
 }
 //////////////////////////////////////////////////////////////////////////
 // Block conjugate gradient; Original O'Leary Dimension zero should be the block direction
 //////////////////////////////////////////////////////////////////////////
 void BlockCGsolve(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
 {
  int Orthog = blockDim; // First dimension is block dim; this is an assumption
  Nblock = Src._grid->_fdimensions[Orthog];
  std::cout<<GridLogMessage<<" Block Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
@@ -162,8 +412,9 @@ void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi)
     *********************
     */
    RealD max_resid=0;
    RealD rr;
    for(int b=0;b<Nblock;b++){
-      RealD rr = real(m_rr(b,b))/ssq[b];
+      rr = real(m_rr(b,b))/ssq[b];
      if ( rr > max_resid ) max_resid = rr;
    }
@@ -173,13 +424,14 @@ void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi)
      std::cout << GridLogMessage<<"BlockCG converged in "<<k<<" iterations"<<std::endl;
      for(int b=0;b<Nblock;b++){
-	std::cout << GridLogMessage<< "\t\tblock "<<b<<" resid "<< std::sqrt(real(m_rr(b,b))/ssq[b])<<std::endl;
+	std::cout << GridLogMessage<< "\t\tblock "<<b<<" computed resid "
 		  << std::sqrt(real(m_rr(b,b))/ssq[b])<<std::endl;
      }
      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
      Linop.HermOp(Psi, AP);
      AP = AP-Src;
-      std::cout << GridLogMessage <<"\tTrue residual is " << std::sqrt(norm2(AP)/norm2(Src)) <<std::endl;
+      std::cout << GridLogMessage <<"\t True residual is " << std::sqrt(norm2(AP)/norm2(Src)) <<std::endl;
      std::cout << GridLogMessage << "Time Breakdown "<<std::endl;
      std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed()     <<std::endl;
@@ -197,35 +449,13 @@ void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi)
  if (ErrorOnNoConverge) assert(0);
  IterationsToComplete = k;
 }
 };
 //////////////////////////////////////////////////////////////////////////
 // multiRHS conjugate gradient. Dimension zero should be the block direction
 // Use this for spread out across nodes
 //////////////////////////////////////////////////////////////////////////
-template <class Field>
+void CGmultiRHSsolve(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
 class MultiRHSConjugateGradient : public OperatorFunction<Field> {
 public:
  typedef typename Field::scalar_type scomplex;
  const int blockDim = 0;
  int Nblock;
  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
                           // Defaults true.
  RealD Tolerance;
  Integer MaxIterations;
  Integer IterationsToComplete; //Number of iterations the CG took to finish. Filled in upon completion
   MultiRHSConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
    : Tolerance(tol),
    MaxIterations(maxit),
    ErrorOnNoConverge(err_on_no_conv){};
 void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
 {
-  int Orthog = 0; // First dimension is block dim
+  int Orthog = blockDim; // First dimension is block dim
  Nblock = Src._grid->_fdimensions[Orthog];
  std::cout<<GridLogMessage<<"MultiRHS Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
@@ -285,12 +515,10 @@ void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi)
    MatrixTimer.Stop();
    // Alpha
    //    sliceInnerProductVectorTest(v_pAp_test,P,AP,Orthog);
    sliceInnerTimer.Start();
    sliceInnerProductVector(v_pAp,P,AP,Orthog);
    sliceInnerTimer.Stop();
    for(int b=0;b<Nblock;b++){
      //      std::cout << " "<< v_pAp[b]<<" "<< v_pAp_test[b]<<std::endl;
      v_alpha[b] = v_rr[b]/real(v_pAp[b]);
    }
@@ -332,7 +560,7 @@ void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi)
      std::cout << GridLogMessage<<"MultiRHS solver converged in " <<k<<" iterations"<<std::endl;
      for(int b=0;b<Nblock;b++){
-	std::cout << GridLogMessage<< "\t\tBlock "<<b<<" resid "<< std::sqrt(v_rr[b]/ssq[b])<<std::endl;
+	std::cout << GridLogMessage<< "\t\tBlock "<<b<<" computed resid "<< std::sqrt(v_rr[b]/ssq[b])<<std::endl;
      }
      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
@@ -358,9 +586,8 @@ void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi)
  if (ErrorOnNoConverge) assert(0);
  IterationsToComplete = k;
 }
 };
 }
 #endif
@@ -1,81 +0,0 @@
    /*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/algorithms/iterative/EigenSort.h
    Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 #ifndef GRID_EIGENSORT_H
 #define GRID_EIGENSORT_H
 namespace Grid {
    /////////////////////////////////////////////////////////////
    // Eigen sorter to begin with
    /////////////////////////////////////////////////////////////
 template<class Field>
 class SortEigen {
 private:
 //hacking for testing for now
 private:
  static bool less_lmd(RealD left,RealD right){
    return left > right;
  }  
  static bool less_pair(std::pair<RealD,Field const*>& left,
                        std::pair<RealD,Field const*>& right){
    return left.first > (right.first);
  }  
 public:
  void push(DenseVector<RealD>& lmd,
            DenseVector<Field>& evec,int N) {
    DenseVector<Field> cpy(lmd.size(),evec[0]._grid);
    for(int i=0;i<lmd.size();i++) cpy[i] = evec[i];
    DenseVector<std::pair<RealD, Field const*> > emod(lmd.size());    
    for(int i=0;i<lmd.size();++i)
      emod[i] = std::pair<RealD,Field const*>(lmd[i],&cpy[i]);
    partial_sort(emod.begin(),emod.begin()+N,emod.end(),less_pair);
    typename DenseVector<std::pair<RealD, Field const*> >::iterator it = emod.begin();
    for(int i=0;i<N;++i){
      lmd[i]=it->first;
      evec[i]=*(it->second);
      ++it;
    }
  }
  void push(DenseVector<RealD>& lmd,int N) {
    std::partial_sort(lmd.begin(),lmd.begin()+N,lmd.end(),less_lmd);
  }
  bool saturated(RealD lmd, RealD thrs) {
    return fabs(lmd) > fabs(thrs);
  }
 };
 }
 #endif
@@ -98,7 +98,14 @@ public:
 #else
    if ( ptr == (_Tp *) NULL ) ptr = (_Tp *) memalign(128,bytes);
 #endif
-
+    // First touch optimise in threaded loop
    uint8_t *cp = (uint8_t *)ptr;
 #ifdef GRID_OMP
 #pragma omp parallel for
 #endif
    for(size_type n=0;n<bytes;n+=4096){
      cp[n]=0;
    }
    return ptr;
  }
@@ -186,6 +193,12 @@ public:
 #else
    _Tp * ptr = (_Tp *) memalign(128,__n*sizeof(_Tp));
 #endif
    size_type bytes = __n*sizeof(_Tp);
    uint8_t *cp = (uint8_t *)ptr;
 #pragma omp parallel for
    for(size_type n=0;n<bytes;n+=4096){
      cp[n]=0;
    }
    return ptr;
  }
  void deallocate(pointer __p, size_type) { 
@@ -1,4 +1,4 @@
- /*************************************************************************************
+/*************************************************************************************
    Grid physics library, www.github.com/paboyle/Grid 
    Source file: ./lib/lattice/Lattice_reduction.h
    Copyright (C) 2015
@@ -369,71 +369,6 @@ static void sliceMaddVector(Lattice<vobj> &R,std::vector<RealD> &a,const Lattice
  }
 };
 /*
 template<class vobj>
 static void sliceMaddVectorSlow (Lattice<vobj> &R,std::vector<RealD> &a,const Lattice<vobj> &X,const Lattice<vobj> &Y,
 			     int Orthog,RealD scale=1.0) 
 {    
  // FIXME: Implementation is slow
  // Best base the linear combination by constructing a 
  // set of vectors of size grid->_rdimensions[Orthog].
  typedef typename vobj::scalar_object sobj;
  typedef typename vobj::scalar_type scalar_type;
  typedef typename vobj::vector_type vector_type;
  int Nblock = X._grid->GlobalDimensions()[Orthog];
  GridBase *FullGrid  = X._grid;
  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
  Lattice<vobj> Xslice(SliceGrid);
  Lattice<vobj> Rslice(SliceGrid);
  // If we based this on Cshift it would work for spread out
  // but it would be even slower
  for(int i=0;i<Nblock;i++){
    ExtractSlice(Rslice,Y,i,Orthog);
    ExtractSlice(Xslice,X,i,Orthog);
    Rslice = Rslice + Xslice*(scale*a[i]);
    InsertSlice(Rslice,R,i,Orthog);
  }
 };
 template<class vobj>
 static void sliceInnerProductVectorSlow( std::vector<ComplexD> & vec, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int Orthog) 
  {
    // FIXME: Implementation is slow
    // Look at localInnerProduct implementation,
    // and do inside a site loop with block strided iterators
    typedef typename vobj::scalar_object sobj;
    typedef typename vobj::scalar_type scalar_type;
    typedef typename vobj::vector_type vector_type;
    typedef typename vobj::tensor_reduced scalar;
    typedef typename scalar::scalar_object  scomplex;
    int Nblock = lhs._grid->GlobalDimensions()[Orthog];
    vec.resize(Nblock);
    std::vector<scomplex> sip(Nblock);
    Lattice<scalar> IP(lhs._grid); 
    IP=localInnerProduct(lhs,rhs);
    sliceSum(IP,sip,Orthog);
    for(int ss=0;ss<Nblock;ss++){
      vec[ss] = TensorRemove(sip[ss]);
    }
  }
 */
 //////////////////////////////////////////////////////////////////////////////////////////
 // FIXME: Implementation is slow
 // If we based this on Cshift it would work for spread out
 // but it would be even slower
 //
 // Repeated extract slice is inefficient
 //
 // Best base the linear combination by constructing a 
 // set of vectors of size grid->_rdimensions[Orthog].
 //////////////////////////////////////////////////////////////////////////////////////////
 inline GridBase         *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Orthog)
 {
  int NN    = BlockSolverGrid->_ndimension;
@@ -453,7 +388,6 @@ inline GridBase         *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Or
  return (GridBase *)new GridCartesian(latt_phys,simd_phys,mpi_phys); 
 }
 template<class vobj>
 static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<vobj> &X,const Lattice<vobj> &Y,int Orthog,RealD scale=1.0) 
 {    
@@ -462,28 +396,103 @@ static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice
  typedef typename vobj::vector_type vector_type;
  int Nblock = X._grid->GlobalDimensions()[Orthog];
-  
+
  GridBase *FullGrid  = X._grid;
  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
-  
+
  Lattice<vobj> Xslice(SliceGrid);
  Lattice<vobj> Rslice(SliceGrid);
-  
+
-  for(int i=0;i<Nblock;i++){
+  assert( FullGrid->_simd_layout[Orthog]==1);
-    ExtractSlice(Rslice,Y,i,Orthog);
+  int nh =  FullGrid->_ndimension;
-    for(int j=0;j<Nblock;j++){
+  int nl = SliceGrid->_ndimension;
-      ExtractSlice(Xslice,X,j,Orthog);
+
-      Rslice = Rslice + Xslice*(scale*aa(j,i));
+  //FIXME package in a convenient iterator
-    }
+  //Should loop over a plane orthogonal to direction "Orthog"
-    InsertSlice(Rslice,R,i,Orthog);
+  int stride=FullGrid->_slice_stride[Orthog];
  int block =FullGrid->_slice_block [Orthog];
  int nblock=FullGrid->_slice_nblock[Orthog];
  int ostride=FullGrid->_ostride[Orthog];
 #pragma omp parallel 
  {
    std::vector<vobj> s_x(Nblock);
 #pragma omp for collapse(2)
    for(int n=0;n<nblock;n++){
    for(int b=0;b<block;b++){
      int o  = n*stride + b;
      for(int i=0;i<Nblock;i++){
 	s_x[i] = X[o+i*ostride];
      }
      vobj dot;
      for(int i=0;i<Nblock;i++){
 	dot = Y[o+i*ostride];
 	for(int j=0;j<Nblock;j++){
 	  dot = dot + s_x[j]*(scale*aa(j,i));
 	}
 	R[o+i*ostride]=dot;
      }
    }}
  }
 };
 template<class vobj>
 static void sliceMulMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<vobj> &X,int Orthog,RealD scale=1.0) 
 {    
  typedef typename vobj::scalar_object sobj;
  typedef typename vobj::scalar_type scalar_type;
  typedef typename vobj::vector_type vector_type;
  int Nblock = X._grid->GlobalDimensions()[Orthog];
  GridBase *FullGrid  = X._grid;
  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
  Lattice<vobj> Xslice(SliceGrid);
  Lattice<vobj> Rslice(SliceGrid);
  assert( FullGrid->_simd_layout[Orthog]==1);
  int nh =  FullGrid->_ndimension;
  int nl = SliceGrid->_ndimension;
  //FIXME package in a convenient iterator
  //Should loop over a plane orthogonal to direction "Orthog"
  int stride=FullGrid->_slice_stride[Orthog];
  int block =FullGrid->_slice_block [Orthog];
  int nblock=FullGrid->_slice_nblock[Orthog];
  int ostride=FullGrid->_ostride[Orthog];
 #pragma omp parallel 
  {
    std::vector<vobj> s_x(Nblock);
 #pragma omp for collapse(2)
    for(int n=0;n<nblock;n++){
    for(int b=0;b<block;b++){
      int o  = n*stride + b;
      for(int i=0;i<Nblock;i++){
 	s_x[i] = X[o+i*ostride];
      }
      vobj dot;
      for(int i=0;i<Nblock;i++){
 	dot = s_x[0]*(scale*aa(0,i));
 	for(int j=1;j<Nblock;j++){
 	  dot = dot + s_x[j]*(scale*aa(j,i));
 	}
 	R[o+i*ostride]=dot;
      }
    }}
  }
 };
 template<class vobj>
 static void sliceInnerProductMatrix(  Eigen::MatrixXcd &mat, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int Orthog) 
 {
  // FIXME: Implementation is slow
  // Not sure of best solution.. think about it
  typedef typename vobj::scalar_object sobj;
  typedef typename vobj::scalar_type scalar_type;
  typedef typename vobj::vector_type vector_type;
@@ -497,22 +506,49 @@ static void sliceInnerProductMatrix(  Eigen::MatrixXcd &mat, const Lattice<vobj>
  Lattice<vobj> Rslice(SliceGrid);
  mat = Eigen::MatrixXcd::Zero(Nblock,Nblock);
-  
+
-  for(int i=0;i<Nblock;i++){
+  assert( FullGrid->_simd_layout[Orthog]==1);
-    ExtractSlice(Lslice,lhs,i,Orthog);
+  int nh =  FullGrid->_ndimension;
-    for(int j=0;j<Nblock;j++){
+  int nl = SliceGrid->_ndimension;
-      ExtractSlice(Rslice,rhs,j,Orthog);
+
-      mat(i,j) = innerProduct(Lslice,Rslice);
+  //FIXME package in a convenient iterator
-    }
+  //Should loop over a plane orthogonal to direction "Orthog"
  int stride=FullGrid->_slice_stride[Orthog];
  int block =FullGrid->_slice_block [Orthog];
  int nblock=FullGrid->_slice_nblock[Orthog];
  int ostride=FullGrid->_ostride[Orthog];
  typedef typename vobj::vector_typeD vector_typeD;
 #pragma omp parallel 
  {
    std::vector<vobj> Left(Nblock);
    std::vector<vobj> Right(Nblock);
    Eigen::MatrixXcd  mat_thread = Eigen::MatrixXcd::Zero(Nblock,Nblock);
 #pragma omp for collapse(2)
    for(int n=0;n<nblock;n++){
    for(int b=0;b<block;b++){
      int o  = n*stride + b;
      for(int i=0;i<Nblock;i++){
 	Left [i] = lhs[o+i*ostride];
 	Right[i] = rhs[o+i*ostride];
      }
      for(int i=0;i<Nblock;i++){
      for(int j=0;j<Nblock;j++){
 	auto tmp = innerProduct(Left[i],Right[j]);
 	vector_typeD rtmp = TensorRemove(tmp);
 	mat_thread(i,j) += Reduce(rtmp);
      }}
    }}
 #pragma omp critical
    {
      mat += mat_thread;
    }  
  }
 #undef FORCE_DIAG
 #ifdef FORCE_DIAG
  for(int i=0;i<Nblock;i++){
    for(int j=0;j<Nblock;j++){
      if ( i != j ) mat(i,j)=0.0;
    }
  }
 #endif
  return;
 }
@@ -40,7 +40,7 @@ const PerformanceCounter::PerformanceCounterConfig PerformanceCounter::Performan
  { PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES          ,  "CPUCYCLES.........." , INSTRUCTIONS},
  { PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS        ,  "INSTRUCTIONS......." , CPUCYCLES   },
    // 4
-#ifdef AVX512
+#ifdef KNL
    { PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", CPUCYCLES    },
    { PERF_TYPE_RAW, RawConfig(0x01,0x04), "L1_MISS_LOADS......", L1D_READ_ACCESS  },
    { PERF_TYPE_RAW, RawConfig(0x40,0x04), "ALL_LOADS..........", L1D_READ_ACCESS    },
@@ -237,4 +237,11 @@ typedef ImprovedStaggeredFermion5D<StaggeredVec5dImplD> ImprovedStaggeredFermion
  }}
 ////////////////////
 // Scalar QED actions
 // TODO: this needs to move to another header after rename to Fermion.h
 ////////////////////
 #include <Grid/qcd/action/scalar/Scalar.h>
 #include <Grid/qcd/action/gauge/Photon.h>
 #endif
@@ -0,0 +1,286 @@
 /*************************************************************************************
 Grid physics library, www.github.com/paboyle/Grid
 Source file: ./lib/qcd/action/gauge/Photon.h
 Copyright (C) 2015
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
 #ifndef QCD_PHOTON_ACTION_H
 #define QCD_PHOTON_ACTION_H
 namespace Grid{
 namespace QCD{
  template <class S>
  class QedGimpl
  {
  public:
    typedef S Simd;
    template <typename vtype>
    using iImplGaugeLink  = iScalar<iScalar<iScalar<vtype>>>;
    template <typename vtype>
    using iImplGaugeField = iVector<iScalar<iScalar<vtype>>, Nd>;
    typedef iImplGaugeLink<Simd>  SiteLink;
    typedef iImplGaugeField<Simd> SiteField;
    typedef SiteField             SiteComplex;
    typedef Lattice<SiteLink>  LinkField;
    typedef Lattice<SiteField> Field;
    typedef Field              ComplexField;
  };
  typedef QedGimpl<vComplex> QedGimplR;
  template<class Gimpl>
  class Photon
  {
  public:
    INHERIT_GIMPL_TYPES(Gimpl);
    GRID_SERIALIZABLE_ENUM(Gauge, undef, feynman, 1, coulomb, 2, landau, 3);
    GRID_SERIALIZABLE_ENUM(ZmScheme, undef, qedL, 1, qedTL, 2);
  public:
    Photon(Gauge gauge, ZmScheme zmScheme);
    virtual ~Photon(void) = default;
    void FreePropagator(const GaugeField &in, GaugeField &out);
    void MomentumSpacePropagator(const GaugeField &in, GaugeField &out);
    void StochasticWeight(GaugeLinkField &weight);
    void StochasticField(GaugeField &out, GridParallelRNG &rng);
    void StochasticField(GaugeField &out, GridParallelRNG &rng,
                         const GaugeLinkField &weight);
  private:
    void invKHatSquared(GaugeLinkField &out);
    void zmSub(GaugeLinkField &out);
  private:
    Gauge    gauge_;
    ZmScheme zmScheme_;
  };
  typedef Photon<QedGimplR>  PhotonR;
  template<class Gimpl>
  Photon<Gimpl>::Photon(Gauge gauge, ZmScheme zmScheme)
  : gauge_(gauge), zmScheme_(zmScheme)
  {}
  template<class Gimpl>
  void Photon<Gimpl>::FreePropagator (const GaugeField &in,GaugeField &out)
  {
    FFT theFFT(in._grid);
    GaugeField in_k(in._grid);
    GaugeField prop_k(in._grid);
    theFFT.FFT_all_dim(in_k,in,FFT::forward);
    MomentumSpacePropagator(prop_k,in_k);
    theFFT.FFT_all_dim(out,prop_k,FFT::backward);
  }
  template<class Gimpl>
  void Photon<Gimpl>::invKHatSquared(GaugeLinkField &out)
  {
    GridBase           *grid = out._grid;
    GaugeLinkField     kmu(grid), one(grid);
    const unsigned int nd    = grid->_ndimension;
    std::vector<int>   &l    = grid->_fdimensions;
    std::vector<int>   zm(nd,0);
    TComplex           Tone = Complex(1.0,0.0);
    TComplex           Tzero= Complex(0.0,0.0);
    one = Complex(1.0,0.0);
    out = zero;
    for(int mu = 0; mu < nd; mu++)
    {
      Real twoPiL = M_PI*2./l[mu];
      LatticeCoordinate(kmu,mu);
      kmu = 2.*sin(.5*twoPiL*kmu);
      out = out + kmu*kmu;
    }
    pokeSite(Tone, out, zm);
    out = one/out;
    pokeSite(Tzero, out, zm);
  }
  template<class Gimpl>
  void Photon<Gimpl>::zmSub(GaugeLinkField &out)
  {
    GridBase           *grid = out._grid;
    const unsigned int nd    = grid->_ndimension;
    switch (zmScheme_)
    {
      case ZmScheme::qedTL:
      {
        std::vector<int> zm(nd,0);
        TComplex         Tzero = Complex(0.0,0.0);
        pokeSite(Tzero, out, zm);
        break;
      }
      case ZmScheme::qedL:
      {
        LatticeInteger spNrm(grid), coor(grid);
        GaugeLinkField z(grid);
        spNrm = zero;
        for(int d = 0; d < grid->_ndimension - 1; d++)
        {
          LatticeCoordinate(coor,d);
          spNrm = spNrm + coor*coor;
        }
        out = where(spNrm == Integer(0), 0.*out, out);
        break;
      }
      default:
        break;
    }
  }
  template<class Gimpl>
  void Photon<Gimpl>::MomentumSpacePropagator(const GaugeField &in,
                                               GaugeField &out)
  {
    GridBase           *grid = out._grid;
    LatticeComplex     k2Inv(grid);
    invKHatSquared(k2Inv);
    zmSub(k2Inv);
    out = in*k2Inv;
  }
  template<class Gimpl>
  void Photon<Gimpl>::StochasticWeight(GaugeLinkField &weight)
  {
    auto               *grid     = dynamic_cast<GridCartesian *>(weight._grid);
    const unsigned int nd        = grid->_ndimension;
    std::vector<int>   latt_size = grid->_fdimensions;
    Integer vol = 1;
    for(int d = 0; d < nd; d++)
    {
      vol = vol * latt_size[d];
    }
    invKHatSquared(weight);
    weight = sqrt(vol*real(weight));
    zmSub(weight);
  }
  template<class Gimpl>
  void Photon<Gimpl>::StochasticField(GaugeField &out, GridParallelRNG &rng)
  {
    auto           *grid = dynamic_cast<GridCartesian *>(out._grid);
    GaugeLinkField weight(grid);
    StochasticWeight(weight);
    StochasticField(out, rng, weight);
  }
  template<class Gimpl>
  void Photon<Gimpl>::StochasticField(GaugeField &out, GridParallelRNG &rng,
                                      const GaugeLinkField &weight)
  {
    auto               *grid = dynamic_cast<GridCartesian *>(out._grid);
    const unsigned int nd = grid->_ndimension;
    GaugeLinkField     r(grid);
    GaugeField         aTilde(grid);
    FFT                fft(grid);
    for(int mu = 0; mu < nd; mu++)
    {
      gaussian(rng, r);
      r = weight*r;
      pokeLorentz(aTilde, r, mu);
    }
    fft.FFT_all_dim(out, aTilde, FFT::backward);
    out = real(out);
  }
 //  template<class Gimpl>
 //  void Photon<Gimpl>::FeynmanGaugeMomentumSpacePropagator_L(GaugeField &out,
 //                                                            const GaugeField &in)
 //  {
 //    
 //    FeynmanGaugeMomentumSpacePropagator_TL(out,in);
 //    
 //    GridBase *grid = out._grid;
 //    LatticeInteger     coor(grid);
 //    GaugeField zz(grid); zz=zero;
 //    
 //    // xyzt
 //    for(int d = 0; d < grid->_ndimension-1;d++){
 //      LatticeCoordinate(coor,d);
 //      out = where(coor==Integer(0),zz,out);
 //    }
 //  }
 //  
 //  template<class Gimpl>
 //  void Photon<Gimpl>::FeynmanGaugeMomentumSpacePropagator_TL(GaugeField &out,
 //                                                             const GaugeField &in)
 //  {
 //    
 //    // what type LatticeComplex
 //    GridBase *grid = out._grid;
 //    int nd = grid->_ndimension;
 //    
 //    typedef typename GaugeField::vector_type vector_type;
 //    typedef typename GaugeField::scalar_type ScalComplex;
 //    typedef Lattice<iSinglet<vector_type> > LatComplex;
 //    
 //    std::vector<int> latt_size   = grid->_fdimensions;
 //    
 //    LatComplex denom(grid); denom= zero;
 //    LatComplex   one(grid); one = ScalComplex(1.0,0.0);
 //    LatComplex   kmu(grid);
 //    
 //    ScalComplex ci(0.0,1.0);
 //    // momphase = n * 2pi / L
 //    for(int mu=0;mu<Nd;mu++) {
 //      
 //      LatticeCoordinate(kmu,mu);
 //      
 //      RealD TwoPiL =  M_PI * 2.0/ latt_size[mu];
 //      
 //      kmu = TwoPiL * kmu ;
 //      
 //      denom = denom + 4.0*sin(kmu*0.5)*sin(kmu*0.5); // Wilson term
 //    }
 //    std::vector<int> zero_mode(nd,0);
 //    TComplexD Tone = ComplexD(1.0,0.0);
 //    TComplexD Tzero= ComplexD(0.0,0.0);
 //    
 //    pokeSite(Tone,denom,zero_mode);
 //    
 //    denom= one/denom;
 //    
 //    pokeSite(Tzero,denom,zero_mode);
 //    
 //    out = zero;
 //    out = in*denom;
 //  };
 }}
 #endif
@@ -31,6 +31,7 @@ directory
 #include <Grid/qcd/action/scalar/ScalarImpl.h>
 #include <Grid/qcd/action/scalar/ScalarAction.h>
 #include <Grid/qcd/action/scalar/ScalarInteractionAction.h>
 namespace Grid {
 namespace QCD {
@@ -39,6 +40,10 @@ namespace QCD {
  typedef ScalarAction<ScalarImplF>                 ScalarActionF;
  typedef ScalarAction<ScalarImplD>                 ScalarActionD;
  template <int Colours, int Dimensions> using ScalarAdjActionR = ScalarInteractionAction<ScalarNxNAdjImplR<Colours>, Dimensions>;
  template <int Colours, int Dimensions> using ScalarAdjActionF = ScalarInteractionAction<ScalarNxNAdjImplF<Colours>, Dimensions>;
  template <int Colours, int Dimensions> using ScalarAdjActionD = ScalarInteractionAction<ScalarNxNAdjImplD<Colours>, Dimensions>;
 }
 }
@@ -6,10 +6,10 @@
  Copyright (C) 2015
-Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+  Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+  Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-Author: neo <cossu@post.kek.jp>
+  Author: neo <cossu@post.kek.jp>
-Author: paboyle <paboyle@ph.ed.ac.uk>
+  Author: paboyle <paboyle@ph.ed.ac.uk>
  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
@@ -35,50 +35,49 @@ directory
 namespace Grid {
  // FIXME drop the QCD namespace everywhere here
-  
+
-  template <class Impl>
+template <class Impl>
-  class ScalarAction : public QCD::Action<typename Impl::Field> {
+class ScalarAction : public QCD::Action<typename Impl::Field> {
-  public:
+ public:
    INHERIT_FIELD_TYPES(Impl);
-    
+
-  private:
+ private:
    RealD mass_square;
    RealD lambda;
  public:
    ScalarAction(RealD ms, RealD l) : mass_square(ms), lambda(l){};
-    virtual std::string LogParameters(){
+ public:
    ScalarAction(RealD ms, RealD l) : mass_square(ms), lambda(l) {}
    virtual std::string LogParameters() {
      std::stringstream sstream;
      sstream << GridLogMessage << "[ScalarAction] lambda      : " << lambda      << std::endl;
      sstream << GridLogMessage << "[ScalarAction] mass_square : " << mass_square << std::endl;
      return sstream.str();
    }
-    
+    virtual std::string action_name() {return "ScalarAction";}
-    virtual std::string action_name(){return "ScalarAction";}
+
-    
+    virtual void refresh(const Field &U, GridParallelRNG &pRNG) {}  // noop as no pseudoferms
-    virtual void refresh(const Field &U,
+
 			 GridParallelRNG &pRNG){};  // noop as no pseudoferms
    virtual RealD S(const Field &p) {
      return (mass_square * 0.5 + QCD::Nd) * ScalarObs<Impl>::sumphisquared(p) +
-	(lambda / 24.) * ScalarObs<Impl>::sumphifourth(p) +
+    (lambda / 24.) * ScalarObs<Impl>::sumphifourth(p) +
-	ScalarObs<Impl>::sumphider(p);
+    ScalarObs<Impl>::sumphider(p);
    };
-    
+
    virtual void deriv(const Field &p,
-		       Field &force) {
+                       Field &force) {
      Field tmp(p._grid);
      Field p2(p._grid);
      ScalarObs<Impl>::phisquared(p2, p);
      tmp = -(Cshift(p, 0, -1) + Cshift(p, 0, 1));
      for (int mu = 1; mu < QCD::Nd; mu++) tmp -= Cshift(p, mu, -1) + Cshift(p, mu, 1);
-      
+
-      force=+(mass_square + 2. * QCD::Nd) * p + (lambda / 6.) * p2 * p + tmp;
+      force =+(mass_square + 2. * QCD::Nd) * p + (lambda / 6.) * p2 * p + tmp;
-    };
+    }
-  };
+};
-  
+
-} // Grid
+
 }  // namespace Grid
 #endif // SCALAR_ACTION_H
@@ -5,96 +5,158 @@
 namespace Grid {
  //namespace QCD {
-  template <class S>
+template <class S>
-  class ScalarImplTypes {
+class ScalarImplTypes {
-  public:
+ public:
    typedef S Simd;
-    
+
    template <typename vtype>
    using iImplField = iScalar<iScalar<iScalar<vtype> > >;
-    
+
    typedef iImplField<Simd> SiteField;
-    
+    typedef SiteField        SitePropagator;
    typedef SiteField        SiteComplex;
    typedef Lattice<SiteField> Field;
    typedef Field              ComplexField;
    typedef Field              FermionField;
    typedef Field              PropagatorField;
    static inline void generate_momenta(Field& P, GridParallelRNG& pRNG){
      gaussian(pRNG, P);
    }
-    
+
    static inline Field projectForce(Field& P){return P;}
-    
+
-    static inline void update_field(Field& P, Field& U, double ep){
+    static inline void update_field(Field& P, Field& U, double ep) {
      U += P*ep;
    }
-    
+
-    static inline RealD FieldSquareNorm(Field& U){
+    static inline RealD FieldSquareNorm(Field& U) {
      return (- sum(trace(U*U))/2.0);
    }
-    
+
    static inline void HotConfiguration(GridParallelRNG &pRNG, Field &U) {
      gaussian(pRNG, U);
    }
-    
+
    static inline void TepidConfiguration(GridParallelRNG &pRNG, Field &U) {
      gaussian(pRNG, U);
    }
-    
+
    static inline void ColdConfiguration(GridParallelRNG &pRNG, Field &U) {
      U = 1.0;
    }
    static void MomentumSpacePropagator(Field &out, RealD m)
    {
      GridBase           *grid = out._grid;
      Field              kmu(grid), one(grid);
      const unsigned int nd    = grid->_ndimension;
      std::vector<int>   &l    = grid->_fdimensions;
      one = Complex(1.0,0.0);
      out = m*m;
      for(int mu = 0; mu < nd; mu++)
      {
        Real twoPiL = M_PI*2./l[mu];
        LatticeCoordinate(kmu,mu);
        kmu = 2.*sin(.5*twoPiL*kmu);
        out = out + kmu*kmu;
      }
      out = one/out;
    }
    static void FreePropagator(const Field &in, Field &out,
                               const Field &momKernel)
    {
      FFT   fft((GridCartesian *)in._grid);
      Field inFT(in._grid);
      fft.FFT_all_dim(inFT, in, FFT::forward);
      inFT = inFT*momKernel;
      fft.FFT_all_dim(out, inFT, FFT::backward);
    }
    static void FreePropagator(const Field &in, Field &out, RealD m)
    {
      Field momKernel(in._grid);
      MomentumSpacePropagator(momKernel, m);
      FreePropagator(in, out, momKernel);
    }
  };
  template <class S, unsigned int N>
-  class ScalarMatrixImplTypes {
+  class ScalarAdjMatrixImplTypes {
  public:
    typedef S Simd;
    typedef QCD::SU<N> Group;
    template <typename vtype>
-    using iImplField = iScalar<iScalar<iMatrix<vtype, N> > >;
+    using iImplField   = iScalar<iScalar<iMatrix<vtype, N>>>;
    template <typename vtype>
    using iImplComplex = iScalar<iScalar<iScalar<vtype>>>;
    typedef iImplField<Simd>   SiteField;
    typedef SiteField          SitePropagator;
    typedef iImplComplex<Simd> SiteComplex;
-    typedef iImplField<Simd> SiteField;
+    typedef Lattice<SiteField>   Field;
-    
+    typedef Lattice<SiteComplex> ComplexField;
-    
+    typedef Field                FermionField;
-    typedef Lattice<SiteField> Field;
+    typedef Field                PropagatorField;
-    
+
-    static inline void generate_momenta(Field& P, GridParallelRNG& pRNG){
+    static inline void generate_momenta(Field& P, GridParallelRNG& pRNG) {
-      gaussian(pRNG, P);
+      Group::GaussianFundamentalLieAlgebraMatrix(pRNG, P);
    }
-    
+
-    static inline Field projectForce(Field& P){return P;}
+    static inline Field projectForce(Field& P) {return P;}
-    
+
-    static inline void update_field(Field& P, Field& U, double ep){
+    static inline void update_field(Field& P, Field& U, double ep) {
      U += P*ep;
    }
-    
+
-    static inline RealD FieldSquareNorm(Field& U){
+    static inline RealD FieldSquareNorm(Field& U) {
-      return (TensorRemove(- sum(trace(U*U))*0.5).real());
+      return (TensorRemove(sum(trace(U*U))).real());
    }
-    
+
    static inline void HotConfiguration(GridParallelRNG &pRNG, Field &U) {
-      gaussian(pRNG, U);
+      Group::GaussianFundamentalLieAlgebraMatrix(pRNG, U);
    }
-    
+
    static inline void TepidConfiguration(GridParallelRNG &pRNG, Field &U) {
-      gaussian(pRNG, U);
+      Group::GaussianFundamentalLieAlgebraMatrix(pRNG, U, 0.01);
    }
-    
+
    static inline void ColdConfiguration(GridParallelRNG &pRNG, Field &U) {
-      U = 1.0;
+      U = zero;
    }
-    
+
  };
-  
+
-  
+
  typedef ScalarImplTypes<vReal> ScalarImplR;
  typedef ScalarImplTypes<vRealF> ScalarImplF;
  typedef ScalarImplTypes<vRealD> ScalarImplD;
  typedef ScalarImplTypes<vComplex> ScalarImplCR;
  typedef ScalarImplTypes<vComplexF> ScalarImplCF;
  typedef ScalarImplTypes<vComplexD> ScalarImplCD;
  // Hardcoding here the size of the matrices
  typedef ScalarAdjMatrixImplTypes<vComplex,  QCD::Nc> ScalarAdjImplR;
  typedef ScalarAdjMatrixImplTypes<vComplexF, QCD::Nc> ScalarAdjImplF;
  typedef ScalarAdjMatrixImplTypes<vComplexD, QCD::Nc> ScalarAdjImplD;
  template <int Colours > using ScalarNxNAdjImplR = ScalarAdjMatrixImplTypes<vComplex,   Colours >;
  template <int Colours > using ScalarNxNAdjImplF = ScalarAdjMatrixImplTypes<vComplexF,  Colours >;
  template <int Colours > using ScalarNxNAdjImplD = ScalarAdjMatrixImplTypes<vComplexD,  Colours >;
-  //} 
+  //}
-} 
+}
 #endif
@@ -6,10 +6,7 @@
  Copyright (C) 2015
-Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+  Author: Guido Cossu <guido,cossu@ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: neo <cossu@post.kek.jp>
 Author: paboyle <paboyle@ph.ed.ac.uk>
  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
@@ -30,55 +27,122 @@ directory
  *************************************************************************************/
 /*  END LEGAL */
-#ifndef SCALAR_ACTION_H
+#ifndef SCALAR_INT_ACTION_H
-#define SCALAR_ACTION_H
+#define SCALAR_INT_ACTION_H
 // Note: this action can completely absorb the ScalarAction for real float fields
 // use the scalarObjs to generalise the structure
 namespace Grid {
  // FIXME drop the QCD namespace everywhere here
-  
+
-  template <class Impl>
+  template <class Impl, int Ndim >
  class ScalarInteractionAction : public QCD::Action<typename Impl::Field> {
  public:
    INHERIT_FIELD_TYPES(Impl);
  private:
    RealD mass_square;
    RealD lambda;
  public:
    ScalarAction(RealD ms, RealD l) : mass_square(ms), lambda(l){};
-    virtual std::string LogParameters(){
+
    typedef typename Field::vector_object vobj;
    typedef CartesianStencil<vobj,vobj> Stencil;
    SimpleCompressor<vobj> compressor;
    int npoint = 2*Ndim;
    std::vector<int> directions;//    = {0,1,2,3,0,1,2,3};  // forcing 4 dimensions
    std::vector<int> displacements;//  = {1,1,1,1, -1,-1,-1,-1};
  public:
    ScalarInteractionAction(RealD ms, RealD l) : mass_square(ms), lambda(l), displacements(2*Ndim,0), directions(2*Ndim,0){
      for (int mu = 0 ; mu < Ndim; mu++){
 		directions[mu]         = mu; directions[mu+Ndim]    = mu;
 		displacements[mu]      =  1; displacements[mu+Ndim] = -1;
      }
    }
    virtual std::string LogParameters() {
      std::stringstream sstream;
      sstream << GridLogMessage << "[ScalarAction] lambda      : " << lambda      << std::endl;
      sstream << GridLogMessage << "[ScalarAction] mass_square : " << mass_square << std::endl;
      return sstream.str();
    }
-    
+
-    virtual std::string action_name(){return "ScalarAction";}
+    virtual std::string action_name() {return "ScalarAction";}
-    
+
-    virtual void refresh(const Field &U,
+    virtual void refresh(const Field &U, GridParallelRNG &pRNG) {}
-			 GridParallelRNG &pRNG){};  // noop as no pseudoferms
+
    virtual RealD S(const Field &p) {
-      return (mass_square * 0.5 + QCD::Nd) * ScalarObs<Impl>::sumphisquared(p) +
+      assert(p._grid->Nd() == Ndim);
-	(lambda / 24.) * ScalarObs<Impl>::sumphifourth(p) +
+      static Stencil phiStencil(p._grid, npoint, 0, directions, displacements);
-	ScalarObs<Impl>::sumphider(p);
+      phiStencil.HaloExchange(p, compressor);
      Field action(p._grid), pshift(p._grid), phisquared(p._grid);
      phisquared = p*p;
      action = (2.0*Ndim + mass_square)*phisquared - lambda/24.*phisquared*phisquared;
      for (int mu = 0; mu < Ndim; mu++) {
 	//  pshift = Cshift(p, mu, +1);  // not efficient, implement with stencils
 	parallel_for (int i = 0; i < p._grid->oSites(); i++) {
 	  int permute_type;
 	  StencilEntry *SE;
 	  vobj temp2;
 	  const vobj *temp, *t_p;
 	  SE = phiStencil.GetEntry(permute_type, mu, i);
 	  t_p  = &p._odata[i];
 	  if ( SE->_is_local ) {
 	    temp = &p._odata[SE->_offset];
 	    if ( SE->_permute ) {
 	      permute(temp2, *temp, permute_type);
 	      action._odata[i] -= temp2*(*t_p) + (*t_p)*temp2;
 	    } else {
 	      action._odata[i] -= (*temp)*(*t_p) + (*t_p)*(*temp);
 	    }
 	  } else {
 	    action._odata[i] -= phiStencil.CommBuf()[SE->_offset]*(*t_p) + (*t_p)*phiStencil.CommBuf()[SE->_offset];
 	  }
 	}
 	//  action -= pshift*p + p*pshift;
      }
      // NB the trace in the algebra is normalised to 1/2
      // minus sign coming from the antihermitian fields
      return -(TensorRemove(sum(trace(action)))).real();
    };
-    
+
-    virtual void deriv(const Field &p,
+    virtual void deriv(const Field &p, Field &force) {
-		       Field &force) {
+      assert(p._grid->Nd() == Ndim);
-      Field tmp(p._grid);
+      force = (2.0*Ndim + mass_square)*p - lambda/12.*p*p*p;
-      Field p2(p._grid);
+      // move this outside
-      ScalarObs<Impl>::phisquared(p2, p);
+      static Stencil phiStencil(p._grid, npoint, 0, directions, displacements);
-      tmp = -(Cshift(p, 0, -1) + Cshift(p, 0, 1));
+      phiStencil.HaloExchange(p, compressor);
      for (int mu = 1; mu < QCD::Nd; mu++) tmp -= Cshift(p, mu, -1) + Cshift(p, mu, 1);
-      force=+(mass_square + 2. * QCD::Nd) * p + (lambda / 6.) * p2 * p + tmp;
+      //for (int mu = 0; mu < QCD::Nd; mu++) force -= Cshift(p, mu, -1) + Cshift(p, mu, 1);
-    };
+      for (int point = 0; point < npoint; point++) {
 	parallel_for (int i = 0; i < p._grid->oSites(); i++) {
 	  const vobj *temp;
 	  vobj temp2;
 	  int permute_type;
 	  StencilEntry *SE;
 	  SE = phiStencil.GetEntry(permute_type, point, i);
 	  if ( SE->_is_local ) {
 	    temp = &p._odata[SE->_offset];
 	    if ( SE->_permute ) {
 	      permute(temp2, *temp, permute_type);
 	      force._odata[i] -= temp2;
 	    } else {
 	      force._odata[i] -= *temp;
 	    }
 	  } else {
 	    force._odata[i] -= phiStencil.CommBuf()[SE->_offset];
 	  }
 	}
      }
    }
  };
-} // Grid
+}  // namespace Grid
-#endif // SCALAR_ACTION_H
+#endif  // SCALAR_INT_ACTION_H
@@ -207,6 +207,12 @@ using GenericHMCRunnerTemplate = HMCWrapperTemplate<Implementation, Integrator,
 typedef HMCWrapperTemplate<ScalarImplR, MinimumNorm2, ScalarFields>
    ScalarGenericHMCRunner;
 typedef HMCWrapperTemplate<ScalarAdjImplR, MinimumNorm2, ScalarMatrixFields>
    ScalarAdjGenericHMCRunner;
 template <int Colours> 
 using ScalarNxNAdjGenericHMCRunner = HMCWrapperTemplate < ScalarNxNAdjImplR<Colours>, MinimumNorm2, ScalarNxNMatrixFields<Colours> >;
 }  // namespace QCD
 }  // namespace Grid
@@ -76,7 +76,7 @@ struct HMCparameters: Serializable {
  template < class ReaderClass > 
  void initialize(Reader<ReaderClass> &TheReader){
-  	std::cout << "Reading HMC\n";
+  	std::cout << GridLogMessage << "Reading HMC\n";
  	read(TheReader, "HMC", *this);
  }
@@ -253,6 +253,7 @@ class HMCResourceManager {
  template<class T, class... Types>
  void AddObservable(Types&&... Args){
    ObservablesList.push_back(std::unique_ptr<T>(new T(std::forward<Types>(Args)...)));
    ObservablesList.back()->print_parameters();
  }
  std::vector<HmcObservable<typename ImplementationPolicy::Field>* > GetObservables(){
@@ -297,4 +298,4 @@ private:
 }
 }
-#endif  // HMC_RESOURCE_MANAGER_H
+#endif  // HMC_RESOURCE_MANAGER_H
@@ -102,7 +102,7 @@ class ILDGHmcCheckpointer : public BaseHmcCheckpointer<Implementation> {
    FieldMetaData header;
    IldgReader _IldgReader;
    _IldgReader.open(config);
-    _IldgReader.readConfiguration(config,U,header);  // format from the header
+    _IldgReader.readConfiguration(U,header);  // format from the header
    _IldgReader.close();
    std::cout << GridLogMessage << "Read ILDG Configuration from " << config
@@ -62,7 +62,10 @@ class Representations {
 typedef Representations<FundamentalRepresentation> NoHirep;
 typedef Representations<EmptyRep<typename ScalarImplR::Field> > ScalarFields;
-  //typedef Representations<EmptyRep<typename ScalarMatrixImplR::Field> > ScalarMatrixFields;
+typedef Representations<EmptyRep<typename ScalarAdjImplR::Field> > ScalarMatrixFields;
 template < int Colours> 
 using ScalarNxNMatrixFields = Representations<EmptyRep<typename ScalarNxNAdjImplR<Colours>::Field> >;
 // Helper classes to access the elements
 // Strips the first N parameters from the tuple
@@ -108,7 +108,7 @@ void WilsonFlow<Gimpl>::evolve_step_adaptive(typename Gimpl::GaugeField &U, Real
    if (maxTau - taus < epsilon){
        epsilon = maxTau-taus;
    }
-    std::cout << GridLogMessage << "Integration epsilon : " << epsilon << std::endl;
+    //std::cout << GridLogMessage << "Integration epsilon : " << epsilon << std::endl;
    GaugeField Z(U._grid);
    GaugeField Zprime(U._grid);
    GaugeField tmp(U._grid), Uprime(U._grid);
@@ -138,10 +138,10 @@ void WilsonFlow<Gimpl>::evolve_step_adaptive(typename Gimpl::GaugeField &U, Real
    // adjust integration step
    taus += epsilon;
-    std::cout << GridLogMessage << "Adjusting integration step with distance: " << diff << std::endl;
+    //std::cout << GridLogMessage << "Adjusting integration step with distance: " << diff << std::endl;
    epsilon = epsilon*0.95*std::pow(1e-4/diff,1./3.);
-    std::cout << GridLogMessage << "New epsilon : " << epsilon << std::endl;
+    //std::cout << GridLogMessage << "New epsilon : " << epsilon << std::endl;
 }
@@ -166,7 +166,6 @@ void WilsonFlow<Gimpl>::smear(GaugeField& out, const GaugeField& in) const {
    out = in;
    for (unsigned int step = 1; step <= Nstep; step++) {
        auto start = std::chrono::high_resolution_clock::now();
        std::cout << GridLogMessage << "Evolution time :"<< tau(step) << std::endl;
        evolve_step(out);
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end - start;
@@ -191,7 +190,7 @@ void WilsonFlow<Gimpl>::smear_adaptive(GaugeField& out, const GaugeField& in, Re
    unsigned int step = 0;
    do{
        step++;
-        std::cout << GridLogMessage << "Evolution time :"<< taus << std::endl;
+        //std::cout << GridLogMessage << "Evolution time :"<< taus << std::endl;
        evolve_step_adaptive(out, maxTau);
        std::cout << GridLogMessage << "[WilsonFlow] Energy density (plaq) : "
            << step << "  "
@@ -0,0 +1,188 @@
    /*************************************************************************************
    grid` physics library, www.github.com/paboyle/Grid 
    Copyright (C) 2015
 Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
 Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 //#include <Grid/Grid.h>
 using namespace Grid;
 using namespace Grid::QCD;
 template <class Gimpl> 
 class FourierAcceleratedGaugeFixer  : public Gimpl {
  public:
  INHERIT_GIMPL_TYPES(Gimpl);
  typedef typename Gimpl::GaugeLinkField GaugeMat;
  typedef typename Gimpl::GaugeField GaugeLorentz;
  static void GaugeLinkToLieAlgebraField(const std::vector<GaugeMat> &U,std::vector<GaugeMat> &A) {
    for(int mu=0;mu<Nd;mu++){
      Complex cmi(0.0,-1.0);
      A[mu] = Ta(U[mu]) * cmi;
    }
  }
  static void DmuAmu(const std::vector<GaugeMat> &A,GaugeMat &dmuAmu) {
    dmuAmu=zero;
    for(int mu=0;mu<Nd;mu++){
      dmuAmu = dmuAmu + A[mu] - Cshift(A[mu],mu,-1);
    }
  }  
  static void SteepestDescentGaugeFix(GaugeLorentz &Umu,Real & alpha,int maxiter,Real Omega_tol, Real Phi_tol,bool Fourier=false) {
    GridBase *grid = Umu._grid;
    Real org_plaq      =WilsonLoops<Gimpl>::avgPlaquette(Umu);
    Real org_link_trace=WilsonLoops<Gimpl>::linkTrace(Umu); 
    Real old_trace = org_link_trace;
    Real trG;
    std::vector<GaugeMat> U(Nd,grid);
                 GaugeMat dmuAmu(grid);
    for(int i=0;i<maxiter;i++){
      for(int mu=0;mu<Nd;mu++) U[mu]= PeekIndex<LorentzIndex>(Umu,mu);
      if ( Fourier==false ) { 
 	trG = SteepestDescentStep(U,alpha,dmuAmu);
      } else { 
 	trG = FourierAccelSteepestDescentStep(U,alpha,dmuAmu);
      }
      for(int mu=0;mu<Nd;mu++) PokeIndex<LorentzIndex>(Umu,U[mu],mu);
      // Monitor progress and convergence test 
      // infrequently to minimise cost overhead
      if ( i %20 == 0 ) { 
 	Real plaq      =WilsonLoops<Gimpl>::avgPlaquette(Umu);
 	Real link_trace=WilsonLoops<Gimpl>::linkTrace(Umu); 
 	if (Fourier) 
 	  std::cout << GridLogMessage << "Fourier Iteration "<<i<< " plaq= "<<plaq<< " dmuAmu " << norm2(dmuAmu)<< std::endl;
 	else 
 	  std::cout << GridLogMessage << " Iteration "<<i<< " plaq= "<<plaq<< " dmuAmu " << norm2(dmuAmu)<< std::endl;
 	Real Phi  = 1.0 - old_trace / link_trace ;
 	Real Omega= 1.0 - trG;
 	std::cout << GridLogMessage << " Iteration "<<i<< " Phi= "<<Phi<< " Omega= " << Omega<< " trG " << trG <<std::endl;
 	if ( (Omega < Omega_tol) && ( ::fabs(Phi) < Phi_tol) ) {
 	  std::cout << GridLogMessage << "Converged ! "<<std::endl;
 	  return;
 	}
 	old_trace = link_trace;
      }
    }
  };
  static Real SteepestDescentStep(std::vector<GaugeMat> &U,Real & alpha, GaugeMat & dmuAmu) {
    GridBase *grid = U[0]._grid;
    std::vector<GaugeMat> A(Nd,grid);
    GaugeMat g(grid);
    GaugeLinkToLieAlgebraField(U,A);
    ExpiAlphaDmuAmu(A,g,alpha,dmuAmu);
    Real vol = grid->gSites();
    Real trG = TensorRemove(sum(trace(g))).real()/vol/Nc;
    SU<Nc>::GaugeTransform(U,g);
    return trG;
  }
  static Real FourierAccelSteepestDescentStep(std::vector<GaugeMat> &U,Real & alpha, GaugeMat & dmuAmu) {
    GridBase *grid = U[0]._grid;
    Real vol = grid->gSites();
    FFT theFFT((GridCartesian *)grid);
    LatticeComplex  Fp(grid);
    LatticeComplex  psq(grid); psq=zero;
    LatticeComplex  pmu(grid); 
    LatticeComplex   one(grid); one = Complex(1.0,0.0);
    GaugeMat g(grid);
    GaugeMat dmuAmu_p(grid);
    std::vector<GaugeMat> A(Nd,grid);
    GaugeLinkToLieAlgebraField(U,A);
    DmuAmu(A,dmuAmu);
    theFFT.FFT_all_dim(dmuAmu_p,dmuAmu,FFT::forward);
    //////////////////////////////////
    // Work out Fp = psq_max/ psq...
    //////////////////////////////////
    std::vector<int> latt_size = grid->GlobalDimensions();
    std::vector<int> coor(grid->_ndimension,0);
    for(int mu=0;mu<Nd;mu++) {
      Real TwoPiL =  M_PI * 2.0/ latt_size[mu];
      LatticeCoordinate(pmu,mu);
      pmu = TwoPiL * pmu ;
      psq = psq + 4.0*sin(pmu*0.5)*sin(pmu*0.5); 
    }
    Complex psqMax(16.0);
    Fp =  psqMax*one/psq;
    /*
    static int once;
    if ( once == 0 ) { 
      std::cout << " Fp " << Fp <<std::endl;
      once ++;
      }*/
    pokeSite(TComplex(1.0),Fp,coor);
    dmuAmu_p  = dmuAmu_p * Fp; 
    theFFT.FFT_all_dim(dmuAmu,dmuAmu_p,FFT::backward);
    GaugeMat ciadmam(grid);
    Complex cialpha(0.0,-alpha);
    ciadmam = dmuAmu*cialpha;
    SU<Nc>::taExp(ciadmam,g);
    Real trG = TensorRemove(sum(trace(g))).real()/vol/Nc;
    SU<Nc>::GaugeTransform(U,g);
    return trG;
  }
  static void ExpiAlphaDmuAmu(const std::vector<GaugeMat> &A,GaugeMat &g,Real & alpha, GaugeMat &dmuAmu) {
    GridBase *grid = g._grid;
    Complex cialpha(0.0,-alpha);
    GaugeMat ciadmam(grid);
    DmuAmu(A,dmuAmu);
    ciadmam = dmuAmu*cialpha;
    SU<Nc>::taExp(ciadmam,g);
  }  
 };
@@ -716,8 +716,7 @@ template<typename GaugeField,typename GaugeMat>
    for (int a = 0; a < AdjointDimension; a++) {
      generator(a, Ta);
-      auto tmp = - 2.0 * (trace(timesI(Ta) * in)) * scale;// 2.0 for the normalization of the trace in the fundamental rep
+      pokeColour(h_out, - 2.0 * (trace(timesI(Ta) * in)) * scale, a);
      pokeColour(h_out, tmp, a);
    }
  }
@@ -65,10 +65,12 @@ Hdf5Reader::Hdf5Reader(const std::string &fileName)
                      Hdf5Type<unsigned int>::type());
 }
-void Hdf5Reader::push(const std::string &s)
+bool Hdf5Reader::push(const std::string &s)
 {
  group_ = group_.openGroup(s);
  path_.push_back(s);
  return true;
 }
 void Hdf5Reader::pop(void)
@@ -54,7 +54,7 @@ namespace Grid
  public:
    Hdf5Reader(const std::string &fileName);
    virtual ~Hdf5Reader(void) = default;
-    void push(const std::string &s);
+    bool push(const std::string &s);
    void pop(void);
    template <typename U>
    void readDefault(const std::string &s, U &output);
@@ -701,9 +701,28 @@ namespace Optimization {
  //Integer Reduce
  template<>
  inline Integer Reduce<Integer, __m256i>::operator()(__m256i in){
-    // FIXME unimplemented
+    __m128i ret;
-    printf("Reduce : Missing integer implementation -> FIX\n");
+#if defined (AVX2)
-    assert(0);
+    // AVX2 horizontal adds within upper and lower halves of register; use
    // SSE to add upper and lower halves for result.
    __m256i v1, v2;
    __m128i u1, u2;
    v1  = _mm256_hadd_epi32(in, in);
    v2  = _mm256_hadd_epi32(v1, v1);
    u1  = _mm256_castsi256_si128(v2);      // upper half
    u2  = _mm256_extracti128_si256(v2, 1); // lower half
    ret = _mm_add_epi32(u1, u2);
 #else
    // No AVX horizontal add; extract upper and lower halves of register & use
    // SSE intrinsics.
    __m128i u1, u2, u3;
    u1  = _mm256_extractf128_si256(in, 0); // upper half
    u2  = _mm256_extractf128_si256(in, 1); // lower half
    u3  = _mm_add_epi32(u1, u2);
    u1  = _mm_hadd_epi32(u3, u3);
    ret = _mm_hadd_epi32(u1, u1);
 #endif
    return _mm_cvtsi128_si32(ret);
  }
 }
@@ -543,6 +543,24 @@ namespace Optimization {
     u512d conv; conv.v = v1;
     return conv.f[0];
  }
  //Integer Reduce
  template<>
  inline Integer Reduce<Integer, __m512i>::operator()(__m512i in){
    // No full vector reduce, use AVX to add upper and lower halves of register
    // and perform AVX reduction.
    __m256i v1, v2, v3;
    __m128i u1, u2, ret;
    v1  = _mm512_castsi512_si256(in);       // upper half
    v2  = _mm512_extracti32x8_epi32(in, 1); // lower half
    v3  = _mm256_add_epi32(v1, v2);
    v1  = _mm256_hadd_epi32(v3, v3);
    v2  = _mm256_hadd_epi32(v1, v1);
    u1  = _mm256_castsi256_si128(v2)        // upper half
    u2  = _mm256_extracti128_si256(v2, 1);  // lower half
    ret = _mm_add_epi32(u1, u2);
    return _mm_cvtsi128_si32(ret);
  }
 #else
  //Complex float Reduce
  template<>
@@ -570,9 +588,7 @@ namespace Optimization {
  //Integer Reduce
  template<>
  inline Integer Reduce<Integer, __m512i>::operator()(__m512i in){
-    // FIXME unimplemented
+    return _mm512_reduce_add_epi32(in);
    printf("Reduce : Missing integer implementation -> FIX\n");
    assert(0);
  }
 #endif
@@ -401,9 +401,7 @@ namespace Optimization {
  //Integer Reduce
  template<>
  inline Integer Reduce<Integer, __m512i>::operator()(__m512i in){
-    // FIXME unimplemented
+    return _mm512_reduce_add_epi32(in);
    printf("Reduce : Missing integer implementation -> FIX\n");
    assert(0);
  }
@@ -1,13 +1,14 @@
-    /*************************************************************************************
+/*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+    Grid physics library, www.github.com/paboyle/Grid
    Source file: ./lib/simd/Grid_neon.h
    Copyright (C) 2015
-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+    Author: Nils Meyer <nils.meyer@ur.de>
-Author: neo <cossu@post.kek.jp>
+    Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    Author: neo <cossu@post.kek.jp>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -26,19 +27,25 @@ Author: neo <cossu@post.kek.jp>
    See the full license in the file "LICENSE" in the top level distribution directory
    *************************************************************************************/
    /*  END LEGAL */
 //----------------------------------------------------------------------
 /*! @file Grid_sse4.h
  @brief Optimization libraries for NEON (ARM) instructions set ARMv8
-  Experimental - Using intrinsics - DEVELOPING! 
+/*
  ARMv8 NEON intrinsics layer by
  Nils Meyer <nils.meyer@ur.de>,
  University of Regensburg, Germany
  SFB/TRR55
 */
 // Time-stamp: <2015-07-10 17:45:09 neo>
 //----------------------------------------------------------------------
 #ifndef GEN_SIMD_WIDTH
 #define GEN_SIMD_WIDTH 16u
 #endif
 #include "Grid_generic_types.h"
 #include <arm_neon.h>
-// ARMv8 supports double precision
+namespace Grid {
 namespace Optimization {
  template<class vtype>
@@ -46,16 +53,20 @@ namespace Optimization {
    float32x4_t f;
    vtype v;
  };
  union u128f {
    float32x4_t v;
    float f[4];
  };
  union u128d {
    float64x2_t v;
-    double f[4];
+    double f[2];
  };
-  
+  // half precision
  union u128h {
    float16x8_t v;
    uint16_t f[8];
  };
  struct Vsplat{
    //Complex float
    inline float32x4_t operator()(float a, float b){
@@ -64,31 +75,31 @@ namespace Optimization {
    }
    // Real float
    inline float32x4_t operator()(float a){
-      return vld1q_dup_f32(&a);
+      return vdupq_n_f32(a);
    }
    //Complex double
-    inline float32x4_t operator()(double a, double b){
+    inline float64x2_t operator()(double a, double b){
-      float tmp[4]={(float)a,(float)b,(float)a,(float)b};
+      double tmp[2]={a,b};
-      return vld1q_f32(tmp);
+      return vld1q_f64(tmp);
    }
-    //Real double
+    //Real double // N:tbc
-    inline float32x4_t operator()(double a){
+    inline float64x2_t operator()(double a){
-      return vld1q_dup_f32(&a);
+      return vdupq_n_f64(a);
    }
-    //Integer
+    //Integer // N:tbc
    inline uint32x4_t operator()(Integer a){
-      return vld1q_dup_u32(&a);
+      return vdupq_n_u32(a);
    }
  };
  struct Vstore{
-    //Float 
+    //Float
    inline void operator()(float32x4_t a, float* F){
      vst1q_f32(F, a);
    }
    //Double
-    inline void operator()(float32x4_t a, double* D){
+    inline void operator()(float64x2_t a, double* D){
-      vst1q_f32((float*)D, a);
+      vst1q_f64(D, a);
    }
    //Integer
    inline void operator()(uint32x4_t a, Integer* I){
@@ -97,54 +108,54 @@ namespace Optimization {
  };
-  struct Vstream{
+  struct Vstream{ // N:equivalents to _mm_stream_p* in NEON?
-    //Float
+    //Float // N:generic
    inline void operator()(float * a, float32x4_t b){
-    
+      memcpy(a,&b,4*sizeof(float));
    }
-    //Double
+    //Double // N:generic
-    inline void operator()(double * a, float32x4_t b){
+    inline void operator()(double * a, float64x2_t b){
-  
+      memcpy(a,&b,2*sizeof(double));
    }
  };
  // Nils: Vset untested; not used currently in Grid at all;
  // git commit 4a8c4ccfba1d05159348d21a9698028ea847e77b
  struct Vset{
-    // Complex float 
+    // Complex float // N:ok
    inline float32x4_t operator()(Grid::ComplexF *a){
-      float32x4_t foo;
+      float tmp[4]={a[1].imag(),a[1].real(),a[0].imag(),a[0].real()};
-      return foo;
+      return vld1q_f32(tmp);
    }
-    // Complex double 
+    // Complex double // N:ok
-    inline float32x4_t operator()(Grid::ComplexD *a){
+    inline float64x2_t operator()(Grid::ComplexD *a){
-      float32x4_t foo;
+      double tmp[2]={a[0].imag(),a[0].real()};
-      return foo;
+      return vld1q_f64(tmp);
    }
-    // Real float 
+    // Real float // N:ok
    inline float32x4_t operator()(float *a){
-      float32x4_t foo;
+      float tmp[4]={a[3],a[2],a[1],a[0]};
-      return foo;
+      return vld1q_f32(tmp);
    }
-    // Real double
+    // Real double // N:ok
-    inline float32x4_t operator()(double *a){
+    inline float64x2_t operator()(double *a){
-      float32x4_t foo;
+      double tmp[2]={a[1],a[0]};
-      return foo;
+      return vld1q_f64(tmp);
    }
-    // Integer
+    // Integer // N:ok
    inline uint32x4_t operator()(Integer *a){
-      uint32x4_t foo;
+      return vld1q_dup_u32(a);
      return foo;
    }
  };
  // N:leaving as is
  template <typename Out_type, typename In_type>
  struct Reduce{
    //Need templated class to overload output type
    //General form must generate error if compiled
-    inline Out_type operator()(In_type in){
+      inline Out_type operator()(In_type in){
      printf("Error, using wrong Reduce function\n");
      exit(1);
      return 0;
@@ -184,26 +195,98 @@ namespace Optimization {
    }
  };
  struct MultRealPart{
    inline float32x4_t operator()(float32x4_t a, float32x4_t b){
      float32x4_t re = vtrn1q_f32(a, a);
      return vmulq_f32(re, b);
    }
    inline float64x2_t operator()(float64x2_t a, float64x2_t b){
      float64x2_t re = vzip1q_f64(a, a);
      return vmulq_f64(re, b);
    }
  };
  struct MaddRealPart{
    inline float32x4_t operator()(float32x4_t a, float32x4_t b, float32x4_t c){
      float32x4_t re = vtrn1q_f32(a, a);
      return vfmaq_f32(c, re, b);
    }
    inline float64x2_t operator()(float64x2_t a, float64x2_t b, float64x2_t c){
      float64x2_t re = vzip1q_f64(a, a);
      return vfmaq_f64(c, re, b);
    }
  };
  struct Div{
    // Real float
    inline float32x4_t operator()(float32x4_t a, float32x4_t b){
      return vdivq_f32(a, b);
    }
    // Real double
    inline float64x2_t operator()(float64x2_t a, float64x2_t b){
      return vdivq_f64(a, b);
    }
  };
  struct MultComplex{
    // Complex float
    inline float32x4_t operator()(float32x4_t a, float32x4_t b){
-      float32x4_t foo;
+
-      return foo;
+      float32x4_t r0, r1, r2, r3, r4;
      // a = ar ai Ar Ai
      // b = br bi Br Bi
      // collect real/imag part, negate bi and Bi
      r0 = vtrn1q_f32(b, b);       //  br  br  Br  Br
      r1 = vnegq_f32(b);           // -br -bi -Br -Bi
      r2 = vtrn2q_f32(b, r1);      //  bi -bi  Bi -Bi
      // the fun part
      r3 = vmulq_f32(r2, a);       //  bi*ar -bi*ai ...
      r4 = vrev64q_f32(r3);        // -bi*ai  bi*ar ...
      // fma(a,b,c) = a+b*c
      return vfmaq_f32(r4, r0, a); //  ar*br-ai*bi ai*br+ar*bi ...
      // no fma, use mul and add
      //float32x4_t r5;
      //r5 = vmulq_f32(r0, a);
      //return vaddq_f32(r4, r5);
    }
    // Complex double
    inline float64x2_t operator()(float64x2_t a, float64x2_t b){
-      float32x4_t foo;
+
-      return foo;
+      float64x2_t r0, r1, r2, r3, r4;
      // b = br bi
      // collect real/imag part, negate bi
      r0 = vtrn1q_f64(b, b);       //  br  br
      r1 = vnegq_f64(b);           // -br -bi
      r2 = vtrn2q_f64(b, r1);      //  bi -bi
      // the fun part
      r3 = vmulq_f64(r2, a);       //  bi*ar -bi*ai
      r4 = vextq_f64(r3,r3,1);     // -bi*ai  bi*ar
      // fma(a,b,c) = a+b*c
      return vfmaq_f64(r4, r0, a); //  ar*br-ai*bi ai*br+ar*bi
      // no fma, use mul and add
      //float64x2_t r5;
      //r5 = vmulq_f64(r0, a);
      //return vaddq_f64(r4, r5);
    }
  };
  struct Mult{
    // Real float
    inline float32x4_t mac(float32x4_t a, float32x4_t b, float32x4_t c){
-      return vaddq_f32(vmulq_f32(b,c),a);
+      //return vaddq_f32(vmulq_f32(b,c),a);
      return vfmaq_f32(a, b, c);
    }
    inline float64x2_t mac(float64x2_t a, float64x2_t b, float64x2_t c){
-      return vaddq_f64(vmulq_f64(b,c),a);
+      //return vaddq_f64(vmulq_f64(b,c),a);
      return vfmaq_f64(a, b, c);
    }
    inline float32x4_t operator()(float32x4_t a, float32x4_t b){
      return vmulq_f32(a,b);
@@ -221,89 +304,275 @@ namespace Optimization {
  struct Conj{
    // Complex single
    inline float32x4_t operator()(float32x4_t in){
-      return in;
+      // ar ai br bi -> ar -ai br -bi
      float32x4_t r0, r1;
      r0 = vnegq_f32(in);        // -ar -ai -br -bi
      r1 = vrev64q_f32(r0);      // -ai -ar -bi -br
      return vtrn1q_f32(in, r1); //  ar -ai  br -bi
    }
    // Complex double
-    //inline float32x4_t operator()(float32x4_t in){
+    inline float64x2_t operator()(float64x2_t in){
-    // return 0;
+
-    //}
+      float64x2_t r0, r1;
      r0 = vextq_f64(in, in, 1);    //  ai  ar
      r1 = vnegq_f64(r0);           // -ai -ar
      return vextq_f64(r0, r1, 1);  //  ar -ai
    }
    // do not define for integer input
  };
  struct TimesMinusI{
    //Complex single
    inline float32x4_t operator()(float32x4_t in, float32x4_t ret){
-      return in;
+      // ar ai br bi -> ai -ar ai -br
      float32x4_t r0, r1;
      r0 = vnegq_f32(in);        // -ar -ai -br -bi
      r1 = vrev64q_f32(in);      //  ai  ar  bi  br
      return vtrn1q_f32(r1, r0); //  ar -ai  br -bi
    }
    //Complex double
-    //inline float32x4_t operator()(float32x4_t in, float32x4_t ret){
+    inline float64x2_t operator()(float64x2_t in, float64x2_t ret){
-    //  return in;
+      // a ib -> b -ia
-    //}
+      float64x2_t tmp;
-
+      tmp = vnegq_f64(in);
-
+      return vextq_f64(in, tmp, 1);
    }
  };
  struct TimesI{
    //Complex single
    inline float32x4_t operator()(float32x4_t in, float32x4_t ret){
-      //need shuffle
+      // ar ai br bi -> -ai ar -bi br
-      return in;
+      float32x4_t r0, r1;
      r0 = vnegq_f32(in);        // -ar -ai -br -bi
      r1 = vrev64q_f32(r0);      // -ai -ar -bi -br
      return vtrn1q_f32(r1, in); // -ai  ar -bi  br
    }
    //Complex double
-    //inline float32x4_t operator()(float32x4_t in, float32x4_t ret){
+    inline float64x2_t operator()(float64x2_t in, float64x2_t ret){
-    //  return 0;
+      // a ib -> -b ia
-    //}
+      float64x2_t tmp;
      tmp = vnegq_f64(in);
      return vextq_f64(tmp, in, 1);
    }
  };
  struct Permute{
    static inline float32x4_t Permute0(float32x4_t in){ // N:ok
      // AB CD -> CD AB
      return vextq_f32(in, in, 2);
    };
    static inline float32x4_t Permute1(float32x4_t in){ // N:ok
      // AB CD -> BA DC
      return vrev64q_f32(in);
    };
    static inline float32x4_t Permute2(float32x4_t in){ // N:not used by Boyle
      return in;
    };
    static inline float32x4_t Permute3(float32x4_t in){ // N:not used by Boyle
      return in;
    };
    static inline float64x2_t Permute0(float64x2_t in){ // N:ok
      // AB -> BA
      return vextq_f64(in, in, 1);
    };
    static inline float64x2_t Permute1(float64x2_t in){ // N:not used by Boyle
      return in;
    };
    static inline float64x2_t Permute2(float64x2_t in){ // N:not used by Boyle
      return in;
    };
    static inline float64x2_t Permute3(float64x2_t in){ // N:not used by Boyle
      return in;
    };
  };
  struct Rotate{
    static inline float32x4_t rotate(float32x4_t in,int n){ // N:ok
      switch(n){
      case 0: // AB CD -> AB CD
        return tRotate<0>(in);
        break;
      case 1: // AB CD -> BC DA
        return tRotate<1>(in);
        break;
      case 2: // AB CD -> CD AB
        return tRotate<2>(in);
        break;
      case 3: // AB CD -> DA BC
        return tRotate<3>(in);
        break;
      default: assert(0);
      }
    }
    static inline float64x2_t rotate(float64x2_t in,int n){ // N:ok
      switch(n){
      case 0: // AB -> AB
        return tRotate<0>(in);
        break;
      case 1: // AB -> BA
        return tRotate<1>(in);
        break;
      default: assert(0);
      }
    }
 // working, but no restriction on n
 //    template<int n> static inline float32x4_t tRotate(float32x4_t in){ return vextq_f32(in,in,n); };
 //    template<int n> static inline float64x2_t tRotate(float64x2_t in){ return vextq_f64(in,in,n); };
 // restriction on n
    template<int n> static inline float32x4_t tRotate(float32x4_t in){ return vextq_f32(in,in,n%4); };
    template<int n> static inline float64x2_t tRotate(float64x2_t in){ return vextq_f64(in,in,n%2); };
  };
  struct PrecisionChange {
    static inline float16x8_t StoH (const float32x4_t &a,const float32x4_t &b) {
      float16x4_t h = vcvt_f16_f32(a);
      return vcvt_high_f16_f32(h, b);
    }
    static inline void  HtoS (float16x8_t h,float32x4_t &sa,float32x4_t &sb) {
      sb = vcvt_high_f32_f16(h);
      // there is no direct conversion from lower float32x4_t to float64x2_t
      // vextq_f16 not supported by clang 3.8 / 4.0 / arm clang
      //float16x8_t h1 = vextq_f16(h, h, 4); // correct, but not supported by clang
      // workaround for clang
      uint32x4_t h1u = reinterpret_cast<uint32x4_t>(h);
      float16x8_t h1 = reinterpret_cast<float16x8_t>(vextq_u32(h1u, h1u, 2));
      sa = vcvt_high_f32_f16(h1);
    }
    static inline float32x4_t DtoS (float64x2_t a,float64x2_t b) {
      float32x2_t s = vcvt_f32_f64(a);
      return vcvt_high_f32_f64(s, b);
    }
    static inline void StoD (float32x4_t s,float64x2_t &a,float64x2_t &b) {
      b = vcvt_high_f64_f32(s);
      // there is no direct conversion from lower float32x4_t to float64x2_t
      float32x4_t s1 = vextq_f32(s, s, 2);
      a = vcvt_high_f64_f32(s1);
    }
    static inline float16x8_t DtoH (float64x2_t a,float64x2_t b,float64x2_t c,float64x2_t d) {
      float32x4_t s1 = DtoS(a, b);
      float32x4_t s2 = DtoS(c, d);
      return StoH(s1, s2);
    }
    static inline void HtoD (float16x8_t h,float64x2_t &a,float64x2_t &b,float64x2_t &c,float64x2_t &d) {
      float32x4_t s1, s2;
      HtoS(h, s1, s2);
      StoD(s1, a, b);
      StoD(s2, c, d);
    }
  };
  //////////////////////////////////////////////
  // Exchange support
  struct Exchange{
    static inline void Exchange0(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
      // in1: ABCD -> out1: ABEF
      // in2: EFGH -> out2: CDGH
      // z: CDAB
      float32x4_t z = vextq_f32(in1, in1, 2);
      // out1: ABEF
      out1 = vextq_f32(z, in2, 2);
      // z: GHEF
      z = vextq_f32(in2, in2, 2);
      // out2: CDGH
      out2 = vextq_f32(in1, z, 2);
    };
    static inline void Exchange1(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
      // in1: ABCD -> out1: AECG
      // in2: EFGH -> out2: BFDH
      out1 = vtrn1q_f32(in1, in2);
      out2 = vtrn2q_f32(in1, in2);
    };
    static inline void Exchange2(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
      assert(0);
      return;
    };
    static inline void Exchange3(float32x4_t &out1,float32x4_t &out2,float32x4_t in1,float32x4_t in2){
      assert(0);
      return;
    };
    // double precision
    static inline void Exchange0(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
      // in1: AB -> out1: AC
      // in2: CD -> out2: BD
      out1 = vzip1q_f64(in1, in2);
      out2 = vzip2q_f64(in1, in2);
    };
    static inline void Exchange1(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
      assert(0);
      return;
    };
    static inline void Exchange2(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
      assert(0);
      return;
    };
    static inline void Exchange3(float64x2_t &out1,float64x2_t &out2,float64x2_t in1,float64x2_t in2){
      assert(0);
      return;
    };
  };
  //////////////////////////////////////////////
  // Some Template specialization
  template < typename vtype > 
    void permute(vtype &a, vtype b, int perm) {
  }; 
  //Complex float Reduce
  template<>
  inline Grid::ComplexF Reduce<Grid::ComplexF, float32x4_t>::operator()(float32x4_t in){
-    return 0;
+    float32x4_t v1; // two complex
    v1 = Optimization::Permute::Permute0(in);
    v1 = vaddq_f32(v1,in);
    u128f conv;    conv.v=v1;
    return Grid::ComplexF(conv.f[0],conv.f[1]);
  }
  //Real float Reduce
  template<>
  inline Grid::RealF Reduce<Grid::RealF, float32x4_t>::operator()(float32x4_t in){
-    float32x2_t high = vget_high_f32(in);
+    return vaddvq_f32(in);
    float32x2_t low = vget_low_f32(in);
    float32x2_t tmp = vadd_f32(low, high);
    float32x2_t sum = vpadd_f32(tmp, tmp);
    return vget_lane_f32(sum,0);
  }
-  
+
-  
+
  //Complex double Reduce
-  template<>
+  template<> // N:by Boyle
  inline Grid::ComplexD Reduce<Grid::ComplexD, float64x2_t>::operator()(float64x2_t in){
-    return 0;
+    u128d conv; conv.v = in;
    return Grid::ComplexD(conv.f[0],conv.f[1]);
  }
-  
+
  //Real double Reduce
  template<>
  inline Grid::RealD Reduce<Grid::RealD, float64x2_t>::operator()(float64x2_t in){
-    float64x2_t sum = vpaddq_f64(in, in);
+    return vaddvq_f64(in);
    return vgetq_lane_f64(sum,0);
  }
  //Integer Reduce
  template<>
  inline Integer Reduce<Integer, uint32x4_t>::operator()(uint32x4_t in){
    // FIXME unimplemented
-   printf("Reduce : Missing integer implementation -> FIX\n");
+    printf("Reduce : Missing integer implementation -> FIX\n");
    assert(0);
  }
 }
 //////////////////////////////////////////////////////////////////////////////////////
-// Here assign types 
+// Here assign types
 namespace Grid {
 // typedef Optimization::vech SIMD_Htype; // Reduced precision type
  typedef float16x8_t  SIMD_Htype; // Half precision type
  typedef float32x4_t  SIMD_Ftype; // Single precision type
  typedef float64x2_t  SIMD_Dtype; // Double precision type
  typedef uint32x4_t   SIMD_Itype; // Integer type
@@ -312,13 +581,6 @@ namespace Grid {
  inline void prefetch_HINT_T0(const char *ptr){};
  // Gpermute function
  template < typename VectorSIMD > 
    inline void Gpermute(VectorSIMD &y,const VectorSIMD &b, int perm ) {
    Optimization::permute(y.v,b.v,perm);
  }
  // Function name aliases
  typedef Optimization::Vsplat   VsplatSIMD;
  typedef Optimization::Vstore   VstoreSIMD;
@@ -326,16 +588,19 @@ namespace Grid {
  typedef Optimization::Vstream  VstreamSIMD;
  template <typename S, typename T> using ReduceSIMD = Optimization::Reduce<S,T>;
- 
+
  // Arithmetic operations
  typedef Optimization::Sum         SumSIMD;
  typedef Optimization::Sub         SubSIMD;
  typedef Optimization::Div         DivSIMD;
  typedef Optimization::Mult        MultSIMD;
  typedef Optimization::MultComplex MultComplexSIMD;
  typedef Optimization::MultRealPart MultRealPartSIMD;
  typedef Optimization::MaddRealPart MaddRealPartSIMD;
  typedef Optimization::Conj        ConjSIMD;
  typedef Optimization::TimesMinusI TimesMinusISIMD;
  typedef Optimization::TimesI      TimesISIMD;
-}
+}
@@ -374,6 +374,84 @@ namespace Optimization {
    // Complex float
    FLOAT_WRAP_2(operator(), inline)
  };
 #define USE_FP16
  struct PrecisionChange {
    static inline vech StoH (const vector4float &a, const vector4float &b) {
      vech ret;
      std::cout << GridLogError << "QPX single to half precision conversion not yet supported." << std::endl;
      assert(0);
      return ret;
    }
    static inline void  HtoS (vech h, vector4float &sa, vector4float &sb) {
      std::cout << GridLogError << "QPX half to single precision conversion not yet supported." << std::endl;
      assert(0);
    }
    static inline vector4float DtoS (vector4double a, vector4double b) {
      vector4float ret;
      std::cout << GridLogError << "QPX double to single precision conversion not yet supported." << std::endl;
      assert(0);
      return ret;
    }
    static inline void StoD (vector4float s, vector4double &a, vector4double &b) {
      std::cout << GridLogError << "QPX single to double precision conversion not yet supported." << std::endl;
      assert(0);
    }
    static inline vech DtoH (vector4double a, vector4double b, 
                             vector4double c, vector4double d) {
      vech ret;
      std::cout << GridLogError << "QPX double to half precision conversion not yet supported." << std::endl;
      assert(0);
      return ret;
    }
    static inline void HtoD (vech h, vector4double &a, vector4double &b, 
                                     vector4double &c, vector4double &d) {
      std::cout << GridLogError << "QPX half to double precision conversion not yet supported." << std::endl;
      assert(0);
    }
  };
  //////////////////////////////////////////////
  // Exchange support
 #define FLOAT_WRAP_EXCHANGE(fn) \
  static inline void fn(vector4float &out1, vector4float &out2, \
                        vector4float in1,  vector4float in2) \
  { \
    vector4double out1d, out2d, in1d, in2d; \
    in1d  = Vset()(in1);   \
    in2d  = Vset()(in2);   \
    fn(out1d, out2d, in1d, in2d); \
    Vstore()(out1d, out1); \
    Vstore()(out2d, out2); \
  }
  struct Exchange{
    // double precision
    static inline void Exchange0(vector4double &out1, vector4double &out2,
                                 vector4double in1,  vector4double in2) {
      out1 = vec_perm(in1, in2, vec_gpci(0145));
      out2 = vec_perm(in1, in2, vec_gpci(02367));
    }
    static inline void Exchange1(vector4double &out1, vector4double &out2,
                                 vector4double in1,  vector4double in2) {
      out1 = vec_perm(in1, in2, vec_gpci(0426));
      out2 = vec_perm(in1, in2, vec_gpci(01537));
    }
    static inline void Exchange2(vector4double &out1, vector4double &out2,
                                 vector4double in1,  vector4double in2) {
      assert(0);
    }
    static inline void Exchange3(vector4double &out1, vector4double &out2,
                                 vector4double in1,  vector4double in2) {
      assert(0);
    }
    // single precision
    FLOAT_WRAP_EXCHANGE(Exchange0);
    FLOAT_WRAP_EXCHANGE(Exchange1);
    FLOAT_WRAP_EXCHANGE(Exchange2);
    FLOAT_WRAP_EXCHANGE(Exchange3);
  };
  struct Permute{
    //Complex double
@@ -497,15 +575,19 @@ namespace Optimization {
  //Integer Reduce
  template<>
-  inline Integer Reduce<Integer, int>::operator()(int in){
+  inline Integer Reduce<Integer, veci>::operator()(veci in){
-    // FIXME unimplemented
+    Integer a = 0;
-    printf("Reduce : Missing integer implementation -> FIX\n");
+    for (unsigned int i = 0; i < W<Integer>::r; ++i)
-    assert(0);
+    {
        a += in.v[i];
    }
    return a;
  }
 }
 ////////////////////////////////////////////////////////////////////////////////
 // Here assign types
 typedef Optimization::vech         SIMD_Htype;  // Half precision type
 typedef Optimization::vector4float SIMD_Ftype;  // Single precision type
 typedef vector4double              SIMD_Dtype; // Double precision type
 typedef Optimization::veci         SIMD_Itype; // Integer type
@@ -570,9 +570,9 @@ namespace Optimization {
  //Integer Reduce
  template<>
  inline Integer Reduce<Integer, __m128i>::operator()(__m128i in){
-    // FIXME unimplemented
+    __m128i v1 = _mm_hadd_epi32(in, in);
-   printf("Reduce : Missing integer implementation -> FIX\n");
+    __m128i v2 = _mm_hadd_epi32(v1, v1);
-    assert(0);
+    return _mm_cvtsi128_si32(v2);
  }
 }
@@ -53,7 +53,7 @@ directory
 #if defined IMCI
 #include "Grid_imci.h"
 #endif
-#ifdef NEONv8
+#ifdef NEONV8
 #include "Grid_neon.h"
 #endif
 #if defined QPX
@@ -751,8 +751,8 @@ inline Grid_simd<std::complex<R>, V> toComplex(const Grid_simd<R, V> &in) {
  conv.v = in.v;
  for (int i = 0; i < Rsimd::Nsimd(); i += 2) {
-    assert(conv.s[i + 1] ==
+    assert(conv.s[i + 1] == conv.s[i]);  
-           conv.s[i]);  // trap any cases where real was not duplicated
+    // trap any cases where real was not duplicated
    // indicating the SIMD grids of real and imag assignment did not correctly
    // match
    conv.s[i + 1] = 0.0;  // zero imaginary parts
@@ -32,8 +32,11 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 namespace Grid {
 int LebesgueOrder::UseLebesgueOrder;
 #ifdef KNL
 std::vector<int> LebesgueOrder::Block({8,2,2,2});
-
+#else
 std::vector<int> LebesgueOrder::Block({2,2,2,2});
 #endif
 LebesgueOrder::IndexInteger LebesgueOrder::alignup(IndexInteger n){
  n--;           // 1000 0011 --> 1000 0010
  n |= n >> 1;   // 1000 0010 | 0100 0001 = 1100 0011
@@ -51,8 +54,31 @@ LebesgueOrder::LebesgueOrder(GridBase *_grid)
  if ( Block[0]==0) ZGraph();
  else if ( Block[1]==0) NoBlocking();
  else CartesianBlocking();
 }
  if (0) {
    std::cout << "Thread Interleaving"<<std::endl;
    ThreadInterleave();
  } 
 }
 void LebesgueOrder::ThreadInterleave(void)
 {
  std::vector<IndexInteger> reorder = _LebesgueReorder;
  std::vector<IndexInteger> throrder;
  int vol = _LebesgueReorder.size();
  int threads = GridThread::GetThreads();
  int blockbits=3;
  int blocklen = 8;
  int msk      = 0x7;
  for(int t=0;t<threads;t++){
    for(int ss=0;ss<vol;ss++){
       if ( ( ss >> blockbits) % threads == t ) { 
         throrder.push_back(reorder[ss]);
       }
    }
  }
  _LebesgueReorder = throrder;
 }
 void LebesgueOrder::NoBlocking(void) 
 {
  std::cout<<GridLogDebug<<"Lexicographic : no cache blocking"<<std::endl;
@@ -70,6 +70,8 @@ namespace Grid {
 		  std::vector<IndexInteger> & xi,
 		  std::vector<IndexInteger> &dims);
    void ThreadInterleave(void);
  private:
    std::vector<IndexInteger> _LebesgueReorder;
@@ -285,7 +285,7 @@ class CartesianStencil { // Stencil runs along coordinate axes only; NO diagonal
  {
    int dimension    = _directions[point];
    int displacement = _distances[point];
-
+    
    int fd = _grid->_fdimensions[dimension];
    int rd = _grid->_rdimensions[dimension];
@@ -156,11 +156,18 @@ class iScalar {
  // convert from a something to a scalar via constructor of something arg
  template <class T, typename std::enable_if<!isGridTensor<T>::value, T>::type * = nullptr>
-    strong_inline iScalar<vtype> operator=(T arg) {
+  strong_inline iScalar<vtype> operator=(T arg) {
    _internal = arg;
    return *this;
  }
  // Convert elements
  template <class ttype>
  strong_inline iScalar<vtype> operator=(iScalar<ttype> &&arg) {
    _internal = arg._internal;
    return *this;
  }
  friend std::ostream &operator<<(std::ostream &stream,const iScalar<vtype> &o) {
    stream << "S {" << o._internal << "}";
    return stream;
@@ -80,8 +80,11 @@ template<class vtype, int N> inline iVector<vtype, N> Exponentiate(const iVector
      mat iQ2 = arg*arg*alpha*alpha;
      mat iQ3 = arg*iQ2*alpha;   
      // sign in c0 from the conventions on the Ta
-      c0 = -imag( trace(iQ3) ) * one_over_three;  
+      scalar imQ3, reQ2;
-      c1 = -real( trace(iQ2) ) * one_over_two;
+      imQ3 = imag( trace(iQ3) );
      reQ2 = real( trace(iQ2) );
      c0 = -imQ3 * one_over_three;  
      c1 = -reQ2 * one_over_two;
      // Cayley Hamilton checks to machine precision, tested
      tmp = c1 * one_over_three;
@@ -36,6 +36,7 @@ using namespace Grid::QCD;
 int main (int argc, char ** argv)
 {
 #ifdef HAVE_LIME
  Grid_init(&argc,&argv);
  std::cout <<GridLogMessage<< " main "<<std::endl;
@@ -96,4 +97,5 @@ int main (int argc, char ** argv)
  std::cout <<GridLogMessage<< "norm2 Gauge Diff = "<<norm2(Umu_diff)<<std::endl;
  Grid_finalize();
 #endif
 }
@@ -36,6 +36,7 @@ using namespace Grid::QCD;
 int main (int argc, char ** argv)
 {
 #ifdef HAVE_LIME
  Grid_init(&argc,&argv);
@@ -112,4 +113,5 @@ int main (int argc, char ** argv)
  std::cout<<GridLogMessage << "calculated link trace " <<l*LinkTraceScale<<std::endl;
  Grid_finalize();
 #endif
 }
@@ -183,8 +183,6 @@ void IntTester(const functor &func)
 {
  typedef Integer  scal;
  typedef vInteger vec;
  GridSerialRNG          sRNG;
  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  int Nsimd = vec::Nsimd();
@@ -287,6 +285,50 @@ void ReductionTester(const functor &func)
 }
 template<class reduced,class scal, class vec,class functor > 
 void IntReductionTester(const functor &func)
 {
  int Nsimd = vec::Nsimd();
  std::vector<scal> input1(Nsimd);
  std::vector<scal> input2(Nsimd);
  reduced result(0);
  reduced reference(0);
  reduced tmp;
  std::vector<vec,alignedAllocator<vec> > buf(3);
  vec & v_input1 = buf[0];
  vec & v_input2 = buf[1];
  for(int i=0;i<Nsimd;i++){
    input1[i] = (i + 1) * 30;
    input2[i] = (i + 1) * 20;
  }
  merge<vec,scal>(v_input1,input1);
  merge<vec,scal>(v_input2,input2);
  func.template vfunc<reduced,vec>(result,v_input1,v_input2);
  for(int i=0;i<Nsimd;i++) {
    func.template sfunc<reduced,scal>(tmp,input1[i],input2[i]);
    reference+=tmp;
  }
  std::cout<<GridLogMessage << " " << func.name()<<std::endl;
  int ok=0;
  if ( reference-result != 0 ){
    std::cout<<GridLogMessage<< "*****" << std::endl;
    std::cout<<GridLogMessage<< reference-result << " " <<reference<< " " << result<<std::endl;
    ok++;
  }
  if ( ok==0 ) {
    std::cout<<GridLogMessage << " OK!" <<std::endl;
  }
  assert(ok==0);
 }
 class funcPermute {
 public:
@@ -691,6 +733,7 @@ int main (int argc, char ** argv)
  IntTester(funcPlus());
  IntTester(funcMinus());
  IntTester(funcTimes());
  IntReductionTester<Integer, Integer, vInteger>(funcReduce());
  std::cout<<GridLogMessage << "==================================="<<  std::endl;
  std::cout<<GridLogMessage << "Testing precisionChange            "<<  std::endl;
@@ -1,6 +1,6 @@
    /*************************************************************************************
-    Grid physics library, www.github.com/paboyle/Grid 
+    Grid physics library, www.github.com/paboyle/Grid
    Source file: ./tests/Test_stencil.cc
@@ -33,9 +33,8 @@ using namespace std;
 using namespace Grid;
 using namespace Grid::QCD;
-int main (int argc, char ** argv)
+int main(int argc, char ** argv) {
-{
+  Grid_init(&argc, &argv);
  Grid_init(&argc,&argv);
  //  typedef LatticeColourMatrix Field;
  typedef LatticeComplex Field;
@@ -47,7 +46,7 @@ int main (int argc, char ** argv)
  std::vector<int> mpi_layout  = GridDefaultMpi();
  double volume = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
-    
+
  GridCartesian Fine(latt_size,simd_layout,mpi_layout);
  GridRedBlackCartesian rbFine(latt_size,simd_layout,mpi_layout);
  GridParallelRNG       fRNG(&Fine);
@@ -55,14 +54,14 @@ int main (int argc, char ** argv)
  //  fRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
  std::vector<int> seeds({1,2,3,4});
  fRNG.SeedFixedIntegers(seeds);
-  
+
  Field Foo(&Fine);
  Field Bar(&Fine);
  Field Check(&Fine);
  Field Diff(&Fine);
  LatticeComplex lex(&Fine);
-  lex = zero;  
+  lex = zero;
  random(fRNG,Foo);
  gaussian(fRNG,Bar);
@@ -98,7 +97,7 @@ int main (int argc, char ** argv)
 	  Fine.oCoorFromOindex(ocoor,o);
 	  ocoor[dir]=(ocoor[dir]+disp)%Fine._rdimensions[dir];
 	}
-	
+
 	SimpleCompressor<vobj> compress;
 	myStencil.HaloExchange(Foo,compress);
@@ -147,7 +146,7 @@ int main (int argc, char ** argv)
 		      <<") " <<check<<" vs "<<bar<<std::endl;
 	  }
-	 
+
 	}}}}
 	if (nrm > 1.0e-4) {
@@ -187,16 +186,15 @@ int main (int argc, char ** argv)
 	  Fine.oCoorFromOindex(ocoor,o);
 	  ocoor[dir]=(ocoor[dir]+disp)%Fine._rdimensions[dir];
 	}
-	
+
 	SimpleCompressor<vobj> compress;
 	Bar = Cshift(Foo,dir,disp);
 	if ( disp & 0x1 ) {
 	  ECheck.checkerboard = Even;
 	  OCheck.checkerboard = Odd;
-	} else { 
+	} else {
 	  ECheck.checkerboard = Odd;
 	  OCheck.checkerboard = Even;
 	}
@@ -213,7 +211,7 @@ int main (int argc, char ** argv)
 	    permute(OCheck._odata[i],EFoo._odata[SE->_offset],permute_type);
 	  else if (SE->_is_local)
 	    OCheck._odata[i] = EFoo._odata[SE->_offset];
-	  else 
+	  else
 	    OCheck._odata[i] = EStencil.CommBuf()[SE->_offset];
 	}
 	OStencil.HaloExchange(OFoo,compress);
@@ -222,18 +220,18 @@ int main (int argc, char ** argv)
 	  StencilEntry *SE;
 	  SE = OStencil.GetEntry(permute_type,0,i);
 	  //	  std::cout << "ODD source "<< i<<" -> " <<SE->_offset << " "<< SE->_is_local<<std::endl;
-	  
+
 	  if ( SE->_is_local && SE->_permute )
 	    permute(ECheck._odata[i],OFoo._odata[SE->_offset],permute_type);
 	  else if (SE->_is_local)
 	    ECheck._odata[i] = OFoo._odata[SE->_offset];
-	  else 
+	  else
 	    ECheck._odata[i] = OStencil.CommBuf()[SE->_offset];
 	}
-	
+
 	setCheckerboard(Check,ECheck);
 	setCheckerboard(Check,OCheck);
-	
+
 	Real nrmC = norm2(Check);
 	Real nrmB = norm2(Bar);
 	Diff = Check-Bar;
@@ -256,10 +254,10 @@ int main (int argc, char ** argv)
 	  diff =norm2(ddiff);
 	  if ( diff > 0){
 	    std::cout <<"Coor (" << coor[0]<<","<<coor[1]<<","<<coor[2]<<","<<coor[3] <<") "
-		      <<"shift "<<disp<<" dir "<< dir 
+		      <<"shift "<<disp<<" dir "<< dir
 		      << "  stencil impl " <<check<<" vs cshift impl "<<bar<<std::endl;
 	  }
-	 
+
 	}}}}
 	if (nrm > 1.0e-4) exit(-1);
@@ -73,7 +73,7 @@ int main (int argc, char ** argv)
  std::vector<LatticeColourMatrix> U(4,&Fine);
-  NerscField header;
+  FieldMetaData header;
  std::string file("./ckpoint_lat.4000");
  NerscIO::readConfiguration(Umu,header,file);
@@ -90,7 +90,7 @@ int main (int argc, char ** argv)
  std::vector<LatticeColourMatrix> U(4,&Fine);
-  NerscField header;
+  FieldMetaData header;
  std::string file("./ckpoint_lat.4000");
  NerscIO::readConfiguration(Umu,header,file);
@@ -28,212 +28,6 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
    /*  END LEGAL */
 #include <Grid/Grid.h>
 using namespace Grid;
 using namespace Grid::QCD;
 template <class Gimpl> 
 class FourierAcceleratedGaugeFixer  : public Gimpl {
  public:
  INHERIT_GIMPL_TYPES(Gimpl);
  typedef typename Gimpl::GaugeLinkField GaugeMat;
  typedef typename Gimpl::GaugeField GaugeLorentz;
  static void GaugeLinkToLieAlgebraField(const std::vector<GaugeMat> &U,std::vector<GaugeMat> &A) {
    for(int mu=0;mu<Nd;mu++){
 //      ImplComplex cmi(0.0,-1.0);
      Complex cmi(0.0,-1.0);
      A[mu] = Ta(U[mu]) * cmi;
    }
  }
  static void DmuAmu(const std::vector<GaugeMat> &A,GaugeMat &dmuAmu) {
    dmuAmu=zero;
    for(int mu=0;mu<Nd;mu++){
      dmuAmu = dmuAmu + A[mu] - Cshift(A[mu],mu,-1);
    }
  }  
  static void SteepestDescentGaugeFix(GaugeLorentz &Umu,Real & alpha,int maxiter,Real Omega_tol, Real Phi_tol) {
    GridBase *grid = Umu._grid;
    Real org_plaq      =WilsonLoops<Gimpl>::avgPlaquette(Umu);
    Real org_link_trace=WilsonLoops<Gimpl>::linkTrace(Umu); 
    Real old_trace = org_link_trace;
    Real trG;
    std::vector<GaugeMat> U(Nd,grid);
                 GaugeMat dmuAmu(grid);
    for(int i=0;i<maxiter;i++){
      for(int mu=0;mu<Nd;mu++) U[mu]= PeekIndex<LorentzIndex>(Umu,mu);
      //trG = SteepestDescentStep(U,alpha,dmuAmu);
      trG = FourierAccelSteepestDescentStep(U,alpha,dmuAmu);
      for(int mu=0;mu<Nd;mu++) PokeIndex<LorentzIndex>(Umu,U[mu],mu);
      // Monitor progress and convergence test 
      // infrequently to minimise cost overhead
      if ( i %20 == 0 ) { 
 	Real plaq      =WilsonLoops<Gimpl>::avgPlaquette(Umu);
 	Real link_trace=WilsonLoops<Gimpl>::linkTrace(Umu); 
 	std::cout << GridLogMessage << " Iteration "<<i<< " plaq= "<<plaq<< " dmuAmu " << norm2(dmuAmu)<< std::endl;
 	Real Phi  = 1.0 - old_trace / link_trace ;
 	Real Omega= 1.0 - trG;
 	std::cout << GridLogMessage << " Iteration "<<i<< " Phi= "<<Phi<< " Omega= " << Omega<< " trG " << trG <<std::endl;
 	if ( (Omega < Omega_tol) && ( ::fabs(Phi) < Phi_tol) ) {
 	  std::cout << GridLogMessage << "Converged ! "<<std::endl;
 	  return;
 	}
 	old_trace = link_trace;
      }
    }
  };
  static Real SteepestDescentStep(std::vector<GaugeMat> &U,Real & alpha, GaugeMat & dmuAmu) {
    GridBase *grid = U[0]._grid;
    std::vector<GaugeMat> A(Nd,grid);
    GaugeMat g(grid);
    GaugeLinkToLieAlgebraField(U,A);
    ExpiAlphaDmuAmu(A,g,alpha,dmuAmu);
    Real vol = grid->gSites();
    Real trG = TensorRemove(sum(trace(g))).real()/vol/Nc;
    SU<Nc>::GaugeTransform(U,g);
    return trG;
  }
  static Real FourierAccelSteepestDescentStep(std::vector<GaugeMat> &U,Real & alpha, GaugeMat & dmuAmu) {
    GridBase *grid = U[0]._grid;
    Real vol = grid->gSites();
    FFT theFFT((GridCartesian *)grid);
    LatticeComplex  Fp(grid);
    LatticeComplex  psq(grid); psq=zero;
    LatticeComplex  pmu(grid); 
    LatticeComplex   one(grid); one = Complex(1.0,0.0);
    GaugeMat g(grid);
    GaugeMat dmuAmu_p(grid);
    std::vector<GaugeMat> A(Nd,grid);
    GaugeLinkToLieAlgebraField(U,A);
    DmuAmu(A,dmuAmu);
    theFFT.FFT_all_dim(dmuAmu_p,dmuAmu,FFT::forward);
    //////////////////////////////////
    // Work out Fp = psq_max/ psq...
    //////////////////////////////////
    std::vector<int> latt_size = grid->GlobalDimensions();
    std::vector<int> coor(grid->_ndimension,0);
    for(int mu=0;mu<Nd;mu++) {
      Real TwoPiL =  M_PI * 2.0/ latt_size[mu];
      LatticeCoordinate(pmu,mu);
      pmu = TwoPiL * pmu ;
      psq = psq + 4.0*sin(pmu*0.5)*sin(pmu*0.5); 
    }
    Complex psqMax(16.0);
    Fp =  psqMax*one/psq;
    /*
    static int once;
    if ( once == 0 ) { 
      std::cout << " Fp " << Fp <<std::endl;
      once ++;
      }*/
    pokeSite(TComplex(1.0),Fp,coor);
    dmuAmu_p  = dmuAmu_p * Fp; 
    theFFT.FFT_all_dim(dmuAmu,dmuAmu_p,FFT::backward);
    GaugeMat ciadmam(grid);
    Complex cialpha(0.0,-alpha);
    ciadmam = dmuAmu*cialpha;
    SU<Nc>::taExp(ciadmam,g);
    Real trG = TensorRemove(sum(trace(g))).real()/vol/Nc;
    SU<Nc>::GaugeTransform(U,g);
    return trG;
  }
  static void ExpiAlphaDmuAmu(const std::vector<GaugeMat> &A,GaugeMat &g,Real & alpha, GaugeMat &dmuAmu) {
    GridBase *grid = g._grid;
    Complex cialpha(0.0,-alpha);
    GaugeMat ciadmam(grid);
    DmuAmu(A,dmuAmu);
    ciadmam = dmuAmu*cialpha;
    SU<Nc>::taExp(ciadmam,g);
  }  
 /*
  ////////////////////////////////////////////////////////////////
  // NB The FT for fields living on links has an extra phase in it
  // Could add these to the FFT class as a later task since this code
  // might be reused elsewhere ????
  ////////////////////////////////////////////////////////////////
  static void InverseFourierTransformAmu(FFT &theFFT,const std::vector<GaugeMat> &Ap,std::vector<GaugeMat> &Ax) {
    GridBase * grid = theFFT.Grid();
    std::vector<int> latt_size = grid->GlobalDimensions();
    ComplexField  pmu(grid);
    ComplexField  pha(grid);
    GaugeMat      Apha(grid);
    Complex ci(0.0,1.0);
    for(int mu=0;mu<Nd;mu++){
      Real TwoPiL =  M_PI * 2.0/ latt_size[mu];
      LatticeCoordinate(pmu,mu);
      pmu = TwoPiL * pmu ;
      pha = exp(pmu *  (0.5 *ci)); // e(ipmu/2) since Amu(x+mu/2)
      Apha = Ap[mu] * pha;
      theFFT.FFT_all_dim(Apha,Ax[mu],FFT::backward);
    }
  }
  static void FourierTransformAmu(FFT & theFFT,const std::vector<GaugeMat> &Ax,std::vector<GaugeMat> &Ap) {
    GridBase * grid = theFFT.Grid();
    std::vector<int> latt_size = grid->GlobalDimensions();
    ComplexField  pmu(grid);
    ComplexField  pha(grid);
    Complex ci(0.0,1.0);
    // Sign convention for FFTW calls:
    // A(x)= Sum_p e^ipx A(p) / V
    // A(p)= Sum_p e^-ipx A(x)
    for(int mu=0;mu<Nd;mu++){
      Real TwoPiL =  M_PI * 2.0/ latt_size[mu];
      LatticeCoordinate(pmu,mu);
      pmu = TwoPiL * pmu ;
      pha = exp(-pmu *  (0.5 *ci)); // e(+ipmu/2) since Amu(x+mu/2)
      theFFT.FFT_all_dim(Ax[mu],Ap[mu],FFT::backward);
      Ap[mu] = Ap[mu] * pha;
    }
  }
 */
 };
 int main (int argc, char ** argv)
 {
  std::vector<int> seeds({1,2,3,4});
@@ -264,22 +58,24 @@ int main (int argc, char ** argv)
  std::cout<< "*****************************************************************" <<std::endl;
  LatticeGaugeField   Umu(&GRID);
  LatticeGaugeField   Urnd(&GRID);
  LatticeGaugeField   Uorg(&GRID);
  LatticeColourMatrix   g(&GRID); // Gauge xform
  SU3::ColdConfiguration(pRNG,Umu); // Unit gauge
  Uorg=Umu;
  Urnd=Umu;
  SU3::RandomGaugeTransform(pRNG,Urnd,g); // Unit gauge
  SU3::RandomGaugeTransform(pRNG,Umu,g); // Unit gauge
  Real plaq=WilsonLoops<PeriodicGimplR>::avgPlaquette(Umu);
  std::cout << " Initial plaquette "<<plaq << std::endl;
  Real alpha=0.1;
  FourierAcceleratedGaugeFixer<PeriodicGimplR>::SteepestDescentGaugeFix(Umu,alpha,10000,1.0e-10, 1.0e-10);
  Umu = Urnd;
  FourierAcceleratedGaugeFixer<PeriodicGimplR>::SteepestDescentGaugeFix(Umu,alpha,10000,1.0e-12, 1.0e-12,false);
  plaq=WilsonLoops<PeriodicGimplR>::avgPlaquette(Umu);
  std::cout << " Final plaquette "<<plaq << std::endl;
@@ -288,14 +84,28 @@ int main (int argc, char ** argv)
  std::cout << " Norm Difference "<< norm2(Uorg) << std::endl;
-  //  std::cout<< "*****************************************************************" <<std::endl;
+  std::cout<< "*****************************************************************" <<std::endl;
-  //  std::cout<< "* Testing Fourier accelerated fixing                            *" <<std::endl;
+  std::cout<< "* Testing Fourier accelerated fixing                            *" <<std::endl;
-  //  std::cout<< "*****************************************************************" <<std::endl;
+  std::cout<< "*****************************************************************" <<std::endl;
  Umu=Urnd;
  FourierAcceleratedGaugeFixer<PeriodicGimplR>::SteepestDescentGaugeFix(Umu,alpha,10000,1.0e-12, 1.0e-12,true);
-  //  std::cout<< "*****************************************************************" <<std::endl;
+  plaq=WilsonLoops<PeriodicGimplR>::avgPlaquette(Umu);
-  //  std::cout<< "* Testing non-unit configuration                                *" <<std::endl;
+  std::cout << " Final plaquette "<<plaq << std::endl;
  //  std::cout<< "*****************************************************************" <<std::endl;
  std::cout<< "*****************************************************************" <<std::endl;
  std::cout<< "* Testing non-unit configuration                                *" <<std::endl;
  std::cout<< "*****************************************************************" <<std::endl;
  SU3::HotConfiguration(pRNG,Umu); // Unit gauge
  plaq=WilsonLoops<PeriodicGimplR>::avgPlaquette(Umu);
  std::cout << " Initial plaquette "<<plaq << std::endl;
  FourierAcceleratedGaugeFixer<PeriodicGimplR>::SteepestDescentGaugeFix(Umu,alpha,10000,1.0e-12, 1.0e-12,true);
  plaq=WilsonLoops<PeriodicGimplR>::avgPlaquette(Umu);
  std::cout << " Final plaquette "<<plaq << std::endl;
  Grid_finalize();
@@ -336,7 +336,7 @@ int main(int argc, char **argv) {
      std::cout << GridLogMessage << "norm cMmat : " << norm2(cMat)
                << std::endl;
-      cMat = expMat(cMat, ComplexD(1.0, 0.0));
+      cMat = expMat(cMat,1.0);// ComplexD(1.0, 0.0));
      std::cout << GridLogMessage << "norm expMat: " << norm2(cMat)
                << std::endl;
      peekSite(cm, cMat, mysite);
@@ -67,7 +67,7 @@ int main (int argc, char ** argv)
  LatticeFermion    err(FGrid);
  LatticeGaugeField Umu(UGrid); 
-  NerscField header;
+  FieldMetaData header;
  std::string file("./ckpoint_lat.400");
  NerscIO::readConfiguration(Umu,header,file);
@@ -133,8 +133,8 @@ int main (int argc, char ** argv)
  int Nconv;
  RealD eresid = 1.0e-6;
-  ImplicitlyRestartedLanczos<LatticeComplex> IRL(HermOp,X,Nk,Nm,eresid,Nit);
+  ImplicitlyRestartedLanczos<LatticeComplex> IRL(HermOp,X,Nk,Nk,Nm,eresid,Nit);
-  ImplicitlyRestartedLanczos<LatticeComplex> ChebyIRL(HermOp,Cheby,Nk,Nm,eresid,Nit);
+  ImplicitlyRestartedLanczos<LatticeComplex> ChebyIRL(HermOp,Cheby,Nk,Nk,Nm,eresid,Nit);
  LatticeComplex src(grid); gaussian(RNG,src);
  {
@@ -1,368 +0,0 @@
 /*******************************************************************************
 Grid physics library, www.github.com/paboyle/Grid
 Source file: tests/hadrons/Test_hadrons.hpp
 Copyright (C) 2017
 Author: Andrew Lawson <andrew.lawson1991@gmail.com>
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution
 directory.
 *******************************************************************************/
 #include <Grid/Hadrons/Application.hpp>
 using namespace Grid;
 using namespace Hadrons;
 /*******************************************************************************
 * Macros to reduce code duplication.
 ******************************************************************************/
 // Useful definitions
 #define ZERO_MOM "0. 0. 0. 0."
 #define INIT_INDEX(s, n) (std::string(s) + "_" + std::to_string(n))
 #define ADD_INDEX(s, n) (s + "_" + std::to_string(n))
 #define LABEL_3PT(s, t1, t2) ADD_INDEX(INIT_INDEX(s, t1), t2)
 #define LABEL_4PT(s, t1, t2, t3) ADD_INDEX(ADD_INDEX(INIT_INDEX(s, t1), t2), t3)
 #define LABEL_4PT_NOISE(s, t1, t2, t3, nn) ADD_INDEX(ADD_INDEX(ADD_INDEX(INIT_INDEX(s, t1), t2), t3), nn)
 // Wall source/sink macros
 #define NAME_3MOM_WALL_SOURCE(t, mom) ("wall_" + std::to_string(t) + "_" + mom)
 #define NAME_WALL_SOURCE(t) NAME_3MOM_WALL_SOURCE(t, ZERO_MOM)
 #define NAME_POINT_SOURCE(pos) ("point_" + pos)
 #define MAKE_3MOM_WALL_PROP(tW, mom, propName, solver)\
 {\
    std::string srcName = NAME_3MOM_WALL_SOURCE(tW, mom);\
    makeWallSource(application, srcName, tW, mom);\
    makePropagator(application, propName, srcName, solver);\
 }
 #define MAKE_WALL_PROP(tW, propName, solver)\
        MAKE_3MOM_WALL_PROP(tW, ZERO_MOM, propName, solver)
 // Sequential source macros
 #define MAKE_SEQUENTIAL_PROP(tS, qSrc, mom, propName, solver)\
 {\
    std::string srcName = ADD_INDEX(qSrc + "_seq", tS);\
    makeSequentialSource(application, srcName, qSrc, tS, mom);\
    makePropagator(application, propName, srcName, solver);\
 }
 // Point source macros
 #define MAKE_POINT_PROP(pos, propName, solver)\
 {\
    std::string srcName = NAME_POINT_SOURCE(pos);\
    makePointSource(application, srcName, pos);\
    makePropagator(application, propName, srcName, solver);\
 }
 /*******************************************************************************
 * Functions for propagator construction.
 ******************************************************************************/
 /*******************************************************************************
 * Name: makePointSource
 * Purpose: Construct point source and add to application module.
 * Parameters: application - main application that stores modules.
 *             srcName     - name of source module to create.
 *             pos         - Position of point source.
 * Returns: None.
 ******************************************************************************/
 inline void makePointSource(Application &application, std::string srcName,
                            std::string pos)
 {
    // If the source already exists, don't make the module again.
    if (!(Environment::getInstance().hasModule(srcName)))
    {
        MSource::Point::Par pointPar;
        pointPar.position = pos;
        application.createModule<MSource::Point>(srcName, pointPar);
    }
 }
 /*******************************************************************************
 * Name: makeSequentialSource
 * Purpose: Construct sequential source and add to application module.
 * Parameters: application - main application that stores modules.
 *             srcName     - name of source module to create.
 *             qSrc        - Input quark for sequential inversion.
 *             tS          - sequential source timeslice.
 *             mom         - momentum insertion (default is zero).
 * Returns: None.
 ******************************************************************************/
 inline void makeSequentialSource(Application &application, std::string srcName,
                                 std::string qSrc, unsigned int tS,
                                 std::string mom = ZERO_MOM)
 {
    // If the source already exists, don't make the module again.
    if (!(Environment::getInstance().hasModule(srcName)))
    {
        MSource::SeqGamma::Par seqPar;
        seqPar.q   = qSrc;
        seqPar.tA  = tS;
        seqPar.tB  = tS;
        seqPar.mom = mom;
        seqPar.gamma = Gamma::Algebra::GammaT;
        application.createModule<MSource::SeqGamma>(srcName, seqPar);
    }
 }
 /*******************************************************************************
 * Name: makeWallSource
 * Purpose: Construct wall source and add to application module.
 * Parameters: application - main application that stores modules.
 *             srcName     - name of source module to create.
 *             tW          - wall source timeslice.
 *             mom         - momentum insertion (default is zero).
 * Returns: None.
 ******************************************************************************/
 inline void makeWallSource(Application &application, std::string srcName,
                           unsigned int tW, std::string mom = ZERO_MOM)
 {
    // If the source already exists, don't make the module again.
    if (!(Environment::getInstance().hasModule(srcName)))
    {
        MSource::Wall::Par wallPar;
        wallPar.tW  = tW;
        wallPar.mom = mom;
        application.createModule<MSource::Wall>(srcName, wallPar);
    }
 }
 /*******************************************************************************
 * Name: makeWallSink
 * Purpose: Wall sink smearing of a propagator.
 * Parameters: application - main application that stores modules.
 *             propName    - name of input propagator.
 *             wallName    - name of smeared propagator.
 *             mom         - momentum insertion (default is zero).
 * Returns: None.
 ******************************************************************************/
 inline void makeWallSink(Application &application, std::string propName,
                         std::string wallName, std::string mom = ZERO_MOM)
 {
    // If the propagator has already been smeared, don't smear it again.
    // Temporarily removed, strategy for sink smearing likely to change.
    /*if (!(Environment::getInstance().hasModule(wallName)))
    {
        MSink::Wall::Par wallPar;
        wallPar.q   = propName;
        wallPar.mom = mom;
        application.createModule<MSink::Wall>(wallName, wallPar);
    }*/
 }
 /*******************************************************************************
 * Name: makePropagator
 * Purpose: Construct source and propagator then add to application module.
 * Parameters: application - main application that stores modules.
 *             propName    - name of propagator module to create.
 *             srcName     - name of source module to use.
 *             solver      - solver to use (default is CG).
 * Returns: None.
 ******************************************************************************/
 inline void makePropagator(Application &application, std::string &propName,
                           std::string &srcName, std::string &solver)
 {
    // If the propagator already exists, don't make the module again.
    if (!(Environment::getInstance().hasModule(propName)))
    {
        Quark::Par         quarkPar;
        quarkPar.source = srcName;
        quarkPar.solver = solver;
        application.createModule<Quark>(propName, quarkPar);
    }
 }
 /*******************************************************************************
 * Name: makeLoop
 * Purpose: Use noise source and inversion result to make loop propagator, then 
 *          add to application module.
 * Parameters: application - main application that stores modules.
 *             propName    - name of propagator module to create.
 *             srcName     - name of noise source module to use.
 *             resName     - name of inversion result on given noise source.
 * Returns: None.
 ******************************************************************************/
 inline void makeLoop(Application &application, std::string &propName,
                     std::string &srcName, std::string &resName)
 {
    // If the loop propagator already exists, don't make the module again.
    if (!(Environment::getInstance().hasModule(propName)))
    {
        MLoop::NoiseLoop::Par loopPar;
        loopPar.q   = resName;
        loopPar.eta = srcName;
        application.createModule<MLoop::NoiseLoop>(propName, loopPar);
    }
 }
 /*******************************************************************************
 * Contraction module creation.
 ******************************************************************************/
 /*******************************************************************************
 * Name: mesonContraction
 * Purpose: Create meson contraction module and add to application module.
 * Parameters: application - main application that stores modules.
 *             npt         - specify n-point correlator (for labelling).
 *             q1          - quark propagator 1.
 *             q2          - quark propagator 2.
 *             label       - unique label to construct module name.
 *             mom         - momentum to project (default is zero)
 *             gammas      - gamma insertions at source and sink.
 * Returns: None.
 ******************************************************************************/
 inline void mesonContraction(Application &application, unsigned int npt, 
                             std::string &q1, std::string &q2,
                             std::string &label, 
                             std::string mom = ZERO_MOM,
                             std::string gammas = "<Gamma5 Gamma5>")
 {
    std::string modName = std::to_string(npt) + "pt_" + label;
    if (!(Environment::getInstance().hasModule(modName)))
    {
        MContraction::Meson::Par mesPar;
        mesPar.output = std::to_string(npt) + "pt/" + label;
        mesPar.q1 = q1;
        mesPar.q2 = q2;
        mesPar.mom = mom;
        mesPar.gammas = gammas;
        application.createModule<MContraction::Meson>(modName, mesPar);
    }
 }
 /*******************************************************************************
 * Name: gamma3ptContraction
 * Purpose: Create gamma3pt contraction module and add to application module.
 * Parameters: application - main application that stores modules.
 *             npt         - specify n-point correlator (for labelling).
 *             q1          - quark propagator 1.
 *             q2          - quark propagator 2.
 *             q3          - quark propagator 3.
 *             label       - unique label to construct module name.
 *             gamma       - gamma insertions between q2 and q3.
 * Returns: None.
 ******************************************************************************/
 inline void gamma3ptContraction(Application &application, unsigned int npt, 
                                std::string &q1, std::string &q2,
                                std::string &q3, std::string &label, 
                                Gamma::Algebra gamma = Gamma::Algebra::Identity)
 {
    std::string modName = std::to_string(npt) + "pt_" + label;
    if (!(Environment::getInstance().hasModule(modName)))
    {
        MContraction::Gamma3pt::Par gamma3ptPar;
        gamma3ptPar.output = std::to_string(npt) + "pt/" + label;
        gamma3ptPar.q1 = q1;
        gamma3ptPar.q2 = q2;
        gamma3ptPar.q3 = q3;
        gamma3ptPar.gamma = gamma;
        application.createModule<MContraction::Gamma3pt>(modName, gamma3ptPar);
    }
 }
 /*******************************************************************************
 * Name: weakContraction[Eye,NonEye]
 * Purpose: Create Weak Hamiltonian contraction module for Eye/NonEye topology
 *          and add to application module.
 * Parameters: application - main application that stores modules.
 *             npt         - specify n-point correlator (for labelling).
 *             q1          - quark propagator 1.
 *             q2          - quark propagator 2.
 *             q3          - quark propagator 3.
 *             q4          - quark propagator 4.
 *             label       - unique label to construct module name.
 * Returns: None.
 ******************************************************************************/
 #define HW_CONTRACTION(top) \
 inline void weakContraction##top(Application &application, unsigned int npt,\
                                 std::string &q1, std::string &q2, \
                                 std::string &q3, std::string &q4, \
                                 std::string &label)\
 {\
    std::string modName = std::to_string(npt) + "pt_" + label;\
    if (!(Environment::getInstance().hasModule(modName)))\
    {\
        MContraction::WeakHamiltonian##top::Par weakPar;\
        weakPar.output = std::to_string(npt) + "pt/" + label;\
        weakPar.q1 = q1;\
        weakPar.q2 = q2;\
        weakPar.q3 = q3;\
        weakPar.q4 = q4;\
        application.createModule<MContraction::WeakHamiltonian##top>(modName, weakPar);\
    }\
 }
 HW_CONTRACTION(Eye)    // weakContractionEye
 HW_CONTRACTION(NonEye) // weakContractionNonEye
 /*******************************************************************************
 * Name: disc0Contraction
 * Purpose: Create contraction module for 4pt Weak Hamiltonian + current
 *          disconnected topology for neutral mesons and add to application 
 *          module.
 * Parameters: application - main application that stores modules.
 *             q1          - quark propagator 1.
 *             q2          - quark propagator 2.
 *             q3          - quark propagator 3.
 *             q4          - quark propagator 4.
 *             label       - unique label to construct module name.
 * Returns: None.
 ******************************************************************************/
 inline void disc0Contraction(Application &application, 
                             std::string &q1, std::string &q2,
                             std::string &q3, std::string &q4,
                             std::string &label)
 {
    std::string modName = "4pt_" + label;
    if (!(Environment::getInstance().hasModule(modName)))
    {
        MContraction::WeakNeutral4ptDisc::Par disc0Par;
        disc0Par.output = "4pt/" + label;
        disc0Par.q1 = q1;
        disc0Par.q2 = q2;
        disc0Par.q3 = q3;
        disc0Par.q4 = q4;
        application.createModule<MContraction::WeakNeutral4ptDisc>(modName, disc0Par);
    }
 }
 /*******************************************************************************
 * Name: discLoopContraction
 * Purpose: Create contraction module for disconnected loop and add to
 *          application module.
 * Parameters: application - main application that stores modules.
 *             q_loop      - loop quark propagator.
 *             modName     - unique module name.
 *             gamma       - gamma matrix to use in contraction.
 * Returns: None.
 ******************************************************************************/
 inline void discLoopContraction(Application &application,
                                std::string &q_loop, std::string &modName,
                                Gamma::Algebra gamma = Gamma::Algebra::Identity)
 {
    if (!(Environment::getInstance().hasModule(modName)))
    {
        MContraction::DiscLoop::Par discPar;
        discPar.output = "disc/" + modName;
        discPar.q_loop = q_loop;
        discPar.gamma  = gamma;
        application.createModule<MContraction::DiscLoop>(modName, discPar);
    }
 }
@@ -65,6 +65,10 @@ int main(int argc, char *argv[])
    // set fermion boundary conditions to be periodic space, antiperiodic time.
    std::string boundary = "1 1 1 -1";
    // sink
    MSink::Point::Par sinkPar;
    sinkPar.mom = "0 0 0";
    application.createModule<MSink::ScalarPoint>("sink", sinkPar);
    for (unsigned int i = 0; i < flavour.size(); ++i)
    {
        // actions
@@ -115,15 +119,15 @@ int main(int argc, char *argv[])
            }
            // propagators
-            Quark::Par quarkPar;
+            MFermion::GaugeProp::Par quarkPar;
            quarkPar.solver = "CG_" + flavour[i];
            quarkPar.source = srcName;
-            application.createModule<Quark>(qName[i], quarkPar);
+            application.createModule<MFermion::GaugeProp>(qName[i], quarkPar);
            for (unsigned int mu = 0; mu < Nd; ++mu)
            {
                quarkPar.source = seqName[i][mu];
                seqName[i][mu]  = "Q_" + flavour[i] + "-" + seqName[i][mu];
-                application.createModule<Quark>(seqName[i][mu], quarkPar);
+                application.createModule<MFermion::GaugeProp>(seqName[i][mu], quarkPar);
            }
        }
@@ -136,7 +140,7 @@ int main(int argc, char *argv[])
            mesPar.q1     = qName[i];
            mesPar.q2     = qName[j];
            mesPar.gammas = "all";
-            mesPar.mom    = "0. 0. 0. 0.";
+            mesPar.sink   = "sink";
            application.createModule<MContraction::Meson>("meson_Z2_"
                                                          + std::to_string(t)
                                                          + "_"
@@ -155,7 +159,7 @@ int main(int argc, char *argv[])
            mesPar.q1     = qName[i];
            mesPar.q2     = seqName[j][mu];
            mesPar.gammas = "all";
-            mesPar.mom    = "0. 0. 0. 0.";
+            mesPar.sink   = "sink";
            application.createModule<MContraction::Meson>("3pt_Z2_"
                                                          + std::to_string(t)
                                                          + "_"
@@ -1,342 +0,0 @@
 /*******************************************************************************
 Grid physics library, www.github.com/paboyle/Grid
 Source file: tests/hadrons/Test_hadrons_rarekaon.cc
 Copyright (C) 2017
 Author: Andrew Lawson <andrew.lawson1991@gmail.com>
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution
 directory.
 *******************************************************************************/
 #include "Test_hadrons.hpp"
 using namespace Grid;
 using namespace Hadrons;
 enum quarks
 {
   light   = 0,
   strange = 1,
   charm   = 2  
 };
 int main(int argc, char *argv[])
 {
    // parse command line //////////////////////////////////////////////////////
    std::string configStem;
    if (argc < 2)
    {
        std::cerr << "usage: " << argv[0] << " <configuration filestem> [Grid options]";
        std::cerr << std::endl;
        std::exit(EXIT_FAILURE);
    }
    configStem = argv[1];
    // initialization //////////////////////////////////////////////////////////
    Grid_init(&argc, &argv);
    HadronsLogError.Active(GridLogError.isActive());
    HadronsLogWarning.Active(GridLogWarning.isActive());
    HadronsLogMessage.Active(GridLogMessage.isActive());
    HadronsLogIterative.Active(GridLogIterative.isActive());
    HadronsLogDebug.Active(GridLogDebug.isActive());
    LOG(Message) << "Grid initialized" << std::endl;
    // run setup ///////////////////////////////////////////////////////////////
    Application              application;
    std::vector<double>       mass    = {.01, .04, .2};
    std::vector<std::string>  flavour = {"l", "s", "c"};
    std::vector<std::string>  solvers = {"CG_l", "CG_s", "CG_c"};
    std::string               kmom    = "0. 0. 0. 0.";
    std::string               pmom    = "1. 0. 0. 0.";
    std::string               qmom    = "-1. 0. 0. 0.";
    std::string               mqmom   = "1. 0. 0. 0.";
    std::vector<unsigned int> tKs     = {0};
    unsigned int              dt_pi   = 16;
    std::vector<unsigned int> tJs     = {8};
    unsigned int              n_noise = 1;
    unsigned int              nt      = 32;
    bool                      do_disconnected(false);
    // Global parameters.
    Application::GlobalPar globalPar;
    globalPar.trajCounter.start    = 1500;
    globalPar.trajCounter.end      = 1520;
    globalPar.trajCounter.step     = 20;
    globalPar.seed                 = "1 2 3 4";
    globalPar.genetic.maxGen       = 1000;
    globalPar.genetic.maxCstGen    = 200;
    globalPar.genetic.popSize      = 20;
    globalPar.genetic.mutationRate = .1;
    application.setPar(globalPar);
    // gauge field
    if (configStem == "None")
    {
        application.createModule<MGauge::Unit>("gauge");
    }
    else
    {
        MGauge::Load::Par gaugePar;
        gaugePar.file = configStem;
        application.createModule<MGauge::Load>("gauge", gaugePar);
    }
    // set fermion boundary conditions to be periodic space, antiperiodic time.
    std::string boundary = "1 1 1 -1";
    for (unsigned int i = 0; i < flavour.size(); ++i)
    {
        // actions
        MAction::DWF::Par actionPar;
        actionPar.gauge = "gauge";
        actionPar.Ls    = 16;
        actionPar.M5    = 1.8;
        actionPar.mass  = mass[i];
        actionPar.boundary = boundary;
        application.createModule<MAction::DWF>("DWF_" + flavour[i], actionPar);
        // solvers
        // RBPrecCG -> CG
        MSolver::RBPrecCG::Par solverPar;
        solverPar.action   = "DWF_" + flavour[i];
        solverPar.residual = 1.0e-8;
        application.createModule<MSolver::RBPrecCG>(solvers[i],
                                                    solverPar);
    }
    // Create noise propagators for loops.
    std::vector<std::string> noiseSrcs;
    std::vector<std::vector<std::string>> noiseRes;
    std::vector<std::vector<std::string>> noiseProps;
    if (n_noise > 0)
    {
        MSource::Z2::Par noisePar;
        noisePar.tA = 0;
        noisePar.tB = nt - 1;
        std::string loop_stem = "loop_";
        noiseRes.resize(flavour.size());
        noiseProps.resize(flavour.size());
        for (unsigned int nn = 0; nn < n_noise; ++nn)
        {
            std::string eta = INIT_INDEX("noise", nn);
            application.createModule<MSource::Z2>(eta, noisePar);
            noiseSrcs.push_back(eta);
            for (unsigned int f = 0; f < flavour.size(); ++f)
            {
                std::string loop_prop = INIT_INDEX(loop_stem + flavour[f], nn);
                std::string loop_res  = loop_prop + "_res";
                makePropagator(application, loop_res, eta, solvers[f]);
                makeLoop(application, loop_prop, eta, loop_res);
                noiseRes[f].push_back(loop_res);
                noiseProps[f].push_back(loop_prop);
            }
        }
    }
    // Translate rare kaon decay across specified timeslices.
    for (unsigned int i = 0; i < tKs.size(); ++i)
    {
        // Zero-momentum wall source propagators for kaon and pion.
        unsigned int tK     = tKs[i];
        unsigned int tpi    = (tK + dt_pi) % nt;
        std::string q_Kl_0  = INIT_INDEX("Q_l_0", tK);
        std::string q_pil_0 = INIT_INDEX("Q_l_0", tpi);
        MAKE_WALL_PROP(tK, q_Kl_0, solvers[light]);
        MAKE_WALL_PROP(tpi, q_pil_0, solvers[light]);
        // Wall sources for kaon and pion with momentum insertion. If either
        // p or k are zero, or p = k, re-use the existing name to avoid 
        // duplicating a propagator.
        std::string q_Ks_k  = INIT_INDEX("Q_Ks_k", tK);
        std::string q_Ks_p  = INIT_INDEX((kmom == pmom) ? "Q_Ks_k" : "Q_Ks_p", tK);
        std::string q_pil_k = INIT_INDEX((kmom == ZERO_MOM) ? "Q_l_0" : "Q_l_k", tpi);
        std::string q_pil_p = INIT_INDEX((pmom == kmom) ? q_pil_k : ((pmom == ZERO_MOM) ? "Q_l_0" : "Q_l_p"), tpi);
        MAKE_3MOM_WALL_PROP(tK, kmom, q_Ks_k, solvers[strange]);
        MAKE_3MOM_WALL_PROP(tK, pmom, q_Ks_p, solvers[strange]);
        MAKE_3MOM_WALL_PROP(tpi, kmom, q_pil_k, solvers[light]);
        MAKE_3MOM_WALL_PROP(tpi, pmom, q_pil_p, solvers[light]);
        /***********************************************************************
         * CONTRACTIONS: pi and K 2pt contractions with mom = p, k.
         **********************************************************************/
        // Wall-Point
        std::string PW_K_k = INIT_INDEX("PW_K_k", tK);
        std::string PW_K_p = INIT_INDEX("PW_K_p", tK);
        std::string PW_pi_k = INIT_INDEX("PW_pi_k", tpi);
        std::string PW_pi_p = INIT_INDEX("PW_pi_p", tpi);
        mesonContraction(application, 2, q_Kl_0, q_Ks_k, PW_K_k, kmom);
        mesonContraction(application, 2, q_Kl_0, q_Ks_p, PW_K_p, pmom);
        mesonContraction(application, 2, q_pil_k, q_pil_0, PW_pi_k, kmom);
        mesonContraction(application, 2, q_pil_p, q_pil_0, PW_pi_p, pmom);
        // Wall-Wall, to be done - requires modification of meson module.
        /***********************************************************************
         * CONTRACTIONS: 3pt Weak Hamiltonian, C & W (non-Eye type) classes.
         **********************************************************************/
        std::string HW_CW_k = LABEL_3PT("HW_CW_k", tK, tpi);
        std::string HW_CW_p = LABEL_3PT("HW_CW_p", tK, tpi);
        weakContractionNonEye(application, 3, q_Kl_0, q_Ks_k, q_pil_k, q_pil_0, HW_CW_k);
        weakContractionNonEye(application, 3, q_Kl_0, q_Ks_p, q_pil_p, q_pil_0, HW_CW_p);
        /***********************************************************************
         * CONTRACTIONS: 3pt sd insertion.
         **********************************************************************/
        // Note: eventually will use wall sink smeared q_Kl_0 instead.
        std::string sd_k = LABEL_3PT("sd_k", tK, tpi);
        std::string sd_p = LABEL_3PT("sd_p", tK, tpi);
        gamma3ptContraction(application, 3, q_Kl_0, q_Ks_k, q_pil_k, sd_k);
        gamma3ptContraction(application, 3, q_Kl_0, q_Ks_p, q_pil_p, sd_p);
        for (unsigned int nn = 0; nn < n_noise; ++nn)
        {
            /*******************************************************************
             * CONTRACTIONS: 3pt Weak Hamiltonian, S and E (Eye type) classes.
             ******************************************************************/
            // Note: eventually will use wall sink smeared q_Kl_0 instead.
            for (unsigned int f = 0; f < flavour.size(); ++f)
            {
                if ((f != strange) || do_disconnected)
                {
                    std::string HW_SE_k = LABEL_3PT("HW_SE_k_" + flavour[f], tK, tpi);
                    std::string HW_SE_p = LABEL_3PT("HW_SE_p_" + flavour[f], tK, tpi);
                    std::string loop_q  = noiseProps[f][nn];
                    weakContractionEye(application, 3, q_Kl_0, q_Ks_k, q_pil_k, loop_q, HW_CW_k);
                    weakContractionEye(application, 3, q_Kl_0, q_Ks_p, q_pil_p, loop_q, HW_CW_p);
                }
            }
        }
        // Perform separate contractions for each t_J position.
        for (unsigned int j = 0; j < tJs.size(); ++j)
        {
            // Sequential sources for current insertions. Local for now,
            // gamma_0 only.
            unsigned int tJ = (tJs[j] + tK) % nt;
            MSource::SeqGamma::Par seqPar;
            std::string q_KlCl_q   = LABEL_3PT("Q_KlCl_q", tK, tJ);
            std::string q_KsCs_mq  = LABEL_3PT("Q_KsCs_mq", tK, tJ);
            std::string q_pilCl_q  = LABEL_3PT("Q_pilCl_q", tpi, tJ);
            std::string q_pilCl_mq = LABEL_3PT("Q_pilCl_mq", tpi, tJ);
            MAKE_SEQUENTIAL_PROP(tJ, q_Kl_0, qmom, q_KlCl_q, solvers[light]);
            MAKE_SEQUENTIAL_PROP(tJ, q_Ks_k, mqmom, q_KsCs_mq, solvers[strange]);
            MAKE_SEQUENTIAL_PROP(tJ, q_pil_p, qmom, q_pilCl_q, solvers[light]);
            MAKE_SEQUENTIAL_PROP(tJ, q_pil_0, mqmom, q_pilCl_mq, solvers[light]);
            /*******************************************************************
             * CONTRACTIONS: pi and K 3pt contractions with current insertion.
             ******************************************************************/
            // Wall-Point
            std::string C_PW_Kl   = LABEL_3PT("C_PW_Kl", tK, tJ);
            std::string C_PW_Ksb  = LABEL_3PT("C_PW_Ksb", tK, tJ);
            std::string C_PW_pilb = LABEL_3PT("C_PW_pilb", tK, tJ);
            std::string C_PW_pil  = LABEL_3PT("C_PW_pil", tK, tJ);
            mesonContraction(application, 3, q_KlCl_q, q_Ks_k, C_PW_Kl, pmom);
            mesonContraction(application, 3, q_Kl_0, q_KsCs_mq, C_PW_Ksb, pmom);
            mesonContraction(application, 3, q_pil_0, q_pilCl_q, C_PW_pilb, kmom);
            mesonContraction(application, 3, q_pilCl_mq, q_pil_p, C_PW_pil, kmom);
            // Wall-Wall, to be done.
            /*******************************************************************
             * CONTRACTIONS: 4pt contractions, C & W classes.
             ******************************************************************/
            std::string CW_Kl   = LABEL_4PT("CW_Kl", tK, tJ, tpi);
            std::string CW_Ksb  = LABEL_4PT("CW_Ksb", tK, tJ, tpi);
            std::string CW_pilb = LABEL_4PT("CW_pilb", tK, tJ, tpi);
            std::string CW_pil  = LABEL_4PT("CW_pil", tK, tJ, tpi);
            weakContractionNonEye(application, 4, q_KlCl_q, q_Ks_k, q_pil_p, q_pil_0, CW_Kl);
            weakContractionNonEye(application, 4, q_Kl_0, q_KsCs_mq, q_pil_p, q_pil_0, CW_Ksb);
            weakContractionNonEye(application, 4, q_Kl_0, q_Ks_k, q_pilCl_q, q_pil_0, CW_pilb);
            weakContractionNonEye(application, 4, q_Kl_0, q_Ks_k, q_pil_p, q_pilCl_mq, CW_pil);
            /*******************************************************************
             * CONTRACTIONS: 4pt contractions, sd insertions.
             ******************************************************************/
            // Note: eventually will use wall sink smeared q_Kl_0/q_KlCl_q instead.
            std::string sd_Kl   = LABEL_4PT("sd_Kl", tK, tJ, tpi);
            std::string sd_Ksb  = LABEL_4PT("sd_Ksb", tK, tJ, tpi);
            std::string sd_pilb = LABEL_4PT("sd_pilb", tK, tJ, tpi);
            gamma3ptContraction(application, 4, q_KlCl_q, q_Ks_k, q_pil_p, sd_Kl);
            gamma3ptContraction(application, 4, q_Kl_0, q_KsCs_mq, q_pil_p, sd_Ksb);
            gamma3ptContraction(application, 4, q_Kl_0, q_Ks_k, q_pilCl_q, sd_pilb);
            // Sequential sources for each noise propagator.
            for (unsigned int nn = 0; nn < n_noise; ++nn)
            {
                std::string loop_stem = "loop_";
                // Contraction required for each quark flavour - alternatively
                // drop the strange loop if not performing disconnected
                // contractions or neglecting H_W operators Q_3 -> Q_10.
                for (unsigned int f = 0; f < flavour.size(); ++f)
                {
                    if ((f != strange) || do_disconnected)
                    {
                        std::string eta      = noiseSrcs[nn];
                        std::string loop_q   = noiseProps[f][nn];
                        std::string loop_qCq = LABEL_3PT(loop_stem + flavour[f], tJ, nn);
                        std::string loop_qCq_res = loop_qCq + "_res";
                        MAKE_SEQUENTIAL_PROP(tJ, noiseRes[f][nn], qmom, 
                                             loop_qCq_res, solvers[f]);
                        makeLoop(application, loop_qCq, eta, loop_qCq_res);
                        /*******************************************************
                         * CONTRACTIONS: 4pt contractions, S & E classes.
                         ******************************************************/
                        // Note: eventually will use wall sink smeared q_Kl_0/q_KlCl_q instead.
                        std::string SE_Kl   = LABEL_4PT_NOISE("SE_Kl", tK, tJ, tpi, nn);
                        std::string SE_Ksb  = LABEL_4PT_NOISE("SE_Ksb", tK, tJ, tpi, nn);
                        std::string SE_pilb = LABEL_4PT_NOISE("SE_pilb", tK, tJ, tpi, nn);
                        std::string SE_loop = LABEL_4PT_NOISE("SE_loop", tK, tJ, tpi, nn);
                        weakContractionEye(application, 4, q_KlCl_q, q_Ks_k, q_pil_p, loop_q, SE_Kl);
                        weakContractionEye(application, 4, q_Kl_0, q_KsCs_mq, q_pil_p, loop_q, SE_Ksb);
                        weakContractionEye(application, 4, q_Kl_0, q_Ks_k, q_pilCl_q, loop_q, SE_pilb);
                        weakContractionEye(application, 4, q_Kl_0, q_Ks_k, q_pil_p, loop_qCq, SE_loop);
                        /*******************************************************
                         * CONTRACTIONS: 4pt contractions, pi0 disconnected 
                         * loop.
                         ******************************************************/
                        std::string disc0 = LABEL_4PT_NOISE("disc0", tK, tJ, tpi, nn);
                        disc0Contraction(application, q_Kl_0, q_Ks_k, q_pilCl_q, loop_q, disc0);
                        /*******************************************************
                         * CONTRACTIONS: Disconnected loop.
                         ******************************************************/
                        std::string discLoop = "disc_" + loop_qCq;
                        discLoopContraction(application, loop_qCq, discLoop);
                    }
                }
            }
        }
    }
    // execution
    std::string par_file_name = "rarekaon_000_100_tK0_tpi16_tJ8_noloop_mc0.2.xml";
    application.saveParameterFile(par_file_name);
    application.run();
    // epilogue
    LOG(Message) << "Grid is finalizing now" << std::endl;
    Grid_finalize();
    return EXIT_SUCCESS;
 }
@@ -63,6 +63,10 @@ int main(int argc, char *argv[])
    MSource::Point::Par ptPar;
    ptPar.position = "0 0 0 0";
    application.createModule<MSource::Point>("pt", ptPar);
    // sink
    MSink::Point::Par sinkPar;
    sinkPar.mom = "0 0 0";
    application.createModule<MSink::ScalarPoint>("sink", sinkPar);
    // set fermion boundary conditions to be periodic space, antiperiodic time.
    std::string boundary = "1 1 1 -1";
@@ -86,31 +90,31 @@ int main(int argc, char *argv[])
                                                    solverPar);
        // propagators
-        Quark::Par quarkPar;
+        MFermion::GaugeProp::Par quarkPar;
        quarkPar.solver = "CG_" + flavour[i];
        quarkPar.source = "pt";
-        application.createModule<Quark>("Qpt_" + flavour[i], quarkPar);
+        application.createModule<MFermion::GaugeProp>("Qpt_" + flavour[i], quarkPar);
        quarkPar.source = "z2";
-        application.createModule<Quark>("QZ2_" + flavour[i], quarkPar);
+        application.createModule<MFermion::GaugeProp>("QZ2_" + flavour[i], quarkPar);
    }
    for (unsigned int i = 0; i < flavour.size(); ++i)
    for (unsigned int j = i; j < flavour.size(); ++j)
    {
        MContraction::Meson::Par mesPar;
-        mesPar.output = "mesons/pt_" + flavour[i] + flavour[j];
+        mesPar.output  = "mesons/pt_" + flavour[i] + flavour[j];
-        mesPar.q1     = "Qpt_" + flavour[i];
+        mesPar.q1      = "Qpt_" + flavour[i];
-        mesPar.q2     = "Qpt_" + flavour[j];
+        mesPar.q2      = "Qpt_" + flavour[j];
-        mesPar.gammas = "all";
+        mesPar.gammas  = "all";
-        mesPar.mom    = "0. 0. 0. 0.";
+        mesPar.sink    = "sink";
        application.createModule<MContraction::Meson>("meson_pt_"
                                                      + flavour[i] + flavour[j],
                                                      mesPar);
-        mesPar.output = "mesons/Z2_" + flavour[i] + flavour[j];
+        mesPar.output  = "mesons/Z2_" + flavour[i] + flavour[j];
-        mesPar.q1     = "QZ2_" + flavour[i];
+        mesPar.q1      = "QZ2_" + flavour[i];
-        mesPar.q2     = "QZ2_" + flavour[j];
+        mesPar.q2      = "QZ2_" + flavour[j];
-        mesPar.gammas = "all";
+        mesPar.gammas  = "all";
-        mesPar.mom    = "0. 0. 0. 0.";
+        mesPar.sink    = "sink";
        application.createModule<MContraction::Meson>("meson_Z2_"
                                                      + flavour[i] + flavour[j],
                                                      mesPar);
@@ -0,0 +1,193 @@
 /*************************************************************************************
 Grid physics library, www.github.com/paboyle/Grid
 Source file: ./tests/Test_hmc_WilsonFermionGauge.cc
 Copyright (C) 2016
 Author: Guido Cossu <guido.cossu@ed.ac.uk>
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 See the full license in the file "LICENSE" in the top level distribution directory
 *************************************************************************************/
 /*  END LEGAL */
 #include <Grid/Grid.h>
 namespace Grid {
 class ScalarActionParameters : Serializable {
 public:
  GRID_SERIALIZABLE_CLASS_MEMBERS(ScalarActionParameters,
    double, mass_squared,
    double, lambda);
    template <class ReaderClass >
  ScalarActionParameters(Reader<ReaderClass>& Reader){
    read(Reader, "ScalarAction", *this);
  }
 };
 }
 using namespace Grid;
 using namespace Grid::QCD;
 template <class Impl>
 class MagMeas : public HmcObservable<typename Impl::Field> {
 public:
  typedef typename Impl::Field Field;
  typedef typename Impl::Simd::scalar_type Trace;
  void TrajectoryComplete(int traj,
                          Field &U,
                          GridSerialRNG &sRNG,
                          GridParallelRNG &pRNG) {
    int def_prec = std::cout.precision();
    std::cout << std::setprecision(std::numeric_limits<Real>::digits10 + 1);
    std::cout << GridLogMessage
              << "m= " << TensorRemove(trace(sum(U))) << std::endl;
    std::cout << GridLogMessage
              << "m^2= " << TensorRemove(trace(sum(U)*sum(U))) << std::endl;
    std::cout << GridLogMessage
    << "phi^2= " << TensorRemove(sum(trace(U*U))) << std::endl;
    std::cout.precision(def_prec);
  }
 private:
 };
 template <class Impl>
 class MagMod: public ObservableModule<MagMeas<Impl>, NoParameters>{
  typedef ObservableModule<MagMeas<Impl>, NoParameters> ObsBase;
  using ObsBase::ObsBase; // for constructors
  // acquire resource
  virtual void initialize(){
    this->ObservablePtr.reset(new MagMeas<Impl>());
  }
 public:
  MagMod(): ObsBase(NoParameters()){}
 };
 int main(int argc, char **argv) {
  typedef Grid::JSONReader       Serialiser;
  Grid_init(&argc, &argv);
  int threads = GridThread::GetThreads();
  // here make a routine to print all the relevant information on the run
  std::cout << GridLogMessage << "Grid is setup to use " << threads << " threads" << std::endl;
  // Typedefs to simplify notation
  constexpr int Ncolours    = 2;
  constexpr int Ndimensions = 3;
  typedef ScalarNxNAdjGenericHMCRunner<Ncolours> HMCWrapper;  // Uses the default minimum norm, real scalar fields
  typedef ScalarAdjActionR<Ncolours, Ndimensions> ScalarAction;
  //::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
  HMCWrapper TheHMC;
  TheHMC.ReadCommandLine(argc, argv);
  if (TheHMC.ParameterFile.empty()){
    std::cout << "Input file not specified."
              << "Use --ParameterFile option in the command line.\nAborting" 
              << std::endl;
    exit(1);
  }
  Serialiser Reader(TheHMC.ParameterFile);
  // Grid from the command line
  GridModule ScalarGrid;
  if (GridDefaultLatt().size() != Ndimensions){
    std::cout << "Incorrect dimension of the grid\n. Expected dim="<< Ndimensions << std::endl;
    exit(1);
  }
  if (GridDefaultMpi().size() != Ndimensions){
    std::cout << "Incorrect dimension of the mpi grid\n. Expected dim="<< Ndimensions << std::endl;
    exit(1);
  }
  ScalarGrid.set_full(new GridCartesian(GridDefaultLatt(),GridDefaultSimd(Ndimensions, vComplex::Nsimd()),GridDefaultMpi()));
  ScalarGrid.set_rb(new GridRedBlackCartesian(ScalarGrid.get_full()));
  TheHMC.Resources.AddGrid("scalar", ScalarGrid);
  std::cout << "Lattice size : " << GridDefaultLatt() << std::endl;
  // Checkpointer definition
  CheckpointerParameters CPparams(Reader);
  TheHMC.Resources.LoadBinaryCheckpointer(CPparams);
  RNGModuleParameters RNGpar(Reader);
  TheHMC.Resources.SetRNGSeeds(RNGpar);
  // Construct observables
  typedef MagMod<HMCWrapper::ImplPolicy> MagObs;
  TheHMC.Resources.AddObservable<MagObs>();
  /////////////////////////////////////////////////////////////
  // Collect actions, here use more encapsulation
  // Scalar action in adjoint representation
  ScalarActionParameters SPar(Reader);
  ScalarAction Saction(SPar.mass_squared, SPar.lambda);
  // Collect actions
  ActionLevel<ScalarAction::Field, ScalarNxNMatrixFields<Ncolours>> Level1(1);
  Level1.push_back(&Saction);
  TheHMC.TheAction.push_back(Level1);
  /////////////////////////////////////////////////////////////
  TheHMC.Parameters.initialize(Reader);
  TheHMC.Run();
  Grid_finalize();
 }  // main
 /* Examples for input files
 JSON
 {
    "Checkpointer": {
    "config_prefix": "ckpoint_scalar_lat",
    "rng_prefix": "ckpoint_scalar_rng",
    "saveInterval": 1,
    "format": "IEEE64BIG"
    },
    "RandomNumberGenerator": {
    "serial_seeds": "1 2 3 4 6",
    "parallel_seeds": "6 7 8 9 11"
    },
    "ScalarAction":{
      "mass_squared": 0.5,
      "lambda": 0.1
    },
    "HMC":{
    "StartTrajectory": 0,
    "Trajectories": 100,
    "MetropolisTest": true,
    "NoMetropolisUntil": 10,
    "StartingType": "HotStart",
    "MD":{
        "name": "MinimumNorm2",
 	      "MDsteps": 15,
 	      "trajL": 2.0
 	    }
    }
 }
 XML example not provided yet
 */
@@ -516,7 +516,7 @@ int main (int argc, char ** argv)
  LatticeColourMatrix U(UGrid);
  LatticeColourMatrix zz(UGrid);
-  NerscField header;
+  FieldMetaData header;
  std::string file("./ckpoint_lat.4000");
  NerscIO::readConfiguration(Umu,header,file);
--- a/Show More
+++ b/Show More