MultiRHS solver improvements with slice operations moved into lattice and sped up.

Block solver requires a lot of performance work.
MultiRHS working, starting to optimise. Block doesn't and I thought it already was; puzzled.
2026-07-30 15:33:29 +01:00 · 2017-04-18 10:51:55 +01:00 · 2017-04-17 10:50:19 +01:00 · 2017-04-16 23:40:00 +01:00 · 2017-04-15 12:27:28 +01:00 · 2017-04-15 10:57:21 +01:00
93 changed files with 3100 additions and 1276 deletions
@@ -7,7 +7,7 @@ cache:
 matrix:
  include:
    - os:        osx
-      osx_image: xcode7.2
+      osx_image: xcode8.3
      compiler: clang
    - compiler: gcc
      addons:
@@ -73,8 +73,6 @@ before_install:
    - if [[ "$TRAVIS_OS_NAME" == "linux" ]] && [[ "$CC" == "clang" ]]; then export LD_LIBRARY_PATH="${GRIDDIR}/clang/lib:${LD_LIBRARY_PATH}"; fi
    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew update; fi
    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install libmpc; fi
-    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install openmpi; fi
-    - if [[ "$TRAVIS_OS_NAME" == "osx" ]] && [[ "$CC" == "gcc" ]]; then brew install gcc5; fi
    
 install:
    - export CC=$CC$VERSION
@@ -92,15 +90,14 @@ script:
    - cd build
    - ../configure --enable-precision=single --enable-simd=SSE4 --enable-comms=none
    - make -j4 
-    - ./benchmarks/Benchmark_dwf --threads 1
+    - ./benchmarks/Benchmark_dwf --threads 1 --debug-signals
    - echo make clean
    - ../configure --enable-precision=double --enable-simd=SSE4 --enable-comms=none
    - make -j4
-    - ./benchmarks/Benchmark_dwf --threads 1
+    - ./benchmarks/Benchmark_dwf --threads 1 --debug-signals
    - echo make clean
-    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then export CXXFLAGS='-DMPI_UINT32_T=MPI_UNSIGNED -DMPI_UINT64_T=MPI_UNSIGNED_LONG'; fi
-    - ../configure --enable-precision=single --enable-simd=SSE4 --enable-comms=mpi-auto
-    - make -j4
+    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ../configure --enable-precision=single --enable-simd=SSE4 --enable-comms=mpi-auto CXXFLAGS='-DMPI_UINT32_T=MPI_UNSIGNED -DMPI_UINT64_T=MPI_UNSIGNED_LONG'; fi
+    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then make -j4; fi
    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then mpirun.openmpi -n 2 ./benchmarks/Benchmark_dwf --threads 1 --mpi 2.1.1.1; fi


@@ -1,6 +1,28 @@
 TODO:
 ---------------

+Peter's work list:
+
+-- Remove DenseVector, DenseMatrix; Use Eigen instead. <-- started
+-- Merge high precision reduction into develop         <-- done
+-- Precision conversion and sort out localConvert      <-- 
+-- Physical propagator interface
+
+-- multiRHS DWF; benchmark on Cori/BNL for comms elimination
+   -- slice* linalg routines for multiRHS, BlockCG        <-- started
+
+-- Profile CG, BlockCG, etc... Flop count/rate
+-- Binary I/O speed up & x-strips
+-- Half-precision comms                                <-- started
+-- GaugeFix into central location
+-- FFTfix in sensible place
+-- Multigrid Wilson and DWF, compare to other Multigrid implementations
+-- quaternions                 -- Might not need
+
+
+-- Conserved currents
+
+-----
 * Forces; the UdSdU  term in gauge force term is half of what I think it should
  be. This is a consequence of taking ONLY the first term in:

@@ -21,16 +43,8 @@ TODO:
  This means we must double the force in the Test_xxx_force routines, and is the origin of the factor of two.
  This 2x is applied by hand in the fermion routines and in the Test_rect_force routine.

-
-Policies:
-
-* Link smearing/boundary conds; Policy class based implementation ; framework more in place
-
 * Support different boundary conditions (finite temp, chem. potential ... )

-* Support different fermion representations? 
-  - contained entirely within the integrator presently
-
 - Sign of force term.

 - Reversibility test.
@@ -41,11 +55,6 @@ Policies:

 - Audit oIndex usage for cb behaviour

- Rectangle gauge actions.
-  Iwasaki,
-  Symanzik,
-  ... etc...
-
 - Prepare multigrid for HMC. - Alternate setup schemes.

 - Support for ILDG --- ugly, not done
@@ -55,9 +64,11 @@ Policies:
 - FFTnD ?

 - Gparity; hand opt use template specialisation elegance to enable the optimised paths ?
+
 - Gparity force term; Gparity (R)HMC.
- Random number state save restore
+
 - Mobius implementation clean up to rmove #if 0 stale code sequences
+
 - CG -- profile carefully, kernel fusion, whole CG performance measurements.

 ================================================================
@@ -90,6 +101,7 @@ Insert/Extract
 Not sure of status of this -- reverify. Things are working nicely now though.

 * Make the Tensor types and Complex etc... play more nicely.
+
  - TensorRemove is a hack, come up with a long term rationalised approach to Complex vs. Scalar<Scalar<Scalar<Complex > > >
    QDP forces use of "toDouble" to get back to non tensor scalar. This role is presently taken TensorRemove, but I
    want to introduce a syntax that does not require this.
@@ -112,6 +124,8 @@ Not sure of status of this -- reverify. Things are working nicely now though.
 RECENT
 ---------------

+  - Support different fermion representations? -- DONE
+  - contained entirely within the integrator presently
  - Clean up HMC                                                             -- DONE
  - LorentzScalar<GaugeField> gets Gauge link type (cleaner).                -- DONE
  - Simplified the integrators a bit.                                        -- DONE
@@ -123,6 +137,26 @@ RECENT
  - Parallel io improvements                                  -- DONE
  - Plaquette and link trace checks into nersc reader from the Grid_nersc_io.cc test. -- DONE

+
+DONE:
+- MultiArray -- MultiRHS done
+- ConjugateGradientMultiShift -- DONE
+- MCR                         -- DONE
+- Remez -- Mike or Boost?     -- DONE
+- Proto (ET)                  -- DONE
+- uBlas                       -- DONE ; Eigen
+- Potentially Useful Boost libraries -- DONE ; Eigen
+- Aligned allocator; memory pool -- DONE
+- Multiprecision              -- DONE
+- Serialization               -- DONE
+- Regex -- Not needed
+- Tokenize -- Why?
+
+- Random number state save restore -- DONE
+- Rectangle gauge actions. -- DONE
+  Iwasaki,
+  Symanzik,
+  ... etc...
 Done: Cayley, Partial , ContFrac force terms.

 DONE
@@ -207,6 +241,7 @@ Done
 FUNCTIONALITY: it pleases me to keep track of things I have done (keeps me arguably sane)
 ======================================================================================================

+* Link smearing/boundary conds; Policy class based implementation ; framework more in place -- DONE
 * Command line args for geometry, simd, etc. layout. Is it necessary to have -- DONE
  user pass these? Is this a QCD specific?

@@ -66,7 +66,8 @@ int main (int argc, char ** argv)

    Vec tsum; tsum = zero;

-    GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+    GridParallelRNG          pRNG(&Grid);      
+    pRNG.SeedFixedIntegers(std::vector<int>({56,17,89,101}));

    std::vector<double> stop(threads);
    Vector<Vec> sum(threads);
@@ -65,7 +65,7 @@ int main (int argc, char ** argv)

      uint64_t Nloop=NLOOP;

-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeVec z(&Grid); //random(pRNG,z);
      LatticeVec x(&Grid); //random(pRNG,x);
@@ -100,7 +100,7 @@ int main (int argc, char ** argv)
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);

-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeVec z(&Grid); //random(pRNG,z);
      LatticeVec x(&Grid); //random(pRNG,x);
@@ -138,7 +138,7 @@ int main (int argc, char ** argv)

      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);

-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeVec z(&Grid); //random(pRNG,z);
      LatticeVec x(&Grid); //random(pRNG,x);
@@ -173,7 +173,7 @@ int main (int argc, char ** argv)
      uint64_t Nloop=NLOOP;
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);

-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
      LatticeVec z(&Grid); //random(pRNG,z);
      LatticeVec x(&Grid); //random(pRNG,x);
      LatticeVec y(&Grid); //random(pRNG,y);
@@ -51,7 +51,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});
  GridParallelRNG          pRNG(&Grid);
  pRNG.SeedFixedIntegers(seeds);
-  //  pRNG.SeedRandomDevice();
+  //  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

  typedef typename ImprovedStaggeredFermionR::FermionField FermionField; 
  typename ImprovedStaggeredFermionR::ImplParams params; 
@@ -55,7 +55,7 @@ int main (int argc, char ** argv)
      std::vector<int> latt_size  ({lat*mpi_layout[0],lat*mpi_layout[1],lat*mpi_layout[2],lat*mpi_layout[3]});
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];
      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeColourMatrix z(&Grid);// random(pRNG,z);
      LatticeColourMatrix x(&Grid);// random(pRNG,x);
@@ -88,7 +88,7 @@ int main (int argc, char ** argv)
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];

      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeColourMatrix z(&Grid); //random(pRNG,z);
      LatticeColourMatrix x(&Grid); //random(pRNG,x);
@@ -119,7 +119,7 @@ int main (int argc, char ** argv)
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];

      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeColourMatrix z(&Grid); //random(pRNG,z);
      LatticeColourMatrix x(&Grid); //random(pRNG,x);
@@ -150,7 +150,7 @@ int main (int argc, char ** argv)
      int vol = latt_size[0]*latt_size[1]*latt_size[2]*latt_size[3];

      GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
-      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedRandomDevice();
+      //      GridParallelRNG          pRNG(&Grid);      pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

      LatticeColourMatrix z(&Grid); //random(pRNG,z);
      LatticeColourMatrix x(&Grid); //random(pRNG,x);
@@ -69,7 +69,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});
  GridParallelRNG          pRNG(&Grid);
  pRNG.SeedFixedIntegers(seeds);
-  //  pRNG.SeedRandomDevice();
+  //  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

  LatticeFermion src   (&Grid); random(pRNG,src);
  LatticeFermion result(&Grid); result=zero;
@@ -83,6 +83,19 @@ case ${ac_LAPACK} in
        AC_DEFINE([USE_LAPACK],[1],[use LAPACK]);;
 esac

+############### FP16 conversions
+AC_ARG_ENABLE([fp16],
+    [AC_HELP_STRING([--enable-fp16=yes|no], [enable fp16 comms])], 
+    [ac_FP16=${enable_fp16}], [ac_FP16=no])
+case ${ac_FP16} in
+    no)
+        ;;
+    yes)
+        AC_DEFINE([USE_FP16],[1],[conversion to fp16]);;
+    *)
+	;;
+esac
+
 ############### MKL
 AC_ARG_ENABLE([mkl],
    [AC_HELP_STRING([--enable-mkl=yes|no|prefix], [enable Intel MKL for LAPACK & FFTW])],
@@ -179,16 +192,16 @@ case ${ax_cv_cxx_compiler_vendor} in
        SIMD_FLAGS='-msse4.2';;
      AVX)
        AC_DEFINE([AVX1],[1],[AVX intrinsics])
-        SIMD_FLAGS='-mavx';;
+        SIMD_FLAGS='-mavx -mf16c';;
      AVXFMA4)
        AC_DEFINE([AVXFMA4],[1],[AVX intrinsics with FMA4])
-        SIMD_FLAGS='-mavx -mfma4';;
+        SIMD_FLAGS='-mavx -mfma4 -mf16c';;
      AVXFMA)
        AC_DEFINE([AVXFMA],[1],[AVX intrinsics with FMA3])
-        SIMD_FLAGS='-mavx -mfma';;
+        SIMD_FLAGS='-mavx -mfma -mf16c';;
      AVX2)
        AC_DEFINE([AVX2],[1],[AVX2 intrinsics])
-        SIMD_FLAGS='-mavx2 -mfma';;
+        SIMD_FLAGS='-mavx2 -mfma -mf16c';;
      AVX512)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
        SIMD_FLAGS='-mavx512f -mavx512pf -mavx512er -mavx512cd';;
@@ -321,7 +334,7 @@ AM_CONDITIONAL(BUILD_COMMS_NONE,  [ test "${comms_type}X" == "noneX" ])
 ############### RNG selection
 AC_ARG_ENABLE([rng],[AC_HELP_STRING([--enable-rng=ranlux48|mt19937|sitmo],\
 	            [Select Random Number Generator to be used])],\
-	            [ac_RNG=${enable_rng}],[ac_RNG=ranlux48])
+	            [ac_RNG=${enable_rng}],[ac_RNG=sitmo])

 case ${ac_RNG} in
     ranlux48)
@@ -401,6 +414,7 @@ AC_CONFIG_FILES(tests/hadrons/Makefile)
 AC_CONFIG_FILES(tests/hmc/Makefile)
 AC_CONFIG_FILES(tests/solver/Makefile)
 AC_CONFIG_FILES(tests/qdpxx/Makefile)
+AC_CONFIG_FILES(tests/testu01/Makefile)
 AC_CONFIG_FILES(benchmarks/Makefile)
 AC_CONFIG_FILES(extras/Makefile)
 AC_CONFIG_FILES(extras/Hadrons/Makefile)
@@ -46,7 +46,7 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #include <Grid/algorithms/iterative/ConjugateGradientMixedPrec.h>

 // Lanczos support
-#include <Grid/algorithms/iterative/MatrixUtils.h>
+//#include <Grid/algorithms/iterative/MatrixUtils.h>
 #include <Grid/algorithms/iterative/ImplicitlyRestartedLanczos.h>
 #include <Grid/algorithms/CoarsenedMatrix.h>
 #include <Grid/algorithms/FFT.h>
@@ -425,7 +425,7 @@ namespace Grid {
 	A[p]=zero;
      }

-      GridParallelRNG  RNG(Grid()); RNG.SeedRandomDevice();
+      GridParallelRNG  RNG(Grid()); RNG.SeedFixedIntegers(std::vector<int>({55,72,19,17,34}));
      Lattice<iScalar<CComplex> > val(Grid()); random(RNG,val);

      Complex one(1.0);
@@ -0,0 +1,366 @@
+/*************************************************************************************
+
+Grid physics library, www.github.com/paboyle/Grid
+
+Source file: ./lib/algorithms/iterative/BlockConjugateGradient.h
+
+Copyright (C) 2017
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License along
+with this program; if not, write to the Free Software Foundation, Inc.,
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+See the full license in the file "LICENSE" in the top level distribution
+directory
+*************************************************************************************/
+/*  END LEGAL */
+#ifndef GRID_BLOCK_CONJUGATE_GRADIENT_H
+#define GRID_BLOCK_CONJUGATE_GRADIENT_H
+
+
+namespace Grid {
+
+//////////////////////////////////////////////////////////////////////////
+// Block conjugate gradient. Dimension zero should be the block direction
+//////////////////////////////////////////////////////////////////////////
+template <class Field>
+class BlockConjugateGradient : public OperatorFunction<Field> {
+ public:
+
+  typedef typename Field::scalar_type scomplex;
+
+  const int blockDim = 0;
+
+  int Nblock;
+  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
+                           // Defaults true.
+  RealD Tolerance;
+  Integer MaxIterations;
+  Integer IterationsToComplete; //Number of iterations the CG took to finish. Filled in upon completion
+  
+  BlockConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
+    : Tolerance(tol),
+    MaxIterations(maxit),
+    ErrorOnNoConverge(err_on_no_conv){};
+
+void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
+{
+  int Orthog = 0; // First dimension is block dim
+  Nblock = Src._grid->_fdimensions[Orthog];
+
+  std::cout<<GridLogMessage<<" Block Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
+
+  Psi.checkerboard = Src.checkerboard;
+  conformable(Psi, Src);
+
+  Field P(Src);
+  Field AP(Src);
+  Field R(Src);
+  
+  Eigen::MatrixXcd m_pAp    = Eigen::MatrixXcd::Identity(Nblock,Nblock);
+  Eigen::MatrixXcd m_pAp_inv= Eigen::MatrixXcd::Identity(Nblock,Nblock);
+  Eigen::MatrixXcd m_rr     = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+  Eigen::MatrixXcd m_rr_inv = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+
+  Eigen::MatrixXcd m_alpha      = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+  Eigen::MatrixXcd m_beta   = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+
+  // Initial residual computation & set up
+  std::vector<RealD> residuals(Nblock);
+  std::vector<RealD> ssq(Nblock);
+
+  sliceNorm(ssq,Src,Orthog);
+  RealD sssum=0;
+  for(int b=0;b<Nblock;b++) sssum+=ssq[b];
+
+  sliceNorm(residuals,Src,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  sliceNorm(residuals,Psi,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  // Initial search dir is guess
+  Linop.HermOp(Psi, AP);
+  
+
+  /************************************************************************
+   * Block conjugate gradient (Stephen Pickles, thesis 1995, pp 71, O Leary 1980)
+   ************************************************************************
+   * O'Leary : R = B - A X
+   * O'Leary : P = M R ; preconditioner M = 1
+   * O'Leary : alpha = PAP^{-1} RMR
+   * O'Leary : beta  = RMR^{-1}_old RMR_new
+   * O'Leary : X=X+Palpha
+   * O'Leary : R_new=R_old-AP alpha
+   * O'Leary : P=MR_new+P beta
+   */
+
+  R = Src - AP;  
+  P = R;
+  sliceInnerProductMatrix(m_rr,R,R,Orthog);
+
+  GridStopWatch sliceInnerTimer;
+  GridStopWatch sliceMaddTimer;
+  GridStopWatch MatrixTimer;
+  GridStopWatch SolverTimer;
+  SolverTimer.Start();
+
+  int k;
+  for (k = 1; k <= MaxIterations; k++) {
+
+    RealD rrsum=0;
+    for(int b=0;b<Nblock;b++) rrsum+=real(m_rr(b,b));
+
+    std::cout << GridLogIterative << "\titeration "<<k<<" rr_sum "<<rrsum<<" ssq_sum "<< sssum
+	      <<" / "<<std::sqrt(rrsum/sssum) <<std::endl;
+
+    MatrixTimer.Start();
+    Linop.HermOp(P, AP);
+    MatrixTimer.Stop();
+
+    // Alpha
+    sliceInnerTimer.Start();
+    sliceInnerProductMatrix(m_pAp,P,AP,Orthog);
+    sliceInnerTimer.Stop();
+    m_pAp_inv = m_pAp.inverse();
+    m_alpha   = m_pAp_inv * m_rr ;
+
+    // Psi, R update
+    sliceMaddTimer.Start();
+    sliceMaddMatrix(Psi,m_alpha, P,Psi,Orthog);     // add alpha *  P to psi
+    sliceMaddMatrix(R  ,m_alpha,AP,  R,Orthog,-1.0);// sub alpha * AP to resid
+    sliceMaddTimer.Stop();
+
+    // Beta
+    m_rr_inv = m_rr.inverse();
+    sliceInnerTimer.Start();
+    sliceInnerProductMatrix(m_rr,R,R,Orthog);
+    sliceInnerTimer.Stop();
+    m_beta = m_rr_inv *m_rr;
+
+    // Search update
+    sliceMaddTimer.Start();
+    sliceMaddMatrix(AP,m_beta,P,R,Orthog);
+    sliceMaddTimer.Stop();
+    P= AP;
+
+    /*********************
+     * convergence monitor
+     *********************
+     */
+    RealD max_resid=0;
+    for(int b=0;b<Nblock;b++){
+      RealD rr = real(m_rr(b,b))/ssq[b];
+      if ( rr > max_resid ) max_resid = rr;
+    }
+    
+    if ( max_resid < Tolerance*Tolerance ) { 
+
+      SolverTimer.Stop();
+
+      std::cout << GridLogMessage<<"BlockCG converged in "<<k<<" iterations"<<std::endl;
+      for(int b=0;b<Nblock;b++){
+	std::cout << GridLogMessage<< "\t\tblock "<<b<<" resid "<< std::sqrt(real(m_rr(b,b))/ssq[b])<<std::endl;
+      }
+      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
+
+      Linop.HermOp(Psi, AP);
+      AP = AP-Src;
+      std::cout << GridLogMessage <<"\tTrue residual is " << std::sqrt(norm2(AP)/norm2(Src)) <<std::endl;
+
+      std::cout << GridLogMessage << "Time Breakdown "<<std::endl;
+      std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tInnerProd  " << sliceInnerTimer.Elapsed() <<std::endl;
+      std::cout << GridLogMessage << "\tMaddMatrix " << sliceMaddTimer.Elapsed()  <<std::endl;
+	    
+      IterationsToComplete = k;
+      return;
+    }
+
+  }
+  std::cout << GridLogMessage << "BlockConjugateGradient did NOT converge" << std::endl;
+
+  if (ErrorOnNoConverge) assert(0);
+  IterationsToComplete = k;
+}
+};
+
+
+//////////////////////////////////////////////////////////////////////////
+// multiRHS conjugate gradient. Dimension zero should be the block direction
+//////////////////////////////////////////////////////////////////////////
+template <class Field>
+class MultiRHSConjugateGradient : public OperatorFunction<Field> {
+ public:
+
+  typedef typename Field::scalar_type scomplex;
+
+  const int blockDim = 0;
+
+  int Nblock;
+  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
+                           // Defaults true.
+  RealD Tolerance;
+  Integer MaxIterations;
+  Integer IterationsToComplete; //Number of iterations the CG took to finish. Filled in upon completion
+  
+   MultiRHSConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
+    : Tolerance(tol),
+    MaxIterations(maxit),
+    ErrorOnNoConverge(err_on_no_conv){};
+
+void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
+{
+  int Orthog = 0; // First dimension is block dim
+  Nblock = Src._grid->_fdimensions[Orthog];
+
+  std::cout<<GridLogMessage<<"MultiRHS Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
+
+  Psi.checkerboard = Src.checkerboard;
+  conformable(Psi, Src);
+
+  Field P(Src);
+  Field AP(Src);
+  Field R(Src);
+  
+  std::vector<ComplexD> v_pAp(Nblock);
+  std::vector<RealD> v_rr (Nblock);
+  std::vector<RealD> v_rr_inv(Nblock);
+  std::vector<RealD> v_alpha(Nblock);
+  std::vector<RealD> v_beta(Nblock);
+
+  // Initial residual computation & set up
+  std::vector<RealD> residuals(Nblock);
+  std::vector<RealD> ssq(Nblock);
+
+  sliceNorm(ssq,Src,Orthog);
+  RealD sssum=0;
+  for(int b=0;b<Nblock;b++) sssum+=ssq[b];
+
+  sliceNorm(residuals,Src,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  sliceNorm(residuals,Psi,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  // Initial search dir is guess
+  Linop.HermOp(Psi, AP);
+
+  R = Src - AP;  
+  P = R;
+  sliceNorm(v_rr,R,Orthog);
+
+  GridStopWatch sliceInnerTimer;
+  GridStopWatch sliceMaddTimer;
+  GridStopWatch sliceNormTimer;
+  GridStopWatch MatrixTimer;
+  GridStopWatch SolverTimer;
+
+  SolverTimer.Start();
+  int k;
+  for (k = 1; k <= MaxIterations; k++) {
+
+    RealD rrsum=0;
+    for(int b=0;b<Nblock;b++) rrsum+=real(v_rr[b]);
+
+    std::cout << GridLogIterative << "\titeration "<<k<<" rr_sum "<<rrsum<<" ssq_sum "<< sssum
+	      <<" / "<<std::sqrt(rrsum/sssum) <<std::endl;
+
+    MatrixTimer.Start();
+    Linop.HermOp(P, AP);
+    MatrixTimer.Stop();
+
+    // Alpha
+    //    sliceInnerProductVectorTest(v_pAp_test,P,AP,Orthog);
+    sliceInnerTimer.Start();
+    sliceInnerProductVector(v_pAp,P,AP,Orthog);
+    sliceInnerTimer.Stop();
+    for(int b=0;b<Nblock;b++){
+      //      std::cout << " "<< v_pAp[b]<<" "<< v_pAp_test[b]<<std::endl;
+      v_alpha[b] = v_rr[b]/real(v_pAp[b]);
+    }
+
+    // Psi, R update
+    sliceMaddTimer.Start();
+    sliceMaddVector(Psi,v_alpha, P,Psi,Orthog);     // add alpha *  P to psi
+    sliceMaddVector(R  ,v_alpha,AP,  R,Orthog,-1.0);// sub alpha * AP to resid
+    sliceMaddTimer.Stop();
+
+    // Beta
+    for(int b=0;b<Nblock;b++){
+      v_rr_inv[b] = 1.0/v_rr[b];
+    }
+    sliceNormTimer.Start();
+    sliceNorm(v_rr,R,Orthog);
+    sliceNormTimer.Stop();
+    for(int b=0;b<Nblock;b++){
+      v_beta[b] = v_rr_inv[b] *v_rr[b];
+    }
+
+    // Search update
+    sliceMaddTimer.Start();
+    sliceMaddVector(P,v_beta,P,R,Orthog);
+    sliceMaddTimer.Stop();
+
+    /*********************
+     * convergence monitor
+     *********************
+     */
+    RealD max_resid=0;
+    for(int b=0;b<Nblock;b++){
+      RealD rr = v_rr[b]/ssq[b];
+      if ( rr > max_resid ) max_resid = rr;
+    }
+    
+    if ( max_resid < Tolerance*Tolerance ) { 
+
+      SolverTimer.Stop();
+
+      std::cout << GridLogMessage<<"MultiRHS solver converged in " <<k<<" iterations"<<std::endl;
+      for(int b=0;b<Nblock;b++){
+	std::cout << GridLogMessage<< "\t\tBlock "<<b<<" resid "<< std::sqrt(v_rr[b]/ssq[b])<<std::endl;
+      }
+      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
+
+      Linop.HermOp(Psi, AP);
+      AP = AP-Src;
+      std::cout <<GridLogMessage << "\tTrue residual is " << std::sqrt(norm2(AP)/norm2(Src)) <<std::endl;
+
+      std::cout << GridLogMessage << "Time Breakdown "<<std::endl;
+      std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tInnerProd  " << sliceInnerTimer.Elapsed() <<std::endl;
+      std::cout << GridLogMessage << "\tNorm       " << sliceNormTimer.Elapsed() <<std::endl;
+      std::cout << GridLogMessage << "\tMaddMatrix " << sliceMaddTimer.Elapsed()  <<std::endl;
+
+
+      IterationsToComplete = k;
+      return;
+    }
+
+  }
+  std::cout << GridLogMessage << "MultiRHSConjugateGradient did NOT converge" << std::endl;
+
+  if (ErrorOnNoConverge) assert(0);
+  IterationsToComplete = k;
+}
+};
+
+
+
+}
+#endif
@@ -78,18 +78,12 @@ class ConjugateGradient : public OperatorFunction<Field> {
    cp = a;
    ssq = norm2(src);

-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient: guess " << guess << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:   src " << ssq << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:    mp " << d << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:   mmp " << b << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:  cp,r " << cp << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:     p " << a << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient: guess " << guess << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:   src " << ssq << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:    mp " << d << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:   mmp " << b << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:  cp,r " << cp << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:     p " << a << std::endl;

    RealD rsq = Tolerance * Tolerance * ssq;

@@ -99,8 +93,7 @@ class ConjugateGradient : public OperatorFunction<Field> {
    }

    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient: k=0 residual " << cp << " target " << rsq
-              << std::endl;
+              << "ConjugateGradient: k=0 residual " << cp << " target " << rsq << std::endl;

    GridStopWatch LinalgTimer;
    GridStopWatch MatrixTimer;
@@ -145,19 +138,20 @@ class ConjugateGradient : public OperatorFunction<Field> {
        RealD resnorm = sqrt(norm2(p));
        RealD true_residual = resnorm / srcnorm;

-        std::cout << GridLogMessage
-                  << "ConjugateGradient: Converged on iteration " << k << std::endl;
-        std::cout << GridLogMessage << "Computed residual " << sqrt(cp / ssq)
-                  << " true residual " << true_residual << " target "
-                  << Tolerance << std::endl;
-        std::cout << GridLogMessage << "Time elapsed: Iterations "
-                  << SolverTimer.Elapsed() << " Matrix  "
-                  << MatrixTimer.Elapsed() << " Linalg "
-                  << LinalgTimer.Elapsed();
-        std::cout << std::endl;
+        std::cout << GridLogMessage << "ConjugateGradient Converged on iteration " << k << std::endl;
+        std::cout << GridLogMessage << "\tComputed residual " << sqrt(cp / ssq)<<std::endl;
+	std::cout << GridLogMessage << "\tTrue residual " << true_residual<<std::endl;
+	std::cout << GridLogMessage << "\tTarget " << Tolerance << std::endl;
+
+        std::cout << GridLogMessage << "Time breakdown "<<std::endl;
+	std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed() <<std::endl;
+	std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed() <<std::endl;
+	std::cout << GridLogMessage << "\tLinalg     " << LinalgTimer.Elapsed() <<std::endl;

        if (ErrorOnNoConverge) assert(true_residual / Tolerance < 10000.0);
+
 	IterationsToComplete = k;	
+
        return;
      }
    }
@@ -30,6 +30,7 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #define GRID_IRL_H

 #include <string.h> //memset
+
 #ifdef USE_LAPACK
 void LAPACK_dstegr(char *jobz, char *range, int *n, double *d, double *e,
                   double *vl, double *vu, int *il, int *iu, double *abstol,
@@ -37,8 +38,9 @@ void LAPACK_dstegr(char *jobz, char *range, int *n, double *d, double *e,
                   double *work, int *lwork, int *iwork, int *liwork,
                   int *info);
 #endif
-#include "DenseMatrix.h"
-#include "EigenSort.h"
+
+#include <Grid/algorithms/densematrix/DenseMatrix.h>
+#include <Grid/algorithms/iterative/EigenSort.h>

 namespace Grid {

@@ -1088,8 +1090,6 @@ static void Lock(DenseMatrix<T> &H, 	// Hess mtx
 		 int dfg,
 		 bool herm)
 {	
-
-
  //ForceTridiagonal(H);

  int M = H.dim;
@@ -1121,7 +1121,6 @@ static void Lock(DenseMatrix<T> &H, 	// Hess mtx

  AH = Hermitian(QQ)*AH;
  AH = AH*QQ;
-	

  for(int i=con;i<M;i++){
    for(int j=con;j<M;j++){
@@ -1,453 +0,0 @@
-    /*************************************************************************************
-
-    Grid physics library, www.github.com/paboyle/Grid 
-
-    Source file: ./lib/algorithms/iterative/Matrix.h
-
-    Copyright (C) 2015
-
-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-    See the full license in the file "LICENSE" in the top level distribution directory
-    *************************************************************************************/
-    /*  END LEGAL */
-#ifndef MATRIX_H
-#define MATRIX_H
-
-#include <cstdlib>
-#include <string>
-#include <cmath>
-#include <vector>
-#include <iostream>
-#include <iomanip>
-#include <complex>
-#include <typeinfo>
-#include <Grid/Grid.h>
-
-
-/** Sign function **/
-template <class T> T sign(T p){return ( p/abs(p) );}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////////////
-///////////////////// Hijack STL containers for our wicked means /////////////////////////////////////////
-/////////////////////////////////////////////////////////////////////////////////////////////////////////
-template<class T> using Vector = Vector<T>;
-template<class T> using Matrix = Vector<Vector<T> >;
-
-template<class T> void Resize(Vector<T > & vec, int N) { vec.resize(N); }
-
-template<class T> void Resize(Matrix<T > & mat, int N, int M) { 
-  mat.resize(N);
-  for(int i=0;i<N;i++){
-    mat[i].resize(M);
-  }
-}
-template<class T> void Size(Vector<T> & vec, int &N) 
-{ 
-  N= vec.size();
-}
-template<class T> void Size(Matrix<T> & mat, int &N,int &M) 
-{ 
-  N= mat.size();
-  M= mat[0].size();
-}
-template<class T> void SizeSquare(Matrix<T> & mat, int &N) 
-{ 
-  int M; Size(mat,N,M);
-  assert(N==M);
-}
-template<class T> void SizeSame(Matrix<T> & mat1,Matrix<T> &mat2, int &N1,int &M1) 
-{ 
-  int N2,M2;
-  Size(mat1,N1,M1);
-  Size(mat2,N2,M2);
-  assert(N1==N2);
-  assert(M1==M2);
-}
-
-//*****************************************
-//*	(Complex) Vector operations	*
-//*****************************************
-
-/**Conj of a Vector **/
-template <class T> Vector<T> conj(Vector<T> p){
-	Vector<T> q(p.size());
-	for(int i=0;i<p.size();i++){q[i] = conj(p[i]);}
-	return q;
-}
-
-/** Norm of a Vector**/
-template <class T> T norm(Vector<T> p){
-	T sum = 0;
-	for(int i=0;i<p.size();i++){sum = sum + p[i]*conj(p[i]);}
-	return abs(sqrt(sum));
-}
-
-/** Norm squared of a Vector **/
-template <class T> T norm2(Vector<T> p){
-	T sum = 0;
-	for(int i=0;i<p.size();i++){sum = sum + p[i]*conj(p[i]);}
-	return abs((sum));
-}
-
-/** Sum elements of a Vector **/
-template <class T> T trace(Vector<T> p){
-	T sum = 0;
-	for(int i=0;i<p.size();i++){sum = sum + p[i];}
-	return sum;
-}
-
-/** Fill a Vector with constant c **/
-template <class T> void Fill(Vector<T> &p, T c){
-	for(int i=0;i<p.size();i++){p[i] = c;}
-}
-/** Normalize a Vector **/
-template <class T> void normalize(Vector<T> &p){
-	T m = norm(p);
-	if( abs(m) > 0.0) for(int i=0;i<p.size();i++){p[i] /= m;}
-}
-/** Vector by scalar **/
-template <class T, class U> Vector<T> times(Vector<T> p, U s){
-	for(int i=0;i<p.size();i++){p[i] *= s;}
-	return p;
-}
-template <class T, class U> Vector<T> times(U s, Vector<T> p){
-	for(int i=0;i<p.size();i++){p[i] *= s;}
-	return p;
-}
-/** inner product of a and b = conj(a) . b **/
-template <class T> T inner(Vector<T> a, Vector<T> b){
-	T m = 0.;
-	for(int i=0;i<a.size();i++){m = m + conj(a[i])*b[i];}
-	return m;
-}
-/** sum of a and b = a + b **/
-template <class T> Vector<T> add(Vector<T> a, Vector<T> b){
-	Vector<T> m(a.size());
-	for(int i=0;i<a.size();i++){m[i] = a[i] + b[i];}
-	return m;
-}
-/** sum of a and b = a - b **/
-template <class T> Vector<T> sub(Vector<T> a, Vector<T> b){
-	Vector<T> m(a.size());
-	for(int i=0;i<a.size();i++){m[i] = a[i] - b[i];}
-	return m;
-}
-
-/** 
- *********************************
- *	Matrices	         *
- *********************************
- **/
-
-template<class T> void Fill(Matrix<T> & mat, T&val) { 
-  int N,M;
-  Size(mat,N,M);
-  for(int i=0;i<N;i++){
-  for(int j=0;j<M;j++){
-    mat[i][j] = val;
-  }}
-}
-
-/** Transpose of a matrix **/
-Matrix<T> Transpose(Matrix<T> & mat){
-  int N,M;
-  Size(mat,N,M);
-  Matrix C; Resize(C,M,N);
-  for(int i=0;i<M;i++){
-  for(int j=0;j<N;j++){
-    C[i][j] = mat[j][i];
-  }} 
-  return C;
-}
-/** Set Matrix to unit matrix **/
-template<class T> void Unity(Matrix<T> &mat){
-  int N;  SizeSquare(mat,N);
-  for(int i=0;i<N;i++){
-    for(int j=0;j<N;j++){
-      if ( i==j ) A[i][j] = 1;
-      else        A[i][j] = 0;
-    } 
-  } 
-}
-/** Add C * I to matrix **/
-template<class T>
-void PlusUnit(Matrix<T> & A,T c){
-  int dim;  SizeSquare(A,dim);
-  for(int i=0;i<dim;i++){A[i][i] = A[i][i] + c;} 
-}
-
-/** return the Hermitian conjugate of matrix **/
-Matrix<T> HermitianConj(Matrix<T> &mat){
-
-  int dim; SizeSquare(mat,dim);
-
-  Matrix<T> C; Resize(C,dim,dim);
-
-  for(int i=0;i<dim;i++){
-    for(int j=0;j<dim;j++){
-      C[i][j] = conj(mat[j][i]);
-    } 
-  } 
-  return C;
-}
-
-/** return diagonal entries as a Vector **/
-Vector<T> diag(Matrix<T> &A)
-{
-  int dim; SizeSquare(A,dim);
-  Vector<T> d; Resize(d,dim);
-
-  for(int i=0;i<dim;i++){
-    d[i] = A[i][i];
-  }
-  return d;
-}
-
-/** Left multiply by a Vector **/
-Vector<T> operator *(Vector<T> &B,Matrix<T> &A)
-{
-  int K,M,N; 
-  Size(B,K);
-  Size(A,M,N);
-  assert(K==M);
-  
-  Vector<T> C; Resize(C,N);
-
-  for(int j=0;j<N;j++){
-    T sum = 0.0;
-    for(int i=0;i<M;i++){
-      sum += B[i] * A[i][j];
-    }
-    C[j] =  sum;
-  }
-  return C; 
-}
-
-/** return 1/diagonal entries as a Vector **/
-Vector<T> inv_diag(Matrix<T> & A){
-  int dim; SizeSquare(A,dim);
-  Vector<T> d; Resize(d,dim);
-  for(int i=0;i<dim;i++){
-    d[i] = 1.0/A[i][i];
-  }
-  return d;
-}
-/** Matrix Addition **/
-inline Matrix<T> operator + (Matrix<T> &A,Matrix<T> &B)
-{
-  int N,M  ; SizeSame(A,B,N,M);
-  Matrix C; Resize(C,N,M);
-  for(int i=0;i<N;i++){
-    for(int j=0;j<M;j++){
-      C[i][j] = A[i][j] +  B[i][j];
-    } 
-  } 
-  return C;
-} 
-/** Matrix Subtraction **/
-inline Matrix<T> operator- (Matrix<T> & A,Matrix<T> &B){
-  int N,M  ; SizeSame(A,B,N,M);
-  Matrix C; Resize(C,N,M);
-  for(int i=0;i<N;i++){
-  for(int j=0;j<M;j++){
-    C[i][j] = A[i][j] -  B[i][j];
-  }}
-  return C;
-} 
-
-/** Matrix scalar multiplication **/
-inline Matrix<T> operator* (Matrix<T> & A,T c){
-  int N,M; Size(A,N,M);
-  Matrix C; Resize(C,N,M);
-  for(int i=0;i<N;i++){
-  for(int j=0;j<M;j++){
-    C[i][j] = A[i][j]*c;
-  }} 
-  return C;
-} 
-/** Matrix Matrix multiplication **/
-inline Matrix<T> operator* (Matrix<T> &A,Matrix<T> &B){
-  int K,L,N,M;
-  Size(A,K,L);
-  Size(B,N,M); assert(L==N);
-  Matrix C; Resize(C,K,M);
-
-  for(int i=0;i<K;i++){
-    for(int j=0;j<M;j++){
-      T sum = 0.0;
-      for(int k=0;k<N;k++) sum += A[i][k]*B[k][j];
-      C[i][j] =sum;
-    }
-  }
-  return C; 
-} 
-/** Matrix Vector multiplication **/
-inline Vector<T> operator* (Matrix<T> &A,Vector<T> &B){
-  int M,N,K;
-  Size(A,N,M);
-  Size(B,K); assert(K==M);
-  Vector<T> C; Resize(C,N);
-  for(int i=0;i<N;i++){
-    T sum = 0.0;
-    for(int j=0;j<M;j++) sum += A[i][j]*B[j];
-    C[i] =  sum;
-  }
-  return C; 
-} 
-
-/** Some version of Matrix norm **/
-/*
-inline T Norm(){ // this is not a usual L2 norm
-    T norm = 0;
-    for(int i=0;i<dim;i++){
-      for(int j=0;j<dim;j++){
-	norm += abs(A[i][j]);
-    }}
-    return norm;
-  }
-*/
-
-/** Some version of Matrix norm **/
-template<class T> T LargestDiag(Matrix<T> &A)
-{
-  int dim ; SizeSquare(A,dim); 
-
-  T ld = abs(A[0][0]);
-  for(int i=1;i<dim;i++){
-    T cf = abs(A[i][i]);
-    if(abs(cf) > abs(ld) ){ld = cf;}
-  }
-  return ld;
-}
-
-/** Look for entries on the leading subdiagonal that are smaller than 'small' **/
-template <class T,class U> int Chop_subdiag(Matrix<T> &A,T norm, int offset, U small)
-{
-  int dim; SizeSquare(A,dim);
-  for(int l = dim - 1 - offset; l >= 1; l--) {             		
-    if((U)abs(A[l][l - 1]) < (U)small) {
-      A[l][l-1]=(U)0.0;
-      return l;
-    }
-  }
-  return 0;
-}
-
-/** Look for entries on the leading subdiagonal that are smaller than 'small' **/
-template <class T,class U> int Chop_symm_subdiag(Matrix<T> & A,T norm, int offset, U small) 
-{
-  int dim; SizeSquare(A,dim);
-  for(int l = dim - 1 - offset; l >= 1; l--) {
-    if((U)abs(A[l][l - 1]) < (U)small) {
-      A[l][l - 1] = (U)0.0;
-      A[l - 1][l] = (U)0.0;
-      return l;
-    }
-  }
-  return 0;
-}
-/**Assign a submatrix to a larger one**/
-template<class T>
-void AssignSubMtx(Matrix<T> & A,int row_st, int row_end, int col_st, int col_end, Matrix<T> &S)
-{
-  for(int i = row_st; i<row_end; i++){
-    for(int j = col_st; j<col_end; j++){
-      A[i][j] = S[i - row_st][j - col_st];
-    }
-  }
-}
-
-/**Get a square submatrix**/
-template <class T>
-Matrix<T> GetSubMtx(Matrix<T> &A,int row_st, int row_end, int col_st, int col_end)
-{
-  Matrix<T> H; Resize(row_end - row_st,col_end-col_st);
-
-  for(int i = row_st; i<row_end; i++){
-  for(int j = col_st; j<col_end; j++){
-    H[i-row_st][j-col_st]=A[i][j];
-  }}
-  return H;
-}
-  
- /**Assign a submatrix to a larger one NB remember Vector Vectors are transposes of the matricies they represent**/
-template<class T>
-void AssignSubMtx(Matrix<T> & A,int row_st, int row_end, int col_st, int col_end, Matrix<T> &S)
-{
-  for(int i = row_st; i<row_end; i++){
-  for(int j = col_st; j<col_end; j++){
-    A[i][j] = S[i - row_st][j - col_st];
-  }}
-}
-  
-/** compute b_i A_ij b_j **/ // surprised no Conj
-template<class T> T proj(Matrix<T> A, Vector<T> B){
-  int dim; SizeSquare(A,dim);
-  int dimB; Size(B,dimB);
-  assert(dimB==dim);
-  T C = 0;
-  for(int i=0;i<dim;i++){
-    T sum = 0.0;
-    for(int j=0;j<dim;j++){
-      sum += A[i][j]*B[j];
-    }
-    C +=  B[i]*sum; // No conj?
-  }
-  return C; 
-}
-
-
-/*
- *************************************************************
- *
- * Matrix Vector products
- *
- *************************************************************
- */
-// Instead make a linop and call my CG;
-
-/// q -> q Q
-template <class T,class Fermion> void times(Vector<Fermion> &q, Matrix<T> &Q)
-{
-  int M; SizeSquare(Q,M);
-  int N; Size(q,N); 
-  assert(M==N);
-
-  times(q,Q,N);
-}
-
-/// q -> q Q
-template <class T> void times(multi1d<LatticeFermion> &q, Matrix<T> &Q, int N)
-{
-  GridBase *grid = q[0]._grid;
-  int M; SizeSquare(Q,M);
-  int K; Size(q,K); 
-  assert(N<M);
-  assert(N<K);
-  Vector<Fermion> S(N,grid );
-  for(int j=0;j<N;j++){
-    S[j] = zero;
-    for(int k=0;k<N;k++){
-      S[j] = S[j] +  q[k]* Q[k][j]; 
-    }
-  }
-  for(int j=0;j<q.size();j++){
-    q[j] = S[j];
-  }
-}
-#endif
@@ -1,75 +0,0 @@
-    /*************************************************************************************
-
-    Grid physics library, www.github.com/paboyle/Grid 
-
-    Source file: ./lib/algorithms/iterative/MatrixUtils.h
-
-    Copyright (C) 2015
-
-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-    See the full license in the file "LICENSE" in the top level distribution directory
-    *************************************************************************************/
-    /*  END LEGAL */
-#ifndef GRID_MATRIX_UTILS_H
-#define GRID_MATRIX_UTILS_H
-
-namespace Grid {
-
-  namespace MatrixUtils { 
-
-    template<class T> inline void Size(Matrix<T>& A,int &N,int &M){
-      N=A.size(); assert(N>0);
-      M=A[0].size();
-      for(int i=0;i<N;i++){
-	assert(A[i].size()==M);
-      }
-    }
-
-    template<class T> inline void SizeSquare(Matrix<T>& A,int &N)
-    {
-      int M;
-      Size(A,N,M);
-      assert(N==M);
-    }
-
-    template<class T> inline void Fill(Matrix<T>& A,T & val)
-    { 
-      int N,M;
-      Size(A,N,M);
-      for(int i=0;i<N;i++){
-      for(int j=0;j<M;j++){
-	A[i][j]=val;
-      }}
-    }
-    template<class T> inline void Diagonal(Matrix<T>& A,T & val)
-    { 
-      int N;
-      SizeSquare(A,N);
-      for(int i=0;i<N;i++){
-	A[i][i]=val;
-      }
-    }
-    template<class T> inline void Identity(Matrix<T>& A)
-    {
-      Fill(A,0.0);
-      Diagonal(A,1.0);
-    }
-
-  };
-}
-#endif
@@ -1,15 +0,0 @@
- ConjugateGradientMultiShift
- MCR
-
- Potentially Useful Boost libraries
-
- MultiArray
- Aligned allocator; memory pool
- Remez -- Mike or Boost?
- Multiprecision
- quaternians
- Tokenize
- Serialization
- Regex
- Proto (ET)
- uBlas
@@ -1,122 +0,0 @@
-#include <math.h>
-#include <stdlib.h>
-#include <vector>
-
-struct Bisection {
-
-static void get_eig2(int row_num,std::vector<RealD> &ALPHA,std::vector<RealD> &BETA, std::vector<RealD> & eig)
-{
-  int i,j;
-  std::vector<RealD> evec1(row_num+3);
-  std::vector<RealD> evec2(row_num+3);
-  RealD eps2;
-  ALPHA[1]=0.;
-  BETHA[1]=0.;
-  for(i=0;i<row_num-1;i++) {
-    ALPHA[i+1] = A[i*(row_num+1)].real();
-    BETHA[i+2] = A[i*(row_num+1)+1].real();
-  }
-  ALPHA[row_num] = A[(row_num-1)*(row_num+1)].real();
-  bisec(ALPHA,BETHA,row_num,1,row_num,1e-10,1e-10,evec1,eps2);
-  bisec(ALPHA,BETHA,row_num,1,row_num,1e-16,1e-16,evec2,eps2);
-
-  // Do we really need to sort here?
-  int begin=1;
-  int end = row_num;
-  int swapped=1;
-  while(swapped) {
-    swapped=0;
-    for(i=begin;i<end;i++){
-      if(mag(evec2[i])>mag(evec2[i+1]))	{
-	swap(evec2+i,evec2+i+1);
-	swapped=1;
-      }
-    }
-    end--;
-    for(i=end-1;i>=begin;i--){
-      if(mag(evec2[i])>mag(evec2[i+1]))	{
-	swap(evec2+i,evec2+i+1);
-	swapped=1;
-      }
-    }
-    begin++;
-  }
-
-  for(i=0;i<row_num;i++){
-    for(j=0;j<row_num;j++) {
-      if(i==j) H[i*row_num+j]=evec2[i+1];
-      else H[i*row_num+j]=0.;
-    }
-  }
-}
-
-static void bisec(std::vector<RealD> &c,   
-		  std::vector<RealD> &b,
-		  int n,
-		  int m1,
-		  int m2,
-		  RealD eps1,
-		  RealD relfeh,
-		  std::vector<RealD> &x,
-		  RealD &eps2)
-{
-  std::vector<RealD> wu(n+2);
-
-  RealD h,q,x1,xu,x0,xmin,xmax; 
-  int i,a,k;
-
-  b[1]=0.0;
-  xmin=c[n]-fabs(b[n]);
-  xmax=c[n]+fabs(b[n]);
-  for(i=1;i<n;i++){
-    h=fabs(b[i])+fabs(b[i+1]);
-    if(c[i]+h>xmax) xmax= c[i]+h;
-    if(c[i]-h<xmin) xmin= c[i]-h;
-  }
-  xmax *=2.;
-
-  eps2=relfeh*((xmin+xmax)>0.0 ? xmax : -xmin);
-  if(eps1<=0.0) eps1=eps2;
-  eps2=0.5*eps1+7.0*(eps2);
-  x0=xmax;
-  for(i=m1;i<=m2;i++){
-    x[i]=xmax;
-    wu[i]=xmin;
-  }
-
-  for(k=m2;k>=m1;k--){
-    xu=xmin;
-    i=k;
-    do{
-      if(xu<wu[i]){
-	xu=wu[i];
-	i=m1-1;
-      }
-      i--;
-    }while(i>=m1);
-    if(x0>x[k]) x0=x[k];
-    while((x0-xu)>2*relfeh*(fabs(xu)+fabs(x0))+eps1){
-      x1=(xu+x0)/2;
-
-      a=0;
-      q=1.0;
-      for(i=1;i<=n;i++){
-	q=c[i]-x1-((q!=0.0)? b[i]*b[i]/q:fabs(b[i])/relfeh);
-	if(q<0) a++;
-      }
-      //			printf("x1=%e a=%d\n",x1,a);
-      if(a<k){
-	if(a<m1){
-	  xu=x1;
-	  wu[m1]=x1;
-	}else {
-	  xu=x1;
-	  wu[a+1]=x1;
-	  if(x[a]>x1) x[a]=x1;
-	}
-      }else x0=x1;
-    }
-    x[k]=(x0+xu)/2;
-  }
-}
-}
@@ -1 +0,0 @@
-
@@ -177,9 +177,11 @@ public:
    // Global addressing
    ////////////////////////////////////////////////////////////////
    void GlobalIndexToGlobalCoor(int gidx,std::vector<int> &gcoor){
+      assert(gidx< gSites());
      Lexicographic::CoorFromIndex(gcoor,gidx,_gdimensions);
    }
    void LocalIndexToLocalCoor(int lidx,std::vector<int> &lcoor){
+      assert(lidx<lSites());
      Lexicographic::CoorFromIndex(lcoor,lidx,_ldimensions);
    }
    void GlobalCoorToGlobalIndex(const std::vector<int> & gcoor,int & gidx){
@@ -206,7 +206,7 @@ void CartesianCommunicator::Init(int *argc, char ***argv) {
      sprintf(shm_name,"/Grid_mpi3_shm_%d_%d",GroupRank,r);

      shm_unlink(shm_name);
-      int fd=shm_open(shm_name,O_RDWR|O_CREAT,0660);
+      int fd=shm_open(shm_name,O_RDWR|O_CREAT,0666);
      if ( fd < 0 ) {	perror("failed shm_open");	assert(0);      }
      ftruncate(fd, size);

@@ -226,7 +226,7 @@ void CartesianCommunicator::Init(int *argc, char ***argv) {
    
      sprintf(shm_name,"/Grid_mpi3_shm_%d_%d",GroupRank,r);

-      int fd=shm_open(shm_name,O_RDWR,0660);
+      int fd=shm_open(shm_name,O_RDWR,0666);
      if ( fd<0 ) {	perror("failed shm_open");	assert(0);      }

      void * ptr =  mmap(NULL,size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
@@ -30,6 +30,8 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #ifndef GRID_LATTICE_REDUCTION_H
 #define GRID_LATTICE_REDUCTION_H

+#include <Grid/Eigen/Dense>
+
 namespace Grid {
 #ifdef GRID_WARN_SUBOPTIMAL
 #warning "Optimisation alert all these reduction loops are NOT threaded "
@@ -38,120 +40,123 @@ namespace Grid {
    ////////////////////////////////////////////////////////////////////////////////////////////////////
    // Deterministic Reduction operations
    ////////////////////////////////////////////////////////////////////////////////////////////////////
-  template<class vobj> inline RealD norm2(const Lattice<vobj> &arg){
-    ComplexD nrm = innerProduct(arg,arg);
-    return std::real(nrm); 
+template<class vobj> inline RealD norm2(const Lattice<vobj> &arg){
+  ComplexD nrm = innerProduct(arg,arg);
+  return std::real(nrm); 
+}
+
+// Double inner product
+template<class vobj>
+inline ComplexD innerProduct(const Lattice<vobj> &left,const Lattice<vobj> &right) 
+{
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_typeD vector_type;
+  scalar_type  nrm;
+  
+  GridBase *grid = left._grid;
+  
+  std::vector<vector_type,alignedAllocator<vector_type> > sumarray(grid->SumArraySize());
+  
+  parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
+    int nwork, mywork, myoff;
+    GridThread::GetWork(left._grid->oSites(),thr,mywork,myoff);
+    
+    decltype(innerProductD(left._odata[0],right._odata[0])) vnrm=zero; // private to thread; sub summation
+    for(int ss=myoff;ss<mywork+myoff; ss++){
+      vnrm = vnrm + innerProductD(left._odata[ss],right._odata[ss]);
+    }
+    sumarray[thr]=TensorRemove(vnrm) ;
  }
+  
+  vector_type vvnrm; vvnrm=zero;  // sum across threads
+  for(int i=0;i<grid->SumArraySize();i++){
+    vvnrm = vvnrm+sumarray[i];
+  } 
+  nrm = Reduce(vvnrm);// sum across simd
+  right._grid->GlobalSum(nrm);
+  return nrm;
+}
+ 
+template<class Op,class T1>
+inline auto sum(const LatticeUnaryExpression<Op,T1> & expr)
+  ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second))))::scalar_object
+{
+  return sum(closure(expr));
+}

-    template<class vobj>
-    inline ComplexD innerProduct(const Lattice<vobj> &left,const Lattice<vobj> &right) 
-    {
-      typedef typename vobj::scalar_type scalar_type;
-      typedef typename vobj::vector_type vector_type;
-      scalar_type  nrm;
-
-      GridBase *grid = left._grid;
-
-      std::vector<vector_type,alignedAllocator<vector_type> > sumarray(grid->SumArraySize());
-      for(int i=0;i<grid->SumArraySize();i++){
-	sumarray[i]=zero;
-      }
-
-      parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
-	int nwork, mywork, myoff;
-	GridThread::GetWork(left._grid->oSites(),thr,mywork,myoff);
-	
-	decltype(innerProduct(left._odata[0],right._odata[0])) vnrm=zero; // private to thread; sub summation
-        for(int ss=myoff;ss<mywork+myoff; ss++){
-	  vnrm = vnrm + innerProduct(left._odata[ss],right._odata[ss]);
-	}
-	sumarray[thr]=TensorRemove(vnrm) ;
-      }
-      
-      vector_type vvnrm; vvnrm=zero;  // sum across threads
-      for(int i=0;i<grid->SumArraySize();i++){
-	vvnrm = vvnrm+sumarray[i];
-      } 
-      nrm = Reduce(vvnrm);// sum across simd
-      right._grid->GlobalSum(nrm);
-      return nrm;
-    }
-
-    template<class Op,class T1>
-      inline auto sum(const LatticeUnaryExpression<Op,T1> & expr)
-      ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second))))::scalar_object
-    {
-      return sum(closure(expr));
-    }
-
-    template<class Op,class T1,class T2>
-      inline auto sum(const LatticeBinaryExpression<Op,T1,T2> & expr)
+template<class Op,class T1,class T2>
+inline auto sum(const LatticeBinaryExpression<Op,T1,T2> & expr)
      ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second)),eval(0,std::get<1>(expr.second))))::scalar_object
-    {
-      return sum(closure(expr));
-    }
-
-
-    template<class Op,class T1,class T2,class T3>
-      inline auto sum(const LatticeTrinaryExpression<Op,T1,T2,T3> & expr)
-      ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second)),
-				 eval(0,std::get<1>(expr.second)),
-				 eval(0,std::get<2>(expr.second))
-				 ))::scalar_object
-    {
-      return sum(closure(expr));
-    }
-
-    template<class vobj>
-    inline typename vobj::scalar_object sum(const Lattice<vobj> &arg){
-
-      GridBase *grid=arg._grid;
-      int Nsimd = grid->Nsimd();
-
-      std::vector<vobj,alignedAllocator<vobj> > sumarray(grid->SumArraySize());
-      for(int i=0;i<grid->SumArraySize();i++){
-	sumarray[i]=zero;
-      }
-
-      parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
-	int nwork, mywork, myoff;
-	GridThread::GetWork(grid->oSites(),thr,mywork,myoff);
-	
-	vobj vvsum=zero;
-        for(int ss=myoff;ss<mywork+myoff; ss++){
-	  vvsum = vvsum + arg._odata[ss];
-	}
-	sumarray[thr]=vvsum;
-      }
-      
-      vobj vsum=zero;  // sum across threads
-      for(int i=0;i<grid->SumArraySize();i++){
-	vsum = vsum+sumarray[i];
-      } 
-
-      typedef typename vobj::scalar_object sobj;
-      sobj ssum=zero;
-
-      std::vector<sobj>               buf(Nsimd);
-      extract(vsum,buf);
-
-      for(int i=0;i<Nsimd;i++) ssum = ssum + buf[i];
-      arg._grid->GlobalSum(ssum);
-
-      return ssum;
+{
+  return sum(closure(expr));
+}
+
+
+template<class Op,class T1,class T2,class T3>
+inline auto sum(const LatticeTrinaryExpression<Op,T1,T2,T3> & expr)
+  ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second)),
+				      eval(0,std::get<1>(expr.second)),
+				      eval(0,std::get<2>(expr.second))
+				      ))::scalar_object
+{
+  return sum(closure(expr));
+}
+
+template<class vobj>
+inline typename vobj::scalar_object sum(const Lattice<vobj> &arg)
+{
+  GridBase *grid=arg._grid;
+  int Nsimd = grid->Nsimd();
+  
+  std::vector<vobj,alignedAllocator<vobj> > sumarray(grid->SumArraySize());
+  for(int i=0;i<grid->SumArraySize();i++){
+    sumarray[i]=zero;
+  }
+  
+  parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
+    int nwork, mywork, myoff;
+    GridThread::GetWork(grid->oSites(),thr,mywork,myoff);
+    
+    vobj vvsum=zero;
+    for(int ss=myoff;ss<mywork+myoff; ss++){
+      vvsum = vvsum + arg._odata[ss];
    }
+    sumarray[thr]=vvsum;
+  }
+  
+  vobj vsum=zero;  // sum across threads
+  for(int i=0;i<grid->SumArraySize();i++){
+    vsum = vsum+sumarray[i];
+  } 
+  
+  typedef typename vobj::scalar_object sobj;
+  sobj ssum=zero;
+  
+  std::vector<sobj>               buf(Nsimd);
+  extract(vsum,buf);
+  
+  for(int i=0;i<Nsimd;i++) ssum = ssum + buf[i];
+  arg._grid->GlobalSum(ssum);
+  
+  return ssum;
+}


+//////////////////////////////////////////////////////////////////////////////////////////////////////////////
+// sliceSum, sliceInnerProduct, sliceAxpy, sliceNorm etc...
+//////////////////////////////////////////////////////////////////////////////////////////////////////////////

 template<class vobj> inline void sliceSum(const Lattice<vobj> &Data,std::vector<typename vobj::scalar_object> &result,int orthogdim)
 {
+  ///////////////////////////////////////////////////////
+  // FIXME precision promoted summation
+  // may be important for correlation functions
+  // But easily avoided by using double precision fields
+  ///////////////////////////////////////////////////////
  typedef typename vobj::scalar_object sobj;
  GridBase  *grid = Data._grid;
  assert(grid!=NULL);

-  // FIXME
-  // std::cout<<GridLogMessage<<"WARNING ! SliceSum is unthreaded "<<grid->SumArraySize()<<" threads "<<std::endl;
-
  const int    Nd = grid->_ndimension;
  const int Nsimd = grid->Nsimd();

@@ -163,23 +168,31 @@ template<class vobj> inline void sliceSum(const Lattice<vobj> &Data,std::vector<
  int rd=grid->_rdimensions[orthogdim];

  std::vector<vobj,alignedAllocator<vobj> > lvSum(rd); // will locally sum vectors first
-  std::vector<sobj> lsSum(ld,zero); // sum across these down to scalars
-  std::vector<sobj> extracted(Nsimd);     // splitting the SIMD
+  std::vector<sobj> lsSum(ld,zero);                    // sum across these down to scalars
+  std::vector<sobj> extracted(Nsimd);                  // splitting the SIMD

-  result.resize(fd); // And then global sum to return the same vector to every node for IO to file
+  result.resize(fd); // And then global sum to return the same vector to every node 
  for(int r=0;r<rd;r++){
    lvSum[r]=zero;
  }

-  std::vector<int>  coor(Nd);  
+  int e1=    grid->_slice_nblock[orthogdim];
+  int e2=    grid->_slice_block [orthogdim];
+  int stride=grid->_slice_stride[orthogdim];

  // sum over reduced dimension planes, breaking out orthog dir
+  // Parallel over orthog direction
+  parallel_for(int r=0;r<rd;r++){

-  for(int ss=0;ss<grid->oSites();ss++){
-    Lexicographic::CoorFromIndex(coor,ss,grid->_rdimensions);
-    int r = coor[orthogdim];
-    lvSum[r]=lvSum[r]+Data._odata[ss];
-  }  
+    int so=r*grid->_ostride[orthogdim]; // base offset for start of plane 
+
+    for(int n=0;n<e1;n++){
+      for(int b=0;b<e2;b++){
+	int ss= so+n*stride+b;
+	lvSum[r]=lvSum[r]+Data._odata[ss];
+      }
+    }
+  }

  // Sum across simd lanes in the plane, breaking out orthog dir.
  std::vector<int> icoor(Nd);
@@ -214,10 +227,304 @@ template<class vobj> inline void sliceSum(const Lattice<vobj> &Data,std::vector<

    result[t]=gsum;
  }
+}

+template<class vobj>
+static void sliceInnerProductVector( std::vector<ComplexD> & result, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int orthogdim) 
+{
+  typedef typename vobj::vector_type   vector_type;
+  typedef typename vobj::scalar_type   scalar_type;
+  GridBase  *grid = lhs._grid;
+  assert(grid!=NULL);
+  conformable(grid,rhs._grid);
+
+  const int    Nd = grid->_ndimension;
+  const int Nsimd = grid->Nsimd();
+
+  assert(orthogdim >= 0);
+  assert(orthogdim < Nd);
+
+  int fd=grid->_fdimensions[orthogdim];
+  int ld=grid->_ldimensions[orthogdim];
+  int rd=grid->_rdimensions[orthogdim];
+
+  std::vector<vector_type,alignedAllocator<vector_type> > lvSum(rd); // will locally sum vectors first
+  std::vector<scalar_type > lsSum(ld,scalar_type(0.0));                    // sum across these down to scalars
+  std::vector<iScalar<scalar_type> > extracted(Nsimd);                  // splitting the SIMD
+
+  result.resize(fd); // And then global sum to return the same vector to every node for IO to file
+  for(int r=0;r<rd;r++){
+    lvSum[r]=zero;
+  }
+
+  int e1=    grid->_slice_nblock[orthogdim];
+  int e2=    grid->_slice_block [orthogdim];
+  int stride=grid->_slice_stride[orthogdim];
+
+  parallel_for(int r=0;r<rd;r++){
+
+    int so=r*grid->_ostride[orthogdim]; // base offset for start of plane 
+
+    for(int n=0;n<e1;n++){
+      for(int b=0;b<e2;b++){
+	int ss= so+n*stride+b;
+	vector_type vv = TensorRemove(innerProduct(lhs._odata[ss],rhs._odata[ss]));
+	lvSum[r]=lvSum[r]+vv;
+      }
+    }
+  }
+
+  // Sum across simd lanes in the plane, breaking out orthog dir.
+  std::vector<int> icoor(Nd);
+  for(int rt=0;rt<rd;rt++){
+
+    iScalar<vector_type> temp; 
+    temp._internal = lvSum[rt];
+    extract(temp,extracted);
+
+    for(int idx=0;idx<Nsimd;idx++){
+
+      grid->iCoorFromIindex(icoor,idx);
+
+      int ldx =rt+icoor[orthogdim]*rd;
+
+      lsSum[ldx]=lsSum[ldx]+extracted[idx]._internal;
+
+    }
+  }
+  
+  // sum over nodes.
+  scalar_type gsum;
+  for(int t=0;t<fd;t++){
+    int pt = t/ld; // processor plane
+    int lt = t%ld;
+    if ( pt == grid->_processor_coor[orthogdim] ) {
+      gsum=lsSum[lt];
+    } else {
+      gsum=scalar_type(0.0);
+    }
+
+    grid->GlobalSum(gsum);
+
+    result[t]=gsum;
+  }
+}
+template<class vobj>
+static void sliceNorm (std::vector<RealD> &sn,const Lattice<vobj> &rhs,int Orthog) 
+{
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  
+  int Nblock = rhs._grid->GlobalDimensions()[Orthog];
+  std::vector<ComplexD> ip(Nblock);
+  sn.resize(Nblock);
+  
+  sliceInnerProductVector(ip,rhs,rhs,Orthog);
+  for(int ss=0;ss<Nblock;ss++){
+    sn[ss] = real(ip[ss]);
+  }
+};
+
+
+template<class vobj>
+static void sliceMaddVector(Lattice<vobj> &R,std::vector<RealD> &a,const Lattice<vobj> &X,const Lattice<vobj> &Y,
+			    int orthogdim,RealD scale=1.0) 
+{    
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  typedef typename vobj::tensor_reduced tensor_reduced;
+  
+  GridBase *grid  = X._grid;
+
+  int Nsimd  =grid->Nsimd();
+  int Nblock =grid->GlobalDimensions()[orthogdim];
+
+  int fd     =grid->_fdimensions[orthogdim];
+  int ld     =grid->_ldimensions[orthogdim];
+  int rd     =grid->_rdimensions[orthogdim];
+
+  int e1     =grid->_slice_nblock[orthogdim];
+  int e2     =grid->_slice_block [orthogdim];
+  int stride =grid->_slice_stride[orthogdim];
+
+  std::vector<int> icoor;
+
+  for(int r=0;r<rd;r++){
+
+    int so=r*grid->_ostride[orthogdim]; // base offset for start of plane 
+
+    vector_type    av;
+
+    for(int l=0;l<Nsimd;l++){
+      grid->iCoorFromIindex(icoor,l);
+      int ldx =r+icoor[orthogdim]*rd;
+      scalar_type *as =(scalar_type *)&av;
+      as[l] = scalar_type(a[ldx])*scale;
+    }
+
+    tensor_reduced at; at=av;
+
+    parallel_for_nest2(int n=0;n<e1;n++){
+      for(int b=0;b<e2;b++){
+	int ss= so+n*stride+b;
+	R._odata[ss] = at*X._odata[ss]+Y._odata[ss];
+      }
+    }
+  }
+};
+
+
+/*
+template<class vobj>
+static void sliceMaddVectorSlow (Lattice<vobj> &R,std::vector<RealD> &a,const Lattice<vobj> &X,const Lattice<vobj> &Y,
+			     int Orthog,RealD scale=1.0) 
+{    
+  // FIXME: Implementation is slow
+  // Best base the linear combination by constructing a 
+  // set of vectors of size grid->_rdimensions[Orthog].
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  
+  int Nblock = X._grid->GlobalDimensions()[Orthog];
+  
+  GridBase *FullGrid  = X._grid;
+  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
+  
+  Lattice<vobj> Xslice(SliceGrid);
+  Lattice<vobj> Rslice(SliceGrid);
+  // If we based this on Cshift it would work for spread out
+  // but it would be even slower
+  for(int i=0;i<Nblock;i++){
+    ExtractSlice(Rslice,Y,i,Orthog);
+    ExtractSlice(Xslice,X,i,Orthog);
+    Rslice = Rslice + Xslice*(scale*a[i]);
+    InsertSlice(Rslice,R,i,Orthog);
+  }
+};
+
+template<class vobj>
+static void sliceInnerProductVectorSlow( std::vector<ComplexD> & vec, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int Orthog) 
+  {
+    // FIXME: Implementation is slow
+    // Look at localInnerProduct implementation,
+    // and do inside a site loop with block strided iterators
+    typedef typename vobj::scalar_object sobj;
+    typedef typename vobj::scalar_type scalar_type;
+    typedef typename vobj::vector_type vector_type;
+    typedef typename vobj::tensor_reduced scalar;
+    typedef typename scalar::scalar_object  scomplex;
+  
+    int Nblock = lhs._grid->GlobalDimensions()[Orthog];
+
+    vec.resize(Nblock);
+    std::vector<scomplex> sip(Nblock);
+    Lattice<scalar> IP(lhs._grid); 
+
+    IP=localInnerProduct(lhs,rhs);
+    sliceSum(IP,sip,Orthog);
+  
+    for(int ss=0;ss<Nblock;ss++){
+      vec[ss] = TensorRemove(sip[ss]);
+    }
+  }
+*/
+
+//////////////////////////////////////////////////////////////////////////////////////////
+// FIXME: Implementation is slow
+// If we based this on Cshift it would work for spread out
+// but it would be even slower
+//
+// Repeated extract slice is inefficient
+//
+// Best base the linear combination by constructing a 
+// set of vectors of size grid->_rdimensions[Orthog].
+//////////////////////////////////////////////////////////////////////////////////////////
+
+inline GridBase         *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Orthog)
+{
+  int NN    = BlockSolverGrid->_ndimension;
+  int nsimd = BlockSolverGrid->Nsimd();
+  
+  std::vector<int> latt_phys(0);
+  std::vector<int> simd_phys(0);
+  std::vector<int>  mpi_phys(0);
+  
+  for(int d=0;d<NN;d++){
+    if( d!=Orthog ) { 
+      latt_phys.push_back(BlockSolverGrid->_fdimensions[d]);
+      simd_phys.push_back(BlockSolverGrid->_simd_layout[d]);
+      mpi_phys.push_back(BlockSolverGrid->_processors[d]);
+    }
+  }
+  return (GridBase *)new GridCartesian(latt_phys,simd_phys,mpi_phys); 
 }


+template<class vobj>
+static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<vobj> &X,const Lattice<vobj> &Y,int Orthog,RealD scale=1.0) 
+{    
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+
+  int Nblock = X._grid->GlobalDimensions()[Orthog];
+  
+  GridBase *FullGrid  = X._grid;
+  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
+  
+  Lattice<vobj> Xslice(SliceGrid);
+  Lattice<vobj> Rslice(SliceGrid);
+  
+  for(int i=0;i<Nblock;i++){
+    ExtractSlice(Rslice,Y,i,Orthog);
+    for(int j=0;j<Nblock;j++){
+      ExtractSlice(Xslice,X,j,Orthog);
+      Rslice = Rslice + Xslice*(scale*aa(j,i));
+    }
+    InsertSlice(Rslice,R,i,Orthog);
+  }
+};
+
+template<class vobj>
+static void sliceInnerProductMatrix(  Eigen::MatrixXcd &mat, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int Orthog) 
+{
+  // FIXME: Implementation is slow
+  // Not sure of best solution.. think about it
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  
+  GridBase *FullGrid  = lhs._grid;
+  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
+  
+  int Nblock = FullGrid->GlobalDimensions()[Orthog];
+  
+  Lattice<vobj> Lslice(SliceGrid);
+  Lattice<vobj> Rslice(SliceGrid);
+  
+  mat = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+  
+  for(int i=0;i<Nblock;i++){
+    ExtractSlice(Lslice,lhs,i,Orthog);
+    for(int j=0;j<Nblock;j++){
+      ExtractSlice(Rslice,rhs,j,Orthog);
+      mat(i,j) = innerProduct(Lslice,Rslice);
+    }
+  }
+#undef FORCE_DIAG
+#ifdef FORCE_DIAG
+  for(int i=0;i<Nblock;i++){
+    for(int j=0;j<Nblock;j++){
+      if ( i != j ) mat(i,j)=0.0;
+    }
+  }
+#endif
+  return;
 }
+
+} /*END NAMESPACE GRID*/
 #endif

@@ -30,12 +30,19 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #define GRID_LATTICE_RNG_H

 #include <random>
+
+#ifdef RNG_SITMO
 #include <Grid/sitmo_rng/sitmo_prng_engine.hpp>
+#endif 
+
+#if defined(RNG_SITMO)
+#define RNG_FAST_DISCARD
+#else 
+#undef  RNG_FAST_DISCARD
+#endif

 namespace Grid {

-  //http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-90Ar1.pdf ?
-
  //////////////////////////////////////////////////////////////
  // Allow the RNG state to be less dense than the fine grid
  //////////////////////////////////////////////////////////////
@@ -65,120 +72,139 @@ namespace Grid {

      multiplicity = multiplicity *fine->_rdimensions[fd] / coarse->_rdimensions[d]; 
    }
-
    return multiplicity;
  }

-  // Wrap seed_seq to give common interface with random_device
-  // Should rather wrap random_device and have a generate
-  class fixedSeed {
-  public:
-
-    typedef std::seed_seq::result_type result_type;
-
-    std::seed_seq src;
-    
-    template<class int_type> fixedSeed(const std::vector<int_type> &seeds) : src(seeds.begin(),seeds.end()) {};
-
-    template< class RandomIt > void generate( RandomIt begin, RandomIt end ) {
-      src.generate(begin,end);
-    }
-
-  };
-
-
-  class deviceSeed {
-  public:
-
-    std::random_device rd;
-
-    typedef std::random_device::result_type result_type;
-    
-    deviceSeed(void) : rd(){};
-
-    template< class RandomIt > void generate( RandomIt begin, RandomIt end ) {
-      for(RandomIt it=begin; it!=end;it++){
-	*it = rd();
-      }
-    }
-  };
-
  // real scalars are one component
-  template<class scalar,class distribution,class generator> void fillScalar(scalar &s,distribution &dist,generator & gen)
+  template<class scalar,class distribution,class generator> 
+  void fillScalar(scalar &s,distribution &dist,generator & gen)
  {
    s=dist(gen);
  }
-  template<class distribution,class generator> void fillScalar(ComplexF &s,distribution &dist, generator &gen)
+  template<class distribution,class generator> 
+  void fillScalar(ComplexF &s,distribution &dist, generator &gen)
  {
    s=ComplexF(dist(gen),dist(gen));
  }
-  template<class distribution,class generator> void fillScalar(ComplexD &s,distribution &dist,generator &gen)
+  template<class distribution,class generator> 
+  void fillScalar(ComplexD &s,distribution &dist,generator &gen)
  {
    s=ComplexD(dist(gen),dist(gen));
  }
  
  class GridRNGbase {
-
  public:
-
-    int _seeded;
    // One generator per site.
    // Uniform and Gaussian distributions from these generators.
 #ifdef RNG_RANLUX
-    typedef uint64_t      RngStateType;
    typedef std::ranlux48 RngEngine;
+    typedef uint64_t      RngStateType;
    static const int RngStateCount = 15;
-#elif RNG_MT19937 
+#endif 
+#ifdef RNG_MT19937 
    typedef std::mt19937 RngEngine;
    typedef uint32_t     RngStateType;
    static const int     RngStateCount = std::mt19937::state_size;
-#elif RNG_SITMO
+#endif
+#ifdef RNG_SITMO
    typedef sitmo::prng_engine 	RngEngine;
    typedef uint64_t    	RngStateType;
    static const int    	RngStateCount = 4;
 #endif
-    std::vector<RngEngine>                             _generators;
-    std::vector<std::uniform_real_distribution<RealD>> _uniform;
-    std::vector<std::normal_distribution<RealD>>       _gaussian;
-    std::vector<std::discrete_distribution<int32_t>>   _bernoulli;

-    void GetState(std::vector<RngStateType> & saved,int gen) {
+    std::vector<RngEngine>                             _generators;
+    std::vector<std::uniform_real_distribution<RealD> > _uniform;
+    std::vector<std::normal_distribution<RealD> >       _gaussian;
+    std::vector<std::discrete_distribution<int32_t> >   _bernoulli;
+    std::vector<std::uniform_int_distribution<uint32_t> > _uid;
+
+    ///////////////////////
+    // support for parallel init
+    ///////////////////////
+#ifdef RNG_FAST_DISCARD
+    static void Skip(RngEngine &eng)
+    {
+      /////////////////////////////////////////////////////////////////////////////////////
+      // Skip by 2^40 elements between successive lattice sites
+      // This goes by 10^12.
+      // Consider quenched updating; likely never exceeding rate of 1000 sweeps
+      // per second on any machine. This gives us of order 10^9 seconds, or 100 years
+      // skip ahead.
+      // For HMC unlikely to go at faster than a solve per second, and 
+      // tens of seconds per trajectory so this is clean in all reasonable cases,
+      // and margin of safety is orders of magnitude.
+      // We could hack Sitmo to skip in the higher order words of state if necessary
+      /////////////////////////////////////////////////////////////////////////////////////
+      uint64_t skip = 0x1; skip = skip<<40;
+      eng.discard(skip);
+    } 
+#endif
+    static RngEngine Reseed(RngEngine &eng)
+    {
+      std::vector<uint32_t> newseed;
+      std::uniform_int_distribution<uint32_t> uid;
+      return Reseed(eng,newseed,uid);
+    }
+    static RngEngine Reseed(RngEngine &eng,std::vector<uint32_t> & newseed,
+			    std::uniform_int_distribution<uint32_t> &uid)
+    {
+      const int reseeds=4;
+      
+      newseed.resize(reseeds);
+      for(int i=0;i<reseeds;i++){
+	newseed[i] = uid(eng);
+      }
+      std::seed_seq sseq(newseed.begin(),newseed.end());
+      return RngEngine(sseq);
+    }    
+
+    void GetState(std::vector<RngStateType> & saved,RngEngine &eng) {
      saved.resize(RngStateCount);
      std::stringstream ss;
-      ss<<_generators[gen];
+      ss<<eng;
      ss.seekg(0,ss.beg);
      for(int i=0;i<RngStateCount;i++){
 	ss>>saved[i];
      }
    }
-    void SetState(std::vector<RngStateType> & saved,int gen){
+    void GetState(std::vector<RngStateType> & saved,int gen) {
+      GetState(saved,_generators[gen]);
+    }
+    void SetState(std::vector<RngStateType> & saved,RngEngine &eng){
      assert(saved.size()==RngStateCount);
      std::stringstream ss;
      for(int i=0;i<RngStateCount;i++){
 	ss<< saved[i]<<" ";
      }
      ss.seekg(0,ss.beg);
-      ss>>_generators[gen];
+      ss>>eng;
    }
+    void SetState(std::vector<RngStateType> & saved,int gen){
+      SetState(saved,_generators[gen]);
+    }
+    void SetEngine(RngEngine &Eng, int gen){
+      _generators[gen]=Eng;
+    }
+    void GetEngine(RngEngine &Eng, int gen){
+      Eng=_generators[gen];
+    }
+    template<class source> void Seed(source &src, int gen)
+    {
+      _generators[gen] = RngEngine(src);
+    }    
  };

  class GridSerialRNG : public GridRNGbase {
  public:

-    // FIXME ... do we require lockstep draws of randoms 
-    // from all nodes keeping seeds consistent.
-    // place a barrier/broadcast in the fill routine
-
    GridSerialRNG() : GridRNGbase() {
      _generators.resize(1);
      _uniform.resize(1,std::uniform_real_distribution<RealD>{0,1});
      _gaussian.resize(1,std::normal_distribution<RealD>(0.0,1.0) );
      _bernoulli.resize(1,std::discrete_distribution<int32_t>{1,1});
-      _seeded=0;
+      _uid.resize(1,std::uniform_int_distribution<uint32_t>() );
    }

-
-
    template <class sobj,class distribution> inline void fill(sobj &l,std::vector<distribution> &dist){

      typedef typename sobj::scalar_type scalar_type;
@@ -191,7 +217,7 @@ namespace Grid {
      for(int idx=0;idx<words;idx++){
 	fillScalar(buf[idx],dist[0],_generators[0]);
      }
-      
+
      CartesianCommunicator::BroadcastWorld(0,(void *)&l,sizeof(l));

    };
@@ -250,28 +276,18 @@ namespace Grid {
      CartesianCommunicator::BroadcastWorld(0,(void *)&l,sizeof(l));
    }

-    template<class source> void Seed(source &src)
-    {
-      _generators[0] = RngEngine(src);
-      _seeded=1;
-    }    
-    void SeedRandomDevice(void){
-      deviceSeed src;
-      Seed(src);
-    }
    void SeedFixedIntegers(const std::vector<int> &seeds){
      CartesianCommunicator::BroadcastWorld(0,(void *)&seeds[0],sizeof(int)*seeds.size());
-      fixedSeed src(seeds);
-      Seed(src);
+      std::seed_seq src(seeds.begin(),seeds.end());
+      Seed(src,0);
    }
-
  };

  class GridParallelRNG : public GridRNGbase {
  public:
-
    GridBase *_grid;
    int _vol;
+  public:

    int generator_idx(int os,int is){
      return is*_grid->oSites()+os;
@@ -285,15 +301,9 @@ namespace Grid {
      _uniform.resize(_vol,std::uniform_real_distribution<RealD>{0,1});
      _gaussian.resize(_vol,std::normal_distribution<RealD>(0.0,1.0) );
      _bernoulli.resize(_vol,std::discrete_distribution<int32_t>{1,1});
-      _seeded=0;
+      _uid.resize(_vol,std::uniform_int_distribution<uint32_t>() );
    }

-
-
-    //FIXME implement generic IO and create state save/restore
-    //void SaveState(const std::string<char> &file);
-    //void LoadState(const std::string<char> &file);
-
    template <class vobj,class distribution> inline void fill(Lattice<vobj> &l,std::vector<distribution> &dist){

      typedef typename vobj::scalar_object scalar_object;
@@ -306,7 +316,6 @@ namespace Grid {
      int     osites=_grid->oSites();
      int words=sizeof(scalar_object)/sizeof(scalar_type);

-
      parallel_for(int ss=0;ss<osites;ss++){

 	std::vector<scalar_object> buf(Nsimd);
@@ -329,104 +338,114 @@ namespace Grid {
      }
    };

-    // This loop could be made faster to avoid the Ahmdahl by
-    // i)  seed generators on each timeslice, for x=y=z=0;
-    // ii) seed generators on each z for x=y=0
-    // iii)seed generators on each y,z for x=0
-    // iv) seed generators on each y,z,x 
-    // made possible by physical indexing.
-    template<class source> void Seed(source &src)
-    {
+    void SeedFixedIntegers(const std::vector<int> &seeds){

-      typedef typename source::result_type seed_t;
-      std::uniform_int_distribution<seed_t> uid;
+      // Everyone generates the same seed_seq based on input seeds
+      CartesianCommunicator::BroadcastWorld(0,(void *)&seeds[0],sizeof(int)*seeds.size());

-      int numseed=4;
-      int gsites = _grid->_gsites;
-      std::vector<seed_t> site_init(numseed);
+      std::seed_seq source(seeds.begin(),seeds.end());
+
+      RngEngine master_engine(source);
+
+#ifdef RNG_FAST_DISCARD
+      ////////////////////////////////////////////////
+      // Skip ahead through a single stream.
+      // Applicable to SITMO and other has based/crypto RNGs
+      // Should be applicable to Mersenne Twister, but the C++11
+      // MT implementation does not implement fast discard even though
+      // in principle this is possible
+      ////////////////////////////////////////////////
      std::vector<int> gcoor;
+      int rank,o_idx,i_idx;

+      // Everybody loops over global volume.
+      for(int gidx=0;gidx<_grid->_gsites;gidx++){

-      // Master RngEngine
-      std::vector<seed_t> master_init(numseed);  src.generate(master_init.begin(),master_init.end());
-      _grid->Broadcast(0,(void *)&master_init[0],sizeof(seed_t)*numseed);
-      fixedSeed master_seed(master_init);
-      RngEngine master_engine(master_seed);
-
-      // Per node RngEngine
-      std::vector<seed_t> node_init(numseed);
-      for(int r=0;r<_grid->ProcessorCount();r++) {
-
-	std::vector<seed_t> rank_init(numseed);
-	for(int i=0;i<numseed;i++) rank_init[i] = uid(master_engine);
-
-	std::cout << GridLogMessage << "SeedSeq for rank "<<r;
-	for(int i=0;i<numseed;i++) std::cout<<" "<<rank_init[i];
-	std::cout <<std::endl;
-
-	if ( r==_grid->ThisRank() ) { 
-	  for(int i=0;i<numseed;i++) node_init[i] = rank_init[i];
-	}
-
-      }
-
-      ////////////////////////////////////////////////////
-      // Set up a seed_seq wrapper with these 8 words
-      // and draw for each site within node.
-      ////////////////////////////////////////////////////
-      fixedSeed node_seed(node_init);
-      RngEngine node_engine(node_seed);
-
-      for(int gidx=0;gidx<gsites;gidx++){
-	int rank,o_idx,i_idx;
+	Skip(master_engine); // Skip to next RNG sequence

+	// Where is it?
 	_grid->GlobalIndexToGlobalCoor(gidx,gcoor);
 	_grid->GlobalCoorToRankIndex(rank,o_idx,i_idx,gcoor);

+	// If this is one of mine we take it
 	if( rank == _grid->ThisRank() ){
 	  int l_idx=generator_idx(o_idx,i_idx);
-	  for(int i=0;i<numseed;i++)  site_init[i] = uid(node_engine);
-	  fixedSeed site_seed(site_init);
-	  _generators[l_idx] = RngEngine(site_seed);
+	  _generators[l_idx] = master_engine;
+	}
+
+      }
+#else 
+      ////////////////////////////////////////////////////////////////
+      // Machine and thread decomposition dependent seeding is efficient
+      // and maximally parallel; but NOT reproducible from machine to machine. 
+      // Not ideal, but fastest way to reseed all nodes.
+      ////////////////////////////////////////////////////////////////
+      {
+	// Obtain one Reseed per processor
+	int Nproc = _grid->ProcessorCount();
+	std::vector<RngEngine> seeders(Nproc);
+	int me= _grid->ThisRank();
+	for(int p=0;p<Nproc;p++){
+	  seeders[p] = Reseed(master_engine);
+	}
+	master_engine = seeders[me];
+      }
+
+      {
+	// Obtain one reseeded generator per thread
+	int Nthread = GridThread::GetThreads();
+	std::vector<RngEngine> seeders(Nthread);
+	for(int t=0;t<Nthread;t++){
+	  seeders[t] = Reseed(master_engine);
+	}
+
+	parallel_for(int t=0;t<Nthread;t++) {
+	  // set up one per local site in threaded fashion
+	  std::vector<uint32_t> newseeds;
+	  std::uniform_int_distribution<uint32_t> uid;	
+	  for(int l=0;l<_grid->lSites();l++) {
+	    if ( (l%Nthread)==t ) {
+	      _generators[l] = Reseed(seeders[t],newseeds,uid);
+	    }
+	  }
 	}
      }
-      _seeded=1;
-    }    
-    void SeedRandomDevice(void){
-      deviceSeed src;
-      Seed(src);
+#endif
    }
-    void SeedFixedIntegers(const std::vector<int> &seeds){
-      CartesianCommunicator::BroadcastWorld(0,(void *)&seeds[0],sizeof(int)*seeds.size());
-      fixedSeed src(seeds);
-      Seed(src);
+    ////////////////////////////////////////////////////////////////////////
+    // Support for rigorous test of RNG's
+    // Return uniform random uint32_t from requested site generator
+    ////////////////////////////////////////////////////////////////////////
+    uint32_t GlobalU01(int gsite){
+
+      uint32_t the_number;
+
+      // who
+      std::vector<int> gcoor;
+      int rank,o_idx,i_idx;
+      _grid->GlobalIndexToGlobalCoor(gsite,gcoor);
+      _grid->GlobalCoorToRankIndex(rank,o_idx,i_idx,gcoor);
+
+      // draw
+      int l_idx=generator_idx(o_idx,i_idx);
+      if( rank == _grid->ThisRank() ){
+	the_number = _uid[l_idx](_generators[l_idx]);
+      }
+      
+      // share & return
+      _grid->Broadcast(rank,(void *)&the_number,sizeof(the_number));
+      return the_number;
    }

  };

-  template <class vobj> inline void random(GridParallelRNG &rng,Lattice<vobj> &l){
-    rng.fill(l,rng._uniform);
-  }
+  template <class vobj> inline void random(GridParallelRNG &rng,Lattice<vobj> &l)   { rng.fill(l,rng._uniform);  }
+  template <class vobj> inline void gaussian(GridParallelRNG &rng,Lattice<vobj> &l) { rng.fill(l,rng._gaussian); }
+  template <class vobj> inline void bernoulli(GridParallelRNG &rng,Lattice<vobj> &l){ rng.fill(l,rng._bernoulli);}

-  template <class vobj> inline void gaussian(GridParallelRNG &rng,Lattice<vobj> &l){
-    rng.fill(l,rng._gaussian);
-  }
-  
-  template <class vobj> inline void bernoulli(GridParallelRNG &rng,Lattice<vobj> &l){
-    rng.fill(l,rng._bernoulli);
-  }
-
-  template <class sobj> inline void random(GridSerialRNG &rng,sobj &l){
-    rng.fill(l,rng._uniform);
-  }
-  
-  template <class sobj> inline void gaussian(GridSerialRNG &rng,sobj &l){
-    rng.fill(l,rng._gaussian);
-  }
-  
-  template <class sobj> inline void bernoulli(GridSerialRNG &rng,sobj &l){
-    rng.fill(l,rng._bernoulli);
-  }
+  template <class sobj> inline void random(GridSerialRNG &rng,sobj &l)   { rng.fill(l,rng._uniform  ); }
+  template <class sobj> inline void gaussian(GridSerialRNG &rng,sobj &l) { rng.fill(l,rng._gaussian ); }
+  template <class sobj> inline void bernoulli(GridSerialRNG &rng,sobj &l){ rng.fill(l,rng._bernoulli); }

 }
 #endif
@@ -1,4 +1,4 @@
-    /*************************************************************************************
+/*************************************************************************************

    Grid physics library, www.github.com/paboyle/Grid 

@@ -359,7 +359,7 @@ void localConvert(const Lattice<vobj> &in,Lattice<vvobj> &out)


 template<class vobj>
-void InsertSlice(Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int orthog)
+void InsertSlice(const Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int orthog)
 {
  typedef typename vobj::scalar_object sobj;

@@ -401,7 +401,7 @@ void InsertSlice(Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int
 }

 template<class vobj>
-void ExtractSlice(Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice, int orthog)
+void ExtractSlice(Lattice<vobj> &lowDim,const Lattice<vobj> & higherDim,int slice, int orthog)
 {
  typedef typename vobj::scalar_object sobj;

@@ -444,7 +444,7 @@ void ExtractSlice(Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice, in


 template<class vobj>
-void InsertSliceLocal(Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice_lo,int slice_hi, int orthog)
+void InsertSliceLocal(const Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice_lo,int slice_hi, int orthog)
 {
  typedef typename vobj::scalar_object sobj;

@@ -110,8 +110,8 @@ public:
  friend std::ostream& operator<< (std::ostream& stream, Logger& log){

    if ( log.active ) {
-      stream << log.background()<< std::setw(10) << std::left << log.topName << log.background()<< " : ";
-      stream << log.colour() << std::setw(14) << std::left << log.name << log.background() << " : ";
+      stream << log.background()<< std::setw(8) << std::left << log.topName << log.background()<< " : ";
+      stream << log.colour() << std::setw(10) << std::left << log.name << log.background() << " : ";
      if ( log.timestamp ) {
 	StopWatch.Stop();
 	GridTime now = StopWatch.Elapsed();
@@ -491,10 +491,15 @@ static inline void writeRNGState(GridSerialRNG &serial,GridParallelRNG &parallel
 #ifdef RNG_RANLUX
    header.floating_point = std::string("UINT64");
    header.data_type      = std::string("RANLUX48");
-#else
+#endif
+#ifdef RNG_MT19937
    header.floating_point = std::string("UINT32");
    header.data_type      = std::string("MT19937");
 #endif
+#ifdef RNG_SITMO
+    header.floating_point = std::string("UINT64");
+    header.data_type      = std::string("SITMO");
+#endif

  truncate(file);
  offset = writeHeader(header,file);
@@ -522,10 +527,15 @@ static inline void readRNGState(GridSerialRNG &serial,GridParallelRNG & parallel
 #ifdef RNG_RANLUX
  assert(format == std::string("UINT64"));
  assert(data_type == std::string("RANLUX48"));
-#else
+#endif
+#ifdef RNG_MT19937
  assert(format == std::string("UINT32"));
  assert(data_type == std::string("MT19937"));
 #endif
+#ifdef RNG_SITMO
+  assert(format == std::string("UINT64"));
+  assert(data_type == std::string("SITMO"));
+#endif

  // depending on datatype, set up munger;
  // munger is a function of <floating point, Real, data_type>
@@ -170,7 +170,6 @@ void CayleyFermion5D<Impl>::Mooee       (const FermionField &psi, FermionField &
  lower[0]   =-mass*lower[0];
  M5D(psi,psi,chi,lower,diag,upper);
 }
-
 template<class Impl>
 void CayleyFermion5D<Impl>::MooeeDag    (const FermionField &psi, FermionField &chi)
 {
@@ -192,7 +191,7 @@ void CayleyFermion5D<Impl>::MooeeDag    (const FermionField &psi, FermionField &
      lower[s]=-cee[s-1];
    }
  }
-  // Conjugate the terms ?
+  // Conjugate the terms 
  for (int s=0;s<Ls;s++){
    diag[s] =conjugate(diag[s]);
    upper[s]=conjugate(upper[s]);
@@ -219,14 +218,22 @@ void CayleyFermion5D<Impl>::MeooeDag5D    (const FermionField &psi, FermionField
  int Ls=this->Ls;
  std::vector<Coeff_t> diag =bs;
  std::vector<Coeff_t> upper=cs;
-  std::vector<Coeff_t> lower=cs;
-  upper[Ls-1]=-mass*upper[Ls-1];
-  lower[0]   =-mass*lower[0];
-  // Conjugate the terms ?
+  std::vector<Coeff_t> lower=cs; 
+
  for (int s=0;s<Ls;s++){
-    diag[s] =conjugate(diag[s]);
-    upper[s]=conjugate(upper[s]);
-    lower[s]=conjugate(lower[s]);
+    if ( s== 0 ) {
+      upper[s] = cs[s+1];
+      lower[s] =-mass*cs[Ls-1];
+    } else if ( s==(Ls-1) ) { 
+      upper[s] =-mass*cs[0];
+      lower[s] = cs[s-1];
+    } else { 
+      upper[s] = cs[s+1];
+      lower[s] = cs[s-1];
+    }
+    upper[s] = conjugate(upper[s]);
+    lower[s] = conjugate(lower[s]);
+    diag[s]  = conjugate(diag[s]);
  }
  M5Ddag(psi,psi,Din,lower,diag,upper);
 }
@@ -313,7 +320,7 @@ void CayleyFermion5D<Impl>::MDeriv  (GaugeField &mat,const FermionField &U,const
    this->DhopDeriv(mat,U,Din,dag);
  } else {
    //      U d/du [D_w D5]^dag V = U D5^dag d/du DW^dag Y // implicit adj on U in call
-    Meooe5D(U,Din);
+    MeooeDag5D(U,Din);
    this->DhopDeriv(mat,Din,V,dag);
  }
 };
@@ -328,7 +335,7 @@ void CayleyFermion5D<Impl>::MoeDeriv(GaugeField &mat,const FermionField &U,const
    this->DhopDerivOE(mat,U,Din,dag);
  } else {
    //      U d/du [D_w D5]^dag V = U D5^dag d/du DW^dag Y // implicit adj on U in call
-      Meooe5D(U,Din);
+      MeooeDag5D(U,Din);
      this->DhopDerivOE(mat,Din,V,dag);
  }
 };
@@ -343,7 +350,7 @@ void CayleyFermion5D<Impl>::MeoDeriv(GaugeField &mat,const FermionField &U,const
    this->DhopDerivEO(mat,U,Din,dag);
  } else {
    //      U d/du [D_w D5]^dag V = U D5^dag d/du DW^dag Y // implicit adj on U in call
-    Meooe5D(U,Din);
+    MeooeDag5D(U,Din);
    this->DhopDerivEO(mat,Din,V,dag);
  }
 };
@@ -194,7 +194,9 @@ template void CayleyFermion5D< A >::M5Ddag(const FermionField &psi,const Fermion
 template void CayleyFermion5D< A >::MooeeInv    (const FermionField &psi, FermionField &chi); \
 template void CayleyFermion5D< A >::MooeeInvDag (const FermionField &psi, FermionField &chi);

-#define CAYLEY_DPERP_CACHE
+#undef  CAYLEY_DPERP_DENSE
+#define  CAYLEY_DPERP_CACHE
 #undef  CAYLEY_DPERP_LINALG
+#define CAYLEY_DPERP_VEC

 #endif
@@ -181,6 +181,18 @@ void CayleyFermion5D<Impl>::MooeeInvDag (const FermionField &psi, FermionField &
  assert(psi.checkerboard == psi.checkerboard);
  chi.checkerboard=psi.checkerboard;

+  std::vector<Coeff_t> ueec(Ls);
+  std::vector<Coeff_t> deec(Ls);
+  std::vector<Coeff_t> leec(Ls);
+  std::vector<Coeff_t> ueemc(Ls);
+  std::vector<Coeff_t> leemc(Ls);
+  for(int s=0;s<ueec.size();s++){
+    ueec[s] = conjugate(uee[s]);
+    deec[s] = conjugate(dee[s]);
+    leec[s] = conjugate(lee[s]);
+    ueemc[s]= conjugate(ueem[s]);
+    leemc[s]= conjugate(leem[s]);
+  }
  MooeeInvCalls++;
  MooeeInvTime-=usecond();

@@ -192,25 +204,25 @@ void CayleyFermion5D<Impl>::MooeeInvDag (const FermionField &psi, FermionField &
    chi[ss]=psi[ss];
    for (int s=1;s<Ls;s++){
                            spProj5m(tmp,chi[ss+s-1]);
-      chi[ss+s] = psi[ss+s]-uee[s-1]*tmp;
+      chi[ss+s] = psi[ss+s]-ueec[s-1]*tmp;
    }
    // U_m^{-\dagger} 
    for (int s=0;s<Ls-1;s++){
                                   spProj5p(tmp,chi[ss+s]);
-      chi[ss+Ls-1] = chi[ss+Ls-1] - ueem[s]*tmp;
+      chi[ss+Ls-1] = chi[ss+Ls-1] - ueemc[s]*tmp;
    }

    // L_m^{-\dagger} D^{-dagger}
    for (int s=0;s<Ls-1;s++){
      spProj5m(tmp,chi[ss+Ls-1]);
-      chi[ss+s] = (1.0/dee[s])*chi[ss+s]-(leem[s]/dee[Ls-1])*tmp;
+      chi[ss+s] = (1.0/deec[s])*chi[ss+s]-(leemc[s]/deec[Ls-1])*tmp;
    }	
-    chi[ss+Ls-1]= (1.0/dee[Ls-1])*chi[ss+Ls-1];
+    chi[ss+Ls-1]= (1.0/deec[Ls-1])*chi[ss+Ls-1];
  
    // Apply L^{-dagger}
    for (int s=Ls-2;s>=0;s--){
      spProj5p(tmp,chi[ss+s+1]);
-      chi[ss+s] = chi[ss+s] - lee[s]*tmp;
+      chi[ss+s] = chi[ss+s] - leec[s]*tmp;
    }
  }

@@ -39,20 +39,17 @@ namespace QCD {
  /*
   * Dense matrix versions of routines
   */
-
-  /*
 template<class Impl>
 void CayleyFermion5D<Impl>::MooeeInvDag (const FermionField &psi, FermionField &chi)
 {
  this->MooeeInternal(psi,chi,DaggerYes,InverseYes);
 }
-  
 template<class Impl>
 void CayleyFermion5D<Impl>::MooeeInv(const FermionField &psi, FermionField &chi)
 {
  this->MooeeInternal(psi,chi,DaggerNo,InverseYes);
 }
-  */
+
 template<class Impl>
 void CayleyFermion5D<Impl>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv)
 {
@@ -126,9 +123,20 @@ void CayleyFermion5D<Impl>::MooeeInternal(const FermionField &psi, FermionField
  }
 }

+#ifdef CAYLEY_DPERP_DENSE
+INSTANTIATE_DPERP(GparityWilsonImplF);
+INSTANTIATE_DPERP(GparityWilsonImplD);
+INSTANTIATE_DPERP(WilsonImplF);
+INSTANTIATE_DPERP(WilsonImplD);
+INSTANTIATE_DPERP(ZWilsonImplF);
+INSTANTIATE_DPERP(ZWilsonImplD);
+
 template void CayleyFermion5D<GparityWilsonImplF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<GparityWilsonImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<WilsonImplF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<WilsonImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<ZWilsonImplF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<ZWilsonImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+#endif

 }}
@@ -48,17 +48,18 @@ void CayleyFermion5D<Impl>::M5D(const FermionField &psi,
 				std::vector<Coeff_t> &diag,
 				std::vector<Coeff_t> &upper)
 {
+  Coeff_t one(1.0);
  int Ls=this->Ls;
  for(int s=0;s<Ls;s++){
    if ( s==0 ) {
      axpby_ssp_pminus(chi,diag[s],phi,upper[s],psi,s,s+1);
-      axpby_ssp_pplus (chi,1.0,chi,lower[s],psi,s,Ls-1);
+      axpby_ssp_pplus (chi,one,chi,lower[s],psi,s,Ls-1);
    } else if ( s==(Ls-1)) { 
      axpby_ssp_pminus(chi,diag[s],phi,upper[s],psi,s,0);
-      axpby_ssp_pplus (chi,1.0,chi,lower[s],psi,s,s-1);
+      axpby_ssp_pplus (chi,one,chi,lower[s],psi,s,s-1);
    } else {
      axpby_ssp_pminus(chi,diag[s],phi,upper[s],psi,s,s+1);
-      axpby_ssp_pplus(chi,1.0,chi,lower[s],psi,s,s-1);
+      axpby_ssp_pplus(chi,one,chi,lower[s],psi,s,s-1);
    }
  }
 }
@@ -70,17 +71,18 @@ void CayleyFermion5D<Impl>::M5Ddag(const FermionField &psi,
 				   std::vector<Coeff_t> &diag,
 				   std::vector<Coeff_t> &upper)
 {
+  Coeff_t one(1.0);
  int Ls=this->Ls;
  for(int s=0;s<Ls;s++){
    if ( s==0 ) {
      axpby_ssp_pplus (chi,diag[s],phi,upper[s],psi,s,s+1);
-      axpby_ssp_pminus(chi,1.0,chi,lower[s],psi,s,Ls-1);
+      axpby_ssp_pminus(chi,one,chi,lower[s],psi,s,Ls-1);
    } else if ( s==(Ls-1)) { 
      axpby_ssp_pplus (chi,diag[s],phi,upper[s],psi,s,0);
-      axpby_ssp_pminus(chi,1.0,chi,lower[s],psi,s,s-1);
+      axpby_ssp_pminus(chi,one,chi,lower[s],psi,s,s-1);
    } else {
      axpby_ssp_pplus (chi,diag[s],phi,upper[s],psi,s,s+1);
-      axpby_ssp_pminus(chi,1.0,chi,lower[s],psi,s,s-1);
+      axpby_ssp_pminus(chi,one,chi,lower[s],psi,s,s-1);
    }
  }
 }
@@ -88,62 +90,68 @@ void CayleyFermion5D<Impl>::M5Ddag(const FermionField &psi,
 template<class Impl>
 void CayleyFermion5D<Impl>::MooeeInv    (const FermionField &psi, FermionField &chi)
 {
+  Coeff_t one(1.0);
+  Coeff_t czero(0.0);
  chi.checkerboard=psi.checkerboard;
  int Ls=this->Ls;
  // Apply (L^{\prime})^{-1}
-  axpby_ssp (chi,1.0,psi,     0.0,psi,0,0);      // chi[0]=psi[0]
+  axpby_ssp (chi,one,psi,     czero,psi,0,0);      // chi[0]=psi[0]
  for (int s=1;s<Ls;s++){
-    axpby_ssp_pplus(chi,1.0,psi,-lee[s-1],chi,s,s-1);// recursion Psi[s] -lee P_+ chi[s-1]
+    axpby_ssp_pplus(chi,one,psi,-lee[s-1],chi,s,s-1);// recursion Psi[s] -lee P_+ chi[s-1]
  }
  // L_m^{-1} 
  for (int s=0;s<Ls-1;s++){ // Chi[ee] = 1 - sum[s<Ls-1] -leem[s]P_- chi
-    axpby_ssp_pminus(chi,1.0,chi,-leem[s],chi,Ls-1,s);
+    axpby_ssp_pminus(chi,one,chi,-leem[s],chi,Ls-1,s);
  }
  // U_m^{-1} D^{-1}
  for (int s=0;s<Ls-1;s++){
    // Chi[s] + 1/d chi[s] 
-    axpby_ssp_pplus(chi,1.0/dee[s],chi,-ueem[s]/dee[Ls-1],chi,s,Ls-1);
+    axpby_ssp_pplus(chi,one/dee[s],chi,-ueem[s]/dee[Ls-1],chi,s,Ls-1);
  }	
-  axpby_ssp(chi,1.0/dee[Ls-1],chi,0.0,chi,Ls-1,Ls-1); // Modest avoidable 
+  axpby_ssp(chi,one/dee[Ls-1],chi,czero,chi,Ls-1,Ls-1); // Modest avoidable 
  
  // Apply U^{-1}
  for (int s=Ls-2;s>=0;s--){
-    axpby_ssp_pminus (chi,1.0,chi,-uee[s],chi,s,s+1);  // chi[Ls]
+    axpby_ssp_pminus (chi,one,chi,-uee[s],chi,s,s+1);  // chi[Ls]
  }
 }

 template<class Impl>
 void CayleyFermion5D<Impl>::MooeeInvDag (const FermionField &psi, FermionField &chi)
 {
+  Coeff_t one(1.0);
+  Coeff_t czero(0.0);
  chi.checkerboard=psi.checkerboard;
  int Ls=this->Ls;
  // Apply (U^{\prime})^{-dagger}
-  axpby_ssp (chi,1.0,psi,     0.0,psi,0,0);      // chi[0]=psi[0]
+  axpby_ssp (chi,one,psi,     czero,psi,0,0);      // chi[0]=psi[0]
  for (int s=1;s<Ls;s++){
-    axpby_ssp_pminus(chi,1.0,psi,-uee[s-1],chi,s,s-1);
+    axpby_ssp_pminus(chi,one,psi,-conjugate(uee[s-1]),chi,s,s-1);
  }
  // U_m^{-\dagger} 
  for (int s=0;s<Ls-1;s++){
-    axpby_ssp_pplus(chi,1.0,chi,-ueem[s],chi,Ls-1,s);
+    axpby_ssp_pplus(chi,one,chi,-conjugate(ueem[s]),chi,Ls-1,s);
  }
  // L_m^{-\dagger} D^{-dagger}
  for (int s=0;s<Ls-1;s++){
-    axpby_ssp_pminus(chi,1.0/dee[s],chi,-leem[s]/dee[Ls-1],chi,s,Ls-1);
+    axpby_ssp_pminus(chi,one/conjugate(dee[s]),chi,-conjugate(leem[s]/dee[Ls-1]),chi,s,Ls-1);
  }	
-  axpby_ssp(chi,1.0/dee[Ls-1],chi,0.0,chi,Ls-1,Ls-1); // Modest avoidable 
+  axpby_ssp(chi,one/conjugate(dee[Ls-1]),chi,czero,chi,Ls-1,Ls-1); // Modest avoidable 
  
  // Apply L^{-dagger}
  for (int s=Ls-2;s>=0;s--){
-    axpby_ssp_pplus (chi,1.0,chi,-lee[s],chi,s,s+1);  // chi[Ls]
+    axpby_ssp_pplus (chi,one,chi,-conjugate(lee[s]),chi,s,s+1);  // chi[Ls]
  }
 }


 #ifdef CAYLEY_DPERP_LINALG
-  INSTANTIATE(WilsonImplF);
-  INSTANTIATE(WilsonImplD);
-  INSTANTIATE(GparityWilsonImplF);
-  INSTANTIATE(GparityWilsonImplD);
+  INSTANTIATE_DPERP(WilsonImplF);
+  INSTANTIATE_DPERP(WilsonImplD);
+  INSTANTIATE_DPERP(GparityWilsonImplF);
+  INSTANTIATE_DPERP(GparityWilsonImplD);
+  INSTANTIATE_DPERP(ZWilsonImplF);
+  INSTANTIATE_DPERP(ZWilsonImplD);
 #endif

 }
@@ -35,7 +35,8 @@ Author: paboyle <paboyle@ph.ed.ac.uk>


 namespace Grid {
-namespace QCD {  /*
+namespace QCD {  
+  /*
   * Dense matrix versions of routines
   */
 template<class Impl>
@@ -58,6 +58,7 @@ Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 #include <Grid/qcd/action/fermion/DomainWallFermion.h>
 #include <Grid/qcd/action/fermion/MobiusFermion.h>
 #include <Grid/qcd/action/fermion/ZMobiusFermion.h>
+#include <Grid/qcd/action/fermion/SchurDiagTwoKappa.h>
 #include <Grid/qcd/action/fermion/ScaledShamirFermion.h>
 #include <Grid/qcd/action/fermion/MobiusZolotarevFermion.h>
 #include <Grid/qcd/action/fermion/ShamirZolotarevFermion.h>
@@ -40,10 +40,10 @@ ImprovedStaggeredFermionStatic::displacements({1, 1, 1, 1, -1, -1, -1, -1, 3, 3,
 // Constructor and gauge import
 /////////////////////////////////

+
 template <class Impl>
-ImprovedStaggeredFermion<Impl>::ImprovedStaggeredFermion(GaugeField &_Uthin, GaugeField &_Ufat, GridCartesian &Fgrid,
-							 GridRedBlackCartesian &Hgrid, RealD _mass,
-							 RealD _c1, RealD _c2,RealD _u0,
+ImprovedStaggeredFermion<Impl>::ImprovedStaggeredFermion(GridCartesian &Fgrid, GridRedBlackCartesian &Hgrid, 
+							 RealD _mass,
 							 const ImplParams &p)
    : Kernels(p),
      _grid(&Fgrid),
@@ -52,9 +52,6 @@ ImprovedStaggeredFermion<Impl>::ImprovedStaggeredFermion(GaugeField &_Uthin, Gau
      StencilEven(&Hgrid, npoint, Even, directions, displacements),  // source is Even
      StencilOdd(&Hgrid, npoint, Odd, directions, displacements),  // source is Odd
      mass(_mass),
-      c1(_c1),
-      c2(_c2),
-      u0(_u0),
      Lebesgue(_grid),
      LebesgueEvenOdd(_cbgrid),
      Umu(&Fgrid),
@@ -65,9 +62,29 @@ ImprovedStaggeredFermion<Impl>::ImprovedStaggeredFermion(GaugeField &_Uthin, Gau
      UUUmuOdd(&Hgrid) ,
      _tmp(&Hgrid)
 {
-  // Allocate the required comms buffer
+}
+
+template <class Impl>
+ImprovedStaggeredFermion<Impl>::ImprovedStaggeredFermion(GaugeField &_Uthin, GaugeField &_Ufat, GridCartesian &Fgrid,
+							 GridRedBlackCartesian &Hgrid, RealD _mass,
+							 RealD _c1, RealD _c2,RealD _u0,
+							 const ImplParams &p)
+  : ImprovedStaggeredFermion(Fgrid,Hgrid,_mass,p)
+{
+  c1=_c1;
+  c2=_c2;
+  u0=_u0;
  ImportGauge(_Uthin,_Ufat);
 }
+template <class Impl>
+ImprovedStaggeredFermion<Impl>::ImprovedStaggeredFermion(GaugeField &_Uthin,GaugeField &_Utriple, GaugeField &_Ufat, GridCartesian &Fgrid,
+							 GridRedBlackCartesian &Hgrid, RealD _mass,
+							 const ImplParams &p)
+  : ImprovedStaggeredFermion(Fgrid,Hgrid,_mass,p)
+{
+  ImportGaugeSimple(_Utriple,_Ufat);
+}
+

  ////////////////////////////////////////////////////////////
  // Momentum space propagator should be 
@@ -86,6 +103,34 @@ void ImprovedStaggeredFermion<Impl>::ImportGauge(const GaugeField &_Uthin)
  ImportGauge(_Uthin,_Uthin);
 };
 template <class Impl>
+void ImprovedStaggeredFermion<Impl>::ImportGaugeSimple(const GaugeField &_Utriple,const GaugeField &_Ufat) 
+{
+  /////////////////////////////////////////////////////////////////
+  // Trivial import; phases and fattening and such like preapplied
+  /////////////////////////////////////////////////////////////////
+  GaugeLinkField U(GaugeGrid());
+
+  for (int mu = 0; mu < Nd; mu++) {
+
+    U = PeekIndex<LorentzIndex>(_Utriple, mu);
+    PokeIndex<LorentzIndex>(UUUmu, U, mu );
+
+    U = adj( Cshift(U, mu, -3));
+    PokeIndex<LorentzIndex>(UUUmu, -U, mu+4 );
+
+    U = PeekIndex<LorentzIndex>(_Ufat, mu);
+    PokeIndex<LorentzIndex>(Umu, U, mu);
+
+    U = adj( Cshift(U, mu, -1));
+    PokeIndex<LorentzIndex>(Umu, -U, mu+4);
+
+  }
+  pickCheckerboard(Even, UmuEven,  Umu);
+  pickCheckerboard(Odd,  UmuOdd ,  Umu);
+  pickCheckerboard(Even, UUUmuEven,UUUmu);
+  pickCheckerboard(Odd,  UUUmuOdd, UUUmu);
+}
+template <class Impl>
 void ImprovedStaggeredFermion<Impl>::ImportGauge(const GaugeField &_Uthin,const GaugeField &_Ufat) 
 {
  GaugeLinkField U(GaugeGrid());
@@ -112,7 +112,16 @@ class ImprovedStaggeredFermion : public StaggeredKernels<Impl>, public ImprovedS
 			   RealD _c1=9.0/8.0, RealD _c2=-1.0/24.0,RealD _u0=1.0,
 			   const ImplParams &p = ImplParams());

+  ImprovedStaggeredFermion(GaugeField &_Uthin, GaugeField &_Utriple, GaugeField &_Ufat, GridCartesian &Fgrid,
+			   GridRedBlackCartesian &Hgrid, RealD _mass,
+			   const ImplParams &p = ImplParams());
+
+  ImprovedStaggeredFermion(GridCartesian &Fgrid, GridRedBlackCartesian &Hgrid, RealD _mass,
+			   const ImplParams &p = ImplParams());
+
+
  // DoubleStore impl dependent
+  void ImportGaugeSimple(const GaugeField &_Utriple, const GaugeField &_Ufat);
  void ImportGauge(const GaugeField &_Uthin, const GaugeField &_Ufat);
  void ImportGauge(const GaugeField &_Uthin);

@@ -0,0 +1,102 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: SchurDiagTwoKappa.h
+
+    Copyright (C) 2017
+
+Author: Christoph Lehner
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#ifndef  _SCHUR_DIAG_TWO_KAPPA_H
+#define  _SCHUR_DIAG_TWO_KAPPA_H
+
+namespace Grid {
+
+  // This is specific to (Z)mobius fermions
+  template<class Matrix, class Field>
+    class KappaSimilarityTransform {
+  public:
+    INHERIT_IMPL_TYPES(Matrix);
+    std::vector<Coeff_t> kappa, kappaDag, kappaInv, kappaInvDag;
+
+    KappaSimilarityTransform (Matrix &zmob) {
+      for (int i=0;i<(int)zmob.bs.size();i++) {
+	Coeff_t k = 1.0 / ( 2.0 * (zmob.bs[i] *(4 - zmob.M5) + 1.0) );
+	kappa.push_back( k );
+	kappaDag.push_back( conj(k) );
+	kappaInv.push_back( 1.0 / k );
+	kappaInvDag.push_back( 1.0 / conj(k) );
+      }
+    }
+
+  template<typename vobj>
+    void sscale(const Lattice<vobj>& in, Lattice<vobj>& out, Coeff_t* s) {
+    GridBase *grid=out._grid;
+    out.checkerboard = in.checkerboard;
+    assert(grid->_simd_layout[0] == 1); // should be fine for ZMobius for now
+    int Ls = grid->_rdimensions[0];
+    parallel_for(int ss=0;ss<grid->oSites();ss++){
+      vobj tmp = s[ss % Ls]*in._odata[ss];
+      vstream(out._odata[ss],tmp);
+    }
+  }
+
+  RealD sscale_norm(const Field& in, Field& out, Coeff_t* s) {
+    sscale(in,out,s);
+    return norm2(out);
+  }
+
+  virtual RealD M       (const Field& in, Field& out) { return sscale_norm(in,out,&kappa[0]);   }
+  virtual RealD MDag    (const Field& in, Field& out) { return sscale_norm(in,out,&kappaDag[0]);}
+  virtual RealD MInv    (const Field& in, Field& out) { return sscale_norm(in,out,&kappaInv[0]);}
+  virtual RealD MInvDag (const Field& in, Field& out) { return sscale_norm(in,out,&kappaInvDag[0]);}
+
+  };
+
+  template<class Matrix,class Field>
+    class SchurDiagTwoKappaOperator :  public SchurOperatorBase<Field> {
+  public:
+    KappaSimilarityTransform<Matrix, Field> _S;
+    SchurDiagTwoOperator<Matrix, Field> _Mat;
+
+    SchurDiagTwoKappaOperator (Matrix &Mat): _S(Mat), _Mat(Mat) {};
+
+    virtual  RealD Mpc      (const Field &in, Field &out) {
+      Field tmp(in._grid);
+
+      _S.MInv(in,out);
+      _Mat.Mpc(out,tmp);
+      return _S.M(tmp,out);
+
+    }
+    virtual  RealD MpcDag   (const Field &in, Field &out){
+      Field tmp(in._grid);
+
+      _S.MDag(in,out);
+      _Mat.MpcDag(out,tmp);
+      return _S.MInvDag(tmp,out);
+    }
+  };
+
+}
+
+#endif
@@ -27,8 +27,11 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    *************************************************************************************/
    /*  END LEGAL */
 #include <Grid.h>
+
+#ifdef AVX512
 #include <simd/Intel512common.h>
 #include <simd/Intel512avx.h>
+#endif

 // Interleave operations from two directions
 // This looks just like a 2 spin multiply and reuse same sequence from the Wilson
@@ -302,7 +305,7 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
  VRDUP(Chi_00,T0)           VIDUP(Chi_00,Chi_00)	          \
   VRDUP(Chi_10,T1)           VIDUP(Chi_10,Chi_10)		  \
   VMUL(Z00,Chi_00,Z1)        VMUL(Z10,Chi_10,Z2)		  \
-   VSHUFMEM(3,%r8,Z00)	      VSHUFMEM(3,%r9,Z10)		  \    
+   VSHUFMEM(3,%r8,Z00)	      VSHUFMEM(3,%r9,Z10)		  \
   VMUL(Z00,Chi_00,Z3)        VMUL(Z10,Chi_10,Z4)		  \
   VSHUFMEM(6,%r8,Z00)	      VSHUFMEM(6,%r9,Z10)		  \
   VMUL(Z00,Chi_00,Z5)        VMUL(Z10,Chi_10,Z6)		  \
@@ -584,7 +587,6 @@ void StaggeredKernels<Impl>::DhopSiteAsm(StencilImpl &st, LebesgueOrder &lo,
 					 int sU, const FermionField &in, FermionField &out) 
 {
  assert(0);
-
 };


@@ -902,9 +904,17 @@ template <> void StaggeredKernels<StaggeredImplD>::DhopSiteAsm(StencilImpl &st,
 #endif
 }

+#define KERNEL_INSTANTIATE(CLASS,FUNC,IMPL)			    \
+  template void CLASS<IMPL>::FUNC(StencilImpl &st, LebesgueOrder &lo,	\
+				  DoubledGaugeField &U,			\
+				  DoubledGaugeField &UUU,		\
+				  SiteSpinor *buf, int LLs,		\
+				  int sU, const FermionField &in, FermionField &out);

-FermOpStaggeredTemplateInstantiate(StaggeredKernels);
-FermOpStaggeredVec5dTemplateInstantiate(StaggeredKernels);
+KERNEL_INSTANTIATE(StaggeredKernels,DhopSiteAsm,StaggeredImplD);
+KERNEL_INSTANTIATE(StaggeredKernels,DhopSiteAsm,StaggeredImplF);
+KERNEL_INSTANTIATE(StaggeredKernels,DhopSiteAsm,StaggeredVec5dImplD);
+KERNEL_INSTANTIATE(StaggeredKernels,DhopSiteAsm,StaggeredVec5dImplF);

 }}

@@ -299,7 +299,24 @@ void StaggeredKernels<Impl>::DhopSiteDepthHand(StencilImpl &st, LebesgueOrder &l

 }

-FermOpStaggeredTemplateInstantiate(StaggeredKernels);
-FermOpStaggeredVec5dTemplateInstantiate(StaggeredKernels);
+#define DHOP_SITE_HAND_INSTANTIATE(IMPL)				\
+  template void StaggeredKernels<IMPL>::DhopSiteHand(StencilImpl &st, LebesgueOrder &lo, \
+						     DoubledGaugeField &U,DoubledGaugeField &UUU, \
+						     SiteSpinor *buf, int LLs, \
+						     int sU, const FermionField &in, FermionField &out, int dag);
+
+#define DHOP_SITE_DEPTH_HAND_INSTANTIATE(IMPL)				\
+  template void StaggeredKernels<IMPL>::DhopSiteDepthHand(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, \
+							  SiteSpinor *buf, int sF, \
+							  int sU, const FermionField &in, SiteSpinor &out,int threeLink) ;
+DHOP_SITE_HAND_INSTANTIATE(StaggeredImplD);
+DHOP_SITE_HAND_INSTANTIATE(StaggeredImplF);
+DHOP_SITE_HAND_INSTANTIATE(StaggeredVec5dImplD);
+DHOP_SITE_HAND_INSTANTIATE(StaggeredVec5dImplF);
+
+DHOP_SITE_DEPTH_HAND_INSTANTIATE(StaggeredImplD);
+DHOP_SITE_DEPTH_HAND_INSTANTIATE(StaggeredImplF);
+DHOP_SITE_DEPTH_HAND_INSTANTIATE(StaggeredVec5dImplD);
+DHOP_SITE_DEPTH_HAND_INSTANTIATE(StaggeredVec5dImplF);

 }}
@@ -114,8 +114,8 @@ class NerscHmcRunnerTemplate {
    */
    //////////////
    NoSmearing<Gimpl> SmearingPolicy;
-    typedef MinimumNorm2<GaugeField, NoSmearing<Gimpl>, RepresentationsPolicy >
-        IntegratorType;  // change here to change the algorithm
+    // change here to change the algorithm
+    typedef MinimumNorm2<GaugeField, NoSmearing<Gimpl>, RepresentationsPolicy >  IntegratorType;  
    IntegratorParameters MDpar(40, 1.0);
    IntegratorType MDynamics(UGrid, MDpar, TheAction, SmearingPolicy);

@@ -54,7 +54,7 @@ THE SOFTWARE.

 #define GRID_MACRO_EMPTY()

-#define GRID_MACRO_EVAL(...)     GRID_MACRO_EVAL1024(__VA_ARGS__)
+#define GRID_MACRO_EVAL(...)     GRID_MACRO_EVAL64(__VA_ARGS__)
 #define GRID_MACRO_EVAL1024(...) GRID_MACRO_EVAL512(GRID_MACRO_EVAL512(__VA_ARGS__))
 #define GRID_MACRO_EVAL512(...)  GRID_MACRO_EVAL256(GRID_MACRO_EVAL256(__VA_ARGS__))
 #define GRID_MACRO_EVAL256(...)  GRID_MACRO_EVAL128(GRID_MACRO_EVAL128(__VA_ARGS__))
@@ -377,8 +377,8 @@ namespace Optimization {
      b0 = _mm256_extractf128_si256(b,0);
      a1 = _mm256_extractf128_si256(a,1);
      b1 = _mm256_extractf128_si256(b,1);
-      a0 = _mm_mul_epi32(a0,b0);
-      a1 = _mm_mul_epi32(a1,b1);
+      a0 = _mm_mullo_epi32(a0,b0);
+      a1 = _mm_mullo_epi32(a1,b1);
      return _mm256_set_m128i(a1,a0);
 #endif
 #if defined (AVX2)
@@ -470,7 +470,52 @@ namespace Optimization {
      return in;
    };
  };
-
+#define USE_FP16
+  struct PrecisionChange {
+    static inline __m256i StoH (__m256 a,__m256 b) {
+      __m256 h;
+#ifdef USE_FP16
+      __m128i ha = _mm256_cvtps_ph(a,0);
+      __m128i hb = _mm256_cvtps_ph(b,0);
+      h = _mm256_castps128_ps256(ha);
+      h = _mm256_insertf128_ps(h,hb,1);
+#else 
+      assert(0);
+#endif
+      return h;
+    }
+    static inline void  HtoS (__m256i h,__m256 &sa,__m256 &sb) {
+#ifdef USE_FP16
+      sa = _mm256_cvtph_ps(_mm256_extractf128_ps(h,0));
+      sb = _mm256_cvtph_ps(_mm256_extractf128_ps(h,1));
+#else 
+      assert(0);
+#endif
+    }
+    static inline __m256 DtoS (__m256d a,__m256d b) {
+      __m128 sa = _mm256_cvtpd_ps(a);
+      __m128 sb = _mm256_cvtpd_ps(b);
+      __m256 s = _mm256_castps128_ps256(sa);
+      s = _mm256_insertf128_ps(s,sb,1);
+      return s;
+    }
+    static inline void StoD (__m256 s,__m256d &a,__m256d &b) {
+      a = _mm256_cvtps_pd(_mm256_extractf128_ps(s,0));
+      b = _mm256_cvtps_pd(_mm256_extractf128_ps(s,1));
+    }
+    static inline __m256i DtoH (__m256d a,__m256d b,__m256d c,__m256d d) {
+      __m256 sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (__m256i h,__m256d &a,__m256d &b,__m256d &c,__m256d &d) {
+      __m256 sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
  struct Exchange{
    // 3210 ordering
    static inline void Exchange0(__m256 &out1,__m256 &out2,__m256 in1,__m256 in2){
@@ -675,6 +720,7 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

+  typedef __m256i SIMD_Htype;  // Single precision type
  typedef __m256  SIMD_Ftype; // Single precision type
  typedef __m256d SIMD_Dtype; // Double precision type
  typedef __m256i SIMD_Itype; // Integer type
@@ -235,11 +235,9 @@ namespace Optimization {
    inline void mac(__m512 &a, __m512 b, __m512 c){         
       a= _mm512_fmadd_ps( b, c, a);                         
    }
-
    inline void mac(__m512d &a, __m512d b, __m512d c){
      a= _mm512_fmadd_pd( b, c, a);                   
    }                                             
-
    // Real float
    inline __m512 operator()(__m512 a, __m512 b){
      return _mm512_mul_ps(a,b);
@@ -342,7 +340,52 @@ namespace Optimization {
    };

  };
-
+#define USE_FP16
+  struct PrecisionChange {
+    static inline __m512i StoH (__m512 a,__m512 b) {
+      __m512i h;
+#ifdef USE_FP16
+      __m256i ha = _mm512_cvtps_ph(a,0);
+      __m256i hb = _mm512_cvtps_ph(b,0);
+      h =(__m512i) _mm512_castps256_ps512((__m256)ha);
+      h =(__m512i) _mm512_insertf64x4((__m512d)h,(__m256d)hb,1);
+#else
+      assert(0);
+#endif
+      return h;
+    }
+    static inline void  HtoS (__m512i h,__m512 &sa,__m512 &sb) {
+#ifdef USE_FP16
+      sa = _mm512_cvtph_ps((__m256i)_mm512_extractf64x4_pd((__m512d)h,0));
+      sb = _mm512_cvtph_ps((__m256i)_mm512_extractf64x4_pd((__m512d)h,1));
+#else
+      assert(0);
+#endif
+    }
+    static inline __m512 DtoS (__m512d a,__m512d b) {
+      __m256 sa = _mm512_cvtpd_ps(a);
+      __m256 sb = _mm512_cvtpd_ps(b);
+      __m512 s = _mm512_castps256_ps512(sa);
+      s =(__m512) _mm512_insertf64x4((__m512d)s,(__m256d)sb,1);
+      return s;
+    }
+    static inline void StoD (__m512 s,__m512d &a,__m512d &b) {
+      a = _mm512_cvtps_pd((__m256)_mm512_extractf64x4_pd((__m512d)s,0));
+      b = _mm512_cvtps_pd((__m256)_mm512_extractf64x4_pd((__m512d)s,1));
+    }
+    static inline __m512i DtoH (__m512d a,__m512d b,__m512d c,__m512d d) {
+      __m512 sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (__m512i h,__m512d &a,__m512d &b,__m512d &c,__m512d &d) {
+      __m512 sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
  // On extracting face: Ah Al , Bh Bl -> Ah Bh, Al Bl
  // On merging buffers: Ah,Bh , Al Bl -> Ah Al, Bh, Bl
  // The operation is its own inverse
@@ -539,7 +582,9 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

-  typedef __m512 SIMD_Ftype;  // Single precision type
+
+  typedef __m512i SIMD_Htype;  // Single precision type
+  typedef __m512  SIMD_Ftype;  // Single precision type
  typedef __m512d SIMD_Dtype; // Double precision type
  typedef __m512i SIMD_Itype; // Integer type

@@ -279,6 +279,101 @@ namespace Optimization {
  
  #undef timesi

+  struct PrecisionChange {
+    static inline vech StoH (const vecf &a,const vecf &b) {
+#ifdef USE_FP16
+      vech ret;
+      vech *ha = (vech *)&a;
+      vech *hb = (vech *)&b;
+      const int nf = W<float>::r;
+      //      VECTOR_FOR(i, nf,1){ ret.v[i]    = ( (uint16_t *) &a.v[i])[1] ; }
+      //      VECTOR_FOR(i, nf,1){ ret.v[i+nf] = ( (uint16_t *) &b.v[i])[1] ; }
+      VECTOR_FOR(i, nf,1){ ret.v[i]    = ha->v[2*i+1]; }
+      VECTOR_FOR(i, nf,1){ ret.v[i+nf] = hb->v[2*i+1]; }
+#else
+      assert(0);
+#endif
+      return ret;
+    }
+    static inline void  HtoS (vech h,vecf &sa,vecf &sb) {
+#ifdef USE_FP16
+      const int nf = W<float>::r;
+      const int nh = W<uint16_t>::r;
+      vech *ha = (vech *)&sa;
+      vech *hb = (vech *)&sb;
+      VECTOR_FOR(i, nf, 1){ sb.v[i]= sa.v[i] = 0; }
+      //      VECTOR_FOR(i, nf, 1){ ( (uint16_t *) (&sa.v[i]))[1] = h.v[i];}
+      //      VECTOR_FOR(i, nf, 1){ ( (uint16_t *) (&sb.v[i]))[1] = h.v[i+nf];}
+      VECTOR_FOR(i, nf, 1){ ha->v[2*i+1]=h.v[i]; }
+      VECTOR_FOR(i, nf, 1){ hb->v[2*i+1]=h.v[i+nf]; }
+#else
+      assert(0);
+#endif
+    }
+    static inline vecf DtoS (vecd a,vecd b) {
+      const int nd = W<double>::r;
+      const int nf = W<float>::r;
+      vecf ret;
+      VECTOR_FOR(i, nd,1){ ret.v[i]    = a.v[i] ; }
+      VECTOR_FOR(i, nd,1){ ret.v[i+nd] = b.v[i] ; }
+      return ret;
+    }
+    static inline void StoD (vecf s,vecd &a,vecd &b) {
+      const int nd = W<double>::r;
+      VECTOR_FOR(i, nd,1){ a.v[i] = s.v[i] ; }
+      VECTOR_FOR(i, nd,1){ b.v[i] = s.v[i+nd] ; }
+    }
+    static inline vech DtoH (vecd a,vecd b,vecd c,vecd d) {
+      vecf sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (vech h,vecd &a,vecd &b,vecd &c,vecd &d) {
+      vecf sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
+
+  //////////////////////////////////////////////
+  // Exchange support
+  struct Exchange{
+
+    template <typename T,int n>
+    static inline void ExchangeN(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      const int w = W<T>::r;
+      unsigned int mask = w >> (n + 1);
+      //      std::cout << " Exchange "<<n<<" nsimd "<<w<<" mask 0x" <<std::hex<<mask<<std::dec<<std::endl;
+      VECTOR_FOR(i, w, 1) {	
+	int j1 = i&(~mask);
+	if  ( (i&mask) == 0 ) { out1.v[i]=in1.v[j1];}
+	else                  { out1.v[i]=in2.v[j1];}
+	int j2 = i|mask;
+	if  ( (i&mask) == 0 ) { out2.v[i]=in1.v[j2];}
+	else                  { out2.v[i]=in2.v[j2];}
+      }      
+    }
+    template <typename T>
+    static inline void Exchange0(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,0>(out1,out2,in1,in2);
+    };
+    template <typename T>
+    static inline void Exchange1(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,1>(out1,out2,in1,in2);
+    };
+    template <typename T>
+    static inline void Exchange2(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,2>(out1,out2,in1,in2);
+    };
+    template <typename T>
+    static inline void Exchange3(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,3>(out1,out2,in1,in2);
+    };
+  };
+
+
  //////////////////////////////////////////////
  // Some Template specialization
  #define perm(a, b, n, w)\
@@ -403,6 +498,7 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

+  typedef Optimization::vech SIMD_Htype; // Reduced precision type
  typedef Optimization::vecf SIMD_Ftype; // Single precision type
  typedef Optimization::vecd SIMD_Dtype; // Double precision type
  typedef Optimization::veci SIMD_Itype; // Integer type
@@ -66,6 +66,10 @@ namespace Optimization {
  template <> struct W<Integer> {
    constexpr static unsigned int r = GEN_SIMD_WIDTH/4u;
  };
+  template <> struct W<uint16_t> {
+    constexpr static unsigned int c = GEN_SIMD_WIDTH/4u;
+    constexpr static unsigned int r = GEN_SIMD_WIDTH/2u;
+  };
  
  // SIMD vector types
  template <typename T>
@@ -73,8 +77,9 @@ namespace Optimization {
    alignas(GEN_SIMD_WIDTH) T v[W<T>::r];
  };

-  typedef vec<float>   vecf;
-  typedef vec<double>  vecd;
-  typedef vec<Integer> veci;
+  typedef vec<float>     vecf;
+  typedef vec<double>    vecd;
+  typedef vec<uint16_t>  vech; // half precision comms
+  typedef vec<Integer>   veci;
  
 }}
@@ -125,7 +125,6 @@ namespace Optimization {
      f[2] = a.v2;
      f[3] = a.v3;
    }
-
    //Double
    inline void operator()(double *d, vector4double a){
      vec_st(a, 0, d);
@@ -38,6 +38,7 @@ Author: neo <cossu@post.kek.jp>

 #include <pmmintrin.h>

+
 namespace Grid {
 namespace Optimization {

@@ -328,6 +329,56 @@ namespace Optimization {
    };
  };

+  
+#define _my_alignr_epi32(a,b,n) _mm_alignr_epi8(a,b,(n*4)%16)
+#define _my_alignr_epi64(a,b,n) _mm_alignr_epi8(a,b,(n*8)%16)
+
+  struct PrecisionChange {
+    static inline __m128i StoH (__m128 a,__m128 b) {
+#ifdef USE_FP16
+      __m128i ha = _mm_cvtps_ph(a,0);
+      __m128i hb = _mm_cvtps_ph(b,0);
+      __m128i h =(__m128i) _mm_shuffle_ps((__m128)ha,(__m128)hb,_MM_SELECT_FOUR_FOUR(1,0,1,0));
+#else
+      __m128i h = (__m128i)a;
+      assert(0);
+#endif
+      return h;
+    }
+    static inline void  HtoS (__m128i h,__m128 &sa,__m128 &sb) {
+#ifdef USE_FP16
+      sa = _mm_cvtph_ps(h); 
+      h =  (__m128i)_my_alignr_epi32((__m128i)h,(__m128i)h,2);
+      sb = _mm_cvtph_ps(h);
+#else
+      assert(0);
+#endif
+    }
+    static inline __m128 DtoS (__m128d a,__m128d b) {
+      __m128 sa = _mm_cvtpd_ps(a);
+      __m128 sb = _mm_cvtpd_ps(b);
+      __m128 s = _mm_shuffle_ps(sa,sb,_MM_SELECT_FOUR_FOUR(1,0,1,0));
+      return s;
+    }
+    static inline void StoD (__m128 s,__m128d &a,__m128d &b) {
+      a = _mm_cvtps_pd(s);
+      s = (__m128)_my_alignr_epi32((__m128i)s,(__m128i)s,2);
+      b = _mm_cvtps_pd(s);
+    }
+    static inline __m128i DtoH (__m128d a,__m128d b,__m128d c,__m128d d) {
+      __m128 sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (__m128i h,__m128d &a,__m128d &b,__m128d &c,__m128d &d) {
+      __m128 sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
+
  struct Exchange{
    // 3210 ordering
    static inline void Exchange0(__m128 &out1,__m128 &out2,__m128 in1,__m128 in2){
@@ -335,8 +386,10 @@ namespace Optimization {
      out2= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(3,2,3,2));
    };
    static inline void Exchange1(__m128 &out1,__m128 &out2,__m128 in1,__m128 in2){
-      out1= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(2,0,2,0));
-      out2= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(3,1,3,1));
+      out1= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(2,0,2,0)); /*ACEG*/
+      out2= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(3,1,3,1)); /*BDFH*/
+      out1= _mm_shuffle_ps(out1,out1,_MM_SELECT_FOUR_FOUR(3,1,2,0)); /*AECG*/
+      out2= _mm_shuffle_ps(out2,out2,_MM_SELECT_FOUR_FOUR(3,1,2,0)); /*AECG*/
    };
    static inline void Exchange2(__m128 &out1,__m128 &out2,__m128 in1,__m128 in2){
      assert(0);
@@ -383,14 +436,9 @@ namespace Optimization {
      default: assert(0);
      }
    }
-  
-#ifndef _mm_alignr_epi64
-#define _mm_alignr_epi32(a,b,n) _mm_alignr_epi8(a,b,(n*4)%16)
-#define _mm_alignr_epi64(a,b,n) _mm_alignr_epi8(a,b,(n*8)%16)
-#endif 

-    template<int n> static inline __m128  tRotate(__m128  in){ return (__m128)_mm_alignr_epi32((__m128i)in,(__m128i)in,n); };
-    template<int n> static inline __m128d tRotate(__m128d in){ return (__m128d)_mm_alignr_epi64((__m128i)in,(__m128i)in,n); };
+    template<int n> static inline __m128  tRotate(__m128  in){ return (__m128)_my_alignr_epi32((__m128i)in,(__m128i)in,n); };
+    template<int n> static inline __m128d tRotate(__m128d in){ return (__m128d)_my_alignr_epi64((__m128i)in,(__m128i)in,n); };

  };
  //////////////////////////////////////////////
@@ -450,7 +498,8 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

-  typedef __m128 SIMD_Ftype;  // Single precision type
+  typedef __m128i SIMD_Htype;  // Single precision type
+  typedef __m128  SIMD_Ftype;  // Single precision type
  typedef __m128d SIMD_Dtype; // Double precision type
  typedef __m128i SIMD_Itype; // Integer type

@@ -2,7 +2,7 @@

 Grid physics library, www.github.com/paboyle/Grid

-Source file: ./lib/simd/Grid_vector_types.h
+Source file: ./lib/simd/Grid_vector_type.h

 Copyright (C) 2015

@@ -358,16 +358,12 @@ class Grid_simd {
  {
    if       (n==3) {
      Optimization::Exchange::Exchange3(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange3 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    } else if(n==2) {
      Optimization::Exchange::Exchange2(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange2 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    } else if(n==1) {
      Optimization::Exchange::Exchange1(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange1 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    } else if(n==0) { 
      Optimization::Exchange::Exchange0(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange0 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    }
  }

@@ -415,7 +411,6 @@ template <class S, class V, IfNotComplex<S> = 0>
 inline Grid_simd<S, V> rotate(Grid_simd<S, V> b, int nrot) {
  nrot = nrot % Grid_simd<S, V>::Nsimd();
  Grid_simd<S, V> ret;
-  //    std::cout << "Rotate Real by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v, nrot);
  return ret;
 }
@@ -423,7 +418,6 @@ template <class S, class V, IfComplex<S> = 0>
 inline Grid_simd<S, V> rotate(Grid_simd<S, V> b, int nrot) {
  nrot = nrot % Grid_simd<S, V>::Nsimd();
  Grid_simd<S, V> ret;
-  //    std::cout << "Rotate Complex by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v, 2 * nrot);
  return ret;
 }
@@ -431,14 +425,12 @@ template <class S, class V, IfNotComplex<S> =0>
 inline void rotate( Grid_simd<S,V> &ret,Grid_simd<S,V> b,int nrot)
 {
  nrot = nrot % Grid_simd<S,V>::Nsimd();
-  //    std::cout << "Rotate Real by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v,nrot);
 }
 template <class S, class V, IfComplex<S> =0> 
 inline void rotate(Grid_simd<S,V> &ret,Grid_simd<S,V> b,int nrot)
 {
  nrot = nrot % Grid_simd<S,V>::Nsimd();
-  //    std::cout << "Rotate Complex by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v,2*nrot);
 }

@@ -698,7 +690,6 @@ inline Grid_simd<S, V> innerProduct(const Grid_simd<S, V> &l,
                                    const Grid_simd<S, V> &r) {
  return conjugate(l) * r;
 }
-
 template <class S, class V>
 inline Grid_simd<S, V> outerProduct(const Grid_simd<S, V> &l,
                                    const Grid_simd<S, V> &r) {
@@ -758,6 +749,67 @@ typedef Grid_simd<std::complex<float>, SIMD_Ftype> vComplexF;
 typedef Grid_simd<std::complex<double>, SIMD_Dtype> vComplexD;
 typedef Grid_simd<Integer, SIMD_Itype> vInteger;

+// Half precision; no arithmetic support
+typedef Grid_simd<uint16_t, SIMD_Htype>               vRealH;
+typedef Grid_simd<std::complex<uint16_t>, SIMD_Htype> vComplexH;
+
+inline void precisionChange(vRealF    *out,vRealD    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    out[m].v=Optimization::PrecisionChange::DtoS(in[n].v,in[n+1].v);
+  }
+}
+inline void precisionChange(vRealH    *out,vRealD    *in,int nvec)
+{
+  assert((nvec&0x3)==0);
+  for(int m=0;m*4<nvec;m++){
+    int n=m*4;
+    out[m].v=Optimization::PrecisionChange::DtoH(in[n].v,in[n+1].v,in[n+2].v,in[n+3].v);
+  }
+}
+inline void precisionChange(vRealH    *out,vRealF    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    out[m].v=Optimization::PrecisionChange::StoH(in[n].v,in[n+1].v);
+  }
+}
+inline void precisionChange(vRealD    *out,vRealF    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    Optimization::PrecisionChange::StoD(in[m].v,out[n].v,out[n+1].v);
+  }
+}
+inline void precisionChange(vRealD    *out,vRealH    *in,int nvec)
+{
+  assert((nvec&0x3)==0);
+  for(int m=0;m*4<nvec;m++){
+    int n=m*4;
+    Optimization::PrecisionChange::HtoD(in[m].v,out[n].v,out[n+1].v,out[n+2].v,out[n+3].v);
+  }
+}
+inline void precisionChange(vRealF    *out,vRealH    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    Optimization::PrecisionChange::HtoS(in[m].v,out[n].v,out[n+1].v);
+  }
+}
+inline void precisionChange(vComplexF *out,vComplexD *in,int nvec){ precisionChange((vRealF *)out,(vRealD *)in,nvec);}
+inline void precisionChange(vComplexH *out,vComplexD *in,int nvec){ precisionChange((vRealH *)out,(vRealD *)in,nvec);}
+inline void precisionChange(vComplexH *out,vComplexF *in,int nvec){ precisionChange((vRealH *)out,(vRealF *)in,nvec);}
+inline void precisionChange(vComplexD *out,vComplexF *in,int nvec){ precisionChange((vRealD *)out,(vRealF *)in,nvec);}
+inline void precisionChange(vComplexD *out,vComplexH *in,int nvec){ precisionChange((vRealD *)out,(vRealH *)in,nvec);}
+inline void precisionChange(vComplexF *out,vComplexH *in,int nvec){ precisionChange((vRealF *)out,(vRealH *)in,nvec);}
+
+
+
 // Check our vector types are of an appropriate size.
 #if defined QPX
 static_assert(2*sizeof(SIMD_Ftype) == sizeof(SIMD_Dtype), "SIMD vector lengths incorrect");
@@ -56,11 +56,11 @@ class iScalar {
  typedef vtype element;
  typedef typename GridTypeMapper<vtype>::scalar_type scalar_type;
  typedef typename GridTypeMapper<vtype>::vector_type vector_type;
+  typedef typename GridTypeMapper<vtype>::vector_typeD vector_typeD;
  typedef typename GridTypeMapper<vtype>::tensor_reduced tensor_reduced_v;
-  typedef iScalar<tensor_reduced_v> tensor_reduced;
  typedef typename GridTypeMapper<vtype>::scalar_object recurse_scalar_object;
+  typedef iScalar<tensor_reduced_v> tensor_reduced;
  typedef iScalar<recurse_scalar_object> scalar_object;
-
  // substitutes a real or complex version with same tensor structure
  typedef iScalar<typename GridTypeMapper<vtype>::Complexified> Complexified;
  typedef iScalar<typename GridTypeMapper<vtype>::Realified> Realified;
@@ -77,8 +77,12 @@ class iScalar {
  iScalar<vtype> & operator= (const iScalar<vtype> &copyme) = default;
  iScalar<vtype> & operator= (iScalar<vtype> &&copyme) = default;
  */
-  iScalar(scalar_type s)
-      : _internal(s){};  // recurse down and hit the constructor for vector_type
+
+  //  template<int N=0>
+  //  iScalar(EnableIf<isSIMDvectorized<vector_type>, vector_type> s) : _internal(s){};  // recurse down and hit the constructor for vector_type
+
+  iScalar(scalar_type s) : _internal(s){};  // recurse down and hit the constructor for vector_type
+
  iScalar(const Zero &z) { *this = zero; };

  iScalar<vtype> &operator=(const Zero &hero) {
@@ -134,33 +138,28 @@ class iScalar {
  strong_inline const vtype &operator()(void) const { return _internal; }

  // Type casts meta programmed, must be pure scalar to match TensorRemove
-  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0, IfNotSimd<U> = 0>
  operator ComplexF() const {
    return (TensorRemove(_internal));
  };
-  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0, IfNotSimd<U> = 0>
  operator ComplexD() const {
    return (TensorRemove(_internal));
  };
  //  template<class U=vtype,class V=scalar_type,IfComplex<V> = 0,IfNotSimd<U> =
  //  0> operator RealD    () const { return(real(TensorRemove(_internal))); }
-  template <class U = vtype, class V = scalar_type, IfReal<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfReal<V> = 0,IfNotSimd<U> = 0>
  operator RealD() const {
    return TensorRemove(_internal);
  }
-  template <class U = vtype, class V = scalar_type, IfInteger<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfInteger<V> = 0, IfNotSimd<U> = 0>
  operator Integer() const {
    return Integer(TensorRemove(_internal));
  }

  // convert from a something to a scalar via constructor of something arg
-  template <class T, typename std::enable_if<!isGridTensor<T>::value, T>::type
-                         * = nullptr>
-  strong_inline iScalar<vtype> operator=(T arg) {
+  template <class T, typename std::enable_if<!isGridTensor<T>::value, T>::type * = nullptr>
+    strong_inline iScalar<vtype> operator=(T arg) {
    _internal = arg;
    return *this;
  }
@@ -193,6 +192,7 @@ class iVector {
  typedef vtype element;
  typedef typename GridTypeMapper<vtype>::scalar_type scalar_type;
  typedef typename GridTypeMapper<vtype>::vector_type vector_type;
+  typedef typename GridTypeMapper<vtype>::vector_typeD vector_typeD;
  typedef typename GridTypeMapper<vtype>::tensor_reduced tensor_reduced_v;
  typedef typename GridTypeMapper<vtype>::scalar_object recurse_scalar_object;
  typedef iScalar<tensor_reduced_v> tensor_reduced;
@@ -305,6 +305,7 @@ class iMatrix {
  typedef vtype element;
  typedef typename GridTypeMapper<vtype>::scalar_type scalar_type;
  typedef typename GridTypeMapper<vtype>::vector_type vector_type;
+  typedef typename GridTypeMapper<vtype>::vector_typeD vector_typeD;
  typedef typename GridTypeMapper<vtype>::tensor_reduced tensor_reduced_v;
  typedef typename GridTypeMapper<vtype>::scalar_object recurse_scalar_object;

@@ -29,51 +29,109 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #ifndef GRID_MATH_INNER_H
 #define GRID_MATH_INNER_H
 namespace Grid {
-    ///////////////////////////////////////////////////////////////////////////////////////
-    // innerProduct Scalar x Scalar -> Scalar
-    // innerProduct Vector x Vector -> Scalar
-    // innerProduct Matrix x Matrix -> Scalar
-    ///////////////////////////////////////////////////////////////////////////////////////
-    template<class sobj> inline RealD norm2(const sobj &arg){
-      typedef typename sobj::scalar_type scalar;
-      decltype(innerProduct(arg,arg)) nrm;
-      nrm = innerProduct(arg,arg);
-      RealD ret = real(nrm);
-      return ret;
-    }
+  ///////////////////////////////////////////////////////////////////////////////////////
+  // innerProduct Scalar x Scalar -> Scalar
+  // innerProduct Vector x Vector -> Scalar
+  // innerProduct Matrix x Matrix -> Scalar
+  ///////////////////////////////////////////////////////////////////////////////////////
+  template<class sobj> inline RealD norm2(const sobj &arg){
+    auto nrm = innerProductD(arg,arg);
+    RealD ret = real(nrm);
+    return ret;
+  }
+  //////////////////////////////////////
+  // If single promote to double and sum 2x
+  //////////////////////////////////////

-    template<class l,class r,int N> inline
-    auto innerProduct (const iVector<l,N>& lhs,const iVector<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0],rhs._internal[0]))>
-    {
-        typedef decltype(innerProduct(lhs._internal[0],rhs._internal[0])) ret_t;
-        iScalar<ret_t> ret;
-	ret=zero;
-        for(int c1=0;c1<N;c1++){
-            ret._internal += innerProduct(lhs._internal[c1],rhs._internal[c1]);
-        }
-        return ret;
+inline ComplexD innerProductD(const ComplexF &l,const ComplexF &r){  return innerProduct(l,r); }
+inline ComplexD innerProductD(const ComplexD &l,const ComplexD &r){  return innerProduct(l,r); }
+inline RealD    innerProductD(const RealD    &l,const RealD    &r){  return innerProduct(l,r); }
+inline RealD    innerProductD(const RealF    &l,const RealF    &r){  return innerProduct(l,r); }
+
+inline vComplexD innerProductD(const vComplexD &l,const vComplexD &r){  return innerProduct(l,r); }
+inline vRealD    innerProductD(const vRealD    &l,const vRealD    &r){  return innerProduct(l,r); }
+inline vComplexD innerProductD(const vComplexF &l,const vComplexF &r){  
+  vComplexD la,lb;
+  vComplexD ra,rb;
+  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
+  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
+  return innerProduct(la,ra) + innerProduct(lb,rb); 
+}
+inline vRealD innerProductD(const vRealF &l,const vRealF &r){  
+  vRealD la,lb;
+  vRealD ra,rb;
+  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
+  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
+  return innerProduct(la,ra) + innerProduct(lb,rb); 
+}
+
+  template<class l,class r,int N> inline
+  auto innerProductD (const iVector<l,N>& lhs,const iVector<r,N>& rhs) -> iScalar<decltype(innerProductD(lhs._internal[0],rhs._internal[0]))>
+  {
+    typedef decltype(innerProductD(lhs._internal[0],rhs._internal[0])) ret_t;
+    iScalar<ret_t> ret;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+      ret._internal += innerProductD(lhs._internal[c1],rhs._internal[c1]);
    }
-    template<class l,class r,int N> inline
-    auto innerProduct (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0]))>
-    {
-        typedef decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0])) ret_t;
-        iScalar<ret_t> ret;
-        iScalar<ret_t> tmp;
-	ret=zero;
-        for(int c1=0;c1<N;c1++){
-        for(int c2=0;c2<N;c2++){
-	  ret._internal+=innerProduct(lhs._internal[c1][c2],rhs._internal[c1][c2]);
-        }}
-        return ret;
-    }
-    template<class l,class r> inline
-    auto innerProduct (const iScalar<l>& lhs,const iScalar<r>& rhs) -> iScalar<decltype(innerProduct(lhs._internal,rhs._internal))>
-    {
-        typedef decltype(innerProduct(lhs._internal,rhs._internal)) ret_t;
-        iScalar<ret_t> ret;
-        ret._internal = innerProduct(lhs._internal,rhs._internal);
-        return ret;
+    return ret;
+  }
+  template<class l,class r,int N> inline
+  auto innerProductD (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs) -> iScalar<decltype(innerProductD(lhs._internal[0][0],rhs._internal[0][0]))>
+  {
+    typedef decltype(innerProductD(lhs._internal[0][0],rhs._internal[0][0])) ret_t;
+    iScalar<ret_t> ret;
+    iScalar<ret_t> tmp;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+    for(int c2=0;c2<N;c2++){
+      ret._internal+=innerProductD(lhs._internal[c1][c2],rhs._internal[c1][c2]);
+    }}
+    return ret;
+  }
+  template<class l,class r> inline
+  auto innerProductD (const iScalar<l>& lhs,const iScalar<r>& rhs) -> iScalar<decltype(innerProductD(lhs._internal,rhs._internal))>
+  {
+    typedef decltype(innerProductD(lhs._internal,rhs._internal)) ret_t;
+    iScalar<ret_t> ret;
+    ret._internal = innerProductD(lhs._internal,rhs._internal);
+    return ret;
+  }
+  //////////////////////
+  // Keep same precison
+  //////////////////////
+  template<class l,class r,int N> inline
+  auto innerProduct (const iVector<l,N>& lhs,const iVector<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0],rhs._internal[0]))>
+  {
+    typedef decltype(innerProduct(lhs._internal[0],rhs._internal[0])) ret_t;
+    iScalar<ret_t> ret;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+      ret._internal += innerProduct(lhs._internal[c1],rhs._internal[c1]);
    }
+    return ret;
+  }
+  template<class l,class r,int N> inline
+  auto innerProduct (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0]))>
+  {
+    typedef decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0])) ret_t;
+    iScalar<ret_t> ret;
+    iScalar<ret_t> tmp;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+    for(int c2=0;c2<N;c2++){
+      ret._internal+=innerProduct(lhs._internal[c1][c2],rhs._internal[c1][c2]);
+    }}
+    return ret;
+  }
+  template<class l,class r> inline
+  auto innerProduct (const iScalar<l>& lhs,const iScalar<r>& rhs) -> iScalar<decltype(innerProduct(lhs._internal,rhs._internal))>
+  {
+    typedef decltype(innerProduct(lhs._internal,rhs._internal)) ret_t;
+    iScalar<ret_t> ret;
+    ret._internal = innerProduct(lhs._internal,rhs._internal);
+    return ret;
+  }

 }
 #endif
@@ -53,6 +53,7 @@ namespace Grid {
  public:
    typedef typename T::scalar_type scalar_type;
    typedef typename T::vector_type vector_type;
+    typedef typename T::vector_typeD vector_typeD;
    typedef typename T::tensor_reduced tensor_reduced;
    typedef typename T::scalar_object scalar_object;
    typedef typename T::Complexified Complexified;
@@ -67,6 +68,7 @@ namespace Grid {
  public:
    typedef RealF scalar_type;
    typedef RealF vector_type;
+    typedef RealD vector_typeD;
    typedef RealF tensor_reduced ;
    typedef RealF scalar_object;
    typedef ComplexF Complexified;
@@ -77,6 +79,7 @@ namespace Grid {
  public:
    typedef RealD scalar_type;
    typedef RealD vector_type;
+    typedef RealD vector_typeD;
    typedef RealD tensor_reduced;
    typedef RealD scalar_object;
    typedef ComplexD Complexified;
@@ -87,6 +90,7 @@ namespace Grid {
  public:
    typedef ComplexF scalar_type;
    typedef ComplexF vector_type;
+    typedef ComplexD vector_typeD;
    typedef ComplexF tensor_reduced;
    typedef ComplexF scalar_object;
    typedef ComplexF Complexified;
@@ -97,6 +101,7 @@ namespace Grid {
  public:
    typedef ComplexD scalar_type;
    typedef ComplexD vector_type;
+    typedef ComplexD vector_typeD;
    typedef ComplexD tensor_reduced;
    typedef ComplexD scalar_object;
    typedef ComplexD Complexified;
@@ -107,6 +112,7 @@ namespace Grid {
  public:
    typedef Integer scalar_type;
    typedef Integer vector_type;
+    typedef Integer vector_typeD;
    typedef Integer tensor_reduced;
    typedef Integer scalar_object;
    typedef void Complexified;
@@ -118,6 +124,7 @@ namespace Grid {
  public:
    typedef RealF  scalar_type;
    typedef vRealF vector_type;
+    typedef vRealD vector_typeD;
    typedef vRealF tensor_reduced;
    typedef RealF  scalar_object;
    typedef vComplexF Complexified;
@@ -128,6 +135,7 @@ namespace Grid {
  public:
    typedef RealD  scalar_type;
    typedef vRealD vector_type;
+    typedef vRealD vector_typeD;
    typedef vRealD tensor_reduced;
    typedef RealD  scalar_object;
    typedef vComplexD Complexified;
@@ -138,6 +146,7 @@ namespace Grid {
  public:
    typedef ComplexF  scalar_type;
    typedef vComplexF vector_type;
+    typedef vComplexD vector_typeD;
    typedef vComplexF tensor_reduced;
    typedef ComplexF  scalar_object;
    typedef vComplexF Complexified;
@@ -148,6 +157,7 @@ namespace Grid {
  public:
    typedef ComplexD  scalar_type;
    typedef vComplexD vector_type;
+    typedef vComplexD vector_typeD;
    typedef vComplexD tensor_reduced;
    typedef ComplexD  scalar_object;
    typedef vComplexD Complexified;
@@ -158,6 +168,7 @@ namespace Grid {
  public:
    typedef  Integer scalar_type;
    typedef vInteger vector_type;
+    typedef vInteger vector_typeD;
    typedef vInteger tensor_reduced;
    typedef  Integer scalar_object;
    typedef void Complexified;
@@ -241,7 +252,8 @@ namespace Grid {
  template<typename T>
  class isSIMDvectorized{
    template<typename U>
-    static typename std::enable_if< !std::is_same< typename GridTypeMapper<typename getVectorType<U>::type>::scalar_type,   typename GridTypeMapper<typename getVectorType<U>::type>::vector_type>::value, char>::type test(void *);
+    static typename std::enable_if< !std::is_same< typename GridTypeMapper<typename getVectorType<U>::type>::scalar_type,   
+      typename GridTypeMapper<typename getVectorType<U>::type>::vector_type>::value, char>::type test(void *);

    template<typename U>
    static double test(...);
@@ -311,8 +311,8 @@ void Grid_init(int *argc,char ***argv)
    std::cout<<GridLogMessage<<std::endl;
    std::cout<<GridLogMessage<<"Performance:"<<std::endl;
    std::cout<<GridLogMessage<<std::endl;
-    std::cout<<GridLogMessage<<"  --comms-isend   : Asynchronous MPI calls; several dirs at a time "<<std::endl;    
-    std::cout<<GridLogMessage<<"  --comms-sendrecv: Synchronous MPI calls; one dirs at a time "<<std::endl;    
+    std::cout<<GridLogMessage<<"  --comms-concurrent : Asynchronous MPI calls; several dirs at a time "<<std::endl;    
+    std::cout<<GridLogMessage<<"  --comms-sequential : Synchronous MPI calls; one dirs at a time "<<std::endl;    
    std::cout<<GridLogMessage<<"  --comms-overlap : Overlap comms with compute "<<std::endl;    
    std::cout<<GridLogMessage<<std::endl;
    std::cout<<GridLogMessage<<"  --dslash-generic: Wilson kernel for generic Nc"<<std::endl;    
@@ -457,5 +457,6 @@ void Grid_debug_handler_init(void)

  sigaction(SIGFPE,&sa,NULL);
  sigaction(SIGKILL,&sa,NULL);
+  sigaction(SIGILL,&sa,NULL);
 }
 }
@@ -0,0 +1,35 @@
+#!/bin/bash
+fn=$1
+
+grep "double zmobius_" $fn |
+awk 'BEGIN{ m["zmobius_b_coeff"]=0; m["zmobius_c_coeff"]=1; }{ val[m[substr($2,0,15)]][substr($2,17)+0]=$4; }END{
+
+    ls=length(val[0])/2;
+
+    print "ls = " ls
+
+    bmc=-111;
+
+    for (s=0;s<ls;s++) {
+      br[s] = val[0][2*s + 0];
+      bi[s] = val[0][2*s + 1];
+      cr[s] = val[1][2*s + 0];
+      ci[s] = val[1][2*s + 1];
+
+      t=br[s] - cr[s];
+      if (bmc == -111)
+        bmc=t;
+      else if (bmc != t)
+        print "Warning: b-c is not constant!";
+
+      omegar[s] = (-1.0 + 2.0* br[s])/(4.0*bi[s]**2.0 + (1.0 - 2.0* br[s])**2);
+      omegai[s] = - 2.0* bi[s]/(4.0*bi[s]**2.0 + (1.0 - 2.0* br[s])**2);
+    }
+
+    print "b-c = " bmc
+
+    for (s=0;s<ls;s++) {
+      printf( "omega.push_back( std::complex<double>(%.15g,%.15g) );\n",omegar[s],omegai[s]);
+    }
+
+}'
@@ -54,8 +54,8 @@ int main (int argc, char ** argv)
  GridSerialRNG     sRNGa;
  GridSerialRNG     sRNGb;

-  pRNGa.SeedRandomDevice();
-  sRNGa.SeedRandomDevice();
+  pRNGa.SeedFixedIntegers(std::vector<int>({45,12,81,9});
+  sRNGa.SeedFixedIntegers(std::vector<int>({45,12,81,9});
  
  std::string rfile("./ckpoint_rng.4000");
  NerscIO::writeRNGState(sRNGa,pRNGa,rfile);
@@ -41,7 +41,7 @@ int main (int argc, char ** argv)

  GridCartesian        Fine(latt_size,simd_layout,mpi_layout);

-  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedRandomDevice();
+  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeComplex U(&Fine);
  LatticeComplex ShiftU(&Fine);
@@ -125,7 +125,7 @@ template<class scal, class vec,class functor >
 void Tester(const functor &func)
 {
  GridSerialRNG          sRNG;
-  sRNG.SeedRandomDevice();
+  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  
  int Nsimd = vec::Nsimd();

@@ -184,7 +184,7 @@ void IntTester(const functor &func)
  typedef Integer  scal;
  typedef vInteger vec;
  GridSerialRNG          sRNG;
-  sRNG.SeedRandomDevice();
+  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  int Nsimd = vec::Nsimd();

@@ -242,7 +242,7 @@ template<class reduced,class scal, class vec,class functor >
 void ReductionTester(const functor &func)
 {
  GridSerialRNG          sRNG;
-  sRNG.SeedRandomDevice();
+  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  
  int Nsimd = vec::Nsimd();

@@ -308,18 +308,23 @@ public:
  int n;
  funcExchange(int _n) { n=_n;};
  template<class vec>    void operator()(vec &r1,vec &r2,vec &i1,vec &i2) const { exchange(r1,r2,i1,i2,n);}
-  template<class scal>   void apply(std::vector<scal> &r1,std::vector<scal> &r2,std::vector<scal> &in1,std::vector<scal> &in2)  const { 
+  template<class scal>   void apply(std::vector<scal> &r1,
+				    std::vector<scal> &r2,
+				    std::vector<scal> &in1,
+				    std::vector<scal> &in2)  const 
+  { 
    int sz=in1.size();
-
-    
    int msk = sz>>(n+1);

-    int j1=0;
-    int j2=0;
-    for(int i=0;i<sz;i++) if ( (i&msk) == 0 ) r1[j1++] = in1[ i ];
-    for(int i=0;i<sz;i++) if ( (i&msk) == 0 ) r1[j1++] = in2[ i ];
-    for(int i=0;i<sz;i++) if ( (i&msk)  ) r2[j2++] = in1[ i ];
-    for(int i=0;i<sz;i++) if ( (i&msk)  ) r2[j2++] = in2[ i ];
+    for(int i=0;i<sz;i++) {
+      int j1 = i&(~msk);
+      int j2 = i|msk;
+      if  ( (i&msk) == 0 ) { r1[i]=in1[j1];}
+      else                 { r1[i]=in2[j1];}
+
+      if  ( (i&msk) == 0 ) { r2[i]=in1[j2];}
+      else                 { r2[i]=in2[j2];}
+    }      
  }
  std::string name(void) const { return std::string("Exchange"); }
 };
@@ -343,7 +348,7 @@ template<class scal, class vec,class functor >
 void PermTester(const functor &func)
 {
  GridSerialRNG          sRNG;
-  sRNG.SeedRandomDevice();
+  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  
  int Nsimd = vec::Nsimd();

@@ -409,7 +414,7 @@ template<class scal, class vec,class functor >
 void ExchangeTester(const functor &func)
 {
  GridSerialRNG          sRNG;
-  sRNG.SeedRandomDevice();
+  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  
  int Nsimd = vec::Nsimd();

@@ -454,8 +459,8 @@ void ExchangeTester(const functor &func)

  std::cout<<GridLogMessage << " " << func.name() << " " <<func.n <<std::endl;

-  //  for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" "<<reference1[i]<<" "<<result1[i]<<std::endl;
-  //  for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" "<<reference2[i]<<" "<<result2[i]<<std::endl;
+  //for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" ref "<<reference1[i]<<" res "<<result1[i]<<std::endl;
+  //for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" ref "<<reference2[i]<<" res "<<result2[i]<<std::endl;

  for(int i=0;i<Nsimd;i++){
    int found=0;
@@ -465,7 +470,7 @@ void ExchangeTester(const functor &func)
 	//	std::cout << " i "<<i<<" j "<<j<<" "<<reference1[j]<<" "<<result1[i]<<std::endl;
      }
    }
-    assert(found==1);
+    //    assert(found==1);
  }
  for(int i=0;i<Nsimd;i++){
    int found=0;
@@ -475,15 +480,24 @@ void ExchangeTester(const functor &func)
 	//	std::cout << " i "<<i<<" j "<<j<<" "<<reference2[j]<<" "<<result2[i]<<std::endl;
      }
    }
-    assert(found==1);
+    //    assert(found==1);
  }

+  /*
+  for(int i=0;i<Nsimd;i++){
+    std::cout << " i "<< i
+	      <<" result1  "<<result1[i]
+	      <<" result2  "<<result2[i]
+	      <<" test1  "<<test1[i]
+	      <<" test2  "<<test2[i]
+	      <<" input1 "<<input1[i]
+	      <<" input2 "<<input2[i]<<std::endl;
+  }
+  */
  for(int i=0;i<Nsimd;i++){
    assert(test1[i]==input1[i]);
    assert(test2[i]==input2[i]);
-  }//    std::cout << " i "<< i<<" test1"<<test1[i]<<" "<<input1[i]<<std::endl;
-    //    std::cout << " i "<< i<<" test2"<<test2[i]<<" "<<input2[i]<<std::endl;
-  //  }
+  }
 }


@@ -678,5 +692,69 @@ int main (int argc, char ** argv)
  IntTester(funcMinus());
  IntTester(funcTimes());

+  std::cout<<GridLogMessage << "==================================="<<  std::endl;
+  std::cout<<GridLogMessage << "Testing precisionChange            "<<  std::endl;
+  std::cout<<GridLogMessage << "==================================="<<  std::endl;
+  {
+    GridSerialRNG          sRNG;
+    sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
+    const int Ndp = 16;
+    const int Nsp = Ndp/2;
+    const int Nhp = Ndp/4;
+    std::vector<vRealH,alignedAllocator<vRealH> > H (Nhp);
+    std::vector<vRealF,alignedAllocator<vRealF> > F (Nsp);
+    std::vector<vRealF,alignedAllocator<vRealF> > FF(Nsp);
+    std::vector<vRealD,alignedAllocator<vRealD> > D (Ndp);
+    std::vector<vRealD,alignedAllocator<vRealD> > DD(Ndp);
+    for(int i=0;i<16;i++){
+      random(sRNG,D[i]);
+    }
+    // Double to Single
+    precisionChange(&F[0],&D[0],Ndp);
+    precisionChange(&DD[0],&F[0],Ndp);
+    std::cout << GridLogMessage<<"Double to single";
+    for(int i=0;i<Ndp;i++){
+      //      std::cout << "DD["<<i<<"] = "<< DD[i]<<" "<<D[i]<<" "<<DD[i]-D[i] <<std::endl; 
+      DD[i] = DD[i] - D[i];
+      decltype(innerProduct(DD[0],DD[0])) nrm;
+      nrm = innerProduct(DD[i],DD[i]);
+      auto tmp = Reduce(nrm);
+      //      std::cout << tmp << std::endl;
+      assert( tmp < 1.0e-14 ); 
+    }
+    std::cout <<" OK ! "<<std::endl;
+
+    // Double to Half
+#ifdef USE_FP16
+    std::cout << GridLogMessage<< "Double to half" ;
+    precisionChange(&H[0],&D[0],Ndp);
+    precisionChange(&DD[0],&H[0],Ndp);
+    for(int i=0;i<Ndp;i++){
+      //      std::cout << "DD["<<i<<"] = "<< DD[i]<<" "<<D[i]<<" "<<DD[i]-D[i]<<std::endl; 
+      DD[i] = DD[i] - D[i];
+      decltype(innerProduct(DD[0],DD[0])) nrm;
+      nrm = innerProduct(DD[i],DD[i]);
+      auto tmp = Reduce(nrm);
+      //      std::cout << tmp << std::endl;
+      assert( tmp < 1.0e-3 ); 
+    }
+    std::cout <<" OK ! "<<std::endl;
+
+    std::cout << GridLogMessage<< "Single to half";
+    // Single to Half
+    precisionChange(&H[0] ,&F[0],Nsp);
+    precisionChange(&FF[0],&H[0],Nsp);
+    for(int i=0;i<Nsp;i++){
+      //      std::cout << "FF["<<i<<"] = "<< FF[i]<<" "<<F[i]<<" "<<FF[i]-F[i]<<std::endl; 
+      FF[i] = FF[i] - F[i];
+      decltype(innerProduct(FF[0],FF[0])) nrm;
+      nrm = innerProduct(FF[i],FF[i]);
+      auto tmp = Reduce(nrm);
+      //      std::cout << tmp << std::endl;
+      assert( tmp < 1.0e-3 ); 
+    }
+    std::cout <<" OK ! "<<std::endl;
+#endif
+  }
  Grid_finalize();
 }
@@ -52,7 +52,7 @@ int main (int argc, char ** argv)
  GridRedBlackCartesian rbFine(latt_size,simd_layout,mpi_layout);
  GridParallelRNG       fRNG(&Fine);

-  //  fRNG.SeedRandomDevice();
+  //  fRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});
  std::vector<int> seeds({1,2,3,4});
  fRNG.SeedFixedIntegers(seeds);
  
@@ -49,7 +49,7 @@ int main (int argc, char ** argv)
  GridCartesian         Fine  (latt_size,simd_layout,mpi_layout);
  GridRedBlackCartesian RBFine(latt_size,simd_layout,mpi_layout,mask,1);

-  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedRandomDevice();
+  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeComplex U(&Fine);
  LatticeComplex ShiftU(&Fine);
@@ -49,7 +49,7 @@ int main (int argc, char ** argv)
  GridCartesian         Fine  (latt_size,simd_layout,mpi_layout);
  GridRedBlackCartesian RBFine(latt_size,simd_layout,mpi_layout,mask,1);

-  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedRandomDevice();
+  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeComplex err(&Fine);
  LatticeComplex U(&Fine);
@@ -41,7 +41,7 @@ int main (int argc, char ** argv)

  GridCartesian        Fine(latt_size,simd_layout,mpi_layout);

-  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedRandomDevice();
+  GridParallelRNG      FineRNG(&Fine);  FineRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeComplex U(&Fine);
  LatticeComplex ShiftU(&Fine);
@@ -148,11 +148,13 @@ class FourierAcceleratedGaugeFixer  : public Gimpl {
    Complex psqMax(16.0);
    Fp =  psqMax*one/psq;

+    /*
    static int once;
    if ( once == 0 ) { 
      std::cout << " Fp " << Fp <<std::endl;
      once ++;
-    }
+      }*/
+
    pokeSite(TComplex(1.0),Fp,coor);

    dmuAmu_p  = dmuAmu_p * Fp; 
@@ -245,7 +245,7 @@ int main(int argc, char *argv[])
  GridCartesian Grid(latt_size,simd_layout,mpi_layout);
  GridSerialRNG sRNG;
  
-  sRNG.SeedRandomDevice();
+  sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  
  std::cout << GridLogMessage << "======== Test algebra" << std::endl;
  createTestAlgebra();
@@ -50,7 +50,7 @@ int main (int argc, char ** argv)
  GridParallelRNG          pRNG(&Grid);
  //  std::vector<int> seeds({1,2,3,4});
  //  pRNG.SeedFixedIntegers(seeds);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  typedef typename GparityWilsonFermionR::FermionField FermionField;

@@ -86,7 +86,7 @@ int main(int argc, char** argv) {

  // Projectors 
  GridParallelRNG gridRNG(grid);
-  gridRNG.SeedRandomDevice();
+  gridRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  SU3Adjoint::LatticeAdjMatrix Gauss(grid);
  SU3::LatticeAlgebraVector ha(grid);
  SU3::LatticeAlgebraVector hb(grid);
@@ -89,8 +89,8 @@ int main(int argc, char **argv) {
      GridSerialRNG SerialRNG;
      GridSerialRNG SerialRNG1;

-      FineRNG.SeedRandomDevice();
-      SerialRNG.SeedRandomDevice();
+      FineRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
+      SerialRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

      std::cout << "SerialRNG" << SerialRNG._generators[0] << std::endl;

@@ -43,10 +43,10 @@ int main (int argc, char ** argv)

  std::vector<int> seeds({1,2,3,4});

-  GridSerialRNG             sRNG;   sRNG.SeedRandomDevice();
+  GridSerialRNG             sRNG;   sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  GridSerialRNG            fsRNG;  fsRNG.SeedFixedIntegers(seeds);

-  GridParallelRNG           pRNG(&Grid);   pRNG.SeedRandomDevice();
+  GridParallelRNG           pRNG(&Grid);   pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  GridParallelRNG          fpRNG(&Grid);  fpRNG.SeedFixedIntegers(seeds);

  SpinMatrix rnd  ; 
@@ -51,7 +51,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});
  GridParallelRNG          pRNG(&Grid);
  pRNG.SeedFixedIntegers(seeds);
-  //  pRNG.SeedRandomDevice();
+  //  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9});

  typedef typename ImprovedStaggeredFermionR::FermionField FermionField; 
  typedef typename ImprovedStaggeredFermionR::ComplexField ComplexField; 
@@ -2,11 +2,10 @@

    Grid physics library, www.github.com/paboyle/Grid 

-    Source file: ./tests/Test_wilson_even_odd.cc
+    Source file: ./tests/Test_wilson_tm_even_odd.cc

    Copyright (C) 2015

-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>

    This program is free software; you can redistribute it and/or modify
@@ -62,7 +61,7 @@ int main (int argc, char ** argv)
  GridParallelRNG          pRNG(&Grid);
  //  std::vector<int> seeds({1,2,3,4});
  //  pRNG.SeedFixedIntegers(seeds);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeFermion src   (&Grid); random(pRNG,src);
  LatticeFermion phi   (&Grid); random(pRNG,phi);
@@ -89,8 +88,8 @@ int main (int argc, char ** argv)
  }

  RealD mass=0.1;
-  RealD mu  = 0.1;
-  WilsonTMFermionR Dw(Umu,Grid,RBGrid,mass,mu);
+
+  WilsonFermionR Dw(Umu,Grid,RBGrid,mass);

  LatticeFermion src_e   (&RBGrid);
  LatticeFermion src_o   (&RBGrid);
@@ -207,7 +206,7 @@ int main (int argc, char ** argv)
  pickCheckerboard(Odd ,phi_o,phi);
  RealD t1,t2;

-  SchurDiagMooeeOperator<WilsonTMFermionR,LatticeFermion> HermOpEO(Dw);
+  SchurDiagMooeeOperator<WilsonFermionR,LatticeFermion> HermOpEO(Dw);
  HermOpEO.MpcDagMpc(chi_e,dchi_e,t1,t2);
  HermOpEO.MpcDagMpc(chi_o,dchi_o,t1,t2);

@@ -2,10 +2,11 @@

    Grid physics library, www.github.com/paboyle/Grid 

-    Source file: ./tests/Test_wilson_tm_even_odd.cc
+    Source file: ./tests/Test_wilson_even_odd.cc

    Copyright (C) 2015

+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>

    This program is free software; you can redistribute it and/or modify
@@ -61,7 +62,7 @@ int main (int argc, char ** argv)
  GridParallelRNG          pRNG(&Grid);
  //  std::vector<int> seeds({1,2,3,4});
  //  pRNG.SeedFixedIntegers(seeds);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeFermion src   (&Grid); random(pRNG,src);
  LatticeFermion phi   (&Grid); random(pRNG,phi);
@@ -88,8 +89,8 @@ int main (int argc, char ** argv)
  }

  RealD mass=0.1;
-
-  WilsonFermionR Dw(Umu,Grid,RBGrid,mass);
+  RealD mu  = 0.1;
+  WilsonTMFermionR Dw(Umu,Grid,RBGrid,mass,mu);

  LatticeFermion src_e   (&RBGrid);
  LatticeFermion src_o   (&RBGrid);
@@ -206,7 +207,7 @@ int main (int argc, char ** argv)
  pickCheckerboard(Odd ,phi_o,phi);
  RealD t1,t2;

-  SchurDiagMooeeOperator<WilsonFermionR,LatticeFermion> HermOpEO(Dw);
+  SchurDiagMooeeOperator<WilsonTMFermionR,LatticeFermion> HermOpEO(Dw);
  HermOpEO.MpcDagMpc(chi_e,dchi_e,t1,t2);
  HermOpEO.MpcDagMpc(chi_o,dchi_o,t1,t2);

@@ -0,0 +1,287 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/Test_dwf_even_odd.cc
+
+    Copyright (C) 2015
+
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+Author: paboyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+
+using namespace std;
+using namespace Grid;
+using namespace Grid::QCD;
+
+template<class d>
+struct scal {
+  d internal;
+};
+
+  Gamma::Algebra Gmu [] = {
+    Gamma::Algebra::GammaX,
+    Gamma::Algebra::GammaY,
+    Gamma::Algebra::GammaZ,
+    Gamma::Algebra::GammaT
+  };
+
+
+int main (int argc, char ** argv)
+{
+  Grid_init(&argc,&argv);
+
+  int threads = GridThread::GetThreads();
+  std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
+
+
+  const int Ls=10;
+  GridCartesian         * UGrid   = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
+  GridCartesian         * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
+  GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
+  GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
+
+  std::vector<int> seeds4({1,2,3,4});
+  std::vector<int> seeds5({5,6,7,8});
+
+  GridParallelRNG          RNG4(UGrid);  RNG4.SeedFixedIntegers(seeds4);
+  GridParallelRNG          RNG5(FGrid);  RNG5.SeedFixedIntegers(seeds5);
+
+  LatticeFermion src   (FGrid); random(RNG5,src);
+  LatticeFermion phi   (FGrid); random(RNG5,phi);
+  LatticeFermion chi   (FGrid); random(RNG5,chi);
+  LatticeFermion result(FGrid); result=zero;
+  LatticeFermion    ref(FGrid);    ref=zero;
+  LatticeFermion    tmp(FGrid);    tmp=zero;
+  LatticeFermion    err(FGrid);    tmp=zero;
+  LatticeGaugeField Umu(UGrid); random(RNG4,Umu);
+  std::vector<LatticeColourMatrix> U(4,UGrid);
+
+  // Only one non-zero (y)
+  Umu=zero;
+  for(int nn=0;nn<Nd;nn++){
+    random(RNG4,U[nn]);
+    if ( nn>0 ) 
+      U[nn]=zero;
+    PokeIndex<LorentzIndex>(Umu,U[nn],nn);
+  }
+
+  RealD mass=0.1;
+  RealD M5  =1.8;
+  std::vector < std::complex<double>  > omegas;
+#if 0
+  for(int i=0;i<Ls;i++){
+    double imag = 0.;
+    if (i==0) imag=1.;
+    if (i==Ls-1) imag=-1.;
+    std::complex<double> temp (0.25+0.01*i, imag*0.01);
+    omegas.push_back(temp);
+  }
+#else
+  omegas.push_back( std::complex<double>(1.45806438985048,-0) );
+  omegas.push_back( std::complex<double>(1.18231318389348,-0) );
+  omegas.push_back( std::complex<double>(0.830951166685955,-0) );
+  omegas.push_back( std::complex<double>(0.542352409156791,-0) );
+  omegas.push_back( std::complex<double>(0.341985020453729,-0) );
+  omegas.push_back( std::complex<double>(0.21137902619029,-0) );
+  omegas.push_back( std::complex<double>(0.126074299502912,-0) );
+  omegas.push_back( std::complex<double>(0.0990136651962626,-0) );
+  omegas.push_back( std::complex<double>(0.0686324988446592,0.0550658530827402) );
+  omegas.push_back( std::complex<double>(0.0686324988446592,-0.0550658530827402) );
+#endif
+
+  ZMobiusFermionR Ddwf(Umu, *FGrid, *FrbGrid, *UGrid, *UrbGrid, mass, M5, omegas,1.,0.);
+//  DomainWallFermionR Ddwf(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
+
+  LatticeFermion src_e (FrbGrid);
+  LatticeFermion src_o (FrbGrid);
+  LatticeFermion r_e   (FrbGrid);
+  LatticeFermion r_o   (FrbGrid);
+  LatticeFermion r_eo  (FGrid);
+  LatticeFermion r_eeoo(FGrid);
+
+  std::cout<<GridLogMessage<<"=========================================================="<<std::endl;
+  std::cout<<GridLogMessage<<"= Testing that Meo + Moe + Moo + Mee = Munprec "<<std::endl;
+  std::cout<<GridLogMessage<<"=========================================================="<<std::endl;
+
+  pickCheckerboard(Even,src_e,src);
+  pickCheckerboard(Odd,src_o,src);
+
+  Ddwf.Meooe(src_e,r_o);  std::cout<<GridLogMessage<<"Applied Meo"<<std::endl;
+  Ddwf.Meooe(src_o,r_e);  std::cout<<GridLogMessage<<"Applied Moe"<<std::endl;
+  setCheckerboard(r_eo,r_o);
+  setCheckerboard(r_eo,r_e);
+
+  Ddwf.Mooee(src_e,r_e);  std::cout<<GridLogMessage<<"Applied Mee"<<std::endl;
+  Ddwf.Mooee(src_o,r_o);  std::cout<<GridLogMessage<<"Applied Moo"<<std::endl;
+  setCheckerboard(r_eeoo,r_e);
+  setCheckerboard(r_eeoo,r_o);
+
+  r_eo=r_eo+r_eeoo;
+  Ddwf.M(src,ref);  
+
+  //  std::cout<<GridLogMessage << r_eo<<std::endl;
+  //  std::cout<<GridLogMessage << ref <<std::endl;
+
+  err= ref - r_eo;
+  std::cout<<GridLogMessage << "EO norm diff   "<< norm2(err)<< " "<<norm2(ref)<< " " << norm2(r_eo) <<std::endl;
+    
+  LatticeComplex cerr(FGrid);
+  cerr = localInnerProduct(err,err);
+  //  std::cout<<GridLogMessage << cerr<<std::endl;
+
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  std::cout<<GridLogMessage<<"= Test MooeeDagger is the dagger of Mooee by requiring                "<<std::endl;
+  std::cout<<GridLogMessage<<"=  < phi | Deo | chi > * = < chi | Deo^dag| phi>  "<<std::endl;
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+
+  LatticeFermion chi_e   (FrbGrid);
+  LatticeFermion chi_o   (FrbGrid);
+
+  LatticeFermion dchi_e  (FrbGrid);
+  LatticeFermion dchi_o  (FrbGrid);
+
+  LatticeFermion phi_e   (FrbGrid);
+  LatticeFermion phi_o   (FrbGrid);
+
+  LatticeFermion dphi_e  (FrbGrid);
+  LatticeFermion dphi_o  (FrbGrid);
+
+  pickCheckerboard(Even,chi_e,chi);
+  pickCheckerboard(Odd ,chi_o,chi);
+  pickCheckerboard(Even,phi_e,phi);
+  pickCheckerboard(Odd ,phi_o,phi);
+
+  Ddwf.Mooee(chi_e,dchi_o);
+  Ddwf.Mooee(chi_o,dchi_e);
+  Ddwf.MooeeDag(phi_e,dphi_o);
+  Ddwf.MooeeDag(phi_o,dphi_e);
+
+  ComplexD pDce = innerProduct(phi_e,dchi_e);
+  ComplexD pDco = innerProduct(phi_o,dchi_o);
+  ComplexD cDpe = innerProduct(chi_e,dphi_e);
+  ComplexD cDpo = innerProduct(chi_o,dphi_o);
+
+
+  std::cout<<GridLogMessage <<"e "<<pDce<<" "<<cDpe <<std::endl;
+  std::cout<<GridLogMessage <<"o "<<pDco<<" "<<cDpo <<std::endl;
+
+  std::cout<<GridLogMessage <<"pDce - conj(cDpo) "<< pDce-conj(cDpo) <<std::endl;
+  std::cout<<GridLogMessage <<"pDco - conj(cDpe) "<< pDco-conj(cDpe) <<std::endl;
+
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  std::cout<<GridLogMessage<<"= Test Ddagger is the dagger of D by requiring                "<<std::endl;
+  std::cout<<GridLogMessage<<"=  < phi | Deo | chi > * = < chi | Deo^dag| phi>  "<<std::endl;
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  
+
+  pickCheckerboard(Even,chi_e,chi);
+  pickCheckerboard(Odd ,chi_o,chi);
+  pickCheckerboard(Even,phi_e,phi);
+  pickCheckerboard(Odd ,phi_o,phi);
+
+  Ddwf.Meooe(chi_e,dchi_o);
+  Ddwf.Meooe(chi_o,dchi_e);
+  Ddwf.MeooeDag(phi_e,dphi_o);
+  Ddwf.MeooeDag(phi_o,dphi_e);
+
+  pDce = innerProduct(phi_e,dchi_e);
+  pDco = innerProduct(phi_o,dchi_o);
+  cDpe = innerProduct(chi_e,dphi_e);
+  cDpo = innerProduct(chi_o,dphi_o);
+
+  std::cout<<GridLogMessage <<"e "<<pDce<<" "<<cDpe <<std::endl;
+  std::cout<<GridLogMessage <<"o "<<pDco<<" "<<cDpo <<std::endl;
+
+  std::cout<<GridLogMessage <<"pDce - conj(cDpo) "<< pDce-conj(cDpo) <<std::endl;
+  std::cout<<GridLogMessage <<"pDco - conj(cDpe) "<< pDco-conj(cDpe) <<std::endl;
+
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  std::cout<<GridLogMessage<<"= Test MeeInv Mee = 1                                         "<<std::endl;
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+
+  pickCheckerboard(Even,chi_e,chi);
+  pickCheckerboard(Odd ,chi_o,chi);
+
+  Ddwf.Mooee(chi_e,src_e);
+  Ddwf.MooeeInv(src_e,phi_e);
+
+  Ddwf.Mooee(chi_o,src_o);
+  Ddwf.MooeeInv(src_o,phi_o);
+  
+  setCheckerboard(phi,phi_e);
+  setCheckerboard(phi,phi_o);
+
+  err = phi-chi;
+  std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<< std::endl;
+
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  std::cout<<GridLogMessage<<"= Test MeeInvDag MeeDag = 1                                   "<<std::endl;
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+
+  pickCheckerboard(Even,chi_e,chi);
+  pickCheckerboard(Odd ,chi_o,chi);
+
+  Ddwf.MooeeDag(chi_e,src_e);
+  Ddwf.MooeeInvDag(src_e,phi_e);
+
+  Ddwf.MooeeDag(chi_o,src_o);
+  Ddwf.MooeeInvDag(src_o,phi_o);
+  
+  setCheckerboard(phi,phi_e);
+  setCheckerboard(phi,phi_o);
+
+  err = phi-chi;
+  std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<< std::endl;
+
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  std::cout<<GridLogMessage<<"= Test MpcDagMpc is Hermitian              "<<std::endl;
+  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
+  
+  random(RNG5,phi);
+  random(RNG5,chi);
+  pickCheckerboard(Even,chi_e,chi);
+  pickCheckerboard(Odd ,chi_o,chi);
+  pickCheckerboard(Even,phi_e,phi);
+  pickCheckerboard(Odd ,phi_o,phi);
+  RealD t1,t2;
+
+
+  SchurDiagMooeeOperator<ZMobiusFermionR,LatticeFermion> HermOpEO(Ddwf);
+  HermOpEO.MpcDagMpc(chi_e,dchi_e,t1,t2);
+  HermOpEO.MpcDagMpc(chi_o,dchi_o,t1,t2);
+
+  HermOpEO.MpcDagMpc(phi_e,dphi_e,t1,t2);
+  HermOpEO.MpcDagMpc(phi_o,dphi_o,t1,t2);
+
+  pDce = innerProduct(phi_e,dchi_e);
+  pDco = innerProduct(phi_o,dchi_o);
+  cDpe = innerProduct(chi_e,dphi_e);
+  cDpo = innerProduct(chi_o,dphi_o);
+
+  std::cout<<GridLogMessage <<"e "<<pDce<<" "<<cDpe <<std::endl;
+  std::cout<<GridLogMessage <<"o "<<pDco<<" "<<cDpo <<std::endl;
+
+  std::cout<<GridLogMessage <<"pDce - conj(cDpo) "<< pDco-conj(cDpo) <<std::endl;
+  std::cout<<GridLogMessage <<"pDco - conj(cDpe) "<< pDce-conj(cDpe) <<std::endl;
+  
+  Grid_finalize();
+}
@@ -115,8 +115,8 @@ int main (int argc, char ** argv)
  RNG.SeedFixedIntegers(seeds);


-  RealD alpha = 1.0;
-  RealD beta  = 0.03;
+  RealD alpha = 1.2;
+  RealD beta  = 0.1;
  RealD mu    = 0.0;
  int order = 11;
  ChebyshevLanczos<LatticeComplex> Cheby(alpha,beta,mu,order);
@@ -131,10 +131,9 @@ int main (int argc, char ** argv)
  const int Nit= 10000;

  int Nconv;
-  RealD eresid = 1.0e-8;
+  RealD eresid = 1.0e-6;

  ImplicitlyRestartedLanczos<LatticeComplex> IRL(HermOp,X,Nk,Nm,eresid,Nit);
-
  ImplicitlyRestartedLanczos<LatticeComplex> ChebyIRL(HermOp,Cheby,Nk,Nm,eresid,Nit);

  LatticeComplex src(grid); gaussian(RNG,src);
@@ -145,9 +144,9 @@ int main (int argc, char ** argv)
  }
  
  {
-    //    std::vector<RealD>          eval(Nm);
-    //    std::vector<LatticeComplex> evec(Nm,grid);
-    //    ChebyIRL.calc(eval,evec,src, Nconv);
+    std::vector<RealD>          eval(Nm);
+    std::vector<LatticeComplex> evec(Nm,grid);
+    ChebyIRL.calc(eval,evec,src, Nconv);
  }

  Grid_finalize();
@@ -54,8 +54,8 @@ int main (int argc, char ** argv)

  std::vector<int> seeds({1,2,3,4});

-  GridParallelRNG          RNG5(FGrid);  RNG5.SeedRandomDevice();
-  GridParallelRNG          RNG4(UGrid);  RNG4.SeedRandomDevice();
+  GridParallelRNG          RNG5(FGrid);  RNG5.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
+  GridParallelRNG          RNG4(UGrid);  RNG4.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
  
  FermionField phi        (FGrid); gaussian(RNG5,phi);
  FermionField Mphi       (FGrid); 
@@ -50,7 +50,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});

  GridParallelRNG          pRNG(&Grid);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeGaugeField U(&Grid);

@@ -50,7 +50,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});

  GridParallelRNG          pRNG(&Grid);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeGaugeField U(&Grid);

@@ -50,7 +50,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});

  GridParallelRNG          pRNG(&Grid);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeFermion phi        (&Grid); gaussian(pRNG,phi);
  LatticeFermion Mphi       (&Grid); 
@@ -50,7 +50,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});

  GridParallelRNG          pRNG(&Grid);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeFermion phi        (&Grid); gaussian(pRNG,phi);
  LatticeFermion Mphi       (&Grid); 
@@ -50,7 +50,7 @@ int main (int argc, char ** argv)
  std::vector<int> seeds({1,2,3,4});

  GridParallelRNG          pRNG(&Grid);
-  pRNG.SeedRandomDevice();
+  pRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));

  LatticeFermion phi        (&Grid); gaussian(pRNG,phi);
  LatticeFermion Mphi       (&Grid); 
@@ -282,8 +282,8 @@ double calc_grid_p(Grid::QCD::LatticeGaugeField & Umu)
  Grid::QCD::LatticeColourMatrix tmp(UGrid); 
  tmp = Grid::zero;

-  Grid::QCD::PokeIndex<Grid::QCD::LorentzIndex>(Umu,tmp,2);
-  Grid::QCD::PokeIndex<Grid::QCD::LorentzIndex>(Umu,tmp,3);
+  Grid::QCD::PokeIndex<LorentzIndex>(Umu,tmp,2);
+  Grid::QCD::PokeIndex<LorentzIndex>(Umu,tmp,3);

  Grid::QCD::WilsonGaugeActionR Wilson(beta); // Just take beta = 1.0
  
@@ -311,7 +311,7 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)

  std::vector<Grid::QCD::LatticeColourMatrix> U(4,UGrid);
  for(int mu=0;mu<Nd;mu++){
-    U[mu] = Grid::PeekIndex<Grid::QCD::LorentzIndex>(Umu,mu);
+    U[mu] = Grid::PeekIndex<LorentzIndex>(Umu,mu);
  }

  Grid::QCD::LatticeComplex rect(UGrid);
@@ -322,7 +322,7 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
  for(int nu=0;nu<Grid::QCD::Nd;nu++){
    if ( mu!=nu ) {

-      Grid::QCD::WilsonLoops<Grid::QCD::LatticeGaugeField>::traceDirRectangle(rect,U,mu,nu);
+      Grid::QCD::ColourWilsonLoops::traceDirRectangle(rect,U,mu,nu);
      trect = Grid::sum(rect);
      crect = Grid::TensorRemove(trect);
      std::cout<< "mu/nu = "<<mu<<"/"<<nu<<" ; rect = "<<crect/vol/2.0/3.0<<std::endl;
@@ -344,10 +344,10 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	//           __ ___ 
 	//          |    __ |
 	Stap = 
-	  Grid::Cshift(Grid::QCD::CovShiftForward (U[mu],mu,
-		       Grid::QCD::CovShiftForward (U[nu],nu,
-		       Grid::QCD::CovShiftBackward(U[mu],mu,
-                       Grid::QCD::CovShiftBackward(U[mu],mu,
+	  Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftForward (U[mu],mu,
+		       Grid::QCD::PeriodicBC::CovShiftForward (U[nu],nu,
+		       Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                       Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
 		       Grid::Cshift(adj(U[nu]),nu,-1))))) , mu, 1);

 	TrStap = Grid::trace (U[mu]*Stap);
@@ -361,10 +361,10 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	//              __ 
 	//          |__ __ |

-	Stap = Grid::Cshift(Grid::QCD::CovShiftForward (U[mu],mu,
-		            Grid::QCD::CovShiftBackward(U[nu],nu,
-   		            Grid::QCD::CovShiftBackward(U[mu],mu,
-                            Grid::QCD::CovShiftBackward(U[mu],mu, U[nu])))) , mu, 1);
+	Stap = Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftForward (U[mu],mu,
+		            Grid::QCD::PeriodicBC::CovShiftBackward(U[nu],nu,
+   		            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu, U[nu])))) , mu, 1);

 	TrStap = Grid::trace (U[mu]*Stap);

@@ -375,10 +375,10 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	//           __ 
 	//          |__ __ |

-	Stap = Grid::Cshift(Grid::QCD::CovShiftBackward(U[nu],nu,
-		            Grid::QCD::CovShiftBackward(U[mu],mu,
-                            Grid::QCD::CovShiftBackward(U[mu],mu,
-   		            Grid::QCD::CovShiftForward(U[nu],nu,U[mu])))) , mu, 1);
+	Stap = Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftBackward(U[nu],nu,
+		            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+   		            Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,U[mu])))) , mu, 1);

 	TrStap = Grid::trace (U[mu]*Stap);

@@ -390,10 +390,10 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	//           __ ___ 
 	//          |__    |

-	Stap = Grid::Cshift(Grid::QCD::CovShiftForward (U[nu],nu,
-		            Grid::QCD::CovShiftBackward(U[mu],mu,
-                            Grid::QCD::CovShiftBackward(U[mu],mu,
-                            Grid::QCD::CovShiftBackward(U[nu],nu,U[mu])))) , mu, 1);
+	Stap = Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftForward (U[nu],nu,
+		            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[nu],nu,U[mu])))) , mu, 1);


 	TrStap = Grid::trace (U[mu]*Stap);
@@ -412,12 +412,12 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	 * Make staple for loops centered at coor of link ; this one is ok.     //     |
 	 */
 	//	Stap = 
-	//	  Grid::Cshift(Grid::QCD::CovShiftForward(U[nu],nu,U[nu]),mu,1)* // ->||
-	//	  Grid::adj(Grid::QCD::CovShiftForward(U[nu],nu,Grid::QCD::CovShiftForward(U[nu],nu,U[mu]))) ;
-	Stap = Grid::Cshift(Grid::QCD::CovShiftForward(U[nu],nu,
-		            Grid::QCD::CovShiftForward(U[nu],nu,
-                            Grid::QCD::CovShiftBackward(U[mu],mu,
-                            Grid::QCD::CovShiftBackward(U[nu],nu,  Grid::Cshift(adj(U[nu]),nu,-1))))) , mu, 1);
+	//	  Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,U[nu]),mu,1)* // ->||
+	//	  Grid::adj(Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,U[mu]))) ;
+	Stap = Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,
+		            Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[nu],nu,  Grid::Cshift(adj(U[nu]),nu,-1))))) , mu, 1);
 	  
 	TrStap = Grid::trace (U[mu]*Stap);
 	SumTrStap += TrStap;
@@ -433,10 +433,10 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	//      |  | 
 	//       -- 

-	Stap = Grid::Cshift(Grid::QCD::CovShiftBackward(U[nu],nu,
-		            Grid::QCD::CovShiftBackward(U[nu],nu,
-                            Grid::QCD::CovShiftBackward(U[mu],mu,
-                            Grid::QCD::CovShiftForward (U[nu],nu,U[nu])))) , mu, 1);
+	Stap = Grid::Cshift(Grid::QCD::PeriodicBC::CovShiftBackward(U[nu],nu,
+		            Grid::QCD::PeriodicBC::CovShiftBackward(U[nu],nu,
+                            Grid::QCD::PeriodicBC::CovShiftBackward(U[mu],mu,
+                            Grid::QCD::PeriodicBC::CovShiftForward (U[nu],nu,U[nu])))) , mu, 1);

 	TrStap = Grid::trace (U[mu]*Stap);
 	trect = Grid::sum(TrStap);
@@ -460,10 +460,10 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	Grid::QCD::LatticeColourMatrix tmp(UGrid);
 	
 	// 2 (mu)x1(nu)
-	left_2=  Grid::QCD::CovShiftForward(U[mu],mu,U[mu]);   // Umu(x) Umu(x+mu)
+	left_2=  Grid::QCD::PeriodicBC::CovShiftForward(U[mu],mu,U[mu]);   // Umu(x) Umu(x+mu)
 	tmp=Grid::Cshift(U[nu],mu,2);                          // Unu(x+2mu)

-	upper_l=  Grid::QCD::CovShiftForward(tmp,nu,Grid::adj(left_2)); //  Unu(x+2mu) Umu^dag(x+mu+nu) Umu^dag(x+nu) 
+	upper_l=  Grid::QCD::PeriodicBC::CovShiftForward(tmp,nu,Grid::adj(left_2)); //  Unu(x+2mu) Umu^dag(x+mu+nu) Umu^dag(x+nu) 
 	//                 __ __ 
 	//              =       |
 	
@@ -533,9 +533,9 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
 	//   _
 	//  | |
 	//  | |
-	Grid::QCD::LatticeColourMatrix up2= Grid::QCD::CovShiftForward(U[nu],nu,U[nu]);
+	Grid::QCD::LatticeColourMatrix up2= Grid::QCD::PeriodicBC::CovShiftForward(U[nu],nu,U[nu]);

-	upper_l= Grid::QCD::CovShiftForward(Grid::Cshift(up2,mu,1),nu,Grid::Cshift(adj(U[mu]),nu,1));
+	upper_l= Grid::QCD::PeriodicBC::CovShiftForward(Grid::Cshift(up2,mu,1),nu,Grid::Cshift(adj(U[mu]),nu,1));
 	ds_U= upper_l*Grid::adj(up2);

 	RectPlaq_d = Grid::trace(U[mu]*ds_U);
@@ -555,7 +555,7 @@ double calc_grid_r_dir(Grid::QCD::LatticeGaugeField & Umu)
   downer_l=           |  
               (x)<----V                 
 */    
-	down_l= Grid::adj(Grid::QCD::CovShiftForward(U[mu],mu,up2)); //downer_l
+	down_l= Grid::adj(Grid::QCD::PeriodicBC::CovShiftForward(U[mu],mu,up2)); //downer_l
 /*
                     ^     |
   down_staple  =    |     V 
@@ -616,9 +616,9 @@ void check_grid_r_staple(Grid::QCD::LatticeGaugeField & Umu)
    // Vol as for each site
    Grid::RealD RectScale(1.0/vol/12.0/6.0/3.0); 

-    Grid::QCD::WilsonLoops<Grid::QCD::LatticeGaugeField>::RectStaple(staple,Umu,mu);
+    Grid::QCD::ColourWilsonLoops::RectStaple(staple,Umu,mu);
    
-    link = Grid::QCD::PeekIndex<Grid::QCD::LorentzIndex>(Umu,mu);
+    link = Grid::QCD::PeekIndex<LorentzIndex>(Umu,mu);

    Traced = Grid::trace( link*staple) * RectScale;
    Grid::QCD::TComplex Tp = Grid::sum(Traced);
@@ -655,9 +655,9 @@ void check_grid_p_staple(Grid::QCD::LatticeGaugeField & Umu)
    // Vol as for each site
    Grid::RealD Scale(1.0/vol/12.0/2.0/3.0); 

-    Grid::QCD::WilsonLoops<Grid::QCD::LatticeGaugeField>::Staple(staple,Umu,mu);
+    Grid::QCD::ColourWilsonLoops::Staple(staple,Umu,mu);
    
-    link = Grid::QCD::PeekIndex<Grid::QCD::LorentzIndex>(Umu,mu);
+    link = Grid::QCD::PeekIndex<LorentzIndex>(Umu,mu);

    Traced = Grid::trace( link*staple) * Scale;
    Grid::QCD::TComplex Tp = Grid::sum(Traced);
@@ -0,0 +1,364 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/qdpxx/Test_qdpxx_munprec.cc
+
+    Copyright (C) 2015
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: paboyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+
+double mq=0.1;
+
+typedef Grid::QCD::StaggeredImplR::FermionField FermionField;
+typedef Grid::QCD::LatticeGaugeField GaugeField;
+
+void make_gauge     (GaugeField & lat, FermionField &src);
+void calc_grid      (GaugeField & lat, GaugeField & uthin,GaugeField & ufat, FermionField &src, FermionField &res,int dag);
+void calc_chroma    (GaugeField & lat,GaugeField & uthin,GaugeField & ufat, FermionField &src, FermionField &res,int dag);
+
+#include <chroma.h>
+#include <actions/ferm/invert/syssolver_linop_cg_array.h>
+#include <actions/ferm/invert/syssolver_linop_aggregate.h>
+
+namespace Chroma { 
+
+
+class ChromaWrapper {
+public:
+  
+  typedef multi1d<LatticeColorMatrix> U;
+  typedef LatticeStaggeredFermion T4;
+  
+  static void ImportGauge(GaugeField & gr,
+			  QDP::multi1d<QDP::LatticeColorMatrix> & ch) 
+  {
+    Grid::QCD::LorentzColourMatrix LCM;
+    Grid::Complex cc;
+    QDP::ColorMatrix cm;
+    QDP::Complex c;
+
+    std::vector<int> x(4);
+    QDP::multi1d<int> cx(4);
+    std::vector<int> gd= gr._grid->GlobalDimensions();
+
+    for (x[0]=0;x[0]<gd[0];x[0]++){
+    for (x[1]=0;x[1]<gd[1];x[1]++){
+    for (x[2]=0;x[2]<gd[2];x[2]++){
+    for (x[3]=0;x[3]<gd[3];x[3]++){
+      cx[0] = x[0];
+      cx[1] = x[1];
+      cx[2] = x[2];
+      cx[3] = x[3];
+      Grid::peekSite(LCM,gr,x);
+
+      for(int mu=0;mu<4;mu++){
+	for(int i=0;i<3;i++){
+	for(int j=0;j<3;j++){
+	  cc = LCM(mu)()(i,j);
+	  c = QDP::cmplx(QDP::Real(real(cc)),QDP::Real(imag(cc)));
+	  QDP::pokeColor(cm,c,i,j);
+	}}
+	QDP::pokeSite(ch[mu],cm,cx);
+      }
+
+    }}}}
+  }
+
+  static void ExportGauge(GaugeField & gr,
+			  QDP::multi1d<QDP::LatticeColorMatrix> & ch) 
+  {
+    Grid::QCD::LorentzColourMatrix LCM;
+    Grid::Complex cc;
+    QDP::ColorMatrix cm;
+    QDP::Complex c;
+
+    std::vector<int> x(4);
+    QDP::multi1d<int> cx(4);
+    std::vector<int> gd= gr._grid->GlobalDimensions();
+
+    for (x[0]=0;x[0]<gd[0];x[0]++){
+    for (x[1]=0;x[1]<gd[1];x[1]++){
+    for (x[2]=0;x[2]<gd[2];x[2]++){
+    for (x[3]=0;x[3]<gd[3];x[3]++){
+      cx[0] = x[0];
+      cx[1] = x[1];
+      cx[2] = x[2];
+      cx[3] = x[3];
+
+      for(int mu=0;mu<4;mu++){
+	for(int i=0;i<3;i++){
+	for(int j=0;j<3;j++){
+	  cm = QDP::peekSite(ch[mu],cx);
+	  c  = QDP::peekColor(cm,i,j);
+	  cc = Grid::Complex(toDouble(real(c)),toDouble(imag(c)));
+	  LCM(mu)()(i,j)= cc;
+	}}
+      }
+      Grid::pokeSite(LCM,gr,x);
+
+    }}}}
+  }
+
+  
+  static void ImportFermion(FermionField & gr,
+			    QDP::LatticeStaggeredFermion & ch  ) 
+  {
+    Grid::QCD::ColourVector F;
+    Grid::Complex c;
+
+
+    std::vector<int> x(5);
+    QDP::multi1d<int> cx(4);
+    std::vector<int> gd= gr._grid->GlobalDimensions();
+
+    for (x[0]=0;x[0]<gd[0];x[0]++){
+    for (x[1]=0;x[1]<gd[1];x[1]++){
+    for (x[2]=0;x[2]<gd[2];x[2]++){
+    for (x[3]=0;x[3]<gd[3];x[3]++){
+      cx[0] = x[0];
+      cx[1] = x[1];
+      cx[2] = x[2];
+      cx[3] = x[3];
+
+      Grid::peekSite(F,gr,x);
+      QDP::ColorVector cv;
+      for(int j=0;j<3;j++){
+	QDP::Complex cc;
+	c  = F()()(j) ;
+	cc = QDP::cmplx(QDP::Real(real(c)),QDP::Real(imag(c)));
+	pokeColor(cv,cc,j);
+      }
+      QDP::StaggeredFermion cF;
+      pokeSpin(cF,cv,0);
+      QDP::pokeSite(ch,cF,cx);
+    }}}}
+  }
+  static void ExportFermion(FermionField & gr,
+			    QDP::LatticeStaggeredFermion & ch  ) 
+  {
+    Grid::QCD::ColourVector F;
+    Grid::Complex c;
+
+    std::vector<int> x(5);
+    QDP::multi1d<int> cx(4);
+    std::vector<int> gd= gr._grid->GlobalDimensions();
+
+    for (x[0]=0;x[0]<gd[0];x[0]++){
+    for (x[1]=0;x[1]<gd[1];x[1]++){
+    for (x[2]=0;x[2]<gd[2];x[2]++){
+    for (x[3]=0;x[3]<gd[3];x[3]++){
+      cx[0] = x[0];
+      cx[1] = x[1];
+      cx[2] = x[2];
+      cx[3] = x[3];
+
+      QDP::StaggeredFermion cF = QDP::peekSite(ch,cx);
+      for(int j=0;j<3;j++){
+	QDP::ColorVector cS=QDP::peekSpin(cF,0);
+	QDP::Complex cc=QDP::peekColor(cS,j);
+	c = Grid::Complex(QDP::toDouble(QDP::real(cc)), 
+			  QDP::toDouble(QDP::imag(cc)));
+	F()()(j) = c;
+      }
+      Grid::pokeSite(F,gr,x);
+    }}}}
+  }
+
+  static Handle< Chroma::EvenOddLinearOperator<T4,U,U> >  GetLinOp (U &u,U &u_fat,U &u_triple)
+  {
+    QDP::Real _mq(mq);
+    QDP::multi1d<int> bcs(QDP::Nd);
+
+    bcs[0] = bcs[1] = bcs[2] = bcs[3] = 1;
+
+    Chroma::AsqtadFermActParams p; 
+    p.Mass = _mq; 
+    p.u0 = Real(1.0);
+
+
+    Chroma::Handle<Chroma::FermBC<T4,U,U> > fbc(new Chroma::SimpleFermBC< T4, U, U >(bcs));
+    Chroma::Handle<Chroma::CreateFermState<T4,U,U> > cfs( new Chroma::CreateSimpleFermState<T4,U,U>(fbc));
+    Chroma::AsqtadFermAct S_f(cfs,p);
+    Chroma::Handle< Chroma::FermState<T4,U,U> >  ffs(  S_f.createState(u) );
+    u_fat   =ffs.cast<AsqtadConnectStateBase>()->getFatLinks();
+    u_triple=ffs.cast<AsqtadConnectStateBase>()->getTripleLinks();
+    return S_f.linOp(ffs);
+  }
+
+};
+}
+
+int main (int argc,char **argv )
+{
+
+  /********************************************************
+   * Setup QDP
+   *********************************************************/
+  Chroma::initialize(&argc,&argv);
+  Chroma::WilsonTypeFermActs4DEnv::registerAll(); 
+
+  /********************************************************
+   * Setup Grid
+   *********************************************************/
+  Grid::Grid_init(&argc,&argv);
+  Grid::GridCartesian * UGrid   = Grid::QCD::SpaceTimeGrid::makeFourDimGrid(Grid::GridDefaultLatt(), 
+									    Grid::GridDefaultSimd(Grid::QCD::Nd,Grid::vComplex::Nsimd()),
+									    Grid::GridDefaultMpi());
+  
+  std::vector<int> gd = UGrid->GlobalDimensions();
+  QDP::multi1d<int> nrow(QDP::Nd);
+  for(int mu=0;mu<4;mu++) nrow[mu] = gd[mu];
+
+  QDP::Layout::setLattSize(nrow);
+  QDP::Layout::create();
+
+  GaugeField uthin  (UGrid);
+  GaugeField ufat   (UGrid);
+  GaugeField utriple(UGrid);
+  FermionField    src(UGrid);
+  FermionField    res_chroma(UGrid);
+  FermionField    res_grid  (UGrid);
+  
+
+  {
+
+    std::cout << "*****************************"<<std::endl;
+    std::cout << "Staggered Action "            <<std::endl;
+    std::cout << "*****************************"<<std::endl;
+
+    make_gauge(uthin,src);
+
+    for(int dag=0;dag<2;dag++) {
+
+      std::cout << "Dag =  "<<dag<<std::endl;
+      
+      calc_chroma(uthin,utriple,ufat,src,res_chroma,dag);
+
+      // Remove the normalisation of Chroma Gauge links ??
+      std::cout << "Norm of chroma Asqtad multiply "<<Grid::norm2(res_chroma)<<std::endl;
+      calc_grid  (uthin,utriple,ufat,src,res_grid,dag);
+
+      std::cout << "Norm of thin gauge "<< Grid::norm2(uthin) <<std::endl;
+      std::cout << "Norm of fat  gauge "<< Grid::norm2(ufat) <<std::endl;
+
+      std::cout << "Norm of Grid Asqtad multiply "<<Grid::norm2(res_grid)<<std::endl;
+      
+      /*
+      std::cout << " site 0 of Uthin  "<<uthin._odata[0] <<std::endl;
+      std::cout << " site 0 of Utriple"<<utriple._odata[0] <<std::endl;
+      std::cout << " site 0 of Ufat   "<<ufat._odata[0] <<std::endl;
+
+      std::cout << " site 0 of Grid   "<<res_grid._odata[0] <<std::endl;
+      std::cout << " site 0 of Chroma "<<res_chroma._odata[0] <<std::endl;
+      */
+
+      res_chroma=res_chroma - res_grid;
+      std::cout << "Norm of difference "<<Grid::norm2(res_chroma)<<std::endl;
+    }
+  }
+
+  std::cout << "Finished test "<<std::endl;
+
+  Chroma::finalize();
+}
+
+void calc_chroma(GaugeField & lat, GaugeField &uthin, GaugeField &ufat, FermionField &src, FermionField &res,int dag)
+{
+  typedef QDP::LatticeStaggeredFermion T;
+  typedef QDP::multi1d<QDP::LatticeColorMatrix> U;
+  
+  U u(4);
+  U ut(4);
+  U uf(4);
+
+  //  Chroma::HotSt(u);
+  Chroma::ChromaWrapper::ImportGauge(lat,u) ;
+
+  QDP::LatticeStaggeredFermion  check;
+  QDP::LatticeStaggeredFermion  result;
+  QDP::LatticeStaggeredFermion  tmp;
+  QDP::LatticeStaggeredFermion  psi;
+
+  Chroma::ChromaWrapper::ImportFermion(src,psi);
+
+  auto linop =Chroma::ChromaWrapper::GetLinOp(u,uf,ut);
+
+  Chroma::ChromaWrapper::ExportGauge(uthin,ut) ;
+  Chroma::ChromaWrapper::ExportGauge(ufat ,uf) ;
+
+  enum Chroma::PlusMinus isign;
+  if ( dag ) {
+    isign=Chroma::MINUS;
+  } else {
+    isign=Chroma::PLUS;
+  }
+
+  std::cout << "Calling Chroma Linop "<< std::endl;
+  linop->evenEvenLinOp(tmp,psi,isign); check[rb[0]] = tmp;
+  linop->oddOddLinOp  (tmp,psi,isign); check[rb[1]] = tmp;
+  linop->evenOddLinOp(tmp,psi,isign) ; check[rb[0]]+= tmp;
+  linop->oddEvenLinOp(tmp,psi,isign) ; check[rb[1]]+= tmp;
+
+  Chroma::ChromaWrapper::ExportFermion(res,check) ;
+}
+
+
+void make_gauge(GaugeField & Umu,FermionField &src)
+{
+  using namespace Grid;
+  using namespace Grid::QCD;
+
+  std::vector<int> seeds4({1,2,3,4});
+
+  Grid::GridCartesian         * UGrid   = (Grid::GridCartesian *) Umu._grid;
+  Grid::GridParallelRNG          RNG4(UGrid);  RNG4.SeedFixedIntegers(seeds4);
+  Grid::QCD::SU3::HotConfiguration(RNG4,Umu);
+  Grid::gaussian(RNG4,src);
+}
+
+void calc_grid(GaugeField & Uthin, GaugeField & Utriple, GaugeField & Ufat, FermionField &src, FermionField &res,int dag)
+{
+  using namespace Grid;
+  using namespace Grid::QCD;
+
+  Grid::GridCartesian         * UGrid   = (Grid::GridCartesian *) Uthin._grid;
+  Grid::GridRedBlackCartesian * UrbGrid = Grid::QCD::SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
+
+  Grid::QCD::ImprovedStaggeredFermionR Dstag(Uthin,Utriple,Ufat,*UGrid,*UrbGrid,mq*2.0);
+
+  std::cout << Grid::GridLogMessage <<" Calling Grid staggered multiply "<<std::endl;
+
+  if ( dag ) 
+    Dstag.Mdag(src,res);  
+  else 
+    Dstag.M(src,res);  
+
+  res = res ; // Convention mismatch to Chroma
+  return;
+} 
+
+
+
+
+
@@ -89,7 +89,7 @@ int main(int argc, char** argv) {
  GridStopWatch CGTimer;

  SchurDiagMooeeOperator<DomainWallFermionR, LatticeFermion> HermOpEO(Ddwf);
-  ConjugateGradient<LatticeFermion> CG(1.0e-8, 10000, 0);// switch off the assert
+  ConjugateGradient<LatticeFermion> CG(1.0e-5, 10000, 0);// switch off the assert

  CGTimer.Start();
  CG(HermOpEO, src_o, result_o);
@@ -73,7 +73,7 @@ int main (int argc, char ** argv)
  DomainWallFermionR Ddwf(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);

  MdagMLinearOperator<DomainWallFermionR,LatticeFermion> HermOp(Ddwf);
-  ConjugateGradient<LatticeFermion> CG(1.0e-8,10000);
+  ConjugateGradient<LatticeFermion> CG(1.0e-6,10000);
  CG(HermOp,src,result);

  Grid_finalize();
@@ -0,0 +1,119 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/Test_wilson_cg_unprec.cc
+
+    Copyright (C) 2015
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+#include <Grid/algorithms/iterative/BlockConjugateGradient.h>
+
+using namespace std;
+using namespace Grid;
+using namespace Grid::QCD;
+
+template<class d>
+struct scal {
+  d internal;
+};
+
+  Gamma::Algebra Gmu [] = {
+    Gamma::Algebra::GammaX,
+    Gamma::Algebra::GammaY,
+    Gamma::Algebra::GammaZ,
+    Gamma::Algebra::GammaT
+  };
+
+int main (int argc, char ** argv)
+{
+  typedef typename ImprovedStaggeredFermion5DR::FermionField FermionField; 
+  typedef typename ImprovedStaggeredFermion5DR::ComplexField ComplexField; 
+  typename ImprovedStaggeredFermion5DR::ImplParams params; 
+
+  const int Ls=4;
+
+  Grid_init(&argc,&argv);
+
+  std::vector<int> latt_size   = GridDefaultLatt();
+  std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
+  std::vector<int> mpi_layout  = GridDefaultMpi();
+
+  GridCartesian         * UGrid   = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
+  GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
+  GridCartesian         * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
+  GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
+
+  std::vector<int> seeds({1,2,3,4});
+  GridParallelRNG pRNG(UGrid );  pRNG.SeedFixedIntegers(seeds);
+  GridParallelRNG pRNG5(FGrid);  pRNG5.SeedFixedIntegers(seeds);
+
+  FermionField src(FGrid); random(pRNG5,src);
+  FermionField result(FGrid); result=zero;
+  RealD nrm = norm2(src);
+
+  LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(pRNG,Umu);
+
+  RealD mass=0.01;
+  ImprovedStaggeredFermion5DR Ds(Umu,Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass);
+  MdagMLinearOperator<ImprovedStaggeredFermion5DR,FermionField> HermOp(Ds);
+
+  ConjugateGradient<FermionField> CG(1.0e-8,10000);
+  BlockConjugateGradient<FermionField> BCG(1.0e-8,10000);
+  MultiRHSConjugateGradient<FermionField> mCG(1.0e-8,10000);
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling 4d CG "<<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  ImprovedStaggeredFermionR Ds4d(Umu,Umu,*UGrid,*UrbGrid,mass);
+  MdagMLinearOperator<ImprovedStaggeredFermionR,FermionField> HermOp4d(Ds4d);
+  FermionField src4d(UGrid); random(pRNG,src4d);
+  FermionField result4d(UGrid); result4d=zero;
+  CG(HermOp4d,src4d,result4d);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling 5d CG for "<<Ls <<" right hand sides" <<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  result=zero;
+  CG(HermOp,src,result);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling multiRHS CG for "<<Ls <<" right hand sides" <<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  result=zero;
+  mCG(HermOp,src,result);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling Block CG for "<<Ls <<" right hand sides" <<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  result=zero;
+  BCG(HermOp,src,result);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+
+  Grid_finalize();
+}
@@ -0,0 +1,82 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/Test_wilson_cg_unprec.cc
+
+    Copyright (C) 2015
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+#include <Grid/algorithms/iterative/BlockConjugateGradient.h>
+
+using namespace std;
+using namespace Grid;
+using namespace Grid::QCD;
+
+template<class d>
+struct scal {
+  d internal;
+};
+
+  Gamma::Algebra Gmu [] = {
+    Gamma::Algebra::GammaX,
+    Gamma::Algebra::GammaY,
+    Gamma::Algebra::GammaZ,
+    Gamma::Algebra::GammaT
+  };
+
+int main (int argc, char ** argv)
+{
+  typedef typename ImprovedStaggeredFermionR::FermionField FermionField; 
+  typedef typename ImprovedStaggeredFermionR::ComplexField ComplexField; 
+  typename ImprovedStaggeredFermionR::ImplParams params; 
+
+  Grid_init(&argc,&argv);
+
+  std::vector<int> latt_size   = GridDefaultLatt();
+  std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
+  std::vector<int> mpi_layout  = GridDefaultMpi();
+  GridCartesian               Grid(latt_size,simd_layout,mpi_layout);
+  GridRedBlackCartesian     RBGrid(latt_size,simd_layout,mpi_layout);
+
+  std::vector<int> seeds({1,2,3,4});
+  GridParallelRNG          pRNG(&Grid);  pRNG.SeedFixedIntegers(seeds);
+
+  FermionField src(&Grid); random(pRNG,src);
+  RealD nrm = norm2(src);
+  FermionField result(&Grid); result=zero;
+  LatticeGaugeField Umu(&Grid); SU3::HotConfiguration(pRNG,Umu);
+
+  double volume=1;
+  for(int mu=0;mu<Nd;mu++){
+    volume=volume*latt_size[mu];
+  }  
+  
+  RealD mass=0.1;
+  ImprovedStaggeredFermionR Ds(Umu,Umu,Grid,RBGrid,mass);
+
+  MdagMLinearOperator<ImprovedStaggeredFermionR,FermionField> HermOp(Ds);
+  CG(HermOp,src,result);
+
+  Grid_finalize();
+}
@@ -43,7 +43,7 @@ Gamma::Algebra Gmu[] = {Gamma::Algebra::GammaX, Gamma::Algebra::GammaY, Gamma::A
 int main(int argc, char** argv) {
  Grid_init(&argc, &argv);

-  const int Ls = 16;
+  const int Ls = 10;

  GridCartesian* UGrid = SpaceTimeGrid::makeFourDimGrid(
      GridDefaultLatt(), GridDefaultSimd(Nd, vComplex::Nsimd()),
@@ -80,11 +80,27 @@ int main(int argc, char** argv) {
  RealD mass = 0.01;
  RealD M5 = 1.8;
  std::vector < std::complex<double>  > omegas;
+#if 0
  for(int i=0;i<Ls;i++){
-  	std::complex<double> temp (0.25+0.00*i, 0.0+0.00*i);
- 	 omegas.push_back(temp);
+    double imag = 0.;
+    if (i==0) imag=1.;
+    if (i==Ls-1) imag=-1.;
+    std::complex<double> temp (0.25+0.01*i, imag*0.01);
+    omegas.push_back(temp);
  }
-//  DomainWallFermionR Ddwf(Umu, *FGrid, *FrbGrid, *UGrid, *UrbGrid, mass, M5);
+#else
+  omegas.push_back( std::complex<double>(1.45806438985048,-0) );
+  omegas.push_back( std::complex<double>(1.18231318389348,-0) );
+  omegas.push_back( std::complex<double>(0.830951166685955,-0) );
+  omegas.push_back( std::complex<double>(0.542352409156791,-0) );
+  omegas.push_back( std::complex<double>(0.341985020453729,-0) );
+  omegas.push_back( std::complex<double>(0.21137902619029,-0) );
+  omegas.push_back( std::complex<double>(0.126074299502912,-0) );
+  omegas.push_back( std::complex<double>(0.0990136651962626,-0) );
+  omegas.push_back( std::complex<double>(0.0686324988446592,0.0550658530827402) );
+  omegas.push_back( std::complex<double>(0.0686324988446592,-0.0550658530827402) );
+#endif
+
  ZMobiusFermionR Ddwf(Umu, *FGrid, *FrbGrid, *UGrid, *UrbGrid, mass, M5, omegas,1.,0.);

  LatticeFermion src_o(FrbGrid);
@@ -0,0 +1,3 @@
+AM_LDFLAGS += -L$(LIBRARY_PATH) -ltestu01 -lprobdist -lmylib -lm
+AM_CXXFLAGS += -I$(C_INCLUDE_PATH)
+include Make.inc
@@ -0,0 +1,175 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/Test_smallcrush.cc
+
+    Copyright (C) 2015
+
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+
+using namespace std;
+using namespace Grid;
+using namespace Grid::QCD;
+
+// Wrap Grid's parallel RNG for testU01
+#undef BIG_CRUSH             // Big crush enable (long running)
+#define MIDDLE_CRUSH             // Big crush enable (long running)
+#undef SMALL_CRUSH             // Big crush enable (long running)
+#undef TEST_RNG_STANDALONE   // Test serial RNGs in isolation
+
+extern "C" { 
+#include "TestU01.h"
+}
+
+std::vector<std::ranlux48>      EngineRanlux;
+std::vector<std::mt19937>       EngineMT;
+
+#include <Grid/sitmo_rng/sitmo_prng_engine.hpp>
+std::vector<sitmo::prng_engine> EngineSitmo;
+
+std::uniform_int_distribution<uint32_t> uid;
+
+uint32_t GetU01Ranlux(void) {
+  return uid(EngineRanlux[0]);
+};
+uint32_t GetU01MT(void) {
+  return uid(EngineMT[0]);
+};
+uint32_t GetU01Sitmo(void) {
+  return uid(EngineSitmo[0]);
+};
+
+typedef Grid::GridRNGbase::RngEngine RngEngine;
+
+struct TestRNG { 
+public:
+  static GridParallelRNG *pRNG;
+  static GridSerialRNG *sRNG;
+  static GridBase *_grid;
+  static RngEngine Eng;
+  static uint64_t site;
+  static uint64_t gsites;
+  static char *name;
+
+  static void Init(GridParallelRNG *_pRNG,GridSerialRNG *_sRNG,GridBase *grid) {
+    pRNG = _pRNG;
+    sRNG = _sRNG;
+    _grid= grid;
+    gsites= grid->_gsites;
+    site = 0;
+  }
+  static uint32_t GetU01(void) { 
+    uint32_t ret_val;
+    ret_val = pRNG->GlobalU01(site);
+    site=(site+1)%gsites;
+    return ret_val;
+  }
+};
+
+GridParallelRNG *TestRNG::pRNG;
+GridSerialRNG   *TestRNG::sRNG;
+GridBase        *TestRNG::_grid;
+RngEngine        TestRNG::Eng;
+uint64_t         TestRNG::site;
+uint64_t         TestRNG::gsites;
+
+#ifdef RNG_SITMO
+char * TestRNG::name = (char *)"Grid_Sitmo";
+#endif
+#ifdef RNG_RANLUX
+char * TestRNG::name = (char *)"Grid_ranlux48";
+#endif
+#ifdef RNG_MT19937
+char * TestRNG::name = (char *)"Grid_mt19937";
+#endif
+
+int main (int argc, char ** argv)
+{
+  Grid_init(&argc,&argv);
+
+  std::vector<int> latt_size   = GridDefaultLatt();
+  std::vector<int> simd_layout = GridDefaultSimd(4,vComplex::Nsimd());
+  std::vector<int> mpi_layout  = GridDefaultMpi();
+     
+  GridCartesian     Grid(latt_size,simd_layout,mpi_layout);
+
+  std::vector<int> seeds({1,2,3,4});
+  std::seed_seq seq(seeds.begin(),seeds.end());
+
+  EngineRanlux.push_back(std::ranlux48(seq));
+  EngineMT.push_back(std::mt19937(seq));
+  EngineSitmo.push_back(sitmo::prng_engine(seq));
+
+  std::cout << GridLogMessage<< "Initialising Grid RNGs "<<std::endl; 
+  GridParallelRNG           pRNG(&Grid);   
+  pRNG.SeedFixedIntegers(std::vector<int>({43,12,7019,9}));
+  GridSerialRNG           sRNG;
+  sRNG.SeedFixedIntegers(std::vector<int>({102,12,99,15}));
+  std::cout << GridLogMessage<< "Initialised Grid RNGs "<<std::endl; 
+
+  TestRNG::Init(&pRNG,&sRNG,&Grid);
+  std::cout << GridLogMessage<< "Grid RNG's are "<< std::string(TestRNG::name) <<std::endl; 
+
+  unif01_Gen * gen;
+
+#ifdef TEST_RNG_STANDALONE
+  std::cout << GridLogMessage<< "Testing Standalone Ranlux" <<std::endl; 
+  gen = unif01_CreateExternGenBits ((char *)"GridRanlux",GetU01Ranlux);
+  bbattery_SmallCrush (gen);
+  unif01_DeleteExternGenBits(gen);
+  std::cout << GridLogMessage<< "Testing Standalone Ranlux is complete" <<std::endl; 
+
+  std::cout << GridLogMessage<< "Testing Standalone Mersenne Twister" <<std::endl; 
+  gen = unif01_CreateExternGenBits ((char *)"GridMT",GetU01MT);
+  bbattery_SmallCrush (gen);
+  unif01_DeleteExternGenBits(gen);
+  std::cout << GridLogMessage<< "Testing Standalone Mersenne Twister is complete" <<std::endl; 
+
+  std::cout << GridLogMessage<< "Testing Standalone Sitmo" <<std::endl; 
+  gen = unif01_CreateExternGenBits ((char *)"GridSitmo",GetU01Sitmo);
+  bbattery_SmallCrush (gen);
+  unif01_DeleteExternGenBits(gen);
+  std::cout << GridLogMessage<< "Testing Standalone Sitmo is complete" <<std::endl; 
+#endif
+
+#ifdef BIG_CRUSH
+  std::cout << GridLogMessage<< "Testing Grid BigCrush for "<< std::string(TestRNG::name) <<std::endl; 
+  gen = unif01_CreateExternGenBits(TestRNG::name,TestRNG::GetU01);
+  bbattery_BigCrush (gen);
+  std::cout << GridLogMessage<< "Testing Grid BigCrush "<< std::string(TestRNG::name)<<" is complete" <<std::endl; 
+#endif
+#ifdef MIDDLE_CRUSH
+  std::cout << GridLogMessage<< "Testing Grid Crush for "<< std::string(TestRNG::name) <<std::endl; 
+  gen = unif01_CreateExternGenBits(TestRNG::name,TestRNG::GetU01);
+  bbattery_Crush (gen);
+  std::cout << GridLogMessage<< "Testing Grid Crush "<< std::string(TestRNG::name)<<" is complete" <<std::endl; 
+#endif
+#ifdef SMALL_CRUSH
+  std::cout << GridLogMessage<< "Testing Grid SmallCrush for "<< std::string(TestRNG::name) <<std::endl; 
+  gen = unif01_CreateExternGenBits(TestRNG::name,TestRNG::GetU01);
+  bbattery_SmallCrush (gen);
+  std::cout << GridLogMessage<< "Testing Grid SmallCrush "<< std::string(TestRNG::name)<<" is complete" <<std::endl; 
+#endif
+  Grid_finalize();
+}
+
Author	SHA1	Message	Date
paboyle	8e161152e4	MultiRHS solver improvements with slice operations moved into lattice and sped up. Block solver requires a lot of performance work.	2017-04-18 10:51:55 +01:00
paboyle	3141ebac10	MultiRHS working, starting to optimise. Block doesn't and I thought it already was; puzzled.	2017-04-17 10:50:19 +01:00
paboyle	7ede696126	Non compile of tests fixed	2017-04-16 23:40:00 +01:00
paboyle	bf516c3b81	higher precision reduction variables in norm and inner product	2017-04-15 12:27:28 +01:00
paboyle	441a52ee5d	First cut at higher precision reduction	2017-04-15 10:57:21 +01:00
paboyle	a8db024c92	Cleaning up the dense matrix and lanczos sector	2017-04-15 08:54:11 +01:00
paboyle	a9c22d5f43	Verbose removal	2017-04-14 14:38:49 +01:00
paboyle	3ca41458a3	Fix to no USE_FP16 case	2017-04-14 14:20:54 +01:00
Peter Boyle	951be75292	Half precision conversion working on AVX512 now too	2017-04-13 17:35:11 +01:00
Peter Boyle	b9113ed310	Patches for knl	2017-04-13 12:02:12 -04:00
paboyle	42fb49d3fd	Merge branch 'develop' of https://github.com/paboyle/Grid into develop	2017-04-13 14:12:47 +01:00
paboyle	2a54c9aaab	Merge branch 'feature/block-cg' into develop	2017-04-13 14:12:24 +01:00
paboyle	0957378679	Fixing conditional ugly way	2017-04-13 13:47:56 +01:00
paboyle	2ed6c76fc5	Getting multiline if then fi working	2017-04-13 13:43:13 +01:00
paboyle	d3b9a7fa14	F16c apparently requires AVX, even if the 128 bit are used. Seems odd.	2017-04-13 13:19:11 +01:00
paboyle	75ea306ce9	Another try at travis	2017-04-13 13:05:32 +01:00
paboyle	4226c633c4	Default to FP16 off again	2017-04-13 12:51:39 +01:00
paboyle	5a4eafbf7e	.travis	2017-04-13 12:50:43 +01:00
paboyle	eb8e26018b	Travis update for macos	2017-04-13 12:35:11 +01:00
paboyle	db5ea001a3	Update to use Xcode 8.3 since -mfp16 causes SIGILL	2017-04-13 12:22:40 +01:00
paboyle	2846f079e5	Predicate tests on fp16 being enabled	2017-04-13 12:08:05 +01:00
paboyle	1d502e4ed6	FP16 optional compile time	2017-04-13 11:55:24 +01:00
paboyle	73cdf0fffe	Drop f16c from SSE because of a macos compile error on travis	2017-04-13 11:23:41 +01:00
paboyle	1c25773319	Trap illegal instructions	2017-04-13 10:51:40 +01:00
paboyle	c38400b26f	Trap signals	2017-04-13 10:35:20 +01:00
paboyle	9c3065b860	Debug flags off again	2017-04-13 10:01:32 +01:00
paboyle	94eb829d08	Align cast fixed for __mm128i gcc complained	2017-04-13 08:40:44 +01:00
paboyle	68392ddb5b	Exchange in generic Precision change in AVX, SSE, AVX512, Generic. QPX still to do.	2017-04-13 08:38:12 +01:00
paboyle	cb6b81ae82	Half precision conversion	2017-04-12 19:32:37 +01:00
portelli	8ef4300412	spurious .dirstamp files removed	2017-04-10 17:00:22 +01:00
portelli	98a24ebf31	The macro “magics” is very intensive for the preprocessor in the measurement code which has numerous serialisable classes. Reducing the number of serialisable fields to 64 (instead of 1024) helps a lot, this is enough for now and can be extended trivially if needed in the future.	2017-04-10 16:58:54 +01:00
paboyle	b12dc89d26	Commenting and clean up	2017-04-10 20:38:20 +09:00
paboyle	d80d802f9d	MultiRHS solver test	2017-04-10 00:12:12 +09:00
paboyle	3d99b09dba	Start of blockCG	2017-04-09 23:42:10 +09:00
paboyle	db5f6d3ae3	Verbose fix	2017-04-09 23:41:30 +09:00
paboyle	683550f116	Const args improvement	2017-04-09 23:41:04 +09:00
paboyle	55d0329624	Merge branch 'develop' of https://github.com/paboyle/Grid into develop	2017-04-07 11:08:14 +09:00
paboyle	86aaa35294	Christoph needs SchurDiagTwoKappa which is mobius specific.	2017-04-07 11:07:40 +09:00
Guido Cossu	172d3dc93a	Correcting names in tests	2017-04-05 16:24:04 +01:00
paboyle	5592f7b8c1	Creation mode better implementation	2017-04-05 02:35:34 +09:00
paboyle	35da4ece0b	UID fix	2017-04-05 02:18:15 +09:00
paboyle	061b15b9e9	Merge branch 'feature/sitmo-skipahead' into develop	2017-04-05 01:24:49 +09:00
paboyle	561426f6eb	Clean up	2017-04-02 23:13:48 +09:00
paboyle	83f6fab8fa	Big/Small crush test, and fast SITMO rng init, faster but not ideal MT and Ranlux init.	2017-04-02 12:10:51 +09:00
paboyle	0fade84ab2	No random device	2017-04-02 00:29:40 +09:00
paboyle	9dc7ca4c3b	Sitmo fast init	2017-04-02 00:28:22 +09:00
paboyle	935d82f5b1	sanity checks	2017-04-02 00:27:28 +09:00
paboyle	9cbcdd65d7	No random device seed	2017-04-02 00:26:57 +09:00
paboyle	f18f5ed926	Drop random device	2017-04-02 00:26:26 +09:00
paboyle	d1d63a4f2d	sitmo default	2017-04-02 00:26:05 +09:00
paboyle	7e5faa0f34	Multiple RNGs	2017-04-02 00:25:44 +09:00
paboyle	6af459cae4	Christoph's coefficients.	2017-03-31 17:07:43 +09:00
paboyle	1c4bc7ed38	Debugged staggered conventions	2017-03-31 14:41:48 +09:00
paboyle	93ea5d9468	Pretty code	2017-03-30 15:00:03 +09:00
paboyle	1ec5d32369	Chulwoo's test to zmobius helped me shake out	2017-03-30 13:45:13 +09:00
paboyle	9fd23faadf	Pretty layout	2017-03-30 13:44:45 +09:00
paboyle	10e4fa0dc8	Template instantiation improvements	2017-03-30 13:44:25 +09:00
paboyle	c4aca1dde4	Conjugate coefficients on adjoint	2017-03-30 13:44:05 +09:00
paboyle	b9e8ea3aaa	conjugate coefficient on the dagger	2017-03-30 13:43:13 +09:00
paboyle	077aa728b9	Fix the ZMobius (I think)	2017-03-30 13:42:09 +09:00
paboyle	a8d83d886e	Macro controls	2017-03-30 13:31:34 +09:00
paboyle	7fd46eeec4	Trailing whitespace removal	2017-03-30 13:31:10 +09:00
paboyle	e0c4eeb3ec	Compiles again	2017-03-30 13:30:45 +09:00
paboyle	cb9a297a0a	Chulwoo's Zmobius test	2017-03-30 13:30:25 +09:00
paboyle	2b115929dc	Small AVX512 asm ifdef patch	2017-03-29 18:51:23 +09:00
paboyle	5c6571dab1	Merge branch 'feature/bgq-asm' into develop	2017-03-29 18:48:55 +09:00