Fix to multinode code

longer nloop
bug fix. works now and great face performance
2025-11-07 15:19:31 +00:00 · 2017-04-26 14:46:52 -04:00 · 2017-04-26 08:43:20 +01:00 · 2017-04-26 03:14:02 -04:00 · 2017-04-26 02:34:52 -04:00 · 2017-04-26 02:34:25 -04:00
71 changed files with 3959 additions and 3315 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -7,7 +7,7 @@ cache:
 matrix:
  include:
    - os:        osx
-      osx_image: xcode7.2
+      osx_image: xcode8.3
      compiler: clang
    - compiler: gcc
      addons:
@@ -73,8 +73,6 @@ before_install:
    - if [[ "$TRAVIS_OS_NAME" == "linux" ]] && [[ "$CC" == "clang" ]]; then export LD_LIBRARY_PATH="${GRIDDIR}/clang/lib:${LD_LIBRARY_PATH}"; fi
    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew update; fi
    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install libmpc; fi
-    - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install openmpi; fi
-    - if [[ "$TRAVIS_OS_NAME" == "osx" ]] && [[ "$CC" == "gcc" ]]; then brew install gcc5; fi
    
 install:
    - export CC=$CC$VERSION
@@ -92,15 +90,14 @@ script:
    - cd build
    - ../configure --enable-precision=single --enable-simd=SSE4 --enable-comms=none
    - make -j4 
-    - ./benchmarks/Benchmark_dwf --threads 1
+    - ./benchmarks/Benchmark_dwf --threads 1 --debug-signals
    - echo make clean
    - ../configure --enable-precision=double --enable-simd=SSE4 --enable-comms=none
    - make -j4
-    - ./benchmarks/Benchmark_dwf --threads 1
+    - ./benchmarks/Benchmark_dwf --threads 1 --debug-signals
    - echo make clean
-    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then export CXXFLAGS='-DMPI_UINT32_T=MPI_UNSIGNED -DMPI_UINT64_T=MPI_UNSIGNED_LONG'; fi
-    - ../configure --enable-precision=single --enable-simd=SSE4 --enable-comms=mpi-auto
-    - make -j4
+    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ../configure --enable-precision=single --enable-simd=SSE4 --enable-comms=mpi-auto CXXFLAGS='-DMPI_UINT32_T=MPI_UNSIGNED -DMPI_UINT64_T=MPI_UNSIGNED_LONG'; fi
+    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then make -j4; fi
    - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then mpirun.openmpi -n 2 ./benchmarks/Benchmark_dwf --threads 1 --mpi 2.1.1.1; fi


--- a/61
+++ b/61
@@ -1,6 +1,26 @@
 TODO:
 ---------------

+Peter's work list:
+2)- Precision conversion and sort out localConvert      <-- 
+3)- Remove DenseVector, DenseMatrix; Use Eigen instead. <-- started 
+4)- Binary I/O speed up & x-strips
+-- Profile CG, BlockCG, etc... Flop count/rate -- PARTIAL, time but no flop/s yet
+-- Physical propagator interface
+-- Conserved currents
+-- GaugeFix into central location
+-- Multigrid Wilson and DWF, compare to other Multigrid implementations
+-- HDCR resume
+
+Recent DONE 
+-- Cut down the exterior overhead                      <-- DONE
+-- Interior legs from SHM comms                        <-- DONE
+-- Half-precision comms                                <-- DONE
+-- Merge high precision reduction into develop        
+-- multiRHS DWF; benchmark on Cori/BNL for comms elimination
+   -- slice* linalg routines for multiRHS, BlockCG    
+
+-----
 * Forces; the UdSdU  term in gauge force term is half of what I think it should
  be. This is a consequence of taking ONLY the first term in:

@@ -21,16 +41,8 @@ TODO:
  This means we must double the force in the Test_xxx_force routines, and is the origin of the factor of two.
  This 2x is applied by hand in the fermion routines and in the Test_rect_force routine.

-
-Policies:
-
-* Link smearing/boundary conds; Policy class based implementation ; framework more in place
-
 * Support different boundary conditions (finite temp, chem. potential ... )

-* Support different fermion representations? 
-  - contained entirely within the integrator presently
-
 - Sign of force term.

 - Reversibility test.
@@ -41,11 +53,6 @@ Policies:

 - Audit oIndex usage for cb behaviour

- Rectangle gauge actions.
-  Iwasaki,
-  Symanzik,
-  ... etc...
-
 - Prepare multigrid for HMC. - Alternate setup schemes.

 - Support for ILDG --- ugly, not done
@@ -55,9 +62,11 @@ Policies:
 - FFTnD ?

 - Gparity; hand opt use template specialisation elegance to enable the optimised paths ?
+
 - Gparity force term; Gparity (R)HMC.
- Random number state save restore
+
 - Mobius implementation clean up to rmove #if 0 stale code sequences
+
 - CG -- profile carefully, kernel fusion, whole CG performance measurements.

 ================================================================
@@ -90,6 +99,7 @@ Insert/Extract
 Not sure of status of this -- reverify. Things are working nicely now though.

 * Make the Tensor types and Complex etc... play more nicely.
+
  - TensorRemove is a hack, come up with a long term rationalised approach to Complex vs. Scalar<Scalar<Scalar<Complex > > >
    QDP forces use of "toDouble" to get back to non tensor scalar. This role is presently taken TensorRemove, but I
    want to introduce a syntax that does not require this.
@@ -112,6 +122,8 @@ Not sure of status of this -- reverify. Things are working nicely now though.
 RECENT
 ---------------

+  - Support different fermion representations? -- DONE
+  - contained entirely within the integrator presently
  - Clean up HMC                                                             -- DONE
  - LorentzScalar<GaugeField> gets Gauge link type (cleaner).                -- DONE
  - Simplified the integrators a bit.                                        -- DONE
@@ -123,6 +135,26 @@ RECENT
  - Parallel io improvements                                  -- DONE
  - Plaquette and link trace checks into nersc reader from the Grid_nersc_io.cc test. -- DONE

+
+DONE:
+- MultiArray -- MultiRHS done
+- ConjugateGradientMultiShift -- DONE
+- MCR                         -- DONE
+- Remez -- Mike or Boost?     -- DONE
+- Proto (ET)                  -- DONE
+- uBlas                       -- DONE ; Eigen
+- Potentially Useful Boost libraries -- DONE ; Eigen
+- Aligned allocator; memory pool -- DONE
+- Multiprecision              -- DONE
+- Serialization               -- DONE
+- Regex -- Not needed
+- Tokenize -- Why?
+
+- Random number state save restore -- DONE
+- Rectangle gauge actions. -- DONE
+  Iwasaki,
+  Symanzik,
+  ... etc...
 Done: Cayley, Partial , ContFrac force terms.

 DONE
@@ -207,6 +239,7 @@ Done
 FUNCTIONALITY: it pleases me to keep track of things I have done (keeps me arguably sane)
 ======================================================================================================

+* Link smearing/boundary conds; Policy class based implementation ; framework more in place -- DONE
 * Command line args for geometry, simd, etc. layout. Is it necessary to have -- DONE
  user pass these? Is this a QCD specific?

--- a/benchmarks/Benchmark_dwf.cc
+++ b/benchmarks/Benchmark_dwf.cc
@@ -152,9 +152,6 @@ int main (int argc, char ** argv)

  RealD NP = UGrid->_Nprocessors;

-  std::cout << GridLogMessage << "Creating action operator " << std::endl;
-  DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
-
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
  std::cout << GridLogMessage<< "* Kernel options --dslash-generic, --dslash-unroll, --dslash-asm" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;
@@ -168,11 +165,13 @@ int main (int argc, char ** argv)
  if ( WilsonKernelsStatic::Opt == WilsonKernelsStatic::OptInlineAsm ) std::cout << GridLogMessage<< "* Using Asm Nc=3   WilsonKernels" <<std::endl;
  std::cout << GridLogMessage<< "*****************************************************************" <<std::endl;

+  DomainWallFermionR Dw(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
  int ncall =1000;
  if (1) {
    FGrid->Barrier();
    Dw.ZeroCounters();
    Dw.Dhop(src,result,0);
+    std::cout<<GridLogMessage<<"Called warmup"<<std::endl;
    double t0=usecond();
    for(int i=0;i<ncall;i++){
      __SSC_START;
@@ -206,6 +205,33 @@ int main (int argc, char ** argv)
    Dw.Report();
  }

+  DomainWallFermionRL DwH(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
+  if (1) {
+    FGrid->Barrier();
+    DwH.ZeroCounters();
+    DwH.Dhop(src,result,0);
+    double t0=usecond();
+    for(int i=0;i<ncall;i++){
+      __SSC_START;
+      DwH.Dhop(src,result,0);
+      __SSC_STOP;
+    }
+    double t1=usecond();
+    FGrid->Barrier();
+    
+    double volume=Ls;  for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
+    double flops=1344*volume*ncall;
+
+    std::cout<<GridLogMessage << "Called half prec comms Dw "<<ncall<<" times in "<<t1-t0<<" us"<<std::endl;
+    std::cout<<GridLogMessage << "mflop/s =   "<< flops/(t1-t0)<<std::endl;
+    std::cout<<GridLogMessage << "mflop/s per rank =  "<< flops/(t1-t0)/NP<<std::endl;
+    err = ref-result; 
+    std::cout<<GridLogMessage << "norm diff   "<< norm2(err)<<std::endl;
+
+    assert (norm2(err)< 1.0e-3 );
+    DwH.Report();
+  }
+
  if (1)
  {

--- a/configure.ac
+++ b/configure.ac
@@ -83,6 +83,18 @@ case ${ac_LAPACK} in
        AC_DEFINE([USE_LAPACK],[1],[use LAPACK]);;
 esac

+############### FP16 conversions
+AC_ARG_ENABLE([sfw-fp16],
+    [AC_HELP_STRING([--enable-sfw-fp16=yes|no], [enable software fp16 comms])], 
+    [ac_SFW_FP16=${enable_sfw_fp16}], [ac_SFW_FP16=yes])
+case ${ac_SFW_FP16} in
+    yes)
+      AC_DEFINE([SFW_FP16],[1],[software conversion to fp16]);;
+    no);;
+    *)
+      AC_MSG_ERROR(["SFW FP16 option not supported ${ac_SFW_FP16}"]);;
+esac
+
 ############### MKL
 AC_ARG_ENABLE([mkl],
    [AC_HELP_STRING([--enable-mkl=yes|no|prefix], [enable Intel MKL for LAPACK & FFTW])],
@@ -176,19 +188,26 @@ case ${ax_cv_cxx_compiler_vendor} in
    case ${ac_SIMD} in
      SSE4)
        AC_DEFINE([SSE4],[1],[SSE4 intrinsics])
-        SIMD_FLAGS='-msse4.2';;
+	case ${ac_SFW_FP16} in
+	  yes)
+	  SIMD_FLAGS='-msse4.2';;
+	  no)
+	  SIMD_FLAGS='-msse4.2 -mf16c';;
+	  *)
+          AC_MSG_ERROR(["SFW_FP16 must be either yes or no value ${ac_SFW_FP16} "]);;
+	esac;;
      AVX)
        AC_DEFINE([AVX1],[1],[AVX intrinsics])
-        SIMD_FLAGS='-mavx';;
+        SIMD_FLAGS='-mavx -mf16c';;
      AVXFMA4)
        AC_DEFINE([AVXFMA4],[1],[AVX intrinsics with FMA4])
-        SIMD_FLAGS='-mavx -mfma4';;
+        SIMD_FLAGS='-mavx -mfma4 -mf16c';;
      AVXFMA)
        AC_DEFINE([AVXFMA],[1],[AVX intrinsics with FMA3])
-        SIMD_FLAGS='-mavx -mfma';;
+        SIMD_FLAGS='-mavx -mfma -mf16c';;
      AVX2)
        AC_DEFINE([AVX2],[1],[AVX2 intrinsics])
-        SIMD_FLAGS='-mavx2 -mfma';;
+        SIMD_FLAGS='-mavx2 -mfma -mf16c';;
      AVX512)
        AC_DEFINE([AVX512],[1],[AVX512 intrinsics])
        SIMD_FLAGS='-mavx512f -mavx512pf -mavx512er -mavx512cd';;
@@ -410,7 +429,6 @@ AC_OUTPUT
 echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Summary of configuration for $PACKAGE v$VERSION
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
 ----- PLATFORM ----------------------------------------
 architecture (build)        : $build_cpu
 os (build)                  : $build_os
@@ -423,6 +441,7 @@ SIMD                        : ${ac_SIMD}${SIMD_GEN_WIDTH_MSG}
 Threading                   : ${ac_openmp} 
 Communications type         : ${comms_type}
 Default precision           : ${ac_PRECISION}
+Software FP16 conversion    : ${ac_SFW_FP16}
 RNG choice                  : ${ac_RNG} 
 GMP                         : `if test "x$have_gmp" = xtrue; then echo yes; else echo no; fi`
 LAPACK                      : ${ac_LAPACK}
--- a/lib/algorithms/Algorithms.h
+++ b/lib/algorithms/Algorithms.h
@@ -46,7 +46,7 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #include <Grid/algorithms/iterative/ConjugateGradientMixedPrec.h>

 // Lanczos support
-#include <Grid/algorithms/iterative/MatrixUtils.h>
+//#include <Grid/algorithms/iterative/MatrixUtils.h>
 #include <Grid/algorithms/iterative/ImplicitlyRestartedLanczos.h>
 #include <Grid/algorithms/CoarsenedMatrix.h>
 #include <Grid/algorithms/FFT.h>
--- a/lib/algorithms/approx/.dirstamp
+++ b/lib/algorithms/approx/.dirstamp
--- a/lib/algorithms/densematrix/DenseMatrix.h
+++ b/lib/algorithms/densematrix/DenseMatrix.h
--- a/lib/algorithms/densematrix/Francis.h
+++ b/lib/algorithms/densematrix/Francis.h
--- a/lib/algorithms/densematrix/Householder.h
+++ b/lib/algorithms/densematrix/Householder.h
--- a/lib/algorithms/iterative/BlockConjugateGradient.h
+++ b/lib/algorithms/iterative/BlockConjugateGradient.h
@@ -0,0 +1,366 @@
+/*************************************************************************************
+
+Grid physics library, www.github.com/paboyle/Grid
+
+Source file: ./lib/algorithms/iterative/BlockConjugateGradient.h
+
+Copyright (C) 2017
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License along
+with this program; if not, write to the Free Software Foundation, Inc.,
+51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+See the full license in the file "LICENSE" in the top level distribution
+directory
+*************************************************************************************/
+/*  END LEGAL */
+#ifndef GRID_BLOCK_CONJUGATE_GRADIENT_H
+#define GRID_BLOCK_CONJUGATE_GRADIENT_H
+
+
+namespace Grid {
+
+//////////////////////////////////////////////////////////////////////////
+// Block conjugate gradient. Dimension zero should be the block direction
+//////////////////////////////////////////////////////////////////////////
+template <class Field>
+class BlockConjugateGradient : public OperatorFunction<Field> {
+ public:
+
+  typedef typename Field::scalar_type scomplex;
+
+  const int blockDim = 0;
+
+  int Nblock;
+  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
+                           // Defaults true.
+  RealD Tolerance;
+  Integer MaxIterations;
+  Integer IterationsToComplete; //Number of iterations the CG took to finish. Filled in upon completion
+  
+  BlockConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
+    : Tolerance(tol),
+    MaxIterations(maxit),
+    ErrorOnNoConverge(err_on_no_conv){};
+
+void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
+{
+  int Orthog = 0; // First dimension is block dim
+  Nblock = Src._grid->_fdimensions[Orthog];
+
+  std::cout<<GridLogMessage<<" Block Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
+
+  Psi.checkerboard = Src.checkerboard;
+  conformable(Psi, Src);
+
+  Field P(Src);
+  Field AP(Src);
+  Field R(Src);
+  
+  Eigen::MatrixXcd m_pAp    = Eigen::MatrixXcd::Identity(Nblock,Nblock);
+  Eigen::MatrixXcd m_pAp_inv= Eigen::MatrixXcd::Identity(Nblock,Nblock);
+  Eigen::MatrixXcd m_rr     = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+  Eigen::MatrixXcd m_rr_inv = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+
+  Eigen::MatrixXcd m_alpha      = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+  Eigen::MatrixXcd m_beta   = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+
+  // Initial residual computation & set up
+  std::vector<RealD> residuals(Nblock);
+  std::vector<RealD> ssq(Nblock);
+
+  sliceNorm(ssq,Src,Orthog);
+  RealD sssum=0;
+  for(int b=0;b<Nblock;b++) sssum+=ssq[b];
+
+  sliceNorm(residuals,Src,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  sliceNorm(residuals,Psi,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  // Initial search dir is guess
+  Linop.HermOp(Psi, AP);
+  
+
+  /************************************************************************
+   * Block conjugate gradient (Stephen Pickles, thesis 1995, pp 71, O Leary 1980)
+   ************************************************************************
+   * O'Leary : R = B - A X
+   * O'Leary : P = M R ; preconditioner M = 1
+   * O'Leary : alpha = PAP^{-1} RMR
+   * O'Leary : beta  = RMR^{-1}_old RMR_new
+   * O'Leary : X=X+Palpha
+   * O'Leary : R_new=R_old-AP alpha
+   * O'Leary : P=MR_new+P beta
+   */
+
+  R = Src - AP;  
+  P = R;
+  sliceInnerProductMatrix(m_rr,R,R,Orthog);
+
+  GridStopWatch sliceInnerTimer;
+  GridStopWatch sliceMaddTimer;
+  GridStopWatch MatrixTimer;
+  GridStopWatch SolverTimer;
+  SolverTimer.Start();
+
+  int k;
+  for (k = 1; k <= MaxIterations; k++) {
+
+    RealD rrsum=0;
+    for(int b=0;b<Nblock;b++) rrsum+=real(m_rr(b,b));
+
+    std::cout << GridLogIterative << "\titeration "<<k<<" rr_sum "<<rrsum<<" ssq_sum "<< sssum
+	      <<" / "<<std::sqrt(rrsum/sssum) <<std::endl;
+
+    MatrixTimer.Start();
+    Linop.HermOp(P, AP);
+    MatrixTimer.Stop();
+
+    // Alpha
+    sliceInnerTimer.Start();
+    sliceInnerProductMatrix(m_pAp,P,AP,Orthog);
+    sliceInnerTimer.Stop();
+    m_pAp_inv = m_pAp.inverse();
+    m_alpha   = m_pAp_inv * m_rr ;
+
+    // Psi, R update
+    sliceMaddTimer.Start();
+    sliceMaddMatrix(Psi,m_alpha, P,Psi,Orthog);     // add alpha *  P to psi
+    sliceMaddMatrix(R  ,m_alpha,AP,  R,Orthog,-1.0);// sub alpha * AP to resid
+    sliceMaddTimer.Stop();
+
+    // Beta
+    m_rr_inv = m_rr.inverse();
+    sliceInnerTimer.Start();
+    sliceInnerProductMatrix(m_rr,R,R,Orthog);
+    sliceInnerTimer.Stop();
+    m_beta = m_rr_inv *m_rr;
+
+    // Search update
+    sliceMaddTimer.Start();
+    sliceMaddMatrix(AP,m_beta,P,R,Orthog);
+    sliceMaddTimer.Stop();
+    P= AP;
+
+    /*********************
+     * convergence monitor
+     *********************
+     */
+    RealD max_resid=0;
+    for(int b=0;b<Nblock;b++){
+      RealD rr = real(m_rr(b,b))/ssq[b];
+      if ( rr > max_resid ) max_resid = rr;
+    }
+    
+    if ( max_resid < Tolerance*Tolerance ) { 
+
+      SolverTimer.Stop();
+
+      std::cout << GridLogMessage<<"BlockCG converged in "<<k<<" iterations"<<std::endl;
+      for(int b=0;b<Nblock;b++){
+	std::cout << GridLogMessage<< "\t\tblock "<<b<<" resid "<< std::sqrt(real(m_rr(b,b))/ssq[b])<<std::endl;
+      }
+      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
+
+      Linop.HermOp(Psi, AP);
+      AP = AP-Src;
+      std::cout << GridLogMessage <<"\tTrue residual is " << std::sqrt(norm2(AP)/norm2(Src)) <<std::endl;
+
+      std::cout << GridLogMessage << "Time Breakdown "<<std::endl;
+      std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tInnerProd  " << sliceInnerTimer.Elapsed() <<std::endl;
+      std::cout << GridLogMessage << "\tMaddMatrix " << sliceMaddTimer.Elapsed()  <<std::endl;
+	    
+      IterationsToComplete = k;
+      return;
+    }
+
+  }
+  std::cout << GridLogMessage << "BlockConjugateGradient did NOT converge" << std::endl;
+
+  if (ErrorOnNoConverge) assert(0);
+  IterationsToComplete = k;
+}
+};
+
+
+//////////////////////////////////////////////////////////////////////////
+// multiRHS conjugate gradient. Dimension zero should be the block direction
+//////////////////////////////////////////////////////////////////////////
+template <class Field>
+class MultiRHSConjugateGradient : public OperatorFunction<Field> {
+ public:
+
+  typedef typename Field::scalar_type scomplex;
+
+  const int blockDim = 0;
+
+  int Nblock;
+  bool ErrorOnNoConverge;  // throw an assert when the CG fails to converge.
+                           // Defaults true.
+  RealD Tolerance;
+  Integer MaxIterations;
+  Integer IterationsToComplete; //Number of iterations the CG took to finish. Filled in upon completion
+  
+   MultiRHSConjugateGradient(RealD tol, Integer maxit, bool err_on_no_conv = true)
+    : Tolerance(tol),
+    MaxIterations(maxit),
+    ErrorOnNoConverge(err_on_no_conv){};
+
+void operator()(LinearOperatorBase<Field> &Linop, const Field &Src, Field &Psi) 
+{
+  int Orthog = 0; // First dimension is block dim
+  Nblock = Src._grid->_fdimensions[Orthog];
+
+  std::cout<<GridLogMessage<<"MultiRHS Conjugate Gradient : Orthog "<<Orthog<<" Nblock "<<Nblock<<std::endl;
+
+  Psi.checkerboard = Src.checkerboard;
+  conformable(Psi, Src);
+
+  Field P(Src);
+  Field AP(Src);
+  Field R(Src);
+  
+  std::vector<ComplexD> v_pAp(Nblock);
+  std::vector<RealD> v_rr (Nblock);
+  std::vector<RealD> v_rr_inv(Nblock);
+  std::vector<RealD> v_alpha(Nblock);
+  std::vector<RealD> v_beta(Nblock);
+
+  // Initial residual computation & set up
+  std::vector<RealD> residuals(Nblock);
+  std::vector<RealD> ssq(Nblock);
+
+  sliceNorm(ssq,Src,Orthog);
+  RealD sssum=0;
+  for(int b=0;b<Nblock;b++) sssum+=ssq[b];
+
+  sliceNorm(residuals,Src,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  sliceNorm(residuals,Psi,Orthog);
+  for(int b=0;b<Nblock;b++){ assert(std::isnan(residuals[b])==0); }
+
+  // Initial search dir is guess
+  Linop.HermOp(Psi, AP);
+
+  R = Src - AP;  
+  P = R;
+  sliceNorm(v_rr,R,Orthog);
+
+  GridStopWatch sliceInnerTimer;
+  GridStopWatch sliceMaddTimer;
+  GridStopWatch sliceNormTimer;
+  GridStopWatch MatrixTimer;
+  GridStopWatch SolverTimer;
+
+  SolverTimer.Start();
+  int k;
+  for (k = 1; k <= MaxIterations; k++) {
+
+    RealD rrsum=0;
+    for(int b=0;b<Nblock;b++) rrsum+=real(v_rr[b]);
+
+    std::cout << GridLogIterative << "\titeration "<<k<<" rr_sum "<<rrsum<<" ssq_sum "<< sssum
+	      <<" / "<<std::sqrt(rrsum/sssum) <<std::endl;
+
+    MatrixTimer.Start();
+    Linop.HermOp(P, AP);
+    MatrixTimer.Stop();
+
+    // Alpha
+    //    sliceInnerProductVectorTest(v_pAp_test,P,AP,Orthog);
+    sliceInnerTimer.Start();
+    sliceInnerProductVector(v_pAp,P,AP,Orthog);
+    sliceInnerTimer.Stop();
+    for(int b=0;b<Nblock;b++){
+      //      std::cout << " "<< v_pAp[b]<<" "<< v_pAp_test[b]<<std::endl;
+      v_alpha[b] = v_rr[b]/real(v_pAp[b]);
+    }
+
+    // Psi, R update
+    sliceMaddTimer.Start();
+    sliceMaddVector(Psi,v_alpha, P,Psi,Orthog);     // add alpha *  P to psi
+    sliceMaddVector(R  ,v_alpha,AP,  R,Orthog,-1.0);// sub alpha * AP to resid
+    sliceMaddTimer.Stop();
+
+    // Beta
+    for(int b=0;b<Nblock;b++){
+      v_rr_inv[b] = 1.0/v_rr[b];
+    }
+    sliceNormTimer.Start();
+    sliceNorm(v_rr,R,Orthog);
+    sliceNormTimer.Stop();
+    for(int b=0;b<Nblock;b++){
+      v_beta[b] = v_rr_inv[b] *v_rr[b];
+    }
+
+    // Search update
+    sliceMaddTimer.Start();
+    sliceMaddVector(P,v_beta,P,R,Orthog);
+    sliceMaddTimer.Stop();
+
+    /*********************
+     * convergence monitor
+     *********************
+     */
+    RealD max_resid=0;
+    for(int b=0;b<Nblock;b++){
+      RealD rr = v_rr[b]/ssq[b];
+      if ( rr > max_resid ) max_resid = rr;
+    }
+    
+    if ( max_resid < Tolerance*Tolerance ) { 
+
+      SolverTimer.Stop();
+
+      std::cout << GridLogMessage<<"MultiRHS solver converged in " <<k<<" iterations"<<std::endl;
+      for(int b=0;b<Nblock;b++){
+	std::cout << GridLogMessage<< "\t\tBlock "<<b<<" resid "<< std::sqrt(v_rr[b]/ssq[b])<<std::endl;
+      }
+      std::cout << GridLogMessage<<"\tMax residual is "<<std::sqrt(max_resid)<<std::endl;
+
+      Linop.HermOp(Psi, AP);
+      AP = AP-Src;
+      std::cout <<GridLogMessage << "\tTrue residual is " << std::sqrt(norm2(AP)/norm2(Src)) <<std::endl;
+
+      std::cout << GridLogMessage << "Time Breakdown "<<std::endl;
+      std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed()     <<std::endl;
+      std::cout << GridLogMessage << "\tInnerProd  " << sliceInnerTimer.Elapsed() <<std::endl;
+      std::cout << GridLogMessage << "\tNorm       " << sliceNormTimer.Elapsed() <<std::endl;
+      std::cout << GridLogMessage << "\tMaddMatrix " << sliceMaddTimer.Elapsed()  <<std::endl;
+
+
+      IterationsToComplete = k;
+      return;
+    }
+
+  }
+  std::cout << GridLogMessage << "MultiRHSConjugateGradient did NOT converge" << std::endl;
+
+  if (ErrorOnNoConverge) assert(0);
+  IterationsToComplete = k;
+}
+};
+
+
+
+}
+#endif
--- a/lib/algorithms/iterative/ConjugateGradient.h
+++ b/lib/algorithms/iterative/ConjugateGradient.h
@@ -78,18 +78,12 @@ class ConjugateGradient : public OperatorFunction<Field> {
    cp = a;
    ssq = norm2(src);

-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient: guess " << guess << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:   src " << ssq << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:    mp " << d << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:   mmp " << b << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:  cp,r " << cp << std::endl;
-    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient:     p " << a << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient: guess " << guess << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:   src " << ssq << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:    mp " << d << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:   mmp " << b << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:  cp,r " << cp << std::endl;
+    std::cout << GridLogIterative << std::setprecision(4) << "ConjugateGradient:     p " << a << std::endl;

    RealD rsq = Tolerance * Tolerance * ssq;

@@ -99,8 +93,7 @@ class ConjugateGradient : public OperatorFunction<Field> {
    }

    std::cout << GridLogIterative << std::setprecision(4)
-              << "ConjugateGradient: k=0 residual " << cp << " target " << rsq
-              << std::endl;
+              << "ConjugateGradient: k=0 residual " << cp << " target " << rsq << std::endl;

    GridStopWatch LinalgTimer;
    GridStopWatch MatrixTimer;
@@ -145,19 +138,20 @@ class ConjugateGradient : public OperatorFunction<Field> {
        RealD resnorm = sqrt(norm2(p));
        RealD true_residual = resnorm / srcnorm;

-        std::cout << GridLogMessage
-                  << "ConjugateGradient: Converged on iteration " << k << std::endl;
-        std::cout << GridLogMessage << "Computed residual " << sqrt(cp / ssq)
-                  << " true residual " << true_residual << " target "
-                  << Tolerance << std::endl;
-        std::cout << GridLogMessage << "Time elapsed: Iterations "
-                  << SolverTimer.Elapsed() << " Matrix  "
-                  << MatrixTimer.Elapsed() << " Linalg "
-                  << LinalgTimer.Elapsed();
-        std::cout << std::endl;
+        std::cout << GridLogMessage << "ConjugateGradient Converged on iteration " << k << std::endl;
+        std::cout << GridLogMessage << "\tComputed residual " << sqrt(cp / ssq)<<std::endl;
+	std::cout << GridLogMessage << "\tTrue residual " << true_residual<<std::endl;
+	std::cout << GridLogMessage << "\tTarget " << Tolerance << std::endl;
+
+        std::cout << GridLogMessage << "Time breakdown "<<std::endl;
+	std::cout << GridLogMessage << "\tElapsed    " << SolverTimer.Elapsed() <<std::endl;
+	std::cout << GridLogMessage << "\tMatrix     " << MatrixTimer.Elapsed() <<std::endl;
+	std::cout << GridLogMessage << "\tLinalg     " << LinalgTimer.Elapsed() <<std::endl;

        if (ErrorOnNoConverge) assert(true_residual / Tolerance < 10000.0);
+
 	IterationsToComplete = k;	
+
        return;
      }
    }
--- a/lib/algorithms/iterative/ImplicitlyRestartedLanczos.h
+++ b/lib/algorithms/iterative/ImplicitlyRestartedLanczos.h
@@ -30,20 +30,17 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #define GRID_IRL_H

 #include <string.h> //memset
+
 #ifdef USE_LAPACK
-#ifdef USE_MKL
-#include<mkl_lapack.h>
-#else
 void LAPACK_dstegr(char *jobz, char *range, int *n, double *d, double *e,
                   double *vl, double *vu, int *il, int *iu, double *abstol,
                   int *m, double *w, double *z, int *ldz, int *isuppz,
                   double *work, int *lwork, int *iwork, int *liwork,
                   int *info);
-//#include <lapacke/lapacke.h>
 #endif
-#endif
-#include "DenseMatrix.h"
-#include "EigenSort.h"
+
+#include <Grid/algorithms/densematrix/DenseMatrix.h>
+#include <Grid/algorithms/iterative/EigenSort.h>

 namespace Grid {

@@ -67,13 +64,12 @@ public:
    int Np;      // Np -- Number of spare vecs in kryloc space
    int Nm;      // Nm -- total number of vectors

-
-    RealD OrthoTime;
-
    RealD eresid;

    SortEigen<Field> _sort;

+//    GridCartesian &_fgrid;
+
    LinearOperatorBase<Field> &_Linop;

    OperatorFunction<Field>   &_poly;
@@ -130,23 +126,23 @@ public:

      GridBase *grid = evec[0]._grid;
      Field w(grid);
-      std::cout<<GridLogMessage << "RitzMatrix "<<std::endl;
+      std::cout << "RitzMatrix "<<std::endl;
      for(int i=0;i<k;i++){
 	_poly(_Linop,evec[i],w);
-	std::cout<<GridLogMessage << "["<<i<<"] ";
+	std::cout << "["<<i<<"] ";
 	for(int j=0;j<k;j++){
 	  ComplexD in = innerProduct(evec[j],w);
 	  if ( fabs((double)i-j)>1 ) { 
 	    if (abs(in) >1.0e-9 )  { 
-	      std::cout<<GridLogMessage<<"oops"<<std::endl;
+	      std::cout<<"oops"<<std::endl;
 	      abort();
 	    } else 
-	      std::cout<<GridLogMessage << " 0 ";
+	      std::cout << " 0 ";
 	  } else { 
-	    std::cout<<GridLogMessage << " "<<in<<" ";
+	    std::cout << " "<<in<<" ";
 	  }
 	}
-	std::cout<<GridLogMessage << std::endl;
+	std::cout << std::endl;
      }
    }

@@ -180,10 +176,10 @@ public:
      RealD beta = normalise(w); // 6. βk+1 := ∥wk∥2. If βk+1 = 0 then Stop
                                 // 7. vk+1 := wk/βk+1

-	std::cout<<GridLogMessage << "alpha = " << zalph << " beta "<<beta<<std::endl;
+//	std::cout << "alpha = " << zalph << " beta "<<beta<<std::endl;
      const RealD tiny = 1.0e-20;
      if ( beta < tiny ) { 
-	std::cout<<GridLogMessage << " beta is tiny "<<beta<<std::endl;
+	std::cout << " beta is tiny "<<beta<<std::endl;
     }
      lmd[k] = alph;
      lme[k]  = beta;
@@ -259,7 +255,6 @@ public:
    }

 #ifdef USE_LAPACK
-#define LAPACK_INT long long
    void diagonalize_lapack(DenseVector<RealD>& lmd,
 		     DenseVector<RealD>& lme, 
 		     int N1,
@@ -269,7 +264,7 @@ public:
  const int size = Nm;
 //  tevals.resize(size);
 //  tevecs.resize(size);
-  LAPACK_INT NN = N1;
+  int NN = N1;
  double evals_tmp[NN];
  double evec_tmp[NN][NN];
  memset(evec_tmp[0],0,sizeof(double)*NN*NN);
@@ -283,19 +278,19 @@ public:
        if (i==j) evals_tmp[i] = lmd[i];
        if (j==(i-1)) EE[j] = lme[j];
      }
-  LAPACK_INT evals_found;
-  LAPACK_INT lwork = ( (18*NN) > (1+4*NN+NN*NN)? (18*NN):(1+4*NN+NN*NN)) ;
-  LAPACK_INT liwork =  3+NN*10 ;
-  LAPACK_INT iwork[liwork];
+  int evals_found;
+  int lwork = ( (18*NN) > (1+4*NN+NN*NN)? (18*NN):(1+4*NN+NN*NN)) ;
+  int liwork =  3+NN*10 ;
+  int iwork[liwork];
  double work[lwork];
-  LAPACK_INT isuppz[2*NN];
+  int isuppz[2*NN];
  char jobz = 'V'; // calculate evals & evecs
  char range = 'I'; // calculate all evals
  //    char range = 'A'; // calculate all evals
  char uplo = 'U'; // refer to upper half of original matrix
  char compz = 'I'; // Compute eigenvectors of tridiagonal matrix
  int ifail[NN];
-  long long info;
+  int info;
 //  int total = QMP_get_number_of_nodes();
 //  int node = QMP_get_node_number();
 //  GridBase *grid = evec[0]._grid;
@@ -303,18 +298,14 @@ public:
  int node = grid->_processor;
  int interval = (NN/total)+1;
  double vl = 0.0, vu = 0.0;
-  LAPACK_INT il = interval*node+1 , iu = interval*(node+1);
+  int il = interval*node+1 , iu = interval*(node+1);
  if (iu > NN)  iu=NN;
  double tol = 0.0;
    if (1) {
      memset(evals_tmp,0,sizeof(double)*NN);
      if ( il <= NN){
        printf("total=%d node=%d il=%d iu=%d\n",total,node,il,iu);
-#ifdef USE_MKL
-        dstegr(&jobz, &range, &NN,
-#else
        LAPACK_dstegr(&jobz, &range, &NN,
-#endif
            (double*)DD, (double*)EE,
            &vl, &vu, &il, &iu, // these four are ignored if second parameteris 'A'
            &tol, // tolerance
@@ -346,7 +337,6 @@ public:
      lmd [NN-1-i]=evals_tmp[i];
  }
 }
-#undef LAPACK_INT 
 #endif


@@ -377,14 +367,12 @@ public:
 //	diagonalize_lapack(lmd2,lme2,Nm2,Nm,Qt,grid);
 #endif

-      int Niter = 10000*N1;
+      int Niter = 100*N1;
      int kmin = 1;
      int kmax = N2;
      // (this should be more sophisticated)

-      for(int iter=0; ; ++iter){
-      if ( (iter+1)%(100*N1)==0) 
-      std::cout<<GridLogMessage << "[QL method] Not converged - iteration "<<iter+1<<"\n";
+      for(int iter=0; iter<Niter; ++iter){

 	// determination of 2x2 leading submatrix
 	RealD dsub = lmd[kmax-1]-lmd[kmax-2];
@@ -413,11 +401,11 @@ public:
        _sort.push(lmd3,N2);
        _sort.push(lmd2,N2);
         for(int k=0; k<N2; ++k){
-	    if (fabs(lmd2[k] - lmd3[k]) >SMALL)  std::cout<<GridLogMessage <<"lmd(qr) lmd(lapack) "<< k << ": " << lmd2[k] <<" "<< lmd3[k] <<std::endl;
-//	    if (fabs(lme2[k] - lme[k]) >SMALL)  std::cout<<GridLogMessage <<"lme(qr)-lme(lapack) "<< k << ": " << lme2[k] - lme[k] <<std::endl;
+	    if (fabs(lmd2[k] - lmd3[k]) >SMALL)  std::cout <<"lmd(qr) lmd(lapack) "<< k << ": " << lmd2[k] <<" "<< lmd3[k] <<std::endl;
+//	    if (fabs(lme2[k] - lme[k]) >SMALL)  std::cout <<"lme(qr)-lme(lapack) "<< k << ": " << lme2[k] - lme[k] <<std::endl;
 	  }
         for(int k=0; k<N1*N1; ++k){
-//	    if (fabs(Qt2[k] - Qt[k]) >SMALL)  std::cout<<GridLogMessage <<"Qt(qr)-Qt(lapack) "<< k << ": " << Qt2[k] - Qt[k] <<std::endl;
+//	    if (fabs(Qt2[k] - Qt[k]) >SMALL)  std::cout <<"Qt(qr)-Qt(lapack) "<< k << ": " << Qt2[k] - Qt[k] <<std::endl;
 	}
    }
 #endif
@@ -432,7 +420,7 @@ public:
 	  }
 	}
      }
-      std::cout<<GridLogMessage << "[QL method] Error - Too many iteration: "<<Niter<<"\n";
+      std::cout << "[QL method] Error - Too many iteration: "<<Niter<<"\n";
      abort();
    }

@@ -449,7 +437,6 @@ public:
 		       DenseVector<Field>& evec,
 		       int k)
    {
-      double t0=-usecond()/1e6;
      typedef typename Field::scalar_type MyComplex;
      MyComplex ip;

@@ -468,8 +455,6 @@ public:
 	w = w - ip * evec[j];
      }
      normalise(w);
-      t0+=usecond()/1e6;
-      OrthoTime +=t0;
    }

    void setUnit_Qt(int Nm, DenseVector<RealD> &Qt) {
@@ -503,10 +488,10 @@ until convergence
 	GridBase *grid = evec[0]._grid;
 	assert(grid == src._grid);

-	std::cout<<GridLogMessage << " -- Nk = " << Nk << " Np = "<< Np << std::endl;
-	std::cout<<GridLogMessage << " -- Nm = " << Nm << std::endl;
-	std::cout<<GridLogMessage << " -- size of eval   = " << eval.size() << std::endl;
-	std::cout<<GridLogMessage << " -- size of evec  = " << evec.size() << std::endl;
+	std::cout << " -- Nk = " << Nk << " Np = "<< Np << std::endl;
+	std::cout << " -- Nm = " << Nm << std::endl;
+	std::cout << " -- size of eval   = " << eval.size() << std::endl;
+	std::cout << " -- size of evec  = " << evec.size() << std::endl;
 	
 	assert(Nm == evec.size() && Nm == eval.size());
 	
@@ -517,7 +502,6 @@ until convergence
 	DenseVector<int>   Iconv(Nm);

 	DenseVector<Field>  B(Nm,grid); // waste of space replicating
-//	DenseVector<Field>  Btemp(Nm,grid); // waste of space replicating
 	
 	Field f(grid);
 	Field v(grid);
@@ -533,48 +517,35 @@ until convergence
 	// (uniform vector) Why not src??
 	//	evec[0] = 1.0;
 	evec[0] = src;
-	std:: cout<<GridLogMessage <<"norm2(src)= " << norm2(src)<<std::endl;
+	std:: cout <<"norm2(src)= " << norm2(src)<<std::endl;
 // << src._grid  << std::endl;
 	normalise(evec[0]);
-	std:: cout<<GridLogMessage <<"norm2(evec[0])= " << norm2(evec[0]) <<std::endl;
+	std:: cout <<"norm2(evec[0])= " << norm2(evec[0]) <<std::endl;
 // << evec[0]._grid << std::endl;
 	
 	// Initial Nk steps
-	OrthoTime=0.;
-	double t0=usecond()/1e6;
 	for(int k=0; k<Nk; ++k) step(eval,lme,evec,f,Nm,k);
-	double t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::Initial steps: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-	std::cout<<GridLogMessage <<"IRL::Initial steps:OrthoTime "<<OrthoTime<< "seconds"<<std::endl;
-//	std:: cout<<GridLogMessage <<"norm2(evec[1])= " << norm2(evec[1]) << std::endl;
-//	std:: cout<<GridLogMessage <<"norm2(evec[2])= " << norm2(evec[2]) << std::endl;
+//	std:: cout <<"norm2(evec[1])= " << norm2(evec[1]) << std::endl;
+//	std:: cout <<"norm2(evec[2])= " << norm2(evec[2]) << std::endl;
 	RitzMatrix(evec,Nk);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::RitzMatrix: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
 	for(int k=0; k<Nk; ++k){
-//	std:: cout<<GridLogMessage <<"eval " << k << " " <<eval[k] << std::endl;
-//	std:: cout<<GridLogMessage <<"lme " << k << " " << lme[k] << std::endl;
+//	std:: cout <<"eval " << k << " " <<eval[k] << std::endl;
+//	std:: cout <<"lme " << k << " " << lme[k] << std::endl;
 	}

 	// Restarting loop begins
 	for(int iter = 0; iter<Niter; ++iter){

-	  std::cout<<GridLogMessage<<"\n Restart iteration = "<< iter << std::endl;
+	  std::cout<<"\n Restart iteration = "<< iter << std::endl;

 	  // 
 	  // Rudy does a sort first which looks very different. Getting fed up with sorting out the algo defs.
 	  // We loop over 
 	  //
-	OrthoTime=0.;
 	  for(int k=Nk; k<Nm; ++k) step(eval,lme,evec,f,Nm,k);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL:: "<<Np <<" steps: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-	std::cout<<GridLogMessage <<"IRL::Initial steps:OrthoTime "<<OrthoTime<< "seconds"<<std::endl;
 	  f *= lme[Nm-1];

 	  RitzMatrix(evec,k2);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL:: RitzMatrix: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
 	  
 	  // getting eigenvalues
 	  for(int k=0; k<Nm; ++k){
@@ -583,27 +554,18 @@ until convergence
 	  }
 	  setUnit_Qt(Nm,Qt);
 	  diagonalize(eval2,lme2,Nm,Nm,Qt,grid);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL:: diagonalize: "<<t1-t0<< "seconds"<<std::endl; t0=t1;

 	  // sorting
 	  _sort.push(eval2,Nm);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL:: eval sorting: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
 	  
 	  // Implicitly shifted QR transformations
 	  setUnit_Qt(Nm,Qt);
-	  for(int ip=0; ip<k2; ++ip){
-	std::cout<<GridLogMessage << "eval "<< ip << " "<< eval2[ip] << std::endl;
-	}
 	  for(int ip=k2; ip<Nm; ++ip){ 
-	std::cout<<GridLogMessage << "qr_decomp "<< ip << " "<< eval2[ip] << std::endl;
+	std::cout << "qr_decomp "<< ip << " "<< eval2[ip] << std::endl;
 	    qr_decomp(eval,lme,Nm,Nm,Qt,eval2[ip],k1,Nm);
 		
 	}
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::qr_decomp: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-if (0) {  
+    
 	  for(int i=0; i<(Nk+1); ++i) B[i] = 0.0;
 	  
 	  for(int j=k1-1; j<k2+1; ++j){
@@ -612,38 +574,14 @@ if (0) {
 	      B[j] += Qt[k+Nm*j] * evec[k];
 	    }
 	  }
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::QR Rotate: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-}
-
-if (1) {
-	for(int i=0; i<(Nk+1); ++i) {
-		B[i] = 0.0;
-	  	B[i].checkerboard = evec[0].checkerboard;
-	}
-
-	int j_block = 24; int k_block=24;
-PARALLEL_FOR_LOOP
-	for(int ss=0;ss < grid->oSites();ss++){
-	for(int jj=k1-1; jj<k2+1; jj += j_block)
-	for(int kk=0; kk<Nm; kk += k_block)
-	for(int j=jj; (j<(k2+1)) && j<(jj+j_block); ++j){
-	for(int k=kk; (k<Nm) && k<(kk+k_block) ; ++k){
-	    B[j]._odata[ss] +=Qt[k+Nm*j] * evec[k]._odata[ss]; 
-	}
-	}
-	}
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::QR rotation: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-}
-	for(int j=k1-1; j<k2+1; ++j) evec[j] = B[j];
+	  for(int j=k1-1; j<k2+1; ++j) evec[j] = B[j];

 	  // Compressed vector f and beta(k2)
 	  f *= Qt[Nm-1+Nm*(k2-1)];
 	  f += lme[k2-1] * evec[k2];
 	  beta_k = norm2(f);
 	  beta_k = sqrt(beta_k);
-	  std::cout<<GridLogMessage<<" beta(k) = "<<beta_k<<std::endl;
+	  std::cout<<" beta(k) = "<<beta_k<<std::endl;

 	  RealD betar = 1.0/beta_k;
 	  evec[k2] = betar * f;
@@ -656,10 +594,7 @@ PARALLEL_FOR_LOOP
 	  }
 	  setUnit_Qt(Nm,Qt);
 	  diagonalize(eval2,lme2,Nk,Nm,Qt,grid);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::diagonalize: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
 	  
-if (0) {
 	  for(int k = 0; k<Nk; ++k) B[k]=0.0;
 	  
 	  for(int j = 0; j<Nk; ++j){
@@ -667,34 +602,12 @@ if (0) {
 	    B[j].checkerboard = evec[k].checkerboard;
 	      B[j] += Qt[k+j*Nm] * evec[k];
 	    }
-	    std::cout<<GridLogMessage << "norm(B["<<j<<"])="<<norm2(B[j])<<std::endl;
+//	    std::cout << "norm(B["<<j<<"])="<<norm2(B[j])<<std::endl;
 	  }
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::Convergence rotation: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-}
-if (1) {
-	for(int i=0; i<(Nk+1); ++i) {
-		B[i] = 0.0;
-	  	B[i].checkerboard = evec[0].checkerboard;
-	}
-
-	int j_block = 24; int k_block=24;
-PARALLEL_FOR_LOOP
-	for(int ss=0;ss < grid->oSites();ss++){
-	for(int jj=0; jj<Nk; jj += j_block)
-	for(int kk=0; kk<Nk; kk += k_block)
-	for(int j=jj; (j<Nk) && j<(jj+j_block); ++j){
-	for(int k=kk; (k<Nk) && k<(kk+k_block) ; ++k){
-	    B[j]._odata[ss] +=Qt[k+Nm*j] * evec[k]._odata[ss]; 
-	}
-	}
-	}
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::convergence rotation : "<<t1-t0<< "seconds"<<std::endl; t0=t1;
-}
+//	_sort.push(eval2,B,Nk);

 	  Nconv = 0;
-	  //	  std::cout<<GridLogMessage << std::setiosflags(std::ios_base::scientific);
+	  //	  std::cout << std::setiosflags(std::ios_base::scientific);
 	  for(int i=0; i<Nk; ++i){

 //	    _poly(_Linop,B[i],v);
@@ -702,16 +615,14 @@ PARALLEL_FOR_LOOP
 	    
 	    RealD vnum = real(innerProduct(B[i],v)); // HermOp.
 	    RealD vden = norm2(B[i]);
-	    RealD vv0 = norm2(v);
 	    eval2[i] = vnum/vden;
 	    v -= eval2[i]*B[i];
 	    RealD vv = norm2(v);
 	    
 	    std::cout.precision(13);
-	    std::cout<<GridLogMessage << "[" << std::setw(3)<< std::setiosflags(std::ios_base::right) <<i<<"] ";
-	    std::cout<<"eval = "<<std::setw(25)<< std::setiosflags(std::ios_base::left)<< eval2[i];
-	    std::cout<<"|H B[i] - eval[i]B[i]|^2 "<< std::setw(25)<< std::setiosflags(std::ios_base::right)<< vv;
-	    std::cout<<" "<< vnum/(sqrt(vden)*sqrt(vv0)) << std::endl;
+	    std::cout << "[" << std::setw(3)<< std::setiosflags(std::ios_base::right) <<i<<"] ";
+	    std::cout << "eval = "<<std::setw(25)<< std::setiosflags(std::ios_base::left)<< eval2[i];
+	    std::cout <<" |H B[i] - eval[i]B[i]|^2 "<< std::setw(25)<< std::setiosflags(std::ios_base::right)<< vv<< std::endl;
 	    
 	// change the criteria as evals are supposed to be sorted, all evals smaller(larger) than Nstop should have converged
 	    if((vv<eresid*eresid) && (i == Nconv) ){
@@ -720,19 +631,17 @@ PARALLEL_FOR_LOOP
 	    }

 	  }  // i-loop end
-	  //	  std::cout<<GridLogMessage << std::resetiosflags(std::ios_base::scientific);
-	t1=usecond()/1e6;
-	std::cout<<GridLogMessage <<"IRL::convergence testing: "<<t1-t0<< "seconds"<<std::endl; t0=t1;
+	  //	  std::cout << std::resetiosflags(std::ios_base::scientific);


-	  std::cout<<GridLogMessage<<" #modes converged: "<<Nconv<<std::endl;
+	  std::cout<<" #modes converged: "<<Nconv<<std::endl;

 	  if( Nconv>=Nstop ){
 	    goto converged;
 	  }
 	} // end of iter loop
 	
-	std::cout<<GridLogMessage<<"\n NOT converged.\n";
+	std::cout<<"\n NOT converged.\n";
 	abort();
 	
      converged:
@@ -745,10 +654,10 @@ PARALLEL_FOR_LOOP
       }
      _sort.push(eval,evec,Nconv);

-      std::cout<<GridLogMessage << "\n Converged\n Summary :\n";
-      std::cout<<GridLogMessage << " -- Iterations  = "<< Nconv  << "\n";
-      std::cout<<GridLogMessage << " -- beta(k)     = "<< beta_k << "\n";
-      std::cout<<GridLogMessage << " -- Nconv       = "<< Nconv  << "\n";
+      std::cout << "\n Converged\n Summary :\n";
+      std::cout << " -- Iterations  = "<< Nconv  << "\n";
+      std::cout << " -- beta(k)     = "<< beta_k << "\n";
+      std::cout << " -- Nconv       = "<< Nconv  << "\n";
     }

    /////////////////////////////////////////////////
@@ -771,25 +680,25 @@ PARALLEL_FOR_LOOP
 	}
      }

-      std::cout<<GridLogMessage<<"Lanczos_Factor start/end " <<start <<"/"<<end<<std::endl;
+      std::cout<<"Lanczos_Factor start/end " <<start <<"/"<<end<<std::endl;

      // Starting from scratch, bq[0] contains a random vector and |bq[0]| = 1
      int first;
      if(start == 0){

-	std::cout<<GridLogMessage << "start == 0\n"; //TESTING
+	std::cout << "start == 0\n"; //TESTING

 	_poly(_Linop,bq[0],bf);

 	alpha = real(innerProduct(bq[0],bf));//alpha =  bq[0]^dag A bq[0]

-	std::cout<<GridLogMessage << "alpha = " << alpha << std::endl;
+	std::cout << "alpha = " << alpha << std::endl;
 	
 	bf = bf - alpha * bq[0];  //bf =  A bq[0] - alpha bq[0]

 	H[0][0]=alpha;

-	std::cout<<GridLogMessage << "Set H(0,0) to " << H[0][0] << std::endl;
+	std::cout << "Set H(0,0) to " << H[0][0] << std::endl;

 	first = 1;

@@ -809,19 +718,19 @@ PARALLEL_FOR_LOOP

 	beta = 0;sqbt = 0;

-	std::cout<<GridLogMessage << "cont is true so setting beta to zero\n";
+	std::cout << "cont is true so setting beta to zero\n";

      }	else {

 	beta = norm2(bf);
 	sqbt = sqrt(beta);

-	std::cout<<GridLogMessage << "beta = " << beta << std::endl;
+	std::cout << "beta = " << beta << std::endl;
      }

      for(int j=first;j<end;j++){

-	std::cout<<GridLogMessage << "Factor j " << j <<std::endl;
+	std::cout << "Factor j " << j <<std::endl;

 	if(cont){ // switches to factoring; understand start!=0 and initial bf value is right.
 	  bq[j] = bf; cont = false;
@@ -844,7 +753,7 @@ PARALLEL_FOR_LOOP

 	beta = fnorm;
 	sqbt = sqrt(beta);
-	std::cout<<GridLogMessage << "alpha = " << alpha << " fnorm = " << fnorm << '\n';
+	std::cout << "alpha = " << alpha << " fnorm = " << fnorm << '\n';

 	///Iterative refinement of orthogonality V = [ bq[0]  bq[1]  ...  bq[M] ]
 	int re = 0;
@@ -879,8 +788,8 @@ PARALLEL_FOR_LOOP
 	  bck = sqrt( nmbex );
 	  re++;
 	}
-	std::cout<<GridLogMessage << "Iteratively refined orthogonality, changes alpha\n";
-	if(re > 1) std::cout<<GridLogMessage << "orthagonality refined " << re << " times" <<std::endl;
+	std::cout << "Iteratively refined orthogonality, changes alpha\n";
+	if(re > 1) std::cout << "orthagonality refined " << re << " times" <<std::endl;
 	H[j][j]=alpha;
      }

@@ -895,13 +804,11 @@ PARALLEL_FOR_LOOP

    void ImplicitRestart(int TM, DenseVector<RealD> &evals,  DenseVector<DenseVector<RealD> > &evecs, DenseVector<Field> &bq, Field &bf, int cont)
    {
-      std::cout<<GridLogMessage << "ImplicitRestart begin. Eigensort starting\n";
+      std::cout << "ImplicitRestart begin. Eigensort starting\n";

      DenseMatrix<RealD> H; Resize(H,Nm,Nm);

-#ifndef USE_LAPACK
      EigenSort(evals, evecs);
-#endif

      ///Assign shifts
      int K=Nk;
@@ -924,15 +831,15 @@ PARALLEL_FOR_LOOP
      /// Shifted H defines a new K step Arnoldi factorization
      RealD  beta = H[ff][ff-1]; 
      RealD  sig  = Q[TM - 1][ff - 1];
-      std::cout<<GridLogMessage << "beta = " << beta << " sig = " << real(sig) <<std::endl;
+      std::cout << "beta = " << beta << " sig = " << real(sig) <<std::endl;

-      std::cout<<GridLogMessage << "TM = " << TM << " ";
-      std::cout<<GridLogMessage << norm2(bq[0]) << " -- before" <<std::endl;
+      std::cout << "TM = " << TM << " ";
+      std::cout << norm2(bq[0]) << " -- before" <<std::endl;

      /// q -> q Q
      times_real(bq, Q, TM);

-      std::cout<<GridLogMessage << norm2(bq[0]) << " -- after " << ff <<std::endl;
+      std::cout << norm2(bq[0]) << " -- after " << ff <<std::endl;
      bf =  beta* bq[ff] + sig* bf;

      /// Do the rest of the factorization
@@ -956,7 +863,7 @@ PARALLEL_FOR_LOOP
      int ff = Lanczos_Factor(0, M, cont, bq,bf,H); // 0--M to begin with

      if(ff < M) {
-	std::cout<<GridLogMessage << "Krylov: aborting ff "<<ff <<" "<<M<<std::endl;
+	std::cout << "Krylov: aborting ff "<<ff <<" "<<M<<std::endl;
 	abort(); // Why would this happen?
      }

@@ -965,7 +872,7 @@ PARALLEL_FOR_LOOP

      for(int it = 0; it < Niter && (converged < Nk); ++it) {

-	std::cout<<GridLogMessage << "Krylov: Iteration --> " << it << std::endl;
+	std::cout << "Krylov: Iteration --> " << it << std::endl;
 	int lock_num = lock ? converged : 0;
 	DenseVector<RealD> tevals(M - lock_num );
 	DenseMatrix<RealD> tevecs; Resize(tevecs,M - lock_num,M - lock_num);
@@ -981,7 +888,7 @@ PARALLEL_FOR_LOOP
      Wilkinson<RealD>(H, evals, evecs, small); 
      //      Check();

-      std::cout<<GridLogMessage << "Done  "<<std::endl;
+      std::cout << "Done  "<<std::endl;

    }

@@ -1046,7 +953,7 @@ PARALLEL_FOR_LOOP
 		  DenseVector<RealD> &tevals, DenseVector<DenseVector<RealD> > &tevecs, 
 		  int lock, int converged)
    {
-      std::cout<<GridLogMessage << "Converged " << converged << " so far." << std::endl;
+      std::cout << "Converged " << converged << " so far." << std::endl;
      int lock_num = lock ? converged : 0;
      int M = Nm;

@@ -1061,9 +968,7 @@ PARALLEL_FOR_LOOP
      RealD small=1.0e-16;
      Wilkinson<RealD>(AH, tevals, tevecs, small);

-#ifndef USE_LAPACK
      EigenSort(tevals, tevecs);
-#endif

      RealD resid_nrm=  norm2(bf);

@@ -1074,7 +979,7 @@ PARALLEL_FOR_LOOP
 	RealD diff = 0;
 	diff = abs( tevecs[i][Nm - 1 - lock_num] ) * resid_nrm;

-	std::cout<<GridLogMessage << "residual estimate " << SS-1-i << " " << diff << " of (" << tevals[i] << ")" << std::endl;
+	std::cout << "residual estimate " << SS-1-i << " " << diff << " of (" << tevals[i] << ")" << std::endl;

 	if(diff < converged) {

@@ -1090,13 +995,13 @@ PARALLEL_FOR_LOOP
 	    lock_num++;
 	  }
 	  converged++;
-	  std::cout<<GridLogMessage << " converged on eval " << converged << " of " << Nk << std::endl;
+	  std::cout << " converged on eval " << converged << " of " << Nk << std::endl;
 	} else {
 	  break;
 	}
      }
 #endif
-      std::cout<<GridLogMessage << "Got " << converged << " so far " <<std::endl;	
+      std::cout << "Got " << converged << " so far " <<std::endl;	
    }

    ///Check
@@ -1105,9 +1010,7 @@ PARALLEL_FOR_LOOP

      DenseVector<RealD> goodval(this->get);

-#ifndef USE_LAPACK
      EigenSort(evals,evecs);
-#endif

      int NM = Nm;

@@ -1179,16 +1082,14 @@ say con = 2
 **/

 template<class T>
-static void Lock(DenseMatrix<T> &H, 	///Hess mtx	
-		 DenseMatrix<T> &Q, 	///Lock Transform
-		 T val, 		///value to be locked
-		 int con, 	///number already locked
+static void Lock(DenseMatrix<T> &H, 	// Hess mtx	
+		 DenseMatrix<T> &Q, 	// Lock Transform
+		 T val, 		// value to be locked
+		 int con, 	// number already locked
 		 RealD small,
 		 int dfg,
 		 bool herm)
 {	
-
-
  //ForceTridiagonal(H);

  int M = H.dim;
@@ -1220,7 +1121,6 @@ static void Lock(DenseMatrix<T> &H, 	///Hess mtx

  AH = Hermitian(QQ)*AH;
  AH = AH*QQ;
-	

  for(int i=con;i<M;i++){
    for(int j=con;j<M;j++){
--- a/lib/algorithms/iterative/Matrix.h
+++ b/lib/algorithms/iterative/Matrix.h
@@ -1,453 +0,0 @@
-    /*************************************************************************************
-
-    Grid physics library, www.github.com/paboyle/Grid 
-
-    Source file: ./lib/algorithms/iterative/Matrix.h
-
-    Copyright (C) 2015
-
-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-    See the full license in the file "LICENSE" in the top level distribution directory
-    *************************************************************************************/
-    /*  END LEGAL */
-#ifndef MATRIX_H
-#define MATRIX_H
-
-#include <cstdlib>
-#include <string>
-#include <cmath>
-#include <vector>
-#include <iostream>
-#include <iomanip>
-#include <complex>
-#include <typeinfo>
-#include <Grid/Grid.h>
-
-
-/** Sign function **/
-template <class T> T sign(T p){return ( p/abs(p) );}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////////////
-///////////////////// Hijack STL containers for our wicked means /////////////////////////////////////////
-/////////////////////////////////////////////////////////////////////////////////////////////////////////
-template<class T> using Vector = Vector<T>;
-template<class T> using Matrix = Vector<Vector<T> >;
-
-template<class T> void Resize(Vector<T > & vec, int N) { vec.resize(N); }
-
-template<class T> void Resize(Matrix<T > & mat, int N, int M) { 
-  mat.resize(N);
-  for(int i=0;i<N;i++){
-    mat[i].resize(M);
-  }
-}
-template<class T> void Size(Vector<T> & vec, int &N) 
-{ 
-  N= vec.size();
-}
-template<class T> void Size(Matrix<T> & mat, int &N,int &M) 
-{ 
-  N= mat.size();
-  M= mat[0].size();
-}
-template<class T> void SizeSquare(Matrix<T> & mat, int &N) 
-{ 
-  int M; Size(mat,N,M);
-  assert(N==M);
-}
-template<class T> void SizeSame(Matrix<T> & mat1,Matrix<T> &mat2, int &N1,int &M1) 
-{ 
-  int N2,M2;
-  Size(mat1,N1,M1);
-  Size(mat2,N2,M2);
-  assert(N1==N2);
-  assert(M1==M2);
-}
-
-//*****************************************
-//*	(Complex) Vector operations	*
-//*****************************************
-
-/**Conj of a Vector **/
-template <class T> Vector<T> conj(Vector<T> p){
-	Vector<T> q(p.size());
-	for(int i=0;i<p.size();i++){q[i] = conj(p[i]);}
-	return q;
-}
-
-/** Norm of a Vector**/
-template <class T> T norm(Vector<T> p){
-	T sum = 0;
-	for(int i=0;i<p.size();i++){sum = sum + p[i]*conj(p[i]);}
-	return abs(sqrt(sum));
-}
-
-/** Norm squared of a Vector **/
-template <class T> T norm2(Vector<T> p){
-	T sum = 0;
-	for(int i=0;i<p.size();i++){sum = sum + p[i]*conj(p[i]);}
-	return abs((sum));
-}
-
-/** Sum elements of a Vector **/
-template <class T> T trace(Vector<T> p){
-	T sum = 0;
-	for(int i=0;i<p.size();i++){sum = sum + p[i];}
-	return sum;
-}
-
-/** Fill a Vector with constant c **/
-template <class T> void Fill(Vector<T> &p, T c){
-	for(int i=0;i<p.size();i++){p[i] = c;}
-}
-/** Normalize a Vector **/
-template <class T> void normalize(Vector<T> &p){
-	T m = norm(p);
-	if( abs(m) > 0.0) for(int i=0;i<p.size();i++){p[i] /= m;}
-}
-/** Vector by scalar **/
-template <class T, class U> Vector<T> times(Vector<T> p, U s){
-	for(int i=0;i<p.size();i++){p[i] *= s;}
-	return p;
-}
-template <class T, class U> Vector<T> times(U s, Vector<T> p){
-	for(int i=0;i<p.size();i++){p[i] *= s;}
-	return p;
-}
-/** inner product of a and b = conj(a) . b **/
-template <class T> T inner(Vector<T> a, Vector<T> b){
-	T m = 0.;
-	for(int i=0;i<a.size();i++){m = m + conj(a[i])*b[i];}
-	return m;
-}
-/** sum of a and b = a + b **/
-template <class T> Vector<T> add(Vector<T> a, Vector<T> b){
-	Vector<T> m(a.size());
-	for(int i=0;i<a.size();i++){m[i] = a[i] + b[i];}
-	return m;
-}
-/** sum of a and b = a - b **/
-template <class T> Vector<T> sub(Vector<T> a, Vector<T> b){
-	Vector<T> m(a.size());
-	for(int i=0;i<a.size();i++){m[i] = a[i] - b[i];}
-	return m;
-}
-
-/** 
- *********************************
- *	Matrices	         *
- *********************************
- **/
-
-template<class T> void Fill(Matrix<T> & mat, T&val) { 
-  int N,M;
-  Size(mat,N,M);
-  for(int i=0;i<N;i++){
-  for(int j=0;j<M;j++){
-    mat[i][j] = val;
-  }}
-}
-
-/** Transpose of a matrix **/
-Matrix<T> Transpose(Matrix<T> & mat){
-  int N,M;
-  Size(mat,N,M);
-  Matrix C; Resize(C,M,N);
-  for(int i=0;i<M;i++){
-  for(int j=0;j<N;j++){
-    C[i][j] = mat[j][i];
-  }} 
-  return C;
-}
-/** Set Matrix to unit matrix **/
-template<class T> void Unity(Matrix<T> &mat){
-  int N;  SizeSquare(mat,N);
-  for(int i=0;i<N;i++){
-    for(int j=0;j<N;j++){
-      if ( i==j ) A[i][j] = 1;
-      else        A[i][j] = 0;
-    } 
-  } 
-}
-/** Add C * I to matrix **/
-template<class T>
-void PlusUnit(Matrix<T> & A,T c){
-  int dim;  SizeSquare(A,dim);
-  for(int i=0;i<dim;i++){A[i][i] = A[i][i] + c;} 
-}
-
-/** return the Hermitian conjugate of matrix **/
-Matrix<T> HermitianConj(Matrix<T> &mat){
-
-  int dim; SizeSquare(mat,dim);
-
-  Matrix<T> C; Resize(C,dim,dim);
-
-  for(int i=0;i<dim;i++){
-    for(int j=0;j<dim;j++){
-      C[i][j] = conj(mat[j][i]);
-    } 
-  } 
-  return C;
-}
-
-/** return diagonal entries as a Vector **/
-Vector<T> diag(Matrix<T> &A)
-{
-  int dim; SizeSquare(A,dim);
-  Vector<T> d; Resize(d,dim);
-
-  for(int i=0;i<dim;i++){
-    d[i] = A[i][i];
-  }
-  return d;
-}
-
-/** Left multiply by a Vector **/
-Vector<T> operator *(Vector<T> &B,Matrix<T> &A)
-{
-  int K,M,N; 
-  Size(B,K);
-  Size(A,M,N);
-  assert(K==M);
-  
-  Vector<T> C; Resize(C,N);
-
-  for(int j=0;j<N;j++){
-    T sum = 0.0;
-    for(int i=0;i<M;i++){
-      sum += B[i] * A[i][j];
-    }
-    C[j] =  sum;
-  }
-  return C; 
-}
-
-/** return 1/diagonal entries as a Vector **/
-Vector<T> inv_diag(Matrix<T> & A){
-  int dim; SizeSquare(A,dim);
-  Vector<T> d; Resize(d,dim);
-  for(int i=0;i<dim;i++){
-    d[i] = 1.0/A[i][i];
-  }
-  return d;
-}
-/** Matrix Addition **/
-inline Matrix<T> operator + (Matrix<T> &A,Matrix<T> &B)
-{
-  int N,M  ; SizeSame(A,B,N,M);
-  Matrix C; Resize(C,N,M);
-  for(int i=0;i<N;i++){
-    for(int j=0;j<M;j++){
-      C[i][j] = A[i][j] +  B[i][j];
-    } 
-  } 
-  return C;
-} 
-/** Matrix Subtraction **/
-inline Matrix<T> operator- (Matrix<T> & A,Matrix<T> &B){
-  int N,M  ; SizeSame(A,B,N,M);
-  Matrix C; Resize(C,N,M);
-  for(int i=0;i<N;i++){
-  for(int j=0;j<M;j++){
-    C[i][j] = A[i][j] -  B[i][j];
-  }}
-  return C;
-} 
-
-/** Matrix scalar multiplication **/
-inline Matrix<T> operator* (Matrix<T> & A,T c){
-  int N,M; Size(A,N,M);
-  Matrix C; Resize(C,N,M);
-  for(int i=0;i<N;i++){
-  for(int j=0;j<M;j++){
-    C[i][j] = A[i][j]*c;
-  }} 
-  return C;
-} 
-/** Matrix Matrix multiplication **/
-inline Matrix<T> operator* (Matrix<T> &A,Matrix<T> &B){
-  int K,L,N,M;
-  Size(A,K,L);
-  Size(B,N,M); assert(L==N);
-  Matrix C; Resize(C,K,M);
-
-  for(int i=0;i<K;i++){
-    for(int j=0;j<M;j++){
-      T sum = 0.0;
-      for(int k=0;k<N;k++) sum += A[i][k]*B[k][j];
-      C[i][j] =sum;
-    }
-  }
-  return C; 
-} 
-/** Matrix Vector multiplication **/
-inline Vector<T> operator* (Matrix<T> &A,Vector<T> &B){
-  int M,N,K;
-  Size(A,N,M);
-  Size(B,K); assert(K==M);
-  Vector<T> C; Resize(C,N);
-  for(int i=0;i<N;i++){
-    T sum = 0.0;
-    for(int j=0;j<M;j++) sum += A[i][j]*B[j];
-    C[i] =  sum;
-  }
-  return C; 
-} 
-
-/** Some version of Matrix norm **/
-/*
-inline T Norm(){ // this is not a usual L2 norm
-    T norm = 0;
-    for(int i=0;i<dim;i++){
-      for(int j=0;j<dim;j++){
-	norm += abs(A[i][j]);
-    }}
-    return norm;
-  }
-*/
-
-/** Some version of Matrix norm **/
-template<class T> T LargestDiag(Matrix<T> &A)
-{
-  int dim ; SizeSquare(A,dim); 
-
-  T ld = abs(A[0][0]);
-  for(int i=1;i<dim;i++){
-    T cf = abs(A[i][i]);
-    if(abs(cf) > abs(ld) ){ld = cf;}
-  }
-  return ld;
-}
-
-/** Look for entries on the leading subdiagonal that are smaller than 'small' **/
-template <class T,class U> int Chop_subdiag(Matrix<T> &A,T norm, int offset, U small)
-{
-  int dim; SizeSquare(A,dim);
-  for(int l = dim - 1 - offset; l >= 1; l--) {             		
-    if((U)abs(A[l][l - 1]) < (U)small) {
-      A[l][l-1]=(U)0.0;
-      return l;
-    }
-  }
-  return 0;
-}
-
-/** Look for entries on the leading subdiagonal that are smaller than 'small' **/
-template <class T,class U> int Chop_symm_subdiag(Matrix<T> & A,T norm, int offset, U small) 
-{
-  int dim; SizeSquare(A,dim);
-  for(int l = dim - 1 - offset; l >= 1; l--) {
-    if((U)abs(A[l][l - 1]) < (U)small) {
-      A[l][l - 1] = (U)0.0;
-      A[l - 1][l] = (U)0.0;
-      return l;
-    }
-  }
-  return 0;
-}
-/**Assign a submatrix to a larger one**/
-template<class T>
-void AssignSubMtx(Matrix<T> & A,int row_st, int row_end, int col_st, int col_end, Matrix<T> &S)
-{
-  for(int i = row_st; i<row_end; i++){
-    for(int j = col_st; j<col_end; j++){
-      A[i][j] = S[i - row_st][j - col_st];
-    }
-  }
-}
-
-/**Get a square submatrix**/
-template <class T>
-Matrix<T> GetSubMtx(Matrix<T> &A,int row_st, int row_end, int col_st, int col_end)
-{
-  Matrix<T> H; Resize(row_end - row_st,col_end-col_st);
-
-  for(int i = row_st; i<row_end; i++){
-  for(int j = col_st; j<col_end; j++){
-    H[i-row_st][j-col_st]=A[i][j];
-  }}
-  return H;
-}
-  
- /**Assign a submatrix to a larger one NB remember Vector Vectors are transposes of the matricies they represent**/
-template<class T>
-void AssignSubMtx(Matrix<T> & A,int row_st, int row_end, int col_st, int col_end, Matrix<T> &S)
-{
-  for(int i = row_st; i<row_end; i++){
-  for(int j = col_st; j<col_end; j++){
-    A[i][j] = S[i - row_st][j - col_st];
-  }}
-}
-  
-/** compute b_i A_ij b_j **/ // surprised no Conj
-template<class T> T proj(Matrix<T> A, Vector<T> B){
-  int dim; SizeSquare(A,dim);
-  int dimB; Size(B,dimB);
-  assert(dimB==dim);
-  T C = 0;
-  for(int i=0;i<dim;i++){
-    T sum = 0.0;
-    for(int j=0;j<dim;j++){
-      sum += A[i][j]*B[j];
-    }
-    C +=  B[i]*sum; // No conj?
-  }
-  return C; 
-}
-
-
-/*
- *************************************************************
- *
- * Matrix Vector products
- *
- *************************************************************
- */
-// Instead make a linop and call my CG;
-
-/// q -> q Q
-template <class T,class Fermion> void times(Vector<Fermion> &q, Matrix<T> &Q)
-{
-  int M; SizeSquare(Q,M);
-  int N; Size(q,N); 
-  assert(M==N);
-
-  times(q,Q,N);
-}
-
-/// q -> q Q
-template <class T> void times(multi1d<LatticeFermion> &q, Matrix<T> &Q, int N)
-{
-  GridBase *grid = q[0]._grid;
-  int M; SizeSquare(Q,M);
-  int K; Size(q,K); 
-  assert(N<M);
-  assert(N<K);
-  Vector<Fermion> S(N,grid );
-  for(int j=0;j<N;j++){
-    S[j] = zero;
-    for(int k=0;k<N;k++){
-      S[j] = S[j] +  q[k]* Q[k][j]; 
-    }
-  }
-  for(int j=0;j<q.size();j++){
-    q[j] = S[j];
-  }
-}
-#endif
--- a/lib/algorithms/iterative/MatrixUtils.h
+++ b/lib/algorithms/iterative/MatrixUtils.h
@@ -1,75 +0,0 @@
-    /*************************************************************************************
-
-    Grid physics library, www.github.com/paboyle/Grid 
-
-    Source file: ./lib/algorithms/iterative/MatrixUtils.h
-
-    Copyright (C) 2015
-
-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-    See the full license in the file "LICENSE" in the top level distribution directory
-    *************************************************************************************/
-    /*  END LEGAL */
-#ifndef GRID_MATRIX_UTILS_H
-#define GRID_MATRIX_UTILS_H
-
-namespace Grid {
-
-  namespace MatrixUtils { 
-
-    template<class T> inline void Size(Matrix<T>& A,int &N,int &M){
-      N=A.size(); assert(N>0);
-      M=A[0].size();
-      for(int i=0;i<N;i++){
-	assert(A[i].size()==M);
-      }
-    }
-
-    template<class T> inline void SizeSquare(Matrix<T>& A,int &N)
-    {
-      int M;
-      Size(A,N,M);
-      assert(N==M);
-    }
-
-    template<class T> inline void Fill(Matrix<T>& A,T & val)
-    { 
-      int N,M;
-      Size(A,N,M);
-      for(int i=0;i<N;i++){
-      for(int j=0;j<M;j++){
-	A[i][j]=val;
-      }}
-    }
-    template<class T> inline void Diagonal(Matrix<T>& A,T & val)
-    { 
-      int N;
-      SizeSquare(A,N);
-      for(int i=0;i<N;i++){
-	A[i][i]=val;
-      }
-    }
-    template<class T> inline void Identity(Matrix<T>& A)
-    {
-      Fill(A,0.0);
-      Diagonal(A,1.0);
-    }
-
-  };
-}
-#endif
--- a/lib/algorithms/iterative/SchurRedBlack.h
+++ b/lib/algorithms/iterative/SchurRedBlack.h
@@ -141,85 +141,5 @@ namespace Grid {
    }     
  };

-  ///////////////////////////////////////////////////////////////////////////////////////////////////////
-  // Take a matrix and form a Red Black solver calling a Herm solver
-  // Use of RB info prevents making SchurRedBlackSolve conform to standard interface
-  ///////////////////////////////////////////////////////////////////////////////////////////////////////
-  template<class Field> class SchurRedBlackDiagTwoSolve {
-  private:
-    OperatorFunction<Field> & _HermitianRBSolver;
-    int CBfactorise;
-  public:
-
-    /////////////////////////////////////////////////////
-    // Wrap the usual normal equations Schur trick
-    /////////////////////////////////////////////////////
-  SchurRedBlackDiagTwoSolve(OperatorFunction<Field> &HermitianRBSolver)  :
-     _HermitianRBSolver(HermitianRBSolver) 
-    { 
-      CBfactorise=0;
-    };
-
-    template<class Matrix>
-      void operator() (Matrix & _Matrix,const Field &in, Field &out){
-
-      // FIXME CGdiagonalMee not implemented virtual function
-      // FIXME use CBfactorise to control schur decomp
-      GridBase *grid = _Matrix.RedBlackGrid();
-      GridBase *fgrid= _Matrix.Grid();
-
-      SchurDiagTwoOperator<Matrix,Field> _HermOpEO(_Matrix);
- 
-      Field src_e(grid);
-      Field src_o(grid);
-      Field sol_e(grid);
-      Field sol_o(grid);
-      Field   tmp(grid);
-      Field  Mtmp(grid);
-      Field resid(fgrid);
-
-      pickCheckerboard(Even,src_e,in);
-      pickCheckerboard(Odd ,src_o,in);
-      pickCheckerboard(Even,sol_e,out);
-      pickCheckerboard(Odd ,sol_o,out);
-    
-      /////////////////////////////////////////////////////
-      // src_o = Mdag * (source_o - Moe MeeInv source_e)
-      /////////////////////////////////////////////////////
-      _Matrix.MooeeInv(src_e,tmp);     assert(  tmp.checkerboard ==Even);
-      _Matrix.Meooe   (tmp,Mtmp);      assert( Mtmp.checkerboard ==Odd);     
-      tmp=src_o-Mtmp;                  assert(  tmp.checkerboard ==Odd);     
-
-      // get the right MpcDag
-      _HermOpEO.MpcDag(tmp,src_o);     assert(src_o.checkerboard ==Odd);       
-
-      //////////////////////////////////////////////////////////////
-      // Call the red-black solver
-      //////////////////////////////////////////////////////////////
-      std::cout<<GridLogMessage << "SchurRedBlack solver calling the MpcDagMp solver" <<std::endl;
-//      _HermitianRBSolver(_HermOpEO,src_o,sol_o);  assert(sol_o.checkerboard==Odd);
-      _HermitianRBSolver(_HermOpEO,src_o,tmp);  assert(tmp.checkerboard==Odd);
-      _Matrix.MooeeInv(tmp,sol_o);        assert(  sol_o.checkerboard   ==Odd);
-
-      ///////////////////////////////////////////////////
-      // sol_e = M_ee^-1 * ( src_e - Meo sol_o )...
-      ///////////////////////////////////////////////////
-      _Matrix.Meooe(sol_o,tmp);        assert(  tmp.checkerboard   ==Even);
-      src_e = src_e-tmp;               assert(  src_e.checkerboard ==Even);
-      _Matrix.MooeeInv(src_e,sol_e);   assert(  sol_e.checkerboard ==Even);
-     
-      setCheckerboard(out,sol_e); assert(  sol_e.checkerboard ==Even);
-      setCheckerboard(out,sol_o); assert(  sol_o.checkerboard ==Odd );
-
-      // Verify the unprec residual
-      _Matrix.M(out,resid); 
-      resid = resid-in;
-      RealD ns = norm2(in);
-      RealD nr = norm2(resid);
-
-      std::cout<<GridLogMessage << "SchurRedBlackDiagTwo solver true unprec resid "<< std::sqrt(nr/ns) <<" nr "<< nr <<" ns "<<ns << std::endl;
-    }     
-  };
-
 }
 #endif
--- a/lib/algorithms/iterative/TODO
+++ b/lib/algorithms/iterative/TODO
@@ -1,15 +0,0 @@
- ConjugateGradientMultiShift
- MCR
-
- Potentially Useful Boost libraries
-
- MultiArray
- Aligned allocator; memory pool
- Remez -- Mike or Boost?
- Multiprecision
- quaternians
- Tokenize
- Serialization
- Regex
- Proto (ET)
- uBlas
--- a/lib/algorithms/iterative/bisec.c
+++ b/lib/algorithms/iterative/bisec.c
@@ -1,122 +0,0 @@
-#include <math.h>
-#include <stdlib.h>
-#include <vector>
-
-struct Bisection {
-
-static void get_eig2(int row_num,std::vector<RealD> &ALPHA,std::vector<RealD> &BETA, std::vector<RealD> & eig)
-{
-  int i,j;
-  std::vector<RealD> evec1(row_num+3);
-  std::vector<RealD> evec2(row_num+3);
-  RealD eps2;
-  ALPHA[1]=0.;
-  BETHA[1]=0.;
-  for(i=0;i<row_num-1;i++) {
-    ALPHA[i+1] = A[i*(row_num+1)].real();
-    BETHA[i+2] = A[i*(row_num+1)+1].real();
-  }
-  ALPHA[row_num] = A[(row_num-1)*(row_num+1)].real();
-  bisec(ALPHA,BETHA,row_num,1,row_num,1e-10,1e-10,evec1,eps2);
-  bisec(ALPHA,BETHA,row_num,1,row_num,1e-16,1e-16,evec2,eps2);
-
-  // Do we really need to sort here?
-  int begin=1;
-  int end = row_num;
-  int swapped=1;
-  while(swapped) {
-    swapped=0;
-    for(i=begin;i<end;i++){
-      if(mag(evec2[i])>mag(evec2[i+1]))	{
-	swap(evec2+i,evec2+i+1);
-	swapped=1;
-      }
-    }
-    end--;
-    for(i=end-1;i>=begin;i--){
-      if(mag(evec2[i])>mag(evec2[i+1]))	{
-	swap(evec2+i,evec2+i+1);
-	swapped=1;
-      }
-    }
-    begin++;
-  }
-
-  for(i=0;i<row_num;i++){
-    for(j=0;j<row_num;j++) {
-      if(i==j) H[i*row_num+j]=evec2[i+1];
-      else H[i*row_num+j]=0.;
-    }
-  }
-}
-
-static void bisec(std::vector<RealD> &c,   
-		  std::vector<RealD> &b,
-		  int n,
-		  int m1,
-		  int m2,
-		  RealD eps1,
-		  RealD relfeh,
-		  std::vector<RealD> &x,
-		  RealD &eps2)
-{
-  std::vector<RealD> wu(n+2);
-
-  RealD h,q,x1,xu,x0,xmin,xmax; 
-  int i,a,k;
-
-  b[1]=0.0;
-  xmin=c[n]-fabs(b[n]);
-  xmax=c[n]+fabs(b[n]);
-  for(i=1;i<n;i++){
-    h=fabs(b[i])+fabs(b[i+1]);
-    if(c[i]+h>xmax) xmax= c[i]+h;
-    if(c[i]-h<xmin) xmin= c[i]-h;
-  }
-  xmax *=2.;
-
-  eps2=relfeh*((xmin+xmax)>0.0 ? xmax : -xmin);
-  if(eps1<=0.0) eps1=eps2;
-  eps2=0.5*eps1+7.0*(eps2);
-  x0=xmax;
-  for(i=m1;i<=m2;i++){
-    x[i]=xmax;
-    wu[i]=xmin;
-  }
-
-  for(k=m2;k>=m1;k--){
-    xu=xmin;
-    i=k;
-    do{
-      if(xu<wu[i]){
-	xu=wu[i];
-	i=m1-1;
-      }
-      i--;
-    }while(i>=m1);
-    if(x0>x[k]) x0=x[k];
-    while((x0-xu)>2*relfeh*(fabs(xu)+fabs(x0))+eps1){
-      x1=(xu+x0)/2;
-
-      a=0;
-      q=1.0;
-      for(i=1;i<=n;i++){
-	q=c[i]-x1-((q!=0.0)? b[i]*b[i]/q:fabs(b[i])/relfeh);
-	if(q<0) a++;
-      }
-      //			printf("x1=%e a=%d\n",x1,a);
-      if(a<k){
-	if(a<m1){
-	  xu=x1;
-	  wu[m1]=x1;
-	}else {
-	  xu=x1;
-	  wu[a+1]=x1;
-	  if(x[a]>x1) x[a]=x1;
-	}
-      }else x0=x1;
-    }
-    x[k]=(x0+xu)/2;
-  }
-}
-}
--- a/lib/algorithms/iterative/get_eig.c
+++ b/lib/algorithms/iterative/get_eig.c
@@ -1 +0,0 @@
-
--- a/lib/communicator/.dirstamp
+++ b/lib/communicator/.dirstamp
--- a/lib/cshift/Cshift_common.h
+++ b/lib/cshift/Cshift_common.h
@@ -30,21 +30,11 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>

 namespace Grid {

-template<class vobj>
-class SimpleCompressor {
-public:
-  void Point(int) {};
-
-  vobj operator() (const vobj &arg) {
-    return arg;
-  }
-};
-
 ///////////////////////////////////////////////////////////////////
-// Gather for when there is no need to SIMD split with compression
+// Gather for when there is no need to SIMD split 
 ///////////////////////////////////////////////////////////////////
-template<class vobj,class cobj,class compressor> void 
-Gather_plane_simple (const Lattice<vobj> &rhs,commVector<cobj> &buffer,int dimension,int plane,int cbmask,compressor &compress, int off=0)
+template<class vobj> void 
+Gather_plane_simple (const Lattice<vobj> &rhs,commVector<vobj> &buffer,int dimension,int plane,int cbmask, int off=0)
 {
  int rd = rhs._grid->_rdimensions[dimension];

@@ -62,7 +52,7 @@ Gather_plane_simple (const Lattice<vobj> &rhs,commVector<cobj> &buffer,int dimen
      for(int b=0;b<e2;b++){
 	int o  = n*stride;
 	int bo = n*e2;
-	buffer[off+bo+b]=compress(rhs._odata[so+o+b]);
+	buffer[off+bo+b]=rhs._odata[so+o+b];
      }
    }
  } else { 
@@ -78,17 +68,16 @@ Gather_plane_simple (const Lattice<vobj> &rhs,commVector<cobj> &buffer,int dimen
       }
     }
     parallel_for(int i=0;i<table.size();i++){
-       buffer[off+table[i].first]=compress(rhs._odata[so+table[i].second]);
+       buffer[off+table[i].first]=rhs._odata[so+table[i].second];
     }
  }
 }

-
 ///////////////////////////////////////////////////////////////////
-// Gather for when there *is* need to SIMD split with compression
+// Gather for when there *is* need to SIMD split 
 ///////////////////////////////////////////////////////////////////
-template<class cobj,class vobj,class compressor> void 
-Gather_plane_extract(const Lattice<vobj> &rhs,std::vector<typename cobj::scalar_object *> pointers,int dimension,int plane,int cbmask,compressor &compress)
+template<class vobj> void 
+Gather_plane_extract(const Lattice<vobj> &rhs,std::vector<typename vobj::scalar_object *> pointers,int dimension,int plane,int cbmask)
 {
  int rd = rhs._grid->_rdimensions[dimension];

@@ -109,8 +98,8 @@ Gather_plane_extract(const Lattice<vobj> &rhs,std::vector<typename cobj::scalar_
 	int o      =   n*n1;
 	int offset = b+n*e2;
 	
-	cobj temp =compress(rhs._odata[so+o+b]);
-	extract<cobj>(temp,pointers,offset);
+	vobj temp =rhs._odata[so+o+b];
+	extract<vobj>(temp,pointers,offset);

      }
    }
@@ -127,32 +116,14 @@ Gather_plane_extract(const Lattice<vobj> &rhs,std::vector<typename cobj::scalar_
 	int offset = b+n*e2;

 	if ( ocb & cbmask ) {
-	  cobj temp =compress(rhs._odata[so+o+b]);
-	  extract<cobj>(temp,pointers,offset);
+	  vobj temp =rhs._odata[so+o+b];
+	  extract<vobj>(temp,pointers,offset);
 	}
      }
    }
  }
 }

-//////////////////////////////////////////////////////
-// Gather for when there is no need to SIMD split
-//////////////////////////////////////////////////////
-template<class vobj> void Gather_plane_simple (const Lattice<vobj> &rhs,commVector<vobj> &buffer, int dimension,int plane,int cbmask)
-{
-  SimpleCompressor<vobj> dontcompress;
-  Gather_plane_simple (rhs,buffer,dimension,plane,cbmask,dontcompress);
-}
-
-//////////////////////////////////////////////////////
-// Gather for when there *is* need to SIMD split
-//////////////////////////////////////////////////////
-template<class vobj> void Gather_plane_extract(const Lattice<vobj> &rhs,std::vector<typename vobj::scalar_object *> pointers,int dimension,int plane,int cbmask)
-{
-  SimpleCompressor<vobj> dontcompress;
-  Gather_plane_extract<vobj,vobj,decltype(dontcompress)>(rhs,pointers,dimension,plane,cbmask,dontcompress);
-}
-
 //////////////////////////////////////////////////////
 // Scatter for when there is no need to SIMD split
 //////////////////////////////////////////////////////
@@ -200,7 +171,7 @@ template<class vobj> void Scatter_plane_simple (Lattice<vobj> &rhs,commVector<vo
 //////////////////////////////////////////////////////
 // Scatter for when there *is* need to SIMD split
 //////////////////////////////////////////////////////
- template<class vobj,class cobj> void Scatter_plane_merge(Lattice<vobj> &rhs,std::vector<cobj *> pointers,int dimension,int plane,int cbmask)
+template<class vobj> void Scatter_plane_merge(Lattice<vobj> &rhs,std::vector<typename vobj::scalar_object *> pointers,int dimension,int plane,int cbmask)
 {
  int rd = rhs._grid->_rdimensions[dimension];

--- a/lib/cshift/Cshift_mpi.h
+++ b/lib/cshift/Cshift_mpi.h
@@ -154,13 +154,7 @@ template<class vobj> void Cshift_comms(Lattice<vobj> &ret,const Lattice<vobj> &r
 			   recv_from_rank,
 			   bytes);
      grid->Barrier();
-      /*
-      for(int i=0;i<send_buf.size();i++){
-	assert(recv_buf.size()==buffer_size);
-	assert(send_buf.size()==buffer_size);
-	std::cout << "SendRecv_Cshift_comms ["<<i<<" "<< dimension<<"] snd "<<send_buf[i]<<" rcv " << recv_buf[i] << "  0x" << cbmask<<std::endl;
-      }
-      */
+
      Scatter_plane_simple (ret,recv_buf,dimension,x,cbmask);
    }
  }
@@ -246,13 +240,6 @@ template<class vobj> void  Cshift_comms_simd(Lattice<vobj> &ret,const Lattice<vo
 			     (void *)&recv_buf_extract[i][0],
 			     recv_from_rank,
 			     bytes);
-	/*
-	for(int w=0;w<recv_buf_extract[i].size();w++){
-	  assert(recv_buf_extract[i].size()==buffer_size);
-	  assert(send_buf_extract[i].size()==buffer_size);
-	  std::cout << "SendRecv_Cshift_comms ["<<w<<" "<< dimension<<"] recv "<<recv_buf_extract[i][w]<<" send " << send_buf_extract[nbr_lane][w]  << cbmask<<std::endl;
-	}
-	*/	
 	grid->Barrier();
 	rpointers[i] = &recv_buf_extract[i][0];
      } else { 
--- a/lib/lattice/Lattice_reduction.h
+++ b/lib/lattice/Lattice_reduction.h
@@ -30,6 +30,8 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #ifndef GRID_LATTICE_REDUCTION_H
 #define GRID_LATTICE_REDUCTION_H

+#include <Grid/Eigen/Dense>
+
 namespace Grid {
 #ifdef GRID_WARN_SUBOPTIMAL
 #warning "Optimisation alert all these reduction loops are NOT threaded "
@@ -38,120 +40,123 @@ namespace Grid {
    ////////////////////////////////////////////////////////////////////////////////////////////////////
    // Deterministic Reduction operations
    ////////////////////////////////////////////////////////////////////////////////////////////////////
-  template<class vobj> inline RealD norm2(const Lattice<vobj> &arg){
-    ComplexD nrm = innerProduct(arg,arg);
-    return std::real(nrm); 
+template<class vobj> inline RealD norm2(const Lattice<vobj> &arg){
+  ComplexD nrm = innerProduct(arg,arg);
+  return std::real(nrm); 
+}
+
+// Double inner product
+template<class vobj>
+inline ComplexD innerProduct(const Lattice<vobj> &left,const Lattice<vobj> &right) 
+{
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_typeD vector_type;
+  scalar_type  nrm;
+  
+  GridBase *grid = left._grid;
+  
+  std::vector<vector_type,alignedAllocator<vector_type> > sumarray(grid->SumArraySize());
+  
+  parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
+    int nwork, mywork, myoff;
+    GridThread::GetWork(left._grid->oSites(),thr,mywork,myoff);
+    
+    decltype(innerProductD(left._odata[0],right._odata[0])) vnrm=zero; // private to thread; sub summation
+    for(int ss=myoff;ss<mywork+myoff; ss++){
+      vnrm = vnrm + innerProductD(left._odata[ss],right._odata[ss]);
+    }
+    sumarray[thr]=TensorRemove(vnrm) ;
  }
+  
+  vector_type vvnrm; vvnrm=zero;  // sum across threads
+  for(int i=0;i<grid->SumArraySize();i++){
+    vvnrm = vvnrm+sumarray[i];
+  } 
+  nrm = Reduce(vvnrm);// sum across simd
+  right._grid->GlobalSum(nrm);
+  return nrm;
+}
+ 
+template<class Op,class T1>
+inline auto sum(const LatticeUnaryExpression<Op,T1> & expr)
+  ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second))))::scalar_object
+{
+  return sum(closure(expr));
+}

-    template<class vobj>
-    inline ComplexD innerProduct(const Lattice<vobj> &left,const Lattice<vobj> &right) 
-    {
-      typedef typename vobj::scalar_type scalar_type;
-      typedef typename vobj::vector_type vector_type;
-      scalar_type  nrm;
-
-      GridBase *grid = left._grid;
-
-      std::vector<vector_type,alignedAllocator<vector_type> > sumarray(grid->SumArraySize());
-      for(int i=0;i<grid->SumArraySize();i++){
-	sumarray[i]=zero;
-      }
-
-      parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
-	int nwork, mywork, myoff;
-	GridThread::GetWork(left._grid->oSites(),thr,mywork,myoff);
-	
-	decltype(innerProduct(left._odata[0],right._odata[0])) vnrm=zero; // private to thread; sub summation
-        for(int ss=myoff;ss<mywork+myoff; ss++){
-	  vnrm = vnrm + innerProduct(left._odata[ss],right._odata[ss]);
-	}
-	sumarray[thr]=TensorRemove(vnrm) ;
-      }
-      
-      vector_type vvnrm; vvnrm=zero;  // sum across threads
-      for(int i=0;i<grid->SumArraySize();i++){
-	vvnrm = vvnrm+sumarray[i];
-      } 
-      nrm = Reduce(vvnrm);// sum across simd
-      right._grid->GlobalSum(nrm);
-      return nrm;
-    }
-
-    template<class Op,class T1>
-      inline auto sum(const LatticeUnaryExpression<Op,T1> & expr)
-      ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second))))::scalar_object
-    {
-      return sum(closure(expr));
-    }
-
-    template<class Op,class T1,class T2>
-      inline auto sum(const LatticeBinaryExpression<Op,T1,T2> & expr)
+template<class Op,class T1,class T2>
+inline auto sum(const LatticeBinaryExpression<Op,T1,T2> & expr)
      ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second)),eval(0,std::get<1>(expr.second))))::scalar_object
-    {
-      return sum(closure(expr));
-    }
-
-
-    template<class Op,class T1,class T2,class T3>
-      inline auto sum(const LatticeTrinaryExpression<Op,T1,T2,T3> & expr)
-      ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second)),
-				 eval(0,std::get<1>(expr.second)),
-				 eval(0,std::get<2>(expr.second))
-				 ))::scalar_object
-    {
-      return sum(closure(expr));
-    }
-
-    template<class vobj>
-    inline typename vobj::scalar_object sum(const Lattice<vobj> &arg){
-
-      GridBase *grid=arg._grid;
-      int Nsimd = grid->Nsimd();
-
-      std::vector<vobj,alignedAllocator<vobj> > sumarray(grid->SumArraySize());
-      for(int i=0;i<grid->SumArraySize();i++){
-	sumarray[i]=zero;
-      }
-
-      parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
-	int nwork, mywork, myoff;
-	GridThread::GetWork(grid->oSites(),thr,mywork,myoff);
-	
-	vobj vvsum=zero;
-        for(int ss=myoff;ss<mywork+myoff; ss++){
-	  vvsum = vvsum + arg._odata[ss];
-	}
-	sumarray[thr]=vvsum;
-      }
-      
-      vobj vsum=zero;  // sum across threads
-      for(int i=0;i<grid->SumArraySize();i++){
-	vsum = vsum+sumarray[i];
-      } 
-
-      typedef typename vobj::scalar_object sobj;
-      sobj ssum=zero;
-
-      std::vector<sobj>               buf(Nsimd);
-      extract(vsum,buf);
-
-      for(int i=0;i<Nsimd;i++) ssum = ssum + buf[i];
-      arg._grid->GlobalSum(ssum);
-
-      return ssum;
+{
+  return sum(closure(expr));
+}
+
+
+template<class Op,class T1,class T2,class T3>
+inline auto sum(const LatticeTrinaryExpression<Op,T1,T2,T3> & expr)
+  ->typename decltype(expr.first.func(eval(0,std::get<0>(expr.second)),
+				      eval(0,std::get<1>(expr.second)),
+				      eval(0,std::get<2>(expr.second))
+				      ))::scalar_object
+{
+  return sum(closure(expr));
+}
+
+template<class vobj>
+inline typename vobj::scalar_object sum(const Lattice<vobj> &arg)
+{
+  GridBase *grid=arg._grid;
+  int Nsimd = grid->Nsimd();
+  
+  std::vector<vobj,alignedAllocator<vobj> > sumarray(grid->SumArraySize());
+  for(int i=0;i<grid->SumArraySize();i++){
+    sumarray[i]=zero;
+  }
+  
+  parallel_for(int thr=0;thr<grid->SumArraySize();thr++){
+    int nwork, mywork, myoff;
+    GridThread::GetWork(grid->oSites(),thr,mywork,myoff);
+    
+    vobj vvsum=zero;
+    for(int ss=myoff;ss<mywork+myoff; ss++){
+      vvsum = vvsum + arg._odata[ss];
    }
+    sumarray[thr]=vvsum;
+  }
+  
+  vobj vsum=zero;  // sum across threads
+  for(int i=0;i<grid->SumArraySize();i++){
+    vsum = vsum+sumarray[i];
+  } 
+  
+  typedef typename vobj::scalar_object sobj;
+  sobj ssum=zero;
+  
+  std::vector<sobj>               buf(Nsimd);
+  extract(vsum,buf);
+  
+  for(int i=0;i<Nsimd;i++) ssum = ssum + buf[i];
+  arg._grid->GlobalSum(ssum);
+  
+  return ssum;
+}


+//////////////////////////////////////////////////////////////////////////////////////////////////////////////
+// sliceSum, sliceInnerProduct, sliceAxpy, sliceNorm etc...
+//////////////////////////////////////////////////////////////////////////////////////////////////////////////

 template<class vobj> inline void sliceSum(const Lattice<vobj> &Data,std::vector<typename vobj::scalar_object> &result,int orthogdim)
 {
+  ///////////////////////////////////////////////////////
+  // FIXME precision promoted summation
+  // may be important for correlation functions
+  // But easily avoided by using double precision fields
+  ///////////////////////////////////////////////////////
  typedef typename vobj::scalar_object sobj;
  GridBase  *grid = Data._grid;
  assert(grid!=NULL);

-  // FIXME
-  // std::cout<<GridLogMessage<<"WARNING ! SliceSum is unthreaded "<<grid->SumArraySize()<<" threads "<<std::endl;
-
  const int    Nd = grid->_ndimension;
  const int Nsimd = grid->Nsimd();

@@ -163,23 +168,31 @@ template<class vobj> inline void sliceSum(const Lattice<vobj> &Data,std::vector<
  int rd=grid->_rdimensions[orthogdim];

  std::vector<vobj,alignedAllocator<vobj> > lvSum(rd); // will locally sum vectors first
-  std::vector<sobj> lsSum(ld,zero); // sum across these down to scalars
-  std::vector<sobj> extracted(Nsimd);     // splitting the SIMD
+  std::vector<sobj> lsSum(ld,zero);                    // sum across these down to scalars
+  std::vector<sobj> extracted(Nsimd);                  // splitting the SIMD

-  result.resize(fd); // And then global sum to return the same vector to every node for IO to file
+  result.resize(fd); // And then global sum to return the same vector to every node 
  for(int r=0;r<rd;r++){
    lvSum[r]=zero;
  }

-  std::vector<int>  coor(Nd);  
+  int e1=    grid->_slice_nblock[orthogdim];
+  int e2=    grid->_slice_block [orthogdim];
+  int stride=grid->_slice_stride[orthogdim];

  // sum over reduced dimension planes, breaking out orthog dir
+  // Parallel over orthog direction
+  parallel_for(int r=0;r<rd;r++){

-  for(int ss=0;ss<grid->oSites();ss++){
-    Lexicographic::CoorFromIndex(coor,ss,grid->_rdimensions);
-    int r = coor[orthogdim];
-    lvSum[r]=lvSum[r]+Data._odata[ss];
-  }  
+    int so=r*grid->_ostride[orthogdim]; // base offset for start of plane 
+
+    for(int n=0;n<e1;n++){
+      for(int b=0;b<e2;b++){
+	int ss= so+n*stride+b;
+	lvSum[r]=lvSum[r]+Data._odata[ss];
+      }
+    }
+  }

  // Sum across simd lanes in the plane, breaking out orthog dir.
  std::vector<int> icoor(Nd);
@@ -214,10 +227,304 @@ template<class vobj> inline void sliceSum(const Lattice<vobj> &Data,std::vector<

    result[t]=gsum;
  }
+}

+template<class vobj>
+static void sliceInnerProductVector( std::vector<ComplexD> & result, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int orthogdim) 
+{
+  typedef typename vobj::vector_type   vector_type;
+  typedef typename vobj::scalar_type   scalar_type;
+  GridBase  *grid = lhs._grid;
+  assert(grid!=NULL);
+  conformable(grid,rhs._grid);
+
+  const int    Nd = grid->_ndimension;
+  const int Nsimd = grid->Nsimd();
+
+  assert(orthogdim >= 0);
+  assert(orthogdim < Nd);
+
+  int fd=grid->_fdimensions[orthogdim];
+  int ld=grid->_ldimensions[orthogdim];
+  int rd=grid->_rdimensions[orthogdim];
+
+  std::vector<vector_type,alignedAllocator<vector_type> > lvSum(rd); // will locally sum vectors first
+  std::vector<scalar_type > lsSum(ld,scalar_type(0.0));                    // sum across these down to scalars
+  std::vector<iScalar<scalar_type> > extracted(Nsimd);                  // splitting the SIMD
+
+  result.resize(fd); // And then global sum to return the same vector to every node for IO to file
+  for(int r=0;r<rd;r++){
+    lvSum[r]=zero;
+  }
+
+  int e1=    grid->_slice_nblock[orthogdim];
+  int e2=    grid->_slice_block [orthogdim];
+  int stride=grid->_slice_stride[orthogdim];
+
+  parallel_for(int r=0;r<rd;r++){
+
+    int so=r*grid->_ostride[orthogdim]; // base offset for start of plane 
+
+    for(int n=0;n<e1;n++){
+      for(int b=0;b<e2;b++){
+	int ss= so+n*stride+b;
+	vector_type vv = TensorRemove(innerProduct(lhs._odata[ss],rhs._odata[ss]));
+	lvSum[r]=lvSum[r]+vv;
+      }
+    }
+  }
+
+  // Sum across simd lanes in the plane, breaking out orthog dir.
+  std::vector<int> icoor(Nd);
+  for(int rt=0;rt<rd;rt++){
+
+    iScalar<vector_type> temp; 
+    temp._internal = lvSum[rt];
+    extract(temp,extracted);
+
+    for(int idx=0;idx<Nsimd;idx++){
+
+      grid->iCoorFromIindex(icoor,idx);
+
+      int ldx =rt+icoor[orthogdim]*rd;
+
+      lsSum[ldx]=lsSum[ldx]+extracted[idx]._internal;
+
+    }
+  }
+  
+  // sum over nodes.
+  scalar_type gsum;
+  for(int t=0;t<fd;t++){
+    int pt = t/ld; // processor plane
+    int lt = t%ld;
+    if ( pt == grid->_processor_coor[orthogdim] ) {
+      gsum=lsSum[lt];
+    } else {
+      gsum=scalar_type(0.0);
+    }
+
+    grid->GlobalSum(gsum);
+
+    result[t]=gsum;
+  }
+}
+template<class vobj>
+static void sliceNorm (std::vector<RealD> &sn,const Lattice<vobj> &rhs,int Orthog) 
+{
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  
+  int Nblock = rhs._grid->GlobalDimensions()[Orthog];
+  std::vector<ComplexD> ip(Nblock);
+  sn.resize(Nblock);
+  
+  sliceInnerProductVector(ip,rhs,rhs,Orthog);
+  for(int ss=0;ss<Nblock;ss++){
+    sn[ss] = real(ip[ss]);
+  }
+};
+
+
+template<class vobj>
+static void sliceMaddVector(Lattice<vobj> &R,std::vector<RealD> &a,const Lattice<vobj> &X,const Lattice<vobj> &Y,
+			    int orthogdim,RealD scale=1.0) 
+{    
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  typedef typename vobj::tensor_reduced tensor_reduced;
+  
+  GridBase *grid  = X._grid;
+
+  int Nsimd  =grid->Nsimd();
+  int Nblock =grid->GlobalDimensions()[orthogdim];
+
+  int fd     =grid->_fdimensions[orthogdim];
+  int ld     =grid->_ldimensions[orthogdim];
+  int rd     =grid->_rdimensions[orthogdim];
+
+  int e1     =grid->_slice_nblock[orthogdim];
+  int e2     =grid->_slice_block [orthogdim];
+  int stride =grid->_slice_stride[orthogdim];
+
+  std::vector<int> icoor;
+
+  for(int r=0;r<rd;r++){
+
+    int so=r*grid->_ostride[orthogdim]; // base offset for start of plane 
+
+    vector_type    av;
+
+    for(int l=0;l<Nsimd;l++){
+      grid->iCoorFromIindex(icoor,l);
+      int ldx =r+icoor[orthogdim]*rd;
+      scalar_type *as =(scalar_type *)&av;
+      as[l] = scalar_type(a[ldx])*scale;
+    }
+
+    tensor_reduced at; at=av;
+
+    parallel_for_nest2(int n=0;n<e1;n++){
+      for(int b=0;b<e2;b++){
+	int ss= so+n*stride+b;
+	R._odata[ss] = at*X._odata[ss]+Y._odata[ss];
+      }
+    }
+  }
+};
+
+
+/*
+template<class vobj>
+static void sliceMaddVectorSlow (Lattice<vobj> &R,std::vector<RealD> &a,const Lattice<vobj> &X,const Lattice<vobj> &Y,
+			     int Orthog,RealD scale=1.0) 
+{    
+  // FIXME: Implementation is slow
+  // Best base the linear combination by constructing a 
+  // set of vectors of size grid->_rdimensions[Orthog].
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  
+  int Nblock = X._grid->GlobalDimensions()[Orthog];
+  
+  GridBase *FullGrid  = X._grid;
+  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
+  
+  Lattice<vobj> Xslice(SliceGrid);
+  Lattice<vobj> Rslice(SliceGrid);
+  // If we based this on Cshift it would work for spread out
+  // but it would be even slower
+  for(int i=0;i<Nblock;i++){
+    ExtractSlice(Rslice,Y,i,Orthog);
+    ExtractSlice(Xslice,X,i,Orthog);
+    Rslice = Rslice + Xslice*(scale*a[i]);
+    InsertSlice(Rslice,R,i,Orthog);
+  }
+};
+
+template<class vobj>
+static void sliceInnerProductVectorSlow( std::vector<ComplexD> & vec, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int Orthog) 
+  {
+    // FIXME: Implementation is slow
+    // Look at localInnerProduct implementation,
+    // and do inside a site loop with block strided iterators
+    typedef typename vobj::scalar_object sobj;
+    typedef typename vobj::scalar_type scalar_type;
+    typedef typename vobj::vector_type vector_type;
+    typedef typename vobj::tensor_reduced scalar;
+    typedef typename scalar::scalar_object  scomplex;
+  
+    int Nblock = lhs._grid->GlobalDimensions()[Orthog];
+
+    vec.resize(Nblock);
+    std::vector<scomplex> sip(Nblock);
+    Lattice<scalar> IP(lhs._grid); 
+
+    IP=localInnerProduct(lhs,rhs);
+    sliceSum(IP,sip,Orthog);
+  
+    for(int ss=0;ss<Nblock;ss++){
+      vec[ss] = TensorRemove(sip[ss]);
+    }
+  }
+*/
+
+//////////////////////////////////////////////////////////////////////////////////////////
+// FIXME: Implementation is slow
+// If we based this on Cshift it would work for spread out
+// but it would be even slower
+//
+// Repeated extract slice is inefficient
+//
+// Best base the linear combination by constructing a 
+// set of vectors of size grid->_rdimensions[Orthog].
+//////////////////////////////////////////////////////////////////////////////////////////
+
+inline GridBase         *makeSubSliceGrid(const GridBase *BlockSolverGrid,int Orthog)
+{
+  int NN    = BlockSolverGrid->_ndimension;
+  int nsimd = BlockSolverGrid->Nsimd();
+  
+  std::vector<int> latt_phys(0);
+  std::vector<int> simd_phys(0);
+  std::vector<int>  mpi_phys(0);
+  
+  for(int d=0;d<NN;d++){
+    if( d!=Orthog ) { 
+      latt_phys.push_back(BlockSolverGrid->_fdimensions[d]);
+      simd_phys.push_back(BlockSolverGrid->_simd_layout[d]);
+      mpi_phys.push_back(BlockSolverGrid->_processors[d]);
+    }
+  }
+  return (GridBase *)new GridCartesian(latt_phys,simd_phys,mpi_phys); 
 }


+template<class vobj>
+static void sliceMaddMatrix (Lattice<vobj> &R,Eigen::MatrixXcd &aa,const Lattice<vobj> &X,const Lattice<vobj> &Y,int Orthog,RealD scale=1.0) 
+{    
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+
+  int Nblock = X._grid->GlobalDimensions()[Orthog];
+  
+  GridBase *FullGrid  = X._grid;
+  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
+  
+  Lattice<vobj> Xslice(SliceGrid);
+  Lattice<vobj> Rslice(SliceGrid);
+  
+  for(int i=0;i<Nblock;i++){
+    ExtractSlice(Rslice,Y,i,Orthog);
+    for(int j=0;j<Nblock;j++){
+      ExtractSlice(Xslice,X,j,Orthog);
+      Rslice = Rslice + Xslice*(scale*aa(j,i));
+    }
+    InsertSlice(Rslice,R,i,Orthog);
+  }
+};
+
+template<class vobj>
+static void sliceInnerProductMatrix(  Eigen::MatrixXcd &mat, const Lattice<vobj> &lhs,const Lattice<vobj> &rhs,int Orthog) 
+{
+  // FIXME: Implementation is slow
+  // Not sure of best solution.. think about it
+  typedef typename vobj::scalar_object sobj;
+  typedef typename vobj::scalar_type scalar_type;
+  typedef typename vobj::vector_type vector_type;
+  
+  GridBase *FullGrid  = lhs._grid;
+  GridBase *SliceGrid = makeSubSliceGrid(FullGrid,Orthog);
+  
+  int Nblock = FullGrid->GlobalDimensions()[Orthog];
+  
+  Lattice<vobj> Lslice(SliceGrid);
+  Lattice<vobj> Rslice(SliceGrid);
+  
+  mat = Eigen::MatrixXcd::Zero(Nblock,Nblock);
+  
+  for(int i=0;i<Nblock;i++){
+    ExtractSlice(Lslice,lhs,i,Orthog);
+    for(int j=0;j<Nblock;j++){
+      ExtractSlice(Rslice,rhs,j,Orthog);
+      mat(i,j) = innerProduct(Lslice,Rslice);
+    }
+  }
+#undef FORCE_DIAG
+#ifdef FORCE_DIAG
+  for(int i=0;i<Nblock;i++){
+    for(int j=0;j<Nblock;j++){
+      if ( i != j ) mat(i,j)=0.0;
+    }
+  }
+#endif
+  return;
 }
+
+} /*END NAMESPACE GRID*/
 #endif

--- a/lib/lattice/Lattice_transfer.h
+++ b/lib/lattice/Lattice_transfer.h
@@ -1,4 +1,4 @@
-    /*************************************************************************************
+/*************************************************************************************

    Grid physics library, www.github.com/paboyle/Grid 

@@ -359,7 +359,7 @@ void localConvert(const Lattice<vobj> &in,Lattice<vvobj> &out)


 template<class vobj>
-void InsertSlice(Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int orthog)
+void InsertSlice(const Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int orthog)
 {
  typedef typename vobj::scalar_object sobj;

@@ -401,7 +401,7 @@ void InsertSlice(Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int
 }

 template<class vobj>
-void ExtractSlice(Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice, int orthog)
+void ExtractSlice(Lattice<vobj> &lowDim,const Lattice<vobj> & higherDim,int slice, int orthog)
 {
  typedef typename vobj::scalar_object sobj;

@@ -444,7 +444,7 @@ void ExtractSlice(Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice, in


 template<class vobj>
-void InsertSliceLocal(Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice_lo,int slice_hi, int orthog)
+void InsertSliceLocal(const Lattice<vobj> &lowDim, Lattice<vobj> & higherDim,int slice_lo,int slice_hi, int orthog)
 {
  typedef typename vobj::scalar_object sobj;

--- a/lib/log/Log.h
+++ b/lib/log/Log.h
@@ -110,8 +110,8 @@ public:
  friend std::ostream& operator<< (std::ostream& stream, Logger& log){

    if ( log.active ) {
-      stream << log.background()<< std::setw(10) << std::left << log.topName << log.background()<< " : ";
-      stream << log.colour() << std::setw(14) << std::left << log.name << log.background() << " : ";
+      stream << log.background()<< std::setw(8) << std::left << log.topName << log.background()<< " : ";
+      stream << log.colour() << std::setw(10) << std::left << log.name << log.background() << " : ";
      if ( log.timestamp ) {
 	StopWatch.Stop();
 	GridTime now = StopWatch.Elapsed();
--- a/lib/qcd/action/fermion/.dirstamp
+++ b/lib/qcd/action/fermion/.dirstamp
--- a/lib/qcd/action/fermion/CayleyFermion5D.cc
+++ b/lib/qcd/action/fermion/CayleyFermion5D.cc
@@ -414,8 +414,6 @@ void CayleyFermion5D<Impl>::SetCoefficientsInternal(RealD zolo_hi,std::vector<Co
    omega[i] = gamma[i]*zolo_hi; //NB reciprocal relative to Chroma NEF code
    bs[i] = 0.5*(bpc/omega[i] + bmc);
    cs[i] = 0.5*(bpc/omega[i] - bmc);
-    std::cout<<GridLogMessage << "CayleyFermion5D "<<i<<" bs="<<bs[i]<<" cs="<<cs[i]<< std::endl;
-
  }
  
  ////////////////////////////////////////////////////////
--- a/lib/qcd/action/fermion/CayleyFermion5Dcache.cc
+++ b/lib/qcd/action/fermion/CayleyFermion5Dcache.cc
@@ -237,6 +237,13 @@ void CayleyFermion5D<Impl>::MooeeInvDag (const FermionField &psi, FermionField &
  INSTANTIATE_DPERP(GparityWilsonImplD);
  INSTANTIATE_DPERP(ZWilsonImplF);
  INSTANTIATE_DPERP(ZWilsonImplD);
+
+  INSTANTIATE_DPERP(WilsonImplFH);
+  INSTANTIATE_DPERP(WilsonImplDF);
+  INSTANTIATE_DPERP(GparityWilsonImplFH);
+  INSTANTIATE_DPERP(GparityWilsonImplDF);
+  INSTANTIATE_DPERP(ZWilsonImplFH);
+  INSTANTIATE_DPERP(ZWilsonImplDF);
 #endif

 }}
--- a/lib/qcd/action/fermion/CayleyFermion5Ddense.cc
+++ b/lib/qcd/action/fermion/CayleyFermion5Ddense.cc
@@ -137,6 +137,20 @@ template void CayleyFermion5D<WilsonImplF>::MooeeInternal(const FermionField &ps
 template void CayleyFermion5D<WilsonImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<ZWilsonImplF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<ZWilsonImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+
+INSTANTIATE_DPERP(GparityWilsonImplFH);
+INSTANTIATE_DPERP(GparityWilsonImplDF);
+INSTANTIATE_DPERP(WilsonImplFH);
+INSTANTIATE_DPERP(WilsonImplDF);
+INSTANTIATE_DPERP(ZWilsonImplFH);
+INSTANTIATE_DPERP(ZWilsonImplDF);
+
+template void CayleyFermion5D<GparityWilsonImplFH>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<GparityWilsonImplDF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<WilsonImplFH>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<WilsonImplDF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<ZWilsonImplFH>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<ZWilsonImplDF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 #endif

 }}
--- a/lib/qcd/action/fermion/CayleyFermion5Dssp.cc
+++ b/lib/qcd/action/fermion/CayleyFermion5Dssp.cc
@@ -37,7 +37,6 @@ namespace Grid {
 namespace QCD {

  // FIXME -- make a version of these routines with site loop outermost for cache reuse.
-
  // Pminus fowards
  // Pplus  backwards
 template<class Impl>  
@@ -152,6 +151,13 @@ void CayleyFermion5D<Impl>::MooeeInvDag (const FermionField &psi, FermionField &
  INSTANTIATE_DPERP(GparityWilsonImplD);
  INSTANTIATE_DPERP(ZWilsonImplF);
  INSTANTIATE_DPERP(ZWilsonImplD);
+
+  INSTANTIATE_DPERP(WilsonImplFH);
+  INSTANTIATE_DPERP(WilsonImplDF);
+  INSTANTIATE_DPERP(GparityWilsonImplFH);
+  INSTANTIATE_DPERP(GparityWilsonImplDF);
+  INSTANTIATE_DPERP(ZWilsonImplFH);
+  INSTANTIATE_DPERP(ZWilsonImplDF);
 #endif

 }
--- a/lib/qcd/action/fermion/CayleyFermion5Dvec.cc
+++ b/lib/qcd/action/fermion/CayleyFermion5Dvec.cc
@@ -808,10 +808,21 @@ INSTANTIATE_DPERP(DomainWallVec5dImplF);
 INSTANTIATE_DPERP(ZDomainWallVec5dImplD);
 INSTANTIATE_DPERP(ZDomainWallVec5dImplF);

+INSTANTIATE_DPERP(DomainWallVec5dImplDF);
+INSTANTIATE_DPERP(DomainWallVec5dImplFH);
+INSTANTIATE_DPERP(ZDomainWallVec5dImplDF);
+INSTANTIATE_DPERP(ZDomainWallVec5dImplFH);
+
 template void CayleyFermion5D<DomainWallVec5dImplF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<DomainWallVec5dImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<ZDomainWallVec5dImplF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
 template void CayleyFermion5D<ZDomainWallVec5dImplD>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);

+template void CayleyFermion5D<DomainWallVec5dImplFH>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<DomainWallVec5dImplDF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<ZDomainWallVec5dImplFH>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+template void CayleyFermion5D<ZDomainWallVec5dImplDF>::MooeeInternal(const FermionField &psi, FermionField &chi,int dag, int inv);
+
+

 }}
--- a/lib/qcd/action/fermion/Fermion.h
+++ b/lib/qcd/action/fermion/Fermion.h
@@ -58,6 +58,7 @@ Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
 #include <Grid/qcd/action/fermion/DomainWallFermion.h>
 #include <Grid/qcd/action/fermion/MobiusFermion.h>
 #include <Grid/qcd/action/fermion/ZMobiusFermion.h>
+#include <Grid/qcd/action/fermion/SchurDiagTwoKappa.h>
 #include <Grid/qcd/action/fermion/ScaledShamirFermion.h>
 #include <Grid/qcd/action/fermion/MobiusZolotarevFermion.h>
 #include <Grid/qcd/action/fermion/ShamirZolotarevFermion.h>
@@ -88,6 +89,10 @@ typedef WilsonFermion<WilsonImplR> WilsonFermionR;
 typedef WilsonFermion<WilsonImplF> WilsonFermionF;
 typedef WilsonFermion<WilsonImplD> WilsonFermionD;

+typedef WilsonFermion<WilsonImplRL> WilsonFermionRL;
+typedef WilsonFermion<WilsonImplFH> WilsonFermionFH;
+typedef WilsonFermion<WilsonImplDF> WilsonFermionDF;
+
 typedef WilsonFermion<WilsonAdjImplR> WilsonAdjFermionR;
 typedef WilsonFermion<WilsonAdjImplF> WilsonAdjFermionF;
 typedef WilsonFermion<WilsonAdjImplD> WilsonAdjFermionD;
@@ -104,27 +109,50 @@ typedef DomainWallFermion<WilsonImplR> DomainWallFermionR;
 typedef DomainWallFermion<WilsonImplF> DomainWallFermionF;
 typedef DomainWallFermion<WilsonImplD> DomainWallFermionD;

+typedef DomainWallFermion<WilsonImplRL> DomainWallFermionRL;
+typedef DomainWallFermion<WilsonImplFH> DomainWallFermionFH;
+typedef DomainWallFermion<WilsonImplDF> DomainWallFermionDF;
+
 typedef MobiusFermion<WilsonImplR> MobiusFermionR;
 typedef MobiusFermion<WilsonImplF> MobiusFermionF;
 typedef MobiusFermion<WilsonImplD> MobiusFermionD;

+typedef MobiusFermion<WilsonImplRL> MobiusFermionRL;
+typedef MobiusFermion<WilsonImplFH> MobiusFermionFH;
+typedef MobiusFermion<WilsonImplDF> MobiusFermionDF;
+
 typedef ZMobiusFermion<ZWilsonImplR> ZMobiusFermionR;
 typedef ZMobiusFermion<ZWilsonImplF> ZMobiusFermionF;
 typedef ZMobiusFermion<ZWilsonImplD> ZMobiusFermionD;

+typedef ZMobiusFermion<ZWilsonImplRL> ZMobiusFermionRL;
+typedef ZMobiusFermion<ZWilsonImplFH> ZMobiusFermionFH;
+typedef ZMobiusFermion<ZWilsonImplDF> ZMobiusFermionDF;
+
 // Ls vectorised 
 typedef DomainWallFermion<DomainWallVec5dImplR> DomainWallFermionVec5dR;
 typedef DomainWallFermion<DomainWallVec5dImplF> DomainWallFermionVec5dF;
 typedef DomainWallFermion<DomainWallVec5dImplD> DomainWallFermionVec5dD;

+typedef DomainWallFermion<DomainWallVec5dImplRL> DomainWallFermionVec5dRL;
+typedef DomainWallFermion<DomainWallVec5dImplFH> DomainWallFermionVec5dFH;
+typedef DomainWallFermion<DomainWallVec5dImplDF> DomainWallFermionVec5dDF;
+
 typedef MobiusFermion<DomainWallVec5dImplR> MobiusFermionVec5dR;
 typedef MobiusFermion<DomainWallVec5dImplF> MobiusFermionVec5dF;
 typedef MobiusFermion<DomainWallVec5dImplD> MobiusFermionVec5dD;

+typedef MobiusFermion<DomainWallVec5dImplRL> MobiusFermionVec5dRL;
+typedef MobiusFermion<DomainWallVec5dImplFH> MobiusFermionVec5dFH;
+typedef MobiusFermion<DomainWallVec5dImplDF> MobiusFermionVec5dDF;
+
 typedef ZMobiusFermion<ZDomainWallVec5dImplR> ZMobiusFermionVec5dR;
 typedef ZMobiusFermion<ZDomainWallVec5dImplF> ZMobiusFermionVec5dF;
 typedef ZMobiusFermion<ZDomainWallVec5dImplD> ZMobiusFermionVec5dD;

+typedef ZMobiusFermion<ZDomainWallVec5dImplRL> ZMobiusFermionVec5dRL;
+typedef ZMobiusFermion<ZDomainWallVec5dImplFH> ZMobiusFermionVec5dFH;
+typedef ZMobiusFermion<ZDomainWallVec5dImplDF> ZMobiusFermionVec5dDF;

 typedef ScaledShamirFermion<WilsonImplR> ScaledShamirFermionR;
 typedef ScaledShamirFermion<WilsonImplF> ScaledShamirFermionF;
@@ -165,17 +193,35 @@ typedef OverlapWilsonPartialFractionZolotarevFermion<WilsonImplD> OverlapWilsonP
 typedef WilsonFermion<GparityWilsonImplR>     GparityWilsonFermionR;
 typedef WilsonFermion<GparityWilsonImplF>     GparityWilsonFermionF;
 typedef WilsonFermion<GparityWilsonImplD>     GparityWilsonFermionD;
+
+typedef WilsonFermion<GparityWilsonImplRL>     GparityWilsonFermionRL;
+typedef WilsonFermion<GparityWilsonImplFH>     GparityWilsonFermionFH;
+typedef WilsonFermion<GparityWilsonImplDF>     GparityWilsonFermionDF;
+
 typedef DomainWallFermion<GparityWilsonImplR> GparityDomainWallFermionR;
 typedef DomainWallFermion<GparityWilsonImplF> GparityDomainWallFermionF;
 typedef DomainWallFermion<GparityWilsonImplD> GparityDomainWallFermionD;

+typedef DomainWallFermion<GparityWilsonImplRL> GparityDomainWallFermionRL;
+typedef DomainWallFermion<GparityWilsonImplFH> GparityDomainWallFermionFH;
+typedef DomainWallFermion<GparityWilsonImplDF> GparityDomainWallFermionDF;
+
 typedef WilsonTMFermion<GparityWilsonImplR> GparityWilsonTMFermionR;
 typedef WilsonTMFermion<GparityWilsonImplF> GparityWilsonTMFermionF;
 typedef WilsonTMFermion<GparityWilsonImplD> GparityWilsonTMFermionD;
+
+typedef WilsonTMFermion<GparityWilsonImplRL> GparityWilsonTMFermionRL;
+typedef WilsonTMFermion<GparityWilsonImplFH> GparityWilsonTMFermionFH;
+typedef WilsonTMFermion<GparityWilsonImplDF> GparityWilsonTMFermionDF;
+
 typedef MobiusFermion<GparityWilsonImplR> GparityMobiusFermionR;
 typedef MobiusFermion<GparityWilsonImplF> GparityMobiusFermionF;
 typedef MobiusFermion<GparityWilsonImplD> GparityMobiusFermionD;

+typedef MobiusFermion<GparityWilsonImplRL> GparityMobiusFermionRL;
+typedef MobiusFermion<GparityWilsonImplFH> GparityMobiusFermionFH;
+typedef MobiusFermion<GparityWilsonImplDF> GparityMobiusFermionDF;
+
 typedef ImprovedStaggeredFermion<StaggeredImplR> ImprovedStaggeredFermionR;
 typedef ImprovedStaggeredFermion<StaggeredImplF> ImprovedStaggeredFermionF;
 typedef ImprovedStaggeredFermion<StaggeredImplD> ImprovedStaggeredFermionD;
--- a/lib/qcd/action/fermion/FermionCore.h
+++ b/lib/qcd/action/fermion/FermionCore.h
@@ -55,7 +55,14 @@ Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
  template class A<ZWilsonImplF>;		\
  template class A<ZWilsonImplD>;		\
  template class A<GparityWilsonImplF>;		\
-  template class A<GparityWilsonImplD>;		
+  template class A<GparityWilsonImplD>;		\
+  template class A<WilsonImplFH>;		\
+  template class A<WilsonImplDF>;		\
+  template class A<ZWilsonImplFH>;		\
+  template class A<ZWilsonImplDF>;		\
+  template class A<GparityWilsonImplFH>;		\
+  template class A<GparityWilsonImplDF>;		
+

 #define AdjointFermOpTemplateInstantiate(A) \
  template class A<WilsonAdjImplF>; \
@@ -69,7 +76,11 @@ Author: Peter Boyle <pabobyle@ph.ed.ac.uk>
  template class A<DomainWallVec5dImplF>;	\
  template class A<DomainWallVec5dImplD>;	\
  template class A<ZDomainWallVec5dImplF>;	\
-  template class A<ZDomainWallVec5dImplD>;	
+  template class A<ZDomainWallVec5dImplD>;	\
+  template class A<DomainWallVec5dImplFH>;	\
+  template class A<DomainWallVec5dImplDF>;	\
+  template class A<ZDomainWallVec5dImplFH>;	\
+  template class A<ZDomainWallVec5dImplDF>;	

 #define FermOpTemplateInstantiate(A) \
 FermOp4dVecTemplateInstantiate(A) \
--- a/lib/qcd/action/fermion/FermionOperatorImpl.h
+++ b/lib/qcd/action/fermion/FermionOperatorImpl.h
@@ -35,7 +35,6 @@ directory
 namespace Grid {
 namespace QCD {

-
  //////////////////////////////////////////////
  // Template parameter class constructs to package
  // externally control Fermion implementations
@@ -89,7 +88,53 @@ namespace QCD {
  //    
  //  }
  //////////////////////////////////////////////
-  
+
+  template <class T> struct SamePrecisionMapper {
+    typedef T HigherPrecVector ;
+    typedef T LowerPrecVector ;
+  };
+  template <class T> struct LowerPrecisionMapper {  };
+  template <> struct LowerPrecisionMapper<vRealF> {
+    typedef vRealF HigherPrecVector ;
+    typedef vRealH LowerPrecVector ;
+  };
+  template <> struct LowerPrecisionMapper<vRealD> {
+    typedef vRealD HigherPrecVector ;
+    typedef vRealF LowerPrecVector ;
+  };
+  template <> struct LowerPrecisionMapper<vComplexF> {
+    typedef vComplexF HigherPrecVector ;
+    typedef vComplexH LowerPrecVector ;
+  };
+  template <> struct LowerPrecisionMapper<vComplexD> {
+    typedef vComplexD HigherPrecVector ;
+    typedef vComplexF LowerPrecVector ;
+  };
+
+  struct CoeffReal {
+  public:
+    typedef RealD _Coeff_t;
+    static const int Nhcs = 2;
+    template<class Simd> using PrecisionMapper = SamePrecisionMapper<Simd>;
+  };
+  struct CoeffRealHalfComms {
+  public:
+    typedef RealD _Coeff_t;
+    static const int Nhcs = 1;
+    template<class Simd> using PrecisionMapper = LowerPrecisionMapper<Simd>;
+  };
+  struct CoeffComplex {
+  public:
+    typedef ComplexD _Coeff_t;
+    static const int Nhcs = 2;
+    template<class Simd> using PrecisionMapper = SamePrecisionMapper<Simd>;
+  };
+  struct CoeffComplexHalfComms {
+  public:
+    typedef ComplexD _Coeff_t;
+    static const int Nhcs = 1;
+    template<class Simd> using PrecisionMapper = LowerPrecisionMapper<Simd>;
+  };

  ////////////////////////////////////////////////////////////////////////
  // Implementation dependent fermion types
@@ -114,37 +159,40 @@ namespace QCD {
  /////////////////////////////////////////////////////////////////////////////
  // Single flavour four spinors with colour index
  /////////////////////////////////////////////////////////////////////////////
-  template <class S, class Representation = FundamentalRepresentation,class _Coeff_t = RealD >
+  template <class S, class Representation = FundamentalRepresentation,class Options = CoeffReal >
  class WilsonImpl : public PeriodicGaugeImpl<GaugeImplTypes<S, Representation::Dimension > > {
-
    public:

    static const int Dimension = Representation::Dimension;
+    static const bool LsVectorised=false;
+    static const int Nhcs = Options::Nhcs;
+
    typedef PeriodicGaugeImpl<GaugeImplTypes<S, Dimension > > Gimpl;
+    INHERIT_GIMPL_TYPES(Gimpl);
      
    //Necessary?
    constexpr bool is_fundamental() const{return Dimension == Nc ? 1 : 0;}
    
-    const bool LsVectorised=false;
-    typedef _Coeff_t Coeff_t;
-
-    INHERIT_GIMPL_TYPES(Gimpl);
+    typedef typename Options::_Coeff_t Coeff_t;
+    typedef typename Options::template PrecisionMapper<Simd>::LowerPrecVector SimdL;
      
    template <typename vtype> using iImplSpinor            = iScalar<iVector<iVector<vtype, Dimension>, Ns> >;
    template <typename vtype> using iImplPropagator        = iScalar<iMatrix<iMatrix<vtype, Dimension>, Ns> >;
    template <typename vtype> using iImplHalfSpinor        = iScalar<iVector<iVector<vtype, Dimension>, Nhs> >;
+    template <typename vtype> using iImplHalfCommSpinor    = iScalar<iVector<iVector<vtype, Dimension>, Nhcs> >;
    template <typename vtype> using iImplDoubledGaugeField = iVector<iScalar<iMatrix<vtype, Dimension> >, Nds>;
    
    typedef iImplSpinor<Simd>            SiteSpinor;
    typedef iImplPropagator<Simd>        SitePropagator;
    typedef iImplHalfSpinor<Simd>        SiteHalfSpinor;
+    typedef iImplHalfCommSpinor<SimdL>   SiteHalfCommSpinor;
    typedef iImplDoubledGaugeField<Simd> SiteDoubledGaugeField;
    
    typedef Lattice<SiteSpinor>            FermionField;
    typedef Lattice<SitePropagator>        PropagatorField;
    typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
    
-    typedef WilsonCompressor<SiteHalfSpinor, SiteSpinor> Compressor;
+    typedef WilsonCompressor<SiteHalfCommSpinor,SiteHalfSpinor, SiteSpinor> Compressor;
    typedef WilsonImplParams ImplParams;
    typedef WilsonStencil<SiteSpinor, SiteHalfSpinor> StencilImpl;
    
@@ -209,31 +257,34 @@ namespace QCD {
  ////////////////////////////////////////////////////////////////////////////////////
  // Single flavour four spinors with colour index, 5d redblack
  ////////////////////////////////////////////////////////////////////////////////////
-
-template<class S,int Nrepresentation=Nc,class _Coeff_t = RealD>
+template<class S,int Nrepresentation=Nc, class Options=CoeffReal>
 class DomainWallVec5dImpl :  public PeriodicGaugeImpl< GaugeImplTypes< S,Nrepresentation> > { 
  public:
-      
-  static const int Dimension = Nrepresentation;
-  const bool LsVectorised=true;
-  typedef _Coeff_t Coeff_t;      
+
  typedef PeriodicGaugeImpl<GaugeImplTypes<S, Nrepresentation> > Gimpl;
-  
  INHERIT_GIMPL_TYPES(Gimpl);
+
+  static const int Dimension = Nrepresentation;
+  static const bool LsVectorised=true;
+  static const int Nhcs = Options::Nhcs;
+      
+  typedef typename Options::_Coeff_t Coeff_t;      
+  typedef typename Options::template PrecisionMapper<Simd>::LowerPrecVector SimdL;
  
  template <typename vtype> using iImplSpinor            = iScalar<iVector<iVector<vtype, Nrepresentation>, Ns> >;
  template <typename vtype> using iImplPropagator        = iScalar<iMatrix<iMatrix<vtype, Nrepresentation>, Ns> >;
  template <typename vtype> using iImplHalfSpinor        = iScalar<iVector<iVector<vtype, Nrepresentation>, Nhs> >;
+  template <typename vtype> using iImplHalfCommSpinor    = iScalar<iVector<iVector<vtype, Nrepresentation>, Nhcs> >;
  template <typename vtype> using iImplDoubledGaugeField = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds>;
  template <typename vtype> using iImplGaugeField        = iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nd>;
  template <typename vtype> using iImplGaugeLink         = iScalar<iScalar<iMatrix<vtype, Nrepresentation> > >;
  
-  typedef iImplSpinor<Simd> SiteSpinor;
-  typedef iImplPropagator<Simd> SitePropagator;
-  typedef iImplHalfSpinor<Simd> SiteHalfSpinor;
-  typedef Lattice<SiteSpinor> FermionField;
-  typedef Lattice<SitePropagator> PropagatorField;
-  
+  typedef iImplSpinor<Simd>            SiteSpinor;
+  typedef iImplPropagator<Simd>        SitePropagator;
+  typedef iImplHalfSpinor<Simd>        SiteHalfSpinor;
+  typedef iImplHalfCommSpinor<SimdL>   SiteHalfCommSpinor;
+  typedef Lattice<SiteSpinor>          FermionField;
+  typedef Lattice<SitePropagator>      PropagatorField;

  /////////////////////////////////////////////////
  // Make the doubled gauge field a *scalar*
@@ -241,9 +292,9 @@ class DomainWallVec5dImpl :  public PeriodicGaugeImpl< GaugeImplTypes< S,Nrepres
  typedef iImplDoubledGaugeField<typename Simd::scalar_type>  SiteDoubledGaugeField;  // This is a scalar
  typedef iImplGaugeField<typename Simd::scalar_type>         SiteScalarGaugeField;  // scalar
  typedef iImplGaugeLink<typename Simd::scalar_type>          SiteScalarGaugeLink;  // scalar
-  typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
+  typedef Lattice<SiteDoubledGaugeField>                      DoubledGaugeField;
      
-  typedef WilsonCompressor<SiteHalfSpinor, SiteSpinor> Compressor;
+  typedef WilsonCompressor<SiteHalfCommSpinor,SiteHalfSpinor, SiteSpinor> Compressor;
  typedef WilsonImplParams ImplParams;
  typedef WilsonStencil<SiteSpinor, SiteHalfSpinor> StencilImpl;
  
@@ -311,35 +362,37 @@ class DomainWallVec5dImpl :  public PeriodicGaugeImpl< GaugeImplTypes< S,Nrepres
    ////////////////////////////////////////////////////////////////////////////////////////
    // Flavour doubled spinors; is Gparity the only? what about C*?
    ////////////////////////////////////////////////////////////////////////////////////////
-    
-template <class S, int Nrepresentation,class _Coeff_t = RealD>
+template <class S, int Nrepresentation, class Options=CoeffReal>
 class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresentation> > {
 public:

 static const int Dimension = Nrepresentation;
+ static const int Nhcs = Options::Nhcs;
+ static const bool LsVectorised=false;

- const bool LsVectorised=false;
-
- typedef _Coeff_t Coeff_t;
 typedef ConjugateGaugeImpl< GaugeImplTypes<S,Nrepresentation> > Gimpl;
- 
 INHERIT_GIMPL_TYPES(Gimpl);
+
+ typedef typename Options::_Coeff_t Coeff_t;
+ typedef typename Options::template PrecisionMapper<Simd>::LowerPrecVector SimdL;
      
- template <typename vtype> using iImplSpinor            = iVector<iVector<iVector<vtype, Nrepresentation>, Ns>, Ngp>;
- template <typename vtype> using iImplPropagator        = iVector<iMatrix<iMatrix<vtype, Nrepresentation>, Ns>, Ngp >;
- template <typename vtype> using iImplHalfSpinor        = iVector<iVector<iVector<vtype, Nrepresentation>, Nhs>, Ngp>;
+ template <typename vtype> using iImplSpinor            = iVector<iVector<iVector<vtype, Nrepresentation>, Ns>,   Ngp>;
+ template <typename vtype> using iImplPropagator        = iVector<iMatrix<iMatrix<vtype, Nrepresentation>, Ns>,   Ngp>;
+ template <typename vtype> using iImplHalfSpinor        = iVector<iVector<iVector<vtype, Nrepresentation>, Nhs>,  Ngp>;
+ template <typename vtype> using iImplHalfCommSpinor    = iVector<iVector<iVector<vtype, Nrepresentation>, Nhcs>, Ngp>;
 template <typename vtype> using iImplDoubledGaugeField = iVector<iVector<iScalar<iMatrix<vtype, Nrepresentation> >, Nds>, Ngp>;
-      
- typedef iImplSpinor<Simd> SiteSpinor;
- typedef iImplPropagator<Simd> SitePropagator;
- typedef iImplHalfSpinor<Simd> SiteHalfSpinor;
+
+ typedef iImplSpinor<Simd>            SiteSpinor;
+ typedef iImplPropagator<Simd>        SitePropagator;
+ typedef iImplHalfSpinor<Simd>        SiteHalfSpinor;
+ typedef iImplHalfCommSpinor<SimdL>   SiteHalfCommSpinor;
 typedef iImplDoubledGaugeField<Simd> SiteDoubledGaugeField;

 typedef Lattice<SiteSpinor> FermionField;
 typedef Lattice<SitePropagator> PropagatorField;
 typedef Lattice<SiteDoubledGaugeField> DoubledGaugeField;
 
- typedef WilsonCompressor<SiteHalfSpinor, SiteSpinor> Compressor;
+ typedef WilsonCompressor<SiteHalfCommSpinor,SiteHalfSpinor, SiteSpinor> Compressor;
 typedef WilsonStencil<SiteSpinor, SiteHalfSpinor> StencilImpl;
 
 typedef GparityWilsonImplParams ImplParams;
@@ -356,8 +409,8 @@ class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresent
 		      const SiteHalfSpinor &chi, int mu, StencilEntry *SE,
 		      StencilImpl &St) {

-  typedef SiteHalfSpinor vobj;
-  typedef typename SiteHalfSpinor::scalar_object sobj;
+   typedef SiteHalfSpinor vobj;
+   typedef typename SiteHalfSpinor::scalar_object sobj;
 	
   vobj vtmp;
   sobj stmp;
@@ -475,7 +528,6 @@ class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresent
   }
 }
      
-      
 inline void InsertForce4D(GaugeField &mat, FermionField &Btilde, FermionField &A, int mu) {

   // DhopDir provides U or Uconj depending on coor/flavour.
@@ -508,23 +560,22 @@ class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresent

 };

-
-  /////////////////////////////////////////////////////////////////////////////
-  // Single flavour one component spinors with colour index
-  /////////////////////////////////////////////////////////////////////////////
-  template <class S, class Representation = FundamentalRepresentation >
-  class StaggeredImpl : public PeriodicGaugeImpl<GaugeImplTypes<S, Representation::Dimension > > {
+/////////////////////////////////////////////////////////////////////////////
+// Single flavour one component spinors with colour index
+/////////////////////////////////////////////////////////////////////////////
+template <class S, class Representation = FundamentalRepresentation >
+class StaggeredImpl : public PeriodicGaugeImpl<GaugeImplTypes<S, Representation::Dimension > > {

    public:

    typedef RealD  _Coeff_t ;
    static const int Dimension = Representation::Dimension;
+    static const bool LsVectorised=false;
    typedef PeriodicGaugeImpl<GaugeImplTypes<S, Dimension > > Gimpl;
      
    //Necessary?
    constexpr bool is_fundamental() const{return Dimension == Nc ? 1 : 0;}
    
-    const bool LsVectorised=false;
    typedef _Coeff_t Coeff_t;

    INHERIT_GIMPL_TYPES(Gimpl);
@@ -641,8 +692,6 @@ class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresent
    }
  };

-
-
  /////////////////////////////////////////////////////////////////////////////
  // Single flavour one component spinors with colour index. 5d vec
  /////////////////////////////////////////////////////////////////////////////
@@ -651,16 +700,14 @@ class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresent

    public:

-    typedef RealD  _Coeff_t ;
    static const int Dimension = Representation::Dimension;
+    static const bool LsVectorised=true;
+    typedef RealD   Coeff_t ;
    typedef PeriodicGaugeImpl<GaugeImplTypes<S, Dimension > > Gimpl;
      
    //Necessary?
    constexpr bool is_fundamental() const{return Dimension == Nc ? 1 : 0;}
-    
-    const bool LsVectorised=true;

-    typedef _Coeff_t Coeff_t;

    INHERIT_GIMPL_TYPES(Gimpl);

@@ -823,43 +870,61 @@ class GparityWilsonImpl : public ConjugateGaugeImpl<GaugeImplTypes<S, Nrepresent
    }
  };

+typedef WilsonImpl<vComplex,  FundamentalRepresentation, CoeffReal > WilsonImplR;  // Real.. whichever prec
+typedef WilsonImpl<vComplexF, FundamentalRepresentation, CoeffReal > WilsonImplF;  // Float
+typedef WilsonImpl<vComplexD, FundamentalRepresentation, CoeffReal > WilsonImplD;  // Double

+typedef WilsonImpl<vComplex,  FundamentalRepresentation, CoeffRealHalfComms > WilsonImplRL;  // Real.. whichever prec
+typedef WilsonImpl<vComplexF, FundamentalRepresentation, CoeffRealHalfComms > WilsonImplFH;  // Float
+typedef WilsonImpl<vComplexD, FundamentalRepresentation, CoeffRealHalfComms > WilsonImplDF;  // Double

- typedef WilsonImpl<vComplex,  FundamentalRepresentation > WilsonImplR;   // Real.. whichever prec
- typedef WilsonImpl<vComplexF, FundamentalRepresentation > WilsonImplF;  // Float
- typedef WilsonImpl<vComplexD, FundamentalRepresentation > WilsonImplD;  // Double
+typedef WilsonImpl<vComplex,  FundamentalRepresentation, CoeffComplex > ZWilsonImplR; // Real.. whichever prec
+typedef WilsonImpl<vComplexF, FundamentalRepresentation, CoeffComplex > ZWilsonImplF; // Float
+typedef WilsonImpl<vComplexD, FundamentalRepresentation, CoeffComplex > ZWilsonImplD; // Double

- typedef WilsonImpl<vComplex,  FundamentalRepresentation, ComplexD > ZWilsonImplR; // Real.. whichever prec
- typedef WilsonImpl<vComplexF, FundamentalRepresentation, ComplexD > ZWilsonImplF; // Float
- typedef WilsonImpl<vComplexD, FundamentalRepresentation, ComplexD > ZWilsonImplD; // Double
+typedef WilsonImpl<vComplex,  FundamentalRepresentation, CoeffComplexHalfComms > ZWilsonImplRL; // Real.. whichever prec
+typedef WilsonImpl<vComplexF, FundamentalRepresentation, CoeffComplexHalfComms > ZWilsonImplFH; // Float
+typedef WilsonImpl<vComplexD, FundamentalRepresentation, CoeffComplexHalfComms > ZWilsonImplDF; // Double
 
- typedef WilsonImpl<vComplex,  AdjointRepresentation > WilsonAdjImplR;   // Real.. whichever prec
- typedef WilsonImpl<vComplexF, AdjointRepresentation > WilsonAdjImplF;  // Float
- typedef WilsonImpl<vComplexD, AdjointRepresentation > WilsonAdjImplD;  // Double
+typedef WilsonImpl<vComplex,  AdjointRepresentation, CoeffReal > WilsonAdjImplR;   // Real.. whichever prec
+typedef WilsonImpl<vComplexF, AdjointRepresentation, CoeffReal > WilsonAdjImplF;  // Float
+typedef WilsonImpl<vComplexD, AdjointRepresentation, CoeffReal > WilsonAdjImplD;  // Double
 
- typedef WilsonImpl<vComplex,  TwoIndexSymmetricRepresentation > WilsonTwoIndexSymmetricImplR;   // Real.. whichever prec
- typedef WilsonImpl<vComplexF, TwoIndexSymmetricRepresentation > WilsonTwoIndexSymmetricImplF;  // Float
- typedef WilsonImpl<vComplexD, TwoIndexSymmetricRepresentation > WilsonTwoIndexSymmetricImplD;  // Double
+typedef WilsonImpl<vComplex,  TwoIndexSymmetricRepresentation, CoeffReal > WilsonTwoIndexSymmetricImplR;   // Real.. whichever prec
+typedef WilsonImpl<vComplexF, TwoIndexSymmetricRepresentation, CoeffReal > WilsonTwoIndexSymmetricImplF;  // Float
+typedef WilsonImpl<vComplexD, TwoIndexSymmetricRepresentation, CoeffReal > WilsonTwoIndexSymmetricImplD;  // Double
 
- typedef DomainWallVec5dImpl<vComplex ,Nc> DomainWallVec5dImplR; // Real.. whichever prec
- typedef DomainWallVec5dImpl<vComplexF,Nc> DomainWallVec5dImplF; // Float
- typedef DomainWallVec5dImpl<vComplexD,Nc> DomainWallVec5dImplD; // Double
+typedef DomainWallVec5dImpl<vComplex ,Nc, CoeffReal> DomainWallVec5dImplR; // Real.. whichever prec
+typedef DomainWallVec5dImpl<vComplexF,Nc, CoeffReal> DomainWallVec5dImplF; // Float
+typedef DomainWallVec5dImpl<vComplexD,Nc, CoeffReal> DomainWallVec5dImplD; // Double
 
- typedef DomainWallVec5dImpl<vComplex ,Nc,ComplexD> ZDomainWallVec5dImplR; // Real.. whichever prec
- typedef DomainWallVec5dImpl<vComplexF,Nc,ComplexD> ZDomainWallVec5dImplF; // Float
- typedef DomainWallVec5dImpl<vComplexD,Nc,ComplexD> ZDomainWallVec5dImplD; // Double
+typedef DomainWallVec5dImpl<vComplex ,Nc, CoeffRealHalfComms> DomainWallVec5dImplRL; // Real.. whichever prec
+typedef DomainWallVec5dImpl<vComplexF,Nc, CoeffRealHalfComms> DomainWallVec5dImplFH; // Float
+typedef DomainWallVec5dImpl<vComplexD,Nc, CoeffRealHalfComms> DomainWallVec5dImplDF; // Double
 
- typedef GparityWilsonImpl<vComplex , Nc> GparityWilsonImplR;  // Real.. whichever prec
- typedef GparityWilsonImpl<vComplexF, Nc> GparityWilsonImplF;  // Float
- typedef GparityWilsonImpl<vComplexD, Nc> GparityWilsonImplD;  // Double
+typedef DomainWallVec5dImpl<vComplex ,Nc,CoeffComplex> ZDomainWallVec5dImplR; // Real.. whichever prec
+typedef DomainWallVec5dImpl<vComplexF,Nc,CoeffComplex> ZDomainWallVec5dImplF; // Float
+typedef DomainWallVec5dImpl<vComplexD,Nc,CoeffComplex> ZDomainWallVec5dImplD; // Double
+ 
+typedef DomainWallVec5dImpl<vComplex ,Nc,CoeffComplexHalfComms> ZDomainWallVec5dImplRL; // Real.. whichever prec
+typedef DomainWallVec5dImpl<vComplexF,Nc,CoeffComplexHalfComms> ZDomainWallVec5dImplFH; // Float
+typedef DomainWallVec5dImpl<vComplexD,Nc,CoeffComplexHalfComms> ZDomainWallVec5dImplDF; // Double
+ 
+typedef GparityWilsonImpl<vComplex , Nc,CoeffReal> GparityWilsonImplR;  // Real.. whichever prec
+typedef GparityWilsonImpl<vComplexF, Nc,CoeffReal> GparityWilsonImplF;  // Float
+typedef GparityWilsonImpl<vComplexD, Nc,CoeffReal> GparityWilsonImplD;  // Double
+ 
+typedef GparityWilsonImpl<vComplex , Nc,CoeffRealHalfComms> GparityWilsonImplRL;  // Real.. whichever prec
+typedef GparityWilsonImpl<vComplexF, Nc,CoeffRealHalfComms> GparityWilsonImplFH;  // Float
+typedef GparityWilsonImpl<vComplexD, Nc,CoeffRealHalfComms> GparityWilsonImplDF;  // Double

- typedef StaggeredImpl<vComplex,  FundamentalRepresentation > StaggeredImplR;   // Real.. whichever prec
- typedef StaggeredImpl<vComplexF, FundamentalRepresentation > StaggeredImplF;  // Float
- typedef StaggeredImpl<vComplexD, FundamentalRepresentation > StaggeredImplD;  // Double
+typedef StaggeredImpl<vComplex,  FundamentalRepresentation > StaggeredImplR;   // Real.. whichever prec
+typedef StaggeredImpl<vComplexF, FundamentalRepresentation > StaggeredImplF;  // Float
+typedef StaggeredImpl<vComplexD, FundamentalRepresentation > StaggeredImplD;  // Double

- typedef StaggeredVec5dImpl<vComplex,  FundamentalRepresentation > StaggeredVec5dImplR;   // Real.. whichever prec
- typedef StaggeredVec5dImpl<vComplexF, FundamentalRepresentation > StaggeredVec5dImplF;  // Float
- typedef StaggeredVec5dImpl<vComplexD, FundamentalRepresentation > StaggeredVec5dImplD;  // Double
+typedef StaggeredVec5dImpl<vComplex,  FundamentalRepresentation > StaggeredVec5dImplR;   // Real.. whichever prec
+typedef StaggeredVec5dImpl<vComplexF, FundamentalRepresentation > StaggeredVec5dImplF;  // Float
+typedef StaggeredVec5dImpl<vComplexD, FundamentalRepresentation > StaggeredVec5dImplD;  // Double

 }}

--- a/lib/qcd/action/fermion/ImprovedStaggeredFermion.cc
+++ b/lib/qcd/action/fermion/ImprovedStaggeredFermion.cc
@@ -160,8 +160,6 @@ void ImprovedStaggeredFermion<Impl>::ImportGauge(const GaugeField &_Uthin,const
    PokeIndex<LorentzIndex>(UUUmu, U*(-0.5*c2/u0/u0/u0), mu+4);
  }

-  std::cout << " Umu " << Umu._odata[0]<<std::endl;
-  std::cout << " UUUmu " << UUUmu._odata[0]<<std::endl;
  pickCheckerboard(Even, UmuEven, Umu);
  pickCheckerboard(Odd,  UmuOdd , Umu);
  pickCheckerboard(Even, UUUmuEven, UUUmu);
--- a/lib/qcd/action/fermion/SchurDiagTwoKappa.h
+++ b/lib/qcd/action/fermion/SchurDiagTwoKappa.h
@@ -0,0 +1,102 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: SchurDiagTwoKappa.h
+
+    Copyright (C) 2017
+
+Author: Christoph Lehner
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#ifndef  _SCHUR_DIAG_TWO_KAPPA_H
+#define  _SCHUR_DIAG_TWO_KAPPA_H
+
+namespace Grid {
+
+  // This is specific to (Z)mobius fermions
+  template<class Matrix, class Field>
+    class KappaSimilarityTransform {
+  public:
+    INHERIT_IMPL_TYPES(Matrix);
+    std::vector<Coeff_t> kappa, kappaDag, kappaInv, kappaInvDag;
+
+    KappaSimilarityTransform (Matrix &zmob) {
+      for (int i=0;i<(int)zmob.bs.size();i++) {
+	Coeff_t k = 1.0 / ( 2.0 * (zmob.bs[i] *(4 - zmob.M5) + 1.0) );
+	kappa.push_back( k );
+	kappaDag.push_back( conj(k) );
+	kappaInv.push_back( 1.0 / k );
+	kappaInvDag.push_back( 1.0 / conj(k) );
+      }
+    }
+
+  template<typename vobj>
+    void sscale(const Lattice<vobj>& in, Lattice<vobj>& out, Coeff_t* s) {
+    GridBase *grid=out._grid;
+    out.checkerboard = in.checkerboard;
+    assert(grid->_simd_layout[0] == 1); // should be fine for ZMobius for now
+    int Ls = grid->_rdimensions[0];
+    parallel_for(int ss=0;ss<grid->oSites();ss++){
+      vobj tmp = s[ss % Ls]*in._odata[ss];
+      vstream(out._odata[ss],tmp);
+    }
+  }
+
+  RealD sscale_norm(const Field& in, Field& out, Coeff_t* s) {
+    sscale(in,out,s);
+    return norm2(out);
+  }
+
+  virtual RealD M       (const Field& in, Field& out) { return sscale_norm(in,out,&kappa[0]);   }
+  virtual RealD MDag    (const Field& in, Field& out) { return sscale_norm(in,out,&kappaDag[0]);}
+  virtual RealD MInv    (const Field& in, Field& out) { return sscale_norm(in,out,&kappaInv[0]);}
+  virtual RealD MInvDag (const Field& in, Field& out) { return sscale_norm(in,out,&kappaInvDag[0]);}
+
+  };
+
+  template<class Matrix,class Field>
+    class SchurDiagTwoKappaOperator :  public SchurOperatorBase<Field> {
+  public:
+    KappaSimilarityTransform<Matrix, Field> _S;
+    SchurDiagTwoOperator<Matrix, Field> _Mat;
+
+    SchurDiagTwoKappaOperator (Matrix &Mat): _S(Mat), _Mat(Mat) {};
+
+    virtual  RealD Mpc      (const Field &in, Field &out) {
+      Field tmp(in._grid);
+
+      _S.MInv(in,out);
+      _Mat.Mpc(out,tmp);
+      return _S.M(tmp,out);
+
+    }
+    virtual  RealD MpcDag   (const Field &in, Field &out){
+      Field tmp(in._grid);
+
+      _S.MDag(in,out);
+      _Mat.MpcDag(out,tmp);
+      return _S.MInvDag(tmp,out);
+    }
+  };
+
+}
+
+#endif
--- a/lib/qcd/action/fermion/WilsonCompressor.h
+++ b/lib/qcd/action/fermion/WilsonCompressor.h
@@ -33,228 +33,321 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 namespace Grid {
 namespace QCD {

-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonCompressor {
-  public:
-    int mu;
-    int dag;
+/////////////////////////////////////////////////////////////////////////////////////////////
+// optimised versions supporting half precision too
+/////////////////////////////////////////////////////////////////////////////////////////////

-    WilsonCompressor(int _dag){
-      mu=0;
-      dag=_dag;
-      assert((dag==0)||(dag==1));
-    }
-    void Point(int p) { 
-      mu=p;
-    };
+template<class _HCspinor,class _Hspinor,class _Spinor, class projector,typename SFINAE = void >
+class WilsonCompressorTemplate;

-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      int mudag=mu;
-      if (!dag) {
-	mudag=(mu+Nd)%(2*Nd);
-      }
-      switch(mudag) {
-      case Xp:
-	spProjXp(ret,in);
-	break;
-      case Yp:
-	spProjYp(ret,in);
-	break;
-      case Zp:
-	spProjZp(ret,in);
-	break;
-      case Tp:
-	spProjTp(ret,in);
-	break;
-      case Xm:
-	spProjXm(ret,in);
-	break;
-      case Ym:
-	spProjYm(ret,in);
-	break;
-      case Zm:
-	spProjZm(ret,in);
-	break;
-      case Tm:
-	spProjTm(ret,in);
-	break;
-      default: 
-	assert(0);
-	break;
-      }
-      return ret;
-    }
-  };

-  /////////////////////////
-  // optimised versions
-  /////////////////////////
+template<class _HCspinor,class _Hspinor,class _Spinor, class projector>
+class WilsonCompressorTemplate< _HCspinor, _Hspinor, _Spinor, projector,
+  typename std::enable_if<std::is_same<_HCspinor,_Hspinor>::value>::type >
+{
+ public:
+  
+  int mu,dag;  

-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonXpCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjXp(ret,in);
-      return ret;
-    }
-  };
-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonYpCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjYp(ret,in);
-      return ret;
-    }
-  };
-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonZpCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjZp(ret,in);
-      return ret;
-    }
-  };
-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonTpCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjTp(ret,in);
-      return ret;
-    }
-  };
+  void Point(int p) { mu=p; };

-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonXmCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjXm(ret,in);
-      return ret;
-    }
-  };
-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonYmCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjYm(ret,in);
-      return ret;
-    }
-  };
-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonZmCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjZm(ret,in);
-      return ret;
-    }
-  };
-  template<class SiteHalfSpinor,class SiteSpinor>
-  class WilsonTmCompressor {
-  public:
-    inline SiteHalfSpinor operator () (const SiteSpinor &in) {
-      SiteHalfSpinor ret;
-      spProjTm(ret,in);
-      return ret;
-    }
-  };
+  WilsonCompressorTemplate(int _dag=0){
+    dag = _dag;
+  }

-    // Fast comms buffer manipulation which should inline right through (avoid direction
-    // dependent logic that prevents inlining
-  template<class vobj,class cobj>
-  class WilsonStencil : public CartesianStencil<vobj,cobj> {
-  public:
+  typedef _Spinor         SiteSpinor;
+  typedef _Hspinor     SiteHalfSpinor;
+  typedef _HCspinor SiteHalfCommSpinor;
+  typedef typename SiteHalfCommSpinor::vector_type vComplexLow;
+  typedef typename SiteHalfSpinor::vector_type     vComplexHigh;
+  constexpr static int Nw=sizeof(SiteHalfSpinor)/sizeof(vComplexHigh);

-    typedef CartesianCommunicator::CommsRequest_t CommsRequest_t;
+  inline int CommDatumSize(void) {
+    return sizeof(SiteHalfCommSpinor);
+  }

-    WilsonStencil(GridBase *grid,
+  /*****************************************************/
+  /* Compress includes precision change if mpi data is not same */
+  /*****************************************************/
+  inline void Compress(SiteHalfSpinor *buf,Integer o,const SiteSpinor &in) {
+    projector::Proj(buf[o],in,mu,dag);
+  }
+
+  /*****************************************************/
+  /* Exchange includes precision change if mpi data is not same */
+  /*****************************************************/
+  inline void Exchange(SiteHalfSpinor *mp,
+                       SiteHalfSpinor *vp0,
+                       SiteHalfSpinor *vp1,
+		       Integer type,Integer o){
+    exchange(mp[2*o],mp[2*o+1],vp0[o],vp1[o],type);
+  }
+
+  /*****************************************************/
+  /* Have a decompression step if mpi data is not same */
+  /*****************************************************/
+  inline void Decompress(SiteHalfSpinor *out,
+			 SiteHalfSpinor *in, Integer o) {    
+    assert(0);
+  }
+
+  /*****************************************************/
+  /* Compress Exchange                                 */
+  /*****************************************************/
+  inline void CompressExchange(SiteHalfSpinor *out0,
+			       SiteHalfSpinor *out1,
+			       const SiteSpinor *in,
+			       Integer j,Integer k, Integer m,Integer type){
+    SiteHalfSpinor temp1, temp2,temp3,temp4;
+    projector::Proj(temp1,in[k],mu,dag);
+    projector::Proj(temp2,in[m],mu,dag);
+    exchange(out0[j],out1[j],temp1,temp2,type);
+  }
+
+  /*****************************************************/
+  /* Pass the info to the stencil */
+  /*****************************************************/
+  inline bool DecompressionStep(void) { return false; }
+
+};
+
+template<class _HCspinor,class _Hspinor,class _Spinor, class projector>
+class WilsonCompressorTemplate< _HCspinor, _Hspinor, _Spinor, projector,
+  typename std::enable_if<!std::is_same<_HCspinor,_Hspinor>::value>::type >
+{
+ public:
+  
+  int mu,dag;  
+
+  void Point(int p) { mu=p; };
+
+  WilsonCompressorTemplate(int _dag=0){
+    dag = _dag;
+  }
+
+  typedef _Spinor         SiteSpinor;
+  typedef _Hspinor     SiteHalfSpinor;
+  typedef _HCspinor SiteHalfCommSpinor;
+  typedef typename SiteHalfCommSpinor::vector_type vComplexLow;
+  typedef typename SiteHalfSpinor::vector_type     vComplexHigh;
+  constexpr static int Nw=sizeof(SiteHalfSpinor)/sizeof(vComplexHigh);
+
+  inline int CommDatumSize(void) {
+    return sizeof(SiteHalfCommSpinor);
+  }
+
+  /*****************************************************/
+  /* Compress includes precision change if mpi data is not same */
+  /*****************************************************/
+  inline void Compress(SiteHalfSpinor *buf,Integer o,const SiteSpinor &in) {
+    SiteHalfSpinor hsp;
+    SiteHalfCommSpinor *hbuf = (SiteHalfCommSpinor *)buf;
+    projector::Proj(hsp,in,mu,dag);
+    precisionChange((vComplexLow *)&hbuf[o],(vComplexHigh *)&hsp,Nw);
+  }
+
+  /*****************************************************/
+  /* Exchange includes precision change if mpi data is not same */
+  /*****************************************************/
+  inline void Exchange(SiteHalfSpinor *mp,
+                       SiteHalfSpinor *vp0,
+                       SiteHalfSpinor *vp1,
+		       Integer type,Integer o){
+    SiteHalfSpinor vt0,vt1;
+    SiteHalfCommSpinor *vpp0 = (SiteHalfCommSpinor *)vp0;
+    SiteHalfCommSpinor *vpp1 = (SiteHalfCommSpinor *)vp1;
+    precisionChange((vComplexHigh *)&vt0,(vComplexLow *)&vpp0[o],Nw);
+    precisionChange((vComplexHigh *)&vt1,(vComplexLow *)&vpp1[o],Nw);
+    exchange(mp[2*o],mp[2*o+1],vt0,vt1,type);
+  }
+
+  /*****************************************************/
+  /* Have a decompression step if mpi data is not same */
+  /*****************************************************/
+  inline void Decompress(SiteHalfSpinor *out,
+			 SiteHalfSpinor *in, Integer o){
+    SiteHalfCommSpinor *hin=(SiteHalfCommSpinor *)in;
+    precisionChange((vComplexHigh *)&out[o],(vComplexLow *)&hin[o],Nw);
+  }
+
+  /*****************************************************/
+  /* Compress Exchange                                 */
+  /*****************************************************/
+  inline void CompressExchange(SiteHalfSpinor *out0,
+			       SiteHalfSpinor *out1,
+			       const SiteSpinor *in,
+			       Integer j,Integer k, Integer m,Integer type){
+    SiteHalfSpinor temp1, temp2,temp3,temp4;
+    SiteHalfCommSpinor *hout0 = (SiteHalfCommSpinor *)out0;
+    SiteHalfCommSpinor *hout1 = (SiteHalfCommSpinor *)out1;
+    projector::Proj(temp1,in[k],mu,dag);
+    projector::Proj(temp2,in[m],mu,dag);
+    exchange(temp3,temp4,temp1,temp2,type);
+    precisionChange((vComplexLow *)&hout0[j],(vComplexHigh *)&temp3,Nw);
+    precisionChange((vComplexLow *)&hout1[j],(vComplexHigh *)&temp4,Nw);
+  }
+
+  /*****************************************************/
+  /* Pass the info to the stencil */
+  /*****************************************************/
+  inline bool DecompressionStep(void) { return true; }
+
+};
+
+#define DECLARE_PROJ(Projector,Compressor,spProj)			\
+  class Projector {							\
+  public:								\
+    template<class hsp,class fsp>					\
+    static void Proj(hsp &result,const fsp &in,int mu,int dag){			\
+      spProj(result,in);						\
+    }									\
+  };									\
+template<typename HCS,typename HS,typename S> using Compressor = WilsonCompressorTemplate<HCS,HS,S,Projector>;
+
+DECLARE_PROJ(WilsonXpProjector,WilsonXpCompressor,spProjXp);
+DECLARE_PROJ(WilsonYpProjector,WilsonYpCompressor,spProjYp);
+DECLARE_PROJ(WilsonZpProjector,WilsonZpCompressor,spProjZp);
+DECLARE_PROJ(WilsonTpProjector,WilsonTpCompressor,spProjTp);
+DECLARE_PROJ(WilsonXmProjector,WilsonXmCompressor,spProjXm);
+DECLARE_PROJ(WilsonYmProjector,WilsonYmCompressor,spProjYm);
+DECLARE_PROJ(WilsonZmProjector,WilsonZmCompressor,spProjZm);
+DECLARE_PROJ(WilsonTmProjector,WilsonTmCompressor,spProjTm);
+
+class WilsonProjector {
+ public:
+  template<class hsp,class fsp>
+  static void Proj(hsp &result,const fsp &in,int mu,int dag){
+    int mudag=dag? mu : (mu+Nd)%(2*Nd);
+    switch(mudag) {
+    case Xp:	spProjXp(result,in);	break;
+    case Yp:	spProjYp(result,in);	break;
+    case Zp:	spProjZp(result,in);	break;
+    case Tp:	spProjTp(result,in);	break;
+    case Xm:	spProjXm(result,in);	break;
+    case Ym:	spProjYm(result,in);	break;
+    case Zm:	spProjZm(result,in);	break;
+    case Tm:	spProjTm(result,in);	break;
+    default: 	assert(0);	        break;
+    }
+  }
+};
+template<typename HCS,typename HS,typename S> using WilsonCompressor = WilsonCompressorTemplate<HCS,HS,S,WilsonProjector>;
+
+// Fast comms buffer manipulation which should inline right through (avoid direction
+// dependent logic that prevents inlining
+template<class vobj,class cobj>
+class WilsonStencil : public CartesianStencil<vobj,cobj> {
+public:
+
+  typedef CartesianCommunicator::CommsRequest_t CommsRequest_t;
+
+  std::vector<int> same_node;
+  std::vector<int> surface_list;
+
+  WilsonStencil(GridBase *grid,
 		int npoints,
 		int checkerboard,
 		const std::vector<int> &directions,
-		const std::vector<int> &distances)  : CartesianStencil<vobj,cobj> (grid,npoints,checkerboard,directions,distances) 
-      {    };
-
-    template < class compressor>
-    void HaloExchangeOpt(const Lattice<vobj> &source,compressor &compress) 
-    {
-      std::vector<std::vector<CommsRequest_t> > reqs;
-      HaloExchangeOptGather(source,compress);
-      this->CommunicateBegin(reqs);
-      this->calls++;
-      this->CommunicateComplete(reqs);
-      this->CommsMerge();
-    }
-
-    template < class compressor>
-    void HaloExchangeOptGather(const Lattice<vobj> &source,compressor &compress) 
-    {
-      this->calls++;
-      this->Mergers.resize(0); 
-      this->Packets.resize(0);
-      this->HaloGatherOpt(source,compress);
-    }
-
-
-    template < class compressor>
-    void HaloGatherOpt(const Lattice<vobj> &source,compressor &compress)
-    {
-      this->_grid->StencilBarrier();
-      // conformable(source._grid,_grid);
-      assert(source._grid==this->_grid);
-      this->halogtime-=usecond();
-      
-      this->u_comm_offset=0;
-      
-      int dag = compress.dag;
-      
-      WilsonXpCompressor<cobj,vobj> XpCompress; 
-      WilsonYpCompressor<cobj,vobj> YpCompress; 
-      WilsonZpCompressor<cobj,vobj> ZpCompress; 
-      WilsonTpCompressor<cobj,vobj> TpCompress;
-      WilsonXmCompressor<cobj,vobj> XmCompress;
-      WilsonYmCompressor<cobj,vobj> YmCompress;
-      WilsonZmCompressor<cobj,vobj> ZmCompress;
-      WilsonTmCompressor<cobj,vobj> TmCompress;
-
-      // Gather all comms buffers
-      //    for(int point = 0 ; point < _npoints; point++) {
-      //      compress.Point(point);
-      //      HaloGatherDir(source,compress,point,face_idx);
-      //    }
-      int face_idx=0;
-      if ( dag ) { 
-	//	std::cout << " Optimised Dagger compress " <<std::endl;
-	this->HaloGatherDir(source,XpCompress,Xp,face_idx);
-	this->HaloGatherDir(source,YpCompress,Yp,face_idx);
-	this->HaloGatherDir(source,ZpCompress,Zp,face_idx);
-	this->HaloGatherDir(source,TpCompress,Tp,face_idx);
-	this->HaloGatherDir(source,XmCompress,Xm,face_idx);
-	this->HaloGatherDir(source,YmCompress,Ym,face_idx);
-	this->HaloGatherDir(source,ZmCompress,Zm,face_idx);
-	this->HaloGatherDir(source,TmCompress,Tm,face_idx);
-      } else {
-	this->HaloGatherDir(source,XmCompress,Xp,face_idx);
-	this->HaloGatherDir(source,YmCompress,Yp,face_idx);
-	this->HaloGatherDir(source,ZmCompress,Zp,face_idx);
-	this->HaloGatherDir(source,TmCompress,Tp,face_idx);
-	this->HaloGatherDir(source,XpCompress,Xm,face_idx);
-	this->HaloGatherDir(source,YpCompress,Ym,face_idx);
-	this->HaloGatherDir(source,ZpCompress,Zm,face_idx);
-	this->HaloGatherDir(source,TpCompress,Tm,face_idx);
-      }
-      this->face_table_computed=1;
-      assert(this->u_comm_offset==this->_unified_buffer_size);
-      this->halogtime+=usecond();
-    }
-
+		const std::vector<int> &distances)  
+    : CartesianStencil<vobj,cobj> (grid,npoints,checkerboard,directions,distances) ,
+    same_node(npoints)
+  { 
+    surface_list.resize(0);
  };

+  void BuildSurfaceList(int Ls,int vol4){
+
+    // find same node for SHM
+    // Here we know the distance is 1 for WilsonStencil
+    for(int point=0;point<this->_npoints;point++){
+      same_node[point] = this->SameNode(point);
+      //      std::cout << " dir " <<point<<" same_node " <<same_node[point]<<std::endl;
+    }
+    
+    for(int site = 0 ;site< vol4;site++){
+      int local = 1;
+      for(int point=0;point<this->_npoints;point++){
+	if( (!this->GetNodeLocal(site*Ls,point)) && (!same_node[point]) ){ 
+	  local = 0;
+	}
+      }
+      if(local == 0) { 
+	surface_list.push_back(site);
+      }
+    }
+  }
+
+  template < class compressor>
+  void HaloExchangeOpt(const Lattice<vobj> &source,compressor &compress) 
+  {
+    std::vector<std::vector<CommsRequest_t> > reqs;
+    this->HaloExchangeOptGather(source,compress);
+    this->CommunicateBegin(reqs);
+    this->CommunicateComplete(reqs);
+    this->CommsMerge(compress);
+    this->CommsMergeSHM(compress);
+  }
+  
+  template <class compressor>
+  void HaloExchangeOptGather(const Lattice<vobj> &source,compressor &compress) 
+  {
+    this->Prepare();
+    this->HaloGatherOpt(source,compress);
+  }
+
+  template <class compressor>
+  void HaloGatherOpt(const Lattice<vobj> &source,compressor &compress)
+  {
+    // Strategy. Inherit types from Compressor.
+    // Use types to select the write direction by directon compressor
+    typedef typename compressor::SiteSpinor         SiteSpinor;
+    typedef typename compressor::SiteHalfSpinor     SiteHalfSpinor;
+    typedef typename compressor::SiteHalfCommSpinor SiteHalfCommSpinor;
+
+    this->_grid->StencilBarrier();
+
+    assert(source._grid==this->_grid);
+    this->halogtime-=usecond();
+    
+    this->u_comm_offset=0;
+      
+    WilsonXpCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> XpCompress; 
+    WilsonYpCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> YpCompress; 
+    WilsonZpCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> ZpCompress; 
+    WilsonTpCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> TpCompress;
+    WilsonXmCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> XmCompress; 
+    WilsonYmCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> YmCompress; 
+    WilsonZmCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> ZmCompress; 
+    WilsonTmCompressor<SiteHalfCommSpinor,SiteHalfSpinor,SiteSpinor> TmCompress;
+
+    int dag = compress.dag;
+    int face_idx=0;
+    if ( dag ) { 
+      //	std::cout << " Optimised Dagger compress " <<std::endl;
+      assert(same_node[Xp]==this->HaloGatherDir(source,XpCompress,Xp,face_idx));
+      assert(same_node[Yp]==this->HaloGatherDir(source,YpCompress,Yp,face_idx));
+      assert(same_node[Zp]==this->HaloGatherDir(source,ZpCompress,Zp,face_idx));
+      assert(same_node[Tp]==this->HaloGatherDir(source,TpCompress,Tp,face_idx));
+      assert(same_node[Xm]==this->HaloGatherDir(source,XmCompress,Xm,face_idx));
+      assert(same_node[Ym]==this->HaloGatherDir(source,YmCompress,Ym,face_idx));
+      assert(same_node[Zm]==this->HaloGatherDir(source,ZmCompress,Zm,face_idx));
+      assert(same_node[Tm]==this->HaloGatherDir(source,TmCompress,Tm,face_idx));
+    } else {
+      assert(same_node[Xp]==this->HaloGatherDir(source,XmCompress,Xp,face_idx));
+      assert(same_node[Yp]==this->HaloGatherDir(source,YmCompress,Yp,face_idx));
+      assert(same_node[Zp]==this->HaloGatherDir(source,ZmCompress,Zp,face_idx));
+      assert(same_node[Tp]==this->HaloGatherDir(source,TmCompress,Tp,face_idx));
+      assert(same_node[Xm]==this->HaloGatherDir(source,XpCompress,Xm,face_idx));
+      assert(same_node[Ym]==this->HaloGatherDir(source,YpCompress,Ym,face_idx));
+      assert(same_node[Zm]==this->HaloGatherDir(source,ZpCompress,Zm,face_idx));
+      assert(same_node[Tm]==this->HaloGatherDir(source,TpCompress,Tm,face_idx));
+    }
+    this->face_table_computed=1;
+    assert(this->u_comm_offset==this->_unified_buffer_size);
+    this->halogtime+=usecond();
+  }
+
+ };

 }} // namespace close
 #endif
--- a/lib/qcd/action/fermion/WilsonFermion5D.cc
+++ b/lib/qcd/action/fermion/WilsonFermion5D.cc
@@ -117,49 +117,20 @@ WilsonFermion5D<Impl>::WilsonFermion5D(GaugeField &_Umu,
    
  // Allocate the required comms buffer
  ImportGauge(_Umu);
+
+  // Build lists of exterior only nodes
+  int LLs = FiveDimGrid._rdimensions[0];
+  int vol4;
+  vol4=FourDimGrid.oSites();
+  Stencil.BuildSurfaceList(LLs,vol4);
+  vol4=FourDimRedBlackGrid.oSites();
+  StencilEven.BuildSurfaceList(LLs,vol4);
+   StencilOdd.BuildSurfaceList(LLs,vol4);
+
+  std::cout << GridLogMessage << " SurfaceLists "<< Stencil.surface_list.size()
+                       <<" " << StencilEven.surface_list.size()<<std::endl;
+
 }
-  /*
-template<class Impl>
-WilsonFermion5D<Impl>::WilsonFermion5D(int simd,GaugeField &_Umu,
-               GridCartesian         &FiveDimGrid,
-               GridRedBlackCartesian &FiveDimRedBlackGrid,
-               GridCartesian         &FourDimGrid,
-               RealD _M5,const ImplParams &p) :
-{
-  int nsimd = Simd::Nsimd();
-
-  // some assertions
-  assert(FiveDimGrid._ndimension==5);
-  assert(FiveDimRedBlackGrid._ndimension==5);
-  assert(FiveDimRedBlackGrid._checker_dim==0); // Checkerboard the s-direction
-  assert(FourDimGrid._ndimension==4);
-
-  // Dimension zero of the five-d is the Ls direction
-  Ls=FiveDimGrid._fdimensions[0];
-  assert(FiveDimGrid._processors[0]         ==1);
-  assert(FiveDimGrid._simd_layout[0]        ==nsimd);
-
-  assert(FiveDimRedBlackGrid._fdimensions[0]==Ls);
-  assert(FiveDimRedBlackGrid._processors[0] ==1);
-  assert(FiveDimRedBlackGrid._simd_layout[0]==nsimd);
-
-  // Other dimensions must match the decomposition of the four-D fields 
-  for(int d=0;d<4;d++){
-    assert(FiveDimRedBlackGrid._fdimensions[d+1]==FourDimGrid._fdimensions[d]);
-    assert(FiveDimRedBlackGrid._processors[d+1] ==FourDimGrid._processors[d]);
-
-    assert(FourDimGrid._simd_layout[d]=1);
-    assert(FiveDimRedBlackGrid._simd_layout[d+1]==1);
-
-    assert(FiveDimGrid._fdimensions[d+1]        ==FourDimGrid._fdimensions[d]);
-    assert(FiveDimGrid._processors[d+1]         ==FourDimGrid._processors[d]);
-    assert(FiveDimGrid._simd_layout[d+1]        ==FourDimGrid._simd_layout[d]);
-  }
-
-  {
-  }
-}  
-  */
     
 template<class Impl>
 void WilsonFermion5D<Impl>::Report(void)
@@ -396,6 +367,7 @@ void WilsonFermion5D<Impl>::DhopInternal(StencilImpl & st, LebesgueOrder &lo,
  DhopTotalTime+=usecond();
 }

+
 template<class Impl>
 void WilsonFermion5D<Impl>::DhopInternalOverlappedComms(StencilImpl & st, LebesgueOrder &lo,
 							DoubledGaugeField & U,
@@ -409,12 +381,17 @@ void WilsonFermion5D<Impl>::DhopInternalOverlappedComms(StencilImpl & st, Lebesg

  int LLs = in._grid->_rdimensions[0];
  int len =  U._grid->oSites();
-  
+
  DhopFaceTime-=usecond();
  st.HaloExchangeOptGather(in,compressor);
  DhopFaceTime+=usecond();
  std::vector<std::vector<CommsRequest_t> > reqs;

+  // Rely on async comms; start comms before merge of local data
+  st.CommunicateBegin(reqs);
+  st.CommsMergeSHM(compressor);
+
+  // Perhaps use omp task and region
 #pragma omp parallel 
  { 
    int nthreads = omp_get_num_threads();
@@ -426,7 +403,6 @@ void WilsonFermion5D<Impl>::DhopInternalOverlappedComms(StencilImpl & st, Lebesg

    if ( me == 0 ) {
      DhopCommTime-=usecond();
-      st.CommunicateBegin(reqs);
      st.CommunicateComplete(reqs);
      DhopCommTime+=usecond();
    } else { 
@@ -439,28 +415,37 @@ void WilsonFermion5D<Impl>::DhopInternalOverlappedComms(StencilImpl & st, Lebesg
  }

  DhopFaceTime-=usecond();
-  st.CommsMerge();
+  st.CommsMerge(compressor);
  DhopFaceTime+=usecond();

-#pragma omp parallel 
-  {
-    int nthreads = omp_get_num_threads();
-    int me = omp_get_thread_num();
-    int myoff, mywork;
-
-    GridThread::GetWork(len,me,mywork,myoff,nthreads);
-    int sF = LLs * myoff;
-
-    // Exterior links in stencil
-    if ( me==0 ) DhopComputeTime2-=usecond();
-    if (dag == DaggerYes) Kernels::DhopSiteDag(st,lo,U,st.CommBuf(),sF,myoff,LLs,mywork,in,out,0,1);
-    else                  Kernels::DhopSite   (st,lo,U,st.CommBuf(),sF,myoff,LLs,mywork,in,out,0,1);
-    if ( me==0 ) DhopComputeTime2+=usecond();
-  }// end parallel region
+  // Load imbalance alert. Should use dynamic schedule OMP for loop
+  // Perhaps create a list of only those sites with face work, and 
+  // load balance process the list.
+  DhopComputeTime2-=usecond();
+  if (dag == DaggerYes) {
+    int sz=st.surface_list.size();
+    parallel_for (int ss = 0; ss < sz; ss++) {
+      int sU = st.surface_list[ss];
+      int sF = LLs * sU;
+      Kernels::DhopSiteDag(st,lo,U,st.CommBuf(),sF,sU,LLs,1,in,out,0,1);
+    }
+  } else {
+    int sz=st.surface_list.size();
+    parallel_for (int ss = 0; ss < sz; ss++) {
+      int sU = st.surface_list[ss];
+      int sF = LLs * sU;
+      Kernels::DhopSite(st,lo,U,st.CommBuf(),sF,sU,LLs,1,in,out,0,1);
+    }
+  }
+  DhopComputeTime2+=usecond();
 #else 
  assert(0);
 #endif
+
 }
+
+
+
 template<class Impl>
 void WilsonFermion5D<Impl>::DhopInternalSerialComms(StencilImpl & st, LebesgueOrder &lo,
 					 DoubledGaugeField & U,
@@ -679,7 +664,6 @@ void WilsonFermion5D<Impl>::MomentumSpacePropagatorHw(FermionField &out,const Fe

 }

-
 FermOpTemplateInstantiate(WilsonFermion5D);
 GparityFermOpTemplateInstantiate(WilsonFermion5D);
  
--- a/lib/qcd/action/fermion/WilsonKernels.cc
+++ b/lib/qcd/action/fermion/WilsonKernels.cc
@@ -33,52 +33,8 @@ directory
 namespace Grid {
 namespace QCD {

-  int WilsonKernelsStatic::Opt   = WilsonKernelsStatic::OptGeneric;
-  int WilsonKernelsStatic::Comms = WilsonKernelsStatic::CommsAndCompute;
-
-#ifdef QPX
-#include <spi/include/kernel/location.h>
-#include <spi/include/l1p/types.h>
-#include <hwi/include/bqc/l1p_mmio.h>
-#include <hwi/include/bqc/A2_inlines.h>
-#endif
-
-void bgq_l1p_optimisation(int mode)
-{
-#ifdef QPX
-#undef L1P_CFG_PF_USR
-#define L1P_CFG_PF_USR  (0x3fde8000108ll)   /*  (64 bit reg, 23 bits wide, user/unpriv) */
-
-  uint64_t cfg_pf_usr;
-  if ( mode ) { 
-    cfg_pf_usr =
-        L1P_CFG_PF_USR_ifetch_depth(0)       
-      | L1P_CFG_PF_USR_ifetch_max_footprint(1)   
-      | L1P_CFG_PF_USR_pf_stream_est_on_dcbt 
-      | L1P_CFG_PF_USR_pf_stream_establish_enable
-      | L1P_CFG_PF_USR_pf_stream_optimistic
-      | L1P_CFG_PF_USR_pf_adaptive_throttle(0xF) ;
-    //    if ( sizeof(Float) == sizeof(double) ) {
-      cfg_pf_usr |=  L1P_CFG_PF_USR_dfetch_depth(2)| L1P_CFG_PF_USR_dfetch_max_footprint(3)   ;
-      //    } else {
-      //      cfg_pf_usr |=  L1P_CFG_PF_USR_dfetch_depth(1)| L1P_CFG_PF_USR_dfetch_max_footprint(2)   ;
-      //    }
-  } else { 
-    cfg_pf_usr = L1P_CFG_PF_USR_dfetch_depth(1)
-      | L1P_CFG_PF_USR_dfetch_max_footprint(2)   
-      | L1P_CFG_PF_USR_ifetch_depth(0)       
-      | L1P_CFG_PF_USR_ifetch_max_footprint(1)   
-      | L1P_CFG_PF_USR_pf_stream_est_on_dcbt 
-      | L1P_CFG_PF_USR_pf_stream_establish_enable
-      | L1P_CFG_PF_USR_pf_stream_optimistic
-      | L1P_CFG_PF_USR_pf_stream_prefetch_enable;
-  }
-  *((uint64_t *)L1P_CFG_PF_USR) = cfg_pf_usr;
-
-#endif
-
-}
-
+int WilsonKernelsStatic::Opt   = WilsonKernelsStatic::OptGeneric;
+int WilsonKernelsStatic::Comms = WilsonKernelsStatic::CommsAndCompute;

 template <class Impl>
 WilsonKernels<Impl>::WilsonKernels(const ImplParams &p) : Base(p){};
@@ -86,12 +42,72 @@ WilsonKernels<Impl>::WilsonKernels(const ImplParams &p) : Base(p){};
 ////////////////////////////////////////////
 // Generic implementation; move to different file?
 ////////////////////////////////////////////
+  
+#define GENERIC_STENCIL_LEG(Dir,spProj,Recon)			\
+  SE = st.GetEntry(ptype, Dir, sF);				\
+  if (SE->_is_local) {						\
+    chi_p = &chi;						\
+    if (SE->_permute) {						\
+      spProj(tmp, in._odata[SE->_offset]);			\
+      permute(chi, tmp, ptype);					\
+    } else {							\
+      spProj(chi, in._odata[SE->_offset]);			\
+    }								\
+  } else {							\
+    chi_p = &buf[SE->_offset];					\
+  }								\
+  Impl::multLink(Uchi, U._odata[sU], *chi_p, Dir, SE, st);	\
+  Recon(result, Uchi);
+  
+#define GENERIC_STENCIL_LEG_INT(Dir,spProj,Recon)		\
+  SE = st.GetEntry(ptype, Dir, sF);				\
+  if (SE->_is_local) {						\
+    chi_p = &chi;						\
+    if (SE->_permute) {						\
+      spProj(tmp, in._odata[SE->_offset]);			\
+      permute(chi, tmp, ptype);					\
+    } else {							\
+      spProj(chi, in._odata[SE->_offset]);			\
+    }								\
+  } else if ( st.same_node[Dir] ) {				\
+      chi_p = &buf[SE->_offset];				\
+  }								\
+  if (SE->_is_local || st.same_node[Dir] ) {			\
+    Impl::multLink(Uchi, U._odata[sU], *chi_p, Dir, SE, st);	\
+    Recon(result, Uchi);					\
+  }

+#define GENERIC_STENCIL_LEG_EXT(Dir,spProj,Recon)		\
+  SE = st.GetEntry(ptype, Dir, sF);				\
+  if ((!SE->_is_local) && (!st.same_node[Dir]) ) {		\
+    chi_p = &buf[SE->_offset];					\
+    Impl::multLink(Uchi, U._odata[sU], *chi_p, Dir, SE, st);	\
+    Recon(result, Uchi);					\
+    nmu++;							\
+  }
+
+#define GENERIC_DHOPDIR_LEG(Dir,spProj,Recon)			\
+  if (gamma == Dir) {						\
+    if (SE->_is_local && SE->_permute) {			\
+      spProj(tmp, in._odata[SE->_offset]);			\
+      permute(chi, tmp, ptype);					\
+    } else if (SE->_is_local) {					\
+      spProj(chi, in._odata[SE->_offset]);			\
+    } else {							\
+      chi = buf[SE->_offset];					\
+    }								\
+    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);	\
+    Recon(result, Uchi);					\
+  }
+
+  ////////////////////////////////////////////////////////////////////
+  // All legs kernels ; comms then compute
+  ////////////////////////////////////////////////////////////////////
 template <class Impl>
 void WilsonKernels<Impl>::GenericDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
-						     SiteHalfSpinor *buf, int sF,
-						     int sU, const FermionField &in, FermionField &out,
-						     int interior,int exterior) {
+					     SiteHalfSpinor *buf, int sF,
+					     int sU, const FermionField &in, FermionField &out)
+{
  SiteHalfSpinor tmp;
  SiteHalfSpinor chi;
  SiteHalfSpinor *chi_p;
@@ -100,174 +116,22 @@ void WilsonKernels<Impl>::GenericDhopSiteDag(StencilImpl &st, LebesgueOrder &lo,
  StencilEntry *SE;
  int ptype;

-  ///////////////////////////
-  // Xp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Xp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjXp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjXp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xp, SE, st);
-  spReconXp(result, Uchi);
-
-  ///////////////////////////
-  // Yp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Yp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjYp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjYp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Yp, SE, st);
-  accumReconYp(result, Uchi);
-
-  ///////////////////////////
-  // Zp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Zp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjZp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjZp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zp, SE, st);
-  accumReconZp(result, Uchi);
-
-  ///////////////////////////
-  // Tp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Tp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjTp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjTp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tp, SE, st);
-  accumReconTp(result, Uchi);
-
-  ///////////////////////////
-  // Xm
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Xm, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjXm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjXm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xm, SE, st);
-  accumReconXm(result, Uchi);
-
-  ///////////////////////////
-  // Ym
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Ym, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjYm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjYm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Ym, SE, st);
-  accumReconYm(result, Uchi);
-
-  ///////////////////////////
-  // Zm
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Zm, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjZm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjZm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zm, SE, st);
-  accumReconZm(result, Uchi);
-
-  ///////////////////////////
-  // Tm
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Tm, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjTm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjTm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tm, SE, st);
-  accumReconTm(result, Uchi);
-
+  GENERIC_STENCIL_LEG(Xp,spProjXp,spReconXp);
+  GENERIC_STENCIL_LEG(Yp,spProjYp,accumReconYp);
+  GENERIC_STENCIL_LEG(Zp,spProjZp,accumReconZp);
+  GENERIC_STENCIL_LEG(Tp,spProjTp,accumReconTp);
+  GENERIC_STENCIL_LEG(Xm,spProjXm,accumReconXm);
+  GENERIC_STENCIL_LEG(Ym,spProjYm,accumReconYm);
+  GENERIC_STENCIL_LEG(Zm,spProjZm,accumReconZm);
+  GENERIC_STENCIL_LEG(Tm,spProjTm,accumReconTm);
  vstream(out._odata[sF], result);
 };

-// Need controls to do interior, exterior, or both
 template <class Impl>
 void WilsonKernels<Impl>::GenericDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
-						  SiteHalfSpinor *buf, int sF,
-						  int sU, const FermionField &in, FermionField &out,int interior,int exterior) {
+					  SiteHalfSpinor *buf, int sF,
+					  int sU, const FermionField &in, FermionField &out) 
+{
  SiteHalfSpinor tmp;
  SiteHalfSpinor chi;
  SiteHalfSpinor *chi_p;
@@ -276,168 +140,123 @@ void WilsonKernels<Impl>::GenericDhopSite(StencilImpl &st, LebesgueOrder &lo, Do
  StencilEntry *SE;
  int ptype;

-  ///////////////////////////
-  // Xp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Xm, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjXp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjXp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xm, SE, st);
-  spReconXp(result, Uchi);
-
-  ///////////////////////////
-  // Yp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Ym, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjYp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjYp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Ym, SE, st);
-  accumReconYp(result, Uchi);
-
-  ///////////////////////////
-  // Zp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Zm, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjZp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjZp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zm, SE, st);
-  accumReconZp(result, Uchi);
-
-  ///////////////////////////
-  // Tp
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Tm, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjTp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjTp(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tm, SE, st);
-  accumReconTp(result, Uchi);
-
-  ///////////////////////////
-  // Xm
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Xp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjXm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjXm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Xp, SE, st);
-  accumReconXm(result, Uchi);
-
-  ///////////////////////////
-  // Ym
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Yp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjYm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjYm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Yp, SE, st);
-  accumReconYm(result, Uchi);
-
-  ///////////////////////////
-  // Zm
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Zp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjZm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjZm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Zp, SE, st);
-  accumReconZm(result, Uchi);
-
-  ///////////////////////////
-  // Tm
-  ///////////////////////////
-  SE = st.GetEntry(ptype, Tp, sF);
-
-  if (SE->_is_local) {
-    chi_p = &chi;
-    if (SE->_permute) {
-      spProjTm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else {
-      spProjTm(chi, in._odata[SE->_offset]);
-    }
-  } else {
-    chi_p = &buf[SE->_offset];
-  }
-
-  Impl::multLink(Uchi, U._odata[sU], *chi_p, Tp, SE, st);
-  accumReconTm(result, Uchi);
-
+  GENERIC_STENCIL_LEG(Xm,spProjXp,spReconXp);
+  GENERIC_STENCIL_LEG(Ym,spProjYp,accumReconYp);
+  GENERIC_STENCIL_LEG(Zm,spProjZp,accumReconZp);
+  GENERIC_STENCIL_LEG(Tm,spProjTp,accumReconTp);
+  GENERIC_STENCIL_LEG(Xp,spProjXm,accumReconXm);
+  GENERIC_STENCIL_LEG(Yp,spProjYm,accumReconYm);
+  GENERIC_STENCIL_LEG(Zp,spProjZm,accumReconZm);
+  GENERIC_STENCIL_LEG(Tp,spProjTm,accumReconTm);
  vstream(out._odata[sF], result);
 };
+  ////////////////////////////////////////////////////////////////////
+  // Interior kernels
+  ////////////////////////////////////////////////////////////////////
+template <class Impl>
+void WilsonKernels<Impl>::GenericDhopSiteDagInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
+						SiteHalfSpinor *buf, int sF,
+						int sU, const FermionField &in, FermionField &out)
+{
+  SiteHalfSpinor tmp;
+  SiteHalfSpinor chi;
+  SiteHalfSpinor *chi_p;
+  SiteHalfSpinor Uchi;
+  SiteSpinor result;
+  StencilEntry *SE;
+  int ptype;
+
+  result=zero;
+  GENERIC_STENCIL_LEG_INT(Xp,spProjXp,accumReconXp);
+  GENERIC_STENCIL_LEG_INT(Yp,spProjYp,accumReconYp);
+  GENERIC_STENCIL_LEG_INT(Zp,spProjZp,accumReconZp);
+  GENERIC_STENCIL_LEG_INT(Tp,spProjTp,accumReconTp);
+  GENERIC_STENCIL_LEG_INT(Xm,spProjXm,accumReconXm);
+  GENERIC_STENCIL_LEG_INT(Ym,spProjYm,accumReconYm);
+  GENERIC_STENCIL_LEG_INT(Zm,spProjZm,accumReconZm);
+  GENERIC_STENCIL_LEG_INT(Tm,spProjTm,accumReconTm);
+  vstream(out._odata[sF], result);
+};
+
+template <class Impl>
+void WilsonKernels<Impl>::GenericDhopSiteInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
+					     SiteHalfSpinor *buf, int sF,
+					     int sU, const FermionField &in, FermionField &out) 
+{
+  SiteHalfSpinor tmp;
+  SiteHalfSpinor chi;
+  SiteHalfSpinor *chi_p;
+  SiteHalfSpinor Uchi;
+  SiteSpinor result;
+  StencilEntry *SE;
+  int ptype;
+  result=zero;
+  GENERIC_STENCIL_LEG_INT(Xm,spProjXp,accumReconXp);
+  GENERIC_STENCIL_LEG_INT(Ym,spProjYp,accumReconYp);
+  GENERIC_STENCIL_LEG_INT(Zm,spProjZp,accumReconZp);
+  GENERIC_STENCIL_LEG_INT(Tm,spProjTp,accumReconTp);
+  GENERIC_STENCIL_LEG_INT(Xp,spProjXm,accumReconXm);
+  GENERIC_STENCIL_LEG_INT(Yp,spProjYm,accumReconYm);
+  GENERIC_STENCIL_LEG_INT(Zp,spProjZm,accumReconZm);
+  GENERIC_STENCIL_LEG_INT(Tp,spProjTm,accumReconTm);
+  vstream(out._odata[sF], result);
+};
+////////////////////////////////////////////////////////////////////
+// Exterior kernels
+////////////////////////////////////////////////////////////////////
+template <class Impl>
+void WilsonKernels<Impl>::GenericDhopSiteDagExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
+						SiteHalfSpinor *buf, int sF,
+						int sU, const FermionField &in, FermionField &out)
+{
+  SiteHalfSpinor tmp;
+  SiteHalfSpinor chi;
+  SiteHalfSpinor *chi_p;
+  SiteHalfSpinor Uchi;
+  SiteSpinor result;
+  StencilEntry *SE;
+  int ptype;
+  int nmu=0;
+  result=zero;
+  GENERIC_STENCIL_LEG_EXT(Xp,spProjXp,accumReconXp);
+  GENERIC_STENCIL_LEG_EXT(Yp,spProjYp,accumReconYp);
+  GENERIC_STENCIL_LEG_EXT(Zp,spProjZp,accumReconZp);
+  GENERIC_STENCIL_LEG_EXT(Tp,spProjTp,accumReconTp);
+  GENERIC_STENCIL_LEG_EXT(Xm,spProjXm,accumReconXm);
+  GENERIC_STENCIL_LEG_EXT(Ym,spProjYm,accumReconYm);
+  GENERIC_STENCIL_LEG_EXT(Zm,spProjZm,accumReconZm);
+  GENERIC_STENCIL_LEG_EXT(Tm,spProjTm,accumReconTm);
+  if ( nmu ) { 
+    out._odata[sF] = out._odata[sF] + result; 
+  }
+};
+
+template <class Impl>
+void WilsonKernels<Impl>::GenericDhopSiteExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U,
+					     SiteHalfSpinor *buf, int sF,
+					     int sU, const FermionField &in, FermionField &out) 
+{
+  SiteHalfSpinor tmp;
+  SiteHalfSpinor chi;
+  SiteHalfSpinor *chi_p;
+  SiteHalfSpinor Uchi;
+  SiteSpinor result;
+  StencilEntry *SE;
+  int ptype;
+  int nmu=0;
+  result=zero;
+  GENERIC_STENCIL_LEG_EXT(Xm,spProjXp,accumReconXp);
+  GENERIC_STENCIL_LEG_EXT(Ym,spProjYp,accumReconYp);
+  GENERIC_STENCIL_LEG_EXT(Zm,spProjZp,accumReconZp);
+  GENERIC_STENCIL_LEG_EXT(Tm,spProjTp,accumReconTp);
+  GENERIC_STENCIL_LEG_EXT(Xp,spProjXm,accumReconXm);
+  GENERIC_STENCIL_LEG_EXT(Yp,spProjYm,accumReconYm);
+  GENERIC_STENCIL_LEG_EXT(Zp,spProjZm,accumReconZm);
+  GENERIC_STENCIL_LEG_EXT(Tp,spProjTm,accumReconTm);
+  if ( nmu ) { 
+    out._odata[sF] = out._odata[sF] + result; 
+  }
+};

 template <class Impl>
 void WilsonKernels<Impl>::DhopDir( StencilImpl &st, DoubledGaugeField &U,SiteHalfSpinor *buf, int sF,
@@ -451,119 +270,14 @@ void WilsonKernels<Impl>::DhopDir( StencilImpl &st, DoubledGaugeField &U,SiteHal
  int ptype;

  SE = st.GetEntry(ptype, dir, sF);
-
-  // Xp
-  if (gamma == Xp) {
-    if (SE->_is_local && SE->_permute) {
-      spProjXp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjXp(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconXp(result, Uchi);
-  }
-
-  // Yp
-  if (gamma == Yp) {
-    if (SE->_is_local && SE->_permute) {
-      spProjYp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjYp(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconYp(result, Uchi);
-  }
-
-  // Zp
-  if (gamma == Zp) {
-    if (SE->_is_local && SE->_permute) {
-      spProjZp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjZp(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconZp(result, Uchi);
-  }
-
-  // Tp
-  if (gamma == Tp) {
-    if (SE->_is_local && SE->_permute) {
-      spProjTp(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjTp(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconTp(result, Uchi);
-  }
-
-  // Xm
-  if (gamma == Xm) {
-    if (SE->_is_local && SE->_permute) {
-      spProjXm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjXm(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconXm(result, Uchi);
-  }
-
-  // Ym
-  if (gamma == Ym) {
-    if (SE->_is_local && SE->_permute) {
-      spProjYm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjYm(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconYm(result, Uchi);
-  }
-
-  // Zm
-  if (gamma == Zm) {
-    if (SE->_is_local && SE->_permute) {
-      spProjZm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjZm(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconZm(result, Uchi);
-  }
-
-  // Tm
-  if (gamma == Tm) {
-    if (SE->_is_local && SE->_permute) {
-      spProjTm(tmp, in._odata[SE->_offset]);
-      permute(chi, tmp, ptype);
-    } else if (SE->_is_local) {
-      spProjTm(chi, in._odata[SE->_offset]);
-    } else {
-      chi = buf[SE->_offset];
-    }
-    Impl::multLink(Uchi, U._odata[sU], chi, dir, SE, st);
-    spReconTm(result, Uchi);
-  }
-
+  GENERIC_DHOPDIR_LEG(Xp,spProjXp,spReconXp);
+  GENERIC_DHOPDIR_LEG(Yp,spProjYp,spReconYp);
+  GENERIC_DHOPDIR_LEG(Zp,spProjZp,spReconZp);
+  GENERIC_DHOPDIR_LEG(Tp,spProjTp,spReconTp);
+  GENERIC_DHOPDIR_LEG(Xm,spProjXm,spReconXm);
+  GENERIC_DHOPDIR_LEG(Ym,spProjYm,spReconYm);
+  GENERIC_DHOPDIR_LEG(Zm,spProjZm,spReconZm);
+  GENERIC_DHOPDIR_LEG(Tm,spProjTm,spReconTm);
  vstream(out._odata[sF], result);
 }

--- a/lib/qcd/action/fermion/WilsonKernels.h
+++ b/lib/qcd/action/fermion/WilsonKernels.h
@@ -34,8 +34,6 @@ directory
 namespace Grid {
 namespace QCD {

-void bgq_l1p_optimisation(int mode);
-
  ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  // Helper routines that implement Wilson stencil for a single site.
  // Common to both the WilsonFermion and WilsonFermion5D
@@ -44,9 +42,8 @@ class WilsonKernelsStatic {
 public:
  enum { OptGeneric, OptHandUnroll, OptInlineAsm };
  enum { CommsAndCompute, CommsThenCompute };
-  // S-direction is INNERMOST and takes no part in the parity.
-  static int Opt;  // these are a temporary hack
-  static int Comms;  // these are a temporary hack
+  static int Opt;  
+  static int Comms;
 };
 
 template<class Impl> class WilsonKernels : public FermionOperator<Impl> , public WilsonKernelsStatic { 
@@ -66,7 +63,7 @@ public:
    switch(Opt) {
 #if defined(AVX512) || defined (QPX)
    case OptInlineAsm:
-      if(interior&&exterior) WilsonKernels<Impl>::AsmDhopSite(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
+      if(interior&&exterior) WilsonKernels<Impl>::AsmDhopSite   (st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      else if (interior)     WilsonKernels<Impl>::AsmDhopSiteInt(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      else if (exterior)     WilsonKernels<Impl>::AsmDhopSiteExt(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      else assert(0);
@@ -75,7 +72,9 @@ public:
    case OptHandUnroll:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
-	  if( exterior) WilsonKernels<Impl>::HandDhopSite(st,lo,U,buf,sF,sU,in,out,interior,exterior);
+	  if(interior&&exterior) WilsonKernels<Impl>::HandDhopSite(st,lo,U,buf,sF,sU,in,out);
+	  else if (interior)     WilsonKernels<Impl>::HandDhopSiteInt(st,lo,U,buf,sF,sU,in,out);
+	  else if (exterior)     WilsonKernels<Impl>::HandDhopSiteExt(st,lo,U,buf,sF,sU,in,out);
 	  sF++;
 	}
 	sU++;
@@ -84,7 +83,10 @@ public:
    case OptGeneric:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
-	  if( exterior) WilsonKernels<Impl>::GenericDhopSite(st,lo,U,buf,sF,sU,in,out,interior,exterior);
+	  if(interior&&exterior) WilsonKernels<Impl>::GenericDhopSite(st,lo,U,buf,sF,sU,in,out);
+	  else if (interior)     WilsonKernels<Impl>::GenericDhopSiteInt(st,lo,U,buf,sF,sU,in,out);
+	  else if (exterior)     WilsonKernels<Impl>::GenericDhopSiteExt(st,lo,U,buf,sF,sU,in,out);
+	  else assert(0);
 	  sF++;
 	}
 	sU++;
@@ -99,11 +101,14 @@ public:
  template <bool EnableBool = true>
  typename std::enable_if<(Impl::Dimension != 3 || (Impl::Dimension == 3 && Nc != 3)) && EnableBool, void>::type
  DhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-		   int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out,int interior=1,int exterior=1 ) {
+	   int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out,int interior=1,int exterior=1 ) {
    // no kernel choice  
    for (int site = 0; site < Ns; site++) {
      for (int s = 0; s < Ls; s++) {
-	if( exterior) WilsonKernels<Impl>::GenericDhopSite(st, lo, U, buf, sF, sU, in, out,interior,exterior);
+	if(interior&&exterior) WilsonKernels<Impl>::GenericDhopSite(st,lo,U,buf,sF,sU,in,out);
+	else if (interior)     WilsonKernels<Impl>::GenericDhopSiteInt(st,lo,U,buf,sF,sU,in,out);
+	else if (exterior)     WilsonKernels<Impl>::GenericDhopSiteExt(st,lo,U,buf,sF,sU,in,out);
+	else assert(0);
 	sF++;
      }
      sU++;
@@ -113,13 +118,13 @@ public:
  template <bool EnableBool = true>
  typename std::enable_if<Impl::Dimension == 3 && Nc == 3 && EnableBool,void>::type
  DhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-		      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out,int interior=1,int exterior=1) {
-
+	      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out,int interior=1,int exterior=1) 
+{
    bgq_l1p_optimisation(1);
    switch(Opt) {
 #if defined(AVX512) || defined (QPX)
    case OptInlineAsm:
-      if(interior&&exterior) WilsonKernels<Impl>::AsmDhopSiteDag(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
+      if(interior&&exterior) WilsonKernels<Impl>::AsmDhopSiteDag   (st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      else if (interior)     WilsonKernels<Impl>::AsmDhopSiteDagInt(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      else if (exterior)     WilsonKernels<Impl>::AsmDhopSiteDagExt(st,lo,U,buf,sF,sU,Ls,Ns,in,out);
      else assert(0);
@@ -128,7 +133,10 @@ public:
    case OptHandUnroll:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
-	  if( exterior) WilsonKernels<Impl>::HandDhopSiteDag(st,lo,U,buf,sF,sU,in,out,interior,exterior);
+	  if(interior&&exterior) WilsonKernels<Impl>::HandDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
+	  else if (interior)     WilsonKernels<Impl>::HandDhopSiteDagInt(st,lo,U,buf,sF,sU,in,out);
+	  else if (exterior)     WilsonKernels<Impl>::HandDhopSiteDagExt(st,lo,U,buf,sF,sU,in,out);
+	  else assert(0);
 	  sF++;
 	}
 	sU++;
@@ -137,7 +145,10 @@ public:
    case OptGeneric:
      for (int site = 0; site < Ns; site++) {
 	for (int s = 0; s < Ls; s++) {
-	  if( exterior) WilsonKernels<Impl>::GenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out,interior,exterior);
+	  if(interior&&exterior) WilsonKernels<Impl>::GenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
+	  else if (interior)     WilsonKernels<Impl>::GenericDhopSiteDagInt(st,lo,U,buf,sF,sU,in,out);
+	  else if (exterior)     WilsonKernels<Impl>::GenericDhopSiteDagExt(st,lo,U,buf,sF,sU,in,out);
+	  else assert(0);
 	  sF++;
 	}
 	sU++;
@@ -156,7 +167,10 @@ public:

    for (int site = 0; site < Ns; site++) {
      for (int s = 0; s < Ls; s++) {
-	if( exterior) WilsonKernels<Impl>::GenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out,interior,exterior);
+	if(interior&&exterior) WilsonKernels<Impl>::GenericDhopSiteDag(st,lo,U,buf,sF,sU,in,out);
+	else if (interior)     WilsonKernels<Impl>::GenericDhopSiteDagInt(st,lo,U,buf,sF,sU,in,out);
+	else if (exterior)     WilsonKernels<Impl>::GenericDhopSiteDagExt(st,lo,U,buf,sF,sU,in,out);
+	else assert(0);
 	sF++;
      }
      sU++;
@@ -169,36 +183,60 @@ public:
 private:
     // Specialised variants
  void GenericDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			       int sF, int sU, const FermionField &in, FermionField &out,int interior,int exterior);
+		       int sF, int sU, const FermionField &in, FermionField &out);
      
  void GenericDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-				  int sF, int sU, const FermionField &in, FermionField &out,int interior,int exterior);
+			  int sF, int sU, const FermionField &in, FermionField &out);
+
+  void GenericDhopSiteInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+			  int sF, int sU, const FermionField &in, FermionField &out);
+      
+  void GenericDhopSiteDagInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+			     int sF, int sU, const FermionField &in, FermionField &out);
+
+  void GenericDhopSiteExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+			  int sF, int sU, const FermionField &in, FermionField &out);
+      
+  void GenericDhopSiteDagExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+			     int sF, int sU, const FermionField &in, FermionField &out);

  void AsmDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			   int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);
+		   int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);

  void AsmDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);
+		      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);

  void AsmDhopSiteInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			   int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);
+		      int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);

  void AsmDhopSiteDagInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			      int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);
+			 int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);

  void AsmDhopSiteExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			      int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);
+		      int sF, int sU, int Ls, int Ns, const FermionField &in,FermionField &out);

  void AsmDhopSiteDagExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-				 int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);
+			 int sF, int sU, int Ls, int Ns, const FermionField &in, FermionField &out);


  void HandDhopSite(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			    int sF, int sU, const FermionField &in, FermionField &out,int interior,int exterior);
+		    int sF, int sU, const FermionField &in, FermionField &out);

  void HandDhopSiteDag(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
-			       int sF, int sU, const FermionField &in, FermionField &out,int interior,int exterior);
+		       int sF, int sU, const FermionField &in, FermionField &out);
      
+  void HandDhopSiteInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+		       int sF, int sU, const FermionField &in, FermionField &out);
+  
+  void HandDhopSiteDagInt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+			  int sF, int sU, const FermionField &in, FermionField &out);
+  
+  void HandDhopSiteExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+		       int sF, int sU, const FermionField &in, FermionField &out);
+  
+  void HandDhopSiteDagExt(StencilImpl &st, LebesgueOrder &lo, DoubledGaugeField &U, SiteHalfSpinor * buf,
+			  int sF, int sU, const FermionField &in, FermionField &out);
+  
 public:

  WilsonKernels(const ImplParams &p = ImplParams());
--- a/lib/qcd/action/fermion/WilsonKernelsAsm.cc
+++ b/lib/qcd/action/fermion/WilsonKernelsAsm.cc
@@ -112,5 +112,16 @@ INSTANTIATE_ASM(DomainWallVec5dImplD);
 INSTANTIATE_ASM(ZDomainWallVec5dImplF);
 INSTANTIATE_ASM(ZDomainWallVec5dImplD);

+INSTANTIATE_ASM(WilsonImplFH);
+INSTANTIATE_ASM(WilsonImplDF);
+INSTANTIATE_ASM(ZWilsonImplFH);
+INSTANTIATE_ASM(ZWilsonImplDF);
+INSTANTIATE_ASM(GparityWilsonImplFH);
+INSTANTIATE_ASM(GparityWilsonImplDF);
+INSTANTIATE_ASM(DomainWallVec5dImplFH);
+INSTANTIATE_ASM(DomainWallVec5dImplDF);
+INSTANTIATE_ASM(ZDomainWallVec5dImplFH);
+INSTANTIATE_ASM(ZDomainWallVec5dImplDF);
+
 }}

--- a/lib/qcd/action/fermion/WilsonKernelsAsmAvx512.h
+++ b/lib/qcd/action/fermion/WilsonKernelsAsmAvx512.h
@@ -71,6 +71,16 @@ WilsonKernels<ZWilsonImplF>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,Doub
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplFH>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
+template<> void 
+WilsonKernels<ZWilsonImplFH>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -84,6 +94,16 @@ WilsonKernels<ZWilsonImplF>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,D
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplFH>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
+template<> void 
+WilsonKernels<ZWilsonImplFH>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+

 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
@@ -97,6 +117,16 @@ template<> void
 WilsonKernels<ZWilsonImplF>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
+template<> void 
+WilsonKernels<WilsonImplFH>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
+template<> void 
+WilsonKernels<ZWilsonImplFH>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
      
 /////////////////////////////////////////////////////////////////
 // XYZT vectorised, dag Kernel, single
@@ -115,6 +145,16 @@ WilsonKernels<ZWilsonImplF>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,D
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplFH>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
+template<> void 
+WilsonKernels<ZWilsonImplFH>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -128,6 +168,16 @@ WilsonKernels<ZWilsonImplF>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & l
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplFH>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
+template<> void 
+WilsonKernels<ZWilsonImplFH>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -141,6 +191,16 @@ WilsonKernels<ZWilsonImplF>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & l
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 				    
+template<> void 
+WilsonKernels<WilsonImplFH>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+				    
+template<> void 
+WilsonKernels<ZWilsonImplFH>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+				    
 #undef MAYBEPERM
 #undef MULT_2SPIN
 #define MAYBEPERM(A,B) 
@@ -162,6 +222,15 @@ WilsonKernels<ZDomainWallVec5dImplF>::AsmDhopSite(StencilImpl &st,LebesgueOrder
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplFH>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplFH>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -174,6 +243,15 @@ WilsonKernels<ZDomainWallVec5dImplF>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrd
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplFH>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplFH>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -189,6 +267,16 @@ WilsonKernels<ZDomainWallVec5dImplF>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrd
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 				    
+template<> void 
+WilsonKernels<DomainWallVec5dImplFH>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+				    
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplFH>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+				    
 /////////////////////////////////////////////////////////////////
 // Ls vectorised, dag Kernel, single
 /////////////////////////////////////////////////////////////////
@@ -205,6 +293,15 @@ WilsonKernels<ZDomainWallVec5dImplF>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrd
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplFH>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplFH>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -217,6 +314,15 @@ WilsonKernels<ZDomainWallVec5dImplF>::AsmDhopSiteDagInt(StencilImpl &st,Lebesgue
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplFH>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplFH>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -229,6 +335,15 @@ WilsonKernels<ZDomainWallVec5dImplF>::AsmDhopSiteDagExt(StencilImpl &st,Lebesgue
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplFH>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplFH>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef COMPLEX_SIGNS
 #undef MAYBEPERM
 #undef MULT_2SPIN
@@ -269,6 +384,15 @@ WilsonKernels<ZWilsonImplD>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,Doub
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplDF>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZWilsonImplDF>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -281,6 +405,15 @@ WilsonKernels<ZWilsonImplD>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,D
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplDF>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZWilsonImplDF>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -293,6 +426,15 @@ WilsonKernels<ZWilsonImplD>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,D
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
      
+template<> void 
+WilsonKernels<WilsonImplDF>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZWilsonImplDF>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+      
 /////////////////////////////////////////////////////////////////
 // XYZT vectorised, dag Kernel, single
 /////////////////////////////////////////////////////////////////
@@ -309,6 +451,15 @@ WilsonKernels<ZWilsonImplD>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,D
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplDF>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZWilsonImplDF>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -321,6 +472,15 @@ WilsonKernels<ZWilsonImplD>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & l
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<WilsonImplDF>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZWilsonImplDF>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -333,6 +493,15 @@ WilsonKernels<ZWilsonImplD>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & l
 						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 				    
+template<> void 
+WilsonKernels<WilsonImplDF>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZWilsonImplDF>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+						int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+				    
 #undef MAYBEPERM
 #undef MULT_2SPIN
 #define MAYBEPERM(A,B) 
@@ -354,6 +523,15 @@ WilsonKernels<ZDomainWallVec5dImplD>::AsmDhopSite(StencilImpl &st,LebesgueOrder
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplDF>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplDF>::AsmDhopSite(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -366,6 +544,15 @@ WilsonKernels<ZDomainWallVec5dImplD>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrd
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplDF>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplDF>::AsmDhopSiteInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -380,6 +567,15 @@ WilsonKernels<ZDomainWallVec5dImplD>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrd
 							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>
 				    
+template<> void 
+WilsonKernels<DomainWallVec5dImplDF>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplDF>::AsmDhopSiteExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U, SiteHalfSpinor *buf,
+							 int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+				    
 /////////////////////////////////////////////////////////////////
 // Ls vectorised, dag Kernel, single
 /////////////////////////////////////////////////////////////////
@@ -396,6 +592,15 @@ WilsonKernels<ZDomainWallVec5dImplD>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrd
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplDF>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplDF>::AsmDhopSiteDag(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #define INTERIOR
 #undef EXTERIOR
@@ -408,6 +613,15 @@ WilsonKernels<ZDomainWallVec5dImplD>::AsmDhopSiteDagInt(StencilImpl &st,Lebesgue
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplDF>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplDF>::AsmDhopSiteDagInt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef INTERIOR_AND_EXTERIOR
 #undef INTERIOR
 #define EXTERIOR
@@ -420,6 +634,15 @@ WilsonKernels<ZDomainWallVec5dImplD>::AsmDhopSiteDagExt(StencilImpl &st,Lebesgue
 							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
 #include <qcd/action/fermion/WilsonKernelsAsmBody.h>

+template<> void 
+WilsonKernels<DomainWallVec5dImplDF>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+template<> void 
+WilsonKernels<ZDomainWallVec5dImplDF>::AsmDhopSiteDagExt(StencilImpl &st,LebesgueOrder & lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+							    int ss,int ssU,int Ls,int Ns,const FermionField &in, FermionField &out)
+#include <qcd/action/fermion/WilsonKernelsAsmBody.h>
+
 #undef COMPLEX_SIGNS
 #undef MAYBEPERM
 #undef MULT_2SPIN
--- a/lib/qcd/action/fermion/WilsonKernelsAsmBody.h
+++ b/lib/qcd/action/fermion/WilsonKernelsAsmBody.h
@@ -39,24 +39,26 @@
 ////////////////////////////////////////////////////////////////////////////////
 #ifdef INTERIOR_AND_EXTERIOR

-#define ZERO_NMU(A) 
-#define INTERIOR_BLOCK_XP(a,b,PERMUTE_DIR,PROJMEM,RECON) INTERIOR_BLOCK(a,b,PERMUTE_DIR,PROJMEM,RECON)
-#define EXTERIOR_BLOCK_XP(a,b,RECON) EXTERIOR_BLOCK(a,b,RECON)
+#define ASM_LEG(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON)			\
+      basep = st.GetPFInfo(nent,plocal); nent++;			\
+      if ( local ) {							\
+	LOAD64(%r10,isigns);						\
+	PROJ(base);							\
+	MAYBEPERM(PERMUTE_DIR,perm);					\
+      } else {								\
+	LOAD_CHI(base);							\
+      }									\
+      base = st.GetInfo(ptype,local,perm,NxtDir,ent,plocal); ent++;	\
+      PREFETCH_CHIMU(base);						\
+      MULT_2SPIN_DIR_PF(Dir,basep);					\
+      LOAD64(%r10,isigns);						\
+      RECON;								\

-#define INTERIOR_BLOCK(a,b,PERMUTE_DIR,PROJMEM,RECON)	\
-  LOAD64(%r10,isigns);                                  \
-  PROJMEM(base);                                        \
-  MAYBEPERM(PERMUTE_DIR,perm);                                  
-
-#define EXTERIOR_BLOCK(a,b,RECON)             \
-  LOAD_CHI(base);
-
-#define COMMON_BLOCK(a,b,RECON)               \
-  base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;     \
-  PREFETCH_CHIMU(base);                                         \
-  MULT_2SPIN_DIR_PF(a,basep);					\
-  LOAD64(%r10,isigns);                                          \
-  RECON;                                                        
+#define ASM_LEG_XP(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON)			\
+  base = st.GetInfo(ptype,local,perm,Dir,ent,plocal); ent++;		\
+  PF_GAUGE(Xp);								\
+  PREFETCH1_CHIMU(base);						\
+  ASM_LEG(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON) 

 #define RESULT(base,basep) SAVE_RESULT(base,basep);

@@ -67,62 +69,62 @@
 ////////////////////////////////////////////////////////////////////////////////
 #ifdef INTERIOR

-#define COMMON_BLOCK(a,b,RECON)       
-#define ZERO_NMU(A) 
+#define ASM_LEG(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON)			\
+      basep = st.GetPFInfo(nent,plocal); nent++;			\
+      if ( local ) {							\
+	LOAD64(%r10,isigns);						\
+	PROJ(base);							\
+	MAYBEPERM(PERMUTE_DIR,perm);					\
+      }else if ( st.same_node[Dir] ) {LOAD_CHI(base);}			\
+      if ( local || st.same_node[Dir] ) {				\
+	MULT_2SPIN_DIR_PF(Dir,basep);					\
+	LOAD64(%r10,isigns);						\
+	RECON;								\
+      }									\
+      base = st.GetInfo(ptype,local,perm,NxtDir,ent,plocal); ent++;	\
+      PREFETCH_CHIMU(base);						\

-// No accumulate for DIR0
-#define EXTERIOR_BLOCK_XP(a,b,RECON)				\
-  ZERO_PSI;							\
-  base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;	
-
-#define EXTERIOR_BLOCK(a,b,RECON)  \
-  base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;     
-
-#define INTERIOR_BLOCK_XP(a,b,PERMUTE_DIR,PROJMEM,RECON) INTERIOR_BLOCK(a,b,PERMUTE_DIR,PROJMEM,RECON)
-
-#define INTERIOR_BLOCK(a,b,PERMUTE_DIR,PROJMEM,RECON)		\
-  LOAD64(%r10,isigns);						\
-  PROJMEM(base);                                                \
-  MAYBEPERM(PERMUTE_DIR,perm);                                  \
-  base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;	\
-  PREFETCH_CHIMU(base);						\
-  MULT_2SPIN_DIR_PF(a,basep);					\
-  LOAD64(%r10,isigns);                                          \
-  RECON;                                                        
+#define ASM_LEG_XP(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON)			\
+  base = st.GetInfo(ptype,local,perm,Dir,ent,plocal); ent++;		\
+  PF_GAUGE(Xp);								\
+  PREFETCH1_CHIMU(base);						\
+  { ZERO_PSI; }								\
+  ASM_LEG(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON) 

 #define RESULT(base,basep) SAVE_RESULT(base,basep);

 #endif
-
 ////////////////////////////////////////////////////////////////////////////////
 // Post comms kernel
 ////////////////////////////////////////////////////////////////////////////////
 #ifdef EXTERIOR

-#define ZERO_NMU(A) nmu=0;

-#define INTERIOR_BLOCK_XP(a,b,PERMUTE_DIR,PROJMEM,RECON) \
-  ZERO_PSI;   base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;		
+#define ASM_LEG(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON)			\
+  base = st.GetInfo(ptype,local,perm,Dir,ent,plocal); ent++;		\
+  if((!local)&&(!st.same_node[Dir]) ) {					\
+    LOAD_CHI(base);							\
+    MULT_2SPIN_DIR_PF(Dir,base);					\
+    LOAD64(%r10,isigns);						\
+    RECON;								\
+    nmu++;								\
+  }									

-#define EXTERIOR_BLOCK_XP(a,b,RECON) EXTERIOR_BLOCK(a,b,RECON)
+#define ASM_LEG_XP(Dir,NxtDir,PERMUTE_DIR,PROJ,RECON)			\
+  nmu=0;								\
+  { ZERO_PSI;}								\
+  base = st.GetInfo(ptype,local,perm,Dir,ent,plocal); ent++;		\
+  if((!local)&&(!st.same_node[Dir]) ) {					\
+    LOAD_CHI(base);							\
+    MULT_2SPIN_DIR_PF(Dir,base);					\
+    LOAD64(%r10,isigns);						\
+    RECON;								\
+    nmu++;								\
+  }

-#define INTERIOR_BLOCK(a,b,PERMUTE_DIR,PROJMEM,RECON)			\
-  base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;		
-
-#define EXTERIOR_BLOCK(a,b,RECON)				\
-    nmu++;							\
-    LOAD_CHI(base);						\
-    MULT_2SPIN_DIR_PF(a,base);					\
-    base = st.GetInfo(ptype,local,perm,b,ent,plocal); ent++;	\
-    LOAD64(%r10,isigns);					\
-    RECON;                                                        
-
-#define COMMON_BLOCK(a,b,RECON)			
-
-#define RESULT(base,basep) if (nmu){  ADD_RESULT(base,base);}
+#define RESULT(base,basep) if (nmu){ ADD_RESULT(base,base);}

 #endif
-
 {
  int nmu;
  int local,perm, ptype;
@@ -134,11 +136,15 @@
  MASK_REGS;
  int nmax=U._grid->oSites();
  for(int site=0;site<Ns;site++) {
+#ifndef EXTERIOR
    int sU =lo.Reorder(ssU);
    int ssn=ssU+1;     if(ssn>=nmax) ssn=0;
    int sUn=lo.Reorder(ssn);
-#ifndef EXTERIOR
    LOCK_GAUGE(0);
+#else
+    int sU =ssU;
+    int ssn=ssU+1;     if(ssn>=nmax) ssn=0;
+    int sUn=ssn;
 #endif
    for(int s=0;s<Ls;s++) {
      ss =sU*Ls+s;
@@ -146,93 +152,20 @@
      int  ent=ss*8;// 2*Ndim
      int nent=ssn*8;

-      ZERO_NMU(0);
-      base  = st.GetInfo(ptype,local,perm,Xp,ent,plocal); ent++;
-#ifndef EXTERIOR
-      PF_GAUGE(Xp); 
-      PREFETCH1_CHIMU(base);
-#endif
-      ////////////////////////////////
-      // Xp
-      ////////////////////////////////
-      basep = st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK_XP(Xp,Yp,PERMUTE_DIR3,DIR0_PROJMEM,DIR0_RECON);
-      } else { 
-	EXTERIOR_BLOCK_XP(Xp,Yp,DIR0_RECON);
-      }
-      COMMON_BLOCK(Xp,Yp,DIR0_RECON);
-      ////////////////////////////////
-      // Yp
-      ////////////////////////////////
-      basep = st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Yp,Zp,PERMUTE_DIR2,DIR1_PROJMEM,DIR1_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Yp,Zp,DIR1_RECON);
-      }
-      COMMON_BLOCK(Yp,Zp,DIR1_RECON);
-      ////////////////////////////////
-      // Zp
-      ////////////////////////////////
-      basep = st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Zp,Tp,PERMUTE_DIR1,DIR2_PROJMEM,DIR2_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Zp,Tp,DIR2_RECON);
-      }
-      COMMON_BLOCK(Zp,Tp,DIR2_RECON);
-      ////////////////////////////////
-      // Tp
-      ////////////////////////////////
-      basep = st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Tp,Xm,PERMUTE_DIR0,DIR3_PROJMEM,DIR3_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Tp,Xm,DIR3_RECON);
-      }
-      COMMON_BLOCK(Tp,Xm,DIR3_RECON);
-      ////////////////////////////////
-      // Xm
-      ////////////////////////////////
-      //  basep= st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Xm,Ym,PERMUTE_DIR3,DIR4_PROJMEM,DIR4_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Xm,Ym,DIR4_RECON);
-      }
-      COMMON_BLOCK(Xm,Ym,DIR4_RECON);
-      ////////////////////////////////
-      // Ym
-      ////////////////////////////////
-      basep= st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Ym,Zm,PERMUTE_DIR2,DIR5_PROJMEM,DIR5_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Ym,Zm,DIR5_RECON);
-      }
-      COMMON_BLOCK(Ym,Zm,DIR5_RECON);
-      ////////////////////////////////
-      // Zm
-      ////////////////////////////////
-      basep= st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Zm,Tm,PERMUTE_DIR1,DIR6_PROJMEM,DIR6_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Zm,Tm,DIR6_RECON);
-      }
-      COMMON_BLOCK(Zm,Tm,DIR6_RECON);
-      ////////////////////////////////
-      // Tm
-      ////////////////////////////////
-      basep= st.GetPFInfo(nent,plocal); nent++;
-      if ( local ) {
-	INTERIOR_BLOCK(Tm,Xp,PERMUTE_DIR0,DIR7_PROJMEM,DIR7_RECON);
-      } else { 
-	EXTERIOR_BLOCK(Tm,Xp,DIR7_RECON);
-      }
-      COMMON_BLOCK(Tm,Xp,DIR7_RECON);
+   ASM_LEG_XP(Xp,Yp,PERMUTE_DIR3,DIR0_PROJMEM,DIR0_RECON);
+      ASM_LEG(Yp,Zp,PERMUTE_DIR2,DIR1_PROJMEM,DIR1_RECON);
+      ASM_LEG(Zp,Tp,PERMUTE_DIR1,DIR2_PROJMEM,DIR2_RECON);
+      ASM_LEG(Tp,Xm,PERMUTE_DIR0,DIR3_PROJMEM,DIR3_RECON);

+      ASM_LEG(Xm,Ym,PERMUTE_DIR3,DIR4_PROJMEM,DIR4_RECON);
+      ASM_LEG(Ym,Zm,PERMUTE_DIR2,DIR5_PROJMEM,DIR5_RECON);
+      ASM_LEG(Zm,Tm,PERMUTE_DIR1,DIR6_PROJMEM,DIR6_RECON);
+      ASM_LEG(Tm,Xp,PERMUTE_DIR0,DIR7_PROJMEM,DIR7_RECON);
+
+#ifdef EXTERIOR
+      if (nmu==0) break;
+      //      if (nmu!=0) std::cout << "EXT "<<sU<<std::endl;
+#endif
      base = (uint64_t) &out._odata[ss];
      basep= st.GetPFInfo(nent,plocal); nent++;
      RESULT(base,basep);
@@ -258,10 +191,6 @@
 #undef DIR5_RECON
 #undef DIR6_RECON
 #undef DIR7_RECON
-#undef EXTERIOR_BLOCK
-#undef INTERIOR_BLOCK
-#undef EXTERIOR_BLOCK_XP
-#undef INTERIOR_BLOCK_XP
-#undef COMMON_BLOCK
-#undef ZERO_NMU
+#undef ASM_LEG
+#undef ASM_LEG_XP
 #undef RESULT
--- a/lib/qcd/action/fermion/WilsonKernelsHand.cc
+++ b/lib/qcd/action/fermion/WilsonKernelsHand.cc
@@ -31,7 +31,7 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
 #define REGISTER

 #define LOAD_CHIMU \
-  const SiteSpinor & ref (in._odata[offset]);	\
+  {const SiteSpinor & ref (in._odata[offset]);	\
    Chimu_00=ref()(0)(0);\
    Chimu_01=ref()(0)(1);\
    Chimu_02=ref()(0)(2);\
@@ -43,20 +43,20 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    Chimu_22=ref()(2)(2);\
    Chimu_30=ref()(3)(0);\
    Chimu_31=ref()(3)(1);\
-    Chimu_32=ref()(3)(2);
+    Chimu_32=ref()(3)(2);}

 #define LOAD_CHI\
-  const SiteHalfSpinor &ref(buf[offset]);	\
+  {const SiteHalfSpinor &ref(buf[offset]);	\
    Chi_00 = ref()(0)(0);\
    Chi_01 = ref()(0)(1);\
    Chi_02 = ref()(0)(2);\
    Chi_10 = ref()(1)(0);\
    Chi_11 = ref()(1)(1);\
-    Chi_12 = ref()(1)(2);
+    Chi_12 = ref()(1)(2);}

 // To splat or not to splat depends on the implementation
 #define MULT_2SPIN(A)\
-   auto & ref(U._odata[sU](A));	\
+  {auto & ref(U._odata[sU](A));			\
   Impl::loadLinkElement(U_00,ref()(0,0));	\
   Impl::loadLinkElement(U_10,ref()(1,0));	\
   Impl::loadLinkElement(U_20,ref()(2,0));	\
@@ -83,7 +83,7 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
    UChi_01+= U_10*Chi_02;\
    UChi_11+= U_10*Chi_12;\
    UChi_02+= U_20*Chi_02;\
-    UChi_12+= U_20*Chi_12;
+    UChi_12+= U_20*Chi_12;}


 #define PERMUTE_DIR(dir)			\
@@ -307,55 +307,132 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
  result_31-= UChi_11;	\
  result_32-= UChi_12;

-namespace Grid {
-namespace QCD {
+#define HAND_STENCIL_LEG(PROJ,PERM,DIR,RECON)	\
+  SE=st.GetEntry(ptype,DIR,ss);			\
+  offset = SE->_offset;				\
+  local  = SE->_is_local;			\
+  perm   = SE->_permute;			\
+  if ( local ) {				\
+    LOAD_CHIMU;					\
+    PROJ;					\
+    if ( perm) {				\
+      PERMUTE_DIR(PERM);			\
+    }						\
+  } else {					\
+    LOAD_CHI;					\
+  }						\
+  MULT_2SPIN(DIR);				\
+  RECON;					
+
+#define HAND_STENCIL_LEG_INT(PROJ,PERM,DIR,RECON)	\
+  SE=st.GetEntry(ptype,DIR,ss);			\
+  offset = SE->_offset;				\
+  local  = SE->_is_local;			\
+  perm   = SE->_permute;			\
+  if ( local ) {				\
+    LOAD_CHIMU;					\
+    PROJ;					\
+    if ( perm) {				\
+      PERMUTE_DIR(PERM);			\
+    }						\
+  } else if ( st.same_node[DIR] ) {		\
+    LOAD_CHI;					\
+  }						\
+  if (local || st.same_node[DIR] ) {		\
+    MULT_2SPIN(DIR);				\
+    RECON;					\
+  }
+
+#define HAND_STENCIL_LEG_EXT(PROJ,PERM,DIR,RECON)	\
+  SE=st.GetEntry(ptype,DIR,ss);			\
+  offset = SE->_offset;				\
+  if((!SE->_is_local)&&(!st.same_node[DIR]) ) {	\
+    LOAD_CHI;					\
+    MULT_2SPIN(DIR);				\
+    RECON;					\
+    nmu++;					\
+  }
+
+#define HAND_RESULT(ss)				\
+  {						\
+    SiteSpinor & ref (out._odata[ss]);		\
+    vstream(ref()(0)(0),result_00);		\
+    vstream(ref()(0)(1),result_01);		\
+    vstream(ref()(0)(2),result_02);		\
+    vstream(ref()(1)(0),result_10);		\
+    vstream(ref()(1)(1),result_11);		\
+    vstream(ref()(1)(2),result_12);		\
+    vstream(ref()(2)(0),result_20);		\
+    vstream(ref()(2)(1),result_21);		\
+    vstream(ref()(2)(2),result_22);		\
+    vstream(ref()(3)(0),result_30);		\
+    vstream(ref()(3)(1),result_31);		\
+    vstream(ref()(3)(2),result_32);		\
+  }
+
+#define HAND_RESULT_EXT(ss)			\
+  if (nmu){					\
+    SiteSpinor & ref (out._odata[ss]);		\
+    ref()(0)(0)+=result_00;		\
+    ref()(0)(1)+=result_01;		\
+    ref()(0)(2)+=result_02;		\
+    ref()(1)(0)+=result_10;		\
+    ref()(1)(1)+=result_11;		\
+    ref()(1)(2)+=result_12;		\
+    ref()(2)(0)+=result_20;		\
+    ref()(2)(1)+=result_21;		\
+    ref()(2)(2)+=result_22;		\
+    ref()(3)(0)+=result_30;		\
+    ref()(3)(1)+=result_31;		\
+    ref()(3)(2)+=result_32;		\
+  }


-template<class Impl> void 
-WilsonKernels<Impl>::HandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor  *buf,
-					  int ss,int sU,const FermionField &in, FermionField &out,int interior,int exterior)
-{
-  typedef typename Simd::scalar_type S;
-  typedef typename Simd::vector_type V;
+#define HAND_DECLARATIONS(a)			\
+  Simd result_00;				\
+  Simd result_01;				\
+  Simd result_02;				\
+  Simd result_10;				\
+  Simd result_11;				\
+  Simd result_12;				\
+  Simd result_20;				\
+  Simd result_21;				\
+  Simd result_22;				\
+  Simd result_30;				\
+  Simd result_31;				\
+  Simd result_32;				\
+  Simd Chi_00;					\
+  Simd Chi_01;					\
+  Simd Chi_02;					\
+  Simd Chi_10;					\
+  Simd Chi_11;					\
+  Simd Chi_12;					\
+  Simd UChi_00;					\
+  Simd UChi_01;					\
+  Simd UChi_02;					\
+  Simd UChi_10;					\
+  Simd UChi_11;					\
+  Simd UChi_12;					\
+  Simd U_00;					\
+  Simd U_10;					\
+  Simd U_20;					\
+  Simd U_01;					\
+  Simd U_11;					\
+  Simd U_21;

-  REGISTER Simd result_00; // 12 regs on knc
-  REGISTER Simd result_01;
-  REGISTER Simd result_02;
-
-  REGISTER Simd result_10;
-  REGISTER Simd result_11;
-  REGISTER Simd result_12;
-
-  REGISTER Simd result_20;
-  REGISTER Simd result_21;
-  REGISTER Simd result_22;
-
-  REGISTER Simd result_30;
-  REGISTER Simd result_31;
-  REGISTER Simd result_32; // 20 left
-
-  REGISTER Simd Chi_00;    // two spinor; 6 regs
-  REGISTER Simd Chi_01;
-  REGISTER Simd Chi_02;
-
-  REGISTER Simd Chi_10;
-  REGISTER Simd Chi_11;
-  REGISTER Simd Chi_12;   // 14 left
-
-  REGISTER Simd UChi_00;  // two spinor; 6 regs
-  REGISTER Simd UChi_01;
-  REGISTER Simd UChi_02;
-
-  REGISTER Simd UChi_10;
-  REGISTER Simd UChi_11;
-  REGISTER Simd UChi_12;  // 8 left
-
-  REGISTER Simd U_00;  // two rows of U matrix
-  REGISTER Simd U_10;
-  REGISTER Simd U_20;  
-  REGISTER Simd U_01;
-  REGISTER Simd U_11;
-  REGISTER Simd U_21;  // 2 reg left.
+#define ZERO_RESULT				\
+  result_00=zero;				\
+  result_01=zero;				\
+  result_02=zero;				\
+  result_10=zero;				\
+  result_11=zero;				\
+  result_12=zero;				\
+  result_20=zero;				\
+  result_21=zero;				\
+  result_22=zero;				\
+  result_30=zero;				\
+  result_31=zero;				\
+  result_32=zero;			

 #define Chimu_00 Chi_00
 #define Chimu_01 Chi_01
@@ -370,475 +447,225 @@ WilsonKernels<Impl>::HandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGauge
 #define Chimu_31 UChi_11
 #define Chimu_32 UChi_12

+namespace Grid {
+namespace QCD {
+
+template<class Impl> void 
+WilsonKernels<Impl>::HandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor  *buf,
+					  int ss,int sU,const FermionField &in, FermionField &out)
+{
+// T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
+  typedef typename Simd::scalar_type S;
+  typedef typename Simd::vector_type V;
+
+  HAND_DECLARATIONS(ignore);

  int offset,local,perm, ptype;
  StencilEntry *SE;

-  // Xp
-  SE=st.GetEntry(ptype,Xp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    XM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(3); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Xp);
-  }
-  XM_RECON;
-  
-  // Yp
-  SE=st.GetEntry(ptype,Yp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    YM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(2); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Yp);
-  }
-  YM_RECON_ACCUM;
-
-
-  // Zp
-  SE=st.GetEntry(ptype,Zp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    ZM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(1); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Zp);
-  }
-  ZM_RECON_ACCUM;
-
-  // Tp
-  SE=st.GetEntry(ptype,Tp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    TM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(0); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Tp);
-  }
-  TM_RECON_ACCUM;
-  
-  // Xm
-  SE=st.GetEntry(ptype,Xm,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    XP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(3); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Xm);
-  }
-  XP_RECON_ACCUM;
-  
-  
-  // Ym
-  SE=st.GetEntry(ptype,Ym,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    YP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(2); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Ym);
-  }
-  YP_RECON_ACCUM;
-
-  // Zm
-  SE=st.GetEntry(ptype,Zm,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    ZP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(1); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Zm);
-  }
-  ZP_RECON_ACCUM;
-
-  // Tm
-  SE=st.GetEntry(ptype,Tm,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    TP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(0); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Tm);
-  }
-  TP_RECON_ACCUM;
-
-  {
-    SiteSpinor & ref (out._odata[ss]);
-    vstream(ref()(0)(0),result_00);
-    vstream(ref()(0)(1),result_01);
-    vstream(ref()(0)(2),result_02);
-    vstream(ref()(1)(0),result_10);
-    vstream(ref()(1)(1),result_11);
-    vstream(ref()(1)(2),result_12);
-    vstream(ref()(2)(0),result_20);
-    vstream(ref()(2)(1),result_21);
-    vstream(ref()(2)(2),result_22);
-    vstream(ref()(3)(0),result_30);
-    vstream(ref()(3)(1),result_31);
-    vstream(ref()(3)(2),result_32);
-  }
+  HAND_STENCIL_LEG(XM_PROJ,3,Xp,XM_RECON);
+  HAND_STENCIL_LEG(YM_PROJ,2,Yp,YM_RECON_ACCUM);
+  HAND_STENCIL_LEG(ZM_PROJ,1,Zp,ZM_RECON_ACCUM);
+  HAND_STENCIL_LEG(TM_PROJ,0,Tp,TM_RECON_ACCUM);
+  HAND_STENCIL_LEG(XP_PROJ,3,Xm,XP_RECON_ACCUM);
+  HAND_STENCIL_LEG(YP_PROJ,2,Ym,YP_RECON_ACCUM);
+  HAND_STENCIL_LEG(ZP_PROJ,1,Zm,ZP_RECON_ACCUM);
+  HAND_STENCIL_LEG(TP_PROJ,0,Tm,TP_RECON_ACCUM);
+  HAND_RESULT(ss);
 }

 template<class Impl>
 void WilsonKernels<Impl>::HandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-						  int ss,int sU,const FermionField &in, FermionField &out,int interior,int exterior)
+						  int ss,int sU,const FermionField &in, FermionField &out)
 {
-  //  std::cout << "Hand op Dhop "<<std::endl;
  typedef typename Simd::scalar_type S;
  typedef typename Simd::vector_type V;

-  REGISTER Simd result_00; // 12 regs on knc
-  REGISTER Simd result_01;
-  REGISTER Simd result_02;
-  
-  REGISTER Simd result_10;
-  REGISTER Simd result_11;
-  REGISTER Simd result_12;
-
-  REGISTER Simd result_20;
-  REGISTER Simd result_21;
-  REGISTER Simd result_22;
-
-  REGISTER Simd result_30;
-  REGISTER Simd result_31;
-  REGISTER Simd result_32; // 20 left
-
-  REGISTER Simd Chi_00;    // two spinor; 6 regs
-  REGISTER Simd Chi_01;
-  REGISTER Simd Chi_02;
-
-  REGISTER Simd Chi_10;
-  REGISTER Simd Chi_11;
-  REGISTER Simd Chi_12;   // 14 left
-
-  REGISTER Simd UChi_00;  // two spinor; 6 regs
-  REGISTER Simd UChi_01;
-  REGISTER Simd UChi_02;
-
-  REGISTER Simd UChi_10;
-  REGISTER Simd UChi_11;
-  REGISTER Simd UChi_12;  // 8 left
-
-  REGISTER Simd U_00;  // two rows of U matrix
-  REGISTER Simd U_10;
-  REGISTER Simd U_20;  
-  REGISTER Simd U_01;
-  REGISTER Simd U_11;
-  REGISTER Simd U_21;  // 2 reg left.
-
-#define Chimu_00 Chi_00
-#define Chimu_01 Chi_01
-#define Chimu_02 Chi_02
-#define Chimu_10 Chi_10
-#define Chimu_11 Chi_11
-#define Chimu_12 Chi_12
-#define Chimu_20 UChi_00
-#define Chimu_21 UChi_01
-#define Chimu_22 UChi_02
-#define Chimu_30 UChi_10
-#define Chimu_31 UChi_11
-#define Chimu_32 UChi_12
-
+  HAND_DECLARATIONS(ignore);

  StencilEntry *SE;
  int offset,local,perm, ptype;
  
-  // Xp
-  SE=st.GetEntry(ptype,Xp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    XP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(3); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
+  HAND_STENCIL_LEG(XP_PROJ,3,Xp,XP_RECON);
+  HAND_STENCIL_LEG(YP_PROJ,2,Yp,YP_RECON_ACCUM);
+  HAND_STENCIL_LEG(ZP_PROJ,1,Zp,ZP_RECON_ACCUM);
+  HAND_STENCIL_LEG(TP_PROJ,0,Tp,TP_RECON_ACCUM);
+  HAND_STENCIL_LEG(XM_PROJ,3,Xm,XM_RECON_ACCUM);
+  HAND_STENCIL_LEG(YM_PROJ,2,Ym,YM_RECON_ACCUM);
+  HAND_STENCIL_LEG(ZM_PROJ,1,Zm,ZM_RECON_ACCUM);
+  HAND_STENCIL_LEG(TM_PROJ,0,Tm,TM_RECON_ACCUM);
+  HAND_RESULT(ss);
+}

-  {
-    MULT_2SPIN(Xp);
-  }
-  XP_RECON;
+template<class Impl> void 
+WilsonKernels<Impl>::HandDhopSiteInt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor  *buf,
+					  int ss,int sU,const FermionField &in, FermionField &out)
+{
+// T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
+  typedef typename Simd::scalar_type S;
+  typedef typename Simd::vector_type V;

-  // Yp
-  SE=st.GetEntry(ptype,Yp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    YP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(2); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Yp);
-  }
-  YP_RECON_ACCUM;
+  HAND_DECLARATIONS(ignore);

+  int offset,local,perm, ptype;
+  StencilEntry *SE;
+  ZERO_RESULT;
+  HAND_STENCIL_LEG_INT(XM_PROJ,3,Xp,XM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(YM_PROJ,2,Yp,YM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(ZM_PROJ,1,Zp,ZM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(TM_PROJ,0,Tp,TM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(XP_PROJ,3,Xm,XP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(YP_PROJ,2,Ym,YP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(ZP_PROJ,1,Zm,ZP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(TP_PROJ,0,Tm,TP_RECON_ACCUM);
+  HAND_RESULT(ss);
+}

-  // Zp
-  SE=st.GetEntry(ptype,Zp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    ZP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(1); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Zp);
-  }
-  ZP_RECON_ACCUM;
+template<class Impl>
+void WilsonKernels<Impl>::HandDhopSiteDagInt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+						  int ss,int sU,const FermionField &in, FermionField &out)
+{
+  typedef typename Simd::scalar_type S;
+  typedef typename Simd::vector_type V;

-  // Tp
-  SE=st.GetEntry(ptype,Tp,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    TP_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(0); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Tp);
-  }
-  TP_RECON_ACCUM;
-  
-  // Xm
-  SE=st.GetEntry(ptype,Xm,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    XM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(3); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Xm);
-  }
-  XM_RECON_ACCUM;
-  
-  // Ym
-  SE=st.GetEntry(ptype,Ym,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
-  
-  if ( local ) {
-    LOAD_CHIMU;
-    YM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(2); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Ym);
-  }
-  YM_RECON_ACCUM;
+  HAND_DECLARATIONS(ignore);

-  // Zm
-  SE=st.GetEntry(ptype,Zm,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
+  StencilEntry *SE;
+  int offset,local,perm, ptype;
+  ZERO_RESULT;
+  HAND_STENCIL_LEG_INT(XP_PROJ,3,Xp,XP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(YP_PROJ,2,Yp,YP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(ZP_PROJ,1,Zp,ZP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(TP_PROJ,0,Tp,TP_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(XM_PROJ,3,Xm,XM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(YM_PROJ,2,Ym,YM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(ZM_PROJ,1,Zm,ZM_RECON_ACCUM);
+  HAND_STENCIL_LEG_INT(TM_PROJ,0,Tm,TM_RECON_ACCUM);
+  HAND_RESULT(ss);
+}

-  if ( local ) {
-    LOAD_CHIMU;
-    ZM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(1); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Zm);
-  }
-  ZM_RECON_ACCUM;
+template<class Impl> void 
+WilsonKernels<Impl>::HandDhopSiteExt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor  *buf,
+					  int ss,int sU,const FermionField &in, FermionField &out)
+{
+// T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
+  typedef typename Simd::scalar_type S;
+  typedef typename Simd::vector_type V;

-  // Tm
-  SE=st.GetEntry(ptype,Tm,ss);
-  offset = SE->_offset;
-  local  = SE->_is_local;
-  perm   = SE->_permute;
+  HAND_DECLARATIONS(ignore);

-  if ( local ) {
-    LOAD_CHIMU;
-    TM_PROJ;
-    if ( perm) {
-      PERMUTE_DIR(0); // T==0, Z==1, Y==2, Z==3 expect 1,2,2,2 simd layout etc...
-    }
-  } else { 
-    LOAD_CHI;
-  }
-  {
-    MULT_2SPIN(Tm);
-  }
-  TM_RECON_ACCUM;
+  int offset,local,perm, ptype;
+  StencilEntry *SE;
+  int nmu=0;
+  ZERO_RESULT;
+  HAND_STENCIL_LEG_EXT(XM_PROJ,3,Xp,XM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(YM_PROJ,2,Yp,YM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(ZM_PROJ,1,Zp,ZM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(TM_PROJ,0,Tp,TM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(XP_PROJ,3,Xm,XP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(YP_PROJ,2,Ym,YP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(ZP_PROJ,1,Zm,ZP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(TP_PROJ,0,Tm,TP_RECON_ACCUM);
+  HAND_RESULT_EXT(ss);
+}

-  {
-    SiteSpinor & ref (out._odata[ss]);
-    vstream(ref()(0)(0),result_00);
-    vstream(ref()(0)(1),result_01);
-    vstream(ref()(0)(2),result_02);
-    vstream(ref()(1)(0),result_10);
-    vstream(ref()(1)(1),result_11);
-    vstream(ref()(1)(2),result_12);
-    vstream(ref()(2)(0),result_20);
-    vstream(ref()(2)(1),result_21);
-    vstream(ref()(2)(2),result_22);
-    vstream(ref()(3)(0),result_30);
-    vstream(ref()(3)(1),result_31);
-    vstream(ref()(3)(2),result_32);
-  }
+template<class Impl>
+void WilsonKernels<Impl>::HandDhopSiteDagExt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
+						  int ss,int sU,const FermionField &in, FermionField &out)
+{
+  typedef typename Simd::scalar_type S;
+  typedef typename Simd::vector_type V;
+
+  HAND_DECLARATIONS(ignore);
+
+  StencilEntry *SE;
+  int offset,local,perm, ptype;
+  int nmu=0;
+  ZERO_RESULT;
+  HAND_STENCIL_LEG_EXT(XP_PROJ,3,Xp,XP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(YP_PROJ,2,Yp,YP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(ZP_PROJ,1,Zp,ZP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(TP_PROJ,0,Tp,TP_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(XM_PROJ,3,Xm,XM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(YM_PROJ,2,Ym,YM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(ZM_PROJ,1,Zm,ZM_RECON_ACCUM);
+  HAND_STENCIL_LEG_EXT(TM_PROJ,0,Tm,TM_RECON_ACCUM);
+  HAND_RESULT_EXT(ss);
 }

  ////////////////////////////////////////////////
  // Specialise Gparity to simple implementation
  ////////////////////////////////////////////////
-template<> void 
-WilsonKernels<GparityWilsonImplF>::HandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
-							SiteHalfSpinor *buf,
-							int sF,int sU,const FermionField &in, FermionField &out,int internal,int external)
-{
-  assert(0);
-}
-
-template<> void 
-WilsonKernels<GparityWilsonImplF>::HandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,
-							   SiteHalfSpinor *buf,
-							   int sF,int sU,const FermionField &in, FermionField &out,int internal,int external)
-{
-  assert(0);
-}
-
-template<> void 
-WilsonKernels<GparityWilsonImplD>::HandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-							int sF,int sU,const FermionField &in, FermionField &out,int internal,int external)
-{
-  assert(0);
-}
-
-template<> void 
-WilsonKernels<GparityWilsonImplD>::HandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,
-							   int sF,int sU,const FermionField &in, FermionField &out,int internal,int external)
-{
-  assert(0);
-}
-
+#define HAND_SPECIALISE_EMPTY(IMPL)					\
+  template<> void							\
+  WilsonKernels<IMPL>::HandDhopSite(StencilImpl &st,			\
+				    LebesgueOrder &lo,			\
+				    DoubledGaugeField &U,		\
+				    SiteHalfSpinor *buf,		\
+				    int sF,int sU,			\
+				    const FermionField &in,		\
+				    FermionField &out){ assert(0); }	\
+  template<> void							\
+  WilsonKernels<IMPL>::HandDhopSiteDag(StencilImpl &st,			\
+				    LebesgueOrder &lo,			\
+				    DoubledGaugeField &U,		\
+				    SiteHalfSpinor *buf,		\
+				    int sF,int sU,			\
+				    const FermionField &in,		\
+				    FermionField &out){ assert(0); }	\
+  template<> void							\
+  WilsonKernels<IMPL>::HandDhopSiteInt(StencilImpl &st,			\
+				    LebesgueOrder &lo,			\
+				    DoubledGaugeField &U,		\
+				    SiteHalfSpinor *buf,		\
+				    int sF,int sU,			\
+				    const FermionField &in,		\
+				    FermionField &out){ assert(0); }	\
+  template<> void							\
+  WilsonKernels<IMPL>::HandDhopSiteExt(StencilImpl &st,			\
+				    LebesgueOrder &lo,			\
+				    DoubledGaugeField &U,		\
+				    SiteHalfSpinor *buf,		\
+				    int sF,int sU,			\
+				    const FermionField &in,		\
+				    FermionField &out){ assert(0); }	\
+  template<> void							\
+  WilsonKernels<IMPL>::HandDhopSiteDagInt(StencilImpl &st,	       	\
+				    LebesgueOrder &lo,			\
+				    DoubledGaugeField &U,		\
+				    SiteHalfSpinor *buf,		\
+				    int sF,int sU,			\
+				    const FermionField &in,		\
+				    FermionField &out){ assert(0); }	\
+  template<> void							\
+  WilsonKernels<IMPL>::HandDhopSiteDagExt(StencilImpl &st,	       	\
+				    LebesgueOrder &lo,			\
+				    DoubledGaugeField &U,		\
+				    SiteHalfSpinor *buf,		\
+				    int sF,int sU,			\
+				    const FermionField &in,		\
+				    FermionField &out){ assert(0); }	\

+  HAND_SPECIALISE_EMPTY(GparityWilsonImplF);
+  HAND_SPECIALISE_EMPTY(GparityWilsonImplD);
+  HAND_SPECIALISE_EMPTY(GparityWilsonImplFH);
+  HAND_SPECIALISE_EMPTY(GparityWilsonImplDF);

 ////////////// Wilson ; uses this implementation /////////////////////
-// Need Nc=3 though //

 #define INSTANTIATE_THEM(A) \
 template void WilsonKernels<A>::HandDhopSite(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,\
-						     int ss,int sU,const FermionField &in, FermionField &out,int interior,int exterior); \
-template void WilsonKernels<A>::HandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,\
-							int ss,int sU,const FermionField &in, FermionField &out,int interior,int exterior);
+					     int ss,int sU,const FermionField &in, FermionField &out); \
+template void WilsonKernels<A>::HandDhopSiteDag(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf, \
+						int ss,int sU,const FermionField &in, FermionField &out);\
+template void WilsonKernels<A>::HandDhopSiteInt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,\
+						int ss,int sU,const FermionField &in, FermionField &out); \
+template void WilsonKernels<A>::HandDhopSiteDagInt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf, \
+						   int ss,int sU,const FermionField &in, FermionField &out); \
+template void WilsonKernels<A>::HandDhopSiteExt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf,\
+						int ss,int sU,const FermionField &in, FermionField &out); \
+template void WilsonKernels<A>::HandDhopSiteDagExt(StencilImpl &st,LebesgueOrder &lo,DoubledGaugeField &U,SiteHalfSpinor *buf, \
+						   int ss,int sU,const FermionField &in, FermionField &out); 

 INSTANTIATE_THEM(WilsonImplF);
 INSTANTIATE_THEM(WilsonImplD);
@@ -850,5 +677,15 @@ INSTANTIATE_THEM(DomainWallVec5dImplF);
 INSTANTIATE_THEM(DomainWallVec5dImplD);
 INSTANTIATE_THEM(ZDomainWallVec5dImplF);
 INSTANTIATE_THEM(ZDomainWallVec5dImplD);
+INSTANTIATE_THEM(WilsonImplFH);
+INSTANTIATE_THEM(WilsonImplDF);
+INSTANTIATE_THEM(ZWilsonImplFH);
+INSTANTIATE_THEM(ZWilsonImplDF);
+INSTANTIATE_THEM(GparityWilsonImplFH);
+INSTANTIATE_THEM(GparityWilsonImplDF);
+INSTANTIATE_THEM(DomainWallVec5dImplFH);
+INSTANTIATE_THEM(DomainWallVec5dImplDF);
+INSTANTIATE_THEM(ZDomainWallVec5dImplFH);
+INSTANTIATE_THEM(ZDomainWallVec5dImplDF);

 }}
--- a/lib/qcd/spin/.dirstamp
+++ b/lib/qcd/spin/.dirstamp
--- a/lib/qcd/utils/.dirstamp
+++ b/lib/qcd/utils/.dirstamp
--- a/lib/serialisation/MacroMagic.h
+++ b/lib/serialisation/MacroMagic.h
@@ -54,7 +54,7 @@ THE SOFTWARE.

 #define GRID_MACRO_EMPTY()

-#define GRID_MACRO_EVAL(...)     GRID_MACRO_EVAL1024(__VA_ARGS__)
+#define GRID_MACRO_EVAL(...)     GRID_MACRO_EVAL64(__VA_ARGS__)
 #define GRID_MACRO_EVAL1024(...) GRID_MACRO_EVAL512(GRID_MACRO_EVAL512(__VA_ARGS__))
 #define GRID_MACRO_EVAL512(...)  GRID_MACRO_EVAL256(GRID_MACRO_EVAL256(__VA_ARGS__))
 #define GRID_MACRO_EVAL256(...)  GRID_MACRO_EVAL128(GRID_MACRO_EVAL128(__VA_ARGS__))
--- a/lib/simd/Grid_avx.h
+++ b/lib/simd/Grid_avx.h
@@ -377,8 +377,8 @@ namespace Optimization {
      b0 = _mm256_extractf128_si256(b,0);
      a1 = _mm256_extractf128_si256(a,1);
      b1 = _mm256_extractf128_si256(b,1);
-      a0 = _mm_mul_epi32(a0,b0);
-      a1 = _mm_mul_epi32(a1,b1);
+      a0 = _mm_mullo_epi32(a0,b0);
+      a1 = _mm_mullo_epi32(a1,b1);
      return _mm256_set_m128i(a1,a0);
 #endif
 #if defined (AVX2)
@@ -470,7 +470,52 @@ namespace Optimization {
      return in;
    };
  };
-
+#define USE_FP16
+  struct PrecisionChange {
+    static inline __m256i StoH (__m256 a,__m256 b) {
+      __m256i h;
+#ifdef USE_FP16
+      __m128i ha = _mm256_cvtps_ph(a,0);
+      __m128i hb = _mm256_cvtps_ph(b,0);
+      h =(__m256i) _mm256_castps128_ps256((__m128)ha);
+      h =(__m256i) _mm256_insertf128_ps((__m256)h,(__m128)hb,1);
+#else 
+      assert(0);
+#endif
+      return h;
+    }
+    static inline void  HtoS (__m256i h,__m256 &sa,__m256 &sb) {
+#ifdef USE_FP16
+      sa = _mm256_cvtph_ps((__m128i)_mm256_extractf128_ps((__m256)h,0));
+      sb = _mm256_cvtph_ps((__m128i)_mm256_extractf128_ps((__m256)h,1));
+#else 
+      assert(0);
+#endif
+    }
+    static inline __m256 DtoS (__m256d a,__m256d b) {
+      __m128 sa = _mm256_cvtpd_ps(a);
+      __m128 sb = _mm256_cvtpd_ps(b);
+      __m256 s = _mm256_castps128_ps256(sa);
+      s = _mm256_insertf128_ps(s,sb,1);
+      return s;
+    }
+    static inline void StoD (__m256 s,__m256d &a,__m256d &b) {
+      a = _mm256_cvtps_pd(_mm256_extractf128_ps(s,0));
+      b = _mm256_cvtps_pd(_mm256_extractf128_ps(s,1));
+    }
+    static inline __m256i DtoH (__m256d a,__m256d b,__m256d c,__m256d d) {
+      __m256 sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (__m256i h,__m256d &a,__m256d &b,__m256d &c,__m256d &d) {
+      __m256 sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
  struct Exchange{
    // 3210 ordering
    static inline void Exchange0(__m256 &out1,__m256 &out2,__m256 in1,__m256 in2){
@@ -675,6 +720,7 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

+  typedef __m256i SIMD_Htype;  // Single precision type
  typedef __m256  SIMD_Ftype; // Single precision type
  typedef __m256d SIMD_Dtype; // Double precision type
  typedef __m256i SIMD_Itype; // Integer type
--- a/lib/simd/Grid_avx512.h
+++ b/lib/simd/Grid_avx512.h
@@ -235,11 +235,9 @@ namespace Optimization {
    inline void mac(__m512 &a, __m512 b, __m512 c){         
       a= _mm512_fmadd_ps( b, c, a);                         
    }
-
    inline void mac(__m512d &a, __m512d b, __m512d c){
      a= _mm512_fmadd_pd( b, c, a);                   
    }                                             
-
    // Real float
    inline __m512 operator()(__m512 a, __m512 b){
      return _mm512_mul_ps(a,b);
@@ -342,7 +340,52 @@ namespace Optimization {
    };

  };
-
+#define USE_FP16
+  struct PrecisionChange {
+    static inline __m512i StoH (__m512 a,__m512 b) {
+      __m512i h;
+#ifdef USE_FP16
+      __m256i ha = _mm512_cvtps_ph(a,0);
+      __m256i hb = _mm512_cvtps_ph(b,0);
+      h =(__m512i) _mm512_castps256_ps512((__m256)ha);
+      h =(__m512i) _mm512_insertf64x4((__m512d)h,(__m256d)hb,1);
+#else
+      assert(0);
+#endif
+      return h;
+    }
+    static inline void  HtoS (__m512i h,__m512 &sa,__m512 &sb) {
+#ifdef USE_FP16
+      sa = _mm512_cvtph_ps((__m256i)_mm512_extractf64x4_pd((__m512d)h,0));
+      sb = _mm512_cvtph_ps((__m256i)_mm512_extractf64x4_pd((__m512d)h,1));
+#else
+      assert(0);
+#endif
+    }
+    static inline __m512 DtoS (__m512d a,__m512d b) {
+      __m256 sa = _mm512_cvtpd_ps(a);
+      __m256 sb = _mm512_cvtpd_ps(b);
+      __m512 s = _mm512_castps256_ps512(sa);
+      s =(__m512) _mm512_insertf64x4((__m512d)s,(__m256d)sb,1);
+      return s;
+    }
+    static inline void StoD (__m512 s,__m512d &a,__m512d &b) {
+      a = _mm512_cvtps_pd((__m256)_mm512_extractf64x4_pd((__m512d)s,0));
+      b = _mm512_cvtps_pd((__m256)_mm512_extractf64x4_pd((__m512d)s,1));
+    }
+    static inline __m512i DtoH (__m512d a,__m512d b,__m512d c,__m512d d) {
+      __m512 sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (__m512i h,__m512d &a,__m512d &b,__m512d &c,__m512d &d) {
+      __m512 sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
  // On extracting face: Ah Al , Bh Bl -> Ah Bh, Al Bl
  // On merging buffers: Ah,Bh , Al Bl -> Ah Al, Bh, Bl
  // The operation is its own inverse
@@ -539,7 +582,9 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

-  typedef __m512 SIMD_Ftype;  // Single precision type
+
+  typedef __m512i SIMD_Htype;  // Single precision type
+  typedef __m512  SIMD_Ftype;  // Single precision type
  typedef __m512d SIMD_Dtype; // Double precision type
  typedef __m512i SIMD_Itype; // Integer type

--- a/lib/simd/Grid_generic.h
+++ b/lib/simd/Grid_generic.h
@@ -279,6 +279,101 @@ namespace Optimization {
  
  #undef timesi

+  struct PrecisionChange {
+    static inline vech StoH (const vecf &a,const vecf &b) {
+#ifdef USE_FP16
+      vech ret;
+      vech *ha = (vech *)&a;
+      vech *hb = (vech *)&b;
+      const int nf = W<float>::r;
+      //      VECTOR_FOR(i, nf,1){ ret.v[i]    = ( (uint16_t *) &a.v[i])[1] ; }
+      //      VECTOR_FOR(i, nf,1){ ret.v[i+nf] = ( (uint16_t *) &b.v[i])[1] ; }
+      VECTOR_FOR(i, nf,1){ ret.v[i]    = ha->v[2*i+1]; }
+      VECTOR_FOR(i, nf,1){ ret.v[i+nf] = hb->v[2*i+1]; }
+#else
+      assert(0);
+#endif
+      return ret;
+    }
+    static inline void  HtoS (vech h,vecf &sa,vecf &sb) {
+#ifdef USE_FP16
+      const int nf = W<float>::r;
+      const int nh = W<uint16_t>::r;
+      vech *ha = (vech *)&sa;
+      vech *hb = (vech *)&sb;
+      VECTOR_FOR(i, nf, 1){ sb.v[i]= sa.v[i] = 0; }
+      //      VECTOR_FOR(i, nf, 1){ ( (uint16_t *) (&sa.v[i]))[1] = h.v[i];}
+      //      VECTOR_FOR(i, nf, 1){ ( (uint16_t *) (&sb.v[i]))[1] = h.v[i+nf];}
+      VECTOR_FOR(i, nf, 1){ ha->v[2*i+1]=h.v[i]; }
+      VECTOR_FOR(i, nf, 1){ hb->v[2*i+1]=h.v[i+nf]; }
+#else
+      assert(0);
+#endif
+    }
+    static inline vecf DtoS (vecd a,vecd b) {
+      const int nd = W<double>::r;
+      const int nf = W<float>::r;
+      vecf ret;
+      VECTOR_FOR(i, nd,1){ ret.v[i]    = a.v[i] ; }
+      VECTOR_FOR(i, nd,1){ ret.v[i+nd] = b.v[i] ; }
+      return ret;
+    }
+    static inline void StoD (vecf s,vecd &a,vecd &b) {
+      const int nd = W<double>::r;
+      VECTOR_FOR(i, nd,1){ a.v[i] = s.v[i] ; }
+      VECTOR_FOR(i, nd,1){ b.v[i] = s.v[i+nd] ; }
+    }
+    static inline vech DtoH (vecd a,vecd b,vecd c,vecd d) {
+      vecf sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (vech h,vecd &a,vecd &b,vecd &c,vecd &d) {
+      vecf sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
+
+  //////////////////////////////////////////////
+  // Exchange support
+  struct Exchange{
+
+    template <typename T,int n>
+    static inline void ExchangeN(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      const int w = W<T>::r;
+      unsigned int mask = w >> (n + 1);
+      //      std::cout << " Exchange "<<n<<" nsimd "<<w<<" mask 0x" <<std::hex<<mask<<std::dec<<std::endl;
+      VECTOR_FOR(i, w, 1) {	
+	int j1 = i&(~mask);
+	if  ( (i&mask) == 0 ) { out1.v[i]=in1.v[j1];}
+	else                  { out1.v[i]=in2.v[j1];}
+	int j2 = i|mask;
+	if  ( (i&mask) == 0 ) { out2.v[i]=in1.v[j2];}
+	else                  { out2.v[i]=in2.v[j2];}
+      }      
+    }
+    template <typename T>
+    static inline void Exchange0(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,0>(out1,out2,in1,in2);
+    };
+    template <typename T>
+    static inline void Exchange1(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,1>(out1,out2,in1,in2);
+    };
+    template <typename T>
+    static inline void Exchange2(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,2>(out1,out2,in1,in2);
+    };
+    template <typename T>
+    static inline void Exchange3(vec<T> &out1,vec<T> &out2,vec<T> &in1,vec<T> &in2){
+      ExchangeN<T,3>(out1,out2,in1,in2);
+    };
+  };
+
+
  //////////////////////////////////////////////
  // Some Template specialization
  #define perm(a, b, n, w)\
@@ -403,6 +498,7 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

+  typedef Optimization::vech SIMD_Htype; // Reduced precision type
  typedef Optimization::vecf SIMD_Ftype; // Single precision type
  typedef Optimization::vecd SIMD_Dtype; // Double precision type
  typedef Optimization::veci SIMD_Itype; // Integer type
--- a/lib/simd/Grid_generic_types.h
+++ b/lib/simd/Grid_generic_types.h
@@ -66,6 +66,10 @@ namespace Optimization {
  template <> struct W<Integer> {
    constexpr static unsigned int r = GEN_SIMD_WIDTH/4u;
  };
+  template <> struct W<uint16_t> {
+    constexpr static unsigned int c = GEN_SIMD_WIDTH/4u;
+    constexpr static unsigned int r = GEN_SIMD_WIDTH/2u;
+  };
  
  // SIMD vector types
  template <typename T>
@@ -73,8 +77,9 @@ namespace Optimization {
    alignas(GEN_SIMD_WIDTH) T v[W<T>::r];
  };

-  typedef vec<float>   vecf;
-  typedef vec<double>  vecd;
-  typedef vec<Integer> veci;
+  typedef vec<float>     vecf;
+  typedef vec<double>    vecd;
+  typedef vec<uint16_t>  vech; // half precision comms
+  typedef vec<Integer>   veci;
  
 }}
--- a/lib/simd/Grid_qpx.h
+++ b/lib/simd/Grid_qpx.h
@@ -33,6 +33,14 @@
 #include "Grid_generic_types.h" // Definitions for simulated integer SIMD.

 namespace Grid {
+
+#ifdef QPX
+#include <spi/include/kernel/location.h>
+#include <spi/include/l1p/types.h>
+#include <hwi/include/bqc/l1p_mmio.h>
+#include <hwi/include/bqc/A2_inlines.h>
+#endif
+
 namespace Optimization {
  typedef struct 
  {
@@ -125,7 +133,6 @@ namespace Optimization {
      f[2] = a.v2;
      f[3] = a.v3;
    }
-
    //Double
    inline void operator()(double *d, vector4double a){
      vec_st(a, 0, d);
--- a/lib/simd/Grid_sse4.h
+++ b/lib/simd/Grid_sse4.h
@@ -328,6 +328,140 @@ namespace Optimization {
    };
  };

+  
+#define _my_alignr_epi32(a,b,n) _mm_alignr_epi8(a,b,(n*4)%16)
+#define _my_alignr_epi64(a,b,n) _mm_alignr_epi8(a,b,(n*8)%16)
+
+#ifdef SFW_FP16
+
+  struct Grid_half {
+    Grid_half(){}
+    Grid_half(uint16_t raw) : x(raw) {}
+    uint16_t x;
+  };
+  union FP32 {
+    unsigned int u;
+    float f;
+  };
+
+  // PAB - Lifted and adapted from Eigen, which is GPL V2
+  inline float sfw_half_to_float(Grid_half h) {
+    const FP32 magic = { 113 << 23 };
+    const unsigned int shifted_exp = 0x7c00 << 13; // exponent mask after shift
+    FP32 o;
+    o.u = (h.x & 0x7fff) << 13;             // exponent/mantissa bits
+    unsigned int exp = shifted_exp & o.u;   // just the exponent
+    o.u += (127 - 15) << 23;                // exponent adjust
+    // handle exponent special cases
+    if (exp == shifted_exp) {     // Inf/NaN?
+      o.u += (128 - 16) << 23;    // extra exp adjust
+    } else if (exp == 0) {        // Zero/Denormal?
+      o.u += 1 << 23;             // extra exp adjust
+      o.f -= magic.f;             // renormalize
+    }
+    o.u |= (h.x & 0x8000) << 16;    // sign bit
+    return o.f;
+  }
+  inline Grid_half sfw_float_to_half(float ff) {
+    FP32 f; f.f = ff;
+    const FP32 f32infty = { 255 << 23 };
+    const FP32 f16max = { (127 + 16) << 23 };
+    const FP32 denorm_magic = { ((127 - 15) + (23 - 10) + 1) << 23 };
+    unsigned int sign_mask = 0x80000000u;
+    Grid_half o;
+    
+    o.x = static_cast<unsigned short>(0x0u);
+    unsigned int sign = f.u & sign_mask;
+    f.u ^= sign;
+    // NOTE all the integer compares in this function can be safely
+    // compiled into signed compares since all operands are below
+    // 0x80000000. Important if you want fast straight SSE2 code
+    // (since there's no unsigned PCMPGTD).
+    if (f.u >= f16max.u) {  // result is Inf or NaN (all exponent bits set)
+      o.x = (f.u > f32infty.u) ? 0x7e00 : 0x7c00; // NaN->qNaN and Inf->Inf
+    } else {  // (De)normalized number or zero
+      if (f.u < (113 << 23)) {  // resulting FP16 is subnormal or zero
+	// use a magic value to align our 10 mantissa bits at the bottom of
+	// the float. as long as FP addition is round-to-nearest-even this
+	// just works.
+	f.f += denorm_magic.f;
+	// and one integer subtract of the bias later, we have our final float!
+	o.x = static_cast<unsigned short>(f.u - denorm_magic.u);
+      } else {
+	unsigned int mant_odd = (f.u >> 13) & 1; // resulting mantissa is odd
+	
+	// update exponent, rounding bias part 1
+	f.u += ((unsigned int)(15 - 127) << 23) + 0xfff;
+	// rounding bias part 2
+	f.u += mant_odd;
+	// take the bits!
+	o.x = static_cast<unsigned short>(f.u >> 13);
+      }
+    } 
+    o.x |= static_cast<unsigned short>(sign >> 16);
+    return o;
+  }
+  static inline __m128i Grid_mm_cvtps_ph(__m128 f,int discard) {
+    __m128i ret=(__m128i)_mm_setzero_ps();
+    float *fp = (float *)&f;
+    Grid_half *hp = (Grid_half *)&ret;
+    hp[0] = sfw_float_to_half(fp[0]);
+    hp[1] = sfw_float_to_half(fp[1]);
+    hp[2] = sfw_float_to_half(fp[2]);
+    hp[3] = sfw_float_to_half(fp[3]);
+    return ret;
+  }
+  static inline __m128 Grid_mm_cvtph_ps(__m128i h,int discard) {
+    __m128 ret=_mm_setzero_ps();
+    float *fp = (float *)&ret;
+    Grid_half  *hp = (Grid_half *)&h;
+    fp[0] = sfw_half_to_float(hp[0]);
+    fp[1] = sfw_half_to_float(hp[1]);
+    fp[2] = sfw_half_to_float(hp[2]);
+    fp[3] = sfw_half_to_float(hp[3]);
+    return ret;
+  }
+#else 
+#define Grid_mm_cvtps_ph _mm_cvtps_ph
+#define Grid_mm_cvtph_ps _mm_cvtph_ps
+#endif
+  struct PrecisionChange {
+    static inline __m128i StoH (__m128 a,__m128 b) {
+      __m128i ha = Grid_mm_cvtps_ph(a,0);
+      __m128i hb = Grid_mm_cvtps_ph(b,0);
+      __m128i h =(__m128i) _mm_shuffle_ps((__m128)ha,(__m128)hb,_MM_SELECT_FOUR_FOUR(1,0,1,0));
+      return h;
+    }
+    static inline void  HtoS (__m128i h,__m128 &sa,__m128 &sb) {
+      sa = Grid_mm_cvtph_ps(h,0); 
+      h =  (__m128i)_my_alignr_epi32((__m128i)h,(__m128i)h,2);
+      sb = Grid_mm_cvtph_ps(h,0);
+    }
+    static inline __m128 DtoS (__m128d a,__m128d b) {
+      __m128 sa = _mm_cvtpd_ps(a);
+      __m128 sb = _mm_cvtpd_ps(b);
+      __m128 s = _mm_shuffle_ps(sa,sb,_MM_SELECT_FOUR_FOUR(1,0,1,0));
+      return s;
+    }
+    static inline void StoD (__m128 s,__m128d &a,__m128d &b) {
+      a = _mm_cvtps_pd(s);
+      s = (__m128)_my_alignr_epi32((__m128i)s,(__m128i)s,2);
+      b = _mm_cvtps_pd(s);
+    }
+    static inline __m128i DtoH (__m128d a,__m128d b,__m128d c,__m128d d) {
+      __m128 sa,sb;
+      sa = DtoS(a,b);
+      sb = DtoS(c,d);
+      return StoH(sa,sb);
+    }
+    static inline void HtoD (__m128i h,__m128d &a,__m128d &b,__m128d &c,__m128d &d) {
+      __m128 sa,sb;
+      HtoS(h,sa,sb);
+      StoD(sa,a,b);
+      StoD(sb,c,d);
+    }
+  };
+
  struct Exchange{
    // 3210 ordering
    static inline void Exchange0(__m128 &out1,__m128 &out2,__m128 in1,__m128 in2){
@@ -335,8 +469,10 @@ namespace Optimization {
      out2= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(3,2,3,2));
    };
    static inline void Exchange1(__m128 &out1,__m128 &out2,__m128 in1,__m128 in2){
-      out1= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(2,0,2,0));
-      out2= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(3,1,3,1));
+      out1= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(2,0,2,0)); /*ACEG*/
+      out2= _mm_shuffle_ps(in1,in2,_MM_SELECT_FOUR_FOUR(3,1,3,1)); /*BDFH*/
+      out1= _mm_shuffle_ps(out1,out1,_MM_SELECT_FOUR_FOUR(3,1,2,0)); /*AECG*/
+      out2= _mm_shuffle_ps(out2,out2,_MM_SELECT_FOUR_FOUR(3,1,2,0)); /*AECG*/
    };
    static inline void Exchange2(__m128 &out1,__m128 &out2,__m128 in1,__m128 in2){
      assert(0);
@@ -383,14 +519,9 @@ namespace Optimization {
      default: assert(0);
      }
    }
-  
-#ifndef _mm_alignr_epi64
-#define _mm_alignr_epi32(a,b,n) _mm_alignr_epi8(a,b,(n*4)%16)
-#define _mm_alignr_epi64(a,b,n) _mm_alignr_epi8(a,b,(n*8)%16)
-#endif 

-    template<int n> static inline __m128  tRotate(__m128  in){ return (__m128)_mm_alignr_epi32((__m128i)in,(__m128i)in,n); };
-    template<int n> static inline __m128d tRotate(__m128d in){ return (__m128d)_mm_alignr_epi64((__m128i)in,(__m128i)in,n); };
+    template<int n> static inline __m128  tRotate(__m128  in){ return (__m128)_my_alignr_epi32((__m128i)in,(__m128i)in,n); };
+    template<int n> static inline __m128d tRotate(__m128d in){ return (__m128d)_my_alignr_epi64((__m128i)in,(__m128i)in,n); };

  };
  //////////////////////////////////////////////
@@ -450,7 +581,8 @@ namespace Optimization {
 //////////////////////////////////////////////////////////////////////////////////////
 // Here assign types 

-  typedef __m128 SIMD_Ftype;  // Single precision type
+  typedef __m128i SIMD_Htype;  // Single precision type
+  typedef __m128  SIMD_Ftype;  // Single precision type
  typedef __m128d SIMD_Dtype; // Double precision type
  typedef __m128i SIMD_Itype; // Integer type

--- a/lib/simd/Grid_vector_types.h
+++ b/lib/simd/Grid_vector_types.h
@@ -2,7 +2,7 @@

 Grid physics library, www.github.com/paboyle/Grid

-Source file: ./lib/simd/Grid_vector_types.h
+Source file: ./lib/simd/Grid_vector_type.h

 Copyright (C) 2015

@@ -53,12 +53,14 @@ directory
 #if defined IMCI
 #include "Grid_imci.h"
 #endif
-#if defined QPX
-#include "Grid_qpx.h"
-#endif
 #ifdef NEONv8
 #include "Grid_neon.h"
 #endif
+#if defined QPX
+#include "Grid_qpx.h"
+#endif
+
+#include "l1p.h"

 namespace Grid {

@@ -74,12 +76,14 @@ struct RealPart<std::complex<T> > {
  typedef T type;
 };

+#include <type_traits>
+
 //////////////////////////////////////
 // demote a vector to real type
 //////////////////////////////////////
 // type alias used to simplify the syntax of std::enable_if
 template <typename T> using Invoke = typename T::type;
-template <typename Condition, typename ReturnType> using EnableIf = Invoke<std::enable_if<Condition::value, ReturnType> >;
+template <typename Condition, typename ReturnType> using EnableIf    = Invoke<std::enable_if<Condition::value, ReturnType> >;
 template <typename Condition, typename ReturnType> using NotEnableIf = Invoke<std::enable_if<!Condition::value, ReturnType> >;

 ////////////////////////////////////////////////////////
@@ -88,13 +92,15 @@ template <typename T> struct is_complex : public std::false_type {};
 template <> struct is_complex<std::complex<double> > : public std::true_type {};
 template <> struct is_complex<std::complex<float> > : public std::true_type {};

-template <typename T> using IfReal       = Invoke<std::enable_if<std::is_floating_point<T>::value, int> >;
-template <typename T> using IfComplex    = Invoke<std::enable_if<is_complex<T>::value, int> >;
-template <typename T> using IfInteger    = Invoke<std::enable_if<std::is_integral<T>::value, int> >;
+template <typename T>              using IfReal    = Invoke<std::enable_if<std::is_floating_point<T>::value, int> >;
+template <typename T>              using IfComplex = Invoke<std::enable_if<is_complex<T>::value, int> >;
+template <typename T>              using IfInteger = Invoke<std::enable_if<std::is_integral<T>::value, int> >;
+template <typename T1,typename T2> using IfSame    = Invoke<std::enable_if<std::is_same<T1,T2>::value, int> >;

-template <typename T> using IfNotReal    = Invoke<std::enable_if<!std::is_floating_point<T>::value, int> >;
-template <typename T> using IfNotComplex = Invoke<std::enable_if<!is_complex<T>::value, int> >;
-template <typename T> using IfNotInteger = Invoke<std::enable_if<!std::is_integral<T>::value, int> >;
+template <typename T>              using IfNotReal    = Invoke<std::enable_if<!std::is_floating_point<T>::value, int> >;
+template <typename T>              using IfNotComplex = Invoke<std::enable_if<!is_complex<T>::value, int> >;
+template <typename T>              using IfNotInteger = Invoke<std::enable_if<!std::is_integral<T>::value, int> >;
+template <typename T1,typename T2> using IfNotSame    = Invoke<std::enable_if<!std::is_same<T1,T2>::value, int> >;

 ////////////////////////////////////////////////////////
 // Define the operation templates functors
@@ -358,16 +364,12 @@ class Grid_simd {
  {
    if       (n==3) {
      Optimization::Exchange::Exchange3(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange3 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    } else if(n==2) {
      Optimization::Exchange::Exchange2(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange2 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    } else if(n==1) {
      Optimization::Exchange::Exchange1(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange1 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    } else if(n==0) { 
      Optimization::Exchange::Exchange0(out1.v,out2.v,in1.v,in2.v);
-      //      std::cout << " Exchange0 "<< out1<<" "<< out2<<" <- " << in1 << " "<<in2<<std::endl;
    }
  }

@@ -415,7 +417,6 @@ template <class S, class V, IfNotComplex<S> = 0>
 inline Grid_simd<S, V> rotate(Grid_simd<S, V> b, int nrot) {
  nrot = nrot % Grid_simd<S, V>::Nsimd();
  Grid_simd<S, V> ret;
-  //    std::cout << "Rotate Real by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v, nrot);
  return ret;
 }
@@ -423,7 +424,6 @@ template <class S, class V, IfComplex<S> = 0>
 inline Grid_simd<S, V> rotate(Grid_simd<S, V> b, int nrot) {
  nrot = nrot % Grid_simd<S, V>::Nsimd();
  Grid_simd<S, V> ret;
-  //    std::cout << "Rotate Complex by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v, 2 * nrot);
  return ret;
 }
@@ -431,14 +431,12 @@ template <class S, class V, IfNotComplex<S> =0>
 inline void rotate( Grid_simd<S,V> &ret,Grid_simd<S,V> b,int nrot)
 {
  nrot = nrot % Grid_simd<S,V>::Nsimd();
-  //    std::cout << "Rotate Real by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v,nrot);
 }
 template <class S, class V, IfComplex<S> =0> 
 inline void rotate(Grid_simd<S,V> &ret,Grid_simd<S,V> b,int nrot)
 {
  nrot = nrot % Grid_simd<S,V>::Nsimd();
-  //    std::cout << "Rotate Complex by "<<nrot<<std::endl;
  ret.v = Optimization::Rotate::rotate(b.v,2*nrot);
 }

@@ -698,7 +696,6 @@ inline Grid_simd<S, V> innerProduct(const Grid_simd<S, V> &l,
                                    const Grid_simd<S, V> &r) {
  return conjugate(l) * r;
 }
-
 template <class S, class V>
 inline Grid_simd<S, V> outerProduct(const Grid_simd<S, V> &l,
                                    const Grid_simd<S, V> &r) {
@@ -758,6 +755,67 @@ typedef Grid_simd<std::complex<float>, SIMD_Ftype> vComplexF;
 typedef Grid_simd<std::complex<double>, SIMD_Dtype> vComplexD;
 typedef Grid_simd<Integer, SIMD_Itype> vInteger;

+// Half precision; no arithmetic support
+typedef Grid_simd<uint16_t, SIMD_Htype>               vRealH;
+typedef Grid_simd<std::complex<uint16_t>, SIMD_Htype> vComplexH;
+
+inline void precisionChange(vRealF    *out,vRealD    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    out[m].v=Optimization::PrecisionChange::DtoS(in[n].v,in[n+1].v);
+  }
+}
+inline void precisionChange(vRealH    *out,vRealD    *in,int nvec)
+{
+  assert((nvec&0x3)==0);
+  for(int m=0;m*4<nvec;m++){
+    int n=m*4;
+    out[m].v=Optimization::PrecisionChange::DtoH(in[n].v,in[n+1].v,in[n+2].v,in[n+3].v);
+  }
+}
+inline void precisionChange(vRealH    *out,vRealF    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    out[m].v=Optimization::PrecisionChange::StoH(in[n].v,in[n+1].v);
+  }
+}
+inline void precisionChange(vRealD    *out,vRealF    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    Optimization::PrecisionChange::StoD(in[m].v,out[n].v,out[n+1].v);
+  }
+}
+inline void precisionChange(vRealD    *out,vRealH    *in,int nvec)
+{
+  assert((nvec&0x3)==0);
+  for(int m=0;m*4<nvec;m++){
+    int n=m*4;
+    Optimization::PrecisionChange::HtoD(in[m].v,out[n].v,out[n+1].v,out[n+2].v,out[n+3].v);
+  }
+}
+inline void precisionChange(vRealF    *out,vRealH    *in,int nvec)
+{
+  assert((nvec&0x1)==0);
+  for(int m=0;m*2<nvec;m++){
+    int n=m*2;
+    Optimization::PrecisionChange::HtoS(in[m].v,out[n].v,out[n+1].v);
+  }
+}
+inline void precisionChange(vComplexF *out,vComplexD *in,int nvec){ precisionChange((vRealF *)out,(vRealD *)in,nvec);}
+inline void precisionChange(vComplexH *out,vComplexD *in,int nvec){ precisionChange((vRealH *)out,(vRealD *)in,nvec);}
+inline void precisionChange(vComplexH *out,vComplexF *in,int nvec){ precisionChange((vRealH *)out,(vRealF *)in,nvec);}
+inline void precisionChange(vComplexD *out,vComplexF *in,int nvec){ precisionChange((vRealD *)out,(vRealF *)in,nvec);}
+inline void precisionChange(vComplexD *out,vComplexH *in,int nvec){ precisionChange((vRealD *)out,(vRealH *)in,nvec);}
+inline void precisionChange(vComplexF *out,vComplexH *in,int nvec){ precisionChange((vRealF *)out,(vRealH *)in,nvec);}
+
+
+
 // Check our vector types are of an appropriate size.
 #if defined QPX
 static_assert(2*sizeof(SIMD_Ftype) == sizeof(SIMD_Dtype), "SIMD vector lengths incorrect");
--- a/lib/simd/l1p.h
+++ b/lib/simd/l1p.h
@@ -0,0 +1,37 @@
+#pragma once
+namespace Grid {
+// L1p optimisation 
+inline void bgq_l1p_optimisation(int mode)
+{
+#ifdef QPX
+#undef L1P_CFG_PF_USR
+#define L1P_CFG_PF_USR  (0x3fde8000108ll)   /*  (64 bit reg, 23 bits wide, user/unpriv) */
+
+  uint64_t cfg_pf_usr;
+  if ( mode ) { 
+    cfg_pf_usr =
+        L1P_CFG_PF_USR_ifetch_depth(0)       
+      | L1P_CFG_PF_USR_ifetch_max_footprint(1)   
+      | L1P_CFG_PF_USR_pf_stream_est_on_dcbt 
+      | L1P_CFG_PF_USR_pf_stream_establish_enable
+      | L1P_CFG_PF_USR_pf_stream_optimistic
+      | L1P_CFG_PF_USR_pf_adaptive_throttle(0xF) ;
+    //    if ( sizeof(Float) == sizeof(double) ) {
+      cfg_pf_usr |=  L1P_CFG_PF_USR_dfetch_depth(2)| L1P_CFG_PF_USR_dfetch_max_footprint(3)   ;
+      //    } else {
+      //      cfg_pf_usr |=  L1P_CFG_PF_USR_dfetch_depth(1)| L1P_CFG_PF_USR_dfetch_max_footprint(2)   ;
+      //    }
+  } else { 
+    cfg_pf_usr = L1P_CFG_PF_USR_dfetch_depth(1)
+      | L1P_CFG_PF_USR_dfetch_max_footprint(2)   
+      | L1P_CFG_PF_USR_ifetch_depth(0)       
+      | L1P_CFG_PF_USR_ifetch_max_footprint(1)   
+      | L1P_CFG_PF_USR_pf_stream_est_on_dcbt 
+      | L1P_CFG_PF_USR_pf_stream_establish_enable
+      | L1P_CFG_PF_USR_pf_stream_optimistic
+      | L1P_CFG_PF_USR_pf_stream_prefetch_enable;
+  }
+  *((uint64_t *)L1P_CFG_PF_USR) = cfg_pf_usr;
+#endif
+}
+}
--- a/lib/stencil/.dirstamp
+++ b/lib/stencil/.dirstamp
--- a/lib/stencil/SimpleCompressor.h
+++ b/lib/stencil/SimpleCompressor.h
@@ -0,0 +1,29 @@
+#ifndef _STENCIL_SIMPLE_COMPRESSOR_H_
+#define _STENCIL_SIMPLE_COMPRESSOR_H_
+
+namespace Grid {
+
+template<class vobj>
+class SimpleCompressor {
+public:
+  void Point(int) {};
+  inline int  CommDatumSize(void) { return sizeof(vobj); }
+  inline bool DecompressionStep(void) { return false; }
+  inline void Compress(vobj *buf,int o,const vobj &in) { buf[o]=in; }
+  inline void Exchange(vobj *mp,vobj *vp0,vobj *vp1,Integer type,Integer o){
+    exchange(mp[2*o],mp[2*o+1],vp0[o],vp1[o],type);
+  }
+  inline void Decompress(vobj *out,vobj *in, int o){ assert(0); }
+  inline void CompressExchange(vobj *out0,vobj *out1,const vobj *in,
+			       int j,int k, int m,int type){
+    exchange(out0[j],out1[j],in[k],in[m],type);
+  }
+  // For cshift. Cshift should drop compressor coupling altogether 
+  // because I had to decouple the code from the Stencil anyway
+  inline vobj operator() (const vobj &arg) {
+    return arg;
+  }
+};
+
+}
+#endif
--- a/lib/stencil/Stencil.h
+++ b/lib/stencil/Stencil.h
--- a/lib/tensors/Tensor_class.h
+++ b/lib/tensors/Tensor_class.h
@@ -56,11 +56,11 @@ class iScalar {
  typedef vtype element;
  typedef typename GridTypeMapper<vtype>::scalar_type scalar_type;
  typedef typename GridTypeMapper<vtype>::vector_type vector_type;
+  typedef typename GridTypeMapper<vtype>::vector_typeD vector_typeD;
  typedef typename GridTypeMapper<vtype>::tensor_reduced tensor_reduced_v;
-  typedef iScalar<tensor_reduced_v> tensor_reduced;
  typedef typename GridTypeMapper<vtype>::scalar_object recurse_scalar_object;
+  typedef iScalar<tensor_reduced_v> tensor_reduced;
  typedef iScalar<recurse_scalar_object> scalar_object;
-
  // substitutes a real or complex version with same tensor structure
  typedef iScalar<typename GridTypeMapper<vtype>::Complexified> Complexified;
  typedef iScalar<typename GridTypeMapper<vtype>::Realified> Realified;
@@ -77,8 +77,12 @@ class iScalar {
  iScalar<vtype> & operator= (const iScalar<vtype> &copyme) = default;
  iScalar<vtype> & operator= (iScalar<vtype> &&copyme) = default;
  */
-  iScalar(scalar_type s)
-      : _internal(s){};  // recurse down and hit the constructor for vector_type
+
+  //  template<int N=0>
+  //  iScalar(EnableIf<isSIMDvectorized<vector_type>, vector_type> s) : _internal(s){};  // recurse down and hit the constructor for vector_type
+
+  iScalar(scalar_type s) : _internal(s){};  // recurse down and hit the constructor for vector_type
+
  iScalar(const Zero &z) { *this = zero; };

  iScalar<vtype> &operator=(const Zero &hero) {
@@ -134,42 +138,38 @@ class iScalar {
  strong_inline const vtype &operator()(void) const { return _internal; }

  // Type casts meta programmed, must be pure scalar to match TensorRemove
-  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0, IfNotSimd<U> = 0>
  operator ComplexF() const {
    return (TensorRemove(_internal));
  };
-  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfComplex<V> = 0, IfNotSimd<U> = 0>
  operator ComplexD() const {
    return (TensorRemove(_internal));
  };
  //  template<class U=vtype,class V=scalar_type,IfComplex<V> = 0,IfNotSimd<U> =
  //  0> operator RealD    () const { return(real(TensorRemove(_internal))); }
-  template <class U = vtype, class V = scalar_type, IfReal<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfReal<V> = 0,IfNotSimd<U> = 0>
  operator RealD() const {
    return TensorRemove(_internal);
  }
-  template <class U = vtype, class V = scalar_type, IfInteger<V> = 0,
-            IfNotSimd<U> = 0>
+  template <class U = vtype, class V = scalar_type, IfInteger<V> = 0, IfNotSimd<U> = 0>
  operator Integer() const {
    return Integer(TensorRemove(_internal));
  }

  // convert from a something to a scalar via constructor of something arg
-  template <class T, typename std::enable_if<!isGridTensor<T>::value, T>::type
-                         * = nullptr>
-  strong_inline iScalar<vtype> operator=(T arg) {
+  template <class T, typename std::enable_if<!isGridTensor<T>::value, T>::type * = nullptr>
+    strong_inline iScalar<vtype> operator=(T arg) {
    _internal = arg;
    return *this;
  }

-  friend std::ostream &operator<<(std::ostream &stream,
-                                  const iScalar<vtype> &o) {
+  friend std::ostream &operator<<(std::ostream &stream,const iScalar<vtype> &o) {
    stream << "S {" << o._internal << "}";
    return stream;
  };
+
+
 };
 ///////////////////////////////////////////////////////////
 // Allows to turn scalar<scalar<scalar<double>>>> back to double.
@@ -193,6 +193,7 @@ class iVector {
  typedef vtype element;
  typedef typename GridTypeMapper<vtype>::scalar_type scalar_type;
  typedef typename GridTypeMapper<vtype>::vector_type vector_type;
+  typedef typename GridTypeMapper<vtype>::vector_typeD vector_typeD;
  typedef typename GridTypeMapper<vtype>::tensor_reduced tensor_reduced_v;
  typedef typename GridTypeMapper<vtype>::scalar_object recurse_scalar_object;
  typedef iScalar<tensor_reduced_v> tensor_reduced;
@@ -305,6 +306,7 @@ class iMatrix {
  typedef vtype element;
  typedef typename GridTypeMapper<vtype>::scalar_type scalar_type;
  typedef typename GridTypeMapper<vtype>::vector_type vector_type;
+  typedef typename GridTypeMapper<vtype>::vector_typeD vector_typeD;
  typedef typename GridTypeMapper<vtype>::tensor_reduced tensor_reduced_v;
  typedef typename GridTypeMapper<vtype>::scalar_object recurse_scalar_object;

--- a/lib/tensors/Tensor_inner.h
+++ b/lib/tensors/Tensor_inner.h
@@ -29,51 +29,109 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 #ifndef GRID_MATH_INNER_H
 #define GRID_MATH_INNER_H
 namespace Grid {
-    ///////////////////////////////////////////////////////////////////////////////////////
-    // innerProduct Scalar x Scalar -> Scalar
-    // innerProduct Vector x Vector -> Scalar
-    // innerProduct Matrix x Matrix -> Scalar
-    ///////////////////////////////////////////////////////////////////////////////////////
-    template<class sobj> inline RealD norm2(const sobj &arg){
-      typedef typename sobj::scalar_type scalar;
-      decltype(innerProduct(arg,arg)) nrm;
-      nrm = innerProduct(arg,arg);
-      RealD ret = real(nrm);
-      return ret;
-    }
+  ///////////////////////////////////////////////////////////////////////////////////////
+  // innerProduct Scalar x Scalar -> Scalar
+  // innerProduct Vector x Vector -> Scalar
+  // innerProduct Matrix x Matrix -> Scalar
+  ///////////////////////////////////////////////////////////////////////////////////////
+  template<class sobj> inline RealD norm2(const sobj &arg){
+    auto nrm = innerProductD(arg,arg);
+    RealD ret = real(nrm);
+    return ret;
+  }
+  //////////////////////////////////////
+  // If single promote to double and sum 2x
+  //////////////////////////////////////

-    template<class l,class r,int N> inline
-    auto innerProduct (const iVector<l,N>& lhs,const iVector<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0],rhs._internal[0]))>
-    {
-        typedef decltype(innerProduct(lhs._internal[0],rhs._internal[0])) ret_t;
-        iScalar<ret_t> ret;
-	ret=zero;
-        for(int c1=0;c1<N;c1++){
-            ret._internal += innerProduct(lhs._internal[c1],rhs._internal[c1]);
-        }
-        return ret;
+inline ComplexD innerProductD(const ComplexF &l,const ComplexF &r){  return innerProduct(l,r); }
+inline ComplexD innerProductD(const ComplexD &l,const ComplexD &r){  return innerProduct(l,r); }
+inline RealD    innerProductD(const RealD    &l,const RealD    &r){  return innerProduct(l,r); }
+inline RealD    innerProductD(const RealF    &l,const RealF    &r){  return innerProduct(l,r); }
+
+inline vComplexD innerProductD(const vComplexD &l,const vComplexD &r){  return innerProduct(l,r); }
+inline vRealD    innerProductD(const vRealD    &l,const vRealD    &r){  return innerProduct(l,r); }
+inline vComplexD innerProductD(const vComplexF &l,const vComplexF &r){  
+  vComplexD la,lb;
+  vComplexD ra,rb;
+  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
+  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
+  return innerProduct(la,ra) + innerProduct(lb,rb); 
+}
+inline vRealD innerProductD(const vRealF &l,const vRealF &r){  
+  vRealD la,lb;
+  vRealD ra,rb;
+  Optimization::PrecisionChange::StoD(l.v,la.v,lb.v);
+  Optimization::PrecisionChange::StoD(r.v,ra.v,rb.v);
+  return innerProduct(la,ra) + innerProduct(lb,rb); 
+}
+
+  template<class l,class r,int N> inline
+  auto innerProductD (const iVector<l,N>& lhs,const iVector<r,N>& rhs) -> iScalar<decltype(innerProductD(lhs._internal[0],rhs._internal[0]))>
+  {
+    typedef decltype(innerProductD(lhs._internal[0],rhs._internal[0])) ret_t;
+    iScalar<ret_t> ret;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+      ret._internal += innerProductD(lhs._internal[c1],rhs._internal[c1]);
    }
-    template<class l,class r,int N> inline
-    auto innerProduct (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0]))>
-    {
-        typedef decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0])) ret_t;
-        iScalar<ret_t> ret;
-        iScalar<ret_t> tmp;
-	ret=zero;
-        for(int c1=0;c1<N;c1++){
-        for(int c2=0;c2<N;c2++){
-	  ret._internal+=innerProduct(lhs._internal[c1][c2],rhs._internal[c1][c2]);
-        }}
-        return ret;
-    }
-    template<class l,class r> inline
-    auto innerProduct (const iScalar<l>& lhs,const iScalar<r>& rhs) -> iScalar<decltype(innerProduct(lhs._internal,rhs._internal))>
-    {
-        typedef decltype(innerProduct(lhs._internal,rhs._internal)) ret_t;
-        iScalar<ret_t> ret;
-        ret._internal = innerProduct(lhs._internal,rhs._internal);
-        return ret;
+    return ret;
+  }
+  template<class l,class r,int N> inline
+  auto innerProductD (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs) -> iScalar<decltype(innerProductD(lhs._internal[0][0],rhs._internal[0][0]))>
+  {
+    typedef decltype(innerProductD(lhs._internal[0][0],rhs._internal[0][0])) ret_t;
+    iScalar<ret_t> ret;
+    iScalar<ret_t> tmp;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+    for(int c2=0;c2<N;c2++){
+      ret._internal+=innerProductD(lhs._internal[c1][c2],rhs._internal[c1][c2]);
+    }}
+    return ret;
+  }
+  template<class l,class r> inline
+  auto innerProductD (const iScalar<l>& lhs,const iScalar<r>& rhs) -> iScalar<decltype(innerProductD(lhs._internal,rhs._internal))>
+  {
+    typedef decltype(innerProductD(lhs._internal,rhs._internal)) ret_t;
+    iScalar<ret_t> ret;
+    ret._internal = innerProductD(lhs._internal,rhs._internal);
+    return ret;
+  }
+  //////////////////////
+  // Keep same precison
+  //////////////////////
+  template<class l,class r,int N> inline
+  auto innerProduct (const iVector<l,N>& lhs,const iVector<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0],rhs._internal[0]))>
+  {
+    typedef decltype(innerProduct(lhs._internal[0],rhs._internal[0])) ret_t;
+    iScalar<ret_t> ret;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+      ret._internal += innerProduct(lhs._internal[c1],rhs._internal[c1]);
    }
+    return ret;
+  }
+  template<class l,class r,int N> inline
+  auto innerProduct (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs) -> iScalar<decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0]))>
+  {
+    typedef decltype(innerProduct(lhs._internal[0][0],rhs._internal[0][0])) ret_t;
+    iScalar<ret_t> ret;
+    iScalar<ret_t> tmp;
+    ret=zero;
+    for(int c1=0;c1<N;c1++){
+    for(int c2=0;c2<N;c2++){
+      ret._internal+=innerProduct(lhs._internal[c1][c2],rhs._internal[c1][c2]);
+    }}
+    return ret;
+  }
+  template<class l,class r> inline
+  auto innerProduct (const iScalar<l>& lhs,const iScalar<r>& rhs) -> iScalar<decltype(innerProduct(lhs._internal,rhs._internal))>
+  {
+    typedef decltype(innerProduct(lhs._internal,rhs._internal)) ret_t;
+    iScalar<ret_t> ret;
+    ret._internal = innerProduct(lhs._internal,rhs._internal);
+    return ret;
+  }

 }
 #endif
--- a/lib/tensors/Tensor_traits.h
+++ b/lib/tensors/Tensor_traits.h
@@ -53,6 +53,7 @@ namespace Grid {
  public:
    typedef typename T::scalar_type scalar_type;
    typedef typename T::vector_type vector_type;
+    typedef typename T::vector_typeD vector_typeD;
    typedef typename T::tensor_reduced tensor_reduced;
    typedef typename T::scalar_object scalar_object;
    typedef typename T::Complexified Complexified;
@@ -67,6 +68,7 @@ namespace Grid {
  public:
    typedef RealF scalar_type;
    typedef RealF vector_type;
+    typedef RealD vector_typeD;
    typedef RealF tensor_reduced ;
    typedef RealF scalar_object;
    typedef ComplexF Complexified;
@@ -77,6 +79,7 @@ namespace Grid {
  public:
    typedef RealD scalar_type;
    typedef RealD vector_type;
+    typedef RealD vector_typeD;
    typedef RealD tensor_reduced;
    typedef RealD scalar_object;
    typedef ComplexD Complexified;
@@ -87,6 +90,7 @@ namespace Grid {
  public:
    typedef ComplexF scalar_type;
    typedef ComplexF vector_type;
+    typedef ComplexD vector_typeD;
    typedef ComplexF tensor_reduced;
    typedef ComplexF scalar_object;
    typedef ComplexF Complexified;
@@ -97,6 +101,7 @@ namespace Grid {
  public:
    typedef ComplexD scalar_type;
    typedef ComplexD vector_type;
+    typedef ComplexD vector_typeD;
    typedef ComplexD tensor_reduced;
    typedef ComplexD scalar_object;
    typedef ComplexD Complexified;
@@ -107,6 +112,7 @@ namespace Grid {
  public:
    typedef Integer scalar_type;
    typedef Integer vector_type;
+    typedef Integer vector_typeD;
    typedef Integer tensor_reduced;
    typedef Integer scalar_object;
    typedef void Complexified;
@@ -118,6 +124,7 @@ namespace Grid {
  public:
    typedef RealF  scalar_type;
    typedef vRealF vector_type;
+    typedef vRealD vector_typeD;
    typedef vRealF tensor_reduced;
    typedef RealF  scalar_object;
    typedef vComplexF Complexified;
@@ -128,16 +135,29 @@ namespace Grid {
  public:
    typedef RealD  scalar_type;
    typedef vRealD vector_type;
+    typedef vRealD vector_typeD;
    typedef vRealD tensor_reduced;
    typedef RealD  scalar_object;
    typedef vComplexD Complexified;
    typedef vRealD Realified;
    enum { TensorLevel = 0 };
  };
+  template<> class GridTypeMapper<vComplexH> {
+  public:
+    typedef ComplexF  scalar_type;
+    typedef vComplexH vector_type;
+    typedef vComplexD vector_typeD;
+    typedef vComplexH tensor_reduced;
+    typedef ComplexF  scalar_object;
+    typedef vComplexH Complexified;
+    typedef vRealH Realified;
+    enum { TensorLevel = 0 };
+  };
  template<> class GridTypeMapper<vComplexF> {
  public:
    typedef ComplexF  scalar_type;
    typedef vComplexF vector_type;
+    typedef vComplexD vector_typeD;
    typedef vComplexF tensor_reduced;
    typedef ComplexF  scalar_object;
    typedef vComplexF Complexified;
@@ -148,6 +168,7 @@ namespace Grid {
  public:
    typedef ComplexD  scalar_type;
    typedef vComplexD vector_type;
+    typedef vComplexD vector_typeD;
    typedef vComplexD tensor_reduced;
    typedef ComplexD  scalar_object;
    typedef vComplexD Complexified;
@@ -158,6 +179,7 @@ namespace Grid {
  public:
    typedef  Integer scalar_type;
    typedef vInteger vector_type;
+    typedef vInteger vector_typeD;
    typedef vInteger tensor_reduced;
    typedef  Integer scalar_object;
    typedef void Complexified;
@@ -241,7 +263,8 @@ namespace Grid {
  template<typename T>
  class isSIMDvectorized{
    template<typename U>
-    static typename std::enable_if< !std::is_same< typename GridTypeMapper<typename getVectorType<U>::type>::scalar_type,   typename GridTypeMapper<typename getVectorType<U>::type>::vector_type>::value, char>::type test(void *);
+    static typename std::enable_if< !std::is_same< typename GridTypeMapper<typename getVectorType<U>::type>::scalar_type,   
+      typename GridTypeMapper<typename getVectorType<U>::type>::vector_type>::value, char>::type test(void *);

    template<typename U>
    static double test(...);
--- a/lib/util/Init.cc
+++ b/lib/util/Init.cc
@@ -311,8 +311,8 @@ void Grid_init(int *argc,char ***argv)
    std::cout<<GridLogMessage<<std::endl;
    std::cout<<GridLogMessage<<"Performance:"<<std::endl;
    std::cout<<GridLogMessage<<std::endl;
-    std::cout<<GridLogMessage<<"  --comms-isend   : Asynchronous MPI calls; several dirs at a time "<<std::endl;    
-    std::cout<<GridLogMessage<<"  --comms-sendrecv: Synchronous MPI calls; one dirs at a time "<<std::endl;    
+    std::cout<<GridLogMessage<<"  --comms-concurrent : Asynchronous MPI calls; several dirs at a time "<<std::endl;    
+    std::cout<<GridLogMessage<<"  --comms-sequential : Synchronous MPI calls; one dirs at a time "<<std::endl;    
    std::cout<<GridLogMessage<<"  --comms-overlap : Overlap comms with compute "<<std::endl;    
    std::cout<<GridLogMessage<<std::endl;
    std::cout<<GridLogMessage<<"  --dslash-generic: Wilson kernel for generic Nc"<<std::endl;    
@@ -457,5 +457,6 @@ void Grid_debug_handler_init(void)

  sigaction(SIGFPE,&sa,NULL);
  sigaction(SIGKILL,&sa,NULL);
+  sigaction(SIGILL,&sa,NULL);
 }
 }
--- a/scripts/grep-global
+++ b/scripts/grep-global
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+export LANG=C
+find . -name "*.cc"  -exec grep -H $@ {} \;
+find . -name "*.h"   -exec grep -H $@ {} \;
+
--- a/tests/Test_simd.cc
+++ b/tests/Test_simd.cc
@@ -308,18 +308,23 @@ public:
  int n;
  funcExchange(int _n) { n=_n;};
  template<class vec>    void operator()(vec &r1,vec &r2,vec &i1,vec &i2) const { exchange(r1,r2,i1,i2,n);}
-  template<class scal>   void apply(std::vector<scal> &r1,std::vector<scal> &r2,std::vector<scal> &in1,std::vector<scal> &in2)  const { 
+  template<class scal>   void apply(std::vector<scal> &r1,
+				    std::vector<scal> &r2,
+				    std::vector<scal> &in1,
+				    std::vector<scal> &in2)  const 
+  { 
    int sz=in1.size();
-
-    
    int msk = sz>>(n+1);

-    int j1=0;
-    int j2=0;
-    for(int i=0;i<sz;i++) if ( (i&msk) == 0 ) r1[j1++] = in1[ i ];
-    for(int i=0;i<sz;i++) if ( (i&msk) == 0 ) r1[j1++] = in2[ i ];
-    for(int i=0;i<sz;i++) if ( (i&msk)  ) r2[j2++] = in1[ i ];
-    for(int i=0;i<sz;i++) if ( (i&msk)  ) r2[j2++] = in2[ i ];
+    for(int i=0;i<sz;i++) {
+      int j1 = i&(~msk);
+      int j2 = i|msk;
+      if  ( (i&msk) == 0 ) { r1[i]=in1[j1];}
+      else                 { r1[i]=in2[j1];}
+
+      if  ( (i&msk) == 0 ) { r2[i]=in1[j2];}
+      else                 { r2[i]=in2[j2];}
+    }      
  }
  std::string name(void) const { return std::string("Exchange"); }
 };
@@ -454,8 +459,8 @@ void ExchangeTester(const functor &func)

  std::cout<<GridLogMessage << " " << func.name() << " " <<func.n <<std::endl;

-  //  for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" "<<reference1[i]<<" "<<result1[i]<<std::endl;
-  //  for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" "<<reference2[i]<<" "<<result2[i]<<std::endl;
+  //for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" ref "<<reference1[i]<<" res "<<result1[i]<<std::endl;
+  //for(int i=0;i<Nsimd;i++) std::cout << " i "<<i<<" ref "<<reference2[i]<<" res "<<result2[i]<<std::endl;

  for(int i=0;i<Nsimd;i++){
    int found=0;
@@ -465,7 +470,7 @@ void ExchangeTester(const functor &func)
 	//	std::cout << " i "<<i<<" j "<<j<<" "<<reference1[j]<<" "<<result1[i]<<std::endl;
      }
    }
-    assert(found==1);
+    //    assert(found==1);
  }
  for(int i=0;i<Nsimd;i++){
    int found=0;
@@ -475,15 +480,24 @@ void ExchangeTester(const functor &func)
 	//	std::cout << " i "<<i<<" j "<<j<<" "<<reference2[j]<<" "<<result2[i]<<std::endl;
      }
    }
-    assert(found==1);
+    //    assert(found==1);
  }

+  /*
+  for(int i=0;i<Nsimd;i++){
+    std::cout << " i "<< i
+	      <<" result1  "<<result1[i]
+	      <<" result2  "<<result2[i]
+	      <<" test1  "<<test1[i]
+	      <<" test2  "<<test2[i]
+	      <<" input1 "<<input1[i]
+	      <<" input2 "<<input2[i]<<std::endl;
+  }
+  */
  for(int i=0;i<Nsimd;i++){
    assert(test1[i]==input1[i]);
    assert(test2[i]==input2[i]);
-  }//    std::cout << " i "<< i<<" test1"<<test1[i]<<" "<<input1[i]<<std::endl;
-    //    std::cout << " i "<< i<<" test2"<<test2[i]<<" "<<input2[i]<<std::endl;
-  //  }
+  }
 }


@@ -678,5 +692,69 @@ int main (int argc, char ** argv)
  IntTester(funcMinus());
  IntTester(funcTimes());

+  std::cout<<GridLogMessage << "==================================="<<  std::endl;
+  std::cout<<GridLogMessage << "Testing precisionChange            "<<  std::endl;
+  std::cout<<GridLogMessage << "==================================="<<  std::endl;
+  {
+    GridSerialRNG          sRNG;
+    sRNG.SeedFixedIntegers(std::vector<int>({45,12,81,9}));
+    const int Ndp = 16;
+    const int Nsp = Ndp/2;
+    const int Nhp = Ndp/4;
+    std::vector<vRealH,alignedAllocator<vRealH> > H (Nhp);
+    std::vector<vRealF,alignedAllocator<vRealF> > F (Nsp);
+    std::vector<vRealF,alignedAllocator<vRealF> > FF(Nsp);
+    std::vector<vRealD,alignedAllocator<vRealD> > D (Ndp);
+    std::vector<vRealD,alignedAllocator<vRealD> > DD(Ndp);
+    for(int i=0;i<16;i++){
+      random(sRNG,D[i]);
+    }
+    // Double to Single
+    precisionChange(&F[0],&D[0],Ndp);
+    precisionChange(&DD[0],&F[0],Ndp);
+    std::cout << GridLogMessage<<"Double to single";
+    for(int i=0;i<Ndp;i++){
+      //      std::cout << "DD["<<i<<"] = "<< DD[i]<<" "<<D[i]<<" "<<DD[i]-D[i] <<std::endl; 
+      DD[i] = DD[i] - D[i];
+      decltype(innerProduct(DD[0],DD[0])) nrm;
+      nrm = innerProduct(DD[i],DD[i]);
+      auto tmp = Reduce(nrm);
+      //      std::cout << tmp << std::endl;
+      assert( tmp < 1.0e-14 ); 
+    }
+    std::cout <<" OK ! "<<std::endl;
+
+    // Double to Half
+#ifdef USE_FP16
+    std::cout << GridLogMessage<< "Double to half" ;
+    precisionChange(&H[0],&D[0],Ndp);
+    precisionChange(&DD[0],&H[0],Ndp);
+    for(int i=0;i<Ndp;i++){
+      //      std::cout << "DD["<<i<<"] = "<< DD[i]<<" "<<D[i]<<" "<<DD[i]-D[i]<<std::endl; 
+      DD[i] = DD[i] - D[i];
+      decltype(innerProduct(DD[0],DD[0])) nrm;
+      nrm = innerProduct(DD[i],DD[i]);
+      auto tmp = Reduce(nrm);
+      //      std::cout << tmp << std::endl;
+      assert( tmp < 1.0e-3 ); 
+    }
+    std::cout <<" OK ! "<<std::endl;
+
+    std::cout << GridLogMessage<< "Single to half";
+    // Single to Half
+    precisionChange(&H[0] ,&F[0],Nsp);
+    precisionChange(&FF[0],&H[0],Nsp);
+    for(int i=0;i<Nsp;i++){
+      //      std::cout << "FF["<<i<<"] = "<< FF[i]<<" "<<F[i]<<" "<<FF[i]-F[i]<<std::endl; 
+      FF[i] = FF[i] - F[i];
+      decltype(innerProduct(FF[0],FF[0])) nrm;
+      nrm = innerProduct(FF[i],FF[i]);
+      auto tmp = Reduce(nrm);
+      //      std::cout << tmp << std::endl;
+      assert( tmp < 1.0e-3 ); 
+    }
+    std::cout <<" OK ! "<<std::endl;
+#endif
+  }
  Grid_finalize();
 }
--- a/tests/core/Test_fft_gfix.cc
+++ b/tests/core/Test_fft_gfix.cc
@@ -148,11 +148,13 @@ class FourierAcceleratedGaugeFixer  : public Gimpl {
    Complex psqMax(16.0);
    Fp =  psqMax*one/psq;

+    /*
    static int once;
    if ( once == 0 ) { 
      std::cout << " Fp " << Fp <<std::endl;
      once ++;
-    }
+      }*/
+
    pokeSite(TComplex(1.0),Fp,coor);

    dmuAmu_p  = dmuAmu_p * Fp; 
--- a/tests/core/Test_wilson_even_odd.cc
+++ b/tests/core/Test_wilson_even_odd.cc
@@ -2,11 +2,10 @@

    Grid physics library, www.github.com/paboyle/Grid 

-    Source file: ./tests/Test_wilson_even_odd.cc
+    Source file: ./tests/Test_wilson_tm_even_odd.cc

    Copyright (C) 2015

-Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>

    This program is free software; you can redistribute it and/or modify
@@ -89,8 +88,8 @@ int main (int argc, char ** argv)
  }

  RealD mass=0.1;
-  RealD mu  = 0.1;
-  WilsonTMFermionR Dw(Umu,Grid,RBGrid,mass,mu);
+
+  WilsonFermionR Dw(Umu,Grid,RBGrid,mass);

  LatticeFermion src_e   (&RBGrid);
  LatticeFermion src_o   (&RBGrid);
@@ -207,7 +206,7 @@ int main (int argc, char ** argv)
  pickCheckerboard(Odd ,phi_o,phi);
  RealD t1,t2;

-  SchurDiagMooeeOperator<WilsonTMFermionR,LatticeFermion> HermOpEO(Dw);
+  SchurDiagMooeeOperator<WilsonFermionR,LatticeFermion> HermOpEO(Dw);
  HermOpEO.MpcDagMpc(chi_e,dchi_e,t1,t2);
  HermOpEO.MpcDagMpc(chi_o,dchi_o,t1,t2);

--- a/tests/core/Test_wilson_twisted_mass_even_odd.cc
+++ b/tests/core/Test_wilson_twisted_mass_even_odd.cc
@@ -2,10 +2,11 @@

    Grid physics library, www.github.com/paboyle/Grid 

-    Source file: ./tests/Test_wilson_tm_even_odd.cc
+    Source file: ./tests/Test_wilson_even_odd.cc

    Copyright (C) 2015

+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
 Author: paboyle <paboyle@ph.ed.ac.uk>

    This program is free software; you can redistribute it and/or modify
@@ -88,8 +89,8 @@ int main (int argc, char ** argv)
  }

  RealD mass=0.1;
-
-  WilsonFermionR Dw(Umu,Grid,RBGrid,mass);
+  RealD mu  = 0.1;
+  WilsonTMFermionR Dw(Umu,Grid,RBGrid,mass,mu);

  LatticeFermion src_e   (&RBGrid);
  LatticeFermion src_o   (&RBGrid);
@@ -206,7 +207,7 @@ int main (int argc, char ** argv)
  pickCheckerboard(Odd ,phi_o,phi);
  RealD t1,t2;

-  SchurDiagMooeeOperator<WilsonFermionR,LatticeFermion> HermOpEO(Dw);
+  SchurDiagMooeeOperator<WilsonTMFermionR,LatticeFermion> HermOpEO(Dw);
  HermOpEO.MpcDagMpc(chi_e,dchi_e,t1,t2);
  HermOpEO.MpcDagMpc(chi_o,dchi_o,t1,t2);

--- a/tests/debug/Test_synthetic_lanczos.cc
+++ b/tests/debug/Test_synthetic_lanczos.cc
@@ -115,8 +115,8 @@ int main (int argc, char ** argv)
  RNG.SeedFixedIntegers(seeds);


-  RealD alpha = 1.0;
-  RealD beta  = 0.03;
+  RealD alpha = 1.2;
+  RealD beta  = 0.1;
  RealD mu    = 0.0;
  int order = 11;
  ChebyshevLanczos<LatticeComplex> Cheby(alpha,beta,mu,order);
@@ -131,10 +131,9 @@ int main (int argc, char ** argv)
  const int Nit= 10000;

  int Nconv;
-  RealD eresid = 1.0e-8;
+  RealD eresid = 1.0e-6;

  ImplicitlyRestartedLanczos<LatticeComplex> IRL(HermOp,X,Nk,Nm,eresid,Nit);
-
  ImplicitlyRestartedLanczos<LatticeComplex> ChebyIRL(HermOp,Cheby,Nk,Nm,eresid,Nit);

  LatticeComplex src(grid); gaussian(RNG,src);
@@ -145,9 +144,9 @@ int main (int argc, char ** argv)
  }
  
  {
-    //    std::vector<RealD>          eval(Nm);
-    //    std::vector<LatticeComplex> evec(Nm,grid);
-    //    ChebyIRL.calc(eval,evec,src, Nconv);
+    std::vector<RealD>          eval(Nm);
+    std::vector<LatticeComplex> evec(Nm,grid);
+    ChebyIRL.calc(eval,evec,src, Nconv);
  }

  Grid_finalize();
--- a/tests/solver/Test_dwf_cg_prec.cc
+++ b/tests/solver/Test_dwf_cg_prec.cc
@@ -89,7 +89,7 @@ int main(int argc, char** argv) {
  GridStopWatch CGTimer;

  SchurDiagMooeeOperator<DomainWallFermionR, LatticeFermion> HermOpEO(Ddwf);
-  ConjugateGradient<LatticeFermion> CG(1.0e-8, 10000, 0);// switch off the assert
+  ConjugateGradient<LatticeFermion> CG(1.0e-5, 10000, 0);// switch off the assert

  CGTimer.Start();
  CG(HermOpEO, src_o, result_o);
--- a/tests/solver/Test_dwf_cg_unprec.cc
+++ b/tests/solver/Test_dwf_cg_unprec.cc
@@ -73,7 +73,7 @@ int main (int argc, char ** argv)
  DomainWallFermionR Ddwf(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);

  MdagMLinearOperator<DomainWallFermionR,LatticeFermion> HermOp(Ddwf);
-  ConjugateGradient<LatticeFermion> CG(1.0e-8,10000);
+  ConjugateGradient<LatticeFermion> CG(1.0e-6,10000);
  CG(HermOp,src,result);

  Grid_finalize();
--- a/tests/solver/Test_staggered_block_cg_unprec.cc
+++ b/tests/solver/Test_staggered_block_cg_unprec.cc
@@ -0,0 +1,119 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/Test_wilson_cg_unprec.cc
+
+    Copyright (C) 2015
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+#include <Grid/algorithms/iterative/BlockConjugateGradient.h>
+
+using namespace std;
+using namespace Grid;
+using namespace Grid::QCD;
+
+template<class d>
+struct scal {
+  d internal;
+};
+
+  Gamma::Algebra Gmu [] = {
+    Gamma::Algebra::GammaX,
+    Gamma::Algebra::GammaY,
+    Gamma::Algebra::GammaZ,
+    Gamma::Algebra::GammaT
+  };
+
+int main (int argc, char ** argv)
+{
+  typedef typename ImprovedStaggeredFermion5DR::FermionField FermionField; 
+  typedef typename ImprovedStaggeredFermion5DR::ComplexField ComplexField; 
+  typename ImprovedStaggeredFermion5DR::ImplParams params; 
+
+  const int Ls=4;
+
+  Grid_init(&argc,&argv);
+
+  std::vector<int> latt_size   = GridDefaultLatt();
+  std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
+  std::vector<int> mpi_layout  = GridDefaultMpi();
+
+  GridCartesian         * UGrid   = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
+  GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
+  GridCartesian         * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
+  GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
+
+  std::vector<int> seeds({1,2,3,4});
+  GridParallelRNG pRNG(UGrid );  pRNG.SeedFixedIntegers(seeds);
+  GridParallelRNG pRNG5(FGrid);  pRNG5.SeedFixedIntegers(seeds);
+
+  FermionField src(FGrid); random(pRNG5,src);
+  FermionField result(FGrid); result=zero;
+  RealD nrm = norm2(src);
+
+  LatticeGaugeField Umu(UGrid); SU3::HotConfiguration(pRNG,Umu);
+
+  RealD mass=0.01;
+  ImprovedStaggeredFermion5DR Ds(Umu,Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass);
+  MdagMLinearOperator<ImprovedStaggeredFermion5DR,FermionField> HermOp(Ds);
+
+  ConjugateGradient<FermionField> CG(1.0e-8,10000);
+  BlockConjugateGradient<FermionField> BCG(1.0e-8,10000);
+  MultiRHSConjugateGradient<FermionField> mCG(1.0e-8,10000);
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling 4d CG "<<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  ImprovedStaggeredFermionR Ds4d(Umu,Umu,*UGrid,*UrbGrid,mass);
+  MdagMLinearOperator<ImprovedStaggeredFermionR,FermionField> HermOp4d(Ds4d);
+  FermionField src4d(UGrid); random(pRNG,src4d);
+  FermionField result4d(UGrid); result4d=zero;
+  CG(HermOp4d,src4d,result4d);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling 5d CG for "<<Ls <<" right hand sides" <<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  result=zero;
+  CG(HermOp,src,result);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling multiRHS CG for "<<Ls <<" right hand sides" <<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  result=zero;
+  mCG(HermOp,src,result);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  std::cout << GridLogMessage << " Calling Block CG for "<<Ls <<" right hand sides" <<std::endl;
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+  result=zero;
+  BCG(HermOp,src,result);
+  std::cout << GridLogMessage << "************************************************************************ "<<std::endl;
+
+
+  Grid_finalize();
+}
--- a/tests/solver/Test_staggered_cg_unprec.cc
+++ b/tests/solver/Test_staggered_cg_unprec.cc
@@ -0,0 +1,82 @@
+    /*************************************************************************************
+
+    Grid physics library, www.github.com/paboyle/Grid 
+
+    Source file: ./tests/Test_wilson_cg_unprec.cc
+
+    Copyright (C) 2015
+
+Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
+Author: Peter Boyle <paboyle@ph.ed.ac.uk>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+    See the full license in the file "LICENSE" in the top level distribution directory
+    *************************************************************************************/
+    /*  END LEGAL */
+#include <Grid/Grid.h>
+#include <Grid/algorithms/iterative/BlockConjugateGradient.h>
+
+using namespace std;
+using namespace Grid;
+using namespace Grid::QCD;
+
+template<class d>
+struct scal {
+  d internal;
+};
+
+  Gamma::Algebra Gmu [] = {
+    Gamma::Algebra::GammaX,
+    Gamma::Algebra::GammaY,
+    Gamma::Algebra::GammaZ,
+    Gamma::Algebra::GammaT
+  };
+
+int main (int argc, char ** argv)
+{
+  typedef typename ImprovedStaggeredFermionR::FermionField FermionField; 
+  typedef typename ImprovedStaggeredFermionR::ComplexField ComplexField; 
+  typename ImprovedStaggeredFermionR::ImplParams params; 
+
+  Grid_init(&argc,&argv);
+
+  std::vector<int> latt_size   = GridDefaultLatt();
+  std::vector<int> simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
+  std::vector<int> mpi_layout  = GridDefaultMpi();
+  GridCartesian               Grid(latt_size,simd_layout,mpi_layout);
+  GridRedBlackCartesian     RBGrid(latt_size,simd_layout,mpi_layout);
+
+  std::vector<int> seeds({1,2,3,4});
+  GridParallelRNG          pRNG(&Grid);  pRNG.SeedFixedIntegers(seeds);
+
+  FermionField src(&Grid); random(pRNG,src);
+  RealD nrm = norm2(src);
+  FermionField result(&Grid); result=zero;
+  LatticeGaugeField Umu(&Grid); SU3::HotConfiguration(pRNG,Umu);
+
+  double volume=1;
+  for(int mu=0;mu<Nd;mu++){
+    volume=volume*latt_size[mu];
+  }  
+  
+  RealD mass=0.1;
+  ImprovedStaggeredFermionR Ds(Umu,Umu,Grid,RBGrid,mass);
+
+  MdagMLinearOperator<ImprovedStaggeredFermionR,FermionField> HermOp(Ds);
+  CG(HermOp,src,result);
+
+  Grid_finalize();
+}
Author	SHA1	Message	Date
Peter Boyle	e57eafe388	Fix to multinode code	2017-04-26 14:46:52 -04:00
paboyle	738c1a11c2	longer nloop	2017-04-26 08:43:20 +01:00
Peter Boyle	f8797e1e3e	bug fix. works now and great face performance	2017-04-26 03:14:02 -04:00
Peter Boyle	fd1eb7de13	Clean implementation of the exterior faces listing only those points on the boudary	2017-04-26 02:34:52 -04:00
Peter Boyle	2ce898efa3	Pretty code	2017-04-26 02:34:25 -04:00
paboyle	ab66bac4e6	Think I'm getting on top of the reduced cost exterior precomputed list of links	2017-04-25 08:50:26 +01:00
paboyle	56277a11c8	Build a list of whats on the surface	2017-04-24 17:06:15 +01:00
paboyle	916e9e1d3e	Merge branch 'feature/half-prec-comms' of https://github.com/paboyle/Grid into feature/half-prec-comms	2017-04-24 10:39:19 +01:00
Peter Boyle	5b55867a7a	Slightly cheaper Ext assembly	2017-04-24 05:36:11 -04:00
Peter Boyle	3accb1ef89	Debugged assemply split phase with interior suppression	2017-04-23 19:30:19 -04:00
Peter Boyle	e3d0e31525	Debugged assemply split phase with interior suppression	2017-04-23 19:29:27 -04:00
Peter Boyle	5812eb8a8c	Partially fixed. But the comms-overlap does not work yet.	2017-04-22 18:50:25 -04:00
paboyle	4dd3763294	Use OMP as much as possible	2017-04-22 20:35:20 +01:00
paboyle	c429ace748	Cleaner OpenMP use	2017-04-22 20:28:42 +01:00
paboyle	ac58565d0a	Dangerous rewrite of the assembly. If I make a mistake the debug will be painful.	2017-04-22 19:31:04 +01:00
paboyle	3703b718aa	Mark up a table if a given site only receives from itself; including MPI3 splitting info.	2017-04-22 19:28:37 +01:00
paboyle	b722889234	Try a better load balancing loop	2017-04-22 19:27:41 +01:00
paboyle	abba44a837	Hand unrolled for overlapped comms	2017-04-22 17:45:17 +01:00
paboyle	f301be94ce	Fixed	2017-04-22 17:42:31 +01:00
Peter Boyle	1d1b225497	Hand unrolled Nc=3 kernels support split phase compute (on-node, off-node).	2017-04-22 09:05:28 -04:00
Peter Boyle	53a785a3dd	Fixing the KNL compile	2017-04-22 08:11:51 -04:00
paboyle	736bf3c866	Major rework of stencil. Half precision and MPI3 now working.	2017-04-22 11:33:50 +01:00
paboyle	b9bbe5d188	L1p config bg/q	2017-04-22 11:33:09 +01:00
paboyle	3844bcf800	If no f16c instructions supported must use software half precision conversion. This will also become useful on BG/Q, so will move out from SSE4 into a general area. Lifted the Eigen half precision from web. Looks sensible, but not extensively regressed against the intrinsics implementation yet.	2017-04-20 15:30:52 +01:00
paboyle	e1a2319d01	Simple compressor moved out of cshift into stencil	2017-04-20 13:18:15 +01:00
paboyle	180c732b4c	Move compressors out of Cshift. Slice iterators would help	2017-04-20 13:17:55 +01:00
paboyle	957a706d0b	Useful script	2017-04-20 13:17:44 +01:00
paboyle	d2312e9874	Drop compressor entirely from Cshift to only Stencil.	2017-04-20 13:16:55 +01:00
paboyle	fc4ab9ccd5	Working half precision comms	2017-04-20 11:20:26 +01:00
paboyle	4a340aa5ca	Massive compressor rework to support reduced precision comms	2017-04-20 09:28:27 +01:00
paboyle	3b7de792d5	Type comparison in the traits work	2017-04-18 13:28:04 +01:00
paboyle	557c3fa109	Pretty change	2017-04-18 13:27:38 +01:00
paboyle	ec18e9f7f6	Merge branch 'develop' into feature/half-prec-comms	2017-04-18 11:39:39 +01:00
paboyle	a839d5bc55	Updated todo list	2017-04-18 11:22:17 +01:00
paboyle	de41b84c5c	Merge branch 'feature/normHP' into develop	2017-04-18 10:57:21 +01:00
paboyle	8e161152e4	MultiRHS solver improvements with slice operations moved into lattice and sped up. Block solver requires a lot of performance work.	2017-04-18 10:51:55 +01:00
paboyle	3141ebac10	MultiRHS working, starting to optimise. Block doesn't and I thought it already was; puzzled.	2017-04-17 10:50:19 +01:00
paboyle	7ede696126	Non compile of tests fixed	2017-04-16 23:40:00 +01:00
paboyle	bf516c3b81	higher precision reduction variables in norm and inner product	2017-04-15 12:27:28 +01:00
paboyle	441a52ee5d	First cut at higher precision reduction	2017-04-15 10:57:21 +01:00
paboyle	a8db024c92	Cleaning up the dense matrix and lanczos sector	2017-04-15 08:54:11 +01:00
paboyle	a9c22d5f43	Verbose removal	2017-04-14 14:38:49 +01:00
paboyle	3ca41458a3	Fix to no USE_FP16 case	2017-04-14 14:20:54 +01:00
paboyle	9e2d29c644	USE_FP16 macro	2017-04-14 14:17:14 +01:00
Peter Boyle	951be75292	Half precision conversion working on AVX512 now too	2017-04-13 17:35:11 +01:00
Peter Boyle	b9113ed310	Patches for knl	2017-04-13 12:02:12 -04:00
paboyle	42fb49d3fd	Merge branch 'develop' of https://github.com/paboyle/Grid into develop	2017-04-13 14:12:47 +01:00
paboyle	2a54c9aaab	Merge branch 'feature/block-cg' into develop	2017-04-13 14:12:24 +01:00
paboyle	0957378679	Fixing conditional ugly way	2017-04-13 13:47:56 +01:00
paboyle	2ed6c76fc5	Getting multiline if then fi working	2017-04-13 13:43:13 +01:00
paboyle	d3b9a7fa14	F16c apparently requires AVX, even if the 128 bit are used. Seems odd.	2017-04-13 13:19:11 +01:00
paboyle	75ea306ce9	Another try at travis	2017-04-13 13:05:32 +01:00
paboyle	4226c633c4	Default to FP16 off again	2017-04-13 12:51:39 +01:00
paboyle	5a4eafbf7e	.travis	2017-04-13 12:50:43 +01:00
paboyle	eb8e26018b	Travis update for macos	2017-04-13 12:35:11 +01:00
paboyle	db5ea001a3	Update to use Xcode 8.3 since -mfp16 causes SIGILL	2017-04-13 12:22:40 +01:00
paboyle	2846f079e5	Predicate tests on fp16 being enabled	2017-04-13 12:08:05 +01:00
paboyle	1d502e4ed6	FP16 optional compile time	2017-04-13 11:55:24 +01:00
paboyle	73cdf0fffe	Drop f16c from SSE because of a macos compile error on travis	2017-04-13 11:23:41 +01:00
paboyle	1c25773319	Trap illegal instructions	2017-04-13 10:51:40 +01:00
paboyle	c38400b26f	Trap signals	2017-04-13 10:35:20 +01:00
paboyle	9c3065b860	Debug flags off again	2017-04-13 10:01:32 +01:00
paboyle	94eb829d08	Align cast fixed for __mm128i gcc complained	2017-04-13 08:40:44 +01:00
paboyle	68392ddb5b	Exchange in generic Precision change in AVX, SSE, AVX512, Generic. QPX still to do.	2017-04-13 08:38:12 +01:00
paboyle	cb6b81ae82	Half precision conversion	2017-04-12 19:32:37 +01:00
Antonin Portelli	8ef4300412	spurious .dirstamp files removed	2017-04-10 17:00:22 +01:00
Antonin Portelli	98a24ebf31	The macro “magics” is very intensive for the preprocessor in the measurement code which has numerous serialisable classes. Reducing the number of serialisable fields to 64 (instead of 1024) helps a lot, this is enough for now and can be extended trivially if needed in the future.	2017-04-10 16:58:54 +01:00
paboyle	b12dc89d26	Commenting and clean up	2017-04-10 20:38:20 +09:00
paboyle	d80d802f9d	MultiRHS solver test	2017-04-10 00:12:12 +09:00
paboyle	3d99b09dba	Start of blockCG	2017-04-09 23:42:10 +09:00
paboyle	db5f6d3ae3	Verbose fix	2017-04-09 23:41:30 +09:00
paboyle	683550f116	Const args improvement	2017-04-09 23:41:04 +09:00
paboyle	55d0329624	Merge branch 'develop' of https://github.com/paboyle/Grid into develop	2017-04-07 11:08:14 +09:00
paboyle	86aaa35294	Christoph needs SchurDiagTwoKappa which is mobius specific.	2017-04-07 11:07:40 +09:00
Guido Cossu	172d3dc93a	Correcting names in tests	2017-04-05 16:24:04 +01:00