Adding pdf generation capabilities

2025-06-16 23:07:05 +01:00 · 2018-03-15 17:01:59 +00:00
parent 7c51dc07f1
commit 38f93f7e12
60 changed files with 5593 additions and 112 deletions
--- a/_pages/docs/API/coordinates.md
+++ b/_pages/docs/API/coordinates.md
@ -0,0 +1,34 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Coordinates"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/coordinates.html
+sidebar:
+  nav : docs
+---
+
+The Grid is define on a N-dimensional set of integer coordinates. 
+
+The maximum dimension is eight, and indexes in this space make use of the `Coordinate` class.
+The coordinate class shares a similar interface to `std::vector<int>`, but contains all data within the
+object, and has a fixed maximum length (template parameter).
+
+**Example**:
+
+```c++
+const int Nd=4;
+Coordinate point(Nd);
+
+for(int i=0;i<Nd;i++) 
+   point[i] = 1;
+
+std::cout<< point <<std::endl;
+point.resize(3);
+std::cout<< point <<std::endl;
+```
+
+This enables the coordinates to be manipulated without heap allocation or thread contention,
+and avoids introducing STL functions into GPU code, but does so at the expense of introducing
+a maximum dimensionality. This limit is easy to change (`lib/util/Coordinate.h`).
--- a/_pages/docs/API/grids.md
+++ b/_pages/docs/API/grids.md
@ -0,0 +1,90 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Grids"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/grids.html
+sidebar:
+  nav : docs
+---
+
+A `Grid` object defines the geometry of a global cartesian array, and through inheritance
+provides access to message passing decomposition, the local lattice, and the message passing primitives.
+
+The constructor requires parameters to indicate how the spatial (and temporal) indices
+are decomposed across MPI tasks and SIMD lanes of the vector length.
+We use a partial vectorisation transformation, must select
+which space-time dimensions participate in SIMD vectorisation.
+The Lattice containers are defined to have opaque internal layout, hiding this layout transformation.
+
+We define GridCartesian and GridRedBlackCartesian which both inherit from `GridBase`:
+
+```c++
+class GridCartesian        : public GridBase 
+class GridRedBlackCartesian: public GridBase 
+```
+
+The simplest Cartesian Grid constructor distributes across `MPI_COMM_WORLD`:
+
+```c++
+/////////////////////////////////////////////////////////////////////////
+// Construct from comm world
+/////////////////////////////////////////////////////////////////////////
+GridCartesian(const Coordinate &dimensions,
+  const Coordinate &simd_layout,
+  const Coordinate &processor_grid);
+```
+
+A second constructor will create a child communicator from a previously declared Grid.
+This allows to subdivide the processor grid, and also to define lattices of differing dimensionalities and sizes,
+useful for both Chiral fermions, lower dimensional operations, and multigrid:
+
+```c++
+/////////////////////////////////////////////////////////////////////////
+// Constructor takes a parent grid and possibly subdivides communicator.
+/////////////////////////////////////////////////////////////////////////
+GridCartesian(const Coordinate &dimensions,
+        const Coordinate &simd_layout,
+        const Coordinate &processor_grid,
+        const GridCartesian &parent,int &split_rank);
+```
+
+The Grid object provides much `internal` functionality to map a lattice site to 
+a node and lexicographic index. These are not needed by code interfacing
+to the data parallel layer.
+
+**Example** (`tests/solver/Test_split_grid.cc`):
+
+```c++
+  const int Ls=8;
+
+  ////////////////////////////////////////////
+  // Grids distributed across full machine
+  // pick up default command line args
+  ////////////////////////////////////////////
+  Grid_init(&argc,&argv);
+
+  Coordinate latt_size   = GridDefaultLatt();
+  Coordinate simd_layout = GridDefaultSimd(Nd,vComplex::Nsimd());
+  Coordinate mpi_layout  = GridDefaultMpi();
+
+  GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), 
+                            GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
+  GridCartesian         * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
+  GridRedBlackCartesian * rbGrid  = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
+  GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
+
+  /////////////////////////////////////////////
+  // Split into N copies of 1^4 mpi communicators
+  /////////////////////////////////////////////
+  Coordinate mpi_split (mpi_layout.size(),1);
+  GridCartesian * SGrid = new GridCartesian(GridDefaultLatt(),
+                                GridDefaultSimd(Nd,vComplex::Nsimd()),
+                                mpi_split,
+                                *UGrid); 
+
+  GridCartesian         * SFGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,SGrid);
+  GridRedBlackCartesian * SrbGrid  = SpaceTimeGrid::makeFourDimRedBlackGrid(SGrid);
+  GridRedBlackCartesian * SFrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,SGrid);
+```
--- a/_pages/docs/API/introduction.md
+++ b/_pages/docs/API/introduction.md
@ -0,0 +1,32 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Data parallel API"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/introduction.html
+sidebar:
+  nav : docs
+---
+
+
+Data parallel array indices are divided into two types. 
+
+* Internal indices, such as complex, colour, spin degrees of freedom 
+* spatial (space-time) indices.
+
+The ranges of all internal degrees are determined by template parameters, 
+and known at compile time. The ranges of spatial indices are dynamic, run time
+values and the Cartesian structure information is contained and accessed via `Grid` objects.
+
+Grid objects are the controlling entity for the decomposition of a distributed `Lattice`
+array across MPI tasks, nodes, SIMD lanes, accelerators. Threaded loops are used
+as appropriate on host code.
+
+Data parallel operations can only be performed between Lattice objects constructed
+from the same Grid pointer. These are called `conformable` operations.
+
+We will focus initially on the internal indices as these are the building blocks assembled
+in Lattice container classes. Every Lattice container class constructor requires a Grid object 
+pointer. 
+
--- a/_pages/docs/API/lattice_containers.md
+++ b/_pages/docs/API/lattice_containers.md
@ -0,0 +1,560 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Lattice containers"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/lattice_containers.html
+sidebar:
+  nav : docs
+---
+
+{% include toc icon="gears" title="Contents" %}
+
+Lattice objects may be constructed to contain the local portion of a distribued array of any tensor type.
+For performance reasons the tensor type uses a vector `Real` or `Complex` as the fundamental datum.
+
+Every lattice requires a `GridBase` object pointer to be provided in its constructor. Memory is allocated
+at construction time. If a Lattice is passed a RedBlack grid, it allocates
+half the storage of the full grid, and may either store the red or black checkerboard. The Lattice object
+will automatically track through assignments which checkerboard it refers to.
+For example, shifting a Even checkerboard by an odd distance produces an Odd result field.
+
+Struct of array objects are defined, and used in the template parameters to the lattice class.
+
+**Example** (`lib/qcd/QCD.h`):
+
+```c++
+       template<typename vtype> using iSpinMatrix = iScalar<iMatrix<iScalar<vtype>, Ns> >;
+       typedef iSpinMatrix<ComplexF>           SpinMatrixF; //scalar
+       typedef iSpinMatrix<vComplexF>          vSpinMatrixF;//vectorised
+       typedef Lattice<vSpinMatrixF>           LatticeSpinMatrixF;
+```
+
+The full range of QCD relevant lattice objects is given below.
+
+|-----------|------------|----------|-----------|---------------|---------------------------------|--------------------------|
+|   Lattice |    Lorentz |    Spin  |    Colour | scalar_type   |   Field                         |      Synonym             |
+|-----------|------------|----------|-----------|---------------|---------------------------------|--------------------------|
+|`Vector`   | `Scalar`   | `Scalar` |  `Scalar` |  `RealD`      | `LatticeRealD`                  |   N/A                    |
+|`Vector`   | `Scalar`   | `Scalar` |  `Scalar` |  `ComplexD`   | `LatticeComplexD`               |   N/A                    |
+|`Vector`   | `Scalar`   | `Scalar` |  `Matrix` |  `ComplexD`   | `LatticeColourMatrixD`          |   `LatticeGaugeLink`     |
+|`Vector`   | `Vector`   | `Scalar` |  `Matrix` |  `ComplexD`   | `LatticeLorentzColourMatrixD`   |   `LatticeGaugeFieldD`   |
+|`Vector`   | `Scalar`   | `Vector` |  `Vector` |  `ComplexD`   | `LatticeSpinColourVectorD`      |   `LatticeFermionD`      |
+|`Vector`   | `Scalar`   | `Vector` |  `Vector` |  `ComplexD`   | `LatticeHalfSpinColourVectorD`  |   `LatticeHalfFermionD`  |
+|`Vector`   | `Scalar`   | `Matrix` |  `Matrix` |  `ComplexD`   | `LatticeSpinColourMatrixD`      |   `LatticePropagatorD`   |
+|-----------|------------|----------|-----------|---------------|---------------------------------|--------------------------|
+
+Additional single precison variants are defined with the suffix `F`.
+Other lattice objects can be defined using the sort of typedef's shown above if needed.
+
+### Opaque containers
+
+The layout within the container is complicated to enable maximum opportunity for vectorisation, and 
+is opaque from the point of view of the API definition. The key implementation observation is that
+so long as data parallel operations are performed and adjacent SIMD lanes correspond to well separated
+lattice sites, then identical operations are performed on all SIMD lanes and enable good vectorisation.
+
+Because the layout is opaque, import and export routines from naturally ordered x,y,z,t arrays
+are provided (`lib/lattice/Lattice_transfer.h`):
+
+```c++
+    unvectorizeToLexOrdArray(std::vector<sobj> &out, const Lattice<vobj> &in);
+    vectorizeFromLexOrdArray(std::vector<sobj> &in , Lattice<vobj> &out);
+```
+
+The Lexicographic order of data in the external vector fields is defined by (`lib/util/Lexicographic.h`):
+
+```c++
+    Lexicographic::IndexFromCoor(const Coordinate &lcoor, int &lex,Coordinate *local_dims);
+```
+
+This ordering is $$x + L_x * y + L_x*L_y*z + L_x*L_y*L_z *t$$
+
+Peek and poke routines are provided to perform single site operations. These operations are
+extremely low performance and are not intended for algorithm development or performance critical code.
+
+The following are "collective" operations and involve communication between nodes. All nodes receive the same
+result by broadcast from the owning node:
+
+```c++
+    void peekSite(sobj &s,const Lattice<vobj> &l,const Coordinate &site);
+    void pokeSite(const sobj &s,Lattice<vobj> &l,const Coordinate &site);
+```
+
+The following are executed independently by each node:
+
+```c++
+    void peekLocalSite(sobj &s,const Lattice<vobj> &l,Coordinate &site);
+    void pokeLocalSite(const sobj &s,Lattice<vobj> &l,Coordinate &site);
+```
+
+Lattices of one tensor type may be transformed into lattices of another tensor type by
+peeking and poking specific indices in a data parallel manner:
+
+```c++
+    template<int Index,class vobj> // Vector data parallel index peek
+    auto PeekIndex(const Lattice<vobj> &lhs,int i);
+
+    template<int Index,class vobj> // Matrix data parallel index peek
+    auto PeekIndex(const Lattice<vobj> &lhs,int i,int j);
+
+    template<int Index,class vobj>   // Vector poke
+    void PokeIndex(Lattice<vobj> &lhs,const Lattice<> & rhs,int i)
+
+    template<int Index,class vobj>   // Matrix poke
+    void PokeIndex(Lattice<vobj> &lhs,const Lattice<> & rhs,int i,int j)
+```
+
+The inconsistent capitalisation on the letter P is due to an obscure bug in g++ that has not to
+our knowledge been fixed in any version. The bug was reported in 2016.
+
+### Global Reduction operations
+
+Reduction operations for any lattice field are provided. The result is identical on each computing node
+that is part of the relevant Grid communicator:
+
+```c++
+  template<class vobj> 
+  RealD norm2(const Lattice<vobj> &arg);
+
+  template<class vobj> 
+  ComplexD innerProduct(const Lattice<vobj> &left,const Lattice<vobj> &right);
+
+  template<class vobj> 
+  vobj sum(const Lattice<vobj> &arg)
+```
+
+### Site local reduction operations
+
+Internal indices may be reduced, site by site, using the following routines:
+
+```c++
+  template<class vobj>
+  auto localNorm2 (const Lattice<vobj> &rhs)
+
+  template<class vobj>
+  auto localInnerProduct (const Lattice<vobj> &lhs,const Lattice<vobj> &rhs) 
+```
+
+### Outer product
+
+A site local outer product is defined:
+
+```c++
+  template<class ll,class rr>
+  auto outerProduct (const Lattice<ll> &lhs,const Lattice<rr> &rhs)
+```
+
+### Slice operations
+
+Slice operations are defined to operate on one lower dimension than the full lattice. The omitted dimension
+is the parameter orthogdim:
+
+```c++
+  template<class vobj> 
+  void sliceSum(const Lattice<vobj> &Data,
+                std::vector<typename vobj::scalar_object> &result,
+ 	        int orthogdim);
+
+  template<class vobj> 
+  void sliceInnerProductVector( std::vector<ComplexD> & result, 
+	                        const Lattice<vobj> &lhs,
+	 			const Lattice<vobj> &rhs,
+				int orthogdim); 
+
+  template<class vobj> 
+  void sliceNorm (std::vector<RealD> &sn,
+       		  const Lattice<vobj> &rhs,
+		  int orthogdim);
+```
+
+### Data parallel expression template engine
+
+The standard arithmetic operators and some data parallel library functions are implemented site by site
+on lattice types. 
+
+Operations may only ever combine lattice objects that have been constructed from the **same** grid pointer.
+
+**Example**:
+
+```c++
+    LatticeFermionD A(&grid);
+    LatticeFermionD B(&grid);
+    LatticeFermionD C(&grid);
+    
+    A = B - C;
+```
+
+Such operations are said to be **conformable** and are the lattice are guaranteed to have the same dimensions
+and both MPI and SIMD decomposition because they are based on the same grid object. The conformability check
+is lightweight and simply requires the same grid pointers be passed to the lattice objects. The data members
+of the grid objects are not compared.
+
+Conformable lattice fields may be combined with appropriate scalar types in expressions. The implemented
+rules follow those already documented for the tensor types. 
+
+
+
+
+### Unary operators and functions
+
+The following sitewise unary operations are defined:
+
+|-----------------------|---------------------------------------------|
+| Operation             |    Description                              |
+|-----------------------|---------------------------------------------|
+|`operator-`            |  negate                                     |
+|`adj`                  |  Hermitian conjugate                        |
+|`conjugate`            |  complex conjugate                          |
+|`trace`                |  sitewise trace                             |
+|`transpose`            |  sitewise transpose                         |
+|`Ta`                   |  take traceles anti Hermitian part          |
+|`ProjectOnGroup`       |  reunitarise or orthogonalise               |
+|`real`                 |  take the real part                         |
+|`imag`                 |  take the imaginary part                    |
+|`toReal`               |  demote complex to real                     |
+|`toComplex`            |  promote real to complex                    |
+|`timesI`               |  elementwise +i mult (0 is not multiplied)  |
+|`timesMinusI`          |  elementwise -i mult (0 is not multiplied)  |
+|`abs`                  |  elementwise absolute value                 |
+|`sqrt`                 |     elementwise square root                 |
+|`rsqrt`                |     elementwise reciprocal square root      |
+|`sin`                  |     elementwise sine                        |
+|`cos`                  |     elementwise cosine                      |
+|`asin`                 |     elementwise inverse sine                |
+|`acos`                 |     elementwise inverse cosine              |
+|`log`                  |     elementwise logarithm                   |
+|`exp`                  |     elementwise exponentiation              |
+|`operator!`            |     Logical negation of integer field       |
+|`Not`                  |     Logical negation of integer field       |
+|-----------------------|---------------------------------------------|
+
+The following sitewise applied functions with additional parameters are:
+
+```c++
+  template<class obj> Lattice<obj> pow(const Lattice<obj> &rhs_i,RealD y);
+
+  template<class obj> Lattice<obj> mod(const Lattice<obj> &rhs_i,Integer y);
+
+  template<class obj> Lattice<obj> div(const Lattice<obj> &rhs_i,Integer y);
+
+  template<class obj> Lattice<obj> 
+  expMat(const Lattice<obj> &rhs_i, RealD alpha, Integer Nexp = DEFAULT_MAT_EXP);
+```
+
+### Binary operators
+
+The following binary operators are defined:
+
+```
+  operator+
+  operator-
+  operator*
+  operator/
+```
+
+Logical are defined on LatticeInteger types:
+
+```
+  operator&
+  operator|
+  operator&&
+  operator||
+```
+
+### Ternary operator, logical operatons and where
+
+Within the data parallel level of the API the only way to perform operations
+that are differentiated between sites is use predicated execution.
+
+The predicate takes the form of a `LatticeInteger` which is confromable with both
+the `iftrue` and `iffalse` argument:
+
+```c++
+  template<class vobj,class iobj> void where(const Lattice<iobj> &pred,
+                                                   Lattice<vobj> &iftrue,
+                                                   Lattice<vobj> &iffalse);
+```
+This plays the data parallel analogue of the C++ ternary operator:
+
+```c++
+     a = b ? c : d;
+```
+
+In order to create the predicate in a coordinate dependent fashion it is often useful
+to use the lattice coordinates. 
+
+The `LatticeCoordinate` function:
+
+```c++
+    template<class iobj> LatticeCoordinate(Lattice<iobj> &coor,int dir);
+```
+
+fills an `Integer` field with the coordinate in the N-th dimension.
+A usage example is given
+
+**Example**:
+
+```c++
+        int dir  =3;
+        int block=4;
+        LatticeInteger coor(FineGrid);
+
+	LatticeCoordinate(coor,dir);
+	
+	result = where(mod(coor,block)==(block-1),x,z);
+```
+
+(Other usage cases of LatticeCoordinate include the generation of plane wave momentum phases.)
+
+### Site local fused operations
+
+The biggest limitation of expression template engines is that the optimisation 
+visibility is a single assignment statement in the original source code.
+
+There is no scope for loop fusion between multiple statements.
+Multi-loop fusion gives scope for greater cache locality.
+
+Two primitives for hardware aware parallel loops are provided.
+These will operate directly on the site objects which are expanded by a factor
+of the vector length (in our struct of array datatypes). 
+
+Since the mapping of sites
+to data lanes is opaque, these vectorised loops
+are *only* appropriate for optimisation of site local operations.
+
+### View objects
+
+Due to an obscure aspect of the way that Nvidia handle device C++11 lambda functions,
+it is necessary to disable the indexing of a Lattice object.
+
+Rather, a reference to a lattice object must be first obtained.
+
+The reference is copyable to a GPU, and is able to be indexed on either accelerator code,
+or by host code.
+
+In order to prevent people developing code that dereferences Lattice objects in a way that
+works on CPU compilation, but fails on GPU compilation, we have decided to remove the ability
+to index a lattice object on CPU code. 
+
+As a result of Nvidia's constraints, all accesses to lattice objects are required to be made
+through a View object. 
+
+In the following, the type is `LatticeView<vobj>`, however it is wise to use the C++11 auto keyword
+to avoid naming the type. See code examples below.
+
+
+### thread_loops
+
+The first parallel primitive is the thread_loop
+
+**Example**:
+
+```c++
+  LatticeField r(grid); 
+  LatticeField x(grid);
+  LatticeField p(grid); 
+  LatticeField mmp(grid);
+  auto r_v = r.View();  
+  auto x_v = x.View();
+  auto p_v = p.View(); 
+  auto mmp_v = mmp.View();
+  thread_loop(s , r_v, {
+    r_v[s] = r_v[s]   - a * mmp_v[s];
+    x_v[s] = x_v[s]   + a*p_v[s];
+    p_v[s] = p_v[s]*b + r_v[s];
+  });
+```
+
+### accelerator_loops
+
+The second parallel primitive is an accelerated_loop
+
+**Example**:
+
+```c++
+  LatticeField r(grid); 
+  LatticeField x(grid);
+  LatticeField p(grid); 
+  LatticeField mmp(grid);
+  auto r_v = r.View();  
+  auto x_v = x.View();
+  auto p_v = p.View(); 
+  auto mmp_v = mmp.View();
+  accelerator_loop(s , r_v, {
+    r_v[s] = r_v[s]   - a * mmp_v[s];
+    x_v[s] = x_v[s]   + a*p_v[s];
+    p_v[s] = p_v[s]*b + r_v[s];
+  });
+```
+
+
+### Cshift 
+
+Site shifting operations are provided using the Cshift function:
+
+```c++
+  template<class vobj> 
+  Lattice<vobj> Cshift(const Lattice<vobj> &rhs,int dimension,int shift)
+```
+
+This shifts the whole vector by any distance shift in the appropriate dimension.
+
+For the avoidance of doubt on direction conventions,a positive shift moves the 
+lattice site $$x_mu = 1$$ in the rhs to $$x_mu = 0$$ in the result.
+
+**Example** (`benchmarks/Benchmark_wilson.cc`):
+
+```c++
+  { // Naive wilson implementation
+    ref = Zero();
+    for(int mu=0;mu<Nd;mu++){
+
+      tmp = U[mu]*Cshift(src,mu,1);
+      {
+	auto ref_v = ref.View();
+	auto tmp_v = tmp.View();
+	for(int i=0;i<ref_v.size();i++){
+	  ref_v[i]+= tmp_v[i] - Gamma(Gmu[mu])*tmp_v[i]; ;
+	}
+      }
+
+      tmp =adj(U[mu])*src;
+      tmp =Cshift(tmp,mu,-1);
+      {
+	auto ref_v = ref.View();
+	auto tmp_v = tmp.View();
+	for(int i=0;i<ref_v.size();i++){
+	  ref_v[i]+= tmp_v[i] + Gamma(Gmu[mu])*tmp_v[i]; ;
+	}
+      }
+    }
+  }
+```
+
+### CovariantCshift 
+
+Covariant Cshift operations are provided for common cases of boundary condition. These may be further optimised
+in future:
+
+```c++
+  template<class covariant,class gauge> 
+  Lattice<covariant> CovShiftForward(const Lattice<gauge> &Link, int mu,
+			   	     const Lattice<covariant> &field);
+
+  template<class covariant,class gauge> 
+  Lattice<covariant> CovShiftBackward(const Lattice<gauge> &Link, int mu,
+			              const Lattice<covariant> &field);
+```
+
+### Boundary conditions
+
+The covariant shift routines occur in namespaces PeriodicBC and ConjugateBC. The correct covariant shift
+for the boundary condition is passed into the gauge actions and wilson loops via an
+"Impl" template policy class.
+
+The relevant staples, plaquettes, and loops are formed by using the provided method:
+
+```c++
+    Impl::CovShiftForward
+    Impl::CovShiftBackward
+```
+
+etc... This makes physics code transform appropriately with externally supplied rules about
+treating the boundary.
+
+**Example** (`lib/qcd/util/WilsonLoops.h`):
+
+```c++
+  static void dirPlaquette(GaugeMat &plaq, const std::vector<GaugeMat> &U,
+                           const int mu, const int nu) {
+    // ___
+    //|   |
+    //|<__|
+    plaq = Gimpl::CovShiftForward(U[mu],mu,
+                    Gimpl::CovShiftForward(U[nu],nu,
+			Gimpl::CovShiftBackward(U[mu],mu,
+			   Gimpl::CovShiftIdentityBackward(U[nu], nu))));
+  }
+```
+
+### Inter-grid transfer operations
+
+Transferring between different checkerboards of the same global lattice:
+
+```c++
+  template<class vobj> void pickCheckerboard(int cb,Lattice<vobj> &half,const Lattice<vobj> &full);
+  template<class vobj> void setCheckerboard(Lattice<vobj> &full,const Lattice<vobj> &half);
+```
+
+These are used to set up Schur red-black decomposed solvers, for example.
+
+Multi-grid projection between a fine and coarse grid:
+
+```c++
+ template<class vobj,class CComplex,int nbasis>
+ void blockProject(Lattice<iVector<CComplex,nbasis > > &coarseData,
+                   const             Lattice<vobj>   &fineData,
+                   const std::vector<Lattice<vobj> > &Basis);
+```
+
+Multi-grid promotion to a finer grid:
+
+```c++
+  template<class vobj,class CComplex,int nbasis>
+  void blockPromote(const Lattice<iVector<CComplex,nbasis > > &coarseData,
+                    Lattice<vobj>   &fineData,
+                    const std::vector<Lattice<vobj> > &Basis)
+```
+
+Support for sub-block Linear algebra:
+
+```c++
+  template<class vobj,class CComplex>
+  void blockZAXPY(Lattice<vobj> &fineZ,
+                  const Lattice<CComplex> &coarseA,
+                  const Lattice<vobj> &fineX,
+                  const Lattice<vobj> &fineY)
+
+  template<class vobj,class CComplex>
+  void blockInnerProduct(Lattice<CComplex> &CoarseInner,
+                         const Lattice<vobj> &fineX,
+                         const Lattice<vobj> &fineY)
+
+  template<class vobj,class CComplex>
+  void blockNormalise(Lattice<CComplex> &ip,Lattice<vobj> &fineX)
+
+  template<class vobj>
+  void blockSum(Lattice<vobj> &coarseData,const Lattice<vobj> &fineData)
+
+  template<class vobj,class CComplex>
+  void blockOrthogonalise(Lattice<CComplex> &ip,std::vector<Lattice<vobj> > &Basis)
+```
+
+Conversion between different SIMD layouts:
+
+```c++
+  template<class vobj,class vvobj>
+  void localConvert(const Lattice<vobj> &in,Lattice<vvobj> &out)
+```
+
+Slices between grid of dimension N and grid of dimentions N+1:
+
+```c++
+  template<class vobj>
+  void InsertSlice(const Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int orthog)
+
+  template<class vobj>
+  void ExtractSlice(Lattice<vobj> &lowDim,const Lattice<vobj> & higherDim,int slice, int orthog)
+```
+
+Growing a lattice by a multiple factor, with periodic replication:
+
+```c++
+  template<class vobj>
+  void Replicate(Lattice<vobj> &coarse,Lattice<vobj> & fine)
+```
+
+That latter is useful to, for example, pre-thermalise a smaller volume and then grow the volume in HMC.
+It was written while debugging G-parity boundary conditions.
--- a/_pages/docs/API/random_number_generators.md
+++ b/_pages/docs/API/random_number_generators.md
@ -0,0 +1,107 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Random number generators"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/random_number_generators.html
+sidebar:
+  nav : docs
+---
+
+Grid provides three configure time options for random the number generator engine.
+
+* `sitmo`
+* `ranlux48`
+* `mt19937`
+
+The selection is controlled by the `--enable-rng=<option>` flag.
+
+Sitmo is the default Grid RNG and is recommended. It is a hash based RNG that is cryptographically secure and has 
+
+#. passed the BigCrush tests
+
+#. can Skip forward an arbitrary distance (up to 2^256) in O(1) time
+
+We use Skip functionality to place each site in an independent well separated stream.
+The Skip was trivially parallelised, important in a many core node,
+and gives very low overhead parallel RNG initialisation.
+
+Our implementation of parallel RNG
+
+* Has passed the BigCrush tests **drawing once from each site RNG** in a round robin fashion.
+
+This test is applied in `tests/testu01/Test_smallcrush.cc`
+
+The interface is as follows::
+
+```c++
+  class GridSerialRNG { 
+    GridSerialRNG();
+    void SeedFixedIntegers(const std::vector<int> &seeds);
+  }
+
+  class GridParallelRNG {
+    GridParallelRNG(GridBase *grid);
+    void SeedFixedIntegers(const std::vector<int> &seeds);
+  }
+
+  template <class vobj> void random(GridParallelRNG &rng,Lattice<vobj> &l)   { rng.fill(l,rng._uniform);  }
+  template <class vobj> void gaussian(GridParallelRNG &rng,Lattice<vobj> &l) { rng.fill(l,rng._gaussian); }
+
+  template <class sobj> void random(GridSerialRNG &rng,sobj &l)   { rng.fill(l,rng._uniform  ); }
+  template <class sobj> void gaussian(GridSerialRNG &rng,sobj &l) { rng.fill(l,rng._gaussian ); }
+```
+
+* Serial RNG's are used to assign scalar fields. 
+
+* Parallel RNG's are used to assign lattice fields and must subdivide the field grid (need not be conformable).
+
+It is the API users responsibility to initialise, manage, save and restore these RNG state for their algorithm.
+In particular there is no single globally managed RNG state. 
+
+Input/Output routines are provided for saving and restoring RNG states.
+
+`lib/parallelIO/BinaryIO.h`:
+
+```c++
+
+  ////////////////////////////////////////////////////////////////////////////
+  // Read a RNG;  use IOobject and lexico map to an array of state 
+  ////////////////////////////////////////////////////////////////////////////
+  static void readRNG(GridSerialRNG &serial,
+			     GridParallelRNG &parallel,
+			     std::string file,
+			     Integer offset,
+			     uint32_t &nersc_csum,
+			     uint32_t &scidac_csuma,
+			     uint32_t &scidac_csumb)
+  ////////////////////////////////////////////////////////////////////////////
+  // Write a RNG; lexico map to an array of state and use IOobject
+  ////////////////////////////////////////////////////////////////////////////
+  static void writeRNG(GridSerialRNG &serial,
+			      GridParallelRNG &parallel,
+			      std::string file,
+			      Integer offset,
+			      uint32_t &nersc_csum,
+			      uint32_t &scidac_csuma,
+			      uint32_t &scidac_csumb)
+
+lib/parallelIO/NerscIO.h::
+
+  void writeRNGState(GridSerialRNG &serial,GridParallelRNG &parallel,std::string file);
+
+  void readRNG(GridSerialRNG &serial,
+	       GridParallelRNG &parallel,
+	       std::string file,
+   	       Integer offset,
+ 	       uint32_t &nersc_csum,
+	       uint32_t &scidac_csuma,
+	       uint32_t &scidac_csumb);
+```
+
+**Example**:
+
+```c++
+  NerscIO::writeRNGState(sRNG,pRNG,rfile);
+```
--- a/_pages/docs/API/tensor_classes.md
+++ b/_pages/docs/API/tensor_classes.md
@ -0,0 +1,443 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Tensor classes"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/tensor_classes.html
+sidebar:
+  nav : docs
+---
+
+The Tensor data structures are built up from fundamental
+scalar matrix and vector classes:
+
+```c++
+    template<class vobj      > class iScalar { private: vobj _internal ; }
+    template<class vobj,int N> class iVector { private: vobj _internal[N] ; }
+    template<class vobj,int N> class iMatrix { private: vobj _internal[N] ; }
+```
+
+These are template classes and can be passed a fundamental scalar or vector type, or
+nested to form arbitrarily complicated tensor products of indices. All mathematical expressions
+are defined to operate recursively, index by index.
+
+Presently the constants
+
+* `Nc`
+* `Nd`
+
+are globally predefined. However, this is planned for changed in future and policy classes
+for different theories (e.g. QCD, QED, SU2 etc...) will contain these constants and enable multiple
+theories to coexist more naturally.
+
+Arbitrary tensor products of fundamental scalar, vector
+and matrix objects may be formed in principle by the basic Grid code.
+
+For Lattice field theory, we define types according to the following tensor
+product structure ordering. The suffix "D" indicates either double types, and
+replacing with "F" gives the corresponding single precision type.
+
+|Lattice  | Lorentz  |  Spin     |   Colour | scalar_type |  Field                   |
+|---------|----------|-----------|----------|-------------|--------------------------|
+|Scalar   | Scalar   |  Scalar   | Scalar   |  RealD      |   RealD                  |
+|Scalar   | Scalar   |  Scalar   | Scalar   |  ComplexD   |   ComplexD               |
+|Scalar   | Scalar   |  Scalar   | Matrix   |  ComplexD   |   ColourMatrixD          |
+|Scalar   | Vector   |  Scalar   | Matrix   |  ComplexD   |   LorentzColourMatrixD   |
+|Scalar   | Scalar   |  Vector   | Vector   |  ComplexD   |   SpinColourVectorD      |
+|Scalar   | Scalar   |  Vector   | Vector   |  ComplexD   |   HalfSpinColourVectorD  |
+|Scalar   | Scalar   |  Matrix   | Matrix   |  ComplexD   |   SpinColourMatrixD      |
+|---------|----------|-----------|----------|-------------|--------------------------|
+
+The types are implemented via a recursive tensor nesting system.
+
+**Example** we declare:
+
+```c++
+  template<typename vtype>
+  using iLorentzColourMatrix = iVector<iScalar<iMatrix<vtype, Nc> >, Nd > ;
+
+  typedef iLorentzColourMatrix<ComplexD > LorentzColourMatrixD;
+```
+
+**Example** we declare:
+
+```c++
+  template<typename vtype>
+  using iLorentzColourMatrix = iVector<iScalar<iMatrix<vtype, Nc> >, Nd > ;
+
+  typedef iLorentzColourMatrix<ComplexD > LorentzColourMatrixD;
+```
+
+Arbitrarily deep tensor nests may be formed. Grid uses a positional and numerical rule to associate indices for contraction
+in the Einstein summation sense.
+
+|----------------|---------|------------|
+| Symbolic name  | Number  | Position   |
+|----------------|---------|------------|
+|`LorentzIndex`  | 0       | left       |
+|`SpinIndex`     | 1       | middle     |
+|`ColourIndex`   | 2       | right      |
+|----------------|---------|------------|
+
+The conventions are that the index ordering left to right are: Lorentz, Spin, Colour. A scalar type (either real
+or complex, single or double precision) is be provided to the innermost structure.
+
+
+### Tensor arithmetic rules (`lib/tensors/Tensor_arith.h`)
+
+Arithmetic rules are defined on these types
+
+The multiplication operator follows the natural multiplication
+table for each index, index level by index level.
+
+`Operator *`
+
+|----|----|----|----|
+| x  |  S |  V |  M |
+|----|----|----|----|
+| S  | S  | V  | M  |
+| V  | S  | S  | V  |
+| M  | M  | V  | M  |
+|----|----|----|----|
+
+The addition and subtraction rules disallow a scalar to be added to a vector,
+and vector to be added to matrix. A scalar adds to a matrix on the diagonal.
+
+`Operator +` and `Operator -`
+
+|----|----|----|----|
+| +/-|  S |  V |  M |
+|----|----|----|----|
+| S  | S  | -  | M  |
+| V  | -  | V  | -  |
+| M  | M  | -  | M  |
+|----|----|----|----|
+
+
+The rules for a nested objects are recursively inferred level by level from basic rules of multiplication
+addition and subtraction for scalar/vector/matrix. Legal expressions can only be formed between objects
+with the same number of nested internal indices. All the Grid QCD datatypes have precisely three internal
+indices, some of which may be trivial scalar to enable expressions to be formed.
+
+Arithmetic operations are possible where the left or right operand is a scalar type.
+
+**Example**:
+
+```c++
+    LatticeColourMatrixD U(grid);
+    LatticeColourMatrixD Udag(grid);
+
+    Udag = adj(U);
+
+    RealD unitary_err = norm2(U*adj(U) - 1.0);
+```
+
+Will provide a measure of how discrepant from unitarity the matrix U is.
+
+### Internal index manipulation (`lib/tensors/Tensor_index.h`)
+
+General code can access any specific index by number with a peek/poke semantic:
+
+```c++
+   // peek index number "Level" of a vector index
+   template<int Level,class vtype>  auto peekIndex (const vtype &arg,int i);
+
+   // peek index number "Level" of a vector index
+   template<int Level,class vtype>  auto peekIndex (const vtype &arg,int i,int j);
+
+   // poke index number "Level" of a vector index
+   template<int Level,class vtype>
+   void pokeIndex (vtype &pokeme,arg,int i)
+
+   // poke index number "Level" of a matrix index
+   template<int Level,class vtype>
+   void pokeIndex (vtype &pokeme,arg,int i,int j)
+```
+
+**Example**:
+
+```c++
+    for (int mu = 0; mu < Nd; mu++) {
+      U[mu] = PeekIndex<LorentzIndex>(Umu, mu);
+    }
+```
+
+Similar to the QDP++ package convenience routines are provided to access specific elements of
+vector and matrix internal index types by physics name or meaning aliases for the above routines
+with the appropriate index constant.
+
+* `peekColour`
+* `peekSpin`
+* `peekLorentz`
+
+and
+
+* `pokeColour`
+* `pokeSpin`
+* `pokeLorentz`
+
+For example, we often store Gauge Fields with a Lorentz index, but can split them into
+polarisations in relevant pieces of code.
+
+**Example**:
+
+```c++
+    for (int mu = 0; mu < Nd; mu++) {
+      U[mu] = peekLorentz(Umu, mu);
+    }
+```
+
+For convenience, direct access as both an l-value and an r-value is provided by the parenthesis operator () on each of the Scalar, Vector and Matrix classes.
+For example one may write
+
+**Example**:
+
+```c++
+  ColourMatrix A, B;
+  A()()(i,j) = B()()(j,i);
+```
+
+bearing in mind that empty parentheses are need to address a scalar entry in the tensor index nest.
+
+The first (left) empty parentheses move past the (scalar) Lorentz level in the tensor nest, and the second
+(middle) empty parantheses move past the (scalar) spin level. The (i,j) index the colour matrix.
+
+Other examples are easy to form for the many cases, and should be obvious to the reader.
+This form of addressing is convenient and saves peek, modifying, poke
+multiple temporary objects when both spin and colour indices are being accessed.
+There are many cases where multiple lines of code are required with a peek/poke semantic which are
+easier with direct l-value and r-value addressing.
+
+
+### Matrix operations
+
+Transposition and tracing specific internal indices are possible using:
+
+```c++
+  template<int Level,class vtype>  
+  auto traceIndex (const vtype &arg)
+
+  template<int Level,class vtype>  
+  auto transposeIndex (const vtype &arg)
+```
+
+These may be used as
+
+**Example**:
+
+```c++
+  LatticeColourMatrixD Link(grid);
+  ComplexD link_trace = traceIndex<ColourIndex> (Link);
+```
+
+Again, convenience aliases for QCD naming schemes are provided via       
+
+* `traceColour`
+* `traceSpin`
+
+* `transposeColour`
+* `transposeSpin`
+
+**Example**:
+
+```c++
+  ComplexD link_trace = traceColour (Link);
+```
+
+The operations only makes sense for matrix and scalar internal indices.
+
+The trace and transpose over all indices is also defined for matrix and scalar types:
+
+```c++
+   template<class vtype,int N> 
+   auto trace(const iMatrix<vtype,N> &arg) -> iScalar
+
+   template<class vtype,int N> 
+   auto transpose(const iMatrix<vtype,N> &arg  ) -> iMatrix
+```
+
+Similar functions are:
+
+* `conjugate`
+* `adjoint`
+
+The traceless anti-Hermitian part is taken with:
+
+```c++
+    template<class vtype,int N> iMatrix<vtype,N> 
+    Ta(const iMatrix<vtype,N> &arg)
+```
+
+Reunitarisation (or reorthogonalisation) is enabled by:
+
+```c++
+    template<class vtype,int N> iMatrix<vtype,N> 
+    ProjectOnGroup(const iMatrix<vtype,N> &arg)
+```
+
+**Example**:
+
+```c++
+  LatticeColourMatrixD Mom(grid);
+  LatticeColourMatrixD TaMom(grid);
+  TaMom  = Ta(Mom);
+```
+
+### Querying internal index structure
+
+Templated code may find it useful to use query functions on the Grid datatypes they are provided.
+For example general Serialisation and I/O code can inspect the nature of a type a routine has been
+asked to read from disk, or even generate descriptive type strings:
+
+```c++
+      ////////////////////////////////////////////////////
+      // Support type queries on template params:
+      ////////////////////////////////////////////////////
+      // int _ColourScalar  =  isScalar<ColourIndex,vobj>();
+      // int _ColourVector  =  isVector<ColourIndex,vobj>();
+      // int _ColourMatrix  =  isMatrix<ColourIndex,vobj>();
+      template<int Level,class vtype>  int isScalar(void)
+      template<int Level,class vtype>  int isVector(void)
+      template<int Level,class vtype>  int isMatrix(void)
+```
+
+**Example** (`lib/parallelIO/IldgIO.h`):
+
+```c++
+  template<class vobj> std::string ScidacRecordTypeString(int &colors, int &spins, int & typesize,int &datacount) { 
+
+  /////////////////////////////////////////
+  // Encode a generic tensor as a string
+  /////////////////////////////////////////
+
+  typedef typename getPrecision<vobj>::real_scalar_type stype;
+
+  int _ColourN       = indexRank<ColourIndex,vobj>();
+  int _ColourScalar  =  isScalar<ColourIndex,vobj>();
+  int _ColourVector  =  isVector<ColourIndex,vobj>();
+  int _ColourMatrix  =  isMatrix<ColourIndex,vobj>();
+
+  int _SpinN       = indexRank<SpinIndex,vobj>();
+  int _SpinScalar  =  isScalar<SpinIndex,vobj>();
+  int _SpinVector  =  isVector<SpinIndex,vobj>();
+  int _SpinMatrix  =  isMatrix<SpinIndex,vobj>();
+
+  int _LorentzN       = indexRank<LorentzIndex,vobj>();
+  int _LorentzScalar  =  isScalar<LorentzIndex,vobj>();
+  int _LorentzVector  =  isVector<LorentzIndex,vobj>();
+  int _LorentzMatrix  =  isMatrix<LorentzIndex,vobj>();
+
+  std::stringstream stream;
+
+  stream << "GRID_";
+  stream << ScidacWordMnemonic<stype>();
+
+  if ( _LorentzVector )   stream << "_LorentzVector"<<_LorentzN;
+  if ( _LorentzMatrix )   stream << "_LorentzMatrix"<<_LorentzN;
+
+  if ( _SpinVector )   stream << "_SpinVector"<<_SpinN;
+  if ( _SpinMatrix )   stream << "_SpinMatrix"<<_SpinN;
+
+  if ( _ColourVector )   stream << "_ColourVector"<<_ColourN;
+  if ( _ColourMatrix )   stream << "_ColourMatrix"<<_ColourN;
+
+  if ( _ColourScalar && _LorentzScalar && _SpinScalar )   stream << "_Complex";
+
+  typesize = sizeof(typename vobj::scalar_type);
+
+  if ( _ColourMatrix ) typesize*= _ColourN*_ColourN;
+  else                 typesize*= _ColourN;
+
+  if ( _SpinMatrix )   typesize*= _SpinN*_SpinN;
+  else                 typesize*= _SpinN;
+
+  };
+```
+
+### Inner and outer products
+
+We recursively define (`tensors/Tensor_inner.h`), ultimately returning scalar in all indices:
+
+```c++
+  /////////////////////////////////////////////////////////////////////////
+  // innerProduct Scalar x Scalar -> Scalar
+  // innerProduct Vector x Vector -> Scalar
+  // innerProduct Matrix x Matrix -> Scalar
+  /////////////////////////////////////////////////////////////////////////
+  template<class l,class r>       
+  auto innerProductD (const iScalar<l>& lhs,const iScalar<r>& rhs)
+
+  template<class l,class r,int N> 
+  auto innerProductD (const iVector<l,N>& lhs,const iVector<r,N>& rhs)
+
+  template<class l,class r,int N> 
+  auto innerProductD (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs)
+
+  template<class l,class r>       
+  auto innerProduct (const iScalar<l>& lhs,const iScalar<r>& rhs)
+
+  template<class l,class r,int N> 
+  auto innerProduct (const iVector<l,N>& lhs,const iVector<r,N>& rhs)
+
+  template<class l,class r,int N> 
+  auto innerProduct (const iMatrix<l,N>& lhs,const iMatrix<r,N>& rhs)
+```
+
+The sum is always performed in double precision for the `innerProductD` variant.
+
+We recursively define (`tensors/Tensor_outer.h`):
+
+```c++
+  /////////////////////////////////////////////////////////////////////////
+  // outerProduct Scalar x Scalar -> Scalar
+  //              Vector x Vector -> Matrix
+  /////////////////////////////////////////////////////////////////////////
+  template<class l,class r> 
+  auto outerProduct (const iScalar<l>& lhs,const iScalar<r>& rhs)
+
+  template<class l,class r,int N> 
+  auto outerProduct (const iVector<l,N>& lhs,const iVector<r,N>& rhs)
+```
+
+### Functions of Tensor
+
+The following unary functions are defined, which operate element by element on a tensor 
+data structure:
+
+```c++
+  sqrt();
+  rsqrt();
+  sin();
+  cos();
+  asin();
+  acos();
+  log();
+  exp();
+  abs();
+  Not();
+  toReal();
+  toComplex();
+```
+
+Element wise functions are defined for::
+
+```c++
+  div(tensor,Integer);
+  mod(tensor,Integer);
+  pow(tensor,RealD);
+```
+
+Matrix exponentiation (as opposed to element wise exponentiation is implemented via power series in::
+
+```c++
+    Exponentiate(const Tensor &r  ,RealD alpha, Integer Nexp = DEFAULT_MAT_EXP)
+```
+
+the exponentiation is distributive across vector indices (i.e. proceeds component by component for a `LorentzColourMatrix`).
+
+Determinant is similar::
+
+```c++
+    iScalar Determinant(const Tensor &r )
+```
--- a/_pages/docs/API/vectorisation.md
+++ b/_pages/docs/API/vectorisation.md
@ -0,0 +1,25 @@
+---
+title : "API Documentation"
+author_profile: false
+excerpt: "Vectorisation"
+header:
+  overlay_color: "#5DADE2"
+permalink: /docs/API/vectorisation.html
+sidebar:
+  nav : docs
+---
+
+
+Internally, Grid defines a portable abstraction SIMD vectorisation, via the following types:
+
+* `vRealF`
+
+* `vRealD`
+
+* `vComplexF`
+
+* `vComplexD`
+
+These have the usual range of arithmetic operators and functions acting upon them. They do not form
+part of the API, but are mentioned to (partially) explain the need for controlling the
+layout transformation in lattice objects.