gives 8x (!) speed up on Haswell laptop vs. standard CG for 8 RHS solves.
166 iterations vs. 537 iterations so algorithmic gain + 2x in flop rate gain.
Better than a slap in the face with a wet kipper.
Gives O(32 x 8 x 18*8*8) chunk size on configuration I/O.
At 150KB should be getting close to packet sizes and 4MB filesystem
block sizes that are reasonably (!?) performant. We shall see once I move
this off my laptop and over to BNL and time it.
This will also become useful on BG/Q, so will move out from SSE4 into a general area.
Lifted the Eigen half precision from web. Looks sensible, but not extensively regressed
against the intrinsics implementation yet.
a HUGE amount of difficult to resolve and understand conflicts .
Wholesale formatting, reordering functions etc... in a central file like Tensor_class
or Grid_vector_types while others are also editing without making substantial functionality
changes creates pain.
4d linalg impls of the 5d hopping terms (and inverse)
Cache friendly loop orderings of the above
Dense matrix stored and apply to the above
-- Switch to Ls vectorised, and use dense matrix approach for the MooeeInv
and rotate/shift of the Mooee M5D routines.
of the 5D cayley form chiral fermions for the 5d matrix. With Ls entirely
in the vector direction, s-hopping terms involve rotations.
The serial dependence of the LDU inversion for Mobius and 4d even odd
checkerboarding is removed by simply applying Ls^2 operations (vectorised
many ways) as a dense matrix operation.
This should give similar throughput but high flops (non-compulsory flops)
but enable use of the KNL cache friendly kernels throughout the code.
Ls is still constrained to be a multiple of Nsimd, which is as much as 8 for AVX512
with single precision.
site and s loop into the kernels. This will save on function call overhead and
guarantee L2 prefetching strategy is right since OMP can't distribute the
sub-chunks of work.
between std::real and Grid::real shouldn't have been necessary and I don't
know why only the icpc v16.0 on babbage hits it.
May need a longer term rename of Grid::real or some careful EnableIf work.
permutes as rotates of length 2, and make any rotate active
over any subset of lane bits. This is hard, and requires general
permute; current intrinsics mean this is only really possible for specific
case by case encodings as presently performed. Intel could produce a general
permute.. would help. IBM did it in VMX.
turned up problems on the BlueWaters Cray.
Gets 75MB/s from home filesystem on parallel configuration read. Need to make the RNG IO parallel,
and also to look at aggregating bigger writes for the parallel write.
Not sure what the home filesystem is.
Not really shaken out to my satisfaction though as I want more tests done, so don't declare as working.
But committing my current while I try a few experimentals.
class, changing the nature of covariant Cshifts used in
plaquettes, rectangles and staples.
As a result same code is used for the plaq and rect action independent of the BC type.
Should probably isolate the BC in a separate class that Gimpl takes as a template param.
Do the same with smearing policies.
This would then allow composition of BC with smearing etc....
Conclusion: c++11 distributions not thread safe and must us distinct dist as well as distinct engine
per site. Makes sense when you think of box muller. Also added a reset of dist on fill to ensure
repro across checkpoints.
Can save and restore RNG state via new (serial) I/O routines in a NERSC header style file.
Store a Parallel (one per site) and a single serial RNG file.
Thought I had already committed these.
Believe I have got the Gparity fermion force working.
* tests/Test_gpdwf_force.cc -- correctly predicts dS for two flavour pseudofermion
based on a small dt update of U field.
* tests/Test_hmc_EODWFRatio_Gparity.cc -- ran 1 trajectory on 8^4 with dH=0.21.
Need to accumulate a full plaquette log to believe fully which will take some hours of run time.
loops are non threadable. Will need to write a kernel for these instead and drive them with a lookup table
to make a look sufficiently simple to thread.
This is really annoying -- it is very hard to thread the loops with the index
recursion on buffer offset in the red-black case. Must think of a good threading
solution here.
preparing to pass integrator, smearing, bc's as policy classes to hmc.
Propose to unify "integrator" and integrator algorithm in a base/derived
way to override step. Want to read through ForceGradient to ensure
that abstraction covers the force gradient case.
to a little figure out what Guido had done & why -- but there is a neat saving of force
evaluations across the nesting time boundary making use of linearity of the leapP in dt.
I cleaned up the printing, reduced the volume of code, in the process sharing printing
between all integrators. Placed an assert that the total integration time for all integrators
must match at end of trajectory.
Have now verified e-dH = 1 for nested integrators in Wilson/Wilson runs with both
Omelyan and with Leapfrog so substantial confidence gained.
Number of IO MPI tasks can be varied by selecting which
dimensions use parallel IO and which dimensions use Serial send to boss
I/O.
Thus can neck down from, say 1024 nodes = 4x4x8x8 to {1,8,32,64,128,256,1024} nodes
doing the I/O.
Interpolates nicely between ALL nodes write their data, a single boss per time-plane
in processor space [old UKQCD fortran code did this], and a single node doing all I/O.
Not sure I have the transfer sizes big enough and am not overly convinced fstream
is guaranteed to not give buffer inconsistencies unless I set streambuf size to zero.
Practically it has worked on 8 tasks, 2x1x2x2 writing /cloning NERSC configurations
on my MacOS + OpenMPI and Clang environment.
It is VERY easy to switch to pwrite at a later date, and also easy to send x-strips around from
each node in order to gather bigger chunks at the syscall level.
That would push us up to the circa 8x 18*4*8 == 4KB size write chunk, and by taking, say, x/y non
parallel we get to 16MB contiguous chunks written in multi 4KB transactions
per IOnode in 64^3 lattices for configuration I/O.
I suspect this is fine for system performance.
derived relationship. Have Text/Binary/Xml versions of
Reader & Writer.
Any new Reader/Writer class inheriting the interface can give object serialisation
to any desired format now.
new file: lib/serialisation/BaseIO.h
modified: lib/serialisation/BinaryIO.h
modified: lib/serialisation/Serialisation.h
modified: lib/serialisation/TextIO.h
modified: lib/serialisation/XmlIO.h
The test uses the Xml, Binary and Text formats as well as cout << Object.
Test_serialisation has an example of *code* *free* object serialisation
to both ostream and to XML using macro magic.
Implementing TextReader/TextWriter, YAML, JSON etc.. should be trivial
and we can use configure time options to select the default "Reader" typedef.
Present done with
"using XMLPolicy::Reader"
to pick up the default serialisation strategy.
Azusa is working hard on the rectangle term and we'll hopefully start reproducing plaquettes
from RBC-UKQCD parameters soon !
My new laptop is pretty warm and is starting to groan ;)
So valence sector looks ok.
FermionOperatorImpl.h provides the policy classes.
Expect HMC will introduce a smearing policy and a fermion representation change policy template
param. Will also probably need multi-precision work.
* HMC is running even-odd and non-checkerboarded (checked 4^4 wilson fermion/wilson gauge).
There appears to be a bug in the multi-level integrator -- <e-dH> passes with single level but
not with multi-level.
In any case there looks to be quite a bit to clean up.
This is the "const det" style implementation that is not appropriate yet for clover since
it assumes that Mee is indept of the gauge fields. Easily fixed in future.
So valence sector looks ok.
FermionOperatorImpl.h provides the policy classes.
Expect HMC will introduce a smearing policy and a fermion representation change policy template
param. Will also probably need multi-precision work.
* HMC is running even-odd and non-checkerboarded (checked 4^4 wilson fermion/wilson gauge).
There appears to be a bug in the multi-level integrator -- <e-dH> passes with single level but
not with multi-level.
In any case there looks to be quite a bit to clean up.
This is the "const det" style implementation that is not appropriate yet for clover since
it assumes that Mee is indept of the gauge fields. Easily fixed in future.
Allows multi-precision work and paves the way for alternate BC's and such like
allowing for example G-parity which is important for K pipi programme.
In particular, can drive an extra flavour index into the fermion fields
using template types.
Allows multi-precision work and paves the way for alternate BC's and such like
allowing for example G-parity which is important for K pipi programme.
In particular, can drive an extra flavour index into the fermion fields
using template types.
Allows multi-precision work and paves the way for alternate BC's and such like
allowing for example G-parity which is important for K pipi programme.
In particular, can drive an extra flavour index into the fermion fields
using template types.
Example of multiple levels in the WilsonFermion hmc test.
Merge remote-tracking branch 'upstream/master'
Conflicts:
lib/qcd/hmc/HMC.h
lib/qcd/hmc/integrators/Integrator.h
lib/qcd/hmc/integrators/Integrator_algorithm.h
tests/Test_simd.cc
Example of multiple levels in the WilsonFermion hmc test.
Merge remote-tracking branch 'upstream/master'
Conflicts:
lib/qcd/hmc/HMC.h
lib/qcd/hmc/integrators/Integrator.h
lib/qcd/hmc/integrators/Integrator_algorithm.h
tests/Test_simd.cc
Example of multiple levels in the WilsonFermion hmc test.
Merge remote-tracking branch 'upstream/master'
Conflicts:
lib/qcd/hmc/HMC.h
lib/qcd/hmc/integrators/Integrator.h
lib/qcd/hmc/integrators/Integrator_algorithm.h
tests/Test_simd.cc
exp(-DeltaH) = 1 now, and plaquette is sensible. Will reproduce an old Wilson Gauge
Wilson Fermion SCRI plaquette with precision in mass matching shortly.
exp(-DeltaH) = 1 now, and plaquette is sensible. Will reproduce an old Wilson Gauge
Wilson Fermion SCRI plaquette with precision in mass matching shortly.
exp(-DeltaH) = 1 now, and plaquette is sensible. Will reproduce an old Wilson Gauge
Wilson Fermion SCRI plaquette with precision in mass matching shortly.
6000 matmuls CG unprec
2000 matmuls CG prec (4000 eo muls)
1050 matmuls PGCR on 16^3 x 32 x 8 m=.01
Substantial effort on timing and logging infrastructure
6000 matmuls CG unprec
2000 matmuls CG prec (4000 eo muls)
1050 matmuls PGCR on 16^3 x 32 x 8 m=.01
Substantial effort on timing and logging infrastructure
6000 matmuls CG unprec
2000 matmuls CG prec (4000 eo muls)
1050 matmuls PGCR on 16^3 x 32 x 8 m=.01
Substantial effort on timing and logging infrastructure
connect the "Real" default precisoin to a configure flag.
Have RealF, RealD and Real types, where Real is compile target dependent single/double,
RealF is single and RealD is double etc..
connect the "Real" default precisoin to a configure flag.
Have RealF, RealD and Real types, where Real is compile target dependent single/double,
RealF is single and RealD is double etc..
connect the "Real" default precisoin to a configure flag.
Have RealF, RealD and Real types, where Real is compile target dependent single/double,
RealF is single and RealD is double etc..
but non-identity matrix
l1 0 0 0 ....
0 l2 0 0 ....
0 0 l3 0 ...
. . .
. . .
. . .
And apply the multishift CG to it. Sum the poles and residues.
Insist that this be the same as the exactly taken square root
where l1,l2,l3 >= 0.
but non-identity matrix
l1 0 0 0 ....
0 l2 0 0 ....
0 0 l3 0 ...
. . .
. . .
. . .
And apply the multishift CG to it. Sum the poles and residues.
Insist that this be the same as the exactly taken square root
where l1,l2,l3 >= 0.
but non-identity matrix
l1 0 0 0 ....
0 l2 0 0 ....
0 0 l3 0 ...
. . .
. . .
. . .
And apply the multishift CG to it. Sum the poles and residues.
Insist that this be the same as the exactly taken square root
where l1,l2,l3 >= 0.
in RedBlack MpcDagMpc, Unprec MdagM and Schur red black solver for
each of.
DomainWallFermion
MobiusFermion
MobiusZolotarevFermion
ScaledShamirFermion
ScaledShamirZolotarevFermion
in RedBlack MpcDagMpc, Unprec MdagM and Schur red black solver for
each of.
DomainWallFermion
MobiusFermion
MobiusZolotarevFermion
ScaledShamirFermion
ScaledShamirZolotarevFermion
in RedBlack MpcDagMpc, Unprec MdagM and Schur red black solver for
each of.
DomainWallFermion
MobiusFermion
MobiusZolotarevFermion
ScaledShamirFermion
ScaledShamirZolotarevFermion
Tanh/Zolo * (Cayley/PartFrac/ContFrac) * (Mobius/Shamir/Wilson)
Approx Representation Kernel.
All are done with space-time taking part in checkerboarding, Ls uncheckerboarded
Have only so far tested the Domain Wall limit of mobius, and at that only checked
that it
i) Inverts
ii) 5dim DW == Ls copies of 4dim D2
iii) MeeInv Mee == 1
iv) Meo+Mee+Moe+Moo == M unprec.
v) MpcDagMpc is hermitan
vi) Mdag is the adjoint of M between stochastic vectors.
That said, the RB schur solve, RB MpcDagMpc solve, Unprec solve
all converge and the true residual becomes small; so pretty good tests.
Tanh/Zolo * (Cayley/PartFrac/ContFrac) * (Mobius/Shamir/Wilson)
Approx Representation Kernel.
All are done with space-time taking part in checkerboarding, Ls uncheckerboarded
Have only so far tested the Domain Wall limit of mobius, and at that only checked
that it
i) Inverts
ii) 5dim DW == Ls copies of 4dim D2
iii) MeeInv Mee == 1
iv) Meo+Mee+Moe+Moo == M unprec.
v) MpcDagMpc is hermitan
vi) Mdag is the adjoint of M between stochastic vectors.
That said, the RB schur solve, RB MpcDagMpc solve, Unprec solve
all converge and the true residual becomes small; so pretty good tests.
Tanh/Zolo * (Cayley/PartFrac/ContFrac) * (Mobius/Shamir/Wilson)
Approx Representation Kernel.
All are done with space-time taking part in checkerboarding, Ls uncheckerboarded
Have only so far tested the Domain Wall limit of mobius, and at that only checked
that it
i) Inverts
ii) 5dim DW == Ls copies of 4dim D2
iii) MeeInv Mee == 1
iv) Meo+Mee+Moe+Moo == M unprec.
v) MpcDagMpc is hermitan
vi) Mdag is the adjoint of M between stochastic vectors.
That said, the RB schur solve, RB MpcDagMpc solve, Unprec solve
all converge and the true residual becomes small; so pretty good tests.
not even SU(3) for now) gauge field. Convergence history is correctly indepdendent of decomposition
on 1,2,4,8,16 mpi tasks.
Found a couple of simd bugs which required fixed and enhanced the Grid_simd.cc test suite.
Implemented the Mdag, M, MdagM, Meooe Mooee schur type stuff in the wilson dop.
not even SU(3) for now) gauge field. Convergence history is correctly indepdendent of decomposition
on 1,2,4,8,16 mpi tasks.
Found a couple of simd bugs which required fixed and enhanced the Grid_simd.cc test suite.
Implemented the Mdag, M, MdagM, Meooe Mooee schur type stuff in the wilson dop.
not even SU(3) for now) gauge field. Convergence history is correctly indepdendent of decomposition
on 1,2,4,8,16 mpi tasks.
Found a couple of simd bugs which required fixed and enhanced the Grid_simd.cc test suite.
Implemented the Mdag, M, MdagM, Meooe Mooee schur type stuff in the wilson dop.
cut at Conjugate gradient. Also copied in Remez, Zolotarev, Chebyshev from
Mike Clark, Tony Kennedy and my BFM package respectively since we know we will
need these. I wanted the structure of
algorithms/approx
algorithms/iterative
etc.. to start taking shape.
cut at Conjugate gradient. Also copied in Remez, Zolotarev, Chebyshev from
Mike Clark, Tony Kennedy and my BFM package respectively since we know we will
need these. I wanted the structure of
algorithms/approx
algorithms/iterative
etc.. to start taking shape.
cut at Conjugate gradient. Also copied in Remez, Zolotarev, Chebyshev from
Mike Clark, Tony Kennedy and my BFM package respectively since we know we will
need these. I wanted the structure of
algorithms/approx
algorithms/iterative
etc.. to start taking shape.
icpc compiles happy on MacOSX both with -xCOMMON-AV512 and native AVX
gcc-5 does not compile happy; can work around by renaming lattice peek/poke/transpose/trace templates
relative to tensor ones, but gcc goes into a recursive template instantiation due to
matching error. I think this is a gcc bug and have filed a report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66153
icpc compiles happy on MacOSX both with -xCOMMON-AV512 and native AVX
gcc-5 does not compile happy; can work around by renaming lattice peek/poke/transpose/trace templates
relative to tensor ones, but gcc goes into a recursive template instantiation due to
matching error. I think this is a gcc bug and have filed a report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66153
icpc compiles happy on MacOSX both with -xCOMMON-AV512 and native AVX
gcc-5 does not compile happy; can work around by renaming lattice peek/poke/transpose/trace templates
relative to tensor ones, but gcc goes into a recursive template instantiation due to
matching error. I think this is a gcc bug and have filed a report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66153
Disassembling output shows ugly sequences in the permute sector. Could comparatively benchmark with and without
the if-else structure to see how much I'm losing.
Drops to 9GF as it falls out of cache. Moving to Lebesgue ordering should help there. Substantive progress.
Disassembling output shows ugly sequences in the permute sector. Could comparatively benchmark with and without
the if-else structure to see how much I'm losing.
Drops to 9GF as it falls out of cache. Moving to Lebesgue ordering should help there. Substantive progress.
and the cshift based implementation. Managed to reduce the volume of code in this
sector a little, but consolidation would be good, perhaps taking common
logic out into simple helper functions
and the cshift based implementation. Managed to reduce the volume of code in this
sector a little, but consolidation would be good, perhaps taking common
logic out into simple helper functions
a machine. If a "fixedSeed" is used, randoms should be reproducible across different machine
decomposition since the generators are physically indexed and assigned in lexico ordering.
a machine. If a "fixedSeed" is used, randoms should be reproducible across different machine
decomposition since the generators are physically indexed and assigned in lexico ordering.
assignment. LatticeCoordinate helper to get global (reduced) coordinate.
Some more work of similar type perhaps needed, but the bulk of the required
structure for masked array assignment is now in place.
assignment. LatticeCoordinate helper to get global (reduced) coordinate.
Some more work of similar type perhaps needed, but the bulk of the required
structure for masked array assignment is now in place.
since testing requirements now go beyond what a single Grid_main.cc can do.
Will need a more organised src tree for this and will require substantial reorg of build system.
since testing requirements now go beyond what a single Grid_main.cc can do.
Will need a more organised src tree for this and will require substantial reorg of build system.
- Add `layout` based and user-defined class names to `<body>` element for added CSS hooks. [#526](https://github.com/mmistakes/minimal-mistakes/pull/526)
- Add simplified Chinese localized UI text. [#532](https://github.com/mmistakes/minimal-mistakes/pull/532)
### Bug Fixes
- Remove duplicate include of `base_path` in category-list.html [#522](https://github.com/mmistakes/minimal-mistakes/pull/522)
- Make ["honeypot" `input`](https://github.com/mmistakes/minimal-mistakes/commit/06a8249a69a37dddda7e2a5bfbe32056c1a9a607) in Staticman comment form less obvious to spam bots
- Add padding to `.highlight` code blocks to better [align `overflow` scrollbar](https://github.com/mmistakes/minimal-mistakes/commit/e4abec0a6f7f8cff72505ca0754615df294fd5b3) to the bottom.
- Add additional image options for Twitter card social sharing meta tags. [#466](https://github.com/mmistakes/minimal-mistakes/pull/466)
- Add structured data markup for Staticman comments. [#458](https://github.com/mmistakes/minimal-mistakes/issues/458)
### Bug Fixes
- Format `og:locale` tag with `_` instead of `-`. [#462](https://github.com/mmistakes/minimal-mistakes/issues/462)
### Maintenance
- Add note to docs about using `url: http://localhost:4000` when working locally.
- Support static-based commenting via [Staticman](https://staticman.net/) for sites hosted with GitHub Pages. [#424](https://github.com/mmistakes/minimal-mistakes/issues/424)
- Re-enabled Jekyll plugins in `_config.yml` in case they aren't autoloaded in `Gemfile`. [#417](https://github.com/mmistakes/minimal-mistakes/issues/417)
### Enhancements
- Fallback to `site.github.url` for use in `{{ base_path }}` when `site.url` is `nil`.
- Replace Sass and Autoprefixer `npm` build scripts with [Jekyll's built-in asset support](https://jekyllrb.com/docs/assets/). [#333](https://github.com/mmistakes/minimal-mistakes/issues/333)
### Maintenance
- Document `site.repository` and its role with [`github-metadata`](https://github.com/jekyll/github-metadata) gem.
- Add sample [archive page with content](https://mmistakes.github.io/minimal-mistakes/archive-layout-with-content/) for testing styles on demo site.
- Add support for configurable feed URL to use a service like FeedBurner instead of linking directly to `feed.xml` in `<head>` and the site footer. [#378](https://github.com/mmistakes/minimal-mistakes/issues/378), [#379](https://github.com/mmistakes/minimal-mistakes/pull/379), [#406](https://github.com/mmistakes/minimal-mistakes/pull/406)
- Enable image popup on < 500px wide screens. [#385](https://github.com/mmistakes/minimal-mistakes/issues/385)
- Indicate the relationship between component URLs in a paginated series by applying `rel="prev"` and `rel="next"` to pages that use `site.paginator`. [#253](https://github.com/mmistakes/minimal-mistakes/issues/253)
- Improve link posts in archive listings. [#276](https://github.com/mmistakes/minimal-mistakes/issues/276)
- Remove window width "magic number" from sticky sidebar check in `main.js` for improved flexibility. [#375](https://github.com/mmistakes/minimal-mistakes/pull/375)
### Bug Fixes
- Fix author override conditional where a missing `authors.yml` would show broken sidebar content. Defaults to `site.author`. [#376](https://github.com/mmistakes/minimal-mistakes/pull/376)
- Add support for [header overlay images](https://mmistakes.github.io/minimal-mistakes/docs/layouts/#header-overlay) for Open Graph images. [#358](https://github.com/mmistakes/minimal-mistakes/pull/358)
### Bug Fixes
- Fix `Person` typo Schema.org type [#358](https://github.com/mmistakes/minimal-mistakes/pull/358)
### Maintenance
- Update `github-pages` gem and dependencies.
- Remove `minutes_read` to avoid awkward reading time wording [#356](https://github.com/mmistakes/minimal-mistakes/issues/356)
- Remove `cursor: pointer` that appears on white-space surrounding author side list items and links. [#354](https://github.com/mmistakes/minimal-mistakes/pull/354)
### Maintenance
- Add contributing information to `README.md`. [#357](https://github.com/mmistakes/minimal-mistakes/issues/357)
- Fix error `Liquid Exception: divided by 0 in _includes/archive-single.html, included in _layouts/single.html` caused by null `words_per_minute` in `_config.yml`. [#345](https://github.com/mmistakes/minimal-mistakes/pull/345)
- Improve text alignment of masthead, hero overlay, page footer to be flush left and remove awkward white-space gaps. [#342](https://github.com/mmistakes/minimal-mistakes/issues/342)
- Add support for image captions in Magnific Popup overlays via the [`gallery`](https://mmistakes.github.io/minimal-mistakes/docs/helpers/#gallery) helper. [#334](https://github.com/mmistakes/minimal-mistakes/issues/334)
- Fix missing category/tag links in post footer due to possible conflict with `site.tags` and `site.categories`. [#329](https://github.com/mmistakes/minimal-mistakes/issues/329#issuecomment-222375568)
- Fix `Liquid Exception: undefined method 'gsub' for nil:NilClass in _layouts/single.html` error when `page.title` is null. `<h1>` element is now conditional if `title: ` is not set for a `page` or collection item. [#312](https://github.com/mmistakes/minimal-mistakes/issues/312)
### Maintenance
- Remove duplicate `fa-twitter` and `fa-twitter-square` classes from `_utilities.scss`. [#302](https://github.com/mmistakes/minimal-mistakes/issues/302)
- Document installing additional Jekyll gem dependencies when using `gem "jekyll"` instead of `gem "github-pages"` to avoid any errors on run. [#305](https://github.com/mmistakes/minimal-mistakes/issues/305)
- Add translation key for "Recent Posts" used in home page `index.html`. [#316](https://github.com/mmistakes/minimal-mistakes/pull/316)
### Maintenance
- Small fix to avoid underlying the whitespace between icons and related text when hovering. [#303](https://github.com/mmistakes/minimal-mistakes/pull/303)
- SEO author bug. If `twitter.username` is set and `author.twitter` is `nil` bad things happen. [#289](https://github.com/mmistakes/minimal-mistakes/issues/289)
- Explain how to use `nav_list` helper in [documentation](https://mmistakes.github.io/minimal-mistakes/docs/helpers/#navigation-list).
- Reduce left/right padding on smaller screens to increase width of main content column.
### Bug Fixes
- Fix alignment issues with related posts [#273](https://github.com/mmistakes/minimal-mistakes/issues/273) and "Follow" button in author profile [#274](https://github.com/mmistakes/minimal-mistakes/issues/274).
- Rebuilt the entire theme: layouts, includes, stylesheets, scripts, you name it.
- Refreshed the look and feel while staying true to the original design of the theme (author sidebar/main content).
- Replaced grid system with [Susy](http://susy.oddbird.net/).
- Replaced Grunt tasks with `npm` scripts.
- Removed Google Fonts and replaced with system fonts to improve performance (they can be [added back](https://mmistakes.github.io/minimal-mistakes/docs/stylesheets/) if desired)
- Increased the amount of sample posts, sample pages, and sample collections to throughly test the theme and edge-cases.
- Moved all sample content and assets out of `master` to keep it as clean as possible for forking.
- Added new layouts for `splash` pages, archives for [`jekyll-archives`](https://github.com/jekyll/jekyll-archives) if enabled, and [`compress.html`](https://github.com/penibelst/jekyll-compress-html) to improve performance.
- Added taxonomy links to posts (tags and categories).
- Added optional "reading time" meta data.
- Improved Liquid used for Twitter Cards and Open Graph data in `<head>`.
- Improved `gallery` include helper and added `feature_row` for use with splash page layout.
- Added Keybase.io, author web URI, and Bitbucket optional links to sidebar.
- Add `feed.xml` link to footer.
- Added a [UI text data file](https://mmistakes.github.io/minimal-mistakes/docs/ui-text/) to easily change all text found in the theme.
- Added LinkedIn to optional social share buttons.
- Added Facebook, Google+, and custom commenting options in addition to Disqus.
- Add Soundcloud, YouTube ([#95](https://github.com/mmistakes/minimal-mistakes/pull/95)), Flickr ([#119](https://github.com/mmistakes/minimal-mistakes/pull/119)), and Weibo ([#116](https://github.com/mmistakes/minimal-mistakes/pull/116)) icons for use in author sidebar.
- Fix typos in posts and documentation and remove references to Less
- Include note about Octopress gem being optional
- Post author override support extended to the Atom feed ([#71](https://github.com/mmistakes/minimal-mistakes/pull/71))
- Only include email address in feed if specified in `_config.yml` or author `_data`
- Wrap all page content in `#main` to harmonize article and post index styles ([#86](https://github.com/mmistakes/minimal-mistakes/issues/86))
- Include new sample feature images for posts and pages
- Table of contents improvements: fix collapse toggle, indent nested elements, show on small screens, and create an `_include` for reusing in posts and pages.
- Include note about running Jekyll with `bundle exec` when using Bundler
- Fix home page path in top navigation
- Remove Google Authorship ([#120](https://github.com/mmistakes/minimal-mistakes/issues/120))
- Remove duplicate author content that displayed in `div.article-author-bottom`
- Reworked top navigation to be a better experience on small screens. Nav items now display vertically when the menu button is tapped, revealing links with larger touch targets.
- Fix top navigation bug issue ([#10](https://github.com/mmistakes/minimal-mistakes/issues/10)) for real this time. Remember to clear your floats kids.
- Removed [Typeplate](http://typeplate.com/) styles. Was [causing issues with newer versions of Less](https://github.com/typeplate/typeplate.github.io/issues/108) and is no longer maintained.
### Enhancements
- Added [image attribution](http://mmistakes.github.io/minimal-mistakes/theme-setup/#feature-images) for post and page feature images.
**Data parallel C++ mathematical object library.**
License: GPL v2.
Last update Nov 2016.
_Please do not send pull requests to the `master` branch which is reserved for releases._
### Compilers
Intel ICPC v16.0.3 and later
Clang v3.5 and later (need 3.8 and later for OpenMP)
GCC v4.9.x (recommended)
GCC v6.3 and later
### Important:
Some versions of GCC appear to have a bug under high optimisation (-O2, -O3).
The safety of these compiler versions cannot be guaranteed at this time. Follow Issue 100 for details and updates.
GCC v5.x
GCC v6.1, v6.2
### Bug report
_To help us tracking and solving more efficiently issues with Grid, please report problems using the issue system of GitHub rather than sending emails to Grid developers._
When you file an issue, please go though the following checklist:
1. Check that the code is pointing to the `HEAD` of `develop` or any commit in `master` which is tagged with a version number.
2. Give a description of the target platform (CPU, network, compiler). Please give the full CPU part description, using for example `cat /proc/cpuinfo | grep 'model name' | uniq` (Linux) or `sysctl machdep.cpu.brand_string` (macOS) and the full output the `--version` option of your compiler.
3. Give the exact `configure` command used.
4. Attach `config.log`.
5. Attach `grid.config.summary`.
6. Attach the output of `make V=1`.
7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example.
## License
The MIT License (MIT)
### Description
This library provides data parallel C++ container classes with internal memory layout
that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
are provided, similar to HPF and cmfortran, and user control is given over the mapping of
array indices to both MPI tasks and SIMD processing elements.
Copyright (c) 2016 Michael Rose
* Identically shaped arrays then be processed with perfect data parallelisation.
* Such identically shaped arrays are called conformable arrays.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The transformation is based on the observation that Cartesian array processing involves
identical processing to be performed on different regions of the Cartesian array.
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
The library will both geometrically decompose into MPI tasks and across SIMD lanes.
Local vector loops are parallelised with OpenMP pragmas.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
for most programmers.
The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
Presently SSE4 (128 bit) AVX, AVX2, QPX (256 bit), IMCI, and AVX512 (512 bit) targets are supported (ARM NEON on the way).
These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers.
The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
MPI, OpenMP, and SIMD parallelism are present in the library.
Please see https://arxiv.org/abs/1512.03487 for more detail.
### Quick start
First, start by cloning the repository:
``` bash
git clone https://github.com/paboyle/Grid.git
```
Then enter the cloned directory and set up the build system:
``` bash
cd Grid
./bootstrap.sh
```
Now you can execute the `configure` script to generate makefiles (here from a build directory):
where `--enable-precision=` set the default precision,
`--enable-simd=` set the SIMD type, `--enable-
comms=`, and `<path>` should be replaced by the prefix path where you want to
install Grid. Other options are detailed in the next section, you can also use `configure
--help` to display them. Like with any other program using GNU autotool, the
`CXX`, `CXXFLAGS`, `LDFLAGS`, ... environment variables can be modified to
customise the build.
Finally, you can build, check, and install Grid:
``` bash
make; make check; make install
```
To minimise the build time, only the tests at the root of the `tests` directory are built by default. If you want to build tests in the sub-directory `<subdir>` you can execute:
``` bash
make -C tests/<subdir> tests
```
If you want to build all the tests at once just use `make tests`.
### Build configuration options
- `--prefix=<path>`: installation prefix for Grid.
- `--with-gmp=<path>`: look for GMP in the UNIX prefix `<path>`
- `--with-mpfr=<path>`: look for MPFR in the UNIX prefix `<path>`
- `--with-fftw=<path>`: look for FFTW in the UNIX prefix `<path>`
- `--enable-lapack[=<path>]`: enable LAPACK support in Lanczos eigensolver. A UNIX prefix containing the library can be specified (optional).
- `--enable-mkl[=<path>]`: use Intel MKL for FFT (and LAPACK if enabled) routines. A UNIX prefix containing the library can be specified (optional).
- `--enable-numa`: enable NUMA first touch optimisation
- `--enable-simd=<code>`: setup Grid for the SIMD target `<code>` (default: `GEN`). A list of possible SIMD targets is detailed in a section below.
- `--enable-gen-simd-width=<size>`: select the size (in bytes) of the generic SIMD vector type (default: 32 bytes).
- `--enable-precision={single|double}`: set the default precision (default: `double`).
- `--enable-precision=<comm>`: Use `<comm>` for message passing (default: `none`). A list of possible SIMD targets is detailed in a section below.
- `--enable-rng={sitmo|ranlux48|mt19937}`: choose the RNG (default: `sitmo `).
- `--disable-timers`: disable system dependent high-resolution timers.
| `mpi3l[-auto]` | MPI communications using MPI 3 shared memory and leader model |
| `shmem ` | Cray SHMEM communications |
For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names). The `-auto` suffix is not supported by the Cray environment wrapper scripts. Use the standard versions instead.
### Possible SIMD types
The following options can be use with the `--enable-simd=` option to target different SIMD instruction sets:
- We currently support AVX512 only for the Intel compiler. Support for GCC and clang will appear in future versions of Grid when the AVX512 support within GCC and clang will be more advanced.
- For BG/Q only [bgclang](http://trac.alcf.anl.gov/projects/llvm-bgq) is supported. We do not presently plan to support more compilers for this platform.
- BG/Q performances are currently rather poor. This is being investigated for future versions.
- The vector size for the `GEN` target can be specified with the `configure` script option `--enable-gen-simd-width`.
### Build setup for Intel Knights Landing platform
The following configuration is recommended for the Intel Knights Landing platform:
``` bash
../configure --enable-precision=double\
--enable-simd=KNL \
--enable-comms=mpi-auto \
--with-gmp=<path> \
--with-mpfr=<path> \
--enable-mkl \
CXX=icpc MPICXX=mpiicpc
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed. If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
undefined_wpm :"Parâmetro words_per_minute não definido em _config.yml"
comment_form_info :"O seu endereço email não será publicado. Os campos obrigatórios estão assinalados"
comment_form_comment_label :"Comentário"
comment_form_md_info :"Markdown é suportado."
comment_form_name_label :"Nome"
comment_form_email_label :"Endereço Email"
comment_form_website_label :"Site (opcional)"
comment_btn_submit :"Sumbeter Comentário"
comment_btn_submitted :"Submetido"
comment_success_msg :"Obrigado pelo seu comentário! Será visível no site logo que aprovado."
comment_error_msg :"Lamento, ocorreu um erro na sua submissão. Por favor verifique se todos os campos obrigatórios estão corretamente preenchidos e tente novamente."
loading_label :"A carregar..."
# Brazilian Portuguese
pt-BR:
page :"Página"
pagination_previous :"Anterior"
pagination_next :"Próxima"
breadcrumb_home_label :"Home"
breadcrumb_separator :"/"
toc_label :"Nesta página"
ext_link_label :"Link direto"
less_than :"meno que"
minute_read :"minutos de leitura"
share_on_label :"Compartilhe em"
meta_label :
tags_label :"Tags:"
categories_label :"Categorias:"
date_label :"Atualizado em:"
comments_label :"Deixe um comentário"
comments_title :
more_label :"Aprenda Mais"
related_label :"Você Talvez Goste Também"
follow_label :"Acompanhe em"
feed_label :"Feed"
powered_by :"Feito com"
website_label :"Site"
email_label :"Email"
recent_posts :"Postagens recentes"
undefined_wpm :"Parâmetro indefinido em word_per_minute no _config.yml"
comment_form_info :
comment_form_comment_label :
comment_form_md_info :
comment_form_name_label :
comment_form_email_label :
comment_form_website_label :
comment_btn_submit :
comment_btn_submitted :
comment_success_msg :
comment_error_msg :
loading_label :
pt-PT:
<<:*DEFAULT_PT
# Italian
# -----------------
it:&DEFAULT_IT
page :"Pagina"
pagination_previous :"Precedente"
pagination_next :"Prossima"
breadcrumb_home_label :"Home"
breadcrumb_separator :"/"
toc_label :"Indice della pagina"
ext_link_label :"Link"
less_than :"meno di"
minute_read :"minuto/i di lettura"
share_on_label :"Condividi"
meta_label :
tags_label :"Tags:"
categories_label :"Categorie:"
date_label :"Aggiornato:"
comments_label :"Scrivi un commento"
comments_title :
more_label :"Scopri di più"
related_label :"Potrebbe Piacerti Anche"
follow_label :"Segui:"
feed_label :"Feed"
powered_by :"Powered by"
website_label :"Website"
email_label :"Email"
recent_posts :"Articoli Recenti"
undefined_wpm :"Parametro words_per_minute non definito in _config.yml"
comment_form_info :"Il tuo indirizzo email non sarà pubblicato. Sono segnati i campi obbligatori"
comment_form_comment_label :"Commenta"
comment_form_md_info :"Il linguaggio Markdown è supportato"
comment_form_name_label :"Nome"
comment_form_email_label :"Indirizzo email"
comment_form_website_label :"Sito Web (opzionale)"
comment_btn_submit :"Invia commento"
comment_btn_submitted :"Inviato"
comment_success_msg :"Grazie per il tuo commento! Verrà visualizzato nel sito una volta che sarà approvato."
comment_error_msg :"C'è stato un errore con il tuo invio. Assicurati che tutti i campi richiesti siano stati completati e riprova."
loading_label :"Caricamento..."
it-IT:
<<:*DEFAULT_IT
# Chinese (zh-CN Chinese - China)
# -----------------
zh:&DEFAULT_ZH
page :"页面"
pagination_previous :"向前"
pagination_next :"向后"
breadcrumb_home_label :"首页"
breadcrumb_separator :"/"
toc_label :"在本页上"
ext_link_label :"直接链接"
less_than :"少于"
minute_read :"分钟 阅读"
share_on_label :"分享"
meta_label :
tags_label :"标签:"
categories_label :"分类:"
date_label :"最新的:"
comments_label :"留下评论"
comments_title :"评论"
more_label :"了解更多"
related_label :"猜您还喜欢"
follow_label :"关注:"
feed_label :"Feed"
powered_by :"Powered by"
website_label :"网站"
email_label :"Email"
recent_posts :"最新文章"
undefined_wpm :"Undefined parameter words_per_minute at _config.yml"
comment_form_info :"Your email address will not be published. Required fields are marked"
comment_form_comment_label :"Comment"
comment_form_md_info :"Markdown is supported."
comment_form_name_label :"Name"
comment_form_email_label :"Email address"
comment_form_website_label :"Website (optional)"
comment_btn_submit :"Submit Comment"
comment_btn_submitted :"Submitted"
comment_success_msg :"Thanks for your comment! It will show on the site once it has been approved."
comment_error_msg :"Sorry, there was an error with your submission. Please make sure all required fields have been completed and try again."
<div class="notice--danger align-center" style="margin: 0;">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.</div>
showAlert('{{ site.data.ui-text[site.locale].comment_success_msg | default: "Thanks for your comment! It will show on the site once it has been approved." }}');
showAlert('{{ site.data.ui-text[site.locale].comment_error_msg | default: "Sorry, there was an error with your submission. Please make sure all required fields have been completed and try again." }}');
{% include comment.html index=forloop.index email=email name=name url=url date=date message=message %}
{% endfor %}
{% endif %}
</div>
<!-- End static comments -->
<!-- Start new comment form -->
<h4class="page__comments-title">{{ site.data.ui-text[site.locale].comments_label | default: "Leave a Comment" }}</h4>
<pclass="small">{{ site.data.ui-text[site.locale].comment_form_info | default: "Your email address will not be published. Required fields are marked" }} <spanclass="required">*</span></p>
<metaname="twitter:image"content="{% if page.header.image contains "://"%}{{page.header.image}}{%else%}{{page.header.image|prepend:"/images/"|prepend:base_path}}{%endif%}">
{% else %}
<metaname="twitter:card"content="summary">
{% if page.header.teaser %}
<metaname="twitter:image"content="{% if page.header.teaser contains "://"%}{{page.header.teaser}}{%else%}{{page.header.teaser|prepend:"/images/"|prepend:base_path}}{%endif%}">
<metaproperty="og:image"content="{% if page.header.image contains "://"%}{{page.header.image}}{%else%}{{page.header.image|prepend:"/images/"|prepend:base_path}}{%endif%}">
{% elsif page.header.overlay_image %}
<metaproperty="og:image"content="{% if page.header.overlay_image contains "://"%}{{page.header.overlay_image}}{%else%}{{page.header.overlay_image|prepend:"/images/"|prepend:base_path}}{%endif%}">
{% elsif page.header.teaser %}
<metaproperty="og:image"content="{% if page.header.teaser contains "://"%}{{page.header.teaser}}{%else%}{{page.header.teaser|prepend:"/images/"|prepend:base_path}}{%endif%}">
{% elsif site.og_image %}
<metaproperty="og:image"content="{% if site.og_image contains "://"%}{{site.og_image}}{%else%}{{site.og_image|prepend:"/images/"|prepend:base_path}}{%endif%}">
This library provides data parallel C++ container classes with internal memory layout
that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
are provided, similar to HPF and cmfortran, and user control is given over the mapping of
array indices to both MPI tasks and SIMD processing elements.
* Identically shaped arrays then be processed with perfect data parallelisation.
* Such identically shapped arrays are called conformable arrays.
The transformation is based on the observation that Cartesian array processing involves
identical processing to be performed on different regions of the Cartesian array.
The library will both geometrically decompose into MPI tasks and across SIMD lanes.
Local vector loops are parallelised with OpenMP pragmas.
Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
for most programmers.
The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
Presently SSE4 (128 bit) AVX, AVX2 (256 bit) and IMCI and AVX512 (512 bit) targets are supported (ARM NEON and BG/Q QPX on the way).
These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers.
The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
MPI, OpenMP, and SIMD parallelism are present in the library.
Please see [this paper](https://arxiv.org/abs/1512.03487) for more detail.
The multiplication operator follows the natural multiplication
table for each index, index level by index level.
`Operator *`
|----|----|----|----|
| x | S | V | M |
|----|----|----|----|
| S | S | V | M |
| V | S | S | V |
| M | M | V | M |
|----|----|----|----|
The addition and subtraction rules disallow a scalar to be added to a vector,
and vector to be added to matrix. A scalar adds to a matrix on the diagonal.
`Operator +` and `Operator -`
|----|----|----|----|
| +/-| S | V | M |
|----|----|----|----|
| S | S | - | M |
| V | - | V | - |
| M | M | - | M |
|----|----|----|----|
The rules for a nested objects are recursively inferred level by level from basic rules of multiplication
addition and subtraction for scalar/vector/matrix. Legal expressions can only be formed between objects
with the same number of nested internal indices. All the Grid QCD datatypes have precisely three internal
indices, some of which may be trivial scalar to enable expressions to be formed.
Arithmetic operations are possible where the left or right operand is a scalar type.
**Example**:
```c++
LatticeColourMatrixD U(grid);
LatticeColourMatrixD Udag(grid);
Udag = adj(U);
RealD unitary_err = norm2(U*adj(U) - 1.0);
```
Will provide a measure of how discrepant from unitarity the matrix U is.
### Internal index manipulation (`lib/tensors/Tensor_index.h`)
General code can access any specific index by number with a peek/poke semantic:
```c++
// peek index number "Level" of a vector index
template<int Level,class vtype> auto peekIndex (const vtype &arg,int i);
// peek index number "Level" of a vector index
template<int Level,class vtype> auto peekIndex (const vtype &arg,int i,int j);
// poke index number "Level" of a vector index
template<int Level,class vtype>
void pokeIndex (vtype &pokeme,arg,int i)
// poke index number "Level" of a matrix index
template<int Level,class vtype>
void pokeIndex (vtype &pokeme,arg,int i,int j)
```
**Example**:
```c++
for (int mu = 0; mu < Nd; mu++) {
U[mu] = PeekIndex<LorentzIndex>(Umu, mu);
}
```
Similar to the QDP++ package convenience routines are provided to access specific elements of
vector and matrix internal index types by physics name or meaning aliases for the above routines
with the appropriate index constant.
* `peekColour`
* `peekSpin`
* `peekLorentz`
and
* `pokeColour`
* `pokeSpin`
* `pokeLorentz`
For example, we often store Gauge Fields with a Lorentz index, but can split them into
polarisations in relevant pieces of code.
**Example**:
```c++
for (int mu = 0; mu < Nd; mu++) {
U[mu] = peekLorentz(Umu, mu);
}
```
For convenience, direct access as both an l-value and an r-value is provided by the parenthesis operator () on each of the Scalar, Vector and Matrix classes.
For example one may write
**Example**:
```c++
ColourMatrix A, B;
A()()(i,j) = B()()(j,i);
```
bearing in mind that empty parentheses are need to address a scalar entry in the tensor index nest.
The first (left) empty parentheses move past the (scalar) Lorentz level in the tensor nest, and the second
(middle) empty parantheses move past the (scalar) spin level. The (i,j) index the colour matrix.
Other examples are easy to form for the many cases, and should be obvious to the reader.
This form of addressing is convenient and saves peek, modifying, poke
multiple temporary objects when both spin and colour indices are being accessed.
There are many cases where multiple lines of code are required with a peek/poke semantic which are
easier with direct l-value and r-value addressing.
### Matrix operations
Transposition and tracing specific internal indices are possible using:
__To help us tracking and solving more efficiently issues with Grid, please report problems using the [issue system of GitHub](https://github.com/paboyle/Grid/issues) rather than sending emails to Grid developers.__
We also suggest to have a brief look at the [closed issues pages](https://github.com/paboyle/Grid/issues?q=is%3Aissue+is%3Aclosed) and check whether the problem has been addressed already.
{% capture notice-text %}
1. Check that the code is pointing to the `HEAD` of `develop` or any commit in `master` which is tagged with a version number.
2. Give a description of the target platform (CPU, network, compiler). Please give the full CPU part description, using for example `cat /proc/cpuinfo | grep 'model name' | uniq` (Linux) or `sysctl machdep.cpu.brand_string` (macOS) and the full output the `--version` option of your compiler.
3. Give the exact `configure` command used.
4. Attach `config.log`.
5. Attach `grid.config.summary`.
6. Attach the output of `make V=1`.
7. Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example.
{% endcapture %}
<div class="notice--warning">
<h4>When you file an issue, please go though the following checklist:</h4>
| `mpi` | MPI communications using [MPI-3 shared memory](https://software.intel.com/sites/default/files/managed/eb/54/An_Introduction_to_MPI-3.pdf) |
| `mpi-auto` | MPI communications with compiler CXX but clone flags from MPICXX |
For the MPI interfaces the optional `-auto` suffix instructs the `configure` scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable `MPICXX` (if not specified `configure` will scan though a list of default names). The `-auto` suffix is not supported by the Cray environment wrapper scripts. Use the standard wrappers ( `CXX=CC` ) set up by Cray `PrgEnv` modules instead.
### Shared memory communications support
The following options can be use with the `--enable-shm=` option to target different shared memory behaviours (default `shmopen`):
| `shmnone` | uses anonymous spaces, use only for 1 MPI rank per node |
| `shmopen` | uses `shm_open` to allocate a shared memory space for inter socket communications. Uses a unique file name in `/dev/shm` associated to the user. |
| `hugetlbfs` | optional [libhugetlbfs](https://github.com/libhugetlbfs/libhugetlbfs) support to map the shared memory allocation into huge 2M pages |
### Other flags
`--enable-shmpath=<path>` to select the shared memory map base path for [libhugetlbfs](https://github.com/libhugetlbfs/libhugetlbfs).
Grid is intended to support performance portability across a many of platforms ranging from single processors
to message passing CPU clusters and accelerated computing nodes.
The library provides data parallel C++ container classes with internal memory layout that is transformed to map efficiently to SIMD architectures. CSHIFT facilities are provided, similar to HPF and cmfortran, and user control is given over the mapping of array indices to both MPI tasks and SIMD processing elements.
Identically shaped arrays then be processed with perfect data parallelisation.
Such identically shaped arrays are called conformable arrays.
The transformation is based on the observation that Cartesian array processing involves identical processing to be performed on different regions of the Cartesian array.
The library will both geometrically decompose into MPI tasks and across SIMD lanes. Local vector loops are parallelised with OpenMP pragmas.
Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification for most programmers.
The two broad optimisation targets are:
* MPI, OpenMP, and SIMD parallelism
Presently SSE4, ARM NEON (128 bits) AVX, AVX2, QPX (256 bits), and AVX512 (512 bits) targets are supported
with aggressive use of architecture vectorisation intrinsic functions.
* MPI between nodes with and data parallel offload to GPU's.
For the latter generic C++ code is used both on the host and on the GPU, with a common vectorisation
granularity.
### Accelerator memory model
For accelerator targets it is assumed that heap allocations can be shared between the CPU
and the accelerator. This corresponds to lattice fields having their memory allocated with
*cudaMallocManaged* with Nvidia GPU's.
Grid does not assume that stack or data segments share a common address space with an accelerator.
* This constraint presently rules out porting Grid to AMD GPU's which do not support managed memory.
* At some point in the future a cacheing strategy may be implemented to enable running on AMD GPU's
The information included in this page has been updated on *March 2018* and it is valid for the [release version 0.7.0](https://github.com/paboyle/Grid/tree/release/v0.7.0).
{% include toc icon="gears" title="Contents" %}
### Building for the Intel Knights Landing
The following configuration is recommended for the [Intel Knights Landing](http://ark.intel.com/products/codename/48999/Knights-Landing) platform:
``` text
../configure --enable-precision=double\
--enable-simd=KNL \
--enable-comms=mpi-auto \
--with-gmp=<path> \
--with-mpfr=<path> \
--enable-mkl \
CXX=icpc MPICXX=mpiicpc
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed. If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
``` text
../configure --enable-precision=double\
--enable-simd=KNL \
--enable-comms=mpi \
--with-gmp=<path> \
--with-mpfr=<path> \
--enable-mkl \
CXX=CC CC=cc
```
### Building for the Intel Haswell
The following configuration is recommended for the [Intel Haswell platform](https://ark.intel.com/products/codename/42174/Haswell):
```bash
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi-auto \
--enable-mkl \
CXX=icpc MPICXX=mpiicpc
```
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
If gmp and mpfr are NOT in standard places (`/usr/`) these flags may be needed:
```bash
--with-gmp=<path> \
--with-mpfr=<path>
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
```bash
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi \
--enable-mkl \
CXX=CC CC=cc
```
If using the Intel MPI library, threads should be pinned to NUMA domains using:
```bash
export I_MPI_PIN=1
```
This is the default.
### Building for the Intel Skylake
The following configuration is recommended for the [Intel Skylake platform](https://ark.intel.com/products/codename/37572/Skylake):
```bash
../configure --enable-precision=double\
--enable-simd=AVX512 \
--enable-comms=mpi-auto \
--enable-mkl \
CXX=mpiicpc
```
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
If gmp and mpfr are NOT in standard places (`/usr/`) these flags may be needed:
```bash
--with-gmp=<path> \
--with-mpfr=<path> \
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
If you are working on a Cray machine that does not use the `mpiicpc` wrapper, please use:
``` bash
../configure --enable-precision=double\
--enable-simd=AVX512 \
--enable-comms=mpi \
--enable-mkl \
CXX=CC CC=cc
```
If using the Intel MPI library, threads should be pinned to NUMA domains using:
```bash
export I_MPI_PIN=1
```
This is the default.
### Building for the AMD Epyc
The [AMD EPYC](https://www.amd.com/en/products/epyc) is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores.
So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total
are common. Each chip within the module exposes a separate NUMA domain.
There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain.
MPI-3 is recommended with the use of four ranks per socket,
and 8 threads per rank.
The following configuration is recommended for the AMD EPYC platform:
```bash
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi3 \
CXX=mpicxx
```
If `gmp` and `mpfr` are NOT in standard places (`/usr/`) these flags may be needed::
```bash
--with-gmp=<path> \
--with-mpfr=<path>
```
where `<path>` is the UNIX prefix where GMP and MPFR are installed.
Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank.
This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.
It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and
{% include toc icon="gears" title="Quick-start" %}
Please send all pull requests to the `develop` branch.
{: .notice--warning}
## Requirements
### Required libraries
* [GMP](https://gmplib.org/) is the GNU Multiple Precision Library (RHMC support).
* [MPFR](http://www.mpfr.org/) is a C library for multiple-precision floating-point computations with correct rounding (RHMC support).
* [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page): bootstrapping GRID downloads and uses for internal dense matrix (non-QCD operations) the Eigen library.
Grid optionally uses:
* [HDF5](https://support.hdfgroup.org/HDF5/) for structured data I/O
* [LIME](http://usqcd-software.github.io/c-lime/) for ILDG and SciDAC file format support.
* [FFTW](http://www.fftw.org/) either generic version or via the Intel MKL library.
* [LAPACK](http://www.netlib.org/lapack/) either generic version or Intel MKL library.
### Compilers
* Intel ICPC v17 and later
* Clang v3.5 and later (need 3.8 and later for OpenMP)
* GCC v4.9.x
* GCC v6.3 and later (recommended)
**Important:**
Some versions of GCC appear to have a bug under high optimisation (-O2, -O3).
The safety of these compiler versions cannot be guaranteed at this time. Follow [Issue 100](https://github.com/paboyle/Grid/issues/100) for details and updates.
GCC v5.x, v6.1, v6.2
## Quick start
First, start by cloning the repository:
``` bash
git clone https://github.com/paboyle/Grid.git
```
Then enter the cloned directory and set up the build system:
```bash
cd Grid
./bootstrap.sh
```
Now you can execute the `configure` script to generate makefiles as in this example (here from a build directory):
sets the **default precision**. Since this is largely a benchmarking convenience, it is anticipated that the default precision may be removed in future implementations, and that explicit type selection be made at all points. Naturally, most code will be type templated in any case.::
selects whether to use MPI communication (mpi) or no communication (none).
``` bash
--prefix=<path>
```
should be passed the prefix path where you want to install Grid.
Other options are detailed in the next section, you can also use
```bash
configure --help
```
to display them.
Like with any other program using GNU autotool, the
```bash
CXX, CXXFLAGS, LDFLAGS, ...
```
environment variables can be modified to customise the build.
Finally, you can build and install Grid:
``` bash
make
make install #this is optional
```
To minimise the build time, only the tests at the root of the `tests` directory are built by default. If you want to build tests in the sub-directory `<subdir>` you can execute:
``` bash
make -C tests/<subdir> tests
```
If you want to build all the tests at once just use `make tests`.
## Build configuration options
A full list of configurations options is available with the `./configure --help` command:
* `--prefix=<path>`: installation prefix for Grid.
* `--with-gmp=<path>`: look for GMP in the UNIX prefix `<path>`
* `--with-mpfr=<path>`: look for MPFR in the UNIX prefix `<path>`
* `--with-fftw=<path>`: look for FFTW in the UNIX prefix `<path>`
* `--with-hdf5=<path>`: look for HDF5 in the UNIX prefix `<path>`
* `--with-lime=<path>`: look for the C-LIME library in the UNIX prefix `<path>`
* `--enable-sfw-fp16=<yes|no`: Enable software FP16 communications support (default `yes`)
* `--enable-lapack[=<path>]`: enable LAPACK support in Lanczos eigensolver. A UNIX prefix containing the library can be specified (optional).
* `--enable-mkl[=<path>]`: use Intel MKL for FFT (and LAPACK if enabled) routines. A UNIX prefix containing the library can be specified (optional).
* `--enable-numa`: enable [numa first touch policy](http://queue.acm.org/detail.cfm?id=2513149) optimization (default `no`)
* `--enable-simd=<code>`: setup Grid for the SIMD target `<code>` (default: `GEN`). [List of possible SIMD targets](/Grid/docs/simd_targets/).
* `--enable-gen-simd-width=<size>`: select the size (in bytes) of the generic SIMD vector type (default: 32 bytes).
* `--enable-precision={single|double}`: set the default precision (default: `double`).
* `--enable-comms=<comm>`: Use `<comm>` for message passing (default: `none`). [List of possible comm targets](/Grid/docs/comm_interfaces/).
* `--enable-shm=<shm>`: Use `<shm>` for shared memory behaviour (default: `shmopen`). [List of possible shm targets](/Grid/docs/comm_interfaces/).
* `--enable-shmpath=<path>`: Select `<path>` for the shared memory mmap base path for libhugetlbfs.
* `--enable-rng={sitmo|ranlux48|mt19937}` choose the RNG (default: `sitmo`).
* `--disable-timers`: disable system dependent high-resolution timers.
* `--enable-chroma`: enable Chroma regression tests. A compiled version of Chroma is assumed to be present.
* `--enable-doxygen-doc`: enable the Doxygen documentation generation (build with `make doxygen-doc`)
{% if site.option=="web" %}
More details on the *Getting started* menu entries on the left.
These are few suggestions in order to get the best performance on the Intel Knights Landing (KNL).
### Bind the memory allocation to the MCDRAM NUMA node
The KNL has two memory systems, the DDR4 (~90 GFlops/s) and the High Bandwidth Memory (MCDRAM, ~400 Gflops/s).
Each of the two memory system is attached to a different [NUMA context](https://software.intel.com/en-us/articles/optimizing-applications-for-numa).
On a KNL node the command `numactl --hardware` will report which NUMA context is connected to the faster MCDRAM.
A typical report looks like this
``` text
node 0 size: 98178 MB
node 0 free: 92899 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15926 MB
```
In this case the node 1 is related to the 16GB MCDRAM (this is the typical situation on KNLs)
To bind the memory allocation to NUMA node 1 use
``` bash
numactl --membind 1 ./your-executable
```
### Controlling threading
The number of threads can be set in GRID at runtime by the flag
``` text
--threads <#threads>
```
A finer control can be achieved using the environment variable `KMP_HW_SUBSETS` (or the deprecated `KMP_PLACE_THREADS`).
From the [Intel developer guide](https://software.intel.com/en-us/node/684313):
>The KMP_HW_SUBSETS variable controls the hardware resource that will be used by the program. This variable specifies the number of sockets to use, how many cores to use per socket and how many threads to assign per core. For example, on Intel® Xeon Phi™ coprocessors, while each coprocessor can take up to four threads, specifying fewer than four threads per core may result in a better performance. While specifying two threads per core often yields better performance than one thread per core, specifying three or four threads per core may or may not improve the performance. This variable enables you to conveniently measure the performance of up to four threads per core.
A typical setting for the best performance on a single node is to use **62 cores with 1 threads per code**, on the bash shell this is set by
``` bash
export KMP_HW_SUBSETS=62c,1t
```
### Using the optimised Wilson Dslash kernels
Beside the generic implementation using stencils, GRID has optimised version of the Dslash kernels (for Wilson and DWF fermions).
Flags at runtime can be used for the optimised paths
| `SKL` | [Intel Skylake with AVX512 support](https://ark.intel.com/products/codename/37572/Skylake) |
| `BGQ` | Blue Gene/Q |
#### Notes (Jan 2018):
* We currently support AVX512 for the Intel compiler and GCC (KNL and SKL target). Support for clang will appear in future
versions of Grid when the AVX512 support in the compiler is more advanced.
* For BG/Q only [bgclang](http://trac.alcf.anl.gov/projects/llvm-bgq) is supported. We do not presently plan to support more compilers for this platform.
* BG/Q performances are currently rather poor. This is being investigated for future versions.
* The vector size for the `GEN` target can be specified with the `configure` script option `--enable-gen-simd-width`.
Travis will test the compilation workflow for single and double precision version on the following compilers for each commit:
- clang 3.7.0 on Ubuntu 14.04
- clang 3.8.0 on Ubuntu 14.04
- gcc 5.4.1 on Ubuntu 14.04
- gcc 4.9.4 on Ubuntu 14.04.1
- Apple LLVM version 8.1.0 (clang-802.0.42) on OSX (x86_64-apple-darwin16.5.0)
Due to the limitations of the Travis virtual machines, the archictecture is limited to SSE4 and few tests.
May 2017: a new server using [TeamCity](https://www.jetbrains.com/teamcity/specials/teamcity/teamcity.html?gclid=CjwKEAjwutXIBRDV7-SDvdiNsUoSJACIlTqlygt_V8-PqWvjV23oAj8wf2suNmct9-sFfplBFYctzBoCnTvw_wcB&gclsrc=aw.ds.ds&dclid=COOh9rPt6dMCFYOmUQodkpwLfQ) is being setup for extensive testing of every commit.
Expected format of the filename is ```<config prefix>.<integer>``` for the configuration and ```<rng file prefix>.<integer>```
* ```--Trajectories <integer>```
Number of trajectories in this run, excluding the thermalization steps. Default: 1.
* ```--Thermalizations <integers>```
Default: 10
* ```--ParameterFile <string>```
The filename for the input parameters deserialisation.
All of them, except the starting trajectory, can be overridden by the input file (but this behaviour can be easily changed by the user writing the source file).
Actions
======
Action names are defined in the file
lib/qcd/Actions.h
Gauge action names list:
```
WilsonGaugeActionR;
WilsonGaugeActionF;
WilsonGaugeActionD;
PlaqPlusRectangleActionR;
PlaqPlusRectangleActionF;
PlaqPlusRectangleActionD;
IwasakiGaugeActionR;
IwasakiGaugeActionF;
IwasakiGaugeActionD;
SymanzikGaugeActionR;
SymanzikGaugeActionF;
SymanzikGaugeActionD;
ConjugateWilsonGaugeActionR;
ConjugateWilsonGaugeActionF;
ConjugateWilsonGaugeActionD;
ConjugatePlaqPlusRectangleActionR;
ConjugatePlaqPlusRectangleActionF;
ConjugatePlaqPlusRectangleActionD;
ConjugateIwasakiGaugeActionR;
ConjugateIwasakiGaugeActionF;
ConjugateIwasakiGaugeActionD;
ConjugateSymanzikGaugeActionR;
ConjugateSymanzikGaugeActionF;
ConjugateSymanzikGaugeActionD;
```
Scalar action names list
```
ScalarActionR;
ScalarActionF;
ScalarActionD;
```
each of these action accept one single parameter at creation time (beta).
Example for creating a Symanzik action with beta=4.0
SymanzikGaugeActionR(4.0)
The suffixes R,F,D in the action names refer to the Real
(the precision is defined at compile time by the --enable-precision flag in the configure),
Float and Double, that force the precision of the action to be 32, 64 bit respectively.
excerpt: "Welcome to the Grid documentation pages"
header:
overlay_color: "#5DADE2"
#cta_label: "Download documentation"
#cta_url: "https://www.google.com"
sidebar:
nav : docs
permalink: /docs/
---
{% include base_path %}
We are currently working on the full documentation.
{% if site.option=="web" %}
Use the sidebar on the left to navigate.
{% endif %}
_May 2017 : The API description and Lattice Theories sections in the sidebar are work in progress_.
### Version history
* May 2017 [version 0.7.0](https://github.com/paboyle/Grid/tree/release/v0.7.0)
* November 2016 [version 0.6.0](https://github.com/paboyle/Grid/tree/release/v0.6.0)
Grid is primarily an application development interface (API) for structured Cartesian grid codes and written in C++11.
In particular it is aimed at Lattice Field Theory simulations in general gauge theories, but with a particular emphasis
on supporting SU(3) and U(1) gauge theories relevant to hadronic physics.
### Description
This library provides data parallel C++ container classes with internal memory layout
that is transformed to map efficiently to SIMD architectures. CSHIFT facilities
are provided, similar to HPF and cmfortran, and user control is given over the mapping of
array indices to both MPI tasks and SIMD processing elements.
* Identically shaped arrays then be processed with perfect data parallelisation.
* Such identically shaped arrays are called conformable arrays.
The transformation is based on the observation that Cartesian array processing involves
identical processing to be performed on different regions of the Cartesian array.
The library will both geometrically decompose into MPI tasks and across SIMD lanes.
Local vector loops are parallelised with OpenMP pragmas.
Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but
optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification
for most programmers.
The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture.
Presently SSE4 (128 bit) AVX, AVX2, QPX (256 bit), IMCI, and AVX512 (512 bit) targets are supported (ARM NEON on the way).
These are presented as `vRealF`, `vRealD`, `vComplexF`, and `vComplexD` internal vector data types. These may be useful in themselves for other programmers.
The corresponding scalar types are named `RealF`, `RealD`, `ComplexF` and `ComplexD`.
MPI, OpenMP, and SIMD parallelism are present in the library.
Please see [this paper](https://arxiv.org/abs/1512.03487) for more detail.
### Who will use this library
As an application development interface *Grid* is primarily a programmers tool providing the
building blocks and primitives for constructing lattice gauge theory programmes.
Grid functionality includes:
* Data parallel primitives, similar to QDP++
* gauge and fermion actions
* solvers
* gauge and fermion force terms
* integrators and (R)HMC.
* parallel field I/O
* object serialisation (text, XML, JSON...)
Grid is intended to enable the rapid and easy development of code with reasonably competitive performance.
It is first and foremost a *library* to which people can programme, and develop new algorithms and measurements.
As such, it is very much hoped that peoples principle point of contact with Grid will be in
the wonderfully rich C++ language. Since import and export procedures are provided for the opaque lattice types
it should be possible to call Grid from other code bases.
Grid is most tightly coupled to the Hadrons package
developed principally by Antonin Portelli.
This package is entirely composed against the Grid data parallel interface.
Interfacing to other packages is also possible.
Several regression tests that combine Grid with Chroma are included in the Grid distribution.
Further, Grid has been successfully interfaced to
* The Columbia Physics System
* The MILC code
### Data parallel interface
Most users will wish to interact with Grid above the data parallel *Lattice* interface. At this level
a programme is simply written as a series of statements, addressing entire lattice objects.
Implementation details may be provided to explain how the code works, but are not strictly part of the API.
**Example**
For example, as an implementation detail, in a single programme multiple data (SPMD) message passing supercomputer the main programme is trivially replicated on each computing node. The data parallel operations are called *collectively* by all nodes. Any scalar values returned by the various reduction routines are the same on each node, resulting in (for example) the same decision being made by all nodes to terminate an iterative solver on the same iteration.
### Internal development
Internal developers may contribute to Grid at a level below the data parallel interface.
Specifically, development of new lattice Dirac operators, for example,
or any codes directly interacting with the
* Communicators
* Simd
* Tensor
* Stencil
will make use of facilities provided by to assist the creation of high performance code. The internal data layout complexities
will be exposed to some degree and the interfaces are subject to change without notice as HPC architectures change.
Since some of the internal implementation details are needed to explain the design strategy of grid these will be
documented, but labelled as *implementation dependent*
Reasonable endeavours will be made to preserve functionality where practical but no guarantees are made.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.