.. | ||
systems/tursa | ||
.clang-format | ||
Benchmark_Grid.cpp | ||
Benchmark_IO.cpp | ||
Benchmark_IO.hpp | ||
bootstrap.sh | ||
build-benchmark.sh | ||
build-grid.sh | ||
Common.hpp | ||
configure.ac | ||
LICENSE | ||
Makefile.am | ||
Readme.md |
Grid benchmarks
This folder contains benchmarks for the Grid library. The benchmarks can be summarised as follows
Benchmark_Grid
: This benchmark measure floating point performances for various fermion matrices, as well as bandwidth measurement for different operations. Measurements are performed for a fixed range of problem sizes.Benchmark_IO
: Parallel I/O benchmark.
TL;DR
Build and install Grid, all dependencies, and the benchmark with
systems/<system>/bootstrap-env.sh <env_dir> # build dependencies, takes a long time
./build-grid.sh <env_dir> <config> # build Grid
./build-benchmark.sh <env_dir> <config> # build benchmarks
where <env_dir>
is an arbitrary directory where every product will be stored, <system>
is a sub-directory of systems
containing system-specific scripts
(an existing preset or your own), and finally <config>
is the name of a build config
in systems/<system>/grid-config.json
. After a successful execution the benchmark binaries
will be in <env_dir>/prefix/gridbench_<config>
.
Environment setup
A complete runtime environnement can be deploy using scripts from this repository. System-specific scripts are in the systems
directory.
You should first deploy the environment for the specific system you are using, for example
systems/tursa/bootstrap-env.sh ./env
will deploy the relevant environment for the Tursa supercomputer in ./env
. This step might compile from source a large set
of packages, and take some time to complete.
After that, the environment directory (./env
in the example above) will contain a env.sh
file that need to be sourced to activate the environment
source ./env/env.sh
Additional scripts env-*.sh
can be sourced after to activate more specific environments,
this should be done after sourcing env.sh
as above.
Building the benchmarks
The environnement directory contains a grid-config.json
file specifying compilation flag
configurations for Grid (please see Grid's repository for documentation). All entries have
the form
{
"name": "foo", // name of the configuration
"env-script": "bar.sh", // script to source before building
// (path relative to the environment directory)
"commit": "...", // Grid commit to use
// (anything that can be an argument of git checkout)
"config-options": "..." // options to pass to the configure script,
"env" : { // environment variables
"VAR": "value" // export VAR="value" before building
}
}
Grid can then be built with
./build-grid.sh <env_dir> <config>
where <env_dir>
is the environment directory and <config>
is the build config name in
grid-config.json
. Similarly, the benchmarks can then be built with
./build-grid <env_dir> <config>
Running the benchmarks
After building the benchmarks as above you can find the binaries in
<env_dir>/prefix/gridbench_<config>
. Depending on the system selected, the environment
directory might also contain batch script examples. More information about the benchmarks
is provided below.
Benchmark_Grid
This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as
well as memory and inter-process bandwidth measurement using Grid routines. The benchmark
command accept any Grid flag (see complete list with --help
), as well as a
--json-out <file>
flag to save the measurement results in JSON to <file>
. The
benchmarks are performed on a fix set of problem sizes, and the Grid flag --grid
will
be ignored.
The resulting metrics are as follows, all data size units are in base 2 (i.e. 1 kB = 1024 B).
Memory bandwidth
One sub-benchmark measure the memory bandwidth using a lattice version of the axpy
BLAS
routine, in a similar fashion to the STREAM benchmark. The JSON entries under "axpy"
have the form
{
"GBps": 215.80653375861607, // bandwidth in GB/s/node
"GFlops": 19.310041765757834, // FP performance (double precision)
"L": 8, // local lattice volume
"size_MB": 3.0 // memory size in MB/node
}
A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher
arithmetic intensity than the axpy
one (although it is still memory-bound).
The JSON entries under "SU4"
have the form
{
"GBps": 394.76639187026865, // bandwidth in GB/s/node
"GFlops": 529.8464820758512, // FP performance (single precision)
"L": 8, // local lattice size
"size_MB": 6.0 // memory size in MB/node
}
Inter-process bandwidth
This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange
using routines in Grid. The exchange is performed in each direction on the MPI Cartesian
grid which is parallelised across at least 2 processes. The resulting bandwidth is related
to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI
decomposition. he JSON entries under "comms"
have the form
{
"L": 40, // local lattice size
"bytes": 73728000, // payload size in B/rank
"dir": 2, // direction of the exchange, 8 possible directions
// (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
"rate_GBps": {
"error": 6.474271894240327, // standard deviation across measurements (GB/s/node)
"max": 183.10546875, // maximum measured bandwidth (GB/s/node)
"mean": 175.21747026766676 // average measured bandwidth (GB/s/node)
},
"time_usec": 3135.055 // average transfer time (microseconds)
}
Floating-point performances
This sub-benchmark measures the achieved floating-point performances using the
Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid.
In the "flops"
and "results"
section of the JSON output are recorded the best
performances, e.g.
{
"Gflops_dwf4": 366.5251173474483, // domain-wall in Gflop/s/node (single precision)
"Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
"Gflops_wilson": 15.221839719288932, // Wilson in Gflop/s/node (single precision)
"L": 8 // local lattice size
}
Here "best" means across a number of different implementations of the routines. Please see the log of the benchmark for an additional breakdown. Finally, the JSON output contains a "comparison point", which is the average of the L=24 and L=32 best domain-wall performances.