Grid benchmarks

This folder contains benchmarks for the Grid library. The benchmarks can be summarised as follows

Benchmark_Grid: This benchmark measure floating point performances for various fermion matrices, as well as bandwidth measurement for different operations. Measurements are performed for a fixed range of problem sizes.
Benchmark_IO: Parallel I/O benchmark.

TL;DR

Build and install Grid, all dependencies, and the benchmark with

systems/<system>/bootstrap-env.sh <env_dir> # build dependencies, takes a long time
./build-grid.sh <env_dir> <config>          # build Grid
./build-benchmark.sh <env_dir> <config>     # build benchmarks

where <env_dir> is an arbitrary directory where every product will be stored, <system> is a sub-directory of systems containing system-specific scripts (an existing preset or your own), and finally <config> is the name of a build config in systems/<system>/grid-config.json. After a successful execution the benchmark binaries will be in <env_dir>/prefix/gridbench_<config>.

Environment setup

A complete runtime environnement can be deploy using scripts from this repository. System-specific scripts are in the systems directory.

You should first deploy the environment for the specific system you are using, for example

systems/tursa/bootstrap-env.sh ./env

will deploy the relevant environment for the Tursa supercomputer in ./env. This step might compile from source a large set of packages, and take some time to complete.

After that, the environment directory (./env in the example above) will contain a env.sh file that need to be sourced to activate the environment

source ./env/env.sh

Additional scripts env-*.sh can be sourced after to activate more specific environments, this should be done after sourcing env.sh as above.

Building the benchmarks

The environnement directory contains a grid-config.json file specifying compilation flag configurations for Grid (please see Grid's repository for documentation). All entries have the form

{
  "name": "foo",          // name of the configuration
  "env-script": "bar.sh", // script to source before building 
                          // (path relative to the environment directory)
  "commit": "...",        // Grid commit to use 
                          // (anything that can be an argument of git checkout)
  "config-options": "..." // options to pass to the configure script,
  "env" : {               // environment variables
    "VAR": "value"        // export VAR="value" before building
  }
}

Grid can then be built with

./build-grid.sh <env_dir> <config>

where <env_dir> is the environment directory and <config> is the build config name in grid-config.json. Similarly, the benchmarks can then be built with

./build-grid <env_dir> <config>

Running the benchmarks

After building the benchmarks as above you can find the binaries in <env_dir>/prefix/gridbench_<config>. Depending on the system selected, the environment directory might also contain batch script examples. More information about the benchmarks is provided below.

`Benchmark_Grid`

This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as well as memory and inter-process bandwidth measurement using Grid routines. The benchmark command accept any Grid flag (see complete list with --help), as well as a --json-out <file> flag to save the measurement results in JSON to <file>. The benchmarks are performed on a fix set of problem sizes, and the Grid flag --grid will be ignored.

The resulting metrics are as follows, all data size units are in base 2 (i.e. 1 kB = 1024 B).

Memory bandwidth

One sub-benchmark measure the memory bandwidth using a lattice version of the axpy BLAS routine, in a similar fashion to the STREAM benchmark. The JSON entries under "axpy" have the form

{
  "GBps": 215.80653375861607,   // bandwidth in GB/s/node
  "GFlops": 19.310041765757834, // FP performance (double precision)
  "L": 8,                       // local lattice volume
  "size_MB": 3.0                // memory size in MB/node
}

A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher arithmetic intensity than the axpy one (although it is still memory-bound). The JSON entries under "SU4" have the form

{
  "GBps": 394.76639187026865,  // bandwidth in GB/s/node
  "GFlops": 529.8464820758512, // FP performance (single precision)
  "L": 8,                      // local lattice size
  "size_MB": 6.0               // memory size in MB/node
}

Inter-process bandwidth

This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange using routines in Grid. The exchange is performed in each direction on the MPI Cartesian grid which is parallelised across at least 2 processes. The resulting bandwidth is related to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI decomposition. he JSON entries under "comms" have the form

{
  "L": 40,                       // local lattice size
  "bytes": 73728000,             // payload size in B/rank
  "dir": 2,                      // direction of the exchange, 8 possible directions
                                 // (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
  "rate_GBps": {
    "error": 6.474271894240327,  // standard deviation across measurements (GB/s/node)
    "max": 183.10546875,         // maximum measured bandwidth (GB/s/node)
    "mean": 175.21747026766676   // average measured bandwidth (GB/s/node)
  },
  "time_usec": 3135.055          // average transfer time (microseconds)
}

Floating-point performances

This sub-benchmark measures the achieved floating-point performances using the Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid. In the "flops" and "results" section of the JSON output are recorded the best performances, e.g.

{
  "Gflops_dwf4": 366.5251173474483,       // domain-wall in Gflop/s/node (single precision)
  "Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
  "Gflops_wilson": 15.221839719288932,    // Wilson in Gflop/s/node (single precision)
  "L": 8                                  // local lattice size
}

Here "best" means across a number of different implementations of the routines. Please see the log of the benchmark for an additional breakdown. Finally, the JSON output contains a "comparison point", which is the average of the L=24 and L=32 best domain-wall performances.

6.6 KiB Raw Blame History