Grid benchmark documentation update

This commit is contained in:
Antonin Portelli 2023-01-30 18:28:38 +00:00
parent 5f9abbb8d0
commit 5198bbe1cd

View File

@ -6,6 +6,7 @@ The benchmarks can be summarised as follows
- `Benchmark_Grid`: This benchmark measure floating point performances for various fermion
matrices, as well as bandwidth measurement for different operations. Measurements are
performed for a fixed range of problem sizes.
- `Benchmark_IO`: Parallel I/O benchmark.
## TL;DR
Build and install Grid, all dependencies, and the benchmark with
@ -28,7 +29,7 @@ You should first deploy the environment for the specific system you are using, f
systems/tursa/bootstrap-env.sh ./env
```
will deploy the relevant environment for the [Tursa](https://www.epcc.ed.ac.uk/hpc-services/dirac-tursa-gpu) supercomputer in `./env`. This step might compile from source a large set
of packages, and might take some time to complete.
of packages, and take some time to complete.
After that, the environment directory (`./env` in the example above) will contain a `env.sh` file that need to be sourced to activate the environment
```bash
@ -66,4 +67,84 @@ where `<env_dir>` is the environment directory and `<config>` is the build confi
## Running the benchmarks
After building the benchmarks as above you can find the binaries in
`<env_dir>/prefix/gridbench_<config>`.
`<env_dir>/prefix/gridbench_<config>`. Depending on the system selected, the environment
directory might also contain batch script examples. More information about the benchmarks
is provided below.
### `Benchmark_Grid`
This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as
well as memory and inter-process bandwidth measurement using Grid routines. The benchmark
command accept any Grid flag (see complete list with `--help`), as well as a
`--json-out <file>` flag to save the measurement results in JSON to `<file>`. The
benchmarks are performed on a fix set of problem sizes, and the Grid flag `--grid` will
be ignored.
The resulting metrics are as follows, all data size units are in base 2
(i.e. 1 kB = 1024 B).
*Memory bandwidth*
One sub-benchmark measure the memory bandwidth using a lattice version of the `axpy` BLAS
routine, in a similar fashion to the STREAM benchmark. The JSON entries under `"axpy"`
have the form
```json
{
"GBps": 215.80653375861607, // bandwidth in GB/s/node
"GFlops": 19.310041765757834, // FP performance (double precision)
"L": 8, // local lattice volume
"size_MB": 3.0 // memory size in MB/node
}
```
A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher
arithmetic intensity than the `axpy` one (although it is still memory-bound).
The JSON entries under `"SU4"` have the form
```json
{
"GBps": 394.76639187026865, // bandwidth in GB/s/node
"GFlops": 529.8464820758512, // FP performance (single precision)
"L": 8, // local lattice size
"size_MB": 6.0 // memory size in MB/node
}
```
*Inter-process bandwidth*
This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange
using routines in Grid. The exchange is performed in each direction on the MPI Cartesian
grid which is parallelised across at least 2 processes. The resulting bandwidth is related
to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI
decomposition. he JSON entries under `"comms"` have the form
```json
{
"L": 40, // local lattice size
"bytes": 73728000, // payload size in B/rank
"dir": 2, // direction of the exchange, 8 possible directions
// (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
"rate_GBps": {
"error": 6.474271894240327, // standard deviation across measurements (GB/s/node)
"max": 183.10546875, // maximum measured bandwidth (GB/s/node)
"mean": 175.21747026766676 // average measured bandwidth (GB/s/node)
},
"time_usec": 3135.055 // average transfer time (microseconds)
}
```
*Floating-point performances*
This sub-benchmark measures the achieved floating-point performances using the
Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid.
In the `"flops"` and `"results"` section of the JSON output are recorded the best
performances, e.g.
```json
{
"Gflops_dwf4": 366.5251173474483, // domain-wall in Gflop/s/node (single precision)
"Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
"Gflops_wilson": 15.221839719288932, // Wilson in Gflop/s/node (single precision)
"L": 8 // local lattice size
}
```
Here "best" means across a number of different implementations of the routines. Please
see the log of the benchmark for an additional breakdown. Finally, the JSON output
contains a "comparison point", which is the average of the L=24 and L=32 best
domain-wall performances.