Grid benchmark documentation update
This commit is contained in:
parent
5f9abbb8d0
commit
5198bbe1cd
@ -6,6 +6,7 @@ The benchmarks can be summarised as follows
|
||||
- `Benchmark_Grid`: This benchmark measure floating point performances for various fermion
|
||||
matrices, as well as bandwidth measurement for different operations. Measurements are
|
||||
performed for a fixed range of problem sizes.
|
||||
- `Benchmark_IO`: Parallel I/O benchmark.
|
||||
|
||||
## TL;DR
|
||||
Build and install Grid, all dependencies, and the benchmark with
|
||||
@ -28,7 +29,7 @@ You should first deploy the environment for the specific system you are using, f
|
||||
systems/tursa/bootstrap-env.sh ./env
|
||||
```
|
||||
will deploy the relevant environment for the [Tursa](https://www.epcc.ed.ac.uk/hpc-services/dirac-tursa-gpu) supercomputer in `./env`. This step might compile from source a large set
|
||||
of packages, and might take some time to complete.
|
||||
of packages, and take some time to complete.
|
||||
|
||||
After that, the environment directory (`./env` in the example above) will contain a `env.sh` file that need to be sourced to activate the environment
|
||||
```bash
|
||||
@ -66,4 +67,84 @@ where `<env_dir>` is the environment directory and `<config>` is the build confi
|
||||
|
||||
## Running the benchmarks
|
||||
After building the benchmarks as above you can find the binaries in
|
||||
`<env_dir>/prefix/gridbench_<config>`.
|
||||
`<env_dir>/prefix/gridbench_<config>`. Depending on the system selected, the environment
|
||||
directory might also contain batch script examples. More information about the benchmarks
|
||||
is provided below.
|
||||
|
||||
### `Benchmark_Grid`
|
||||
This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as
|
||||
well as memory and inter-process bandwidth measurement using Grid routines. The benchmark
|
||||
command accept any Grid flag (see complete list with `--help`), as well as a
|
||||
`--json-out <file>` flag to save the measurement results in JSON to `<file>`. The
|
||||
benchmarks are performed on a fix set of problem sizes, and the Grid flag `--grid` will
|
||||
be ignored.
|
||||
|
||||
The resulting metrics are as follows, all data size units are in base 2
|
||||
(i.e. 1 kB = 1024 B).
|
||||
|
||||
*Memory bandwidth*
|
||||
|
||||
One sub-benchmark measure the memory bandwidth using a lattice version of the `axpy` BLAS
|
||||
routine, in a similar fashion to the STREAM benchmark. The JSON entries under `"axpy"`
|
||||
have the form
|
||||
```json
|
||||
{
|
||||
"GBps": 215.80653375861607, // bandwidth in GB/s/node
|
||||
"GFlops": 19.310041765757834, // FP performance (double precision)
|
||||
"L": 8, // local lattice volume
|
||||
"size_MB": 3.0 // memory size in MB/node
|
||||
}
|
||||
```
|
||||
|
||||
A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher
|
||||
arithmetic intensity than the `axpy` one (although it is still memory-bound).
|
||||
The JSON entries under `"SU4"` have the form
|
||||
```json
|
||||
{
|
||||
"GBps": 394.76639187026865, // bandwidth in GB/s/node
|
||||
"GFlops": 529.8464820758512, // FP performance (single precision)
|
||||
"L": 8, // local lattice size
|
||||
"size_MB": 6.0 // memory size in MB/node
|
||||
}
|
||||
```
|
||||
|
||||
*Inter-process bandwidth*
|
||||
|
||||
This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange
|
||||
using routines in Grid. The exchange is performed in each direction on the MPI Cartesian
|
||||
grid which is parallelised across at least 2 processes. The resulting bandwidth is related
|
||||
to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI
|
||||
decomposition. he JSON entries under `"comms"` have the form
|
||||
```json
|
||||
{
|
||||
"L": 40, // local lattice size
|
||||
"bytes": 73728000, // payload size in B/rank
|
||||
"dir": 2, // direction of the exchange, 8 possible directions
|
||||
// (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
|
||||
"rate_GBps": {
|
||||
"error": 6.474271894240327, // standard deviation across measurements (GB/s/node)
|
||||
"max": 183.10546875, // maximum measured bandwidth (GB/s/node)
|
||||
"mean": 175.21747026766676 // average measured bandwidth (GB/s/node)
|
||||
},
|
||||
"time_usec": 3135.055 // average transfer time (microseconds)
|
||||
}
|
||||
```
|
||||
|
||||
*Floating-point performances*
|
||||
|
||||
This sub-benchmark measures the achieved floating-point performances using the
|
||||
Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid.
|
||||
In the `"flops"` and `"results"` section of the JSON output are recorded the best
|
||||
performances, e.g.
|
||||
```json
|
||||
{
|
||||
"Gflops_dwf4": 366.5251173474483, // domain-wall in Gflop/s/node (single precision)
|
||||
"Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
|
||||
"Gflops_wilson": 15.221839719288932, // Wilson in Gflop/s/node (single precision)
|
||||
"L": 8 // local lattice size
|
||||
}
|
||||
```
|
||||
Here "best" means across a number of different implementations of the routines. Please
|
||||
see the log of the benchmark for an additional breakdown. Finally, the JSON output
|
||||
contains a "comparison point", which is the average of the L=24 and L=32 best
|
||||
domain-wall performances.
|
Loading…
Reference in New Issue
Block a user