Grid benchmark documentation update
This commit is contained in:
parent
5f9abbb8d0
commit
5198bbe1cd
@ -6,6 +6,7 @@ The benchmarks can be summarised as follows
|
|||||||
- `Benchmark_Grid`: This benchmark measure floating point performances for various fermion
|
- `Benchmark_Grid`: This benchmark measure floating point performances for various fermion
|
||||||
matrices, as well as bandwidth measurement for different operations. Measurements are
|
matrices, as well as bandwidth measurement for different operations. Measurements are
|
||||||
performed for a fixed range of problem sizes.
|
performed for a fixed range of problem sizes.
|
||||||
|
- `Benchmark_IO`: Parallel I/O benchmark.
|
||||||
|
|
||||||
## TL;DR
|
## TL;DR
|
||||||
Build and install Grid, all dependencies, and the benchmark with
|
Build and install Grid, all dependencies, and the benchmark with
|
||||||
@ -28,7 +29,7 @@ You should first deploy the environment for the specific system you are using, f
|
|||||||
systems/tursa/bootstrap-env.sh ./env
|
systems/tursa/bootstrap-env.sh ./env
|
||||||
```
|
```
|
||||||
will deploy the relevant environment for the [Tursa](https://www.epcc.ed.ac.uk/hpc-services/dirac-tursa-gpu) supercomputer in `./env`. This step might compile from source a large set
|
will deploy the relevant environment for the [Tursa](https://www.epcc.ed.ac.uk/hpc-services/dirac-tursa-gpu) supercomputer in `./env`. This step might compile from source a large set
|
||||||
of packages, and might take some time to complete.
|
of packages, and take some time to complete.
|
||||||
|
|
||||||
After that, the environment directory (`./env` in the example above) will contain a `env.sh` file that need to be sourced to activate the environment
|
After that, the environment directory (`./env` in the example above) will contain a `env.sh` file that need to be sourced to activate the environment
|
||||||
```bash
|
```bash
|
||||||
@ -66,4 +67,84 @@ where `<env_dir>` is the environment directory and `<config>` is the build confi
|
|||||||
|
|
||||||
## Running the benchmarks
|
## Running the benchmarks
|
||||||
After building the benchmarks as above you can find the binaries in
|
After building the benchmarks as above you can find the binaries in
|
||||||
`<env_dir>/prefix/gridbench_<config>`.
|
`<env_dir>/prefix/gridbench_<config>`. Depending on the system selected, the environment
|
||||||
|
directory might also contain batch script examples. More information about the benchmarks
|
||||||
|
is provided below.
|
||||||
|
|
||||||
|
### `Benchmark_Grid`
|
||||||
|
This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as
|
||||||
|
well as memory and inter-process bandwidth measurement using Grid routines. The benchmark
|
||||||
|
command accept any Grid flag (see complete list with `--help`), as well as a
|
||||||
|
`--json-out <file>` flag to save the measurement results in JSON to `<file>`. The
|
||||||
|
benchmarks are performed on a fix set of problem sizes, and the Grid flag `--grid` will
|
||||||
|
be ignored.
|
||||||
|
|
||||||
|
The resulting metrics are as follows, all data size units are in base 2
|
||||||
|
(i.e. 1 kB = 1024 B).
|
||||||
|
|
||||||
|
*Memory bandwidth*
|
||||||
|
|
||||||
|
One sub-benchmark measure the memory bandwidth using a lattice version of the `axpy` BLAS
|
||||||
|
routine, in a similar fashion to the STREAM benchmark. The JSON entries under `"axpy"`
|
||||||
|
have the form
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"GBps": 215.80653375861607, // bandwidth in GB/s/node
|
||||||
|
"GFlops": 19.310041765757834, // FP performance (double precision)
|
||||||
|
"L": 8, // local lattice volume
|
||||||
|
"size_MB": 3.0 // memory size in MB/node
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher
|
||||||
|
arithmetic intensity than the `axpy` one (although it is still memory-bound).
|
||||||
|
The JSON entries under `"SU4"` have the form
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"GBps": 394.76639187026865, // bandwidth in GB/s/node
|
||||||
|
"GFlops": 529.8464820758512, // FP performance (single precision)
|
||||||
|
"L": 8, // local lattice size
|
||||||
|
"size_MB": 6.0 // memory size in MB/node
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
*Inter-process bandwidth*
|
||||||
|
|
||||||
|
This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange
|
||||||
|
using routines in Grid. The exchange is performed in each direction on the MPI Cartesian
|
||||||
|
grid which is parallelised across at least 2 processes. The resulting bandwidth is related
|
||||||
|
to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI
|
||||||
|
decomposition. he JSON entries under `"comms"` have the form
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"L": 40, // local lattice size
|
||||||
|
"bytes": 73728000, // payload size in B/rank
|
||||||
|
"dir": 2, // direction of the exchange, 8 possible directions
|
||||||
|
// (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
|
||||||
|
"rate_GBps": {
|
||||||
|
"error": 6.474271894240327, // standard deviation across measurements (GB/s/node)
|
||||||
|
"max": 183.10546875, // maximum measured bandwidth (GB/s/node)
|
||||||
|
"mean": 175.21747026766676 // average measured bandwidth (GB/s/node)
|
||||||
|
},
|
||||||
|
"time_usec": 3135.055 // average transfer time (microseconds)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
*Floating-point performances*
|
||||||
|
|
||||||
|
This sub-benchmark measures the achieved floating-point performances using the
|
||||||
|
Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid.
|
||||||
|
In the `"flops"` and `"results"` section of the JSON output are recorded the best
|
||||||
|
performances, e.g.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Gflops_dwf4": 366.5251173474483, // domain-wall in Gflop/s/node (single precision)
|
||||||
|
"Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
|
||||||
|
"Gflops_wilson": 15.221839719288932, // Wilson in Gflop/s/node (single precision)
|
||||||
|
"L": 8 // local lattice size
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Here "best" means across a number of different implementations of the routines. Please
|
||||||
|
see the log of the benchmark for an additional breakdown. Finally, the JSON output
|
||||||
|
contains a "comparison point", which is the average of the L=24 and L=32 best
|
||||||
|
domain-wall performances.
|
Loading…
x
Reference in New Issue
Block a user