Grid benchmark documentation update

2023-01-30 18:28:38 +00:00
parent 5f9abbb8d0
commit 5198bbe1cd
1 changed files with 83 additions and 2 deletions
@@ -6,6 +6,7 @@ The benchmarks can be summarised as follows
 - `Benchmark_Grid`: This benchmark measure floating point performances for various fermion
 matrices, as well as bandwidth measurement for different operations. Measurements are
 performed for a fixed range of problem sizes.
 - `Benchmark_IO`: Parallel I/O benchmark.
 ## TL;DR
 Build and install Grid, all dependencies, and the benchmark with
@@ -28,7 +29,7 @@ You should first deploy the environment for the specific system you are using, f
 systems/tursa/bootstrap-env.sh ./env
 ```
 will deploy the relevant environment for the [Tursa](https://www.epcc.ed.ac.uk/hpc-services/dirac-tursa-gpu) supercomputer in `./env`. This step might compile from source a large set
-of packages, and might take some time to complete.
+of packages, and take some time to complete.
 After that, the environment directory (`./env` in the example above) will contain a `env.sh` file that need to be sourced to activate the environment
 ```bash
@@ -66,4 +67,84 @@ where `<env_dir>` is the environment directory and `<config>` is the build confi
 ## Running the benchmarks
 After building the benchmarks as above you can find the binaries in 
-`<env_dir>/prefix/gridbench_<config>`.
+`<env_dir>/prefix/gridbench_<config>`. Depending on the system selected, the environment
 directory might also contain batch script examples. More information about the benchmarks
 is provided below.
 ### `Benchmark_Grid`
 This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as
 well as memory and inter-process bandwidth measurement using Grid routines. The benchmark
 command accept any Grid flag (see complete list with `--help`), as well as a 
 `--json-out <file>` flag to save the measurement results in JSON to `<file>`. The 
 benchmarks are performed on a fix set of problem sizes, and the Grid flag `--grid` will
 be ignored.
 The resulting metrics are as follows, all data size units are in base 2 
 (i.e. 1 kB = 1024 B).
 *Memory bandwidth*
 One sub-benchmark measure the memory bandwidth using a lattice version of the `axpy` BLAS
 routine, in a similar fashion to the STREAM benchmark. The JSON entries under `"axpy"` 
 have the form
 ```json
 {
  "GBps": 215.80653375861607,   // bandwidth in GB/s/node
  "GFlops": 19.310041765757834, // FP performance (double precision)
  "L": 8,                       // local lattice volume
  "size_MB": 3.0                // memory size in MB/node
 }
 ```
 A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher
 arithmetic intensity than the `axpy` one (although it is still memory-bound). 
 The JSON entries under `"SU4"` have the form
 ```json
 {
  "GBps": 394.76639187026865,  // bandwidth in GB/s/node
  "GFlops": 529.8464820758512, // FP performance (single precision)
  "L": 8,                      // local lattice size
  "size_MB": 6.0               // memory size in MB/node
 }
 ```
 *Inter-process bandwidth*
 This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange
 using routines in Grid. The exchange is performed in each direction on the MPI Cartesian
 grid which is parallelised across at least 2 processes. The resulting bandwidth is related
 to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI
 decomposition. he JSON entries under `"comms"` have the form
 ```json
 {
  "L": 40,                       // local lattice size
  "bytes": 73728000,             // payload size in B/rank
  "dir": 2,                      // direction of the exchange, 8 possible directions
                                 // (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
  "rate_GBps": {
    "error": 6.474271894240327,  // standard deviation across measurements (GB/s/node)
    "max": 183.10546875,         // maximum measured bandwidth (GB/s/node)
    "mean": 175.21747026766676   // average measured bandwidth (GB/s/node)
  },
  "time_usec": 3135.055          // average transfer time (microseconds)
 }
 ```
 *Floating-point performances*
 This sub-benchmark measures the achieved floating-point performances using the 
 Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid.
 In the `"flops"` and `"results"` section of the JSON output are recorded the best 
 performances, e.g.
 ```json
 {
  "Gflops_dwf4": 366.5251173474483,       // domain-wall in Gflop/s/node (single precision)
  "Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
  "Gflops_wilson": 15.221839719288932,    // Wilson in Gflop/s/node (single precision)
  "L": 8                                  // local lattice size
 }
 ```
 Here "best" means across a number of different implementations of the routines. Please
 see the log of the benchmark for an additional breakdown. Finally, the JSON output
 contains a "comparison point", which is the average of the L=24 and L=32 best
 domain-wall performances.