From 5198bbe1cd0a7e35455ffafd59b1d0bfbb45d73c Mon Sep 17 00:00:00 2001
From: Antonin Portelli <antonin.portelli@me.com>
Date: Mon, 30 Jan 2023 18:28:38 +0000
Subject: [PATCH] Grid benchmark documentation update

---
 Grid/Readme.md | 85 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 83 insertions(+), 2 deletions(-)
diff --git a/Grid/Readme.md b/Grid/Readme.md
index 10d6bbd..3c2689d 100644
--- a/Grid/Readme.md
+++ b/Grid/Readme.md
@@ -6,6 +6,7 @@ The benchmarks can be summarised as follows
 - `Benchmark_Grid`: This benchmark measure floating point performances for various fermion
 matrices, as well as bandwidth measurement for different operations. Measurements are
 performed for a fixed range of problem sizes.
+- `Benchmark_IO`: Parallel I/O benchmark.
 
 ## TL;DR
 Build and install Grid, all dependencies, and the benchmark with
@@ -28,7 +29,7 @@ You should first deploy the environment for the specific system you are using, f
 systems/tursa/bootstrap-env.sh ./env
 ```
 will deploy the relevant environment for the [Tursa](https://www.epcc.ed.ac.uk/hpc-services/dirac-tursa-gpu) supercomputer in `./env`. This step might compile from source a large set
-of packages, and might take some time to complete.
+of packages, and take some time to complete.
 
 After that, the environment directory (`./env` in the example above) will contain a `env.sh` file that need to be sourced to activate the environment
 ```bash
@@ -66,4 +67,84 @@ where `<env_dir>` is the environment directory and `<config>` is the build confi
 
 ## Running the benchmarks
 After building the benchmarks as above you can find the binaries in 
-`<env_dir>/prefix/gridbench_<config>`.
\ No newline at end of file
+`<env_dir>/prefix/gridbench_<config>`. Depending on the system selected, the environment
+directory might also contain batch script examples. More information about the benchmarks
+is provided below.
+
+### `Benchmark_Grid`
+This benchmark performs flop/s measurement for typical lattice QCD sparse matrices, as
+well as memory and inter-process bandwidth measurement using Grid routines. The benchmark
+command accept any Grid flag (see complete list with `--help`), as well as a 
+`--json-out <file>` flag to save the measurement results in JSON to `<file>`. The 
+benchmarks are performed on a fix set of problem sizes, and the Grid flag `--grid` will
+be ignored.
+
+The resulting metrics are as follows, all data size units are in base 2 
+(i.e. 1 kB = 1024 B).
+
+*Memory bandwidth*
+
+One sub-benchmark measure the memory bandwidth using a lattice version of the `axpy` BLAS
+routine, in a similar fashion to the STREAM benchmark. The JSON entries under `"axpy"` 
+have the form
+```json
+{
+  "GBps": 215.80653375861607,   // bandwidth in GB/s/node
+  "GFlops": 19.310041765757834, // FP performance (double precision)
+  "L": 8,                       // local lattice volume
+  "size_MB": 3.0                // memory size in MB/node
+}
+```
+
+A second benchmark performs site-wise SU(4) matrix multiplication, and has a higher
+arithmetic intensity than the `axpy` one (although it is still memory-bound). 
+The JSON entries under `"SU4"` have the form
+```json
+{
+  "GBps": 394.76639187026865,  // bandwidth in GB/s/node
+  "GFlops": 529.8464820758512, // FP performance (single precision)
+  "L": 8,                      // local lattice size
+  "size_MB": 6.0               // memory size in MB/node
+}
+```
+
+*Inter-process bandwidth*
+
+This sub-benchmark measures the achieved bidirectional bandwidth in threaded halo exchange
+using routines in Grid. The exchange is performed in each direction on the MPI Cartesian
+grid which is parallelised across at least 2 processes. The resulting bandwidth is related
+to node-local transfers (inter-CPU, NVLink, ...) or network transfers depending on the MPI
+decomposition. he JSON entries under `"comms"` have the form
+```json
+{
+  "L": 40,                       // local lattice size
+  "bytes": 73728000,             // payload size in B/rank
+  "dir": 2,                      // direction of the exchange, 8 possible directions
+                                 // (0: +x, 1: +y, ..., 5: -x, 6: -y, ...)
+  "rate_GBps": {
+    "error": 6.474271894240327,  // standard deviation across measurements (GB/s/node)
+    "max": 183.10546875,         // maximum measured bandwidth (GB/s/node)
+    "mean": 175.21747026766676   // average measured bandwidth (GB/s/node)
+  },
+  "time_usec": 3135.055          // average transfer time (microseconds)
+}
+```
+
+*Floating-point performances*
+
+This sub-benchmark measures the achieved floating-point performances using the 
+Wilson fermion, domain-wall fermion, and staggered fermion sparse matrices from Grid.
+In the `"flops"` and `"results"` section of the JSON output are recorded the best 
+performances, e.g.
+```json
+{
+  "Gflops_dwf4": 366.5251173474483,       // domain-wall in Gflop/s/node (single precision)
+  "Gflops_staggered": 7.5982861018529455, // staggered in Gflop/s/node (single precision)
+  "Gflops_wilson": 15.221839719288932,    // Wilson in Gflop/s/node (single precision)
+  "L": 8                                  // local lattice size
+}
+```
+Here "best" means across a number of different implementations of the routines. Please
+see the log of the benchmark for an additional breakdown. Finally, the JSON output
+contains a "comparison point", which is the average of the L=24 and L=32 best
+domain-wall performances.
\ No newline at end of file