tursa-energy-efficiency/README.md
2022-09-08 14:28:19 +01:00

72 lines
4.1 KiB
Markdown

# Grid energy-efficiency benchmarks on A100 GPUs
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
Supplemental data for the report ["Optimisation of lattice simulations energy efficiency"](https://doi.org/10.5281/zenodo.7057319).
## Data
At the root of the repository you can find the data associated with the report.
- `jobs.db`: is an SQLite databased containing the following data for all benchmark jobs, divided in two tables for both problem sizes.
| Field | Content |
|---------------|--------------------------------|
| `job_id` | Slurm job id |
| `start` | Start UNIX epoch |
| `end` | End UNIX epoch |
| `nodes` | Number of nodes |
| `clock_limit` | GPUs clock limit in MHz |
| `slot` | Job slot (A or B), cf. report |
| `smi_db` | Associated NVIDIA SMI database |
| `job_dir` | Job output directory |
- `smi-dmon-*.db`: NVIDIA SMI power monitoring SQLite database. Each benchmark monitoring is saved under the `clock_limit_<c>` table, where `<c>` is the GPU clock limit. Each table has the structure
| Field | Content |
|-------------|----------------------------|
| `sample` | Sample index |
| `timestamp` | Sample time (UTC) |
| `gpu` | GPU index |
| `power` | Power draw (W) |
| `temp_gpu` | GPU temperature |
| `temp_mem` | GPU memory temperature |
| `activity` | GPU activity |
| `memory` | GPU memory activity |
| `clock_mem` | GPU memory frequency (MHz) |
| `clock_gpu` | GPU frequency (MHz) |
- `rack-power.db`: ATOS BullSequana XH2000 power monitoring SQLite database. The data in separated in two tables `run_220820` & `run_220822`, which correspond to the C0 and loc32 problem sizes, respectively. Each table as the following structure
| Field | Content |
|-------------|-----------------------|
| `sample` | Sample index |
| `timestamp` | Sample time (UTC) |
| `rack_1` | Rack 1 power draw (W) |
| `rack_2` | Rack 2 power draw (W) |
| `rack_3` | Rack 3 power draw (W) |
| `rack_4` | Rack 4 power draw (W) |
- `c0.dat` & `loc32.dat`: tables in text form with processed results. The different columns in the tables are through comments at the beginning of the files.
- `c0-eps.dat` & `loc32-eps.dat`: tables in text form with epsilon-constraint energy-optimal GPU frequencies (cf. report).
## Analysis scripts
The results described above can be reproduced with the scripts at the root of the repository. These are shell scripts using standard UNIX tools, with the addition of [GNU datamash](https://www.gnu.org/software/datamash/) and [SQLite](https://www.sqlite.org/). All scripts have a usage help message when executed without arguments. The complete of results can be reproduced by executing `full-analysis.sh` which just contains the commands below.
```bash
echo '-- make job DBs...'
./make-job-db.sh jobs.db size_C0 2-racks/size-C0
./make-job-db.sh jobs.db size_loc32 2-racks/size-loc32
echo '-- make result tables...'
./make-result-table.sh jobs.db size_C0 2-racks/rack-power.db run_220820 > c0.dat
./make-result-table.sh jobs.db size_loc32 2-racks/rack-power.db run_220822 > loc32.dat
echo '-- make eps-constraint tables...'
./make-perf-epsilon-table.sh c0.dat > c0-eps-perf.dat
./make-perf-epsilon-table.sh loc32.dat > loc32-eps-perf.dat
./make-power-epsilon-table.sh c0.dat > c0-eps-power.dat
./make-power-epsilon-table.sh loc32.dat > loc32-eps-power.dat
```
## Run data
The `2-racks` subdirectory is a complete copy of the run directory from the Tursa supercomputer. It is provided as-is and undocumented (although a number of scripts are commented), many elements are specific to this cluster, and require root access to several parts of the system. It is shared here for transparency, and as an example on how power monitoring can be automatised for such studies.