From 2d8aff36fe46cd692634ceacbacf43308d204fb4 Mon Sep 17 00:00:00 2001 From: Peter Boyle Date: Fri, 14 Jul 2017 22:52:16 +0100 Subject: [PATCH] Update README.md --- README.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/README.md b/README.md index e0a9bb14..ea20d0ec 100644 --- a/README.md +++ b/README.md @@ -324,6 +324,17 @@ one rank per socket. If using the Intel MPI library, threads should be pinned to ``` This is the default. +** Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping): ** + +mpirun -n 2 benchmarks/Benchmark_dwf --grid 16.16.16.16 --mpi 2.1.1.1 --cacheblocking 2.2.2.2 --dslash-asm --shm 1024 --threads 18 +Average mflops/s per call per node (full): ** 498739 ** 4d vec +Average mflops/s per call per node (full): ** 457786 ** 4d vec, fp16 comms +Average mflops/s per call per node (full): ** 572645 ** 5d vec +Average mflops/s per call per node (full): ** 721206 ** 5d vec, red black +Average mflops/s per call per node (full): ** 634542 ** 4d vec, red black + + + ### Build setup for AMD EPYC / RYZEN The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. @@ -378,6 +389,17 @@ echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY $@ ``` +Performance: + +** Expected EPYC 7601 Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping): ** + +mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --threads 8 --mpi 2.2.2.1 --dslash-unroll --grid 16.16.16.16 --cacheblocking 4.4.4.4 +Average mflops/s per call per node (full): **420235** 4d vec +Average mflops/s per call per node (full): **437617** 4d vec, fp16 comms +Average mflops/s per call per node (full): **522988** 5d vec +Average mflops/s per call per node (full): **588984** 5d vec, red black +Average mflops/s per call per node (full): **508423** 4d vec, red black + ### Build setup for BlueGene/Q To be written...