paboyle
|
e5fa3d887f
|
Compile on CUDA
|
2025-08-21 22:10:27 +01:00 |
|
Peter Boyle
|
fe0db53842
|
FFT offload to GPU and MUCH faster comms.
40x speed up on Frontier
|
2025-08-21 16:45:38 -04:00 |
|
paboyle
|
9e6a4a4737
|
Assertion updates to macros (mostly) with backtrace.
WIlson flow to include options for DBW2, Iwasaki, Symanzik.
View logging for data assurance
|
2025-08-07 15:48:38 +00:00 |
|
paboyle
|
41f344bbd3
|
Merge with Christoph GPT checksum debug
|
2025-07-15 03:06:09 +00:00 |
|
paboyle
|
bfae14d035
|
More flight logging
|
2025-06-27 06:07:34 +00:00 |
|
Peter Boyle
|
6ec5cee368
|
Preparing for compressed comms
|
2025-06-17 16:38:10 +02:00 |
|
paboyle
|
d418f78352
|
Making running on Aurora more debuggable
|
2025-05-23 20:58:16 +00:00 |
|
Peter Boyle
|
adc90d3a86
|
NVLINK GET/PUT on cuda aware mpi
|
2025-04-04 18:35:05 -04:00 |
|
paboyle
|
19f9378b98
|
Should work on Aurora nowb
|
2025-03-11 13:50:43 +00:00 |
|
paboyle
|
1d22841811
|
Working on aurora, GPT issue turned up is fixed
|
2025-03-06 03:20:18 +00:00 |
|
paboyle
|
1cc5f221f3
|
GET not put ordering is better as I know when I've got all MY data
|
2025-02-12 14:53:05 +00:00 |
|
paboyle
|
0baaddbe98
|
Pipeline mode commit on Aurora. 5+ TF/s on 16^3x32 per tile at 384
nodes.
More concurrency/fine grained scheduling is possible.
|
2025-02-04 19:27:26 +00:00 |
|
paboyle
|
8cf809e231
|
Best results on Aurora so far
|
2025-01-31 16:14:45 +00:00 |
|
paboyle
|
94019a922e
|
Significantly better performance on Aurora without using pipeline mode
|
2025-01-30 16:36:46 +00:00 |
|
paboyle
|
d6b2727f86
|
Pipeline mode getting better -- 2 nodes @ 10TF/s per node on Aurora
|
2025-01-29 09:22:21 +00:00 |
|
paboyle
|
febfe4e77f
|
Make my own reduction a configure flag
|
2024-10-15 14:32:35 +00:00 |
|
Peter Boyle
|
5c3ace7c3e
|
Merge branch 'develop' into feature/scidac-wp1
|
2024-04-30 05:26:06 -04:00 |
|
Peter Boyle
|
434c3e7f1d
|
We have a choice of GET or PUT across NVlink
|
2024-03-25 14:32:44 +00:00 |
|
Peter Boyle
|
b6ad1bafc7
|
Normal memory SendToRecvFrom asynchronous for use in general stencil
code
|
2023-10-20 19:27:13 -04:00 |
|
Peter Boyle
|
2376156fbc
|
Merge branch 'develop' into feature/dirichlet
|
2023-03-27 21:33:50 -07:00 |
|
Peter Boyle
|
dd3bbb8fa2
|
MOve the synchronise out to the stencil so one call instead of one call per packet
|
2023-03-27 17:27:45 -07:00 |
|
Peter Boyle
|
a11c12e2e7
|
Modifications for partial dirichlet BCs
|
2022-11-15 16:20:01 -05:00 |
|
Peter Boyle
|
1177b8f661
|
Merge branch 'develop' into feature/dirichlet
|
2022-08-31 19:05:57 -04:00 |
|
Peter Boyle
|
06d9ce1a02
|
Synch ranks on node here for GPU - GPU memcopy
|
2022-08-04 13:35:56 -04:00 |
|
Peter Boyle
|
8137cc7049
|
Allways concurrent comms
|
2022-07-28 12:01:51 -04:00 |
|
Peter Boyle
|
2ab1af5754
|
Ensure no synchronize and not optoin dependent
|
2022-07-19 09:51:06 -07:00 |
|
Peter Boyle
|
f7217d12d2
|
World barrier for clock synch
|
2022-07-11 13:45:31 -04:00 |
|
Peter Boyle
|
7eb29cf529
|
MPI fix
|
2022-05-28 15:51:34 -07:00 |
|
Peter Boyle
|
3f31afa4fc
|
Clean up verbose
|
2022-05-24 18:18:51 -07:00 |
|
Peter Boyle
|
aab3bcb46f
|
Dirichlet first cut - wrong answers on dagger multiply.
Struggling to get a compute node so changing systems
|
2022-02-22 19:58:33 +00:00 |
|
Peter Boyle
|
135808dcfa
|
Less verbose
|
2021-12-07 16:24:24 -05:00 |
|
Peter Boyle
|
2bf3b4d576
|
Update to reduce memory footpring in benchmark test
|
2021-12-07 09:02:02 -08:00 |
|
Peter Boyle
|
16c2a99965
|
Overlap cudamemcpy - didn't set up stream right
|
2021-10-11 13:31:26 -07:00 |
|
Peter Boyle
|
c0d56a1c04
|
Perlmutter tune up
|
2021-09-22 06:02:34 -07:00 |
|
Peter Boyle
|
ca9816bfbb
|
Typo
|
2021-09-21 04:12:04 +02:00 |
|
Peter Boyle
|
109507888b
|
Option to force use of MPI over Nvlink
|
2021-09-21 00:53:25 +02:00 |
|
Peter Boyle
|
8195890640
|
Force MPI over NVLINK
|
2021-09-14 05:00:17 +01:00 |
|
Peter Boyle
|
cd99edcc5f
|
maxLocalNorm2()
|
2021-02-04 18:25:49 -05:00 |
|
Peter Boyle
|
d05ce01809
|
TOFU behaviour now optional THREAD_MULTIPLE or THREAD_SERIALIZED
|
2020-11-13 03:52:19 +01:00 |
|
Peter Boyle
|
a8309638d4
|
UVM check in MPI calls
|
2020-09-03 20:29:26 -04:00 |
|
Peter Boyle
|
0c3095e173
|
Comms buffers to device memory
|
2020-09-03 15:45:35 -04:00 |
|
Christoph Lehner
|
197612bc7a
|
fast cpu basisRotate and other small cleanups
|
2020-07-30 07:08:54 -04:00 |
|
nmeyer-ur
|
8726e94ea7
|
merge upstream develop
|
2020-07-07 20:26:47 +02:00 |
|
nmeyer-ur
|
1635c263ee
|
disable TOFU by default
|
2020-06-30 19:27:08 +02:00 |
|
nmeyer-ur
|
465856331a
|
switch back to serialized; wrong results on single too
|
2020-06-15 15:39:39 +02:00 |
|
nmeyer-ur
|
cc958aa9ed
|
switch back to standard MPI_init due to wrong results in Benchmark_wilson using comms-overlap
|
2020-06-15 14:21:38 +02:00 |
|
nmeyer-ur
|
4fedd8d29f
|
switch to MPI_THREAD_SERIALIZED instead of SINGLE
|
2020-05-27 14:08:34 +02:00 |
|
nmeyer-ur
|
9a86059761
|
symmetrize VLA and fixed size build messages
|
2020-05-20 20:05:42 +02:00 |
|
nmeyer-ur
|
b780b7b7a0
|
guard prevents multiple TOFU messages
|
2020-05-20 19:20:59 +02:00 |
|
nmeyer-ur
|
fc2e9850d3
|
temporarily enable TOFU by default when using A64FX or A64FXFIXEDSIZE
|
2020-05-11 13:25:02 +02:00 |
|