dbollweg
|
1514b4f137
|
slicesum_sycl passes test
|
2024-02-06 19:08:44 -05:00 |
|
|
91cf5ee312
|
Updated bench script
|
2024-02-06 23:45:10 +00:00 |
|
david clarke
|
0a6e2f42c5
|
small amount of cleanup
|
2024-02-06 16:32:07 -07:00 |
|
dbollweg
|
ab2de131bd
|
work towards sliceSum for sycl backend
|
2024-02-06 13:24:45 -05:00 |
|
|
5bfa88be85
|
Aurora MPI standalone benchmake and options that work well
|
2024-02-06 16:28:40 +00:00 |
|
Dennis Bollweg
|
5af8da76d7
|
Fix cuda compilation of Lattice_slicesum_gpu.h
|
2024-02-01 18:02:30 -05:00 |
|
Dennis Bollweg
|
b8b9dc952d
|
Async memcpy's and cleanup
|
2024-02-01 17:55:35 -05:00 |
|
Dennis Bollweg
|
79a6ed32d8
|
Use accelerator_for2d and DeviceSegmentedRecude to avoid kernel launch latencies
|
2024-02-01 16:41:03 -05:00 |
|
dbollweg
|
caa5f97723
|
Add sliceSum gpu using cub/hipcub
|
2024-01-31 16:50:06 -05:00 |
|
david clarke
|
4924b3209e
|
projectU3 yields a unitary matrix
|
2024-01-23 14:43:58 -07:00 |
|
Peter Boyle
|
eb702f581b
|
Running on 12 rhs on 18 nodes of frontier
|
2024-01-22 17:44:15 -05:00 |
|
Peter Boyle
|
3d13fd56c5
|
Precompute phases, save memory in hermitian
|
2024-01-22 17:43:35 -05:00 |
|
Peter Boyle
|
6f51b49ef8
|
Use stderr
|
2024-01-22 17:41:09 -05:00 |
|
Peter Boyle
|
addc638856
|
Fast localCopyRegion, blockProjectFast
|
2024-01-22 17:40:38 -05:00 |
|
david clarke
|
00f24f8765
|
already found some bugs in projection, still needs testing
|
2024-01-22 05:50:16 -07:00 |
|
david clarke
|
f5b3d582b0
|
first attempt at U3 projection
|
2024-01-22 02:49:40 -07:00 |
|
david clarke
|
981c93d67a
|
update Test_fatLinks to accept Naik
|
2024-01-21 21:09:19 -07:00 |
|
david clarke
|
c020b78e02
|
Merge branch 'develop' into hisq_fat_links
|
2024-01-21 20:21:08 -07:00 |
|
Peter Boyle
|
42ae36bc28
|
WOrking
|
2024-01-17 16:39:14 -05:00 |
|
Peter Boyle
|
c69f73ff9f
|
Working
|
2024-01-17 16:38:46 -05:00 |
|
Peter Boyle
|
ca5ae8a2e6
|
Revert to working.
|
2024-01-17 16:32:05 -05:00 |
|
Peter Boyle
|
d967eb53de
|
Working for first time
|
2024-01-17 16:31:12 -05:00 |
|
Peter Boyle
|
839f9f1bbe
|
Don't log memory by default
|
2024-01-17 16:25:50 -05:00 |
|
Peter Boyle
|
b754a152c6
|
Flag guard correctly
|
2024-01-17 16:25:28 -05:00 |
|
Peter Boyle
|
e07cb2b9de
|
Accelerator memory
|
2024-01-17 16:24:31 -05:00 |
|
Peter Boyle
|
a1f8bbb078
|
accelerator memory print
|
2024-01-17 16:24:09 -05:00 |
|
Peter Boyle
|
7909683f3b
|
MultiRHS
|
2024-01-17 16:21:07 -05:00 |
|
Peter Boyle
|
25f71913b7
|
MultiRHS coarse
|
2024-01-04 12:01:17 -05:00 |
|
Peter Boyle
|
34ddd2b7b1
|
MultiRHS coarse space
|
2024-01-04 12:00:53 -05:00 |
|
Peter Boyle
|
d5fd90b2f3
|
Add 48^3 rtest
|
2024-01-04 12:00:01 -05:00 |
|
Peter Boyle
|
b7c7000d0d
|
Don't need the numerical rounding tolerance in multigrid
|
2023-12-22 18:10:23 -05:00 |
|
Peter Boyle
|
551f6c4edd
|
Synchronise changes
|
2023-12-22 18:09:11 -05:00 |
|
Peter Boyle
|
defd814750
|
Speed up the coarsened matrix matrix evaluation.
It is block project limited.
Could be sped up with calls to Batched GEMM and a data layout change.
|
2023-12-22 18:07:03 -05:00 |
|
Peter Boyle
|
3d517bbd2a
|
Synchronise decouple from the launch
Speeds up multileg stencils
|
2023-12-22 18:06:13 -05:00 |
|
Peter Boyle
|
78ab955fec
|
Better padded cell exchange
|
2023-12-22 18:05:41 -05:00 |
|
Peter Boyle
|
dd13937bb6
|
Better opt face gather scatter
|
2023-12-22 18:03:38 -05:00 |
|
Peter Boyle
|
66a1b63aa9
|
Faster grid/blas layout change.
Halo exchange is now the only slow part.
Revisit
|
2023-12-21 20:50:18 -05:00 |
|
Peter Boyle
|
22c611bd1a
|
Delete temp file
|
2023-12-21 18:32:31 -05:00 |
|
Peter Boyle
|
c9bb1bf8ea
|
Passing new BLAs based
|
2023-12-21 18:31:17 -05:00 |
|
|
2a0d75bac2
|
Aurora files
|
2023-12-21 23:20:17 +00:00 |
|
Peter Boyle
|
9e489887cf
|
General coarse multiRHS move to BLAS implementation
|
2023-12-21 15:24:48 -05:00 |
|
Peter Boyle
|
9feb801bb9
|
Much simpler GPU implementation
|
2023-12-21 15:24:06 -05:00 |
|
Peter Boyle
|
c00b495933
|
Multigrid
|
2023-12-21 15:23:31 -05:00 |
|
Peter Boyle
|
d22eebe553
|
BLas options
|
2023-12-21 15:23:03 -05:00 |
|
Peter Boyle
|
8bcbd82680
|
BLAS based layout and implementation
|
2023-12-21 15:21:24 -05:00 |
|
Peter Boyle
|
dfa617c439
|
Batched SGEMM/DGEMM/ZGEMM/CGEMM
Hip, Cuda version and vanilla CPU
One MKL stub in comments, to be tested as different.
|
2023-12-21 14:01:18 -05:00 |
|
Peter Boyle
|
48d1f0df89
|
Optimised partially, working
|
2023-12-21 12:33:47 -05:00 |
|
Peter Boyle
|
b75cb7a12c
|
Blas batched partial implementation on Frontier only for now
|
2023-12-21 12:31:33 -05:00 |
|
Peter Boyle
|
332563e037
|
Debugged, reducing verbose
|
2023-12-21 12:30:57 -05:00 |
|
Peter Boyle
|
0cce97a4fe
|
verbosity only
|
2023-12-20 21:30:10 -05:00 |
|