Peter Boyle
|
94d721a20b
|
Comments on further topology discovery work
|
2018-09-11 04:20:04 +01:00 |
|
Peter Boyle
|
7bf82f5b37
|
Offload the face handling to GPU
|
2018-09-10 11:28:42 +01:00 |
|
Peter Boyle
|
f02c7ea534
|
Peer to peer on GPU's setup
|
2018-09-10 11:26:20 +01:00 |
|
Peter Boyle
|
bc503b60e6
|
Offloadable gather code
|
2018-09-10 11:21:25 +01:00 |
|
Peter Boyle
|
704ca162c1
|
Offloadable compression
|
2018-09-10 11:20:50 +01:00 |
|
Peter Boyle
|
b5329d8852
|
Protect against zero length loops giving a kernel call failure
|
2018-09-10 11:20:07 +01:00 |
|
Peter Boyle
|
f27b9347ff
|
Better unquiesce MPI coverage
|
2018-09-10 11:19:39 +01:00 |
|
Peter Boyle
|
b4967f0231
|
Verbose and error trapping cleaner
|
2018-09-09 14:28:02 +01:00 |
|
Peter Boyle
|
6d0f1aabb1
|
Fix the multi-node path
|
2018-09-09 14:27:37 +01:00 |
|
Peter Boyle
|
f4bfeb835d
|
Drop back to smaller Ls
|
2018-09-09 14:25:06 +01:00 |
|
Peter Boyle
|
394b7b6276
|
Verbose decrease
|
2018-09-09 14:24:46 +01:00 |
|
Peter Boyle
|
da17a015c7
|
Pack the stencil smaller for 128 bit access
|
2018-07-23 06:12:45 -04:00 |
|
Peter Boyle
|
1fd08c21ac
|
make simd width configure time option for GPU
|
2018-07-23 06:10:55 -04:00 |
|
Peter Boyle
|
28db0631ff
|
Hack to force 128bit accesses
|
2018-07-23 06:10:27 -04:00 |
|
Peter Boyle
|
b35401b86b
|
Fix CUDA_ARCH. Need to simplify. See when new eigen release happens
|
2018-07-23 06:09:33 -04:00 |
|
Peter Boyle
|
a0714de8ec
|
Define vector length for GPU
|
2018-07-23 06:09:05 -04:00 |
|
Peter Boyle
|
21a1710b43
|
Verbose vector length
|
2018-07-23 06:08:39 -04:00 |
|
Peter Boyle
|
b2b5137d28
|
Finally starting to get decent performance on Volta
|
2018-07-13 12:06:18 -04:00 |
|
Peter Boyle
|
2cc07450f4
|
Fastest option for the dslash
|
2018-07-05 09:57:55 -04:00 |
|
Peter Boyle
|
c0e8bc9da9
|
Current version gets 250 - 320 GF/s on Volta on the target 12^4 volume.
|
2018-07-05 07:10:25 -04:00 |
|
Peter Boyle
|
b1265ae867
|
Prettify code
|
2018-07-05 07:08:06 -04:00 |
|
Peter Boyle
|
32bb85ea4c
|
Standard extractLane is fast
|
2018-07-05 07:07:30 -04:00 |
|
Peter Boyle
|
ca0607b6ef
|
Clearer kernel call meaning
|
2018-07-05 07:06:15 -04:00 |
|
Peter Boyle
|
19b527e83f
|
Better extract merge for GPU. Let the SIMD header files define the pointer type for
access. GPU redirects through builtin float2, double2 for complex
|
2018-07-05 07:05:13 -04:00 |
|
Peter Boyle
|
4730d4692a
|
Fast lane extract, saturates bandwidth on Volta for SU3 benchmarks
|
2018-07-05 07:03:33 -04:00 |
|
Peter Boyle
|
1bb456c0c5
|
Minor GPU vector width changeÂ
|
2018-07-05 07:02:04 -04:00 |
|
Peter Boyle
|
4b04ae3611
|
Printing improvement
|
2018-07-05 06:59:38 -04:00 |
|
Peter Boyle
|
2f776d51c6
|
Gpu specific benchmark saturates memory. Can enhance Grid to do this for expressions,
but a bitof (known) work.
|
2018-07-05 06:58:37 -04:00 |
|
paboyle
|
3a50afe7e7
|
GPU dslash updates
|
2018-06-27 22:32:21 +01:00 |
|
paboyle
|
f8e880b445
|
Loop for s and xyzt offlow
|
2018-06-27 21:49:57 +01:00 |
|
paboyle
|
3e947527cb
|
Move looping over "s" and "site" into kernels for GPU optimisatoin
|
2018-06-27 21:29:43 +01:00 |
|
paboyle
|
31f65beac8
|
Move site and Ls looping into the kernels
|
2018-06-27 21:28:48 +01:00 |
|
paboyle
|
38e2a32ac9
|
Single SIMD lane operations for CUDA
|
2018-06-27 21:28:06 +01:00 |
|
paboyle
|
efa84ca50a
|
Keep Cuda 9.1 happy
|
2018-06-27 21:27:32 +01:00 |
|
paboyle
|
5e96d6d04c
|
Keep CUDA happy
|
2018-06-27 21:27:11 +01:00 |
|
paboyle
|
df30bdc599
|
CUDA happy
|
2018-06-27 21:26:49 +01:00 |
|
paboyle
|
7f45222924
|
Diagnostics on memory alloc fail
|
2018-06-27 21:26:20 +01:00 |
|
paboyle
|
dd891f5e3b
|
Use NVCC to suppress device Eigen
|
2018-06-27 21:25:17 +01:00 |
|
paboyle
|
6c97a6a071
|
Coalescing version of the kernel
|
2018-06-13 20:52:29 +01:00 |
|
paboyle
|
73bb2d5128
|
Ugly hack to speed up compile on GPU; we don't use the hand kernels on GPU anyway so why compile
|
2018-06-13 20:35:28 +01:00 |
|
paboyle
|
b710fec6ea
|
Gpu code first version of specialised kernel
|
2018-06-13 20:34:39 +01:00 |
|
paboyle
|
b2a8cd60f5
|
Doubled gauge field is useful
|
2018-06-13 20:27:47 +01:00 |
|
paboyle
|
867ee364ab
|
Explicit instantiation hooks
|
2018-06-13 20:27:12 +01:00 |
|
paboyle
|
25becc9324
|
GPU tweaks for benchmarking; really necessary?
|
2018-06-13 20:26:07 +01:00 |
|
paboyle
|
94d1ae4c82
|
Some prep work for GPU shared memory. Need to be careful, as will try GPU direct
RDMA and inter-GPU memory sharing on SUmmit later
|
2018-06-13 20:24:06 +01:00 |
|
paboyle
|
2075b177ef
|
CUDA_ARCH more carefule treatment
|
2018-06-13 20:22:34 +01:00 |
|
paboyle
|
847c761ccc
|
Move sfw IEEE fp16 into central location
|
2018-06-13 20:22:01 +01:00 |
|
paboyle
|
8287ed8383
|
New GPU vector targets
|
2018-06-13 20:21:35 +01:00 |
|
paboyle
|
e6be7416f4
|
Use managed memory
|
2018-06-13 20:14:00 +01:00 |
|
paboyle
|
26863b6d95
|
User Managed memory
|
2018-06-13 20:13:42 +01:00 |
|