Peter Boyle
adbdc4e65b
Half comms not working on GPU yet, so disable.
2018-09-11 05:15:22 +01:00
Peter Boyle
e4deea4b94
Weird bug appears with Vector<Vector<>>.
...
"fix" with std::vector<Vector<>>
Lies in the face table code. But think there is some latent problem.
Possibly in my allocator since it is caching, but could simplify or eliminate the caching
option and retest. One to look at later.
2018-09-11 04:36:57 +01:00
Peter Boyle
94d721a20b
Comments on further topology discovery work
2018-09-11 04:20:04 +01:00
Peter Boyle
7bf82f5b37
Offload the face handling to GPU
2018-09-10 11:28:42 +01:00
Peter Boyle
f02c7ea534
Peer to peer on GPU's setup
2018-09-10 11:26:20 +01:00
Peter Boyle
bc503b60e6
Offloadable gather code
2018-09-10 11:21:25 +01:00
Peter Boyle
704ca162c1
Offloadable compression
2018-09-10 11:20:50 +01:00
Peter Boyle
b5329d8852
Protect against zero length loops giving a kernel call failure
2018-09-10 11:20:07 +01:00
Peter Boyle
f27b9347ff
Better unquiesce MPI coverage
2018-09-10 11:19:39 +01:00
Peter Boyle
b4967f0231
Verbose and error trapping cleaner
2018-09-09 14:28:02 +01:00
Peter Boyle
6d0f1aabb1
Fix the multi-node path
2018-09-09 14:27:37 +01:00
Peter Boyle
f4bfeb835d
Drop back to smaller Ls
2018-09-09 14:25:06 +01:00
Peter Boyle
394b7b6276
Verbose decrease
2018-09-09 14:24:46 +01:00
Peter Boyle
da17a015c7
Pack the stencil smaller for 128 bit access
2018-07-23 06:12:45 -04:00
Peter Boyle
1fd08c21ac
make simd width configure time option for GPU
2018-07-23 06:10:55 -04:00
Peter Boyle
28db0631ff
Hack to force 128bit accesses
2018-07-23 06:10:27 -04:00
Peter Boyle
b35401b86b
Fix CUDA_ARCH. Need to simplify. See when new eigen release happens
2018-07-23 06:09:33 -04:00
Peter Boyle
a0714de8ec
Define vector length for GPU
2018-07-23 06:09:05 -04:00
Peter Boyle
21a1710b43
Verbose vector length
2018-07-23 06:08:39 -04:00
Peter Boyle
b2b5137d28
Finally starting to get decent performance on Volta
2018-07-13 12:06:18 -04:00
Peter Boyle
2cc07450f4
Fastest option for the dslash
2018-07-05 09:57:55 -04:00
Peter Boyle
c0e8bc9da9
Current version gets 250 - 320 GF/s on Volta on the target 12^4 volume.
2018-07-05 07:10:25 -04:00
Peter Boyle
b1265ae867
Prettify code
2018-07-05 07:08:06 -04:00
Peter Boyle
32bb85ea4c
Standard extractLane is fast
2018-07-05 07:07:30 -04:00
Peter Boyle
ca0607b6ef
Clearer kernel call meaning
2018-07-05 07:06:15 -04:00
Peter Boyle
19b527e83f
Better extract merge for GPU. Let the SIMD header files define the pointer type for
...
access. GPU redirects through builtin float2, double2 for complex
2018-07-05 07:05:13 -04:00
Peter Boyle
4730d4692a
Fast lane extract, saturates bandwidth on Volta for SU3 benchmarks
2018-07-05 07:03:33 -04:00
Peter Boyle
1bb456c0c5
Minor GPU vector width changeÂ
2018-07-05 07:02:04 -04:00
Peter Boyle
4b04ae3611
Printing improvement
2018-07-05 06:59:38 -04:00
Peter Boyle
2f776d51c6
Gpu specific benchmark saturates memory. Can enhance Grid to do this for expressions,
...
but a bitof (known) work.
2018-07-05 06:58:37 -04:00
paboyle
3a50afe7e7
GPU dslash updates
2018-06-27 22:32:21 +01:00
paboyle
f8e880b445
Loop for s and xyzt offlow
2018-06-27 21:49:57 +01:00
paboyle
3e947527cb
Move looping over "s" and "site" into kernels for GPU optimisatoin
2018-06-27 21:29:43 +01:00
paboyle
31f65beac8
Move site and Ls looping into the kernels
2018-06-27 21:28:48 +01:00
paboyle
38e2a32ac9
Single SIMD lane operations for CUDA
2018-06-27 21:28:06 +01:00
paboyle
efa84ca50a
Keep Cuda 9.1 happy
2018-06-27 21:27:32 +01:00
paboyle
5e96d6d04c
Keep CUDA happy
2018-06-27 21:27:11 +01:00
paboyle
df30bdc599
CUDA happy
2018-06-27 21:26:49 +01:00
paboyle
7f45222924
Diagnostics on memory alloc fail
2018-06-27 21:26:20 +01:00
paboyle
dd891f5e3b
Use NVCC to suppress device Eigen
2018-06-27 21:25:17 +01:00
paboyle
6c97a6a071
Coalescing version of the kernel
2018-06-13 20:52:29 +01:00
paboyle
73bb2d5128
Ugly hack to speed up compile on GPU; we don't use the hand kernels on GPU anyway so why compile
2018-06-13 20:35:28 +01:00
paboyle
b710fec6ea
Gpu code first version of specialised kernel
2018-06-13 20:34:39 +01:00
paboyle
b2a8cd60f5
Doubled gauge field is useful
2018-06-13 20:27:47 +01:00
paboyle
867ee364ab
Explicit instantiation hooks
2018-06-13 20:27:12 +01:00
paboyle
25becc9324
GPU tweaks for benchmarking; really necessary?
2018-06-13 20:26:07 +01:00
paboyle
94d1ae4c82
Some prep work for GPU shared memory. Need to be careful, as will try GPU direct
...
RDMA and inter-GPU memory sharing on SUmmit later
2018-06-13 20:24:06 +01:00
paboyle
2075b177ef
CUDA_ARCH more carefule treatment
2018-06-13 20:22:34 +01:00
paboyle
847c761ccc
Move sfw IEEE fp16 into central location
2018-06-13 20:22:01 +01:00
paboyle
8287ed8383
New GPU vector targets
2018-06-13 20:21:35 +01:00