1
0
mirror of https://github.com/paboyle/Grid.git synced 2024-11-09 23:45:36 +00:00
Commit Graph

3846 Commits

Author SHA1 Message Date
Peter Boyle
adbdc4e65b Half comms not working on GPU yet, so disable. 2018-09-11 05:15:22 +01:00
Peter Boyle
e4deea4b94 Weird bug appears with Vector<Vector<>>.
"fix" with std::vector<Vector<>>

Lies in the face table code. But think there is some latent problem.
Possibly in my allocator since it is caching, but could simplify or eliminate the caching
option and retest. One to look at later.
2018-09-11 04:36:57 +01:00
Peter Boyle
94d721a20b Comments on further topology discovery work 2018-09-11 04:20:04 +01:00
Peter Boyle
7bf82f5b37 Offload the face handling to GPU 2018-09-10 11:28:42 +01:00
Peter Boyle
f02c7ea534 Peer to peer on GPU's setup 2018-09-10 11:26:20 +01:00
Peter Boyle
bc503b60e6 Offloadable gather code 2018-09-10 11:21:25 +01:00
Peter Boyle
704ca162c1 Offloadable compression 2018-09-10 11:20:50 +01:00
Peter Boyle
b5329d8852 Protect against zero length loops giving a kernel call failure 2018-09-10 11:20:07 +01:00
Peter Boyle
f27b9347ff Better unquiesce MPI coverage 2018-09-10 11:19:39 +01:00
Peter Boyle
b4967f0231 Verbose and error trapping cleaner 2018-09-09 14:28:02 +01:00
Peter Boyle
6d0f1aabb1 Fix the multi-node path 2018-09-09 14:27:37 +01:00
Peter Boyle
f4bfeb835d Drop back to smaller Ls 2018-09-09 14:25:06 +01:00
Peter Boyle
394b7b6276 Verbose decrease 2018-09-09 14:24:46 +01:00
Peter Boyle
da17a015c7 Pack the stencil smaller for 128 bit access 2018-07-23 06:12:45 -04:00
Peter Boyle
1fd08c21ac make simd width configure time option for GPU 2018-07-23 06:10:55 -04:00
Peter Boyle
28db0631ff Hack to force 128bit accesses 2018-07-23 06:10:27 -04:00
Peter Boyle
b35401b86b Fix CUDA_ARCH. Need to simplify. See when new eigen release happens 2018-07-23 06:09:33 -04:00
Peter Boyle
a0714de8ec Define vector length for GPU 2018-07-23 06:09:05 -04:00
Peter Boyle
21a1710b43 Verbose vector length 2018-07-23 06:08:39 -04:00
Peter Boyle
b2b5137d28 Finally starting to get decent performance on Volta 2018-07-13 12:06:18 -04:00
Peter Boyle
2cc07450f4 Fastest option for the dslash 2018-07-05 09:57:55 -04:00
Peter Boyle
c0e8bc9da9 Current version gets 250 - 320 GF/s on Volta on the target 12^4 volume. 2018-07-05 07:10:25 -04:00
Peter Boyle
b1265ae867 Prettify code 2018-07-05 07:08:06 -04:00
Peter Boyle
32bb85ea4c Standard extractLane is fast 2018-07-05 07:07:30 -04:00
Peter Boyle
ca0607b6ef Clearer kernel call meaning 2018-07-05 07:06:15 -04:00
Peter Boyle
19b527e83f Better extract merge for GPU. Let the SIMD header files define the pointer type for
access. GPU redirects through builtin float2, double2 for complex
2018-07-05 07:05:13 -04:00
Peter Boyle
4730d4692a Fast lane extract, saturates bandwidth on Volta for SU3 benchmarks 2018-07-05 07:03:33 -04:00
Peter Boyle
1bb456c0c5 Minor GPU vector width change 2018-07-05 07:02:04 -04:00
Peter Boyle
4b04ae3611 Printing improvement 2018-07-05 06:59:38 -04:00
Peter Boyle
2f776d51c6 Gpu specific benchmark saturates memory. Can enhance Grid to do this for expressions,
but a bitof (known) work.
2018-07-05 06:58:37 -04:00
paboyle
3a50afe7e7 GPU dslash updates 2018-06-27 22:32:21 +01:00
paboyle
f8e880b445 Loop for s and xyzt offlow 2018-06-27 21:49:57 +01:00
paboyle
3e947527cb Move looping over "s" and "site" into kernels for GPU optimisatoin 2018-06-27 21:29:43 +01:00
paboyle
31f65beac8 Move site and Ls looping into the kernels 2018-06-27 21:28:48 +01:00
paboyle
38e2a32ac9 Single SIMD lane operations for CUDA 2018-06-27 21:28:06 +01:00
paboyle
efa84ca50a Keep Cuda 9.1 happy 2018-06-27 21:27:32 +01:00
paboyle
5e96d6d04c Keep CUDA happy 2018-06-27 21:27:11 +01:00
paboyle
df30bdc599 CUDA happy 2018-06-27 21:26:49 +01:00
paboyle
7f45222924 Diagnostics on memory alloc fail 2018-06-27 21:26:20 +01:00
paboyle
dd891f5e3b Use NVCC to suppress device Eigen 2018-06-27 21:25:17 +01:00
paboyle
6c97a6a071 Coalescing version of the kernel 2018-06-13 20:52:29 +01:00
paboyle
73bb2d5128 Ugly hack to speed up compile on GPU; we don't use the hand kernels on GPU anyway so why compile 2018-06-13 20:35:28 +01:00
paboyle
b710fec6ea Gpu code first version of specialised kernel 2018-06-13 20:34:39 +01:00
paboyle
b2a8cd60f5 Doubled gauge field is useful 2018-06-13 20:27:47 +01:00
paboyle
867ee364ab Explicit instantiation hooks 2018-06-13 20:27:12 +01:00
paboyle
25becc9324 GPU tweaks for benchmarking; really necessary? 2018-06-13 20:26:07 +01:00
paboyle
94d1ae4c82 Some prep work for GPU shared memory. Need to be careful, as will try GPU direct
RDMA and inter-GPU memory sharing on SUmmit later
2018-06-13 20:24:06 +01:00
paboyle
2075b177ef CUDA_ARCH more carefule treatment 2018-06-13 20:22:34 +01:00
paboyle
847c761ccc Move sfw IEEE fp16 into central location 2018-06-13 20:22:01 +01:00
paboyle
8287ed8383 New GPU vector targets 2018-06-13 20:21:35 +01:00