adbdc4e65b
Half comms not working on GPU yet, so disable.
2018-09-11 05:15:22 +01:00
e4deea4b94
Weird bug appears with Vector<Vector<>>.
...
"fix" with std::vector<Vector<>>
Lies in the face table code. But think there is some latent problem.
Possibly in my allocator since it is caching, but could simplify or eliminate the caching
option and retest. One to look at later.
2018-09-11 04:36:57 +01:00
94d721a20b
Comments on further topology discovery work
2018-09-11 04:20:04 +01:00
7bf82f5b37
Offload the face handling to GPU
2018-09-10 11:28:42 +01:00
f02c7ea534
Peer to peer on GPU's setup
2018-09-10 11:26:20 +01:00
bc503b60e6
Offloadable gather code
2018-09-10 11:21:25 +01:00
704ca162c1
Offloadable compression
2018-09-10 11:20:50 +01:00
b5329d8852
Protect against zero length loops giving a kernel call failure
2018-09-10 11:20:07 +01:00
f27b9347ff
Better unquiesce MPI coverage
2018-09-10 11:19:39 +01:00
b4967f0231
Verbose and error trapping cleaner
2018-09-09 14:28:02 +01:00
6d0f1aabb1
Fix the multi-node path
2018-09-09 14:27:37 +01:00
f4bfeb835d
Drop back to smaller Ls
2018-09-09 14:25:06 +01:00
394b7b6276
Verbose decrease
2018-09-09 14:24:46 +01:00
da17a015c7
Pack the stencil smaller for 128 bit access
2018-07-23 06:12:45 -04:00
1fd08c21ac
make simd width configure time option for GPU
2018-07-23 06:10:55 -04:00
28db0631ff
Hack to force 128bit accesses
2018-07-23 06:10:27 -04:00
b35401b86b
Fix CUDA_ARCH. Need to simplify. See when new eigen release happens
2018-07-23 06:09:33 -04:00
a0714de8ec
Define vector length for GPU
2018-07-23 06:09:05 -04:00
21a1710b43
Verbose vector length
2018-07-23 06:08:39 -04:00
b2b5137d28
Finally starting to get decent performance on Volta
2018-07-13 12:06:18 -04:00
2cc07450f4
Fastest option for the dslash
2018-07-05 09:57:55 -04:00
c0e8bc9da9
Current version gets 250 - 320 GF/s on Volta on the target 12^4 volume.
2018-07-05 07:10:25 -04:00
b1265ae867
Prettify code
2018-07-05 07:08:06 -04:00
32bb85ea4c
Standard extractLane is fast
2018-07-05 07:07:30 -04:00
ca0607b6ef
Clearer kernel call meaning
2018-07-05 07:06:15 -04:00
19b527e83f
Better extract merge for GPU. Let the SIMD header files define the pointer type for
...
access. GPU redirects through builtin float2, double2 for complex
2018-07-05 07:05:13 -04:00
4730d4692a
Fast lane extract, saturates bandwidth on Volta for SU3 benchmarks
2018-07-05 07:03:33 -04:00
1bb456c0c5
Minor GPU vector width changeÂ
2018-07-05 07:02:04 -04:00
4b04ae3611
Printing improvement
2018-07-05 06:59:38 -04:00
2f776d51c6
Gpu specific benchmark saturates memory. Can enhance Grid to do this for expressions,
...
but a bitof (known) work.
2018-07-05 06:58:37 -04:00
3a50afe7e7
GPU dslash updates
2018-06-27 22:32:21 +01:00
f8e880b445
Loop for s and xyzt offlow
2018-06-27 21:49:57 +01:00
3e947527cb
Move looping over "s" and "site" into kernels for GPU optimisatoin
2018-06-27 21:29:43 +01:00
31f65beac8
Move site and Ls looping into the kernels
2018-06-27 21:28:48 +01:00
38e2a32ac9
Single SIMD lane operations for CUDA
2018-06-27 21:28:06 +01:00
efa84ca50a
Keep Cuda 9.1 happy
2018-06-27 21:27:32 +01:00
5e96d6d04c
Keep CUDA happy
2018-06-27 21:27:11 +01:00
df30bdc599
CUDA happy
2018-06-27 21:26:49 +01:00
7f45222924
Diagnostics on memory alloc fail
2018-06-27 21:26:20 +01:00
dd891f5e3b
Use NVCC to suppress device Eigen
2018-06-27 21:25:17 +01:00
6c97a6a071
Coalescing version of the kernel
2018-06-13 20:52:29 +01:00
73bb2d5128
Ugly hack to speed up compile on GPU; we don't use the hand kernels on GPU anyway so why compile
2018-06-13 20:35:28 +01:00
b710fec6ea
Gpu code first version of specialised kernel
2018-06-13 20:34:39 +01:00
b2a8cd60f5
Doubled gauge field is useful
2018-06-13 20:27:47 +01:00
867ee364ab
Explicit instantiation hooks
2018-06-13 20:27:12 +01:00
25becc9324
GPU tweaks for benchmarking; really necessary?
2018-06-13 20:26:07 +01:00
94d1ae4c82
Some prep work for GPU shared memory. Need to be careful, as will try GPU direct
...
RDMA and inter-GPU memory sharing on SUmmit later
2018-06-13 20:24:06 +01:00
2075b177ef
CUDA_ARCH more carefule treatment
2018-06-13 20:22:34 +01:00
847c761ccc
Move sfw IEEE fp16 into central location
2018-06-13 20:22:01 +01:00
8287ed8383
New GPU vector targets
2018-06-13 20:21:35 +01:00