| 
							
							
								 Peter Boyle | f02c7ea534 | Peer to peer on GPU's setup | 2018-09-10 11:26:20 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | bc503b60e6 | Offloadable gather code | 2018-09-10 11:21:25 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 704ca162c1 | Offloadable compression | 2018-09-10 11:20:50 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | b5329d8852 | Protect against zero length loops giving a kernel call failure | 2018-09-10 11:20:07 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | f27b9347ff | Better unquiesce MPI coverage | 2018-09-10 11:19:39 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | b4967f0231 | Verbose and error trapping cleaner | 2018-09-09 14:28:02 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 6d0f1aabb1 | Fix the multi-node path | 2018-09-09 14:27:37 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | f4bfeb835d | Drop back to smaller Ls | 2018-09-09 14:25:06 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 394b7b6276 | Verbose decrease | 2018-09-09 14:24:46 +01:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | da17a015c7 | Pack the stencil smaller for 128 bit access | 2018-07-23 06:12:45 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 1fd08c21ac | make simd width configure time option for GPU | 2018-07-23 06:10:55 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 28db0631ff | Hack to force 128bit accesses | 2018-07-23 06:10:27 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | b35401b86b | Fix CUDA_ARCH. Need to simplify. See when new eigen release happens | 2018-07-23 06:09:33 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | a0714de8ec | Define vector length for GPU | 2018-07-23 06:09:05 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 21a1710b43 | Verbose vector length | 2018-07-23 06:08:39 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | b2b5137d28 | Finally starting to get decent performance on Volta | 2018-07-13 12:06:18 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 2cc07450f4 | Fastest option for the dslash | 2018-07-05 09:57:55 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | c0e8bc9da9 | Current version gets 250 - 320 GF/s on Volta on the target 12^4 volume. | 2018-07-05 07:10:25 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | b1265ae867 | Prettify code | 2018-07-05 07:08:06 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 32bb85ea4c | Standard extractLane is fast | 2018-07-05 07:07:30 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | ca0607b6ef | Clearer kernel call meaning | 2018-07-05 07:06:15 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 19b527e83f | Better extract merge for GPU. Let the SIMD header files define the pointer type for access. GPU redirects through builtin float2, double2 for complex | 2018-07-05 07:05:13 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 4730d4692a | Fast lane extract, saturates bandwidth on Volta for SU3 benchmarks | 2018-07-05 07:03:33 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 1bb456c0c5 | Minor GPU vector width change | 2018-07-05 07:02:04 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 4b04ae3611 | Printing improvement | 2018-07-05 06:59:38 -04:00 |  | 
			
				
					| 
							
							
								 Peter Boyle | 2f776d51c6 | Gpu specific benchmark saturates memory. Can enhance Grid to do this for expressions, but a bitof (known) work. | 2018-07-05 06:58:37 -04:00 |  | 
			
				
					| 
							
							
								 paboyle | 3a50afe7e7 | GPU dslash updates | 2018-06-27 22:32:21 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | f8e880b445 | Loop for s and xyzt offlow | 2018-06-27 21:49:57 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 3e947527cb | Move looping over "s" and "site" into kernels for GPU optimisatoin | 2018-06-27 21:29:43 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 31f65beac8 | Move site and Ls looping into the kernels | 2018-06-27 21:28:48 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 38e2a32ac9 | Single SIMD lane operations for CUDA | 2018-06-27 21:28:06 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | efa84ca50a | Keep Cuda 9.1 happy | 2018-06-27 21:27:32 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 5e96d6d04c | Keep CUDA happy | 2018-06-27 21:27:11 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | df30bdc599 | CUDA happy | 2018-06-27 21:26:49 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 7f45222924 | Diagnostics on memory alloc fail | 2018-06-27 21:26:20 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | dd891f5e3b | Use NVCC to suppress device Eigen | 2018-06-27 21:25:17 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 6c97a6a071 | Coalescing version of the kernel | 2018-06-13 20:52:29 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 73bb2d5128 | Ugly hack to speed up compile on GPU; we don't use the hand kernels on GPU anyway so why compile | 2018-06-13 20:35:28 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | b710fec6ea | Gpu code first version of specialised kernel | 2018-06-13 20:34:39 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | b2a8cd60f5 | Doubled gauge field is useful | 2018-06-13 20:27:47 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 867ee364ab | Explicit instantiation hooks | 2018-06-13 20:27:12 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 25becc9324 | GPU tweaks for benchmarking; really necessary? | 2018-06-13 20:26:07 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 94d1ae4c82 | Some prep work for GPU shared memory. Need to be careful, as will try GPU direct RDMA and inter-GPU memory sharing on SUmmit later | 2018-06-13 20:24:06 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 2075b177ef | CUDA_ARCH more carefule treatment | 2018-06-13 20:22:34 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 847c761ccc | Move sfw IEEE fp16 into central location | 2018-06-13 20:22:01 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 8287ed8383 | New GPU vector targets | 2018-06-13 20:21:35 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | e6be7416f4 | Use managed memory | 2018-06-13 20:14:00 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 26863b6d95 | User Managed memory | 2018-06-13 20:13:42 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | ebd730bd54 | Adding 2D loops | 2018-06-13 20:13:01 +01:00 |  | 
			
				
					| 
							
							
								 paboyle | 066be31a3b | Optional GPU target SIMD types; work in progress and trying experiments | 2018-06-13 20:07:55 +01:00 |  |