Peter Boyle 
							
						 
					 
					
						
						
							
						
						adbdc4e65b 
					 
					
						
						
							
							Half comms not working on GPU yet, so disable.  
						
						
						
						
					 
					
						2018-09-11 05:15:22 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						e4deea4b94 
					 
					
						
						
							
							Weird bug appears with Vector<Vector<>>.  
						
						... 
						
						
						
						"fix" with std::vector<Vector<>>
Lies in the face table code. But think there is some latent problem.
Possibly in my allocator since it is caching, but could simplify or eliminate the caching
option and retest. One to look at later. 
						
						
					 
					
						2018-09-11 04:36:57 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						94d721a20b 
					 
					
						
						
							
							Comments on further topology discovery work  
						
						
						
						
					 
					
						2018-09-11 04:20:04 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						7bf82f5b37 
					 
					
						
						
							
							Offload the face handling to GPU  
						
						
						
						
					 
					
						2018-09-10 11:28:42 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						f02c7ea534 
					 
					
						
						
							
							Peer to peer on GPU's setup  
						
						
						
						
					 
					
						2018-09-10 11:26:20 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						bc503b60e6 
					 
					
						
						
							
							Offloadable gather code  
						
						
						
						
					 
					
						2018-09-10 11:21:25 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						704ca162c1 
					 
					
						
						
							
							Offloadable compression  
						
						
						
						
					 
					
						2018-09-10 11:20:50 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						b5329d8852 
					 
					
						
						
							
							Protect against zero length loops giving a kernel call failure  
						
						
						
						
					 
					
						2018-09-10 11:20:07 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						f27b9347ff 
					 
					
						
						
							
							Better unquiesce MPI coverage  
						
						
						
						
					 
					
						2018-09-10 11:19:39 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						b4967f0231 
					 
					
						
						
							
							Verbose and error trapping cleaner  
						
						
						
						
					 
					
						2018-09-09 14:28:02 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						6d0f1aabb1 
					 
					
						
						
							
							Fix the multi-node path  
						
						
						
						
					 
					
						2018-09-09 14:27:37 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						f4bfeb835d 
					 
					
						
						
							
							Drop back to smaller Ls  
						
						
						
						
					 
					
						2018-09-09 14:25:06 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						394b7b6276 
					 
					
						
						
							
							Verbose decrease  
						
						
						
						
					 
					
						2018-09-09 14:24:46 +01:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						da17a015c7 
					 
					
						
						
							
							Pack the stencil smaller for 128 bit access  
						
						
						
						
					 
					
						2018-07-23 06:12:45 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						1fd08c21ac 
					 
					
						
						
							
							make simd width configure time option for GPU  
						
						
						
						
					 
					
						2018-07-23 06:10:55 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						28db0631ff 
					 
					
						
						
							
							Hack to force 128bit accesses  
						
						
						
						
					 
					
						2018-07-23 06:10:27 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						b35401b86b 
					 
					
						
						
							
							Fix CUDA_ARCH. Need to simplify. See when new eigen release happens  
						
						
						
						
					 
					
						2018-07-23 06:09:33 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						a0714de8ec 
					 
					
						
						
							
							Define vector length for GPU  
						
						
						
						
					 
					
						2018-07-23 06:09:05 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						21a1710b43 
					 
					
						
						
							
							Verbose vector length  
						
						
						
						
					 
					
						2018-07-23 06:08:39 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						b2b5137d28 
					 
					
						
						
							
							Finally starting to get decent performance on Volta  
						
						
						
						
					 
					
						2018-07-13 12:06:18 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						2cc07450f4 
					 
					
						
						
							
							Fastest option for the dslash  
						
						
						
						
					 
					
						2018-07-05 09:57:55 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						c0e8bc9da9 
					 
					
						
						
							
							Current version gets 250 - 320 GF/s on Volta on the target 12^4 volume.  
						
						
						
						
					 
					
						2018-07-05 07:10:25 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						b1265ae867 
					 
					
						
						
							
							Prettify code  
						
						
						
						
					 
					
						2018-07-05 07:08:06 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						32bb85ea4c 
					 
					
						
						
							
							Standard extractLane is fast  
						
						
						
						
					 
					
						2018-07-05 07:07:30 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						ca0607b6ef 
					 
					
						
						
							
							Clearer kernel call meaning  
						
						
						
						
					 
					
						2018-07-05 07:06:15 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						19b527e83f 
					 
					
						
						
							
							Better extract merge for GPU. Let the SIMD header files define the pointer type for  
						
						... 
						
						
						
						access. GPU redirects through builtin float2, double2 for complex 
						
						
					 
					
						2018-07-05 07:05:13 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						4730d4692a 
					 
					
						
						
							
							Fast lane extract, saturates bandwidth on Volta for SU3 benchmarks  
						
						
						
						
					 
					
						2018-07-05 07:03:33 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						1bb456c0c5 
					 
					
						
						
							
							Minor GPU vector width change  
						
						
						
						
					 
					
						2018-07-05 07:02:04 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						4b04ae3611 
					 
					
						
						
							
							Printing improvement  
						
						
						
						
					 
					
						2018-07-05 06:59:38 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Boyle 
							
						 
					 
					
						
						
							
						
						2f776d51c6 
					 
					
						
						
							
							Gpu specific benchmark saturates memory. Can enhance Grid to do this for expressions,  
						
						... 
						
						
						
						but a bitof (known) work. 
						
						
					 
					
						2018-07-05 06:58:37 -04:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						3a50afe7e7 
					 
					
						
						
							
							GPU dslash updates  
						
						
						
						
					 
					
						2018-06-27 22:32:21 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						f8e880b445 
					 
					
						
						
							
							Loop for s and xyzt offlow  
						
						
						
						
					 
					
						2018-06-27 21:49:57 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						3e947527cb 
					 
					
						
						
							
							Move looping over "s" and "site" into kernels for GPU optimisatoin  
						
						
						
						
					 
					
						2018-06-27 21:29:43 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						31f65beac8 
					 
					
						
						
							
							Move site and Ls looping into the kernels  
						
						
						
						
					 
					
						2018-06-27 21:28:48 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						38e2a32ac9 
					 
					
						
						
							
							Single SIMD lane operations for CUDA  
						
						
						
						
					 
					
						2018-06-27 21:28:06 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						efa84ca50a 
					 
					
						
						
							
							Keep Cuda 9.1 happy  
						
						
						
						
					 
					
						2018-06-27 21:27:32 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						5e96d6d04c 
					 
					
						
						
							
							Keep CUDA happy  
						
						
						
						
					 
					
						2018-06-27 21:27:11 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						df30bdc599 
					 
					
						
						
							
							CUDA happy  
						
						
						
						
					 
					
						2018-06-27 21:26:49 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						7f45222924 
					 
					
						
						
							
							Diagnostics on memory alloc fail  
						
						
						
						
					 
					
						2018-06-27 21:26:20 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						dd891f5e3b 
					 
					
						
						
							
							Use NVCC to suppress device Eigen  
						
						
						
						
					 
					
						2018-06-27 21:25:17 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						6c97a6a071 
					 
					
						
						
							
							Coalescing version of the kernel  
						
						
						
						
					 
					
						2018-06-13 20:52:29 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						73bb2d5128 
					 
					
						
						
							
							Ugly hack to speed up compile on GPU; we don't use the hand kernels on GPU anyway so why compile  
						
						
						
						
					 
					
						2018-06-13 20:35:28 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						b710fec6ea 
					 
					
						
						
							
							Gpu code first version of specialised kernel  
						
						
						
						
					 
					
						2018-06-13 20:34:39 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						b2a8cd60f5 
					 
					
						
						
							
							Doubled gauge field is useful  
						
						
						
						
					 
					
						2018-06-13 20:27:47 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						867ee364ab 
					 
					
						
						
							
							Explicit instantiation hooks  
						
						
						
						
					 
					
						2018-06-13 20:27:12 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						25becc9324 
					 
					
						
						
							
							GPU tweaks for benchmarking; really necessary?  
						
						
						
						
					 
					
						2018-06-13 20:26:07 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						94d1ae4c82 
					 
					
						
						
							
							Some prep work for GPU shared memory. Need to be careful, as will try GPU direct  
						
						... 
						
						
						
						RDMA and inter-GPU memory sharing on SUmmit later 
						
						
					 
					
						2018-06-13 20:24:06 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						2075b177ef 
					 
					
						
						
							
							CUDA_ARCH more carefule treatment  
						
						
						
						
					 
					
						2018-06-13 20:22:34 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						847c761ccc 
					 
					
						
						
							
							Move sfw IEEE fp16 into central location  
						
						
						
						
					 
					
						2018-06-13 20:22:01 +01:00 
						 
				 
			
				
					
						
							
							
								paboyle 
							
						 
					 
					
						
						
							
						
						8287ed8383 
					 
					
						
						
							
							New GPU vector targets  
						
						
						
						
					 
					
						2018-06-13 20:21:35 +01:00