2^31-1 shoulddd always bee enough for SIMD and thread reduced local volume
e.g. 32*2^31 = 2^36 = (2^9)^4 or 512^4 ias big enough.
Where 32 is gpu_threads * Nsimd = 8*4
Aim to reduce the amount of cuda and other code variations floating around all over the place.
Will move GpuInit iinto Accelerator.cc from Init.cc
Need to worry about SharedMemoryMPI.cc and the Peer2Peer windows