mirror of
https://github.com/paboyle/Grid.git
synced 2026-05-17 07:34:31 +01:00
Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/ reduceKernel) and the global device variable retirementCount with a unified CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL. No small/large split is needed: both CUB and sycl::reduction handle arbitrary object sizes internally. Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the original files for regression testing on GPU hardware. Also add CLAUDE.md with build, test, and architecture guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,98 @@
|
|||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
|
## What This Is
|
||||||
|
|
||||||
|
Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
|
||||||
|
|
||||||
|
## Build
|
||||||
|
|
||||||
|
Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./bootstrap.sh # downloads Eigen 3.4.0, generates configure
|
||||||
|
mkdir build && cd build
|
||||||
|
../configure [options]
|
||||||
|
make -j$(nproc)
|
||||||
|
make check # run root-level tests
|
||||||
|
make install
|
||||||
|
```
|
||||||
|
|
||||||
|
Key configure options:
|
||||||
|
|
||||||
|
| Option | Common values |
|
||||||
|
|--------|---------------|
|
||||||
|
| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
|
||||||
|
| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
|
||||||
|
| `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
|
||||||
|
| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
|
||||||
|
| `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
|
||||||
|
| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
|
||||||
|
| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
|
||||||
|
|
||||||
|
Platform recipes from `README.md`:
|
||||||
|
- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
|
||||||
|
- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
|
||||||
|
- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
|
||||||
|
- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
|
||||||
|
|
||||||
|
Required external libs: GMP, MPFR, OpenSSL, zlib.
|
||||||
|
|
||||||
|
## Running Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From build directory
|
||||||
|
make check # root-level tests (Test_simd, Test_cshift, etc.)
|
||||||
|
make -C tests/<subdir> tests # build tests in a subdirectory
|
||||||
|
./tests/core/Test_simd # run a single test binary directly
|
||||||
|
```
|
||||||
|
|
||||||
|
Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Layer stack (bottom to top)
|
||||||
|
|
||||||
|
1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
|
||||||
|
|
||||||
|
2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
|
||||||
|
|
||||||
|
3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
|
||||||
|
|
||||||
|
4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
|
||||||
|
|
||||||
|
5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
|
||||||
|
|
||||||
|
6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
|
||||||
|
|
||||||
|
### QCD subsystem (`Grid/qcd/`)
|
||||||
|
|
||||||
|
- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
|
||||||
|
- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
|
||||||
|
- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
|
||||||
|
- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
|
||||||
|
- `smearing/` — APE, Stout, HEX, gradient flow
|
||||||
|
- `observables/` — Polyakov loop, plaquette, topological charge
|
||||||
|
|
||||||
|
### GPU acceleration
|
||||||
|
|
||||||
|
GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
|
||||||
|
|
||||||
|
### Memory and I/O
|
||||||
|
|
||||||
|
- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
|
||||||
|
- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
|
||||||
|
- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
|
||||||
|
|
||||||
|
### HMC applications
|
||||||
|
|
||||||
|
`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
|
||||||
|
|
||||||
|
## Key Conventions
|
||||||
|
|
||||||
|
- **C++17** is required throughout.
|
||||||
|
- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
|
||||||
|
- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
|
||||||
|
- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
|
||||||
|
- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
|
||||||
@@ -31,6 +31,9 @@ Author: Christoph Lehner <christoph@lhnr.de>
|
|||||||
#if defined(GRID_SYCL)
|
#if defined(GRID_SYCL)
|
||||||
#include <Grid/lattice/Lattice_reduction_sycl.h>
|
#include <Grid/lattice/Lattice_reduction_sycl.h>
|
||||||
#endif
|
#endif
|
||||||
|
#if defined(GRID_CUDA)||defined(GRID_HIP)||defined(GRID_SYCL)
|
||||||
|
#include <Grid/lattice/Lattice_reduction_gpu_cub.h>
|
||||||
|
#endif
|
||||||
#include <Grid/lattice/Lattice_slicesum_core.h>
|
#include <Grid/lattice/Lattice_slicesum_core.h>
|
||||||
|
|
||||||
NAMESPACE_BEGIN(Grid);
|
NAMESPACE_BEGIN(Grid);
|
||||||
|
|||||||
@@ -198,7 +198,7 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) {
|
|||||||
// Possibly promote to double and sum
|
// Possibly promote to double and sum
|
||||||
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_objectD sobj;
|
typedef typename vobj::scalar_objectD sobj;
|
||||||
typedef decltype(lat) Iterator;
|
typedef decltype(lat) Iterator;
|
||||||
@@ -224,7 +224,7 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi
|
|||||||
}
|
}
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::vector_type vector;
|
typedef typename vobj::vector_type vector;
|
||||||
typedef typename vobj::scalar_typeD scalarD;
|
typedef typename vobj::scalar_typeD scalarD;
|
||||||
@@ -244,13 +244,13 @@ inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osi
|
|||||||
buf[ss] = dat[ss*words+w];
|
buf[ss] = dat[ss*words+w];
|
||||||
});
|
});
|
||||||
|
|
||||||
ret_p[w] = sumD_gpu_small(tbuf,osites);
|
ret_p[w] = sumD_gpu_small_old(tbuf,osites);
|
||||||
}
|
}
|
||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_objectD sobj;
|
typedef typename vobj::scalar_objectD sobj;
|
||||||
sobj ret;
|
sobj ret;
|
||||||
@@ -261,9 +261,9 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
|
|||||||
int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks);
|
int ok = getNumBlocksAndThreads(size, sizeof(sobj), numThreads, numBlocks);
|
||||||
|
|
||||||
if ( ok ) {
|
if ( ok ) {
|
||||||
ret = sumD_gpu_small(lat,osites);
|
ret = sumD_gpu_small_old(lat,osites);
|
||||||
} else {
|
} else {
|
||||||
ret = sumD_gpu_large(lat,osites);
|
ret = sumD_gpu_large_old(lat,osites);
|
||||||
}
|
}
|
||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
@@ -272,20 +272,20 @@ inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
|
|||||||
// Return as same precision as input performing reduction in double precision though
|
// Return as same precision as input performing reduction in double precision though
|
||||||
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_object sobj;
|
typedef typename vobj::scalar_object sobj;
|
||||||
sobj result;
|
sobj result;
|
||||||
result = sumD_gpu(lat,osites);
|
result = sumD_gpu_old(lat,osites);
|
||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_object sobj;
|
typedef typename vobj::scalar_object sobj;
|
||||||
sobj result;
|
sobj result;
|
||||||
result = sumD_gpu_large(lat,osites);
|
result = sumD_gpu_large_old(lat,osites);
|
||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,184 @@
|
|||||||
|
/*************************************************************************************
|
||||||
|
Grid physics library, www.github.com/paboyle/Grid
|
||||||
|
Source file: ./Grid/lattice/Lattice_reduction_gpu_cub.h
|
||||||
|
Copyright (C) 2015-2024
|
||||||
|
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
|
||||||
|
This program is free software; you can redistribute it and/or modify
|
||||||
|
it under the terms of the GNU General Public License as published by
|
||||||
|
the Free Software Foundation; either version 2 of the License, or
|
||||||
|
(at your option) any later version.
|
||||||
|
This program is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||||
|
GNU General Public License for more details.
|
||||||
|
You should have received a copy of the GNU General Public License along
|
||||||
|
with this program; if not, write to the Free Software Foundation, Inc.,
|
||||||
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
|
||||||
|
See the full license in the file "LICENSE" in the top level distribution directory
|
||||||
|
*************************************************************************************/
|
||||||
|
/* END LEGAL */
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#if defined(GRID_CUDA)
|
||||||
|
#include <cub/cub.cuh>
|
||||||
|
#define gpucub cub
|
||||||
|
#define gpuError_t cudaError_t
|
||||||
|
#define gpuSuccess cudaSuccess
|
||||||
|
#elif defined(GRID_HIP)
|
||||||
|
#include <hipcub/hipcub.hpp>
|
||||||
|
#define gpucub hipcub
|
||||||
|
#define gpuError_t hipError_t
|
||||||
|
#define gpuSuccess hipSuccess
|
||||||
|
#endif
|
||||||
|
|
||||||
|
NAMESPACE_BEGIN(Grid);
|
||||||
|
|
||||||
|
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||||
|
// Unified lattice reduction using CUB (CUDA/HIP) and sycl::reduction (SYCL).
|
||||||
|
//
|
||||||
|
// Strategy: one accelerator_for pass per site to extract SIMD lanes and promote to sobjD,
|
||||||
|
// then a single library reduce over the sobjD array. No small/large split is needed:
|
||||||
|
// CUB DeviceReduce::Reduce and sycl::reduction both handle arbitrary object sizes by
|
||||||
|
// tuning block occupancy internally.
|
||||||
|
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||||
|
|
||||||
|
#if defined(GRID_CUDA) || defined(GRID_HIP)
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
typedef typename vobj::scalar_object sobj;
|
||||||
|
typedef typename vobj::scalar_objectD sobjD;
|
||||||
|
|
||||||
|
// Per-site: sum SIMD lanes (Reduce) and promote to double precision.
|
||||||
|
deviceVector<sobjD> per_site(osites);
|
||||||
|
sobjD *per_site_p = &per_site[0];
|
||||||
|
|
||||||
|
accelerator_for(ss, osites, 1, {
|
||||||
|
sobj tmp = Reduce(lat[ss]);
|
||||||
|
sobjD tmpD; tmpD = tmp;
|
||||||
|
per_site_p[ss] = tmpD;
|
||||||
|
});
|
||||||
|
|
||||||
|
// CUB global reduction over the sobjD array.
|
||||||
|
sobjD zero; zeroit(zero);
|
||||||
|
sobjD *d_out = static_cast<sobjD *>(acceleratorAllocDevice(sizeof(sobjD)));
|
||||||
|
void *d_temp = nullptr;
|
||||||
|
size_t temp_bytes = 0;
|
||||||
|
|
||||||
|
gpuError_t gpuErr;
|
||||||
|
gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out,
|
||||||
|
(int)osites, gpucub::Sum(), zero, computeStream);
|
||||||
|
if (gpuErr != gpuSuccess) {
|
||||||
|
std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce size query failed: "
|
||||||
|
<< gpuErr << std::endl;
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
|
||||||
|
d_temp = acceleratorAllocDevice(temp_bytes);
|
||||||
|
|
||||||
|
gpuErr = gpucub::DeviceReduce::Reduce(d_temp, temp_bytes, per_site_p, d_out,
|
||||||
|
(int)osites, gpucub::Sum(), zero, computeStream);
|
||||||
|
if (gpuErr != gpuSuccess) {
|
||||||
|
std::cout << GridLogError << "Lattice_reduction_gpu_cub.h: DeviceReduce failed: "
|
||||||
|
<< gpuErr << std::endl;
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
|
||||||
|
accelerator_barrier();
|
||||||
|
|
||||||
|
sobjD result;
|
||||||
|
acceleratorCopyFromDevice(d_out, &result, sizeof(sobjD));
|
||||||
|
acceleratorFreeDevice(d_temp);
|
||||||
|
acceleratorFreeDevice(d_out);
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
// sumD_gpu_small and sumD_gpu_large are preserved as aliases for API compatibility.
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
return sumD_gpu(lat, osites);
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
return sumD_gpu(lat, osites);
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
typedef typename vobj::scalar_object sobj;
|
||||||
|
sobj result;
|
||||||
|
result = sumD_gpu(lat, osites);
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
return sum_gpu(lat, osites);
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // GRID_CUDA || GRID_HIP
|
||||||
|
|
||||||
|
#if defined(GRID_SYCL)
|
||||||
|
|
||||||
|
// Accumulates in sobjD throughout, fixing the precision bug in the original
|
||||||
|
// Lattice_reduction_sycl.h which accumulated in sobj then converted at the end.
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
typedef typename vobj::scalar_object sobj;
|
||||||
|
typedef typename vobj::scalar_objectD sobjD;
|
||||||
|
|
||||||
|
sobjD identity; zeroit(identity);
|
||||||
|
sobjD ret; zeroit(ret);
|
||||||
|
{
|
||||||
|
sycl::buffer<sobjD, 1> abuff(&ret, {1});
|
||||||
|
theGridAccelerator->submit([&](sycl::handler &cgh) {
|
||||||
|
auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>());
|
||||||
|
cgh.parallel_for(sycl::range<1>{(size_t)osites},
|
||||||
|
Reduction,
|
||||||
|
[=](sycl::id<1> item, auto &sum) {
|
||||||
|
sobj s = Reduce(lat[item[0]]);
|
||||||
|
sobjD sd; sd = s;
|
||||||
|
sum += sd;
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
return sumD_gpu(lat, osites);
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
return sumD_gpu(lat, osites);
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
typedef typename vobj::scalar_object sobj;
|
||||||
|
sobj result;
|
||||||
|
result = sumD_gpu(lat, osites);
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
template<class vobj>
|
||||||
|
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
|
||||||
|
{
|
||||||
|
return sum_gpu(lat, osites);
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // GRID_SYCL
|
||||||
|
|
||||||
|
NAMESPACE_END(Grid);
|
||||||
@@ -6,7 +6,7 @@ NAMESPACE_BEGIN(Grid);
|
|||||||
|
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_tensor_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_object sobj;
|
typedef typename vobj::scalar_object sobj;
|
||||||
typedef typename vobj::scalar_objectD sobjD;
|
typedef typename vobj::scalar_objectD sobjD;
|
||||||
@@ -31,40 +31,40 @@ inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer os
|
|||||||
}
|
}
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_large_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
return sumD_gpu_tensor(lat,osites);
|
return sumD_gpu_tensor_old(lat,osites);
|
||||||
}
|
}
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_small_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
return sumD_gpu_large(lat,osites);
|
return sumD_gpu_large_old(lat,osites);
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_objectD sumD_gpu(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_objectD sumD_gpu_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
return sumD_gpu_large(lat,osites);
|
return sumD_gpu_large_old(lat,osites);
|
||||||
}
|
}
|
||||||
|
|
||||||
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||||
// Return as same precision as input performing reduction in double precision though
|
// Return as same precision as input performing reduction in double precision though
|
||||||
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
/////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_object sum_gpu(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_object sum_gpu_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_object sobj;
|
typedef typename vobj::scalar_object sobj;
|
||||||
sobj result;
|
sobj result;
|
||||||
result = sumD_gpu(lat,osites);
|
result = sumD_gpu_old(lat,osites);
|
||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class vobj>
|
template <class vobj>
|
||||||
inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osites)
|
inline typename vobj::scalar_object sum_gpu_large_old(const vobj *lat, Integer osites)
|
||||||
{
|
{
|
||||||
typedef typename vobj::scalar_object sobj;
|
typedef typename vobj::scalar_object sobj;
|
||||||
sobj result;
|
sobj result;
|
||||||
result = sumD_gpu_large(lat,osites);
|
result = sumD_gpu_large_old(lat,osites);
|
||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user