1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-06-06 04:04:36 +01:00

Compare commits

..

100 Commits

Author SHA1 Message Date
Peter Boyle 94d8ee4268 More info 2026-05-29 16:10:11 -04:00
Peter Boyle 3fac61fc55 benchmarks: add DhopEO benchmarks to Benchmark_staggered and Benchmark_staggeredF
Also add NaiveStaggeredFermionF to Benchmark_staggeredF, and fix bug where
NaiveStaggered timed loop was calling Ds.Dhop instead of Dn.Dhop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 13:55:27 -04:00
Peter Boyle 42cd9eda71 Some improvements that should have been there if in synch with develop,
and also some staggered hdcg type work
2026-05-29 13:36:57 -04:00
Thomas Blum 34d8d003a8 staggered-hdcg: smoother shift tuning, CG baseline, Lanczos diagnostics
Smoother shift:
- Replace hard-coded mass^2 = 0.0025 with fine_lambda_max / divisor,
  measured at runtime via PowerMethod on the SchurStaggeredOperator.
- Current divisor = 200 (tunable); concentrates the O(8) CG polynomial
  zeros on the high-frequency end of the spectrum [shift, lambda_max],
  repairing the spectral leakage introduced at coarse-cell boundaries
  when the coarse-grid solution is promoted back to the fine grid.
- Add explanatory comment on the lego-block edge / covariant-derivative
  physics behind the high-mode smoothing requirement.

Chebyshev filter (IRL):
- Fix lambda_lo = 0.02 (was mass^2 * 0.5 = 0.00125).
  Tuning history logged in comments: lo=0.005 → 0/24 modes (T_70~53);
  lo=0.01 → 24/24 but 2 restarts; lo=0.02 → 24/24 in 1 restart.
- Reduce Nk/Nm from 48/96 to 24/48 (target 24 near-null modes only).
- Print Chebyshev filter parameters at run time.

CG baseline:
- Add sequential single-RHS CG loop before the HDCG solve to establish
  unpreconditioned iteration count and wall time for direct comparison.

ImplicitlyRestartedBlockLanczosCoarse:
- Print Ritz values before and after implicit shift at each restart.
- Print alpha/beta block-diagonal elements at each Lanczos step.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-05-28 16:43:23 -04:00
Thomas Blum 905651deaa Test_staggered_hdcg: fix GridParallelRNG and Lanczos grid bugs
- GridParallelRNG must be constructed on full (non-checkerboarded) UGrid,
  not UrbGrid; fill() recurses infinitely when _grid is checkerboarded.
- evec and c_srcs for ImplicitlyRestartedBlockLanczosCoarse must both be
  on f_grid (Coarse4d), not CoarseMrhs; calc_irbl asserts evec[0].Grid()
  == src[0].Grid().
- Switch subspace generation from CreateSubspaceChebyshevNew to
  CreateSubspace (CG inverse iteration), which requires no spectral
  bound tuning and adapts automatically to the matrix spectrum.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:41:41 -04:00
Thomas Blum 119308c42a Test_staggered_hdcg: add missing ImplicitlyRestartedBlockLanczos.h include
IRBLdiagonalisation, SortEigen, and LanczosType are defined in
ImplicitlyRestartedBlockLanczos.h, which must be included before
ImplicitlyRestartedBlockLanczosCoarse.h.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 20:55:51 -04:00
Thomas Blum 89a32799e3 mac-arm: align --enable-Sp=no with upstream config-command-mpi style
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:21:02 -04:00
Thomas Blum ce8b52749d Merge remote-tracking branch 'origin/develop' into feature/staggered-hdcg 2026-05-27 16:20:47 -04:00
Peter Boyle 86c7f29183 Config command update 2026-05-27 16:19:33 -04:00
Thomas Blum bbdc8e95f4 mac-arm: disable Sp, fermion-reps, gparity for faster dev builds
Reduces compile time significantly by skipping representations not
needed for the staggered HDCG work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:19:28 -04:00
Thomas Blum 1284acf37a Merge remote-tracking branch 'origin/develop' into feature/staggered-hdcg 2026-05-27 16:19:19 -04:00
Peter Boyle b0c99f876e Configure on mac update 2026-05-27 16:16:55 -04:00
Peter Boyle bf5fcdc860 Ease of use for std::complex interchangable with thrust 2026-05-27 16:05:37 -04:00
Thomas Blum 520b90259d Add staggered HDCG multigrid test and mac-arm Homebrew build scripts
Test_staggered_hdcg.cc implements a two-level ADEF2 multigrid solver for
NaiveStaggeredFermion using SchurStaggeredOperator, following the mrhs
hermitian multigrid approach of arXiv:2409.03904. Uses a 33-point coarse
stencil (NextToNearestStencilGeometry4D) with nbasis=24, block={4,4,4,4},
and Chebyshev subspace generation with hi=5.0 (lambda_max ~4.6).

Also adds systems/mac-arm/sourceme-homebrew.sh and config-command-homebrew
for building Grid on Apple Silicon with Homebrew-installed dependencies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:52:49 -04:00
Peter Boyle b58a1508fa Perlmutter cuda version update 2026-05-21 13:25:13 -07:00
Peter Boyle 4d527e81fa Remove hip specific files 2026-05-21 12:34:30 -04:00
Peter Boyle 7803580aa6 Lattice_reduction_gpu: demote timing logs to Debug, disable by default
skills/mpi-heterogeneous: add Bug Class 4 for Frontier GTL/libamdhip64 ABI mismatch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 32654db366 Test_planned_fft: fix PlannedFFT template parameter to use ::vector_object
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle cd340cfab3 tests: add Test_planned_fft exercising PlannedFFT<vobj>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle f32866b2ff tests/fft: remove PlanDestroy calls (FFT handles plans per-call)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 1cd1dc091e FFT: add FFTbase, PlannedFFT; factor FFT_dim_execute free function
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 0493656e86 debug: add Test_hipfft_repro — reproducer for hipFFT PARSE_ERROR on ROCm 7
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 66fd504c4d tests/debug: add G=4 to hipfft fail reproducer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle be4dd2b52f tests/debug: test hipMemset variant before cache is populated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 707d059766 tests/debug: extend hipfft fail reproducer with hipMemset and sync variants
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle f08c755ae6 FFT: use host stack buffer in PlanCreate, not deviceVector
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle dbbfdd4e4b tests/debug: add minimal hipfft ordering bug fail/pass pair
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle f967fb40bf tests/debug: test plan-before-malloc vs malloc-before-plan ordering
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 74e0f846cb tests/debug: extend hipfft reproducer with Grid-realistic howmany and exec tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 303a4d26e5 tests/debug: add minimal hipfft plan-creation reproducer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 119888653c FFT HIP: use hipfftCreate+hipfftMakePlanMany instead of hipfftPlanMany 2026-05-21 12:34:30 -04:00
Peter Boyle a9f42c08f9 FFT: pass nullptr for inembed/onembed in hipfftPlanMany to avoid HIPFFT_PARSE_ERROR 2026-05-21 12:34:30 -04:00
Peter Boyle e79adc9d31 FFT: cache plans per vobj type across calls
Plans are created lazily on the first FFT_dim call and reused for all
subsequent calls on the same FFT object.  PlanCreate<vobj>() can be
called explicitly to pre-warm the cache.  PlanDestroy() must be called
before switching to a different vobj type; the destructor cleans up any
live plans automatically.

Update Test_fft.cc and Test_fftf.cc to call PlanDestroy() between the
LatticeComplex and LatticeSpinMatrix sections that reuse the same FFT object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 5a9056cd93 Accelerator: lower default accelerator_threads from 16 to 8
Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16.
With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads =
one full AMD wavefront. Higher values double register demand per block and
hit a register-pressure cliff for stencil kernels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 012c36ab5a Accelerator: raise default accelerator_threads from 2 to 16 2026-05-21 12:34:30 -04:00
Peter Boyle 5c4574f9aa skills: add gpu-memory-performance.md
Documents the acceleratorThreads() default=2 trap, LambdaApply thread
mapping, coalescedRead/Write idiom, when to use __global__ vs
accelerator_for, and fused vs staged HBM access patterns.

Includes observed MI250X numbers from LatticePropagatorD reduction
(50 → 297 → 546 GB/s progression).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle a424775884 sumD_gpu_reduce_words: fuse pack+reduce into single packReduceKernel
Replace the two-kernel pack+reduce sequence with a single fused kernel
packReduceKernel<R> that reads R words of each vobj at offset 'base'
and accumulates directly into iVector<iScalar<scalarD>,R>, eliminating
the intermediate bundle buffer entirely.

HBM access per word-group drops from 3x (pack-read + pack-write +
reduce-read) to 1x.  Thread count comes from getNumBlocksAndThreads
(warpSize..256) rather than acceleratorThreads(), so occupancy is
correct regardless of the --accelerator-threads setting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle d6b1388741 Modified repack 2026-05-21 12:34:30 -04:00
Peter Boyle 796c6cae4e Enable GRID_REDUCTION_TIMING unconditionally
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 1a8064d6d9 Lattice_reduction_gpu: add GRID_REDUCTION_TIMING instrumentation
Uncomment #define GRID_REDUCTION_TIMING to enable per-phase timing output:

  sumD_gpu_reduce_words: pack time (accelerator_for) per R and base
  sumD_gpu_small:        reduceKernel+barrier time and D2H time separately
  sumD_gpu_large:        total wall time across all word groups

This lets us identify whether the large-type bottleneck is in the pack
kernel, the shared-memory reduction kernel, the barrier, or the D2H.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 43648924c3 sumD_gpu_large: radix-12 word-bundle reduction replacing radix-1
Replace the word-by-word loop (one kernel launch per scalar word) with
sumD_gpu_reduce_words<R> which packs R consecutive vector_type words per
site into iVector<iScalar<vector>,R>, then calls the existing sumD_gpu_small
shared-memory kernel once for the whole bundle.

Dispatch: radix-12 first, radix-4 for the remainder < 12, radix-1 for
any final < 4 words.  For LatticePropagator (144 words = 12x12), this
reduces the kernel-launch count from 144 to 12 -- a 12x reduction.

Bundle::Nsimd() inherits from vector_type so sumD_gpu_small handles SIMD
lane extraction and double-precision promotion identically to the scalar
word case.  sizeof(Bundle::scalar_objectD) = R*16 <= 192 B; well within
sharedMemPerBlock on all supported devices.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle bf2140e74d Lattice_reduction_sycl: fix double-precision accumulation in sumD_gpu_tensor
Accumulate in sobjD throughout rather than accumulating in sobj and
converting the final sum. For float fields this matters: summing N floats
then casting loses O(N*eps_float) relative precision vs accumulating in
double from the start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle a1119266c1 Revert to hand-rolled reduction; drop Lattice_reduction_gpu_cub.h
Remove the CUB/hipCUB direction entirely. Restore Lattice_reduction_gpu.h,
Lattice_reduction_sycl.h, and Lattice_reduction.h to the state before the
CUB rewrite (commit 969b0a39), recovering the original primary function names
(sumD_gpu_small, sumD_gpu_large, sumD_gpu, sum_gpu, sum_gpu_large) and the
hand-rolled shared-memory reduction kernel.

Delete Lattice_reduction_gpu_cub.h. Update Test_reduction to remove the
old/new comparison sections that depended on sum_gpu_old.

The lesson: CUB DeviceReduce is slower than the hand-rolled kernel for small
types, and the smem sizing problem for the extraction pass has no clean
solution within the accelerator_for abstraction. The right improvement is
a higher radix (12 then 4) in sumD_gpu_large, applied directly to the
existing hand-rolled kernel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle a0f00c0eca sumD_gpu_direct: revert to per-lane write; CUB handles Nsimd*osites inputs
Benchmarking showed the shared-memory lane-summation approach (843d6497)
was slower than writing each SIMD lane individually and letting CUB reduce
the full nlanes = osites*Nsimd array. CUB's device reduce is more efficient
over the larger input than the smem overhead + serialised lane-0 summation.
The smem approach also required overriding acceleratorThreads() to avoid
the block-size sizing problem. Restore the simpler per-lane path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle d358954a84 sumD_gpu_direct: shared-memory lane reduction with acceleratorThreads(1)
Set acceleratorThreads to 1 before the extraction kernel so that
dim3(nsimd,1,1) blocks give exactly one site group per block and
__shared__ sobjD smem[nsimd] is correctly sized without depending on
the runtime acceleratorThreads() value. threadIdx.x (acceleratorSIMTlane)
indexes the SIMD lane for coalesced reads; lane 0 sums smem[0..nsimd-1]
and writes one sobjD per site. CUB then reduces osites elements instead
of osites*nsimd, reducing both store traffic and CUB work by Nsimd.
acceleratorSynchronise() (warp-level) suffices since nsimd < warpSize.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle aee00bdfb5 sumD_gpu_direct: one thread per SIMD lane using extractLane
Replaces one thread per outer site calling Reduce() (sequential Nsimd-wide
loop) with one thread per lane calling extractLane() — O(1) per thread.
CUB now reduces over osites*Nsimd elements. Avoids serial lane reduction
but leaves the per-lane sobjD store stride as a known remaining concern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle cf324b0fa1 Lattice_reduction_gpu_cub: define GRID_REDUCTION_TIMING in header
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle b314dc224d Lattice_reduction_gpu_cub: add GRID_REDUCTION_TIMING instrumentation
Guards accelerator_for and CUB DeviceReduce calls in sumD_gpu_direct
and sumD_gpu_large with #ifdef GRID_REDUCTION_TIMING to isolate where
time is spent in each path. Large path accumulates across all groups
and prints totals with words/nfull/rem context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 1bbd62498e Lattice_reduction_gpu_cub: replace WordBundle4 with iVector<iScalar<scalarD>,4>
WordBundle4 was redundant with Grid's existing tensor infrastructure.
iVector<iScalar<scalarD>,4> already provides accelerator_inline operator+,
zeroit(), and sycl::is_device_copyable — no new type needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle f3c3b1c04b Test_reduction: add timing benchmark for new vs old reduction paths
Reports us/call and GB/s for sum_gpu (CUB/sycl::reduction) and
sum_gpu_old (hand-rolled shared-memory) for each field type, with
5-call warmup and 100-call timed loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 069f98b253 skills: HPC battle-hardening skill files for GPU+MPI correctness
Six skill files encoding expertise for making codebases robust on
problematic HPC systems, covering: correctness verification
(double-run, fingerprinting, flight recorder), hang diagnosis,
GPU runtime correctness (premature barrier, infinite poll),
MPI correctness on heterogeneous systems (device buffer aliasing,
AARCH64 PLT corruption, deterministic reductions),
compiler validation, and communication/computation overlap pipeline
design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle dfd0503eae Test_reduction: use separate float and double grids
Float fields require a grid constructed with vComplexF::Nsimd(); using
a double grid causes grid->_gsites to undercount the sites in float
vobjF, making the constant-field expected value wrong.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle c629b2e87e Rename scalarNorm2 to squaredSum in Test_reduction.cc
The function computes |sum|^2 — the squared magnitude of an aggregate sum —
not a norm. squaredSum makes clear that squaring is applied to the sum, not
to individual site values before summing, distinguishing it from sumOfSquares
(the squared L2 norm).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 7c8462abd1 Fix Zero() used on thrust::complex in WordBundle4 initialisation
Grid's Zero() sentinel is not assignable to thrust::complex<double>;
use scalarD(0) instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 95a6a0bde7 Reinstate large/small dispatch in CUB reduction path; radix-4 word-bundle for large types
rocPRIM's DeviceReduce requires warpSize(64) threads each holding one element in shared
memory, so sizeof(T)*64 must fit in sharedMemPerBlock.  LatticePropagator::scalar_objectD
is 2304 bytes (64*2304 = 147 KB), exceeding the budget and triggering a compile-time
static_assert in limit_block_size.

Introduce sumD_gpu_direct (the original direct-CUB path, safe for small types) and a new
sumD_gpu_large that groups the vobj's vector_type words in bundles of 4, reducing each
bundle as WordBundle4<scalarD> (64 bytes, 64*64 = 4 KB — always within budget).  If
words % 4 != 0, the final partial bundle is zero-padded.  sumD_gpu dispatches at compile
time via if constexpr on sizeof(sobjD) > 512.

For LatticePropagator (144 words) this gives 36 CUB launches instead of 144.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle bba328fac5 Add Test_reduction to tests/debug
Tests the new CUB/hipCUB/SYCL lattice reduction (sum_gpu) against the
preserved hand-rolled implementation (sum_gpu_old) for LatticeComplexF/D,
LatticeColourMatrixF/D and LatticePropagatorF/D.

Part a) gaussian random field: checks that old and new agree to within
float/double roundoff tolerance.
Part b) constant field (= 1.0, identity-matrix init): verifies
innerProduct(sum, sum) = Ncomp * V^2 where Ncomp counts the nonzero
diagonal scalar components per site (1 / Nc / Ns*Nc respectively).

Make.inc is auto-generated by scripts/filelist on bootstrap and is not
tracked; the new .cc file is all that is needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle 41362349f3 Rewrite lattice GPU reduction to use CUB, hipCUB, and SYCL reduction
Replace hand-rolled shared-memory reduction kernels (reduceBlock/reduceBlocks/
reduceKernel) and the global device variable retirementCount with a unified
CUB/hipCUB DeviceReduce::Reduce path for CUDA/HIP and sycl::reduction for SYCL.
No small/large split is needed: both CUB and sycl::reduction handle arbitrary
object sizes internally.

Old implementations preserved as sum_gpu_old / sumD_gpu_old etc. in the
original files for regression testing on GPU hardware.

Also add CLAUDE.md with build, test, and architecture guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:34:30 -04:00
Peter Boyle a5a04929fb Merge pull request #492 from giltirn/develop
Fixes to support CUDA > 13
2026-05-19 15:26:58 -04:00
Christopher Kelly 77b8657fcc Fixes to support CUDA > 13. Specifically, the CUDA header is no longer accidentally included within Grid's namespace, and the breaking change to cub::Sum() -> ::cuda::std::plus<>{} in CUDA-13 has been worked around 2026-05-19 12:22:14 -04:00
Peter Boyle f8b2eacf99 File list issue (Ed Bennets pull request?) 2026-05-15 12:57:42 -04:00
Peter Boyle 6140ac6864 Hip Happy 2026-05-15 12:13:01 -04:00
Peter Boyle c6c2834e03 Hip Happy 2026-05-15 11:30:29 -04:00
Peter Boyle 856545a1db Support ROCM 7.0.2 2026-05-15 11:30:29 -04:00
Peter Boyle e2d607f6c7 Merge pull request #490 from jdmaia/hip-guard-acceleratorfor2dNB
[HIP] Including kernel launch parameter guard on accelerator_for2dNB
2026-05-06 14:51:30 -04:00
Julio Maia 66da4e0657 Including guard on accelerator_for2dNB against invalid kernel configurations if GRID_HIP 2026-05-06 13:26:33 -05:00
Peter Boyle b37390bb5a 4 node usqcd run 2026-04-27 14:40:11 -07:00
Peter Boyle 829dc8cceb 32 node 2026-04-27 14:38:02 -07:00
Peter Boyle 13cc2c39f5 FOM run 2026-04-27 14:20:49 -07:00
Peter Boyle 66ea3b271c Merge branch 'develop' of https://github.com/paboyle/Grid into develop 2026-04-27 13:55:52 -07:00
Peter Boyle d293b58a20 384 node baseline run 2026-04-27 13:54:40 -07:00
Peter Boyle ce093b2bf3 rdtsc 2026-04-27 13:54:06 -07:00
Peter Boyle e4404efe5a Perlmutter compile update 2026-04-27 13:53:28 -07:00
Peter Boyle 5ce270f1de Adding Claude related files 2026-04-21 10:41:18 -04:00
Peter Boyle af43b067a0 New CLAUDE controllable visualiser 2026-04-10 11:23:25 -04:00
Quadro 34b44d1fee New file for animation in MD time direction 2026-04-02 13:55:38 -04:00
Peter Boyle 595ceaac37 Include grid header and make the ENABLE correct 2026-03-11 17:24:44 -04:00
Peter Boyle daf5834e8e Fixing incorrect PR about disable fermion instantiations 2026-03-11 17:05:46 -04:00
Peter Boyle 0d8658a039 Optimised 2026-03-05 06:06:32 -05:00
Peter Boyle 095e004d01 Setup change GCR 2026-03-05 06:06:32 -05:00
Peter Boyle 0acabee7f6 Modest change 2026-03-05 06:06:32 -05:00
Peter Boyle 76fbcffb60 Improvement to 16^3 hdcg 2026-03-05 06:06:32 -05:00
Peter Boyle a0a62d7ead Merge pull request #478 from vataspro/PolyakovUpstream
Spatial Polyakov Loop implementation
2026-02-24 20:45:42 -05:00
Peter Boyle c5038ea6a5 Merge pull request #483 from cmcknigh/bugfix/rocm7-rocblas-type-refactor
Adding a version check to handle rocBlas type refactor
2026-02-24 20:45:03 -05:00
Peter Boyle a5120903eb Merge pull request #486 from RChrHill/fix/sp4-fp32
Define Sp4 ProjectOnGeneralGroup for generic vtype
2026-02-24 20:44:08 -05:00
Peter Boyle 00b286a08a Merge pull request #488 from RChrHill/feature/additional-ET-traces
Add ET support for Lattice spin- and colour-traces
2026-02-24 20:43:45 -05:00
Peter Boyle 24a9759353 Merge pull request #485 from edbennett/skip-fermion-instantiations
Be able to skip compiling fermion instantiations altogether
2026-02-24 20:43:20 -05:00
edbennett 1b56f6f46d be able to skip compiling fermion instantiations altogether 2026-02-24 23:52:18 +00:00
Peter Boyle 2a8084d569 Subspace setup 2026-02-13 17:26:11 -05:00
Peter Boyle 6ff29f9d4f Alternate multigrids 2026-02-13 17:25:45 -05:00
RChHill c4d3e79193 Add ET support for Lattice spin- and colour-traces 2026-01-29 14:46:52 +00:00
Peter Boyle 7cd3f21e6b preserving a bunch of experiments on setup and g5 subspace doubling 2026-01-06 05:57:39 -05:00
RChHill b650b89682 Define Sp4 ProjectOnGeneralGroup for generic vtype 2025-11-19 13:26:52 +00:00
Allen McKnight 4304245c1b Merge branch 'develop' into bugfix/rocm7-rocblas-type-refactor 2025-11-04 08:50:11 -06:00
Your Name 1d1fd3bcaf adding a version check to handle rocblas type change 2025-10-02 15:24:24 -05:00
Alexis Provatas c646d91527 Fix names, protect against bad index values, clean docstrings 2025-05-01 10:52:00 +01:00
Alexis Provatas a2b98d82e1 remove obsolete spatial polyakov observable file 2025-05-01 10:52:00 +01:00
Alexis Provatas 7b9415c088 Move observable logger to Polyakov Loop file and fix docstring 2025-05-01 10:52:00 +01:00
Alexis Provatas cb7110f492 Add Spatial Polyakov Loop observable 2025-05-01 10:52:00 +01:00
Alexis Provatas 0c7af66490 Create Spatial Polyakov Observable Module 2025-05-01 10:52:00 +01:00
Alexis Provatas 496d1b914a Generalise Polyakov loop and overload for temporal direction 2025-05-01 10:52:00 +01:00
114 changed files with 10044 additions and 774 deletions
+115
View File
@@ -0,0 +1,115 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## What This Is
Grid is a data-parallel C++ library for lattice QCD. It provides SIMD-vectorised lattice containers, MPI-based domain decomposition, GPU acceleration (CUDA/HIP/SYCL), and a full suite of QCD algorithms including HMC.
## Build
Uses GNU Autotools. The bootstrap step only needs to run once (or after `configure.ac` changes).
```bash
./bootstrap.sh # downloads Eigen 3.4.0, generates configure
mkdir build && cd build
../configure [options]
make -j$(nproc)
make check # run root-level tests
make install
```
Key configure options:
| Option | Common values |
|--------|---------------|
| `--enable-simd=` | `AVX2`, `AVX512`, `KNL`, `A64FX`, `NEONv8`, `GPU` |
| `--enable-comms=` | `mpi-auto`, `mpi3-auto`, `none` |
| `--enable-accelerator=` | `cuda`, `hip`, `sycl` |
| `--enable-shm=` | `shmopen`, `hugetlbfs`, `nvlink` |
| `--enable-Nc=` | `3` (default), `2`, `4`, `5` |
| `--with-gmp=`, `--with-mpfr=`, `--with-fftw=`, `--with-lime=` | paths to libs |
| `--enable-hdf5`, `--enable-mkl`, `--enable-lapack` | optional features |
| `--enable-debug=yes` | adds `-g`, removes `-O3` |
Use `make V=1` for verbose compiler output (shows full flags; required for bug reports).
Platform recipes from `README.md`:
- **KNL**: `--enable-simd=KNL --enable-comms=mpi3-auto --enable-mkl`
- **Skylake/Haswell**: `--enable-simd=AVX512` or `AVX2` + `--enable-comms=mpi3-auto`
- **AMD EPYC**: `--enable-simd=AVX2 --enable-comms=mpi3`
- **A64FX (Fugaku)**: `--enable-simd=A64FX --enable-comms=mpi3 --enable-shm=shmget` (see `SVE_README.txt`)
Required external libs: GMP, MPFR, OpenSSL, zlib.
## Running Tests
```bash
# From build directory
make check # root-level tests (Test_simd, Test_cshift, etc.)
make -C tests/<subdir> tests # build tests in a subdirectory
./tests/core/Test_simd # run a single test binary directly
```
Test subdirectories and their focus: `core` (SIMD, stencil, comms), `solver` (CG, GMRES, eigensolvers), `hmc` (MD integrators), `forces` (fermion forces), `lanczos`, `IO`, `smearing`, `sp2n`, `debug`.
## Architecture
### Layer stack (bottom to top)
1. **SIMD layer** (`Grid/simd/`) — platform-specific intrinsics wrapped into `vRealF`, `vComplexD`, etc. The SIMD width and layout are compile-time constants controlled by `--enable-simd`.
2. **Tensor layer** (`Grid/tensors/`) — Lorentz/colour/spin tensor algebra built on top of SIMD types. `iMatrix`, `iVector`, `iScalar` templates compose into QCD types like `ColourMatrix`, `SpinColourVector`.
3. **Lattice layer** (`Grid/lattice/`) — `Lattice<T>` container: a site-local tensor replicated across a distributed Cartesian grid. All arithmetic is site-parallel and expression-template-fused.
4. **Cartesian/comms layer** (`Grid/cartesian/`, `Grid/communicator/`) — `GridCartesian` holds the MPI topology and local/global geometry. `Grid/cshift/` implements nearest-neighbour halo exchange; `Grid/stencil/` is the optimised multi-hop stencil used by Dirac operators.
5. **Algorithm layer** (`Grid/algorithms/`) — iterative solvers (CG, GMRES, BiCGSTAB, mixed-precision), eigensolvers (Lanczos, LAPACK), FFT, smearing.
6. **QCD layer** (`Grid/qcd/`) — gauge and fermion actions, HMC integrators, observables.
### QCD subsystem (`Grid/qcd/`)
- `action/fermion/` — Wilson, Clover, DWF (Mobius), Staggered, twisted-mass, G-parity variants
- `action/gauge/` — Wilson gauge, Symanzik, Iwasaki, DBW2, plaquette+rect
- `representations/` — Fundamental, Adjoint, Two-index, Sp(2n)
- `hmc/` — Leapfrog, OMF2/OMF4 integrators; pseudofermion refreshment; Metropolis accept/reject
- `smearing/` — APE, Stout, HEX, gradient flow
- `observables/` — Polyakov loop, plaquette, topological charge
### GPU acceleration
GPU support is injected via macros (`accelerator_for`, `accelerator_for2dNB`). The `Grid/simd/` SIMD types map to scalar on GPU device code; host code paths remain vectorised. Unified virtual memory is on by default (`--enable-unified=yes`); device-aware MPI (`--enable-accelerator-aware-mpi`) avoids device→host copies on transfers.
### Memory and I/O
- `Grid/allocator/` — aligned/NUMA-aware allocators; caching allocator via `--enable-alloc-cache`
- `Grid/parallelIO/` — distributed parallel reader/writer for ILDG (via LIME), SciDAC, and native binary formats
- `Grid/serialisation/` — text, binary, HDF5, XML/JSON serialisation of arbitrary Grid objects
### HMC applications
`HMC/` contains production-ready HMC driver programmes (e.g. `Mobius2p1f.cc`, `DWF_plus_DSDR_nf2plus1_Shamir_Gparity.cc`). These are built separately from the library tests.
## Key Conventions
- **C++17** is required throughout.
- Template structure: most classes are templated on `<_FImpl>` (fermion impl) or `<Gimpl>` (gauge impl), which encode the representation and precision. Instantiation is controlled by `--enable-fermion-instantiations`.
- The `RealD`/`RealF`/`ComplexD`/`ComplexF` typedefs are used everywhere; avoid raw `double`/`float`.
- Logging uses `Grid_log`, `Grid_error` macros (from `Grid/log/`); performance-critical paths use the `GRID_TRACE` / timer macros from `Grid/perfmon/`.
- Reductions across MPI ranks go through `GridBase::GlobalSum` / `GlobalMax`; never reduce with bare MPI calls inside library code.
## Skills
`skills/` contains seven user-invocable Claude Code skills encoding deep domain knowledge for HPC work in this repo. Invoke with `/skill-name` or ask Claude to use them by name:
| Skill | When to use |
|-------|-------------|
| `gpu-memory-performance` | Bandwidth/occupancy problems — `acceleratorThreads()` pitfalls, `coalescedRead/Write`, fused vs staged HBM patterns |
| `gpu-runtime-correctness` | Wrong answers, non-deterministic results, premature `q.wait()` returns |
| `communication-overlap` | Designing GPU+MPI overlap pipelines; replacing broken accelerator-aware MPI paths with host-staged 7-phase pipeline |
| `mpi-heterogeneous` | Collective hangs, buffer aliasing in `MPI_Sendrecv`, heterogeneous topology bugs |
| `hang-diagnosis` | Distinguishing ioctl hangs, infinite poll loops, collective deadlocks, and silent wrong-answer races |
| `correctness-verification` | Reproducibility checksums, double-wait testing, bisecting non-deterministic failures |
| `compiler-validation` | Confirming compiler/optimisation flags are safe before production runs |
+9 -7
View File
@@ -54,22 +54,24 @@ Version.h: version-cache
include Make.inc
include Eigen.inc
extra_sources+=$(WILS_FERMION_FILES)
extra_sources+=$(STAG_FERMION_FILES)
if BUILD_FERMION_INSTANTIATIONS
extra_sources+=$(WILS_FERMION_FILES)
extra_sources+=$(STAG_FERMION_FILES)
if BUILD_ZMOBIUS
extra_sources+=$(ZWILS_FERMION_FILES)
extra_sources+=$(ZWILS_FERMION_FILES)
endif
if BUILD_GPARITY
extra_sources+=$(GP_FERMION_FILES)
extra_sources+=$(GP_FERMION_FILES)
endif
if BUILD_FERMION_REPS
extra_sources+=$(ADJ_FERMION_FILES)
extra_sources+=$(TWOIND_FERMION_FILES)
extra_sources+=$(ADJ_FERMION_FILES)
extra_sources+=$(TWOIND_FERMION_FILES)
endif
if BUILD_SP
extra_sources+=$(SP_FERMION_FILES)
if BUILD_FERMION_REPS
extra_sources+=$(SP_TWOIND_FERMION_FILES)
extra_sources+=$(SP_TWOIND_FERMION_FILES)
endif
endif
endif
+305 -308
View File
@@ -1,6 +1,6 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Grid physics library, www.github.com/paboyle/Grid
Source file: ./lib/Cshift.h
@@ -65,17 +65,16 @@ public:
typedef hipfftDoubleComplex FFTW_scalar;
typedef hipfftHandle FFTW_plan;
static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
int sign, unsigned flags) {
FFTW_plan p;
auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_Z2Z,howmany);
GRID_ASSERT(rv==HIPFFT_SUCCESS);
return p;
}
}
inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
hipfftResult rv;
if ( sign == forward ) rv =hipfftExecZ2Z(p,in,out,HIPFFT_FORWARD);
@@ -83,29 +82,25 @@ public:
accelerator_barrier();
GRID_ASSERT(rv==HIPFFT_SUCCESS);
}
inline static void fftw_destroy_plan(const FFTW_plan p) {
hipfftDestroy(p);
}
inline static void fftw_destroy_plan(const FFTW_plan p) { hipfftDestroy(p); }
};
template<> struct FFTW<ComplexF> {
public:
static const int forward=FFTW_FORWARD;
static const int backward=FFTW_BACKWARD;
typedef hipfftComplex FFTW_scalar;
typedef hipfftHandle FFTW_plan;
typedef hipfftComplex FFTW_scalar;
typedef hipfftHandle FFTW_plan;
static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
int sign, unsigned flags) {
FFTW_plan p;
auto rv = hipfftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,HIPFFT_C2C,howmany);
GRID_ASSERT(rv==HIPFFT_SUCCESS);
return p;
}
}
inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
hipfftResult rv;
if ( sign == forward ) rv =hipfftExecC2C(p,in,out,HIPFFT_FORWARD);
@@ -113,9 +108,7 @@ public:
accelerator_barrier();
GRID_ASSERT(rv==HIPFFT_SUCCESS);
}
inline static void fftw_destroy_plan(const FFTW_plan p) {
hipfftDestroy(p);
}
inline static void fftw_destroy_plan(const FFTW_plan p) { hipfftDestroy(p); }
};
#endif
@@ -126,53 +119,45 @@ public:
static const int backward=FFTW_BACKWARD;
typedef cufftDoubleComplex FFTW_scalar;
typedef cufftHandle FFTW_plan;
static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
int sign, unsigned flags) {
FFTW_plan p;
cufftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,CUFFT_Z2Z,howmany);
return p;
}
}
inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
if ( sign == forward ) cufftExecZ2Z(p,in,out,CUFFT_FORWARD);
else cufftExecZ2Z(p,in,out,CUFFT_INVERSE);
accelerator_barrier();
}
inline static void fftw_destroy_plan(const FFTW_plan p) {
cufftDestroy(p);
}
inline static void fftw_destroy_plan(const FFTW_plan p) { cufftDestroy(p); }
};
template<> struct FFTW<ComplexF> {
public:
static const int forward=FFTW_FORWARD;
static const int backward=FFTW_BACKWARD;
typedef cufftComplex FFTW_scalar;
typedef cufftHandle FFTW_plan;
typedef cufftHandle FFTW_plan;
static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
int sign, unsigned flags) {
FFTW_plan p;
cufftPlanMany(&p,rank,n,n,istride,idist,n,ostride,odist,CUFFT_C2C,howmany);
return p;
}
}
inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
if ( sign == forward ) cufftExecC2C(p,in,out,CUFFT_FORWARD);
else cufftExecC2C(p,in,out,CUFFT_INVERSE);
accelerator_barrier();
}
inline static void fftw_destroy_plan(const FFTW_plan p) {
cufftDestroy(p);
}
inline static void fftw_destroy_plan(const FFTW_plan p) { cufftDestroy(p); }
};
#endif
@@ -183,313 +168,325 @@ public:
typedef fftw_complex FFTW_scalar;
typedef fftw_plan FFTW_plan;
static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
int sign, unsigned flags) {
return ::fftw_plan_many_dft(rank,n,howmany,in,inembed,istride,idist,out,onembed,ostride,odist,sign,flags);
}
}
inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
::fftw_execute_dft(p,in,out);
}
inline static void fftw_destroy_plan(const FFTW_plan p) {
::fftw_destroy_plan(p);
}
inline static void fftw_destroy_plan(const FFTW_plan p) { ::fftw_destroy_plan(p); }
};
template<> struct FFTW<ComplexF> {
public:
typedef fftwf_complex FFTW_scalar;
typedef fftwf_plan FFTW_plan;
static FFTW_plan fftw_plan_many_dft(int rank, int *n,int howmany,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
FFTW_scalar *in, int *inembed,
int istride, int idist,
FFTW_scalar *out, int *onembed,
int ostride, int odist,
int sign, unsigned flags) {
return ::fftwf_plan_many_dft(rank,n,howmany,in,inembed,istride,idist,out,onembed,ostride,odist,sign,flags);
}
}
inline static void fftw_execute_dft(const FFTW_plan p,FFTW_scalar *in,FFTW_scalar *out, int sign) {
::fftwf_execute_dft(p,in,out);
}
inline static void fftw_destroy_plan(const FFTW_plan p) {
::fftwf_destroy_plan(p);
}
inline static void fftw_destroy_plan(const FFTW_plan p) { ::fftwf_destroy_plan(p); }
};
#endif
#endif
class FFT {
private:
double flops;
double flops_call;
uint64_t usec;
public:
static const int forward=FFTW_FORWARD;
static const int backward=FFTW_BACKWARD;
double Flops(void) {return flops;}
double MFlops(void) {return flops/usec;}
double USec(void) {return (double)usec;}
struct FFTbase {
double flops;
double flops_call;
uint64_t usec;
GridCartesian *_grid;
FFT ( GridCartesian * grid )
{
flops=0;
usec =0;
};
~FFT ( void) {
// delete sgrid;
static const int forward = FFTW_FORWARD;
static const int backward = FFTW_BACKWARD;
double Flops(void) { return flops; }
double MFlops(void) { return flops / usec; }
double USec(void) { return (double)usec; }
FFTbase(GridCartesian *grid) : _grid(grid), flops(0), flops_call(0), usec(0) {}
};
// Barrel-shift gather, FFT execute, and insert. Called by both FFT and PlannedFFT.
// The caller is responsible for plan acquisition and destruction.
template<class vobj>
static void FFT_dim_execute(
Lattice<vobj> &result,
const Lattice<vobj> &source,
int dim, int sign,
typename FFTW<typename vobj::scalar_type>::FFTW_plan p,
GridCartesian *grid,
double &flops, double &flops_call, uint64_t &usec)
{
typedef typename vobj::scalar_type scalar;
typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_type scalar_type;
typedef typename vobj::vector_type vector_type;
typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
const int Ndim = grid->Nd();
int L = grid->_ldimensions[dim];
int G = grid->_fdimensions[dim];
int Ncomp = sizeof(sobj) / sizeof(scalar);
int64_t Nlow = 1, Nhigh = 1;
for (int d = 0; d < dim; d++) Nlow *= grid->_ldimensions[d];
for (int d = dim+1; d < Ndim; d++) Nhigh *= grid->_ldimensions[d];
int64_t Nperp = Nlow * Nhigh;
deviceVector<scalar> pgbuf(Nperp * Ncomp * G);
scalar *pgbuf_v = &pgbuf[0];
int howmany = Ncomp * Nperp;
scalar div;
if (sign == FFTW_BACKWARD) div = 1.0 / G;
else if (sign == FFTW_FORWARD) div = 1.0;
else GRID_ASSERT(0);
double t_pencil = 0, t_fft = 0, t_copy = 0, t_shift = 0;
double t_total = -usecond();
result = source;
int pc = grid->_processor_coor[dim];
const Coordinate ldims = grid->_ldimensions;
const Coordinate rdims = grid->_rdimensions;
const Coordinate sdims = grid->_simd_layout;
const Coordinate processors = grid->_processors;
Coordinate pgdims(Ndim);
pgdims[0] = G;
for (int d = 0, dd = 1; d < Ndim; d++)
if (d != dim) pgdims[dd++] = ldims[d];
int64_t pgvol = 1;
for (int d = 0; d < Ndim; d++) pgvol *= pgdims[d];
const int Nsimd = vobj::Nsimd();
t_pencil = -usecond();
for (int p_idx = 0; p_idx < processors[dim]; p_idx++) {
t_copy -= usecond();
autoView(r_v, result, AcceleratorRead);
accelerator_for(idx, grid->oSites(), vobj::Nsimd(), {
#ifdef GRID_SIMT
{
int lane = acceleratorSIMTlane(Nsimd);
#else
for (int lane = 0; lane < Nsimd; lane++) {
#endif
Coordinate icoor, ocoor, pgcoor;
Lexicographic::CoorFromIndex(icoor, lane, sdims);
Lexicographic::CoorFromIndex(ocoor, idx, rdims);
pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + ((pc+p_idx)%processors[dim])*L;
for (int d = 0, dd = 1; d < Ndim; d++)
if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; }
int64_t pgidx;
Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims);
vector_type *from = (vector_type *)&r_v[idx];
scalar_type stmp;
for (int w = 0; w < Ncomp; w++) {
stmp = getlane(from[w], lane);
pgbuf_v[pgidx + w*pgvol] = stmp;
}
#ifdef GRID_SIMT
}
#else
}
#endif
});
t_copy += usecond();
if (p_idx != processors[dim] - 1) {
Lattice<vobj> temp(grid);
t_shift -= usecond();
temp = Cshift(result, dim, L); result = temp;
t_shift += usecond();
}
}
template<class vobj>
void FFT_dim_mask(Lattice<vobj> &result,const Lattice<vobj> &source,Coordinate mask,int sign){
t_pencil += usecond();
// vgrid=result.Grid();
// conformable(result.Grid(),vgrid);
// conformable(source.Grid(),vgrid);
const int Ndim = source.Grid()->Nd();
FFTW_scalar *in = (FFTW_scalar *)pgbuf_v;
FFTW_scalar *out = (FFTW_scalar *)pgbuf_v;
t_fft = -usecond();
FFTW<scalar>::fftw_execute_dft(p, in, out, sign);
t_fft += usecond();
flops_call = 5.0 * howmany * G * log2(G);
usec = t_fft;
flops = flops_call;
result = Zero();
double t_insert = -usecond();
{
autoView(r_v, result, AcceleratorWrite);
accelerator_for(idx, grid->oSites(), Nsimd, {
#ifdef GRID_SIMT
{
int lane = acceleratorSIMTlane(Nsimd);
#else
for (int lane = 0; lane < Nsimd; lane++) {
#endif
Coordinate icoor(Ndim), ocoor(Ndim), pgcoor(Ndim);
Lexicographic::CoorFromIndex(icoor, lane, sdims);
Lexicographic::CoorFromIndex(ocoor, idx, rdims);
pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + pc*L;
for (int d = 0, dd = 1; d < Ndim; d++)
if (d != dim) { pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d]; dd++; }
int64_t pgidx;
Lexicographic::IndexFromCoor(pgcoor, pgidx, pgdims);
vector_type *to = (vector_type *)&r_v[idx];
scalar_type stmp;
for (int w = 0; w < Ncomp; w++) {
stmp = pgbuf_v[pgidx + w*pgvol];
putlane(to[w], stmp, lane);
}
#ifdef GRID_SIMT
}
#else
}
#endif
});
}
result = result * div;
t_insert += usecond();
t_total += usecond();
std::cout << GridLogPerformance << " FFT took " << t_total/1.0e6 << " s" << std::endl;
std::cout << GridLogPerformance << " FFT pencil " << t_pencil/1.0e6 << " s" << std::endl;
std::cout << GridLogPerformance << " of which copy " << t_copy/1.0e6 << " s" << std::endl;
std::cout << GridLogPerformance << " of which shift" << t_shift/1.0e6 << " s" << std::endl;
std::cout << GridLogPerformance << " FFT kernels " << t_fft/1.0e6 << " s" << std::endl;
std::cout << GridLogPerformance << " FFT insert " << t_insert/1.0e6 << " s" << std::endl;
}
class FFT : public FFTbase {
public:
FFT(GridCartesian *grid) : FFTbase(grid) {}
~FFT() {}
template<class vobj>
void FFT_dim_mask(Lattice<vobj> &result, const Lattice<vobj> &source, Coordinate mask, int sign) {
const int Ndim = _grid->Nd();
Lattice<vobj> tmp = source;
for(int d=0;d<Ndim;d++){
if( mask[d] ) {
FFT_dim(result,tmp,d,sign);
tmp=result;
for (int d = 0; d < Ndim; d++) {
if (mask[d]) {
FFT_dim(result, tmp, d, sign);
tmp = result;
}
}
}
template<class vobj>
void FFT_all_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int sign){
const int Ndim = source.Grid()->Nd();
Coordinate mask(Ndim,1);
FFT_dim_mask(result,source,mask,sign);
void FFT_all_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int sign) {
Coordinate mask(_grid->Nd(), 1);
FFT_dim_mask(result, source, mask, sign);
}
template<class vobj>
void FFT_dim(Lattice<vobj> &result,const Lattice<vobj> &source,int dim, int sign){
const int Ndim = source.Grid()->Nd();
GridBase *grid = source.Grid();
conformable(result.Grid(),source.Grid());
void FFT_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int dim, int sign) {
GRID_ASSERT(source.Grid() == _grid);
GRID_ASSERT(result.Grid() == _grid);
conformable(result.Grid(), source.Grid());
int L = grid->_ldimensions[dim];
int G = grid->_fdimensions[dim];
Coordinate layout(Ndim,1);
// Construct pencils
typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_type scalar;
typedef typename vobj::scalar_type scalar_type;
typedef typename vobj::vector_type vector_type;
//std::cout << "CPU view" << std::endl;
typedef typename vobj::scalar_object sobj;
typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
typedef typename FFTW<scalar>::FFTW_plan FFTW_plan;
int Ncomp = sizeof(sobj)/sizeof(scalar);
int64_t Nlow = 1;
int64_t Nhigh = 1;
for(int d=0;d<dim;d++){
Nlow*=grid->_ldimensions[d];
}
for(int d=dim+1;d<Ndim;d++){
Nhigh*=grid->_ldimensions[d];
}
int64_t Nperp=Nlow*Nhigh;
deviceVector<scalar> pgbuf; // Layout is [perp][component][dim]
pgbuf.resize(Nperp*Ncomp*G);
scalar *pgbuf_v = &pgbuf[0];
int rank = 1; /* 1d transforms */
int n[] = {G}; /* 1d transforms of length G */
const int Ndim = _grid->Nd();
int G = _grid->_fdimensions[dim];
int Ncomp = sizeof(sobj) / sizeof(scalar);
int64_t Nperp = 1;
for (int d = 0; d < Ndim; d++)
if (d != dim) Nperp *= _grid->_ldimensions[d];
int n[] = {G};
int howmany = Ncomp * Nperp;
int odist,idist,istride,ostride;
idist = odist = G; /* Distance between consecutive FT's */
istride = ostride = 1; /* Distance between two elements in the same FT */
int *inembed = n, *onembed = n;
scalar div;
if ( sign == backward ) div = 1.0/G;
else if ( sign == forward ) div = 1.0;
else GRID_ASSERT(0);
double t_pencil=0;
double t_fft =0;
double t_total =-usecond();
// std::cout << GridLogPerformance<<"Making FFTW plan" << std::endl;
/*
*
*/
FFTW_plan p;
{
FFTW_scalar *in = (FFTW_scalar *)&pgbuf_v[0];
FFTW_scalar *out= (FFTW_scalar *)&pgbuf_v[0];
p = FFTW<scalar>::fftw_plan_many_dft(rank,n,howmany,
in,inembed,
istride,idist,
out,onembed,
ostride, odist,
sign,FFTW_ESTIMATE);
}
// Barrel shift and collect global pencil
// std::cout << GridLogPerformance<<"Making pencil" << std::endl;
Coordinate lcoor(Ndim), gcoor(Ndim);
double t_copy=0;
double t_shift=0;
t_pencil = -usecond();
result = source;
int pc = grid->_processor_coor[dim];
const Coordinate ldims = grid->_ldimensions;
const Coordinate rdims = grid->_rdimensions;
const Coordinate sdims = grid->_simd_layout;
Coordinate processors = grid->_processors;
Coordinate pgdims(Ndim);
pgdims[0] = G;
for(int d=0, dd=1;d<Ndim;d++){
if ( d!=dim ) pgdims[dd++] = ldims[d];
}
int64_t pgvol=1;
for(int d=0;d<Ndim;d++) pgvol*=pgdims[d];
const int Nsimd = vobj::Nsimd();
for(int p=0;p<processors[dim];p++) {
t_copy-=usecond();
autoView(r_v,result,AcceleratorRead);
accelerator_for(idx, grid->oSites(), vobj::Nsimd(), {
#ifdef GRID_SIMT
{
int lane=acceleratorSIMTlane(Nsimd); // buffer lane
#else
for(int lane=0;lane<Nsimd;lane++) {
#endif
Coordinate icoor;
Coordinate ocoor;
Coordinate pgcoor;
Lexicographic::CoorFromIndex(icoor,lane,sdims);
Lexicographic::CoorFromIndex(ocoor,idx,rdims);
pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + ((pc+p)%processors[dim])*L;
for(int d=0,dd=1;d<Ndim;d++){
if ( d!=dim ) {
pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d];
dd++;
}
}
// Map coordinates in lattice layout to FFTW index
int64_t pgidx;
Lexicographic::IndexFromCoor(pgcoor,pgidx,pgdims);
vector_type *from = (vector_type *)&r_v[idx];
scalar_type stmp;
for(int w=0;w<Ncomp;w++){
int64_t pg_idx = pgidx + w*pgvol;
stmp = getlane(from[w], lane);
pgbuf_v[pg_idx] = stmp;
}
#ifdef GRID_SIMT
}
#else
}
#endif
});
t_copy+=usecond();
if (p != processors[dim] - 1) {
Lattice<vobj> temp(grid);
t_shift-=usecond();
temp = Cshift(result,dim,L); result = temp;
t_shift+=usecond();
}
}
t_pencil += usecond();
FFTW_scalar *in = (FFTW_scalar *)pgbuf_v;
FFTW_scalar *out= (FFTW_scalar *)pgbuf_v;
t_fft = -usecond();
FFTW<scalar>::fftw_execute_dft(p,in,out,sign);
t_fft += usecond();
// performance counting
flops_call = 5.0*howmany*G*log2(G);
usec = t_fft;
flops= flops_call;
result = Zero();
double t_insert = -usecond();
{
autoView(r_v,result,AcceleratorWrite);
accelerator_for(idx,grid->oSites(),Nsimd,{
#ifdef GRID_SIMT
{
int lane=acceleratorSIMTlane(Nsimd); // buffer lane
#else
for(int lane=0;lane<Nsimd;lane++) {
#endif
Coordinate icoor(Ndim);
Coordinate ocoor(Ndim);
Coordinate pgcoor(Ndim);
Lexicographic::CoorFromIndex(icoor,lane,sdims);
Lexicographic::CoorFromIndex(ocoor,idx,rdims);
pgcoor[0] = ocoor[dim] + icoor[dim]*rdims[dim] + pc*L;
for(int d=0,dd=1;d<Ndim;d++){
if ( d!=dim ) {
pgcoor[dd] = ocoor[d] + icoor[d]*rdims[d];
dd++;
}
}
// Map coordinates in lattice layout to FFTW index
int64_t pgidx;
Lexicographic::IndexFromCoor(pgcoor,pgidx,pgdims);
vector_type *to = (vector_type *)&r_v[idx];
scalar_type stmp;
for(int w=0;w<Ncomp;w++){
int64_t pg_idx = pgidx + w*pgvol;
stmp = pgbuf_v[pg_idx];
putlane(to[w], stmp, lane);
}
#ifdef GRID_SIMT
}
#else
}
#endif
});
}
result = result*div;
t_insert +=usecond();
// destroying plan
deviceVector<scalar> dummy(2);
FFTW_scalar *buf = (FFTW_scalar *)&dummy[0];
FFTW_plan p = FFTW<scalar>::fftw_plan_many_dft(1, n, howmany,
buf, n, 1, G,
buf, n, 1, G,
sign, FFTW_ESTIMATE);
FFT_dim_execute(result, source, dim, sign, p, _grid, flops, flops_call, usec);
FFTW<scalar>::fftw_destroy_plan(p);
}
};
t_total +=usecond();
template<class vobj>
class PlannedFFT : public FFTbase {
private:
typedef typename vobj::scalar_type scalar;
typedef typename vobj::scalar_object sobj;
typedef typename vobj::vector_type vector_type;
typedef typename FFTW<scalar>::FFTW_scalar FFTW_scalar;
typedef typename FFTW<scalar>::FFTW_plan FFTW_plan;
std::cout <<GridLogPerformance<< " FFT took "<<t_total/1.0e6 <<" s" << std::endl;
std::cout <<GridLogPerformance<< " FFT pencil "<<t_pencil/1.0e6 <<" s" << std::endl;
std::cout <<GridLogPerformance<< " of which copy "<<t_copy/1.0e6 <<" s" << std::endl;
std::cout <<GridLogPerformance<< " of which shift"<<t_shift/1.0e6 <<" s" << std::endl;
std::cout <<GridLogPerformance<< " FFT kernels "<<t_fft/1.0e6 <<" s" << std::endl;
std::cout <<GridLogPerformance<< " FFT insert "<<t_insert/1.0e6 <<" s" << std::endl;
std::vector<FFTW_plan> forward_plans;
std::vector<FFTW_plan> backward_plans;
void PlanCreate() {
const int Ndim = _grid->Nd();
forward_plans.resize(Ndim);
backward_plans.resize(Ndim);
for (int d = 0; d < Ndim; d++) {
int G = _grid->_fdimensions[d];
int Ncomp = sizeof(sobj) / sizeof(scalar);
int64_t Nperp = 1;
for (int dd = 0; dd < Ndim; dd++)
if (dd != d) Nperp *= _grid->_ldimensions[dd];
int howmany = Ncomp * (int)Nperp;
int n[] = {G};
deviceVector<scalar> dummy(2);
FFTW_scalar *buf = (FFTW_scalar *)&dummy[0];
forward_plans[d] = FFTW<scalar>::fftw_plan_many_dft(1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_FORWARD, FFTW_ESTIMATE);
backward_plans[d] = FFTW<scalar>::fftw_plan_many_dft(1, n, howmany, buf, n, 1, G, buf, n, 1, G, FFTW_BACKWARD, FFTW_ESTIMATE);
}
}
void PlanDestroy() {
for (auto p : forward_plans) FFTW<scalar>::fftw_destroy_plan(p);
for (auto p : backward_plans) FFTW<scalar>::fftw_destroy_plan(p);
forward_plans.clear();
backward_plans.clear();
}
public:
PlannedFFT(GridCartesian *grid) : FFTbase(grid) { PlanCreate(); }
~PlannedFFT() { PlanDestroy(); }
void FFT_dim_mask(Lattice<vobj> &result, const Lattice<vobj> &source, Coordinate mask, int sign) {
const int Ndim = _grid->Nd();
Lattice<vobj> tmp = source;
for (int d = 0; d < Ndim; d++) {
if (mask[d]) {
FFT_dim(result, tmp, d, sign);
tmp = result;
}
}
}
void FFT_all_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int sign) {
Coordinate mask(_grid->Nd(), 1);
FFT_dim_mask(result, source, mask, sign);
}
void FFT_dim(Lattice<vobj> &result, const Lattice<vobj> &source, int dim, int sign) {
GRID_ASSERT(source.Grid() == _grid);
GRID_ASSERT(result.Grid() == _grid);
GRID_ASSERT((int)forward_plans.size() == _grid->Nd());
conformable(result.Grid(), source.Grid());
FFTW_plan p = (sign == forward ? forward_plans : backward_plans)[dim];
FFT_dim_execute(result, source, dim, sign, p, _grid, flops, flops_call, usec);
}
};
+88 -17
View File
@@ -28,6 +28,7 @@ Author: Peter Boyle <pboyle@bnl.gov>
#pragma once
#ifdef GRID_HIP
#include <hip/hip_version.h>
#include <hipblas/hipblas.h>
#endif
#ifdef GRID_CUDA
@@ -255,17 +256,30 @@ public:
if ( OpB == GridBLAS_OP_N ) hOpB = HIPBLAS_OP_N;
if ( OpB == GridBLAS_OP_T ) hOpB = HIPBLAS_OP_T;
if ( OpB == GridBLAS_OP_C ) hOpB = HIPBLAS_OP_C;
#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
auto err = hipblasZgemmBatched(gridblasHandle,
hOpA,
hOpB,
m,n,k,
(hipblasDoubleComplex *) &alpha_p[0],
(hipblasDoubleComplex **)&Amk[0], lda,
(hipblasDoubleComplex **)&Bkn[0], ldb,
(hipblasDoubleComplex *) &beta_p[0],
(hipblasDoubleComplex **)&Cmn[0], ldc,
(hipDoubleComplex *) &alpha_p[0],
(hipDoubleComplex **)&Amk[0], lda,
(hipDoubleComplex **)&Bkn[0], ldb,
(hipDoubleComplex *) &beta_p[0],
(hipDoubleComplex **)&Cmn[0], ldc,
batchCount);
// std::cout << " hipblas return code " <<(int)err<<std::endl;
#else
auto err = hipblasZgemmBatched(gridblasHandle,
hOpA,
hOpB,
m,n,k,
(hipblasDoubleComplex *) &alpha_p[0],
(hipblasDoubleComplex **)&Amk[0], lda,
(hipblasDoubleComplex **)&Bkn[0], ldb,
(hipblasDoubleComplex *) &beta_p[0],
(hipblasDoubleComplex **)&Cmn[0], ldc,
batchCount);
#endif
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -503,17 +517,31 @@ public:
if ( OpB == GridBLAS_OP_N ) hOpB = HIPBLAS_OP_N;
if ( OpB == GridBLAS_OP_T ) hOpB = HIPBLAS_OP_T;
if ( OpB == GridBLAS_OP_C ) hOpB = HIPBLAS_OP_C;
#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
auto err = hipblasCgemmBatched(gridblasHandle,
hOpA,
hOpB,
m,n,k,
(hipblasComplex *) &alpha_p[0],
(hipblasComplex **)&Amk[0], lda,
(hipblasComplex **)&Bkn[0], ldb,
(hipblasComplex *) &beta_p[0],
(hipblasComplex **)&Cmn[0], ldc,
(hipComplex *) &alpha_p[0],
(hipComplex **)&Amk[0], lda,
(hipComplex **)&Bkn[0], ldb,
(hipComplex *) &beta_p[0],
(hipComplex **)&Cmn[0], ldc,
batchCount);
#else
auto err = hipblasCgemmBatched(gridblasHandle,
hOpA,
hOpB,
m,n,k,
(hipblasComplex *) &alpha_p[0],
(hipblasComplex **)&Amk[0], lda,
(hipblasComplex **)&Bkn[0], ldb,
(hipblasComplex *) &beta_p[0],
(hipblasComplex **)&Cmn[0], ldc,
batchCount);
#endif
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -550,6 +578,7 @@ public:
(void **)&Cmn[0], CUDA_C_32F, ldc,
batchCount, compute_precision, CUBLAS_GEMM_DEFAULT);
}
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==CUBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_SYCL
@@ -720,6 +749,7 @@ public:
(float *) &beta_p[0],
(float **)&Cmn[0], ldc,
batchCount);
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -880,6 +910,7 @@ public:
(double *) &beta_p[0],
(double **)&Cmn[0], ldc,
batchCount);
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -1094,11 +1125,20 @@ public:
GRID_ASSERT(info.size()==batchCount);
#ifdef GRID_HIP
#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
auto err = hipblasZgetrfBatched(gridblasHandle,(int)n,
(hipblasDoubleComplex **)&Ann[0], (int)n,
(hipDoubleComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(int*) &info[0],
(int)batchCount);
#else
auto err = hipblasZgetrfBatched(gridblasHandle,(int)n,
(hipblasDoubleComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(int*) &info[0],
(int)batchCount);
#endif
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -1124,11 +1164,21 @@ public:
GRID_ASSERT(info.size()==batchCount);
#ifdef GRID_HIP
#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
auto err = hipblasCgetrfBatched(gridblasHandle,(int)n,
(hipblasComplex **)&Ann[0], (int)n,
(hipComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(int*) &info[0],
(int)batchCount);
#else
auto err = hipblasCgetrfBatched(gridblasHandle,(int)n,
(hipblasComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(int*) &info[0],
(int)batchCount);
#endif
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -1201,12 +1251,23 @@ public:
GRID_ASSERT(Cnn.size()==batchCount);
#ifdef GRID_HIP
#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
auto err = hipblasZgetriBatched(gridblasHandle,(int)n,
(hipblasDoubleComplex **)&Ann[0], (int)n,
(hipDoubleComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(hipblasDoubleComplex **)&Cnn[0], (int)n,
(hipDoubleComplex **)&Cnn[0], (int)n,
(int*) &info[0],
(int)batchCount);
#else
auto err = hipblasZgetriBatched(gridblasHandle,(int)n,
(hipblasDoubleComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(hipblasDoubleComplex **)&Cnn[0], (int)n,
(int*) &info[0],
(int)batchCount);
#endif
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
@@ -1235,12 +1296,22 @@ public:
GRID_ASSERT(Cnn.size()==batchCount);
#ifdef GRID_HIP
#if defined(HIP_VERSION_MAJOR) && (HIP_VERSION_MAJOR >=7)
auto err = hipblasCgetriBatched(gridblasHandle,(int)n,
(hipblasComplex **)&Ann[0], (int)n,
(hipComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(hipblasComplex **)&Cnn[0], (int)n,
(hipComplex **)&Cnn[0], (int)n,
(int*) &info[0],
(int)batchCount);
#else
auto err = hipblasCgetriBatched(gridblasHandle,(int)n,
(hipblasComplex **)&Ann[0], (int)n,
(int*) &ipiv[0],
(hipblasComplex **)&Cnn[0], (int)n,
(int*) &info[0],
(int)batchCount);
#endif
// std::cout << " hipblas return code " <<(int)err<<" "<<__LINE__<<std::endl;
GRID_ASSERT(err==HIPBLAS_STATUS_SUCCESS);
#endif
#ifdef GRID_CUDA
+2 -2
View File
@@ -92,8 +92,8 @@ class TwoLevelCGmrhs
// Vector case
virtual void operator() (std::vector<Field> &src, std::vector<Field> &x)
{
// SolveSingleSystem(src,x);
SolvePrecBlockCG(src,x);
SolveSingleSystem(src,x);
// SolvePrecBlockCG(src,x);
}
////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -288,17 +288,21 @@ public:
//----------------------------------------------------------------------
if ( Nm>Nk ) {
// Glog <<" #Apply shifted QR transformations "<<std::endl;
//int k2 = Nk+Nu;
Glog <<" #Ritz values of poly(A) before shift ["<<Nm<<" values]:"<<std::endl;
for(int i=0; i<Nm; ++i){
std::cout.precision(8);
std::cout << "[" << std::setw(4)<< std::setiosflags(std::ios_base::right) <<i<<"] ";
std::cout << "Rval = "<<std::setw(16)<< std::setiosflags(std::ios_base::left)<< eval2[i] << std::endl;
}
int k2 = Nk;
Eigen::MatrixXcd BTDM = Eigen::MatrixXcd::Identity(Nm,Nm);
Q = Eigen::MatrixXcd::Identity(Nm,Nm);
unpackHermitBlockTriDiagMatToEigen(lmd,lme,Nu,Nblock_m,Nm,Nm,BTDM);
for(int ip=Nk; ip<Nm; ++ip){
Glog << " ip "<<ip<<" / "<<Nm<<std::endl;
shiftedQRDecompEigen(BTDM,Nu,Nm,eval2[ip],Q);
}
@@ -326,11 +330,11 @@ public:
Qt = Eigen::MatrixXcd::Identity(Nm,Nm);
diagonalize(eval2,lmd2,lme2,Nu,Nk,Nm,Qt,grid);
_sort.push(eval2,Nk);
// Glog << "#Ritz value after shift: "<< std::endl;
Glog << "#Ritz values of poly(A) after shift ["<<Nk<<" values]:"<<std::endl;
for(int i=0; i<Nk; ++i){
// std::cout.precision(13);
// std::cout << "[" << std::setw(4)<< std::setiosflags(std::ios_base::right) <<i<<"] ";
// std::cout << "Rval = "<<std::setw(20)<< std::setiosflags(std::ios_base::left)<< eval2[i] << std::endl;
std::cout.precision(8);
std::cout << "[" << std::setw(4)<< std::setiosflags(std::ios_base::right) <<i<<"] ";
std::cout << "Rval = "<<std::setw(16)<< std::setiosflags(std::ios_base::left)<< eval2[i] << std::endl;
}
}
//----------------------------------------------------------------------
@@ -710,11 +714,17 @@ private:
//lme[0][L] = beta;
for (int u=0; u<Nu; ++u) {
// Glog << "norm2(w[" << u << "])= "<< norm2(w[u]) << std::endl;
GRID_ASSERT (!isnan(norm2(w[u])));
for (int k=L+u; k<R; ++k) {
// Glog <<" In block "<< b << "," <<" beta[" << u << "," << k-L << "] = " << lme[u][k] << std::endl;
}
}
// Diagnostic: print alpha (lmd) and beta (lme) block diagonals for this step
{
Glog << " blk b="<<std::setw(3)<<b<<" alpha:";
for (int u=0; u<Nu; ++u)
std::cout << " " << std::setw(10) << std::setprecision(6) << real(lmd[u][L+u]);
std::cout << " |beta|:";
for (int u=0; u<Nu; ++u)
std::cout << " " << std::setw(10) << std::setprecision(6) << abs(lme[u][L+u]);
std::cout << std::endl;
}
// Glog << "LinAlg done "<< std::endl;
+9 -6
View File
@@ -97,7 +97,7 @@ public:
RealD scale;
ConjugateGradient<FineField> CG(1.0e-3,400,false);
ConjugateGradient<FineField> CG(1.0e-4,2000,false);
FineField noise(FineGrid);
FineField Mn(FineGrid);
@@ -131,7 +131,10 @@ public:
RealD scale;
TrivialPrecon<FineField> simple_fine;
PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,30,DiracOp,simple_fine,12,12);
// PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,10,DiracOp,simple_fine,30,30);
// PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,10,DiracOp,simple_fine,12,12);
// PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,30,DiracOp,simple_fine,12,12);
PrecGeneralisedConjugateResidualNonHermitian<FineField> GCR(0.001,30,DiracOp,simple_fine,10,10);
FineField noise(FineGrid);
FineField src(FineGrid);
FineField guess(FineGrid);
@@ -146,16 +149,16 @@ public:
DiracOp.Op(noise,Mn); std::cout<<GridLogMessage << "noise ["<<b<<"] <n|Op|n> "<<innerProduct(noise,Mn)<<std::endl;
for(int i=0;i<2;i++){
for(int i=0;i<3;i++){
// void operator() (const Field &src, Field &psi){
#if 1
std::cout << GridLogMessage << " inverting on noise "<<std::endl;
if (i==0)std::cout << GridLogMessage << " inverting on noise "<<std::endl;
src = noise;
guess=Zero();
GCR(src,guess);
subspace[b] = guess;
#else
std::cout << GridLogMessage << " inverting on zero "<<std::endl;
if (i==0)std::cout << GridLogMessage << " inverting on zero "<<std::endl;
src=Zero();
guess = noise;
GCR(src,guess);
@@ -167,7 +170,7 @@ public:
}
DiracOp.Op(noise,Mn); std::cout<<GridLogMessage << "filtered["<<b<<"] <f|Op|f> "<<innerProduct(noise,Mn)<<std::endl;
DiracOp.Op(noise,Mn); std::cout<<GridLogMessage << "filtered["<<b<<"] <f|Op|f> "<<innerProduct(noise,Mn)<<" <f|OpDagOp|f>"<<norm2(Mn)<<std::endl;
subspace[b] = noise;
}
+2 -31
View File
@@ -589,7 +589,6 @@ double CartesianCommunicator::StencilSendToRecvFromPrepare(std::vector<CommsRequ
srq.PacketType = InterNodeRecv;
srq.bytes = rbytes;
srq.req = rrq;
srq.dest = from;
srq.host_buf = host_recv;
srq.device_buf = recv;
srq.tag = tag;
@@ -766,7 +765,7 @@ double CartesianCommunicator::StencilSendToRecvFromBegin(std::vector<CommsReques
srq.host_buf = NULL;
srq.device_buf = xmit;
srq.tag = -1;
srq.dest = from;
srq.dest = dest;
srq.commdir = dir;
list.push_back(srq);
}
@@ -818,40 +817,12 @@ void CartesianCommunicator::StencilSendToRecvFromComplete(std::vector<CommsReque
int ierr = MPI_Waitall(MpiRequests.size(),&MpiRequests[0],&status[0]); // Sends are guaranteed in order. No harm in not completing.
GRID_ASSERT(ierr==0);
}
// for(int r=0;r<nreq;r++){
// if ( list[r].PacketType==InterNodeRecv ) {
// acceleratorCopyToDeviceAsynch(list[r].host_buf,list[r].device_buf,list[r].bytes);
// }
// }
// Get global error count in comms receive
#define AUDIT_FLIGHT_RECORDER_ERRORS
#ifdef AUDIT_FLIGHT_RECORDER_ERRORS
uint64_t EC = FlightRecorder::CommsErrorCount();
if (EC) std::cerr << " global sum error count "<<EC<<std::endl;
this->GlobalSum(EC);
if (EC) {
for(int r=0;r<list.size();r++){
#ifdef GRID_CHECKSUM_COMMS
uint64_t rbytes_data = list[r].bytes - 8;
#else
uint64_t rbytes_data = list[r].bytes;
#endif
if (list[r].PacketType == InterNodeReceiveHtoD) {
std::cerr << " calling xor reduce "<<std::endl;
uint64_t csg = gpu_xor((uint64_t*)list[r].device_buf,rbytes_data/8);
uint64_t csh = cpu_xor((uint64_t*)list[r].host_buf,rbytes_data/8);
std::cerr << " Packet "<<r<<" Receive from " <<list[r].dest<<" host csum "<<csh<<" gpu csum "<<csg<<std::endl;
} if (list[r].PacketType == InterNodeXmitISend ) {
std::cerr << " calling xor reduce "<<std::endl;
uint64_t csg = gpu_xor((uint64_t*)list[r].device_buf,rbytes_data/8);
uint64_t csh = cpu_xor((uint64_t*)list[r].host_buf,rbytes_data/8);
std::cerr << " Packet "<<r<<" Send to " <<list[r].dest<<" host csum "<<csh<<" gpu csum "<<csg<<std::endl;
}
}
}
#endif
#ifdef GRID_CHECKSUM_COMMS
for(int r=0;r<list.size();r++){
if ( list[r].PacketType == InterNodeReceiveHtoD ) {
+172 -22
View File
@@ -197,12 +197,15 @@ __global__ void reduceKernel(const vobj *lat, sobj *buffer, Iterator n) {
/////////////////////////////////////////////////////////////////////////////////////////////////////////
// Possibly promote to double and sum
/////////////////////////////////////////////////////////////////////////////////////////////////////////
#undef GRID_REDUCTION_TIMING
template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osites)
{
typedef typename vobj::scalar_objectD sobj;
typedef decltype(lat) Iterator;
Integer nsimd= vobj::Nsimd();
Integer size = osites*nsimd;
@@ -211,41 +214,188 @@ inline typename vobj::scalar_objectD sumD_gpu_small(const vobj *lat, Integer osi
GRID_ASSERT(ok);
Integer smemSize = numThreads * sizeof(sobj);
// Move out of UVM
// Turns out I had messed up the synchronise after move to compute stream
// as running this on the default stream fools the synchronise
deviceVector<sobj> buffer(numBlocks);
sobj *buffer_v = &buffer[0];
sobj result;
#ifdef GRID_REDUCTION_TIMING
RealD t_kernel = -usecond();
#endif
reduceKernel<<< numBlocks, numThreads, smemSize, computeStream >>>(lat, buffer_v, size);
accelerator_barrier();
#ifdef GRID_REDUCTION_TIMING
t_kernel += usecond();
RealD t_d2h = -usecond();
#endif
acceleratorCopyFromDevice(buffer_v,&result,sizeof(result));
#ifdef GRID_REDUCTION_TIMING
t_d2h += usecond();
std::cout << GridLogDebug << " sumD_gpu_small"
<< " sizeof(sobj)=" << sizeof(sobj)
<< " blocks=" << numBlocks << " threads=" << numThreads
<< " kernel+barrier=" << t_kernel << " us"
<< " D2H=" << t_d2h << " us" << std::endl;
#endif
return result;
}
// Fused pack+reduce: reads R words of each vobj at word offset 'base',
// accumulates directly into iVector<iScalar<scalarD>,R> without staging
// through an intermediate bundle buffer. One HBM pass instead of three.
template <int R, class vobj, class sobj, class Iterator>
__device__ void packReduceBlocks(
const iScalar<typename vobj::vector_type> *idat,
sobj *g_odata, Iterator osites, int base, int words)
{
constexpr Iterator nsimd = vobj::Nsimd();
Iterator blockSize = blockDim.x;
extern __shared__ __align__(COALESCE_GRANULARITY) unsigned char shmem_pointer[];
sobj *sdata = (sobj *)shmem_pointer;
Iterator tid = threadIdx.x;
Iterator i = blockIdx.x * (blockSize * 2) + threadIdx.x;
Iterator gridSize = blockSize * 2 * gridDim.x;
sobj mySum = Zero();
while (i < osites * nsimd) {
Iterator lane = i % nsimd;
Iterator ss = i / nsimd;
sobj tmpD; zeroit(tmpD);
for (int k = 0; k < R; k++) {
auto w = extractLane(lane, idat[ss * words + base + k]);
iScalar<typename vobj::scalar_typeD> wd; wd = w;
tmpD._internal[k] = wd;
}
mySum += tmpD;
if (i + blockSize < osites * nsimd) {
lane = (i + blockSize) % nsimd;
ss = (i + blockSize) / nsimd;
sobj tmpD2; zeroit(tmpD2);
for (int k = 0; k < R; k++) {
auto w = extractLane(lane, idat[ss * words + base + k]);
iScalar<typename vobj::scalar_typeD> wd; wd = w;
tmpD2._internal[k] = wd;
}
mySum += tmpD2;
}
i += gridSize;
}
reduceBlock(sdata, mySum, tid);
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <int R, class vobj, class sobj, class Iterator>
__global__ void packReduceKernel(
const iScalar<typename vobj::vector_type> *idat,
sobj *buffer, Iterator osites, int base, int words)
{
Iterator blockSize = blockDim.x;
packReduceBlocks<R, vobj, sobj>(idat, buffer, osites, base, words);
if (gridDim.x > 1) {
const Iterator tid = threadIdx.x;
__shared__ bool amLast;
extern __shared__ __align__(COALESCE_GRANULARITY) unsigned char shmem_pointer[];
sobj *smem = (sobj *)shmem_pointer;
acceleratorFence();
if (tid == 0) {
unsigned int ticket = atomicInc(&retirementCount, gridDim.x);
amLast = (ticket == gridDim.x - 1);
}
acceleratorSynchroniseAll();
if (amLast) {
Iterator i = tid;
sobj mySum = Zero();
while (i < (Iterator)gridDim.x) {
mySum += buffer[i];
i += blockSize;
}
reduceBlock(smem, mySum, tid);
if (tid == 0) {
buffer[0] = smem[0];
retirementCount = 0;
}
}
}
}
template<int R, class vobj>
inline void sumD_gpu_reduce_words(const vobj *lat, Integer osites,
typename vobj::scalar_typeD *ret_p, int base)
{
typedef typename vobj::vector_type vector;
typedef typename vobj::scalar_typeD scalarD;
using BundleScalarD = iVector<iScalar<scalarD>, R>;
constexpr int Nsimd = vobj::Nsimd();
const int words = sizeof(vobj) / sizeof(vector);
const iScalar<vector> *idat = (const iScalar<vector> *)lat;
Integer size = (Integer)osites * Nsimd;
Integer numThreads, numBlocks;
int ok = getNumBlocksAndThreads(size, sizeof(BundleScalarD), numThreads, numBlocks);
GRID_ASSERT(ok);
Integer smemSize = numThreads * sizeof(BundleScalarD);
deviceVector<BundleScalarD> buffer(numBlocks);
BundleScalarD *buffer_v = &buffer[0];
BundleScalarD result;
#ifdef GRID_REDUCTION_TIMING
RealD t_kernel = -usecond();
#endif
packReduceKernel<R, vobj, BundleScalarD, Integer>
<<<numBlocks, numThreads, smemSize, computeStream>>>
(idat, buffer_v, osites, base, words);
accelerator_barrier();
#ifdef GRID_REDUCTION_TIMING
t_kernel += usecond();
RealD t_d2h = -usecond();
#endif
acceleratorCopyFromDevice(buffer_v, &result, sizeof(result));
#ifdef GRID_REDUCTION_TIMING
t_d2h += usecond();
std::cout << GridLogDebug << " sumD_gpu_reduce_words R=" << R
<< " base=" << base
<< " kernel=" << t_kernel << " D2H=" << t_d2h << " us" << std::endl;
#endif
for (int k = 0; k < R; k++)
ret_p[base + k] = TensorRemove(result._internal[k]);
}
template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_large(const vobj *lat, Integer osites)
{
typedef typename vobj::vector_type vector;
typedef typename vobj::scalar_typeD scalarD;
typedef typename vobj::scalar_objectD sobj;
sobj ret;
typedef typename vobj::vector_type vector;
typedef typename vobj::scalar_typeD scalarD;
typedef typename vobj::scalar_objectD sobjD;
const int words = sizeof(vobj) / sizeof(vector);
sobjD ret; zeroit(ret);
scalarD *ret_p = (scalarD *)&ret;
const int words = sizeof(vobj)/sizeof(vector);
deviceVector<vector> buffer(osites);
vector *dat = (vector *)lat;
vector *buf = &buffer[0];
iScalar<vector> *tbuf =(iScalar<vector> *) &buffer[0];
for(int w=0;w<words;w++) {
#ifdef GRID_REDUCTION_TIMING
RealD t_large = -usecond();
#endif
int w = 0;
while (w + 12 <= words) { sumD_gpu_reduce_words<12>(lat, osites, ret_p, w); w += 12; }
while (w + 4 <= words) { sumD_gpu_reduce_words< 4>(lat, osites, ret_p, w); w += 4; }
while (w < words) { sumD_gpu_reduce_words< 1>(lat, osites, ret_p, w); w += 1; }
#ifdef GRID_REDUCTION_TIMING
t_large += usecond();
std::cout << GridLogDebug << "sumD_gpu_large"
<< " sizeof(sobjD)=" << sizeof(sobjD)
<< " words=" << words << " total=" << t_large << " us" << std::endl;
#endif
accelerator_for(ss,osites,1,{
buf[ss] = dat[ss*words+w];
});
ret_p[w] = sumD_gpu_small(tbuf,osites);
}
return ret;
}
+17 -29
View File
@@ -6,28 +6,27 @@ NAMESPACE_BEGIN(Grid);
template <class vobj>
inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites)
inline typename vobj::scalar_objectD sumD_gpu_tensor(const vobj *lat, Integer osites)
{
typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_object sobj;
typedef typename vobj::scalar_objectD sobjD;
sobj identity; zeroit(identity);
sobj ret; zeroit(ret);
Integer nsimd= vobj::Nsimd();
{
sycl::buffer<sobj, 1> abuff(&ret, {1});
sobjD identity; zeroit(identity);
sobjD ret; zeroit(ret);
{
sycl::buffer<sobjD, 1> abuff(&ret, {1});
theGridAccelerator->submit([&](sycl::handler &cgh) {
auto Reduction = sycl::reduction(abuff,cgh,identity,std::plus<>());
cgh.parallel_for(sycl::range<1>{osites},
Reduction,
[=] (sycl::id<1> item, auto &sum) {
auto osite = item[0];
sum +=Reduce(lat[osite]);
});
auto Reduction = sycl::reduction(abuff, cgh, identity, std::plus<>());
cgh.parallel_for(sycl::range<1>{(size_t)osites},
Reduction,
[=](sycl::id<1> item, auto &sum) {
sobj s = Reduce(lat[item[0]]);
sobjD sd; sd = s;
sum += sd;
});
});
}
sobjD dret; convertType(dret,ret);
return dret;
return ret;
}
template <class vobj>
@@ -68,7 +67,8 @@ inline typename vobj::scalar_object sum_gpu_large(const vobj *lat, Integer osite
return result;
}
template<class Word> Word gpu_xor(Word *vec,uint64_t L)
template<class Word> Word svm_xor(Word *vec,uint64_t L)
{
Word identity; identity=0;
Word ret = 0;
@@ -86,18 +86,6 @@ template<class Word> Word gpu_xor(Word *vec,uint64_t L)
theGridAccelerator->wait();
return ret;
}
template<class Word> Word cpu_xor(Word *vec,uint64_t L)
{
Word csum=0;
for(uint64_t w=0;w<L;w++){
csum = csum ^ vec[w];
}
return csum;
}
template<class Word> Word svm_xor(Word *vec,uint64_t L)
{
gpu_xor(vec,L);
}
template<class Word> Word checksum_gpu(Word *vec,uint64_t L)
{
Word identity; identity=0;
+9 -3
View File
@@ -1,7 +1,6 @@
#pragma once
#if defined(GRID_CUDA)
#include <cub/cub.cuh>
#define gpucub cub
#define gpuError_t cudaError_t
@@ -57,8 +56,13 @@ inline void sliceSumReduction_cub_small(const vobj *Data,
//copy offsets to device
acceleratorCopyToDeviceAsynch(&offsets[0],d_offsets,sizeof(int)*(rd+1),computeStream);
#if defined(__CUDACC__) && (__CUDACC_VER_MAJOR__ >= 13)
#define GRID_CUB_SUM_OP ::cuda::std::plus<>{}
#else
#define GRID_CUB_SUM_OP ::gpucub::Sum()
#endif
gpuError_t gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p,d_out, rd, d_offsets, d_offsets+1, ::gpucub::Sum(), zero_init, computeStream);
gpuError_t gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p,d_out, rd, d_offsets, d_offsets+1, GRID_CUB_SUM_OP, zero_init, computeStream);
if (gpuErr!=gpuSuccess) {
std::cout << GridLogError << "Lattice_slicesum_gpu.h: Encountered error during gpucub::DeviceSegmentedReduce::Reduce (setup)! Error: " << gpuErr <<std::endl;
exit(EXIT_FAILURE);
@@ -82,11 +86,13 @@ inline void sliceSumReduction_cub_small(const vobj *Data,
});
//issue segmented reductions in computeStream
gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p, d_out, rd, d_offsets, d_offsets+1,::gpucub::Sum(), zero_init, computeStream);
gpuErr = gpucub::DeviceSegmentedReduce::Reduce(temp_storage_array, temp_storage_bytes, rb_p, d_out, rd, d_offsets, d_offsets+1, GRID_CUB_SUM_OP, zero_init, computeStream);
if (gpuErr!=gpuSuccess) {
std::cout << GridLogError << "Lattice_slicesum_gpu.h: Encountered error during gpucub::DeviceSegmentedReduce::Reduce! Error: " << gpuErr <<std::endl;
exit(EXIT_FAILURE);
}
#undef GRID_CUB_SUM_OP
acceleratorCopyFromDeviceAsynch(d_out,&lvSum[0],rd*sizeof(vobj),computeStream);
+2 -2
View File
@@ -51,8 +51,8 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
#endif
#ifdef __x86_64__
#ifdef GRID_CUDA
accelerator_inline uint64_t __rdtsc(void) { return 0; }
accelerator_inline uint64_t __rdpmc(int ) { return 0; }
//accelerator_inline uint64_t __rdtsc(void) { return 0; }
//accelerator_inline uint64_t __rdpmc(int ) { return 0; }
#else
#include <x86intrin.h>
#endif
+18
View File
@@ -596,16 +596,32 @@ template<int Index,class vobj> inline vobj transposeColour(const vobj &lhs){
//////////////////////////////////////////
// Trace lattice and non-lattice
//////////////////////////////////////////
#define GRID_UNOP(name) name
#define GRID_DEF_UNOP(op, name) \
template <typename T1, typename std::enable_if<is_lattice<T1>::value||is_lattice_expr<T1>::value,T1>::type * = nullptr> \
inline auto op(const T1 &arg) ->decltype(LatticeUnaryExpression<GRID_UNOP(name),T1>(GRID_UNOP(name)(), arg)) \
{ \
return LatticeUnaryExpression<GRID_UNOP(name),T1>(GRID_UNOP(name)(), arg); \
}
template<int Index,class vobj>
inline auto traceSpin(const Lattice<vobj> &lhs) -> Lattice<decltype(traceIndex<SpinIndex>(vobj()))>
{
return traceIndex<SpinIndex>(lhs);
}
GridUnopClass(UnaryTraceSpin, traceIndex<SpinIndex>(a));
GRID_DEF_UNOP(traceSpin, UnaryTraceSpin);
template<int Index,class vobj>
inline auto traceColour(const Lattice<vobj> &lhs) -> Lattice<decltype(traceIndex<ColourIndex>(vobj()))>
{
return traceIndex<ColourIndex>(lhs);
}
GridUnopClass(UnaryTraceColour, traceIndex<ColourIndex>(a));
GRID_DEF_UNOP(traceColour, UnaryTraceColour);
template<int Index,class vobj>
inline auto traceSpin(const vobj &lhs) -> Lattice<decltype(traceIndex<SpinIndex>(lhs))>
{
@@ -617,6 +633,8 @@ inline auto traceColour(const vobj &lhs) -> Lattice<decltype(traceIndex<ColourIn
return traceIndex<ColourIndex>(lhs);
}
#undef GRID_UNOP
#undef GRID_DEF_UNOP
//////////////////////////////////////////
// Current types
//////////////////////////////////////////
+12
View File
@@ -103,6 +103,18 @@ class PolyakovMod: public ObservableModule<PolyakovLogger<Impl>, NoParameters>{
PolyakovMod(): ObsBase(NoParameters()){}
};
template < class Impl >
class SpatialPolyakovMod: public ObservableModule<SpatialPolyakovLogger<Impl>, NoParameters>{
typedef ObservableModule<SpatialPolyakovLogger<Impl>, NoParameters> ObsBase;
using ObsBase::ObsBase; // for constructors
// acquire resource
virtual void initialize(){
this->ObservablePtr.reset(new SpatialPolyakovLogger<Impl>());
}
public:
SpatialPolyakovMod(): ObsBase(NoParameters()){}
};
template < class Impl >
class TopologicalChargeMod: public ObservableModule<TopologicalCharge<Impl>, TopologyObsParameters>{
+42 -2
View File
@@ -2,11 +2,12 @@
Grid physics library, www.github.com/paboyle/Grid
Source file: ./lib/qcd/modules/polyakov_line.h
Source file: ./Grid/qcd/observables/polyakov_loop.h
Copyright (C) 2017
Copyright (C) 2025
Author: David Preti <david.preti@csic.es>
Author: Alexis Verney-Provatas <2414441@swansea.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -60,4 +61,43 @@ class PolyakovLogger : public HmcObservable<typename Impl::Field> {
}
};
template <class Impl>
class SpatialPolyakovLogger : public HmcObservable<typename Impl::Field> {
public:
// here forces the Impl to be of gauge fields
// if not the compiler will complain
INHERIT_GIMPL_TYPES(Impl);
// necessary for HmcObservable compatibility
typedef typename Impl::Field Field;
void TrajectoryComplete(int traj,
Field &U,
GridSerialRNG &sRNG,
GridParallelRNG &pRNG) {
// Save current numerical output precision
int def_prec = std::cout.precision();
// Assume that the dimensions are D=3+1
int Ndim = 3;
ComplexD polyakov;
// Iterate over the spatial directions and print the average spatial polyakov loop
// over them
for (int idx=0; idx<Ndim; idx++) {
polyakov = WilsonLoops<Impl>::avgPolyakovLoop(U, idx);
std::cout << GridLogMessage
<< std::setprecision(std::numeric_limits<Real>::digits10 + 1)
<< "Polyakov Loop in the " << idx << " spatial direction : [ " << traj << " ] "<< polyakov << std::endl;
}
// Return to original output precision
std::cout.precision(def_prec);
}
};
NAMESPACE_END(Grid);
+3 -3
View File
@@ -254,9 +254,9 @@ static void testGenerators(GroupName::Sp) {
}
}
template <int N>
static Lattice<iScalar<iScalar<iMatrix<vComplexD, N> > > >
ProjectOnGeneralGroup(const Lattice<iScalar<iScalar<iMatrix<vComplexD, N> > > > &Umu, GroupName::Sp) {
template <class vtype, int N>
static Lattice<iScalar<iScalar<iMatrix<vtype, N> > > >
ProjectOnGeneralGroup(const Lattice<iScalar<iScalar<iMatrix<vtype, N> > > > &Umu, GroupName::Sp) {
return ProjectOnSpGroup(Umu);
}
+26 -8
View File
@@ -177,25 +177,43 @@ public:
}
//////////////////////////////////////////////////
// average over all x,y,z the temporal loop
// average Polyakov loop in mu direction over all directions != mu
//////////////////////////////////////////////////
static ComplexD avgPolyakovLoop(const GaugeField &Umu) { //assume Nd=4
GaugeMat Ut(Umu.Grid()), P(Umu.Grid());
static ComplexD avgPolyakovLoop(const GaugeField &Umu, const int mu) { //assume Nd=4
// Protect against bad value of mu [0, 3]
if ((mu < 0 ) || (mu > 3)) {
std::cout << GridLogError << "Index is not an integer inclusively between 0 and 3." << std::endl;
exit(1);
}
// U_loop is U_{mu}
GaugeMat U_loop(Umu.Grid()), P(Umu.Grid());
ComplexD out;
int T = Umu.Grid()->GlobalDimensions()[3];
int X = Umu.Grid()->GlobalDimensions()[0];
int Y = Umu.Grid()->GlobalDimensions()[1];
int Z = Umu.Grid()->GlobalDimensions()[2];
Ut = peekLorentz(Umu,3); //Select temporal direction
P = Ut;
for (int t=1;t<T;t++){
P = Gimpl::CovShiftForward(Ut,3,P);
// Number of sites in mu direction
int N_mu = Umu.Grid()->GlobalDimensions()[mu];
U_loop = peekLorentz(Umu, mu); //Select direction
P = U_loop;
for (int t=1;t<N_mu;t++){
P = Gimpl::CovShiftForward(U_loop,mu,P);
}
RealD norm = 1.0/(Nc*X*Y*Z*T);
out = sum(trace(P))*norm;
return out;
}
}
/////////////////////////////////////////////////
// overload for temporal Polyakov loop
/////////////////////////////////////////////////
static ComplexD avgPolyakovLoop(const GaugeField &Umu) {
return avgPolyakovLoop(Umu, 3);
}
//////////////////////////////////////////////////
// average over traced single links
+8
View File
@@ -113,6 +113,14 @@ accelerator_inline RealD adj(const RealD & r){ return r; }
accelerator_inline ComplexD adj(const ComplexD& r){ return(conjugate(r)); }
accelerator_inline ComplexF adj(const ComplexF& r ){ return(conjugate(r)); }
#if defined(GRID_CUDA) || defined(GRID_HIP)
//Provide for convenience
accelerator_inline std::complex<double> conjugate(const std::complex<double>& r){ return(conj(r)); }
accelerator_inline std::complex<float> conjugate(const std::complex<float>& r) { return(conj(r)); }
accelerator_inline std::complex<double> adj(const std::complex<double>& r) { return(conj(r)); }
accelerator_inline std::complex<float> adj(const std::complex<float>& r) { return(conj(r)); }
#endif
accelerator_inline RealF real(const RealF & r){ return r; }
accelerator_inline RealD real(const RealD & r){ return r; }
accelerator_inline RealF real(const ComplexF & r){ return r.real(); }
+1 -1
View File
@@ -3,7 +3,7 @@
NAMESPACE_BEGIN(Grid);
int world_rank; // Use to control world rank for print guarding
int acceleratorAbortOnGpuError=1;
uint32_t accelerator_threads=2;
uint32_t accelerator_threads=8;
uint32_t acceleratorThreads(void) {return accelerator_threads;};
void acceleratorThreads(uint32_t t) {accelerator_threads = t;};
+16 -16
View File
@@ -96,7 +96,9 @@ void acceleratorInit(void);
#ifdef GRID_CUDA
NAMESPACE_END(Grid);
#include <cuda.h>
NAMESPACE_BEGIN(Grid);
#ifdef __CUDA_ARCH__
#define GRID_SIMT
@@ -432,22 +434,20 @@ accelerator_inline int acceleratorSIMTlane(int Nsimd) {
#define accelerator_for2dNB( iter1, num1, iter2, num2, nsimd, ... ) \
{ \
typedef uint64_t Iterator; \
auto lambda = [=] accelerator \
(Iterator iter1,Iterator iter2,Iterator lane ) mutable { \
{ __VA_ARGS__;} \
}; \
int nt=acceleratorThreads(); \
dim3 hip_threads(nsimd, nt, 1); \
dim3 hip_blocks ((num1+nt-1)/nt,num2,1); \
if(hip_threads.x * hip_threads.y * hip_threads.z <= 64){ \
hipLaunchKernelGGL(LambdaApply64,hip_blocks,hip_threads, \
0,computeStream, \
num1,num2,nsimd, lambda); \
} else { \
hipLaunchKernelGGL(LambdaApply,hip_blocks,hip_threads, \
0,computeStream, \
num1,num2,nsimd, lambda); \
if (num1*num2) { \
typedef uint64_t Iterator; \
auto lambda = [=] accelerator \
(Iterator iter1,Iterator iter2,Iterator lane ) mutable { \
{ __VA_ARGS__;} \
}; \
int nt=acceleratorThreads(); \
dim3 hip_threads(nsimd, nt, 1); \
dim3 hip_blocks ((num1+nt-1)/nt,num2,1); \
if(hip_threads.x * hip_threads.y * hip_threads.z <= 64){ \
LambdaApply64<<<hip_blocks,hip_threads,0,computeStream>>>(num1,num2,nsimd,lambda); \
} else { \
LambdaApply<<<hip_blocks,hip_threads,0,computeStream>>>(num1,num2,nsimd,lambda); \
} \
} \
}
-8
View File
@@ -47,7 +47,6 @@ int32_t FlightRecorder::CsumLoggingCounter;
int32_t FlightRecorder::NormLoggingCounter;
int32_t FlightRecorder::ReductionLoggingCounter;
uint64_t FlightRecorder::ErrorCounter;
uint64_t FlightRecorder::CommsErrorCounter;
std::vector<double> FlightRecorder::NormLogVector;
std::vector<double> FlightRecorder::ReductionLogVector;
@@ -119,10 +118,6 @@ void FlightRecorder::SetLoggingModeVerify(void)
ResetCounters();
LoggingMode = LoggingModeVerify;
}
uint64_t FlightRecorder::CommsErrorCount(void)
{
return CommsErrorCounter;
}
uint64_t FlightRecorder::ErrorCount(void)
{
return ErrorCounter;
@@ -317,7 +312,6 @@ void FlightRecorder::xmitLog(void *buf,uint64_t bytes)
if ( !ContinueOnFail ) GRID_ASSERT(0);
ErrorCounter++;
} else {
if ( PrintEntireLog ) {
std::cerr<<"FlightRecorder::XmitLog : VALID "<< XmitLoggingCounter <<" "<< std::hexfloat << _xor << " "<< XmitLogVector[XmitLoggingCounter] <<std::endl;
@@ -363,9 +357,7 @@ void FlightRecorder::recvLog(void *buf,uint64_t bytes,int rank)
if ( !ContinueOnFail ) GRID_ASSERT(0);
CommsErrorCounter++;
ErrorCounter++;
} else {
if ( PrintEntireLog ) {
std::cerr<<"FlightRecorder::RecvLog : VALID "<< RecvLoggingCounter <<" "<< std::hexfloat << _xor << " "<< RecvLogVector[RecvLoggingCounter] <<std::endl;
+1 -2
View File
@@ -9,8 +9,8 @@ class FlightRecorder {
LoggingModeRecord,
LoggingModeVerify
};
static int LoggingMode;
static uint64_t CommsErrorCounter;
static uint64_t ErrorCounter;
static const char * StepName;
static int32_t StepLoggingCounter;
@@ -39,7 +39,6 @@ class FlightRecorder {
static void Truncate(void);
static void ResetCounters(void);
static uint64_t ErrorCount(void);
static uint64_t CommsErrorCount(void);
static void xmitLog(void *,uint64_t bytes);
static void recvLog(void *,uint64_t bytes,int rank);
};
+6 -1
View File
@@ -24,7 +24,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
#if Nc == 3
#include <Grid/qcd/smearing/GaugeConfigurationMasked.h>
@@ -230,3 +234,4 @@ int main(int argc, char **argv)
#endif
} // main
#endif
+6 -3
View File
@@ -25,7 +25,11 @@ directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
#if Nc == 3
#include <Grid/qcd/smearing/GaugeConfigurationMasked.h>
@@ -231,5 +235,4 @@ int main(int argc, char **argv)
#endif
} // main
#endif
+6 -3
View File
@@ -24,7 +24,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
#if Nc == 3
#include <Grid/qcd/smearing/GaugeConfigurationMasked.h>
@@ -230,5 +234,4 @@ int main(int argc, char **argv)
#endif
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
int main(int argc, char **argv) {
using namespace Grid;
@@ -195,5 +199,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -28,7 +28,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
#ifdef GRID_DEFAULT_PRECISION_DOUBLE
#define MIXED_PRECISION
@@ -449,5 +453,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -28,7 +28,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
#ifdef GRID_DEFAULT_PRECISION_DOUBLE
#define MIXED_PRECISION
@@ -442,5 +446,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+7 -1
View File
@@ -28,7 +28,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
using namespace Grid;
@@ -918,3 +922,5 @@ int main(int argc, char **argv) {
return 0;
#endif
} // main
#endif
+7 -1
View File
@@ -28,7 +28,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
using namespace Grid;
@@ -873,3 +877,5 @@ int main(int argc, char **argv) {
return 0;
#endif
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
int main(int argc, char **argv) {
using namespace Grid;
@@ -193,5 +197,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
NAMESPACE_BEGIN(Grid);
@@ -512,5 +516,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
int main(int argc, char **argv) {
using namespace Grid;
@@ -345,5 +349,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
NAMESPACE_BEGIN(Grid);
@@ -516,5 +520,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
NAMESPACE_BEGIN(Grid);
@@ -567,5 +571,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
int main(int argc, char **argv) {
using namespace Grid;
@@ -263,5 +267,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
int main(int argc, char **argv) {
using namespace Grid;
@@ -417,5 +421,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
NAMESPACE_BEGIN(Grid);
@@ -452,5 +456,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
NAMESPACE_BEGIN(Grid);
@@ -462,5 +466,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
+6 -3
View File
@@ -27,7 +27,11 @@ See the full license in the file "LICENSE" in the top level distribution
directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include<Grid/Grid.h>
@@ -264,5 +268,4 @@ int main(int argc, char **argv) {
Grid_finalize();
} // main
#endif
@@ -0,0 +1,16 @@
#include <Grid/Grid.h>
#pragma once
#ifndef ENABLE_FERMION_INSTANTIATIONS
#include <iostream>
int main(void) {
std::cout << "This build of Grid was configured to exclude fermion instantiations, "
<< "which this example relies on. "
<< "Please reconfigure and rebuild Grid with --enable-fermion-instantiations"
<< "to run this example."
<< std::endl;
return 1;
}
#endif
+5
View File
@@ -26,6 +26,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace Grid;
@@ -731,3 +734,5 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+4
View File
@@ -20,6 +20,9 @@
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#ifdef GRID_CUDA
#define CUDA_PROFILE
@@ -439,3 +442,4 @@ void Benchmark(int Ls, Coordinate Dirichlet,bool sloppy)
GRID_ASSERT(norm2(src_e)<1.0e-4);
GRID_ASSERT(norm2(src_o)<1.0e-4);
}
#endif
+6
View File
@@ -20,6 +20,10 @@
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#ifdef GRID_CUDA
#define CUDA_PROFILE
@@ -439,3 +443,5 @@ void Benchmark(int Ls, Coordinate Dirichlet,bool sloppy)
GRID_ASSERT(norm2(src_e)<1.0e-4);
GRID_ASSERT(norm2(src_o)<1.0e-4);
}
#endif
@@ -20,6 +20,9 @@
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#ifdef GRID_CUDA
#define CUDA_PROFILE
@@ -385,3 +388,5 @@ int main (int argc, char ** argv)
Grid_finalize();
exit(0);
}
#endif
+4 -2
View File
@@ -26,6 +26,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -238,5 +241,4 @@ void benchDw(std::vector<int> & latt4, int Ls, int threads,int report )
}
}
#endif
+5
View File
@@ -1,3 +1,7 @@
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#include <sstream>
using namespace std;
@@ -155,3 +159,4 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+5
View File
@@ -20,6 +20,9 @@
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#ifdef GRID_CUDA
#define CUDA_PROFILE
@@ -129,3 +132,5 @@ int main (int argc, char ** argv)
Grid_finalize();
exit(0);
}
#endif
+5
View File
@@ -26,6 +26,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -149,3 +152,5 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+4 -2
View File
@@ -26,6 +26,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -172,5 +175,4 @@ void benchDw(std::vector<int> & latt4, int Ls)
// Dw.Report();
}
#endif
+45 -11
View File
@@ -26,11 +26,13 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
using namespace Grid;
;
int main (int argc, char ** argv)
{
@@ -94,19 +96,51 @@ int main (int argc, char ** argv)
RealD c2=-1.0/24.0;
RealD u0=1.0;
ImprovedStaggeredFermionD Ds(Umu,Umu,Grid,RBGrid,mass,c1,c2,u0,params);
std::cout<<GridLogMessage << "Calling Ds"<<std::endl;
NaiveStaggeredFermionD Dn(Umu,Grid,RBGrid,mass,c1,u0,params);
FermionField src_e(&RBGrid); pickCheckerboard(Even, src_e, src);
FermionField src_o(&RBGrid); pickCheckerboard(Odd, src_o, src);
FermionField res_e(&RBGrid); res_e = Zero();
int ncall=1000;
// ImprovedStaggered Dhop
for(int i=0;i<ncall;i++) Ds.Dhop(src,result,0);
double t0=usecond();
for(int i=0;i<ncall;i++){
Ds.Dhop(src,result,0);
}
for(int i=0;i<ncall;i++) Ds.Dhop(src,result,0);
double t1=usecond();
double flops=(16*(3*(6+8+8)) + 15*3*2)*volume*ncall; // == 66*16 + == 1146
std::cout<<GridLogMessage << "Called Ds"<<std::endl;
std::cout<<GridLogMessage << "norm result "<< norm2(result)<<std::endl;
std::cout<<GridLogMessage << "mflop/s = "<< flops/(t1-t0)<<std::endl;
double flops=(16*(3*(6+8+8)) + 15*3*2)*volume*ncall; // 1146 flops/site
std::cout<<GridLogMessage << ncall << " ImprovedStaggered Dhop calls in "<< (t1-t0)<<" us"<<std::endl;
std::cout<<GridLogMessage << "ImprovedStaggered Dhop mflop/s = "<< flops/(t1-t0)<<std::endl;
// ImprovedStaggered DhopEO
for(int i=0;i<ncall;i++) Ds.DhopEO(src_o,res_e,0);
t0=usecond();
for(int i=0;i<ncall;i++) Ds.DhopEO(src_o,res_e,0);
t1=usecond();
flops=(16*(3*(6+8+8)) + 15*3*2)*(volume/2)*ncall;
std::cout<<GridLogMessage << ncall << " ImprovedStaggered DhopEO calls in "<< (t1-t0)<<" us"<<std::endl;
std::cout<<GridLogMessage << "ImprovedStaggered DhopEO mflop/s = "<< flops/(t1-t0)<<std::endl;
// NaiveStaggered Dhop
for(int i=0;i<ncall;i++) Dn.Dhop(src,result,0);
t0=usecond();
for(int i=0;i<ncall;i++) Dn.Dhop(src,result,0);
t1=usecond();
flops=(8*(3*(6+8+8)) + 7*3*2)*volume*ncall;
std::cout<<GridLogMessage << ncall << " NaiveStaggered Dhop calls in "<< (t1-t0)<<" us"<<std::endl;
std::cout<<GridLogMessage << "NaiveStaggered Dhop mflop/s = "<< flops/(t1-t0)<<std::endl;
// NaiveStaggered DhopEO
for(int i=0;i<ncall;i++) Dn.DhopEO(src_o,res_e,0);
t0=usecond();
for(int i=0;i<ncall;i++) Dn.DhopEO(src_o,res_e,0);
t1=usecond();
flops=(8*(3*(6+8+8)) + 7*3*2)*(volume/2)*ncall;
std::cout<<GridLogMessage << ncall << " NaiveStaggered DhopEO calls in "<< (t1-t0)<<" us"<<std::endl;
std::cout<<GridLogMessage << "NaiveStaggered DhopEO mflop/s = "<< flops/(t1-t0)<<std::endl;
Grid_finalize();
}
#endif
+42 -14
View File
@@ -26,6 +26,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -93,22 +96,47 @@ int main (int argc, char ** argv)
RealD c2=-1.0/24.0;
RealD u0=1.0;
ImprovedStaggeredFermionF Ds(Umu,Umu,Grid,RBGrid,mass,c1,c2,u0,params);
std::cout<<GridLogMessage << "Calling Ds"<<std::endl;
int ncall=1000;
for(int i=0;i<ncall;i++){
Ds.Dhop(src,result,0);
}
NaiveStaggeredFermionF Dn(Umu,Grid,RBGrid,mass,c1,u0,params);
FermionField src_e(&RBGrid); pickCheckerboard(Even, src_e, src);
FermionField src_o(&RBGrid); pickCheckerboard(Odd, src_o, src);
FermionField res_e(&RBGrid); res_e = Zero();
int ncall=10000;
// ImprovedStaggered Dhop
for(int i=0;i<ncall;i++) Ds.Dhop(src,result,0);
double t0=usecond();
for(int i=0;i<ncall;i++){
Ds.Dhop(src,result,0);
}
for(int i=0;i<ncall;i++) Ds.Dhop(src,result,0);
double t1=usecond();
double flops=(16*(3*(6+8+8)) + 15*3*2)*volume*ncall; // == 66*16 + == 1146
std::cout<<GridLogMessage << "Called Ds"<<std::endl;
std::cout<<GridLogMessage << "norm result "<< norm2(result)<<std::endl;
std::cout<<GridLogMessage << "mflop/s = "<< flops/(t1-t0)<<std::endl;
double flops=(16*(3*(6+8+8)) + 15*3*2)*volume*ncall; // 1146 flops/site
std::cout<<GridLogMessage << "ImprovedStaggered Dhop mflop/s = "<< flops/(t1-t0)<<std::endl;
// ImprovedStaggered DhopEO
for(int i=0;i<ncall;i++) Ds.DhopEO(src_o,res_e,0);
t0=usecond();
for(int i=0;i<ncall;i++) Ds.DhopEO(src_o,res_e,0);
t1=usecond();
flops=(16*(3*(6+8+8)) + 15*3*2)*(volume/2)*ncall;
std::cout<<GridLogMessage << "ImprovedStaggered DhopEO mflop/s = "<< flops/(t1-t0)<<std::endl;
// NaiveStaggered Dhop
for(int i=0;i<ncall;i++) Dn.Dhop(src,result,0);
t0=usecond();
for(int i=0;i<ncall;i++) Dn.Dhop(src,result,0);
t1=usecond();
flops=(8*(3*(6+8+8)) + 7*3*2)*volume*ncall;
std::cout<<GridLogMessage << "NaiveStaggered Dhop mflop/s = "<< flops/(t1-t0)<<std::endl;
// NaiveStaggered DhopEO
for(int i=0;i<ncall;i++) Dn.DhopEO(src_o,res_e,0);
t0=usecond();
for(int i=0;i<ncall;i++) Dn.DhopEO(src_o,res_e,0);
t1=usecond();
flops=(8*(3*(6+8+8)) + 7*3*2)*(volume/2)*ncall;
std::cout<<GridLogMessage << "NaiveStaggered DhopEO mflop/s = "<< flops/(t1-t0)<<std::endl;
Grid_finalize();
}
#endif
+176 -6
View File
@@ -26,6 +26,10 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#include <Grid/algorithms/blas/BatchedBlas.h>
@@ -712,6 +716,161 @@ public:
return mflops_best;
}
static double NaiveStaggered(int L)
{
double mflops;
double mflops_best = 0;
double mflops_worst= 0;
std::vector<double> mflops_all;
///////////////////////////////////////////////////////
// Set/Get the layout & grid size
///////////////////////////////////////////////////////
int threads = GridThread::GetThreads();
Coordinate mpi = GridDefaultMpi(); GRID_ASSERT(mpi.size()==4);
Coordinate local({L,L,L,L});
Coordinate latt4({local[0]*mpi[0],local[1]*mpi[1],local[2]*mpi[2],local[3]*mpi[3]});
GridCartesian * TmpGrid = SpaceTimeGrid::makeFourDimGrid(latt4,
GridDefaultSimd(Nd,vComplex::Nsimd()),
GridDefaultMpi());
uint64_t NP = TmpGrid->RankCount();
uint64_t NN = TmpGrid->NodeCount();
NN_global=NN;
uint64_t SHM=NP/NN;
///////// Welcome message ////////////
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << "Benchmark NaiveStaggered on "<<L<<"^4 local volume "<<std::endl;
std::cout<<GridLogMessage << "* Global volume : "<<GridCmdVectorIntToString(latt4)<<std::endl;
std::cout<<GridLogMessage << "* ranks : "<<NP <<std::endl;
std::cout<<GridLogMessage << "* nodes : "<<NN <<std::endl;
std::cout<<GridLogMessage << "* ranks/node : "<<SHM <<std::endl;
std::cout<<GridLogMessage << "* ranks geom : "<<GridCmdVectorIntToString(mpi)<<std::endl;
std::cout<<GridLogMessage << "* Using "<<threads<<" threads"<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
///////// Lattice Init ////////////
GridCartesian * FGrid = SpaceTimeGrid::makeFourDimGrid(latt4, GridDefaultSimd(Nd,vComplexF::Nsimd()),GridDefaultMpi());
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(FGrid);
///////// RNG Init ////////////
std::vector<int> seeds4({1,2,3,4});
GridParallelRNG RNG4(FGrid); RNG4.SeedFixedIntegers(seeds4);
std::cout << GridLogMessage << "Initialised RNGs" << std::endl;
RealD mass=0.1;
RealD c1=9.0/8.0;
RealD c2=-1.0/24.0;
RealD u0=1.0;
typedef NaiveStaggeredFermionF Action;
typedef typename Action::FermionField Fermion;
typedef LatticeGaugeFieldF Gauge;
Gauge Umu(FGrid); SU<Nc>::HotConfiguration(RNG4,Umu);
typename Action::ImplParams params;
Action Ds(Umu,*FGrid,*FrbGrid,mass,c1,u0,params);
///////// Source preparation ////////////
Fermion src (FGrid); random(RNG4,src);
Fermion src_e (FrbGrid);
Fermion src_o (FrbGrid);
Fermion r_e (FrbGrid);
Fermion r_o (FrbGrid);
Fermion r_eo (FGrid);
{
pickCheckerboard(Even,src_e,src);
pickCheckerboard(Odd,src_o,src);
const int num_cases = 2;
std::string fmt("G/S/C ; G/O/C ; G/S/S ; G/O/S ");
controls Cases [] = {
{ StaggeredKernelsStatic::OptGeneric , StaggeredKernelsStatic::CommsAndCompute ,CartesianCommunicator::CommunicatorPolicyConcurrent },
{ StaggeredKernelsStatic::OptHandUnroll, StaggeredKernelsStatic::CommsAndCompute ,CartesianCommunicator::CommunicatorPolicyConcurrent },
{ StaggeredKernelsStatic::OptInlineAsm , StaggeredKernelsStatic::CommsAndCompute ,CartesianCommunicator::CommunicatorPolicyConcurrent }
};
for(int c=0;c<num_cases;c++) {
StaggeredKernelsStatic::Comms = Cases[c].CommsOverlap;
StaggeredKernelsStatic::Opt = Cases[c].Opt;
CartesianCommunicator::SetCommunicatorPolicy(Cases[c].CommsAsynch);
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
if ( StaggeredKernelsStatic::Opt == StaggeredKernelsStatic::OptGeneric ) std::cout << GridLogMessage<< "* Using GENERIC Nc StaggeredKernels" <<std::endl;
std::cout << GridLogMessage<< "* SINGLE precision "<<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
int nwarm = 10;
double t0=usecond();
FGrid->Barrier();
for(int i=0;i<nwarm;i++){
Ds.DhopEO(src_o,r_e,DaggerNo);
}
FGrid->Barrier();
double t1=usecond();
uint64_t no = 50;
uint64_t ni = 100;
// std::cout << GridLogMessage << " Estimate " << ncall << " calls per second"<<std::endl;
time_statistics timestat;
std::vector<double> t_time(no);
for(uint64_t i=0;i<no;i++){
t0=usecond();
for(uint64_t j=0;j<ni;j++){
Ds.DhopEO(src_o,r_e,DaggerNo);
}
t1=usecond();
t_time[i] = t1-t0;
}
FGrid->Barrier();
double volume=1; for(int mu=0;mu<Nd;mu++) volume=volume*latt4[mu];
double flops=((8*(3*(6+8+8)) + 7*3*2)*1.0*volume)/2;
double mf_hi, mf_lo, mf_err;
timestat.statistics(t_time);
mf_hi = flops/timestat.min*ni;
mf_lo = flops/timestat.max*ni;
mf_err= flops/timestat.min * timestat.err/timestat.mean;
mflops = flops/timestat.mean*ni;
mflops_all.push_back(mflops);
if ( mflops_best == 0 ) mflops_best = mflops;
if ( mflops_worst== 0 ) mflops_worst= mflops;
if ( mflops>mflops_best ) mflops_best = mflops;
if ( mflops<mflops_worst) mflops_worst= mflops;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo mflop/s = "<< mflops << " ("<<mf_err<<") " << mf_lo<<"-"<<mf_hi <<std::endl;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo mflop/s per rank "<< mflops/NP<<std::endl;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo mflop/s per node "<< mflops/NN<<std::endl;
std::cout<<GridLogMessage << std::fixed << std::setprecision(1)<<"Deo us per call "<< timestat.mean/ni<<std::endl;
}
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << L<<"^4 Deo Best mflop/s = "<< mflops_best << " ; " << mflops_best/NN<<" per node " <<std::endl;
std::cout<<GridLogMessage << L<<"^4 Deo Worst mflop/s = "<< mflops_worst<< " ; " << mflops_worst/NN<<" per node " <<std::endl;
std::cout<<GridLogMessage <<fmt << std::endl;
std::cout<<GridLogMessage ;
for(int i=0;i<mflops_all.size();i++){
std::cout<<mflops_all[i]/NN<<" ; " ;
}
std::cout<<std::endl;
}
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
return mflops_best;
}
static double Clover(int L)
{
double mflops;
@@ -883,6 +1042,7 @@ int main (int argc, char ** argv)
std::vector<double> clover;
std::vector<double> dwf4;
std::vector<double> staggered;
std::vector<double> naive_staggered;
int Ls=1;
if (do_dslash){
@@ -910,13 +1070,21 @@ int main (int argc, char ** argv)
staggered.push_back(result);
}
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Naive Staggered dslash 4D vectorised" <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
for(int l=0;l<L_list.size();l++){
double result = Benchmark::NaiveStaggered(L_list[l]) ;
naive_staggered.push_back(result);
}
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Summary table Ls="<<Ls <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << "L \t\t Clover \t\t DWF4 \t\t Staggered" <<std::endl;
std::cout<<GridLogMessage << "L \t\t Clover \t\t DWF4 \t\t Staggered \t\t Naive Staggered" <<std::endl;
for(int l=0;l<L_list.size();l++){
std::cout<<GridLogMessage << L_list[l] <<" \t\t "<< clover[l]<<" \t\t "<<dwf4[l] << " \t\t "<< staggered[l]<<std::endl;
std::cout<<GridLogMessage << L_list[l] <<" \t\t "<< clover[l]<<" \t\t "<<dwf4[l] << " \t\t "<< staggered[l]<<" \t\t "<<naive_staggered[l]<<std::endl;
}
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
}
@@ -926,14 +1094,14 @@ int main (int argc, char ** argv)
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " Per Node Summary table Ls="<<Ls <<std::endl;
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
std::cout<<GridLogMessage << " L \t\t Clover\t\t DWF4\t\t Staggered (GF/s per node)" <<std::endl;
std::cout<<GridLogMessage << " L \t\t Clover\t\t DWF4\t\t Staggered \t\t NaiveStag \t|\t (GF/s per node)" <<std::endl;
fprintf(FP,"Per node summary table\n");
fprintf(FP,"\n");
fprintf(FP,"L , Wilson, DWF4, Staggered, GF/s per node\n");
fprintf(FP,"L , Wilson, DWF4, Staggered, NaiveStag\n");
fprintf(FP,"\n");
for(int l=0;l<L_list.size();l++){
std::cout<<GridLogMessage << L_list[l] <<" \t\t "<< clover[l]/NN<<" \t "<<dwf4[l]/NN<< " \t "<<staggered[l]/NN<<std::endl;
fprintf(FP,"%d , %.0f, %.0f, %.0f\n",L_list[l],clover[l]/NN/1000.,dwf4[l]/NN/1000.,staggered[l]/NN/1000.);
std::cout<<GridLogMessage << L_list[l] <<" \t\t "<< clover[l]/NN<<" \t "<<dwf4[l]/NN<< " \t "<<staggered[l]/NN<<" \t " <<naive_staggered[l]/NN<<std::endl;
fprintf(FP,"%d , %.0f, %.0f, %.0f, %.0f\n",L_list[l],clover[l]/NN/1000.,dwf4[l]/NN/1000.,staggered[l]/NN/1000.,naive_staggered[l]/NN/1000.);
}
fprintf(FP,"\n");
std::cout<<GridLogMessage << "=================================================================================="<<std::endl;
@@ -978,3 +1146,5 @@ int main (int argc, char ** argv)
Grid_finalize();
fclose(FP);
}
#endif
+5
View File
@@ -26,6 +26,9 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -258,3 +261,5 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+5
View File
@@ -19,6 +19,9 @@ Author: Richard Rollins <rprollins@users.noreply.github.com>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_benchmarks_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -161,3 +164,5 @@ void bench_wilson_eo (
double flops = (single_site_flops * volume * ncall)/2.0;
std::cout << flops/(t1-t0) << "\t\t";
}
#endif
@@ -0,0 +1,16 @@
#include <Grid/Grid.h>
#pragma once
#ifndef ENABLE_FERMION_INSTANTIATIONS
#include <iostream>
int main(void) {
std::cout << "This build of Grid was configured to exclude fermion instantiations, "
<< "which this benchmark relies on. "
<< "Please reconfigure and rebuild Grid with --enable-fermion-instantiations"
<< "to run this benchmark."
<< std::endl;
return 1;
}
#endif
+9
View File
@@ -172,6 +172,12 @@ case ${ac_TRACING} in
esac
############### fermions
AC_ARG_ENABLE([fermion-instantiations],
[AS_HELP_STRING([--enable-fermion-instantiations=yes|no],[enable fermion instantiations])],
[ac_FERMION_REPS=${enable_fermion_instantiations}], [ac_FERMION_INSTANTIATIONS=yes])
AM_CONDITIONAL(BUILD_FERMION_INSTANTIATIONS, [ test "${ac_FERMION_INSTANTIATIONS}X" == "yesX" ])
AC_ARG_ENABLE([fermion-reps],
[AS_HELP_STRING([--enable-fermion-reps=yes|no],[enable extra fermion representation support])],
[ac_FERMION_REPS=${enable_fermion_reps}], [ac_FERMION_REPS=yes])
@@ -194,6 +200,9 @@ AM_CONDITIONAL(BUILD_ZMOBIUS, [ test "${ac_ZMOBIUS}X" == "yesX" ])
case ${ac_FERMION_REPS} in
yes) AC_DEFINE([ENABLE_FERMION_REPS],[1],[non QCD fermion reps]);;
esac
case ${ac_FERMION_INSTANTIATIONS} in
yes) AC_DEFINE([ENABLE_FERMION_INSTANTIATIONS],[1],[enable fermions]);;
esac
case ${ac_GPARITY} in
yes) AC_DEFINE([ENABLE_GPARITY],[1],[fermion actions with GPARITY BCs]);;
esac
+4 -2
View File
@@ -3,6 +3,9 @@
* without regression / tests being applied
*/
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -310,5 +313,4 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+4 -2
View File
@@ -3,6 +3,9 @@
* without regression / tests being applied
*/
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -432,5 +435,4 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+4 -2
View File
@@ -3,6 +3,9 @@
* without regression / tests being applied
*/
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -535,5 +538,4 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+4 -2
View File
@@ -3,6 +3,9 @@
* without regression / tests being applied
*/
#include "disable_examples_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -429,5 +432,4 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
@@ -0,0 +1,15 @@
#include <Grid/Grid.h>
#pragma once
#ifndef ENABLE_FERMION_INSTANTIATIONS
#include <iostream>
int main(void) {
std::cout << "This build of Grid was configured to exclude fermion instantiations, "
<< "which this example relies on. "
<< "Please reconfigure and rebuild Grid with --enable-fermion-instantiations"
<< "to run this example."
<< std::endl;
return 1;
}
#endif
+1 -1
View File
@@ -11,7 +11,7 @@ CCFILES=`find . -name '*.cc' -not -path '*/instantiation/*/*' -not -path '*/gamm
ZWILS_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/ZWilsonImpl*' `
WILS_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonImpl*' `
STAG_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/Staggered*' `
STAG_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/StaggeredImpl*' `
GP_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/Gparity*' `
ADJ_FERMION_FILES=` find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonAdj*' `
TWOIND_FERMION_FILES=`find . -name '*.cc' -path '*/instantiation/*' -path '*/instantiation/WilsonTwoIndex*'`
+196
View File
@@ -0,0 +1,196 @@
---
name: communication-overlap
description: Design and implement communication/computation overlap pipelines for GPU+MPI codes — per-packet event tracking, host-staging through pinned memory, internode/intranode bandwidth separation, and the 7-phase pipeline pattern that replaces broken accelerator-aware MPI paths.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
---
# Communication/Computation Overlap Pipeline Design
## Why GPU-Direct MPI Is Often Not the Right Default
GPU-direct RDMA (passing GPU buffer pointers directly to MPI) is appealing because it eliminates explicit D2H/H2D copies. In practice on several leadership systems:
- **Bandwidth**: RDMA at 30% of wirespeed has been observed on Pontevecchio/Aurora. The overhead of staging through pinned host memory can be *lower* total latency than slow RDMA.
- **Correctness**: Device buffer aliasing in `MPI_Sendrecv` (see `mpi-heterogeneous.md`) makes direct GPU-to-GPU transfer unreliable.
- **Overlap**: Host-staging enables fine-grained overlap — each packet's D2H can be issued as a separate asynchronous event, and the corresponding MPI send can fire as soon as *that packet* arrives in host memory, not after all packets are ready.
The pipeline pattern below was developed to replace broken MPICH accelerator-aware paths. It achieves genuine computation/communication overlap by tracking per-packet GPU events.
## The 7-Phase Pipeline
Given a set of halo exchange operations (each identified by a `packet_index`):
### Phase 0: Prepare data on device
Pack halo data into contiguous GPU buffers. One buffer per direction/neighbour.
### Phase 1: Post receives + start D2H
Post all `MPI_Irecv` calls immediately (into pinned host buffers). Simultaneously, start asynchronous D2H copies for all send buffers:
```cpp
for (auto &pkt : send_packets) {
MPI_Irecv(pkt.host_recv_buf, pkt.bytes, MPI_BYTE,
pkt.src_rank, pkt.tag, comm, &pkt.recv_req);
acceleratorCopyFromDeviceAsync(pkt.device_send_buf,
pkt.host_send_buf,
pkt.bytes, &pkt.d2h_event);
}
```
The key: `pkt.d2h_event` is a per-packet GPU event (e.g. `cudaEvent_t`, `hipEvent_t`, or SYCL event). We can poll individual packet completion rather than waiting for all.
### Phase 2: Fire sends as D2H completes (packet by packet)
Poll packet D2H events. As each packet becomes ready in host memory, immediately fire the corresponding `MPI_Isend`. Also start intranode D2D copies at this point — these are deferred until now to avoid competing with the internode D2H on PCIe bandwidth:
```cpp
bool all_sent = false;
while (!all_sent) {
all_sent = true;
for (auto &pkt : send_packets) {
if (!pkt.sent && acceleratorEventIsComplete(pkt.d2h_event)) {
MPI_Isend(pkt.host_send_buf, pkt.bytes, MPI_BYTE,
pkt.dst_rank, pkt.tag, comm, &pkt.send_req);
pkt.sent = true;
start_intranode_copy(pkt); // now safe, D2H is done
}
if (!pkt.sent) all_sent = false;
}
}
```
### Phase 3: Poll receives + start H2D as each arrives
`MPI_Test` individual receive requests. As each completes, immediately start the H2D copy into device-resident halo buffer:
```cpp
bool all_recvd = false;
while (!all_recvd) {
all_recvd = true;
for (auto &pkt : recv_packets) {
if (!pkt.h2d_started) {
int flag = 0;
MPI_Test(&pkt.recv_req, &flag, MPI_STATUS_IGNORE);
if (flag) {
acceleratorCopyToDeviceAsync(pkt.host_recv_buf,
pkt.device_recv_buf,
pkt.bytes, &pkt.h2d_event);
pkt.h2d_started = true;
}
}
if (!pkt.h2d_started) all_recvd = false;
}
}
```
### Phase 4: Wait for all sends
```cpp
std::vector<MPI_Request> send_reqs;
for (auto &pkt : send_packets) send_reqs.push_back(pkt.send_req);
MPI_Waitall(send_reqs.size(), send_reqs.data(), MPI_STATUSES_IGNORE);
```
### Phase 5: Wait for all H2D copies
```cpp
for (auto &pkt : recv_packets) acceleratorEventWait(pkt.h2d_event);
```
### Phase 6: Run interior computation
The interior (non-halo) computation can run from Phase 1 onwards, overlapped with all of the above:
```cpp
// Launched in Phase 1, runs in parallel with the pipeline
accelerator_for(ss, interior_sites, ...) { compute_interior(ss); }
```
Synchronise with interior before using the full field:
```cpp
accelerator_barrier(); // interior kernel done
// Halo H2D is also complete (Phase 5 above)
// Now safe to use full field
```
## Per-Packet Event Tracking Data Structure
```cpp
struct Packet {
// Buffers
void *device_send_buf;
void *host_send_buf; // pinned
void *device_recv_buf;
void *host_recv_buf; // pinned
size_t bytes;
// MPI
int src_rank, dst_rank, tag;
MPI_Request send_req, recv_req;
// GPU events (one per packet, not one global barrier)
AcceleratorEvent d2h_event;
AcceleratorEvent h2d_event;
// State flags
bool sent = false;
bool h2d_started = false;
};
```
The critical design point: `d2h_event` and `h2d_event` are **per-packet**, not global. This allows the MPI send for packet 0 to fire while packet 1's D2H is still in progress.
## Internode vs Intranode Separation
PCIe (GPU-to-CPU) and NVLink/xGMI (GPU-to-GPU within a node) are separate bandwidth resources. They do not compete with each other, but they *do* compete with each other for transactions if both are active simultaneously.
Strategy: complete all internode D2H copies first (to maximise NIC injection bandwidth), then start intranode D2D copies (which use NVLink/xGMI and do not contend with PCIe for internode traffic):
```cpp
// In Phase 2: start intranode D2D only after D2H is confirmed complete
if (pkt.is_intranode && pkt.d2h_done) {
// Use peer access (cudaMemcpyPeerAsync / hipMemcpyPeerAsync)
// rather than staging through host for intranode
cudaMemcpyPeerAsync(pkt.peer_recv_buf, pkt.dst_device,
pkt.device_send_buf, pkt.src_device,
pkt.bytes, computeStream);
}
```
Grid reference: `Grid/communicator/Communicator_mpi3.cc` — search for `NVLINK_GET` and `ACCELERATOR_AWARE_MPI` conditional blocks.
## Pinned Memory Allocation
All host staging buffers must be pinned (page-locked) for async D2H/H2D:
```cpp
// CUDA
cudaMallocHost(&host_buf, bytes);
cudaFreeHost(host_buf);
// HIP
hipHostMalloc(&host_buf, bytes, hipHostMallocDefault);
hipHostFree(host_buf);
// SYCL
host_buf = sycl::malloc_host(bytes, *queue);
sycl::free(host_buf, *queue);
```
Pre-allocate at startup. Repeated `cudaMallocHost` in the hot path adds latency from the OS memory manager.
## Checksumming in the Pipeline
Insert checksum computation before D2H (on the GPU-resident data) and verification after H2D (on the received GPU-resident data). See `correctness-verification.md` for the checksum pattern. The salting (`packet_index + 1000 * tag`) detects packet transposition — critical for diagnosing MPI buffer aliasing bugs where two packets' contents are swapped.
## Smoke Test for a New System
Before running physics, validate the pipeline on a synthetic benchmark:
```cpp
// Send a buffer of known values, receive and check
// Run at multiple message sizes: 4KB, 64KB, 1MB, 16MB
// Run at multiple process counts: 2, 8, 64, 512
// Verify checksums on every packet
// Measure bandwidth: should be ≥ 80% of FDR/HDR/NDR peak for host-staged
```
Any bandwidth below 50% of theoretical, or any checksum failure, indicates a problem in the communication stack that must be resolved before production runs.
+154
View File
@@ -0,0 +1,154 @@
---
name: compiler-validation
description: Identify GPU compiler code generation bugs, distinguish them from hardware and runtime bugs, construct minimal reproducers, and validate correctness of generated assembly for performance-critical HPC kernels.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
- Bash(objdump)
---
# Compiler Validation for GPU HPC Codes
## Why Compiler Bugs Are Distinct
Compiler bugs have a unique diagnostic signature: they produce *deterministically wrong* results. The same input always produces the same wrong output. This distinguishes them from:
- Hardware bugs: usually stochastic (wrong answer sometimes, correct answer other times)
- Runtime bugs (premature barrier, buffer aliasing): often stochastic or history-dependent
- Race conditions: non-deterministic
**The determinism test**: run the same kernel 100 times with the same input. If the wrong answer is always the same wrong answer, suspect the compiler.
## The Minimal Reproducer Protocol
When a kernel produces wrong results, isolate the compiler as quickly as possible:
**Step 1: Eliminate the physics**. Reduce the failing kernel to the smallest possible computation that still exhibits the bug. Replace QCD fields with `double` arrays. Replace lattice operations with scalar arithmetic. The goal is a 20-line CUDA/HIP/SYCL file that any compiler engineer can compile and run.
**Step 2: Binary search over optimisation levels**. Compile at `-O0` (or equivalent). If the answer becomes correct, the bug is in an optimisation pass. Then test `-O1`, `-O2`, `-O3` individually to find which optimisation level introduces the bug.
```bash
# HIP example
hipcc -O0 minimal_repro.cc -o test_O0 && ./test_O0 # should be correct
hipcc -O1 minimal_repro.cc -o test_O1 && ./test_O1 # compare
hipcc -O2 minimal_repro.cc -o test_O2 && ./test_O2 # compare
```
**Step 3: Identify the optimisation pass**. For LLVM-based compilers (clang, hipcc, dpcpp, nvcc via ptxas):
```bash
# Disable individual optimisation passes:
hipcc -O2 -mllvm -disable-loop-unrolling minimal_repro.cc -o test
hipcc -O2 -fno-vectorize minimal_repro.cc -o test
hipcc -O2 -fno-slp-vectorize minimal_repro.cc -o test
```
**Step 4: Inspect the generated code**. For CUDA/HIP, use `--generate-line-info` and `cuobjdump` or `roc-obj-extract` to get annotated assembly:
```bash
# CUDA
nvcc -O2 --generate-line-info --keep minimal_repro.cu
cuobjdump --dump-ptx minimal_repro.o
# HIP/ROCm
hipcc -O2 --save-temps minimal_repro.cc
llvm-objdump -d minimal_repro.o
# SYCL/DPC++
icpx -O2 -fsycl -Xclang -ast-dump minimal_repro.cc 2>&1 | grep -A5 "suspicious_expr"
```
Look for: incorrect register spill/fill sequences, loop trip count miscalculation, vectorisation across iteration boundaries, incorrect address arithmetic.
## Known Compiler Bug Patterns in GPU Code
### Register Pressure / Spill Bugs
High register usage forces spills to local memory. Some compiler versions generate incorrect spill/fill code — the value is written to local memory but a stale register value is read back instead of the spilled value.
**Signature**: Wrong answer with high-register-count kernels; becomes correct when `--maxrregcount=N` forces lower register count (more spilling) or higher (`--maxrregcount=256`, fewer spills).
**Diagnostic**: Check register usage:
```bash
nvcc -O2 --ptxas-options=-v minimal_repro.cu 2>&1 | grep "registers"
hipcc -O2 --offload-arch=gfx90a --save-temps minimal_repro.cc
llvm-mc --arch=amdgcn minimal_repro.s 2>&1 | grep "VGPRs"
```
### Vectorisation Across Loop Boundaries
The compiler vectorises two successive loop iterations as a SIMD unit when they have a data dependency that the compiler has incorrectly determined does not exist.
**Signature**: Wrong answer that becomes correct when the loop body is extracted to a non-inlined function (disabling auto-vectorisation across iterations).
### Incorrect Constant Propagation
The compiler evaluates a compile-time expression incorrectly, substituting a wrong constant. Common in template-heavy code where `sizeof(T)` or `alignof(T)` is used in arithmetic that the compiler folds at compile time.
**Signature**: Wrong array index or wrong stride. Inspecting the generated assembly shows a literal constant where you expect a computed value.
## Stress Patterns for Compiler Validation
These patterns exercise the compiler in ways that commonly expose bugs:
```cpp
// 1. Aliased pointer write followed by immediate read
// (tests correct handling of write-after-write in register allocation)
__global__ void alias_stress(double *a, double *b, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
a[i] = a[i] * 2.0;
b[i] = a[i] + 1.0; // must read the updated value, not the original
}
}
// 2. Mixed-precision accumulation
// (tests correct type promotion in FMA sequences)
__global__ void precision_stress(float *in, double *out, int n) {
double acc = 0.0;
for (int i = 0; i < n; i++) acc += (double)in[i];
*out = acc;
}
// 3. Large struct in shared memory
// (tests alignment and offset calculation for non-power-of-2-sized objects)
struct S { double x[3]; }; // sizeof = 24 bytes, not a power of 2
__global__ void struct_stress(S *in, S *out, int n) {
extern __shared__ S smem[];
int tid = threadIdx.x;
smem[tid] = in[tid];
__syncthreads();
out[tid] = smem[(tid + 1) % blockDim.x];
}
```
## Separating Compiler from Runtime/Hardware
When results are deterministically wrong:
| Test | Compiler bug | Runtime/hardware bug |
|---|---|---|
| Recompile at -O0 | Fixes it | No effect |
| Run on CPU (host code equivalent) | Fixes it | No effect |
| Reorder loop iterations | Changes wrong answer | No effect or different pattern |
| Different compiler version | Fixes or changes wrong answer | No effect |
| Different GPU of same model | Same wrong answer | Different or no error |
| Different GPU model | Fixes it (ISA-specific codegen bug) | May or may not fix |
## Reporting to Compiler Teams
A compiler bug report needs:
1. Minimal reproducer (< 50 lines)
2. Compiler version (`hipcc --version`, `nvcc --version`, `icpx --version`)
3. GPU model and driver version
4. Exact wrong and correct answers (hexfloat for reproducibility)
5. Which compile flags change the behaviour
6. Generated assembly for the correct and incorrect variants
File with: LLVM Bugzilla (for hipcc/clang/dpcpp backends), NVIDIA bug portal (nvcc/ptxas), or vendor-specific developer forum. The minimal reproducer is the single most important element — without it, compiler teams cannot prioritise.
## Pragmatic In-Production Workaround
When a compiler bug is confirmed but the fix is not yet available, the lowest-risk workaround is to mark the affected function with reduced optimisation:
```cpp
#pragma clang optimize off // clang/hipcc/dpcpp
void __attribute__((optimize("O0"))) affected_kernel_host_wrapper() { ... }
// For device code, use per-file compilation flags via CMake/Makefile
```
Document the workaround with a comment referencing the compiler bug report number so it can be removed when the compiler is updated.
+169
View File
@@ -0,0 +1,169 @@
---
name: correctness-verification
description: Implement application-level correctness verification for HPC codes on unreliable hardware — double-run pattern, deterministic reductions, per-packet checksums, and flight recorder step logging.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
---
# Correctness Verification Infrastructure for HPC Codes
## The Problem
Leadership computing facilities sometimes have hardware or firmware bugs below the level visible to application code. The accelerator runtime can return from `q.wait()` or `cudaDeviceSynchronize()` before work is actually complete, or silently produce wrong answers in DMA transfers. Standard testing does not catch these because they are non-deterministic and often topology-dependent (fail only at specific process counts or on specific node configurations).
The symptoms look like numerical instabilities, random MPI hangs, or wrong physics results — not like crashes. Without deliberate infrastructure, diagnosing root cause takes months.
## The Double-Run Pattern
The most reliable correctness check for non-deterministic hardware bugs is to run every computation twice and compare bit-identical fingerprints.
**Key constraint**: the second run must use a *deterministic* code path. Non-deterministic floating-point ordering (e.g. from MPI_Allreduce with different reduction trees on retry) produces false mismatches. See `mpi-heterogeneous.md` for how to make reductions deterministic.
```cpp
// Pseudocode: double-run a step and compare CRC fingerprints
void run_step_verified(State &state) {
state.save_checkpoint();
uint64_t crc_a = run_step_and_fingerprint(state);
state.restore_checkpoint();
uint64_t crc_b = run_step_and_fingerprint(state);
if (crc_a != crc_b) {
report_mismatch("step", crc_a, crc_b);
// Policy: abort, retry from checkpoint, or continue with alarm
}
}
```
**Fingerprinting**: XOR-fold a CRC32 over all floating-point data after each step. XOR is order-independent, so it works across distributed nodes without communication. For field data:
```cpp
uint64_t fingerprint(const double *data, size_t n) {
uint64_t acc = 0;
for (size_t i = 0; i < n; i++) {
uint64_t bits;
memcpy(&bits, &data[i], sizeof(bits));
acc ^= crc32(bits);
}
return acc;
}
```
On GPU, compute the XOR reduction on-device (avoids D2H transfer of the full field):
```cpp
// SYCL
uint64_t svm_xor(uint64_t *vec, uint64_t L) {
uint64_t ret = 0;
{ sycl::buffer<uint64_t,1> abuff(&ret, {1});
theGridAccelerator->submit([&](sycl::handler &cgh) {
auto R = sycl::reduction(abuff, cgh, uint64_t(0), std::bit_xor<>());
cgh.parallel_for(sycl::range<1>{L}, R,
[=](sycl::id<1> i, auto &sum) { sum ^= vec[i]; });
}); }
theGridAccelerator->wait();
return ret;
}
```
## Per-Packet Communication Checksums
Silent data corruption in MPI buffers (documented in MPICH with device-resident buffers; see `mpi-heterogeneous.md`) requires per-packet verification, not just end-to-end. The pattern:
1. Before packing a send buffer, compute a GPU-side checksum of the payload.
2. Append the checksum to the host staging buffer alongside the data.
3. After receiving and copying to device, recompute the checksum on-device and compare.
Salt each checksum with `packet_index + 1000 * mpi_tag` to detect transposition (packet A landing in packet B's slot):
```cpp
uint64_t salt = (uint64_t)packet_index + 1000ULL * mpi_tag;
checksum_send = checksum_gpu(payload_gpu, payload_words) ^ salt;
// ... transmit payload + checksum_send ...
checksum_recv = checksum_gpu(payload_gpu_recv, payload_words) ^ salt;
assert(checksum_recv == checksum_send);
```
Grid reference: `Grid/communicator/Communicator_mpi3.cc`, `#ifdef GRID_CHECKSUM_COMMS`.
## Flight Recorder: Step-Level Logging
Maintain a monotonic counter that names the current operation. On a hang, this is the only way to know *which* operation the process is stuck in without a debugger.
```cpp
struct FlightRecorder {
std::atomic<uint64_t> step_counter{0};
const char *step_name = "init";
void step_log(const char *name) {
step_name = name;
step_counter.fetch_add(1, std::memory_order_relaxed);
}
};
extern FlightRecorder gRecorder;
```
In Record mode, also store floating-point norms and communication checksums to vectors. In Verify mode, compare against stored values:
```cpp
void norm_log(double val) {
if (mode == Record) norm_log_vec.push_back(val);
if (mode == Verify) {
double expected = norm_log_vec[norm_counter];
if (val != expected) { // bit-exact for deterministic paths
std::cerr << "MISMATCH at step " << step_counter
<< " (" << step_name << "): "
<< std::hexfloat << val << " vs " << expected << "\n";
print_backtrace();
}
norm_counter++;
}
}
```
Grid reference: `Grid/util/FlightRecorder.h`, `Grid/util/FlightRecorder.cc`.
## Signal Handler for Hang Detection
Install a SIGHUP handler that dumps the current flight recorder state. This is async-safe only if the handler writes to a pre-allocated buffer using `write()` (not `printf`):
```cpp
static char hang_buf[4096];
static void sighup_handler(int) {
int n = snprintf(hang_buf, sizeof(hang_buf),
"rank=%d step=%llu name=%s\n",
mpi_rank,
(unsigned long long)gRecorder.step_counter.load(),
gRecorder.step_name);
write(STDERR_FILENO, hang_buf, n);
// Optional: call backtrace_symbols_fd (async-safe on Linux)
void *frames[64];
int depth = backtrace(frames, 64);
backtrace_symbols_fd(frames, depth, STDERR_FILENO);
}
// In main():
signal(SIGHUP, sighup_handler);
```
To diagnose a hang across all ranks: `kill -HUP $(pgrep my_app)` or via job scheduler.
## What to Verify at Each Step
| Data type | Fingerprint method | Frequency |
|---|---|---|
| Lattice fields | XOR of CRC32 over float64 words | Every algorithmic step |
| Communication buffers | GPU XOR reduction, salted | Every MPI operation |
| Scalar reductions | Bit-exact match of double | Every GlobalSum |
| Iteration counters | Exact integer match | Every solver iteration |
## When to Abort vs Continue
- **Abort immediately**: communication checksum mismatch (data is corrupt, continuing will silently propagate errors).
- **Log and continue**: norm mismatch in Verify mode if you need to map out which operations are unreliable.
- **Retry from checkpoint**: double-run mismatch when the underlying bug is non-deterministic (second retry will usually pass).
Track the mismatch rate over a production run. A rate above ~1/1000 steps indicates a systemic hardware issue that should be escalated to the facility.
+205
View File
@@ -0,0 +1,205 @@
---
name: gpu-memory-performance
description: Diagnose and fix GPU memory bandwidth and occupancy problems in Grid HPC kernels — acceleratorThreads() pitfalls, LambdaApply thread mapping, coalescedRead/Write idiom, when to use accelerator_for vs a hand-rolled __global__ kernel, and fused vs staged HBM access patterns.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
---
# GPU Memory Performance in Grid
## Nsimd on GPU builds
With `GEN_SIMD_WIDTH=64B` (the typical production setting), `Nsimd` is **not 1**:
| Scalar type | `sizeof` | `Nsimd = 64 / sizeof` |
|---|---|---|
| `ComplexD` | 16 B | **4** |
| `ComplexF` | 8 B | **8** |
| `RealD` | 8 B | **8** |
| `RealF` | 4 B | **16** |
So for `LatticePropagatorD` (scalar type `ComplexD`), `Nsimd=4` and the SIMD lane runs `threadIdx.x ∈ {0,1,2,3}`. True `Nsimd=1` scalar-GPU builds are the exception, not the rule.
## The acceleratorThreads() Trap
`acceleratorThreads()` is a runtime-settable global (default **8**) that controls the `blockDim.y` of every `accelerator_for` launch. It is NOT the SIMD width — it is the number of sites processed per block in the y-dimension.
```cpp
// Grid/threads/Accelerator.cc
uint32_t accelerator_threads = 8;
```
With `accelerator_for(ss, osites, nsimd, ...)`, the launch is:
```
dim3 threads(nsimd, acceleratorThreads(), 1)
dim3 blocks ((osites + acceleratorThreads() - 1) / acceleratorThreads(), 1, 1)
```
Total threads per block = `Nsimd × acceleratorThreads()`. With `GEN_SIMD_WIDTH=64B` and `Nsimd=8` (fp32 / ComplexF):
| `acceleratorThreads()` | threads/block | AMD wavefront | note |
|---|---|---|---|
| 2 (original default) | 16 | 25% | sub-wavefront, poor |
| 4 | 32 | 50% | half wavefront |
| **8 (current default)** | **64** | **100%** | **one full wavefront — Dslash sweet spot** |
| 16 | 128 | 200% | two wavefronts/block; register-pressure cliff for heavy kernels |
| 32 | 256 | 400% | severe register spill for stencil kernels |
**Why 8 and not higher?** Compute-heavy kernels like the Domain Wall Dslash carry many live registers per thread (spinors + gauge links + projections). Doubling `acceleratorThreads` from 8→16 doubles the register demand per block, which on AMD GFX90A triggers a hard occupancy cliff: `Benchmark_dwf_fp32` drops from 1.7 TF/s (nt=8) to ~300 GF/s (nt=16). The sweet spot is one full wavefront per block, which with `Nsimd=8` (fp32) means `nt=8`.
For fp64 work (`Nsimd=4`), `nt=8` gives 32 threads = half a wavefront (AMD pads to 64 with idle lanes). Kernels that are not register-limited (e.g. simple lattice arithmetic) can benefit from `--accelerator-threads 16` at runtime. Reduction kernels bypass `acceleratorThreads()` entirely via `getNumBlocksAndThreads`.
**Why was the default ever 2?** Before the `threadIdx.x`/`threadIdx.y` remap (see LambdaApply section below), the site index lived in `threadIdx.x` — the fast, coalescing dimension. Increasing `acceleratorThreads` widened the block in the *site* direction, so adjacent threads in a warp hit adjacent sites, each stride `sizeof(vobj)` apart in AoS memory — breaking coalescing. On early NVIDIA ports with `Nsimd≈8`, `nt=2` gave 16 threads = 50% of a 32-thread warp; NVIDIA recovers this via multiple concurrent blocks per SM, so occupancy was barely tolerable. AMD has no such multiplier when blocks are already sub-wavefront. After the remap put the site index in `threadIdx.y` and the SIMD lane in `threadIdx.x`, coalescing became independent of `acceleratorThreads`, removing the constraint.
**Diagnostic**: observed bandwidth << peak, kernel time >> expected from data volume. Check with `--accelerator-threads 32` at runtime. A large speedup confirms occupancy starvation.
**Fix options** (in order of preference):
1. Kernel needs its own thread count — use `getNumBlocksAndThreads` and launch a `__global__` kernel directly (see below).
2. Temporarily acceptable: set `--accelerator-threads 32` at the application level. Note this affects every `accelerator_for` site in the binary.
## LambdaApply Thread Mapping
`accelerator_for` and `accelerator_for2d` go through `LambdaApply`:
```cpp
// HIP/CUDA LambdaApply kernel:
uint64_t x = threadIdx.y + blockDim.y * blockIdx.x; // iter1 (site index)
uint64_t y = threadIdx.z + blockDim.z * blockIdx.y; // iter2
uint64_t z = threadIdx.x; // lane (SIMD lane)
Lambda(x, y, z);
```
`threadIdx.x` is the **fast** (lane) dimension — consecutive thread IDs within a warp/wavefront correspond to consecutive lane values on the **same** site, not consecutive sites.
With `GEN_SIMD_WIDTH=64B` and `Nsimd=4` (PropagatorD), a 64-thread AMD wavefront contains 64/4 = 16 sites, each processed by 4 lanes. Adjacent threads within the wavefront read different lanes of the same site — this is a broadcast pattern (hardware handles this efficiently), not a stride. The stride between consecutive *sites* (`sizeof(vobj)`) only appears between groups of `Nsimd` threads, spaced `Nsimd` apart in threadIdx.x — not between adjacent threads within a warp.
## coalescedRead / coalescedWrite
These are Grid's canonical way to read/write one SIMD lane from a vector type inside a `GRID_SIMT` kernel:
```cpp
// accelerator_for(ss, osites, Nsimd, {
// lane = acceleratorSIMTlane(Nsimd) = threadIdx.x ∈ {0..Nsimd-1}
auto scalar_val = coalescedRead(field[ss]); // extractLane(lane, field[ss])
coalescedWrite(field[ss], scalar_val); // insertLane(lane, field[ss], scalar_val)
```
For `vobj` aggregate types, `coalescedRead` calls `extractLane(lane, vobj)` which recurses through the tensor hierarchy and returns `vobj::scalar_object`.
For `vsimd` (raw SIMD vector) types, it casts to `scalar_type*` and indexes with `lane`.
## Coalescing the Iteration Structure
For an AoS input array where each site is `words` 16-byte elements, adjacent threads reading the same site's consecutive words achieve coalesced access:
```cpp
// Good: k varies across threads in a block → consecutive 16-byte reads
accelerator_for2d(k, R, ss, osites, Nsimd, {
coalescedWrite(out[ss]._internal[k],
coalescedRead(idat[ss * words + base + k]));
});
// dim3(Nsimd, nt, 1): threadIdx.y = k (consecutive words, coalesced)
// threadIdx.x = lane (SIMD sub-lane, coalesced for Nsimd>1)
```
```cpp
// Bad: each thread reads all R words of its site serially
accelerator_for(ss, osites, 1, {
Bundle b;
for (int k = 0; k < R; k++)
b._internal[k] = idat[ss * words + base + k]; // serial, not coalesced across threads
out[ss] = b; // bulk struct write
});
```
The bad pattern also accumulates a large struct in registers (192 bytes for R=12), increasing register pressure and reducing occupancy further.
## When to Use a __global__ Kernel Instead of accelerator_for
`accelerator_for` is correct for site-parallel work where `acceleratorThreads()` is tuned appropriately. Use a direct `__global__` kernel when:
- The kernel requires a **specific thread count** for correctness or performance (reductions, shared-memory algorithms).
- The optimal thread count depends on `sizeof(sobj)` and `sharedMemPerBlock`, not on a runtime global.
- You need the retirement-count pattern for cross-block final reduction.
Pattern: use `getNumBlocksAndThreads` to pick `numThreads` and `numBlocks`:
```cpp
Integer numThreads, numBlocks;
int ok = getNumBlocksAndThreads(n, sizeof(sobj), numThreads, numBlocks);
// starts at warpSize (32/64), doubles while 2*threads*sizeof(sobj) < sharedMemPerBlock
// gives 64256 threads/block → correct occupancy independent of acceleratorThreads()
Integer smemSize = numThreads * sizeof(sobj);
myKernel<<<numBlocks, numThreads, smemSize, computeStream>>>(args...);
```
Grid's `reduceKernel` uses this pattern and achieves ~400 GB/s on MI250X.
## Fused vs Staged HBM Access
A staged pack+reduce reads the data **three times**:
```
pack kernel: reads vobj array (N bytes), writes bundle buffer (N bytes)
reduce kernel: reads bundle buffer (N bytes), writes tiny result buffer
```
Total HBM: 3N bytes for N bytes of useful input.
A fused kernel reads the data **once**:
```
packReduceKernel: reads R words of vobj array (N bytes), reduces in-place
```
Total HBM: N bytes. Register pressure increases (R words held per thread) but the 3× HBM saving dominates for large objects.
The fused pattern in Grid's `sumD_gpu_reduce_words<R>`:
```cpp
template <int R, class vobj, class sobj, class Iterator>
__device__ void packReduceBlocks(
const iScalar<typename vobj::vector_type> *idat,
sobj *g_odata, Iterator osites, int base, int words)
{
// sobj = iVector<iScalar<scalarD>, R> (R double-precision scalars per site)
constexpr Iterator nsimd = vobj::Nsimd();
...
while (i < osites * nsimd) {
Iterator lane = i % nsimd;
Iterator ss = i / nsimd;
sobj tmpD; zeroit(tmpD);
for (int k = 0; k < R; k++) {
auto w = extractLane(lane, idat[ss * words + base + k]);
iScalar<typename vobj::scalar_typeD> wd; wd = w; // float→double promotion
tmpD._internal[k] = wd;
}
mySum += tmpD;
...
}
reduceBlock(sdata, mySum, tid);
}
```
Launched with `getNumBlocksAndThreads` → 128 threads/block for R=12 (`BundleScalarD`=192 B, sharedMem=64 KB) → correct occupancy without depending on `acceleratorThreads()`.
## Observed Numbers on MI250X (32^4 LatticePropagatorD, Nsimd=4, GEN_SIMD_WIDTH=64B)
| Configuration | pack µs/group | reduce µs/group | total µs | GB/s |
|---|---|---|---|---|
| acceleratorThreads=2 (8 threads/block), staged | 10,080 | 470 | 126,909 | 50 |
| acceleratorThreads=16 (64 threads/block), staged | 342 | 310 | 8,251 | 297 |
| acceleratorThreads=16, fused (128 threads/block via getNumBlocksAndThreads) | — | 349 | 4,584 | 546 |
The fused kernel at 349 µs/group reads 201 MB at 576 GB/s — 36% of MI250X HBM peak. The remaining gap from peak is the in-kernel serial loop over R=12 words and the 12 serial kernel launches.
## Quick Checklist When a Kernel Is Slow
1. **Check Nsimd**: `GEN_SIMD_WIDTH=64B` → Nsimd=4 (ComplexD), 8 (ComplexF). Total threads/block = `Nsimd × acceleratorThreads()`. With old default nt=2 and Nsimd=4: 8 threads = 12.5% of AMD wavefront.
2. Check threads per block: for `accelerator_for` kernels use `--accelerator-threads 32` and measure; a large speedup confirms occupancy starvation.
3. Check for bulk struct accumulation in registers (`Bundle b; for(...) b._internal[k] = ...;`). Replace with per-element writes via `coalescedWrite`.
4. Check for staged HBM access (pack → buffer → reduce). Count the passes; fuse if ≥ 2 passes over the same data.
5. For reduction kernels, always use `getNumBlocksAndThreads` rather than `accelerator_for` so thread count is independent of `acceleratorThreads()`.
+101
View File
@@ -0,0 +1,101 @@
---
name: gpu-runtime-correctness
description: Detect and work around GPU runtime correctness failures — premature completion signalling, infinite poll hangs, stale completion flags, and the double-wait diagnostic pattern. Covers CUDA, HIP/ROCm, and SYCL/Level Zero runtimes.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
---
# GPU Runtime Correctness
## The Completion Signalling Problem
GPU runtimes expose a synchronisation primitive — `cudaDeviceSynchronize()`, `hipDeviceSynchronize()`, `q.wait()` — that is supposed to block until all previously submitted GPU work is complete. On several production systems, this guarantee has been violated in two distinct ways:
### Failure Mode A: Premature Return
The wait returns before the GPU work is done. The subsequent CPU code reads stale data from the output buffer. This is the most dangerous failure because it looks like a numerical instability, not a crash. Results are wrong but the program exits normally.
**Identifying Premature Return**: Insert a second, independent wait immediately after the first. If a second `q.wait()` "fixes" incorrect results that appeared with a single `q.wait()`, the first wait was returning prematurely.
```cpp
// Diagnostic version — if this stabilises results, you have premature return
accelerator_barrier(); // first wait
accelerator_barrier(); // second wait (diagnostic)
```
Production fix: submit a trivially cheap no-op kernel after the real work and wait for it. The no-op kernel cannot complete until all previous commands in the queue are done (command queue ordering guarantee), so waiting for the no-op is a stronger barrier than waiting for the queue itself:
```cpp
// Lightweight fence kernel
template<class T>
__global__ void noop_kernel(T *p) { if (threadIdx.x == 0) (void)(*p); }
void strong_barrier(T *device_ptr) {
noop_kernel<<<1, 1, 0, computeStream>>>(device_ptr);
cudaStreamSynchronize(computeStream); // wait for the no-op
}
```
### Failure Mode B: Infinite Poll
The wait enters a polling loop that never terminates. The process consumes 100% CPU in a runtime library. The GPU has either stopped signalling progress entirely, or the completion flag is in a memory region that has become incoherent.
This is distinct from Failure Mode A: with premature return the CPU proceeds; with infinite poll the CPU is stuck.
**Identifying Infinite Poll**: `top` shows the MPI rank at 100% CPU. `perf top -p PID` or `strace -p PID` shows the process burning cycles inside the GPU runtime library (e.g. `libze_intel_gpu.so`, `libamdhip64.so`).
**Documented instances**:
- Intel Level Zero on Pontevecchio (Aurora): both premature return *and* infinite poll have been observed as independent bugs on the same system.
- The two failure modes can co-exist and have overlapping symptoms at the application level.
## Completion Signalling Architecture
Understanding why these bugs happen requires knowing how completion signalling works:
```
GPU command processor
→ signals completion by writing to a host-visible memory address
→ CPU runtime polls that address (or uses OS event notification via ioctl)
```
A premature return means the memory write happened before the actual work completed (e.g. the signal is on a different command stream that has not been serialised with the work stream). An infinite poll means the memory write never happens (hardware or driver bug preventing the signal from being written).
**Implication**: `accelerator_barrier()` is not an unconditional correctness guarantee on all production systems. Application-level verification (double-run, checksums) is necessary as a second line of defence.
## The Double-Wait Pattern in Practice
The double-wait is a pragmatic workaround when premature return is suspected but not yet confirmed. It adds latency but does not change correctness if the barrier is working properly, so it is safe to enable in production:
```cpp
#ifdef WORKAROUND_PREMATURE_BARRIER
#define accelerator_barrier() do { \
real_accelerator_barrier(); \
real_accelerator_barrier(); \
} while(0)
#endif
```
Monitor whether this changes observed behaviour. If double-wait eliminates wrong answers, you have confirmed premature return. If it does not help but inserting a no-op kernel does, the issue is with the wait primitive specifically, not with the underlying completion signal.
## SYCL/Level Zero Specifics
Level Zero (the backend for Intel GPU runtimes) separates command submission from synchronisation. A `q.wait()` should wait for all previously submitted command lists to retire. Documented bugs include:
- `q.wait()` returning before the associated fence in Level Zero has been signalled.
- `q.wait()` entering an `ioctl(i915, I915_GEM_WAIT)` call that never returns (kernel driver bug, not runtime bug).
The latter requires a node reboot and cannot be worked around in application code. Detect it by checking process state (`D` in `ps aux`) and the kernel function via `/proc/PID/wchan`.
## Stream Ordering and Compute Streams
All GPU work must be submitted to the *same* stream/queue if you rely on in-order execution guarantees. Mixing default stream and non-default streams invalidates ordering assumptions on some backends.
Grid uses `computeStream` (CUDA/HIP) or `theGridAccelerator` (SYCL) consistently throughout. If mixing Grid with third-party GPU code, ensure the third-party code is directed to the same stream, or insert explicit inter-stream barriers.
## Checklist for New GPU Code
1. Every kernel launch is followed by an `accelerator_barrier()` before reading device-side output on the host.
2. All device-to-host copies use an explicit stream synchronisation after the copy, not before.
3. If results are non-deterministic across runs, insert a second barrier and observe whether reproducibility improves.
4. For correctness-critical operations (reductions that will be compared against reference values), add the double-run checksum test from `correctness-verification.md`.
5. If the process hangs at 100% CPU in a runtime library function, this is a driver/runtime bug — there is no application-level fix beyond scheduling a node reboot.
+102
View File
@@ -0,0 +1,102 @@
---
name: hang-diagnosis
description: Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
- Bash(strace)
- Bash(gdb)
---
# Hang Diagnosis on HPC Systems
## Taxonomy of Hangs
Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems:
### 1. Kernel-level ioctl hang (never returns)
The process is in `D` (uninterruptible sleep) state. `strace` shows it blocked in an `ioctl` syscall. The GPU device driver has entered an unrecoverable state.
**Diagnosis**: `ps aux | grep D` — the process shows `D` state. `cat /proc/PID/wchan` shows `i915_gem_wait_for_error` or similar.
**Resolution**: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler.
### 2. Infinite poll loop (`q.wait()` or `cudaDeviceSynchronize()` never returns)
The process is in `R` (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread.
**Diagnosis**: `top` shows the rank at 100% CPU. `strace -p PID` shows repeated `futex` or `read` syscalls with zero-length results, or no syscalls at all (pure spinloop). `perf top -p PID` shows the process burning cycles in a single tight loop in a runtime library (e.g., `ze_intel_gpu.so`).
**Resolution**: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See `gpu-runtime-correctness.md`.
### 3. Collective deadlock
One or more ranks are blocked in an MPI call, usually `MPI_Allreduce` or `MPI_Barrier`, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive.
**Diagnosis**: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different `step_name` strings. The hung ranks will show `D` or `S` state in `ps`.
**Resolution**: Replace `MPI_Allreduce` with a deterministic point-to-point tree reduction. See `mpi-heterogeneous.md`.
### 4. Premature return from wait (silent wrong answer, not a hang)
The runtime returns from `q.wait()` before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally.
**Diagnosis**: Double-run with checksum (see `correctness-verification.md`). Insert a second `q.wait()` after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely.
## Flight Recorder for Hang Localization
The most important diagnostic tool is knowing *which operation* a process is in when it hangs. Maintain a named step counter:
```cpp
// Call at the start of every major operation
FlightRecorder::StepLog("MPI_Allreduce::norm");
// ... do the operation ...
FlightRecorder::StepLog("MPI_Allreduce::done");
```
On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner:
```cpp
static void sighup_handler(int) {
char buf[256];
int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n",
comm_rank,
(unsigned long long)step_counter,
step_name);
write(2, buf, n);
// backtrace_symbols_fd is async-safe on Linux glibc
void *frames[32];
backtrace_symbols_fd(frames, backtrace(frames, 32), 2);
}
signal(SIGHUP, sighup_handler);
```
Broadcast SIGHUP to all ranks from outside the job:
```bash
# In a separate shell while the job is hung
squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \
xargs -I{} ssh {} "pkill -SIGHUP -f my_application"
```
The step names from all ranks will reveal which collective operation has diverged.
## Distinguishing Driver Hang from MPI Hang
| Symptom | Driver hang | MPI hang |
|---|---|---|
| Process state | `D` (ioctl) or `R` (spinloop) | `S` (blocked in syscall) |
| `strace` | blocked `ioctl` or tight loop | blocked `recvmsg` / `read` |
| Scope | single rank / single node | subset of ranks, pattern-dependent |
| Recovery | reboot node | cancel job |
| Flight recorder | step name is a GPU operation | step name is a collective |
## Reducing Diagnostic Time
1. **Name every collective operation** in the flight recorder before calling it.
2. **Separate GPU work from MPI work** in the code so the step name unambiguously identifies which subsystem is hung.
3. **Log node identifiers** alongside step names so flaky nodes can be identified and blacklisted.
4. **Request flight recorder dumps from all ranks simultaneously** (SIGHUP broadcast) rather than attaching a debugger — attaching `gdb` to one rank of a hung MPI job usually deadlocks the debugger too.
## What Not to Do
- Do not `kill -9` a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost.
- Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the *last* rank to arrive at the barrier.
- Do not use `MPI_Abort` in the hang handler — it may itself hang on some implementations. Use `_exit(1)` to force termination.
+182
View File
@@ -0,0 +1,182 @@
---
name: mpi-heterogeneous
description: Diagnose and work around MPI correctness bugs on heterogeneous (CPU+GPU) systems — device buffer aliasing in MPI_Sendrecv, AARCH64 PLT corruption from libfabric, topology-dependent allreduce hangs, mixed-ABI HIP runtime from wrong GTL library (Frontier/ROCm), and deterministic point-to-point reduction trees as a replacement for MPI_Allreduce.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
---
# MPI Correctness on Heterogeneous HPC Systems
## The Core Problem
MPI libraries were designed for CPU-resident buffers. When GPU-resident buffers are passed directly (GPU-aware MPI / GPU direct RDMA), several correctness assumptions break:
- **Buffer aliasing**: The MPI library may internally alias send/receive buffer addresses for `MPI_Sendrecv` in ways that are safe for CPU memory but wrong for GPU memory with different cache coherency rules.
- **RDMA bandwidth**: GPU direct RDMA on some fabrics operates at a fraction of peak wirespeed (documented at ~30% on Pontevecchio/Aurora), making host-staging mandatory for performance even when correctness is not an issue.
- **Collective tree topology**: `MPI_Allreduce` implementations may select reduction trees based on process count or communicator topology that expose rank-ordering bugs, causing hangs on some configurations but not others.
## Bug Class 1: Device Buffer Aliasing in MPI_Sendrecv
**Symptom**: `MPI_Sendrecv` with GPU-resident send and receive buffers produces wrong results. The received data matches neither the expected payload nor a host-staged copy. The failure is *deterministic* for a given problem size and process count, but *history-dependent* — earlier sends affect which alias is selected.
**Root cause**: The MPI library internally reuses GPU buffer addresses for temporary staging without proper device memory ordering. When the same physical GPU memory pages appear in both the send and receive paths, writes from one path corrupt the other.
**Diagnosis**:
1. Enable per-packet checksumming (see `correctness-verification.md`). If the checksum on the received packet does not match the sent checksum, the data was corrupted in transit.
2. Replace `MPI_Sendrecv` with separate `MPI_Isend` + `MPI_Irecv` + `MPI_Waitall`. If this fixes the problem, the bug is in the `MPI_Sendrecv` implementation's internal buffer handling.
3. Stage through host memory (`cudaMemcpy`/`hipMemcpy` to a host buffer, then `MPI_Sendrecv` on host buffers, then copy back). If this fixes the problem, confirms GPU-specific aliasing.
**Reported as**: MPICH issue #7302. Affects MPICH on Intel Pontevecchio (Aurora) with device-resident buffers.
**Workaround**: Do not use `MPI_Sendrecv` with GPU buffers. Use asynchronous send/receive pairs or host-staging. See `communication-overlap.md` for the full pipeline pattern.
## Bug Class 2: PLT Corruption on AARCH64 (libfabric)
**Symptom**: Application crashes or hangs on first `MPI_Comm_dup` call on AARCH64 systems (e.g. NVIDIA Grace/H200). Backtrace shows a bad instruction in the PLT (Procedure Linkage Table) for `MPI_Comm_dup` — specifically a `br x15` instruction that should instead be a proper trampoline.
**Root cause**: `libfabric`'s memory registration cache monitor patches PLT entries at runtime to intercept memory allocation calls. Its AARCH64 trampoline generation writes an incorrect instruction sequence, leaving `br x15` (branch to whatever happens to be in x15) in the PLT entry. The next call through that PLT entry executes garbage.
**Diagnosis**:
```bash
# Check if the PLT entry is corrupted
objdump -d /proc/PID/exe | grep -A5 "MPI_Comm_dup@plt"
# Look for "br x15" — this should be a proper stub, not a register branch
```
Or check the disassembly of the live process:
```bash
gdb -p PID -batch -ex "disassemble 'MPI_Comm_dup@plt'"
```
**Workaround**:
```bash
export FI_MR_CACHE_MONITOR=disabled
```
This prevents libfabric from patching PLT entries. It may reduce MR cache performance but restores correctness.
**Reported as**: libfabric issue #11451. Affects systems using AARCH64 + libfabric OFI provider (Cray Slingshot, AWS EFA) with memory registration cache enabled.
## Bug Class 3: Topology-Dependent Allreduce Hangs
**Symptom**: `MPI_Allreduce` hangs indefinitely on some node configurations but completes correctly on others. The failure correlates with process count (e.g. fails at 512 ranks, works at 256) or network topology (fails when crossing specific router boundaries).
**Root cause**: The MPI library's collective selection algorithm picks a reduction tree implementation that assumes symmetric participation from all ranks. A bug in one rank's contribution path (e.g. a GPU-side buffer not yet flushed when MPI reads it, due to premature barrier — see `gpu-runtime-correctness.md`) causes that rank to send wrong or incomplete data, and the tree-reduction protocol deadlocks waiting for data that never arrives correctly.
**Diagnosis**: Flight recorder step logging (see `hang-diagnosis.md`). SIGHUP broadcast to all ranks. Ranks that are hung will show step name `MPI_Allreduce::...`; ranks that completed will show the next step. The hung ranks are the *recipients* of the stale data, not necessarily the *cause*.
**Workaround — deterministic P2P reduction tree**:
Replace `MPI_Allreduce` with an explicit point-to-point binary tree reduction. This is slower for large communicators but:
1. Is immune to topology-dependent collective bugs.
2. Is deterministic in floating-point ordering (the tree is fixed, not chosen at runtime).
3. Makes the hang location explicit — each P2P operation is a named step in the flight recorder.
```cpp
// Binary tree reduction: rank 0 collects, then broadcasts
void GlobalSumP2P(double *data, int count, MPI_Comm comm) {
int rank, size;
MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size);
// Reduce phase: even ranks receive from odd neighbours
for (int stride = 1; stride < size; stride *= 2) {
if (rank % (2*stride) == 0) {
int partner = rank + stride;
if (partner < size) {
std::vector<double> tmp(count);
MPI_Recv(tmp.data(), count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE);
for (int i = 0; i < count; i++) data[i] += tmp[i];
}
} else if (rank % stride == 0) {
int partner = rank - stride;
MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm);
break;
}
}
// Broadcast phase
for (int stride = /* highest power of 2 <= size */; stride >= 1; stride /= 2) {
if (rank % (2*stride) == 0) {
int partner = rank + stride;
if (partner < size)
MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm);
} else if (rank % stride == 0) {
int partner = rank - stride;
MPI_Recv(data, count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE);
}
}
}
```
Grid reference: `USE_GRID_REDUCTION` macro in `Grid/communicator/Communicator_mpi3.cc`.
## Bug Class 4: Mixed HIP ABI from Wrong GTL Library (Frontier / ROCm)
**Symptom**: `HIPFFT_PARSE_ERROR` (error code 12) returned by `hipfftPlanMany` / `hipfftMakePlanMany` / `hipfftPlan1d` for FFT sizes G < 32, but G ≥ 32 succeeds. The failure only occurs with an empty rocFFT kernel cache (`~/.cache/rocfft`); a warm cache may mask it. Host-side operations and GPU kernels that do not invoke rocFFT JIT work correctly.
**Root cause — mixed HIP ABI**: rocFFT uses JIT compilation (via `libamd_comgr`) for small transforms (G < 32); for G ≥ 32 it uses pre-compiled device code bundled in the library, so the JIT path is never exercised. When two HIP runtime versions are loaded in the same process — e.g. `libamdhip64.so.7` (ROCm 7) and `libamdhip64.so.6` (ROCm 6) — the rocFFT JIT cannot complete successfully.
The hidden source of the old library is the Cray MPI GPU Transport Layer. On Frontier, `cray-mpich`'s `libmpi_gtl_hsa.so` may be compiled against `libamdhip64.so.6` (ROCm 6 ABI) even when the loaded ROCm module is 7.0.2. Because `LD_LIBRARY_PATH` picks up the GTL directory before the ROCm 7 library directory, `libamdhip64.so.6` is pulled in first, and both ABI versions end up resident in the process.
**Diagnosis**:
```bash
# Check which libamdhip64 versions are actually linked into your binary at runtime
ldd --verbose ./your_binary 2>&1 | grep amdhip
# Bad output — two different .so versions:
# libamdhip64.so.6 => /opt/rocm-6.4.2/lib/libamdhip64.so.6
# libamdhip64.so.7 => /opt/rocm-7.0.2/lib/libamdhip64.so.7
# Good output — only one:
# libamdhip64.so.7 => /opt/rocm-7.0.2/lib/libamdhip64.so.7
```
If two versions appear, the problem is the GTL/LD_LIBRARY_PATH ordering.
**Fix — correct module stack and LD_LIBRARY_PATH ordering (Frontier)**:
```bash
module load cce/21.0.0
module load cpe/26.03
module load rocm/7.0.2
# Prepend CRAY_LD_LIBRARY_PATH so the ROCm-7-aware GTL is found first
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
# Ensure ROCm 7 LLVM libs (needed by libamd_comgr JIT) are on the path
export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib/llvm/lib/:$LD_LIBRARY_PATH
```
The critical step is prepending `CRAY_LD_LIBRARY_PATH`: this ensures the GTL library built against the ROCm 7 ABI is resolved before any older version that may appear further down `LD_LIBRARY_PATH`. Without this step, a stale symlink or directory ordering can silently load the wrong `libmpi_gtl_hsa.so`.
**Reproducer**: `tests/debug/Test_hipfft_repro.cc` — standalone hipFFT test (no Grid headers) that sweeps G and howmany values matching realistic Grid lattice geometries. Compile with:
```bash
hipcc -o Test_hipfft_repro Test_hipfft_repro.cc -lhipfft
rm -rf ~/.cache/rocfft # empty cache required to trigger JIT path
./Test_hipfft_repro
```
**Reference**: `systems/WorkArounds.txt`, Frontier section — GPU mapping, XPMEM, and `FI_MR_CACHE_MONITOR=disabled` settings for Frontier are documented there.
**Systems affected**: Frontier (ORNL, MI250X). Likely applies to any Cray PE system where the loaded `cray-mpich` GTL was compiled against an older ROCm ABI than the runtime ROCm module. LumiG (CSC, MI250X) uses the same Cray PE and may exhibit the same issue.
## Compile-Time Guard Structure
Recommended macro structure to switch between the workaround paths:
```cpp
// In configure / CMake, expose as options:
// ACCELERATOR_AWARE_MPI — use GPU direct (fast, potentially broken)
// GRID_CHECKSUM_COMMS — per-packet checksums (overhead: ~5%)
// USE_GRID_REDUCTION — P2P tree instead of MPI_Allreduce (slower, deterministic)
// FI_MR_CACHE_MONITOR — libfabric PLT workaround (env var, not compile-time)
```
On a known-good system, enable `ACCELERATOR_AWARE_MPI` and disable the others. On a system with known bugs, disable `ACCELERATOR_AWARE_MPI` and enable `GRID_CHECKSUM_COMMS` + `USE_GRID_REDUCTION` as needed.
## Escalation Checklist
Before concluding a bug is in your code:
1. [ ] Can you reproduce with a minimal reproducer (two MPI ranks, no physics code)?
2. [ ] Does the failure rate correlate with buffer size, process count, or network route?
3. [ ] Does staging through host memory eliminate the failure?
4. [ ] Is the failure deterministic for a given input (same answer, always wrong) or stochastic?
5. [ ] Does the failure appear on a different MPI implementation (e.g. OpenMPI vs MPICH)?
Deterministic wrong answers that reproduce with minimal reproducers and disappear with host-staging are strong evidence of an MPI library bug. File with the MPI library issue tracker with the minimal reproducer.
+1 -1
View File
@@ -51,7 +51,7 @@ EOF
CMD="mpiexec -np 384 -ppn 12 -envall --hostfile nodefile \
../gpu_tile.sh \
$BINARY --mpi 4.4.4.6 --grid 64.64.64.96 \
--shm-mpi 0 --comms-overlap --shm 4096 --device-mem 32000 --accelerator-threads 32 --seconds 6000 --log Message --debug-stdout "
--shm-mpi 0 --comms-overlap --shm 4096 --device-mem 32000 --accelerator-threads 32 --seconds 6000 --log Message "
echo $CMD > command-line
env > environment
-62
View File
@@ -1,62 +0,0 @@
#!/bin/bash
#PBS -l select=256
#PBS -q run_next
##PBS -A LatticeQCD_aesp_CNDA
#PBS -A LatticeFlavor
#PBS -l walltime=06:00:00
#PBS -N reproBigJob
#PBS -k doe
#PBS -l filesystems=flare
#PBS -l filesystems=home
#export OMP_PROC_BIND=spread
#unset OMP_PLACES
# 56 cores / 6 threads ~9
export OMP_NUM_THREADS=6
export SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE=1
export SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_D2D_COPY=1
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file"
export GRID_PRINT_ENTIRE_LOG=0
export GRID_CHECKSUM_RECV_BUF=1
export GRID_CHECKSUM_SEND_BUF=1
export MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=generic
export MPIR_CVAR_NOLOCAL=1
cd $PBS_O_WORKDIR
source ../sourceme.sh
cp $PBS_NODEFILE nodefile
DIR=reproBigJob.$PBS_JOBID
mkdir -p $DIR
cd $DIR
cp $PBS_NODEFILE nodefile
BINARY=../Test_dwf_mixedcg_prec
echo > pingjob <<EOF
while read node ;
do
echo ssh $node killall -HUP Test_dwf_mixedcg_prec
done < nodefile
EOF
CMD="mpiexec -np 3072 -ppn 12 -envall --hostfile nodefile \
../gpu_tile.sh \
$BINARY --mpi 8.8.8.6 --grid 128.128.128.288 \
--shm-mpi 0 --comms-overlap --shm 4096 --device-mem 32000 --accelerator-threads 32 --seconds 18000 --log Message --debug-stdout "
echo $CMD > command-line
env > environment
$CMD
grep Oops */Grid.stderr.* > failures.$PBS_JOBID
+78 -63
View File
@@ -1,76 +1,91 @@
Per node summary table
L , Wilson, DWF4, Staggered, NaiveStag
8 , 90, 933, 38, 23
12 , 403, 1688, 178, 113
16 , 188, 1647, 449, 295
24 , 947, 1574, 674, 553
32 , 931, 1371, 718, 643
Memory Bandwidth
Bytes, GB/s per node
6291456, 379.297050
100663296, 3754.674992
509607936, 6521.472413
1610612736, 8513.456479
3932160000, 9018.901766
GEMM
M, N, K, BATCH, GF/s per rank
16, 8, 16, 256, 0.564958
16, 16, 16, 256, 243.148058
16, 32, 16, 256, 440.346877
32, 8, 32, 256, 439.194136
32, 16, 32, 256, 847.334141
32, 32, 32, 256, 1430.892623
64, 8, 64, 256, 1242.756741
64, 16, 64, 256, 2196.689493
64, 32, 64, 256, 3697.458072
16, 8, 256, 256, 899.582627
16, 16, 256, 256, 1673.537756
16, 32, 256, 256, 2959.597089
32, 8, 256, 256, 1558.858630
32, 16, 256, 256, 2864.839445
32, 32, 256, 256, 4810.671254
64, 8, 256, 256, 2386.092942
64, 16, 256, 256, 4451.665937
64, 32, 256, 256, 5942.124095
8, 256, 16, 256, 799.867271
16, 256, 16, 256, 1584.624888
32, 256, 16, 256, 1949.422338
8, 256, 32, 256, 1389.417474
16, 256, 32, 256, 2668.344493
32, 256, 32, 256, 3234.162120
8, 256, 64, 256, 2150.925128
16, 256, 64, 256, 4012.488132
32, 256, 64, 256, 5154.785521
786432, 40.271620
12582912, 433.611792
63700992, 905.374321
201326592, 1114.979152
491520000, 1180.241898
Communications
Packet bytes, direction, GB/s per node
4718592, 1, 245.026198
4718592, 2, 251.180996
4718592, 3, 361.110977
4718592, 5, 247.898447
4718592, 6, 249.867523
4718592, 7, 359.033061
15925248, 1, 255.030946
15925248, 2, 264.453890
15925248, 3, 392.949183
15925248, 5, 256.040644
15925248, 6, 264.681896
15925248, 7, 392.102622
37748736, 1, 258.823333
37748736, 2, 268.181577
37748736, 3, 401.478191
37748736, 5, 258.995363
37748736, 6, 268.206586
37748736, 7, 400.397611
Per node summary table
GEMM
M, N, K, BATCH, GF/s per rank fp64
16, 8, 16, 4096, 693.316363
16, 12, 16, 4096, 657.277058
16, 16, 16, 4096, 711.992616
32, 8, 32, 4096, 821.084324
32, 12, 32, 4096, 1279.852719
32, 16, 32, 4096, 2647.096674
64, 8, 64, 4096, 2630.192325
64, 12, 64, 4096, 3338.071321
64, 16, 64, 4096, 3950.899281
16, 8, 256, 4096, 1638.362501
16, 12, 256, 4096, 2377.502234
16, 16, 256, 4096, 3048.328833
32, 8, 256, 4096, 2917.384276
32, 12, 256, 4096, 4103.085151
32, 16, 256, 4096, 5102.971860
64, 8, 256, 4096, 3222.258206
64, 12, 256, 4096, 4619.456391
64, 16, 256, 4096, 5847.916650
8, 256, 16, 4096, 1728.073337
12, 256, 16, 4096, 2356.653970
16, 256, 16, 4096, 2676.876038
8, 256, 32, 4096, 2611.531990
12, 256, 32, 4096, 3451.573106
16, 256, 32, 4096, 3966.915301
8, 256, 64, 4096, 3436.248737
12, 256, 64, 4096, 4539.497945
16, 256, 64, 4096, 5307.992323
GEMM
M, N, K, BATCH, GF/s per rank fp32
16, 8, 16, 4096, 499.017445
16, 12, 16, 4096, 731.543385
16, 16, 16, 4096, 958.800786
32, 8, 32, 4096, 1549.813550
32, 12, 32, 4096, 2147.907502
32, 16, 32, 4096, 2601.698596
64, 8, 64, 4096, 3785.446233
64, 12, 64, 4096, 5116.694843
64, 16, 64, 4096, 6109.345016
16, 8, 256, 4096, 1206.627737
16, 12, 256, 4096, 1809.699599
16, 16, 256, 4096, 2412.014053
32, 8, 256, 4096, 2406.114488
32, 12, 256, 4096, 3605.531907
32, 16, 256, 4096, 4798.444037
64, 8, 256, 4096, 4688.711196
64, 12, 256, 4096, 6990.696301
64, 16, 256, 4096, 9214.749925
8, 256, 16, 4096, 2596.307289
12, 256, 16, 4096, 3439.892562
16, 256, 16, 4096, 3907.201036
8, 256, 32, 4096, 3012.752067
12, 256, 32, 4096, 3904.217583
16, 256, 32, 4096, 4599.047092
8, 256, 64, 4096, 3721.999042
12, 256, 64, 4096, 5098.573927
16, 256, 64, 4096, 6159.080872
L , Wilson, DWF4, Staggered, GF/s per node
8 , 155, 1386, 50
12 , 694, 4208, 230
16 , 1841, 6675, 609
24 , 3934, 8573, 1641
32 , 5083, 9771, 3086
1 Memory Bandwidth Per node summary table
1 Per node summary table
2 L , Wilson, DWF4, Staggered, NaiveStag
3 8 , 90, 933, 38, 23
4 12 , 403, 1688, 178, 113
5 16 , 188, 1647, 449, 295
6 24 , 947, 1574, 674, 553
7 32 , 931, 1371, 718, 643
8 Memory Bandwidth
9 Bytes, GB/s per node
10 786432, 40.271620
11 Memory Bandwidth 12582912, 433.611792
12 Bytes, GB/s per node 63700992, 905.374321
13 6291456, 379.297050 201326592, 1114.979152
14 100663296, 3754.674992 491520000, 1180.241898
15 509607936, 6521.472413 Communications
16 1610612736, 8513.456479 Packet bytes, direction, GB/s per node
17 3932160000, 9018.901766 GEMM
18 GEMM M, N, K, BATCH, GF/s per rank fp64
M, N, K, BATCH, GF/s per rank
16, 8, 16, 256, 0.564958
16, 16, 16, 256, 243.148058
16, 32, 16, 256, 440.346877
32, 8, 32, 256, 439.194136
32, 16, 32, 256, 847.334141
32, 32, 32, 256, 1430.892623
64, 8, 64, 256, 1242.756741
64, 16, 64, 256, 2196.689493
64, 32, 64, 256, 3697.458072
16, 8, 256, 256, 899.582627
16, 16, 256, 256, 1673.537756
16, 32, 256, 256, 2959.597089
32, 8, 256, 256, 1558.858630
32, 16, 256, 256, 2864.839445
32, 32, 256, 256, 4810.671254
64, 8, 256, 256, 2386.092942
64, 16, 256, 256, 4451.665937
64, 32, 256, 256, 5942.124095
8, 256, 16, 256, 799.867271
16, 256, 16, 256, 1584.624888
32, 256, 16, 256, 1949.422338
8, 256, 32, 256, 1389.417474
16, 256, 32, 256, 2668.344493
32, 256, 32, 256, 3234.162120
8, 256, 64, 256, 2150.925128
16, 256, 64, 256, 4012.488132
32, 256, 64, 256, 5154.785521
Communications
Packet bytes, direction, GB/s per node
4718592, 1, 245.026198
4718592, 2, 251.180996
4718592, 3, 361.110977
19 4718592, 5, 247.898447 16, 8, 16, 4096, 693.316363
20 4718592, 6, 249.867523 16, 12, 16, 4096, 657.277058
21 4718592, 7, 359.033061 16, 16, 16, 4096, 711.992616
22 15925248, 1, 255.030946 32, 8, 32, 4096, 821.084324
23 15925248, 2, 264.453890 32, 12, 32, 4096, 1279.852719
15925248, 3, 392.949183
15925248, 5, 256.040644
15925248, 6, 264.681896
15925248, 7, 392.102622
37748736, 1, 258.823333
37748736, 2, 268.181577
37748736, 3, 401.478191
37748736, 5, 258.995363
37748736, 6, 268.206586
37748736, 7, 400.397611
Per node summary table
L , Wilson, DWF4, Staggered, GF/s per node
8 , 155, 1386, 50
12 , 694, 4208, 230
16 , 1841, 6675, 609
24 , 3934, 8573, 1641
32 , 5083, 9771, 3086
24 32, 16, 32, 4096, 2647.096674
25 64, 8, 64, 4096, 2630.192325
26 64, 12, 64, 4096, 3338.071321
27 64, 16, 64, 4096, 3950.899281
28 16, 8, 256, 4096, 1638.362501
29 16, 12, 256, 4096, 2377.502234
30 16, 16, 256, 4096, 3048.328833
31 32, 8, 256, 4096, 2917.384276
32 32, 12, 256, 4096, 4103.085151
33 32, 16, 256, 4096, 5102.971860
34 64, 8, 256, 4096, 3222.258206
35 64, 12, 256, 4096, 4619.456391
36 64, 16, 256, 4096, 5847.916650
37 8, 256, 16, 4096, 1728.073337
38 12, 256, 16, 4096, 2356.653970
39 16, 256, 16, 4096, 2676.876038
40 8, 256, 32, 4096, 2611.531990
41 12, 256, 32, 4096, 3451.573106
42 16, 256, 32, 4096, 3966.915301
43 8, 256, 64, 4096, 3436.248737
44 12, 256, 64, 4096, 4539.497945
45 16, 256, 64, 4096, 5307.992323
46 GEMM
47 M, N, K, BATCH, GF/s per rank fp32
48 16, 8, 16, 4096, 499.017445
49 16, 12, 16, 4096, 731.543385
50 16, 16, 16, 4096, 958.800786
51 32, 8, 32, 4096, 1549.813550
52 32, 12, 32, 4096, 2147.907502
53 32, 16, 32, 4096, 2601.698596
54 64, 8, 64, 4096, 3785.446233
55 64, 12, 64, 4096, 5116.694843
56 64, 16, 64, 4096, 6109.345016
57 16, 8, 256, 4096, 1206.627737
58 16, 12, 256, 4096, 1809.699599
59 16, 16, 256, 4096, 2412.014053
60 32, 8, 256, 4096, 2406.114488
61 32, 12, 256, 4096, 3605.531907
62 32, 16, 256, 4096, 4798.444037
63 64, 8, 256, 4096, 4688.711196
64 64, 12, 256, 4096, 6990.696301
65 64, 16, 256, 4096, 9214.749925
66 8, 256, 16, 4096, 2596.307289
67 12, 256, 16, 4096, 3439.892562
68 16, 256, 16, 4096, 3907.201036
69 8, 256, 32, 4096, 3012.752067
70 12, 256, 32, 4096, 3904.217583
71 16, 256, 32, 4096, 4599.047092
72 8, 256, 64, 4096, 3721.999042
73 12, 256, 64, 4096, 5098.573927
74 16, 256, 64, 4096, 6159.080872
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
+5 -5
View File
@@ -1,4 +1,3 @@
CLIME=`spack find --paths c-lime@2-3-9 | grep c-lime| cut -c 15-`
../../configure --enable-comms=mpi-auto \
--with-lime=$CLIME \
--enable-unified=no \
@@ -9,12 +8,13 @@ CLIME=`spack find --paths c-lime@2-3-9 | grep c-lime| cut -c 15-`
--disable-gparity \
--disable-fermion-reps \
--enable-simd=GPU \
--with-gmp=$OLCF_GMP_ROOT \
--with-mpfr=/opt/cray/pe/gcc/mpfr/3.1.4/ \
--with-gmp=$GMP \
--with-mpfr=$MPFR \
--with-openssl=$OPENSSL \
--disable-fermion-reps \
CXX=hipcc MPICXX=mpicxx \
CXXFLAGS="-fPIC -I${ROCM_PATH}/include/ -I${MPICH_DIR}/include -L/lib64 " \
LDFLAGS="-L/lib64 -L${ROCM_PATH}/lib -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa -lhipblas -lrocblas -lhipfft"
CXXFLAGS="-fPIC -I${ROCM_PATH}/include/ -I${MPICH_DIR}/include " \
LDFLAGS="-L${ROCM_PATH}/lib -L${MPICH_DIR}/lib -lmpi -lmpi_gtl_hsa -lhipblas -lrocblas -lhipfft -lamdhip64"
+10 -21
View File
@@ -1,25 +1,14 @@
echo spack
. /autofs/nccs-svm1_home1/paboyle/Crusher/Grid/spack/share/spack/setup-env.sh
. /autofs/nccs-svm1_home1/paboyle/spack/share/spack/setup-env.sh
module load cce/15.0.1
module load rocm/5.3.0
module load cray-fftw
module load craype-accel-amd-gfx90a
export CLIME=`spack find --paths c-lime | grep ^c-lime | awk '{print $2}' `
export MPFR=`spack find --paths mpfr | grep ^mpfr | awk '{print $2}' `
export OPENSSL=`spack find --paths openssl | grep openssl | awk '{print $2}' `
export GMP=`spack find --paths gmp | grep ^gmp | awk '{print $2}' `
#Ugly hacks to get down level software working on current system
export LD_LIBRARY_PATH=/opt/cray/libfabric/1.20.1/lib64/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=`pwd`/:$LD_LIBRARY_PATH
ln -s /opt/rocm-6.0.0/lib/libamdhip64.so.6 .
#echo spack load c-lime
#spack load c-lime
#module load emacs
##module load PrgEnv-gnu
##module load cray-mpich
##module load cray-fftw
##module load craype-accel-amd-gfx90a
##export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
#Hack for lib
##export LD_LIBRARY_PATH=`pwd`/:$LD_LIBRARY_PATH
module load cce/21.0.0
module load cpe/26.03
module load rocm/7.0.2
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib/llvm/lib/:$LD_LIBRARY_PATH
+4 -5
View File
@@ -1,17 +1,16 @@
DIR=`pwd`
PREFIX=$DIR/../Prequisites/install/
../../configure \
--enable-comms=mpi \
--enable-simd=GPU \
--enable-shm=nvlink \
--enable-gen-simd-width=64 \
--with-gmp=$GMP \
--with-mpfr=$MPFR \
--enable-accelerator=cuda \
--enable-setdevice \
--disable-accelerator-cshift \
--with-gmp=$PREFIX \
--disable-fermion-reps \
--disable-unified \
--disable-gparity \
CXX=nvcc \
LDFLAGS="-cudart shared " \
CXXFLAGS="-ccbin CC -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared"
CXXFLAGS="-ccbin CC -gencode arch=compute_80,code=sm_80 -std=c++17 -cudart shared"
+39
View File
@@ -0,0 +1,39 @@
#!/bin/bash
##SBATCH -A m5294_g
#SBATCH -A mp13_g
#m3886_g
#SBATCH -C gpu
#SBATCH -q premium
#SBATCH -t 00:20
#SBATCH -c 32
#SBATCH -N 384
#SBATCH -n 1536
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --exclusive
#SBATCH --gpu-bind=none
export SLURM_CPU_BIND="cores"
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_IPC_ENABLED=1
export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
export MPICH_GPU_NO_ASYNC_MEMCPY=0
#export MPICH_SMP_SINGLE_COPY_MODE=CMA
cat << EOF > select_gpu
#!/bin/bash
export GPU_MAP=(0 1 2 3)
export NUMA_MAP=( 0 1 2 3 )
export GPU=\$SLURM_LOCALID
export NUMA=\$SLURM_LOCALID
export CUDA_VISIBLE_DEVICES=\$GPU
exec numactl -m \$NUMA -N \$NUMA \$*
EOF
chmod +x ./select_gpu
export VOL=128.128.128.288
OPT="--comms-overlap --shm-mpi 0"
srun ./select_gpu ./benchmarks/Benchmark_dwf_fp32 --mpi 4.8.4.12 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT
+19 -6
View File
@@ -1,5 +1,6 @@
#!/bin/bash
#SBATCH -A m3886_g
#SBATCH -A m5294_g
#m3886_g
#SBATCH -C gpu
#SBATCH -q debug
#SBATCH -t 0:20:00
@@ -19,9 +20,21 @@ export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
export MPICH_GPU_NO_ASYNC_MEMCPY=0
#export MPICH_SMP_SINGLE_COPY_MODE=CMA
OPT="--comms-sequential --shm-mpi 1"
VOL=64.64.64.64
srun ./benchmarks/Benchmark_dwf_fp32 --mpi 2.2.1.1 --grid $VOL --accelerator-threads 8 --shm 2048 $OPT
#srun ./benchmarks/Benchmark_dwf_fp32 --mpi 2.1.1.4 --grid $VOL --accelerator-threads 8 --shm 2048 $OPT
#srun ./benchmarks/Benchmark_dwf_fp32 --mpi 1.1.1.8 --grid $VOL --accelerator-threads 8 --shm 2048 $OPT
cat << EOF > select_gpu
#!/bin/bash
export GPU_MAP=(0 1 2 3)
export NUMA_MAP=( 0 1 2 3 )
export GPU=\$SLURM_LOCALID
export NUMA=\$SLURM_LOCALID
export CUDA_VISIBLE_DEVICES=\$GPU
exec numactl -m \$NUMA -N \$NUMA \$*
EOF
chmod +x ./select_gpu
OPT="--comms-sequential --shm-mpi 0"
VOL=64.64.32.32
srun ./select_gpu ./benchmarks/Benchmark_dwf_fp32 --mpi 2.2.1.1 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT
OPT="--comms-overlap --shm-mpi 0"
srun ./select_gpu ./benchmarks/Benchmark_dwf_fp32 --mpi 2.2.1.1 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT
+4 -2
View File
@@ -1,4 +1,6 @@
export CRAY_ACCEL_TARGET=nvidia80
source /global/homes/p/pboyle/spack/share/spack/setup-env.sh
export MPFR=`spack find --paths mpfr | grep mpfr | cut -c 13-`
export GMP=`spack find --paths gmp | grep gmp | cut -c 12-`
module load PrgEnv-gnu cpe-cuda cudatoolkit/11.4
module load PrgEnv-gnu cpe-cuda cudatoolkit/12.0
+44
View File
@@ -0,0 +1,44 @@
#!/bin/bash
##SBATCH -A m5294_g
#SBATCH -A mp13_g
#m3886_g
#SBATCH -C gpu
#SBATCH -q premium
#SBATCH -t 00:10
#SBATCH -c 32
#SBATCH -N 32
#SBATCH -n 128
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --exclusive
#SBATCH --gpu-bind=none
export SLURM_CPU_BIND="cores"
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_IPC_ENABLED=1
export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
export MPICH_GPU_NO_ASYNC_MEMCPY=0
#export MPICH_SMP_SINGLE_COPY_MODE=CMA
cat << EOF > select_gpu
#!/bin/bash
export GPU_MAP=(0 1 2 3)
export NUMA_MAP=( 0 1 2 3 )
export GPU=\$SLURM_LOCALID
export NUMA=\$SLURM_LOCALID
export CUDA_VISIBLE_DEVICES=\$GPU
exec numactl -m \$NUMA -N \$NUMA \$*
EOF
chmod +x ./select_gpu
OPT="--comms-overlap --shm-mpi 0"
#
# Local volume WAS 32.16.32.24
#
# 384 nodes
#srun ./select_gpu ./Test_dwf_mixedcg_prec --seconds 300 --grid 128.128.128.288 --mpi 4.8.4.12 --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > job.log
# 32 nodes, same volume per node
srun ./select_gpu ./Test_dwf_mixedcg_prec --seconds 300 --grid 64.32.64.96 --mpi 2.2.2.4 --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > job.log
+38
View File
@@ -0,0 +1,38 @@
#!/bin/bash
##SBATCH -A m5294_g
#SBATCH -A mp13_g
#m3886_g
#SBATCH -C gpu
#SBATCH -q premium
#SBATCH -t 00:10
#SBATCH -c 32
#SBATCH -N 384
#SBATCH -n 1536
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --exclusive
#SBATCH --gpu-bind=none
export SLURM_CPU_BIND="cores"
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_IPC_ENABLED=1
export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
export MPICH_GPU_NO_ASYNC_MEMCPY=0
#export MPICH_SMP_SINGLE_COPY_MODE=CMA
cat << EOF > select_gpu
#!/bin/bash
export GPU_MAP=(0 1 2 3)
export NUMA_MAP=( 0 1 2 3 )
export GPU=\$SLURM_LOCALID
export NUMA=\$SLURM_LOCALID
export CUDA_VISIBLE_DEVICES=\$GPU
exec numactl -m \$NUMA -N \$NUMA \$*
EOF
chmod +x ./select_gpu
OPT="--comms-overlap --shm-mpi 0"
srun ./select_gpu ./Test_dwf_mixedcg_prec --seconds 300 --grid 128.128.128.288 --mpi 4.8.4.12 --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > job.log
+39
View File
@@ -0,0 +1,39 @@
#!/bin/bash
#SBATCH -A m5294_g
#m3886_g
#SBATCH -C gpu
#SBATCH -q debug
#SBATCH -t 0:30:00
#SBATCH -c 32
#SBATCH -N 4
#SBATCH -n 16
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --exclusive
#SBATCH --gpu-bind=none
export SLURM_CPU_BIND="cores"
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_IPC_ENABLED=1
export MPICH_GPU_EAGER_REGISTER_HOST_MEM=0
export MPICH_GPU_NO_ASYNC_MEMCPY=0
#export MPICH_SMP_SINGLE_COPY_MODE=CMA
cat << EOF > select_gpu
#!/bin/bash
export GPU_MAP=(0 1 2 3)
export NUMA_MAP=( 0 1 2 3 )
export GPU=\$SLURM_LOCALID
export NUMA=\$SLURM_LOCALID
export CUDA_VISIBLE_DEVICES=\$GPU
exec numactl -m \$NUMA -N \$NUMA \$*
EOF
chmod +x ./select_gpu
OPT="--comms-sequential --shm-mpi 0"
VOL=64.64.32.32
srun ./select_gpu ./benchmarks/Benchmark_usqcd --mpi 2.2.2.2 --grid $VOL --device-mem 16000 --accelerator-threads 8 --shm 2048 $OPT > usqcd.log
-12
View File
@@ -1,12 +0,0 @@
CXX=mpicxx CXXFLAGS=-I/opt/local/include LDFLAGS=-L/opt/local/lib/ ../../configure \
--enable-simd=GEN \
--enable-comms=mpi3 \
--enable-debug \
--enable-unified=yes \
--prefix /Users/peterboyle/QCD/BugHunt/install \
--with-lime=/Users/peterboyle/QCD/SciDAC/install/ \
--with-openssl=$BREW \
--disable-fermion-reps \
--disable-gparity \
--enable-debug
+20
View File
@@ -0,0 +1,20 @@
#!/usr/bin/env bash
# Configure Grid from a build directory two levels below the source root,
# e.g.: mkdir -p systems/mac-arm/build && cd systems/mac-arm/build && bash ../config-command-homebrew
#
# Prerequisites: source sourceme-homebrew.sh first.
CXX=mpicxx ../../configure \
--enable-simd=GEN \
--enable-comms=mpi-auto \
--enable-unified=yes \
--prefix="${HOME}/Grid-install" \
--with-lime="${CLIME}" \
--with-openssl="${OPENSSL}" \
--with-gmp="${GMP}" \
--with-mpfr="${MPFR}" \
--with-fftw="${FFTW}" \
--enable-Sp=no \
--disable-fermion-reps \
--disable-gparity \
--disable-debug
+4 -1
View File
@@ -3,7 +3,10 @@
CXX=mpicxx ../../configure \
--enable-simd=GEN \
--enable-comms=mpi-auto \
--enable-Sp=yes \
--enable-Sp=no \
--disable-fermion-reps \
--disable-gparity \
--with-fftw=$FFTW \
--enable-unified=yes \
--prefix /Users/peterboyle/QCD/vtk/Grid/install \
--with-lime=$CLIME \
+15
View File
@@ -0,0 +1,15 @@
#!/usr/bin/env bash
# Environment for building Grid on Apple Silicon Mac with Homebrew dependencies.
# Usage: source systems/mac-arm/sourceme-homebrew.sh
HOMEBREW=/opt/homebrew
export GMP=${HOMEBREW}/opt/gmp
export MPFR=${HOMEBREW}/opt/mpfr
export FFTW=${HOMEBREW}/opt/fftw
export OPENSSL=${HOMEBREW}/opt/openssl@3
export CLIME=/usr/local
export PATH="${HOMEBREW}/opt/open-mpi/bin:${HOMEBREW}/bin:${PATH}"
export LDFLAGS="-L${OPENSSL}/lib"
export CPPFLAGS="-I${OPENSSL}/include"
+11
View File
@@ -0,0 +1,11 @@
source /Users/peterboyle/QCD//Spack/spack//share/spack/setup-env.sh
export FFTW=`spack find --paths fftw | grep ^fftw | awk '{print $2}' `
#export HDF5=`spack find --paths hdf5+cxx | grep ^hdf5 | awk '{print $2}' `
export CLIME=`spack find --paths c-lime | grep ^c-lime | awk '{print $2}' `
export MPFR=`spack find --paths mpfr | grep ^mpfr | awk '{print $2}' `
export OPENSSL=`spack find --paths openssl | grep openssl | awk '{print $2}' `
export GMP=`spack find --paths gmp | grep ^gmp | awk '{print $2}' `
export LD_LIBRARY_PATH=$MPFR/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$GMP/lib:$LD_LIBRARY_PATH
+4 -3
View File
@@ -25,6 +25,9 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_tests_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -273,8 +276,6 @@ void TestWhat(What & Ddwf,
err = phi-chi;
std::cout<<GridLogMessage << "norm diff "<< norm2(err)<< std::endl;
}
#endif
@@ -30,6 +30,9 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
* Reimplement the badly named "multigrid" lanczos as compressed Lanczos using the features
* in Grid that were intended to be used to support blocked Aggregates, from
*/
#include "disable_tests_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#include <Grid/algorithms/iterative/ImplicitlyRestartedLanczos.h>
#include <Grid/algorithms/iterative/LocalCoherenceLanczos.h>
@@ -256,3 +259,4 @@ int main (int argc, char ** argv) {
Grid_finalize();
}
#endif
+5
View File
@@ -25,6 +25,9 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_tests_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -237,3 +240,5 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+6 -3
View File
@@ -25,6 +25,9 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_tests_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -82,7 +85,6 @@ int main (int argc, char ** argv)
Grid_init(&argc,&argv);
std::cout << GridLogMessage<<" in main(): Grid is initialised"<<std::endl;
const int Ls=12;
GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplexD::Nsimd()),GridDefaultMpi());
@@ -95,7 +97,6 @@ int main (int argc, char ** argv)
GridCartesian * FGrid_f = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid_f);
GridRedBlackCartesian * FrbGrid_f = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid_f);
std::cout << GridLogMessage<<" in main(): making RNGs"<<std::endl;
std::vector<int> seeds4({1,2,3,4});
std::vector<int> seeds5({5,6,7,8});
GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds5);
@@ -154,7 +155,7 @@ int main (int argc, char ** argv)
time_t start = time(NULL);
UGrid->Broadcast(0,(void *)&start,sizeof(start));
FlightRecorder::ContinueOnFail = 1;
FlightRecorder::ContinueOnFail = 0;
FlightRecorder::PrintEntireLog = 0;
FlightRecorder::ChecksumComms = 0;
FlightRecorder::ChecksumCommsSend=0;
@@ -224,3 +225,5 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
+4
View File
@@ -25,6 +25,9 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include "disable_tests_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
using namespace std;
@@ -118,3 +121,4 @@ int main (int argc, char ** argv)
Grid_finalize();
}
#endif
#endif
+4
View File
@@ -24,6 +24,8 @@ with this program; if not, write to the Free Software Foundation, Inc.,
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
#include "disable_tests_without_instantiations.h"
#ifdef ENABLE_FERMION_INSTANTIATIONS
#include <Grid/Grid.h>
#include <Grid/qcd/utils/A2Autils.h>
@@ -157,3 +159,5 @@ int main(int argc, char *argv[])
return EXIT_SUCCESS;
}
#endif
+321
View File
@@ -0,0 +1,321 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Source file: ./tests/core/Test_planned_fft.cc
Copyright (C) 2015
Author: Azusa Yamaguchi <ayamaguc@staffmail.ed.ac.uk>
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
using namespace Grid;
int main (int argc, char ** argv)
{
Grid_init(&argc,&argv);
int threads = GridThread::GetThreads();
std::cout<<GridLogMessage << "Grid is setup to use "<<threads<<" threads"<<std::endl;
Coordinate latt_size = GridDefaultLatt();
Coordinate simd_layout = GridDefaultSimd(Nd,vComplexD::Nsimd());
Coordinate mpi_layout = GridDefaultMpi();
int vol = 1;
for(int d=0;d<latt_size.size();d++) vol *= latt_size[d];
GridCartesian GRID(latt_size,simd_layout,mpi_layout);
GridRedBlackCartesian RBGRID(&GRID);
LatticeComplexD one(&GRID);
LatticeComplexD zz(&GRID);
LatticeComplexD C(&GRID);
LatticeComplexD Ctilde(&GRID);
LatticeComplexD Cref (&GRID);
LatticeComplexD Csav (&GRID);
LatticeComplexD coor(&GRID);
LatticeSpinMatrixD S(&GRID);
LatticeSpinMatrixD Stilde(&GRID);
Coordinate p({1,3,2,3});
one = ComplexD(1.0,0.0);
zz = ComplexD(0.0,0.0);
ComplexD ci(0.0,1.0);
std::cout<<"*************************************************"<<std::endl;
std::cout<<"Testing Fourier form of known plane wave "<<std::endl;
std::cout<<"*************************************************"<<std::endl;
C=Zero();
for(int mu=0;mu<4;mu++){
RealD TwoPiL = M_PI * 2.0/ latt_size[mu];
LatticeCoordinate(coor,mu);
C = C + (TwoPiL * p[mu]) * coor;
}
C = exp(C*ci);
Csav = C;
S=Zero();
S = S+C;
// PlannedFFT is templated on the lattice element type (vector_object), not the Lattice<> itself.
PlannedFFT<LatticeComplexD::vector_object> theFFT(&GRID);
PlannedFFT<LatticeSpinMatrixD::vector_object> theFFT_spin(&GRID);
Ctilde=C;
std::cout<<" Benchmarking PlannedFFT of LatticeComplex "<<std::endl;
theFFT.FFT_dim(Ctilde,Ctilde,0,FFTbase::forward); std::cout << theFFT.MFlops()<<" Mflops "<<std::endl;
theFFT.FFT_dim(Ctilde,Ctilde,1,FFTbase::forward); std::cout << theFFT.MFlops()<<" Mflops "<<std::endl;
theFFT.FFT_dim(Ctilde,Ctilde,2,FFTbase::forward); std::cout << theFFT.MFlops()<<" Mflops "<<std::endl;
theFFT.FFT_dim(Ctilde,Ctilde,3,FFTbase::forward); std::cout << theFFT.MFlops()<<" Mflops "<<std::endl;
TComplexD cVol;
cVol()()() = vol;
Cref=Zero();
pokeSite(cVol,Cref,p);
Cref=Cref-Ctilde;
std::cout << "diff scalar "<<norm2(Cref) << std::endl;
C=Csav;
theFFT.FFT_all_dim(Ctilde,C,FFTbase::forward);
theFFT.FFT_all_dim(Cref,Ctilde,FFTbase::backward);
std::cout << norm2(C) << " " << norm2(Ctilde) << " " << norm2(Cref)<< " vol " << vol<< std::endl;
Cref= Cref - C;
std::cout << " invertible check " << norm2(Cref)<<std::endl;
Stilde=S;
std::cout<<" Benchmarking PlannedFFT of LatticeSpinMatrix "<<std::endl;
theFFT_spin.FFT_dim(Stilde,Stilde,0,FFTbase::forward); std::cout << theFFT_spin.MFlops()<<" mflops "<<std::endl;
theFFT_spin.FFT_dim(Stilde,Stilde,1,FFTbase::forward); std::cout << theFFT_spin.MFlops()<<" mflops "<<std::endl;
theFFT_spin.FFT_dim(Stilde,Stilde,2,FFTbase::forward); std::cout << theFFT_spin.MFlops()<<" mflops "<<std::endl;
theFFT_spin.FFT_dim(Stilde,Stilde,3,FFTbase::forward); std::cout << theFFT_spin.MFlops()<<" mflops "<<std::endl;
SpinMatrixD Sp;
Sp = Zero(); Sp = Sp+cVol;
S=Zero();
pokeSite(Sp,S,p);
S= S-Stilde;
std::cout << "diff FT[SpinMat] "<<norm2(S) << std::endl;
std::vector<int> seeds({1,2,3,4});
GridSerialRNG sRNG; sRNG.SeedFixedIntegers(seeds);
GridParallelRNG pRNG(&GRID);
pRNG.SeedFixedIntegers(seeds);
LatticeGaugeFieldD Umu(&GRID);
SU<Nc>::ColdConfiguration(pRNG,Umu);
////////////////////////////////////////////////////
// Wilson test
////////////////////////////////////////////////////
{
LatticeFermionD src(&GRID); gaussian(pRNG,src);
LatticeFermionD tmp(&GRID);
LatticeFermionD ref(&GRID);
RealD mass=0.01;
WilsonFermionD Dw(Umu,GRID,RBGRID,mass);
Dw.M(src,tmp);
std::cout << "Dw src = " <<norm2(src)<<std::endl;
std::cout << "Dw tmp = " <<norm2(tmp)<<std::endl;
Dw.FreePropagator(tmp,ref,mass);
std::cout << "Dw ref = " <<norm2(ref)<<std::endl;
ref = ref - src;
std::cout << "Dw ref-src = " <<norm2(ref)<<std::endl;
}
////////////////////////////////////////////////////
// Dwf matrix — verify Fourier representation using PlannedFFT<LatticeFermionD>
////////////////////////////////////////////////////
{
std::cout<<"****************************************"<<std::endl;
std::cout<<"Testing Fourier representation of Ddwf"<<std::endl;
std::cout<<"****************************************"<<std::endl;
const int Ls=16;
const int sdir=0;
RealD mass=0.01;
RealD M5 =1.0;
Gamma G5(Gamma::Algebra::Gamma5);
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,&GRID);
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,&GRID);
DomainWallFermionD Ddwf(Umu,*FGrid,*FrbGrid,GRID,RBGRID,mass,M5);
GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds);
LatticeFermionD src5(FGrid); gaussian(RNG5,src5);
LatticeFermionD src5_p(FGrid);
LatticeFermionD result5(FGrid);
LatticeFermionD ref5(FGrid);
LatticeFermionD tmp5(FGrid);
Ddwf.M(src5,tmp5);
ref5 = tmp5;
PlannedFFT<LatticeFermionD::vector_object> theFFT5(FGrid);
theFFT5.FFT_dim(result5,tmp5,1,FFTbase::forward); tmp5 = result5;
std::cout<<"Fourier xformed Ddwf 1 "<<norm2(result5)<<std::endl;
theFFT5.FFT_dim(result5,tmp5,2,FFTbase::forward); tmp5 = result5;
std::cout<<"Fourier xformed Ddwf 2 "<<norm2(result5)<<std::endl;
theFFT5.FFT_dim(result5,tmp5,3,FFTbase::forward); tmp5 = result5;
std::cout<<"Fourier xformed Ddwf 3 "<<norm2(result5)<<std::endl;
theFFT5.FFT_dim(result5,tmp5,4,FFTbase::forward);
std::cout<<"Fourier xformed Ddwf 4 "<<norm2(result5)<<std::endl;
result5 = result5*ComplexD(::sqrt(1.0/vol),0.0);
std::cout<<"Fourier xformed Ddwf "<<norm2(result5)<<std::endl;
tmp5 = src5;
theFFT5.FFT_dim(src5_p,tmp5,1,FFTbase::forward); tmp5 = src5_p;
theFFT5.FFT_dim(src5_p,tmp5,2,FFTbase::forward); tmp5 = src5_p;
theFFT5.FFT_dim(src5_p,tmp5,3,FFTbase::forward); tmp5 = src5_p;
theFFT5.FFT_dim(src5_p,tmp5,4,FFTbase::forward); src5_p = src5_p*ComplexD(::sqrt(1.0/vol),0.0);
std::cout<<"Fourier xformed src5"<< norm2(src5)<<" -> "<<norm2(src5_p)<<std::endl;
Gamma::Algebra Gmu [] = {
Gamma::Algebra::GammaX,
Gamma::Algebra::GammaY,
Gamma::Algebra::GammaZ,
Gamma::Algebra::GammaT,
Gamma::Algebra::Gamma5
};
LatticeFermionD Kinetic(FGrid); Kinetic = Zero();
LatticeComplexD kmu(FGrid);
LatticeInteger scoor(FGrid);
LatticeComplexD sk (FGrid); sk = Zero();
LatticeComplexD sk2(FGrid); sk2= Zero();
LatticeComplexD W(FGrid); W= Zero();
LatticeComplexD one5(FGrid); one5 =ComplexD(1.0,0.0);
for(int mu=0;mu<Nd;mu++) {
LatticeCoordinate(kmu,mu+1);
RealD TwoPiL = M_PI * 2.0/ latt_size[mu];
kmu = TwoPiL * kmu;
sk2 = sk2 + 2.0*sin(kmu*0.5)*sin(kmu*0.5);
sk = sk + sin(kmu) *sin(kmu);
Kinetic = Kinetic + sin(kmu)*ci*(Gamma(Gmu[mu])*src5_p);
}
std::cout << " src5 "<<norm2(src5_p)<<std::endl;
std::cout << " Kinetic "<<norm2(Kinetic)<<std::endl;
W = one5 - M5 + sk2;
std::cout << " W "<<norm2(W)<<std::endl;
Kinetic = Kinetic + W * src5_p;
std::cout << " Kinetic "<<norm2(Kinetic)<<std::endl;
LatticeCoordinate(scoor,sdir);
tmp5 = Cshift(src5_p,sdir,+1);
tmp5 = (tmp5 - G5*tmp5)*0.5;
tmp5 = where(scoor==Integer(Ls-1),mass*tmp5,-tmp5);
Kinetic = Kinetic + tmp5;
tmp5 = Cshift(src5_p,sdir,-1);
tmp5 = (tmp5 + G5*tmp5)*0.5;
tmp5 = where(scoor==Integer(0),mass*tmp5,-tmp5);
Kinetic = Kinetic + tmp5;
std::cout<<"Momentum space Ddwf "<< norm2(Kinetic)<<std::endl;
std::cout<<"Stencil Ddwf "<< norm2(result5)<<std::endl;
result5 = result5 - Kinetic;
std::cout<<"diff "<< norm2(result5)<<std::endl;
GRID_ASSERT(norm2(result5)<1.0e-4);
}
////////////////////////////////////////////////////
// Dwf prop
////////////////////////////////////////////////////
{
std::cout<<"****************************************"<<std::endl;
std::cout << "Testing Ddwf Ht Mom space 4d propagator \n";
std::cout<<"****************************************"<<std::endl;
LatticeFermionD src(&GRID); gaussian(pRNG,src);
LatticeFermionD tmp(&GRID);
LatticeFermionD ref(&GRID);
LatticeFermionD diff(&GRID);
Coordinate point(4,0);
src=Zero();
SpinColourVectorD ferm; gaussian(sRNG,ferm);
pokeSite(ferm,src,point);
const int Ls=32;
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,&GRID);
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,&GRID);
RealD mass=0.01;
RealD M5 =0.8;
DomainWallFermionD Ddwf(Umu,*FGrid,*FrbGrid,GRID,RBGRID,mass,M5);
std::cout << " Solving by FFT and Feynman rules" <<std::endl;
bool fiveD = false;
Ddwf.FreePropagator(src,ref,mass,fiveD);
Gamma G5(Gamma::Algebra::Gamma5);
LatticeFermionD src5(FGrid); src5=Zero();
LatticeFermionD tmp5(FGrid);
LatticeFermionD result5(FGrid); result5=Zero();
LatticeFermionD result4(&GRID);
const int sdir=0;
tmp = (src + G5*src)*0.5; InsertSlice(tmp,src5, 0,sdir);
tmp = (src - G5*src)*0.5; InsertSlice(tmp,src5,Ls-1,sdir);
std::cout << " Solving by Conjugate Gradient (CGNE)" <<std::endl;
Ddwf.Mdag(src5,tmp5);
src5=tmp5;
MdagMLinearOperator<DomainWallFermionD,LatticeFermionD> HermOp(Ddwf);
ConjugateGradient<LatticeFermionD> CG(1.0e-8,10000);
CG(HermOp,src5,result5);
ExtractSlice(tmp,result5,0 ,sdir); result4 = (tmp-G5*tmp)*0.5;
ExtractSlice(tmp,result5,Ls-1,sdir); result4 = result4+(tmp+G5*tmp)*0.5;
std::cout << " Taking difference" <<std::endl;
std::cout << "Ddwf result4 "<<norm2(result4)<<std::endl;
std::cout << "Ddwf ref "<<norm2(ref)<<std::endl;
diff = ref - result4;
std::cout << "result - ref "<<norm2(diff)<<std::endl;
GRID_ASSERT(norm2(diff)<1.0e-4);
}
Grid_finalize();
}
+2 -2
View File
@@ -36,8 +36,8 @@ Author: Peter Boyle <paboyle@ph.ed.ac.uk>
using namespace std;
using namespace Grid;
gridblasHandle_t GridBLAS::gridblasHandle;
int GridBLAS::gridblasInit;
//gridblasHandle_t GridBLAS::gridblasHandle;
//int GridBLAS::gridblasInit;
///////////////////////
// Tells little dirac op to use MdagM as the .Op()
+30 -13
View File
@@ -128,6 +128,10 @@ int main (int argc, char ** argv)
typedef HermOpAdaptor<LatticeFermionD> HermFineMatrix;
HermFineMatrix FineHermOp(HermOpEO);
LatticeFermionD src(FrbGrid);
src = ComplexD(1.0);
PowerMethod<LatticeFermionD> PM; PM(HermOpEO,src);
////////////////////////////////////////////////////////////
///////////// Coarse basis and Little Dirac Operator ///////
////////////////////////////////////////////////////////////
@@ -150,7 +154,7 @@ int main (int argc, char ** argv)
std::cout << "**************************************"<<std::endl;
std::cout << "Create Subspace"<<std::endl;
std::cout << "**************************************"<<std::endl;
Aggregates.CreateSubspaceChebyshevNew(RNG5,HermOpEO,95.);
Aggregates.CreateSubspaceChebyshev(RNG5,HermOpEO,nbasis,35.,0.01,500);// <== last run
std::cout << "**************************************"<<std::endl;
std::cout << "Refine Subspace"<<std::endl;
@@ -185,7 +189,7 @@ int main (int argc, char ** argv)
std::cout << "**************************************"<<std::endl;
typedef HermitianLinearOperator<MultiGeneralCoarsenedMatrix_t,CoarseVector> MrhsHermMatrix;
Chebyshev<CoarseVector> IRLCheby(0.05,40.0,101); // 1 iter
Chebyshev<CoarseVector> IRLCheby(0.01,16.0,201); // 1 iter
MrhsHermMatrix MrhsCoarseOp (mrhs);
CoarseVector pm_src(CoarseMrhs);
@@ -193,10 +197,10 @@ int main (int argc, char ** argv)
PowerMethod<CoarseVector> cPM;
cPM(MrhsCoarseOp,pm_src);
int Nk=nrhs;
int Nm=Nk*3;
// int Nk=36;
// int Nm=144;
// int Nk=16;
// int Nm=Nk*3;
int Nk=32;
int Nm=128;
int Nstop=Nk;
int Nconv_test_interval=1;
@@ -210,7 +214,7 @@ int main (int argc, char ** argv)
nrhs,
Nk,
Nm,
1e-4,10);
1e-4,100);
int Nconv;
std::vector<RealD> eval(Nm);
@@ -231,8 +235,6 @@ int main (int argc, char ** argv)
std::cout << "**************************************"<<std::endl;
std::cout << " Recompute coarse evecs "<<std::endl;
std::cout << "**************************************"<<std::endl;
evec.resize(Nm,Coarse5d);
eval.resize(Nm);
for(int r=0;r<nrhs;r++){
random(CRNG,c_src[r]);
}
@@ -243,7 +245,7 @@ int main (int argc, char ** argv)
// Deflation guesser object
///////////////////////
std::cout << "**************************************"<<std::endl;
std::cout << " Reimport coarse evecs "<<std::endl;
std::cout << " Reimport coarse evecs "<<evec.size()<<" "<<eval.size()<<std::endl;
std::cout << "**************************************"<<std::endl;
MultiRHSDeflation<CoarseVector> MrhsGuesser;
MrhsGuesser.ImportEigenBasis(evec,eval);
@@ -252,9 +254,11 @@ int main (int argc, char ** argv)
// Extra HDCG parameters
//////////////////////////
int maxit=3000;
ConjugateGradient<CoarseVector> CG(2.0e-1,maxit,false);
RealD lo=2.0;
int ord = 9;
// ConjugateGradient<CoarseVector> CG(2.0e-1,maxit,false);
// ConjugateGradient<CoarseVector> CG(1.0e-2,maxit,false);
ConjugateGradient<CoarseVector> CG(5.0e-2,maxit,false);
RealD lo=0.2;
int ord = 7;
DoNothingGuesser<CoarseVector> DoNothing;
HPDSolver<CoarseVector> HPDSolveMrhs(MrhsCoarseOp,CG,DoNothing);
@@ -300,6 +304,19 @@ int main (int argc, char ** argv)
ConjugateGradient<LatticeFermionD> CGfine(1.0e-8,30000,false);
CGfine(HermOpEO, src, result);
}
{
std::cout << "**************************************"<<std::endl;
std::cout << "Calling MdagM CG"<<std::endl;
std::cout << "**************************************"<<std::endl;
LatticeFermion result(FGrid); result=Zero();
LatticeFermion src(FGrid); random(RNG5,src);
result=Zero();
MdagMLinearOperator<MobiusFermionD, LatticeFermionD> HermOp(Ddwf);
ConjugateGradient<LatticeFermionD> CGfine(1.0e-8,30000,false);
CGfine(HermOp, src, result);
}
#endif
Grid_finalize();
return 0;
+5 -2
View File
@@ -368,7 +368,10 @@ int main (int argc, char ** argv)
TrivialPrecon<CoarseVector> simple;
NonHermitianLinearOperator<LittleDiracOperator,CoarseVector> LinOpCoarse(LittleDiracOpPV);
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-4, 100, LinOpCoarse,simple,10,10);
PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(3.0e-2, 100, LinOpCoarse,simple,10,10);
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(3.0e-2, 100, LinOpCoarse,simple,12,12); // 35 outer
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(5.0e-2, 100, LinOpCoarse,simple,12,12); // 36 outer, 12s
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-1, 100, LinOpCoarse,simple,12,12); // 36 ; 11s
PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(3.0e-1, 100, LinOpCoarse,simple,12,12);
L2PGCR.Level(3);
c_res=Zero();
L2PGCR(c_src,c_res);
@@ -400,7 +403,7 @@ int main (int argc, char ** argv)
LinOpCoarse,
L2PGCR);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermion> L1PGCR(1.0e-8,1000,PVdagM,TwoLevelPrecon,16,16);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermion> L1PGCR(1.0e-8,100,PVdagM,TwoLevelPrecon,10,10);
L1PGCR.Level(1);
f_res=Zero();
@@ -0,0 +1,493 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Source file: ./tests/Test_padded_cell.cc
Copyright (C) 2023
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include <Grid/lattice/PaddedCell.h>
#include <Grid/stencil/GeneralLocalStencil.h>
#include <Grid/algorithms/iterative/PrecGeneralisedConjugateResidual.h>
#include <Grid/algorithms/iterative/PrecGeneralisedConjugateResidualNonHermitian.h>
#include <Grid/algorithms/iterative/BiCGSTAB.h>
using namespace std;
using namespace Grid;
template<class Matrix,class Field>
class PVdagMLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
public:
PVdagMLinearOperator(Matrix &Mat,Matrix &PV): _Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
// std::cout << GridLogMessage<< "Op: PVdag M "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
}
void AdjOp (const Field &in, Field &out){
// std::cout << GridLogMessage<<"AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_PV.M(in,tmp);
_Mat.Mdag(tmp,out);
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){
assert(0);
}
void HermOp(const Field &in, Field &out){
// std::cout <<GridLogMessage<< "HermOp: Mdag PV PVdag M"<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
// std::cout << "HermOp done "<<norm2(out)<<std::endl;
}
};
template<class Matrix,class Field>
class MdagPVLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
public:
MdagPVLinearOperator(Matrix &Mat,Matrix &PV): _Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
Field tmp(in.Grid());
// std::cout <<GridLogMessage<< "Op: PVdag M "<<std::endl;
_PV.M(in,tmp);
_Mat.Mdag(tmp,out);
}
void AdjOp (const Field &in, Field &out){
// std::cout <<GridLogMessage<< "AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){
assert(0);
}
void HermOp(const Field &in, Field &out){
// std::cout << GridLogMessage<<"HermOp: PVdag M Mdag PV "<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
// std::cout << "HermOp done "<<norm2(out)<<std::endl;
}
};
template<class Matrix,class Field>
class ShiftedPVdagMLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
RealD shift;
public:
ShiftedPVdagMLinearOperator(RealD _shift,Matrix &Mat,Matrix &PV): shift(_shift),_Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
// std::cout << "Op: PVdag M "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
out = out + shift * in;
}
void AdjOp (const Field &in, Field &out){
// std::cout << "AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_PV.M(tmp,out);
_Mat.Mdag(in,tmp);
out = out + shift * in;
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){ assert(0); }
void HermOp(const Field &in, Field &out){
// std::cout << "HermOp: Mdag PV PVdag M"<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
}
};
template<class Fobj,class CComplex,int nbasis>
class MGPreconditionerSVD : public LinearFunction< Lattice<Fobj> > {
public:
using LinearFunction<Lattice<Fobj> >::operator();
typedef Aggregation<Fobj,CComplex,nbasis> Aggregates;
typedef typename Aggregation<Fobj,CComplex,nbasis>::FineField FineField;
typedef typename Aggregation<Fobj,CComplex,nbasis>::CoarseVector CoarseVector;
typedef typename Aggregation<Fobj,CComplex,nbasis>::CoarseMatrix CoarseMatrix;
typedef LinearOperatorBase<FineField> FineOperator;
typedef LinearFunction <FineField> FineSmoother;
typedef LinearOperatorBase<CoarseVector> CoarseOperator;
typedef LinearFunction <CoarseVector> CoarseSolver;
Aggregates & _FineToCoarse;
Aggregates & _CoarseToFine;
FineOperator & _FineOperator;
FineSmoother & _PreSmoother;
FineSmoother & _PostSmoother;
CoarseOperator & _CoarseOperator;
CoarseSolver & _CoarseSolve;
int level; void Level(int lv) {level = lv; };
MGPreconditionerSVD(Aggregates &FtoC,
Aggregates &CtoF,
FineOperator &Fine,
FineSmoother &PreSmoother,
FineSmoother &PostSmoother,
CoarseOperator &CoarseOperator_,
CoarseSolver &CoarseSolve_)
: _FineToCoarse(FtoC),
_CoarseToFine(CtoF),
_FineOperator(Fine),
_PreSmoother(PreSmoother),
_PostSmoother(PostSmoother),
_CoarseOperator(CoarseOperator_),
_CoarseSolve(CoarseSolve_),
level(1) { }
virtual void operator()(const FineField &in, FineField & out)
{
GridBase *CoarseGrid = _FineToCoarse.CoarseGrid;
// auto CoarseGrid = _CoarseOperator.Grid();
CoarseVector Csrc(CoarseGrid);
CoarseVector Csol(CoarseGrid);
FineField vec1(in.Grid());
FineField vec2(in.Grid());
std::cout<<GridLogMessage << "Calling PreSmoother " <<std::endl;
// std::cout<<GridLogMessage << "Calling PreSmoother input residual "<<norm2(in) <<std::endl;
double t;
// Fine Smoother
// out = in;
out = Zero();
t=-usecond();
_PreSmoother(in,out);
t+=usecond();
std::cout<<GridLogMessage << "PreSmoother took "<< t/1000.0<< "ms" <<std::endl;
// Update the residual
_FineOperator.Op(out,vec1); sub(vec1, in ,vec1);
// std::cout<<GridLogMessage <<"Residual-1 now " <<norm2(vec1)<<std::endl;
// Fine to Coarse
t=-usecond();
_FineToCoarse.ProjectToSubspace (Csrc,vec1);
t+=usecond();
std::cout<<GridLogMessage << "Project to coarse took "<< t/1000.0<< "ms" <<std::endl;
// Coarse correction
t=-usecond();
Csol = Zero();
_CoarseSolve(Csrc,Csol);
//Csol=Zero();
t+=usecond();
std::cout<<GridLogMessage << "Coarse solve took "<< t/1000.0<< "ms" <<std::endl;
// Coarse to Fine
t=-usecond();
// _CoarseOperator.PromoteFromSubspace(_Aggregates,Csol,vec1);
_CoarseToFine.PromoteFromSubspace(Csol,vec1);
add(out,out,vec1);
t+=usecond();
std::cout<<GridLogMessage << "Promote to this level took "<< t/1000.0<< "ms" <<std::endl;
// Residual
_FineOperator.Op(out,vec1); sub(vec1 ,in , vec1);
// std::cout<<GridLogMessage <<"Residual-2 now " <<norm2(vec1)<<std::endl;
// Fine Smoother
t=-usecond();
// vec2=vec1;
vec2=Zero();
_PostSmoother(vec1,vec2);
t+=usecond();
std::cout<<GridLogMessage << "PostSmoother took "<< t/1000.0<< "ms" <<std::endl;
add( out,out,vec2);
std::cout<<GridLogMessage << "Done " <<std::endl;
}
};
int main (int argc, char ** argv)
{
Grid_init(&argc,&argv);
const int Ls=16;
GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
// Construct a coarsened grid
Coordinate clatt = GridDefaultLatt();
for(int d=0;d<clatt.size();d++){
clatt[d] = clatt[d]/2;
// clatt[d] = clatt[d]/4;
}
GridCartesian *Coarse4d = SpaceTimeGrid::makeFourDimGrid(clatt, GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());;
GridCartesian *Coarse5d = SpaceTimeGrid::makeFiveDimGrid(1,Coarse4d);
std::vector<int> seeds4({1,2,3,4});
std::vector<int> seeds5({5,6,7,8});
std::vector<int> cseeds({5,6,7,8});
GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds5);
GridParallelRNG RNG4(UGrid); RNG4.SeedFixedIntegers(seeds4);
GridParallelRNG CRNG(Coarse5d);CRNG.SeedFixedIntegers(cseeds);
LatticeFermion src(FGrid); random(RNG5,src);
LatticeFermion result(FGrid); result=Zero();
LatticeFermion ref(FGrid); ref=Zero();
LatticeFermion tmp(FGrid);
LatticeFermion err(FGrid);
LatticeGaugeField Umu(UGrid);
FieldMetaData header;
std::string file("ckpoint_lat.4000");
NerscIO::readConfiguration(Umu,header,file);
RealD mass=0.01;
RealD M5=1.8;
DomainWallFermionD Ddwf(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
DomainWallFermionD Dpv(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,1.0,M5);
const int nbasis = 30;
const int cb = 0 ;
NextToNearestStencilGeometry5D geom(Coarse5d);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
typedef PVdagMLinearOperator<DomainWallFermionD,LatticeFermionD> PVdagM_t;
typedef MdagPVLinearOperator<DomainWallFermionD,LatticeFermionD> MdagPV_t;
typedef ShiftedPVdagMLinearOperator<DomainWallFermionD,LatticeFermionD> ShiftedPVdagM_t;
PVdagM_t PVdagM(Ddwf,Dpv);
MdagPV_t MdagPV(Ddwf,Dpv);
// ShiftedPVdagM_t ShiftedPVdagM(2.0,Ddwf,Dpv); // 355
// ShiftedPVdagM_t ShiftedPVdagM(1.0,Ddwf,Dpv); // 246
// ShiftedPVdagM_t ShiftedPVdagM(0.5,Ddwf,Dpv); // 183
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 145
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 134
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 127 -- NULL space via inverse iteration
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 57 -- NULL space via inverse iteration; 3 iterations
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 57 , tighter inversion
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // nbasis 20 -- 49 iters
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // nbasis 20 -- 70 iters; asymmetric
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 58; Loosen coarse, tighten fine
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 56 ...
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 51 ... with 24 vecs
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 31 ... with 24 vecs and 2^4 blocking
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 43 ... with 16 vecs and 2^4 blocking, sloppier
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 35 ... with 20 vecs and 2^4 blocking
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 35 ... with 20 vecs and 2^4 blocking, looser coarse
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 64 ... with 20 vecs, Christoph setup, and 2^4 blocking, looser coarse
ShiftedPVdagM_t ShiftedPVdagM(0.01,Ddwf,Dpv); //
// Run power method on HOA??
PowerMethod<LatticeFermion> PM;
// PM(PVdagM,src);
// PM(MdagPV,src);
// Warning: This routine calls PVdagM.Op, not PVdagM.HermOp
typedef Aggregation<vSpinColourVector,vTComplex,nbasis> Subspace;
Subspace V(Coarse5d,FGrid,cb);
Subspace U(Coarse5d,FGrid,cb);
// Breeds right singular vectors with call to HermOp (V)
V.CreateSubspaceChebyshev(RNG5,PVdagM,
nbasis,
4000.0,0.003,
500);
// Breeds left singular vectors with call to HermOp (U)
// U.CreateSubspaceChebyshev(RNG5,PVdagM,
U.CreateSubspaceChebyshev(RNG5,MdagPV,
nbasis,
4000.0,0.003,
500);
typedef Aggregation<vSpinColourVector,vTComplex,2*nbasis> CombinedSubspace;
CombinedSubspace CombinedUV(Coarse5d,FGrid,cb);
for(int b=0;b<nbasis;b++){
CombinedUV.subspace[b] = V.subspace[b];
CombinedUV.subspace[b+nbasis] = U.subspace[b];
}
int bl, br;
std::cout <<" <V| PVdagM| V> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(V.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(V.subspace[bl],src)<<std::endl;
}}
std::cout <<" <V| PVdagM| U> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(U.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(V.subspace[bl],src)<<std::endl;
}}
std::cout <<" <U| PVdagM| V> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(V.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(U.subspace[bl],src)<<std::endl;
}}
std::cout <<" <U| PVdagM| U> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(U.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(U.subspace[bl],src)<<std::endl;
}}
typedef GeneralCoarsenedMatrix<vSpinColourVector,vTComplex,nbasis> LittleDiracOperatorV;
typedef LittleDiracOperatorV::CoarseVector CoarseVectorV;
typedef GeneralCoarsenedMatrix<vSpinColourVector,vTComplex,2*nbasis> LittleDiracOperator;
typedef LittleDiracOperator::CoarseVector CoarseVector;
V.Orthogonalise();
for(int b =0 ; b<nbasis;b++){
CoarseVectorV c_src (Coarse5d);
V.ProjectToSubspace (c_src,U.subspace[b]);
V.PromoteFromSubspace(c_src,src);
std::cout << " Completeness of U in V ["<< b<<"] "<< std::sqrt(norm2(src)/norm2(U.subspace[b]))<<std::endl;
}
CoarseVector c_src (Coarse5d);
CoarseVector c_res (Coarse5d);
CoarseVector c_proj(Coarse5d);
LittleDiracOperator LittleDiracOpPV(geom,FGrid,Coarse5d);
LittleDiracOpPV.CoarsenOperator(PVdagM,CombinedUV,CombinedUV);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"Testing coarsened operator "<<std::endl;
Complex one(1.0);
c_src = one; // 1 in every element for vector 1.
blockPromote(c_src,err,CombinedUV.subspace);
LatticeFermion prom(FGrid);
prom=Zero();
for(int b=0;b<nbasis*2;b++){
prom=prom+CombinedUV.subspace[b];
}
std::cout<<GridLogMessage<<"c_src "<<norm2(c_src)<<std::endl;
std::cout<<GridLogMessage<<"prom "<<norm2(prom)<<std::endl;
PVdagM.Op(prom,tmp);
blockProject(c_proj,tmp,CombinedUV.subspace);
std::cout<<GridLogMessage<<" Called Big Dirac Op "<<norm2(tmp)<<std::endl;
LittleDiracOpPV.M(c_src,c_res);
std::cout<<GridLogMessage<<" Called Little Dirac Op c_src "<< norm2(c_src) << " c_res "<< norm2(c_res) <<std::endl;
std::cout<<GridLogMessage<<"Little dop : "<<norm2(c_res)<<std::endl;
std::cout<<GridLogMessage<<"Big dop in subspace : "<<norm2(c_proj)<<std::endl;
c_proj = c_proj - c_res;
std::cout<<GridLogMessage<<" ldop error: "<<norm2(c_proj)<<std::endl;
/**********
* Some solvers
**********
*/
///////////////////////////////////////
// Coarse grid solver test
///////////////////////////////////////
std::cout<<GridLogMessage<<"******************* "<<std::endl;
std::cout<<GridLogMessage<<" Coarse Grid Solve -- Level 3 "<<std::endl;
std::cout<<GridLogMessage<<"******************* "<<std::endl;
TrivialPrecon<CoarseVector> simple;
NonHermitianLinearOperator<LittleDiracOperator,CoarseVector> LinOpCoarse(LittleDiracOpPV);
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-4, 100, LinOpCoarse,simple,10,10);
PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-2, 10, LinOpCoarse,simple,20,20);
L2PGCR.Level(3);
c_res=Zero();
L2PGCR(c_src,c_res);
////////////////////////////////////////
// Fine grid smoother
////////////////////////////////////////
std::cout<<GridLogMessage<<"******************* "<<std::endl;
std::cout<<GridLogMessage<<" Fine Grid Smoother -- Level 2 "<<std::endl;
std::cout<<GridLogMessage<<"******************* "<<std::endl;
TrivialPrecon<LatticeFermionD> simple_fine;
// NonHermitianLinearOperator<PVdagM_t,LatticeFermionD> LinOpSmooth(PVdagM);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermionD> SmootherGCR(0.01,1,ShiftedPVdagM,simple_fine,16,16);
SmootherGCR.Level(2);
LatticeFermionD f_src(FGrid);
LatticeFermionD f_res(FGrid);
f_src = one; // 1 in every element for vector 1.
f_res=Zero();
SmootherGCR(f_src,f_res);
typedef MGPreconditionerSVD<vSpinColourVector, vTComplex,nbasis*2> TwoLevelMG;
TwoLevelMG TwoLevelPrecon(CombinedUV,CombinedUV,
PVdagM,
simple_fine,
SmootherGCR,
LinOpCoarse,
L2PGCR);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermion> L1PGCR(1.0e-8,1000,PVdagM,TwoLevelPrecon,20,20);
L1PGCR.Level(1);
f_res=Zero();
L1PGCR(f_src,f_res);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage << "Done "<< std::endl;
Grid_finalize();
return 0;
}
@@ -0,0 +1,492 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Source file: ./tests/Test_padded_cell.cc
Copyright (C) 2023
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include <Grid/lattice/PaddedCell.h>
#include <Grid/stencil/GeneralLocalStencil.h>
#include <Grid/algorithms/iterative/PrecGeneralisedConjugateResidual.h>
#include <Grid/algorithms/iterative/PrecGeneralisedConjugateResidualNonHermitian.h>
#include <Grid/algorithms/iterative/BiCGSTAB.h>
using namespace std;
using namespace Grid;
template<class Matrix,class Field>
class PVdagMLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
public:
PVdagMLinearOperator(Matrix &Mat,Matrix &PV): _Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
// std::cout << GridLogMessage<< "Op: PVdag M "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
}
void AdjOp (const Field &in, Field &out){
// std::cout << GridLogMessage<<"AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_PV.M(in,tmp);
_Mat.Mdag(tmp,out);
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){
HermOp(in,out);
ComplexD dot = innerProduct(in,out);
n1=real(dot);
n2=norm2(out);
}
void HermOp(const Field &in, Field &out){
// std::cout <<GridLogMessage<< "HermOp: Mdag PV PVdag M"<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
// std::cout << "HermOp done "<<norm2(out)<<std::endl;
}
};
template<class Matrix,class Field>
class MdagPVLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
public:
MdagPVLinearOperator(Matrix &Mat,Matrix &PV): _Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
Field tmp(in.Grid());
// std::cout <<GridLogMessage<< "Op: PVdag M "<<std::endl;
_PV.M(in,tmp);
_Mat.Mdag(tmp,out);
}
void AdjOp (const Field &in, Field &out){
// std::cout <<GridLogMessage<< "AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){
ComplexD dot = innerProduct(in,out);
n1=real(dot);
n2=norm2(out);
}
void HermOp(const Field &in, Field &out){
// std::cout << GridLogMessage<<"HermOp: PVdag M Mdag PV "<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
// std::cout << "HermOp done "<<norm2(out)<<std::endl;
}
};
template<class Matrix,class Field>
class ShiftedPVdagMLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
RealD shift;
public:
ShiftedPVdagMLinearOperator(RealD _shift,Matrix &Mat,Matrix &PV): shift(_shift),_Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
// std::cout << "Op: PVdag M "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
out = out + shift * in;
}
void AdjOp (const Field &in, Field &out){
// std::cout << "AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_PV.M(tmp,out);
_Mat.Mdag(in,tmp);
out = out + shift * in;
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){ assert(0); }
void HermOp(const Field &in, Field &out){
// std::cout << "HermOp: Mdag PV PVdag M"<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
}
};
template<class Fobj,class CComplex,int nbasis>
class MGPreconditionerSVD : public LinearFunction< Lattice<Fobj> > {
public:
using LinearFunction<Lattice<Fobj> >::operator();
typedef Aggregation<Fobj,CComplex,nbasis> Aggregates;
typedef typename Aggregation<Fobj,CComplex,nbasis>::FineField FineField;
typedef typename Aggregation<Fobj,CComplex,nbasis>::CoarseVector CoarseVector;
typedef typename Aggregation<Fobj,CComplex,nbasis>::CoarseMatrix CoarseMatrix;
typedef LinearOperatorBase<FineField> FineOperator;
typedef LinearFunction <FineField> FineSmoother;
typedef LinearOperatorBase<CoarseVector> CoarseOperator;
typedef LinearFunction <CoarseVector> CoarseSolver;
Aggregates & _FineToCoarse;
Aggregates & _CoarseToFine;
FineOperator & _FineOperator;
FineSmoother & _PreSmoother;
FineSmoother & _PostSmoother;
CoarseOperator & _CoarseOperator;
CoarseSolver & _CoarseSolve;
int level; void Level(int lv) {level = lv; };
MGPreconditionerSVD(Aggregates &FtoC,
Aggregates &CtoF,
FineOperator &Fine,
FineSmoother &PreSmoother,
FineSmoother &PostSmoother,
CoarseOperator &CoarseOperator_,
CoarseSolver &CoarseSolve_)
: _FineToCoarse(FtoC),
_CoarseToFine(CtoF),
_FineOperator(Fine),
_PreSmoother(PreSmoother),
_PostSmoother(PostSmoother),
_CoarseOperator(CoarseOperator_),
_CoarseSolve(CoarseSolve_),
level(1) { }
virtual void operator()(const FineField &in, FineField & out)
{
GridBase *CoarseGrid = _FineToCoarse.CoarseGrid;
// auto CoarseGrid = _CoarseOperator.Grid();
CoarseVector Csrc(CoarseGrid);
CoarseVector Csol(CoarseGrid);
FineField vec1(in.Grid());
FineField vec2(in.Grid());
std::cout<<GridLogMessage << "Calling PreSmoother " <<std::endl;
// std::cout<<GridLogMessage << "Calling PreSmoother input residual "<<norm2(in) <<std::endl;
double t;
// Fine Smoother
// out = in;
out = Zero();
t=-usecond();
_PreSmoother(in,out);
t+=usecond();
std::cout<<GridLogMessage << "PreSmoother took "<< t/1000.0<< "ms" <<std::endl;
// Update the residual
_FineOperator.Op(out,vec1); sub(vec1, in ,vec1);
// std::cout<<GridLogMessage <<"Residual-1 now " <<norm2(vec1)<<std::endl;
// Fine to Coarse
t=-usecond();
_FineToCoarse.ProjectToSubspace (Csrc,vec1);
t+=usecond();
std::cout<<GridLogMessage << "Project to coarse took "<< t/1000.0<< "ms" <<std::endl;
// Coarse correction
t=-usecond();
Csol = Zero();
_CoarseSolve(Csrc,Csol);
//Csol=Zero();
t+=usecond();
std::cout<<GridLogMessage << "Coarse solve took "<< t/1000.0<< "ms" <<std::endl;
// Coarse to Fine
t=-usecond();
// _CoarseOperator.PromoteFromSubspace(_Aggregates,Csol,vec1);
_CoarseToFine.PromoteFromSubspace(Csol,vec1);
add(out,out,vec1);
t+=usecond();
std::cout<<GridLogMessage << "Promote to this level took "<< t/1000.0<< "ms" <<std::endl;
// Residual
_FineOperator.Op(out,vec1); sub(vec1 ,in , vec1);
// std::cout<<GridLogMessage <<"Residual-2 now " <<norm2(vec1)<<std::endl;
// Fine Smoother
t=-usecond();
// vec2=vec1;
vec2=Zero();
_PostSmoother(vec1,vec2);
t+=usecond();
std::cout<<GridLogMessage << "PostSmoother took "<< t/1000.0<< "ms" <<std::endl;
add( out,out,vec2);
std::cout<<GridLogMessage << "Done " <<std::endl;
}
};
int main (int argc, char ** argv)
{
Grid_init(&argc,&argv);
const int Ls=16;
GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
// Construct a coarsened grid
Coordinate clatt = GridDefaultLatt();
for(int d=0;d<clatt.size();d++){
clatt[d] = clatt[d]/2;
// clatt[d] = clatt[d]/4;
}
GridCartesian *Coarse4d = SpaceTimeGrid::makeFourDimGrid(clatt, GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());;
GridCartesian *Coarse5d = SpaceTimeGrid::makeFiveDimGrid(1,Coarse4d);
std::vector<int> seeds4({1,2,3,4});
std::vector<int> seeds5({5,6,7,8});
std::vector<int> cseeds({5,6,7,8});
GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds5);
GridParallelRNG RNG4(UGrid); RNG4.SeedFixedIntegers(seeds4);
GridParallelRNG CRNG(Coarse5d);CRNG.SeedFixedIntegers(cseeds);
LatticeFermion src(FGrid); random(RNG5,src);
LatticeFermion result(FGrid); result=Zero();
LatticeFermion ref(FGrid); ref=Zero();
LatticeFermion tmp(FGrid);
LatticeFermion err(FGrid);
LatticeGaugeField Umu(UGrid);
FieldMetaData header;
std::string file("ckpoint_lat.4000");
NerscIO::readConfiguration(Umu,header,file);
RealD mass=0.01;
RealD M5=1.8;
DomainWallFermionD Ddwf(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
DomainWallFermionD Dpv(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,1.0,M5);
const int nbasis = 20;
const int cb = 0 ;
NextToNearestStencilGeometry5D geom(Coarse5d);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
typedef PVdagMLinearOperator<DomainWallFermionD,LatticeFermionD> PVdagM_t;
typedef MdagPVLinearOperator<DomainWallFermionD,LatticeFermionD> MdagPV_t;
typedef ShiftedPVdagMLinearOperator<DomainWallFermionD,LatticeFermionD> ShiftedPVdagM_t;
PVdagM_t PVdagM(Ddwf,Dpv);
MdagPV_t MdagPV(Ddwf,Dpv);
// ShiftedPVdagM_t ShiftedPVdagM(2.0,Ddwf,Dpv); // 355
// ShiftedPVdagM_t ShiftedPVdagM(1.0,Ddwf,Dpv); // 246
// ShiftedPVdagM_t ShiftedPVdagM(0.5,Ddwf,Dpv); // 183
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 145
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 134
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 127 -- NULL space via inverse iteration
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 57 -- NULL space via inverse iteration; 3 iterations
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 57 , tighter inversion
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // nbasis 20 -- 49 iters
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // nbasis 20 -- 70 iters; asymmetric
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 58; Loosen coarse, tighten fine
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 56 ...
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 51 ... with 24 vecs
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 31 ... with 24 vecs and 2^4 blocking
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 43 ... with 16 vecs and 2^4 blocking, sloppier
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 35 ... with 20 vecs and 2^4 blocking
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 35 ... with 20 vecs and 2^4 blocking, looser coarse
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 64 ... with 20 vecs, Christoph setup, and 2^4 blocking, looser coarse
ShiftedPVdagM_t ShiftedPVdagM(0.01,Ddwf,Dpv); //
// Run power method on HOA??
PowerMethod<LatticeFermion> PM;
// PM(PVdagM,src);
// PM(MdagPV,src);
// Warning: This routine calls PVdagM.Op, not PVdagM.HermOp
typedef Aggregation<vSpinColourVector,vTComplex,nbasis> Subspace;
Subspace V(Coarse5d,FGrid,cb);
Subspace U(Coarse5d,FGrid,cb);
// Breeds right singular vectors with call to HermOp (V)
V.CreateSubspace(RNG5,PVdagM,nbasis);
// Breeds left singular vectors with call to HermOp (U)
// U.CreateSubspaceChebyshev(RNG5,MdagPV,
U.CreateSubspace(RNG5,PVdagM,nbasis);
typedef Aggregation<vSpinColourVector,vTComplex,2*nbasis> CombinedSubspace;
CombinedSubspace CombinedUV(Coarse5d,FGrid,cb);
for(int b=0;b<nbasis;b++){
CombinedUV.subspace[b] = V.subspace[b];
CombinedUV.subspace[b+nbasis] = U.subspace[b];
}
int bl, br;
std::cout <<" <V| PVdagM| V> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(V.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(V.subspace[bl],src)<<std::endl;
}}
std::cout <<" <V| PVdagM| U> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(U.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(V.subspace[bl],src)<<std::endl;
}}
std::cout <<" <U| PVdagM| V> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(V.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(U.subspace[bl],src)<<std::endl;
}}
std::cout <<" <U| PVdagM| U> " <<std::endl;
for(bl=0;bl<nbasis;bl++){
for(br=0;br<nbasis;br++){
PVdagM.Op(U.subspace[br],src);
std::cout <<bl<<" "<<br<<"\t"<<innerProduct(U.subspace[bl],src)<<std::endl;
}}
typedef GeneralCoarsenedMatrix<vSpinColourVector,vTComplex,nbasis> LittleDiracOperatorV;
typedef LittleDiracOperatorV::CoarseVector CoarseVectorV;
typedef GeneralCoarsenedMatrix<vSpinColourVector,vTComplex,2*nbasis> LittleDiracOperator;
typedef LittleDiracOperator::CoarseVector CoarseVector;
V.Orthogonalise();
for(int b =0 ; b<nbasis;b++){
CoarseVectorV c_src (Coarse5d);
V.ProjectToSubspace (c_src,U.subspace[b]);
V.PromoteFromSubspace(c_src,src);
std::cout << " Completeness of U in V ["<< b<<"] "<< std::sqrt(norm2(src)/norm2(U.subspace[b]))<<std::endl;
}
CoarseVector c_src (Coarse5d);
CoarseVector c_res (Coarse5d);
CoarseVector c_proj(Coarse5d);
LittleDiracOperator LittleDiracOpPV(geom,FGrid,Coarse5d);
LittleDiracOpPV.CoarsenOperator(PVdagM,CombinedUV,CombinedUV);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"Testing coarsened operator "<<std::endl;
Complex one(1.0);
c_src = one; // 1 in every element for vector 1.
blockPromote(c_src,err,CombinedUV.subspace);
LatticeFermion prom(FGrid);
prom=Zero();
for(int b=0;b<nbasis*2;b++){
prom=prom+CombinedUV.subspace[b];
}
std::cout<<GridLogMessage<<"c_src "<<norm2(c_src)<<std::endl;
std::cout<<GridLogMessage<<"prom "<<norm2(prom)<<std::endl;
PVdagM.Op(prom,tmp);
blockProject(c_proj,tmp,CombinedUV.subspace);
std::cout<<GridLogMessage<<" Called Big Dirac Op "<<norm2(tmp)<<std::endl;
LittleDiracOpPV.M(c_src,c_res);
std::cout<<GridLogMessage<<" Called Little Dirac Op c_src "<< norm2(c_src) << " c_res "<< norm2(c_res) <<std::endl;
std::cout<<GridLogMessage<<"Little dop : "<<norm2(c_res)<<std::endl;
std::cout<<GridLogMessage<<"Big dop in subspace : "<<norm2(c_proj)<<std::endl;
c_proj = c_proj - c_res;
std::cout<<GridLogMessage<<" ldop error: "<<norm2(c_proj)<<std::endl;
/**********
* Some solvers
**********
*/
///////////////////////////////////////
// Coarse grid solver test
///////////////////////////////////////
std::cout<<GridLogMessage<<"******************* "<<std::endl;
std::cout<<GridLogMessage<<" Coarse Grid Solve -- Level 3 "<<std::endl;
std::cout<<GridLogMessage<<"******************* "<<std::endl;
TrivialPrecon<CoarseVector> simple;
NonHermitianLinearOperator<LittleDiracOperator,CoarseVector> LinOpCoarse(LittleDiracOpPV);
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-4, 100, LinOpCoarse,simple,10,10);
PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-2, 10, LinOpCoarse,simple,20,20);
L2PGCR.Level(3);
c_res=Zero();
L2PGCR(c_src,c_res);
////////////////////////////////////////
// Fine grid smoother
////////////////////////////////////////
std::cout<<GridLogMessage<<"******************* "<<std::endl;
std::cout<<GridLogMessage<<" Fine Grid Smoother -- Level 2 "<<std::endl;
std::cout<<GridLogMessage<<"******************* "<<std::endl;
TrivialPrecon<LatticeFermionD> simple_fine;
// NonHermitianLinearOperator<PVdagM_t,LatticeFermionD> LinOpSmooth(PVdagM);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermionD> SmootherGCR(0.01,1,ShiftedPVdagM,simple_fine,16,16);
SmootherGCR.Level(2);
LatticeFermionD f_src(FGrid);
LatticeFermionD f_res(FGrid);
f_src = one; // 1 in every element for vector 1.
f_res=Zero();
SmootherGCR(f_src,f_res);
typedef MGPreconditionerSVD<vSpinColourVector, vTComplex,nbasis*2> TwoLevelMG;
TwoLevelMG TwoLevelPrecon(CombinedUV,CombinedUV,
PVdagM,
simple_fine,
SmootherGCR,
LinOpCoarse,
L2PGCR);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermion> L1PGCR(1.0e-8,1000,PVdagM,TwoLevelPrecon,20,20);
L1PGCR.Level(1);
f_res=Zero();
L1PGCR(f_src,f_res);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage << "Done "<< std::endl;
Grid_finalize();
return 0;
}
@@ -0,0 +1,479 @@
/*************************************************************************************
Grid physics library, www.github.com/paboyle/Grid
Source file: ./tests/Test_padded_cell.cc
Copyright (C) 2023
Author: Peter Boyle <paboyle@ph.ed.ac.uk>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */
#include <Grid/Grid.h>
#include <Grid/lattice/PaddedCell.h>
#include <Grid/stencil/GeneralLocalStencil.h>
#include <Grid/algorithms/iterative/PrecGeneralisedConjugateResidual.h>
#include <Grid/algorithms/iterative/PrecGeneralisedConjugateResidualNonHermitian.h>
#include <Grid/algorithms/iterative/BiCGSTAB.h>
using namespace std;
using namespace Grid;
template<class Matrix,class Field>
class PVdagMLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
public:
PVdagMLinearOperator(Matrix &Mat,Matrix &PV): _Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
// std::cout << GridLogMessage<< "Op: PVdag M "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
}
void AdjOp (const Field &in, Field &out){
// std::cout << GridLogMessage<<"AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_PV.M(in,tmp);
_Mat.Mdag(tmp,out);
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){
assert(0);
}
void HermOp(const Field &in, Field &out){
// std::cout <<GridLogMessage<< "HermOp: Mdag PV PVdag M"<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
// std::cout << "HermOp done "<<norm2(out)<<std::endl;
}
};
template<class Matrix,class Field>
class MdagPVLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
public:
MdagPVLinearOperator(Matrix &Mat,Matrix &PV): _Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
Field tmp(in.Grid());
// std::cout <<GridLogMessage<< "Op: PVdag M "<<std::endl;
_PV.M(in,tmp);
_Mat.Mdag(tmp,out);
}
void AdjOp (const Field &in, Field &out){
// std::cout <<GridLogMessage<< "AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){
assert(0);
}
void HermOp(const Field &in, Field &out){
// std::cout << GridLogMessage<<"HermOp: PVdag M Mdag PV "<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
// std::cout << "HermOp done "<<norm2(out)<<std::endl;
}
};
template<class Matrix,class Field>
class ShiftedPVdagMLinearOperator : public LinearOperatorBase<Field> {
Matrix &_Mat;
Matrix &_PV;
RealD shift;
public:
ShiftedPVdagMLinearOperator(RealD _shift,Matrix &Mat,Matrix &PV): shift(_shift),_Mat(Mat),_PV(PV){};
void OpDiag (const Field &in, Field &out) { assert(0); }
void OpDir (const Field &in, Field &out,int dir,int disp) { assert(0); }
void OpDirAll (const Field &in, std::vector<Field> &out){ assert(0); };
void Op (const Field &in, Field &out){
// std::cout << "Op: PVdag M "<<std::endl;
Field tmp(in.Grid());
_Mat.M(in,tmp);
_PV.Mdag(tmp,out);
out = out + shift * in;
}
void AdjOp (const Field &in, Field &out){
// std::cout << "AdjOp: Mdag PV "<<std::endl;
Field tmp(in.Grid());
_PV.M(tmp,out);
_Mat.Mdag(in,tmp);
out = out + shift * in;
}
void HermOpAndNorm(const Field &in, Field &out,RealD &n1,RealD &n2){ assert(0); }
void HermOp(const Field &in, Field &out){
// std::cout << "HermOp: Mdag PV PVdag M"<<std::endl;
Field tmp(in.Grid());
Op(in,tmp);
AdjOp(tmp,out);
}
};
template<class Fobj,class CComplex,int nbasis>
class MGPreconditionerSVD : public LinearFunction< Lattice<Fobj> > {
public:
using LinearFunction<Lattice<Fobj> >::operator();
typedef Aggregation<Fobj,CComplex,nbasis> Aggregates;
typedef typename Aggregation<Fobj,CComplex,nbasis>::FineField FineField;
typedef typename Aggregation<Fobj,CComplex,nbasis>::CoarseVector CoarseVector;
typedef typename Aggregation<Fobj,CComplex,nbasis>::CoarseMatrix CoarseMatrix;
typedef LinearOperatorBase<FineField> FineOperator;
typedef LinearFunction <FineField> FineSmoother;
typedef LinearOperatorBase<CoarseVector> CoarseOperator;
typedef LinearFunction <CoarseVector> CoarseSolver;
///////////////////////////////
// SVD is M = U S Vdag
//
// Define a subset of Vc and Uc in Complex_f,c matrix
// - these are the coarsening, non-square matrices
//
// Solve a coarse approx to
//
// M psi = eta
//
// via
//
// Uc^dag U S Vdag Vc Vc^dag psi = Uc^dag eta
//
// M_coarse Vc^dag psi = M_coarse psi_c = eta_c
//
///////////////////////////////
Aggregates & _U;
Aggregates & _V;
FineOperator & _FineOperator;
FineSmoother & _PreSmoother;
FineSmoother & _PostSmoother;
CoarseOperator & _CoarseOperator;
CoarseSolver & _CoarseSolve;
int level; void Level(int lv) {level = lv; };
MGPreconditionerSVD(Aggregates &U,
Aggregates &V,
FineOperator &Fine,
FineSmoother &PreSmoother,
FineSmoother &PostSmoother,
CoarseOperator &CoarseOperator_,
CoarseSolver &CoarseSolve_)
: _U(U),
_V(V),
_FineOperator(Fine),
_PreSmoother(PreSmoother),
_PostSmoother(PostSmoother),
_CoarseOperator(CoarseOperator_),
_CoarseSolve(CoarseSolve_),
level(1) { }
virtual void operator()(const FineField &in, FineField & out)
{
GridBase *CoarseGrid = _U.CoarseGrid;
// auto CoarseGrid = _CoarseOperator.Grid();
CoarseVector Csrc(CoarseGrid);
CoarseVector Csol(CoarseGrid);
FineField vec1(in.Grid());
FineField vec2(in.Grid());
std::cout<<GridLogMessage << "Calling PreSmoother " <<std::endl;
// std::cout<<GridLogMessage << "Calling PreSmoother input residual "<<norm2(in) <<std::endl;
double t;
// Fine Smoother
// out = in;
out = Zero();
t=-usecond();
_PreSmoother(in,out);
t+=usecond();
std::cout<<GridLogMessage << "PreSmoother took "<< t/1000.0<< "ms" <<std::endl;
// Update the residual
_FineOperator.Op(out,vec1); sub(vec1, in ,vec1);
// std::cout<<GridLogMessage <<"Residual-1 now " <<norm2(vec1)<<std::endl;
// Uc^dag U S Vdag Vc Vc^dag psi = Uc^dag eta
// Fine to Coarse
t=-usecond();
_U.ProjectToSubspace (Csrc,vec1);
t+=usecond();
std::cout<<GridLogMessage << "Project to coarse took "<< t/1000.0<< "ms" <<std::endl;
// Coarse correction
t=-usecond();
Csol = Zero();
_CoarseSolve(Csrc,Csol);
//Csol=Zero();
t+=usecond();
std::cout<<GridLogMessage << "Coarse solve took "<< t/1000.0<< "ms" <<std::endl;
// Coarse to Fine
t=-usecond();
// _CoarseOperator.PromoteFromSubspace(_Aggregates,Csol,vec1);
_V.PromoteFromSubspace(Csol,vec1);
add(out,out,vec1);
t+=usecond();
std::cout<<GridLogMessage << "Promote to this level took "<< t/1000.0<< "ms" <<std::endl;
// Residual
_FineOperator.Op(out,vec1); sub(vec1 ,in , vec1);
// std::cout<<GridLogMessage <<"Residual-2 now " <<norm2(vec1)<<std::endl;
// Fine Smoother
t=-usecond();
// vec2=vec1;
vec2=Zero();
_PostSmoother(vec1,vec2);
t+=usecond();
std::cout<<GridLogMessage << "PostSmoother took "<< t/1000.0<< "ms" <<std::endl;
add( out,out,vec2);
std::cout<<GridLogMessage << "Done " <<std::endl;
}
};
int main (int argc, char ** argv)
{
Grid_init(&argc,&argv);
const int Ls=16;
GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
GridRedBlackCartesian * UrbGrid = SpaceTimeGrid::makeFourDimRedBlackGrid(UGrid);
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
GridRedBlackCartesian * FrbGrid = SpaceTimeGrid::makeFiveDimRedBlackGrid(Ls,UGrid);
// Construct a coarsened grid
Coordinate clatt = GridDefaultLatt();
for(int d=0;d<clatt.size();d++){
clatt[d] = clatt[d]/2;
// clatt[d] = clatt[d]/4;
}
GridCartesian *Coarse4d = SpaceTimeGrid::makeFourDimGrid(clatt, GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());;
GridCartesian *Coarse5d = SpaceTimeGrid::makeFiveDimGrid(1,Coarse4d);
std::vector<int> seeds4({1,2,3,4});
std::vector<int> seeds5({5,6,7,8});
std::vector<int> cseeds({5,6,7,8});
GridParallelRNG RNG5(FGrid); RNG5.SeedFixedIntegers(seeds5);
GridParallelRNG RNG4(UGrid); RNG4.SeedFixedIntegers(seeds4);
GridParallelRNG CRNG(Coarse5d);CRNG.SeedFixedIntegers(cseeds);
LatticeFermion src(FGrid); random(RNG5,src);
LatticeFermion result(FGrid); result=Zero();
LatticeFermion ref(FGrid); ref=Zero();
LatticeFermion tmp(FGrid);
LatticeFermion err(FGrid);
LatticeGaugeField Umu(UGrid);
FieldMetaData header;
std::string file("ckpoint_lat.4000");
NerscIO::readConfiguration(Umu,header,file);
RealD mass=0.01;
RealD M5=1.8;
DomainWallFermionD Ddwf(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,mass,M5);
DomainWallFermionD Dpv(Umu,*FGrid,*FrbGrid,*UGrid,*UrbGrid,1.0,M5);
const int nbasis = 60;
const int cb = 0 ;
NextToNearestStencilGeometry5D geom(Coarse5d);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
typedef PVdagMLinearOperator<DomainWallFermionD,LatticeFermionD> PVdagM_t;
typedef MdagPVLinearOperator<DomainWallFermionD,LatticeFermionD> MdagPV_t;
typedef ShiftedPVdagMLinearOperator<DomainWallFermionD,LatticeFermionD> ShiftedPVdagM_t;
PVdagM_t PVdagM(Ddwf,Dpv);
MdagPV_t MdagPV(Ddwf,Dpv);
// ShiftedPVdagM_t ShiftedPVdagM(2.0,Ddwf,Dpv); // 355
// ShiftedPVdagM_t ShiftedPVdagM(1.0,Ddwf,Dpv); // 246
// ShiftedPVdagM_t ShiftedPVdagM(0.5,Ddwf,Dpv); // 183
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 145
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 134
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 127 -- NULL space via inverse iteration
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 57 -- NULL space via inverse iteration; 3 iterations
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 57 , tighter inversion
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // nbasis 20 -- 49 iters
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // nbasis 20 -- 70 iters; asymmetric
// ShiftedPVdagM_t ShiftedPVdagM(0.25,Ddwf,Dpv); // 58; Loosen coarse, tighten fine
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 56 ...
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 51 ... with 24 vecs
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 31 ... with 24 vecs and 2^4 blocking
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 43 ... with 16 vecs and 2^4 blocking, sloppier
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 35 ... with 20 vecs and 2^4 blocking
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 35 ... with 20 vecs and 2^4 blocking, looser coarse
// ShiftedPVdagM_t ShiftedPVdagM(0.1,Ddwf,Dpv); // 64 ... with 20 vecs, Christoph setup, and 2^4 blocking, looser coarse
ShiftedPVdagM_t ShiftedPVdagM(0.01,Ddwf,Dpv); //
// Run power method on HOA??
PowerMethod<LatticeFermion> PM;
PM(PVdagM,src);
PM(MdagPV,src);
// Warning: This routine calls PVdagM.Op, not PVdagM.HermOp
typedef Aggregation<vSpinColourVector,vTComplex,nbasis> Subspace;
Subspace V(Coarse5d,FGrid,cb);
// Subspace U(Coarse5d,FGrid,cb);
// Breeds right singular vectors with call to HermOp
V.CreateSubspaceChebyshev(RNG5,PVdagM,
nbasis,
4000.0,0.003,
300);
// Breeds left singular vectors with call to HermOp
// U.CreateSubspaceChebyshev(RNG5,MdagPV,
// nbasis,
// 4000.0,0.003,
// 300);
// U.subspace=V.subspace;
// typedef Aggregation<vSpinColourVector,vTComplex,2*nbasis> CombinedSubspace;
// CombinedSubspace CombinedUV(Coarse5d,FGrid,cb);
// for(int b=0;b<nbasis;b++){
// CombinedUV.subspace[b] = V.subspace[b];
// CombinedUV.subspace[b+nbasis] = U.subspace[b];
// }
// typedef GeneralCoarsenedMatrix<vSpinColourVector,vTComplex,2*nbasis> LittleDiracOperator;
typedef GeneralCoarsenedMatrix<vSpinColourVector,vTComplex,nbasis> LittleDiracOperator;
typedef LittleDiracOperator::CoarseVector CoarseVector;
LittleDiracOperator LittleDiracOpPV(geom,FGrid,Coarse5d);
LittleDiracOpPV.CoarsenOperator(PVdagM,V,V);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"Testing coarsened operator "<<std::endl;
CoarseVector c_src (Coarse5d);
CoarseVector c_res (Coarse5d);
CoarseVector c_proj(Coarse5d);
Complex one(1.0);
c_src = one; // 1 in every element for vector 1.
// blockPromote(c_src,err,CoarseToFine.subspace);
LatticeFermion prom(FGrid);
prom=Zero();
for(int b=0;b<nbasis;b++){
prom=prom+V.subspace[b];
}
std::cout<<GridLogMessage<<"c_src "<<norm2(c_src)<<std::endl;
std::cout<<GridLogMessage<<"prom "<<norm2(prom)<<std::endl;
PVdagM.Op(prom,tmp);
blockProject(c_proj,tmp,V.subspace);
std::cout<<GridLogMessage<<" Called Big Dirac Op "<<norm2(tmp)<<std::endl;
LittleDiracOpPV.M(c_src,c_res);
std::cout<<GridLogMessage<<" Called Little Dirac Op c_src "<< norm2(c_src) << " c_res "<< norm2(c_res) <<std::endl;
std::cout<<GridLogMessage<<"Little dop : "<<norm2(c_res)<<std::endl;
std::cout<<GridLogMessage<<"Big dop in subspace : "<<norm2(c_proj)<<std::endl;
c_proj = c_proj - c_res;
std::cout<<GridLogMessage<<" ldop error: "<<norm2(c_proj)<<std::endl;
/**********
* Some solvers
**********
*/
///////////////////////////////////////
// Coarse grid solver test
///////////////////////////////////////
std::cout<<GridLogMessage<<"******************* "<<std::endl;
std::cout<<GridLogMessage<<" Coarse Grid Solve -- Level 3 "<<std::endl;
std::cout<<GridLogMessage<<"******************* "<<std::endl;
TrivialPrecon<CoarseVector> simple;
NonHermitianLinearOperator<LittleDiracOperator,CoarseVector> LinOpCoarse(LittleDiracOpPV);
// PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L2PGCR(1.0e-4, 100, LinOpCoarse,simple,10,10);
PrecGeneralisedConjugateResidualNonHermitian<CoarseVector> L3PGCR(1.0e-4, 10, LinOpCoarse,simple,20,20);
L3PGCR.Level(3);
c_res=Zero();
L3PGCR(c_src,c_res);
////////////////////////////////////////
// Fine grid smoother
////////////////////////////////////////
std::cout<<GridLogMessage<<"******************* "<<std::endl;
std::cout<<GridLogMessage<<" Fine Grid Smoother -- Level 2 "<<std::endl;
std::cout<<GridLogMessage<<"******************* "<<std::endl;
TrivialPrecon<LatticeFermionD> simple_fine;
// NonHermitianLinearOperator<PVdagM_t,LatticeFermionD> LinOpSmooth(PVdagM);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermionD> SmootherGCR(0.01,1,ShiftedPVdagM,simple_fine,16,16);
SmootherGCR.Level(2);
LatticeFermionD f_src(FGrid);
LatticeFermionD f_res(FGrid);
f_src = one; // 1 in every element for vector 1.
f_res=Zero();
SmootherGCR(f_src,f_res);
// typedef MGPreconditionerSVD<vSpinColourVector, vTComplex,nbasis*2> TwoLevelMG;
typedef MGPreconditionerSVD<vSpinColourVector, vTComplex,nbasis> TwoLevelMG;
// TwoLevelMG TwoLevelPrecon(CombinedUV,CombinedUV,
TwoLevelMG TwoLevelPrecon(V,V,
PVdagM,
simple_fine,
SmootherGCR,
LinOpCoarse,
L3PGCR);
PrecGeneralisedConjugateResidualNonHermitian<LatticeFermion> L1PGCR(1.0e-8,1000,PVdagM,TwoLevelPrecon,16,16);
L1PGCR.Level(1);
f_res=Zero();
L1PGCR(f_src,f_res);
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage<<"*******************************************"<<std::endl;
std::cout<<GridLogMessage<<std::endl;
std::cout<<GridLogMessage << "Done "<< std::endl;
Grid_finalize();
return 0;
}

Some files were not shown because too many files have changed in this diff Show More