portelli/Grid - Grid - DiRAC Tursa git server

portelli/Grid

Fork 0

mirror of https://github.com/paboyle/Grid.git synced 2026-05-27 12:34:16 +01:00

Commit Graph

Author	SHA1	Message	Date
Peter Boyle	5a9056cd93	Accelerator: lower default accelerator_threads from 16 to 8 Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16. With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads = one full AMD wavefront. Higher values double register demand per block and hit a register-pressure cliff for stencil kernels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00
Peter Boyle	5c4574f9aa	skills: add gpu-memory-performance.md Documents the acceleratorThreads() default=2 trap, LambdaApply thread mapping, coalescedRead/Write idiom, when to use __global__ vs accelerator_for, and fused vs staged HBM access patterns. Includes observed MI250X numbers from LatticePropagatorD reduction (50 → 297 → 546 GB/s progression). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:34:30 -04:00

Author

SHA1

Message

Date

Peter Boyle

5a9056cd93

Accelerator: lower default accelerator_threads from 16 to 8

Benchmark_dwf_fp32 on MI250X GCD: 1.7 TF/s at nt=8, ~300 GF/s at nt=16.
With Nsimd=8 (fp32, GEN_SIMD_WIDTH=64B), nt=8 gives exactly 64 threads =
one full AMD wavefront. Higher values double register demand per block and
hit a register-pressure cliff for stencil kernels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-21 12:34:30 -04:00

Peter Boyle

5c4574f9aa

skills: add gpu-memory-performance.md

Documents the acceleratorThreads() default=2 trap, LambdaApply thread
mapping, coalescedRead/Write idiom, when to use __global__ vs
accelerator_for, and fused vs staged HBM access patterns.

Includes observed MI250X numbers from LatticePropagatorD reduction
(50 → 297 → 546 GB/s progression).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-21 12:34:30 -04:00

2 Commits