Documents the acceleratorThreads() default=2 trap, LambdaApply thread
mapping, coalescedRead/Write idiom, when to use __global__ vs
accelerator_for, and fused vs staged HBM access patterns.
Includes observed MI250X numbers from LatticePropagatorD reduction
(50 → 297 → 546 GB/s progression).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>