skills: HPC battle-hardening skill files for GPU+MPI correctness

Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-07-18 08:03:27 +01:00 · 2026-05-18 12:10:44 -04:00
parent c0472aa0ec
commit c93b338bdd
6 changed files with 859 additions and 0 deletions
@@ -0,0 +1,196 @@
 ---
 name: communication-overlap
 description: Design and implement communication/computation overlap pipelines for GPU+MPI codes — per-packet event tracking, host-staging through pinned memory, internode/intranode bandwidth separation, and the 7-phase pipeline pattern that replaces broken accelerator-aware MPI paths.
 user-invocable: true
 allowed-tools:
  - Read
  - Bash(grep -r)
 ---
 # Communication/Computation Overlap Pipeline Design
 ## Why GPU-Direct MPI Is Often Not the Right Default
 GPU-direct RDMA (passing GPU buffer pointers directly to MPI) is appealing because it eliminates explicit D2H/H2D copies. In practice on several leadership systems:
 - **Bandwidth**: RDMA at 30% of wirespeed has been observed on Pontevecchio/Aurora. The overhead of staging through pinned host memory can be *lower* total latency than slow RDMA.
 - **Correctness**: Device buffer aliasing in `MPI_Sendrecv` (see `mpi-heterogeneous.md`) makes direct GPU-to-GPU transfer unreliable.
 - **Overlap**: Host-staging enables fine-grained overlap — each packet's D2H can be issued as a separate asynchronous event, and the corresponding MPI send can fire as soon as *that packet* arrives in host memory, not after all packets are ready.
 The pipeline pattern below was developed to replace broken MPICH accelerator-aware paths. It achieves genuine computation/communication overlap by tracking per-packet GPU events.
 ## The 7-Phase Pipeline
 Given a set of halo exchange operations (each identified by a `packet_index`):
 ### Phase 0: Prepare data on device
 Pack halo data into contiguous GPU buffers. One buffer per direction/neighbour.
 ### Phase 1: Post receives + start D2H
 Post all `MPI_Irecv` calls immediately (into pinned host buffers). Simultaneously, start asynchronous D2H copies for all send buffers:
 ```cpp
 for (auto &pkt : send_packets) {
    MPI_Irecv(pkt.host_recv_buf, pkt.bytes, MPI_BYTE,
               pkt.src_rank, pkt.tag, comm, &pkt.recv_req);
    acceleratorCopyFromDeviceAsync(pkt.device_send_buf,
                                   pkt.host_send_buf,
                                   pkt.bytes, &pkt.d2h_event);
 }
 ```
 The key: `pkt.d2h_event` is a per-packet GPU event (e.g. `cudaEvent_t`, `hipEvent_t`, or SYCL event). We can poll individual packet completion rather than waiting for all.
 ### Phase 2: Fire sends as D2H completes (packet by packet)
 Poll packet D2H events. As each packet becomes ready in host memory, immediately fire the corresponding `MPI_Isend`. Also start intranode D2D copies at this point — these are deferred until now to avoid competing with the internode D2H on PCIe bandwidth:
 ```cpp
 bool all_sent = false;
 while (!all_sent) {
    all_sent = true;
    for (auto &pkt : send_packets) {
        if (!pkt.sent && acceleratorEventIsComplete(pkt.d2h_event)) {
            MPI_Isend(pkt.host_send_buf, pkt.bytes, MPI_BYTE,
                       pkt.dst_rank, pkt.tag, comm, &pkt.send_req);
            pkt.sent = true;
            start_intranode_copy(pkt);  // now safe, D2H is done
        }
        if (!pkt.sent) all_sent = false;
    }
 }
 ```
 ### Phase 3: Poll receives + start H2D as each arrives
 `MPI_Test` individual receive requests. As each completes, immediately start the H2D copy into device-resident halo buffer:
 ```cpp
 bool all_recvd = false;
 while (!all_recvd) {
    all_recvd = true;
    for (auto &pkt : recv_packets) {
        if (!pkt.h2d_started) {
            int flag = 0;
            MPI_Test(&pkt.recv_req, &flag, MPI_STATUS_IGNORE);
            if (flag) {
                acceleratorCopyToDeviceAsync(pkt.host_recv_buf,
                                              pkt.device_recv_buf,
                                              pkt.bytes, &pkt.h2d_event);
                pkt.h2d_started = true;
            }
        }
        if (!pkt.h2d_started) all_recvd = false;
    }
 }
 ```
 ### Phase 4: Wait for all sends
 ```cpp
 std::vector<MPI_Request> send_reqs;
 for (auto &pkt : send_packets) send_reqs.push_back(pkt.send_req);
 MPI_Waitall(send_reqs.size(), send_reqs.data(), MPI_STATUSES_IGNORE);
 ```
 ### Phase 5: Wait for all H2D copies
 ```cpp
 for (auto &pkt : recv_packets) acceleratorEventWait(pkt.h2d_event);
 ```
 ### Phase 6: Run interior computation
 The interior (non-halo) computation can run from Phase 1 onwards, overlapped with all of the above:
 ```cpp
 // Launched in Phase 1, runs in parallel with the pipeline
 accelerator_for(ss, interior_sites, ...) { compute_interior(ss); }
 ```
 Synchronise with interior before using the full field:
 ```cpp
 accelerator_barrier();  // interior kernel done
 // Halo H2D is also complete (Phase 5 above)
 // Now safe to use full field
 ```
 ## Per-Packet Event Tracking Data Structure
 ```cpp
 struct Packet {
    // Buffers
    void *device_send_buf;
    void *host_send_buf;    // pinned
    void *device_recv_buf;
    void *host_recv_buf;    // pinned
    size_t bytes;
    // MPI
    int src_rank, dst_rank, tag;
    MPI_Request send_req, recv_req;
    // GPU events (one per packet, not one global barrier)
    AcceleratorEvent d2h_event;
    AcceleratorEvent h2d_event;
    // State flags
    bool sent = false;
    bool h2d_started = false;
 };
 ```
 The critical design point: `d2h_event` and `h2d_event` are **per-packet**, not global. This allows the MPI send for packet 0 to fire while packet 1's D2H is still in progress.
 ## Internode vs Intranode Separation
 PCIe (GPU-to-CPU) and NVLink/xGMI (GPU-to-GPU within a node) are separate bandwidth resources. They do not compete with each other, but they *do* compete with each other for transactions if both are active simultaneously.
 Strategy: complete all internode D2H copies first (to maximise NIC injection bandwidth), then start intranode D2D copies (which use NVLink/xGMI and do not contend with PCIe for internode traffic):
 ```cpp
 // In Phase 2: start intranode D2D only after D2H is confirmed complete
 if (pkt.is_intranode && pkt.d2h_done) {
    // Use peer access (cudaMemcpyPeerAsync / hipMemcpyPeerAsync)
    // rather than staging through host for intranode
    cudaMemcpyPeerAsync(pkt.peer_recv_buf, pkt.dst_device,
                         pkt.device_send_buf, pkt.src_device,
                         pkt.bytes, computeStream);
 }
 ```
 Grid reference: `Grid/communicator/Communicator_mpi3.cc` — search for `NVLINK_GET` and `ACCELERATOR_AWARE_MPI` conditional blocks.
 ## Pinned Memory Allocation
 All host staging buffers must be pinned (page-locked) for async D2H/H2D:
 ```cpp
 // CUDA
 cudaMallocHost(&host_buf, bytes);
 cudaFreeHost(host_buf);
 // HIP  
 hipHostMalloc(&host_buf, bytes, hipHostMallocDefault);
 hipHostFree(host_buf);
 // SYCL
 host_buf = sycl::malloc_host(bytes, *queue);
 sycl::free(host_buf, *queue);
 ```
 Pre-allocate at startup. Repeated `cudaMallocHost` in the hot path adds latency from the OS memory manager.
 ## Checksumming in the Pipeline
 Insert checksum computation before D2H (on the GPU-resident data) and verification after H2D (on the received GPU-resident data). See `correctness-verification.md` for the checksum pattern. The salting (`packet_index + 1000 * tag`) detects packet transposition — critical for diagnosing MPI buffer aliasing bugs where two packets' contents are swapped.
 ## Smoke Test for a New System
 Before running physics, validate the pipeline on a synthetic benchmark:
 ```cpp
 // Send a buffer of known values, receive and check
 // Run at multiple message sizes: 4KB, 64KB, 1MB, 16MB
 // Run at multiple process counts: 2, 8, 64, 512
 // Verify checksums on every packet
 // Measure bandwidth: should be ≥ 80% of FDR/HDR/NDR peak for host-staged
 ```
 Any bandwidth below 50% of theoretical, or any checksum failure, indicates a problem in the communication stack that must be resolved before production runs.
@@ -0,0 +1,154 @@
 ---
 name: compiler-validation
 description: Identify GPU compiler code generation bugs, distinguish them from hardware and runtime bugs, construct minimal reproducers, and validate correctness of generated assembly for performance-critical HPC kernels.
 user-invocable: true
 allowed-tools:
  - Read
  - Bash(grep -r)
  - Bash(objdump)
 ---
 # Compiler Validation for GPU HPC Codes
 ## Why Compiler Bugs Are Distinct
 Compiler bugs have a unique diagnostic signature: they produce *deterministically wrong* results. The same input always produces the same wrong output. This distinguishes them from:
 - Hardware bugs: usually stochastic (wrong answer sometimes, correct answer other times)
 - Runtime bugs (premature barrier, buffer aliasing): often stochastic or history-dependent
 - Race conditions: non-deterministic
 **The determinism test**: run the same kernel 100 times with the same input. If the wrong answer is always the same wrong answer, suspect the compiler.
 ## The Minimal Reproducer Protocol
 When a kernel produces wrong results, isolate the compiler as quickly as possible:
 **Step 1: Eliminate the physics**. Reduce the failing kernel to the smallest possible computation that still exhibits the bug. Replace QCD fields with `double` arrays. Replace lattice operations with scalar arithmetic. The goal is a 20-line CUDA/HIP/SYCL file that any compiler engineer can compile and run.
 **Step 2: Binary search over optimisation levels**. Compile at `-O0` (or equivalent). If the answer becomes correct, the bug is in an optimisation pass. Then test `-O1`, `-O2`, `-O3` individually to find which optimisation level introduces the bug.
 ```bash
 # HIP example
 hipcc -O0 minimal_repro.cc -o test_O0 && ./test_O0   # should be correct
 hipcc -O1 minimal_repro.cc -o test_O1 && ./test_O1   # compare
 hipcc -O2 minimal_repro.cc -o test_O2 && ./test_O2   # compare
 ```
 **Step 3: Identify the optimisation pass**. For LLVM-based compilers (clang, hipcc, dpcpp, nvcc via ptxas):
 ```bash
 # Disable individual optimisation passes:
 hipcc -O2 -mllvm -disable-loop-unrolling minimal_repro.cc -o test
 hipcc -O2 -fno-vectorize minimal_repro.cc -o test
 hipcc -O2 -fno-slp-vectorize minimal_repro.cc -o test
 ```
 **Step 4: Inspect the generated code**. For CUDA/HIP, use `--generate-line-info` and `cuobjdump` or `roc-obj-extract` to get annotated assembly:
 ```bash
 # CUDA
 nvcc -O2 --generate-line-info --keep minimal_repro.cu
 cuobjdump --dump-ptx minimal_repro.o
 # HIP/ROCm
 hipcc -O2 --save-temps minimal_repro.cc
 llvm-objdump -d minimal_repro.o
 # SYCL/DPC++
 icpx -O2 -fsycl -Xclang -ast-dump minimal_repro.cc 2>&1 | grep -A5 "suspicious_expr"
 ```
 Look for: incorrect register spill/fill sequences, loop trip count miscalculation, vectorisation across iteration boundaries, incorrect address arithmetic.
 ## Known Compiler Bug Patterns in GPU Code
 ### Register Pressure / Spill Bugs
 High register usage forces spills to local memory. Some compiler versions generate incorrect spill/fill code — the value is written to local memory but a stale register value is read back instead of the spilled value.
 **Signature**: Wrong answer with high-register-count kernels; becomes correct when `--maxrregcount=N` forces lower register count (more spilling) or higher (`--maxrregcount=256`, fewer spills).
 **Diagnostic**: Check register usage:
 ```bash
 nvcc -O2 --ptxas-options=-v minimal_repro.cu 2>&1 | grep "registers"
 hipcc -O2 --offload-arch=gfx90a --save-temps minimal_repro.cc
 llvm-mc --arch=amdgcn minimal_repro.s 2>&1 | grep "VGPRs"
 ```
 ### Vectorisation Across Loop Boundaries
 The compiler vectorises two successive loop iterations as a SIMD unit when they have a data dependency that the compiler has incorrectly determined does not exist.
 **Signature**: Wrong answer that becomes correct when the loop body is extracted to a non-inlined function (disabling auto-vectorisation across iterations).
 ### Incorrect Constant Propagation
 The compiler evaluates a compile-time expression incorrectly, substituting a wrong constant. Common in template-heavy code where `sizeof(T)` or `alignof(T)` is used in arithmetic that the compiler folds at compile time.
 **Signature**: Wrong array index or wrong stride. Inspecting the generated assembly shows a literal constant where you expect a computed value.
 ## Stress Patterns for Compiler Validation
 These patterns exercise the compiler in ways that commonly expose bugs:
 ```cpp
 // 1. Aliased pointer write followed by immediate read
 // (tests correct handling of write-after-write in register allocation)
 __global__ void alias_stress(double *a, double *b, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        a[i] = a[i] * 2.0;
        b[i] = a[i] + 1.0;  // must read the updated value, not the original
    }
 }
 // 2. Mixed-precision accumulation
 // (tests correct type promotion in FMA sequences)
 __global__ void precision_stress(float *in, double *out, int n) {
    double acc = 0.0;
    for (int i = 0; i < n; i++) acc += (double)in[i];
    *out = acc;
 }
 // 3. Large struct in shared memory
 // (tests alignment and offset calculation for non-power-of-2-sized objects)
 struct S { double x[3]; };  // sizeof = 24 bytes, not a power of 2
 __global__ void struct_stress(S *in, S *out, int n) {
    extern __shared__ S smem[];
    int tid = threadIdx.x;
    smem[tid] = in[tid];
    __syncthreads();
    out[tid] = smem[(tid + 1) % blockDim.x];
 }
 ```
 ## Separating Compiler from Runtime/Hardware
 When results are deterministically wrong:
 | Test | Compiler bug | Runtime/hardware bug |
 |---|---|---|
 | Recompile at -O0 | Fixes it | No effect |
 | Run on CPU (host code equivalent) | Fixes it | No effect |
 | Reorder loop iterations | Changes wrong answer | No effect or different pattern |
 | Different compiler version | Fixes or changes wrong answer | No effect |
 | Different GPU of same model | Same wrong answer | Different or no error |
 | Different GPU model | Fixes it (ISA-specific codegen bug) | May or may not fix |
 ## Reporting to Compiler Teams
 A compiler bug report needs:
 1. Minimal reproducer (< 50 lines)
 2. Compiler version (`hipcc --version`, `nvcc --version`, `icpx --version`)
 3. GPU model and driver version
 4. Exact wrong and correct answers (hexfloat for reproducibility)
 5. Which compile flags change the behaviour
 6. Generated assembly for the correct and incorrect variants
 File with: LLVM Bugzilla (for hipcc/clang/dpcpp backends), NVIDIA bug portal (nvcc/ptxas), or vendor-specific developer forum. The minimal reproducer is the single most important element — without it, compiler teams cannot prioritise.
 ## Pragmatic In-Production Workaround
 When a compiler bug is confirmed but the fix is not yet available, the lowest-risk workaround is to mark the affected function with reduced optimisation:
 ```cpp
 #pragma clang optimize off  // clang/hipcc/dpcpp
 void __attribute__((optimize("O0"))) affected_kernel_host_wrapper() { ... }
 // For device code, use per-file compilation flags via CMake/Makefile
 ```
 Document the workaround with a comment referencing the compiler bug report number so it can be removed when the compiler is updated.
@@ -0,0 +1,169 @@
 ---
 name: correctness-verification
 description: Implement application-level correctness verification for HPC codes on unreliable hardware — double-run pattern, deterministic reductions, per-packet checksums, and flight recorder step logging.
 user-invocable: true
 allowed-tools:
  - Read
  - Bash(grep -r)
 ---
 # Correctness Verification Infrastructure for HPC Codes
 ## The Problem
 Leadership computing facilities sometimes have hardware or firmware bugs below the level visible to application code. The accelerator runtime can return from `q.wait()` or `cudaDeviceSynchronize()` before work is actually complete, or silently produce wrong answers in DMA transfers. Standard testing does not catch these because they are non-deterministic and often topology-dependent (fail only at specific process counts or on specific node configurations).
 The symptoms look like numerical instabilities, random MPI hangs, or wrong physics results — not like crashes. Without deliberate infrastructure, diagnosing root cause takes months.
 ## The Double-Run Pattern
 The most reliable correctness check for non-deterministic hardware bugs is to run every computation twice and compare bit-identical fingerprints.
 **Key constraint**: the second run must use a *deterministic* code path. Non-deterministic floating-point ordering (e.g. from MPI_Allreduce with different reduction trees on retry) produces false mismatches. See `mpi-heterogeneous.md` for how to make reductions deterministic.
 ```cpp
 // Pseudocode: double-run a step and compare CRC fingerprints
 void run_step_verified(State &state) {
    state.save_checkpoint();
    uint64_t crc_a = run_step_and_fingerprint(state);
    state.restore_checkpoint();
    uint64_t crc_b = run_step_and_fingerprint(state);
    if (crc_a != crc_b) {
        report_mismatch("step", crc_a, crc_b);
        // Policy: abort, retry from checkpoint, or continue with alarm
    }
 }
 ```
 **Fingerprinting**: XOR-fold a CRC32 over all floating-point data after each step. XOR is order-independent, so it works across distributed nodes without communication. For field data:
 ```cpp
 uint64_t fingerprint(const double *data, size_t n) {
    uint64_t acc = 0;
    for (size_t i = 0; i < n; i++) {
        uint64_t bits;
        memcpy(&bits, &data[i], sizeof(bits));
        acc ^= crc32(bits);
    }
    return acc;
 }
 ```
 On GPU, compute the XOR reduction on-device (avoids D2H transfer of the full field):
 ```cpp
 // SYCL
 uint64_t svm_xor(uint64_t *vec, uint64_t L) {
    uint64_t ret = 0;
    { sycl::buffer<uint64_t,1> abuff(&ret, {1});
      theGridAccelerator->submit([&](sycl::handler &cgh) {
        auto R = sycl::reduction(abuff, cgh, uint64_t(0), std::bit_xor<>());
        cgh.parallel_for(sycl::range<1>{L}, R,
          [=](sycl::id<1> i, auto &sum) { sum ^= vec[i]; });
      }); }
    theGridAccelerator->wait();
    return ret;
 }
 ```
 ## Per-Packet Communication Checksums
 Silent data corruption in MPI buffers (documented in MPICH with device-resident buffers; see `mpi-heterogeneous.md`) requires per-packet verification, not just end-to-end. The pattern:
 1. Before packing a send buffer, compute a GPU-side checksum of the payload.
 2. Append the checksum to the host staging buffer alongside the data.
 3. After receiving and copying to device, recompute the checksum on-device and compare.
 Salt each checksum with `packet_index + 1000 * mpi_tag` to detect transposition (packet A landing in packet B's slot):
 ```cpp
 uint64_t salt = (uint64_t)packet_index + 1000ULL * mpi_tag;
 checksum_send = checksum_gpu(payload_gpu, payload_words) ^ salt;
 // ... transmit payload + checksum_send ...
 checksum_recv = checksum_gpu(payload_gpu_recv, payload_words) ^ salt;
 assert(checksum_recv == checksum_send);
 ```
 Grid reference: `Grid/communicator/Communicator_mpi3.cc`, `#ifdef GRID_CHECKSUM_COMMS`.
 ## Flight Recorder: Step-Level Logging
 Maintain a monotonic counter that names the current operation. On a hang, this is the only way to know *which* operation the process is stuck in without a debugger.
 ```cpp
 struct FlightRecorder {
    std::atomic<uint64_t> step_counter{0};
    const char *step_name = "init";
    void step_log(const char *name) {
        step_name = name;
        step_counter.fetch_add(1, std::memory_order_relaxed);
    }
 };
 extern FlightRecorder gRecorder;
 ```
 In Record mode, also store floating-point norms and communication checksums to vectors. In Verify mode, compare against stored values:
 ```cpp
 void norm_log(double val) {
    if (mode == Record) norm_log_vec.push_back(val);
    if (mode == Verify) {
        double expected = norm_log_vec[norm_counter];
        if (val != expected) {  // bit-exact for deterministic paths
            std::cerr << "MISMATCH at step " << step_counter
                      << " (" << step_name << "): "
                      << std::hexfloat << val << " vs " << expected << "\n";
            print_backtrace();
        }
        norm_counter++;
    }
 }
 ```
 Grid reference: `Grid/util/FlightRecorder.h`, `Grid/util/FlightRecorder.cc`.
 ## Signal Handler for Hang Detection
 Install a SIGHUP handler that dumps the current flight recorder state. This is async-safe only if the handler writes to a pre-allocated buffer using `write()` (not `printf`):
 ```cpp
 static char hang_buf[4096];
 static void sighup_handler(int) {
    int n = snprintf(hang_buf, sizeof(hang_buf),
        "rank=%d step=%llu name=%s\n",
        mpi_rank,
        (unsigned long long)gRecorder.step_counter.load(),
        gRecorder.step_name);
    write(STDERR_FILENO, hang_buf, n);
    // Optional: call backtrace_symbols_fd (async-safe on Linux)
    void *frames[64];
    int depth = backtrace(frames, 64);
    backtrace_symbols_fd(frames, depth, STDERR_FILENO);
 }
 // In main():
 signal(SIGHUP, sighup_handler);
 ```
 To diagnose a hang across all ranks: `kill -HUP $(pgrep my_app)` or via job scheduler.
 ## What to Verify at Each Step
 | Data type | Fingerprint method | Frequency |
 |---|---|---|
 | Lattice fields | XOR of CRC32 over float64 words | Every algorithmic step |
 | Communication buffers | GPU XOR reduction, salted | Every MPI operation |
 | Scalar reductions | Bit-exact match of double | Every GlobalSum |
 | Iteration counters | Exact integer match | Every solver iteration |
 ## When to Abort vs Continue
 - **Abort immediately**: communication checksum mismatch (data is corrupt, continuing will silently propagate errors).
 - **Log and continue**: norm mismatch in Verify mode if you need to map out which operations are unreliable.
 - **Retry from checkpoint**: double-run mismatch when the underlying bug is non-deterministic (second retry will usually pass).
 Track the mismatch rate over a production run. A rate above ~1/1000 steps indicates a systemic hardware issue that should be escalated to the facility.
@@ -0,0 +1,101 @@
 ---
 name: gpu-runtime-correctness
 description: Detect and work around GPU runtime correctness failures — premature completion signalling, infinite poll hangs, stale completion flags, and the double-wait diagnostic pattern. Covers CUDA, HIP/ROCm, and SYCL/Level Zero runtimes.
 user-invocable: true
 allowed-tools:
  - Read
  - Bash(grep -r)
 ---
 # GPU Runtime Correctness
 ## The Completion Signalling Problem
 GPU runtimes expose a synchronisation primitive — `cudaDeviceSynchronize()`, `hipDeviceSynchronize()`, `q.wait()` — that is supposed to block until all previously submitted GPU work is complete. On several production systems, this guarantee has been violated in two distinct ways:
 ### Failure Mode A: Premature Return
 The wait returns before the GPU work is done. The subsequent CPU code reads stale data from the output buffer. This is the most dangerous failure because it looks like a numerical instability, not a crash. Results are wrong but the program exits normally.
 **Identifying Premature Return**: Insert a second, independent wait immediately after the first. If a second `q.wait()` "fixes" incorrect results that appeared with a single `q.wait()`, the first wait was returning prematurely.
 ```cpp
 // Diagnostic version — if this stabilises results, you have premature return
 accelerator_barrier();  // first wait
 accelerator_barrier();  // second wait (diagnostic)
 ```
 Production fix: submit a trivially cheap no-op kernel after the real work and wait for it. The no-op kernel cannot complete until all previous commands in the queue are done (command queue ordering guarantee), so waiting for the no-op is a stronger barrier than waiting for the queue itself:
 ```cpp
 // Lightweight fence kernel
 template<class T>
 __global__ void noop_kernel(T *p) { if (threadIdx.x == 0) (void)(*p); }
 void strong_barrier(T *device_ptr) {
    noop_kernel<<<1, 1, 0, computeStream>>>(device_ptr);
    cudaStreamSynchronize(computeStream);  // wait for the no-op
 }
 ```
 ### Failure Mode B: Infinite Poll
 The wait enters a polling loop that never terminates. The process consumes 100% CPU in a runtime library. The GPU has either stopped signalling progress entirely, or the completion flag is in a memory region that has become incoherent.
 This is distinct from Failure Mode A: with premature return the CPU proceeds; with infinite poll the CPU is stuck.
 **Identifying Infinite Poll**: `top` shows the MPI rank at 100% CPU. `perf top -p PID` or `strace -p PID` shows the process burning cycles inside the GPU runtime library (e.g. `libze_intel_gpu.so`, `libamdhip64.so`).
 **Documented instances**: 
 - Intel Level Zero on Pontevecchio (Aurora): both premature return *and* infinite poll have been observed as independent bugs on the same system.
 - The two failure modes can co-exist and have overlapping symptoms at the application level.
 ## Completion Signalling Architecture
 Understanding why these bugs happen requires knowing how completion signalling works:
 ```
 GPU command processor
  → signals completion by writing to a host-visible memory address
  → CPU runtime polls that address (or uses OS event notification via ioctl)
 ```
 A premature return means the memory write happened before the actual work completed (e.g. the signal is on a different command stream that has not been serialised with the work stream). An infinite poll means the memory write never happens (hardware or driver bug preventing the signal from being written).
 **Implication**: `accelerator_barrier()` is not an unconditional correctness guarantee on all production systems. Application-level verification (double-run, checksums) is necessary as a second line of defence.
 ## The Double-Wait Pattern in Practice
 The double-wait is a pragmatic workaround when premature return is suspected but not yet confirmed. It adds latency but does not change correctness if the barrier is working properly, so it is safe to enable in production:
 ```cpp
 #ifdef WORKAROUND_PREMATURE_BARRIER
 #define accelerator_barrier() do { \
    real_accelerator_barrier();    \
    real_accelerator_barrier();    \
 } while(0)
 #endif
 ```
 Monitor whether this changes observed behaviour. If double-wait eliminates wrong answers, you have confirmed premature return. If it does not help but inserting a no-op kernel does, the issue is with the wait primitive specifically, not with the underlying completion signal.
 ## SYCL/Level Zero Specifics
 Level Zero (the backend for Intel GPU runtimes) separates command submission from synchronisation. A `q.wait()` should wait for all previously submitted command lists to retire. Documented bugs include:
 - `q.wait()` returning before the associated fence in Level Zero has been signalled.
 - `q.wait()` entering an `ioctl(i915, I915_GEM_WAIT)` call that never returns (kernel driver bug, not runtime bug).
 The latter requires a node reboot and cannot be worked around in application code. Detect it by checking process state (`D` in `ps aux`) and the kernel function via `/proc/PID/wchan`.
 ## Stream Ordering and Compute Streams
 All GPU work must be submitted to the *same* stream/queue if you rely on in-order execution guarantees. Mixing default stream and non-default streams invalidates ordering assumptions on some backends.
 Grid uses `computeStream` (CUDA/HIP) or `theGridAccelerator` (SYCL) consistently throughout. If mixing Grid with third-party GPU code, ensure the third-party code is directed to the same stream, or insert explicit inter-stream barriers.
 ## Checklist for New GPU Code
 1. Every kernel launch is followed by an `accelerator_barrier()` before reading device-side output on the host.
 2. All device-to-host copies use an explicit stream synchronisation after the copy, not before.
 3. If results are non-deterministic across runs, insert a second barrier and observe whether reproducibility improves.
 4. For correctness-critical operations (reductions that will be compared against reference values), add the double-run checksum test from `correctness-verification.md`.
 5. If the process hangs at 100% CPU in a runtime library function, this is a driver/runtime bug — there is no application-level fix beyond scheduling a node reboot.
@@ -0,0 +1,102 @@
 ---
 name: hang-diagnosis
 description: Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters.
 user-invocable: true
 allowed-tools:
  - Read
  - Bash(grep -r)
  - Bash(strace)
  - Bash(gdb)
 ---
 # Hang Diagnosis on HPC Systems
 ## Taxonomy of Hangs
 Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems:
 ### 1. Kernel-level ioctl hang (never returns)
 The process is in `D` (uninterruptible sleep) state. `strace` shows it blocked in an `ioctl` syscall. The GPU device driver has entered an unrecoverable state.
 **Diagnosis**: `ps aux | grep D` — the process shows `D` state. `cat /proc/PID/wchan` shows `i915_gem_wait_for_error` or similar.
 **Resolution**: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler.
 ### 2. Infinite poll loop (`q.wait()` or `cudaDeviceSynchronize()` never returns)
 The process is in `R` (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread.
 **Diagnosis**: `top` shows the rank at 100% CPU. `strace -p PID` shows repeated `futex` or `read` syscalls with zero-length results, or no syscalls at all (pure spinloop). `perf top -p PID` shows the process burning cycles in a single tight loop in a runtime library (e.g., `ze_intel_gpu.so`).
 **Resolution**: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See `gpu-runtime-correctness.md`.
 ### 3. Collective deadlock
 One or more ranks are blocked in an MPI call, usually `MPI_Allreduce` or `MPI_Barrier`, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive.
 **Diagnosis**: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different `step_name` strings. The hung ranks will show `D` or `S` state in `ps`.
 **Resolution**: Replace `MPI_Allreduce` with a deterministic point-to-point tree reduction. See `mpi-heterogeneous.md`.
 ### 4. Premature return from wait (silent wrong answer, not a hang)
 The runtime returns from `q.wait()` before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally.
 **Diagnosis**: Double-run with checksum (see `correctness-verification.md`). Insert a second `q.wait()` after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely.
 ## Flight Recorder for Hang Localization
 The most important diagnostic tool is knowing *which operation* a process is in when it hangs. Maintain a named step counter:
 ```cpp
 // Call at the start of every major operation
 FlightRecorder::StepLog("MPI_Allreduce::norm");
 // ... do the operation ...
 FlightRecorder::StepLog("MPI_Allreduce::done");
 ```
 On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner:
 ```cpp
 static void sighup_handler(int) {
    char buf[256];
    int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n",
        comm_rank,
        (unsigned long long)step_counter,
        step_name);
    write(2, buf, n);
    // backtrace_symbols_fd is async-safe on Linux glibc
    void *frames[32];
    backtrace_symbols_fd(frames, backtrace(frames, 32), 2);
 }
 signal(SIGHUP, sighup_handler);
 ```
 Broadcast SIGHUP to all ranks from outside the job:
 ```bash
 # In a separate shell while the job is hung
 squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \
  xargs -I{} ssh {} "pkill -SIGHUP -f my_application"
 ```
 The step names from all ranks will reveal which collective operation has diverged.
 ## Distinguishing Driver Hang from MPI Hang
 | Symptom | Driver hang | MPI hang |
 |---|---|---|
 | Process state | `D` (ioctl) or `R` (spinloop) | `S` (blocked in syscall) |
 | `strace` | blocked `ioctl` or tight loop | blocked `recvmsg` / `read` |
 | Scope | single rank / single node | subset of ranks, pattern-dependent |
 | Recovery | reboot node | cancel job |
 | Flight recorder | step name is a GPU operation | step name is a collective |
 ## Reducing Diagnostic Time
 1. **Name every collective operation** in the flight recorder before calling it.
 2. **Separate GPU work from MPI work** in the code so the step name unambiguously identifies which subsystem is hung.
 3. **Log node identifiers** alongside step names so flaky nodes can be identified and blacklisted.
 4. **Request flight recorder dumps from all ranks simultaneously** (SIGHUP broadcast) rather than attaching a debugger — attaching `gdb` to one rank of a hung MPI job usually deadlocks the debugger too.
 ## What Not to Do
 - Do not `kill -9` a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost.
 - Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the *last* rank to arrive at the barrier.
 - Do not use `MPI_Abort` in the hang handler — it may itself hang on some implementations. Use `_exit(1)` to force termination.
@@ -0,0 +1,137 @@
 ---
 name: mpi-heterogeneous
 description: Diagnose and work around MPI correctness bugs on heterogeneous (CPU+GPU) systems — device buffer aliasing in MPI_Sendrecv, AARCH64 PLT corruption from libfabric, topology-dependent allreduce hangs, and deterministic point-to-point reduction trees as a replacement for MPI_Allreduce.
 user-invocable: true
 allowed-tools:
  - Read
  - Bash(grep -r)
 ---
 # MPI Correctness on Heterogeneous HPC Systems
 ## The Core Problem
 MPI libraries were designed for CPU-resident buffers. When GPU-resident buffers are passed directly (GPU-aware MPI / GPU direct RDMA), several correctness assumptions break:
 - **Buffer aliasing**: The MPI library may internally alias send/receive buffer addresses for `MPI_Sendrecv` in ways that are safe for CPU memory but wrong for GPU memory with different cache coherency rules.
 - **RDMA bandwidth**: GPU direct RDMA on some fabrics operates at a fraction of peak wirespeed (documented at ~30% on Pontevecchio/Aurora), making host-staging mandatory for performance even when correctness is not an issue.
 - **Collective tree topology**: `MPI_Allreduce` implementations may select reduction trees based on process count or communicator topology that expose rank-ordering bugs, causing hangs on some configurations but not others.
 ## Bug Class 1: Device Buffer Aliasing in MPI_Sendrecv
 **Symptom**: `MPI_Sendrecv` with GPU-resident send and receive buffers produces wrong results. The received data matches neither the expected payload nor a host-staged copy. The failure is *deterministic* for a given problem size and process count, but *history-dependent* — earlier sends affect which alias is selected.
 **Root cause**: The MPI library internally reuses GPU buffer addresses for temporary staging without proper device memory ordering. When the same physical GPU memory pages appear in both the send and receive paths, writes from one path corrupt the other.
 **Diagnosis**: 
 1. Enable per-packet checksumming (see `correctness-verification.md`). If the checksum on the received packet does not match the sent checksum, the data was corrupted in transit.
 2. Replace `MPI_Sendrecv` with separate `MPI_Isend` + `MPI_Irecv` + `MPI_Waitall`. If this fixes the problem, the bug is in the `MPI_Sendrecv` implementation's internal buffer handling.
 3. Stage through host memory (`cudaMemcpy`/`hipMemcpy` to a host buffer, then `MPI_Sendrecv` on host buffers, then copy back). If this fixes the problem, confirms GPU-specific aliasing.
 **Reported as**: MPICH issue #7302. Affects MPICH on Intel Pontevecchio (Aurora) with device-resident buffers.
 **Workaround**: Do not use `MPI_Sendrecv` with GPU buffers. Use asynchronous send/receive pairs or host-staging. See `communication-overlap.md` for the full pipeline pattern.
 ## Bug Class 2: PLT Corruption on AARCH64 (libfabric)
 **Symptom**: Application crashes or hangs on first `MPI_Comm_dup` call on AARCH64 systems (e.g. NVIDIA Grace/H200). Backtrace shows a bad instruction in the PLT (Procedure Linkage Table) for `MPI_Comm_dup` — specifically a `br x15` instruction that should instead be a proper trampoline.
 **Root cause**: `libfabric`'s memory registration cache monitor patches PLT entries at runtime to intercept memory allocation calls. Its AARCH64 trampoline generation writes an incorrect instruction sequence, leaving `br x15` (branch to whatever happens to be in x15) in the PLT entry. The next call through that PLT entry executes garbage.
 **Diagnosis**:
 ```bash
 # Check if the PLT entry is corrupted
 objdump -d /proc/PID/exe | grep -A5 "MPI_Comm_dup@plt"
 # Look for "br x15" — this should be a proper stub, not a register branch
 ```
 Or check the disassembly of the live process:
 ```bash
 gdb -p PID -batch -ex "disassemble 'MPI_Comm_dup@plt'"
 ```
 **Workaround**:
 ```bash
 export FI_MR_CACHE_MONITOR=disabled
 ```
 This prevents libfabric from patching PLT entries. It may reduce MR cache performance but restores correctness.
 **Reported as**: libfabric issue #11451. Affects systems using AARCH64 + libfabric OFI provider (Cray Slingshot, AWS EFA) with memory registration cache enabled.
 ## Bug Class 3: Topology-Dependent Allreduce Hangs
 **Symptom**: `MPI_Allreduce` hangs indefinitely on some node configurations but completes correctly on others. The failure correlates with process count (e.g. fails at 512 ranks, works at 256) or network topology (fails when crossing specific router boundaries).
 **Root cause**: The MPI library's collective selection algorithm picks a reduction tree implementation that assumes symmetric participation from all ranks. A bug in one rank's contribution path (e.g. a GPU-side buffer not yet flushed when MPI reads it, due to premature barrier — see `gpu-runtime-correctness.md`) causes that rank to send wrong or incomplete data, and the tree-reduction protocol deadlocks waiting for data that never arrives correctly.
 **Diagnosis**: Flight recorder step logging (see `hang-diagnosis.md`). SIGHUP broadcast to all ranks. Ranks that are hung will show step name `MPI_Allreduce::...`; ranks that completed will show the next step. The hung ranks are the *recipients* of the stale data, not necessarily the *cause*.
 **Workaround — deterministic P2P reduction tree**:
 Replace `MPI_Allreduce` with an explicit point-to-point binary tree reduction. This is slower for large communicators but:
 1. Is immune to topology-dependent collective bugs.
 2. Is deterministic in floating-point ordering (the tree is fixed, not chosen at runtime).
 3. Makes the hang location explicit — each P2P operation is a named step in the flight recorder.
 ```cpp
 // Binary tree reduction: rank 0 collects, then broadcasts
 void GlobalSumP2P(double *data, int count, MPI_Comm comm) {
    int rank, size;
    MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size);
    // Reduce phase: even ranks receive from odd neighbours
    for (int stride = 1; stride < size; stride *= 2) {
        if (rank % (2*stride) == 0) {
            int partner = rank + stride;
            if (partner < size) {
                std::vector<double> tmp(count);
                MPI_Recv(tmp.data(), count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE);
                for (int i = 0; i < count; i++) data[i] += tmp[i];
            }
        } else if (rank % stride == 0) {
            int partner = rank - stride;
            MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm);
            break;
        }
    }
    // Broadcast phase
    for (int stride = /* highest power of 2 <= size */; stride >= 1; stride /= 2) {
        if (rank % (2*stride) == 0) {
            int partner = rank + stride;
            if (partner < size)
                MPI_Send(data, count, MPI_DOUBLE, partner, 0, comm);
        } else if (rank % stride == 0) {
            int partner = rank - stride;
            MPI_Recv(data, count, MPI_DOUBLE, partner, 0, comm, MPI_STATUS_IGNORE);
        }
    }
 }
 ```
 Grid reference: `USE_GRID_REDUCTION` macro in `Grid/communicator/Communicator_mpi3.cc`.
 ## Compile-Time Guard Structure
 Recommended macro structure to switch between the workaround paths:
 ```cpp
 // In configure / CMake, expose as options:
 // ACCELERATOR_AWARE_MPI  — use GPU direct (fast, potentially broken)
 // GRID_CHECKSUM_COMMS    — per-packet checksums (overhead: ~5%)
 // USE_GRID_REDUCTION     — P2P tree instead of MPI_Allreduce (slower, deterministic)
 // FI_MR_CACHE_MONITOR    — libfabric PLT workaround (env var, not compile-time)
 ```
 On a known-good system, enable `ACCELERATOR_AWARE_MPI` and disable the others. On a system with known bugs, disable `ACCELERATOR_AWARE_MPI` and enable `GRID_CHECKSUM_COMMS` + `USE_GRID_REDUCTION` as needed.
 ## Escalation Checklist
 Before concluding a bug is in your code:
 1. [ ] Can you reproduce with a minimal reproducer (two MPI ranks, no physics code)?
 2. [ ] Does the failure rate correlate with buffer size, process count, or network route?
 3. [ ] Does staging through host memory eliminate the failure?
 4. [ ] Is the failure deterministic for a given input (same answer, always wrong) or stochastic?
 5. [ ] Does the failure appear on a different MPI implementation (e.g. OpenMPI vs MPICH)?
 Deterministic wrong answers that reproduce with minimal reproducers and disappear with host-staging are strong evidence of an MPI library bug. File with the MPI library issue tracker with the minimal reproducer.