mirror of https://github.com/paboyle/Grid.git synced 2026-05-21 01:24:16 +01:00

Files

T

Peter Boyle c93b338bdd skills: HPC battle-hardening skill files for GPU+MPI correctness

Six skill files encoding expertise for making codebases robust on
problematic HPC systems, covering: correctness verification
(double-run, fingerprinting, flight recorder), hang diagnosis,
GPU runtime correctness (premature barrier, infinite poll),
MPI correctness on heterogeneous systems (device buffer aliasing,
AARCH64 PLT corruption, deterministic reductions),
compiler validation, and communication/computation overlap pipeline
design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-18 12:10:44 -04:00

5.6 KiB

Raw Blame History

name, description, user-invocable, allowed-tools

name

description

user-invocable

allowed-tools

hang-diagnosis

Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters.

true

Read

Bash(grep -r)

Bash(strace)

Bash(gdb)

Hang Diagnosis on HPC Systems

Taxonomy of Hangs

Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems:

1. Kernel-level ioctl hang (never returns)

The process is in D (uninterruptible sleep) state. strace shows it blocked in an ioctl syscall. The GPU device driver has entered an unrecoverable state.

Diagnosis: ps aux | grep D — the process shows D state. cat /proc/PID/wchan shows i915_gem_wait_for_error or similar.

Resolution: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler.

2. Infinite poll loop (`q.wait()` or `cudaDeviceSynchronize()` never returns)

The process is in R (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread.

Diagnosis: top shows the rank at 100% CPU. strace -p PID shows repeated futex or read syscalls with zero-length results, or no syscalls at all (pure spinloop). perf top -p PID shows the process burning cycles in a single tight loop in a runtime library (e.g., ze_intel_gpu.so).

Resolution: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See gpu-runtime-correctness.md.

3. Collective deadlock

One or more ranks are blocked in an MPI call, usually MPI_Allreduce or MPI_Barrier, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive.

Diagnosis: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different step_name strings. The hung ranks will show D or S state in ps.

Resolution: Replace MPI_Allreduce with a deterministic point-to-point tree reduction. See mpi-heterogeneous.md.

4. Premature return from wait (silent wrong answer, not a hang)

The runtime returns from q.wait() before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally.

Diagnosis: Double-run with checksum (see correctness-verification.md). Insert a second q.wait() after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely.

Flight Recorder for Hang Localization

The most important diagnostic tool is knowing which operation a process is in when it hangs. Maintain a named step counter:

// Call at the start of every major operation
FlightRecorder::StepLog("MPI_Allreduce::norm");
// ... do the operation ...
FlightRecorder::StepLog("MPI_Allreduce::done");

On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner:

static void sighup_handler(int) {
    char buf[256];
    int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n",
        comm_rank,
        (unsigned long long)step_counter,
        step_name);
    write(2, buf, n);
    // backtrace_symbols_fd is async-safe on Linux glibc
    void *frames[32];
    backtrace_symbols_fd(frames, backtrace(frames, 32), 2);
}
signal(SIGHUP, sighup_handler);

Broadcast SIGHUP to all ranks from outside the job:

# In a separate shell while the job is hung
squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \
  xargs -I{} ssh {} "pkill -SIGHUP -f my_application"

The step names from all ranks will reveal which collective operation has diverged.

Distinguishing Driver Hang from MPI Hang

Symptom	Driver hang	MPI hang
Process state	`D` (ioctl) or `R` (spinloop)	`S` (blocked in syscall)
`strace`	blocked `ioctl` or tight loop	blocked `recvmsg` / `read`
Scope	single rank / single node	subset of ranks, pattern-dependent
Recovery	reboot node	cancel job
Flight recorder	step name is a GPU operation	step name is a collective

Reducing Diagnostic Time

Name every collective operation in the flight recorder before calling it.
Separate GPU work from MPI work in the code so the step name unambiguously identifies which subsystem is hung.
Log node identifiers alongside step names so flaky nodes can be identified and blacklisted.
Request flight recorder dumps from all ranks simultaneously (SIGHUP broadcast) rather than attaching a debugger — attaching gdb to one rank of a hung MPI job usually deadlocks the debugger too.

What Not to Do

Do not kill -9 a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost.
Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the last rank to arrive at the barrier.
Do not use MPI_Abort in the hang handler — it may itself hang on some implementations. Use _exit(1) to force termination.

5.6 KiB Raw Blame History