Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.6 KiB
name, description, user-invocable, allowed-tools
| name | description | user-invocable | allowed-tools | ||||
|---|---|---|---|---|---|---|---|
| hang-diagnosis | Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters. | true |
|
Hang Diagnosis on HPC Systems
Taxonomy of Hangs
Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems:
1. Kernel-level ioctl hang (never returns)
The process is in D (uninterruptible sleep) state. strace shows it blocked in an ioctl syscall. The GPU device driver has entered an unrecoverable state.
Diagnosis: ps aux | grep D — the process shows D state. cat /proc/PID/wchan shows i915_gem_wait_for_error or similar.
Resolution: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler.
2. Infinite poll loop (q.wait() or cudaDeviceSynchronize() never returns)
The process is in R (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread.
Diagnosis: top shows the rank at 100% CPU. strace -p PID shows repeated futex or read syscalls with zero-length results, or no syscalls at all (pure spinloop). perf top -p PID shows the process burning cycles in a single tight loop in a runtime library (e.g., ze_intel_gpu.so).
Resolution: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See gpu-runtime-correctness.md.
3. Collective deadlock
One or more ranks are blocked in an MPI call, usually MPI_Allreduce or MPI_Barrier, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive.
Diagnosis: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different step_name strings. The hung ranks will show D or S state in ps.
Resolution: Replace MPI_Allreduce with a deterministic point-to-point tree reduction. See mpi-heterogeneous.md.
4. Premature return from wait (silent wrong answer, not a hang)
The runtime returns from q.wait() before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally.
Diagnosis: Double-run with checksum (see correctness-verification.md). Insert a second q.wait() after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely.
Flight Recorder for Hang Localization
The most important diagnostic tool is knowing which operation a process is in when it hangs. Maintain a named step counter:
// Call at the start of every major operation
FlightRecorder::StepLog("MPI_Allreduce::norm");
// ... do the operation ...
FlightRecorder::StepLog("MPI_Allreduce::done");
On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner:
static void sighup_handler(int) {
char buf[256];
int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n",
comm_rank,
(unsigned long long)step_counter,
step_name);
write(2, buf, n);
// backtrace_symbols_fd is async-safe on Linux glibc
void *frames[32];
backtrace_symbols_fd(frames, backtrace(frames, 32), 2);
}
signal(SIGHUP, sighup_handler);
Broadcast SIGHUP to all ranks from outside the job:
# In a separate shell while the job is hung
squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \
xargs -I{} ssh {} "pkill -SIGHUP -f my_application"
The step names from all ranks will reveal which collective operation has diverged.
Distinguishing Driver Hang from MPI Hang
| Symptom | Driver hang | MPI hang |
|---|---|---|
| Process state | D (ioctl) or R (spinloop) |
S (blocked in syscall) |
strace |
blocked ioctl or tight loop |
blocked recvmsg / read |
| Scope | single rank / single node | subset of ranks, pattern-dependent |
| Recovery | reboot node | cancel job |
| Flight recorder | step name is a GPU operation | step name is a collective |
Reducing Diagnostic Time
- Name every collective operation in the flight recorder before calling it.
- Separate GPU work from MPI work in the code so the step name unambiguously identifies which subsystem is hung.
- Log node identifiers alongside step names so flaky nodes can be identified and blacklisted.
- Request flight recorder dumps from all ranks simultaneously (SIGHUP broadcast) rather than attaching a debugger — attaching
gdbto one rank of a hung MPI job usually deadlocks the debugger too.
What Not to Do
- Do not
kill -9a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost. - Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the last rank to arrive at the barrier.
- Do not use
MPI_Abortin the hang handler — it may itself hang on some implementations. Use_exit(1)to force termination.