mirror of
https://github.com/paboyle/Grid.git
synced 2026-05-23 18:44:17 +01:00
c93b338bdd
Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
103 lines
5.6 KiB
Markdown
103 lines
5.6 KiB
Markdown
---
|
|
name: hang-diagnosis
|
|
description: Diagnose and isolate process hangs on HPC systems — distinguishing kernel-level ioctl hangs, infinite poll loops, collective deadlocks, and GPU completion signalling failures using async-safe signal handlers and flight recorder step counters.
|
|
user-invocable: true
|
|
allowed-tools:
|
|
- Read
|
|
- Bash(grep -r)
|
|
- Bash(strace)
|
|
- Bash(gdb)
|
|
---
|
|
|
|
# Hang Diagnosis on HPC Systems
|
|
|
|
## Taxonomy of Hangs
|
|
|
|
Not all hangs are the same. Misidentifying the type leads to wrong mitigation. The four distinct classes encountered on production leadership systems:
|
|
|
|
### 1. Kernel-level ioctl hang (never returns)
|
|
The process is in `D` (uninterruptible sleep) state. `strace` shows it blocked in an `ioctl` syscall. The GPU device driver has entered an unrecoverable state.
|
|
|
|
**Diagnosis**: `ps aux | grep D` — the process shows `D` state. `cat /proc/PID/wchan` shows `i915_gem_wait_for_error` or similar.
|
|
|
|
**Resolution**: Only a driver reload or node reboot recovers it. Log the node identifier and request replacement from the facility scheduler.
|
|
|
|
### 2. Infinite poll loop (`q.wait()` or `cudaDeviceSynchronize()` never returns)
|
|
The process is in `R` (running) state, consuming 100% CPU. A polling loop inside the runtime is checking a completion flag that never becomes true, either because the hardware never sets it or because the flag is in a memory region not visible to the polling thread.
|
|
|
|
**Diagnosis**: `top` shows the rank at 100% CPU. `strace -p PID` shows repeated `futex` or `read` syscalls with zero-length results, or no syscalls at all (pure spinloop). `perf top -p PID` shows the process burning cycles in a single tight loop in a runtime library (e.g., `ze_intel_gpu.so`).
|
|
|
|
**Resolution**: The double-wait workaround — submit a trivially cheap kernel after the operation under test to act as a fence, then wait for the trivial kernel. See `gpu-runtime-correctness.md`.
|
|
|
|
### 3. Collective deadlock
|
|
One or more ranks are blocked in an MPI call, usually `MPI_Allreduce` or `MPI_Barrier`, while others are not. Root cause: a topology-dependent bug in the MPI library's collective algorithm where some ranks' contributions never arrive.
|
|
|
|
**Diagnosis**: Flight recorder step logs show some ranks at step N (inside the collective) while others are at step N+1 or stuck at step N with different `step_name` strings. The hung ranks will show `D` or `S` state in `ps`.
|
|
|
|
**Resolution**: Replace `MPI_Allreduce` with a deterministic point-to-point tree reduction. See `mpi-heterogeneous.md`.
|
|
|
|
### 4. Premature return from wait (silent wrong answer, not a hang)
|
|
The runtime returns from `q.wait()` before the GPU work is complete. The next operation reads stale data. This is not a hang — it manifests as a wrong answer or non-deterministic floating-point results. It is listed here because it is the most confusing failure mode: the code appears to run correctly and completes normally.
|
|
|
|
**Diagnosis**: Double-run with checksum (see `correctness-verification.md`). Insert a second `q.wait()` after the first and observe if results become reproducible. If inserting the second wait "fixes" wrong answers, the first wait was returning prematurely.
|
|
|
|
## Flight Recorder for Hang Localization
|
|
|
|
The most important diagnostic tool is knowing *which operation* a process is in when it hangs. Maintain a named step counter:
|
|
|
|
```cpp
|
|
// Call at the start of every major operation
|
|
FlightRecorder::StepLog("MPI_Allreduce::norm");
|
|
// ... do the operation ...
|
|
FlightRecorder::StepLog("MPI_Allreduce::done");
|
|
```
|
|
|
|
On SIGHUP, dump rank, step counter value, and step name to stderr in an async-safe manner:
|
|
|
|
```cpp
|
|
static void sighup_handler(int) {
|
|
char buf[256];
|
|
int n = snprintf(buf, sizeof(buf), "rank %d: step %llu '%s'\n",
|
|
comm_rank,
|
|
(unsigned long long)step_counter,
|
|
step_name);
|
|
write(2, buf, n);
|
|
// backtrace_symbols_fd is async-safe on Linux glibc
|
|
void *frames[32];
|
|
backtrace_symbols_fd(frames, backtrace(frames, 32), 2);
|
|
}
|
|
signal(SIGHUP, sighup_handler);
|
|
```
|
|
|
|
Broadcast SIGHUP to all ranks from outside the job:
|
|
```bash
|
|
# In a separate shell while the job is hung
|
|
squeue --job $JOBID -o "%i %N" | awk '{print $2}' | \
|
|
xargs -I{} ssh {} "pkill -SIGHUP -f my_application"
|
|
```
|
|
|
|
The step names from all ranks will reveal which collective operation has diverged.
|
|
|
|
## Distinguishing Driver Hang from MPI Hang
|
|
|
|
| Symptom | Driver hang | MPI hang |
|
|
|---|---|---|
|
|
| Process state | `D` (ioctl) or `R` (spinloop) | `S` (blocked in syscall) |
|
|
| `strace` | blocked `ioctl` or tight loop | blocked `recvmsg` / `read` |
|
|
| Scope | single rank / single node | subset of ranks, pattern-dependent |
|
|
| Recovery | reboot node | cancel job |
|
|
| Flight recorder | step name is a GPU operation | step name is a collective |
|
|
|
|
## Reducing Diagnostic Time
|
|
|
|
1. **Name every collective operation** in the flight recorder before calling it.
|
|
2. **Separate GPU work from MPI work** in the code so the step name unambiguously identifies which subsystem is hung.
|
|
3. **Log node identifiers** alongside step names so flaky nodes can be identified and blacklisted.
|
|
4. **Request flight recorder dumps from all ranks simultaneously** (SIGHUP broadcast) rather than attaching a debugger — attaching `gdb` to one rank of a hung MPI job usually deadlocks the debugger too.
|
|
|
|
## What Not to Do
|
|
|
|
- Do not `kill -9` a hung rank immediately — get the flight recorder dump first, otherwise diagnostic information is lost.
|
|
- Do not assume the first rank that prints an error is the faulty one — collective hangs are frequently caused by the *last* rank to arrive at the barrier.
|
|
- Do not use `MPI_Abort` in the hang handler — it may itself hang on some implementations. Use `_exit(1)` to force termination.
|