mirror of https://github.com/paboyle/Grid.git synced 2026-05-19 00:24:32 +01:00

Files

T

Peter Boyle c93b338bdd skills: HPC battle-hardening skill files for GPU+MPI correctness

Six skill files encoding expertise for making codebases robust on
problematic HPC systems, covering: correctness verification
(double-run, fingerprinting, flight recorder), hang diagnosis,
GPU runtime correctness (premature barrier, infinite poll),
MPI correctness on heterogeneous systems (device buffer aliasing,
AARCH64 PLT corruption, deterministic reductions),
compiler validation, and communication/computation overlap pipeline
design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-18 12:10:44 -04:00

6.7 KiB

Raw Blame History

name, description, user-invocable, allowed-tools

name

description

user-invocable

allowed-tools

correctness-verification

Implement application-level correctness verification for HPC codes on unreliable hardware — double-run pattern, deterministic reductions, per-packet checksums, and flight recorder step logging.

true

Read

Bash(grep -r)

Correctness Verification Infrastructure for HPC Codes

The Problem

Leadership computing facilities sometimes have hardware or firmware bugs below the level visible to application code. The accelerator runtime can return from q.wait() or cudaDeviceSynchronize() before work is actually complete, or silently produce wrong answers in DMA transfers. Standard testing does not catch these because they are non-deterministic and often topology-dependent (fail only at specific process counts or on specific node configurations).

The symptoms look like numerical instabilities, random MPI hangs, or wrong physics results — not like crashes. Without deliberate infrastructure, diagnosing root cause takes months.

The Double-Run Pattern

The most reliable correctness check for non-deterministic hardware bugs is to run every computation twice and compare bit-identical fingerprints.

Key constraint: the second run must use a deterministic code path. Non-deterministic floating-point ordering (e.g. from MPI_Allreduce with different reduction trees on retry) produces false mismatches. See mpi-heterogeneous.md for how to make reductions deterministic.

// Pseudocode: double-run a step and compare CRC fingerprints
void run_step_verified(State &state) {
    state.save_checkpoint();
    
    uint64_t crc_a = run_step_and_fingerprint(state);
    state.restore_checkpoint();
    uint64_t crc_b = run_step_and_fingerprint(state);
    
    if (crc_a != crc_b) {
        report_mismatch("step", crc_a, crc_b);
        // Policy: abort, retry from checkpoint, or continue with alarm
    }
}

Fingerprinting: XOR-fold a CRC32 over all floating-point data after each step. XOR is order-independent, so it works across distributed nodes without communication. For field data:

uint64_t fingerprint(const double *data, size_t n) {
    uint64_t acc = 0;
    for (size_t i = 0; i < n; i++) {
        uint64_t bits;
        memcpy(&bits, &data[i], sizeof(bits));
        acc ^= crc32(bits);
    }
    return acc;
}

On GPU, compute the XOR reduction on-device (avoids D2H transfer of the full field):

// SYCL
uint64_t svm_xor(uint64_t *vec, uint64_t L) {
    uint64_t ret = 0;
    { sycl::buffer<uint64_t,1> abuff(&ret, {1});
      theGridAccelerator->submit([&](sycl::handler &cgh) {
        auto R = sycl::reduction(abuff, cgh, uint64_t(0), std::bit_xor<>());
        cgh.parallel_for(sycl::range<1>{L}, R,
          [=](sycl::id<1> i, auto &sum) { sum ^= vec[i]; });
      }); }
    theGridAccelerator->wait();
    return ret;
}

Per-Packet Communication Checksums

Silent data corruption in MPI buffers (documented in MPICH with device-resident buffers; see mpi-heterogeneous.md) requires per-packet verification, not just end-to-end. The pattern:

Before packing a send buffer, compute a GPU-side checksum of the payload.
Append the checksum to the host staging buffer alongside the data.
After receiving and copying to device, recompute the checksum on-device and compare.

Salt each checksum with packet_index + 1000 * mpi_tag to detect transposition (packet A landing in packet B's slot):

uint64_t salt = (uint64_t)packet_index + 1000ULL * mpi_tag;
checksum_send = checksum_gpu(payload_gpu, payload_words) ^ salt;
// ... transmit payload + checksum_send ...
checksum_recv = checksum_gpu(payload_gpu_recv, payload_words) ^ salt;
assert(checksum_recv == checksum_send);

Grid reference: Grid/communicator/Communicator_mpi3.cc, #ifdef GRID_CHECKSUM_COMMS.

Flight Recorder: Step-Level Logging

Maintain a monotonic counter that names the current operation. On a hang, this is the only way to know which operation the process is stuck in without a debugger.

struct FlightRecorder {
    std::atomic<uint64_t> step_counter{0};
    const char *step_name = "init";
    
    void step_log(const char *name) {
        step_name = name;
        step_counter.fetch_add(1, std::memory_order_relaxed);
    }
};
extern FlightRecorder gRecorder;

In Record mode, also store floating-point norms and communication checksums to vectors. In Verify mode, compare against stored values:

void norm_log(double val) {
    if (mode == Record) norm_log_vec.push_back(val);
    if (mode == Verify) {
        double expected = norm_log_vec[norm_counter];
        if (val != expected) {  // bit-exact for deterministic paths
            std::cerr << "MISMATCH at step " << step_counter
                      << " (" << step_name << "): "
                      << std::hexfloat << val << " vs " << expected << "\n";
            print_backtrace();
        }
        norm_counter++;
    }
}

Grid reference: Grid/util/FlightRecorder.h, Grid/util/FlightRecorder.cc.

Signal Handler for Hang Detection

Install a SIGHUP handler that dumps the current flight recorder state. This is async-safe only if the handler writes to a pre-allocated buffer using write() (not printf):

static char hang_buf[4096];

static void sighup_handler(int) {
    int n = snprintf(hang_buf, sizeof(hang_buf),
        "rank=%d step=%llu name=%s\n",
        mpi_rank,
        (unsigned long long)gRecorder.step_counter.load(),
        gRecorder.step_name);
    write(STDERR_FILENO, hang_buf, n);
    // Optional: call backtrace_symbols_fd (async-safe on Linux)
    void *frames[64];
    int depth = backtrace(frames, 64);
    backtrace_symbols_fd(frames, depth, STDERR_FILENO);
}

// In main():
signal(SIGHUP, sighup_handler);

To diagnose a hang across all ranks: kill -HUP $(pgrep my_app) or via job scheduler.

What to Verify at Each Step

Data type	Fingerprint method	Frequency
Lattice fields	XOR of CRC32 over float64 words	Every algorithmic step
Communication buffers	GPU XOR reduction, salted	Every MPI operation
Scalar reductions	Bit-exact match of double	Every GlobalSum
Iteration counters	Exact integer match	Every solver iteration

When to Abort vs Continue

Abort immediately: communication checksum mismatch (data is corrupt, continuing will silently propagate errors).
Log and continue: norm mismatch in Verify mode if you need to map out which operations are unreliable.
Retry from checkpoint: double-run mismatch when the underlying bug is non-deterministic (second retry will usually pass).

Track the mismatch rate over a production run. A rate above ~1/1000 steps indicates a systemic hardware issue that should be escalated to the facility.

6.7 KiB Raw Blame History