1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-19 08:34:32 +01:00
Files
Grid/skills/correctness-verification.md
T
Peter Boyle c93b338bdd skills: HPC battle-hardening skill files for GPU+MPI correctness
Six skill files encoding expertise for making codebases robust on
problematic HPC systems, covering: correctness verification
(double-run, fingerprinting, flight recorder), hang diagnosis,
GPU runtime correctness (premature barrier, infinite poll),
MPI correctness on heterogeneous systems (device buffer aliasing,
AARCH64 PLT corruption, deterministic reductions),
compiler validation, and communication/computation overlap pipeline
design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 12:10:44 -04:00

170 lines
6.7 KiB
Markdown

---
name: correctness-verification
description: Implement application-level correctness verification for HPC codes on unreliable hardware — double-run pattern, deterministic reductions, per-packet checksums, and flight recorder step logging.
user-invocable: true
allowed-tools:
- Read
- Bash(grep -r)
---
# Correctness Verification Infrastructure for HPC Codes
## The Problem
Leadership computing facilities sometimes have hardware or firmware bugs below the level visible to application code. The accelerator runtime can return from `q.wait()` or `cudaDeviceSynchronize()` before work is actually complete, or silently produce wrong answers in DMA transfers. Standard testing does not catch these because they are non-deterministic and often topology-dependent (fail only at specific process counts or on specific node configurations).
The symptoms look like numerical instabilities, random MPI hangs, or wrong physics results — not like crashes. Without deliberate infrastructure, diagnosing root cause takes months.
## The Double-Run Pattern
The most reliable correctness check for non-deterministic hardware bugs is to run every computation twice and compare bit-identical fingerprints.
**Key constraint**: the second run must use a *deterministic* code path. Non-deterministic floating-point ordering (e.g. from MPI_Allreduce with different reduction trees on retry) produces false mismatches. See `mpi-heterogeneous.md` for how to make reductions deterministic.
```cpp
// Pseudocode: double-run a step and compare CRC fingerprints
void run_step_verified(State &state) {
state.save_checkpoint();
uint64_t crc_a = run_step_and_fingerprint(state);
state.restore_checkpoint();
uint64_t crc_b = run_step_and_fingerprint(state);
if (crc_a != crc_b) {
report_mismatch("step", crc_a, crc_b);
// Policy: abort, retry from checkpoint, or continue with alarm
}
}
```
**Fingerprinting**: XOR-fold a CRC32 over all floating-point data after each step. XOR is order-independent, so it works across distributed nodes without communication. For field data:
```cpp
uint64_t fingerprint(const double *data, size_t n) {
uint64_t acc = 0;
for (size_t i = 0; i < n; i++) {
uint64_t bits;
memcpy(&bits, &data[i], sizeof(bits));
acc ^= crc32(bits);
}
return acc;
}
```
On GPU, compute the XOR reduction on-device (avoids D2H transfer of the full field):
```cpp
// SYCL
uint64_t svm_xor(uint64_t *vec, uint64_t L) {
uint64_t ret = 0;
{ sycl::buffer<uint64_t,1> abuff(&ret, {1});
theGridAccelerator->submit([&](sycl::handler &cgh) {
auto R = sycl::reduction(abuff, cgh, uint64_t(0), std::bit_xor<>());
cgh.parallel_for(sycl::range<1>{L}, R,
[=](sycl::id<1> i, auto &sum) { sum ^= vec[i]; });
}); }
theGridAccelerator->wait();
return ret;
}
```
## Per-Packet Communication Checksums
Silent data corruption in MPI buffers (documented in MPICH with device-resident buffers; see `mpi-heterogeneous.md`) requires per-packet verification, not just end-to-end. The pattern:
1. Before packing a send buffer, compute a GPU-side checksum of the payload.
2. Append the checksum to the host staging buffer alongside the data.
3. After receiving and copying to device, recompute the checksum on-device and compare.
Salt each checksum with `packet_index + 1000 * mpi_tag` to detect transposition (packet A landing in packet B's slot):
```cpp
uint64_t salt = (uint64_t)packet_index + 1000ULL * mpi_tag;
checksum_send = checksum_gpu(payload_gpu, payload_words) ^ salt;
// ... transmit payload + checksum_send ...
checksum_recv = checksum_gpu(payload_gpu_recv, payload_words) ^ salt;
assert(checksum_recv == checksum_send);
```
Grid reference: `Grid/communicator/Communicator_mpi3.cc`, `#ifdef GRID_CHECKSUM_COMMS`.
## Flight Recorder: Step-Level Logging
Maintain a monotonic counter that names the current operation. On a hang, this is the only way to know *which* operation the process is stuck in without a debugger.
```cpp
struct FlightRecorder {
std::atomic<uint64_t> step_counter{0};
const char *step_name = "init";
void step_log(const char *name) {
step_name = name;
step_counter.fetch_add(1, std::memory_order_relaxed);
}
};
extern FlightRecorder gRecorder;
```
In Record mode, also store floating-point norms and communication checksums to vectors. In Verify mode, compare against stored values:
```cpp
void norm_log(double val) {
if (mode == Record) norm_log_vec.push_back(val);
if (mode == Verify) {
double expected = norm_log_vec[norm_counter];
if (val != expected) { // bit-exact for deterministic paths
std::cerr << "MISMATCH at step " << step_counter
<< " (" << step_name << "): "
<< std::hexfloat << val << " vs " << expected << "\n";
print_backtrace();
}
norm_counter++;
}
}
```
Grid reference: `Grid/util/FlightRecorder.h`, `Grid/util/FlightRecorder.cc`.
## Signal Handler for Hang Detection
Install a SIGHUP handler that dumps the current flight recorder state. This is async-safe only if the handler writes to a pre-allocated buffer using `write()` (not `printf`):
```cpp
static char hang_buf[4096];
static void sighup_handler(int) {
int n = snprintf(hang_buf, sizeof(hang_buf),
"rank=%d step=%llu name=%s\n",
mpi_rank,
(unsigned long long)gRecorder.step_counter.load(),
gRecorder.step_name);
write(STDERR_FILENO, hang_buf, n);
// Optional: call backtrace_symbols_fd (async-safe on Linux)
void *frames[64];
int depth = backtrace(frames, 64);
backtrace_symbols_fd(frames, depth, STDERR_FILENO);
}
// In main():
signal(SIGHUP, sighup_handler);
```
To diagnose a hang across all ranks: `kill -HUP $(pgrep my_app)` or via job scheduler.
## What to Verify at Each Step
| Data type | Fingerprint method | Frequency |
|---|---|---|
| Lattice fields | XOR of CRC32 over float64 words | Every algorithmic step |
| Communication buffers | GPU XOR reduction, salted | Every MPI operation |
| Scalar reductions | Bit-exact match of double | Every GlobalSum |
| Iteration counters | Exact integer match | Every solver iteration |
## When to Abort vs Continue
- **Abort immediately**: communication checksum mismatch (data is corrupt, continuing will silently propagate errors).
- **Log and continue**: norm mismatch in Verify mode if you need to map out which operations are unreliable.
- **Retry from checkpoint**: double-run mismatch when the underlying bug is non-deterministic (second retry will usually pass).
Track the mismatch rate over a production run. A rate above ~1/1000 steps indicates a systemic hardware issue that should be escalated to the facility.