mirror of
https://github.com/paboyle/Grid.git
synced 2026-05-23 02:24:17 +01:00
skills: HPC battle-hardening skill files for GPU+MPI correctness
Six skill files encoding expertise for making codebases robust on problematic HPC systems, covering: correctness verification (double-run, fingerprinting, flight recorder), hang diagnosis, GPU runtime correctness (premature barrier, infinite poll), MPI correctness on heterogeneous systems (device buffer aliasing, AARCH64 PLT corruption, deterministic reductions), compiler validation, and communication/computation overlap pipeline design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,154 @@
|
||||
---
|
||||
name: compiler-validation
|
||||
description: Identify GPU compiler code generation bugs, distinguish them from hardware and runtime bugs, construct minimal reproducers, and validate correctness of generated assembly for performance-critical HPC kernels.
|
||||
user-invocable: true
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Bash(grep -r)
|
||||
- Bash(objdump)
|
||||
---
|
||||
|
||||
# Compiler Validation for GPU HPC Codes
|
||||
|
||||
## Why Compiler Bugs Are Distinct
|
||||
|
||||
Compiler bugs have a unique diagnostic signature: they produce *deterministically wrong* results. The same input always produces the same wrong output. This distinguishes them from:
|
||||
|
||||
- Hardware bugs: usually stochastic (wrong answer sometimes, correct answer other times)
|
||||
- Runtime bugs (premature barrier, buffer aliasing): often stochastic or history-dependent
|
||||
- Race conditions: non-deterministic
|
||||
|
||||
**The determinism test**: run the same kernel 100 times with the same input. If the wrong answer is always the same wrong answer, suspect the compiler.
|
||||
|
||||
## The Minimal Reproducer Protocol
|
||||
|
||||
When a kernel produces wrong results, isolate the compiler as quickly as possible:
|
||||
|
||||
**Step 1: Eliminate the physics**. Reduce the failing kernel to the smallest possible computation that still exhibits the bug. Replace QCD fields with `double` arrays. Replace lattice operations with scalar arithmetic. The goal is a 20-line CUDA/HIP/SYCL file that any compiler engineer can compile and run.
|
||||
|
||||
**Step 2: Binary search over optimisation levels**. Compile at `-O0` (or equivalent). If the answer becomes correct, the bug is in an optimisation pass. Then test `-O1`, `-O2`, `-O3` individually to find which optimisation level introduces the bug.
|
||||
|
||||
```bash
|
||||
# HIP example
|
||||
hipcc -O0 minimal_repro.cc -o test_O0 && ./test_O0 # should be correct
|
||||
hipcc -O1 minimal_repro.cc -o test_O1 && ./test_O1 # compare
|
||||
hipcc -O2 minimal_repro.cc -o test_O2 && ./test_O2 # compare
|
||||
```
|
||||
|
||||
**Step 3: Identify the optimisation pass**. For LLVM-based compilers (clang, hipcc, dpcpp, nvcc via ptxas):
|
||||
```bash
|
||||
# Disable individual optimisation passes:
|
||||
hipcc -O2 -mllvm -disable-loop-unrolling minimal_repro.cc -o test
|
||||
hipcc -O2 -fno-vectorize minimal_repro.cc -o test
|
||||
hipcc -O2 -fno-slp-vectorize minimal_repro.cc -o test
|
||||
```
|
||||
|
||||
**Step 4: Inspect the generated code**. For CUDA/HIP, use `--generate-line-info` and `cuobjdump` or `roc-obj-extract` to get annotated assembly:
|
||||
```bash
|
||||
# CUDA
|
||||
nvcc -O2 --generate-line-info --keep minimal_repro.cu
|
||||
cuobjdump --dump-ptx minimal_repro.o
|
||||
# HIP/ROCm
|
||||
hipcc -O2 --save-temps minimal_repro.cc
|
||||
llvm-objdump -d minimal_repro.o
|
||||
# SYCL/DPC++
|
||||
icpx -O2 -fsycl -Xclang -ast-dump minimal_repro.cc 2>&1 | grep -A5 "suspicious_expr"
|
||||
```
|
||||
|
||||
Look for: incorrect register spill/fill sequences, loop trip count miscalculation, vectorisation across iteration boundaries, incorrect address arithmetic.
|
||||
|
||||
## Known Compiler Bug Patterns in GPU Code
|
||||
|
||||
### Register Pressure / Spill Bugs
|
||||
High register usage forces spills to local memory. Some compiler versions generate incorrect spill/fill code — the value is written to local memory but a stale register value is read back instead of the spilled value.
|
||||
|
||||
**Signature**: Wrong answer with high-register-count kernels; becomes correct when `--maxrregcount=N` forces lower register count (more spilling) or higher (`--maxrregcount=256`, fewer spills).
|
||||
|
||||
**Diagnostic**: Check register usage:
|
||||
```bash
|
||||
nvcc -O2 --ptxas-options=-v minimal_repro.cu 2>&1 | grep "registers"
|
||||
hipcc -O2 --offload-arch=gfx90a --save-temps minimal_repro.cc
|
||||
llvm-mc --arch=amdgcn minimal_repro.s 2>&1 | grep "VGPRs"
|
||||
```
|
||||
|
||||
### Vectorisation Across Loop Boundaries
|
||||
The compiler vectorises two successive loop iterations as a SIMD unit when they have a data dependency that the compiler has incorrectly determined does not exist.
|
||||
|
||||
**Signature**: Wrong answer that becomes correct when the loop body is extracted to a non-inlined function (disabling auto-vectorisation across iterations).
|
||||
|
||||
### Incorrect Constant Propagation
|
||||
The compiler evaluates a compile-time expression incorrectly, substituting a wrong constant. Common in template-heavy code where `sizeof(T)` or `alignof(T)` is used in arithmetic that the compiler folds at compile time.
|
||||
|
||||
**Signature**: Wrong array index or wrong stride. Inspecting the generated assembly shows a literal constant where you expect a computed value.
|
||||
|
||||
## Stress Patterns for Compiler Validation
|
||||
|
||||
These patterns exercise the compiler in ways that commonly expose bugs:
|
||||
|
||||
```cpp
|
||||
// 1. Aliased pointer write followed by immediate read
|
||||
// (tests correct handling of write-after-write in register allocation)
|
||||
__global__ void alias_stress(double *a, double *b, int n) {
|
||||
int i = blockIdx.x * blockDim.x + threadIdx.x;
|
||||
if (i < n) {
|
||||
a[i] = a[i] * 2.0;
|
||||
b[i] = a[i] + 1.0; // must read the updated value, not the original
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Mixed-precision accumulation
|
||||
// (tests correct type promotion in FMA sequences)
|
||||
__global__ void precision_stress(float *in, double *out, int n) {
|
||||
double acc = 0.0;
|
||||
for (int i = 0; i < n; i++) acc += (double)in[i];
|
||||
*out = acc;
|
||||
}
|
||||
|
||||
// 3. Large struct in shared memory
|
||||
// (tests alignment and offset calculation for non-power-of-2-sized objects)
|
||||
struct S { double x[3]; }; // sizeof = 24 bytes, not a power of 2
|
||||
__global__ void struct_stress(S *in, S *out, int n) {
|
||||
extern __shared__ S smem[];
|
||||
int tid = threadIdx.x;
|
||||
smem[tid] = in[tid];
|
||||
__syncthreads();
|
||||
out[tid] = smem[(tid + 1) % blockDim.x];
|
||||
}
|
||||
```
|
||||
|
||||
## Separating Compiler from Runtime/Hardware
|
||||
|
||||
When results are deterministically wrong:
|
||||
|
||||
| Test | Compiler bug | Runtime/hardware bug |
|
||||
|---|---|---|
|
||||
| Recompile at -O0 | Fixes it | No effect |
|
||||
| Run on CPU (host code equivalent) | Fixes it | No effect |
|
||||
| Reorder loop iterations | Changes wrong answer | No effect or different pattern |
|
||||
| Different compiler version | Fixes or changes wrong answer | No effect |
|
||||
| Different GPU of same model | Same wrong answer | Different or no error |
|
||||
| Different GPU model | Fixes it (ISA-specific codegen bug) | May or may not fix |
|
||||
|
||||
## Reporting to Compiler Teams
|
||||
|
||||
A compiler bug report needs:
|
||||
1. Minimal reproducer (< 50 lines)
|
||||
2. Compiler version (`hipcc --version`, `nvcc --version`, `icpx --version`)
|
||||
3. GPU model and driver version
|
||||
4. Exact wrong and correct answers (hexfloat for reproducibility)
|
||||
5. Which compile flags change the behaviour
|
||||
6. Generated assembly for the correct and incorrect variants
|
||||
|
||||
File with: LLVM Bugzilla (for hipcc/clang/dpcpp backends), NVIDIA bug portal (nvcc/ptxas), or vendor-specific developer forum. The minimal reproducer is the single most important element — without it, compiler teams cannot prioritise.
|
||||
|
||||
## Pragmatic In-Production Workaround
|
||||
|
||||
When a compiler bug is confirmed but the fix is not yet available, the lowest-risk workaround is to mark the affected function with reduced optimisation:
|
||||
|
||||
```cpp
|
||||
#pragma clang optimize off // clang/hipcc/dpcpp
|
||||
void __attribute__((optimize("O0"))) affected_kernel_host_wrapper() { ... }
|
||||
// For device code, use per-file compilation flags via CMake/Makefile
|
||||
```
|
||||
|
||||
Document the workaround with a comment referencing the compiler bug report number so it can be removed when the compiler is updated.
|
||||
Reference in New Issue
Block a user