1
0
mirror of https://github.com/paboyle/Grid.git synced 2026-05-19 00:24:32 +01:00
Files
Grid/skills/communication-overlap.md
T
Peter Boyle c93b338bdd skills: HPC battle-hardening skill files for GPU+MPI correctness
Six skill files encoding expertise for making codebases robust on
problematic HPC systems, covering: correctness verification
(double-run, fingerprinting, flight recorder), hang diagnosis,
GPU runtime correctness (premature barrier, infinite poll),
MPI correctness on heterogeneous systems (device buffer aliasing,
AARCH64 PLT corruption, deterministic reductions),
compiler validation, and communication/computation overlap pipeline
design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 12:10:44 -04:00

7.6 KiB

name, description, user-invocable, allowed-tools
name description user-invocable allowed-tools
communication-overlap Design and implement communication/computation overlap pipelines for GPU+MPI codes — per-packet event tracking, host-staging through pinned memory, internode/intranode bandwidth separation, and the 7-phase pipeline pattern that replaces broken accelerator-aware MPI paths. true
Read
Bash(grep -r)

Communication/Computation Overlap Pipeline Design

Why GPU-Direct MPI Is Often Not the Right Default

GPU-direct RDMA (passing GPU buffer pointers directly to MPI) is appealing because it eliminates explicit D2H/H2D copies. In practice on several leadership systems:

  • Bandwidth: RDMA at 30% of wirespeed has been observed on Pontevecchio/Aurora. The overhead of staging through pinned host memory can be lower total latency than slow RDMA.
  • Correctness: Device buffer aliasing in MPI_Sendrecv (see mpi-heterogeneous.md) makes direct GPU-to-GPU transfer unreliable.
  • Overlap: Host-staging enables fine-grained overlap — each packet's D2H can be issued as a separate asynchronous event, and the corresponding MPI send can fire as soon as that packet arrives in host memory, not after all packets are ready.

The pipeline pattern below was developed to replace broken MPICH accelerator-aware paths. It achieves genuine computation/communication overlap by tracking per-packet GPU events.

The 7-Phase Pipeline

Given a set of halo exchange operations (each identified by a packet_index):

Phase 0: Prepare data on device

Pack halo data into contiguous GPU buffers. One buffer per direction/neighbour.

Phase 1: Post receives + start D2H

Post all MPI_Irecv calls immediately (into pinned host buffers). Simultaneously, start asynchronous D2H copies for all send buffers:

for (auto &pkt : send_packets) {
    MPI_Irecv(pkt.host_recv_buf, pkt.bytes, MPI_BYTE,
               pkt.src_rank, pkt.tag, comm, &pkt.recv_req);
    
    acceleratorCopyFromDeviceAsync(pkt.device_send_buf,
                                   pkt.host_send_buf,
                                   pkt.bytes, &pkt.d2h_event);
}

The key: pkt.d2h_event is a per-packet GPU event (e.g. cudaEvent_t, hipEvent_t, or SYCL event). We can poll individual packet completion rather than waiting for all.

Phase 2: Fire sends as D2H completes (packet by packet)

Poll packet D2H events. As each packet becomes ready in host memory, immediately fire the corresponding MPI_Isend. Also start intranode D2D copies at this point — these are deferred until now to avoid competing with the internode D2H on PCIe bandwidth:

bool all_sent = false;
while (!all_sent) {
    all_sent = true;
    for (auto &pkt : send_packets) {
        if (!pkt.sent && acceleratorEventIsComplete(pkt.d2h_event)) {
            MPI_Isend(pkt.host_send_buf, pkt.bytes, MPI_BYTE,
                       pkt.dst_rank, pkt.tag, comm, &pkt.send_req);
            pkt.sent = true;
            start_intranode_copy(pkt);  // now safe, D2H is done
        }
        if (!pkt.sent) all_sent = false;
    }
}

Phase 3: Poll receives + start H2D as each arrives

MPI_Test individual receive requests. As each completes, immediately start the H2D copy into device-resident halo buffer:

bool all_recvd = false;
while (!all_recvd) {
    all_recvd = true;
    for (auto &pkt : recv_packets) {
        if (!pkt.h2d_started) {
            int flag = 0;
            MPI_Test(&pkt.recv_req, &flag, MPI_STATUS_IGNORE);
            if (flag) {
                acceleratorCopyToDeviceAsync(pkt.host_recv_buf,
                                              pkt.device_recv_buf,
                                              pkt.bytes, &pkt.h2d_event);
                pkt.h2d_started = true;
            }
        }
        if (!pkt.h2d_started) all_recvd = false;
    }
}

Phase 4: Wait for all sends

std::vector<MPI_Request> send_reqs;
for (auto &pkt : send_packets) send_reqs.push_back(pkt.send_req);
MPI_Waitall(send_reqs.size(), send_reqs.data(), MPI_STATUSES_IGNORE);

Phase 5: Wait for all H2D copies

for (auto &pkt : recv_packets) acceleratorEventWait(pkt.h2d_event);

Phase 6: Run interior computation

The interior (non-halo) computation can run from Phase 1 onwards, overlapped with all of the above:

// Launched in Phase 1, runs in parallel with the pipeline
accelerator_for(ss, interior_sites, ...) { compute_interior(ss); }

Synchronise with interior before using the full field:

accelerator_barrier();  // interior kernel done
// Halo H2D is also complete (Phase 5 above)
// Now safe to use full field

Per-Packet Event Tracking Data Structure

struct Packet {
    // Buffers
    void *device_send_buf;
    void *host_send_buf;    // pinned
    void *device_recv_buf;
    void *host_recv_buf;    // pinned
    size_t bytes;
    
    // MPI
    int src_rank, dst_rank, tag;
    MPI_Request send_req, recv_req;
    
    // GPU events (one per packet, not one global barrier)
    AcceleratorEvent d2h_event;
    AcceleratorEvent h2d_event;
    
    // State flags
    bool sent = false;
    bool h2d_started = false;
};

The critical design point: d2h_event and h2d_event are per-packet, not global. This allows the MPI send for packet 0 to fire while packet 1's D2H is still in progress.

Internode vs Intranode Separation

PCIe (GPU-to-CPU) and NVLink/xGMI (GPU-to-GPU within a node) are separate bandwidth resources. They do not compete with each other, but they do compete with each other for transactions if both are active simultaneously.

Strategy: complete all internode D2H copies first (to maximise NIC injection bandwidth), then start intranode D2D copies (which use NVLink/xGMI and do not contend with PCIe for internode traffic):

// In Phase 2: start intranode D2D only after D2H is confirmed complete
if (pkt.is_intranode && pkt.d2h_done) {
    // Use peer access (cudaMemcpyPeerAsync / hipMemcpyPeerAsync)
    // rather than staging through host for intranode
    cudaMemcpyPeerAsync(pkt.peer_recv_buf, pkt.dst_device,
                         pkt.device_send_buf, pkt.src_device,
                         pkt.bytes, computeStream);
}

Grid reference: Grid/communicator/Communicator_mpi3.cc — search for NVLINK_GET and ACCELERATOR_AWARE_MPI conditional blocks.

Pinned Memory Allocation

All host staging buffers must be pinned (page-locked) for async D2H/H2D:

// CUDA
cudaMallocHost(&host_buf, bytes);
cudaFreeHost(host_buf);

// HIP  
hipHostMalloc(&host_buf, bytes, hipHostMallocDefault);
hipHostFree(host_buf);

// SYCL
host_buf = sycl::malloc_host(bytes, *queue);
sycl::free(host_buf, *queue);

Pre-allocate at startup. Repeated cudaMallocHost in the hot path adds latency from the OS memory manager.

Checksumming in the Pipeline

Insert checksum computation before D2H (on the GPU-resident data) and verification after H2D (on the received GPU-resident data). See correctness-verification.md for the checksum pattern. The salting (packet_index + 1000 * tag) detects packet transposition — critical for diagnosing MPI buffer aliasing bugs where two packets' contents are swapped.

Smoke Test for a New System

Before running physics, validate the pipeline on a synthetic benchmark:

// Send a buffer of known values, receive and check
// Run at multiple message sizes: 4KB, 64KB, 1MB, 16MB
// Run at multiple process counts: 2, 8, 64, 512
// Verify checksums on every packet
// Measure bandwidth: should be ≥ 80% of FDR/HDR/NDR peak for host-staged

Any bandwidth below 50% of theoretical, or any checksum failure, indicates a problem in the communication stack that must be resolved before production runs.