LCI v2.0.0-dev
For Asynchronous Multithreaded Communication
Loading...
Searching...
No Matches
GPU Direct RDMA

Overview

LCI can send/receive/put/get GPU-resident buffers directly over RDMA, as long as the hardware and driver stack support GPU Direct RDMA.

This document summarizes the required steps and notes.

Requirements

1) Enable CUDA support at build time

Build LCI with CUDA support enabled:

cmake -DLCI_USE_CUDA=ON ...

Other CUDA-related options you may want to consider:

-DLCI_CUDA_ARCH=<your_cuda_architecture> # e.g., 80 for A100, 90 for H100
-DLCI_CUDA_STANDARD=<cuda_standard> # e.g., 20

2) Ensure memory registration is GPU-aware

LCI needs to know that a buffer is device memory. You can satisfy this in two ways:

  • Explicitly register the buffer using lci::register_memory. LCI will detect that the buffer is GPU memory and register it accordingly.
  • Use the optional argument mr and set it to MR_DEVICE or MR_UNKNOWN when posting operations. MR_UNKNOWN lets LCI detect the memory type at runtime.

3) Pass GPU buffers to communication APIs

Just pass GPU memory pointers to send/recv/put/get APIs as usual.

Example

Sending a GPU buffer with automatic memory registration:

// If you know the buffer is GPU memory:
lci::status_t status = lci::post_send_x(rank, gpu_buffer, msg_size, tag, comp).mr(lci::MR_DEVICE)();
// If you are unsure about the memory type, use MR_UNKNOWN:
lci::status_t status = lci::post_send_x(rank, generic_buffer, msg_size, tag, comp).mr(lci::MR_UNKNOWN)();
The actual implementation for post_send.
Definition lci_binding_post.hpp:3227
post_send_x && mr(mr_t mr_in)
Definition lci_binding_post.hpp:3263
const mr_t MR_DEVICE
A special mr_t value for device memory.
Definition lci.hpp:313
const mr_t MR_UNKNOWN
A special mr_t value for unknown memory. LCI will detect the memory type automatically.
Definition lci.hpp:320
The type of the completion desciptor for a posted communication.
Definition lci.hpp:446

Alternatively, You can explicitly register a GPU buffer and using it in a send operation:

lci::mr_t mr = lci::register_memory(gpu_buffer, size);
lci::status_t status = lci::post_send_x(rank, gpu_buffer, msg_size, tag, comp).mr(mr)();
// You can also use part of the registered region for communication
lci::status_t status = lci::post_send_x(rank, gpu_buffer + offset, smaller_size, tag, comp).mr(mr)();
...
lci::deregister_memory(mr);
The actual implementation for RESOURCE mr.
Definition lci_binding_pre.hpp:439
mr_t register_memory(void *address_in, size_t size_in)
Definition lci_binding_post.hpp:1123

Same for put/get operations:

// Register the GPU buffer and exchange the rmr_t with all processes
lci::mr_t mr = lci::register_memory(gpu_buffer, size);
std::vector<lci::rmr_t> rmrs;
lci::allgather(&rmr, rmrs.data(), sizeof(lci::rmr_t));
// Perform put/get operation
lci::status_t status = lci::post_put_x(rank, local_gpu_buffer, size, comp, remote_offset, rmrs[rank]).mr(mr)();
...
lci::deregister_memory(mr);
The actual implementation for post_put.
Definition lci_binding_post.hpp:3392
post_put_x && mr(mr_t mr_in)
Definition lci_binding_post.hpp:3429
rmr_t get_rmr(mr_t mr_in)
Definition lci_binding_post.hpp:1197
void allgather(const void *sendbuf_in, void *recvbuf_in, size_t size_in)
Definition lci_binding_post.hpp:2965
The type of remote memory region.
Definition lci.hpp:333

Refer to tests/unit/accelerator for a complete example of using LCI with GPU Direct RDMA.

Notes

  • Currently, LCI only supports NVIDIA GPUs. Other vendors will be supported in the future.
  • It is challenging to make Active messages GPU-direct because receive buffers are allocated by LCI on the host. If you need GPU-resident AM receives, overload the packet pool with a GPU-aware allocator.