Overview
LCI can send/receive/put/get GPU-resident buffers directly over RDMA, as long as the hardware and driver stack support GPU Direct RDMA.
LCI supports both NVIDIA GPUs (via CUDA) and AMD GPUs (via HIP/ROCm).
This document summarizes the required steps and notes.
Requirements
1) Enable GPU support at build time
Build LCI with CUDA and/or HIP support enabled:
For NVIDIA GPUs (CUDA):
cmake -DLCI_USE_CUDA=ON ...
Other CUDA-related options you may want to consider:
-DLCI_CUDA_ARCH=<your_cuda_architecture> # e.g., 80 for A100, 90 for H100
-DLCI_CUDA_STANDARD=<cuda_standard> # e.g., 20
For AMD GPUs (HIP/ROCm):
cmake -DLCI_USE_HIP=ON ...
Other HIP-related options you may want to consider:
-DLCI_HIP_ARCH=<your_hip_architecture> # e.g., gfx90a for MI250X, gfx942 for MI300A/MI300X/MI325X, gfx950 for MI350X/MI355X
-DLCI_HIP_STANDARD=<hip_standard> # e.g., 20
Note: LCI_USE_CUDA and LCI_USE_HIP are mutually exclusive. Choose the GPU vendor matching your hardware at compile time. For heterogeneous clusters with different GPU vendors on different nodes, compile separate binaries for each node type.
2) Ensure memory registration is GPU-aware
LCI needs to know that a buffer is device memory. You can satisfy this in two ways:
- Explicitly register the buffer using
lci::register_memory. LCI will detect that the buffer is GPU memory and register it accordingly.
- Use the optional argument
mr and set it to MR_DEVICE or MR_UNKNOWN when posting operations. MR_UNKNOWN lets LCI detect the memory type at runtime.
3) Pass GPU buffers to communication APIs
Just pass GPU memory pointers to send/recv/put/get APIs as usual.
Example
Sending a GPU buffer with automatic memory registration:
The actual implementation for post_send.
Definition lci_binding_post.hpp:215
post_send_x && mr(mr_t mr_in)
Definition lci_binding_post.hpp:251
const mr_t MR_DEVICE
A special mr_t value for device memory.
Definition lci.hpp:313
const mr_t MR_UNKNOWN
A special mr_t value for unknown memory. LCI will detect the memory type automatically.
Definition lci.hpp:320
The type of the completion desciptor for a posted communication.
Definition lci.hpp:446
Alternatively, You can explicitly register a GPU buffer and using it in a send operation:
...
lci::deregister_memory(mr);
The actual implementation for RESOURCE mr.
Definition lci_binding_pre.hpp:312
mr_t register_memory(void *address_in, size_t size_in)
Definition lci_binding_post.hpp:886
Same for put/get operations:
std::vector<lci::rmr_t> rmrs;
...
lci::deregister_memory(mr);
The actual implementation for post_put.
Definition lci_binding_post.hpp:380
post_put_x && mr(mr_t mr_in)
Definition lci_binding_post.hpp:417
rmr_t get_rmr(mr_t mr_in)
Definition lci_binding_post.hpp:960
void allgather(const void *sendbuf_in, void *recvbuf_in, size_t size_in)
Definition lci_binding_post.hpp:2765
The type of remote memory region.
Definition lci.hpp:333
Refer to tests/unit/accelerator for a complete example of using LCI with GPU Direct RDMA.
Notes
- LCI supports both NVIDIA GPUs (CUDA) and AMD GPUs (HIP/ROCm), but they are mutually exclusive at build time.
- It is challenging to make Active messages GPU-direct because receive buffers are allocated by LCI on the host. If you need GPU-resident AM receives, overload the packet pool with a GPU-aware allocator.