Overview
LCI can send/receive/put/get GPU-resident buffers directly over RDMA, as long as the hardware and driver stack support GPU Direct RDMA.
This document summarizes the required steps and notes.
Requirements
1) Enable CUDA support at build time
Build LCI with CUDA support enabled:
cmake -DLCI_USE_CUDA=ON ...
Other CUDA-related options you may want to consider:
-DLCI_CUDA_ARCH=<your_cuda_architecture> # e.g., 80 for A100, 90 for H100
-DLCI_CUDA_STANDARD=<cuda_standard> # e.g., 20
2) Ensure memory registration is GPU-aware
LCI needs to know that a buffer is device memory. You can satisfy this in two ways:
- Explicitly register the buffer using
lci::register_memory. LCI will detect that the buffer is GPU memory and register it accordingly.
- Use the optional argument
mr and set it to MR_DEVICE or MR_UNKNOWN when posting operations. MR_UNKNOWN lets LCI detect the memory type at runtime.
3) Pass GPU buffers to communication APIs
Just pass GPU memory pointers to send/recv/put/get APIs as usual.
Example
Sending a GPU buffer with automatic memory registration:
The actual implementation for post_send.
Definition lci_binding_post.hpp:3227
post_send_x && mr(mr_t mr_in)
Definition lci_binding_post.hpp:3263
const mr_t MR_DEVICE
A special mr_t value for device memory.
Definition lci.hpp:313
const mr_t MR_UNKNOWN
A special mr_t value for unknown memory. LCI will detect the memory type automatically.
Definition lci.hpp:320
The type of the completion desciptor for a posted communication.
Definition lci.hpp:446
Alternatively, You can explicitly register a GPU buffer and using it in a send operation:
...
lci::deregister_memory(mr);
The actual implementation for RESOURCE mr.
Definition lci_binding_pre.hpp:439
mr_t register_memory(void *address_in, size_t size_in)
Definition lci_binding_post.hpp:1123
Same for put/get operations:
std::vector<lci::rmr_t> rmrs;
...
lci::deregister_memory(mr);
The actual implementation for post_put.
Definition lci_binding_post.hpp:3392
post_put_x && mr(mr_t mr_in)
Definition lci_binding_post.hpp:3429
rmr_t get_rmr(mr_t mr_in)
Definition lci_binding_post.hpp:1197
void allgather(const void *sendbuf_in, void *recvbuf_in, size_t size_in)
Definition lci_binding_post.hpp:2965
The type of remote memory region.
Definition lci.hpp:333
Refer to tests/unit/accelerator for a complete example of using LCI with GPU Direct RDMA.
Notes
- Currently, LCI only supports NVIDIA GPUs. Other vendors will be supported in the future.
- It is challenging to make Active messages GPU-direct because receive buffers are allocated by LCI on the host. If you need GPU-resident AM receives, overload the packet pool with a GPU-aware allocator.