LCI v2.0.0-dev
For Asynchronous Multithreaded Communication
Loading...
Searching...
No Matches
Tutorial

Add LCI to your project

The following CMake code will add LCI to your project. It will first try to find LCI on your system. If it is not found, it will download and build LCI from GitHub.

# Try to find LCI externally
find_package(
LCI
CONFIG
PATH_SUFFIXES
lib/cmake
lib64/cmake)
if(NOT LCI_FOUND)
message(STATUS "Existing LCI installation not found. Try FetchContent.")
include(FetchContent)
FetchContent_Declare(
lci
GIT_REPOSITORY https://github.com/uiuc-hpc/lci.git
GIT_TAG master)
FetchContent_MakeAvailable(lci)
endif()
# Link LCI to your target
target_link_libraries(your_target PRIVATE LCI::lci)

Install LCI

Although CMake FetchContent is convenient, you may also want to install LCI on your systems.

Prerequisites

  • A Linux or MacOS laptop or cluster.
  • A C++ compiler that supports C++17 or higher (GCC 8 or higher, Clang 5 or higher, etc).
  • CMake 3.12 or higher.
  • A network backend that supports LCI. Currently, LCI supports:
    • libibverbs. Typically for Infiniband/RoCE.
    • libfabric. For Slingshot-11, Ethernet, shared memory (including laptop), and other networks.

Normal clusters should already have these installed (you may need a few module load). If you are using a laptop, you may need to install cmake and libfabric manually.

  • For Ubuntu, you can install the prerequisites using:
    sudo apt-get install -y cmake libfabric-bin libfabric-dev
  • For MacOS, you can install the prerequisites using:
    brew install cmake libfabric

With CMake

Install LCI on your laptop

git clone https://github.com/uiuc-hpc/lci.git
cd lci
mkdir build
cmake -DCMAKE_INSTALL_PREFIX=/path/to/install -DLCI_NETWORK_BACKENDS=ofi ..
make
make install

Install LCI on a cluster

git clone https://github.com/uiuc-hpc/lci.git
cd lci
mkdir build
cmake -DCMAKE_INSTALL_PREFIX=/path/to/install ..
make
make install

Basically the same as installing LCI on your laptop, but you don't need to install pre-requisites because they are already installed on the cluster. You also don't need to manually select the network backend because LCI will automatically select the best one for you.

Important CMake variables

  • LCI_DEBUG=[ON|OFF]: Enable/disable the debug mode (more assertions and logs). The default value is OFF.
  • LCI_NETWORK_BACKENDS=[ibv|ofi]: allow multiple values separated with comma. Hint to which network backend to use. If the backend indicated by this variable are found, LCI will just use it. Otherwise, LCI will use whatever are found with the priority ibv > ofi. The default value is ibv,ofi. Typically, you don't need to modify this variable as if libibverbs presents, it is likely to be the recommended one to use.
    • ibv: libibverbs, typically for infiniband/RoCE.
    • ofi: libfabric, for all other networks (slingshot-11, ethernet, shared memory).

With Spack

LCI can be installed using Spack. An LCI recipe has been included in the Spack official repository. For the most up-to-date version, you can also use the recipe in the contrib/spack directory of the LCI repository. To install LCI using Spack, run the following commands:

# Assume you have already installed Spack
# Optional: add the LCI spack recipe from the LCI repository
git clone https://github.com/uiuc-hpc/lci.git
spack repo add lci/contrib/spack
spack install lci

Important Spack variables

  • debug: define the CMake variable LCI_DEBUG.
  • backend=[ibv|ofi]: define the CMake variable LCI_NETWORK_BACKENDS.

Cluster-specific Installation Note

NCSA Delta

Click to expand

tl;dr: module load libfabric, then specify the cmake variable -DLCI_NETWORK_BACKENDS=ofi or the Spack variable backend=ofi when building LCI.

The only caveat is that you need to pass the -DLCI_NETWORK_BACKENDS=ofi option to CMake. This is because Delta somehow has both libibverbs and libfabric installed, but only libfabric is working.

No additional srun arguments are needed to run LCI applications. However, we have noticed that srun can be broken under some mysterious module loading conditions. In such case, just use srun --mpi=pmi2 instead.

SDSC Expanse

Click to expand

tl;dr: use srun --mpi=pmi2 to run LCI applications.

You don't need to do anything special to install LCI on Expanse. Just follow the instructions above.

NERSC Perlmutter

Click to expand

tl;dr: module load cray-pmi before running the CMake command. Add cray-pmi as a Spack external package and add default-pm=cray when building LCI with spack install.

LCI needs to find the Perlmutter-installed Cray-PMI library. Do module load cray-pmi and then run the CMake command to configure LCI. Make sure you see something like this in the output:

-- Found PMI: /opt/cray/pe/pmi/6.1.15/lib/libpmi.so
-- Found PMI2: /opt/cray/pe/pmi/6.1.15/lib/libpmi2.so

When building LCI with spack install, you need to first add cray-pmi as a Spack external package. Put the following code in ~/.spack/packages.yaml:

cray-pmi:
externals:
- spec: cray-pmi@6.1.15
modules:
- cray-pmi/6.1.15
buildable: false

Afterwards, you can use spack install lci default-pm=cray.

TACC Frontera

Click to expand

tl;dr: use ibrun and the MPI bootstrap backend in LCI.

Frontera recommends using its ibrun command to launch multi-node applications, and its ibrun is tightly coupled with its MPI installation. Therefore, the recommended way to run LCI applications on Frontera is to use the MPI bootstrap backend. You can do this by setting the CMake variable LCT_PMI_BACKEND_ENABLE_MPI=ON and linking LCI to the MPI library.

It is also possible to directly use its srun and the PMI2 bootstrap backend in LCI. However, we found that this method can result in very slow bootstrapping times for large numbers of ranks (>=512).

Write LCI programs

Overview

LCI is a C++ library that is implemented in C++17 but its public header is compatible with C++11 and later versions. A C API is planned—please open an issue if you need it. The LCI API is structured into four core components:

  • Resource Management: Allocation and deallocation of LCI resources.
  • Communication Posting: Posting communication operations.
  • Completion Checking: Checking the status of posted operations.
  • Progress: Ensuring that pending communication can be moved forward.

All LCI APIs are defined in lci.hpp and wrapped in the lci namespace.

Objectified Flexible Functions (OFF)

All LCI functions have a _x variant that implements the Objectified Flexible Functions (OFF) idiom. OFF allows optional arguments to be specified in any order, similar to named arguments in Python.

# the default post_send function
auto ret = post_send(rank, buf, size, tag, comp);
# the OFF variants
auto ret = post_send_x(rank, buf, size, tag, comp).device(device)();
auto ret = post_send_x(rank, buf, size, tag, comp).matching_policy(matching_policy_t::rank_only).device(device)();

Internally, an OFF is implemented as a functor with a constructor for positional arguments and setter methods for optional ones. The () operator triggers the actual function call.

Resource Management

As a beginner user, you may not need to worry about this part. The only functions you need to know are

// Initialize the default LCI runtime.
// Similar to MPI_Init.
// Query the rank and total number of ranks.
// Similar to MPI_Comm_rank(MPI_COMM_WORLD) and MPI_Comm_size(MPI_COMM_WORLD).
int rank_me = lci::g_rank_me();
int rank_n = lci::g_rank_n();
// Finalize the default LCI runtime.
// Similar to MPI_Finalize.
void g_runtime_fina()
Definition lci_binding_post.hpp:2414
void g_runtime_init()
Definition lci_binding_post.hpp:2382

For advanced users, you may want to know the following functions/resources (not a complete list):

Runtime

Instead of using the global default runtime, you can create your own runtime. This is useful if you are writing a library that might be used in an application that already uses LCI.

// Create a new LCI runtime.
// Free the LCI runtime.
// Example: post a receive operation to a specific runtime.
lci::post_recv_x(rank, buf, size, tag, comp).runtime(runtime)();
The actual implementation for post_recv.
Definition lci_binding_post.hpp:1902
post_recv_x && runtime(runtime_t runtime_in)
Definition lci_binding_post.hpp:1933
The actual implementation for RESOURCE runtime.
Definition lci_binding_pre.hpp:378
void free_runtime(runtime_t *runtime_in)
Definition lci_binding_post.hpp:2330
runtime_t alloc_runtime()
Definition lci_binding_post.hpp:2296

All LCI OFFs can take a runtime as an optional argument. If not specified, the default runtime will be used.

Device

A device represents a set of communication resources. Using separate devices per thread ensures thread isolation:

// Create a new LCI device.
// Free the LCI device.
// Example: post a receive operation to a specific device.
lci::post_recv_x(rank, buf, size, tag, comp).device(device)();
The actual implementation for RESOURCE device.
Definition lci_binding_pre.hpp:241
post_recv_x && device(device_t device_in)
Definition lci_binding_post.hpp:1935
device_t alloc_device()
Definition lci_binding_post.hpp:859
void free_device(device_t *device_in)
Definition lci_binding_post.hpp:896

Communication Posting

LCI supports:

  • Send-receive with tag matching.
  • Active message.
  • RMA (Remote Memory Access): put/get/put with notification.

For those who are not familiar with the terminology: active message is a mechanism that allows the sender to send a message to the receiver and execute a callback function on the receiver side. The receiver does not need to post a receive. RMA is a mechanism that allows one process to read/write the memory of another process without involving the target process.

Below shows the default function signatures of these operations:

using namespace lci;
status_t post_send(int rank, void* local_buffer, size_t size, tag_t tag, comp_t local_comp);
status_t post_recv(int rank, void* local_buffer, size_t size, tag_t tag, comp_t local_comp);
status_t post_am(int rank, void* local_buffer, size_t size, comp_t local_comp, rcomp_t remote_comp);
status_t post_put(int rank, void* local_buffer, size_t size, comp_t local_comp, uintptr_t remote_disp, rmr_t rmr);
status_t post_get(int rank, void* local_buffer, size_t size, comp_t local_comp, uintptr_t remote_disp, rmr_t rmr);
The actual implementation for RESOURCE comp.
Definition lci_binding_pre.hpp:128
uint32_t rcomp_t
The type of remote completion handler.
Definition lci.hpp:364
uint64_t tag_t
The type of tag.
Definition lci.hpp:347
All LCI API functions and classes are defined in this namespace.
status_t post_am(int rank_in, void *local_buffer_in, size_t size_in, comp_t local_comp_in, rcomp_t remote_comp_in)
Definition lci_binding_post.hpp:1779
status_t post_send(int rank_in, void *local_buffer_in, size_t size_in, tag_t tag_in, comp_t local_comp_in)
Definition lci_binding_post.hpp:1867
status_t post_put(int rank_in, void *local_buffer_in, size_t size_in, comp_t local_comp_in, uintptr_t remote_disp_in, rmr_t rmr_in)
Definition lci_binding_post.hpp:2046
status_t post_recv(int rank_in, void *local_buffer_in, size_t size_in, tag_t tag_in, comp_t local_comp_in)
Definition lci_binding_post.hpp:1952
status_t post_get(int rank_in, void *local_buffer_in, size_t size_in, comp_t local_comp_in, uintptr_t remote_disp_in, rmr_t rmr_in)
Definition lci_binding_post.hpp:2137
The type of remote memory region.
Definition lci.hpp:330
The type of the completion desciptor for a posted communication.
Definition lci.hpp:604

Where is write with notification? post_put can take an optional rcomp argument to specify a remote completion handle.

All functions here are non-blocking: they merely post the communication operation and return immediately. The actual communication may not be completed yet. A communication posting function returns a status_t object that can be in one of the three states:

  • done: the operation is completed.
  • posted: the operation is posted but not completed yet.
  • retry: the operation cannot be posted due to a non-fatal error (typically, some resources are temporarily unavailable). You can retry the posting.

You can use the is_done/is_posted/is_retry methods to check the status of the operation.

Below is an example of how to post a send operation and blocking wait for it to complete:

void post_send(int rank, void* buf, size_t size, tag_t tag) {
status_t status;
do {
status = post_send(rank, buf, size, tag, sync);
} while (status.is_retry());
if (status.is_posted()) {
while (sync_test(sync, &status) == false)
}
assert(status.is_done());
// At this point, the send operation is completed
// and the status object contains the information about the completed operation.
}
@ sync
Definition lci_binding_pre.hpp:29
bool sync_test(comp_t comp_in, status_t *p_out_in)
Definition lci_binding_post.hpp:282
error_t progress()
Definition lci_binding_post.hpp:2177
comp_t alloc_sync()
Definition lci_binding_post.hpp:242
bool is_done() const
Definition lci.hpp:619
bool is_posted() const
Definition lci.hpp:620
bool is_retry() const
Definition lci.hpp:621

The usage of synchronizer (sync) and progress will be explained in the next section.

Completion Checking

LCI offers several mechanisms to detect when a posted operation completes. These mechanisms are designed to support both synchronous and asynchronous programming styles.

Synchronizer

A synchronizer behaves similarly to an MPI request. You can wait for its completion using sync_wait or test it with sync_test.

status_t status = post_send(rank, buf, size, tag, sync);
if (status.is_posted()) {
while (sync_test(sync, nullptr) == false) {
progress(); // call progress to ensure the operation can be completed
}
// Alternatively, you can use sync_wait
}
void free_comp(comp_t *comp_in)
Definition lci_binding_post.hpp:45

Optionally, a synchronizer can accept multiple completion signals before becoming ready. For example,

for (...) {
post_send(..., sync);
post_recv(..., sync);
status = sync_wait(sync);
}
The actual implementation for alloc_sync.
Definition lci_binding_post.hpp:220
alloc_sync_x && threshold(int threshold_in)
Definition lci_binding_post.hpp:233
void sync_wait(comp_t comp_in, status_t *p_out_in)
Definition lci_binding_post.hpp:328

Completion Queue

Applications that expect a large number of asynchronous operations can use a completion queue. Completed operations are pushed into the queue and can be polled later:

// can have many pending operations
status_t status = post_am(..., cq, ...);
// Later or in another thread:
status_t status = cq_pop(cq);
if (status.is_done()) {
// Process completed operation
// status contains information about the completed operation
// such as the rank, tag, and buffer.
}
status_t cq_pop(comp_t comp_in)
Definition lci_binding_post.hpp:411
@ cq
Definition lci_binding_pre.hpp:30
comp_t alloc_cq()
Definition lci_binding_post.hpp:374

Handler

You can also register a callback handler. This is useful for advanced users who want LCI to directly invoke a function upon completion.

void my_handler(status_t status) {
std::free(status.buf); // free the message buffer
}
status_t status = post_send(rank, buf, size, tag, handler);
@ handler
Definition lci_binding_pre.hpp:31
comp_t alloc_handler(comp_handler_t handler_in)
Definition lci_binding_post.hpp:454

The handler is automatically called once the operation completes locally.

Graph (Advanced)

For complex asynchronous flows, LCI also supports graph_t, a completion object that represents a Directed Acyclic Graph (DAG) of operations and callbacks. This is similar in spirit to CUDA Graphs and can be used to build efficient non-blocking collectives.

(Check the "Non-blocking Barrier" example for more details.)

Progress

LCI decouples communication progress from posting and completion operations. Unlike MPI, where progress is implicit, LCI uses an explicit progress() call.

This enables users to:

  • Call progress() periodically in application threads
  • Dedicate threads to call progress() continuously
  • Integrate progress into event loops
Basic Usage
// On default device
// With specified device
lci::progress_x().device(device)();
The actual implementation for progress.
Definition lci_binding_post.hpp:2157
progress_x && device(device_t device_in)
Definition lci_binding_post.hpp:2169

Call progress() frequently to ensure pending operations complete. In multithreaded programs, you can call progress() on a specific device.

Other Materials

Check the examples and tests subdirectory for more example code.

Read this paper to comprehensively understand LCI interface and runtime design.

Check out the API documentation for more details.

Examples

Hello World

This example shows the LCI runtime lifecycle and the query of rank.

Click to expand code
// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: MIT
#include <iostream>
#include <unistd.h>
#include "lci.hpp"
// This example shows the LCI runtime lifecycle and the query of rank.
int main(int argc, char** args)
{
char hostname[64];
gethostname(hostname, 64);
// Initialize the global default runtime.
// After at least one runtime is active, we can query the rank and nranks.
// rank is the id of the current process
// nranks is the total number of the processes in the current job.
std::cout << "Hello world from rank " << lci::get_rank_me() << " of "
<< lci::get_rank_n() << " on " << hostname << std::endl;
// Finalize the global default runtime
return 0;
}
int get_rank_n()
Definition lci_binding_post.hpp:2241
int get_rank_me()
Definition lci_binding_post.hpp:2209
Click to expand example output
$ lcrun -n 4 ./lci_hello_world
Hello world from rank 1 of 4 on <hostname>
Hello world from rank 3 of 4 on <hostname>
Hello world from rank 0 of 4 on <hostname>
Hello world from rank 2 of 4 on <hostname>

Hello World (Active Message)

This example shows the usages of basic communication operations (active message) and completion mechanisms (synchronizer and handler).

Click to expand code
// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: MIT
#include <iostream>
#include <sstream>
#include <cassert>
#include "lci.hpp"
// This example shows the usages of basic communication operations (active
// message) and completion mechanisms (synchronizer and handler).
// Using a flag for simple termination detection.
bool received = false;
// Define the function to be triggered when the active message arrives.
void am_handler(lci::status_t status)
{
// Get the active message payload.
lci::buffer_t payload = status.get_buffer();
std::string payload_str(static_cast<char*>(payload.base), payload.size);
// Active message payload buffer is allocated by the LCI runtime using
// `malloc` by default. The user are responsible for freeing it after use.
free(payload.base);
// Print the hello world message.
std::ostringstream oss;
oss << "Rank " << lci::get_rank_me() << " received active message from rank "
<< status.rank << ". Payload: " << payload_str << std::endl;
std::cout << oss.str();
// Set the received flag to true.
received = true;
}
int main(int argc, char** args)
{
// We use "synchronizer" as the source side completion object.
// It is similar to a MPI request, but has an optional argument `threshold` to
// accept multiple signals before becoming ready.
// Register the active message handler as the target side completion object.
// Since handler/cq needs to be referenced by another process, we need to
// register it into a remote completion handler.
// Since all ranks register the rcomp, all ranks will automatically have a
// symmetric view. We do not need to explicitly exchange them.
// Put a barrier here to ensure all ranks have registered the handler
if (lci::get_rank_me() == 0) {
for (int target = 0; target < lci::get_rank_n(); ++target) {
// Prepare the active message payload.
std::string payload =
"Hello from rank " + std::to_string(lci::get_rank_me());
// Post the active message to the target rank.
auto send_buf =
const_cast<void*>(static_cast<const void*>(payload.data()));
// Unlike MPI_Isend, LCI posting operation can return a status in one of
// the three states:
// 1. `retry`: the posting failed due to resource being temporarily busy,
// and the user can retry.
lci::status_t status;
do {
status = lci::post_am(target, send_buf, payload.size(), sync, rcomp);
} while (status.is_retry());
// 2. `posted`: the operation is posted, and the completion object will be
// signaled.
if (status.is_posted()) {
while (lci::sync_test(sync, &status /* can be nullptr */) == false) {
}
assert(status.is_done());
}
// 3. `done`: the operation is completed, the completion object will not
// be signaled, and the user can check the status.
assert(status.is_done());
// at this point, all fields in the status object are valid.
}
}
// Wait for the active message to arrive.
while (!received) {
}
// Clean up the completion objects.
return 0;
}
rcomp_t register_rcomp(comp_t comp_in)
Definition lci_binding_post.hpp:162
void barrier()
Definition lci_binding_post.hpp:3133
The type of a local buffer descriptor.
Definition lci.hpp:430
size_t size
Definition lci.hpp:432
void * base
Definition lci.hpp:431
int rank
Definition lci.hpp:606
buffer_t get_buffer()
Definition lci.hpp:627
Click to expand example output
$ lcrun -n 4 ./lci_hello_world_am
Rank 1 received active message from rank 0. Payload: Hello from rank 0
Rank 2 received active message from rank 0. Payload: Hello from rank 0
Rank 3 received active message from rank 0. Payload: Hello from rank 0
Rank 0 received active message from rank 0. Payload: Hello from rank 0

Distributed Array

This example shows the usage of RMA (Remote Memory Access) operations to implement a distributed array.

Click to expand code
// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: MIT
#include <iostream>
#include <cassert>
#include "lci.hpp"
// This example shows the usage of the LCI one-sided RMA operations through the
// implementation of a simple distributed array.
template <typename T>
class distributed_array_t
{
public:
distributed_array_t(size_t size, T default_val) : m_size(size)
{
m_per_rank_size = size / lci::get_rank_n();
m_local_start = lci::get_rank_me() * m_per_rank_size;
m_data.resize(m_per_rank_size, default_val);
m_rmrs.resize(lci::get_rank_n());
// RMA operations allow users to directly read/write remote memory on other
// processes. To enable this, the following steps are required: (1) Each
// target process (i.e., the process whose memory will be accessed remotely)
// must register its local memory region and obtain a corresponding
// remote key (rmr).
// (2) Each target process must then share its base address and rmr with
// all other ranks.
// This is typically done using an allgather or similar collective
// operation.
// These steps ensure that every process has the information needed to
// perform one-sided RMA operations (e.g., put/get) to any other process's
// registered memory buffer.
mr = lci::register_memory(m_data.data(), m_data.size() * sizeof(T));
// exchange the memory registration information with other ranks
lci::allgather(&rmr, m_rmrs.data(), sizeof(lci::rmr_t));
}
~distributed_array_t()
{
// deregister my memory buffer
// we need a barrier to ensure that all remote operations are completed
}
// a blocking get operation
T get(size_t index)
{
int target_rank = get_target_rank(index);
size_t local_index = get_local_index(index);
lci::status_t status;
T value;
do {
status =
lci::post_get(target_rank, &value, sizeof(T), lci::COMP_NULL_RETRY,
local_index * sizeof(T), m_rmrs[target_rank]);
} while (status.is_retry());
assert(status.is_done());
return value;
}
// a blocking put operation
void put(size_t index, const T& value)
{
int target_rank = get_target_rank(index);
size_t local_index = get_local_index(index);
lci::status_t status;
do {
status = lci::post_put_x(target_rank,
static_cast<void*>(const_cast<int*>(&value)),
local_index * sizeof(T), m_rmrs[target_rank])
.comp_semantic(lci::comp_semantic_t::network)();
} while (status.is_retry());
assert(status.is_done());
}
private:
size_t m_size;
size_t m_per_rank_size;
size_t m_local_start;
std::vector<T> m_data;
// LCI memory registration information
lci::mr_t mr;
std::vector<lci::rmr_t> m_rmrs;
int get_target_rank(size_t index) const
{
assert(index < m_size);
return index / m_per_rank_size;
}
size_t get_local_index(size_t index) const
{
assert(index < m_size);
return index % m_per_rank_size;
}
};
void work(size_t size)
{
distributed_array_t<int> darray(size, 0);
int rank = lci::get_rank_me();
int nranks = lci::get_rank_n();
for (size_t i = rank; i < size; i += nranks) {
darray.put(i, i);
}
for (size_t i = (rank + 1) % nranks; i < size; i += nranks) {
int value = darray.get(i);
assert(value == i);
}
}
int main(int argc, char** args)
{
const size_t size = 1000;
work(size);
return 0;
}
const comp_t COMP_NULL_RETRY
Special completion object setting allow_posted and allow_retry to false.
Definition lci.hpp:652
@ network
Definition lci.hpp:420
void deregister_memory(mr_t *mr_in)
Definition lci_binding_post.hpp:976
mr_t register_memory(void *address_in, size_t size_in)
Definition lci_binding_post.hpp:939
rmr_t get_rmr(mr_t mr_in)
Definition lci_binding_post.hpp:1013
void allgather(const void *sendbuf_in, void *recvbuf_in, size_t size_in)
Definition lci_binding_post.hpp:3441
Click to expand example output
# This example has no output.
$ lcrun -n 4 ./lci_distributed_array

Non-blocking Barrier

This examples shows the usage of the completion graph and the send/recv operations to implement a non-blocking barrier.

Click to expand code
// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: MIT
#include <unistd.h>
#include <cstdio>
#include "lci.hpp"
// This examples shows the usage of the completion graph and the send/recv
// operations.
// create the graph according to the dissemination algorithm
lci::comp_t create_ibarrier_graph()
{
int rank_me = lci::get_rank_me();
int rank_n = lci::get_rank_n();
// GRAPH_START is a special node that indicates the start of the graph.
lci::graph_node_t dummy_node;
// The dissemination algorithm contains log2(rank_n) rounds.
// In each round, each rank sends and receives a message to/from rank_me +/-
// round**2.
for (int jump = 1; jump < rank_n; jump *= 2) {
int rank_to_recv = (rank_me - jump + rank_n) % rank_n;
int rank_to_send = (rank_me + jump) % rank_n;
// Define the communication operations for each round.
// We cannot explicitly retry in the graph, so we set allow_retry to false.
// The runtime will handle the retry using the internal backlog queues.
auto recv =
lci::post_recv_x(rank_to_recv, nullptr, 0, 0, graph).allow_retry(false);
auto send =
lci::post_send_x(rank_to_send, nullptr, 0, 0, graph).allow_retry(false);
// Note that we do not trigger the operations here.
// Instead, we make them as nodes in the graph.
auto recv_node = lci::graph_add_node_op(graph, recv);
auto send_node = lci::graph_add_node_op(graph, send);
// To make the code more readable, we use a dummy node to represent the end
// of the round.
if (jump * 2 >= rank_n) {
// this is the last round
// GRAPH_END is a special node that indicates the end of the graph.
dummy_node = lci::GRAPH_END;
} else {
// we can make arbitrary functions as nodes in the graph
// The graph expect the function of the node to either return `done` or
// `posted`. In the case of `done`, the runtime will immeidately trigger
// its children. In the case of `posted`, the runtime will do nothing and
// the node will be considered pending until
// `graph_node_mark_complete(node)` is called.
dummy_node = graph_add_node(
graph, [](void*) -> lci::status_t { return lci::errorcode_t::done; });
}
// Specify the dependencies between the nodes.
// Wait for the previous round to finish before starting this round.
lci::graph_add_edge(graph, old_node, recv_node);
lci::graph_add_edge(graph, old_node, send_node);
// Wait for the send and recv operations to finish before moving to the next
// round.
lci::graph_add_edge(graph, recv_node, dummy_node);
lci::graph_add_edge(graph, send_node, dummy_node);
old_node = dummy_node;
}
return graph;
}
int main()
{
// create a graph describing the operations needed by the barrier
lci::comp_t graph = create_ibarrier_graph();
// the graph can be reused
for (int i = 0; i < 3; ++i) {
if (lci::get_rank_me() == 0) {
// create some asymmetric delay
sleep(1);
}
fprintf(stderr, "rank %d start barrier\n", lci::get_rank_me());
// start executing those operations
// wait for the operations to finish
while (lci::graph_test(graph).is_retry()) {
}
fprintf(stderr, "rank %d end barrier\n", lci::get_rank_me());
}
// free the graph
return 0;
}
post_recv_x && allow_retry(bool allow_retry_in)
Definition lci_binding_post.hpp:1944
The actual implementation for post_send.
Definition lci_binding_post.hpp:1815
post_send_x && allow_retry(bool allow_retry_in)
Definition lci_binding_post.hpp:1859
graph_node_t graph_add_node_op(comp_t graph, const T &op)
Add a functor as a node to the completion graph.
Definition lci.hpp:819
void * graph_node_t
The node type for the completion graph.
Definition lci.hpp:699
const graph_node_t GRAPH_START
The start node of the completion graph.
Definition lci.hpp:705
@ done
Definition lci.hpp:121
@ send
Definition lci.hpp:384
@ recv
Definition lci.hpp:385
const graph_node_t GRAPH_END
Definition lci.hpp:706
status_t graph_test(comp_t comp_in)
Definition lci_binding_post.hpp:700
@ graph
Definition lci_binding_pre.hpp:32
void graph_start(comp_t comp_in)
Definition lci_binding_post.hpp:663
void graph_add_edge(comp_t comp_in, graph_node_t src_in, graph_node_t dst_in)
Definition lci_binding_post.hpp:586
graph_node_t graph_add_node(comp_t comp_in, graph_node_run_cb_t fn_in)
Definition lci_binding_post.hpp:540
comp_t alloc_graph()
Definition lci_binding_post.hpp:494
Click to expand example output
$ lcrun -n 4 ./lci_nonblocking_barrier
rank 1 start barrier
rank 2 start barrier
rank 3 start barrier
rank 0 start barrier
rank 0 end barrier
rank 1 end barrier
rank 1 start barrier
rank 3 end barrier
rank 3 start barrier
rank 0 start barrier
rank 2 end barrier
rank 2 start barrier
rank 2 end barrier
rank 2 start barrier
rank 0 end barrier
rank 3 end barrier
rank 3 start barrier
rank 1 end barrier
rank 1 start barrier
rank 0 start barrier
rank 1 end barrier
rank 0 end barrier
rank 2 end barrier
rank 3 end barrier

Multithreaded Active Message Ping-pong

This example shows the usages of thread-local devices to speedup the active message communication in a multithreaded environment.

Click to expand code
// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: MIT
#include <iostream>
#include <thread>
#include <cassert>
#include <chrono>
#include <atomic>
#include "lct.h"
#include "lci.hpp"
// This example shows the usages of thread-local devices.
const int nthreads = 4;
const int nmsgs = 1000;
const size_t msg_size = 8;
LCT_tbarrier_t thread_barrier;
std::atomic<int> thread_seqence_control(0);
void worker(int thread_id)
{
int rank = lci::get_rank_me();
int nranks = lci::get_rank_n();
int peer_rank;
if (nranks == 1) {
peer_rank = rank;
} else {
peer_rank = (rank + nranks / 2) % nranks;
}
// allocate resouces
// device and rcomp allocation needs to be synchronized to ensure uniformity
// across ranks.
while (thread_seqence_control != thread_id) continue;
if (++thread_seqence_control == nthreads) thread_seqence_control = 0;
void* send_buf = malloc(msg_size);
memset(send_buf, rank, msg_size);
LCT_tbarrier_arrive_and_wait(thread_barrier);
auto start = std::chrono::high_resolution_clock::now();
if (nranks == 1 || rank < nranks / 2) {
// sender
for (int i = 0; i < nmsgs; i++) {
// send a message
lci::post_am_x(peer_rank, send_buf, msg_size, lci::COMP_NULL, rcomp)
.device(device)
.tag(thread_id)();
// wait for an incoming message
lci::status_t status;
do {
lci::progress_x().device(device)();
status = lci::cq_pop(cq);
} while (status.is_retry());
if (status.tag != thread_id) {
std::cerr << "thread_id: " << thread_id
<< ", status.tag: " << status.tag << std::endl;
}
assert(status.tag == thread_id);
lci::buffer_t recv_buf = status.get_buffer();
assert(recv_buf.size == msg_size);
for (size_t j = 0; j < msg_size; j++) {
assert(((char*)recv_buf.base)[j] == peer_rank);
}
free(recv_buf.base);
}
} else {
// receiver
for (int i = 0; i < nmsgs; i++) {
// wait for an incoming message
lci::status_t status;
do {
lci::progress_x().device(device)();
status = lci::cq_pop(cq);
} while (status.is_retry());
assert(status.tag == thread_id);
lci::buffer_t recv_buf = status.get_buffer();
assert(recv_buf.size == msg_size);
for (size_t j = 0; j < msg_size; j++) {
assert(((char*)recv_buf.base)[j] == peer_rank);
}
free(recv_buf.base);
// send a message
lci::post_am_x(peer_rank, send_buf, msg_size, lci::COMP_NULL, rcomp)
.device(device)
.tag(thread_id)();
}
}
LCT_tbarrier_arrive_and_wait(thread_barrier);
auto end = std::chrono::high_resolution_clock::now();
if (thread_id == 0 && rank == 0) {
std::cout << "pingpong_am_mt: " << std::endl;
std::cout << "Number of threads: " << nthreads << std::endl;
std::cout << "Number of messages: " << nmsgs << std::endl;
std::cout << "Message size: " << msg_size << " bytes" << std::endl;
std::cout << "Number of ranks: " << nranks << std::endl;
double total_time_us =
std::chrono::duration_cast<std::chrono::microseconds>(end - start)
.count();
double msg_rate_uni =
1.0 * nmsgs * nthreads * (nranks + 1) / 2 / total_time_us;
double bandwidth_uni = msg_rate_uni * msg_size;
std::cout << "Total time: " << total_time_us / 1e6 << " s" << std::endl;
std::cout << "Message rate: " << msg_rate_uni << " mmsg/s" << std::endl;
std::cout << "Bandwidth: " << bandwidth_uni << " MB/s" << std::endl;
}
free(send_buf);
// free resouces
while (thread_seqence_control != thread_id) {
lci::progress_x().device(device)();
}
lci::free_device(&device);
if (++thread_seqence_control == nthreads) thread_seqence_control = 0;
}
int main(int argc, char** args)
{
// Initialize the global default runtime.
// Here we use the *objectized flexible function* version of the
// `g_runtime_init` operation and specify that the default device should not
// be allocated.
// After at least one runtime is active, we can query the rank and nranks.
// rank is the id of the current process
// nranks is the total number of the processes in the current job.
assert(lci::get_rank_n() == 1 || lci::get_rank_n() % 2 == 0);
// get a thread barrier
thread_barrier = LCT_tbarrier_alloc(nthreads);
// spawn the threads to do the pingpong
if (nthreads == 1) {
worker(0);
} else {
std::vector<std::thread> threads;
for (int i = 0; i < nthreads; i++) {
threads.push_back(std::thread(worker, i));
}
for (auto& thread : threads) {
thread.join();
}
}
// free the thread barrier
LCT_tbarrier_free(&thread_barrier);
// Finalize the global default runtime
return 0;
}
The actual implementation for g_runtime_init.
Definition lci_binding_post.hpp:2354
g_runtime_init_x && alloc_default_device(bool alloc_default_device_in)
Definition lci_binding_post.hpp:2372
The actual implementation for post_am.
Definition lci_binding_post.hpp:1729
post_am_x && tag(tag_t tag_in)
Definition lci_binding_post.hpp:1766
post_am_x && device(device_t device_in)
Definition lci_binding_post.hpp:1762
const comp_t COMP_NULL
Special completion object setting allow_posted to false.
Definition lci.hpp:638
tag_t tag
Definition lci.hpp:608
Click to expand example output (Run on my laptop. Performance may vary.)
$ lcrun -n 4 ./lci_pingpong_am_mt
pingpong_am_mt:
Number of threads: 4
Number of messages: 1000
Message size: 8 bytes
Number of ranks: 4
Total time: 0.035286 s
Message rate: 0.283399 mmsg/s
Bandwidth: 2.26719 MB/s

Run LCI applications

In Quick Start, we have shown you how to run LCI applications using mpirun or srun. Here, we will discuss the bootstrapping process in more detail.

To successfully bootstrap LCI, the launcher (srun, mpirun, or lcrun) must match the bootstrapping backend used by LCI. Normally, LCI will automatically select the right bootstrapping backend based on the environment so no special configuration is needed. However, if you see your applications were launched as a collection of processes all with rank 0, it means something went wrong.

Run LCI applications with lcrun

You do not need to do anything special to run LCI applications with lcrun. However, lcrun is a "toy" launcher that is not as scalable as srun or mpirun. It is mainly used for testing and debugging purposes.

If you ever encounter a problem with lcrun, you can remove the temporary folder ~/.tmp/lct_pmi_file-* and try again.

Run LCI applications with srun

LCI is shipped with a copy of the SLURM PMI1 and PMI2 client implementation, so normally you can use srun to run LCI applications without any extra configuration. You may need to explicitly enable the pmi2 support by srun --mpi=pmi2.

On Cray systems, you may need to load the cray-pmi module before building LCI as srun on some Cray systems only supports Cray PMI.

Run LCI applications with mpirun

Because there are many different MPI implementations and there are no standard about how they implement mpirun, it is slightly more complicated to run LCI applications with mpirun. In such cases, the easiest way is to let LCI use MPI to bootstrap. You just need to set the CMake variable LCT_PMI_BACKEND_ENABLE_MPI=ON and link LCI to MPI.

It is possible to directly use the PMI backend with mpirun, but you need to find the corresponding PMI client library and link LCI to it. Read the following section for more details.

More details

Bootstrapping backends

Specifically, LCI has six different bootstrapping backends:

  • pmi1: Process Management Interface version 1.
  • pmi2: Process Management Interface version 2.
  • pmix: Process Management Interface X.
  • mpi: Use MPI to bootstrap LCI.
  • file: LCI-specific bootstrapping backend with a shared file system and flock.
  • local: Just set rank_me to 0 and rank_n to 1.

pmi1, pmi2, and pmix are the recommended backends to use. They are the same backends used by MPI. The mpi backend is a fallback option if you cannot find the PMI client library. The file backend is a non-scalable bootstrapping backend mainly for testing and debugging purposes.

By default, the source code of LCI is shipped with a copy of the SLURM PMI1 and PMI2 client implementation, so pmi1 and pmi2 are always compiled. pmix will be compiled if the CMake configuration of LCI finds the PMIx client library. The mpi backend must be explicitly asked for by setting the CMake variable LCT_PMI_BACKEND_ENABLE_MPI=ON. The file and local backend is always compiled.

However, the SLURM PMI1 and PMI2 client implementation is not always the best option. For example, if you are using mpirun, you may want to use the PMI client library that comes with your MPI implementation. In this case, you need to find the corresponding PMI client library and link LCI to it. ldd $(which mpirun) will show you the PMI client library used by mpirun. Normally, MPICH uses hydra-pmi; Cray-MPICH uses cray-pmi; OpenMPI uses pmix. After finding the PMI client library, you can reconfigure LCI with the corresponding PMI client library through the PMI_ROOT, PMI2_ROOT, or PMIx_ROOT environment/cmake variables.

A CMake variable LCT_PMI_BACKEND_DEFAULT and an environment variable LCT_PMI_BACKEND can be used to set a list of backends to try in order (if they are compiled). The first one that works will be used. The default value is pmi1,pmi2,pmix,mpi,file,local.

You can use export LCT_LOG_LEVEL=info to monitor the bootstrapping procedure.

Launchers

srun and mpirun should use one of the PMI backends (or mpi as a last resort). lcrun will use the file backend.

Depending on the SLURM configuration, srun may not enable PMI by default. In this case, you can explicitly enable one of the PMI services by using the --mpi option. srun --mpi=list -n 1 hostname will show you the available PMI services. You can confirm whether the PMI service has been enabled with srun env | grep PMI.

mpirun will use the PMI client library that comes with your MPI implementation. As mentioned above, you need to link LCI to the correct PMI client library.

Sometimes, lcrun may hang because of a previous failed run. In this case, you can remove the temporary folder ~/.tmp/lct_pmi_file-* and try again.

More about the file backend

The file backend allows you to launch multiple LCI processes individually without a launcher. This can significantly ease the debugging process.

For example, you can open two terminal windows and run the following commands in each window:

export LCT_PMI_BACKEND=file
export LCT_PMI_FILE_NRANKS=2
./lci_program
# or launch it with gdb
gdb ./lci_program

This will launch two LCI processes with rank 0 and rank 1.