Add LCI to your project

The following CMake code will add LCI to your project. It will first try to find LCI on your system. If it is not found, it will download and build LCI from GitHub.

# Try to find LCI externally
find_package(
  LCI
  CONFIG
  PATH_SUFFIXES
  lib/cmake
  lib64/cmake)
if(NOT LCI_FOUND)
  message(STATUS "Existing LCI installation not found. Try FetchContent.")
  include(FetchContent)
  FetchContent_Declare(
    lci
    GIT_REPOSITORY https://github.com/uiuc-hpc/lci.git
    GIT_TAG master)
  FetchContent_MakeAvailable(lci)
endif()
 
# Link LCI to your target
target_link_libraries(your_target PRIVATE LCI::lci)

Install LCI

Although CMake FetchContent is convenient, you may also want to install LCI on your systems.

Prerequisites

A Linux or MacOS laptop or cluster.
A C++ compiler that supports C++17 or higher (GCC 8 or higher, Clang 5 or higher, etc).
CMake 3.12 or higher.
Python 3.8 or higher.
A network backend that supports LCI. Currently, LCI supports:
- libibverbs. Typically for Infiniband/RoCE.
- libfabric. For Slingshot-11, Ethernet, shared memory (including laptop), and other networks.

Normal clusters should already have these installed (you may need a few module load). If you are using a laptop, you may need to install cmake and libfabric manually.

For Ubuntu, you can install the prerequisites using:
sudo apt-get install -y cmake libfabric-bin libfabric-dev
For MacOS, you can install the prerequisites using:
brew install cmake libfabric

With CMake

Install LCI on your laptop

git clone https://github.com/uiuc-hpc/lci.git
cd lci
mkdir build
cmake -DCMAKE_INSTALL_PREFIX=/path/to/install -DLCI_NETWORK_BACKENDS=ofi ..
make
make install

Install LCI on a cluster

git clone https://github.com/uiuc-hpc/lci.git
cd lci
mkdir build
cmake -DCMAKE_INSTALL_PREFIX=/path/to/install ..
make
make install

Basically the same as installing LCI on your laptop, but you don't need to install pre-requisites because they are already installed on the cluster. You also don't need to manually select the network backend because LCI will automatically select the best one for you.

Important CMake variables

LCI_DEBUG=[ON|OFF]: Enable/disable the debug mode (more assertions and logs). The default value is OFF.
LCI_NETWORK_BACKENDS=[ibv|ofi]: allow multiple values separated with comma. Hint to which network backend to use. If the backend indicated by this variable are found, LCI will just use it. Otherwise, LCI will use whatever are found with the priority ibv > ofi. The default value is ibv,ofi. Typically, you don't need to modify this variable as if libibverbs presents, it is likely to be the recommended one to use.
- ibv: libibverbs, typically for infiniband/RoCE.
- ofi: libfabric, for all other networks (slingshot-11, ethernet, shared memory).

With Spack

LCI can be installed using Spack. An LCI recipe has been included in the Spack official repository. For the most up-to-date version, you can also use the recipe in the contrib/spack directory of the LCI repository. To install LCI using Spack, run the following commands:

# Assume you have already installed Spack
# Optional: add the LCI spack recipe from the LCI repository
git clone https://github.com/uiuc-hpc/lci.git
spack repo add lci/contrib/spack
spack install lci

Important Spack variables

debug: define the CMake variable LCI_DEBUG.
backend=[ibv|ofi]: define the CMake variable LCI_NETWORK_BACKENDS.

Cluster-specific Installation Note

NCSA Delta

Click to expand

tl;dr: module load libfabric, then specify the cmake variable -DLCI_NETWORK_BACKENDS=ofi or the Spack variable backend=ofi when building LCI.

The only caveat is that you need to pass the -DLCI_NETWORK_BACKENDS=ofi option to CMake. This is because Delta somehow has both libibverbs and libfabric installed, but only libfabric is working.

No additional srun arguments are needed to run LCI applications. However, we have noticed that srun can be broken under some mysterious module loading conditions. In such case, just use srun --mpi=pmi2 instead.

NCSA DeltaAI

Click to expand

Use srun --mpi=pmi2 or srun --mpi=pmix to run LCI applications.

(Experimental feature: you can try passing -DLCI_USE_CUDA to CMake to enable GPU direct communication.)

SDSC Expanse

Click to expand

tl;dr: use srun --mpi=pmi2 to run LCI applications.

You don't need to do anything special to install LCI on Expanse. Just follow the instructions above.

NERSC Perlmutter

Click to expand

tl;dr: module load cray-pmi before running the CMake command. Add cray-pmi as a Spack external package and add default-pm=cray when building LCI with spack install.

LCI needs to find the Perlmutter-installed Cray-PMI library. Do module load cray-pmi and then run the CMake command to configure LCI. Make sure you see something like this in the output:

-- Found PMI: /opt/cray/pe/pmi/6.1.15/lib/libpmi.so

-- Found PMI2: /opt/cray/pe/pmi/6.1.15/lib/libpmi2.so

When building LCI with spack install, you need to first add cray-pmi as a Spack external package. Put the following code in ~/.spack/packages.yaml:

cray-pmi:
    externals:
    - spec: cray-pmi@6.1.15
      modules:
      - cray-pmi/6.1.15
    buildable: false

Afterwards, you can use spack install lci default-pm=cray.

TACC Frontera

Click to expand

tl;dr: use ibrun and the MPI bootstrap backend in LCI.

Frontera recommends using its ibrun command to launch multi-node applications, and its ibrun is tightly coupled with its MPI installation. Therefore, the recommended way to run LCI applications on Frontera is to use the MPI bootstrap backend. You can do this by setting the CMake variable LCT_PMI_BACKEND_ENABLE_MPI=ON and linking LCI to the MPI library.

It is also possible to directly use its srun and the PMI2 bootstrap backend in LCI. However, we found that this method can result in very slow bootstrapping times for large numbers of ranks (>=512).

Write LCI programs

Overview

LCI is a C++ library that is implemented in C++17 but its public header is compatible with C++11 and later versions. A C API is planned—please open an issue if you need it. The LCI API is structured into four core components:

Resource Management: Allocation and deallocation of LCI resources.
Communication Posting: Posting communication operations.
Completion Checking: Checking the status of posted operations.
Progress: Ensuring that pending communication can be moved forward.

All LCI APIs are defined in lci.hpp and wrapped in the lci namespace.

Objectified Flexible Functions (OFF)

All LCI functions have a _x variant that implements the Objectified Flexible Functions (OFF) idiom. OFF allows optional arguments to be specified in any order, similar to named arguments in Python.

# the default post_send function
auto ret = post_send(rank, buf, size, tag, comp);
# the OFF variants
auto ret = post_send_x(rank, buf, size, tag, comp).device(device)();
auto ret = post_send_x(rank, buf, size, tag, comp).matching_policy(matching_policy_t::rank_only).device(device)();

Internally, an OFF is implemented as a functor with a constructor for positional arguments and setter methods for optional ones. The () operator triggers the actual function call.

Resource Management

As a beginner user, you may not need to worry about this part. The only functions you need to know are

// Initialize the default LCI runtime.
// Similar to MPI_Init.
lci::g_runtime_init();
 
// Query the rank and total number of ranks.
// Similar to MPI_Comm_rank(MPI_COMM_WORLD) and MPI_Comm_size(MPI_COMM_WORLD).
int rank_me = lci::g_rank_me();
int rank_n = lci::g_rank_n();
 
// Finalize the default LCI runtime.
// Similar to MPI_Finalize.
lci::g_runtime_fina();

For advanced users, you may want to know the following functions/resources (not a complete list):

Runtime

Instead of using the global default runtime, you can create your own runtime. This is useful if you are writing a library that might be used in an application that already uses LCI.

// Create a new LCI runtime.
lci::runtime_t runtime = lci::alloc_runtime();
// Free the LCI runtime.
lci::free_runtime(&runtime);
// Example: post a receive operation to a specific runtime.
lci::post_recv_x(rank, buf, size, tag, comp).runtime(runtime)();

All LCI OFFs can take a runtime as an optional argument. If not specified, the default runtime will be used.

Device

A device represents a set of communication resources. Using separate devices per thread ensures thread isolation:

// Create a new LCI device.
lci::device_t device = lci::alloc_device();
// Free the LCI device.
lci::free_device(&device);
// Example: post a receive operation to a specific device.
lci::post_recv_x(rank, buf, size, tag, comp).device(device)();

Communication Posting

LCI supports:

Send-receive with tag matching.
Active message.
RMA (Remote Memory Access): put/get/put with notification.

For those who are not familiar with the terminology: active message is a mechanism that allows the sender to send a message to the receiver and execute a callback function on the receiver side. The receiver does not need to post a receive. RMA is a mechanism that allows one process to read/write the memory of another process without involving the target process.

Below shows the default function signatures of these operations:

using namespace lci;
 
status_t post_send(int rank, void* local_buffer, size_t size, tag_t tag, comp_t local_comp);
 
status_t post_recv(int rank, void* local_buffer, size_t size, tag_t tag, comp_t local_comp);
 
status_t post_am(int rank, void* local_buffer, size_t size, comp_t local_comp, rcomp_t remote_comp);
 
status_t post_put(int rank, void* local_buffer, size_t size, comp_t local_comp, uintptr_t remote_disp, rmr_t rmr);
 
status_t post_get(int rank, void* local_buffer, size_t size, comp_t local_comp, uintptr_t remote_disp, rmr_t rmr);

Where is write with notification? post_put can take an optional rcomp argument to specify a remote completion handle.

All functions here are non-blocking: they merely post the communication operation and return immediately. The actual communication may not be completed yet. A communication posting function returns a status_t object that can be in one of the three states:

done: the operation is completed.
posted: the operation is posted but not completed yet.
retry: the operation cannot be posted due to a non-fatal error (typically, some resources are temporarily unavailable). You can retry the posting.

You can use the is_done/is_posted/is_retry methods to check the status of the operation.

Below is an example of how to post a send operation and blocking wait for it to complete:

void post_send(int rank, void* buf, size_t size, tag_t tag) {
  comp_t sync = alloc_sync();
  status_t status;
  do {
    status = post_send(rank, buf, size, tag, sync);
    progress();
  } while (status.is_retry());
  if (status.is_posted()) {
    while (sync_test(sync, &status) == false)
      progress();
  }
  assert(status.is_done());
  // At this point, the send operation is completed
  // and the status object contains the information about the completed operation.
}

The usage of synchronizer (sync) and progress will be explained in the next section.

Completion Checking

LCI offers several mechanisms to detect when a posted operation completes. These mechanisms are designed to support both synchronous and asynchronous programming styles.

Synchronizer

A synchronizer behaves similarly to an MPI request. You can wait for its completion using sync_wait or test it with sync_test.

comp_t sync = alloc_sync();
status_t status = post_send(rank, buf, size, tag, sync);
if (status.is_posted()) {
  while (sync_test(sync, nullptr) == false) {
    progress();  // call progress to ensure the operation can be completed
  }
  // Alternatively, you can use sync_wait
}
free_comp(&sync);

Optionally, a synchronizer can accept multiple completion signals before becoming ready. For example,

comp_t sync = alloc_sync_x().threshold(2)();
 
for (...) {
  post_send(..., sync);
  post_recv(..., sync);
 
  status = sync_wait(sync);
}
 
free_comp(&sync);

Completion Queue

Applications that expect a large number of asynchronous operations can use a completion queue. Completed operations are pushed into the queue and can be polled later:

comp_t cq = alloc_cq();
// can have many pending operations
status_t status = post_am(..., cq, ...);
// Later or in another thread:
status_t status = cq_pop(cq);
if (status.is_done()) {
  // Process completed operation
  // status contains information about the completed operation
  // such as the rank, tag, and buffer.
}
free_comp(&cq);

Handler

You can also register a callback handler. This is useful for advanced users who want LCI to directly invoke a function upon completion.

void my_handler(status_t status) {
  std::free(status.buf); // free the message buffer
}
 
comp_t handler = alloc_handler(my_handler);
status_t status = post_send(rank, buf, size, tag, handler);

The handler is automatically called once the operation completes locally.

Graph (Advanced)

For complex asynchronous flows, LCI also supports graph_t, a completion object that represents a Directed Acyclic Graph (DAG) of operations and callbacks. This is similar in spirit to CUDA Graphs and can be used to build efficient non-blocking collectives.

(Check the "Non-blocking Barrier" example for more details.)

Progress

LCI decouples communication progress from posting and completion operations. Unlike MPI, where progress is implicit, LCI uses an explicit progress() call.

This enables users to:

Call progress() periodically in application threads
Dedicate threads to call progress() continuously
Integrate progress into event loops

Basic Usage

// On default device
lci::progress();
 
// With specified device
lci::progress_x().device(device)();

Call progress() frequently to ensure pending operations complete. In multithreaded programs, you can call progress() on a specific device.

Other Materials

Check the examples and tests subdirectory for more example code.

Read this paper to comprehensively understand LCI interface and runtime design.

Check out the API documentation for more details.

Examples

Hello World

This example shows the LCI runtime lifecycle and the query of rank.

Click to expand code

// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: NCSA
 
#include <iostream>
#include <unistd.h>
#include "lci.hpp"
 
// This example shows the LCI runtime lifecycle and the query of rank.
 
int main(int argc, char** args)
{
  char hostname[64];
  gethostname(hostname, 64);
  // Initialize the global default runtime.
  lci::g_runtime_init();
  // After at least one runtime is active, we can query the rank and nranks.
  // rank is the id of the current process
  // nranks is the total number of the processes in the current job.
  std::cout << "Hello world from rank " << lci::get_rank_me() << " of "
            << lci::get_rank_n() << " on " << hostname << std::endl;
  // Finalize the global default runtime
  lci::g_runtime_fina();
  return 0;
}

Click to expand example output

$ lcrun -n 4 ./lci_hello_world 
Hello world from rank 1 of 4 on <hostname>
Hello world from rank 3 of 4 on <hostname>
Hello world from rank 0 of 4 on <hostname>
Hello world from rank 2 of 4 on <hostname>

Hello World (Active Message)

This example shows the usages of basic communication operations (active message) and completion mechanisms (synchronizer and handler).

Click to expand code

// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: NCSA
 
#include <iostream>
#include <sstream>
#include <cassert>
#include "lci.hpp"
 
// This example shows the usages of basic communication operations (active
// message) and completion mechanisms (synchronizer and handler).
 
// Using a flag for simple termination detection.
bool received = false;
 
// Define the function to be triggered when the active message arrives.
void am_handler(lci::status_t status)
{
  // Get the active message payload.
  std::string payload_str(static_cast<char*>(status.buffer), status.size);
  // Active message payload buffer is allocated by the LCI runtime using
  // `malloc` by default. The user are responsible for freeing it after use.
  free(status.buffer);
  // Print the hello world message.
  std::ostringstream oss;
  oss << "Rank " << lci::get_rank_me() << " received active message from rank "
      << status.rank << ". Payload: " << payload_str << std::endl;
  std::cout << oss.str();
  // Set the received flag to true.
  received = true;
}
 
int main(int argc, char** args)
{
  lci::g_runtime_init();
  // We use "synchronizer" as the source side completion object.
  // It is similar to a MPI request, but has an optional argument `threshold` to
  // accept multiple signals before becoming ready.
  lci::comp_t sync = lci::alloc_sync();
  // Register the active message handler as the target side completion object.
  lci::comp_t handler = lci::alloc_handler(am_handler);
  // Since handler/cq needs to be referenced by another process, we need to
  // register it into a remote completion handler.
  // Since all ranks register the rcomp, all ranks will automatically have a
  // symmetric view. We do not need to explicitly exchange them.
  lci::rcomp_t rcomp = lci::register_rcomp(handler);
 
  // Put a barrier here to ensure all ranks have registered the handler
  lci::barrier();
 
  if (lci::get_rank_me() == 0) {
    for (int target = 0; target < lci::get_rank_n(); ++target) {
      // Prepare the active message payload.
      std::string payload =
          "Hello from rank " + std::to_string(lci::get_rank_me());
      // Post the active message to the target rank.
      auto send_buf =
          const_cast<void*>(static_cast<const void*>(payload.data()));
      // Unlike MPI_Isend, LCI posting operation can return a status in one of
      // the three states:
      // 1. `retry`: the posting failed due to resource being temporarily busy,
      // and the user can retry.
      lci::status_t status;
      do {
        status = lci::post_am(target, send_buf, payload.size(), sync, rcomp);
        lci::progress();
      } while (status.is_retry());
      // 2. `posted`: the operation is posted, and the completion object will be
      // signaled.
      if (status.is_posted()) {
        while (lci::sync_test(sync, &status /* can be nullptr */) == false) {
          lci::progress();
        }
        assert(status.is_done());
      }
      // 3. `done`: the operation is completed, the completion object will not
      // be signaled, and the user can check the status.
      assert(status.is_done());
      // at this point, all fields in the status object are valid.
    }
  }
  // Wait for the active message to arrive.
  while (!received) {
    lci::progress();
  }
 
  // Clean up the completion objects.
  lci::free_comp(&handler);
  lci::free_comp(&sync);
 
  lci::g_runtime_fina();
  return 0;
}

Click to expand example output

$ lcrun -n 4 ./lci_hello_world_am
Rank 1 received active message from rank 0. Payload: Hello from rank 0
Rank 2 received active message from rank 0. Payload: Hello from rank 0
Rank 3 received active message from rank 0. Payload: Hello from rank 0
Rank 0 received active message from rank 0. Payload: Hello from rank 0

Distributed Array

This example shows the usage of RMA (Remote Memory Access) operations to implement a distributed array.

Click to expand code

// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: NCSA
 
#include <iostream>
#include <cassert>
#include "lci.hpp"
 
// This example shows the usage of the LCI one-sided RMA operations through the
// implementation of a simple distributed array.
 
template <typename T>
class distributed_array_t
{
 public:
  distributed_array_t(size_t size, T default_val) : m_size(size)
  {
    m_per_rank_size = size / lci::get_rank_n();
    m_local_start = lci::get_rank_me() * m_per_rank_size;
    m_data.resize(m_per_rank_size, default_val);
    m_rmrs.resize(lci::get_rank_n());
    // RMA operations allow users to directly read/write remote memory on other
    // processes. To enable this, the following steps are required: (1) Each
    // target process (i.e., the process whose memory will be accessed remotely)
    //     must register its local memory region and obtain a corresponding
    //     remote key (rmr).
    // (2) Each target process must then share its base address and rmr with
    // all other ranks.
    //     This is typically done using an allgather or similar collective
    //     operation.
    // These steps ensure that every process has the information needed to
    // perform one-sided RMA operations (e.g., put/get) to any other process's
    // registered memory buffer.
    mr = lci::register_memory(m_data.data(), m_data.size() * sizeof(T));
    // exchange the memory registration information with other ranks
    lci::rmr_t rmr = lci::get_rmr(mr);
    lci::allgather(&rmr, m_rmrs.data(), sizeof(lci::rmr_t));
  }
 
  ~distributed_array_t()
  {
    // deregister my memory buffer
    // we need a barrier to ensure that all remote operations are completed
    lci::barrier();
    lci::deregister_memory(&mr);
  }
 
  // a blocking get operation
  T get(size_t index)
  {
    int target_rank = get_target_rank(index);
    size_t local_index = get_local_index(index);
    lci::status_t status;
    T value;
    do {
      status =
          lci::post_get(target_rank, &value, sizeof(T), lci::COMP_NULL_RETRY,
                        local_index * sizeof(T), m_rmrs[target_rank]);
      lci::progress();
    } while (status.is_retry());
    assert(status.is_done());
    return value;
  }
 
  // a blocking put operation
  void put(size_t index, const T& value)
  {
    int target_rank = get_target_rank(index);
    size_t local_index = get_local_index(index);
    lci::status_t status;
    do {
      status = lci::post_put_x(
          target_rank, static_cast<void*>(const_cast<int*>(&value)), sizeof(T),
          lci::COMP_NULL_RETRY, local_index * sizeof(T), m_rmrs[target_rank])();
      lci::progress();
    } while (status.is_retry());
    assert(status.is_done());
  }
 
 private:
  size_t m_size;
  size_t m_per_rank_size;
  size_t m_local_start;
  std::vector<T> m_data;
  // LCI memory registration information
  lci::mr_t mr;
  std::vector<lci::rmr_t> m_rmrs;
 
  int get_target_rank(size_t index) const
  {
    assert(index < m_size);
    return index / m_per_rank_size;
  }
 
  size_t get_local_index(size_t index) const
  {
    assert(index < m_size);
    return index % m_per_rank_size;
  }
};
 
void work(size_t size)
{
  distributed_array_t<int> darray(size, 0);
  int rank = lci::get_rank_me();
  int nranks = lci::get_rank_n();
 
  for (size_t i = rank; i < size; i += nranks) {
    darray.put(i, i);
  }
  lci::barrier();
  for (size_t i = (rank + 1) % nranks; i < size; i += nranks) {
    int value = darray.get(i);
    assert(value == i);
  }
}
 
int main(int argc, char** args)
{
  const size_t size = 1000;
 
  lci::g_runtime_init();
  work(size);
  lci::g_runtime_fina();
  return 0;
}

Click to expand example output

# This example has no output.

$ lcrun -n 4 ./lci_distributed_array

Non-blocking Barrier

This examples shows the usage of the completion graph and the send/recv operations to implement a non-blocking barrier.

Click to expand code

// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: NCSA
 
#include <unistd.h>
#include <cstdio>
#include "lci.hpp"
 
// This examples shows the usage of the completion graph and the send/recv
// operations.
 
// create the graph according to the dissemination algorithm
lci::comp_t create_ibarrier_graph()
{
  int rank_me = lci::get_rank_me();
  int rank_n = lci::get_rank_n();
 
  lci::comp_t graph = lci::alloc_graph();
  // GRAPH_START is a special node that indicates the start of the graph.
  lci::graph_node_t old_node = lci::GRAPH_START;
  lci::graph_node_t dummy_node;
  // The dissemination algorithm contains log2(rank_n) rounds.
  // In each round, each rank sends and receives a message to/from rank_me +/-
  // round**2.
  for (int jump = 1; jump < rank_n; jump *= 2) {
    int rank_to_recv = (rank_me - jump + rank_n) % rank_n;
    int rank_to_send = (rank_me + jump) % rank_n;
    // Define the communication operations for each round.
    // We cannot explicitly retry in the graph, so we set allow_retry to false.
    // The runtime will handle the retry using the internal backlog queues.
    auto recv =
        lci::post_recv_x(rank_to_recv, nullptr, 0, 0, graph).allow_retry(false);
    auto send =
        lci::post_send_x(rank_to_send, nullptr, 0, 0, graph).allow_retry(false);
    // Note that we do not trigger the operations here.
    // Instead, we make them as nodes in the graph.
    auto recv_node = lci::graph_add_node_op(graph, recv);
    auto send_node = lci::graph_add_node_op(graph, send);
    // To make the code more readable, we use a dummy node to represent the end
    // of the round.
    if (jump * 2 >= rank_n) {
      // this is the last round
      // GRAPH_END is a special node that indicates the end of the graph.
      dummy_node = lci::GRAPH_END;
    } else {
      // we can make arbitrary functions as nodes in the graph
      // The graph expect the function of the node to either return `done` or
      // `posted`. In the case of `done`, the runtime will immeidately trigger
      // its children. In the case of `posted`, the runtime will do nothing and
      // the node will be considered pending until
      // `graph_node_mark_complete(node)` is called.
      dummy_node = graph_add_node(
          graph, [](void*) -> lci::status_t { return lci::errorcode_t::done; });
    }
    // Specify the dependencies between the nodes.
    // Wait for the previous round to finish before starting this round.
    lci::graph_add_edge(graph, old_node, recv_node);
    lci::graph_add_edge(graph, old_node, send_node);
    // Wait for the send and recv operations to finish before moving to the next
    // round.
    lci::graph_add_edge(graph, recv_node, dummy_node);
    lci::graph_add_edge(graph, send_node, dummy_node);
    old_node = dummy_node;
  }
  return graph;
}
 
int main()
{
  lci::g_runtime_init();
 
  // create a graph describing the operations needed by the barrier
  lci::comp_t graph = create_ibarrier_graph();
 
  // the graph can be reused
  for (int i = 0; i < 3; ++i) {
    if (lci::get_rank_me() == 0) {
      // create some asymmetric delay
      sleep(1);
    }
    fprintf(stderr, "rank %d start barrier\n", lci::get_rank_me());
    // start executing those operations
    lci::graph_start(graph);
    // wait for the operations to finish
    while (lci::graph_test(graph).is_retry()) {
      lci::progress();
    }
    fprintf(stderr, "rank %d end barrier\n", lci::get_rank_me());
  }
 
  // free the graph
  lci::free_comp(&graph);
 
  lci::g_runtime_fina();
  return 0;
}

Click to expand example output

$ lcrun -n 4 ./lci_nonblocking_barrier 
rank 1 start barrier
rank 2 start barrier
rank 3 start barrier
rank 0 start barrier
rank 0 end barrier
rank 1 end barrier
rank 1 start barrier
rank 3 end barrier
rank 3 start barrier
rank 0 start barrier
rank 2 end barrier
rank 2 start barrier
rank 2 end barrier
rank 2 start barrier
rank 0 end barrier
rank 3 end barrier
rank 3 start barrier
rank 1 end barrier
rank 1 start barrier
rank 0 start barrier
rank 1 end barrier
rank 0 end barrier
rank 2 end barrier
rank 3 end barrier

Multithreaded Active Message Ping-pong

This example shows the usages of thread-local devices to speedup the active message communication in a multithreaded environment.

Click to expand code

// Copyright (c) 2025 The LCI Project Authors
// SPDX-License-Identifier: NCSA
 
#include <iostream>
#include <thread>
#include <cassert>
#include <chrono>
#include <atomic>
#include "lct.h"
#include "lci.hpp"
 
// This example shows the usages of thread-local devices.
 
const int nthreads = 4;
const int nmsgs = 1000;
const size_t msg_size = 8;
 
LCT_tbarrier_t thread_barrier;
std::atomic<int> thread_seqence_control(0);
 
void worker(int thread_id)
{
  int rank = lci::get_rank_me();
  int nranks = lci::get_rank_n();
  int peer_rank;
  if (nranks == 1) {
    peer_rank = rank;
  } else {
    peer_rank = (rank + nranks / 2) % nranks;
  }
  // allocate resouces
  // device and rcomp allocation needs to be synchronized to ensure uniformity
  // across ranks.
  while (thread_seqence_control != thread_id) continue;
  lci::comp_t cq = lci::alloc_cq();
  lci::rcomp_t rcomp = lci::register_rcomp(cq);
  lci::device_t device = lci::alloc_device();
  if (++thread_seqence_control == nthreads) thread_seqence_control = 0;
 
  void* send_buf = malloc(msg_size);
  memset(send_buf, rank, msg_size);
 
  LCT_tbarrier_arrive_and_wait(thread_barrier);
  auto start = std::chrono::high_resolution_clock::now();
  if (nranks == 1 || rank < nranks / 2) {
    // sender
    for (int i = 0; i < nmsgs; i++) {
      // send a message
      lci::post_am_x(peer_rank, send_buf, msg_size, lci::COMP_NULL, rcomp)
          .device(device)
          .tag(thread_id)();
      // wait for an incoming message
      lci::status_t status;
      do {
        lci::progress_x().device(device)();
        status = lci::cq_pop(cq);
      } while (status.is_retry());
      if (status.tag != thread_id) {
        std::cerr << "thread_id: " << thread_id
                  << ", status.tag: " << status.tag << std::endl;
      }
      assert(status.tag == thread_id);
      assert(status.size == msg_size);
      for (size_t j = 0; j < msg_size; j++) {
        assert(((char*)status.buffer)[j] == peer_rank);
      }
      free(status.buffer);
    }
  } else {
    // receiver
    for (int i = 0; i < nmsgs; i++) {
      // wait for an incoming message
      lci::status_t status;
      do {
        lci::progress_x().device(device)();
        status = lci::cq_pop(cq);
      } while (status.is_retry());
      assert(status.tag == thread_id);
      assert(status.size == msg_size);
      for (size_t j = 0; j < msg_size; j++) {
        assert(((char*)status.buffer)[j] == peer_rank);
      }
      free(status.buffer);
      // send a message
      lci::post_am_x(peer_rank, send_buf, msg_size, lci::COMP_NULL, rcomp)
          .device(device)
          .tag(thread_id)();
    }
  }
  LCT_tbarrier_arrive_and_wait(thread_barrier);
  auto end = std::chrono::high_resolution_clock::now();
  if (thread_id == 0 && rank == 0) {
    std::cout << "pingpong_am_mt: " << std::endl;
    std::cout << "Number of threads: " << nthreads << std::endl;
    std::cout << "Number of messages: " << nmsgs << std::endl;
    std::cout << "Message size: " << msg_size << " bytes" << std::endl;
    std::cout << "Number of ranks: " << nranks << std::endl;
    double total_time_us =
        std::chrono::duration_cast<std::chrono::microseconds>(end - start)
            .count();
    double msg_rate_uni =
        1.0 * nmsgs * nthreads * (nranks + 1) / 2 / total_time_us;
    double bandwidth_uni = msg_rate_uni * msg_size;
    std::cout << "Total time: " << total_time_us / 1e6 << " s" << std::endl;
    std::cout << "Message rate: " << msg_rate_uni << " mmsg/s" << std::endl;
    std::cout << "Bandwidth: " << bandwidth_uni << " MB/s" << std::endl;
  }
 
  free(send_buf);
  // free resouces
  lci::free_comp(&cq);
 
  while (thread_seqence_control != thread_id) {
    lci::progress_x().device(device)();
  }
  lci::free_device(&device);
  if (++thread_seqence_control == nthreads) thread_seqence_control = 0;
}
 
int main(int argc, char** args)
{
  // Initialize the global default runtime.
  // Here we use the *objectized flexible function* version of the
  // `g_runtime_init` operation and specify that the default device should not
  // be allocated.
  lci::g_runtime_init_x().alloc_default_device(false)();
  // After at least one runtime is active, we can query the rank and nranks.
  // rank is the id of the current process
  // nranks is the total number of the processes in the current job.
  assert(lci::get_rank_n() == 1 || lci::get_rank_n() % 2 == 0);
 
  // get a thread barrier
  thread_barrier = LCT_tbarrier_alloc(nthreads);
 
  // spawn the threads to do the pingpong
  if (nthreads == 1) {
    worker(0);
  } else {
    std::vector<std::thread> threads;
    for (int i = 0; i < nthreads; i++) {
      threads.push_back(std::thread(worker, i));
    }
    for (auto& thread : threads) {
      thread.join();
    }
  }
 
  // free the thread barrier
  LCT_tbarrier_free(&thread_barrier);
 
  // Finalize the global default runtime
  lci::g_runtime_fina();
  return 0;
}

Click to expand example output (Run on my laptop. Performance may vary.)

$ lcrun -n 4 ./lci_pingpong_am_mt 
pingpong_am_mt: 
Number of threads: 4
Number of messages: 1000
Message size: 8 bytes
Number of ranks: 4
Total time: 0.035286 s
Message rate: 0.283399 mmsg/s
Bandwidth: 2.26719 MB/s

Run LCI applications

In Quick Start, we have shown you how to run LCI applications using mpirun or srun. Here, we will discuss the bootstrapping process in more detail.

To successfully bootstrap LCI, the launcher (srun, mpirun, or lcrun) must match the bootstrapping backend used by LCI. Normally, LCI will automatically select the right bootstrapping backend based on the environment so no special configuration is needed. However, if you see your applications were launched as a collection of processes all with rank 0, it means something went wrong.

Run LCI applications with lcrun

You do not need to do anything special to run LCI applications with lcrun. However, lcrun is a "toy" launcher that is not as scalable as srun or mpirun. It is mainly used for testing and debugging purposes.

If you ever encounter a problem with lcrun, you can remove the temporary folder ~/.tmp/lct_pmi_file-* and try again.

Run LCI applications with srun

LCI is shipped with a copy of the SLURM PMI1 and PMI2 client implementation, so normally you can use srun to run LCI applications without any extra configuration. You may need to explicitly enable the pmi2 support by srun --mpi=pmi2.

On Cray systems, you may need to load the cray-pmi module before building LCI as srun on some Cray systems only supports Cray PMI.

Run LCI applications with mpirun

Because there are many different MPI implementations and there are no standard about how they implement mpirun, it is slightly more complicated to run LCI applications with mpirun. In such cases, the easiest way is to let LCI use MPI to bootstrap. You just need to set the CMake variable LCT_PMI_BACKEND_ENABLE_MPI=ON and link LCI to MPI.

It is possible to directly use the PMI backend with mpirun, but you need to find the corresponding PMI client library and link LCI to it. Read the following section for more details.

More details

Bootstrapping backends

Specifically, LCI has six different bootstrapping backends:

pmi1: Process Management Interface version 1.
pmi2: Process Management Interface version 2.
pmix: Process Management Interface X.
mpi: Use MPI to bootstrap LCI.
file: LCI-specific bootstrapping backend with a shared file system and flock.
local: Just set rank_me to 0 and rank_n to 1.

pmi1, pmi2, and pmix are the recommended backends to use. They are the same backends used by MPI. The mpi backend is a fallback option if you cannot find the PMI client library. The file backend is a non-scalable bootstrapping backend mainly for testing and debugging purposes.

By default, the source code of LCI is shipped with a copy of the SLURM PMI1 and PMI2 client implementation, so pmi1 and pmi2 are always compiled. pmix will be compiled if the CMake configuration of LCI finds the PMIx client library. The mpi backend must be explicitly asked for by setting the CMake variable LCT_PMI_BACKEND_ENABLE_MPI=ON. The file and local backend is always compiled.

However, the SLURM PMI1 and PMI2 client implementation is not always the best option. For example, if you are using mpirun, you may want to use the PMI client library that comes with your MPI implementation. In this case, you need to find the corresponding PMI client library and link LCI to it. ldd $(which mpirun) will show you the PMI client library used by mpirun. Normally, MPICH uses hydra-pmi; Cray-MPICH uses cray-pmi; OpenMPI uses pmix. After finding the PMI client library, you can reconfigure LCI with the corresponding PMI client library through the PMI_ROOT, PMI2_ROOT, or PMIx_ROOT environment/cmake variables.

A CMake variable LCT_PMI_BACKEND_DEFAULT and an environment variable LCT_PMI_BACKEND can be used to set a list of backends to try in order (if they are compiled). The first one that works will be used. The default value is pmi1,pmi2,pmix,mpi,file,local.

You can use export LCT_LOG_LEVEL=info to monitor the bootstrapping procedure.

Launchers

srun and mpirun should use one of the PMI backends (or mpi as a last resort). lcrun will use the file backend.

Depending on the SLURM configuration, srun may not enable PMI by default. In this case, you can explicitly enable one of the PMI services by using the --mpi option. srun --mpi=list -n 1 hostname will show you the available PMI services. You can confirm whether the PMI service has been enabled with srun env | grep PMI.

mpirun will use the PMI client library that comes with your MPI implementation. As mentioned above, you need to link LCI to the correct PMI client library.

Sometimes, lcrun may hang because of a previous failed run. In this case, you can remove the temporary folder ~/.tmp/lct_pmi_file-* and try again.

More about the file backend

The file backend allows you to launch multiple LCI processes individually without a launcher. This can significantly ease the debugging process.

For example, you can open two terminal windows and run the following commands in each window:

export LCT_PMI_BACKEND=file
export LCT_PMI_FILE_NRANKS=2
./lci_program
# or launch it with gdb
gdb ./lci_program

This will launch two LCI processes with rank 0 and rank 1.

Table of Contents

Add LCI to your project

Install LCI

Prerequisites

With CMake

Install LCI on your laptop

Install LCI on a cluster

Important CMake variables

With Spack

Important Spack variables

Cluster-specific Installation Note

NCSA Delta

NCSA DeltaAI

SDSC Expanse

NERSC Perlmutter

TACC Frontera

Write LCI programs

Overview

Objectified Flexible Functions (OFF)

Resource Management

Runtime

Device

Communication Posting

Completion Checking

Synchronizer

Completion Queue

Handler

Graph (Advanced)

Progress

Basic Usage

Other Materials

Examples

Hello World

Hello World (Active Message)

Distributed Array

Non-blocking Barrier

Multithreaded Active Message Ping-pong

Run LCI applications

Run LCI applications with lcrun

Run LCI applications with srun

Run LCI applications with mpirun

More details

Bootstrapping backends

Launchers

More about the file backend