1. Overview

This manual provides reference material for the Content-MPI (C-MPI) DHT.

C-MPI provides a key/value store for distributed computing over MPI.

2. Quick Start

The fastest way to get a quick overview of provided features is to just run:

./setup.sh
./configure --config-cache --enable-table-dense-1 \
                           --enable-tests \
                           --with-mpi=${HOME}/sfw/mpich2-1.2.1p1
make D=1 test_results

Then, just take a look at the test code and output to see how things work.

3. Use Cases

3.1. MPI Library

C-MPI can be used as an MPI library. In this mode, the user allocates some number of DHT nodes and DHT clients. The nodes start up and begin listening for requests. The clients call into cmpi_client_code(), do work, and make C-MPI calls.

More advanced user programs can link to libcmpi.a directly as well by imitating cmpi/client.c.

#include <cmpi.h>

cmpi_client_code()
{
  sprintf(key,   "key_%i",   mpi_rank);
  sprintf(value, "value_%i", mpi_rank);
  cmpi_put(key, value, strlen(value)+1);

  rank = (mpi_rank+1)%mpi_size;
  sprintf(key, "key_%i", rank);
  cmpi_get(key, &result, &length);
  printf("result(%i): %s\n", length, result);
}

3.2. Cluster database

In this mode, the DHT runs as a distributed background process (cmpi-db) and the user connects to it via a cp-like tool (cmpi-cp).

Cluster mode operation

Commands executed on submit host:

#!/bin/sh

# Allocate 6 compute nodes
qsub -t 1-6 ...

# List node host names to file for mpiexec
qstat | something > hosts

# Create some initial conditions
create_initial > input

# Launch the DHT (12 processes on 6 nodes)
mpiexec.hydra -f hosts -n 12 bin/cmpi-db -n 6 &

# Spawn client scripts
# (Could use Falkon here?)
id=0
total=$( wc -l hosts )
for host in $( cat hosts )
do
  ssh ${host} client_script.sh $(( id++ )) ${total} &
done

wait

client_script.sh:

#!/bin/sh

id=$1
nodes=$2

for (( i=0 ; i<10 ; i++ ))
do
  # Do some useful work
  do_work ${id} ${nodes} < input > output
  # Post the results to the DHT
  key=output_${id}_${i}
  cmpi-cp output dht://${key}
  # Read a neighbor's result
  key=output_$(( id-1 % nodes ))_${i}
  cmpi-cp dht://${key} input
done

3.3. MPI-IO implementation

The CMPI-IO module is really only an sketch. Here’s the idea:

  1. C-MPI provides a patch for MPICH. The user applies the patch and recompiles MPICH.

  2. The user writes a normal MPI/MPI-IO program but provides cmpi:/ pathnames to trigger the CMPI-IO implementation.

  3. The user launches the program with mpiexec, allocating extra processes for the DHT.

  4. Calls like MPI_FILE_WRITE_AT() are translated by ROMIO and implemented ultimately by calls like ADIOI_CMPI_WriteContig(). (The full list of what CMPI-IO needs to implement is in ad_cmpi.h; it’s actually not that bad.) This method would be implemented via calls to cmpi_put()/cmpi_get().

3.4. FUSE module

A FUSE adapter could be built using functionality similar to the techniques above. The user would first instantiate the cmpi-db. Then, the FUSE implementation would make calls using the driver interface similar to the way that cmpi-cp does.

3.5. MPI-RPC

3.5.1. Overview

The MPIRPC component is used to allow the user to issue multiple asynchronous requests in one thread from a higher-level, RPC, event-driven model.

3.5.2. Definitions

MPIRPC object

Created by an MPIRPC call. Can be waited on. On call completion, this is passed to the proceed-function. Contains the result of the call on completion. Must be freed by the user with MPIRPC_Free().

proceed-function

A user function pointer passed into an MPIRPC call. Called by MPIRPC on call completion with the MPIRPC object. This allows the caller to make progress after the call completed, and obtain the result.

handler

A user function that is the target of an MPIRPC call. Called by MPIRPC on an incoming request.

MPIRPC_Call

MPIRPC_Call() creates an MPIRPC object and starts performing the RPC asynchronously. The MPIRPC object will contain the results of the call when complete. The arguments are:

MPIRPC_Node target

The target MPIRPC_Node, which is an MPI communicator and rank.

char* name

Remote function name Copied into the MPIRPC object. Limited to MPIRPC_MAX_NAME (128) characters. May not be NULL.

char* args

Short NULL-terminated string for user control data arguments. Copied into the MPIRPC object. Limited to MPIRPC_MAX_ARGS (256) characters. May be NULL.

void* extras

Extra user state accessible by the proceed-function.

void (proceed)(MPIRPC)

The proceed-function.

The MPIRPC Object

The MPIRPC object contains:

MPIRPC_Node target

The target MPIRPC_Node

status

The status of the call: MPIRPC_STATUS_PROTO, MPIRPC_STATUS_CALLED, or MPIRPC_STATUS_RETURNED.

char[] name

Copy of the remote procedure name.

char[] args

Copy of the user argument string.

char* blob

Pointer to the user data blob.

int blob_length

Length of the user data blob.

void* result

Pointer to result data returned by remote procedure in fresh storage.

int result_length

Length of result data.

void* extras

Extra user pointer useful for proceed-function.

int unique

Internal uniquifier.

bool cancelled

Not yet used.

MPIRPC_Node target

The target MPIRPC_Node

MPI_Request request[]

Internal MPI objects. Released by MPIRPC_Free().

Example

sample code

3.6. Usage notes

  • Handler routines: The handler must copy the incoming args if it wants to save them. The handler must return by calling MPIRPC_Return() or MPIRPC_Null(). Handlers can call into MPIRPC_Call() but the flow eventually return to the original caller.

4. C-MPI

4.1. Overview

C-MPI is intended to be an easy to use MPI-based distributed key/value store.

4.2. API

The C-MPI API:

cmpi_init()

Initialize C-MPI. The user must first call MPI_Init() and MPIRPC_Init().

cmpi_put()

Post a key/value pair in C-MPI. Arguments:

char* key

NULL-terminated string key. Passed as the args argument to MPIRPC_Call_blob().

void* value

Variable-length user data. Passed as the blob argument to MPIRPC_Call_blob().

int length

Byte-length of value.

cmpi_update()

Update the value of a key/value pair in C-MPI. Arguments:

char* key

NULL-terminated string key. Passed as the args argument to MPIRPC_Call_blob().

void* value

Variable-length user data. Passed as the blob argument to MPIRPC_Call_blob().

int length

Byte-length of value.

int offset

At which point to begin overwrite. Offset+length may exceed the original value length. The key/value pair need not originally exist, but if it does not, the offset must be 0.

cmpi_get()

Extract the value of a key/value pair into fresh storage. Arguments:

char* key

NULL-terminated string key. Passed as the args argument to MPIRPC_Call_blob().

void** result

Will be set to the location of the extracted data in fresh storage.

int* length

Will be set to the length of the extracted data.

5. Configure, build, test

5.1. Configure

  1. Run ./setup.sh.

  2. Run ./configure.

    --with-mpi

    Mandatory. The location of your MPICH installation. E.g.,

    --enable-table-*

    Mandatory. The underlying DHT algorithm. Current options:

    --enable-table-dense-1

    Simple dense table defined in src/dense-1. Uses monolithic communicator. Hashes keys and assigns to nodes by modulus. Not really a DHT; doesn’t need neighbor tables. Great for debugging, simple and fast.

    --enable-table-kda-2A

    Kademlia implementation defined in src/kda-2, linked with src/kda-2/conn-A.c. Uses monolithic communicator.

    --enable-table-kda-2B

    Kademlia implementation defined in src/kda-2, linked with src/kda-2/conn-B.c. Uses MPI-2 dynamic processes.

    --enable-mode-*

    The node layout scheme. (see [cmpi-modes]).

    --enable-tests

    Turn on support for tests. When compiling tests, be sure to use make D=1.

    --enable-cmpi-io

    Turn on support for the skeletal CMPI-IO component.

5.2. Build

Each component defines its build behavior in a module.mk.in file. These are converted to module.mk by configure (or config.status). These components typically append to variables or define additional targes. Makefile includes all +module.mk+s and manages the whole build.

Functions like cmpi_get() are defined in multiple places (cmpi_dense.c, cmpi-kademlia.c). The choice of which to compile is made at configure time.

Useful arguments to make:

D=1

Turn on debugging output. Mandatory for many tests.

V=1

Turn on verbose build output. Useful when debugging build process.

-j

You can do make -j tests but you cannot running the tests concurrently is not recommended.

mpirpc

Build a stand-alone MPI-RPC lib.

cmpi

Build a stand-alone C-MPI lib.

tests

Build (but do not run) the test programs.

test/<module>/test-success.out

Run all the tests for module after all of the tests for its dependencies.

tags

Make an etags file based on the configuration choices.

5.3. Test

The tests are defined for each component in module.mk. For each test-*.c, a test-*.x executable is produced and launched. The launcher is ; assert()+s and output parsing are used to confirm correctness. Output is collected in +test-*.out. If the test is run from a test-*.zsh, debugging output is collected and post-processed by the ZSH script. If the test fails, the output is moved to test-\*.out.failed (so make does not consider it).

make D=1 test_results

Build and run all the tests. Requires ./configure --enable-tests.

6. Components

C-MPI components outlined below.

cmpi

The C-MPI interface. Some reusable functionality is defined.

cmpi_*.c

The translation layer. Translates C-MPI calls into calls to a DHT implementation.

mode_*.c

C-MPI mode selection made at configure time.

mono

Given mpiexec -n 6 cmpi-db -n 4, creates 4 DHT nodes (ranks 0-3) and 2 clients (ranks 4-5). The clients will use all nodes as contacts. Ideal for the SMP case.

pairs

Given mpiexec -n 6 cmpi-db -n 3, creates 3 DHT nodes (ranks 0-2) and 3 clients (ranks 3-5). The clients connect to a single contact (0:3, 1:4, 2:5). Ideal for the cluster case: each physical node will have one node process and one client process, the client will only contact the local node.

node.c

Reusable C-MPI node. Generally, a C-MPI node is anything that calls into cmpi_init().

client.c

Reusable C-MPI client.

Generally, a C-MPI client is anything that calls into cmpi_init_client(). client.c calls into cmpi_client_code(), which is a convenient way to reuse the setup routines in client.c. See the tests for use of cmpi_client_code().

Other clients could be built that do different things, such as provide filesystem interfaces.

driver

Issue commands to a client over a stream interface.

cmpi-db

Instantiates several nodes and clients. The clients may be manipulated by a driver.

cmpi-cp

Acts as a "user" in the diagram above.

adts

Abstract data types: lists, hash tables, etc.

gossip

A logging library from Phil Carns.

7. Walk-throughs

7.1. SiCortex

How to run a C-MPI test on the ANL SiCortex.

  1. Check out source

    svn co https://c-mpi.svn.sourceforge.net/svnroot/c-mpi+
  2. Setup

    ./setup.sh
  3. Configure

    ./configure --enable-table-dense-1 --enable-tests
  4. Build test

    make deps
    make -j 3 test/cmpi/test-manyputs.x
  5. Run test 256 nodes, 256 clients, 1000 insertions per client.

    srun -n 512 test/cmpi/test-manyputs.x -n 256 -p reps=1000