The Multicast/Reduction Network: A User's Guide to MRNet


Table of Contents

1. Introduction
2. Installing and Using MRNet
Supported Platforms and Compilers
System Requirements
Build Configuration
Compilation and Installation
Bugs, Questions and Comments
3. MRNet Components and Abstractions
MRNet Communicators
MRNet Streams
MRNet Filters
4. A Simple Example
The MRNet Interface
MRNet Instantiation
5. The MRNet C++ API Reference
Class Network
Class Communicator
Class Stream
6. MRNET Process-tree Topologies
7. Adding New Aggregation Filters
A. MRNET Format Strings
B. A Complete MRNet Example

List of Figures

4.1. MRNet Front-end Sample Code
4.2. MRNet Back-end Sample Code

Chapter 1. Introduction

MRNet is a customizable, high-throughput communication software system for parallel tools with a master/slave architecture. MRNet reduces the cost of these tools' activities by incorporating a tree of processes between the tool's front-end and back-ends. MRNet uses these internal processes to distribute many important tool activities, reducing analysis time and keeping tool front-end loads manageable.

MRNet-based tools send data between front-end and back-ends on logical flows of data called streams. MRNet internal processes use filters to synchronize and aggregate data sent to the tool's front-end. Using filters to manipulate data in parallel as it passes through the network, MRNet can efficiently compute averages, sums, and other more complex aggregations on back-end data.

Several features make MRNet especially well-suited as a general facility for building scalable parallel tools:

  • Flexible organization. MRNet does not dictate the organization of MRNet and tool processes. MRNet process organization is specified in a configuration file that can specify common network layouts like k-ary and k-nomial trees, or custom layouts tailored to the system(s) running the tool. For example, MRNet internal processes can be allocated to dedicated system nodes or co-located with tool back-end and application processes.
  • Scalable, flexible data aggregation. MRNet's built-in filters provide efficient computation of averages, sums, concatenation, and other common data reductions. Custom filters can be loaded dynamically into the network to perform tool-specific aggregation operations.
  • High-bandwidth communication. MRNet transfers data within the tool system using an efficient, packed binary representation. Zero-copy data paths are used whenever possible to reduce the cost of transferring data through internal processes.
  • Scalable multicast. As the number of back-ends increases, serialization when sending control requests limits the scalability of existing tools. MRNet supports efficient message multicast to reduce the cost of issuing control requests from the tool front-end to its back-ends.
  • Multiple concurrent data channels. MRNet supports multiple logical streams of data between tool components. Data aggregation and message multicast takes place within the context of a data stream, and multiple operations (both upward and downward) can be active simultaneously.

Chapter 2. Installing and Using MRNet

For this discussion, $MRNET_ROOT is the location of the top-level directory of the MRNet distribution and $MRNET_ARCH is a string describing the platform (OS and architecture) as discovered by autoconf. For all instructions, it is assumed that the current working directory is $MRNET_ROOT.

Supported Platforms and Compilers

MRNet has been developed to be highly portable; there is no reason why it should not run properly on all common Unix-based platforms. This being said, we have successfully built and tested MRNet using GCC version 3 compilers on the following systems:

  • i686-pc-linux-gnu
  • rs6000-ibm-aix5.1.0.0
  • sparc-sun-solaris2.8

We are currently upgrading our build system to allow the use of native compilers where appropriate, for instance, xlc and xlC in AIX environments.

System Requirements

Here we list the third party tools that MRNet uses and needs for proper installation:

  • GNU make
  • flex
  • bison

Build Configuration

MRNet uses GNU autoconf to discover the platform specific configuration parameters. The script that does this auto-configuration is called configure.

UNIX>  ./configure --help

shows all possible options of the command. Below, we display the MRNet-specific ones:


===Begin MRNET Compile Options:
  --with-cc                   Specify C compiler
  --with-cflags               Set C compiler flags
                                  - (ONLY USE WITH --with-cc)
  --with-cxx                  Specify C++ compiler
  --with-cxxflags             Set C++ compiler flags
                                  - (ONLY USE WITH --with-cxx)
  --with-ldflags              Set loader flags
  --with-libfl                Link line for flex library

For example,

UNIX> ./configure --with-cc=cc --with-cxx=c++ --with-libfl=/usr/local/lib/libfl.a

instructs the configure script to use cc for the C compiler, c++ for the C++ compiler and /usr/local/lib/libfl.a as the location of the flex library.

Compilation and Installation

To build the MRNet toolkit by type:

UNIX>  make

After a successful build, the following files will be present:

  • $MRNET_ROOT/lib/$MRNET_ARCH/libmrnet.a: MRNet API library
  • $MRNET_ROOT/bin/$MRNET_ARCH/mrnet_commnode: MRNet internal communcation node (used internally)
  • $MRNET_ROOT/bin/$MRNET_ARCH/mrnet_confgen: MRNet topology file generator
  • $MRNET_ROOT/bin/$MRNET_ARCH/*_[FE,BE]: MRNet test front-end and back-end programs

Bugs, Questions and Comments

MRNet is maintained by Dorian Arnold at the University of Wisconsin. Comments and other feedback whether positive or negative are welcome.

Please report bugs to darnold@cs.wisc.edu.

The MRNet webpage is http://www.paradyn.org/mrnet/

Chapter 3. MRNet Components and Abstractions

MRNet has two main types of components: libmrnet.a, a library that is linked into a tool's front-end and back-end components, and mrnet_commnode, a program that runs on intermediate nodes interposed between the application front-end and back-ends. libmrnet.a exports an API (See Chapter 5, The MRNet C++ API Reference) that enables I/O interaction between the front-end and groups of back-ends via MRNet. The primary purpose of mrnet_comm is to distribute data processing functionality across multiple computer hosts and to implement efficient and scalable group communications. The following sub-sections describe the lower-level components of the MRNet API in more detail.

MRNet Communicators

MRNet uses communicators to represent groups of network end-points. Like communicators in MPI, MRNet communicators provide a handle that identifies a set of end-points for point-to-point, multicast or broadcast communications. MPI applications typically have a non-hierarchical layout of potentially identical processes. In contrast, MRNet enforces a tree-like layout of all processes, rooted at the tool front-end. Accordingly, MRNet communicators are created and managed by the front-end, and communication is only allowed between a tool's front-end and its back-ends, i.e. back-ends cannot interact with each other directly via MRNet.

MRNet Streams

A stream is a logical channel that connects the front-end to the end-points of a communicator. All tool-level communication via MRNet must use these streams. Streams carry data packets downstream, from the front-end toward the back-ends, and upstream, from the back-ends toward the front-end. Upward streams are expected to carry data of a specific type allowing data aggregation operations to be associated with a stream. The type is specified using a format string (See Appendix A, MRNET Format Strings) similar to those used in C formatted I/O primitives, e.g. a packet whose data is described by the format string "%d %d %f %s" contains two integers followed by a float then a character string. MRNet expands the standard specification to allow for specifiers that describe arrays of integers and floats.

MRNet Filters

Data Aggregation is the process of merging multiple input data packets and transforming them into one or more output packets. Though it is not necessary for the aggregation to result in less or even different data, aggregations that reduce or modify data values are most common. MRNet uses data filters to aggregate data packets. Filters specify an operation to perform and the type of the data expected on the bound stream. Filter instances are bound to a stream at stream creation. MRNet uses two types of filters: synchronization filters and transformation filters. Synchronization filters organize data packets from downstream nodes into synchronized waves of data packets, while transformation filters operate on the synchronized data packets yielding one or more output packets.

Filters operate on data flowing upstream in the network. Synchronization filters receive packets one at a time and do not output any packets until the specified synchronization criteria has occurred. Transformation filters input the group of synchronized packets, perform some type of data transformation on the data contained in the packets and output one or more packets. A distinction between synchronization and transformation filters is that synchronization filters are independent of the packet data type, but transformation filters operate on packets of a specific type. Synchronization filters provide a mechanism to deal with the asynchronous rrival of packets from children nodes; the synchronizer collects packets and typically aligns them into waves, passing an entire wave onward at the same time. Therefore, synchronization filters do no data transformation and can operate on packets in a type-independent fashion. MRNet currently supports three synchronization modes:

  • Wait For All: wait for a packet from every child node,
  • Time Out: wait a specified time or until a packet has arrived from every child (whichever occurs first), or
  • Do Not Wait: output packets immediately. Synchronization filters use one of these three criteria to determine when to return packets to the stream manager.

Transformation filters combine data from multiple packets by performing an aggregation that yields one or more new data packets. Since transformation filters are expected to perform computational operations on data packets, there is a type requirement for the data packets to be passed to this type of filter: the data format string of the stream's packets and the filter must be the same. Transformation operations must be synchronous, but can carry state from one transformation to the next using static storage structures. MRNet provides several transformation filters that should be of general use:

  • Basic scalar operations: min, max, sum and average on integers or floats.
  • Concatenation: operation that inputs n scalars and outputs a vector of length n of the same base type.

Chapter 7, Adding New Aggregation Filters describes facilities a tool developer may use to add new filters to the provided set.

Chapter 4. A Simple Example

The MRNet Interface

A complete description of the MRNet API is in Chapter 5, The MRNet C++ API Reference. This section offers a brief overview only. Using libmrnet.a, a tool can leverage a system of internal processes, instances of the mrnet_commnode program, as a communication substrate. After instantiation of the MRNet network (discussed in the section called “MRNet Instantiation”, the front-end and back-end processes are connected by the internal processes. The connection topology and host assignment of these processes is determined by a configuration file, thus the geometry of MRNet's process tree can be customized to suit the physical topology of the underlying hardware resources. While MRNet can generate a variety of standard topologies, users can easily specify their own topologies; see Chapter 6, MRNET Process-tree Topologies for further discussion.

The MRNet API, provided by libmrnet, contains network, end-point, communicator, and stream objects that a tool's front-end and back-end use for communication. The network object is used to instantiate the MRNet network and access end-point objects that represent available tool back-ends. The communicator object is a container for groups of end-points, and streams are used to send data to the end-points in a communicator.

Figure 4.1. MRNet Front-end Sample Code

   front_end_main(){
1.   MR_Network * net;
2.   MR_Communicator * comm;
3.   MR_Stream * stream;
4.   float result;
5.   net = new MR_Network(config_file);
6.   comm = net->get_broadcast_communicator( );
7.   stream = new MR_Stream(comm, FMAX_FIL);
8.   stream->send("%d", FLOAT_MAX_INIT);
9.   stream->recv("%f", result);
   }

A simplified version of code from an example tool front-end is shown in Figure 4.1. In the front-end code, after some variable definitions in lines 1-4, in line 5, an instance of the MRNet network is created using the topology specification in config_file. In line 6, the newly created network object is queried for an auto-generated broadcast communicator that contains all available end-points. In line 7, this communicator is used to established a stream for which the MRNet internal processes will use a filter that finds the maximum floating point data value of the data sent upstream. The front-end then might send one or more initialization messages to the backends; in our example code on line 9, we broadcast an integer initializer and await the single floating point value result.

Figure 4.2. MRNet Back-end Sample Code

   back_end_main(){
1.   MR_Stream * stream;
2.   int val;
3.   MR_Network::init_backend( );
4.   MR_Stream::recv("%d", &val, &stream);
5.   if(val == FLOAT_MAX_INIT){
6.      stream->send("%f", rand_float);
     }
   }

Figure 4.2 shows the code for the back-end that reciprocates the actions of the front-end. Each tool back-end first connects to the appropriate internal process, via the init_backend call in line 3. While the front-end makes a stream-specific recv call, the back-ends make a stream-anonymous recv that returns the integer sent by the front-end along with a stream object representing the stream that the front-end has established. Finally, each back-end sends a scalar floating point value upstream toward the front-end.

A complete example of MRNet code can be found in Appendix B, A Complete MRNet Example.

MRNet Instantiation

While conceptually simple, creating and connecting the internal processes is complicated by interactions with the various job scheduling systems. In the simplest environments, we can launch jobs manually using facilities like rsh or ssh. In more complex environments, it is necessary to submit all requests to a job management system. In this case, we are constrained by the operations provided by the job manager (and these vary from system to system). We currently support two modes of instantiating MRNet-based tools.

In the first mode of process instantiation, MRNet creates the internal and back-end processes, using the specified MRNet topology configuration to determine the hosts on which the components should be located. First, the front-end consults the configuration and uses rsh or ssh to create internal processes for the first level of the communication tree on the appropriate hosts. Upon instantiation, the newly created processes establish a network connection to the process that created it. The first activity on this connection is a message from parent to child containing the portion of the configuration relevant to that child. The child then uses this information to begin instantiation of the sub-tree rooted at that child. When a sub-tree has been established, the root of that sub-tree sends a report to its parent containing the end-points accessible via that sub-tree. Each internal node establishes its children processes and their respective connections sequentially. However, since the various processes are expected to run on different compute nodes, sub-trees in different branches of the network are created in concurrently, maximizing the efficiency of network instantiation.

In the second mode of process instantiation, MRNet relies on a process management system to create some or all of the MRNet processes. This mode accommodates tools that require their back-ends to create, monitor, and control the application processes. For example, IBM's POE uses environment variables to pass information, such as the process' rank within the application's global MPI communicator, to the MPI run-time library in each application process. In cases like this, MRNet cannot provide back-end processes with the environment necessary to start MPI application processes. As a result, MRNet creates its internal processes recursively as in the first instantiation mode, but does not instantiates any back-end processes. MRNet then starts the tool back-ends using the process management system to ensure they have the environment needed to create application processes successfully. When starting the back-ends, MRNet must provide them with the information needed to connect to the MRNet internal process tree, such as the leaf processes' host names and connection port numbers. This information is provided via the environment, using shared filesystems or other information services as available on the target system.

Chapter 5. The MRNet C++ API Reference

All classes are included in the MRN namespace. For this discussion, we do not explicitly include reference to the namespace; for example, when we reference the class Network, we are implying the class MRN::Network.

In MRNet, there are four top-level classes: Network, EndPoint, Communicator, and Stream. The Network class contains primarily static methods that allow one to instantiate, and destroy MRNet process trees and to query instantiated trees for information. Application back-ends are referred to as end-points and are encapsulated by objects of type EndPoint. The Communicator class is used to reference a group of EndPoints and can be used to establish MRNet Streams for unicast, multicast or broadcast communications via the MRNet infrastructure. The public members of these classes are detailed below.

Class Network

static int Network::new_Network(config_file,  
 commnode_exe,  
 backend_exe); 
const char * config_file;
const char *  commnode_exe;
const char *  backend_exe;
Network::new_Network is a static method that is used to instantiate the MRNet process tree. config_file is the path to a configuration file that describes the desired process tree topology. commnode_exe is the path to the mrnet_commnode executable that should have been built at installation time, and backend_exe is the path to the executable to be used for the application's back-end processes. When this function returns without error, all MRNet internal processes and the application back-end processes will have been instantiated using rsh or ssh depending on the setting of the environment variable MRNET_RSH.

static void Network::delete_Network();

Network::delete_Network is used to tear down the MRNet process tree. When this function is called, each node in the MRNet configuration sends a control message to its immediate children informing them of the "delete network" request. After delivering this message, the process itself terminates. Note: if the application back-ends have not already terminated, invoking this method will cause them to terminate.
static void Network::print_error(error_str); 
const char *  error_str;
Network::print_error prints a message to stderr describing the last error encountered during a MRNet library call. It first prints the null-terminated string, error_str followed by a colon then actual error message followed by a newline.

Class Communicator

static Communicator * Communicator::new_ Communicator(); 
This function returns a pointer to a new Communicator object. The object initially contains no endpoints. Use Communicator::add_EndPoint( ) to populate the communicator.
static Communicator * Communicator::new_ Communicator(orig_comm); 
Communicator & orig_comm;
This function returns a pointer to a new Communicator object that initially contains the set of endpoints contained in orig_comm.
static Communicator * Communicator::get_ BroadcastCommunicator(); 
This function returns a pointer to a default communicator containing all the endpoints available in the system. Multiple calls to this function return the same pointer to the broadcast communicator object created at network instantiation.
int Communicator::add_EndPoint(hostname,  
 port); 
const char *  hostname;
unsigned short  port;
This function is used to add a new EndPoint object to the set contained by the communicator. The original set of endpoints contained by the communicator is tested to see if it already contains the potentially new endpoint. If so, the function silently returns successfully. This function fails if there exists no endpoint defined by hostname:port. This function returns 0 on success, -1 on failure.
int Communicator::add_EndPoint(endpoint); 
EndPoint &  endpoint;
This function is similar to the add_EndPoint() above except that it takes an explicit EndPoint object instead of hostname and port parameters. Success and failure conditions are exactly as stated above. This function also returns 0 on success and -1 on failure.

unsigned int Communicator::size();

This function returns the number of endpoints contained in the communicator.
const char * Communicator::get_HostName(idx); 
unsigned int  idx;
This function returns a character string identifying the hostname of the endpoint at position idx in the set contained by the communicator. A return value of NULL signals that idx> is out of range.
unsigned short Communicator::get_Port(idx); 
unsigned int  idx;
This function returns an unsigned short identifying the connection port of the endpoint at position idx in the set contained by the communicator. A return value of NULL signals that idx> is out of range.
unsigned int Communicator::get_Id(idx); 
unsigned int  idx;
This function returns an unsigned int that is used by MRNet to uniquely identify the endpoint at position idx in the set contained by the communicator. A return value of NULL signals that idx> is out of range.

Class Stream

static Stream * Stream::new_Stream(comm,  
 filter_id); 
Communicator * comm;
unsigned int  filter_id;
Stream::new_Stream creates a MRNet stream object attached to the endpoints specified by the comm argument. The second argument filter_id specifies the filtering operation to apply to data flowing upstream from the application back-ends toward the front-end.
static int Stream::recv(tag,  
 buf,  
 stream); 
int * tag;
void * *  buf;
Stream * *  stream;
This non-blocking function is used to invoke a stream-anonymous receive operation. Any packet available (addressed to any stream) will be returned (in roughly FIFO ordering) via the output parameters passed in. tag will be filled in with the integer tag value that was passed by the corresponding Stream::send() operation. buf is an opaque structure that must be passed to the Stream::unpack described below. Finally, a pointer to the stream to which the packet was addressed will be returned in stream. A return value of -1 indicates an error, 0 indicates no packets were available, and 1 indicates success.
static int Stream::unpack(buf,  
 format_str,  
 ...); 
char * buf;
const char *  format_str;
This function operates similarily to C's sscanf. It takes a buf parameter that was returned by a previous call to Stream::recv(). format_str is a format string describing the datatypes expected in the packet returned by Stream::recv() (See Appendix A, MRNET Format Strings for a full description.) On success, Stream::unpack() returns 0; on failure, -1.
int Stream::send(tag,  
 format_str,  
 ...); 
int  tag;
const char *  format_str;
This function invokes a data output operation on the calling stream. tag is an integer identifier that is expected to classify the data in the packet to be transmitted across the stream. format_str is a format string describing the data in the packet (See Appendix A, MRNET Format Strings for a full description.) On success, Stream::send() returns 0; on failure, -1.

int Stream::flush();

This function commits a flush of all packets currently buffered by the stream pending an output operation. A successful return indicates that all packets on the calling stream have been passed to the operating system kernel for network transmission.

int Stream::recv(tag, buf);
int *tag;
void * * buf;

This non-blocking function is used to invoke a stream-specific receive operation. Packets addressed to the calling stream will be returned in strictly FIFO ordering via the output parameters passed in. tag will be filled in with the integer tag value that was passed by the corresponding Stream::send() operation. buf is an opaque structure that must be passed to the Stream::unpack described below. A return value of -1 indicates an error, 0 indicates no packets were available, and 1 indicates success.

Chapter 6. MRNET Process-tree Topologies

MRNet allows a tool to specify a node allocation and process connectivity tailored to its computation and communication requirements and to the system running the tool. Choosing an appropriate MRNet configuration can be difficult due to the complexity of the tool's own activity and its interaction with the system. We briefly discuss the issues related to process layout, but because our current work focuses on tool scalability a full treatment of optimal MRNet configurations is beyond the scope of this paper. The configurations we used for our experiments in Section 4 were chosen for their ability to show MRNet's effect on tool scalability. We anticipate future research will examine the issue of MRNet configurations in more detail. When choosing the process configuration for an MRNet-based tool, there are two key issues to consider: whether the MRNet internal processes are co-located with the application processes under study, and how the internal processes are connected. Our primary measures of a configuration's quality are its: (1) latency for a single broadcast operation, measured from initiation by the front-end to the last receipt by a back-end; (2) latency for a single data aggregation operation, measured from initiation by the back-ends to receipt by the front-end; (3) throughput for streams of broadcasts and data aggregations; and (4) CPU utilization of the MRNet internal processes.

Chapter 7. Adding New Aggregation Filters

Appendix A. MRNET Format Strings

Following the % character introducing a conversion there may be a number of flag characters. u, h, l, and a are special modifiers meaning unsigned, short, long and array, respectivley. The full set of conversions are:

cMatches a signed 8-bit character
ucMatches a signed 8-bit character
acMatches an array of signed 8-bit characters
aucMatches an array of unsigned 8-bit characters
hdMatches a signed 16-bit decimal integer
uhdMatches an unsigned 16-bit decimal integer
ahdMatches an array of signed 16-bit decimal integers
auhdMatches an array of unsigned 16-bit decimal integers
dMatches a signed 32-bit decimal integer
udMatches an unsigned 32-bit decimal integer
adMatches an array of signed 32-bit decimal integers
audMatches an array of unsigned 32-bit decimal integers
ldMatches a signed 64-bit decimal integer
uldMatches an unsigned 64-bit decimal integer
aldMatches an array of signed 64-bit decimal integers
auldMatches an array of unsigned 64-bit decimal integers
fMatches a 32-bit floating-point number
afMatches an array of 32-bit floating-point numbers
lfMatches a 64-bit floating-point number
alfMatches an array of 64-bit floating-point numbers
sMatches a null-terminated character string.
asMatches an array of null-terminated character strings

Appendix B. A Complete MRNet Example