As the successor to SUNMOS [a] 
Introduction

MPICH
MPICH is a portable implementation of the Message Passing Interface (MPI) [31 standard developed jointly by ArgonneNational Laboratory and Mississippi State University. MPICH contains an abstract device interface (ADI) upon which a high-level message passing application programmer interface such as MPI can be implemented. The AD1 performs four main functions [6] : sending and receiving, data transfer, queueing, and device-dependent func-
tions.
Porting MPICH to an architecture such as the Paragon involves the creation of new "device" that interacts with the AD1 through a set of routines (see [5] for details) and handles. These handles are used to cache device specific data to pass information between the device independent and device dependent layers of MPICH.
Puma and Portals
Puma is an operating system designed to provide a flexible, lightweight, high performance message passing environment for massively parallel computing[l 11. Message passing in Puma is accomplished through the use of portals, whch are structures that inform the kernel how and where incoming messages should be deposited. Each application is allotted a finite number of portals in a portal table, and each entry in the portal table has an associated memory descriptor which describes how the memory is arranged. Messages destined for a particular portal table entry are deposited according to the type of memory descriptor attached to it. Additionally, matching lists may be attached to a portal table entry in order to provide further selection criteria for messages destined for a particular portal. Each match list entry contains 64 match bits and 64 ignore bits.
The ignore bits can be used to mask off insignificant match bits. These matching lists in turn have memory descriptors associated with them. There are four basic types of memory descriptors.
The most basic is the single block memory descriptor, which describes a single contiguous block of memory. Messages destined for a portal with single block memory descriptor attached may be deposited anywhere within this single contiguous region.
A dynamic block memory descriptor describes a contiguous block of heap memory. The Puma kernel maintains a list of free memory blocks and a list of messages that have arrived. Messages destined for a portal with a dynamic block memory descriptor attached will be deposited in the first available space within this heap, and the message will be added to the end of an incoming message queue.
The independent block memory descriptor describes a table of possibly noncontiguous buffers. An independent block contains a buffer descriptor table, each entry of which describes a contiguous block of memory. A message destined for a portal with an independent block memory (descriptor attached will be deposited in the first available buffer in the buffer descriptor table.
Finally, the combined block memory descriptor describes a logically contiguous but possiibly physically cliscontigous block of memory. This descriptor is almost identical to an independent block descriptor. but rather than tlepositing a message into a single buffer, a message destined for this descriptor will keep filling successive buffers in the buffer descriptor table until reception is complete.
Each type of memory descriptor also has several configurable options regarding how tc~ respond to incoming messages and how to progress through buffer lists. The folloiwing describes the options for each descriptor type: Certain descriptors may also be overlayed so that the same memory region is accessed or manipulated by multiple match list entries or portals. For example, a match list entry may be overlayed onto another match list entry which has an independent block memory descriptor. ME:ssages which are destined for either match list entry update the same independent block buffer table.
A match list is a linked list of match entries. When a message arrives for a portal with a match list attached, the kernel traverses the list, comparing the group, rank, and match bits in the incominig message header to those in each match list entry. When a match is found, the kernel then attempts to deplosit the message into the associated memory descriptor.
There are three types of failure the kernel can experience at each imatch list entry. The kernel can fail because the matching critenia are not met, because the memory descriptor has no available buffer, or because the available buffer is too small to hold the incoming message. Each entry can specify the next successive entry to which the kernel should proceed upon eiocountering any of these three failures.
Many ad the options used by the different memory descriptors require information contained in the incoming message The send side is responsible for providing this information depending on the type of send. For example, a memory descriptor that is configured to send an acknowledgment bwk tlo the sender needs to know to which portal the acknowledgment needis to return and what the match bits shouddl be. Similarly, a sender managed memory descriptor needs to get the dlesired offset from the incoming message. ' The following values are required for the short message pIotm1: send buffer, number of bytes, destination offset, destination group, destination rank, destination portal, and destination match bits. In addition, the following values are needed for long and synchronous message protocols: re:tum match bits, return portal, return length, return offset, and user data (twelve bytes). The configuration of the destinalion portal and memory descriptor will determine how and if some of these values are used. For example, a message that is destined for a portal with a memory descriptor attached, but no match list, with ignore the destination matchbit!;.
Point-to-Point Design and Implementation
A twol-level ]protocol was decided upon at the outset to provide low latency for shorter messages and high bandwidth for larger ones. Figure 1 the message header and the message body. Each independent block memory descriptor contains only one buffer. A single block memory descriptor cannot be used because a single block can only save incoming data and not header information. A posted receive needs header information in order to distinguish between difFerent protocols and to obtain message tag and source values should these be wildcarded.
The final two entries in the match list are used to catch and queue unexpected messages for each protocol. The first catchall entry has a dynamic block memory descriptor configured to save both the message header and the message body. The second catchall entry also has a dynamic block memory descriptor, but is configured to save only the message header. The second catch all entry is overlayed on top of the first catchall entry so that both entries use the same heap list structure, insuring correct ordering for unexpected messages. The three bits for message type in the match bits are used to choose the catchall entry in which unexpected messages are buffered. For the short catchall entry, the matchbits are configured to ignore all bits except the first three, which must be zero. The long catchall entry is likewise configured to ignore all bits except the first three, the third of which must be set. Messages with the first message type bit set are ready send messages which must have a pre-posted receive. Consequently, these message have no overflow buffer and will simply be discarded if there is no pre-posted receive. The second message type bit is used to distinguish between regular message$ and reply messages which the receiver has requested. The match list entries for pre-posted receives are configured to ignore the ready send and long send message type bits. Figure 2 illustrates the portals needed for sending messages. The first entry in the portal table contains a match list that is used for collecting acknowledgments from the receivers. Each acknowledgment is a message header with the result (saved header, saved body, or saved header and body) of the message reception contained in the first byte of the twelve byte user data portion of the header. Each entry contains an independent block memory descriptor conflgured to save only the message header.
Portal Table   F The second entry in the portal table contains a match list where each entry contains a single block memory descriptor contigured to reply. A message destined for this type of read portal will cause the data in the appropriate buffer to be replied back to the original message's point of origin.
Because the match bits for both the read portal and the acknowledge portal must be unique for each send request, the first 32 match bits are set to the address of the device independent handle associated with the send request, while the second 32 bits are set to the address of the device dependent handle. Figure 3 illustrates a short protocol send operation. Then sender sends both a message header and user data to the receive portal at the destination. The destination matchbits are set appropriately for the tag, the context identifier, rank within the communicator, and message type. The protocol type is also encoded in the user data portion of the message header. The short protocol send operation is complete once the kernel finishes delivering the message.
Fbr the long send protocol (Figure 4) . the sender uses an eager protocol where both the message header and the data are sent to the receiver. However, if there is no posted receive for a long message, only the header is saved, and the receiver must pull the message from the sender. After a message is sent, the sender waits for an acknowledgment. If a receive was pre-posted and the message was saved directly into the user buffer, the acknowledgment will indicate that both header and body were saved. If no receive was pre-posted, the message header will be saved in the dynamic heap and the acknowledgment will indicate that only Sender Receiver Figure 3 . Short send pmtocol.
the header was saved. The sender must then wait for the appropriate number of bytes to be read from the single block portal.
To post a short protocol receive, a free receive match list entry is obtained and the necessary matching criteria is added. For short protocol receives, an independent block memory descriptor configured to save both header and body with no acknowledgment is attached to ithe entry. However, the entry is not activated until a search cbf the dynamic heap is performed. If there is an unexpected message stored in the dynamic heap that matches the receive that is being processed, the message is copied out of the heap and the space in the heap is freed. If the search of the heap is unsuccessful, the entry is activated. This operation must be atomic: to insure that the kernel doesn't deposit a message in the dynamic heap between the time the heap is searched and the entry is activated. Message arrival on the entry is signalled by an update of the bytes written to the memory descriptor. Necessary header information is extracted when the message arrives.
Sender
Receiver Figure 4 . Long send protocol.
For the long protocol on the receive side, the match list entry is prepared in the same way as with the short prcbtocol. However, the independent memory descriptor that is attached to the match list enhy is coniigured to acknowledge the result back to the sender upon receipt of a m.essage. Also, if a matching message header is found in the dynamic heap, the match bits are changed to accept a pulled message, aind a message is sent to the sender to pull the data across. Figure C i illustrates the long protocol where the message muist be pulled by the receiver.
Receiver Figure 5 . Long send read protocol.
The tmic short and long protocols are extended to include an extra acknowledgment for synchrounous messages. Synchronous acknowledgements are sent to the same portal as long send acknowledgments. For short synchronous messages, the return match bits are contained in the user d<ata of the message header. In Figure 6 when a synchromous message is received, either by the posted receive or copied out of the: heap, the receiver sends back a synchronous acknowledgment. iln the long protocol, if the message is saved to a posted receive, the situation is similar to the slhort r,ynchronouis protocol. For long synchronous messages lhat are pulled, the extra acknowledgment is sent upon arrivd of the pulled imessages (Figure 7) . Sender Receiver 
Collective Communications Design and Implementation
In the Puma MPICH ADI, the collective communication operations are mapped to the native Puma collective communications which are built on top of Puma portals. This section discusses the implementation of the Puma collective communications on top of portals as it relates to MPI collective communications.
The native Puma collective communications are primarily interested in high performance collective communication with contiguous data over the entire range of vector lengths. They make use of hybrid techniques developed at the University of Texas [l, 10,21 to achieve this full range of performance. The hybrid techniques use the physical multi-dimensional nature of an interconnect to maximize bandwidth and minimize message contention for long messages. For short messages, the hybrids make use of logical multi-dimensional mappings within each physical dimension (dimensional rings) to form new near-optimal short message algorithms. For medium length messages, the hybrids evaluate whether to use a short or long message algorithm in each IogicaVphysical dimension to gain the best performance.
Since the implementation on top of portals concerns itself primarily with the short and long building block algorithms, this section will restrict discussions to the shortflong building block implementations. Once these implementations are optimized, the advantages of the hybrids can be incorporated directly. 
Short Message Protocols
The best point-to-point short message algorithms embed a minimum spanning tree within the participating group of nodes in such a way as to enable the sending and receiving of contention free messages based on the structure of the tree. To support such a communication pattern, the Puma MPI collective communications use the structures as illustrated in Figure 8 .
Consider the collective operation MPIBcastO for short messages. Before the application enters main(), the match list, and message heap are setup and initialized. The single block memory buffer is not present. When a child enters MPIBcastO, it checks whether the message has arrived in the message heap. If it hasn't, then the child sets up the single block buffer for the message that will be arriving. Upon entering MPIBcastO, the root node immediately begins sending to its children within the minimum spanning tree. It is possible that a parent may be sending before its children have reached MPIBcastO. If this is the case, then the broadcast message is placed in the child's message heap as soon as it arrives.
This implementation has advantages in that it both avoids unnecessary memory copies when a child enters MPIBcast() before the parent, and does not hold up the parent if a child is not ready. For longer messages, it may be desirable to avoid memory copies completely, in which case it is worthwhile to add an additional handshake between parent and child to make sure the child is ready before the parent sends. In this description, MPIBcastO was used as an example collective operation. Other MPI operations such as MPI-Scatter(), MPI-Gather(), MPIAllgatherO, etc., all have minimum spanning tree algorithms [ll that would make use of the same portal structures for short messages.
Features that have not been implemented which are read-ProcO Proc 1
Proc2! Proc3
Time
Step 1 -
TimeStep2
-----
Time
Step 3 . . . --. . . 
Long Message Protocols
For long messages l , versions of the "bucket" algorithm or "ring" algorithm have proven to be the most efficient, since the amount of data traversing the network is less tlian the amount transfer4 using a minimum spanning tree algorithm. Figure 9 illustrates the isteps in a bucket algoritlhm for the MPIAllgather() operation. At each time step, every process sends a piece of the message to one neighbor and receives a piece from another. Bucket algorithms are also natural for MPIReduce(), MPI&,ducescatter(), and MPIAllreduceO.
It is clear from Figure 9 that bucket ,algorithms are very lock-step in nature. As a result, it makes sense that a sending neighbor would synchronize with its receiving neighbor and then stream the pieces of the message into the waiting buffer. Figure 10 illustrates the pairtal structures necessary for supporting this mechanism. Each process sets up the receive buffer locally in the single block portal and sends a message to the sending neighblor announcing that it is ready to receive. In the mean time, it will wait falr a 'Long messages refer to messages with limgths larger than aay lOKbytes depending on the bandwidth and latimcy measurements for a given architecture and the number of processes participating in the operation. ready-to-receive message from its receiving neighbor to arrive in its indepiendent block portal ' . The ready-to-receive message will tell the process that a buffer is available at the receive neighbcir and streaming data can begin. The process can watch the message coiunter on its single block portal to make sure it does not get ahead of the sending neighbor.
This long message portals implementation makes use of the lock-step characteristic of bucket algorithms to avoid memory copies which are costly for long messages. It accomplishes this by synchronizing with its neighbors and by following up with streaming data into the appropriate receive buffer. Allso, this implementation cuts down on additional costs by being able it0 eliminate the need for an additional indirection through a match list.
It is worth noting that by switching out the single block memory descriptor and rqplacing it with a combined block memory descriptor. this design would support MPI noncontiguaiuz; dataitypes.
Implementation Issues
In the: mana,gement of both the short and long message building blocks, issues with race conditions and dropped messages due to overlapping collective operations arise and must be dealt with. Since all portal structures are in user space, race conditions can occur between the kernel and the process ,when both access the same structures at the same time. Cooperation between the kernel and the libraries can ensure that the race conditions do not OCCUT.
Using the structures above, it is possible for back to back "fan-in" minimum spanning tree operations to overlap and lose maisaiges. This is because in "fan-in" algorithms, for instance in MPlLGather(), the leaf nodes can send their first contribution, enter the second MPI-Gather() and send their ----21f there are no overlapping bucket collective algorithms, then one could use the faster zero length single block memory descriptor instead of an independent b1oi;k memory descriptor which saves a message header.
second contribution before the parent is ready for the second gather operation. One could either use separate portals, sequence numbers, or some other form of matching criteria to ensure that overlapping collective operations are handled properly.
One Sided Communications
A proposal for one sided communications is currently under consideration by the MPI-2 Forum. One sided communications is an extension to the communications mechanisms of MPI allowing for remote memory access (RMA) where the transfer of data from the memory of one process to the memory of another process occurs with only the explicit involvement of one of these processes 191. This proposal hopes to provide an interface for taking advantage of the opportunities for high performance RMA on those systems that have dedicated RMA hardware, such as the Cray T3D [41, systems with communications coprocessors, such as the Intel Paragon, and on shared memory mulitprocessor systems. The current proposal contains functions for initialization, remote memory reads and writes, atomic memory updates, remote synchronization, and message handlers. The initialization and RMA access functions provide the basis for doing one sided communications.
The design of Puma provides for doing efficient RMA operations. Because portals allow for writing into and reading out of (using reply portals) the memory of a remote process without the process' explicit involvement, Puma has the capability to do RMA communications easily and efficiently. Cache coherency is maintained for all incoming messages.
The current proposal includes two functions for initialization. The first is MPZRMAinit() which exposes a window of memory to RMA communications and returns a Communicator enabled for peforming both RMA operations and normal MPI communications. The second is a routine for allocating special memory which is provided for those platforms where a different type of memory must be used for RMA. Puma provides the capability to write into any memory in an application's address space, so the RMA allocation function is equivalent to the standard malloc() function.
The current proposal has four functions for RMA access: M P I P u t ( ) , MPI-Get(), MPI_lput(), and MPIJger(). When used with the communicator returned by MPZRMAinit(), the put functions perform a remote write of the data supplied at the origin process into the exposed window at the target process. The get functions peform a remote memory read of the exposed window at the target process, depositing the data into a supplied buffer on the origin process. The non-blocking versions return a request handle that may be used with any of the normal MPI wait or test functions. These functions also contain an offset argument so that reads and writes can be initiated at an offset from the base of the window. Figure 11 illustrates the portals used for RMA operations. Two portal table entries are used for RMA, one for puts and one for gets. In the M P I R M A i n i t ( ) function, the next available match list entry for the put portal is obtained. The first 32 match bits in this entry are set to the the send context identifier in the RMA communicator. The second 32 match bits are set to a special tag value. A sendermanaged single block memory descriptor referencing the RMA window is attached to the entry. Similarly, the next available match list entry for the get portal is obtained, and the match bits are set to the receive context on the RMA communicator and a special tag value. A sender-managed single block reply memory descriptor referencing the same RMA window is attached to the get match list entry.
Design and Implementation
The put functions send a message to the designated put portal on the target process with the destination match bits set to the send context of the communicator and the correct byte offset calculated from the offset arguments to the function. The put operation completes as soon as the message is sent. A blocking put requires no further action, while a non-blocking put must build a request handle that is immediately marked completed. Therefore, non-blocking puts have a degradation in performance over blocking puts. Put operations maintain pairwise ordering. The get function begins by posting a receive for the reply message. Posting a receive is done exactly as if the receive were being posted for a normal MPI message, with a few exceptions. The matchbits for this receive are set to the receive context of the communicator and the special tag value, in order to avoid mixing RMA communications with regular MPI communications. Instead of attaching an independent block memory descriptor to the match entry, a single block descriptor can be used. A message is then sent to the designated get portal on the target process with the destination match bits set to the receive context of the communicator and the correct byte offset calculated from the offset arguments to the function. And in addition, the requested return portal is the local receive portal and the return matchbits are set appropriately. The blocking version of get is implemented by calling the non-blocking version and waiting for the request to complete.
Future Work
A concerted effort is being made to increase the performance of Puma to be at least that of its predecessor.
MPI has yet to be tested under the various coprocessor modes being developed, and some base functioinality still needs to be implemented at the operating system level.
The combined block memory descriptor was designed to be used for operations with non-contiguous datatypes. However, the combined block has yet tlo be implemented. Consequently, non-contiguous datatypes are packed and uu1-packed into contiguous buffers. For one sided communications, each block in the datatype generates a separate message so that the offset can be used properly. Combined block memory descriptors will greatly reduce this cost.
For the ASCVDOE TeraFLOPS machine, hybrid techniques will be incorporated into MPI ccdlective operations in order to take advantage of the topiology of the machine. In addition, once combined blocks are implemented, the collective operations will bt: modifiled to support noncontiguous datatypes.
Effort is nearly completed on a new AD1 for MPICH r71.
The goal of this next generation AD1 is to achieve lower latencies and remove as much overhead as possible, especially when handling messages with contiguous datatypes. Providing better support for multi-prot0c:ol devices and heterogeneous systems are additional goals. Work has already begun on moving this implementation to the new ADI.
