Recent developments in Ccmmunication architectures for parallel machines have made significant progress and reduced the communication overheads and latencies by over an order of magnitude as compared to earlier proposals. lhis paper examines whether these techniques can cany over to clusters of workstations connected by an ATM network even though clusters use standard operating system software, are equipped with network inredaces optimized for stream communication, do not allow direct protected user-level access to the network, and use networks without reliable transmission or flow control.
Introduction
ATM switches can be configured to provide bisection bandwidths comuarable to parallel machine networks, and routing latencies are on h e order of microseconds? However, to date this communication potential has not The shift from 'low broadcast-based 'OCal area networks making the use ofclusters of workstations' as Dlatfonns Eo high bandwidth switched network architectures is been available at the application level. for paklel processing more and more athactiie. While a number of software packages [5,61 already support parallel processing on today's workstations and networks, the communication performance is aver two orders of magnitude inferior to state-of-the art multiprocessor$. As a result, only embarassingly parallel applications (i.e., parallel applications that essentiaHy never communicate) can make use of such environments. Networking technologies such as ATM[l] offer the opportunity to close the gap: for example, ATM cells are roughly the Same size as messages an multiprocessors, it takes only a few microseconds to send or receive a cell, 1. The term cluster is used here to nfer to collections of workstation-class machines interconnected by a low-latency high-bandwidth network. 2. This paper focuses exclusively on scalable multiprocessor architecturw and specifically excludes bus-based sharedmemory multiprocessors. This paper and the described software are available at URL From a purely technical point of view, the gap between clusters of workstations and multiprocessors is certainly closing and the distinction between the two types of systems is becoming blurred. Differences remain: in particular, the design and construction of multiprocessors allows better integration of all the components because they can be designed to fit together. In addition, the sharing of physical components such as power supplies, cooling and cabinets has the potential to reduce cost and to allow denser packaging. While the debate over the significance of these technological differences is still open, it is becoming clear that the two approaches will yield qualitatively similar hardware systems. Indeed, it is possible to rake a cluster of workstations and load system software making it look almost identical to a multiprocessor. This means that a continuous specmm of platforms spanning the entire range from workstations on an Erhemet to state-of-the-art multiprocessors can become available, and that any distinction between mul-tiprocessors and clusters will be more and more arbitrary from a technical point of view. network interfaces for workstations optimize processor networks, and stream communication (e.g., TCP/IP) and k e less well integrated into the overall architecture (e.g., connect to the VO bus instead of the memory bus).
From a pragmatic point of view, however, significant differences are likely to remain. The most important attraction in using a cluster of workstations instead of a multiprocessor li& in the off-the-shelf availability of all its major hardware and software components. This
In comparing communication on clusters and multiprocessors this paper makes two major contributions: means that all the components are readily available, they are familiar, and their cost is lower because of economies of scale leveraged across the entire workstation user community. Thus, even if from a technical point of view there is a continuous spectrum between clusters and multiprocessors, the use of off-the-shelf components in clusters will maintain differences.
In fact, the use of standard components in clusters raises the question whether these can be reasonably used for parallel processing. Recent advances in multiprocessor communication performance are principally due to a tighter integration of programming models, compilers, operating system functions, and hardware primitives. It is not clear whether these advances can be carried over to clusters or whether the use of standard components is squarely at odds with achieving the level of integration required to enable modem parallel programming models. Specifically, new communication architectures such as distributed shared memory, explicit remote memory access, and Active Messages reduced the costs fiom hundreds to thousands of microseconds to just a few dozen precisely through the integration of all system components. These new communication architectures are designed such that network interfaces can implement common primitives directly in hardware, they allow the operating system to be moved out of the critical communication path without compromising protection, and they are well suited for high-level language implementation. This paper examines whether the techniques developed to improve communication performance in multiprocessors, in particular, Active Messages, can be canied over to clusters of workstations with standard networks and mostly standard system software. This paper assumes the current state of the art technology in which clusters using ATM networks differ from multiprocessors in three major aspects':
clusters use standard operating system software which implies less coordination among individual nodes, in particular with respect to process scheduling and address translation, ATM networks do not provide the reliable delivery and flow control that are taken for granted in multi-1. A discussion of differences i n fault isolation characteristics is beyond the scope of this paper.
first, it analyzes, in Section 2, the implications that the differences between clusters and multiprocessors have on the design of communication layers similar to those used in multiprocessors, and second, it describes, in Section 3, the design of an Active Messages prototype implementation on a collection of Sun workstations interconnected by an ATM network which yields application-mapplication latencies on the order of 2Op.
The use of Active Messages in workstation clusters is briefly contrasted to other approaches in Section 4 and Section 5 concludes the paper. dynamically reducing the transmission rate on connections which experience high cell loss rates. This works in these settings because, following the law of large numbers, contention in a wide area network does not tend to vary instantaneously and therefore the degree of contention observed in the recent past is a good predictor for contention in the near future.
Technical Issues
parallel program run at the same time. The next generation adds the sending process id to each message allowing the receiving network interface to discriminate between messages destined for the currently running process, that can retrieve these message directly, and messages for dormant processes, which must be queued (typically by the operating system) for later retrieval.
As an illustration of the difficulties in a parallel computing setting, consider the implementation of a parallel sort. The most efficient parallel sort algorithms [3] are based on an alternation of local sorts on the nodes and permutation phases in which all nodes exchange data with all other nodes. These permutation phases serve to move the elements to be sorted "towards" their correct position, The communication patterns observed are highly dynamic and their characteristics depend to a large degree on the input data. If at any point the attempted data rate into a given node exceeds the link rate, then the output buffers at up-stream switches will start filling up. Because the communication patterns change very rapidly (essentially with every cell), it is futile to attempt to predict contention, and given the allto-all communication pattern, the probability of internal contention among seemingly unrelated connections is high.
Beyond the problems caused by contention and the resulting retransmissions, the lack of reliable delivery guarantee in ATM networks imposes a certain overhead on the communication primitives. Specifically, the sender must keep a copy of each cell sent until a corresponding acknowledgment is received, in case the cell must be retransmitted. This means that messages cannot be transferred directly between processor registers and the network interface (as is possible on the CM-5 [12] ), rather, a memory copy must be made as well.
User-level access to the network interface
Recently, multiprocessor communication architectures have achieved a significant reduction of the communication overhead by eliminating the operating system from the critical path. In order not to compromise security, the network interface must offer some form of protection mechanism. In shared memory models, the memory management unit is extended to map remote memory into the local virtual user address space such that the operating system can enforce security by managing the address translation tables. Message-based network interfaces contain a node address translation table which maps the user's virtual node numbers onto the physical node address space. Again, the operating system enforces security by controlling the address translation, thereby preventing a process from sending a message to an arbitrary node. The current generation of message based network interfaces only control the destination node address and therefore require that all processes of a In contrast, the network interfaces available for workstations do not yet incorporate any form of protection mechanism. Instead, the operating system must be involved in the sending and reception of every message.
The connection based nature of ATM networks would principally allow the design of a protection mechanism to limit the virtual circuits a user process has access to (the operating system would still control virtual circuit set-up). But because the architecture of the networking layers in current operating systems does not seem to be set-up to allow user-level network interface access, it appears unlikely that network interfaces with these features will become commonplace soon. The challenge in any high-perfomance communication layer for clusters is, thus, to minimize the path through the kernel by judiciously coordinating the user-kernel interactions.
Coordination of system software across all communicating nodes
In almost all communication architectures the message reception logic is the critical performance bottleneck. In order to be able to handle incoming messages at full network bandwidth, the processing required for each arriving message must be minimized carefully. The trick used in multiprocessor systems to ensure rapid message handling is to constrain the sender to only send messages which are easy to handle.
In shared memory systems this is done by coordinating the address translation tables among all processing nodes such that the originating node can h-anslate the virtual memory address of a remote access and directly place the corresponding physical memory address into the message. The set of communication primitives is small and fixed (e.g., read and write) and by forcing the sender to perform the complicated part of a remote memory access (namely the protection checks and the address translation) the handling of a request is relatively simple to implement'. If the virtual address were sent, the receiving node could discover that the requested virtual memory location had been paged out to disk with the result that the handling of the message would become rather involved. In clusters the fact that the operating systems of the individual nodes are not nearly as coordinated contradicts the assumption that mes can always be consumed quickly upon arrival. In the case of Active Messagestke destination process might have been suspended and cannot run the handler, and in a shared memory model the memory location requested might not be mapped. Although exact coordination is not possible without major changes to the operating system core, an impLementation of either communication model is likely to be able to perform some coordination among nodes on its own and to influence t4e local operating system accordingly. This may allow the communication layer to assume that in the common case everything works Qut fine, but it must be able to handle the difficult cases as well.
Summary
Even though superficially a cluster of workstachms appears the be technically comparable to a multiprocessor, the reality is that key characteristics are different and cause significant implementation difficulties: the very comparable raw hardware link bandwidths, bisection bandwidtks, and routing latencies conceal the lack in clusters of flow control, reliability, user-level network access, and operating system coordination.
These shortcomings wiil inevitably result in lower communieahn performance; their quantitative effect on performance is evaluated in the next seaion which presents a prototype implementation of Active Messages on a cluster of Sun workstations. However, the lack of flow-control in ATM networks poses a fundamental problem: can catastrophic performance degradation occur due to significant cell loss in particular communication patterns?
SSAM: a SPARCstation Active Messages Prototype
The S S A M prototype implements the critical parts of an Active Messages communication architecture on a cluster of SPARCstations connected by an ATM network. The primary goal is to evaluate whether it is possible to provide a parallel programming environment on the cluster that is comparable to those found on multiprocessors. The prototype is primarily concerned with providing performance at par with parallel machines, while addressing the handicaps of ATM networks that have been identified in the previous section. In particular:
the prototype provides reliable communication to evaluate the cost of performing the necessary flowcontrol and error checking in software, it minimizes the kemel intervention to determine the cost of providing protection in software, and the buffering is designed t o tolerate arbitrary context switching on the nodes.
At this time only a limited experimental set-up (described below) is available such that the prototype cannot provide information neither on how cell losses due to contention within the network affect performance, nor on how the scheduling of processes can be coordinated to improve the overall performance of parallel applications.
Active Messages Communication Architecture
The an associated small amount of computation (in the form of a handler) at the receiving end. Typically the first word of an Active Message points to the handler for that message. On message arrival, the computation on the * node is interrupted and the handler is executed. The role of the handler is to get the message out of the network, by integrating it into the ongoing computation and/or by sending a reply message back. The buffering and scheduling provided by Active Messages are extremely primitive and thereby fast: the only buffering is that involved in actual transport and the only scheduling is that required to activate the handler. This is sufficient to support many higher-level abstractions and more general buffering and scheduling can be easily constructed in layers above Active Messages when needed. This minimalist approach avoids paying a performance penalty for unneeded functionality.
In order to prevent deadlock and livelock, Active Message restricts communication patterns to requests and replies, i.e., the handler of a request message is only allowed to send a reply message and a reply handler is not allowed to send further replies.
S S A M functionality
The current implementation is geared towards the sending of small messages which fit into the payload of a single ATM cell. FIFO are moved into a buffer. S S A M then calls the appropriate handler for each message, passing as arguments the originating connection identifier, the address of the buffer holding the message, and the address of a buffer for a reply message. The handler processes the message and may send a reply message back by placing the data in the buffer provided and returning the address of the reply handler (or NULL if no reply is to be sent).
The current prototype does not use interrupts, instead, the network is polled every time a message is sent. This means that as long as a process is sending masages it will also handle incoming ones. An explicit polling function is provided for program parts which do not communicate. Using interrupts is planned but not implemented yet
Example: implementing a remote read with S S A M
The sample implementation of a split-phase remote double-word read is shown in Figure 2 . The readDouble function increments a counter of outstanding reads, formats a request Active Message with the address to be read as well as information for the reply, and sends the message. The readDouble-h handler fetches the remote location and sends a reply back to the readDouble-rh reply handler which stores the data into memory and decrements the counter. The originating processor waits for the completion of the read by busywaiting on the counter at the end of readDouble. A split-phase read could be constructed easily by exposing the counter to the caller, who could proceed with computation after initiating the read and only wait on the counter when the data is required.
Experimental set-up
The experimental set-up used to evaluate the performance of the prototype SSAM implementation consists of a 60Mhz SPARCstation-20 and a 25Mhz SPARCstation-l+ running SunOS 4.1. The two machines are connected via Fore Systems SBA-100 ATM interfaces using a 140Mbls TAXI fiber. Note that the network interface used is much simpler and closer to multiprocessor "Is than most second-generation ATM interfaces available today. The only function performed in hardware, beyond simply moving cells onto/off the fiber, is checksum generation and checking for the ATM header and an AAL314 compatible payload. In particular, no DMA or segmentation and reassembly of multi-cell packets is provided.
SSAM implementation
The implementation of the SPARCstation ATM Active Messages layer consists of two parts: a device driver which is dynamically loaded into the kernel and a userlevel library to be linked with applications using SSAM. The driver implements standard functionality to open and close the ATM device and it provides two paths to send and receive cells. The fast path described here consists of three trap instructions which lead directly to code for sending and receiving individual ATM cells. The traps are relatively generic and all functionality specific to Active Messages is in the user-level library which also performs the flow-control and buffer management. A conventional read/write system call interface is provided for comparison purposes and allows to send and receive cells using a "pure" device driver approach.
The traps to send and receive cells consist of carefully crafted assembly language routines. Each routine is small (28 and 43 instructions for the send and receive traps, respectively) and uses available registers carefully. The register usage is simplified by the Sparc architecture's use of a circular register file, which provides a clean8-register window for the trap. By interfacing from the program to the uaps via a function call, arguments can be passed and another 8 registers become available to the trap.
The following paragraphs describe the critical parts of the SSAM implementation in more detail.
Flow-control
A simple sliding window flow control scheme is used to prevent overrun of the receive buffers and to detect cell losses. The window size is dimensioned to allow close to full bandwidth communication among pairs of processors.
In order to implement the flow control for a window of size w, each process pre-allocates memory to hold 4w cells per every other process with which it communicates. The algorithm to send a request message polls the receiver until a free window slot is available and then injects the cell into the network, saving it in one of the buffers as well in case it has to be retransmitted. Upon receipt of a request message, the message layer moves the cell into a buffer and, as soon as the corresponding process is running, calls the Active Message handler. If the handler issues a reply, it is sent and a copy is held in a buffer. If the handler does not generate a reply, an explicit acknowledgment is sent Upon receipt of the reply or acknowledgment, the buffer holding the original request message can be reused. Note how the dis-~ per cell pull from NI uer cell demux tinction between requests and replies made in Active Messages allows acknowledgments to be piggy-backed onto replies.
The recovery scheme used in case of lost or duplicate cells is standard, except that the reception of duplicate request messages may indicate lost replies which have to be retransmitted. It is important to realize that this flow control mechanism does not really attempt to minimize message losses due to congestion within the network the lack of flow-control in ATM networks effectively precludes any simple congestion avoidance scheme. Until larger test-beds become available and the ATM community agrees on how routers should handle buffer overflows it seems futile to invest in more sophisticated flow-conuol mechanisms. Nevertheless, the bursty nature of parallel computing communication pattems are likely to require some solution before the performance characteristics of an ATM network become as robust as those of as multiprocessor networks.
4 . 2 7~ 3 . 6 8~ 0 . 0 9~ 0.23~
User-kemel interface and buffer management
The streamlining of the user-kernel interface is the most important factor contributing to the performance of SSAM. In the prototype, the kemel preallocates all buffers for a process when the device is opened. The pages are then pinned to prevent page-outs and are mapped (using mmap) into the processes' address space. After every message send, the user-level library chooses a buffer for the next message and places a pointer in an exported variable. The application program moves the message data into that buffer and passes the connection id and the handler address to S S A M which finishes formatting the cell (adding the flow control and handler) and traps to the kemel. The trap passes the message offset within the buffer area and the connection id in registers to the kernel. Protection is ensured with simple masks to limit the connection id and offset ranges. A lookup maps the current process and connection ids to a virtual circuit. The kernel finally moves the cell into the At the receiving end, the read-ATM kemel trap delivers a batch of incoming cells into a pre-determined shared memory buffer. The number of cells received is retumed in a register. For each cell the kemel performs four tasks: it loads the first half of the cell into registers, uses the VCI to index into a table to obtain the address of the appropriate processes' input buffer, moves the full cell into that buffer, and checks the integrity of the cell using three flag bits set by the NI in the last byte. Upon return from the trap, the S S A M library loops through all received cells checking the flow-control information, calling the appropriate handlers for request and reply messages, and sending explicit acknowledgments when needed.
output FIFO.
per cell store away total for 1 cell per cell total for 16 cells
SSAM performance
The following paragraphs describe performance measurements of S S A M made with a number of synthetic benchmarks. The following terminology is used: overhead consists of the processor cycles spent preparing to send or receive a message, latency is the time from which a message send routine is called to the time the message is handled at the remote end, and bandwidth is the rate at which user data is transferred. The performance goal for S S A M is the fiber rate of 140Ivlbit/s which transmits a cell every 3 . 1 4~ (53+2 bytes) for an ATM payload bandwidth of 15.2MB/s1. Operation with gettimeof day which uses a mbose"saod-accurate clock and takes 9 . 5~ on the SS-20. The time breakdown for each trap was measured by commenting appropriate insauctions out and is somewhat appximate due to the pipeline ewAap occurring between successive instructions.
The write trap cost is broken down into 5 parts: the cost of the uap and r e m , the protection checks, overhead for fetching addresses, loading the cell into registers, and pushing the cell into the network interface. The SS-20 n u m b show clearly that the fiber can be saturated by sending a cell at a time from user level. It also indicates that the majority of the cost (75%) lies in the access to the netwQEk interface across the Sbus. The cost of the trap itself is surprisiigly low, even though it is the second largest item, In fact, it could be reduced slightly as the current implementation adds a level of indirection in the trap dispatch to simplify the dynamic loading of the device driver.'
The read trap is itemized similarly: the cost to trap and rem, fetching the device register with the count af available cells, additional overhead for setting-up addresses, loading the cell from the network interface, demultiplexing among processes, and storing the cell away. The total cost shows a hap which receives a single cell, as well as the per-cell cost for a trap which receives 16 cells. Here again the access to the device dominates due to the fact that each double-word load incurs the full h m c y of an Sbus access. The total time of 4 . 6 1~ on the SS-20 falls short of the fiber's cell time and wilt limit the achievable bandwidth to at most 68% of the fiber.
The write-read trap first sends a cell and then receives a chunk of cells. This monks the cost of the trap across both functions and overlaps checking the cell count slightly with sending. The last item in the table shows the cost of a null system call for comparison purposes (a write to file descriptor -1 was used). It is clear that a system call approach would yield performance far inferior to the traps and would achieve only a fraction of the fiber bandwidth. check for the appropriateness of the file descriptor, transfer data between user space and an intemal buffer using uiomove, and 9 transfer data between the intemal buffer and the FIFOs of the network interface.
The internal buffer is used because the data cannot be transferred directly between user space and the device using uiomove due to the fact that the device FIFOs are only word addressable. The use of an intemal buffer also allows double-word accesses to the device FIFOs, which improves the access times considerably. Table 2 shows the costs for the various parts of the read and write system calls. The "syscall overhead entries reflect the time taken for a read (respectively write) system call with an empty read (write) device driver routine. This measures the kemel overhead associated with these system calls. The "check fd, do uiomove" entry reflects the time spent in checking the validity of the file descriptor and performing the uiomove. In the case of a read, it also includes the time to check the device register holding the number of cells available in the input
FIFO.
The "push/pull cell" entries reflect the time spent to uansfer the contents of one cell between the internal buffer and the device FIFOs. The "write" and "read 1 cell" totals reflect the cost of the full system call, while the "readOcells" entry is the time taken for an unsuccessful poll which includes the system call overhead, the file descriptor checks, and the reading of the receive-ready register.
send request handle request, no reply sent
The timings show clearly that the overhead of the read/ write system call interface is prohibitive for small messages. For larger messages, however, it may well be a viable choice and it is more portable than the traps. The measurements show that supporting only single-cell Active Messages is not optimal. Longer messages are required to achieve peak bulk transfer rates: the onecell-at-a-time prototype can yield up to 5.6MB/s. A simpler interface for shorter messages (e.g., with only 16 bytes of payload) might well be useful as well to accelerate the small requests and acknowledgments that are often found in higher-level protocols. Unfortunately, given that the trap cost is dominated by the network interface access time and that the SBA-100 requires all 56 bytes of a cell to be transferred by the processor, it is unlikely that a significant benefit can be realized. be improved to 9Mbytes/s by using the full ATM payload and simplifying the handling slightly.
5.w

Unresolved issues
The current S S A M prototype has no influence on the kemel's process scheduling. Given the current buffering scheme the SSAM layer operation is not influenced by which process is running. The performance of applications, however, is likely to be highly influenced by the scheduling. How to best influence the scheduler in a semi-portable fashion requires further investigation. The most promising approach appears to be to use real-time thread scheduling priorities, such as are available in Solaris 2.
The amount of memory allocated by the S S A M prototype is somewhat excessive and, in fact, for simplicity, the current prototype uses twice as many buffers as strictly necessary. For example, assuming that a flowcontrol window of 32 cells is used, the kernel allocates and pins 8Kbytes of memory per process per connection. On a @-node cluster with 10 parallel applications running, this represents 5Mb of memory per processor.
The number of preallocated buffers could be reduced without affecting peak bulk transfer rates by adjusting the flow control window size dynamically. The idea is that the first cell of a long message contain a flag which requests a larger window size from the receiver; a few extra buffers would be allocated for this purpose. The receiver grants the larger window to one sender at a time using the first acknowledgment cell of the bulk transfer. The larger window size remains in effect until the end of the long message. This scheme has two benefits: the request for a larger window is overlapped with the first few cells of the long message, and the receiver can prevent too many senders from msferring large data blocks simultaneously, which would be sub-optimal for the cache. However, fundamentally, it appears that memory (or, altematively, low performance) is the price to pay for having neither flow-control in the network nor coordinated process scheduling.
A more subtle problem having to do with the ATM payload alignment used by the SBA-100 interface will surface in the future: the53 bytes of an ATM cell are padded by the SBA-100 to56 bytes and the 48-byte payload starts with the 6th byte, i.e., it is only half-word aligned. The effect is that bulk transfer payload formats designed with the SBA-100 in mind (and supporting double-word moves of data between memory and the SBA-100) will clash with other network interfaces which double-word align the ATM payload.
Summary
The prototype Active Messages implementation on a SPARCstation ATM cluster provides a preliminary dem-PVM over T C P P Sun RPC onstration that this communication architecture developed for multiprocessors can be adapted fo the peculiarities of the workstation cluster. The performance achieved is roughly comparable to that of a multiprocessor such as the CM-5 (where the one-way latency is roughly w), but it is clear h a t without a network interface closer to the processor the performance gap cannot be closed. The time taken by the flow-control and protection in softwm is surprisingly low (at least in compan'son with the network interface access times). The cost, in effect, has been shifted to large pre-allocated and pinned buffers. While the prototype's memory usage is somewhat excessive, other schemes with comparable performance will also q u i r e large buffers.
Overall, SSAM's speed comes from a careful integration of atl layers, from the language level to the kernel traps. The key issues are avoiding copies by having the application place the data directly where the kemel picks it up to move it into the device and by passing only easy to check information to the kemel (in particular not pass an arbitrary virtual address). offered through per-segment notification flags in order to to cause a file descriptor to become ready.
Finally, SSAM provides a reliable transport mechanism while the remote memory access primitives are unreliable and do not provide flow-control.
Table4 compares the performance of the two approaches: Thekkath's implementation uses two DECstation 5000 interconnected by a Turbochannel version of the same Fore-100 ATM interface used for S S A M and performs a liule worse than S S A M for data transfer and significantly worse for control transfer. The remote reads and writes are directly comparable in that they transfer the same payload per cell.
The performance of more traditional communication layers over an ATM network has been evaluated by Lin et. al.
[7] and shows over two orders of magnitude higher communication latencies than SSAM offers. Table 5 summarizes the best round-trip latencies and one-way bandwidths attained on Sun 4/690's and SPARCstation 2's connecred by Fore SBA-100 interfaces without switch. The millisecond scale reflects the costs of the traditional networking architecture used by these layers, although it is not clear why Fore's AAWS API is slower than the readwrite system call interface described in 53.4.2. Note that a TCPD implementation with a well-optimized fast-path should yield sub-millisecond latencies.
BSD TCPLP Sockets
