Acronyms and Abbreviations

AALS
API
Introduction
This study evaluates a high-speed local area network that can support multiple problem domains in distributing computing. Specifically, we evaluate commodity ATM [ 11 network components (e.g. switches, adapters, etc.) that will satisfy the performance requirements of RPC-based [2] as well PVM-based [3] applications.
Asynchronous Transfer Mode (ATM) is the latest network technology that can offer scalability in network size and speed. Moreover, using fast cell-switching in hardware, it can control delay jitters, thereby providing integrated services with different quality of service (QoS) requirements. Parallel to the advances made in networking technology, there have been significant improvements in processor technology; current high-end workstations have comparable processing power to those in dedicated, tightly-coupled supercomputers. Therefore, it is becoming increasingly desirable to use a cluster of workstations on a high-speed network as a loosely-coupled supercomputer.
The goal of this study is to build a high performance LAN that can serve both RPC-based as well as parallel applications. While RPC-based application perform well using this cluster environment, it would also allow parallel applications to use its idle cycles as a loosely coupled supercomputer. We seek to satisfy the performance requirements for both applications in terms of throughput, response time, and fairness. In addition, for parallel applications, optimization techniques are employed to achieve hardware speed in both network and processor.
The organization of this paper is as follows. Section 2 describes components of this cluster computing environment. Section 3 presents results and analysis for generic as well as parallel distributed computing. Section 4 summaries and describes future work.
The cluster computing environment
The major components of our cluster environment include: 1) AN2's fair queuing [4], 2) credit-based, ATM-level flow control [5] , and 3) workstation kernel memory mapping [6] on 170 Mhz DEC Alpha workstations.
AN2 Fair Queuing
The DEC AN2 is a 16 x 16, input-buffered, cross-bar switch. It is internally nonblocking becauselhere is at least one path between all non-conflicting input and output pairs. In order to avoid head-of-line (HOL) blocking [7] and achieve fair queuing, this switch uses per-VC buffer management and an parallel-iterative-matching (PIM) scheduling algorithm [4]. The fairness achieved is at an ATM cell granularity. The following three steps are iterated several times within one OC-3c time slot.
1. Each unmatched input port sends a request to all output ports for which it has cells to send.
2.
Grants are issued, at random, by unmatched output ports to requesting input ports.
3.
If the input port accepts the grant, it notifies the output port, which disqualifies the output port from participating in the next iteration.
Request
Grant Accept As depicted by Figure 2 , a sender is initialized with a credit balance which is equal to the buffer space at the receiver. Credit balance of a VC is decremented after a cell is sent.
When the receiver has successfully forwarded a cell, a credit is returned to the sender and its credit balance is incremented. Credits are piggybacked onto either a data cell or a null 
Kernel memory mapping
Previous studies demonstrated that, over generic networks, only the very coarse-grained parallel applications can achieve reasonable speed-up. While commodity network components, such as ATM switches, TCP/IP drivers, ATM adapters, and its device driver, perform well for RPC applications, the protocol processing, multiple data copies, and operating system interrupt overheads severely penalize the communication aspects of parallel computing relative to the computation components. In order to achieve an acceptable cost/performance tradeoff for a wider spectrum of parallel applications, we strive to achieve hardware speed on network links as well as in workstation processors.
As mentioned earlier, we choose the parallel virtual machine (PVM) run time system for our parallel applications. The PVM software package allows a network of computers to appear as a single concurrent computational resource. We can reduce end-to-end latency thereby reducing the number of data copies necessary for a given message transfer.
The DEC ATM adapter, OTTO, performs ATM segmentation and reassembly (SAR) of data packets in host memory. The device driver will properly encapsulate and checksum data packets pointed to by a memory address; the adapter then transfers fragments (48 bytes) of the packets using DMA from host memory to the network. Similarly, the payload of ATM cells from the network is DMA'ed directly into host memory to be reassembled. This design eliminates the store-and forward latency normally experienced by adapters that carry the ATM S A R in the adapter memory. To further improve end-toend latency, the device driver provides the modified PVM library routines with the ability to map its kernel memory using the ioctZ system call interfaces. Figure 4 depicts the buffer layout of t h kernel mapping process. 
Results and analysis
Throughput and response time
A two-node fan-idfan-out configuration (Figure 6 ) is used to evaluate the W C responsetime performance. Using this configuration, we interconnected one UDP-based echo session and from one to three concurrent TCP-based TTCP sessions. We measured the Round-Trip-Times (RTT) of UDP packets of 64, 128, 256, 512, 1024, and 2048 bytes with from one to three background TTCP streams. With fair queuing in place, the theoretical delay for a UDP echo packet is defined as follows:
where n is the number of active sessions, and m is the number of cells per packet. This definition implies that active sessions are serviced fairly within the granularity of an ATM cell-time. We normalize our measured RTT to its theoretical upper-bound in order to put it in proper perspective, and we will refer to the normalized value as its relative RTT degradation. In addition, we define link efficiency as follows:
where Ti is the throughput achieved by the i" TTCP session. Table 1 summarizes the results for our throughput evaluation using the three-node parking-lot configuration. Using the simulation parameters listed in Table 2 , one TCP session can achieve 110
Mbps. Therefore, this topology represents an offered network load which is roughly sixtimes the link capacity. We performed our simulation with and without the credit-based flow control and the results are plotted in Figure 11 . As shown, the credit-based flow control indeed provided a zero loss ATM network.
Furthermore, we show that cell loss is very detrimental to TCP performance; a -6% cell loss caused -37% TCP packet loss in this scenario. As a result, a significant amount of bandwidth is wasted in transporting cells belonging to the already corrupted packets as well as in their re-transmission.
PVM performance
Using pvm-send and pvm-recv routines, a ping-pong program was written to measure message round trip times. We conducted tests, with message size of 64, 128, 256, 512, 1024, 2048,4096; and 8192 bytes, using the traditional as well as this ATM-based PVM system. Results of these measurements are plotted in Figure 9 and demonstrated that our implementation can offer significant performance gains, especially when the PVM messages are small.
3000
-Y 
P V~C P n P I A T M -t o -A T M I I P T I r C P~~
10000
Packet size (Bytes) Figure 9 . Round-trip latencies of PVM applications using different transport mechanisms.
A scaleable parallel molecular dynamics code has been compiled to run on eight DEC Alphas interconnected via the DEC AN2 ATM switch without source code modification.
This model uses a mesoscopic approach between the strict continuum and atomistic levels. It is useful for quantifying processes that involve the manufacture of state-of-theart microchips, where neither atomistic nor continuum approaches are applicable. Models of this sort require large (-1 megabyte) messages and can profit from the switched point-to-point network that ATM provides. Using this cluster environment, the speedup in calculation is achieved by a special PVM implementation that eliminates processing overhead. This, together with ATM's point-to-point switching technology, has more than doubled the efficiency of communication. However, since this application is coarse grained, there is no significant performance gain. We will continue to evaluate the performance characteristics for finer grained parallel models. 
Conclusions and Future Work
Our study has shown that ATM switches with Per-VC buffer management and fair scheduling can provide optimal throughput, response time and fairness for distributing computing. Because packets are fragmented for ATM adaptation, cell-loss can be extremely damaging to TCP performance. Therefore, it is important that the switch has a functional flow-control mechanism. Our study demonstrated that the AN2 credit-based flow control can indeed provide a loss-less environment.
While we have demonstrated in our tests that TCP works well with DEC's credit flow control, the processing overhead of the TCP/IP protocol stack introduces delays that are undesirable for distributed applications. We have reduced end-to-end latency by eliminating TCP/IP, since their function already exists in a flow-controlled ATM network. In addition, our modified OTTO driver allows PVM library routines to directly access kernel memory, thereby reducing the number of data copies necessary for a given message transfer. We compared the round-trip latencies measured both with and without the modifications using a simple ping-pong PVM application. Results of these measurements demonstrated that our implementation can offer significant performance gains, especially when the PVM messages are small. This approach opens up a broad range of possibilities in the area of parallel distributed computing, where the low costlperformance ratio could make this method competitive with today's massively parallel computers in many situations. Using this cost effective cluster-computing environment, we will continue to evaluate the performance characteristics for a wide range of parallel models. In particular, we would like to apply these same optimizations to a cluster of Pentium based personal computers with Free BSD Unix operating systems and DEC' s ATM PCI adapter cards. 
