Abstract
Introduction
The recent advances in improved microprocessor performance and high speed networks are making clusters an appealing vehicle for cost effective parallel computing [ 13. Particularly, the emerging of chipset technology in supporting symmetric multiprocessing (SMP) servers has proved successful in making Standard High-Volume (SHV) servers, such as the 4-way or 8-way x86-based SMP servers, for high-speed computing. In addition, Gigabit Ethernet has recently becomes an ideal system area network (SAN) for SHV clusters because of its * The research was supported by Hong Kong Research Grants Council grant 10201701 and HKU CRGC grant 10203009.
reliability, simplicity and lower cost. With the supports of such high performance interconnection networks, multiple SHV servers can be connected to form a powerful supercomputing environment.
In the past, various fast messaging mechanisms for clusters have been proposed, such as AM [4] For the rest of the paper, we first introduce the Directed Point communication model in Section 2. Then we discuss the architecture of DP-I1 communication subsystems in Section 3. The light-weight messaging techniques are discussed in Section 4. In Section 5, we describe the implementation details and the performance measurement using a 4-phase model. Finally, the conclusions are given in Section 6.
Directed Point Abstraction Model
The communication traffic in a cluster is caused by the inter-process communication between a group of cooperating processes, which reside on different nodes to solve a single task. Various communication patterns are usually used in algorithm design, such as point-to-point, pair-wise data exchange, broadcast tree, total-exchange, etc. A communication abstraction model can be used to describe the inter-process communication patterns during the algorithm design stage, as well as a guide to implement the primitive messaging operations or API for the underlying communication subsystem.
The Directed Point abstraction provides programmers with a virtual network topology among a group of communicating processes. Directed Point abstraction model is based o n a Directed Point graph (DPG). It allows users to statically depict the communication pattern and provide some schemes to dynamically modify the pattern during the execution time. All inter-process communication patterns can be described by a directed graph, with a directed edge connecting two endpoints representing a uni-directional communication channel between a source and a destination processes. A formal definition of DPG is given below:
Let DPG = (N, EP, NID, P, E), where N, EP, NID, P and E are: N (Node set): A subset of integer set, representing the nodes in a cluster.
EP (Endpoint set):
A subset of integer set, representing endpoints of the directed edges.
P (Process set):
The power set of EP, each element in P represents all endpoints created by a communicating process in a cluster. For example, P, represents all the endpoints created by process i. A process in DPG is usually shown as a circle; while the endpoint is shown as a vertex in the circle.
NZD (Node Identification function):
NID is a function from P to N, representing the node in a cluster where a process resides. For simplicity, we write NID(P,) as NID,. The restriction on NID is that b'P( , PI E P : NID< = NIDI -+ P, n PI = 0. This property ensures that no two processes in the same node will share the same endpoints. The proposed model supports not only the point-topoint communication but also other types of group operations. For examples, an endpoint can be used as the root of a broadcast tree or a destination point for a reduce operation. Below is a simple example to illustrate the usage of the DP abstraction model. Given a DPG = (N, EP, NID, P, E), where 1,6,1>, <2,1,2,2>,<2,1,6,1>) Figure 1 shows the diagram to represent the DP graph of the given example. From the function NID, we know that process 1 and process 2 are executed in node 1. There are four communication channels between these processes. For example, the channel <1,1> + <5,1> is from the endpoint 1 of process 1 to the endpoint 5 of process 2. The endpoint 2 in P, is used to connect with Pz and P,.
Figure 1. A Simple Example of DP Graph
DP graph provides a snap shot of the process-toprocess communication.
The inter-process communication pattern can evolve by adding a new endpoint within a process, adding a new edge between two distinct endpoints in different processes, deleting an endpoint as well as the edges linked to it, or deleting an edge between different endpoints. With these operations, any run-time inter-process communication patterns can be modeled.
DP-ZZ Architecture
Based on the DP abstraction model, we design DP-I1 communication subsystem. DP-I1 consists of three main layers: (1) API Layer (2) Service Layer (3) Network Interface Layer. Figure 2 shows an overview of the DP architecture.
Figure 2. The Architecture of DP-II
The API Layer implements the operations for users to program their communication codes. To provide better programmability, DP-I1 API preserves the syntax and semantics of traditional UNIX U 0 interface by associating each DP endpoint with afile descriptor, which was generated when a DP endpoint is created. All messaging operations can only access through the file descriptor for sending or receiving messages. The communication endpoint is released by closing the file descriptor. With the file descriptor, a process can access the communication system via traditional U 0 system calls. This kind of interface has been widely used in traditional UNIX I/O, such as Socket, which can reduce the burden of learning new API.
The DP-I1 Service Layer realizes the DP abstraction model and is hardware independent. It is built by different components to provide services for passing message from user space to network hardware and to deliver incoming packets to the buffer of the receiving process.
The DP-I1 Network Interface Layer consists of network driver modules. Most of the driver modules are hardware dependent. Each of them is an individual kernel module that can be loaded to and unloaded from the system. Multiple network interfaces can be loaded at the same time. Currently, network driver modules supported in DP-I1 include Digital DEC 21140A Fast Ethernet, Hamachi Gigabit Ethernet, and FORE PCA-200E ATM. We have also developed DP SHMEM module to support intra-node communication through shared memory. Modular design makes DP-I1 implementation need not recompile the whole kernel source tree while adding new drivers.
Light-Weight Messaging Techniques
DP-I1 is designed with the goals to achieve low communication latency and high bandwidth as well as minimizing the resource usage. We propose various techniques, namely, directed message, token buffer pool, and light-weight messaging call. They reduce protocol processing overheads, network buffer management overhead and process-kernel space transition overhead.
In DP-11, we use Hamachi Gigabit Ethernet NIC as the network interface. The Hamachi Gigabit Ethernet NIC uses a typical descriptor-based bus-master architecture [ 1 13. Two statically allocated fixed-size descriptor rings, namely, the transmit and receive descriptor rings. Figure 3 shows the messaging flow with respect to different components in DP-I1 using such descriptor based network interface controller. The transmission unit of DP-I1 is called Directed Message (DM). DM packet consists of a header and a data portion called container. The header is constructed at DP service layer. It consists of three fields: target NID, target DPID, and the length of the container. The simplicity of DM packet only requires very small of packet processing time comparing to other complex protocols. The NART (Network Address Resolution Table) is used for the header construction in the transmission.
Buffer management affects the communication performance. On the receive side, we maintain a token buffer pool (TBP). It is a fixed-size physical memory area dedicated to a single communication endpoint. It is allocated when the communication endpoint is opened and freed when the endpoint is closed. The unit of storage in TBP is called token buffer. It is a variable-length storage unit for storing the incoming DM packet to reduce the memory usage as compared to the fixed length buffer The TBP is directly accessible by kernel and user processes. Thus, incoming message can be directly used by user program. When the packet arrives, interrupt signal is triggered by the network interface. The interrupt handler calls MDR (Message Dispatcher Routine) to examine the header of packet, locate the buffer at TBP to store the incoming message based on the information stored in DP-11, and copy the incoming message to TBP. Since TBP is accessible by both kernel and user processes, no extra memory copy is needed to bring message up to the user space. DP-I1 allocates one TBP whenever a new DP endpoint is opened. It requires no common dedicated system buffers for storing incoming messages. Thus, the memory resource in a server can be efficiently utilized. The amount of memory needed depends on the number of endpoints created in the applications.
To reduce the overheads while crossing kernel and user space, both send and receive operations in DP-I1 are using light-weight messaging culls (LMC). LMC provides fast switch from user space to kernel space. It is implemented using Intel x86 cull gate. The use of LMC can eliminate the cost of possible process rescheduling, context switching, and bottom-half operations after return from a system call.
Performance Analysis

DP-I1 has been implemented to connect four Dell
PowerEdge 6300 SMP servers, using Packet Engines' G-NIC I1 Gigabit Ethernet adapters. Each server consists of four Pentium I11 Xeon processors sharing 1 GB memory and 18 GB hard disk. Each processor consists of 5 12 KB L2 cache and operates at 500 MHz. All servers are installed with Linux 2.2.5 Kernel. Two G-NIC I1 Gigabit Ethernet adapters are used in each server to connect to the PowerRail 2200 Gigabit Ethernet Switch for the purpose of fault tolerance. Each server also has one Fast Ethernet connection to the campus LAN for external access. The PowerRail2200 switch can achieve backplane capacity 22 Gbps.
Latency and Bandwidth Tests
We evaluated the performance of the single-trip latency of the communication system for various message sizes. In all benchmark routines, source and destination buffers were page-aligned for steady performance. The benchmark routines used hardware time-stamp counters in the Intel processor, with resolution within 100 ns, to time the operations. The round-trip latency test measured the ping-pong time of two communicating processes and repeated two hundreds iterations. The first and last 10% (in terms of execution time) were neglected. Only the middle 80% of the timings was used to calculate the average. Single-trip latency is defined as the average round-trip time divided by 2.
The bandwidth test measured the time to transmit 4 MBytes data from one process to another process, plus the time for the receive process to send back a 4-bytes acknowledgement. The time measured was then subtracted by a single-trip latency time for a 4-byte message. Thus, the bandwidth was calculated as the number of bytes transferred in the test divided by the calculated time. Figure 4 shows the latency results. The DP-I1 can achieve single-trip latency 18.35 us for send 1-byte message with back-to-back Gigabit Ethernet connection. The switch causes at lease extra 21 us delay while performs in a store-and-forward mode for data transmission. Figure 5 shows the bandwidth results of TCPAP and DP-I1 with back-to-back connection. The DP-I1 can achieve sustained bandwidth 72.8 MB/s at message size 1504 bytes; while TCPAP can only achieve 36 MB/s.
Performance Breakdowns
To help understand the performance results, we examine the communication cost of a single-trip data transfer using a 4-phase model. The 4-phase model consists of the following parameters:
1 : The length of the message.
Lt : The single-trip communication latency. Tstartup : The start-up time of a send operation. It includes the time for the API wrapper, the time to switch from user-space to kernel space and the time to prepare the frame header.
Tsend The time spent in DP-I1 VO operations (IOR) on the sender's side to copy the user space buffer to the NIC's DMA buffer and set up its descriptor for the NIC. Tnet : The network delay and the OS overhead. The network delays include the time to copy the data from host memory to the NIC on the sender's side and from the NIC to the host memory on the receiver's side. Generally, different parts in the network delay time may overlap. The OS overhead includes the execution time of interrupt handler in the OS kernel.
TdeliveV : The message delivery time. It is the time to deliver an incoming message to the destination memory at receiving process, which is mainly the execution time of Message Dispatch Routines.
Thus, to transmit a message of size 1, the single-trip latency time can be expressed by the following equation:
L,(O = TrhrIup + T*,(I) + TJO + Tdellvery(~) Figure 6 shows the latency breakdown on sending 1 byte message. Performance breakdowns on various x86-based PCs connected by 32-bit PCI Fast Ethernet NICs were reported for the purpose of comparison. For all testing cases, DP-I1 shows small overheads in handling the communication protocol and the delay occurred in starting up the PCI bus and NIC. On Gigabit Ethernet, the TsmUp, Tund, Tne,, and Tel,vev time for transmitting 1 byte message are 0. 44, 0.7, 16.82 , and 0.36 us respectively. All machines achieved nearly the same performance at the network delay (Tne,). The G-NIC I1 network interface didn't cause long delay in its more complex hardware.
The K6-2 shows the largest delay in T,,, which could be caused by its special Socket-7 motherboard architecture. For startup, send, and deliver phases, faster CPU can always achieve shorter latency for handling the communication protocol. The K6-2 featured by its larger L1 cache and IMB on-board L2 cache can handle protocol execution faster than other INTEL x86-based PCs on Fast Ethernet. Overall, the faster 500 MHz Pentium 111 Xeon processor, efficient PCI bus design, and faster system bus on the PowerEdge help in achieving much smaller overheads in startup, send and delivery phases.
-DP-11, back to back on Gigabit Ethernet The measured Tsmup, Tu.nd, T,,,. and T&,,vcry time for transmitting 1500 byte message are 0. 44, 5.23, 69.96, and 12 .21 us respectively. Tslanup, Tsmd, and Tdellvery involve host node processor. All together they contribute 20.3 % total messaging time for sending 1504 bytes. Major delay was still contributed by the host PCI and the Hamachi NIC. In the 64-bit 33 MHz PCI server, the speed of PCI bus with its overhead seems slower than the full-duplex Gigabit Ethernet line rate.
Conclusions
In this paper, we present a high-performance communication subsystem DP-I1 on Gigabit Ethernet based on the Directed Point Model. We emphasize both on high-performance communication as well as good programmability. With the performance breakdowns, we have shown that the DP-I1 has greatly reduced the software overheads. Our light-weight messaging mechanisms can reduce the CPU involvement while performing data communication on the SHV server. However, while Gigabit network media is able to transfer data in low latency and high bandwidth, the network delay (T,,,) still contributes major portion of the communication time in sending both short and long message. We conclude that the current bottleneck in Gigabit Ethernet networking is the interface between CPU and NIC. The move from a 100 MHz PC system bus to a higher clock rate bus, as well as the move from a 64-bit 33 MHz PCI bus to a 64-bit 66 MHz PCI interface could greatly improve the communication performance in the future.
