ABSTRACT Virtual Interface Architecture(VIA) is a light-weight protocol for protected user-level zero-copy communication. In spite of the promised high performance of VIA, previous MPI implementations for GigaNet's cLAN revealed low communication performance. Two main sources of such low performance are the discrepancy in the communication model between MPI and VIA and the multi-threading overhead. In this paper, we propose a new implementation of the Bulk Synchronous Parallel(BSP) programming library for VIA called xBSP to overcome such problems. To the best of our knowledge, xBSP is the first implementation of the BSP library for VIA. xBSP demonstrates that the selection of a proper library is important to exploit the features of light-weight protocols. Intensive use of Remote Direct Memory Access(RDMA) operations leads to high performance close to the native VIA performance with respect to round trip delay and bandwidth. Considering the effects of multi-threading, memory registration, and completion policy on performance, we could obtain an efficient BSP implementation for cLAN, which was confirmed by experimental results.
Introduction
Even though the peak bandwidth of networks has increased rapidly over the years, the latency experienced by applications using these networks has decreased only modestly. The main reason of this disappointing performance is the high software overhead [1, 2, 3] , which mainly results from context switch and data copy between the user and the kernel spaces. To overcome these problems, many lightweight protocols have been proposed to move the protocol stacks from the kernel to the user space [4, 5, 6, 7, 8, 9, 10] .
One of these protocols is Virtual Interface Architecture(VIA) [6] which was jointly proposed by Intel, Compaq, and Microsoft. The VIA specifications describe a network architecture for protected user-level zero-copy communication. For applica-tion developers, VIA provides an interface called the Virtual Interface Provider Layer(VIPL).
Even though the VIPL can be directly used to develop applications, it is desirable to build various popular programming libraries such as PVM [11] , MPI [12] , and BSPlib [13] for portability of the programs. Two previous works, for example, are the MPI implementations for cLAN by MPI Software Technology(MPI/Pro) [14] and by Rice University [15] . Parallel programming library based on other communication protocols can be found in [16, 17, 18] . The authors of [14] described many implementation issues such as threading, long message, asynchronous incoming message, etc. In particular, they paid attention to the pre-posting constraint of VIA in implementing asynchronous operations of MPI. The zero-copy strategy of VIA enforces that the receiver is ready before the sender initiates its operation, which defines the pre-posting constraint. The results of these studies, however, are somewhat disappointing. Even though the half round trip time(RTT) of cLAN using VIPL is 8.21µs in our system, that of MPI/Pro is delayed more than five times. Furthermore, MPI/Pro achieved only 81.7 percent of the peak bandwidth of VIPL. This means that the MPI library could not be efficiently integrated with VIA.
There are two main causes for such low performance. The primary one is the discrepancy in the communication model between MPI and VIA. VIA does not assume any intermediate buffers due to the zero-copy policy, while various asynchronous operations of MPI require receiving queues. Therefore, the authors suggested the use of "unexpected queues" on the receiver side to handle asynchronous incoming messages. Then, the implementation experiences more than one copying overhead on the receiver side and requires flow control for the queue. Moreover, they did not use the Remote Direct Memory Access(RDMA) operation for small messages, because only large messages can amortize the overhead of exchanging the address of RDMA buffers. The second cause is the overhead due to multi-threading. Although delegating the message handling task to a separate thread from the computation thread seems a good way of structural implementation, it suffers significant overhead due to thread switching. The overhead due to multi-threading in our system is over ten micro-seconds: this is indeed comparable to the round trip delay in the application level. This means that the multi-threading overhead negates the gain obtained by reducing the latency in the hardware level.
These two problems motivate us to implement another VIA-based parallel library. In this paper, we implement the BSPlib standard of the Bulk Synchronous Parallel(BSP) programming library. The BSP model [19] was first proposed as a computing model to bridge the gap between software and hardware for parallel processing. Afterwards, it became a viable programming model with BSPlib. The performance of the BSPlib library was shown to be better than MPICH with respect to throughput and predictability [20] , which means that BSPlib is not only theoretically but also practically useful. Moreover, the study on BSP clusters [21] has demonstrated that the BSPlib library can be accelerated by rewriting the Fast Ethernet device driver to be optimized for the BSPlib operations. One of the main lessons of the study was that optimization with global knowledge about the transport layer and the parallel library promises higher performance. This perspective is also applicable to implementing parallel libraries using light-weight protocols. Indeed, BSPlib has a strong operational resemblance with VIA in memory registration, message passing communication, and direct remote memory access.
Our new implementation of BSPlib for cLAN is called express BSP(xBSP). To the best of our knowledge, xBSP is the first implementation of BSPlib for VIA. xBSP demonstrates that selecting a proper library is important in exploiting the features of light-weight protocols. Furthermore, we achieved performance close to the native VIPL by significant efforts to reduce the overheads due to multi-threading, memory registration, and flow-control. xBSP also supports reliable communication by using the reliable delivery mode of VIA.
In the following two sections, we address key features of VIA required to implement the BSPlib library and discuss how well the library is matched with VIA. After that, we present experimental implementation alternatives to achieve the full performance of VIA. In sections 4 and 5, several benchmarks demonstrate the efficiency of xBSP, and we conclude our discussion in section 6.
VIA Features
In this section, we discuss VIA features that should be carefully considered for efficient implementation of BSPlib. They concern memory registration, communication mode, and descriptor processing. Communication buffers of the user space should be registered in order to eliminate data copying between the user space and the kernel space and to provide memory protection. The memory registration cost, however, is not negligible. For example, the Windows NT system experienced over 15µs latency for messages smaller than 16Kbytes [15] , while the overhead in our Linux system ranged from 3 to 5µs as shown in table 1. Considering communication delay and copying overhead, it is important to reduce the registration overhead, especially for small messages.
Memory Registration

Communication Mode
After communication buffers are registered, processes can transfer data between the registered buffers. VIA supports two communication modes. One is the traditional message passing mode in which both the sender and the receiver participate in communication, satisfying the pre-posting constraint. The other is the one-sided The procedure of the RDMA write operation is illustrated in Fig. 1 . First, both processes register their buffers to their VIA device drivers, and process B informs process A of the address of its buffer by explicit message passing to avoid the protection violation. After that, process A initiates its operation by posting descriptors and the device driver moves data from the user buffer to the network through DMA. When packets arrive at the target machine, the device driver of the target machine moves data in the reverse way of the sender.
This RDMA operation has several advantages. First, the RDMA operation can avoid the descriptor processing overhead in the target process since it does not require any descriptor in the target process except when the initiator uses the immediate data field of descriptor. Second, since only the VI-NIC of the target machine is involved in communication, the target process can continue without interruption. Finally, the initiator does not have to worry about flow control for the resources of the target machine. Therefore, we prefer the RDMA mode to the message passing mode.
Descriptor Processing Mode
When there are multiple VI-connections to a process, mechanisms like select() in the socket interface are needed. We can implement such mechanisms using the Completion Queue. Notifications of descriptor completion from multiple Receive Queues are directed to a single Completion Queue.
The Completion Queue can be managed by a dedicated communication thread or the user thread itself. When a thread is dedicated to managing the Completion Queue, it prevents the interruption of user threads in a clustered SMP environment.
However, this introduces extra latency of thread switching. On the other hand, the user thread directly receives messages at the expense of CPU time to avoid this multi-threading overhead. Since we aim at low latency communication, the user thread itself takes a role in managing the Completion Queue.
BSPlib Implementation
Based on the previous discussion, we explain in this section how well the BSPlib library is matched with VIA and how the library is realized.
BSP-Registration
In a BSP program, a user can access data in a remote memory after one registers a memory block by bsp push reg(void* ident, int nbytes). The registrations within a superstep take effect after the subsequent barrier synchronization identified by bsp sync().
In the Oxford implementation [13] , each node keeps track of the sequence of registrations and maintains a mapping table between the unique block number and the associated local address: it does not require any explicit message exchange. When a process initiates a one-sided operation with this block number, the target process translates the number into its local address for the block. The main objective of this mechanism is to reduce unnecessary network traffic in the registration step. This low-cost dynamic registration is beneficial to implementing user-level libraries and applications with recursion.
Since the registration typically appears at the beginning of a program and rarely afterwards, it may be preferable to speed up ordinary communication operations at the expense of the registration. As discussed in section 2.2, the initiator of RDMA operations should know the address of the remote buffer. In xBSP, each node registers its local buffer to the VI-NIC in the bsp push reg() and exchanges the address in the barrier synchronization step. At the end of the synchronization, each node builds a mapping table between the local address and the corresponding remote addresses. Since each node knows the actual address of the global memory block, it can transfer data to the remote buffer directly using the RDMA operation, unlike the Oxford implementation.
One-Sided Operation
A process can initiate a one-sided operation on the registered memory block. For example, bsp hpput(int pid, void* src, void* dst, int offset, int nbytes) writes nbytes data in the src buffer to the dst+offset address in the pid node; the written data is valid in the next superstep. The bsp hpput() function is exactly matched to the RDMA write operation. As the initiator has the address information of the dst buffer after the registration step, it can transfer data to the dst buffer directly. The target process does not have to consider flow-control, descriptor posting, nor incoming message handling. Furthermore, it is free from multi-threading overhead.
Consequently, the bsp hpput() function is able to pull delay and bandwidth performance close to those of VIPL.
One problem related to the RDMA write operation is how the target process knows the arrival of a message. There are two possible solutions to this problem. One is to enforce the use of a descriptor notifying the end of a message(EOM). An RDMA write operation consumes a descriptor in the Receive Queue only when there is an immediate data in the source descriptor. Hence, we can use this feature to mark the end of an RDMA message. When a message consists of n packets, the sender transfers n-1 packets and finishes the nth packet transfer with the EOM tag while the receiver checks whether a descriptor is consumed and the returned value is EOM. This approach requires one descriptor per message, while the traditional message passing requires n descriptors. The other approach is to send an additional control message to mark the end of a message. Even though this approach has more overhead than the first, in the case of BSPlib, this approach is preferable. As the transferred messages in a superstep are available in the next superstep, there is no need to handle incoming messages immediately. Since cLAN supports reliable inorder delivery, the arrival of a packet means the successful arrival of the preceding packets. Therefore, a series of EOM control messages in a superstep can be replaced by the last EOM control message and the EOM message can be piggybacked with the barrier synchronization packet. After all, in the place of EOM control messages, barrier synchronization can be used implicitly to mark the end of transfers.
Other Issues
The accumulated start-up costs of communication are significant if small messages are outstanding to the network. This problem has already been discussed in other studies [20, 22] and can be overcome with the combining scheme. xBSP also combines small messages into a temporary buffer since the copying overhead of small messages is smaller than the memory registration cost of VIA. This combining method contributes to increasing the communication bandwidth, sacrificing little round trip time. Besides, reordering messages is helpful to avoid serialization of message delivery [22] , and we use a latin square indexing order to schedule the destination of messages. A latin square is a p x p square in which all rows and columns are permutations of integers 1 to p. In comparison, naive ordering distributes messages by the fixed index order as implied in the code, like for(j = 0; j < p; j + +). As presented in table 2, the reordering affects the performance for large messages; the speed-up factor increases with the message size. This result indicates that poor destination scheduling can decrease performance of total exchange significantly.
Micro-benchmark Experiments
In this section, we demonstrate that BSPlib could be efficiently implemented on VIA through the experimental results with two micro-benchmarks: half round trip time and bandwidth. Our Linux cluster consists of eight nodes connected by an 8-port cLAN switch. Each node has dual Pentium III 550MHz processors with 256-Mbyte SDRAM and runs Redhat Linux 6.2 SMP version.
Preliminary Experiments
We tested a few implementation alternatives to achieve the full performance of VIA and observed the effects of completion policies and threading on the round trip delay.
Fig. 2. Effects of threading and completion policy
By polling, each process repeatedly checks whether the transaction is completed while by blocking it waits for the completion of the transaction. Meanwhile, in the threaded version, a communication thread is dedicated to receiving incoming messages while a user thread continues its computation. Fig. 2 shows that the single threaded version using polling achieves significant reduction of delay. However, it is wasteful to dedicate all of the CPU resources to polling, especially in the case of long message transfers. A tradeoff can be made by mixing both schemes: xBSP polls for a certain number of iterations anticipating the completion of short message transfers and is blocked eventually. Based on these experiments, we chose the single threaded version using the mixed policy.
Half Round Trip Time and Bandwidth
With micro-benchmarks, we measured the half RTT and the bandwidth. To measure the half RTT, two processes send equal amounts of data back and forth repeatedly. We vary the message size from 4bytes to 64Kbytes and take the average value over 1000 execution results. Also, the bandwidth is computed after measuring the latency to transfer 1-Mbyte data varying the message size.
The baseline is the performance of xBSP using the traditional message passing with a single thread. We change the communication mode of VIA from the message passing to RDMA and compare xBSP to VIPL and MPI/Pro. These benchmarks use the following configurations:
• VIPL-MP: VIPL using message passing(polling) • VIPL-RDMA: VIPL using RDMA(polling) • xBSP-MP: xBSP using message passing(mixed) • xBSP-RDMA: xBSP using RDMA(mixed) • MPI/Pro: MPI of MPI Software Technology Fig. 3 and Fig. 4 show the experimental results of the round trip delay and bandwidth for various configurations. For comparison, the results of MPI/Pro [14] are also presented. The VIPL versions reveal minimum application level latency since they do not include any supplementary jobs for communication like registration and use the polling mechanism for the completion policy.
Comparing the two VIPL versions, we can estimate the overhead due to the preposting constraint which includes descriptor posting and flow control. Even though the performance gap is not significant, the RDMA version consistently outperforms the MP version and the experiments with xBSP also show similar results.
According to Fig. 3 , xBSP-RDMA is two times slower than VIPL-RDMA with 4-byte packets. The extra latency of xBSP-RDMA mainly results from the copying overhead of the message combining and the blocking overhead of the mixed completion policy. In contrast, MPI/Pro is 8.8 times slower than VIPL-RDMA: in average, xBSP shows at least twice lower latency than MPI/Pro in the case of small messages. In terms of the peak bandwidth, xBSP-RDMA achieves about 94% of the VIPL bandwidth while MPI/Pro achieved only 82%. Consequently, these results demonstrate that xBSP exploits VIA features more effectively than MPI/Pro.
Benchmark Experiments
Even though micro-benchmarks can be used for measuring the basic link properties, high performance of micro-benchmarks does not ensure the same performance benefit in real applications. To rigorously evaluate the performance, we measure the BSP cost parameters and then the execution times of several real applications.
BSP Cost Model
The BSP model simplifies a parallel machine by three components, a set of processors, an interconnection network, and a barrier synchronizer, which are parameterized as {p,g,l}. Parameter p represents the number of processors in the cluster, parameter g, the gap between continuous message sending operations, and parameter l, the barrier synchronization latency. A BSP program consists of a sequence of supersteps separated by barrier synchronizations. In every superstep, each process performs local computation or exchanges messages which are available in the next superstep. Hence, the execution time for superstep i is modeled by w i + gh i + l, where w i is the longest duration of local computation in the ith superstep and h i is the largest amount of packets exchanged by a process during this superstep. In Table 3 , the cost parameters of xBSP and the Oxford BSPlib implementation using UDP/IP for Fast Ethernet are compared. These parameters serve as a measure of the entire system under some non-trivial workload. The s parameter represents the instruction execution rate of each processor taken from the average execution time of matrix multiplication and dot products. The minimum L value is taken as the average latency of a long sequence of bsp sync(), while the maximum value is taken as the average latency of a long sequence of the pair of bsp hpput() and bsp sync() with one word message. The g parameter is a measure of the global network bandwidth, not the point-to-point bandwidth: a smaller g value means higher global bandwidth. With the shift communication pattern, each process sends data to its neighbor, and with total exchange it broadcasts.
xBSP-RDMA experiences much lower synchronization latency and higher bandwidth(short time interval) than the others. xBSP-RDMA achieves a constant global bandwidth of about 381Mbps and xBSP-MP achieves about 291Mbps while the BSPlib-UDP/IP's performance decreases with the number of over four nodes: xBSP shows good scalability characteristics, and the RDMA operations are well matched with the BSPlib interfaces.
Applications
In this section, we compare the BSPlib libraries with the following two applications.
• ES: application to solve a grid problem with a 300 x 300 matrix [23] • LU: application to solve a linear equation using LU decomposition [24] The execution time of the grid solver is presented in Fig. 5 . The values above bars represent the ratio of the sum of communication and synchronization times compared with xBSP-RDMA. In the grid solver program, each process exchanges data with its neighbors: the communication pattern is similar to the shift communication. Since ES spends most of its time(about 5.9 sec) in computation in the case of two nodes, the performance gap between xBSP-RDMA and xBSP-MP is not so great. In contrast, since the packet size transferred in a superstep is 2400bytes, the 
Conclusions
In this paper, we presented an efficient implementation of BSPlib for VIA called xBSP. xBSP demonstrates that BSPlib is more appropriate than MPI to exploit the features of VIA. Furthermore, we achieved similar application performance to the native performance from VIPL by reducing the overheads associated with multithreading, memory registration, and flow-control.
Even though we paid attention only to implementing BSPlib, there are many possibilities to improving performance by relaxing the BSPlib semantics. In particular, we should reduce barrier synchronization costs by adopting such mechanisms as relaxed barrier synchronization [25] and zero-cost synchronization [26] . Currently, we are building a programming environment based on xBSP-RDMA for heterogeneous cluster systems adopting a dynamic load balancing scheme.
