Abstract. This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.
Introduction
CPU clock speeds have remained essentially constant over the last several years. To keep up with the performance boosts expected by Moore's law, the number of CPU cores used in high-end systems is rapidly increasing. System size on the Top500 list [2] has changed rapidly, and in November 2009 the top ten systems averaged 134,893 cores, with five systems larger than 100, 000 cores. This rapid increase in core count, and the associated increase in the number of compute threads used in a single job increases the urgency of dealing with system characteristics that impede application scalability.
Scientific simulation codes frequently use collective communications such as broadcasts, and data reductions. The ordered communication patterns used by high performance implementation of collective algorithms present an application scalability challenge. This impediment is further magnified by application load imbalance and system activity, or system noise [15, 16] , delaying the collective operations.
CORE-Direct functionality, recently added to the InfiniBand (IB) ConnectX-2 NICs by Mellanox Technologies [3] , provides hardware support for offloading a sequence of datadependent communications to the network. This functionality is well suited for supporting asynchronous Message Passing Interface (MPI) [13] collective operations. It provides hardware support for overlapping collective communications with application computation, which can be used to improve application scalability. 
Related Work
Work to delegate communication management, both point-to-point and collective, to processing units other than the main CPU has already been done. A number of studies explored the benefits of NIC-based collective operations including those described in references [17, 4, 6, 18, 14, 10] . Several analyses of NIC-based broadcast algorithms are available in References [17, 4, 18] . Generally, these all tend to use NIC-based packet forwarding as a means of improving performance of the broadcast operation. Some of the benefits of offloading barrier, reduce and broadcast operations to the NIC are described in References [6, 5, 14] , and [18] . These showed that barrier and reduce operations can benefit from reduced host involvement, efficient communication processing, and better tolerance to process skew. Keeping the data transfer paths relatively short in multi-stage communication patterns is appealing as it offers a favorable payback for moving the work to the network. Even though much research has been done in this area, many problems are still to be solved. As such, these techniques have not gained wide acceptance. Amongst all the previous NIC-based collective implementations, only the effort from Quadrics [1] has been largely deployed in the Elan3/4 interconnects, enabling its wide utilization in Quadrics-based clusters. In this work, we study the benefits of collective offload in another popular interconnect technology, InfiniBand.
An Overview of InfiniBand
A detailed description of the CORE-Direct support and how this is used to implement support for MPI collective operations in Open MPI [7] is described elsewhere [8] . In this section we provide a brief description of these, as well as very recent enhancements to the MPI support.
The InfiniBand Architecture (IBA) [11] defines a communication architecture from the switch-based network fabric to transport layer communication interface for inter-processor communication. Processing nodes are connected as end-nodes to the fabric by Host Channel Adapters (HCAs). Fig. 1 illustrates the IBA specification of the IB Reliable Connection (RC) communication stack. Two processes communicate through a pair of IB queues that are created on the HCAs. This pair of send/receive queues is also referred to as a Queue Pair (QP). A communication operation is initiated by posting send, receive, read, write, or atomic work queue elements (WQE) to the QP. Completion of a WQE results in a completion queue event (CQE) being posted to a completion queue. The consumer obtains this event by polling the completion queue. Multiple QP's may share a completion queue (CQ). Except for very special circumstances, RDMA write operations do not generate remote completion entries. Table 1 lists the barrier algorithm implemented using RC-QP's including only communications operations in the critical path, and omitting the asynchronous send completion, independent of MPI Barrier completion. Posting network operations to the QP's and polling for completion all use the CPU. As a result, the CPU cost for this operation increases as the number of ranks participating in the collective operation increases logarithmically for the recursive doubling barrier algorithm described here. Table 2 . Task list for a four process recursive doubling barrier.
The general purpose CORE-Direct functionality introduced in ConnectX-2 aims to improve application scalability by moving the management of chained network operations to the network card, with CPU involvement in collective operations limited to initiation and completion phases. HCA management of these operations allows for the possibility to overlap computation and communication, thereby reducing the performance impact of process skew and system noise on collective operations on parallel applications using such operations. The use of nonblocking collectives can further decrease the effects of process skew. To accomplish this new hardware support for wait network tasks, Multiple Work Requests (MWR), and Management Queues (MQ's) is introduced. The following sub-section will describe this new support, and how it is used to implement MPI collective operations.
New Hardware Capabilities
The IBA defines several communication tasks, these include send, receive, read, write, and atomic tasks. CORE-Direct adds hardware support for a wait task. This takes as arguments a list of completion queues and the number of completion tasks to wait for, and can be used to order communications taking place using different QP's. Information on completed tasked consumed by a wait task may not be obtained from a completion queue, and must inferred from QP completion ordering. For example, the HCA will poll a CQ for wait task completion of a receive on QP X . Data location can only be determined by tracking the order in which receive buffers are posted to the receive queue, and receive completions associated with that queue.
The Multiple Work Request is a linked list of InfiniBand communication tasks which the driver posts in, order to, the queues specified by the individual work requests. These tasks include the send, receive, 0-byte RDMA write and wait tasks. An MWR completion entry is posted after the task that is marked with the flag MQE WR FLAG SIGNAL is processed by the HCA. The MWR may be used to chain a series of network tasks, and once posted, the HCAs progress the communication, without using the central processing unit.
The Management Queues are used to handle MWR's. The driver supports two different types of MQs, one type is the hardware MQ, and the other is a software construct used to post the ordered list of tasks in a MWR to multiple Reliably Connected QPs (RC-QP's), not using the MQ. We will refer to this as a software MQ (SW-MQ), even though all network processing takes place by hardware on the HCA. HW-MQ's and a unique MQ Completion Queue are created in a single step. Tasks are posted in-order to the various QP's and the MQ, with no interleaving of individual tasks from different MWR's. Wait tasks are posted either to the QP specified in the task, or to the hardware MQ if the QP specified is NULL. A send/receive task that follows an associated wait task posted to the MQ results in two tasks being posted, a send/receive task is posted to the specified QP, and will not be processed until it is enabled by the send/receive-enable task posted to the MQ after wait task.
MPI Collective Design
The new Management-Queue, Multiple Work Request, and the wait tasks are used with the preexisting IB functionality to implement offloaded, asynchronous collective operations. The queue structure used is displayed in Figure 2 .
Collective communications are managed on a per-communicator basis, to ensure independent progress. Each rank in the communicator uses a single MQ and a RC-QP for each rank with which it communicates. Receive completion is handled by the wait task, with no user-level access to the receive CQ. We keep track of the receive buffers posted to each QP to retrieve the data for subsequent send tasks. We do not use shared receive queues so we can identify the data source, without the benefit of CQE. Send tasks are completed asynchronously, out of the critical path using a single send completion queue.
MPI Collective operations are implemented using an interdependent sequence of network operations executed by each process in the communicator. Each process participating in a given collective operation executes a different sequence of network operations, with reduction operations also manipulating the data being transfered. These local communication patterns determine the MWR task list used by each process in the communicator. To avoid ReceiverNot-Ready Negative Acknowledgements (RNR-NACK) and the associated retransmission delays we pre-post receive WQE's, and keep track of the receive buffers associated with the WQE, for use by subsequent send tasks. To avoid the additional half network round trip latency, we do not include send completion wait tasks in the tasks list, as this is not necessary for MPI-level completion.
We use pre-registered memory for task buffers, and large data collective operations are segmented to manage memory usage and allow for pipelining collective operations. Blocks of receive buffers are pre-posted to each QP, to avoid the performance degradation associated with RNR-NACKs.
The blocking and nonblocking barrier operations used employ a recursive doubling algorithm described in [9] . Implementation details for the blocking and nonblocking operations are similar, with dissimilarities due to the different completion semantics. As an example, the communication pattern for the MPI Barrier collective operations, is given in Table 1 . The user defined task list corresponding to the four process barrier algorithm (the MWR) is described in Table 1 is given in Table 2 . The wait events are used to ensure receive completion only. Table 3 describes the tasks posted to the queue pairs, and Table 4 describes the tasks posted to the management queue, for each of the 4 ranks participating in the barrier collective operation. As a further optimization, zero byte data transfers can use an RDMA send operation with immediate data, which consumes a receive WQE that is never fetched.
Benchmark Results
In this section we focus primarily on studying the potential for overlapping computation with collective communications. We study the performance of the nonblocking MPIX Ibarrier() recently voted into the MPI-3 draft standard, and, for comparison, also present the most recent benchmark results for MPI Barrier(). Since no user data is involved in barrier operations, the performance of a given algorithm is determined by network latency, and the latency of the (MPI) software stack. As such, compared to other collective algorithms which also send user data over the network, the opportunities for overlap are relatively modest. However the simplicity of the algorithm makes it a good first candidate for studying some of the newly developed CORE-Direct capabilities. We compare the performance characteristics of barrier operations using the HCA to manage these operations, to implementations in which the CPU is used to manage the barrier algorithms.
Experimental Setup
The performance measurements were all taken on an 8 node, dual socket quad-core, 3.00 Gigahertz Intel Xeon Quad-core X5472 with 32 gigabytes of memory. The system runs Red Hat Enterprise Linux Server 5.1, kernel version 2.6.18-53.el5, and a dual port quad data rate ConnectX-2 NIC and switch running Mellanox firmware version 2.6.8000. This is prerelease version of the firmware, and provides the first working implementation of the new Management Queue capability.
The prototype offloaded IB collectives are implemented within version 1.5 of the Open MPI code base, as a new collective module. To measure raw completion time, we loop over initiating the nonblocking collective operation, and then waiting for operation completion. To measure the overlap characteristics of these collective operations, we modify the ideas introduced in the COMB [12] benchmark, adapting them for collective operations. Overlap is measure in two ways (1) Initiating the collective operation and then looping over (work loop, MPI Test() ) until the operation completes, making sure the busy loop does not increase overall operation completion beyond that of the raw operation. (2) Initiating the collective operation, work loop, and then wait for operation completion, with the work loop starting at about 10% of the raw completion time, and incrementing this work loop by 10% up to about 100% of raw completion time. The work loop is created by looping over the "nop" asm instruction. 
Discussion
MPI Barrier() latency as a function of process count is presented in Figure 3 . While results for the barrier algorithm have been presented previously [8] , the current results are significantly improved at the larger process counts. This is because work has been done to reduce the number of queue pairs used in the algorithm, reducing the number of network context resources the HCA needs to manage. This gives the HCA more opportunities to cache such context data, and increase performance. In addition,we use immediate data with RDMA write as a means of signaling data arrival, not actually fetching the receive work entry, as a further optimization. Adding these changes reduces the latency of barrier at 64 process from 54 micro-seconds down to 32. These changes are also used to implement the nonblocking barrier algorithm.
As the results show, over the range of communicator sizes used, the performance of the offloaded barrier is very good compared to the other three approaches used. However, at 64 processes, the CPU based RDMA method out performs the HW-MQ RDMA based approach by polling for completion on in main memory, thus reducing the pressure put on the HCA 7 network context cache. It is expected that as the number of ranks participating in the barrier increases, the CPU based approach will also be affected by the HCA caching effects.
The raw performance of the the nonblocking barrier algorithm is shown in Figure 4 , and is generally similar to that of the blocking barrier. A notable exception is the performance of the CPU progression with RDMA based approach. This is mostly because the later version polls the completion completion queue for receive completion, rather than polling on a location in main memory, as the blocking method does.
Two measures of overlap are used to study the ability to overlap collective communications with computation. Measuring the overlap by alternating between small work quanta and testing for collective completion, provides an opportunity for both offload based collectives as well as CPU based collective operations to maximize the amount of compute work that can be done. Figure 5 presents the overlap capabilities the nonblocking barrier provides during the recursive doubling algorithm, as a function of process count. Both HCA and CPU based algorithms provide some degree of overlap opportunity, ranging from around 30-45% cpu availability at eight processes, and up to around 90% at 64 process count. While we measured up to 20% variability between the different collective algorithms, all show an opportunity for overlap, with the offloaded based methods having a smoother profile as a function of process count. While high process availability is desirable, a combination of high availability and short collective operation duration is more desirable, thus requiring less compute time to hide the barrier latency. Such a measure could be CPU availability divided by collective operation duration. The MQ based algorithms perform better using this measure.
From an application perspective, it is far more desirable to have a single compute time slice per collective operation, rather than many small time slices, making it easier to do useful computation while hiding communication latency. Therefore, we measure how much work can be done after the collective operation is performed and before waiting to complete the operation without impacting its overall completion time. The results are presented in Figure 6. As this figure shows, delegating the progression of the collective operations to the HCA makes the CPU available for relatively long periods of time. The MQ-RDMA base approach provides about 80% CPU availability before starting to impact the nonblocking barrier completion time. This tends to increase as the number of processes involved in the collective increase. On the other hand, the algorithms that rely on the CPU for progression provide at most 30% CPU availability at 64 processes before impacting the collective operation time. This result is not at all surprising, as the CORE-Direct functionality is designed with the goal of providing hardware support for overlapping computation and communications.
Conclusions
In this paper we have evaluated the performance characteristics of the latency sensitive blocking and nonblocking barrier algorithms, when using the CORE-Direct functionality provided by ConnectX-2. We have shown that this functionality provides support for well performing barrier operations, similar to that of highly optimized CPU based barrier operations. It also provides the necessary support for effectively overlapping computation and communications. The latter characteristic is an effective strategy for mitigating application effects of system noise, and, when using nonblocking collective operations, application load imbalance. As such, this capability is a key ingredient to improved scaling for applications using collective operations. Future studies will examine the overlap characteristics of collective operations involving user data, and reduction operations.
