Abstract-Scientists across a wide range of domains increasingly rely on computer simulation for their investigations. Such simulations often spend a majority of their run-times solving large systems of linear equations that require vast amounts of computational power and memory. It is hence critical to design solvers in a highly efficient and scalable manner. Hypre is a high performance, scalable software library that offers several optimized linear solver routines and pre-conditioners. In this paper, we study the characteristics of Hypre's Preconditioned Conjugate Gradient (PCG) solver algorithm. The PCG routine is known to spend a majority of its communication time in the MPI Allreduce operation to compute a global summation during the inner product operation. The MPI Allreduce is a blocking operation, whose latency is often a limiting factor to the overall efficiency of the PCG solver routine, and correspondingly the performance of simulations that rely on this solver. Hence, hiding the latency of the MPI Allreduce operation is critical towards scaling the PCG solver routine and improving the performance of many simulations.
I. INTRODUCTION
The fastest supercomputing systems currently offer sustained peta-flop performance and are allowing scientists to scale their parallel applications to tens of thousands of processors. The Message Passing Interface (MPI) [1] has been a popular programming model for High Performance Computing applications for the last couple of decades. MPI defines a set of collective operations that are used to communicate data among a group of participating processes. Owing to their ease of use and portability, these operations are commonly used across various applications. The current MPI Standard 2.2, defines the collective operations to be blocking, i.e., the application has to wait until the collective call completes. This limits the overall performance and scalability of various scientific parallel applications. In addition, blocking collective operations are also prone to system noise, which directly impacts the performance of parallel applications [2] , [3] . This impact has spurred interest in the design of non-blocking collective communication operations and the upcoming version, MPI-3, defines nonblocking collective communication operations.
InfiniBand is a popular switched interconnect standard being used by almost 41% of the Top500 Supercomputing systems [4] . Since InfiniBand is so widely used, efficient support of non-blocking collectives in MPI implementations on InfiniBand is critical. Mellanox has introduced network offload features in their ConnectX-2 [5] adapter. Using this feature, generic lists of communication tasks can be offloaded to the network interface [6] . Such an interface eliminates the need for the host processor to progress communication and provides a low-level mechanism that can be leveraged to design non-blocking collective communication algorithms. However, in order to leverage the full benefits of this low-level mechanism, MPI libraries must be designed in a highly efficient manner.
II. MOTIVATION
Application scientists are increasingly relying on large scale simulations for their scientific explorations. Such simulations enable the study of scenarios that are infeasible or impractical to study by experiment. Several such applications are known to rely heavily on popular solver routines, such as the Preconditioned Conjugate Gradient (PCG), to solve large systems of sparse linear equations. The efficiency of the solver routine is extremely critical and strongly affects the overall run-times of scientific simulations. In this paper, we use Hypre [7] , a high performance, scalable, open source library that implements several preconditioners and solver algorithms, including PCG. The PCG solver routine, when used with the diagonal scaling preconditioner spends a considerable fraction of its communication time in the MPI Allreduce collective operation while computing the global inner product. Hence, in order to improve the efficiency of the PCG solver, it is critical to design a highly efficient non-blocking variant of the MPI Allreduce collective operation that can be used to hide the communication latency effectively.
A high performance implementation of a non-blocking interface for collective operations would ideally be expected to deliver near-perfect communication/computation overlap. While the benefits of non-blocking collectives are obvious at a high-level, the real benefits offered by intelligent MPI designs are likely to be the key driver for acceptance of this interface by the application community. For example, to ensure high overlap capabilities, we must minimize the role of the host-processors in progressing the collective operations. Simplistic designs of non-blocking collectives requiring progressing the MPI library explicitly by CPU intervention, e.g., calling MPI Test [8] , offsets much of the benefit of non-blocking communication. Similarly, if threads within the library are used for progression, the application performance can be affected by interrupt processing, thread scheduling and other such factors, [9] . It is also critical for a non-blocking collective interface to ensure performance portability, i.e., the benefits of using non-blocking collectives should not be tightly coupled to system architecture and network speeds. In this context, real benefits of non-blocking collectives can only be achieved with corresponding network support [10] , [11] . In order to extract the maximum benefit from non-blocking collectives, application developers may have to re-engineer their codes for communication/computation overlap. Such a co-design between the applications and the MPI libraries can potentially lead to significant improvements in application runtimes.
In this paper, we leverage the CORE-Direct collective offload features that are available in the InfiniBand ConnectX-2 network adapter to design non-blocking algorithms for the MPI Iallreduce operation. We also re-design the PCG solver routine to utilize our proposed MPI Iallreduce designs to hide the latency of global reductions by overlapping it with the compute phases of PCG. Our studies show that we can improve the run-times of PCG by up to 21% when compared to the default PCG implementation in Hypre, about 16% of the overall benefits are due to overlapping the MPI Iallreduce operations. We integrate our designs into the MVAPICH2 [12] software stack, which is a popular MPI implementation for InfiniBand, iWARP and RoCE technologies. MVAPICH2 is currently used by more than 1,840 organizations in 65 countries worldwide. We list the important contributions of this paper below: 1) We propose fully functional designs for the MPI Iallreduce operation, which leverages the collective offload features offered by the ConnectX-2 network interface. 2) We study the various factors that could potentially affect the overlap capabilities of our network offloadbased MPI Iallreduce operation. 3) Linear solvers typically spend a significant amount of their MPI time in global Allreduce operations. In this work, we re-design the PCG solver in Hypre, to leverage our MPI Iallreduce designs. 4) We show that our MPI Iallreduce designs reduce the impact of system noise on the PCG solver.
III. BACKGROUND
In this section we give the necessary background information for our work.
A. InfiniBand and ConnectX-2 Network Interface
InfiniBand QDR network cards and switches can deliver 36 Gbps end-to-end bandwidth and about 1.0 to 1.5 μs latency. Along with all of the standard InfiniBand features, the ConnectX-2 [5] network adapter from Mellanox offers a new network offloading feature called CORE-Direct [13] . Using this feature, arbitrary lists of send, receive and wait operations can be created. These lists can then be posted to a work-request queue to be further processed by the network card. The network adapter independently executes it and eliminates the need for the host processor to progress the communication tasks. Using such task-lists, non-blocking collective operations may be designed by upper-level libraries.
B. Offloading compute operations with ConnectX-2
Unlike collectives such as MPI Bcast and MPI Alltoall, which only transfer data, the MPI Allreduce operation also performs a few basic math operations, such as MPI MAX and MPI SUM. In order to design MPI Iallreduce in a truly non-blocking manner, it is desirable to offload the compute phases of the reduction operation to the network interface. Such a design could lead to higher overlap capabilities when compared to designs that require the host processors to intervene and to perform the math operations. Apart from supporting communication offload, the ConnectX-2 interface also allows MPI libraries to create and to post task-lists comprising of calc-requests. The calc operation needs to be specified as a part of a send work request element. Upon execution, the result of the calc operation for the given set of operands will be sent to the specified destination process. However, if a process that is posting such an operation also requires the result, it is necessary to do a network loop-back operation to retrieve the data from the network interface. The current generation ConnectX-2 interface has the limitation that it can only support binary calc operations of scalar values. Solver routines commonly do global reduction operations on just one double. So, it is possible for MPI libraries to leverage the ConnectX-2 feature for such applications. However, to offload reductions on vector data, we will require more advanced hardware support.
C. Algorithms for MPI Allreduce in MVAPICH2
State-of-the-art open-source MPI implementations, such as MPICH2 [14] , Open-MPI [15] and MVAPICH2 [12] use optimized algorithms to improve the latency of blocking collective operations. MVAPICH2 implements multicore aware, shared-memory based algorithms for blocking collective operations. The processes that are within a compute node are grouped within a "shared memory communicator". One process per node is designated as a leader and participates in a "leader communicator" that contains leaders from all nodes. In the MVAPICH2 implementation of MPI Allreduce, the processes first do a shared-memory based reduction within each compute node to accumulate the data at the leader process. This is followed by an inter-leader reduction operation based on point-to-point MPI operations. MVAPICH2 uses the recursive-doubling algorithm to implement this step. Finally, the leader processes perform a shared-memory broadcast to complete the MPI Allreduce.
D. Impact of System Noise
Several researchers have demonstrated the impact of system noise on the performance of parallel applications [2] , [3] . The impact of noise is higher at larger scales, because the delays tend to get propagated across various tasks in the job. Hoefler et al. quantified the impact of noise on various host-based collective operations [2] and concluded that MPI Allreduce based on the recursive-doubling algorithm is very sensitive to system noise. However, with network based implementations, the network can independently execute the schedules, with little intervention from the host processors. Such designs have the potential to reduce the impact of system noise on the performance of applications.
E. Hypre
Hypre is an open-source, high performance and scalable package of parallel linear solvers and preconditioners. Hypre is designed to leverage the notion of conceptual interfaces, which expose the various solver routines to users in a modular fashion [16] . Such a design significantly eases the coding burden for application developers and may also be used to provide additional application information to the solver routines. The solvers in Hypre are robust, numerically stable and scalable [17] . Its object model is more generic and flexible when compared to many state-of-the-art solver packages [7] and it may also be used as a framework for algorithm development. In this paper, we focus specifically on the PCG solver routine, which uses the diagonal scaling preconditioner.
IV. DESIGNING OFFLOAD-BASED ALGORITHMS FOR MPI ALLREDUCE
We described the communication protocols that we use for small and large messages in MVAPICH2 with the COREDirect interface [10] , [11] . We pre-post buffers to minimize the latency of small messages and we rely on InfiniBand's Receiver-Not-Ready (RNR) feature for large messages. In this section, we discuss our designs for the network-offloadbased algorithms for the non-blocking MPI Iallreduce.
As discussed in Section III-B, with the calc-request, if a process also requires the result of the compute operation performed by the NIC, it is necessary to do a networkloop-back operation. This may affect the latency of the network-offload based MPI Iallreduce operation. Also, since the network interface does not offer hardware-level tagmatching, it is necessary to use distinct InfiniBand Queue Pairs (QP) and a Completion Queue (CQ) for each pair of processes to ensure correct execution. We limit the size of these CQ's to minimize the memory overhead of our designs. We also selectively poll only the specific CQ's, where we are expecting send/recv completions, instead of exhaustively polling on all the available CQ's. Such an approach allows us to control the polling overheads.
We now propose our network-offload based designs for both the recursive-doubling and the tree-based reduce-bcast algorithm and explore design-level choices to minimize the performance impact of the network loop-back operation.
A. Network Offload based Recursive-Doubling Algorithm
We describe the network-offload based recursive-doubling algorithm in Figure 1 . The recursive-doubling algorithm across N processes requires log(N ) iterations to complete. At the start of the algorithm, each process copies its input buffer into an accumulator-buffer. In each iteration, every process computes the rank of the peer process and creates a task-list comprising of: (i) a send-task to send the data in the accumulator-buffer to the peer, (ii) a recv-task to receive the peer's data, (iii) a calc-request to offload the compute operation to be performed on the two operands and (iv) a wait-task to wait for the calc operation to complete. The calc operation executes through the loopback mechanism through which the NIC can directly write the result of the compute operation into the accumulator buffer. The recursive-doubling algorithm requires that every process has the result of the calc operation at the end of each iteration. Hence, a network offload based implementation for the recursive-doubling algorithm will require each process to do a loop-back operation at the end of every iteration of the algorithm. It is not possible to hide this overhead, because the next step of the algorithm cannot be started before the loop-back operation is completed. Since there are log(N) steps in a recursive-doubling algorithm, each process executes log(N) loop-back operations during the MPI Iallreduce operation.
B. Network Offload based Reduce-Bcast Algorithm
The MPI Allreduce operation can also be designed by executing MPI Reduce followed by MPI Bcast. In Figures 2(a), (b) and (c), we describe the steps involved in our network-offload based reduce-bcast algorithm. Suppose we consider an intermediate process P i, of the generic k-nomial reduce tree. This process needs to handle receiving the data from k children processes, creating k − 1 calc operations to itself (through network loop-back) and 1 calc-send operation to its parent. As shown in Figure 2 (a), for a binomial tree, an intermediate process is required to do only one calc-selfsend operation, but the communication tree may be tall and thin. However, as we increase the degree of the tree, the number of calc-send operations to self also increase, while the communication tree becomes shorter and denser. For example, with 16 processes, the reduction operation with degree=4, takes 8 steps to finish, whereas with degree=2 and 3, we only require 7 steps.
Unlike the recursive-doubling algorithm, a tree-based algorithm may also lead to an imbalanced communication pattern. Such an imbalance could help in overlapping the overhead of the loop-back operation at the intermediate processes, which was not possible with the recursive-doubling algorithm. For example, in Figure 2 (a), process P 5, waits for data to arrive from process P 6. On receipt of this message, P 5 executes the calc-send operation to itself. At the same time process P 7 also receives data from its child process and executes a calc-send task to P 5. If processes P 5 and P 7 are sufficiently synchronized, it is possible that by the time the data from P 7 arrives at P 5, P 5 is done with the calc-send to itself, hence potentially hiding the overhead of loop-back operation. Once the reduce phase is done, the root of the reduce-tree broadcasts the final result across all the processes. This phase is also offloaded to the network interface, hence, eliminating the need for any intervention from the host processors. Every process chains the task-lists corresponding to the reduce and the broadcast steps during the MPI Iallreduce operation, which allows us to completely offload the reduce-bcast algorithm.
C. MPI Iallreduce Design Choices
A simple approach to designing a network-offload based MPI Iallreduce operation is to ignore the node-level topology and consider the communicator as a flat structure. However, such a design may suffer from a relatively high communication latency, because it does not consider the multi-core architecture and all the communication steps have to be performed through the network-offload channel. We refer to such a design as the "Flat" scheme.
Another option in designing a network-offload based MPI Iallreduce operation could be to use the sharedmemory channel for the intra-node communication and use the network-offload channel for the inter-node transfers. We refer to such a design as the "Two-Level" scheme. We use the shared-memory-based reduce operation to implement the first intra-node reduction step, and the sharedmemory-based broadcast operation to implement the final intra-node broadcast phase. We may choose to use either the network-offload-based recursive-doubling algorithm or the reduce-bcast algorithm for the inter-leader step of the MPI Iallreduce operation. Since the shared-memory channel offers much better communication latency, such a design allows for lower communication latency for the networkoffload based MPI Iallreduce operation. However, a limitation of such a design is that it requires a degree of synchronization between groups of participating processes. During the MPI Wait step, the non-leaders are required to synchronize with their leader process to get the final result of the MPI Iallreduce. Such a synchronization may not be very appealing, because the goal of a non-blocking collective interface is to hide such synchronization overheads. Also, since the final intra-node step requires copying the data from the leader processes to the rest of the non-leader processes, it may also lead to evicting the application's data from the caches. On the other hand, the "Flat" scheme may alleviate these problems, because it does not require any synchronization during the MPI Wait operation. Additionally, since the NIC can directly write the final result of the compute operation into the user-buffer, this approach also eliminates the need for any additional copy operations and possible cache conflicts. We study the performance characteristics of these approaches in detail in Section VI.
Since each process creates the task-lists during the MPI Iallreduce call, messages may arrive at destination processes before those processes have even started the collective. Hence it is necessary to carefully orchestrate this step so that we honor the data dependency that exists across different iterations of the collective algorithm. We previously proposed a solution to address a similar problem with the network-offload-based MPI Ibcast operation [11] . In this paper, we extend the idea to handle the data dependency correctly for the network-offload-based MPI Iallreduce operation.
V. DESIGNING PCG FOR OVERLAP
In this section, we first describe the basic PCG algorithm which is used in the Hypre software library. We then discuss our implementation of a common variant of the PCG algorithm, within Hypre. Finally, we propose our Overlap-PCG algorithm which uses non-blocking inner product operations to achieve communication/computation overlap.
A. Basic PCGSolve Algorithm
The PCG solver routine is commonly used to solve systems of linear equations of the form Ax = B, if A is symmetric and positive definite. The CG method is often used in combination with a preconditioning step, which generates a different matrix C, which is an approximation of A, so that Cy = Z is easier to solve, when compared to Ax = B. In Figure 3 , we include the pseudo-code for the PCG solver routine in the Hypre library. We observe that we do three calls to the inner product function in each iteration of the PCG Solve routine to update the sdotp, gamma and the i prod variables. The inner product operation relies on the MPI Allreduce operation on the MPI COMM WORLD communicator to calculate the global summation value. In each iteration of the PCG solve routine, we also x = initial guess, p = 0, beta = 0 r = b -Ax Solve C * p = r gamma = inner-prod(r, p) while( not converged ) Matvec (A, p, s) /* s = A*p */ sdotp = inner-prod(s, p) alpha = gamma/sdotp gamma old = gamma x = x + alpha*p /* X Axpy*/ r = r -alpha*s /* X Axpy*/ Solve C * s = r /*DiagScale*/ gamma = inner-prod(r, s) i prod = inner-prod(r, r) if(i prod / bi prod) /* Convergence Test */ if(converged) break beta = gamma/gamma old p = s + beta*p /* P Axpy */ Figure 3 . PCG-Algorithm1: Basic Preconditioned Conjugate Gradient solver Algorithm in Hypre call the Matvec function, which implements the boundary exchange and the local hypre CSRM atrixM atvec operation. The boundary exchange phase uses MPI Isend, MPI Irecv and MPI Waitall operations and is overlapped with the hypre CSRM atrixM atvec function. We also observe that the loop in its current form has a strict data dependency and is not very amenable for overlapping the inner product functions. For example, if we consider the sdotp variable, it gets updated during an inner product operation and its value is used in the very next step. For the rest of this paper, we refer to this version of the PCG, as "PCG-Algorithm1".
B. PCGSolve Algorithm Variant
In Figure 6 , we discuss a variant of the PCG algorithm [18] . This version is inherently very similar to the basic PCG routine described in Figure 3 , and it has the same numerical stability. However, we also observe that this variant offers the flexibility to use the result of the inner product functions at a later point in time. For example, the result of the sdotp inner product is not needed until after the X Axpy routine is done. We refer to this version of the PCG routine as "PCG-Algorithm2". This variant of the PCG also requires a slightly modified version of the preconditioner routine. We describe the differences in the two In Hypre, PCGAlgorithm1 uses the DiagScale preconditioner, described in Figure 4 , which uses indirect addressing to access the elements of the A data array. For PCG-Algorithm2, we create the L and the L −T arrays at the start of the solver routine, and we use these arrays in the DiagInvScale preconditioner, described in Figure 5 . In our case, the matrix L is the squareroot of the diagonal elements of matrix A, and therefore L = L T . The DiagInvScale routine reads data sequentially from the A data array, which could lead to better cache behavior. Additionally, the DiagInvScale preconditioner involves floating-point multiplication operations, whereas the DiagScale routine requires floating-point division, which is computationally more expensive. We also observe that the gamma-inner product step in PCG-Algorithm2 reads data from the same vector t. In PCG-Algorithm1, we use vectors r and s to compute gamma. This may lead to fewer bus transactions and better cache behavior. Due to these factors, we expect PCG-Algorithm2 to perform better than PCGAlgorithm1, even though both of them use blocking global reductions.
C. PCG Algorithm with Overlap
We leverage PCG-Algorithm2 and re-design it to overlap the inner product operation with independent compute tasks, as described in Figure 7 . We design a non-blocking interface for the inner product function, init − innerproduct, which can be used to initiate the inner product operation and return immediately. We can perform some of the other independent compute tasks of the solver routine and wait for the completion of the inner product by using the wait−inner−prod routine. The init-inner product function initiates the non-blocking MPI Iallreduce operation through our network-offload-based MPI Iallreduce operation. The wait−innerproduct function calls the MPI Wait operation to wait on the corresponding non-blocking MPI Iallreduce operation. In our proposed variant, we have overlapped each of the three inner products with either the DiagInvScale routine or the X Axpy operation. For the rest of the paper, we refer to this algorithm as "PCG-Overlap". X = initial guess, P = 0, beta = 0 P prev = 0, w = 0, v = 0, t = 0 r = b -Ax C = L.L T t = L −1 *r /* DiagInvScale */ gamma = inner-prod(t ,t) while( not converged ) w =L −T *t /* DiagInvScale */ p = w + beta*p prev /* P Axpy */ s = A * p /* Matvec */ sdotp = inner-prod(s, p) x = x + alpha*p prev /* X Axpy*/ alpha = gamma/sdotp r = r -alpha*s /* R Axpy*/ i prod = inner-prod(r, r) t = L −1 *r /* DiagInvScale */ gamma old = gamma gamma = inner-prod (t,t) beta = gamma/gamma old if(i prod / bi prod) /* Convergence Test */ if(converged) break; Figure 6 . PCG-Algorithm2: Modified Preconditioned Conjugate Gradient solver Algorithm X = initial guess, P = 0, beta = 0 P prev = 0, w = 0, v = 0, t = 0 r = b -Ax C = L.L t = L −1 *r /* DiagInvScale */ gamma = init-inner-prod(t ,t) while( not converged ) w =L −T *t /* DiagInvScale */ gamma = wait-inner-prod(t, t) /*finish gamma inner product*/ beta = gamma/gamma old p = w + beta*p prev /* P Axpy */ s = A * p /* Matvec */ init-inner-prod(s, p) /*start sdotp inner product */ x = x + alpha*p prev /* X Axpy*/ sdotp = wait-inner-prod(s, p) /*finish sdotp inner product*/ alpha = gamma/sdotp r = r -alpha*s /* R Axpy*/ init-inner-prod(r, r) /*start i prod inner product */ t = L −1 *r /* DiagInvScale */ i prod = wait-inner-prod(r, r) /*finish i prod inner product*/ gamma old = gamma init-inner-prod (t,t) /*start gamma inner product */ if(i prod / bi prod) /* Convergence Test */ if(converged) break; 
VI. EXPERIMENTAL EVALUATION

A. Experimental Setup
Each of our compute nodes have eight Intel Xeon cores running at 2.53 Ghz with 12 MB L3 cache. The cores are organized as two sockets with four cores per socket. Each node also has 12 GB of memory and Gen2 PCI-Express bus. They are equipped with MT26428 QDR ConnectX-2 HCAs with PCI-Ex interfaces. We used a 171-port Mellanox QDR switch, with 11 leafs, each having 16 ports. Each node is connected to the switch using one QDR link. The HCA as well as the switches use the latest firmware. The operating system used is Red Hat Enterprise Linux Server release 5.4 (Tikanga), with the 2.6.18-164.el5 kernel version. OFED version 1.5.1 is used on all machines, and the OpenSM version is 3.3.7.
B. Benchmark Suite
In this paper, we use modified versions of the OSU MicroBenchmarks, which are a part of the MVAPICH2 software package. We measure the average latency of the different implementations of the MPI Allreduce operation across various system sizes. We report communication latency averaged across all the processes, across 1,000 iterations and three different runs. Overlap Benchmark: In this benchmark, we perform floating point matrix-matrix operations by invoking the cblas dgemm function supported by the Intel MKL Library (10.2.1.017), between the MPI Iallreduce and the MPI Wait operations. We measure the overall time required for completion and compute the GFLOPS rating for the given case and compare it against the theoretical peak FLOPS rating for our system. In the first experiment, we fix the message size and vary the matrix size N gradually between the values 10 and 3K and we measure the average throughput. We do a global barrier between two iterations to ensure that all the processes are synchronized at the start of an iteration.
C. Communication Latency
In Figure 8 , we compare the latency of the default host-based, blocking recursive-doubling algorithm in MVA-PICH2, with our proposed network offload based designs. We fix the message size of the global summation operation to be constant at 1 double and we vary the number of processes. We can see that the latency of our proposed networkoffload-based designs are higher than that of the default implementation in MVAPICH2. We attribute this to the expensive loop-back operations that have to be performed with the network-offload designs.
As discussed in Section IV, for the recursive-doubling algorithm, every process needs to do a loop-back operation at the end of each iteration. With the tree-based approaches, this issue could be alleviated to an extent, because the number of loop-back operations per process will depend on the degree of the tree. However, the tree-based schemes also require the offload-based broadcast to complete.
From Figure 8 , we conclude that the latency of the "Flat" scheme is significantly higher than the rest of the designs. As discussed in Section IV-C, this is mainly because the Flat scheme is not multi-core-aware and does not use shared-memory channels for intra-node communication.
Among the "Two-Level" designs, we observe that both the recursive-doubling (Offload-RD) and the tree-based reducebcast (Offload-Red-Bcast) are similar. For the tree-based approach, we also vary the degree of the reduce-tree, denoted by Offload-Red-Bcast-2 (for degree 2), Offload-Red-Bcast-3 (for degree 3) and Offload-Red-Bcast-4 (for degree 4). We observe that the reduce-bcast approach performs better when the degree of the Reduce-tree is 2. As we increase the degree, the intermediate processes will do more loopback operations before they send the data to their parent processes. We also varied the degree of the tree for the offload-bcast algorithm and we observed that the best degree for this step is 4.
In Figures 9, 10 and 11, we further analyze the communication latency of our network offload based designs. We specifically measure the average time required for a process to return from the MPI Iallreduce and the overhead of the MPI Wait operations, without attempting any overlap. We can see that the average time required for the MPI Iallreduce operation to return is lower with the tree-based designs than the recursive-doubling algorithm. This is expected, because the size of the task-list posted by an intermediate process in the tree-based schemes is smaller when compared to the task-list posted by any process in the recursive-doubling algorithm. Also, for this analysis, we consider the tree-based algorithm with degree-2, since it delivers better latency when compared to the algorithm with higher degrees.
D. Overlap/Throughput Analysis
In this section, we use our throughput benchmark to study the impact on the throughput of the DGEMM operation, when it is overlapped with different variants of the networkoffload-based MPI Iallreduce operation. In Figure 12 (a), we compare the measured throughput of the CBLAS-DGEMM operation, when it is overlapped with the MPI Iallreduce based on the Two-level-Recursive-Doubling scheme, the Two-level-Reduce-Bcast algorithm, or the Flat-RecursiveDoubling scheme across 256 processes. We have estimated the peak-throughput of the DGEMM operation on our system and we can observe that all the network-offload-based MPI Iallreduce implementations achieve throughput that is very close to the peak. This implies that our proposed designs offer very good communication/computation overlap.
In Figure 12 (b), we study the communication overhead of the MPI Iallreduce operation, when overlapped with the DGEMM operation, for various DGEMM problem sizes. We can observe that the communication overheads of the twolevel approaches increase as we increase the DGEMM problem size, while that of the Flat-Recursive-Doubling approach remains nearly constant. Also, if we replace the DGEMM operation with a simple compute loop that continuously updates a floating-point variable without significant cache accesses, we observe that the communication overheads of all Allreduce operations remain constant. As discussed in Section IV-C, the increased communication overheads with the DGEMM benchmark could be due to process skew, synchronization inside MPI Wait and cache effects. Interestingly, we observed that the time required for the DGEMM operation to complete varies by as much as 10% across all the processors. With the Two-level schemes, this difference can add to the skew and lead to higher times inside MPI Wait operation. Suppose a node-level leader process gets delayed in the DGEMM operation, then all the non-leader processes in that node have to wait inside the MPI Wait operation to synchronize with the leader, which leads to higher MPI Wait times. However, with the Flat approach, every process does the same amount of work inside the MPI Iallreduce operation. And within the MPI Wait operation, processes only need to poll for network completion events, without having to synchronize with any other process. The Flat approach could potentially achieve better communication/computation overlap, because it does not introduce skew and may require fewer cache accesses. We also expect that the overall benefits of the flat approach could be higher, if the next generation hardware designs can offer lower latency for executing the calc operations.
VII. PCG SOLVE PERFORMANCE A. Potential for Computation/Communication Overlap in PCG Routines
In Table I , we report the average time required to do the different operations for PCG-Algorithm1 and PCGAlgorithm2. For this experiment, we consider 256 processes, and 216,000 unknowns per process (-n 60 60 60). We can see that Matvec function accounts for most of the time. We also observe that the precond and the gamma-inner product steps in PCG-Algorithm2 are much faster, as discussed in Section V. Both the solvers run for 951 iterations. In each iteration of PCG-Algorithm2, we make two calls to the DiagInvScale function. In our proposed PCG-Overlap algorithm (Section V), we overlap both of these calls with the non-blocking inner product functions. From Table I , we expect about 1.8 msec of compute time to overlap between the call to MPI Iallreduce and the corresponding call to MPI Wait. From Section VI-C, we know that the latency of even the most expensive MPI Iallreduce operation is about 250μs, with 256 processes. Hence, there is enough potential for computation/communication overlap with the PCG-Overlap Algorithm.
B. PCG solver Run-Time Comparison
In this section, we study the run-times of the different algorithms for the PCG solver, across different system sizes and varying number of unknowns. PCG-Algorithm1 and PCG-Algorithm2 use the blocking inner product function, which uses the regular MPI Allreduce operation, which uses the recursive-doubling algorithm in MVAPICH2. Since our re-designed PCG algorithm in Figure 7 uses a nonblocking inner product function, we use one of our proposed MPI Iallreduce designs. We use a slightly adapted version of the ij.c driver program, which is available in Hypre, to invoke the PCG solver for the 27-point Laplace problem. The 27-point Laplace problem solves a Laplace-like problem with a 27-point stencil, i.e., each row has an average of 27 nonzeros. We vary the number of processes from 64 through 512 and study the run-times of the different PCG solver algorithms. Similarly, in Figures 13, 14, 15 and 16 , we vary the number of unknowns per process by using the (-n 40 40 40), (-n 60 60 60), (-n 80 80 80) and (-n 10 10 100) run-time options, respectively.
In Figure 13 , we fix the number of unknowns of the Laplace problem by using the -n option as (-n 40 40 40), which leads to 64,000 unknowns per process. With 64 processes, the run-time of PCG-Algorithm1 is about 6.8 seconds, whereas that of the PCG-Algorithm2 is about 6.1 seconds. Our proposed PCG algorithm, Overlap-PCG which uses non-blocking inner products through our network offload based MPI Iallreduce operation has a runtime of about 6 seconds for both the recursive-doubling and the reduce-bcast schemes. However, with the flat scheme, Overlap-PCG(Flat), the run-time is about 5.4 seconds, which is about 11.5% better than the PCG-Algorithm2 algorithm and about 20.5% better than PCG-Algorithm1. With higher number of processes, say 512, Overlap-PCG(Flat)'s run-time is about 11.1 seconds, whereas the default PCG-Algorithm1 requires about 14.2 seconds and modified PCG-Algorithm2 takes about 12.2 seconds.
We observe a similar trend in Figure 14 , with 216,000 unknowns per process (-n 60 60 60). With 64 processes, the Overlap-PCG(Flat) algorithm, which uses the flat MPI Iallreduce scheme has a run-time of about 27.93 seconds, when compared to the PCG-Algorithm2's run-time of 33.3 seconds (an improvement of about 16.1%) and PCGAlgorithm1's run-time of 35.6 seconds (an improvement of about 21.6%). The run-time of the Overlap-PCG(Flat) is also better than that of the Overlap-PCG(Two-Level-RD) and the Overlap-PCG(Two-level-Red-Bcast) by about 5%. With 512 processes, the Overlap-PCG(Flat) scheme delivers an improvement of about 5.7% when compared to PCG-Algorithm2 and about 13.6% when compared to PCGAlgorithm1.
As we further increase the number of unknowns per process to 512,000, in Figure 14 , Overlap-PCG(Flat) does about 7% better than PCG-Algorithm2 and about 21.5% better than PCG-Algorithm1, with 64 processes. With 512 processes, Overlap-PCG(Flat) performs about 4.5% better than PCG-Algorithm2 and 13% better when compared to PCG-Algorithm1.
Based on these experiments, we observe that the benefits of overlapping the inner product appear to be the highest with the Flat MPI Iallreduce design, than the Two-Level-RD or the Two-Level-Red-Bcast schemes. This is consistent with the results we observed in Figure 12(b) . Despite the fact that Flat scheme has very high communication latency, it could potentially deliver better overlap if there is enough compute to overlap. However, we also observe that the best performance with 64 processes and the overall benefits appear to diminish as we increase the number of processes. This could be attributed to the fact that the latency of the Flat scheme continues to increase as we scale up the number of processes. We note that we could achieve higher effeciency with our proposed PCG-Overlap algorithm, if the network hardware provided improved support for offloading compute operations.
C. PCG solver Run-Time Analysis
In this section, we report the analysis of the run-time between application-level compute and MPI communication operations. In Figures 17(a), (b) and (c), we consider the 1164 1164 Figure 17 (a), we analyze the run-times across the different PCG and MPI Allreduce algorithms, as we keep the number of unknowns per process constant at 64,000. The PCG-Algorithm1 and PCG-Algorithm2 versions use blocking MPI Allreduce and we can see that the overhead of the reduction operation is higher in these cases. The reduction overheads are smaller with the Overlap-PCG-RedBcast and the Overlap-PCG-RD cases, implying that we are seeing benefits through our proposed network offload based designs. We also observe that the overhead of the reduction operation is negligible with the Overlap-PCG-Flat case, which seems to indicate that most of the time for reduction is effectively hidden. In Figures 17(b) and (c), we repeat the same study, as we increase the number of unknowns per process. We can see that with the Overlap-PCG-Flat scheme, the overhead to perform the reduction operation continues to remain significantly smaller than the other alternatives.
With better hardware support, we expect that the Overlap-PCG routine achieves better efficiency through completely hiding the latency of the reduction operations.
D. Impact of System Noise of PCG Run-Times
In this section, we study the impact of system noise on the performance of PCG solver routines, by considering PCG-Algorithm2 and the Overlap-PCG algorithm based on the Flat MPI Iallreduce operation. We believe that this is a fair comparison, because our earlier set of experiments indicate that PCG-Algorithm2 performs better than PCG-Algorithm1, even though both of them use blocking MPI Allreduce operations and the Overlap-PCG algorithm achieved better speed-up with the Flat MPI Iallreduce operation. We rely on a simple daemon that performs matrixmatrix multiplication operations that can be used to inject noise with various durations and frequencies. We schedule this daemon on each core on all the nodes. In this experiment, we use 256 processes and we fix the number of unknowns per process for the PCG solver as 216,000. We vary the noise duration from about 50 μs to about 200 μs and the noise frequency between 20Hz and 1KHz. We expect the noise to affect the performance of the compute phases of both solver routines similarly. However, since PCG-Algorithm2 uses blocking host-based MPI Allreduce operations, its communication times could be affected to a greater extent than the Overlap-PCG algorithm, which uses network-offload based MPI Iallreduce operation. This is mainly due to the fact that the host processors are not Figure 18 , we compare the performance degradations of both the PCG algorithms, as we vary the noise duration and frequency. We can observe that the relative performance degradation of the PCG-Algorithm2 is higher, when compared to the Overlap-PCG version, as the noise becomes longer and more frequent. For example, with the extreme case, the performance of PCG-Algorithm2 degrades by as much as 37%, whereas the Overlap-PCG version degrades by about 30%. We expect the effects of noise to be stronger, at larger scales. VIII. RELATED WORK Improving computation and communication overlap in parallel applications has traditionally been a topic of great interest [19] . Sancho et al. [20] studied the benefits of using dedicated processors for progressing the global reduction operation and study the benefits of overlapping the MPI Iallreduce operations in POP, a weather modeling application. Improving the efficiency of the Conjugate Gradient Solvers is a widely studied problem [21] . Hoefler et al. [22] tried to optimize the CG Solver using the CG method by Hestenes and Stiefel [23] . However, the authors noted that they were unable to resolve the data dependency necessary to overlap the global reduction operations. In our work, we leverage the PCG algorithm variants proposed by Demmel et al. [24] and extend their work to achieve communication/computation overlap through non-blocking implementations of the inner product operations, which use our network-offload based MPI Iallreduce operations.
Hoefler et. al. proposed using host based techniques for designing non-blocking collective operations [8] . However, host based techniques offer limited performance portability and may not deliver complete overlap. Hemmert et. al. demonstrate the benefits of using triggered operations and counting events provided by the Portals 4.0 message passing interface [25] . Additionally, Beckman at. al. [26] studied the impact of noise on the performance of collectives by injecting noise. We use a subset of these parameters in our experiments. Graham et. al. reported early experiences with the CORE-Direct software API [13] . Subramoni et. al. proposed communication primitives for blocking collective operations with the CORE-Direct [6] . Previously, we designed a scalable network offload based MPI Ialltoall implementation and demonstrated up to 23% improvement with a parallel 3D FFT library [10] . We have also proposed network-offload based designs for the MPI Ibcast operation and studied the benefits of achieving communication/computation overlap with the HPL benchmark [11] . In this paper, we propose efficient non-blocking designs for the MPI Allreduce operation that scales beyond 512 processes and study the benefits with Preconditioned Conjugate Gradient Solvers in the Hypre software library. We also study the benefits of using network based collectives with system noise. In our prior work, we observed that our network-offload based solutions offer significantly better performance benefits than host-based non-blocking solutions, such as libNBC [8] . Hence, in this work, we focus more on the different design choices for network offload based MPI Iallreduce and understanding their behavior with the PCG solver algorithms.
IX. CONCLUSION
In this paper, we designed fully functional, scalable, nonblocking algorithms for global reductions utilizing networkoffload technology. We showed that we are able to scale our designs to more than 512 processes and we achieve near perfect communication/computation overlap. We also re-designed the PCG solver routine to leverage our proposed MPI Iallreduce operation to hide the latency of the global reduction operations. Our proposed Overlap-PCG algorithm does up to 21% better than the default PCG implementation in Hypre, about 16% of these benefits are derived through hiding the latency of the global reductions. All of our current work was based on the ConnectX-2 InfiniBand network interface from Mellanox. We believe that the benefits of our proposed approaches could be higher with better hardware support for offloading reductions to the network. In the future, we wish to explore the benefits of hiding the latency of the Allreduce operations with other solver routines in Hypre. It could also be interesting to study the benefits of our work with real scientific applications, which use Hypre's solvers.
