In this paper we present an optimized GPU co-design of the Induced Dimension Reduction (IDR) algorithm for solving linear systems. Starting from a baseline implementation based on the generic BLAS routines from the MAGMA software library, we apply optimizations that are based on kernel fusion and kernel overlap. Runtime experiments are used to investigate the benefit of the distinct optimization techniques for different variants of the IDR algorithm. A comparison to the reference implementation reveals that the interplay between them can succeed in cutting the overall runtime by up to about one third.
INTRODUCTION
Krylov subspace iterative methods [20] are among the most popular methods for solving large sparse linear systems of the form Ax = b which arise in many scientific and engineering fields. Against the background of an increasing number of computer systems featuring hardware accelerators like GPUs [11, 2] , significant research focuses on how these methods can be designed to benefit from the computing power of the accelerators. Possible paths range from outsourcing individual computations to the device, to porting the complete algorithm to the accelerator. A straightforward way to use accelerators in Krylov subspace solvers is to offload all matrix and vector computations to the device using library functions. In many cases, this results in significant acceleration of the algorithm [8] . However, even larger improvements are often available when replacing library-based functions with applicationspecific kernels that keep data in local memory as much as possible. This concept of "kernel fusion" in particular pays off as algorithms working on sparse matrices are typically memory-bound, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. and merging multiple linear algebra routines into a single kernel reduces the pressure on the memory bandwidth [6] . Aside from that, the acceleration of the sparse matrix-vector product needed to generate the Krylov subspace [20] , which is often the computationally most expensive part of the algorithms, is often the subject of tuning. Finally, some algorithms allow for overlapping computation and communication by executing multiple kernels concurrently. This, however, not only requires the algorithm to contain computations that are not directly dependent and can be executed in a flexible order, but furthermore will only result in noticeable benefits if not all computations are memory-bound.
Co-HPC2015
Although Krylov subspace solvers are usually memory-bound when applied to sparse problems, they may contain parts that benefit from using concurrent kernel execution. Identifying these parts can be challenging, especially as overlapping kernels can conflict with kernel fusion, e.g., in case two kernels use the same data to update two distinct vectors. In this paper we address the interplay of these two optimization techniques when porting the Induced Dimension Reduction (IDR [22] ) algorithm to NVIDIA GPUs. We structure the rest of the paper as follows. First, in Sect. 2 we provide an overview of related work on strategies for accelerating Krylov subspace methods on GPUs. We then briefly review the IDR algorithm and its variants in Sect. 3. Section 4 and Sect. 5 contain the main contributions of this work:
• We discuss the GPU implementation of IDR and the possible optimization steps such as the fusion of multiple linear algebra operations into a single kernel, and overlapping different computations by running multiple GPU kernels in concurrent fashion.
• We investigate the trade-off between kernel fusion and kernel overlap for a memory-bound matrix addition for different levels of data reuse.
• We apply optimization steps expected to provide benefit to a reference implementation based on BLAS function calls.
• We analyze the performance benefit of the distinct optimization steps for the IDR implementation.
We conclude with a summary of the results and some ideas for future work in Sect. 6.
RELATED WORK
Several open-source software packages provide GPU-support for Krylov subspace solvers [15, 3, 17, 14] . These often provide significantly higher performance than native CPU implementations [8, 12, 13] . For the IDR algorithm, the benefits of GPU-acceleration were investigated in [10] . There is also literature available on the concept of merging multiple linear algebra operations into one single algorithm-specific kernel in order to reduce pressure on the device memory. In [9] the authors have shown that kernel fusion can be realized for BLAS1 and dense BLAS2 operations by using a source-to-source compiler. No automatic fusion of sparse linear algebra operations is addressed, however. In [23] the authors combine CUDA kernels in iterative sparse linear system solvers, but only consider kernels that provide the same functionality and have no dependencies among them. Explicit kernel coding was used in [4] , where the authors have shown how custom-designed kernels improve performance and energy-efficiency of a GPU implementation for the Conjugate Gradient iterative solver. In [6] this idea was transferred to the BiCGSTAB algorithm, and combined with the acceleration of the sparse matrix vector product. Also, a general model estimating the savings was introduced. In [25] the authors take a structured approach to improve performance and energy efficiency by way of kernel fusion. Kernel fusions are categorized into the classes "inner thread," "inner thread block," and "inter thread block," and their effects on performance and energy efficiency are investigated by using two general benchmarks. For sparse linear system solvers, a deeper analysis on the first category can be found in [5] , where a precise characterization of the kernels and the possibility of merging them into one single kernel is presented.
INDUCED DIMENSION REDUCTION
The Induced Dimension Reduction (IDR) algorithm is a robust framework for deriving iterative solvers for large nonsymmetric linear systems of equations. It is based on the Krylov subspace idea, and was first introduced by P. Sonneveld and M. B. Gijzen in [22] . Numerous variants exist to enhance the solver's convergence, stability, or parallelism level [21, 24, 19] . However, they all share the idea of exploiting the following central theorem [22] : THEOREM 1. (IDR) Let A be any matrix in C N×N , let v 0 be any nonzero vector in C N , and let G 0 be the complete Krylov space K N (A, v 0 ). Let P denote any (proper) subspace of C N such that P and G 0 do not share a nontrivial invariant subspace of A, and define the sequence G j , j = 1, 2, ... as
where the ω j are nonzero scalars. Then
One popular variant of Theorem 1 is IDR(s) which can be constructed by considering s independent, standard normally distributed, shadow vectors p 1 , p 2 , ..., p s to solve a smaller system of equations based on the iterative residuals [24] . The smaller system represents a set of polynomials that force the generated residuals to be in subspaces G j , thus enforcing the convergence of x i after, at most N, dimension-reduction steps.
An improved variant of the IDR algorithm is the biortho-variant including smoothing [24] . The approach uses the iteration residuals with the assumption that each residual is the first vector in the next reduced subspace G j+1 . Faster convergence is achieved by taking advantage of the biorthogonality property
where p i are orthonormal shadow vectors. A comprehensive collection of research efforts and a more detailed derivation of the algorithm can be found at [1] .
OPTIMIZING IDR ON GPUS
A GPU implementation for main loop of the IDR(s)-biortho algorithm enhanced with smoothing is provided in pseudocode in Fig. 1 . For convenience, we extracted the smoothing option into Fig. 2 . This implementation uses, exclusively, BLAS1 and BLAS2 functions as part of the MAGMA software library to express the algorithm. This BLAS-based implementation will also serve as a baseline to quantify the architecture-specific tuning in Sect. 5. Our co-design supports heterogeneous execution via a single interface of the MAGMA library which allows mapping the IDR algorithm to both CPU and GPU systems effectively.
The optimization steps we apply to this baseline implementation are the fusion of BLAS functions into algorithm-specific kernels, and the concurrent execution of GPU kernels. In general, these optimization steps are not independent, as overlapped kernels cannot be fused, and fusing two kernels into one removes the possibility of overlapping. Depending on the amount of data reused, the one or other may provide larger benefits. For a memory-bound algorithm, we use a small experiment to analyze this issue. More precisely, we compute two matrix sums, where the second sum partly reuses addends from the first sum, see Figure 3 . A parameter K is used to control how much of the data gets reused: For K = 0 the two sums are identical, for increasing K = 0 the amount of data reused decreases. Using two distinct kernels allows for concurrent execution using different streams. Also, in Figure 3 , both additions are merged into one single kernel, such that data gets reused. Figure 3 reports runtime and achieved memory bandwidth of both options for increasing K. Especially for the smaller problem size, we see some benefit of launching the two kernels in concurrent fashion (left-hand side of Figure 3 ). The fused kernel requires reading from distant memory locations, resulting in a lower memory bandwidth. As long as the addends do not overlap, executing two independent kernels (in concurrent fashion) gives better performance. If data is shared by the two sums, the fused kernel catches up. For 10% overlap (corresponding to reducing the memory transfers by 6%), the runtime of the two independent and overlapped kernels is matched. Completely overlapping the addends reduces the memory transfers by one third, resulting in 33% runtime improvement for this memory-bound operation. We conclude that for memory-bound algorithms, larger performance benefits can be expected from reusing data already read into local memory (kernel fusion) than from overlapping data reload with a previous computation (concurrent execution). Kernels sharing only scalar values, or being completely data-independent, benefit more from overlapping than from fusing, as the overhead of reading a scalar value multiple times is small compared to the benefits of more consecutive memory reads. We apply the two optimization techniques with respect to these premises.
For the kernel fusion, we use the classification proposed in [5] to sort the operations into groups depending on whether the input and output vectors can be mapped to the GPU thread block alignment. We then ignore all kernels that have no data dependency nor share any data beside scalars. Also, we do not consider fusing the sparse matrix vector product generating the Krylov subspace with any other operations. The motivation is that MAGMA supports multiple storage formats for sparse matrices, and we want to keep the flexibility of switching between the respective sparse matrix vector kernels. Although replacing the sequence of axpys to update U in the inner biorthogonalization (line 27 in Fig. 1 ) with a matrix vector product outside this loop does not require the implementation of an algorithm-specific kernel, it may be considered as kernel fusion at it combines a set of BLAS1 functions into one BLAS2 function. i f ( nrmr <= tolb ) r e t u r n ; // check convergence 54 } w h i l e ( numiter < maxiter ); Similarly, the solution approximation update in line 34 ( Fig. 1 ) can be handled by a matrix vector product outside the respective loop if no smoothing is used. Smoothing is essential for a monotone residual decay as can be seen by comparing left-and righthand side in Fig. 5 in the experimental evaluation in Sect. 5. Therefore, we keep smoothing as an option in our algorithm. Customdesigned kernels are required in the smoothing option (Fig. 2) . The computation of t = rs − r (line 2,3) as well as the computation of xs = xs−γ(xs−x) (line 8-10) can be realized with higher efficiency in one kernel, respectively. The baseline implementation also contains, in several places, a set of two consecutive dot products sharing one vector (line 46,47 in Fig. 1 , and line 4,5 in Fig. 2) .
The recently proposed merged dot product [6] allows for fusing these. For this case, the performance benefits not only come from reusing one vector, but also from combining the computationally expensive parallel reduction phases into one. We must emphasize that we do not apply all fusions that are possible, only those that turned out to be beneficial for performance, and do not conflict with kernel overlaps providing larger improvement. Similarly, we do not push concurrency of kernels to the limit. Running concurrent kernels requires switching between the GPU stream and careful synchronization. For runtime improvement, this overhead has to be compensated for by the kernel overlap. Figure 4 shows a dependency graph for one iteration of the IDR(s) algorithm with smoothing enabled. The graphs are colored to represent the different regions of the algorithm (e.g., loops and smoothing), this helps identify which steps are suitable for overlapping. Additional overlapping can be attained if the graph is constructed for s > 1 and multiple iterations. We now outline several key concepts of CUDA-enabled GPUs and the MAGMA library that enable effective 2-way and 3-way concurrency with few explicit synchronization points.
• The default stream is synchronous with respect to CPU and GPU. Any execution in this stream synchronizes all other streams, unless a stream is created with the cudaStreamNonBlocking flag.
• Asynchronous transfers require CPU page-locked memory.
• Concurrency requires operations to be handled by different streams -given that sufficient resources are available.
• Operations get added to streams in issue order; this enforces a synchronization signal between streams but not within them.
• MAGMA reduction operations that return a scalar value (e.g., dot and norm) are synchronous with respect to the CPU.
Based on these concepts, our optimized version makes use of three streams: one using cudaStreamNonBlocking flag, and two nondefault streams. All scalar values are allocated in page-locked memory, and the leading dimension of the matrices are aligned to 32 bytes. Reduction operations are used as implicit synchronization points, kernel-kernel operations are issued in breadth first order, and transfer-kernel-transfer operations are issued in depth first order. Another essential modification to permit maximum concurrency is to rearrange the structure of the loops by unrolling the loop until a dependency that requires a synchronization is hit. Both loop levels in the IDR(s) algorithm were unrolled using this strategy.
In the optimized version, we overlap the following routines and refer to the code shown in Fig. 1 in case smoothing is enabled. The first gemv (lines 4) and general solver (line 10) are taken out of the loop and overlapped with the final smoothing operation. This general solver was transformed into a triangular solver since the matrix M is lower triangular, and its transfer (line 8) is unrolled and overlapped with the first smoothing operation. The solution approximation of the next inner iteration (line 34) is handled concurrently with the biorthogonalization loop, and for cases with s > 1, the update of f as well. A 3-way overlap occurs after the biorthogonalization loop between the gemv (line 30), scalar transfer (line 32), and the update of U (line 27) using a gemv kernel. It is worth mentioning that residual and solution updates in the smoothing operation are overlapped as well as memory transfers of internal work arrays. Finally, the next iteration computations of lines 4,7,10 are executed in concurrent fashion with the current residual updates and convergence check (lines 51,53). After these modifications, the optimized IDR(s) variant requires explicit synchronization calls for the nonblocked stream at the end of both the shadow space and main loops. The remaining streams make use of implicit synchronization points provided by scalar reduction computations.
EXPERIMENTAL RESULTS

Experiment setup
The GPU results for this paper are obtained from a Tesla K40 GPU which belongs to the Kepler line of NVIDIA's hardware accelerators and has a theoretical peak performance of 1,682 GFlop/s (double precision). On the GPU, the 12 GB of main memory, accessed at a theoretical bandwidth of 288 GB/s, is sufficiently large to keep all the matrices and all the vectors needed in the iteration process. The baseline GPU results were obtained from using the iterative solver package of the MAGMA open source software library [14] linked to NVIDIA's cuBLAS [18] and cuSPARSE library [17] Table 1 : Key characteristics of the test matrices ordered according to their dimension.
7.0 [16] . For the experiments, we use a set of test matrices taken from the University of Florida matrix collection (UFMC, [7] ). The IDR algorithm does not require the system matrix to be symmetric. We selected a mix of symmetric and nonsymmetric matrices to cover a broad spectrum with respect to dimension and sparsity (see Table 1 for some key characteristics). The specific sparsity pattern and symmetry characteristics are not relevant for this performance evaluation as the implementation of the sparse matrix vector product remains outside the scope of the optimization process.
Convergence and Performance
To ensure the algorithm's correctness, in Fig. 5 we compare the convergence of the basic GPU implementation with the convergence of a MATLAB reference implementation taken from [1] . We analyze shadow space dimensions 1, 2, 4, and 8. The left-hand side of Fig. 5 shows the jagged convergence pattern for both implementations, which is characteristic for the basic IDR(s) algorithm. While already small rounding differences in the computation of dot products result in significant differences concerning the residual for a specific iteration count, the implementations have very similar average convergence rates. This becomes more obvious when enabling the smoothing option, see the right-hand side of Fig. 5 .
To quantify the overhead of using larger shadow space dimensions, in Table 2 we list the actual runtimes (and normalized to the shadow space dimension 1) needed by the baseline GPU implementation to execute 100 iterations. The solution of the small dense systems required for the minimization process results in a superlinear runtime increase for larger shadow space dimensions. We notice that the relative runtime increase is smaller when smoothing is enabled.
Next, we apply the optimization steps outlined in Sect. 4. Fig. 6 reports the runtime improvement obtained from of the distinct optimization levels of the GPU implementation over the CPU code. Again, we address both, the basic IDR(s) (left-hand side plots) and the IDR(s) using smoothing (right-hand side plots). The graphs in the first line illustrate the runtime improvement obtained from kernel fusion: we replaced the generic BLAS functions with the algorithm-specific kernels introduced in Sect. 4. Kernel fusion provides larger improvements for the smoothed IDR(s), as algorithmspecific kernels significantly reduce the memory transfers in the smoothing operation. Furthermore, we observe that the algorithmspecific kernels are in particular beneficial when working with small shadow space dimensions s. The bottom of Fig. 6 combines the kernel fusion with kernel overlap. As expected, concurrent kernel execution is in particular attractive when overlapping the loops for the distinct shadow spaces.
Furthermore, concurrent kernel execution provides larger benefits to the unsmoothed IDR(s). This is expected as combining the Table 2 : Absolute runtimes in seconds (and normalized to the runtime for shadow space dimension 1) for 100 iterations of the basic GPU implementation of IDR(s).
x-updates (line 34 in Figure 1 ) into a matrix vector product outside the shadow space loop allows to completely overlap this operation.
Looking at the interplay of the optimization steps, larger runtime reduction (up to 33%) can be achieved for smoothing enabled. As expected, the benefits are larger when targeting smaller test matrices, as for those the cost of the sparse matrix vector product is not necessarily dominating the algorithm.
SUMMARY
In this paper, we have shown how a GPU implementation of the IDR(s) algorithm and an enhanced variant featuring a smoothing step for better convergence properties can be accelerated by applying kernel fusion and the concept of overlapping kernels. A runtime analysis revealed that custom-designed kernels are particularly attractive for small shadow space dimensions, while kernel overlap provides significant benefits when running the smoothed variant in combination with large shadow space dimensions. Combining the optimization steps we succeed in cutting the overall runtime by up to about one third.
In future, we will look into operations that are not memory bound for a more comprehensive study on whether kernel fusion or kernel overlap should be preferred. The overall goal is a theoretical model that allows to identify the optimal choice depending on the algorithm and hardware characteristics.
