Abstract. We discuss an approach for solving sparse or dense banded linear systems Ax = b on a Graphics Processing Unit (GPU) card. The matrix A ∈ R N ×N is possibly nonsymmetric and moderately large; i.e., 10 000 ≤ N ≤ 500 000. The split and parallelize (SaP) approach seeks to partition the matrix A into diagonal sub-blocks A i , i = 1, . . . , P , which are independently factored in parallel. The solution may choose to consider or to ignore the matrices that couple the diagonal sub-blocks A i . This approach, along with the Krylov subspace-based iterative method that it preconditions, are implemented in a solver called SaP::GPU, which is compared in terms of efficiency with three commonly used sparse direct solvers: PARDISO, SuperLU, and MUMPS. SaP::GPU, which runs entirely on the GPU except several stages involved in preliminary row-column permutations, is robust and compares well in terms of efficiency with the aforementioned direct solvers. In a comparison against Intel's MKL, SaP::GPU also fares well when used to solve dense banded systems that are close to being diagonally dominant. SaP::GPU is publicly available and distributed as open source under a permissive BSD3 license.
1. Introduction. Previously used in niche applications and by a small group of enthusiasts, general purpose computing on graphics processing unit (GPU) cards has gained widespread popularity after the release in 2007 of the CUDA programming environment [35] . Owing also to the release of the OpenCL specification [40] in 2008, GPU computing has been rapidly adopted by numerous groups with computing needs originating in a broad spectrum of application areas. In several of these areas though, when compared to the library ecosystem enabling sequential and/or parallel computing on x86 chips, GPU computing library support continues to be spotty. This observation motivated an effort whose outcomes are reported in this paper, which is concerned with solving sparse linear systems of equations on the GPU.
Developing an approach and implementing parallel code for solving sparse linear systems is not trivial. This, and the relative novelty of GPU computing explain the scarcity of solutions for solving Ax = b on the GPU, when A ∈ R N ×N is possibly nonsymmetric, sparse, and moderately large; i.e., 10 000 ≤ N ≤ 500 000. An inventory of software solutions as of 2015 produced a short list of codes that solved Ax = b on the GPU: cuSOLVER [7] , Paralution [1], and SuperLU [16] , the latter focused on distributed memory architectures and leveraging GPU computing at the node level only. Several CPU multi-core approaches exist and are well established, see for instance [4, 43, 8, 16] . For a domain-specific application implemented on the GPU that calls for solving Ax = b, one alternative would be to fall back on one of these CPU-based solutions. This strategy usually impacts the overall performance of the algorithm due to the back-and-forth data movement across the PCI host-device interconnect, which in practice supports bandwidths of the order of 10 GB/s. Herein, the focus is not on this strategy. Instead, we are interested in carrying out the LU factorization on the GPU when the possibly nonsymmetric matrix A is sparse or dense banded with narrow bandwidth.
There are pros and cons to having a linear solver on the GPU. On the upside, since a parallel implementation of a LU factorization is memory bound, particularly for sparse systems, the GPU is attractive owing to its high bandwidths and relatively low latencies. At main-memory bandwidths of roughly 300 GB/s, the GPU is four to five times faster than a modern multicore CPU. On the downside, the irregular memory access patterns associated with sparse matrix factorization ablate this GPU-over-CPU advantage, which is further eroded by the intense logic and integer arithmetic requirements associated with existing algorithms. The approach discussed herein alleviates these two pitfalls by embracing a splitting strategy described for CPU-centric multicore and/or multi-node computing in [38] . Two successive rowcolumn permutations attempt to increase the diagonal dominance of the matrix and reduce its bandwidth, respectively. Ideally, the reordered matrix would be (i) diagonal dominant, and (ii) dense banded. If (i) is accomplished, no LU factorization row/column pivoting is necessary, thus avoiding tasks at which the GPU does not shine: logic and arithmetic operations. Additionally, if (ii) holds, coalesced memory access patterns associated with dense matrix operations can capitalize on the GPU's high bandwidth.
The overall solution strategy adopted herein solves Ax = b using a Krylovsubspace method and employs LU preconditioning with work-splitting and drop-off. Specifically, each outer Krylov-subspace iteration takes at least one preconditioner solve step that involves solvingÂy =b on the GPU, whereÂ ∈ R N ×N is a dense banded matrix obtained from A after a sequence of possibly two reordering stages that can include element drop-off. Regardless of whether A is sparse or not, the salient attribute of the approach is the casting of the preconditioning step as a dense linear algebra problem. Thus, a reordering process is employed to obtain a narrowband, denseÂ, which is subsequently LU-factored. For the reordering, a strategy that combines two stages, namely diagonal dominance boosting and bandwidth reduction, has yielded well balanced coefficient matrices that can be factored fast on the GPU leveraging a single instruction multiple data (SIMD)-friendly underlying data structure. The LU factorization relies on a splitting of the matrixÂ in several diagonal blocks that are factored independently and a correction process to account for the inter-diagonal block coupling. The implementation takes advantage of the GPU's deep memory hierarchy, its multi-SM layout, and its predilection for SIMD computation.
This paper is organized as follows. Section 2 summarizes the solution algorithm. The discussion covers first the work-splitting-based LU factorization of dense banded matrices. Subsequently, the Ax = b sparse case brings into focus strategies for matrix reordering. Section 3 summarizes aspects related to the GPU implementation of the solution approaches proposed. Results of a series of numerical experiments for both dense banded and sparse linear systems are reported in Section 4. Since reordering strategies play a pivotal role in the sparse linear system solution, we present benchmarking results in which we compared the reordering strategies adopted herein to established solutions/implementations. The paper concludes with a series of final remarks and a summary of lessons learned and directions of future work.
Description of the methodology.
2.1. The dense banded linear system case. Assume that the banded dense matrix A ∈ R N ×N has half-bandwidth K N . Following an approach discussed in [42, 38, 39] , we partition the banded matrix A into a block tridiagonal form with P diagonal blocks A i ∈ R Ni×Ni , where P i N i = N . For each partition i, let B i , i = 1, . . . , P − 1 and C i , i = 2, . . . , P be the super-and sub-diagonal coupling blocks, respectively -see Figure 2 .1. Each coupling block has dimension K × K for banded matrices with half-bandwidth K = max i,j,aij =0 |i − j|.
As illustrated in Fig. 2 .1, the banded matrix A is expressed as the product of a block diagonal matrix D and a so-called spike matrix S [42] . The latter is made up of identity diagonal blocks of dimension N i , and off-diagonal spike blocks, each having K columns. Specifically, Solving the linear system Ax = b is thus reduced to solving
Since D is block-diagonal, solving for the modified right-hand side g from (2.3) is trivially parallelizable, as the work is split across P processes, each charted to solve A i g i = b i , i = 1, . . . , P . Note that the same decoupling is manifest in Eq. (2.2) , and the work is spread over P processes.
The remaining question is how to solve quickly the linear system in (2.4) . This problem can be reduced to one of smaller size,Ŝx =ĝ. To that end, the spikes V i and W i , as well as the modified right-hand side g i and the unknown vectors x i in (2.4) are partitioned into their top K rows, the middle N i − 2K rows, and the bottom K rows:
A block-tridiagonal reduced system is obtained by excluding the middle partitions of the spike matrices as:
. .
where the linear system above, denotedŜx =ĝ, is of dimension 2K(P − 1) N ,
I M , i = 1, . . . , P − 1 (2.7b)
, i = 1, . . . , P − 1 .
Two strategies are proposed in [38] to solve (2.6): (i ) an exact reduction; and, (ii ) an approximate reduction, which sets N i ≡ 0 and M i ≡ 0 and results in a block diagonal matrixŜ. The solution approach adopted herein is based on (ii) and therefore each sub-system R ixi =ĝ i is solved independently using the following steps:
Note that a tilde was used to differentiate between the actual and approximate values x (t) i andx (b) i obtained upon dropping the N i and M i terms. An approximation of the solution of the original problem is finally obtained by solving independently and in parallel P systems using the available LU factorizations of the A i matrices:
Computational savings can be made by noting that if an LU factorization of the diagonal blocks A i is available, the bottom block of the right spike; i.e. V
(b)
i , can be obtained from (2.2a) using only the bottom K × K blocks of L and U. However, obtaining the top block of the left spike requires calculating the entire spike W i . An effective alternative is to perform an additional UL factorization of A i , in which case W (t) i can be obtained using only the top K × K blocks of the new U and L.
Next, note that the decision to set N i ≡ 0 and M i ≡ 0 relegates the resulting algorithm to preconditioner status. Embracing this path is justified by the following observation that although the dimension of the reduced linear system in (2.6) is smaller that that of the original problem, its half-bandwidth is at least three times larger. The memory footprint of exactly solving (2.6) is large, thus limiting the size of problems that can be tackled on the GPU. Specifically, at each recursive step, additional memory that is required to store the new reduced matrix cannot be deallocated until the global solution is fully recovered.
Finally, it becomes apparent that the quality of the preconditioner is correlated to neglecting the N i and M i terms. For the sake of this discussion, assume that the matrix A is diagonally dominant with a degree of diagonal dominance d ≥ 1; i.e., (2.11) |a
When d > 1, the elements of the left spikes W i decay in magnitude from top to bottom, while those of the right spikes V i decay from bottom to top [33] . This decay, which is more pronounced the larger the degree of diagonal dominance of A, justifies the approximation N i ≡ 0 and M i ≡ 0. However, note that having A be diagonal dominant, although desirable, it is not a prerequisite as demonstrated by numerical experiments reported herein. Truncating when d < 1 will lead to a preconditioner of lesser quality.
2.1.1. Nomenclature, solution strategies. Targeted for execution on the GPU, the methodology outlined above becomes the foundation of a parallel implementation called herein "split and parallelize" (SaP). The matrix A is split into block diagonal matrices A i , which are processed in parallel. The code implementing this strategy is called SaP::GPU. Several flavors of SaP::GPU can be envisioned. At one end of the spectrum, one solution path would implement the exact reduction, a strategy that is not considered herein. At the other end of the spectrum, SaP::GPU solves the block-diagonal linear system in 2.3 and for preconditioning purposes uses the approximation x ≈ g. In what follows, this will be called the decoupled approach, SaP::GPU-D. The middle ground is the approximate reduction, which sets N i ≡ 0 and M i ≡ 0. This will be called the coupled approach, SaP::GPU-C, owing to the coupling that occurs through the truncated spikes; i.e., V
Neither the coupled nor the decoupled paths qualify as direct solvers and SaP::GPU employs an outer Krylov subspace scheme to solve Ax = b. The solver uses BiCGStab( ) [46] and left-preconditioning, unless the matrix A is symmetric and positive definite, in which case the outer loop implements a conjugate gradient method [41] . SaP::GPU is open source and available at [2, 3] .
2.2. The sparse linear system case. The discussion focuses next on solving A s x = b, where A s ∈ R N ×N is assumed to be a sparse matrix. The salient attribute of the solution strategy is its fallback on the dense banded approach described in §2.1. Specifically, an aggressive row and column permutation process is employed to transform A s into a matrix A that has a large d and small K. Although the reordered matrix will remain sparse within the band, it will be regarded to be dense banded and LU-and/or UL-factored accordingly. For matrices A s that are either nonsymmetric or have low d, a first set of row permutations is applied as QA s x = Qb, to either maximize the number of nonzeros on the diagonal (maximum traversal search) [19] , or maximize the product of the absolute values of the diagonal entries [20, 21] . Both reordering algorithms are implemented using a depth first search with a look-ahead technique similar to the one in the Harwell Software Library (HSL) [4] .
While the purpose of the first reordering QA s is to render the permuted matrix diagonally "heavy", a second reordering seeks to reduce K by using the traditional Cuthill-McKee CM algorithm [14] . Since the diagonal entries should not be relocated, the second permutation is applied to the symmetric matrix QA s + A T s Q T . Following these two reorderings, the resulting matrix A is split to obtain A 1 through A P . A third CM reordering is then applied to each A i for further reduction of bandwidth. While straightforward to implement in SaP::GPU-D, this third stage reordering in SaP::GPU-C mandates computation of the entire spikes, an operation that can significantly increase the memory footprint and flop count of the numerical solution. Note that third stage reordering in SaP::GPU-C renders the UL factorization superfluous since computing only the top of a spike is insufficient.
If A i is diagonally dominant, the LU and/or UL factorization can be safely carried out without pivoting [24] . Adopting the strategy used in PARDISO [44] , we always perform factorizations of the diagonal blocks A i without pivoting but with pivot boosting. Specifically, if a pivot becomes smaller than a threshold value, it is boosted to a small, user controlled value . This yields a factorization of a slightly perturbed diagonal block, L i U i = A i + δA i , where δA i = O(u A ) and u is the unit roundoff [32] .
2.2.1. Brief comments on the reordering algorithms. SaP::GPU employs two reordering strategies, namely Diagonal Boosting (DB) and Cuthill-McKee (CM), possibly multiple times, to reduce K and increase the degree of diagonal dominance. DB is applied first at the matrix A s level, followed by CM applied at matrix level, and possibly followed by a set of P third-stage CM reorderings applied at the sub-matrix A i level. Diagonal Boosting. The DB algorithm seeks to improve diagonal dominance in A s and draws on a minimum bipartite perfect matching [12, 28, 11, 13, 17, 26] . There are several variants of the algorithm aimed at different outcomes, e.g., maximizing the absolute value of bottleneck, the sum, the product or other metrics that factor in the diagonal entries. As a proxy for diagonal dominance, SaP::GPU maximizes the absolute value of the product of all diagonal entries.
The algorithm that seeks to leverage GPU computing is as follows. Given a matrix {a ij } n×n , find a permutation σ that maximizes n i=1 |a iσi |. Denoting a i = max j |a ij | and noting that a i is an invariant of σ, then we are to minimize
(log a i − log |a iσi |) .
The reordering problem is reduced to minimum bipartite perfect matching in the following way: given a bipartite graph G C = (V R , V C , E), we define the weight c ij of the edge between nodes i ∈ V R and j ∈ V C as (2.12)
If we are able to find a minimum bipartite perfect matching σ such that c iσi is minimized, according to the process of reduction above, then
Bandwidth reduction. Whether QA s is sparse or not, there are P − 1 pairs of always dense spikes, each of dimension N i × K. They need to be stored unless one employs an LU and UL factorization of A i to retain only the appropriate bottom and top components. Large K values pose memory challenges; i.e., storing and data movement, that limit the size of the problems that can be tackled. Moreover, the spikes need to be computed by solving multiple right-hand side linear systems with A i coefficient matrices. There are 2K such systems for each of the P −1 pairs of spikes. Evidently, a low K is highly desirable. However, finding the lowest half-bandwidth K by symmetrically reordering a sparse matrix is NP-hard. The CM reordering provides simple and oftentimes effective heuristics to tackle this problem. Moreover, as the CM reordering yields symmetric permutations, it will not displace the "heavy" diagonal terms obtained during the DB step. However, to obtain a symmetric permutation, one has to start with a symmetric matrix. To this end, unless A is already symmetric and does not call for a DB step (which is the case, for instance, when A is symmetric positive definite), the matrix passed over for CM reordering is (A + A T )/2. Given a symmetric n × n matrix with m non-zero entries CM works on its adjacency matrix. CM first picks a random node and adds the node to the work list. Then the algorithm repeats sorting all its neighboring nodes with non-descending vertex degree and adding them until all vertices have been added and removed once from the work list. In other words, CM is essentially a BFS where neighboring vertices are visited in order from lowest to highest vertex degree. Third-stage reordering. The DB-CM reordering sequence yields diagonally-heavy matrices of smaller bandwidth. The band itself however can be very sparse. The purpose of the third-stage CM reordering is to further reduce the bandwidth within each A i and reduce the sparsity within the band. Consider, for instance, the matrix ANCF88950 that comes from structural dynamics [45] . It has 513 900 nonzeros, N = 88 950, and an average of 5.78 non-zero elements per row. After DB-CM reordering with no drop-off, the resulting banded matrix has a half-bandwidth K = 205. The band itself is very sparse with a fill-in of only 0.7% within the band. In its default solution, SaP::GPU constructs a block banded matrix where each diagonal block A i , obtained after the initial DB-CM reorderings, is allowed to have a different bandwidth. This is achieved using another CM pass, independently and in parallel for each A i . Applying this strategy to ANCF88950, using P = 16 partitions, the half bandwidth is reduced for all partitions to values no higher than K = 141, while the fill-in within the band becomes approximately 3%.
Note that this third-stage reordering does nothing to reduce the column-width of the spikes. However, it helps in two respects: a smaller memory footprint for the LU/UL factors, and less factorization effort. These are important side effects, since the LU/UL GPU factorization is currently done in-core considering A i to be dense within the band.
Brief implementation details.
3.1. Dense banded matrix factorization details. This subsection provides implementation details regarding how the P partitions A i are determined, how the banded matrix A is stored, and how the LU/UL steps are implemented on the GPU.
Number of partitions and partition size. The selection of P must strike a balance between two conflicting requirements. On the one hand, having a large P is attractive given that the LU/UL factorization of A i for i = 1, . . . , P can be done independently and simultaneously. On the other hand, this negatively impacts the quality of the resulting preconditioner, due to the approximations in evaluating the spikes corresponding to the coupling of the diagonal blocks A i and A i+1 . Since this adversely impacts the quality of the resulting preconditioner, a high P could lead to poor preconditioning and an increase in the number of iterations to convergence. In the current implementation, no attempt is made to automate this selection and some experimentation is required.
Given a P value, the size of the diagonal blocks A i is selected to achieve load balancing. The first P r partitions are of size N/P + 1, while the remaining are of size N/P , where N = P N/P + P r .
Matrix storage. For general dense banded matrices A i , we adopt a "tall and thin" storage in column-major order. All diagonal elements are stored in the K-th column. The rest of the elements are correspondingly distributed columnwise. This strategy, shown below for a matrix with N = 8 and K = 2, groups the operands of the LU/UL factorizations and allows coalesced memory accesses that can fully leverage the GPU's bandwidth. 
LU/UL factorizations. The solution strategy pursued calls for an LU and an optional UL factorization of each dense banded diagonal block A i . The implementation requires a certain level of synchronization since for each A i , the factorization, forward elimination, and backward substitution phases each consist of N i − 1 dependent steps that need to be choreographed. One aggravating factor is the GPU lack of native, low overhead, support for synchronization between threads running in different blocks.
The established GPU strategy for inter-block synchronization is "exit and launch a new kernel". This guarantees synchronization at the GPU-grid level at the cost of non-negligible overhead. In a trade-off between minimizing the overhead of kernel launches and maximizing the occupancy of the GPU, we established two execution paths: one for K < 64, the second one for larger bandwidths. As a side note, the threshold value of 64 was selected through numerical experimentation over a variety of problems and is controlled by the number of threads that can be organized in a block in CUDA [35] . For K < 64, the code was designed to reduce the kernel launch count. Instead of having N i −1 kernel launches, each completing a step of the factorization of A i = L i U i by updating entries in a (K + 1) × (K + 1) window of elements, a single kernel is launched to factor A i . It uses min(K 2 , 1024) threads per block and relies on lowoverhead stream-multiprocessor synchronization support within the block, without any need for global synchronization. In a so-called window-sliding method, at each step of the factorization; i.e., during the process of computing column entries in L and row entries of U, each thread updates a fixed number of A i entries. On current GPU hardware, this fixed number is between 1 and 4. Once all threads in the block complete their work, they are synchronized and the (K + 1) × (K + 1) window slides down by one row and to the right by one column. The value 4 is explained as follows. Assume that K = 63. Then, the sliding window has size 64 × 64. Since the twodimensional GPU thread block size is 1024 = 32 × 32, each thread will handle four entries of the window of focus.
For K ≥ 64, SaP uses multiple blocks of threads to update L and U entries. On the upside, there are more threads working on the window of focus. On the downside, there is overhead associated with leaving and reentering the kernel, a process that has the side effect of flushing the shared memory and registers. The window is larger than K × K, and it slides at a stride of eight; i.e., moves down by eight rows and to the right by eight columns upon exiting and reentering the LU factorization kernel. Use of registers and shared memory. If the user decides to employ a third-stage reordering, the coupling sub-blocks B i and C i are used to compute the entire spikes in a scheme that renders a UL factorization superfluous. Then, B i and C i are each first partitioned into sub-blocks of dimension L × K where L is at most 20. Each forward/backward sweep to get the spikes is unrolled, and in each iteration of the new loop, one entire sub-block, rather than a vector of length K, is calculated. To this end, the corresponding elements in the matrix A i are pre-fetched into shared memory and the entries of the sub-block are preloaded into registers. This strategy, in which all operations to calculate the spikes draw on registers and shared memory, leads to 50% to 70% improvement in performance when compared with an alternative that calculates the spike elements in a loop without leveraging the low latency/high bandwidth of the GPU register file and shared memory. Mixed Precision Strategy. The solution uses a mixed-precision implementation by falling back on single precision for the preconditioner and switching to double precision arithmetic in the outer BiCGStab(2) calculations. A battery of tests indicate that this strategy results in a 50% average reduction in time to solution when compared with an approach where all calculations are performed in double precision.
3.2. DB reordering implementation details. SaP::GPU organizes the DB algorithm into four stages, DB-S1 through DB-S4. Due to differences in the nature and degree of parallelism of these stages, DB implements a hybrid strategy; namely, it relies on GPU computing for DB-S1 and DB-S4 and on CPU computing for DB-S2 and DB-S3. A thorough discussion of the implementation is provided in [31] . Therein, a solution that kept the entire DB implementation on the GPU was discussed and deemed decisively slower than the hybrid strategy adopted here. DB-S1: form bipartite graph. This stage assembles a matrix that mirrors the structure of the original sparse matrix. The sparsity pattern of the input matrix is maintained and the values of its nonzero entries are modified according to Eq. (2.12). The stage is highly parallel and involves: (1) calculating for each row of the original matrix the max absolute value, and (2) updating each value to form the weighted bipartite graph.
DB-S2: find initial partial match. This stage is not mandatory but the availability of an initial partial match as a starting point for the next stage was found to considerably reduce the running time for the overall algorithm [31] . Like in [12] , after setting u i = min j c ij and v j = min i (c ij − u i ), we try to match as many pairs of nodes as possible. The matched nodes (i, j) should satisfy u i + v j = c ij . This yields augmenting paths of length one. This stage, which was implemented to execute in parallel, was compute intensive as it had to resolve scenarios where multiple column nodes would match the same row node. A CPU parallel implementation was found to be more suitable owing to intense integer arithmetic and control flow overhead.
DB-S3: find perfect match. Finding matches in a bipartite graph G C is equivalent to finding the shortest paths in an associated reduced graph. Omitting some of the details, the shortest path problem is tackled using Dijkstra's algorithm [18] , which is applied to all nodes i that are unmatched in the initial partial match obtained in DB-S2. This ensures that all row nodes, and therefore all column nodes, are eventually matched. The theoretical complexity of this stage is O(n · (m + n) · log n), where n and m are the dimension and number of nonzeros in the input matrix, respectively. However, thanks to the preprocessing DB-S2, actual run times for finding a perfect match are acceptable in all situations and this stage is the DB bottleneck only for about half of the matrices tested [31] . DB-S4: extract permutation and scaling factors. The matrix permutation can be obtained directly from the resulting perfect match: if the row node i was matched to the column node j then rows (or columns) i and j must be permuted. Optionally, scaling factors can be calculated and applied to rows and columns in order to bring the matrix to a so-called I-matrix form; i.e., a matrix with 1 or −1 on the diagonal and off-diagonal elements of absolute value less than 1, see [36] . This stage is highly parallelizable and amenable to GPU computing.
3.3.
CM reordering implementation details. The unordered CM algorithm, which draws on an approach described in [27] , is separated into three stages, CM-S1 through CM-S3. A high quality reordering calls for several BFS iterations, which are called herein "CM iterations". Just like the DB implementation, the CM solution (i) is hybrid -the overall algorithm leverages both CPU and GPU computing; and, (ii) it uses CPU-GPU unified memory, a recent CUDA feature [34] , to provide for a simple and transparent memory management process. The latter feature allows the CUDA runtime to transparently manage the CPU-GPU data migration as the computation switches back and forth between the CPU and GPU. Since no explicit, programmer initiated, data transfer is required, the code is cleaner and more concise. CM-S1: pre-processing. The first stage is implemented on the GPU to accomplish two objectives. First, it produces the data structure that is worked upon. As the input matrix A is not guaranteed to be symmetric, the sparse matrix structure for (A + A T )/2 is produced in anticipation of the subsequent two stages of the algorithm. Second, in order to avoid repetitively sorting the neighbors of a given node, the nodes with the same row indices are pre-sorted by ascending vertex degree of column index.
CM-S2: perform standard BFS. After experimenting with the implementation, the strategy adopted started from several nodes and in parallel performed what would be a traditional CM-S2 & CM-S3 combo. The alternative of considering one node only, namely the node with the smallest vertex degree, yields a second level BFS tree with fewer nodes. Eventually, the resulting BFS tree will likely be "tall and thin". Starting from several nodes and completing the reordering process for each of them increases the likelihood of avoiding a "bad" initial node. In practical terms, owing to the use of parallel computing, this strategy yields smaller bandwidths at a modest increase in computational overhead.
For each starting node, a standard BFS pass yields the levels of all nodes in the BFS tree. Since the order of nodes at the same level is not critical in this stage, parallel computing can help by concurrently visiting the neighbors of all nodes at the previous level. We use an outer loop to iterate over the levels, and in each iteration, depending on the number of nodes n p added in the previous iteration, we decide whether this iteration is executed on the GPU or CPU. The heuristics used are as follows: a kernel handles the iteration on the GPU only if n p ≥ 10. There are two notable implementation details. First, the CM iterations are executed sequentially. After each iteration, we select the node at the previous level with the lowest vertex degree which has not yet been selected yet. If no such nodes exist; i.e., all nodes at the last level have been selected as starting nodes in previous iterations, a random node which has not been considered is selected. Second, the CM iterations terminate either when the height of the BFS tree does not increase, or when the maximum number of nodes over all levels does not decrease compared with the candidate optimal found so far. This strategy is proposed in [37] with the caveat that we only consider the leaf with the minimum degree. From practical experience, these heuristics lead to an algorithm that for most matrices terminates within three CM iterations.
CM-S3: reorder nodes. The previous stage determines the level of each node. Roughly speaking, nodes are ordered in ascending order, from level 0 up to the maximum level m l and memory space can be pre-allocated for nodes at each level. Parallel computing is leveraged by observing that the order of nodes at level l depends only on the order of nodes at level l − 1. To that end, a pair of read/write pointers is set for each level, and except for level 0, the read/write pointers of each level will point to the starting position of the level's pre-allocated space. We say a thread "works on" level l if it reads nodes at level l and writes their neighbors that are at level l +1. Thus the execution thread working on level l will read and modify the read pointer of level l and the write pointer of level l + 1, and it will only read the write pointer of level l. Once the thread finishes reading all nodes at level l, it moves on to another level; otherwise it repeats checking whether or not the thread working on level l − 1 has written nodes which it has not processed by checking if the read pointer at level l lags the write pointer at level l. If yes, the thread working on level l processes these nodes, i.e., writes their neighbors with level l + 1, and goes back to checking again whether it has finished processing or not; otherwise, it spins and waits for the thread working on the previous level. Note that the parallelism in CM-S3 is rather coarse-grained and proved to be better suited for execution on the CPU.
3.4.
SaP::GPU-components and computational flow. In the absence of column/row reordering before the LU factorization and pivoting during the factorization, the SaP::GPU dense banded linear system solver is straightforward to implement. Upon partitioning A into diagonal blocks A i , each A i is subject to an LU factorization that requires an amount of time T LU . Next, in T BC time, the coupling block matrices B i and C i are extracted on the GPU. The V i and W i spikes are subsequently computed in an operation that requires T SP K time. Afterwards, in T LU rdcd time, the spikes are truncated and the steps outlined in Eq. (2.9) are taken to produce the intermediary valuesx
i . At this point, the pre-processing step is over and two sets of factorizations, for A i andR i , are available for preconditioning during the iterative phase of the solution. The amount of time spent iterating is T Kry , the iterative methods considered being BiCGStab(2) and conjugate gradient.
The sparse linear system solution is slightly more convoluted at the front end. A sequence of two permutations, DB requiring T DB and CM requiring T CM time, are carried out to increase the size of the diagonal elements and reduce bandwidth, respectively. An additional amount of time T Drop might be spent to drop off-diagonal elements in order to decrease the bandwidth of the reordered A matrix. Since the DB and CM reorderings are hybrid, T Dtransf is used to keep track of the overhead associated with moving data back and forth between the CPU and GPU during the reordering process. An amount of time T Asmbl is spent on the GPU in book-keeping required to turn the reordered sparse matrix into a dense banded matrix. The process described above is summarized in Fig. 3 .1. The boxes in gray are associated with the solution of a dense banded linear system. For a sparse linear system solve that uses a coupled approach; i.e., SaP::GPU-C, the total time is T T otSparse = T P repSp + T T otDense , where T P repSp = T DB + T CM + T Dtransf + T Drop + T Asmbl and T T otDense = T LU + T BC + T SP K + T LU rdcd + T Kry . For SaP::GPU-D, owing to the decoupled nature of the solution, T T otDense = T LU + T Kry , where T LU includes an CM process that reduces the bandwidth of each A i . The names introduced; i.e., T DB , T CM , T LU rdcd , etc., are referenced in the profiling study discussed in §4.3.1 and used ad verbum on the SaP::GPU web-page [3] to report profiling results for approximately 120 linear systems.
4. Numerical Experiments. The next three subsections summarize results from three numerical experiments concerned, in this order, with the solution of dense banded linear systems, sparse matrix reordering, and the solution of sparse linear systems. The subsection order is meant to emphasize that dense banded linear system solution and matrix reordering are two prerequisites for an effective sparse linear system implementation in SaP::GPU. The hardware/software setup for these numerical experiments is as follows. The GPU used was Tesla K20X [6, 5] . SaP::GPU uses CUDA 7.0 [35] , cusp [9] , and Thrust [25] . The CPU used was the 3GHz, 25 MB last level cache, Intel Xeon E5-2690v2. The node used hosted two such CPUs, which is the maximum possible for this type of chip, for a total of 20 cores executing up to 40 HTT threads. The two-CPU node was used to run Intel's MKL version 13.0.1, PARDISO [43] , MUMPS [8] , SuperLU [16] , and Harwell's MC60 and MC64 [4] . Unless otherwise stated, all times reported are in seconds and were obtained on a dedicated machine. In an attempt to avoid warm up overhead, the results reported represent averages that drew on multiple successive identical runs.
When reporting below the results of several numerical experiments, one legitimate question is whether it makes sense to compare performance results obtained on one GPU with results obtained on two multicore CPUs. The multicore CPU is not the fastest, as Intel chips with more cores are presently available. Additionally, the Intel chip's microarchitecture is not Haswell, which is more recent than the Ivy Bridge microarchitecture of the Xeon E5-2690v2. Likewise, on the GPU side, one could have used a Tesla K80 card, which has roughly four times more memory than K20x and twice its memory bandwidth. Moreover, price-wise, the K80 would have been closer to the cost of two CPUs than K20x is. Finally, Kepler is not the latest microarchitecture either, since Maxwell currently enjoys that status. We do not attempt to answer these questions and hope that the interested reader will modulate this study's conclusions by factoring in unavoidable CPU-GPU hardware differences. No claim is made herein of one architecture being superior since such a claim could be easily proved wrong by moving from algorithm to algorithm or from discipline to discipline. The sole and narrow purpose of this section is to report on how apt SaP::GPU is in tackling linear algebra tasks. To that end its performance is compared to that of established solutions running on CPUs and also of a recent GPU library.
Numerical experiments related to dense banded linear systems.
The discussion in this subsection draws on a subset of results reported in [29] and presents results pertaining to the influence on SaP's time to solution of the number of partitions P and of the diagonal dominance d of the coefficient matrix, as well as a comparison against Intel's MKL solver over a spectrum of problem dimensions N and half bandwidth values K.
4.1.1. Sensitivity with respect to P . The entire SaP::GPU solution for dense banded linear systems is implemented on the GPU. We first carried out a sensitivity analysis of the time to solution with respect to the number of partitions. The results are summarized in Fig. 4 .1. This behavior; i.e., relatively small gains after a threshold value of P , is typical. As a rule of thumb, some experimentation is necessary to find an optimal P value. Otherwise, a conservatively large value should be picked in the neighborhood of 50 or above. For SaP::GPU-D, larger values of P help with load balancing, particularly for GPUs with many stream multiprocessors. The same argument can be made for SaP::GPU-C, with the caveat that the spike truncation factor comes into play in a fashion that is modulated by the value of d.
It SaP::GPU-D and understand how changing P influences this distribution of the time to solution between the major implementation components. The results in Table 4 .1 provide this information as they compare the coupled and decoupled strategies in regards to the factorization times, D pre vs. C pre ; number of iterations in the Krylov solver, D it vs. C it ; amount of time spent iterating to find the solution at a level of accuracy of at least 10 −10 , D Kry vs. C Kry ; and the total times, D T ot vs. C T ot . These times are defined as D pre = T LU , C pre = T LU + T BC + T SP K + T LU rdcd , D T ot = D pre + D Kry , and C T ot = C pre + C Kry . Note that for SaP::GPU, quarters of number of iterations are reported. This is due to the fact that BiCGStab(2) contains three exits points during each iteration. Moving from one to the next roughly requires the same amount of effort, which justifies the adopted convention.
The number of iterations to convergence suggests that the quality of the coupledversion of the preconditioner is superior. Yet the price for getting this better preconditioner is higher and SaP::GPU-D ends up winning by taking as little as half the time required by SaP::GPU-C. When the same factorization is used multiple times, this conclusion could change since the metric that controls the performance would be D Kry and C Kry , or its number of iterations for convergence proxy. Also note that the return on increasing the number of partitions gradually fades away and for the coupled strategy there is no reason to go beyond P = 50. Table 4 .2 provide this information as they help answer the following question: can one still use a decoupled approach for matrices that are far from being diagonal dominant? The answer is yes, except in the most extreme case, when d = 0.06. Note that the number of iterations to convergence for the decoupled approach quickly recovers away from small values of d. In the end, the same 2× speedup factor is obtained virtually over the entire spectrum of d values.
Comparison with Intel's MKL over a spectrum of N and K.
This section summarizes results of a two-dimensional sweep over N and K. In this exercise, prompted by the results reported in Figs. 4.1 and 4.2, we fixed P = 50 and chose matrices for which d = 1. Each row in Table 4 .3 lists the value of N , which runs from 1000 to 1 000 000. Each column lists the dimension of half bandwidth K, which runs from 10 to 500. Each table row is split in three sub-rows: SaP::GPU-D results are reported in the first sub-row; SaP::GPU-C in the second sub-row; MKL in the third sub-row. All timings are in milliseconds. "OOM" stands for "out-of-memory" -a situation that arises when SaP::GPU exhausts during the solution of the linear system the GPU's 6 GB of global memory.
The results reported in Table 4 , 0)). Given that N assumes 10 values and K takes 6 values, "α" can be one of 60 tests. Since three (N, K) tests, namely (1 000 000, 200), (1 000 000, 500), and (500 000, 500), failed to solve in SaP, the sample population for the statistical study in Fig. 4.3 Table 4 . Table  4 .3.
by SaP two times faster than by MKL. The figure also shows that about 25% of the tests run, roughly, between three and six times faster in SaP. The red crosses in the figure represent outliers.
4.2.
Numerical experiments related to sparse matrix reorderings. When solving sparse linear systems, SaP reformulates the sparse problem as a dense banded linear system that is subsequently solved using SaP::GPU-C or SaP::GPU-D. Ideally, the "sparse-to-dense" transition yields a coefficient matrix that is diagonal heavy; i.e., has a large d, and has a small bandwidth K. Two matrix reorderings are applied in an attempt to meet these two objectives. The first one; i.e., the diagonal boosting reordering, is assessed in section §4.2.1. The second one; i.e., the bandwidth reduction reordering, is evaluated in §4.2.2.
Assessment of the diagonal boosting reordering solution.
The first set of results, summarized in Fig. 4.4 , correspond to an efficiency comparison between the hybrid CPU-GPU implementation of §3.2 and the Harwell Sparse Library (HSL) MC64 algorithm [4] . The hybrid implementation outperformed MC64 for 96 out of the 116 matrices selected from the Florida Sparse Matrix Collection [15] . The left pane in Fig. 4.4 Table 4 .3: Performance comparison, two-dimensional sweep over N and K for P = 50 and d = 1. For each value N , the three rows correspond to the SaP::GPU-D, SaP::GPU-C, and MKL solvers, respectively. the diagonal boosting reordering in test α. A relative speedup is computed as
values, which can be either positive or negative, are collected in a set S DB−MC64 which is used to generate the left box plot in Fig. 4 .10. The number of tests used to produce these statistical results was 116. Note that a positive value means that DB is faster than MC64, with the opposite outcome being the case for negative values of S DB−MC64 α . The median values for S DB−MC64 was 1.2423, which indicates that half of the 116 tests ran more than 2.3 times faster using the DB implementation. On average, it turns out that the larger the matrix, the faster the DB solution becomes. Indeed, as a case study, we analyzed a subset of larger matrices. The "large" attribute was defined in two ways: first, by considering the matrix size, and second, by considering the number of nonzero elements. For the 116 matrices considered, we picked the largest 24 of them; i.e., approximately the largest 20%. To this end, in the first case, we selected all matrices whose dimension was higher than N =150 000. In the second case, we selected all matrices whose number of nonzero elements was larger than 4 350 000. For large N , the median was 1.6255, while for matrices with many nonzero elements, the median was 1.7276. In other words, half of the large tests ran more than three times faster in DB. Finally, the statistical results in Fig. 4 .10 indicate that for large tests, with the exception of two outliers, there were no tests for which S DB−MC64 α was negative; i.e., with one exception, DB was faster. When all 116 tests were considered, MC64 was faster in several cases, with an outlier for which MC64 was four times faster than DB.
Two facts emerged at the end of this analysis. First, as discussed in [31] , the bottleneck in the diagonal boosting reordering was either the DB-S2 stage; i.e., finding the initial match, or the DB-S3 stage; i.e., finding a perfect match, with an approximately equal split among them. Secondly, the quality of the reordering turned out to be identical -almost all matrices displayed the same grand product of the diagonal entries regardless of whether the reordering was carried out using MC64 or DB.
Assessment of the bandwidth reduction solution.
The performance of the CM solution implemented in SaP was evaluated on a set of 125 sparse matrices from various applications. These matrices were the 116 used in the previous section plus several other matrices such as ANCF31770, ANCF88950, and NetANCF 40by40, etc., that arise in granular dynamics and the implicit integration of flexible multi-body dynamics [22, 23, 45] . Figure 4 .5 presents results of a statistical analysis that used a median-quartile method to compare (i) the half bandwidths of the matrices obtained by Harwell's MC60 and SaP's CM; and, (ii) the time to solution; i.e., time to complete a band-reducing reordering. For (i), the quantity reported is the relative difference between the resulting bandwidths,
where K MC60 and K CM are, respectively, the half bandwidths K of the matrices produced by MC60 and CM. For (ii), the metric used was identical to the one introduced in Eq. (4.1). Note that CM is superior when r K assumes large positive values, which are also desirable for the time-to-solution plot. As far as r K is concerned, the median value is 0%; i.e., out of 125 matrices, about half are better off being reordered by Harwell's MC60 with the other half being better off reordered by SaP's CM. On a positive side, the number of outliers for CM is higher, indicating that there is a propensity for CM to "win big". In terms of times to solution, MC60 is marginally faster than CM's hybrid CPU/GPU solution. Indeed, the median value of the performance metric is −0.1057; i.e., it takes half of the tests run with CM at least 1.076 times longer to complete the bandwidth reduction task. It is insightful to discuss what happens when this statistical analysis is controlled to only consider larger matrices. The results of this analysis are captured in Fig. 4 .6. Just like in section §4.2.1, the focus is on the largest 20% matrices, where "large" is understood to mean large matrix dimension N , and then separately, large number of nonzeros nnz. Incidentally, the cut-off value for the dimension was N =215 000, while for the number of nonzeros was nnz =7 800 000. When the statistical analysis included the 25 largest matrices based on size N , the median value for the half bandwidth metric r K was yet again 0.0%. The median value for time to solution changed however, from −0.1057 to 0.6964 to indicate that for half of these large tests SaP ran more than 1.6 times faster than the Harwell solution. Qualitatively, the same conclusions were reached when the 25 large matrices were selected on the grounds on nnz count. The median for r K was 0.4182%, which again suggested that the relative difference in the resulting bandwidth K yielded by CM and MC60 was practically negligible. The median time to solution was the same 0.6964. Note though that according to the results shown in Fig. 4.6 , there is no large-nnz test for which the Harwell implementation is faster than the CM. In fact, 25% of the large tests; i.e., about five tests, run at least three times faster in CM. Finally, it is worth pointing out the correlations between times to solutions and K values, on the one hand, and N and nnz, on the other hand. Herein, the correlation used is the Pearson product-moment correlation coefficient [10] . As a rule of thumb, a Pearson correlation coefficient of 0.01 to 0.19 suggests a negligible relationship, while a coefficient between 0.7 and 1.0 indicates a strong positive relationship. The correlation coefficient between the bandwidth and the dimension N of the matrix turns out to be small; i.e., 0.15 for MC60 and 0.16 for CM. Indeed, the fact that a matrix is large doesn't say much about what K value one can expect upon reordering this matrix. The correlation between the number of nonzeros and the amount of time to figure out the reordering is very high though. In other words, the larger the matrix size N , the longer the time to produce the reordering. For instance, the correlation coefficient was 0.91 for MC60 and 0.81 for CM. The same observation holds for the number of nonzeros entries: when there is a lot of them, the time to produce a reordering is large. The Pearson correlation coefficient is 0.71 for MC60 and 0.83 for CM. These correlation coefficients were obtained on a sample size of 125 matrices. Yet the same trends are manifest for the reduced set of 25 large matrices that we worked with. For instance, the correlation between dimension N and resulting K is very small at large N values: 0.04 for MC60 and 0.05 for CM. For the time to solution, the correlation coefficients with respect to N are 0.89 for MC60 and 0.76 for CM.
4.3.
Numerical experiments related to sparse linear systems. Figure 4 .7 plots statistical results that summarize how the time to solution; i.e., finding x in Ax = b, is spent in SaP::GPU. The raw data used in this analysis is available on-line [3] ; also, a discussion of exactly what it means to find the solution of the linear system is postponed for section §4.3.4. The labels used in the plot Fig. 4 .7 are inspired by the notation used in section §3.4 and Fig. 3.1 . Consider for instance the diagonal boosting reordering DB employed by SaP. In a statistical sense, the percent of time to solution spent in DB is represented using a median-quartile method to measure statistical spread. The raw data used to generate the DB box was obtained as follows. If a test "α" that runs to completion requires T DB α > 0 for DB completion, then this test will generate one data entry in an array of data subsequently used to produce the statistical result. The actual entry that is used is 100 × T is the total amount of time that test "α" takes for completion. In other words, the entry is the percent of time spent when solving this particular linear system for performing the diagonal boosting reordering. The bars for the K-reducing reordering (CM), for multiple data transfers between CPU and GPU (Dtrsf), etc., are similarly obtained. Not all bars in Fig. 4 .7 were generated using the same number of data entries; i.e., some tests contributed to some but not all bars. For instance, a symmetric positive definite linear system requires no DB step and such this test won't contribute an entry to the array of data used to determine the DB box in the figure. Of a batch of 85 tests that ran to completion with SaP, the sample population used to generate the bars is as follows: 85 data points for CM, Dtrsf, and Kry; 63 data points for DB; 60 for LU; 32 data points for Drop; and 9 data points for BC, SPK, and LUrdcd. These counts provide insights into the solution path adopted by SaP in solving the 85 linear systems. For instance, the coupled approach; i.e., the SPIKE method of [38] has been employed in the solution of nine of the 85 linear systems. The rest of them were used via SaP::GPU-D. Of 85 linear systems, 25 were most effectively solved by SaP resorting to diagonal preconditioning; i.e., after DB all the entries were dropped off except the heavy diagonal ones. Also, note that several of the linear systems considered were symmetric positive definite, from where the 60 points count for DB.
Profiling results.
A statistical analysis of the time spent in the Krylov-subspace component of the solution reveals that the median time was 55.84%. The median times for the other components of the solution are listed in the first row of data in Table 4 .4. The second row of data provides the median values when the Krylov-subspace component, which dwarfs most of the solution components is eliminated. In this case, the entry for DB, for instance, was obtained based on data points 100 × T is the time required to compute from scratch the preconditioner. The median values should be used in conjunction with the median-quartile boxplot of Fig. 4 .7 for the first row of data, and Fig. 4.8 for the second row of data. Consider, for instance, the results associated with the drop-off operation. In the Krylov-inclusive measurement, Drop has a median of 4.1%; i.e., half of the 32 tests which employed drop-off spent more than amount in performing the drop-off, while half were quicker. The spread is rather large and there are several outliers that suggest that a handful of tests require a very large amount of time be spent in the drop-off part of the solution. The results in Fig. 4 .7 and Table 4 .4 suggest where the optimization efforts should concentrate in the future. For instance, the time required for the CPU↔GPU data transfer is, in the overall picture, rather insignificant and as such a matter of small concern. Somewhat surprising, the amount of time spent in drop-off came out higher than anticipated, at least in relative terms. One caveat is that no effort was made to optimize this component of the solution. Instead, the effort went into optimizing the DB and CM solution components. This paid off, as matrix reordering in SaP, particularly for large matrices, is fast when compared to Harwell and it reached the point where the drop-off became a more significant bottleneck. Another unexpected observation was the relative small number of scenarios in which SaP::GPU-C was preferred over SaP::GPU-D; i.e., in which the SPIKE strategy [38] was employed. This observation, however, should not be generalized as it might very well be specific to the SaP implementation. Indeed, it simply states that in the current implementation, a large number of iterations associated with a less sophisticated preconditioner is preferred to a smaller count of expensive iterations associated with SaP::GPU-C. Out of a sample population of 85 tests, when invoked, the median number of iterations to solution in SaP::GPU-C was 6.75. Conversely, when SaP::GPU-D was preferred, the median count was 29.375 [3] .
4.3.2.
The impact of the third stage reordering. It is almost always the case that upon carrying out a CM reordering of a sparse matrix, the resulting A matrix has a small number of entries in the first and last rows. Yet, as the row index j increases, the number of nonzero in row j increases up to approximately j ≈ N/2. Thereafter, the nonzero count starts decreasing to reach small values towards j ≈ N . Overall, A has its K value dictated by the worst offender. Therefore, a partitioning of A into A i , i = 1, . . . , P would conservatively require that, for instance, A 1 and A P work with a large K most likely dictated by a sub-matrix such as A P/2 . Allowing each A i to have its own K i proved to lead to efficiency gains for two main reasons. First, in SaP::GPU-C it led to a reduction in the dimension of the spikes, since for each pair of coupling blocks B i and C i , the number of columns in the ensuing spikes was determined as the larger of the values K i and K i+1 . Second, SaP::GPU capitalizes on the observation that, since A i are independent and governed by their local K i , there is nothing to prevent a third reordering, which attempts to further reduce the bandwidth of A i . As it comes on the heels of the DB and CM reorderings, this is called a "third stage reordering" and is applied independently and preferably concurrently to the P sub-matrices A i . As illustrated in Table 4 .5, the decrease in local K i can be significant and it can lead to non-negligible speedups, see Table 4 .5: Examples of matrices where the third stage reordering (3 rd SR) reduced more significantly the block bandwidth K i for A i , i = 1, . . . , P .
4.3.3.
Comparison against state of the art. A set of 114 matrices, of which 105 are from the Florida matrix collection, is used herein to compare the robustness and time to solution of SaP::GPU, PARDISO, SuperLU, and MUMPS. This set of matrices was selected on the following basis: at least one of the four solvers can retrieve the solution x within 1% relative accuracy. For a sparse linear system Ax = b, this relative accuracy was measured as follows. An exact solution x was first chosen and then the right-hand side was set to b = Ax . Each sparse linear solver attempted to produce an approximation x of the solution x . If this approximation satisfied ||x−x || 2 /||x || 2 ≤ 0.01, then the solve was considered to have been successful. Given that SaP::GPU is an iterative solver, its initial guess is always Table 4 .6: Speed-up "SpdUp" values obtained upon embedding a third stage reordering step in the solution process, a decision that also changed the number of partitions P for best performance. When correlating the results reported to values provided in Table 4 .5, this table lists for each matrix A the largest of its K i values, i = 1, . . . , P .
in many instances the initial guess can be selected to be relatively close the actual solution, this situation is avoided here by choosing x far from the aforementioned initial guess. Specifically, x had its entries roughly distributed on a parabola starting from 1.0 as the first entry, approaching the value 400 at N/2, and decreasing to 1.0 for the N th and last entry of x . The statistical results reported in this section draw on raw data provided in the Appendix in Table A On the robustness side, SaP::GPU failed to solve 28 linear systems. In 23 cases, SaP ran out of GPU global memory. In the remaining five cases, SaP::GPU failed to converge. The rest of the solvers failed as follows: PARDISO 40 times, SuperLU 22 times, and MUMPS 35 times. These results should be qualified as follows. The GPU card had 6 GB of GDDR5-type memory. Given that in its current implementation SaP::GPU is an in-core solver, it does not swap data in and out of the GPU. Consequently, it ran 23 times against this memory-size hard constraint. This issue can be partially alleviated by considering a better GPU card. Indeed, there are cards that have as much as 24 GB of global memory, which still comes short of the 64 GB of RAM that PARDISO, SuperLU, and MUMPS could tap into. Secondly, the PARDISO, SuperLU, and MUMPS solvers were used with default setting. Adjusting parameters that control these solvers' solution process would likely increase their success rate.
Interestingly, for the 114 linear systems considered there was a perfect negative correlation between speed and robustness. PARDISO was the fastest, followed by MUMPS, then SaP, and finally SuperLU. Of the 57 linear systems solved both by SaP and PARDISO, SaP was faster 20 times. Of the 71 linear systems solved both by SaP and SuperLU, SaP was faster 38 times. Of the 60 linear systems solved both by SaP and MUMPS, SaP was faster 27 times. Of the 60 linear systems solved both by PARDISO and SuperLU, PARDISO was faster 60 times. Of the 57 linear systems solved both by SaP and MUMPS, PARDISO was faster 57 times. And finally, of the 64 linear systems solved both by SuperLU and MUMPS, SuperLU was faster 24 times.
We compare next the four solvers using a median-quartile method to measure are −1.4036, 0.0934, and −0.3242, respectively. These results suggest that when it finishes, PARDISO can be expected to be about two times faster than SaP. MUMPS is marginally faster than SaP, which on average can be expected to be only slightly faster than SuperLU. for which compared to PARDISO, SaP finishes significantly faster, four linear systems for which it is significantly faster than SuperLU, and four linear systems for which it is significantly faster than MUMPS. On the flip side, there are two tests where SaP runs slower than MUMPS and one test where it runs significantly slower then SuperLU.
The results also suggest that about 50% of the linear systems run in SaP in the range between "as fast as PARDISO or two to three times slower", 50% of the linear systems run in SaP in the range "between four times faster to four times slower then SuperLU". Relative to MUMPS, the situation is just like for SuperLU if only slightly shifted towards negative territory: the second and third quartile suggest that 50% of the linear systems run in SaP in the range "between three times faster to three times slower then MUMPS". Again, favorably for SaP, the last quartile is long and reaches well into high positive values. In other words, when it beats the competition, it beats it by a large margin.
4.3.4.
Comparison against another GPU solver. The same set of 114 matrices used in the comparison against PARDISO, SuperLU, and MUMPS was considered to compare SaP::GPU with the sparse direct QR solver in cuSOLVER library [7] . For cuSOLVER, the QR solver was run in two configurations: with or without the application of a reversed Cuthill-McKee (RCM) reordering before solving the system. RCM was optionally applied given that it can potentially reduce the QR factorization fill-in. cuSOLVER successfully solved 45 out of 114 systems when using either configuration. There are only three linear systems: ABACUS shell ud, ex11 and jan99jac120, which were successfully solved by cuSOLVER but not by SaP::GPU. Of the 42 systems solved both by SaP::GPU and cuSOLVER, cuSOLVER was faster than SaP::GPU in five cases. In all 69 systems cuSOLVER failed to solve, the implementation ran out of memory.
5. Conclusions and future work. This contribution discusses parallel strategies to (i) solve dense banded linear systems; (ii) solve sparse linear systems; and (iii) perform matrix reorderings for diagonal boosting and bandwidth reduction. The salient feature shared by these strategies is that they are designed to run in parallel on GPU cards. BSD3 open source implementations of all these strategies are available at [2, 3] as part of a software package called SaP. As far as the parallel solution of linear systems is concerned, the strategies discussed are in-core; i.e., there is no host-device, CPU-GPU, memory swapping, which somewhat limits the size of the problems that can be presently solved by SaP. Over a broad range of dense matrix sizes and bandwidths, SaP is likely to run two times faster than Intel's MKL. This conclusion should be modulated by hardware considerations and also the observation that the diagonal dominance of the dense banded matrix is a performance factor. On the sparse linear system side, the most surprising result was the robustness of SaP. Out of a set of 114 tests, most of them using matrices from the University of Florida sparse matrix collection, SaP failed only 28 times, of which 23 were "out-of-memory" failures owing to a 6 GB limit on the size of the GPU memory. In terms of performance, SaP was compared against PARDISO, MUMPS, and SuperLU. We noticed a perfect negative correlation between robustness and time to solution: the faster a solver, the less robust it was. In this context, PARDISO was the fastest, followed by MUMPS, SaP, and SuperLU. Surprisingly, the straight split-and-parallelize strategy, without the coupling involved in the SPIKE-type strategy, emerged as the more often solution approach adopted by SaP.
The implementation of SaP is somewhat peculiar in that the sparse solver builds on top of the dense banded one. The sparse-to-dense transition occurs via two reorderings: one that boosts the diagonal entires and one that reduces the matrix bandwidth. Herein, they were implemented as CPU/GPU hybrid solutions which were compared against Harwell's implementations and found to be twice as fast for the diagonal boosting reordering, and of comparable speed for the bandwidth reduction.
Many issues remain to be investigated at this point. First, given that more than 50% of the time to solution is spent in the iterative solver, it is worth consider the techniques analyzed in [30] , which sometimes double the flop rate in sparse matrix-vector multiplication operations upon changing the matrix storage scheme; i.e., moving from CSR to ELL or hybrid. Second, an out-of-core and/or multi-GPU implementation would enable SaP to handle larger problems while possibly reducing time to solution. Third, the CM bandwidth reduction strategy implemented is dated; spectral and/or hyper-graph partitioning for load balancing should lead to superior splitting of the coefficient matrix. Finally, as it stands, with the exception of parts of the matrix reordering, SaP is entirely a GPU solution. It would be worth investigating how the CPU can be involved in other phases of the implementation. Such an investigation would be well justified given the imminent tight integration of the CPU and GPU memories. their size N and number of non-zero elements nnz. 
