The problem of numerical solution of sparse matrix-based linear systems arises from many scientific applications. Iterative solvers and corresponding preconditioning techniques are usually adopted. Due to the the irregularity in memory access patterns and undeterministic branching behavior of sparse matrix-based algorithms, they present unique problems for emerging GPU-based platforms.
INTRODUCTION
Large sparse matrix-based linear systems arise in many scientific applications. Solving large sparse matrix-based linear systems efficiently is of crucial importance to various research tasks. In practice it usually requires applying iterative solvers [1] for solving such systems, together with accompanying preconditioners. As is pointed out in [1, 2] , preconditioners are usually decisive on the convergence rate of the solver. Factorization based preconditioners are most popular for general sparse matrices. Incomplete Cholesky factorization are used for symmetric definite positive matrices, while incomplete LU factorization are popular for 1 general, unsymmetric matrices. These algorithms usually result in a preconditioning process which is based on forward/backward substitutions and inherently sequential. Approximate inverse-based preconditioners feature high parallelism in preconditioning operations by direct approximation to the inverse of the matrix. With approximate inverse preconditioners, preconditioning operations becomes matrix-vector products, which has inherent parallelism.
Accelerated computing has emerged in recent years as a fundamental driving force for higher performance computing. Some of the highest ranking machines on the TOP500 [3] list are of a hybrid structure with accelerators. Although much effort has been put to dense matrix-based operations, few work exists for dealing with sparse matrices. Due to its irregular structure, sparse matrix-based algorithms usually involve codes with much more irregular memory accesses and more branches. They propose unique problems for parallelization on accelerators, which are usually throughput-oriented architectures which exploit data parallelism. This paper discusses generating approximate inverse preconditioners for sparse matrices. We apply parallelization and optimization techniques to this problem and implement it on CUDA computing environment. Experiments show that an effective speedup of 3x and 6x are achieved for approximate inverse preconditioners, compared with a parallel, CPU-based solution.
This paper is organized as follows. Section 2 will give a short introduction to the iterative methods and preconditioning techniques. In specific we use A-biconjugate based Approximate Inverse preconditioners to favor preconditioning process. CUDA environment is briefly introduced in Section 3. In Section 4 detailed design and optimization of the AINV preconditioner generation and GMRES algorithm is proposed. Experiment results in Section 5 includes performance of AINV preconditioner generation and GMRES application, including micro-benchmarks of kernels in AINV generation. Section 6 discusses related issues in migrating sparse matrix-based applications on CUDA. Section 7 concludes the paper.
476
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU
ITERATIVE SOLVERS & APPROXIMATE INVERSE PRECONDITIONERS
For a linear system defined over a square, sparse matrix A and a generally dense right-hand side (RHS) b:
Iterative methods based on Krylov subspace are mostly used to solve it, such as GMRES, CG, etc [1] . The generation of Krylov subspace κ (r 0 ) = {r 0 , Ar 0 , ..., A m−1 r 0 } depends on the residue vector r 0 = b − Ax 0 , where x 0 is an initial guess, and matrix-vector products. Although definition of specific methods vary due to algorithmic differences or exploitation of specific matrix properties, matrix-vector products and inner products are the fundamental operations in Krylov-subspace methods. In this paper GMRES is used as the exemplary iterative method for optimization in CUDA. Other Krylov methods can also benefit in similar ways by applying the techniques developed in this paper.
The convergence rate of iterative methods are mostly affected by the preconditioners applied to the solver [1, 2] . Preconditioners are matrices denoted as M l and M r . By applying them to A, the effective system to be solved is: 
Preconditioned GMRES How to choose the most adequate preconditioner is subjected to specific matrix and computing environment. Usually the most popular and widely used are preconditioners based on incomplete factorization. For non-symmetric systems, an incomplete LU factorization is usually used:
Where L × U should approximate A to a good extent with L and U as lower and uppertriangular matrices respectively. When used for preconditioning, M l and M r are in effect substituted as L −1 and U −1 , and preconditioning operations are turned into triangular solves, i.e., forward and backward substitutions. One major drawback of factorization-based preconditioners is that this solving process is inherently sequential. Only limited speedup can be achieved by complex parallelization schemes, usually depending on multiple RHS's and supernodal approaches. For direct methods this may not pose a big problem since the factorization is used only once, but for iterative methods, the overhead for accessing factorized matrices grows in proportion to the iteration count required for convergence.
Approximate Inverse Preconditioners
Approximate inverse-based preconditioners tries to mitigate the parallelization problem by locating a good approximation to the inverse of matrix A, i.e., A −1 . In specific, the AINV algorithm [4] When used for preconditioning, M l and M r are equivalent to W T and ZD −1 , respectively. Accordingly, the preconditioning operations now are substituted to matrix-vector products, which can be parallelized easily.
The iteration governed by i from line 4 to 12 has carried dependencies and cannot be parallelized. Within each iteration governed by i, the two iterations governed by j can both be parallelized. The inner product on line 5 and 6 both involve two sparse vectors. The linear combination on line 9 and 10 involve updating a sparse vector, i.e., w j or z j with another, i.e., w i or z i . Value droppings on line 11 is adopted to avoid introducing too many fill-in's, i.e., non-zero elements introduced to w j or z j by linear combinations with w i or z i , respectively. The dropping criterion can be based on: (1) a predefined sparsity pattern, or (2) filtering out elements by their absolute values and/or fill-in count. If a predefined sparsity pattern is used for dropping values, line 11 in Algo. 2 is effectively replaced by:
Where S W and S Z are the predefined sparsity pattern for W and Z, respectively. Usually the transitive closure of the original matrix A serves as a good sparsity pattern guess, denoted A(l ) where l is the level of transitive closure. The most simple form is when l = 0, i.e., the original sparsity pattern of A. In this case, W T and Z will feature the sparsity pattern of the lower triangular and upper triangular of A, respectively. We denote the AINV preconditioner based on a predefined sparsity pattern of A(l ) as AINV-l .
When dynamic sparsity patterns are adopted, non-zero elements of large absolute values are kept to contain more valid information. Usually at most m fill-in's are allowed per column of W and Z: after fill-in's are introduced into w j and z j , on line 11, the m elements with the largest absolute values are kept. Also as an practical implementation, and they are afterwards sorted in ascending row indices. To ensure non-degenerative preconditioners, diagonal values of W and Z which are all 1's, are never dropped. We denote this dynamic fill-in position scheme as AINV-t(m), or AINV-t for short. In AINV-t(m), each column of W (or Z) actually contain m + 1 non-zeros, with m of them on nondiagonal positions and a diagonal element of value 1.
CUDA ENVIRONMENT AND GPU ARCHITECTURE
Graphics Processing Units (GPU) are throughput-oriented architectures which in recent years has become popular for general purposed computing [5] . Contrary to conventional CPU architectures, GPU is massively parallel with more than thousands of concurrent threads, and caching and branching support is limited compared with CPU. For programming models, user has to divide the task into explicit parallel threads, and different threads perform an elementary, logically identical task to different data.
480
In this work we are concerned with NVIDIA GPUs and accompanying CUDA programming environment [6] due to its popularity and good technical support. Fig. 1 gives an exemplary thread organization in the CUDA and an organization graph of NVIDIA Tesla C1060 processor, a standard NVIDIA Migration of custom algorithms onto CUDA requires conceptual mapping between CUDA threads/blocks and algorithmic components which feature datalevel parallelization, but certain hardware details exposed to programmers have to be considered for higher performance: (1) avoid divergent threads within a warp and control-flow instructions to (2) make use of limited ShM, (3) optimize memory bandwidth usage for coalescent accesses.
Journal of
Algorithms & Computational Technology Vol. 5 No. 3 481
AINV ON GPU -IMPLEMENTATION AND OPTIMIZATIONS
In this section we apply CUDA-based parallelization to AINV algorithms. AINV algorithm has inherent parallelism on the inner iterations: in Algo. 2, at iteration i, n -i+1 iterations on line 5 and 6 can be executed in parallel, and n -i iterations on line 9, 11 and 10, 11 can be executed in parallel. Given that n is usually very large, resulting in high parallelism. In this section we will firstly introduce the data structure of preconditioner, and analyze the CUDA parallelization of: (1) sparse vector inner products, (2) sparse vectorsparse vector updates, and (3) sorting and dropping algorithmic design in AINV-t.
Data Structure
For sparse matrix with a random non-zero pattern, Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) are most commonly used formats. ELLPACK format [7] is an format for bandwidth exploitation and its efficiency depends on the fact that number of non-zero elements (NNZ) per row/column does not differ much. In [8] ELLPACK and formats based on ELLPACK score highest in average for sparse matrix-vector product (SpMV) due to effectively
482
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU exploitation of GPU memory bandwidth. Fig. 2 shows a sample sparse matrix with 16 rows and its storage scheme in ELLPACK. The column index information and value information are both stored in the shown scheme. Non-zero elements in each row of the sparse matrix are stored in a row in ELLPACK; ELLPACK store elements in a column-wise way. Each 2-digit hexadecimal value in ELLPACK denotes the actual memory position where the element is stored. Grey positions correspond to non-zero elements, while white positions are padding. When the address of the first element in ELLPACK format is aligned to 16-byte boundary, fully coalesced memory accesses is guaranteed if the same i-th non-zero element in each row are accessed simultaneously by CUDA threads, resulting in high GPU memory bandwidth usage efficiency. The limitation of ELLPACK is that when the variation in non-zero element distribution in rows is high, padding is severe and efficiency is low. In mitigation [8] also introduced Hybrid format (HYB), which is an optimized format based on ELLPACK by spilling non-zero elements into another storage format when too much padding is incurred.
In AINV, W and Z are stored in a column-wise manner. Since the sparsity pattern for transitive closure of A usually contains variable non-zero element count per column, we use CSC format to record W and Z in AINV -l. For AINV-t(m), i.e., at most m fill-in's per column, ELLPACK format is optimal due to that padding is minimal. Note that when used for preconditioning, Z should be processed to accommodate row-wise storage. In case that ELLPACK is not applicable for preconditioning, Hybrid format (HYB) defined in [8] is adopted.
Sparse Vector Inner Products
The parallelization of the lines on 4 ∼ 7 involves the inner product of a constant sparse vector, i.e., a i or c i , against n − i sparse column vectors in W or Z.
We map each CUDA thread to one of the n − i columns in W, and the j-th thread carries out the calculation the inner product of a i (or c i ) against the (i + j )-th column in W (or Z ). Note that j starts from 1. For each thread, it has to walk through a i and Wi and calculate products between elements with same row indices. Pseudo code is as follows: (Ai and Ax records indices and values in a i and Wi and Wx records those of w j ). sum = 0.;
i += 1; } else { j += 1; } } Each thread will walk through the common vector (a i or c i ) and the column it is assigned. Branches happen due to dynamic interplay of Ai and Wi. Values in Ai, Ax, Wi and Wx are fetched from global memory. Accesses to Ai and Ax can possibly be coalesced, but due to the CSC storage format, accesses to Wi and Wx won't be coalesced, potential penalty is:
Where α is the average NNZ per column of W. Since when memory accesses are fully uncoalesced, the effective bandwidth is 1/16 of the fully coalesced accesses, so the penalty is no higher than 16.
Optimizations. Since the sharing of a i between the threads, ShM (shared memory) in NVIDIA GPUs can be utilized to contain the information of a i , i.e., 
Sparse Vector-Sparse Vector Update
For AINV-l, the sparsity pattern for each column in W and Z is predefined. As indicated before, W and Z are already pre-allocated and they carry row index information as indicated by the predefined sparsity pattern. The computation for calculation of w j in Line 9 and Line 11 can be transformed as:
end end Where w p,q is the matrix entry of W at position (p, q), and S i is the predefined sparsity pattern of i-th column of W in AINV-l . In implementation, at iteration i, a CUDA thread is assigned to the calculation of w j based on w i , and the computation of the CUDA thread involves a loop as shown in Algo. 3. When a match in row indices is encountered, the corresponding element in w j , i.e., w j,k is updated with its counterpart in w i , i.e., w i,k . If an element in wi does not have matchings in w j , i.e., when (w i,k ∈ S i and w j,k ∈ S j ) holds, the fill-in is ignored.
For AINV-t(m), W and Z are recorded in ELLPACK format, and allocated to accommodate at most 2 × m + 1 non-zeros per column. This scheme is based on the fact that: (1) before fill-in process of w i into w j , w j includes at most m non-zero elements (diagonal element is implicit), and (2) after fill-in process, w i will contain at most 2m + 1 non-zero elements. A similar walking scheme to Algo. 3 is adopted. The only difference is that when an element inw i does not have its counterpart in w j , i.e., no element in w j has the same row index, its Optimizations. The optimization in both AINV-l and AINV-t involves loading the currently shared vector (w i or z i ) into ShM in advance. For AINV-l , this reduces the bandwidth requirement by half due to the CSC storage format. For AINV-t , the benefit is higher since the positions in w i accesses by simultaneous threads will be very far from each other in memory due to the ELLPACK format, with very low temporal locality. Fetching them into ShM in advance will reduce latency by 2 order of magnitude and brings potential improvement of 16 times.
Similar to sparse vector inner products, in the case that NNZ in w i or z i exceeds the size of buffer in ShM, an additional loop is introduced to process all the non-zero elements.
Locating & Sorting m Largest Elements
For AINV-t(m) , at the i-th iteration, one needs to: (1) locate the m-largest elements in absolute values in (n − i) rows, and (2) sort elements in each column in ascending column order to maintain integrity. The task for locating largest elements corresponds to the classical problem for selection [9] . In terms of
486
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU (RowIndex, Value ) pair for each non-zero element, the selection process is based on the absolute value for Value, and the sorting process is based on RowIndex.
Due to the limited cache size on GPU and the large data set to be accessed with a verylow operation-count/data-fetched ratio, one would expect the performance of this process will be largely affected by how memory bandwidth is exploited. Irregular data access among threads will result in un-coalesced global memory access and hence detrimental to overall performance.
Note that the sorting work here involves sorting n − i short sequences at iteration i in Algo. 2 on GPU (other than sorting a long sequence), hence the parallelism here is data-level, rather than parallelism involved in parallel sorting algorithms which a vast body of literature focus on, such as [10] . Given that the sequences to be sorted are organized in ELLPACK format, simultaneous accesses by each CUDA thread will result in coalesced memory accesses when the same k-th element in each sequence is accessed.
Here we consider 3 popular families of algorithms: heap-based algorithms, Quick-Sort based algorithms, and Bubble-Sort based algorithms. Bubble-Sort is an example among sorting algorithms with O(n 2 ) complexity; similar methods include selection sort, etc. It is included due to: (1) fill-in control usually adopt m of small values, resulting in short sequences to be sorted for which Bubble-Sort might perform well, and (2) Bubble-Sort has the most regular memory access patterns: the offset in each column of data is the same for each thread, hence coalesced memory accesses are always guaranteed. Quick-Sort and Quick-Sort based selection algorithm, i.e., Quick-Select have good real-world performance for conventional CPUs but theoretically the worst-case of Quick-Sort is quadratic; when used for sorting, additional O(log n) storage is needed. Due to this we don't consider Quick-Sort for sorting m elements, and only use Quick-Select for finding m largest elements, in which spatial complexity is O(1). Quick-Sort and Quick-Select also has quite divergent memory access behavior, which is also hard to analyze for CUDAbased parallelization and it will results in few predictable coalesced memory accesses. Heap-Sort has a lower spatial locality than Quick-Sort and BubbleSort, but also a lower, i.e., better theoretical complexity bound of O(n log n) in the worst case. Its performance can be analyzed to a deeper extent than Quick-Sort/Quick-Select for both parallel sorting and finding large elements for AINV-t.
When heap structure is used for finding the m-largest among n elements in an array, we use a max/min-heap method listed in Algo. 4 Adjust max-heap; 7 end Algorithm 3: Finding m-largest elements in an array with n elements Algo. 3 has the lowest computational complexity among heap-based algorithms for selection. Other similar algorithms are possible with almost identical performance results, see Section 6 for discussions.
Both the building of heaps and adjustment to heaps require accessing a trail of elements from a certain root element in the heap structure. A sample accessed trail of elements in a heap from the root is shown in Fig. 4 . Effective bandwidth could witness degradation due to the divergence of accessed memory locations among CUDA threads in traversing the heap structure as shown in Fig. 4 .
Traversal of heap is the elementary heap operation for algorithms such as heap adjustment. A traversal begins at the root node of the heap and descends down towards a leaf node of the heap. All the siblings along the traversal trajectory are also accessed. In Fig. 4 the traversal is from root node (with index 0) to leaf node with index of 8, through nodes 1 and 3. Siblings that are also visited are node 2, 4 and 7. Another traversal may start with 0 and end at any node with index between 7 and 14. In the following parts in this section we analyze the timing of aforementioned algorithms based on their memory bandwidth usage behavior.
488
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU 
General performance analysis of sorting/selection algorithms on GPU
In AINV-t we carry out n -i concurrent CUDA threads in the i-th iteration, and each thread carries out a sorting/selection algorithm for its corresponding column in W (or Z ). Due to that W and Z are stored in global memory space in ELLPACK format, accessing to same i-th non-zero in each column in W or Z results in coalesced memory accesses. When sorting/selection is carried out by each CUDA thread, whether coalesced memory access is guaranteed depends on both the adopted algorithm and runtime information. In this paper we use temporal complexity to denote the relationship between the actual execution time of the algorithm and the size of the problem. Theoretically this corresponds to the computational complexity of the algorithm, as shown in Tab. 1. But in the context of parallel sorting on GPU, temporal complexity may be higher than the theoretical computational complexity, due to that the process is mainly bandwidth bound and GPU memory bandwidth usage efficiency may be compromised by uncoalesced accesses.
Bubble-Sort related temporal complexity
For Bubble-Sort, each CUDA threads will always access the elements at the same position for each column. Hence the memory accesses to W and Z is always coalesced. The temporal complexity is consistent with computational complexity in Tab. 1. Our analysis is based on effective memory access amount for Bubble-Sort. Detailed analysis is shown in Tab. 2. Note that n = 2m + 1. Table 2 . Temporal complexity of Bubble-Sort. 
Journal of Algorithms & Computational Technology
Vol. 5 No. 3 489
Finding largest m elements Sorting m elements

Temporal Complexity
O m 2 2              O m 3 2 2             
Algorithm Computational Complexity Memory Complexity
Average Worst Bubble-Sort O (n 2 ) O (n 2 ) O (1) Quick-Sort O (n log n) O (n 2 ) O (log n) Heap-Sort O (n log n) O (n log n) O (1)
Temporal complexity related to heap-based algorithms
In heap-based algorithms usually the heap is traversed from the root node to a certain leaf node in the heap. Actual trajectory as shown in Fig. 4 depends on the specific data the heap contains. Since the binary tree of heap is usually organized in a 1-D array, this yields potential of uncoalesced memory accesses when each CUDA thread carries out heap-based algorithms on arrays containing different data. Higher temporal complexity will ensue due to the uncoalesced accesses. For AINV-t , a min-heap is created for each row at position 1 to m, and a max-heap is created at position from m + 1 to 2m + 1.
Since each column of W (and Z ) has the same number of non-zeros, the heap structure in each column is also the same.
For the temporal analysis, we start with the analysis to basic heap operation, i.e., heap adjustments, which fulfills the functionality of reinstalling the heap property when only the root node of the heap has been changed. In heap adjustments, the traversal of the heap is from the root of the heap to one of the leaf of the heap. A sample of accessed elements of a heap adjustment is shown in Fig. 4 .
Temporal complexity of heap adjustments
Since it is possible for different CUDA threads to follow divergent trajectory in heap adjustment, the effective bandwidth available to the adjustment process on the i-th level is 1/2 i-1 of the total bandwidth, given that 2 i-1 is no larger than 16. When i ≤ 5, i.e., 2 i-1 ≤ 16, the effective bandwidth is 1/16 of the achievable bandwidth. For a heap with L + 1 levels, the total wall-clock time for the process of adjustment for all threads would be proportional to:
The actual time for T should be the right handside multiplied by the actual time required for accessing both the index and value info for a non-zeroelement, which is a constant and equal to
The result size of (Index size of (Value GPU Memory B ) ) + a andwidth
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU above implies that for very large heaps, i.e., GPU Memory Bandwidth L >> 5,
T is dominated bythe count of levels, i.e., T ∝ L.The constant for L is large, i.e., 32, due to uncoalesced accesses and low efficiency. But for AINV-t, L is normally not very big. When L < 5, T is proportional to 2 L , which in turn is the non-zero element count in the heap. This is higher than the theoretical computational complexity of heap adjustment, which is of order L.
Temporal complexity of building heaps
Building the heap involves traversing from the last non-leaf node to the root of the heap, and at each internal node a heap adjustment is carried out. So when L is small, heap building has the time complexity of:
Compared with the theoretical bound for building heap, which is O(n), this is higher by a factor of log n.
Temporal complexity of finding the m largest elements with Heap-Sort
In Algo. 3, firstly two heaps are built (one min-heap, one max-heap), with complexity of O (m log m) and O((n − m)log(n − m)), respectively. If termination check fails, i.e., there still exists an element in the min-heap which is larger than one element in the max-heap, the root of the two heaps are exchanged and they are both adjusted. When L is small, the adjustment of both heaps has the time complexity of O(m) + O(n − m) = O(n). The average iteration count before the smallest in min-heap is larger than the largest element in the max-heap is . Hence the temporal complexity for finding m largest elements is:
In the case of AINV-t , we reserve almost twice the space for both original non-zeros and extra fill-in's, hence n ≈ 2m. Eqn. 10 becomes: O(m 2 ) by omitting entry of lower order. On the finish of Algo. 5, the first m element in S contains the largest m element of S.
By comparison with the theoretical complexity for heap-based selection algorithm, which is of O(m log m), it can be concluded that when the heap is small (which is satisfied in AINV-t), the actual time required for selection process is of higher order in complexity, mainly due to the uncoalesced accesses.
Temporal complexity of sorting m elements with Heap-Sort
The sorting algorithm based on heaps is built upon heap-building and heapadjusting algorithms. The heap building process has a temporal complexity of O(m log m). The adjustment process has a time complexity of: O(m 2 ). Hence the time complexity of Heap-Sort for data-parallel execution on CUDA is:
Through analysis we can see that when used for AINV-t construction basedon ELLPACK format, heap-based algorithms face real world performance bottlenecks. Unlike that on CPU architecture where Heap-Sort falls short in both computational complexity constant and spatial locality, Heap-Sort tends to feature lower bandwidth usage especially for large sequences. The temporal complexity is higher than the theoretical computational complexity for heap operations. Heap-based algorithms better Bubble-Sort by having a smaller constant factor for the temporal complexity. Heap-Sort features a decrease of effective bandwidth which is linear to the heap size when the heap is small.
EXPERIMENTS
In this section the experimentresults for AINV preconditioner generation and preconditioned iterative solver is presented. To verify the theoretical analysis in previous section, various sorting kernels are tested for comparison between CPU and GPU. We compare AINV-0 (AINV-l with l = 0) and AINV-t(m) with m = 20 for performance evaluation of the optimization and speedup of CUDA algorithms over CPU-based ones. We also evaluate speedup of the performance of the GMRES iterations with AINV -based left & right preconditioners.
Basic Configurations
NVIDIA Tesla C1060 graphics card is used for GPU experiments. CUDA version 2.2 for Linux is used. For CPU experiments use a separate 2.83 GHz Intel Core2Quad CPU based machine as the baseline for comparison.
The matrix used for experiments are from a variety of scientific applications. Some are from [11] . Single-precision floating points are universally used in the experiments, including CPU versions of the algorithms.
Micro-benchmarks on parallel sorting
We designed micro-benchmarks for testing the sorting of n sequences with various sorting algorithms to verify the theoretical analysis in previous section. We generate random index-value pairs as the testing dataset to emulate the scenario in AINV-t generation. A CPU version is implemented as a reference. Data on GPU and CPU are ensured to be the same. B-S, Q-S and H-S represents Bubble-Sort, Quick-Sort and Heap-Sort, respectively.
CPU version of the micro-benchmarks uses OpenMP-based parallelization. Each CPU core is responsible for sorting 1/4 of all sequences (4 cores used). Due to the cache on CPU can effectively exploit spatial and temporal locality, we mapa continuous chunk of sequences to the same core.In memory, we choose to store the sequences in a 'row-wise' storage scheme, i.e., each sequence has its elements stored continuous in memory.This corresponds to storing the non-zero elements in W and Z in CSC format. Note that unlike AINV preconditioners, we use CSC for both AINV-l and AINV-t. On GPU version, these sequence are stored in a 'column-wise' manner, to mimic the matrix storage scheme in ELLPACK. Fig . 5 .a shows the time required for finding m largest element among 2m + 1 elements on CPU and GPU platforms with various algorithms. In all 102'400 sequences are generated. For m lower than 40, Heap-based selection algorithm on GPU performs better than other configurations. But it also shows a similar ascending trend to Bubble-Sort based selection on GPU. The distance between their curves in the figure is nearly constant, implying that the difference between the two is only in the coefficients, but not in order. This result is consistent with our time complexity analysis in previous chapter. With the growth of m, threads becomes more divergent in accessing ELLPACK format and effective bandwidth falls short.
Journal of Algorithms & Computational Technology
Vol. 5 No. 3 493
494
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU Fill-in count under 100 are common in incomplete-factorization/AINV preconditioners. Large fill-in count are rarely used since it poses high storage and computational overhead. The microbenchmarkresults can serve as a guideline for optimizingspecific real-worldusage.
Preconditioner Generation in Overview
We implemented AINV -l (with L = 0) and AINV-t on both CPU and GPU platform. As mentioned before, on CPU platform, we use CSC format for matrices to achieve good spatial locality for both AINV-l and AINV-t preconditioners. The program is parallelized for the 2 inner loops for AINV generation using OpenMP. Intel ICC is used for compilation using '-fast' optimization option to allow possible vectorization operations.
We take the CPU-version of AINV generation as the baseline, and compare the performance of various GPU version of AINV algorithm. On GPU, CSC format is used for AINV -l and ELLPACK is used for AINV-t. e a n Figure 6 . Relative performance of AINV -l preconditioner generation using A's sparsity pattern.
We can see that in some cases, non-optimized GPU AINV-0 is on par with the performance of CPU-based parallel algorithm. Using ShM will introduce about 2.2 × speedup. This is common since for ordinary matrices, accesses to global memory is effectively reduced by 50%. Also extra speedup has mainly come from specific matrix structures: matrices with a more random sparsity pattern will have more divergent threads within the same wrap. Amortizing one source of potential serialization is achieved by using ShM instead of accessing global, off-chip memory. Fig. 7 show the relative performance of GPU-based AINV-t(m) algorithms, using a CPU-based parallel counterpart as an baseline. The CPU version uses Quick-Sort and Heap-Sort for fill-in control, i.e., finding m largest elements and the sorting them afterwards. GPU version includes: (1) using ShM for acceleration, and Heap-Sort for fill-in control, (2) not using ShM, while using Heap-Sort for fill-in control, and (3) using ShM but use Bubble-Sort for finding/sortingtasks.
The average speed-up is higher than that of AINV -l preconditioner, mainly due to: (1) high memory usage efficiency using ELLPACK format, and (2). Performance gain for using Heap-Sort instead of Bubble-Sort is 18% on average.
GMRES Solving Process
The CUDA-based GMRES framework is adopted from [12] . GMRES basedon modified Gram-Schimidt orthogonalization is implemented with Saad's flexible GMRES framework; Krylov subspace generation is based on code from [8] ; orthogonalization is implemented in CUBLAS.
496
Generating Approximate Inverse Preconditioners for Sparse Matrices using CUDA and GPGPU Here we mainlyevaluate the speedup of a preconditioned iteration by using the preconditioner generated bythe AINV -l or AINV-t(m) algorithms above. We use a restarted version of GMRES with restart set to 32.The relative average performance of each GMRES iteration is shown in Fig. 8 .
Performance of each GMRES iteration is influenced by: (1) SpMV performance for Krylov subspace generation, (2) preconditioning performance, in our situation, also SpMV-based, and (3) performance of local operations local to GMRES, such as orthogonalization. Krylov subspace bases generation and preconditioning depends on SpMV operations, which is a stable speedup over CPU version by 7 ∼ 9. Smaller matrices show lower speedup mainly due to the orthogonalization/linear combination process within GMRES is relatively slow compared with CPU, mainly due to: (1) CPU cache system is larger, providing more reuse across Krylov iterations, and (2) CUDA kernel starting incurs extra overhead for each it-eration/orthogonalization operation. For AINV -l, the GPU memory bandwidth usage level is on par with SpMV using original matrix A, mainly due to that W and Z generated by AINV -l has same non-zero element distribution as A. For AINV-t(m), since Z has to be transformed into row-wise storage format, W matrix related preconditioning has a higher memory bandwidth usage level than Z due to its minimal padding in ELLPACK format.
DISCUSSION
GMRES is a long-recurrence Krylov-subspace algorithm; at each iteration all the previously generated bases of Krylov subspace should be accessed during orthogonalization. For short-recurrent algorithms such as BiCG [1] , such a process is not present, but SpMV will be carried out twice as much as in 
Generating Approximate Inverse Preconditioners for
Sparse Matrices using CUDA and GPGPU GMRES. Due to the comparatively slower CUBLAS in orthogonalization compared with SpMV operations, short-recurrence based algorithms will have higher potential for speedup of CUDA compared with CPU implementation for the Krylov iterations. The convergence properties of AINV preconditioners are discussed in [4] and [2] . [2] also includes detailed comparison of AINV preconditioners and ILU preconditioners. The parallelization and the convergence properties of are two independent aspects for AINV pre-conditioners. In our paper we only consider the parallelization and optimization techniques of AINV preconditioners on GPU and CUDA platforms. Given the same preconditioner and configurations are used, GMRES will converge in the same manner on GPU platform as on and CPU platform. A discussion of convergence property of different configurations for preconditioners on GPU is a topic beyond this work and is considered as a future work.
[8] implemented SpMV operation using CUDA by matching sparse matrix formats to achieve effectively utilization of the bandwidth. It serves as the basic Krylov subspace basis generator in our work.
[13] implemented CG algorithm on graphics processors without using general purpose programming platform on GPUs; simple point Jacobian preconditioner was adopted and preconditioner construction/preconditioning only involves O(n) operations.The work in our paper serves as the first paper on implementation of complex preconditioning algorithms on GPU platforms. [14] use CG method and point Jacobian preconditioners. High bandwidth utilization is achieved on ATI GPUs. [15] used GMRES and ILU -based methods for general unsymmetric linear systems. ILU preconditioner is constructed only on diagonal blocks of the original matrices. Diagonal blocks are of size 32 × 32. Due to the small size of diagonal blocks and usually large size of the matrix, this methods yields much data level parallelism. Both Jacobian preconditioners and diagonal-block based preconditioners ignore the matrix information beyond diagonal parts, and both can be generated trivially. To our knowledge, our paper is the first work which explores sparse matrix preconditioning techniques which are both non-trivial and suitable to GPU computation.
Selection algorithms based on heap structure has many variants. In this paper, we choose a min/max heap variant. Other variants have similar or higher computational complexity. One variant is to use heap-sort: a max-heap of size n is built and each iteration the root of the heap is retrieved and replace with the last element in the heap. After m iterations, m largest elements are retrieved (in sorted order). This method has higher complexity so discarded. Another variant is to use a min-heap of size m:
