ABSTRACT
INTRODUCTION
Although the explicit inversion of matrices can often be by-passed by solving systems of linear equations, there are situations where the inverse of a matrix itself is of interest. Examples include earth sciences [1] , the matrix sign function method for spectral decomposition [2] , and other disciplines (see [3] for a detailed list). Here, we focus on the inversion of symmetric positive definite (SPD) matrices. This operation, like many other dense linear algebra computations, requires an important computational effort in terms of execution time and memory, which motivates the usage of graphics processors (GPUs) [4] , [5] , [6] .
In 2008, Bientinesi et al. published a review of algorithms for computing the inverse on an SPD matrix [7] . Their work describes also some implementation issues on a multi-core platform and conclude that Gaus-Jordan elimination (GJE) is a suitable algorithm for high performance parallel architectures. There are no previous works on the inversion of SPD matrices on GPUs.
In this paper we revisit the two different methods for computing the inverse of an SPD matrix, the traditional approach based on the Cholesky factorization, and the Gauss-Jordan elimination algorithm, more suitable for massively parallel architectures such as GPUs. Several high performance implementations of both methods on three different architectures (a multi-core CPU, a graphics processor, and a hybrid CPUGPU architecture) are presented and evaluated. Numerical experiments illustrate the efficiency attained by the GJE-based implementations on the target hybrid CPU-GPU platform.
The rest of the paper is structured as follows. In Section 2 we review the two approaches for computing the inverse of an SPD matrix. Several high performance implementations of each method are described and evaluated in Sections 3 and 4. Finally, a few concluding remarks and future work are offered in Section 5.
HIGH PERFORMANCE MATRIX INVERSION OF SYMMETRIC POSITIVE DEFINITE MATRICES Figure 1. Blocked Algorithm for Matrix Inversion of SPD Matrices via GJE (Variant 1).
Two different methods for the inversion of SPD matrices are evaluated in this section, the traditional approach that uses the Cholesky factorization, and the approach based on the Gauss-Jordan elimination method [8] .
Essentially, both algorithms perform the same operations but they exhibit different properties, with the GJE method being more appealing from the point of view of high performance and parallelization.
Matrix Inversion Based on the Cholesky Factorization
The traditional approach to compute the inverse of an SPD matrix 3. Obtain the inverse from the matrix-matrix
Exploiting the symmetry of A , the computational and storage costs of the algorithm are significantly reduced. In particular, the computational cost is 3 n floating-point arithmetic operations or flops (compared, e.g., with the 3 2n flops required to invert a non-symmetric matrix). Inplace inversion of the matrix (i.e., inversion using only the storage provided in A ), which only references the upper triangular part of the matrix, is possible. However, for performance,
A is stored as a full n n  matrix.
Matrix Inversion Based on the Gauss-Jordan Elimination Algorithm
The Gauss-Jordan elimination (GJE) algorithm is, in essence, a reordering of the computations performed by the traditional approach. Thus, it has the same computational cost.
Although the GJE requires the same computational cost, it sweeps through the matrix only once, reducing the number of memory accesses [7] . This is a very important property in order to obtain a high performance implementation in current architectures, where a single memory operation takes considerably more time than several flops. Moreover, most of the computations are cast in terms of matrixmatrix products, which is a highly parallel operation. Therefore, the GJE algorithm is a suitable method for the new massively parallel architectures (e.g., the GPUs). Alike the method based on the Cholesky factorization, the GJE can also exploit the symmetric structure of the matrix to reduce the number of flops. A careful implementation can also compute the inverse using only the memory provided in the initial matrix.
Two blocked variants of this algorithm are shown in Figures 1 and 2 . In both variants, the active block, 11 A , moves along the main diagonal. Only the upper part of the matrix is computed, and the elements below the main diagonal are not referenced. The algorithms are specified using the FLAME notation [9] [10]. There, ) ( m stands for the number of rows of its argument; ) ( TRIU returns the upper triangular part of a matrix; "⋆ " specifies blocks in the lower triangular part of the matrix, which are not referenced. We believe the rest of the notation is intuitive. Next to each operation, we provide the name of the BLAS kernel that is employed to perform the corresponding computation.
In the first variant (see Fig. 1 ), at a given iteration, the elements placed on the right of block 11 A are not updated/referenced. Eight operations are carried out, six of them corresponding to matrix-matrix products. However there are some features that will have a negative effect on the performance:  Data dependencies serialize the execution of most of the operations (e.g., four operations update the same block, 11 A ). Thus, concurrency is limited to the parallelism within each operation.
 All the operations, except the update of 00
A , involve only small blocks (taking into account that, for performance reasons, the block size b is chosen to be small compared with n). In fact, even 00
A is a small block during the first iterations of the loop. Figure 2 shows a second variant of the GJE algorithm, more suitable for parallel computing. In this variant, at each iteration, all the elements in the upper part of the matrix are updated. Although data dependencies among the operations still exist, in this version, two operations involve large blocks and concentrate most of the computations (considering that, for performance reasons, b is "small"), namely, the updates of blocks 00
A and 22 A .
In particular, both computations are performed as matrixmatrix products, which is well-known to be a highly parallel operation.
An additional advantage of the second variant is that it does not need any additional workspace, while the first variant utilizes an b n  workspace.
HIGH PERFORMANCE IMPLEMENTATIONS

Implementations Based on the Cholesky Factorization
The algorithm based on the Cholesky factorization for the computation of the inverse of an SPD matrix (see Section 2.1) is composed of three steps that must be executed sequentially. This implies that parallelism has to be extracted inside the execution of each step.
Three high performance implementations for this algorithm are described next. All of them compute the inverse in-place, reducing the memory requirements.
All the data transferences between the many-core GPU and the multi-core memory spaces, required by the GPUbased implementations, are performed using the synchronous transfer routine provided by CUDA.
Implementation on a multi-core CPU: the Intel
MKL library [11] offers kernels for the Cholesky factorization of an SPD matrix (routine potrf,
Step 1 in subsection 2.1), and the inversion using its triangular factors (routine potri, Steps 2 and 3 in subsection 2.1). The use of a multi-thread version of MKL offers parallelism and efficiency on the execution of both routines on a multi-core CPU.
2. Implementation on a many-core GPU: this implementation is similar to the previous one but the target architecture is now a GPU. One of our contributions in this work is the implementation of routines, gpu_potrf and gpu_potri for the GPU architecture. The first one computes the Cholesky factorization (Step 1), while gpu_potri can be used to obtain the matrix inverse from its Cholesky factors. The implementation of both routines makes intensive use of kernels from the nVidia CUBLAS library [12] . This implementation requires two matrix transfers between the CPU and the GPU memory spaces. Initially the matrix to invert is transferred to the GPU; the inverse is then computed in the GPU; and finally the inverse is retrieved back to the CPU.
Hybrid Implementation: this implementation
executes each operation on the most convenient processor with a minor increase of data transfers between them. In general, the CPU is faster than the GPU executing fine grain operations, while the GPU outperforms the CPU in the execution of large and highly parallel operations. Our algorithm is mainly composed of large operations suitable for the GPU, but it also performs a few fine grain operations, especially during the computation of the Cholesky factorization. In this implementation, the CPU and the GPU work jointly to compute the Cholesky factorization of the matrix; in particular, the large matrix-matrix products (symmetric rank-k updates) that appear during the Cholesky factorization are computed by the GPU, and the rest of the operations are executed on the CPU.
Implementations Based on the Gauss-Jordan Elimination Algorithm
In this subsection we describe five implementations for the GJE algorithm shown in Section 2.
1. Implementation on a multi-core CPU: two variants of the GJE algorithm were presented in Section 2.2. In both variants, most of the computations are cast in terms of matrix-matrix products. In particular, the operation that involves a higher number of flops is a symmetric rank-k update. The MKL library offers highly tuned implementations of this kernel as well as of the remaining operations present in algorithms , which is a relatively small block. Therefore, we cannot expect any significant benefit from computing them in the GPU. Nevertheless, the purpose of this is to reduce the amount of data transfers between the GPU and the CPU memory spaces.
3. Hybrid implementation: high performance can be expected from the GPU implementations as the algorithms mostly invoke kernels which are available in the CUBLAS library. However, some fine grain computations, like the update of block 11 A , can limit performance since they are not suitable for this architecture. The hybrid routine described in Figure 3 implements the 2 _ V BLK GJE algorithm which carries out the update of block 11 A on the CPU and the rest of the computations on the GPU. In this implementation, each operation is executed on the most convenient device, with a small overhead due to data transfers. This ensures optimal performance in each single operation while the data communication time remains small. Furthermore, both architectures perform computations concurrently when possible, reducing the computational time. Three transfers are needed per iteration, all involving a small block, 11 A . Those transfers enable the concurrent execution of operations on the CPU and on the GPU, adding a new level of parallelism and improving the resources utilization.
EXPERIMENTAL RESULTS
In this section we evaluate the implementations presented in Section 3 and compare their performance with that of the routines in LAPACK and BLAS.
All experiments were performed using IEEE single precision arithmetic. Results are shown for SPD matrices of dimension 1,000, 2,000, … , 15,000. Different algorithmic block sizes were tested (512, 256, 128, 64 and 32) but, for simplicity, only the performance obtained with the optimal algorithmic block size is shown. The platform employed in the experiments consists of four Intel Xeon X7550 processors (8 cores per processor) connected to an nVidia S2050 GPU via a PCI-Express 2.0 bus. Although the nVidia device is composed of 4 GPUs (nVidia C2050, Fermi), with up to 448 cores per GPU, only one of the GPUs is employed in the experiments. Further details on the hardware can be found in Table I .
Figure 3. Hybrid Concurrent Implementation for Matrix Inversion of SPD Matrices via GJE (Variant 2) in a Hybrid
Architecture Composed by a Multicore CPU and a GPU.
Kernels from the Intel MKL 11.0 multi-threaded implementation of BLAS are used for most of the computations on the multi-core, while routines from nVidia CUBLAS 3.2 are employed on the GPU. As the MKL library kernels do not scale well beyond 16 cores we omit the results obtained using 32 threads. Figure 4 shows the performance attained by the implementations based on the Cholesky factorization. The LAPACK implementation obtains the best performance for matrices of dimension up to 10,000, while implementations for the GPU clearly outperform LAPACK for larger matrices. The hybrid approach increases the performance notoriously, becoming nearly 2× faster than the traditional LAPACK implementation for matrices of dimension 15,000. Furthermore, while the performance of the LAPACK implementation peaks at a matrix dimension of 3,000 but then stabilizes around problem sizes of 6,000, the performance of the GPUbased implementations continues growing even for the larger matrices evaluated in the experiment. Thus, the use of a massively parallel architecture, like the GPU, improves the scalability of the implementation. Figure 5 shows the performance achieved by the different implementations based on the GJE algorithm. The best performance for small and medium matrices (up to 7,000) is again obtained with a CPU-based implementation. For larger matrices, the hybrid variant is the best implementation. Routines that only use the graphics processor deliver a poor performance, due to the number of operations in the algorithm that involve small blocks. Those operations are not well suited for the GPU architecture, so high performance cannot be expected from their execution. The consequence is that the execution time dedicated to perform small operations is too high, and hereby the performance of the overall process suffers. On the other hand, the multi-core efficiently executes the operations that involve small blocks so, despite this architecture is slower for the large operations, the overall performance is higher. The hybrid approach efficiently leverages both architectures, executing the small operations on the CPU and large operations on the GPU. This is why it reports the lowest execution time for medium and large matrices. Figure 6 shows the optimal block size for the best implementation on each architecture. The optimal algorithmic block size is a trade-off between the optimal block size for small and large operations. When the block size is reduced, more computations are performed in terms of large updates (basically matrix-matrix products), this also complicates the efficient use of the hierarchical memory, limiting the performance. In the GPU-based implementations, given the low performance attained by the small operations and the high performance attained for the large matrix-matrix products, the optimal block size is relatively small. The opposite happens for the multi-core implementation, where the difference between the performance attained by the large and the small operations is smaller. The hybrid approach is somewhere in the middle between the two other alternatives: the optimal block size is increased from 32 to 256. The objective here is to balance the execution time of the small operations on the multicore and the large operations on the GPU.
Finally, Figure 7 shows the performance of the LAPACK implementation and the best implementations for each architecture. It is important to notice that, for all the architectures, the best implementation is based on the GJE algorithm. Routine GJE_v2(GPU) outperforms LAPACK when the matrix dimension is larger than 9,000. With smaller matrices, the data transfer time and the operations with small blocks makes renders this code inefficient. The multi-core implementation, GJE_v2(CPU), offers a relevant behavior, and clearly outperforms LAPACK and GJE_v2(GPU). The fastest option for large matrices is the hybrid implementation based on the GJE algorithm; despite the communication overhead, it executes each operation on the most convenient device, and also performs concurrent operations on both devices, so it offers higher performance for matrices of dimension larger than 7,000. Finally, we point out the scalability of the GJE(HIB) variant, and, in general, of all the implementations based on the GJE algorithm.
CONCLUDING REMARKS
We have analyzed the inversion of SPD matrices. This operation appears in different scientific applications and features a high computational cost, which asks for the use of high performance architectures like multi-core CPUs or many-core graphics processors (GPUs). The study includes the evaluation of two different algorithms, the conventional one based on the Cholesky factorization, and the GJE algorithm, more suitable for parallel architectures.
Several implementations are presented for each algorithm and architecture. Most of the computations are executed invoking kernels from high performance libraries, Intel MKL library in the case of the multi-core CPU, and nVidia CUBLAS library for the GPU. Experimental results show that higher performance is attained from those routines based on the GJE algorithm. This algorithm exhibits a remarkable scalability in all its implementations. Three different hardware options are addressed: multi-core CPU, GPU, and a hybrid platform consisting of a multi-core CPU connected to a GPU. The best performance for large matrices is obtained with the hybrid implementation, where both architectures collaborate to obtain the matrix inverse. The main advantage of the hybrid implementation comes from the execution of each operation on the most convenient device, and the concurrent usage of both resources.
Future research to improve the proposed hybrid implementation will include:
 Overlap communication and computations using asynchronous transfers. A careful schedule of operations in both architectures will make possible to partly overlap communications and computations.  Use of different block sizes for each architecture.
In this case, the use of different block sizes for the CPU and the GPU will increase the productivity from both architectures.
 Use of multiple GPUs. This is specially interesting for the inversion of very large matrices.
 Use of other GPU kernels or libraries that outperform CUBLAS.
