ABSTRACT As an essential part of current mainstream computing systems, GPUs are not only powerful graphics engines but also highly parallel programmable processors. Collaboration between CPUs and GPUs is required to obtain high computing performance in multi-CPU and multi-GPU heterogeneous systems. It is challenging to develop new parallel algorithms on heterogeneous architectures with multiple CPUs and multiple GPUs for such purposes as communication, load balancing, memory spaces, and synchronization We present a parallel Cholesky block factorization algorithm for heterogeneous multi-CPU and multi-GPU architectures. First, a matrix is partitioned into different-sized blocks based on with the performance of the CPU and GPU. Then, a one-dimensional row block-cyclic distribution strategy is used to allocate row block data to every CPU and GPU to minimize communication. The computing task related to the definite row block will then be executed by the corresponding CPU or GPU. Experiments on a system with two CPUs and eight GPUs show good load balancing, parallelism, communication cost, and scalability.
I. INTRODUCTION
Parallelism will be increasingly important in future computing architectures. Microprocessor development will continue to focus on adding cores rather than increasing single-thread performance [1] , [2] . For the past few years, heterogeneous computing has undergone great development and is used in scientific and engineering computing. Currently, computer architects, programmers, and researchers no longer debate CPU versus GPU, but discuss how to design better paradigms in which the best features of CPUs and GPUs can be intelligently combined to achieve further computational gains [3] - [7] . Heterogeneous architectures combined with CPUs and GPUs have been used in most TOP500 and Green500 supercomputers [8] , [9] . Simultaneously, heterogeneous computing systems confront new challenges, especially in parallel development, task allocation, load balancing, and so on [10] , [11] . In multi-CPU and multi-GPU architectures, programmers must perform tedious and difficult work such as allocating tasks and estimating performance for every processor. Consequently, programmers will not want to accept and use heterogeneous parallel programming. Due to the vastly different architectures and programming models, traditional single CPU programming techniques may not work well and may even result in heavy overhead or error in the multi-CPU and multi-GPU system [12] , [13] . It is challenging to develop new parallel algorithms on heterogeneous architecture with multiple CPUs and multiple GPUs for various reasons, such as communication, load balancing, memory spaces, synchronization, etc. This paper presents a heterogeneous parallel Cholesky block factorization algorithm for multi-CPU and multi-GPU architecture. According to the performance of the CPUs and GPUs, the matrix is partitioned into different-sized blocks, and then, a onedimensional block-cyclic distribution strategy is used to minimize communication and achieve load balancing. The experiments on a system with two CPUs and eight GPUs show good load balancing, parallelism, communication cost and scalability.
The remainder of this paper is organized as follows: Section II reviews related works. Section III introduces the proposed algorithm for heterogeneous multi-CPU and multi-GPU architecture. Section IV shows the experimental results, followed by a brief conclusion in Section V.
II. RELATED WORK
Many problems, such as solving linear systems of equations, least squares fitting, and finding eigenvalues, can ultimately be reduced to matrix factorization. LAPACK and ScaLAPACK, the de facto software standards for high performance dense linear algebra computations, have been developed for shared-memory and distributed-memory architectures [14] . LAPACK [15] is a software library that uses block-partitioned algorithms for performing dense and banded linear algebra computations on vector and shared memory computers. ScaLAPACK [16] is a scalable software library developed for distributed memory concurrent computers, and it also runs on vector and shared memory computers. It is compatible with the LAPACK library. Their exploitation of parallelism is based on the availability of parallel BLAS. BLAS contains three level linear algebra subroutines such as dot-product, vector-matrix multiplication, and matrix-matrix multiplication. The parallel BLAS in LAPACK is based on the parallel block BLAS and assumes a blockcyclic data distribution. In contrast, the parallel BLAS implementation in ScaLAPACK relies internally on the BLAS, the PB-BLAS and the BLACS. CUBLAS [17] and CULA [18] have implemented the standard BLAS and LAPACK subroutine libraries, respectively, on multiple GPUs. However, they cannot be used in heterogeneous multi-CPU and multi-GPU systems directly. Elmroth et al. [19] present a Cholesky factorization algorithm to develop the parallelism of BLAS by increasing the level-3 computation. However, this method is not suitable for solving the fine granularity level parallelism and it does not support heterogeneous multi-CPU and multi-GPU systems. Fogué et al. [20] use the existing PLAPACK library with GPU-accelerated clusters. In their works, most of the computation-intensive work is processed by GPUs and the data are stored in GPU local memories to reduce communication. For tall and skinny matrices, Anderson et al. [21] developed a communication-avoiding QR factorization for special matrices and minimized the communication cost. StarPU also develops a dynamic load balancing framework to execute a sequential code on multicores and multiple GPUs in parallel, which has been applied to QR factorization [22] . Abdelfattah et al. [23] present a high performance Cholesky factorization run on GPU with shared memory. They discuss the performance of both the batch mode and the native mode. Kurzak et al. [24] also develop a batched Cholesky factorization for multi-GPU system. For sparse matrices, Rennich et al. [25] describe a supernodal Cholesky factorization algorithm. But they do not extend their algorithm to the multi-GPU configuration. On the other hand, static data distribution strategy is also used in heterogeneous systems to solve matrix factorization. Agullo et al. [26] make a deep analysis and comparison of static and dynamic Cholesky factorization algorithm on platforms consisting of GPUs and CPUs. Considering the heterogeneity of processors as well as memory locality, Lastovetsky and Reddy [27] develop a static data distribution for dense matrix factorization. Dong [28] presents a matrix factorization for multicores and multi-GPU systems. The heterogeneous algorithm with different tile sizes is designed to address processor heterogeneity and static data distribution is used to realize load balancing. However, this algorithm has not implemented the parallelism for left column computation. Considering load balancing, parallelism, communication cost and scalability, etc., we present a parallel Cholesky factorization algorithm for heterogeneous multi-CPU and multi-GPU architectures by using a onedimensional row block-cyclic distribution strategy to allocate data and computing to CPUs and GPUs statically. The algorithm has good performance on load balancing, parallelism, communication cost and scalability.
III. HETEROGENEOUS CHOLESKY PARALLEL BLOCK FACTORIZATION ALGORITHM
Because Cholesky factorization must be of applied to a symmetric positive definite matrix, only the lower triangular matrix has to be stored. Assuming an N × N matrix that can be divided into n × n sized blocks, we denote the equation A = LL T as shown in (1).
where A
(1) ij and L ij denote matrix blocks for any i and j, and L ii is a lower triangular matrix block for any i.
Similarly, for k = 1, · · · , n, we denote A (k) as (2) .
Then, the k-th step of the Cholesky block factorization task can be expressed using the following three operations:
to update the trailing submatrix and obtain A (k+1) . Choi et al. [29] developed a parallel Cholesky block factorization algorithm, but it cannot be used in heterogeneous systems with multiple CPUs and multiple GPUs. Our algorithm will use a one-dimensional row block-cyclic distribution strategy to minimize communication and achieve load balancing. 
A. ROW BLOCK-CYCLIC DATA DISTRIBUTION
Like block-cyclic data distribution, we use the onedimensional row block-cyclic data distribution strategy. Suppose the system contains p processors (CPUs and GPUs), and matrix A is divided into n × n blocks. Then, the i-th (i = 0, · · · , n − 1) row block will be distributed to processor j statically, where j = i mod p. Because the CPUs and GPUs perform differently, a single block size does not work well on multi-CPU and multi-GPU systems. We will partition the matrix into different-sized blocks. The smaller row blocks will be assigned to CPUs and the larger row blocks to GPUs to achieve load balancing to match their different computing capability. Fig. 1 shows an example in which a matrix A is partitioned into 9×9 blocks and distributed in a system with one CPU and two GPUs, where C denotes CPU, and G 1 and G 2 denote two different GPUs. Obviously, only the lower triangular part of matrix A must be stored. According to row block-cyclic distribution strategy, the smaller row blocks 1, 4 and 7 are allocated to the CPU, the larger row blocks 2, 5 and 8 to the first GPU, and the larger row blocks 3, 6 and 9 to the second GPU. Moreover, in a static row block-cyclic distribution strategy, the GPUs only store the corresponding row blocks in their local memory to minimize communication.
B. PARALLEL CHOLESKY ROW BLOCK FACTORIZATION ALGORITHM
With double-precision data types, for example, the subroutine DPOTF2 in LAPACK and the subroutines DTRSM, DGEMM, and DSYRK in BLAS are used to develop the heterogeneous parallel Cholesky row block algorithm. At the k-th iteration, A (k+1) ij and A (k) ij will be stored in the same memory unit. Hence, no extra memory is needed when updating the trailing submatrix. In addition, we will use A ij to express any A (k) ij in the algorithm. The subroutines DPOTF2, DTRSM, DGEMM and DSYRK are described as follows: 1) DPOTF2 is used to compute the Cholesky factorization of a symmetric positive definite matrix. If the input is a real symmetric positive definite matrix A kk , DPOTF2(A kk , L kk ) will output the lower triangular 
Algorithm 1 Heterogeneous Parallel
to update corresponding blocks. MPI_BARRIER(); } Assuming p to be the total number of processors (CPUs and GPUs) in the system, and the matrix A to be partitioned into n × n blocks in line with the performance of the CPUs and GPUs, we can then describe the heterogeneous parallel Cholesky row block algorithm as in Algorithm 1.
In the step where all processors P x (x = 1, · · · , p) update the corresponding blocks A x y , we will use the subroutines DSYRK and DGEMM. In detail, if y = x , P x will perform DSYRK (L x k , A x x ), to update A x x , i.e., A Fig. 2 illustrates the operations of the first iteration (k = 1) of the heterogeneous parallel Cholesky row block factorization for a matrix with 9 × 9 blocks. Here, C, G 1 , and G 2 will be expressed as P 1 , P 2 , and P 3 , respectively. (a) The row blocks 1, 4, and 7 are distributed to P 1 , the row blocks 2, 5, and 8 to P 2 , and the row blocks 3, 6, and 9 to P 3 . (b) The processor P 1 performs DPOTF2 (A 11 , L 11 ) to compute L 11 . (c) All processors 
to update the corresponding blocks A x y by using subroutines DSYRK and DGEMM, where P 1 updates the row blocks 4 and 7, i.e., VOLUME 6, 2018 A 4y (y = 2, 3, 4) and A 7y (y = 2, · · · , 7), P 2 updates the row blocks 2, 5, and 8, i.e., A 22 , A 5y (y = 2, · · · , 5) and A 8y (y = 2, · · · , 8), and P 3 updates the row blocks 3, 6, and 9, i.e., A 3y (y = 2, 3), A 6y (y = 2, · · · , 6), and A 9y (y = 2, · · · , 9).
IV. IMPLEMENTATION AND EVALUATION
For the experiment, the computing system is composed of two 8-core Intel Xeon E5-2640 CPUs and eight Tesla M2090 GPUs showed in Table 1 .
Since the system will use either CPUs or GPUs to compute different matrix blocks, we designed two different computational processes for CPUs and GPUs, respectively. It is necessary to allocate the smaller matrix blocks to CPUs and the larger matrix blocks to GPUs to achieve load balancing on the CPUs and GPUs. Assuming a system with same performance for CPUs and the same performance for GPUs, we estimated the sizes of the matrix blocks allocated to the CPUs and GPUs by using (3).
where B C and B G are the sizes of the matrix blocks allocated to CPUs and GPUs, respectively, and Perf C and Perf G denote the maximum performance of the CPUs and GPUs, respectively. In fact, the sizes of the matrix blocks can affect performance. For a smaller matrix, larger block sizes mean fewer blocks, which would reduce the parallel performance. Conversely, for a larger matrix, smaller block sizes mean more blocks, but smaller blocks cannot make the most of the performance of the GPUs. Hence, in our experiment, we will partition larger matrices into larger-sized blocks to take advantage of the computational performance of the GPUs, and smaller matrices into relatively smaller-sized blocks to obtain some balance between computational performance and parallelism. For convenience, when k(k = 1, · · · , 8) GPUs are selected to run our algorithm, corresponding k CPU cores will be assigned to match them in our experiment. For example, the system with 2 CPUs and 4 GPUs, simply denoted as 2C+4G in this paper, will create 16 processes, 4 of which run on the 4 CPU cores and 4 GPUs, and 12 of which run on the remain 12 CPU cores.
Compared with homogeneous parallel algorithm such as DPLASMA, PLAPACK, etc., heterogeneous parallel algorithm takes advantage of the computing power of the CPUs and GPUs. Fig. 3 compares the performance of our algorithm against the performance of the parallel Cholesky algorithm in PLAPACK [30] for different size single precision matrix and double precision matrix from 3N to 7N (N = 16384) in different system configurations. When the system consists of more CPU cores and fewer GPUs like 2C+1G configuration, heterogeneous parallel Cholesky block factorization algorithm shows significantly better performance against the Cholesky algorithm in PLAPACK. The main reason is that the former use the 15 CPU cores and 1 GPU to factorize matrix, but the latter only use the 1 GPU to factorize matrix. However, when the system is connected with more GPUs like 2C+8G, heterogeneous parallel Cholesky block factorization algorithm is only slightly better than the Cholesky algorithm in PLAPACK. Obviously, compared with 8 GPUs, the computing capability of 8 CPU cores is trivial.
A. COMMUNICATION COST AND PARALLELISM
At the k-th step, the communication cost is as follows:
1) Current processor P k (k = modp) broadcasts L kk to all processors P x after Cholesky A kk . 2) Every processor P x receives L kk , and then broadcasts
update the trailing submatrix. Obviously, all processors P x parallel perform kernel function DTRSM(L kk , A x k , L x k ) at step 2. In contrast, for the corresponding algorithm in [28] , only one processor performs kernel function DTRSM(L kk , A x k , L x k ) at this step. Communication traffic is almost the same for the two different algorithms.
B. LOAD BALANCING
Imbalance ratio proposed in [31] is used to evaluate load balancing quality in the experiment. Imbalance ratio is defined as:
where MaxLoad denotes the maximal amount of load of a processor, and AvgLoad denotes the average load of all the processors. The Imbalance ratio is greater than or equal to 1. The closer it is to 1, the better. Fig. 4 shows the Imbalance ratio of our parallel row block Cholesky factorization for double-precision matrices on different configurations. The size of the matrix is gradually increased from 1N = 16384 to 8N , and the number of CPUs/GPUs is correspondingly increased from 1CPU + 1GPU to 2CPUs + 8GPUs. Fig. 4 shows that all the imbalance ratios are less than 10%, and most of them are less than 5%. Given a particular processor performance of the CPUs and GPUs, the larger the matrix is, the better the balance performance is, with no evident differential. Because there are fewer blocks, the Imbalance ratio is relatively larger with smaller matrices and more GPUs such as 1N = 16384 and a 2CPUs + 8GPUs configuration. When the matrix size is increased or the block size is decreased, the Imbalance ratio will decrease. Communications between CPUs also influences the Imbalance ratio slightly.
C. SCALABILITY
First, scalability can be used to evaluate the program capability to solve potentially larger problems when more computing resources are available. In the experiment, when the number of CPUs and GPUs increased, we increased the input size accordingly to test system scalability. On the other hand, scalability can also be used to evaluate the speed-up of a program solving a specific problem if the program is given more computing resources. In this case, we fix a matrix size and gradually add more computing resources to the program attempting to solve it.
In presents the number of CPUs and GPUs, from 1CPU + 1GPU to 2CPUs + 8GPUs, in the experiment. The y-axis shows the different matrix sizes from 1N = 16384 to 8N . The zaxis is the parallel execution time in seconds on a logarithmic scale. When the value of x is fixed and processors along with y-axis are added gradually, the execution time decreases proportionally to the performance of processors on a logarithmic scale. Fig. 6 shows the execution time of the Cholesky parallel algorithm run on different CPU/GPU configurations and different-sized matrices in single precision and double precision, respectively. Fig. 7 shows the trends of the performance of our algorithm when the computing source is increased from 1CPU + 1GPU to 2CPUs + 8GPUs and the matrix size is increased from 1N to 8N simultaneously. Theoretically, if the time of Cholesky factorization is t 0 for an N × N matrix, then the execution time will be k 3 t 0 for a kN × kN matrix. That is, the execution time is proportional to the matrix size in a logarithmic scale. Fig. 7 shows that the two curves also follow a nearly straight line in a logarithmic scale when matrix size and the number of GPUs simultaneously increase proportionally. Fig. 7 also shows that the algorithm performs better when running on a larger matrix. Experiment results also show that communication and data exchange between CPUs can lead to performance degradation to some extent on 2 CPU configurations. 
V. CONCLUSION
In this paper, we present a heterogeneous parallel Cholesky row block factorization algorithm for multi-CPU and multi-GPU architectures. Considering factors such as communication cost, memory consistency, load balance, synchronization, etc., it is difficult to develop a new parallel algorithm for heterogeneous systems. In our method, a matrix is partitioned into different-sized blocks according to the available CPU and GPU performance; then, a one-dimensional row block-cyclic distribution strategy is used to allocate differentsized row block data to CPUs and GPUs statically. This approach makes it relatively easy to maintain load-balanced CPUs and GPUs and also minimizes the communication cost. In an experiment, we investigated the scalability and load balancing when the matrix size is increased from small to large and computing resources are added. Results show good performance on scalability, load balancing, and communication cost. Because static data allocation is used, the largest matrix size is constrained by the local memory capacity of the GPUs. If a data replacement strategy is used to solve this constraint, increasing communication costs resulting from data transfer will lead to a significant degradation of system performance. Furthermore, block size is related to the performance of the CPUs and GPUs. The adaptability and portability of the algorithm will be the subject of future work.
