Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it in a hybrid CPU-GPU computing environment. Considering that the Hessenberg reduction algorithm consists almost entirely of BLAS (Basic Linear Algebra Subprograms) operations, we propose three approaches for distributing the BLAS operations between CPU and GPU. Among them, the third approach, which assigns small-size BLAS operations to CPU and distributes large-size BLAS operations between CPU and GPU in some optimal manner, was found to be consistently faster than the other two approaches. On a machine with an Intel Core i7 processor and an NVIDIA Tesla C1060 GPU, this approach achieved 3.2 times speedup over the CPU-only case when computing the Hessenberg form of a 8,192 × 8,192 real matrix. 
2 Algorithm for the dense nonsymmetric eigenvalue problem
Four steps for the solution of the nonsymmetric eigenvalue problem
Let A be an n × n real nonsymmetric matrix. The standard procedure for solving the eigenvalue problem Ax = λx consists of the following four steps [10] . Although the procedure given here is for real matrices, the procedure for complex matrices is almost the same.
Step 1. Hessenberg reduction The input matrix A is reduced to an upper Hessenberg matrix H by applying Householder transformations Q 1 , Q 2 , . . . , Q n−2 from left and right.
Step 2. Accumulation of Householder transformations The Householder transformations used in Step 1 are accumulated as an orthogonal matrix Q.
Step 3. Schur decomposition The Hessenberg matrix H is transformed into a block upper triangular matrix T with diagonal blocks of size at most 2 by applying orthogonal transformations from left and right (the QR algorithm). The orthogonal matrices used in the transformation are accumulated as an orthogonal matrix P .
The eigenvalues of A are given as the eigenvalues of the diagonal blocks of T .
Step 4. Computation of the eigenvectors The eigenvectors y of T are computed and they are transformed into the eigenvectors of A by x = QP y.
The computational time of each step is shown in Fig. 1 for the n = 4, 800 case. The computation is performed using LAPACK [4] on the Intel Core i7 920 (2.66GHz) processor using four cores. It can be seen that steps 1, 3 and 4 occupy a large fraction of the computational time. In terms of computational work, step 1 usually requires much less work than step 3. However, in contrast to step 3, which can be performed almost entirely with the level-3 BLAS [8], step 1 requires both level-2 and level-3 BLAS operations. As is well known, the level-2 BLAS cannot reuse the matrix data and its performance is usually bounded not by the peak FLOPS value of the CPU but by the memory throughput [10] [6] . This is the main reason why step 1 requires long time.
Step 4, which occupies the largest time, is slow because the corresponding LAPACK routine is written without using level-2 nor level-3 BLAS. The reason is that various exception handling is required in the computation. To speed up this routine, overall modification of the algorithm is necessary. Step 1
Step 2
Step 3
Step 4
Execution time (sec.) In this paper, we focus on speeding up step 1 using GPU. GPUs have much higher memory throughput than CPUs, so we can expect that the level-2 BLAS can execute much faster on GPUs. As a preparation for porting step 1 to GPU, we will explain the algorithm of Hessenberg reduction in more detail in the rest of this section.
Hessenberg reduction by Householder transformations
Let us denote the submatrix of A consisting of rows r 1 through r 2 and columns c 1 through c 2 by A r1:r2,c1:c2 . Also, denote the identity matrix of order m by I m and the zero vector of length m by 0 m . At the kth step of the Hessenberg reduction (k = 1, . . . , n − 2), we determine a Householder transformationQ k = I n−k − τ kṽkṽ T k to annihilate all but the first element of current A k+1:n,k . The Householder transformation Q k in Eq. (1) is defined as Q k = I k ⊕Q k , where ⊕ denotes the direct sum of two matrices.
Using these definitions ofQ k and Q k , the algorithm of Hessenberg reduction can be written as follows.
This is the basic algorithm for Hessenberg reduction. In this algorithm, most of the computations are done with the level-2 BLAS, namely, matrix-vector multiplication and rank-1 update of a matrix (A−τ vu T ). However, these operations cannot reuse the matrix data and their performance is usually limited by the memory throughput. To mitigate this problem, a blocked version of Hessenberg reduction has been proposed [9] , as we will see below.
Block algorithm for Hessenberg reduction
In this subsection, we overview the blocked version of the Hessenberg reduction algorithm. The main idea is to accumulate several Householder transformations and apply them at once, using matrix multiplications. More specifically, let L be the block size and consider the 1st to the Lth step of Hessenberg reduction. Let v k = 0 k ⊕ṽ k and define an n by k matrix V k by
Also, let T k and Y k be matrices such that
Eq. (6) is known as the compact-WY representation of the k Householder transformations [14] . Then, if we have V L , T L and Y L , we can apply the 1st to the Lth Householder transformations to A from left and right as follows:
Clearly, these operations can be done entirely with the level-3 BLAS, or matrix multiplication. Note also that in computing Eq. (9), only rows 2 through n are affected due to the structure of v k . The matrices V k , T k and Y k can be generated recursively. First, given the 1st column of A, we can construct τ 1 and v 1 . Then we set
Now, assume that we are given V k−1 , T k−1 and Y k−1 . Then we can update the kth column of A using Eqs. (8) and (9) . From the updated column, we can compute τ k and v k , constructing V k . Then T k can be computed using the update formula of the compact-WY representation as follows [14] :
To compute Y k , we use the following relationship:
In summary, in the block algorithm, the computation is divided into the following two phases.
(a) Update of columns 1 through L of A and construction of V L , T L and Y L . This is performed by repeating the following for k = 1, 2, . . . , L:
In this algorithm, step (b), which accounts for more than 60% of the computational work, can be done entirely with level-3 BLAS. However, the remaining part, step (a), consists of level-2 BLAS, namely, matrix-vector multiplications. This becomes a serious bottleneck, as can be seen from Fig. 1 . We seek to accelerate both steps (a) and (b) using GPU with high floating-point performance and large memory throughput.
CUDA and CUBLAS
To speed up Hessenberg reduction with the computing power of GPU, we use the CUDA (Compute Unified Device Architecture) environment [1] provided by NVIDIA Corp. CUDA is an integrated software development environment for GPGPU and consists of a compiler, mathematical libraries and other software.
The nvcc compiler
When doing matrix computations in the CUDA environment, we have two options. One is to use the nvcc compiler. In this case, the user writes a subroutine to be executed on GPU using an extension of the C language. The nvcc compiler takes the main program and this subroutine as inputs and outputs executable files both for CPU and GPU. This provides the user with a large degree of freedom in programming. On the other hand, the user has to write the GPU program from scratch. This is not easy, because various optimization techniques must be employed to exploit the potential of GPU.
The CUBLAS
Another approach is to use the CUBLAS, which is a BLAS library for GPU. The CUBLAS consists of routines to transfer data between the CPU memory and the GPU memory, and routines to perform basic linear algebra operations on data residing on the GPU memory. To use the CUBLAS, the user first transfers the necessary data to the GPU memory, performs several BLAS operations on that data, and gets back the data to the CPU memory (see Fig. 2 ). The CUBLAS is optimized for NVIDIA's GPUs and the user can exploit the computing power of GPU while minimizing the modification of the original program. In view of these advantages, we will use the CUBLAS in this study. 4 Implementation of Hessenberg reduction using CUBLAS
Main

Basic principles for implementation
As we can see from Fig. 2 , CPU and GPU have distinct memory spaces. The data transfer between these two memory spaces is usually very slow, achieving only a few gigabytes per second. Hence, to use the computing power of GPU effectively, it is crucial to minimize the data transfer between CPU and GPU. From this viewpoint, it would be the best to send all the data to GPU at first and perform all the subsequent computations on GPU. However, there are operations that cannot be performed by CUBLAS. For example, construction of a Householder reflector in step (a3) is a function not provided by CUBLAS (or BLAS). In that case, we have to get the data to CPU, perform the necessary operation, and send back the data to GPU. Also, from the viewpoint of utilizing the computing power of the system to a maximum extent, it is sometimes more effective to assign some of the work to CPU or to distribute some of the work between CPU and GPU. Based on these observations, we consider three implementations in the following.
Implementation (I)
In this implementation, we try to minimize the amount of work assigned to CPU. As can be seen from the algorithm presented in subsection 2.3, all the computations except for the construction of Householder transformations can be done with BLAS. So we assign only the construction of Householder transformations to CPU. The schematic diagram of this implementation is shown in Fig. 3 . At the beginning of Hessenberg reduction, we send the matrix A to GPU. At each step k, after updating the kth column of A using Eq. (8) and (9), we get the updated column to CPU, construct the Householder transformation, and send back τ k and v k to GPU. All other computations are performed on GPU using CUBLAS. Finally, the reduced matrix is returned to CPU.
This implementation requires one send and one receive per step, and the data size of each transfer is O(n). Thus the total amount of data transfer is O(n 2 ). Since the computational work of Hessenberg reduction is O(n 3 ), we can expect that the cost of data transfer becomes negligible for large n.
Implementation (II)
The second approach is to assign larger computational tasks to GPU and assign smaller ones, including the construction of Householder transformations, to CPU. We consider this approach because generally GPU cannot attain high performance on small data, due to insufficient level of parallelism. There are several ways of dividing the computational tasks in Hessenberg reduction between CPU and GPU. From the structure of the algorithm described in subsection 2.3, we know • Phase (a), except for the computation of Av k in step (a5), is assigned to CPU.
• Phase (b) and the computation of Av k in step (a5) are assigned to GPU.
As in the implementation (I), we send the whole matrix to GPU at the beginning of Hessenberg reduction. At each step k, the computation proceeds as follows:
• CPU receives the kth column of A from GPU, updates it as shown in steps (a1) and (a2), constructs the Householder transformation (τ k , v k ) and sends v k to GPU.
• GPU computes Av k and sends the result to CPU.
• CPU updates T k and Y k using Eqs. (13) and (14) and sends the kth columns of T k and Y k to GPU.
After these steps are repeated L times, GPU updates the columns L + 1 through n of A as shown in steps (b1) and (b2). The total amount of data transfer is twice that of the implementation (I), but it might be advantageous to perform small-size matrix computations on CPU.
Implementation (III)
In implementation (II), large-size BLAS operations such as the computation of Av k and steps (b1) and (b2) are performed exclusively by GPU. The CPU is idle while these operations are executed. The implementation (III) tries to improve this situation. To this end, we define a division parameter r (0 ≤ r ≤ 1), which specifies the ratio of matrix A * ,L+1:n assigned to CPU. More precisely, among the n − L columns of A * ,L+1,n , the first r(n − L) columns are allocated to CPU, while the remaining columns are allocated to GPU. In computing Av k and steps (b1) and (b2), both CPU and GPU join the computation using the part of A * ,L+1:n they own.
At each step k, the computation proceeds as follows:
• CPU receives the kth column of A from GPU (unless it is already owned by CPU), updates it as shown in steps (a1) and (a2), constructs the Householder transformation (τ k , v k ) and sends v k to GPU.
• Both CPU and GPU compute partial result of Av k using the part of A they own.
• CPU receives the partial result computed by GPU and combine it with the partial result of its own to get the full result of Av k .
After these steps are repeated L times, both CPU and GPU execute steps (b1) and (b2) and update the part of A * ,L+1:n they own. After this point, the 1st through the Lth columns of A will no longer be involved in the computation. We therefore redistribute the columns of A so that the ratio of active columns allocated to CPU and GPU is r : 1 − r.
In the next section, we compare the performance of these two implementations experimentally.
5 Performance evaluation
Computational environments
Based on the approaches given in the previous section, we wrote programs that perform Hessenberg reduction using CPU and GPU and evaluated their performance. We used LAPACK routine DGEHRD (a subroutine for Hessenberg reduction) as a basis, translated it into the C language and rewrote it using CUBLAS to use GPU. For comparison, we also show the performance of routine DGEHRD using CPU only.
As test matrices, we used real random matrices whose entries follow uniform distribution in [0, 1]. The matrix size n was varied from 1,024 to 8,192 and the block size L was fixed to 32, which proved to be the optimal value from preliminary experiments.
The computational environments are as shown in 
Performance of Hessenberg reduction
We show the execution times of Hessenberg reduction in Table 2 and Fig. 4 . In the table, "CPU" means the execution time of LAPACK routine DGEHRD using four cores of the CPU. "GPU1", "GPU2" and "GPU3" mean the execution times for the implementation (I), (II) and (III) described in subsections 4.2, 4.3 and 4.4, respectively. "Speedup" means their relative speed compared with the CPU-only case. For the implementation (III), we varied the division parameter r from 1/32 to 32/32 and chose the best value for each n. These values are denoted as r opt . It is clear from the table that the use of GPU cannot accelerate the computation when the matrix size is one or two thousands. One possible reason for this is that the data transfer between CPU and GPU, which requires O(n 2 ) time, is not negligible compared with the O(n 3 ) time for arithmetic operations. Another reason is that the CUBLAS is not efficient on small matrices. As the matrix size becomes larger, the speedup increases, and when n = 8, 192, the implementation (III) attains 3.2 times speedup over the CPU-only case. It is also apparent that the implementation (III) consistently outperforms the implementation (I) and (II). This shows that it is more efficient to assign operations on small matrices to CPU than doing as much work as possible on GPU and that work distribution between CPU and GPU works effectively. Note that the optimal division parameter r opt is larger than 1/2 for small n and decreases monotonically as n increases. This indicates that small-size BLAS runs more efficiently on CPU, while large-size BLAS runs several times faster on GPU than on CPU. In the face of such performance characteristics, our implementation (III) can achieve optimal work distribution between CPU and GPU due to the existence of the division parameter.
Discussion
How to determine the optimal value of r In the above experiments, we determined the optimal value of the division parameter r by trying several values of r for each n and choosing the best one empirically. We believe that this is acceptable because this procedure needs to be done only once at the installation of the Hessenberg reduction program. An alternative approach would be to predict the execution time of the Hessenberg reduction program using some performance model and estimate an appropriate value of r in advance. In [16] , a methodology is proposed to predict the execution time of linear algebra programs based on the performance models of BLAS routines. By using this approach, the time needed to tune the division parameter r will be greatly reduced. We are pursuing this approach.
Effect of suboptimal choice of r To see how suboptimal choice of r affects the performance, we plotted the execution time of the implementation (III) as a function of r. Here, we fixed the matrix size n to 4,096 and varied r from 2/32 to 16/32. The result is shown in Fig. 5 . The optimal value is r = 6/32 in this case. By changing r to 2/32 or 10/32, the execution time increases more than 10%. Thus we know that it is important to optimize r to achieve the best performance.
Comparison with the CULA library Our implementations of the Hessenberg reduction are based on the CUBLAS. This approach has two advantages. First, we can exploit the performance of the optimized CUBLAS library. Second, it is easy to transport the program developed in this way to another GPU or accelerator. However, in this approach, operations not supported by BLAS must be done by CPU. Thus data transfer between CPU and GPU is necessary at each step of the algorithm. An alternative approach would be to perform all the operations on the GPU. To study how this approach compares with ours, we evaluated the performance of the CULA library, which adopts this approach. The execution times of the CULA's Hessenberg reduction routine are shown in Table 3 . By comparing Table 3 with Table 2 , we know that our implementation (III) with the best division parameter r opt is faster than the CULA routine when n ≥ 3, 072. This is because our implementation uses both CPU and GPU efficiently. Note however that the difference is not large. This seems to be because our implementation requires data transfer between CPU and GPU more frequently than the CULA implementation. Table 3 : Execution time (in seconds) of CULA Hessenberg reduction.
Effect of speeding up the Step 1 In this paper, we picked up step 1 of the nonsymmetric eigenvalue solver, namely Hessenberg reduction, and showed how to speed up this step using GPU. Because step 4 requires longer time than step 1, it may appear that this speed up has limited impact on the overall performance of the nonsymmetric eigensolver. However, there are cases where only the eigenvalues are needed [5] . In those cases, steps 2 and 4 are unnecessary and step 1 occupies most of the execution time. Thus the speedup achieved in this paper will be more significant in such cases. Also, it is possible to speed up step 4 using GPU by modifying the current algorithm. If this has been achieved, the speed up of step 1 will become more important.
Conclusion
In this paper, we proposed three approaches for accelerating Hessenberg reduction in a hybrid computing environment with CPU and GPU. Among them, the third approach, which assigns smallsize BLAS operations to CPU and distributes large-size BLAS operations between CPU and GPU in some optimal manner, was consistently faster than the other two approaches. On a platform with an Intel Core i7 processor and an NVIDIA Tesla C1060 GPU, the third approach achieved up to 3.2 times speedup over the CPU-only case.
Our future work includes detailed analysis of the performance results and development of more efficient strategies for distributing the work between CPU and GPU. It also remains to speed up steps 3 and 4 of the nonsymmetric eigenvalue algorithm with the computing power of GPU.
