ABSTRACT. In this paper, we introduce our work on accelerating a black oil simulator using GPU-based parallel iterative linear solvers. We develop iterative linear solvers and several commonly used preconditioners on NVIDIA Tesla GPUs. These solvers and preconditioners are coupled with our in-house reservoir simulator. Numerical experiments show that our GPU-based black oil simulator is sped up around six times faster than a pure CPU-based simulator.
Introduction
For large scale reservoir simulation, especially when the number of grid blocks is over millions, the running time of reservoir simulators can be very long. To our experience, solving a linear system arising from reservoir simulation is the most time-consuming part. For SPE 10 [CMFPM] , for example, over 95% running time is spent on the solution of linear systems. It is clear that if the linear systems are solved efficiently, the whole simulation can be sped up.
GPUs are now much more powerful in float point calculation than conventional CPUs [NVCUDAPG, CUDABPG] . They have become very popular nowadays and have been used in many scientific applications [ESMV, ISMV, ECMCP, GPILS] . In this paper, we introduce our work on accelerating a black oil simulator on GPUs. We develop a new matrix vector multiplication kernel [ESMV, ISMV] for NVIDIA GPUs and other related BLAS 1/2 subroutines. Based on these subroutines, seven Krylov subspace solvers [TLS, IMSLS, GPILS, ECMCP] are developed. Several commonly used preconditioners, such as polynomial, block ILU(k), ILU(k), block ILUT, ILUT [TLS, IMSLS] and domain decomposition preconditioners [RAS] , are also developed. Our solvers and preconditioners are applied to our in-house black oil simulator. The SPE 10 problem is chosen as a benchmark. The number of grid blocks in SPE 10 is over 1.1 million and the number of all unknowns is over 2.2 millions. Numerical experiments show that we can speed the whole simulation up around six times faster than our pure CPU-based simulator.
The layout is as follows. In §2, our new matrix format and sparse matrix-vector multiplication kernel is proposed first, then linear solvers and preconditioners are introduced. In §3, numerical experiments are employed to test the efficiency of the GPU-based linear solvers and preconditioners.
Iterative Linear Solvers and Preconditioners
In this section, we introduce our sparse matrix-vector multiplication kernel first, and then the GPU-based linear solvers and preconditioners.
2.1. Sparse Matrix-Vector Multiplication Kernel. The matrix format used for GPUs is HEC (hybrid ELL and CSR), which is developed in [PPRS] and is demonstrated by Figure 1 . From this figure, we can see that an HEC matrix contains two submatrices, an ELL matrix and a CSR matrix. The ELL matrix has two matrices, one for the column indices and the other one for non-zeros. The length of each row in these two matrices is the same. A CSR matrix contains three arrays, the first one for the offset of each row, the second one for the column indices and the last one for non-zeros. The ELL matrix is in column-major order and is aligned in GPUs, which ensure that the data access pattern of global memory for NVIDIA Tesla GPUs is well coalesced [ESMV, ISMV, NVCUDAPG, CUDABPG] . In this case the data access speed for the ELL matrix is high. A disadvantage of the ELL format is that even if only one row has too many elements, the length of all rows must be the same. Hence it is a waste of memory. Then a CSR matrix is applied to overcome this problem.
For a HEC format matrix, the corresponding sparse matrix-vector multiplication kernel is described in Algorithm 1. This is a two-step algorithm. In the first step, the ELL part is calculated, where each CUDA thread [NVCUDAPG, CUDABPG, ESMV, ISMV] is responsible for one row. Then the CSR part is calculated. Other BLAS 2 subroutines are developed similarly. One BLAS 1 subroutine is described in Algorithm 2. For this algorithm, each CUDA thread calculates only one element. Other BLAS 1 subroutines are similar. the ith thread calculates the ith row of ELL matrix; ◃ Use one thread 3: end for the ith thread calculates the ith row of CSR matrix; ◃ Use one thread 7: end for Algorithm 2 BLAS 1 subroutine, y = αx + βy 1: for i = 1: n do ◃ Use one GPU kernel to deal with this loop 2:
◃ Use one thread 3: end for 2.2. Iterative Linear Solvers. We consider the following linear system:
where A is a nonsingular n × n matrix, b is the right-hand side and x is the solution to be solved for. Several Krylov subspace linear solvers are listed in [TLS, IMSLS] . From the descriptions of these solvers, we can see that these solvers share the following common operations:
where A is a matrix, x, y and z are vectors, α and β are real numbers, and ⟨·, ·⟩ is the scalar product. These subroutines are simple variants of Algorithm 1 and Algorithm 2. With these BLAS 1/2 operations, the linear solvers can be developed in a straightforward manner. Seven GPU-based Krylov subspace solvers are developed, including GMRES, CG, BICGSTAB, GCR, CGS, ORTHOMIN and ORTHODIR [IMSLS, TLS] . The CPU-based versions are also developed.
Preconditioners.
In practice, an equivalent linear system of equations (2.1) is solved:
where M is called a preconditioner or left-preconditioner. When choosing preconditioner M, a general principle is that M is an approximation of A and in this case, it means that the product of M −1 and A approximates the unit matrix I. The condition number of M −1 A is smaller than that of A and the linear system (2.7) is much easier to solve compared to the original equation (2.1). Meanwhile, M should be easy to construct and be easy to solve. When the spectrum of N = I −A is less than 1, we have the Neumann expansion [TLS] (2.8)
For any positive integer s, a Neumann polynomial preconditioner is defined as follows:
When we solve the preconditioned system, only the matrix-vector multiplication is involved. A simple idea of constructing a preconditioner is to apply LU factorization. However, for a given sparse matrix A, the accurate L and U are usually much denser than the lower and upper parts of A, respectively. Alternatively, incomplete-LU is applied. The ILU factorization computes a sparse lower triangular matrix L and a sparse upper triangular matrix U for a given matrix A. If the non-zero pattern of L and U is the same as that of the lower and upper parts of A, respectively, we obtain the so-called ILU(0) preconditioner and higher order ILU(k) is obtained similarly [IMSLS] . Another method is ILUT, which drops entries based on the numerical values of the fill-in elements [IMSLS, GPILS] , where L and U are controlled by the drop tolerance and the maximal number of fill-ins in each row.
The solution procedure for L and U is sequential. In this paper, block ILU(0) and block ILUT are implemented. If the number of blocks increases, both preconditioners have better parallel performance. The matrix A is partitioned by METIS [METIS] . The lower and upper triangular problems are solved by a modified level schedule method [PPRS] .
Cai et al. developed a restricted additive Schwarz preconditioner (RAS) for solving general sparse matrices [RAS] . The basic idea is to partition the original problem to some smaller problems and then to solve these smaller problems simultaneously. In this paper, the matrix is also partitioned by METIS [METIS] . The submatrices are extended according to the topology of the original matrix. Each smaller problem is solved by ILU(0) or ILUT. Figure 2 is the basic structure of our linear solver package. This package has a multi-level structure. The bottom is the infrastructure, where memory management, communication, input, output, and preprocessing modules are developed. These modules serve the whole package. The middle level includes the matrix and vector operations. The top level includes our solvers and preconditioners. These solvers and preconditioners are designed in such a way that each solver or preconditioner is independent of each other. In this case, this package is friendly to the user, who can choose the proper solver and preconditioner depending on the individual application, and if one solver or preconditioner has bugs, these bugs do not affect other solvers or preconditioners. 
Package Structure.

Numerical Results
In this section, numerical experiments are performed on our workstation with Intel Xeon X5570 CPUs and NVIDIA Tesla C2050/C2070 GPUs. The operating system is Fedora 13 X86 64 with CUDA Toolkit 4.0 and GCC 4.4. All CPU codes are compiled with -O3 option. The type of float point number is double. EXAMPLE 3.1. In this example, the matrix is from SPE 10 [CMFPM] . The dimension of this matrix is 2,188,851 and the number of non-zeros is 29,915,573. Three solvers are tested without using any preconditioner, and the number of iteration is fixed at 20. Performance data is collected in Table 1 . This example is designed to test the framework of our package. From Table 1 , we can see that when no preconditioner is applied, the average speedup for each solver is around 10.4. We have a maximal speedup of 10.72 when GMRES (40) solver is employed. The table also indicates that the BLAS 1/2 subroutines are efficient, and the whole framework of our package works well. EXAMPLE 3.2. The matrix used in this example is the same as that in Example 3.1. Here the Neumann polynomial preconditioner is applied, and the order, s, of the polynomial preconditioner is 8. The number of iterations is also 20. Performance data is collected in Table 2 . This example is employed to test the performance of the developed sparse matrixvector multiplication kernel, which is fundamental to a linear solver package. From Table  2 , we can conclude that the performance of our sparse matrix-vector multiplication kernel is high, and for this example, a maximal speedup of 11.67 is achieved. The average speedup is around 10.5. EXAMPLE 3.3. Here only the solver GMRES (20) is employed. The preconditioner is block ILU(0) with a different number of blocks. The matrix used here is the same as that in Example 3.1. The terminating criteria is 2e − 2. Performance data is collected in Table  3 . The combination of GMRES and ILU (0) is the most commonly used method for sequential reservoir simulation. Since the solution of ILU (0) is sequential in nature, it is hard to parallelize. This example is to test the parallel performance of our GPU-based block ILU(0) preconditioner. When the number of blocks is one, then the block ILU(0) is the so-called ILU(0). Though ILU(0) is sequential, we can still speed up this preconditioner around 8.14 times faster than the CPU-based ILU(0). When we increase the number of blocks, the speedup increases. It means that the block ILU(0) has better parallel performance. However, the number of iteration increases, too. For this matrix, we have an average speedup of 8.3. When the number of blocks is 16, a maximal speedup of 9.44 is achieved.
EXAMPLE 3.4. Here the block ILUT is applied. All other settings are the same as those in Example 3.3. Performance data is collected in Table 4 . The ILUT preconditioner is computed by dropping small elements of lower and upper triangular matrices. The non-zero pattern of L and U is less regular than that of ILU(0), which means that their data dependency is more complicated than that in ILU(0). This is also reflected from Table 4 . The speedup of block ILUT is lower compared to that of block ILU(0). An average speedup of 4.2 is achieved. However, comparing the data in Table 3 and Table 4 , we find that block ILUT is better than block ILU(0) in terms of total running time and the number of iterations. The block ILUT is also sensitive to the number of blocks. The number of iterations increases when the number of blocks increases. EXAMPLE 3.5. The RAS preconditioner is tested. The solver is GMRES(20) and the matrix is also the same as above. For the RAS preconditioner, the smaller problems are solved by ILU(0) and ILUT here. Data is collected in Tables 5 and 6 .
From Tables 5 and 6, we find that ILU(0) has better speedup than ILUT. But in terms of the solution time and the number of iterations, ILUT is still better. Since the subdomain is enlarged, the data from Tables 5 and 6 shows that the number of iterations does not change largely when we increase the number of blocks. It means that the RAS preconditioner is not as sensitive as block ILU(0) and block ILUT. In addition, we can increase the number of blocks to have better performance. For ILU(0), we have an average speedup of 8, and meanwhile, we have an average speedup of 4.5 for ILUT. The SPE 10 problem is tested. SPE 10 is a standard benchmark for the black oil simulator [CMFPM] . The problem is highly heterogenous and it has been designed to be difficult to solve. The grid size for SPE 10 is 60x220x85. The number of unknowns is 2,188,851 and the number of non-zeros is 29,915,573. The time period is 100 days. The solver is GMRES (20) . Performance data is collected in Table 7 . From Table 7 , we can see that when the block ILU(0) and RAS with ILU(0) are applied, the average speedup is around 6. This means that we can speed up the black oil simulator 6 times faster. When the block ILUT and RAS with ILUT are applied, the average speedup is lower, which is only about 2.5. We can still speed up the simulator 2.5 times faster than the pure CPU simulator. The parallel performance of these preconditioners is similar, but for sequential performance, the ILUT-related preconditioners are much better than ILU(0)-related preconditioners.
Conclusion
We have presented our work on accelerating a black oil simulator using GPU-based linear solvers and preconditioners. The numerical experiments show that these solvers and preconditioners are efficient. The simulator can be sped up around 6 times faster with these solvers and preconditioners.
