A fine-grained block incomplete LU (FGBILU) factorization for solving large-scale block-sparse linear systems resulting from coupled PDE systems with n equations has been recently developed for massively parallel heterogeneous architectures, such as generalpurpose graphics processing units (GPGPUs). A straightforward one-sweep wavefront ordering is combined with element-wise block submatrix operations, allowing FGBILU to achieve low-overhead concurrent computation at O n 2 N 2 scale on a 3D PDE domain with a linear scale of N . Numerical experiments show that FGBILU is less efficient on smaller domains. Besides the inevitable performance penalty of a wavefront ordering, the index reconstruction by each concurrent computation thread causes considerable parallelism overhead. One way to reduce the overhead is to employ thread recycling along with CUDA inter-block synchronization. Dynamic parallelism is also attempted, although with no significant perforamnce benefit. The improved FGBILU is tested for a series of 3D PDE domains extracted from an incompressible Navier-Stokes solver called INCOMP3D. Results show that thread recycling can significantly reduce parallelism overhead and improve the performance of FGBILU on smaller domains.
Introduction
It is widely known that implicit time marching can significantly improve CFD simulation speed when the physical time scale is relatively large, as is often the case in incompressible flows. Also, in many complex physics processes governing equations are so stiff that an explicit method may never converge under any reasonable CFL number. One prime example is multiphase reactive flows, which, due to their incorporation of a large number of variables and equations for describing the complex thermodynamical and chemical processes, often result in highly stiff linear systems. Another example is highly stretched computational grids due to unusual boundary geometry, which tend to cause local stiffness. Implicit methods for a complex physical process usually result in a large block sparse linear system. Because direct solution of a large sparse matrix is prohibitively costly, the best option is usually an iterative method, whose efficiency is primarily determined by the quality of the preconditioner.
Recent development on hardware and software technologies of General Purpose Graphics Processing Unit (GPGPU) has greatly improved its potential for large-scale high performance computation (see, for example [1, 2, 3, 4, 5] ). Maturing programming frameworks have allowed GPGPU algorithm designs to gain more popularity. Particularly, the emergence of OpenACC [6] , a directive-based programming model closely resembling OpenMP, significantly reduces the obstacles of efficient GPGPU programming. With the assistance of convenient development tools many software engineering issues of porting CFD codes can be trivially resolved, including data management and efficient data exchange between multiple GPGPUs. Our previous attempts of porting CFD codes to GPGPU using OpenACC and GPGPU-aware MPI implementations have proven the advantages on portability and maintainability of this approach [2, 7] . More fundamental performance restrictions are encountered when porting implicit CFD solvers. Much of the difficulty is contributed by poor performance of the iterative linear solver, a core component of most implicit CFD solvers. Mathematically efficient solvers often involve highly sequential algorithms for preconditioning, which can be very difficulty to implement on fine-grained massively parallel architectures such as GPGPU.
Our study on implicit methods focus on an implicit 3D LES incompressible Navier-Stokes solver called IN3D [7] , which has been ported onto the GPU architecture using OpenACC [6] and GPGPU-aware MPI [8, 9] . The solver was based on an CPU-based version validated for a range of model problems [10, 11] . The 3D incompressible N-S equations are solved in a structured grid using the finite volume method (FVM). Time integration of the discrete equations is carried out by the artificial compressibility scheme [12] . Time-accurate simulation is achieved by employing a dual time stepping procedure (sub-iteration) at each physical time step. A general multi-block grid can be partitioned over a number of allowable processors. MPI is used to achieve parallel computation on a cluster. The original version of the solver has been used in the study of a wide variety of CFD problems, including unsteady aerodynamics [13, 14] , two-phase flows [15] , and human-induced contaminant transport [16, 17] . An immersed-boundary method [11] is incorporated to enable computations of flow about moving objects. A full GPU port has been implemented [7] . However, the performance of the linear solver is rather poor, as it is found to be extremely memory bound.
The solution is to refine granularity for the linear solver, whose benefit is two-fold. First, to obtain optimal speed of a GPGPU, its cores should be kept as busy as possible. Finer granularity allows the algorithm to be divided into smaller units, thus occupying more cores with the same total amount of work. Meanwhile, the amount of information each core processes is kept as little as possible. Unfortunately, fine granularity also introduces significant parallelism overhead. Particularly, the overhead associated with launching separate kernels for each wavefront and the overhead associated with reconstructing array indices for each concurrent thread, which scales with a rate of n 2 N 2 . Although index reconstructions are fast integer operations, such scaling would nevertheless cause considerable performance penalty.
It is possible to reduce this overhead by employing thread recycling. Instead of spawning new threads for different grid locations, each thread runs a loop to fetch new grid points to process, avoiding a large portion of the index construction that is constant. Because the size of a wavefront is usually larger than a CUDA block, inter-block synchronization must be manually managed to resolve data dependency. The most common method to achieve this is to use a global variable and atomic operation. Such approach is commonly referred to as "task-based parallelism" by Nvidia. In first-generation Nvidia GPGPUs, this is not recommended as atomic operations are rather inefficiently implemented, resulting in very long synchronization time if the number of streamline multiprocessors is large. However, recently
Figure 1: 7-point stencil on a regular 3D structure Nvidia GPGPUs have greatly improved the performance of atomic operations, making this approach much more desirable.
This paper is organized as follows. First, a brief introduction of BILU is given in Section 2, along with an enhanced FGBILU scheme. The scheme includes algorithms for the factorization, triangular solve and iterative correction processes. In Section 3, FGBILU algorithms with thread recycling are discussed in detail, including two methods of achieving CPU-free inter-block synchronization. The performance of the new scheme is studied in numerical experiments presented in Section 4, where FGBILU is employed as a block sparse linear system solver for an incompressible Navier-Stokes solver called INCOMP3D. The paper concludes with discussions on further improvements and potential applications of FGBILU scheme.
BILU on a regular 3D structured domain
The linear solver of IN3D is based on BILU(0), which stands for incomplete LU factorization with zero fill-ins for block-sparse linear systems. The "zero fill-ins" phrase is omitted from now on for clarity. Incomplete LU factorization was originally developed as approximate inversion schemes by matrix splitting for sparse matrices [18] . The block-wise extension arises naturally from coupled multi-component PDE systems and found to be a better choice for this kind of applications in general [19, 20] .
Consider discretization of a coupled PDE system with n unknowns on a structured 3D domain with a straightforward 7-point stencil, as shown in Figure 1 . 0 ≤ i < I, 0 ≤ j < J and 0 ≤ k < K are grid cell indexes and the corresponding dimensions of the domain. Implicit time integration on such a stencil usually results in a block-sparse matrix linear system Ax = b, where the left-hand-side coefficient matrix takes a form as shown in Figure 2 . Each element therein is a n × n submatrix. Each row corresponds to the discretization centered on a certain grid point. A typical non-boundary grid point (i, j, k) produces totally 7 block submatrices with the same subscript i, j, k, which is reflected in Figure 1 .
BILU with zero fill-in attempts to approximate A with LU , such that LU has the exact same zero pattern of A. BILU with zero fill-ins ignores all off-stencil multiples of the submatrices. As a result, the submatrices of the factorization can be found by a comparison of submatrices for a general grid point (i, j, k) on its 7-point stencil. Simple analysis [21] reveals that the factorization only need to be carried out for the diagonal blocks δ i,j,k , using 
Note that all d i,j,k can be safely assumed non-singular for most physical systems. Out-ofbounds array references are treated as zero. The BILU factorization can then be expressed as:
where D is the block-diagonal matrix of the factorized diagonals d i,j,k , calculated using Eq. 1. E is the strict lower block-diagonal of A and F is the strict upper block-diagonal of A.
The factorization leads to an approximation of the original linear system, which can be now be solved by two triangle sweeps:
which, if written in point-wise form, gives the following recursive formulation:
Finally, the approximated solutionx can be further improved by recursively solving a similar linear system for the residual:
until the residual R l = b − Ax l diminishes to a certain level. The subscript l is the iterative index. In a multi-block PDE solver, block coupling can be achieved by exchanging R l using message passing before each iteration.
Algorithm 1
Generation of wavefront ordering mapping 1:
Parallel BILU by wavefront ordering
Although the BILU factorization (Eq. 1) and the triangular solves (Eq. 2 and 3) are both in recursive forms, parallelism can be extracted for all grid points with a constant sum of subscripts p = i+j +k, as they form a data-independent subset. Such scheme of identifying a sequence of data-independent subsets ("levels") is called level scheduling. For a 3D structured domain, a level defined in this way forms a diagonal cross-section, called a hyperplane or a wavefront, which can be uniquely identified by p. The number of data-independent points on hyperplane p is written as N p . In a parallel algorithm, the grid points are processed in a wavefront ordering. Each grid point is assigned an index q, unique within its hyperplane, so that every grid point in the 3D domain can be uniquely identified by an index pair (p, q). A simple 3D traverse, as given in Algorithm 1, can be used to generate a list of hyperplane sizes N p , and a mapping W : (p, q) → (i, j, k), which is stored as a three-dimensional array.
Parallel factorization and triangular solve are given in Algorithm 2 and 3. In a wavefront scheme, the sequential outer loop resolves data dependency, while the parallel inner loop processes all the data point on a hyperplane concurrently. Such a parallel loop is designated as a "kernel", with index variables and their ranges listed inside parenthesis. One permutation of indices is the smallest unit for concurrent execution in a parallel implementation. For clarity, an implied mapping W is always carried out inside a wavefront scheme kernel.
Algorithm 2
Coarse-grained parallel BILU factorization 1:
Coarse-grained BILU factorization and triangular solve algorithms on CPU have been 
Fine-grained algorithms on GPGPU
Although GPGPU programming generally follows OpenMP-like shared-memory parallelism, some distinct challenges must be addressed. A GPGPU has hundreds of computation cores, each capable of executing one instance of the kernel code, which is called a "thread". GPGPU threads are logically organized as "workgroups". The size of a workgroup and the total number of workgroups can both be programmed. Within each workgroup, local shared memory can be used for data sharing. A certain number (fixed by hardware design) of threads, called a "warp", are executed in synchronized lock-step, much like a SIMD operation in CPU. The actual execution of warps is out-of-order, but explicit synchronization points can be imposed within a workgroup.
When implementing the BILU algorithms, the primary performance constraint is found to be memory boundedness [22] . If one grid point is mapped to one GPU thread, the coarse-grain factorization would require exceedingly large cache to achieve reasonable locality, which is not practical with current GPGPU hardware. Furthermore, the coarse-grained algorithm quickly becomes intractable as the number of PDE equations increases, because the amount of data per thread scales with n 2 . The solution to this problem is fine-grained algorithms that can map the calculation on the element (scalar) level. This cancels the per-thread scaling factor of n 2 , greatly easing the demand on cache memory. However, matrix operations have internal data dependencies, which must be addressed correctly and efficiently. For parallel paradigms with a moderate overhead, such as OpenMP, the benefit of extracting parallelism from matrix operation is often negated by heavy synchronization cost. Low-overhead solutions, such as SIMD operations, are often restricted by poor support of branching instructions and lack of local storage [23] . A GPGPU, on the other hand, combines the advantage of very low synchronization overhead (within a workgroup) and the capability of branching operations, which makes the element-level parallelism a much more practical approach.
The basic idea of FGBILU can be considered as a two-tier parallelism. BILU has two inherent granularity levels. As discussed in Section 2.1, concurrent processing of all grid point in a hyperplane allows coarse-grained parallelism. On the next tier, fine-grained parallelism can be achieved by vectorizing submatrix operations in a element-wise fashion. This twotier structure maps very well to GPGPU, which also has a well-defined two-tier structure. Concurrent processing of grid points can be mapped to multiple workgroups, while vectorized submatrix operations can be mapped to threads. Data dependency across hyperplanes can be implemented as sequential kernel launches, in the same way as OpenMP. It is possible to achieve synchronization across workgroup using other methods, which will be discussed in following sections.
In the descriptions that follow, grid location triplets (i, j, k) in subscripts of domain submatrix and vector variables, namely, α, β, γ, δ, ε, ζ, η, d, b,x and R, are simplified to reduce clustering in pseudocodes. Only the deviation from the center of the 7-point stencil is explicitly indicated. Any submatrix variable without the triplet is taken at the stencil center (i, j, k). 
d r,r ← 1/d r,r ; 6: } have superscripts.
In GPGPU programming, each thread is assigned a workgroup index I G and a thread index I T , both 0-based. All other indices for mathematics can be reconstructed with Algorithm 6, where integer division (\) is used extensively. Once q is calculated, (i, j, k) can be further determined by mapping W . For clarity, index reconstruction appears as a mapping from (I G , I T ) to (i, j, k, u, v, q w ) in the kernel header.
FGBILU factorization is given in Algorithm 4. While the calculation of the off-diagonal multiples and their sum can be embarrassingly parallel, the inversion of diagonals is a wellknown bottleneck of vectorizing BILU algorithms. In FGBILU, an in-place direct inversion algorithm based on the Gauss-Jordan elimination (GJE) is employed, whose general sequential pseudocode is given in Algorithm 5. Shared memory arrays is a temporary storage for resolving data dependency. Its first subscript ranges from 0 to 2, corresponding to the three spatial components. The second subscript q w , which is always implied in pseudocodes, gives the point index within a workgroup. The two superscripts are submatrix indices, each ranging from 0 to n − 1. A dot product is written as ·, · . Fortran-style colon operators are used to indicate the range of a certain dimension. Note that the use of s 0 and s 1 as read buffers for resolving data dependency in GJE. For each n × n submatrix this inversion algorithm generates roughly 2n 3 floating point operations. For comparison, a pure sequential algorithm would generate n 3 floating point operations. Because the parallel GJE uses n 2 threads per matrix, the computation time should be at the scale of 2n 3 /n 2 = 2n, which is much less than a sequential algorithm.
The triangular solves involve mostly matrix-vector multiplications. As shown in Algo- 
; 9: } rithm 7, the fine-grained algorithm further divides the dot product into element-wise operations, which refines the granularity by a factor of n. The thread-level synchronization with the assistance of a local shared memory array allows the merge of data-dependent operations.
During correction iterations the factorization d and the triangular solve procedure remain constant. The residual, R = b − Ax, must be calculated, which is then used as the RHS for the upcoming correction step. The residual takes a simple form (Eq. 4), which can be embarrassingly parallelized without level scheduling, as given in Algorithm 8.
Algorithm 8 FGBILU residual calculation in correction steps
1:
CUDA-specific Optimizations
The major concern of FGBILU is its use of wavefront ordering, which is always less efficient on smaller domains. However, there are methods to mitigate its performance impact. Specifically, we can try to lower the synchronization overhead. In this section, we introduce two attempts to improve the performance of FGBILU. The first the a feature called dynamic parallelism, recently introduced to CUDA devices with compute capability 3.5 or above. It allows kernel to be launched on a GPGPU directly, without any involvement of the CPU. The second is a CUDA programming technique called thread recycling, which allows a workgroup to "fetch" multiple grid points to work on. Combined with a lock-based synchronization scheme, this technique can achieve CPU-free synchronization and overhead reduction by eliminating redundant index reconstruction. To expose the technical details of these optimizations, the pseudocodes in this section will be written in a more detailed manner.
CPU-free inter-workgroup synchronization by dynamic parallelism (DP)
In FGBILU's two-tier parallelism, the data dependency of hyperplanes are resolved by synchronization across workgroups working on the hyperplane, which is traditionally achieved by separate kernel launches for each hyperplane, initiated from host (CPU) side. As arguments must be passed from CPU memory to GPU memory, host-side kernel launching is always associated with an overhead. If the kernel launching can avoid data exchange between CPU and GPU, this overhead can be largely eliminated. CUDA Dynamic Parallelism is introduced as a mechanism to launch new kernels (child kernel) within an active kernel (parent kernel). The launch is fully nested, so that data consistency is preserved for consecutive child kernels. This capability can be used to move the level scheduling from CPU to GPGPU completely. The migration from a traditional level scheduling using CPU kernel launches is trivial. First, the hyperplane size list N p needes to be copied to GPGPU. Then, a master control kernel with only one thread is launch, which carries out the kernel launches, in the exact same way as CPU kernel launches. By removing any intervention of the CPU during level scheduling, it may be able to reduce overheads and improve the overall performance of the wavefront scheme.
Thread recycling (TR)
Although implied in the pseudocodes, the index reconstruction is a significant overhead of the fine-grained parallel algorithms. In a one-to-one mapping, each thread must calculate the same grid location indecies (i, j, k) and the local matrix element index (u, v) for every matrix element. Instead, thread recycling establish a one-to-many mapping from a thread to elements in many grid points. In such approach, each GPGPU workgroup repeatedly "fetches" new grid points to process. One can observes that the local matrix element index (u, v) can be fixed while (i, j, k) is changed for new grid points. In fact, u, v and q w can all be fixed as the mapping to a matrix element is solely determined by the relative position of a thread within a workgroup. This can potentially save a significant amount of overhead on index reconstruction. To differentiate from traditional data parallelism, this programming approach is called task-based parallelism by Nvidia. remains active to process large amount of grid points without exiting. This requirement is in direct conflict with the wavefront scheme. To resolve data dependency of two consecutive hyperplanes, all GPGPU workgroups must be synchronized at the boundary of a hyhperplane. In standard GPGPU programming, synchronization across multiple workgroups is achieved by launching another kernel for the subsequent hyperplane. To resolve this conflict, a method for inter-workgroup synchronization without launching new kernels must be employed. GPGPU is not a flat collection of computation cores. Instead, a certain number of cores are physically organized in a streamline multiprocessor (SM). A GPGPU usually have a number of SMs. During kernel execution, each logical GPGPU workgroup is mapped to one SM and the mapping remains constant until that workgroup finishes. The execution of workgroups for each kernel launch is out-of-order, meaning that SMs work most independently on different workgroups. This hardware arrangement of two-level granularity directly corresponds to the workgroup arrangement in the programming model.
Atomic operations are globally sequential, which can provide a mechanism for creating a global barrier for inter-workgroup synchrnoization [24] . This is achieved by a global mutex variable to count the number of workgroups that reach the synchronization point. A CUDA C template of this approach is given in Algorithm 9, which is used by both the factorization and the forward solve. Backward solve use a slightly different version. Line 13 is replaced by the specific computation task. Assume that there are totally N sm active SMs working on the problem. At the beginning of the loop, the head thread (I T = 0) of each workgroup obtains a new batch, which contains L grid points, by adding L to the global index g_p_elm. Note that function atomicAdd() returns the value of the global mutex before it is added.
The wavefront ordering map is padded, so that N sm workgroups of purely fill-ins are placed at each hyperplane boundary. For example, if SM=3, and each workgroup process 4 data points, then three hyperplanes with 28, 36, 45 points would be padded in the way shown in Figure 3 . Note that each table cell corresponds to one workgroup batch. The mapping can then be flatten to a single long listŴ with only one index, facilitating the synchronization algorithm. N total is the total number of grid points, including both "real" points and fill-ins. If the head thread detects a fill-in, it adds 1 to the global mutex g_mutex. When all SMs finish adding to the global mutex, the global mutex is increased by N sm , so that the modulo operation should result in zero. An inter-workgroup synchronization is therefore achieved.
We also need to ensure that each SM has a one-to-one mapping to an active workgroup, so that, from a programmer's perspective, a workgroup is simply the logical equivalence of an SM. This is achieved by launching the kernel with exactly N sm workgroups, and allocating all available local shared memory on an SM to each workgroup. The latter technique effectively prevents two workgroups from being scheduled to the same SM.
Algorithm 9
Lock-based wavefront kernel with thread recycling 1: __device__ volatile int g_p_elm, g_mutex; // Both initialized to zero 2: __device__ void wavefront_kernel(...) 3: { 4: __shared__ volatile int q 0 ; 5:
for () { 7:
if (I T ==0) q 0 ←atomicAdd((int*)&g_p_elm,L); 8:
__syncthreads(); 9:
if (q 0 > N total ) break ; 10:Ŵ :
if (not a fill-in) { 12:
...do computation...
13:
} else if (I T ==0) { 14:
atomicAdd((int*)&g_mutex,1); 15:
while (g_mutex%N sm !=0) {} 16: } 17: __syncthreads(); 18: }
Numerical experiments
The fine-grained algorithms have been implemented in Fortran and C with OpenACC and CUDA support. All floating-point operations are in double precision. For baseline comparison, a sequential BILU code is run on one of 16 cores of an Opteron 6128. A 16-thread OpenMP version is also included as a benchmark of coarse-grained algorithms. The finegrained algorithms is tested on an Nvidia GTX Titan, which has 24 SMs and 12GB of global device memory. GTX Titan is a representative fourth-generation GPGPU, with a CUDA compute capability of 5.2. Due to improved atomic operations and warp scheduling, taskbased parallelism becomes a much more attractive approach to write GPGPU programs.
Verification of FGBILU in INCOMP3D
FGBILU has been implemented as the block-sparse linear solver in INCOMP3D, which is an implicit 3D incompressible Navier-Stokes solver validated for a range of model problems [10, 11] . The 3D incompressible Navier-Stokes equations are solved on a multi-block structured grid using finite volume methods. Time integration of the discrete equations is carried out by the artificial compressibility scheme [12] . Time-accurate simulation is achieved by employing a dual time stepping procedure (sub-iteration) at each physical time step. A general multiblock grid can be partitioned over a number of allowable processors. MPI is used to achieve parallel computation on a cluster. The original version of the solver has been used in the study of a wide variety of CFD problems, including unsteady aerodynamics [13, 14] , twophase flows [15] , and human-induced contaminant transport [16, 17] . An immersed-boundary method [11] is incorporated to enable computations of flow around moving objects, but is not employed in this study. An LES test case based on a dynamic stall study [25] is used for the verification. An illustration with exaggerated dimensions is given in Figure 4 . An SD7003 airfoil, with a chord length of 100mm and a span of 200mm, is placed in the center of the domain, which has a radius of around 1200mm. On the spanwise direction there is an additional 200mm of free flow space on each side of the airfoil, to allow enough room for finite-span effects. The Reynolds number based on the chord length is 10 6 , with a free-stream velocity of 0.1m/s. The O-type mesh has 23 million cells, divided into 1152 blocks. Static load balancing is achieved by mapping multiple blocks of various sizes to a processor. 32 computation nodes are used, each of which executes two computation processes. Each of the nodes has one Xeon E5645 and two M2050 GPU. Only two of the six cores of each E5645 is used for the CPU version of the code, so that the impact of the interconnection is similar for both CPU and GPU simulations. A qualitative comparison is given in Figure 5 , showing Q=4 isosurfaces for the same setting of the SD7003 case. As we can see, the results are virtually indistinguishable.
Scaling tests for different domain sizes
The primary concern of algorithms based on a wavefront scheme is synchronization overhead and performance penalty due to smaller wavefronts near domain corners. The total process time T varies greatly for different domain sizes. To evaluate an average process time for one grid point in the domain, a per-point process time is defined:
Test domains are extracted from INCOMP3D solving a series of simple steady-state 3D RANS channel flows in cubic domains, ranging from 15 3 (3.4K) to 55 3 (85K). The number of PDE unknowns is fixed at n = 6. Once the linear system is extracted it is solved in the standalone solver. To minimize the impact of random time measure error, we repeat each task (factorization or triangular solve) 100 times and take an average. Run times on CPU are given in Fig. 6 . The effects of insufficient work load caused by the wavefront scheme near the domain corners is clearly visible: sequential BILU shows almost constant t p for various domain sizes, while t p for wavefront algorithms gradually increases as the domain size decreases. The performance of the 16-thread OpenMP version is particularly poor for smaller domains, sometimes even slower than the sequential code running on one core, indicating heavy overhead incurred by a coarse-grained wavefront scheme. A coarsegrained wavefront-based implementation of BILU on a CPU remains inefficient for any domain with less than 25K grid points. Considering 25K is actually not a very small domain, such requirement is rather discouraging. Indeed, for multi-block implicit PDE solvers, sharedmemory parallelism (OpenMP) is less preferred than message-passing parallelism (MPI) when ILU-type preconditioners are used.
Run times on GPU are given in Fig. 7 . For larger domains, the performance by all three versions are mostly the same. This is very much expected, because t p is largely determined by floating point computation, which is the same for all three versions. It is very important to investigate the performance for smaller domains, as they are often encountered in multi-block PDE solvers.
From Fig. 7 we can see that TR version constantly performances better than the baseline version for smaller domains. This indicates a clear benefit of TR, which is reducing index reconstruction overhead. In fact, only the grid point indices (i, j, k) need to be updated each time a new grid point is fetched. The relative element location (u, v) is always constant for each thread.
We can also see that the DP version is slower than the baseline for smaller domains. The reason for DP to under-perform is straightforward. First of all, the overhead associated with kernel launches in the baseline version is not very significant. Note that kernel launches are asynchronized, meaning that the kernel launch overhead often overlaps with kernels that is still running. Because the computation load of all three FGBILU kernels is substantial, almost all of the kernel launch overhead can be masked by this asynchronized launching : t p as a function of grid size for GPU algorithms mechanism. Therefore, the actual impact of kernel launch overhead in the baseline version is minimal. On the other hand, DP is merely a mechanism to launch kernels from GPGPU. As DP requires an master thread to launch wavefront kernels, allowing slightly less computation resource for actual computation. For smaller domains, such loss of computation resource clearly outweights any benefits of CPU-free kernel launching, resulting in less efficient overall performance in the DP version. Apparently, a coarse-grained wavefront-based implementation of BILU, such as the OpenMP version on a CPU, is rather inefficient if many small domains are to be processed. On the other hand, wavefront scheme shows clear advantage when fine-grained algorithms are adopted on GPGPU. Among the three versions of FGBILU, the TR version provides an effective optimization for smaller domains, while the DP version shows no clear benefit for all domain sizes. Speedup numbers, calculated by comparing run times of FGBILU and 16-thread OpenMP version, are compiled in Table 1 . The DP version is omitted altogether, as it is unlikely to be adopted in an optimized version of FGBILU. Speedup numbers show the biggest difference for triangular solve. This is not surprising, because it has less floating point operations, which amplifies any improvement on overhead reduction. As the domain size grows, the speedup numbers quickly converge. There is very little difference between all three FGBILU versions beyond 30K.
Conclusion
The baseline FGBILU has been designed with two granularity levels, which are perfectly mapped to the two-tier parallelism model of CUDA. The first tier resolves data dependency among grid points, which is mapped to CUDA workgroups. The second tier resolves data dependency among elements of submatrices, which is mapped to threads within a workgroup. An efficient implementation of the second tier parallelism requires very low-overhead synchronization mechanism, which is readily provided by CUDA thread-level synchronization.
The baseline CUDA implementation of FGBILU performs very well for larger domains. However, the nature of wavefront scheme imposes inevitable performance loss near domains corners. This issues is particularly serious for smaller domains. As less computation work can be generated in smaller domains, the impact of overhead is magnified. The primary objective of improving performance of FGBILU on smaller domains is therefore determined by how the impact of overhead can be mitigated.
In this study, two CUDA-specific techniques are investigated. The first technique, CUDA dynamic parallelism, allows GPGPU to launch kernels without CPU involvement. However, the benefit of this technique is marginal at best, because the kernel launching overhead has been masked very well by the asynchronized launch mechanism. The extra involvement of a master control thread on GPGPU reduces the effective computation resource for floating point computation, causing overall performance loss. The second technique is called thread recycling, which allows each SM on GPGPU to remain mapped to one workgroup so that it continues to fetch grid points to process. A global lock based on atomic operations is adopted to achieve synchronization at the boundary of each hyperplanes. Because threads remain active all the time, a large portion of the index reconstruction can remain constant, avoiding repetitive calculation and improving overall performance on smaller domains.
Further improvement of FGBILU are being investigated. Currently, the wavefront scheme of FGBILU only applies to structured domains. With proper level scheduling techniques FGBILU can be modified to work with unstructured grids. Also, it may be possible to adopt FGBILU using vector processing operations in the latest CPUs. It will be interesting to see if manually optimized algorithms on CPU can lead to similar performance improvement.
