Abstract
Introduction
Stencil computation (SC) is a common kernel of a wide range of scientific and engineering applications such as Jacobi and multigrid solvers [1] . Examples include explicit time-integration methods for numerical solution of partial differential equations used in climate, weather and ocean modeling [2] , computational electromagnetics [3] and quantum dynamics [4] codes using the finite-difference time-domain (FDTD) method, multimedia/image-processing applications that perform smoothing and other neighbor pixel-based computations [5] , and certain cellular automata and seismic simulations [6] . Because of its significance, SC is included in a number of benchmark suites, such as SPEC [7] , HPFBECH [8] , and NAS Parallel Benchmarks [9] . for ∀R ∈ {R}
v(R) ← f({v(R′)|R′ ∈ neighbor(R)})
where f is a mapping function. A typical SC thus consists of doubly nested loops: The outer loop sweeps over all grid points R in the lattice to update ) (R v ; and the inner loop over ) (R R neighbor ∈ ′ computes )}) ( ({ R′ v f . SC may be classified according to the spatial pattern of ) (R neighbor as follows. First, the order of a stencil is defined as the distance between the central grid point R and the farthest grid points ′ R in ) (R neighbor along a certain axis. (In a finitedifference application, this order is identical to that of Taylor expansion.) Second, we define the size of a stencil as the cardinality | ) ( | R neighbor , i.e. the number of grid points ′ R in neighbor(R) including R itself. In diagonally dominant SC, the computation depends heavily on the local value ) (R v as well as some ) (R′ v values from the nearest-neighbor grid points. Figure 1 (a) shows a diagonally dominant, firstorder, 19-point SC in the 3-dimensional lattice Boltzmann method (LBM) flow simulation [10, 11] . SC in other applications is often much higher order involving distant neighbor grid points. For example, Fig.  1(b) shows a 6-th order, 25-point SC in a 2-dimenional lattice. Such stencil is widely used in high-order finitedifference calculations [12, 13] . Emerging parallel platforms comprised of multicore compute nodes potentially provide enormous computing power for SC. An example is a cluster of general microprocessors consisting of multiple cores such as Intel quadcore processor. Another example is Cell Broadband Engine (Cell BE) in PlayStation3 and QS22 blade [14] . It is a heterogeneous multicore chip consisting of a traditional microprocessor called power processing element (PPE) that controls eight singleinstruction multiple-data (SIMD) co-processing units called synergistic processing elements (SPEs) [15] .
There have been extensive efforts to optimize SC on multicore platforms. Williams et al. [16] have optimized the LBM application on single Cell BE blade. Traditional SC optimization techniques include tiling [17] and iteration skewing (if the iteration structure allows it) [1, [18] [19] [20] . Recently, Datta et al. have performed comprehensive optimization of SC with both cache-aware and cache-oblivious approaches [21] . These studies are mainly focused on low-order SC. Since both SC and multicore architectures have wide varieties as mentioned above, it is desirable to develop a unified parallelization framework and perform systematic performance optimization for various SCs (e.g., diagonally dominant vs. high-order) on multicore architectures. This paper presents our hierarchical SC parallelization framework that combines: (1) spatial decomposition based on message passing; (2) multithreading using critical section-free, dual representation; and (3) single-instruction multiple-data (SIMD) parallelism based on various code transformations. We employ SIMD transformations such as translocated statement fusion, vector composition via shuffle, and vectorized data layout reordering (e.g. matrix transpose), which are combined with traditional optimization techniques such as loop unrolling. We thereby parallelize two SCs of different characteristics-diagonally dominant (first order, 19-point) LBM for fluid flow simulation, and high-order (6th-order, 37-point) FDTD for seismic wave propagation. The two SC applications have been implemented on a Cell BE based platform (PlayStation3 cluster), dual Intel quadcore, and IBM BlueGene/L and P. Section 2 presents the hierarchical SC parallelization framework applied to LBM, followed by test results of inter-and intra-node scalability on a PlayStation3 cluster and IBM BlueGene/L and P, as well the effect of SIMD, multithreading and their combination on a PlayStation3 cluster. In section 3, we discuss the optimization of the seismic wave propagation code on a dual Intel quadcore platform. Section 4 contains conclusions drawn from our experiments.
Hierarchical Parallelization of Lattice Boltzmann Method
In the lattice Boltzmann method (LBM) for fluid flow simulation, variable ) (R v is a density function (DF) from which fluid density and velocity at R are computed. The LBM simulation consists of timingstepping iterations, where each step consists of two phases: (1) collision function that involves a large number of floating-point operations that are strictly local to each R; and (2) streaming function that contains no floating-point operation but solely memory copies between nearest-neighbor grid points.
Our hierarchical parallel Lattice Boltzmann Method (pLBM) algorithm parallelizes LBM based on spatial decomposition implemented with message passing, multithreading using critical section-free, dual representation, and SIMD parallelism, which maximally exposes concurrency and data locality. The following subsections describe these parallelization layers.
Spatial Decomposition Based on Message Passing
In the uppermost level of the hierarchical spatial decomposition in our pLBM algorithm, the total simulation space = {R} is decomposed into several spatial sub-domains i , where
and each domain is mapped onto a processor. Figure 2 We have implemented the spatial decomposition using the message passing interface (MPI) standard and have tested the inter-node scalability on the IBM BlueGene/L computer at the Lawrence Livermore National Laboratory and the IBM BlueGene/P computer at the Argonne National Laboratory. Figure  3(a) shows the running and communication times of the pLBM code on up to 212,992 BlueGene/L and 131,072 BlueGene/P processors. Here, we scale the number of lattice sites linearly with the number of processors: 128 3 P lattice sites on P processors. The computing time increases only slightly when P increases from 1 to 131,072 on BlueGene/P. The weak-scaling parallel efficiency is the running time on 1 processor divided by that on P processors, and it is 0.978 on 131,072 BlueGene/P processors. Nearly ideal efficiency is also evident on 212,992 IBM BlueGene/L processors in Fig.  3 (a). Spatial decomposition is thus highly effective in terms of inter-node scalability for SC up to 10 5 compute nodes. 3 P lattice sites on P processors. Despite the small bandwidth and large latency of the low-cost Gigabyte Ethernet switch, the parallel efficiency of the PlayStation3 cluster is respectable (0.704 on 8 consoles).
Based on Critical Section-Free, Dual Representation
On a cluster of PlayStation3 (PS3) consoles, pLBM is implemented based on hierarchical spatial decomposition: (1) inter-console parallelization with the upper-level spatial decomposition into domains based on message passing; and (2) intra-console parallelization with the lower-level spatial decomposition into interleaved rows of the lattice sites within each domain through multithreading. Inside each PS3 console, the main program running on the power processing element (PPE) spawns multiple threads to run on synergistic processing elements (SPEs). Data transfer between the main memory of PPE and the local storage of SPEs is handled by direct memory access (DMA).
For the collision function, 6 SPE programs are performed simultaneously with 6 threads created by PPE (due to the PS3 hardware restriction, only 6 SPEs out of 8 are available for user programming). For optimal load balancing, we adopt an interleaving schema shown in Fig. 4(a) , where the area enclosed by the dotted lines shows the computational task assigned to the first thread with thread ID 0, and chunk ID j is assigned to SPE with thread ID j mod N thread ,
(the number of threads N thread is 6). Since multiple threads may update a common lattice site, we adopt a double-layered DF consisting of two floatingpoint arrays DF0 and DF1 to avoid any critical section, as shown in Fig. 4(b) : The collision function transfers DFs from array DF0 to local store on SPE, updates the DFs, and subsequently copies it back to the array DF1. We have tested the intra-node (or multithreading) scalability of pLBM on each PlayStation3 console. 
Single-Instruction Multiple-Data Parallelism
Most modern computing platforms have incorporated single-instruction multiple-data (SIMD) extensions into their processors [22] to exploit the natural parallelism of applications if the data can be SIMDized (i.e., if a single instruction can simultaneously operate on a vector of consecutive data). On Cell BE, each vector contains four floating-point numbers that are operated concurrently, and thus the ideal speedup is 4. We maximally expose SIMD parallelism of pLBM using various code transformations. The following subsections describe our SIMD transformations: Translocated instruction fusion, vector composition via shuffle, and vectorized data layout reordering (e.g. matrix transpose).
Translocated Statement Fusion for SIMD
The key to SIMD parallelization is to identify SIMDizable statements, i.e., a set of statements that are independent and thus are executable concurrently, when the associate variables are packed into vectors.
Statements that satisfy this criterion can be transformed into a single SIMD statement regardless of the positions of the statements (e.g., in different loops) in the program. The translocation of statements and their fusion into a SIMD operation enhance the concurrency of the program at the instruction level. In the following subsection, we illustrate the use of such translocated statement fusion for SIMD parallelization in pLBM.
Here, the original code is doubly nested for loops, where the inner loop traverses the 18 nearest neighbor grid points in a cube to perform certain computation: 
Vector Composition for SIMD
The flexibility of composing new vectors is another useful feature of SIMD supported by shuffle and extract instructions. The shuffle(vec1, vec2, pattern) function is a byte-oriented operation that selects bytes according to the pattern (a control vector) from source vectors vec1 and vec2 and places them in a target vector [23] . The control vector entries are indices of bytes in the 32-byte concatenation of vec1 and vec2: To select byte n from vec1, we define the corresponding byte of the pattern as "0X0n", while from vec2 as "0X1n". Figure  7 illustrates the use of shuffle to generate a new vector
. By assigning uvec to vec1, vec2 and defining pattern0 as in Fig 7, we can select desired bytes of the uvec to generate uvec0. , where u [3] is set as NULL to fill the vector. We then define three patterns and perform three shuffle operations with different patterns to obtain three new vectors: , 1, 2 ). Subsequently, we multiply uveci (i = 0, 1, 2) by uvec, and put the results into pieqveci (i = 0, 1, 2), respectively. The pseudo-code of the above SIMDization is given below:
for(i=0;i<3;i++){ uveci=shuffle(uvec,uvec,patterni); pieqveci = mul(uvec,uveci); } The speedup of this SIMDization can be analyzed as follows. While the original program performs 3×3×1 = 9 operations, the SIMD program requires only 3. Thus the speedup is 3. Memory access is analyzed as well. While the original program accesses memory 3×3×3 = 27 times, the SIMD program only requires 1+3×1 = 4 memory accesses. Thus the memory access time is reduced by a factor of 7 by the SIMDization. The use of array of structure (AOS), i.e., the packing of 3 floats into a vector leaving the fourth element NULL, incurs some overhead, which is largely offset by the SIMD speedup.
It is highly beneficial to generate new data from existing vector registers instead of fetching them from cache or memory each time, and this greatly improves the data reuse ratio and reduces the memory access time. Data generation flexibility of SIMD via vector composition thus not only reduces computation but also reduces memory access.
Data Layout Reordering for SIMD
Data layout reordering is another SIMDizing technique facilitated by shuffle. By defining appropriate patterns, for instance, it is possible to transpose the rows and columns of an array efficiently. Figure 8 shows a transpose function for a 4×4 array, TRANSPOSE4*4(v0,v1,v2,v3), where each vi (i = 0,1,2,3) is a vector. The transpose procedure in Fig. 8 consists of two steps, both of which are shuffle operations. First, the 4×4 array is arranged as four vectors, where each row itself is a vector. We then define patterns for shuffle operation, and execute the set of four shuffles in Step 1 (Fig 8) , followed by another set of four shuffles in Step 2 (Fig 8) to obtain the transposed array. Since all the vectors are in the vector register, only four memory accesses are involved, making this SIMDized transpose highly efficient. We use this SIMDized transpose to parallelize pLBM. Here, the original code is a doubly nested for loops to do something like matrix multiplication:
The inner loop accesses an 18×3 array, v, column by column. The resulting stride memory accessing degrades the performance considerably and necessitates data layout reordering. Our SIMD solution resolves this problem in a two-step procedure: (1) transpose-v is transposed using divide-and-conquer, first dividing it into 5 small arrays and then using TRANSPOSE4*4() to transpose each sub-array; and (2) calculation-the transpose reduces the problem to the one described in section 2.3.1, which is SIMDized using the same transposed statement fusion schema. The pseudo-code of the above SIMDization of transpose is given below:
The speedup of the SIMDized transpose of an 18×3 array can be analyzed as follows. The original program requires 3×18×(1+2) = 162 operations, while the SIMD program operates 5×4×2 = 40. The speedup is thus 2.7. This is remarkable considering the overhead of the SIMDization. First, we transpose an array of 4×4 each time, while the real program only has 4×3 data. Also, the TRANSPOSE18*3 function in fact transposes 20×3 data instead of 18×3.
Using the techniques described above, we have SIMDized pLBM, where the SIMDization was only implemented to the collision function, which accounts for the most computing time of pLBM. We first test the SIMD effect with various problem sizes both for the collision part (which is fully SIMDized) and the whole pLBM application on a PS3, where the number of threads is set to one. Figure 9(a) shows the SIMD speedup of the collision function (i.e. the ratio of the running time without SIMD and that with SIMD). It is an ascending function of the problem size and reaches 3.72 for 2.61×105 grid points, which is close to the ideal speedup of 4 (SIMD efficiency is 0.930). Figure  9 (b) shows the SIMD effect for the whole pLBM application. The SIMD speedup of pLBM is again an ascending function of the problem size and reaches 3.12 for 2.614×105 grid points (SIMD efficiency 0.780). The SIMD optimization is thus highly effective for pLBM. Next, we test the SIMD speedup for different numbers of threads for the collision part and the whole pLBM application on a PS3 for 2.614×105 grid points. Figure 10(a) shows that the speedup for the collision (i.e. pure SIMD) part is nearly constant around 3.5 with varying number of threads. The SIMD optimization thus has very good scalability with different numbers of threads. However, for the whole pLBM application, the speedup is a decreasing function of the number of threads and reduces to 2.1 for 6 threads (i.e. the maximum number of threads on a PS3). This is because both communication and streaming-computation times do not decrease for larger numbers of threads. Consequently, the overall speedup for the whole pLBM is not as good as for the pure SIMD part. 
High-Order Seismic Stencil Computation on Dual Intel Quadcore
We have applied the same parallelization framework as described in section 2 to a finitedifference time-domain (FDTD) code for seismic wave propagation, which performs a large stencil computation with typical problem sizes of 1,024 3 grid points. This section describes some specific performance features of the FDTD code. The 3D stencil computation here is highly off-diagonal (6-th order) and involves 37 points, i.e., each grid point interacts with 12 other grid points in each of the x, y and z Cartesian directions. The original FDTD code is written with standard POSIX threads to run on x86 multicore systems, and we choose a dual Intel quadcore platform as a testbed. Spatial decomposition divides the system into 2D layers, and each thread works on its layer in the x direction, taking 12 inputs from the main array outputting a data point into a temporary array. Then the temporary array is run through again with a 12-point stencil and sent into the final array. For the y and z directions, the initial 3D array is transposed before being operated on, and the final array is then summed out in strides in order to return to the 3D arrangement.
Optimization Schema for Seismic Code
As mentioned above, the 37-point SC in the seismic code accesses 12 neighbor points in each Cartesian direction, in order to update the value of one grid point. To find the bottleneck of the code, we have run the code with the Intel VTune profiler on a dual Intel quadcore platform, and the result shows that the most time-consuming parts are nine computations and two stride memory accessing. Each of the nine computations (three for each of the x, y, z directions) contains nested for loops over central and neighbor grid points. The two stride memory accessing transpose the 3D array data to make the memory access in the y and z directions contiguous, respectively, for the computations.
We have applied the SIMDization approaches described in section 2.3 to the parallel seismic code. Here, we illustrate the procedure using a representative code segment shown below, which calculates one value from 12 nearest neighbors:
Our SIMD solution first unrolls the loop four times (we choose four for alignment), and then makes use of the translocated statement fusion and vector composition for SIMD as described in section 2.3. Prior to the for loop, we pack floats u[i] to u[i+11] to three vectors, uvec 0 , uvec 1 , and uvec 2 :
We also pack the constants a k (k = 1, 2, … , 6) (k= 0, 1, 2) . For the first loop, we multiply each utmp k with avec k (k=0, 1, 2) and store the product to vector sumvec. After calling horizontal addition hadd (as defined in Fig.  11(b) ) twice, we extract the first float value of sumvec with extract. Subsequently, the float value is multiplied by b to get a bD. Next, we load uvec 3 Memory access can also be analyzed as follows: The original program requires 15 memory accesses to get one bD, while the SIMDized program needs only one load, nine shuffle and one store to get four bDs-in each round, we reuse three vectors from the last round (see the last statement of the above pseudo-code), and consequently only one vector u 4 needs to be loaded. After obtaining four bDs, the SIMD code writes back it as a vector of
The memory access is thus reduced by a factor of over 5. In summary, this solution not only utilizes SIMD and loop unrolling but also enhances data reuse.
Performance Tests
We have tested the performance of SIMD parallelization for the seismic code on Intel dual quadcore Xeon with 2.33 GHz clock, which has 32 KB L1 data cache per core with shared 2×4 MB L2 cache and 16 GB DDR2 memory. The processor also includes 128-bit wide registers and a separate register for data movement. It also features Streaming SIMD Extension 2 (SSE2) and Streaming SIMD Extension 3 (SSE3) instructions to support SIMDization. Figure 12 (a) shows the running time (to perform one of the major for loops for the stencil computation in the x direction) of the FDTD simulation without and with SIMDization for different numbers of threads for 400 Obviously, SIMD optimization on the seismic code is not as effective as on pLBM, for which several reasons are conceivable. The first is the problem sizethe pLBM is tested with the largest size of 64 3 while the seismic code is with 400 3 . The large problem size has caused a high TLB (translation lookaside buffer) miss ratio of 0.2% on the Intel platform and thereby decreased the performance. The large problem size also causes large stride memory access when transposing the array in the y and z directions, which further degrades the performance of the whole program. Another reason is the hardware restriction. The Intel platform has only eight vector registers, which cannot feed 30 vectors required for each round of computation in the seismic code. This may cause frequent exchanges of data between vector registers and main memory, which greatly decreases the performance. This also explains the performance decrease when the whole x-direction computations are SIMDized instead of only one-third of it. On the contrary, on Cell BE based PlayStation3, there are 128 vector registers per SPE, which is more than enough to satisfy the needs of pLBM, and accordingly the SIMDized pLBM has achieved a better performance.
Conclusion
In summary, we have developed a hierarchical parallelization framework that combines spatial decomposition, multithreading with critical section-free, dual representation, and SIMDization using translocated statement fusion, vector composition and vectorized data layout transformation. For diagonally dominant stencil computations such as LBM, the SIMD parallelization has been found highly effective on multicore platforms such as a Cell BE based PlayStation3 cluster. However, for highly off-diagonal stencil computations such as the high-order FDTD code, the SIMDization has proved less effective on a Intel quadcore platform due to high TLB miss, large stride memory accessing, and lack of vector registers. This points out the need for further optimizations to increase the data locality of the high-order stencil computation.
