Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time.
INTRODUCTION
Astrophysics [Brandenburg 2003 ], geophysics [McMechan 1989; Operto et al. 2006 ], quantum chemistry [Alonso et al. 2008; Castro et al. 2006] , and oceanography [De Groot-Hedlin 2008; Kormann et al. 2008] , are examples of scientific fields where large computer simulations are frequently carried out. These simulations have something in common, namely, the physical models are represented by Partial Differential Equations (PDE), which are mainly solved by the Finite-Difference (FD) method. This method uses stencil computations to obtain values for discrete differential operators. Large-scale simulations may consume days of supercomputer time; further, most of this execution time is spent on the stencil computation if they are PDE+FD based. For instance, Reverse-Time Migration (RTM) is a seismic imaging technique from geophysics used in the oil and gas industry, where up to 80% of the RTM kernel execution time [Araya-Polo et al. 2008 ] is spent on stencil computation.
The basic structure of stencil computation is that the central point accumulates the contribution of neighbor points in every axis of the Cartesian system. The number of neighbor points in every axis relates to the accuracy level of the stencil, where more neighbor points lead to higher accuracy. The stencil computation is then repeated for every point in the computational domain, thus solving the spatial differential operator.
Two inherent problems can be identified from the structure of the stencil computation.
-First, the noncontiguous memory access pattern. In order to compute the central point of the stencil, a set of neighbors have to be accessed. Some of these neighbor points are distant in the memory hierarchy, requiring many cycles in latencies to be accessed [Kamil et al. 2005 [Kamil et al. , 2006 . Furthermore, with increasing stencil order, it becomes more expensive to access the required data points. Hardware prefetchers can hide the latency issue, but only if the cache lines used do not exceed the number of streaming prefetchers. -Second, the low computational-intensity and reuse ratios. After gathering the set of data points, just one central point is computed and only the accessed data points in the sweep direction might be reused for the computation of the next central point [Frigo and Strumpen 2005] . Thus, some of the accessed data are not reused and the current hierarchical memory structures are poorly exploited.
In order to deal with these issues and improve the stencil computation, we introduce the Semi-stencil algorithm. This algorithm changes the structure of the stencil computation, but it can be generally applied to most stencil-based problems. Indeed, the Semi-stencil algorithm computes half of the axis contributions required by several central points at the same loop iteration. By just accessing the points required to compute half of the stencil axis, this algorithm reduces the number of neighboring points loaded per iteration. At the same time, the number of floating-point operations remains the same, but because the number of loads is reduced, the computation-access ratio is increased. This approach has already been shown effective in the heterogeneous Cell/B.E. architecture [de la Cruz et al. 2009 ] to optimize real scientific applications such as RTM [Araya-Polo et al. 2008] . The Cell/B.E. stencil implementation of the RTM includes the Semi-stencil strategy, the SIMDization of the code, and other pipeline optimizations. The single-SPE performance on an IBM Bladecenter QS22 is 12.44 GFlops for an 8th-order stencil (25-point stencil) . This gives 49% of the SPE peak performance (25.6 GFlops) compared to 34% (8.61 GFlops) obtained with the naive implementation.
In the current work, we present a comprehensive study of the Semi-stencil strategy on homogeneous multicore architectures with hierarchical memories. Unlike other stencil optimization works [Datta et al. 2008 [Datta et al. , 2009 [Datta et al. , 2010 where a 7-point academic stencil (single Jacobi iteration) is tackled through auto-tuning environments, we perform an evaluation of 25-and 43-point stencils (from academic up to widely used in the industry). Furthermore, the aim of this work is not to achieve the fastest stencil implementation on the chosen platforms, but to prove the soundness of the Semi-stencil algorithm when scientific codes are optimized. Hence, a future goal may be to include Semi-stencil into auto-tuners to take advantage of hardware-dependent optimization techniques such as: register blocking, SIMDization, or cache bypassing.
The remainder of this article is organized as follows: Section 2 introduces the stencil algorithm analyzing its main characteristics and challenges. In Section 3, the principal techniques that help to cope with the stencil problem are reviewed. Section 4 introduces the novel algorithm, its internals, features, and considerations. Section 5 depicts the performance modeling methods available in the literature in order to characterize the performance of a specific stencil algorithm. In Section 6, we present the testbed platforms and the serial and parallel performance are evaluated for each algorithm with the Semi-stencil strategy. Finally, Section 7 presents the conclusions of this work.
THE STENCIL PROBLEM
The stencil computes the spatial differential operator, which is required to solve PDEs numerically on FD schemes. A multidimensional structured grid (often a huge 3D data structure) is traversed and the elements (X t i, j,k ) are updated with weighted contributions. Figure 1 (b) depicts a generic 3D stencil pattern, where the parameter represents the number of neighbors to be used in each direction of the Cartesian axis. Hence, the computation of a X t i, j,k point at time step t requires weighted neighbors in each axis direction at the previous time step (t − 1). The domain contains interior points (the solution domain) and ghost points (outside of the solution domain). This operation is repeated for every point in the solution domain, finally approximating the spatial differential operator.
The first problem identified from this computation is the sparse memory access pattern. Data is stored in Z-major order (unit-stride dimension), and therefore accesses across the other two axes (X and Y) may be significantly more expensive latencywise (see Figure 1(a) ). The parameter has a direct impact on this problem, that is, the larger the value, the more neighbors at each axis have to be loaded to compute the X t i, j,k point. The second problem to deal with is the low floating-point operations to data cache accesses ratio, which is related to the poor data reuse. In this work, we use the FP/Cache metric to quantify these issues; this is the ratio of floating-point operations to the number of loads and stores issued to L1 data cache for one X t point computation.
FP/Cache Classical =
FloatingPoint Operations Data Cache Accesses = 2 * MultiplyAdd Instructions X t−1 Loads + X t Stores = 2 * 2 * dim * + 1 (2 * (dim − 1) * + 1)
+ (1)
X t Stores = 4 * dim * + 1 2 * dim * − 2 * + 2
(1) 
ALGORITHM 1:
The classical stencil algorithm pseudocode for a 3D problem. X t is the space domain for time step t, where Z, X and Y define the dimensions (ordered from unit to least stride) of the datasets including ghost points (points outside of the solution domain). denotes the number of neighbors used for the central point contribution. C Z1...Z , C X1...X , C Y1...Y are the spatial discretization coefficients for each dimension and C 0 for the self-contribution. z s , z e , x s , x e , y s and y e denote the area in X t where the stencil operator is computed. In order to compute the entire dataset, the stencil pseudocode must be called as follows:
Notice that in this work the discretization coefficients are considered constant. for k = y s to y e do 3:
for j = x s to x e do 4:
for i = z s to z e do 5: Eq. (1) states this metric for the classical stencil computation shown in Algorithm 1. dim is the number of dimensions of our PDE (3 in a 3D problem), where 2 * Multiply-Add instructions are required at each dimension. Furthermore, one extra multiplication operation must be considered for the self-contribution (X t−1 i, j,k ). The number of loads needed to compute a stencil point differs depending on the axis. Those axes that are not unit-stride dimension (X and Y) require 2 * loads at each loop iteration. However, the Z unit-stride dimension tends to require just one load due to the load reuse from preceding loop iterations. Finally, to conclude the X t i, j,k point computation requires one store to save the result.
As shown in Eq.
(1), the FP/Cache ratio depends on the dim and variables. Taking into account that is the only variable that may increase, the FP/Cache ratio tends to a factor of ≈ 3 flops per data cache (d-cache) access for 3D problems. This relatively low ratio along with some previous works' results [Frigo and Strumpen 2005; McCalpin and Wonnacott 1999; Rivera and Tseng 2000] shows that the stencil computation is usually memory-bound. In conclusion, the execution time of the stencil computation is mainly dominated by the memory access latency.
These concerns force us to pay special attention to how data is accessed during the computation. It is crucial to improve the memory access pattern (by reducing the overall number of data transfers) and exploit the memory hierarchy as much as possible (by reducing the overall transfer latency). Section 3 reviews the main approaches that can be found in the literature to address these issues.
RELATED WORK
Most of the contributions for stencil computations can be divided into three dissimilar groups: space blocking, time blocking, and pipeline optimizations. Space and time blocking are related to tiling strategies widely used in multilevel cache hierarchy architectures. Pipeline optimizations refer to those optimization techniques that are used at CPU pipeline level to improve the instruction throughput performance (IPC). In addition, these three groups contain incremental optimization techniques which can be combined with techniques from other groups.
Figure 2 categorizes these three groups by complexity of implementation (effort), benefit improvement (performance), and implementation tightness regarding hardware (dependency). Even though each algorithm property (effort, performance, and dependency) can differ widely depending on the code complexity and the underlying hardware, this classification can be treated as the big picture of the state-of-theart. In this diagram, an interesting optimization algorithm would be classified in the top-left corner, whereas an inefficient one would be in the bottom-right corner. The next sections will review the advantages and disadvantages of these optimization methods.
Space-Blocking
Space-blocking algorithms promote data reuse by traversing data in a specific order. Space blocking is especially useful when the dataset structure does not fit into the memory hierarchy. The most representative algorithms of this kind are as follows.
-Tiling or blocking techniques. Rivera and Tseng [2000] propose a generic blocking scheme for 3D stencil problems. The entire domain is divided into small blocks of size TI × TJ × TK which must fit into the cache. Rivera and Tseng showed that a good blocking scheme configuration can be achieved when a TI × TJ 2D block is set along the less-stride dimension. Later, Kamil et al. [2005] found out that the best configuration is usually given when TI is equal to the grid size I, as shown in Fig. 2 . Characterization of different optimization schemes. This diagram sorts each optimization method according to the three properties: programming effort required to implement it, performance boost by optimizing the code, and the dependency of the implementation with respect to the underlying hardware. Figure 3. This traversal order reduces cache misses in less-stride dimensions (X and Y), which may increase data locality and overall performance. Note that a search of the best block-size parameters (TI × TJ) must be performed for each problem size and architecture. -Circular queue. Datta et al. [2008] employ a separate queue data structure to perform the actual stencil calculations. This structure stores only as many 2D planes as are needed for the given stencil. After completing a plane, the pointer to the lowest read plane is moved to the new top read plane, thus making a circular queue. In general, the circular queue stores (t − 1) sets of planes, where t is the number of iterations performed.
Time Blocking
Time-blocking algorithms perform loop unrolling over time-step sweeps to exploit the grid points as much as possible, and thus increase data reuse. Such techniques have shown some effectiveness in real infrastructures [ANAG 2012 ], but they require careful code design. However, due to the time-dependency scenario, these techniques may pose implementation issues when other tasks must be carried out between stencil sweeps. For instance, boundary condition computation, intra-and extra-node communication, or I/O may occur in many scientific applications during time-step updates. Time-blocking techniques can be divided into explicit and implicit algorithms.
-Time skewing. McCalpin and Wonnacott [1999] try to reuse data in cache as much as possible. As a result, memory transfers are reduced and execution stalls are minimized. Essentially, several cache blocks of size TI×TJ×TK are generated over space dimensions of the grid and each of these blocks is unrolled over time, as shown in Figure 4 (a). To keep data dependencies of stencil computations, block computations must be executed in a specific order. This constraint makes the time-skewing algorithm only partially effective in parallel executions. However, time skewing has been already parallelized [Kamil et al. 2006; Wonnacott 2000] with reasonable results. Wonnacott devised the parallel version of the time-skewing algorithm by performing space cuts in the least-stride dimension. Each thread block is divided into three parallelepipeds, which are computed in a specific order to preserve data dependencies and enable parallel computation. In this algorithm, as in space tiling, a search of the best block-size parameters must be performed prior to the start of the computation. T1 and T2) , where the stencil can be computed in an optimal way due to size constraints of the cache hierarchy. Cutting in space produces a left-side trapezoid (T1) where there is no point depending on the right-side trapezoid (T2), thus allowing T1 to be fully computed before T2. In addition, recursive cuts can be taken over the time dimension. In this cut, the original time region (t0, t1) is split into a lower T1 trapezoid (t0, tn) and an upper T2 trapezoid (tn, t1). As in space cut, T1 does not depend on any point of T2, and T1 can be computed in advance. Cache oblivious has been also parallelized [Frigo and Strumpen 2006] by creating space cuts (either in the Z or Y dimensions) of inverted and noninverted trapezoids, which are computed in order to preserve dependencies.
Pipeline Optimizations
Low-level optimizations at CPU pipeline level include several well-known techniques. These techniques may be categorized into three groups: loop transformations, data access, and streaming optimizations. Loop unrolling, loop fission, and loop fusion are part of the loop transformations group. They can reduce the overall execution time of scientific codes by reducing register pressure and data dependency as well as improving temporal locality [Manjikian and Abdelrahman 1997; McKinley et al. 1996] . Data access optimizations include techniques such as software prefetching, software pipelining and register blocking [Allan et al. 1995; Callahan et al. 1991; Lam et al. 1991; Mowry and Gupta 1991; Rogers and Li 1992] . All of these relate to improving data access latency within the memory hierarchy (from register to main memory level). Finally, Symmetric MultiProcessing (SMP), Single-Instruction Multiple-Data (SIMD), or Multiple-Instruction Multiple-Data (MIMD) are some programming paradigms which are included in the streaming optimizations group. The multiprocessing term refers to the ability to execute multiple processes, threads, and instructions concurrently in a hardware system.
All the previous techniques have been successfully used in many computational fields to improve the processing throughput and increase the IPC metric (Instructions Per Cycle). Furthermore, the performance of the codes can be improved significantly by combining several of these, although collateral effects (performance issues) can appear from time to time. Some modern compilers are able to generate code automatically that takes advantage of some of these techniques, thus relieving the developer of tedious implementation work. However, some pipeline optimizations may increase both instruction code size and register pressure. Therefore, they must be used carefully, since overuse can lead to register spilling, which would produce slower code due to additional saves and restores of register code from the stack.
The stencil code representation in terms of arithmetic instructions is another issue to bear in mind. Depending on the hardware, one type of stencil code codification could perform better than others. Table I shows two different ways of stencil representation: factored and expanded. The former uses add and multiplication instructions, while the latter maps more naturally to fused multiply-add and multiplication instructions. 
3 * dim * + 1 3 * dim * + 1 (Muls and Muls: dim * + 1 M u l s :dim * + 1 Adds inst.
Adds: 2 * dim * Adds: 2 * dim * based)
Muls: 2 * dim * + 1 MAdds: 2 * dim * MultiplyAdds: 2 * dim * Muls: 1 Add inst. based)
Depending on the instruction set and the latency (in cycles) of each arithmetic instruction in the CPU pipeline, one specific representation (factored or expanded) may outperform the other. 
THE SEMI-STENCIL ALGORITHM
The Semi-stencil algorithm involves noticeable changes to the structure as well as the memory access pattern of the stencil computation. This new computation structure (depicted in Figure 5 ) consists of two phases: forward and backward updates, which are described in detail in Section 4.1. Additionally, due to this new structure, three parts of code, called head, body, and tail, need to be developed to preserve the numerical soundness; these are described at the end of this section. To conclude this section, it will be shown how Semi-stencil is orthogonal and may be combined with any other optimization technique.
As shown in Section 2, there are two main issues in stencil computations: first, the low floating-point operation to data cache access ratio and, second, the high-latency memory access pattern, especially for the last two dimensions. Basically, the Semistencil algorithm tackles these two issues in the following ways.
-It improves data locality since less data is required on each axis per loop iteration.
This may have an important benefit for hierarchical cache architectures, where the noncontiguous axes (X and Y) of a 3D domain are more expensive latencywise. -The new memory access pattern reduces the total number of loads per inner loop iteration, while keeping the same number of floating-point operations, thereby increasing the data reuse and thus the FP/Cache ratio, relative to the classical estimate of Eq. (1).
As mentioned before, two phases (forward and backward updates) are needed to carry out the new memory access pattern of the Semi-stencil algorithm. These are the core of the algorithm, and the following section elaborates them.
Forward and Backward Updates
Line 5 of Algorithm 1 shows, in computational terms, the generic spatial operator for FD codes. This line of pseudocode updates the value of X t i, j,k in one step (iteration {i, j, k} of the loop). The essential idea of Semi-stencil is to break up this line into two phases, thereby partially updating more than one point per loop iteration. To carry out this parallel update, the factored add and mul operations (c i * (x −i + x +i )) must be decomposed into multiply-add instructions (c i * x −i + c i * x +i ) in order to split up the classical computation into forward and backward updates.
The forward update is the first contribution that an X t point receives at time step t. During this phase, when the i iteration of the sweep axis is being computed, the X t i+ point is updated with X t−1 rear contributions (as depicted in Figure 5 (a)). Through this operation we obtain a precomputed X t i+ value. Recall that represents the number of neighbors used in each axis direction. Eq. (2) shows, in mathematical terms, a summary of the forward operations performed over X t i+ point in a one-dimensional (1D) problem,
where the prime character ( ) denotes that the point has been partially computed, and some contributions are still missing. In addition, only neighbor points of X t−1 have been loaded and one X t point, X t i+ , stored so far. In a second phase, called backward, a previously precomputed X t i point in the forward update is completed by adding in the front axis contributions (X
Only two loads are required in this process to complete the backward computation, since most of the t − 1 time-step points were loaded during the forward update of the X t i+ computation and hence they can be reused. The two loads required for this second phase are: the X t−1 i+ point and the precomputed X t i value. This phase also needs one additional store to write the final X t i value. Finally, in order to carry out both updates, 2 * + 1 floating-point operations are issued in the inner loop ( Multiply-Add instructions for neighbor contributions and one multiplication for the self-contribution). Moreover, in particular scenarios the computation can be reused; if is a multiple of two, the C l/2 * X t−1 i+l/2 operation may be used for forward and backward updates when performing the X t i and X t i+ computation. These scenarios may lead to a slight reduction of the inner loop instructions and a further improvement in the execution time.
Floating-Point Operations to Data Cache Access Ratio
For the purpose of comparing the classical and the Semi-stencil algorithms, we calculate the FP/Cache ratio for the latter algorithm as follows,
where the total number of floating-point operations remains constant with respect to the classical stencil implementation (see Eq. (1)). Eq. (4) also shows that X t−1 loads to the L1 cache have decreased substantially, by almost a factor of 2. Nevertheless, the X t stores have increased and a new term (X t loads) has appeared in the equation due to the partial computations applied to the X and Y axes. Reducing the number of loads results in less cycles to compute the inner loop, as well as a lower register pressure. The lower register use means the compiler has the opportunity to perform more aggressive low-level optimizations, such as loop unrolling, software pipelining, and software prefecthing. However, as shown in Figure 5 , the Semi-stencil algorithm updates two points per iteration, leading to double the number of X t stores. Depending on the architecture, this could result in loss of performance. For instance, in caches with write-allocate policy, a store miss produces a load block action followed by a write-hit, which will produce cache pollution and a pipeline stall. Nowadays, some cache hierarchy architectures implement cache-bypass techniques to address this issue.
Reviewing Eq. (4), we see that the new computation structure of Semi-stencil allows the FP/Cache ratio to increase to ≈ 5 flops per d-cache access for 3D problems. For in the range of 4 to 14, the FP/Cache ratio is a factor of between 1.3 and 1.7 times better than the classical stencil implementation, which will have a clear effect on performance.
Head, Body and Tail Computations
FD methods require what are called interior points (inside the solution domain) and ghost points (outside the solution domain). The new algorithm structure of Semistencil takes this feature into account to preserve numerical soundness.
To obtain consistent results with the two-phase update on border interior points, the algorithm must be split into three different sections: head, body, and tail. The head section updates the first interior points with the rear contributions (forward phase). In the body section, the interior points are updated with back-and-forth neighbor interior elements (both forward and backward phases are carried out). Finally, in the tail section, the last interior points of the axis are updated with the front contributions (backward phase). Figure 6 shows a 1D execution example of Semi-stencil where the three sections are clearly depicted.
Orthogonal Algorithm
As stated in Section 3, the state-of-the-art in stencil computations has improved in recent years with the publication of several optimization techniques. Some of these can significantly increase the execution performance under specific circumstances. However, most of them cannot be combined due to traversing strategy incompatibilities. In other words, some state-of-the-art techniques share a particular data traversal method when the stencil operator is computed, for example, space-blocking or spacetime-blocking-based. Therefore, not all combinations of these techniques are feasible.
In contrast to the previous group of optimization algorithms, Semi-stencil specifies the manner in which the stencil operator is calculated and how data is accessed in the inner loop (see Algorithm 2). This kind of algorithm, which we term a structuralbased optimization algorithm, is orthogonal and can easily be implemented without modifying the sweeping order of the computational domain. Thus, the Semi-stencil structure is complementary with traversing optimization algorithms such as Rivera, time skewing, or cache oblivious. At the same time, Semi-stencil can be combined with low-level pipeline optimizations such as loop-level transformations, software prefetching, or software pipelining.
Note that Semi-stencil can be applied to any axis of an n-dimensional stencil problem, from unit-stride to least-stride dimension. Our recent studies in 3D stencil problems have shown that the Semi-stencil algorithm is most suitable for the leaststride dimensions of 3D problems (X and Y in our Cartesian axes). There are four main reasons for this behavior. First, most modern compilers can take advantage of unit-stride accesses by reusing previous loads and fetching one value per iteration in a steady state. Second, the programmer may easily add some pipeline optimizations that would improve performance for the unit-stride dimension in the same way that 
The figures show the structure of each stencil while dark boxes and numbers represent the updated points and their computation order respectively. Recall that data reuse for load operation is considered for the unit-stride dimension (Z) through compiler optimizations.
Semi-stencil does. Third, current architectures usually sport hardware prefetchers which may especially help to reduce unit-stride latency accesses. And fourth, our memory access model for stencil computations (see the summary in Table II ) has shown that a full-axis Semi-stencil algorithm implementation has a slightly higher penalty for X t loads and X t stores than a partial Semi-stencil implementation. Algorithms 4, 5, and 6 show the pseudocode implementation of Rivera, time-skewing, and cache-oblivious algorithms, respectively, with a partial Semi-stencil implementation (detailed in Algorithms 2 and 3). All these implementations are freely available in the microbenchmark that accompanies the software bundle.
PERFORMANCE MODELING
The performance characterization of a kernel code is difficult but it may help us to forecast the maximum expected performance for a given architecture. Nevertheless, the modeling approach has a drawback: it relies heavily on the ability to capture an algorithm's behavior in an accurate fashion, which is not simple to achieve. Thus, we must focus on effectively modeling how a kernel performs and behaves on different platforms.
For stencil computations, a number of suitable performance models for finitedifference problems may be found in the literature. The next sections provide details of Roofline, a generic performance model for any scientific kernel, along with a stencilcomputation-specific model for multilevel cache architectures. In this regard, we also acknowledge the works Treibig and Hager [2009] and Treibig et al. [2010a] . Although these models are not covered in this work, they devised a simplified approach to understand the performance of bandwidth-limited loop kernels which might predict stencil computation behavior.
The Roofline Model
An easy-to-understand and visual performance model is proposed through the Roofline model for j = x s to x e do 4:
for i = z s to z e do 5: 
for i = z s to z e do 12: 
for i = z s to z e do 32: 
38: for floating-point computations by relating processor performance to off-chip memory traffic. To achieve such a goal, the term Operational Intensity (OI) is defined as the operations carried per byte of memory traffic (Flops/Byte). The total bytes accessed are those that hit the main memory after being filtered by the cache hierarchy. Therefore, OI measures traffic between the caches and memory rather than between the processor and the caches. Thus, OI quantifies the DRAM bandwidth needed by a kernel on a particular architecture.
As shown in Figure 7 , floating-point performance, OI, and memory performance can be tied together in a two-dimensional graph with this model. The horizontal line shows peak floating-point performance of the given architecture, and therefore no kernel can exceed this since it is a hardware limit. The X axis depicts the GFlops/byte, whereas the Y axis the GFlops/second. The 45 • angle of the graph represents the bytes per second metric, the ratio of (GFlops/second)/(GFlops/byte). As a consequence, a second line can be plotted, which gives the highest floating-point performance that the memory system can achieve for a given OI. Through these two lines, the model is completely for jj = to X, TJ do 4:
for ii = to Z, TI do 5:
end for 7:
end for 8:
end for 9: end procedure roof limited and therefore the performance limit can be obtained with the following formula.
Attainable GFlops/second = MIN(Peak Floating-Point Performance, Peak Memory Bandwidth × Operational Intensity)
The point where the two lines intersect is the peak computational performance and peak memory bandwidth. It is important to bear in mind that the Roofline limits are fixed for each architecture and do not depend on the kernel being characterized.
A point on the X axis can be found for any scientific kernel through its OI. Drawing a vertical line through that point, the maximum attainable performance for the tested architecture can be estimated from the intersection of this line and the roofline.
The Roofline sets an upper bound on a kernel's performance depending on its OI. If the vertical projection of the line intersects the flat part of the roof, the kernel performance is compute-bound, whereas it is memory-bound if it intersects the sloped part of the roof. In addition, the model shows what the authors call the ridge point, where the horizontal and the 45 • rooflines meet. The x-coordinate of this point provides an estimate of the minimum OI required by a kernel to achieve the maximum performance on an architecture. A ridge point shifted to the right implies that only kernels with a high OI can achieve the maximum performance. On the other hand, a ridge point shifted to the left on the axis means that most of the kernels can potentially obtain the maximum performance. Hence, the ridge point is related to the difficulty for programmers and compilers to achieve the peak performance in a given architecture.
Multilevel Cache Model
In order to properly predict the behavior of stencil computations in mainstream architectures, a flexible and accurate model for a wide range of stencil computations has been proposed [de la Cruz and Araya-Polo 2011]. In this work, the authors claim that the multilevel cache hierarchy has to be considered to forecast the behavior of stencil computations.
Only memory transactions between the CPU and memory subsystem are considered, and computation bottlenecks are assumed negligible due to the memory-bound aspect of stencil computation [Datta et al. 2009; de la Cruz et al. 2009] . Finally, interference between data and instructions at cache level is not taken into account.
The model is composed of a base model, a read and write misses policy, and additional special features such as a hardware prefetching predictor mechanism. and K are the least-stride dimensions excluding ghost points. A 3D stencil computation of order requires 2 * + 1 Z-X input planes in order to compute one Z-X output plane [de la Cruz et al. 2009 ]. The size of these planes (in words) differs depending on the operation being performed (read or write), and therefore altering the memory ALGORITHM 6: Pseudocode for the cache-oblivious implementation of the Semi-stencil algorithm. The trapezoid to compute is defined by τ (t0, t1, z0, dz0, z1, dz1, x0, dx0, x1, dx1, y0, dy0, y1, dy1) . CUTOFF, ds, t0, t1, z0, dz0, z1, dz1, x0, dx0, x1, dx1, y0, dy0, y1, dy1) 2:
if dt = 1 or vol < CUTOFF then 7:
for t = t0 to t1 do Base case: compute stencil on 3D trapezoid 8: 
13: CUTOFF, ds, t0, t1, z0, dz0, z1, dz1, x0, dx0, x1, dx1, y0, dy0, ym, 14: CUTOFF, ds, t0, t1, z0, dz0, z1, dz1, x0, dx0, x1, dx1, ym, y1, dy1) 15: CUTOFF, ds, t0, t1, z0, dz0, z1, dz1, x0, dx0, xm, y0, dy0, y1, dy1) 18: CUTOFF, ds, t0, t1, z0, dz0, z1, dz1, xm, x1, dx1, y0, dy0, y1, dy1) 19:
else Time cut 20: CUTOFF, ds, t0, t0 + s, z0, dz0, z1, dz1, x0, dx0, x1, dx1, y0, dy0, y1, dy1) 22:
end if
24:
end if 25: end procedure requirements to compute one k iteration of the sweep (outermost loop of Algorithm 1). Let P be the number of planes and S the plane size in words; the following terms for read and write operations are obtained.
= P read * S read + P write * S write (5) The stencil computation involves back-and-forth dataset transfers, which may produce conflicts depending on cache policies (write-through or write-back). For instance, during a write operation with write-back policy, a cache line allocation is produced (write allocate), which shares memory resources among input data and leads to cache pollution. Given an architecture with a memory hierarchy of n cache levels, it can be assumed that the total time (T total ) required to compute a stencil is based on the following formula,
where T Li and T Memory are the times to access data in Li cache level and main memory, respectively. In order to estimate T Li and T Memory , the number of cache line hits (Hits Li ) and cache line misses (Misses Li ) must be estimated. Depending on the level, the way to compute hits and misses differs due to hardware policies. To deal with this aspect, three memory hierarchy groups are established to calculate hits and misses: first (L1), intermediate (from L2 to Ln), and last (Memory). When the CPU issues a word Load instruction, the data, if present, is brought from the closest cache level (L1); otherwise, a miss is flagged and passed to the next level of the hierarchy, which will finally fetch an entire cache line containing the data. The remaining levels also follow this mechanism. Several values are required in the internal loop to compute a single point of the domain: grid points (X t−1 ), weights (C Z,X,Y,0 ) , and indices (i, j, k). In the register bank, grid points and weights are kept in Floating-Point Registers (FPR), while indices use General-Purpose Registers (GPR). Leaving aside optimizations, grid points must be fetched in every sweep of the loop, whereas weights and indices might be partially reused. However, note that depending on the order of the stencil ( ), the problem dimension (dim), and the available registers (FPR free and GPR free ), data reuse becomes burdensome, leading to register spilling and increased CPU-L1 traffic. To obtain such information for the first level, the register pressure (Reg grid , Reg weight and Reg index ) required to compute the whole domain (I × J × K) is estimated by considering the available registers 
where the number of hits (Hits word L1 ) is estimated by the difference between loads issued by the CPU and misses triggered by the first hierarchy (Misses word L1 ). Let word be the data granularity and Bw read L1 the L1 read bandwidth; the L1 cost (T L1 ) is calculated as the cost of transferring a word from L1 to CPU (T word L1 ) times the hits in L1 ) times the bandwidth (T cacheline Memory ) to access the data at that level.
Read Misses.
The model covers four cases of miss estimation (see Figure 8 ) where these cases depend on three distinct types of cache misses (compulsory, capacity, and conflict misses) [Temam et al. 1994] . Ordered from lowest to highest penalty, the four cases proposed by the model are as follows. 
(2) The second scenario occurs when all required planes do not fit in Li, leading to capacity misses. However, the k-central plane, with a higher temporal reuse than the rest (1 versus 2 * + 1), is probably not replaced from the cache due to conflict misses. Hence, it might be partially reused in following iterations, giving nplanes Li = P read − 1 for Eq. (9). (3) In the third scenario, it is also assumed that all planes do not fit in Li. Nevertheless, in this case the k-central plane overwhelms a significant part of the Li cache (i.e., half the capacity), reducing the possibility of temporal reuse. Therefore, nplanes Li = P read planes must be read when traversing the Y axis of the stencil loop. (4) Finally, in the worst scenario (upper bound) neither the planes nor the columns of the k-central plane fit in the cache. Due to capacity and conflict misses, all planes must be read at every k iteration (P read ), whereas the columns of the k-central plane must be read at each j iteration (P read − 1). This scenario gives nplanes Li = 2 * P read − 1. (10) where pref Li refers to the number of stream channels supported by the current hierarchy level i. Second, prefetched and nonprefetched cache line misses within planes are computed using Eq. (9) and its specific nplanes Li value. Finally, T Li is computed by adding the contribution of the two types of planes using their respective hit ratio and transfer cost to the base model rule for Li. In order to compute the specific bandwidths for streamed and nonstreamed planes, a modified version of the STREAM2 [de la Cruz and Araya-Polo 2011] benchmark is used.
EXPERIMENTS
In this section, experimental results of the Semi-stencil algorithm are presented. First, we introduce the Stencil Probe, an existing parameterized microbenchmark developed to test stencil codes. Second, the platforms used to conduct the experiments are described, and the tools employed to obtain the profiling information are introduced. Then, we prove that our FP/Cache model is consistent for classical and Semi-stencil codes. Finally, the performance results are shown and evaluated for each testbed architecture, including some multicore scalability results.
Stencil Microbenchmark
In order to evaluate the performance of our implementations, the Stencil Probe [Kamil et al. 2005] , a compact and self-contained serial microbenchmark, has been used. This microbenchmark was developed to explore the behavior of 3D stencil-based computations without having to modify any application code. Further, the Stencil Probe has been extended with more features to evaluate the new algorithms.
We made four changes to the microbenchmark. First, two Semi-stencil algorithms have been introduced: a full-axis (Z, X and Y) and a partial-axis (X and Y) version. Second, as well as the original stencil shape, four new cross-shape stencils have been included for each algorithmic implementation. The currently available stencil sizes range from low order to high order = {1, 2, 4, 7, 14}. Third, some straightforward pipeline optimizations related to the inner loop of the stencil computation have been coded. These optimizations include loop fusion and fission in two and three segments. Finally, the Stencil Probe microbenchmark has been extended to include the parallel implementation of each algorithm in order to analyse their multicore scalability. The shared-memory OpenMP API is used as the parallel programming paradigm. {loop fusion (1 loop), loop fission (2 loops), loop fission (3 loops)} Some combinations may depend on the architecture and the algorithm; for example, block sizes are only used for Rivera and time skew, whereas CUTOFF is only used for the cacheoblivious algorithm. Additionally, due to memory constraints on some platforms, the problem size may be reduced to 256 3 . It is worth noting that, with the new benchmark features, the parameter space to be explored becomes quite large. In order to bound the search area for the experiments, we set some parameters as either constants or within a reasonable range of values (e.g., problem size and blocking parameters). Table III shows a summary of all the parameters used in the microbenchmark.
Testbed Architectures
The following leading platforms, available in the PRACE 1 project, were used to carry out our study for serial runs (detailed platform characteristics are presented in Table IV ).
-Intel Nehalem. Juropa supercomputer, hosted at Jülich Research Center (Germany), sports 4416 Intel Xeon Nehalem-EP processors. -IBM POWER6. Sara supercomputing center (The Netherlands) hosts Huygens, an IBM pSeries 575, clustered SMP system. The system consists of 104 nodes, each with 16 dual-core POWER6 processors.
-IBM Blue Gene/P. Jugene supercomputer, hosted at Jülich Research Center, is based on the Blue Gene/P processor. Jugene reaches 1 PFlops of peak performance with 294912 PowerPC 450 cores. -AMD Opteron. Louhi is a Cray XT4/XT5 supercomputer with 1012 XT4 and 852 XT5 compute nodes, which is hosted at IT Center for Science in Finland.
Two sets of data have been collected for evaluation from these platforms: execution times and hardware counters. The former allows to compare the performance of our implementations with classical versions, while the latter gives an accurate profiling picture of the underlying events, such as the CPU pipeline resource usage and the cache hierarchy transfers. Some of the counters measured are floating-point operations, total cycles and instructions, Load and Store instructions issued, and cache misses, hits, and accesses for L1, L2, and L3 levels. These counters permit measurement of some interesting metrics such as GFlops, FP/Cache ratio, and actual memory traffic, among others. The results of these metrics are presented along with the FP/Cache models in Section 6.3. Section 6.5 shows a comparison of the performance results.
In order to gather the hardware counter profiling, the following tools and frameworks have been deployed in the testbed systems.
-PapiEx (PAPI). PapiEx is a performance analysis tool designed to measure, transparently and passively, the hardware performance counters of an application using the PAPI [Mucci et al. 1999] framework. -LikwidPerfCtr (Likwid). LikwidPerfCtr [Treibig et al. 2010b ] is a lightweight command tool to measure hardware performance counters on Intel and AMD processors using the Linux msr module. -hpmcount/libhpm (HPCT/HPM). hpmcount and libhpm provide hardware metric and resource utilization statistics after application execution. They are developed by IBM to support power-based systems.
The information gathered through these tools is used as reference data in the validation and evaluation process. However, some steps are necessary to ensure reliable and accurate data. First, tests must be executed a number of times in order to minimize collateral effects from the OS or other artifacts. Second, cache hierarchy must be cleared during each test repetition to avoid cache line hits due to hot cache effects. And finally, to prevent the pollution of load and store metrics in high-order stencils due to register spilling, the same value is used for all coefficient terms (C Z,X,Y ).
Data Cache Accesses
As Eqs. (1) and (4) convey, the proposed algorithm performs increasingly well in FP/Cache and data access terms with respect to the classical approach. Figure 9 (top) plots the three models: the classical and the two Semi-stencil (partial axis X-Y and full axis Z-X-Y). We see that the classical and the Semi-stencil models tend to a factor of ≈ 3 and 5 flops per d-cache accesses, respectively. To evaluate the robustness of our results, we proceed to compare the model with real performance data. Unfortunately, if the measured data and the projected FP/Cache model are compared, we note a slight difference. The main reason for this is due to interference from other loads and stores not related to X t and X t−1 datasets; coefficients terms, loop control variables, and register spilling are all involved in this cache traffic increase. The larger the stencil length ( ), the worse the traffic becomes.
In order to achieve an accurate comparison, a new metric must be defined. Taking into account that this issue affects any stencil computation and the artifact should remain constant for a specific stencil size ( ), it would be a good idea to compare data cache accesses between different implementations of the same stencil length. Therefore, results are compared based on the accesses in terms of gain or loss using a reduction factor between FP/Cache metrics. This new metric is defined as AccessesGain = (Accesses Classical − Accesses Semi )/Accesses Classical , and is measured as a percentage. A negative value signifies higher data cache traffic, while a positive percentage signals a lower traffic, and hence better performance.
Figure 9 (bottom) shows the comparison of data cache accesses for best-performance cases using Semi-stencil and classical implementations. Examining the figure, we note that both data groups, the model and the real performance data, are quite close and the experiments closely follow a reduction model curve. In addition, the Semi-stencil algorithm performs poorly for low-order stencils ( = {1, 2}), where the negative reduction factor represents an increase in traffic. This result is also in line with the theoretical FP/Cache ratio models presented in Figure 9 (top), where both Semi-stencil implementations perform worse than the classical code for ≤ 2. Nevertheless, as Notice that only one core is being considered. The gray section depicts where Semi-stencil obtains a better ratio compared to the classical. Values in parentheses are obtained using write-through policy.
expected, the Semi-stencil algorithm behaves better for medium-high-order stencils due to a decrease in data cache traffic.
Operational Intensity
In order to demonstrate the robustness of the Semi-stencil algorithm, we conducted the Roofline model on the Intel Nehalem and AMD Opteron architectures where memory traffic counters (DRAM bytes accessed) were gathered using likwidperfctr. We can then use the Roofline model to assess how far our empirical results are from the attainable peak performance.
To build the roof part of the model, the DRAM bandwidth and the theoretical peak performance were measured. The DRAM bandwidth measurements were conducted using a modified version of STREAM2 [de la Cruz and Araya-Polo 2011] to estimate stream and nonstream bandwidths. These bandwidths determine how far to the left or right the ridge point is on the X axis. To obtain the Y axis ceilings, theoretical peak performance, the processor frequency, and the floating-point units on a single core were considered. The AMD Opteron and Intel Nehalem cores have two floating-point units, one multiplication and one add unit. Hence, applications must be multiplication/add instruction balanced in order to reach the maximum peak performance. Table V shows the gathered ceiling data on both architectures.
To gain a better understanding of the Roofline model, the theoretical Operational Intensity (OI) for each stencil algorithm and size were also estimated. As described in Section 5.2, three OI groups are devised depending on the stencil 3C's misses: only compulsory, compulsory + capacity, and compulsory + capacity + conflict misses. Compulsory misses set the upper bound while the compulsory + capacity + conflict misses set the lower bound in the X axis of the Roofline model. Table VI shows the estimated OIs using write-back and write-through policies.
Furthermore, we calculated the actual computational peak performance to provide a realistic bound on the Roofline ceilings. Such information was obtained by running the Stencil Probe microbenchmark several times using a small dataset as input (32 3 ), but without clearing the cache (warm-cache effect). Then, the fastest execution time was selected as the reference for the stencil algorithm and size. Using this technique, the optimal computational performance was obtained disregarding DRAM accesses and considering only traffic between processor and cache. Figure 10 combines all the gathered data in the Roofline model for the Intel Nehalem and AMD Opteron architectures to predict the attainable performance. Graphs are on log2-log2 scale, where the Y axis is the attainable GFlops per second and the X axis is Flops per DRAM byte accessed. Memory bandwidths (stream and nonstream) and computational peaks (mul/add instruction balanced and imbalanced) determine the optimization regions within the graphs. Remember that the memory measurement is the steady-state bandwidth potential of the memory in only one core, and it is not the pin bandwidth of the DRAM chips. The model also suggests, through three regions, which optimizations would be appropriate. The darkest trapezoid suggests working mainly on computational optimizations, the lightest parallelogram proposes trying just memory optimizations, and the region in the corner suggests both types of the preceding optimizations. The vertical lines on the graphs show bounds for 3C's regions obtained for a 25-point stencil computation, and hence their theoretical OI limits. Additionally, the remaining horizontal lines depict the actual peak performance for each stencil size. Finally, the crosses and circles on the graphs mark the performance achieved by naive and naive+Semi-stencil runs, respectively. Numbers shown next to the marks represent their stencil size.
The peak double-precision performance for the Intel Nehalem is the highest of the two architectures. However, Figure 10 (top) shows that this performance can only be achieved with an OI greater than 0.71 in the best scenario (stream bandwidth) and 1.72 Flops per byte in the worst scenario (nonstream bandwidth). Reviewing the achieved performances, we can note that they are all in the bottom part of the memory bandwidth (45 • nonstream line). This behavior may be a consequence of a front side bus limitation and the ineffective work of the snoop filter, which carries coherency traffic and may consume half of the bus bandwidth. Despite the memory bandwidth limitation, the 25-and 43-point stencil computations of naive+Semi-stencil get close to the actual computational peak thanks to their higher OI.
Figure 10 (bottom) shows the Roofline model for the AMD Opteron architecture. In this architecture the ridge point of the model is at an OI of 1 Flop per byte for stream bandwidth and at 3.28 Flops per byte for nonstream bandwidth. The model clearly shows that the 7-and 13-point results of naive+Semi-stencil are already limited by peak memory and their performance could only be improved by increasing OI. On the other hand, naive results fall into the lightest region, indicating not so efficient memory access. Notice that if a vertical line is projected from each naive result, these lines meet the actual stencil peak performance approximately at the point that the diagonal roof part is reached. This behavior confirms the accuracy of the gathered data for the Roofline model. In addition, large stencil-size computations (25 and 43 point) of Semi-stencil lead the GFlops/second results despite only showing slightly better OI ratios with respect to their naive competitors. This metric gap is because OI depends on traffic between the caches and memory, whereas the Semi-stencil algorithm mostly reduces traffic between the processor and caches (see Eq. (4)).
Performance Evaluation and Analysis
This work not only claims that the Semi-stencil approach improves the stencil computation performance, but also that it can be combined with others stencil optimization techniques to achieve even higher performance. Due to the sheer number of possible experiments (see Table III ) the most relevant results across all testbed platforms are highlighted and analyzed in this section. All computations have been carried out in double precision, since this is used by most scientific codes. In order to support our claims, the results of our experiments are shown combined in four arrangements. First, the execution times of classical and Semi-stencil versions are compared when the stencil size ( ) and the problem size are changed. Second, hardware counter profilings are given and discussed for each algorithm. Third, a general view of algorithm speedups is outlined, and finally, performance results are grouped by varying blocking parameters (TI × TJ × TK). Figure 11 (top) clearly shows how efficiently the Semi-stencil algorithm performs as and the problem size (N 3 ) are varied. Here, all parameters have been fixed, except for the stencil length. Each pair of classical and Semi-stencil runs was conducted for a particular problem size, ranging from unusually small sizes (128 3 and 256 3 ) to realistic cases (512 3 , 768 3 and 1024 3 ). In order to obtain a fair and clear comparison, the reduction in time with respect to the classical algorithm is presented in Figure 11 (bottom).
As expected, Semi-stencil versions perform worse for low-order stencil computations ( ≤ 2) where the elapsed time is slightly higher compared to the classical implementation. However, for medium-and high-order stencils ( ≥ 4), our proposed algorithm performs extremely well. Nevertheless, as the problem size grows, the elapsed time difference becomes negligible when = 2.
Considering precision and performance requirements, the stencil computation depends on the number of neighbor points. A large number of neighbors will provide high-order results at increased computational cost. Given the previous statement and the results shown in Figure 11 , a Semi-stencil algorithm is a valid option when high precision is required in large scientific problems. Figures 12 and 13 show a detailed collection of hardware counters obtained by profiling on all four platforms. Performance statistics shown are: memory traffic (GBytes), L2/L3 cache misses, GFlops, and FP/Cache ratios. Each matrix column represents an algorithm implementation and each matrix row a specific stencil length, . Generally, most of these show a better performance when ≥ 2 (13 point). In terms of the FP/Cache ratio and cache misses, our proposed algorithm performs increasingly well with respect to the classical algorithm.
The stacked bar graphs in Figure 14 show the speedup of each implementation and execution times in seconds. Data is gathered by platform, time steps, and stencil length. In these graphs, the baseline for each platform and set of parameters represents the naive code using a classical algorithm (where the speedup is 1). We see that space-and time-blocking techniques enhance some of the performance results, ranging from 1.05 to 1.3×. However, using the same enhancements with Semi-stencil improves the performance even further, bestowing aggregated speedups of up to 1.6× in some cases. The BlueGene/P outperforms the other platforms in terms of speedup, especially for the Semi-stencil algorithm.
In order to show how important a blocking parameter is for space-and time-blocking algorithms, Figure 15 collects all the execution runs sorted by blocking sizes. Results are shown for each platform by modifying the blocking parameters in the horizontal axis. As expected, the Rivera and time-skewing algorithms show a variation of performance when TI, TJ and TK are changed. In all cases, the best blocking performance is obtained when TI is left uncut [Kamil et al. 2005; Rivera and Tseng 2000] ; the exception, strangely, is BG/P. This may be due to collateral cache line conflicts. The naive and cache-oblivious algorithms are plotted using their best configurations (tuning internal loop optimizations and CUTOFF parameter).
A summary of the performance results can be found in Table VII . This table shows the execution times and their speedups using a stencil size of 4. For three of the four platforms, the Semi-stencil versions are the best implementations. The speedups with respect to the naive code range from ≈ 1.20× on the Intel Nehalem . Execution times for all the algorithmic combinations, where the blocking parameter has been changed to discover the optimum value. The algorithms which do not support blocking have been plotted using the best time obtained. to 1.60× for BG/P, which is the most favored architecture. On the Intel Nehalem, the classical cache-oblivious version slightly outperforms the Semi-stencil version. It is likely that the -fast option on the Intel compiler is quite aggressive and optimizes stencil codes reasonably well. On the contrary, the AMD Opteron appears to have some kind of problem with time skewing, where the compiler does not appear able to generate optimized binaries for this code. The compiler used on this architecture was GCC v4.1. Finally, on both IBM platforms a substantial gain is observed. First, on BG/P a speedup of 1.64 is obtained, basically because Semi-stencil helps to mitigate cache line conflicts in the small L1/L2 caches (32kB and 1920B). Second, on the Power6 architecture, space-and time-blocking techniques do not have any substantial effect due to the large memory hierarchy (64kB, 4MB, and 32MB) and the sophisticated hardware prefetching system. However, Semi-stencil is able to improve the execution time by 25% due to the lower number of loads issued into the memory system.
SMP Performance
In this section, the parallel aspect of the algorithms is evaluated. The multiprocessing version of each stencil algorithm version has been augmented using OpenMP pragmas. We have tested certain combinations of the algorithms on three different architectures: IBM POWER7, Intel Sandy Bridge, and the latest Xeon Phi (MIC). The parallel implementation of each stencil code has been designed following specific decomposition strategies. First, the naive parallelization has been carried out cutting the least-stride dimension (NY) by the number of threads in order to deal with data locality. Each thread computes thread blocks of size NZ × NX × TK SMP , where TK SMP = NY/Threads. Second, in order not to affect the Rivera-base scheme performance, the serial search of the best TJ block-size parameter has been respected as far as possible when decomposing the problem size. The block parameter is computed as TJ SMP = TJ if TJ * Threads ≤ NX, or TJ SMP = NX/Threads otherwise. Hence, each thread computes blocks of size NZ × TJ SMP × NY. Finally, time-blocking algorithms have been parallelized by applying thread synchronization methods to keep data dependencies between thread domain computations. Figure 16 shows the timeskewing [Kamil et al. 2006; Wonnacott 2000] and cache-oblivious [Frigo and Strumpen 2006] parallel strategies as discussed in Section 3.2.
The IBM BladeCenter PS701 node is equipped with an 8-core 64-bit POWER7 processor operating at 3.0 GHz. Each core contains 4 floating-point units, which are able to perform fused multiply-add (FMA) instructions in double precision. These features give an impressive 192 GFlops of peak performance per socket. Additionally, a 32KB L1 data cache and a 256KB L2 cache are included on each core, and an on-chip 32MB L3 adaptative victim cache is shared among all eight cores. The POWER7 socket processor has eight DDR3-1066 RDIMM memory channels. Each channel operates at 6.4Gbps Fig. 16 . Thread decomposition strategies for time-blocking algorithms. Top: in time skewing, each thread block (size block) is divided into three parallelepipeds and executed in the next order at the same time by all threads: first the light gray (size base = 2 * * (t − 1) + 1), second the white, and finally the dark gray area. Bottom: in cache oblivious, each thread block is divided into inverted and noninverted trapezoids (size l) and executed in this order.
and can address up to 32GB of memory. The RDIMM capacity of the PS701 system examined in our study is 128GB (16 × 8).
IBM has vastly increased the parallel processing capabilities of POWER7 cores by enabling a 4-way SMT (simultaneous multithreading) system. Therefore, upto 32 threads can run at the same time per chip, leading to a quasi-massive parallel processor. This high thread-level parallelism makes the POWER7 platform a very appealing target to test the OpenMP versions of the Stencil Probe microbenchmark. The experiments were conducted using the IBM XL Compiler and the -qsmp flag. Our PS701 node only includes one POWER7 chip; therefore no NUMA-aware code is required since all the data is allocated on the only memory bank. Otherwise, data should be initialized in parallel by all threads due to the first touch page mapping policy. In this way, the dataset memory region would be pinned to its corresponding socket. Table VIII shows the parallel results on the POWER7 architecture for a large problem (512 × 1024 × 1024) in order to exploit the parallel capabilities (up to 32 threads per chip). This table shows the execution time and scalability for each algorithm using the classical (top) and the Semi-stencil implementation (bottom) for varying numbers of OpenMP threads. Reviewing the results, we observe clearly that Semi-stencil enhances the global performance in almost all cases, except, partially, in cache oblivious. In some cases, performance is nearly doubled (1.68 and 1.83×) when compared with the classical algorithm running the same number of threads and base algorithm (2 and 4 threads for naive and time-skew implementations, respectively). Besides, the Semistencil scalability results are mostly similar to the classical algorithm except on 8-and 16-threads runs, where a substantial degration in scalability performance is observed.
Furthermore, the POWER7 architecture scales well considering that, despite having 32 threads, only 8 real cores exist per chip, achieving the impressive scalability of ≈16× for the naive case and ≈13× for Semi-stencil case. This behavior shows that the 4-way SMT feature maximizes the processor core throughput by offering an increase in efficiency.
This section is concluded by looking at two Intel-based platforms; a representative of the Xeon family and an early access to a Many-Integrated-Core (MIC)-Knight-Cornerbased system. The Xeon system is based on a E5-2670 Sandy Bridge. The system has two sockets where each socket holds 8 cores running at a clock speed of 2.6 GHz. The Numbers shown represent the total time in seconds (lower is better). Scalability results with respect to the 1-thread execution are shown in parentheses. memory hierarchy has four levels: a 32KB (I) + 32KB (D) L1 per core, a 256KB L2 per core, a 20MB L3 shared among cores within each socket and a main memory of 32GB.
For our experiments on the Sandy Bridge, the hyperthreading capability was turned off. Additionally, only one of the sockets was used since the microbenchmark does not take advantage of the NUMA configuration (memory banks associated to sockets). Besides this, the testing on the Intel architectures followed the parallelism guidelines stated earlier. Regarding the software stack, the code was compiled and profiled with Intel tools, where the only outstanding compiler flags used were -avx and -DAVX macro in order to utilize Advanced Vector Extensions (AVX) instrinsics.
The dataset sizes selected for the tests were 640 3 and 256×4096×256 on Sandy Bridge and Xeon MIC, respectively. These sizes are enough to overwhelm the memory hierarchy and also to bear a reasonable load balance among OpenMP threads. Table IX shows moderate scalability for Sandy Bridge (up to 6.27×) on Rivera, Rivera+Semi-stencil, oblivious, and oblivious+Semi-stencil. Nevertheless, naive results lead to the worst scalability results. Clearly, the current domain decomposition (cutting the leaststride dimension) is not the most appropiate approach for this case.
Also, the Semi-stencil algorithm reduces the execution time on 8 cores by only 3% in the worst case (Rivera) to 18% on the better case (naive). Furthermore, the larger the number of threads, the less advantage of the Semi-stencil strategy over other algorithms. It is likely that memory channels saturate earlier, thus achieving the peak memory bandwidth and hampering the linear scalability for a large number of threads.
The MIC Knight Corner beta 0 has 61 cores running at 1 GHz, each one of them handling 4 threads. The memory hierarchy has three levels: a 32KB (I) + 32KB (D) L1 per thread, a coherent 512KB L2 per core connected through a ring bus to other L2 Numbers represent the total time in seconds (scalability is shown in parentheses). †Due to algorithmic constraints on time-skew (thread decomposition is performed on least-stride dimension), tests were conducted using a 256 × 256 × 4096 dataset to enable parallelization with a large number of threads.
caches, and 8GB of main memory. The device is attached to a host system via a PCI express bus. The MIC runtime environment offers new scheduling options for thread affinity settings, such as scatter, compact, and balanced. Tests have been carried out with all of them, selecting only those thread schedulings that offered better performance (balanced in most cases). Also, MIC has two execution models: native (run from the MIC device) and offload (run from the host). The former is accessed via the compiler flags -mmic and -DMIC macro which produce an executable targeted specifically for MIC. The latter is based on specific language pragmas. Since all versions of the code were generated through macros, the first was preferred. It was simpler to compile the benchmark with the MIC target flag rather than changing the OpenMP augmentation for a particular Intel pragma set.
It can be seen in Table X that the scalability of the Rivera+Semi-stencil algorithm reaches 93.8× for the 244-threads case, which corresponds to a 1.26× improvement over the Rivera algorithm for a 25-point stencil. Another worthy result is the combination of time skewing and Semi-stencil; this case offers the best performance in execution time (0.426 seconds with 122 threads) among all algorithms. However, it should be clarified that for this case, the dataset size was rearranged and the time step was split in order to satisfy the parallelepipeds' size constraints (size base) for a decomposition with a large number of threads (see Figure 16 (top)). Summarizing, the Semi-stencil algorithm outperforms the classical implementation for all cases presented from the MIC platform, but, for reasons to be discussed in future work, it is not the case for the oblivious algorithm.
CONCLUSIONS AND FUTURE WORK
In this article a novel optimization technique for stencil-based computations is presented. This new algorithmic approach, called Semi-stencil, is especially well suited for scientific applications on both homogeneous and heterogeneous architectures and, in particular, for large stencil sizes.
The proposed technique is relevant because it deals with a number of well-known problems associated with stencil computations. The first bottleneck, due to low data reutilization, is the poor floating-point operation to data cache access ratio (FP/Cache). Some techniques, like loop unrolling, software pipelining, or software prefetching, can help improve the performance of stencil codes by partially hiding the dependency stalls. However, these techniques have a limiting factor, namely the number of registers available on the processor. The second identified problem is the memory access pattern. Common algorithms like Rivera or cache oblivious tackle this issue with partial success, reducing the overall transfer latency in cache hierarchy architectures. However, due to its orthogonal property, Semi-stencil outperforms other techniques on many architectures, either when implemented alone or combined with space-and time-blocking algorithms.
The Semi-stencil contributions to these problems are the following.
-Reducing the minimum working dataset, thus minimizing register pressure and cache memory footprint. This effect becomes more pronounced for high-order stencils. -Increasing spatial and temporal data locality, owing to a reduction in the number of loads, but increasing slightly the number of stores. In homogeneous multicore architectures, the cache coherence system and the write allocate policy (usually associated with write-back) may produce further memory traffic due to the extra store transactions. Despite the additional stores, the benefit in time of the load reduction overcomes the penalty of stores.
The experimental results show that the best classical stencil implementations for a common 25-point stencil order are typically able to deliver up to 30% of the peak performance. Under the same conditions, the Semi-stencil implementations can achieve up to 1.32× performance improvement. On multicore systems, the Semi-stencil algorithm has also shown excellent scalability results on most of the analyzed platforms.
On the POWER7 architecture, the scalability soars to 13.06× for the oblivious+Semi-stencil case when run over 32 threads. Furthermore, when only 4 threads are used, performance is widely enhanced in most cases and the execution time almost halved with respect to the classical runs. We have also demonstrated the efficient parallel behavior of the Semi-stencil algorithm on novel architectures such as Intel MIC. For instance, on this architecture, scalability reaches 93.8× over 244 threads for certain testcases. Additionally, through insight provided by the Roofline and the multilevel cache performance models, we have revealed how the Semi-stencil algorithm performs in terms of operational intensity and memory traffic. Future work will focus on further research of this novel algorithm in several aspects. First, the cost of each computational part (head, body, and tail) will be broken down with respect to the global performance. Second, the benefits in performance of each forward and backward update will be analysed in detail for each Semi-stencil axis. Third, on cache-based architectures with a write allocate policy, a write miss may have a negative impact on the stencil performance. This is due to the pollution that the cache line allocation produces for the store instruction issued. Recently, new architectures with cache-bypass techniques have appeared, which minimize cache pollution when writing data to memory. Our future efforts will also assess the performance of the Semi-stencil algorithm when combined with bypassing. For instance, on the SSE/AVX technologies used on x86 architectures, the vmovntpd/movntpd cache-bypass instructions can be employed for this purpose. In contrast, on the Power ISA, the nontemporal dcbz/dclz instructions have a dissimilar behavior to the Intel ones. Unlike the x86 architecture, these Power instructions only set to zero the contents of a cache line without bringing it from main memory, finally evicting the cache line. Nevertheless, the dcbtst instruction, which enables prefetching for stores, could partially make up for this lack of bypass instruction. This store-stream prefetching instruction improves performance by anticipating the request of the cache block fetch before it is actually needed by the program. In doing so, the program can later perform stores to the block without experiencing the additional delay caused by fetching the block into the cache. Finally, regarding future work, the Semi-stencil strategy will be evaluated on Lattice Boltzmann
