Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only a few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved. We propose a flexible multidimensional intratile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects from hardware prefetching or TLB shortage. Our Girih framework includes an autotuner to select optimal parameter configurations on the target hardware. We conduct performance experiments on two contemporary Intel processors and compare with the state-of-the-art stencil frameworks Pluto and Pochoir, using four corner-case stencil schemes and a wide range of problem sizes. Girih shows substantial performance advantages and best arithmetic intensity at almost all problem sizes, especially on low-intensity stencils with variable coefficients. We study in detail the performance behavior at varying grid sizes using phenomenological performance modeling. Our analysis of energy consumption reveals that our method can save energy through reduced DRAM bandwidth usage even at a marginal performance gain. It is thus well suited for future architectures that will be strongly challenged by the cost of data movement, be it in terms of performance or energy consumption. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. 
Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations

Contribution
We make several contributions in addition to [32] and [47] . We introduce a multidimensional intratile parallelization scheme using multicore wavefront diamond blocking for optimizing practically relevant stencil algorithms. Our improved scheme provides cache block sharing along the leading dimension (the dimension of most rapid index advance in a Cartesian ordering), which may result in better utilization of translation lookaside buffer (TLB) entries and hardware prefetching to the shared cache level. We introduce a novel Fixed-Execution to Data (FED) wavefront parallelization technique that reduces the data movement in the cache hierarchy of the processor by using tiling hyperplanes that are parallel to the time dimension. Our approach achieves hierarchical cache blocking by using large tiles in the shared cache level and fitting subsets of the tiles in the private caches of the threads, leading to cache blocks that span multiple cache domains.
In contrast to many temporal blocking approaches in the literature, our approach efficiently leverages the shared cache feature on modern processor chips. It also provides a controllable tradeoff between memory bandwidth per thread and frequency of synchronization to alleviate the bottleneck at the CPU or memory interface, as needed, when applying a stencil to a particular architecture. The results demonstrate a substantial reduction in cache block size and memory bandwidth requirements using four corner-case stencil types that represent a full range of practically important stencil computations. An efficient runtime system is used to dynamically schedule tiles to thread groups. We also develop a fine-grained synchronization scheme to coordinate the work of the thread groups and avoid race conditions. Finally, coupled with autotuning, our cache block sharing algorithm provides a rich set of runtime configurable options that allow architecture-friendly data access patterns for various setups.
Article Outline
We present the relevant background to our tiling scheme and modeling techniques in Section 2.
The motivation of our work shows the limitation of using separate cache blocks per thread in Section 3, using our models and experimental results. We introduce our novel multidimensional intratile parallelization approach and implementation details of our Girih framework in Section 4. In Section 5, we perform comprehensive experiments comparing the performance of our solution with several state-of-the-art stencil frameworks, using four "corner-case" stencil schemes. We also show the impact of cache block sharing on the performance, memory usage, and energy, and we show results for thread scaling. We elaborate on the related work in Section 6. Finally, we conclude and present the future work in Sections 7 and 8.
BACKGROUND
Cache Blocking
Typical stencil-based applications use working sets that are many orders of magnitude larger than a processor's cache memory. In the "naïve" approach, all grid cells are updated once, typically in lexicographic or any other convenient order, before the next update sweep starts. Each grid cell update involves loading the neighbor grid cells to perform the stencil computation. Spatial blocking can be used to reduce the data transfer to main memory to a theoretical minimum, but there is no in-cache data reuse between successive updates to any specific grid point. Consequently, a low in-memory flops/byte ratio is observed in this approach [10] .
Temporal blocking techniques allow more in-cache data reuse and thus reduce the memory bandwidth pressure. Updates to grid points are reordered across several time steps while the corresponding data is still in the cache. Wavefront blocking and diamond tiling are important temporal blocking techniques, so we describe them in more detail in the following.
Wavefront Temporal
Blocking. Different tiling techniques can be used in each grid dimension. We describe them using a 3-point stencil in one dimension (C syntax):
for(t=0; t<T; t++){ for(x=1; x<Nx-1; x++){
A[(t+1)%2][x] = W1*A[t%2][x] + W2*(A[t%2][x-1] + A[t%2][x+1]) }}
A graphical representation of this naïve update scheme is shown in Figure 1 (a). Colored cells represent recent updates, the darkest being the most recent. Wavefront-parallel updates were first described in [29] . The update order in this technique maximizes the reuse of the most recently visited grid cells, and can be leveraged for temporal blocking (see Figure 1(b) ). The space-time tile is traversed from left to right (in the directions of the arrows). The red arrows show the data dependence across time steps for each grid cell. The wavefront slope increases as the spatial order of the stencil increases, as the data dependence of the stencil operator extends in space. The wavefront blocking works only if the wavefront tile fits in the cache memory. Larger blocks in time lead to increased data reuse but also enlarge the cache size requirements.
The wavefront tile can be updated using a single thread per wavefront [34, 42] or multiple threads per wavefront [47] . Single-thread wavefronts require dedicated cache blocks per thread in the cache memory and do not utilize any shared cache hardware feature on the CPU chip. In contrast, explicitly multicore-aware wavefront tiling leverages the shared cache feature among the threads. Tile data is pipelined through a thread group that shares a cache level. This cache sharing reduces the cache memory size requirements and allows for larger cache blocks. As a result, the memory bandwidth pressure can be reduced compared to the single-thread wavefront. Figure 1 (c) shows the multithread wavefront variant proposed in [47] , where each thread's updates are denoted by a different symbol/color. Each thread is assigned to one or more consecutive time steps of the wavefront tile. To enable concurrent updates in the wavefront tile, the slope of the wavefront is increased to add spacing between the threads.
Using only the simple wavefront scheme in multiple dimensions would still require cache block sizes larger than typical cache sizes, at least in three-dimensional grids. Moreover, the wavefront scheme has a pipelined startup that makes parallelization more complex. Combining the wavefront blocking scheme with other tiling techniques is thus essential. Wavefront tiling is commonly combined with tiling approaches such as trapezoidal, parallelogram, or diamond tiling. This limits the tile size in other dimensions, such that the wavefront tile size fits in the desired cache memory. Since diamond tiling is proven to provide maximum data reuse of a loaded spatial cache block [35] ; we describe it in greater detail in the next section.
Wavefront blocking is ideal when the block size in time is fixed by a different factor, for example, by the tiling approach in a different spatial dimension. By pipelining the data through the cache memory, the wavefront method loads each element once to cache and performs several updates before evicting the results. On the other hand, loading blocks that are fixed in space and updating their elements by other tiling techniques allow less data reuse in the cache memory, given the same cache block size. 
Diamond Tiling.
Diamond tiling is a well-studied temporal blocking technique. The basic idea is shown in Figure 2 . Each diamond tile has dependencies to the two tiles below it ("to the past"), as indicated by the arrows. All diamonds that have their dependencies fulfilled can be updated in parallel; in the simplest case, this would be all diamonds in the same row. The tile slope depends on the stencil semibandwidth S = ±1/R, where R is the stencil radius. Diamond tiling has several advantages: it allows maximum data reuse [35] in a confined tile space (as opposed to streaming the data, as in wavefront tiling), shows more concurrency in the tiles update, and has a homogeneous tile shape, making the implementation simple.
Analytic and Phenomenological Performance Modeling
Analytic performance modeling is a valuable tool to gain insight into the performance properties of programs. The Roofline model is a well-known example, whose principles date back to the 1980s [6, 24, 28, 37] and which has received revived interest in the context of cache-based multicore processor architectures in recent years [48] . It predicts the performance of "steady-state" loops or loop nests on a CPU, assuming that one of two possible bottlenecks apply: the runtime is limited either by the execution of instructions (execution bottleneck) or by the required data transfers through the memory hierarchy (data bottleneck), whichever takes longer. "Steady state" in this context means that start-up and wind-down effects are ignored, and that the CPU executes the same mix of instructions with the same data requirements for a long time. The assumptions behind the Roofline model are fulfilled for many algorithms in computational science that are either very data bound or very compute bound, i.e., whenever the data transfer and execution times differ strongly. Unfortunately, stencil algorithms with temporal blocking optimizations do not fall into this category.
The ECM model [21, 40, 45] is an extension of the Roofline model. It does not assume perfect overlap between in-core and data transfer times but allows for overlapping and nonoverlapping parts in the in-core execution time, i.e., the time required to execute a number of loop iterations with data coming from the L1 cache. On x86 processors, the nonoverlapping part is serialized with all transfer times between adjacent memory hierarchy levels down to where the data originally resided. We briefly review the model in this section; details can be found in [40] .
The in-core execution time T core is set by the execution unit (pipeline) taking the largest number of cycles to execute the instructions for a given number of loop iterations. The nonoverlapping part of T core , called T nOL , consists of all cycles in which a LOAD instruction retires. As all data transfers are in units of cache lines (CLs), we usually consider one cache line's "worth of work." With a double-precision stencil code, one unit of work is eight iterations.
The time for all data transfers required to execute the work unit is the "transfer time." Latency is neglected altogether (although it can be brought into the model as a correction factor, see [25] ), so the cost for one CL transfer can be determined by the maximum bandwidth (cache lines per cycle) between adjacent memory hierarchy levels. For example, on the Intel Haswell architecture, one CL transfer takes one cycle between the L1 and L2 caches and two cycles between L2 and L3. Moving a 64-byte CL between memory and L3 takes 64bytes · f /b S cycles. Here f is the clock frequency of the CPU and b S is the fully saturated memory bandwidth.
If T data is the transfer time, T OL is the overlapping part of the core execution, and T nOL is the nonoverlapping part, then
T data is the sum of the data transfer times through the memory hierarchy. If, e.g., the data is in the L3 cache, T data = T L1L2 + T L2L3 . We use a shorthand notation for the cycle times in the model for executing a unit of work:
Adding up the contributions from T data andT nOL and applying Equation (1), one can predict the cycles for executing the loop with data from any given memory level. For predictions we use " " as the separator to present the information in a similar way as in the model. In the example above, this would be T ECM = {4 6 10 19} cy. Note that the ECM model as presented here assumes a fully inclusive cache hierarchy. It can be adapted to exclusive caches as well [45] . For multicore scalability, we assume that the performance is linear in the number of cores until the maximum bandwidth of a bottleneck data path is exhausted. On Intel processors, the memory bandwidth is the only bottleneck, so that the number of cores required to reach saturation is n s = T ECM /T L3Mem . The saturated performance is the Roofline prediction for memory-bound execution: P BW = I · b S , with I being the computational intensity.
Although it is possible in many situations to analyze the data transfer properties of a stencil code and construct the ECM model from first principles, in practice, one faces difficulties with this approach when looking at temporally blocked codes. The reason is that the model requires an accurate analysis of the data transfer volume in different cache levels, which becomes very difficult with small, nonrectangular block shapes (e.g., diamonds). In such situations, the model can only give very rough upper limits. However, it is still possible to apply the principles of the model in a phenomenological way by counting the data transfers via hardware performance monitoring. If the model thus constructed agrees with the performance measurements, this is an indication that the code utilizes the hardware in an optimal way, and especially that the low-level machine code produced by the compiler does not incur any significant overhead. This phenomenological modeling approach will be applied later when analyzing the performance of the Multi-threaded Wavefront Diamond blocking (MWD) code.
MOTIVATION: ANALYSIS OF SINGLE-THREADED WAVEFRONT
DIAMOND BLOCKING In this section, we study the performance of a highly efficient state-of-the-art temporal blocking scheme, Single-threaded Wavefront Diamond blocking (1WD). We develop an accurate model to estimate the cache block size requirements and the memory traffic of 1WD in order to show its limits. The model is validated using four corner-case stencils: a 7-point constant-coefficient stencil (see Listing 1); a 7-point variable-coefficient stencil with no coefficient symmetry; a second-orderin-time 25-point constant-coefficient stencil, which is frequently used in seismic imaging applications (see Listing 2); and a first-order-in-time 25-point variable-coefficient stencil with coefficient symmetry across each axis. These stencils are described in more detail in [32] . The purpose of this section is to demonstrate the shortcomings of dedicating a single thread per tile in contemporary processors, even with the best tiling techniques.
Testbed
We conducted all our experiments on two recent microprocessors: an Intel Ivy Bridge (10-core Xeon E5-2660v2, 2.2GHz, 25MB L3 cache, 40GB/s of applicable memory bandwidth) and an Intel Haswell (18-core Xeon E5-2699 v3, 2.3GHz, 45MB L3 cache, 50GB/s of applicable memory bandwidth). The "Turbo Mode" feature was disabled (i.e., the CPUs ran at their nominal clock speeds) to avoid performance fluctuations. The Haswell chip was set up in a standard configuration with "Cluster on Die" (CoD) mode disabled, meaning that the full chip was a single ccNUMA memory domain with 18 cores and a shared L3 cache. The documented clock slowdown feature with highly optimized AVX code [20] was not observed on this machine with any of our codes. Simultaneous multithreading (SMT) was enabled on both machines but not used in our tests because it would have adverse effects on the cache size per thread, which is crucial for the type of optimizations we are targeting. More details on the hardware architectures can be found in the vendor documents [26] .
Performance counter measurements were done with the likwid-perfctr tool from the LIKWID multicore tool suite [30] in version 4.0. The compiler versions used for different code variants are described in Section 5.
Single-Thread Wavefront Diamond Blocking
In Sections 2.1.1 and 2.1.2, we mentioned that wavefront blocking and diamond tiling techniques allow high data reuse in the cache memory. We find it reasonable to use these techniques over the Fig. 3 . Single-thread wavefront diamond tiling (1WD) in three dimensions, with wavefront traversal along the z-dimension and diamond tiling along the y-dimension (darker cells imply more recent updates, cf. [32] ). The diamond block shown on the upper right part (3D view) is updated by a single thread, whereas multiple independent diamonds (e.g., in the same row along y) can be updated by multiple threads in parallel.
outer dimensions (y and z) in three-dimensional grids. We prefer to leave the x-dimension intact, at reasonable grid sizes, for efficient hardware data prefetching, minimum TLB misses, and long inner loops that are suitable for efficient Single Instruction Multiple Data (SIMD) vectorization. In fact, all these techniques are employed in recent research by several groups. The work in [42] performs the diamond tiling along the y-axis and the wavefront blocking along the z-axis. [3] uses the Pluto framework with diamond tiling along the z-axis and wavefront blocking along the y-axis. As will be shown in Section 5, the autotuning of Pluto tiling parameters shows that longer loops along the x-axis achieve the best performance, which supports our argument.
The 1WD scheme used here works as shown in Figure 3 . The wavefront traverses along the z-axis while the diamond tiling is performed along the y-axis. Hence, each grid point in Figure 3 represents the full domain size along the x-dimension. Multiple diamonds along the y-axis can be updated in parallel. In [32] , we have introduced a cache block size and memory traffic model, which we will employ in the following to show that our basic assumptions about data reuse in the 1WD are correct, and to highlight its requirements and limits on contemporary processors.
Memory Data Traffic Verification
The two central parameters of the 1WD scheme are the diamond width D w and the wavefront tile width N F . We verify the correctness of the memory traffic and cache block size models of the 1WD scheme introduced in [32] by measuring the main memory data traffic per lattic-style update (LUP), i.e., the code balance, at varying block sizes per thread for the four corner-case stencils described above. The grid sizes are larger than the cache and fit in the main memory of the processor, which is typical in real applications. We minimize the cache block size in our experiments by using a unity wavefront tile width (N F = 1), where larger N F would increase the cache block size without decreasing the code balance. The experiments were run using a single thread of the 18-core Haswell processor, so the full 45MiB L3 cache was available for use. Since we measure the memory transfer rate here, we make sure that the results are representative by running each test for tens of seconds. The single-thread experiment allows us to test larger cache block sizes without having cache capacity misses. There is a strong agreement between our model and the empirical results when the cache block is small enough, showing that our implementation of the 1WD blocking scheme can achieve the theoretical memory traffic reductions. The measured code balance in Figure 4 starts to deviate from the model at cache blocks larger than about half the L3 cache size. This effect can be predicted from our cache block size model, considering the rule of thumb that half the cache size is usually usable for blocking [40] .
The results show that separate temporally blocked tiles per thread lead to starvation in the cache size requirement. For instance, the minimum diamond width (D W = 16) of the 25-point constantcoefficient stencil, in Figure 4 (c), requires a tile size of ∼3MiB/thread. To run all 18 cores of the processor with efficient temporal blocking, the processor would have to provide a minimum of 3 · 18 = 54MiB of cache (not even considering the rule of thumb), which is far beyond the available capacity. The variable-coefficient 25-point stencil, as shown in Figure 4 (d), has even more stringent requirements. The 7-point stencils (see Figures 4(a) and 4(b)) can fit 18 tiles in the cache memory, but the diamond width has to be limited, which allows for less data reuse. This may lead to the effect that the code balance cannot be reduced sufficiently to prevent main memory bandwidth saturation. We investigate these predictions in more detail in Section 5.
Our accurate cache block size model could be used to tune the tiling parameters for optimum performance. The largest tile size that fits in the usable cache size can be computed, making autotuning in this parameter obsolete. However, even an accurate model is insufficient to select the best tuning parameters if the target function (i.e., performance) is slowly varying. For example, Figure 4 (c) shows a case when a tile size larger than the usable cache memory (i.e., D W = 48 requires a ∼27MiB cache block) can still achieve better code balance than the maximum tile size that fits entirely in the cache memory (i.e., D W = 32 requires a ∼12.5MiB cache block). This observation shows the importance of autotuning even when an accurate model for the cache block size is available. The pragmatic solution is to employ model-assisted autotuning to narrow down the parameter search space, which we do.
APPROACH: MULTIDIMENSIONAL INTRATILE PARALLELIZATION
In this section, we describe our intratile multidimensional parallelization algorithm in detail and its wavefront diamond tiling implementation in a structured grid. We also describe the components of our open-source testbed framework, called Girih [15] .
Algorithm
We showed in Section 3 that temporal blocking with a single thread per cache block does not yield optimal results due to cache size restraints. To resolve this issues, we introduce an advanced cache block sharing scheme. It alleviates the pressure on the cache size and improves data reuse so that the ultimate goal of decoupling the computations from the main memory bandwidth becomes easier to achieve.
We introduce a (d + 1)-dimensional intratile parallelization algorithm for tiled d-dimensional grids. Each Cartesian dimension i of the tile is divided equally into T i chunks that can be updated concurrently. Additionally, to generalize the method beyond the standard case of scalar-valued quantities, the components of each grid cell may be divided into T c chunks, when at least T c equations per grid cell can be updated concurrently. The number of threads updating a tile ("thread group") is T c × d n=1 T n . Multiple thread groups (TGs) may exist to update tiles concurrently. The data reuse in the threads' private caches can be improved by making the boundaries of the subtiles parallel to the time dimension, where each thread would be reusing its local grid points across the time steps in the tile.
Our multidimensional cache block sharing algorithm has several advantages. First, it utilizes thread parallelism within a cache block, which leads to lower in-memory code balance because the shared cache blocks can be larger in space and time, allowing more in-cache updates. Consequently, it has less stringent requirements on the cache block size compared to other cache block sharing approaches, which require more space to provide sufficient concurrency. This will also be discussed in the experimental results (see Section 5) and the related work (see Section 6).
Second, cache block sharing along the leading dimension reduces the need for tiling along it, which may result in better utilization of the loaded TLB data and the hardware data prefetching to the last-level cache (LLC). Long, data-contiguous loops are utilized in the shared cache level, while each thread updates a block of the unit stride.
Finally, by parameterizing the T i size and using autotuning, our cache block sharing algorithm provides a rich set of runtime configurable options that allow architecture-friendly data access patterns for various setups. For example, the autotuner finds cases when sharing the leading dimension among multiple threads is beneficial to reduce the cache memory pressure, as will be shown in Section 5.
Girih Framework
The Girih framework consists of several components that allow the stencil code to run faster. It starts by running the autotuner component to find an efficient tiling and parallelization parameters configuration, as we describe in Section 4.2.2. After allocating and initializing the data arrays, the runtime system (described in Section 4.2.3) schedules the tiles to the available threads. The tile execution involves running the MWD implementation with the selected parameter set to complete the stencil computations in the tile; this is described in detail in the next section.
Multicore Wavefront Temporal
Blocking. In Figure 5 , we demonstrate a practical implementation of our approach using Multi-threaded Wavefront Diamond blocking (MWD). Diamond tiling is performed along the y-axis and wavefront blocking is performed along the z-axis. Figure 5 (a) shows the tile decomposition among the threads in three spatial dimensions at a single time step of the tile update. The thread group size is equal to the product of the thread counts in all dimensions. ) show two perspectives of the extruded diamond along the z-and y-axes, respectively. We use this particular variant of wavefront-diamond tiling because it uses provably optimal tiling techniques, as described earlier in Section 3. Slicing the x-axis into small tiles is known to be impractical as it would negatively affect the TLB misses, hardware prefetching, and control flow overhead. We replaced tiling along the x-axis by our intratile parallelization, although it may be useful to combine them in very large grid sizes. Finally, we chose the diamond to extend in the t-y plane (instead of t-z) to ensure optimal, contiguous (gap-free) data access without adverse effects from prefetchers and TLB shortage.
At each time step in the tile update, each thread performs updates of multiple grid points before proceeding to the next time step. All threads update the same number of grid points to maintain load balancing, and they synchronize at certain points to ensure correctness.
We present two wavefront parallelization schemes in Figures 5(c) and 5(e). The first assigns a fixed location for the thread in the wavefront tile, allowing the use of a simple relaxedsynchronization scheme in the thread group, where each thread has data dependencies to its left neighbor only. This wavefront approach has the disadvantage of pipelining the data across the cores through the shared cache, resulting in a large data transfer volume in the cache hierarchy. The alternative scheme in Figure 5 (e) addresses this issue by using Fixed-Execution to Data (FED), which updates each subset of grid points by a single thread, regardless of the wavefront tile location, allowing a wavefront cache block to span multiple cache domains. The limitation of this scheme is the more complex synchronization within the thread group, so we have to use an OpenMP barrier after each time step update in the FED wavefront. The FED wavefront scheme is useful for stencils with a large number of bytes per grid cell, such as the variable-coefficient and high-order stencils. It is particularly important for high-order stencils because almost no reuse is usually achieved between the time steps of the wavefront due to the steep temporal blocking slope.
The intratile parallelization schemes along the wavefront in Figures 5(c) and 5(e) allow an arbitrary number of threads to be assigned along the z-axis, with the cost of increasing the cache block size without increasing the data reuse. An arbitrary number of threads can also be assigned along the x-axis of the tile. The parallelization along the x-axis replaces the tiling along this dimension, when tiling would be useful to reduce the cache block size of the thread. The diamond tile, shown in Figure 5 (d), uses only one or two threads, as they are the only options to maintain load balance and assign the same data to each thread across time steps. Relaxed synchronization is used along the z-axis in the regular wavefront approach ( Figure 5(c) ), and the software subgroup barrier synchronizes threads along y-and x-axes for each subtile along the z-axis.
Listing 3 shows a reduced version of our MWD tiling implementation. The code synchronizes the thread group using an OpenMP barrier after each time step and uses the regular wavefront blocking strategy (described in Figure 5(c) ). The full kernel, the relaxed-synchronization variant, and the FED variant are available online in [15] . We use an OpenMP parallel region to spawn the threads of the thread group. The three-dimensional coordinates of the threads are calculated in lines 5 to 7. The tile is decomposed among the threads along the y-, x-, and z-axes in lines 10 to 18, 20 to 28, and 30 to 31, respectively.
The Girih framework code requires adding two entries about the stencil to be included in the list of available stencil computations. First, the loop body has to be appended in a macro, which is substituted for line 42 of Listing 3 during the compiler preprocessing step. Second, important information about the stencil operator, such as the shape and the semibandwidth, has to be appended in a stencil information list. Girih uses the stencil information to set up the parameterized MWD implementation such that it respects the data dependencies of the new stencil operator.
Autotuning.
Autotuning is a preprocessing step, conducted after the user has selected the problem details, e.g., the stencil type and grid size. The autotuner allocates and initializes the required arrays for its own use and deallocates them once it is completed. This saves significant tuning time when the grid array is very large. Figure 6 shows the details of our approach. It tunes several parameters: diamond width, wavefront tile width, threads along x-, y-, and z-axes. The autotuner starts with fixed user-selected parameters, then determines the feasible set of intratile thread dimensions by factorizing the available number of threads. It tests the performance of all valid thread group size (TGS) combinations. A local search hill-climbing algorithm is applied to tune the diamond and wavefront tile widths for each TGS case. The autotuner uses our cache block size model (see Section 3.3 and [32] ) and the processor's available cache size (specified by the user) to reduce the parameter search space. In particular, the autotuner uses the model to get the parameters of a tile that has smaller cache memory requirements than the available architecture's cache memory. Then, the autotuner employs the hill-climbing algorithm to test the performance at increasing tile sizes and stops when the measured performance starts to decline.
Selecting a proper test size, especially the number of time steps, for autotuning test cases is challenging. A very short test may be affected by the system jitter and other sources of noise, which produces a false indication of the achievable performance. A large test, on the other hand, may increase the tuning time and associated cost significantly. We use multiples of diamond rows to set the number of time steps. The runtime of a single diamond row has many dependencies. It relies on the grid size, stencil type, processor speed, tiling efficiency, and so forth, so an a priori setting of the test size is not practical. To resolve this issue, for each test case, we dynamically change the test size until "acceptable performance" is obtained: we repeat the test case with an increasing number of diamond rows. Once the performance variation between two repetitions falls within a certain threshold, we consider that the largest test case size produces "acceptable performance" and we use it to report the performance for this case. 
Runtime System.
We use a multiproducer multiconsumer First In First Out (FIFO) queue to perform the dynamic scheduling of tiles to thread groups. The queue holds the tiles to be updated; access to it is protected by a critical region. When a thread group updates a tile, it pushes the next tiles that have no more unmet dependencies into the queue. When a tile is assigned to a thread group, the runtime removes it from the queue. The cost of synchronization is negligible because each extruded diamond update involves updating millions of grid points.
Our implementation relies on OpenMP nested parallelization. The outer parallel region spawns one thread per thread group (we call them "group masters"). Tiles are scheduled at this level. The inner parallel region spawns the threads of each group to update the grid cells in the tile. Spawning the inner parallel region happens many times, in undefined order. As a result, threads are grouped in different subsets during runtime.
Grouping threads in different subsets does not affect the performance in Uniform Memory Access (UMA) setups. On the other hand, this issue has to be resolved in Non-Uniform Memory Access (NUMA) cases, e.g., when running on systems with multiple processors. In Girih, we use the Linux low-level interface of sched_setaffinity to set the thread affinity explicitly with minimal overhead.
EXPERIMENTAL RESULTS
We perform our experiments over the four stencils under consideration using Intel 10-core Ivy Bridge and 18-core Haswell processors, over a broad range of grid sizes (cubic domain). Our MWD 1 approach is faster than 1WD/CATS [42] , Pluto 2 [3, 4] , and Pochoir 3 [44] for all stencils on both architectures with most grid sizes. MWD is also the only approach that provides a significant improvement over an efficient spatial blocking implementation in all cases. To understand the strength of our approach in potentially memory-starved future processors, we study the impact of MWD cache block sharing over the performance, code balance, memory transfer volumes, and energy consumption. Finally, we show significant memory bandwidth and memory transfer volume savings.
Framework Setup
To ensure fair comparison, we investigated and used the best setup of the tested Pluto and Pochoir frameworks. We ran each test case twice and report the performance of the faster one, although the variation was minor. For Sections 5.2 and 5.3, we ran the experiments several times to measure different hardware counter groups. To demonstrate the stability of our results, we plot the reported performance of the repeated experiments in our performance figures. In most cases, the results are aggregated in a single point. At some small grid sizes, where the noise is relatively more significant, the performance points are stretched vertically. Each test was run for tens of seconds.
Pluto Setup.
The Pluto framework uses the polycc executable to perform the source-tosource transformation. We use the flags "--tile --parallel --pet --partlbtile" for efficient code generation. For the Intel compiler, we employed the flags "-O3 -xHost -ansi-alias -ipo -openmp," which are also the default configuration in the Pluto examples. The Intel C compiler V15 achieves a twofold speedup compared to version 12, mostly because of more effective SIMD vectorization.
The selected tiling transformations perform diamond tiling along the z-axis and parallelepiped tiling along the y-and x-axes. Since parameter tuning is essential to run the tiled code efficiently, we implemented a Python script to tune the parameters of each test case in the presented results. We performed brute-force parameter search in diverse setups (i.e., different stencils, processors, and grid sizes) to ensure fair and efficient tuning. We found that the parameter search space is convex, where only a single maximum exists in the tested processors and stencil kernels. In the results presented here, we use a recursive local search algorithm, in the same manner discussed earlier in the Girih autotuner for the MWD code, to tune the three tiling parameters.
Pochoir Setup.
Pochoir does not have performance-critical tuning parameters for tiling, as it relies on cache-oblivious algorithms. We used the same compiler options as in the Pochoir examples: "-O3 -funroll-loops -xHost -fno-alias -openmp."
Girih Setup.
The Intel compiler options "-O3 -xAVX -fno-alias -openmp" were used for Girih. Our autotuning code selects the diamond and wavefront tile widths and thread group parallelization parameters. The wavefront implementation parameter is selected according to the stencil type. A relaxed-synchronization wavefront scheme is used for the 7-point stencils, as this implementation has lower synchronization overhead. FED is not important here as sufficient data reuse is possible at reasonable wavefront width. On the other hand, FED is applied in the 25-point stencils to allow sufficient data reuse in the wavefront update.
We also tune the 1WD case separately. This is the closest implementation we have to the CATS2 algorithm of [42] .
Performance at Increasing Grid Size
In this section, we present a performance comparison among MWD, Pluto, Pochoir, 1WD/CATS, and optimal spatial blocking over a wide range of problem sizes (cubic domain) for the four stencils under consideration. For a better identification of relevant bottlenecks, we also show memory bandwidth and memory transfer volume per LUP (see Section 5.4 for a thorough code balance analysis).
Our major finding is that MWD is faster than 1WD, Pluto, and Pochoir for all stencils on both architectures with most problem sizes. MWD is also the only approach that provides a significant improvement over optimal spatial blocking in all cases. Especially for the high-order (25-point) stencils, it is the only efficient solution. In general, the Haswell CPU shows better speedups for temporal blocking versus optimal spatial blocking due to its large number of cores (18) , which leads to a low machine balance of 0.14bytes/flop (assuming fused multiply-add is not used; with FMA, the machine balance goes down further to 0.07bytes/flop). In contrast, the Ivy Bridge CPU only has 10 cores, a 5% lower clock speed, and a 20% lower memory bandwidth for a better machine balance of 0.23bytes/flop. A low machine balance generally indicates a more "bandwidth-starved" situation with a higher potential for temporal blocking techniques, although a quantitative analysis requires a more elaborate performance model. At large problem sizes, where boundary effects become negligible, it is possible to construct a phenomenological ECM performance model by combining measured data traffic per LUP in all memory levels with code composition characteristics such as the number of load instructions per LUP, as described in Section 2.2. Tables 1 and 2 summarize the comparison between the phenomenological models and the measurements. Details are discussed below.
Due to the large amount of data, we only discuss the results on the Intel Haswell platform in detail here; the data taken on the Ivy Bridge system can be found in the Online Appendix in the ACM DL. Additionally, in the Online Appendix, we have collected additional hardware counter measurements that substantiate some of our conclusions.
7-Point Stencil with Constant Coefficients.
The results for the 7-point stencil with constant coefficients are shown in Figures 7(a) , 7(b), and 7(c) for the Haswell CPU.
This stencil performs seven flops per LUP and has a minimum code balance for pure spatial blocking of 24bytes/LUP in double precision. The memory-bound maximum performance in the latter case on Haswell is thus (50GB/s)/(24bytes/LUP) ≈ 2.1GLUP/s. If nontemporal stores were used, the code balance would go down to 16bytes/LUP, leading to a maximum performance of about 3.1GLUP/s. However, this is the only stencil of the four that can profit from nontemporal stores in a significant way.
All temporal blocking variants outperform optimal spatial blocking by far. MWD and 1WD are consistently faster than Pluto and Pochoir for smaller problem sizes, with MWD taking a clear lead for larger problems. MWD also exhibits the lowest memory bandwidth and the lowest code balance (memory traffic per LUP). For most problem sizes, the autotuner selects a thread group size of two or six, except for very large and very small problems.
The 7-point stencil with constant coefficients uses 28 half-wide (16-byte) load instructions per 8 LUPs, which require 14 cycles to execute on both architectures. The required data traffic in the L2 cache is 40bytes/LUP at problem sizes below 700 and 56bytes/LUP above, since the L1 cache becomes too small to hold three consecutive rows of the source array. Hardware counter measurements confirm these estimates for MWD. On Haswell, this transfer takes seven cycles. The L3 cache traffic is hard to predict due to the blocking strategy; we use the measured value of 30bytes/LUP, leading to a transfer time of about four cycles. In memory, we observe a code balance of about 5bytes/LUP. With the chosen input data, the ECM model predicts a socket-level performance of about 10GLUP/s, with bandwidth saturation at 17 cores. Looking at the measured performance data, we see that the prediction is about 20% above the measurements. This deviation is expected because the ECM model is known to be overoptimistic near the saturation point. Overall, the ECM model describes the performance characteristics of the MWD code at larger problem sizes quite well, proving that the MWD code is operating at the limits of the hardware. This is also true for the other stencils described below. Further substantial performance improvements could only be expected from optimizations that reduce the amount of work.
7-Point Stencil with Variable Coefficients.
The results for the 7-point stencil with constant coefficients are shown in Figures 8(a), 8(b) , and 8(c) for Haswell. This stencil performs 13flops/LUP, but it has a higher pressure on memory for purely spatial blocking (80bytes/LUP instead of 24 without nontemporal stores). Hence, we expect more potential for temporal blocking and accordingly a higher speedup for MWD, which exhibits the lowest in-memory code balance of all. Indeed, the speedup of MWD compared to spatial blocking is 4.5× to 5.2× on the very bandwidth-starved Haswell.
Pochoir shows the lowest memory bandwidth at large problem sizes, but it achieves only 60% of the performance of spatial blocking. This shows that low memory bandwidth does not guarantee high performance. In this particular case, the cause can only be exceptionally slow low-level code, since the data traffic within the cache hierarchy (see the Online Appendix for more details) is on par with the other methods. However, even if the memory bandwidth were fully utilized (50GB/s instead of 22GB/s), the performance would not be competitive, as shown by the memory transfers per LUP.
The phenomenological ECM model for MWD predicts a performance of 3.8GLUP/s at large problem sizes for the full Haswell socket assuming perfect scalability. The deviation from the measurement is larger than for the constant-coefficient 7-point variant, because the large data traffic requirements lead to a breakdown of parallel efficiency due to cache size constraints.
25-Point Stencil with Constant Coefficients.
The results for the 25-point stencil with constant coefficients are shown in Figures 9(a) , 9(b), and 9(c) for Haswell. The stencil has a minimum code balance for optimal spatial blocking of 32bytes/LUP.
Due to its high ratio of 33flops/LUP and its large radius, this stencil poses a challenge for all temporal blocking schemes. Many layers of grid points have to be supplied by the caches, so the cache size needed for temporal blocking is large and the time needed for memory transfers is rather small in comparison to the cache access time [40] . Only MWD reduces the memory pressure significantly, since it can leverage the multithreaded wavefronts to reduce the need for cache size. As a consequence, only MWD is consistently faster than pure spatial blocking by a factor of 1.5× to 1.7× on Haswell. The other algorithms even exhibit a memory code balance larger than spatial blocking for most problem sizes.
With MWD, the compiler produces some register spilling, adding to the dominance of the in-core and in-cache contributions to the runtime. At large problem sizes, the ECM model predicts a fullsocket MWD performance of 2.5GLUP/s. This prediction is reasonably close to the measurements. This stencil performs 37flops/LUP but requires much more data due to the variable coefficients array (128bytes/LUP for optimal spatial blocking). Although it has a lower intensity compared to the 25-point constant-coefficient stencil, it poses the same cache size problems. Again, only MWD can reduce the code balance significantly, and it is the only scheme that is faster than spatial blocking at all, by 1.5× to 2×. Again, Pochoir shows low memory bandwidth; measurements show that it requires a massive amount of data traffic between the L1 and L2 cache for this stencil, which is a contributor (in addition to slow low-level code) to its extremely low performance. The phenomenological ECM model for MWD at large problem sizes yields socket-level estimates of 0.71GLUP/s on Haswell, which is in line with the measurements.
25-Point
MWD Tile Sharing Impact on Performance, Memory Transfer,
and Energy Consumption While the cache block sharing reduces the memory bandwidth requirements of the stencil codes, it increases the overhead by performing fine-grained synchronization among more threads. As a result, the autotuner would select the minimum thread group size that sufficiently decouples from the main memory bandwidth, when allowed to tune all the parameters.
In this section, we run the MWD approach using fixed thread group sizes to study their impact on the performance, memory bandwidth and transfer, and energy consumption. The autotuner selected the parallelization dimensions and tiling parameters in these experiments. We limit the energy consumption analysis to the Intel Ivy Bridge processor. The memory modules of the Haswell processor consume a similar amount of energy regardless of the memory bandwidth usage, making the energy consumption rate oblivious to our optimization techniques.
Since we observe similarities in the studied characteristics of the four stencils under investigation here, we describe a representative subset of these results. We use a range of cubic grid sizes, where each dimension is set to multiples of 64 in the Ivy Bridge processor and multiples of 128 in the Haswell processor. We describe the Haswell results of the 7-point constant-coefficient and the 25-point variable-coefficient stencils, to show MWD behavior at two extremes in the code balance. The energy consumption behavior is illustrated using the 25-point constant-coefficient stencil results on Ivy Bridge. For reference, additional performance results are included in the Online Appendix.
7-Point Stencil with Constant Coefficients on
Haswell. The Intel Haswell performance results using different thread group sizes are presented in Figure 11 (a). We also show the measured memory bandwidth in Figure 11 (b) and the memory transfer volumes normalized by the number of lattice site updates in Figure 11 (c). The 1WD implementation does not saturate the memory bandwidth at smaller grid sizes than 800 3 , as this stencil has a small bytes requirements and moderate code balance. When 1WD runs at grid sizes larger than 800 3 , the memory bandwidth saturates and the performance drops, as larger cache blocks cannot fit in the L3 cache memory. 2WD and larger thread group sizes are sufficient to decouple from the main memory bandwidth bottleneck at large grid sizes.
At small grid sizes, the synchronization overhead of 9WD and 18WD is significant compared to the computations, resulting in lower performance. As the grid size becomes larger, the synchronization cost of 9WD and 18WD becomes less significant, and all MWD variants achieve similar performance.
Larger thread group sizes save more cache memory as they keep a smaller number of tiles in the cache memory. The cache size saving provides space for larger diamond tiles, which increases the in-cache data reuse and decreases the main memory bandwidth and data traffic. Figures 11(b) and 11(c) show how larger thread group sizes consume less memory bandwidth and transfer less data per lattice update. This shows that our MWD approach is suitable for future processors, which are expected to be more starved for memory bandwidth. Hence, although the synchronization overhead of MWD impacts the performance on this particular processor, this is not a general result; a re-evaluation is needed on future architectures. Fortunately, it can be left to the autotuner to take such variations in hardware and software into account.
25-Point Stencil with Variable Coefficients on Haswell.
Haswell performance results using different thread group sizes are shown in Figure 12 .
The 25-point variable-coefficient stencil has high code balance and large cache block size requirements, so a large thread group size is necessary to decouple from the main memory bandwidth bottleneck. As shown in Figure 12 (a), 18WD achieves the best performance at most of the grid sizes. Other MWD variants saturate or nearly saturate the main memory bandwidth at larger grid sizes (Figure 12(b) ) and show a much higher code balance (Figure 12(c) ).
25-Point Stencil with Constant Coefficients on Ivy
Bridge. Ivy Bridge performance results using different thread group sizes are presented in Figure 13 (a), along with the measured memory bandwidth in Figure 13(b) . The memory transfer volumes in Figure 13 (c) and the energy estimates in Figures 13(d) , 13(e), and 13(f) are normalized by the number of lattice site updates.
1WD achieves the best performance up to the grid size of 512 3 before it starts saturating the memory bandwidth (Figure 13(b) ) and increasing its data transfer volume (Figure 13(c) ). Although 2WD achieves the best performance with grid sizes larger than 512 3 , all MWD variants show similar performance. This similarity is caused by negligible synchronization overhead among all MWD variants.
The energy consumption results in Figure 13 (f) show that the usual law of "faster code uses less energy," or "race-to-halt" as defined in [22] , is not always true. For instance, we observe that 10WD achieves the lowest total energy to solution in most cases, although it does not achieve the best performance among all MWD variants. This is because the DRAM energy consumption is proportionally correlated with a linear function of its bandwidth usage. The memory bandwidth saving of 10WD results in decreasing its DRAM energy consumption, as we observe in Figure 13 (e), while consuming the same CPU energy as other variants. We have to stress that these results, while being representative of a system with a relatively high DRAM power dissipation, are not representative and will have to be re-evaluated on later architectures. It is generally expected, however, that future HPC systems will show this very characteristic [27] .
Code Balance and Energy Consumption Analysis
The findings described in the previous section would not justify favoring the maximum thread group size over all other options on the Ivy Bridge processor. However, they show clearly that if the future moves toward more memory-bandwidth-starved systems and higher relative power dissipation in the memory subsystem, it should use algorithms that exhibit the lowest possible code balance. This view is corroborated by another observation in our data: the overall energy savings of temporal blocking versus standard spatial blocking are roughly accompanied by equivalent runtime savings, but when the energy consumption of CPU and DRAM are inspected separately, it is evident that this equivalence emerges from the mutual cancellation of two opposing effects: while the CPU energy is less strongly correlated with the code performance, the DRAM energy shows an overproportional reduction for temporal blocking.
This can be seen more clearly in Figure 14 , where we have measured the energy to solution with respect to the code balance for 5WD (as a consequence of setting different diamond tile sizes) for both 7-point stencils (the diagram for the 25-point stencil would only contain a single data point per set). In both cases, the DRAM energy varies much more strongly with changing code balance than the CPU energy. This was expected from the observations described above, but the CPU energy dependence is far from weak. Overall, there is an almost linear dependence of energy on code balance, making the latter a good indicator of the former. 
Thread Scaling Performance
All measurement results discussed so far were taken on a full CPU socket. In order to better understand the shortcomings and advantages of the different temporal blocking approaches, we present thread scaling results in this section. For each stencil, we show the scaling behavior of performance, memory bandwidth, and measured code balance at a fixed problem size on the 18-core Haswell CPU.
7-Point Constant-Coefficient Stencil.
The thread scaling results for the 7-point constantcoefficient stencil are shown in Figures 15(a), 15(b) , and 15(c). All temporally blocked variants except Pochoir show a roughly constant code balance with increasing thread count, but only MWD shows good scaling across the whole chip. Pochoir and 1WD clearly run into the bandwidth bottleneck; the limited scalability of Pluto is not caused by traffic issues and will be investigated. MWD shows a linearly rising memory bandwidth utilization, indicating bottleneck-free and balanced execution.
7-Point Variable-Coefficient Stencil.
The thread scaling results for the 7-point variablecoefficient stencil are shown in Figures 16(a), 16(b), and 16(c) . Again, MWD exhibits a constant, low code balance and good scaling. Starting at six threads, Pluto also shows constant code balance but on a 60% higher level. Since the memory bandwidth is roughly the same as with MWD, performance also scales at a much lower level. An interesting pattern can be observed with 1WD: at rising thread count the shared cache becomes too small to accommodate the required tiles for maintaining sufficient locality, leading to a steep increase in code balance beyond 10 cores. Since the memory bandwidth is already almost saturated at this point, performance starts to break down. This behavior was expected from the discussion in the earlier sections, but it is evident now that 1WD would be the best choice on a CPU with only 10 cores but with the same cache size. 1WD is also the only temporal blocking variant that is not decoupled from the memory bandwidth. In case of Pochoir, the data shows that the decoupling is due to very slow low-level code, as shown before.
25-Point Constant-Coefficient Stencil.
Thread scaling results for the 25-point constantcoefficient stencil are shown in Figures 17(a), 17(b), and 17(c) . Due to the massive cache size requirements of this stencil, only MWD is still able to decouple from the memory bandwidth. All other variants show strong saturation, or even a slowdown in case of 1WD beyond 10 threads, which is caused by the same cache size issues as with the 7-point variable-coefficient stencil. Again, 1WD would be the method of choice if the chip only had 10 cores but the same shared cache size.
25-Point Variable-Coefficient Stencil.
Thread scaling results for the 25-point variablecoefficient stencil are shown in Figures 18(a), 18(b) , and 18(c). Note that there is now only a single data point in each figure for MWD (at 18 threads). All variants except MWD exceed even the spatial blocking code balance beyond five threads and thus show strong performance saturation, with the exception of Pochoir, which again suffers from code quality issues. Even if one were able to accelerate the Pochoir code so that it could saturate the bandwidth, it would still end up at a lower performance level than all others (about 0.3GLUP/s).
RELATED WORK
The arguably most comprehensive overview on stencil operators and associated optimization techniques on modern processors was given in [9] . Since then, some new ideas about efficient cache reuse in stencil computations have come up, most of which are variants of temporal blocking. Spatial blocking alone, although it may reduce the code balance considerably, is not able to achieve decoupling from the memory bottleneck in many practically relevant situations. In the following, we restrict ourselves to prior work in stencil optimizations for standard cache-based microprocessors.
Several tiling techniques have been proposed in the literature that can be leveraged for temporal blocking. These include parallelogram tiling, split tiling, overlapped tiling, diamond tiling, and hexagonal tiling (see [36, 52] for reviews). Out of these, diamond tiling has emerged as a promising candidate for the stencil schemes and processor architectures under investigation here, and has been covered extensively, for example, in [3, 16, 17, 42] .
The wavefront technique is a further development of the hyperplane method as introduced in [29] . It can be combined with other tiling approaches as shown in [34, 42, 50] and in [47] . The latter work was the first to leverage shared caches in multicore CPUs for improved cache reuse in stencil algorithms.
Apart from the tiling technique, there is another criterion that can be used to classify cache optimization approaches: they are either cache oblivious (i.e., they require no prior knowledge of the hardware to find optimal parameter combinations [12, 41, 44] ) or cache-aware (i.e., they use either autotuning [9] or predictive cache size models [42] ).
Producing optimal code by hand is tedious and error-prone, so there is a strong tendency to develop software frameworks that can handle this process (semi-) automatically. Pluto [4] is a sourceto-source converter using the polyhedral model, CATS [42] is a library, Pochoir [44] leverages a domain-specific language together with a cache-oblivious algorithm, PATUS [8] employs a DSL and autotuning, and a DSL that uses split tiling is developed in [23] . [19] developed MODESTO, a stencil framework using models to arrive at predictions about the optimal transformations on the target system. Recently, a review of stencil optimization tools that use polyhedral model was published in [51] .
The emergence of energy and energy-related metrics as new optimization targets in HPC has sparked intense research on power issues in recent years. For instance, the realization that low energy and low time to solution may sometimes be opposing goals has led to activities in multiobjective autotuning [2, 18] . However, there is very little work that tries to connect power dissipation with code execution using simple synthetic models to gain insight without statistical or machine-learning components. [21] constructed a simple power model that can explain the main features of power scaling and energy to solution on standard multicore processors. Choi et al. [7] follow a slightly different approach by modeling the energy consumption of elementary operations such as floating-point operations and cache line transfers. In this work, we try to pick up some of those ideas to establish a connection between energy consumption on the CPU and in the DRAM with the performance and, more importantly, with the code balance of a stencil algorithm.
The cache block size reduction benefits of our multidimensional intratile parallelization are demonstrated further in a companion paper describing an application case study in applied electromagnetics [33] . The experiments show the advantage of our approach over intratile parallelization along a single dimension, especially for applications with high bytes-per-grid-cell requirements. One should note that MWD is best suited in cases where many time steps can be executed on the grid; in applications where the number of time steps is very limited, other approaches may be more successful. See [14] for a recent contribution.
We classify the prior work related to our MWD approach in two categories: using separate cache block per thread and utilizing cache block sharing.
Related Work Using Separate Cache Blocks per Thread
We describe the temporal blocking issues in using dedicated cache block per thread in Section 3. Insufficient data reuse is a consequence in these approaches, given the LLC size limitation, having to use long strides in the leading dimension, and the increasing number of cores in contemporary processors.
Pluto and CATS2 combine wavefront temporal blocking with diamond tiling. While we use CATS2 tiling choices along each dimension, Pluto performs diamond tiling along the z-axis, and parallelogram tiling along the y-and x-axes. Both methods are efficient in terms of minimizing the synchronization overhead among threads and in terms of data reuse. However, our approach can improve on this in several respects:
In order to expose sufficient parallelism, Pluto and CATS2 require a large domain size in the diamond tiling dimension. If this condition is not fulfilled, some threads will run idle with Pluto. CATS2 can in this case fall back to CATS1, which has less stringent requirements. Still, both approaches will be hard to adapt to emerging many-core architectures, where the cache size per thread will most probably decrease.
Neither Pluto nor CATS2 assumes a shared cache among the threads. As we have shown in Section 3, memory-bandwidth-starved stencil updates may easily run out of cache space and may thus be unable to decouple from the memory bottleneck. This issue is even more severe in view of the expected evolution of processor architectures toward lower memory bandwidth and cache size per thread, the Intel Xeon Phi coprocessor being a prominent example.
Related Work Utilizing Cache Block Sharing
The idea of utilizing cache block sharing by multiple cores was proposed in [47] . They use parallelogram tiling along the y-axis and multicore wavefront temporal blocking along the z-axis. This work was extended in [46, 49] . They use relaxed synchronization in the thread group and assign one thread group per cache domain. To maintain the intermediate values across parallelogram tile updates, data is copied to temporary storage to keep the intermediate time steps from the boundaries of the tiles. Their work alleviates the cache capacity limitation that appears when using separate cache blocks per thread. It has the advantage of reducing the total cache memory requirement by using fewer cache blocks while utilizing all the available threads. They also introduce the thread group concept, where each thread group shares a tile. This article improves on the cache block sharing concept to achieve further cache block saving, more data reuse, more concurrency, flexibility with parameterized tiling, and autotuning. The wavefront data is pipelined across the threads in the work of Wellein et al. [46, 47, 49] by assigning one or more time steps for each thread. A space is imposed between the working set of each thread to achieve concurrency while ensuring correctness. This requires extending the cache block size in the spatial dimension, without increasing the in-cache data reuse. Since each thread updates data from different time steps, an equal amount of work across the time steps of the wavefront is required. They achieve load balancing by using parallelogram tiling in the other dimension. This does not allow their work to explore more advanced tiling techniques, such as diamond tiling.
A more recent cache block sharing work, called jagged tiling, is proposed in [39] and extended in [38] . The polyhedral model is employed to generate code for intratile parallelization using the Pluto framework. They use multicore wavefront tiles, similar to our work along the z-axis, and a runtime system for fine-grained parallelization to schedule the work to threads. They show results for the three-dimensional 7-point constant-coefficient stencil, running experiments using a grid of fixed size 480 3 , which is friendly to the cache memory. They provide intratile concurrency along the diamond tile dimension by stretching the diamond tile along the space dimension, i.e., without increasing the data reuse in time.
Shrestha's work in [38, 39] has the advantage of using a source-to-source transformation framework to make their techniques more usable. However, by providing the diamond tiling and the intratile concurrency along the same dimension, they effectively revert to smaller diamond tile sizes and update groups of adjacent diamond tiles together. The only reuse in the thread group is at the boundary of the subtiles. In contrast, our MWD approach allows the thread group to share one large diamond tile, providing more in-cache data reuse. Finally, our work has the advantage of providing a parameterized thread group size and autotuning.
CONCLUSIONS
The performance of stencil computations tends to underutilize the compute resources of processors for memory-resident datasets unless proper temporal blocking optimizations are applied. In this work, we have proposed novel temporal blocking algorithms providing performance advantages over previously published state-of-the-art techniques. Our optimizations also show reduced memory bandwidth pressure, which makes them suitable for future systems that are expected to be more starved for main memory bandwidth.
To demonstrate the capability of our solution, we have performed extensive benchmark studies using four "corner case" stencils on two modern Intel server CPUs (Ivy Bridge and Haswell). Building on previous work about diamond tiling and wavefront temporal blocking, we show that separate cache blocks per thread, as used, for instance, in cache-oblivious algorithms, are often insufficient to decouple from the main memory bandwidth, especially for large problem sizes and stencils with variable coefficients. Our novel multidimensional intratile parallelization approach allows for a more efficient use of the cache size and exploits high intratile concurrency, which is very advantageous for stencils with high data traffic requirements. To this end, we introduce thread parallelism in the leading dimension instead of tiling it to reduce the working set in the threads' private caches. This enables a better utilization of hardware prefetching in the shared cache level, and it reduces the danger of TLB shortage. By making the intratile parallelism parallel to the time axis, we also maximize the data reuse in the private caches.
All algorithms were implemented in our open-source testbed framework Girih. Our parameterized implementation provides a controllable tradeoff between fine-grained synchronization and memory bandwidth pressure. An autotuning component determines optimal parameter combinations for the stencil and architecture at hand.
We have constructed accurate models for the cache size and data transfer requirements of our optimized implementations. The traffic model can predict the optimal code balance as a function of the stencil radius, the tile parameters, and the number of domain-sized streams ("arrays"), and was shown to be very accurate for the four stencils under investigation if the predicted cache block size fits into about half the shared cache size. We use the models to reduce the effort of the autotuner, but they are also of general value since they predict the expected code balance reduction of temporal blocking before an actual implementation is done.
Our approach achieves better performance than Pluto, Pochoir, 1WD/CATS2, and optimal spatial blocking with most grid sizes. Furthermore it is the only scheme that is consistently faster than optimal spatial blocking, especially for higher-order (long-range) stencils. Based on hardware counter measurements, we show that our code makes optimal use of the computational resources as predicted by a phenomenological performance model. By varying the thread group size, we have explored opportunities for reducing the memory bandwidth pressure significantly at the cost of a slight performance degradation. On systems where the energy consumption of the memory chips is significant, this leads to lower energy to solution at nonoptimal performance, providing a striking example where "race to halt" does not apply, i.e., where slower code is more energy efficient.
Since our temporal blocking solution provides lower code balance than previous work, it is, in our opinion, best suited for future, bandwidth-starved systems. The concepts behind our work may also be applied beyond the specific field of regular stencil codes, e.g., for unstructured grids.
FUTURE GENERALIZATIONS
Stencil computations arise in applications more general than those heretofore considered and must be ported to other architectures as well. We briefly consider several generalizations opened up by MWD.
Algorithmic Generalizations
Adaptive time stepping and checkpointing. Temporal blocking advances the solution domain several iterations at once, which poses a challenge to applications with adaptive time stepping (ATS). MWD can be modified to accommodate ATS by checkpointing the solution to nonvolatile storage, which is often desired at regular intervals even under uniformly safe time steps. The most suitable place for checkpointing is the middle of diamond tiles, where the entire solution domain can be stored at one iteration. This requires modifying the wavefront approach to store the middle time step of certain diamond rows in separate arrays. At the same time, the priority of tiles at earlier time steps would have to be adapted (increased).
Higher-order time stepping. MWD as presented here makes use of just two time levels of storage, but this does not rule out methods that are higher order in time. Temporal truncation errors can be expressed in terms of time derivatives of the unknown and powers of the time step. By repeated time differentiation, a time-dependent partial differential equation (PDE) provides an explicit formula for the time derivatives in terms of spatial derivatives. Therefore, at the price of an expanding spatial stencil, successive orders of temporal truncation error can be eliminated, a technique common in the seismic imaging industry.
Gauss-Seidel updates. While two-array Jacobi-style updates are considered above, another important variant is the Gauss-Seidel style, where the successive time iterations update the same solution array. Some Gauss-Seidel-style schemes can be accommodated by MWD in a straightforward manner, as it respects the data dependencies across time steps (i.e., iterations). However, the updates would not be ordered identically within each time step due to the spatial blocking performed by diamond tiling and the potential out-of-order scheduling of the diamond tiles in each row of diamonds. Using color-splitting, like red-black Gauss-Seidel, requires doubling the slope of the tiles (both the wavefront and the diamond) to respect the data dependence imposed by the redblack ordering in space. Spatial blocking can respect the red-black ordering, as each point update has dependence over the direct neighbors in the same time step.
Box and diamond stencils. MWD has been described here for star-shaped stencils, where the operator is a sum of one-dimensional operators along each axis. Other important stencil operators have shapes that extend diagonally, such as the 27-point box stencil operator. The major difference in the application of these stencils is the diagonal data dependence of the stencil operator. The current implementation of our approach can handle box and diamond stencil types because our tile shapes already account for data dependencies in the tile's corners and edges.
Pipelined solvers. Implicit solvers for linear systems arising from PDEs typically perform repeated matrix-vector products with a stencil-like matrix that is constant in time, but has variable coefficients in space (i.e., over the rows). Variants such as s-step methods or pipelined Krylov methods chain multiple matrix-vector products, which are equivalent to expanded stencils. The motivation for these pipelined methods is synchronization overhead reduction in distributed memory, not cache efficiency in shared memory, but MWD can still be exploited to improve the intranode performance of these extreme-scale Krylov solvers.
Architectural Generalizations
Deeper memory hierarchies. Future processors are anticipated to have deeper memory hierarchies [43] . MWD can be used with a runtime system to perform hierarchical blocking of the larger and slower memory levels. The runtime system would be invoked infrequently, as each MWD tile involves updating millions of lattice-site updates (LUPs). Cache-oblivious techniques [11, 13, 44] are good candidates for the runtime system, employing space-filling curves at the granularity of MWD tiles to provide automated hierarchical cache blocking for arbitrary memory levels with an optimal asymptotic lower bound on the data transfer. The recursive tessellation nature of diamond tiles makes them good candidates for space-filling-curve algorithms. The extruded MWD tile may be split along the z-axis using parallelepiped tiling to control the tile size and provide more concurrency. This would require multidimensional space-filling curves similar to [44] , with the difference of using diamond tiling along the y-axis, instead of split tiling in all dimensions.
Wide vector units. The long, unit-stride memory accesses provided by MWD tiles are ideally suited for SIMD execution. Moreover, our implementation provides array padding and aligned memory allocation to minimize vector loads across the cache lines from the L1 cache to the registers. Contiguous and aligned memory accesses make our tiles friendly for efficient vectorization by the compiler and other vectorization techniques, as in [5, 31] . The long unit-stride inner loops also result in less loop and vectorization prologue and epilogue overhead.
Perspective on integration with accelerators. Our generic framework could be adapted to the Graphics Processing Unit (GPU). The diamond tile shape enables the decomposition of the domain among CPUs and GPUs with a relaxed synchronization scheme, as we do in our distributedmemory setup. The hierarchical shape of the diamonds (i.e., each diamond can be split into four equal subdiamonds) allows setting different tile sizes for the CPUs and the GPUs while maintaining the tessellation of the subdomains. Each block can be updated by a Streaming Multiprocessor (SMX) of Nvidia GPUs. The threads of the SMX can be kept busy by exploiting the concurrency along the x-axis.
Intratile parallelization. MWD can be integrated with cache-aware stencil optimization techniques that perform explicit tiling, such as Pluto and Physis. For example, Pluto has options to perform diamond tiling and parallelepiped tiling along different dimensions. It may be possible to configure Pluto to perform diamond tiling along the y-axis and small parallelepiped tiling along the z-axis. The wavefront-diamond tiling can be achieved by updating consecutive tiles along the z-axis. These tiles can then be parallelized using our thread group concept through nested OpenMP parallelization.
Energy-saving potential. Modern processors allow dynamic power capping in both the DRAM and CPU resources (in current Intel CPUs this is implemented in the Running Average Power Limit (RAPL) infrastructure). MWD significantly reduces memory bandwidth requirements without increasing execution time. We propose to utilize these savings to reduce the frequency of the memory interface and increase the CPU frequency in order to stay under the node power cap. The CPU and DRAM power cap parameters can be included in our autotuning search space, rendering the autotuning process multiobjective. The current implementation of the autotuner employs our accurate cache block size model to prune the tile size search space. Phenomenological power models [21] may be used to reduce the search space in the power-relevant parameters.
