Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of expected speedup. Understanding the performance properties and bottlenecks by performance modeling enables a clear view on promising optimization opportunities. In this work we refine the recently developed Execution-Cache-Memory (ECM) model and use it to quantify the performance bottlenecks of stencil algorithms on a contemporary Intel processor. This includes applying the model to arrive at single-core performance and scalability predictions for typical "corner case" stencil loop kernels. Guided by the ECM model we accurately quantify the significance of "layer conditions," which are required to estimate the data traffic through the memory hierarchy, and study the impact of typical optimization approaches such as spatial blocking, strength reduction, and temporal blocking for their expected benefits. We also compare the ECM model to the widely known Roofline model.
INTRODUCTION AND RELATED WORK
Stencil computations on regular lattices are important in many computational science applications operating on regular lattices. The combination of low computational intensity and their iterative nature have made them a popular target for loop optimizations. Beyond established cache blocking techniques, leveraging SIMD capabilities [1] and an efficient shared memory parallelization [2] are subject to recent research. An overview about the state of the art in stencil optimizations can be found in a review by Datta et al. [3] .
In the compiler community modeling approaches based on cache usage have been employed for stencil optimizations more than a decade ago [4] . Modeling was also used to estimate the optimal performance of stencil codes and to validate the effectiveness of applied optimizations. One can roughly separate those efforts into Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ICS '15, June 8-11, 2015 , Newport Beach, CA, USA. two categories: "black box" modeling, which relies on statistical methods and machine learning to describe and predict performance behavior based on observed data such as hardware performance metrics [5] , and "white box" modeling, which uses simplified machine models to describe the interaction of code with the hardware from "first principles." Simple bottleneck-centric approaches based on memory bandwidth vs. computational peak performance have been in use for a long time [6, 7] . This line of thinking was popularized and refined by Williams et al. [8] , and is now known as the "Roofline model." The Roofline model is tremendously useful in getting first estimates of the "light speed" of a loop on a given architecture. Its central assumption is that data transfers through the memory hierarchy overlap with code execution on the core(s). This implies that data access latencies are assumed to be hidden by prefetching mechanisms. However, the Roofline model cannot describe relevant bottlenecks beyond memory bandwidth and peak performance. Especially in long-range stencils common in seismic imaging there is no single dominating bottleneck anymore and the influence of in-cache transfers on performance is significant. The recently introduced ECM model provides an accurate single-core performance and scaling prediction. It can be seen as an extension and refinement of the Roofline model for multicore processors. Like the Roofline model it focuses on resource utilization but considers contributions from the complete memory hierarchy. The ECM model was first introduced in [2] and later refined and combined with a multicore power model in [9] . It provides clear insights into the relevant contributions of the hardware model to the performance of a given loop code and therefore allows a clear identification of optimization opportunities. Based on the singlecore performance estimate it also predicts a performance saturation point. This paper refines the ECM model with clear rules for overlapping execution and data transfers and introduces an accessible notation to ease the usage and description of the model. The power of the model to validate and guide performance engineering efforts is demonstrated on the example of three relevant stencil code patterns.
The paper is structured as follows. In Sect. 2 we describe the architecture of the test machine used for measurements. Section 3 introduces and refines the ECM model on the example of simple streaming loop kernels. In Sections 4, 5, and 6 we apply the model to successively more complex stencils to show how it can identify bottlenecks, reveal optimization opportunities, and prevent misleading conclusions. The paper concludes in Sect. 7.
EXPERIMENTAL TESTBED
Unless noted otherwise, a standard two-socket server based on eight-core SandyBridge-EP (SNB) CPUs was used for the measurements. The core supports SSE and AVX SIMD instruction set extensions. Using floating-point arithmetic, each core can execute one multiply and one add instruction per cycle, leading to a peak performance of eight double precision (DP) or 16 single precision (SP) flops per cycle. Memory is connected via four DDR3-1600 memory channels per socket. The SNB core can sustain one fullwidth AVX load and one half-width AVX store per cycle. With SSE or scalar execution, these limits are changed: In both cases the core can sustain either one load and one store, or two loads per cycle, to the effect that many loops do not show a 4× speedup of core execution when going from scalar mode to AVX. The L2 cache sustains refills and evicts to and from L1 at 256 bits per cycle (half-duplex). A full 64-B cache line transfer thus takes two cycles. The L3 cache is segmented, with one segment per core. All segments are connected by a ring bus. Each segment has the same bandwidth capabilities as the L2 cache, i.e., it can sustain 256 bits per cycle (half-duplex) to and from L2. This means that the L3 cache is usually not a bandwidth bottleneck, which is an improvement compared to previous Intel processors. An overview of the specifications is given in Table 1 . In Sect. 5 we use a 3.0 GHz Intel Ivy Bridge CPU for comparison. The relevant differences will be described there. Measurements for the vector summation example were performed with LIKWID-bench [11] , and the Intel C Compiler (version 13.1.3) was used to compile the stencil codes. The clock frequency was fixed to 2.7 GHz.
THE ECM PERFORMANCE MODEL
The ECM model delivers a prediction of the number of CPU cycles required to execute a certain number of iterations n it of a given loop on a single core. Since the smallest amount of data that can be transferred between adjacent cache levels is usually a cache line (CL), n it is typically chosen to be "one cache line's worth of work." With a 64-B CL, this amounts to n it = 8 for a streaming kernel with double precision data and stride-one data access to all arrays (such as the STREAM benchmarks). Depending on the data types and access characteristics this number may vary, however.
Model input, construction, and assumptions
Some input data is required to construct the ECM model for a loop code on a given architecture. We consider two parts of the input separately: The time to execute the instructions on the processor core, assuming that there are no cache misses, and the time to transfer data between its initial location and the L1 cache. For the sake of clarity we review and refine the model, which was first introduced in [2, 9] , using simple streaming loops such as DAXPY (A(:)=A(:)+s*B(:)) before turning to more complex cases. 
In-core execution time
To determine the in-core execution time a simple model for instruction throughput capabilities of a microarchitecture is required. On Intel processors the port scheduler model can be used for this purpose. Figure 1 shows the port model for the Intel Sandy Bridge microarchitecture. We assume that all instructions in a loop are independently scheduled to the ports and executed out of program order within the execution units u i . The execution time is then set by the unit that takes the most cycles to execute the instructions:
An additional limitation that may be relevant is the overall maximum instruction throughput of four instructions per cycle on Intel microarchitectures. 1 For example, the loop body of the DAXPY operation
consists of two loads, one store, one add, and one multiply instruction regardless of SIMD vectorization. In addition there is typically one integer add, one integer compare, and one conditional branch in the body to implement the "loop mechanics." If the loop is unrolled, the relative share of these instructions is smaller and can usually be neglected. With AVX vectorization, one work unit (eight iterations) comprises four load, two store, two add, and two multiply instructions, which require four cycles to execute on the SNB core. In this case the load (and store) pipeline throughput is the bottleneck. In Fig. 2 we visualize this situation in a timeline diagram (cycles 1 to 4). Since there are no loop-carried dependencies in the code, the four-cycle prediction holds if the loop is sufficiently unrolled. In complex cases one can employ the Intel Architecture Code Analyzer (IACA) [12] to get an estimate of the loop body execution time under the throughput assumption.
Data transfers through the memory hierarchy
Data not present in the L1 cache must be fetched from lower levels of the memory hierarchy before it can be processed in CPU registers, and modified data must be evicted to make room for new cache lines. We call the time needed for these transfers the "transfer time." Again we assume the best possible case of perfect streaming, i.e., there are no latency effects and the cost for transferring one CL can be computed using the maximum bandwidth. For the SNB architecture, one CL transfer takes two cycles between adjacent cache levels. Getting a 64-B CL from memory to L3 or back takes 64 B · f /b S cycles, where f is the CPU clock speed and b S is the achievable memory bandwidth as measured by a suitable streaming benchmark.
Calculating the transfer times through the memory hierarchy requires an accurate account of the data volumes between all cache levels. For the DAXPY loop kernel discussed above this means that, due to the cache hierarchy being inclusive, three CLs must be transferred between each pair of adjacent memory levels: two for loading the data for a and b, and one for evicting modified data of a. This leads to transfer times of six cycles each between cache levels, and 13 cycles between memory and L3 cache (at a maximum socket bandwidth of 40 GB/s and a clock frequency of 2.7 GHz). If the "uncore" part of the chip, especially the L3 cache, has a clock speed that is set independently from the core clock, a correction factor can be applied to the L3-L2 cycle count.
Single-core prediction and examples
The in-core execution and transfer times described above must be put together to arrive at a prediction of single-thread execution time. Therefore we must specify which of the runtime contributions can overlap with each other. If T data is the transfer time (L1 downward, as described above), T OL is the part of the core execution that overlaps with the transfer time, and T nOL is the part that does not, then
There are two open questions here: (i) Which components of T core in (1) are overlapping? (ii) How is T data composed of the different data transfer contributions along the memory hierarchy?
The ECM model makes the following fundamental assumptions for single-core performance prediction: (1) All instructions in the loop are independently scheduled to the execution ports and executed out of program order within the execution units. The unit that takes the longest time executing its instructions is the bottleneck on the core level (see Sect. 3.1.1). If dependencies lead to pipeline stalls, these appear as additional cycles on the core level and count as overlapping time in the full model (see below). The maximum throughput capabilities of the microarchitecture are a further limitation. (2) Loads (more specifically, cycles in which loads are retired) do not overlap with any other data transfer in the memory hierarchy. All other instructions, and also pipeline bubbles, overlap perfectly with data transfers. This means that only the time needed for the loads to the L1 cache contributes to T nOL (store-bound loop kernels show significant overlap). (3) The transfer times up to the L1 cache are mutually non-overlapping, i.e., all cache line transfers across the memory hierarchy are serialized. Additional non-overlapping penalty cycles may be added in cases where the prefetchers cannot hide the memory latency completely.
The non-overlapping assumptions seem very pessimistic; they were motivated by the observation that the L1 cache on Intel processors can either communicate with the L2 cache or the registers, but not both in the same cycle. We will show below that they lead to the most accurate description of performance behavior. Due to this non-overlap assumption the ECM model does not provide an absolute upper performance bound. In contrast to the Roofline model it allows for performance measurements to exceed the model prediction.
In case of the DAXPY example on the Intel SNB core, these rules lead to the timeline diagram shown in Fig. 2 . In this particular case the overall execution time is dominated by data transfers, even if the data comes from the L1 cache.
In order to be able to discuss the ECM model more efficiently we introduce a shorthand notation that summarizes the most relevant information from timeline diagrams such as Fig. 2 . We write the predicted cycle counts for T OL and T nOL , separated by " ". After T nOL we attach the individual contributions for the different memory hierarchy levels (which make up T data ), separated by "|":
For example, the ECM model for DAXPY would be written as {4 4 | 6 | 6 | 13} cy. Cycle predictions for data sets fitting into any particular cache level can then be easily calculated from this representation by adding up the appropriate contributions from T data and T nOL , and finding the maximum of that number and T OL according to (3). For instance, the prediction for data in L2 will be max (4, 4 + 6) cy = 10 cy. As a shorthand notation for these predictions we use a similar format but with "⌉" as the delimiter. For DAXPY this would read as T ECM = {4 ⌉ 10 ⌉ 16 ⌉ 29} cy. As a further example that does not exhibit the strong data transfer dominance of DAXPY we pick a double precision vector summation:
We construct the ECM model for this kernel in four cases: naive, scalar, SSE, and AVX vectorization on a SNB core. In all but the "naive" case we assume that appropriate modulo unrolling has been applied (at least 3-way due to the 3-cycle add latency) on top of SIMD vectorization to allow for filling the add pipeline despite the register dependency. The "naive" code is scalar and does not use modulo unrolling, so the full latency must be paid for every add instruction. The results are shown in Table 2 . One can immediately see that the naive code is executing so slowly (T OL = 24 cy/CL due to the dependency on the add pipeline) that it does not "feel" the memory hierarchy even if the data is in memory. The scalar code is dominated by core execution down to the L3 cache and starts to degrade in performance only when the data comes from mem-ory. Furthermore, although SSE vectorization yields a significant speedup in L1 and L2 cache, the gain is only about 20% in memory. Finally, going from SSE to AVX pays off only if the data is in the L1 cache. These predictions are in line with the general observation that the performance of "slow," i.e., unoptimized code is largely independent of the position of data in the memory hierarchy.
Conclusions from the ECM model
The ECM model explains accurately why the theoretical bandwidth limits of all memory hierarchy levels except the L1 cache cannot be hit with a single thread. Even with a highly efficient, bandwidth-demanding loop kernel the non-overlapping contributions to T data from the higher memory levels prevent full bandwidth utilization. If multiple threads are used, bandwidth saturation is possible. See Sect. 3.1.5 for scalability predictions.
Once the execution time for the basic work unit is known from the ECM model, one can calculate the single-core performance by dividing work by time: P ECM = W /T . Any appropriate unit of "work" will do, be it flops, iterations, "lattice site updates," etc. In case of the scalar summation, the performance in flop/s on a SNB core is
Here we allow for a modification of the CPU clock speed away from the base frequency f 0 . At the nominal clock speed of f 0 = 2.7 GHz we get P( f 0 ) = {2.7 ⌉ 2.7 ⌉ 2.7 ⌉ 1.8} Gflop/s. The model shows clearly that the single-core performance changes with the clock frequency even if the data is in memory: P(1.6 GHz) = {1.6 ⌉ 1.6 ⌉ 1.6 ⌉ 1.2} Gflop/s. The larger the relative share of T L3Mem versus the other contributions, the weaker this dependence will be.
Chip-level scaling and comparison with Roof line
For describing the chip-level performance scaling we assume that the single-core performance scales linearly with the number of cores until a bottleneck is hit [13] . In case of the modern Intel processors the only bottleneck is the memory bandwidth, which means that an upper performance limit is given by the Roofline prediction for memory-bound execution: P BW = b S /B C , where B C is the code balance (inverse computational intensity) of the loop code and b S is the memory bandwidth (one may certainly consider other bottlenecks, such as a non-scalable L3 cache like on the Intel Westmere processors). The performance scaling for n cores is thus described by
if P mem ECM is the ECM model prediction if the data is in main memory. This expression allows us to draw a comparison between the ECM model and the Roofline model. The Roofline model considers only a single data transfer bottleneck. This could be main memory access but also any other memory hierarchy level. Constructing the Roofline model for any core count thus requires measured maximum bandwidth numbers for all memory levels and all core counts as input parameters, since none of these values can be predicted by the model. Due to the overlapping assumption of the Roofline model, the predicted single-core performance is then
Here, P core max is the maximum in-core sequential performance, i.e., with all data coming from the L1 cache (one may use the arithmetic peak performance here and treat the L1 cache as an additional • T core is large compared to all contributions from data transfers. Then, if n is large enough, the memory bottleneck may become relevant, but this is described in the same way by both models. One typical example would be a kernel that is absolutely dominated by long-latency operations such as divides or square roots, like the "divide triad" kernel analyzed in [9] .
• The loop code has single-core data transfer and execution characteristics that make its Roofline-based performance estimate coincide with the measured streaming benchmark performance used for obtaining the b S,i (n). We will show an example for this in Sect. 4.3.
If, however, in-core execution time and data transfer times are comparable but much larger than those of the streaming benchmark, the Roofline model is not able to describe the performance characteristics accurately. See Sect. 6 for an example. According to (6) , the performance will saturate at n S cores:
The model confirms the well-known effect that "slow" loop code needs more cores to reach saturation. For example, the AVX summation code has a predicted single-core performance of P mem ECM = 2.1 Gflop/s at nominal clock speed and it will saturate at three cores, while the naive, non-pipelined code with P mem ECM = 0.9 Gflop/s will require six. Hence, in this case a low code quality can be "healed" by using more resources. 2 This would not be possible at a clock speed of 1.6 GHz, since the slow code would then need ten cores to saturate, but the CPU only has eight.
CASE STUDY: 5-POINT STENCIL
We use the arguably most simple non-trivial stencil variant to employ the ECM model in a setting that is more complex than the pure streaming kernels shown earlier. All concepts, however, are generalizable to other stencils. One sweep (complete update of all points) of the 2D five-point Jacobi stencil looks as follows (double precision assumed):
Unless otherwise noted we assume that the array dimensions are large so that the arrays a and b do not fit in any cache. We adopt "lattice site updates per second" (LUP/s), i.e., scalar inner kernel iterations, as an appropriate performance metric.
Basic analysis
With AVX vectorization (4-wide vector registers in double precision), one unit of work (eight LUPs) comprises eight AVX loads, two AVX stores, six AVX adds, and two AVX multiply instructions (note that there will be no benefit from inner loop unrolling to enable re-use of data along the leading dimension). Hence, T nOL = 8 cy and T OL = 6 cy. The code is thus limited by the LOAD pipeline throughput at the core level.
The data transfer properties of the loop nest depend on the problem size in relation to the available cache sizes. The target array b causes a traffic of two CLs across the complete memory hierarchy due to the write-allocate transfer on every store miss. [3] . We call this the layer condition (LC). If the size of cache k is C k , it can be formulated as
The factor of 1/2 is a rule of thumb: we assume that half the cache size is available for storing layers. The number of layers is connected to the stencil radius in the outer dimension. If r is the radius, Table 3 show a summary of the ECM model timing and performance predictions on a SNB core. In column 5 we give the leading dimension below which the respective layer condition is satisfied. This number was obtained by solving (9) for N i and using the cache sizes of the SNB processor in Table 1 . Figure 3 shows the measured performance and the ECM model prediction for a large range of memory-bound problem sizes. After the problem drops out to main memory at N i ≈ 1145 (quadratic domain), three phases can be distinguished. These can be attributed to the layer condition being satisfied in the L2 cache, the L3 cache, and not at all, corresponding to rows 2-4 in Table 3 . An L1 layer condition would require a leading dimension of less than 683, which leads to an in-cache problem size if N i = N j . For each phase we also show the predicted code balance for different levels of the memory hierarchy. These results indicate that the Roofline The dashed line in Fig. 3 shows the ECM model prediction, which is in good agreement with the measurement in all three phases. The model fits well within a 10% margin, and it is even slightly below the measurement if the layer condition is fulfilled in L2 cache. We have observed this effect in many situations where the pressure on the memory hierarchy gets smaller when moving towards main memory (i.e., when the number of streams goes down). One may speculate that the non-overlapping assumption of the model is inaccurate in these cases; the deviation is generally small, however.
Single-core performance vs. problem size
Despite the performance fluctuations (especially in the phase with B C = 40 B/LUP), the actual memory code balance obtained by measuring the data volume with hardware performance events is within 5% of the predicted value.
Spatial blocking optimization
Spatial blocking (also called "loop tiling") is a well-known optimization for loop nests that aims at improving temporal data access locality. In the context of stencil solvers it amounts to rearranging the order of the stencil updates so that some layer condition is met, leading to a decrease in code balance. We can predict the expected gain of spatial blocking on the single core by setting up an ECM model for different blocking strategies. This is especially simple for the 2D Jacobi stencil as the layer condition (9) only depends on the leading dimension. One must conclude that it is sufficient to block the inner loop:
// inner block size = b_i for ( is =1; is <N i -1; is += b i ) for ( j =1; j <N j -1; ++ j ) for ( i = is ; i < min (
The layer condition (9) now takes a modified form:
Hence, the block size b i determines whether the layer condition will be satisfied in cache level k, and column 5 in Table 3 again pro- vides the relevant thresholds. Choosing b i appropriately is called "blocking for cache level k." Figure 4 shows performance vs. leading dimension for three block sizes and no blocking. These four cases (squares, circles, triangles, gray line) correspond to the four possible layer conditions in Table 3 , and the ECM models given there apply. The solid lines denote the ECM performance predictions. The dotted and dashed-dotted lines mark the Roofline model predictions derived from measured single-thread cache and memory bandwidths. 3 Since the Roofline model predicts a memorybound situation in all cases (independent of the blocking), there are only two performance levels: a lower level for B mem C = 40 B/LUP and a higher level for B mem C = 24 B/LUP. Although the Roofline model cannot distinguish between different blocking strategies, the best and worst case scenarios (L1 blocking and no blocking) are covered with useful accuracy. Note, however, that this is a coincidence: As the ECM model shows, the runtime contributions are rather evenly distributed across the memory hierarchy. The Roofline model works in this case because the benchmark used for taking the single-thread memory bandwidth limit has similar characteristics (in fact, the ECM model prediction for the STREAM COPY loop is {4 ⌉ 10 ⌉ 16 ⌉ 29} cy, and the number of data streams is the same as for the Jacobi 2D kernel with L1 blocking).
The data shows that the ECM model is able to describe the performance impact of spatial blocking quite accurately, and that the gain from moving a layer condition up in the memory hierarchy (i.e., satisfying it in a higher cache level) can be leveraged also for larger leading dimensions. In case of pure L1 blocking, however, the observed performance was at first lower than expected. A measurement of the memory data traffic revealed that the actual code balance was much higher than 24 B/LUP, which was caused by the hardware prefetcher: If the inner block size b i is small, the leading array dimension is only traversed partially, but the hardware prefetcher cannot foresee when the loop ends. It fetches cache lines from beyond the actual end of the short inner loop block. Those cannot be used before the next inner block is updated, they are thus 3 evicted prematurely and cause excess memory traffic. Additional blocking in the outer dimension was added to correct this problem:
for ( js =1; js <N j -1; js += b j ) for ( is =1; is <N i -1; is += b i ) for ( j = js ; j < min (N j -1 , js +b j -1); ++ j ) for ( i = is ; i < min (
In Fig. 5 we show the effect of the outer loop blocking with block size b j on the performance and on the measured in-memory code balance at a constant inner block size of b i = 800. It turns out that b j ≈ 300 alleviates the prefetcher problem: blocks of size b i × b j are updated consecutively along the leading dimension, and all excess cache lines can be used. Note that one can determine the excess traffic, and hence the prefetch distance, from the measured data volume at large b j . We have calculated the prefetch distance to be about 33 CL in our case, but this result is probably not generic since the details of the prefetch mechanism are unknown. Data overheads at block boundaries (of which eager prefetching is just the most striking example) may be incorporated into the model if the corresponding data volumes are known. Although all considered problem sizes are bigger than the L3 cache, the execution is far from memory bound. Even when blocking for the L1 cache (squares in Fig. 4) , the memory bandwidth as calculated from the product of performance and code balance is barely 17 GB/s out of a maximum of over 40 GB/s.
Since the code balance is already at its possible minimum of 24 B/LUP across all cache levels, spatial blocking cannot improve the single-core performance any further. The ECM model prediction in the first row of Table 3 shows that if one were able to reduce the core time by a factor of two (from 8 to 4 cycles) by some advanced optimization technique such as register blocking, singlecore performance would improve by a factor of 33/(33−4) = 1.14.
Multi-core scaling
When going from single-threaded to multi-threaded execution on a multicore chip, two architectural features impact the performance behavior: (i) the memory bandwidth bottleneck and (ii) the shared outer-level cache (if present). In Fig. 6 we show performance scaling across the cores of a SNB socket at N i = 1.2 × 10 6 . Two-dimensional L1 blocking was only used where necessary. In all cases (blocked or not), the j loop was parallelized with OpenMP and static loop scheduling. This ensures that the prefetcher problem described in the previous section remains solved, since each thread works on a consecutive string of blocks in the leading dimension. Figure 6a compares the ECM model and scaling predictions with measurements for the two extreme cases of no blocking (circles) and L1 blocking (stars). The general scaling behavior is described well, although there is a deviation in the vicinity of the saturation point, which we attribute to changes in the prefetching strategy when the memory interface is almost fully utilized [10] : For instance, cache lines are only prefetched into L3 and the prefetch distance is smaller. This leads to an additional latency penalty, which is not part of the ECM model (but could be included). The saturation level is slightly smaller than predicted by the code balance, but the deviation is well below 10%.
In Fig. 6b we study the influence of the block size on the scalability when blocking for the L3 cache. At b i = 2.3 × 10 5 (squares) the single-core speedup versus the non-blocked version (circles) is as expected, but performance saturates at a level that indicates a code balance of 40 B/LUP. Reducing the block size to b i = 1.1 × 10 5 shows no improvement on the single core but leads to much higher performance up to four cores. Beyond four cores, however, performance degrades and settles again at the same level as before. The reason for this behavior is the shared L3 cache, which requires a modified layer condition:
The n-dependent block size ensures that three layers per thread fit into the cache, and leads to the saturation performance expected from a code balance of 24 B/LUP (triangles). Note that block size adaptivity is not required when blocking for core-private caches, since their aggregate size scales with the number of cores.
Finally we need to comment on the influence of the chosen blocking strategy (L1, L2, or L3) on the scalability. Technically the ECM model predicts different saturation points for L3 blocking (4 cores) and L1 or L2 blocking (3 cores), as shown in the last column of Table 3 . In reality they display very similar saturation behavior for the Jacobi 2D kernel (see Fig. 6c ), although the general trend to saturate earlier with faster single-core code can certainly be observed. If maximum performance is the only significant metric, it is irrelevant in practice which blocking strategy is chosen as long as the in-memory code balance is at its minimum of 24 B/LUP. Saturating earlier, however, has some advantages: "expendable" cores can be put into a low-power mode, saving energy ("concurrency throttling"), or they can be used for other tasks such as asynchronous communication.
Note that the use of non-temporal stores (also called "streaming stores") may reduce the in-memory code balance from 24 B/LUP to 16 B/LUP by ignoring the cache hierarchy on a store miss, saving the write-allocate transfers. However, their implementation on Intel processors cannot be described by a simple throughput assumption. Therefore it is unclear as yet how to incorporate the non-temporal store instructions in the ECM model.
CASE STUDY: UXX STENCIL
The "uxx" stencil [14] is a part of a simulation code for dynamic rupture and earthquake wave propagation. The basic loop nest without any blocking optimizations looks as follows:
The code is used with either single or double precision, and it contains a divide operation in the inner loop. Without any guidance by performance modeling one would probably try to eliminate the divide (which is non-trivial) because "divides are expensive."
Analysis and ECM model
The loop code as generated by the compiler is quite complex, so we employ the IACA tool. At double precision (DP) the through- put analysis reports the divider to be the bottleneck on the core, causing a core time of T core = T OL = 84 cy per CL (eight iterations) due to the very slow 42-cycle throughput for the divide instruction (vdivpd). The non-overlapping part is T nOL = 38 cy, caused by the loads to the arrays xx, xy, xz, d1, and u1. With single precision (SP) the machine code does not contain any divide at all: the compiler uses the vrcpps instruction instead, which provides a lowprecision (11 bits) but fully vectorized and pipelined reciprocal, and employs a subsequent Newton-Raphson step and a multiply to perform a 22-bit divide [10] . This results again in T nOL = 38 cy as the load instructions are the same as with DP. The overlapping part is T OL = 35 cy, caused by add, multiply, and 16-byte move instructions. The code is close to the limit of four micro-ops per cycle on the SNB, which is why IACA reports a front-end bottleneck and an overall throughput of T core = 45 cy. Since front-end stalls overlap with the transfer time, the seven extra cycles have no significance and will be ignored in the following. Calculating the transfer time requires an analysis of relevant layer conditions. There are two types of layers in a 3D stencil code: one-dimensional layers along the leading dimension (rows along i of size N) and layers that span the extent of the inner twolevel loop nest (size N × N). At relevant problem sizes only the latter will matter, and we can safely ignore the "row conditions" since they will be automatically fulfilled in the L1 cache. The only data structures which are subject to layer conditions are thus xz and d1, because their stencil radius is larger than zero in the outer (k) dimension. The cache must hold as many layers as there are distinct outer index references, i.e., two for d1 (which is accessed in layers k and k − 1) and four for xz (which is accessed in layers k − 2 to k + 1). Hence, the layer condition for n threads in the shared L3 cache and blocking in the j dimension is
One may certainly consider higher cache levels or block the inner dimension as well, but we omit this discussion here since it would not provide new insights. With L3 blocking, the transfer time will be independent of the precision. Six consecutive data streams hit the memory interface per thread: one each for xx, xy, xz, and d1, and two for u1. The code balance in memory is thus B mem C = 24 B/LUP in SP and 48 B/LUP in DP. Assuming that the higher caches are too small to hold the layers, the L3 cache will be hit by ten streams per thread.
The analysis in the previous sections now enables us to construct the ECM model for the uxx stencil. The first two rows in Table 4 show the model and the cycle predictions for SP and DP. For comparison we also include a special DP version ("noDIV") in which the divide as substituted by a multiplication.
The ECM model yields the same cycle prediction per unit of work in main memory for all versions (104 cy), hence we expect a factor of two in performance between SP and DP. Also, all versions should saturate at four cores. Probably the most striking result is that the presence or absence of the divide does not make a difference as
despite the low-throughput divide. Going to great lengths in trying to eliminate it from the code will thus not be rewarded with performance improvements. These predictions are confirmed by the performance data shown for one SNB socket in Fig. 7a . There is actually a slight speedup for the "noDIV" version below saturation. Using the latency analysis in the IACA tool revealed that the critical code path through the loop body has a worst case execution time of 192 cy. Usually the throughput prediction is much closer to the measured runtime since the out of order logic can fill a lot of pipeline bubbles across the independent loop iterations. In this case, however, the critical path is so long that (13) is just barely satisfied (in fact there is a 10% deviation from the model on the single core). Removing the divide shortens the critical path so much that (13) can be easily met, and the ECM model prediction is very precise. For comparison we present the results on an Intel Ivy Bridge CPU (3.0 GHz, b S = 47 GB/s) in Fig. 7b . This architecture is quite similar to Sandy Bridge apart from the higher memory bandwidth and the faster divide throughput (28 cy instead of 42 cy for a double precision divide with AVX). The accuracy of the model is even better here because of the shorter critical path in double precision. The ECM model in this case is {56 38 | 20 | 20 | 25} cy, leading to a prediction of {56 ⌉ 58 ⌉ 78 ⌉ 103} cy. The divide instruction has no impact if the data does not fit into the L1 cache.
Optimization opportunities
One typical stencil optimization that goes beyond spatial blocking is temporal blocking, which can dramatically reduce the inmemory code balance by performing multiple updates on each grid point in cache before it gets evicted. The ECM model can provide upper limits to the expected benefit of temporal blocking. In case of uxx, optimal temporal blocking for the L3 cache would completely remove the memory transfer time of T L3Mem = 26 cy, resulting in a 24% (DP) or 33% (SP) speedup on the single core. The next bottleneck will then be the divide for DP (84 cy) and the L3 transfer time for SP (78 cy), respectively, but both of these are purely core-local.
The true potential of temporal blocking is thus not in the speedup on the single core but in the removal of the memory bandwidth bottleneck, allowing for scalable performance with an upper limit of over 2000 MLUP/s even for DP and still including the divide. Even in this case the benefit from temporal blocking would by far outweigh the speedup gained from removing the divide.
See [15] for a case study of temporal blocking with a 3D Jacobi smoother supported by an ECM model analysis.
CASE STUDY: 3D LONG-RANGE STEN-CIL
A large stencil radius in the outer grid dimension leads to a high pressure on the cache levels since they have to hold a large number of layers. As an example we consider a constant coefficient longrange stencil code in single precision with a radius of four in each dimension:
# pragma omp parallel for for ( int k =4; k < N -4; k ++) { for ( int j =4; j < N -4; j ++) { for ( int i =4; i < N -4; i ++)
The only array that is relevant for layer conditions here is V; the others (U and ROC) are accessed in linear order without potential for cache re-use.
Analysis and ECM model
The IACA tool reports an overlapping core time of T OL = 68 cy, which is mainly due to adds and frontend stalls, while the load port limitation is at T nOL = 62 cy (there are only 27 visible loads in the loop body, but due to some integer register spilling four more cycles are required). Since the stencil radius is 4 in all directions, the block size in the j direction must be chosen so that the cache can hold nine layers. The n-thread layer condition is
In this case, four data streams per thread will hit the memory and the code balance will be B mem c = 16 B/LUP. If the layer condition is only satisfied in L3, this cache will be hit by twelve streams per thread, which already indicates that the transfer time is significant for this kernel and that the main contribution to it does not come from main memory. Indeed the ECM model is {68 62 | 24 | 24 | 17} cy, which leads to a cycle prediction of {68 ⌉ 86 ⌉ 110 ⌉ 127} cy. Hence, only 17/127 ≈ 13% of the execution time is attributed to T L3Mem , and the code will just barely saturate at eight cores. If the layer condition is not satisfied at all, the memory code balance will rise to B mem c = 48 B/LUP. In Fig. 8 we show performance data on a SNB socket with and without blocking. At the chosen problem size of 480 3 , the layer condition is fulfilled without blocking for a single thread, which is why the single-core performance and prediction is the same in both cases. At two cores and above, however, blocking is required. The blocked code scales well almost up to the full socket, as predicted by the ECM model. This stencil provides a good example for a situation where the Roofline model fails to deliver a useful performance estimate. Due to the large in-core execution time, the loop is predicted to be corebound until the memory bandwidth is saturated, which happens at four cores (top data set in Fig. 8) . However, the other data transfer contributions are far from negligible, so the Roofline model is much too optimistic here due to its overlapping assumption.
Optimization opportunities
The blocking optimization has almost removed the memory bottleneck, and the ECM model predicts that temporal blocking, which would eliminate T L3Mem altogether, would only have a minor effect on the single core and the full socket. In order to get any performance improvement one would have to reduce the major contribution in T ECM mem , which is T nOL . If all core contributions including T nOL could be shrunk by 50%, the ECM model would be {34 31 | 24 | 24 | 17} cy and the prediction would be {34 ⌉ 55 ⌉ 79 ⌉ 96} cy. The single-core code would be accelerated by 33% and saturation would set in at six cores, making temporal blocking a viable optimization. Possible code transformations leading to a substantial improvement of T core would have to save on arithmetic operations as well as on loads. The semi-stencil algorithm [16] is one recent example.
SUMMARY
In this work we have reviewed the ECM performance model using simple streaming kernels and refined it to clearly state its assumptions regarding the overlap of different contributions to the single-core runtime of a loop kernel across the memory hierarchy. By introducing a short-hand notation we could ease the presentation of the model results and performance predictions. We have then applied the model to three stencil algorithms with different code characteristics on an Intel SandyBridge processor. Using a 2D Jacobi stencil as the initial example we were able to explain the single-core performance vs. problem size by accurately calculating the runtime contributions. The performance and scalability impact of spatial blocking for establishing "layer conditions" in different cache levels was quantified using the ECM model, and it was shown why outer loop blocking is required although the layer condition does not depend on the outer problem dimension. The performance of a more complex 3D stencil with an expensive divide operation in the inner loop was shown to be independent of the presence of the divide because of the dominance of transfer times. Temporal blocking was predicted to show a minor speedup on the single core but a major performance boost on the full chip even if the divide operation is left untouched. Finally we have used the model to show that a 3D long-range stencil cannot saturate the memory bandwidth of the CPU at minimum in-memory code balance because of in-cache contributions to the single-core runtime. Any optimization attempt would thus have to aim at reducing the in-cache runtime before more advanced techniques such as temporal blocking are applied.
It must be stressed that we have barely scratched the surface of the vast range of known stencil optimizations, but the basic line of thinking has been clearly demonstrated. We have also shown when and why Roofline model predictions fail to coincide with the ECM model. The ECM model is to our knowledge the only model that can usefully estimate the contributions to the single-core execution time of streaming loop kernels, of which stencil codes are just one example. It leads to a clear understanding of relevant bottlenecks and potential optimization approaches but can mostly be set up using "pencil and paper." Work is ongoing to build a simple tool that can construct the model from a high-level description of the code and the architecture, making it simpler to apply for the non-expert.
One basic limitation of the ECM model is its "streaming assumption:" The model is over-optimistic if the data transfers through the memory hierarchy are latency-bound, which is the case, e.g., in "pointer chasing" scenarios, or on processors for which hardware and/or software prefetch mechanisms cannot hide the latency cost. In addition, we do not know yet how to incorporate the cost of non-temporal stores. Finally we have observed the model to be too optimistic for the memory data transfer time of tight loops with low T core on processors with high-frequency memory interfaces. These shortcomings may be "fixed" by introducing additional latency penalties, but we consider such an approach undesirable, since it would introduce additional parameters just for making the predictions more accurate. The main goal of the model is, however, to understand trends and bottlenecks, for which a precise performance prediction is often not required.
