Abstract-As the cost of data movement increasingly dominates performance, developers of finite-volume and finite-difference solutions for partial differential equations (PDEs) are exploring novel higher-order stencils that increase numerical accuracy and computational intensity. This paper describes a new compiler reordering transformation applied to stencil operators that performs partial sums in buffers, and reuses the partial sums in computing multiple results. This optimization has multiple effects on improving stencil performance that are particularly important to higher-order stencils: exploits data reuse, reduces floating-point operations, and exposes efficient SIMD parallelism to backend compilers. We study the benefit of this optimization in the context of Geometric Multigrid (GMG), a widely used method to solve PDEs, using four different Jacobi smoothers built from 7-, 13-, 27-and 125-point stencils. We quantify performance, speedup, and numerical accuracy, and use the Roofline model to qualify our results. Ultimately, we obtain over 4× speedup on the smoothers themselves and up to a 3× speedup on the multigrid solver. Finally, we demonstrate that high-order multigrid solvers have the potential of reducing total data movement and energy by several orders of magnitude.
I. INTRODUCTION
Finite-Volume/Finite-Difference solutions for partial differential equations (PDE) have, at their base, computations of stencils. A stencil is a linear transformation of the form L(a) i = j α j a i+j . Here i, j are integer tuples, a is a multidimensional array of floating-point data that approximates on a rectangular grid the solution to some differential equation, and L(a) approximates the application of a differential operator to a function evaluated on a rectangular grid. The error in the approximation is proportional to some integer power p of the grid spacing (p is referred to as the order of accuracy of the discretization). Most prior work on optimizing stencil computations focus on lower-order methods (typically p = 2) where there is limited reuse of data and computations are notoriously memory bound. Much of this prior work reduces the amount of data movement by fusing multiple stencil sweeps through techniques like cache oblivious, time skewing, wavefront or overlapped tiling [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] .
For a given partial differential operator, the number of points required in a stencil typically increases as a polynomial in the order of accuracy. This is in contrast to the exponential dependence on p in the reduction of the error. Furthermore, larger stencils, while requiring larger numbers of floating-point operations, can be organized to require a comparable degree of main memory data movement as their lower-order counterparts. Thus, with processor architectures becoming more compute-intensive [13] , high-order schemes are increasingly important as they can achieve greater accuracy with less data movement. This property, combined with the need for computational scientists to minimize the memory capacity required to obtain a given level of error in their simulations, have been motivating the effort to increase the order of accuracy of stencil-based algorithms for PDE.
In this paper, we introduce and evaluate a novel compiler transformation which implements array common subexpression elimination [14] by recognizing and exploiting the high degree of symmetry in stencil coefficients and generating code that exposes efficient SIMD parallelism to backend compilers. By recognizing that stencil computations are associative and can be reordered, array common subexpression elimination identifies common floating point operations across loop iterations and reuses these calculations. In this paper, we decompose the three-dimensional stencils to a number of two-dimensional planes, compute partial sums of the elements, and then buffer these partial sums across iterations to derive the output values. This optimization exploits data reuse, thus improving memory hierarchy performance, while also eliminating redundant floating-point computation. Reducing floating-point computation is particularly valuable for large higher-order stencils, as they can be compute bound, and their computation can also stress the register capacity [15] .
To investigate the efficacy of this approach, we apply it to various discretizations of Poisson's equation, with orders of accuracy p ranging from 2 to 10. Numerical solutions to Poisson's equation are ubiquitous in simulations of a variety of physical problems including fluid dynamics, astrophysics, electromagnetics, and plasma physics. In addition, the stencils used here are good proxies for higher-order stencil operators from a broad range of PDE problems arising in computational science. We evaluate our approach in the context both of applying the operators in a stand-alone fashion, and within a Geometric Multigrid (GMG) method for solving Poisson's equation using these discretizations. We combine partial sums with the communication-avoiding and thread scheduling optimizations of prior work [16, 17] . Our compiler's ability to compose transformations allows it to take a higher-order smoother, remove its computation bottleneck and make it bandwidthlimited, enabaling it to then further apply bandwidth-reducing optimizations. Overall, this paper makes the following contributions:
• We describe the partial sum optimization within the CHiLL compiler [18] , which goes beyond related manual [19] [20] [21] and compiler optimizations [14, 15] by simultaneously addressing DRAM and cache bandwidth while reducing floating-point computation and facilitating SIMDization.
• Overall we demonstrate up to a 3× speedup for the entire multigrid solver. Our study is in contrast to much of the prior work in this area, which has been dominated by looking at the stencil operator in isolation.
• We demonstrate our 10 th order finite-volume method delivers roughly a 1000× increase in accuracy for every 8× increase in memory vs. 32,768× increase in memory for the 2 nd order (requisite memory ≈ error −3/order ) with only a factor of two difference in stencil performance.
II. STENCIL DEFINITIONS AND ACCURACY
We will consider a collection of test problems for stencil calculations all of which are discretizations of Poisson's equation
, Here f is some given function, and we want to solve for φ. For this study, we will assume that φ and f are periodic functions on [0, 1] d , i.e. φ(x) = φ(x + q) for all integer tuples q, and similarly for f . We compute approximate solutions on a rectangular lattice φ In this work, we examine the four representative stencils operators shown in Figure 1 . Points with the same color have the same value for the coefficient. Stencils with the high degree of symmetry shown here are a consequence of the use of centered-difference approximations on rectangular grids, which is ubiquitous in discretizations of Poisson's equation and other constant-coefficient, elliptic operators on such grids. The standard approximation for explicit integrations use either the well known 7-point [22] or 13-point stencils [23] , with secondand fourth-order accuracy, respectively. The sixth-order (27-point) and tenth-order (125-point stencils) in our study are what are known as Mehrstellen stencils. They achieve their stated accuracy only if the right-hand side is pre-processed.
Rather than solving L
where M is a stencil operator whose coefficients sum to 1. This is a one-time operation applied to the right-hand side, prior to applying whatever solution algorithm that is used (e.g. geometric multigrid (GMG) as is used here), and hence does not have significant impact on the performance of the solver. The 27-point stencil and the associated Mehrstellen correction stencil are classical, and can be found in [22] . The 125-point stencil and its associated Mehrstellen correction stencil are new, and will be published elsewhere [24] . For the purposes of this paper, it is only necessary to know the symmetries of the operator, which are summarized in Figure 1 . The accuracy claims for this method will be verified in Section V.
III. STENCIL REORDERING: PARTIAL SUMS
For almost all stencils, there is data and operation reuse between neighboring points. This reuse is more significant for higher-order stencils, which examine many more neighboring input points to compute each output point. In this paper, we develop a compiler transformation that exploits this reuse to reduce loads, while removing floating-point operations that are redundant across multiple output calculations. Thus, we improve both computation and memory costs of the higherorder stencils. The transformation recognizes that stencil computation can be reordered, and therefore computes and reuses partial results in partial sums.
In stencil computations, out-of-place updates are loop nest computations where the right-hand sides are read-only arrays per stencil sweep (e.g., Jacobi). In-place stencil computations read and write the same array each sweep (e.g., Gauss-Seidel). Stencils may be the sum of pairwise products of two or more arrays (variable-coefficient) or a weighted sum of a single array (constant-coefficient). The partial sum transformation described in this paper targets constant-coefficient, out-of-place stencils. This section illustrates the partial sum transformation, while details of the compiler implementation are presented in Section IV, and performance impact as well as interaction with other optimizations is described in Section V.
For an illustration of computing stencils via partial sums, consider the 9-point 2D stencil of Figure 2 Figure 2 (bottom), these points are the right edge for iteration j, i , the center for iteration j, i + 1 and the left edge for iteration j, i + 2 . Therefore, we employ an optimization that avoids loading all nine inputs, but instead loads only the right edge while reusing data from the previous two iterations of the i loop.
At iteration j, i the right edge,
are loaded. We capture the contribution these points make to the outputs at j, i , j, i + 1 and j, i + 2 by calculating the weighted sum of the loaded edge with coefficients corresponding to the right, left and center edge, as illustrated in Figure 3 . The compiler constructs an array of coefficients to be used in the partial sum transformation. If we visualize the array of coefficients as a box as in Figure 3 , with its entries corresponding to coefficients of the stencil, then the weighted sum of input array points with the for (j=0; j<N; j++) for (i=0; i<N; i++) right edge of the array of coefficients computes B0, the center computes B1 and the left edge computes B2. B0 is used to compute the output at j, i , while B1 and B2 are buffered for outputs at the next two iterations of the i loop.
As mentioned earlier, symmetries in stencil coefficients are ubiquitous, and we exploit symmetry to reduce floating-point operations in computing B0, B1 and B2. The colors in the array of coefficients in Figure 3 represent symmetry, with each color representing a unique coefficient. The code implementing partial sums in Figure 2 can now be explained using Figure 3 is loaded to r1. B0 is the weighted sum of r1 and r2 with coefficients w1 (blue) and w2 (green), respectively, and B1 with coefficients w3 (red) and w2 (green).
In a similar manner, for 3D stencils we compute partial sums of 2D planes instead of 1D lines. Figure 4 shows a 2D crosssection of a 3D stencil, and the symmetries present. Figure 5 shows simplified generated code for the 3D 27-point stencil. Three buffers are allocated, for the left, center and right planes; and three scalars are created for three unique coefficients in each 2D plane. In the general case, the number of buffers required is twice the radius of the stencil plus one (radius is 1 for 7-and 27-point stencils, 2 for 13-and 125-point stencils).
While a reference implementation of the 9-point stencil necessitates 11 floating-point operations (8 adds, 3 multiplies) to compute each output point, our approach requires only 9 floating-point operations (3 adds and 4 multiplies for the partial sums plus 2 adds to sum the symmetric partial results). The reduction in floating-point operations becomes more significant with the size and dimension of the stencil operator; for example, the 125-point stencil has 124 adds, but once optimized it has only 38, for a more than 3× reduction.
An additional advantage of this approach is that it results in code amenable to SIMD code generation, AVX and SSE for our target architectures. Intuitively, SIMDization is enabled because the calculations in the loop, including the partial sum calculations, have unit stride across the innermost i dimension. While we are unable to isolate the SIMDization benefits, we performed an experiment to illustrate the differences between the 125-point stencil code before and after the partial sum optimization. Using Intel Architecture Code Analyzer (IACA), we profiled both versions of the code on an Intel Westmere i5-540M platform. 1 In the partial-sum-optimized code, all floating-point adds use SIMD instructions, whereas only about two-thirds of the adds in the baseline code use SIMD instructions. However, L1 and shuffle bandwidth can limit the ultimate benefit from increased SIMDization.
Comparison with Array Common Subexpression Elimination. A compiler formulation of a related reordering transformation called array common subexpression elimination was described in [14] . Array common subexpression elimination is implemented using an abstraction called a tablet, which records the structure of the stencil inputs and their coefficients. To capture reuse, redundancy conditions, heuristics and benefit functions are used to generate subtablets of the tablet. The subtablets are used to compute partial sums which are reused via scalar temporary variables. This method of exploiting reuse through the subtablets is more complex but more general than the partial sums method. The subtablet construction allows for reuse of points other than a plane and enables handling of multi-statement stencils. However, exploiting reuse in scalar registers introduces a scalar dependence across loop iterations and inhibits instruction-level and SIMD parallelization by native backend compilers.
In contrast, the partial sums approach always picks the leading plane of points, and it buffers the computed partial results in vectors. Picking the leading plane and explicitly looking for symmetry to reduce redundant computation is simpler than using heuristics, redundancy conditions and cost functions to discover reuse and symmetry from subtablets. Further, partial sums avoids introducing dependences and generates code easily vectorized by the native backend compiler. The resulting code no longer has cross-iteration reuse in the innermost loop and thus avoids the data stream alignment problem which results in inefficient SIMD code, as described in [25] .
Interplay with communication-avoiding optimizations. The partial sum optimization reduces floating-point operations in compute-bound kernels to make them more memory bound. Once memory bound, communication-avoiding optimizations can be used to further improve performance, such as overlapped tiling (via larger ghost zones), loop fusion and wavefronts [8, 16, 17, 26] . Overlapped tiling reduces inter-processor communication. Loop fusion and wavefront computation reduce communication to DRAM by fusing multiple grid sweeps into one. Wavefronts are generated by loop skewing followed by loop permutation. Skewing breaks data dependences allowing permutation, and, the loop skew factor must increase with stencil radius. Wavefront exploits reuse but increases the working set, which may result in spills from faster caches. CHiLL can generate nested parallel OpenMP code to reduce the working set per thread. As will be empirically demonstrated in Section V, if used together, partial sums and communication- avoiding optimizations can achieve substantial performance gains for high-order stencils.
IV. COMPILER IMPLEMENTATION
The partial sum transformation described in the previous section has been implemented in the CHiLL polyhedral transformation and code generation framework [18] and extends previous communication-avoiding optimizations in CHiLL targeting GMG [16, 17] . Building new transformations into a polyhedral framework easily allows for composition of transformations as long as data dependences are not violated. We compose partial sums with loop transformations to target both computation and communication bottlenecks.
The input to CHiLL is a source code written in C or Fortran and a transformation recipe describing the set of transformations to be composed to optimize the provided source [27] . This recipe can be written by an expert programmer, or derived automatically by a compiler decision algorithm. A new script command for the partial sum transformation has been implemented in CHiLL, which identifies the stencil statement to which this transformation should be applied. In this paper, which explores the impact of different optimization strategies, the transformation recipes were manually written, extending the recipes for communication-avoiding optimization described in [16] to also invoke the partial sum transformation. This section describes the abstractions used by the compiler in the automation of the partial sum transformation.
We make the following assumptions about the input code to our framework. As is standard with polyhedral compiler frameworks, we require that all subscript expressions are affine, or linear combinations of the loop indices and loopinvariant variables. The partial sum optimization requires that the subscript expressions are separable, such that each dimension references just a single loop index. Partial Sum applies to out-of-place stencils. Out-of-place updates are loop nest computations where the right-hand sides are read-only matrices per stencil sweep (e.g. Jacobi). Currently we are limited to constant coefficient out-of-place stencils.
Background on Polyhedral Compiler Frameworks. In most representative applications, stencils are implemented as multidimensional loop nest computations (all stencils are 3D in this paper). In CHiLL, we represent this loop nest by an iteration space IS, which mathematically describes polyhedra corresponding to points in the 3D iteration space:
By convention, l 3 is the innermost loop of a 3D loop nest. It is standard to normalize iteration spaces to start at 0. Bounds constraints can be far more complex, but for simplicity of explanation, we show an upper bound that is a constant or variable. The compiler applies a series of affine transformations that map from the original iteration space to transformed iteration spaces. A separate dependence graph is used to determine the safety of the transformations to be applied. When all transformations are completed, polyhedra scanning is used to scan the iteration spaces and generate transformed loop nests [28] . The array index expressions are mapped to the new iteration space. In the following we will describe the partial sum transformation, which modifies both iteration spaces and statements.
Abstractions for Partial Sums. Examining the statement associated with a stencil computation, the compiler builds four abstractions to perform the partial sum optimization: (1) StencilPoints refers to the set of points that comprise the stencil, offset from a specific iteration in the 3D iteration space; (2) BB is the axis-aligned bounding box of StencilPoints, such that each dimension derives its lower and upper bound from the minimum and maximum values in that dimension; (3) Coeff is a 3D array the same size as BB to hold the coefficients for the points in the stencil; and, (4) Buffer is a set of arrays that are used to hold the partial sums in the generated code. This subsection describes how these abstractions are derived automatically by the compiler and used by the code generator to produce code analogous to the example in Figure 5 . For simplicity, we assume a unit stride for the loop iteration spaces, but extensions for non-unit-stride loops are straightforward. 
Deriving StencilPoints and BoundingBox (BB
Deriving Coefficients (Coeff). If StencilPoints and BB represent the identical volume, as in the 27-point stencil, we call this a full stencil. The star-shaped 7-point stencil operators of Figure 1 (a) are not full.A set difference mathematically determines the holes in the stencil.
The compiler creates an array Coeff, the same size as BB, to store the coefficients for the partial sum. For simplicity of explanation, we will assume Coeff is centered at 0, 0, 0 and allows negative indices. Points B = b 1 , b 2 , b 3 ∈ BB which belong to StencilHoles are set to zero in the array of coefficients, and others are assigned appropriate constant values. We can then rewrite the stencil computation at each output point I, using the following equations:
Deriving Partial Sums and Buffers. We form partial sums for subsets of BB, planes for a 3D stencil or similarly, lines for a 2D stencil as in Figure 2 . Plane k is the p 1 , p 2 plane inside the BB at p 3 = k. The BB can be partitioned into (ub 3 − lb 3 + 1) such planes (corresponding to the stencil radius plus 1); there is a corresponding plane in Coeff such that points in BB are array indices for elements in Coeff. We can compute an output point I as a sum of partial sums PS k at each plane k ∈ BB, which is staged in a buffer. There are (ub 3 − lb 3 + 1) buffers, each as wide as the trip count of the inner loop. Within the innermost loop, the values in the buffers are summed to compute the output. These rewrites of the stencil are captured as follows:
Exploiting Reuse in Partial Sums. The optimizations derived from partial sums recognize that loads associated with a plane have reuse in the third dimension. At iteration I, the rightmost plane Plane r ( I) where r = ub 3 is of particular importance. This is the leading plane in the bounding box of the stencil (in the direction of increasing third dimension), and it corresponds to the right edge of the 2D stencil described in section III. We load only this plane, and reuse it for other output points. Let We can now derive the partial sums from just the load of Plane r ( I). The following is computed
Exploiting Symmetry to Reduce Floating-Point Operations. Figure 4 shows that several coefficients in each plane of Coeff have the same value. This means multiple points in Plane r use the same coefficient to compute partial sums. For each unique coefficient, we sum the points in Plane r that use it. This sum is stored, and multiplied by the appropiate weight to calculate the partial sums, and does not have to be recomputed. In our current implementation we check for symmetry in BB about the axes and diagonals in a plane and across planes before implementing our floating-point reducing optimization. This technique can be easily extended to work with fewer degrees of symmetry. When all the symmetries are present, BB is a cube such that upper and lower bounds are the same, lb = -ub, and the coefficients in each p1, p2 plane are symmetric about the p2 and p3 axes and diagonals.
In any 2D plane of Coeff, the coordinates of unique coefficients are defined as UC = { p1, p2 : 0 ≤ p1 ≤ p2 ≤ ub}, corresponding to the colored octant in Figure 4( 3 ) } Peel off remaining iterations and use S0. coefficients in Coeff, like the 7-point stencil, we take care not to generate redundant factored terms. The following is computed ∀ k , 0 ≤ k ≤ (ub − lb):
Code Generation. Once the compiler has performed the rewriting steps described above, it must generate the transformed code. The steps of code generation are described in detail in Figure 6 . The compiler must create the buffer objects, and compute their values using the rewriting shown in Eqn.(2) directly above. It must also modify the original stencil statement to refer to the buffers rather than the input code. As we near the upper bound while iterating through the inner-loop, there will be no need to calculate all the partial sums, and we will go off the end of the allocated buffers. The figure shows the restricted iteration space, and peeling to address this. We also generate a code variant where IS' equals the original IS, which does extra computation, allocates buffers longer than the loop bound, but has no cleanup code. Both the code variants perform similarly. 
Overview of Evaluated Platforms

V. EXPERIMENTAL RESULTS
We now present performance results and analysis using our optimizing compiler technology. Additionally, we apply the Roofline performance model [13] to help quantify attainable performance. The Roofline Model uses bound and bottleneck analysis to represent architecture performance as a function of requisite data movement and computation. As such, it provides a nominal upper bound to attainable performance.
A. Evaluated Platforms
In this paper, we evaluate the benefits of our compiler technology using the systems detailed in Table I . On each platform, we used the installed Intel compiler with -O3 -fno-alias -fno-fnalias and either -xAVX or -msse3.
Edison is a Cray XC30 MPP at NERSC. Each node contains two 12-core Xeon Ivy Bridge chips each with four DDR3-1600 memory controllers and a 30MB L3 cache [29] . Each core implements the 4-way AVX SIMD instruction set and includes both a 32KB L1 and a 256B L2 cache. With a high flop:byte ratio (and exacerbated by TurboBoost), we expect this machine to be memory-limited for most operations without communication-avoiding optimizations. Note, Edison's memory was upgraded to DDR3-1866 subsequent to this paper.
Hopper is a Cray XE6 MPP at NERSC. Each node contains four 6-core Opteron chips each with two DDR3-1333 memory controllers and a 6MB L3 cache [30] . Each core uses the 2-way SSE3 SIMD instruction set and includes both a 64KB L1 and a 512KB L2 cache. Hopper's lower machine balance may result in the 125-point operator being compute-limited.
B. Evaluation Benchmark -miniGMG
Multigrid is a fast linear solver that uses an iterative and recursive approach to solve elliptic PDEs. Each iteration of a multigrid solve requires performing a V-Cycle. As shown in Figure 7 , a V-Cycle involves performing stencil operations (smooths) on progressively coarser (smaller) grids, solving a coarse grid problem, and then using that coarse-grid solution to correct the fine-grid solution. Unfortunately, as one may only apply the stencil four times during each smooth, there is limited data reuse.
This paper uses the miniGMG compact multigrid benchmark which is over 2000 lines of C and includes a dozen performance-critical routines [31] . Previous work studied some of the performance and productivity aspects of miniGMG [16, 26] . The benchmark creates a block structured 3D grid partitioned into subdomains (boxes) which are distributed among At each level, we apply four weighted Jacobi smooths using the relevant stencil shown in Figure 1 . Our experiments run a fixed 10 V-Cycles with a point relaxation bottom solver that performs 48 smooths.
To highlight the generality, productivity, and effectiveness of our compiler optimizations, we decompose the smooth operations into three separate loop nests: the first applies one of the stencils to an array and writes to the temporary array, the second reads the temporary array and forms either the Poisson or Helmholtz operator, while the third performs a weighted Jacobi update of the current solution. This structure is slightly different than the manually-optimized version of the code online [31] where a few combinations of stencil and smoother were manually fused.
We construct a manufactured solution:
and symbolically apply the Laplacian to it to find a nominal right-hand side f . We then apply the appropriate Mehrstellen correction and solve
We compare u h and u true under the l 2 norm to calculate error.
C. Stencil Performance
To highlight the memory bottleneck on modern multicore processors as well as differentiate ApplyOp (y = Ax) characteristics from smoother characteristics, Figure 8 presents ApplyOp (stencil in isolation) performance on the finest (256 3 ) grid using either the baseline implementation, or CHiLL with the Partial Sums optimization. We provide a memory bound based (blue circle) on the Roofline performance model [13] derived from the size of the arrays (including ghost zones) and the bandwidths of the target machines listed in Table I . As the compiler does not generate the movntpd cache-bypass instruction (verified with compiler flags), we assume 24 bytes per stencil. As the 13-and 125-point operators require a 2-deep ghost zone, their Roofline bound is slightly lower.
As shown in Figure 8 baseline performance is close to the Roofline bound for most cases. The use of the partial sums optimization to eliminate unnecessary floating-point adds and regularize the computation for efficient SIMDization significantly improved performance and ensured performance came close to the Roofline bound. Note, however, that the computeintensive 125-point stencil fell well below its Roofline bound.
D. Smoother Performance on the Fine Grid
The core operation in a multigrid solver is a smoother. It is applied multiple times in sequence on each level of the multigrid V-Cycle. As such, we may apply a wavefront or time-skewing technique to overcome the memory bandwidth limit for ApplyOp. In this paper we use a fixed 256 3 problem for all experiments. Although this ensures we may compare smoother performance when using different discretizations of the Laplacian (different stencils) it also implies that the result using a 125-point operator will be more accurate. Please note that unlike a simple stencil operation (y = Ax), a smoother ( x new = x+wD −1 (b−Ax) ) requires significantly more data movement including reading arrays for the right-hand side b and the inverse of the diagonal D −1 . As a result, our smoother is nominally memory-limited for all operators. Observe that the baseline implementation of the smoother is well below the Roofline bound in all cases. This is a result of the smoother being applied in two stages -application of the linear operator, and using that result to correct x. Enabling the operator "Fusion" optimization in CHiLL allows the compiler to automatically fuse these operations and eliminate the access to the intermediate temporary array. This significantly reduces data movement and allows the more memory-intensive 7-and 13-point smoother to reach their Roofline bounds. Unfortunately, the more compute-intensive 27-and 125-point stencils demand efficient SIMDization to reach their Roofline bounds. Enabling the "Partial Sums" optimization in CHiLL allows the compiler to automatically restructure these stencils to eliminate superfluous additions and regiment the computation for SIMDization by the backend compiler. The benefit is clear -a more than doubling of 125-point smoother performance and performance near the Roofline bound for all operators.
Naively, one may conclude that reaching the Roofline-bound represents the upper end of performance. However, this simply implies that a new set of algorithmic optimizations are required to further improve performance. As smooths are applied in sequence within the multigrid V-Cycle, it is possible, to view their execution as a quadruply nested loop. Manually reordering these loops is beneficial, but unproductive [26] . Conversely, our additions to CHiLL allow the compiler to automatically add ghost zones and restructure the loops into a communication-avoiding "wavefront" without loss of accuracy (the result is bit identical) or productivity. As seen in Figure 9 , our approach attains roughly a 2× performance boost for the 7-and 27-point smoother on Edison. (Note, "All Optimizations" include tuned nested OpenMP.) Whereas Edison is heavily memory-limited, Hopper is not. As such, the benefit of a communication-avoiding algorithm is limited on Hopper. Communication-avoiding 13-and 125-point smoothers suffer on two axes. First, generating wavefronts for these operators require skewing loops by the larger stencil radius. This larger skew factor increases the working set, increases cache pressure and makes it difficult to fit the working set in the fastest caches. Second, the 125-point operator is likely computebound on Hopper and nearly compute-bound on Edison. Thus, the potential benefit from communication-avoiding is small.
E. Smoother Performance Throughout the V-Cycle
Unlike simple explicit methods that only need to attain high performance for a stencil on a large grid, multigrid requires high performance on grids of exponentially varying size. In Figure 10 , we explore the performance of the 27-and 125-point smoothers on Edison as they operate on coarser (smaller) grids. Examining the baseline implementation for the 27-point operator shows the expected rise in performance when moving to the coarser grids, which nominally fit in ever lower levels of cache. Note, the first smooth at each level will inevitably read from DRAM. As such, high cache bandwidths only amortize this slow initial smooth. On small grids, efficient 12-way OpenMP multithreading becomes impossible and performance drops. The 125-point smoother sees a similar behavior but to a lesser degree as it is ultimately compute-limited.
As optimizations are enabled in CHiLL, we see the compiler can nearly sustain constant performance for the 256 3 , 128 3 , and 64
3 levels for the 27-point operator by automatically tuning for the optimal optimizations. Similarly, Table II shows the compiler continually shifts the set and parameterization of the optimizations employed for the 125-point smoother at each level of the V-Cycle. A manually-optimized implementation would likely only target the fine grid and would thus deliver lower performance on the coarse grids, while significantly increasing programmer overhead.
The partial sums optimization requires a two pass approach in which the first creates a few auxiliary results. The cost of this initial pass is amortized on large arrays but becomes an impediment on small arrays. Thus the benefit of partial sums decreases on the smaller grids. Figure 11 presents the performance of the miniGMG multigrid solver using either the existing Intel compiler (baseline) or our optimizing CHiLL compiler as a function of discretization and platform. Performance is expressed in millions of degrees of freedom solved per second (DOF/s). For this scalar problem, fine-grid cell is one degree of freedom. Thus, solving a 256 3 grid in 1 second would equate to 16.78 million DOF/s. Generally, the CHiLL compiler can provide an overall speedup of about 2× using all available optimizations. The attained speedup was a bit less on the 13-point operator as it did not benefit from partial sums and achieving high performance on a communication-avoiding wavefront is particularly challenging. Note, the labels indicate the overall solver speedup attained via CHiLL. The performance of the 10 th order 125-point solver is within a factor of 2 of the 2 nd order 7-point solver but provides nearly a million times lower error.
F. miniGMG Solver Performance and Error
When comparing raw performance (MStencil/s), results show that the 7-and 27-point operators were comparable with the 125-point operator, attaining about half the throughput. However, this only conveys half the story. As highlighted in Figure 12 , miniGMG with the Mehrstellen correction attains roughly 2,370× and 377,000× better accuracy for our test problem using the 27-or 125-point operators (respectively) than miniGMG attained using the 7-point operator. As we move to ever finer domains, the benefit increases. Thus, if the goal were to solve to a particular numerical error, using the Mehrstellen correction with these discretizations reduces total data movement by several orders of magnitude.
VI. RELATED WORK
In the past, operations on large structured grids could easily be bound by capacity misses in cache, leading to a variety of studies on blocking and tiling optimizations [19, [32] [33] [34] [35] [36] [37] . In recent years, numerous efforts have focused on increasing temporal locality by fusing multiple stencil sweeps through Fig. 12 . Error attained in miniGMG as a function of operator and grid size. Note, a 1GB vector represents a single 512 3 grid. Multigrid will require several of these grids. Observe that the 10 th order method delivers a threedigit increase in accuracy for every 8× increase in memory.
techniques like cache oblivious, time skewing, wavefront or overlapped tiling [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] 38] . In addition, domain-specific compilers have recently been developed for parallel code generation from a stylized stencil specification [39] [40] [41] or from a code excerpt [42] .
Prior work developed manual optimizations for multigrid that incorporated communication-avoiding optimizations such as fusion of operators, wavefront parallelism, and hierarchical threading [26] . In addition, lower-level optimizations such as explicit use of SIMD and prefetch intrinsics improved the performance of the inner loops. Subsequently, the communicationavoiding optimizations were implemented in an autotuning compiler [16, 17] . In this paper, we expand the compilerdirected optimizations to include partial sums. When optimizing a complex proxy application such as miniGMG, we have to optimize several stencil operators and how they work together. To the best of our knowledge, the DSL Halide [43] which focuses on image processing pipelines is the only other framework to do so.
Manual optimization of stencil computations has developed techniques such as semi-stencils to reduce loads [21] and using array common subexpression elimination after unrolling to reduce floating-point computations [19, 44] . In [19, 44] , the authors unroll the loops of the stencil computation to expose array common subexpressions and reorder computation to reduce floating point operations using a stencil-specific code generator. Our approach is automated, and does not rely on unrolling. As discussed in Section III, Deitz et al. describe an automated approach to common subexpression elimination in the ZPL compiler [14] . Polyhedral techniques for reconfigurable computing construct custom storage structures to exploit reuse [45] , but are limited to reuse between consecutive iterations and do not consider higher-order stencils.
Recent work reorders stencil computations from the direct specification of the stencil as an update from a set of input points [15] . Stencils are converted to a reduction and then loop shifting exposes register reuse of the same input, contributing to different output. Their approach does not reduce floating point computations.
The data layout transformation (DLT) was developed to eliminate the data stream alignment problem in generating SIMD code for stencil computations [25] . DLT transposes stencil inputs so that multiple computations can be performed in SIMD registers without the costly shifting of data across iterations of the innermost loop. Partial sums access aligned planes and buffer them in separate arrays; thus, we have addressed stream alignment without requiring the data transpose.
In summary, our compiler's ability to compose transformations allows it to take a higher-order smoother, remove its computation bottleneck with partial sums and make it bandwidthlimited, and then further apply DRAM bandwidth-reducing optimizations. No prior work has combined reducing floatingpoint operations with communication-avoiding optimization for higher-order stencils.
VII. CONCLUSIONS AND FUTURE WORK
High-order discretizations of the Laplacian often result in compute-intensive stencils that perform more than an order of magnitude more floating-point operations per point than the traditional 2 nd order discretization. The paradigm shift from compute-limited architectures to bandwidth-limited architectures has revitalized interest in these methods. In this paper, we explored several novel augmentations to the CHiLL compiler designed to improve the computational performance of these stencils. Using the miniGMG multigrid benchmark, we showed that the compiler could nearly quadruple the performance of the 125-point Jacobi smoother on Edison and reach the Roofline performance bound. Moreover, we showed the compiler could tune optimizations independently by platform and multigrid level.
To highlight the true potential of high-order methods, we examined the reduction in error as we increase resolution. We show that our 10 th order method using the compact 125-point stencil in conjunction with a Mehrstellen correction attains 10 th order accuracy and provides a 1000× increase in accuracy for every 8× increase in memory. This has the tremendous potential of reducing total data movement and total energy (across a supercomputer) by many orders of magnitude compared to the 2 nd order multigrid solver.
Future work will explore integration of these techniques into the HPGMG benchmark which implements a true distributed V-Cycle. This will demonstrate the benefit of our optimizations at large scale and will facilitate implementations on Xeon Phi and GPU-based platforms. In addition, we plan to leverage task based parallelism and explore tiling and threading strategies to target the newer platforms.
