We derive tight bounds on the cache misses for evaluation of explicit stencil operators on structured grids. Our lower bound is based on the isoperimetrical property of the discrete octahedron. Our upper bound is based on a good surface to volume ratio of a parallelepiped spanned by a reduced basis of the interference lattice of a grid. Measurements show that our algorithm typically reduces the number of cache misses by a factor of three, relative to a compiler optimized code. We show that stencil calculations on grids whose interference lattice have a short vector feature abnormally high numbers of cache misses. We call such grids unfavorable and suggest to avoid these in computations by appropriate padding. By direct measurements on a MIPS R10000 processor we show a good correlation between abnormally high numbers of cache misses and unfavorable three-dimensional grids.
Introduction
On modern computers the gap between access times to cache and to global memory amounts to several orders of magnitude, and is growing. As a result, improvement in usage of the memory hierarchy has become a significant source of enhancing application performance. Well-organized data traffic may improve performance of a program, without changing the actual amount of computation, by reducing the time the processor stalls waiting for data. Both data location and access patterns affect the amount of data movement in the program, and the effectiveness of the cache.
A number of techniques for improvements in usage of data caches have been developed in recent years. The techniques include improvements in data reuse (i.e. temporal locality) [3, 4, 5, 13] , improvements in data locality (i.e. spatial locality) [13] , and reductions in conflicts in data accesses [4, 5, 9, 10] . In practice, these techniques are implemented through code and data transformations such as array padding and loop unrolling, tiling, and fusing. Tight lower and upper bounds on memory hierarchy access complexity for FFT and matrix multiplication algorithms are given in [8] . However, questions concerning bounds on the number of cache misses and how closely current optimization techniques approach those bounds for stencil operators remain open.
In this paper we consider improvement of cache usgae through maximizing temporal locality in evaluations of explicit stencil operators on structured discretization grids. Our contribution if twofold. First, we prove lower and upper bounds on the number of cache misses for local operators on structured grids. Our lower bound (i.e. the number of unavoidable misses) is based on the discrete isoperimetric theorem. Our upper bound (i.e. the achievable number of misses) is based on a cache fitting algorithm which utilizes a special basis of the grid interference lattice. As shown by example, the lower bound can be achieved in some cases. The second contribution is the identification of grids unfavorable dimensions which cause significant increases in cache misses. We provide two characterizations of these unfavorable grids. The first one, derived experimentally, states that the product of all relevant grid dimensions is close to a multiple of half the cache size. The second characterization is that the grid interference lattice has a short vector.
Cache model and definitions
We consider a single-level, virtual-address-mapped, set-associative data cache memory, see [7] . The memory is organized in a sets of z lines of w words each. Hence, it can be characterized by the parameter triplet (a, z, w), and its size S equals a * z * w words. A cache with parameters (a, 1, w) is called fully associative, and with parameters (1, z, w) it is called direct-mapped.
The cache memory is used as a temporary fast storage of words used for processing. A word at virtual address A is fetched into a (a(A), z(A), w(A)) cache location, where w(A) = A mod w, z(A) = (a/w) mod z, and a(A) is determined according to a replacement policy (usually a variation of least recently used ). The replacement policy is not important within the scope of this paper.
If a word is fetched, then w − 1 neighboring words are fetched as well to fill the cache line completely. In practice, a, z, and w are often powers of 2 in order to simplify computation of the location in cache. For example, the MIPS R10000 processor for which we report some measurements in Section 6, has a cache with parameters (2,512,4), which makes S equal to 4K double precision words, or 32KB.
Our lower bound for the minimum number of cache misses that must be suffered during a stencil computation holds for any cache, including fully associative caches. The upper bound shows that a particular number of cache misses can be achieved by choosing a special sequence of computations. A cache miss is defined as a request for a word of data that is not present in the cache at the time of the request. A cache load is defined as an explicit request for a word of data for which no explicit request has been made previously (a cold load ), or whose residence in the cache has expired because of a cache load of another word of data into the exact same location in the cache (a replacement load ). The definitions of cold and replacement loads match those of cold and replacement cache misses, respectively [4] , and if w equals 1 they completely coincide.
If a piece of code features φ cache misses and µ cache loads, it can easily be shown that µ ≤ wφ. For a code with good spatial locality we typically have µ ≈ wφ. As can be shown by a simple example, no bound of the form φ ≤ cµ (c constant) can be derived for arbitrary code segments, but if the code implements a non-redundant stencil operation, we have φ ≤ |K|µ, where |K| is the total number of points within the stencil. This is shown as follows. Let the stencil operation be written as q(x) = Ku(x), with x ∈ Ω. Here Ω is the (not necessarily contiguous) point set on which array q is evaluated. Let Ω be the K-extension of Ω, which is the point set on which u must be defined in order to compute q at all points of Ω. The total distinct number of elements of u used is |Ω|. The number of cache misses φ does not exceed the total number of accesses to array u (may included repeated accesses to the same element), which equals |K||Ω|, so |Ω| ≤ |K||Ω|. Consequently, we have the following interval inequality:
≤ w, which can be used to bound the number of cache misses in terms of the number of cache loads.
A lower bound for cache loads for local operators
In this section we consider the following problem: for a given d-dimensional structured grid and a local stencil operator K, how many cache loads have to be incurred in order to compute q = Ku, where q and u are two arrays defined on the grid. We will provide a lower bound µ which asserts that, regardless of the order the grid points are visited for the computation of q, at least µ cache loads have to occur. In the next section we provide a cache fitting algorithm for the computation of q whose number of cache loads closely approaches the lower bound.
We use the following terminology to describe the operator K. The vectors k 1 , . . . ,k s defined such that q(x), the value of q at the grid point identified by the vector x, is a function of the values u(x + k 1 ), . . . ,u(x + k s ), are called stencil vectors. Locality of K means that the stencil vectors are contained in a cube {x| |x i | ≤ r, i = 1, . . . , d} (r is called the radius of K, and 2r+1 its diameter ). In this section we assume that K contains only the star stencil (i.e. the {0, e 1 , . . . , e d , −e 1 , . . . , −e d } stencil). A lower bound for cache loads for the star stencil will give us a lower bound for any stencil containing it.
Let q be computed in the K-interior R of a rectangular region (a grid ) G. We assume that computation of q is performed in a pointwise fashion, that is, at any grid point the value of q is computed completely before computation of the value of q at another point is started. In order to compute the value of q at a grid point x, the values of u in neighbor points of x must be loaded into the cache (a point y is a neighbor of x if y−x is a stencil vector of K). If x is a neighbor of y and u(y) has been loaded in cache to compute q(z) but is dropped from the cache before q(x) is computed, then u(y) must be reloaded, resulting in a replacement load associated with x.
To estimate the number of elements, ρ, of array u that must be replaced, we choose a partition of R into a disjoint union of grid regions R i , with R = ∪ k i=1 R i , in such a way that q is computed in all points of R i before it is computed at any point of R i+1 , see Figure 1 . Let B ij be the set of points in R j which are neighbors of R i . Since the star stencil is symmetrical, the B ij are neighbor points of B ji . Because any point of B ij can have at most 2d neighbors in B ji , we have the following inequalities: For computation of q in R i we have to replace at least ρ i values of u, where ρ i equals max
The total number of replaced values in the course of computing q on the entire grid will be at least ρ, where ρ equals B − kS, and B equals
Summing all the terms |B ij |, taking into account Equation (1) and the fact that |B ii | = 0, we get:
Let δR i be the exterior boundary of R i , that is, all points neighbor to R i not belonging to R i , and let δ i be the subset of the grid boundary D having neighbors only in
, where σ is specified below, and let ν equal max |R i |. Consequently, we have k ≥ |R|/ν def = V /ν, and
We subsequently choose σ in such a way that
for some t, where O(d, t) is the standard d-dimensional octahedron of radius t (see Appendix A). It follows from Equation 21, Appendix A, that t can be chosen in such a way that σ is less than 8d(2d + 1)S. Now the value of ν can be estimated using the isoperimetric property of the octahedron (see again Appendix A), namely: ν ≤ |O(d, t)|. Hence, we find
). This gives the following lower bound:
We also have V + |D| = |G| and |D| ≤ 2d|G|/l, where l is the smallest size of the grid. This gives the final lower bound µ for the total number of elements of u to be read into the cache:
In general, assuming that the cache associativity a is larger than the diameter of the operator K, the order of this lower bound can not be improved, as shows the following example (remember that our lower bound is valid for a cache with any associativity, including a fully associative cache). Let the spatial extents of a two-dimensional grid be n 1 and n 2 , respectively, with n 1 equal to kS and n 2 arbitrary, and perform calculations of the star stencil (i.e. r = 1) in the following order:
Since n 1 equals kS, all values of q and u having the same value of the second index are mapped into the same cache location within a set. Since a exceeds 2r + 1, none of the values required for the computation of q will be replaced in the cache, except those at a distance r around the line defined by i1 = i*S/a. The total number of elements of u read into the cache for execution of this loop nest will therefore be n 1 n 2 + (n 2 − 2)2r(ka − 1) − 4, which equals n 1 n 2 (1 − 2/n 1 + 2a(1 − 2/n 2 )/S). Similar examples in higher dimensions show that the order of our lower bound (Equation 7) can not be improved. 4 An upper bound for cache loads for local operators.
Cache fitting algorithm
In order to obtain an upper bound we present a cache fitting algorithm which has a small number of replacements. We find a set P of cache conflict-free indices of u and calculate Ku at the points of P . Then we tile the index space of u with P to minimize the total number of replacements. For the analysis we assume an cache associativity of one, which is the worst case for replacement loads. Let L be a set in the index space of u having the same image in cache as the index (0, . . . , 0), Figure 2 . L is a lattice in the sense that there is a generating set
We call this the interference lattice of u. It can be defined as the set of all vectors (i 1 , . . . , i d ) such that
In [4] this lattice is defined as the set of solutions to the cache miss equation. Let P be a fundamental parallelepiped of L * . For future reference we note that vol(P ) = det L = S. The second equality follows form the fact that L has a basis {v i } of the form: 
, and Let F be a face of P (see Figure 2) , and let v be a basis vector of L such that
. . contain all integer points of a pencil Q, with Q = {f + xv | f ∈ F, x is any number } for an appropriate value of g † . The values of q at the points of Q can be computed without replacing reusable values of u except at a distance of r or less from the boundary of Q. Let h 1 , . . . , h s be the signed projections along F of the stencil vectors of K onto v, and let h + and h − be the maximum and the minimum of the projections, respectively. We assume also that |h + − h − |/g < |v|a, meaning that the extent of P in the direction of v is big enough to allow to compute q on F without replacements. It may be impossible to satisfy this condition when the shortest vector in L is shorter than the diameter of K divided by the cache associativity. Lattices with short vectors are discussed in Section 6. The associated grids are called unfavorable grids.
The Cache Fitting Algorithm for computing q is as follows (see Figure 2) ; here K(R) is the set of points where u must be available in order to compute q in all points of R (i.e. the K-extension of R):
† Let I be a fundamental parallelepiped of the integer lattice in the subspace Y generated by F , and let e be an integer vector such that e and the basis of I generate 
In this algorithm the parameters Qmin, Qmax, kmin and kmax are determined such that the scanning face F sweeps out the entire grid. Whenever a point is not contained in the grid, it is simply skipped in the nest.
Since we defined the algorithm in such a way that the scanning face in the direction of v with step g passes through all integer points of Q, the values of q at all points inside Q will be computed. Figure 2 : The Interference Lattice. Cache fitting set F + kw, k ∈ Z, sweeps across pencil Q in the direction of v. Only values of u at points at a distance r or less from the pencil boundaries β 1 and β 2 will be replaced in the cache when K is evaluated inside of Q.
Replacements misses can occur only at points at a distance of r or less from the boundaries of the pencils. For each of these points at most s replacement need to take place, where s, the size of the stencil, is defined by s = |K| ≤ (2r + 1)
d . So the number of replacements will not exceed r(2r + 1)
d A, where A is the total surface area of all pencils. To minimize A we choose P so that Q has a good surface to volume ratio. Let P be the fundamental parallelepiped of a reduced basis of L. A basis
where c d is a constant which depends only on d ‡ . Let b 1 be the shortest vector of the basis, and let the eccentricity e of the basis be defined by e = max(||b i ||/||b 1 ||). If we define ∂P as the surface of P , we can derive an estimation for the surface-to-volume ratio of P :
where we twice used the Hadamard inequality: i ||b i || ≥ det L, and the abovementioned fact that L = S. The constant c
Since A does not exceed the surface area of all fundamental parallelepipeds covering the grid, the total number of these parallelepipeds (which equals |G|/ det L) gives us: A ≤ |∂P ||G|/ det L, so that the total number of replacements ρ can be bounded by: ρ ≤ r(2r + 1)
d |∂P ||G|/ det L. This, combined with Equation 11, gives an upper bound for the total number of elements to be loaded into the cache in the cache fitting algorithm:
where c ≤ f c d . In Appendix B we show that there are grids whose interference lattices feature f 's that are independent of S (provided that S is a prime power, which is true in most practical cases). For these lattices the relative gap between the upper bound (Equation 12) and the lower bound (Equation 7) of the previous section goes to zero as S increases. When the cache associativity exceeds the diameter of K, this gap can be closed. In that case a parallelepiped, built on a reduced basis of the interference lattice of the array indices with x d = 0, can be swept in the d th coordinate direction, similar to the example at the end of Section 3. In general, the cache fitting algorithm gives full cache utilization, in contrast to the algorithm for finding grid-aligned parallelepipeds devoid of interference lattice points, as proposed in [4] . See Table 2 , [4] , where the sizes of blocks without self interference are approximately 20% smaller than S.
Lower and upper bounds for multiple RHS arrays
In this section we consider the case where there multiple arrays involved in the computation of q. Let p be the number of arrays (we call these the RHS arrays), all having the same sizes, and let the stencil of each RHS array include the star stencil. This means, in particular, that for each boundary point of any region R i (see Figure 1 ) values of p RHS arrays are necessary § for computation of q in R i . Hence, we have to replace at least ρ i values, with ‡ Every lattice has a reduced basis. There is a polynomial algorithm to find a reduced basis with a constant
. § As in Section 3, we assume that computation of q is performed in a pointwise fashion. In this case elements of all RHS arrays have to be loaded into cache simultaneously, reducing the cache size effectively ρ i = max(p( j<i |B ij |) − S, 0) values of RHS arrays. Now we can repeat the arguments of Section 3, with |V | and |G| replaced by p|V | and p|G|, respectively, and S replaced by ⌈S/p⌉, to obtain the following lower bound for the number of cache loads for stencil computations with p RHS arrays:
In order to obtain an upper bound for cache loads for calculations with p RHS arrays, we assume that we are free to choose relative array offsets. Our upper bound is valid on the assumption that the stencil diameter divided by the cache associativity is smaller than the length of the longest lattice basis vector divided by p. Consider a stripwise tiling of the fundamental parallelepiped P for the lattice L, see Figure 3 . Each tile P i has the same size and shape. The size is determined by considering the longest edge vector v in the fundamental parallelepiped and dividing it into p equal pieces of size [(S/|F |)/p]||v||, so that each tile contains |F |[(S/|F |)/p] integer points, where |F | is the number of integer points in the face. The remainder part of the tiling is indicated by the shaded area. The reason why the longest edge vector is selected for subdivision is as follows. Since we use a reduced basis, the smallest angle between v and F is bounded from below, so the parallelepiped is always close to orthogonal. Therefore, subdividing the longest edge leads to tiles with the largest inscribed sphere, and thus the largest difference stencil fitting inside the tile.
Let {P i } be the parallelepipeds of the tiling, and let s i be the address offset of P i relative to P 1 (corresponding to the same RHS array). We assign one parallelepiped to each RHS array and choose starting addresses of the arrays, addr i , in such a way that images of tiles P i in the cache do not overlap: addr i = addr 1 + m i S + s i , where m 1 = s 1 = 0, and
. . , p. Sweeping through the pencil by units of tile P 1 in the direction of v we can compute Ku without any cache conflicts, except on the boundary of the pencils. The number of replacement loads of this algorithm can be estimated similarly to the number of replacement loads of a single-array algorithm, taking into account that for calculation of a value u at any point values of all p RHS arrays in the neighbor points may have to be in cache, thus reducing the effective cache size to [S/p]:
where c ′′ d is a constant which depends only on d, and e is the eccentricity of L.
by a factor of p. Non-pointwise computations may be performed if the operator K is separable, in the sense that it can be written as
In the case of separability of K the stencil operation can be split into a succession of independent operations, each involving an intermediate value of q and one RHS array. This would not require to load all p RHS arrays in cache at each point. Instead, it would suffice to write intermediate values of q into main memory, and then load them back into cache for completion of the computations. This results in a larger effective cache size, but more data to be loaded, so splitting the operation need not improve the total number of loads. + h -h -00 00 11 11 00 00 11 11 00 00 11 11 00 00 00 11 11 11 00 00 11 11
Figure 3: Tiling of a fundamental parallelepiped of a reduced basis of the lattice L. We assume that |h + − h − | ≤ a|v/p| (a is the cache associativity). The tiling effectively reduces the size of the parallelepiped by a factor of at most 2p (since x/p ≥ ⌊x/p⌋ ≥ x/(2p)), and increases the cost of a replacement in the cache per point of the boundary of the pencil by at most a factor of p, since elements of all p RHS arrays will be replaced at the same time.
Unfavorable array sizes
We have implemented our cache fitting algorithm and compared its actually measured number of cache misses with those of the compiler-optimized code for the corresponding naturally ordered loop nest on a MIPS R10000 processor (SGI Origin 2000). For comparison we chose a second order difference operator (the common 13-point star stencil) an a test set including three-dimensional grids of sizes 40 ≤ n 1 < 100, n 2 = 91, and n 3 = 100 (the value of the second dimension was chosen to show a typical picture; that of the third dimension is irrelevant). A plot of measured cache misses for both codes is shown in Figure 4 . The program was compiled with options "-O3 -LNO:prefetch=0," using the MIPSpro f77 compiler, version 7.3.1.1m. The prefetch flag disables the prefetching compiler optimization. Without this option the number of cache misses increases significantly, because the compiler does aggressive prefetching to try to reduce execution time.
The upper bounds for the cache misses from the previous sections would suggest that the number of replacement cache misses will increase in the cases where the interference lattice has a very short vector. Very short means that the length is smaller than the diameter of the operator divided by the cache associativity. In this case the self interference would increase significantly. This result suggests how to pad arrays to improve cache performance: the padding should be organized in such a way that the shortest vector in the lattice is not too short, though short enough to minimize the number of pencils (large index of scanning face F ). The sweeping is organized such that pencils are as wide as possible (i.e. the smallest total number of pencils), while avoiding-in the case of multiple RHS arrays-tiles that are thinner than the diameter of the stencil operator divided by the cache associativity.
To demonstrate these unfavorable grids we again choose the second order stencil and force computations in the nest to follow the natural order ¶ . Figure 5a shows the correlation between spikes in the number of cache misses and the presence of a very short vector in the lattice. We call these lattices unfavorable for cache utilization. Arrays having such lattices should be avoided on the target machine. When the shortest vector of the interference lattice is shorter than the diameter of the operator, the number of cache misses sharply increases. The application developer should avoid such unfavorable array sizes, and compilers should avoid the sizes using appropriate padding of array dimensions. Note that similar unfavorable cache effects have been mentioned in [1] . ¶ This forcing is accomplished by introducing a dependence through a Fortran subroutine that performs a circular shift of its arguments 
Conclusions and future work
We have demonstrated tight lower and upper bounds for cache misses for calculations of an explicit operator K on a structured grid. Our lower bound is valid in the general case of fully associative caches, and is based on a discrete isoperimetric theorem. Our upper bound is based on a cache fitting algorithm which uses the fundamental parallelepiped of a special basis of the interference lattice to fit the data in the cache. The upper bound assumes that the shortest vector in the interference lattice is not too short. We have shown that there are grids whose interference lattices have this property. We have also shown that the presence of a very short vector in the lattice correlates with fluctuations of actual cache misses for calculation of a second order explicit operator on three-dimensional grids. The fluctuations occur on grids with unfavorable sizes, i.e. on those whose product of the first two dimensions is (close to) a multiple of half the cache size.
Our results can be extended straightforwardly to implicit stencil computations (i.e. those of the form q ← K(q)) when the problem has a one-dimensional data dependence. Such a data dependence exists if computations of q at grid points can take place in an arbitrary order, except that there is a single index i for which q(x 1 , . . . , i, . . . , x d ) must be evaluated before q(x 1 , . . . , i + α, . . . , x d ) can be calculated (the constant α is either +1 or -1). Clearly, the lower bound is not affected by the implicitness of K. The previously derived upper bound can still be achieved by prescribing the proper visit order of points within each parallelepiped, of the scanning face direction within each pencil (positive or negative sweep direction), and of the visit order of subsequent pencils. This is always possible for a one-dimensional data dependency.
Our results can also be extended to arrays that store more than one word per grid point (tensor arrays). The lower bound of Section 3 for operations with multiple right hand sides immediately applies to tensor arrays. The upper bound of that section also applies, provided the tensor components can be stored as independent subarrays.
In a future study we plan to extend the results of this paper to more general implicit operators, to operators on unstructured grids, and to tensor arrays with restricted storage models. We intend to study more closely the dependence of cache misses on the size of the operator's stencil. We also plan to enhance the presented results by taking into account a secondary cache and TLB, and to formulate bounds for cache misses more directly than through the determination of cache loads.
shows that |δO(d, t)| ≤ (2d + 1)|δO(d, t − 1)| .
For the number of integer points in the simplex we have the following recurrence relation:
This can be used to prove that cf. [6] , Table 169 , see also [12] , Section 5:
From Equations 18 and 21 it follows that |O(d, t)| ≤ 2 d |S(d, t)|. Also, since δO(d, t − 1) contains at least two nonoverlapping simplices S(d − 1, t) and can be covered by 2 d such simplices, we see that
Hence, if |S(d − 1, t)| equals S, we have for d ≥ 2:
since from Equation 23 it follows that if |S(d − 1, t)| does not exceed S, then 1 + t/d does not exceed S 1/(d−1) . The isoperimetric inequality [12] , Theorem 2, asserts that the size of the boundary of a subset R in Z d is at least as big as the size of the standard sphere that contains |R| points . It is easy to see that any standard sphere is sandwiched between two standard octahedrons whose radii differ by one. Since the octahedron has the largest volume for a given fixed-size boundary, Inequality 25 is true for any lattice body with a boundary of size S.
Appendix B: The existence of grids with favorable lattices
In order to prove that for every cache of size S = p n , where p is a prime number, there are grids with interference lattices whose shortest vector has a length l greater than (S/f ) 1/d , with f independent of S, we show: a. For every dimensionality d there exists a lattice L of the same dimension whose basis has the form given in Equation 9 (Section 4), and whose shortest vector is sufficiently long, and b. a grid can be constructed that has L as its interference lattice.
The standard sphere, defined in [12] , is the integer point set of minimal surface area for any given number of interior points.
