When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a. variety of problem sizes and cache orgamzations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with prewous work. More recent work has addressed some of these factors for 279
Introduction
Due to the wide gap between processor and memory speed in current architectures, achieving good performance requires high cache efficiency. Compiler optimizations to improve data lc~cal-ity for uniprocessors is increasingly becoming a critical part of achieving good performance [CMT94] . One of the most wellknown compiler optimizations is tiling (also known as blocking).
It combines strip-mining, loop permutation, and skewing to enable reused data to stay in the cache for each of its uses, i.e., accesses to reused data are moved closer together in the iteration space to eliminate capacity misses.
Much previous work focuses on how to do the loop nesi restructuring step in tiling [CK92, CL95, IT88, GJG88, WL91, W0189] . This work however ignores the effects of real caches such as low associativity and cache line size on the cache performance of tiled nests. Because of these factors, performance for a given problem size can vary wildly with tile size [LRW91].
In addition, performance can vary wildly when the same tile sizes are used on very similar problem sizes [LRW91, NJL94].
These results occur because low associativity causes interference misses in addition to capacity misses.
Permission to copy without fee all or part of this material i:s granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and i!he title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery.To copy otherwise, or to republish, requires a fee andlor specific permission. SIGPLAN '95La Jolla, CA USA @ 1995 In this paper, we focus on how to choose the tile sizes given a tiled nest. As in previous research, our algorithm targets loop nests in which the reuse of a single array dominates. Given a problem size, the Tile Size Selection (TSS) algorithm selects a tile size that eliminates self-interference and capacity misses for the tiled array in a direct-mapped cache. It uses the data layout for a problem size, cache size, and cache line size to generate potential tile sizes. If the nest accesses other arrays or other parts of the same array, TSS selects a tile size that minimizes expected cross interferences between these accesses and for which the working set of the tile and other accesses fits in a fully associative LRU cache.
We present simulated miss rates and execution times for a variety of tiled nests that illustrate the effectiveness of the TSS algorithm. We compare these results to previous algorithms by Lam et al. [LRW91] and Esseghir [Ess93] . On average, TSS achieves better miss rates and performance on direct-mapped caches than previous algorithms because it selects rectangular tile sizes that use the majority of the cache. In some cases, it achieves significantly better performance, If the problem size is unknown at compile time, the additional overhead of computing problem-dependent tile sizes at runtime is negligible. We show that because TSS effectively uses the majority of the cache and its runtime overhead is negligible, copying is unnecessary and significantly degrades performance.
Section 2 compares our strategy to previous research. In Section 3, we briefly review the relevant terminology and features of caches, reuse, and tiling, Section 4 describes the tile size selection algorithm, TS S. It generates a selection of tile sizes without self-interference misses in a direct-mapped cache using the array size, the cache size, and the cache line size. It selects among these tile sizes to generate the largest tile size that fits in cache and that minimizes expected cross-interference misses from other accesses. Section 5 presents simulation and
execution time results that demonstrate the efficacy of our approach and compares it to the work of Esseghir and Lam et al. [Ess93, LRW91 ] .
Related Work
Several researchers describe methods for how to tile nests [BJWE92, CK92, CL95, IT88, GJG88, W0189] . None of this work however addresses interference, cache replacement policies, cache line size, or spatial locality which are important factors that determine performance for current machines.
More recent work has addressed some of these factors for selecting tile sizes [Ess93, LRW91] .
Esseghm selects tile sizes for a variety of tiled nests [Ess93] . HIS algorithm chooses the maximum number of complete columns that fit in the cache. Previous research has focused on how to transform a nest into a tiled version to eliminate these capacity misses [CK92, CIL95, IT88, GJG88, W0189]. We assume as input a tiled nest produced by one of these methods and turn our attention the selection of tile sizes for the nest. Table 1 illustrates these quantities for the version of tiled matrix multiply in Figure l (b).
The largest tile with the most reuse on the I loop is the access to Y. We therefore target this reference to fit and stay in ca(:he.
We want to choose TK and TJ such that the TK x TJ submatrix of Y will still be in the cache after each iteration of the I loop and there is enough room in the cache for the working set size of the I loop, TKx TJ + TK + TJ. Given a target array reference, we now show how to select a tile size for the reference.
Reuse Factor Footprint
Z(J,I) OITK IO/1 TJ /TJll
Detecting and Eliminating Self Interference
In this section, we describe how to detect and ehminate selfinterference misses when choosing a tile size. We compute a selection of tile sizes that exhibit no self interference and no capacity misses. Factors such as cross interference and working set size determine which size we select. We use the cache size, the line size, and the array column dimension. We only select tile sizes in which the column dimension is a multiple of the cache line size. The starting positions of the first and second set will differ by SetDiff = N -r 1. The difference between subsequent sets will eventually become Gap = N mod SetDiff. The number of rows is determined by the point at which the difference changes from r 1 to Gap. The algorithm for computing the number of rows for a column size which is a Euclidean remamder appears in Figure 4 . The algorithm diwdes the cache into two sections: (1) the rl gap at the end of the first set and (2) To take advantage spatial locality, we choose column sizes that are multiples of the cache line size m terms of elements, CLS. We assume the start of an array is aligned on a cache line boundary.
After we find the row size, we simply adjust colSize as follows. 
Minimizing Cross Interference
In this section, we compute worst case cross-interference misses for tiled nests. We use footprints to determine the amount of data accessed in the tile and choose a tile size that has a working set size that fits in the cache and minimizes the number of expected cross-interference misses.
Consider again matrix multiply from Figure 1 . On an iteration of the I loop (from Table 1 
The cross-interference rate, CIR, is thus We also minimize expected cross interference by selecting tile sizes such that the working set will fit in the cache, e.g., for matrix multiply the working set size constraint is TJ*TK+TJ+l*CLS<CS.
(Since X is register allocated, we reduce its footprint to CLS.)
The left-hand side of equation 4 is exactly the amount of cache need for an fully-associative LRU cache. In general, the working set is simply the sum of the footprints for the target loop which accesses the tile.
We use the cross-interference rate and working set size constraint to differentiate between tile sizes generated by the algorithm described in Section 4.1. As the algorithm iterates, we select a new tile size without self interference if its working set size is larger than a previous tile size and still fits in cache and the new size has a lower CIR than the previous size. (Section 5
demonstrates that these tile sizes result in lower miss rates for direct-mapped caches.)
If the tile we select does not meet these constraints, we decrease colSize by CL-$ until it does. Both this phase and the self interference phase result in numerous gaps through out the cache, rather than one large gap. Because cross-interfering arrays typically map all over the cache, multiple gaps minimize the expected interference, e.g., the columns of Z are more likely to map to one of many gaps rather than one larger gap.
4,4 Tile Size Selection Algorithm
The pseudo code for the TSS algorithm which completely avoids self interference, capacity misses, and minimizes ex-petted cross interference appears in Figure 5 , We begin by selecting the column dlmenslon which is the maximum colSize and determine the maximum rowSize without self interference.
In the pathological case, the column length evenly divides the cache and the algorithm terminates (this case often occurs with power of 2 array sizes), If the tile sizes are larger than the array or the bounds of the Iteration space, there is no need to tile. In addition, if any of the dimensions of the tiles are larger than the bounds of the iteration space, the tales are adjusted accordingly.
For simplicity, these checks are omitted from Figure 5 .
The whiIe loop iterates finding potential tile sizes without self interference using Euclidean numbers as candidates for column dimensions. After determining the number of rows that wdl not interfere for the given column size, the column size is adjusted to a multiple of the CLS. If any newly compute tile size has a larger working set size that is also less than CS and for which the (7IR is less than the previous best tile size, we set this tile to be the best. If the initial tile size does not meet the working set size constraint and no subsequent tile size does either, we use the initial tile size, but reduce lts column size by CLS until it meets the constraint.
Set Associativity
Set associativity does not affect the tile size picked by the algorithm for a particular cache size. As expected, increasing the set associatiwty usually decreases the miss rate on a particular tile size because more cross interferences, if they exist, are eliminated. Our results in Section 5 confirm this expected benefit from set associativity and illustrate that increasing set associatiwty causes the differences in miss rates between distinct tde sizes to become less extreme,
Translation Lookaside Buffer
A translation lookaside buffer (TLB) IS a fast memory used for storing virtual to physical address mappings for the most recently reference page entries. All addresses referenced in the CPU must be translated from virtual to physical before the search for an element is performed, If the mappmg for the element is not In the TLB, a TLB miss occurs, causing the rest of the system to stall until the mapping completes. A TLB miss can take anywhere from 30 to more than 100 cycles, depending on the machine and the type of TLB miss.
To avoid a TLB misses, tile sizes should enable the TLB to hold all the entries required for the tile. In general, the height (column in Fortran) of the tile should be much larger than the width of the tile, Since a TLB miss can cause a cache stall, ensuring that no TLB misses occur is more important than the cross and self interference constraints. The tile size therefore needs to be constrained such that the number of non-consecutive elements accesses (i.e. rows) is smaller than the number of page We selected problem sizes of 256 x 256 to illustrate the pathological case, and 300x 300 and 301 x 301 to illustrate the effect a small change in the problem size has on selected tde sizes and performance. We used these relatively small sizes in order to obtain timely simulation results. For execution results, we added a problem size of 550x550. We expect, and others have demonstrated, more dramatic improvements for larger array sizes.
Simulation Results for Tiled Kernel
We ran simulations on these programs using the Shade cache simulator. We used a variety of cache parameters: a cache size of 8K and 64K; set-associatiwty of 1, 2, and 4; and a cache line size of 32, 64, and 128 [C0194] . Of these, we present cache parameters corresponding to the DEC Alpha Model 3000/400 (8K, 32 byte line, direct-mapped) and the RS/6000 Model 540 (64K, 128 byte line, 4-way) with variation m lme size and associatiwty,
We also executed the kernels on these machines.
In Table 2 , we show simulated miss rates in an 8K cache for double precision arrays (16 byte) for the untiled algorithm and the version tiled with TSS. We present results for set assoclativities of 1, 2, and 4 and cache line sizes of 32 bytes and 128 bytes. TSS achieves significant improvement in the miss rates for most of these kernels. On average, it improves miss rates by a factor of 8.6. It improves SOR2D by a factor of 66.21 on a 8K, 4-way, 32
byte hne cache. The improvement for 32 byte lines is higher, a factor of 9.5, than that for 64 byte lines, 7,62, because the longer line sizes benefit these kernels all of which have good spatial locality. If the dramatic improvements of SOR2D are ignored, our algorlthm improves miss rates by an average of 2.3 (2.8 on 32 byte hnes and 1.8 on 128 byte hnes).
Two more trends for tded kernels are ewdent m this [Ess93 ] . We use the algorithms presented in their papers to compute the tile sizes for the different cache and data set.
LRW generates the largest square tiles without self interference.
Esseghir chooses the column length, N for the column tde size and [CS/N=] for the row size. For 1 dimensional tiling, simply choosing the correct number of complete columns of size N suffices and as a result comparisons are uninteresting. We therefore consider matrix multiply, SOR2D, and LUD2D since they are tiled in 2 dimensions and the algorithms usually produce different tile sizes. We compare the results for 8K and 64K caches separately since they have slightly different behaviors. Table 3 presents simulated miss rates for an 8K, 32 byte line cache with associatlvities of 1, 2, and 4 for TSS, LRW, Esseghir, and the untiled kernels. For each kernel, we present the selected tde size (the actual parameters to the tiled algorithm), the working set size (lV.Yet), and the simulated miss rates. The working set size is presented to demonstrate the cache efftclency of the selected tile sizes. Notice that for SOR2D, LRW'S tile sizes are not square as expected. We reduce the number of rows to account for the working set size of SOR2D.
SK Caches
LRW has lower miss rates than TSS on LUD2D and is dramatically lower on SOR2D, TSS has lower miss rates on MM.
On average, TSS slightly Improves miss rates by a factor of 1.03 over LRW, excluding arrays of size 256x256, (We exclude this case since paddmg is probably a better solutlon to pathological interference). TSS has consistently lower simulated miss rates than Esseghir, on average a factor of 6.66 (excluding arrays of size 256x256, including them 5.34). For example, SOR2D 301x301, 4-way, TSS improves miss rates by a factor of 16.6 over Esseghir. These results hold for larger line sizes as
TSS's lower simulated miss rates translate mto better performance. Table 4 presents execution time results on the DEC Alpha (first level cache: 8K, direct-mapped, 32 byte line; second level cache: 5 12K, banked). We measured the execution times for TSS with and without computmg the tile sizes at runtime, It made no measurable impact on performance.
The simulated miss rates and execution times for SOR2D 256 x 256 do not agree (LRW should be best), nor do the miss rates and execution times for LUD2D when TSS is compared to Esseghir (TSS should be sigruficantly better). We believe these inconsistencies result due to a combination of two factors: the difference in working set sizes and the Alpha's large second level cache (5 12K). TSS always has at least as good cache efficiency in terms of working set size as the other algorithms. Esseghir often uses too big of a working set, resulhng in interference.
LRW uses a small working set (often around 50%) because they are hrnlted to square tales, TSS may therefore get more benefit from the second level cache.
TSS always improves or matches execution time when compared to LRW or Esseghm's algorithm. On average, TSS improved over LRW by a factor of 1.12 when the pathological cases with array sizes of 256x256 are excluded. TSS Improved over Esseghir on average by a more significant factor of 1.37, and by 2.01 on matrix multiply (again excluding array sizes of 256x256). Tables 5 and 6 show the same type of results as the previous two tables, but for variants of the RS/6000 organization (64K, 4-way, 128 byte line). The simulated miss rates and execution times showed more variations than those for the 8K cache. These inconsistencies probably result because the simulator uses an LRU replacement pohcy and the RS6000 uses a quicker, but unpublished replacement pohcy.
64K Caches
TSS still achieves lower miss rates more often (a factor of 1.19
without 256 x 256 arrays), but Esseghir's algorithm has lower miss rates for most of the 4-way simulated results, Esseghlr's algorlthm out performs TSS and LRW in execution times on the RS6000 (for TSS, by a factor of 1.03). This result probably stems from set associativity and efficiency of the working set size.
For this cache organization, Esseghir tends to encounter less interference because the working set sizes either fit in cache or are only slightly larger than the cache. When Esseghir encounters cross interference, the 4-way associativity is now more likely to over come It through the cache, rather than the tile sizes. For miss rates, LRW achieves lower miss rates TSS by a factor of 1.17 (excluding 256 x 256 arrays), but most of this comes from SOR2D. When SOR2D is excluded, TSS has lower miss rates by a factor of 1.10 (excluding 256x256 arrays). TSS however continues to out performs LRW on the RS/6000, as it dld on the Alpha, but by less. In both Table 5 and 3, the square tde sizes use significantly less of the cache and are probably therefore not as effective.
Copying
To demonstrate that copying is unnecessary to achieve good performance, we compared execution times for tiled matrix multiply using tile sizes chosen by TSS to execution times for code generated by Esseghir's Tale-and-Copy algorlthm [Ess93] . These results appear in Table 7 . The execution times for TS S include computmg the tile sizes at runtime. Tile sizes for Esseghlr's code were calculated using the formula TS = <-S -2
where TS is the tile size. We also used Esseghlr's algorithm 
