Abstract. Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding con ict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with existing techniques. Results show padding improves performance of matrix multiply by over 100% in some cases over a range of matrix sizes. Comparing the e cacy of di erent tiling algorithms, we discover rectangular tiles are slightly more e cient than square tiles. Overall, tiling improves performance from 0-250%. Copying tiles at run time proves to be quite e ective.
Introduction
With processor speeds increasing faster than memory speeds, memory access latencies are becoming the key bottleneck for modern microprocessors. As a result, e ectively exploiting data locality by keeping data in cache is vital for achieving good performance. Linear algebra codes, in particular, contain large amounts of reuse which may be exploited through tiling (also known as blocking). Tiling combines strip-mining and loop permutation to create small tiles of loop iterations which may be executed together to exploit data locality 4, 11, 26] . Due to hardware constraints, caches have limited set associativity, where memory addresses can only be mapped to one of k locations in a k-way associative cache. Con ict misses may occur when too many data items map to the same set of cache locations, causing cache lines to be ushed from cache before they may be reused, despite su cient capacity in the overall cache. Con ict Figure 2 illustrates how the columns of a tile may overlap on the cache, preventing reuse. A number of compilation techniques have been developed to avoid con ict misses 6, 13, 23] , either by carefully choosing tile sizes or by copying tiles to contiguous memory at run time. However, it is unclear which is the best approach for modern microprocessors. We previously presented intra-variable padding, a compiler optimization for eliminating con ict misses by changing the size of array dimensions 20]. Unlike standard compiler transformations which restructure the computation performed by the program, padding modi es the program's data layout. We found intravariable padding to be e ective in eliminating con ict misses in a number of scienti c computations. In this paper, we demonstrate intra-variable padding can also be useful for eliminating con icts in tiled codes. For example, in Figure 2 padding the array column can change mappings to cache so that columns are better spaced on the cache, eliminating con ict misses.
Our contributions include:
{ introducing padding to assist tiling { new algorithm for calculating non-con icting tile dimensions { experimental comparisons based on matrix multiply and LU decomposition
We begin by reviewing previous algorithms for tiling, then discuss enhancements including padding. We provide experimental results and conclude with related work.
Background
We focus on copying tiles at run-time and carefully selecting tile sizes, two strategies studied previously for avoiding con ict misses in tiled codes. In remaining sections, we refer to the cache size as C s , the cache line size as L s , and the column size of an array as C ol s . The dimensions of a tile are represented by H for height and W for width. All values are in units of the array element size.
Tile Size Selection
One method for avoiding con ict misses in tiled codes is to carefully select a tile size for the given array and cache size. (1) using C s and C ol s as the initial heights. A complicated formula is presented for calculating non-con icting tile widths, based on the gap between tile starting addresses and number of tile columns which can t in that gap. 
Copying
An alternative method for avoiding con ict misses is to copy tiles to a bu er and modify code to use data directly from the bu er 13, 23] . Since data in the bu er is contiguous, self-interference is eliminated. However, performance is lost because tiles must be copied at run time. Overhead is low if tiles only need to be copied once, higher otherwise. Figure 3 shows how copying may be introduced into tiled matrix multiply. First, each tile of A(I,K) may be copied into a bu er BUF. Because tiles are invariant with respect to the J loop, they only need to be copied once outside the J loop.
It is also possible to copy other array sections to bu ers. If bu ers are adjacent, then cross-interference misses are also avoided. For instance, in Figure 3 the column accessed by array C(I,J) in the innermost loop is copied to BUF2 to eliminate interference between arrays C and A. Since the location of the column varies with the J loop, we must copy it on each iteration of the J loop, causing data in C to be copied multiple times 23]. In addition, the data in the bu er must be written back to C since the copied region is both read and written to.
Whether copying more array sections is pro table depends on the frequency and expense of cross-interference.
Tiling Improvements
We present two main improvements to existing tiling algorithms. First, we derive a more accurate method for calculating non-con icting tile dimensions. Second, we integrate intra-variable padding with tiling to handle pathological array sizes.
Non-con icting Tile Dimensions
We choose non-con icting tile heights using the Euclidean GCD algorithm from Coleman and McKinley 6]. However, we compute tile widths using a simple recurrence. The recurrences for both height and width may be computed simultaneously using the recursive function ComputeTileSizes in Figure 4 . The initial invocation is ComputeTileSizes (C s ; C ol s ; 0; 1). W . This cost function favors square tiles over rectangular tiles with the same area; it is similar to that used by Coleman and McKinley 6] . Table 1 illustrates the sequence of H and W values computed by ComputeTileSizes at each invocation when C s = 2048 and C ol s = 300. An important result is that each of the computed tile sizes are maximal in the sense that neither their heights nor widths may be increased without causing con icts. Moreover, ComputeTileSizes computes all maximal tile sizes. Note that at invocation 1, (H; W ) is not a legal tile size since H = 2048 exceeds C ol s . In general, this can occur only at the rst invocation, and a simple comparison with C ol s will prevent consideration of such tile sizes. The formula used by ComputeTileSizes for nding non-con icting tile widths is simpler than that of the Coleman and McKinley algorithm. In addition, it avoids occasionally incorrect W values that result from their algorithm.
Padding
Our second improvement is to incorporate intra-variable padding with tiling. Previously we found memory access patterns common in linear algebra computations may lead to frequent con ict misses for certain pathological column sizes, particularly when we need to keep two columns in cache or prevent selfinterference in rows 20]. Bailey 2] rst noticed this e ect and de ned stride e ciency as a measure of how well strided accesses (e.g., row accesses) avoid con icts. Empirically, we determined that these con icts can be avoided through a small amount of intra-variable padding. In tiled codes a related problem arises, since we need to keep multiple tile columns/rows in cache. Largest number of non-con icting columns (Esseghir) lrw Largest non-con icting square (Lam, Rothberg, Wolf) tss Maximal non-con icting rectangle (Coleman, McKinley) euc Maximal (accurate) non-con icting rectangle (Rivera, Tseng) wmc10 Square tile using 10% of cache (Wolf, Maydan, Chen) lrwPad lrw with padding tssPad tss with padding eucPad euc with padding eucPrePad euc with pre-copying to padded array copyTile Tiles of array A copied to contiguous bu er copyTileCol Tiles of array A and column of C copied to contiguous bu er When ComputeTileSizes obtains tile sizes for pathological column sizes, though the resulting tile sizes are noncon icting, overly \skinny" or \fat" (nonsquare) tiles result, which decrease the e ectiveness of tiling. For example, if C ol s = 768 and C s = 2048, ComputeTileSizes nds only the tile sizes shown in Table 2 . The tile closest to a square is still much taller than it is wide. For this C ol s , any tile wider than 8 will cause con icts. This situation is illustrated Figure 2 , in which the column size for the array on the left would result in interference with a tile as tall as shown. On the right we see how padding enables using better tile sizes. Our padding extension is thus to consider pads of 0{8 elements, generating tile sizes by running ComputeTileSizes once for each padded column size. The column size with the best tile according to the cost model is selected. By substituting di erent cost models and tile size selection algorithms, we may also combine this padding method with the algorithms used by Lam, Rothberg, Wolf and Coleman and McKinley.
Padding may even be applied in cases where changing column sizes is not possible. For example, arrays passed to subroutines cannot be padded without interprocedural analysis, since it is not known whether such arrays require preserving their storage order. In many linear algebra codes the cost of pre-copying to padded arrays is often small compared to the cost of the actual computation. For instance, initially copying all of A to a padded array before executing the loop in Figure 1 adds only O(N 2 ) operations to an O(N 3 ) computation. We may therefore combine padding with tile size selection by either directly padding columns or by pre-copying.
Experimental Evaluation

Evaluation Framework
To compare tiling heuristics we varied the matrix sizes for matrix multiplication (mult) from 100 to 400 and applied the heuristics described in Fig. 5. Matrix multiplication: MFlops of tiling heuristics heuristic, performance on a Sun UltraSparc I and a DEC Alpha 21064 were measured. Both processors use a 16k direct-mapped Level 1 (L1) cache. In addition, several heuristics were applied to varying problem sizes of LU decomposition (lu). We also computed the percent cache utilization for several heuristics.
Performance of mult
Tile Size Selection. We rst consider heuristics which do not perform copying or padding. Ultra and Alpha mega op rates of mult for these heuristics are graphed in Figure 5 . The X-axis represents matrix size and the Y-axis gives M ops. In the top graph we see that tiled versions usually outperform orig versions by 4 or more M ops on the Ultra, improving performance by at least 20%.
We nd that for sizes beginning around 200, ess and wmc10, the heuristics which do not attempt maximality of tile dimensions, obtain a lesser order improvement than euc, lrw, and tss, usually by a margin of at least 2 M ops.
Performance of the latter three heuristics appears quite similar, except at the clusters of matrix sizes in which performance of all heuristics drops sharply. In these cases we see euc does not do nearly as bad, and that tss drops the most.
The lower graph gives the same data with respect to the Alpha. Behavior is similar, though variation in performance for individual heuristic increases, and ess is a competitive heuristic until matrix sizes exceed 250. formance indicates a cost model should determine tile heights instead of the array column size, as using the column size results in overly \skinny" tiles. Lower wmc10 performance underscores the need to better utilize the cache. tss would bene t from accuracy in computing tile widths. A prominent feature of both graphs is the gradual dip in performance of orig and ess beginning at 256. This occurs as matrix sizes exceed the Level 2 (L2) cache, indicating ess is also less e ective in keeping data in the L2 cache than other heuristics.
The top graph in Figure 6 focuses on euc, lrw, and tss, giving percent M ops improvements on the Ultra compared to orig. While all heuristics usually improve performance by about 25%{70%, we again observe clusters of matrix sizes in which performance drops sharply, occasionally resulting in degradations (negative improvement). euc does best in these cases, while lrw and especially tss do considerably worse. The lower graph shows results are similar on the Alpha, but the sudden drops in performance tend to be greater. Also, performance improvements are much larger beyond 256, indicating L2 cache misses are more costly on the Alpha.
Averaging over all problem sizes, euc, lrw, and tss improve performance on the Ultra by 42.1%, 38.0%, and 38.5%, respectively, and by 92.3%, 76.6%, and 76.4% on the Alpha. The advantage of euc over lrw indicates that using only square tiles is an unfavorable restriction. For instance, at problem size 256 ,   0%   10%   20%   30%   40%   50%   60%   70%   80%   100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  290  300  310  320  330  340  350  360  370  380  390 Padding. To avoid pathological problem sizes which hurt performance, we combine padding with tile size selection. Figure 7 compares euc with eucPad and eucPrePad. In both graphs, eucPad and eucPrePad improvements demonstrate that padding is successful in avoiding these cases. Moreover, the cost of pre-copying is acceptably small, with eucPrePad attaining improvements of 43.3% on the Ultra whereas eucPad improves performance by 45.5%. On the Alpha, eucPrePad averages 98.5% whereas eucPad averages 104.2%. Since pre-copying requires only O(N 2 ) instructions, the overhead becomes even less signi cant for problem sizes larger than 400. Improvements for lrwPad and tssPad, which do not appear in Figure 7 , resemble those of eucPad. Both are slightly less e ective, however. On average, lrwPad and tssPad improve performance on the Ultra by 44.3% and 43.8% respectively.
Copying Tiles. An alternative to padding is to copy tiles to a contiguous bu er. Figure 8 compares improvements from copyTile and copyTileCol with those of eucPad, the most e ective noncopying heuristic. On the Ultra, copyTile is as stable as eucPad, and overall does slightly better, attaining an average improvement of 46.6%. Though copyTileCol is just as stable, overhead results in improvements consistently worse than both eucPad and copyTile, and the average improvement is only 38.1%. We nd a di erent outcome on the Alpha, on which both copyTile and copyTileCol are superior to eucPad. This is especially true for larger matrix sizes, where copying overhead is less signi cant.
Summary. From the above results, we observe that tile size selection heuristics which compute maximal square or rectangular non-con icting tiles are most e ective. Also, padding can enable these heuristics to avoid pathological cases in which substantial performance drops are unavoidable. Moreover, we nd copying tiles to be advantageous in mult.
Performance of lu
Tile Size Selection. We also compare padding heuristics euc and lrw on lu. Figure 9 gives percent M ops improvements for euc and lrw. As with mult, on both the Ultra and the Alpha, large drops in performance occur at certain clusters of matrix sizes, and euc is again more e ective in these cases. However, tiling overhead has a greater impact, leading to frequent degradations in performance until tiling improves both L1 and L2 cache performance at 256. As a result, overall improvements on the Ultra for euc and lrw are only 17.8% and 11.4%, respectively. On the Alpha, overall performance, even worse for matrix sizes less than 256, is 53.8% and 31.6% for euc and lrw. 
Cache Utilization
Finally, cache utilization, computed as H W=C s , appears in Figure 11 for four heuristics. The top graph give cache utilization for euc and lrw. Here, the Xaxis again gives problem size while the Y-axis gives percent utilization for a 16k cache. We see that for lrw, which chooses only square tiles, utilization varies dramatically for di erent matrix sizes. Earlier models assumed fully-associative caches, but more recent techniques take limited associativity into account 10, 22] .
Researchers began reexamining con ict misses after a study showed con ict misses can cause half of all cache misses and most intra-nest misses in scienti c codes 18]. Data-layout transformations such as array transpose and padding have been shown to reduce con ict misses in the SPEC benchmarks when applied by hand 14] . Array transpose applied with loop permutation can improve parallelism and locality 5, 12, 19] . Array padding can also help eliminate con ict misses 1, 20, 21] when performed carefully.
Conclusions
The goal of compiler optimizations for data locality is to enable users to gain good performance without having to become experts in computer architecture. Tiling is a transformation which can be very powerful, but requires fairly good knowledge of the caches present in today's advanced microprocessors. In this paper, we have examined and improved a number of tiling heuristics. We show non-con icting tile widths can be calculated using a simple recurrence, then demonstrate intra-variable padding can avoid problem spots in tiling. Experimental results on two architectures indicate large performance improvements are possible using compiler heuristics. By improving compiler techniques for automatic tiling, we allow users to obtain good performance without considering machine details. Scientists and engineers will bene t because it will be easier for them to take advantage of high performance computing.
