Loop tiling is an effective optimizing transformation to boost the memory performance of a program, especially for dense matrix scientific computations. The magnitude and stability of the achieved performance improvements are heavily dependent on the appropriate selection of tile sizes. Many existing tile selection algorithms try to find tile sizes which eliminate self-interference cache conflict misses, maximize cache utilization, and minimize cross-interference cache conflict misses. These techniques depend heavily on the actual layout of the arrays in memory. Array padding, an effective data layout optimization technique, is therefore incorporated by many algorithms to help loop tiling stabilize its effectiveness by avoiding ''pathological'' array sizes.
Introduction
As the speed of modern microprocessors has increased much faster than the memory speed, the memory traffic has become a key performance factor. As a result, effectively keeping reused data in the cache reduces the memory traffic and thus improves program performance. Cache performance is usually evaluated in terms of cache misses (or miss rate) which, according to the causes, can be classified as either compulsory misses or replacement misses [12] . A compulsory miss occurs if the first access to a block is not in cache. If a block in the cache is evicted and later retrieved, the miss is called a replacement miss. Replacement misses can be sub-categorized as capacity misses and conflict misses [15, 16] . Most compiler optimizations for improving cache performance have focused on reducing capacity misses, conflict misses, or both. A recent study showed that both capacity misses and conflict misses are equally important in determining the cache performance [24] .
Loop tiling [6, 8, 19, 21, 26, 32, 34, 38, 39 ] is a well-known compiler optimization that partitions the iteration space of a loop nest into tiles (or blocks) to avoid replacement misses of those array elements frequently referenced during the computation involving the tile. Early efforts have been to select the tile in such a way that its working set fits into the cache to eliminate capacity misses, and its size is maximized to minimize loop overhead [3] . Significant work has been done to quantify the size of the working set [7, 10, 29] .
Recent work takes conflict misses into account as well. With respect to tiling, conflict misses are classified as either self conflict misses, i.e., misses due to array elements of the same tile, or cross conflict misses, i.e., misses due to elements of different tiles. In addition to eliminating capacity misses and maximizing cache utilization, the tile is selected in such a way that there are no (or few) self conflict misses, and cross conflict misses are minimized [6, 8, 9, 21, 32, 37] . Some work has been done to quantify the total number of conflict misses [11] [12] [13] [14] 35] .
Unfortunately, the performance of a tiled program resulting from existing tiling heuristics shows a large amount of instability [28, 32] . Instability comes from the socalled pathological array sizes [2, 4, 10, 21] which result in poor choices of tile sizes. Array padding [1, 22, 23, 30, 31] is a compiler optimization that increases the array sizes and initial locations to avoid the pathological cases. It introduces space overhead but effectively stabilizes program performance. As a result, more recent research efforts have investigated the combination of both loop tiling and array padding in the hope that both magnitude and stability of performance improvements of tiled programs can be achieved at the same time [20, 28, 32] .
Example Figure 1 shows the original implementation of matrix multiplication and a version of tiled and padded program. The loop header do kk ¼ 1; n; w in the tiled version indicates that the loop counter kk starts from 1 and ends when it hits n with increments of w. The loop order for the original version is determined by applying loop order optimizations as proposed by McKinley et al. [25] . The iteration space spanned by k-loop and i-loop is tiled. As a result, a tile size h6w of array a is reused across iterations of the j-loop. Careful selection of the tile size will realize the data reuse and thus improve the program performance.
In this paper, a subset of published tile size selection algorithms are examined in a single framework-select from a set of candidate tile sizes the one that minimizes a particular cost model. These algorithms can be distinguished by their different choices of candidate tile sizes and cost models. The choices may affect the efficiency of an algorithm, the magnitude of the resulting performance improvement, and the stability of the improvement over different problem sizes.
Common performance factors in these choices are then identified. They are discussed in terms of criticality in precision to the effectiveness of performance optimization. Such information is crucial for any programr or compiler writer who wants to use loop tiling to improve program performance. In addition, the insights into the performance characteristics of previous algorithms can be used to improve existing tiling algorithms, or develop new algorithms such as the new tile size and pad size selection algorithm discussed in this paper. The new algorithm gives priority to performance factors differently, and uses different levels of approximations for these performance factors. Specifically, most previous algorithms search for tile sizes that eliminate self conflict misses, maximize cache utilization, and minimize cross conflict misses. In contrast, the new algorithm searches for tile sizes that eliminate TLB misses, have a reasonable cache utilization, and generate few conflict misses.
To validate the findings, the new algorithm and a set of populartile size selection algorithms are used to produce tiled codes which are then executed on real machines. Two benchmarks, matrix-matrix multiplication and LU factorization, are selected since all published tile size selection algorithms use them to demonstrate the effectiveness of their algorithms. Experimental results show that the new algorithm generates code with a performance and stability comparable to the best existing tiling algorithms, but at a much lower cost in terms of both space and execution time overheads.
In summary, the main contributions of the paper include
. A case study showing that more precise and costly models may not be justified by the resulting performance improvements since much simpler and faster models can achieve comparable effectiveness. . A new tile selection algorithm that selects tiles very fast, and effectively boosts and stabilizes performance while using only small pad sizes. . A discussion of several critical performance factors and possible approximation models.
The rest of the paper is organized as follows. Section 2 presents a more detailed description of the tested algorithms. Several performance factors are identified and discussed in Section 3. In addition, a new algorithm based on the findings is presented. A discussion of our experimental results and findings can be found in Section 4. Conclusions and future work are presented in the last section.
Tile size selection algorithms
Most tile size selection algorithms choose from a set of candidate tile sizes ðtÞ the one ðtÞ that minimizes a particular cost model, i.e.,
For simplicity, only square arrays of size n6n and rectangle tile sizes are considered in this paper. Since a rectangle tile size can be described in terms of its height ðhÞ and width ðwÞ, the tile size selection problem is reduced to determine the values of h and w that minimizes the cost model:
where t(fh6w j 1 h; w ng: ð3Þ
In the following, we refer to the cache size as C, cache block size as b, and cache associativity as a. All values are defined in terms of array elements, and the array is assumed to be stored in column-major order. Furthermore, we focus on the cases where the array size is too large to fit entirely in the cache ðC < n 2 Þ, but its single column can completely fit in ðn CÞ.
Many tile size selection algorithms concentrate on the tile sizes in t being nonconflicting. A tile size is non-conflicting if it does not generate any self conflict misses [32] . Whether or not a tile size is non-conflicting depends on the cache organization ðC; b; aÞ and the array size ðnÞ. As a result, the set of all non-conflicting tile sizes is denoted as nðC; b; a; nÞ. For algorithm efficiency, almost all algorithms using nonconflicting tile sizes restrict their candidates to the maximal ones. A non-conflicting tile size is maximal if neither its height nor its width may be increased without causing self conflict misses. The set of maximal non-conflicting tile sizes is in general much smaller than the set of all non-conflicting tile sizes, i.e., mðC; b; a; nÞ(nðC; b; a; nÞ: ð4Þ
These notations are summarized in Table 1 . Table 2 shows a number of tile size selection algorithms for matrix-matrix multiplication, extracted from the referenced papers. It can be seen that all presented algorithms differ in the set of candidate tile sizes and/or the cost models they use. In general, most algorithms search for the largest tile sizes that generate the least amount of capacity misses and conflict misses. Sometimes, high cache utilization and low cache misses may not be achieved simultaneously. As a result, we can consider each algorithm as a different way to approximate cache utilization and number of cache misses, and to weigh between these two quantities.
To find tile sizes that have few capacity misses, many algorithms restrict their candidate tile sizes to be the ones whose working set can entirely fit in the cache (e.g., h Ã w C). To model self conflict misses due to low associativity cache, some algorithms such as wmc and mhcf use the effective cache size ða Ã CÞ, while others explicitly find the non-conflicting tile sizes. Note that these non-conflicting tile sizes, by the definition, do not generate capacity misses, i.e.,
In summary, the surveyed algorithms tend to focus their attention on tile sizes that eliminate both capacity misses and self conflict misses. The minimization of cross A rectangle tile with height h and width w h6w
The selected tile size nðC; b; a; nÞ
The set of all non-conflicting tile sizes mðC; b; a; nÞ
The set of all maximal non-conflicting tile sizes D Pad size 
conflict misses and maximization of cache utilization are modeled in the cost function, usually in terms of the cross conflict miss rate. For algorithms using explicitly the non-conflicting tile sizes, mðC; b; a; nÞ is approximated differently. Algorithms moon and tli find the exact set of all nonconflicting tile sizes with respect to cache block size b. In contrast, algorithms such as tss and euc approximate the set by always using b ¼ 1. The approximation leads to a simple formula mðC; 1; 1; nÞ:fh i 6 minðw i ; nÞ : i ! 1g such that
The resulting algorithm is very fast, but the selected tile size is not guaranteed to be non-conflicting. In contrast, algorithms such as tli propose simulation-based methods to find the exact set for arbitrary b, which is computationally more expensive. Table 3 gives an example illustrating various approximations of mðC; b; a; nÞ. The desired tile shape ðh=wÞ has been explicitly specified in algorithms such as lrw, ess, and wmc. Both lrw and wmc search for square tiles ðh ¼ wÞ. In contrast, ess finds extremely tall tiles ðh ¼ nÞ. Tile shape can be implicitly favored through the cost model. A more detailed discussion can be found in Section 3.
Array padding extension
Loop tiling alone is in general effective in improving program performance. However, there are certain array sizes that will introduce severe cache conflict misses and deteriorate the performance improvement. Such array sizes are usually near the power of two. Array padding [1, 22, 23, 30, 31] is a data layout transformation that sets a dimension in an array to a new size to reduce the number of conflict misses. For example, the transformed code in Figure 1 has the leading dimension of its array a padded with pad size D. With the appropriate value of D for each array size n, array elements in a tile are mapped to the different cache lines, and therefore avoid the performance degradation due to severe conflict misses. In short, with the help of array padding, loop tiling is guaranteed to be effective across all problem sizes. We say it has stable performance improvement. From tile size selection perspective, the array sizes that cause severe conflict misses constrain the choice of candidate tile sizes when the non-conflictingness is a desired feature. Usually the set of candidate tile size for these array sizes are ill-shaped, i.e., either they are extremely long or extremely wide. It has been observed that ill-shaped tile sizes cannot boost program performance. As an example, consider the set of candidate tiles for euc when array size n is 127, as shown in Table 3 . It can be seen that there is no tile whose shape is near square, and therefore euc is constrained to select 124616 as its best choice. If the leading dimension of the array is increased with pad size D ¼ 5, then euc is able to find 61631 as its best choice.
Since mðC; b; a; nÞ is sensitive to the array size n, Rivera and Tseng [32] proposed an extension eucPad, which allows to increase the array size by at most eight elements in the hope of finding a ''better'' tile. By substituting different approximations for mðC; b; a; nÞ, we can evaluate the impact of the quality (preciseness) of approximations to mðC; b; a; nÞ on the tiling effectiveness.
Algorithm datPad [28] takes a very different approach. Unlike eucPad which simply uses padding to enlarge the set of tiles for selection, datPad finds the largest tile with a specified (program-dependent) shape, and then uses padding to eliminate self conflict misses. This algorithm considers different cache block sizes.
Performance factors and a new algorithm
Tile shape ðh=wÞ and cache utilization ðh Ã w=CÞ are two important performance factors considered by many algorithms, either implicitly through the cost model or explicitly through candidate tiles. In addition, many have observed that there is a ''tension'' between tile shape and cache utilization [8, 18, 21, 27, 32] . Extremely wide tiles may introduce severe TLB thrashing. On the other hand, extremely tall or square tiles may have low cache utilization.
For example, lrw will select the non-conflicting tile size 464 for case ðC; b; a; nÞ ¼ ð2048; 1; 1; 512Þ, which has a very low cache utilization of 0.78%. Algorithms such as ess, tss and euc allow rectangle tile shape in the hope to avoid the problem. For the same configuration, euc is able to select tile size 51264 instead and boosts the cache utilization up to 100%. However, if array size n is changed to 516, euc will select tile size 166127. Though the cache utilization is as high as 99.22%, this tile size requires 127 entries in the page table to keep it in the cache, given a page size of say 4 KB. In contrast, ess selects tile size 51663, an overly tall tile size. Though it will not introduce TLB thrashing, it has only 75.59% cache utilization. As a result, surveyed algorithms have different ways to weigh these performance factors.
Algorithms lrw, ess, and datPad give priority to tile shape over cache utilization. For datPad, the best tile shape is program dependent. For example, it determines that the best tile shape for matrix multiplication is square ðh=w ¼ 1Þ and the best tile shape for LU decomposition is h=w ¼ b. In contrast, euc suggests the same tile shape for both benchmarks because they are all linear algebra codes.
Tile shape is sometimes implicitly preferred through the cost model. For example, euc's cost model 1=h þ 1=w favors square tiles over non-square ones with the same area. For linear algebra codes, the memory access pattern within a tile is usually due to several array references. To model the effect of tile h6w interfering with two regions of sizes h61 and 16w, we can estimate the probability of cross conflict misses in terms of footprint overlap, as done in [8, 21] . As a result, euc's cost model h þ w=h Ã w is derived. That is, a ''good'' tile shape not only avoids TLB thrashing and low cache utilization but also minimizes cross conflict miss rate.
TLB thrashing have been explicitly considered in tss and a version of mhcf with multi-level cost functions. The complicated version of mhcf finds tile sizes that has no TLB misses, few cache misses, and minimize the (program dependent) cost model [26] . It can be done implicitly as well. For example, the constraints of h=w ¼ 1 and
may always be smaller than the page table size. We feel that TLB thrashing is a crucial performance factor since a TLB miss costs more than a cache miss and may cause cache stalls.
Finally, we want to mention that cache associativity has been taken into account by datPad but not other algorithms surveyed in this paper. Algorithm datPad uses cache associativity to adjust the effective cache size (i.e., a ¼ a À 1=a). In our target machines, we carefully chose one that has a 4-way set associative cache to evaluate the importance of this performance factor.
A new algorithm
Based on the discussion above, we propose in this paper a new algorithm newPad with the following three guidelines:
. The best tile size has few conflict misses and good cache utilization. . The best tile size eliminates TLB misses. . The padding choices should not be fixed a priori.
These guidelines reorder the priorities of performance factors. Specifically, instead of searching for tile sizes that eliminate self conflict misses, maximize cache utilization, and minimize cross conflict misses, as most previous algorithms do, the new algorithm looks for tile sizes that eliminate TLB misses, generate few conflict misses, and have a reasonable cache utilization. The new algorithm is greedy in that it increases pad size until a ''qualified'' tile size is found with the current pad size. If there are more than one such tile sizes, the algorithm will select the one that minimizes the cost model. The pseudo code of the algorithm is presented in Figure 2 , with parameters costðh6wÞ and goodðh6wÞ.
In Figure 3 , a possible formulation of goodðh6wÞ and costðh6wÞ is presented. A good tile size allows only part of TLB entries to be touched by the entire tile to avoid TLB thrashing. In a way it restricts the possible values of w. In addition, its area should be sufficiently large to guarantee a reasonable cache utilization. Finally, its shape should not be ''too far away'' from the ''optimal'' shape h=w ¼ b (explained below).
The cost model of newPad is similar to euc's cost model h þ w=h Ã w, but newPad takes into account the effect of large cache block size ðb > 1Þ. As a result, the cost model becomes h=b þ w=ðh=bÞw ¼ b=h þ b=w. A consequence of using the new cost model is that the optimal tile shape, instead of square, becomes h=w ¼ b, a rather long tile shape. Since tile shape implicitly estimates the number of cross conflict misses, bounding the tile shape and making sure it comes from mðC; 1; 1; n þ DÞ will introduce limited conflict misses.
Among all presented algorithms, newPad is closest to datPad. As a matter of fact, datPad can be considered as a specialization of newPad by providing more strict definitions of a good tile shape and cache utilization. Both of them do not fix the possible pad sizes and may therefore avoid cases where the fixed-amount pad choice strategy can only select ''bad'' tile sizes. Algorithm newPad considers TLB thrashing crucial, while datPad and eucPad do not. Table 4 summarizes the different characteristics of the algorithms tested in our experimental study. The final tile selections computed by these algorithms for the example in Table 3 are listed in Table 4 as well.
We do not claim that our formulation of goodðh6wÞ and costðh6wÞ is the optimal choice. However, we believe that our model reflects the importance of the different performance factor correctly. In addition, for any given tile size h6w, there always exists a pad size with no self conflict misses, for example, D ¼ iC þ h [20] . As a result, newPad will always terminate.
Experimental evaluation
To compare different tile size selection algorithms, the array size n was varied from 100 to 1,100 double-precision data elements with a step size of four elements. For each algorithm, two linear algebra benchmark kernels,namely matrix-matrix multiplication (mm) and LU-factorization without pivoting (lu), were executed on three target architectures with different cache organizations. The two benchmarks have been used by many published papers [8, 21, 28, 32, 36] to evaluate the effectiveness of their tile selection algorithms. For machines with multi-level caches (and thus multiple cache sizes), Rivera and Tseng [33] have found that nearly all the benefits can be achieved by simply targeting the first-level cache, provided that the next level cache is much larger. Therefore, the discussed algorithms assume tile size selection of the first-level data cache. Each kernel was compiled by SUN's SparcCompiler 5.0 f 77 compiler with the À O option, and executed in five runs. The minimum execution time, excluding the time for data initialization, was then reported. This way, we can minimize abrupt noises and file cache effects. The loop orders for the untiled versions were determined by applying loop order optimizations as proposed by McKinley et al. [25] . All experimental results are represented in terms of MFLOPS. The three target architectures are shown in Table 5 .
Performance and stability
Figures 4 and 5 plot the achieved MFLOPS rates (y-axes) for each of the algorithms and target machines for our 251 input array sizes ranging from 100 to 1,100 by a step of four (x-axes). We are concerned with the quality of each algorithm in terms of its performance improvement magnitude and stability. Table 7 summarizes the results by reporting average performance and the standard deviation over all problem sizes. The following observations can be made: . lrw and euc have similar performance improvement, which is better than ess.
. euc has a significant number of pathological cases which degrade performance improvement (note the downward peaks in euc). . Array padding can effectively stabilize the performance improvement (for instance, compare eucPad against euc). . Compared with tliPad, eucPad has similar performance improvement but slightly better stability. . newPad and datPad have similar performance improvement and stability. The stability of both algorithms is significantly better than eucPad and tliPad.
In summary, the above algorithms have similar performance improvement magnitude. While array padding helps stabilize the performance improvement, only datPad and newPad have a consistent behavior overall tested array sizes. 
Cost-benefit trade-offs
newPad and datPad have similar effectiveness in stabilizing performance improvement due to loop tiling, at the cost of padding dummy array elements. With respect to the space overhead due to array padding, an analysis reveals that on average newPad introduces slightly more space overhead than eucPad but much less than datPad, as shown in Table 6 . Compared to eucPad, the deviation by newPad is slightly higher. It is the price to be paid for ''unlimited'' padding.
Another dimension of the costs of a tile size selection algorithm is its execution time. Algorithms eucPad and newPad are based on the Euclidean GCD computation and are therefore fast to compute. datPad performs memory-based simulation to find the smallest pad size that ensures no self conflict misses in the selected tile. tliPad uses a clever ''simulation'' strategy by searching the tiles as a computation-based incremental process. Our data show that on average eucPad and newPad all take approximately 80 microseconds, tliPad takes 3.68 seconds, and datPad takes 0.057 seconds. That is, algorithms tliPad and datPad are executed several orders of magnitude (66 and 36, respectively) slower than the Euclidean GCD-based algorithms. The long execution time of tliPad is due to the fact that the algorithm does not know a priori the final tile and pad size, and enumerates all possible choices to determine the best one. In contrast, datPad has a predetermined tile shape and thus can quickly compute the largest tile size with this shape and reject inappropriate pad sizes.
The preliminary experimental data suggest that more precise and costly tile selection and array padding may not be justified by the resulting performance improvements since such improvements can also be achieved by simpler, more approximate and therefore cheaper models. The quality of a tile selection algorithm is determined by its ability to identify critical performance factors and the degree in which such factors need to be approximated through performance models.
Experimental results show that less strict definitions achieve comparable magnitude and stability of performance improvements with significantly smaller pad sizes.
Conclusions and future work
We have presented several algorithms combining tile selection and array padding. A new tile selection algorithm has been introduced. The new algorithm and other published algorithms were evaluated in terms of the magnitude and stability of the performance improvement, the space overhead introduced by padding, and the time for tile selection. The experiments showed that the new algorithm generates tiles of comparable performance and stability as the best of our tested algorithms, but has a significant lower space overhead (factor of 7) and selection time (up to three orders of magnitude). We have found that the cost-benefit balance is a key in designing such an effective, yet inexpensive tile selection algorithm. We have observed that more precise and costly tile selection and array padding may not be justified by the resulting performance improvement since such improvements may also be achieved by much simpler and cheaper models. A few performance factors for tile selection have been identified and discussed. We found that: (1) TLB thrashing effect is critical, (2) tile shape is important, but squareness is not critical, (3) tile area is important, but largest cache utilization is not critical, and (4) set associativity is not important. These observations hold for the two prevalent benchmark kernels mm and lu.
In the future, we are planning to perform experiments on a wider range of benchmark programs and underlying architectures. In addition, we are interested in investigating the impact of other program transformations such as tiling for multiple arrays and register tiling [5] on tile selection. Finally, we want to evaluate the possibility of automatic construction of effective, yet low-cost models [17] . 
