The study and understanding of memory hierarchy behavior is essential, as it is critical to current systems performance. The design of optimising environments and compilers, which allow the guidance of program transformation applications in order to improve cache performance with as little user intervention as possible, is particularly interesting. In this paper we introduce a fast analytical modelling technique that is suitable for arbitrary set-associative caches with LRU replacement policy, which overcomes weak points of other approaches found in the literature. The model was integrated in the Polaris parallelizing compiler, to allow automated analysis of loop-oriented scientific codes and to drive code optimizations. Results from detailed validations using well-known benchmarks show that the model predictions are usually very accurate and that the code optimizations proposed by the model are always, or nearly always, optimal.
Introduction
The widening disparity between processor and main-memory speeds makes memory hierarchy performance one of the most critical factors that influences global system performance. Programmers are aware of this and try to hand-tune it to their codes through a time-consuming trial and error process [1] , either intuitively, or using costly traditional techniques such as trace-driven simulations [2] , or profiling (e.g. using built-in hardware counters [3] ). These methods are not the most suitable ones, because they provide very little information on the reasons for a given memory hierarchy behavior and their time requirements are high. As a result, a lot of effort is made to automatically and efficiently guide this kind of optimization. Present compiler technology usually relies on limited scope heuristics and simple analyses that lack the necessary generality and precision. New general techniques for analysing cache performance are required to give accurate predictions of memory hierarchy behavior, improve its understanding, and to provide the information needed to propose improvements to the cache configuration or the code structure.
Analytical models based on the direct analysis of the source code seem to be a very reasonable approach to meet these challenges, as they overcome the limitations of the other strategies. Still, traditionally, they have had two important disadvantages with respect to the aforementioned techniques. One has been the lack of accuracy, as many models give rough approximations of the number of misses, rather than precise quantitative estimations. This is not surprising, since cache behavior is known to be highly unstable and sensitive to small changes in a large number of parameters [4, 5] : code structure, problem sizes, base addresses, cache layout, etc. Another drawback of the analytical models has been their limited scope, as, in some way, they all restrict either the code structure or the memory hierarchy they can analyse. Finally, only the construction of some models [6] [7] [8] [9] [10] has been systematized to the point of allowing their integration in frameworks that can apply them automatically. The resulting tools analyse cache performance and/or propose suitable program transformations in order to optimize cache behavior for the given input codes.
Our group has developed a fast analytical model [11] that makes accurate predictions of the performance of set-associative caches with LRU replacement policy for general loop-oriented codes. It is based on the generation of a set of what we denote as probabilistic miss equations (PMEs). These equations provide estimators for the number of misses generated by each reference in each loop. Unlike most of the other models found in the bibliography, which consider separately each loop nest, ours takes into consideration, the probability of hits due to reuse of data structures accessed in loops in different nests. This is very important, since most misses may occur between loop executions [12] . Although a conservative approach is applied in the current implementation, the analysis of real scientific codes without limiting the study to isolated loop-nests becomes feasible. It should also be mentioned that our model is the only one of this kind that is based on a probabilistic approach. This property enables it to predict memory system performance without knowing the base addresses of the data structures, something that no other model can do. If such addresses are available, the model uses them to improve the prediction. A final property of our model is its velocity, which enables it to be used for extensive searches in the optimization space of large codes.
In this paper we have unified the different cases for generating the miss equations exposed in [11] in a single expression. Furthermore, the model has been integrated in the Polaris parallelizing compiler framework [13] , to automatically make performance estimations and propose code optimizations. In this paper the model is validated, using both small kernels and codes from the SPECfp95 suite, and the feasibility of driving compiler optimizations that lead to optimal or almost optimal solutions using the analytical model is shown. In fact, we will see that the optimizations proposed by our tool are usually better than those of a current production compiler.
This paper is structured as follows:
The following section provides a detailed review of related work. Section 3 introduces the underlying modelling ideas on which PMEs are based. It explains how to derive both formulae that give the number of misses for each reference in a set of nested loops, and the miss probabilities generated by the accesses performed during the given reuse distances that such formulae require. Details about the structure and functionality of our tool, integrated in Polaris to analyse real codes and suggest code optimizations, can be found in Section 4. The next section compares the model predictions with the measurements of trace-driven simulations to evaluate its accuracy and speed. Section 6 focuses on automatic code optimization. Section 7 follows with our conclusions.
Related work
A number of analytical models based on the code structure have been developed, but they differ in distinct levels of coverage with respect to scope of applicability, type of results delivered, accuracy, execution times required for obtaining model results, or implementation in a compiler framework. For example, Ferrante et al. [14] consider an arbitrary degree of associativity, but they estimate the number of different lines that the references in a given loop-nest access rather than the real number of misses. This is often misleading, as conflict misses play a very important role that is not considered in that work. A model based on a code that considers all the kinds of misses and which can be applied to general numerical codes is presented in [5] , although it is restricted to direct-mapped and isolated perfect loop-nests. The miss formulae of this model are based on the calculation of the footprints on the cache of the different references of the loop. More recently, Ghosh et al. [7] and Harper et al. [8] have overcome some of these restrictions. The first paper introduces Cache Miss Equations (CMEs), a system of linear Diophantine equations where each solution corresponds to a potential cache miss. Although CMEs can be generated in a few seconds for standard benchmark programs and are very accurate, using them to try to predict or optimize the cache behavior seems to have heavy computing requirements. On the other hand, Harper et al [8] focus on the same class of nests as [5] and also base their approach on footprints. Their model is devoted to set-associative caches and has modeling times very similar to those of our tool. Tests show that the error of our model is usually between two and three times smaller than theirs, using the same example codes they propose.
All the works referenced above share a common limitation that we have overcome: their modelling is suitable for regular access patterns in isolated, perfectly nested loops, and they do not take into account the probability of hits in the reuse of data structures referenced in previous loops. This is a very important issue, as pointed out by McKinley and Temam [12] . In any case, our model is not the only one capable of handling non-perfect loop-nests and analysing a whole code rather than isolated loop-nests. Recently Vera and Xue [9] have reduced the computing requirements of the CMEs by applying statistical techniques and have extended the model to handle complete codes with regular computations. Table 1 compares the accuracy and speed of that model and ours using the cache configurations that they use in most of their evaluations. MMT is a matrix product with blocking in two dimensions that Vera and Xue use to compare their model with ours, while Tomcatv and Swim are two well-known SPECfp95 codes. We can see that accuracy is similar, while our model is much faster. Notice that the models must be applied hundreds or thousands of times to search in the optimization space for optimal block sizes, paddings, etc. So typical compiler applications will require computing times which are several orders of magnitude longer than those required by a single evaluation of the models. An approach based on Presburger formulae [10] also addresses the reuse in non-perfect loop-nests and gives exact predictions for small kernels that can include certain conditional statements and non-linear data layouts, while allowing a certain amount of symbolic analysis. Drawbacks also exist, as it can only handle modest levels of associativity and is very time-consuming, which currently reduces its applicability. Blieberger et al. [15] propose a model completely focused on symbolic evaluation and analysis which addresses set-associative caches. Their strategy Table 1 Comparison between Vera and Xue's EstimateMisses algorithm (subindex E) using a 95% confidence and an interval of 0.05 [9] , and our probabilistic model (subindex P ) for a 32 Kbyte cache with lines of 32 bytes and different degrees of associativity (K) The error (Err) is expressed as the absolute difference between the predicted and the measured miss ratios. Modeling times (Time, in ms) for both models were measured on a 933 MHz Pentium III.
involves instrumenting source codes that are then symbolically evaluated to generate symbolic trace-files which store all the references of the program as symbolic expressions and recurrences. The symbolic evaluation of this trace-file, based on the cache parameters, yields the final analytical cache hit function. Unfortunately it has only been evaluated with simple kernels, for which it yields very accurate results, and there is no information about its computing requirements. A very important advantage, exclusive of our model, is that it is able to predict the behavior of a code without knowing the base addresses of the data structures, which is a quite common situation. For this reason it is also the only one applicable to physically indexed caches or dynamically allocated data structures. Some other recently designed tools, such as Delphi [6] and SPLAT [16] analyse codes through both run-time profiling and analytic models. Delphi bases its model on stack distances, by generating stack histograms for each individual loop-nest. This restricts the accurate estimation to fully associative caches with LRU replacement policy (although it includes a probabilistic approach to handle general set-associative caches), and causes the loss of reuse information between different loop-nests. Delphi also includes a model for indirect accesses and a CPU model that we have used as a complement to our model in Section 6, to obtain real computing time estimations. The locality analysis in SPLAT uses a series of phases. In the volume phase the cache behavior is considered to be similar to that of an LRU fully associative cache, while in the interference phase the model only focuses on direct-mapped caches. However, it relies on the CMEs to estimate the behavior of set-associative caches. Both tools, Delphi and SPLAT, require computing times to perform their analysis which are typically several orders of magnitude larger than those of our tool. In both cases prediction errors are usually similar or larger than ours, but only consider codes in which the vast majority of the accesses are sequential, like tomcatv and swim. These kinds of codes are better suited to their strategy of modelling setassociative or direct-mapped caches as if they were fully associative. Experiments using the matrix product with blocking, a code which presents a larger variety of access patterns, show that large deviations appeared in the predictions of Delphi (we had no access to SPLAT), while our model generates very good predictions (see Section 5).
Model concepts
We classify misses as either cold or interference misses. When a line is accessed for the very first time, it gives place to a cold miss. All the remaining misses on that line take place when the processor tries to reuse the line, but the line has been replaced due to accesses to other lines that interfere with it in its cache set. For example, in a K-way associative cache with LRU replacement, the condition for a cache line to be replaced is that K, or more different lines mapped to its same cache set, be accessed since the last access to this line. Analysing the access patterns of the different references during a given portion of the program execution, provides knowledge of how many different lines have been accessed, as well as how many of them will be reused and which other lines will be accessed between those reuses. This way, miss equations that estimate the number of misses that such references generate can be constructed. Our model differs from all the related works in the bibliography in its probabilistic approach. It builds PMEs based on the estimation of miss probabilities for the line reuses. These probabilities depend on the cache area affected by the accesses performed between the consecutive accesses to each line.
Let us now consider an initial modelling scope to develop these ideas. This scope is a set of normalized nested loops, like the one in Fig. 1 , where several arrays are accessed through references indexed by affine functions of the enclosing loop variables. The references can be found in any nesting level, so imperfectly nested loops can be analysed. The number of iterations of each loop will be known at compile time and is the same in each execution of the loop. The code is executed in a machine with an arbitrary cache size, line size and degree of associativity. The only limitation regarding the cache is that the replacement policy must be LRU. Under these conditions, which depict the most common indexing scheme in scientific and engineering codes, a PME for each reference in each nesting level is generated following some simple ideas:
• Each loop enclosing a given reference gives place to a stride in the accesses of the reference (which may be zero) and to a fixed number of accesses following this stride. This way, a function to estimate the number of misses generated by the reference during execution of that loop can be built (see Section 3.1).
• The existence of several references to a given array gives place to reuses of those lines that are accessed by two or more of these references (group reuse). As a result, miss equations for loops in which these reuses occur will not reflect the total size of such loops, but rather the number of iterations between consecutive reuses. In order to simplify the analysis, our model only computes the reuse among refe- rences which are in translation, that is, their indices only differ in one or more of the added d constants. Most reuse in scientific codes comes from this kind of reference groups.
• The cache area affected by the accesses of a given reference during a number of executions of any of its enclosing loops can be computed either analytically or through simulation, or following a mixed approach. The probability that these accesses generate interferences that affect this specific reference, or others, can be derived from this cache area (see Section 3.2).
• Once these ideas are developed, it is clear that there is no reason to restrict our study to a single simple nest. Computing reuse distances and interference probabilities between references located in different nests has also been done, although a number of restrictions must be met (see [11] for more information).
The modelling requires data such as the size of the loops, the indices of the references and so on. Many times this data is not available at compile time. In these cases, code instrumentation could capture such data at run-time, which for example is the approach of Delphi [6] . As for data locations, the model can use them if they are available, but it can also predict the memory behavior using a completely probabilistic approach that does not use them. Our model cannot be applied to codes that include conditional structures.
In the following we discuss the PMEs and the miss probabilities that PMEs use in Sections 3.1 and 3.2, respectively. It is interesting to note that although this basic modelling scope allows the analysis of many programs, our modelling ideas are general enough to allow the expansion of this scope in a number of ways. For example, our model already supports some kinds of loop dependences (blocking) and loops with a different number of iterations in different executions (triangular loops).
Probabilistic miss equations
Our model associates a probabilistic miss estimator F Ri ðpÞ to each reference R and enclosing loop at nesting level i, which estimates the number of misses the reference generates during the execution of that loop. The estimator is a function of the miss probability p in the first access to a line of the data structure that R references during the execution of that loop. This probability is an input parameter for the estimator in this nesting level, because it depends on the access patterns and accesses in outer and/ or previous loops. If R is enclosed by loop i þ 1, F Ri ðpÞ is calculated in a PME in terms of F Rðiþ1Þ ðpÞ, as each iteration of the loop i involves the execution of the immediate inner loop i þ 1, and thus F Rðiþ1Þ ðpÞ includes the access pattern of the reference in that loop and the misses generated in it. On the other hand, the analysis of loop i provides p for any immediate inner loop i þ 1. Thus, the analysis of each reference starts in the innermost loop that contains it, and finishes in the outermost loop. In the innermost loop containing a reference, there is no inner loop to analyze that carries information on the access pattern of the reference. As a result, in the PME that generates the estimator for this loop, the F Rðiþ1Þ ðpÞ estimator really corresponds to one execution of the loop body, rather than to the execution of any inner loop. This way, in the PME that generates the estimator for the innermost loop, F Rðiþ1Þ ðpÞ is replaced by p, the probability of a miss in any single access. The input miss probability for the outermost loop of the code is one, as there are no portions of the data structure in the cache when the execution begins. Notice that the generation of a miss estimation per reference and loop facilitates the detection of those references and loops where data accesses become a serious bottleneck (hot spots).
Let N i be the number of iterations of the loop at nesting level i, and L Ri be the number of iterations in which R cannot reuse lines in this loop, then the estimator F Ri ðpÞ for this reference R is obtained by the PME
if R carries no reuse with other references. Itði; nÞ represents the memory regions accessed during n iterations of the loop at nesting level i, and P R (RegionsSet) is the interference probability that the access to region(s) RegionsSet generates on the accesses of R. In Section 3.2 we elaborate on the mapping of accessed regions to the interference probabilities they generate. This formula reflects that the miss probability for the L Ri loop iterations in which there can be no reuse in this loop, depends on the accesses and patterns in the outer loops (given by p), while the miss probability for the remaining iterations and their accesses is a function of the regions accessed during the portion of the program executed between those reuses, which is one iteration of this loop. The indices of R are affine functions of the enclosing loop variables, so there is a constant stride S Ri for this reference associated to the loop i. As a result, L Ri can be calculated as
where L s is the number of array elements a cache line holds. The formula estimates the number of accesses of R that cannot exploit either temporal or spatial locality, and it is equivalent to estimating the number of different lines that are accessed in N i iterations with stride S Ri . On the one hand, if the index of loop i does not index reference R, then S Ri ¼ 0 and L Ri ¼ 1. This is correct, as in that case there is a single iteration for cold accesses with respect to this loop (the first one) and all the remaining iterations are potential reuses due to temporal locality. On the other hand, if R is indexed by the loop variable, then S Ri > 0 and it accesses different data items in different iterations. In this case, the reuse of lines related to the execution of this loop is only possible by exploiting spatial locality. As we access L Ri different lines during the N i iterations of the loop with stride S Ri , then this locality can be exploited by the remaining N i À L Ri iterations. This way Eq. (2) captures both temporal and spatial locality, and in either case the reuse distance is always one iteration of the loop which is being considered, which justifies Eq. (1). The above formulae are applied when the reference carries no reuse with other references. When the reference belongs to a group of references in translation, they all exhibit the same strides, but may differ in the d constants of the indices. This gives rise to possible (spatial or temporal) systematic group reuse between them, which is taken into account when building the corresponding PMEs. Our model sorts the references in descending order of their base address in order to compare them and estimate the reuse distance, measured in loop iterations, of the lines accessed by these groups of references. The PMEs corresponding to these references are summationies, in which each term expresses the number of accesses that have a given reuse distance multiplied by the miss probability associated with that distance. This algorithm is explained in [11] .
Example. In order to illustrate our model, we will explain the modelling of the following simple loop, where the references have been numbered:
Let us consider for example a computer with 8-byte words and a 2-way 4 Kw cache with 4-word lines. Then, L s ¼ 4, and the PMEs for references R 2 , A(I+2), and R 3 , B(1,I), for loop I at nesting level 0 with N 0 ¼ 200 iterations, would be:
as R 2 has stride S R 2 0 ¼ 1 in the loop, which implies in Eq. (2) that L R 2 0 ¼ 50; and the stride for R 3 is S R 3 0 ¼ 250, resulting in L R 2 0 ¼ 200. Notice that L Ri is always the number of different lines the reference R accesses in the loop. Finally, this loop contains no other loops, so F Rðiþ1Þ ðpÞ is replaced in the PMEs by p, as we have explained.
The PME for reference R 1 , A(I), cannot be derived from Eq. (1) because it exhibits group reuse with respect to the reference R 2 . Although this kind of modeling is not explained in this paper, we illustrate it for this example with this simple reasoning. Its PME is F R 1 0 ðpÞ ¼ 100 Á ðP R 1 ðItð0; 1ÞÞÞ, because in 50 out of the 200 iterations of the loop it is accessing the line that R 2 accessed in the previous iteration, and in another 50 iterations it is accessing the same line it accessed in the previous iteration. This yields a total of 100 accesses whose reuse distance is Itð0; 1Þ, this is, one execution of the loop under consideration. In the remaining 100 iterations, R 1 is accessing the same line as R 2 and there are no intermediate accesses, so these accesses cannot result in misses. As a result, these iterations are not represented in the PME.
Miss probabilities
In Section 3.1 interference probability has been represented through the reuse distance between two consecutive accesses to the line accessed by the reference whose behavior is being studied. The process to calculate the miss probability associated to a given reuse distance has three steps: Access Pattern Description, Cache Impact Quantification, and Area Vectors Addition. The first one involves the description of the access pattern followed by the references inside the reuse distance. This is achieved by examining the indices of each dimension as well as the number of iterations during the reuse distance for each loop that controls these indices. Considering these two factors together, we will obtain the shape and size of the accessed region.
In the second step, Cache Impact Quantification, each region is analyzed in order to describe the way its access affects the cache. This involves calculating the cache footprint of the region and its contribution to the miss probability we are estimating. This is represented in our model by the interference area vectors. Given a data structure A and a K-way associative cache, we call S A ¼ S A 0 ; S A 1 ; . . . ; S A K is the area vector associated with the accesses to A during a given period of the program execution. The element in the ith position of the vector stands for the ratio of sets that have received K À i lines of this structure. The exception is S A 0 , as it is the ratio of sets that have received K or more lines. Thus, S A 0 is the probability that the accesses to array A produce an interference, as we are considering caches with LRU replacement policy. Notice that estimating the area vector of a given region just means counting how many different lines of that region fall in each cache set. A region has two kinds of interference area vectors associated to it. Self-interference area vectors contain the information required to estimate the miss probability on the reference(s) that access the region they depict. On the other hand, cross interference area vectors are generated by the accesses of the references to other regions. The calculation of both kinds of area vectors for a given access pattern is different only because a given line does not generate interferences with itself, thus the calculations are very similar.
There are a number of typical access patterns which account for the vast majority of the accesses in scientific codes. A previous work [17] , focused on non-automatic modeling of irregular codes, describes formulae and algorithms to estimate the area vectors associated to many of these patterns. The most common regular patterns found in loop-oriented scientific and engineering applications are sequential access and access to regions of the same size separated by a constant stride. Simulation must be used when we lack the analytical tools to calculate this impact.
In the following, we show the sequential access to n words as an example, in order to illustrate our area vector estimation approach. Its cross interference area vector, S s ðnÞ, can be estimated as:
where l ¼ maxfK; ðn þ L s À 1Þ=ðC s =KÞg is the average number of lines of this access that are placed in each set when it finishes. In this expression L s stands for line size and C s for cache size. The term L s À 1 added to n stands for the average extra words brought to the cache in the first and last accessed lines. The formula expresses that ðl À blcÞ Â 100% of the cache sets keep K À blc À 1 lines of this access and the remaining cache sets keep K À blc lines.
The self-interference area vector associated to this access, S sa ðnÞ, expresses the interference that the lines associated to this access generate on themselves.
Once the effect on the cache of the accesses to the different data structures has been calculated, a third step, Area Vectors Addition, is required to add them together, in order to obtain the global effect on the cache. If the relative position of the data structures is known, each area vector is scaled before its addition: the amount of overlapping between its region and the data structure associated to the reference whose PME is being calculated is expressed using a coefficient. Once the scaling has been done, the area vectors are added considering their area ratios as independent probabilities. Given two area vectors S U and S V , their addition, represented by the operator [, is calculated as
It should be mentioned that when the analysed references are in translation, the access pattern is sequential and the array base addresses are known, then an optimized algorithm which considers the pathological conflicts that may arise in such cases is applied.
Example. The PMEs calculated in the example in Section 3.1 must be completed with the corresponding miss probabilities. Both F R 1 0 ðpÞ and F R 2 0 ðpÞ depend on the miss probability on the line they access due to the regions accessed during one loop iteration. In each of the iterations in our example loop, two elements (separated by one) from A and one element from B are accessed. This access pattern to A is equivalent, from the point of view of the cache footprint, to an access to three consecutive elements. This way, both regions can be modeled as a sequential access. As the degree of associativity is K ¼ 2, the area vectors that depict these regions have three elements. The cross interference area vectors associated to these regions are S cross A Itð0;1Þ ¼ S s ð3Þ ¼ ð0; 1:5=512; 510:5=512Þ for array A, and S cross B Itð0;1Þ ¼ S s ð1Þ ¼ ð0; 1=512; 511=512Þ for matrix B, both calculated according to Eq. (3). The selfinterference area vectors for both data structures would be S self A Itð0;1Þ ¼ S sa ð3Þ ¼ S self B Itð0;1Þ ¼ S sa ð1Þ ¼ ð0; 0; 1Þ, as neither of these regions interferes with itself in the cache.
As explained, the last step in order to calculate the miss probabilities is to add the area vectors corresponding to the different data structures accessed during the considered period of the program execution. In our example, both P R 2 ðItð0; 1ÞÞ and P R 1 ðItð0; 1ÞÞ are the first element of the area vector S cross B Itð0;1Þ [ S self A Itð0;1Þ , which yields (0, 1/512, 511/512) according to Eq. (4). As a result, the final expression for the PMEs of the three references in our example code is F R 1 0 ðpÞ ¼ 0, F R 2 0 ðpÞ ¼ 50 Á p and F R 3 0 ðpÞ ¼ 200 Á p.
A compiler framework for analytical modelling
The final purpose of the modelling task described in the previous section is the integration of the model in a compiler environment. This allows the analysis of real scientific codes, both effectively and quickly, in such a way that code optimizations can be proposed. Our modelling technique has been embedded in the Polaris parallelizing compiler [13] , using its development infrastructure. The whole environment also includes the Delphi CPU model [6] . Thus, the computation cycles predicted by Delphi can be added to the stall cycles caused by the misses predicted by our model, in order to estimate real execution times. Memory hierarchies of arbitrary size, block size and associativity per memory level, and LRU replacement policy (which nowadays is by far the most common) can be considered to depict the behavior of memory systems. Basically, our framework takes as inputs the memory hierarchy parameters and a FORTRAN 77 program, and returns system performance results using our analytical models. It also provides timing results for each stage of the tool, as well as the desired level of monitoring capabilities (from a user-level up to a developerlevel).
The tool structure is depicted in Fig. 2 , and it consists of the following blocks:
• In the Cache Parameters configuration file, the user specifies for each level of the memory hierarchy: its size, line size (both expressed in number of words), degree of associativity and miss weight. This file also contains, among others, switches to enable/disable the code generation of the simulator and also the analyser shown in Fig. 2 . All parameters can also be specified on the command line.
• The Modelling Library is a set of C routines that collect, in a modular way, our analytical model described in the previous section.
• The Delphi CPU Modelling Library is used in those cases in which predictions involving the processor execution time are required.
• The Optimization Library analyses the input code and uses the predictions of the memory hierarchy behavior (and also the CPU, if required), and suggests code restructuring in order to reduce the total execution time. An example of optimization currently implemented is the selection of an optimal block size for the points in the program where a blocking pattern is detected.
• The Parser/Integrator is the integrative part of this framework. It analyses the source code, detects array reference patterns inside a loop, and uses the corresponding model from the Modelling Library. Note that the functionality of our performance model could be embedded in other compiler environments by changing this module.
Our tool provides both general and detailed information about cache behavior in the Modelling Results: global number of misses, misses per data structure, misses per loop (or set of loops). Thus, cache hot spots can be detected and localized to drive code optimizations. Optionally, the following files can be generated:
• The Analyser Code is a C program (with the appropriate calls to the Modelling Library) that obtains the modelling results of the input code (although the cache parameters and optionally, the data structures base addresses, can be altered). The aim of this code is to be used afterwards for testing purposes on any machine, independently of the Polaris environment.
• The Simulator Code is the C code of a trace-driven simulator that validates the analytical results obtained for the specific input code. As in the case of the analyser, it allows the modification of the memory hierarchy parameters and the data structures base addresses. Simulation is performed by an optimized library of functions (Simulation Library in Fig. 2 ) that has been extensively validated matching its results with those of the dineroIII simulator (a component of the WARTS toolset [1] ). • The Optimized Source Code is the input code after a source-to-source transformation. In case of a blocking transformation, currently implemented, for each blocking, the original block size is replaced by the optimal block size (computed by means of our analytical models).
Experimental results
The main validation of our model has been performed by comparing the number of misses it predicts with the values measured using trace-driven simulations. The code to perform these simulations can also be generated automatically by our tool (Simulator Code and Simulation Library in Fig. 2) . Thus, we have applied both approaches to a series of codes using a wide variety of combinations of cache parameters and data structure dimensions. Also, for each combination, several simulations (approximately twenty) were performed, changing the value of the data structure base addresses using a random generator.
Two kinds of metrics have been calculated for each cache parameter combination: D MR , which is based on the miss ratio (MR), and D NM , based on the number of misses. The first metric, D MR , is the average of the differences between the predicted and measured miss ratios obtained for each combination of array base addresses tried for each cache; D NM is the average error in the prediction of the number of misses in these trials, expressed as a percentage of the number of measured misses. Both the miss ratio differences and the percentage error in the prediction of the number of misses were taken in absolute value to calculate the averages, so that negative and positive values did not compensate each other. The standard deviation, r, of the number of measured misses in the simulations (expressed as a percentage of the average number of measured misses) is also taken into account to help understand the memory behavior of the algorithms.
Validation through synthetic codes
The validation was first performed using three kernels covering different access patterns and loop-nests:
(1) The synthetic code in Fig. 3 represents a type of code with several non-perfect nestings. A number of 216 parameter combinations (and more than 4000 simulations) were tried. (2) A matrix-matrix product with blocking (Fig. 4) was validated checking 2400 parameter combinations resulting in 16,320 simulations. (3) Finally, a system solver by forward substitution (shown in Fig. 5 ) that involves a triangular loop, has also been included; 480 parameter combinations were tried, requiring a total of 9600 simulations. Table 2 includes validation results for these codes. The D MR values are highly satisfactory, with small errors. The codes also perform very well on average in the pre-diction of the absolute number of misses, although the deviations are somewhat higher. This difference in the validation metrics means that the important deviations take place in the parameter combinations for which a small number of misses is obtained. That is, those in which the memory behavior analysis produces little optimization. A relatively small error in the prediction of the number of misses for these combinations generates very large values of D NM , which unbalance the average. For example, only 31.3% of the combinations for the code with non-perfect nestings have a D NM over 5%, and their average miss ratio D MR is 0.22%; while the remaining 68.7% of the combinations have an average miss ratio of 17.94%. Even more meaningful is the fact that if the parameter combinations with miss ratios smaller than 0.5% are not considered for the calculation of the average D NM (just those in which the memory hierarchy is already behaving very well), it drops to only 1.45%. Tables 3-5 show the validation results of some parameter combinations for the codes shown in Figs. 3-5 , respectively. In these tables and in what follows, C s stands for the cache size in Kwords, L s is the line size in words and K is the degree of associativity, while M, N , BJ and BK stand for the problem size (see the related codes). These tables also include average times in seconds, both for the trace-driven simulation used for the validation (T. sim.) and modelling (T. mod.), measured in an SGI Origin 200 server with R10000 processors at 180 MHz. We must recall that many efforts have been made to optimize our simulator, and these simulation times are Table 3 Validation and time results corresponding to the non-perfect nestings code much shorter than those we would obtain using more general and standard simulators. Even so, modelling is typically between three and six orders of magnitude faster. The exception is the forward substitution code, where the triangular loop halves the number of accesses and thus, the simulation time, while its irregularity forces our tool to perform a sampled simulation to estimate the area vector. Still, modelling is much faster than simulation. Large values of D NM arise in the non-perfect nestings code for some combinations with a very small miss ratio (almost 0%), where a prediction error of a small number of misses in absolute terms becomes a large relative error, as we have already explained. There is another general and important reason for this behavior of the model for these combinations. Due to the probabilistic nature of the model, the convergence of this kind of approach is not favoured by small problems where few misses are generated. An additional reason for some large deviations in codes with non-perfect nestings is the use of a conservative approach in the estimation of the number of misses for data structures that have been accessed in previous loops. This simple and quick strategy affects the accuracy of the prediction for codes with several non-perfect nestings, giving place to an overestimation.
Another interesting fact in the tables is that the parameter combinations with the greatest prediction errors are those that have a wide variation (high r) in their number of misses depending on the relative positions of the data structures. Moreover, such prediction deviations are almost always similar to or noticeably smaller than Table 4 Validation and time results corresponding to the matrix product code Table 5 Validation and time results corresponding to the forward substitution code the standard deviation of the real number of misses, making them a very reasonable prediction even when the relative error is large. Again, this effort can be explained by the probabilistic nature of the model: it tends to predict a number of misses close to the average number of misses generated through the different simulations, but the particular values generated in the different simulations may be far from this average value.
Validation through the SPECfp95 suite
In order to demonstrate the wide range of validity of our tool, we applied it to programs belonging to the SPECfp95 suite, as shown in Table 6 . In the case of the first two SPECfp95 programs we have analysed the whole code, while in the remaining codes we have chosen the most CPU time consuming routine. In the cases where such routine calls other routines, they have been inlined in the calling routine. Four benchmarks were not considered in our study, for different reasons: turb3d and apsi most demanding loops are very small and so they generate very few misses; fppp main routines contain many non-regular loops and conditional sentences; finally, wave5 has many indirect accesses, not suitable for our analysis techniques. Table 6 shows the resulting number of loops and references analyzed (Loops and Refs columns, respectively), as well as the percentage of the program execution time associated to them (% Ex. T.) and the validation results for each code. As in the previous experiments in Section 5.1, miss ratio errors (D MR ) are very good, with a maximum average value of only 0.12%. The average errors relative to the number of misses are somewhat larger, but still small and within the range of the corresponding r. Notice also that the largest relative deviations usually take place in the codes with the smallest miss ratios: memory hierarchy is working better and the small number of misses turns an error of a few misses into a large relative error. In any case, D NM exhibits excellent values. In fact, Fig. 6 proves the high precision of the model for some caches with different sizes and degrees of associativity. Benchmark su2cor-matmat has not been included in the figure because of its reduced number of misses, and the number of misses generated by tomcatv and swim have been scaled to fit in the plot. Tables 7 and 8 shows the miss ratio deviations D MR and modelling times for the same cache configurations, respectively. These times refer only to the model execution itself, so they do not include the time required by Polaris to read the file, parse Table 8 Modeling times for the suite codes for different cache configurations (in seconds) it, and so on. The modelling times are at least one order of magnitude faster than the simulation, and even with respect to the execution time. As an example, the tomcatv amd swim benchmarks require, respectively, 192 and 131 s to be executed using the reference input set, and the corresponding cache simulation is at least one order of magnitude slower.
Automatic code optimization
An additional set of experiments using our model to check the possibility of driving optimizations in real memory hierarchy systems is described below. The high degree of accuracy obtained in the validation experiments shown in the preceding section was a support for this idea.
The matrix-matrix product with blocking shown in Fig. 4 was chosen as a case study for that purpose. We tried to derive the optimal block sizes for four representative architectures by feeding our tool with the code and the parameters describing the memory hierarchy levels of such systems (see Table 9 ): a PC with a PentiumII at 450 MHz (PC pII), a Digital Personal Workstation 433au with a 21164 at 433 MHz (PW 433au), an SGI Origin 200 with R10000 processors at 180 Mhz (O200 R10K) and the nodes of a Fujitsu AP3000 multicomputer with UltraSparc-II processors at 300 MHz (AP U-II). Each memory hierarchy level i of these systems was specified to our model by means of C si , L si and K i , and a new parameter W i , the relative miss penalty. We have obtained the values of W i for the different memory levels of these machines either from the corresponding manuals or using microbenchmarking [18] . This way, the cost of a given block BJ Â BK in a system with L levels would be computed as
where function M provides the number of misses the model estimates for a given cache level and block. The code was later run on these machines, trying all the possible blocks whose sizes were divisors of the matrix dimensions in order to measure their execution times. Table 9 shows the difference between the execution times using the predicted optimal block and the best execution time expressed as a Table 9 Difference between predicted and real optimal block execution times expressed as a percentage of the execution time of the real optimal block (N stands for the matrix size) percentage of the latter. As we can see, the model proposes optimal or near optimal blocks in all the cases.
Optimal block selection
The good results achieved in the preceding set of experiments on real systems, together with the high speed of our model, confirmed that it was feasible to drive completely automatic optimizations using our tool. As an example we built a module to choose optimal block sizes (inside Optimization Library in Fig. 2 ). The estimation of the block size leading to the minimum execution time requires taking into account both the stalls due to cache misses and the computation cycles, which implies the need of a CPU model. As this is beyond the scope of our work, we took the CPU model of Delphi, as explained in Section 4. The CPU computation cycles predicted by Delphi were added to the stall cycles caused by the misses predicted by our model, to estimate the total number of execution cycles for a given code. The platform chosen for this set of experiments was an SGI Origin 200 for two reasons: the CPU parameters for the R10000 microprocessor required by Delphi were known; and a good compiler (MIPSpro version 7.3.1.1m), with many flags that allow control of the optimization and code generation process, was available.
The procedure for performing this experiment was the following: each code was first fed to the MIPSpro compiler in such a way that it decided where to apply blocking, as well as the size of the block, using the O3 optimization level. The resulting code was then analysed and rewritten by our framework, modifying the block sizes and the sentences related to them according to the predictions of our model. The set of block sizes evaluated by the tool for each dimension of size N were the original values proposed by the compiler as well as dN =ie for 0 < i 6 128 (discarding those values that differed in less than three units). It should also be mentioned that the model was run without taking into account the relative position of the data structures, just using pure probabilities. It is obvious that this execution mode is less accurate, as it has less information to perform the modelling. This conservative approach was taken because our algorithm was not running in the MIPSpro compiler, so it really had no information at all about where it was going to locate the different data structures.
The experiment was applied to all the codes in Table 6 , but the MIPSpro compiler only chose to apply blocking in three of them: tomcatv, swim and hydro2d-filter. We must point out that these codes can only benefit from reuse in the borders of the blocks, and not in the whole block, as is the case in the typical example of the matrix-matrix product. As a result, large improvements in the execution time cannot be expected by optimizing the size of such blocks. Moreover, the execution time of the loops modified in the last code is so small that it is very difficult to measure them alone meaningfully. This problem was overcome using the whole hydro2d program to make the measurements, just changing the block sizes in those places where the compiler chose to apply blocking. Table 10 shows the percentage of improvement in execution time obtained when applying our tool with respect to that obtained with the MIPSpro compiler using the O2 and O3 optimization levels. Excepting one case, our tool has always improved the execution times although, as expected, there are no big improvements, mainly due to the reasons previously mentioned. Nevertheless, we think that the improvements are quite good, taking into account that only one column and/or row of the blocks can be reused and that we are comparing it to the results of a good compiler.
There are several points that make it very difficult to compare our approach with the traditional algorithms for tile size selection [4, 19] . First, such algorithms only focus on the memory hierarchy behavior, while our approach is the first one that, as far as we know, also takes into account CPU time. This happened to play an essential role, as ignoring it would lead to the choice of very small blocks that have maximum reuse, but whose management requires computing overheads that are too high. Another important point is that the referenced algorithms can consider just the parameters of one cache level, while our approach takes into account all the levels of the memory hierarchy simultaneously, in order to make the selection. Furthermore, works in [4, 19] are specifically devoted to driving tile size optimization; in our case, tiling is just a particular application of a general purpose system for the prediction of the memory hierarchy behavior. Not less important is the fact that these algorithms look for block sizes that allow the keeping of the whole block in cache between reuses, but the blocks found in these codes only reuse their borders, so it is only interesting to keep the last row and/or column. Finally, these approaches only consider a block of one given matrix and pay little or no attention to its interaction with other data structures, which could even be other blocks generated by the dimension partitioning implied by blocking. Nevertheless, such interferences must be taken into account, particularly as the dimension tiling may simultaneously affect many arrays accessed in the same nesting (up to seven in these codes), giving place to many blocks interacting in the caches. This is no problem for our tool because it is oriented to a broader scope.
Conclusions
A fast and flexible analytical modelling technique to study the memory hierarchy performance has been validated and embedded in a compiler framework. The model is based on the concepts of a PME for each reference and nesting level, and the miss probability for an access to a given line. The miss probabilities are derived from the cache areas affected by the references between the reuses of a line, which are repre- sented by means of the unified notation of the area vectors. This produces many structural advantages. For example, it allows the estimation of the number of misses for each reference and each given loop. The area vectors also allow the study of the relative contribution of each different data structure or even reference to the miss probability of the accesses associated to any reference. Other advantages which are not so obvious have been illustrated here, such as the ability to take into account the probability of hits due to reuses of data accessed in previous loops, which enables the analysis of non-perfect nestings and even whole programs. The model can be easily extended, thanks to its modularity, by developing the PMEs and area vectors associated to each new access pattern we need to study. These calculations may be performed either analytically or through simulations that can be sampled to reduce their computing requirements. The systematization achieved in the model has allowed its implementation inside the Polaris parallelizing compiler, together with additional capabilities shown in Section 4. This has allowed the construction of a complete compiler framework, giving place to a powerful tool for system designers and programmers.
Validations performed through comparison with trace-driven simulations shows that although relatively large errors may arise in the number of predicted misses (when both the number of misses and the miss ratio are very small), such predictions are still very acceptable and the average accuracy of the model is very good. On the other hand, the time required by our model to make the predictions is very small in comparison with that required by simulation, and even in comparison with the compilation and execution time of the example codes. These properties make our model ideal to guide compiler optimizations. Thus, we applied it to choose optimal block sizes, as a case study, using the parameters of memory hierarchies of real systems. Successful results in a number of different hardware platforms were achieved and current commercial compilers were outperformed with real codes from the SPECfp95 suite.
