We present a novel, compile-time method for determining the cache performance of the loop nests in a program. The cache-miss rates are produced by applying the program's reference string of a loop nest, determined during compilation, to an architecturally parameterized cache simulator. The obtained cache-miss rates correlate well with the performance of the loop nests on actual target machines. We describe also a heuristic that uses this method for compile-time optimization of loop ranges in iterationspace blocking. The results of the loop program optimizations are presented for di erent processor architectures, namely IBM SP1 RS/6000, the SuperSPARC, and the Intel i860.
Introduction
The recent advances in processor technology, both in the number of cyclesper-instruction and in raw clock rate, have dramatically improved single processor performance. However, modern processors perform at full speed only when programs exhibit enough date reference locality for data caches to sustain high hitrates. When such locality does not exist, the performance drops from the CPU's cycles-per-instruction speed to the much slower memory-reference access speed.
Cache performance is therefore an important consideration in the development of compiler optimization techniques. Given a description of a target computer and a program, the compiler should be able to optimize the generated code to take advantage of the cache architecture.
Loop nests are prime prospects for code optimization. Many loop restructuring optimizations in uence cache performance. Examples are loop interchange, fusion, distribution, iteration-space blocking, and skewing (c.f., 1,2,3]). One of the possible fkaploww,szymanskg@cs.rpi.edu loop optimizations is to select the loop range that maximizes the cache performance.
Choosing too small a range causes the loop to incur large numbers of intrinsic misses (the addresses currently in cache are used again in the next set of ranges but are removed from cache due to the references to the subsequent rows of the current range). Too large a range causes a large increase in self-interference misses in the cache (the addresses currently in cache are used again in the following rows of the current range but are removed from cache by references to the subsequent elements of the current range) 1].
The processor performance on a code section is a key parameter in parallel task partitioning and scheduling (c.f., 4, 5] ). For example, in 4] the task granularity, de ned as a function of the ratio of task communication to its computation, is the basis for determining optimal parallel schedules.
In this paper we describe a compile-time method for determining cache performance on a generalized loop range. We also present a heuristic for nding optimal loop ranges for such nests. Finally, we demonstrate results of applying this technique to several application codes and processors.
Previous Work on Cache Performance Prediction
Two components are needed to predict the performance of the cache: (i) the reference string of an application, and (ii) actions of the cache over the string. Depending on how these components are created or approximated current methods fall into the following three classes: (1) exact, (2) cache simulation, and (3) analytical models. The method presented in this paper belongs to a new class, full simulation. The relationships among these classes are shown in Figure 1 . Exact Methods generate the reference string by creating a synthetic mix or abbreviated form of the application code, executing it, and running it through the real cache (cf. 6]). Although actions of the cache over the string are very fast (they are taken by the actual hardware), the method requires repetitive compilations and executions of the application codes; therefore it is too time-consuming to be used at compile-time.
Cache Simulation described in, 8, 9, 10] , uses the reference string generated by an execution of an application and a model of a cache. The generation of the reference string from the program allows the inverse mapping, i.e., from a reference to the source code. Thus, the programmer can be informed of the number and types of cache misses caused by each source line. However, the cache model is signi cantly slower than the actual cache. The time for repeated compilation, instrumentation, and execution of the code, which would be required to search for optimal loop ranges, renders this method impractical at compile-time.
Analytical models approximate both the reference string and the actions of the cache. Instead of looking at individual references, these methods determine the number of distinct cache lines referenced in a loop nest based on the number of reference classes 1,2] or array classes 7] . This number combined with the number of cache lines and loop ranges yields a crude approximation of cache misses. The method cannot accurately account for e ects of di erent replacement algorithms, context switches, set-associative cache mappings, or translation of virtual to physical addresses. It is also di cult to accurately attribute cache behavior to the source code of the loop nest. Moreover, the analysis relies only on the syntax of the loop nests and formulas involving parameters of a cache, and therefore analytic methods are very fast and suitable for use at compile-time.
Full Simulation is introduced in this paper as a new class. Figure 1 shows the relationship of this class to the other classes. Full simulation di ers from cache simulation in the way the reference strings are generated. The de nition of the reference string is created during parsing of the source code, as opposed to generating the reference string by compiled code execution.
Both cache and full simulation can accurately account for di erent replacement algorithms and di erent cache organizations (direct, set-associative, etc.). Contextswitch e ects can be approximated by the periodic clearing of the current contents of the cache. Instead of physical addresses, both models use compile-time addresses that are changed during loading. However, as shown in 10], the probability of additional cache misses caused by loading is low, so the e ect of this approximation on the miss-rates is negligible.
The results of the full simulation are less accurate than cache simulation because the simulated reference string is just an approximation of the real one (e.g., assignment statements guarded by conditionals are assumed to always execute). In our experiments we show that these results are accurate enough for cache performance prediction. The advantage of the proposed method is that the compact representation of the reference string is created at compile-time immediately after parsing. This representation can be modi ed during optimization to simulate di erent loop ranges without the need for its regeneration.
In the following section we show how the reference string is de ned from the source program and how it is used to obtain the loop ranges for which the cache performance is optimized. Section 3 describes the use of this method for loop optimization for di erent architectures, and provides the timing results of its application. Section 4 suggests future work.
Full Simulation Method
We describe here (i) a technique for generating the reference string from the source code at compile time, and (ii) a simulation model of a cache described in terms of basic cache parameters. In addition, we present a heuristic search algorithm for nding the loop ranges that optimize cache performance.
Reference String Generation
The parser of a source language can be extended to produce an expression that can be used to generate the reference string. As explained earlier, we focus on programs expressed as nests of loops. To keep the source code translation languageindependent, we will discuss the parser action for a simple language, L, that The range of a loop is de ned at compile-time if it is either a constant or an expression over the control variables of outer loops with de ned ranges. Because the described cache prediction method is intended for loop range optimization (see Section 2.3), we are interested only in an upper bound of the optimal loop range for the given cache. The upper limit can be estimated from the loop body and the cache line size (see below). Thus, without loss of generality, we can assume that the range of a loop can always be established at compile-time.
We de ne the reference space of a symbol s 2 L as: The presented translation can be done only if the indexing function involves only loop control variables as arguments and constants as coe cients. More general indexing expressions involve loop constants (i.e., variables which are not assigned to in the loop body). In such a case, however, the exact virtual address will not be known, so the simulation will be exact only for fully associative caches. For the case of indexing expressions with indirect mappings or with variables that are reassigned in the loop body, we do not produce any virtual addresses for the corresponding references. The justi cation of this treatment is that such references drastically change from one loop iteration to the next so their e ect on cache performance is usually not dependent on the loop range. Ignoring these kind of references usually will not have a signi cant impact on the determination of the optimal loop ranges.
It is clear from the above de nitions that the reference string is produced when the virtual address translation is applied to the subsequent successors of the mini- 
Cache Model
As de ned in 6], a cache is the rst level of memory closest to a processor. Cache is a fast memory that at any execution instance represents some subset of the address space of a processor. Our cache model is characterized by a replacement algorithm and the following three parameters: L is the number of bytes per line of the cache, K is the number of lines per set, and N is the number of sets in the cache. The implementation of the cache model assumes that the lower log(L) bits of an address select a byte within a cache line, and the next log(N) bits select the set. The cache is initialized with all lines marked empty. Therefore, all initial references will cause a cache miss. This is not an unreasonable assumption for two reasons: rst, the likelihood of reuse of the cache from one loop nest to another is usually very small. Second, the simulation is run su ciently long so that the number of references to each cache line is large enough for the initial misses to be insigni cant.
In one cache simulation cycle, the simulation engine obtains the lexicographically next element of the reference space of the loop nest to obtain the next virtual address. This address is then run through the cache algorithm, and the action of the cache is recorded.
Heuristic Search
Generating a densely populated cache miss-rate curve via simulation is time consuming, and therefore unsuitable for integration into a compilation system. However, theoretical and experimental evidence show that the cache miss-rate curve as a function of a relevant range is S shaped (see Figure 3, right) .
The overall program performance is highest at the rightmost value of the loop range, i, to the left of the steep region approaching i c . This range value corresponds to the maximum problem size exhibiting good cache performance. An abrupt change in cache performance takes place over a very small interval of loop range values, approximately 10. Taking advantage of the shape of the cache-miss rate curves, we developed a heuristic for nding the optimal range (see loop kernels from some common applications. Secondly, we discuss the di erence between the optimal loop ranges and the values determined by the heuristic. Finally, we present the results of applying the described method to loop optimizations on selected applications.
Applications
We used the following applications: Vector Product that computes a double precision vector inner product, Jacobi Iteration for solving partial di erential equations (PDEs), Ocean Model solving a continuity equation taken from two-layer, linear unidirectional model of wind-driven circulation in a density-strati ed ocean, and Shallow Water Model which is a section of the weather prediction program. We believe that these codes are a representative sample for a large class of supercomputer applications. Loop bodies in these codes use a ne indexes and loops are nested up to three levels deep.
Simulation verses Actual Performance
Figures 4, 5, and the right hand graph in Figure 6 show densely sampled curves of cache simulation and benchmark performance results. The curves are normalized hit-rate curves for each application and normalized benchmark performance, where the hit-rate is normalized against the best hit-rate, and the benchmark is normalized against the best performance over the range. The size is de ned as the number of array elements allocated per processor along a single dimension, and the plots are sampled at intervals of approximately 50. For one-dimensional problems, such as the ocean model and vector product benchmarks, the size corresponds to the raw number of data elements. For two-dimensional problems, such as the Jacobi iteration and shallow water model benchmarks, the size represents the square root of the number of elements allocated per processor.
There is a strong correlation between the performance on the target machine and the simulated cache hit-rate. The densely sampled cache simulation results are compared with the densely sampled target performance data in Table 1 . The Actual column in the table contains observed performance loop range thresholds, which, when exceeded, result in substantial performance degradation. The rst Error(percent) column is the relative error between the simulations and measured performance loop range thresholds on the target architecture. The maximum (in magnitude) relative error observed is 17% for the shallow water model benchmark on the SuperSPARC. Such a large error is partially due to the relatively continuous decline of the performance of this benchmark and its quite low simulated miss rates. For other benchmarks, the relative error does not exceed 10%. Table 1 shows also a comparison of actual range values with the heuristic results in the Heuristic. The second Error(percent) records the relative error (in percent) verses the actual value. The second Time(sec.) column shows the running time of the heuristic on a uniprocessor SUN Sparc-10 computer. The maximum time to evaluate the heuristic search is 50 seconds for the shallow water model benchmark with the SuperSPARC as a target. A positive error value in the table indicates that the heuristic picks a larger performance loop range threshold than the actual threshold. The positive error is usually small, not exceeding 8%. The largest negative error, 20%, is for the Jacobi iteration benchmark on the i860 because of the way the heuristic selects the threshold. For this application the miss rate increases with a sharp jump occurring around a problem size of 512. Intentionally, the heuristic underestimates the threshold value by picking the left edge of the upward slope, because the cost of underestimating is a slight increase in loop control overhead, while overestimating can increase the miss-rate and degrade performance considerably. 
Results of Loop Optimizations
The next set of experiments focuses on predicting the performance of original and optimized loop nests in order to evaluate the usefulness of the optimization. We show that the heuritic can be used to quickly determine the iteration-space blocking factor. Some results for loop interchange are given in Table 1 and Figure 6 (see also 11]).
Blocking or tiling 1] is used to increase the locality of data references during the loop execution by adding additional levels of loops so that inner loops iterate over blocks of the original iteration space. The results of this optimization are shown in Figure 7 . The original code shows an approximately 10 to 33% performance decrease when the number of elements to be processed is greater than 1048576. Using the loop range indicated by the heuristic, 1024, the blocked graph shows the improved performance.
In another experiment, we took the original loop ordering of the Shallow Water application and compared it against an iteration-blocked version for various blocking factors. The optimum block size is loop range determined by the heuristic simulating the cache performance of a non-contiguously allocated block. Table 1 , we would predict that the optimum range value is 90 for the SP1 architecture. This value was determined from a cache simulation that assumed that the arrays referenced inside the loops were contiguously allocated. When the code is blocked, the reference pattern generated by the blocked traversal will not follow a traversal of a contiguously allocated array. Static prediction methods, such as proposed by 5] or 7], do not take this into consideration, and therefore they may not correctly choose the proper blocking factor for a loop. We can correctly determine the optimum blocking factor for in-place execution by modifying the virtual addresses to represent the references generated by the traversal of a block within a larger array. This can be accomplished by o setting the simulated block in the blocked iteration dimension. In our reference generation model this is done by adding a constant o set to f(v; i) for each blocked dimension. The resulting cache miss-rate increases (i.e., i c moves to the left), and the optimum loop range returned by the heuristic decreases to 60, shown in Table 1 as SP1 RS/6000 Shallow(n.c.). Figure 7 shows the target execution performance improvement for blocked versions of the Shallow Water model benchmark over the original code. Three di erent problem sizes are shown, each executed with several di erent block sizes. The predicted blocking size of 60 yields improvment within a few percent of the optimum performance improvement for each of the problem sizes.
Future Work
The current simulation system takes the source loop nest and represents it as a reference space with a total order. The reference string for each simulation is just the sequence of applications of the successor function in this space.
Instead of going through each point of a reference space we could traverse just the points that could change the contents of the cache; these are exactly the points for which cache-misses occur. Hence, instead of using a successor function, we can move to the lexicographically closest point of the reference space for which the current lines are over owed or at which the range of one of the loop control variables is exhausted. Such an approach can speed up the performance by the factor equal to the inverse of the miss-rate. However, it requires that the indexing expressions are a ne functions (otherwise, it is di cult to predict which future reference reaches beyond the current cache contents). The advantages and disadvantages of such an approach in comparison to the approach presented here are the subject of our current research.
Acknowledgements

