In numerical codes, the regular interleaved accesses that occur within do-loop nests induce cache interference phenomena that can severely degrade program performance. Cache interferences can signi cantly increase the volume of memory tra c and the amount of communication in uniprocessors and multiprocessors. In this paper, we identify cache interference phenomena, determine their causes and the conditions under which they occur. Based on these results, we derive a methodology for computing an analytical expression of cache misses for most classic loop nests, which can be used for precise performance analysis and prediction. We show that cache performance is unstable, because some unexpected parameters such as arrays base address can play a signi cant role in interference phenomena. We also show that the impact of cache interferences can be so high, that the bene ts of current data locality optimization techniques can be partially, if not totally, eradicated.
Introduction
As CPU cycle time decreases, main memory and network latencies rapidly increase and cache misses become very costly. Furthermore, the increasing issue rate of processors worsen the burden on caches. Moreover, most CPU chips are now being designed for integration into a massively parallel supercomputer or a parallel workstation, and therefore minimizing memory tra c, i.e. optimizing memory hierarchy utilization, is becoming critical. For all these reasons, optimizing the cache behavior has become a major issue. To achieve such optimizations, many studies have been performed to understand the workings of cache memories and derive proper optimizations.
The rst category of studies ?] relies on numerous simulations of representative codes, i.e. a collection of codes which corresponds to the average workload of a computer. Such simulations provide a good summary of average cache performance and some hints at the relationships between the di erent cache parameters (cache size, line size, set-associativity). Furthermore they provide nearly exact indications on the behavior of cache memories for speci c examples. A major problem inherent to such techniques is to nd codes which are truly representative in terms of memory referencing. A second category of studies ?] aims at building analytical models for synthesizing the behavior of cache under most circumstances. Such models provide better insight on the relationship between the di erent parameters. These models can also be used for performance prediction, This work was funded by the BRA Esprit III European Project APPARC, European Agency DGXIII.
avoiding numerous simulations. However, while such analytical models are more representative than simulation based studies, they are generally less accurate. And they are intrinsicly limited because they cannot and are not designed for understanding speci c phenomena which occur within a cache. Modeling the global behavior of cache may be su cient as far as trends are needed. However, for understanding the weaknesses of caches and deriving either software or hardware optimizations more precise modeling is necessary.
Therefore a good understanding of a program reference pattern needs to be extracted. Some models have addressed this problem ?]. Although they are valuable tools, they still lack accuracy because they aim at characterizing the behavior of most programs. Therefore, they again sacri ce accuracy for generality and representativity. However, there is a category of codes, numerical codes, that are emerging as some of the most demanding programs in terms of execution time and memory usage. Many architectures are targeted or at least tuned for such programs. The widespread use of numerical codes as benchmarks is a clear sign of their growing in uence. Therefore, studying the cache behavior under numerical workloads is critical. However, numerical codes have speci c properties in terms of memory addressing (spatial and temporal locality) which hardly allow them to t in the classic framework of general models. Fricker and al. ?] developed a model for direct-mapped and set-associative caches that is dedicated to regular (and some irregular) numerical codes. This model takes into account the fact that references within numerical codes correspond to chunks of consecutive addresses which recur periodically. The main asset of the model is to show the behavior of cache under numerical workloads, and to allow dimensioning of cache parameters for such codes. So, if this e ort allows a better understanding of cache behavior under such workloads, it does not help in unveiling hot-spot and irregular phenomena which are speci c to numerical codes and alter the cache performance. Furthermore numerical codes are actually made of a limited number of typical loop nests. Therefore, e orts should be concentrated on modeling and understanding the actual and most frequent types of loop nests. This problem is twofold: the rst step is identifying these typical cases and consequently restricting problem hypotheses. The second and main step is deriving a model which encompasses the majority of such cases.
Porter eld ?] developed a model dedicated to numerical codes, which does examine speci c pieces of codes and determine their behavior on cache. However, mostly fully-associative caches have been considered. Because such caches are unlikely to experience interference phenomena, conclusions can hardly be derived for interferences in real caches.
An important step towards accurate evaluation of cache interferences has been made in ?], where blocked Matrix-Matrix multiply is carefully studied. Cross and self-interference misses are evaluated, and a model for this algorithm is provided. This paper clearly unveils that interferences can severely alter locality exploitation. However, the model still lacks accuracy and is not capable of catching some of the speci c phenomena and performance uctuations induced by the mapping of direct-mapped caches. Furthermore, new parameters (such as arrays base address) which can play an important role in interference phenomena are not taken into account. Moreover, this model is dedicated to one particular example, while a methodology suitable to many numerical algorithms would be very useful.
Indeed, many powerful software optimization techniques for exploiting numerical codes locality have now been designed ?, ?, ?]. Although they stress the possible impact of cache interferences, no real evaluation of these phenomena nor a study of their frequency of occurrence have been performed yet. In the next sections, we will show that cache interferences can have a strong impact on performance and occur frequently. In section ?? the problem is de ned and hypotheses are given. In section ??, the general method for computing the number of misses is indicated. In section ??, conclusions are drawn and further work is discussed.
Problem statement
The purpose of the paper is twofold. The rst goal is to show that cache interferences are not infrequent, that they can have a signi cant impact on numerical loop nests performance, and that their conditions of occurrence can be determined even though cache interferences are highly irregular. Understanding and then eliminating such interferences would allow stable cache performance. Furthermore, cache line size is kept small because large line sizes are generally considered to bring more interferences. This notion is not completely true: when line size is large, compulsory misses are much less important so that interference misses correspond to a larger portion of total misses. Therefore, numerical loop nests are more sensitive to cache interferences when line size is large, but such interferences are not necessarily more important. It derives that exploiting a larger line size requires a good understanding of cache interferences. The second and main goal of this paper is to introduce a method for estimating these cache interferences. General principles and main steps of the technique are indicated and illustrated with examples.
Cache architecture Direct-mapped caches have been chosen for several reasons:
Direct-mapped caches are more sensitive to interferences than w-way associative caches. Therefore, they are more likely to bene t from studies and optimizations on that matter. 1 
APPARC is a BRA Esprit III European project
Since the replacement policy of direct-mapped caches is straightforward, computing interferences is easier in direct-mapped caches. Though, we strongly believe the technique can be extended to w-way associative caches with moderate modi cations.
Among the three newest processor chips (DEC Alpha, MIPS R4000, SuperSPARC), two chips (DEC Alpha, MIPS R4000) include a small (8kbytes) direct-mapped on-chip data cache. Since the frequency of such processors is very high, the cost of a cache miss is huge, making it critical to reduce the amount of interferences. The placement policy in the DEC Alpha, for example, is such that, a data cache location can generally be determined from the data virtual address. Therefore, a study based on virtual addresses would accurately describe real cache behavior.
In the remainder of the paper, the cache size is indicated by C S and the line size by L S . The unit size is 8 bytes, i.e. the size of a double-precision oating point data. In all experiments, cache size is equal to 8-kbyte and line size is equal to 32-byte (the characteristics of the DEC Alpha data cache), so C S = 1024 and L S = 4. Codes Let us now discuss which types of codes are considered. In numerical codes most data tra c occurs in do-loops, only these code constructs are examined. Only array references are considered because it is probable that other variables would be stored in registers if they are frequently used, and otherwise they would induce minimal perturbations of cache behavior.
Loop Nests A loop nest is composed of n distinct loops, j i being the loop index of the i th loop, and j n being the loop index of the innermost loop. Column-major storage is assumed, so, for example, the virtual address of array reference A(j 1 ; j 2 ) is a 0 + Nj 1 Another model hypothesis is that the boundaries of each loop index must be constant (after normalization, 0 j i N i ) and the stride of all indices equal to 1. For any rectangular loop nest, the loop indices and array subscripts can be changed so as to satisfy these hypotheses. However, the fact non-rectangular loops do not r these hypotheses. Since this point is relatively restrictive, further developments of the model will mainly focus on including this kind of loop nests.
General principles
If cache was fully-associative and replacement was optimal, cache misses would occur in two cases only. First, when data are loaded in cache for the rst time; such misses are called compulsory misses. Second, when cache space is too small to store all loop nest data. Then, an element is ushed from cache each time a new element needs to be loaded; such misses are called capacity misses.
However, in direct-mapped caches (and in set-associative caches as well), cache misses can occur though cache space is su cient, because a data element can only be mapped into one speci c cache location (or w locations in w-way associative caches). Therefore, such cache misses do not occur because of capacity con icts, but because of mapping con icts; they are called mapping misses ?].
The main e ect of such unexpected misses is to degrade the spatial and temporal reuse of data. Interferences can either correspond to interferences of an array with itself (self-interferences), or with another array (cross-interferences). In the remainder of the paper, these two types of interferences are analyzed.
For self-interferences the principle is to study the mapping of the set of elements of an array to be reused, and check whether these elements overlap with themselves. If so, self-interferences occur, and estimating the degree of overlapping yields the number of additional memory accesses brought by self-interferences. In general, self-interferences mostly correspond to temporal interferences.
For cross-interferences, once the set of elements of an array to be reused is identi ed (taking into account self-interferences), the overlapping between these elements and elements of another array is determined. Knowing the number of times the two sets of elements overlap, and the amount of overlapping each time, is su cient to compute the number of additional memory accesses brought by cross-interferences. Cross-interferences can correspond to either temporal or spatial interferences.
Self-interferences
Theoretical reuse set Thanks to the subscript types considered, it is easy to identify where reuse due to self-dependences occurs. If coe cient a i = 0 in the virtual address expression a 0 + a 1 j 1 + : : : + a n j n of a reference to array A, then loop i carries reuse. Let us de ne l as the lowest loop level where reuse occurs. On all loops k with k > l, no reuse occurs. So, on each iteration of these loops, the array elements referenced are all distinct. This set of elements is called the reuse set. During each execution of the sub loop nest j n ; : : :; j l+1 , all elements of the reuse set are referenced.
De nition For array reference a 0 +a 1 j 1 +: : :+a n j n , the reuse set can be de ned if there exists a k such that a k = 0. The loop level of a reuse set is l = max fk=a k = 0g. The theoretical reuse set is equal to RS(A) l = fa 0 + a 1 j 1 + : : : + a n j n ; (0 j i N i ? 1) i>l g.
Reuse can occur on di erent loop levels. However, reuse that is carried on loop levels higher than l is at least one order of magnitude less important than the reuse on loop l. For Consequently, the potential reuse on loop j 4 is approximately N 4 times more important than the potential reuse on loop j 2 . Furthermore, the time interval between two reuses on loop j 4 is minimum, i.e. it is equal to one iteration of loop j 4 , while N 4 N 3 iterations of loop j 4 are executed between two reuses on loop j 2 . Therefore, not only reuse is more scarce on loop j 2 , but the probability it can be exploited is also much smaller.
That is why only the rst reuse level is considered in general. However, when some coe cients consecutive to a l (i.e. a l?1 ; a l?2 ; : : :) are also equal to 0, reuse on these loop levels is considered to still be achieved (with respect to the reuse set, it is the same as if the boundary of loop l were N l N l?1 N l?2 : : : instead of N l ). Note that if a l = a l?1 = a l?2 = : : : = a k = 0, the reuse set on loop k is the same as on loop l.
Size of the theoretical reuse set Let us compute the number of cache lines corresponding to the theoretical reuse set assuming no self-interferences. First, let us determine the cache line stride of the reuse set of a reference A (where reuse occurs on loop l). It is equal to min(1; L S ), where = min l+1 k n (a k ). This term is the ratio of the smallest coe cient (only considering loop levels de ning the reuse set) to the line size. If this ratio is greater than 1, then it is necessarily equal to 1, since at most one new cache line is referenced on each iteration. For example, the access Actual reuse set Because self-interferences occur, not all elements of the reuse set can actually be reused. In direct-mapped caches, as soon as two elements of the reuse set compete for the same cache line, none of the two elements can be reused: they are victim of self-interferences. The elements of the reuse set not victim of self-interferences belong the actual reuse set. The actual reuse set is the set of cache lines of the theoretical reuse set where no self-interferences occur.
Characterizing self-interferences is equivalent to determining the actual reuse set. And determining the actual reuse set is equivalent to studying the mapping of the theoretical reuse set in cache.
To compute the overlapping within the theoretical reuse set of a reference A, the loops n to l + 1 are successively executed, starting with loop n. On each loop level k, the cache lines used by loops n; : : :; k+1, which are not victim of interferences, form a temporary reuse set called RS(A) k l . On loop level k, the interferences between N k such temporary reuse sets RS(A) k+1 l are determined, and the cache lines still not victim of interferences form the new temporary reuse set RS(A) k l .
1. Loop level n (reuse occurs on loop l): on this loop level, the reuse set is fa0 + a1j1 + : : : anjn; 0 jn < Nng.
Let a 0 n = a 0 +a 1 j 1 +: : :+a n?1 j n?1 . A temporary reuse set RS(T) n l corresponds to min(1; an
N n cache lines, starting at cache position a 0 n mod C S . If C S < min(1; an L S ) N n then capacity interferences occur, and the victim cache lines are removed from the temporary reuse set (cf example on Matrix-Vector multiply below).
2. Loop level n ? 1: let a n?1 0 = a 0 + a 1 j 1 + : : : + a n?2 j n?2 . The temporary reuse sets RS(A) n l start at cache positions a n?1 0 + a n?1 j n?1 mod C S . By checking whether two such temporary reuse sets interfere (depending on their relative cache positions) and evaluating the amount of overlapping, it is possible to compute the number of cache lines not victim of interferences.
These cache lines correspond to reuse set RS(A) n?1 l (cf example on Matrix-Matrix multiply below).
3. All subsequent steps are identical to step 2. The process stops on loop level l. This process is generally not too complex because few levels i have to be considered (reuse sets of dimension 1 or 2, i.e. depending on 1 or 2 loop levels, correspond to a majority of cases). Furthermore, very often the layout of sets in cache is straightforward due to the way array elements are referenced (stride one access to arrays), making it relatively easy to evaluate self-interferences, and compute the actual reuse set. This problem is well known in the domain of loop restructuring. It corresponds to capacity interferences rather than mapping interferences. The most classic method for dealing with it is to block the loop, so that the reuse set of X is smaller than cache size. However, a less obvious and known fact is that, even when the loop is blocked interferences can still occur. Let us consider the blocked version of Matrix 
Matrix-Vector multiply

Cross-interferences
Two main cases of cross-interferences can occur between two references: either the di erence between the corresponding two virtual addresses is constant (independent of loop indices), or it varies with the loop indices. These two cases must be distinguished because such cross-interferences are very di erent. In the rst case, the two references are in translation, they always overlap and the amount of overlapping is constant, while in the second case the two references overlap only periodically and the amount of overlapping varies. So, detecting and estimating cross-interferences in the rst case basically amounts to comparing the constant parameter of the two virtual addresses (it generally depends on arrays dimensions and base address), while in the second case, the relative movement of the two references must be analyzed.
The rst type of interferences is called internal cross-interferences (because a set of references in translation constitutes a kind of class of references, and such cross-interferences then occur within a class), and the second type of interferences is called external cross-interferences (interferences among two references not belonging to the same class).
Estimating cross-interferences
Computing the impact of cross-interferences between two references amounts to estimating how much of the reuse of one reference is lost because of cross-interferences with the other reference.
So, let us consider two references R 1 ; R 2 , and compute the impact of R 2 on the reuse of R 1 .
First, the set of elements R 1 can reuse must be estimated; it is the reuse set de ned in section ??.
Second, the set of elements of R 2 that can interfere with R 1 must be estimated as well; this set is called the interference set. Because the reuse set is computed on a given loop level l, the interference set should be computed on the same loop level. Recall, that below loop l, no reuse can occur for R 1 . Therefore, the number of additional memory requests for R 1 due to cross-interferences with R 2 , on each reutilization of the reuse set (i.e. on each iteration of loop l), is exactly equal to the number of cache lines used by both the reuse set and the interference set. This notion is fundamental in the computation of cross-interferences. Computing interferences this way allows to make abstraction of time considerations, i.e. when interferences occur. It is su cient to estimate the intersection between the set of cache lines corresponding to the reuse set and the interference set. It is important to note that, in the following sections, the reuse set considered is the actual reuse set, otherwise cross-interferences would be counted where self-interferences already occur, resulting in an overestimate of additional memory requests.
The interference set The de nition of the theoretical interference set is the same as that of the theoretical reuse set.
De nition For array reference a 0 + a 1 j 1 + : : : + a n j n , the theoretical interference set, on loop level l (this loop level is determined by the victim reuse set), is equal to IS(A) l = fa 0 + a 1 j 1 + : : : + a n j n ; (0 j i N i ? 1) i>l g.
Moreover, determining the actual interference set is done much the same way as for the actual reuse set. The actual reuse set is the subset of cache lines of the theoretical reuse set where no self-interferences occur, while the actual interference set is simply the set of cache lines used by the theoretical interference set. So if a cache line of the theoretical interference set is victim of self-interferences, this cache line is still counted in the actual interference set (while such a cache line is rejected from the actual reuse set). Intuitively, the actual interference set corresponds to the cache surface (the number of cache lines) used by the theoretical interference set. The amount of cross-interferences is directly correlated to the size of the actual interference set (the larger the set, the higher the probability of overlapping with the reuse set). The process for determining the actual reuse set is the following:
1. Loop level n (for the reuse set, reuse occurs on loop l): on this loop level, the interference set is fa 0 + a 1 j 1 + : : :a n j n ; 0 j n < N n g. Let a n 0 = a 0 + a 1 j 1 + : : : + a n?1 j n?1 . A temporary interference set IS(T) n l corresponds to min(1; an L S ) N n cache lines, starting at cache position a n 0 mod C S . If C S < min(1; an L S ) N n then capacity interferences occur, and the cache lines used twice or more are only counted once.
2. Loop level n ? 1: let a n?1 0 = a 0 + a 1 j 1 + : : : + a n?2 j n?2 . The temporary interference sets IS(A) n l starts at cache positions a n?1 0 + a n?1 j n?1 mod C S . So, by checking whether two such temporary interference sets interfere (depending on their relative cache positions) and evaluating the amount of overlapping, it is possible to compute the number of cache lines corresponding to the union of the two intervals. The union of the cache lines of all temporary interference sets IS(A) n l is the temporary interference set IS(A) n?1 l . 3. All subsequent steps are identical to step 2. The process stops on loop level l. In opposition to the reuse set, it is preferable that overlapping occurs within the interference set, because the larger the overlapping within a theoretical interference set, the smaller the corresponding actual interference set, and the less likely the actual interference set overlaps with cache lines of the actual reuse set. Therefore, the optimal case is obtained for N D mod C S = 0 (note that it corresponds to the worse case for the reuse set of D; cf section ??).
Internal cross-interferences
Internal cross-interferences occur between two references in translation. Thanks to that property the relative cache position between the reuse set of the victim reference and the interference set of the interfering reference is always the same. Therefore, if the reuse set is de ned on loop level l (reuse occurs on loop l), the total number of additional memory requests due to cross-interferences between the two references, is equal to the total number of iterations of loop l times the number of cache lines used by both the actual interference set and the actual reuse set.
The process for computing internal cross-interferences on a reference R 1 due to reference R 2 is the following: Spatial interferences Note that internal cross-interferences correspond to spatial interferences, only if the relative cache positions of the actual reuse set and the actual interference set is smaller than the line size (in the above example, ((x 1 0 ?x 2 0 mod C S ) < L S ). So, spatial interferences have a low probability to occur. On the other hand, when such a case happens, little or no spatial reuse can be achieved and no temporal reuse can be achieved also. These cases are extremely costly, they correspond to "ping-pong", i.e. when two arrays translate in cache and constantly compete for the same cache location. Group-dependence reuse In this paper, mostly reuse due to self-dependences (a reference reuses itself) is analyzed. However, the reuse due to group-dependences (a reference reuses elements of another reference) can also be signi cant, though it is in general less important than reuse due to self-dependences. However, since the way internal cross-interferences perturbate group-dependence reuse is original, the phenomenon is worth being illustrated with an example.
Example from FLO52 Let us consider the following simple example extracted from a version of a Perfect Club code FLO52 ?]. Array XY is actually a temporary array used for scalar expansion. The leading dimension of arrays XY and X can be considered equal to N 2 . In this case, there is no self-dependence on array X, but there is a group-dependence between references X(j 1 ; j 2 ; 1) and X(j 1 ?1; j 2 ; 1) which bene ts to array reference X(j 1 ? 1; j 2 ; 1). Let us call R 1 the reference X(j 1 ? 1; j 2 ; 1) and R 2 the reference X(j 1 ; j 2 ; 1). Since reuse occurs on a given loop level (loop 1), it is possible to extend the de nition of the reuse set to group-dependences (except the reuse set is not reused by the reference itself). The reuse set of R 1 is RS(X) 1 2 = fX(j 1 ; 1; 1); : : :; X(j 1 ; N 2 ; 1)g.
The dependence distance between the two references is equal to N 2 . It is assumed N 2 < C S so that the reuse set of R 1 ( N 2 L S cache lines) ts in cache. Now, array XY can have internal crossinterferences with reference R 1 . However, the way such cross-interferences occur is not straightforward. The amount of overlapping is not equal to the number of cache lines used by both the reuse set of R 1 and the interference set of XY . Indeed, internal cross-interferences occur in a boolean way. Depending on the cache distance x 0 ? xy 0 mod C S between the interference set of XY and the reuse set of R 1 two cases can occur. Either x 0 ? xy 0 mod C S 2 C S ? (N 2 ? 1); C S ] and the interference set of XY ushes no element of the reuse set of R 1 because a given cache line is used by R 1 after it has been used by XY (cf gures ?? and ??). There is no additional memory request. Or x 0 ?xy 0 mod C S 2 0; N 2 ? 1] and the interference set of XY ushes all elements of the reuse set of R 1 because a given cache line is used by R 1 before it is used by XY . Therefore, elements referenced by X(j 1 ; j 2 ; 1) are ushed before they can be reused by X(j 1 ? 1; j 2 ; 1) (cf gures ?? and ??; note also that for x 0 ?xy 0 mod C S = 0 and x 0 ?xy 0 mod C S = N 2 , reference XY induces ping-pong phenomenon with respectively reference X(j 1 ; j 2 ; 1) and reference X (j 1 ?1; j 2 ; 1) ). The total number of additional memory requests is equal to the number of cache lines of the reuse set of X(j 1 ; j 2 ; 1) that would have been reused by X (j 1 ? 1; j 2 ; 1 Conclusions Internal cross-interferences can be very signi cant because they occur on each reutilization of the reuse set. They are as frequent and can be as important as self-interferences. The above examples illustrate the fact that cross-interferences can vary signi cantly, though apparently randomly. Considering arrays base address is critical for detecting and estimating internal crossinterferences. Intuitively, studying the relative cache positions of the two sets means that the reuse set is considered to be xed, while the interference set is considered to be moving in cache. Conclusions In opposition to internal cross-interferences, external cross-interferences occur periodically and with varying importance. Therefore detecting and estimating external crossinterferences is more di cult. Still, a precise estimate can be derived by studying such interferences over a period. It is possible to compute the period of interferences and the number of cross-interferences over a period. Then, the total number of external cross-interferences is equal to the number of periods times the external cross-interferences over one period. Though, external cross-interferences operate in a more irregular manner than internal cross-interferences, they can still be very damaging.
External cross-interferences
General case
Interferences with multiple array references
The previous section provides a method for estimating how much an array can perturbate the potential reuse of another array. Now, when the reuse of an array is perturbated by several di erent arrays it is necessary to evaluate whether the interferences caused by these several arrays are cumulative or redundant, so that total interferences can be evaluated. However, because it is unlikely three arrays are mapped to the same cache line within a short time interval, redundant interferences are relatively infrequent (this assertion is discussed in ?, ?]). Therefore, considering multiple cross-interferences as cumulative is not an important approximation in general.
However, there are some cases where redundant interferences can be signi cant. If two arrays in translation overlap, then the size of the union of their two actual interference sets can always be much smaller than the sum of the size of their respective actual interference set. Therefore, such cases need to be considered when internal cross-interferences are evaluated. Indeed, it frequently happens that within the same loop nest, references are in translation (i.e. the di erence between their virtual addresses is constant). If it is assumed all arrays in translation form a class, then the whole class can be considered together since the movement of all these references is the same. The actual interference set of a class is the union of the actual interference sets of all references within a class. External and internal cross-interferences due to arrays of one such class, should not be evaluated separately. Conclusions In general, the redundancy between several external cross-interferences is small enough to be ignored. Note that self-interferences and internal cross-interferences are estimated rst to avoid redundancy between classes of interferences. Only redundancy due to arrays in translation needs to be evaluated carefully.
Translating arrays
Conclusions and Further work
The importance and frequency of occurrence of cache interferences are generally considered to be very irregular. These interference phenomena prevent optimum utilization of cache memory, and render cache performance unstable. The purpose of NUMODE is to understand, detect and quantify cache interferences. NUMODE is basically a framework and a method for computing the number of cache interferences within a given numerical loop nest. The di erent types of cache interferences are identi ed and separated into distinct classes. Then, for each class, it is shown how to evaluate the number of additional memory requests due to such interferences.
NUMODE can be used for determining a \glossary" of con icting situations which can arise in numerical loop nests. Some of these situations have been illustrated in this paper. The occurrence and importance of these interferences have already been shown to be dependent on speci c problem parameters, such as arrays base address or arrays dimensions.
Because analytical expressions of cache misses can be derived, this technique can be used for performance evaluation and prediction of a given algorithm. The in uence of problem parameters on cache interferences can be extracted, and therefore, the algorithm can be optimized so as to minimize cache con icts. In many examples, it has been shown that simple software optimization techniques, such as array padding or changing arrays base address, can help delivering optimum performance by minimizing or even eliminating cache interferences. Current software optimization techniques could strongly bene t from such enhancements.
The goal of NUMODE is di erent from usual cache models. Rather than providing informations on global cache behavior, the purpose of this model is to achieve a good comprehension of the phenomena happening in cache, spotting problems and unveiling their causes, so that software and hardware optimizations can be derived to reduce or eliminate cache con icts.
Software optimization for cache interferences is one of our main research goals. First, using the algorithms presented in this paper, it is possible to systematically evaluate cache interferences at compile-time or at run-time. This information can be used to assist data locality optimizing algorithms ?, ?, ?] in the precise evaluation of the optimal block size. The theoretical values currently used have been shown to considerably lack precision ?]. Second, data locality optimizing algorithms are usually designed for local memories, i.e. they ignore the impact of cache con icts. It is generally suggested ?, ?, ?] that copying can be used for eliminating cache con icts. However, copying is a costly operation that cannot be used blindly ?]. The capacity to detect and estimate cache interferences allows to determine when copying is useful and apply it only then. Third, even when algorithms are not blocked, signi cant cache interferences occur. Reducing such con icts by tuning problem parameters is another application. Fourth and last, though it seems possible to tune problem parameters for achieving e cient (deprived of interferences) execution of a given loop nest, it is still a challenge to either make such transformations transparent to other loop nests within the same program, or to nd good tradeo values for all code loop nests.
