Abstract| Applications with regular patterns of memory access can experience high levels of cache con ict misses. In shared-memory multiprocessors con ict misses can be increased signi cantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more con ict misses. The tension between minimizing cache con icts and the other transformations needed for ecient parallelization leads to complex optimization problems for parallelizing compilers.
I. Introduction
If the upward trend in processor clock frequencies during the last ten years is extrapolated over the next ten years, we will see clock frequencies increase by a factor of twenty during that period 1]. However, based on the current 7% per annum reduction in DRAM access times 2], memory latency can be expected to reduce by only 50% in the next ten years. This potential ten-fold increase in the distance to main memory has serious implications for the design of future cache-based memory hierarchies as well as for the architecture of memory devices themselves.
Each block of main memory can be placed in exactly one set of blocks in cache. The chosen set is determined by the indexing function. Conventional caches typically extract a eld of m bits from the address and use this to select one block from a set of 2 m . Whilst easy to implement, this indexing function is not robust. The principal weakness is its susceptibility to repetitive con ict misses. associativity can help to alleviate such con icts, but is not an e ective solution for repetitive and regular con icts.
One of the best ways to control locality in dense matrix computations with large data structures is to use a tiled (or \blocked") algorithm. This is e ectively a re-ordering of the iteration space which increases temporal locality. However, previous work has shown that the con icts introduced by tiling can be a serious problem 3]. In practice, until now, this has meant that compilers which tile loop nests really ought to compute the maximal con ict-free tile size for given values of B, major array dimension N and cache capacity C. Often this will be too small to make it worthwhile tiling a loop, or perhaps the value of N will not be known at compile time. Gosh et al. 4 ] present a framework for analyzing cache misses in perfectly-nested loops with a ne references. They develop a generic technique for determining optimum tile sizes, and methods for determining array padding sizes to avoid con icts. These methods require solutions to sets of linear Diophantine equations and depend upon there being su cient information at compile time to nd such solutions. Table I highlights the problem of con ict misses with reference to the SPEC95 benchmarks. The programs were compiled with the maximum optimization level and instrumented with the atom tool 5] . A data cache similar to the rst-level cache of the Alpha 21164 microprocessor was simulated: 8 KB capacity, 32-byte lines, write-through and no write allocate. For each benchmark we simulated the rst 2 30 load operations. Because of the no-write-allocate feature, the tables below refer only to load operations. Table I shows the miss ratio for the following cache organizations: direct-mapped, two-way associative, columnassociative 6], victim cache with four victim lines 7], and two-way skewed-associative 8], 9].
Of these schemes, only the two-way skewed-associative cache uses an unconventional indexing scheme, as proposed by its author. For comparison, the miss ratio of a fullyassociative cache is shown in the penultimate column. The miss ratio di erence between a direct-mapped cache and that of a fully-associative cache is shown in the right-most column of table I, and represents the direct-mapped con ict miss ratio (CMR) 2]. In the case of hydro2d and apsi some organizations exhibit lower miss ratios than a fullyassociative cache, due to sub-optimality of lru replacement in a fully-associative cache for these particular programs. E ectively, the direct-mapped con ict miss ratio represents Cache miss ratios for direct-mapped (DM), 2-way set-associative (2W), column-associative (CA), victim cache (VC), 2-way skewed associative (SA), and fully-associative (FA) organizations. Conflict miss ratio (CMR) is also shown. the target reduction in miss ratio that we hope to achieve through improved indexing schemes. The other type of misses, compulsory and capacity, will remain unchanged by the use of randomized indexing schemes.
As expected, the improvement of a 2-way set-associative cache over a direct-mapped cache is rather low. The column-associative cache provides a miss ratio similar to that of a two-way set-associative cache. Since the former has a lower access time but requires two cache probes to satisfy some hits, any choice between these two organizations should take into account implementation parameters such as access time and miss penalty. The victim cache removes many con ict misses and outperforms a fourway set-associative cache. Finally, the two-way skewedassociative cache o ers the lowest miss ratio. Previous work has shown that it can be signi cantly more e ective than a four-way conventionally-indexed set-associative cache 10].
In this paper we investigate the use of alternative index functions for reducing con icts and discuss some practical implementation issues. Section II introduces the alternative index functions, and section III evaluates their con ict avoidance properties. In section IV we discuss a number of implementation issues, such as the e ect of novel indexing functions on cache access time. Then, in section V, we evaluate the impact of the proposed indexing scheme on the performance of a dynamically-scheduled processor. Finally, in section VI, we draw conclusions from this study.
II. Alternative indexing functions
The aim of this paper is to show how alternative cache organizations can eliminate repetitive con ict misses. This is analogous to the problem of nding an e cient hashing function. For large secondary or tertiary caches it may be possible to use the virtual address mapping to adjust the location of pages in cache, as suggested by Bershad et al. 11 ], thus avoiding con icts dynamically. However, for small rst-level caches this e ect can only be achieved by using an alternative cache index function.
In the eld of interleaved memories it is well known that bank con icts can be reduced by using bank selection functions other than the simple modulo-power-of-two. Lawrie and Vora proposed a scheme using prime-modulus func- 17] , 18]. These schemes each yield a more or less uniform distribution of requests to banks, with varying degrees of theoretical predictability and implementation cost. In principle each of these schemes could be used to construct a con ict-resistant cache by using them as the indexing function. However, in cache architectures two factors are critical; rstly, the chosen indexing function must have a logically simple implementation, and secondly we would like to be able to guarantee good behavior on all regular address patterns { even those that are pathological under a conventional index function.
In the commercial domain, the IBM 3033 The use of pseudo-random cache indexing has been suggested by other authors. For example, Smith 22 ] compared a pseudo-random placement against a set-associative placement. He concluded that random indexing had a small advantage in most cases, but that the advantages were not signi cant. In this paper we show that for certain workloads and cache organizations, the advantages can be very large.
Hashing the process id with the address bits in order to index the cache was evaluated in a multiprogrammed environment by Agarwal in 23] . Results showed that this scheme could reduce the miss ratio.
Perhaps the most well-known alternative cache indexing scheme is the class of bitwise exclusive-OR functions proposed for the skewed associative cache 8]. The bitwise xor mapping computes each bit of the cache index as either one bit of the address or the xor of two bits. Where two such mappings are required di erent groups of bits are chosen for xor-ing in each case. A two-way skewed-associative cache consists of two banks of the same size that are accessed simultaneously with two di erent hashing functions.
Not only does the associativity help to reduce con icts but the skewed indexing functions help to prevent repetitive con icts from occurring.
The polynomial modulus function was rst applied to cache indexing in 10]. It is best described by rst considering the unsigned integer address A in terms of its binary representation A = a n?1 2 n?1 + a n?1 2 n?2 + + a 0 . This is interpreted as the polynomial A(x) = a n?1 x n?1 + a n?1 x n?2 + + a 0 de ned over the eld GF(2). The binary representation of the m-bit cache index R is similarly de ned by the GF(2) polynomial R(x) of order less than m such that A(x) = V (x)P (x) + R(x). E ectively R(x) is A(x) modulo P(x), where P(x) is an irreducible polynomial of order m and P(x) is such that x i mod P(x) generates all polynomials of order lower than m. The polynomials that ful l the previous requirements are called Ipoly polynomials. Rau showed how the computation of R(x) can be accomplished by the vector-matrix product of the address and an n m matrix H of single-bit coe cients derived from P(x) 18]. In GF(2), this product is computed by a network of and and xor gates, and if the H-matrix is constant the and gates can be omitted and the mapping then requires just m xor gates with fan-in from 2 to n. In practice we may reduce the number of input address bits to the polynomial mapping function by ignoring some of the upper bits in A. This does not seriously degrade the quality of the mapping function.
Ipoly mapping functions have been studied previously in the context of stride-insensitive interleaved memories (see 17] , 18]), and have certain provable characteristics of signi cant value for cache indexing. In 24] it was demonstrated that a skewed Ipoly cache indexing scheme shows a higher degree of con ict resistance than that exhibited by conventional set-associativity or other (non-Ipoly) xorbased mapping functions. Overall, the skewed-associative cache using Ipoly mapping and a pure lru replacement policy achieved a miss ratio within 1% of that achieved by a fully-associative cache. Given the advantage of an Ipoly function over the bitwise xor function, all results presented in this paper use the Ipoly indexing scheme.
III. Evaluation of Conflict Resistance
The performance of both the integer and oatingpoint SPEC95 programs has been evaluated for columnassociative, two-way set-associative (2W) and two-way skewed-associative organizations using Ipoly indexing functions. In all cases a single-level cache is assumed. The miss ratios of these con gurations are shown in table II. Given a conventional indexing function, the direct-mapped (DM) and fully-associative (FA) cache organizations display respectively the lowest and the highest degrees of con ict-resistance of all possible cache architectures. As such they de ne the bounds within which novel indexing schemes should be evaluated. Their miss ratios are shown in the right-most two columns of table II.
The column-associative cache has access-time characteristics similar to a direct-mapped cache but has some degree of pseudo-associativity { each address can map to one of two locations in the cache, but initially only one is probed. The column labelled spl represents a cache which swaps data between the two locations to increase the percentage of a hit on the rst probe. It also uses a realistic pseudolru replacement policy. The cache reported in the column labelled lru does not swap data between columns and uses an unrealistic pure lru replacement policy 10]. It is to be expected that a two-way set-associative cache will be capable of eliminating many random con icts. However, a conventionally-indexed set-associative cache is not able to eliminate pathological con ict behavior as it has limited associativity and a naive indexing function. The performance of a two-way set-associative cache can be improved by simply replacing the index function, whilst retaining all other characteristics. Conventional lru replacement can still be used, as the indexing function has no impact on replacement for this cache organization. For two programs the two-way Ipoly cache has a lower miss ratio than a fully-associative cache. This is again due to the sub-optimality of lru replacement in the fully-associative cache, and is a common anomaly in programs with negligible con ict misses.
The nal cache organization shown in table II is the two-way skewed-associative cache proposed originally by Seznec 8] . In its original form it used two bitwise xorindexing functions. Our version uses Ipoly 
and a cache which uses an unrealistic pure lru policy (labelled lru). This organization produces the lowest con ict miss ratio, down from 4.8% to 0.67% for SPECint, and from 12.61% to 0.07% for SPECfp.
It is striking that the performance improvement is dominated by three programs (tomcatv, swim and wave). These e ectively exhibit pathological con ict miss ratios under conventional indexing schemes. Studies by Olukotun et al. 25] , have shown that the data cache miss ratio in tomcatv wastes 56% and 40% of available IPC in 6-way and 2-way superscalar processors respectively.
Tiling will often introduce extra cache con icts, the elimination of which is not always possible through software. Now that we have alternative indexing functions that exhibit con ict avoidance properties we can use these to avoid these induced con icts. The e ectiveness of Ipoly indexing for tiled loops was evaluated by simulating the cache behavior of a variety of tiled loop kernels. Here we present a small sample of results to illustrate the general outcome. Figures 1 and 2 show the miss ratios observed in two tiled matrix multiplication kernels where the original matrices were square and of dimensions 171 and 256 respectively. Tile sizes were varied from 2 2 up to 16 16 to show the e ect of con icts occurring in caches that are directmapped (a1), 2-way set-associative (a2), fully-associative (fa) and skewed 2-way Ipoly (Hp-Sk). The tiled working set divided by cache capacity measures the fraction of the cache occupied by a single tile. Cache capacity is 8 KBytes, with 32-byte lines.
For dimension 171 the miss ratio initially falls for all caches as tile size increases. This is due to increasing spatial locality, up to the point where self con icts begin to occur in the conventionally-indexed direct-mapped and two-way set-associative caches. The fully-associative cache su ers no self-con icts and its miss ratio decreases monotonically to less than 1% at 50% loading. The behavior of the skewed 2-way Ipoly cache tracks the fully-associative cache closely. The qualitative di erence between the Ipoly cache and a conventional two-way cache is clearly visible.
For dimension 256 the product array and the multiplicand array are positioned in memory so that cross-con icts occur in addition to self-con icts. Hence the direct-mapped and 2-way set associative caches experience little spatial locality. However, the Ipoly cache is able to eliminate cross-con icts as well as self-con icts, and it again tracks the fully-associative cache.
IV. Implementation Issues
The logic of the GF(2) polynomial modulus operation presented in section II de nes a class of hash functions which compute the cache placement of an address by combining subsets of the address bits using xor gates. This means that, for example, bit 0 of the cache index may be Whilst this appears remarkably simple, there is more to consider than just the placement function. Firstly, the function itself uses address bits beyond the normal limit imposed by typical minimum page size restriction. Secondly, the use of pseudo-random placement in a multi-level memory hierarchy has implications for the maintenance of Inclusion. In 24] we explain these two issues in more depth, and show how the virtual-real two-level cache hierarchy proposed by Wang et al. 26 ] provides a clean solution to both problems.
A cache memory access in a conventional organization normally computes its e ective address by adding two registers, or a register plus a displacement. Ipoly indexing implies additional circuitry to compute the index from the e ective address. This circuitry consists of several xor gates that operate in parallel and therefore the total delay is just the delay of one gate. Each xor gate has a number of inputs that depend on the particular polynomial being used. For the experiments reported in this paper the number of inputs is never higher than 5. The xor gating required by the Ipoly mapping may increase the critical path length within the processor pipeline. However, any delay will be short since all bits of the index can be computed in parallel. Moreover, we show later that even if this additional delay induces a full cycle penalty in the cache access time, the Ipoly mapping provides a signi cant overall performance improvement. Memory address prediction can be also used to avoid the penalty introduced by the xor delay when it lengthens the critical path. Memory addresses have been shown to be highly predictable. For instance, in 27] it was shown that the addresses of about 75% of the dynamically executed memory instructions from the SPEC95 suite can be predicted with a simple tabular scheme which tracks the last address produced by a given instruction and its last stride. A similar scheme, that could be used to give an early prediction of the line that is likely to be accessed by a given load instruction, is outlined below.
The processor incorporates a table indexed by the instruction address. Each entry stores the last address and the predicted stride for some recently executed load instruction. In the fetch stage, this table is accessed with the program counter. In the decode stage, the predicted address is computed and the xor functions are performed to compute the predicted cache line. This can be done in one cycle since the xor can be performed in parallel with the computation of the most-signi cant bits of the e eective address. When the instruction is subsequently issued to the memory unit it uses the predicted line number to access the cache in parallel with the actual address and line computation. If the predicted line turns out to be incorrect, the cache access is repeated with the actual address. Otherwise, the data provided by the speculative access can be loaded into the destination register.
A number of previous papers have suggested address prediction as a means to reduce memory latency 28], 29], 30], or to execute memory instructions and their dependent instructions speculatively 31], 27], 32]. In the case of a miss-speculation, a recovery mechanism similar to that used by branch prediction schemes is then used to squash miss-speculated instructions.
V. Effect of Ipoly indexing on IPC
In order to verify the impact of polynomial mapping on realistic microprocessor architectures we have developed a parametric simulator for a four-way superscalar processor with out-of-order execution. Table III summarizes the functional units and their latencies used in these experiments. The reorder bu er contained 32 entries, and there were two separate physical register les (FP and Integer), each with 64 physical registers. The processor had a lockup-free data cache 33] that allowed 8 outstanding misses to di erent cache lines. Cache capacities of 8 KB and 16 KB were simulated with 2-way associativity and 32-byte lines. The cache was write-through and no-writeallocate. The cache had two ports, each with a two-cycle The memory address prediction scheme was implemented by a direct-mapped table with 1K entries, indexed by instruction address. To reduce cost the entries were not tagged, although this increases interference in the table. Each entry contained the last e ective address of the most recent load instruction to index into that table entry, together with the last observed stride. In addition, each entry contained a 2-bit saturating counter to assign condence to the prediction. Only when the most-signi cant bit of the counter is set would the prediction be considered correct. The address eld was updated for each new reference regardless of the prediction. However, the stride eld was updated only when the counter went below 10 2 , i.e. after two consecutive mispredictions. Table IV shows the IPC and miss ratios for six con gurations 1 . All IPC averages are computed using an equallyweighted harmonic mean. The baseline con guration is an 8 KB cache with conventional indexing and no address prediction (np, 3rd column). The average IPC for this con guration is 1.27 from an average miss ratio of 16.53%. With Ipoly indexing the average miss ratio falls to 9.68%. If the xor gates are not in the critical path IPC rises to 1.33 (nx, 5th column). Conversely, if the xor gates are in the critical path, and a one cycle penalty in the cache access time is assumed, the resulting IPC is 1.29 (wx, 6th column). However, if memory address prediction is then introduced (wp, 7th columnn) IPC is the same as for a cache without the xor gates in the critical path (nx). Hence, the memory address prediction scheme can o set the penalty introduced by the additional delay of the xor gates when they are in the critical path, even under the conservative assumption that whole cycle of latency is added to each load instruction. Finally, table IV also shows the performance of a 16 KB 2-way set-associative cache without Ipoly indexing (2nd column). Notice that the addition of Ipoly indexing to an 8 KB cache yields over 60% of the IPC increase that can be obtained by doubling the cache size. These IPC measurements exhibit small absolute di erences, but this is because the bene t of Ipoly indexing is perceived by a only small subset of the benchmark programs. Most programs in SPEC95 exhibit low con ict miss ratios. In fact the SPEC95 con ict miss ratio for an 8 KB 2-way set-associative cache is less than 4% for all programs except tomcatv, swim and wave5. The two penultimate rows of table IV show independent IPC averages for the benchmarks with high con ict miss ratios (Ave ?), and those with low con ict miss ratios (Ave y). This highlights the ability of polynomial mapping to reduce the miss ratio and signi cantly boost the performance of problem cases. One can see that the polynomial mapping provides a significant 27% improvement in IPC for the three bad programs even if the xor gates are in the critical path and memory address prediction is not used. With memory address prediction Ipoly indexing yields an IPC improvement of 33% compared with that of a conventional cache of the same capacity, and 16% higher than that of a conventional cache with twice the capacity. Notice that the polynomial mapping scheme with prediction is even better than the organization without prediction where the xor gates do not extend the critical path. This is due to the fact that the memory address prediction scheme reduces by one cycle the e ective cache hit time when the predictions are correct, since the address computation is overlapped with the cache access (the computed address is used to verify that the prediction was correct). However, the main bene ts observed come from the reduction in con ict misses. To isolate the di erent e ects we have also simulated an organization with the memory address prediction scheme and conventional indexing for an 8 KB cache (wp, column 4). If we compare the IPC of this organization with that in column 3, we see that the bene ts of the memory address prediction scheme due solely to the reduction of the hit time are almost negligible. This con rms that the improvement observed in the Ipoly indexing scheme with address prediction derives from the reduction in con ict misses. The averages for the fteen programs which exhibit low levels of con ict misses show a small (1.7%) deterioration in average IPC when Ipoly indexing is used and the xor gates are in the critical path. This is due to a slight increase in the average hit time rather than an overall increase in miss ratio (which on average falls by 2%). For these programs the reduction in aggregated miss penalty does not outweigh the slight extension in critical path length.
VI. Conclusions
In this paper we have discussed the problem of cache conict misses and surveyed the options for reducing or eliminating those con icts. We have described pseudo-random indexing schemes based on polynomial modulus functions, and have shown them to be robust enough to virtually eliminate the repetitive cache con icts caused by bad strides inherent in some SPEC95 benchmarks, as well as eliminating those introduced into an application by the tiling of loop nests.
We have highlighted the major implementation issues that arise from the use of such novel indexing schemes. For example, Ipoly indexing uses more address bits than a conventional cache to compute the cache index. Also, the use of di erent indexing functions at level-1 and level-2 caches results in the occasional eviction at level-1 simply to maintain Inclusion. We have explained that both of these problems can be solved using a two-level virtual-real cache hierarchy. Finally, we have proposed a memory address prediction scheme to avoid the penalty due to the small potential delay in the critical path introduced by the pseudo-random indexing function.
Detailed simulations of an out-of-order superscalar processor have demonstrated that programs with signi cant numbers of con ict misses in a conventional 8 KB 2-way skewed-associative cache perceive IPC improvements of 33% (with address prediction) or 27% (without address prediction). This is up to 16% higher than the IPC improvements obtained simply by doubling the cache capacity. Furthermore, from the programs we analyzed, those that do not experience signi cant con ict misses on average see only a 1.7% reduction in IPC when Ipoly indexing appears on the critical path for computing the e ective address, and address prediction is used. If the indexing logic does not appear on the critical path no deterioration in overall average performance is experienced by those programs.
We believe the key contribution of pseudo-random indexing is the resulting predictability of cache behavior. In our experiments we found that Ipoly indexing reduces the standard deviation of miss ratios across SPEC95 from 18.49 to 5.16. This could be bene cial in real-time systems where unpredictable timing, caused by the possibility of pathological miss ratios, presents problems. If con ict misses are eliminated, the miss ratio depends solely on compulsory and capacity misses, which in general are easier to predict and control. Con ict avoidance could also be bene cial when iteration-space tiling is used to improve data locality.
VII. Acknowledgments
