Abstract-Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst-case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often results in performance slowdown. In this paper, we present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions, prime modulo and odd-multiplier displacement, that are resistant to pathological behavior and yet are able to eliminate the worst-case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow addition operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have nonuniform cache accesses, both prime modulo and odd-multiplier displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using odd-multiplier displacement function with multiple multipliers in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7 percent.
D
ESPITE the relatively large size and high associativity of the L2 cache, conflict misses are a significant performance bottleneck in many applications. Alternative cache indexing/hashing functions are used to reduce such conflicts by achieving a more uniform access distribution across the L1 cache sets [18] , [4] , [23] , [22] , L2 cache sets [19] , or the main memory banks [15] , [11] , [16] , [17] , [20] , [8] , [7] , [26] , [12] . Although various alternative hashing functions have been demonstrated to eliminate the worst-case conflict behavior, few studies, if any, have analyzed the pathological behavior of such hashing functions that often result in performance degradation.
This paper presents an in-depth analysis of the pathological behavior of hashing functions and proposes two hashing functions, prime modulo and odd-multiplier displacement that are resistant to the pathological behavior and yet able to eliminate the worst-case conflict misses from the L2 cache. The number of cache sets in the prime modulo hashing is a prime number, while the odd-multiplier displacement hashing adds an offset, equal to an odd or prime number multiplied by the tag bits, to the index bits of an address to obtain a new cache index. The prime modulo hashing has been used in software hash tables [1] and in the Burroughs Scientific Processor [11] . A fast implementation of prime modulo hashing has only been proposed for Mersenne prime numbers [25] . Since Mersenne prime numbers are sparse (i.e., for most n, 2 n À 1 are not prime), using Mersenne prime numbers significantly restricts the number of cache sets that can be implemented.
We present an implementation that solves the three main drawbacks of the prime modulo hashing when it is directly applied to cache indexing. First, we present a fast hardware mechanism that uses a set of narrow add operations in place of a true integer division for any number of cache sets. Second, by applying the prime modulo to the L2 cache, the fragmentation that results from not fully utilizing a power of two number of cache sets becomes negligible. Finally, we show an implementation where the latency of the prime modulo computation can be hidden by performing it in parallel with L1 accesses and by caching the partial computation in the TLB. We show that the prime modulo hashing has properties that make it resistant to the pathological behavior that plagues other alternative hashing functions, while, at the same time, enabling it to eliminate worst-case conflict misses.
Although the odd-multiplier displacement hashing lacks the theoretical superiority of the prime modulo hashing, it can perform just as well in practice when the multiplier is carefully selected. In addition, the odd-multiplier displacement hashing can easily be used in conjunction with a skewed associative cache that uses multiple hashing functions to further distribute the cache set accesses. In this case, a unique multiplier is used for each cache bank.
We use 23 memory intensive applications from various sources and categorize them into two classes: one with nonuniform cache set accesses and the other with uniform cache set accesses. We found that, on the applications with nonuniform accesses, the prime modulo hashing achieves an average speedup of 1.27. It does not slow down any of the 23 applications except in one case by only 2 percent. The oddmultiplier displacement hashing achieves almost identical performance to the prime modulo hashing, but without slowing down any of the 23 benchmarks. Both methods outperform an XOR-based indexing function, which obtains an average speedup of 1.21 on applications with nonuniform cache set accesses.
Using multiple hashing functions, skewed associative caches sometimes are able to eliminate more misses than using a single hashing function. However, we found that the use of a skewed associative cache introduces some pathological behavior that slows down some applications. A skewed associative cache with the XOR-based hashing [18] , [4] , [19] obtains an average speedup of 1.31 for benchmarks with non-uniform accesses, but slows down four applications by up to 9 percent. However, using the odd-multiplier displacement hashing that we propose in conjunction with a skewed associative cache, the average speedup improves to 1.35 and the worst-case slowdown improves to 7 percent.
The rest of the paper is organized as follows: Section 2 discusses the metrics and ideal properties of a hashing function, which help in understanding the pathological behavior. Section 3 discusses the proposed hashing functions: prime modulo and odd-multiplier displacement and their implementations. Section 4 describes the evaluation environment, while Section 5 discusses the results obtained. Section 6 lists the related work. Finally, Section 7 concludes the paper.
PROPERTIES OF HASHING FUNCTIONS
In this section, we will describe two metrics that are helpful in understanding the pathological behavior of hashing functions (Section 2.1) and two properties that a hashing function must have in order to avoid the pathological behavior (Section 2.2).
Metrics
The first metric that estimates the degree of the pathological behavior is balance, which describes how evenly distributed the addresses are over the cache sets. The other is concentration, which measures how evenly the sets are used over small intervals of accesses.
Let n set be the number of sets in the cache that are a power of two. This implies that log 2 n set bits from the address are used to obtain a set index. Let a sequence < a 1 ; a 2 ; . . . ; a m > of block addresses a i denote m cache accesses. Suppose a i 6 ¼ a j for i 6 ¼ j, where 1 i; j m, implying that each block address in the sequence is unique. Thus, the address sequence does not have any temporal reuse (we will return to this issue later). Let H be a hashing function that maps each block address a i to set Hða i Þ in the cache. We denote log 2 n set index bits of a i with x i and the first log 2 n set bits of the tag of a i with t i , as shown in Fig. 1 .
Balance describes how evenly distributed the addresses are over the sets in the cache. When good balance is not achieved, the hashing function would be ineffective and would cause conflict misses. To measure the balance, we use a formula suggested by Aho and Ullman [1] :
where b j represents the total number of addresses that are mapped to set j. bjÁðbjþ1Þ 2 represents the weight of the set j, equivalent to 1 þ 2 þ . . . þ b j . Thus, a set that has more addresses will have a larger weight. The numerator, P n set j¼1 b j Áðb j þ1Þ 2
, represents the sum of the weights of all sets. The denominator, m 2Ânset Á ðm þ 2 Â n set À 1Þ, represents the sum of the weights of all sets, but assuming a perfectly random address distribution across all sets [1] . Thus, a balance value close to an ideal value of 1 represents better address distribution across the sets.
Concentration is a less straightforward measure and is intended to measure how evenly the sets are used over small intervals of accesses. It is possible to achieve the ideal balance for the entire address sequence < a 1 ; a 2 ; . . . ; a m > , yet conflicts can occur if, on smaller intervals, the balance is not achieved.
To measure concentration, we calculate the distance d i as the number of accesses to the cache that occur between two accesses to a particular set and calculate the standard deviation of these distances. More formally, d i for an address a i is the smallest positive integer such that Hða i Þ ¼ Hða iþd i Þ. The concentration is equal to the standard deviation of d i s. Noting that, in the ideal case and with the balance of 1, the average of d i s is necessarily n set , 1 our formula is
Using standard deviation penalizes a hashing function not only for reaccessing a set after a small time period since its last access (d i < n set ), but also for a large time period (d i > n set ). The smaller the concentration is, the better the hashing function is. The concentration of an ideal hashing function is zero.
In general, alternative hashing functions have mostly targeted the ideal balance, but not the ideal concentration. Achieving good concentration is vital in avoiding pathological behavior for applications with high temporal locality. If a set receives a burst of accesses to many distinct addresses, then the set suffers from conflict misses temporarily. If one of the addresses has temporal reuse, it may have been replaced from the cache by the time it is reaccessed, creating conflict misses.
Ideal Properties
In this section, we describe the properties that should be satisfied by an ideal hashing function. Most applications, even some irregular applications, often have strided access patterns. Given the common occurrence of these patterns, a 1. There are m accesses spread ideally across n set sets, so the total distance between accesses is m Á n set . Hence, the average over the m accesses is n set .
hashing function that does not achieve the ideal balance and concentration will cause a pathological behavior. A pathological behavior arises when the balance or concentration of an alternative hashing function is worse than those of the traditional hashing function, often leading to slowdown.
Let s be a stride amount in the address sequence < a 1 ; a 2 ; . . . ; a m > , i.e., a iþ1 ¼ a i þ s, where 1 i < m.
Property 1 (Ideal balance). For the modulo-based hashing where Hða i Þ ¼ a i mod n set , the ideal balance is achieved if and only if gcdðs; n set Þ ¼ 1, as shown in [15] . For other hashing functions, such as XOR-based, the ideal balance condition is harder to formulate because the hashing function has various cases where the ideal balance is not achieved (Section 3.3).
Property 2 (Sequence invariance).
A hashing function is sequence invariant if and only if, for any a i ,
The ideal concentration is achieved when both the ideal balance and sequence invariance are satisfied. 2 Therefore, the ideal concentration is not achieved when the sequence invariance is not achieved. The sequence invariance says that once a set is reaccessed, the sequence of set accesses will precisely follow the previous sequence. Moreover, when the sequence invariance is satisfied, all the distances between two accesses to the same set are equal to a constant d, indicating the absence of a burst of accesses to a single set for the strided access pattern. Furthermore, when the ideal balance is satisfied for the modulo hashing, then the constant d is the average distance, or d ¼ x ¼ n set .
It is possible that a hashing function satisfies the sequence invariance property in most, but not all, cases. Such a function can be said to have partial sequence invariance.
In Section 3.3, we will show that the prime modulo hashing function satisfies both properties except for a very small number of cases, whereas other hashing functions do not always achieve Property 1 and 2 simultaneously. Bad concentration is a major source of the pathological behavior for alternative hashing functions.
HASHING FUNCTIONS BASED ON PRIME NUMBERS
In this section, we describe the prime modulo and oddmultiplier displacement hashing functions that we propose (Sections 3.1 and 3.2, respectively). We compare them against other hashing functions in Section 3.3.
The Prime Modulo Hashing Function
Prime modulo hashing functions, like any other modulo functions, can be expressed as Hða i Þ ¼ a i mod n set . The difference between prime modulo hashing and traditional hashing is that n set is a prime number instead of a power of two. The prime modulo hashing functions that have been used in software hash tables [1] and the BSP machine [11] have two major drawbacks. First, they are considered expensive to implement in hardware because performing a modulo operation with a prime number requires an integer division [11] . Second, since the number of sets in the physical memory (n set phys ) is likely a power of two, there are Á ¼ n set phys À n set sets that are wasted, causing fragmentation. For example, the fragmentation in the BSP is a nontrivial 6.3 percent. To minimize the fragmentation, n set is generally chosen to be the largest prime number that is smaller than n set phys . Since we target the L2 cache, however, this fragmentation becomes negligible. Table 1 shows that the percentage of the sets that are wasted in an L2 cache is small for commonly used numbers of sets in the L2 cache. The fragmentation falls below 1 percent when there are 512 physical sets or more. This is due to the fact that there is always a prime number that is very close to a power of two. In cases where the wasted sets are not fabricated, fragmentation is no longer an issue.
Utilizing number theory, we can compute the prime modulo function quickly without using an integer division. The foundation for the technique is taken from fast random number generators [14] , [24] . Specifically, computing 2k Á x mod m where m is a Mersenne prime number (i.e., m is one less than a power of two) can be performed using add operations without a multiplication or division operation.
Since we are interested in a prime number n set that is not necessarily Mersenne prime, we extend the existing method and propose two methods of performing the prime modulo operation fast without any multiplication and division. The first method, iterative linear, needs recursive steps of shift, add, and subtract&select operations. The second method, polynomial, needs only one step of add and a subtract&select operation.
Subtract&select method. Computing the value of x mod n set is trivial if x is small. Fig. 2 shows how this can be implemented in hardware. x, x À n set , x À 2n set , x À 3n set , etc. are all fed as input into a selector which chooses the rightmost input that is not negative. To implement this method, the maximum value of x should be known.
Iterative linear method. First, a i is represented as a linear function of Á ¼ n set phys À n set . To see how this can be done, let T i and x i represent parts of the bits in a i , as depicted in Fig. 1 . Since a i ¼ n set phys Á T i þ x i , then 
Since a 0 i is much smaller than a i , a 0 i mod n set may be computed using the subtract&select method (Fig. 2) . Moreover, although (3) contains a multiplication, since Á is a very small integer (at most 9, see Table 1 ) for most cases, the multiplication can easily be converted to shift and add operations. For example, when Á ¼ 9,
where << denotes a left shift operation. Finally, when a 0 i is still large, we can apply (3) iteratively to obtain a 00 i , a 000 i , etc. The following theorem states the maximum number of iterations that need to be performed to compute a cache index using the iterative linear method. Theorem 1. Assuming a 2-input subtract&select unit and given a B-bit address and a cache with block/line size of L, the number of iterations needed to compute a i mod n set is at most:
B À log 2 L À log 2 n set log 2 n set phys À log 2 Á
:
When a subtract&select with 2 t þ 2 selector inputs is used in conjunction with the iterative linear method, the required number of iterations is at most:
Proof. In order to have a i mod n set ¼ a i , we need to have a i < n set . We will look at how the maximum range of a i can be increased using (3) while still avoiding the modulo operation. We express a i as a i ¼ n set Á x 1 þ y 1 , where y 1 < n set , and a i ¼ n set phys Á x 2 þ y 2 where y 2 < n set phys . Now, using (3), we have a i Á Á x 2 þ y 2 ðmod n set Þ. In order to avoid the modulo computation, a i ¼ Á Á x 2 þ y 2 < n set . Hence,
Since y 2 < n set phys , it can be tackled separately by using the subtract&select method with only two inputs. Thus, a i < n set nset phys Á
. Compared to the original range, where a i < n set , we have increased the range of a i 's value by nset phys Á times. Using induction, we can prove that, after n iterations, the maximum range of a i such that no modulo computation is required is a i < n set ð n set phys Á Þ n . For a B-bit address and a cache with block/line size of L, the maximum value of a i is 2 BÀlog 2 L . Therefore, we need n iterations such that 2 BÀlog 2 L < n set ð nset phys Á Þ n . Taking the log for both sides, the inequality becomes:
B À log 2 L < log 2 n set þ n Á ðlog 2 n set phys À log 2 ÁÞ:
BÀlog 2 LÀlog 2 n set log 2 n set phys Àlog 2 Á e. Using a similar method, we can prove the case where there are 2 t þ 2 inputs into the subtract&select unit. t u
For example, for a 32-bit machine with n set phys ¼ 2; 048 and a 64-byte cache line size, the prime modulo can be computed with only two iterations. However, with a 64-bit machine, it requires six iterations using a subtract&select with 3-input selector, but requires three iterations with a 258-input selector.
Polynomial method. Because, in some cases, the iterative linear method involves multiple iterations, we devise an algorithm to compute the prime modulo operation in one step. To achieve that, using the same method as in deriving (3), we first express a i as a polynomial function of n set phys :
where t ij consists of bit log 2 n set phys Á j through bit log 2 n set phys Á ðj þ 1Þ À 1 of the block address bits of a i . For example, t i1 is shown as t i in Fig. 1 . Substituting n set phys by ðn set þ ÁÞ, we obtain
ðn set þ ÁÞ k , where k ¼ 1; 2; . . . ; n, can be expanded into
In the ðmod n set Þ space, any term that is a multiple of n set is equivalent to zero. Since only the last term is not zero, then ðn set þ ÁÞ k Á k ðmod n set Þ. Therefore, we can express a i as a polynomial function of Á:
Note that a Ã i is much smaller than a i and is, in general, small enough to derive the result of the prime modulo using the subtract&select method (Fig. 2) .
A special but restrictive case is when n set is a Mersenne prime number, in which case, Á ¼ 1. Then, (4) can be simplified further, leading up to the method given by Yang and Yang [25] :
It is possible to use n set that is equal to n set phys À 1 but not a prime number. In some cases, if n set phys À 1 is not a prime number, it is a product of a few prime numbers (e.g.,
Thus, some of them may be good choices to handle small strided patterns. However, it is beyond the scope of this paper to evaluate such numbers.
Comparing the iterative linear and polynomial methods, the polynomial method allows smaller latency in computing the prime modulo when Á is small, especially for 64-bit machines and a small number of sets in the cache. The iterative linear method is more desirable for low hardware and power budget, or when Á is large.
Hardware Implementation
To illustrate how the prime modulo indexing can be implemented in hardware, let us consider an L2 cache with 64 byte blocks and 2,048 (¼ 2 11 ) number of physical sets and 2,039 (¼ 2 11 À 9) number of sets. Therefore, 6 bits are used as the block offset, while the tag can be broken up into three components: 11-bit x (x 10 ; x 9 ; x 8 ; . . . ; x 0 ), 11-bit t 1 (t According to (4), the cache index can be calculated as index ¼ x þ 9 Á t 1 þ 81 Á t 2 ðmod n set Þ. The binary representations of 9 and 81 are "1001" and "1010001," respectively. Therefore, Fig. 3a shows that the index can be calculated as the sum of six numbers. To simplify the computation further, the third number can be expressed as an addition of two numbers: one that consists of the three most significant bits (t
) and the rest (t
. The former number's prime modulo can be computed separately and, according to (4) , is equal to 9 Á ðt
Furthermore, some of the numbers can be added instantly to fill in the bits that have zero values. For example, the fourth and the fifth numbers are combined into a single number. The resulting numbers are shown in Fig. 3b . There are only five numbers (A through E) that need to be added, with up to 11 bits each. Fig. 4 shows how the prime modulo hashing computation fits with the overall system when the index from the full address is computed (Fig. 4a) , or from part of the address (Fig. 4b) . Fig. 4a shows how the new index is computed from the addition of the five numbers (A through E in Fig. 3b) . A and B are directly taken from the address bits, while C, D, and E are obtained by wired permutation of the tag part of the address. The sum of the five numbers is then fed into a subtract&select unit. In order to keep the sum of the five numbers (A through E) and the number of inputs into the selector small, (4) can be used in the intermediate additions by converting any intermediate carry out (in the 11th most significant position) into a 9 and adding it to the intermediate addition results. This ensures that the overall sum can only be slightly larger than 2,039 and that only two inputs would be needed in the selector. Fig. 4b illustrates how the prime modulo indexing is integrated in a real system with two optimizations. For KHARBUTLI ET AL.: ELIMINATING CONFLICT MISSES USING PRIME NUMBER-BASED CACHE INDEXING 577 illustration purposes, the figure assumes a virtually indexed physically tagged L1 cache, a physically indexed and tagged L2 cache, and 4KB page sizes. A physical address consists of the Physical Page Index (PPI) and the page offset. In the first optimization, we overlap the prime modulo computation of the L2 cache index with L1 cache accesses. On each L1 cache access, the prime modulo L2 cache index is computed. If the access results in an L1 cache miss, the new L2 cache index has been computed and is ready for use. This completely hides the prime modulo computation latency, at the expense of higher power consumption.
In the second optimization, we reduce the computational complexity and power consumption of the prime modulo computation by separating the prime modulo computation of the PPI and the block address. On a TLB miss, P P I mod n set of the missed page is computed and, along with the address translation information, stored in the new TLB entry. This computation is not in the critical path of the TLB access and does not require modifications to the operating system's page table. To compute the L2 cache index, we simply add the precomputed PPI modulo with the page offset bits that are not part of the L2 block offset. In the figure, only 6 bits of the page offset need to be added to the 11-bit precomputed PPI modulo, followed by a 2-input subtract&select operation to obtain the L2 cache index. This is a very simple operation that can probably be performed in much less than one clock cycle with small power consumption.
The Odd-Multiplier Displacement Hashing Function
In the odd-multiplier displacement hashing, the traditional modulo is performed after an offset is added to the original index x i . The offset is the product of a multiplier p and the tag T i .
This new hashing function is based on hashing functions in Aho and Ullman [1] and is related to Raghavan and Hayes's RANDOM-H functions [15] , with the main difference being their use of nonconstant offset, which results in not satisfying the sequence invariance property. Fig. 5 shows the hardware implementation used in odd-multiplier displacement hashing for a 32-bit machine. The figure uses 9 as the multiplier. The multiplication between the tag and the multiplier is converted into a simple addition operation. Therefore, the index can be calculated with a truncated addition of three 11-bit numbers. In general, we can minimize the number of narrow truncated additions by choosing p with few 1s in its binary representation.
One advantage of the odd-multiplier displacement hashing function compared to the prime modulo hashing function is that the complexity of calculating the cache index in the odd-multiplier displacement hashing function is mostly independent of the machine size. This makes it trivial to implement in machines with 64-bit or larger addressing.
Choosing the Multiplier in Odd-Multiplier Displacement Hashing
An important issue in the odd-multiplier displacement hashing function is how to find a good multiplier. Since we found that most odd numbers can achieve an ideal balance for most stride amounts, we will focus on discussing the concentration of such numbers. Fig. 6 shows the concentration achieved using different multipliers for strided accesses with stride amounts ranging from 1 to 8. Only small stride amounts are considered because they are more common than larger ones. Each of the eight subfigures represents a different stride amount. The x-axis shows the different prime or odd-multipliers investigated. Discontinuity in the figure indicates a concentration value larger than 500. The figure shows that the concentration is usually better (lower) for multipliers. Some multipliers suffer from bad concentration for certain stride amounts. For example, 5 has a poor concentration when the stride amount is 3, while 3 has a poor concentration when the stride amount is 5. A good multiplier should not only give good average concentration across all stride amounts, but also ensure good concentration for small stride amounts because they are more common than large stride amounts in practice. We found that the following multipliers excel in both criteria: 9, 21, 31, and 61.
Comparison of Various Hashing Functions
Before we compare our hashing functions with existing ones, we briefly overview existing hashing functions. The traditional hashing function H trad is a very simple modulobased hashing function. It can be expressed as H trad ¼ x i or, equivalently, as H trad ða i Þ ¼ a i mod n set phys . Pseudorandom hashing randomizes accesses to cache sets. Examples of pseudorandom hashing are XOR-based hashing functions, which are by far the most extensively studied [16] , [17] , [22] , [23] , [26] , [12] , [18] , [4] , [19] . We choose one of the most prominent examples:
, where L represents the bitwise exclusive OR operator. In a skewed associative cache, the cache itself is divided into banks and each bank is accessed using a different hashing function. Here, cache blocks that are mapped to the same set in one bank are most likely not to map to the same set in the other banks. Seznec and Bodin propose using an XOR hashing function in each bank after a circular shift is performed to the bits in t i [18] , [4] , [19] . The number of circular shifts performed differs between banks. This results in a form of a perfect shuffle. We propose using odd-multiplier displacement hashing 4 . Our earlier work [9] refers to it as Prime Displacement Indexing because we originally only considered prime numbers for p, but found out that some odd numbers performed well. functions with a skewed cache. To ensure interbank dispersion, a different multiplier for each bank is used. Table 2 compares the various hashing functions based on when the ideal balance is achieved, whether they satisfy the sequence invariance property, whether a simple hardware implementation exists, and whether they place restrictions on the replacement algorithm. The major disadvantage of the traditional hashing is that it achieves the ideal balance only when the stride amount s is odd, where gcdðs; n set Þ ¼ 1. When the ideal balance is satisfied, however, it achieves the ideal concentration because it satisfies the sequence invariance property. Note that, for a frequent case of a unit stride AE1, it has an ideal balance and concentration. Thus, any hashing functions that achieve less than the ideal balance or concentration with unit strides are prone to exhibit a pathological behavior.
Qualitative Comparison
XOR achieves the ideal balance on most stride amounts s, but always has less than the ideal concentration because it does not satisfy the sequence invariance property. This is because the sequence of set accesses is never repeated due to XOR-ing with different values in each subsequence. Thus, it is prone to pathological behavior. There are various cases where the XOR hashing does not achieve the ideal balance. One case is when s ¼ n set À 1. For example, with s ¼ 15 and n set ¼ 16 (as in a 4-way 4KB cache with 64 byte lines), it will access sets 0; 15; 15; 15; . . . . Not only that, a stride of 3 or 5 will also fail to achieve the ideal balance because they are factors of 15. This makes XOR hashing a particularly bad choice for indexing the L1 cache.
The prime modulo hashing (pMod) achieves the ideal balance and concentration except for very few cases. The ideal balance is achieved because gcdðs; n set Þ ¼ 1 except when s is a multiple of n set . If a i ¼ a iþd ðmod pÞ for some d, then a i þ s ¼ a iþd þ s ðmod pÞ, for any s. Therefore, with prime modulo hashing, the ideal concentration is achieved because Hða i Þ ¼ Hða iþd Þ implies that Hða iþ1 Þ ¼ Hða i þ sÞ ¼ Hða iþd þ sÞ ¼ Hða iþdþ1 Þ with the stride amount s. This makes the prime modulo hashing an ideal hashing function. [18] , [4] , [19] , and Our Skewed Associative with Odd-Multiplier Displacement (Skewed + oDisp)
KHARBUTLI ET AL.: ELIMINATING CONFLICT MISSES USING PRIME NUMBER-BASED CACHE
As we have shown in Section 3.1, fast hardware implementations exist. The odd-multiplier displacement hashing (oDisp) achieves an ideal balance with even strides and most odd strides. Although it does not satisfy the sequence invariance property, the distance between two accesses to the same set is almost always constant. That is, for all but one set in a single subsequence, Hða i Þ ¼ Hða iþx Þ implies Hða iþ1 Þ ¼ Hða iþxþ1 Þ. Furthermore, x ¼ n set À p, where p is the multiplier. Thus, it partially satisfies the sequence invariance.
Skewed associative caches do not guarantee the ideal balance or concentration, whether XOR-based hashing (Skewed) or odd-multiplier displacement hashing (Skewed+oDisp) is used. However, probabilistically, the accesses will be quite balanced since an address can be mapped to a different place in each bank. A disadvantage of skewed associative caches is the fact that they make it difficult to implement a least recently used (LRU) replacement policy and force using pseudo-LRU policies. The nonideal balance and concentration, together with the use of a pseudo-LRU replacement policy, make a skewed associative cache prone to pathological behavior, although it works well on average. Later, we will show that skewed caches do degrade the performance of some applications. Finally, although not described in Table 2 , some other hashing functions, such as all XOR-based functions and random-h [15] , [16] , [17] , [20] , [8] , [7] , [26] , [12] , are not sequence invariant and, therefore, do not achieve the ideal concentration. Fig. 7 shows the balance values of the four different hashing functions using a synthetic benchmark that produces only strided access patterns. The stride size is varied from 1 to 128. The maximum balance and concentration displayed in the vertical axes in the figures are truncated at 10 and 2,000, respectively, for easy comparison. Note that, since small strides are more likely to appear in practice, they are more important than large strides. The balance values for the traditional and prime modulo hashing functions follow the discussion in Section 3.3. In particular, the traditional hashing function suffers from bad balance values with even strides, but achieves a perfect balance with odd strides. The prime modulo achieves perfect balance, except when the stride is equal to n set , which, in this case, is 2.039. XOR and odd-multiplier displacement hashing also have the ideal balance with most strides. Both have various cases in which the stride size causes nonideal balance. However, the odd-multiplier displacement hashing achieves ideal balance for the smaller strides, making it, in practice, superior to the XOR hashing function. Fig. 8 shows the concentration for the same range of stride amounts. As expected, the traditional hashing function suffers from very bad concentration with even strides, but achieves perfect concentration with odd strides. The XOR and odd-multiplier displacement hashing functions also suffer from bad concentration for many strides. This is because odd-multiplier displacement hashing is only partially sequence invariant. However, note that oddmultiplier displacement hashing achieves better concentration on small stride amounts compared to XOR hashing. We can also expect that, in an application with only small stride amounts, the odd-multiplier displacement algorithm will perform well. Since the prime modulo hashing is sequence invariant, it achieves ideal concentration except for the stride equal to n set . Hence, for strided accesses, we can expect the prime modulo hashing to have the best performance between our four hashing functions.
Balance and Concentration
More importantly, the prime modulo hashing also achieves ideal concentration with odd strides the same way as the traditional hashing. Hence, we can expect that the prime modulo hashing is resistant to the pathological behavior. The XOR hashing function may outperform the traditional hashing on average. However, it cannot match the ideal concentration of the traditional hashing with odd stride amount. Thus, it is prone to pathological behavior.
EVALUATION ENVIRONMENT
Applications. To evaluate the prime hashing functions, we use 23 memory-intensive applications from various sources: bzip2, gap, mcf, and parser from Specint2000 [21] , applu, mgrid, swim, equake, and tomcatv from Specfp2000 and Specfp95 [21] , mst from Olden, bt, ft, lu, is, sp, and cg from NAS [13] , sparse from Sparsebench [6] , and tree from the University of Hawaii [3] . Irr is an iterative PDE solver used in CFD applications. Charmm is a well-known molecular dynamics code and moldyn is its kernel. Nbf is a kernel from the GROMOS molecular dynamics benchmarks. And, euler is a 3D Euler equation solver from NASA.
We categorize the applications into two groups: a group where the histogram of the number of accesses to different sets is uniform and a group where it is not uniform. Let f 1 ; f 2 ; . . . ; f n set represent the frequency of accesses to the sets 1; 2; . . . ; n set in the L2 cache. An application is considered to have a nonuniform cache access behavior if the ratio stdevðf i Þ= " f f is greater than 0.5. Applications with nonuniform cache set accesses likely suffer from conflict misses and, hence, alternative hashing functions are expected to speed them up.
Among the 23 applications, we found that 30 percent of them (seven benchmarks) are nonuniform: bt, cg, ft, irr, mcf, sp, and tree. The L2 cache miss rates for all the applications are summarized in Table 3 .
Simulation Environment. The evaluation is performed using a cycle accurate execution-driven simulation environment that supports a dynamic superscalar processor model [10] . Table 4 shows the parameters used for each component of the architecture. The latency of prime modulo and oddmultiplier displacement computation is assumed to be less than three cycles and, therefore, can be fully overlapped with the L1 cache access latency and does not cause any addition to the L2 cache access time (Section 3.1.1).
Prime Numbers. The prime modulo function uses the prime number shown in Table 1 . The odd-multiplier displacement function uses a number 9 when it is used as a single hashing function. When used in conjunction with a skewed associative cache the multipliers for each of the four cache banks are 9, 19, 31, and 37.
EVALUATION
In this section, we present and discuss five sets of evaluation results. We present the impact of using different single hashing functions on the various types of cache misses (Section 5.1) and on the cache miss distribution across the cache sets (Section 5.2). Section 5.3 presents the impact of using multiple hashing functions in conjunction with a skewed associative L2 cache on the various types of cache misses. Section 5.4 shows the performance gain achieved using the different hashing functions for the 23 applications. Finally, Section 5.5 shows the impact of our prime hashing functions on the execution time for various cache configurations.
Single Hashing Function Schemes
Figs. 9 and 10 show the normalized number of cache misses in each application with nonuniform cache accesses and uniform cache accesses, respectively. Each figure compares the number of cache misses with different hashing functions: a traditional hashing function with 4-way associative L2 cache (Base), a traditional hashing function with an 8-way associative same-size L2 cache (8-way), the XOR hashing function (XOR), the prime modulo hashing function (pMod), and the odd-multiplier displacement hashing function (oDisp), as described in Section 3.3. The number of misses in each case is normalized to Base. All bars are divided into: cold misses (Cold), capacity misses (Capacity), and conflict misses (Conflict). The cold misses are computed as the sum of the first miss to each line that is accessed. The conflict misses are calculated as the number of misses eliminated if a fully associative L2 cache is used. The rest of the misses are categorized as capacity misses. Fig. 9 shows that, for nonuniform applications, conflict misses account for roughly 50 percent of the total L2 cache misses. Fig. 10 shows that even uniform applications suffer from a noticeable amount of conflict misses, averaging 17 percent of the total L2 cache misses.
For nonuniform applications, Fig. 9 shows that increasing the L2 cache associativity to 8-way only reduces the number of misses marginally (10 percent on average). This is mainly because doubling associativity but keeping the same cache size reduces the number of sets in half. In turn, this doubles the number of addresses mapped to a set. Thus, increasing cache associativity without increasing the cache size is not an effective method to eliminate conflict Latencies correspond to contention-free conditions. RT stands for round-trip from the processor.
misses. Comparing the alternative hashing functions, pMod and oDisp perform the best, eliminating a large majority of L2 conflict misses (70 percent on average), contributing to a 35 percent total L2 cache miss reduction. XOR is much less effective, on average only eliminating 25 percent of the L2 cache misses. Although XOR is certainly better than Base and 8-way, its nonideal balance for small strides and its nonideal concentration hurt its performance.
For applications that have uniform cache accesses in Fig. 10 , the same observation generally holds. pMod and oDisp noticeably reduce the number of conflict misses in charmm, euler, and mst. Fig. 11 shows the cache misses distribution over the cache sets for the nonuniform applications using three different hashing functions: traditional (Base), prime modulo (pMod), and odd-multiplier displacement (oDisp). The x-axis represents the cache sets, while the y-axis represents the number of cache misses associated with that set. The figure shows how the distribution changes as the result of applying the pMod and oDisp hashing functions.
Cache Misses Distribution
The figure shows that bt and tree have the worst distribution of cache misses using traditional hashing and exhibit a much better distribution and, consequently, a large reduction in the number of misses when pMod or oDisp hashing is used. For tree, with the traditional hashing function, the vast majority of cache misses are concentrated in about 10 percent of the sets. This is due to an unbalanced distribution of cache accesses, causing some cache sets to be overutilized and, thus, to suffer from many conflict misses. By distributing the accesses more uniformly across the sets, pMod and oDisp are able to eliminate most of those misses. Other applications (mcf, sp, and ft) also have a bad distribution when traditional hashing is used, but significantly improved distributions when pMod or oDisp hashing functions are used.
Multiple Hashing Functions Schemes
Figs. 12 and 13 show the normalized number of misses in applications with nonuniform cache accesses and uniform cache accesses, respectively, when multiple hashing functions are used. Each figure compares the number of misses with different hashing functions: a traditional hashing function with 4-way associative L2 cache (Base), the prime modulo hashing that is the best single hashing function from Section 5.1 (pMod), the XOR-based skewed associative cache proposed by Seznec [19] (SKW), and the skewed associative cache with the odd-multiplier displacement function that we propose (skw+oDisp) as described in Section 3.3. The number of misses in each case is normalized to Base.
The skewed associative caches (SKW and skw+oDisp) are based on Seznec's design that uses four direct-mapped cache banks. The replacement policy is called Enhanced Not Recently Used (ENRU) [19] . The only difference between SKW and skw+oDisp is the hashing function used (XOR versus odd-multiplier displacement). We have also tried a different replacement policy, called NRUNRW (Not Recently Used Not Recently Written) [18] . We found that it gives similar results.
Figs. 12 and 13 show that pMod sometimes outperforms and is sometimes outperformed by SKW and skw+oDisp. Interestingly, when skewed caches (SKW and skw+oDisp) outperform (pMod), it is mostly due to the reduction in capacity misses instead of in conflict misses (see cg and mst). Indeed, in terms of conflict miss reduction, pMod appears to be superior. It is more effective in eliminating conflict misses in ft, mcf, charmm, and euler, but is less effective in bt. However, pMod consistently reduces or maintains the conflict misses, and never increases them. In contrast, SKW and skw+oDisp actually increase the conflict misses in seven applications (bzip2, charmm, is, mgrid, parser, sparse, and irr) by up to 19 percent. On average, however, skw+oDisp performs the best, followed by SKW and then followed closely by pMod. In terms of average execution time speedups over Base on nonuniform applications, skw+oDisp achieves 1.35, followed by SKW (1.31) and pMod (1.27), as we will see in Section 5.4. The extra performance of skewed caches for the applications with nonuniform accesses, however, comes at the expense of having pathological behavior in applications that have uniform accesses (Fig. 13) . Table 5 shows the performance improvement achieved for both the uniform and nonuniform applications using the five hashing functions: XOR, pMod, oDisp, SKW, and skw+oDisp. Speedups are calculated as the execution time improvement over traditional hashing (Base). In terms of average speedup, our skw+oDisp achieves the highest average (1.35) for nonuniform applications. SKW achieves an average speedup of 1.31, closely followed by pMod and oDisp, which achieve an average speedup of 1.27 each. The five hashing functions achieve their maximum improvement for tree. skw+oDisp achieves a speedup of 2.63. SKW, pMod, and oDisp achieve comparable speedups of 2.55, 2.34, and 2.32, respectively. The higher performance gain achieved by the skewed caches comes at the expense of some pathological behavior in uniform applications. SKW slows down four applications (bzip2, charmm, parser, and sparse) by up to 9 percent for sparse. Our skw+oDisp, on the other hand, also slows down four applications (bzip2, mgrid, parser, and sparse) by up to 7 percent for sparse, but achieves better average performance for both uniform and nonuniform applications compared to SKW hashing. pMod only slows down one application (sparse) by 2 percent, while our proposed oDisp does not slowdown any of the 23 applications.
Performance Improvement of the Different Schemes

Sensitivity to Cache Parameters
This section evaluates the impact of using the prime hashing functions on the execution times of applications using different cache configurations. We choose only the best single hashing function (pMod) and the best multiple hashing functions (skw+oDisp). Tables 6 and 7 summarize the speedups on various cache configurations for pMod and skw+oDisp over Base, respectively. The cache size was varied from 256 KBytes to 1,024 KBytes while keeping the associativity constant at 4-way. The cache associativity was also varied from 2-way to 8-way while keeping the cache size constant at 512 KBytes. Both tables present the minimum, average, and maximum speedups obtained for both uniform and nonuniform applications, and show the number of pathological cases that result. Pathological cases are defined as a slowdown of more than 1 percent compared to Base. Table 6 shows that pMod produces good speedups for nonuniform applications on all cache configurations, with an average ranging from 1.15 to 1.27. These speedups are achieved without slowing down the uniform applications by more than 3 percent. As a result, the number of applications that exhibit pathological behavior are very limited (at most one case). Table 7 shows that skw+oDisp produces higher speedups for nonuniform applications on all cache configurations, with an average ranging from 1.19 to 1.35. These average speedups are consistently higher than pMod, due to the ability of skw+oDisp to eliminate some capacity misses. However, the higher average speedups come at the expense of having many more pathological cases, mostly on uniform applications. Some applications exhibit nontrivial slowdown of up to 10 percent due to an increase in conflict misses. Consequently, the number of applications that exhibit pathological behavior is significantly higher, ranging from three to six.
RELATED WORK
Prior studies showed that alternative cache indexing/ hashing functions are effective in reducing conflicts by achieving a more uniform access distribution across the L1 cache sets [18] , [4] , [23] , [22] , L2 cache sets [19] , or the main memory banks [15] , [11] , [16] , [17] , [20] , [8] , [7] , [26] , [12] . Most of the prior hashing functions permute the accesses using some form of XOR operations. We found that XOR operations typically do not achieve an ideal concentration that is critical to avoiding pathological behavior under strided access patterns.
Although prime-based hashing has been proposed for software hash tables, as in [1] , its use in the memory subsystem has been very limited due to its hardware complexity that involves true integer division operations and fragmentation problems. Budnick and Kuck first suggested using a prime number of memory banks in parallel computers [5] , which was later developed into the Burroughs Scientific Processor [11] . Yang and Yang proposed using Mersenne prime modulo hashing for cache indexing for vector computation [25] . Since Mersenne prime numbers are sparse, e.g.,
31ð2
5 À 1Þ; 127ð2 7 À 1Þ; 8; 191ð2 13 À 1Þ; 131; 071ð2 17 À 1Þ; . . . ;
using them significantly restricts the number of cache sets that can be implemented. We derive a more general solution that does not assume Mersenne prime numbers and show that prime modulo hashing can be implemented fast on any number of cache sets. In addition, we present an odd-multiplier displacement hashing function that achieves comparable performance and robustness to the prime modulo hashing function.
Compiler optimizations have also targeted conflict misses by padding the data structures of a program. One example is the work by Bacon et al. [2] , who tried to find the optimal padding amount to reduce conflict misses in caches and TLBs in a loop nest. They tried to spread cache misses uniformly across loop iterations based on profiling information. Since the conflict behavior is often input dependent and determined at runtime, their approach has limited applicability.
CONCLUSIONS
Even though using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache, no prior study has really analyzed the pathological behavior of such hashing functions that often result in performance degradation.
We presented an in-depth analysis of the pathological behavior of hashing functions and proposed two new hashing functions for the L2 cache that are resistant to the pathological behavior and yet are able to eliminate the worst-case conflict behavior. The prime modulo hashing uses a prime number of sets in the cache, while the oddmultiplier displacement hashing adds an offset that is equal to a odd/prime number multiplied by the tag bits to the index bits of an address to obtain a new cache index. These hashing techniques can be implemented in fast hardware that uses a set of narrow add operations in place of true integer division and multiplication. This implementation has negligible fragmentation for the L2 cache. We evaluated our techniques with 23 applications from various sources. For applications that have nonuniform cache accesses, both the prime modulo and odd-multiplier displacement hashing achieve an average speedup of 1.27, practically without slowing down any of the 23 benchmarks.
Although lacking the theoretical superiority of the prime modulo, when the multiplier is carefully selected, the odd-multiplier displacement hashing performs just as well in practice. In addition, the odd-multiplier displacement hashing can easily be used in conjunction with a skewed associative cache, which uses multiple hashing functions to further distribute the cache accesses across the sets in the cache. The odd-multiplier displacement hashing outperforms XOR-based hashing used in prior skewed associative caches. Although the skewed associative L2 cache with odd-multiplier displacement hashing eliminates less conflict misses than single hashing functions, it outperforms single hashing functions due to eliminating some capacity misses. It shows an average speedup of 1.35 for applications that have nonuniform cache accesses. However, it introduces some pathological behavior that slows down four applications by up to 7 percent. Therefore, an L2 cache with our prime modulo or odd-multiplier displacement hashing functions is a promising alternative.
Yan Solihin received the BS degree in computer science from Institut Teknologi Bandung in 1995, the MASc degree in computer engineering from Nanyang Technological University in 1997, and the MS and PhD degrees in computer science from the University of Illinois at UrbanaChampaign in 1999 and 2002. He is currently an assistant professor in the Department of Electrical and Computer Engineering at North Carolina State University. From 1999 to 2000, he was on an internship with the Parallel Architecture and Performance Team at Los Alamos National Laboratory. He has published more than 25 papers in computer architecture and image processing which cover chip multiprocessor systems, processing-in-memory architectures, performance modeling, and architecture support for security and software reliability. He has released Scaltool, a software for pinpointing parallel program scalability bottlenecks, and Fodex, a forensic document examination toolset. He was a recipient of a 2004 US National Science Foundation Faculty Early Career Award and 1997 AT&T Leadership Award. He is a member of the IEEE. More information can be found at http://www.cesr.ncsu.edu/solihin. Jaejin Lee received the BS degree in physics from Seoul National University in 1991, the MS degree in computer science from Stanford University in 1995, and the PhD degree in computer science from the University of Illinois at Urbana-Champaign in 1999. He is an associate professor in the School of Computer Science and Engineering at Seoul National University, Korea, where he has been a faculty member since September 2002. Before joining Seoul National University, he was an assistant professor in the Computer Science and Engineering Department at Michigan State University. His research interests include compilers, computer architectures, and embedded computer systems. He is a member of the IEEE and the ACM. More information can be found at http://aces.snu.ac.kr/jlee.
