Using Prime Numbers for Cache Indexing to Eliminate Conflict Misses, HPCA by Mazen Kharbutli et al.
Using Prime Numbers for Cache Indexing
to Eliminate Conﬂict Misses
￿
Mazen Kharbutli, Keith Irwin, Yan Solihin Jaejin Lee
Dept. of Electrical and Computer Engineering School of Computer Science and Engineering
North Carolina State University Seoul National University
￿
mmkharbu,kirwin,solihin
￿ @eos.ncsu.edu jlee@cse.snu.ac.kr
Abstract
Using alternative cache indexing/hashing functions is a popular
technique to reduce conﬂict misses by achieving a more uniform
cache access distribution across the sets in the cache. Although
various alternative hashing functions have been demonstrated to
eliminate the worst case conﬂict behavior, no study has really an-
alyzed the pathological behavior of such hashing functions that
often result in performance slowdown. In this paper, we present
an in-depth analysis of the pathological behavior of cache hash-
ing functions. Based on the analysis, we propose two new hash-
ing functions: prime modulo and prime displacement that are re-
sistant to pathological behavior and yet are able to eliminate the
worst case conﬂict behavior in the L2 cache. We show that these
two schemes can be implemented in fast hardware using a set of
narrow add operations, with negligible fragmentation in the L2
cache. We evaluate the schemes on 23 memory intensive appli-
cations. For applications that have non-uniform cache accesses,
both prime modulo and prime displacement hashing achieve an
average speedup of 1.27 compared to traditional hashing, with-
out slowing down any of the 23 benchmarks. We also evaluate
using multiple prime displacement hashing functions in conjunc-
tion with a skewed associative L2 cache. The skewed associa-
tive cache achieves a better average speedup at the cost of some
pathological behavior that slows down four applications by up to
7%.
1. Introduction
Despite the relatively large size and high associativity of
the L2 cache, conﬂict misses are a signiﬁcant performance
bottleneck in many applications. Alternative cache index-
ing/hashing functions are used to reduce such conﬂicts by
achievinga moreuniformaccess distributionacross the L1
cache sets [18, 4, 22, 23], L2 cache sets [19], or the main
memory banks [14, 10, 15, 16, 20, 8, 7, 26, 11]. Although
various alternative hashing functions have been demon-
￿
This work was supported in part by North Carolina State Univer-
sity, Seoul National University, the Korean Ministry of Education under
the BK21 program, and the Korean Ministry of Science and Technology
under the National Research Laboratory program.
strated to eliminate the worst case conﬂict behavior, few
studies, if any, have analyzed the pathological behavior
of such hashing functions that often result in performance
degradation.
This paper presents an in-depth analysis of the pathologi-
cal behavior of hashing functions and proposes two hash-
ing functions: prime modulo and prime displacement that
areresistanttothepathologicalbehaviorandyetareableto
eliminate the worst case conﬂict misses for the L2 cache.
The number of cache sets in the prime modulo hashing
is a prime number, while the prime displacement hashing
adds an offset, equal to a prime number multiplied by the
tag bits, to the index bits of an address to obtain a new
cache index. The prime modulo hashing has been used in
software hash tables [1] and in the Borroughs Scientiﬁc
Processor [10]. A fast implementation of prime modulo
hashing has only been proposed for Mersenne prime num-
bers [25]. Since Mersenne prime numbers are sparse (i.e.,
for most
￿ ,
￿
￿
￿
￿
￿
￿
￿ are not prime), using Mersenne prime
numbers signiﬁcantly restricts the number of cache sets
that can be implemented.
We present an implementation that solves the three main
drawbacks of prime modulo hashing when it is directly
applied to cache indexing. First, we present a fast hard-
ware mechanism that uses a set of narrow add operations
in place of a true integer division. Secondly, by applying
the prime modulo to the L2 cache, the fragmentation that
results from not fully utilizing a power of two number of
cache sets becomes negligible. Finally, we show an imple-
mentation where the latency of the prime modulo compu-
tation can be hidden by performing it in parallel with L1
accesses or by caching the partial computationin the TLB.
We showthattheprimemodulohashinghaspropertiesthat
make it resistant to the pathological behavior that plagues
other alternative hashing functions, while at the same time
enable it to eliminate worst case conﬂict misses.
Although the prime displacement hashing lacks the the-
oretical superiority of the prime modulo hashing, it can
perform just as well in practice when the prime numberis carefully selected. In addition, the prime displacement
hashing can easily be used in conjunction with a skewed
associative cache that uses multiple hashing functions to
further distribute the cache set accesses. In this case, a
unique prime number is used for each cache bank.
We use 23 memory intensive applications from various
sources, and categorize them into two classes: one with
non-uniformcache set accesses andthe otherwith uniform
cache set accesses. We found that on the applications with
non-uniformaccesses, the prime modulo hashing achieves
an average speedup of 1.27. It does not slow down any of
the 23 applications except in one case by only 2%. The
prime displacement hashing achieves almost identical per-
formance to the prime modulo hashing but without slow-
ing downanyof the 23benchmarks. Both methodsoutper-
form an XOR based indexing function, which obtains an
average speedup of 1.21 on applications with non-uniform
cache set accesses.
Using multiple hashing functions, skewed associative
caches sometimes are able to eliminate more misses than
using a single hashing function. However, we found that
the use of a skewed associative cache introduces some
pathological behavior that slows down some applications.
A skewed associative cache with the XOR-based hash-
ing [18, 4, 19] obtains an average speedup of 1.31 for
benchmarks with non-uniform accesses, but slows down
four applications by up to 9%. However, when using the
prime displacement hashing that we propose in conjunc-
tion with a skewed associative cache, the average speedup
improves to 1.35, and the worst case slowdown drops to
7%.
Therest ofthe paperis organizedas follows. Section2dis-
cusses the metrics and ideal properties of a hashing func-
tion, which help in understanding the pathological behav-
ior. Section 3 discusses the proposed hashing functions:
prime modulo and prime displacement and their imple-
mentations. Section 4 describes the evaluation environ-
ment while Section 5 discusses the results obtained. Sec-
tion 6 lists the related work. Finally, Section 7 concludes
the paper.
2. Properties of Hashing Functions
In this section, we will describe two metrics that are
helpful in understanding pathological behavior of hashing
functions (Section 2.1), and two properties that a hashing
function must have in order to avoid the pathological be-
havior (Section 2.2).
2.1. Metrics
The ﬁrst metric that estimates the degree of the patholog-
ical behavior is balance, which describes how evenly dis-
tributed the addresses are over the cache sets. The other
is concentration, which measures how evenly the sets are
used over small intervals of accesses.
t i x i
2 log  (blkSize)
a i
 i T
log  (n     ) 2 set log  (n     ) 2 set
blk offset
Figure 1: Components of a block address
￿
￿
￿ .
Let
￿
￿
￿
￿
￿
￿
￿ be the number of sets in the cache that is a
power of two. This implies that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ bits from the
address are used to obtain a set index. Let a sequence
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ of block addresses
￿
￿
￿ denote
  cache
accesses. Suppose
￿
￿
￿
"
!
#
￿
%
$ for
&
!
#
(
’ where
￿
*
)
(
&
￿
’
)
+
  ,
implyingthateachblockaddressinthesequenceisunique.
Thus, the address sequence does not have any temporal
reuse (we will return to this issue later). Let
, be a hash-
ing function that maps each block address
￿
￿ to set
,
.
-
￿
￿
0
/
in the cache. We denote
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ index bits of
￿
￿ with
1
￿
and the ﬁrst
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
0
￿ bits of the tag of
￿
￿ with
2
￿ as shown
in Figure 1.
Balance describes how evenly distributed the addresses
are over the sets in the cache. When good balance is not
achieved, the hashing function would be ineffective and
would cause conﬂict misses. To measure the balance we
use a formula suggested by Aho, et al. [1]:
3
￿
￿
4
5
￿
￿
7
6
9
8
#
:
￿
￿
;
=
<
=
>
$
￿
?
@
￿
B
A
=
C
E
D
G
F
￿
A
=
C
I
H
￿
I
J
￿
￿
￿
L
K
￿
￿
;
M
<
5
>
O
N
-
P
 
R
Q
￿
T
S
￿
￿
￿
￿
￿
0
￿
￿
￿
/ (1)
where
3
$ represents the total number of addresses that are
mapped to set
’ .
A
5
C
9
D
G
F
￿
A
=
C
I
H
￿
I
J
￿ represents the weight of the
set
’ , equivalent to
￿
O
Q
￿
￿
U
Q
￿
￿
￿
￿
￿
Q
3
$ . Thus, a set that has
more addresses will have a larger weight. The numerator,
:
￿
;
M
<
5
>
$
￿
?
@
￿
B
A
=
C
E
D
G
F
￿
A
=
C
I
H
￿
￿
J
￿ , represents the sum of the weights of all
sets. The denominator,
￿
￿
￿
K
￿
;
=
<
=
>
N
-
P
 
￿
Q
￿
V
S
￿
￿
￿
￿
0
￿
￿
￿
/ , rep-
resents the sum of the weights of all sets, but assuming a
perfectlyrandomaddressdistributionacrossall sets. Thus,
a lower
3
￿
￿
4
5
￿
￿
7
6
9
8 value, with an ideal value of 1, represents
better address distribution across the sets.
Concentrationis a less straightforwardmeasure and is in-
tended to measure how evenly the sets are used over small
intervals of accesses. It is possible to achieve the ideal bal-
ance for the entire address sequence
￿
W
￿
￿
￿
X
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ,
yet conﬂicts can occur if on smaller intervals the balance
is not achieved.
To measure concentration, we calculate the distance
Y
￿ as
the number of accesses to the cache that occur between
two accesses to a particular set and calculate the stan-
dard deviation of these distances. More formally,
Y
￿ for
an address
￿
￿ is the smallest positive integer such that,
-
￿
￿
0
/
#
,
-
￿
￿
H
￿
￿
￿
/ . The concentration is equal to the
standard deviation of
Y
￿ ’s. Noting that in the ideal case,
and with the balance of 1, the average of
Y
￿ ’s is necessar-
ily
￿
￿
￿
￿
0
￿
￿
, our formula is
6
￿
￿
￿
7
6
9
8
￿
2
￿
￿
￿
2
￿
&
￿
￿
￿
#
￿
:
￿
￿
￿
?
@
￿
-
5
Y
￿
￿
￿
￿
￿
￿
￿
0
￿
/
￿
 
(2)
Using standard deviation penalizes a hashing function not
only for re-accessing a set after a small time period since
its last access (
Y
￿
￿
￿
￿
￿
￿
￿
0
￿ ), but also for a large time pe-
riod (
Y
￿
￿
￿
￿
￿
￿
0
￿ ). Similar to the balance, the smaller the
concentration is, the better the hashing function is. The
concentration of an ideal hashing function is zero.
In general, alternative hashing functions have mostly tar-
geted the ideal balance, but not the ideal concentration.
Achieving good concentration is vital to avoid the patho-
logical behavior for applications with high temporal local-
ity. Assume that a set receives a burst of accesses that con-
sist of distinct addresses, then, the set suffers from conﬂict
misses temporarily. If one of the addresses has temporal
reuse, it may have been replaced from the cache by the
time it is re-accessed, creating conﬂict misses.
2.2. Ideal Properties
In this section, we describe the properties that should be
satisﬁed by an ideal hashing function. Most applications,
even some irregularapplications, often have strided access
patterns. Given the common occurrence of these patterns,
a hashing function that does not achieve the ideal balance
and concentration will cause a pathological behavior. A
pathological behavior arises when the balance or concen-
tration of an alternative hashing function is worse than
those of the traditional hashing function, often leading to
slowdown.
Let
￿ be a stride amount in the address sequence
￿
￿
￿
￿
X
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ , i.e.,
￿
￿
H
￿
#
￿
￿
Q
￿
￿ where
￿
)
(
&
￿
  .
Property 1: Ideal balance. For the modulo based hashing
where
,
-
￿
￿
￿
/
#
￿
￿
￿ mod
￿
￿
￿
￿
0
￿ , the ideal balance is achieved
if and only if
￿
￿
6
9
Y
-
￿
￿
￿
￿
￿
￿
￿
0
￿
/
#
￿ . For other hashing func-
tions, such as XOR-based, the ideal balance condition is
harder to formulate because the hashing function has var-
ious cases where the ideal balance is not achieved (Sec-
tion 3.3).
Property 2: Sequence invariance. A hashing function is
sequence invariant if and only if for any
￿
￿ ,
,
-
￿
￿
=
/
#
,
-
￿
￿
H
￿
￿
/ implies
,
-
￿
￿
H
￿
/
#
,
-
￿
￿
H
￿
￿
H
￿
9
/ .
The ideal concentration is achieved when both the ideal
balance and sequence invariance are satisﬁed. Therefore,
￿
There are
￿ accesses spread ideally across
￿
￿
￿
￿
￿
￿
￿ sets, so the total
distance between accesses is
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ . Hence, the average over the
￿
accesses is
￿
￿
￿
￿
￿
￿
￿
the ideal concentration is not achieved when the sequence
invariance is not achieved. The sequence invariance says
that once a set is re-accessed, the sequence of set ac-
cesses will precisely follow the previous sequence. More-
over, when the sequence invariance is satisﬁed, all the dis-
tances between two accesses to the same set are equal
to a constant
Y , indicating the absence of a burst of ac-
cesses to a single set for the strided access pattern. Fur-
thermore, when the ideal balance is satisﬁed for the mod-
ulo hashing, then the constant
Y is the average distance, or
Y
#
1
#
￿
￿
￿
￿
￿
0
￿ .
It is possible that a hashing function satisﬁes the sequence
invariancepropertyinmost,butnotall, cases. Suchafunc-
tion can be said to have partial sequence invariance.
In Section 3.3, we will show that prime modulo hashing
function satisﬁes both properties except for a very small
number of cases, whereas other hashing functions do not
always achieve Property 1 and 2 simultaneously. Bad con-
centration is a major source of the pathological behavior
for alternative hashing functions.
3. Hashing Functions Based on Prime Num-
bers
In this section, we describe the prime modulo and prime
displacement hashing functions that we propose (Sec-
tion 3.1 and 3.2, respectively). We compare them against
other hashing functions in Section 3.3
3.1. Prime Modulo Hashing Function
Prime modulo hashing functions, like any other modulo
functions, can be expressed as
,
-
￿
￿
￿
/
#
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿ . The
difference between prime modulo hashing and traditional
hashing functions is that
￿
￿
￿
￿
￿
￿ is a prime number instead
of a power of two. In general,
￿
@
￿
I
￿
0
￿ is the largest prime
number that is smaller than a power of two. The prime
modulo hashing functions that have been used in soft-
ware hash tables [1] and the BSP machine [10], have two
major drawbacks. First, they are considered expensive
to implement in hardware because performing a modulo
operation with a prime number requires an integer divi-
sion [10]. Second, since the number of sets in the physi-
cal memory (
￿
￿
I
￿
0
￿
￿
￿
￿
￿
￿
￿ ) is likely a power of two, there are
￿
#
￿
￿
￿
￿
0
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
0
￿ sets that are wasted, causing frag-
mentation. For example, the fragmentation in BSP is a
non-trivial 6.3%.
Since we target the L2 cache, however, this fragmentation
becomes negligible. Table 1 shows that the percentage of
the sets that are wasted in an L2 cache is small for com-
monly used numbers of the sets in the L2 cache. The frag-
mentation falls below 1% when there are 512 physical sets
or more. This is due to the fact that there is always a prime
number that is very close to a power of two.￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ Fragmentation (%)
256 251 1.95%
512 509 0.59%
1024 1021 0.29%
2048 2039 0.44%
4096 4093 0.07%
8192 8191 0.01%
16384 16381 0.02%
Table 1: Prime modulo set fragmentation.
Utilizing number theory we can compute the prime mod-
ulofunctionquicklywithoutusinganintegerdivision. The
foundation for the technique is taken from fast random
number generators [13, 24]. Speciﬁcally, computing
￿
￿
N
1
mod
  where
  is a Mersenne prime number (i.e.,
  is
one less than a power of two) can be performed using add
operations without a multiplication or division operation.
Since we are interested in a prime number
￿
@
￿
I
￿
0
￿ that is
not necessarily Mersenne prime, we extend the existing
method and propose two methods of performingthe prime
modulo operation fast without any multiplication and di-
vision. The ﬁrst method, iterative linear, needs recursive
steps of shift, add, and subtract&select operations. The
second method, polynomial, needs only one step of add,
and a subtract&select operation.
Subtract&select method. Computing the value of
1 mod
￿
￿
￿
￿
￿
￿ is trivial if
1 is small. Figure 2 shows how this
can be implemented in hardware.
1 ,
1
￿
￿
￿
I
￿
0
￿ ,
1
￿
￿
￿
￿
I
￿
0
￿ ,
1
￿
￿
￿
￿
￿
￿
￿
￿
0
￿ , etc. are all fed as input into a selector, which
chooses the rightmost input that is not negative. To im-
plement this method, the maximum value of
1 should be
known.
Adder
−nset
Adder
−2nset x
set n x mod
x
Selector
x
. . .
Figure 2: Hardware implementation of the sub-
tract&select method.
Iterativelinearmethod. First,
￿
￿
￿ isrepresentedasalinear
function of
￿
#
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ . To see how this can
be done, let
￿
￿ and
1
￿ represent parts of the bits in
￿
￿ as
depicted in Figure 1. Since
￿
￿
#
￿
￿
￿
￿
￿
0
￿
￿
￿
 
￿
￿
￿
N
￿
￿
Q
1
￿ , then
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿
￿ (3)
￿
￿
 
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿
￿
Since
￿
"
!
￿ is much smaller than
￿
￿ ,
￿
"
!
￿ mod
￿
￿
￿
￿
￿
￿
￿ may be
computed using the subtract&select method (Figure 2).
Moreover, although equation 3 contains a multiplication,
since
￿
is a very small integer (at most 9, see Table 1)
for most cases, the multiplication can easily be converted
to shift and add operations. For example, when
￿
#
$
# ,
￿
￿
￿
&
%
￿
￿
￿
U
￿
￿
Q
’
￿
￿
Q
.
1
￿
- mod
￿
￿
I
￿
0
￿
/ , where
￿
U
￿ denotes
a left shift operation. Finally, when
￿
(
!
￿ is still large, we can
apply Equation 3 iteratively to obtain
￿
￿
!
!
￿ ,
￿
)
!
!
!
￿ , etc. The fol-
lowing theorem states the maximum number of iterations
that needs to be performedto compute a cache index using
the iterative linear method.
Theorem 1. Given a
* -bit address and a cache with
block/line size of
+ , the number of iterations needed to
compute
￿
￿ mod
￿
@
￿
￿
￿
0
￿ is,
,
*
￿
￿
￿
￿
￿
￿
￿
+
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
0
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
-
When a subtract&select with
￿
￿
Q
￿ selector inputs is used
in conjunctionwith the iterativelinearmethod,the number
of iterations required becomes,
,
*
￿
￿
￿
￿
￿
￿
￿
+
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
2
￿
Q
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
0
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
-
For example, for a 32-bit machine with
￿
￿
￿
￿
0
￿
￿
￿
￿
 
￿
￿
￿
#
￿
 
.
0
/
2
1
and a 64-byte cache line size, the prime modulo can be
computedwith only two iterations. However, with a 64-bit
machine, it requires 6 iterations using a subtract&select
with 3-input selector, but requires 3 iterations with a 258-
input selector.
Polynomial method. Because in some cases the iterative
linear method involves multiple iterations, we devise an
algorithm to compute the prime modulo operation in one
step. To achievethat, usingthe same methodas in deriving
Equation 3, we ﬁrst express
￿
￿ as a polynomialfunction of
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ :
￿
￿
3
￿
￿
￿
￿
)
￿
￿
4
0
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
4
0
￿
6
5
￿
￿
5
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
7
￿
7
￿
7
￿
9
4
￿
;
:
￿
￿
:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿
where
2
￿
$ consists of bit
4
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
N
’ through bit
4
￿
 
￿
￿
￿
￿
￿
￿
￿
0
￿
￿
￿
￿
 
￿
￿
￿
N
-
’
Q
￿
/
￿
￿ of the address bits of
￿
￿ . For
example,
2
￿
￿ is shown as
2
￿ in Figure 1. Substituting
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ by
-
￿
￿
￿
￿
￿
0
￿
7
Q
￿
/ , we obtain
￿
￿
￿
￿
￿
￿
￿
￿
4
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
=
￿
>
￿
￿
4
￿
?
5
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
5
￿
@
7
￿
7
￿
7
￿
9
4
￿
;
:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
:
￿ mod
￿
￿
￿
￿
￿
￿
-
￿
￿
￿
￿
￿
0
￿
L
Q
￿
/
B
A , where
￿
#
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ , canbe expandedinto
C
￿
D
C
￿
D
E
0
D
￿
G
F
￿
￿
￿
￿
￿
C
￿
D
￿
C
I
H
<
J
￿
D
;
J
K
D
￿
M
L
F
O
N
￿
Q
P
￿
￿
￿
￿
￿
￿
￿
￿
8
7
￿
7
￿
7
￿
￿
￿
￿
￿
F
In the
- mod
￿
@
￿
￿
￿
0
￿
/ space, any term that is a multiple of
￿
￿
￿
￿
￿
￿
￿ is equivalent to zero. Since only the last term is notzero, then
-
￿
@
￿
￿
￿
0
￿
@
Q
￿
/
B
A
%
￿
A
- mod
￿
￿
￿
I
￿
0
￿
/ . Therefore, we
can express
￿
￿ as a polynomial function of
￿
:
￿
￿
3
￿
￿
￿
￿
)
￿
￿
4
0
￿
￿
￿
￿
￿
￿
4
￿
￿
?
5
￿
￿
5
￿
@
7
￿
7
￿
7
O
￿
4
￿
￿
;
:
￿
￿
:
￿ mod
￿
￿
￿
￿
￿
￿
￿ (4)
￿
￿
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿
￿
Note that
￿
￿
￿ is much smaller than
￿
￿
￿ , and is in general
small enough to derive the result of the prime modulo us-
ing the subtract&select method (Figure 2).
A special but restrictive case is when
￿
￿
￿
￿
0
￿ is a Mersenne
primenumber,in whichcase
￿
#
￿ . Then,Equation4 can
be simpliﬁed further, leading up to the same implementa-
tion as in [25]:
￿
￿
￿
￿
￿
￿
)
￿
￿
4
￿
￿
￿
￿
￿
4
￿
￿
?
5
￿
@
7
￿
7
￿
7
O
￿
4
￿
￿
;
:
￿ mod
￿
￿
￿
￿
￿
￿
￿ (5)
It is possible to use
￿
@
￿
￿
￿
￿
￿ that is equal to
￿
@
￿
￿
￿
0
￿
￿
￿
￿
 
￿
￿
￿
￿
￿ but
not a prime number. Often, if
￿
￿
￿
￿
0
￿
￿
￿
￿
 
￿
￿
￿
￿
￿ is not a prime
number, it is a product of two prime numbers. Thus, it
is at least a good choice for most stride access patterns.
However, it is beyond the scope of this paper to evaluate
such numbers.
Comparing the iterative linear and polynomial methods,
the polynomial method allows smaller latency in comput-
ing the prime modulo when
￿
is small, especially for 64-
bit machines and a small number of sets in the cache. The
iterative linear method is more desirable for low hardware
and power budget, or when
￿
is large.
t
2
1 t
1
1 t
0
1 t
3
1 t
4
1 t
6
1 t
7
1 t
8
1 t
9
1 t
5
1 t
10
1 0 0 0
x
1 x
2 x
3 x
4 x
5 x
6 x
7 x
8 x
9 x
10 x
0
t
2
1 t
1
1 t
0
1 t
3
1 t
4
1 t
6
1 t
7
1 t
8
1 t
9
1 t
5
1 t
10
1
t
1
2 t
2
2 t
3
2
0 t2
t
1
2 t
2
2 t
3
2
0 t2 0 0 0 0
0 0
0 0
t
1
2 t
2
2 t
3
2
t
1
2 t
2
2 t
3
2
0 t2
t1 9    =
t2 81    =
x =
0 0
0 t2
+
index (mod 2039)
(a)
t
2
1 t
1
1 t
0
1 t
3
1 t
4
1 t
6
1 t
7
1 t
8
1 t
9
1 t
5
1 t
10
1
t
2
1 t
1
1 t
0
1 t
3
1 t
4
1 t
6
1 t
7
1 t
5
1 t
8
1 t
9
1 t
10
1
t
1
2 t
2
2 t
3
2
0 t2
t
1
2 t
2
2 t
3
2
0 t2 t
1
2 t
2
2 t
3
2
0 t2
t
1
2 t
2
2 t
3
2
0 t2
t
8
1 t
9
1 t
10
1 0 0 0
x
1 x
2 x
3 x
4 x
5 x
6 x
7 x
8 x
9 x
10 x
0
+
index (mod 2039)
A
B
C
D
E
(b)
Figure 3: The initial components of index calculation (a),
and the components after optimizations (b).
3.1.1. Hardware Implementation
To illustrate how the prime modulo indexing can be im-
plemented in hardware, let us consider an L2 cache with
64 byte blocks and 2048 (
#
￿
￿
￿
￿
) number of physical sets
and 2039 (
#
￿
￿
X
￿
￿
# ) number of sets. Therefore, 6 bits are
used as the block offset, while the tag can be broken up
into three components: 11-bit
1 (
1
￿
￿
￿
￿
1
￿
￿
1
￿
￿
￿
￿
￿
￿
￿
￿
￿
1
￿
), 11-
bit
2
￿ (
2
￿
￿
￿
￿
￿
2
￿
￿
￿
2
￿
￿
￿
￿
￿
￿
￿
9
￿
2
￿
￿ ), and the remaining 4-bit
2
￿
(
2
￿
￿
,
2
￿
￿
,
2
￿
￿
,
2
￿
￿
). According to Equation 4, the cache index can
be calculated as
&
￿
7
Y
8
1
#
1
Q
#
N
2
￿
Q
1
￿
N
2
￿
. The binary
representation of 9 and 81 are ’1001’ and ’1010001’, re-
spectively. Therefore, Figure 3a shows that the index can
be calculated as the sum of six numbers. This can be sim-
pliﬁed further. The highest three bits of the third number
(
2
￿
￿
￿
￿
N
￿
￿
￿
Q
T
2
￿
￿
N
￿
￿
￿
Q
T
2
￿
￿
N
￿
￿
X
￿
) canbe separatedand according
to Equation4, is equal to
#
N
-
P
2
￿
￿
￿
￿
N
￿
￿
Q
B
2
￿
￿
N
￿
￿
Q
B
2
￿
￿
N
￿
￿
/ in the
modulo space. Furthermore, some of the numbers can be
added instantly to ﬁll in the bits that have zero values. For
example, the fourth and the ﬁfth numbers are combined
into a single number. The resulting numbers are shown in
Figure 3b. There are only ﬁve numbers (A through E) that
need to be added, with up to 11 bits each.
Wired Permutation
Adder
2−input Sub&Select
L1
L2
11 11
11 10 8
12
blk offset t t
6 11 11 4
Address
Computation
Prime
Modulo
11 4
L1 Miss
New Index
L1 Access
D E C B A
11
x 2 1
Figure 4: Hardware implementation of the prime modulo
hashing using the polynomial method for
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
E
￿
￿
￿
￿
on
a 32-bit machine
Figure 4 shows how the prime modulo hashing computa-
tion ﬁts with the overall system. The ﬁgure shows how the
new index is computed from the addition of the ﬁve num-
bers(AthroughE inFigure3b). AandB aredirectlytaken
from the address bits, while C, D, and E are obtained by
wired permutation of the tag part of the address. The sum
of the ﬁve numbers is then fed into a subtract&select with
a 2-input selector. Only 2 inputs are needed by the selec-
tor because the maximum value of the addition can only
be slightly larger than 2039. The reason for this is that in
computing the addition, any intermediate carry out can be
convertedto9andbeaddedto othernumbers(Equation4).The ﬁgure also shows that the prime modulo computation
can be overlapped with L1 cache accesses. On each L1
cache access, the prime modulo L2 cache index is com-
puted. If the access results in an L1 cache miss, the new
L2 cache index has been computed and is ready for use.
The prime modulo computation is simple enough that the
L2 access time is not likely impacted.
If a TLB is accessed in parallel with the L1 cache, the
prime modulo computation can be partially cached in the
TLB, simplifying the computation further. A physical ad-
dress consists of the page index and the page offset. To
compute the prime modulo L2 cache index, we can com-
pute the page index modulo independently from the page
offset modulo. On a TLB miss, the prime modulo of the
missed page index is computed and stored in the new TLB
entry. This computation is not in the critical path of TLB
access, and does not require modiﬁcations to the OS’ page
table. On an L1 miss, the pre-computed modulo of the
page index is added with the page offset bits that are not
part of L2 block offset. For example, if the page size is
4KB, then only
￿
￿
￿
￿
￿
#
￿ bits of the page offset needs
to be added to the 11-bit pre-computed modulo, followed
by a subtract&select operation to obtain the L2 cache in-
dex. This is a very simple operation that can probably be
performed in much less than one clock cycle.
3.2. Prime Displacement Hashing Function
In the prime displacement hashing, the traditional modulo
is performed after an offset is added to the original index
1
￿ . The offset is the product of a number
￿ and the tag
￿
￿ .
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿ mod
￿
￿
￿
￿
￿
￿ (6)
This new hashing function is based on hashing functions
in Aho, et al. [1], and is related to Raghavan’s RANDOM-
H functions [14], with the main difference being their use
of non-constant offset, resulting in not satisfying the se-
quence invariance property. Prime displacement hashing
functions can easily be implemented in hardware with a
narrow truncated addition if a prime number with few 1’s
in its binary representation is used as
￿
￿
.
Oneadvantageoftheprimedisplacementhashingfunction
comparedto the prime modulo hashing functionis that the
complexityof calculating the cache index in the prime dis-
placement hashing function is mostly independent of the
machine sizes. This makes it trivial to implement in ma-
5 We had originally considered that a prime number would work best
for p, hence the name Prime Displacement was introduced. But, tech-
nically an odd number modulo a power of two forms a modulo multi-
plication group
￿
￿
￿ [17] and as such they are all invertible so none of
them are prime in a technical sense. In practice, it is also not the case
that prime numbers are necessarily better choices for
￿ than ordinary odd
numbers.
chines with 64-bit or larger addressing.
3.3. Comparison of Various Hashing Functions
Before we compare our hashing functions with exist-
ing ones, we brieﬂy overview existing hashing functions.
The traditional hashing function
,
￿
￿
￿
￿
￿
￿ is a very simple
modulo based hashing function. It can be expressed as
,
T
￿
￿
￿
￿
￿
￿
#
1
￿ or equivalently as
,
V
￿
￿
￿
￿
￿
￿
-
￿
￿
M
/
#
￿
￿ mod
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ . Pseudo-random hashing randomizes accesses
to cache sets. Examples of the pseudo-randomhashing are
XOR-based hashing functions, which are by far the most
extensively studied [15, 16, 23, 22, 26, 11, 18, 4, 19]. We
choose one of the most prominent examples:
,
.
-
￿
￿
0
/
#
2
￿
￿
￿
1
￿ , where
￿ represents the bitwise exclusive OR op-
erator. In a skewed associative cache, the cache itself is di-
vided into banks and each bank is accessed using a differ-
ent hashing function. Here, cache blocks that are mapped
to the same set in one bank are most likely not to map to
the same set in the other banks. Seznec proposes using an
XOR hashing function in each bank after a circular shift
is performed to the bits in
2
￿ [18, 4, 19]. The number of
circular shifts performed differs between banks. This re-
sults in a form of a perfect shufﬂe. We propose using the
prime displacement hashing functions in a skewed cache.
To ensure inter-bank dispersion, a different prime number
for each bank is used.
Table 2 compares the various hashing functions based on
the following:
￿ when the ideal balance is achieved,
￿ whether they satisfy the sequence invariance prop-
erty,
￿ whether a simple hardware implementation exists,
and
￿ whether they place restrictions on the replacemental-
gorithm.
The major disadvantage of the traditional hashing is that
it achieves the ideal balance only when the stride amount
￿ is odd, where
￿
￿
6
9
Y
-
￿
￿
￿
￿
￿
￿
I
￿
0
￿
/
#
￿ . When the ideal bal-
ance is satisﬁed, however, it achieves the ideal concentra-
tion because it satisﬁes the sequence invariance property.
Note that for a common unit stride
￿
￿ , it has the ideal bal-
ance and concentration. Thus, any hashing functions that
achieve less than the ideal balance or concentration with
unit strides are bound to have a pathological behavior.
XOR achieves the ideal balance on most stride amounts
￿ ,
but always has less than the ideal concentration because
it does not satisfy the sequence invariance property. This
is because the sequence of set accesses is never repeated
dueto XOR-ingwith differentvaluesineach subsequence.Characteristics Single Hashing Function Multiple Hashing Functions
Traditional XOR pMod pDisp Skewed Skewed + pDisp
Ideal balance condition
￿ odd various all s except
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ Most odd, all even
￿ None None
Sequence invariant? Yes No Yes Partial No No
Simple Hw Impl. Yes Yes Yes Yes Yes Yes
Replacement Alg.
Restriction
No No No No Yes Yes
Table 2: Qualitative comparison of various hashing functions: traditional, XOR, prime modulo (pMod), prime displace-
ment (pDisp), skewed associative cache with circular-shift and XOR (Skewed) [18, 4, 19], and our skewed associative
cache with prime displacement (Skewed + pDisp).
Thus, it is prone to the pathological behavior. There are
various cases where the XOR hashingdoes not achieve the
idealbalance. Onecaseiswhen
￿
#
￿
￿
￿
￿
0
￿
￿
￿ . Forexample,
with
￿
#
￿
￿
￿ and
￿
￿
￿
￿
0
￿
#
￿
￿ (as in a 4-way 4KB cache with
64 byte lines), it will access sets 0, 15, 15, 15, .... Not only
that, a stride of 3 or 5 will also fail to achieve the ideal
balance because they are factors of 15. This makes the
XOR a particularly bad choice for indexing the L1 cache.
The prime modulo hashing (pMod) achieves the ideal bal-
ance and concentration except for very few cases. The
ideal balance is achieved because
￿
￿
6
9
Y
-
￿
￿
￿
￿
￿
￿
￿
0
￿
/
#
￿ ex-
cept when
￿ is a multiple of
￿
￿
￿
￿
0
￿ . The ideal concentra-
tion is achieved because
,
-
￿
￿
￿
/
#
,
-
￿
￿
H
￿
/ implies that
,
-
￿
￿
H
￿
/
#
,
-
￿
￿
￿
Q
￿
/
#
,
-
￿
￿
￿
H
￿
Q
￿
/
#
,
-
￿
￿
￿
H
￿
H
￿
/
with the stride amount
￿ . This makes the prime modulo
hashing an ideal hashing function. As we have shown in
Section 3.1, fast hardware implementations exist.
The prime displacement hashing (pDisp) achieves an ideal
balance with even strides and most odd strides. Although
it does not satisfy the sequence invariance property, the
distance between two accesses to the same set is almost
always constant. That is, for all but one set in a sin-
gle subsequence,
,
-
￿
￿
0
/
#
,
-
￿
￿
H
￿
￿
/ implies
,
-
￿
￿
H
￿
/
#
,
-
￿
￿
H
￿
￿
H
￿
9
/ . Furthermore,
1
#
￿
@
￿
￿
￿
0
￿
￿
￿ , where
￿ is the
prime number used. Thus, it partially satisﬁes the se-
quence invariance.
Skewed associative caches do not guarantee the ideal bal-
ance or concentration, whether they use the XOR-based
hashing (Skewed), or the prime displacement hashing
(Skewed+pDisp). However, probabilistically, the accesses
will be quite balanced since an address can be mapped to
a different place in each bank. A disadvantage of skewed
associative caches is the fact that they make it difﬁcult to
implement a least recently used (LRU) replacement policy
and force using pseudo-LRU policies. The non-ideal bal-
ance and concentration, together with the use of a pseudo-
LRU replacementpolicy,make a skewed associative cache
prone to the pathological behavior, although it works well
on average. Later, we will show that skewed caches do
degrade the performance of some applications.
Finally, although not described in Table 2, some other
hashing functions, such as all XOR-based functions and
random-h [14, 15, 16, 20, 8, 7, 26, 11], are not sequence
invariant,and thereforedo not achievethe ideal concentra-
tion.
4. Evaluation Environment
Applications. To evaluate the prime hashing func-
tions, we use 23 memory-intensive applications from
various sources: bzip2, gap, mcf, and parser from
Specint2000 [21], applu, mgrid, swim, equake, and tom-
catv from Specfp2000 and Specfp95 [21], mst from
Olden, bt, ft, lu, is, sp, and cg from NAS [12], sparse
from Sparsebench [6], and tree from the University of
Hawaii [3]. Irr is an iterative PDE solver used in CFD
applications. Charmm is a well known molecular dynam-
ics code and moldyn is its kernel. Nbf is a kernel from the
GROMOS molecular dynamics benchmarks. And, euler is
a 3D Euler equation solver from NASA.
We categorize the applications into two groups: a group
where the histogram of number of accesses to different
sets are uniform, and those that are not uniform. Let
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
9
￿
￿
￿
;
M
<
5
> represent the frequencyof accesses to the
sets
￿
￿
￿
￿
￿
￿
￿
￿
￿
E
￿
￿
@
￿
￿
￿
0
￿ in the L2 cache. An application is con-
sidered to have a non-uniformcache access behavior if the
ratio
￿
2
￿
Y
8
￿
￿
-
￿
￿
0
/
￿
￿
￿
￿
￿
is greater than 0.5. Applications with
non-uniform cache set accesses likely suffer from conﬂict
misses, and hence alternative hashing functions are ex-
pected to speed them up.
Among the 23 applications, we found that 30% of them (7
benchmarks) are non-uniform: bt, cg, ft, irr, mcf, sp, and
tree.
Simulation Environment. The evaluation is done using
an execution-driven simulation environment that supports
a dynamic superscalar processor model [9]. Table 3 shows
the parameters used for each component of the architec-
ture. The architecture is modeled cycle by cycle.
Prime Numbers. The prime modulo function uses the
prime number shown in Table 1. The prime displacement
function uses a number 9 when it is used as a single hash-
ing function. When used in conjunction with a skewed
associative cache the numbers that are used for each of the
four cache banks are 9, 19, 31, and 37. Although the num-
ber 9 is not prime, it is selected for the reason explained inPROCESSOR
6-issue dynamic. 1.6 GHz. Int, fp, ld/st FUs: 4, 4, 2
Pending ld, st: 8, 16. Branch penalty: 12 cycles
MEMORY
L1 data: write-back, 16 KB, 2 way, 32-B line, 3-cycle hit RT
L2 data: write-back, 512 KB, 4 way, 64-B line, 16-cycle hit RT
RT memory latency: 243 cycles (row miss), 208 cycles (row hit)
Memory bus: split-transaction, 8 B, 400 MHz, 3.2 GB/sec peak
Dual channel DRAM. Each channel: 2 B, 800 MHz. Total: 3.2
GB/sec peak
Random access time (tRAC): 45 ns
Time from memory controller (tSystem): 60 ns
Table 3: Parameters of the simulated architecture. Laten-
cies correspond to contention-free conditions. RT stands
for round-trip from the processor.
Section 3.2.
5. Evaluation
In this section, we present and discuss six sets of evalu-
ation results. Section 5.1 shows the balance and concen-
tration for four different hashing functions. Section 5.2
presents the results of using a single hashing function for
the L2 cache, while Section 5.3 presents the results of
using multiple hashing functions in conjunction with the
skewed associative L2 cache. Section 5.4 provides an
overall comparison of the hashing functions. Section 5.5
shows the miss reduction achieved by the hashing func-
tions. Finally, Section 5.6 presents cache miss distribution
oftree beforeandafterapplyingtheprimemodulohashing
function.
5.1. Balance and Concentration
Traditional Hashing Balance
0
2
4
6
8
10
1 2047 Stride Amount
B
a
l
a
n
c
e
Prime Modulo Hashing Balance
0
2
4
6
8
10
1 2047 Stride Amount
B
a
l
a
n
c
e
(a) (b)
Prime Displacement Hashing Balance
0
2
4
6
8
10
1 2047
Stride Amount
B
a
l
a
n
c
e
XOR Hashing Balance
0
2
4
6
8
10
1 2047 Stride Amount
B
a
l
a
n
c
e
(c) (d)
Figure 5: Balance for the Traditional Hashing (a), Prime
Modulo Hashing (b), Prime Displacement Hashing (c),
and XOR Hashing (d)
Figure 5 shows the balance values of the four different
hashing functions using a synthetic benchmark that pro-
duces only strided access patterns. The stride size is var-
Traditional Hashing Concentration
0
10
20
30
40
50
60
70
80
90
100
1 2047
Stride Amount
C
o
n
c
e
n
t
r
a
t
i
o
n
Prime Modulo Hashing Concentration
0
10
20
30
40
50
60
70
80
90
100
1 2047
Stride Amount
C
o
n
c
e
n
t
r
a
t
i
o
n
(a) (b)
Prime Displacement Hashing Concentration
0
10
20
30
40
50
60
70
80
90
100
1 2047
Stride Amount
C
o
n
c
e
n
t
r
a
t
i
o
n
XOR Hashing Concentration
0
10
20
30
40
50
60
70
80
90
100
1 2047
Stride Amount
C
o
n
c
e
n
t
r
a
t
i
o
n
(c) (d)
Figure 6: Concentration: Traditional Hashing (a), Prime
Modulo Hashing (b), Prime Displacement Hashing (c),
and XOR Hashing (d)
ied from 1 to 2047. The maximum balance displayed in
the vertical axes in the ﬁgure is limited to 10 for easy
comparison. Note that small strides are more likely to
appear in practice. Therefore, they are more important
than large strides. The balance values for the traditional
and prime modulo hashing functions follow the discus-
sion in Section 3.3. In particular, the traditional hashing
functionsuffers frombad balance values with evenstrides,
but achieves perfect balance with odd strides. The prime
modulo achieves perfect balance, except when the stride
is equal to
￿
￿
￿
￿
￿
￿ , which in this case is 2039. XOR and
Prime displacement hashing also have the ideal balance
with most strides. Both have various cases in which the
stride size causes non-ideal balance. The non-ideal bal-
ance is clustered around the small strides with the XOR
function, whereas it is concentrated toward the middle for
the prime displacement function. Thus, in practice, the
prime displacement is superior to the XOR hashing func-
tion.
Figure 6 shows the concentration for the same range of
stride size. As expected, the traditional hashing function
suffers from very bad concentration with even strides, but
achieves perfect concentration with odd strides. The XOR
and primedisplacementhashingfunctionsalso suffer from
bad concentration for many strides. This is because prime
displacement hashing is only partially sequence invariant.
Since the prime modulo hashing is sequence invariant, it
achieves ideal concentration except for the stride equal
to
￿
￿
￿
I
￿
0
￿ . Hence, for strided accesses, we can expect the
prime modulo hashing to have the best performance be-
tween our four hashing functions. More importantly, the
prime modulo hashing also achieves ideal concentration
with odd strides the same way as the traditional hashing.
Hence, we can expect that the prime modulo hashing isresistant to the pathological behavior. The XOR hashing
function may outperform the traditional hashing on aver-
age. However, it cannot match the ideal concentration of
the traditional hashing with odd stride amount. Thus it is
prone to the pathological behavior.
5.2. Single Hashing Function Schemes
Normalized Execution Times
0
0.2
0.4
0.6
0.8
1
1.2
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
bt cg ft irr mcf sp tree Avg
Busy Other Stalls Memory Stall
Figure 7: Performance results of single hashing functions
for applications with non-uniform cache accesses
Figure 7 and 8 show the execution time of each applica-
tion with non-uniform cache accesses and uniform cache
accesses, respectively. Each ﬁgure compares the execu-
tion time with different hashing functions: a traditional
hashing function with 4-way associative L2 cache (Base),
a traditional hashing function with an 8-way associative
same-size L2 cache (8-way), the XOR hashing function
(XOR), the prime modulo hashing function (pMod), and
the prime displacement hashing function (pDisp) as de-
scribed in Section 3.3. The execution time of each case is
normalizedto Base. All bars are dividedinto: executionof
instructions (Busy), stall time due to various pipeline haz-
ards (Other Stalls), and stall time due to memory accesses
(Memory Stall).
Both Figure 7 and 8 show that increasing the L2 cache
associativity to 8-way only improves the execution time
marginally, with the exception of tree. This is mainly be-
cause doubling associativity but keeping the same cache
size reduces the number of sets in half. In turn, this dou-
bles the number of addresses mapped to a set. Thus, in-
creasing cache associativity without increasing the cache
size is not an effectivemethodto eliminate conﬂict misses.
Comparing the XOR, pMod, and pDisp for applications
with non-uniform cache accesses in Figure 7, they all im-
prove execution time signiﬁcantly, with both pMod and
pDisp achieving an average speedup of 1.27. It is clear
that pMod and pDisp perform the best, followed by the
XOR. Although XOR is certainly better than Base, its non-
ideal balance for small strides and its non-ideal concentra-
tion hurt its performancesuch that it cannot obtainoptimal
speedups. For applications that have uniform cache ac-
cesses in Figure 8, the same observation generally holds.
However, 8-way slows down mst by 2%, and XOR and
pMod slow down sparse by 2%. Some other applications
have slight speedups.
5.3. Multiple Hashing Functions Schemes
Figure 9 and 10 show the execution times of applications
with non-uniform cache accesses and uniform cache ac-
cesses, respectively when multiple hashing functions are
used. Each ﬁgure compares the execution time with dif-
ferent hashing functions: a traditional hashing function
with 4-way associative L2 cache (Base), the prime modulo
hashing that is the best single hashing function from Sec-
tion 5.2 (pMod), the XOR-based skewed associative cache
proposed by Seznec [19] (SKW), and the skewed associa-
tive cache with the prime displacement function that we
propose (skw+pDisp) as described in Section 3.3. The ex-
ecution time of each case is normalized to Base.
The skewed associative caches (SKW and skw+pDisp) are
based on Seznec’s design that uses four direct-mapped
cache banks. The replacement policy is called Enhanced
Not Recently Used (ENRU) [19]. The only difference be-
tween SKW and skw+pDisp is the hashing function used
(XOR versus prime displacement). We have also tried a
different replacement policy called NRUNRW (Not Re-
cently Used Not Recently Written) [18]. We found that
it gives similar results.
Figure 9 and 10 show that pMod sometimes outperforms
and is sometimes outperformed by SKW and skw+pDisp.
With cg and mst, only the skewed associative schemes
are able to obtain speedups. This indicates that there are
some misses that the strided access patterns cannot ac-
count for, and this type of misses cannot be tackled by a
single hashing function. On average, skw+pDisp performs
the best, followed by SKW, and then closely followed by
pMod. The extra performance for the applications with
non-uniform accesses, however, comes at the expense of
having the pathological behavior for the applications that
have uniform accesses (Figure 10). For example, SKW
slows down six applications by up to 9% (bzip2, charmm,
is, parser, sparse, and irr), while skw+pDisp slows down
three applications by up to 7% (bzip2, mgrid, and sparse).
Normalized Execution Times
0
0.2
0.4
0.6
0.8
1
1.2
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
bt cg ft irr mcf sp tree Avg
Busy Other Stalls Memory Stall
Figure 9: Performance results of multiple hashing func-
tions for applications with nonuniform cache accessesNormalized Execution Times
0
0.2
0.4
0.6
0.8
1
1.2
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
B
a
s
e
8
w
a
y
X
O
R
p
M
o
d
p
D
i
s
p
applu bzip2 charmm equake euler gap is lu mgrid moldyn mst nbf parser sparse swim tomcatv Avg
Busy Other Stalls Memory Stall
Figure 8: Performance results of single hashing functions for applications with uniform cache accesses
Normalized Execution Times
0
0.2
0.4
0.6
0.8
1
1.2
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
B
a
s
e
p
M
o
d
S
K
W
s
k
w
+
p
D
i
s
p
applu bzip2 charmm equake euler gap is lu mgrid moldyn mst nbf parser sparse swim tomcatv Avg
Busy Other Stalls Memory Stall
Figure 10: Performance results of multiple hashing functions for applications with uniform cache accesses
Cache Uniform Apps Nonuniform Apps Patho.
Hashing min,avg,max min,avg,max Cases
XOR 0.98,1.00,1.01 1.00,1.21.2.09 1
pMod 0.98,1.00,1.05 1.00,1.27,2.34 1
pDisp 1.00,1.01,1.05 1.00,1.27,2.32 0
SKW 0.91,1.00,1.12 0.99,1.31,2.55 4
skw+pDisp 0.93,1.01,1.12 1.00,1.35,2.63 4
Table 4: Summary of the Performance Improvement of the
Different Cache Conﬁgurations
5.4. Overall Comparison
Table 4 summarizes the minimum, average, and maximum
speedups obtained by all the hashing functions evaluated.
It also shows the number of pathological behavior cases
that result in a slowdownof more than 1% when compared
to the traditional hashing function. In terms of average
speedups, skw+pDisp is the best amongthe multiple hash-
ing function schemes, while pMod is the best among the
single hashing function schemes. In terms of pathological
cases, however, the single hashing function schemes per-
form better than the multiple hashing function schemes.
Thisis duetoworseconcentrationandprobablyinpartdue
to pseudo-LRUschemeused in skewedassociative caches.
In summary, the prime modulo and prime displacement
hashing stand out as excellent hashing functions for L2
caches.
5.5. Miss Reduction
Figures 11 and 12 show the normalized number of
L2 cache misses for the 23 benchmarks using a tradi-
tional hashing function (Base), the prime modulo hashing
function (pMod), the prime displacement hashing func-
tion (pDisp), a skewed cache using prime displacement
(skw+pDisp), and a fully-associative cache of the same
size (FA). The number of misses are normalized to the
(Base).
Normalized Number of Misses
0
0.2
0.4
0.6
0.8
1
1.2
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
bt  cg  ft  irr  mcf  sp  tree  Avg
Figure 11: Normalized number of misses for several
hashing functions for applications with non-uniform cache
accesses
Figure 11 shows that our proposed hashing functions are
on average able to remove more than 30% of the L2 cache
misses. Insomeapplicationssuchas btandtree, theyelim-
inate nearlyall the cachemisses. Interestingly,skw+pDisp
is able to remove more cache misses than a fully associa-
tivecacheincg,indicatingthatinsomecasesitcanremove
some capacity misses.
Figure 12 shows that most of the applications that have
uniform cache accesses do not have conﬂict misses, and
thus do not beneﬁt from a fully-associative cache, ex-
cept for charmm and euler. The ﬁgure highlights the
pMod’s and pDisp’s resistance to pathological behavior.
pMod does not increase the cache misses even for applica-
tions that already have uniform cache accesses due to its
ideal balance and concentration. Although pDisp does not
achieve an ideal concentration, its partial sequence invari-Normalized Number of Misses
0
0.2
0.4
0.6
0.8
1
1.2
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
B
a
s
e
p
M
o
d
p
D
i
s
p
s
k
w
+
p
D
i
s
p
F
A
applu  bzip2 charmm equake  euler  gap  is  lu  mgrid  moldyn  mst  nbf  parser  sparse  swim  tomcatv  Avg
Figure 12: Normalized number of misses for several hashing functions for applications with uniform cache accesses
ance property helps it to achieve a very good concentra-
tion and in practice performs just as well as pMod. This
is in contrast to skw+pDisp, which increases the number
of misses by up to 20% in 6 applications (bzip2, mgrid,
parser, sparse, swim, tomcatv). This again shows that
skewed associative caches are prone to pathological be-
havior, although they are able to remove some capacity
misses.
5.6. Cache Misses Distribution for tree
From previous ﬁgures, tree is the application where pMod
and pDisp perform the best, both in terms of miss reduc-
tion and speedup. Figure 13 shows the distribution of L2
cache misses over the cache sets for tree using both tra-
ditional hashing and prime modulo hashing. In the tradi-
tional hashing case, Figure 13a shows that the vast ma-
jority of cache misses in tree are concentrated in about
10% of the sets. This is due to an unbalanced distribu-
tion of cache accesses in tree, causing some cache sets
to be over-utilized and suffer from many conﬂict misses.
By distributing the accesses more uniformly across the
sets, pMod is able to eliminate most of those misses (Fig-
ure 13b).
Cache Misses Distribution
(Traditional Hashing)
0
50000
100000
150000
200000
250000
1 2048
Cache Misses Distribution
(pMod)
0
50000
100000
150000
200000
250000
1 2048
(a) (b)
Figure 13: Distribution of misses across the cache sets
for tree using (Base) hashing (a) and pMod hashing (b).
6. Related Work
Prior studies showed that alternative cache index-
ing/hashingfunctionsare effectivein reducingconﬂicts by
achievinga moreuniformaccess distributionacross the L1
cache sets [18, 4, 22, 23], L2 cache sets [19], or the main
memory banks [14, 10, 15, 16, 20, 8, 7, 26, 11]. Most
of the prior hashing functions permute the accesses using
some form of XOR operations. We found that XOR oper-
ations typically do not achieve an ideal concentration that
is critical to avoiding pathological behavior under strided
access patterns.
Althoughprime-basedhashinghas been proposedfor soft-
ware hash tables as in [1], its use in the memory subsys-
tem has been very limited due to its hardware complexity
that involves true integer division operations and fragmen-
tation problems. Budnick and Kuck ﬁrst suggested using a
prime number of memory banks in parallel computers [5],
which was later developed into the Borroughs Scientiﬁc
Processor [10]. Yang and Yang proposed using Mersenne
prime modulo hashing for cache indexing for vector com-
putation [25]. Since Mersenne prime numbers are sparse,
e.g.
￿
￿
-
￿
￿
￿
￿
￿
/
￿
￿
￿
￿
-
￿
￿
￿
￿
￿
/
￿
1
￿
#
￿
￿
-
￿
￿
￿
￿
￿
/
￿
￿
K
￿
￿
K
.
￿
￿
￿
-
￿
￿
￿
￿
￿
/
￿
￿
￿
￿
￿
￿, using them signiﬁcantly restricts the number of
cachesets thatcanbeimplemented. We derivea moregen-
eral solution that does not assume Mersenne prime num-
bers and show that prime modulo hashing can be imple-
mented fast on any number of cache sets. In addition,
we present a prime displacement hashing function that
achieves comparable performance and robustness to the
prime modulo hashing function.
Compiler optimizations have also targeted conﬂict misses
by padding the data structures of a program. One example
is the work by Bacon, et. al. [2], who tried to ﬁnd the op-
timal padding amount to reduce conﬂict misses in caches
and TLBs in a loopnest. Theytried to spread cache misses
uniformly across loop iterations based on proﬁling infor-
mation. Since the conﬂict behavior is often input depen-
dentanddeterminedatruntime,theirapproachhas limited
applicability.
7. Conclusions
Even though using alternative cache indexing/hashing
functions is a popular technique to reduce conﬂict misses
by achieving a more uniform cache access distribution
across the sets in the cache, no prior study has really ana-
lyzed the pathological behavior of such hashing functions
that often result in performance degradation.We presented an in-depth analysis of the pathological be-
haviorof hashingfunctionsandproposedtwo new hashing
functions for the L2 cache that are resistant to the patho-
logical behavior and yet are able to eliminate the worst
case conﬂict behavior. The prime modulo hashing uses a
prime number of sets in the cache, while the prime dis-
placement hashing adds an offset that is equal to a prime
number multiplied by the tag bits to the index bits of an
address to obtain a new cache index. These hashing tech-
niques can be implemented in fast hardware that uses a
set of narrow add operations in place of true integer divi-
sion and multiplication. This implementation has negligi-
ble fragmentationfortheL2 cache. We evaluatedourtech-
niques with 23 applications from various sources. For ap-
plications that have non-uniform cache accesses, both the
prime modulo and prime displacement hashing achieve an
averagespeedupof1.27,practicallywithoutslowingdown
any of the 23 benchmarks.
Although lacking the theoretical superiority of the prime
modulo, when the prime number is carefully selected, the
prime displacement hashing performs just as well in prac-
tice. In addition, the prime displacement hashing can
easily be used in conjunction with a skewed associative
cache, which uses multiple hashing functions to further
distribute the cache accesses across the sets in the cache.
The prime displacement hashing outperforms XOR-based
hashing used in prior skewed associative caches. In some
cases, the skewed associative L2 cache with prime dis-
placement hashing is able to eliminate more misses com-
pared to using a single hashing function. It shows an aver-
agespeedupof1.35forapplicationsthathavenon-uniform
cache accesses. However, it introduces some pathological
behavior that slows down four applications by up to 7%.
Therefore, an L2 cache with our prime modulo or prime
displacement hashing functions is a promising alternative.
References
[1] A.V.Ahoand J.D.Ullman. Principles of Compiler Design, chapter
7.6, pages 434–8. Addison-Wesley, 1997.
[2] David F. Bacon, Jyh-Herng Chow, Dz ching R. Ju, Kalyan
Muthukumar, and Vivek Sarkar. A Compiler Framework for Re-
structuring Data Declarations to Enhance Cache and TLB Effec-
tiveness. In Proceedings of CACON’94, pages 270–282, October
1994.
[3] J. E. Barnes. Treecode. Institute for Astronomy, University of
Hawaii. 1994. ftp://hubble.ifa.hawaii.edu/pub/barnes/treecode.
[4] F. Bodin and A. Seznec. Skewed-associativity improves perfor-
mance and enhances predictability. IEEE Transactions on Comput-
ers, 1997.
[5] P. Budnick and D. J. Kuck. Organization and use of parallel mem-
ories. IEEE Transactions on COmputers, pages pp. 435–441, Dec
1971.
[6] J. Dongarra, V. Eijkhout, and H. van der Vorst. SparseBench: A
Sparse Iterative Benchmark.
http://www.netlib.org/benchmark/sparsebench.
[7] J.M.Frailong, W.Jalby,and J.Lenfant. Xor-schemes: aﬂexible data
organization in parallel memories. In Proceedings the International
Conference on Parallel Processing, 1985.
[8] D.T.Harper IIIand J.R. Jump. Vector access performance in paral-
lel memories using a skewed storage scheme. In IEEETransactions
on Computers, Dec. 1987.
[9] V. Krishnan and J. Torrellas. A Direct-Execution Framework for
Fast and Accurate Simulation of Superscalar Processors. In In-
ternational Conference on Parallel Architectures and Compilation
Techniques, pages 286–293, October 1998.
[10] D. H. Lawrie and C. R. Vora. The prime memory system for array
access. IEEE Transactions on COmputers, 31(5), May 1982.
[11] W. F. Lin, S. K. Reinhardt, and D. C. Burger. Reducing dram laten-
cies with a highly integrated memory hierarchy design. In Proceed-
ings of the International Symposium on High-Performance Com-
puter Architecture (HPCA), 2001.
[12] NAS Parallel Benchmark.
http://www.nas.nasa.gov/pubs/techreports/ nasreports/nas-98-009/.
[13] W. H. Payne, J. R. Rabung, and T. P. Bogyo. Coding the lehmer
pseudo-random number generator. Communication of ACM, 12(2),
1969.
[14] R. Raghavan and J. Hayes. On randomly interleaved memories. In
Supercomputing, 1990.
[15] B. R. Rau. Pseudo-randomly interleaved memory. In Proceed-
ings of the 18th International Symposium onComputer Architecture
(ISCA), 1991.
[16] B. R. Rau, M. Schlansker, and D. Yen. The cydra 5 stride-
insensitive memory system. In Proceedings of the International
Conference on Parallel Processing, 1989.
[17] H. Riesel. Prime Number and Computer Methods for Factorization
2nd Ed., pages pp. 270–2. Birkhauser, 1994.
[18] A. Seznec. A case for two-way skewed associative caches. In Pro-
ceedings of the 20th International Symposium of Computer Archi-
tecture, 1993.
[19] A. Seznec. A new case for skewed-associativity. IRISA Technical
Report #1114, 1997.
[20] G. S. Sohi. Logical data skewing schemes for interleaved memo-
ries in vector processors. In University of Wisconsin-Madison Com-
puter Science Technical Report #753, 1988.
[21] Standard Performance Evaluation Corporation.
http://www.spec.org.
[22] N. Topham, A. Gonzalez, and J. Gonzalez. The design and perfor-
mance of a conﬂict-avoiding cache. In International Symposium on
Microarchitecture, pages 71–80, Dec. 1997.
[23] N. Topham, A. Gonzalez, and J. Gonzalez. Eliminating cache con-
ﬂict misses through xor-based placement functions. In Interna-
tional Conference on Supercomputing, 1997.
[24] P.-C. Wu. Multiplicative, congruential random number generators
with multiplier
￿
￿
￿
F
￿
￿
￿
￿
F
5 and modulus
￿
￿
￿
￿
￿
￿ . ACMTransactions
on Mathematical Software, 23(2):255–65, 1997.
[25] Q. Yang and L. W. Yang. A novel cache design for vector process-
ing. In Proceedings of the International Symposium on Computer
Architecture, pages pp. 362–371, May 1992.
[26] Z. Zhang, Z. Zhu, and X. Zhang. A permutation-based page in-
terleaving scheme to reduce row-buffer conﬂicts and exploit data
locality. In Proceedings of the International Symposium on Mi-
croarchitecture, 2000.