Conflict Avoiding Caches Invite New Data Layout Optimizations by NOOTAERT, B et al.
Conflict-Avoiding Caches Invite New Data Layout Optimizations
Bavo Nootaert Hans Vandierendonck Koen De Bosschere
Dept. of Electronics and Information Systems
Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
E-mail: {bnootaer,hvdieren,kdb}@elis.UGent.be
Abstract
Cache performance can be seriously degraded by con-
flict misses, which occur when too many addresses in
the working set are mapped to the same sets of the
cache. Past research has investigated a myriad tech-
niques to remove conflict misses. Software-driven op-
timizations (e.g. padding and blocking) reorganize data
or statements in a program in order to improve local-
ity. Hardware optimizations include hashed indexing,
i.e. the set index is computed using a XOR-based hash
function instead of the conventional modulo indexing.
It has been repeatedly shown that both software and
hardware optimizations can effectively remove conflict
misses.
We confirm on a set of numerical kernels that hash-
ing removes most conflict misses and that it avoids
unusually high miss rates, which occur for patholog-
ical data layouts. Second, we show that data layout
optimizations such as intra-variable padding do not
consistently outperform caches with hashing. Further-
more, these optimizations provide only marginal im-
provements for caches with hashing.
Caches with hashed set index functions allow new
data layout optimizations which are meaningless in
modulo-indexed caches. This paper introduces base ad-
dress optimization and shows that the number of con-
flict misses can be reduced by over 15% on average
for numerical kernels. Optimizing the base address is
at least as powerful as intra-variable padding and it
removes additional misses when applied together with
intra-variable padding or blocking.
1. Introduction
Caches hide the ever-increasing latency of the main
memory by providing quick access to the most-recently
accessed data. By adding multiple levels of cache, and
by increasing cache size, very high latencies can be hid-
den. However, high cache miss rates remain an imped-
iment to high performance as much time can be wasted
waiting for the memory hierarchy.
Cache misses are traditionally classified as either
compulsory, capacity or conflict misses [5]. The conflict
misses constitute a significant part of the cache misses.
They can be severe (> 50%) in some situations [4, 18].
The root cause of conflict misses is a mismatch be-
tween the addresses in the working set of a program
and the way these addresses are mapped to the cache
sets. Conflict misses occur when too many addresses
are mapped to the same set. Data layout with such
properties are called pathological layouts [9]. There
are generally speaking two ways to tackle this prob-
lem: (i) change the addresses in the working set or
(ii) change the mapping of the addresses to the cache
sets. The first is a typical software optimization, while
the latter is a pure hardware optimization.
Software optimizations to eliminate conflict misses
typically involve some data layout or code transforma-
tion. Software optimizations tackle stride patterns in
particular, because strided programs are much more
prone to mapping conflicts, as mapping conflicts nec-
essarily occur at the same rate for all stride elements.
Padding [11] avoids unfavorable strides by increasing
the distance between conflicting addresses, i.e. an un-
favorable stride is changed to a favorable one. Loop
transformations can improve temporal locality and de-
crease the number of non-unit strides [7, 8, 22]. The ad-
vantage of compiler transformations is that they need
only be applied when required, and can be tailored
to the specific program at hand. The analysis of a
program can be difficult however, and sometimes the
heuristics just fail [11].
Conflict-avoiding caches attempt to map all stride
patterns without conflicts [10], allowing acceptable
conflict miss rates for all data layouts. They rely
on hardware hashing of the address: each set index
bit is computed as the XOR of some of the address
bits [10, 20]. Such XOR-based hash functions were
shown to be very effective to eliminate conflicts in
caches [4, 17, 16, 18]. In practice, some pathological
data layouts may remain. These are further avoided
by the skewed-associative cache, which uses multiple
hash functions [13, 14].
Although both avenues are viable approaches to
eliminate conflict misses – each with their pros and cons
– few research has been performed to compare both
methods or to investigate their interaction. It is not
known whether conflict-avoiding caches preclude data
layout optimizations or if data layout optimizations re-
main useful for conflict-avoiding caches. In this paper,
we compare the performance of intra-variable padding
to hardware hashing using a set of numerical kernels.
Our results show that hardware hashing schemes are
at least competitive to padding: every method removes
most conflicts on some kernel. Furthermore, when the
cache implements XOR-based hash functions, the use-
fulness of padding reduces strongly, as most conflict
misses are already avoided.
We also investigate the interaction of blocking and
hashing in the cache. Clearly, blocking is more pow-
erful as it reduces both capacity misses and conflict
misses. However, when conflict misses remain after ap-
plying blocking, then XOR-based hashing can remove
them.
The applicable data layout optimizations depend
strongly on the cache organization. Modulo-indexed
caches are sensitive to particular pathological strides,
so data layout optimizations must avoid these strides.
Caches implementing XOR-based hash functions are
insensitive to the stride, but they allow a new opti-
mization: base address optimization. This possibility
stems from the properties of the XOR-functions: by
XOR-ing high-order address bits with low-order ad-
dress bits to compute the set index, the mapping of
addresses to set indices becomes strongly dependent
on the exact base address of a data structure. In con-
trast, modulo-indexed caches incur the same number
of conflicts regardless of the higher-order address bits.
Absolute placement, rather than the internal structure
of the data, becomes crucial to exploiting the capabil-
ities of XOR-based hashing.
We analyze the performance potential of base ad-
dress optimization and show an average reduction of
conflict misses by 15%. Furthermore, we show that
base address optimization outperforms data layout op-
timizations such as padding. Applying both base ad-
dress optimization and padding yields marginal im-
provements. Finally, base address optimization yields
additional improvements over blocking.
2. Related work
The use of alternative mapping schemes in caches
was advocated in [12]. Different hash functions have
been presented by several authors. Some methods
based on prime numbers have been proposed in [23]
and [6] but the majority of work has focused on XOR-
based hash functions. XOR-based hash functions form
a class of functions with the basic property that ev-
ery bit of the set index is the exclusive-OR of some
of the address bits. Since only XOR-gates are needed,
implementation is particularly simple and fast. Good
performance results have been reported for these func-
tions [4, 17, 16, 18].
Polynomial hash functions [10] are a special sub-
class, defined by a polynomial with coefficients in
{0, 1}. A polynomial P (x) can be uniquely represented
as the integer obtained by substituting 2 for x. E.g.,
polynomial x8 + 1 is referred to as ‘257’. An address
A, corresponding to the polynomial A(x), is mapped to
the set index corresponding to A(x)mod P (x), i.e. the
remainder of the polynomial division of A(x) by P (x).
Polynomial hash functions are implemented in exactly
the same way as any XOR-based hash function, but
they generally need a larger number of inputs for the
XOR-gates [19].
The hashing scheme that is traditionally used, se-
lecting some bits from the address, will be referred to
as ‘bit selection’ or ‘modulo-based hashing’.
A skewed-associative cache [13, 14] consists of sev-
eral banks. Every bank is accessed using a different
hash function. The idea is that addresses conflicting
for one hash function, are less likely to generate con-
flicts for the other. This is called inter-bank dispersion.
How to select hash functions that maximize inter-bank
dispersion is described in [20].
Several authors have tried to explain the effective-
ness of hashing by examining special working sets.
Rau [10] gives several properties of polynomial hashing
on stride patterns in the context of multi-banked mem-
ories. Vandierendonck and De Bosschere [20] charac-
terize XOR-based hash functions with minimal number
of conflicts for vector spaces. Although these studies
explain how hashing can improve the miss rate on some
working sets, they make assumptions on the placement
of the data in memory that allow to prove some prop-
erties mathematically, but are unrealistic in practice.
Frailong et. al. [3] defend these assumptions by argu-
ing that any working set may be decomposed into vec-
tor spaces, on which their theory can be applied. We
found from experiments that this decomposition de-
pends heavily on the base address, supporting our case
that placement can have a significant impact on the
cache behavior.
Work similar to ours has been carried out by Bodin
and Seznec [1]. They investigated the effect of skewing
on performance predictability for a number of numer-
ical kernels, including some blocked algorithms. Our
approach differs in that we focus on data layout trans-
formations, and do not limit ourselves to either skewing
or bit-selection.
Tuning placement to improve cache performance has
been studied in [2, 15] for conventional modulo-based
caches.
3. Experimental Setup
The goal of this paper is to determine the interaction
between source code optimizations and cache enhance-
ments based on hashing. This section describes the
experimental setup: the evaluated kernels, the cache
enhancements and the source code optimizations.
3.1. Kernels
We have selected four numerical kernels that oper-
ate on matrices and measured the number of misses
through simulation.
1. Matrix multiplication. This kernel operates on
three matrices. It is structured so that all refer-
ences are accessed with unit stride by the enclosing
loop.
2. Transposed matrix multiplication. This variation
on the first kernel has one reference that is ac-
cessed with non-unit stride by the enclosing loop.
3. Gaussian elimination (without pivoting). This
kernel operates on only one matrix.
4. K23 (the 23rd Livermore Fortran Kernel). This
short kernel operates on six matrices, of which one
is accessed multiple times within each iteration.
A detailed structure of these kernels is given in Fig-
ure 1. All matrices are square, except those in K23, and
laid out in column-major order. Temporaries and coun-
ters are register-allocated. No additional optimizations
were performed by the compiler.
3.2. Cache Organizations and Enhancements
We simulated direct mapped caches, 2-and 4-way
set-associative caches (each with modulo- and XOR-
based hashing), and 2-and 4-way skewed-associative
caches with XOR-based skewing functions. The re-
placement policies are LRU for the set-associative
Table 1. Cache parameters.
Parameters
Number of cache lines 512
Cache line size 32 bytes
Total cache size 16kB
Data per cache line 4
Polynomials
Direct mapped 515
2-way set-associative 285
2-way skewed-associative 285 and 501
4-way set-associative 131
4-way skewed-associative 131, 157, 181 and 239
caches, and ENRU [14] for the skewed-associative
caches. We chose irreducible polynomials for the XOR-
based hash functions. An overview of the organizations
is given in Table 1.
3.3. Experiments
There are several characteristics of a program that
may affect cache behavior, but are independent of the
program semantics. We based our choice of experi-
ments on the following observations.
• In many applications the size of the matrix is not
equal to its leading dimension. E.g., a matrix is
sometimes allocated with predetermined dimen-
sions, while only a part of it is actually used. Or,
in blocked algorithms, a kernel essentially operates
on small submatrices of a large matrix. So the
leading dimension of a matrix is an important pa-
rameter for our study. It may seem more natural
to vary the matrix size, but this makes interpre-
tation of the data more difficult because the ratio
of the working set size to the average number of
misses is not constant.
• Varying the relative placement of the matrices
may change the amount of cross-interference. Us-
ing Monte-Carlo simulations, where the matrices
are placed at random, we can quantify the perfor-
mance improvements that can be made by inter-
variable padding.
• Intra-variable padding is very effective at eliminat-
ing conflicts [11, 21]. By increasing the dimensions
of a matrix, strides that lead to poor performance
are avoided.
• Blocking can reduce the miss rate significantly, but
it still depends greatly on the matrix dimensions.
Since there are less capacity misses, hashing might
have a larger impact.
Matrix multiplication Transposed matrix multiplication
dimension A(S,S), B(S,S), C(S,S)
do j=1,S
do i=1,S
C(i,j)=0
enddo
do k=1,S
t=B(k,j)
do i=1,S
C(i,j)=C(i,j) + t*A(i,k)
enddo
enddo
enddo
dimension A(S,S), B(S,S), C(S,S)
do j=1,S
do i=1,S
t=0
do k=1,S
t=t+A(k,i)*B(j,k)
enddo
C(i,j)=t
enddo
enddo
Gaussian elimination K23
dimension A(S,S)
do j=1,S-1
t1=A(j,j)
do i=j+1,S
t2=A(i,j)/t1
A(i,j)=t2
do k=j+1,S
A(i,k)=A(i,k)-t2*A(j,k))
enddo
enddo
enddo
parameter(jmax=7)
dimension A(jmax,S),B(jmax,S),
R(jmax,S),U(jmax,S),
V(jmax,S),Z(jmax,S)
do j=2,jmax-1
do k=2,S-1
t=A(k,j+1)*R(k,j)+A(k,j-1)*B(k,j)
+A(k+1,j)*U(k,j)+A(k-1,j)*V(k,j)
+Z(k,j)
A(k,j)=A(k,j)+0.175*(t-A(k,j))
enddo
enddo
Figure 1. The four kernels. The matrices are stored in the same order as they are declared. Depending
on the experiment, the dimension may be larger than the size S.
Table 2. The five experiments.
Experiment Variant Gaps Transformation
ld lead. dim. 0 -
size size 0 -
rnd lead. dim. random -
ld+p lead. dim. 0 padding
ld+b lead. dim. 0 blocking
We selected five of the various possible combinations
for our experiments, labeled ld, size, rnd, ld+p and
ld+b, and given by Table 2.
Matrix (column) sizes and leading dimensions vary
from 80 to 150. When varying the leading dimension
(experiments ld, rnd, ld+p and ld+b), the size is
kept fixed at 80. Matrices are aligned to an 8-byte
boundary, so the elements are never distributed over
more than one cache line. In experiments size, ld,
ld+p and ld+b, there are no gaps between the ma-
trices, except for some space resulting from the addi-
tional rows when the leading dimension is larger than
the size. The matrices form a single data structure,
that is placed in memory as a whole. The order is fixed
as indicated in Figure 1. The data layout for matrix
multiplication is depicted in Figure 2. It is analogous
for the other kernels.
Figure 2. The data layout for matrix multipli-
cation used in the experiments.
For the caches with XOR-based hashing, we picked
50,000 random base addresses out of 1,000,000. Simu-
lating 4 base addresses suffices for modulo-based hash-
ing. For experiment rnd, we again simulated 50,000
placements, but this time we inserted random gaps be-
tween the matrices, besides selecting the base address
at random. We kept all addresses within the same
range as for the other experiments. The same experi-
ment was conducted for modulo-based hashing.
In experiment ld+p, we allowed a padding of 0 to
3 rows, and selected the dimension where the miss rate
is lowest when averaged over all base addresses.
The matrices are too large to fit entirely in the cache.
Reuse is primarily located in the inner loops, that iter-
ate repeatedly over individual rows or columns. Block-
ing, which reduces capacity misses, is studied in ex-
periment ld+b. We experimentally determined that
blocking factor 16 gives optimal results. A 16× 16 tile
occupies 12.5% of the cache.
3.4. Metrics
We are primarily interested in minimizing the miss
rate (both conflict and capacity misses). When using
hashing, the base address has an important impact on
the miss rate. The sensitivity of the miss rate to the
base address is measured as the improvement that can
be made by optimizing the base address relative to the
average case, i.e., the sensitivity equals one minus the
minimum miss rate over all base addresses divided by
the average miss rate.
4. Results
We perform each of the 5 experiments on several
cache organizations (direct mapped, set-associative
and skewed-associative). This is repeated for every ker-
nel. The result of each experiment is a graph such as
Figure 3. Here, experiment ld is performed on a di-
rect mapped cache for the matrix multiplication kernel.
The horizontal axis shows the leading dimension and
the vertical axis shows the miss rate, averaged over
all base addresses. We collected the minimum, av-
erage and maximum miss rate, and the sensitivity in
this form. The data is further summarized into tables
such as Table 3, which show the results for each cache
organization, averaged over all leading dimensions (or
sizes).
4.1. Matrix multiplication
Table 3 provides the minimum, average and maxi-
mum miss rates and the sensitivity for matrix multi-
plication.
The miss rate for matrix multiplication is rather sta-
ble in all experiments, on all cache configurations (Ta-
ble 3). For ld on a direct mapped cache, all average
miss rates are between 8.6% and 13.0% for modulo-
based hashing and between 8.8% and 9.3% for XOR-
based hashing (see Figure 3).
Although the base address has some effect for XOR-
based hashing, the sensitivity is low compared to the
other kernels. This is because the arrays in the inner
loops are accessed with stride one, which is mapped
8.5
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
13.0
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
Figure 3. The average miss rates for matrix
multiplication on a direct mapped cache.
optimally by modulo-based hashing, and almost op-
timally by XOR-based hashing (i.e. it shows short-
term equidistribution [10]). For the caches and matrix
sizes used in this paper, experiments show that individ-
ual columns are mapped conflict-free by both hashing
schemes (not counting interference from other matri-
ces).
Random placement leads to a moderate increase in
sensitivity for modulo-based hashing (from 1.19% to
2.82% on a direct mapped cache). There is not much
variation in overlap of the cache footprints. Relative
placement is even less important for XOR-based hash-
ing.
Neither set-associativity nor skewed-associativity
improves the miss rate significantly. Skewing is even
a little harmful for size.
These results indicate that hashing is of little use if
the code has a high spatial locality.
The miss rate drops when blocking is applied, but
the sensitivity increases, up to the point where it is
comparable to that of the blocked versions of the other
kernels.
4.2. Transposed matrix multiplication and Gaus-
sian elimination
Transposed matrix multiplication and Gaussian
elimination are quite similar. An overview of the miss
rates and sensitivity is given in Tables 4 and 5. The ma-
trix dimensions that have a very large miss rate are the
same for both kernels, as depicted in Figure 4. XOR-
ing reduces the miss rate on these dimensions, but fails
to give great performance improvements on average:
the largest difference in favor of XOR is only 0.7%
(for transposed matrix multiplication). Intra-variable
padding is more effective at eliminating these severe
miss rates, as can be seen in Figure 5. It achieves a
greater reduction on the modulo-indexed cache than
Table 3. Miss rates and sensitivity for matrix multiplication.
Modulo XOR Skewed
Min. Avg. Max. Sens. Min. Avg. Max. Sens. Min. Avg. Max Sens.
Direct
Mapped
ld 9.1 9.2 9.2 1.19 8.8 9.1 9.4 3.37
size 9.0 9.0 9.0 0.07 8.9 9.0 9.1 0.95
rnd 8.9 9.1 10.4 2.82 8.8 9.1 9.6 3.52
ld+p 8.8 9.0 9.0 1.31 8.7 9.0 9.3 3.51
ld+b 1.7 1.9 1.9 6.39 1.4 1.8 2.3 17.94
2-way
Associative
ld 8.7 8.8 8.8 1.16 8.6 8.8 8.8 2.20 8.6 8.8 8.9 1.94
size 8.4 8.4 8.4 0.01 8.4 8.4 8.5 0.09 8.5 8.5 8.5 0.33
rnd 8.7 8.8 8.9 1.22 8.6 8.8 8.9 2.21 8.6 8.8 8.9 1.94
ld+p 8.7 8.8 8.8 0.73 8.5 8.8 8.9 2.91 8.6 8.7 8.9 2.22
ld+b 1.3 1.4 1.5 6.96 1.2 1.3 1.5 11.10 1.2 1.3 1.4 8.09
4-way
Associative
ld 8.7 8.8 8.8 1.16 8.7 8.8 8.8 1.37 8.6 8.7 8.8 1.45
size 8.4 8.4 8.4 0.01 8.4 8.4 8.4 0.01 8.4 8.4 8.4 0.05
rnd 8.7 8.8 8.8 1.15 8.7 8.8 8.8 1.36 8.6 8.7 8.8 1.43
ld+p 8.7 8.8 8.8 1.16 8.6 8.8 8.8 1.92 8.6 8.7 8.8 1.22
ld+b 1.3 1.4 1.4 5.89 1.2 1.3 1.4 7.81 1.2 1.2 1.3 6.76
Table 4. Miss rates and sensitivity for transposed matrix multiplication.
Modulo XOR Skewed
Min. Avg. Max Sens. Min. Avg. Max Sens. Min. Avg. Max Sens.
Direct
Mapped
ld 19.6 19.8 19.9 1.14 16.5 19.6 26.5 14.98
size 22.0 22.0 22.0 0.06 19.1 22.4 28.2 14.51
rnd 19.3 19.6 19.9 2.02 16.5 19.6 26.6 15.26
ld+p 15.0 15.3 15.4 1.39 14.7 16.4 23.9 10.22
ld+b 3.1 3.3 3.4 4.29 3.0 3.4 4.2 10.74
2-way
Associative
ld 16.0 16.1 16.2 1.11 14.1 15.4 17.5 8.02 13.4 13.7 13.9 2.31
size 17.2 17.2 17.2 0.04 15.7 17.1 19.3 7.89 13.6 13.8 14.0 1.19
rnd 15.9 16.1 16.3 1.51 14.1 15.4 17.6 8.17 13.4 13.7 14.0 2.48
ld+p 13.0 13.2 13.3 1.23 13.6 14.4 16.2 6.08 13.2 13.6 14.0 2.99
ld+b 2.1 2.2 2.3 4.75 1.9 2.1 2.4 9.99 1.9 2.0 2.1 5.55
4-way
Associative
ld 14.5 14.7 14.8 1.08 13.4 14.3 15.2 4.57 13.0 13.2 13.3 1.41
size 14.8 14.8 14.8 0.03 13.8 14.6 15.8 4.53 12.7 12.8 12.8 0.11
rnd 14.5 14.7 14.8 1.09 13.4 14.2 15.2 4.62 13.0 13.2 13.3 1.41
ld+p 13.1 13.2 13.3 1.09 12.8 13.2 13.5 2.95 12.8 13.2 13.3 2.59
ld+b 2.0 2.0 2.1 3.58 1.8 1.9 2.0 5.33 1.8 1.9 1.9 4.22
on the XOR-indexed cache, because the extremes in
the miss rate for the former contribute considerably to
the average.
The sensitivity of XOR-based hashing is high, how-
ever, for both of these kernels: about 15% and 17% for
transposed matrix multiplication and Gaussian elim-
ination, respectively, on a direct mapped cache, and
8% and 14.6%, respectively, on a 2-way set-associative
cache. It depends heavily on the leading dimension,
as depicted in Figure 6 for Gaussian elimination. This
large variation in miss rate, if exploited, is the key to
performance improvements of XOR-based hashing over
modulo-based hashing.
The benefits of optimizing the base address in ld are
about equal to or larger than those of applying only
inter-variable padding on both kernels.
In contrast to matrix multiplication, these ker-
nels are significantly affected by associativity. Set-
associativity reduces the misses for Gaussian elimina-
tion by about 5% and by about 4% for transposed ma-
trix multiplication. The main cause of the improve-
ment on average by skewing (about a fourth less misses
compared to set-associativity, for Gaussian elimina-
tion) is the avoidance of matrix dimensions that lead to
a lot of conflicts: the curve in Figure 5 for the skewed-
associative cache lies close to the one for the modulo-
indexed cache, except for the peaks at certain strides.
On average, skewing and padding yield about the same
miss rate.
The blocked versions still show some sharp peaks for
modulo-based hashing, but not for XOR-based hashing
(Figure 7). The average miss rates are approximately
Table 5. Miss rates and sensitivity for Gaussian elimination.
Modulo XOR Skewed
Min. Avg. Max Sens. Min. Avg. Max Sens. Min. Avg. Max Sens.
Direct
Mapped
ld 14.9 15.0 15.1 0.67 12.2 14.8 21.5 16.92
size 20.0 20.0 20.1 0.35 17.6 20.3 26.0 12.91
ld+p 11.1 11.2 11.3 0.80 11.0 12.3 18.5 10.28
ld+b 1.3 1.4 1.4 5.75 1.1 1.4 2.3 19.43
2-way
Associative
ld 9.8 9.9 10.0 1.15 8.5 10.0 12.5 14.64 6.9 7.3 8.0 5.89
size 12.5 12.6 12.6 0.70 12.8 14.4 16.9 11.04 9.2 9.6 10.4 4.42
ld+p 6.7 6.8 6.9 1.23 7.8 9.0 11.2 12.88 6.7 7.1 8.0 6.26
ld+b 1.0 1.1 1.1 5.87 0.9 1.1 1.4 12.55 1.3 1.4 1.5 7.93
4-way
Associative
ld 8.4 8.5 8.6 1.08 7.3 8.4 9.7 9.73 6.3 6.5 6.6 2.49
size 10.0 10.1 10.1 0.64 9.6 11.0 12.8 10.56 7.9 8.0 8.1 1.52
ld+p 6.9 6.9 7.0 1.29 6.8 7.2 7.8 5.84 6.3 6.4 6.6 2.49
ld+b 1.0 1.0 1.1 5.82 0.9 1.0 1.1 8.87 0.9 1.0 1.0 7.15
10
20
30
40
50
60
70
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
(a) Transposed matrix multiplication
10
20
30
40
50
60
70
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
(b) Gaussian elimination
Figure 4. Similarities in the cache behavior of
Transposed matrix multiplication and Gaus-
sian elimination.
equal, although significantly lower than for the non-
blocked versions. Again changing the base address may
cause a higher or lower miss rate. This is also the case
for modulo-based hashing. Skewing increases the miss
rate for Gaussian elimination. This may be due to a
suboptimal blocking factor: it has been observed that
optimal blocking factors differ on a skewed associative
cache [1].
10
15
20
25
30
35
40
45
50
55
60
65
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
skewed
(a) Without padding
12.5
13.0
13.5
14.0
14.5
15.0
15.5
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
skewed
(b) With padding
Figure 5. Effect of padding on the average
miss rates for transposed matrix multiplica-
tion on a 2-way associative cache.
4.3. K23
In contrast to the other kernels, random placement
now does sharply increase the sensitivity for modulo-
based hashing: from 0.73% to 9.32% (see Table 6). The
effect is less pronounced for XOR-based hashing: the
difference is only 2.4%. As can be seen in Figure 8, the
sensitivity depends heavily on the leading dimension
Table 6. Miss rates and sensitivity for K23.
Modulo XOR Skewed
Min. Avg. Max Sens. Min. Avg. Max Sens. Min. Avg. Max Sens.
Direct
Mapped
ld 16.0 16.1 16.3 0.73 15.2 16.5 20.1 6.97
size 15.9 15.9 16.0 0.14 14.9 16.3 19.8 7.60
rnd 15.0 16.6 48.0 9.32 15.0 16.6 31.4 9.33
ld+p 15.4 15.5 15.8 0.78 15.0 15.7 18.7 3.99
ld+b 15.7 15.9 16.2 1.18 15.2 16.3 20.0 6.27
2-way
Associative
ld 15.4 15.5 15.7 0.73 15.1 15.4 16.4 2.17 15.1 15.5 16.6 2.82
size 15.3 15.3 15.3 0.14 14.8 15.3 16.3 2.84 14.9 15.4 16.4 3.49
rnd 15.0 15.4 37.6 2.25 15.0 15.4 19.6 2.26 15.0 15.5 17.1 3.05
ld+p 15.2 15.3 15.5 0.83 15.1 15.3 16.3 1.77 15.0 15.5 16.6 2.77
ld+b 15.3 15.4 15.7 1.07 15.1 15.3 16.1 1.80 15.1 15.4 16.3 2.28
4-way
Associative
ld 15.1 15.2 15.4 0.72 15.1 15.2 15.5 0.90 15.1 15.2 15.6 1.02
size 15.1 15.1 15.1 0.14 14.8 15.0 15.2 1.02 14.9 15.2 15.5 1.68
rnd 15.0 15.2 17.3 0.93 15.0 15.2 15.9 0.93 15.0 15.2 15.6 1.20
ld+p 15.1 15.2 15.4 0.74 15.0 15.2 15.6 1.31 15.0 15.2 15.7 1.26
ld+b 15.1 15.2 15.4 0.93 15.1 15.2 15.6 0.91 15.1 15.2 15.6 0.94
0
5
10
15
20
25
30
35
40
 80  90  100  110  120  130  140  150
se
n
si
tiv
ity
 (%
)
leading dimension
modulo
XOR
(a) Direct mapped
0
5
10
15
20
25
 80  90  100  110  120  130  140  150
se
n
si
tiv
ity
 (%
)
leading dimension
modulo
XOR
skewed
(b) 2-way associative
Figure 6. The variation of the sensitivity for
Gaussian elimination on a direct mapped
cache and on a 2-way associative cache (ex-
periment ld).
for ld, but not for rnd. Also notice that the worst case
for modulo has a larger miss rate than the one for XOR
(48% against 31% on a direct mapped cache). So XOR-
ing can, to some extent, replace inter-variable padding.
0
1
2
3
4
5
6
7
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
(a) Direct mapped
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
 80  90  100  110  120  130  140  150
m
is
s 
ra
te
 (%
)
leading dimension
modulo
XOR
skewed
(b) 2-way associative
Figure 7. The average miss rates for blocked
Gaussian elimination on direct mapped and
2-way associative caches.
Combing the two might not be beneficial if the base
address of the entire data structure is not optimized:
the maximum miss rates are high for both ld and rnd.
Skewing yields a somewhat higher miss rate than
set-associativity. It is also more sensitive to variations
0
5
10
15
20
25
30
35
40
 80  90  100  110  120  130  140  150
se
n
si
tiv
ity
 (%
)
leading dimension
modulo
XOR
(a) ld
8.6
8.8
9.0
9.2
9.4
9.6
9.8
10.0
10.2
10.4
 80  90  100  110  120  130  140  150
se
n
si
tiv
ity
 (%
)
leading dimension
modulo
XOR
(b) rnd
Figure 8. The sensitivity for K23 on a direct
mapped cache.
in the base address.
4.4. Comparison of XOR, modulo and skewing
The effect of XOR-ing depends greatly on the ker-
nel and the data layout. Pure padding is more power-
ful than pure XOR-ing, but they both reduce the miss
rate compared to pure modulo-based hashing. Com-
bining padding and XOR-ing yields no large additional
improvements, and in some cases makes the situation
worse. If the base address is optimized and padding
is not possible, either XOR-ing or skewing is the bet-
ter choice. What hashing scheme performs best varies
across experiments and across kernels. Table 7 in-
dicates the best hashing scheme for each experiment
and for each cache organization based on the average
miss rate. XOR-based hashing proves beneficial for
most kernels and cache organizations when no program
transformations are applied (experiments ld and rnd).
When program transformations are applied, modulo-
based hashing is the best choice except for one case:
the matrix multiplication kernel favors skewing the 2-
way associative cache. A different picture arises when
we compare hashing schemes using the minimum miss
rate across base addresses (Table 8). In this case we
assume that the base address of the matrices is opti-
mized. Now, XOR-based hashing results in fewer con-
flict misses than modulo-based hashing in most situ-
ations. Furthermore, skewing with base address op-
timizations out-performs modulo-based hashing in all
experiments but ld+b.
5. Conclusion
We have studied the effect of XOR-based hashing on
several numerical kernels. XOR-based hashing avoids
the worst conflicting situations that occur for modulo-
based hashing. However, this is not really translated
in a performance benefit on average. Skewing, combin-
ing two independent hash functions, outperforms both
modulo-based and XOR-based hashing on a 2-way set
associative cache.
The miss rate depends heavily on the base address
in a cache with XOR-based hash function, allowing
for the optimization of the base address. If this op-
portunity is adequately exploited, XOR-based hashing
might ultimately result in a net performance benefit
over modulo-based hashing.
The impact of padding is diminished, and opti-
mal placement is required to further enhance the miss
rate. If both techniques are combined, it can compete
with the combination of padding and conventional bit-
selection.
Because blocked algorithms have by nature a low
miss rate on average, XOR-ing has little room for im-
provement. It does avoid the severe conflicts that are
still present for bit selection.
The results presented in this paper indicate that
non-standard cache organizations can profit from pro-
gram transformations. Some of the well-known opti-
mizations may prove of little use, while new ones may
emerge. In particular, this paper shows that optimiz-
ing the base address has a large potential to remove
conflict misses when the cache uses XOR-based hash
functions.
Acknowledgments
We would like to thank the anonymous referees
for their helpful comments. Hans Vandierendonck
is a postdoctoral researcher of the Fund for Scien-
tific Research-Flanders (FWO). This research was also
funded by Ghent University.
References
[1] F. Bodin and A. Seznec. Skewed associativity
improves program performance and enhances pre-
Table 7. Best hashing scheme based on average miss rates. Row DM compares direct mapped
caches, rows 2SA and 4SA compare set-associative caches, and rows 2SK and 4SK compare set-
and skewed-associative caches.
Mat. Mult. Trans. Mat. Mult. Gauss K23
ld rnd ld+p ld+b ld rnd ld+p ld+b ld rnd ld+p ld+b ld rnd ld+p ld+b
DM XOR TIE TIE XOR XOR TIE MOD MOD XOR - MOD TIE MOD TIE MOD MOD
2SA TIE TIE TIE XOR XOR XOR MOD XOR MOD - MOD TIE XOR TIE TIE XOR
2SK SK SK SK XOR SK SK MOD XOR SK - TIE TIE XOR TIE TIE XOR
4SA TIE TIE TIE XOR XOR XOR TIE XOR XOR - MOD TIE TIE TIE TIE TIE
4SK TIE TIE TIE SK SK SK TIE TIE SK - SK TIE TIE TIE TIE TIE
Table 8. Best hashing scheme based on minimal miss rates. Row DM compares direct mapped
caches, rows 2SA and 4SA compare set-associative caches, and row 2SK and 4SK compare set- and
skewed-associative caches.
Mat. Mult. Trans. Mat. Mult. Gauss K23
ld rnd ld+p ld+b ld rnd ld+p ld+b ld rnd ld+p ld+b ld rnd ld+p ld+b
DM XOR XOR XOR XOR XOR XOR XOR XOR XOR - XOR XOR XOR TIE XOR XOR
2SA XOR XOR XOR XOR XOR XOR MOD XOR XOR - MOD XOR XOR TIE XOR XOR
2SK SK SK TIE XOR SK SK TIE XOR SK - SK XOR TIE TIE TIE XOR
4SA TIE TIE XOR XOR XOR XOR XOR XOR XOR - XOR XOR TIE TIE XOR TIE
4SK TIE TIE TIE TIE SK SK TIE TIE SK - SK SK TIE TIE TIE TIE
dictability. IEEE Trans. Comput., 46(5):530–544,
1997.
[2] B. Calder, C. Krintz, S. John, and T. Austin. Cache-
conscious data placement. In Proceedings of the
Eighth International Conference on Architectural Sup-
port for Programming Languages and Operating Sys-
tems, pages 139–149, 1998.
[3] J. M. Frailong, W. Jalby, and J. Lenfant. XOR-
schemes: a flexible data organization in parallel mem-
ories. In Proceedings 1985 International Conference
on Parallel Processing, pages 276–283, Aug. 1985.
[4] A. Gonza´lez, M. Valero, N. Topham, and J. M.
Parcerisa. Eliminating cache conflict misses through
XOR-based placement functions. In Proceedings of
the 11th international conference on Supercomputing,
pages 76–83, 1997.
[5] M. D. Hill and A. J. Smith. Evaluating associativity in
cpu caches. IEEE Trans. Comput., 38(12):1612–1630,
1989.
[6] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using
prime numbers for cache indexing to eliminate conflict
misses. In Proceedings of the 10th International Sym-
posium on High Performance Computer Architecture,
pages 288–299, Feb. 2004.
[7] M. D. Lam, E. E. Rothberg, and M. E. E. Wolf. The
cache performance and optimizations of blocked al-
gorithms. In Proceedings of the Fourth International
Conference on Architectural Support for Programming
Languages and Operating Systems, pages 63–74, Apr.
1991.
[8] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving
data locality with loop transformations. ACM Trans.
Program. Lang. Syst., 18(4):424–453, 1996.
[9] P. Michaud. A statistical model of skewed-
associativity. In Proceedings of the IEEE International
Symposium on Performance Analysis of Systems and
Software, pages 204–213, Mar. 2003.
[10] B. R. Rau. Pseudo-randomly interleaved memory. In
Proceedings of the 18th Annual International Sympo-
sium on Computer Architecture, pages 74–83, 1991.
[11] G. Rivera and C.-W. Tseng. Data transformations for
eliminating conflict misses. In Proceedings of the ACM
SIGPLAN 1998 conference on Programming language
design and implementation, pages 38–49, 1998.
[12] M. Schlansker, R. Shaw, and S. Sivaramakrishnan.
Randomization and associativity in the design of
placement-insensitive caches. Technical report, HP
Computer Systems Laboratory, June 1993.
[13] A. Seznec. A case for two-way skewed associative
caches. In Proceedings of the 20th Annual Interna-
tional Symposium on Computer Architecture, pages
169–178, May 1993.
[14] A. Seznec. A new case for skewed-associativity. Tech-
nical Report PI-1114, IRISA, July 1997.
[15] O. Temam, C. Fricker, and W. Jalby. Cache inter-
ference phenomena. In Proceedings of the 1994 ACM
SIGMETRICS conference on Measurement and mod-
eling of computer systems, pages 261–271, 1994.
[16] N. Topham and A. Gonza´lez. Randomized cache place-
ment for eliminating conflicts. IEEE Trans. Comput.,
48(2):185–192, 1999.
[17] N. Topham, A. Gonza´lez, and J. Gonza´lez. The design
and performance of a conflict-avoiding cache. In Pro-
ceedings of the 30th annual ACM/IEEE international
symposium on Microarchitecture, pages 71–80,
[18] H. Vandierendonck. Avoiding Mapping Conflicts in
Microprocessors. PhD thesis, Ghent University, Jan.
2004.
[19] H. Vandierendonck and K. De Bosschere. Evaluation
of the performance of polynomial set index functions.
In Workshop on Duplicating, Deconstructing and De-
bunking, held in conjunction with the 29th Interna-
tional Symposium on Computer Architecture, pages
31–41, Anchorage, May 2002.
[20] H. Vandierendonck and K. De Bosschere. XOR-based
hash functions. IEEE Trans. Comput., 54(7):800–812,
2005.
[21] X. Vera, J. Abella, J. Llosa, and A. Gonza´lez. An ac-
curate cost model for guiding data locality transforma-
tions. ACM Trans. Program. Lang. Syst., 27(5):946–
987, 2005.
[22] M. E. Wolf and M. S. Lam. A data locality optimiz-
ing algorithm. In Proceedings of the ACM SIGPLAN
1991 conference on Programming language design and
implementation, pages 30–44, 1991.
[23] Q. Yang and W. LiPing. A novel cache design for
vector processing. In Proceedings of the 19th Annual
International Symposium on Computer Architecture,
pages 362–371, May 1992.
