Making Belady-Inspired Replacement Policies More Effective Using
  Expected Hit Count by Ghahani, Seyed Armin Vakil et al.
Making Belady-Inspired Replacement Policies More
Effective Using Expected Hit Count
Seyed Armin Vakil Ghahani‡ Sara Mahdizadeh Shahri‡ Mohammad Bakhshalipour‡§
Pejman Lotfi-Kamran§ Hamid Sarbazi-Azad‡§
‡Department of Computer Engineering, Sharif University of Technology
§ School of Computer Science, Institute for Research in Fundamental Sciences (IPM)
Abstract
Memory-intensive workloads operate on massive amounts
of data that cannot be captured by last-level caches (LLCs)
of modern processors. Consequently, processors encounter
frequent off-chip misses, and hence, lose a significant perfor-
mance potential. One way to reduce the number of off-chip
misses is through using a well-behaved replacement policy
in the LLC. Existing processors employ a variation of least
recently used (LRU) policy to determine a victim for replace-
ment. Unfortunately, there is a large gap between what LRU
offers and that of Belady’s MIN, which is the optimal replace-
ment policy. Belady’s MIN requires selecting a victim with
the longest reuse distance, and hence, is unfeasible due to
the need to know the future. Consequently, Belady-inspired
replacement polices use Belady’s MIN to derive an indicator
to help them choose a victim for replacement.
In this work, we show that the indicator that is used in the
state-of-the-art Belady-inspired replacement policy is not de-
cisive in picking a victim in a considerable number of cases,
and hence, the policy has to rely on a standard metric (e.g.,
recency or frequency) to pick a victim, which is inefficient.
We observe that there exist strong correlations among the hit
counts of cache blocks in the same region of memory when
Belady’s MIN is the replacement policy. Taking advantage of
this observation, we propose an expected-hit-count indica-
tor for the memory regions and use it to improve the victim
selection mechanism of Belady-inspired replacement poli-
cies when the main indicator is not decisive. Our proposal
offers a 5.2% performance improvement over the baseline
LRU and outperforms Hawkeye, which is the state-of-the-art
replacement policy.
Keywords Memory System, Cache Replacement, Belady’s
MIN
1 Introduction
The ever-increasing expansion of datasets in memory-
intensive applications has resulted in massive working
sets beyond what can be captured by on-chip caches of
modern processors [10, 11, 15, 16, 21, 30, 36, 48]. As a re-
sult, processors executing such applications encounter fre-
quent data misses, losing significant performance poten-
tials [5, 6, 7, 19, 20, 28, 29, 40, 41, 42]. Among the data
misses, which frequently happen in various levels of a mod-
ern deep cache hierarchy, Last Level Cache (LLC) misses
are of more importance as for every LLC miss, the off-chip
DRAM should be accessed for getting the data. Off-chip
accesses significantly hurt system performance due to the
limited bandwidth [5, 12, 18, 19, 22, 32, 45] and long la-
tency [6, 7, 15, 16, 35, 41, 42] of DRAM accesses.
While LLC misses are inevitable due to large datasets of
applications, not all off-chip misses are capacity misses [10,
29, 30, 48]. One way to reduce the number of non-capacity
LLC misses is through a well-behaved replacement policy.
The replacement policy decides, out of all possible candidates,
which one should be evicted from the cache upon arrival of
a new block of data.
The optimal replacement policy is Belady’s MIN [9, 17].
Belady’s MIN replacement policy evicts a block of data that
is going to be referenced further into the future. As the
optimal replacement policy requires knowledge of the future,
it is impractical. As a result, most replacement polices use
various heuristics to evict a cache block (e.g., recency or
frequency). Unfortunately, there is a significant gap between
the effectiveness of Belady’s MIN and practical replacement
policies.
While implementation of Belady’s MIN as a whole is im-
practical, few replacement policies emulate (approximate) Be-
lady’s algorithm to choose a victim for replacement [23, 44].
Rajan and Ramaswamy [44] used an extra storage, called
a Shepherd cache, in order not to evict blocks until future
references determine which block, based on Belady’s MIN,
should be chosen as the victim. Unfortunately, this tech-
nique requires a large storage to be truly effective. Jain and
Lin [23] benefited from the observation that with Belady’s
MIN, some load instructions are cache-friendly and some
ar
X
iv
:1
80
8.
05
02
4v
1 
 [c
s.A
R]
  1
5 A
ug
 20
18
0%
10%
20%
30%
40%
50%
Re
pl
ac
em
en
ts 
wi
th
 n
o c
ac
he
 av
er
se
/al
l r
ep
lac
em
en
ts
Figure 1. The fraction of replacement decisions made by Hawkeye [23] when no block in the set has most-recently been
touched by a cache-averse load instruction.
others are cache-averse. Based on this observation, they pro-
posed Hawkeye that uses a minimal storage to emulate Be-
lady’s MIN for the purpose of determining whether a load
instruction is cache-friendly or cache-averse. Hawkeye uses
cache-friendliness/-averseness of load instructions to choose
a victim, i.e., evicting cache-averse blocks while maintaining
cache-friendly ones. The storage requirement of Hawkeye
is minimal because: (1) Hawkeye only stores the block ad-
dresses and not block data (as in Shepherd [44]) to determine
cache-friendliness of load instructions, and (2) Hawkeye
only emulates Belady’s MIN for a small number of sets in
the cache, leveraging the fact that load instructions behave
similarly on all cache sets.
While classifying load instructions into cache-friendly and
cache-averse is quite effective1, this technique is useful only
if there are blocks that most-recently have been accessed with
a cache-averse load instruction. The blocks that are accessed
with cache-averse load instructions are the prime candidate
for replacement. However, if no block is most-recently ac-
cessed with a cache-averse load instruction, the replacement
policy should pick a victim for replacement using standard
replacement mechanisms, e.g., recency or frequency, which are
in many cases ineffective.
To show how frequently a Belady-inspired replacement
policy based on cache-friendly/-averse load instructions
finds no cache-averse block in a set, Figure 1 shows the
percentage of replacement decisions with Hawkeye when
no cache-averse block is in the set. The figure shows that
for 0.5% to 42.9% of all replacements (with an average of
1Hawkeye, which is based on classifying load instructions into cache-
friendly/-averse, is the champion of the second cache replacement champi-
onship (CRC-2) [1].
15.1% across all benchmarks), Hawkeye should rely on tradi-
tional replacement policies to pick a victim, which limits its
effectiveness. Consequently, a Belady-inspired replacement
policy based on cache-friendly/-averse load instructions can-
not benefit from Belady’s MIN in choosing a victim in a
considerable number of cases.
To address this limitation, this work makes the fundamen-
tal observation that with Belady’s MIN replacement policy,
there is a strong correlation among hit counts of blocks in the
same memory region2; i.e., the hit counts of two blocks from
the same memory region before being evicted by Belady’s
MIN are correlated. Using this observation, we estimate the
hit counts of blocks in various memory regions by emulating
Belady’s MIN and use the expected hit count for replacement
when no cache-averse block is available in a set. Just like Hawk-
eye, to estimate the expected hit count of a region, we only
need to store the block addresses and not block data. More-
over, as we only need to estimate the hit count of memory
regions and not individual blocks, the storage overhead of
our proposal is insignificant.
In this paper, we make the following contributions:
• We show that a Belady-inspired replacement policy
that classifies load instructions into cache-friendly or
cache-averse, in a considerable fraction of cases, are
unable to perform the replacement decisions based on
the history of loads.
• We show that with Belady’s MIN replacement policy,
there is a strong correlation between the hit count of
2A memory region is referred to a chunk of contiguous cache blocks in
the memory, holding several kilobytes of data. In this paper, we consider
128 KB memory regions.
a cache block in two consecutive residencies in the
cache.
• Furthermore, we show that with Belady’s MIN replace-
ment policy, not only there is a strong correlation be-
tween the hit count of a cache block in two consecutive
residencies but also there is a strong correlation among
hit counts of blocks in the same memory region.
• Using these observations, we augment a Belady-
inspired replacement policy based on cache-friendly-
averse load instructions with a small structure to track
the hit counts of various memory regions and use the
region hit count in replacement decisions when no
cache-averse block exists in a set to improve the vic-
tim selection quality.
• We use a simulation infrastructure to evaluate our
proposal in the context of both single and multi-core
processors. Our results show that our proposal offers
17.5% lower Miss Per Kilo Instructions (MPKI) and 5.2%
higher performance, as compared to the baseline LRU,
and outperforms all prior state-of-the-art replacement
policies.
2 Motivation
In this section, we show that there is a strong correlation
between the hit counts of cache blocks in the same region
of memory when the replacement policy is Belady’s MIN.
We first show that there is a strong correlation among the
hit counts of a cache block in its recent residencies in the
cache when the replacement policy is Belady’s MIN. We then
extend the results to blocks in the same region of memory.
Tomeasure the correlation among the hit counts of a cache
block in its various residencies in the cache, we predict the
number of hits of a cache block by averaging the number of
hits that the block experienced in its last four residencies in
the LLC. Figure 2 shows the difference between the actual
hit count and the prediction for several benchmarks. The
details of the methodology can be found in Section 4.
As Figure 2 shows, 30.6% to 94.4% of predictions are 100%
accurate in various benchmarks. On average across all bench-
marks, 64.6% of hit-count predictions are exact and 81.5% of
them differ with the actual hit count by at most one. Most of
the other predictions are also close to the actual hit count.
The figure clearly shows that there is a strong correlation
among the hit counts of a cache block in its recent residen-
cies in the LLC. The correlation is so strong that we can
easily predict the hit count of a block with a high accuracy.
Not only there is a strong correlation among the hit counts
of a cache block in its recent residencies in the cache, but
also there is a strong correlation among the hit counts of
memory regions when the replacement policy is Belady’s
MIN. To measure the correlation among the hit counts of
memory regions, we predict the number of hits of a cache
block in each region by averaging the number of hits that
the last four evicted blocks in that region have experienced.
Figure 3 shows the difference between the actual hit count
and the prediction for several benchmarks.
As shown, 31.5% to 91.6% of predictions are exactly correct
in various benchmarks. On average, 62.6% of predictions are
correct and 78.9% of them differ with the actual hit count by
at most one. Comparing Figure 2 and Figure 3 reveals that
prediction accuracy of block hit count using past hit counts
of the cache block itself and using hit counts of blocks in the
same region is very close.
While these results are strong, in the sprite, they are an
extension of the findings of prior work [49]. Wu et al. [49]
showed that the re-reference interval of cache blocks is cor-
related with some signatures like program counter, memory
region, and instruction sequence. These signatures can be
used to predict the re-reference interval of a cache block.
Corroborating prior work, we observe that there is a corre-
lation between the memory region and re-reference interval
of cache blocks. Moreover, by extending the findings of prior
work, we show that there is a strong correlation between mem-
ory regions and the hit counts of cache blocks. Note that prior
work concluded that the correlation between memory re-
gions and re-reference interval is not strong in a traditional
replacement policy. This work, however, shows that there is
a strong correlation between the memory regions and the
hit counts of cache blocks with Belady’s MIN. Taking advan-
tage of this strong correlation, we attempt to improve the
effectiveness of Belady-inspired replacement policies.
Many pieces of prior work [24, 25, 33, 37, 38, 39, 43, 47, 49]
enhanced replacement decisions by identifying and evict-
ing/bypassing dead blocks. A dead block is a cache block
with no further reference during its current residency in the
cache (i.e., blocks with zero remaining hits). Our proposal
is a generalization of prior work: instead of just relying on
dead blocks (i.e., blocks with zero remaining hits), our pro-
posal estimates the expected number of hits of a cache block
(zero or larger) and benefits from that information in the
decision making process. We find that prior proposals that
employ a dead block predictor for making replacement deci-
sions fundamentally suffer from one or two of the following
problems: (1) whenever a live block mistakenly identifies as
dead, a cache miss is inevitable, and (2) whenever all blocks
in a set are predicted as live blocks, the replacement policy is
unable to effectively choose the best victim for replacement.
3 Our Proposal
Taking advantage of the strong correlation of hit counts of
blocks in the samememory region, we plan to improve the ef-
fectiveness of Belady-inspired replacement policies. To make
what we add to a Belady-inspired replacement policy under-
standable, we first briefly introduce the main components of
a Belady-inspired replacement policy and then explain what
we add to it.
Replacement decisions with Belady’s MIN require knowl-
edge of future accesses, but to determine whether the current
access is a miss or a hit, having past accesses suffices. If a
cache access turns into a miss, however, Belady’s MIN re-
quires knowledge of future accesses to replace a cache block.
There are two strategies for implementing Belady-inspired
replacements: (1) directly using Belady for replacement deci-
sions, and (2) using hit/miss behavior with Belady’s MIN to
derive an indicator for replacement decisions. Both strate-
gies require extra storage in addition to the cache: former
strategies require knowledge of future accesses to pick a re-
placement candidate, and hence, need to keep all the blocks
in a storage and defer the replacement decision to the future,
while the latter strategies require knowledge of past cache
accesses. Prior work [23] showed that to get reasonable re-
sults, a Belady-inspired replacement policy requires an extra
storage with enough capacity for eight times the number of
blocks in the cache.
Due to the massive size of the required extra storage, tech-
niques that directly emulate Belady’s MIN (e.g., Shepherd
Cache [44]) are not considered efficient. Aiming to build a
ast
ar_
big
lak
es
bw
ave
s
bzi
p
cac
tus
AD
M
gem
sfd
td
hm
me
r_n
ph lbm
les
lie3
d
libq
uan
tum mc
f
sop
lex
sph
inx xal
an
zeu
sm
p
Ave
rag
e
0%
20%
40%
60%
80%
100%
0 1 2-3 4-6 7+
Ac
tu
al 
Hi
t C
ou
nt
 - 
Pr
ed
ict
io
n
Figure 2. The difference between the actual and predicted hit count of cache blocks when the replacement policy is Belady’s
MIN. The predicted hit count of a cache block is the average hit count of the block in its last four residencies in the LLC.
ast
ar_
big
lak
es
bw
ave
s
bzi
p
cac
tus
AD
M
gem
sfd
td
hm
me
r_n
ph lbm
les
lie3
d
libq
uan
tum mc
f
sop
lex
sph
inx xal
an
zeu
sm
p
Ave
rag
e
0%
20%
40%
60%
80%
100%
0 1 2-3 4-6 7+
Ac
tu
al 
Hi
t C
ou
nt
 - 
Pr
ed
ict
io
n
Figure 3. The difference between the actual and predicted hit count of memory regions when the replacement policy is
Belady’s MIN. The predicted hit count of a memory region is the average hit count of the last four recently-evicted cache
blocks in the region.
storage-efficient replacement policy, Jain and Lin [23] work
out two strategies: (1) they do not directly use Belady’s MIN
to derive replacement decisions; instead they derive indica-
tors to be used in picking a victim for replacement, as this
enables them not to store block data, which is much larger
than the block address, and (2) they use sampling by relying
on indicators that can be trained using only a small number
of cache sets.
In this work, we take the advantage of Belady’s MIN to
derive an indicator, i.e., the expected hit counts of memory
regions, and as such, do not need to store the block data in
the extra storage. Moreover, as memory regions are much
larger than cache blocks and span over a large number of
cache sets, the expected hit counts of memory regions can
be estimated using only a small number of cache sets. Doing
so enables using sampling while estimating the expected hit
counts of memory regions.
In this part, we review how a Belady-inspired replacement
policy decides if an access is a hit or a miss using Belady’s
MIN, and then show how we can measure and use the ex-
pected hit counts of memory regions. We first explain the
algorithm for determining the hit/miss decision and then
describe the hardware implementation. We assume that past
sequence of accesses to every set is recorded. As the goal
is to determine the hit/miss events, the recorded sequence
only includes block addresses and not block data. Upon a
new access to the set, we attempt to find the corresponding
address in the recorded sequence of addresses. If the address
is not found, this is the first access to the address, and hence,
is a miss. Otherwise, we need to examine the sequence of
accesses between the last and current access to this address.
We refer to the interval between two accesses to the same
address as the reuse interval. We need to find all the reuse
intervals of other accesses that fall within the reuse interval
of the current access. We then need to find the maximum
number of overlaps among such intervals, which we refer
to it as the occupancy. If the occupancy reaches or exceeds
the size of the cache set (i.e., cache associativity), the current
access is a miss, and otherwise, a hit under Belady’s MIN.
To implement this algorithm in hardware, prior work [23]
suggested recording the occupancy of a set after every access
to the set. This means that we record both the address of the
access and the occupancy of the set after the access, as part
of the recorded sequence of accesses. For every new access
to the cache, the recorded sequence is searched to find an
access to the same address in the past. In case a match is
found, if the recorded occupancy of all the recorded accesses
after the match is less than the associativity of the cache, this
access is a hit, and otherwise, it is a miss. In case the access
is a hit, the occupancy number of all the accesses after the
match is incremented to indicate that the currently accessed
address was in the cache since last access.
As there are many repetitions in the sequence of past
addresses, to reduce the storage overhead, the occupancy
numbers and addresses are stored in two different structures.
For every set, there is an occupancy vector that stores the
sequence of occupancy numbers and an address cache for
storing the corresponding addresses. In addition to the ad-
dress, each entry in the address cache also has a pointer that
points to the last occurrence of this address in the occupancy
vector. Depending on what indicator the Belady-inspired
replacement policy trains for replacement decisions, there
might be other elements in the address cache (e.g., PC of the
load instruction).
In general, Belady may require recording all the past ac-
cesses to decide if the current access is a miss or a hit. How-
ever, in practice, as shown by prior work [23], for each set,
recording eight times the associativity of the cache is enough
to decide if an access is a hit or a miss under Belady’s MIN in
most of cases. Moreover, to reduce the storage requirement,
the aforementioned mechanism is only performed on a small
number of cache sets (e.g., one for every sixty-four sets) [43].
3.1 Design Overview
Our proposal, named Expected Hit Count (EHC), is built on
top of a Belady-inspired replacement policy and attempts to
exploit the expected-hit-count phenomenon. EHC requires
extending the tag storage of each block in the LLC with only
three bits. The storage unit, named Expected Further Hits, is
a count-down counter that shows the number of further hits
that the block is expected to have in its current residency in the
cache.
3.2 Selecting Victim
When the trained indicator of the baseline Belady-inspired
replacement policy is not decisive (e.g., all blocks in a set are
cache-friendly in Hawkeye), the replacement policy cannot
use load history for victim selection and instead inevitably
relies on a standard policy. We use the expected hit count
to improve the quality of victims in such cases. Many re-
cent replacement policies are based on Re-Reference Interval
Prediction (RRIP) [25] (e.g., [23, 27, 49]), so in this part, we
explain how to use the expected hit count to improve the vic-
tim selection mechanism of RRIP. We emphasize that EHC is
not limited to RRIP and can be used with other replacement
policies as well.
RRIP associates a number with each cache block named
Re-Reference Prediction Value (RRPV) and replaces a block
with the highest RRPV. EHC updates the victim selection
mechanism based on “how many further hits do we expect
to get from each cache block?” Every time that a new block
Table 1. Evaluation Parameters.
Parameter Value
Processing Nodes 6-stage pipeline, 256-entry ROB
L1-D/I Caches 32 KB, 8-way, 4-cycle load-to-use
Private L2 Cache 256 KB, 8-way, 8-cycle access latency
Shared LLC 2 MB per core, 16-way, 20-cycle hit latency
Data Prefetcher L1 next-line prefetcherL2 PC-based stride prefetcher [4]
is brought into a set, its expected hit count is placed into Ex-
pected Further Hits counter in the cache. The value of the ex-
pected hit count is empirically chosen to be one. For every ac-
cess to this block, the content of Expected Further Hits is decre-
mented. When a victim needs to be selected, we combine
Expected Further Hits, which determines how many further
hits we expect to see for this block, and RRPV to choose a vic-
tim.We examine the value of ‘ExpectedFurtherHits−RRPV ’
for all candidates and evict the block with the lowest value3.
4 Methodology
4.1 Simulation Infrastructure
We evaluate our proposal using the simulation framework
released by the Second Cache Replacement Championship
(CRC-2) [3]. Table 1 summarizes the key elements of our
methodology. We target both single-core and four-core pro-
cessors with a 2 MB per-core shared LLC. The processors
benefit from a non-inclusive cache hierarchy that employs
LRU as the default replacement policy. We report both cache
statistics and the end-to-end performance of the competing
policies.
We use SPEC CPU2006 [2] benchmarks for evaluating
the competing replacement policies. For the multi-program
workloads, we randomly choose a hundred combinations of
single-core programs and use them to evaluate the competing
policies on a four-core processor. For single-core evaluations,
we execute 4-billion instructions and use half of the instruc-
tions for warm-up and the rest for actual measurements. For
the multi-core evaluations, we execute 2-billion instructions
per core and use the first half for warm-up and the rest for
measurements.
4.2 Evaluated Methods
We evaluate the following replacement policies:
Baseline LRU. Well-known Least Recently Used replace-
ment policy is used as the baseline in our evaluations. It
keeps four bits per block in each set to establish the LRU
3If more than one block has the lowest value, we pick the first one.
stack.
Dynamic RRIP (DRRIP) [25]. Each block has a 3-bit
storage unit named RRPV. Upon each hit, RRPV of the
block is set to zero, and upon each miss, the block with the
maximum RRPV is evicted. If none of the blocks have the
maximum RRPV, the RRPV of all blocks is incremented.
This procedure is repeated until at least one block gets
the maximum RRPV. It uses set-dueling for choosing an
insertion policy between Static RRIP (SRRIP) and Bimodal
RRIP (BRRIP). In SRRIP, all blocks are inserted with RRPV of
maximum minus one. In BRRIP, the RRPV of inserted block
gets the value of maximum minus one (with the probability
of 132 ) or maximum (with the probability of
31
32 ). Thirty-two
random sets emulate SRRIP and another thirty-two random
sets emulate BRRIP. The remaining sets follow the winner
of the duel. The total area overhead of DRRIP is 12 KB/48 KB
in single-core/four-core substrate.
SHiP [49]. Signature-Based Hit Predictor replacement policy
relies on RRIP but attempts to distinguish the dead blocks
from the live ones, which will be re-referenced in the cache.
It uses the PC of corresponding instruction for classification
of dead and live blocks. The replacement policy differs from
RRIP since it sets the RRPV of a block during insertion based
on the dead block prediction. SHiP uses an LRU replacement
simulator to find the cache-friendliness/-averseness of
blocks, and stores the outcome in a particular structure.
In addition, SHiP uses two bits per block for storing
RRPV information. The total area overhead of SHiP is
39 KB/156 KB in single-core/four-core substrate.
Multiperspective. This replacement policy leverages
machine learning concepts to determine whether the
incoming block should bypass the cache or not. It also uses
a mechanism for sending feedback to assist in choosing
a victim. The machine learning mechanism used in this
method exploits several features to determine the output and
has a sampler unit to calculate the weight of each feature.
The sampler and other metadata structures used in this
mechanism occupy approximately 25.25 KB/95.5 KB storage
in a single-/four-core processor. The baseline replacement
of this algorithm is different in single-core and four-core
systems: in single-core systems, it uses PseudoLRU [26] as
the baseline replacement policy of the main cache, which has
3.75 KB hardware overhead; however, in four-core systems,
SRRIP [25] is the baseline replacement policy, with 32 KB
additional storage overhead. The total storage overhead
of this method is 29 KB/127.5 KB in single-core/four-core
substrate.
Hawkeye [23]. Hawkeye uses an optimal replacement
simulator, named OPTGen, to simulate Belady’s MIN and
classify load instructions (PCs) into cache-friendly or
cache-averse. OPTGen’s hardware overhead is 15.2 KB.
Moreover, Hawkeye uses three bits for the RRPV, which are
set, upon each access based on the averseness or friendliness
of the PCs of incoming accesses. Upon each miss, Hawkeye
evicts a cache-averse block from the cache. If there is no
cache-averse block in the set, it chooses the oldest block in
the set as the victim. The total overhead of this replacement
policy is 30 KB/90 KB in single-core/four-core substrate.
EHC. Implemented on top of Hawkeye [23]. This policy
makes changes in the victim selection mechanism of Hawk-
eye to efficiently select a victim when all cache blocks in a
given set are predicted to be cache-friendly. The total storage
overhead of EHC on top of Hawkeye is 12 KB per core.
5 Evaluation
5.1 Miss Reduction
Figure 4 shows the MPKI reduction of various policies over
the baseline LRU. As shown clearly, our proposal outper-
forms all previously-proposed replacement policies. On av-
erage, our proposal reduces the MPKI by 17.5%. The second
best policy is Hawkeye, which offers a 15.4% MPKI reduction.
5.2 Performance
Figure 5 compares the performance improvement of the eval-
uated policies over the baseline LRU. We use the Instruction
per Cycle (IPC) as the metric for performance. EHC offers the
highest performance improvement on average. The average
performance improvement is 5.19% across all workloads. The
second best policy is Hawkeye with an average performance
improvement of 4.76%.
Figure 6 compares the performance improvement of the
evaluated policies over the baseline LRU in a four-core sys-
tem. Again, EHC achieves the highest performance, outper-
forming all other replacement policies on average.
5.3 Why is EHC Effective?
To show that the expected hit count with a non-ideal re-
placement policy is correlated with the reciprocal of the
reuse distance, Figure 7 compares the quality of the chosen
replacement victims in Hawkeye and EHC. Every time a vic-
tim needs to be selected, we sort the blocks in the set and the
incoming block based on their reuse distance. Block 0 has the
longest and Block 16 has the shortest reuse distance. Ideally,
a replacement policy always picks Block 0. Comparing EHC
and Hawkeye reveals that the victims chosen by EHC have
higher quality as compared to those chosen by Hawkeye.
6 Related Works
The replacement policy is basically a prediction mechanism
to determine the best candidate among the residing blocks
in a set of the cache to be replaced with the incoming block.
Different replacement policies have different approaches to
perform this prediction, which can be classified as follows:
6.1 Static Prediction
This class of replacement policies predict the best victim can-
didate without concerning the blocks’ unique features and
solely based on events such as hit, miss or eviction. Therefore,
all victim selection decisions are only influenced by such
short-term events, which cannot distinguish different access
patterns. LRU and Pseudo LRU [26], for instance, perform
well in case of recency-friendly access patterns. However,
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
DRRIP SHiP SHiP++ Multiperspective Hawkeye EHC
M
PK
I R
ed
uc
tio
n 
Ov
er
 LR
U
Figure 4.MPKI reduction of the competing replacement policies over the baseline LRU.
ast
ar_
big
lak
es
bw
ave
s
bzi
p
cac
tus
AD
M
gem
sfd
td
hm
me
r_n
ph lbm
les
lie3
d
libq
uan
tum mc
f
sop
lex
sph
inx xal
an
zeu
sm
p
GM
ean
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1.14
DRRIP SHiP SHiP++ Multiperspective Hawkeye EHC
No
rm
ali
ze
d 
Pe
rfo
rm
an
ce
Figure 5. Performance improvement of the competing replacement policies over the baseline LRU.
1 20 40 60 80 100
0.95
1.00
1.05
1.10
1.15
1.20
1.25
DRRIP SHiP SHiP++ Multiperspective Hawkeye EHC
No
rm
ali
ze
d 
Pe
rfo
rm
an
ce
Figure 6. Performance improvement of the competing replacement policies over the baseline LRU.
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
Ha
wk
ey
e
EH
C
astar bwaves bzip cactus gems hmmer lbm leslie libq mcf soplex sphinx xalan zeusmp Aveage
0
0.2
0.4
0.6
0.8
1
0 1 2-8 9-16
Figure 7. The quality of chosen victims in Hawkeye and EHC. Blocks in a set are sorted with the reuse distance: Block 0 has
the longest and Block 16 has the shortest reuse distance. Selecting a victim with the lower index is better.
they perform poorly on other types of access patterns as
the assumption that the block which is referenced further in
the past should be evicted could lead to cache pollution and
cache thrashing in some applications. RRIP [25] is effective
for streaming accesses but can degrade the performance of
applications that follow other patterns. Adapting these static
approaches to the unique requirements of an application is
useful. Techniques like Set Dueling [43], which is used in
policies like DIP [43] and DRRIP [25], can improve the per-
formance of static approaches by selecting the best-fitting
approach to choose the victim. TADIP [24] extends static
prediction concept to shared caches. Unfortunately, adapting
the replacement policy based on the workload would not
fill the large gap between the performance of this class of
policies and the optimal algorithm, as the winning policy
would apply to all blocks regardless of the differing behavior.
6.2 Dynamic Prediction
The dynamic prediction is based on the unique characteris-
tics of a block such as tag, PC, live distance or age. These
replacement policies attempt to identify, categorize, and rank
blocks as they enter the cache. As a result, in case of a con-
flict, the block that is predicted to be the one which is least
likely to be re-referenced is chosen as the victim. Learn-
ing the behavior of each category, however, can be based
on different information and through different mechanisms.
Evicted-Address Filter (EAF) [46], for example, keeps track
of recently evicted blocks using a Bloom Filter. Some replace-
ment policies leverage a Dead Block Predictor (DBP), which
is a mechanism to predict the time when a block becomes
dead, to evict dead blocks from the cache. Policies such as
cache-burst [39], SDBP [34], SHiP [49], and Leeway [14] are
based on DBPs and use various indicators to decide whether
a block is dead. PRP [13] uses a probabilistic approach to
determine the approximate probability of occurring a hit for
a block according to its age. EVA [8] similarly uses theoret-
ical analysis to calculate the Economic Value Added (EVA)
for each block based on the block’s age to determine if it is
worth to keep the block in the cache.
Multiperspective [27] categorizes blocks not solely based
on one or two possible features, on the contrary, it classifies
blocks based on different characteristics that may have an
effect on the prediction of reuse behavior of blocks. Poli-
cies such as RDP [31], Shepherd Cache [44], and Hawk-
eye [23] emulate the Belady’s optimal policy to reduce the
gap between practical replacement policies and Belady’s
MIN. RDP [31] uses an expensive content-addressable mem-
ory for storing the data of each PC. Shepherd Cache [44]
requires storing the block data besides the block addresses
to emulate Belady’s MIN, and hence, cannot effectively use
its storage to simulate the Belady’s algorithm. Hawkeye uses
Belady’s MIN to derive an indicator to help to choose victims.
Consequently, it does not need to store the block data, and
hence, it benefits from a wider window to keep a record of
past references to emulate Belady. We showed that Hawk-
eye’s indicator for choosing a victim is not decisive in a
considerable number of cases. To address this limitation, we
proposed a different indicator, the expected hit counts of
memory regions, to help to choose the victim when the main
indicator is not decisive.
References
[1] “Closing Remarks of the 2nd Cache Replacement Championship.”
[Online]. Available: <http://crc2.ece.tamu.edu/submaterial/CRC2_
closing.pptx>
[2] “SPEC - Standard Performance Evaluation Corporation.” [Online].
Available: <http://www.spec.org>
[3] “The 2nd Cache Replacement Championship.” [Online]. Available:
<http://crc2.ece.tamu.edu>
[4] J.-L. Baer and T.-F. Chen, “An Effective On-Chip Preloading Scheme
to Reduce Data Access Penalty,” in International Conference on Super-
computing (ICS), 1991.
[5] M. Bakhshalipour, P. Lotfi-Kamran, A. Mazloumi, F. Samandi,
M. Naderan, M. Modarressi, and H. Sarbazi-Azad, “Fast Data Delivery
for Many-Core Processors,” IEEE Transactions on Computers (TC), 2018.
[6] M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “An Efficient
Temporal Data Prefetcher for L1 Caches,” IEEE Computer Architecture
Letters (CAL), 2017.
[7] M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Domino
Temporal Data Prefetcher,” in International Symposium on High Perfor-
mance Computer Architecture (HPCA), 2018.
[8] N. Beckmann and D. Sanchez, “Maximizing Cache Performance Un-
der Uncertainty,” in International Symposium on High Performance
Computer Architecture (HPCA), 2017.
[9] L. A. Belady, “A Study of Replacement Algorithms for a Virtual-Storage
Computer,” IBM Systems Journal, 1966.
[10] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC Benchmark
Suite: Characterization and Architectural Implications,” in Parallel
Architectures and Compilation Techniques (PACT), 2008.
[11] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu,
R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu,
“Google Workloads for Consumer Devices: Mitigating Data Movement
Bottlenecks,” in Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 2018.
[12] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh,
K. T. Malladi, H. Zheng, and O. Mutlu, “LazyPIM: An Efficient Cache
Coherence Mechanism for Processing-in-Memory,” IEEE Computer
Architecture Letters (CAL), 2017.
[13] S. Das, T. M. Aamodt, and W. J. Dally, “Reuse Distance-Based Prob-
abilistic Cache Replacement,” ACM Transactions on Architecture and
Code Optimization (TACO), 2015.
[14] P. Faldu and B. Grot, “Leeway: Addressing Variability in Dead-Block
Prediction for Last-Level Caches,” in Parallel Architectures and Compi-
lation Techniques (PACT), 2017.
[15] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-
jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing
the Clouds: A Study of Emerging Scale-Out Workloads on Modern
Hardware,” in Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 2012.
[16] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic,
C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Quantifying
the Mismatch Between Emerging Scale-Out Applications and Modern
Processors,” ACM Transactions on Computer Systems (TOCS), 2012.
[17] Gecsei, J and Slutz, DR and Traiger, IL, “Evaluation Techniques for
Storage Hierarchies,” IBM Systems Journal, 1970.
[18] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Toward Dark
Silicon in Servers,” IEEE Micro, 2011.
[19] M. Hashemi, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “Accelerating
Dependent Cache Misses with an Enhanced Memory Controller,” in
International Symposium on Computer Architecture (ISCA), 2016.
[20] M. Hashemi, O. Mutlu, and Y. N. Patt, “Continuous Runahead: Trans-
parent Hardware Acceleration for Memory Intensive Workloads,” in
International Symposium on Microarchitecture (MICRO), 2016.
[21] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The HiBench Bench-
mark Suite: Characterization of the MapReduce-Based Data Analysis,”
in International Conference on Data Engineering Workshops (ICDEW),
2010.
[22] J. Huh, D. Burger, and S. W. Keckler, “Exploring the Design Space of
Future CMPs,” in Parallel Architectures and Compilation Techniques
(PACT), 2001.
[23] A. Jain and C. Lin, “Back to the Future: Leveraging Belady’s Algorithm
for Improved Cache Replacement,” in International Symposium on
Computer Architecture (ISCA), 2016.
[24] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer,
“Adaptive Insertion Policies for Managing Shared Caches,” in Parallel
Architectures and Compilation Techniques (PACT), 2008.
[25] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High Performance
Cache Replacement Using Re-Reference Interval Prediction (RRIP),” in
International Symposium on Computer Architecture (ISCA), 2010.
[26] D. A. Jiménez, “Insertion and Promotion for Tree-Based PseudoLRU
Last-Level Caches,” in International Symposium on Microarchitecture
(MICRO), 2013.
[27] D. A. Jiménez and E. Teran, “Multiperspective Reuse Prediction,” in
International Symposium on Microarchitecture (MICRO), 2017.
[28] T. S. Karkhanis and J. E. Smith, “A First-Order Superscalar Processor
Model,” in International Symposium on Computer Architecture (ISCA),
2004.
[29] M. Kayaalp, K. N. Khasawneh, H. A. Esfeden, J. Elwell, N. Abu-
Ghazaleh, D. Ponomarev, and A. Jaleel, “RIC: Relaxed Inclusion Caches
for Mitigating LLC Side-Channel Attacks,” in Design Automation Con-
ference (DAC), 2017.
[30] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker,
“Performance Characterization of a Quad Pentium Pro SMP Using
OLTP Workloads,” in International Symposium on Computer Architec-
ture (ISCA), 1998.
[31] G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache Replacement
Based on Reuse-Distance Prediction,” in International Conference on
Computer Design (ICCD), 2007.
[32] T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Rein-
hardt, and K. Flautner, “PicoServer: Using 3D Stacking Technology
to Enable a Compact Energy Efficient Chip Multiprocessor,” in Archi-
tectural Support for Programming Languages and Operating Systems
(ASPLOS), 2006.
[33] S. M. Khan, D. A. Jiménez, D. Burger, and B. Falsafi, “Using Dead Blocks
As a Virtual Victim Cache,” in Parallel Architectures and Compilation
Techniques (PACT), 2010.
[34] S. M. Khan, Y. Tian, and D. A. Jimenez, “Sampling Dead Block Predic-
tion for Last-Level Caches,” 2010.
[35] F. Khorasani, H. A. Esfeden, N. Abu-Ghazaleh, and V. Sarkar, “In-
Register Parameter Caching for Dynamic Neural Nets with Virtual
Persistent Processor Specialization,” in International Symposium on
Microarchitecture (MICRO), 2018.
[36] F. Khorasani, H. A. Esfeden, A. Farmahini-Farahani, N. Jayasena, and
V. Sarkar, “RegMutex: Inter-Warp GPU Register Time-Sharing,” in
International Symposium on Computer Architecture (ISCA), 2018.
[37] A.-C. Lai and B. Falsafi, “Selective, Accurate, and Timely Self-
Invalidation Using Last-Touch Prediction,” in International Symposium
on Computer Architecture (ISCA), 2000.
[38] A.-C. Lai, C. Fide, and B. Falsafi, “Dead-Block Prediction & Dead-block
Correlating Prefetchers,” in International Symposium on Computer
Architecture (ISCA), 2001.
[39] H. Liu, M. Ferdman, J. Huh, and D. Burger, “Cache Bursts: A New Ap-
proach for Eliminating Dead Blocks and Increasing Cache Efficiency,”
in International Symposium on Microarchitecture (MICRO), 2008.
[40] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel,
A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, “Scale-Out Pro-
cessors,” in International Symposium on Computer Architecture (ISCA),
2012.
[41] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for Efficient Process-
ing in Runahead Execution Engines,” in International Symposium on
Computer Architecture (ISCA), 2005.
[42] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead Execution:
An Alternative to Very Large Instruction Windows for Out-of-Order
Processors,” in International Symposium on High Performance Computer
Architecture (HPCA), 2003.
[43] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive
Insertion Policies for High Performance Caching,” in International
Symposium on Computer Architecture (ISCA), 2007.
[44] K. Rajan and G. Ramaswamy, “Emulating Optimal Replacement with
a Shepherd Cache,” in International Symposium on Microarchitecture
(MICRO), 2007.
[45] B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin,
“Scaling the Bandwidth Wall: Challenges in and Avenues for CMP
Scaling,” in International Symposium on Computer Architecture (ISCA),
2009.
[46] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “The Evicted-
Address Filter: A Unified Mechanism to Address Both Cache Pollution
and Thrashing,” in Parallel Architectures and Compilation Techniques
(PACT), 2012.
[47] A. Vakil-Ghahani, S. Mahdizadeh-Shahri, M.-R. Lotfi-Namin,
M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Cache
Replacement Policy Based on Expected Hit Count,” IEEE Computer
Architecture Letters (CAL), 2018.
[48] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-
2 Programs: Characterization and Methodological Considerations,” in
International Symposium on Computer Architecture (ISCA), 1995.
[49] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and
J. Emer, “SHiP: Signature-Based Hit Predictor for High Performance
Caching,” in International Symposium on Microarchitecture (MICRO),
2011.
