To Update or Not To Update?: Bandwidth-Efficient Intelligent Replacement
  Policies for DRAM Caches by Young, Vinson & Qureshi, Moinuddin K.
To Update or Not To Update?: Bandwidth-Efficient
Intelligent Replacement Policies for DRAM Caches
Vinson Young and Moinuddin K. Qureshi
Georgia Institute of Technology
{vyoung,moin}@gatech.edu
ABSTRACT
This paper investigates intelligent replacement policies for
improving the hit-rate of gigascale DRAM caches. Cache re-
placement policies are commonly used to improve the hit-rate
of on-chip caches. The most effective replacement policies
often require the cache to track and update per-line reuse state
to inform their decision. A fundamental challenge on DRAM
caches, however, is that stateful policies would require sig-
nificant bandwidth to maintain per-line DRAM cache state.
As such, DRAM cache replacement policies have primarily
been stateless policies, such as always-install or probabilistic
bypass. Unfortunately, we find that stateless policies are of-
ten too coarse-grain and become ineffective at the size and
associativity of DRAM caches. Ideally, we want a replace-
ment policy that can obtain the hit-rate benefits of stateful
replacement policies, but keep the bandwidth-efficiency of
stateless policies.
We perform our study on a DRAM cache design similar
to the one used in Knights Landing, and find that tracking
per-line reuse state can enable an effective replacement pol-
icy that can mitigate the common thrashing patterns seen in
gigascale caches. We propose a stateful replacement/bypass
policy called RRIP Age-On-Bypass (RRIP-AOB), that tracks
reuse state for high-reuse lines, protects such lines by by-
passing other lines, and Ages the state On cache Bypass.
Unfortunately, such a stateful technique requires significant
bandwidth to update state. To this end, we propose Efficient
Tracking of Reuse (ETR). ETR makes state tracking efficient
by accurately tracking the state of only one line from a re-
gion, and using the state of that line to guide the replacement
decisions for other lines in that region. ETR reduces the band-
width for tracking the replacement state by 70%, and makes
stateful policies practical for DRAM caches. Our evaluations
with a 2GB DRAM cache, show that our RRIP-AOB and
ETR techniques provide 18% speedup while needing less
than 1KB of SRAM.
1. INTRODUCTION
DRAM caches are important for enabling effective hetero-
geneous memory systems that can transparently provide the
bandwidth of high bandwidth memories [1], and the capacity
of high capacity memories [2, 3]. Designs for DRAM cache
organize the tag-store such that the tags can be kept in DRAM
(to reduce storage overheads) and yet the tags can also be
obtained with low latency and low bandwidth overheads [4,5].
For example, Intel’s Knights Landing product organizes its
DRAM cache as a direct-mapped cache with tags stored
alongside each data-line, so that one access can retrieve both
tag and data. This direct-mapped design has been shown to be
effective for enabling low latency and bandwidth-efficient tag
access [5]; however, such a direct-mapped design can have
significant conflict misses. One could consider increasing
associativity to improve hit-rate, but, increasing associativity
also substantially increases bandwidth consumption and de-
grades performance for many workloads. Fortunately, cache
bypassing [6, 7, 8] offers a way to both improve hit-rate and
decrease bandwidth consumption, while still maintaining a
direct-mapped organization. We investigate the extent to
which an intelligent bypass policy can reduce conflict misses
for DRAM caches. We perform our evaluations on a direct-
mapped DRAM cache similar to the one used in KNL [4, 5].
We would like to use the most effective replacement poli-
cies to improve DRAM cache hit-rate. However, intelligent
replacement policies [8, 9, 10, 11] often require the cache to
track per-line state that needs to be updated on cache events.
On a DRAM cache, managing this per-line state is difficult
as tracking even 2 bits of state per line would require multi-
megabyte storage. As such, DRAM cache designs would
need to keep this state in the DRAM array, and spend offchip
bandwidth to update state. Prior replacement policies pro-
posed for DRAM caches have avoided this per-line state with
stateless policies [4, 5, 6]. The DRAM cache in KNL [4, 5],
for example, employs an Always-Install policy. Along the
same lines, Chou et. al [6] propose a policy that bypasses the
cache with 90% probability (we call this policy 90%-Bypass).
However, such stateless policies often fail to capture the reuse
patterns commonly seen in large caches. We show how such
policies are often inadequate with an example.
Let us consider replacement policies for a common access
pattern where the workload has repeated accesses to high-
reuse data (labeled A) interspersed with accesses to low-reuse
data that is not re-referenced while it is in the cache (labeled
B), as shown in Figure 1(a). For the baseline Always-Install
policy, accesses to A will install A and enable subsequent
accesses to A to hit; however, accesses to low-reuse B will
evict A and cause the subsequent access to A to miss. In
this case, always-installing lines allows low-reuse B to evict
high-reuse A, and this results in degraded hit-rate and wasted
install bandwidth. For a 90%-Bypass policy, references to A
will install some A lines and marginally improve hit-rate, and
references to B will install only a few lines and marginally de-
grade hit-rate. In this case, 90%-Bypass offers some working-
set protection; however, it is indiscriminate in deciding which
lines to protect and may not achieve high hit-rate. Figure 1(b)
shows that such a probabilistic bypass policy has poor per-
formance potential of 3%. Ideally, we desire a bypass policy
that can remember and protect individual lines that have high
reuse (i.e., A), and bypass other lines (i.e., B). Figure 1(b)
shows that if we are able to formulate such a reuse-based
bypass policy while avoiding the bandwidth cost for state
update, we could achieve up to 20% speedup.
Our approach to improving DRAM cache performance is
to (1) design a reuse-based bypass policy to improve DRAM
cache hit-rate, and to (2) reduce the bandwidth cost of state
update to further improve performance.
ar
X
iv
:1
90
7.
02
16
7v
1 
 [c
s.A
R]
  4
 Ju
l 2
01
9
(b) Speedup Potential(a) Replacement Policies 
0%
5%
10%
15%
20%
25%
S
p
e
e
d
u
p
 o
v
e
r
S
p
e
e
d
u
p
 o
v
e
r
High-Reuse
Always-Install 
Probabilistic Reuse-based
Bypass Bypass
Low-Reuse
90%-Bypass
Desired
Figure 1: (a) Always-Install, 90%-Bypass, and Desired replacement policies under mixed high-reuse low-reuse access
pattern. (b) Potential for speedup: Probabilistic Bypass [6], and Ideal Reuse-based Bypass with no state update cost.
In this paper, we use Re-Reference Interval Prediction
(RRIP) [9] as a representative example of a replacement pol-
icy that is designed to exploit reuse [8, 9, 10, 12, 13]. RRIP
requires that each line is equipped with metadata bits (two-bit
counter called RRPV) to track reuse. RRPV is set to 0 on
a hit, and the victim line is identified as a line that has an
RRPV of 3. And, if no lines have an RRPV of 3, all counters
in the set are incremented and victim-selection is repeated.
While RRIP is effective for set-associative caches, it becomes
ill-defined for direct-mapped caches, as such a cache would
have only one counter in the set. Following the algorithm for
selecting a victim will always cause the resident line to get
evicted, even if the line had an RRPV of 0. Similarly, bypass-
ing the incoming line if the resident line has an RRPV=0 will
mean that such lines will never get evicted from the cache.
To enable reuse-based replacement policies for direct-
mapped DRAM caches, we propose a bypass version of RRIP,
which we call RRIP-AOB. The key mechanism in RRIP-AOB,
is to Age the counters On cache Bypass. For example, if a
good victim cannot be found (no RRPV are 3), we bypass the
incoming line and age the reuse counter (increment RRPV).
After several bypass+age events, a resident line that is no
longer useful will have its RRPV reach 3 and become a can-
didate for eviction. This enables the cache to protect lines
that have had reuse via bypassing, but also provides a path
to eventually victimize cold lines. Our insight makes RRIP
(and other reuse-based policies) applicable to DRAM caches.
Another practical obstacle in implementing reuse-based
policies for DRAM caches is the high state update cost of
maintaining replacement state in DRAM. A straight-forward
way of implementing RRIP-AOB in DRAM cache is to ex-
tend the tag-entry of the line to incorporate the bits for track-
ing the replacement state of the line. However, it incurs
bandwidth overhead for performing update of the replace-
ment state: resetting the RRPV counter on a hit (promotion),
and incrementing the RRPV counter on a bypassing miss
(demotion). Note that these accesses for updating the replace-
ment state are not present in the baseline and for designs
that do bypassing without tracking per-line state. If we can
completely remove with state update cost with Ideal RRIP-
AOB, we can achieve up to 20% speedup. To reduce the state
update cost of maintaining per-line counters in DRAM, we
propose Efficient Tracking of Reuse (ETR).
ETR reduces the bandwidth consumed in performing up-
dates of the replacement state by doing the updates for only a
subset of the lines and using their replacement state to infer
the replacement state of the other lines. ETR is based on
two key properties that we observe in DRAM caches: Cores-
idency and Eviction-Locality. Coresidency indicates that at
any given time if a line is present, then several other line
belonging to that 4KB region are also present in the cache.
Eviction-Locality indicates that when a line gets evicted from
the cache, the replacement-state of the other coresident lines
belonging to that region tend to have similar replacement-
state as the line being evicted. We show strong levels of cores-
idency and eviction-locality with RRIP-AOB. ETR exploits
the properties of coresidency and eviction-locality to reduce
the updates for tracking the replacement state. Rather than
updating the replacement-state for all lines in the cache, ETR
simply updates the replacement-state for one of the coresi-
dent lines of the region, and uses the state of this line to guide
the replacement decisions of other coresident lines of the
region. ETR reduces the bandwidth overhead of state updates
by 70% and enables RRIP-AOB to achieve 18% speedup,
nearing Ideal RRIP-AOB performance. These benefits are
obtained with a storage overhead of less than 1KB SRAM.
Note: A cache implementing ETR still fundamen-
tally employs line-based replacement – it simply op-
portunistically exploits spatial locality when it exists
to reduce state update costs. We compare with alter-
native page-based [14] designs in Section 9.3, and
grouped-metadata [15] approaches in Section 9.2.
Overall our paper makes the following contributions:
Contribution-1: To our knowledge, this is the first paper
to investigate intelligent replacement / bypass policies for
direct-mapped DRAM caches. We propose a bypass version
of RRIP (RRIP-AOB) suitable for caches with limited asso-
ciativity. However, we find an effective replacement policy
for DRAM caches must optimize not only hit-rate but also
state update cost. We introduce two properties, coresidency
and eviction-locality, that can be exploited to reduce state
update cost for implementing intelligent replacement.
Contribution-2: We propose Efficient Tracking of Reuse
(ETR), a design that performs updates for only a subset of
lines and uses their state to guide the replacement decisions of
other lines. ETR reduces bandwidth overhead of updates by
70%, improves speedup to 18%, and requires only 512-bytes.
Contribution-3: We discuss how our concepts of RRIP-
AOB and ETR can be applied to enhanced policies that rely
on signature-information (SHiP [10]) for further speedup.
Contribution-4: We show that RRIP-AOB and ETR are
general techniques also applicable to set-associative imple-
mentations of DRAM caches.
2
DRAM ARRAY
DATA (64B)
ECC-bits (8B)
ROW BUFFER 
Row Buffer = 32 x (72 byte TAD)
Address Data Burst (4 x 18B)
DATA (128b)DATA (128b)
DATA (128b)
SECDED ECC (9b)
UnusedECC-bits(7b)
Single Burst (18B)
4 
x 
Bu
rs
ts
28 bits per line for Tag/Metadata 
Figure 2: Organization of the DRAM cache used in KNL. DRAM cache is organized at a linesize of 64 bytes, is direct-
mapped, and tags are kept with the data-line. On an access, the DRAM cache transfers 72 bytes using four bursts on an
18-byte bus (16-bytes for data + 2-bytes for ECC). We need only 9-bits for SECDED on 16-bytes of data, which leaves 7
unused ECC bits in each burst that can be used to store metadata (KNL utilizes these 28 unused ECC bits to store tags).
2. BACKGROUND AND MOTIVATION
We present the organization of our DRAM cache and dis-
cuss the storage and bandwidth constraints that make it chal-
lenging to apply intelligent replacement policies.
2.1 Organization of a DRAM Cache (KNL)
As the tag storage required for gigascale DRAM caches
is large, DRAM cache designs often store tags in DRAM
and intelligently organize their structure to enable efficient
tag-access. The baseline we use for this study is the direct-
mapped, tags-in-ECC organization used in Intel’s Knights
Landing (KNL) design [4, 5]. Figure 2 shows the organiza-
tion of the DRAM cache in KNL. The DRAM cache places
each tag information in the unused bits in the ECC space
and streams out the data and tag (contained in ECC) on each
access. The tag information is used to determine cache hit
or miss. On a tag match, the data is available to service the
request immediately, without any additional latency. Thus, co-
locating the tag and data allows the DRAM cache access to be
serviced in just one DRAM request, which makes the cache
hit operation both low-latency and bandwidth-efficient [5].
Our goal is to increase the hit-rate of such DRAM caches. In
fact, the DRAM cache only uses about 8-10 bits from the un-
used 28 bits in the ECC space, so we have 18-20 bits per line
available for managing the DRAM cache intelligently. We
leverage these bits to build intelligent replacement policies.
2.2 Replacement / Bypass Policies for 1-Way
Typically, cache replacement policies are discussed in the
context of a set associative cache, as the set contains multiple
lines and there is a choice of the line to evict. For a direct-
mapped cache, the set contains only one line, so if we want
to install, there is exactly one place the line can go, and we
do not have a choice in selecting the victim. However, we
could choose to bypass the line, so the binary choice for a
direct-mapped cache becomes, whether to evict the resident
line or to bypass the incoming line. We can improve the hit-
rate by making this binary decision intelligently. We explain
different replacement strategies for a direct-mapped cache.
Probabilistic Replacement: The simplest policy is to by-
pass the incoming line with a certain probability. For example,
Bandwidth-Aware Bypass (BAB) [6, 8] bypasses the incom-
ing line with 90% probability to reduce install bandwidth, as
long as hit-rate remains unaffected. Figure 5 shows that such
global bypassing policies are coarse-grain and miss out on
bypassing opportunities that exploit per-line information.
Recency-Based Replacement: LRU [16] installs incoming
lines with the highest priority, based on the heuristic that
recently-used lines are more likely to be re-used. On a direct-
mapped cache, LRU degenerates into an Always-Install de-
sign, as the incoming line is the most recent. Enhancements
of LRU, such as DIP [17], degenerate to probabilistic bypass.
Reuse-Based Replacement: Replacement policies that ex-
ploit reuse (also called re-reference or frequency) are resilient
to thrashing and scans [9, 18, 19]. Such policies can protect
the direct-mapped DRAM cache from thrashing when multi-
ple pages are mapped to the same set of the DRAM cache. We
discuss Re-Reference Interval Prediction (RRIP) [9] policy.
Hit
Hit
Hit
Install
1000 01 11
Hit
Evict
All_Counters < 3 
Figure 3: Re-Reference Interval Prediction (RRIP).
Re-Reference Interval Predictor [9] is a thrash-and-scan-
resistant replacement policy often used in last-level caches.
As shown in Figure 3, each line is equipped with a 2-bit
counter to track the Re-Reference Interval Prediction Value
(RRPV). On a hit to the line, the RRPV is Promoted to 0.
On a miss, the victim is found by searching from way 0 and
finding the first line in the set with RRPV of 3. If no such
line is found, the RRPV of all lines in the set is Demoted (i.e.,
incremented) and the search is repeated. Lines are installed
in RRPV=2 to protect the lines that were re-used.
Challenge in Using RRIP for Direct-Mapped Cache: Just
like other replacement policies based on reuse-information,
RRIP operates by comparing the counter values of multiple
candidates in the set. It becomes ill-defined for a direct-
mapped cache, where there is only one counter, which means
the resident line will always get evicted regardless of the past
behavior. Thus, for a direct-mapped cache RRIP degenerates
into always-install (or always-bypass if the incoming line
is bypassed unless the RRPV of the resident line equals 3).
We propose extensions that make reuse-based policies viable
for direct-mapped and two-way caches, and implementations
that reduce the cost of tracking the RRPV state for gigascale
DRAM caches. We discuss our solution after methodology.
3
3. METHODOLOGY
3.1 Framework and Configuration
We use USIMM [20], an x86 simulator with detailed mem-
ory system model. We extend USIMM to include a DRAM
cache. Table 1 shows the configuration used in our study.
We model a configuration similar to a Intel Knights Landing
(KNL) Sub-NUMA Cluster (one-eighth size). We assume a
four-level cache hierarchy (L1, L2, L3 being on-chip SRAM
caches and L4 being off-chip DRAM cache). All caches use
64B line size. We model a virtual memory system to perform
virtual to physical address translations. The L4 is a 2GB
DRAM cache [5,21], which is direct-mapped and places tags
with data in the unused ECC bits. The parameters of our
DRAM cache is based on HBM technology [1]. The main
memory is based on non-volatile memory and assumed a la-
tency similar to PCM and 3D-XPoint [3,22,23,24,25,26,27]:
the read latency is 4X that of DRAM [2], and write bandwidth
is worse than read bandwidth. We perform evaluations with
DRAM-based memory in Section 8.5.
Table 1: System Config (KNL 1⁄8 Sub-NUMA Cluster)
Processors 8 cores; 3.0GHz, 2-wide OoO
Last-Level Cache 8MB, 16-way
DRAM Cache
Capacity 2GB
Bus Frequency 500MHz (DDR 1GHz)
Configuration 4 channel, 128-bit bus
Aggregate Bandwidth 64 GB/s
tCAS-tRCD-tRP-tRAS 13-13-13-30 ns
Main Memory (PCM)
Capacity 64GB
Bus Frequency 1000MHz (DDR 2GHz)
Configuration 1 channel, 64-bit bus
Aggregate Bandwidth 16 GB/s
tCAS-tRCD-tRP 13-128-8 ns
tRAS-tWR 143-160 ns
3.2 Workloads
We use a representative slice of 2-billion instructions se-
lected by PinPoints [28], from benchmarks suites that include
SPEC 2006 [29], GAP [30], and HPC. For SPEC, we pick
a sample of high intensity workloads that have at least two
miss per thousand instructions (MPKI). The evaluations exe-
cute benchmarks in rate mode, where all eight cores execute
the same benchmark. In addition to rate-mode workloads,
we also evaluate 24 mixed workloads, which are created by
randomly choosing 8 of the 15 SPEC workloads that have at
least two MPKI. Table 2 shows L3 miss rates, and memory
footprints for the 8-core rate-mode workloads in our study.
We perform timing simulation until each benchmark in
a workload executes at least 2 billion instructions. We use
weighted speedup to measure aggregate performance of the
workload normalized to the baseline and report geometric
mean for the average speedup across all the 21 workloads (11
SPEC, 4 SPEC-mix, 5 GAP, 1 HPC). Note that to keep the
graphs readable we only use 4 mixed workloads for all of our
results. However, we provide key performance results for the
set of the remaining 20 mixes in Section 8.1.
Table 2: Workload Characteristics
Suite Workload L3 MPKI Footprint
SPEC
soplex 35.3 1.8 GB
leslie 22.1 623 MB
libq 30.1 256 MB
gcc 108.5 1.5 GB
omnet 29.1 1.2 GB
wrf 10.4 1.1 GB
zeus 7.0 1.6 GB
xalanc 7.4 1.5 GB
mcf 101.1 13 GB
milc 31.2 4.5 GB
sphinx 15.0 146 MB
GAP
cc twitter 116.8 9.3 GB
bc twitter 101.2 13.5 GB
pr twitter 126.6 15.3 GB
pr web 24.8 15.1 GB
cc web 11.4 9.3 GB
HPC nekbone 13.71 44 MB
4. RRIP: AGE-ON-BYPASS
If we want to use RRIP on direct-mapped DRAM caches,
we have to solve two issues: how do we formulate RRIP as a
bypassing policy suitable for caches with limited associativity,
and how can we mitigate the state update cost of maintaining
per-line reuse state in DRAM.
4.1 RRIP as a Bypassing Policy
We design a version of RRIP for limited-associativity
caches, called RRIP: Age-On-Bypass (RRIP-AOB). The key
insight in RRIP-AOB is to use the episode of cache bypassing
to age / update the RRPV information associated with the
line. Figure 4 shows the overview of our design. RRIP-AOB
needs to similarly track lines that have reuse, so RRIP-AOB
Promotes state (sets RRPV to 0) on hit. RRIP-AOB can pro-
tect these reused lines by bypassing when reuse has been
seen (bypass when RRPV is 0, 1, or 2). However, reused
lines can now stay stuck in high priority state. We need a
different mechanism to age older lines so that new lines can
eventually be installed. We choose to implement aging by
Demoting (increment RRPV) state when an incoming line
is bypassed. This allows lines to naturally age to RRPV of
3, and be evicted in favor of the incoming line. Similar to
RRIP, RRIP-AOB needs 2 bits per line to track RRPV. A
practical design must address where to store the RRPV bits
and address the bandwidth needed to track the per-line RRPV.
Hit
Hit
Hit
Bypass BypassBypass
Install
1000 01 11
Hit
Evict
Promotion Demotion
Figure 4: Overview of RRIP: Age-On-Bypass (RRIP-
AOB). The transition from one state to another is accom-
plished with replacement-state update operation. Such
updates may consume significant bandwidth.
4
0.4
0.6
0.8
1.0
1.2
1.4
1.6
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
1.93
2.27
Sp
ee
du
p
Bypass-90% Bandwidth-Aware Bypass RRIP-AOB Ideal RRIP-AOB
Figure 5: Speedup from different replacement policies over the baseline always-install direct-mapped DRAM cache.
(a) Bypass-90% causes 15% degradation, (b) Bandwidth-Aware Bypass provides 3% speedup, (c)RRIP-AOB that main-
tains state in DRAM provides 13% speedup, and (d) Ideal RRIP-AOB with no state update cost provides 20% speedup
4.2 Storing RRPV in DRAM
A straight-forward way of incorporating RRIP into a DRAM
cache is to extend the tag-entry of the line to incorporate the
RRPV bits. We refer to this design as simply RRIP-AOB.
However, such a design incurs bandwidth overhead for per-
forming update of the replacement state. Note that these
accesses for updating the replacement state are not present
in the baseline and for designs that do bypassing without
tracking per-line state.
Alternatively, we can avoid the bandwidth of replacement
updates by storing the replacement state in a dedicated SRAM
array. Unfortunately, for our 2GB DRAM cache, maintaining
2 bits of RRPV per line would need 8MB of SRAM, which
is impractically large. We call this design Ideal RRIP-AOB.
1
2
4
8
16
32
64
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libqles
lie
so
ple
x
Av
era
ge
L4
 R
D
M
PK
I
Always-Install RRIP-AOB
Figure 6: MPKI of baseline DRAM cache and RRIP-
AOB. RRIP-AOB reduces misses by 10%.
4.3 Benefits from Reuse-Based Replacement
Intelligent replacement policies improve performance by
reducing cache misses. Figure 6 shows the Misses Per Thou-
sand Instructions (MPKI) for our baseline DRAM cache and
with RRIP-AOB. RRIP-AOB reduces 10% of the misses on
average. However, the speedup from RRIP-AOB also de-
pends on bandwidth used in replacement-state updates.
Figure 5 shows the speedup from different bypassing poli-
cies implemented on our 2GB DRAM cache. Performance
numbers are normalized to the always-install policy. Indis-
criminately bypassing 90% of the lines (Bypass-90%) causes
a degradation of 15%. The adaptiveness of Bandwidth-Aware-
Bypass (BAB) [6,8] avoids slowdowns; however, the average
speedup is only 3%. With RRIP-AOB, the performance ben-
efits is 13%, whereas with Ideal RRIP-AOB the speedup
could be 20%. Thus, there is significant room for perfor-
mance improvement with reuse-based replacement policies.
Unfortunately, obtaining this benefit in a practical manner is
challenging as maintaining accurate per-line state in DRAM
requires significant bandwidth for state updates.
4.4 Dissecting BW of Replacement-Updates
To highlight the bandwidth differences between Always-
Install and RRIP-AOB, we show the bandwidth needed to
implement replacement policy for Always-Install and RRIP-
AOB. Always-Install simply has install bandwidth, whereas
RRIP-AOB additionally needs bandwidth to promote and
demote state. Figure 7 shows the replacement bandwidth
of RRIP-AOB, normalized to the replacement bandwidth of
Always-Install. Of particular note, RRIP-AOB has the po-
tential to save 76% of the install bandwidth (due to bypass),
which can improve performance. However, it has overall
increased bandwidth consumption due to promotion and de-
motion. If we want to obtain most of the benefits of RRIP, we
must develop methods to reduce this bandwidth overhead.
 0%
 50%
 100%
 150%
 200%
 250%
sp
hin
x
m
ilc
ne
kb
on
e
cc
_w
eb
pr_
we
b
m
cf
xa
lan
c
pr_
twi
bc
_tw
i
cc
_tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
Am
ea
n
R
ep
la
ce
m
en
t B
W
 w
.r.
t. 
Al
wa
ys
−I
ns
ta
ll
 
Demote
Promote
Install
Figure 7: Replacement bandwidth (Install, Promote, De-
mote) of RRIP-AOB, normalized to replacement band-
width (Install) of Always-Install. RRIP-AOB reduces in-
stall bandwidth but incurs state update bandwidth.
4.5 Potential for Improvement
RRIP-AOB with state in DRAM is a practical design as
it does not require any SRAM overheads, and can be imple-
mented without any changes to the DRAM cache (the extra
bits for RRPV are taken from the unused ECC bits). However,
it has two-thirds the speedup compared to the potential bene-
fit of Ideal RRIP-AOB with no state update costs. Ideally, we
would like to get speedup similar to Ideal RRIP-AOB, while
needing low SRAM cost similar to RRIP-AOB. The goal of
the next section is to develop such a solution.
RRIP-AOB simply suffers from high DRAM state update
cost. If we can find effective ways to mitigate this bandwidth
overhead, we can get most of the benefits at little cost. We
develop an insight that if we can do replacement updates in
an efficient manner for only a subset of the lines, then we can
reduce the bandwidth for replacement updates and still retain
most of the benefits.
5
5. EFFICIENT TRACKING OF REUSE
Demoting state on every cache bypass incurs significant
bandwidth overheads–even if we choose to bypass the line,
we still have to spend bandwidth to demote the replacement-
state. We can avoid state update costs if we have an effective
way to infer an RRPV state. Our design reduces the band-
width consumed in performing updates of the replacement
state by doing the updates for only a subset of the lines and
using their replacement state to infer the replacement state of
the other lines. Our solution is based on two key properties,
Coresidency and Eviction-Locality, which we describe next.
5.1 Insight: Coresidency and Eviction-Locality
Coresidency indicates that at any given time if a line is
present, then several other lines belonging to that region are
also present in the cache. Coresidency indicates that there is
some amount of spatial locality in the reference stream, even
if such spatial locality is not perfect. A 4KB region contains
64 lines each of 64 bytes. Therefore, the maximum number
of coresident lines for a region would be 63. Figure 8 shows
the level of coresidency for our workloads. In general, the
workloads have between 16 to 45 coresident lines. We note
that although this is lower than perfect spatial locality, there
are still a large number of lines coresident (even 4 coresident
lines can amortize 75% state update cost). This shows there
is potential for using one line to infer replacement state of
many coresident lines.
 
 8
 16
 24
 32
 40
 48
 56
 64
m
ilc
cc
_w
eb
pr_
we
b
m
cf
pr_
twi
bc
_tw
i
cc
_tw
i
ze
us
m
p
wr
f
gcc libq les
lie
so
ple
x
Am
ea
n
# 
Li
ne
s 
Co
re
sid
en
t o
n 
Ev
ict
io
n
 0
Figure 8: Coresidency in DRAM caches. Average num-
ber of coresident lines in a 4KB region on first line evicted
from a region (workloads with L4 MPKI>1).
Eviction-Locality indicates that when a line gets evicted
from the cache, then the replacement-state of the other cores-
ident lines belonging to that region tend to have similar
replacement-state as the line being evicted. Figure 9 shows
the distribution of the RRPV of coresident lines, on an evic-
tion from L4 (on the first line evicted from a region).
0%
 20%
 40%
 60%
 80%
 100%
m
ilc
cc
_w
eb
pr
_w
eb m
cf
pr
_t
w
i
bc
_t
w
i
cc
_t
w
i
ze
us
m
p
w
rf
gc
c
lib
q
le
sl
ie
so
pl
ex
Am
ea
n
C
o
re
s
id
e
n
t 
R
R
P
V
 o
n
 F
ir
s
t
E
v
ic
t 
(%
)
RRPV=1RRPV=2RRPV=3 RRPV=0
17 46 40 30 23 18 21 31 44 27 29 45 27 30
Figure 9: Distribution of RRPV of coresident lines on
first line evicted from a 4KB region, for workloads with
L4 MPKI>1. Average number of coresident lines shown
above workloads. Eviction of one line indicates other
lines in a region are likely to be evicted soon (RRPV≥2).
Combined insight: On eviction, we typically observe
30 or more lines are coresident. In addition, we find the
coresident lines generally have similar RRPV state (77%
have RRPV≥2). Together, this means that if we maintain
accurate RRPV for just one of the coresident lines, then
we can infer RRPV state for the rest of the coresident lines
in the region with reasonable accuracy. Our solution is
based on exploiting this insight.
5.2 Insight: Update Only the Representative
We propose Efficient Tracking of Reuse (ETR) to reduce
bandwidth overheads of doing replacement updates in DRAM.
We implement ETR on top of RRIP-AOB as an example. ETR
exploits the properties of coresidency and eviction-locality.
Instead of updating replacement state for all the lines in a
region, ETR updates the state of only one Representative-Line
among all the coresident lines. The state of the Representative-
Line is then used to guide the replacement policy of the cores-
ident lines. The design of ETR consists of three parts: (1)
Selecting a Representative-Line in the region (2) Keeping ac-
curate RRPV for only the Representative-Line, and (3) Using
the representative’s RRPV to infer coresident lines’ RRPV to
make bypass decisions.
Access:	Page	A,	Page	B,	Page	B
A,	R=2Set	0 B,	R=2A,	R=3
A,	R=2Set	1 B,	R=2A,	R=3
A,	R=2Set	2 B,	R=2A,	R=3
A,	R=2Set	3 B,	R=2A,	R=3
Install	A	
(4)
Bypass	B	
(5)
Install	B	
(5)
A,	R=2Set	0 B,	R=2A,	R=3
A,	R=2Set	1 B,	R=2A,	R=2
A,	R=2Set	2 B,	R=2A,	R=2
A,	R=2Set	3 B,	R=2A,	R=2
Install	A	
(4)
Bypass	B	
(2)
Install	B	
(5)
(a)	RRIP-AOB (b)	ETR	on	RRIP-AOB
First	conflicting	set Avoid	3	
updatesUpdate	1
BW	cost:BW	cost:
Set	4 Set	4B,	R=2 B,	R=0 B,	R=2 B,	R=0
Follow	
Install
Figure 10: ETR’s representative-update and bypass-
decision following enables similar RRIP-AOB install pol-
icy, at reduced update bandwidth (dashed box = benefit).
To implement representative-update, we first need to pick
a stable representative line. Prior work finds the first access
to a region is relatively consistent [31]. If we maintain state
for just the first conflicting set in a region, we can maintain
good reuse information for the rest of the region without
incurring extra bandwidth costs. Figure 10 shows an example
of how ETR’s RRPV-inference (i.e., representative-update
and bypass-decision following) can be used to obtain similar
install-policy and hit-rate at reduced update cost.
If we first access 4 lines from region A at time 0, we install
region A with RRPV=2. If 5 lines from region B are then
accessed, Figure 10(b) shows that we can save bandwidth
and demote only the state of the first conflicting set (being set
0). On second access to region B, set 0 with its RRPV of 3
will inform us that that region A was not used recently. This
means that lines corresponding to region A have low reuse
and should be evicted in favor of installing region B. We can
then follow the region B install-decision for the rest of the
lines. Such a policy will end up installing all of region B
and result in an install policy similar to if we had maintained
each state individually in Figure 10(a). As such, we can
keep similar install policy and save update bandwidth with
representative state update and bypass-decision following.
6
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
1.93
2.31
2.40
Sp
ee
du
p
RRIP-AOB ETR on RRIP-AOB Ideal RRIP-AOB
Figure 11: Performance of RRIP-AOB, ETR on RRIP-AOB, and an Ideal RRIP-AOB with no state update costs. Coor-
dinating bypass decisions with ETR reduces state update needs, and enables RRIP-AOB to obtain 18% speedup.
Structures for ETR: To implement representative state up-
date and bypass-decision following, ETR maintains a Recent-
Bypass Table (RBT), in Figure 12. RBT tracks recently seen
regions (Region-ID) and the bypass decision made for them
(Last-Bypass-Decision). RBT enables us to find the represen-
tative first-conflicting-set in a region (as the first conflicting
set would have miss in RBT), keep just that set’s RRPV up-to-
date, and remember the first-conflicting-set’s bypass decision
to inform bypass decision for the other lines in the region (as
the follower sets would hit in RBT and see previous decision
made). We use a 128-entry RBT, which requires <512B of
SRAM (performance is relatively insensitive to RBT sizing).
Region	ID
Last	Bypass	
Decision
Recent	Bypass	Table	(RBT)
1.	Hit,	follow	decision
Region	ID
1.	Miss,	make	new	decision
2.	Update	RRIP	state
3.	Update	RBT Page	C 0
A Page	A 1
C Page	B 0
Figure 12: Design of Recent-Bypass-Table to enforce
coordinated-bypass and coordinated-state-update. De-
motions only occur on first miss to a region.
Operation of ETR: On cache miss, we index into RBT with
Region-ID. If there is an RBT miss, we are currently access-
ing the representative first-conflicting-set in a region. In this
case, we should make a bypass decision based on its RRPV,
spend bandwidth to demote state if bypass was chosen, and
update the RBT so later accesses can make an informed by-
pass decision. Otherwise, if there is an RBT hit, the region
has been recently accessed and already had a bypass decision
made, so we should follow the Last Bypass Decision to keep
similar install policy and save on demotion bandwidth.
5.3 Impact on Bandwidth
ETR tries to reduce the bandwidth used for replacement
state updates (RRPV promotion and demotion) to improve
performance. To understand effectiveness of ETR, we divide
the bandwidth used for cache replacement into three parts:
installs, promotions, and demotions, and we normalize this
consumption to the baseline design that uses bandwidth only
for installs. Figure 13 shows the replacement bandwidth us-
age of base RRIP-AOB and ETR on RRIP-AOB, normalized
to Always-Install. ETR saves 70% of replacement state up-
date bandwidth. These bandwidth savings result in speedup.
 0%
 50%
 100%
 150%
 200%
 250%
sp
hin
x
m
ilc
ne
kb
on
e
cc
_w
eb
pr_
we
b
m
cf
xa
lan
c
pr_
twi
bc
_tw
i
cc
_tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
Am
ea
n
R
ep
la
ce
m
en
t B
W
 w
.r.
t. 
Al
wa
ys
−I
ns
ta
ll  
  
 
Demote
Promote
Install
Figure 13: Replacement and Install bandwidth consump-
tion of base RRIP-AOB [left] and ETR on RRIP-AOB
[right], normalized to Always-Install. ETR reduces 70%
of the bandwidth consumed in state update.
5.4 Impact on Performance
Figure 11 shows the performance of RRIP-AOB, ETR
on RRIP-AOB, and Ideal RRIP-AOB with no state update
costs. ETR on RRIP-AOB bridges 70% of the performance
gap between RRIP-AOB and Ideal to achieve 18% speedup,
while incurring negligible SRAM storage costs. Thus, our
solutions of AOB and ETR make it practical to apply reuse-
based policies to DRAM caches and get significant benefits
while incurring negligible storage overheads.
6. SIGNATURE-BASED POLICIES
Thus far, we have discussed AOB and ETR only in the
context of RRIP. However, AOB and ETR are actually general
techniques that enable formulating direct-mapped versions
of replacement policies, as well as reducing the bandwidth
needed to maintain replacement policy state. AOB and ETR
can make even state-of-the-art signature-based policies [10,
11,12,32,33] suitable for DRAM caches. We show how using
Signature-based Hit Predictor (SHiP) [10] as an example.
6.1 Operation of Conventional SHiP
SHiP works by observing and learning which signatures
correspond to low-reuse lines, and installing lines accessed by
those signatures at low priority, as shown in Figure 15. SHiP
maintains signatures (PC) and reuse-bit (R-Bit) for tracking
reuse of the signature. On eviction, the line increments or
decrements a counter in the Signature History Counter Table
(SHCT) based on the R-bit. On install, the SHCT decides if
the incoming line is installed with High-Priority (RRPV=2)
or Low-Priority (RRPV=3), based on signature.
7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
2.31
2.40
2.36
Sp
ee
du
p
ETR on RRIP-AOB ETR on SHiP-AOB Ideal SHiP-AOB
Figure 14: Performance of ETR on RRIP-AOB, ETR on SHiP-AOB, and Ideal SHiP-AOB with no state update costs.
Sig = PC%(1<<12)
Cache
000
SHCTR
NON 
ZERO
SHCT
Sig 
IncomingLineB
Sig R
Sig 
IncomingLineALP-Install
(RRPV=3)
HP-Install
(RRPV=2)
Figure 15: Operation and Organization of SHiP.
6.2 Adapting SHiP to Direct-Mapped Cache
Conventional SHiP design always installs the incoming
line, either with High-Priority or Low-Priority. Unfortunately,
with a direct-mapped cache, doing so will degenerate into the
Always-Install policy (baseline). We extend SHiP in the con-
text of direct-mapped caches using the option of bypassing
with SHiP-AOB. If the resident line has an RRPV=3, then the
incoming line is always installed. If the resident line has an
RRPV of less than 3, then we bypass the incoming line. How-
ever, we then demote the RRPV of the resident line only if
the incoming line had a High-Priority install, and we skip the
RRPV update for incoming lines with Low-Priority install.
Thus, the advantage of SHiP-AOB is that it can reduce the
bandwidth required to perform demotion when the incoming
line is predicted to have low reuse.
6.3 Implementing SHiP for DRAM Cache
To implement this bypassing version of SHiP for DRAM
cache, we would need additional replacement metadata with
each line: 12-bit signature + 1 R-Bit, in addition to the 2 bits
for RRPV. Fortunately, these 15 bits of replacement metadata
can still fit in the 18-20 unused bits available in ECC space
of the KNL-Cache, so storage for the additional metadata for
SHiP is not a concern. The SHCT table still needs to be im-
plemented in SRAM (similar to conventional SHiP); however,
the SRAM overhead of the SHCT is only 1.5 kilobytes.
Note that updating the R-Bit occurs concurrently with the
Promote operation (setting RRPV to 0), so no additional
bandwidth is required for tracking the R-Bit. And, we force
install 2% of the time to get information on bypassed pages.
For lines without a valid PC (e.g., writebacks), we train using
a single PC-less SHCT entry.
6.4 Impact on Bandwidth
Figure 16 shows the bandwidth breakdown (in terms of
install, promotion, and demotion operations) for ETR on
RRIP-AOB, and ETR on SHiP-AOB, normalized to the band-
width consumed in doing installs for the always-install design.
ETR on SHiP-AOB achieves further reduction in state up-
date bandwidth compared to ETR on RRIP-AOB. ETR on
SHiP-AOB is able to prevent the bandwidth for demotion
operations by reducing update-cost when lines have predicted
no-reuse, and this reduces the number of lines that need to
be promoted back on hit. In particular, ETR on SHiP-AOB
benefits workloads that had poor spatial locality (e.g., mcf ).
 0%
 20%
 40%
 60%
 80%
 100%
 120%
 140%
sp
hin
x
m
ilc
ne
kb
on
e
cc
_w
eb
pr_
we
b
m
cf
xa
lan
c
pr_
twi
bc
_tw
i
cc
_tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
Am
ea
n
R
ep
la
ce
m
en
t B
W
 w
.r.
t. 
Al
wa
ys
−I
ns
ta
ll  
  
 
Demote
Promote
Install
Figure 16: Bandwidth usage of ETR on RRIP-AOB [left]
and ETR on SHiP-AOB [right], normalized to Always-
Install. SHiP-AOB further reduces BW for state update.
6.5 Impact on Performance
Figure 14 shows the speedup of ETR on RRIP-AOB, ETR
on SHiP-AOB, and the Idealized version of SHiP-AOB with
no state update cost. Overall, ETR on SHiP-AOB achieves
21% speedup, achieving most of the 23% speedup with the
Ideal design (which incurs impractical storage overheads).
7. TOWARDS SET-ASSOCIATIVE DESIGNS
We evaluate our solutions in the context of a direct-mapped
cache, but our designs and insights can be made applicable to
set-associative caches. A recent proposal ACCORD [34] tries
to make DRAM caches set-associative, to improve hit rate
albeit at an expense of bandwidth and latency [35,36,37]. We
compare with the recently proposed associative cache design,
and show that ETR has higher potential as it improves both
hit-rate and bandwidth. Nonetheless, ETR on RRIP-AOB
can be used in conjunction with set-associative designs for
DRAM caches for even better performance.
7.1 ACCORD: Predictable Associativity
DRAM caches that store tag with data [4, 5] are known to
be difficult to design for associativity–if there are multiple
possible locations for a line, you may need several accesses to
look up the possible locations for the line. A recent work, AC-
CORD [34], proposes to modify replacement policy to make
8
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
2.31 2.40
Sp
ee
du
p
ACCORD ETR on RRIP-AOB ACCORD + ETR on RRIP-AOB
Figure 17: Performance of set-associative ACCORD, ETR on RRIP-AOB, and ACCORD with ETR on RRIP-AOB.
it easier to locate lines with simple way prediction methods.
For example, it proposes Probabilistic Way-Steering (PWS)
that biases install to a preferred way 85% of the time based on
address, as shown in Figure 18(a). It also proposes Ganged
Way-Steering (GWS) that steers subsequent installs to a re-
gion into the same way, to enable per-region last-way-seen
way-prediction to be accurate, shown in Figure 18(b).
Figure 17 and Figure 20 shows ACCORD provides 10%
speedup, as it reduces misses by 15% but can cost extra band-
width consumption (due to way-mispredictions and looking
up multiple ways on miss). Meanwhile, ETR on RRIP-AOB
provides a higher 18% speedup, as it reduces misses by 10%
while simultaneously reducing DRAM cache bandwidth con-
sumption. However, these ideas are not direct competitors.
Associativity enables storing of multiple conflicting lines, and
bypassing enables reducing install bandwidth while maintain-
ing hit-rate. We design a solution that gets benefits of both.
A0
A,B
Way	0 Way	1
A1
Way	0
A2
A3
Way	1
A,	RRPV=2
B0
Prefer	Way	0	
85% 15%
B1
B2
B3
A,B
A0
A2
A3
B0
B1
B2
B3
(a)	Probabilistic	
Way-Steering
(b)	Ganged
Way-Steering
15%Prefer	Way	0	85%
Follow	B0
Figure 18: ACCORD enables low-latency associativity
for DRAM cache, by modifying install policy to make it
easier to predict which way a line is in.
7.2 Combining ACCORD and Bypassing
We want to obtain the hit-rate benefits associativity of-
fers; however, we must do so while maintaining ACCORD’s
biased-install or we will face increased bandwidth cost to
locate lines (due to frequent way-mispredictions).
To obtain benefits of both associativity and bypassing, we
combine the two by viewing associativity as a way-selection
policy and using a tiered decision. Figure 19 shows our tiered
decision tree: ACCORD is first used to select the way, and
then ETR is used to determine bypass policy.
For the first install to a region (Region Miss in ACCORD-
RIT and ETR-RBT), we use ACCORD PWS to select which
way to attempt install. This will have a biased probability to
install into a preferred-way, so we can maintain similar way
prediction accuracy. We subsequently use RRIP-AOB to de-
cide install or bypass into this particular way, and correspond-
ingly update the state. This enables us to keep ACCORD’s
flexibility to use both ways and maintain high way prediction
accuracy, as well as add RRIP’s thrash-resistant replacement.
This combination of the two enables high way prediction
accuracy (ACCORD) and bandwidth-efficient state update
and bypass (ETR on RRIP-AOB).
Line	Install
Probabilistic	Way	
Selection	(PWS)
Bypass-Decision	&	
State-Update	(RRIP)
Way-Selection	
Following	(GWS)
Bypass-Decision	
Following	(ETR)
Select	Way	
with	ACCORD
Decide	Bypass	with	
ETR	on	RRIP-AOB
Region-Miss
(RIT/RBT)
Region-Hit
(RIT/RBT)
Figure 19: We first use ACCORD to select which way to
attempt install, then use RRIP-AOB to decide bypass.
7.3 Effectiveness: Miss-rate and Speedup
Such a tiered decision process allows the bypass-policy
ETR on RRIP-AOB to continue to obtain its thrash and
scan-resistance, and bandwidth benefits. And, it enables
ACCORD’s associativity to utilize both ways to improve hit-
rate, at minor cost to DRAM cache bandwidth. Figure 20
shows the read MPKI of ACCORD, RRIP-AOB, and the
combination of the two. ACCORD reduces miss-rate by 15%,
RRIP-AOB reduces miss-rate by 10%, and the combination
of the two reduces miss-rate by 20%.
1
2
4
8
16
32
64
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libqles
lie
so
ple
x
Av
era
ge
L4
 R
D
M
PK
I
Always-Install
ACCORD
RRIP-AOB
ACCORD + RRIP-AOB
Figure 20: L4 Read-Miss-Per-Kilo-Instruction of
Always-Install, ACCORD, RRIP-AOB, and ACCORD +
RRIP-AOB. Combination enables 20% miss reduction.
Figure 17 shows the performance of ACCORD, ETR on
RRIP-AOB, and the combination of the two. ETR on RRIP-
AOB improves both hit-rate and bandwidth, whereas AC-
CORD improves hit-rate at the cost of bandwidth. The
combination of ACCORD and ETR on RRIP-AOB enables
20% speedup, and show the concepts developed in ETR and
RRIP-AOB are also applicable to set-associative DRAM
cache designs. We discuss potential for improving other
set-associative designs in Section 8.7.
9
8. RESULTS AND DISCUSSION
In this section we present sensitivity studies and storage
analysis. Due to space constraints, we limit these results to
ETR implemented on RRIP-AOB.
8.1 Multi-programmed Workloads
To show robustness of our proposal to multi-programmed
workloads, we evaluate over a larger set of 20 mix-application
workloads. Figure 21 shows that ETR provides 19% speedup
across 20 mixes, with no workloads experiencing slowdown.
0.9
1.0
1.1
1.2
1.3
1.4
1.5
m
ix1
m
ix2
m
ix3
m
ix4
m
ix5
m
ix6
m
ix7
m
ix8
m
ix9
m
ix1
0
m
ix1
1
m
ix1
2
m
ix1
3
m
ix1
4
m
ix1
5
m
ix1
6
m
ix1
7
m
ix1
8
m
ix1
9
m
ix2
0
Gm
ea
n
Sp
ee
du
p
ETR Ideal
Figure 21: Speedup of ETR on RRIP-AOB and Ideal
RRIP-AOB on multi-programmed workloads.
8.2 Impact on Energy and Power
Figure 22 shows DRAM cache + memory power, energy
consumption, and energy-delay-product (EDP) of a system
using ETR, normalized to baseline DRAM cache. We model
power and energy for stacked DRAM with [38,39], and model
power and energy for non-volatile memory with [27]. ETR
reduces DRAM cache energy by reducing install and state
update bandwidth, and provides lower main memory energy
by improving DRAM cache hit-rate. Overall, ETR reduces
energy consumption by 11% and EDP by 24%.
 0.7
 0.8
 0.9
 1
 1.1
 1.2
 1.3
Speedup Power Energy EDPNo
rm
al
iz
ed
 to
 B
as
el
in
e
Figure 22: Memory system energy of ETR on RRIP-
AOB. Intelligent replacement reduces energy usage.
8.3 Storage Requirements
We analyze the SRAM storage overheads of ETR. ETR
requires only a 128-entry 4B-per-entry Recent-Bypass Table,
which needs 512B. Thus, our proposal can be easily built
with negligible overheads within the memory controller.
For DRAM storage overheads required for RRIP-AOB, we
use the fact that the DRAM cache has 28 unused bits in the
ECC, which can be used for tag and metadata (see Figure 2).
Baseline uses 8-10 bits for tag, valid, and dirty bit. RRIP
requires just 2 bits for RRPV. Tag-entry becomes 12 bits,
which fit in 28 available bits.
8.4 Impact of Cache Size
Table 3 shows the speedup of ETR as the size of the DRAM
cache is varied from 1GB to 8GB. ETR on RRIP-AOB con-
tinues to provide significant speedup across different cache
sizes, ranging from 16.4% at 1GB to 13.5% at 8GB. As ex-
pected, when the cache size is increased, larger portions of the
workload fit in, and there is reduced scope for improvement.
Table 3: ETR Sensitivity to Cache Size
Cache Size Avg. Speedup from ETR
1.0GB 16.4%
2.0GB 18.0%
4.0GB 17.4%
8.0GB 13.5%
8.5 Impact of Memory Type
We use a non-volatile main memory for our studies, but
our benefits are not limited to NVM-backed systems only.
We compare BEAR’s Bandwidth Aware Bypass [6] with
proposed ETR on DRAM-backed main memory in Table 4.
ETR outperforms BEAR by intelligently bypassing lines and
achieving better hit-rate and substantial bandwidth benefits.
Table 4: ETR on DRAM-backed Memory
Bandwidth Aware Bypass ETR
SPEC RATE +7.4% +17.0%
SPEC MIX +1.7% +16.0%
GAP +4.8% +26.6%
GMEAN26 +5.7% +19.0%
8.6 Impact of Region Size
Table 5 shows the speedup of ETR as the region size is var-
ied from 1KB to 4KB. Region size of 4KB (matching smallest
OS page) provides best speedup of 18.0% as it amortizes the
most replacement-update costs.
Table 5: ETR Sensitivity to Region Size
Region Size Avg. Speedup from ETR
1KB 17.8%
2KB 18.0%
4KB 18.0%
8.7 Impact on Other 2-Way Designs
We evaluate our design on the state-of-the-art set-associative
DRAM cache design ACCORD [34], but the intelligent re-
placement offered by RRIP-AOB can be useful for other
set-associative designs as well. To isolate the impact of in-
telligent replacement (without considering bandwidth costs
specific to each DRAM cache organization), Table 6 shows
the impact on average L4 misses when using different re-
placement policies for 1-way and 2-way caches, normalized
to the baseline 1-way always-install policy. As expected, our
RRIP-AOB achieves the highest miss reduction for 1-way
caches as it enables intelligent reuse-based replacement pol-
icy for 1-way caches. Additionally, RRIP-AOB also achieves
the highest miss reduction for 2-way caches. This is because
RRIP-AOB can intelligently decide to bypass in the case the
cache set is storing multiple useful lines.
Table 6: RRIP-AOB Impact on Misses for 1-2 Way L4
Replacement Policy Impact on Avg. L4 Misses
1-way Always-Install -0.0%
1-way Probabilistic Bypass [6] -1.6%
1-way RRIP-AOB -10.4%
2-way Random -14.6%
2-way LRU -15.5%
2-way RRIP -19.7%
2-way RRIP-AOB -26.6%
10
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
sp
hin
x
m
ilc
ne
kb
on
e
cc
 w
eb
pr 
we
b
m
cf
xa
lan
c
pr 
twi
bc
 tw
i
cc
 tw
i
ze
us
m
p
wr
f
om
ne
t
gcc libq les
lie
so
ple
x
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
2.36
Sp
ee
du
p
Timber (2-way) Unison (4-way) Direct-mapped + ETR on RRIP-AOB 
Figure 23: Speedup of line-based [40] and page-based [14] set-associative DRAM caches, rel. to baseline direct-mapped
DRAM cache [4, 5]. Proposed ETR on RRIP-AOB enables direct-mapped DRAM caches to obtain the hit-rate benefits
of intelligent cache replacement, without needing to pay additional bandwidth to maintain set-associative tags.
9. RELATED WORK
9.1 Replacement / Bypassing policies
Recency-based replacement policies [16, 41, 42] install in-
coming lines at highest priority, which degenerate into always-
install baseline. Probabilistic replacement policies [17,43],
become probabilistic bypass [8] in Figure 5. Frequency-
based replacement [18, 19, 44, 45, 46] or Reuse-based replace-
ment [7, 8, 9, 47, 48] try to predict and keep most-frequently
used lines in the cache. We design a bypassing version of
RRIP, RRIP-AOB, and implement ETR on our RRIP-AOB
as an example of this class of policies, but our ETR scheme
can be easily used to reduce update-cost of other frequency
and reuse-based replacement algorithms. Signature-based
replacement [10, 11, 12, 32, 33] attempt to predict line reuse
based on signatures (e.g., PC). We develop a bypassing ver-
sion of SHiP, called SHiP-AOB, to show how to implement
signature-based replacement on caches with low associativity.
9.2 Line-based DRAM Caches
In our study, we use the DRAM cache organization used in
Intel’s Knights-Landing [4] that is direct-mapped and stores
each tag next to its data as our baseline. This organization
is the commercial implementation of many research efforts
that store Tag-With-Data [5, 6, 49] to improve latency and
reduce bandwidth consumption. We compare with recent
enhancements in Figure 5 (90%-Bypass and BAB [6]).
Alternative designs such as Sim et al. [15] take a different
approach to storing tags via tag grouping. For such caches,
a tag-only line is placed along with data in the same row
buffer [15, 40, 50, 51, 52]. Such caches require separate ac-
cesses for tag and data, which can cost significant bandwidth
and latency. Timber is an enhancement that proposes to
mitigate tag lookup by using a tag-cache and exploiting spa-
tial locality (by co-locating tags and metadata from multiple
sets) [40]. We compare with Timber as a representative of
the grouped tag / metadata approach in Figure 23. Such
approaches can enable associativity, but pay substantial band-
width to access and update tags when the tag cache has poor
hit-rate, due to large footprint and poor spatial locality (e.g.,
mcf and pr twi). Our ETR on RRIP-AOB, on the other hand,
enables intelligent replacement without needing to access tags
separately, and outperforms such tag-grouped approaches.
9.3 Page-based DRAM Caches
An alternate approach to designing DRAM caches is to
use large granularity caches to reduce tag and metadata over-
head, in hardware [14, 53] or software [54, 55, 56, 57]. The
reduction in tag requirements enable more space for asso-
ciativity and replacement metadata. Such large-granularity
caches often employ recency-based replacement [14, 54, 55]
or frequency-based replacement [56,57] that would otherwise
be too expensive in line-granularity caches. We compare with
Unison cache [14] (hardware-managed, 4-way, page-based
sectored cache with LRU replacement, separate tag lookup)
as a representative of page-based designs, in Figure 23. The
associativity and replacement Unison offers enable it to fre-
quently outperform the baseline DRAM cache. However, the
large linesize of Unison often limits it from using a large
portion of the cache (e.g., pr twi and bc twi), and, the sepa-
rate tag and data lookup often wastes significant bandwidth.
Our ETR on RRIP-AOB, on the other hand, enables direct-
mapped DRAM caches to obtain intelligent replacement with-
out sacrificing cache-utilization or bandwidth-efficiency, to
outperform such page-based approaches.
10. CONCLUSION
This paper investigates improving hit-rate for direct-mapped
DRAM caches by utilizing reuse-based replacement polices.
We would like to use the most effective replacement policies
to improve DRAM cache hit-rate. Unfortunately, state-of-the-
art policies based on reuse are designed to compare multiple
counter values within the set to decide a replacement victim.
As such, these policies become ill-defined and inapplica-
ble for direct mapped caches (they degenerate into either
always-install or always-bypass policies). To make reuse-
based policies, such as RRIP, applicable to direct-mapped
DRAM caches, we propose a bypass formulation of RRIP
called RRIP Age-On-Bypass (RRIP-AOB). RRIP-AOB lever-
ages the insight that the event of bypassing in a direct-mapped
cache, should also age the reuse state of the resident line.
Similar to RRIP, RRIP-AOB needs per-line reuse counters.
Maintaining such reuse state in DRAM costs significant band-
width (promote on hit, demote on bypass). We investigate
methods to reduce bandwidth overhead of keeping reuse state
in DRAM. We propose Efficient Tracking of Reuse (ETR) to
reduce state update costs. ETR builds upon an observation
that, at any given time, many lines from a 4KB region are
coresident and have similar reuse state. If we select a repre-
sentative (e.g., first-conflicting set in a region) and maintain
accurate reuse state for just that line, we can use that line’s
reuse to infer rest of the lines’ reuse without update-cost.
ETR reduces the bandwidth for tracking replacement state by
70% while maintaining similar hit-rate. Our evaluations with
a 2GB DRAM cache, show that ETR on RRIP-AOB provides
a speedup of 18.0% while incurring an SRAM cost of <1KB.
ETR on RRIP-AOB performs within 2% of an idealized de-
sign that does not incur bandwidth for state update.
11
11. REFERENCES
[1] J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013.
[2] JEDEC, DDR4 SPEC (JESD79-4), 2013.
[3] Intel and Micron, “A revolutionary breakthrough in memory
technology,” 2015.
[4] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights
landing: Second-generation intel xeon phi product,” IEEE Micro,
vol. 36, pp. 34–46, Mar 2016.
[5] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in
architecting dram caches: Outperforming impractical sram-tags with a
simple and practical design,” in 2012 45th Annual IEEE/ACM
International Symposium on Microarchitecture, pp. 235–246, Dec
2012.
[6] C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for
mitigating bandwidth bloat in gigascale dram caches,” in Proceedings
of the 42Nd Annual International Symposium on Computer
Architecture, ISCA ’15, (New York, NY, USA), pp. 198–210, ACM,
2015.
[7] M. Kharbutli and Y. Solihin, “Counter-based cache replacement and
bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447,
Apr. 2008.
[8] H. Gao and C. Wilkerson, “A dueling segmented lru replacement
algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop
on Computer Architecture Competitions: cache replacement
Championship, 2010.
[9] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High
performance cache replacement using re-reference interval prediction
(rrip),” in Proceedings of the 37th Annual International Symposium on
Computer Architecture, ISCA ’10, (New York, NY, USA), pp. 60–71,
ACM, 2010.
[10] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,
and J. Emer, “Ship: Signature-based hit predictor for high
performance caching,” in Proceedings of the 44th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-44, (New
York, NY, USA), pp. 430–441, ACM, 2011.
[11] V. Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing
signature-based hit predictor for improved cache performance,” in The
2nd Cache Replacement Championship (CRC-2 Workshop in ISCA
2017), 2017.
[12] A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm
for improved cache replacement,” in 2016 ACM/IEEE 43rd Annual
International Symposium on Computer Architecture (ISCA), pp. 78–89,
June 2016.
[13] D. A. Jiménez and E. Teran, “Multiperspective reuse prediction,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium
on Microarchitecture, MICRO-50 ’17, (New York, NY, USA),
pp. 436–448, ACM, 2017.
[14] D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A
scalable and effective die-stacked dram cache,” in Microarchitecture
(MICRO), 2014 47th Annual IEEE/ACM International Symposium on,
pp. 25–37, IEEE, 2014.
[15] J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor, “Resilient
die-stacked dram caches,” in Proceedings of the 40th Annual
International Symposium on Computer Architecture, ISCA ’13, (New
York, NY, USA), pp. 416–427, ACM, 2013.
[16] W. A. Wong and J.-L. Baer, “Modified lru policies for improving
second-level cache behavior,” in High-Performance Computer
Architecture, 2000. HPCA-6. Proceedings. Sixth International
Symposium on, pp. 49–60, IEEE, 2000.
[17] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer,
“Adaptive insertion policies for high performance caching,” in
Proceedings of the 34th Annual International Symposium on
Computer Architecture, ISCA ’07, (New York, NY, USA),
pp. 381–391, ACM, 2007.
[18] J. T. Robinson and M. V. Devarakonda, “Data cache management
using frequency-based replacement,” in Proceedings of the 1990 ACM
SIGMETRICS Conference on Measurement and Modeling of
Computer Systems, SIGMETRICS ’90, (New York, NY, USA),
pp. 134–142, ACM, 1990.
[19] M. K. Qureshi, D. Thompson, and Y. N. Patt, “The v-way cache:
demand-based associativity via global replacement,” in Computer
Architecture, 2005. ISCA’05. Proceedings. 32nd International
Symposium on, pp. 544–555, IEEE, 2005.
[20] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah
simulated memory module,” University of Utah, Tech. Rep, 2012.
[21] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu, “Knights
landing: Second-generation intel xeon phi product,” IEEE Micro,
vol. 36, pp. 34–46, Mar 2016.
[22] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J.
Soh, Z. Wang, Y. Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic
performance measurements of the intel optane DC persistent memory
module,” CoRR, vol. abs/1903.05714, 2019.
[23] Intel, “Fact sheet: New intel architectures and technologies target
expanded market opportunities,” 2018. Accessed: 2019-03-20.
[24] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, “Phase change
memory: From devices to systems,” Synthesis Lectures on Computer
Architecture, vol. 6, no. 4, pp. 1–134, 2011.
[25] Y. Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim,
Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M.-G. Kang,
J. Lee, Y. Kwon, S. Kim, J. Kim, Y.-J. Lee, Q. Wang, S. Cha, S. Ahn,
H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y.-T. Lee, J. Yoo, and
G. Jeong, “A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,”
in Solid-State Circuits Conference Digest of Technical Papers (ISSCC),
2012 IEEE International, pp. 46–48, Feb 2012.
[26] H. S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg,
B. Rajendran, M. Asheghi, and K. E. Goodson, “Phase change
memory,” Proceedings of the IEEE, vol. 98, pp. 2201–2227, Dec 2010.
[27] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase
change memory as a scalable dram alternative,” in Proceedings of the
36th Annual International Symposium on Computer Architecture,
ISCA ’09, (New York, NY, USA), pp. 2–13, ACM, 2009.
[28] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and
A. Karunanidhi, “Pinpointing representative portions of large intel
itanium programs with dynamic instrumentation,” in
Microarchitecture, 2004. MICRO-37 2004. 37th International
Symposium on, pp. 81–92, Dec 2004.
[29] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH
Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006.
[30] S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark
suite,” CoRR, vol. abs/1508.03619, 2015.
[31] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos,
“Spatial memory streaming,” in Proceedings of the 33rd Annual
International Symposium on Computer Architecture, ISCA ’06,
(Washington, DC, USA), pp. 252–263, IEEE Computer Society, 2006.
[32] S. M. Khan, Y. Tian, and D. A. Jimenez, “Sampling dead block
prediction for last-level caches,” in Proceedings of the 2010 43rd
Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO ’43, (Washington, DC, USA), pp. 175–186, IEEE Computer
Society, 2010.
[33] A. Jain and C. Lin, “Rethinking belady’s algorithm to accommodate
prefetching,” in 2018 ACM/IEEE 45th Annual International
Symposium on Computer Architecture (ISCA), June 2018.
[34] V. Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling
associativity for gigascale dram caches by coordinating way-install
and way-prediction,” in 2018 ACM/IEEE 45th Annual International
Symposium on Computer Architecture (ISCA), pp. 328–339, June
2018.
[35] A. Agarwal and S. D. Pudar, Column-associative caches: A technique
for reducing the miss rate of direct-mapped caches, vol. 21. ACM,
1993.
[36] B. Calder, D. Grunwald, and J. Emer, “Predictive sequential
associative cache,” in Proceedings of the 2Nd IEEE Symposium on
High-Performance Computer Architecture, HPCA ’96, (Washington,
DC, USA), pp. 244–, IEEE Computer Society, 1996.
[37] D. H. Albonesi, “Selective cache ways: On-demand cache resource
allocation,” in Microarchitecture, 1999. MICRO-32. Proceedings.
32nd Annual International Symposium on, pp. 248–259, IEEE, 1999.
[38] K. Chandrasekar, C. Weis, B. Akesson, N. Wehn, and K. Goossens,
“System and circuit level power modeling of energy-efficient
12
3d-stacked wide i/o drams,” in Proceedings of the Conference on
Design, Automation and Test in Europe, DATE ’13, (San Jose, CA,
USA), pp. 236–241, EDA Consortium, 2013.
[39] K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and
M. Horowitz, “Rethinking dram power modes for energy
proportionality,” in Proceedings of the 2012 45th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-45,
(Washington, DC, USA), pp. 131–142, IEEE Computer Society, 2012.
[40] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling
efficient and scalable hybrid memories using fine-granularity dram
cache management,” IEEE Computer Architecture Letters, vol. 11,
pp. 61–64, July 2012.
[41] D. A. Jiménez, “Insertion and promotion for tree-based pseudolru
last-level caches,” in Proceedings of the 46th Annual IEEE/ACM
International Symposium on Microarchitecture, pp. 284–296, ACM,
2013.
[42] Y. Smaragdakis, S. Kaplan, and P. Wilson, “Eelru: simple and
effective adaptive page replacement,” in ACM SIGMETRICS
Performance Evaluation Review, vol. 27, pp. 122–133, ACM, 1999.
[43] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and
J. Emer, “Adaptive insertion policies for managing shared caches,” in
Proceedings of the 17th International Conference on Parallel
Architectures and Compilation Techniques, PACT ’08, (New York, NY,
USA), pp. 208–219, ACM, 2008.
[44] E. G. Hallnor and S. K. Reinhardt, “A fully associative
software-managed cache design,” in Proceedings of the 27th Annual
International Symposium on Computer Architecture, ISCA ’00, (New
York, NY, USA), pp. 107–116, ACM, 2000.
[45] E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The lru-k page
replacement algorithm for database disk buffering,” in Proceedings of
the 1993 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’93, (New York, NY, USA), pp. 297–306, ACM,
1993.
[46] D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S.
Kim, “Lrfu: A spectrum of policies that subsumes the least recently
used and least frequently used policies,” IEEE Trans. Comput., vol. 50,
pp. 1352–1361, Dec. 2001.
[47] N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V.
Veidenbaum, “Improving cache management policies using dynamic
reuse distances,” in Microarchitecture (MICRO), 2012 45th Annual
IEEE/ACM International Symposium on, pp. 389–400, IEEE, 2012.
[48] G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache replacement
based on reuse-distance prediction,” in Computer Design, 2007. ICCD
2007. 25th International Conference on, pp. 245–250, IEEE, 2007.
[49] C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent
dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), pp. 1–13,
Oct 2016.
[50] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block
sizes for very large die-stacked dram caches,” in Proceedings of the
44th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-44, (New York, NY, USA), pp. 454–464,
ACM, 2011.
[51] C.-C. Huang and V. Nagarajan, “Atcache: reducing dram cache latency
via a small sram tag cache,” in Proceedings of the 23rd international
conference on Parallel architectures and compilation, pp. 51–60,
ACM, 2014.
[52] Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y. Xie, “Building
a low latency, highly associative dram cache with the buffered way
predictor,” in 2016 28th International Symposium on Computer
Architecture and High Performance Computing (SBAC-PAD),
pp. 109–117, Oct 2016.
[53] D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked dram caches for
servers: Hit ratio, latency, or bandwidth? have it all with footprint
cache,” in Proceedings of the 40th Annual International Symposium on
Computer Architecture, ISCA ’13, (New York, NY, USA),
pp. 404–415, ACM, 2013.
[54] Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A
fully associative, tagless dram cache,” in Proceedings of the 42Nd
Annual International Symposium on Computer Architecture, ISCA ’15,
(New York, NY, USA), pp. 211–222, ACM, 2015.
[55] H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee,
“Efficient footprint caching for tagless dram caches,” in High
Performance Computer Architecture (HPCA), 2016 IEEE
International Symposium on, pp. 237–248, IEEE, 2016.
[56] G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and
K. McGrath, “Challenges in heterogeneous die-stacked and off-chip
memory systems,” in 3rd Workshop on SoCs, Heterogeneous
Architectures and Workloads (SHAW-3), 02 2012.
[57] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee:
Bandwidth-efficient dram caching via software/hardware cooperation,”
in Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY,
USA), pp. 1–14, ACM, 2017.
13
