TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and
  Misses in Hybrid Memory Systems by Young, Vinson et al.
TicToc: Enabling Bandwidth-Efficient DRAM Caching
for both Hits and Misses in Hybrid Memory Systems
Vinson Young†, Zeshan Chishti‡, and Moinuddin K. Qureshi†
†Georgia Institute of Technology
‡Intel Corporation
{vyoung,moin}@gatech.edu,zeshan.a.chishti@intel.com
ABSTRACT
This paper investigates bandwidth-efficient DRAM caching
for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is
becoming a viable alternative to DRAM as it enables high-
capacity and non-volatile main memory systems. However,
3D-XPoint has several characteristics that limit it from out-
right replacing DRAM: 4-8x slower read, and even worse
writes. As such, effective DRAM caching in front of 3D-
XPoint is important to enable a high-capacity, low-latency,
and high-write-bandwidth memory. There are currently two
major approaches for DRAM cache design: (1) a Tag-Inside-
Cacheline (TIC) organization that optimizes for hits, by stor-
ing tag next to each line such that one access gets both tag and
data, and (2) a Tag-Outside-Cacheline (TOC) organization
that optimizes for misses, by storing tags from multiple data
lines together in a tag-line such that one access to a tag-line
gets information on several data-lines. Ideally, we would
like to have the low hit-latency of TIC designs, and the low
miss-bandwidth of TOC designs. To this end, we propose a
TicToc organization that provisions both TIC and TOC to get
the hit and miss benefits of both.
We find that naively combining both techniques actually
performs worse than TIC individually, because one has to
pay the bandwidth cost of maintaining both metadata. The
main contribution of this work is developing architectural
techniques to reduce bandwidth cost of accessing and main-
taining both TIC and TOC metadata. We find that most of
the update bandwidth is due to maintaining the TOC dirty
information. We propose a DRAM Cache Dirtiness Bit tech-
nique that carries DRAM cache dirty information to last-level
caches, to help prune repeated dirty-bit updates for known
dirty lines. We also propose a Preemptive Dirty Marking
technique that predicts which lines will be written and proac-
tively marks the dirty bit at install time, to help avoid the
initial dirty-bit update for dirty lines. To support PDM, we
develop a novel PC-based Write-Predictor to aid in marking
only write-likely lines. Our evaluations on a 4GB DRAM
cache in front of 3D-XPoint show that our TicToc organiza-
tion enables 10% speedup over the baseline TIC, nearing the
14% speedup possible with an idealized DRAM cache design
with 64MB of SRAM tags, while needing only 34KB SRAM.
1. INTRODUCTION
As memory systems scale, non-volatile memories or NVMs
(such as, 3D-XPoint [1]) are emerging as viable alternatives
to DRAM. NVMs offer the advantages of higher bit density
and the ability to retain data after power outages. However,
NVMs also have significant limitations that prevent them
from outright replacing DRAM in the memory hierarchy. For
example, 3D-XPoint is reported to have 4-8x slower read, and
even slower writes compared to DRAM [2]. As such, future
systems are likely to utilize hybrid memory systems [3,4,5,6]
consisting of both DRAM and 3D-XPoint. We focus on
the setup where DRAM is operated as a hardware-managed
cache for 3D-XPoint based main memory, since such a setup
enables applications to benefit from the lower latency and
higher write-bandwidth of DRAM and the higher capacity of
3D-XPoint without relying on any software or OS support.
Recently, there have been many works [7,8,9,10,11,12,13]
on architecting High Bandwidth Memory (HBM) [14] caches
in front of traditional DRAM main memory [15]. These
works target improving memory bandwidth by migrating data
between DRAM and HBM, and servicing most data at the
higher internal/bus bandwidth of HBM. These works are
effective due to HBM having dedicated higher-bandwidth
channels/interfaces compared to commodity DDRx DRAM.
We would like to utilize the insights learned from these works
to design effective DRAM caches in front of NVMs, such as
3D-XPoint. However, we note that there are significant dif-
ferences in setup and goals for a DRAM+3D-XPoint hybrid
memory as compared to a HBM+DDRx hybrid memory.
First, in a 3D-XPoint based hybrid memory, 3D-XPoint
and DRAM will be sharing the same channel interfaces [16].
Second, DRAM caches in front of 3D-XPoint target reducing
read latency and improving write bandwidth and endurance
of 3D-XPoint, by servicing most data at the lower latency
and higher write bandwidth of DRAM. An added complexity
is that the DRAM cache and 3D-XPoint are likely to sit
behind the same channel [17], as depicted in Figure 1(a).
Such a set-up enables a balanced configuration where every
channel has DRAM backing it. However, in such a channel-
sharing set-up, bandwidth needed for maintaining DRAM
cache state now comes directly at a cost to bus bandwidth
available for memory. As such, there is a renewed need
for bandwidth-efficient DRAM caches. We analyze prior
DRAM caching approaches, highlight cases of bandwidth-
inefficiency, and rigorously target the remaining bandwidth
overheads to develop a bandwidth-efficientf DRAM cache
suitable for DRAM + 3D-XPoint systems.
We start with a baseline hit-optimized Tag-Inside-Cacheline
(TIC) DRAM cache design [7, 11, 18]. A TIC design orga-
nizes its DRAM cache as a direct-mapped cache with tags
stored inside each cacheline, such that one access can retrieve
both tag and data. TIC has good hit-latency, as it can ser-
vice cache hits in one DRAM access. However, TIC incurs
bandwidth overhead on cache misses as it needs to probe the
tag in DRAM in order to determine a miss. This approach
of trading miss-bandwidth for hit-latency has been proven
effective in situations where the cache has its own dedicated
ar
X
iv
:1
90
7.
02
18
4v
1 
 [c
s.A
R]
  4
 Ju
l 2
01
9
(a) Channel-Sharing Hybrid Memory (b) Performance of DRAM Cache Organizations 
DRAM
3D-XPoint
CPU Chip
Channel
DRAM
3D-XPoint
Channel
0.60
0.70
0.80
0.90
1.00
1.10
1.20
Sp
ee
du
p	
w.
r.t
Ta
g	I
ns
id
e	
Ca
ch
el
in
e
Figure 1: (a) Channel-Sharing Hybrid Memory, and (b) Performance of hit-optimized Tag-Inside-Cacheline (TIC) [7],
miss-optimized Tag-Oustide-Cacheline (TOC) [8], and idealized Tag-In-SRAM, normalized to TIC.
access channel, such as the HBM+DRAM hybrid memory in
Intel’s Knights Landing [11]. However, in a channel-sharing
setup, the miss probe bandwidth directly consumes available
main memory bandwidth, resulting in bandwidth inefficiency.
As we show in Figure 1(b), there is a 14% performance gap
between TIC and an idealized Tag-In-SRAM approach.
An alternative approach to DRAM cache design is a miss-
optimized Tag-Outside-Cacheline (TOC) design [8, 12, 13].
A TOC design stores tags of multiple cachelines together in
a tag-only-line, such that one access to a tag-line can obtain
information for multiple cachelines at once. We can bring in
these bundles of tags as needed, and cache them in a small tag
cache (e.g., 32KB SRAM) [8]. If the tag cache has high hit-
rate, TOC can service most hits with one DRAM access, and
misses to clean lines without a DRAM access. However, if
the tag cache has low hit-rate, TOC may need two accesses to
service a hit, and one access to service a miss. As such, TOC
consumes lower bandwidth on misses than TIC; however, it
consumes higher bandwidth on hits due to separate tag and
data read. Overall, as shown in Figure 1(b), TOC approach
performs worse than TIC due to bandwidth overheads.
We notice that TIC is good for hits, while TOC is good for
misses – one can perhaps combine both approaches to get
both good hit and miss bandwidth. Fortunately, it is cheap
to provision both metadata at once: TIC uses spare ECC
bits [11], and TOC needs to dedicate only 1.5% of DRAM
cache capacity to store metadata and a 32KB SRAM for a
metadata cache [8]. To decide when to use TOC or TIC, one
can employ a hit/miss predictor [7] that uses TIC for likely
hits and TOC for likely misses. We call this proposal that pro-
visions both TIC and TOC metadata as TicToc. Unfortunately,
we find that naively combining TIC and TOC actually leads
to worse performance than TIC by itself. This is because
maintaining and updating TOC metadata bits consumes sig-
nificant DRAM bandwidth. In order for TicToc to be effective,
we need ways to reduce TOC maintenance bandwidth.
TOC incurs bandwidth overheads for the following three
cases: (i) tag-check on hits, ( ii) tag-update on installs, and
(iii) dirty-bit-update on writebacks. Hit overhead is easily
mitigated by additionally storing TIC metadata in TicToc.
Tag updates are generally inexpensive because they occur at
miss time, and miss traffic usually has good spatial locality
and therefore a high metadata-cache hit-rate. Dirty-bit up-
dates, however, remain costly because they are carried out
when dirty lines are evicted from an earlier level of cache.
Such evictions have poor access locality and therefore low
metadata-cache hit-rates. Hence, we identify dirty bit updates
as the most significant bandwidth overhead for TicToc.
To reduce dirty data tracking costs for TOC, we target
the following two cases: initial write to a cache line, and
repeated writes to the same cache line. For repeated writes,
we propose to store a DRAM Cache Dirtiness bit alongside
the line in an earlier level of cache, to track the current dirty
status of the line in the DRAM cache. On a writeback to
DRAM cache, we only need to update the TOC if the line in
the DRAM cache has changed from clean to dirty. However,
many workloads write to lines only once. For such workloads,
we propose Preemptive Dirty Marking that predicts likely-to-
be-written cache lines and proactively mark sthose lines as
dirty in the TOC at install time. This avoids needing to update
dirty information at eviction time, thereby avoiding metadata-
cache misses. We develop a PC-based Write Predictor that is
92% accurate for our Preemptive Dirty Marking.
Even after solving for hit and miss bandwidth, when data
has poor reuse, installing lines and updating TOC tag can be-
come a major source of bandwidth overhead. To mitigate that
problem, we develop a Write-Aware Bypassing technique that
reduces install and tag-update bandwidth, without increasing
writes to write-constrained 3D-XPoint.
Overall our paper makes the following contributions:
Contribution-1: This paper evaluates and rigorously targets
the bandwidth overheads of prior DRAM-cache organiza-
tions. We find that we can combine two tag-storage methods
with a TicToc organization to obtain both good hit and good
miss path. However, such an approach suffers significant
bandwidth cost to maintain TOC dirty information on writes.
Contribution-2: We develop two techniques to reduce the
cost of tracking dirty information. DRAMCache Dirtiness Bit
targets reducing cost of dirty-bit updates for repeated writes
to the same location, via maintaining DRAM cache dirty
information alongside the line in an earlier level of cache.
And, Preemptive Dirty Marking targets reducing cost of the
initial dirty-bit update to a location, via predicting which lines
are likely to be written to (with our Signature-based Write
Predictor) and preemptively setting the dirty-bit.
Contribution-3: To reduce install bandwidth while not in-
creasing 3D-XPoint write traffic, we develop a Write-Aware
Bypass technique. This technique bypasses most clean lines
by default to save install bandwidth. And, it installs most
dirty and predicted write-likely lines (to amortize metadata
updates) to buffer writes to write-constrained 3D-XPoint.
Overall, our proposed TicToc organization, enables 10%
speedup over TIC baseline, nearing the 14% speedup of
an idealized Tag-In-SRAM approach, while needing signifi-
cantly less SRAM storage (34 KB vs. 64 MB).
2
DRAM ARRAY
ROW BUFFER 
HIT: DATAOUT     MISS: DATAOUT+READMEM
ADDR
HIT: DATAOUT
(a) Tag-In-SRAM organizationFull SRAM Tag+Dirty Store (MBs)
ADDR
HIT: TAGFETCH+DATAOUT
(c) Tag-Outside-Cacheline organizationSRAM Metadata Cache (KBs)
T
Tag+Dirty Store
MISS: READMEM
Hit
Miss
Hit
Miss
MISS: TAGFETCH+READMEM
Figure 2: DRAM cache organization and flow for (a) idealized Tag-In-SRAM, (b) hit-latency-optimized Tag-Inside-
Cacheline (TIC) [7], and (c) miss-bandwidth-optimized Tag-Outside-Cacheline (TOC) [8].
2. BACKGROUND AND MOTIVATION
DRAM caches are important for enabling heterogeneous
memory systems to have the effective latency and bandwidth
of one memory technology, and the capacity of another; how-
ever, there are several challenges in designing DRAM caches.
A DRAM cache design has to balance multiple goals. First,
it should minimize the SRAM storage needed for DRAM
cache maintenance. Second, it should minimize cache hit
latency. Third, it should minimize miss latency. Fourth, it
should provide high hit-rate. Lastly, it should try to minimize
total bandwidth costs for maintaining DRAM cache state.
It is desirable to organize DRAM caches at the granularity
of a cache line to efficiently utilize cache capacity, and to
minimize the consumption of main memory bandwidth [10].
A key challenge in designing a large line-granularity cache is
deciding where to store the tag and dirty-bit metadata. For a
moderately-sized 4GB DRAM cache with 64B lines, there
would be 64 million lines. Even if each metadata required
8 bits (6 tag, 1 dirty, 1 valid bit), this would result in 64MB
storage for metadata. Next, we discuss the various options for
DRAM cache metadata management, and their implications
for SRAM storage cost and bandwidth consumption.
Table 1: Bandwidth of DRAM Cache Organizations –
ρ denotes Metadata-Cache Miss Probability
Organization SRAM TIC TOC
(SRAM Cost) (>20MB) (<1KB) (~32KB)
Hit 1 1 1 + ρ
Miss + Evict-Clean 0 1 0 + ρ
Miss + Evict-Dirty 1 1 1 + ρ
2.1 Tag In SRAM
A costly method to design high performance DRAM caches
is to simply maintain all of the tag and dirty bits in on-chip
SRAM, and query the on-chip SRAM metadata to determine
hit or miss, in a Tag-In-SRAM approach, shown in Figure 2(a).
Assuming 1-byte metadata per cache line, such an approach
would require 64MB of SRAM storage for a 4GB cache
(>20MB with sectoring [10, 19]). Table 1 shows the DRAM
bandwidth consumption for such an approach. SRAM meta-
data is queried first to determine hit or miss. A hit can be
serviced with one DRAM access to data. A miss can be
serviced without a DRAM access to data. However, in prepa-
ration for installing the newly accessed line, the cache would
need to perform an eviction of the resident line. If the resident
line were clean, the location could be directly overwritten.
However, if the resident line were dirty, the resident line
would need to be read before writeback to memory. Hence,
miss with eviction of a clean line costs 0 bandwidth, and
miss with eviction of a dirty line costs 1 bandwidth. Such a
design represents the minimum DRAM bandwidth needed
for DRAM cache maintenance, and an upper-bound for per-
formance. We aim to achieve Tag-In-SRAM performance at
low SRAM cost.
2.2 Tag Inside Cacheline
To reduce SRAM storage costs, one could store tags inside
each line in DRAM [7,11,18] in a Tag-Inside-Cacheline (TIC)
approach, shown in Figure 2(b). TIC optimizes for hit-latency
by using a direct-mapped design and storing tag inside each
data-line such that one access can retrieve both tag and data.
Direct-mapped organization enables the controller to know
which location to access, without waiting for tags.
Table 1 shows the bandwidth of such an approach. Hits
are serviced with one DRAM access that retrieves both tag
and data: in case of a tag match, the attached data can be
used to service the request. However, misses also need to
access tag in DRAM. As such, TIC is effective for hit-latency,
but consumes extra bandwidth on misses. This approach
of trading miss-bandwidth for hit-latency has been proven
effective in commercial products [11], and, as such, we use
the TIC organization [7] as our baseline.
Setup: We store metadata alongside data in unused ECC
bits similar to Intel’s Knights Landing [11]. TIC additionally
employs a small hit-miss predictor to guide when to access
cache+memory either in a parallel or serial manner (needs
<1KB SRAM storage overhead). We additionally include
bandwidth-reducing enhancements from Chou et al. [18],
such as DCP to reduce writeback probe.
2.3 Tag Outside Cacheline
Another option with reduced SRAM storage costs, is to
store metadata lines in a separate area of DRAM and bring
them in as needed in a Tag-Outside-Cacheline (TOC) [8, 12,
13] approach, shown in Figure 2(c). To determine hit or
miss, TOC first accesses a metadata line to get tag+dirty in-
formation for the requested data line, then routes the request
appropriately to DRAM cache or to memory. Of note, each of
these metadata lines actually stores tag+dirty information of
several adjacent data lines. An enhanced design [8] proposes
to cache the metadata lines in a small metadata cache to avoid
3
repeated accesses to the same metadata line, and would amor-
tize metadata lookup if there is spatial locality. Table 1 shows
the bandwidth consumption of such an approach. In case of a
metadata-cache hit, TOC performs similar to idealized Tag-
In-SRAM. For a metadata-cache miss, TOC spends additional
bandwidth to access the metadata. Overall, TOC has the po-
tential for reducing miss bandwidth, but it can suffer from
significant bandwidth overhead when the metadata-cache has
poor hit rate (due to poor spatial locality).
Setup: We assume 1-byte metadata (6 tag, 1 dirty, 1 valid
bits), and 64 tags stored in each metadata entry. The metadata
are stored in a separate part of DRAM, consisting of 64MB
out of the 4GB DRAM capacity. Recently accessed metadata
are stored in a 512-entry metadata cache, which requires
32KB of SRAM. Note that the metadata cache is sized to
capture only spatial locality and not the working set of the
DRAM cache, which would need megabytes of SRAM.
Optimizing for Latency: In the case of metadata-miss
in the metadata-cache, we want to avoid the latency for se-
rialized tag + cache-data access, as well as the latency for
serialized cache-data + memory access. We employ a direct-
mapped organization and a hit-miss predictor [7] for latency
and bandwidth considerations. If predicted hit, we access
tag + cache-data in parallel to save latency (direct-mapped
organization dictates only one possible location for data), and
serially access memory only if prediction is wrong to save
memory bandwidth. If predicted miss, we access tag + mem-
ory in parallel for latency, and serially access cache-data only
if prediction is wrong to save DRAM cache bandwidth.
2.4 Insight: Combine Metadata Approaches
A TIC approach has good hit-latency, but suffers from
extra miss bandwidth. Whereas, a TOC approach has good
miss bandwidth but incurs extra hit bandwidth. Our key
insight is that if one could use TIC for hits and TOC for
misses, then one could potentially achieve both good hit and
miss bandwidth.
We note that provisioning metadata for both TIC and
TOC is relatively inexpensive: TIC simply uses spare ECC
bits [11], and TOC needs to dedicate only ~1.5% of DRAM
cache capacity to store metadata lines and employs 32KB
SRAM for its metadata cache [8]. However, we need an
effective design that can use TIC for hits and TOC for misses.
We notice that we can use hit/miss predictor [7] to help
guide when to use TIC or TOC. For predicted hits, we can
directly access the line with TIC. For predicted misses, we
can consult the metadata-line / metadata-cache in TOC to
help avoid miss probes. We call this proposal TicToc. Un-
fortunately, we find that naively combining both approaches
actually leads to worse performance than TIC individually.
This is because maintaining TOC tag and dirty bits consumes
substantial bandwidth. To complete our design, we need to
develop effective solutions to reduce maintenance bandwidth
for TOC. We discuss methodology before proposed design.
3. METHODOLOGY
3.1 Framework and Configuration
We use USIMM [20], an x86 simulator with detailed mem-
ory system model. We extend USIMM to include a DRAM
cache. Table 2 shows the configuration used in our study. We
assume a four-level cache hierarchy (L1, L2, L3 being on-
chip SRAM caches and L4 being off-chip DRAM cache). All
caches use 64B line size. We model a virtual memory system
to perform virtual to physical address translations. The base-
line L4 is a 4GB DRAM-cache [11], which is direct-mapped
and places tags with data in unused ECC bits. The parameters
of our DRAM cache are based on DDR4 DRAM technol-
ogy [15]. The main memory is based on 3D-XPoint [1,2,21]:
the read latency is ~6X, the write latency is ~24X that of
DRAM, and there are 64 rowbuffers each 256B in size.
Table 2: System Configuration
Processors 8 cores; 3.0GHz, 4-wide OoO
Last-Level Cache 8MB, 16-way
DRAM Cache
Capacity 4GB
Bus Frequency 1000MHz (DDR 2GHz)
Configuration 1 channel, 64-bit bus, shared
Aggregate Bandwidth 16 GB/s, shared with Memory
tCAS-tRCD-tRP-tRAS 13-13-13-30 ns
Main Memory (3D XPoint)
Capacity 64GB
Bus Frequency 1000MHz (DDR 2GHz)
Configuration 1 channel, 64-bit bus, shared
Aggregate Bandwidth 16 GB/s, shared with DRAM
tCAS-tRCD-tRP 4-80-0 ns
tRAS-tWR 96-320 ns
3.2 Workloads
We use a representative slice of 2-billion instructions se-
lected by PinPoints [22], from benchmark suites that include
SPEC 2006 [23] and GAP [24]. For SPEC, we pick a subset
of high memory intensity workloads that have at least 2 L3
misses per thousand instructions (MPKI). The evaluations ex-
ecute benchmarks in rate mode, where all eight cores execute
the same benchmark. In addition to rate-mode workloads,
we also evaluate 21 mixed workloads, which are created by
randomly choosing 8 of the 17 SPEC workloads. Table 3
shows L3 miss rates, and memory footprints for the 8-core
rate-mode workloads in our study.
We perform timing simulation until each benchmark in
a workload executes at least 2 billion instructions. We use
weighted speedup to measure aggregate performance of the
workload normalized to the baseline and report geometric
mean for the average speedup across all the 17 workloads (11
SPEC, 2 GAP, 4 MIX). We provide key performance results
for additional 17 SPEC-mixed workloads in Section 6.4.
Table 3: Workload Characteristics
Suite Workload L3 MPKI Footprint
SPEC
mcf 101.14 13.4 GB
lbm 49.3 3.2 GB
soplex 35.3 1.8 GB
libq 30.1 256 MB
gems 29.1 6.4 GB
omnet 29.0 1.2 GB
wrf 10.4 1.1 GB
gcc 7.6 1.5 GB
xalanc 7.4 1.5 GB
zeus 7.0 1.6 GB
cactus 6.5 2.6 GB
GAP cc twitter 116.8 9.3 GBpr twitter 126.6 15.3 GB
4
DRAM ARRAY
ROW BUFFER 
ADDR
HIT: DATAOUT
TicToc Metadata Organization
Metadata Cache
T
TIC Metadata
TOC Metadata
Pred Hit
Pred Miss
Hit / Miss Prediction
MISS: READMEM+TAGFETCH
DATA (64B)
ECC+TAG+D(8B)
D D D DD D D D D D DD D D D
Figure 3: TicToc Metadata Organization queries hit/miss predictor to use TIC metadata for hits and TOC metadata
for misses. TicToc enables good hit latency, and good hit/miss bandwidth.
4. TICTOC DESIGN
DRAM caches need metadata to confirm if a line is cache
resident or not (tag bits), and if the resident line is the most
up-to-date copy (dirty bit). Tag-Inside-Cacheline (TIC) or-
ganizations are optimized for hits as one access gets both
metadata and data, but can suffer for misses as misses still
need to access DRAM for metadata. In contrast, Tag-Outside-
Cacheline (TOC) organizations are optimized for misses as
one metadata access gets residency and dirty information for
multiple lines; however, such approaches suffer from needing
to frequently query and update TOC metadata. Ideally, we
want the hit-path of TIC, the miss-path of TOC, all without
paying significant cost to access and maintain TOC metadata.
This section is organized as follows: we describe how to
provision and effectively use both TIC and TOC metadata in
a TicToc organization, describe how to reduce TOC metadata
maintenance costs, and show effectiveness of our design.
4.1 TicToc Metadata Organization
Figure 3 shows metadata organization of our TicToc de-
sign. TicToc provisions TIC metadata – tag-bits and dirty-bit
are stored inside the cacheline in unused ECC bits, simi-
lar to commercial implementation [11]. In addition, TicToc
provisions TOC metadata – metadata is stored in dedicated
metadata lines corresponding to 1.5% of DRAM capacity,
and cached as needed in a 32KB on-chip metadata cache.
While provisioning both TIC and TOC metadata is relatively
cheap, the complexity lies in utilizing TIC and TOC metadata
appropriately to save on bandwidth for both hits and misses.
4.1.1 TicToc Operation
Figure 3 shows the operation of TicToc. Ideally, we want to
use TIC metadata for hits and TOC metadata for misses. Our
key insight is that one can use hit/miss prediction [7, 25] to
help guide when to use which metadata. Hit/miss predictors
have been primarily used to hide the serialization latency that
can occur from waiting on last-level cache response before
sending memory access. They work by predicting which
cache accesses are likely to miss, and sending both cache and
memory requests in parallel to avoid serialization. We exploit
an effective hit/miss predictor [7] to guide TicToc to use TIC
metadata on likely-hit and TOC metadata on likely-miss. The
common result: a hit is serviced in one cache access (TIC
path), a miss with clean eviction directly goes to memory
(TOC path), and a miss with dirty eviction goes to cache
and memory (TIC path). An uncommon path of predict-hit
actual-miss incurs serialization latency and bandwidth cost
to access cache before memory. The other uncommon path
of predict-miss actual-hit incurs extra memory access due to
parallel lookup of cache and memory.
4.1.2 TicToc Effectiveness
To analyze effectiveness of TicToc, Figure 4 and Figure 5
shows the proportion of channel bandwidth being used for
useful operations, install operations, and assorted mainte-
nance operations, for baseline TIC and proposed TicToc.
Useful operations include 3D-XPoint Read and Write, and
DRAM Cache Hit and Writeback. Install operations refer to
cache installs, which are important for improving hit-rate but
cost bandwidth to write the line to DRAM. Lastly, Mainte-
nance operations refer to bandwidth-wasting operations used
to confirm a line is not resident: miss probes for TIC, and
accessing and updating TOC metadata for TOC.
0.00
0.25
0.50
0.75
1.00
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
Am
ea
nN
or
m
al
iz
ed
 B
an
dw
id
th
Co
ns
um
pt
io
n
Useful BW Install Miss Probe
Figure 4: Breakdown of bus bandwidth consumption for
TIC organization [7]. Workloads with low hit-rate waste
significant bandwidth to confirm misses.
As expected, Figure 4 shows that TIC wastes bandwidth
probing the DRAM cache to confirm misses. The proposed
TicToc can utilize TOC to reduce such miss probes. How-
ever, Figure 5 shows that TicToc actually fares worse due to
needing bandwidth to maintain TOC tag and TOC dirty-bit.
TOC tag-updates happen when the workload misses on a
line and installs it. A large fraction of misses occur when a
workload is accessing many lines in a new page, so misses
generally have good spatial locality. In such cases, metadata-
accesses/updates are amortized with the small metadata cache.
TOC dirty-bit-updates, on the other hand, occur upon evic-
tion of a dirty line from an earlier level of cache. Eviction
generally has poor spatial and temporal locality, so updating
this information often takes significant bandwidth to read
then update the TOC dirty-bit. We need effective methods
that target reducing the cost of maintaining dirty information.
0.00
0.25
0.50
0.75
1.00
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
Am
ea
nN
or
m
al
iz
ed
 B
an
dw
id
th
Co
ns
um
pt
io
n
Useful BW Install Tag-Update Dirty-Update
Figure 5: Breakdown of bus bandwidth consumption for
proposed TicToc organization. Write-heavy workloads
waste significant bandwidth updating TOC dirty-bit.
5
(a) Write Path (b) Miss + Install Path
Install Writeback Decoupled Metadata Maintenance Miss/WB Probe Mem Read Install
|D|CTIC |C Mem R Cache WTIC
Metadata R Metadata WTOC D|	 .C|	. Mem R Cache WTOC
D|D = (TOC Dirty-bit | TIC Dirty-bit). C is clean, D is dirty = access
D|DC|CTicToc Metadata R Metadata W Mem R Cache WTicToc
TicToc
(+ PDM)
D|DD|C D|C Mem R Cache WTicToc(+ PDM)
Marking predicted dirty lines at 
install saves metadata bandwidth
But, overpredicting dirty lines increases miss BW“Predicted-Dirty”
Figure 6: Bandwidth for a typical (a) write path and (b) miss+install path. TicToc+PDM adds “Predicted-Dirty” state,
where TOC dirty-bit is installed as dirty but TIC dirty-bit is installed as clean. Installing lines in Pred-Dirty can (a)
save TOC dirty-bit update, but (b) increase miss cost. Using Pred-Dirty only for write-likely lines can save bandwidth
4.2 Reducing Dirty-Bit Tracking Costs
The main source of bandwidth overhead of TicToc is main-
taining the dirty-bit for TOC metadata. We need effective
methods to reduce the cost of tracking dirty information. We
explain difficulty before describing solution.
4.2.1 Understanding Dirty-bit Updates
The dirty-bit update procedure starts upon an eviction of a
dirty line from L3. First, we need to check the tag/dirty-bit
line in the destination location of the DRAM cache to see
if we can overwrite it (i.e., we must first evict a dirty tag-
mismatched line). The common case is that the line evicted
from L3 is resident in L4 in Figure 6(a). Chou et al. [18]
proposes to eliminate the tag-check for this common case by
maintaining a DRAM Cache Presence bit (DCP) alongside
every line in L3. The DCP informs us that the same line is in
both L3 and L4 – if the bit is set, the destination location has
the same tag and can be directly overwritten (note that this op-
timizationt is included in our baseline). Second, the DRAM
cache will then write the dirty line to the DRAM cache. Third,
the cache will need to update any pertinent tag and dirty-bit
metadata. The tag-update for TIC and TOC is uncommon, as
typically L3 to L4 writebacks will hit. The dirty-bit-update
for TIC is sent along with L4 install, so it does not incur band-
width overhead. However, Figure 6(a)[TOC,TicToc] shows
that the dirty-bit update for TOC often needs to be separately
queried and potentially updated. This TOC dirty-bit update
is TicToc’s main source of bandwidth overhead.
The overhead of dirty-bit updates is comprised of two parts:
repeated TOC dirty-bit checks for already-dirty lines, and the
initial TOC dirty-bit update to mark clean-to-dirty transition.
We target these two scenarios with two techniques.
4.2.2 Reducing Repeated TOC Dirty-bit Checks
We have an insight that if we also knew the dirty state of
the corresponding line in the DRAM cache, we can avoid the
need to check the TOC dirty-bit. Instead, we can check (and
update) the TOC dirty-bit only if the dirty status changes.
DRAM Cache Dirtiness: To enable this optimization, we
propose to additionally store a DRAM Cache Dirtiness bit
(DCD) alongside the DCP [18] next to each line in the L3
cache. The DCP stores information that the current L3 line is
also resident in L4. Meanwhile, the DCD will additionally
store the dirty status of that L4 line. We set the DCD on
read of a dirty line from L4. On a DRAM cache write, we
check both the DCP and DCD. If both DCD and DCP are
set, we know the line is resident and already dirty in the TOC
metadata – tag and dirty-bit will be unchanged and we do not
need to fetch TOC. Hence, DCP reduces tag checks when tag
will not be modified, and DCD reduces dirty-bit checks when
dirty-bit will not be modified.
Figure 7 shows that DCD reduces dirty-bit check of many
workloads that repeatedly write to same lines (e.g., omnet,
soplex). However, there are several workloads (e.g., zeusmp)
that are write-heavy and write to most lines only once – we
want to reduce dirty-bit updates for those workloads as well.
4.2.3 Reducing Initial TOC Dirty-bit Update
For workloads that write-once to lines, we have an insight
that if we can preemptively mark the dirty bit in the TOC at
install time, we can avoid even the initial TOC clean-to-dirty
update that would have occurred at L3 eviction time. We call
this approach Preemptive Dirty Marking (PDM).
Preemptive Dirty Marking: Figure 6 shows the typical
write and miss+install bandwidth for TicToc and one that pre-
emptively marks TOC dirty-bit. Figure 6(a)[TicToc] shows
a typical write path needs 4 accesses: a normal line would
incur clean install, a write, and TOC dirty-bit read and write.
Figure 6(a)[TicToc+PDM] shows that PDM can limit writes
to 2 accesses. We add a new dirty state of “Predicted-Dirty,”
where TOC dirty-bit is marked as dirty but TIC dirty-bit is
marked as clean. If we install lines in “Predicted-Dirty,” the
TOC dirty-bit is set at install time, and even the initial TOC
clean-to-dirty update can be avoided.
However, while early marking can save bandwidth on
writes, PDM incurs a different problem on the miss path.
Figure 6(b)[TicToc] shows a typical miss+install path needs
2 accesses: TOC metadata informs residence and dirtiness so
miss+install can be accomplished with a memory read and a
DRAM cache install. However, Figure 6(b)[TicToc+PDM]
shows that PDM can increase miss+install to 3 accesses. For
instance, if an otherwise clean line has been preemptively
marked as dirty in the TOC dirty-bit, we would read the
DRAM cache line in preparation for an eviction of a dirty
line, thereby adding an extra DRAM read. Note that the
Predicted-Dirty state does not cause extra memory write-
6
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
+ 
no
 m
cf
.30.33Sp
ee
du
p
TOC TicToc TicToc+DCD TicToc+PDM Tag-In-SRAM
Figure 7: Speedup of TOC, proposed TicToc, TicToc with DRAM Cache Dirtiness bit, TicToc with Preemptive Dirty
Marking (PDM), and ideal Tag-In-SRAM, normalized to TIC. TicToc+PDM performs near ideal for most workloads.
backs as the miss/wb probe will find the TIC dirty-bit, and
write back only if the data is dirty. Thus, being aggressive in
marking lines as “Predicted-Dirty” will save write bandwidth,
but it can come at the cost of increasing miss bandwidth.
Ideally, we want to avoid write costs by installing write-
likely lines as “Predicted-Dirty”, and avoid increased miss
costs by installing write-unlikely lines as clean. However, if
we install a write-likely line as clean, it will pay increased
miss cost. Conversely, if we install a write-unlikely line
as “Predicted-Dirty,” it will pay cost to update TOC dirty
bit. Hence, performance of Preemptive Dirty Marking is
contingent on good classification of write-likely and write-
unlikely lines at install-time to avoid both TOC dirty bit
update and TIC miss probe bandwidth.
Sig = PC%(1<<10)
Cache
000
NON 
ZERO
Counters
Sig 
IncomingLineB
Sig W
Sig 
IncomingLineAPredict-Clean
(Init Dirty=0)
Predict-Dirty
(Init Dirty=1)
CTR
Figure 8: Signature-based Write Predictor learns which
sigs correspond to eventual write, to aid PDM technique.
Write Predictor: For accurate write-classification for PDM,
we develop a Signature-based Write Predictor (SWP) to pre-
dict likeliness an incoming line will be written. SWP employs
a sampling PC-based prediction, inspired by SHiP [26, 27].
Figure 8 shows structures and operation of SWP. SWP con-
sists of write-behavior observation, learning, and prediction.
Observation is accomplished by maintaining signature
(installing-PC in this case) and a written-to bit inside the
metadata of each line (10 bits additional metadata for the 1%
sampled lines, stored in TOC-metadata). Signature is set at
install-time, and written-to bit is updated on first write to line.
On eviction of such a sampled line, we get the information
that this PC installed a line that was either written-to or never
written-to in its lifetime in the cache.
Learning is then accomplished by storing observed write-
behavior into a PC-indexed table of saturating 3-bit counters.
On eviction of a line that has the written-to bit set, the counter
corresponding to installing-PC is incremented. On eviction
of a line that does not have written-to bit set, the counter
corresponding to installing-PC is decremented. This counter
table becomes a PC-indexed table of write-behavior.
Prediction is then simple – on install, the installing-PC is
used to index into the counter-table to provide a write-likely
or write-unlikely prediction. If the counter is non-zero, this
PC has seen write behavior and the incoming line should
be installed in “Predicted-Dirty” state to avoid initial TOC
clean-to-dirty update. If the counter is zero, then this PC has
not seen much write behavior and the incoming line should
be installed as clean to avoid miss/wb probes.
Accuracy of Write Predictor: Effectiveness of PDM is con-
tingent on good classification of write-likely (dirty) lines to
reduce dirty-bit update cost, and write-unlikely (clean) lines
to reduce miss-probe cost. Figure 9 shows the fraction of
lines that are predicted clean or dirty, and actually clean or
dirty. On average, SWP predicts clean and dirty with 92%
accuracy, and enables PDM to save most dirty-update and
miss-probe bandwidth.
0
25
50
75
100
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
Gm
ea
n
W
rit
e-
Pr
ed
 A
cc
 (%
)
PDirty,ADirty PClean,ADirty PClean,AClean PDirty,AClean
Figure 9: Accuracy of Write Prediction (P=predicted,
A=actual). Low PClean/ADirty and PDirty/AClean rate
reflects accurate write-behavior prediction.
4.2.4 Effectiveness of Dirty-Tracking Optimizations
Performance: Figure 7 shows the speedup of TOC, our
TicToc, TicToc with DCD, TicToc with PDM, and idealized
Tag-In-SRAM, normalized to TIC approach. TOC performs
poorly due to poor metadata-cache hit-rate, for 30% slow-
down. Our TicToc reduces hit bandwidth, for 22% slowdown.
Adding DRAM Cache Dirtiness bit reduces dirty-bit tracking
for repeated writes to same lines, for 0% speedup. Adding
Preemptive Dirty Marking reduces the initial dirty-bit up-
date without incurring extra miss bandwidth due to accurate
Write Predictor. Notably, TicToc+PDM achieves near ideal-
ized Tag-In-SRAM performance for most workloads for 10%
speedup, without including worst-case mcf in the average.
Few workloads see performance gap to ideal. We analyze
bandwidth consumption to gain insight into the problem.
7
0.00
0.25
0.50
0.75
1.00
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
Am
ea
nN
or
m
al
iz
ed
 B
an
dw
id
th
Co
ns
um
pt
io
n
Useful BW Install Tag-Update Dirty-Update
Figure 10: Breakdown of bus bandwidth for dirty-optimized TicToc. Dirty-bit updates are greatly reduced.
0.00
0.25
0.50
0.75
1.00
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
Am
ea
nN
or
m
al
iz
ed
 B
an
dw
id
th
Co
ns
um
pt
io
n
Useful BW Install Metadata
Figure 11: Breakdown of bus bandwidth for dirty-optimized TicToc w/ Write-Aware Bypassing. Installs are mitigated.
Bandwidth: Figure 10 shows the bandwidth breakdown
of TicToc + dirty-bit optimizations. Overall, our approach
eliminates nearly all of the TOC dirty-bit update bandwidth
(decreased fraction from 10% to 0.8%) and frees up band-
width for useful reads and writes. However, we note that
installing lines and updating the TOC-tag now becomes the
primary source of DRAM cache bandwidth overhead. We
target this overhead next.
5. REDUCING INSTALL BANDWIDTH
WITH WRITE-AWARE BYPASS
When data has poor reuse, installing lines and updating
TOC metadata wastes bandwidth. In fact, in such cases, em-
ploying a DRAM cache could actually hurt performance, as
the line install and tag maintenance operations needlessly
steal bus bandwidth from memory accesses. Figure 13 shows
the performance of a setup without a DRAM cache, normal-
ized to a setup with a TIC DRAM cache. We note that there
are multiple workloads (e.g., pr twi and cc twi), for which
“no DRAM cache” performs better than TIC. While one can
avoid this degradation by disabling the DRAM cache at boot-
time, doing so would then hurt the cache-friendly workloads.
Therefore, we need effective mechanisms to reduce the cost
of unnecessary installs.
Insight – Write-Aware Bypassing: Prior work has pro-
posed cache bypassing [18, 28, 29] to avoid unnecessary in-
stalls. On an L3 miss, one can bypass the DRAM cache
and install the line only in L1/L2/L3 caches, thereby saving
the DRAM cache install bandwidth. However, such bypass-
ing must be done selectively and carefully, otherwise it may
increase writes to 3D-XPoint memory, and degrade perfor-
mance, endurance, and power.
5.1 Design of Write-Aware Bypassing
Figure 12 shows our Write-Aware Bypassing policy. We
start with the default 90%-bypass policy proposed in [18],
which bypasses 90% of all installs. While such aggressive
bypassing was shown to work well for an HBM+DDR hybrid
memory [18], we note that it can increase write traffic to
the write-constrained 3D-XPoint memory. To address this
problem, we add write awareness to the bypass policy. We
augment the default bypass policy with a write-allocate con-
dition, which requires that dirty L3 evictions would always
install DRAM cache lines. Thus, the DRAM cache would act
as a write buffer for 3D-XPoint memory. Unfortunately, the
drawback of such an approach is that installing DRAM cache
lines at the time of L3 evictions may result in significant tag-
update costs. L3 evictions often have poor spatial locality, so
TOC tag updates carried out at L3 eviction time exhibit poor
metadata cache hit rates and incur extra DRAM accesses.
To amortize the TOC tag-update cost of our write-allocate
policy, we propose Preemptive Write-Allocate, whereby we
also always-install write-likely lines (predicted with SWP).
Preemptive Write-Allocate enables our write-allocate installs
to happen at L3 miss time. Such installs have higher spa-
tial locality, resulting in more metadata cache hits and more
effective amortization of TOC metadata updates.
Demand 
Miss
Writeback
Miss
90%-Bypass
(save install/tag BW)Pred-Clean
Pred-Dirty
Always-Install
(reduce 3D-XPoint writes)
Write-
Predictor
= Preemptive Write-Allocate
Figure 12: Write-Aware Bypass. Reduce install band-
width by bypassing most write-unlikely lines. Reduce 3D-
XPoint writes by installing write-likely lines.
8
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
2.20
m
cf lbm
so
ple
x
libq ge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
+ 
no
 m
cf
.44
.54
2.2
Sp
ee
du
p
No DRAM Cache TicToc + 90% bypass + Write-Allocate + Preemptive Write-Allocate
Figure 13: Speedup of a no-DRAM-cache configuration, proposed TicToc organization, adding 90%-bypass, adding
Write-Allocate, and adding Preemptive Write-Allocate, relative to TIC approach.
5.2 Effectiveness of Write-Aware Bypassing
Bandwidth: To understand the effectiveness of our install
and metadata-update reducing optimizations, we show the
bandwidth breakdown of our approach in Figure 11. Over-
all, we find that install-reducing optimizations can eliminate
nearly all of the install bandwidth overheads and leave much
more bandwidth for useful reads and writes. In total, the
combination of our cache bandwidth reducing optimizations
improves fraction of bandwidth going to useful operations
(servicing reads / writes) from 70% to 90% on average.
Performance: Figure 13 shows the performance of TicToc
with dirty-optimizations, TicToc with 90%-bypass, TicToc
with 90%-bypass and write-allocate, and TicToc with 90%-
bypass and preemptive write-allocate, relative to TIC.
TicToc with dirty-bit optimizations does well for most
workloads for an average 4.2% speedup, but can suffer for
workloads with poor spatial locality and low hit-rate (e.g.,
mcf ). TicToc with 90%-bypass reduces install and TOC tag-
update cost to improve speedup to 16.7%. Notably, the perfor-
mance degradation for mcf has been substantially mitigated.
TicToc with 90%-bypass and write-allocate enables effective
write-buffering to improve speedup to 20.6%. Finally, Tic-
Toc with 90%-bypass and preemptive write-allocate further
amortizes TOC metadata-update (e.g., useful for zeusmp and
pr twi) to improve speedup to 23.2%.
5.3 Putting it all together
Overall, our proposed techniques target all forms of DRAM
cache maintenance bandwidth to achieve a bandwidth-efficient
(>90% of channel bandwidth to useful operations) and low
SRAM storage overhead (34KB) DRAM cache organiza-
tion: TicToc improves hit and miss bandwidth, DRAM Cache
Dirtiness bit and Preemptive Dirty Marking reduces dirty-
bit-tracking bandwidth, and Write-Aware Bypass reduces
install and tag-tracking bandwidth. Our TicToc with dirty-bit
and install bandwidth reducing optimizations enables 23.2%
speedup at the cost of only 34KB of SRAM.
6. RESULTS AND DISCUSSION
In this section we present sensitivity studies and storage
analysis. Due to space constraints, we limit results to TicToc
with dirty-bit optimizations.
6.1 Storage Requirements
We analyze the SRAM storage requirements of our TicToc
organization. TicToc requires structures from its component
TIC and TOC organizations. Inheriting from TIC, we need
~1KB for PC-based hit/miss prediction [7], and 1 bit along-
side each L3 line for DRAM Cache Presence bit to avoid
tag-check for writes to resident lines [18]. Inheriting from
TOC, we need 32KB for a metadata cache [8].
Specific to TicToc, to implement our dirty-bit optimiza-
tions, we need a 1-bit bit alongside each L3 line for DRAM
Cache Dirtiness, and ~1KB for our Signature-based Write-
Predictor (512 entries of 3-bit counters with 9-bit PC tag).
Our bypassing optimizations do not require additional space.
In total, TicToc needs 34KB SRAM storage in the memory
controller, with 2 bits alongside each L3 line.
Table 4: Storage Requirements of TicToc
TicToc Component SRAM Storage
Hit-Miss Predictor [7] 1 KB
DRAM Cache Presence [18] 1-bit / L3-line
Metadata Cache [8] 32 KB
DRAM Cache Dirtiness 1-bit / L3-line
Signature-based Write Predictor 1 KB
TicToc 34KB + 2-bits/L3-line
6.2 Sensitivity to Metadata-Cache Size
The largest SRAM component of our TicToc proposal is
the TOC metadata cache. Table 5 shows performance sen-
sitivity of our TicToc organization to metadata-cache sizing.
We show average speedup of TicToc with dirty-bit optimiza-
tions, when employing metadata-caches with sizes ranging
from 8KB to 64KB. The dirty-bit tracking optimizations en-
able TOC approaches to be much more effective with small
metadata-cache sizes, as the metadata caches do not need to
sized to handle writeback traffic that has poor spatial locality.
Table 5: Sensitivity to Metadata Cache Sizing
Num. Entries TicToc TicToc (no mcf)
128 (8KB) -3.0% +1.9%
256 (16KB) +1.5% +7.0%
512 (32KB) +4.2% +10.0%
1024 (64KB) +4.8% +10.7%
6.3 Impact of Channel-Sharing
DRAM and 3D-XPoint are likely to be behind the same
channel to maximize the bandwidth out of each physical pin,
as shown in Figure 1. Figure 14 shows the system perfor-
mance of a channel-shared system (two channels of TIC
DRAM cache + 3D-XPoint), normalized over previously as-
sumed dedicated-channel systems (one channel of 2x TIC
DRAM cache, and one channel of 2x 3D-XPoint). We find
that channel-shared systems enable more balanced channel
bandwidth usage due to each channel having a DRAM cache.
9
For example, under high DRAM cache hit-rate, a channel-
shared system would be able to utilize all channels, whereas
a dedicated-channel system would only be able to use the
half of channels employing DRAM caches. Such channel-
shared approaches enable up to 40% speedup compared to
the traditional dedicated-channel setups.
0.8
1.0
1.2
1.4
1.6
m
cf lbm
so
ple
x
libqge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
Sp
ee
du
p
Figure 14: Speedup of Channel-Shared Hybrid Mem-
ory, over Dedicated-Channel Hybrid Memory. Channel-
sharing enables up to 40% speedup.
6.4 Multi-programmed Workloads
To show robustness of our proposal to multi-programmed
workloads, we conduct evaluations over a larger set of 17
mix-application workloads. Figure 15 shows that our dirty-
optimized TicToc organization provides 11% speedup across
17 mixes, with no workloads experiencing slowdown.
0.9
1.0
1.1
1.2
1.3
m
ix1
m
ix2
m
ix3
m
ix4
m
ix5
m
ix6
m
ix7
m
ix8
m
ix9
m
ix1
0
m
ix1
1
m
ix1
2
m
ix1
3
m
ix1
4
m
ix1
5
m
ix1
6
m
ix1
7
Gm
ea
n
Sp
ee
du
p
TicToc (dirty-opt) Idealized Tag-In-SRAM
Figure 15: Speedup of TicToc with dirty-bit optimiza-
tions, and idealized Tag-In-SRAM, on mixed workloads.
6.5 Performance Gap to DRAM-only Solution
In order to quantify the remaining performance opportunity,
we compare our TicToc DRAM cache + 3D-XPoint solution
with an expensive DRAM-only solution having the same
DRAM main memory capacity as the 3D-XPoint capacity
in our setup. Note that this DRAM-only solution will cost
substantially (4–8x) more than a hybrid DRAM+3D-XPoint
memory. Figure 16 shows performance results normalized to
TIC. TicToc’s bandwidth-efficient DRAM caching enables
3D-XPoint to perform within 13% of the expensive DRAM-
only solution.
0.8
1.0
1.2
1.4
1.6
1.8
2.0
m
cf lbm
so
ple
x
libqge
ms
om
ne
t
wr
f
gcc
xa
lan
c
ze
us
ca
ctu
s
cc
 tw
i
pr 
twi
m
ix1
m
ix2
m
ix3
m
ix4
Gm
ea
n
2.20 2.44
Sp
ee
du
p
TicToc (dirty-opt,bypass) DRAM Only
Figure 16: Speedup of TicToc (dirty-opt, bypassing) and
DRAM-only solution, relative to TIC cache + 3D-XPoint.
6.6 Considerations for Associativity
Our TicToc implementation uses a direct-mapped organi-
zation to avoid the latency of serialized tag and data lookup
(TOC explanation in Section 2.3). There have been works that
can avoid the serialized tag lookup for associative TIC [30]
and associative TOC [13] designs via scalable way-prediction
methods. As such, we find associativity an orthogonal issue
for our proposal; techniques such as [30] can be incorporated
into our design.
7. RELATED WORK
7.1 Line-based DRAM Caches
In our work, we utilize and combine the two major types of
line-granularity DRAM cache designs: Tag-Inside-Cacheline
(TIC) and Tag-Outside-Cacheline (TOC) approaches.
TIC designs [7, 11, 18, 30, 31, 32] organize their cache
as direct-mapped and store tag inside the cacheline, such
that one access can retrieve both tag and data. Such ap-
proaches are optimized for hits, but pay bandwidth to confirm
misses [7]. BEAR [18] proposes several enhancements to
reduce bandwidth cost of cache maintenance: we include its
DRAM Cache Presence that targets reducing write probe in
our baseline TIC design, we compare with Bandwidth-Aware
Bypass with 90%-bypass in Figure 13, but, however, we do
not include Neighboring Tag Cache as current implemen-
tations cannot obtain neighboring tag for free [11]. Such
hit-latency optimized approaches have been proven effective
in industrial application with Intel Knights Landing prod-
uct [11]; as such, we perform all of our experiments with
BEAR as our baseline. We use BEAR as our TIC component
of TicToc, and improve upon TIC miss-bandwidth inefficien-
cies to enable a scalable bandwidth-efficient DRAM cache.
TOC designs [8, 9, 12, 33, 34] store tags in a separate area
of the DRAM cache and fetch them as needed. The earli-
est forms of such caches were highly associative and would
need a serial tag then data lookup [9]. Some enhancements
used tag-prefetching [33] or way-prediction [34] to avoid this
serialized tag lookup. Others used direct-mapped organiza-
tion [8,12] to avoid serialized tag lookup, with one employing
a tag cache [8] to reduce the bandwidth of tag lookup as well.
Figure 7 shows that TIMBER [8], a direct-mapped TOC de-
sign with tag cache, performs well for misses but can perform
poorly due to high bandwidth cost to update metadata. We
use TIMBER as our TOC component of TicToc, and improve
upon TOC metadata-bandwidth inefficiencies to enable a
scalable bandwidth-efficient DRAM cache.
7.2 Page-based DRAM Caches
An alternate approach to designing DRAM-caches is to
use large-granularity caches to amortize tag and metadata
overhead, in hardware [10, 13] or software [35, 36, 37, 38].
Hardware-only: Hardware-based approaches store tags
either in SRAM [10, 19] or in DRAM [13]. The Tag-In-
SRAM proposals typically use sector caching [19] to reduce
the overall tag requirements, and fit them all in MegaBytes
of SRAM [10]. However, the storage for these approaches
are still typically quite large. And, they have the penalty of
poor cache utilization when not all lines in a page are used.
10
For comparison, we show what these cache organizations can
achieve with the line-granularity Tag-In-SRAM organization
in Figure 7. Our proposed TicToc achieves close to this upper-
bound with much less SRAM storage (34KB vs. >20MB).
Alternatively, there are Tag-In-DRAM proposals that store
metadata in DRAM, and fetch the tags as needed [13]. These
approaches need to spend bandwidth to access and update
metadata information in a separate area of DRAM cache. As
such, these approaches often have similar bandwidth over-
heads and performance to the TOC component of TicToc.
And, they have the penalty of poor cache utilization when
not all lines in a page are used. For comparison, we show
what these cache organizations can achieve with the line-
granularity TOC organization in Figure 7. Our proposed
TicToc, as well as the baseline TIC, outperforms such TOC
organizations. Nonetheless, our dirty-tracking optimizations
are general and can be applied to improve metadata update
cost for such caches as well.
Software-supported: Software-supported DRAM cache
approaches maintain mapping and metadata information in-
side page tables [35, 36, 37, 38], and use various heuristics
to determine when to install pages. The benefit to such ap-
proaches is that they do not need to pay additional bandwidth
to access tags. The shortcomings of such an approach are
two-fold. First, the migration granularity is fixed to the size
of OS page, which can cause overfetch problems, as well
as poor cache DRAM utilization when not all of the page
is useful. Second, such approaches require both hardware
and software support, and can be difficult to deploy without
cooperation between multiple vendors. We do not perform
comparison with such works as these approaches are out of
scope (i.e., break our design goal of OS-transparency).
7.3 Two Level Memories
Other hybrid memory approaches attempt to get the capac-
ity of both memories, and instead initiate hardware-managed
line or page swaps to enable most data to be serviced at the
lower-latency or higher-bandwidth memory [39,40,41,42,43].
These approaches have various tracking overheads and effec-
tiveness. However, we note there is a fundamental difference
from caching. On eviction of an unmodified line/page, caches
can simply drop the clean line/page – whereas, swap-based
approaches need to always write back the evicted line/page.
Such swaps incur extra writes that could otherwise have been
avoided. For our target DRAM + 3D-XPoint configuration,
these extra swapping-induced writes would cost performance,
endurance, and power when writing to write-constrained 3D-
XPoint. The added capacity benefits (3-12%) obtained from
such swapping are unlikely to make up the difference. Hence,
we do not take a swapping-based / two-level memory ap-
proach for this work. We do not perform comparison with
such works as these approaches are out of scope (i.e., break
our design goal of write-efficiency).
7.4 On Reducing Dirty-bit Tracking
Tracking dirty-bit or most-recent-copy of cacheline effi-
ciently with low SRAM storage costs is a known difficult
problem. Many works limit the amount of lines that can be
kept dirty [25, 44], to reduce SRAM storage costs needed
to track dirty lines. Other approaches are more extreme and
make the cache clean-only by always writing through [45,46].
However, for our work, we target a DRAM + 3D-XPoint
system, which is often constrained by 3D-XPoint write band-
width. Such mostly-clean caching techniques, which limit
the fraction of DRAM cache that can be dirty, hamper the
ability for the DRAM cache to act as an effective write buffer
for 3D-XPoint. This write limit can cause corresponding
degradation in performance, endurance, and power.
Our approach, on the other hand, does not impose any
limitation on which lines of the DRAM cache can be kept
dirty. Instead, we fundamentally target dirty-bit update cost
with architectural techniques. Our DRAM Cache Dirtiness
and Preemptive Dirty Marking techniques reduce over 90% of
the bandwidth cost to track dirty information, while needing
only 34KB of SRAM storage.
7.5 DRAM + NVM Hybrid Memories
There has been a long line of work on hybrid DRAM +
NVM systems [4, 5, 6]. These works typically try to use
DRAM to hide the 4-8x read latency and poor write char-
acteristics of NVM (e.g., low write bandwidth, high power
consumption, low write endurance). Our work follows on this
line of research. We develop a scalable (low 34KB SRAM
cost) and bandwidth-efficient DRAM cache design, and add
a NVM-specific Write-Aware Bypassing that specifically tar-
gets hiding NVM’s poor write-relative-to-read characteristics.
8. CONCLUSION
This paper investigates bandwidth-efficient DRAM caching
for hybrid DRAM + 3D-XPoint memories. Effective DRAM
caching in front of 3D-XPoint is critical to enabling a mem-
ory system that has the apparent high-capacity of 3D-XPoint,
and the low-latency and high-write-bandwidth of DRAM.
There are two currently major approaches for DRAM cache
design: (1) a Tag-Inside-Cacheline (TIC) organization that
optimizes for hits, by storing tag next to each line such that
one access gets both tag and data, and (2) a Tag-Outside-
Cacheline (TOC) organization that optimizes for misses, by
storing tags from multiple data lines together in a tag-line
such that one access to a tag-line gets information on several
data-lines. Ideally, we would like to have the low hit-latency
of TIC designs, and the low miss-bandwidth of TOC designs.
To this end, we propose a TicToc organization that provisions
both TIC and TOC to get the hit and miss benefits of both.
However, we find that naively combining both techniques
actually performs worse than TIC individually, because one
has to pay the bandwidth cost of maintaining both metadata.
We find the majority of update bandwidth is due to maintain-
ing the TOC dirty information. We propose DRAM Cache
Dirtiness Bit that helps prune repeated dirty-bit updates for
known dirty lines. We propose Preemptive Dirty Marking
technique that predicts which lines will be written and proac-
tively marks the dirty bit at install time, to help avoid even the
initial dirty-bit update for dirty lines. To support PDM, we
develop a novel PC-based Write-Predictor to aid in marking
only write-likely lines. Our evaluations on a 4GB DRAM
cache in front of 3D-XPoint show that our TicToc organiza-
tion enables 10% speedup over the baseline TIC, nearing the
14% speedup possible with an idealized DRAM cache design
with 64MB of SRAM tags, while needing only 34KB SRAM.
11
9. REFERENCES
[1] Intel and Micron, “A revolutionary breakthrough in memory
technology,” 2015.
[2] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J.
Soh, Z. Wang, Y. Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic
performance measurements of the intel optane DC persistent memory
module,” CoRR, vol. abs/1903.05714, 2019.
[3] A. Ilkbahar, “Intel© optane™ dc persistent memory operating modes
explained,” 2018. Accessed: 2019-03-20.
[4] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high
performance main memory system using phase-change memory
technology,” in Proceedings of the 36th Annual International
Symposium on Computer Architecture, ISCA ’09, (New York, NY,
USA), pp. 24–33, ACM, 2009.
[5] G. Dhiman, R. Ayoub, and T. Rosing, “Pdram: A hybrid pram and
dram main memory system,” in 2009 46th ACM/IEEE Design
Automation Conference, pp. 664–669, July 2009.
[6] A. Bivens, P. Dube, M. Franceschini, J. Karidis, L. Lastras, and
M. Tsao, “Architectural design for next generation heterogeneous
memory systems,” in Memory Workshop (IMW), 2010 IEEE
International, pp. 1–4, IEEE, 2010.
[7] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in
architecting dram caches: Outperforming impractical sram-tags with a
simple and practical design,” in 2012 45th Annual IEEE/ACM
International Symposium on Microarchitecture, pp. 235–246, Dec
2012.
[8] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling
efficient and scalable hybrid memories using fine-granularity dram
cache management,” IEEE Computer Architecture Letters, vol. 11,
pp. 61–64, July 2012.
[9] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block
sizes for very large die-stacked dram caches,” in Proceedings of the
44th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-44, (New York, NY, USA), pp. 454–464,
ACM, 2011.
[10] D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked dram caches for
servers: Hit ratio, latency, or bandwidth? have it all with footprint
cache,” in Proceedings of the 40th Annual International Symposium on
Computer Architecture, ISCA ’13, (New York, NY, USA),
pp. 404–415, ACM, 2013.
[11] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights
landing: Second-generation intel xeon phi product,” IEEE Micro,
vol. 36, pp. 34–46, Mar 2016.
[12] J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor, “Resilient
die-stacked dram caches,” in Proceedings of the 40th Annual
International Symposium on Computer Architecture, ISCA ’13, (New
York, NY, USA), pp. 416–427, ACM, 2013.
[13] D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A
scalable and effective die-stacked dram cache,” in Microarchitecture
(MICRO), 2014 47th Annual IEEE/ACM International Symposium on,
pp. 25–37, IEEE, 2014.
[14] J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013.
[15] JEDEC, DDR4 SPEC (JESD79-4), 2013.
[16] ArsTechnica, “Intel’s crazy-fast 3d xpoint optane memory heads for
ddr slots (but with a catch),” 2018. Accessed: 2019-01-23.
[17] M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava,
A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. Vora,
“Cascade lake: Next generation intel xeon scalable processor,” IEEE
Micro, vol. 39, pp. 29–36, March 2019.
[18] C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for
mitigating bandwidth bloat in gigascale dram caches,” in Proceedings
of the 42nd Annual International Symposium on Computer
Architecture, ISCA ’15, (New York, NY, USA), pp. 198–210, ACM,
2015.
[19] J. B. Rothman and A. J. Smith, “Sector cache design and performance,”
in Proceedings 8th International Symposium on Modeling, Analysis
and Simulation of Computer and Telecommunication Systems (Cat.
No.PR00728), pp. 124–133, Aug 2000.
[20] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah
simulated memory module,” University of Utah, Tech. Rep, 2012.
[21] Intel, “Fact sheet: New intel architectures and technologies target
expanded market opportunities,” 2018. Accessed: 2019-03-20.
[22] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and
A. Karunanidhi, “Pinpointing representative portions of large intel
itanium programs with dynamic instrumentation,” in
Microarchitecture, 2004. MICRO-37 2004. 37th International
Symposium on, pp. 81–92, Dec 2004.
[23] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH
Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006.
[24] S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark
suite,” CoRR, vol. abs/1508.03619, 2015.
[25] J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “A
mostly-clean dram cache for effective hit speculation and
self-balancing dispatch,” in Microarchitecture (MICRO), 2012 45th
Annual IEEE/ACM International Symposium on, pp. 247–257, IEEE,
2012.
[26] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,
and J. Emer, “Ship: Signature-based hit predictor for high
performance caching,” in Proceedings of the 44th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-44, (New
York, NY, USA), pp. 430–441, ACM, 2011.
[27] V. Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing
signature-based hit predictor for improved cache performance,” in The
2nd Cache Replacement Championship (CRC-2 Workshop in ISCA
2017), 2017.
[28] M. Kharbutli and Y. Solihin, “Counter-based cache replacement and
bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447,
Apr. 2008.
[29] H. Gao and C. Wilkerson, “A dueling segmented lru replacement
algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop
on Computer Architecture Competitions: cache replacement
Championship, 2010.
[30] V. Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling
associativity for gigascale dram caches by coordinating way-install
and way-prediction,” in 2018 ACM/IEEE 45th Annual International
Symposium on Computer Architecture (ISCA), pp. 328–339, June
2018.
[31] C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent
dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), pp. 1–13,
Oct 2016.
[32] V. Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram
caches for bandwidth and capacity,” in ISCA ’17, (New York, NY,
USA), pp. 627–638, ACM, 2017.
[33] C.-C. Huang and V. Nagarajan, “Atcache: reducing dram cache latency
via a small sram tag cache,” in Proceedings of the 23rd international
conference on Parallel architectures and compilation, pp. 51–60,
ACM, 2014.
[34] Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y. Xie, “Building
a low latency, highly associative dram cache with the buffered way
predictor,” in 2016 28th International Symposium on Computer
Architecture and High Performance Computing (SBAC-PAD),
pp. 109–117, Oct 2016.
[35] Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A
fully associative, tagless dram cache,” in Proceedings of the 42Nd
Annual International Symposium on Computer Architecture, ISCA ’15,
(New York, NY, USA), pp. 211–222, ACM, 2015.
[36] H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee,
“Efficient footprint caching for tagless dram caches,” in High
Performance Computer Architecture (HPCA), 2016 IEEE
International Symposium on, pp. 237–248, IEEE, 2016.
[37] G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and
K. McGrath, “Challenges in heterogeneous die-stacked and off-chip
memory systems,” in 3rd Workshop on SoCs, Heterogeneous
Architectures and Workloads (SHAW-3), 02 2012.
[38] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee:
Bandwidth-efficient dram caching via software/hardware cooperation,”
in Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY,
12
USA), pp. 1–14, ACM, 2017.
[39] C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory
organization with capacity of main memory and flexibility of
hardware-managed cache,” in Proceedings of the 47th Annual
IEEE/ACM International Symposium on Microarchitecture,
MICRO-47, (Washington, DC, USA), pp. 1–12, IEEE Computer
Society, 2014.
[40] J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim,
“Transparent hardware management of stacked dram as part of
memory,” in Proceedings of the 47th Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO-47, (Washington, DC,
USA), pp. 13–24, IEEE Computer Society, 2014.
[41] J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm:
Subblocked interleaved cache-like flat memory organization,” in 2017
IEEE International Symposium on High Performance Computer
Architecture (HPCA), pp. 349–360, Feb 2017.
[42] A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen,
“Mempod: A clustered architecture for efficient and scalable migration
in flat address space multi-level memories,” in 2017 IEEE
International Symposium on High Performance Computer Architecture
(HPCA), pp. 433–444, Feb 2017.
[43] A. Kokolis, “Pageseer: Using page walks to trigger page swaps in
hybrid memory systems,” 2019 IEEE International Symposium on
High Performance Computer Architecture (HPCA), pp. 596–608,
2019.
[44] C. Huang, R. Kumar, M. Elver, B. Grot, and V. Nagarajan, “C3d:
Mitigating the numa bottleneck via coherent dram caches,” in 2016
49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), pp. 1–12, Oct 2016.
[45] I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M.
Aamodt, “Cache coherence for GPU architectures,” in 19th IEEE
International Symposium on High Performance Computer Architecture,
HPCA 2013, Shenzhen, China, February 23-27, 2013, 2013.
[46] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa,
“Combining hw/sw mechanisms to improve numa performance of
multi-gpu systems,” in MICRO ’18, October 2018.
13
