Tiered-Latency DRAM: Enabling Low-Latency Main Memory at Low Cost by Lee, Donghyuk et al.
Tiered-Latency DRAM:
Enabling Low-Latency Main Memory at Low Cost
Donghyuk Lee1,2 Yoongu Kim2 Vivek Seshadri3,2
Jamie Liu4,2 Lavanya Subramanian5,2 Onur Mutlu6,2
1NVIDIA Research 2Carnegie Mellon University
3Microsoft Research India 4Google 5Intel Labs 6ETH Zürich
This paper summarizes the idea of Tiered-Latency DRAM
(TL-DRAM), which was published in HPCA 2013 [73], and ex-
amines the work’s signicance and future potential. The ca-
pacity and cost-per-bit of DRAM have historically scaled to
satisfy the needs of increasingly large and complex computer
systems. However, DRAM latency has remained almost con-
stant, making memory latency the performance bottleneck in
today’s systems. We observe that the high access latency is not
intrinsic to DRAM, but a trade-o is made to decrease the cost
per bit. To mitigate the high area overhead of DRAM sensing
structures, commodity DRAMs connect many DRAM cells to
each sense amplier through a wire called a bitline. These bit-
lines have a high parasitic capacitance due to their long length,
and this bitline capacitance is the dominant source of DRAM
latency. Specialized low-latency DRAMs use shorter bitlines
with fewer cells, but have a higher cost-per-bit due to greater
sense amplier area overhead.
To achieve both low latency and low cost per bit, we intro-
duce Tiered-Latency DRAM (TL-DRAM). In TL-DRAM, each
long bitline is split into two shorter segments by an isolation
transistor, allowing one of the two segments to be accessed with
the latency of a short-bitline DRAM without incurring a high
cost per bit. We propose mechanisms that use the low-latency
segment as a hardware-managed or software-managed cache.
Our evaluations show that our proposed mechanisms improve
both performance and energy eciency for both single-core and
multiprogrammed workloads.
Tiered-Latency DRAM has inspired several other works on
reducing DRAM latency with little to no architectural modi-
cation [20, 21, 22, 24, 37, 38, 68, 72, 116, 117, 118].
1. Problem: High DRAM Latency
Primarily due to its low cost per bit, DRAM has long been
the substrate of choice for architecting main memory sub-
systems. In fact, DRAM’s cost per bit has been decreasing
at a rapid rate as DRAM process technology scales to inte-
grate ever more DRAM cells into the same die area. As a
result, each successive generation of DRAM has enabled in-
creasingly larger-capacity main memory subsystems at low
cost.
In stark contrast to the continued scaling of cost per bit, the
latency of DRAM has remained almost constant. During the
same 11-year interval in which DRAM’s cost per bit decreased
by a factor of 16, DRAM latency (as measured by the tRCD and
tRC timing constraints)1 decreased by only 30.5% and 26.3% [6,
47], respectively, as shown in Figure 1. From the perspective
of the processor, an access to DRAM takes hundreds of cycles
– time during which the processor may be stalled, waiting for
DRAM [3, 34, 48, 92, 93, 96]. This wasted time due to stalling
on DRAM leads to large performance degradation.
0
20
40
60
80
100
0.0
0.5
1.0
1.5
2.0
2.5
SDR-200 DDR-400 DDR2-800 DDR3-1066 DDR3-1333
2000 2003 2006 2008 2011
Capacity Latency (tRCD) Latency (tRC)
C
ap
ac
it
y 
(G
b
) 
La
te
n
cy
 (
n
s)
 
2  20  2  2  2  
Figure 1: Change in DRAM capacity and latency over time [6,
47,100,111]. Reproduced from [73].
2. Key Observations and Our Goal
Bitline: Dominant Source of Latency. In DRAM, each
bit is represented as electrical charge in a capacitor-based
cell. The small size of this capacitor necessitates the use of an
auxiliary structure, called a sense amplier, to (1) detect the
small amount of charge held by the cell and (2) amplify it to
a full digital logic value. A sense amplier is approximately
one hundred times larger than a cell [107]. To amortize their
large size, each sense amplier is connected to many DRAM
cells through a wire called a bitline.2
Every bitline has an associated parasitic capacitance, whose
value is proportional to the length of the bitline. Unfortu-
nately, the parasitic capacitance slows down DRAM operation
for two reasons. First, it increases the latency of the sense am-
pliers. When the parasitic capacitance is large, a cell cannot
quickly create a voltage perturbation on the bitline that can be
easily detected by the sense amplier. Second, the capacitance
increases the latency of charging and precharging the bitlines.
1The overall DRAM latency can be decomposed into individual DRAM
timing constraints. Two of the most important timing constraints are tRCD
(row-to-column delay) and tRC (row-cycle time).
2We refer the reader to our prior works for a detailed background on
DRAM architecture and operation [21, 22, 23, 24, 37, 38, 54, 56, 57, 58, 59, 60, 68,
69, 71, 72, 73, 75, 76, 99, 103, 116, 117].
ar
X
iv
:1
80
5.
03
04
8v
1 
 [c
s.A
R]
  4
 M
ay
 20
18
Although the cell and the bitline must be restored to their
quiescent voltages during and after an access to a cell, such a
procedure takes much longer when the parasitic capacitance
of the bitline is large. Due to these two reasons, and based on
a detailed latency breakdown discussed in Section 3.1 of our
HPCA 2013 paper [73], we conclude that long bitlines are the
dominant source of DRAM latency [44, 72, 73, 90, 91, 122].
Latency vs. Cost Trade-O. The bitline length is a key
design parameter that exposes the important trade-o be-
tween latency and die size (cost). Short bitlines (i.e., a bitline
connected to only a few cells) constitute a small electrical load
(parasitic capacitance), which leads to low latency. However,
they require more sense ampliers for a given DRAM capac-
ity (Figure 2a), which leads to a large die size. In contrast,
long bitlines have high latency and a small die size (Figure 2b).
As a result, neither of these two approaches can optimize for
both latency and cost per bit.
sense-amps
cells
sense-amps
cells
b
it
lin
e
(s
h
o
rt
)
b
it
lin
e
(s
h
o
rt
)
(a) Latency-optimized
architecture
sense-amps
cells
b
it
lin
e
(l
o
n
g
)
(b) Cost-optimized archi-
tecture
sense-amps
cellsb
it
lin
e
Isolation TR.
b
it
lin
e
(c) Our proposed archi-
tecture
Figure 2: DRAM latency and cost optimization, and our pro-
posal (TL-DRAM). Reproduced from [73].
Figure 3 shows the trade-o between DRAM latency and
die size by plotting the latency (tRCD and tRC ) and the die size
for dierent values of cells per bitline. Existing DRAM archi-
tectures are either (1) optimized for die size (e.g., commodity
DDR3 [86, 111]) and are thus low cost but high latency; or
(2) optimized for latency (e.g., RLDRAM [85], FCRAM [112])
and are thus low latency but (very) high cost.
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60
N
o
rm
al
iz
ed
 D
ie
-S
iz
e 
C
h
ea
p
er
 
Latency (ns) 
tRCD            tRC  
RLDRAM 
FCRAM 
RLDRAM 
FCRAM 
DDR3 DDR3 
16 : cells-per-bitline 
(492mm2): die-size 
 
32 (276) 
64 (168)   
128 (114)  
256 (87)  
512 (73.5) 
32  
16  
64  
128  256  512  
Faster 
Figure 3: Bitline length: latency vs. die size. Reproduced
from [73].
The goal of our HPCA 2013 paper [73] is to design a new
DRAM architecture to approximate the best of both worlds
(i.e., low latency and low cost), based on our key observation
that long bitlines are the dominant source of DRAM latency.
3. Tiered-Latency DRAM
To achieve the latency advantage of short bitlines and the
cost advantage of long bitlines, we propose the Tiered-Latency
DRAM (TL-DRAM) architecture, which is shown in Figures 2c
and 4a. The key idea of TL-DRAM is to divide the long bitline
into two shorter segments using an isolation transistor: the
near segment (connected directly to the sense amplier) and
the far segment (connected through the isolation transistor).
Far 
Segment
Near 
Segment
Isolation 
Transistor
Sense-
Amps
(a) Organization
C
C
EL
L
b
it
lin
e
C
N
EA
R
Isolation 
TR. (off)
C
C
EL
L
C
FA
R
(b) Near segment
C
C
EL
L
b
it
lin
e
C
N
EA
R
Isolation 
TR. (on)
C
C
EL
L
C
FA
R
(c) Far segment
Figure 4: TL-DRAM: accessing the near segment and the far
segment. Adapted from [73].
The primary role of the isolation transistor is to electrically
decouple the two segments from each other. This changes
the eective bitline length (and also the eective bitline ca-
pacitance) as seen by the cell and sense amplier. Corre-
spondingly, the latency to access a cell also changes, albeit
dierently depending on whether the cell is in the near or
the far segment.
When accessing a cell in the near segment, the isolation
transistor is turned o, disconnecting the far segment (Fig-
ure 4b). Since the cell and the sense amplier see only the
reduced bitline capacitance of the shortened near segment,
they can drive the bitline voltage more easily. As a result, the
bitline voltage is restored more quickly, and, thus, the latency
(tRC ) for the near segment is signicantly reduced. On the
other hand, when accessing a cell in the far segment, the
isolation transistor is turned on to connect the entire length
of the bitline to the sense amplier. In this case, the isolation
transistor acts like a resistor inserted between the two seg-
ments (Figure 4c) and limits how quickly charge ows to the
far segment. Because the far segment capacitance is charged
more slowly, it takes longer for the far segment voltage to be
restored, and, thus, the latency (tRC ) is increased for cells in
the far segment.
Sensitivity to Segment Length. The lengths of the two
segments are determined by where the isolation transistor
is placed on the bitline. Assuming that the number of cells
per bitline is xed at 512 cells, the near segment length can
range from as short as a single cell to as long as 511 cells.
We perform circuit-level simulations to determine how the
latency of each segment based on the number of cell in the
2
020
40
60
80
1 2 4 8 16 32 64 128 256 512
Near Segment Length (Cells) Ref.
tRCD tRC
La
te
n
cy
 (
n
s)
 
(a) Cell in near segment
0
20
40
60
80
511 510 508 504 496 480 448 384 256 512
Far Segment Length (Cells) Ref.
tRCD tRC
La
te
n
cy
 (
n
s)
 
(b) Cell in far segment
Figure 5: Latency analysis. Repro-
duced from [73].
tRCDnear tRASnear
Near Segment (TL-DRAM)
Long Bitline
Short Bitline
VDD
0.75VDD
0.50VDD
5 10 15 20 25 30 35 (ns)0 40
(a) Cell in near segment (128 cells)
tRCDfar tRASfar
Near Segment (TL-DRAM)
Far Segment (TL-DRAM)
Long Bitline
Short Bitline
VDD
0.75VDD
0.50VDD
5 10 15 20 25 30 35 (ns)0 40
(b) Cell in far segment (384 cells)
Figure 6: Activation: bitline voltage. Reproduced
from [73].
tRPnear
Near
Long 
Short 
VDD
0.75VDD
0.50VDD
5 10 15 200 (ns)
(a) Cell in near segment
tRPfar
Near
Far 
Long 
Short 
VDD
0.75VDD
0.50VDD
5 10 15 200 (ns)
(b) Cell in far segment
Figure 7: Precharging. Re-
produced from [73].
segment. Figures 5a and 5b plot the latencies of the near and
far segments as a function of their length, respectively. For
reference, the rightmost bars in each gure are the latencies of
an unsegmented long bitline whose length is 512 cells. From
these gures, we draw three conclusions. First, the shorter
the near segment, the lower its latencies (tRCD and tRC ). This
is expected since a shorter near segment has a lower eective
bitline capacitance, allowing it to be driven to target voltages
more quickly. Second, the longer the far segment, the lower
the far segment’s tRCD . Recall from our previous discussion
that the far segment’s tRCD depends on how quickly the near
segment (not the far segment) can be driven. A longer far
segment implies a shorter near segment (lower capacitance),
which is why tRCD decreases for the far segment. Third, the
shorter the far segment, the smaller its tRC . The far segment’s
tRC is determined by how quickly it reaches the full voltage
(VDD or 0). Regardless of the length of the far segment or the
near segment, the current that trickles into it through the
isolation transistor does not change signicantly. Therefore,
a shorter far segment (lower capacitance) reaches the full
voltage more quickly.
Latency Analysis (Circuit Evaluation). We model TL-
DRAM in detail using SPICE simulations. Simulation parame-
ters are mostly derived from a publicly available 55nm DDR3
2Gb process technology le [107] which includes information
such as cell and bitline capacitances and resistances, physical
oorplanning, and transistor dimensions. Transistor device
characteristics were derived from [98] and scaled to agree
with [107]. Figures 6 and 7 show the bitline voltages during
activation and precharging, respectively. The x-axis origin
(time 0) in the two gures corresponds to when the subarray
receives the ACTIVATE or PRECHARGE command, respec-
tively. In addition to the voltages of the segmented bitline
(near and far segments), the gures also show the voltages of
two unsegmented bitlines (short and long) for reference.
First, during an access to a cell in the near segment (Fig-
ure 6a), the far segment is disconnected and is oating (hence
its voltage is not shown). The bitline starts at 1/2 VDD . Due to
the reduced bitline capacitance of the near segment, its volt-
age increases almost as quickly as the voltage of a short bitline
(the two curves are overlapped) during sensing & amplica-
tion. Since the near segment voltage reaches 0.75VDD and
VDD (the threshold and restored states, respectively) quickly,
its tRCD and tRAS , respectively, are signicantly reduced com-
pared to a long bitline. Second, during an access to a cell in
the far segment (Figure 6b), we can indeed verify that the
voltages of the near and the far segments increase at dier-
ent rates due to the resistance of the isolation transistor, as
previously explained. Compared to a long bitline, while the
near segment voltage reaches 0.75VDD more quickly, the far
segment voltage reaches VDD more slowly. As a result, tRCD
for the far segment is reduced while its tRAS is increased.
While precharging the bitline after accessing a cell in the
near segment (Figure 7a), the near segment reaches 0.5VDD
quickly due to the smaller capacitance, almost as quickly as
the short bitline (the two curves are overlapped). On the other
hand, precharging the bitline after accessing a cell in the far
segment (Figure 7b) takes longer compared to the long-bitline
baseline. As a result, tRP is reduced for the near segment and
increased for the far segment.
Summary (Latency, Power, and Die-Area). Table 1
summarizes the latency, power, and die area characteris-
tics of TL-DRAM compared to short-bitline and long-bitline
DRAMs, estimated using circuit-level SPICE simulation [98]
and power/area models from Rambus [107]. Compared to
commodity DRAM (long bitlines), which incurs high latency
(tRC ) for all cells, TL-DRAM oers signicantly reduced la-
tency (tRC ) for cells in the near segment, while increasing the
latency for cells in the far segment due to the additional resis-
tance of the isolation transistor. In DRAM, a large fraction of
the power is consumed by the bitlines. Since the near segment
3
in TL-DRAM has a lower capacitance, it also consumes less
power. On the other hand, accessing the far segment requires
toggling the isolation transistors, leading to increased power
consumption. Mainly due to additional isolation transistors,
TL-DRAM increases die area by 3% compared to commodity
DRAM. Section 4 of our HPCA 2013 paper [73] includes de-
tailed circuit-level analyses of TL-DRAM, along with detailed
area, latency, and power estimations.
Short Bitline Long Bitline Segmented Bitline
(Figure 2a) (Figure 2b) (Figure 2c)
Unsegmented Unsegmented Near Far
Length (Cells) 32 512 32 480
Latency Low High Low Higher
(tRC ) (23.1ns) (52.5ns) (23.1ns) (65.8ns)
Normalized Low High Low Higher
Power (0.51) (1.00) (0.51) (1.49)
Normalized High Lower Low
Die-Size (Cost) (3.76) (1.00) (1.03)
Table 1: Latency, power, and die area comparison. Adapted
from [73].
4. Leveraging TL-DRAM
TL-DRAM enables the design of many new memory man-
agement policies that exploit the asymmetric latency charac-
teristics of the near and the far segments. Section 5 of our
HPCA 2013 paper [73] describes four mechanisms that take
advantage of TL-DRAM. Here, we describe two approaches
in particular.
In the rst approach, the memory controller uses the near
segment as a hardware-managed cache for the far segment.
In our HPCA 2013 paper [73], we discuss three policies for
managing the near segment cache. The three policies dier
in deciding when a row in the far segment is cached into the
near segment and when the row is evicted. In addition, we
propose a new data transfer mechanism (Inter-Segment Data
Transfer) that eciently migrates data between the segments
by taking advantage of the fact that the bitline is a bus con-
nected to the cells in both segments. By using this technique,
the data from the source row can be transferred to the destina-
tion row over the bitlines at very low latency (additional 4ns
over tRC ).3 Furthermore, this Inter-Segment Data Transfer
happens exclusively within a DRAM bank without utilizing
the DRAM channel, allowing concurrent accesses to other
banks.
In the second approach, the near segment capacity is ex-
posed to the OS, enabling the OS to use the full DRAM ca-
pacity. We propose two concrete mechanisms, one where
the memory controller uses an additional layer of indirec-
tion to map frequently-accessed pages to the near segment,
and another where the OS uses static/dynamic proling to
directly map frequently-accessed pages to the near segment.
3A later work, RowClone [116], takes advantage of this property to
enable bulk copy and initialization completely within DRAM.
In both approaches, the accesses to pages that are mapped
to the near segment are served faster and with lower power
than in conventional DRAM, resulting in improved system
performance and energy eciency.
We refer the reader to Section 5 of our HPCA 2013 pa-
per [73] for a full description of use cases for TL-DRAM. Note
that a very wide variety of techniques developed for cache
management [105,115,119,120,132] can be adopted to manage
the near segment in TL-DRAM.
5. Performance and Power Evaluation
Section 8 of our HPCA 2013 paper [73] provides a de-
tailed evaluation of all of the above approaches to leverage
TL-DRAM. Here, we present the evaluation results for only
the rst approach, in which the near segment is used as a
hardware-managed cache managed under our best policy
(Benet-Based Caching), to demonstrate the advantages of
our TL-DRAM substrate.
Methodology. To evaluate our mechanism, we use Ra-
mulator [56, 110], an open-source DRAM simulator, which is
integrated into an in-house processor simulator. The released
version of Ramulator [110] provides a model for TL-DRAM,
which we hope future works use and build upon. A detailed
methodology can be found in Section 7 of our HPCA 2013
paper [73].
Performance & Power Analysis. Figure 8 shows the
average performance improvement and power eciency of
our proposed mechanism over the baseline with conventional
DRAM, on 1-, 2- and 4-core systems. As described in Section 3,
the access latency and power consumption are signicantly
lower for near segment accesses, but higher for far segment
accesses, compared to accesses in a conventional DRAM. We
observe that a large fraction (over 90% on average) of requests
hit in the rows cached in the near segment, thereby accessing
the near segment with low latency and low power consump-
tion. As a result, TL-DRAM achieves signicant performance
improvements of 12.8%/12.3%/11.0%, and power savings of
23.6%/26.4%/28.6% in 1-/2-/4-core systems, respectively.
0%
5%
10%
15%
1 (1-ch) 2 (2-ch) 4 (4-ch)
P
er
f.
 Im
p
ro
ve
m
en
t 
       Core-count (# of channels) 
(a) IPC improvement
0%
5%
10%
15%
20%
25%
30%
1 (1-ch) 2 (2-ch) 4 (4-ch)
P
o
w
er
 R
ed
u
ct
io
n
 
       Core-count (# of channels) 
(b) Power consumption
Figure 8: IPC improvement and power consumption of TL-
DRAM. Adapted from [73].
Sensitivity to Near Segment Capacity. The number of
rows in the near segment presents a trade-o, since increas-
ing the near segment’s size increases its capacity but also
increases its access latency. Figure 9 shows the performance
improvement of our proposed mechanisms over the baseline
as we vary the near segment size. Initially, performance im-
4
proves as the number of rows in the near segment increases,
since more data can be cached. However, increasing the
number of rows in the near segment beyond 32 reduces the
performance benet due to the increased capacitance and
hence the higher near segment access latencies.
Figure 9: Eect of varying near segment capacity. Repro-
duced from [73].
Other Results. In our HPCA 2013 paper [73], we pro-
vide a detailed analysis of how timing parameters and power
consumption vary when varying the near segment length
(Sections 4 and 6.3 of [73], respectively). We also provide a
comprehensive evaluation of the mechanisms we build on
top of the TL-DRAM substrate for both single- and multi-core
systems (Section 8 of [73]).
6. Related Work
To our knowledge, our HPCA 2013 paper [73] is the rst
to i) enable latency heterogeneity in DRAM without signif-
icantly increasing the DRAM cost per bit, and ii) propose
hardware/software mechanisms that leverage this latency
heterogeneity to improve system performance. We make the
following major contributions.
ACost-Ecient Low-LatencyDRAM. Based on the key
observation that long internal wires (bitlines) are the dom-
inant source of DRAM latency, our HPCA 2013 paper [73]
proposes a new DRAM architecture called Tiered-Latency
DRAM (TL-DRAM). To our knowledge this is the rst work
to enable low-latency DRAM without signicantly increasing
the DRAM cost per bit. By adding a single isolation transistor
to each bitline, we carve out a region within a DRAM chip,
called the near segment, which is fast and energy-ecient.
This comes at a modest overhead of 3% increase in DRAM die-
area. While there are two prior approaches to reduce DRAM
latency (using short bitlines [85, 112], adding an SRAM cache
in DRAM [32, 36, 39, 142]), both of these approaches signi-
cantly increase die-area due to additional sense ampliers or
additional area for an SRAM cache, as we evaluate in our full
paper [73]. Compared to these prior approaches, TL-DRAM
is a much more cost-eective architecture for achieving low
latency.
There are many recent works that reduce overall memory
access latency by modifying DRAM, the DRAM-controller
interface, and DRAM controllers. These works enable more
parallelism and bandwidth [22, 60, 71, 116], reduce refresh
counts [50, 51, 52, 53, 75, 76, 103, 134], accelerate bulk opera-
tions [23,114,116,117,118], accelerate computation in the logic
layer of 3D-stacked DRAM [1,2,7,8,33,35,40,41,55,77,101,141],
enable better communication between CPU and other devices
through DRAM [69], leverage process variation and tempera-
ture dependency in DRAM [20, 21, 24, 70, 72], leverage design-
induced variation in DRAM [68], leverage DRAM access pat-
terns [37, 38, 123], reduce write-related latencies by better
designing DRAM and DRAM control policies [26,66,113], and
reduce overall queuing latencies in DRAM by better schedul-
ing memory requests [29, 30, 31, 34, 42, 43, 49, 58, 59, 65, 87, 88,
89, 94, 95, 121, 126, 127, 133]. Our proposal is orthogonal to all
of these approaches and can be applied in conjunction with
them to achieve higher latency and energy benets.
Inter-Segment Data Transfer. By implementing latency
heterogeneity within a DRAM subarray, TL-DRAM enables
ecient data transfer between the fast and slow segments
by utilizing the bitlines as a wide bus. This mechanism takes
advantage of the fact that both the source and destination
cells share the same bitlines. Furthermore, this inter-segment
migration happens only within a DRAM bank and does not
utilize the DRAM channel, thereby allowing concurrent ac-
cesses to other banks over the channel. This inter-segment
data transfer enables fast and ecient movement of data
within DRAM, which in turn enables ecient ways of taking
advantage of latency heterogeneity.
Other works that leverage latency heterogeneity in DRAM
do not usually provide any ecient mechanism of inter-
segment data migration between dierent latency segments.
For example, Son et al. [124] propose a low-latency DRAM
architecture that has dierent, fast (long bitline) and slow
(short bitline) subarrays in DRAM. This approach provides
the signicant benet only if latency-critical data is already
allocated to the low-latency regions (the low latency subar-
rays). Therefore, the overall memory system performance
is very sensitive to the page placement policy, and the sys-
tem cannot easily adopt to changes in the access latency
of pages. In contrast, our new inter-segment data transfer
mechanism enables ecient relocation of pages, leading to
ecient dynamic page placement and relocation based on
the dynamically determined latency criticality of each page.
Several more recent works [23, 114, 116, 117] take advantage
of our concept of inter-segment data transfer mechanism to
perform page copy/initialization and bulk bitwise operations
completely within a DRAM chip.
7. Potential Long-Term Impact
Tolerating High DRAM Latency by Enabling New
Layers in the Memory Hierarchy. Today, there is a large
latency cli between the on-chip last level cache and o-chip
DRAM, leading to a large performance fall-o when appli-
cations start missing in the last level cache. By introducing
an additional fast layer (the near segment) within the DRAM
itself, TL-DRAM smoothens this latency cli.
Note that many recent works add a DRAM cache or
create heterogeneous main memories [25, 28, 62, 63, 74, 81,
82, 83, 102, 106, 108, 109, 138, 140] to smooth the latency
cli between the last level cache and a longer-latency non-
5
volatile main memory, e.g., phase-change memory [62, 63,
64, 83, 84, 104, 106, 137, 139], STT-MRAM [61, 83, 97, 135], or
RRAM/memristors [27, 125, 136], or to take advantage of the
advantages of multiple dierent types of memories to op-
timize for multiple metrics. Our approach is similar at the
high-level (i.e., to reduce the latency cli at low cost by tak-
ing advantage of heterogeneity), yet we introduce the new
low-latency layer within DRAM itself instead of adding a com-
pletely separate device. Tiered-Latency DRAM can also be
used as a fast DRAM cache.
Applicability to Future Memory Devices. We show
the benets of TL-DRAM’s asymmetric latencies. Consid-
ering that most memory devices adopt a similar cell orga-
nization (i.e., a two-dimensional cell array and row/column
bus connections), our approach of reducing the electrical
load of connecting to a bus (bitline) to achieve low access
latency can be applicable to other memory devices. Further-
more, the idea of performing inter-segment data transfer can
also potentially be applied to other memory devices, regard-
less of the memory technology. For example, we believe
it is promising to examine similar approaches for emerg-
ing memory technologies like phase-change memory [62, 63,
64, 83, 84, 104, 106, 137, 139], STT-MRAM [61, 83, 97, 135], or
RRAM/memristors [27,125,136], as well as NAND ash mem-
ory technology [9,10,11,12,13,14,15,16,17,18,19,78,79,80,81].
New Research Opportunities. The TL-DRAM substrate
creates new opportunities by enabling mechanisms that can
leverage the latency heterogeneity oered by the substrate.
We briey describe three directions, but we believe that there
are many new possibilities.
• New ways of leveraging TL-DRAM: TL-DRAM is a substrate
that can be utilized for many applications. Although we
describe two major ways of leveraging TL-DRAM in our
HPCA 2013 paper [73], we believe there are more ways
to leverage the TL-DRAM substrate both in hardware and
software. For instance, new mechanisms could be devised
to detect data that is latency critical (e.g., data that causes
many threads to become serialized [31, 45, 46, 130, 131] or
data that belongs to threads that are more latency-sensitive
or important [4, 5, 29, 58, 59, 65, 67, 126, 127, 128, 129, 133])
or could become latency critical in the near future and
allocate/prefetch such data into the near segment.
• Opening up new design spaces with multiple tiers: TL-DRAM
can be easily extended to have multiple latency tiers by
adding more isolation transistors to the bitlines, provid-
ing more latency asymmetry. Our HPCA 2013 paper [73]
provides an analysis of the latency of a TL-DRAM design
with three tiers, showing the spread in latency for three
tiers. This enables new mechanisms both in hardware and
software that can allocate data appropriately to dierent
tiers based on their access characteristics such as locality,
criticality, priority, etc.
• Inspiring new ways of architecting latency heterogeneity
within DRAM: To our knowledge, TL-DRAM is the rst
to enable latency heterogeneity within DRAM, which is
signicantly modifying the existing DRAM architecture.
We believe that this could inspire research on other pos-
sible ways of architecting latency heterogeneity within
DRAM [20,21,24,37,38,68,70,72] or other memory devices.
Note that recent works that are after our HPCA 2013 paper
clearly exploit this promising direction proposed by our
paper [20, 21, 24, 37, 38, 68, 70, 72, 116].
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. Many thanks to Uksong Kang, Hak-
soo Yu, Churoo Park, Jung-Bae Lee, and Joo Sun Choi from
Samsung, and Brian Hirano from Oracle, for their helpful
comments. We thank the reviewers for their feedback. We
acknowledge the support of our industrial partners: AMD,
HP Labs, IBM, Intel, Oracle, Qualcomm, and Samsung. This
research was also partially supported by grants from the
NSF (grants 0953246 and 1212962), GSRC, and the Intel URO
Memory Hierarchy Program.
References
[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing,” in ISCA, 2015.
[2] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: A Low-
Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
[3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood, “DBMSs on a Modern
Processor: Where Does Time Go?” in VLDB, 1999.
[4] R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu,
“Staged memory scheduling: achieving high performance and scalability in het-
erogeneous systems,” in ISCA, 2012.
[5] R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir,
and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Perfor-
mance,” in PACT, 2015.
[6] S. Borkar and A. A. Chien, “The future of microprocessors,” in CACM, 2011.
[7] A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data
Movement Bottlenecks,” in ASPLOS, 2018.
[8] A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and
O. Mutlu, “LazyPIM: An Ecient Cache Coherence Mechanism for Processing-
in-Memory,” in IEEE CAL, 2016.
[9] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characterization,
Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives,” in Proceed-
ings of the IEEE, 2017.
[10] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characteri-
zation, Mitigation, and Recovery in Flash Memory Based Solid-State Drives,”
arXiv:1706.08642 [cs.AR], 2017.
[11] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-
Based Solid-State Drives: Analysis, Mitigation, and Recovery,” arXiv:1711.11427
[cs.AR], 2017.
[12] Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, and E. F. Haratsch, “Vulnerabilities in
MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and
Mitigation Techniques,” in HPCA, 2017.
[13] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error patterns in MLC NAND ash
memory: Measurement, characterization, and analysis,” in DATE, 2012.
[14] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Unsal, and K. Mai,
“Flash Correct-and-Refresh: Retention-Aware Error Management for Increased
Flash Memory Lifetime,” in ICCD, 2012.
[15] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Threshold Voltage Distribution in
MLC NAND Flash Memory: Characterization, Analysis, and Modeling,” in DATE,
2013.
[16] Y. Cai, Y. Luo, S. Ghose, and O. Mutlu, “Read Disturb Errors in MLC NAND Flash
Memory: Characterization, Mitigation, and Recovery,” in DSN, 2015.
[17] Y. Cai, Y. Luo, E. Haratsch, K. Mai, and O. Mutlu, “Data Retention in MLC NAND
Flash Memory: Characterization, Optimization, and Recovery,” in HPCA, 2015.
[18] Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai, “Program Interference in MLC
NAND Flash Memory: Characterization, Modeling, and Mitigation,” in ICCD,
2013.
[19] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, and K. Mai,
“Neighbor-cell Assisted Error Correction for MLC NAND Flash Memories,” in
SIGMETRICS, 2014.
6
[20] K. K. Chang, “Understanding and Improving Latency of DRAM-Based Memory
Systems,” Ph.D. dissertation, Carnegie Mellon University, 2017.
[21] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhi-
menko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern
DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in
SIGMETRICS, 2016.
[22] K. K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and
O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes with Ac-
cesses,” in HPCA, 2014.
[23] K. K. Chang, P. J. Nair, S. Ghose, D. Lee, M. K. Qureshi, and O. Mutlu, “Low-Cost
Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in
DRAM,” in HPCA, 2016.
[24] K. K. Chang, A. G. Yaglikci, A. Agrawal, N. Chatterjee, S. Ghose, A. Kashyap,
H. Hassan, D. Lee, M. O’Connor, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[25] N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis, Z. Fang, R. Illikkal,
and R. Iyer, “Leveraging Heterogeneity in DRAM Main Memories to Accelerate
Critical Word Access,” in MICRO, 2012.
[26] N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. P. Jouppi,
“Staged Reads: Mitigating the Impact of DRAM Writes on DRAM Reads,” in
HPCA, 2012.
[27] L. Chua, “Memristor—The Missing Circuit Element,” TCT, 1971.
[28] G. Dhiman et al., “PDRAM: A hybrid PRAM and DRAM main memory system,”
in DAC, 2009.
[29] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Fairness via Source Throttling: A
Congurable and High-performance Fairness Substrate for Multi-core Memory
Systems,” in ASPLOS, 2010.
[30] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-aware shared resource
management for multi-core systems,” in ISCA, 2011.
[31] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N.
Patt, “Parallel Application Memory Scheduling,” in MICRO, 2011.
[32] Enhanced Memory Systems, “Enhanced SDRAM SM2604,” 2002.
[33] M. Gao and C. Kozyrakis, “HRL: Ecient and exible recongurable logic for
near-data processing,” in HPCA, 2016.
[34] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via
Processor-Side Load Criticality Information,” in ISCA, 2013.
[35] Q. Guo et al., “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WoNDP, 2013.
[36] C. A. Hart, “CDRAM in a Unied Memory Architecture,” in Compcon, 1994.
[37] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[38] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[39] H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima, “The Cache DRAM Archi-
tecture: A DRAM with an On-Chip Cache Memory,” in IEEE Micro, 1990.
[40] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and
O. Mutlu, “Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges,
Mechanisms, Evaluation,” in ICCD, 2016.
[41] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar,
O. Mutlu, and S. W. Keckler, “Transparent Ooading and Mapping (TOM):
Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in
ISCA, 2016.
[42] I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers,” in MICRO, 2004.
[43] E. Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning
Approach,” in ISCA, 2008.
[44] JEDEC, “DDR3 SDRAM STANDARD,” http://www.jedec.org/
standards-documents/docs/jesd-79-3d, 2010.
[45] J. A. Joao et al., “Utility-Based Acceleration of Multithreaded Applications on
Asymmetric CMPs,” in ISCA, 2013.
[46] J. A. Joao, M. A. Suleman et al., “Bottleneck identication and scheduling in mul-
tithreaded applications,” in ASPLOS, 2012.
[47] T. S. Jung, “Memory technology and solutions roadmap,” http://www.sec.co.kr/
images/corp/ir/irevent/techforum_01.pdf, 2005.
[48] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and
D. Brooks, “Proling a Warehouse-Scale Computer,” in ISCA, 2015.
[49] D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist Open-Page: A DRAM Page-
Mode Scheduling Policy for the Many-Core Era,” in MICRO, 2011.
[50] S. Khan et al., “Detecting and Mitigating Data-Dependent DRAM Failures by
Exploiting Current Memory Content,” in MICRO, 2017.
[51] S. Khan, D. Lee, and O. Mutlu, “PARBOR: An Ecient System-Level Technique
to Detect Data-Dependent Failures in DRAM,” in DSN, 2016.
[52] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu, “The
Ecacy of Error Mitigation Techniques for DRAM Retention Failures: A Com-
parative Experimental Study,” in SIGMETRICS, 2014.
[53] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, and O. Mutlu, “A Case for
Memory Content-Based Detection and Mitigation of Data-Dependent Failures
in DRAM,” in IEEE CAL, 2016.
[54] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[55] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin,
C. Alkan, and O. Mutlu, “GRIM-Filter: Fast Seed Location Filtering in DNA Read
Mapping Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
[56] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” in IEEE CAL, 2015.
[57] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and
O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental
Study of DRAM Disturbance Errors,” in ISCA, 2014.
[58] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A scalable and high-
performance scheduling algorithm for multiple memory controllers,” in HPCA,
2010.
[59] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Mem-
ory Scheduling: Exploiting Dierences in Memory Access Behavior,” in MICRO,
2010.
[60] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-
Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
[61] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-
RAM as an energy-ecient main memory alternative,” in ISPASS, 2013.
[62] B. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
“Phase-Change Technology and the Future of Main Memory,” in IEEEMicro, 2010.
[63] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory
As a Scalable DRAM Alternative,” in ISCA, 2009.
[64] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture
and the Quest for Scalability,” in CACM, 2010.
[65] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware DRAM Con-
trollers,” in MICRO, 2008.
[66] C. J. Lee, E. Ebrahimi, V. Narasiman, O. Mutlu, and Y. N. Patt, “DRAM-Aware
Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory
Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep.
TR-HPS-2010-002, 2010.
[67] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level
Parallelism in the Presence of Prefetching,” in MICRO, 2009.
[68] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[69] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[70] D. Lee, “Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity,”
Ph.D. dissertation, Carnegie Mellon University, 2016.
[71] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” in ACM
TACO, 2016.
[72] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu,
“Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,”
in HPCA, 2015.
[73] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
[74] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility-Based Hybrid
Memory Management,” in CLUSTER, 2017.
[75] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of
Data Retention Behavior in Modern DRAM Devices: Implications for Retention
Time Proling Mechanisms,” in ISCA, 2013.
[76] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent
DRAM Refresh,” in ISCA, 2012.
[77] Z. Liu, I. Calciu, M. Herlihy, and O. Mutlu, “Concurrent Data Structures for Near-
Memory Computing,” in SPAA, 2017.
[78] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “Enabling Accurate and
Practical Online Flash Channel Modeling for Modern MLC NAND Flash Mem-
ory,” JSAC, 2016.
[79] Y. Luo, Y. Cai, S. Ghose, J. Choi, and O. Mutlu, “WARM: Improving NAND ash
memory lifetime with write-hotness aware retention management,” in MSST,
2015.
[80] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “HeatWatch: Improving 3D
NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Tem-
perature Awareness,” in HPCA, 2018.
[81] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu,
B. Khessib, K. Vaid, and O. Mutlu, “Characterizing Application Memory Error
Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Mem-
ory,” in DSN, 2014.
[82] J. Meza et al., “Enabling Ecient and Scalable Hybrid Memories Using Fine-
Granularity DRAM Cache Management,” in IEEE CAL, 2012.
[83] J. Meza et al., “A Case for Ecient Hardware-Software Cooperative Management
of Storage and Memory,” in WEED, 2013.
[84] J. Meza, J. Li, and O. Mutlu, “A case for small row buers in non-volatile main
memories,” in ICCD, 2012.
7
[85] Micron, “RLDRAM 2 and 3 Specications,” http://www.micron.com/products/
dram/rldram-memory.
[86] Y. Moon et al., “1.2V 1.6Gb/s 56nm 6F2 4Gb DDR3 SDRAM with hybrid-I/O sense
amplier and segmented sub-array architecture,” ISSCC, 2009.
[87] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-Core Systems,” in USENIX Security, 2007.
[88] J. Mukundan and J. F. Martínez, “MORSE: Multi-Objective Recongurable Self-
Optimizing Memory Scheduler,” in HPCA, 2012.
[89] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda,
“Reducing Memory Interference in Multicore Systems via Application-aware
Memory Channel Partitioning,” in MICRO, 2011.
[90] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.
[91] O. Mutlu, “Main Memory Scaling: Challenges and Solution Directions,” in More
than Moore Technologies for Next Generation Computer Design. Springer, 2015.
[92] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execution: An eec-
tive alternative to large instruction windows,” in IEEE Micro, 2003.
[93] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for Ecient Processing in Runa-
head Execution Engines,” in ISCA, 2005.
[94] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” in MICRO, 2007.
[95] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[96] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead Execution: An Al-
ternative to Very Large Instruction Windows for Out-of-Order Processors,” in
HPCA, 2003.
[97] H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, and J. Tschanz, “STT-RAM
Scaling and Retention Failure,” Intel Technology Journal, 2013.
[98] S. Narasimha et al., “High performance 45-nm SOI technology with enhanced
strain, porous low-k BEOL, and immersion lithography,” in IEDM, 2006.
[99] M. Patel, J. S. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the
Mitigation of DRAM Retention Failures via Proling at Aggressive Conditions,”
in ISCA, 2017.
[100] D. A. Patterson, “Latency lags bandwith,” in Commun. ACM, 2004.
[101] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,
and C. R. Das, “Scheduling Techniques for GPU Architectures with Processing-
in-Memory Capabilities,” in PACT, 2016.
[102] S. Phadke and S. Narayanasamy, “MLP aware heterogeneous memory system,”
in DATE, 2011.
[103] M. Qureshi, D.-H. Kim, S. Khan, P. Nair, and O. Mutlu, “AVATAR: A Variable-
Retention-Time (VRT) Aware Refresh for DRAM Systems,” in DSN, 2015.
[104] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali,
“Enhancing Lifetime and Security of PCM-based Main Memory with Start-gap
Wear Leveling,” in MICRO, 2009.
[105] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, “A case for MLP-aware
cache replacement,” in ISCA, 2006.
[106] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main
Memory System Using Phase-change Memory Technology,” in ISCA, 2009.
[107] Rambus, “DRAM Power Model,” http://www.rambus.com/energy, 2010.
[108] L. E. Ramos et al., “Page placement in hybrid memory systems,” in ICS, 2011.
[109] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “ThyNVM: Enabling
Software-Transparent Crash Consistency in Persistent Memory Systems,” in MI-
CRO, 2015.
[110] SAFARI Research Group, “Ramulator – GitHub Repository,” https://github.com/
CMU-SAFARI/ramulator.
[111] Samsung, “DRAM Data Sheet,” http://www.samsung.com/global/business/
semiconductor/product.
[112] Y. Sato et al., “Fast Cycle RAM (FCRAM); a 20-ns random row access, pipe-lined
operating DRAM,” in VLSIC, 1998.
[113] V. Seshadri, A. Bhowmick, O. Mutlu, P. Gibbons, M. Kozuch, and T. Mowry, “The
Dirty-Block Index,” in ISCA, 2014.
[114] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. Kozuch, O. Mutlu, P. Gibbons,
and T. Mowry, “Fast Bulk Bitwise AND and OR in DRAM,” in IEEE CAL, 2015.
[115] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “The evicted-address lter:
A unied mechanism to address both cache pollution and thrashing,” in PACT,
2012.
[116] V. Seshadri et al., “RowClone: Fast and Energy-Ecient In-DRAM Bulk Data
Copy and Initialization,” in MICRO, 2013.
[117] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,
O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory Accelerator for
Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
[118] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
[119] V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Mitigating Prefetcher-Caused Pollution Using Informed Caching Poli-
cies for Prefetched Blocks,” TACO, 2015.
[120] A. Seznec, “A Case for Two-Way Skewed-Associative Caches,” in ISCA, 1993.
[121] J. Shao and B. T. Davis, “A Burst Scheduling Access Reordering Mechanism,” in
HPCA, 2007.
[122] S. M. Sharroush et al., “Dynamic random-access memories without sense ampli-
ers,” in Elektrotechnik & Informationstechnik, 2012.
[123] W. Shin, J. Yang, J. Choi, and L.-S. Kim, “NUAT: A Non-Uniform Access Time
Memory Controller,” in HPCA, 2014.
[124] Y. H. Son, O. Seongil, Y. Ro, J. W. Lee, and J. H. Ahn, “Reducing Memory Access
Latency with Asymmetric DRAM Bank Organizations,” in ISCA, 2013.
[125] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The Missing Mem-
ristor Found,” Nature, 2008.
[126] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting
Memory Scheduler: Achieving high performance and fairness at low cost,” in
ICCD, 2014.
[127] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing
Performance, Fairness and Complexity in Memory Access Scheduling,” in TPDS,
2016.
[128] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu, “The Application
Slowdown Model: Quantifying and Controlling the Impact of Inter-Application
Interference at Shared Caches and Main Memory,” in MICRO, 2015.
[129] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu, “MISE: Providing
Performance Predictability and Improving Fairness in Shared Main Memory Sys-
tems,” in HPCA, 2013.
[130] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt, “Accelerating critical
section execution with asymmetric multi-core architectures,” in ASPLOS, 2009.
[131] M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt, “Data Marshaling
for Multi-core Architectures,” in ISCA, 2010.
[132] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, “A Modied Approach to
Data Cache Management,” in MICRO, 1995.
[133] H. Usui, L. Subramanian, K. K.-W. Chang, and O. Mutlu, “DASH: Deadline-Aware
High-Performance Memory Scheduler for Heterogeneous Systems with Hard-
ware Accelerators,” in ACM TACO, 2016.
[134] R. Venkatesan, S. Herr, and E. Rotenberg, “Retention-aware placement in DRAM
(RAPID): software methods for quasi-non-volatile DRAM,” in HPCA, 2006.
[135] J. Wang, X. Dong, and Y. Xie, “Enabling High-performance LPDDRx-compatible
MRAM,” in ISLPED, 2014.
[136] H. S. P. Wong, H. Y. Lee, S. Yu, Y. S. Chen, Y. Wu, P. S. Chen, B. Lee, F. T. Chen,
and M. J. Tsai, “Metal–Oxide RRAM,” in Proceedings of the IEEE, 2012.
[137] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran,
M. Asheghi, and K. E. Goodson, “Phase Change Memory,” Proc. IEEE, 2010.
[138] H. Yoon et al., “Row Buer Locality Aware Caching Policies for Hybrid Memo-
ries,” in ICCD, 2012.
[139] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Ecient Data
Mapping and Buering Techniques for Multilevel Cell Phase-Change Memories,”
in ACM TACO, 2014.
[140] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-
ecient DRAM Caching via Software/Hardware Cooperation,” in MICRO, 2017.
[141] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Igna-
towski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,”
in HPCA, 2014.
[142] Z. Zhang, Z. Zhu, and X. Zhang, “Cached DRAM for ILP Processor Memory
Access Latency Reduction,” in IEEE Micro, 2001.
8
