Heterogeneous-Reliability Memory: Exploiting Application-Level Memory
  Error Tolerance by Luo, Yixin et al.
Heterogeneous-Reliability Memory:
Exploiting Application-Level Memory Error Tolerance
Yixin Luo1 Sriram Govindan2 Bikash Sharma3,2 Mark Santaniello3,2 Justin Meza3,1
Aman Kansal2 Jie Liu2 Badriddine Khessib2 Kushagra Vaid2 Onur Mutlu4,1
1Carnegie Mellon University 2Microsoft Corporation 3Facebook 4ETH Zürich
This paper summarizes our work on characterizing appli-
cation memory error vulnerability to optimize datacenter cost
via Heterogeneous-Reliability Memory (HRM), which was pub-
lished in DSN 2014 [104], and examines the work’s signicance
and future potential. Memory devices represent a key compo-
nent of datacenter total cost of ownership (TCO), and techniques
used to reduce errors that occur on these devices increase this
cost. Existing approaches to providing reliability for memory
devices pessimistically treat all data as equally vulnerable to
memory errors. Our key insight is that there exists a diverse
spectrum of tolerance to memory errors in new data-intensive
applications, and that traditional one-size-ts-all memory reli-
ability techniques are inecient in terms of cost. For example,
we found that while traditional error protection increases mem-
ory system cost by 12.5%, some applications can achieve 99.00%
availability on a single server with a large number of memory
errors without any error protection. This presents an opportu-
nity to greatly reduce server hardware cost by provisioning the
right amount of memory reliability for dierent applications.
Toward this end, in our DSN 2014 paper [104], we make three
main contributions to enable highly-reliable servers at low dat-
acenter cost. First, we develop a new methodology to quantify
the tolerance of applications to memory errors. Second, using
our methodology, we perform a case study of three new data-
intensive workloads (an interactive web search application, an
in-memory key–value store, and a graph mining framework)
to identify new insights into the nature of application memory
error vulnerability. Third, based on our insights, we propose
several new hardware/software heterogeneous-reliability mem-
ory system designs to lower datacenter cost while achieving
high reliability and discuss their trade-os. We show that our
new techniques can reduce server hardware cost by 4.7% while
achieving 99.90% single server availability.
We believe the notion of HRM opens up a sea of opportunities
in optimizing memory system and overall system cost, reliabil-
ity, eciency, and performance in a manner that is aware of
applications’ tolerance to memory errors. Thus, our paper just
scratches the surface of a large HRM exploration space, which
we hope future works will undertake in various novel ways, in
a wide variety of systems, ranging from datacenters to mobile
and embedded systems.
1. Introduction
A warehouse-scale datacenter consists of many thousands
of machines running a diverse set of applications, and com-
prises the foundation of the modern web [4, 151]. While such
datacenters are vital to the operation of companies such as
Facebook, Google, Microsoft, and Yahoo!, reducing the cost
of such large-scale deployments of machines poses a signi-
cant challenge to these and other companies. Recently, the
need for reduced datacenter cost has driven companies to
examine more energy-ecient server designs [38] and build
their datacenter installations in cold environments to reduce
cooling costs [49, 59] or use built-in power plants to reduce
electricity supply costs [139].
There are two main components of the total cost of owner-
ship (TCO) of a datacenter [4]: (1) capital costs (those associ-
ated with server hardware) and (2) operational costs (those
associated with providing electricity and cooling). Recent
studies have shown that capital costs can account for the
majority (e.g., around 57% in [4]) of datacenter TCO, and
thus represent the main impediment for reducing datacenter
TCO. In addition, this component of datacenter TCO is only
expected to increase going forward as companies adopt more
ecient cooling and power supply techniques.
Of the dominant component of datacenter TCO (capital
costs associated with server hardware), the cost of server pro-
cessors and memory represents the key component—around
60% in modern servers [77]. Furthermore, the cost of the
memory in today’s servers is comparable to that of the pro-
cessors [77], and is likely to exceed processor cost for data-
intensive applications such as web search and social media
services, which use in-memory caching to improve response
time [54, 127, 128, 129, 159] (e.g., a popular key–value store,
Memcached, has been used at Google and Facebook [54, 127]
for this purpose).
Exacerbating the cost of memory in modern servers is the
use of memory devices (such as dynamic random access mem-
ory, or DRAM) that provide error detection and correction.
This cost arises from two components: (1) quality assurance
testing performed by memory vendors to ensure devices sold
to customers are of a high enough caliber and (2) additional
memory capacity for error detection and correction. Device
testing has been shown to account for an increasing fraction
of the cost of memory for DRAM [2, 33]. The cost of addi-
tional memory capacity, on the other hand, depends on the
technique used to provide error detection and correction.
Table 1 compares several common memory error detection
and correction techniques in terms of which types of errors
ar
X
iv
:1
60
2.
00
72
9v
2 
 [c
s.D
C]
  1
0 M
ay
 20
18
they are able to detect/correct and the additional amount of
capacity/logic they require (which, for DRAM devices, whose
design is ercely cost-driven [120, 121], is proportional to
cost). Techniques range from the relatively low-cost (and
widely employed) parity, SEC-DED (single error correction,
double error detection), Chipkill [35], and DEC-TED (dou-
ble error correction, triple error detection), all of which use
dierent error-correcting codes (ECC) to detect and correct
a small number of bits or chip errors, to the more expen-
sive RAIM [111] and Mirroring [53] techniques that replicate
some (or all) of memory to tolerate the failure of an entire
DRAM dual in-line memory module (DIMM). The additional
cost of memory with high error-tolerance can be signicant
(e.g., 12.5% of the total memory capacity for SEC-DED and
Chipkill, and as high as 125% for Mirroring).
Table 1: Memory error detection and correction techniques.
“X /Y Z” means a technique can detect/correct X out of every
Y failures of Z . n represents the parity of any odd number of
bits between 1 and 63. Adapted from [104].
Technique Error Detection (Correction) Added Capacity Added Logic
Parity n/64 bits (None) 1.6% Low
SEC-DED 2/64 bits (1/64 bits) 12.5% Low
DEC-TED 3/64 bits (2/64 bits) 23.4% Low
Chipkill [35] 2/8 chips (1/8 chips) 12.5% High
RAIM [111] 1/5 modules (1/5 modules) 40.6% High
Mirroring [53] 2/8 chips (1/2 modules) 125.0% Low
Yet even with well-tested and error-tolerant memory de-
vices, recent studies from the eld have observed a rising rate
of memory error occurrences [55, 72, 116, 120, 123, 141, 146].
This trend presents an increasing challenge for ensuring high
performance and high reliability in future systems, as memory
errors can be detrimental to both. In terms of performance,
existing error detection and correction techniques incur a
slowdown on each memory access due to their additional
circuitry [55, 92] and up to an additional 10% slowdown due
to techniques that operate DRAM at a slower speed to reduce
the chances of random bit ips due to electrical interference
in higher-density devices that pack more and more cells per
square nanometer [148]. In addition, whenever an error is de-
tected or corrected on modern hardware, the processor raises
an interrupt that must be serviced by the system rmware
(e.g., BIOS), incurring up to 100 µs latency—roughly 2000×
the latency of a typical 50 ns memory access latency [58]—
leading to unpredictable slowdowns and sometimes even
system hangs [116].
In terms of reliability, memory errors can cause an ap-
plication to slow down, hang, crash, or produce incorrect
results [40]. Software-level techniques such as the retire-
ment of regions of memory with errors [55, 76, 116, 118, 150]
have been proposed to reduce the occurrence of memory
error correction events and prevent correctable errors from
turning into uncorrectable errors over time. Hardware-level
techniques, such as those listed in Table 1, are used to detect
and correct errors without software intervention (but with
additional hardware cost). All of these techniques are ap-
plied homogeneously to memory systems in a one-size-ts-all
manner.
Our goal in our DSN 2014 paper [104] is to (1) understand
how tolerant dierent data-intensive applications and dif-
ferent memory regions of each application are to memory
errors, and (2) design a new memory system organization
that matches hardware reliability to the error tolerance of
the application and the memory region in order to reduce
system cost. The main idea of our approach is to classify
applications and memory regions based on their memory
error tolerance, and map applications and memory regions
to heterogeneous-reliability memory (HRM) system designs
managed cooperatively between hardware and software to
reduce system cost. We make the following contributions:
1. A new methodology to quantify the tolerance of appli-
cations and their memory regions to memory errors. Our
approach measures the eect of memory errors on appli-
cation correctness and quanties an application’s ability
to mask or recover from memory errors.
2. A comprehensive characterization of the memory error
tolerance of three data-intensive workloads: an interactive
web search application [104,138], an in-memory key–value
store [34, 104], and a graph mining framework [103, 104].
We nd that there exists an order of magnitude dierence
in memory error tolerance across these three applications.
We also nd that there exists an order of magnitude dier-
ence in memory error tolerance across dierent memory
regions of each application.
3. An exploration of the design space of a family of
new memory system organizations, called heterogeneous-
reliability memory, which combines a heterogeneous mix
of reliability techniques that leverage application and mem-
ory region error tolerance to reduce system cost. We show
that an example use of our techniques reduces server hard-
ware cost by 4.7%, while achieving 99.90% single server
availability, based on a preliminary evaluation of an exam-
ple HRM system.
2. Characterizing Memory Error Tolerance
We characterize three commonly-used data-intensive ap-
plications to quantify their tolerance to memory errors:
• WebSearch [138], an interactive web search application,
• Memcached [34], an in-memory key-value store, and
• GraphLab [103], a graph mining framework.
We run these three applications in real production systems,
and sample hundreds to tens of thousands of unique memory
addresses for each application.
2.1. Characterization Methodology
To understand how tolerant dierent data-intensive appli-
cations are to memory errors, our characterization consists of
three components: (1) characterizing the outcomes of mem-
ory errors on an application based on how they propagate
2
through an application’s code and data, (2) characterizing
how safe or unsafe it is for memory errors to occur in dier-
ent regions of an application’s data, and (3) determining how
amenable an application’s data is to recovery in the event of
an error. We describe the implementation of each component
in detail in Sections III and IV of our DSN 2014 paper [104].
We characterize an application’s vulnerability to a memory
error based on its behavior after a memory error is introduced
(we assume for the moment that no error detection or cor-
rection is being performed). Figure 1 shows a taxonomy of
memory error outcomes. Our taxonomy is mutually exclusive
(no two outcomes occur simultaneously) and exhaustive (it
captures all possible outcomes). At a high level, a memory
error may be either (1) masked by an overwrite, in which
case it is never detected and causes no change in application
behavior; or (2) consumed by the application. In the case
that an error is consumed by the application, it may either
(2.1) be masked by application logic, in which case it is never
detected and causes no change in application behavior; (2.2)
cause the application to generate an incorrect response; or
(2.3) cause the application or system to crash.
(a) Memory Error Fates
Memory Error
System/
App Crash
Incorrect 
Response
Masked by 
Logic
Masked by 
Overwrite
Consumed by 
Application
2.1
21
2.2 2.3
Correct Result Incorrect Result
Figure 1: Memory error outcomes. Reproduced from [104].
When we refer to the tolerance of an application to memory
errors, we mean the likelihood of an error occurring in some
data results in outcomes (1) or (2.1). Conversely, when we
refer to the vulnerability of an application to memory errors,
we mean the likelihood of an error occurring in some data
results in outcomes (2.2) or (2.3).
We have three design goals when implementing our
methodology for quantifying application memory error tol-
erance. First, due to the sporadic and inconsistent nature
of memory errors in the eld [65, 72, 100, 116, 135, 141, 145,
146, 147], we want to design a framework that emulates the
occurrence of a memory error in an application’s data in a
controlled manner. Second, we want an ecient way to
measure how an application accesses its data. Third, we want
our framework to be easily adaptable to other workloads or
system congurations.
Figure 2 shows a ow diagram illustrating the ve steps
involved in our error emulation framework. We assume that
the application under examination has already been run out-
side of the framework and its expected output without any
memory errors has been recorded. The framework proceeds
as follows. (1) We start the application under the error injec-
tion framework. Our memory error emulation framework
is described in Section IV of our DSN 2014 paper [104]. (2)
We use software debuggers1 to inject the desired number and
types of memory errors. (3) We initiate the connection of a
client and start executing the desired workload. (4) Through-
out the course of the application’s execution, we check to
see if the machine has crashed; if it has, we log this outcome
and proceed to step (1) to begin testing once again. (5) If
the application nishes its workload, we check to see if its
output matches the expected results; if the output does not
match the expected results, we log this outcome and proceed
to step (1) to test again. Each run injects a particular pattern
of errors into the application. We can run this framework as
many times as needed to test an application with dierent
patterns of injected errors.
(b) Error Injection
Framework
(Re)Start App
Inject Errors 
(Soft/Hard)
Run Client 
Workload
App 
Crash?
Compare Result 
with Expected 
Output
NO
YES
Re
pe
at
Start
2
3
4
5
1
Figure 2: Memory error emulation framework. Reproduced
from [104].
There are two main types of memory errors: (1) soft
or transient errors and (2) hard or recurring errors.2 Soft
memory errors occur at random due to charged particle
emissions from chip packaging or the atmosphere [110].
Hard memory errors may occur from physical device de-
fects or wearout [55, 141, 146], and are inuenced by envi-
ronmental factors such as humidity, temperature, and uti-
lization [141, 144, 147]. Hard errors typically aect multiple
bits (for example, large memory regions and entire DRAM
chips have been shown to fail [55, 146, 147]). Our charac-
terization covers single-bit soft and hard errors. For a de-
tailed background on DRAM, we refer the reader to prior
works [24, 25, 26, 27, 51, 52, 66, 71, 72, 73, 74, 75, 83, 84, 85, 86, 87,
99, 100, 131, 142, 143].
1WinDBG [119] in Windows and GDB [42] in Linux.
2Recent studies [62,64,65,72,100,135] examined the eects of intermittent
and access-pattern dependent errors, which are increasingly common as
DRAM technology scales down to smaller technology nodes [120].
3
2.2. Key Findings
We summarize two of the most important ndings from
our characterization below. We briey list four other ndings
in Section 2.3, and describe all six of our ndings in detail in
Section V-B of our DSN 2014 paper [104].
Finding 1: Error Tolerance Varies Across Applica-
tions. Figure 3(a) plots the probability of each of the eval-
uated three applications crashing due to the occurrence of
single-bit soft or hard errors in their memory (we call this
application-level memory error vulnerability). For cases where
the application does not crash, Figure 3(b) plots the rate of
incorrect results per billion application queries under the
same conditions. We draw two key observations from these
results.
0
5
10
15
Soft Error Hard Error
WebSearch
Memcached
GraphLab
P
ro
b
ab
ili
ty
 o
f 
C
ra
sh
 (
%
)
(a) Application vulnerability
1E+0
1E+2
1E+4
1E+6
1E+8
Soft Error Hard Error
WebSearch
Memcached
GraphLab
# 
In
co
rr
ec
t/
B
ill
io
n
 Q
u
er
ie
s
(b) Application incorrectness
Figure 3: Inter-application variations in vulnerability to
single-bit soft and hard memory errors for the three appli-
cations in terms of (a) probability of crash and (b) frequency
of incorrect results. Reproduced from [104].
First, there exists a signicant variance in vulnerability
among the three applications both in terms of crash proba-
bility and in terms of incorrect result rate, which varies by
up to six orders of magnitude. Second, these characteristics
may dier depending on whether errors are soft or hard (for
example, the number of incorrect results for WebSearch dif-
fers by over two orders of magnitude between soft and hard
errors, with hard errors being more problematic). We there-
fore conclude that memory reliability techniques that treat
all applications similarly are inecient because there exists
signicant variation in error tolerance among applications.
Finding 2: Error Tolerance Varies Within an Appli-
cation. Figure 4(a) plots the probability of each of the three
applications crashing due to the occurrence of single-bit soft
or hard errors in dierent regions of their memory address
space. Figure 4(b) plots the rate of incorrect results per billion
queries under the same conditions, for cases where a crash
did not occur.
We make two observations from Figure 4. First, for some
memory regions, the probability of an error leading to a crash
is much lower than for others (for example, in WebSearch, the
probability of a hard error leading to a crash in the heap or
private memory regions is much lower than in the stack mem-
ory region). Second, even in the presence of memory errors,
some regions of some applications are still able to tolerate
memory errors (perhaps at reduced correctness). This may be
acceptable for applications such as WebSearch that aggregate
results from several servers before presenting them to the
0
2
4
6
8
10
12
14
P
ro
b
ab
ili
ty
 o
f 
C
ra
sh
 (
%
) Private Heap Stack
Soft Error Hard Error
0 0 0 0
(a) Memory region vulnerability
1E+0
1E+2
1E+4
1E+6
1E+8 Private Heap Stack
Soft Error Hard Error
# 
In
co
rr
ec
t/
B
ill
io
n
 Q
u
er
ie
s
0 0 0 0 0
(b) Memory region incorrectness
Figure 4: Memory region variations in vulnerability to single-
bit soft and hardmemory errors for the applications in terms
of (a) probability of crash and (b) frequency of incorrect re-
sults. Reproduced from [104].
user, in which case the likelihood of the user being exposed
to an error is much lower than the reported probabilities. We
therefore conclude that memory reliability techniques that
treat all memory regions within an application similarly are
inecient because there exists signicant variance in the error
tolerance among dierent memory regions.
2.3. Other Findings
In Section V-B of our DSN 2014 paper [104], we discuss four
other ndings that we make based on our characterization
data. These ndings focus on the memory error tolerance of
WebSearch, which we nd to be representative of the behavior
of all three of our characterized applications. In particular,
we nd that:
• More severe failures (i.e., failures that lead to system down-
times) due to memory errors tend to crash the application
or system quickly, while less severe failures tend to gener-
ate incorrect results periodically.
• Some memory regions are safer than others. This indi-
cates that either an application’s access pattern or com-
putational operations on dierent memory regions can be
the dominant factor to mask a majority of memory errors.
• More severe errors mainly decrease correctness, as opposed
to increase an application’s probability of crashing.
• Data recoverability varies across memory regions. For
data-intensive applications like WebSearch, software-only
memory error tolerance techniques are a promising direc-
tion for enabling reliable system designs.
3. Heterogeneous-Reliability Memory
Based on the ndings from our experimental charac-
terization, we propose heterogeneous-reliability memory
(HRM), a software/hardware cooperative framework that em-
ploys dierent levels of memory reliability within a single
main memory subsystem to optimize datacenter cost based
on the memory error tolerance level of applications and their
memory regions. We examine three dimensions, and their
benets and trade-os in the design space, for systems with
heterogeneous reliability memory: (1) hardware techniques
to detect and correct errors, (2) software responses to errors,
and (3) the granularity at which dierent techniques are used.
4
Table 2 lists the techniques we considered in each of the
dimensions along with their potential benets and trade-os.
Using WebSearch as an example application, we evaluate
and compare ve example design points (three non-HRM
systems, and two HRM systems):
• Typical Server (non-HRM): A baseline conguration re-
sembling a typical server deployed in a modern datacenter.
All memory is homogeneously protected using SEC-DED
ECC.
• Consumer PC (non-HRM): Consumer PCs typically
have no hardware protection against memory errors, re-
ducing both their cost and reliability.
• Detect&Recover (HRM): Based on our observation that
some memory regions are safer than others, we consider
an HRM system design that, for the private region, uses
parity in hardware to detect errors and responds by cor-
recting them with a clean copy of data from disk in soft-
ware (Par+R, parity and recovery), and uses neither error
detection nor correction for the rest of its data.
• Less-Tested (L; non-HRM): Testing increases both the
cost and average reliability of memory devices [120, 121,
131]. This system examines the implications of using less-
thoroughly-tested memory throughout the entire memory
system.
• Detect&Recover/L (HRM): This system evaluates the De-
tect&Recover design with less-tested memory. ECC is used
in the private region and Par+R in the heap to compensate
for the reduced reliability of the less-tested memory.
Section VI-A of our DSN 2014 paper [104] discusses (1) the
metrics we use to evaluate the benets and costs of the de-
signs, and (2) the memory error model we use to examine
the eectiveness of the ve designs. We refer the reader to
Section VI-A in [104] for detail and a full understanding.
Our evaluation illustrates the ineciencies of traditional
homogeneous approaches to memory system reliability, as
well as the benets of heterogeneous-reliability memory sys-
tem designs. Figure 5 shows the cost savings and single
server availability for our ve evaluated design points. We
observe from the gure that the two highlighted example
HRM design points (in orange color), which leverage our
heterogeneous-reliability memory system design, both can
achieve our target single server availability of 99.90% while
reducing server hardware cost by 2.9% and 4.7% respectively.
We therefore conclude that heterogeneous-reliability memory
system designs can enable systems to achieve both high cost
savings and high single server availability/reliability at the
same time.
3.3
2.9
8.1
4.7
0 2 4 6 8
Typical Server
Consumer PC
Detect&Recover
Less-Tested(L)
Detect&Recover/L
Server HW cost savings (%)
99.55
99.93
97.78
99.90
97 98 99 100
Single server availability (%)
Figure 5: Comparison of server hardware cost savings and
single server availability for the ve design points. Results
extracted from [104]. Orange bars indicate HRM designs.
Section VI of our DSN 2014 paper [104] contains a de-
tailed analysis of HRM, including (1) memory cost savings
(Section VI-B of [104]), (2) the expected crash and incor-
rect query frequency for each conguration (Section VI-B
of [104]), (3) the maximum number of tolerable errors per
month for each application to achieve a reliability target (Sec-
tion VI-B of [104]), and (4) a discussion of hardware/software
support for and feasibility of HRM (Section VI-C of [104]).
We summarize the key empirical ndings here:
• Our two example HRM designs, Detect&Recover and De-
tect&Recover/L, reduce memory costs by 9.7% and 15.5%,
Table 2: Heterogeneous reliability design dimensions, example techniques, and their potential benets and trade-os.
Adapted from [104].
Design dimension Technique Benets Trade-os
Example hardware
techniques
No detection/correction No associated overheads (low cost) Unpredictable crashes and silent data corruption
Parity Relatively low cost with detection capability No hardware correction capability
SEC-DED/DEC-TED Tolerate common single-/double-bit errors Increased cost and memory access latency
Chipkill [35] Tolerate single-DRAM-chip errors Increased cost and memory access latency
Mirroring [53] Tolerate memory module failure 100% capacity overhead
Less-Tested DRAM Saved testing cost during manufacturing Increased error rates
Example software
responses
Consume errors in application Simple, no performance overhead Unpredictable crashes and data corruption
Automatically restart application Can prevent unpredictable application behavior May make little progress if error is frequent
Retire memory pages Low overhead, eective for repeating errors Reduces memory space (usually very little)
Conditionally consume errors Flexible, software vulnerability-aware Memory management overhead to make decision
Software correction Tolerates detectable memory errors Usually has performance overheads
Usage granularity
Physical machine Simple, uniform usage across memory space Costly depending on technique used
Virtual machine More ne-grained, exible management Host OS is still vulnerable to memory errors
Application Manageable by the OS Does not leverage dierent region tolerance
Memory region Manageable by the OS Does not leverage dierent page tolerance
Memory page Manageable by the OS Does not leverage dierent data object tolerance
Cache line Most ne-grained management Large management overhead; software changes
5
respectively, compared to the cost of the Typical Server
system, which does not use HRM.
• The two example HRM designs limit the number of crashes
to 3 and 4 per server per month, respectively, and limit the
incorrect query frequency to 9 and 12 per million queries,
respectively.
• Without any error detection/correction, two out of our
three evaluated applications (WebSearch and Memcached)
are able to achieve 99.00% single server availability.
We therefore conclude that heterogeneous-reliability mem-
ory system designs can enable systems to achieve both high
cost savings and high single server availability/reliability
at the same time. We believe that there is signicant op-
portunity in many data-intensive applications for reducing
server hardware cost while achieving high single server avail-
ability/reliability using our heterogeneous-reliability design
methodology.
4. Related Work
To our knowledge, our DSN 2014 paper [104] is the rst
to (1) perform a comprehensive analysis of memory error
vulnerability for data-intensive datacenter applications across
a range of dierent memory error types; (2) propose the idea of
heterogeneous reliability memory, which consists of multiple
memory types with dierent levels of reliability and error
handling mechanisms; and (3) evaluate the cost-eectiveness
of dierent heterogeneous-reliability memory organizations
with hardware/software cooperation. We discuss related re-
search in memory error vulnerability and DRAM architecture
below, categorizing the works into six broad classes: (1) mem-
ory errors in datacenters, (2) characterizing application error
tolerance, (3) hardware-based memory reliability techniques,
(4) software-based memory reliability techniques, (5) exploit-
ing application error tolerance, and (6) heterogeneous (hy-
brid) memory architectures.
Studies of Memory Errors. Various works [92, 116, 141,
145, 146, 147] have conducted studies of DRAM error rates
that are deployed in production datacenters, studying failures
across a large sample size. These works note that memory
errors occur frequently in datacenters, and are induced by a
number of error sources. In particular, one of these studies
empirically demonstrates the increased memory errors and
increased memory cost to tolerate these errors in large-scale
datacenters [116]. A recent work [48] examines how vari-
ous hardware and software techniques to detect and mitigate
errors introduce signicant performance degradation in pro-
duction datacenters. This work shows that for WebSearch,
software error handling techniques can induce a performance
overhead of 3746× [48]. These studies motivate the need for
a low-overhead, cost-eective approach to memory reliabil-
ity, and motivate us to further explore hardware–software
cooperative techniques such as HRM.
There are several studies that characterize various sources
of errors in DRAM at a ne granularity. Many of these works
observe how specic factors aect DRAM errors, analyz-
ing the impact of temperature [37, 86] and hard errors [55].
A large number of works study errors through controlled
experiments, usually using FPGA-based DRAM testing in-
frastructures like SoftMC [51], to investigate errors due to
retention time [51, 62, 63, 64, 65, 99, 100, 131, 135], disturbance
from neighboring DRAM cells [60, 70, 72, 120], latency vari-
ation across/within DRAM chips [21, 23, 25, 82, 83, 86], and
supply voltage [23, 27]. None of these works study memory
errors in a system with heterogeneous-reliability memory.
Classifying Application Error Tolerance. Error injec-
tion techniques based on hardware watchpoints [92, 112], bi-
nary instrumentation [89], and architectural simulation [93]
have been used to investigate the impact of memory errors
on application behavior, including execution times, applica-
tion/system crashes, and output correctness. These works
study a range of applications including SPEC CPU bench-
marks, web servers, databases, and scientic applications. In
general, these works conclude that not all memory errors
cause application/system crashes and many memory errors
can be tolerated with minimal dierence in the application
outputs. We generalize this observation to data-intensive
applications, and leverage it to reduce datacenter TCO. Re-
cent work [149] develops a Markov-chain model for the error
tolerance of HPC applications. Approximate computing tech-
niques [8, 39, 57], where the precision of program output can
be relaxed to achieve better performance or energy eciency,
oer further opportunities for leveraging the error-tolerance
of application data, though these typically require very care-
ful changes to the program source code.
Hardware-Based Memory Reliability Techniques.
There are various ECC techniques for memory, and we list
the most dominant ones in Table 1. Using eight bits, SEC-
DED can correct a single bit ip and detect up to two bit
ips out of every 64 bits. DEC-TED is a generalization of
SEC-DED that uses fourteen bits to correct two and detect
three ipped bits out of every 64 bits. Chipkill [35] improves
reliability by interleaving error detection and correction data
among multiple DRAM chips. RAIM [111] is able to tolerate
entire DIMMs failing by storing detection and correction data
across multiple DIMMs. Virtualized ECC [155] maps ECC to
software-visible locations in memory so that software can
decide what ECC protection to use. While Virtualized ECC
can help reduce the DRAM hardware cost of memory reli-
ability, it requires modication to the processor’s memory
management unit and cache(s).
Recent works propose new hardware-based techniques to
tolerate soft and hard memory errors eciently. We break
these down into four categories: (1) Tolerating soft errors:
BambooECC [67] proposes a new single-tier ECC family
that enables adaptive graceful downgrade of ECC capabilities.
CleanECC [45] provides both high memory reliability and
exible memory access granularity by using ne-grained er-
ror detection and coarse-grained error correction. XED [126]
6
uses in-DRAM ECC to reduce the overhead of double Chip-
kill. (2) Tolerating hard errors: ArchShield [124] proposes
an architectural framework to identify and tolerate hard er-
rors caused by DRAM cell failures. Citadel [125] proposes
to tolerate large-granularity failures, such as row/bank fail-
ures, by replacing them with spares. Other works propose to
identify and mitigate potentially recurring memory errors by
page oining [116], online testing [65, 131], and multi-rate
refresh [99, 135]. (3) Reducing memory cost: FrugalECC [68]
proposes a new exible granularity compression to reduce
the redundancy and energy consumption of ECC. Morphable
ECC [29] proposes to reduce DRAM refresh overhead by re-
ducing ECC strength to 6-bit ECC when the DRAM is in idle
mode. (4) End-to-end memory error protection: AIECC [69]
provides end-to-end protection for clock, control, command,
and address (CCCA) signals in addition to data signals.
Software-BasedMemory Reliability Techniques. Pre-
vious works (e.g., [55,116,140,150]) show that the OS retiring
memory pages after a certain number of errors can eliminate
up to 96.8% of detected memory errors. While these tech-
niques improve system reliability, they still require costly
ECC hardware for detecting and identifying memory pages
with errors. Other works attempt to reduce the impact of
memory errors on system reliability by writing more reliable
software [7], modifying the OS memory allocator [132], or
using a compiler to generate a more error-tolerant version of
the program [5,22]. Other algorithmic solutions (e.g., memory
bounds checks [88], watchdog timers [88], and checkpoint
recovery [30,31,32,90,91,95,96,97,153,154]) can also be used
to improve resilience to memory errors.
Li et al. [98] propose to deploy software-based ECC in
an in-memory key-value store, and show that it incurs low
performance overhead. Recent works [149,161] improve upon
traditional RAIM-3 and use selective replication to reduce
unnecessary memory redundancy. SDECC [46, 47] proposes
to use strong error detection in the hardware, while tolerating
hard memory errors and recovering from soft errors in the
software.
Exploiting Application Error Tolerance. Flikker [102]
proposes a technique to trade o DRAM reliability for energy
savings. It relies on the programmer to separate application
data into vulnerable or tolerant data. Less reliable mobile
DIMMs have been proposed [109, 156] as a replacement for
ECC DIMMs in servers to improve energy eciency. Re-
cent work [128] shows that RAMCloud can recover 35 GB of
data from a failed server in 1.6 seconds using a log-structure
storage.
Heterogeneous (Hybrid) Memory Architectures. Var-
ious recent works (e.g., [1, 6, 28, 36, 44, 94, 101, 113, 114, 133,
134, 136, 137, 157, 158, 160]) explore the use of heterogeneous
memory architectures, consisting of multiple dierent types
of memories. These works are mainly concerned with either
mitigating the overheads of emerging memory technologies
or improving performance and power eciency. They do not
investigate the use of multiple devices with dierent error
correction capabilities. CREAM [107] and Odd-ECC [108] de-
velop low-cost techniques to provide exible provisioning of
memory error correction capabilities. Recent works [3, 152]
apply our heterogeneous reliability idea to processor caches
to achieve better cost-reliability trade-os.
5. Signicance and Long-Term Impact
We believe that our DSN 2014 paper [104] will have long-
term impact for three major reasons. First, it emphasizes
and aims to solve the increasing cost of ensuring memory
reliability as the error rates of memory devices continue to
grow, which is a major trend as memory technology scales
to smaller technology nodes [120, 121]. Second, it tackles
memory system cost in datacenters, which is a problem that
we expect will be increasingly important in the future. Third,
it proposes a novel framework that uses hardware–software
co-design to improve memory system reliability as well as
cost, thereby hopefully inspiring future works to exploit soft-
ware characteristics to improve system reliability and reduce
system cost (and other important metrics).
Increasing Memory Error Rate. As DRAM scales to
smaller process technology nodes, the reliability of DRAM
continues to degrade [55, 61, 116, 120, 121, 122, 123, 141, 145,
146, 147]. For example, recent works 1) show the existence
of disturbance errors in commodity DRAM chips operat-
ing in the eld [72, 120]; 2) experimentally demonstrate the
increasing importance of retention-related failures in mod-
ern DRAM devices [51, 62, 63, 64, 65, 99, 100, 131, 131, 135];
3) examine the trade-o between DRAM reliability and la-
tency [21,23,25,27,51,66,82,83,86]; and 4) advocate, including
in a paper co-written by the Samsung and Intel memory de-
sign teams [61], for the use of in-DRAM error correcting
codes to overcome the reliability challenges [61,135]. As a re-
sult of decreasing DRAM reliability, maintaining the eective
error rate at the levels we have today can (1) increase DRAM
cost due to decreased yield, expensive quality assurance tests,
and/or extra capacity for storing stronger error-correcting
codes; or (2) reduce DRAM performance due to frequent
error correction and logging. All of these solutions might
make DRAM technology scaling more dicult and less ap-
pealing [120, 121, 122]. Our paper proposes a solution that
enables the use of DRAM with higher error rates while still
achieving reasonable application reliability, which can enable
much more ecient scaling of DRAM to smaller technology
nodes in the future.
Other memory technologies such as NAND ash mem-
ory [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 43, 105, 106, 115],
phase-change memory (PCM) [79, 80, 81, 114, 117, 136] and
STT-MRAM [78, 114] also show a similar decreasing trend
in their reliability with process technology scaling and the
advent of multi-level cell (MLC) technology [121]. For ex-
ample, like DRAM, NAND ash memory suers from re-
tention errors [9, 10, 11, 13, 14, 15], cell-to-cell program in-
7
terference errors [9, 10, 11, 14, 17, 19], and read disturb er-
rors [9, 10, 11, 14, 20]. Additionally, NAND ash memory
suers from program/erase cycling errors [14, 18], and pro-
gramming errors [12, 105, 130]. PCM suers from endurance
issues [79, 81, 136] and resistance drift [56]. HRM can be ap-
plied to these memory technologies with slight modications
to enable reliable high-density non-volatile devices in the
future.
Increasing Datacenter Cost. Recent studies have shown
that capital costs can account for the majority (e.g., around
57% in [4]) of datacenter TCO (total cost of ownership). As
part of the cost of a server, the cost of the memory is com-
parable to that of the processors [77], and is likely to exceed
processor cost and become the dominant cost for servers
running data-intensive applications such as web search and
social media services [34, 103, 138]. As future datacenters
grow in scale, datacenter TCO will become an increasingly
important factor in system design. Our paper demonstrates
a way of optimizing datacenter TCO by reducing the cost of
the memory system. The cost savings can be signicant due
to the increasing scale of such datacenters [50], making our
proposed technique hopefully more important in the future.
Hardware–Software Co-Design. Our solution,
heterogeneous-reliability memory, utilizes hardware–
software cooperative design to reduce system cost. Our DSN
2014 paper [104] demonstrates the benets of exploiting
application characteristics to improve overall system design.
For example, it shows that a signicant number of errors can
be corrected in software by reloading a clean copy of the data
from storage. This motivates us to rethink the placement of
dierent functionalities (such as error detection and error
correction) across dierent system components and across
software versus hardware to improve the cost–reliability
trade-o.
Our DSN 2014 paper [104] has started a community discus-
sion [50] on the feasibility of solving the problem of memory
reliability by exploiting application memory error tolerance
in the future, inspiring reporters to ask the question: “How
good does memory need to be?” We hope that our charac-
terization results and mechanisms will hopefully continue to
inspire future works that can provide ecient and extensive
characterization/estimation of application-level memory er-
ror tolerance [41], which can make our proposed technique
applicable to a broader set of applications.
Two example works that build on ours include Odd-
ECC [108] and CREAM [107]. Odd-ECC provides a mecha-
nism to enable dierent levels of fault tolerance for the data
stored in a commodity DRAM module. Odd-ECC maps the
ECC bits to a memory address aligned with the data so that
the memory controller can access both the data and the ECC
bits eciently. CREAM provides a mechanism to dynamically
adjust the tradeo between memory capacity/bandwidth used
for ECC bits and fault tolerance within an ECC DRAM mod-
ule. CREAM proposes several data layouts that reduce page
faults and improve memory performance signicantly when
strong fault tolerance is not needed.
6. Conclusion
In our DSN 2014 paper [104], we develop a new method-
ology to quantify the tolerance of applications to memory
errors. Using this methodology, we perform a case study
of three new data-intensive workloads that show, among
other new insights, that there exists a diverse spectrum of
memory error tolerance both within and across these appli-
cations. Based on this observation, we introduce the idea of
heterogeneous-reliability memory (HRM), which combines
multiple dierent memories that have dierent reliability
characteristics and error correction capabilities. We propose
new hardware/software heterogeneous-reliability memory
system designs, and evaluate them to show that (1) the one-
size-ts-all approach to reliability in modern servers is in-
ecient in terms of cost, and (2) heterogeneous-reliability
systems can achieve the benets of both low cost and high sin-
gle server availability/reliability. We hope that our techniques
can enable the use of lower-cost memory devices to reduce
the server hardware cost of datacenters, and that our analyses
will spur future research on heterogeneous-reliability mem-
ory systems. As DRAM technology scales into small feature
sizes and becomes less reliable and memory cost becomes
more important in datacenters in the future, we hope that
our ndings and ideas will inspire more research to improve
the cost–reliability trade-o in memory systems. We believe
dierent HRM designs can be employed to optimize other key
trade-os and target metrics (e.g., performance vs. energy
consumption) in modern systems. Our DSN 2014 paper just
scratches the surface of a large amount of research and design
space to be explored.
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. We thank the anonymous review-
ers and the members of SAFARI research group for feedback.
We acknowledge the support of Microsoft and Samsung. This
research was partially supported by the Intel Science and
Technology Center for Cloud Computing and the NSF (grants
0953246, 1065112, and 1212962).
References
[1] N. Agarwal and T. F. Wenisch, “Thermostat: Application-Transparent Page Man-
agement for Two-Tiered Main Memory,” in ASPLOS, 2017.
[2] Z. Al-Ars, “DRAM Fault Analysis and Test Generation,” Ph.D. dissertation, Delft,
2005.
[3] S. Arslan, H. R. Topcuoglu, M. T. Kandemir, and O. Tosun, “Performance and
Energy Ecient Asymmetrically Reliable Caches for Multicore Architectures,”
in IPDPSW, 2015.
[4] L. A. Barroso et al., The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines. Morgan & Claypool Publishers, 2009.
[5] A. Benso et al., “A C/C++ Source-to-Source Compiler for Dependable Applica-
tions,” in DSN, 2000.
[6] S. Bock, B. R. Childers, R. Melhem, and D. Mossé, “Concurrent Migration of
Multiple Pages in Software-Managed Hybrid Main Memory,” in ICCD, 2016.
[7] C. Borchert et al., “Generative Software-Based Memory Error Detection and Cor-
rection for Operating System Data Structures,” in DSN, 2013.
8
[8] J. Bornholt et al., “Uncertain<T>: A First-order Type for Uncertain Data,” in
ASPLOS, 2014.
[9] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characterization,
Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives,” Proc. IEEE,
Sep. 2017.
[10] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characteri-
zation, Mitigation, and Recovery in Flash Memory Based Solid-State Drives,”
arXiv:1706.08642 [cs.AR], 2017.
[11] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-
Based Solid-State Drives: Analysis, Mitigation, and Recovery,” arXiv:1711.11427
[cs.AR], 2017.
[12] Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, and E. F. Haratsch, “Vulnerabilities in
MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and
Mitigation Techniques,” in HPCA, 2017.
[13] Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu, “Data Retention in MLC
NAND Flash Memory: Characterization, Optimization, and Recovery,” in HPCA,
2015.
[14] Y. Cai et al., “Error Patterns in MLC NAND Flash Memory: Measurement, Char-
acterization, and Analysis,” in DATE, 2012.
[15] Y. Cai et al., “Flash Correct-and-Refresh: Retention-Aware Error Management
for Increased Flash Memory Lifetime,” in ICCD, 2012.
[16] Y. Cai et al., “Error Analysis and Retention-Aware Error Management for NAND
Flash Memory,” ITJ, 2013.
[17] Y. Cai et al., “Program Interference in MLC NAND Flash Memory: Characteriza-
tion, Modeling, and Mitigation,” in ICCD, 2013.
[18] Y. Cai et al., “Threshold Voltage Distribution in MLC NAND Flash Memory:
Characterization, Analysis, and Modeling,” in DATE, 2013.
[19] Y. Cai et al., “Neighbor-Cell Assisted Error Correction for MLC NAND Flash
Memories,” in SIGMETRICS, 2014.
[20] Y. Cai, Y. Luo, S. Ghose, and O. Mutlu, “Read Disturb Errors in MLC NAND Flash
Memory: Characterization, Mitigation, and Recovery,” in DSN, 2015.
[21] K. Chandrasekar, S. Goossens, C. Weis, M. Koedam, B. Akesson, N. Wehn, and
K. Goossens, “Exploiting Expendable Process-Margins in DRAMs for Run-Time
Performance Optimization,” in DATE, 2014.
[22] J. Chang et al., “Automatic Instruction-Level Software-Only Recovery,” in DSN,
2006.
[23] K. K. Chang, “Understanding and Improving the Latency of DRAM-Based Mem-
ory Systems,” Ph.D. dissertation, Carnegie Mellon Univ., 2017.
[24] K. K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and
O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes With Ac-
cesses,” in HPCA, 2014.
[25] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhi-
menko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern
DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in
SIGMETRICS, 2016.
[26] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, “Low-Cost
Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in
DRAM,” in HPCA, 2016.
[27] K. K. Chang, A. G. Yaglikci, A. Agrawal, N. Chatterjee, S. Ghose, A. Kashyap,
H. Hassan, D. Lee, M. O’Connor, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[28] N. Chatterjee et al., “Leveraging Heterogeneity in DRAM Main Memories to Ac-
celerate Critical Word Access,” in MICRO, 2012.
[29] C. Chou, P. Nair, and M. K. Qureshi, “Reducing Refresh Power in Mobile Devices
with Morphable ECC,” in DSN, 2015.
[30] K. Constantinides, O. Mutlu, and T. Austin, “Online Design Bug Detection: Rtl
Analysis, Flexible Mechanisms, and Evaluation,” in MICRO, 2008.
[31] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, “Software-Based Online
Detection of Hardware Defects Mechanisms, Architectural Support, and Evalu-
ation,” in MICRO, 2007.
[32] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, “A Flexible Software-
Based Framework for Online Detection of Hardware Defects,” TC, 2009.
[33] CST, Inc., “Memory Test Background,” http://tinyurl.com/m7c3wf7. 2000.
[34] Danga Interactive, “Memcached,” http://memcached.org/.
[35] T. J. Dell, “A White Paper on the Benets of Chipkill-Correct ECC for PC Server
Main Memory,” IBM Microelectronics Division, 1997.
[36] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson,
and K. Schwan, “Data Tiering in Heterogeneous Memory Systems,” in EuroSys,
2016.
[37] N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder,
“Temperature Management in Data Centers: Why Some (Might) Like It Hot,” in
SIGMETRICS, 2012.
[38] E. M. Elnozahy et al., “Energy-Ecient Server Clusters,” in PACS, 2003.
[39] H. Esmaeilzadeh et al., “Architecture Support for Disciplined Approximate Pro-
gramming,” in ASPLOS, 2012.
[40] D. Fiala et al., “Detection and Correction of Silent Data Corruption for Large-
scale High-performance Computing,” in SC, 2012.
[41] N. Foutris et al., “Versatile Architecture-level Fault Injection Framework for Re-
liability Evaluation: A rst Report,” in IOLTS, 2014.
[42] Free Software Foundation, Inc., “GDB: The GNU Project Debugger,”
http://www.sourceware.org/gdb/.
[43] A. Fukami, S. Ghose, Y. Luo, Y. Cai, and O. Mutlu, “Improving the Reliability of
Chip-O Forensic Analysis of NAND Flash Memory Devices,” Digital Investiga-
tion, Mar 2017.
[44] K. Gai, M. Qiu, H. Zhao, and L. Qiu, “Smart Energy-Aware Data Allocation for
Heterogeneous Memory,” in HPCC, 2016.
[45] S.-L. Gong, M. Rhu, J. Kim, J. Chung, and M. Erez, “Clean-ECC: High reliability
ECC for Adaptive Granularity Memory System,” in MICRO, 2015.
[46] M. Gottscho, I. Alam, C. Schoeny, L. Dolecek, and P. Gupta, “Low-Cost Memory
Fault Tolerance for IoT Devices,” CASES/TECS, 2017.
[47] M. Gottscho, C. Schoeny, L. Dolecek, and P. Gupta, “Software-dened error-
correcting codes,” in DSN Workshop, 2016.
[48] M. Gottscho, M. Shoaib, S. Govindan, B. Sharma, D. Wang, and P. Gupta, “Mea-
suring the Impact of Memory Errors on Application Performance,” CAL, 2017.
[49] S. Grundberg et al., “For Data Center, Google Goes for the Cold,”
http://tinyurl.com/ml55nh5. 2011.
[50] R. Harris, “How good does memory need to be?” http://www.zdnet.com/how-
good-does-memory-need-to-be-7000031853/. 2014.
[51] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[52] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[53] D. Henderson et al., “POWER7 System RAS,” 2012.
[54] Q. Huang, K. Birman, R. van Renesse, W. Lloyd, S. Kumar, and H. C. Li, “An
Analysis of Facebook Photo Caching,” in SOSP, 2013.
[55] A. A. Hwang, I. Stefanovici, and B. Schroeder, “Cosmic Rays Don’t Strike Twice:
Understanding the Nature of DRAM Errors and the Implications for System De-
sign,” in ASPLOS, 2012.
[56] D. Ielmini et al., “Physical Interpretation, Modeling and Impact on Phase Change
Memory (PCM) Reliability of Resistance Drift Due to Chalcogenide Structural
Relaxation,” in IEDM, 2007.
[57] Intel Corp., “iACT,” http://www.github.com/IntelLabs/iACT.
[58] JEDEC Solid State Technology Association, “JEDEC Standard: DDR3 SDRAM,
JESD79-3C,” 2008.
[59] P. Jobin, “Cloud Computing Shifting to Cooler Climates,” http://tinyurl.com/
mfrlrtl, 2012.
[60] M. Jung, C. C. Rheinländer, C. Weis, and N. Wehn, “Reverse Engineering of
DRAMs: Row Hammer with Crosshair,” in MEMSYS, 2016.
[61] U. Kang et al., “Co-Architecting Controllers and DRAM to Enhance DRAM Pro-
cess Scaling,” in The Memory Forum, 2014.
[62] S. Khan, D. Lee, and O. Mutlu, “PARBOR: An Ecient System-Level Technique
to Detect Data-Dependent Failures in DRAM,” in DSN, 2016.
[63] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, and O. Mutlu, “A Case for
Memory Content-Based Detection and Mitigation of Data-Dependent Failures
in DRAM,” CAL, 2016.
[64] S. Khan, C. Wilkerson, Z. Wang, A. R. Alameldeen, D. Lee, and O. Mutlu, “De-
tecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current
Memory Content,” in MICRO, 2017.
[65] S. M. Khan et al., “The Ecacy of Error Mitigation Techniques for DRAM Reten-
tion Failures: A Comparative Experimental Study,” in SIGMETRICS, 2014.
[66] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[67] J. Kim, M. Sullivan, and M. Erez, “Bamboo ECC: Strong, Safe, and Flexible Codes
for Reliable Computer Memory,” in HPCA, 2015.
[68] J. Kim, M. Sullivan, S.-L. Gong, and M. Erez, “Frugal ECC: Ecient and Versatile
Memory Error Protection Through Fine-Grained Compression,” in SC, 2015.
[69] J. Kim, M. Sullivan, S. Lym, and M. Erez, “All-Inclusive ECC: Thorough End-to-
End Protection for Reliable Computer Memory,” in ISCA, 2016.
[70] Y. Kim, “Architectural Techniques to Enhance DRAM Scaling,” Ph.D. dissertation,
Carnegie Mellon Univ., 2015.
[71] Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in
DRAM,” in ISCA, 2012.
[72] Y. Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimen-
tal Study of DRAM Disturbance Errors,” in ISCA, 2014.
[73] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A Scalable and High-
Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA,
2010.
[74] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Mem-
ory Scheduling: Exploiting Dierences in Memory Access Behavior,” in MICRO,
2010.
[75] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” CAL, 2015.
[76] A. Kleen, “mcelog: Memory Error Handling in User Space,” http://www.
halobates.de/lk10-mcelog.pdf.
[77] C. Kozyrakis et al., “Server Engineering Insights for Large-Scale Online Services,”
IEEE Micro, 2010.
9
[78] E. Kultursay et al., “Evaluating STT-RAM As An Energy-Ecient Main Memory
Alternative,” in ISPASS, 2013.
[79] B. C. Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alter-
native,” in ISCA, 2009.
[80] B. C. Lee et al., “Phase Change Memory Architecture and The Quest for Scalabil-
ity,” CACM, 2010.
[81] B. C. Lee et al., “Phase Change Technology and the Future of Main Memory,”
IEEE Micro, 2010.
[82] D. Lee, “Reducing DRAM Energy at Low Cost by Exploiting Heterogeneity,” Ph.D.
dissertation, Carnegie Mellon Univ., 2016.
[83] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[84] D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Ar-
chitecture,” in HPCA, 2013.
[85] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” TACO,
2016.
[86] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu,
“Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,”
in HPCA, 2015.
[87] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[88] L. Leem et al., “ERSA: Error Resilient System Architecture for Probabilistic Ap-
plications,” in DATE, 2010.
[89] D. Li et al., “Classifying Soft Error Vulnerabilities in Extreme-Scale Scientic
Applications Using a Binary Instrumentation Tool,” in SC, 2012.
[90] M.-L. Li et al., “Trace-Based Microarchitecture-level Diagnosis of Permanent
Hardware Faults,” in DSN, 2008.
[91] M.-L. Li et al., “Understanding the Propagation of Hard Errors to Software and
Implications for Resilient System Design,” in ASPLOS, 2008.
[92] X. Li et al., “A Realistic Evaluation of Memory Hardware Errors and Software
System Susceptibility,” in USENIX ATC, 2010.
[93] X. Li et al., “Application-Level Correctness and Its Impact on Fault Tolerance,” in
HPCA, 2007.
[94] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility-Based Hybrid
Memory Management,” in CLUSTER, 2017.
[95] Y. Li, S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test
Using Stored Test Patterns,” in DATE, 2008.
[96] Y. Li, O. Mutlu, D. S. Gardner, and S. Mitra, “Concurrent Autonomous Self-Test
for Uncore Components in System-on-Chips,” in VTS, 2010.
[97] Y. Li, O. Mutlu, and S. Mitra, “Operating System Scheduling for Ecient Online
Self-Test in Robust Systems,” in ICCAD, 2009.
[98] Y. Li, H. Wang, X. Zhao, H. Sun, and T. Zhang, “Applying Software-based Mem-
ory Error Correction for In-Memory Key-Value Store: Case Studies on Mem-
cached and RAMCloud,” in MemSys, 2016.
[99] J. Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
[100] J. Liu et al., “An Experimental Study of Data Retention Behavior in Modern
DRAM Devices: Implications for Retention Time Proling Mechanisms,” in ISCA,
2013.
[101] L. Liu, H. Yang, Y. Li, M. Xie, L. Li, and C. Wu, “Memos: A Full Hierarchy Hybrid
Memory Management Framework,” in ICCD, 2016.
[102] S. Liu et al., “Flikker: Saving DRAM Refresh-Power Through Critical Data Parti-
tioning,” in ASPLOS, 2011.
[103] Y. Low et al., “Distributed GraphLab: A Framework for Machine Learning and
Data Mining in the Cloud,” PVLDB, 2012.
[104] Y. Luo et al., “Characterizing Application Memory Error Vulnerability to Opti-
mize Datacenter Cost via Heterogeneous-Reliability Memory,” in DSN, 2014.
[105] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “Enabling Accurate and
Practical Online Flash Channel Modeling for Modern MLC NAND Flash Mem-
ory,” JSAC, 2016.
[106] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “HeatWatch: Improving 3D
NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Tem-
perature Awareness,” in HPCA, 2018.
[107] Y. Luo, S. Ghose, T. Li, S. Govindan, B. Sharma, B. Kelly, A. Boroumand,
and O. Mutlu, “Using ECC DRAM to Adaptively Increase Memory Capacity,”
arXiv:1706.08870 [cs.AR], 2017.
[108] A. Malek, E. Vasilakis, V. Papaefstathiou, P. Trancoso, and I. Sourdis, “Odd-ECC:
On-Demand DRAM Error Correcting Codes,” in MemSys, 2017.
[109] K. T. Malladi et al., “Towards Energy-proportional Datacenter Memory with Mo-
bile DRAM,” in ISCA, 2012.
[110] T. C. May et al., “Alpha-Particle-Induced Soft Errors in Dynamic Memories,” IEEE
T-ED, 1979.
[111] P. J. Meaney et al., “IBM zEnterprise Redundant Array of Independent Memory
Subsystem,” IBM JRD, 2012.
[112] A. Messer et al., “Susceptibility of Commodity Systems and Software to Memory
Soft Errors,” IEEE TC, 2004.
[113] J. Meza et al., “Enabling ecient and scalable hybrid memories using ne-
granularity DRAM cache management,” CAL, 2012.
[114] J. Meza et al., “A Case for Ecient Hardware/Software Cooperative Management
of Storage and Memory,” in WEED, 2013.
[115] J. Meza et al., “A Large Scale Study of Flash Errors in the Field,” in SIGMETRICS,
2015.
[116] J. Meza et al., “Revisiting Memory Errors in Large-Scale Production Data Centers:
Analysis and Modeling of New Trends from the Field,” in DSN, 2015.
[117] J. Meza, J. Li, and O. Mutlu, “Evaluating Row Buer Locality in Future Non-
Volatile Main Memories,” Carnegie Mellon Univ., SAFARI Research Group, Tech.
Rep. TR-SAFARI-2012-002, 2012.
[118] Microsoft Corp., “Predictive Failure Analysis (PFA),” http://tinyurl.com/n34z657.
[119] Microsoft Corp., “Windows Debugging,” http://tinyurl.com/l6zsqzv.
[120] O. Mutlu, “The RowHammer Problem and Other Issues We May Face as Memory
Becomes Denser,” in DATE, 2017.
[121] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.
[122] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in MEMCON,
2013.
[123] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” SUPERFRI, 2014.
[124] P. J. Nair, D.-H. Kim, and M. K. Qureshi, “ArchShield: Architectural Framework
for Assisting DRAM Scaling by Tolerating High Error Rates,” in MICRO, 2013.
[125] P. J. Nair, D. A. Roberts, and M. K. Qureshi, “Citadel: Eciently Protecting
Stacked Memory from TSV and Large Granularity Failures,” TACO, vol. 12, no. 4,
p. 49, 2016.
[126] P. J. Nair, V. Sridharan, and M. K. Qureshi, “XED: Exposing On-Die Error Detec-
tion Information for Strong Memory Reliability,” in ISCA, 2016.
[127] R. Nishtala et al., “Scaling Memcache at Facebook,” in NSDI, 2013.
[128] D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, “Fast
Crash Recovery in RAMCloud,” in SOSP, 2011.
[129] J. Ousterhout et al., “The Case for RAMCloud,” Communications of the ACM,
vol. 54, no. 7, pp. 121–130, 2011.
[130] T. Parnell, N. Papandreou, T. Mittelholzer, and H. Pozidis, “Modelling of the
Threshold Voltage Distributions of Sub-20nm NAND Flash Memory,” in GLOBE-
COM, 2014.
[131] M. Patel, J. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the Mit-
igation of DRAM Retention Failures via Proling at Aggressive Conditions,” in
ISCA, 2017.
[132] K. Pattabiraman et al., “Samurai: Protecting Critical Data in Unsafe Languages,”
in EuroSys, 2008.
[133] A. J. Peña and P. Balaji, “Toward the Ecient Use of Multiple Explicitly Managed
Memory Subsystems,” in CLUSTER, 2014.
[134] S. Phadke et al., “MLP Aware Heterogeneous Memory System,” in DATE, 2011.
[135] M. Qureshi et al., “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh
for DRAM Systems,” in DSN, 2015.
[136] M. K. Qureshi et al., “Scalable High Performance Main Memory System Using
Phase-Change Memory Technology,” in ISCA, 2009.
[137] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page Placement in Hybrid Memory
Systems,” in ICS, 2011.
[138] V. J. Reddi et al., “Web Search Using Mobile Cores: Quantifying and Mitigating
the Price of Eciency,” in ISCA, 2010.
[139] A. C. Riekstin et al., “No More Electrical Infrastructure: Towards Fuel Cell Pow-
ered Data Centers,” in HotPower, 2013.
[140] H. Schirmeier et al., “RAMpage: Graceful Degradation Management for Memory
Errors in Commodity Linux Servers,” in PRDC, 2011.
[141] B. Schroeder et al., “DRAM Errors in the Wild: A Large-Scale Field Study,” in
SIGMETRICS Performance, 2009.
[142] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,
O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for
Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
[143] V. Seshadri et al., “RowClone: Fast and Ecient In-DRAM Copy and Initializa-
tion of Bulk Data,” in MICRO, 2013.
[144] T. Siddiqua et al., “Analysis and Modeling of Memory Errors from Large-scale
Field Data Collection,” in SELSE, 2013.
[145] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf,
and S. Gurumurthi, “Memory Errors in Modern Systems: The Good, The Bad,
and the Ugly,” in ASPLOS, 2015.
[146] V. Sridharan et al., “A Study of DRAM Failures in the Field,” in SC, 2012.
[147] V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Eects in
DRAM and SRAM Faults,” in SC, 2013.
[148] J. Stuecheli et al., “Elastic refresh: Techniques to mitigate refresh penalties in
high density memory,” in MICRO, 2010.
[149] O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, “Designing and
Modelling Selective Replication for Fault-tolerant HPC Applications,” in CCGrid,
2017.
[150] D. Tang et al., “Assessment of the Eect of Memory Page Retirement on System
RAS Against Hardware Faults,” in DSN, 2006.
[151] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes,
“Large-Scale Cluster Management at Google with Borg,” in EuroSys, 2015.
[152] J. Wang, X. Zhu, Y. Liu, J. Zhang, M. Wu, W. Zhang, and K. Qiu, “Heterogeneous
Energy-Ecient Cache Design in Warehouse Scale Computers,” in CF, 2015.
[153] Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala, “Checkpointing and
Its Applications,” in FTCS, 1995.
10
[154] X. Xu et al., “Understanding Soft Error Propagation Using Ecient Vulnerability-
Driven Fault Injection,” in DSN, 2012.
[155] D. H. Yoon et al., “Virtualized and Flexible ECC for Main Memory,” in ASPLOS,
2010.
[156] D. H. Yoon et al., “BOOM: Enabling Mobile Memory Based Low-power Server
DIMMs,” in ISCA, 2012.
[157] H. Yoon et al., “Row Buer Locality Aware Caching Policies for Hybrid Memo-
ries,” in ICCD, 2012.
[158] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-
Ecient DRAM Caching via Software/Hardware Cooperation,” in MICRO, 2017.
[159] Y. Yue, “Caching in Datacenters,” https://twitter.github.io/pelikan/2016/04/03/
caching-in-datacenters.html, 2012.
[160] W. Zhang and T. Li, “Exploring Phase Change Memory and 3D Die-Stacking
for Power/Thermal Friendly, Fast and Durable Memory Architectures,” in PACT,
2009.
[161] R. Zheng and M. C. Huang, “Redundant Memory Array Architecture for Ecient
Selective Protection,” in ISCA, 2017.
11
