Workload-Aware DRAM Error Prediction using Machine Learning by Mukhanov, Lev et al.
Workload-Aware DRAM Error Prediction using
Machine Learning
Lev Mukhanov∗, Konstantinos Tovletoglou†, Hans Vandierendonck‡,
Dimitrios S. Nikolopoulos§(*) and Georgios Karakonstantis¶
School Of Electronics, Electrical Engineering And Computer Science, ECIT, Queen’s University Belfast, UK
Email: ∗l.mukhanov@qub.ac.uk, †ktovletoglou01@qub.ac.uk, ‡h.vandierendonck@qub.ac.uk,
§d.nikolopoulos@qub.ac.uk ¶g.karakonstantis@qub.ac.uk
§(*) The present affiliation: Department of Computer Science, Virginia Polytechnic Institute and State University, USA
Abstract—The aggressive scaling of technology may have
helped to meet the growing demand for higher memory capacity
and density, but has also made DRAM cells more prone to
errors. Such a reality triggered a lot of interest in modeling
DRAM behavior for either predicting the errors in advance or
for adjusting DRAM circuit parameters to achieve a better trade-
off between energy efficiency and reliability. Existing modeling
efforts may have studied the impact of few operating parameters
and temperature on DRAM reliability using custom FPGAs
setups, however they neglected the combined effect of workload-
specific features that can be systematically investigated only on
a real system.
In this paper, we present the results of our study on workload-
dependent DRAM error behavior within a real server considering
various operating parameters, such as the refresh rate, voltage
and temperature. We show that the rate of single- and multi-bit
errors may vary across workloads by 8x, indicating that program
inherent features can affect DRAM reliability significantly. Based
on this observation, we extract 249 features, such as the memory
access rate, the rate of cache misses, the memory reuse time and
data entropy, from various compute-intensive, caching and ana-
lytics benchmarks. We apply several supervised learning methods
to construct the DRAM error behavior model for 72 server-
grade DRAM chips using the memory operating parameters
and extracted program inherent features. Our results show that,
with an appropriate choice of program features and supervised
learning method, the rate of single- and multi-bit errors can be
predicted for a specific DRAM module with an average error
of less than 10.5 %, as opposed to the 2.9x estimation error
obtained for a conventional workload-unaware error model. Our
model enables designers to predict DRAM errors in advance for
less than a second and study the impact of any workload and
applied software optimizations on DRAM reliability.
I. INTRODUCTION
The worsening of parametric variations in deep nanometer
technologies and aggressive scaling of circuit parameters for
low power operation made memory cells more prone to errors,
the number of which may vary significantly across different
The manifestation of such errors, that depends on various
factors [23], [24], [39], [75], [79]–[81] related to circuit
parameters, temperature, as well as system architecture, and
workloads, threaten the availability of computing systems and
quality of service of sensitive storage components in data
centers [67] and supercomputers [5], [20], [72]. The increased
risks have triggered few research studies on prediction of
DRAM errors in advance [4], [18], [35], [38], [58], [66],
[89]. However, these studies were performed only for DRAM
operating under nominal circuit parameters and typical en-
vironmental conditions. Moreover, even though they tried to
consider other workload/architecture related factors, this was
limited due to the constrained access to only specific features,
like percentage of utilized memory, average CPU utilization
and hardware characteristics [44]. The joint consideration of
more features may reveal new non-linear behaviors that cannot
be captured by linear regression models [44] or traditional
workload-agnostic statistical models [31]. In addition, all these
studies lacked an adequate number of samples because of
the rare manifestation of errors for DRAM operating under
nominal circuit parameters, which may result in contradictory
observations [44], [67].
In the past, there have been several experimental studies
that tried to predict the error behavior of DRAM operating
under non-nominal circuit parameters [39], such as the refresh
period (TREFP ) and the supply voltage (VDD), and even under
various temperatures [19], [27], [39], [52], [53]. However, the
main goal of these studies was to improve DRAM performance
and energy efficiency by scaling TREFP or VDD [25], [76],
rather than model DRAM errors. Although, some of these
works have indicated the fact that the certain program features,
such as the pattern of data stored in memory [1], [3], [17],
[40], [61], [63], [78], [85], [87], [87], [88], may change the
number of manifested errors, none of them attempted to jointly
consider the impact of DRAM circuit parameters and various
program inherent features on DRAM reliability. Beside the
data pattern, program inherent features encapsulate features
that can be extracted using hardware program counters, e.g.
the processor utilization, the rate of memory and cache misses,
IPC. The program counters may have been used in the past
for power and performance modelling [7], [49], [51], but were
never used for DRAM reliability modeling in conjunction with
various circuit parameters and temperature. Modeling the joint
impact of such a wide range of features requires a novel
experimental framework implemented on a real system with a
complete software stack. This framework, unlike the custom
FPGA setups used in prior studies [22], [40], [63], should
be capable of running real workloads under different DRAM
ar
X
iv
:2
00
3.
12
44
8v
1 
 [c
s.D
C]
  1
7 M
ar 
20
20
temperatures and provide a mechanism to measure errors and
hardware program counters.
The main goal of this work is to systematically investigate
the effect of various program inherent features on DRAM
reliability and develop a DRAM error model that takes into
consideration the combined effect of these features, as well as
the reliability variation across chips, DRAM circuit parameters
and temperature. This model enables designers to predict
DRAM errors based on few workload-specific features for a
given set of DRAM circuit parameters and temperature. Such a
prediction does not require long-running DRAM characteriza-
tion campaigns that may take hours or even days on complex
experimental setups. The error behavioral model facilitates:
i) evaluating how prone to errors are specific workloads; ii)
evaluating the implicit impact of applied software optimiza-
tions (e.g. compiler, or thread level parallelism) on DRAM
reliability; iii) predicting maintenance cycles, as aimed by
recent works [20], [44]; iv) guiding the adjustment of the
circuit DRAM parameters for saving energy [41], [63].
Our contributions can be summarized as follows:
• We develop a novel experimental framework for char-
acterizing DRAMs under relaxed refresh period and
lowered supply voltage within a state-of-the-art 64-bit
ARM based server. In order to experiment under different
DRAM temperatures, we implement a thermal testbed
that allows us to fine tune the temperature of each DIMM
on the server.
• We perform a characterization of 72 server-grade DRAM
chips under scaled refresh period and lowered supply
voltage running compute-intensive, caching, and analytics
benchmarks. Our study shows that the rate of single- and
multi- bit errors may vary across workloads and DRAM
chips by 8× and 188×, respectively.
• To quantify the dynamically changing data and access
patterns of a running program, we introduce new metrics,
namely the DRAM reuse time and the data entropy. We
extract these program features along with 247 features
measured using hardware performance counters during
the execution of each workload and correlate them with
DRAM errors, identifying features that are more likely
to affect DRAM reliability.
• We apply three different Machine Learning methods
to train a workload-aware DRAM error model based
on the extracted program features, DRAM circuit pa-
rameters and temperature. In particular, we investigate
the accuracy of Support Vector Machines (SVM), K-
nearest neighbors (KNN) and Random Decision Forests
(RDF). We compare these models on 4 different DRAM
devices considering various sets of program features used
for training. Our study shows that the highest accuracy
of DRAM error estimates is achieved by KNN, which
enables us to predict DRAM error rates within 300 ms
with an average error that does not exceed 10.2 %, as
opposed to the 2.9× estimation error obtained for a
conventional workload-unaware error model. We make
the DRAM error behavioral model (KNN-based) publicly
Columns
Ro
w
s
DRAM  ...
  40...
1.store x19,[sp]
2.mov  x29,sp
3.load  w1
...
k.load  w2
...
n.load  w3 Ro
w
s
Columns
10 10 ...
01 10 ...
11 10 ...
1 11 ...
01 0 ...0
11 11 ...
10 00 ...
10 10 ...
0
VDD
TREFP
m4.
m6.
mn.
m3.
m5.
...
m1.
m2.
  Workload 1
  ldr    x1, [sp]  
  x2, sp, #0x8
  mov x6, sp
  ldr    x0, 400520
  ldr    x3, 400528
  ldr    x4, 400530
  ...
  Workload 2
  ...
  adrp    x1, 420000
  adrp    x0, 420000
  add     x1, x1
  stp      x29, x30
  add     x0, x0
  ...
  40
Workload m
...
1.store x19,[sp]
2.mov x29,sp
3.load w1
...
k.load  w2
...
n.load  w3
  ...
  40  W rkload 2  ...
  a rp   x1, 4200
  adrp    x0, 4200
  add     x1, x1
  stp    x29, x30
  ...
  ...
  40
t
Sense Amplifiers
DR
AM
 C
el
ls
DRAM Chip
BanksColumnsRows
Co
nt
ro
l L
og
ic
Wordline
Bi
tli
ne
Fig. 1: Interaction between workloads and DRAM; Internal
structure of DRAM.
available, which will be periodically updated based on
new characterization results [50].
II. BACKGROUND
A. DRAM Basics
DRAM is an essential component in any modern computing
system, used to realize the memory subsystem. Beside the
data caches, the memory subsystem includes several channels
(Memory Channel Units, MCUs) which are used to transfer
data and commands between the processor and DRAM. Each
channel is connected to a number of Dual In-line Memory
Modules (DIMMs). A DIMM usually has two ranks that
contain DRAM chips. Within each chip, DRAM cells are
organized into banks, which are two-dimensional arrays that
can be accessed in parallel based on rows and columns (see
Figure 1 on the right). The basic storage element of a DIMM
is a cell, consisting of a transistor and a capacitor. When a
row of cells is accessed, the peripheral circuitry of a DIMM
senses the data stored in this row via amplifiers and sends it
to the processor.
B. DRAM Error Behavior: Main Operating Parameters
The main drawback of the DRAM technology is the limited
retention time [39] of a cell’s charge. To avoid any error
induced by the charge leakage, DRAM employs an Auto-
Refresh mechanism that recharges the cells in the array pe-
riodically [39]. Conventionally, all DDR technologies adopt
a refresh period, TREFP , of 64 ms for refreshing each
cell. Other critical parameter that affects DRAMs’ power and
reliability behavior is the supply voltage, VDD. Similar to
TREFP , VDD of DRAM chips is chosen conservatively by
vendors to ensure that each chip operates correctly under a
wide range of conditions. In addition to the above circuit
parameters, one of the main environmental conditions that
affect DRAM reliability is temperature (TEMPDRAM ). In
fact, it has been reported that the retention time of DRAM
cells decreases exponentially with increasing temperature [19].
C. DRAM Error Behavior: Workload-Dependent Parameters
The use of DRAM depends on executed instructions that
access the memory in a certain way. In particular, the data
read and written by a program (data pattern) from/to memory
and the order in which the program refers to this data (access
pattern) vary across workloads. Note that access pattern also
encapsulates the rate of memory accesses and the average time
between accesses to DRAM cells.
10 20 30 40 50 60 70 80 90 100 110 120
Time, minutes
1
2
3
4
5
6
7
8
W
ER
x10-7
3.
5x
memcached
backprop
random
Fig. 2: The rate of single-bit
errors per 64-bit word (WER)
when running memcached,
backprop and the random
micro-benchmark for DRAM
operating under 2.283 s
TREFP and lowered VDD
(1.428 V ) at 70◦C (2 hours
run, 8 threads).
Previous studies have
demonstrated that the data
pattern of a running
program may affect
DRAM errors [27],
[39]. Meanwhile, the
frequency of read and
write accesses (i.e. the
memory access pattern)
may reduce the number
of manifested errors, since
each read/write naturally
refreshes DRAM [1],
[78]. We demonstrate
such accesses in Figure 1
where 3.load and n.load
instructions from the t-th
workload refresh data in
the m4 DRAM line. By
contrast, if a row is accessed many times, then some cells
from neighbouring rows may leak charge due to the DRAM
cell-to-cell interference [32]. This effect has been exploited
widely for ”row hammer” attacks [55], [84]. Specifically,
data in the m3 and m5 DRAM rows (see Figure 1) may
be compromised when the m4 row is accessed too often.
Thus, by increasing the memory access frequency to the
same row, we reduce the number of errors manifested in
this particular row, while inducing errors in neighborhood
rows due to the DRAM cell-to-cell interference. Accordingly,
inherent program features that change memory data and
access patterns of a running workload may have an important
effect on DRAM reliability. However, to the best of our
knowledge, none of the previous studies have systematically
investigated the combined effect of data and access pattern
on DRAM reliability under relaxed DRAM parameters and
varying DIMM temperature.
Failing to identify the combined effect of program features
on real server deployments may limit or nullify the efficacy
of existing approaches. For example, several previous stud-
ies have proposed fine-grained methods to control DRAM
parameters based on the retention time measured for each
cell [40], [62]. To measure the retention time, authors use
micro-benchmarks that implement the worst-case data pat-
tern manifesting errors in the vast majority of error-prone
memory locations [3], [19], [22], [27]. However, our study
shows that real applications may trigger errors in many more
memory locations than the conventional data pattern micro-
benchmarks. Figure 2 depicts the rate of single-bit errors
per 64-bit word (WER) observed for DRAM operating un-
der relaxed parameters when running two different bench-
marks (memcached and backprop), and the most stressful
data pattern micro-benchmark (the random data pattern micro-
benchmark [39]). We see that the WER incurred by backprop
is 3.5× higher than the rate observed for random. As a result,
the cell retention time measured using this data pattern micro-
benchmark may be inaccurate, which, in turn, may lead to
uncertain hardware behavior or even hardware crashes when
the proposed methods are applied in practice. On the other
hand, the proposed methods may be too pessimistic about the
retention time and thus ineffective, since real applications,
such as memcached, may trigger errors in fewer memory
locations than the micro-benchmark. These results indicate
that designers should take into account the combined effect
of workload-dependent factors on DRAM reliability when
designing error mitigation techniques.
D. DIMM-to-DIMM Variation
Apart from the above circuit and workload-dependent pa-
rameters, DRAM reliability may vary across DIMMs from
different vendors [29], [39], and even across DIMMs man-
ufactured by the same vendor. This variation is due the
manufacturing process [31] and the internal design of DRAM
modules, such as true-anti cell organization [39], address
scrambling [29], [83] and the remapping of faulty cells [28].
Our study indicates that the rate of single-bit errors per 64-bit
word may vary by 188× across different DRAM chips.
E. Challenges
According to the above discussion, there are various cross-
layer parameters, at the circuit (e.g. VDD, TREFP ), micro-
architecture (i.e. cache organization and DRAM architecture),
application (i.e. data and DRAM access patterns) layers, which
in combination with environmental parameters (i.e. the DRAM
temperature) can significantly influence DRAM reliability.
Predicting the potential failures early at design or operation
cycle by considering all the combined cross-layer effects is an
extremely challenging problem.
III. DRAM ERROR PREDICTION
A. Mathematical Formulation of the Problem
Let us assume that a workload, having a specific set of
program features (Ftrs = (f1, f2, ..., fK) where fi is the i-
th feature), allocates data on a DRAM device (Dev) when
this device operates under TREFP and VDD at a certain
temperature (TEMPDRAM ). Then, to predict a target DRAM
error metric Merr for this workload, we need to model a
prediction function (M ) such that:
Merr = M(Ftrs,Dev, TREFP , VDD, TEMPDRAM ) (1)
It is evident that building such a model is extremely chal-
lenging due to the number of possible parameter combinations.
To address this challenge, we propose to use a supervised
Machine Learning (ML) technique, since we believe it is hard
to find an analytical model that predicts DRAM error behavior
accurately considering the DIMM-to-DIMM variation and all
the parameters.
Num. of corrupted bits Type of errors Abbreviation
1 corrected CE
> 1 uncorrected/detected UE
> 2 uncorrected/undetected SDC
TABLE I: Types of DRAM errors that can be corrected or
detected with ECC SECDED.
B. ML Models
In our study, we investigate the accuracy of the following
Machine Learning models: Support Vector Machines (SVM),
K-nearest neighbors algorithm (KNN) and Random Decision
Forests (RDF). These models have a high accuracy for both
linear and non-linear prediction problems [15]. We use the
scikit-library to implement the models [68].
C. DRAM Error Metrics
There are several types of errors that may manifest in
DRAM chips [2], [57], [64], [71]. Vendors implement a special
hardware (ECC, Error Correction Codes) in server-grade chips
to automatically correct such errors. In this study, we use
hardware that supports ECC SECDED, which is implemented
in the majority of commercial servers. There are three types of
memory errors that may occur when ECC SECDED is enabled
(see Table I): single-bit errors (or correctable errors, CE);
detected errors where more than one bit in a 64-bit word is
corrupted (or uncorrectable errors, UE); and errors where more
than 2 bits are corrupted per word, which are not corrected
and not detected by ECC. The last types of errors manifest
so-called Silent Data Corruption (SDC), since such errors are
invisible for hardware.
Correctable errors: To characterize DRAM in terms of
CEs, we measure the rate of single-bit errors per 64-bit, WER,
for the amount of memory used by an application as:
WER =
NCE
MEMSIZE
(2)
where NCE is the number of unique 64-bit word locations
where CEs have manifested and MEMSIZE is the size (in
64-bit words) of memory allocated by the application. WER
shows the probability of a word being erroneous regardless of
the size of memory allocated by the application.
Uncorrectable Errors: To characterize DRAM in terms of
UEs, we estimate the probability of an UE, triggered by a
running application as:
PUE =
NUE
NEXP
(3)
where NUE is the number of experiments with the application
that resulted in an UE, and NEXP is the total number of
experiments with the application.
D. Program Inherent Features
To investigate software-level factors that may affect DRAM
reliability, we extract the following program features.
The DRAM Reuse Time: The DRAM reuse time (Treuse)
is the average time between memory accesses to the same
64-bit word (or a DRAM location). This metric is important
for our study, as memory accesses inherently refresh the
stored charge [1], [78], while Treuse denotes the average
period between accesses to the DRAM cells, and thus, the
average refresh period of cells incurred by memory accesses.
If Treuse < TREFP for a running program, then the number of
DRAM errors induced by the charge leakage will decrease. We
estimate Treuse by averaging the DRAM reuse time over all
memory accesses, i.e., Treuse =
∑Nmem
i=1 T
i
reuse
Nmem
, where T ireuse
is the reuse time for the i memory access instruction with
reference to some address. In turn, we calculate T ireuse as:
T ireuse = CPI ×Direuse (4)
In this equation, CPI is the average number of clock cycles
per instruction measured for an entire program, Direuse is the
number of instructions executed since the last reference to the
address accessed by the i instruction. We extract Direuse using
a dynamic binary instrumentation tool, DynamoRIO [8]. We
validated Treuse estimates using micro-benchmarks where we
can control and measure Treuse for specific memory accesses,
and found that the approximation is accurate.
The Data Entropy: To quantify the varying data patterns
(DPs) stored in memory across different time instances, we
introduce a new metric, the DP entropy, HDP . To estimate
HDP , we profile all workloads with DynamoRIO and take
samples of the data for each write memory access that is
ultimately stored in DRAM. We then estimate HDP based
on the sampled data as:
HDP = −
232−1∑
i=0
P (xi)× log2(P (xi));P (xi) = NWR(xi)
NTOTWR
(5)
where NWR(xi) is the number of writes operations with data
xi in a word and NTOTWR is the total number of writes.
Performance Counters: Another important parameter that
may affect DRAM reliability is the number of memory
accesses executed per cycle, as the cell-to-cell interference
grows with the rate of memory accesses [47], [64]. We
measure this number, along with 247 program metrics, such as
L1/L2/memory accesses (writes and reads) per cycle, and IPC
and the SoC utilization, using existing hardware performance
counters (perf ) to investigate the potential effect of other
architecture-level parameters on DRAM error behavior.
E. Data Collection
To collect data for training of the ML models, we run a
set of representative benchmarks (workloads) under varying
DRAM operating parameters, such as TREFP , VDD and
temperature, and measure WER and PUE , as shown in
Figure 3. We additionally run each benchmark to collect all
the inherent program features using DynamoRio and the perf
tool (Profiling phase). Then, we combine collected program
inherent features with the WER or PUE measurements.
For each workload
Build the training set:
  Copy all samples from the
  original data set into the training 
  set, except for the samples taken 
  for the specific workload
 Build the test set:
  Copy the samples taken for 
  the specific workload from the 
  original data set into the 
  training set
Train the model
   Original Data set
 Test set
 Training set
Sa
m
pl
es
 c
ol
le
ct
ed
fo
r t
he
 s
pe
ci
fic 
w
or
kl
oa
d
Other samples
Test the model
Training
Testing
phase
phase
  Workload 1
  ...
  400500: ldr    x1, [sp]
  400504: add  x2, sp, #0x8
  400508: mov x6, sp
  40050c: ldr    x0, 400520
  400510: ldr    x3, 400528
  400514: ldr    x4, 400530
  ...
  Workload 2
  ...
  400550: adrp   x1, 420000
  400554: adrp   x0, 420000
  400558: add    x1, x1
  40055c: stp       x29, x30
  400560: add     x , x0
  400564: add     x1, x1
  ...
  Workload N
  ...
  400500: ldr    x1, [sp]
  400504: add  x2, sp, #0x8
  400508: mov x6, sp
  40050c: ldr    x0, 400520
  400510: ldr    x3, 400528
  400514: ldr    x4, 400530
  ...
Extract program
features: 
 - perf
 - DynamoRio
Program features:
 - the memory 
         access rate
 - TREUSE
  Workload 1
  ...
  400500: ldr    x1, [sp]
  400504: add  x2, sp, #0x8
  400508: mov x6, sp
  40 50c: ldr    x0, 400520
  40 51 : ldr    x3, 400528
  40 514: ldr    x4, 400530
  ...
  r l  2
  ..
  : adrp  x1, 420
  : a rp  x0, 42
  : ad  x1, x1
  c: stp  x29, x3
  6 : ad   x0, x0
  6 : ad   x1, x1
  ..
Workload N
...
1 [sp]
04 ad 2 sp, #0x8
08 mov 6, sp
40 50c: ldr    x0, 40 520
  40 510: ldr    x3, 40 528
  40 514: ldr    x4, 40 530
  ...
Program features:
 - the memory 
         access rate
 - TREUSE
 - HDP
 - +246 features B
ui
ld
 d
at
a 
se
t.
M
O
D
EL
 IN
P
U
T:
 T
RE
FP
,V
D
D
,
TE
M
P D
RA
M
,T
RE
U
SE
,H
D
P, 
ot
he
r 
24
6 
pr
og
ra
m
 fe
at
ur
es
M
O
D
EL
  O
U
TP
U
T:
 W
ER
, P
U
E
  
DRAM characterization:
Run workloads  under 
varying TREFP,VDD and 
temperature 
DRAM error 
metrics:
 - WER
 - PUE
Profiling phase
DRAM characterization 
phase
Validation processData collection
Fig. 3: Overview of the data collection and validation processes.
10 20 30 40 50 60 70 80 90 100 110 120
Time (minutes)
0
1
2
3
4
5
6
W
ER
x10-7 memcachedpagerank
backprop(par)
kmeans(par)
srad(par)
nw(par)
fmm(par)
fmm
backprop
kmeans
srad
nw
bc
bfs
Fig. 4: WER of each benchmark under
2.283 s TREFP and 1.428 V VDD at 50◦C.
Fig. 5: X-Gene2 with the cus-
tom thermal adapters.
Fig. 6: Temperature controller
board.
F. Accuracy Evaluation of ML Models
We evaluate accuracy of the ML models using the cross-
validation technique [33] by partitioning the collected data
into a test set and a training set. We use the Leave-One-
Out [54] partitioning as shown in Figure 3. In particular, for
each benchmark we create a test set that consists of samples
taken only for a specific benchmark, whereas the training
set contains all other samples. We train the model (Training
phase) and test (Testing phase) its prediction accuracy for each
pair of training and testing sets (see Figure 3). Finally, we
average the prediction accuracy over all testing experiments,
the number of which is equivalent to the total number of
benchmarks.
IV. EXPERIMENTAL SETUP
To enable DRAM characterization, we developed a unique
experimental setup which we discuss in this section.
A. Experimental Framework
The basis of our experimental framework is a state-of-
the-art commodity 64-bit ARMv8-based server, the X-Gene2
Server-on-a-Chip. The X-Gene2 SoC consists of eight 64-
bit ARMv8 cores running at 2.4GHz. The X-Gene2 has four
DDR3 Memory Controller Units (MCUs). In our campaign,
we are experimenting with 4 Micron DDR3 8GB DIMMs at
1866 MHz [45], with one DIMM per MCU. In total, we are
characterizing 72 chips of 4Gb x8 DDR3 [46], since each
DIMM includes 16 and 2 DRAM chips for data storage and
ECC, respectively.
DRAM Thermal Testbed on a Server. To perform the
experiments under controlled temperatures, we implement a
temperature-controlled testbed using heating elements [22] for
DRAMs on a server. Figure 5 shows the X-Gene2 board with
four DIMMs fitted with our custom adapters. Each adapter
consists of a resistive element, with thermally conductive tape
transferring the heat of the element to all the chips in a
DIMM in a uniform way, and a thermocouple to measure the
temperature. The temperature of each element is controlled
by a controller board, as shown in Figure 6, which contains a
Raspberry Pi 3 [16] and four closed-loop PID controllers [9].
B. DRAM Parameters and Error Accounting
The X-Gene2 provides access to a separate light-weight in-
telligent processor (SLIMpro), which is a special management
core that is used to boot the system and provide access to the
on-board sensors to measure the temperature and the power of
the SoC and DRAM. The SLIMpro also reports all memory
errors corrected or detected by SECDED ECC to the Linux
kernel, providing information about the DIMM, bank, rank,
row and column in which the error occurred. Finally, SLIMpro
allows the configuration of the parameters of the MCUs, such
as TREFP and VDD. Specifically, TREFP may be changed
from the nominal 64 ms to 2.283 s, which is the maximum
on the X-Gene2 server. The server runs a fully-fledged OS
based on CentOS 7 with the default Linux kernel 4.3.0 for
ARMv8 and support for 64KB pages.
C. Benchmarks
In our study, we use Rodinia and Parsec benchmark suites,
specifically the backprop, nw, srad, kmeans and fmm bench-
marks, which represent a variety of compute-intensive algo-
rithms [6], [11]. To evaluate how parallelism and processing
power affect the characterization, we run these benchmarks
with 1 and 8 threads. To investigate the effect of popular
caching and analytics workloads on DRAM reliability, we
run the memcached benchmark [60], the pagerank algorithm
(pagerank), the betweenness centrality algorithm (bc) and the
breadth-first search algorithm (bfs) [69], [74]. Finally, we run
each benchmark allocating 8 GB of data to exclude the effect
of the data size factor on DRAM errors.
V. DRAM CHARACTERIZATION AND DATA COLLECTION
In this section, we characterize DRAM error behavior when
running real workloads under lowered VDD, different levels of
TREFP and the selected DRAM temperature range.
nw srad backprop kmeans fmm
1 thread 10.93 2.82 1.61 0.17 8.88
8 threads 4.06 1.89 1.10 0.50 2.41
memcached pagerank bfs bc
8 threads 0.09 0.48 0.61 0.56
TABLE II: The average DRAM reuse time.
Temperature. We characterize DRAM at three temperature
levels: 50◦C, 60◦C and 70◦C. We use this temperature range
to follow previous studies [39] and the DIMM specifica-
tion [45], in which the vendor reports the maximum operating
temperature of 70◦C. Note that this temperature range is
common for dense server environments [40], [48], [53].
DRAM Circuit Parameters. We experimentally determine
the lowest operating DRAM VDD as 1.428 V , after which the
circuitry of the DRAM is likely to stop working. We execute
all the benchmarks with the memory operating under the
minimum VDD (1.428 V ) discovered in our experiments; how-
ever, the benchmarks have not manifested errors for DRAM
operating at 50◦C. Moreover, we discover only a few CEs by
running benchmarks at 60◦C and 70◦C. Thus, reducing VDD
from the nominal 1.5 V down to 1.428 V (or by 5%) has a
negligible effect on DRAM reliability.
The maximum power gain is achieved when both TREFP
and VDD are scaled. To achieve this gain, in the rest of this
paper, we set the minimum VDD (1.428 V ) and run all the
benchmarks under different TREFP .
A. Correctable errors
In our experiments with all the benchmarks for DRAM
operating under scaled TREFP and VDD, we encounter only
CEs at 50◦C and 60◦C, and no UEs or SDCs.
Previously, it was discovered that the memory cell leakage
may change over time due to a phenomenon called variable
retention time (V RT ) [65]. As a result, DRAM error behavior
may vary across runs of the same application, and thus, it is
essential to run each application several times until a target
DRAM error metric converges to a specific value. To this end,
we run each application for 2 hours with DRAM operating
under the maximum TREFP (i.e. 2.283 s) and lowered VDD
(1.428 V ) at 50◦C. Figure 4 shows how the rate (WER) of
single-bit errors detected in 64-bit words changes over time
for each benchmark. Note that labels with abbreviation (par)
correspond to the parallel version of the compute-intensive
benchmarks. We see that after 2-hour runs WER achieves
a certain value for each benchmark: the average change in
the WER for the last 10 minutes of each experiment does
not exceed 3 % at 50◦C. We observe the same results for
DRAM operating at 60◦C. These observations imply that 120
minutes is sufficient for identifying the vast majority of error-
prone memory locations and characterize DRAM behavior
when running a specific benchmark.
WER: Further, we investigate how WER varies across
benchmarks when DRAM operates under different TREFP at
50◦C and 60◦C. We run benchmarks for DRAM operating
under 0.618 s, 1.173 s, 1.727 s, 2.283 s TREFP and lowered
VDD. Figure 7 illustrates how WER changes with scaling
TREFP at 50◦C and 60◦C. Our first observation is that WER
varies across benchmarks significantly; for example, the differ-
ence achieves almost 8× for memcached and backprop(par)
when DRAM operates under 0.618 s at 70◦C. Our second
observation is that the benchmark that incurs the highest
WER may change with the DRAM temperature and TREFP ;
for example, the highest WER is obtained for srad when
DRAM operates under 1.173 s at 50◦C, while for DRAM
operating under the same TREFP at 70◦C the highest WER
is observed for the bfs benchmark. This indicates that DRAM
operational and environmental parameters may change the
ratio of WER between different workloads which is hard
to capture with an analytical model. Our third observation is
that WER grows exponentially with TREFP (see Figure 7f).
Finally, we see that the WER incurred by the parallel version
of some benchmarks differs from the WER obtained for the
single-threaded version of these benchmarks. For example,
the WER measured for backprop is almost 30 % greater
than the WER obtained for backprop(par) when DRAM
operates under 2.283 s TREFP at 50◦C and 60◦C. The
same difference is also observed in the case of the srad
benchmark. Importantly, parallel and single-threaded versions
of the same workload have different memory access scenarios,
but a similar data pattern. Thus, these observations imply that
the memory access pattern of a running program may also
significantly affect DRAM error behavior.
To investigate the difference in WER for parallel and
single-threaded benchmarks, we calculate Treuse for each
workload, as shown in Table II. We see that the Treuse
of the parallel backprop and srad is less than the Treuse
estimated for the single-threaded version of backprop and srad,
respectively. As follows, in the case of backprop and srad,
the parallel benchmarks implicitly refresh data in the memory
more frequently than the single-threaded benchmarks do by
generating more accesses to the same regions of memory per
cycle. As a result, we observe a low error rate for these parallel
benchmarks. Nonetheless, in the case of kmeans, the parallel
version has a higher Treuse (0.50 s) than do the serial version
(0.17 s) due to a better data locality in caches obtained for the
parallel kmeans. Respectively, the parallel version generates
fewer references to the same memory per cycle than does
the single-threaded version, resulting in a higher Treuse and
therefore a higher WER. Lastly, memcached incurs the lowest
WER and has the lowest Treuse for DRAM operating under
different TREFP and temperatures among all workloads at the
same time, which confirms that there is a correlation between
Treuse and DRAM error behavior.
To investigate how WER varies across different DIMMs
and ranks, we grouped all the collected errors by a
source DIMM/rank. Figure 8 shows WER measured on
different DIMMs and ranks when DRAM operates under
2.283 s TREFP at 50◦C. We see that WER varies across
DIMMs/ranks by up to 188x; in particular, WER incurred
by the bc benchmark on DIMM2/rank0 and DIMM3/rank1
are 1.75× 10−7 and 9.31× 10−10, respectively. Therefore, to
ba
ck
pr
op
ba
ck
pr
op
(p
ar
)
km
ea
ns
km
ea
ns
(p
ar
)
nw
nw
(p
ar
)
sra
d
sra
d(
pa
r)
fm
m
fm
m
(p
ar
)
pa
ge
ra
nk bf
s bc
m
em
ca
ch
ed
0.0
0.2
0.4
0.6
0.8
1.0
W
ER
 (T
RE
FP
=0
.6
18
 s) x10-9 TREFP(0.618 s)
0
1
2
3
4
5
W
ER
 (T
RE
FP
=1
.1
73
 s)x10-9TREFP(1.173 s)
(a) 50◦C
ba
ck
pr
op
ba
ck
pr
op
(p
ar
)
km
ea
ns
km
ea
ns
(p
ar
)
nw
nw
(p
ar
)
sra
d
sra
d(
pa
r)
fm
m
fm
m
(p
ar
)
pa
ge
ra
nk bf
s bc
m
em
ca
ch
ed
01
23
45
67
8
W
ER
 (T
RE
FP
=1
.7
27
 s) x10-8 TREFP(1.727 s)
0
1
2
3
4
5
6
W
ER
 (T
RE
FP
=2
.2
83
 s)x10-7TREFP(2.283 s)
(b) 50◦C
ba
ck
pr
op
ba
ck
pr
op
(p
ar
)
km
ea
ns
km
ea
ns
(p
ar
)
nw
nw
(p
ar
)
sra
d
sra
d(
pa
r)
fm
m
fm
m
(p
ar
)
pa
ge
ra
nk bf
s bc
m
em
ca
ch
ed
0.0
0.2
0.4
0.6
0.8
1.0
W
ER
 (T
RE
FP
=0
.6
18
 s) x10-9 TREFP(0.618 s)
0
1
2
3
4
5
6
7
W
ER
 (T
RE
FP
=1
.1
73
 s)x10-7TREFP(1.173 s)
(c) 60◦C
ba
ck
pr
op
ba
ck
pr
op
(p
ar
)
km
ea
ns
km
ea
ns
(p
ar
)
nw
nw
(p
ar
)
sra
d
sra
d(
pa
r)
fm
m
fm
m
(p
ar
)
pa
ge
ra
nk bf
s bc
m
em
ca
ch
ed
0.00.5
1.01.5
2.02.5
3.03.5
4.0
W
ER
 (T
RE
FP
=1
.7
27
 s) x10-6 TREFP(1.727 s)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
W
ER
 (T
RE
FP
=2
.2
83
 s)x10-5TREFP(2.283 s)
(d) 60◦C
ba
ck
pr
op
ba
ck
pr
op
(p
ar
)
km
ea
ns
km
ea
ns
(p
ar
)
nw
nw
(p
ar
)
sra
d
sra
d(
pa
r)
fm
m
fm
m
(p
ar
)
pa
ge
ra
nk bf
s bc
m
em
ca
ch
ed
0
12
3
4
56
7
8
9
W
ER
 (T
RE
FP
=0
.6
18
 s) x10-7 TREFP(0.618 s)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
W
ER
 (T
RE
FP
=1
.1
73
 s)x10-5TREFP(1.173 s)
(e) 70◦C
0.5 1.0 1.5 2.0 2.5
TREFP (sec)
0.0
0.2
0.4
0.6
0.8
W
ER
x10-7
50 oC
60 oC
(f)
Fig. 7: WER for DRAM operating under 0.618 s, 1.173 s, 1.727 s, 2.283 s at 50◦C(a,b), 60◦C(c,d) and 70◦C(e). The WER
averaged over all benchmarks for DRAM operating at 50◦C and 60◦C(f)
backprop backprop kmeans kmeans nw nw srad srad fmm fmm pagerank bfs bc memcached
(par) (par) (par) (par) (par)
0.0
0.5
1.0
1.5
2.0
2.5
W
ER
x10-8 DIMM0/rank0
DIMM0/rank1
DIMM1/rank0
DIMM1/rank1
DIMM2/rank0
DIMM2/rank1
DIMM3/rank0
DIMM3/rank1
Fig. 8: WER per DIMM/rank obtained for DRAM operating under 2.283 s TREFP at 50◦C.
backp
ropkmea
ns nw srad fmm
backp
rop(p
ar)
kmea
ns(pa
r)
nw(pa
r)
srad(
par)
fmm(
par)
mem
cache
d
page
rank bfs bc Avera
ge
0.0
0.2
0.4
0.6
0.8
1.0
P U
E
1.450 s 1.727 s 2.283 s
(a)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Th
e 
pr
ob
ab
ilit
y 
to
 o
bt
ai
n 
an
 U
E
0.02
0.24
<0.01 <0.01
0.67
0.05 <0.01 0
DIMM0
rank0
DIMM0
rank1
DIMM1
rank0
DIMM1
rank1
DIMM2
rank0
DIMM2
rank1
DIMM3
rank0
DIMM3
rank1
(b)
Fig. 9: a) PUE and b) The probability to obtain an UE on a specific DIMM/rank when DRAM operates under 1.450 s, 1.727 s
and 2.283 s TREFP at 70◦C.
enable accurate DRAM error predictions, a model should take
into consideration the error behavior of a specific DIMM.
B. Uncorrectable Errors and System Crashes
In our experiments with DRAM operating at 50◦C and
60◦C, we have discovered no Silent Data Corruptions (SDCs)
or uncorrectable errors (UEs). However, we encounter UEs
and system crashes when raising the DRAM temperature to
70◦C and scaling TREFP up to 1.45 s under lowered VDD.
Note that in our framework, any UE triggered by the Linux
kernel or a user-level program, once detected by ECC, will
result in a system crash.
Figure 9a shows PUE , the likelihood to observe an UE,
measured across all benchmarks for DRAM operating under
1.450 s, 1.727 s, 2.283 s TREFP and lowered VDD at 70◦C.
To estimate this probability, we repeat each 2-hour experiment
with a specific benchmark 10 times. We see that PUE varies
significantly across benchmarks for DRAM operating under
1.450 s TREFP ; it achieves 0.8 for fmm(par), whereas it
equals to 0 for memcached and pagerank. We also observe that
PUE is greater than 0 only for the parallel compute-intensive
benchmarks, while it is 0 for all the single-threaded bench-
marks except for srad. The PUE averaged over all benchmarks
for DRAM operating under 1.450 s TREFP is below 0.4.
However, when we increase TREFP up to 1.727 s, then the
likelihood of crashing averaged over benchmarks grows by
2.15× (see Figure 9). Moreover, for DRAM operating under
this TREFP , there is no benchmark with PUE = 0. Finally,
all the benchmarks trigger UEs in 100% of the experiments
when we use the maximum TREFP (2.283 s) at 70◦C. These
results show that TREFP and the DRAM temperature have a
dominant effect on the likelihood of an UE.
Figure 9b depicts the probability to obtain an UE on a
specific DIMM/rank when ECC detects an UE. We see that
the vast majority of UEs are triggered by DIMM0/rank1 and
DIMM2/rank0, while DIMM3/rank1 do not trigger UEs at all.
Thus, DRAM reliability varies significantly from DIMM-to-
DIMM not only in terms of WER but also the probability to
obtain an UE. Importantly, we have discovered no SDCs when
running experiments under different TREFP at 50◦C, 60◦C,
and 70◦C.
VI. ACCURACY EVALUATION OF ML MODELS
In this section, we present the results of the feature selection
process and accuracy evaluation of ML models.
A. Feature selection
The accuracy of an ML model strongly depends on the set of
features used for training of the model. If the model is trained
using the set of features that are not correlated with a metric
that we target to predict, then the model may overestimate the
significance of some features [12]. As a result, a low prediction
accuracy will be obtained for this model. To identify those
features that may affect DRAM reliability, we extract 249
program features, including Treuse (the average memory reuse
time) and HDP (the data entropy, see Section III), for each
benchmark, and correlate them with both WER and PUE
metrics.
WER: We build the correlation of WER and program fea-
tures using the combined measurements taken under different
levels of TREFP (0.618 s, 1.173 s, 1.727 s, 2.283 s) at 50◦C,
60◦C and 70◦C, where we observe no UEs or system crashes.
To identify and quantify any dependency between program
features and the DRAM error metrics formally, we use the
Spearman’s rank correlation coefficient (rs). This correlation
coefficient allows us to detect both linear and non-linear
relationships [56]. Coefficient values lie in a range [−1,+1]
in which -1 or +1 occurs when there is a perfect monotonic
relationship between two variables.
0.2 0.0 0.2 0.4 0.6
rS(PUE)
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
r S
(W
ER
)
MCU accesses
Memory accesses
IPC
wait cycles
CPU utilization
HDP
Treuse
Fig. 10: rs for 249 program features
and WER and PUE .
Figure 10 shows
the correlation
coefficients for 249
program features
and WER on the
Y-axis, whereas
the correlation
coefficients for these
features and PUE are
shown on the X-axis.
We see that the
number of memory
accesses per cycle
is highly correlated
with WER, as rs is
above 0.57, indicating a positive direction of the correlation;
in other words, WER grows with the number of accesses
per cycle. We also observe that the group of performance
indicators that reflects the number of issued memory read
and write commands per cycle in different MCUs is also
highly correlated with WER. However, the number of such
commands is determined by the number of memory read and
write instructions executed by the processor per cycle.
Another inherent program feature that is strongly correlated
with WER is wait cycles (rs is 0.4). This feature reflects the
ratio of the number of cycles spent on waiting for data to
the total number of program cycles. Nonetheless, wait cycles
is implicitly determined by the number of memory accesses
per clock cycle, as it encapsulates idle cycles due to memory
access stalls which explains its correlation with WER.
We attribute the correlation of the memory access rate
and WER to disturbance errors induced by the cell-to-cell
interference [47], [64]. Previously, it was shown that, if a row
is accessed many times, then some cells from neighbouring
rows may leak charge quickly [32]. Thus, by accessing the
memory with a high rate, we increase the probability of
the interference errors for DRAM operating under scaled
TREFP and VDD. By contrast, under a higher memory access
rate, each cell may be implicitly refreshed more frequently,
resulting in a lower WER. However, this effect occurs only
for those benchmarks in which Treuse < TREFP . Therefore, a
high memory access rate may have negative or positive effects
on DRAM reliability, which depends on Treuse and TREFP .
Notably, Treuse is greater than the maximum TREFP (2.283 s)
available on our platform for almost 30 % of the benchmarks.
Thereby, Treuse does not have any effect on DRAM error
behavior in these benchmarks. This lack of an effect explains
why Treuse (rs is 0.23) is less correlated with WER than the
rate of memory accesses.
Our experiments show that HDP , which reflect the data
pattern of a running application, is also correlated with WER
as the rs is 0.39, see Figure 10. Although it is higher than
the rs obtained for Treuse, it is by 51 % lower than the rs
observed for the memory access rate.
The probability of an UE: Similar to WER, we discover
a correlation between PUE and the memory access rate, the
number of issued memory read and write commands per cycle
in different MCUs, HDP , and wait cycles. However, the level
of this correlation is lower than in the case of WER; for
example, the rs for the memory access rate and PUE is 0.43,
which is 35 % less than the same rs for WER. It is noteworthy
that unlike previous studies, which have indicated a strong
impact of Treuse or HDP [27], [77], we obtain the highest rs
for the memory access rate among all the program features
when correlating it with WER and PUE metrics.
Implication: Thus, our study indicates that the memory
access rate has a major effect on DRAM reliability, which
is stronger than the effect of the content data stored in DRAM
and the average DRAM reuse time.
B. Accuracy evaluation
WER: We start our evaluation campaign by applying SVM,
KNN and RDF models to predict WER using 3 different
input sets of parameters (see Table III), which consist of
different combinations of program features, TREFP and the
DRAM temperature (TEMPDRAM ). Note that we investigate
different input sets, as it is known that the accuracy of an
Input set Parameters
1 TEMPDRAM , TREFP , wait cycles
memory accesses, HDP , Treuse
2 TEMPDRAM , TREFP , wait cycles
memory accesses
3 TEMPDRAM , TREFP , all program features
TABLE III: Input feature sets used for training
ML model depends on the input parameters that are chosen
for training [13]. We build the first two input sets using the
program features that are strongly correlated with DRAM error
behavior. In the third set of input parameters, we include all the
collected program features, to investigate the model accuracy
when all the available parameters are provided to the model.
Figure 11 (a,b,c) shows the mean percentage error (MPE)
of WER estimates provided by SVM, KNN and RDF per
DIMM/rank for all three sets of input parameters. We see that
the minimum error of WER estimates averaged over all the
DIMMs and ranks is achieved when we use the first set of
input parameters for SVM (16.3 %) and KNN (10.1 %), while
the average error incurred by the second input set for SVM and
KNN are 17.0 % and 10.2 %, respectively. Thereby, by adding
HDP and Treuse to the input parameter set, we only slightly
increase the accuracy of the two models. This implies that the
memory access rate has the strongest impact on DRAM error
behavior among all the program features, which is consistent
with the results of the feature selection process.
Notably, if we train SVM and KNN using all the collected
program features for each workload, then the average MPE
grows up to 29.3 % (SVM) and 12.3 % (KNN). We explain
this by overfitting of the model which happens when we train it
using all the available program features, including those that
do not affect DRAM reliability. In other words, the models
may overestimate the significance of some features when we
train the model using all the features, which results in a low
prediction accuracy obtained for the third set [12].
Interestingly, in contrast to SVM and KNN, RDF provides
the lowest accuracy of WER estimates (the error is 21.4 %)
when the first input set is used. Moreover, this model demon-
strates the highest accuracy (the error is 12.9 %) when all the
available program features are used for training and testing.
Nonetheless, this accuracy is less than the best accuracy
achieved by KNN when the first input set is used. Furthermore,
the maximum error of WER estimated per application is
about 55 % when we use the third input set for the RDF
model, see Figure 11f (the fmm benchmark). Meanwhile, the
average error of WER estimates provided by SVM and KNN
per application do not exceed 30 % and 24 %, correspondingly,
when we use the first input set. Thus, we may conclude that
RDF has the lowest accuracy among the considered models
when predicting WER.
The probability of an UE: Figure 12 depicts the mean
percentage error of PUE estimates averaged over all bench-
marks and DIMMs. Similar to our experiments with WER,
we see that the first set incurs the lowest error (12.3 %) when
we use SVM. While the average error obtained by this model
for the second and third sets is above 15 %. However, KNN
and RDF demonstrate the lowest average error when we use
the second input feature set. Notably, this error is only 4.1 %
and 5.5 % for KNN and RDF, respectively, which is almost
3× lower than the lowest error (12.3 %) achieved by SVM.
To conclude, our study shows that the highest accuracy
of WER estimates is achieved by the K-nearest neighbors
algorithm (KNN) when we train it using the first input set
of parameters (i.e. the memory access rate, wait cycles, HDP
and Treuse, TEMPDRAM , TREFP and VDD). The highest
accuracy of PUE estimates is also demonstrated by KNN when
we use the second input set, which contains only the memory
access rate, wait cycles, TEMPDRAM and TREFP .
C. Workload-Aware Modeling vs Conventional Modeling
Many studies have proposed to model DRAM errors for
investigation either hardware design efficiency [41] or software
fault tolerance [36], [37], [42], [43]. However, all those studies
use constant DRAM error rates extracted on real DRAMs
when running the data pattern micro-benchmarks [3], [19],
[22], [40], [62]. Our model can be used to improve those stud-
ies and proposed techniques by considering workload-aware
DRAM error behavior. For example, Figure 13 depicts the
measured WER over all DIMMs when DRAM is operating
under 0.618 s TREFP at 70◦C for the lulesh benchmark and a
data pattern micro-benchmark that implements a random data
pattern [27]. This figure also shows the WER which has
been predicted by the KNN-based DRAM error behavioral
model. In this experiments, we use two versions of lulesh
to illustrate the implicit effect of compiler optimizations on
DRAM reliability: the benchmark compiled with −O2 (default
optimizations) and −F (aggressive optimizations). We see
that the model correctly predicts the WER incurred by both
versions of the benchmark; the error is less than 3 %. Such a
high accuracy enables us to correctly predict the difference in
WER between these benchmarks, which is about 29 %. At the
same time, we see that the random micro-benchmark incurs
the WER which is higher than the WER obtained for lulesh
by 2.9×. Thus, the conventional DRAM error modeling based
on the constant rates may be inaccurate and lead to incorrect
conclusions about the effectiveness of applied techniques.
Moreover, the vast majority of research studies have con-
sidered only hardware-level techniques to mitigate errors for
DRAM operating under scaled [3], [19], [22], [62], which
introduce additional power and chip area overheads. However,
as we see, even compiler optimizations may implicitly affect
DRAM error behavior. To systematically study the effect
of compiler optimizations, it is essential to build a model,
since such a study may take months or even years if it is
conducted using DRAM characterization campaigns. While
our models predict DRAM errors within 300 ms, which opens
new avenues for research.
VII. RELATED WORK
Scaling of TREFP and VDD: Many studies [26], [30],
[40], [59], [62], [85] tried to improve DRAM performance and
0
10
20
30
40
50
60
Er
ro
r o
f W
ER
 e
st
.,%
Input set 1 Input set 2 Input set 3
DIMM0
rank0
DIMM0
rank1
DIMM1
rank0
DIMM1
rank1
DIMM2
rank0
DIMM2
rank1
DIMM3
rank0
DIMM3
rank1 Average
(a) SVM
0
5
10
15
20
25
30
Er
ro
r o
f W
ER
 e
st
.,%
Input set 1 Input set 2 Input set 3
DIMM0
rank0
DIMM0
rank1
DIMM1
rank0
DIMM1
rank1
DIMM2
rank0
DIMM2
rank1
DIMM3
rank0
DIMM3
rank1 Average
(b) KNN
0
5
10
15
20
25
30
Er
ro
r o
f W
ER
 e
st
.,%
Input set 1 Input set 2 Input set 3
DIMM0
rank0
DIMM0
rank1
DIMM1
rank0
DIMM1
rank1
DIMM2
rank0
DIMM2
rank1
DIMM3
rank0
DIMM3
rank1 Average
(c) RDF
ba
ckp
rop
ba
ckp
rop
(8)
km
ea
ns
km
ea
ns(
8) nw
nw
(8) sra
d
sra
d(8
)
me
mc
ach
ed fm
m
fm
m(
8)
pa
ge
ran
k bfs bc
0
10
20
30
40
50
60
Er
ro
r o
f W
ER
 e
st
.,%
Input set 1 Input set 2 Input set 3
(d) SVM
ba
ckp
rop
ba
ckp
rop
(8)
km
ea
ns
km
ea
ns(
8) nw
nw
(8) sra
d
sra
d(8
)
me
mc
ach
ed fm
m
fm
m(
8)
pa
ge
ran
k bfs bc
0
10
20
30
40
50
60
Er
ro
r o
f W
ER
 e
st
.,%
Input set 1 Input set 2 Input set 3
(e) KNN
ba
ckp
rop
ba
ckp
rop
(8)
km
ea
ns
km
ea
ns(
8) nw
nw
(8) sra
d
sra
d(8
)
me
mc
ach
ed fm
m
fm
m(
8)
pa
ge
ran
k bfs bc
0
10
20
30
40
50
60
Er
ro
r o
f W
ER
 e
st
.,%
Input set 1 Input set 2 Input set 3
(f) RDF
Fig. 11: The average error of WER estimates per DIMM/rank: a) SVM, b) KNN and c) RDF. The average error of WER
estimates per application: d) SVM, e) KNN and f) RDF.
SVM KNN RDF0
5
10
15
20
Er
ro
r o
f P
UE
es
t.,
% Input set 1
Input set 2
Input set 3
Fig. 12: The error of PUE es-
timates averaged over applica-
tions and DIMMs.
lule
sh(
O2)
lule
sh(
O2)
pre
dict
ed
lule
sh(
F)
lule
sh(
F)
pre
dict
ed
dat
a-p
atte
rn
2
3
4
5
6
7
8
W
ER
x10-7
Fig. 13: The measured and
predicted WER for lulesh and
the random micro-benchmark
(TREFP is 0.618 s, 70◦C).
energy efficiency by adopting a low refresh period for ”weak”
cells. The main idea of such an approach is to split memory
cells into groups based on their retention time and relax the
refresh rate for those groups where cells have small leakage.
Other works [1], [14], [17] suggested to skip refresh operations
for those memory segments that have been implicitly refreshed
by memory accesses. Several studies [3], [21] proposed to
extend this technique and refresh selectively only rows with
valid data allocated by running applications or OS. Chang
et al. [10] provided the results of their study on reduced-
voltage operation in DDR3L memory devices. However, even
though the latest study [29] tried to capture the effect of
varying data patterns on DRAM reliability when running real
applications, all these studies ignored the combined effect of
data and memory access patterns on DRAM errors. To the best
of our knowledge, none of previous works have systematically
investigated the combined impact of these patterns on memory
errors on a real server. Understanding of such an impact is
crucial for facilitating the co-design of software and hardware
techniques to improve DRAM energy efficiency. Other re-
search studies proposed various fine-grained schemes to reduce
the number of refresh operations and thus improve DRAM
energy efficiency [17], [21], [34], [73], [82], [86]. Although
some of these studies utilize workload inherent features, such
as the memory reuse time, they are orthogonal to our work.
Predictive maintenance and statistical prediction of er-
rors: Considerable research has been done on statistical pre-
diction of different types of hardware faults, including DRAM
errors, in supercomputers [4], [18], [35], [38], [44], [58],
[66], [89]. The majority of these studies proposed different
techniques, based either on rules [38] or Machine Learning
[66], for prediction of failures that may happen in various
hardware components using history of errors. Other research
studies tried to systematically investigate factors, including
workload-dependent factors, that may affect DRAMs in data
centers and supercomputers [44], [67], [70], [72]. Nonetheless,
all these studies tried to predict errors for hardware operating
under nominal parameters.
Hardware error prediction becomes extremely important in
production lines for identifying maintenance cycles or faulty
components (predictive maintenance) [18], [67] However, any
study of failures for hardware operating under nominal pa-
rameters may require years [18], while a reliability charac-
terization of hardware that operates under relaxed parameters
is much faster. In our future research, we aim to investigate
how characterization and modeling of errors for DRAM op-
erating under relaxed parameters can be applied to identify
maintenance cycles or any abnormal hardware behavior.
VIII. CONCLUSION
In this work, we present the results of a study on char-
acterization and prediction of the error behavior for DRAM
operating under scaled parameters within a real server. Our
results indicate that the rate of single- and multi-bit errors may
vary across workloads and DRAM chips by 8× and 188×,
respectively. We quantify the effect of inherent program fea-
tures that may significantly affect DRAM errors by correlating
249 features extracted from various benchmarks with DRAM
errors. We train three ML models to predict DRAM failure
rates and compare the accuracy of the models using different
sets of program features. We demonstrate that, with the correct
choice of program features and an ML model, the word-error-
rate for single-bit failures and the likelihood of a system crash
triggered by uncorrectable errors can be predicted for a specific
DRAM device with an average error of less than 10.5 %.
ACKNOWLEDGMENT
This work was funded by the H2020 Framework Program
of the European Union through the UniServer Project (Grant
Agreement 688540, http://www.uniserver2020.eu) and Opre-
Comp project (Grant Agreement 732631, http://oprecomp.eu).
We are grateful to Dr. Philip Hodgers (ECIT) for providing
the thermal testbed.
REFERENCES
[1] Aditya Agrawal, Prabhat Jain, Amin Ansari, and Josep Torrellas. Refrint:
Intelligent refresh to minimize power in on-chip multiprocessor cache
hierarchies. In 19th IEEE International Symposium on HPCA ’13, pages
400–411, 2013.
[2] Zaid Al-Ars, Said Hamdioui, and Ad J van de Goor. Effects of bit line
coupling on the faulty behavior of DRAMs. In 22nd IEEE VLSI Test
Symposium, 2004. Proceedings., pages 117–122, April 2004.
[3] Seungjae Baek, Sangyeun Cho, and Rami Melhem. Refresh Now and
Then. IEEE Transactions on Computers, 63(12):3114–3126, Dec 2014.
[4] Elisabeth Baseman, Nathan DeBardeleben, Kurt Ferreira, Scott Levy,
Steven Raasch, Vilas Sridharan, Taniya Siddiqua, and Qiang Guan.
Improving DRAM Fault Characterization through Machine Learning. In
2016 46th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks Workshop (DSN-W), pages 250–253, June 2016.
[5] Leonardo Bautista-Gomez, Ferad Zyulkyarov, Osman Unsal, and Simon
McIntosh-Smith. Unprotected Computing: A Large-Scale Study of
DRAM Raw Error Rate on a Supercomputer. In SC16: International
Conference for High Performance Computing, Networking, Storage and
Analysis, pages 645–655, Nov 2016.
[6] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li.
The PARSEC Benchmark Suite: Characterization and Architectural
Implications. In Proceedings of the 17th International Conference on
Parallel Architectures and Compilation Techniques, PACT ’08, pages
72–81, New York, NY, USA, 2008. ACM.
[7] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for
architectural-level power analysis and optimizations. In Proceedings
of 27th International Symposium on Computer Architecture (IEEE Cat.
No.RS00201), pages 83–94, June 2000.
[8] Derek Bruening and Timothy Garnett. Tutorial: Building Dynamic
Instrumentation Tools with DynamoRIO. In Proceedings of the 9th
Annual IEEE/ACM International Symposium on Code Generation and
Optimization, CGO ’11, pages xxi–, Washington, DC, USA, 2011. IEEE
Computer Society.
[9] Carel. ir33 universale electronic control. http://www.carel.com/product/
ir33-universale/, 2012. [Online; accessed 6-June-2019].
[10] Kevin K. Chang, A. Giray Yauglikcci, Saugata Ghose, Aditya Agrawal,
Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O’Connor,
Hasan Hassan, and Onur Mutlu. Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization,
Analysis, and Mechanisms. Proc. ACM Meas. Anal. Comput. Syst.,
1(1):10:1–10:42, June 2017.
[11] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W.
Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark
Suite for Heterogeneous Computing. In Proceedings of the 2009
IEEE International Symposium on Workload Characterization (IISWC),
IISWC ’09, pages 44–54, Washington, DC, USA, 2009. IEEE Computer
Society.
[12] Pedro Domingos. A Few Useful Things to Know About Machine
Learning. Commun. ACM, 55(10):78–87, October 2012.
[13] James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and
Unsupervised Discretization of Continuous Features. In Proceedings
of the Twelfth International Conference on International Conference on
Machine Learning, ICML’95, pages 194–202, San Francisco, CA, USA,
1995. Morgan Kaufmann Publishers Inc.
[14] Philip G Emma, William R Reohr, and Mesut Meterelliyoz. Rethinking
Refresh: Increasing Availability and Reducing Power in DRAM for
Cache Applications. IEEE Micro, 28(6):47–56, Nov 2008.
[15] Manuel Ferna´ndez-Delgado, Eva Cernadas, Sene´n Barro, and Dinani
Amorim. Do we need hundreds of classifiers to solve real world
classification problems? J. Mach. Learn. Res., 15(1):3133–3181, January
2014.
[16] Raspberry Pi Foundation. Raspberry Pi 3 Model B. https://www.
raspberrypi.org/, 2016. [Online; accessed 6-June-2019].
[17] Mrinmoy Ghosh and Hsien-Hsin S Lee. Smart Refresh: An Enhanced
Memory Controller Design for Reducing Energy in Conventional and
3D Die-Stacked DRAMs. In 40th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO 2007), pages 134–145, Dec
2007.
[18] Ioana Giurgiu, Jacint Szabo, Dorothea Wiesmann, and John Bird.
Predicting DRAM Reliability in the Field with Machine Learning. In
Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference:
Industrial Track, Middleware ’17, pages 15–21, New York, NY, USA,
2017. ACM.
[19] Takeshi Hamamoto, Soichi Sugiura, and Shizuo Sawada. On the
retention time distribution of dynamic random access memory (DRAM).
IEEE Transactions on Electron Devices, 45(6):1300–1309, Jun 1998.
[20] Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. Cos-
mic Rays Don’t Strike Twice: Understanding the Nature of DRAM
Errors and the Implications for System Design. In Proceedings of
the Seventeenth International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS XVII, pages
111–122, New York, NY, USA, 2012. ACM.
[21] Ciji Isen and Lizy John. ESKIMO: Energy Savings Using Semantic
Knowledge of Inconsequential Memory Occupancy for DRAM Sub-
system. In Proceedings of the 42Nd Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO 42, pages 337–346, New
York, NY, USA, 2009. ACM.
[22] Matthias Jung, Deepak M. Mathew, Carl Christian Rheinlander, Chris-
tian Weis, and Norbert Wehn. A Platform to Analyze DDR3 DRAM’s
Power and Retention Time. IEEE Design & Test, 34(4):52–59, 2017.
[23] G. Karakonstantis, A. Chatterjee, and K. Roy. Containing the nanometer
pandora-box: Cross-layer design techniques for variation aware low
power systems. IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, 1(1):19–29, March 2011.
[24] G. Karakonstantis and K. Roy. Voltage over-scaling: A cross-layer
design perspective for energy efficient systems. In 2011 20th European
Conference on Circuit Theory and Design (ECCTD), pages 548–551,
Aug 2011.
[25] G. Karakonstantis, K. Tovletoglou, L. Mukhanov, H. Vandierendonck,
D. S. Nikolopoulos, P. Lawthers, P. Koutsovasilis, M. Maroudas, C. D.
Antonopoulos, C. Kalogirou, N. Bellas, S. Lalis, S. Venugopal, A. Prat-
Prez, A. Lampropulos, M. Kleanthous, A. Diavastos, Z. Hadjilambrou,
P. Nikolaou, Y. Sazeides, P. Trancoso, G. Papadimitriou, M. Kaliorakis,
A. Chatzidimitriou, D. Gizopoulos, and S. Das. An energy-efficient and
error-resilient server ecosystem exceeding conservative scaling limits. In
2018 Design, Automation Test in Europe Conference Exhibition (DATE),
pages 1099–1104, March 2018.
[26] Yasunao Katayama, Eric J Stuckey, Sumio Morioka, and Zhao Wu.
Fault-tolerant refresh power reduction of drams for quasi-nonvolatile
data retention. In Defect and Fault Tolerance in VLSI Systems, 1999.
DFT ’99. International Symposium on, pages 311–318, Nov 1999.
[27] Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen,
Chris Wilkerson, and Onur Mutlu. The Efficacy of Error Mitigation
Techniques for DRAM Retention Failures: A Comparative Experimental
Study. SIGMETRICS Perform. Eval. Rev., 42(1):519–532, June 2014.
[28] Samira Khan, Donghyuk Lee, and Onur Mutlu. PARBOR: An Efficient
System-Level Technique to Detect Data-Dependent Failures in DRAM.
In 2016 46th Annual IEEE/IFIP International Conference on Depend-
able Systems and Networks (DSN), pages 239–250, June 2016.
[29] Samira Khan, Chris Wilkerson, Zhe Wang, Alaa R. Alameldeen,
Donghyuk Lee, and Onur Mutlu. Detecting and Mitigating Data-
dependent DRAM Failures by Exploiting Current Memory Content. In
Proceedings of the 50th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-50 ’17, pages 27–40, New York, NY, USA,
2017. ACM.
[30] Joohee Kim and M. C. Papaefthymiou. Block-based multiperiod dy-
namic memory design for low data-retention power. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 11(6):1006–1018, Dec
2003.
[31] Kinam Kim and Jooyoung Lee. A New Investigation of Data Retention
Time in Truly Nanoscaled DRAMs. IEEE Electron Device Letters,
30(8):846–848, Aug 2009.
[32] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee,
Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping
bits in memory without accessing them: An experimental study of
DRAM disturbance errors. In 2014 ACM/IEEE 41st International
Symposium on Computer Architecture (ISCA), pages 361–372, June
2014.
[33] Ron Kohavi. A Study of Cross-validation and Bootstrap for Accuracy
Estimation and Model Selection. In Proceedings of the 14th Interna-
tional Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95,
pages 1137–1143, San Francisco, CA, USA, 1995. Morgan Kaufmann
Publishers Inc.
[34] Jagadish B. Kotra, Narges Shahidi, Zeshan A. Chishti, and Mahmut T.
Kandemir. Hardware-Software Co-design to Mitigate DRAM Refresh
Overheads: A Case for Refresh-Aware Process Scheduling. SIGOPS
Oper. Syst. Rev., 51(2):723–736, April 2017.
[35] Zhiling Lan, Jiexing Gu, Ziming Zheng, Rajeev Thakur, and Susan
Coghlan. A Study of Dynamic Meta-learning for Failure Prediction
in Large-scale Systems. J. Parallel Distrib. Comput., 70(6):630–643,
June 2010.
[36] Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. A Realistic
Evaluation of Memory Hardware Errors and Software System Suscep-
tibility. In Proceedings of the 2010 USENIX Conference on USENIX
Annual Technical Conference, USENIXATC’10, pages 6–6, Berkeley,
CA, USA, 2010. USENIX Association.
[37] Xuanhua Li and Donald Yeung. Application-Level Correctness and Its
Impact on Fault Tolerance. In Proceedings of the 2007 IEEE 13th
International Symposium on High Performance Computer Architecture,
HPCA ’07, pages 181–192, Washington, DC, USA, 2007. IEEE Com-
puter Society.
[38] Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette,
and Ramendra Sahoo. BlueGene/L Failure Analysis and Prediction
Models. In International Conference on Dependable Systems and
Networks (DSN’06), pages 425–434, June 2006.
[39] Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu.
An Experimental Study of Data Retention Behavior in Modern DRAM
Devices: Implications for Retention Time Profiling Mechanisms. In
Proceedings of the 40th Annual International Symposium on Computer
Architecture, ISCA ’13, pages 60–71, New York, NY, USA, 2013. ACM.
[40] Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. RAIDR:
Retention-Aware Intelligent DRAM Refresh. In Proceedings of the 39th
Annual International Symposium on Computer Architecture, ISCA ’12,
pages 1–12, Washington, DC, USA, 2012. IEEE Computer Society.
[41] Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G.
Zorn. Flikker: Saving dram refresh-power through critical data parti-
tioning. SIGPLAN Not., 46(3):213–224, March 2011.
[42] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal,
J. Liu, B. Khessib, K. Vaid, and O. Mutlu. Characterizing ap-
plication memory error vulnerability to optimize datacenter cost via
heterogeneous-reliability memory. In 2014 44th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, pages
467–478, June 2014.
[43] Alan Messer, Philippe Bernadat, Guangrui Fu, Deqing Chen, Zoran
Dimitrijevic, David Lie, Durga Devi Mannaru, Alma Riska, and Dejan
Milojicic. Susceptibility of commodity systems and software to memory
soft errors. IEEE Transactions on Computers, 53(12):1557–1568, Dec
2004.
[44] Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. Revisiting
Memory Errors in Large-Scale Production Data Centers: Analysis and
Modeling of New Trends from the Field. In Proceedings of the 2015 45th
Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’15, pages 415–426, Washington, DC, USA, 2015. IEEE
Computer Society.
[45] Micron Technology. MT18JSF1G72AZ-1G9 - 8GB.
https://www.micron.com/products/dram-modules/udimm/part-catalog/
mt18jsf1g72az-1g9, 2015. [Online; accessed 6-June-2019].
[46] Micron Technology. MT40A512M8. https://www.micron.com/products/
dram/ddr4-sdram/part-catalog/mt40a512m8hx-083e, 2015. [Online; ac-
cessed 6-June-2019].
[47] Dong-Sun Min Dong-Sun Min, Dong-Il Seo Dong-Il Seo, Jehwan
You Jehwan You, Sooin Cho Sooin Cho, Daeje Chin Daeje Chin, and
YE Park. Wordline coupling noise reduction techniques for scaled
DRAMs. In Digest of Technical Papers., 1990 Symposium on VLSI
Circuits, pages 81–82, June 1990.
[48] Lauri Minas and Brad Ellison. The Problem of
Power Consumption in Servers. http://www.drdobbs.com/
the-problem-of-power-consumption-in-serv/215800830. [Online;
accessed 6-June-2019].
[49] L. Mukhanov, D. S. Nikolopoulos, and B. R. d. Supinski. Alea: Fine-
grain energy profiling with basic block sampling. In 2015 International
Conference on Parallel Architecture and Compilation (PACT), pages 87–
98, Oct 2015.
[50] Lev Mukhanov. DFault. https://github.com/lmukhanov/DFault, 2019.
[Online; accessed 21-September-2019].
[51] Lev Mukhanov, Pavlos Petoumenos, Zheng Wang, Nikos Parasyris,
Dimitrios S. Nikolopoulos, Bronis R. De Supinski, and Hugh Leather.
Alea: A fine-grained energy profiling tool. ACM Trans. Archit. Code
Optim., 14(1):1:1–1:25, March 2017.
[52] Lev Mukhanov, Konstantinos Tovletoglou, Dimitrios S. Nikolopoulos,
and Georgios Karakonstantis. Characterization of hpc workloads on
an armv8 based server under relaxed dram refresh and thermal stress.
In Proceedings of the 18th International Conference on Embedded
Computer Systems: Architectures, Modeling, and Simulation, SAMOS
’18, pages 230–235, New York, NY, USA, 2018. ACM.
[53] Lev Mukhanov, Konstantinos Tovletoglou, Dimitrios S Nikolopoulos,
and Georgios Karakonstantis. DRAM Characterization under Relaxed
Refresh Period Considering System Level Effects within a Commodity
Server. In 2018 IEEE 24th International Symposium on On-Line Testing
And Robust System Design (IOLTS), pages 236–239, July 2018.
[54] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin.
Learning theory: stability is sufficient for generalization and necessary
and sufficient for consistency of empirical risk minimization. Advances
in Computational Mathematics, 25(1):161–193, Jul 2006.
[55] Onur Mutlu. The RowHammer Problem and Other Issues We May
Face As Memory Becomes Denser. In Proceedings of the Conference
on Design, Automation & Test in Europe, DATE ’17, pages 1116–
1121, 3001 Leuven, Belgium, Belgium, 2017. European Design and
Automation Association.
[56] Jerome L. Myers and Arnold D. Well. Research Design & Statistical
Analysis. Routledge, 1 edition, June 1995.
[57] Yoshinobu Nakagome, M Aoki, S Ikenaga, M Horiguchi, S Kimura,
Y Kawamoto, and K Itoh. The impact of data-line interference noise on
dram scaling. IEEE Journal of Solid-State Circuits, 23(5):1120–1127,
Oct 1988.
[58] Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann,
Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for
GPU Error Prediction in a Large Scale HPC System. In 48th Annual
IEEE/IFIP International Conference on Dependable Systems and Net-
works, DSN 2018, Luxembourg City, Luxembourg, June 25-28, 2018,
pages 95–106, 2018.
[59] Taku Ohsawa, Koji Kai, and Kazuaki Murakami. Optimizing the
dram refresh count for merged dram/logic lsis. In Proceedings of the
1998 International Symposium on Low Power Electronics and Design,
ISLPED ’98, pages 82–87, New York, NY, USA, 1998. ACM.
[60] Tapti Palit, Yongming Shen, and Michael Ferdman. Demystifying Cloud
Benchmarking. In 2016 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS), pages 122–132, April 2016.
[61] Minesh Patel, Jeremie S. Kim, and Onur Mutlu. The Reach Profiler
(REAPER): Enabling the Mitigation of DRAM Retention Failures via
Profiling at Aggressive Conditions. SIGARCH Comput. Archit. News,
45(2):255–268, June 2017.
[62] Moinuddin K Qureshi, Dae-Hyun Kim, Samira Khan, Prashant J Nair,
and Onur Mutlu. AVATAR: A Variable-Retention-Time (VRT) Aware
Refresh for DRAM Systems. In 2015 45th Annual IEEE/IFIP Interna-
tional Conference on Dependable Systems and Networks, pages 427–
437, June 2015.
[63] Arnab Raha, Hrishikesh Jayakumar, Soubhagya Sutar, and Vijay Raghu-
nathan. Quality-aware Data Allocation in Approximate DRAM. In
Proceedings of the 2015 International Conference on Compilers, Archi-
tecture and Synthesis for Embedded Systems, CASES ’15, pages 89–98,
Piscataway, NJ, USA, 2015. IEEE Press.
[64] Michael Redeker, Bruce F Cockburn, and Duncan G Elliott. An
investigation into crosstalk noise in dram structures. In Proceedings of
the 2002 IEEE International Workshop on Memory Technology, Design
and Testing (MTDT2002), pages 123–129, 2002.
[65] Phillip J Restle, JW Park, and Brian F Lloyd. Dram variable retention
time. In 1992 International Technical Digest on Electron Devices
Meeting, pages 807–810, Dec 1992.
[66] Ramendra K Sahoo, Adam J Oliner, Irina Rish, Manish Gupta, Jose´ E
Moreira, Sheng Ma, Ricardo Vilalta, and Anand Sivasubramaniam. Criti-
cal Event Prediction for Proactive Management in Large-scale Computer
Clusters. In Proceedings of the Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’03, pages
426–435, New York, NY, USA, 2003. ACM.
[67] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM
Errors in the Wild: A Large-scale Field Study. In Proceedings of the
Eleventh International Joint Conference on Measurement and Modeling
of Computer Systems, SIGMETRICS ’09, pages 193–204, New York,
NY, USA, 2009. ACM.
[68] Scikit-library. Scikit. https://scikit-learn.org/stable/, 2019. [Online;
accessed 21-September-2019].
[69] Julian Shun and Guy E. Blelloch. Ligra: A Lightweight Graph Process-
ing Framework for Shared Memory. SIGPLAN Not., 48(8):135–146,
February 2013.
[70] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gu-
rumurthi. Feng Shui of supercomputer memory positional effects in
DRAM and SRAM faults. In SC ’13: Proceedings of the International
Conference on High Performance Computing, Networking, Storage and
Analysis, pages 1–11, Nov 2013.
[71] Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Fer-
reira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. Memory Er-
rors in Modern Systems: The Good, The Bad, and The Ugly. SIGARCH
Comput. Archit. News, 43(1):297–310, March 2015.
[72] Vilas Sridharan and Dean Liberty. A Study of DRAM Failures in
the Field. In Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis, SC ’12,
pages 76:1–76:11, Los Alamitos, CA, USA, 2012. IEEE Computer
Society Press.
[73] Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C.Hunter, and Lizy K.
John. Elastic Refresh: Techniques to Mitigate Refresh Penalties in High
Density Memory. In Proceedings of the 2010 43rd Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO ’43, pages 375–
384, Washington, DC, USA, 2010. IEEE Computer Society.
[74] Jiawen Sun, Hans Vandierendonck, and Dimitrios S. Nikolopoulos.
Graphgrind: Addressing load imbalance of graph partitioning. In
Proceedings of the International Conference on Supercomputing, ICS
’17, pages 16:1–16:10, New York, NY, USA, 2017. ACM.
[75] A. Teman, G. Karakonstantis, R. Giterman, P. Meinerzhagen, and
A. Burg. Energy versus data integrity trade-offs in embedded high-
density logic compatible dynamic memories. In 2015 Design, Automa-
tion Test in Europe Conference Exhibition (DATE), pages 489–494,
March 2015.
[76] K. Tovletoglou, L. Mukhanov, G. Karakonstantis, A. Chatzidimitriou,
G. Papadimitriou, M. Kaliorakis, D. Gizopoulos, Z. Hadjilambrou,
Y. Sazeides, A. Lampropulos, S. Das, and P. Vo. Measuring and
exploiting guardbands of server-grade armv8 cpu cores and drams. In
2018 48th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks Workshops (DSN-W), pages 6–9, June 2018.
[77] Konstantinos Tovletoglou, Lev Mukhanov, Dimitrios S Nikolopoulos,
and Georgios Karakonstantis. Shimmer: Implementing a Heterogeneous-
Reliability DRAM Framework on a Commodity Server. IEEE Computer
Architecture Letters, pages 1–1, 2019.
[78] Konstantinos Tovletoglou, Dimitrios S Nikolopoulos, and Georgios
Karakonstantis. Relaxing DRAM refresh rate through access pattern
scheduling: A case study on stencil-based algorithms. In 2017 IEEE
23rd International Symposium on On-Line Testing and Robust System
Design (IOLTS), pages 45–50, July 2017.
[79] I. Tsiokanos, L. Mukhanov, and G. Karakonstantis. Low-power
variation-aware cores based on dynamic data-dependent bitwidth trunca-
tion. In 2019 Design, Automation Test in Europe Conference Exhibition
(DATE), pages 698–703, March 2019.
[80] I. Tsiokanos, L. Mukhanov, D. S. Nikolopoulos, and G. Karakonstantis.
Minimization of timing failures in pipelined designs via path shaping
and operand truncation. In 2018 IEEE 24th International Symposium
on On-Line Testing And Robust System Design (IOLTS), pages 171–176,
July 2018.
[81] Ioannis Tsiokanos, Lev Mukhanov, Dimitrios S. Nikolopoulos, and
Georgios Karakonstantis. Variation-aware pipelined cores through path
shaping and dynamic cycle adjustment: Case study on a floating-point
unit. In Proceedings of the International Symposium on Low Power
Electronics and Design, ISLPED ’18, pages 52:1–52:6, New York, NY,
USA, 2018. ACM.
[82] Alejandro Valero, Salvador Petit, Julio Sahuquillo, David R. Kaeli, and
Jose´ Duato. A Reuse-based Refresh Policy for Energy-aware eDRAM
Caches. Microprocess. Microsyst., 39(1):37–48, February 2015.
[83] Ad J Van De Goor and Ivo Schanstra. Address and data scrambling:
causes and impact on memory tests. In Electronic Design, Test and
Applications, 2002. Proceedings. The First IEEE International Workshop
on, pages 128–136, 2002.
[84] Victor van der Veen, Yanick Fratantonio, Martina Lindorfer, Daniel
Gruss, Clementine Maurice, Giovanni Vigna, Herbert Bos, Kaveh
Razavi, and Cristiano Giuffrida. Drammer: Deterministic Rowhammer
Attacks on Mobile Platforms. In Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’16, pages
1675–1689, New York, NY, USA, 2016. ACM.
[85] Ravi K Venkatesan, Stephen Herr, and Eric Rotenberg. Retention-aware
placement in DRAM (RAPID): software methods for quasi-non-volatile
DRAM. In The Twelfth International Symposium on High-Performance
Computer Architecture, 2006., pages 155–165, Feb 2006.
[86] Shibo Wang, Mahdi Nazm Bojnordi, Xiaochen Guo, and Engin Ipek.
Content Aware Refresh: Exploiting the Asymmetry of DRAM Retention
Errors to Reduce the Refresh Frequency of Less Vulnerable Data. IEEE
Transactions on Computers, 68(3):362–374, March 2019.
[87] Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose, Nika Man-
souri Ghiasi, Minesh Patel, Jeremie S Kim, Hasan Hassan, Mohammad
Sadrosadati, and Onur Mutlu. Reducing DRAM latency via charge-level-
aware look-ahead partial restoration. In 2018 51st Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), pages 298–
311. IEEE, 2018.
[88] Ying Wang, Yinhe Han, Cheng Wang, Huawei Li, and Xiaowei Li.
RADAR: A Case for Retention-aware DRAM Assembly and Repair
in Future FGR DRAM Memory. In Proceedings of the 52Nd Annual
Design Automation Conference, DAC ’15, pages 19:1–19:6, New York,
NY, USA, 2015. ACM.
[89] Li Yu, Ziming Zheng, Zhiling Lan, and Susan Coghlan. Practical Online
Failure Prediction for Blue Gene/P: Period-based vs Event-driven. In
Proceedings of the 2011 IEEE/IFIP 41st International Conference on
Dependable Systems and Networks Workshops, DSNW ’11, pages 259–
264, Washington, DC, USA, 2011. IEEE Computer Society.
