Common-case optimized memory hierarchy for data centers and HPC systems by Jian, Xun
c© 2017 Xun Jian
COMMON-CASE OPTIMIZED MEMORY HIERARCHY FOR DATA




Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Doctoral Committee:
Associate Professor Rakesh Kumar, Chair
Professor Josep Torrellas
Associate Professor Pavan Hanumolu
Dr. Vilas Sridharan, AMD
ABSTRACT
The memory hierarchy is predicted to consume up to 40% to 70% of total
system power in future data centers and high performance computing (HPC)
systems; as such, it is time to rethink memory system designs. Conventional
memory system designs in existing systems often seek to provide uniform per-
formance across time and space. While this design approach is simple, which
benefits hardware implementation, the overheads of uncommon operations
often dictate overall memory energy and performance. This dissertation ex-
plores across the memory hierarchy common-case optimized memory design,
which reduces overall overheads by reducing common overheads at the cost
of increasing uncommon overheads. For latency-optimized on-chip caches,
which require long-latency correction to ensure reliable accesses during low
voltage execution, this dissertation reduces common-case correction latency
by proposing architectural techniques such as correction prediction, at the
cost of increasing uncommon-case correction latency due to occasional opera-
tions such as misprediction recovery. For bandwidth-optimized 3D DRAMs,
which are power-hungry due to high access frequency, this dissertation de-
scribes new data layout and power management policies to improve overall
memory access energy-efficiency at the cost of increasing uncommon-case ac-
cess latencies. For capacity-optimized server main memory, which contains
100’s to 1000’s of memory chips to provide high capacity and, therefore,
needs expensive fault-tolerance, this dissertation proposes an adaptive ar-
chitecture that minimizes energy when memory contains no or minor fault,
at the cost of increasing energy as faults slowly accumulate. Finally, for
emerging density-optimized NVRAMs, which suffer from very high random
bit error rates, this dissertation describes a server memory architecture that
reuses the redundant memory budgeted for memory chip failure protection
to accelerate the expensive bit error correction before memory chip(s) fail at
the cost of increasing correction overheads after memory chip(s) fail.
ii
To my best friend and wife, Hannah Jian, for sharing in my joys, providing
unwavering support during the tough times, and putting up with me during
many countless late nights at work. Without your loving kindness and
encouragements over the years, I would not be where I am today.
iii
ACKNOWLEDGMENTS
I thank my advisor, Rakesh Kumar, for his constant willingness to share his
wisdom and experience with me. For patiently guiding me and helping me
grow over the years. For setting a good, hardworking, and passionate role
model to follow. And for driving me to always strive to be better, teaching
me to think big, and believing in me even when I did not. I thank Vilas
Sridharan for his insightful help, counsel, and encouragement over the years.
I thank Josep Torrellas and Pavan Hanumolu for their invaluable feedback
on my research. I thank all my current and former labmates, especially John
Sartori and Henry Duwe, for asking tough questions and holding intense
discussions with me over research. Finally, I thank God for providing me the
strength, wisdom, the people in my life, as well as everything else that was
required for all the work that goes into this dissertation.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 ADAPTIVE RELIABILITY CHIPKILL CORRECT
(ARCC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Background and Related Work . . . . . . . . . . . . . . . . . . 8
2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Adaptive Reliability Chipkill Correct . . . . . . . . . . . . . . 11
2.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CHAPTER 3 MULTI-LINE ERROR CORRECTION . . . . . . . . . 26
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Multi-line ECC . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
CHAPTER 4 VERY LARGE ECC WORD (VLEW) ARCHITEC-
TURE FOR HIGH-DENSITY NVRAM-BASED SERVER MAIN
MEMORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Introducion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 VLEW Memory Architecture . . . . . . . . . . . . . . . . . . 49
4.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Related Work and Generality . . . . . . . . . . . . . . . . . . 66
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CHAPTER 5 CORRECTION PREDICTION . . . . . . . . . . . . . 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Background and Related Work . . . . . . . . . . . . . . . . . . 71
5.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
v
5.4 Correction Prediction for L1 Caches . . . . . . . . . . . . . . . 76
5.5 Implementation, Coverage, and Overheads . . . . . . . . . . . 81
5.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CHAPTER 6 ERROR PATTERN TRANSFORMATION . . . . . . . 100
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Error Pattern Transformation . . . . . . . . . . . . . . . . . . 104
6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
CHAPTER 7 PARITY HELIX . . . . . . . . . . . . . . . . . . . . . 129
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Die-stacked DRAM Background . . . . . . . . . . . . . . . . . 131
7.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4 Parity Helix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
CHAPTER 8 UNDERSTANDING AND OPTIMIZING POWER
CONSUMPTION IN MEMORY NETWORKS . . . . . . . . . . . . 159
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.3 Analyzing Power Characteristics of Memory Networks . . . . . 163
8.4 I/O Power Control Mechanisms . . . . . . . . . . . . . . . . . 169
8.5 Network-Unaware Management . . . . . . . . . . . . . . . . . 172
8.6 Network-Aware Management . . . . . . . . . . . . . . . . . . . 177
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
CHAPTER 9 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . 188




The information and communication technology (ICT) industry, which gen-
erates trillions of dollars in annual revenue and consumes over 10% of the
world’s total electricity [1], plays a major role in world economy and environ-
ment. The ICT industry has also been critical in achieving scientific progress
by enabling advances in genomics, medicine, material science, etc. The ex-
pected proliferation of IoT (Internet of Things) devices such as smartphones,
wearables, and sensors, is expected to further drive rapid expansion of the
ICT industry. Continued growth of the ICT industry is contingent upon the
continued scaling of (A) data centers, the workhorses of the digital universe
that store, process, and serve information to and from end-user devices, and
(B) high performance computing (HPC) systems that analyze vast data to
produce scientific and engineering breakthroughs.
The future scaling of data centers and HPC systems is challenging, how-
ever. Data centers in the United States consume 2% of total electricity in
the country [2]; to put this into perspective, this is over half of the elec-
tricity consumed due to home lighting in the United States [3]. Scaling up
the performance of data centers to keep up with the rapid growth of smart-
phones and IOT (internet of things) [1, 4] will become unsustainable without
achieving commensurate improvement in energy-efficiency. Similarly, top
HPC systems today already consume ∼10 megawatts; at the current level of
energy efficiency, scaling up the performance of an HPC system to exascale,
the next milestone in supercomputing, requires 100s of megawatts, which is
sufficient to power 100,000’s of homes; having such a high power requirement
is again unsustainable.
Memory is predicted to become a major source of power consumption in
future data centers and HPC systems. In data centers, the server memory ca-
pacity requirement has been rapidly increasing due to application-level trends




























Figure 1.1: Growth in memory



















Figure 1.2: DDR4 power [5].
be seen in the per-processor memory size of memory-optimized servers in
major cloud data-centers (see Figures 1.1); per-processor memory size of
these servers is currently several terabytes. Terabytes of memory consume
large amounts of power, approaching and even surpassing that of the server
processor (see Figure 1.2). In HPC systems, on the other hand, memory
bandwidth requirements have been increasing due to increasing core count
per processor; emerging exascale supercomputers will likely require 4TB/s of
DRAM memory bandwidth within the processor package [6, 7]. High mem-
ory bandwidth comes at the cost of high power consumption, as shown in
Figure 1.3. Providing 4TB/s of in-package DRAM memory bandwidth is
estimated to consume 160W of peak power [8], which is more than half of
the predicted 200-300W total power budget for an entire node in an exascale
system [6, 7]. The off-chip main memory for each node in an exascale system
is estimated to consume another 70W of peak power [6], further aggravating
the problem of high memory power consumption in future exascale systems.
In addition, aggressive voltage scaling has been widely proposed to tackle the
high power consumption problem of the processor. During low voltage ex-
ecution, on-chip memories also determine overall processor energy-efficiency
because voltage scaling for memory is more challenging than for logic [9].
This thesis argues that a major source of energy overhead in today’s mem-
ory systems is due to providing uniform performance across time and space.
As an example of uniformity across time, memory systems today provide
nearly identical performance at the first day of its lifetime, when server mem-
ory is fault-free, as the last day of its lifetime, when faults have occurred.
As an example of uniformity across space, every chip in a typical memory
system has the same bandwidth and latency regardless of how frequently the





















Figure 1.3: Memory power consumption in future exascale supercomputers.
proach provides the benefit of being simple by repeating the same actions for
all scenarios, which in turn benefits hardware implementation. The down-
side, however, is that rare memory operations end up dictating the overall
energy-efficiency of the memory system. In the past, such energy overheads
were acceptable due to the low memory power consumption relative to the
system power. However, as the memory system is on track to becoming a
major source of power consumption in future data centers and HPC systems,
it is time to revisit new memory design approaches.
This thesis explores the benefits of a common-case optimized memory hi-
erarchy that foregoes memory performance uniformity to improve energy
efficiency by optimizing common-case memory operations without also opti-
mizing, or sometimes even at the expensive of, uncommon memory opera-
tions. Optimizing the common-case memory operations provides significant
overall energy savings since they are much more frequent than uncommon
memory operations. The dissertation is organized as follows:
Chapter 2 addresses the problem of server memory systems paying high
energy overheads to ensure reliable memory operations due to containing
many (e.g., 100’s to 1000’s) of memory chips; it proposes an adaptive memory
architecture that dynamically increases memory error correction strength as
faults accumulate in memory to improve memory energy efficiency in the
common-case of having no or minor faults in memory.
Chapter 3 answers the question given an adaptive memory architecture
that can reconfigure memory error correction (e.g., one described in Chap-
ter 2) as faults occur: How should memory error correction be designed
differently from memory error correction in a conventional static memory
3
architecture to maximize energy savings? By exploiting the observation that
error correction (as opposed to detection) is only performed in the presence
of a fault, which occurs infrequently in present-day DRAM, this chapter pro-
poses a novel memory error correction for fault-free memory regions that
significantly improves memory energy efficiency at the cost of aggressively
increasing error correction latency.
Chapter 4 looks at the problem of how to design common-case optimized
server main memory when error rates are high. Because high error rate re-
quires frequent correction, the architecture in Chapter 3 is inadequate as
it relies on error correction being infrequent in present-day DRAM. Emerg-
ing high-density memories, such as NVRAMs, have very high bit error rate
(BER). Tolerating high BER at low memory redundancy requires the use
of very large error correcting code (ECC) words, similar to the ECC words
used to tolerate bit errors in storage systems. However, very large ECC
words (e.g., > 30X larger than ECC words typically used for DRAM) are
very power hungry when used in main memory because they require access-
ing more (e.g., > 30X) data to detect/correct errors. To build efficient server
main memory with very large ECC words, the key observation is that most
cachelines are not affected by any chip-level faults during typical system life-
time. The proposed very large ECC word (VLEW) server memory architec-
ture reuses the redundant memory budgeted for chip failures to opportunis-
tically correct bit errors before memory chips fail to speed up common-case
memory accesses for high-density NVRAMs with high BER.
Chapter 5 addresses the problem that during low voltage operations, on-
chip SRAM caches can significantly affect overall processor energy efficiency
due to frequent errors that require expensive long-latency correction. The
proposed Correction Prediction effectively hides the long correction latencies
by using a weak correction mechanism to predict the corrected cache word
value in low latency and recover from misprediction by instruction replay if
the predicted cache word value is found to be incorrect by a reliable error
correction that takes longer to complete. By reducing the correction latency
for common-case cache accesses with no or single error, Correction Predic-
tion improves overall performance despite incurring expensive misprediction
recovery overheads for uncommon cache accesses with multi-bit errors.
Chapter 6 builds on Chapter 5 to reduce the long multi-error correction
latency as well. Multi-bit error correction is typically performed via a slow
4
iterative algorithm that corrects one error at a time. On the other hand,
there are many low-overhead parallel error correction algorithms that can
correct multiple errors in parallel at low latency/area/power overheads in the
common-case, but unfortunately do not guarantee correction of all multi-bit
errors. To provide low-latency multi-bit error correction, Chapter 6 describes
how to effectively exploit low-overhead parallel correction algorithms to cor-
rect multi-bit error patterns in the common-case.
Chapter 7 explores how to architect 3D DRAM as efficient and reliable
in-package high-bandwidth memory. Due to the 3D nature of die-stacked
DRAMs, 3D DRAMs can develop faults in different physical dimensions.
By sharing minimal redundancy across physical dimensions to protect them
against failures, the proposed Parity Helix protection for multi-dimensional
memory improves energy-efficiency in the common-case of having no or mi-
nor fault in memory at the expense of higher correction latency overheads
compared to baselines with dedicated protection for each physical dimension.
Chapter 8 explores power management for memory networks, a recently
proposed architecture to provide scalable-capacity high-bandwidth off-chip
memory. By adaptively tuning the performance of the different memory
nodes in a memory network depending on the node’s frequency of access, the
proposed network-aware memory power management significantly improves
energy efficiency compared to the conventional approach of providing uniform





This chapter addresses the problem of server memory systems paying high
energy overheads to ensure reliable memory operations due to containing
many (e.g., 100’s to 1000’s) memory chips; it proposes an adaptive memory
architecture that dynamically increases memory error correction as faults
accumulate in memory to improve memory energy efficiency in the common-
case of having no or minor faults in memory.
2.1 Introduction
Chipkill correct is an advanced type of error correction in memory that sig-
nificantly improves the reliability of memory by allowing continued memory
operation in the event of device-level faults in memory. Large-scale studies
show that chipkill correct reduces the Detectable Uncorrectable Error (DUE)
rate of memory by 4X [10] to 36X [11] compared to Single Error Correct Dou-
ble Error Detect (SECDED). As a result, chipkill correct memory systems
have become very popular among HPC systems and high end servers with
large memory capacities. As the amount of memory in servers continues to
increase, we envision that the adoption of chipkill correct memory systems
will become even more widespread in order to maintain the same level of
DUE and silent data corruption (SDC) rates in memory.
Currently, there is a strong tradeoff between power and reliability among
different chipkill correct solutions. Commercial chipkill correct solutions,
such as single chipkill correct double chipkill detect (SCCDCD) [12] and
Double Chip Sparing [13, 14], can detect up to two failed memory devices
per rank; however, they require accessing 36 memory devices per memory re-
quest. On the other hand, a weaker solution that can only detect and correct
up to a single failed device requires accessing only 18 memory devices; be-
6
cause only half as many devices are accessed per request, significant memory
power can be saved.
In this research, we aim to improve on the present power and reliability
tradeoff of chipkill correct memory solutions. We observe that all exist-
ing chipkill correct solutions have a fixed level of protection strength from
the start regardless of the age of the memory system; however, due to the
low occurrence rate of faults in modern Dynamic Random Access Memory
(DRAM) devices, our calculations, based on a large-scale field study of over
160,000 Dual-Inline Memory Modules (DIMMs) [11], show that only a very
small fraction of memory chips, on average, experience any type of faults
in a typical operational lifespan of 5 to 7 years [15]. Therefore, instead of a
fixed worst case design, we propose an average case design where the memory
system begins with low strength of protection, which consumes low power,
and only upgrades to higher strength(s) of protection, which consumes high
power, on a page-by-page basis as the pages become affected by faults. We
call this optimization to be applied to chipkill correct solutions Adaptive Re-
liability Chipkill Correct (ARCC). By increasing the chipkill correct strength
of faulty pages at the end of every memory scrub, which can be performed
once every few hours [11], ARCC offers similar reliability as always using a
strong chipkill correct solution for all the pages.
In this research, we focus on applying ARCC to commercial chipkill correct
solutions. Our evaluation shows that ARCC reduces the power consumption
of memory by 36% when applied to commercially available chipkill correct
solutions, while keeping the same storage overhead and maintaining simi-
lar reliability. We will also briefly describe how to apply ARCC to newly
proposed chipkill correct solutions such as LOT-ECC and VECC.
We make the following contributions:
1. The concept of applying a weaker but more energy-efficient ECC for
regions in the main memory that are fault-free, and dynamically in-
creasing ECC strength of a region in the main memory after detecting
faults in the memory region.
2. An efficient implementation of the concept, where adjacent smaller
codewords combine to form larger codewords after faults are detected in
a page, which allows ECC strength to be increased without increasing
the storage overhead.
7
3. Comparative evaluation of the power, performance, and reliability of
the above implementation relative to commercial chipkill correct solu-
tions. Our experiments show that ARCC reduces memory power by
36% when applied to commercial chipkill correct solutions with negli-
gible degradation to reliability.
2.2 Background and Related Work
Chipkill correct memory systems are designed to guarantee error correction
and detection even in the event of a complete device failure in a rank, which is
a group of memory devices needed to serve a memory request. SCCDCD [12]
and double chip sparing [13, 14] are two popular commercial chipkill correct
solutions. SCCDCD can correct one failed device and detect up to two failed
devices. Double chip sparing can correct up to two failed devices as long as
the two faults do not occur before one of them is first detected. Both solu-
tions rely on symbol-based linear block codes to perform error detection and
correction. In a symbol-based linear block code, each codeword is composed
of multiple symbols, which are groups of bits; the symbols are categorized
into data symbols and check symbols, which are the redundant information
needed for error detection and correction [16]. The larger the number of
check symbols per codeword, the higher the strength of error detection and
correction. For example, with two check symbols per codeword, a bad sym-
bol in the codeword can be detected and corrected [17]; however, when there
are only two check symbols per codeword, if two bad symbols exist in the
same codeword, the error can go undetected. With four check symbols per
codeword, depending on the exact type of code employed, the second bad
symbol in the codeword can either be detected or even be corrected [16].
Figure 2.1 illustrates how commercially available chipkill correct solutions
store the symbols of each codeword. They store each symbol of a codeword
in a different device in the rank. As a result, even in the event of a complete
device failure, only a single symbol is lost per codeword; the lost symbol
can be recovered using the remaining data symbols and check symbols in the
codeword. Both SCCDCD and double chip sparing use four check symbols
per codeword. Although only four check symbols are required to provide
single symbol correct and double symbol detect, SCCDCD uses a somewhat
8
Figure 2.1: Commercial Chipkill Correct Implementation: Each symbol is
stored in a different device in the rank. The “D” boxes represent data
devices while the “R” boxes represent redundant devices.
inefficient encoding such that all four check symbols are needed to provide
the same level of protection [16]. On the other hand, double chip sparing
uses a more efficient encoding where only three check symbols are required
to provide single symbol correct and double symbol detect. When a bad
symbol is detected, the bad symbol is remapped to a spare symbol, the fourth
symbol. This allows double chip sparing to correct up to two bad symbols
per codeword, as long as the second bad symbol does not occur before the
first has been detected.
Because each symbol in a codeword has to be stored in its own DRAM
device as illustrated by Figure 2.1, there has to be as many redundant de-
vices as there are check symbols per codeword. In order to keep the ratio of
the number of redundant devices to regular data devices in a rank low, the
number of data devices in the rank is chosen to be large. To keep the stor-
age overhead the same as SECDED, commercially available chipkill correct
solutions use 32 data symbols and four data symbols per codeword, result-
ing in a storage overhead of 12.5%; this translates to a rank with a total of
36 devices. Since such a large number of devices (36 compared to only 9
for SECDED) have to be accessed per memory request, commercial chipkill
correct solutions consume high power.
2.3 Motivation
There is a fundamental tradeoff between power and reliability in chipkill
correct solutions, when the storage overhead is held constant. Increasing
the number of error correcting code (ECC) bits per word improves the error
correction/detection capability of the codeword; however, this increases the
9
storage overhead of the ECC bits. In order to keep the storage overhead the
same while increasing the number of ECC bits per word, the number of data
bits in the word has to be increased as well; this in turn increases memory
power consumption since more devices have be to accessed per memory re-
quest. Consider for example a memory configuration consisting of a single
channel with two ranks and 36 devices per rank. If we were to reduce the
number of check symbols per codeword from four to two, the size of a rank
can be reduced from 36 to 18 without affecting the storage overhead. Our
motivational experiments using quad-core multiprogrammed SPEC bench-
marks show that having a rank size of 18 instead of 36 reduces memory
power consumption by 36.7% on average. However, the downside of using
only two redundant check symbols is that it only guarantees the detection
of a single bad symbol per codeword; the reliability that it provides is sig-
nificantly worse than using four redundant check symbols per codeword as
does commercial chipkill correct solutions, which guarantee detection of up
to two bad symbols per codeword. ARCC is an optimization that seeks
to improve the power-reliability tradeoff between a stronger chipkill correct
solution with more ECC bits per codeword and a weaker chipkill correct so-
lution with fewer ECC bits per codeword by offering similar reliability as the
former while consuming similar power as the latter.
Intuitively, pages with a fault can benefit much from double symbol detec-
tion/correction per codeword, because if an additional bad symbol occurs in
a codeword that already contains a bad symbol, the second bad symbol can
still be detected/corrected via double symbol detection/correction. On the
flip-side, if a page is completely fault-free, the added value of double symbol
detection/correction compared to only single symbol detection/correction is
significantly smaller; for these pages, using two symbols per codeword might
suffice. Our reliability analysis in Section 2.4.6 confirms this intuition; it
shows that adaptively upgrading codewords from single symbol protection
to double symbol protection as the codewords become affected by faults in-
curs negligible reliability degradation compared to always applying double
symbol protection to every codeword.
Meanwhile, large field studies have shown that only 2.95% [11] to 8% [10]
of DIMMs suffer any type of faults per year. Also, most of these faults affect
a small fraction of the DIMM (such as the single-bit and row faults). By
considering the different types of faults studied in [11] and making the worst-
10
Figure 2.2: Faulty memory vs. time: Average fraction of 4 KB pages that
has been affected by faults vs. time.
case assumption that each type of device-level fault considered in the work
results in every memory location under the device-level circuitry becoming
corrupted, we calculated the average fraction of 4 KB physical pages in a
memory channel that contain one or more faulty locations. The channel
consists of two ranks with 36 devices per rank. Figure 2.2 shows that the
fraction of pages with fault is just a few percent during most of the lifetime of
the memory channel, even for a worst-case failure rate that is 4X as high as
what was measured in [11]. Because fault-free pages can be protected using
only two instead of four symbols per codeword and because most pages are
fault-free, an adaptive approach that provides weaker protection for a page
in the beginning and only upgrades the protection strength when the page
contains a fault can lead to substantial power savings with similar reliability
as always providing stronger protection.
2.4 Adaptive Reliability Chipkill Correct
ARCC reactively increases the strength of protection of every codeword in
a page when the page is detected with a fault by doubling the number of
check symbols per codeword. Conceptually, ARCC does so by joining two
codewords stored in two separate memory channels into a single large code-
11
word, which, therefore, has twice the number of check symbols but the same
storage overhead as the smaller codewords. In this section, we describe how
to apply ARCC to commercial chipkill correct solutions to provide similar
reliability as always using four check symbols per codeword while incurring
the low memory power consumption of using only two check symbols per
codeword.
We refer to a physical page where there are four check symbols per code-
word as an upgraded page and a page where there are only two check symbols
per codeword as a relaxed page. The top half of Figure 2.3 shows the data
layout of a relaxed page. In the figure, there are two memory channels in the
memory system, where each memory channel can serve a memory request
for a 64B line independently. We consider a common memory configura-
tion where each physical page contains 4 KB of data, which is equivalent to
sixty-four 64B lines. Physical address mapping policies for systems with mul-
tiple memory controllers often map adjacent 64B lines to different memory
channels to reduce the latency of accessing adjacent lines in memory; this
is reflected in Figure 2.3, where alternate lines belong to alternate memory
channels (X and Y). Each 64B line, in turn, consists of multiple codewords;
in the example in Figure 2.3, each line consists of four codewords, which are
delimited by the horizontal lines in the figure; each codeword is composed of
16 data symbols and two check symbols, which are represented by the shaded
region. Since each symbol maps to a different DRAM device in commercial
chipkill correct solutions, the 18 symbols of a codeword in a relaxed page are
stored across 18 DRAM devices controlled by the same memory controller.
When an error is detected during memory scrubbing, ARCC increases
the protection strength of the page with error by increasing the number of
check symbols per codeword from two to four. To double the number of
check symbols per codeword without increasing the check to data symbol
ratio which increases storage overhead, ARCC combines two adjacent 64B
lines, each stored in a separate channel, in a page into a single 128B line,
referred to as an upgraded line from now on, where each codeword in the
128B line contains four check symbols and 32 data symbols. To convert a
relaxed page into an upgraded page, all lines in the relaxed page are read out
to compute the new codewords in each line and are stored back to memory
afterward. Note that since the two 64B lines in each upgraded line belong to
two separate memory channels, the entire upgraded line can be read out in the
12
Figure 2.3: Memory layout: Data layout of a physical page in relaxed and
upgraded chipkill correct modes. The letters “X” and “Y” appended after
each line number indicate to which channel the line belongs. Each shaded
rectangle represents a check symbol in a codeword. The line with the cross
contains a fault, which causes the page to be upgraded.
time it takes to read a single 64B line by accessing the two memory channels
in parallel. The bottom half of Figure 2.3 illustrates one way of combining
two adjacent 64B lines into an upgraded line, where each symbol maintains
its original size and the number of codewords per upgraded line is the same
as the number of codewords per line under the relaxed mode. An alternative
design is to reduce the size of each symbol by half, and as a result, to double
the number of codewords per upgraded line. This flexibility is important
since different symbol sizes require different types of Error Detection and
Correction (EDAC) controllers. By providing this flexibility, ARCC provides
freedom in choosing the EDAC controller to use for the upgraded line.
2.4.1 Implementation Details
This section describes the different modifications needed to support ARCC
as well as the associated overheads.
2.4.2 Page Table
Each physical page entry and the corresponding TLB entry is modified to
contain an additional 1-bit flag to indicate the chipkill correct strength, re-
laxed or upgraded, the page currently operates in. The value of the flag is
updated at the end of a memory scrub. If the memory scrubber detects an
error in a physical page, the chipkill correct strength of the physical page will
13
be upgraded. To upgrade a page affected by faults, only the page itself needs
to be accessed to recalculate each upgraded line in the page; pages without
faults are not affected. When an upgraded physical page is accessed, both
64B lines in each upgraded line will be accessed.
We assume that the operating system is started up in the upgraded mode
for every page. After the page table has been populated, a memory scrub
is immediately performed to determine the fault-free pages to set them to
relaxed mode.
2.4.3 Memory Scrubbing
ARCC upgrades the chipkill correct strength of a page after faults are de-
tected in a page during memory scrubbing. Our reliability analysis in Section
2.4.6 assumes an ideal memory scrubber that is capable of detecting all faults
at the end of each memory scrub. A conventional memory scrubber which
simply reads out and writes back the memory content during each scrub may
leave many hidden stuck-at-1 or stuck-at-0 faults undetected. Therefore, to
adhere closely to such an ideal memory scrubber, we modified a conventional
memory scrubber to execute the following steps:
1. Read a line and store its value aside.
2. Write all 0’s to the line location in memory and then read the location
in memory to see if only 0’s are returned. If true, go to step 3. If false,
a stuck-at-1 fault may be present; go to step 4 and upgrade the page
afterward.
3. Write all 1’s to the line and then read the line to see if only 1’s are
returned. If true, go to step 4. If false, a stuck-at-0 fault may be
present; go to step 4 and upgrade the page afterward.
4. Correct any errors in the original content of the line and write the line
back to memory.
Optionally, in order to reduce the overhead of alternating between reads
and writes during a memory scrub, steps 1 to 4 can be performed in batches
for multiple consecutive lines at a time.
14
Although memory scrubbing is twice more expensive in ARCC (due to the
two additional reads and writes for all 0’s and all 1’s) compared to conven-
tional memory scrubbing, the performance overhead of memory scrubbing
is still negligible since memory scrubbing takes a few seconds per memory
channel while it is performed once every few hours [11]. Consider, for exam-
ple, a 128-bit wide memory channel with 4 GB of 667 MHz DDR2 memory.
Accessing the entire memory content takes 4 ·10243 ·8/128/(667 ·106) = 0.4s.
A memory scrub required by ARCC takes 0.4 · 6 = 2.4s per memory scrub.
Assuming a memory scrub rate of once every four hours, 2.4s/(4·3600) results
in only 0.0167% reduction in maximum effective memory bandwidth.
2.4.4 Last Level Cache
The LLC needs to be modified in order to accommodate both the relaxed
64B lines and the upgraded 128B lines in the LLC simultaneously; during a
write to memory, both sub-lines of an upgraded line need to be written back
to memory at the same time in order to update all four check symbols in each
codeword in the upgraded line. One way to accommodate both 64B and 128B
lines in the LLC is to implement the LLC as a sectored cache [18]. However,
since the sectored cache can degrade the effective size of the cache when
there is low spatial locality in the applications, we propose an alternative
LLC design to accommodate the two different cacheline sizes.
We observe that since the physical addresses of the two sub-lines in an
upgraded line are consecutive, the two sub-lines will be mapped to two adja-
cent sets in a conventional LLC with 64B cachelines. We propose including
an additional bit to the tag of each cacheline to indicate whether or not the
cacheline belongs to an upgraded line. When an upgraded line is brought
into the LLC, the flag is set to one. When a line is selected for eviction,
its flag is checked to see if it is a sub-line of an upgraded line; if it is, the
second sub-line of the same upgraded line can be found in the adjacent set
as the line with the same tag. In order to prevent a sub-line from being
forcefully evicted due to the lack of reuse in the second sub-line, the LLC
cache replacement policy uses the recency of the most recently used sub-line
as the recency value of both sub-lines for eviction selection.
The main overhead in the cache is due to the fact that cache replacement
15
requires a second tag access to find the recency value of the other sub-line
in an upgraded line. The performance overhead of this second tag access is
small because LLC replacements are required only by LLC misses, which are
less frequent that LLC hits. In addition, the latency of the second tag access
is much smaller than that of the memory access due to the LLC miss. In our
experiments, we modified the cache implementation such that every cache
replacement takes twice as long and did not observe any noticeable effect on
performance.
2.4.5 Memory Controller
The two sub-lines in each upgraded line have to be read from and written
to memory at the same time in order to provide error detection/correction.
One design is to logically partition the memory queue of each controller into
two, one for the sub-lines of upgraded lines and one for the relaxed lines. The
sub-line queue maintains a strict FIFO ordering to ensure that the pairing
of the sub-lines in each queue is always correct. The memory controllers can
then alternatingly issue requests from the queue for the sub-lines and the
queue for the regular 64B lines.
An alternative design is to augment each memory queue entry with a new
flag with multiple bits. The first bit of the flag is set to 1 to indicate that
the line is a sub-line of an upgraded line. When the first bit is set to 1, the
remaining bits in the flag serve as a pointer to the physical queue entry in the
other memory channel where the second sub-line resides. When a sub-line
is at the head of the memory access queue, memory access in the queue is
stalled until the second sub-line is found. The corresponding sub-line in the
second memory channel is to be found via the pointer and then promoted
to the head of its memory access queue so that the pair of sub-lines can be
issued together.
Due to the large number of devices per rank (36), commercial chipkill
correct solutions require two physical memory channels, each controlling 18
devices, to operate in lockstep as a single logical channel [19]. As a result,
a single EDAC controller targeting codewords with four check symbols is
sufficient for every pair of memory controllers. However, ARCC requires an
additional EDAC controller for each memory controller to target codewords
16
Figure 2.4: Reliability comparison: Comparison between the SDC rate of
simultaneous double error detection (DED) as provided by commercial
SCCDCD and that of the reduced double error detection (ARCC DED) as
provided by applying ARCC to commercial SCCDCD.
with two check symbols.
2.4.6 Reliability Analysis
ARCC does not degrade the DUE rate of commercial chipkill correct solu-
tions. When ARCC is applied to SCCDCD, ARCC does not degrade the
DUE rate since ARCC always guarantees correction of a single bad symbol
in a codeword, just as SCCDCD. ARCC also does not degrade the DUE
rate when it is applied to double chip sparing, which corrects up to two bad
symbols per codeword. This is because just like ARCC, double chip sparing
cannot correct the second bad symbol unless the first bad symbol has been
detected before the second bad symbol occurs (see Section 2.2).
In the ARCC implementation described in Section 2.4, each codeword
contains only two check symbols at the beginning, which only guarantees the
detection of a single bad symbol. After an error is detected in a page (which
results in at most one bad symbol per codeword), ARCC reactively increases
the number of check symbols per codeword to four by doubling the size of
each codeword in the page in order to be able to detect two bad symbols
per codeword. However, it is possible for a second bad symbol to occur in
the same codeword before the first bad symbol is detected; such errors may
not be detected. On the other hand, commercial chipkill correct solutions
constantly allocate four check symbols per codeword, and, therefore, always
guarantee the detection of two bad symbols. However, the probability of two
17
or more faults in two or more different devices affecting the same codeword
and occurring within the same scrub period of a few hours is small.
To understand the extent of degradation in detection reliability when ap-
plying ARCC to commercial chipkill, we used the chipkill correct reliability
models given in [20]. The fault rates that were used as inputs to the models
are taken from a recent large field study on DRAM errors [11], and include
those of lane, device, bank, column, and row faults. A memory scrub period
of four hours was assumed, which is consistent with the memory scrub period
used in [11]. The memory configuration of the baseline commercial chipkill
correct solution consists of a memory channel with two ranks, with 36 de-
vices per rank, and, therefore, a total of 72 DRAM devices. We used the
double chipkill correct/detect model provided in our technical report [20] to
calculate the error detection reliability of the baseline SCCDCD. Since SC-
CDCD+ARCC may not detect a second bad symbol in a codeword unless
it occurs after the first bad symbol in the codeword has been detected, its
error detection reliability may be conservatively modeled as that the error
correction reliability of double chip sparing. Therefore, we used the double
chip sparing reliability model in [20] to calculate the error detection reliabil-
ity of SCCDCD+ARCC. To report the SDC rate, we converted the reliability
output to the SDC rate. When calculating the SDC rate, we assume that
all DIMMs in a machine are to be replaced as soon as the first undetectable
error occurs in the machine, so that the same faulty machine does not con-
tribute multiple SDCs. To validate the results of the reliability models, we
also performed Monte Carlo simulations (details in [20]). Figure 2.4 shows
the number of SDCs in 1000 machine-years calculated using the output from
the reliability model for both the SCCDCD baseline and SCCDCD+ARCC
for different intended lifespans of a machine and for different factors of the
failure rate. The figure shows that the increase to the SDC rate of SC-
CDCD+ARCC over SCCDCD alone is insignificant.
2.5 Methodology
Table 2.1 summarizes the memory configurations for comparison against com-
mercial chipkill correct solutions. The commercial chipkill correct baseline
simulated consists of a single memory channel with two ranks per channel
18
Table 2.1: Memory Configurations
Name Tech I/O Chan Ranks/Chan Rank Size
Baseline DDR2 X4 2 1 36
ARCC DDR2 X8 2 2 18
Table 2.2: Processor Microarchitecture
SS Width IQ Size Phys Regs LSQ Size
2 16 72FP/72INT 32LQ/32SQ
L1 D$, I$ L1 Assoc L1 lat. L2$
32 kB 2 1 cycle 1MB
L2 Assoc L2 lat. Cacheline Size L2 MSHR
16 10 cycles 64B 240
and 36 devices per rank. Due to the large number of devices per rank, both
the burst length and the I/O width of each device must be small to satisfy
the chosen cacheline size of 64B. As such, DDR2 X4 devices are chosen for
the SCCDCD baseline. The corresponding memory configuration for ARCC
consists of two memory channels with two ranks per channel and 18 devices
per rank. Both configurations have the same total number of devices. In
order to provide the same output granularity with 18 devices per rank as
that of the baseline with 36 devices per rank, we increased I/O width of each
device from X4 to X8 for ARCC. DRAMsim [21] was used to model memory
power and timing. The DRAM device parameters are taken from MICRON
datasheets [22]. We assume that there are two 4 KB pages per row in mem-
ory. For the row buffer and physical address mapping policies, we used the
closed page policy and the high performance mapping policy, respectively,
as provided by DRAMsim. Meanwhile, we simulated a quad-core processor
running 12 mixed SPEC workloads for 2 billion cycles using GEM5 [23], a
full system simulator. Table 2.2 describes the CPU microarchitecture while
Table 2.3 shows the workloads that were used.
We used the following methodology to study the power and performance
degradation due to upgraded pages in ARCC as faults develop over time:
1. We estimated the power and performance overhead associated with
each type of device-level fault, introduced in Section 2.4.6 for the mem-
ory configuration summarized in Table 2.1 by setting the fraction of
memory affected by that type of fault to upgraded mode and repeating
















Table 2.4: Fault Modeling Details
Fault Type Fraction of Pages Upgraded
Lane 100%: It causes both ranks per channel
to be upgraded.
Device 1/2: It causes 1 out the 2 ranks to be upgraded.
Subbank 1/16: It causes 1 out of the 8 banks in a single
rank to be upgraded.
Column 1/32: It causes half of the pages in a single
bank to be upgraded.
2. Using Monte Carlo simulations, we simulate a 7 year lifespan for 10000
memory channels to capture when and what type of device-level faults
occur during the 7 simulated years in the 10000 channels. The fault
rates from [11] were used.
3. For each recorded fault type in each simulated memory channel, we
added the overhead associated with that fault to the performance and
power of that channel, starting from the time that the fault occurs.
4. For each year X in the intended lifetime of a channel, we averaged
the power and performance of the 10000 memory channels from the
beginning of the first year to the end of year X to provide an estimate




When ARCC is applied to commercial chipkill correct solutions, which re-
quire 36 devices to be accessed per memory request, significant power reduc-
tion can be achieved as ARCC requires only 18 devices to be accessed for
fault-free pages and 36 devices to be accessed for pages with faults (see Ta-
ble 2.1). Figure 2.5 shows the DRAM power and performance improvement
from applying ARCC to commercial chipkill correct solutions when there
are no faults in memory. Performance of a mixed workload is reported as
the sum of the instructions per cycle (IPCs) of all the benchmarks in the
workload. On average, ARCC reduces power consumption by 36.7% and
improves performance by 5.9%. The power benefits across the workloads
are relatively uniform due to the fact that ARCC saves the same amount of
power every memory access regardless of the application characteristics. The
slight performance improvement is due to having twice the number of ranks
under ARCC, which increases the amount of rank-level parallelism. Differ-
ent benchmarks experience different performance benefits from the increased
rank-level parallelism, which explains the variation in performance.
Figure 2.6 shows the power consumption of ARCC in the presence of a
single device-level fault in memory normalized to when the memory is fault
free. Results are presented for different types of device-level faults. For
example, “1 Device Fault” represents the scenario where half of the pages
have been upgraded due to a device fault (Table 2.4). As expected, the power
consumption of ARCC increases in the presence of faults in memory. This
is because two adjacent 64B lines instead of one are required to be accessed
Figure 2.5: Power and performance improvements.
21
Figure 2.6: Power consumption of a memory system with fault: Power
consumption of ARCC when there are different types of faults in memory,
normalized to when there is no fault.
when a 128B line in an upgraded page is accessed. In the worst-case scenario,
when there is no spatial locality in the application, the second 64B line is
always not useful to the workload. In this scenario, the power consumption
of accessing an upgraded page is twice that of accessing a normal page. As
shown in the “worst-case est.” bar in Figure 2.6, the worst-case memory
power increases by the fraction of pages in memory that are upgraded. In
reality, the second 64B line is useful for many workloads due to the presence
of spatial locality; as a result, the power overhead due to device-level faults
in memory is much smaller, as shown in the figure.
Figure 2.7 shows the IPC of ARCC in the presence of a single device-level
fault in memory normalized to when the memory is free of faults. While
some workloads, such as Mix2 to Mix7, show a clear degradation in per-
formance when there are faults in memory, some others such as Mix1 and
Mix10 show performance improvement. We attribute this to the different
amount of spatial locality in the different workloads. Although applications
with low spatial locality suffers when two adjacent cachelines have to be ac-
cessed for every memory access after upgrading faulty pages, applications
with high spatial locality on the other hand actually benefit from having to
22
Figure 2.7: Performance of a memory system with fault: Performance of
ARCC when there are different types of faults in memory, normalized to
when there is no fault.
Figure 2.8: Power overhead of error correction: Average increase in power
consumption as a function of time compared to fault-free memory.
23
Figure 2.9: Performance overhead of error correction: Average decrease in
performance as a function of time compared to fault-free memory.
access two adjacent cachelines because this acts like a useful prefetch. In
the worst-case scenario, when there is no spatial locality in the application
and the bandwidth is the bottleneck, ARCC can degrade performance by as
much as 50% in the presence of a lane fault. However, due to spatial locality
in the applications studied, there was negligible performance degradation on
average.
The fraction of pages that are faulty in a memory system increases with
time. Using the methodology described in Section 2.5, we calculated the
average power and performance degradation during different years in the
lifetime of the memory system in Table 2.1. The results are shown in Fig-
ures 2.8 and 2.9 respectively. The worst-case estimate curves in the figures
assume that there is no spatial locality in the application and, therefore, a
memory access to an upgraded pages consumes twice as much power and re-
duces effective bandwidth by half compared to a memory access to a relaxed
page. The figures show that the degradation both in terms of the worst-case
estimate and measured overheads is small. This is due to the fact that, on
average, only a tiny fraction of memory is affected by faults (see Figure 2.2).
In fact, power benefits from ARCC even at the end of 7 years for 4X the
memory fault rate reported in [11] is no less than 30%.
24
2.7 Conclusion
In this research, we propose ARCC, a novel optimization for existing chipkill
correct solutions that aims to provide the reliability of a high strength of chip-
kill correct solution at the same memory power overhead of a low strength
chipkill correct solution without increasing the storage overhead. Based on
the observation that only a small fraction of memory experiences faults in
memory for a typical operational lifespan of a memory system, ARCC be-
gins with a low chipkill correct strength and adaptively increases the chipkill
correct strength on a page-by-page basis as they become affected by faults,
in order to take advantage of the power benefit of a weaker chipkill correct
solution for all fault-free pages while providing the reliability of a stronger
chipkill correct solution. We presented an efficient implementation of the
concept, where the number of check symbols per codeword is doubled after
faults are detected in a page by combining two adjacent codewords in two
different channels into a single large codeword without increasing the over-
all storage overhead. We performed a comparative evaluation of the power,
performance, and reliability of this implementation relative to commercial
chipkill correct solutions. Our experiments show that this implementation
reduces memory power by 36% when applied to commercial chipkill correct
solutions with negligible degradation to reliability. ARCC not only can be
used to reduce the power consumption of existing commercial chipkill correct
memories with four check symbols per codeword, but also provides an imple-
mentation for stronger chipkill correct solutions in the future, such as those





Chapter 2 explores how to design an efficient adaptive memory architecture
that reconfigures correction strength and, therefore, memory power consump-
tion, as faults accumulate in memory. This chapter explores the converse
problem: Given an adaptive memory architecture that can reconfigure mem-
ory error correction as faults accumulate, how should memory error correction
be designed differently to maximize energy savings?
In a static memory architecture, memory error correction performance
dictates system performance when memory faults occur because accessing
erroneous memory requires performing error correction. As such, memory
error correction in a static memory architectures needs to provide high-
performance correction to ensure that system performance does not degrade
drastically when faults occur in memory. In an adaptive memory architec-
ture, however, the performance of the initial memory error correction when
memory is fault-free can be decoupled from correction performance when
memory has become faulty; when memory is fault-free, having extremely
slow (e.g., > 1, 000, 000X slower than regular memory accesses) correction
performance does not impact system performance because correction is only
performed when accessing erroneous memory. As such, intuitively, ample
opportunities exist to tradeoff the much lower correction performance re-
quirement when memory is fault-free for improved memory energy efficiency.
Indeed, conventional server memory architectures pay significant energy
overheads to provide fast correction by allocating sufficient redundancy to
each cacheline so that each cacheline can be independently corrected with-
out accessing other cachelines. This chapter proposes Multi-line ECC, a novel
memory error correction that reduces the memory redundancy requirement
and, therefore, improves memory energy efficiency by forgoing fast error cor-
rection; performing error correction requires fetching many (e.g. millions)
of cachelines under Multi-line ECC. Through Multi-line ECC, an adaptive
26
memory architecture (e.g., ARCC described in the previous chapter) can sig-
nificantly improve common-case memory energy efficiency, when memory is
fault-free, while only slightly degrading performance when memory becomes
faulty, by adaptively reconfiguring to a fast memory error correction when
faults occur.
3.1 Motivation
Server memory systems today are power-hungry because they access many
memory chips per memory request. A difficult challenge with reducing the
number of chips accessed per memory request is that when a cacheline is
stored across fewer chips, more data in the cacheline can become erroneous
when one of the memory chip fails, as illustrated in Figure 3.1. More errors
requires more redundant data to correct; higher memory redundancy in turn
increases system cost because memory is expensive [24]. To access fewer
memory chips per memory request without increasing memory redundancy
requires that the same memory redundancy be able to correct more errors;
this needs a new type of memory error correction that requires less memory
redundancy to correct each error.
How to reduce the memory redundancy for correcting each error requires
taking a close look at how error correction operates. Memory error correction
typically operates in three steps [25]. Step one is to detect whether the













Figure 3.1: When a cacheline is accessed from fewer chips, the cacheline
will suffer from more errors if a chip fails and, therefore, requires more
memory redundancy to correct its errors.
27
Figure 3.2: Error manifestation of a memory chip-level fault. D stands for
a memory device; CW stands for a codeword; X stands for errors.
in the cacheline the error(s) reside(s). Step three is to reconstruct the lost
data in the identified error locations. Both step ones and two require adding
redundant memory data, while step three can use the existing redundant
data from the previous steps to calculate the lost data in the error locations.
Every cacheline needs redundant data for error detection (i.e., for step one) to
verify the cacheline’s correctness before the cacheline can be safely consumed
by the processor. While typical server memory architectures also allocate for
every cacheline redundant data to localize errors (i.e., for step two), many
cachelines can actually share these error localization code bits because faults
typically occur in one chip at a time and errors due to one faulty chip always
appear in the same location across different cachelines (see Figure 3.2).
The benefit of many cachelines sharing the same error location code bits
is reducing the memory redundancy for correcting an error by reducing the
total number of needed error localization code bits; as discussed at the be-
ginning of this section, reducing the memory redundancy for correcting each
error enables each cacheline to be accessed from fewer chips without increas-
ing the overall memory redundancy. The downside is that because making
use of a code bit requires fetching all data that the code bit is computed from,
the more cachelines sharing the error localization code bits, the more cache-
lines the error localization step needs to fetch from memory and, therefore,
the slower the memory error correction. In comparison, the conventional ap-
proach of allocating dedicated error localization code bits for each cacheline
maximizes error correction performance since no additional cacheline needs
to be accessed for correction. Trading off correction performance for im-
proved memory energy efficiency is desirable for the common-case fault-free
28
memory because each cacheline only needs to perform error detection but
not correction as no errors are present. Section 3.2 proposes such a memory
error correction for fault-free memory, Multi-line ECC, which shares error
localization code bits across vast number of cachelines to minimize memory
power for accessing fault-free memory.
3.2 Multi-line ECC
Multi-line ECC reduces the memory redundancy requirement for correct-
ing each error by sharing error localization code bits across vast number of
cachelines. Under Multi-line ECC, each cacheline still contains its own error
detection code bits so that its correctness can be quickly verified on-the-fly
with the memory read request to access the cacheline. If a cacheline is de-
tected to be erroneous, Multi-line ECC uses the shared error localization
code bits to identify the error locations in the cacheline and then reuses
the same detection code bits in the cacheline to reconstruct the lost data
in the identified error locations. Because using the shared error localiza-
tion code bits to locate the error locations requires accessing all cachelines
sharing the same code bits, Multi-line ECC suffers from very slow error
correction performance. However, since error correction is only performed
when faults are present, slow error correction performance does not affect
system performance in the common-case fault-free memory. The adaptive
memory architecture in the previous chapter can use Multi-line ECC when
memory is fault-free to maximize energy savings and then reconfigure to a
high-performance memory but power-hungry memory error correction when
faults occur.
3.2.1 Encoding and Layout
Multi-line ECC allocates the Reed-Solomon (RS) error correcting code (ECC)
bits to each cacheline. RS ECC typically requires 2B code bits to correct each
bad byte (i.e., a group of eight adjacent bits with at least one bad bit). Each
byte of of RS code bits can also guarantee detection of one bad byte. In addi-
tion, if which bytes are bad is known a priori, each byte of code bits can also



















































































Figure 3.3: Layout of data and code bits under Multi-line ECC.
protects each 64B cacheline with 8B of RS code bits; it stores the RS code
bits in one ECC chip per rank as shown in Figure 3.3. Since only a single
ECC chip is needed, each cacheline can be stored in/accessed from only eight
data chips, while maintaining the same (i.e,. 1/8 = 12.5%) memory redun-
dancy typically found in modern server memory systems. Multi-line ECC
uses error detection checksums, such as the MSet-Add-Hash checksums [26],
as the shared error localization code bits. While the shared error detection
checksums can be computed from any amount of data, this chapter considers
an implementation that computes each checksum using all the data from a
single bank in single memory chip. In this embodiment, the total storage
for all checksums is 8 · 4 · 9 · 8 · 8 = 18KB for a large eight-channel system
with 4 ranks/channel, 9 chips/rank, and 8 banks/channel assuming 8B per
checksum; since 18KB of storage is small, the checksums are can stored in
the memory controller on the processor. These checksums are referred to as
bank checksums in the rest of the chapter.
3.2.2 Read Requests
When reading a cacheline from memory, Multi-line ECC uses the cacheline’s
RS code bits to detect errors. Since there are 8B RS code bits per cacheline
and each memory chip contributes 64B/8=8B of data to each cacheline, the
8B RS code bits guarantees detection of one chip failure per rank. In addi-
tion, the RS code bits can also probabilistically detect errors (as opposed to
30
guarantee detection of all errors) from multiple chip failures at probability
of 1− 256−S, where S is how many bytes of RS code bits are computed from
the same data. For example, each byte of RS code bits can be independently
computed from 8B of data in each 64B cacheline; here, the probability of
not detecting multiple chip failures is 256−1 = 0.4%. However, when all 8B
RS code bits are computing using the same 64B data in each cacheline, the
probability of not detecting multiple chip failures reduces to 256−8 = 5·10−20,
which is astronomically small.
If the cacheline is detected to be error-free via its RS code bits, the mem-
ory request completes without requiring error correction. This scenario rep-
resents the vast majority of memory accesses since most memory remains
fault-free throughout system lifetime (see Chapter 2). However, if the cache-
line is detected to be erroneous, the next step is to locate the errors in order
to correct the errors. Multi-line ECC uses the shared error localization code
bits to locate errors by fetching all the cachelines that reside in the same
memory bank as the erroneous cacheline to recalculate the bank checksums.
By comparing the 9 bank newly calculated from the latest values in memory
against the 9 bank checksums maintained by the memory controller, Multi-
line ECC can identify which chip contains a faulty bank. By identifying
the faulty bank, Multi-line ECC labels all 8B of data that the faulty chip
contributes to the erroneous cacheline as errors and corrects the eight errors
through erasure correction by reusing the cacheline’s 8B RS code bits. In
the rare scenario where more than one bank checksums mismatch, which im-
plies a multi-chip failure where more than one chip is faulty, Multi-line ECC
cannot correct the error and thus reports the error as uncorrectable.
Multi-line ECC provides extremely slow error correction performance due
to needing to fetch all cachelines in a memory bank; this requires fetching
8GB/8/64 = 16.7 million cachelines assuming 1GB memory chips with 8
banks per chip. Assuming 2000MT/s memory chips, fetching all cachelines
in a memory bank takes 62ms, which is longer than a disk access (e.g.,
∼10ms). However, as long as this expensive error correction is performed
only once for each new memory fault, which occurs very infrequently, the
overall performance overhead over system lifetime due to taking several extra
disk access latencies to correct memory errors is negligible. To limit the
occasions of performing expensive memory error correction to only once per
memory fault, the memory controller can track how many memory pages in
31
the memory bank contain erroneous cacheline when fetching all the cachelines
in the memory bank to perform error correction. If few memory pages contain
error, which implies that the memory bank suffer from a few faulty bits
or faulty rows, the faulty memory pages can be retired at low overhead
to prevent the faulty memory pages from getting accessed again and thus
prevent repeating expensive error correction for these faulty memory pages.
On the other hand, when many memory pages contain error, which implies a
faulty bank or even faulty chip, memory error correction can be reconfigured
to provide higher correction performance (see Section 3.2.4).
3.2.3 Write Requests
Each write request requires recomputing and storing the code bits whose
data bits are modified by the write request. Recomputing the new RS code
bits for each write request is straightforward because the RS code bits are
dedicated to each cacheline; as such, the memory controller has all the data it
needs (i.e., the dirty cacheline being written to memory) to recompute the RS
code bits for each write request. Recomputing the bank checksums for each
write request, on the other hand, is challenging because each bank checksum
is computed from data across many cachelines; as a result, recomputing
the bank checksums for a write request requires subtracting from the bank
checksums the checksum values of the cacheline’s stale value, which is to be
overwritten in memory, and then adding to the bank checksum the checksum
values of the cacheline’s new/dirty value, which is to be written to memory.
The problem is that the cacheline’s stale value is no longer present in the
processor by the time of the write request because the stale value in the
processor is over-written when a cacheline first becomes dirty. The naive
approach to obtain the cacheline’s stale value for each write request is to
issue a memory read request to fetch the stale cacheline’s value from memory
prior to issuing the write request. However, issuing an overhead memory
read request for every memory write request cause 100% memory bandwidth
overhead for writes, which is expensive.
To completely eliminate the write memory bandwidth overhead for recom-
puting the bank checksums, Multi-line ECC modifies the processor’s cache(s)
to first send to the memory controller the stale memory value of a cacheline,
32
instead of overwriting the stale value right away, when the cacheline becomes
dirty. Upon receiving the stale cacheline value, the memory controller can
subtract from the bank checksums the checksum values of the stale cacheline
value. Later on, when the dirty cacheline is evicted from the cache and needs
to be written back to memory, the memory controller can simply add to the
bank checksums the dirty cacheline’s checksum values and then write the
dirty cacheline to memory, without needing to fetch the stale cacheline value
from memory for any write request.
3.2.4 Deploying Multi-line ECC in ARCC
In the common-case when memory is fault-free, Multi-line ECC significantly
improves memory energy efficiency by accessing only nine chips per mem-
ory request, instead of 18 or 36 chips as do existing server memory systems
[27, 13], while requiring the same memory redundancy as existing memory
systems. The downside is that when memory becomes faulty, error correction
performance can be cripplingly slow due to needing to read an entire memory
bank worth of data. However, an adaptive memory architecture (e.g., ARCC
described in the previous chapter) adaptively reconfigures memory error cor-
rection can significantly benefit from Multi-line ECC’s energy efficiency when
memory is fault-free while avoiding Multi-line ECC’s performance overhead
when memory is faulty by switching to a higher performing error correction.
Recall that ARCC upgrades memory error correction strength by com-
bining two logically adjacent 64B cachelines, each with 8B redundancy, into
a single 128B cacheline with 16B redundancy. To upgrade memory error
correction strength when deploying Multi-line ECC in ARCC requires two
minor adjustments. First, because the bank checksums are computed at the
granularity of a memory bank, ARCC must upgrade an entire memory bank
at a time, instead of upgrading a single memory page at a time. Upgrading
an entire memory bank at a time also reduces the number of bits needed
to record which pages have been upgraded from one bit per page down to
one bit per memory bank; this enables all these history bits to fit within the
memory controller at small area overhead (e.g., 2Kb).
The second adjustment is what to store using the 16B redundancy per






Figure 3.4: Upgraded cacheline when applying ARCC to Multi-line ECC.
8B out of the 16B redundancy is used as spare bits to permanently “correct”
the data values from the faulty bank by serving as the new location for the
faulty bank’s data; this functionally eliminates the faulty bank. The second
8B out of the 16B redundancy continues to store RS code bits, with the
slight difference that the 8B RS code bits are now computed from the entire
128B data from the 16 fault-free banks (i.e., 15 original data bank + 1 spare
bank), instead of from being computed from 64B data. When reading an
upgraded cacheline, the 8B RS code bits in the upgraded cacheline will not
report errors, as they are computed using data from the remaining fault-
free banks; this allows the memory request to complete without invoking the
expensive error correction. In addition, the 8B RS code bits per upgraded
cacheline enables correction against a future memory-chip-level failure in the
same way as before as when memory is fault-free (see Section 3.2.2).
3.3 Methodology
We compare against two commercial server memory architectures, Chipkill-
correct [27] , which tolerates single chip failure, and Chipkill-correct+Sparing
[13], which tolerates multiple chip failures. We also compare against Non-
ECC memory, which does not provide any error correction. Table 3.1 sum-
marizes the evaluated memory configurations; all have equal usable memory
capacity and bandwidth. When modeling Chipkill-correct+Sparing, which
Table 3.1: Evaluated Memory Configurations
Chip Type Chips/rank Redundancy Channel Configuration
Non-ECC Memory X8 8 0% 4 chan, 16 ranks/chan
Proposal X8 9(8 data+1 ECC) 12.5% 4 chan, 16 ranks/chan
Chipkill-correct X4 18 (16 data+2 ECC) 12.5% 4 chan, 8 ranks/chan
Chipkill-correct+Sparing X4 36 (32 data+4 ECC) 12.5% 2 chan, 8 ranks/chan
34
Table 3.2: Processor Configuration
16 cores, 3GHz, 4-issue OOO
Core 168 ROB entries, 64B cache line size
L1 d-cache, i-cache 2-way, 64kB, 1 cycle
Private L2 cache 4-way, 256KB, 4 cycles
Shared LLC 32-way, 32MB, 14 cycles
Memory Degree two stride prefetcher,
Controller 512 buffer entries/channel
Table 3.3: Composition of Mixed Workloads
mixA 4 omnetpp, 4 mcf, 4 wrf, 4T ocean cp
mixB 4 bwaves, 4 cactusADM, 4 wrf, 4T ocean cp
mixC 4 omnetpp, 4 mcf, 4 astar, 4T radix
mixD 4 mcf, 4 GemsFDTD, 4T barnes, 4T radiosity
mixE 4 cactusADM, 4 bwaves,4 sjeng, 4T fft
mixF 4 mcf, 4 omnetpp,4 astar, 4T fft
accesses 128B data per memory request due to requiring 32 data chips per
rank, we increase LLC cacheline size from 64B to 128B. We use DRAMSim2
[21] to model memory power and performance. We assume the FR-FCFS
scheduling policy, and the close-page row policy, which closes a row when it
has no request in the command queue. We prioritize reads over writes and
interleave adjacent memory pages across banks, ranks, and then channels.
We use the timing and power of 8Gb 1600MT/s DDR3 chips [28].
We use GEM5 [23] to simulate a 16-core x86 processor (see Table 3.2).
We evaluate 16 threads per workload, seven single-application NASBench
[29] workloads and seven multi-programmed workloads (see Table 3.3 for
composition); only native and reference inputs are used. The workloads’
memory footprints range from 10GB to 35GB and are 17GB on average.
We fast forward each workload until all multi-threaded application(s) have
initialized and then by another 20 simulated seconds. We warm up the caches
by 20 simulated milliseconds and then evaluate the next 20ms via cycle-
accurate simulation. As workloads contain multi-threaded applications, we
measure throughput not by total instructions, but by FLOPs for workloads
with only FP benchmarks and otherwise by instructions that access LLC.
3.4 Results
Figure 3.5 shows the overheads in memory energy per instruction (EPI) of
the proposal (i.e., Multi-line ECC + ARCC) and the two baselines compared























Figure 3.5: Memory energy overhead compared to non-ECC memory.
Compared to the non-ECC memory system, which accesses eight chips per
memory request, the proposal incurs 12.5% memory energy overheads due
to adding one ECC chip per eight data chips. On the other hand, Chipkill-
correct and Chipkill-correct+Sparing incur 50% and 90% memory energy
overheads, respectively, due to accessing over 2X and 4X as many mem-
ory chips per request, respectively. The proposal reduces memory energy
by 25% and 41%, respectively, compared to Chipkill-correct and Chipkill-
correct+sparing, respectively.
To better understand the memory energy overheads, Figure 3.6 shows the
breakdown in memory power consumption between background and active
power normalized to the total (background+active) power of non-ECC mem-
ory. Figure 3.6 shows that the proposal reduces both active and background
memory power compared to the baselines. The proposal reduces active mem-


















































































































































Figure 3.6: Memory power normalized to non-ECC memory. “CC” stands























Figure 3.7: Performance normalized to non-ECC memory.
reduces background memory power because during periods of low memory
utilization, when memory is often put into low power sleep modes, a memory
request under the proposal wakes up fewer chips than under the baselines.
Figure 3.7 shows the system performance of the proposal and the baseline
compared to non-ECC memory. While the average performance across all
workloads are nearly identical for the three memory systems, the relative
performance of individual workloads can vary significantly because Chipkill-
correct+Sparing access 128B data per memory request, which is a form of
prefetching. For workloads with high spatial locality, prefetching improves
performance; however, for workloads with poor spatial locality, prefetching
reduces performance due to wasting memory bandwidth. As such, the per-
formance of Chipkill-correct+Sparing relatively to the proposal and Chipkill-
correct can vary depending on the workload.
The proposal requires accessing twice as many chips when accessing mem-
ory banks with fault than when accessing fault-free memory banks (see Sec-
tion 3.2.4). For applications with poor spatial locality, accessing twice as
many chips, which means accessing 128B of data per memory request, can
reduce performance due to wasting memory bandwidth. To quantify this
performance overhead, we evaluate a pessimistic scenario where all memory
banks are faulty and, therefore, requires all memory requests to access 128B.
In addition, we add 15 cycle latency penalty to each read request for per-
forming error detection on 128B, instead of 64B, data granularity. Finally,
recall that accessing more data per memory request can improve the perfor-
mance of applications with high spatial locality; as such, we compare against
an oracular non-ECC memory baseline that uses stride prefetching only for
workloads whose performance improves with prefetching. Figure 3.8 shows
the performance of the proposal compared to the oracular non-ECC mem-





















Figure 3.8: Performance overhead of proposal when all memory banks are
faulty vs. fault-free n n-ECC m mory.
overhead is only 4%.
An advantage of the baselines with dedicated error localization code bits
in each cacheline is that they can tolerate two faults occurring in two chips
simultaneously as long as each cacheline is affected by at most one fault.
However, the proposal can only tolerate two different chips developing faults
simultaneously as long as each memory bank is affected by at most one fault
because each memory bank of cachelines share the same error localization
code bits. As such, the proposal provides slightly lower error correction
coverage than the baselines for when multiple chips develop faults at the
same time; however, if the second fault occurs after the first fault has been
discovered (e.g., via a memory scrub), the proposal can still tolerate the
second fault because the proposal can fix the first fault (e.g., either by retiring
the faulty memory page or upgrading a faulty memory bank, see Section
3.2.4) before the second fault occurs. We calculate the expected downtime
per year due to the above loss in correction coverage in a large memory
system with 3000 chips by assuming a memory scrub rate of once per day, a
server downtime of 24 hours when uncorrectable memory error occurs, and
the memory fault pattern and rate in [30]. We find the downtime overhead to
be 0.00006 minute a year. As comparison, the downtime target for mission-
critical servers is 12 minutes per year [31].
3.5 Conclusion
Chapter 2 explores how to design an efficient adaptive memory architec-
ture that reconfigures correction strength and memory power consumption
38
as faults accumulate in memory. To build upon the previous chapter, this
chapter explores how to redesign memory error correction for adaptive mem-
ory architectures to maximize memory energy savings. By observing that
error correction is only performed when fault is present, we propose aggres-
sively trading off error correction performance for improved memory energy-
efficiency in the common-case when memory is fault free and thus does not
perform correction. Compared to a typical server memory system that allo-
cates dedicated error localization code bits to every cacheline to enable high-
performance on-the-fly error correction with each read request, the proposed
Multi-line ECC reduces memory redundancy requirement and, therefore,
memory power consumption, by sharing error localization code bits across
vast number of cachelines at the cost of needing to fetch millions of cachelines
to perform error correction. Multi-ECC reduces memory energy by 25% to
41% compared to baseline memory systems with same memory redundancy
and correction strength (e.g., Chipkill-correct or Chipkill-correct+Sparing).
Multi-ECC greatly benefits an adaptive memory architecture, which can en-
joy the energy savings of Multi-ECC in the common-case when memory is
fault-free and then adaptively reconfigure to power-hungry high-performance
memory error correction when faults occur.
39
CHAPTER 4
VERY LARGE ECC WORD (VLEW)
ARCHITECTURE FOR HIGH-DENSITY
NVRAM-BASED SERVER MAIN MEMORY
This chapter looks at the problem of how to design common-case optimized
server main memory when error rates are high. Because high error rate
requires frequent correction, the architecture in Chapter 3 is inadequate as it
relies on error correction being infrequent in present-day DRAM. Emerging
high-density memories, such as NVRAMs, have very high bit error rates.
Tolerating high BER at low memory redundancy requires the use of very
large error correcting code (ECC) words, similar to the ECC words used to
tolerate bit errors in storage systems. However, very large ECC words (e.g.,
> 30X larger than ECC words typically used for DRAM) are very power
hungry when used in main memory because they require accessing more (e.g.,
> 30X) data to detect/correct errors. To build efficient server main memory
with very large ECC words, the key observation is that most cachelines are
not affected by any chip-level faults during typical system lifetime. The
proposed very large ECC word (VLEW) Server Memory architecture reuses
the redundant memory budgeted for chip failures to opportunistically correct
bit errors before memory chips fail to speed up common-case memory accesses
for high-density NVRAMs with high BER.
4.1 Introducion
Server memory capacity requirements have been rapidly increasing due to
big data, in-memory computing, and aggressive server virtualization, as can
be seen from the increasing per-processor memory size in memory-optimized
cloud servers (see Figure 4.1). Aggressive density scaling is needed to satisfy
future memory demands. Unfortunately, high-density memories often suffer
from high bit error rates (BER) [32, 33, 34, 35]. Servers today already pay






























Figure 4.1: Growing memory sizes in the cloud.
high-density memories will require more redundancy because they will need
to handle much higher BER on top of handling chip-level faults. This is
a severe problem for many emerging non-volatile random access memories
(NVRAM ) whose bit errors require strong online error correction (see Section
4.3), which can require up to 75% redundancy when applied to random access
main memory see (Section 4.3).
It is well known in coding theory that a larger ECC word (i.e., a dataword
+ its code bits) requires less redundancy than a smaller ECC word to meet
the same acceptable uncorrectable error rate for the same bit error rate [36].
Storage systems, which also need to handle storage device failures and high
BER, use large ECC words for bit errors to obtain low total redundancy
(e.g., 16% to 25%, see Section 4.3). However, large ECC words cannot be
used easily in all contexts. Storage devices are block-addressable devices with
large access sizes, which naturally benefit from large ECC words. Memory
chips, however, are byte-addressable devices with small access sizes for good
random accessibility, making usage of large ECC words difficult. Naively
applying very large ECC words, like those in storage systems, to random
access main memory requires increasing access size by more than 30x (see
Section 4.3).
This chapter explores how to architect efficient server main memory with
very large ECC words that span many (e.g., KBs worth of) cachelines to
minimize the redundancy requirement to handle bit errors. When BER is
high, many (e.g., 10% of) accesses require correction, which causes high over-
heads when fetching very large ECC words for every correction. We observe
while server main memory needs to handle chip failures due to the high prob-
41
ability that at least one of its many chips fails during system lifetime, most
chips remain fault-free; also, redundant memory for handling NVRAM chip
failures can be as efficient as those for DRAM chip failures since BER is
orthogonal to chip fault rates as the former affects individual cells while the
latter affects chip-level logic/IO. As such, for memory regions yet unaffected
by chip faults, VLEW accelerates bit error correction by using redundant
memory budgeted for chip failure protection to opportunistically and selec-
tively correct data with few errors; chip failure protection is maintained since
this redundant memory is only reused to speed up, but not ensure, bit error
correction. Opportunistic correction enables practically the same read size
as random access main memory today. However, writes still transfer 3.5x
data (down from > 30x) to recompute code bits of large ECC words.
One intuitive approach to reduce the extra data movement for writes is
to embed in memory the computation of very large ECC words’ code bits,
which is a function of both new and old data. The challenge of computing
code bits in memory is that old memory data needs correction due to the high
BER; correcting old data in-memory is expensive due to embedding expensive
(see Section 4.3) correction logic of very large ECC words in-memory, while
correcting old data in the processor is also expensive due to reading old data
from and sending corrected old data back to memory. VLEW efficiently
computes code bits in-memory with minimal bandwidth overhead and no
correction in-memory by re-designing low-level caches and data encoding of
memory writes for NVRAM main memory; VLEW caches at low overhead
the old memory data of LLC cachelines when they first become dirty and
modifies writes to send bitwise sum of old and new data from processor to
memory. Overall, VLEW incurs 5.5% bandwidth overhead vs. today’s main
memory and requires only 20% total redundancy.
We make the following contributions in this chapter:
• We are first to explore NVRAM-based server main memory that toler-
ates chip-level faults and high BER. We find that while tolerating many
bit errors at low redundancy requires very large ECC words, large ECC
words incur high bandwidth overheads.
• We are first to explore server memory architectures with very large
ECC words, like those in storage systems, that span many (e.g., KBs
worth of) cachelines. Our VLEW server memory architecture accesses
42
memory randomly and efficiently by using redundant memory budgeted
for chip failure protection to opportunistically correct bit errors and
efficiently compute code bits in-memory.
• We show VLEW redunces memory energy by 46% and improves perfor-
mance by 25% vs. the best comparable (but still higher) redundancy
baseline based on extending prior works on DRAM.
4.2 Background
Server processors typically connect to many memory chips to gain access
to high memory capacity; they connect to memory chips through data pins
and command pins. Due to operating at high frequencies to provide high
bandwidth, data pins support a low degree of electric loading and, therefore,
can only connect to few memory chips. Since each processor data pin can
only connect to a few chips, connecting to many memory chips requires that
each memory chip only connects to few data pins; as such, memory chips
typically have few data pins, such as four and eight data pins in X4 and X8
memory chips, respectively. X4 chips typically output 4B per access, while
wider X8 chips output 8B per access. We refer to chips with 4B and 8B
access sizes as narrow and wide chips, respectively. Chips that connect to
the same command pins collectively form a memory channel; these chips are
placed on PCB board(s), called DIMMs, in a multiplexed fashion where each
processor pin connects to every DIMM in the channel.
To access memory, a group or rank of chips in each memory channel are ac-
cessed in lockstep. Accessing a rank, instead of a chip, per request minimizes
latency since transmitting a 64B cacheline via one chip takes N times as long
compared to N chips. It also reduces the complexity of the processor mem-
ory controller (MC) compared to individually accessing the 100’s to 1000’s of
memory chips per processor. Finally, server memory systems need to tolerate
memory chip-level faults (such as row, bank, and multi-bank faults within a
chip) due to containing many memory chips [30, 37, 12, 38, 39, 40]; accessing
a cacheline from a rank reduces the bandwidth overhead of chip-level fault
tolerance. When accessing a cacheline from a rank, a faulty chip affects only
a part of the cacheline; as such, only a small amount of redundant informa-
43
tion needs to be written to memory per write request to correct the erroneous
portion of the cacheline should a memory chip fail. However, if a cacheline
is accessed from a single chip, failure in one chip affects up to the entire
cacheline; correcting entire bad cachelines requires that a large amount of
redundant data, equal in size to at least an entire line, be written to memory
per write request - a high overhead.
While server main memory today is dominated by DRAM, many NVRAMs
capable of DRAM-like latency, such as PCM, ReRAM, and STT-RAM are
emerging to either supplement or replace DRAM in server main memory
[41, 42, 43, 44]. One motivation is that memory refresh will become a power
bottleneck for DRAM [45]; NVRAMs do not require frequent refresh and can
thus reduce static memory power. NVRAMs can also provide higher density
than DRAM for the same feature size. For example, PCM can optionally
store multiple logical bits per memory cell [46] and can, therefore, provide
2X to 3X higher density. Another orthogonal density benefit for NVRAMs is
that they enable the crossbar array structure, which is ∼50% higher density
than the 1T1C (one transistor one cell) DRAM array structure by requiring
only 4F 2 per cell [44]. Finally, NVRAMs also enable persistent memory with
denser form factor than battery-backed DRAM.
We expect NVRAMs to have similar chip structure and system-level or-
ganization as DRAM since they are cell-level technologies, not memory I/O
technologies. Doing so also maximizes reuse of existing infrastructure and
standards and, therefore, facilitates NVRAMs’ adoption. DRAM-like NVRAM
chips and DIMMs have been manufactured [47, 48, 49, 50]. Many prior works
[42, 44, 43, 51, 52, 41, 53] and industry patents [54, 55, 56] on NVRAMs also
use DRAM-like chip structure and system-level organization.
Many high-density NVRAMs suffer from high bit error rates. Recent stud-
ies from IBM report PCM BERs ranging from 2.5·10−5 to 2·10−4 for >= 1/hr
refresh rate [33, 32]. Studies from MICRON report BER ranging from 2·10−5
to 1.4 · 10−4 for ReRAM [34, 57, 58]. Extrapolating from numbers reported
for 1GB array in an study from Intel [59], we calculate 8 · 10−5 BER for
STT-RAM with 0.25Hz refresh rate. In comparison, current DRAM cells
suffer from several orders of magnitude lower fault rates (see 28nm DRAM
in Figure 4.2). The main reason for the high BER for high-density NVRAMs
is that maintaining low BER for high density memory is difficult. At high

























Figure 4.2: NVRAM BER, DRAM cell fault rates, and commercial TLC
Flash BER [60].
DRAM in Figure 4.2) according to a recent Samsung report [35].
4.3 Problem
When incorporating high-density NVRAMs chips, server main memory needs
both memory chip-level fault tolerance - a standard feature in server main
memory [37, 12, 38, 39, 40] today that may become even more important for
NVRAMs due to their potential use as persistent memory - and tolerance
against high BER (see Figure 4.2). The bit errors in many high-density
NVRAM are dominated by random errors that can occur in any cell and,
therefore, need strong online many-error correction in the field [33, 61, 59,
62], like the bit errors in commercial Flash storage devices [63, 64]. For
example, prior works report PCM bit errors are dominated by resistance
drift [33, 32] - a random process [65, 61] that has even been used to process
probabilistic programs in memory [66]. Similarly, prior studies on STT-
RAM and ReRAM bit errors [59, 58, 57] report retention errors as a top
source of bit errors; retention errors in these NVRAMs cause random cells
to lose values at random times [59, 67, 68, 69, 70]. In all these cases, strong
online error correction is needed. While recent works have explored how to
architect reliable server memory using high cell fault rate DRAM [71, 35],
it is a different problem from how to architect reliable server memory from
NVRAMs with high bit error rate. In DRAM, bit errors are mostly confined
in a limited set of faulty cells that can be identified and permanently fixed at
manufacture time using static techniques such as remapping [35, 72, 73]; as












































SLC Versus MLC NAND Flash Memory
www.cypress.com Document No. 002-00602 Rev. *B 3
4 Future Trends
As NAND flash memory technologies approach 1X nm process nodes, it will become very difficult to achieve the
same performance and endurance of the 3X/2X process node products. Each electron will have a larger impact on
a smaller geometry floating gate which makes it even more susceptible to any leakage or disturbs. This is the
reason why the RBER will continue to increase as the process shifts to smaller geometry devices and require
more error correction. Processors or memory controllers will need to be selected to support these higher error
correction requirements. Legacy processors or controllers may not be compatible with the NAND flash products
which have higher ECC requirements. Figure 4 shows a trend of ECC requirements across process technology
nodes.
Figure 4. SLC and MLC ECC Trend
5 Conclusion
Table 2 summarizes the advantages and disadvantages of using a MLC or SLC NAND flash product. 
MLC does have the advantage in higher density and lower cost per bit. This makes it suitable for consumer
applications since it has reasonable endurance along with lower costs than SLC. However, SLC will be better
suited for applications which require better performance, higher endurance and lower error rates. Such
applications would include automotive, communications and industrial products since reliability would be the key
factor in choosing which type of NAND memory to use. Due to the significant differences in SLC and MLC NAND
Table 2. Advantages and Disadvantages between SLC and MLC Devices
Feature SLC MLC
Higher Density  X
Lowest Cost per Bit  X
Higher Program/Erase Cycles X  
Lower Error Rate X  
Program and Erase Speeds X  
Partial Page Programming Supported Not Supported




Figure 4.4: ECC words for storage bit errors [63].
single) correction in the field [35, 71]. NVRAMs, on the other hand, need
much stronger online error correction.
Extending prior works targeting high cell fault rate DRAM, such as XED
[71] and a recent Samsung DRAM proposal [35], to NVRAMs requires high
redundancy due to needing much stronger ECCs. In the context of DRAM,
XED [71] requires 1/8 + 1/8 = 25% redundancy by using 1B code bits to
protect each 8B of data within each chip with SEC (single-error correction)
and uses a parity chip to tolerate chip-level faults; the Samsung DRAM
proposal [35] is t e same except each SEC word is doubled from 8B to 16B
and, thus, requires 1/16 + 1/8 = 18.75% redundancy. Applying these prior
works to NVRAMs requires replacing SEC with a strong enough ECC, which
increases the total redundancy requirement1 up to 87.5%. Most of this is for
correcting bit errors (see Figure 4.3); for example, for XED, 10−4 BER needs
6-EC (six-error correction) with 6B code bits, which require 6B/8B = 75%
redundancy.
1For server main memory, we target having less than one cacheline with uncorrectable
bit error per 1015 cachelines [33] and less than one cacheline with silent data corruption
(SDC) per 1017 cachelines. We round up ECC bits to the nearest byte to facilitate memory
implementation [35].
46
Storage systems often also need to tolerate both storage-device-level faults
and high BER. Storage systems commonly use strong ECC (e.g., 12 to 41
error correction, see Figure 4.4) to tolerate bit errors and use RAID to tolerate
storage-device-level faults. Storage systems require very low redundancy,
however; as an example, assuming a RAID with eight data devices and one
parity device, where each device is an SLC Flash chip with 12-EC (12-error-
correction) or an MLC Flash chip with 41-EC [63], the aggregate redundancy
requirement is only 3.9%+1/8 = 16.4% or 12.5%+1/8=25%. Storage devices
enjoy such low redundancy because they are block-addressable devices with
large access sizes and thus naturally benefit from large ECC words (e.g.,
with 512B data, see Figure 4.4), which are well known in coding theory to
reduce redundancy requirements [36]. To illustrate the relationship between
redundancy and ECC word size, Figure 4.5 shows the total redundancy to
tolerate high BER and chip-level faults when increasing the size of ECC
words for bit errors; a rank of nine chips is assumed, where the ninth chip
is a parity chip to protect against chip-level faults. When each ECC word
contains 256B instead of 8B of data, the combined redundancy requirement
drops to <= 20.3%.
Increasing ECC word size in memory by 256B/8B = 32X is challenging,
however. One approach is to increase access size of each memory request
by 34.5X (32X of which are data bits and the remainder are the code bits);
this enables data to be reliably read from memory by fetching the very large
bit-error ECC words to reliably correct bit errors for every read. However,
increasing memory access size by > 34.5X for each read request essentially
forfeits random memory access capability and, therefore, can result in very
high (e.g., > 30X) bandwidth overheads for applications with low spatial
locality. This approach also increases the amount of data transferred per
write request by 34.5X because computing the new code bits when modifying
a part of a very large ECC word requires as input the old data in memory
(i.e., the data to be overwritten); reading old data from memory requires the
same error correction as reading regular data from memory.
A second approach is to embed the correction logic of the very large bit-
error ECC words into the memory chips themselves, so that the NVRAM
memory chips can appear to be free of bit errors to the processor to maintain
the same memory request size as today’s DRAM main memory. However,

























2E-4	BER	 1E-4	BER	 5E-5	BER	 2.5E-5	BER	
Figure 4.5: Large ECC words reduce redundancy.
(e.g., 13-EC over 256B data). For example, some Flash chips targeting em-
bedded applications have embedded ECC correction logic [64]; compared
to raw Flash chips, these Flash chips either suffer from lower performance
(e.g., 3x [74, 75]) when embedding strong ECC or lower (e.g., 16x [64]) den-
sity when embedding weaker ECC with lower performance overheads. Flash
chips with embedded correction logic also pay high (e.g., 66%) energy over-
heads [74] and increases cost per bit [76]. One reason for the high cost is
that correction logic for very large and strong ECC words needs to solve large
systems of simultaneous equations [77]. A second reason is that the memory
manufacturing process is sub-optimal for implementing complex logic due to
having low transistor switching speed, high transistor switching energy, and
few (e.g., three [78]) metal layers. A third reason is that compared to using
a centralized logic controller chip for correction, replicating correction logic
in every memory chip multiplies the overheads.
For NVRAM chips, the overheads are even higher; here, performing cor-
rection in memory also requires internally fetching large ECC word >= 34.5x
the access size of typical memory chips. To quantify this overhead, we use
NVSim [79] to model the read energy for a bank within a PCM chip; when
increasing the internal fetch size by 34.5x, the internal fetch read energy in-
creases by 14x and becomes 5x that of the total read energy (internal fetch
and external transfer energy) of a DDR3 DRAM chip [28]. Note that while
prior works on DRAM embed the correction logic of small SEC words in
memory [71, 35], SEC is a simple special case that does not require solving
systems of equations [77].
In short, efficient server main memory architectures with very large ECC







































Figure 4.6: Overview of VLEW architecture.
4.4 VLEW Memory Architecture
We propose Very Large ECC Word (VLEW) server random access main
memory architecture for high-density NVRAMs. VLEW uses very large ECC
words, each containing 256B of data in each memory chip, to tolerate bit er-
rors and, therefore, requires only 20% total memory redundancy, as described
in Section 4.3; Section 4.4.1 describes memory layout in VLEW. To preserve
small random memory accesses, VLEW opportunistically uses the redundant
memory budgeted for chip failure protection, which can remain the same (i.e.,
small and efficient) as for DRAM chips, to selectively correct cachelines with
few bit errors; Section 4.4.2 describes opportunistic correction. While oppor-
tunistic correction enables practically the same read size as today’s random
access main memory, writes still need to transfer 3.5x (down from > 30x) as
much data; writes remain expensive due to needing extra data movement to
compute code bits of very large ECC words. VLEW leverages processing-in-
memory (PIM) to reduce this data movement by computing these code bits
in memory at low cost without correction in memory; Section 4.4.3 describes
PIM for writes. Figure 4.6 overviews VLEW.
4.4.1 Memory Layout
The distinguishing feature of VLEW server memory architecture is applying
very large ECC words, like those in storage systems, to tolerate high BER
at low redundancy requirement for server random access main memory. The













Figure 4.7: Memory layout for wide chips.
for bit errors in Flash [80]; BCH requires t(dlog2(k)e+1) code bits to correct t
bad bits when protecting k bits of data. In VLEW, the very large BCH ECC
words contain 256B of data (i.e., k = 256 · 8) and corrects up to 13 errors
to tolerate up to 2 · 10−4 BER at only 7.8% redundancy due to their large
size. Each very large ECC word resides in one chip; similar to [35, 71] and
commercial Flash chips [81, 82], we assume each row within a memory chip
contains dedicated storage for ECC bits for bit errors, as shown in Figure
4.7. As such, each very large ECC word in ranks with wide chips and narrow
chips spans 256/8 = 32 and 256/4 = 64 cachelines, respectively.
To tolerate memory chip-level fault, VLEW uses the Reed-Solomon (RS)
ECC, which is also used to tolerate memory chip-level faults in DRAM server
main memory [12, 83, 84]. RS ECC typically requires 2B code bits to correct
each bad byte (i.e., a group of eight adjacent bits with at least one bad bit).
If the bad bytes are known a priori, only 1B of code bits is needed per bad
byte; this is known as erasure correction. Each chip-error ECC word contains
a 64B cacheline and 8B RS code bits, which are stored in parity chips. A
rank of wide chips and a rank of narrow chips needs one and two parity
chips, respectively, to keep the same memory channel width. Note that each
cacheline belongs to orthogonal RS and BCH ECC words; if all the bits of
the same cacheline were to construct both a RS and a BCH ECC word, the
two ECC words can only correct either bit errors or chip errors (e.g., four
bad bytes from a faulty narrow chip), but not both (e.g., a bit error plus four
bad bytes). Tolerating chip-level faults requires correcting both bit and byte
errors in each cacheline because millions of cachelines contain bit errors due
to high BER, such that many cachelines will contain both bad bits and bad








































Figure 4.8: Most memory banks are not affected by any chip-level fault
over system lifetime.
Finally, the processor MC stores a set of bank checksums, one for each
bank in each memory chip, to provide enhanced error detection for chip-level
faults. Each checksum is an 8B MSet-Add-Hash checksum [26, 85].
4.4.2 Opportunistically Correct Bit Errors via Redundant
Bits Budgeted for Chip Errors
We observe that the vast majority of memory in a memory system are not
affected by any chip-level faults during typical system lifetime. Figure 4.8
shows the average fraction of memory banks (i.e., a memory bank consists
of all the same bank in each chip in a rank) that are not affected by any
chip-level fault, such as row, bank, column, chip, or multi-chip faults; the
calculation assumes eight banks/chip, nine chips/rank, eight ranks/channel,
the chip-level fault rate and pattern from [30]. It also pessimistically assumes
all chip-level faults are permanent. Figure 4.8 shows that 98.6% of memory
banks, on average, are not affected by chip-level faults.
To exploit the above observation, we propose using each memory bank’s re-
dundant memory budgeted for chip-level faults to opportunistically correct
cachelines with few bit errors before a chip-level fault affects the memory
bank. If opportunistic correction is successful, the memory request com-
pletes without fetching the very large ECC words and, therefore, accelerates
the performance of bit error correction. Expensive bit error correction using
very large ECC words, which requires fetching 33.5X to 68X more bits from
memory, is only needed when opportunistic error correction is unsuccess-
51
ful. Since the redundant memory budgeted for tolerating chip-errors in each
cacheline contains 8B of RS code bits, it can correct up to four bad bytes
and, therefore, four bit errors. Figure 4.9 shows that even under a high BER
of 2 · 10−4, the vast majority of 72B (64B data + 8B redundant data) cache-
lines contain no more than four bit errors; as such, opportunistic correction
of bit errors using the redundant memory budgeted for chip failures has the
potential to eliminate fetching the very large ECC words for most memory
requests.
The challenge with opportunistic correction is that the redundant memory
budgeted for chip failure protection in each cacheline can cause high (e.g.,
3, 000, 000X > than acceptable, see Section 4.4.4) SDC rate because it often
miscorrects cachelines with >= 5 bit errors. One way to avoid miscorrection
is to use the chip-error ECC words only to detect, but not correct, bit errors;
expensive correction is not needed if no error is detected. However, a sig-
nificant fraction (e.g., 3% to 10%, see Figure 4.9) of cachelines have error(s)
when bit error rate is high; using the very large ECC words for bit errors to
correct every cacheline with error(s) can result in high memory bandwidth
overhead (e.g., 33.5 · 10% = 335% or 68 · 10% = 680% memory bandwidth
overheads).
To use the redundant memory budgeted for chip failure protection to op-
portunistically yet reliably correct bit errors, one needs to know whether
there are many bit errors in a cacheline to avoid correcting a cacheline with
many errors. Unfortunately, one cannot tell a priori how many errors there
are in a received cacheline. We observe that the number of corrections made
during error correction is indicative of how numerous errors are in the original
received cacheline. The higher the number of corrections (i.e., the number























Figure 4.9: Number of bit errors in cachelines.
52
A	 B	 C	 D	 E	 F	 G	H	 …	 Opportunis*c	Correc*on	 A	 Y	 Z	 D	 P	 F	 G	 K	 …	
Many	correc*onsè	Likely	many	errors	è	Reject	opportunis;c	correc;on	








Figure 4.10: Reliable opportunistic correction.
are more errors in the original cacheline. As such, VLEW uses the number
of corrections as an error severity indicator to reject opportunistic correc-
tion (see Figure 4.10) when it makes more than a small cautionary threshold
(e.g., two) number of corrections, even though each chip-error ECC word can
correct more (i.e., up to four) errors.
Figure 4.11 shows the complete correction procedure. A read request com-
pletes if opportunistic correction makes no more than two corrections. Oth-
erwise, VLEW performs full strength correction, which fetches the very large
bit-error ECC words to correct the bit errors and then uses the redundant
memory per cacheline budgeted for chip failure protection to correct chip
errors. The method for correcting chip errors is slightly different for narrow
and wide chips. For narrow chips, each chip-error ECC word simply corrects
up to four bad bytes. For wide chips, VLEW identifies the faulty chip to
correct the eight bytes from the faulty chip as erasures. To identify a faulty
chip, VLEW uses the chip-error ECC word to check whether any errors re-
main after bit error correction; the 8B RS code bits in each chip-error ECC
word guarantees detection of all errors left in a single chip. Read completes
if the check passes; otherwise, VLEW fetches all cachelines in the memory
bank to recompute bank checksums to compare with the bank checksums in
the MC (see Section 4.4.1) to identify the bad chip by a mismatching pair of
bank checksums.
Even under 2 ·10−4 BER, only 1.8 ·10−4 of read requests require expensive
full error correction that fetches 32 or 64 cachelines; this reduces the band-
width overhead due to full error correction to 0.6% and 1.2% for ranks with
wide and narrow chips, respectively. However, when a permanent chip-level
fault occurs, expensive full error correction may be frequently performed.
One way to avoid this repeating overhead after chip-level faults occur is
to retire memory affected by the fault, which is done in current systems



















Figure 4.11: Action flow for memory reads.
rectable errors. Another way is to permanently remap the content of the
faulty bank to a redundant bank in the memory bank and then re-encode all
the very large ECC words in the memory bank such that the 256B data in
each large ECC word consists of four adjacent 64B cachelines; this enables
each memory request to benefit from full bit error correction by accessing
only four, instead of 32, cachelines and fetching the code bits via another
burst-chop access.
4.4.3 Computing Code Bits in Memory
While opportunistic correction enables practically the same read size as to-
day’s random access main memory (e.g., down to only 0.6% − 1.2% larger,
on average), write requests still need to transfer 3.5X as much data for wide
chips (down from > 30X) and 6X for narrow chips (down from > 30X).
This is because each write requires recomputing the code bits of the very
large ECC words, which needs the old memory data as input as only part
of the ECC word is modified, and the new code bits need to be written to
memory; since these code bits are 20B/8B = 2.5 or 20B/4B = 5X as much
as the data bits modified per write request per chip, writing code bits incurs
2.5X/5X bandwidth overheads.
To reduce the extra data movement (see Figure 4.12) for write requests,


































Figure 4.13: A typical write in VLEW.
naive approach to compute code bits in-memory is to embed the complete
error correction logic in memory; while this works well in the context of prior
works that only need weak and small SEC words to tolerate bit errors [71, 35],
it incurs high energy, performance, and/or density overheads for very large
and strong ECC words (see Section 4.3). Instead, VLEW only embeds ECC
encoding, not correction, logic in memory. Since commonly used codes, such
as the BCH used for bit errors, are linear codes, encoding simply calculates
the right hand side of a system of linear equations where all left hand side
values are known [25], which is simple (Section 4.4.4 for quantitative analy-
sis); this is far simpler than correction, which solves this very large system
of equations when many variables are unknown. Computing code bits in-
memory eliminates the 2.5X or 5X bandwidth overheads for sending them
from processor to memory, for wide and narrow chips, respectively.
The challenge with embedding encoding but not correction logic in mem-
ory is that encoding requires the old data to be overwritten; the old data in
memory often have bit errors and, therefore, require correction since com-
puting code bits using bad inputs directly causes SDC. Correcting old data
can require two overhead data movements - one to read the old data from
memory and one to send corrected old data back to memory - which together
55
cause 2X bandwidth overheads. Sections 4.4.3 and 4.4.3 describe two ideas
to reduce these extra data movements, where each reduces one of the two
extra movements, by rethinking data encoding of memory writes and design
of low-level caches for NVRAM main memory. Figure 4.13 shows a write
when combining both.
Send Old Data to Memory for Free by Writing Bitwise Sums Instead of
New Data
Instead of explicitly sending corrected old data from processor to memory,
VLEW embeds old data in new data by modifying each write request to
send the bitwise sum of new and old data to memory; this is the first work to
explore writing bitwise sum of new and old data from processor to memory,
to the best of our knowledge. When receiving a bitwise sum, each memory
chip internally calculates new data by subtracting (in a bitwise manner)
the old data2 it contains from the bitwise sum. Because bitwise sums and
differences are the same and linear codes such as BCH are constructed from
linear equations using bitwise arithmetic [25], each chip also uses the same
bitwise sum3 as the complete input for recomputing code bits in memory.
Specifically, each memory chip updates its data for each write request
by internally fetching x′ (old data) from an open row, XORing x′ with the
received bitwise sum, and writing the calculated x (new data) back to the
open row4, as shown in Figure 4.14. To update code bits, each chip calculates
f(x ⊕ x′), where f is the BCH encoding function, and uses the result to
update a code bits register ; because many values in the same row are part
of the same very large ECC word, the code bits register coalesces multiple
code bit updates to the same row into one. Since each very large ECC
word has 256B data, R/256 code bits registers are needed for each bank,
where R is row size in bytes. Later, when receiving a request to close a row,
each chip first updates the code bits in the open row using the bank’s dirty
code bits registers before closing the row. All changes incur pre-deterministic
2Old data’s errors can propagate to calculated new data. The MC may optionally
write new data to memory to scrub these errors by recording which cachelines had a
certain number of errors when fetched via one bit per LLC cacheline.
3While bitwise sums can have errors due to chip-level faults (e.g., I/O pin fault), chip-
error ECC corrects these errors.
4The ability to internally read-modify-write data is also in current and future DRAM
chips [91, 35].
56















Figure 4.14: In-memory actions for writes.
latencies and thus can be incorporated into the current deterministic memory
interfaces.
Cache Old Memory Data Instead of Re-fetching Them from Memory
To reduce the overhead data movement of reading old data from memory to
compute the bitwise sum, we observe that the old data can also be obtained
from LLC because a cacheline must first be brought into LLC from memory
before the cacheline can become dirty; this requires sending the current data
in a LLC cacheline to MC when the cacheline receives a write-back from an
upper-level cache while still holding the same data as memory5. The MC
caches the old data until the dirty cacheline is written from LLC to MC;
this is the first work to explore caching old data when a cacheline becomes
dirty, to the best of our knowledge. Since caching old data is for improving
write performance, MC caches the received old data in a write buffer cache
(WBC ), which is often used to mitigate the long write latencies of NVRAMs
by improving write spatial locality [92, 93, 94]. We consider implementing
WBC in MC as a small SRAM cache. When LLC writes back/through a
cacheline to MC, MC searches WBC for its old data. If it is found, MC
replaces the old data found in WBC with a bitwise sum of the old and new
data; otherwise, MC inserts the new data into WBC. We need to add two
5In a non-inclusive hierarchy where LLC may not contain the old data of dirty L1
cachelines, one can instead send the old data from L1 cache to MC when a valid L1
cacheline receives a write while still holding the same data as memory.
57
bits to each WBC entry to record what type of data (old data, new data, or
bitwise sum) the entry holds. When WBC evicts a bitwise sum, MC directly
writes it to memory; when WBC evicts new data, however, MC needs to first
read old data from memory before it can calculate and write bitwise sum to
memory.
Caching the old data of every dirty LLC cacheline would require a large
WBC, however, which is expensive. To reduce the number of dirty LLC
cachelines and, therefore, the amount of old data to be cached in WBC,
we observe that the probability that a LLC cacheline is rewritten decreases
rapidly as its recency reduces [95]; as such, dirty LLC cachelines can be
written through (i.e., a dirty LLC cacheline’s data is sent to MC and then
the cacheline is made clean in the LLC) after their recency is older than
a threshold. To cap the bandwidth overhead of writing through cachelines
due to a cacheline becoming dirty again after write-through, one can add
one bit per LLC cacheline to record whether the cacheline has been written
through before to ensure writing-through each cacheline at most once. One
can further cap this bandwidth overhead to a tunable fraction of total write
bandwidth by sampling how often dirty LLC cachelines are rewritten with
respect to their recency values and then dynamically adjusting the recency
threshold for writing-through LLC cachelines. When this threshold is large,
LLC halts sending old data to WBC to avoid unnecessary thrashing in WBC.
4.4.4 Overheads
The SDC formula we use to calculate SDC rate due to opportunistically
correcting bit errors is the probability of having at least a threshold number
of errors required for miscorrection times the probability that a noncodeword
with this many errors is mistakenly “corrected” into a valid but unintended
ECC word; since each RS ECC word corrects up to four errors, the minimum
error threshold (e) is five. For a given BER, the first term of the SDC formula
can be obtained via simple combinatorial analysis. The second term is the
probability that a noncodeword with e errors happens to be <= t bytes
away from a valid ECC word, where t is how many errors to correct via the
ECC [25]; when each ECC word has r code bytes and n total bytes, this
























Figure 4.15: BCH encoder over 2048-bit data.
each ECC word (i.e., nCt · 256t) times the total number of possible ECC
words (i.e., 256(n−r)) divided by the total number of possible ECC words +
noncodewords (i.e., 256n). Using e = 5, t = 4, and BER = 2 · 10−4 as inputs
to the SDC formula, SDC rate is 1.3 · 10−7 ∗ 2.4 · 10−4 = 3.2 · 10−11. When
only accepting correction results with <= 2 corrections, however, t = 2 and
e = 7; here, e = 9− 2 = 7 because a noncodeword must be <= 2 bytes away
from a valid but unintended ECC word U to be miscorrected into U and the
intended ECC word is >= 9 bytes from any other valid ECC word due to
having 8B RS code bits. Using e = 7, t = 2, and BER = 2 · 10−4 as inputs
to the SDC formula, SDC rate is 5.3 · 10−11 ∗ 4.1 · 10−11 = 2.2 · 10−21.
BCH code bits are computed in parallel via one XOR tree per code bit,
which enables simple memory-array-like layout via only two metal layers in
memory, as shown in Figure 4.15. Using CACTI [96], we calculate total area
to be 0.1mm2 when assuming only semi-global metal wires and that each
logic gate is equal to two SRAM cells in size (similar to [97]). We estimate
latency as 1.6ns using LSTP bulk transistor latency [98].
Under 2 · 10−4 BER, 1/200 and 1.8/10000 of reads need multi-error RS
correction and BCH correction, respectively. We estimate latency of multi-
error RS correction to be 86ns and area to be 0.03mm2 after a 3-EC (127,121)
RS ECC decoder synthesized in 32nm logic process [99]. We estimate latency
of the 13-EC BCH ECC to be 200ns and area to be 0.05mm2 after a 32-EC
BCH decoder synthesized in 130nm logic process [100] and adjusting for
strength and size differences.
59
Table 4.1: Evaluated Memory Configurations
Chips/rank Redundancy Channel Config.
Ideal Baseline 9(8 data+ 6.25%+ 2 channels
with X8 Chips 1 ECC) 12.5% 4 ranks/channel
Ideal Baseline 18 (16 data+ 6.25%+ 2 channels
with X4 Chips 2 ECC) 12.5% 2 ranks/channel
VLEW 9 (8 data+ 7.81%+ 2 channels
with X8 Chips 1 ECC) 12.5% 4 ranks/channel
VLEW 18 (16 data+ 7.81%+ 2 channels
with X4 Chips 2 ECC) 12.5% 2 ranks/channel
Baseline 21 (16 data+ 1 channel
with X8 Chips 5 ECC) 31.25% 4 ranks/channel
Baseline 40 (32 data+ 1 channels
with X4 Chips 8 ECC) 25% 2 ranks/channel
The MC stores an 8B bank checksum for each bank in each chip (see
Section 4.4.1). In a large eight-channel system with four ranks/channel, nine
chips/rank, and eight banks/chip, the total storage is 8 · 4 · 9 · 8 · 8 = 18KB.
Updating these checksums for writes requires no memory access overhead
because they require the same bitwise sums.
Sending old data from LLC to MC causes an overhead on-chip transfer
per memory write. Since writes are < 1/2 of memory accesses and on-
chip network bandwidth is 10-100X higher than memory [101, 102], on-chip
network bandwidth/area overhead is 0.5% to 5%.
4.5 Methodology
We evaluate an ideal NVRAM baseline, with DRAM-like errors that only
needs SEC for bit errors, and model it as the Samsung DRAM proposal,
which requires 18.75% total redundancy for bit errors and chip-level faults





































2E-4	BER	 1E-4	BER	 5E-5	BER	 2.5E-5	BER	
Figure 4.16: Redundancy when using RS alone.
60
Table 4.2: Mixed Workload Composition
mixA 4 omnetpp, 4 mcf, 4 wrf, 4T ocean cp
mixB 4 bwaves, 4 cactusADM, 4 wrf, 4T ocean cp
mixC 4 sjeng, 4 cactusADM, 4 radiosity, 4T radix
mixD 4 mcf, 4 GemsFDTD, 4T barnes, 4T radiosity
mixE 4 cactusADM, 4 bwaves,4 sjeng, 4T fft
mixF 4 mcf, 4 omnetpp,4 astar, 4T fft
mixG 4 GemsFDTD,4 astar, 4 bwaves, 4T barnes
both bit errors in NVRAM and chip errors by combining Bamboo ECC [84],
which uses multi-error RS ECC for DRAM server memory, and mission-
critical DRAM protection [13, 103, 104], which spans each ECC word across
few (i.e., two to four) cachelines. The baseline protects each cacheline with
only one RS ECC word, instead of both BCH and RS ECC words, at the
cost of increasing redundancy (see Figure 4.16) since RS is sub-optimal for
bit errors. The baseline uses RS ECC words containing 128B data striped
across 32 narrow X4 chips, which achieves similar redundancy as the ideal
baseline (e.g., 25% vs. 18.75%). If the baseline uses wide X8 chips instead,
the minimum redundancy increases to 31.3%. We evaluate this for sensitivity
analysis; each ECC word in this implementation contains 128B data striped
across 16 chips. Table 4.1 summarizes the evaluated memory configurations;
all have equal usable memory capacity and bandwidth.
We evaluate 16 threads per workload, seven single-application NASBench
[29] workloads and seven multi-programmed workloads (see Table 4.2 for
composition); only native and reference inputs are used. The workloads’
memory footprints range from 10GB to 35GB and are 17GB on average.
We fast forward each workload until all multi-threaded application(s) have
initialized and then by another 20 simulated seconds. We warm up the
caches by 20 simulated milliseconds and then evaluate the next 20ms via
cycle-accurate simulation. Since all workloads contain multi-threaded appli-
cations, we measure throughput not by total instructions, but by FLOPs for
workloads with only FP benchmarks and by instructions that access LLC for
remaining workloads.
We simulate a 16-core processor (see Table 8.2) in GEM5 [23]. We evaluate
a WBC for both baselines and VLEW; we observe ∼25% better average per-
formance for both baselines and VLEW when using WBC, which mitigates
the long write latencies of NVRAMs by improving memory write spatial lo-
cality. We implement WBC as an eight-way cache with 1/16th the LLC size.
61
Table 4.3: Processor Configuration
16 cores, 3GHz, 4-issue OOO
Core 168 ROB entries, 64B cacheline
L1 d-cache, i-cache 2-way, 64KB, 1 cycle
Private L2 cache 4-way, 256KB, 4 cycles
Shared LLC 32-way, 32MB, 14 cycles
Memory Degree two stride prefetcher,
Controller 512 buffer entries/channel
Table 4.4: Evaluated tWr Increase (%) for VLEW
Chip NAS.D bt is mg cg lu sp ua avg
Main 4 38 4 13 4 4 4 10
Wide Sensitivity 8 89 8 28 8 8 9 23
Main 4 65 4 13 4 4 4 14
Narrow Sensitivity 9 172 8 28 9 9 9 35
Chip Mixes A B C D E F G avg
Main 14 6 28 42 23 53 26 20
Wide Sensitivity 26 10 39 80 31 95 29 44
Main 12 6 27 24 18 50 16 27
Narrow Sensitivity 30 12 65 101 50 133 58 64
WBC evicts all other dirty entries that are part of the same 4KB page to-
gether when evicting one such dirty entry. WBC searches these entries via
a small search cache with the same associativity and the same number of
entries. In the search cache, each tag points to a 4KB page, while each data
entry is a 64-bit vector to indicate which of the 64 cachelines of the 4KB page
are dirty in the WBC. The search cache is updated whenever WBC evicts or
inserts a dirty entry. To model the realistic baseline, we increase LLC cache-
line size from 64B to 128B as all its ECC words contain 128B data. To model
VLEW, we set 5% as the bandwidth overhead cap for write-through, which
is enforced by dynamically adjusting the recency threshold for write-through
(see Section 4.4.3); LLC halts sending old data to WBC if this threshold is
a quarter of the way down the recency stack.
To model memory, we use DRAMSim2 [21], assuming a row size of 256B
per chip, the FR-FCFS scheduling policy, and the close-page row policy,
which closes a row when it has no request in the command queue. We prior-
itize reads over writes and interleave adjacent memory pages across banks,
ranks, and then channels. Since datasheets of high-density NVRAMs chips
are not currently available, similar to [42], we model them based on DRAM
chips; we use 8Gb 1600MT/s DDR3 chips [28] and modify tRCD and tWR
as 250ns and 1000ns, respectively, to reflect MLC-PCM read and write laten-
cies [52] and evaluate tRCD = 120ns and tWR = 450ns to reflect ReRAM


















Ideal,X8	 VLEW,X8	 Ideal,X4	 VLEW,X4	 Baseline	
Figure 4.17: Normalized performance.
we multiply total power by total bits divided by data bits.
VLEW can update more code bits than the ideal baseline due to using
very large ECC words, which may reduce write lifetime. We make up for
potential loss in write lifetime by slowing down writes [105]. We increase
tWR in VLEW (see Table 4.4) by determining on a per-workload basis the
number of writes to BCH code bits in NVRAM banks (but not in code bits
registers) vs. the total number of memory write requests. We use the write
endurance and latency relationship in [105]. For sensitivity analysis, we use
the worst-case relationship, where endurance is linear with latency [105].
4.6 Results
Figure 4.17 shows system throughput normalized to the ideal baseline with
wide chips. The baseline suffers 28.5% average performance loss vs. the ideal
baseline with wide chips because the former accesses 4X as many data chips
per request as the latter, which means a lot fewer parallel accesses can be
in flight at the same time than the latter; having many parallel accesses in
flight is critical for hiding the long access latency of NVRAMs.
The performance of VLEW with wide chips, however, is very similar to
the ideal baseline with wide chips (i.e., only 1.9% lower, on average) because
VLEW accesses the same amount of data and number of chips per request
as the latter in the common case. mixD suffers the highest loss; this is
because RCC provides very low old data hit rate for mixD, which results in
high memory bandwidth overhead for write requests (see Figure 4.23). A few
workloads, however, have slight performance improvement because writing
through LLC cachelines improves the spatial locality of their memory writes.
Figure 4.18 shows the memory energy per instruction (EPI) normalized to

















	 Ideal,X8	 VLEW,X8	 Ideal,X4	 VLEW,X4	 Baseline	
Figure 4.18: Normalized memory EPI.
EPI compared to the ideal baseline due to two reasons. First, since the data
in each ECC word is spread across 32 chips (see Figure 4.16), the 128B data
in each request are accessed from 32 chips. In comparison, the ideal baseline
can access 128B data, or two adjacent 64B cachelines, through two consecu-
tive requests to the same eight chips. This means that under the baseline, up
to four times as many memory chips need to perform the expensive row acti-
vation for the same 128B of data, which increases memory dynamic energy.
Second, for low memory utilization workloads where memory chips are often
idle, four times as many chips have to be brought out of low power mode to
serve a memory request; this increases memory background energy. In com-
parison, VLEW with wide chips only pays 5.2% memory energy overhead,
on average. Overall, VLEW reduces memory energy by 46.3% and improves
performance by 38.4% compared to the baseline, despite the latter having
higher redundancy (25% vs. 20%).
For sensitivity analysis, Figure 4.19 shows the performance and energy of
the baseline with wide chips, which requires high redundancy (e.g., 31.3%).
It accesses twice as much data and as many chips per request as ideal baseline
and, therefore, suffers from high memory EPI overhead - 45.3%, on average.
VLEW with wide chips saves memory EPI by 25.0% and improves perfor-
mance by 8.5%, on average, vs. baseline with wide chips.
Figure 4.20 shows dynamic energy savings are higher than background
energy savings. Since we model NVRAM chip power using DRAM chip
power and NVRAM has higher dynamic energy than DRAM [6], the actual
overall energy saving should be > 48%.
Figure 4.21 shows how the worst-case relationship between write latency
and endurance affects VLEW. On average, it causes 2.2% performance loss
and 3.2% memory EPI increase for wide chips; for narrow chips, they are 5.2%





































































































































































































Figure 4.20: Memory EPI savings breakdown.
(e.g., is.D and mixF ) which are write-intensive and have low row hit rates
for writes.
Figure 4.22 shows average performance and memory EPI vs. ideal baseline
with wide chips for faster NVRAM. VLEW with wide chips still provides
similar EPI savings (i.e., 46.7%, on average). However, performance saving
reduces to 25% because the baseline’s disadvantage of having fewer parallel
accesses is less impactful.
Figure 4.23 shows the breakdown of bandwidth utilization in VLEW nor-
malized to total available bandwidth. Reading old data from memory to
calculate bitwise sums when old data misses in WBC accounts for 4.2% of all
accesses; it shows old data can be obtained effectively from LLC instead of







































































































































































































































Reads	 Writes	 Overhead	write-through	 Full	correc9on	 Read	old	data	
Figure 4.23: VLEW bandwidth utilization.
dynamically adjusting recency threshold for write-throughs can effectively
limits this overhead. The overall bandwidth overhead is 5.5% vs. ideal base-
line. In comparison, even under perfect spatial local ty, baselines with narrow
and wide chips incur 125%/(9/8) − 1 = 11% and 131%/(9/8) − 1 = 16.7%
bandwidth overheads, respectively, vs. the ideal baseline by requiring higher
ratio of parity chips to data chips per rank.
4.7 Related Work and Generality
By tolerating even higher BER, NVRAM chips can also serve as persistent
memory that retains data for long time without refresh. We consider 10−3
BER, which is the ReRAM BER one year after last write/refresh [57] and
PCM BER one week after last write/refresh (extrapolating from [33]). To
tolerate the higher BER, VLEW adds code bits to strengthen bit-error ECC
words from 13-EC to 21-EC; VLEW strengthens chip-error ECC words to
opportunistically correct up to six errors by doubling their size by accessing
66
128B per request from the same number of chips per rank as before. While
the 16B RS ECC bits in the bigger chip-error ECC word can correct up to
eight errors, VLEW rejects correction results with more than six corrections
to reduce SDC rate from 5 · 10−13 to 3 · 10−25. Compared to the baseline in
Section 4.6, which requires 44% and 31% redundancy to tolerate 10−3 BER
for wide and narrow, chips respectively, VLEW requires only 25% redun-
dancy and provides similar memory EPI and performance improvements as
in Section 4.6; as such, the benefit of VLEW increases with increasing BER.
VLEW uses the chip-error code bits to judiciously correct few errors when
the number of errors is unknown (i.e., during opportunistic correction) and
uses the same code bits to correct many errors when the total/maximum
number of errors is known (i.e., after bit error correction). While it is well
known in coding theory that the same code bits can be used in different
ways [25], this is the first work that applies this well-known property to
memory to dynamically adjust how many errors to correct using the same
code bits; this can be useful in many contexts. For example, in desktops and
laptops where memory chip-level fault tolerance is not required, the basic
approach to tolerate NVRAM bit errors is to use one BCH ECC word per
cacheline; this comes at the cost of 18.75% redundancy (assuming 2 · 10−4
BER) per cacheline, which incurs expense bandwidth overheads due to either
doubling the number of parity chips (for wide chips) or increasing the number
of data bus cycles (e.g., from 4 to 5). VLEW reduces these overheads at
the same memory redundancy. By dynamically adjusting how many errors
to correct, the same per-cacheline RS ECC bits in VLEW can correct few
errors opportunistically and aggressively correct more errors after performing
per-chip BCH ECC correction; the former reduces common-case bandwidth
overheads to only 12.5%, while the latter reduces the strength and, therefore,
redundancy required by BCH ECC.
Due to very low BER, modern server memory corrects both bit and chip
errors using a single ECC word, which can be viewed as a chip-error ECC
word reused to correct bit errors. However, server memory today uses the
chip-error ECC word to always correct bit errors at full strength; no dy-
namic adjustment of how many errors to correct is performed as does VLEW.
Moreover, VLEW is used in high BER settings where chip error correction
is frequently incapable of correcting bit errors.
Many prior works (e.g., [83, 106, 107]) on low-BER DRAM have used very
67
large EDC (error detecting code) words, such as checksums and bitwise par-
ity, which cannot correct any error on their own. These works target systems
with low BER (hence only using large EDC, but not ECC, words) and, there-
fore, only need to provide efficient detection, but not efficient correction, for
memory read requests. For writes, very large EDC words require a lot fewer
code bits than very large ECC words. As such, prior works on very large
EDC words (e.g., [83, 107]) reduce write bandwidth overhead of very large
EDC words by caching running code bits (e.g., f(x′)⊕f(x)⊕f(y′)⊕..., where
x and y belong the same EDC word) at the expense of increasing worse-case
write overhead; this is because f(x) can enter the cache after f(x′) is evicted,
which causes two writes to code bits. This is unacceptable for very large ECC
words where the code bits can be up to 5X the size of data in each request;
here, having two writes to code bits per data write causes an unacceptable
worst-case write bandwidth overhead of 5 · 2 = 1000%.
4.8 Conclusion
We are first to explore how to architect NVRAM-based server main mem-
ory that tolerates both high BER and memory chip failures. The proposed
VLEW server memory architecture tolerates bit errors using very large ECC
words, like those used in storage systems, while preserving the small access
size of today’s random access main memory. VLEW uses the redundant
memory budgeted for chip failure protection to opportunistically correct bit
errors to accelerate memory accesses; it computes code bits in memory at
low overhead to further accelerate writes. VLEW saves memory energy by
46%, improves performance by 25%, and reduces redundancy by 4% vs. the





This chapter addresses the problem that during low voltage operations, on-
chip SRAM caches can significantly affect overall processor energy efficiency
due to frequent errors that require expensive long-latency correction. The
proposed Correction Prediction effectively hides the long correction latencies
by using a weak correction mechanism to predict the corrected cache word
value in low latency and recover from misprediction by instruction replay if
the predicted cache word value is found to be incorrect by a reliable error
correction that takes longer to completes. By reducing the correction latency
for common-case cache accesses with no or single error, Correction Predic-
tion improves overall performance despite incurring expensive misprediction
recovery overheads for uncommon cache accesses with multi-bit errors.
5.1 Introduction
One simple, yet effective, technique to reduce the power of on-chip memories
(e.g., caches) is voltage scaling [108, 109, 110, 111, 112]. Reducing the supply
voltage results in significant reductions in static and dynamic power [111].
One major challenge of scaling the voltage of on-chip memories is maintaining
the desired reliability. Process variations can cause a super-exponentially
increasing fraction of memory cells to become faulty as the supply voltage
decreases [111, 113].
A flurry of recent work has been devoted to providing the desired cache
reliability at low supply voltages [111, 112, 97, 114, 115]. Some propose using
larger, and therefore stronger, memory cells (e.g., 8T and 10T SRAM cells
or cells with up-sized transistors) to prevent errors from occurring in the first
place. Unfortunately, these methods incur a high static overhead even for
nominal voltage operation, when cache reliability is high (see Section 5.6.1).
69
As a result, many have instead proposed using error correction to correct the
wrong values in memory cells as the cells become faulty due to voltage scaling.
A large number of error correction techniques have been proposed, spanning
from the use of error correction codes [112, 97, 115] to data remapping [111,
114].
However, error correction inevitably incurs a latency overhead, which may
be significant relative to the cache access time for error correction strong
enough to provide the desired reliability (see Section 5.3). This increase to
cache access time may lead to a significant degradation in performance and
energy (see Section 2.6) from either an increased clock cycle time [116] or
increased pipeline depth. An alternative supported by some processors [117]
is to speculatively execute on the instruction or data accessed from the cache
prior to performing error detection [117]. A detected error corresponds to a
mis-speculation in which case the appropriate instructions are squashed and
the pipeline stalls for correction. While this technique suffices for scenarios
where the bit-failure probability is low, it also incurs a high performance
overhead for scenarios where the bit-failure probability is high (e.g., when
the supply voltage is low) due to rampant mis-speculation leading to frequent
squashes and stalls (see Section 5.4.1).
We propose a novel scheme for hiding the latency overhead of a strong
error correction scheme used to ensure reliability. Our scheme, correction
prediction, uses a fast mechanism to accurately predict the results of strong
error correction. Subsequent pipeline stages execute using the predicted val-
ues. In parallel, the long latency strong error correction attempts to verify
the correctness of the predicted values. On a mis-prediction, i.e., when the
value produced by the correction predictor is not the same as the result of
the strong error correction, speculative instructions are squashed and the
pipeline is re-started. By allowing the logic core to execute on the predicted
data or instructions, one can effectively hide the latency of the slow, strong
error correction, even at very low supply voltages where cell failure is preva-
lent. In the context of hard faults in voltage scaled SRAM L1 caches, we
propose implementing the correction predictor using a fast, but weak error
correction mechanism that produces the same result as strong error correc-
tion mechanisms for most, but not all words. Our implementation, CP, is
based on a Correction Prediction Table (CPT) (details in Section 5.5) that
can accurately predict over 90% of cache word corrections.
70
We make the following contributions:
• We propose correction prediction, a scheme to reduce the latency of
strong error correction by predicting the output of strong error correc-
tion. The accuracy of prediction is verified by long latency strong error
correction even as the predicted value is used to continue execution.
• We present CP, a simple implementation of correction prediction where
a fast, weak correction precedes strong error correction and allows CP
to limit the mis-prediction rate to<0.1%. CP adds<10% area overhead
and <2.5% worst-case latency to a cache with strong error correction.
• We evaluate CP applied to three recently proposed strong error cor-
rection schemes—Hi-ECC [77], VS-ECC [97], and Bit-Fix [111]. Com-
pared to using the strong error correction technique alone, CP reduces
L1 cache access latency by 38%, 38%, and 52%, respectively. For a 2-
issue in-order core, this corresponds to a processor-wide energy savings
of 16%, 17%, and 21%.
5.2 Background and Related Work
5.2.1 SRAM Reliability at Low Voltages
Some SRAM cells in a cache are weaker than other cells in the cache due to
process variations. Although practically every cell in a cache (e.g., 99.999%
plus) is functional at a high supply voltage, more and more of these weaker
cells become faulty as the supply voltage drops. A faulty cell can experience
both read failures, where the wrong value is returned or the stored value is
toggled unintentionally, and write failures, where the value in the cell cannot
be toggled [111]. Since these faulty cells are due to permanent defects (e.g.,
dopant variation), they can be located using a number of built-in self-test
(BIST) routines [111, 97, 114]. Some faults at low voltages are due to soft
errors [118] that cannot be detected by BIST routines. However, the fraction
of such faults at low voltages is minuscule (by over 5 orders of magnitude at
650mV [112]).
Figure 5.1 shows the average fraction of the cells in a 65nm cache that
are faulty as a function of the supply voltage. The calculation is based on
71
Figure 5.1: The average fraction of bits and 32-bit words that are faulty in
a cache for 65nm SRAMs.
the SRAM failure probability reported in [113] and assumes that the faulty
cells are distributed randomly across the cache, which is consistent with
the assumption made in prior works [111, 112, 97]. We also calculated the
fraction of 32-bit words that are faulty; a word is faulty if it contains one or
more faulty bits. The results in Figure 5.1 show that nearly 30% of all words
require error correction when the supply voltage is scaled beyond 650mV,
motivating the need for efficient error correction algorithms.
5.2.2 Error Resilience Techniques in Caches
One straightforward technique to improve reliability is using larger transis-
tors to implement the SRAM cells [113]. Another technique is to use a more
fault tolerant SRAM implementation, such as an 8T or a 10T SRAM cell
[113]. The downside of these techniques is that they incur a significant area
and power overhead even when the processor is operating at high supply
voltage (see Table 5.2).
There have been several recent attempts at using strong error correcting
codes to implement a cache that operates reliably at low supply voltages while
incurring a low overhead at a high supply voltage. For example, FLAIR
[115] uses a combination of SECDED (single error correction double error
detection) codes and dual modular redundancy to correct errors in a cache
line. VS-ECC [97] proposes using a combination of SECDED and 4EC5ED
72
(four error correct five error detect). MS-ECC [112] proposes trading off
storage-overhead for decode latency by using Orthogonal Latin Square Codes
(OLSC) for multi-bit error correction. The downside of these techniques is
their high energy overhead at low voltages due to the high performance cost
of detection and correction (Section 2.6).
Finally, several works observe that since the location of the faulty cells
can be predetermined via offline testing (e.g., BIST), one can remap the
value of faulty cells elsewhere in the cache instead of using ECC to correct
errors. PADded Cache [119], for example, uses a fast, programmable address
decoder to remap cache lines into other sets and disables faulty physical lines.
Unfortunately, for low supply voltage operation, most cache lines would need
to be disabled (e.g., >99% of cache lines at 650mV). Bit-Fix [111] proposes
a simple remapping policy that uses dedicated bits per cache line to record
the bad bits in each cache line and remap their values to a different cache
line. Archipelago [114] uses a more sophisticated remapping policy that uses
a global fault map table to perform remapping at a finer granularity. The
primary downside of these techniques is the unavoidable latency increase for
every cache access due to data correction after data array access or map
look-up before the data access. This latency increase may have significant
performance and energy impact (Section 2.6).
5.2.3 Tolerating Error Detection and Correction Latency
One simple approach to account for the additional delay in the instruction-
fetch and data-load stages due to error correction without stalling the pro-
cessor pipeline is to slow down the overall core frequency; however, this can
lead to a significant performance degradation (Section 2.6) since the error
correction latency is often a significant fraction of the cache access latency.
Instead of slowing down the core frequency, Bonnoit et al. [116] propose using
additional pipeline stages to handle the error correction latency. However,
using pipeline deepening to hide correction latency can also lead to a degra-
dation in performance and energy efficiency (Section 2.6) because branch and
data hazards become more expensive. Also, load-dependent instructions are
stalled more frequently and for more cycles. Furthermore, since only low




mV 600 700 800 840 1000 1100 1200
 Bits 0.02 0.006 0.002 0.001 0.000035 0.0000032 0.0000004
Words 0.47611686 0.17517028 0.06205511 0.03150892 0.00111939 0.0001024 1.28E-005
0.945783133
0.3392 0.5312 3.40660933 0.3024
BCH No-Error BCH 1-Error BCH Multi-bit ErrorOLSC Fault Map Archipelago 7-Modular Redundancym










































































Figure 5.2: Latency of different error correction schemes normalized to the
access time of a 4-way set-associative 32KB cache at 65nm.
result in unwanted overhead for high voltage operation.
Bonnoit et al. [116] also propose avoiding the additional pipeline stages by
decoupling error detection from correction. They observe that error detection
typically incurs a shorter latency than error correction. Therefore, they pro-
pose reducing the clock frequency to accommodate only the error detection
latency, and then stalling the pipeline when errors are detected to wait for
the error correction. However, when error detection latency is still a large
fraction of cache access latency, this technique may result in a significant
reduction in clock frequency. In addition, this technique leads to frequent
stalling when the fraction of cache accesses with errors is high, which limits
its effectiveness for aggressive voltage scaling.
Some processors [117] attempt to hide the latency of error detection by
using speculation, whereby the word retrieved from the cache is sent directly
to the subsequent pipeline stages, prior to performing any error detection;
meanwhile, error detection takes place in the clock cycles following the cache
access. If errors are detected, the pipeline is flushed, an exception handler
performs the correction, and execution restarts from the erring instruction.
However, this technique is also ineffective at low voltages, where flushing




Error correction can be expensive when the number of errors that need to be
corrected is large (as may be the case for low Vmin L1 caches). Consider, for
example, 5-bit BCH-based error correction at 650mV for the SRAM failure
rates in Figure 5.1. BCH-based correction has been used for strong error cor-
rection in past works such as VS-ECC [97]. We calculated the decode latency
of the BCH code for the three scenarios (no error, one error, multi-bit errors)
assuming 32 data bits per codeword using the methodology in [120, 77] and
using the FO4 delay for the 65nm technology reported in [121] (more model-
ing details in Section 5.6.1). Figure 5.2 shows the latency values normalized
to the access latency of a 65nm four-way set-associative 32KB cache with
64-bit output granularity (see Section 5.6.1 for details). The figure shows
that even in the most optimistic scenario (i.e., codeword is error-free), the
decode latency of the BCH code is a significant fraction (50%) of the cache
access latency. The decode latency is 72% of the cache access latency for
single-bit errors and 647% of the cache access latency for multi-bit errors.
Other strong error correction techniques that provide comparable relia-
bility are expensive as well. Figure 5.2 shows the decode latency of a 7-
bit error-correcting OLSC code [112], and Bit-Fix, a fault-map-based tech-
nique [111]—the reported implementations provide the same reliability as the
5-bit BCH discussed above. Results show that the decode latency of OLSC
and the decoding and shifting latencies of Bit-Fix are also significant (41%
and 140% of the cache access latency respectively). Moreover, while the error
correction latency of the OLSC code is shorter than that of the BCH code,
it comes at a significant cost in terms of storage overhead. Similarly, while
the error correction latency of 7-modular redundancy is only 6% of that of
the cache access, the corresponding storage overhead is 600%.
Our goal is to develop a technique that allows strong error correction to be
used for low Vmin caches without the prohibitive latency or storage overhead.
Toward this goal we employ correction prediction used in conjunction with
a strong error correction scheme.
75
  































Figure 5.3: Word error distribution at 650mV. 99% of words have two or
fewer errors. This suggests that a weak error correction mechanism
provides sufficient prediction for most cache accesses.
5.4 Correction Prediction for L1 Caches
Correction prediction for L1 caches feeds predicted values to the pipeline
while using strong error correction in parallel. When the predicted value is
correct (i.e., the word consumed by the subsequent pipeline stages is the
same as the output of the strong error correction), the latency of strong
error correction is avoided. On a mis-prediction (i.e., the word consumed
by the subsequent pipeline stages differs from the output of the strong error
correction), the instructions dependent on the consumed predicted word are
flushed and restarted using the output of the strong error correction. The
mis-prediction penalty is the larger of the squash/restart and strong error
correction penalties.
5.4.1 Key Idea
A correction predictor must both be accurate and fast. A correction predictor
must have high accuracy to be effective since a large fraction of words are
faulty at low voltages (e.g., over 30% at 650mV, see Figure 5.3). A high
mis-prediction rate will lead to frequent squashes. A correction predictor
must be fast since even an added cycle of latency may be prohibitive for L1
76
  










650mV, 30% words have faults
700mV, 17.1% of words have faults
765mV, 8.7% of words have faults












































Figure 5.4: Capacity overhead vs. latency tradeoff.
cache accesses.
A high accuracy correction predictor implementation predicts correctly in
high likelihood scenarios. Figure 5.3 shows that for voltage-scaled SRAMs, a
high likelihood scenario is where a word has zero, one, or two faults. An error
correction mechanism that can correct up to two errors would correct over
99% of words at 650mV. One fast implementation of such an error correction
mechanism is storing information about up to two faults in a table. Since the
table cannot store enough information to correct every word and since the
table itself can suffer faults, the table cannot correctly predict every access.
However, we show in Section 5.5 that it can correctly predict over 90% of all
accesses. By allowing the pipeline to speculatively execute using instructions
or data that have been predicted by the fast error correction mechanism, our
proposed error correction implementation, CP, can effectively provide error
correction latency similar to that of a fast error correction mechanism with
the same level of reliability as a long latency error correction mechanism.
Figure 5.4 shows the capacity overhead and the average error correction
latency of the CP implementation relative to other strong error correction




















Figure 5.5: Correction prediction for an L1 cache access.
can correct 99.9987% of words in the cache for the word error rates shown in
the legend1). Each data point corresponds to a particular implementation of
a correction technique. For the BCH code, the average correction latency is
the average of the correction latency for zero errors, one error, and more than
one error weighted by the frequency of occurrence of these errors (Figure 5.3).
Each CP design point uses BCH as the strong error correction mechanism
while the fast correction mechanism is based on fault-map and can correct up
to zero, one, and two errors (corresponds to zero, one, and two Map Units–
see Section 5.5). The results show that CP indeed provides latency similar to
the low latency correction techniques (e.g., N-modular redundancy) at the
capacity overhead of capacity-optimized techniques (e.g., BCH).
5.4.2 Microarchitectural Support
Figure 5.5 details the logical flow of an L1 cache access using CP. When the
pipeline requests an L1 cache access, the cache performs the normal cache
tag and data array accesses. In parallel, the cache reads the Correction
Prediction Table entry (see Section 5.5) corresponding to the word being
accessed. The entry indicates whether to perform correction prediction or
not. To predict, the cache applies fast error correction and feeds the resulting
predicted value to the pipeline. After speculative execution begins using the
1This means that each implementation has barely enough ECC resources to provide
the same reliability for the voltages shown in the legend as the cache has at 1.2V.
78
predicted value, the slow, strong error correction determines whether the
predicted value contains an error. If the predicted value does contain an
error, a mis-prediction has occurred and the pipeline squashes all dependent
instructions. The cache then returns the corrected value to the pipeline and
the pipeline restarts the instruction which initiated the cache request with
the correct value. When prediction is not performed, the cache applies strong
correction to the value returned from the data array and returns the correct
value to the pipeline.
To support CP, the following changes need to be made to the processor
pipeline:
Cache Support Figure 5.6 depicts the additions the CP cache module re-
quires beyond a traditional cache. The fast, weak correction module contains
the Correction Prediction Table and associated logic (see Section 5.5). This
module determines whether prediction should be triggered and also generates
the predicted value. The slow, strong correction module uses the predicted
value to provide sufficient error correction to meet the reliability requirement.
It outputs both whether it detected an error within the predicted codeword
and also returns the corrected value (i.e., the corrected data bits from the
codeword) in case of an error. If the corrected value does not match the pre-
dicted value, then a mis-prediction occurred. The Pipeline Control module
then indicates that a squash is required. The Pipeline Control module also
stalls the pipeline when the fast, weak correction module does not predict
a value and the strong error correction module must perform long-latency
error correction.
We recognize that the additional logic inserted on the critical path of a
cache access may result in a reduction in operating frequency. We quantify
this overhead in Section 5.5.2 and study the sensitivity of benefits to this
overhead in Section 5.7.3.
Instruction Fetch Applying CP to the instruction cache requires support
for squashing each instruction dependent on a mis-predicted instruction and
restarting the corrected instruction in the decode stage. Also, the predicted
instructions may cause exceptions (e.g., illegal opcode or divide by zero ex-
ceptions). Such exceptions must be suppressed until strong correction com-
pletes. For the core used in our evaluations (see Section 5.6), our strong

















Figure 5.6: Modifications to L1 caches. The critical path for the common
case of correction prediction is in bold.
sulting in potentially erroneous instruction bits propagating to the decode,
execute, and initial fetch stages. For this core, all exceptions must be sup-
pressed until after the execute stage, which would happen anyway to maintain
precise exception handling.
Data Memory Load The core idea behind CP is allowing computation to
continue successfully speculating during error correction. For a data memory
load, this means that the predicted result must be forwarded to any depen-
dent instructions within the pipeline. This avoids execution of dependent
instructions being stalled by the additional latency of error correction. To
allow continued forwarding of predicted data to dependent instructions, ad-
ditional pipeline stages must be added after the data memory stage. The
number of required additional stages is equal to the number of cycles it takes
to perform strong correction. These additional stages are dummy stages
through which instructions flow. These stages also support the forwarding of
predicted values to earlier stages. Note that adding these forwarding stages
to the data cache access has a significantly smaller impact on performance
than adding pipeline stages for error correction (Deep Pipelining). This is
because the dependent instructions are stalled waiting for the strong correc-
tion to complete in the latter case. For the core used in our evaluations (see
Section 5.6), at least two additional pipeline stages are needed after the data
memory stage. Mis-predictions are detected during the write-back stage. On
a mis-prediction, the value generated by strong error correction is written
back to the register file; all following instructions are squashed.
80
5.4.3 Tag Array Protection
Note that the above discussion assumes that correction is performed only on
the content of the data array, not the tag array. The tag access is assumed to
be robust at low voltages similar to previous related work [114, 111]. Since
the tag array is significantly smaller than the data array and is less latency
constrained, we assume a 10-T SRAM-based [113] implementation for the
tag array to guarantee robustness at low voltages.
5.5 Implementation, Coverage, and Overheads
For correction prediction to be beneficial, the underlying prediction mech-
anism must only add minimal latency to a regular cache access. As such,
our design goal is to limit the latency of correction prediction to the delay
of a single logic gate. Many prediction mechanisms with varying prediction
accuracies, latencies, and storage overheads are possible. Here we present
one such correction prediction mechanism and leave a full exploration of
correction prediction mechanisms as future work.
The proposed correction prediction mechanism is a fast but weak error
correction mechanism that duplicates a small number of faulty bits in the
cache. The number of duplicate bits is kept small so that they can be stored
in a small enough table, called the Correction Prediction Table or CPT, that
accessing the table is much faster than accessing the L1 cache. The CPT is
accessed in parallel with a L1 cache access. The fast CPT allows the duplicate
bits to be accessed and then processed before a L1 cache access completes;
as such, it allows the duplicate bits to correct the faulty bits in the cache
word at the cost of the delay of a single MUX, which decides what bit—a
regular data bit or a duplicate bit—to output per bit position to subsequent
pipeline stages.
The CPT is organized as follows. There is a CPT entry corresponding
to every consecutive four words (e.g., 128 bits) in the L1 cache. Each CPT
entry contains four predFlags and two Map Units (Figure 5.7). There is a
predFlag for each word that allows CP to avoid bad predictions, such as
when a CPT entry does not have enough Map Units to correct all faults
in the corresponding words or when a Map Unit is faulty. The predFlag
















Map Unit 0 Map Unit 1
4 bits 1 bit 7 bits 1 bit 9 bits
Figure 5.7: Correction Prediction Table entry format. Each table entry
corresponds to four 32-bit words.
Way 1 Bit Location Decoding 
and Enable
Way Select
Way 1 Bit Location 
Decoding and 
Enable






Word i Bit x
Value Field
M.U. 0 Way 0
Value Field
M.U. 0 Way 1
Value Field
M.U. 1 Way 0
Value Field








Way 0 M.U. 0 
L.0 Bit0
/ 5
Figure 5.8: Fast, weak error correction circuit for bit x of word i. M.U.
stands for Map Unit, V. stands for valid bit, and L. stands for location
field. Value propagation before data array access completes is in bold.
or to simply perform strong error correction and stall the pipeline when the
word is accessed. A Map Unit has three fields—valid, location, and value. The
valid bit indicates that the corresponding location and value fields are error-
free. The location field contains the location of one of the single-bit errors
within the 128 bits. The value bit contains the correct current value for the
corresponding cache bit. Note that both the valid bits and the predFlags
are vulnerable to errors; however, errors in the CPT only affect prediction
accuracy, not reliability.
Each CPT entry is populated as follows. Only a CPT entry’s value bits
are set and updated during regular cache accesses, while all other bits in the
CPT are set at runtime by a BIST routine that tests the L1 cache at the
target low voltage. The BIST routine first tests the two Map Units of the
CPT entry. If the routine finds any faulty bits in a Map Unit, the routine sets
the valid bit of the Map Unit to false. The routine then tests the 128 cache
bits that correspond to the CPT entry to identify as many faulty cache bit
82




Prediction 1− (p · [1− P (error)] + (1− p) · P (error)) 91%
Mis-prediction p · P (error) 0.089%
locations as there are valid Map Units in the CPT entry. These faulty cache
bit locations are then recorded in the location field of the valid Map Units.
Next, for each of the four predFlags in the CPT entry, the routine sets a
predFlag to true if every faulty bit in the 32-bit cache word that corresponds
to the predFlag has a corresponding valid Map Unit. Finally, for each write
to the cache data array (data write or cache fill), the corresponding CPT
entry of written cache word must be read to update the value bits.
Figure 5.8 shows the fast correction circuitry that uses entries from the
CPT to fix an error in a bit of the data word. The location field bits of the
two Map Units are decoded to determine which of the 128 bits are replaced
by the values stored in the Map Units. The Map Unit valid bit enables
this decoding. The least significant two bits of the word offset are used
to determine if the Map Unit points to Wordi. For every bit in the word
accessed from the cache (WordiBitx), a multiplexer is used to select between
the bit read out from the cache and the values stored in the Map Units. The
valid bit of a Map Unit enables the selection of its corresponding value field.
5.5.1 Detection and Correction Coverage
In order to evaluate the fast error correction mechanism we must calculate
what fraction of words it attempts to predict (the prediction rate) and of
those words how many are incorrectly predicted (the mis-prediction rate).
To calculate the prediction and mis-prediction rates, we first calculate
the probability of observing an error in the output of fast correction. An
error may exist in the output of fast correction if the total number of fault
bits among the four data words that share the same CPT entry exceeds
the number of correct Map Units in the entry. Note that since the Map
Units themselves have the same bit error probability as the words in the
L1 cache, one or more of the two correction units in a CPT entry may be
faulty. To calculate the probability that an error occurs at the output of the
fast correction circuitry, we observe that when the total number of faulty bits
among the four data words exceed the number of non-fault Map Units by one,
83
an error will appear when one of these four words is accessed. Therefore, the
probability that an error occurs when one of these four words is accessed given
that the error correction capability of the fast correction entry is exceeded
by one is 1/4. Similarly, the probability that an error will occur when one
of these four words are accessed given that the error correction capability of
the fast correction entry is exceeded by two is 2/4. The following equation
summarizes the probabilities of all possible scenarios that cause an error in
















pj(1− p)128−j ·min((j − 2 + i)/4, 1)] (5.1)
In the formula above, p is the fault probability of a single bit, i is the
number of faulty Map Units in the entry and j represents the total number
of bad bits in the four words with a total of 128 bits.
To determine the mis-prediction rate, i.e., the probability that the output
of fast correction is faulty, but prediction is still triggered, we note that it
is equal to the probability that the predFlag bit is faulty when the output
of fast correction is faulty. This probability is p · P (error). Prediction is
not attempted when the predFlag bit is fault-free while the output of fast
correction is faulty, even though it would have been beneficial to trigger
prediction. The probability of this occurring is p · [1−P (error)]. Prediction
is also not attempted when a predFlag bit is faulty while the output of fast
correction is not faulty. In this case, the pipeline is correctly stalled until
strong correction completes. The probability of this occurring is (1 − p) ·
P (error). Table 5.1 lists calculated values for our cache operating at 650mV
where the bit failure rate is 0.011 (Figure 5.1). We note that we could
protect the CPT with Schmidt Trigger (ST) SRAM cells or increased supply
voltage. However, only 0.089% of cache accesses are mis-predicted, resulting
in an insignificant performance degradation. We argue that this is a good
tradeoff for not requiring an additional voltage rail or a unnecessary increase
in area (i.e., 100%) for the CPT.
We also note that although our specific implementation of fast correction
leverages characteristics of the fault distribution (e.g., uncorrelated bit errors
84
where single errors are most common), this is not a requirement of fast error
correction. For example, if faults are correlated within a word, we could
increase the size of our value fields in the Map Units to improve the correct
speculation rate.
5.5.2 Latency Overhead
We model the CPT for our four-way set associative 32KB cache using CACTI
[122] to determine its latency. The table has a latency of 0.44ns in a 65nm
technology node operating at 1.2V. As shown in Figure 5.8, the critical path
of the fast correction circuit before the MUX that picks between a data bit
and a duplicate bit consists of the following: a 5-to-1 binary decoder, two
AND gates, a MUX to select between the outputs of the fast correction entries
that correspond to the two ways in the set, a MUX to select between the
two Map Units, and the three level of inverters required for every location
to drive 32 bit slices. Using the FO4 delay of 65nm technology reported
in [121], these equate to a total delay of 0.2ns. In comparison, the access
latency of a four-way set-associative 32KB L1 cache (the associativity/size
of the L1 caches in our evaluations in Section 5.7) is 0.68ns. Therefore,
the delay of fast error correction circuitry prior to the MUX gate can be
effectively hidden by the latency difference between the L1 cache and the
CPT. Consequently, the MUX gate used to select between the data bit and
the duplicate bit is the only additions to the critical path of the cache access.
Following the methodology in [123], we estimate the MUX to be 0.5 FO4.
In our evaluations, we increase the clock period of the CP cores by 0.01ns at
nominal Vdd (1.2V) and 0.026ns at 650mV to account for these delays.
5.5.3 Area Overhead
The area overhead of the fast correction technique is dominated by the Cor-
rection Prediction Table. We estimated using CACTI [122] that the area
of a Correction Prediction Table with two Map Units is 8.6% of the size of
a four-way set associative 32KB L1 cache. Depending on the the desired
level of voltage scaling, fewer Map Units may be needed. For design points
with only a single Map Unit plus the four predFlag bits or with the four
85
predFlag bits only, the area overhead due to the Correction Prediction Table
normalized to cache area would be 5.1% and 1.6%, respectively.
We also estimated the area overhead of the fast correction circuit. As
shown in Figure 5.8, the fast error correction circuit that corrects a single bit
of the 32-bit word requires three MUXes, eight XOR gates, 12 AND gates,
and four 5-to-1 binary decoders, where each requires at most four AND gates,
for a total of 40 gates. The total number of gates required by the fast error
correction circuit is 40 · 32 = 1240 gates. Assuming that two such decoders
are needed to keep up with the issue width of the processor, this translates
to a maximum of 2480 gates, which incurs an area overhead 1.3% compared
to a 32KB L1 cache.
5.5.4 Energy Overhead
We model the access energy of the CPT using CACTI [122] and determined
that an access to the table results in a 4% energy increase for every cache
access. We also estimated the energy consumed by the correction circuitry
(Figure 5.8) assuming an activity factor of 1 and the switching energy of a
65nm transistor given in [124]. This energy is 0.2% that of the access energy
of the 32KB L1 cache. Static power overheads are no worse than the area
overheads discussed in Section 5.5.3.
5.6 Methodology
In this section, we describe the methodology we used to evaluate CP for a
65nm technology node. Section 5.6.1 describes the strong error correction
baselines to which we apply CP. Section 5.6.2 describes the different designs
we evaluated. Section 6.4.3 describes our experimental details.
5.6.1 Strong Error Correction
CP can be applied to any strong error correction technique to reduce its
latency. We use CP in conjunction with three recently proposed strong error
correction schemes—Hi-ECC [77], VS-ECC [97], and Bit-Fix [111]. These
86
strong error correction methods vary in their area, capacity, and latency
overheads.
Our Hi-ECC implementation protects every word using a BCH code that is
capable of correcting five erroneous bits within a 59-bit codeword. We employ
an additional parity bit to detect a sixth error within the codeword. If zero
or one errors are detected, the evaluated BCH decoder only incurs the single
error correction latency, but not the much longer multi-bit correction latency.
Single-bit correction is used instead of the multi-bit correction whenever ap-
plicable. The mult-bit BCH correction is used if single-bit correction fails.
In our implementation, the ECC bits are stored in cache ways during low
voltage operation such that half the ways store data bits, while the other half
store ECC bits. The ECC cache ways store data during nominal operation
as in [112]. Our VS-ECC implementation, based on VS-ECC-fixed from [97]
uses seven bits for SECDED per cache word. At the same time, each cache
line contains four additional 20-bit extended ECC fields to accommodate a
5EC6ED BCH code for up to four words within the line.
The slow, strong correction module in Figure 5.6 contains a BCH decoder
implemented iteratively according to the modified Berlekamp-Massey algo-
rithm in [97]. With the addition of a modest amount of logic, this decoder
can detect errors and correct single bit errors with a small latency relative
to the latency of full 5-bit error correction [77]. Using the latency equations
from [120] and the 65nm technology parameters from [121], we calculate the
error detection latency for the 5EC6ED BCH code to be 0.34ns or 50% of
the access latency of a four-way set-associative 32KB L1 cache [122]. Single-
bit correction, as calculated from [77], takes 0.53ns or 78% of an L1 cache
access. Similarly, multi-bit correction requires 4.4ns or 648% of an L1 cache
access. Given the iterative BCH decoder used in [97] and Schmidt Trigger
10-T cell protection for the tag array, Hi-ECC has a total area overhead of
11.9%. VS-ECC, requiring additional static storage overhead has a 41.4%
area overhead compared with our L1 cache.
The energy consumption of the BCH code used by both our Hi-ECC and
VS-ECC implementations depends on the number of errors in the input code-
word (i.e., zero, one, or more than one). At 650mV, an average of 11 bits are
bad per 1000 bits (Figure 5.1). At this bit error rate, the energy overhead of
the BCH decoding is calculated to be 0.86% that of the L1 access assuming











































































































CP: Hi-ECC CP: VS-ECC CP: BitFix
Figure 5.9: Impact of adding CP to different strong correction schemes at
650mV.
to be 0.3% that of an L1 access. Static power overheads are no worse than
the previously discussed area overheads.
Our Bit-Fix implementation is adapted from [111]—we assume access at
word granularity. Each access takes three cycles [111]. The decoding circuitry
has fewer than 26,000 transistors [111] or roughly 1.7% of a 32KB caches data
array. Our Bit-Fix implementation also requires the 4.8% area overhead for
robust tag cells (10T-ST) resulting in a total area overhead of 6.5% of a 32KB
L1 cache. Using an activity factor of 1, our Bit-Fix circuitry has an energy
overhead of 1.2% the access energy of an L1 cache. Static power overheads
are no worse than the area overheads.
Figure 5.9 presents the latency, area, and energy impact of applying CP to
the above strong error correction schemes. In the worst-case, CP increases
the latency of a cache access by up to 2.2% (e.g., when fast, weak correction
attempts to predict a word, but strong error correction determines that the
predicted word was incorrect). However, as shown in Table 5.1, most errors
can be predicted by fast error correction, allowing CP to reduce the average
latency of a cache access by 38% to 52% depending on the strong error
correction scheme. These benefits come at an area overhead of no more than
8%, a maximum dynamic energy overhead of 4.2%, and a maximum static
energy overhead of 8% at low Vdd.
Table 5.2 compares the latency, capacity, and area overheads of complete
CP schemes (including overheads from both CP and the specific strong error
88
Table 5.2: Latency, Capacity, and Area Overheads vs. an Ideal Cache
Low Vdd Nominal Voltage Low Vdd (650mV)
Tolerance Ave Lat Capacity Ave Lat Capacity Area
Technique Over. Over. Over. Over. Over.
CP
2% 0% 24% 100% 20%
+ Hi-ECC
CP
2% 0% 24% 0% 50%
+ VS-ECC
CP
2% 0% 19% 33% 13%
+ BitFix
10T ST
60% 0% 60% 0% 100%Cell
SRAM [125]
correction scheme) to those of a strong circuit-level technique. At nominal
voltage, all CP schemes have significantly smaller latency and area overheads
compared to a 10T ST SRAM cell. The 10T ST SRAM [126] has significantly
better reliability at low voltage than 8T [127] and 10T [128], yet it still cannot
provide sufficient reliability for a 32KB cache compared with a strong error
correction technique such as Bit-Fix [111].
5.6.2 Design Points
As shown in Figure 5.9, applying CP reduces the average access latency of
a reliable, low Vdd L1 cache. At the processor level, this can result in sig-
nificant performance and energy benefits for those processors (e.g., in-order
cores) where the latency of cache access can significantly determine perfor-
mance. To quantify the processor benefits, we evaluate CP against three
design points presented in Table 5.3. The first baseline, Nominal Baseline,
is a seven-stage pipeline running at 2.68GHz at 1.2V. Note that cache ac-
cesses take two cycles for this baseline. The second baseline, Deep Pipe, has
additional stages (for both the L1 ICache and L1 DCache) to accommodate
the latency of error correction required during 650mV operation without af-
fecting the frequency of the pipeline during nominal (1.2V) operation. For
Hi-ECC and VS-ECC, error correction latency requires two cycles given the
single-bit correction latency described in Section 5.6.1, while Bit-Fix requires
three cycles. The frequency of the Deep Pipe baseline (and other baselines)
at 650mV is determined using the operating point pairs from [111] and as-
suming a linear voltage-frequency scaling (this is similar to the methodology
89





Processor Low Vdd — 968 MHz 968 MHz 945 MHz
Frequency Nom Vdd 2.68 GHz 2.56 GHz
L1 $ Low Vdd — 2.37 ns
Latency Nom Vdd 0.68 ns
L1 Singlebit Hi-ECC — 2 cycles
Correction VS-ECC — 2 cycles
Latency BitFix — 3 cycles
L1 Multibit Hi-ECC — 13 cycles
Correction VS-ECC — 13 cycles
Latency Bit-Fix — 3 cycles
Pipeline
Hi-ECC 7 11 9 9
Stages
VS-ECC 7 11 9 9
Bit-Fix 7 13 10 10
L2 Latency 10 ns
used in [129, 97]).
The third design point is Speculate on Every Access (SEA) where every
cache access is speculated upon. I.e., the uncorrected value is returned to
the pipeline and to the strong correction circuitry at the same time. This
design point represents the natural extension of [117] using hardware error
correction. This design will suffer from frequent squashing of speculative
instructions and more frequent stalls for long-latency correction. SEA also
requires the additional pipeline stages that CP needs for forwarding specu-
lative values. A fourth design point is SECDED+Disable where each word
is protected by an ECC that can correct up to one error (this is the same
SEC as used for our VS-ECC implementations). If more than one error is
identified, that word is disabled, requiring an access to the next level in the
cache hierarchy. Since SECDED requires more than one cycle of additional
latency, implement it in an 11-stage pipeline. The final design point is CP.
It has a slightly lower frequency at nominal voltage than the other design
points and a slightly lower frequency than Deep Pipe and SEA at 650mV to
account for the addition of two MUXes to the critical path of a cache access
(see Figure 5.8).
90
Table 5.4: Processor Configuration
Number Cores 1 in-order
Register File 32 Int, 32 FP
Fetch/Decode/Issue Width 2/2/2
BTB Size 4096 entries
RAS Size 16 entries
Branch Predictor Tournament
ALUs/FPUs/MDUs 2/1/1
Cache Line Size 64B
L1 I$
Nominal (1.2V, all designs) 32KB, 4-way
Hi-ECC (650-840mV) 16KB, 2-way
VS-ECC (650-840mV) 32KB, 4-way
Bit-Fix (650-840mV) 24KB, 3-way
L1 D$
Nominal (1.2V, all designs) 32KB, 4-way
Hi-ECC (650-840mV) 16KB, 2-way
VS-ECC (650-840mV) 32KB, 4-way
Bit-Fix (650-840mV) 24KB, 3-way
L2 Unified $ 1MB, 8-way
Memory Configuration 2 GB of 1066 MHz DDR3
5.6.3 Experimental Setup
We evaluate CP over benchmarks from the Spec2000 [130] and Spec2006 [131]
benchmark suites executing on a 2-issue in-order core. Core microarchitec-
tural parameters were chosen to be similar to the ARM Cortex-A7 [132] and
are enumerated in Table 5.4. An aggressive branch predictor was chosen
to not unduly penalize Deep Pipe. Performance simulations were run using
Gem5 [23], fast forwarding for 1 billion instructions and executing for 1 bil-
lion instructions. Frequency for a given voltage was determined by the linear
scaling in [97]. Nominal dynamic and static power overheads were deter-
mined using the simulation results and McPAT [133]. Low supply voltage
dynamic power was scaled quadratically with respect to voltage [111]. Low
supply voltage static power was scaled cubically with respect to voltage [111].
The L2 and the main memory are not scaled (i.e., we model them to have
an absolute latency in terms of ns).
5.7 Experimental Results
In this section, we evaluate the impact of the reduced error correction latency







































































































































































































Figure 5.11: Energy for 650mV operation.
5.7.1 Low Voltage (650mV) Operation
Figure 5.10 shows the performance of the different schemes at 650mV com-
pared to the nominal baseline operating at 1.2V. For simplicity, we discuss
Hi-ECC results here, although the trends for VS-ECC and Bit-Fix are sim-
ilar. We see a 22% performance benefit over the deepened pipeline design
in spite of slightly lower frequency (945MHz vs 968MHz). This is due to a
reduced penalty for branch and data hazards since the number of pipeline
stages is smaller (9 vs. 11). Furthermore, the additional pipeline stages in CP
continue to forward to dependent instructions, while the additional pipeline
stages in the deepened pipeline design lead to stalls for the load-dependent
instructions. CP increases performance by 42% over the SECDED with word
disable scheme, which performs the worst of all correction schemes on aver-
age. When compared to Speculate on Every Access (SEA), CP increases
performance by 18%. There are two reasons for this performance improve-
ment. First, CP reduces the mis-prediction rate and thus the penalty from






































































































Figure 5.12: Performance during nominal voltage operation.
92
latency stalls for strong error correction because the fast, weak correction
in CP can correct errors within the data word, which may reduce multi-bit
errors to single-bit errors. Finally, CP achieves average performance within
13% of the ideal, zero-latency, zero-capacity correction. The slight differ-
ence in performance is due to the slightly lower frequency of operation at
650mV, stalls due to non-predictions and mis-predictions (see Table 5.1).
When compared against the SECDED+Disable design point, CP has a per-
formance benefit of 42% because SECDED requires additional pipeline stages
for both L1 caches and also incurs long-latency misses to the next levels of
the cache hierarchy.
Figure 5.11 shows the energy benefits that the error correction predic-
tion scheme can provide over the reduced frequency and pipeline deepening
schemes. The observed benefit of CP compared to the deepened pipeline
design is 19% and it occurs due to the 22% higher performance. CP has 16%
lower energy than SEA because of the increased performance from correct
prediction and decreased power consumption. All error correction techniques
show a reduction of average energy by 58% or more relative to the nominal
baseline. We report energy benefits for a core where logic and SRAMs are on
a single power rail. If logic and SRAMs are on different rails, energy benefits
will still be significant because SRAM energy would dominate at low supply
voltages.
5.7.2 Nominal Operation
We recognize that for many applications low power/energy operating points
are desirable, but they must not come at the price of hurting performance at
nominal operating points. We evaluated the performance of each design point
at the nominal voltage (1.2V). Figure 5.12 shows that CP has only a slight
performance degradation of 1.2% when operating at the nominal voltage
due to the addition of a MUX on the critical path of L1 cache accesses. This
shows that CP is not only an attractive design point for fixed voltage systems
running at low voltages (e.g., 650mV), but also for systems that support
DVFS to provide high performance at nominal operating points. In contrast,
pipeline deepening has an average performance degradation of over 15%.
The high performance degradation is due to two factors (1) data and control
93
































































Figure 5.13: Voltage scaling sensitivity study.
hazards are more expensive due to the two additional pipeline stages since
these stages cannot forward values until the access is complete, dependent
instructions have to stall. Finally, the reduced frequency design and SEA
have high performance at nominal operation. However, these designs may
have significant performance and energy overhead at low voltage operation
(Figure 5.10, Figure 5.11).
5.7.3 Sensitivity
The benefits provided by CP are dependent on several parameters, including
the voltage of operation, the latency impact of CP, and the fault rates of
a given voltage. In this section we perform a sensitivity analysis of these
parameters.
Our results in Section 5.7.1 were presented for 650mV. It is the lowest
voltage at which all the ECC bits for a 32-bit data word can fit within an-
other 32-bit word. In this section, we compare the performance of correction
prediction against other design points over higher supply voltages and corre-
sponding error rates. For Hi-ECC and VS-ECC, we select the weakest BCH
code which guarantees reliability at that voltage equivalent to the reliability
at nominal voltage. This means that the given supply voltage is the lowest
supply voltage at which the design can be run (e.g., the BCH decoder that is
required for 840mV operation would not be sufficient to operate at 700mV).
94
























































CP: Hi-ECC 2x Lat
CP: Hi-ECC 4x Lat
Figure 5.14: Fast correction table latency sensitivity study.
The CP error correction hardware is constant across all supply voltages.
Figure 5.13 shows the results. The results show that CP allows operation
over a large voltage range at a performance comparable to the ideal scheme.
Performance degradation of CP applied on top of Hi-ECC compared to the
ideal is only 13%, 7%, and 7% at 650mV, 700mV, and 840mV respectively.
The primary reason why CP performs comparably to the ideal is that a
large fraction of errors over a wide voltage range are predictable by the fast,
weak correction. We also observe that while CP provides better performance
than other design points over the entire range that was evaluated, the per-
formance advantage depends on the operating voltage. For example, while
CP has nearly identical performance to SEA at 840mV, the performance ad-
vantage increases to 18% at 650mV. This is because the number of errors is
considerably higher at 650mV than 840mV. For SEA, this leads to signifi-
cantly higher number of squashes and long-latency stalls. For CP, most of the
cache accesses continue to be predicted by the fast, weak correction scheme
at 650mV. Finally, CP continues to perform much better than reduced fre-
quency and pipeline deepened designs even at higher supply voltages (e.g.,
840mV). This is because the reduced frequency design has significantly lower
frequency even at 840mV (1.1GHz vs. 1.5GHz). Similarly, the deepened
pipeline not only suffers from the standard disadvantages of having a large
number of pipeline stages (increased cost of branch and data hazards), but
also has additional stalls due to the load-dependent instructions waiting for
95
































































Figure 5.15: Fault rate sensitivity study.
the error correction for the data cache access to complete.
The results in Section 5.7.1 correspond to a 2.5% latency overhead for
CP as modeled by the methodology in Section 5.5.2. Figure 5.14 shows the
performance of CP applied to Hi-ECC (the correction scheme CP performs
worst with) assuming its latency overhead is double and quadruple our calcu-
lated value. Despite the increased overhead (5% and 10%), CP still increases
performance by 16% and 12%, respectively.
Our results in Section 5.7.1 are based on the 65nm fault rate data from [113]
(note that prior work [97, 111] used 130nm fault rate data). We explored the
sensitivity of benefits to fault rates nearly 5x lower than [113]. Figure 5.15
shows that CP can still provide performance improvements of 14-20% for
a lower fault rate vs. voltage curve. These results (including results in
Figures 5.13, 5.14, and 5.15) attest to the generality of our approach.
The results reported above are for min-sized 6T SRAMs. Prior work
(e.g., [125]) has demonstrated that circuit-level techniques such as doubling
the size of transistors in 6T cells or using 8T cells may allow reduced volt-
age operation. However, CP can still yield performance benefits when used
with such circuit-level techniques. To quantify the benefits we rely on the
fault rate vs voltage dataset from [125] and voltage versus frequency scal-
ing dataset from [134].2 When implemented on top of 6Tx2 SRAM cells,
2Unfortunately, the data set used for our main set of results in Section 5.7.1 did not
have fault rate vs voltage data for 6T upsized or 8T SRAM cells. Additionally,the linear
voltage vs frequency scaling assumption no longer applies for the low voltages allowed by
96


























































Figure 5.16: 6T 2x upsized sensitivity study using [125, 134].


























































Figure 5.17: 8T sensitivity study using [125, 134].
97




























































Figure 5.18: 6T sensitivity study using [125, 134].
CP can achieve a performance improvement of up to 13-20% as shown in
Figure 5.16. Such upsizing results in a 33% area increase and a doubling of
static power. By using an 8T SRAM cell, the additional static power can
be reduced. When implemented on top of 8T SRAM cells, CP provides up
to a 13-20% performance improvement as shown in Figure 5.17. For com-
pleteness, Figure 5.18 shows CP results when applied to 6T cells using these
datasets. At the lowest voltage where the reliability target can be met, these
results show a performance benefit of 12-19% for 6T transistors, similar to
the primary results presented in Section 5.7.1.
5.8 Conclusion
On-chip memories consume an increasingly large fraction of chip power. The
reliability of on-chip memories determines their voltage and, therefore, the
power these memories consume. Voltage scaling can be used to significantly
reduce the power consumed by on-chip memories and chips as a whole. How-
ever, aggressive voltage scaling leads to high error rates in on-chip memories
(e.g., caches). Strong error correction can be used to tolerate high error rates
in on-chip memories. However, such strong error correction may require sig-
nificant latency relative to the memory access itself. We propose correction
6Tx2 and 8T cells. Therefore, we use the non-linear voltage vs frequency scaling data
from [134].
98
prediction, a scheme that reduces the latency of strong error correction by
using a fast mechanism to predict the result of strong error correction. We
present CP, a fast, weak correction mechanism that predicts the result of
strong error correction with a mis-prediction rate of <0.1%. This reduces
the effective access latency of a 32KB, four-way SRAM L1 cache by 38%-
52%. For a two-issue in-order core, CP provides 16%-21% energy reduction
compared with using a strong error correction scheme alone, while incurring




This chapter builds on Chapter 5 to reduce the long multi-error error correc-
tion latency as well. Multi-bit error correction is typically performed via a
slow iterative algorithm that corrects one error at a time. On the other hand,
there are many low-overhead parallel error correction algorithms that can cor-
rect multiple errors in parallel at low latency/area/power overheads in the
common-case, but unfortunately do not guarantee correction of all multi-bit
errors. To provide low-latency multi-bit error correction, this chapter ex-
plores how to effectively exploit low-overhead parallel correction algorithm’s
ability to correct multi-bit error patterns in the common-case.
6.1 Introduction
Current and future process technologies face serious power challenges. A
well-known technique that effectively reduces processor power consumption
is supply voltage scaling. One major challenge for voltage scaling is that the
reliability of SRAM cells decreases with the supply voltage. As the supply
voltage decreases, an rapidly increasing fraction of SRAM cells become faulty
due to process variations [126, 125]. This problem becomes more pronounced
at smaller feature sizes (Figure 6.1) and is expected to get worse [113]. As
such, the reliability of SRAM cells, and not the reliability of logic circuits,
often determines the extent to which supply voltage can be reduced [111, 135,
9]. Although using more robust SRAM cell implementations with larger area
and leakage can improve the reliability of SRAMs at low supply voltages, a
significant fraction of these SRAM cells are still faulty at low voltages, as
shown in Figure 6.2.
Many recent works have investigated using an error correcting code, or






























Figure 6.1: SRAM failure probability w.r.t. to voltage for different
processing technologies [126, 125, 136]. Technologies with smaller feature
sizes exhibit higher fault rates.
[112, 97, 137, 115]. However, there is often a strong trade-off between er-
ror correction coverage and latency between different ECC schemes. For
example, Figure 6.3 shows that the four-bit-correcting BCH (127,64) ECC
can tolerate many factors higher bit failure rates than weaker ECC schemes
for the same coverage; however, this comes at the cost of incurring up to 20X
higher latency overhead.1 Although stronger ECC schemes provide the high
error correction coverage needed to enable low supply voltage, their high er-
ror correction latencies make them unattractive for on-chip memories, where
latency is often critical.
We propose error pattern transformation, a novel low-latency microarchi-
tectural technique that allows on-chip memories to be scaled to voltages lower
than what has been previously possible. We observe that although many
ECCs only guarantee correction of a few memory errors, they can oppor-
tunistically correct more errors depending on the error pattern in the logical
words they protect. As such, we propose transforming the uncorrectable
error patterns in logical words into correctable ones by intelligently rear-
ranging the logical bit to physical bit mapping when storing a logical word
into a physical word in an on-chip memory. This improves on-chip memory
reliability and, therefore, leads to a reduction in the minimum voltage at
which on-chip memory can be run reliably.
Error pattern transformation (EPT) is a general technique that can be
1BCH decoder latency is taken from [120]; the calculation assumes an oracular BCH
decoder whose latency depends on the actual number of faults in a word, as opposed to





























Figure 6.2: 65nm SRAM failure probability w.r.t. to voltage for different
SRAM cells [125]. Larger SRAMs have lower but still significant fault rates
at low voltages.
applied on top of many prior works on improving on-chip memory reliability.
EPT can provide reliability benefits even in the presence of soft errors and
erratic faults. Our evaluations show that EPT reduces the minimum volt-
age at which on-chip memory can run by 70mV over the best low-latency
ECC baseline, leading to a 25.7% core-wide power reduction. Energy per
instruction is reduced by 15.7% compared to the best baseline.
6.2 Motivation
The key insight driving this work is that for for many ECCs, the number
of errors that they can correct differs widely depending on the observed er-
ror pattern (i.e., the bit locations of the individual errors). Consider, for
example, a segmented ECC, which breaks a word into several independent
and equally sized segments, each with its own check bits for error correc-
tion [112]. The number of errors that a segmented ECC can correct depends
on how many errors are located in each segment, as illustrated in Figure 6.4.
Our second example ECC does not break a word into independent seg-
ments. An Orthogonal Latin Square Code, OLSC(128,64), protects 64 bits
of data with 128-64=64 bits of redundancy. OLSC(128,64) performs error
correction in multiple levels of majority voting, where each level consists of

























































Figure 6.3: ECC latency and coverage for 64-bit data words. Low-latency
ECCs typically provide lower coverages than long-latency ECCs.
segment' segment' segment' segment'
'word:'
'word:'
segment' segment' segment' segment'
Figure 6.4: Uncorrectable (above) and correctable (below) error patterns in
a segmented ECC that corrects one error per segment. Low-latency ECCs
can correct different numbers of errors for different error patterns.
for low-latency error correction. When there are too many errors in the in-
puts to one of the majority voters, the output of the voter can flip; this can
in turn flip the outputs of subsequent voting levels. As such, the number of
errors that the OLSC ECC can correct again depends on the error pattern.
OLSC(128,64) can correct up to 32 bad bits among the 128 bits for some spe-
cific error patterns; for all error patterns in general, however, OLSC(128,64)
only guarantees correction of up to four bad bits.
In fact, it is quite common for many low-latency ECCs (the latency-
coverage trade-offs of different ECCs are shown in Figure 6.3) to have some
error patterns for which more errors can be corrected than other error pat-
terns. Figure 6.5 shows the fraction of t-bit error patterns that are correctable
under different ECCs. Figure 6.5 considers 64-bit data words; it presents
the unsegmented OLSC(128,64) ECC as well as a segmented Hamming(7,4)


















































segmented&Hamming&(7,4)& segmented&OLSC(8,4)& OLSC&(128,64)& BCH&(127,64)&
Figure 6.5: Fraction of t-bit errors that are correctable by different ECCs.
Many t-bit errors are correctable even for large values of t.
a 64-bit data word into segments of k data bits with n − k check bits per
segment. Figure 6.5 shows that a significant fraction of t-bit error patterns
are correctable for a large range of t even though segmented Hamming(7,4),
segmented OLSC(8,4), and OLSC (128,64) guarantee correction of only 1, 1,
and 4 errors, respectively.
We improve the coverage of low-latency ECCs by adaptively transforming
an uncorrectable BIST-detectable t-bit error pattern in a logical word into
one of the many correctable t-bit error patterns (the latency-coverage trade-
offs of different ECCs are shown in Figure 6.3). This allows low-latency
ECCs to obtain similar coverage as high-latency ECCs (e.g., the coverage of
the BCH(127,64) shown in Figure 6.5), while incurring only a small fraction
of the latter’s long latency overhead. The improved coverage allows on-
chip memories to be scaled to voltages lower than what has been previously
possible.
6.3 Error Pattern Transformation
Consider the proposed technique in context of a cache. We observe that the
same physical fault pattern in a cache word can manifest as different error
patterns in the logical word presented to the ECC decoder depending on the
mapping of the logical bits in a cache word to the physical bits stored in the
physical cache word. For example, the logical bit to physical bit mapping
from a logical word to a physical cache word, or bit ordering, shown in the left
half and the right half of Figure 6.6 generates error vectors ‘01010000’ and
104
Figure 6.6: Possible logical to physical bit orderings. The error pattern in a
logical word depends on the bit ordering used to access the physical word.
‘10000001’, respectively, even though the physical fault pattern in the cache
word remains the same. We seek to transform uncorrectable error patterns
in the logical words into correctable error patterns by supporting multiple
bit orderings per physical cache word. In comparison, conventional cache
designs support only a single bit ordering, which maps bit x in a logical word
(e.g., logical bit 0) to the same bit x in the physical cache word (e.g., physical
bit 0).
When accessing a cache word with a known (BIST-detectable) physical
fault pattern, our proposal uses a pre-recorded bit ordering that generates
only correctable errors for the known physical fault pattern in the cache word;


















































Figure 6.7: The physical fault coverages of low-latency ECCs with different
numbers of bit orderings. With multiple bit ordering options, the
low-latency ECCs approach, and sometimes, exceed the slow BCH ECC in
coverage.
105
Under our proposal, a physical fault pattern in a cache word is uncorrectable
only if no bit ordering that generates only correctable error patterns for the
given physical fault pattern can be found; intuitively, the larger the number
of supported bit orderings, the less likely that a bit ordering that generates
only correctable error patterns cannot be found for a given physical fault
pattern.
Figure 6.7 shows the calculated ideal fault coverages for different ECCs
under different numbers of supported bit orderings2. Figure 6.7 shows that
the fraction of correctable fault patterns increases significantly as the number
of supported bit orderings increases. In fact, for a sufficient number of avail-
able bit orderings (e.g., 32 to 256 bit orderings), coverage approaches, and
sometimes, exceeds the high-latency BCH ECC. This improved coverage can
be used to to be scale voltages lower than what has been previously possible
for cache memories.
Our proposal requires addressing three main challenges. First, how to
identify the appropriate bit ordering for each cache word. Second, how to
record the identified bit orderings in a space-efficient manner. Third, how
to enact the appropriate bit ordering during cache accesses at low latency
overhead. We address these challenges in Sections 6.3.1 to 6.3.3. Finally,
Section 6.3.4 walks through examples of how to apply our proposal.
6.3.1 Bit Ordering Selection
Low voltage SRAM faults are largely hard faults that can be identified using
a BIST (Built-in Self Test) routine that is run either during post-manufacture
testing or at set intervals during processor lifetime [111, 97]. The following
describes a general approach, applicable to different ECCs, for selecting the
appropriate bit ordering that only generates correctable errors for a physical
fault pattern that has been identified in a cache word.
2Coverages are determined as follows. Let f(t) be the fraction of t-bit error patterns
that are uncorrectable by an ECC scheme; we obtained f(t) for each of the fast ECC
implementations via Monte Carlo simulations. Given that there are n ways to reorder the
logical bits stored in a physical word, a t-bit fault pattern is uncorrectable if the fault
pattern expresses uncorrectable error patterns under all n bit orderings. The probability
that a t-bit fault pattern is expressed as an uncorrectable error pattern under all n bit
orderings is f(t)n, assuming that the n orderings are generated independently from one
another. The fraction of t-bit fault patterns that are correctable given n bit orderings is,
therefore, 1− f(t)n.
106
The first step is to identify the physical fault vector of a cache word at
low supply voltage using BIST. Next, the fault vector is sent to the bit
reordering logic (described in detail in Section 6.3.3), which modifies the
ordering of the bits in the fault vector. The modified fault vector is then
fed into a correctability checker to check whether all possible error patterns
due to the modified fault vector are correctable. The correctability checker
is specific to an ECC. For example, for the segmented Hamming (7,4) ECC,
the correctability checker counts the number of ‘1’s in every segment of seven
adjacent bits in the fault vector under test; the checker asserts a fail signal if
any segment contains more than one ‘1’ and asserts a pass signal otherwise.
If the check passes, the tested bit ordering for the cache word is recorded
so that the ordering will be used to access that cache word for all future
accesses. Otherwise, the bit reordering logic checks another bit ordering of
the identified fault vector until one of the bit orderings passes or until all
available bit orderings have been tested; in the latter case, the cache word is
reported as having an uncorrectable fault pattern.
Figure 6.8 summarizes the steps described above. Note that all of the steps
only have to be performed once for each cache word (e.g., once during post-
manufacturing testing). They are not performed during normal operations
and, therefore, do not affect runtime performance. Assuming a 50s BIST
overhead to identify all fault patterns for a 2MB cache [97], our 32KB L1
cache requires 0.78s. Assuming the worst case of 28 OID calculations per word
and two cycles per calculation (one cycle to generate a new bit ordering and
one cycle to check the new ordering), a 32KB L1 cache requires only 131K
cycles. Therefore, selection of OID bits would add a negligible amount of
time to the BIST routine.
6.3.2 Tracking Bit Orderings
To support 2k different bit orderings per cache word, we allocate k bits per
physical cache word to record the chosen bit ordering for each physical cache
word. We call these k bits per cache word the Ordering ID or OID of the
physical cache word. We store the OIDs in the tag array where they are
accessed at the same time as the tags. Similar to prior work [97], a copy















Figure 6.8: Selecting the appropriate bit ordering for a cache word. A BIST
circuit generates the fault pattern of a physical word. Logical bit
reorderings are tried until one guarantees that the logical word will be
correctable.
switches to low voltage or after the chip is powered down.
6.3.3 Bit Reordering and Order Recovery
Prior to a write to a physical cache word, we reorder the bits in the logical
word to be written according to the recorded OID of the cache word. Con-
versely, after reading from a cache word, the earlier modification to the bit
ordering has to be reversed to recover the original logical word. Figure 6.9
details the steps needed for accessing a cache word. During a cache access,
the OID is read at the same time as the tag; however, bit reordering and or-
der recovery are on the critical paths of cache writes and reads, respectively.
It is critical that these two steps are implemented at very low latency.
To provide low latency bit reordering and order recovery, we evenly di-
vide each logical word into multiple small groups of bits and perform bit
reordering and order recovery on all groups in parallel; bits in each parallel
reordering group (PRG) can only be moved within the PRG but not across
PRG. Intuitively, the smaller the PRG sizes, the higher the parallelism in the
reordering and order recovery logic, and therefore, the lower the latency of
the bit reordering and order recovery logic. However, the smaller the PRG




















Figure 6.9: Cache access action flow. Since bit reordering is on the critical
path of a cache access, it is critical that it have low latency.
fault coverage because it reduces the number of correctable error patterns
that can be exploited. As such, the size of the PRGs needs to be selected
based on the underlying ECCs. Section 6.3.4 walks through some examples
of PRG selection and sizing.
A straightforward way to reorder the bits in a PRG is to use a set of
static circuits, each designed to reorder bits within a PRG in a specific way.
However, this can lead to a large area overhead when supporting a large
number of bit orderings per cache word. Instead, we use barrel shifters as
a more scalable approach to reorder a PRG. A barrel shifter can generate a
different bit ordering within a PRG for a different shift distance input value.
In addition to providing good scalability, the barrel shifters can also reorder
the bits in a PRG at low latency since the PRGs are small (e.g., only 8 to
16 bits per PRG). Synthesis shows (details in Section 6.4.1) only < 0.3 cycle
of barrel shifting latency assuming 20 FO4s per cycle.
Figure 6.10 illustrates the details of how to implement the bit reordering
and order recovery logic using barrel shifters. The barrel shifters in the
bit reordering logic and order recovery logic differ simply by the direction of
rotation so that the actions taken by the former can be reversed by the latter.
The shift distance inputs to the barrel shifters are taken from the OID of a
cache word; this allows a logical word to be reordered/recovered according to
the OID of the cache word that stores the logical word. Note that the number



































































Figure 6.10: Bit reordering and order recovery logic. PRG stands for a
Parallel Reordering Group.
of each PRG; this number may or may not be equal to the size of an OID.
As such, each barrel shifter takes as its shift distance input a combination
of bits from the OID. Different barrel shifters take different combinations
of the bits from an input OID to reduce the correlation between different
PRGs since this correlation can reduce the effectiveness of bit reordering by
reducing the movement of the bits in a logical word relative to each other.
For our evaluation in Section 6.5, we choose the combinations of input bits
to the barrel shifters experimentally by picking the set of combinations that
provides the best empirical fault coverage out of ten randomly generated sets
of combinations.
6.3.4 Application Examples
Error pattern transformation (or EPT) is a general technique that can be
applied on top of many prior works on improving on-chip memory reliability.









(a) A bit ordering that can generate









(b) A bit ordering that only
generates correctable errors for
segmented Hamming(7,4).
Figure 6.11: PRGs for segmented Hamming(7,4).
Each row corresponds to one segment within the access granularity. Green
and blue represent data and ECC bits, respectively. Red represent bits
mapped to faulty SRAM cells. All rows are accessed at once, are decoded
at once, and are combined to form one word.
Applying to Segmented ECCs
For segmented ECCs, we construct each PRG using logical bits from different
segments to move errors in segments with more errors into segments with
fewer errors; this PRG composition principle benefits all segmented ECCs.
Figure 6.11 shows the PRG composition for a 16-bit data word protected by
the segmented Hamming(7,4) ECC; each column of squares represents a PRG
while each row of squares represents a segment. Figure 6.11 also illustrates
how the composition of each PRG from bits in different segments improves























(c) Bit ordering after
2D reordering.
Figure 6.12: PRGs for OLSC(24,16). Green and blue represent data and
ECC bits, respectively. Red grid represents bits mapped to faulty SRAM
cells.
111
Applying to Unsegmented ECCs
For unsegmented ECCs, a good PRG composition may be different for dif-
ferent unsegmented ECC implementations. As such, we propose a two-stage
EPT that provides more freedom in bit reordering to better suit the specific
needs of different unsegmented ECCs. During the first stage, each PRG is
composed from bits in the same row so that the bits can be reordered along
the X-axis. During the second stage, each PRG is composed from bits in the
same column so the bits can be reordered along the Y-axis. Figure 6.12 shows
the PRG composition for a 16-bit data word protected by the OLSC(24,16)
ECC.
Section 6.6 provides more examples of application of EPT to prior works
on improving on-chip memory reliability.
6.4 Methodology
6.4.1 EPT-Based Cache Resilience Designs
To demonstrate the applicability of EPT to a variety of ECCs that can
be used to protect deeply voltage scaled on-chip memories, we apply EPT
to three different ECCs in the context of a 32KB L1 cache with a word
size of 64 bits - the segmented Hamming(7,4), segmented OLSC(8,4), and
OLSC(128,64). All three ECCs protect at the granularity of a 64-bit data
word, which is the word size of the evaluated L1 cache. For sensitivity anal-
ysis, we evaluate EPT using both 5-bit OIDs and 8-bit OIDs; we refer to
them as 5-bit EPT and 8-bit EPT, respectively. We apply EPT to segmented
Hamming(7,4) and segmented OLSC(8,4) by adding one level of 16-bit bar-
rel shifters and apply EPT to OLSC(128,64) by adding two levels of 16-bit
Table 6.1: Synthesized Decoder Latency and Area. Area is w.r.t a 0.76mm2
L1 Cache.









OLSC(128,64) 0.68 8.89% 6394
EPT:OLSC(128,64) 0.98 10.10% 9187
112
Table 6.2: Evaluated EPT Overheads
Baselines
Latency Usable Cache Decoder Area Metadata Total Area
(Cycles) Size Overhead Overhead Overhead
5-bit EPT+Segmented
1 16KB 3.06% 0% 3.1%
Hamming(7,4)
5-bit EPT+Segmented
1 16KB 3.2% 0% 3.2%
OLSC(8,4)
5-bit EPT+OLSC(128,64) 2 16KB 20.20% 0% 24.2%
8-bit EPT+Segmented
1 16KB 3.06% 4% 7.1%
Hamming(7,4)
Table 6.3: Evaluated Baseline Overheads
Baselines
Latency Usable Cache Decoder Area Metadata Total Area
(Cycles) Size Overhead Overhead Overhead
Oracular
2 - 21 (3.2 avg) 100% 70% 0% 70%
BCH(127,64)
VS-ECC 1-7 (1.9 avg) 16KB 37% 4% 42%
MS-ECC 1 16KB 9.4% 0% 9.4%
SECDED 1 32KB 10.9% 0% 10.9%
8T SRAMs 0 32KB 0% 0% 33.3%
WD–Original 1 16KB <1% 43.2% 44.2%
WD–IsoArea 1 16KB <1% 4% 5%
barrel shifters.
To accurately characterize the latency and area overheads of EPT, we
implemented these ECCs, both by themselves and with EPT, in RTL and
synthesize them using Synopsys Design Compiler [138] and the TSMC 65GP
standard cell library. Cadence SoC Encounter [139] was used for physical
layout. Area and latency values were verified using Encounter. Table 6.1
shows the latency in cycles and area overheads normalized to the evaluated
32KB cache (see Section 6.4.3); our evaluation assumes the 20FO4 cycle com-
monly used in prior works [111, 125]. Table 6.1 shows that all three ECCs
stay within one cycle of latency, with or without EPT. By itself, each level
of barrel shifters required by EPT only incurs < 0.3 cycle latency and 3%
area overhead. In particular, the two segmented ECCs both stay within 0.5
cycle of latency, with or without EPT. The low-latency overheads justify our
design choices in Section 6.3. In our evaluations, we model the latency
overhead of all of the ECCs in Table 6.1 as one cycle; the one exception
is EPT:OLSC(128,64), which we conservatively model as two cycles of la-
tency. Table 6.2 lists the EPT latency overheads that are used during our
performance and EPT area overheads used during our power evaluations.
113
6.4.2 Baseline Cache Resilience Designs
To show that the proposed approach can lead to lower on-chip memory volt-
ages than what has been previously possible, we compare against seven base-
lines. We compare against the SECDED (72,64) ECC commonly used in ex-
isting processors [132]. We compare against the primary implementation of
MS-ECC [112], which uses the OLSC(128,64) ECC. We also compare against
the circuit-level technique of using robustified 8T SRAM cells, instead of the
usual 6T SRAM cells.
We compare against an Oracular BCH(127,64) ECC that guarantees cor-
rection of up to 10 bits of error per word. We consider this baseline oracular
because we optimistically assume a BCH decoder that incurs the minimum
latency for the exact number of errors currently present in a logical word.
The oracular decoder has the low-latency benefit of weaker BCH decoders
that correct fewer errors per word while having the high error coverage ben-
efit of always guaranteeing correction of 10 bits in a word. As such, this
oracular BCH baseline represents the best coverage and latency obtainable
by prior works that propose protecting caches with the BCH ECC, such as
Hi-ECC [77].
We compare against word disable (WD), the best resilience scheme pre-
sented in [111] for L1 caches. WD is previously evaluated in the context
of caches with 512-bit access granularity; in this context, WD combines two
512-bit cache lines into an error-free logical line by dividing the two lines into
32 chunks and remapping data to fault-free chunks among the 32 available
chunks. When evaluating WD in the context of our L1 cache with a 64-bit
word size, we note that dividing two 64-bit words into 32 chunks can result
in a high storage overhead for recording the 32 chunks. As such, we evaluate
two versions of WD, WD–Original and WD–IsoArea; the former divides two
64-bit words into 32 chunks, while the latter divides the two 64-bit words
into fewer chunks such that the implementation requires the same number of
extra storage bits as an 8-bit OID EPT scheme.
We compare against a VS-ECC [97] implementation where each 64-bit data
word is divided into four 16-bit segments. The 64-bits of ECC are shared as
6-bit ECC chunks between the four segments. Each segment can have up to
four 6-bit ECC chunks allocated to it; each additional 6-bit chunk guarantees
correction of up to an additional error. For this VS-ECC implementation, we
114
assume that the four segments each have a dedicated decoder that corrects
up to a single error but shares two 4-bit correcting decoders; the single-bit
correcting decoders are sufficient for segments with a single error while the
multi-bit decoders are needed in case multiple errors are present in a segment.
We use two instead of four 4-bit correcting decoders to optimize the decoder
area for VS-ECC since the segments more commonly experience single-bit
errors than multi-bit errors in our evaluation.
We calculate the latency of SECDED according to [120], which conser-
vatively estimates decoder latency in FO4s by ignoring the wire delay. We
calculate the decoder area overhead of SECDED(72,64) by counting the num-
ber of gates required according to [120] and assuming that each gate is equal
in area as two SRAM cells; the latter assumption is consistent with prior
works [97]. When calculating the latency of Oracular BCH(127,64), we use
the formula in [120] to calculate the latency required to correct t-bit errors
and weight that calculate latency by the probability of encountering t bits of
errors in a word. For the decoder area overhead of Oracular BCH(127,64), we
assume it is the same as the area overhead of a constant latency BCH(127,64),
which we calculate according to [120]. We extrapolate the latencies and area
overheads of MS-ECC and WD from their respective papers. We follow the
methodology for the parallel BCH decoder in VS-ECC to calculate the la-
tency and decoder area overheads of our VS-ECC adaptation; we assume
that the latency overhead of VS-ECC when a cache word is accessed is de-
termined by the segment with the most errors among the four segments in
the cache word, since this segment takes the longest to correct.
6.4.3 Experimental Setup
Keeping with the context of this work - energy efficient computing, we evalu-
ate a two-issue in-order core similar to the ARM Cortex-A7 [132]; the detailed
micro-architectural parameters are enumerated in Table 6.4. The baseline
core operating at nominal voltage (1.2V) uses no ECC for L1 accesses.3 For
each design that requires correction, we deepen the pipeline to accommodate
the minimum correction latency of that design. The core uses a common volt-
age rail for both its L1 caches and the rest of the core; as such, the L1 caches
3Without any loss of generality, we can easily assume that some other default ECC is
used at the nominal voltage
115
determine the minimum operating voltage of the entire core, in accordance
with many prior works [135, 111, 112, 140]. We only voltage scale the core,
but not the rest of the system, such as the memory system and lower-level
caches, which we assume are on different voltage rails. To model voltage scal-
ing, we scale frequency and leakage power with respect to voltage using the
detailed 70nm scaling given in [141]; we use the common quadratic scaling to
model dynamic power with respect to voltage. We model the SRAM failure
probabilities using data reported in [125]. During low voltage operations, we
account for the added cache access latency due to error correction by stalling
each load instruction by the number of cycles needed to perform error cor-
rection. We simulate seventeen benchmarks from the SPEC2000 [130] and
SPEC2006 [131] benchmark suites4 using Gem5 [23]; we fast-forward each
benchmark for 1 billion instructions and execute in detailed timing mode
for 1 billion instructions. We model processor power with McPAT [133] us-
ing simulation outputs from Gem5. We use Cacti [96] to model cache latency
and power.
We choose a simple cache organization that is commonly used by prior
works [111, 112, 140] to evaluate the relative merits of EPT and the various
baselines. This simple cache organization stores the check bits of a data
cache word (i.e., a cache word assigned to storing data during low voltage
operation) in the same cache word in a different way in the same cache
set, as these two cache words are typically always accessed in parallel in L1
caches both with or without support for voltage scaling. Unlike an alternative
cache organization that uses one ECC way to protect multiple data ways, this
simple cache organization that allocates a dedicated ECC way for each data
way does not require expensive read-modify-write operations when updating
the ECC cache word. We do not investigate new approaches to store ECC
bits because it is orthogonal to the core idea of EPT. All the evaluated cache
resilience designs have almost equal amount of check bits as data bits in
each word, and thus all effectively utilize the available space in the dedicated
ECC way for each data way. The only exception is the SECDED(72,64)
ECC, which only provides 72 − 64 = 8 bits of check bits per 64-bit data
word. Instead of reserving half of the ways as ECC ways, we leave all four
ways as data ways and expand the size of each cache word statically to store
4We evaluated ammp, applu, apsi, art470, facerec, gromacs, lbm, libquantum, lucas,
mcf 2006, mesa, mgrid, milc, omnetpp, soplex, swim, and wupwise.
116
Table 6.4: Microarchitectural Parameters
Core Type In-order, 7-Stage
Register File 32 Int, 32 FP
Fetch/Decode/Issue Width 2/2/2
BTB Size 4096 entries
RAS Size 16 entries
Branch Predictor Tournament
ALUs/FPUs/MDUs 2/1/1
Cache Line Size 64B
L1 I$
Nominal (1.2V) 32KB, 4-way, 2-cycle
Vmin 16KB, 2-way, 2-cycle
L1 D$
Nominal (1.2V) 32KB, 4-way, 2-cycle
Vmin 16KB, 2-way, 2-cycle
L2 Unified $ 1MB, 8-way, 20ns latency
Memory Configuration 1066 MHz DDR3
the check bits for SECDED(72,64).
Also similar to prior works [111, 140], the cache organization we evaluate
uses the larger but more reliable 10T cells [125],5 as the tag array is only a
small fraction of the total cache area. Similar to these prior works, we also use
the more reliable tag array to store any metadata needed for the evaluated
cache resilience schemes, such as the OID bits for EPT, the disable bits for
WD, and the bits to identify how many ECC chunks are allocated to each
segment for VS-ECC. Note that since half of the ways in the cache are used to
store ECC bits during low voltage operation, half of the tags in the tag array
are unused during low voltage operation. Therefore, we reuse the unused tag
bits to store the metadata during low voltage operation. When the metadata
bits do not completely fit in the unused tags (i.e., more than 5-bit OIDS),
we expand the size of the tag entries in the ECC ways to accommodate the
remaining metadata bits. The area overhead due to expanding the size of
the tag entries are given in Table 6.3 and Table 6.2. We do not study OIDs
smaller than 5-bits because they do not reduce area and latency overheads,
but reduce coverage.
6.5 Results
In this section, we demonstrate that the proposed technique allows memory
to run at lower voltages than was previously possible for a given yield target.
5Our reliability calculations also account for the failure probability of the 10T cells.
117




















5-bit EPT + OLSC(128,64)
5-bit EPT + Segmented OLSC(8,4)
5-bit EPT + Segmented Hamming(7,4)
8-bit EPT + Segmented Hamming(7,4)
2.15e-02 5.12e-03 6.55e-04 4.50e-05 3.09e-06
pBitFail
Figure 6.13: Yield for 32KB 4-way cache implemented in 6T SRAM cells.
Cache yield vs voltage for the various cache resilience schemes. The red line
depicts the 99.9% yield.
We also discuss the core-wide power and energy benefits this entails compared
to the different baselines.
6.5.1 Yield
We define cache yield as the fraction of caches where all of the data cache
words in the data ways (i.e., the two data ways in our evaluated four-way set-
associative caches) are functional and error-free, similar to prior works [111,
112].6 Figure 6.13 shows the cache yield at low supply voltage for the different
cache resilience designs. Figure 6.13 shows that for a 99.9% yield target,
which is commonly used in prior works [111, 112], 8-bit EPT + Segmented
Hamming(7,4) reduces the latter’s Vmin by 134mV. For the same yield target,
8-bit EPT + Segmented Hamming(7,4) reduces the Vmin of the two weakest
baselines in terms of Vmin (i.e., SECDED(72,64) and MS-ECC) by 70mV or
more.
After 8-bit EPT + Segmented Hamming(7,4), the next-closest resilience
design in terms of achieving the lowest Vmin is the 10-bit correcting Oracular
BCH(127,64) ECC (the strongest version of a capacity-efficient code whose
redundant bits fit in the disabled way). While 8-bit EPT + Segmented
Hamming(7,4) achieves only 1mV lower Vmin than Oracular BCH(127,64),
the former’s main advantage over the latter is having lower latency overhead;
8-bit EPT + Segmented Hamming(7,4) incurs a constant one cycle latency
(see Table 6.1), while the Oracular BCH(127,64) incurs 2 to 21 cycles of
6Cache word error rates are calculated analytically where possible (e.g., Oracular
BCH(127,64)) For the rest of the designs, in particular, EPT, word error rates are de-
termined through Monte Carlo simulations.
118
latency overhead per cache access (3.2 cycles, on average). A second benefit of
8-bit EPT + Segmented Hamming(7,4) over Oracular BCH(127,64) is having
much lower decoder area. The area overhead of a BCH decoder increases
rapidly with the word size and the number of errors to correct [120]. 8-bit
EPT + Segmented Hamming(7,4) requires both minimal word size (i.e., only
seven bits per segment) and minimal correction strength (i.e., only corrects
one error per segment), as compared to BCH (127,64) which uses a word size
that is roughly 16X as large and corrects 10X as many errors per word.
Following after Oracular BCH(127,64), the next closest baseline in Vmin is
VS-ECC. 8-bit EPT + Segmented Hamming(7,4) only reduces Vmin by 13mV
compared to VS-ECC. However, the main benefit of 8-bit EPT + Segmented
Hamming(7,4) over VS-ECC is against for having lower latency overhead;
VS-ECC incurs one to seven cycles of latency overhead per cache access (1.9
cycles, on average). 8-bit EPT + Segmented Hamming(7,4) also incurs much
lower decoder area overheads, since the latter uses a word size 8X as large
and corrects 4x as many errors per word.
After VS-ECC, the next-closest baseline in Vmin is WD–Original. 8-bit
EPT + Segmented Hamming(7,4) reduces the Vmin of WD–Original by 33mV.
However, WD–Original requires 4X as many metadata bits as 8-bit EPT,
which incurs a 43% overhead in total cache area. On the other hand, the al-
ternative WD–IsoArea implementation we evaluate, which requires the same
number of metadata bits as 8-bit EPT, only provides a Vmin of 769mV, which
is nearly 100mV higher than the best EPT solution.
The next-closest baseline in Vmin are caches built using 8T SRAMs. 8-
bit EPT + Segmented Hamming(7,4) reduces Vmin by 38mV and 59mV,
respectively. However, 8T incurs high area (and thus power) overheads of
33%. Figure 6.13 shows that, although 8T SRAM cells provide a reduction
in Vmin, EPT solutions provide lower Vmin (e.g., EPT(8-bit OID) applied to
segmented Hamming(7,4) provides a reduction of 59mV over 8T). However,
8T incur high area overheads (i.e., 33% [125], respectively).
For sensitivity analysis, we also evaluated three 5-bit EPT design points.
Among these three designs, 5-bit EPT + Segmented Hamming(7,4) pro-
vides the highest yield; it provides higher yield than 5-bit EPT + Segmented
OLSC(8,4) because the former requires fewer check bits than the latter while
correcting the same number of errors per segment as the latter. Having a
fewer number of bits per word reduces the average number of faulty bits per
119
word and, thereby, improving yield. On the other hand, 5-bit EPT with Seg-
mented OLSC(8,4) provides higher yield than 5-bit EPT + OLSC(128,64);
this is because the fraction of t-bit error patterns that are correctable by
OLSC(128,64) decreases more rapidly with respect to t than the fraction of
t-bit error patterns that are correctable by Segmented OLSC(8,4), as shown
in Figure 6.5. When the fraction of t-bit error patterns that are correctable
is small, it becomes less likely to find a bit ordering that always generates
one of the correctable t-bit error patterns in a logical word given a t-bit fault
pattern in a cache word.
All of the above results hold irrespective of the core microarchitecture (e.g.,
in-order vs. out-of-order, issue-width, etc.) and workloads. The next section
describes how the improved yield due to EPT translations into improved
power and performance characteristics for the chosen core microarchitecture
and workloads.
6.5.2 Power and Performance
Table 6.5 summarizes the core-wide power and performance of the EPT de-
signs and the baselines. While 8-bit EPT + Segmented Hamming(7,4) oper-
ates at the same Vmin and, therefore, similar power as Oracular BCH(127,64),
the former provides 21% higher IPC than the latter. This performance
improvement is due to having much lower latency overhead than Oracu-
lar BCH(127,64) ECC. This shows that EPT is better than imply using a
stronger ECC due to decoder latency and area overheads. Compared to 8-bit
EPT + Segmented Hamming(7,4), VS-ECC only increases Vmin by 13mV.
However, the higher decoder latency and area overhead of VS-ECC results
in a high EPI overhead. 8-bit EPT + Segmented Hamming(7,4) reduces EPI
by 40.2% compared to VS-ECC.
WD-Original and WD-IsoArea provide similar IPC as EPT since WD also
only incurs a constant one correction cycle latency. However, the high area
overhead of WD-Original (i.e., 44.2%) results in a high power overhead; 8-
bit EPT + Segmented Hamming(7,4) reduces power by 20.5% compared to
WD-Original. Although WD-IsoArea has similar area overhead as 8-bit EPT
+ Segmented Hamming(7,4), it has a much higher Vmin; 8-bit EPT + Seg-
mented Hamming(7,4) reduces power by 31.1% over WD-IsoArea. MS-ECC
also incurs similar latency and area overheads as EPT. However, the Vmin
120
of MS-ECC is 70mV higher than that of 8-bit EPT + Segmented Ham-
ming(72,64). As a result, 8-bit EPT + Segmented Hamming(72,64) reduces
core-wide power by 25.7% compared to MS-ECC.
Using robustified SRAM cells can provide significant Vmin reduction with-
out incurring any error correction latency overhead since they do not require
error correction; however, robustified SRAM cells incur significant area. Due
to the high area overhead as well as the higher Vmin of 8T cells, 8-bit EPT +
Segmented Hamming(7,4) reduces overall core power by 28.1% compared to
using 8T SRAM cells. Due to this power overhead, 8-bit EPT + Segmented
Hamming(7,4) has a 19.1% lower EPI than 8T cells. Note that unlike ar-
chitectural techniques where most of the circuits (e.g., decoders) needed for
low voltage operations can be power gated during nominal voltage opera-
tions, caches with robustified SRAM cells continue to incur high overheads
at nominal operation. For example, caches with the 8T SRAM cells incur a
12.0% core-wide EPI overhead at nominal voltage operation, whereas 8-bit
EPT + Segmented Hamming(7,4) incurs only a 7.8% EPI overhead at nom-
inal voltage (due to the overheads of the added correction pipeline stages).
The above results show that EPT is very effective at reducing the Vmin
and, therefore, power of low latency caches by increasing the coverage of an
ECC-based error resilience scheme. The results also show that EPT does not
have some of the limitations that hamper the efficacy of other comparable
techniques (e.g., high power and area overheads during nominal and low
voltage operation). This makes EPT a promising technique to allow deeply
voltage scaled processors and on-chip memories.
6.5.3 Additional Sensitivity Analysis
BIST-Undetectable Faults
EPT targets faulty SRAM cells that can be identified during BIST. How-
ever, some SRAM faults, such as soft errors, erratic faults, or aging related
faults, cannot be detected by BIST. EPT does not protect against these
faults because they cause errors to occur randomly in different locations. For
L1 caches, existing processors typically only detect these errors using parity
or protect against these errors using SECDED to guarantee correction of a
single random error per word [132]. One can also guarantee correction of a
121




(mV) (GHz) (%) (%)
Oracular
671 0.931 1.035 24.3 50.5
BCH(127,64)
VS-ECC 683 0.960 0.854 22.4 54.8
8T SRAMs 729 1.061 1.236 26.6 40.5
WD–IsoArea 769 1.141 1.171 27.7 41.5
WD–Original 703 1.006 1.228 24.0 38.9
SECDED 823 1.256 0.879 34.5 62.6
Segmented Hamming(7,4) 804 1.210 1.147 31.5 45.4
MS-ECC 740 1.083 1.191 25.7 39.8
EPT-Based Designs
5-bit EPT+
697 0.993 0.918 22.0 48.3
OLSC(128,64)
5-bit EPT+
693 0.984 1.229 20.7 34.3
Seg OLSC(8,4)
5-bit EPT+
680 0.953 1.242 19.7 33.2
Seg Ham(7,4)
8-bit EPT+
670 0.929 1.252 19.1 32.8
Seg Ham(7,4)+
single random error per word in caches with EPT by adding an independent
layer of random error correcting code.
As an example, when storing a 64-bit dataword, besides protecting it with
EPT + Segmented Hamming(7,4), one can also protect it using an indepen-
dently calculated single-symbol-correcting (SSC) Reed-Solomon (RS) ECC
with 5-bit symbols; each adjacent four data bits in the 64-bit dataword is
mapped to a symbol in the SSC ECC when calculating the SSC ECC. When
a random error occurs in a cache word, EPT + Segmented Hamming(7,4)
may not be able to correct the segment in which the error resides. How-
ever, since only a single segment and, therefore, a single symbol contains the
random error, the erroneous segment is guaranteed to be correctable via the
SSC ECC if the random error is the only random error in the cache word.
The overhead required by this additional layer random error correction is
small. The RS ECC requires 10 additional ECC bits,7 which fit within the
64− (7−4) ·16 = 16 unused bits per ECC way of Segmented Hamming(7,4).
The area overhead of the RS ECC encoder [142] and a ROM-based decoder,
which stores the correctable syndromes and the error pattern corresponding
to each correctable syndrome, is only 1.8% of the L1 cache area. Our eval-
7For simplicity, we protect against known bit faults among these 10 bits per cache word
by only using EPT to move the faulty bits away from them but not storing Hamming(7,4)
ECC bits for them.
122




(mV) (GHz) (%) (%)
Oracular
BCH(767,512) 701 1.002 0.956 28.0 58.5
MS-ECC 758 1.119 0.996 29.0 52.0
8-bit EPT+
689 0.974 0.983 24.0 50.2
Seg OLSC(128,64)
One ECC way for seven data ways
VS-ECC 754 1.111 0.967 30.8 57.3
8-bit EPT:
742 1.087 0.930 29.0 57.4Seg BCH(80,64)
uation shows that including this additional layer of SSC ECC to 8-bit EPT
+ Segmented Hamming(7,4) only increases the latter’s core-wide power by
6.8%. Since random errors only need to incur any correction latency when a
random error occurs [132], which is rare, the energy overhead of adding the
SSC ECC is minimal (i.e., 4.4%). In comparison to VS-ECC, the best prior
scheme that can provide built-in soft error protection, EPT + Segmented
Hamming(7,4) provides a 40% energy reduction and a 25% total cache area
reduction. Despite incurring a 4.4% energy overhead and a 2% cache area
overhead from SSC ECC, EPT’s relative benefits remain high.
Note that in our evaluation in Sections 6.5.1 and 6.5.2, we dedicate all the
error correction resources of all evaluated resilience schemes to protecting
against BIST-detectable faults; in other words, additional overheads will be
required for all the evaluated baselines to guarantee correction of a single
random error per word while maintaining the same strength of protection
against BIST-detectable faults that the baselines enjoy in Sections 6.5.1 and
6.5.2. As such, although we do not evaluate adding random error protection
for the baselines, we expect the merits of EPT relative to the various baselines
to remain roughly the same as those in Sections 6.5.1 and 6.5.2 even in
scenarios where protection against random errors are needed.
L2 Caches
While EPT is primarily a technique to allow deeply voltage scaled on-chip
memories, it can also be viewed as a technique to reduce the latency of strong
error correction since it allows low-latency ECCs to attain the coverage of
high-latency (Sections 7.3 and 6.3). While latency is less of a concern for L2
123
caches, we demonstrate that EPT can still provide some benefits. Table 6.6
shows the performance, power, and energy evaluations for several designs
where the L2 is co-scaled with L1s and the core. For these designs, the L2
determines the Vmin. The values in the table include the power consumed
by the L2. BCH, MS-ECC, and EPT + Segmented OLSC(128,64), continue
to use an ECC way for each data way, which is consistent with our evalua-
tion of the L1 cache. Compared against MS-ECC, 8-bit EPT + Segmented
OLSC(128,64) reduces combined core and L2 power by 17%.
Since the capacity overhead requirement may be more stringent at the L2
level, we also demonstrate the benefits of EPT while disabling only one of
eight ways. For this capacity overhead requirement, we compare EPT against
a VS-ECC solution which has the same number of redundant bits as 8-bit
EPT. This VS-ECC solution breaks each cache line into four words each of
which can correct up to four faults. Across these four words, up to eight total
faults may be corrected, requiring 72 bits of redundancy. We do not allocate
any correction resources to BIST-undetectable faults. Results show that the
energy characteristics of EPT and VS-ECC are comparable. However, EPT
may still be preferable since VS-ECC decoders have high area overhead–10%
of the L2 cache area. High decoder area overhead comes from implementing
four 4-bit correcting BCH decoders in a fully combinational manner (i.e.,
lowest latency BCH decoder). 8-bit EPT + Segmented BCH(80,64), on the
other hand, only incurs a 3.5% total area overhead.
28nm/32nm Technology Node
The performance, power, and energy benefits provided by fault tolerant
mechanisms for low-Vmin caches are strongly tied to the characteristics of
a particular technology process (i.e., how fault rates, frequency, and power
scale with voltage). In this section, we explore the benefits for a 28nm fault
rate [136]8 and 32nm frequency and power rates [9]. 8-bit EPT + Segmented
Hamming(7,4) has 17.7% lower power and 34.5% lower energy than VS-ECC
(the next-lowest power correction technique). 8-bit EPT + Segmented Ham-
ming(7.4) has 23.9% lower power and 11.8% lower energy than MS-ECC
(the next lowest energy correction technique). These 28nm/32nm results
8We could not find any 28nm or 32nm 8T SRAM fault rates.
124





(mV) (GHz) (%) (%)
Oracular
674 0.326 0.977 15.5 44.6
BCH(127,64)
VS-ECC 688 0.344 0.896 14.7 43.5
WD–IsoArea 738 0.415 1.157 17.3 32.9
WD–Original 701 0.361 1.185 15.5 33.0
SECDED 777 0.469 0.851 21.3 48.8
MS-ECC 719 0.388 1.165 15.9 32.3
EPT-Based Designs
5-bit EPT+
696 0.355 0.873 14.2 41.8
OLSC(128,64)
5-bit EPT+
693 0.351 1.186 13.3 29.3
Seg OLSC(8,4)
5-bit EPT+
682 0.336 1.186 12.6 28.9
Seg Ham(7,4)
8-bit EPT+
673 0.324 1.202 12.1 28.5
Seg Ham(7,4)
demonstrate that EPT can provide voltage scaling benefits across technol-
ogy nodes.
6.6 Related Work
This section describes several additional related works. Parichute [137] ex-
plores how to protect low-level caches operating at low voltages using Turbo
codes. By themselves, Turbo codes under-utilize cache capacity because they
do not increment in size in powers of two. To better utilize cache capacity,
Parichute fills up the ECC ways by constructing additional check bits to
the original Turbo codes and, thereby, improves the strength of the original
Turbo codes as well. However, the latency overheads of Turbo codes are still
significant (e.g., > 4 cycles, on average [137]). In addition, Turbo codes can
correct a different number of errors depending on the error pattern in the
word. As such, EPT can be applied on top of Parichute and is, therefore,
orthogonal to Parichute.
Archipelago [114] avoids accessing faulty bits by remapping groups of log-
ical bits (called chunks) from one physical cache line to a second cache line,
called a sacrificial line. However, remapping chunks across lines requires sig-
125












Figure 6.14: Applying EPT to Archipelago [114]. Left: two chunks need to
be remapped while the sacrificial line only has room for one chunk. Right:
Only one chunk needs to be remapped after bit reordering.
nificant design complexity. For each remapped chunk, Archipelago records
the original location of the chuck, its corresponding sacrificial line, and the
corresponding position within the sacrificial line; this requires adding two ad-
ditional address translation tables to the critical path of cache access [114].
In addition, a complete cache access now has to wait for two cache line ac-
cesses to complete. Since the sacrificial cacheline is often not in the same
cache set as the data line, to allow both cache line accesses to complete
around the same time, the number of banks in the cache has to be dou-
bled from what is needed to meet the fetch and issue width [114]. This not
only increases cache area and latency, but also requires modifying the ac-
cess scheduling for the different cache banks to improve synchronization of
accesses between different banks. Cache access scheduling is further compli-
cated because a sacrificial line stores chunks from different data cache lines.
Therefore, on a write to a data cache line, Archipelago requires updating the
appropriate chunks within the corresponding sacrificial line using an expen-
sive read-modify-write operation. EPT only reorders bits within a word, not
across words, and, therefore, avoids all of the above complexities. Finally,
EPT can also be applied on top of Archipelago to improve its effectiveness
by aggregating faults in unused chunks, as illustrated in Figure 6.14. In this
way, EPT is orthogonal to Archipelago.
IPatch [143] protects caches operating at low voltages by reusing unused
spaces in other memory structures, such as the store queue, micro-op cache,
126
MSHR buffers, etc., to store redundant copies of words in the cache. IPatch
requires modifying a large number of processor components to make them
aware of faulty cache words; this may significantly complicate processor de-
sign and verification. In addition, these structures can only protect a small
fraction of the cache due to the smaller sizes of these memory structures
relative to a cache; as such, IPatch disables unprotected faulty cache words
in data lines [143]. Disabling a cache word in a data line leads to uncachable
logical words, which can significantly impact performance. EPT, on the
other hand, restricts the required modifications to within a cache and does
not require disabling cache words in data lines. In addition, EPT can also be
applied to a cache along with IPatch and is, therefore, orthogonal to IPatch.
EPT bears some resemblance to bit-interleaving, which statically inter-
leaves adjacent physical bits across different segments of a segmented ECC.
For physically adjacent faults, bit-interleaving converts multiple adjacent bits
of errors in one segment into single-bit errors in different segments. Causes
of physically adjacent bit faults include large alpha particle strikes [144] and
complete DRAM chip failures [145]. Faulty SRAM cells at low voltages, how-
ever, are randomly distributed across a cache [111], not typically physically
adjacent to one another. As such, bit-interleaving does not improve the cov-
erage of faulty SRAM cells during low voltage operation, which EPT does by
adaptively modifying the bit ordering in each cache word according to the
identified fault pattern of the cache word.
EPT also bears some resemblance to fault dispersion, which seeks to trans-
fer faults from a line with too many faults to lines with fewer faults, in off-chip
main memories [146, 147, 148, 149, 150, 151, 152]. In off-chip main memo-
ries, a line is typically striped across multiple memory chips/cards called a
rank, such that all chips/cards in a rank receive the same memory address
input and operate in lockstep to satisfy a single memory request. Exploiting
this architecture, prior works perform fault dispersion by physically swap-
ping the memory chips/cards or by reconfiguring the memory chips/cards
in a rank to access different intra-chip/card locations for the same address
presented to all the chips/cards in the rank. As such, these prior works differ
vastly in implementation from EPT, which targets on-chip memories. Fault
dispersion in the context of L1 caches equates to transferring logical bits be-
tween words in different sets, which requires accessing multiple cache words
per access to a faulty cache word; this can incur high latency overheads for
127
low voltage SRAM caches, where latencies are low and fault rates are high,
unlike off-chip main memories, where latencies are high and the fault rates
are low. Finally, unlike fault dispersion, EPT does not reduce the number of
errors in a cache word, but simply transforms the error pattern generated by
the cache word. As such, EPT can also be applied on top of fault dispersion,
and is, therefore, also orthogonal to fault dispersion.
6.7 Conclusion
This chapter presents error pattern transformation, a general technique for
low cost error correction which enables memory voltages to be scaled further
than prior works on error correction. We observed that although many ECCs
only guarantee correction of a small number of errors, they can actually cor-
rect a large number of erroneous bits if these bits are in the particular error
patterns. Since the same physical fault pattern in a cache word can mani-
fest as different error patterns depending on the ordering of the logical bits
stored in a physical cache word, EPT adaptively rearranges the logical bit to
physical bit mapping per word according to the known BIST-detectable fault
pattern in the physical word. The adaptive logical bit to physical bit map-
ping transforms many uncorrectable error patterns in the logical words into
correctable error patterns and, therefore, improves ECC error coverage and
reduces the minimum required voltage of operation. This reduces the mini-
mum voltage at which memory can run by 70mV over the best low-latency
ECC baseline leading to a 25.7% core-wide power reduction for an ARM
Cortex-A7-like core. Energy per instruction is reduced by 15.7% compared




This chapter explores how to architect 3D DRAM as efficient and reliable
in-package high-bandwidth memory. Due to the 3D nature of die-stacked
DRAMs, 3D DRAMs can develop faults in different physical dimensions.
By sharing minimal redundancy across physical dimensions to protect them
against failures, the proposed Parity Helix protection for multi-dimensional
memory improves energy-efficiency in the common-case of having no or mi-
nor fault in memory at the expense of higher correction latency overheads
compared to baselines with dedicated protection for each physical dimension.
7.1 Introduction
Memory is fast becoming the power, and therefore, performance bottleneck
of modern and emerging computer systems. For example, memory consumes
30% of total power in HPC systems, on average, and up to 40% of total power
in data centers [153, 154]. If the conventional DDR3 memory technology
were deployed in future exascale systems, memory power consumption will
constitute 2.5X the total system’s power budget [153]. Many architecture
and technology level solutions have been proposed to improve the energy
efficiency of the memory system [154, 155, 156]. One promising solution is to
stack multiple DRAM dies on top of one another using through silicon vias
(TSVs) [153]. Die-stacked DRAMs improve memory bandwidth by 4-20X
and energy efficiency by 3X compared to the latest generation 2D memories,
such as DDR4 and GDDR5 memories [157, 158]. As a result, die-stacked
DRAMs may replace 2D DRAMs as the main building block of memory
systems in the near future.
Many recent studies show that DRAM faults are a common occurrence in
production systems [86, 11, 30, 159]. An uncorrectable error in memory can
129
lead to a system crash, which degrades both performance and availability. As
such, many supercomputers and data centers deploy memory error resilience
techniques to protect against DRAM faults [86, 11, 30, 159]. A large body
of recent works also explores how to provide efficient error resilience for 2D
DRAMs [16, 104, 160]. Die-stacked DRAMs will also need error resilience
techniques that provide similar reliability in order to be deployed in these
reliability-critical systems.
A fault mode that is commonly covered by memory error resilience tech-
niques for data centers and supercomputers, such as the well-known Chipkill
Correct [86, 11, 30], is the DRAM chip failure. Due to the large number of
DRAM chips in large-scale systems, DRAM chip failures are common. Chip-
kill Correct protects against the complete failure of DRAM chips in conven-
tional memory systems with 2D DRAMs. Several recent works have started
exploring error resilience for die-stacked DRAMs [161, 106, 162]. While prior
works protect against up to channel faults, DRAM die faults have not been
addressed.
One challenge with protecting against DRAM die faults and channel faults
in die-stacked DRAMs is that a DRAM die consists of a group of DRAM
banks that lie in a horizontal plane while a channel consists of a vertical
group of DRAM banks that span across multiple DRAM dies; as such, a
die fault affects a horizontal group of banks while a channel fault affects
a vertical group of banks. A straightforward resilience scheme to protect
against both fault modes is to deploy dedicated error correction resources to
protect against a fault along each physical dimension. However, maintaining
dedicated error correction resources for each physical dimension can incur
high power and capacity overheads.
This chapter describes Parity Helix, a general memory resilience technique
to efficiently protect a multi-dimensional memory system against single-
dimensional faults. We refer to a multi-dimensional memory system as a
memory system with a multi-dimensional collection of memory banks; for
example, a die-stacked DRAM can be considered as a three-dimensional col-
lection of banks. We refer to a single-dimensional fault as a fault that affects
only memory banks located within a subset of physical dimensions in a multi-
dimensional memory system; we refer to the subset of physical dimensions
that a fault affects collectively as a single logical fault dimension. Die faults
and channel faults in a die-stacked DRAM are examples of single dimensional
130
faults. Parity Helix is most effective for multi-dimensional memory systems
where each fault typically affects memory banks located only within a subset
of the physical dimensions in the memory system. Parity Helix is inspired by
the helix - a properly constructed helix can intersect a horizontal plane and
a vertical line segment in at most a single point in space. Parity Helix con-
structs every parity group (i.e., a parity line and the data lines it protects)
in a helix-like fashion to ensure that at most a single line per parity group
is affected even in the event of the complete failure of all the banks within a
single fault dimension.
When applying Parity Helix to die-stacked DRAMs, our evaluations show
that Parity Helix reduces memory energy per data access by 21%, on average,
and by up to 45% compared to the baseline scheme of protecting against
both die fault and channel fault by maintaining dedicated error correction
resources for each dimension. In addition, Parity Helix increases available
data capacity per die-stacked DRAM by 16.7% compared to this baseline
with a < 4% increase in uncorrectable error probability. Further, Parity
Helix incurs less than 1% performance overhead compared to prior schemes
that correct channel faults but not die faults. Finally, while our evaluations
focus on die-stacked DRAMs, Parity Helix can be easily applied to other
memory technologies.
We make the following contributions in this chapter:
• We propose Parity Helix, a technique to protect against any single-
dimensional fault in a multi-dimensional memory system.
• We evaluate Parity Helix in the context of die-stacked DRAMs; it is
the first scheme to protect against up to a complete DRAM die fault
in die-stacked DRAMs.
7.2 Die-stacked DRAM Background
A die-stacked DRAM, or simply a stack, consists of multiple DRAM dies
stacked on top of one another using TSVs, which provides both structural
support and communication. Since each stack consists of multiple DRAM
dies, a stack contains a large number of DRAM banks. The large number of











A memory line 
Figure 7.1: Die-stacked DRAM architecture.
illustrated in Figure 7.1. Each channel consists of the same DRAM banks
in every DRAM die in the stack;1 the DRAMs banks within a channel share
control and datapath with one another but not with banks in other channels.
Within a bank, DRAM cells are organized in rows and columns, identically
to conventional memory.
Unlike conventional DRAMs, where each memory line is sourced in parallel
from different DRAM banks in different DRAM dies, an entire memory line
is sourced from a single bank in a single DRAM die in a stack. This access
behavior resembles that of disks, where an entire OS page is typically sourced
from a single disk.
Because each line is sourced from a single DRAM bank in a stack, a single
fault in a stack can corrupt an entire memory line. As such, a disk-like
resilience approach, such as RAID, that can protect against entire line failures
is needed [165, 166]. Several recent works on error resilience for die-stacked
DRAMs have proposed RAID-like schemes for die-stacked DRAMs. Citadel
1This organization is supported by all current die-stacked DRAM standards including
HBM [163] and HMC [164]. HBM also supports a second organization where each channel
consists only of DRAM banks within the same DRAM die. Our work studies the first
organization due to its wider support.
132




[161, 162] X X
Parity Helix X X X
protects up a complete bank fault (i.e., it can protect against complete failure
of every line within a single bank) in a stack by constructing parity lines from
lines in different banks and then distributing the parity lines across all the
banks in the stack in a manner similar to RAID5 [106]. Sim et al. protect
against up to a complete channel fault by storing a duplicate of a line of one
channel in another channel, in a manner similar to RAID1 [161]. Jeon et
al. protect up to a complete channel fault [162]; they construct parity lines
from lines in different channels and distribute the parity lines across all the
channels in the stack in a manner similar to RAID5.
In the context of data centers and supercomputers, large-scale field stud-
ies have observed DRAM die failures in 2D DRAMs [11, 30, 159]. Chipkill-
Correct, a memory error resilience technique commonly deployed in data
centers and supercomputers [86, 11, 30], is used to protect against up to a
complete DRAM die fault in conventional memory systems. For die-stacked
DRAM memory systems to achieve similar reliability as conventional mem-
ory systems, die-stacked memory systems also need to be protected against
DRAM die faults. Die-stacked DRAMs may be even more susceptible to
DRAM die faults than 2D DRAMs. DRAM dies in a stack are much thin-
ner than 2D DRAM dies to allow the integration of TSVs [167]. The TSVs
are also made from different materials from the DRAM dies and, there-
fore, have different thermal expansion coefficients from the DRAM dies [167].
Both factors make DRAM dies in a stack more vulnerable to cracking under
thermal-mechanical stress [167, 168, 169]. In this chapter, we explore how to
efficiently protect against a DRAM die fault in addition to protecting against
faults covered by prior works. Table 7.1 summarizes the fault coverage of this















Figure 7.2: Left: a channel fault. Right: a DRAM die fault. Die-stacked
DRAMs are vulnerable to faults along different dimensions.
7.3 Motivation
As shown in Figure 7.1, a DRAM die in a stack contains a horizontal group of
DRAM banks while a channel contains a vertical group of DRAM banks. As
such, a DRAM die fault affects DRAM banks that lie in a different physical
dimension from banks affected by a channel fault, as illustrated in Figure
7.2. This makes protection against a DRAM die fault and against a channel
fault challenging. For example, RAID across all the channels in a stack (i.e.,
protecting the channels using a parity channel) does not protect against a
DRAM die fault because a DRAM die fault affects multiple channels in the
stack. Similarly, RAID across all the DRAM dies in a stack does not protect
against a channel fault because the fault affects multiple DRAM dies.
One potential solution is to adapt a stronger RAID implementation, such
as RAID6, to die-stacked DRAMs. RAID6 protects against up to two disk
failures in a group of disks [170]. There are numerous implementations of
RAID6, such as EVENODD, Row-Diagonal Parity, and Horizontal-Diagonal
Parity [171, 172, 173], but they all use two parity blocks to protect each data
block. RAID6 can be adapted to die-stacked DRAMs by protecting each data
line in a stack with two parity lines, where one parity line is constructed from
lines in different channels while the other parity line is constructed from lines
in different DRAM dies. In this scheme, the first parity line helps to protect
against a channel fault while the second parity line helps to protect against a
die fault. However, protecting each data line with two parity lines can incur








Figure 7.3: Conventional DRAMs can also experience faults along different
dimensions.
seek to provide similar reliability as the RAID6 adaptation while protecting
each data line with a single parity line.
Besides die-stacked DRAMs, other memory technologies also experience
faults along different dimensions. For example, many conventional memory
systems with 2D DRAMs are also multi-dimensional; a rank of chips and a
lane of chips (i.e., the same chip in different ranks of DRAM chips that share
the same I/O bus bits) in a conventional memory system occupy different
dimensions (see Figure 7.3). A lane of 2D DRAM chips in a conventional
memory system can fail together (e.g., due to a stuck-at-1 or stuck-at-0 I/O
pin failure [11]), while all the chips in the same rank can also fail together
(e.g., due to a Row-Hammer fault [174]) since these chips are controlled
by the same command signals. Our proposed solution to protect against
any single-dimensional fault in die-stacked DRAMs is easily generalizable to
other memory contexts, as will be discussed in Section 7.7.
7.4 Parity Helix
We propose Parity Helix to efficiently protect against faults that affect a sub-
set of physical dimensions in a multi-dimensional memory system by sharing
the same error correction resources across all dimensions to minimize the
overheads of the error correction resources. Parity Helix is similar to RAID5,
which uses ECC bits dedicated to each line to detect errors and to optionally
correct small errors but uses parity lines to correct large errors that affect




We define a sector as a group of row-wise adjacent line locations in memory.
A data sector contains data while a parity sector contains the bitwise XOR
of the contents of a set of data sectors. We divide memory into disjoint and
equally sized groups of sectors called partitions, such that each partition is
uniquely identifiable by N coordinates, each corresponding to one of the N
logical fault dimensions in memory. We call the union of the same sector in
every partition a memory slice; in other words, each sector within a memory
slice is uniquely identifiable by the N coordinates of the partition that the
sector belongs to. Within each memory slice, we construct stripes of memory,
each consisting of a set of data sectors and a single parity sector.
One key aspect of Parity Helix is the appropriate assignment of sectors to
stripes within a memory slice. Parity Helix assigns each sector to the kth
stripe in the sector’s memory slice, where:
k(c1, c2, ..., cN) = (c1 + c2 + ...+ cN) mod p (7.1)
In Equation 7.1, c1,c2,...,cN stand for the coordinates of the partition that
a sector belongs to while p stands for the protection strength. Specifically,
p represents the number of faulty adjacent partitions that Parity Helix can
correct. Equation 7.1 ensures that all coordinates that differ in a single
coordinate value by less than p evaluate to a different value of k. Note that
any group of p adjacent partitions along the same dimension always differ by
less than p in exactly one coordinate value among the N coordinate values;
as such, Equation 7.1 assigns all adjacent p sectors along the same dimension
into different stripes. This ensures that a fault that affects up to p partitions
in the same dimension will affect at most a single sector per stripe; such a
fault is, therefore, correctable by the parity in the single parity sector of each
stripe.
Because k in Equation 7.1 ranges from 0 to p − 1, there are p stripes per
memory slice. Therefore, the total number of sectors in a stripe is:
Sstripe = memory slice size/p = s1 · s2... · sN/p (7.2)
136
In Equation 7.2, si is the range of the coordinate values for the i
th dimension.
Since there are p stripes per memory slice and Sstripe − 1 data sectors per
stripe, the total number of data stripes per memory slice is
Sds = p · (Sstripe − 1) (7.3)
There is a parity sector in each stripe; so the capacity overhead of the
parity sectors in a memory slice is
parity sectors per memory slice/Sds =
p/Sds = 1/(Sstripe − 1) (7.4)
Since p, the protection strength, should be no more than the size of the largest
dimension in the memory system, Equation 7.2 implies that Sstripe scales ap-
proximately with the average size of the dimensions raised to the power of
the total number of dimensions minus one; as such, the capacity overhead of
Parity Helix (given in Equation 7.4) scales inversely proportionally with the
average dimension size raised to the power of the total number of dimensions
minus one. This is significantly lower than the capacity overhead of pro-
tecting each dimension with dedicated parities, which scales approximately
proportionally with the total number of dimensions. Moreover, Parity Helix
requires updating a single parity when modifying a unit of data. In compar-
ison, protecting each dimension with a dedicated parity requires updating
N parities when modifying a unit of data, resulting in higher power and
performance overheads.
We illustrate Parity Helix using an example memory system with a single
stack with four DRAM dies and four channels. Since our goal is to protect
against DRAM die faults and channel faults, which affect two different fault
dimensions, we also partition the sectors in a stack along two dimensions.
Since there are four DRAM dies and four channels in the example stack, the
coordinate values of each dimension ranges from zero to three. Since a fault
(e.g., a DRAM die fault or channel fault) in the example affects at most
four partitions in a stack, p should be set to four to protect against up to a
complete DRAM die or channel fault. The left half of Figure 7.4 illustrates a
memory slice in the example stack. The right half of Figure 7.4 illustrates the









Figure 7.4: We divide memory into partitions, and partitions into sectors.
The same sector in all partitions together form a memory slice. A helix-like
collection of sectors in a memory slice form a stripe. Each color represents
a different stripe in the figure.
the inspiration of our work. Equation 7.1 maps all sectors that belong to the
same die (e.g., die c1) in a memory slice to a different stripe and also maps
all sectors that belong to the same channel (e.g., channel c2) to a different
stripe. As such, a single die fault or a single channel fault affects at most one
sector per stripe, as illustrated in Figure 7.5, and is, therefore, correctable
by the parity stored in the parity sector in each stripe.
In a memory system with multiple stacks, each stack can either be pro-
tected independently or all the stacks can be protected as a single logical
stack. To illustrate the latter case, consider a memory system with n stacks,
each with c1 channels and c2 dies. All n stacks can be managed as a sin-
gle logical stack with n · c1 channels and c2 dies. By configuring protection
strength to be p = max(c1, c2), Parity Helix can protect up to any single
complete channel or die fault among the n stacks.
7.4.2 Data Layout
Since any single DRAM fault (up to a complete DRAM die fault or a com-
plete channel fault) affects at most a single sector per stripe, selecting which
sector in a stripe to be the parity sector has no effect on the error correc-





















Figure 7.5: Left: a channel fault affects at most a single sector per stripe.
Right: a DRAM die fault affects at most a single sector per stripe. Thus,
the single parity sector per stripe protects against up to a complete channel
or die fault.
as the parity sector of the stripe for correctness. However, the selection of
which sector in a stripe to be the parity sector can have significant impact
on performance. A parity sector needs to be updated whenever a data line
(i.e., a line location in a data sector) is written. Updating the parity sector
requires a read-modify-write operation to the parity line (i.e., a line loca-
tion within the parity sector) that corresponds to the line being written; the
read-modify-write operation requires two line accesses. Therefore, updating
the parity sectors can potentially become a bandwidth bottleneck if they are
concentrated in a subset of the partitions in a stack.
To avoid bandwidth bottlenecks due to updating the parity sectors, we
evenly distribute the parity sectors across all partitions in a stack similar
to how RAID5 evenly distributes the parity blocks across all disks in a disk
array [165]. This can be achieved by rotating the parity sector over all
possible positions in a stripe. Since each sector in a stripe belongs to a
different partition, alternating which sector in each stripe should be the parity
sector evenly distributes the parity sectors across all partitions. This can be
implemented by letting the same stripes within each group of Sstripe memory
slices all have a different sector as the parity sector, as illustrated in Figure
7.6.
Distributing the parity sectors evenly across all partitions can complicate
the mapping of physical lines (i.e., lines in the OS physical address space)
139
."."."
stripe 0 stripe 1 stripe Sstripe-1 
Basic Layout Optimized Layout 
""
."."."
stripe 0 stripe 1 stripe Sstripe-1 
Figure 7.6: Parity layout. The parity sectors rotate over Sstripe memory
slices. This distributes the parity sectors evenly across all partitions to
eliminate bandwidth bottlenecks due to updating the parity sectors.
to line locations within the stack. Since each partition now stores a mix-
ture of both data and parities, one must avoid storing a physical line to a
parity sector. We describe a physical line to stack location mapping that dis-
tributes the parity sectors evenly across all partitions. When designing such
a mapping, we note that adjacent physical lines often exhibit spatial locality;
therefore, mapping adjacent physical lines to the same stripe improves the
locality of the parity lines in the memory buffer, and for resilience designs
that cache ECC bits in the processor, improves the locality of the parity lines
in the last-level cache as well. However, mapping adjacent physical lines to
different sectors in the same stripe can reduce row buffer hit rate since dif-
ferent sectors in the same stripe belong to different channels and dies and,
thus, different rows in memory. As such, we propose first filling up a single
sector with the most adjacent physical lines before filling up the stripe with
less adjacent lines to achieve both high row buffer hit rate and high spatial
locality for the parity sectors.
Figure 7.7 summarizes the steps needed to locate a data line and its parity
line in a stack given a physical line address A. To locate a data line given
a physical line address A, we first compute S, the sector address for the
line by taking the first n − sector width bits in address A, where n is the
total number of bits in address A and sector width is the number of bits
needed to represent the sector size, which is the number of lines per sector;
the last sector width bits in the line address, therefore, indicates which one
140
physical line address A 
S%(Sstripe*Sds) S/Sds 
die#! channel#!slice#!
 look up DMT 
die#! channel#!slice#!
look up PMT 
stack location of A!
stack location of 
the parity line of A !




Figure 7.7: Steps for calculating the location of a physical line in a stack
and the location of the parity line that protects the physical line.
of the sector size number of lines within the sector stores the physical line.
Next, the memory slice that contains S is found through S/Sds. Recall that
a memory slice is the union of the same line locations in all the partitions
in a stack; as such, A is mapped to the (S/Sds)
th sector in the appropriate
partition.
Finally, to determine the appropriate partition, which is uniquely identifi-
able by the die ID and channel ID of the partition, recall that the locations
of the parity stripes rotate within groups of Sstripe memory slices (see Figure
7.6); therefore, the locations of the parity sectors start repeating after every
Sstripe memory slices. As such, we record the channel ID and die ID for S in
a data mapping table (DMT) that is indexed by S mod (Sstripe · Sds). The
(S mod (Sstripe · Sds))th entry in the DMT stores the die ID and the channel
ID of S. To maximize the spatial locality of the parity sectors, the DMT
maps adjacent physical addresses into the same stripe by storing in adjacent
DMT entries die IDs and channel IDs that point to the same stripe (see the
paragraph below for details). Similarly, we use S mod (Sstripe · Sds) as an
index into a parity mapping table (PMT) to identify the die ID and channel
ID of the parity sector of the stripe to which A belongs. The DMT and PMT
141
are implemented in hardware in the memory controller for fast lookup.
Populating the DMT takes the following steps. First, compute the stripe
number k for all pairs of die ID and channel ID in a stack using Equation
7.1 and then sort the ID pairs by their respective k values from the least to
the greatest. Next, replicate the sorted list of ID pairs Sstripe − 1 times and
concatenate all Sstripe lists into a single list. Then, for each m
th group of
Sstripe ·p adjacent ID pairs within the single large list (where m ∈ [0, Sstripe−
1]), remove the mth ID pair in each group of Sstripe adjacent ID pairs; these
removed ID pairs point to parity sectors. Finally, insert the remaining ID
pairs in the single list into the DMT in order, one ID pair per DMT entry.
7.4.3 Error Correction
Similar to RAID, when a line encounters an error that is uncorrectable by
the line’s dedicated ECC, Parity Helix uses the parity line and other data
lines in the same stripe to reconstruct the erroneous line. While the parity
line of the erroneous data line can be found through the PMT, the other data
lines in the same stripe need to be identified using a separate table, which we
call the correction table or CT. The CT contains a total of p · Sstripe entries,
one for each of the p stripes in Sstripe memory slices. The i
th entry in the CT
contains the partition coordinates of all the data stripes in a memory slice
that belongs to the ith stripe in the memory slice.
Reading out all lines in the stripe to correct an erroneous line can incur
significant performance and energy overheads. The problem of having to read
out multiple lines to perform error correction using a parity line is common
across all RAID-based error resilience solutions, which include prior works
such as [106] and [162]. This problem is especially of concern for permanent
DRAM faults, which requires repeated error correction. Unfortunately, field
studies show that the permanent fault mode is also the most common fault in
DRAMs [11, 30]. Since error correction via the parity lines is used only when
a data line encounters an error that is uncorrectable by its dedicated ECCs,
error correction via the parity lines is only needed for large granularity faults
such as the DRAM die fault or channel fault that can cause complete line
failures. To reduce the overheads of Parity Helix for error correction via the
parity lines, we propose gracefully degrading the available capacity of a faulty
142
stack by retiring (or disabling) faulty channels and dies. This prevents future
accesses to faulty channels and dies to prevent frequent error correction via
parity lines; in addition, avoiding repeated accesses to faulty channels and
dies also alleviates the burden on error detection since repeatedly accessing
erroneous lines increases the chance of errors going undetected.
Retiring a DRAM die or channel reduces the size of the corresponding
partition dimension size by one. Since the number of dies or channels in a
stack is reduced, the data previously stored in the stack can no longer com-
pletely fit in the stack and thus some of the data must be migrated outside
of the stack. This requires alerting the OS to migrate the content of the top
1/M OS pages, where M is the total number of dies or channel per stack,
currently mapped to the faulty stack to other stacks or to disk and then per-
manently retire the corresponding OS pages. This frees up a channel or die
worth of free space within the stack such that the memory controller of the
faulty stack can reorder in hardware the remaining data in the stack to com-
pletely avoid using the disabled die or channel. This requires recalculating
the contents of the DMT, PMT, and CT using the new dimensions via the
procedures described in Section 7.4.2 and then making minor adjustments
to the tables to completely avoid any access to the disabled DRAM die or
channel; the latter can be accomplished by simply increasing all coordinate
values in the table by one for all DRAM die IDs or channel IDs greater than
or equal to the disabled DRAM die or channel.
7.4.4 Hardware Overheads
Parity Helix requires adding new hardware structures to the processor to
implement the division and modulo operations in Figure 7.7. These are
divide-by-constant and modulo-by-constant operations, with the constants
being Sds and Sstripe · Sds, respectively. The area overhead of a modulo-
by-constant circuit is low [175, 161]; Qureshi et al. report that it requires
only a few hundred logic gates and incurs a two-cycle latency [175]. For the
divide-by-constant operation, Drane et al. report a 0.001mm2 area overhead
and 1ns delay for a non-approximate always-round-to-zero constant divider
synthesized using the 65nm processing technology [176].
In addition to adding the divide-by-constant and modulo-by-constant cir-
143
cuits to the processor, Parity Helix also requires adding the DMT, PMT,
and the CT to the processor. Both the DMT and PMT require Sstripe · Sds
entries with ceil(log2(s1)) + ceil(log2(s2)) + ...+ ceil(log2(sN)) bits per entry.
The CT requires p entries with (Sstripe− 1) · (ceil(log2(s1)) + ceil(log2(s2)) +
... + ceil(log2(sN))) bits per entry. Finally, two sets of DMT, PMT, CT are
needed during the retirement process because both the mappings before and
after the retirement are needed during the retirement process. For our eval-
uation in Section 2.6 with 8GB stacks, the total size of the two sets of all
three tables is 2016 bytes per stack. Due to the small sizes of each individual
tables, the latency overhead of accessing each table is only one cycle.
Finally, to identify faulty die or faulty channel for graceful capacity degra-
dation in a faulty stack (see Section 7.4.3), we require a counter for each
channel and each die in the stack to count the number of errors encountered
in the channel or die. If the number of errors for a channel or die exceeds
a threshold, we perform read-write-read on all the lines in the channel or
die to detect whether errors exist in multiple2 banks in the channel or die;
if so, the channel or die is marked as faulty. Assuming an 8-bit threshold
counter for each channel or die, and one bit per bank for each channel or die
to indicate whether the bank contains error(s), a total of 64B is required for
the evaluated system in Section 2.6.
7.5 Methodology
7.5.1 Baselines
We compare against three error resilience baselines. The first baseline is
similar to [153]; it protects each line with only dedicated ECC bits but not
with any parity line. We refer to this baseline as Non-RAID ECC. The second
baseline is similar to [162]; it protects up to a complete channel in a stack by
using RAID5 across channels. We refer to this baseline as Channel Correct.
We evaluate these two baselines because the former minimizes capacity and
power overheads while the latter maximizes protection strength among all
prior works on error resilience for die-stacked DRAMs. The third baseline
2If errors exist in a single bank, we disable the whole die (instead of a single bank) for
simplicity.
144
Table 7.2: Processor Microarchitecture
Core 16 cores, 3GHz, 2-issue OOO
L1 d-cache, i-cache 2-way, 64kB, 1 cycle
Private L2 cache 8-way, 512kB, 3 cycles
Shared L3 cache 32-way, 32MB, 20 cycles
is the simple RAID adaptation to die-stacked DRAMs that we described in
Section 7.3; it corrects up to a complete channel and a complete DRAM die
fault in a stack by protecting each data line with two parity lines. We refer
to this as Two-Parity RAID (TPR).
7.5.2 Microarchitecture and Workloads
We use GEM5 [23], a cycle-accurate simulator, to simulate a 16-core x86
processor. Table 7.2 lists the processor parameters; they are based on Intel
Xeon E7 processors. The cache access latencies in Table 7.2 are obtained
using CACTI [96] assuming the 32nm process technology. For generality, our
evaluation assumes that the processor implements memory error resilience;
our techniques and analysis apply directly when a logic die on a die-stacked
DRAM is used to implement error resilience [164]. For the RAID-based
resilience schemes, the processor caches the parity lines in the last-level cache
of the processor, similar to many prior works in memory error resilience
[16, 177, 106]. We use the caching scheme in [177], which not only reduces
the number of updates to the parity lines in memory, but also reduces the
number of reads to the stale values of data lines needed to calculate the
update to the shared parity lines.
We simulate 10 workloads in full system mode; the simulated OS is the
Linux 2.6.28.4 OS provided with GEM5. The workloads consist of six NAS
Parallel benchmarks [29], two SPLASH2X multi-threaded benchmarks [178],
a floating-point mixed workload and an integer mixed workload. Each mixed
workload consists of SPEC benchmarks and multi-threaded SPLASH2x bench-
marks. We fast-forward each workload in GEM5 functional simulation mode
until after the OS has started up and the parallel benchmarks have com-
pleted their initialization; we rely on the terminal output in the simulated
Linux OS to determine when the OS startup and workload initialization
have completed. After the fast-forwarding, we warm up the processor cache
145
Table 7.3: Workloads
Composition — Read/Write BW RSS
Workload Fast-forwarding distance (GB/s) (GB)
NAS:bt.D 16T — 64.1s 3.4 / 1.2 10
NAS:lu.D 16T — 58.1s 24 / 12 8.9
NAS:sp.D 16T — 40.8s 36 / 6.0 11
NAS:ua.D 16T — 19.0s 5.4 / 1.4 7.3
NAS:ft.C 16T — 11.8s 7.0 / 6.0 5.4
NAS:mg.C 16T — 7.7s 13 / 11 3.3
Splash2X:fft 16T — 75.0s 7.6 / 3.8 12
Splach2X:
ocean cp 16T — 5.2 11 / 8.9 3.5
4T radix, 4 mcf
mix INT 4 omnetpp, 4 astar — 8.0s 7.2 / 2.1 12
4T ocean cp, 4 bwaves,
mix FP 4 cactusADM, 4 wrf — 43.7s 21 / 8.5 12
in functional mode for 200ms of simulated time; the 200ms cache warmup
period is selected to fill up the 32MB last level cache and turn over the cache
content many times. Finally, we simulate each workload in cycle-accurate
simulation mode for 15ms of simulated time and report the measurements
taken during this period. Table 7.3 lists the composition and fast-forward
distance of the chosen workloads, as well as the corresponding memory char-
acteristics such as memory read bandwidth, write bandwidth, and resident
set size.
7.5.3 Memory Modeling
We evaluate memory systems with two 8GB HBM stacks to accommodate
the memory footprint of all of our workloads. We evaluate ECC stacks, which
contains 12.5% more bits per line than a non-ECC stack for storing dedicated
per line ECC bits [163]. We select the die, channel, and bank count of each
stack according to the top configuration available in the HBM specification
[163]. For the memory I/O frequency, however, we evaluate a low 500Mhz
frequency; since Parity Helix incurs a bandwidth overhead for updating the
parity lines, evaluating a lower I/O frequency enables us to stress the stack
bandwidth utilization and, therefore, stress the performance overhead due to
the bandwidth overhead. To further stress the stack bandwidth utilization,
146
Table 7.4: Stack Configuration
Capacity/Dies/Channels 8GB/8/8
Channel Bandwidth 0.5GHz, 144-bit wide
data line size 64B
Banks/Row size 16 per channel/2KB
MSHR/refresh (ms) 32 per channel/64
Activate/Precharge (nj) 1.48/1.43
DRAM Read/Write (nj) 6.87/6.87
TSV/IO transfer (nj) 1.13/5.5
tCAS-tRCD-tRP-tRAS (ns) 18-18-18-26
we assign the first contiguous half of the physical address space to the first
stack and the second half to the second stack instead of interleaving the
physical address space evenly across the two stacks. Our evaluation shows
that our workloads utilize up to 48% of the available bandwidth in the more
frequently accessed stack out of the two, and 27% on average, under the Non-
RAID ECC baseline; we also observed bursty periods within the evaluation
interval that utilize nearly 100% of the memory bandwidth. As such, we
believe our workloads sufficiently stress the memory bandwidth utilization.
We use CACTI-3DD [179] to model the stack power and timing similar to
[162]. We use the numbers reported in [180] to calculate the IO energy of
a stack. We use DRAMsim2 [21], a cycle-accurate DRAM simulator, to
model the timing of the stack. Table 7.4 summarizes the evaluated stack
parameters.
We model memory controllers that prioritize reads over writes and use
the first-come-first serve policy and bank-round-robin policy as the intra-
bank and inter-bank memory scheduling policies, respectively. We evaluate
the open-page row-buffer policy. For the RAID-based resilience schemes, we
colocate eight adjacent physical lines in the same DRAM row (i.e., we set
the sector size, described in Section 7.4.2, to eight) to strike a good balance
between row buffer hit rates and good parity line hit rates for these resilience
schemes. For Non-RAID ECC, which does not require updating the parity
lines, we co-locate 32 adjacent lines in the same row since each 2KB row
in the evaluated stacks can hold up to 32 64B lines; we interleave adjacent
groups of 32 lines across different banks and channels to maximize bank-level
parallelism for Non-RAID ECC.
When modeling Parity Helix, we apply Parity Helix to each stack individ-
147
Table 7.5: Evaluated Fault Rates [30, 162]
Fault Type Affects up to Fault rate
Single-bit one bit per line 0.030 FIT/Mb
Single-column four bits per line 3.8E-3 FIT/Mb
Single-TSV four bits per line 41 FIT/TSV
Single-bank the whole line 9.4E-3 FIT/Mb
Single-channel the whole line 4.8E-4 FIT/Mb
Single-die the whole line 4.4E-4 FIT/Mb
ually. For each stack, both the channel and die coordinate values range from
zero to seven (since there are eight channels and eight dies per stack in Table
7.4); since a channel fault or a die fault can affect up to eight partitions, we
set p to eight. As such, the capacity overhead is 1/7 (see Equation 7.4).
7.5.4 Reliability Modeling
We use combinatorial analysis to calculate reliability. Similar to prior works
[161, 106, 162], we assume the fault rates remain constant over time and use
2D DRAM fault rates to approximate fault rates in die-stacked DRAMs since
no reliability data is currently publically available for die-stacked DRAMs.
We use the DDR3 DRAM fault rates reported in [30]. We pessimistically
categorize all faults within a bank that can affect up to an entire line in a
bank (e.g., the single-row fault or single-line fault) as the single-bank fault for
simplicity. For Parity Helix, we directly map the 2D-DRAM multi-bank fault
rate in [30] to our 3D-DRAM die fault rate because both affect multiple banks
in a single DRAM die. Similarly, when modelling the channel fault rate, we
directly map the 2D-DRAM multi-rank fault rate in [30] to our 3D-DRAM
channel fault rate because both affect multiple banks in different DRAM dies
with common I/O connection. We use the average fraction of lines affected
by a bank, multi-bank, and lane fault reported in [11] to model the fractions
of lines affected by a bank, die, and channel fault, respectively, in a stack.
For the bank, channel, and die faults, which can affect up to an entire line,
we assume that half of the bits in each affected line are erroneous, similar to
[161]. Since [30] does not report any TSV fault rates, we use the TSV fault
rate used in [162]. Table 7.5 lists the fault rates used in our evaluation.
Recall from Section 7.4.3 that Parity Helix can protect against the accu-
mulation of faults by retiring faulty dies or channels at the cost of reduced
148
available memory capacity. When modeling the reliability of Parity Helix,
our reliability calculation assumes that Parity Helix is allowed to retire up
to two channels or two dies or one of each since retiring more memory in a
stack may render the stack unusable for the intended applications. We make
the same assumption for TPR; similarly, we assume that Channel Correct
can retire up to two faulty channels (but not any dies since it cannot correct
against die faults). Our reliability calculation assumes that a patrol scrub
is performed once every 24 hours to ensure that faulty channels or dies are
detected and, therefore, retired in a timely manner; if multiple faults ac-
cumulate during the same scrub period, more than the particular resilience
scheme can correct without performing retirement, our calculation assumes
that the faults cause an uncorrectable error.
Recall from Section 7.4, RAID uses ECC bits dedicated to each line to
detect errors and/or correct errors and use parity lines to correct large errors
that affect up to an entire line. We use the dedicated per line ECC proposed
by prior works to evaluate all resilience schemes. The evaluated dedicated
per line ECC uses 8B of redundancy for every 64B data line; this matches
the 1/8 = 12.5% ECC area overhead provisioned in an ECC stack in the
HBM specification [163]. The 8B of dedicated ECC bits is a synthesis of
different ECCs used in prior works [161, 106]. The 8B consists of a 32-bit
CRC computed over the 64B data line, a 16-bit SSCDSD ECC computed
over the 64B data line using 4-bit symbols, and 16 spare bits. The purpose
of the 32-bit CRC is to reliably detect errors that cannot be corrected by the
dedicated ECC, as proposed in [161]; also as proposed in [161], the address of
the line is folded into the CRC to detect address decoder errors. The 16-bit
SSCDSD ECC [162] (which is called the SBCDBD ECC in [153]) guarantees
correction of errors due to the single-cell fault and single-TSV fault in a
stack [162, 153]. The remaining 16 spare bits per line are used to protect
against the accumulation of data TSV faults in a stack over time, similar to
[106]. Since each data TSV supplies 4 bits per 64B line due to the 128-bit
channel data width (see Table 7.4), the 16 spare bits per line can replace up
to 16/4 = 4 bad data TSVs per channel. Similarly, we also use the spare bits
to protect against the accumulation of single bit and column faults. Since all
resilience schemes we evaluate use the same codes for error detection, Parity
Helix differs only in terms of error correction coverage, not error detection















































Non-RAID%ECC% Channel%Correct% TPR% Parity%Helix%
Figure 7.8: Probability of encountering uncorrectable memory error(s) in
systems with different aggregate stack sizes during seven years of operation.
Parity Helix reduces the probability of uncorrectable error(s) by over 120X
and 100X compared to Non-RAID ECC and Channel Correct, respectively.
the undetectable error probabilities.
7.6 Results
7.6.1 Reliability
Figure 7.8 presents the probability of experiencing uncorrectable error(s) in
systems with different aggregate physical sizes of die-stacked DRAMs. Parity
Helix reduces the probability of having uncorrectable error(s) by over 120X
and 100X compared to Non-RAID ECC and Channel Correct, respectively,
for all design points shown in Figure 7.8. This is comparable in magnitude to
the improvement achieved by Chipkill Correct over SECDED in conventional
memory systems (40x [11]). Parity Helix offers remarkable reduction in un-
correctable error probability because it can correct all individual faults in a
stack and only fail due to the accumulation of multiple faults; Non-RAID
ECC and Channel Correct, on the other hand, can fail in the presence of
a single fault in a stack (e.g., the DRAM die fault). Compared to TPR,
Parity Helix incurs only 3.6% higher uncorrectable error probability than
150
TPR, even though Parity Helix requires only half as much capacity overhead
as TPR for sharing the same set of error correction resources across both
dimensions. TPR incurs slightly lower probability of uncorrectable error be-
cause it can protect against the accumulation of more faults (e.g., both a die
and a channel fault) occurring within each 24-hour scrub period due to its
higher redundancy than Parity Helix. However, since the probability of mul-
tiple faults occurring within the same 24-hour scrub period within the same
stack is small, the difference between the uncorrectable error probability of
Parity Helix and TPR is also small (i.e., 3.6%). This small difference can be
further narrowed by scrubbing more frequently than once per 24 hours.
Lower uncorrectable error probability reduces performance overhead from
checkpoint-restart and improves availability. Consider an HPC system with
64PB of memory [181], and assume a checkpoint interval of 4 hours and a 2
hour checkpoint-restart performance overhead for each uncorrectable error.
Also assume that the memory system consists entirely of die-stacked DRAMs.
Using the single stack uncorrectable error probabilities (i.e., the 8GB data
points in Figure 7.8), we calculate that Channel Correct incurs 11.6 hours of
performance overhead per day, while Parity Helix and TPR incur only 0.053
and 0.051 hours per day, respectively.
7.6.2 Capacity Overhead
Figure 7.9 shows the amount of data capacity per stack (i.e., the number of
lines used for storing data lines instead of parity lines in a stack) of different
resilience schemes normalized to that of Non-RAID ECC for the evaluated
stacks. Figure 7.9 shows both the available memory capacity at the begin-
ning of a stack’s operation, when the stack is fault-free, and the average
available memory capacity at the end of the seven years of operation, when
some channel/dies have been retired due to developing faults. Parity Helix
increases the amount of available data capacity per stack by 16.7% compared
to TPR. Parity Helix incurs the same capacity overhead as Channel Correct,
which only protects up to a complete channel fault in a stack.
Figure 7.9 shows, however, that Parity Helix provides 12.5% lower avail-
able data capacity than Non-RAID ECC. As will be shown in Sections 7.6.3


































Figure 7.9: Available data capacity per stack. Parity Helix increases data
capacity per stack by 16.7% compared to TPR and provides the same data
capacity as Channel Correct.
Non-RAID ECC. On the other hand, Parity Helix protects against complete
DRAM die fault - a fault commonly covered in memory systems of datacen-
ters and supercomputers - while the Non-RAID ECC does not; Figure 7.8
shows that Parity Helix provides over 120X lower probability of experiencing
uncorrectable errors than Non-RAID ECC due to the former’s higher fault
coverage. As such, we believe Parity Helix is a useful and distinctive design
point from Non-RAID ECC.
7.6.3 Memory Energy
Figure 7.10 shows the memory energy per program read/write access for
the different schemes. Parity Helix consumes 21% lower memory energy per
program access, on average, and up to 45% lower memory energy per program
access (for ft.C). This large difference in memory energy per program access
is due to the larger number of memory accesses needed to update the parity
lines for TPR. Figure 7.11 shows that compared to TPR, Parity Helix incurs
24% fewer stack access per program access, on average, and incurs up to 48%
fewer stack accesses per program access for ft.C. TPR requires more stack
accesses per program access than Parity Helix for three reasons.
First, TPR protects each line of data with two lines of parity, which ef-













































+ Non1RAID$ECC$ Channel$Correct$ TPR$ Parity$Helix$
Figure 7.10: Stack energy per program access. Parity Helix reduces
memory energy per program access by 21% compared to TPR.
lines. Second, TPR incurs higher capacity overhead than Parity Helix (see
Section 7.6.2); this reduces the amount of locality of the parity lines of TPR
in the last-level cache since each parity line covers (one) fewer data line than
Parity Helix. Third, the two parity lines are constructed from two orthogonal
groups of data where one group consists of data lines from different chan-
nels and the other group consists of data lines from different DRAM dies; as
such, the two parity lines have at most one data line in common. Therefore,
while one parity line can protect adjacent physically lines (e.g., with physical
addresses A, A + 1, A + 2,..., A + 6), the other parity line can only protect
more distant physical lines (e.g., with physical addresses A, A+ 6,..., A+ 30
etc.). This results in poorer cache locality of the latter parity line. Parity
Helix, however, only protects each line of data with a single parity line and,
therefore, enjoy high cache locality for the parity lines.
Figure 7.10 shows that Parity Helix consumes roughly the same memory
energy per program read/write as Channel Correct. This is because Parity
Helix requires roughly the same number of stack accesses per application
access as Channel Correct, as shown in Figure 7.11. This is due to the fact
that both of these resilience schemes protect each data line with one parity
line and protect the same number of data lines per parity line.
7.6.4 Performance
Figure 7.12 presents the system throughput for the different resilience schemes.














































& Non.RAID$ECC$ Channel$Correct$ TPR$ Parity$Helix$
Figure 7.11: Number of stack accesses per program access. Parity Helix
incurs 18.4% fewer access per program access than TPR and incurs the
same number of accesses as Channel Correct.
second) as the metric of measurement. For mixed int, we measure through-
put using the number of committed store instructions per second.3 Figure
7.12 shows that the average performance degradations of the RAID-based
resilience schemes, TPR, Parity Helix, and Channel Correct, are 4.5%, 7.3%,
and 3.3%, respectively, compared to the Non-RAID ECC. TPR incurs the
highest performance degradation (62% higher degradation than Parity He-
lix) due to requiring a large number of overhead accesses to update the two
parity lines protecting each data line. Parity Helix incurs 1.0% performance
degradation compared to Channel Correct due to the added address decoding
latency (see Section 7.4.4).
The above results assume a fault-free memory system. Recall from Section
7.4.3 that when a partition encounters an error that the dedicated per line
ECC cannot correct, Parity Helix retires the channel or die that contains
the partition. On average, the total amount of time per 8GB stack spent on
reconstructing and copying data for retirement is only 7.4 milliseconds over
the seven-year lifetime of the stack. This low overhead is due to the high
bandwidth die-stacked DRAMs; all 8GB of data in the evaluated 64GB/s
stack can be accessed in just 8/64 = 125 milliseconds.
3We did not use the total number of instructions (including integer instructions) as the
























+ Non,RAID$ECC$ Channel$Correct$ TPR$ Parity$Helix$
Figure 7.12: System throughput. The throughput of Parity Helix is 103%
and 99% that of TPR and Channel Correct, respectively.
7.7 Discussions
7.7.1 Applicability to Other Contexts
Parity Helix is applicable to other contexts where a multi-dimensional mem-
ory encounters single-dimensional faults. For example, applying Parity Helix
to sub-ranked 2D DRAM memory system is straightforward. Recall from
Section 7.3, 2D memory systems are vulnerable to multi-chip failures along
different dimensions - both multi-chip failures among the same chip in dif-
ferent ranks and multi-chip failures within a single rank. A sub-ranked 2D
DRAM memory system allows the option of accessing individual DRAM
chips in a rank [182, 156, 154]; as such, a DRAM chip in a sub-ranked mem-
ory system is logically equivalent to a single bank in a stack. The chips in
the same rank in a sub-ranked memory system are logically equivalent to a
DRAM die in a stack. The chips in the same lane (i.e., the same chip in
different ranks) in a sub-ranked memory system are logically equivalent to
all the banks in a channel in a stack. Due to the one-to-one correspondence
of a chip, rank, and lane in sub-ranked memory system to a DRAM bank,
DRAM die, and channel in a stack, respectively, Parity Helix can be applied
directly to sub-ranked memory systems.
In the above discussions, single-dimensional faults develop along only two
dimensions for both die-stacked DRAMs (i.e., across channels and across
horizontal dies) and 2D DRAMs (i.e., across lanes and across ranks). How-
ever, as described in Section 7.4.1, Parity Helix is generalizable to even those





Figure 7.13: Parity Helix example when single-dimensional faults exist
along three dimensions.
trary number of dimension. Figure 7.13 shows the stripes within a memory
slice for memories where single-dimensional faults can occur along three dif-
ferent dimensions - X, Y, and Z dimensions. Figure 7.13 shows that each
stripe is still similar to a helix, with the difference that each ring in the helix
is translated along the X and Y dimensions with respect to other rings.
7.7.2 Comparison against DRAM Manufacturer Side
Techniques
Parity Helix is an architectural technique to enhance memory reliability.
Many known techniques at the manufacturer side can also improve mem-
ory reliability. However, Parity Helix complements these techniques. For
example, while defect testing and burn-in reduces early-life DRAM failures,
Parity Helix reduces DRAM lifetime failures. While DRAM circuit-level
improvements also enhances DRAM reliability, architectural-level error re-
silience techniques, such as Chipkill Correct, are still needed for conventional
2D DRAMs to achieve the additional reliability to satisfy the needs of data
centers and supercomputers; this is despite the fact that 2D DRAMs have
undergone many decades of circuit-level improvements. Similarly, we expect
Parity Helix to provide the additional reliability needed by mission-critical
systems with die-stacked DRAMs.
It may also be possible to redesign memory systems such that faults only
occur along one dimension. For example, one can design stacks using the
alternative organization where a channel consists of a horizontal group of
banks that lie within the same die; as such, both the channel fault and die
fault affect only the same dimension. However, redesigning memories comes
at the cost of losing many of the advantages of the original design. For
156












Figure 7.14: A horizontally partitioned stack (left) requires a different
internal wire routing per die. In comparison, all DRAM dies are identical in
a vertically partitioned stack (right).
example, the horizontal channel organization requires all banks in each die
in the stack to connect to a set of data/address TSVs dedicated to the die;
as shown in the left half of Figure 7.14, this requires a different wire routing
within different dies in the stack and, therefore, a different die mask per die
in the stack, which is expensive. Instead of requiring multiple DRAM die
masks, the horizontal channel organization can alternatively pay the cost
of connecting every bank in a die to every data/address TSVs in the stack
and then later disconnecting the unwanted connections (e.g., by blowing
fuses). However, the DRAM process allows only a few metal layers [183];
this all-to-all connection complicates intra-die routing for stacks with a large
number of dies and channels. The more widely supported vertical channel
organization, in comparison, allows every bank to connect to only a single
set of data/address TSVs while requiring a single die mask (see right half of
Figure 7.14). Parity Helix gives the option of improving reliability without
modifying DRAM design.
7.8 Conclusion
Memory is fast becoming the power and performance bottleneck in current
and emerging computer systems. The die-stacked DRAM technology is a
promising solution to address the memory bottleneck. However, die-stacked
DRAMs pose new challenges for memory error resilience because they are
vulnerable to large-granularity faults along different physical dimensions. We
propose Parity Helix, a general low-overhead technique to protect against
single-dimensional faults in multi-dimensional memory systems that shares
157
the same error correction resources across all dimensions. Parity Helix is
most effective for memory systems where every fault can be mapped to a
single dimension such that each fault affects only one dimension at a time.
When applying Parity Helix to die-stacked DRAMs, our evaluation shows
that Parity Helix reduces memory energy per data access by 21%, on average,
and by up to 45% compared to the baseline scheme of maintaining dedicated
error correction resources for each dimension; Parity Helix also increases
available data capacity by 16.7% compared to this baseline. Compared to
protecting against only channel fault but not DRAM fault, the strongest





POWER CONSUMPTION IN MEMORY
NETWORKS
This chapter explores power management for memory networks, a recently
proposed architecture to provide scalable-capacity high-bandwidth off-chip
memory. By adaptively tuning the performance of the different memory
nodes in a memory network depending on the node’s frequency of access, the
proposed network-aware memory power management significantly improves
energy efficiency compared to the conventional approach of providing uniform
performance to each network node.
8.1 Introduction
Processor architectures for data centers and HPC systems have become in-
creasingly throughput-oriented and multi-core. As the number of cores per
processor increases, so does the amount of memory capacity and bandwidth
required per processor to match the core count. Conventionally, memory
systems scale in capacity by adding more memory modules to shared mem-
ory buses. However, multiple devices sharing a bus negatively impacts I/O
signal integrity [184] and thus limits memory capacity and bandwidth [185].
Instead of sharing multiple memory modules on the same memory bus,
a memory network [186, 187] connects one memory module to another via
a point-to-point (P2P) link dedicated to each pair of modules. By adding
memory modules via an expandable network of P2P links, a memory net-
work provides similar or higher capacity scaling as conventional high-capacity
DDRx memories. However, by only connecting a minimum number of (i.e.,
two) devices per link, P2P links improves signal integrity [184] and, therefore,
allow high I/O frequencies and thus bandwidth. For example, the Hybrid
Memory Cube (HMC) [164], an emerging memory network technology, in-
creases I/O frequency by up to 8X compared to DDR4, the latest generation
159
memory with a shared bus interface. Due to providing high bandwidth and
capacity, memory networks will likely benefit future data centers and HPC
systems.
In addition to high memory capacity and bandwidth, large-scale systems
also require high memory energy efficiency. Memory systems consume 25%
- 40% of the total data center power [154]. In a projection based on the
current level of power efficiency, memory power alone will consume 3.5X the
total power budget of future exascale systems [153].
In this chapter, we perform the first exploration to understand the power
characteristics of memory networks. We study different memory network
topologies and sizes and report a detailed breakdown of idle and active power
consumption among different memory components. We find that idle I/O
power is the biggest source of memory network power consumption. Subse-
quently, we study idle I/O power in more detail. We evaluate various well-
known I/O power control mechanisms such as DVFS, variable link width
(VWL), rapid on off (ROO), and their combinations. We adapt prior works
on memory power management to manage I/O power in memory networks.
Since no prior power management works have been proposed in the context of
memory networks, we refer to our adaptation of prior schemes collectively as
network-unaware management. We find that network-unaware management
reduces I/O power by 32% and 21%, on average, for big and small networks,
respectively. However, idle I/O power still remains the top power contribu-
tor. Consequently, we explore and propose novel network-aware management
policies; they provide 29% and 17% I/O power reduction for big and small
networks, respectively, compared to network-unaware management. This
chapter makes the following contributions:
• The first exploration to understand memory network power character-
istics. We identify idle I/O as the biggest power contributor.
• The evaluation of various circuit-level I/O power control mechanisms,
such rapid on off, variable link width, DVFS, and combinations thereof,
and analysis of their relative effectiveness.
• The first adaptation of prior works on memory power management to
memory networks; average I/O power is reduced by 32% and 21% for



















. . . 
.  .  .  
.  .  .  
Figure 8.1: Left: conventional memory systems. Right: memory networks.
of network topology (e.g., linear, tree, etc.) significantly affects the
effectiveness of power management.
• The first network-aware memory power management, which yields an-
other 29% and 17% average I/O power reduction for big and small
networks, respectively.
8.2 Background
Memory system capacity is increased by adding memory modules. Different
types of memory systems differ in terms of how memory modules are added
to the system.
8.2.1 Conventional Memory Systems
Conventional memory systems increase memory capacity by adding more
memory modules to shared memory buses, as shown in the left half of Figure
8.1. However, increasing the number of electrical devices on a bus reduces
signal integrity [184, 185]; this limits the number of memory modules that can
be reliably connected to a bus and, therefore, the maximum memory capacity
per bus. Having more memory buses per processor increases the maximum
capacity and bandwidth of a memory system, but requires more I/O pins
on the processor chip. Unfortunately, the number of I/O pins per processor
is limited [188] since I/O pins increase processor area and, therefore, cost,
which in turn limits the maximum capacity and bandwidth of conventional
memory systems. As such, conventional memory systems either provide high
161
bandwidth but low capacity (e.g., GDDRx memory with P2P data I/O pins
that allow up to 14 Gbps but not sharing) or high capacity but low bandwidth
(e.g., DDRx memory with data I/O pins that allow up to only 3.2 Gbps but
sharing).
8.2.2 Memory Network
In addition to memory, each memory module in a memory network also
contains buffering/routing logic to communicate with other memory modules
via high-speed P2P links. This allows a memory network to increases its
capacity via a network of P2P links, as shown in the right half of Figure
8.1. This expandable network of high-speed P2P links provides both high
bandwidth and high capacity memory.
A well-known example of a memory network module is the HMC.1 An
HMC consists of multiple DRAM dies stacked on top of a logic die using
TSVs (through-silicon-vias) [164]. The DRAM dies provide memory capacity,
while the logic die implements network routing logic and I/O circuitry [164].
An HMC can provide up to 30 Gbps [164] of I/O frequency; in comparison,
current DDR4 DIMMs, the latest memory modules found in conventional
memory systems, only support an I/O frequency of up to 3.2Gbps per lane
[190]. HMCs also improve memory energy per bit by 3X compared to DDR4
[157]. Due to the many benefits of HMCs, we explore memory networks in
the context of HMCs.
Figure 8.2 shows an example of an HMC network. A network of HMCs
communicates via unidirectional links and a packet-based protocol [164],
which are commonly used to support high-speed I/O communication; they
are also used by buffer-on-board memory systems [189], for example. We re-
fer to unidirectional links that send data away from and toward the processor
as request links and response links, respectively. A read request packet con-
sists of a single 16B flit (i.e., minimum traffic flow unit), while write request
and read response packets contain five flits, assuming 64B lines.
1As another example, a memory network module could be a few high bandwidth (e.g.,
GDDRx) memory chips attached to a buffer/router chip, such as a buffer-on-board chip
[189], which connects to other such memory network modules via P2P links.
162
Processor	 HMC	 HMC	 . . . HMC	







Figure 8.2: HMC network example.
8.3 Analyzing Power Characteristics of Memory
Networks
In this work, we perform the first exploration to understand the power char-
acteristics of memory networks.
8.3.1 Network Topologies
We evaluate minimally connected network topologies. For a given set of
networked memory modules, a minimally connected topology minimizes the
average and worst-case hop distances between the processor and its mem-
ory modules by connecting every available link to a new module, instead
of spending them on already connected modules. A minimally connected
topology is also acyclic and, therefore, does not require deadlock or livelock
avoidance logic.
The HMC standard supports high-radix HMCs with four full links (i.e.,
eight unidirectional links) and low-radix HMCs with two full links [164].
We evaluate networks consisting of only high-radix HMCs, of only low-radix
HMCs, and a mixture of both. Figure 8.3 shows the topologies we examine.
We evaluate the ternary tree, which minimizes network hop distance, using
high-radix HMCs. We evaluate the daisy chain using only low-radix HMCs
to minimize HMC area. For a mixture of both types of HMCs, we evalu-
ate the star topology which grows by adding rings of nodes equidistant from
the processor; for smaller network sizes, star offers the same hop distances
as the ternary tree while requiring fewer high-radix HMCs. We also evalu-
ate another mixed-HMC topology that scales in capacity by adding rows of
memory packages, similar to how DDRx DIMMs scale in capacity by adding
rows or ranks of DDRx memory packages, for potential ease of adoption; we











































4	 5	 10	 12	
Ternary	tree	













.  .  .  DDRx-like	
.  .  .  
Figure 8.3: Topologies studied.
8.3.2 HMC Modeling
We use the HMC power model in [191] to evaluate high-radix HMCs. In [191],
high radix HMCs with 12.5Gbps I/O data rate per lane consume 13.4W of
peak power. Peak power is modeled in [191] by attributing 43%, 22%, and
35% of the peak power of an HMC to the peak power of the DRAM dies, the
logic part of the logic die (which we simply refer to as logic), and the I/O
links, respectively; for idle power, the DRAM dies consume 10% of its peak
power when idle, logic consumes 25% of its peak power when idle, while idle
I/O power (i.e., when not transmitting application data) is same as active I/O
power. Idle and active I/O power are similar because high-speed links need
to continuously transmit data even when idle to maintain synchronization
between the transmitter and receiver [164, 192]. To evaluate low-radix HMCs,
we assume that memory peak power is proportional to memory bandwidth
and, therefore, assume the peak power of low-radix HMCs to be half of
13.4W; we also assume the same relative power breakdown as above.
We use DRAMSim2 and the parameters in Table 8.1 to model the per-
formance of DRAM array accesses and modify GEM5 to model I/O link
performance. For the I/O links, we assume that each link controller contains
164
Table 8.1: HMC DRAM Array Parameters
Capacity per HMC/vaults per HMC 4GB/32
Vault data rate/IO width/buffer entries 2Gbps/X32/16
page policy/line address mapping close/interleaved
tCL/tRCD/tRAS/tRP/tRRD/tWR(ns) 11/11/22/11/5/12
Table 8.2: Processor Microarchitecture
16 cores, 3GHz, 2-issue OOO
Core 64 ROB entries, 64B cache line size
L1 d-cache, i-cache 2-way, 64kB, 1 cycle
Private L2 cache 8-way, 512kB, 3 cycles
Shared L3 cache 32-way, 32MB, 20 cycles
128 buffer entries and prioritize reads over writes, as writes do not typically
lie along the critical path of execution. We model 3.2ns SERDES link la-
tency. To model routing within an HMC, we assume a pipelined router with
0.64ns (the minimum transfer latency of a single flit over the evaluated links)
clock period and four cycles latency.
8.3.3 Processor and Workloads
We model a 16-core X86 processor in GEM5. Detailed architectural parame-
ters used for GEM5 simulation are listed in Table 8.2. Since different memory
channels are physically independent from one another and bandwidth utiliza-
tion is often uniformly distributed across channels by interleaving adjacent
memory across channels [193], we evaluate a single HMC channel with lit-
tle loss of generality; we leave the exploration of power implication of any
potential inter-channel interactions to future work.
We evaluate seven HPC workloads and seven cloud computing workloads
using full-system simulation. The average memory footprint of all of our
workloads is 17GB. The HPC workloads include 16-threaded ua.D, lu.D,
bt.D, sp.D, cg.D, mg.D, and is.D from NASBench. The cloud workloads
are mixed application workloads each consisting of four applications, at least
one of which is a parallel SPLASH2X benchmark and the remaining of which
are SPEC2006 benchmarks; only native or reference inputs are used. Each
parallel application runs on four threads, while each SPEC application runs
as four independent instances. Table 8.3 details the workload composition
165
Table 8.3: Mixed Workload Composition
mixA 4 bwaves, 4 cactusADM, 4 wrf, ocean cp
mixB 4 mcf, 4 GemsFDTD, 4T barnes, 4T radiosity
mixC 4 omnetpp, 4 mcf,4 wrf, 4T ocean cp
mixD 4 sjeng, 4 cactusADM,4T radiosity, 4T fft
mixE 4 cactusADM, 4 sjeng,4 wrf, 4T fft
mixF 4 cactusADM, 4 bwaves,4 sjeng, 4T fft
mixG 4 mcf, 4 omnetpp,4 astar, 4T fft
for the mixed workloads; the applications in each workload appear in the
order of their invocation, which determines memory allocation as we delay
the invocation of each subsequent application or instance by one simulated
second. We fast forward each workload until all multi-threaded application(s)
in each workload have completed their initialization, as indicated by the
simulated OS console output, and then by another 20 simulated seconds to
warm up the caches; the total fast-forward period is 73 seconds, on average,
and up to 218 seconds. We evaluate each workload in cycle-accurate mode
for the next 10ms of simulated time.
Since there is a wide range in the memory footprint of our workloads, each
workload is evaluated for a memory network whose size matches the memory
footprint of the workload. For our evaluations, we map the ith contiguous
4GB of physical pages (we use 4GB HMCs in our evaluation) to the ith
HMC in the network (see Figure 8.3 for the location of HMC i); therefore,
the average number of HMCs per workload is d17/4e = 5. Since memory
networks can support a large number of memory modules, we also perform a
big network study by mapping the ith contiguous 1GB to the ith HMC. Figure
8.4 shows the cumulative fraction of memory accesses by the ith gigabyte of
memory address space of each workload during the cycle-accurate simulation
interval; memory traffic distribution within the networks can be deduced
through Figures 8.3 and 8.4.
8.3.4 Key Findings
Figure 8.5 shows the average power consumption per memory module; “Idle
I/O” and “Active I/O” are calculated as total energy consumed by links













































ua.D	 lu.D	 bt.D	 sp.D	 cg.D	 mg.D	 is.D	
mixA	 mixB	 mixC	 mixD	 mixE	 mixF	 mixG	
Figure 8.4: Workload memory access characteristics.
Ini$al	power	finding	
1.8	 2.0	 2.0	 1.9	 1.9	 2.4	


























)	 Idle	I/O	 Ac$ve	I/O	 Logic	Leakage	 Logic	Dynamic	 DRAM	Leakage	 DRAM	Dynamic	
Figure 8.5: Average power breakdown of an HMC in a network.
by time. The first key observation is that I/O power is the highest power
contributor in memory networks; I/O consumes, on average, 73% of the
memory network power. There are two reasons why I/O power dominates.
First, each memory request in a memory network accesses a DRAM die
once but traverses multiple links (see Figure 8.6), which contributes to I/O
having much higher power than DRAM. Second, even within a single module,
a major fraction of energy per memory access is due to I/O. For example,
moving data off package consumes roughly twice as much energy per bit as
moving data within package from DRAMs to the logic die [194]. As another





























small:daisychain	 small:ternary	tree	 small:star	 small:DDRx-like	
big:daisychain	 big:ternary	tree	 big:star	 big:DDRx-like	










Figure 8.7: Some major causes of high I/O power.
There are several reasons why I/O power is high even for a single memory
module. First, due to the high area cost of I/O pins, memory I/O width is
often many times narrower than the DRAM array data bus width. There-
fore, I/O must operate at much higher frequencies than the DRAM arrays to
match the bandwidth and thus consume high power. Another reason is that
significant power is required to maintain signal integrity for off-chip com-
munication. For example, transmitter output impedance must closely match
the characteristic impedance of the off-chip transmission channel (i.e., a PCB
trace) to minimize I/O signal reflection; impedance matching is typically im-
plemented by terminating the transmitter output using a similar impedance
(see Figure 8.7). Unfortunately, PCB traces typically have low characteristic
impedance (around 50Ω) to keep ohmic power loss low; the matching low
impedance termination results in a low impedance connection to ground and
thus high power consumption (e.g., (1V )2/(50Ω) = 20mW per lane assum-
ing 1V signal voltage or 20 ∗ 32 = 0.64W per HMC full link). Receivers also
require similar overheads.
The second key observation is that when further breaking down power into
idle and active, idle I/O power is the highest power contributor in memory
networks; it accounts for 53% and 67% of total memory network power in
the small and big network studies, respectively (see black and orange points
in Figure 8.8). To understand the sources of high idle I/O power, we define
channel utilization as the bandwidth utilization of the full link that connects
the processor to a memory network and define link utilization as the aver-
age bandwidth utilization across all links in a network. Figure 8.9 shows
the channel and link utilizations of the different workloads under different
topologies. As expected, the fraction of total network power taken up by
idle I/O power increases when channel bandwidth utilization decreases; for



































	 small:daisychain	 small:ternary	tree	 small:star	 small:DDRx-like	
big:daisychain	 big:ternary	tree	 big:star	 big:DDRx-like	



















chan:small:daisychain	 chan:small:ternary	tree	 chan:small:star	 chan:small:DDRx-like	
chan:big:daisychain	 chan:big:ternary	tree	 chan:big:star	 chan:big:DDRx-like	
link:small:daisychain	 link:small:ternary	tree	 link:small:star	 link:small:DDRx-like	
link:big:daisychain	 link:big:ternary	tree	 link:big:star	 link:big:DDRx-like	
Figure 8.9: Average channel and link utilization.
is also the highest (see Figure 8.8). However, idle I/O power still accounts
for 50% of total memory network power even for mixB, which has 75% av-
erage channel utilization; in fact, the average channel utilization across our
evaluated workloads is high - 43%. Idle I/O power remains high despite high
channel utilization because memory traffic attenuates across the network; as
such, average link utilization continues to be low (see the dotted lines in Fig-
ure 8.9) even when channel utilization is high (see data points without the
dotted lines).
Since idle I/O power accounts for over half of total memory network power,
we will explore idle I/O power management for memory networks in the rest
of the chapter.
8.4 I/O Power Control Mechanisms
Many circuit-level mechanisms exist to dynamically reduce I/O power during
low utilization, with different power and performance tradeoff characteristics.
Below we discuss the mechanisms we study.
169
8.4.1 Rapid On/Off Links
In conventional DRAMs, I/O is commonly turned off (i.e., put in an inacces-
sible low power state) when idle to reduce idle power. However, once turned
off, an I/O link needs to first be woken up before it can be accessed again,
leading to performance overheads. Conventional DRAMs typically require
10-25ns to wake up the I/O [190], depending on the off-state power. Both
the wakeup latency and off state power should ideally be zero. This is diffi-
cult for two reasons, however. First, to wake up a link, the link transmitter
must first resynchronize with the receiver before data can be reliably trans-
mitted. Second, increasing the current from the off state level (where current
i ≈ 0) to nominal operating level typically requires many nanoseconds since
current change is opposed by a back electromagnetic force (EMF) that is
proportional to the rate of current change (i.e., EMF = Ldi/dt); since I/O
operating current is high (due to the high operating power of I/O), di/dt is
also high.
To model ROO, we assume 14ns wakeup latency [196] per unidirectional2
link and 1% power when off [196]. We also examine 20ns wakeup latency
[192] for sensitivity analysis.
8.4.2 Dynamic Voltage Frequency Scaling
Another mechanism to reduce I/O power during low utilization is to keep
the link on but reduce I/O bandwidth via dynamic voltage frequency scaling
(DVFS) [196]. Since dynamic energy is ∝ CV 2f , DVFS reduces dynamic
energy per bit in addition to idle power, unlike ROO, which only reduces
idle power. DVFS avoids the long wakeup latency of ROO; however, DVFS
increases SERDES latency since SERDES - the circuit that converts parallel
data input into a high-speed serial data stream for transmission - is clocked
by the I/O clock as well [196]. DVFS also requires significant latency to
adjust voltage since changing voltage requires a current draw that is pro-
portional to the rate of change (i.e., i=Cdv/dt); to adjust voltage quickly
(e.g., small dt), a more heavy-weight on-chip voltage regulator capable of
supplying much higher current than is needed during regular operations is
2Each unidirectional HMC link can be individually power controlled [164], like the
unidirectional links in Infiniband and YARC switches [197]).
170
required, which incurs area and power overheads. As such, on-chip voltage
regulators typically require long latency to adjust voltage (e.g., 0.5us [198]).
Due to the long voltage scaling latency, DVFS can incur higher queuing
latency overheads than ROO for a long burst of accesses.
We model link voltage switching latency as 0.5us. To ensure connectivity
during voltage scaling, we assume that the sixteen lanes per link are split
into two bundles of eight lanes each, where each bundle has separate voltage
rails such that DVFS is only applied to one bundle at a time. This leads to
up 3us to complete voltage scaling for a link; 1 us to reduce operating link
width to half (see Section 8.4.3), 1 us to DVFS the two bundles, and 1 us
to resume full link width. We model the performance and power of DVFS
during steady-state operations using [196] by evaluating DVFS modes that
provide 100%, 80%, 50%, and 14% bandwidth and provide 0%, 30%, 65%,
and 92% power reduction, respectively; the modes are selected such that
each subsequent mode provides roughly an equal amount of total link power
reduction (e.g., 30%) as the previous mode. The lowest power mode operates
only a single bundling of eight lanes at the minimum I/O operating voltage
(i.e., Vmin).
8.4.3 Variable Width Links
Another way to reduce I/O power is to reduce the number of active I/O lanes;
we refer to links that can vary its number of active lanes as variable width
links (VWL). VWLs are used in Infiniband and YARC switches [197] and are
also supported by HMCs [164]. While VWL reduces I/O power at the cost
of reduced I/O bandwidth like DVFS, VWL provides distinct tradeoffs from
DVFS. VWL does not incur the long SERDES latency overheads of DVFS;
however, it provides less power reduction because it does not reduce dynamic
energy per bit. When modeling VWL links, we allow the number of active
lanes per link to be reduced from 16 down to 8, 4, or 1. We calculate the
power when l lanes are on as (l + 1)/(16 + 1) of a full power link, as I/O
clock power is similar to power of a lane [196]. We assume 1us latency for
changing the number of active lanes [197].
171
8.5 Network-Unaware Management
Our first step in exploring how to utilize the above I/O power control mech-
anisms to reduce memory network power is to adapt prior works on memory
power management to memory networks. Since no prior power management
work has been previously proposed in the context of a memory network, we
refer to our adaptation of prior works as network-unaware power manage-
ment. We will explore network-aware power management in Section 8.6.
In this study, we explore hardware power management techniques that
do not require software/OS assistance or cross-layer optimization. To man-
age VWL and DVFS links, we adapt [199], which uses hardware counters
to simultaneously estimate the aggregate queuing and transmission latency
overheads of read packets over a link for all possible link bandwidth config-
urations to adjust link bandwidth accordingly. To manage ROO links, we
incorporate aspects of [200] and [201]; [200] describes a hardware mechanism
to find the optimal power ROO mode for a given memory latency overhead
constraint, while [201] hides wakeup latency overheads for responses from
DRAM. Finally, to provide predictable worst-case performance overheads,
we limit average memory latency overhead by incorporating feedback control
and performance violation detection from [202].
Broadly, network-unaware management works as follows. It seeks to re-
duce power while keeping the aggregate memory read latency overhead of the
network below an allowable memory slowdown or AMS in short; a memory
network’s AMS is set as a user-tunable factor of α% times the network’s ag-
gregate memory read latency had all its links always operated in full power
mode, and thus is given in units of time (as opposed to being an unit-less
fraction). To this end, each memory module uses a hardware counter to
track its aggregate DRAM read access latency. Each link controller contains
a hardware counter to measure the link’s actual aggregate read packet la-
tency and also counters to estimate what the link’s aggregate latency might
be had the link always operated at full power. Using these values, each mem-
ory module independently determines the appropriate amount of AMS the
module can have such that the network as a whole obeys the network-level
AMS as determined by the user-settable factor α. Each module calculates
its AMS periodically after each fixed time interval, referred to as an epoch;






















Figure 8.10: Network-unaware management overview.
module divides the AMS to among its various links; Section 8.5.1 details how
to obtain the AMS of each link at the end of each epoch. Each link controller
then sets its link to the lowest power mode whose latency overhead is less
than its AMS. Section 8.5.2 describes how a link sets its power mode accord-
ing to its AMS. Finally, each link controller periodically checks whether its
current latency overhead exceeds its AMS [202]; if violation is detected, the
link switches to full power until end of the epoch. Figure 8.10 summarizes
the above.
8.5.1 Obtaining a Link’s Allowable Memory Slowdown (AMS)
We refer to the estimated aggregate memory latency of a network or a mod-
ule in an epoch had it operated in full power as the network or module’s full
power epoch latency or FEL. Network-unaware management keeps aggregate
memory latency overhead in an epoch within α% times the network’s FEL
simply by requiring each module independently keep its latency overhead
within α% times the module’s FEL. Specifically, network-unaware manage-



















0 AMSM (m,t+1) (8.1)
Above, FELm,t stands for the full power epoch latency of module m during
epoch t; AELm,t is the actual epoch latency of m during epoch t, which is
the actual measured aggregate latency of m throughout t; AMSm(m, t + 1)
is module m’s AMS for the next epoch.
173
Module m obtains AELm,t in Equation 8.1 by summing (A) m’s aggregate
DRAM array read latency and (B) aggregate link latency for read packets
(i.e., read request and read response packets) during t. (A) is the number
of reads to the module’s DRAM times DRAM access latency (e.g., 30ns
from Table 8.1). (B) is the sum of the latency of all read packets passing
through the links that connect the module upstream, which we refer to as the
module’s connectivity links ; the link latency of each read packet is obtained
as the difference of the departure time and the arrival time of the last flit
of each read packet. Module m obtains FELm,t in the same way as AELm,t
except that (B) is estimated using a delay monitor and delay counter [199]
per link configured to assume the link always operates at full power. At the





0(AELm,t − FELm,t), respectively. Finally, each
connectivity link of m receives an equal portion of the module-level AMS.
8.5.2 Setting Link Power Mode According to the Link’s AMS
After receiving its AMS, each link controller sets its link power mode for the
next epoch as the minimum power mode whose predicted latency overhead
during the next epoch is lower than or equal to the link’s AMS; we refer to
the predicted latency overhead of operating a link at a particular low power
mode during the next epoch as the Future Latency Overhead or (FLO) of
the given power mode of the given link; the FLO of a given link’s given power
mode is calculated as the estimated aggregate latency of operating the link at
the given power mode during the current epoch minus the link’s FEL during
the current epoch.
We use the same method to estimate the FLO of VWL and DVFS power
modes during an epoch since VWL and DVFS behave similarly. Each link
uses a pair of hardware latency counters from [199] (referred to as the de-
lay monitor and counter in [199]) to estimate the FLO of each available
VWL/DVFS power mode of the link. For each DVFS low power mode, we
also add to this estimated FLO the SERDES latency overhead of the power
mode multiplied by the number of read packets over the link during the
epoch.





















)	 FP	 2.5%	VWL	 5%	VWL	 2.5%	ROO	 5%	ROO	 2.5%	VWL+ROO	 5%	VWL+ROO	
Figure 8.11: Per HMC power under network-unaware management.
idle interval histogram algorithm from [200]. The ROO power modes we
evaluate have idleness thresholds of 32ns, 128ns, 512ns, and 2048ns, where
each ROO power mode turns off a link after it has been idle for longer than
the power mode’s idleness threshold; the ROO power mode with 2048ns
threshold is considered the full power mode (i.e., a ROO link always turns
off after being idle for 2048ns). The idle interval histogram algorithm requires
an idle interval bucket for every ROO mode. At the end of each link idle
interval, the appropriate idle interval bucket is incremented. At the end
of the epoch, the algorithm calculates the predicated latency overhead of a
ROO mode by summing from the 32ns idle interval bucket to the bucket
whose idle interval corresponds to that ROO mode, and then multiplying
the sum by an estimated average latency overhead per wakeup. This average
latency is wakeup latency + wakeup latency * the average number of read
packet arrivals during wakeup; the latter number is estimated by periodically
sampling how many subsequent read packets arrive during an amount of
time equal to the wakeup latency after a periodically chosen read packet first
arrives.
We discovered in our experiments that waking up a request link can cause
significant queuing latency overhead in a later response link because read
response packets are much bigger (5X as many flits per packet, assuming
64B lines) than read request packets; for example, if a queue of N requests
delayed in a waking request link all have the same destination module, when
they eventually arrive at and then exit the DRAM array of the destination
module, they can translate to a queue that is effectively five times as long at
the destination module’s response link. To account for any potential latency









2.5%	 5%	 2.5%	 5%	 2.5%	 5%	 2.5%	 5%	 2.5%	 5%	 2.5%	 5%	



























daisychain	 ternary	tree	 star	 DDRx-like	 avg	
Figure 8.12: Average and maximum (error bar) performance overhead
under network-unaware management.
an additional value of wakeup latency * the average number of read packet
arrivals during wakeup when estimating the predicated latency overhead of
operating a request link under a given ROO power mode.
Finally, we estimate the FLO for VWL/DVFS +ROO links as the sum of
the VWL/DVFS power mode’s FLO and the ROO power mode’s FLO.
8.5.3 Evaluation
Figure 8.11 shows the average power per HMC under network-unaware power
management for VWL, ROO, and VWL+ROO links3 and also average power
for full power (FP) networks. On average across 336 comparisons (i.e., three
circuit-level I/O power reduction mechanisms * two values of α * four topolo-
gies * 14 workloads), network-unaware management yields 14% average over-
all power reduction for small networks. Average power reduction increases
to 24% for big networks; this is as expected since idle I/O power is higher in
big networks (see Figure 8.5). The corresponding idle I/O power reduction
is 32% and 21% for big and small networks, respectively. Figure 8.12 shows
that the throughput4 overhead of network-unaware power management. The
maximum throughput degradation for α = 2.5% and α = 5% are 3.2% and
5.1%, respectively, which closely follow their respective α. This shows the
3We will evaluate DVFS and DVFS+ROO links in Section 8.6.4.
4Since every workload contains multi-threaded application(s), we use FLOPS for work-
loads with floating-point applications only and use memory accesses per second for the
rest (seven total).
176
effectiveness of using memory latency overhead feedback control to curb over-
all system performance overheads; the occasional performance degradation
in excess of α is due to imperfect link latency overhead estimations using
counters.
Memory power management provides star and the DDRx-like topologies
with the highest power reduction relative to total network power. These
topologies allow many modules with cold memory ranges (see the flat line
segments in Figure 8.4) to go into very low link power modes; under daisy
chain, however, such modules still need to be frequently traversed to provide
access to modules with hot memory content. Ternary tree, on the other hand,
contains entirely of high radix HMCs, with high logic and DRAM leakage
power (see Figure 8.5); as such, idle I/O power reduction as a fraction of
total network power is less pronounced.
Overall, idle I/O power still consumes 44% and 57% of total power for the
small and big networks, respectively, and thus remains the top power contrib-
utor. One way to further reduce idle I/O power is to increase the network’s
AMS by increasing α. Figure 8.11 shows that power reduction through in-
creasing α is very modest, only 3% on average when α% goes from 2.5% to
5%; meanwhile, Figure 8.12 shows that the average throughput degradation
almost doubled from 0.9% to 1.7%. While 0.9% to 1.7% average system-level
performance overheads may be acceptably small, they are not negligible. Fur-
ther increasing α only further degrades system-level performance for modest
gains in power reduction. As such, new techniques need to be explored for
further power reduction while minimizing performance overheads.
8.6 Network-Aware Management
One main problem with network-unaware management is that busier links
often operate in a lower power mode than links with lower utilization. The
left half of Figure 8.13 plots the fraction of total link hours (i.e., analogous
to machine hours) spent by links of different utilization levels for VWL links
under network-unaware management for big networks; 9% of total link hours
are spent by links with 0-1% utilization in 16-lane mode (i.e., the red segment
in “0-1%” bar for big networks); meanwhile, roughly the same total number










































































































































































0-1%	 1-5%	 5-10%	 10-20%	 20-100%	
Figure 8.13: Left: Distribution of link hours spent in different VWL modes
(y-axis) by links of different utilizations (x-axis) under network-unaware
management. Right: Distribution under network-aware management.
the combined height of the orange segments in the last three bars for big
networks). Intuitively, a busier link should operate in a higher power mode
since it incurs more frequent and, therefore, higher total latency overhead
than a lower utilization link for operating at the same low power mode.
The counter-intuitive behavior is because a more frequently accessed module
often generates more AMS than a less frequently accessed link; unfortunately,
under network-unaware management, the large amount of AMS generated
by a frequently accessed module is also assigned to that module, allowing it
to often operate at equal or lower power modes than infrequently accessed
links. Ideally, one would like to see a link hour distribution like one shown
in the right half of Figure 8.13 that increases the time low utilization (e.g.,
0-5% utilization) links spend in low power modes by decreasing the time high
utilization (10+% utilization) links spend in low power modes; network-aware
power management obtains this distribution.
Memory networks also provide new opportunities to reduce link power at
low performance cost. We observe that for ROO links, multiple links along
the access path can wake up at the same time, instead of one at a time, to
enable aggressive ROO modes while minimizing wake up overheads. We also
observe that latency overheads at a downstream link does not cause memory
latency overhead if an upstream response link is congested; had there been
no delay downstream, the packet would arrive at the congested upstream
response link sooner and wait longer in queue.
Network-aware management addresses and exploits the above problems
and opportunities. It builds on top of network-unaware management: it also
relies on Equation 8.1 to calculate network-level AMS and uses the same
178
hardware link counters for FLO estimation, etc. The difference is that in-
stead of each module independently obtaining its AMS, network-aware man-
agement intelligently redistributes the network-level AMS across the network
to ensure that busier links operate at no lower power modes than less busy
links; this is described in Section 8.6.1. Network-aware management also
completely hides the wakeup latency of response links, and aggressively sets
downstream response links to lower VWL/DVFS modes when upstream re-
sponse links are congested, as will be described in Section 8.6.2 and Section
8.6.3, respectively.
8.6.1 Network-Aware Slowdown Redistribution
We propose redistributing AMS across the network such that busier links
always operate at higher or equal power mode than less busy links. This al-
lows the leftover AMS in the former to be transferred to the latter to enable
the latter to operate in lower power modes. There are two challenges with
network-aware slowdown redistribution. First, determining the relative uti-
lization of different links can be challenging since link utilization can change
even within an epoch. Second, even when the relative utilization of different
links is known, there are still numerous ways to set link power modes across
the network such that busier links also have have higher or equal power
modes; an efficient and effective selection method needs to be identified.
To address the first challenge, we observe that memory traffic attenuates
across the network from memory modules closer to the processor to memory
modules farther away from the processor. This implies that among links of
the same type (i.e., request or response link), an upstream link always have
equal or higher utilization than its immediate downstream link. To address
the second challenge, we observe that a distributed algorithm can exploit the
fact that each module is aware of its neighbors in a memory network to make
topology-aware power mode decisions without having to first translate the
physical topology into a logical data structure. A distributed algorithm also
divides the computational and memory overheads over all modules, resulting
in low overheads per module.
Exploiting the above observations, we propose Iterative Slowdown Propa-
gation (ISP), a distributed message passing algorithm that distributes AMS
179
over the network through several iterations (our evaluations cap total iter-
ations at three). Each iteration consists of two steps - scatter and gather.
ISP scatter redistributes unused AMS; ISP gather collects unused AMS and
other helpful statistics and enforces that an upstream link always be set to
higher or equal power mode than downstream links of the same type. Figure
8.14 illustrates the direction of message passing for these two ISP steps. The
final CLS (see Section 8.5.1) at each link by the end of ISP is used to select
the power mode.
ISP Scatter
ISP scatter broadly works as follows. At the ith ISP iteration, only some
links may still benefit from receiving more AMS; we refer to a link previously
determined to potentially benefit from receiving more AMS as a slowdown re-
ceiving candidate (SRC). Each link L divides its unused AMS equally among
all downstream SRCs, or DSRCs, of the same type as L; since the latency
overhead of a low power mode is usually higher for busier links, equal AMS
distribution helps to converge to a global link power mode selection where
busier links select no lower power modes. We describe ISP scatter in detail
below.
At the beginning of ISP scatter, the head module contains the total network-
wide unused AMS (which is obtained by ISP gather of the previous iteration,
to be described in Section 8.6.1). The head module divides the unused AMS
by the total SRCs in the network to calculate a per candidate slowdown
(PCS ) value and passes the PCS to the link controllers of the head module’s
connectivity links.
If a link is not an SRC, upon receiving a PCS message, the link controller
simply passes the received PCS value to all downstream links of the same
type.
If a link is an SRC, upon receiving a PCS message, the link controller
increases the link’s AMS by the received PCS and selects (but not yet physi-
cally set) the lowest link power mode whose FLO is below the updated AMS.
Next, the link controller passes PCS + (AMS − FLO)/DSRC as the new
PCS value to each immediate downstream link of the same type; this effec-
tively evenly distributes to these downstream links any leftover AMS at an
upstream link after the upstream link selects its power mode. Lastly, the link
180
Processor' HMC' HMC' . . . HMC'
Processor' HMC' HMC' . . . HMC'
ISP'Sca.er'
ISP'Gather'
Figure 8.14: ISP message flow. Each HMC passes the same message packet
to each of its downstream/upstream neighbor(s) during ISP scatter/gather.
controller updates the link’s AMS as the FLO of the selected power mode
and also decides whether the link should be an SRC during the next iteration
of ISP scatter; it decides true if the link has not already selected the lowest
available power mode and if PCS + AMS is at least a big fraction (e.g.,
25%) of the next lower power mode’s FLO.
ISP Gather
ISP gather obtains for every link controller its DSRC value and also obtains
for the head module the network-wide unused AMS via parallel reduction
by passing messages upstream. The DSRC values are obtained via a parallel
prefix sum reduction over the network in which each leaf link controller passes
‘1’ if it is an SRC, ‘0’ otherwise, to the immediate upstream link controller
of the same type. Each non-leaf link controller accumulates the values in
the received messages, sets DSRC to this partial sum, increments the partial
sum by ‘1’ if the link is itself an SRC, and passes the updated sum upstream
similarly.
To obtain the network-wide unused AMS, the first ISP gather iteration cal-
culates the initial network-wide AMS generated during the previous epoch,
while each subsequent iteration calculates how much AMS is still unused after
the previous ISP scatter iteration. The head module is responsible for calcu-
lating via Equation 8.1 the network-wide AMS generated during the previous









FELm,t) sums in Equation 8.1. To update these sums using the AELm,t and





are obtained via a simple parallel reduction sum operations over the AELm,t
and FELm,t values, respectively, across all modules in the network. For
subsequent iterations, ISP gather accumulates the unused AMS in the leaf
181
modules also by a simple parallel reduction sum across the network.
Finally, to ensure that an upstream link L’s power mode will be equal or
higher than the maximum power mode among all L’s downstream links of
the same type, each downstream link also passes its currently selected power
mode to its upstream link of the same type during ISP gather. If L’s selected
power mode is lower than any of its downstream links, L increases its power
mode to match the latter, updates CLS and passes upstream the difference
between the FLOs of the updated and the previous power mode selection
as unused AMS. Note that since the total amount of data passed from a
downstream module to its upstream module during ISP gather is small, each
module only needs to send a single 64B packet during ISP gather.
Utilizing the Leftover AMS from ISP
After the end of the last iteration of ISP gather, all leftover AMS in the
network is stored in the head module. Network-aware management utilizes
this unused AMS to prevent some links from switching to full power mode
due to AMS violation during an epoch. When detecting a violation, a link
controller first requests for some of the unused AMS recorded at the head
module. When receiving the AMS request message, the head module re-
sponds with a portion of the unused AMS (e.g., 1/16th of the original unused
AMS), if the unused AMS has not already been depleted by AMS request
messages prior in the epoch. We allow each link to request up to a quarter
of the original unused AMS (e.g., by allowing a maximum of 0.25/(1/16) = 4
AMS requests per link per epoch). A link switches to full power mode if the
AMS request is denied.
8.6.2 Network-Aware Optimizations for ROO
Network-unaware management adapts [201] to reduce wakeup latency for
response links (see Section 8.5); while the DRAM array of a module is still
being accessed, the module proactively wakes up its response link (if the link
is off), instead of first waiting for the DRAM access to complete, to reduce
the performance overhead of waking up the response link. Since the DRAM
access latency is typically longer than link wakeup latency (e.g., 30ns vs.
182
14ns), the wakeup latency overhead of the response link of the module being
accessed can be completely hidden.
Network-aware management seeks to not only hide the response link wakeup
latency of the module under access, but also hide the wakeup latency of every
response link along the response path to the processor. It does so by again
ensuring that a downstream link always be in a higher or equal power mode
than its downstream link(s). Specifically, under network-aware management,
a response link starts turning on either when the link’s module’s DRAM array
is being read or after one of its immediate downstream response links starts
turning on plus a wait interval; the interval is the sum of router latency and
the downstream link’s current SERDES and transmission latencies, which
are constant values during an epoch. A response link only turns off when
the link’s DRAM is not being read and when all immediate downstream re-
sponse links are off, which implies that the latter links also will not be soon
receiving packets from their respective downstream response links or their
modules’ DRAMs. Note that an upstream response link’s transmitter and
its immediate downstream response links’ receivers all reside on the same
module; as such, the on/off state of immediate downstream links are readily
available.
Since network-aware management completely hides wakeup latency over-
heads for response links, it only considers the request links to be SRCs
during ISP for networks with only ROO links. For the similar reason, for
VWL/DVFS + ROO links, the head module assigns during ISP scatter more
(i.e., 3/4 of) unused AMS to the request links.
8.6.3 Network-aware Optimizations for VWL/DVFS
When upstream response links are congested, network-aware management
ignores some of the latency overhead experienced by downstream links when
calculating of network-wide AMS at the end of an epoch. To do so, each
response link controller tracks for the current epoch the cumulative queuing
delay (QD) and the fraction of packets that are queued, or simply queuing
fraction (QF). A packet is conservatively considered as being queued if it
arrives behind at least three older packets according to the link’s full power
mode delay monitor. The reduction sum operation of first ISP gather itera-
183
tion, which calculates how much network-wide AMS is available for the next
epoch, makes use of these statistics; during this operation, each response link
controller reduces the total downstream overhead (from both downstream re-
quest and response links) by the minimum of downstream overhead ∗ QF
and QD.
8.6.4 Evaluation
Figure 8.15 shows the network-wide power reduction of network-aware man-
agement vs. network-unaware management. Overall power reduction over
network-unaware power management is, on average, 11% and 19% for small
and big networks, respectively. The corresponding I/O power reduction for
small and big networks are 17% and 29%, respectively. Figure 8.16 presents
the benchmark-level power reduction of network-aware and unaware man-
agement vs. full power networks, on average across all topologies; for read-
ability, it only reports big networks and α = 5%. Figure 8.16 shows that
network-aware management consistently yields higher power reduction for
every workload.
The left half of Figure 8.17 shows the average performance overhead of
network-aware management vs. unaware management. For α = 2.5% and
α = 5%, network-aware management incurs a 0.2% and 0.3% average perfor-
mance penalty, respectively, compared to the latter. Closer inspection reveals
that some high utilization links under network-unaware management do not












2.5%	 5%	 2.5%	 5%	 2.5%	 5%	 2.5%	 5%	 2.5%	 5%	 2.5%	 5%	
































t	 daisychain	 ternary	tree	 star	 DDRx-like	 avg	

















































































































































































	 daisychain	 ternary	tree	 star	 DDRx-like	
Figure 8.17: Left: Average performance overhead vs. network-unaware
management. Right: Maximum performance overhead vs. full power
networks.
links to low power mode are exceedingly high. Under network-aware power
management, the AMS generated by these links is redistributed to other links
that can utilize the AMS, resulting in a corresponding performance degrada-
tion. The right half of Figure 8.17 shows the maximum performance overhead
of network-aware management vs. full power. The maximum overhead over
all 672 comparisons is 5.9%.
We perform sensitivity analysis by evaluating DVFS links instead of VWL
links and evaluating ROO with a wakeup latency of 20ns instead of 14ns.
Figure 8.18 shows the network-wide power reduction and performance over-
heads of network-aware and network-unaware management relative to full
power networks for α = 5%. Under DVFS, both schemes yield less power
reduction for the same value of α% (e.g., < 5%) than VWL; this is due
to the high SERDES latency overheads at low voltage. As expected, the
power savings under both schemes for the longer 20ns ROO links are slightly
reduced. Meanwhile, network-aware power management provides 21% and







































































































































































	 daisychain	 ternary	tree	 star	 DDRx-like	 avg	
Figure 8.18: Left: Average and maximum power reduction for DVFS and
20ns ROO links. Right: Average and maximum performance overheads.
and small networks, respectively, on average across DVFS, 20ns ROO, and
DVFS+20ns ROO links.
8.7 Discussion
8.7.1 Other I/O Power Reduction Strategies
An alternative approach for selecting link bandwidth (but not ROO modes) is
to statically reduce link bandwidth like the well-known fat-trees and tapered-
trees. We note that if traffic is evenly distributed across the network (e.g., by
interleaving adjacent pages across all modules), a hybrid fat-tree and tapered-
tree static bandwidth selection5 does not induce any queuing latency over-
heads. However, low bandwidth links still takes longer to transmit a packet;
since static selection cannot control the aggregate packet transmission la-
tency overheads, it only provides a static power and performance tradeoff
point and incurs unpredictable worst-case performance overheads. For ex-
ample, for the VWL power/performance model and big networks, static selec-
tion+interleaving incurs 13% average performance overheads, 43% worst-case
overhead, and 30% average top quarter worst-case overheads, over the four
topologies and 14 workloads (i.e., 4 ∗ 14 = 56 comparisons).
5Let S(x) be the number of links with hop distance x and T be the total number of
links, a fat+tapered tree sets the bandwidth of link with hop distance d as 1/S(d) ∗ (1−
sumd−11 S(i)/T ) of maximum bandwidth (and raises it to the nearest available bandwidth
option).
186
In comparison, our network-aware management not only provides tunable
design points, but also provides higher benefit for the same performance. By
sweeping α values, we found that network-aware power management with α =
30% matches the average performance overhead above, but reduces overall
power by 15% compared to static selection; this is because our network-
aware power management allows contiguous memory pages to be mapped
within the same HMC, which consolidates accesses to fewer active HMCs
and allows more HMCs to go into low power modes. In addition, network-
aware management only incurs a 25% worst-case and 20% average top quarter
worst-case performance overheads vs. full power networks.
8.7.2 Related Work
Many prior works have studied power management for on-chip networks.
Nodes (i.e., cores) in on-chip networks need to communicate with one another
and, therefore, benefit from a small average distance between the network
nodes; as such, on-chip networks use topologies with many redundant links,
such as the well-known 2D mesh, which provide very different power manage-
ment challenges and opportunities from the minimally connected topologies
we study.
8.8 Conclusion
In this chapter, we perform the first exploration to understand the power
characteristics of memory networks. We identify idle I/O power as the high-
est power contributor. Subsequently, we study idle I/O power in more de-
tail. We evaluate well-known I/O power reduction techniques such as DVFS,
ROO, and VWL. We adapt existing works on memory power management
to memory networks and obtain 32% and 21% I/O power reduction for big
and small memory networks, respectively. Finally, we explore network-aware





Data centers and supercomputers are the workhorses of the digital universe
and modern science and engineering. However, the future scaling of these sys-
tems is challenging because they are very power hungry. In future data cen-
ters and supercomputers, the memory hierarchy will become a major source
of power consumption, accounting up to 40% to 70% of total system power
due to growing memory capacity and bandwidth requirements. This thesis
argues that a major source of energy overhead in today’s memory systems is
due to providing uniform performance across time and space. Such a uniform
design approach provides the benefit of being simple by repeating the same
actions for all scenarios, which in turn benefits hardware implementation.
The downside, however, is that rare memory operations end up dictating
the overall energy-efficiency of the memory system. In the past, such en-
ergy overheads were acceptable due to the low memory power consumption
relative to the system power. However, as the memory system is on track
to becoming a major source of power consumption in future data centers
and HPC systems, it is time to revisit new memory design approaches. This
thesis explores the benefits of a common-case optimized memory hierarchy
that foregoes memory performance uniformity to improve energy efficiency
by optimizing common-case memory operations without also optimizing, or
sometimes even at the expensive of, uncommon memory operations. Op-
timizing the common-case memory operations provides significant overall
energy savings since they are much more frequent than uncommon memory
operations. This dissertation demonstrates that architecting the memory sys-
tems for the common-case can provide significant energy savings across the
entire memory hierarchy, spanning latency-optimized on-chip SRAM caches,
bandwidth-optimized 3D DRAM, capacity-optimized main memory, as well
as emerging density-optimized NVRAMs.
188
REFERENCES
[1] M. P.Mills, “The cloud begins with coal,” Au-
gust 2013, https://www.tech-pundit.com/wp-
content/uploads/2013/07/Cloud Begins With Coal.pdf.
[2] A. Shehabi, S. J. Smith, D. A. Sartor, R. E. Brown, M. Herrlin, J. G.
Koomey, E. R. Masanet, N. Horner, I. L. Azevedo, and W. Lintner,
“United states data center energy usage report,” 06/2016 2016.
[3] “How much electricity is used for lighting in
the United States?” 2017. [Online]. Available:
https://www.eia.gov/tools/faqs/faq.php?id=99&t=3
[4] M. Alioto, Ed., Enabling the Internet of Things From Integrated Cir-
cuits to Integrated Systems. Springer International Publishing, 2017.
[5] HPE, “Overview of DDR4 memory in HPE Pro-
Liant Gen9 Servers with Intel Xeon E5-2600 v3,”
2017, https://h20195.www2.hpe.com/V2/getpdf.aspx/4AA6-
2997ENW.pdf?ver=4.1.
[6] T. Vijayaraghavany, Y. Eckert, G. H. Loh, M. J. Schulte, M. Igna-
towski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang,
A. Karunanithi, O. Kayiran, M. Meswani, I. Paul, M. Poremba,
S. Raasch, S. K. Reinhardt, G. Sadowski, and V. Sridharan, “Design
and analysis of an APU for exascale computing,” in 2017 IEEE In-
ternational Symposium on High Performance Computer Architecture
(HPCA), Feb 2017, pp. 85–96.
[7] O. Villa, D. R. Johnson, M. Oconnor, E. Bolotin, D. Nellans, J. Luit-
jens, N. Sakharnykh, P. Wang, P. Micikevicius, A. Scudiero, S. W.
Keckler, and W. J. Dally, “Scaling the power wall: A path to exascale,”
in SC14: International Conference for High Performance Computing,
Networking, Storage and Analysis, Nov 2014, pp. 830–841.
[8] N. Chatterjee, M. OConnor, D. Lee, D. R. Johnson, S. W. Keckler,
M. Rhu, and W. J. Dally, “Architecting an energy-efficient DRAM
system for GPUs,” in 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA), Feb 2017, pp. 73–84.
189
[9] S. Vangal and S. Jain, Claremont: A Solar-Powered Near-
Threshold Voltage IA-32 Processor, P. P. Pande, A. Ganguly, and
K. Chakrabarty, Eds. Springer New York, 2013. [Online]. Available:
http://dx.doi.org/10.1007/978-1-4614-4975-1 9
[10] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the
wild: A large-scale field study,” SIGMETRICS, pp. 193–204, 2009.
[11] V. Sridharan and D. Liberty, “A study of DRAM failures in the field,”
in Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis, ser. SC ’12. Los
Alamitos, CA, USA: IEEE Computer Society Press, 2012. [Online].
Available: http://dl.acm.org/citation.cfm?id=2388996.2389100 pp.
76:1–76:11.
[12] AMD, “AMD, BIOS and Kernel Developer’s Guide
for AMD NPT Family 0Fh Processors,” 2009,
http://developer.amd.com/wordpress/media/2012/10/325591.pdf.
[13] H. P. Enterprise, “How memory RAS technologies can
enhance the uptime of hpe proliant servers,” 2016,
http://h20195.www2.hp.com/V2/GetPDF.aspx
[14] M. Ohmacht, R. A. Bergamaschi, S. Bhattacharya, A. Gara, M. E.
Giampapa, B. Gopalsamy, R. A. Haring, D. Hoenicke, D. J. Krolak,
J. A. Marcella, B. J. Nathanson, V. Salapura, and M. E. Wazlowski,
“Blue gene/l compute chip: Memory and ethernet subsystem,” IBM
Journal of Research and Development, vol. 49, no. 2.3, pp. 255–264,
March 2005.
[15] M. K. Qureshi, “Pay-as-you-go: Low-overhead hard-error correction
for phase change memories,” in Proceedings of the 44th Annual
IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO-44. New York, NY, USA: ACM, 2011. [Online]. Available:
http://doi.acm.org/10.1145/2155620.2155658 pp. 318–328.
[16] D. H. Yoon and M. Erez, “Virtualized and flexible ECC for main mem-
ory,” in Proceedings of the Fifteenth Edition of ASPLOS on Architec-
tural Support for Programming Languages and Operating Systems, ser.
ASPLOS XV. New York, NY, USA: ACM, 2010, pp. 397–408.
[17] C. Chen and M. Hsiao, “Error-correcting codes for semiconductor mem-
ory applications: A state-of-the-art review,” IBM Journal of Research
and Development, 1984.
190
[18] J. Rothman and A. Smith, “Sector cache design and performance,”
in Modeling, Analysis and Simulation of Computer and Telecommuni-
cation Systems, 2000. Proceedings. 8th International Symposium on,
2000, pp. 124–133.
[19] A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and
N. P. Jouppi, “Lot-ecc: Localized and tiered reliability mechanisms for
commodity memory systems,” ISCA, pp. 285 – 296, 2012.
[20] X. Jian, S. Blanchard, N. Debardeleben, V. Sridharan, and R. Ku-
mar, “Analyzing reliability of memory subsystems with double-chipkill
detect/correct,” Pacific Rim International Symposium on Dependable
Computing, 2013.
[21] “University of Maryland Memory System Simulator Manual,”
http://www.eng.umd.edu/˜blj/dramsim/v1/download/ DRAMsim-
Manual.pdf.
[22] MICRON, “2Gb: x4, x8, x16 DDR2 SDRAM,” MICRON, 2006.
[23] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,
A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti,
R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and
D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit.
News, vol. 39, no. 2, pp. 1–7, Aug. 2011. [Online]. Available:
http://doi.acm.org/10.1145/2024716.2024718
[24] A. Kansal, “Building a more efficient data center – from servers
to software,” 2013, https://www.microsoft.com/en-us/research/wp-
content/uploads/2013/02/BuildingMoreEfficientDC.pdf.
[25] S. Lin and D. J. Costello, Error Control Coding, Second Edition. Up-
per Saddle River, NJ, USA: Prentice-Hall, Inc., 2004.
[26] D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G. E. Suh, In-
cremental Multiset Hash Functions and Their Application to Memory
Integrity Checking. Berlin, Heidelberg: Springer Berlin Heidelberg,
2003, pp. 188–207.
[27] AMD, “BIOS and Kernel Developer’s Guide (BKDG) for
AMD Family 15h Models 00h-0Fh Processors,” 2013. [Online].
Available: http://support.amd.com/TechDocs/42301 15h Mod 00h-
0Fh BKDG.pdf




[29] “NAS Parallel Benchmarks,” http://www.nas.nasa.gov/publications/npb.html.
[30] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and
S. Gurumurthi, “Feng shui of supercomputer memory: Positional
effects in dram and sram faults,” in Proceedings of SC13: International
Conference for High Performance Computing, Networking, Storage
and Analysis, ser. SC ’13. New York, NY, USA: ACM, 2013.
[Online]. Available: http://doi.acm.org/10.1145/2503210.2503257 pp.
22:1–22:11.
[31] ITIC, “Itic 2015 - 2016 global server hardware, server os reliabil-
ity report,” 2015, http://www.lenovo.com/images/products/system-x/
pdfs/white-papers/itic 2015 reliability wp.pdf.
[32] W. S. Khwa, M. F. Chang, J. Y. Wu, M. H. Lee, T. H. Su, K. H. Yang,
T. F. Chen, T. Y. Wang, H. P. Li, M. BrightSky, S. Kim, H. L. Lung,
and C. Lam, “7.3 A resistance-drift compensation scheme to reduce
MLC PCM raw BER by over 100X for storage-class memory appli-
cations,” in 2016 IEEE International Solid-State Circuits Conference
(ISSCC), Jan 2016, pp. 134–135.
[33] A. Athmanathan, M. Stanisavljevic, N. Papandreou, H. Pozidis, and
E. Eleftheriou, “Multilevel-cell phase-change memory: A viable tech-
nology,” IEEE Journal on Emerging and Selected Topics in Circuits
and Systems, vol. 6, no. 1, pp. 87–100, March 2016.
[34] A. Calderoni, S. Sills, and N. Ramaswamy, “Performance comparison
of O-based and Cu-based ReRAM for high-density applications,” in
2014 IEEE 6th International Memory Workshop (IMW), May 2014,
pp. 1–4.
[35] S. Cha, S. O, H. Shin, S. Hwang, K. Park, S. J. Jang, J. S. Choi, G. Y.
Jin, Y. H. Son, H. Cho, J. H. Ahn, and N. S. Kim, “Defect analysis
and cost-effective resilience architecture for future dram devices,” in
2017 IEEE International Symposium on High Performance Computer
Architecture (HPCA), Feb 2017, pp. 61–72.
[36] S. Dolinar, D. Divsalar, and F. Pollara, “Code performance as a func-
tion of block size,” in The Telecommunications and Mission Operations
Progress Report, vol. 42-133, 1998, pp. 1–23.
[37] H. P. Enterprise, “HP advanced memory error detection technol-
ogy,” http://h20565.www2.hpe.com/hpsc/doc/public/display? do-
cId=emr na-c02878598&amp;lang=en-us&amp;cc=us.




[39] DELL, “Memory for Dell PowerEdge 12th Generation Servers,”
2012, http://www.dell.com/downloads/global/products/pedge/ pow-
eredge 12th generation server memory.pdf.
[40] T. J. Dell, “A White Paper on the Benefits of Chipkill Correct ECC
for PC Server Main Memory,” 1997, iBM Microelectronics Division.
[41] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hy-
brid memory systems,” in Proceedings of the International Conference
on Supercomputing, ser. ICS ’11. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/1995896.1995911 pp.
85–95.
[42] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase
change memory as a scalable DRAM alternative,” SIGARCH Comput.
Archit. News, vol. 37, no. 3, pp. 2–13, June 2009. [Online]. Available:
http://doi.acm.org/10.1145/1555815.1555758
[43] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu,
“Evaluating STT-RAM as an energy-efficient main memory alterna-
tive,” 2013 IEEE International Symposium on Performance Analysis
of Systems and Software (ISPASS), pp. 256–267, 2013.
[44] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang,
S. Yu, and Y. Xie, “Overcoming the challenges of crossbar resistive
memory architectures,” in 2015 IEEE 21st International Symposium
on High Performance Computer Architecture (HPCA), Feb 2015, pp.
476–488.
[45] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-aware
intelligent DRAM refresh,” in Computer Architecture (ISCA), 2012
39th Annual International Symposium on, June 2012, pp. 1–12.
[46] M. K. Qureshi, M. M. Franceschini, L. A. Lastras-Montaño, and
J. P. Karidis, “Morphable memory system: A robust architecture for
exploiting multi-level phase change memories,” SIGARCH Comput.
Archit. News, vol. 38, no. 3, pp. 153–162, June 2010. [Online].
Available: http://doi.acm.org/10.1145/1816038.1815981
[47] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and
S. Swanson, “Onyx: A protoype phase change memory storage
array,” in Proceedings of the 3rd USENIX Conference on Hot
Topics in Storage and File Systems, ser. HotStorage’11. Berke-
ley, CA, USA: USENIX Association, 2011. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2002218.2002220 p. 2.
193
[48] R. Fackenthal, M. Kitagawa, W. Otsuka, K. Prall, D. Mills, K. Tsutsui,
J. Javanifard, K. Tedrow, T. Tsushima, Y. Shibahara, and G. Hush,
“19.7 a 16gb reram with 200mb/s write and 1gb/s read in 27nm tech-
nology,” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), Feb 2014, pp. 338–339.
[49] L. Mearian, “Everspin ships first st-mram mem-




[50] Everspin, “Ddr3 DRAM compatible MRAM - spin torque technology,”
https://www.everspin.com/ddr3-dram-compatible-mram-spin-torque-
technology.
[51] A. Prodromakis, N. Papandreou, E. Bougioukou, U. Egger, N. Toul-
garidis, T. Antonakopoulos, H. Pozidis, and E. Eleftheriou, “Controller
architecture for low-latency access to phase-change memory in open-
power systems,” in 2016 26th International Conference on Field Pro-
grammable Logic and Applications (FPL), Aug 2016, pp. 1–4.
[52] L. Jiang, Y. Zhang, B. R. Childers, and J. Yang, “FPB:Fine-grained
power budgeting to improve write throughput of multi-level cell
phase change memory,” in Proceedings of the 2012 45th Annual
IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012.
[Online]. Available: http://dx.doi.org/10.1109/MICRO.2012.10 pp.
1–12.
[53] S. Schechter, G. H. Loh, K. Strauss, and D. Burger, “Use ECP, Not
ECC, for hard failures in resistive memories,” in Proceedings of the
37th Annual International Symposium on Computer Architecture, ser.
ISCA ’10. New York, NY, USA: ACM, 2010. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1815980 pp. 141–152.
[54] R. Ramanujan, G. Hinton, and D. Zimmerman, “Dynamic partial
power down of memory-side cache in a 2-level memory hierarchy,”
Oct. 9 2014, US Patent App. 13/994,726. [Online]. Available:
https://www.google.com/patents/US20140304475
[55] R. Ramanujan, D. Ziakas, D. Zimmerman, M. Kumar, M. Swami-
nathan, and B. Coury, “Apparatus and method for implementing
a multi-level memory hierarchy over common memory chan-
nels,” Apr. 19 2016, US Patent 9,317,429. [Online]. Available:
https://www.google.com/patents/US9317429
194
[56] S. Qawami and J. Hulbert, “Phase change memory in a dual
inline memory module,” Jan. 7 2014, US Patent 8,626,997. [Online].
Available: https://www.google.com/patents/US8626997
[57] S. Sills, S. Yasuda, A. Calderoni, C. Cardon, J. Strand, K. Aratani, and
N. Ramaswamy, “Challenges for high-density 16gb reram with 27nm
technology,” in 2015 Symposium on VLSI Circuits (VLSI Circuits),
June 2015, pp. T106–T107.
[58] S. Sills, S. Yasuda, J. Strand, A. Calderoni, K. Aratani, A. Johnson,
and N. Ramaswamy, “A copper reram cell for storage class mem-
ory applications,” in 2014 Symposium on VLSI Technology (VLSI-
Technology): Digest of Technical Papers, June 2014, pp. 1–2.
[59] H. Naeimi, C. Augustine, A. Raychowdhury, L. Shih-Lien, and
J. Tschanz, “Sttram scaling and retention failure.” Intel Technology
Journal, vol. 17, no. 1, pp. 54 – 75, 2013.
[60] T. Parnell, “Nand flash basics & error characteristics; why do we need
smart controllers?” Flash Memory Summit, 2016.
[61] M. Awasthi, M. Shevgoor, K. Sudan, B. Rajendran, R. Balasubramo-
nian, and V. Srinivasan, “Efficient scrub mechanisms for error-prone
emerging memories,” in Proceedings of the 2012 IEEE 18th Interna-
tional Symposium on High-Performance Computer Architecture, ser.
HPCA ’12. Washington, DC, USA: IEEE Computer Society, 2012.
[Online]. Available: http://dx.doi.org/10.1109/HPCA.2012.6168941
pp. 1–12.
[62] M. Mao, P. Y. Chen, S. Yu, and C. Chakrabarti, “A multilayer ap-
proach to designing energy-efficient and reliable reram cross-point array
system,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 25, no. 5, pp. 1611–1621, May 2017.
[63] CYPRESS, “SLC versus MLC NAND Flash Memory,” 2015,
http://www.cypress.com/file/209181/download.




[65] A. Athmanathan, “Multilevel-cell phase-change memory,” Ph.D. dis-
sertation, STI, Lausanne, 2016.
[66] A. Sebastian, T. Tuma, N. Papandreou, M. Le Gallo, L. Kull, T. Par-
nell, and E. Eleftheriou, “Temporal correlation detection using compu-
tational phase-change memory,” ArXiv e-prints, June 2017.
195
[67] B. Gao, H. Zhang, B. Chen, L. Liu, X. Liu, R. Han, J. Kang, Z. Fang,
H. Yu, B. Yu, and D. L. Kwong, “Modeling of retention failure behav-
ior in bipolar oxide-based resistive switching memory,” IEEE Electron
Device Letters, vol. 32, no. 3, pp. 276–278, March 2011.
[68] J. Frascaroli, F. G. Volpe, S. Brivio, and S. Spiga, “Effect of Al Doping
on the retention behavior of HfO2 resistive switching memories,”
Microelectron. Eng., vol. 147, no. C, pp. 104–107, Nov. 2015. [Online].
Available: https://doi.org/10.1016/j.mee.2015.04.043
[69] S. Yu, Y. Yin Chen, X. Guan, H.-S. Philip Wong, and J. A. Kittl, “A
Monte Carlo study of the low resistance state retention of HfOx based
resistive switching memory,” Applied Physics Letters, vol. 100, no. 4,
p. 043507, Jan. 2012.
[70] D. Acharyya, A. Hazra, and P. Bhattacharyya, “A journey
towards reliability improvement of tio2 based resistive ran-
dom access memory: A review,” Microelectronics Reliability,
vol. 54, no. 3, pp. 541 – 560, 2014. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0026271413004344
[71] P. J. Nair, V. Sridharan, and M. K. Qureshi, “Xed: Exposing on-die
error detection information for strong memory reliability,” in Proceed-
ings of the 43rd Annual International Symposium on Computer Archi-
tecture, ser. ISCA ’16, 2016.
[72] K. Chakraborty and P. Mazumder, Fault-tolerance and Reliability
Techniques for High-density Random-access Memories, ser. Prentice
Hall modern semiconductor design series. Prentice Hall, 2002. [Online].
Available: https://books.google.com/books?id=nZUeAQAAIAAJ
[73] Y. H. Son, S. Lee, O. Seongil, S. Kwon, N. S. Kim, and J. H. Ahn,
“CiDRA: A cache-inspired DRAM resilience architecture,” in 2015
IEEE 21st International Symposium on High Performance Computer
Architecture (HPCA), Feb 2015, pp. 502–513.
[74] R. Metzger, “Advantages of ECC-free NAND in high performance ap-
plications,” Flash Memory Summit, 2012.
[75] J. Cooke and M. Kim, “Making informed memory choices,” 2014,
cache.freescale.com/files/training/doc/ftf/2014/FTF-IND-F0378.pdf.
[76] MICRON, “Nor — nand flash guide,” 2013,
https://www.micron.com/ /media/documents/products/ product-
flyer/flyer nor nand flash guide.pdf.
196
[77] C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar,
and S.-l. Lu, “Reducing cache power with low-cost, multi-
bit error-correcting codes,” in Proceedings of the 37th Annual
International Symposium on Computer Architecture, ser. ISCA
’10. New York, NY, USA: ACM, 2010. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1815973 pp. 83–93.
[78] N. Chatterjee, M. OConnor, D. Lee, D. R. Johnson, S. W. Keckler,
M. Rhu, and W. J. Dally, “Architecting an energy-efficient DRAM
system for GPUs,” in 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA), Feb 2017, pp. 73–84.
[79] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-level per-
formance, energy, and area model for emerging nonvolatile memory,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 31, no. 7, pp. 994–1007, July 2012.
[80] R. Micheloni, A. Marelli, and K. Eshghi, Inside Solid State Drives
(SSDs). Springer Publishing Company, Incorporated, 2012.
[81] Sandisk, “Flash management: A detailed overview of flash management
techniques,” 2013.
[82] MICRON, “Error correction code (ecc) in Micron R© single-
level cell (slc) nand,” 2011, https://www.micron.com/ /me-
dia/documents/products/ technical-note/nand-
flash/tn2963 ecc in slc nand.pdf.
[83] X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar,
“Low-power, low-storage-overhead chipkill correct via multi-line error
correction,” in Proceedings of SC13: International Conference for
High Performance Computing, Networking, Storage and Analysis, ser.
SC ’13. New York, NY, USA: ACM, 2013. [Online]. Available:
http://doi.acm.org/10.1145/2503210.2503243 pp. 24:1–24:12.
[84] J. Kim, M. Sullivan, and M. Erez, “Bamboo ECC: Strong, safe, and
flexible codes for reliable computer memory,” in 2015 IEEE 21st In-
ternational Symposium on High Performance Computer Architecture
(HPCA), Feb 2015, pp. 101–112.
[85] L. Chen and Z. Zhang, “Memguard: A low cost and energy efficient
design to support and enhance memory system reliability,” in 2014
ACM/IEEE 41st International Symposium on Computer Architecture
(ISCA), June 2014, pp. 49–60.
197
[86] A. A. Hwang, I. A. Stefanovici, and B. Schroeder, “Cosmic Rays Don’t
Strike Twice: Understanding the Nature of DRAM Errors and the
Implications for System Design,” SIGARCH Comput. Archit. News,
pp. 111–122, 2012.
[87] J. Abrams and D. Iler, “Pre-failure alerts pro-
vided by Dell PowerEdge server systems manage-
ment,” 2015, http://en.community.dell.com/techcenter/ ex-
tras/m/white papers/20441294/download.
[88] M. T. Chapman, “Introducing IBM X6 technology,” 2014,
http://www.lenovo.com/images/products/system-x/pdfs/white-
papers/XSW03145USEN.PDF.
[89] Microsoft, “Predictive failure analysis (pfa),” 2017,
http://tinyurl.com/n34z657.
[90] Oracle, “Oracle solaris and oracle sparc servers in-
tegrated and optimized for mission critical comput-
ing,” 2010, http://www.oracle.com/technetwork/server-
storage/solaris/documentation/solaris-sparc-rf-final-175353.pdf.
[91] T. Y. Oh, H. Chung, J. Y. Park, K. W. Lee, S. Oh, S. Y. Doo, H. J.
Kim, C. Lee, H. R. Kim, J. H. Lee, J. I. Lee, K. S. Ha, Y. Choi, Y. C.
Cho, Y. C. Bae, T. Jang, C. Park, K. Park, S. Jang, and J. S. Choi,
“A 3.2 Gbps/pin 8 Gbit 1.0 V LPDDR4 SDRAM with integrated ECC
engine for Sub-1 V DRAM Core Operation,” IEEE Journal of Solid-
State Circuits, vol. 50, no. 1, pp. 178–190, Jan 2015.
[92] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-
mance main memory system using phase-change memory technology,”
in Proceedings of the 36th Annual International Symposium on Com-
puter Architecture, ser. ISCA ’09. New York, NY, USA: ACM, 2009.
[Online]. Available: http://doi.acm.org/10.1145/1555754.1555760 pp.
24–33.
[93] M. K. Qureshi, M. M. Franceschini, and L. A. Lastras-Montao, “Im-
proving read performance of phase change memories via write cancel-
lation and write pausing,” in HPCA - 16 2010 The Sixteenth Interna-
tional Symposium on High-Performance Computer Architecture, Jan
2010, pp. 1–11.
[94] M. K. Qureshi, M. M. Franceschini, A. Jagmohan, and L. A. Las-
tras, “Preset: Improving performance of phase change memories by
exploiting asymmetry in write times,” in 2012 39th Annual Interna-
tional Symposium on Computer Architecture (ISCA), June 2012, pp.
380–391.
198
[95] H. H. S. Lee, G. S. Tyson, and M. K. Farrens, “Eager writeback - A
technique for improving bandwidth utilization,” in Proceedings 33rd
Annual IEEE/ACM International Symposium on Microarchitecture.
MICRO-33 2000, 2000, pp. 11–21.
[96] H. Labs, “Cacti 6.5,” http://www.hpl.hp.com/research/cacti/cacti65.tgz.
[97] A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson,
and S.-L. Lu, “Energy-efficient cache design using variable-
strength error-correcting codes,” in Proceedings of the 38th Annual
International Symposium on Computer Architecture, ser. ISCA
’11. New York, NY, USA: ACM, 2011. [Online]. Available:
http://doi.acm.org/10.1145/2000064.2000118 pp. 461–472.
[98] “The international technology roadmap for semiconductors (ITRS),
system drivers,” 2012, http://www.itrs2.net/2012-itrs.html.
[99] C. Yang, Y. Emre, and C. Chakrabarti, “Product code schemes for
error correction in mlc nand flash memories,” IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 20, no. 12, pp. 2302–
2314, Dec 2012.
[100] Y. Lee, H. Yoo, I. Yoo, and I. C. Park, “6.4gb/s multi-threaded bch
encoder and decoder for multi-channel SSD controllers,” in 2012 IEEE
International Solid-State Circuits Conference, Feb 2012, pp. 426–428.
[101] K. Aingaran, S. Jairath, G. Konstadinidis, S. Leung, P. Loewen-
stein, C. McAllister, S. Phillips, Z. Radovic, R. Sivaramakrishnan,
D. Smentek, and T. Wicki, “M7: Oracle’s next-generation sparc pro-
cessor,” IEEE Micro, vol. 35, no. 2, pp. 36–45, Mar 2015.
[102] S. K. Sadasivam, B. W. Thompto, R. Kalla, and W. J. Starke, “IBM
Power9 processor architecture,” IEEE Micro, vol. 37, no. 2, pp. 40–51,
Mar 2017.
[103] X. Jian and R. Kumar, “Adaptive Reliability Chipkill Correct
(ARCC),” in High Performance Computer Architecture (HPCA2013),
2013 IEEE 19th International Symposium on, 2013, pp. 270–281.
[104] S. Li, D. H. Yoon, K. Chen, J. Zhao, J. H. Ahn, J. B. Brockman,
Y. Xie, and N. P. Jouppi, “Mage: Adaptive granularity and ECC for
resilient and power efficient memory systems,” in Proceedings of the
International Conference on High Performance Computing, Network-
ing, Storage and Analysis, ser. SC ’12. Los Alamitos, CA, USA: IEEE
Computer Society Press, 2012, pp. 33:1–33:11.
199
[105] L. Zhang, B. Neely, D. Franklin, D. Strukov, Y. Xie, and F. T. Chong,
“Mellow writes: Extending lifetime in resistive memories through selec-
tive slow write backs,” in 2016 ACM/IEEE 43rd Annual International
Symposium on Computer Architecture (ISCA), June 2016, pp. 519–531.
[106] P. Nair, D. Roberts, and M. Qureshi, “Citadel: Efficiently protect-
ing stacked memory from large granularity failures,” in Microarchitec-
ture (MICRO), 2014 47th Annual IEEE/ACM International Sympo-
sium on, Dec 2014, pp. 51–62.
[107] X. Jian, V. Sridharan, and R. Kumar, “Parity helix: Efficient pro-
tection for single-dimensional faults in multi-dimensional memory sys-
tems,” in 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA), March 2016, pp. 555–567.
[108] K. Meng, R. Joseph, R. P. Dick, and L. Shang, “Multi-optimization
power management for chip multiprocessors,” in Proceedings of the 17th
International Conference on Parallel Architectures and Compilation
Techniques, ser. PACT ’08. New York, NY, USA: ACM, 2008.
[Online]. Available: http://doi.acm.org/10.1145/1454115.1454141 pp.
177–186.
[109] K. Ma, X. Li, M. Chen, and X. Wang, “Scalable power control
for many-core architectures running multi-threaded applications,” in
Proceedings of the 38th Annual International Symposium on Computer
Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/2000064.2000117 pp.
449–460.
[110] M. B. Taylor, “Is dark silicon useful?: Harnessing the four
horsemen of the coming dark silicon apocalypse,” in Proceedings
of the 49th Annual Design Automation Conference, ser. DAC
’12. New York, NY, USA: ACM, 2012. [Online]. Available:
http://doi.acm.org/10.1145/2228360.2228567 pp. 1131–1136.
[111] C. Wilkerson, H. Gao, A. R. Alameldeen, Z. Chishti, M. Khellah,
and S.-L. Lu, “Trading off cache capacity for reliability to enable low
voltage operation,” in Proceedings of the 35th Annual International
Symposium on Computer Architecture, ser. ISCA ’08. Washington,
DC, USA: IEEE Computer Society, 2008. [Online]. Available:
http://dx.doi.org/10.1109/ISCA.2008.22 pp. 203–214.
[112] Z. Chishti, A. Alameldeen, C. Wilkerson, W. Wu, and S.-L. Lu, “Im-
proving cache lifetime reliability at ultra-low voltages,” in Microar-
chitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International
Symposium on, 2009, pp. 89–99.
200
[113] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,
“Near-threshold computing: Reclaiming moore’s law through energy
efficient integrated circuits,” Proceedings of the IEEE, vol. 98, no. 2,
pp. 253–266, 2010.
[114] A. Ansari, S. Feng, S. Gupta, and S. A. Mahlke, “Archipelago: A poly-
morphic cache design for enabling robust near-threshold operation.” in
HPCA. IEEE Computer Society, 2011, pp. 539–550.
[115] M. K. Qureshi and Z. Chishti, “Operating secded-based caches at ultra-
low voltage with flair,” in Dependable Systems and Networks (DSN),
2013 43rd Annual IEEE/IFIP International Conference on, 2013, pp.
1–11.
[116] T. Bonnoit, M. Nicolaidis, and N.-E. Zergainoh, “Using error
correcting codes without speed penalty in embedded memories:
Algorithm, implementation and case study,” Journal of Electronic
Testing, vol. 29, no. 3, pp. 383–400, 2013. [Online]. Available:
http://dx.doi.org/10.1007/s10836-013-5386-8
[117] Alpha 21264 Microprocessor Hardware Refer-
ence Manual, Compaq Computer Corporation.
[Online]. Available: http://h18000.www1.hp.com/cpq-
alphaserver/technology/literature/21264hrm.pdf
[118] V. Chandra and R. Aitken, “Impact of technology and voltage
scaling on the soft error susceptibility in nanoscale CMOS,” in
Proceedings of the 2008 IEEE International Symposium on Defect
and Fault Tolerance of VLSI Systems, ser. DFT ’08. Washington,
DC, USA: IEEE Computer Society, 2008. [Online]. Available:
http://dx.doi.org/10.1109/DFT.2008.50 pp. 114–122.
[119] P. P. Shirvani and E. J. McCluskey, “Padded cache: A new fault-
tolerance technique for cache memories,” in VLSI Test Symposium,
1999. Proceedings. 17th IEEE. IEEE, 1999, pp. 440–445.
[120] D. Strukov, “The area and latency tradeoffs of binary bit-parallel bch
decoders for prospective nanoelectronic memories,” in Signals, Systems
and Computers, 2006. ACSSC ’06. Fortieth Asilomar Conference on,
2006, pp. 1183–1187.
[121] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems
Perspective, 4th ed. USA: Addison-Wesley Publishing Company, 2010.
[122] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, “CACTI
5.1,” HP Laboratories, Palo Alto, Tech. Rep, vol. 20, 2008.
201
[123] K. Khubaib, M. Suleman, M. Hashemi, C. Wilkerson, and Y. Patt,
“Morphcore: An energy-efficient microarchitecture for high perfor-
mance ILP and high throughput TLP,” in Microarchitecture (MICRO),
2012 45th Annual IEEE/ACM International Symposium on, 2012, pp.
305–316.
[124] “International technology roadmap for semiconductors
2001 edition process integration, devices, and struc-
tures and emerging research devices.” [Online]. Available:
http://www.itrs.net/Links/2001ITRS/PIDS.pdf
[125] J. Kulkarni and K. Roy, “Ultralow-voltage process-variation-tolerant
schmitt-trigger-based sram design,” Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, vol. 20, no. 2, pp. 319–332,
Feb 2012.
[126] J. Kulkarni, K. Kim, and K. Roy, “A 160 mv robust schmitt trig-
ger based subthreshold sram,” Solid-State Circuits, IEEE Journal of,
vol. 42, no. 10, pp. 2303–2313, Oct 2007.
[127] L. Chang, D. Fried, J. Hergenrother, J. Sleight, R. Dennard, R. Mon-
toye, L. Sekaric, S. McNab, A. Topol, C. Adams, K. Guarini, and
W. Haensch, “Stable sram cell design for the 32 nm node and beyond,”
in VLSI Technology, 2005. Digest of Technical Papers. 2005 Sympo-
sium on, June 2005, pp. 128–129.
[128] B. Calhoun and A. Chandrakasan, “A 256kb sub-threshold sram in
65nm CMOS,” in Solid-State Circuits Conference, 2006. ISSCC 2006.
Digest of Technical Papers. IEEE International, Feb 2006, pp. 2592–
2601.
[129] E. Le Sueur and G. Heiser, “Dynamic voltage and frequency scaling:
The laws of diminishing returns,” in Proceedings of the 2010 Inter-
national Conference on Power Aware Computing and Systems, ser.
HotPower’10. Berkeley, CA, USA: USENIX Association, 2010. [On-
line]. Available: http://dl.acm.org/citation.cfm?id=1924920.1924921
pp. 1–8.
[130] Standard Performance Evaluation Corporation, “Spec cpu2000.”
[Online]. Available: www.spec.org/cpu2000
[131] Standard Performance Evaluation Corporation, “Spec cpu2006.”
[Online]. Available: www.spec.org/cpu2006
[132] ARM, “Cortex-a7 technical reference manual, rev r0p5.”
202
[133] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in Proceedings
of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009.
[Online]. Available: http://doi.acm.org/10.1145/1669112.1669172 pp.
469–480.
[134] W. Kim, M. S. Gupta, G. yeon Wei, and D. Brooks, “System level
analysis of fast, per-core DVFS using on-chip switching regulators,” in
HPCA, 2008.
[135] S. M. Khan, A. R. Alameldeen, C. Wilkerson, J. Kulkarni, and D. A.
Jimenez, “Improving multi-core performance using mixed-cell cache
architecture,” in Proceedings of the 2013 IEEE 19th International
Symposium on High Performance Computer Architecture (HPCA), ser.
HPCA ’13. Washington, DC, USA: IEEE Computer Society, 2013.
[Online]. Available: http://dx.doi.org/10.1109/HPCA.2013.6522312
pp. 119–130.
[136] F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester, and M. Alioto,
“SRAM for error-tolerant applications with dynamic energy-quality
management in 28 nm CMOS,” Solid-State Circuits, IEEE Journal
of, vol. 50, no. 5, pp. 1310–1323, May 2015.
[137] T. N. Miller, R. Thomas, J. Dinan, B. Adcock, and R. Teodorescu,
“Parichute: Generalized turbocode-based error correction for near-
threshold caches,” in Proceedings of the 2010 43rd Annual IEEE/ACM
International Symposium on Microarchitecture. IEEE Computer So-
ciety, 2010, pp. 351–362.
[138] Synopsys Design Compiler User’s Manual, Synopsys.
[139] Cadence SoC Encounter User’s Manual, Candence.
[140] H. Duwe, X. Jian, and R. Kumar, “Correction prediction: Reducing
error correction latency for on-chip memories,” in High Performance
Computer Architecture (HPCA), 2015 IEEE 21st International Sym-
posium on, Feb 2015, pp. 463–475.
[141] M. Hempstead, G. yeon Wei, and D. Brooks, “Architecture and cir-
cuit techniques for low-throughput, energy-constrained systems across
technology generations,” in Proceedings of CASES. ACM Press, 2006,
pp. 368–378.
[142] S. Morioka and Y. Katayama, “Design methodology for a one-shot
Reed-Solomon encoder and decoder,” in Computer Design, 1999.
(ICCD ’99) International Conference on, 1999, pp. 60–67.
203
[143] D. Palframan, N. S. Kim, and M. Lipasti, “ipatch: Intelligent fault
patching to improve energy efficiency,” in High Performance Computer
Architecture (HPCA), 2015 IEEE 21st International Symposium on,
Feb 2015, pp. 428–438.
[144] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe, “Multi-bit
error tolerant caches using two-dimensional error coding,” in Microar-
chitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International
Symposium on, Dec 2007, pp. 197–209.
[145] Intel, “Intel E7500 Chipset MCH Intel x4 single device data correction
(x4 sddc) implementation and validation,” Aug 2002. [Online]. Avail-
able: http://www.intel.com/content/www/us/en/chipsets/e7500-
chipset-mch-x4-single-device-data-correction-note.html
[146] G. Bond, “Fault alignment control system and circuits,”
Dec. 18 1984, US Patent 4,489,403. [Online]. Available:
https://www.google.com/patents/US4489403
[147] G. Bond, F. Cartman, and P. Ryan, “Multi-bit error scattering
arrangement to provide fault tolerant semiconductor static mem-
ories,” Dec. 11 1984, US Patent 4,488,298. [Online]. Available:
http://www.google.com/patents/US4488298
[148] D. Bossen and M. Hsiao, “Deterministic permutation algo-
rithm,” July 17 1984, US Patent 4,461,001. [Online]. Available:
http://www.google.com/patents/US4461001
[149] W. Beausoleil, “Method of manufacturing a full capac-
ity monolithic memory utilizing defective storage cells,”
Aug. 5 1975, US Patent 3,897,626. [Online]. Available:
http://www.google.com/patents/US3897626
[150] H. M. Bossen D, Haugh C, “Dynamic address translation scheme using
orthogonal squares,” May 21 1974, US Patent 3,812,336. [Online].
Available: http://www.google.com/patents/US3812336
[151] W. Beausoleil, “Monolithic memory utilizing defective storage
cells,” Dec. 25 1973, US Patent 3,781,826. [Online]. Available:
http://www.google.com/patents/US3781826
[152] B. W. F, “Memory with reconfiguration to avoid uncorrectable
errors,” Feb. 22 1972, US Patent 3,644,902. [Online]. Available:
http://www.google.com/patents/US3644902
204
[153] B. Giridhar, M. Cieslak, D. Duggal, R. Dreslinski, H. M. Chen,
R. Patti, B. Hold, C. Chakrabarti, T. Mudge, and D. Blaauw,
“Exploring DRAM organizations for energy-efficient and resilient
exascale memories,” in Proceedings of SC13: International Conference
for High Performance Computing, Networking, Storage and Analysis,
ser. SC ’13. New York, NY, USA: ACM, 2013. [Online]. Available:
http://doi.acm.org/10.1145/2503210.2503215 pp. 23:1–23:12.
[154] A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian,
A. Davis, and N. P. Jouppi, “Rethinking dram design and organization
for energy-constrained multi-cores,” in Proceedings of the 37th
Annual International Symposium on Computer Architecture, ser.
ISCA ’10. New York, NY, USA: ACM, 2010. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1815983 pp. 175–186.
[155] D. H. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan,
“BOOM: Enabling mobile memory based low-power server DIMMs,” in
Computer Architecture (ISCA), 2012 39th Annual International Sym-
posium on, June 2012, pp. 25–36.
[156] D. H. Yoon, M. K. Jeong, and M. Erez, “Adaptive granularity memory
systems: A tradeoff between storage efficiency and throughput,” in
Proceedings of the 38th Annual International Symposium on Computer
Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp.
295–306.
[157] J. T. Pawlowski, “Hybrid Memory Cube (HMC),” Hot Chips 23, 2011.
[158] M. O’Connor, “Highlights of the High Bandwidth Memory (HBM)
Standard,” 2014, http://www.cs.utah.edu/thememoryforum/mike.pdf.
[159] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira,
J. Stearley, J. Shalf, and S. Gurumurthi, “Memory errors in modern
systems: The good, the bad, and the ugly,” in Proceedings of
the Twentieth International Conference on Architectural Support
for Programming Languages and Operating Systems, ser. ASPLOS
’15. New York, NY, USA: ACM, 2015. [Online]. Available:
http://doi.acm.org/10.1145/2694344.2694348 pp. 297–310.
[160] S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. D. Kersey, J. B.
Brockman, A. F. Rodrigues, and N. P. Jouppi, “System implications of
memory reliability in exascale computing,” in Proceedings of 2011 In-
ternational Conference for High Performance Computing, Networking,
Storage and Analysis, ser. SC ’11. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/2063384.2063445 pp.
46:1–46:12.
205
[161] J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor, “Resilient
die-stacked DRAM caches,” in Proceedings of the 40th Annual
International Symposium on Computer Architecture, ser. ISCA
’13. New York, NY, USA: ACM, 2013. [Online]. Available:
http://doi.acm.org/10.1145/2485922.2485958 pp. 416–427.
[162] H. Jeon, G. Loh, and M. Annavaram, “Efficient RAS support for die-
stacked dram,” in Test Conference (ITC), 2014 IEEE International,
Oct 2014, pp. 1–10.
[163] “JEDEC STANDARD: High Bandwidth Memory (HBM) DRAM,”
2013, http://www.jedec.org/standards-documents/results/jesd235.
[164] “Hybrid Memory Cube Specification 2.1,” 2015,
http://www.hybridmemorycube.org/specification-v2-download-form/.
[165] D. A. Patterson, G. Gibson, and R. H. Katz, “A case for redundant
arrays of inexpensive disks (raid),” in Proceedings of the 1988 ACM
SIGMOD International Conference on Management of Data, ser.
SIGMOD ’88. New York, NY, USA: ACM, 1988. [Online]. Available:
http://doi.acm.org/10.1145/50202.50214 pp. 109–116.
[166] J. Kim and Y. Kim, “HBM: Memory solution for bandwidth-hungry
processors,” Hot Chips 26, 2014.
[167] Tezzaron, “Our Technology 101,” http://www.tezzaron.com/about-
us/our-technology-101/.
[168] W. Sun, W. Zhu, F. Che, C. Wang, A. Sun, and H. Tan, “Ultra-thin
die characterization for stack-die packaging,” in Electronic Components
and Technology Conference, 2007. ECTC ’07. Proceedings. 57th, May
2007, pp. 1390–1396.
[169] H. Guojun, L. Jing-en, and X. Baraton, “Characterization of silicon die
strength with application to die crack analysis,” in Electronic Man-
ufacturing Technology Symposium (IEMT), 2008 33rd IEEE/CPMT
International, Nov 2008, pp. 1–7.
[170] EMC, “EMC CLARiiON RAID 6 Technology: A Detailed Review,”
2007, http://www.emc.com/collateral/hardware/white-papers/h2891-
clariion-raid-6.pdf.
[171] M. Blaum, J. Brady, J. Bruck, and J. Menon, “Evenodd: An efficient
scheme for tolerating double disk failures in raid architectures,” Com-
puters, IEEE Transactions on, vol. 44, no. 2, pp. 192–202, Feb 1995.
206
[172] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and
S. Sankar, “Row-diagonal parity for double disk failure correction,” in
In Proceedings of the 3rd USENIX Symposium on File and Storage
Technologies (FAST ’04, 2004, pp. 1–14.
[173] C. Wu, X. He, G. Wu, S. Wan, X. Liu, Q. Cao, and C. Xie, “HDP
code: A horizontal-diagonal parity code to optimize i/o load balancing
in raid-6,” in Dependable Systems Networks (DSN), 2011 IEEE/IFIP
41st International Conference on, June 2011, pp. 209–220.
[174] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee,
C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in mem-
ory without accessing them: An experimental study of dram
disturbance errors,” in Proceeding of the 41st Annual Interna-
tional Symposium on Computer Architecuture, ser. ISCA ’14.
Piscataway, NJ, USA: IEEE Press, 2014. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2665671.2665726 pp. 361–372.
[175] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in
architecting dram caches: Outperforming impractical sram-tags with
a simple and practical design,” in Proceedings of the 2012 45th
Annual IEEE/ACM International Symposium on Microarchitecture,
ser. MICRO-45. Washington, DC, USA: IEEE Computer Society,
2012. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2012.30
pp. 235–246.
[176] T. Drane, W.-c. Cheung, and G. Constantinides, “Correctly rounded
constant integer division via multiply-add,” in Circuits and Systems
(ISCAS), 2012 IEEE International Symposium on, May 2012, pp.
1243–1246.
[177] X. Jian and R. Kumar, “ECC Parity: A technique for efficient
memory error resilience for multi-channel memory systems,” in
Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, ser. SC ’14.




[179] K. Chen, S. Li, N. Muralimanohar, J.-H. Ahn, J. Brockman, and
N. Jouppi, “Cacti-3dd: Architecture-level modeling for 3d die-stacked
dram main memory,” in Design, Automation Test in Europe Confer-
ence Exhibition (DATE), 2012, 2012, pp. 33–38.
207
[180] N. Jouppi, A. Kahng, N. Muralimanohar, and V. Srinivas, “Cacti-
io: Cacti with off-chip power-area-timing models,” in Computer-Aided
Design (ICCAD), 2012 IEEE/ACM International Conference on, Nov
2012, pp. 294–301.
[181] J. Shalf, S. Dosanjh, and J. Morrison, “Exascale computing technology
challenges,” in Proceedings of the 9th International Conference
on High Performance Computing for Computational Science, ser.
VECPAR’10. Berlin, Heidelberg: Springer-Verlag, 2011. [Online].
Available: http://dl.acm.org/citation.cfm?id=1964238.1964240 pp.
1–25.
[182] D. H. Yoon, M. K. Jeong, M. Sullivan, and M. Erez, “The dynamic
granularity memory system,” in Proceedings of the 39th Annual
International Symposium on Computer Architecture, ser. ISCA ’12.
Washington, DC, USA: IEEE Computer Society, 2012. [Online].
Available: http://dl.acm.org/citation.cfm?id=2337159.2337222 pp.
548–559.
[183] Y.-B. Kim and T. Chen, “Assessing merged dram/logic technology,” in
Circuits and Systems, 1996. ISCAS ’96., Connecting the World., 1996
IEEE International Symposium on, vol. 4, May 1996, pp. 133–134.
[184] J. Zerbe, P. Chau, C. Werner, T. Thrush, D. Perino, B. Garlepp, and
K. Donnelly, “1.6 gb/s/pin 4-pam signaling and circuits for a multi-
drop bus,” in VLSI Circuits, 2000. Digest of Technical Papers. 2000
Symposium on, June 2000, pp. 128–131.
[185] B. Ganesh, A. Jaleel, D. Wang, and B. Jacob, “Fully-buffered
dimm memory architectures: Understanding mechanisms, overheads
and scaling,” in Proceedings of the 2007 IEEE 13th International
Symposium on High Performance Computer Architecture, ser. HPCA
’07. Washington, DC, USA: IEEE Computer Society, 2007. [Online].
Available: http://dx.doi.org/10.1109/HPCA.2007.346190 pp. 109–120.
[186] D. Resnick, “Memory network methods, apparatus, and systems,”
Aug. 19 2010, uS Patent App. 12/389,200. [Online]. Available:
http://www.google.com/patents/US20100211721
[187] G. Kim, M. Lee, J. Jeong, and J. Kim, “Multi-GPU system design with
memory networks,” in Microarchitecture (MICRO), 2014 47th Annual
IEEE/ACM International Symposium on, Dec 2014, pp. 484–495.
208
[188] B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and
Y. Solihin, “Scaling the bandwidth wall: Challenges in and
avenues for cmp scaling,” in Proceedings of the 36th Annual
International Symposium on Computer Architecture, ser. ISCA
’09. New York, NY, USA: ACM, 2009. [Online]. Available:
http://doi.acm.org/10.1145/1555754.1555801 pp. 371–382.
[189] E. Cooper-Balis, P. Rosenfeld, and B. Jacob, “Buffer-on-board
memory systems,” in Proceedings of the 39th Annual International
Symposium on Computer Architecture, ser. ISCA ’12. Washington,
DC, USA: IEEE Computer Society, 2012. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2337159.2337204 pp. 392–403.
[190] MICRON, “4Gb:x4, x8, x16 DDR4 SDRAM,”
https://www.micron.com/˜/media/documents/products/data-
sheet/dram/ddr4/4gb ddr4 sdram.pdf.
[191] S. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan, A. Buyuk-
tosunoglu, A. Davis, and F. Li, “Comparing implementations of
near-data computing with in-memory MapReduce workloads,” Micro,
IEEE, vol. 34, no. 4, pp. 44–52, July 2014.
[192] T. Anand, M. Talegaonkar, A. Elkholy, S. Saxena, A. Elshazly, and
P. Hanumolu, “3.7 a 7gb/s rapid on/off embedded-clock serial-link
transceiver with 20ns power-on time, 740 uw off-state power for energy-
proportional links in 65nm cmos,” in Solid- State Circuits Conference
- (ISSCC), 2015 IEEE International, Feb 2015, pp. 1–3.
[193] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and
T. Moscibroda, “Reducing memory interference in multicore systems
via application-aware memory channel partitioning,” in Proceedings
of the 44th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/2155620.2155664 pp.
374–385.
[194] J. Jeddeloh and B. Keeth, “Hybrid memory cube new dram architec-
ture increases density and performance,” in VLSI Technology (VLSIT),
2012 Symposium on, June 2012, pp. 87–88.
[195] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis,
K. Periyathambi, and M. Horowitz, “Towards energy-proportional
datacenter memory with mobile dram,” in Proceedings of the 39th
Annual International Symposium on Computer Architecture, ser. ISCA




[196] G. Shu, W. S. Choi, S. Saxena, S. J. Kim, M. Talegaonkar, R. Nand-
wana, A. Elkholy, D. Wei, T. Nandi, and P. K. Hanumolu, “23.1 a
16mb/s-to-8gb/s 14.1-to-5.9pj/b source synchronous transceiver using
dvfs and rapid on/off in 65nm cmos,” in 2016 IEEE International Solid-
State Circuits Conference (ISSCC), Jan 2016, pp. 398–399.
[197] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu,
“Energy proportional datacenter networks,” in Proceedings of the
37th Annual International Symposium on Computer Architecture, ser.
ISCA ’10. New York, NY, USA: ACM, 2010. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1816004 pp. 338–347.
[198] T. M. Andersen, F. Krismer, J. W. Kolar, T. Toifl, C. Menolfi, L. Kull,
T. Morf, M. Kossel, M. Brändli, P. Buchmann, and P. A. Francese, “4.7
a sub-ns response on-chip switched-capacitor dc-dc voltage regulator
delivering 3.7w/mm2 at 90in Solid-State Circuits Conference Digest of
Technical Papers (ISSCC), 2014 IEEE International, Feb 2014, pp.
90–91.
[199] J. Ahn, S. Yoo, and K. Choi, “Dynamic power management
of off-chip links for hybrid memory cubes,” in Proceedings
of the 51st Annual Design Automation Conference, ser. DAC
’14. New York, NY, USA: ACM, 2014. [Online]. Available:
http://doi.acm.org/10.1145/2593069.2593128 pp. 139:1–139:6.
[200] D. Wu, B. He, X. Tang, J. Xu, and M. Guo, “Ramzzz: Rank-aware
dram power management with dynamic migrations and demotions,” in
High Performance Computing, Networking, Storage and Analysis (SC),
2012 International Conference for, Nov 2012, pp. 1–11.
[201] K. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. Lee, and
M. Horowitz, “Rethinking dram power modes for energy proportion-
ality,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM
International Symposium on, Dec 2012, pp. 131–142.
[202] X. Li, Z. Li, Y. Zhou, and S. Adve, “Performance directed
energy management for main memory and disks,” Trans. Storage,
vol. 1, no. 3, pp. 346–380, Aug. 2005. [Online]. Available:
http://doi.acm.org/10.1145/1084779.1084782
210
