Coding for Phase Change Memory Performance Optimization by Mirhoseini, Azalia

Abstract
Coding for Phase Change Memory Performance Optimization
by
Azalia Mirhoseini
Over the past several decades, memory technologies have exploited
continual scaling of CMOS to drastically improve performance and cost.
Unfortunately, charge-based memories become unreliable beyond 20 nm
feature sizes. A promising alternative is Phase-Change-Memory (PCM)
which leverages scalable resistive thermal mechanisms. To realize PCM's
potential, a number of challenges, including the limited wear-endurance
and costly writes, need to be addressed. This thesis introduces novel
methodologies for encoding data on PCM which exploit asymmetries
in read/write performance to minimize memory's wear/energy consump-
tion. First, we map the problem to a distance-based graph clustering
problem and prove it is NP-hard. Next, we propose two dierent ap-
proaches: an optimal solution based on Integer-Linear-Programming, and
an approximately-optimal solution based on Dynamic-Programming. Our
methods target both single-level and multi-level cell PCM and provide
further optimizations for stochastically-distributed data. We devise a low
overhead hardware architecture for the encoder. Evaluations demonstrate
signicant performance gains of our framework.
Acknowledgements
The completion of this research would not have been possible without
the contribution of several individuals who in one way or another have
provided their valuable guidance and assistance.
My deepest gratitude to my advisor Prof. Farinaz Koushanfar for her
priceless support, encouragement, and guidance. Her dedication, fond-
ness, and motivation towards doing novel research, as well as exceptional
care for her students have taught me invaluable lessons and greatly inu-
enced my life at both academic and personal levels.
Thanks to Prof. Miodrag Potkonjak for sharing his insights and ideas
about new directions in optimizing phase-change memory technology.
I would like to acknowledge all the great people who I have been fortu-
nate to work with during my graduate carrier at Rice university for their
inspiration and support.
Finally I wish to express profound admiration and gratitude to my
beloved parents for their endless love and continuous support for my ed-
ucation.
Contents
Abstract ii
Acknowledgements iii
1 Introduction 1
2 Related Work and Background 5
3 PCM Operation and Energy Model 8
3.1 Single-Level PCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Multi-Level PCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Data Coding Problem 10
4.1 Coding Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Problem Formulation, Complexity and Bounds . . . . . . . . . . . . . 11
5 Solving Energy Ecient Coding Problem 15
5.1 Optimal Coding via Integer Linear Programming . . . . . . . . . . . 15
5.2 Coding via Dynamic Programming . . . . . . . . . . . . . . . . . . . 18
6 Eect of the Encoding on Memory Wear 24
7 Multi-Level Cell PCM 27
8 Data Encoder/Decoder Architecture and Overhead 30
v9 Evaluations 32
9.1 ILP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9.2 Performance of DP-based Algorithm on Uniform Data . . . . . . . . . 33
9.3 Performance on Audio and Image Data . . . . . . . . . . . . . . . . . 36
9.4 Performance of Stochastic Data Coding . . . . . . . . . . . . . . . . . 37
9.5 MLC Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10 Conclusion 40
References 41
List of Figures
3.1 (a) The cross section of a conventional PCM memory cell; (b) The
owing current pulses amplitude and duration control the set, reset,
and read operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Data encoding/decoding module is a part of memory controller in the
memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 A 3-bit encoding for the 4 words W1, W2, W3, and W4; ES and ER are
set and reset energies. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1 Data-aware alphabet letter codings. . . . . . . . . . . . . . . . . . . . 22
6.1 This plot shows that PCM write endurance improves with data coding.
Each word of length N (bits) is coded by codes of length N+1. The
wear eciencies of 2N-1-bit and 2N-bit codes are equal. . . . . . . . . 25
8.1 This plot shows the architecture of the encoder module. The read
buer contains the old data from PCM and the write buer contains
the new data that is going to be overwritten the read data on PCM. . 31
9.1 8-bit system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.2 16-bit system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.3 32-bit system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.4 Cost reduction by data-aware coding. . . . . . . . . . . . . . . . . . . 35
9.5 Audio data, 32-bit system, ER
ES
= 2. . . . . . . . . . . . . . . . . . . . 36
9.6 Image data, 32-bit system, ER
ES
= 2. . . . . . . . . . . . . . . . . . . . 36
vii
9.7 This plot shows the results of MLC-PCM proposed coding for energy
saving. The dashed line shows the (normalized) average required en-
ergy for writing data compared to no-codeing method. The dotted line
shows the capacity of the MLC-PCM compared to no-coding method.
For example coding a 30-bit (15-cell) word with 32-bit (16-cell) codes
results in 0.78 reduction in energy. The capacity is reduced to 30
32
= 0:93
of the full memory capacity. . . . . . . . . . . . . . . . . . . . . . . . 39
List of Tables
7.1 Required energy for programming dierent levels. . . . . . . . . . . . 27
7.2 MLC-PCM coding for 2-cell words. The total number of intermedi-
ate cells (01 and 10s) in the 2-cell words is 16. The total number of
intermediate cells in the corresponding 3-cell codes is 8. . . . . . . . 28
9.1 Performance compared to the optimal coding for various code sizes. . 35
9.2 Comparing the ASCII data-aware energy cost with the uniform coding.
The costs shows the energy reductions that are normalized to the no-
coding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 1
Introduction
In the design of digital integrated circuits, memory often signicantly impacts the
system's implementation cost, performance, and power dissipation. Presently, there
is an ever increasing performance and energy gap between emerging (multi-)processor
families and memory. For portable devices and embedded systems with constrained
energy sources, minimizing memory energy dissipation is of great importance [1].
Improving and scaling the currently used storage technologies, would have a limited
eectiveness in the long run, especially as the miniaturized technologies reach the
limits of the silicon. For example, for capacitive memories such as DRAM, scaling
beyond 20 nm would likely result in a diminishing capacitor with increased leakage
that is unreliable for holding the charge [2]. The newer resistive memory technologies
enable an alternative or a hybrid solution that could bridge the growing performance
and energy gap between processing and storage.
The data storage mechanism for resistive memories is built upon the large electrical
resistance discrepancy between the states of a phase-change material. In one phase
(state), the material is amorphous and has a very high resistance. In another phase,
the same material is crystalline which is highly conductive. Extensive research in
the eld of phase change storage has demonstrated new material and non-volatile
2memory cell structures with improved performance, integration, endurance, retention,
and yield properties. Phase-change memory is projected to scale to 9nm [3, 4]. Since
several recent work have adopted the PCM terminology for Phase-Change Memory,
in the remainder of the manuscript, we use this term.
This thesis aims at minimizing the energy cost of rewriting to the PCM by propos-
ing very low overhead data encoding methods. Our general optimization is easily
integrable within the processor architecture and memory interface with a very low
complexity and overhead. The method is largely transparent and orthogonal to most
other energy saving transformations and methods. Our proposed resistive memory
encoding utilizes bitwise manipulation ability during the word overwrites; only the
bits that are changing for the new word compared to the existing word in the mem-
ory location would require overwriting. Our encoding ensures that the number of
required overwrites is minimized. Our optimizations capture asymmetries in energy
cost of read, set, and reset operations.
A special case of data encoding for minimizing the unidirectional transitions in the
memory is the Write-Once Memory (WOM) coding originally proposed by Rivest and
Shamir [5]. They assumed a memory model where the bits could only be set (and could
not be reset) and the goal was to increase the number of eective cycles for rewriting
to the memory. Subsequent interesting work followed, mostly in information theory
and coding with the goal of estimating the capacity and nding more ecient WOM
codes. Applications and extension of this model for addressing the ash memory
device lifetime improvements were studied [6, 7, 8]. These methods however cannot be
directly applied to PCM due to its distinctive energy characteristics. There has been a
few work focusing specically on improving PCM endurance and energy consumption.
Reducing the number of cell programming in data updates by avoiding reprogramming
redundant bits and ipping the data in case it costs less programming energy has been
3suggested [9, 10]. Our work generalizes the above approaches by devising codes for
minimizing the energy cost of bi-directional bit transitions for PCM data writes.
The large space of possibilities provided by the freedom in both setting and re-
setting transitions, and the possibility to perform bit-level operations, motivate the
development of new type of codes that can improve the PCM's energy consumption.
The complicating factors for this problem are the new degrees of freedom and the
curse-of-dimensionality resulting from the exponential number of plausible code com-
binations. To address the challenge, this thesis presents a novel formal handling of
the energy minimization that is appropriate for resistive memory and other storage
technologies with bit-level operations and simultaneous consideration of the set and
reset transition energy costs. Our contributions are as follows.
 We introduce a formal treatment and formulation of PCM coding, with the goal
of minimizing the energy. We show that the problem is NP-complete.
 A methodology for deriving the optimal bounds for minimum-energy data en-
coding problem is developed.
 We devise a new Integer Linear Programming (ILP) formulation that can nd
the optimal solution to the problem. Our ILP framework can integrate both
symmetric and asymmetric set/reset costs for dierent code sizes.
 For runtime and eciency reasons, we develop a new alternative rapid and
ecient algorithms for addressing the problem. The method builds upon the
smaller optimal codes using Dynamic Programming (DP).
 An ecient distribution-aware data encoding method for non-uniformly dis-
tributed data is introduced.
 We discuss and analyze how our method reduce the memory wear.
4 An architecture for the coding module is proposed and its overhead is discussed.
 We develop and present a new energy ecient coding for Multi-Level Cell PCM
that incorporates its dierent structure and energy-related properties compared
to PCM.
 Evaluation of the proposed encoding methods on a diverse set of data stored
on PCM is demonstrated. The data includes a diverse set of benchmark image,
audio and text les.
An earlier version of this work appeared in 49th Design Automation Conference
(DAC) in San Fransisco, CA [11].
Chapter 2
Related Work and Background
Recent advances in resistive memory material and device technology have paved the
way for building PCM devices that are comparable or better than conventional solid
state memory and DRAM in terms of certain properties. The eld has been rapidly
growing in recent years both in research and in terms of industrial prototypes, making
PCM the most viable emerging technology for the next generation storage devices
[12, 13].
The idea of using the resistance change in phase-change material for storage has
been known for more than forty years now [14]. Historically, the performance of
resistive memories was not on par with the contemporary solid state and DRAM
storage alternatives. During the past 15 years, there has been an unprecedented
growth in this technology driven by its desirable characteristics and extensive research
in the eld. A number of recent work have shown signicant improvements in memory
performance by integrating PCM within the storage hierarchy [15, 16, 17, 18].
Previous PCM research introduced methods for rewriting to the memory cells
such that the writes to all bits have a uniform distribution. The heavily used written
lines are remapped to the less frequently utilized locations by the memory manage-
ment unit [19]. It has been demonstrated that the PCM endurance, reliability, and
6energy consumption would greatly improve if the redundant writes are avoided, i.e.,
by reading the existing contents of the bits and only programming those bits that
must be changed. The method is called Data-Comparison-Write (DCW), [9].
Flip-N-Write (FNW) is a protocol that adds an indicator bit to each word to
determine if the word is inverted or not, [10]. PCM controller can write the data in
an inverted form if it requires less number of bit changes. No optimality proof was
provided. Our paper formalizes, provides proofs and generalizes the Flip-N-Write
method by devising codes of length N + K for words of length N , where K  1.
Our approach, for the rst time in the literature, considers the asymmetric set and
reset energy costs. We will show that signicant improvements in energy are achieved
over FNW at the expense of allowing a few extra storage bits. We have also devised
coding for Multi-Level-Cell PCM.
Some of the existing digital storage mechanisms, including the optical storage,
only allow for one directional transition of the bits. Write-Once Memory (WOM)
encoding was introduced in a classic paper by Rivest and Shamir [5] to increase the
number of writes to such memories with one directional bit setting (in an irreversible
fashion). A urry of subsequent research have centered on improving and generalizing
the WOM codes and to extend its reach to other models. The NAND ash memory
has been modeled as a one-way transitional memory. Thus, generalizations of the
WOM codes have been applied to this class of memories [7, 6, 8].
For the PCM devices, the WOM model and the ash encoding methods do not
correctly capture the specics of the technology. One reason is that the energy dis-
crepancy ratio between the set and reset commands on the PCM is much less subtle
when compared to the NAND ash memory devices. The other reason, perhaps even
more important, is the ability to perform bit-level manipulation on the PCM, as op-
posed to block-level operations on NAND ash. The bit-level operations for PCM
7have been used earlier for error correcting codes [20]. The work in [20] focused on
developing error correction for PCM. Since the faulty bits are rather static, they have
demonstrated that Error Correcting Pointers (ECP) that include the knowledge of the
fault location, are much more ecient than classic Error Correcting Codes (ECCs).
Error correction is orthogonal to our energy ecient data encoding method.
Write-Ecient Memory or WEM is an extension of WOM that has been intro-
duced in [21]. The objective of WEM codes is to minimize the overall number of
transitions, and therefore, its goal is close to our encoding case when the set and re-
set have equal costs. However, to the best of our knowledge, the few papers available
on WEM have mainly focused on developing bounds but did not provide an optimal-
ity guarantee, or they centered on constructing suitable error correcting codes, e.g.,
[22, 23]. Aside from the loose bounds and the error correction, we have not been
able to nd WEM codes that are applicable to the PCM. Besides, we did not nd a
transform or discussion of the problem's NP-completeness in the earlier literature.
Chapter 3
PCM Operation and Energy Model
3.1 Single-Level PCM
A key challenge for non-volatile memory technology, in particular ash, is the high
energy cost of writes [13]. The speed of writing and reading from the caches and from
the DRAM is often high, and therefore, the number of transitions is higher than the
external memories. Therefore, since resistive memory is suggested for replacing and
complementing various storage units in the memory hierarchy, saving the energy cost
of set and reset transitions is of a high value [13, 19].
Programmable 
region
Metal
Chalcogenide
Metal
Heating element
T
e
m
p
e
ra
tu
re
Time
Reset pulse
Set pulse
Read pulse
(a) (b)
Figure 3.1: (a) The cross section of a conventional PCM memory cell; (b) The owing
current pulses amplitude and duration control the set, reset, and read operations.
As shown in Figure 3.1(a), the current ows through the phase change mate-
rial(chalcogenide) from the electrode/metal to the heater. This current is provided
9as a pulse, and its duration and amplitude controls the temperature needed for the
set and reset operations. Heating the phase change material above a crystallization
temperature by applying an average current but wide duration pulse results in the
set operation. A very high current (melt quenching) pulse with a short duration
resets the device to its amorphous state. The read is done by applying a very low
amplitude and low power pulse that senses the device resistance. The shape of the
three pulses used for set, reset, and read commands is plotted in Figure 3.1(b) (source
[13]). The energy discrepancy between the PCM set and reset operations has been
experimentally demonstrated and quantied, e.g., [24].
3.2 Multi-Level PCM
The large dierence between set and reset resistances in PCM has enabled devising
Multi-Level Cell (MLC) PCM. As opposed to the conventional single-level PCM, the
randomness and variability in MLC-PCM structure makes it impossible to have a
universal pulse shape to attain intermediate resistance levels. Instead, Program and
Verify (P&V) is the technique that is used to obtain dierent resistance distributions
for PCM [25]. P&V applies partial program pulses iteratively and then veries if the
desired cell level is achieved. The iterative approach causes MLC PCM to acquire an
order of magnitude more write energy than the single level PCM.
One of the main challenges in prototyping MLC PCM is the relaxation eect
that induces resistance drift in the phase-change material over the time. The drift is
particularly important in MLC-PCM due to the high sensitivity of the cell state level
to the resistance value. Dierent memory sensing and error correction techniques
have been proposed to develop more robust MLC PCM systems, [26, 27, 28]. IBM
has announced implementing of the rst drift-tolerant 2-bit cell PCM in 2011, [29].
Chapter 4
Data Coding Problem
4.1 Coding Overview
Our codes bring energy eciency by reducing the cost of writes. The eciency is
achieved at the expense of memory overhead; the coded data has a larger length
than the actual data. For a given budget of memory overhead, our algorithm de-
velops the codes o-line and its complexity does not aect the realtime performance
of the system. The resulting codes from our algorithms are then saved in the mem-
ory controller which interfaces to the PCM on one side and to processing units on
the other side. Figure 4.1 presents an abstract view of the placement of the data
encoding/decoding module for our method. The details of the architecture of the
encoder/decoder module and its overhead will be discussed in Chapter 8.
CPU
Memory
Controller
Data Code/
Decode
Misc.
Memory
PCM
Figure 4.1: Data encoding/decoding module is a part of memory controller in the
memory hierarchy.
11
4.2 Problem Formulation, Complexity and
Bounds
Our goal is to minimize the energy cost associated with writing words to the memory
with bitwise operability. In this section, each word consists of a xed number of bits
and the energy cost of writing the word is equal to the total cost of the required bit
ips, i.e., sets/resets.
We provide an optimal encoding scheme that assigns multiple representations (or
codes) to each word in the data set. The objective of encoding is to minimize the en-
ergy cost for writing the next word of data. The method trades-o the encoding data
overhead with resulting energy improvements. We provide a motivational example to
demonstrate the concept more clearly .
4.2.1 A word encoding/decoding Example
In this example, we describe how one may benet from coding the PCM data. Here
we are solving the problem of nding the optimal coding for 2-bit words with 3-bit
codes. We denote the words by W1=(00), W2=(01), W3=(10), W4=(11) and denote
the codes corresponding to the word Wi by Zi1 and Zi2, for 1  i  4; since K=1
each word has 2K=1=2 code representation. The key point is to exploit multiple
representations of each word for minimizing the write energy. For instance, if the
existing data is Z11 and W2 is to be written on it, among its representations Z21 and
Z22, the one that incurs the minimum energy cost to overwrite Z11 is selected.
Figure 4.2 shows a graph representation of the encodings for the 2-bit words
shown in separate clusters. The vertices of the graph are the codes and each cluster
represents a word. The graph is a directed graph and the weight of each edge shows
the cost of overwriting one node with the other. The optimal encoding is provided
12
Z
11
Z
12
Z
31
Z
32
W
1
W
3
Z
21
Z
22
W
2
Z
41
Z
42
W
4
Z
11
= 000
Z
12
= 111
Z
21
= 001
Z
22
= 110
Z
31
= 010
Z
32
= 101
Z
41
= 100
Z
42
= 011
Optimal codes
E
S
=E
R
Figure 4.2: A 3-bit encoding for the 4 words W1, W2, W3, and W4; ES and ER are
set and reset energies.
on the gure. If the code Z22 is to be overwritten by a code of W3, Z31 is selected
because its energy cost is only equal to ES that is the required energy for setting a
bit. Denote the bit rest energy by ER. If no coding was used, overwritingW2 withW3
would cost the higher value of ER+ES. Another example is a cycle of word overwrites
(W1,W2,W3,W4,W1). Assume thatW1 is codes as Z11. Then, the minimum cost codes
would be selected as follows (Z11,Z21,Z32,Z41,Z11). The cost associated with the code
overwrites is ES+ES+ES+ER+ER = 2ES+2:ER. Whereas the cost for overwriting
the codes without coding is ES + (ES + ER) + ES + (2:ER) = 3:ES + 3:ER.
We have shown that assigning the best codes to each word is equivalent to cluster-
ing the vertices of a graph where each cluster represents a word (See Section 4.2.1).
Clustering should be done such that it yields the minimum distance between the
vertices of dierent clusters. We can formally dene our problem as follows:
Problem. Minimize the energy cost of PCM rewrites.
Given. The word and the codeword (symbol) lengths in bits denoted by N and
N +K, where K  1. Each word is represented by 2K symbols. The read, set and
reset energy are denoted by Eread and ES and ER respectively.
Objective. Find the best codes for each word so as to minimize the average energy
cost of overwrites. We refer to this problem as P(N;K):
13
4.2.2 Problem Formulation
We denote the words byW1;W2; : : : ;W2N and denote the codes corresponding to word
Wi by Zli, where 1  l  2K . Function  gives the energy required to overwrite a
currently written symbol by a symbol of the next word that would incur the minimum
energy cost:
(Zli;W
0
l ) = minfC(Zli; Zl0i0); 81  i0  2Kg: (4.1)
The cost function C measures the amount of energy consumed to overwrite a symbol
by another one. To overwrite Zli with Zl0i0 , if NS number of bit sets and NR number
of bit resets are needed, then C would be:
C(Zli; Zl0i0) = (N +K):Eread + (NS):ES + (NR):ER: (4.2)
In the above equation, the rst term shows the energy for reading the bits of the
existing symbol in the memory (Zli). This cost is ignored in our work because of its
low value. The next two terms show the energy for the overwrite process (setting and
resetting) so as to get Zl0i0 . Similar bits in the two symbols remain untouched. The
Objective Function (OF) can be written as follows:
OF : minfC(N;K) = 1
22N+K
X
1l;l02N
X
1i2K
(Zli;Wl0)g: (4.3)
The challenge is to nd the optimal coding of the words that minimizes the OF.
Function C(N;K) represents the average energy cost of code overwrites for all possible
rewrites.
14
4.2.3 Problem Complexity
We have expressed the energy minimizing coding problem as an instance of a distance-
based graph clustering problem; each cluster corresponds to a word and the nodes that
belong to a cluster are dierent codes for the cluster's associated word. The goal is to
minimize the inter-cluster distances. In the energy minimizing encoding scenario, the
inter-cluster distance is the average distance between the code symbols in one cluster
and the closest code symbol in every other cluster. Our example demonstrates the
interpretation of the data coding as a graph problem (See Section 4.2.1). Extensive
prior work on distance-based graph clustering have shown that this problem is NP-
hard. The proof was given by a reduction from the set covering problem [30].
4.2.4 Optimal Bounds on the OF
In this part, we provide a lower bound for the OF. The average cost of overwriting
each symbol Zli with the other words is determined by the following formulation:
1
2N 1
P
l0 (Zli;W
0
l ) for l
0 6= l and 1  l0  2N . An optimal code assignment is the one
that assigns each of the closest 2N   1 symbols to Zli to one of the words W 0l 6= Wl.
This assignment gives the minimum average overwrite cost of the symbols.
We provide a lower bound for the OF as follows. First, we calculate the distances
from each code Zli to all the other 2
N+K   1 possible codes. Next, the resulting
distances are sorted and the average sum of the smallest 2N 1 distances are calculated
for each node. We compare our DP algorithm result with the optimal bound in our
evaluations.
Chapter 5
Solving Energy Ecient Coding Problem
We propose two dierent approaches for solving the problem formulated earlier. Our
rst solution is based on mapping the problem to an instance of an Integer Linear
Programming (ILP). The method nds the optimal coding for any given word/code
width. The approach is discussed in details in the Section 5.1. Due to the com-
plexity of the ILP approach which grows exponentially with the size of the coding
problem (i.e., the word and code widths), we introduce another solution based on
Dynamic Programming (DP) paradigm. The solution is designed for both uniform
and stochastic data and is presented in the Section 5.2.
5.1 Optimal Coding via Integer Linear Program-
ming
An ILP problem formulation requires linear representation of the objective function
and the constraints. To the best of our knowledge, ILP has not been used for ad-
dressing similar coding problems before. The variables in ILP take integer values.
There is a combinatorial complexity associated with assigning values to the variables
16
of our NP-complete problem. The OF represented in Equation 4.3 is not linear since
the function (:; :) is a distance minimization function. To formulate this OF in a
linear form, we dene variables to indicate the distance of each symbol in a cluster
from its closest symbol in every other clusters. The OF is equivalent to the average of
all these variables. Certain linear constrains are applied to ensure the variable meets
the minimum distance criteria. The ILP method nds the optimal solution at the
expense of runtimes exponentially increasing with the code size.
To formulate OF in a linear form, we dene an index variable that for each sym-
bol, keeps track of the index of the element (in each of the other clusters) with the
minimum distance to the symbol. The following set of variables were used in our ILP
formulation:
l; l0 Words indices Wl or W 0l for 1  l; l0  2N .
i; i0 Code indices within each cluster, 1  i; i0  2K .
Zli The i-th code 2 Wl for all i.
ll0i (Zl0i;Wl) for all l, l
0, i and i0.
wll0ii0 w(Zli; Zl0i0) for all l, l
0, i and i0.
ll0ii0 wll0ii0   ll0i for all l, l0, i and i0.
Xlij j-th signicant bit of Zli for 1  j  (N +K).
Fll0ii0j w(Xlij; Xl0i0j) for all l, l
0, i, i0 and j.
Idll0ii0 An indicator binary; =0 i ll0ii0 = 0
for all l, l0, i and i0.
The codes representing a word Wl are shown by Zli; ll0i denotes the cost of
overwriting Zli by a code in Wl0 that requires the minimum overwrite energy; wll0ii0
is the cost of overwriting two codes Uli and Ul0i0 . Thus, ll0i = mini0 wll0ii0 . Each
code Zli consists of N +K bits and can be written as (XliN+K ; : : : ; Xli2; Xli1). The
parameter Fll0ii0j is dened to be the cost of overwriting Xlij with Xl0i0j and its range
of values is shown in the table below. Variable Idll0ii0 is an indicator binary variable
17
that indicates if the closest code to Zil in cluster l
0 is Zi0l0 or not.
Xlij Xl0i0j Fll0ii0j
0 0 0
0 1 ES
1 0 ER
1 1 0
Using the above variables, we dene our OF and provide constraints to our prob-
lem in a way that conforms to the ILP format. Our OF, as written in Equation 4.3,
minimizes the average cost of overwriting the codes for all possible overwrites:
OF : min
1
2N :2N :2K
X
l0li for all l
0, l and i variables (5.1)
The following constraints dene ll0i:
C1. ll0i  0 for all l, l0 and i variables,
C2. i021;:::;2kIdll0ii0  2K   1,
C3. Idll0ii0  ll0ii0 ,
C4. ER:(N +K):Idll0ii0  ll0ii0 :
Constraints C1 and C2 set ll0i not greater than each distance ll0i and equal
to at least one of them respectively; Constraints C3 and C4 dene the indicator
variable based on the fact that ER:(N +K) is always grater than ll0ii0 .
The below linear constraints set Fll0ii0j to the desired value:
C5. 1
ER+ES
Fll0ii0j +Xlij +Xl0i0j  2,
C7. Fll0ii0j   ER:Xlij   ES:Xl0i0j  0,
C8. Fll0ii0j   ER:Xlij   ER:Xl0i0j  0,
C9. Fll0ii0j   ES:Xlij   ES:Xl0i0j  0.
18
The following constraint denes the distance wll0ii0 :
C10. wll0ii0 = 1jN+KFll0ii0j.
The next constraint is set to ensure that no code is assigned to more than one word;
ES is the minimum cost of overwriting two dierent codes:
C11. wll0ii0  ES.
The output of the above ILP is the values of Xlij that constructs the codes Uil.
The above constraints are all in linear format and can be readily implemented by
any ILP solver. The complexity and runtime for solving the instances of the ILP
for our NP-complete problem exponentially increases with the instance size. In our
experiments, we have been able to nd the optimal solution by using a limited version
of an ILP solver licensed to one user for N and K (N = 2; 3; 4, K = 1; 2). If one
has access to the commercial ILP solvers that run on the cloud or supercomputers, it
is likely possible to nd the optimal codes for the practical problems of longer sizes.
The longer runtimes can be tolerated since the ILP needs to be used only once and
o-line.
5.2 Coding via Dynamic Programming
5.2.1 Coding for Uniform Data
In this subsection, we rst show the optimal coding for solving the P(N; 1). Next,
we show how to devise the codes any P(N; k) based on the coding solutions for the
smaller instances of N and K.
Coding For P(N; 1):
Claim: Optimal coding of P(N; 1), for any N  1 is achieved by assigning the
19
complement pairs to the words.
Proof: The optimal coding nds 2K = 2 symbols, each of size N + 1, for each word.
For now, let us assume that the cost of set and reset is equal. This makes the overwrite
cost proportional to the number of bitwise dierences for the codes, ER = ES = E.
The average transition cost from each code Zli to all the other words satises the
following inequality:
1
2N 1
P
l0 (Zli;W
0
l )  0:
 
N+1
0

+ E:
 
N+1
1

+   +
N 1
2
E:
 
N+1
[N 1
2
]

+O:N+1
2
E:
 
N+1
[N+1
2
]

, for 1  l0  2N :
Where O = 1 if N is odd and O = 0 otherwise. The right side of the inequality equals
E:(N +1)2N 1. The proof of the inequality is as follows. The nearest 2N codes to Zli
should contain all the codes that have zero distance from it (that is Zli itself). The
number of such codes is
 
N+1
0

. It should also include all the codes that are in just
one bit dierent from Zli; the number of such codes is
 
N+1
1

. The next closest set
of codes are the ones that are in dierent from Zli in 2 bits and so on. We continue
until we reach to the rst closest 2N codes to Zli. In that case, the number of bit
dierences reach to N 1
2
when N is even and N+1
2
when N is odd. This is because
the following equation holds: 
N+1
0

+
 
N+1
1

+ ::: +
 
N+1
[N 1
2
]

+ O
 
N+1
[N+1
2
]

= 2N , where O is the same as dened
before.
Now, we show that the complement-pair coding assigns all the above 2N codes
to dierent words. In this case, the average transition cost for each code Zli will be
equal to its optimal value and thus the optimal OF is achieved. The sum of bitwise
dierences of Zli from any complement pair (Zl01; Zl02), is equal to N + 1. This is
because each bit of Zli is equal to exactly one of the bits of the complement pair.
Thus, one symbol of each word has a distance of less than N+1
2
bits and the other
symbol has a distance of more than N+1
2
bits from Zli. This means that all the 2
N  1
20
Algorithm 1. DP-based method for energy-aware cod-
ing
Inputs: Word and code lengths: N, N+K; C(N; 1)
and optimal coding for P(N; 1) from Section 5.2.1.
? Finding C(n; k) and the partitioning index index(n; k; 1 :
2):
1 for (n=1 to n=N)
2 for (k=1 to k=K)
3 if (k==1)
4 C(n; k) = C(n; 1);
5 else
6 for (i=1 to i=n-1)
7 for (j=1 to j=k-1)
8 if (C(n; k)  C(n  i; k   j))
9 C(n; k) = C(n  i; k   j) + C(ij);
10 index(N;K; 1 : 2)=(i; j);
? Building the codes for P(N;K):
11 for (n=1 to n=N)
12 for (k=1 to k=K)
13 if (k==1)
14 P(n; k) = P(n; 1) from Section 5.2.1;
15 else
16 P(n; k) = all code combinations from P(n  
index(n; k; 1);
; k   index(n; k; 2)) and
P(index(n; k; 1); index(n; k; 2));
closest codes to Zli belong to dierent words.
Note that our complement results for the K = 1 case also apply to the asymmetric
set/reset costs. The number of sets and resets for traversing from a code to its
complement is not symmetric for most of the code words. Recall that our objective is
to minimize the average costs over all possible transitions. It can be readily shown that
for achieving the mean cost, the average inter-complement distance can replace the
two disparate transition costs between the complements. The results of the Lemma
1 then directly follows.
Coding For P(N;K):
21
We introduce a DP-based algorithm for solving the general P(N;K) problem. Our
algorithm uses the coding results for P(p; q) and P(r; s) to construct the codes for
P (p+ r; q + s) such that the following bounds can be achieved:
C(p+ r; q + s) = C(p; q) + C(r; s): (5.2)
The code construction is as follows. The word Wi of length p + r is partitioned into
2 words, W 1i and W
2
i . The rst word is the rst p bits and the second word is the
last r bits of Wi. There are 2
q, p + q-bit symbols for W 1i and 2
s, r + s-bit symbols
for W 2i that are obtained from solving P(p; q) and P(r; s) respectively. We construct
the codes for Wi by concatenating all the possible combinations of these two set of
symbols which provides a total of 2q:2s = 2q+s codes (of length p + q + r + s) for
Wi. It can be easily seen that the codes satisfy Equation 5.2. Based on the above
code construction, the DP method breaks N into smaller values and selects the best
partitioning to minimize:
C(N;K) = min
iN
fmin
ji
C(N   i;K   j) + C(i; j)g: (5.3)
Algorithm 1 provides the details of the DP method. The optimal coding for P(N; 1)
is given from the previous part and the algorithm iteratively traverses over all the
possible partitions to improve the energy minimization objective (Lines 1-10). The
index vector index(n; k; 1 : 2) is used to store the optimal partitioning of (n; k). After
nding all the indices, the algorithm builds the codes (Lines 11-16). The complexity
of the algorithm is O(N2K2), but recall that this algorithm is run o-line.
22
0 0 0 N+K-bit
Prefix
e
All 2N+K symbols
0 0 1 N+K-bit
Prefix
t
All 2N+K symbols
0 1 0 N+K-bit
Prefix
a
All 2N+K symbols
0 1 1 N+K-bit
Prefix
o
All 2N+K symbols
1 0 0 N+K-bit
Prefix
i
All 2N+K symbols
1 0 1 N+K-bit
Prefix
n
All 2N+K symbols
1 1 0 N+K-bit
Prefix
s
All 2N+K symbols
1 1 1 N+K-bit
Prefix
other 
letters
2K symbols of P(N,K)
Figure 5.1: Data-aware alphabet letter codings.
5.2.2 Coding For Stochastic Data
In Section 4.2, the OF 4.3 minimizes the average energy cost for all the possible word
overwrites. Here, we discuss how the inherent stochastic properties for real data
scenarios can be exploited to further improve the memory's energy performance. An
important feature is that dierent words occur with diering frequencies. To benet
from this fact, instead of weighting all the rewrite energy costs equally, we aggressively
optimize our encoding for the rewrites that are more prevalent by assigning dierent
number of codes to the words based on their frequency of occurrence.
Variable-length and xed-length coding are two statistical compression techniques.
In the variable-length method, shorter codes are assigned to the more frequent words
to better improve the compression. However, this adds to decoding complexity and
since our main goal is to minimize the energy, decoding eciency is very important.
Thus, we use a xed-length coding method. We describe our method on text les that
contain English alphabet letters. The method can be generalized to other data sets
with nonuniform frequencies. Our data consists of the lower-case alphabet letters:
W1 = a, W2 = b, ..., W26 = z. Since there are 26 letter, Wi's are 5-bit words.
Let us consider the rst 7 most frequent letters of the table, e, t, a, o, i, n and
s. The probability that an overwrite occurs on any of these letters (by any other
letter) plus the probability that these letters overwrite any other letter accounts for
almost 60% of all probable overwrites. Thus, we can benet a lot by optimizing our
23
coding for these seven letters. To do so, we assign a dierent prex to each of these
letters such that only the prexes determine the letter. Since there are 7 letters, the
prexes are 3-bit each and are shown in Figure 5.1. The prexes can be interpreted
as dictionary indices. The remaining N +K bits of these letters take all the possible
2N+K states. Thus, an overwrite to/by any of these letters requires only adjusting
the prex that is of length 3. The other 19 letters have the prex (111) as shown in
the gure. The remaining N +K bits for the less frequent letters are lled with the
codes obtained by solving P(N;K) as described in Subsection 9.2. Thus, an overwrite
between the letters costs as much as for a regular P(N;K). By this coding, we assign
2N+K symbols to the highly frequent letters and 2K codes to the rest of the letters.
All the symbols are of length prex-length+N +K.
Chapter 6
Eect of the Encoding on Memory Wear
In this section, we study our encoding scheme in terms of memory wearing. The
write endurance of PCM, although orders of magnitude higher than Flash memories,
is still limited and considerably less than DRAM. Wear leveling is a technique that is
widely used to diminish the limited number of memory write cycles by managing data
writes such that they are distributed uniformly across the memory. Wear leveling is
performed by memory controller. The encoding scheme can be used along with any
conventional wear leveling technique. After the memory controller decides the address
to write the data based on the wear leveling method, the encoding module steps in
and performs the encoding by rst reading the memory at those addresses and then
accordingly nding the best codes for the data to be writhen.
The encoding improves the endurance of the memory by reducing the total number
of writes (sets and resets). For example, for P(2; 1), the average number of program-
mings, i.e., average total number of resets and sets, per code write is 0.39 that is equal
to 0.13 per bit (since the codes have 2+1=3 bits). However, the average number of
programmings per word write if no coding is used is 0.5 that is equal to 0.25 per bit
(since the words have 2 bits). In general, the average number of bit ips for rewriting
25
an N+1-bit code with any other existing code is equal to the following:
1
N + 1
1kbN+1
2
ck:
 
N+1
k

2N
=
(N + 1):1kbN+1
2
c
 
N
k 1

(N + 1):2N
=
1
2
  1
2
 
N
bN
2
c

2N :
(6.1)
The numerator represents the total number of bit ips required to write an arbitrary
code by the closest code (the one that requires less cost) of each of the other 2N words.
As mentioned in the proof of optimal coding for P(N; 1), our coding is designed such
that for rewriting a code by the closest code of any other word, the number of bit ips
k is in the following range, 1  k  N+1
2
. For each k, there are
 
N+1
k

of such closest
codes, each representing a dierent word. There is a total of 2N of such close codes
(one for each word) and each code has N+1 bits. Thus, dividing by the denominator
yields the average number of ips for each individual bit during a code write. For the
N-bit word data, without coding, the average number of bit ips is 1
2
. The proof is
straightforward due to the symmetry in the words.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
0
10
20
30
40
50
60
70
80
Code length (bits)
W
ea
r r
ed
uc
tio
n,
 n
or
m
al
ize
d 
to
 D
CW
 (%
)
 
 
Odd code length
Even code length
Figure 6.1: This plot shows that PCM write endurance improves with data coding.
Each word of length N (bits) is coded by codes of length N+1. The wear eciencies
of 2N-1-bit and 2N-bit codes are equal.
Figure 9.7 shows the improved write endurance with coding compared to DCW
method. The words are of length N with the codes of length N +1. For example for
a 2-bit word with 3-bit codes, the number of allowed writes per bit increases by 50%.
26
Note that the wear eciencies of codes of length 2N   1 and 2N are equal. However,
the memory overhead of an 2N -bit code is 1
2N
which is less than that of an 2N 1-bit
code that is 1
2N 1 . Thus, it is more ecient to use codes with even lengths for saving
the memory capacity. Our DP algorithm takes this property into account.
Chapter 7
Multi-Level Cell PCM
The programming energy properties of the MLC-PCM varies from that of a single cell
PCM. Thus, we provide our energy encoding optimizations. As mentioned earlier,
program and verify is used Table 7.1 shows the average required energies for dierent
levels for a 4-level PCM [25, 31].
Table 7.1: Required energy for programming dierent levels.
Level Energy (pJ)
00 36
01 307
10 547
11 20
Since the required energy for programming the intermediate levels, 01 and 10,
is signicantly higher than that of 00 and 11, we decided to encode the data such
that the number of 01 and 10 cells are minimized. Our method assigns N+1-cell
(2N+2-bit) codes to N-cell (2N-bit) data. There are 22N+2 dierent N+1-cell data.
Our coding selects data of length N+1 (Cells) that have the minimum number of
intermediate levels and uses them to code the words of length N (Cells). Each N-cell
word is coded by one N+1-cell. Here, we provide an example for 2-cell word 3-cell
codes in Table 7.2.
28
Table 7.2: MLC-PCM coding for 2-cell words. The total number of intermediate cells
(01 and 10s) in the 2-cell words is 16. The total number of intermediate cells in the
corresponding 3-cell codes is 8.
Word Code
00,00 00,00,00
00,01 00,00,11
00,10 00,11,00
00,11 00,11,11
01,00 01,00,00
01,01 01,00,11
01,10 01,11,00
01,11 01,11,11
10,00 10,00,00
10,01 10,00,11
10,10 10,11,00
10,11 10,11,11
11,00 11,00,00
11,01 11,00,11
11,10 11,11,00
11,11 11,11,11
As it can be observed in the table, the total number of intermediate levels for the
words is 16, whereas there is only 8 intermediate levels for the corresponding codes.
To observe the energy eciency of the code, we compare the average energy cost for
writing uniform data on the memory before and after coding. We denote the four
levels 00, 01, 10, and 11 with L1, H1, H2, and L2 respectively. For simplicity and
due to the symmetry of the problem, we consider the write energies of L1 and L2
to be equal to the average of their individual energies; 36+20
2
= 28. Likewise, H1
and H2 write energies are considered to be 307+547
2
= 427. For uniform data, where
on average all the 16 values are written equal number of times, the average write
energy of words is 8(L1+H1+H2+L2)
16
= 455; whereas for coded data, this value equals
20(L1+L2)+4(H1+H2)
16
= 183:75. Thus, on average, the energy is reduced by almost 60%.
In the above example, despite the signicant energy saving, the 33% reduction
in memory capacity (3-cell codes for 2-cell data) is not desirable. To address this
29
problem, we propose coding N-cell data with N+1-cell codes for larger N values. In
this case, the memory capacity is reduced by a factor of 1
N+1
. The coding uses the
same technique as the example and reduces the number of intermediate levels to save
energy. We begin with assigning all the codes with zero intermediate levels to the
words, then we assign all the codes with one intermediate levels to the words and so
on until all the words have a code. Here we calculate how many N+1 cell codes with
a given number of intermediate levels, say 0  m, are available. The answer is equal
to the number of N+1-character data with exactly m, H1 and H2 characters and
N + 1  m, L1 and L2 characters. From combinatorics, this number is equal to the
following.

N + 1
m

:2N+1 for 1  m  N+1: (7.1)
To have the best code for all the N-cell words, we nd the minimum m, such that
all the 22N words are covered with codes that has at most m intermediate levels, i.e.,
the minimum m such that the following inequality holds.
0m

N + 1
m

:2N+1  22N : (7.2)
We denote the answer bymMIN . Given the answer, the average energy for a code write
for the uniform data is 1
22N
0mmMIN
 
N+1
m

:2N+1:(m 427 + (N + 1 m) 28).
The average write for the corresponding non-coded words is bNc
2
(427+28); The proof
is straightforward due to the symmetry.
Chapter 8
Data Encoder/Decoder Architecture and Overhead
Figure 8.1 shows the architecture of the encoding unit. The read buer contains the
data from the current address and the write buer contains the new data that is
not yet coded. A lookup table is employed for storing the matching codes; given the
data in write buer and the data in the read buer, the lookup table nds the code
that incurs minimum energy for the overwrite. The process of nding the best code
from the lookup table causes very low energy overhead due to the low read energy
of PCM. Also, since PCM is non-volatile and the leakage power consumption is very
small, there is negligible standby power for storing the lookup table [15]. The read
latency of PCM is also very low and comparable to that of DRAM [15, 32].
One method to store the lookup table is to store all possible 2N words and 2N+K
codes combinations that means a lookup table of size 2N+N+K . For example, such a
lookup table yields to a 215=32KB table for 8-bit codes (with N = 7 and K = 1) that
is a low memory overhead considering the large memory capacity of today's computer
systems (GB). However, as the size of the codes grows, the size of the lookup table
increases exponentially. To make the lookup table design feasible for large codes,
we break the table into sub-lookup tables each containing part of the ecient codes.
From our dynamic programming method, each coding problem P(N;K) is solved by
31
concatenating the codes from the optimal sub-problem P(N1; K1) and P(N2; K2),
where N1 +N2 = N and K1 +K2 = K. For example, P(28; 4) = P(14; 2) + P(14; 2)
that can be further break using the following P(14; 2) = P(6; 1) + P(8; 1). Thus, for
storing the codes for the original problem, we only need to break the 28-bit word into
sub-words of length 6 and 8 and store the codes for the corresponding 26+7=13 and
28+9=17 sub-words. The result is two lookup tables of sizes 8KB and 128KB.
Write
Buffer
(N-bit)
Read Buffer (N+K-bit)
LUT_1
LUT_2
LUT_K
N1
N1+1
N2
NK
N2+1
NK+1
N1+1 N2+1 NK+1
PROGRAM/SET/RESET Enable
Figure 8.1: This plot shows the architecture of the encoder module. The read buer
contains the old data from PCM and the write buer contains the new data that is
going to be overwritten the read data on PCM.
The Read-Before-Write technique has already been implemented and shown to be
promising by designing PCM cache to replace the SRAM cache. The higher density
of PCM allows replacing the SRAM cache with a larger capacity cache. Read and
write buers can be used to compensate for the lower speed of PCM compared to
SRAM [33].
The main overhead of our coding system will be the memory overhead since in
P(N;K), K extra bits are used to represent an N -bit word, incurring an overhead of
K
K+N
. In our evaluations, we will study the eect of dierent memory overheads on
the system performance.
Chapter 9
Evaluations
We perform system level evaluations of our methods on a variety of real world data
sets. The eect of dierent word widths, dierent set and reset energies, and the
memory and delay overheads on the eciency of our method is examined.
9.1 ILP Results
We used the latest version of Gurobi ILP solver, Gurobi 4.5.2, to solve the ILP method
described in Section 5.1, [34]. Gurobi provides free access for academic purposes. The
runtime of the solver for solving P(4; 2) is about 30 hours on a computer with an Intel
dual core 2.80GHz processor and 4GB RAM. Thus, due to the time constraint we
were not able to solve the objective function for larger problems. The python ILP
code is available upon request to the interested readers.
33
12 14 16 18 20 22 24 26
10
15
20
25
30
35
Im
pr
ov
em
en
t o
ve
r D
CW
 (%
)
Memory overhead (%)
 
 
Our method
FNW
Figure 9.1: 8-bit system
5 10 15 20 25 30 35
0
5
10
15
20
25
30
35
Im
pr
ov
em
en
t o
ve
r D
CW
 (%
)
Memory overhead (%)
 
 
Our method
FNW
Figure 9.2: 16-bit system
0 5 10 15 20 25 30 35
0
5
10
15
20
25
30
35
Im
pr
ov
em
en
t o
ve
r D
CW
 (%
)
Memory overhead (%)
 
 
Our method
FNW
Figure 9.3: 32-bit system
9.2 Performance of DP-based Algorithm on Uni-
form Data
We analyze DP-based encoding method provided in Algorithm 1 for dierent memory
overheads. We compare our results with Data-Comparison-Write (DCW) and Flip-
N-Write (FNW) algorithms [9, 10] described in Section 2.
Here, we show the average eciencies for uniform data where all the word writes
occur with the same frequency. Our metric is the average (per write) energy for all
possible word combination overwrites.
Figures 9.1, 9.2, and 9.3 show the energy improvements of our method and the
FNW method over the conventional DCW method for 8-bit, 16-bit and 32-bit system
respectively. The lined graph shows the results of our method and the black circles
show the result of FNW. For example, a 25% memory overhead in a 320bit system
means a 32-bit code represents a 24-bit data. The results for FNW system are sparser
since they can only take memory overheads of type 1
N+1
. Thus, in a 32-bit system,
FNW only accept data-overhead sizes of 31-1, 30-2( 15-1), 28-4( 7-1), and 24-8( 3-1).
It can be seen that our method performs up to 15% better than FNW. There
are two main reasons for the better performance. The rst reason is the ability of
our method to accept dierent overheads. For example, for solving P(30; 2) the DP
algorithm breaks it into P(14; 1) + P(16; 1) as apposed to the FNW approach that
34
is 2P(15; 1). The former combination, as we discussed in Section 6 delivers better
eciency. Note that as the memory overhead increases, our methods become more
ecient. Whereas, in FNW method, P(31; 1) with memory overhead of 03.13% is
slightly more ecient than P(30; 2) with memory overhead of 6.25%.
The second reason is our focus on energy-ecient selection of the codes to over-
write; FNW always ips the data if the number of bit-ips required to write the
original data is more than half of the word's size. However, considering the asym-
metric cost of set and reset, we do not count the number of bit-ips; what we count
is the total energy of set and resets as our metric to choose a code. For example,
let us assume that the new word 00001111 is to overwrite 00000000, FNW writes the
new word as it is whereas our method chooses its complement 11110000 to overwrite
000000000. The dierence in energy levels is 4  ER for FNW versus 4  ES in our
method, if the ratio of the reset to set energy (ER
ES
) is 2, then our code requires twice
less energy.
9.2.1 Performance comparison with respect to the optimal
coding
We have compared the performance of our DP-based algorithm with the optimal
bound (as presented in Section 4.2.4 and provided the results in the following table.
The results shows the average energy cost for all possible data overwrites (see above)
with ER
ES
=2. According to the table, the performance gap is increasing with the
memory overhead increment, however, the DP algorithm results are still close and in
some cases equal to the optimal bound.
35
Table 9.1: Performance compared to the optimal coding for various code sizes.
Problem size P(4; 1) P(4; 2) P(8; 1) P(8; 2) P(8; 3) P(8; 4)
Optimalcost
DPcost
1 1 1 .98 .93 .90
5 10 15 20 25 30 35
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
N
or
m
al
iz
ed
 e
ne
rg
y 
co
st
Memory overhead (%)
 
 
ER=ES
ER=2ES
ER=3ES
ER=4ES
Figure 9.4: Cost reduction by data-aware coding.
9.2.2 Eect of the asymmetric set/reset energy ratios
Here we look at the eect of dierent ER
ES
ratios on the eciency of our method.
Figure 9.4 shows the normalized energy costs for various memory overheads. As the
ratio increases, more energy savings are achieved; for example, for a memory overhead
30%, if the cost of set and rest is equal the eciency is % less for P(31; 1)ER
ES
= 1
than for P(31; 1)ER
ES
= 4. This is because our coding scheme aims to optimize the
energy consumption by minimizing the number of overwrites. Since resets have a
higher energy cost, the minimization impact will be higher for them.
We always consider memory overheads of up to 33.34% (or equivalently Ks up
to N
2
) since no extra eciency is achieved for K > N
2
. The reason is that our DP
algorithm breaks the P(N;K) problem for K > N
2
to at least one P(1; 1) and it is
straightforward to see that coding 1-bit words with 2-bit codes does not provide any
eciency and only incurs memory overhead. Thus, by setting the limit on K we avoid
such overheads.
36
9.3 Performance on Audio and Image Data
0 5 10 15 20 25 30 35
0
20
40
60
80
100
Memory overhead (%)
N
or
m
al
iz
ed
 e
ne
rg
y 
co
st
s 
(%
)
 
 
Our method
FNW
Figure 9.5: Audio data, 32-bit system, ER
ES
= 2.
We use the encoding method for storage of audio and image data on PCM. Our
benchmark data were taken from Columbia University audio and Caltech Vision image
databases [35] and [36] respectively. Four audio and four image les are selected. The
audio data are msmn1.wav, msmv1.wav, mssp1.wav, and msms1.wav and are denoted
by a1, a2, a3 and a4 in and the image les are dcp   2897:jpg, dcp   2898:jpg, and
dcp  2899:jpg and dcp  2830:jpg.
0 5 10 15 20 25 30 35
0
20
40
60
80
100
Memory overhead (%)
N
or
m
al
iz
ed
 e
ne
rg
y 
co
st
s 
(%
)
 
 
Our method
FNW
Figure 9.6: Image data, 32-bit system, ER
ES
= 2.
We show the normalized average energy cost of overwriting all the audio les in
Figure 9.5 and the image les Figure 9.6. There are 12 possible overwrites for each
le type. The costs are shown for 32-bit codings with various memory overheads.
PCM holds the following properties PCM: ES = 13:733pJ=bit, and ER = 26:808pJ
; the measures are according to for a 32-nm PCM from [17]. The cost of applying
our coding method and FNW method are presented. The costs are normalized to the
37
DCW method's cost. For some memory capacities, FNW cannot be applied and the
in such cases its cost is set to be equal to the DCW cost. i.e., 100%.
In both gures, our method outperforms the FNW method. The gap between
the performances become wider as the memory overhead increases. Our method
outperforms the FNW method by up to 14% and 16%.
9.4 Performance of Stochastic Data Coding
Here we rst provide evaluation results for the English alphabet coding as described in
Section 5.2.2. Then, we provide coding and evaluations for the ASCII characters. We
used two text benchmarks, the 31 MB text8:txt le from [37], for alphabet (excluding
spaces) evaluations; and the 4.8 MB KJV:txt le from [38] for ASCII evaluations.
Our evaluations are based on the fact that the overwrites are independent events;
for example, the probability that letter a being overwritten by letter b, which we de-
note by p(a; b) is equal to p(a):p(b), where p(a) and p(b) are the normalized frequencies
of the corresponding letters. We experimentally veried the above assumption by ran-
domly selecting 100 vector pairs for overwriting, each of size 100000 from the text
benchmark Text8:txt. We formed a table of normalized frequencies for all combina-
tion of vector pair rewrites and observed that the resulting numbers comply with our
independence assumption.
9.4.1 Alphabet Letters
We encoded the alphabet letters with the distribution-aware encoding. Since there
are 26 alphabet letters, N = 5; we set K = 1, and Prex=3. The codes are of length
Prex+N + K = 9. We evaluated the method on Text8:txt data for dierent test
trials. For each trial, we created 100 pairs of vectors by randomly reading the data
38
from the text le. Each vector has 1000 letters. We overwrote the vectors of each pair
and computed the average overwrite cost for ER
ES
= 2. The results demonstrate an
average 44.1% reduction when compared to the no-coding scheme and 9.3% reduction
compared to the uniform coding P(5; 2).
9.4.2 ASCII Characters
According to the frequencies of ASCII characters from [39], 59% of all the possible
rewrites are to/by one of the rst 15 most frequent characters out of the total 127
characters. Thus, we optimize our coding for these characters by assigning separate
prexes to them.
The rst 15 most frequent characters are: space, e, t, a, o, i, n, s, h, r, d, l, u,
m, c. We assigned the following 4-it prexes to them respectively: (0000), (0001),
(0010), (0100), (1000), (1001), (1010), (0110), (0111), (1011), (1101). The prex for
all the other characters is (1111). Since there are 27 ASCII characters, N = 7 and we
set K = 1. Thus, the codes will be of length 4 +N +K = 12. The encoding method
is the same as described for alphabet letters.
We evaluated the ASCII coding scheme on the KJV:txt le. We created 100
pairs of vectors, each of length 1000 from the le. The rst vector in each pair
was overwritten by the second vector. We considered ER
ES
= 2. To compare this
method with the uniform coding, we encoded the ASCII characters with the codes
from P(7; 1), P(7; 2) and P(7; 3) and report the corresponding average costs in the
following:
Table 9.2: Comparing the ASCII data-aware energy cost with the uniform coding.
The costs shows the energy reductions that are normalized to the no-coding method.
Encoding Data-aware P(7; 1) P(7; 2) P(7; 3)
Average normalized cost (%) 81.6 94.6 91.3 89.3
39
We see that the ASCII data-aware coding, on average, reduces the energy cost
more than the best achieved from P(7; 3); for overwriting each ASCII character, there
will be almost 8% more reduction in the energy cost compared to the results of the
uniform encoding. This improvement is at the expense of two extra bits per character.
9.5 MLC Coding
6 8 16 32 48 64
60
65
70
75
80
85
Code length in bits
En
er
gy
 n
or
m
al
ize
d 
to
 n
o 
co
di
ng
 s
ch
em
e 
(%
), d
as
he
d l
ine
0 10 20 30 40 50 60 70
75
80
85
90
95
100
Ca
pa
cit
y 
no
rm
al
ize
d 
to
 n
o 
co
di
ng
 s
ch
em
e 
(%
), d
ott
ed
 lin
e
Figure 9.7: This plot shows the results of MLC-PCM proposed coding for energy
saving. The dashed line shows the (normalized) average required energy for writing
data compared to no-codeing method. The dotted line shows the capacity of the
MLC-PCM compared to no-coding method. For example coding a 30-bit (15-cell)
word with 32-bit (16-cell) codes results in 0.78 reduction in energy. The capacity is
reduced to 30
32
= 0:93 of the full memory capacity.
Figure 9.7 shows the result for the average energy reduction of the coded data
for dierent code widths (N) compared to the non-coded data. Each N-cell word is
coded with N+1-cell codes. Thus the overhead for such codes is 1
N+1
. The gure also
shows the memory capacity usage of the coded data. It can be seen there is a tradeo
between the energy reduction and the capacity usage. However, the reductions are
still signicant for a small capacity losses. For example,for 16-cell codes, the write
energy is reduced by 78% while 93% of the memory capacity is being used.
Chapter 10
Conclusion
We proposed a novel data coding methodology for minimizing the energy consump-
tion and wear eect of PCM writes. Our method creates several alternative codes
for each word on the memory, trading o performance with memory capacity and
encoding overhead. The new words to be written on the memory are encoded such
that they incur the minimum cost when overwriting the existing words of the mem-
ory. To address the coding problem, we developed (i) an ILP-based solution with
a high combinational complexity that found the codes optimally; (ii) a Dynamic
Programming-based approach that combined the smaller optimal codewords to nd
near-optimal codes; (iii) and an independent coding approach for Multi-Level Cell
PCM that reduced the number of costly intermediate level transitions to improve the
performance. For cases where the distributions of the data were a priori known, we
created a new data-aware algorithm that incorporated those information for further
optimizations. A low overhead architecture for our encoder module was proposed.
Evaluations on a diverse set of text, image, and audio benchmark data demonstrated
the applicability and eectiveness of our new methods. It was shown that allowing
extra memory overhead results in signicantly better reductions in memory energy
and wear.
References
[1] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas,
\Memory hierarchy reconguration for energy and performance in general-
purpose processor architectures," in International Symposium on Microarchi-
tecture (MICRO), 2000, pp. 245{257. 1
[2] B. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
\Phase-change technology and the future of main memory,"Micro, IEEE, vol. 30,
no. 1, p. 143, jan.-feb. 2010. 1
[3] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y.-C. Chen, R. M. Shelby,
M. Salinga, D. Krebs, S.-H. Chen, H.-L. Lung, and C. H. Lam, \Phase-change
random access memory: A scalable technology," IBM Journal of Research and
Development, vol. 52, no. 4.5, pp. 465 {479, july 2008. 1
[4] I. Kim, S. Cho, D. Im, E. Cho, D. Kim, G. Oh, D. Ahn, S. Park, S. Nam,
J. Moon, and C. Chung, \High performance pram cell scalable to sub-20nm
technology with below 4f2 cell size, extendable to dram applications," in VLSI
Technology (VLSIT), 2010 Symposium on, june 2010, pp. 203 {204. 1
[5] R. L. Rivest and A. Shamir, \How to reuse a write - once memory (preliminary
version)," in Symposium on Theory of computing (STOC), 1982, pp. 105{113. 1,
2
[6] A. Jiang, M. Langberg, M. Schwartz, and J. Bruck, \Universal rewriting in con-
strained memories," in International Symposium on Information Theory (ISIT),
2009, pp. 1219{1223. 1, 2
[7] H. Mahdavifar, P. Siegel, A. Vardy, J. Wolf, and E. Yaakobi, \A nearly optimal
construction of ash codes," in International Symposium on Information Theory
(ISIT), 2009, pp. 1239{1243. 1, 2
[8] Y. Wu and A. Jiang, \Position modulation code for rewriting write-once memo-
ries," IEEE Transactions on Information Theory, vol. 57, no. 6, pp. 3692{3697,
june 2011. 1, 2
42
[9] B.-D. Yang, J.-E. Lee, J.-S. Kim, J. Cho, S.-Y. Lee, and B.-G. Yu, \A low power
phase-change random access memory using a data-comparison write scheme,"
in Circuits and Systems, ISCAS. IEEE International Symposium on, 2007, pp.
3014 {3017. 1, 2, 9.2
[10] S. Cho and H. Lee, \Flip-N-Write: a simple deterministic technique to improve
PRAM write performance, energy and endurance," in International Symposium
on Microarchitecture (MICRO), 2009, pp. 347{357. 1, 2, 9.2
[11] A. Mirhoseini, M. Potkonjak, and F. Koushanfar, \Coding-based energy mini-
mization for phase change memory," in Design Automation Conference (DAC),
2012, pp. {. 1
[12] S. Lai, \Current status of the phase change memory and its future," in Interna-
tional Electron Devices Meeting (IEDM), 2003, pp. 10.1.1 { 10.1.4. 2
[13] H. Wong, S. Raoux, S. Kim, J. Liang, J. Reifenberg, B. Rajendran, M. Asheghi,
and K. Goodson, \Phase change memory," Proceedings of the IEEE, vol. 98,
no. 12, pp. 2201{2227, 2010. 2, 3.1, 3.1
[14] C. Sie, \Memory devices using bistable resistivity in amorphousAs-Te-Ge lms,"
PhD dissertation, Proquest/UMI publication 69-20670, Iowa State University,
January 1969. 2
[15] G. Dhiman, R. Ayoub, and T. Rosing, \PDRAM: A hybrid PRAM and DRAM
main memory system," in Design Automation Conference, 2009, pp. 664 {669.
2, 8
[16] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, \Architecting phase change mem-
ory as a scalable dram alternative," in International Symposium on Computer
Architecture (ISCA), 2009, pp. 2{13. 2
[17] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, \A durable and energy ecient main
memory using phase change memory technology," in International Symposium
on Computer Architecture (ISCA), 2009, pp. 14{23. 2, 9.3
[18] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie, \Design explo-
ration of hybrid caches with disparate memory technologies," ACM Transactions
on Architecture and Code Optimization, vol. 7, December 2010. 2
[19] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and
B. Abali, \Enhancing lifetime and security of pcm-based main memory with
start-gap wear leveling," in International Symposium on Microarchitecture
(MICRO-42), 2009, pp. 14{23. 2, 3.1
[20] S. Schechter, G. H. Loh, K. Straus, and D. Burger, \Use ECP, not ECC, for
hard failures in resistive memories," in International Symposium on Computer
Architecture (ISCA), 2010, pp. 141{152. 2
43
[21] R. Ahlswede and Z. Zhang, \Coding for write-ecient memory," Information
and Computation, vol. 83, no. 1, pp. 80{97, October 1989. 2
[22] F.-W. Fu and R. Yeung, \On the capacity and error-correcting codes of write-
ecient memories," IEEE Transactions on Information Theory,, vol. 46, no. 7,
pp. 2299 {2314, November 2000. 2
[23] T. Mittelholzer, L. Lastras-Monta ando, M. Sharma, and M. Franceschini,
\Rewritable storage channels with limited number of rewrite iterations," in In-
ternational Symposium on Information Theory (ISIT), 2010, pp. 973 {977. 2
[24] F. Bedeschi, R. Bez, C. Bono, E. Bonizzoni, E. Buda, G. Casagrande, L. Costa,
M. Ferraro, R. Gastaldi, O. Khouri, F. Ottogalli, F. Pellizzer, A. Pirovano,
C. Resta, G. Torelli, and M. Tosi, \4-Mb MOSFET-selected trench phase-
change memory experimental chip," IEEE Journal of Solid-State Circuits,
vol. 40, no. 7, pp. 1557 { 1565, July 2005. 3.1
[25] F. Bedeschi, R. Fackenthal, C. Resta, E. Donze, M. Jagasivamani, E. Buda,
F. Pellizzer, D. Chow, A. Cabrini, G. Calvi, R. Faravelli, A. Fantini, G. Torelli,
D. Mills, R. Gastaldi, and G. Casagrande, \A bipolar-selected phase change
memory featuring multi-level cell storage," Solid-State Circuits, IEEE Journal
of, vol. 44, no. 1, pp. 217 {227, jan. 2009. 3.2, 7
[26] W. Xu and T. Zhang, \Using time-aware memory sensing to address resistance
drift issue in multi-level phase change memory," in Quality Electronic Design,
2010 11th International Symposium on, march 2010, pp. 356 {361. 3.2
[27] S. Braga, A. Sanasi, A. Cabrini, and G. Torelli, \Voltage-driven partial-reset mul-
tilevel programming in phase-change memories," Electron Devices, IEEE Trans-
actions on, vol. 57, no. 10, pp. 2556 {2563, oct. 2010. 3.2
[28] M. Joshi, W. Zhang, and T. Li, \Mercury: A fast and energy-ecient multi-
level cell based phase change memory system," in High Performance Computer
Architecture (HPCA), 2011 IEEE 17th International Symposium on, feb. 2011,
pp. 345 {356. 3.2
[29] N. Papandreou, H. Pozidis, T. Mittelholzer, G. Close, M. Breitwisch, C. Lam,
and E. Eleftheriou, \Drift-tolerant multilevel phase-change memory," inMemory
Workshop (IMW), 2011 3rd IEEE International, may 2011, pp. 1 {4. 3.2
[30] \Clustering to minimize the maximum intercluster distance," Theoretical Com-
puter Science, vol. 38, no. 0, pp. 293{306, 1985. 4.2.3
[31] J. Wang, X. Dong, G. Sun, D. Niu, and Y. Xie, \Energy-ecient multi-level cell
phase-change memory system with data encoding," in Computer Design (ICCD),
2011 IEEE 29th International Conference on, oct. 2011, pp. 175 {182. 7
44
[32] K.-J. Lee, B.-H. Cho, W.-Y. Cho, S. Kang, B.-G. Choi, H.-R. Oh, C.-S. Lee,
H.-J. Kim, J.-M. Park, Q. Wang, M.-H. Park, Y.-H. Ro, J.-Y. Choi, K.-S. Kim,
Y.-R. Kim, I.-C. Shin, K.-W. Lim, H.-K. Cho, C.-H. Choi, W.-R. Chung, D.-E.
Kim, K.-S. Yu, G.-T. Jeong, H.-S. Jeong, C.-K. Kwak, C.-H. Kim, and K. Kim,
\A 90nm 1.8v 512mb diode-switch pram with 266mb/s read throughput," in
Solid-State Circuits Conference, ISSCC. Digest of Technical Papers. IEEE In-
ternational, 2007, pp. 472 {616. 8
[33] Y. Joo, D. Niu, X. Dong, G. Sun, N. Chang, and Y. Xie, \Energy- and endurance-
aware design of phase change memory caches," in Proceedings of the Conference
on Design, Automation and Test in Europe, ser. DATE '10, 2010, pp. 136{141.
8
[34] \Gurobi ILP solver. http://www.gurobi.com/." 9.1
[35] \Columbia University sound examples directory:
http://labrosa.ee.columbia.edu/sounds." 9.3
[36] \Caltech computational vision data repository website:
http://www.vision.caltech.edu/html-les/archive.html." 9.3
[37] \Text le test data. http://mattmahoney.net/dc/textdata/." 9.4
[38] \The king james bible (KJV). http://patriot.net/ bmcgin/kjvpage.html." 9.4
[39] \Letter frequency counter. http://millikeys.sourceforge.net/freqanalysis.html."
9.4.2
