Purdue University

Purdue e-Pubs
Open Access Dissertations

Theses and Dissertations

5-2018

Signal Processing for Caching Networks and Non-volatile
Memories
Tianqiong Luo
Purdue University

Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations

Recommended Citation
Luo, Tianqiong, "Signal Processing for Caching Networks and Non-volatile Memories" (2018). Open
Access Dissertations. 1765.
https://docs.lib.purdue.edu/open_access_dissertations/1765

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries.
Please contact epubs@purdue.edu for additional information.

SIGNAL PROCESSING FOR CACHING NETWORKS AND NON-VOLATILE
MEMORIES

A Dissertation
Submitted to the Faculty
of
Purdue University
by
Tianqiong Luo

In Partial Ful˝llment of the
Requirements for the Degree
of
Doctor of Philosophy

May 2018
Purdue University
West Lafayette,Indiana

THE PURDUE UNIVERSITY GRADUATE SCHOOL
STATEMENT OF COMMITTEE APPROVAL

Dr. Borja M. Peleato-Inarrea, Chair
Department of Electrical and Computer Engineering
Dr. Chih-Chun Wang
Department of Electrical and Computer Engineering
Dr. David J. Love
Department of Electrical and Computer Engineering
Dr. Vijay Raghunathan
Department of Electrical and Computer Engineering

Approved by:
Dr. Venkataramanan Balakrishnan
Head of the Graduate Program

iii

ACKNOWLEDGMENTS

I would like to ˝rst thank my advisor, Professor Borja Peleato, for his support and
guidance during my Ph.D. studies. I will never forget the encouragement and help
from him, especially every time he patiently gave me advices on how to continue my
research work, improve my writing skills and proceed with my future career. These
suggestions are really helpful and bene˝cial to my future work.
I had two wonderful internships in 2017, one with Google and one with Facebook.
I determined my mind to continue my future career as a software engineer after these
two internships.

I want to extend my deep gratitude to my supervisors, Kai Shen

and Xiao Jing. They have provided me with guidance on how to do system research
work in industry.
I want to thank my parents who have always encouraged me in my Ph.D. studies.
Finally, I want to thank my boyfriend Yimajian Yan, who supported me when I went
through hard times.

iv

TABLE OF CONTENTS

Page

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

. . . . . . . . . . . . . . . . . . . . . .

2

. . . . . . . . . . . . .

2

. . . . . . . . . .

3

. . . . . . . .

4

. . . . . . . . . . . . . . . . . . . . .

4

. . . . . . . . . . . . . . . . . . .

5

. . . . . . . . . . . .

7

LIST OF TABLES
LIST OF FIGURES
ABSTRACT
1

INTRODUCTION
1.1

1.2

2

Challenges of Caching Networks
1.1.1

Coded Caching with Distributed Storage

1.1.2

Tra°c Load-I/O Trade-o˙ for Coded Caching

Challenges of Modern Non-volatile Memory Technologies
1.2.1

Challenges of NAND Flash

1.2.2

Challenges of Resistive RAM

CODED CACHING AND DISTRIBUTED STORAGE
2.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Background

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

. . . . . . . . . . . . . . . . .

11

. . . . . . . . . . . . . . . . . . . . . .

12

. . . . . . . . . . . . . . . . . . .

13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

. . . . . . . . . . . . . . . . . . . . . . . . . . .

16

. . . . . . . . . . . . . . . . . . . . . . . . . .

18

. . . . . . . . . . . . . . . . . .

18

2.2.1

System Model

2.2.2

Maddah-Ali and Niesen's scheme

2.2.3

Interference Elimination

2.2.4

Extension to multiple servers

2.3

File striping

2.4

Scheme 1: Large cache

2.5

2.4.1

No parity servers

2.4.2

One parity and two data servers

2.4.3

One parity and L data servers

. . . . . . . . . . . . . . . . . . .

24

2.4.4

Two parity and L data servers

. . . . . . . . . . . . . . . . . . .

28

. . . . . . . . . . . . . . . . . . . . . . . . . . .

31

Scheme 2: Small cache

v

Page

3

2.6

Simulations

2.7

Summary

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

. . . . . . . . . . . . .

39

3.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2

Background

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

. . . . . . . . . . . . . . . . . . . . . . . . . .

41

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

. . . . . . . . . . . . . . . . . . . . . . . . . .

44

. . . . . . . . . . . . . . . . . . . . . . . . . . .

46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

. . . . . . . . . .

51

3.2.1

System Model

3.2.2

Uncoded scheme

3.2.3

Coded scheme

General Algorithms
3.3.1

Adaptive delivery

3.3.2

Partial Caching

3.4

Simulations

3.5

Summary

SIGNAL PROCESSING FOR NAND FLASH MEMORIES
4.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.2

Background

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.3

Multi-page Read for NAND Flash

. . . . . . . . . . . . . . . . . . . . .

55

. . . . . . . . . . . . . . . . . . . . . .

55

. . . . . . . . . . . . . . . . .

58

. . . . . . . . . . . .

65

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

. . . . . . . . . . . . . . . . . . . . . .

68

. . . . . . . . . . . . . . . . . .

76

. . . . . . . . . . . . . . . . . . . . . . . .

78

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

. . . . . . . . . . . . . . . . . . . . . . . . .

84

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

. .

91

4.4

4.5
5

34

TRAFFIC LOAD-I/O TRADE-OFF FOR CACHING

3.3

4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3.1

Multi-page Read Method

4.3.2

Applications for Multi-page Read

Spreading Modulation for NAND Flash Memories
4.4.1

System Model

4.4.2

The Spreading Approach

4.4.3

Choice of Spreading Parameter

4.4.4

Obtaining Soft Input

4.4.5

Security

4.4.6

Simulation Results

Summary

SIGNAL PROCESSING FOR CROSSPOINT RESISTIVE MEMORIES

vi

Page
91

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

. . . . . . . . . . . . . . . . . . . . . . . . . . .

95

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

. . . . . . . . . . . . . . . . . . . . .

98

. . . . . . . . . . . . . . . . . . . . . . .

99

Introduction

5.2

System Model

5.3

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.1

5.2.1

Sneak Currents

5.2.2

Voltage Drop

Compensation for Sneak Currents
5.3.1

Spreading Modulation

5.3.2

Distribution Shaping

. . . . . . . . . . . . . . . . . . . . . . .

101

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

103

. . . . . . . . . . . . . . . . . . . . .

106

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

106

. . . . . . . . . . . . . . . . . . . . . . . . . .

108

. . . . . . . . . . . . . . . . . . . . . . . . . .

110

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

113

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

114

5.4

Simulations

5.5

Summary

SUMMARY AND FUTURE WORK
6.1

Caching Networks

6.2

Non-volatile Memories

A PROOF FOR LEMMA 2.4.3
VITA

REFERENCES

vii

LIST OF TABLES

Table

Page

2.1

Files stored in each server (no parity)

. . . . . . . . . . . . . . . . . . . . .

16

2.2

Files stored in each server (3 servers)

. . . . . . . . . . . . . . . . . . . . .

19

2.3

Mapping of ˝le segments to user caches (K

2.4

Segments received by each users in each transmission

2.5

Files stored in parity servers

2.6

Segments users get in each transmission

2.7

Normalized peak rate of Scheme 1

2.8

Normalized (M,R) pair of Scheme 2

3.1

Mapping of ˝le segments to user caches (K

4.1

Bitline illustration & Multi-page reads for MLC ICI equalization.

4.2

Transition probabilities and LLR values

= 6, N = 8, M = 4) . . . . . .

21

. . . . . . . . . . . .

26

. . . . . . . . . . . . . . . . . . . . . . . . . .

28

. . . . . . . . . . . . . . . . . . .

30

. . . . . . . . . . . . . . . . . . . . . . .

34

. . . . . . . . . . . . . . . . . . . . . .

34

= 4, N = 4, M = 2) . . . . . .

45

. . . . .

60

. . . . . . . . . . . . . . . . . . .

81

viii

LIST OF FIGURES

Figure

Page

4

. . . . . . . . . . . .

26

. . . . . . . . . . .

33

RAID-4)

. . . . . . . . .

35

≥ K,

RAID-6)

. . . . . . . . .

36

Comparison of Scheme 1 and Scheme 2 (N

≤ K,

RAID-4)

. . . . . . . . .

37

2.6

Comparison of Scheme 1 and Scheme 2 (N

≤ K,

RAID-6)

. . . . . . . . .

38

3.1

Caching system

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.2

Comparison of adaptive and coded schemes for varying

α . . . . . . . . . .

46

3.3

Comparison of partial caching, coded and uncoded schemes for varying

α .

48

3.4

Comparison of adaptive and coded schemes for varying

K . . . . .

49

3.5

Comparison of partial caching, coded and uncoded schemes for varying

K.

49

4.1

Floating gate transistor structure

. . . . . . . . . . . . . . . . . . . . . . .

53

4.2

Bitline-Wordline structure of NAND ˛ash memory

. . . . . . . . . . . . .

54

4.3

ABL read operation timing diagram

. . . . . . . . . . . . . . . . . . . . . .

56

4.4

ABL sense circuits for NAND ˛ash memory

. . . . . . . . . . . . . . . . .

57

4.5

Illustration of multi-page read method for MLC ICI equalization

. . . . . .

60

4.6

Bitline illustration & Multi-page reads

. . . . . . . . . . . . . . . . . . . .

60

4.7

Channel capacity for an MLC cell after multi-page reads

. . . . . . . . . .

62

4.8

The WOM code on the cube

. . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.9

Multi-page read to decode WOM code

. . . . . . . . . . . . . . . . . . . .

65

. . . . . . . . . . . . . . . . . . . . .

70

. . . . . . .

71

k . . . . . . . . . . . . . . . . .

78

. . . . . .

80

data servers and

1

2.1

Pairing for

parity server system

2.2

Comparison of systems with and without parity servers

2.3

Comparison of Scheme 1 and Scheme 2 (N

≥ K,

2.4

Comparison of Scheme 1 and Scheme 2 (N

2.5

4.10 Illustration of the spreading approach

M

and

4.11 Distribution of cell voltages for regular and spreading schemes
4.12 Quantization noise power as a function of

4.13 PAM channel equivalent to SLC ˛ash read channel in spreading

ix

Figure

Page

. . .

81

. . . . . . . . . . . . . .

84

. . . . . . . . . . . .

85

4.17 Evolution of the probability of error for an MLC memory

. . . . . . . . . .

86

4.18 Evolution of the probability of error for an TLC memory

. . . . . . . . . .

87

. . . . . . . . . . . . . . . .

88

. . . . . . . . . . . . . . . . . . . . . . .

89

. . . . . . . . . . . . . . . . . . . . . . . . . .

93

. . . . . . . . . . . . . . . . . . . . . . . .

96
97

4.14 Comparison of the channel capacity of spreading and regular schemes
4.15 Voltage distribution for a SLC cell after spreading

4.16 Comparison of di˙erent hidden information sequences

4.19 Evolution of BER as cell broken rate increases
4.20 Coe°cients modeling the damage
5.1

Illustration of sneak currents

5.2

Circuit model for sneak currents

5.3

Circuit model for voltage drop

. . . . . . . . . . . . . . . . . . . . . . . . .

5.4

Distribution of sneak currents

. . . . . . . . . . . . . . . . . . . . . . . .

100

5.5

Noise shift and variance for SLC ReRAM

. . . . . . . . . . . . . . . . . .

101

5.6

Relative error between the simulated estimates and analytic model

. . .

103

5.7

Evolution of the BER as the array size grows

. . . . . . . . . . . . . . .

104

5.8

Capacity-maximizing distribution of resistance levels per bitline

. . . . .

104

A.1

Pairing illustration

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112

x

ABSTRACT

Luo, Tianqiong Ph.D., Purdue University, May 2018. Signal Processing for Caching
Networks and Non-volatile Memories.

Major Professor: Borja Peleato.

The recent information explosion has created a pressing need for faster and more
reliable data storage and transmission schemes. This thesis focuses on two systems:
caching networks and non-volatile storage systems. It proposes network protocols to
improve the e°ciency of information delivery and signal processing schemes to reduce
errors at the physical layer as well.
This thesis ˝rst investigates caching and delivery strategies for content delivery
networks.

Caching has been investigated as a useful technique to reduce the net-

work burden by prefetching some contents during o˙-peak hours. Coded caching [1]
proposed by Maddah-Ali and Niesen is the foundation of our algorithms and it has
been shown to be a useful technique which can reduce peak tra°c rates by encoding transmissions so that di˙erent users can extract di˙erent information from the
same packet. Content delivery networks store information distributed across multiple
servers, so as to balance the load and avoid unrecoverable losses in case of node or
disk failures.

On one hand, distributed storage limits the capability of combining

content from di˙erent servers into a single message, causing performance losses in
coded caching schemes. But, on the other hand, the inherent redundancy existing in
distributed storage systems can be used to improve the performance of those schemes
through parallelism. This thesis proposes a scheme combining distributed storage of
the content in multiple servers and an e°cient coded caching algorithm for delivery
to the users. This scheme is shown to reduce the peak transmission rate below that
of state-of-the-art algorithms.

xi

Then we study the trade-o˙ between the network tra°c load and disk I/O for
caching networks. Coded caching can reduce tra°c load by broadcasting coded messages that can bene˝t multiple users but, in the case with redundant requests, it
requires reading some data segments multiple times to compose di˙erent coded messages.

Hence, coded caching requires more disk I/Os than uncoded transmission.

This thesis proposes caching and delivery algorithms which combine coded and uncoded transmission to strike a trade-o˙ between tra°c load and disk I/Os.

Our

algorithms can improve both the average and worst case performance in terms of the
user requests.
Finally, we broaden our perspective to look at the storage hardware. Two methods
are proposed which are suitable for NAND ˛ash technology: multi-page read and
spreading modulation.

The ˝rst one reads multiple wordlines simultaneously and

returns a combination of their stored information. This multi-page read method is
shown to be useful for equalizing the inter-cell interference, reduce the damage caused
by erase operations, and speed up the decoding of some codes, such as WOM codes [2].
Then a new data representation scheme is proposed which increases endurance and
signi˝cantly reduces the probability of error caused by inter-cell-interference.

This

data representation scheme is based on using an orthogonal code to spread each bit
across multiple cells, resulting in lower variance for the voltages being programmed.
We also study an up-and-coming memory technology, ReRAM, with a di˙erent set
of challenges.

Speci˝cally, we build a simple analytic model for the voltage drop

and sneak currents in MLC-ReRAM arrays as a form of inter-cell-interference and
proposes two techniques to minimize the resulting BER: distribution shaping and
spreading modulation, which is extended from that of NAND ˛ash.

1

1. INTRODUCTION
For several decades, CPUs have doubled their speed every two years in what is commonly known as Moore's law, but the storage technology has not been able to keep
up with this trend: magnetic hard drives have steadily increased their capacity, but
not their speed. Current computers and communication networks are not limited by
the speed at which information can be processed, but rather by the speed at which
it can be read, moved, and written. Furthermore, the recent information explosion is
driving an exponential increase in the demand for data, which is not expected to slow
down any time soon. Users and applications require storing large amounts of data
and transmitting data at higher speeds, straining the devices and networks to their
maximum capabilities. This thesis focuses on two modern storage systems: caching
networks and non-volatile memories, as explained below.
The ˝rst part of the thesis focuses on caching networks. In the context of networking, popular content can be pre-cached in multiple nodes to balance the load
and alleviate the stress of the network during peak times.

So the cache problem

focuses on what data to store and how to deliver it so that the system e°ciency
is improved. Besides the heavy network transfer load, large number of disk I/Os is
another performance bottleneck for storage systems, putting a burden on the system
resources.
The second part of the thesis focuses on storage hardware, speci˝cally non-volatile
memories.

Non-volatile memories store data using persistent physical properties,

which do not change even if the power is turned o˙. Speci˝cally, Flash memories use
the voltage threshold of ˛oating gate transistors and ReRAM memories use memristors (a contraction of "memory resistor").

Unfortunately, these parameters are

subject to noise during reads and writes, making it a signal processing challenge to
store data reliably.

2

Section 1.1 and Section 1.2 will introduce the challenges that caching networks
and non-volatile memory technologies are facing and describe the contributions of
this thesis.

1.1 Challenges of Caching Networks
Caching has been investigated as a useful technique to reduce the network burden
by prefetching some contents during o˙-peak hours. A caching scheme has two phases:
placement and delivery. In the placement phase, the users have access to all ˝les to
˝ll their caches. In the delivery phase, every user requests one ˝le and only the server
has database access. The server delivers messages to the users to ful˝ll their requests.
Coded caching has recently become quite popular among the coding community,
starting with the work by Maddah-Ali and Niesen in [1].

It has been shown that

coded caching can reduce peak tra°c rates by encoding transmissions so that di˙erent
users can extract di˙erent information from the same packet. Our study is based on
Maddah-Ali and Niesen's work in [1] and we extend their work to solve two interesting
problems: extending coded caching to distributed storage system (as explained in
Subsection 1.1.1) AND achieving the tra°c load-I/O trade-o˙ for coded caching (as
explained in Subsection 1.1.2).

1.1.1 Coded Caching with Distributed Storage
Maddah-Ali and Niesen's work [1] focuses on a system with multiple users connecting to a single server through a shared broadcast link. However, with the higher
demand of data, networks usually distribute popular ˝les across multiple independent servers.

This thesis proposes and analyzes multiple caching mechanisms for

multi-server systems with di˙erent system parameters.
Distributed storage deals with how the information is stored at the servers. Disk
failures are very common in large storage systems, so they need to have some amount
of redundancy.

Erasure codes have recently sparked a renewed interest from the

3

research community for this task. Files are encoded and distributed among a set of
nodes (disks, servers, etc.) in such a way that the system can recover from the failure
of a certain number of nodes [3]. Most large scale systems use some form of erasure
codes (such as RAID [4]) with striping across multiple storage drives, but some others

e.g.

store or replicate whole ˝les as a single unit in the network nodes (

data centers).

This increases the peak rate, but it also simpli˝es book-keeping and deduplication,
improves security, and makes the network more ˛exible.
In this thesis, we aim to design a joint storage and transmission protocol for the
multi-server multi-user system. We combine distributed storage with coded caching
utilizing parallelism and redundancy to reduce the peak tra°c rate. The main contributions are: (1) a ˛exible model for multi-server systems where each ˝le can be
divided among multiple servers or kept as a single block in one server; (2) an extension of the coded caching algorithms in [1] and [5] to striping multi-server systems;
(3) new caching and delivery schemes with signi˝cantly lower peak rates for the case
when ˝les are stored as a single unit in a data server. The detailed algorithms are
elaborated in Chapter 2 and the related publication is [6].

1.1.2 Tra°c Load-I/O Trade-o˙ for Coded Caching
Although the tra°c load is the dominant factor in slow or congested networks,
disk I/Os are also valuable and can become the bottleneck in some systems, such as
Haystack [7] and Colossus [8]. A signi˝cant amount of research has gone into coding
techniques to minimize disk I/Os in storage systems [9, 10].
Maddah-Ali and Nisen's coded caching scheme in [1] has the lowest peak tra°c
load in the literature and its extension by Yu et.

al. [11] is proved to achieve the

best average tra°c load with uncoded prefetching. However, their I/O performance
is suboptimal when there are redundant user demands. The same segment could be
read multiple times if it is used to construct di˙erent messages, which dramatically
increases I/O reads. On the contrast, if all messages are transmitted uncoded, each

4

data segment requested is read once and broadcast to all users. Inspired by this fact,
we study the trade-o˙ between tra°c load and I/O by designing algorithms which
combine coded and uncoded transmission. The algorithms are shown to improve both
the average and worst case performance in terms of the user requests as elaborated
in Chapter 3 and the related publication is [12].

1.2 Challenges of Modern Non-volatile Memory Technologies
Recently, some new non-volatile memory technologies have emerged as a faster
and more e°cient alternative to hard drives.

We will investigate two promising

non-volatile memory technologies in this thesis: NAND ˛ash memories and Resistive RAM. These two new non-volatile memory technologies o˙er signi˝cantly higher
speeds and power e°ciency than hard drives, but their higher cost is still an obstacle for its widespread use.

The cost is dominated by the area of silicon that they

require per Gigabyte of stored information. Manufacturers have tried to increase the
capacity of the memories by shrinking the cells and storing more bits in each of them
but this has introduced some problems, mainly related to reliability and endurance.
The challenges for NAND ˛ash and ReRAM are explained in Subsection 1.2.1 and
Subsection 1.2.2 respectively.

1.2.1 Challenges of NAND Flash
A NAND ˛ash memory is fundamentally an array of ˛oating gate transistors,
known as ˛ash cells, whose threshold voltages can be programmed to represent di˙erent information symbols. Single-Level Cell (SLC) memories can only store one bit in
each cell, but most commercial products now use Multi-Level Cell (MLC) memories
that can store two bits in each cell by taking 4 possible voltages. In order to reduce
the cost, manufacturers have aggressively scaled the technology to pack more cells in
the same silicon real estate, while they also increased the number of bits stored in
each cell.

5

As the cells shrink, the noise observed in the programmed voltages increases,
specially the inter-cell interference (ICI). ICI is a phenomenon by which programming
a cell increases the voltage of its neighbors.

It has been shown that the ICI noise

created by a cell is proportional to the voltage to which it is programmed [13]. Another
challenge that the ˛ash memory industry is facing is the limited lifetime of ˛ash cells.
Basic operations like programming and erasing

1

the cells require tunneling charges

through a dielectric barrier. This results in stresses that degrade the properties of the
barrier making the cells less e°cient in the retention of data, more vulnerable to noise,
and consequently more prone to errors. Once again, the degradation is proportional
to the voltage variations [14].
In this thesis, two signal processing approaches are proposed for NAND ˛ash to
improve the reliability and endurance: (1) We propose a new method which reads
multiple wordlines simultaneously and returns a combination of their stored information. This multi-page read method is shown to be useful for equalizing the ICI, reduce
the damage caused by erase operations, and speed up the decoding of certain WOM
codes [2]. (2) A spreading modulation is proposed. It is a new data representation
scheme which reduces ICI and extends the lifetime of the memory by reducing the
frequency with which the largest voltage levels are programmed. The proposed modulation is based on using an orthogonal code to spread each information symbol across
multiple cells, similar to how DS-CDMA is used in wireless communications [15, 16].
The detailed algorithms are elaborated in Chapter 4 and the related publications
are [1719].

1.2.2 Challenges of Resistive RAM
Besides NAND ˛ash memory technology which is already in commercial use, some
other promising non-volatile memory technologies are also under research. Resistive
RAM (ReRAM) is rising as a promising non-volatile alternative because of its high

1 Flash cells need to be erased before they can be overwritten.

6

density, fast access time and low power consumption.

ReRAM uses memristors (a

contraction of "memory resistor") to store information. A memristor is a nonlinear
resistor whose value can be adjusted by pushing current across its terminals.
To increase the density of the memristor array, crosspoint architectures are used,
which shows excellent scalability but su˙ers from other problems, mainly related to
sneak currents ˛owing through supposedly deactivated cells.

Typically, writes and

reads are done by fully biasing a selected wordline and bitline (row and column,
respectively) while others are only partially biased or not at all. This creates a large
voltage drop across the cell at the intersection, a smaller one across other cells in the
same wordline or bitline (half-selected cells), and a negligible one across the rest of
the cells. Ideally, all the current would be ˛owing through the selected cell, but in
practice there are some additional currents ˛owing through the other half-selected
cells.

These are called sneak currents [20].

The magnitude of the sneak currents

increases dramatically with the size of the memristor array, to the point that they
have become the limiting factor in the scalability of ReRAM memories. This e˙ect is
even more severe in MLC memories.
In this thesis, we propose signal processing approaches to compensate for sneak
currents in MLC ReRAM. Our contributions include: (1) we build a simple analytic
model for the voltage drop and sneak currents in MLC-ReRAM arrays as a form of ICI.
(2) we propose two techniques to minimize the resulting BER: spreading modulation
and distribution shaping. The detailed algorithms are shown in Chapter 5 and the
related publication is [21].

7

2. CODED CACHING AND DISTRIBUTED STORAGE

2.1 Introduction
The recent information explosion has created a pressing need for faster and more
reliable data transmission and recovery schemes. The IT industry has addressed this
challenge through parallelism and caching: instead of using a single high capacity
storage drive to serve all the requests, networks usually distribute popular ˝les across
multiple independent servers that can operate in parallel and cache part of the information at intermediate or ˝nal nodes. This chapter proposes multiple caching mechanisms for multi-server systems with di˙erent system parameters. Previous literature
has addressed coded caching for single server systems and distributed storage without caching but, to the extent of our knowledge, this is the ˝rst work that considers
both coded caching at the users and distributed storage at the servers. Furthermore,
it provides solutions for systems with and without ˝le striping (

i.e.

with ˝les split

among multiple servers and with whole ˝les stored in each server).
Erasure codes are adopted to solve the disk failures in distributed storage system.
Files are encoded and distributed among servers in such a way that the system can
recover from the failure of a certain number of servers [3]. One widely used distributed
storage technique based on erasure codes is RAID (redundant array of independent
disks).

It combines multiple storage nodes (disks, servers, etc.)

into a single logi-

cal unit with data redundancy. Two of the most common are RAID-4 and RAID-6,
consisting of block-level striping with one and two dedicated parity nodes, respectively [4, 22]. Most large scale systems use some form of RAID with striping across
multiple storage drives, but some others store or replicate whole ˝les as a single unit

e.g.

in the network nodes (

data centers). This increases the peak rate, but it also

8

simpli˝es book-keeping and deduplication, improves security, and makes the network
more ˛exible.
Coded caching deals with the high temporal variability of network tra°c: the peak
tra°c in the network is reduced by pre-fetching popular content in each receiver's
local cache memory during o˙-peak hours, when resources are abundant.

Coded

caching has also recently become a research hit, starting with the work by MaddahAli and Niesen in [1], which focused on how a set of users with local memories can
e°ciently receive data from a single server through a common link. Their seminal
paper proposed a caching and delivery scheme o˙ering a worst case performance
within a constant factor of the information-theoretic optimum, as well as upper and
lower bounds on that optimum.

The lower bounds were later re˝ned in [23] and

new schemes were designed to consider non-uniform ˝le sizes and popularity [24, 25];
multiple requests per user [26, 27]; variable number of users [28]; and multiple servers
with access to the whole library of ˝les [29].
Maddah-Ali and Niesen's work in [1] caches the information uncoded and encodes
the transmitted packets. This scheme performs well when the cache size is relatively
large, but a close inspection shows that there are other cases in which its performance
is far from optimal. Tian and Chen's recent work in [5] designs a new algorithm which
encodes both the cached and transmitted segments to achieve a better performance
than [1] when the cache size is small or the number of users is greater than the number
of ˝les. However, this scheme also focuses on a single server system. In this chapter,
we aim to design a joint storage and transmission protocol for the multi-server multiuser system.
Summarizing, prior work on distributed storage has studied how a single user can
e°ciently recover data distributed across a set of nodes and prior work on coded
caching has studied how a set of users with local memories can e°ciently receive data
from a single node. However, to the extent of our knowledge, it has not been studied
how the cache placement and content delivery should be performed when multiple
nodes send data to multiple users through independent channels. We combine dis-

9

tributed storage with coded caching utilizing parallelism and redundancy to reduce
the peak tra°c rate in this thesis.
The rest of the chapter is structured as follows: Section 2.2 introduces the system
model and two existing coded caching algorithms for single server systems, namely
the one proposed by Maddah-Ali and Niesen in [1] and the interference elimination
scheme in [5].

Section 2.3 extends both algorithms to a multi-server system with

˝le striping, while Sections 2.4 and 2.5 consider the case where servers store whole
˝les. Speci˝cally, Section 2.4 extends Maddah-Ali and Niesen's scheme, suitable for
systems with large cache capacity, and Section 2.5 extends the interference elimination
scheme, which provides better performance when the cache size is small.

Finally,

Section 2.6 provides simulations to support and illustrate our algorithms and Section
2.7 concludes the chapter.

2.2 Background
This section describes the multi-node multi-server model in 2.2.1 and then reviews
two existing coded caching schemes that constitute the basis for our algorithms. Subsection 2.2.2 summarizes Maddah-Ali and Niesen's coded caching scheme from [1]
and subsection 2.2.3 summarizes Tian and Chen's interference elimination scheme
from [5].

2.2.1 System Model
We consider a network with

K

1

users

and

N

˝les stored in

L

data servers. Some

parts of the chapter will also include additional parity servers, denoted parity server

P

when storing the bitwise XOR of the information in the data servers (RAID-4) and

parity server

Q

when storing a di˙erent linear combination of the data (RAID-6).

The network is assumed to be ˛exible, in the sense that there is a path from every

1 Servers and users can be anything from a single disk to a computer cluster, depending on the
application.

10

server to every user [29]. Each server stores the same number of ˝les with the same
size and each user has a cache with capacity for

M

˝les. For the sake of simplicity,

this chapter assumes that all ˝les have identical length and popularity.
The servers are assumed to operate on independent error-free channels, so that
two or more servers can transmit messages simultaneously and without interference to
the same or di˙erent users. A server can broadcast the same message to multiple users
without additional cost in terms of bandwidth, but users cannot share the content
of their caches with each other. This assumption makes sense in a practical setting
since peer-to-peer content sharing is generally illegal. Also, users typically have an
asymmetric channel, with large download capacity but limited upload speed.
Similarly, each server can only access the ˝les that it is storing, not those stored
on other servers. A server can read multiple segments from its own ˝les and combine them into a single message, but two ˝les stored on di˙erent servers cannot be
combined into a single message. However, it will be assumed that servers are aware
of the content cached by each user and of the content stored in other servers, so that
they can coordinate their messages. This can be achieved by exchanging segment IDs
through a separate low-capacity control channel or by maintaining a centralized log.
The problem consists of two phases: placement and delivery. During the placement
phase, the content is stored in the user's caches. The decisions on where to locate
each ˝le, how to compute the parity, and what data to store in each cache are made
based on the statistics for each ˝le's popularity, without knowledge of the actual
user requests. In our chapter, we assume all the ˝les have the same popularity. The
delivery phase starts with each user requesting one of the ˝les. All servers are made
aware of these requests and proceed to send the necessary messages.
Throughout the chapter, we use subindices to represent ˝le indices and superindices to represent segment indices, so
˝le

Fi .

Fij

will represent the j-th segment from

Some parts of the chapter will also use di˙erent letters to represent ˝les from

di˙erent servers.

For example,

Ai

to represent the i-th ˝le from server A and

to represent the j-th segment from ˝le

Ai .

Aji

The chapter focuses on minimizing the

11

peak rate (or delay), implicitly assuming that di˙erent users request di˙erent ˝les.
Therefore, we will indistinctly refer to users or their requests.

2.2.2 Maddah-Ali and Niesen's scheme
The coded caching scheme proposed by Maddah-Ali and Niesen in [1] has a single
server storing all the ˝les

{F1 , F2 . . . , FN },

and users are connected to this server

through a shared broadcast link. Their goal is to design caching and delivery schemes
so as to minimize the peak load on the link,
transferred from the server to the users.

j
nonoverlapping segments Fi of equal size,
each segment in a distinct group of

t

i.e.

the total amount of information

This scheme splits each ˝le

j = 1, . . .

�K 
t

, with

t=

Fi

into

�K 
t

KM
, and caches
N

users. In other words, each subset of

t

users is

2

assigned one segment from each ˝le for all the users to cache . In the delivery phase
the server sends one message to each subset of

t+1 users, for a total of

�K
t+1

messages.

This caching scheme ensures that, regardless of which ˝les have been requested, each
user in a given subset of

t+1

nodes is missing a segment that all the others have in

their cache. The message sent to that subset of nodes consists of the bitwise XOR of
all

t+1

missing segments: a set of users

S

requesting ˝les

Fi1 , Fi2 , . . . , Fit+1

would

receive the message

j

t+1
mS = Fij11 ⊕ Fij22 ⊕ · · · ⊕ Fit+1
,

where

jk

(2.1)

is the index for the segment cached by all the users in the set except the one

requesting

F ik .

Each user can then cancel out the segments that it already has in its

cache to recover the desired segment. In the worst case,

i.e.

when all users request

di˙erent ˝les, this scheme yields a (normalized by ˝le size) peak rate of


RC (K, t) =

  
K
K
/
t+1
t

= K(1 − M/N )
2 Parameter

t

1
.
1 + KM/N

is assumed to be an integer for the sake of symmetry.

(2.2)

Otherwise some segments

would be cached more often than others, requiring special treatment during the delivery phase and
complicating the analysis unnecessarily.

12

Under some parameter combinations, broadcasting all the missing segments uncoded

RC (K, t),

could require lower rate than

so the generalized peak rate is

min {RC (K, t), N − M }
but this chapter will ignore those pathological cases, assuming that
such that

RC (K, t) ≤ N − M .

N , M , and K

are

It has been shown that this peak rate is the minimum

achievable for some parameter combinations and falls within a constant factor of the
information-theoretic optimum for all others [1, 23].
This scheme, henceforth refered to as Maddah's scheme" will be the basis for
multiple others throughout the chapter. It is therefore recommended that the reader
has a clear understanding of Maddah's scheme before proceeding.

2.2.3 Interference Elimination
A close examination of Maddah's algorithm reveals that it has poor performance
when the cache is small and

N ≤ K.

Thus, a new coded caching scheme based on

interference elimination was proposed by Tian and Chen in [5] for the case where the
number of users is greater than the number of ˝les. Instead of caching ˝le segments
in plain form, they propose that the users cache linear combinations of multiple segments. After formulating the requests, undesired terms are treated as interference that
needs to be eliminated to recover the requested segment. The transmitted messages
are designed to achieve this using maximum distance separable (MDS) codes [30, 31].
In the placement phase, this scheme also splits each ˝le into
segments of equal size and each segment is cached by
other segments. Let
from ˝le

Fi

t

�K 
t

non-overlapping

users, albeit combined with

FiS , where S ⊆ {1, 2, . . . , K} and |S| = t, denote the ˝le segment

chosen to be cached by the users in

S.

In the placement phase user

k

collects the ˝le segments

{FiS |i ∈ {1, 2, . . . , N }, k ∈ S},

(2.3)

13

�K−1

N in total), encodes them with a MDS code C(P0 , P ) of length P0 =
t−1
�K−1
�K−2
2 t−1 N − t−1 (N − 1), and stores the P0 − P parity symbols in its cache.

(P

=

The delivery phase proceeds as if all the ˝les are requested.

When only some

˝les are requested, the scheme replaces some users' requests to the unrequested
˝les" and proceeds as if all ˝les were requested.

A total of

transmitted (either uncoded or coded) for each ˝le

Fi ,

�K−1
t

messages are

regardless of the requests.

Uncoded messages provide the segments that were not cached by the users requesting

Fi ,

while coded messages combining multiple segments from

the interference in their cached segments.
messages which, together with the
to recover all

P

P − P0

components in the

Fi

Each user gathers

are used to eliminate

�K −2
t−1

(N − 1)

useful

components stored in its cache, are enough

C(P0 , P )

MDS code. A more detailed description

of the messages can be found in [5].

N

Therefore, the total number of messages transmitted from the server is
In this interference elimination scheme, the following normalized

(M, R)

�K−1
t

.

pairs are

achievable:



t [(N − 1)t + K − N ] N (K − t)
,
K(K − 1)
K


, t = 0, 1, . . . , K.

This scheme is shown to improve the inner bound given in [1] for the case

(2.4)

N ≤K

and has a better performance than the algorithm in subsection 2.2.2 when the cache
capacity is small.

2.2.4 Extension to multiple servers
Both of the previous schemes assume that a single server stores all the ˝les and
can combine any two segments into a message. Then, they design a list of messages
to be broadcast by the server, based on the users' requests. In practice, however, it
is often the case that content delivery networks have multiple servers and throughput
is limited by the highest load on any one server rather than by the total tra°c in the
link between servers and users. Shariatpanahi et al. addressed this case in [29], but
still assumed that all servers had access to all the ˝les and could therefore compose

14

any message. They proposed a load balancing scheme distributing the same list of
messages among all the servers, scaling the peak rate by the number of servers.
If each server only has access to some of the ˝les, the problem is signi˝cantly more
complicated. The general case, where each segment can be stored by multiple servers
and users, is known as the index coding problem. This is one of the core problems
of network information theory but it remains open despite signi˝cant e˙orts from
the research community [3234]. Instead of addressing the index coding problem in
its general form, we focus on the case where each data segment is stored in a single
server, all caches have the same capacity, and users request a single ˝le.
A simple way to generalize the previous schemes to our scenario is to follow the
same list of messages, combining transmissions from multiple servers to compose
each of them. Instead of receiving a single message with all the segments as shown
in Eq. (2.1), each node would receive multiple messages from di˙erent servers. The
peak rate for any one server would then be the same as in a single server system.
With parity servers storing linear combinations of the data, the peak rate can be

3

reduced. In general, distributed storage systems use MDS codes for the parity , so
any subset of

L

servers can be used to generate any message.

balancing of the load by rotating among all subsets of
rate by

L
, where
L+L0

L0

Therefore, a simple

L servers would scale the peak

is the number of parity servers. However, we intend to design

caching and delivery algorithms capable of further reducing the peak rate of any one
server.

2.3 File striping
The simplest way to extend single-server coded caching algorithms to a multiserver system is to spread each ˝le across all servers. That way, each user will request
an equal amount of information from each server, balancing the load. This is called
data striping [38] and it is common practice in data centers and solid state drives

3 Some systems use repetition or pyramid codes [3537] to reduce the recovery bandwidth, but this
chapter will focus on MDS codes.

15

(SSD), where multiple drives or memory blocks can be written or read in parallel.
The users then allocate an equal portion of their cache to each server and the delivery
is structured as

L

independent single-server demands.

We now proceed to give a

detailed description of how striping can reduce the peak rate of Maddah's scheme,
but the same idea can be applied to any other scheme.
Each of the

N

˝les

{F1 , F2 . . . , FN } is split
�K 

servers and each block is divided into

(j,m)

Fi

, where

number; and

i = 1, 2, . . . , N
m = 1, 2, . . . , L

into

L

blocks to be stored in di˙erent

segments. These segments are denoted by

t

j = 1, 2, . . . ,

represents the ˝le number;
the block number. The

m-th

(j,m)
the m-th segment of each ˝le, that is Fi
for every

i

�K 
t

the segment

server is designed to store

and

j.

The placement is the same as in Maddah's scheme. Each segment is cached by

t

users, with

(j,1)

{Fi

(j,2)

, Fi

(j,L)

, . . . , Fi

}

being cached by the same user.

We notice

that each message transmitted by Maddah's scheme in Eq. (2.1) can be split into

L

components

(j ,m)

F i1 1

(j ,m)

⊕ Fi2 2

m = 1, 2, . . . , L to be sent by di˙erent servers.
into

L

(j

t+1
⊕ · · · ⊕ Fit+1

,m)

,

(2.5)

Then the problem can be decomposed

independent single-server subproblems with reduced ˝le sizes of

F
bits. The
L

subproblems have the same number of users, ˝les, and cache capacity (relative to the
˝le size) as the global problem.
peak load is reduced to

Since all servers can transmit simultaneously, the

1
of that in Eq. (2.2) (Maddah's single server scheme).
L

If one additional parity server

P

is available (RAID-4), it will store the bitwise

i.e. Fi(j,1) ⊕ Fi(j,2) ⊕ · · · ⊕ Fi(j,L) for all i and j .

Then,

can take over some of the transmissions, reducing the peak load to

1
of
L+1

XOR of the blocks for each ˝le,
server

P

4

that with Maddah's scheme . Speci˝cally, instead of having all data servers transmit
their corresponding component in Eq. (2.5), server

P

can transmit the XOR of all

the components, relieving one data server from transmitting. The users can combine
the rest of the components with this XOR to obtain the missing one. Similarly, if two

4 The number of segments must be a multiple of

L to achieve this reduction,

to divide each segment into multiple chunks to ful˝l this condition.

but it is always possible

16

additional parity servers

L

out of the

L+2

P

and

Q are available (RAID-6), it is possible to choose any

servers to take care of each set of messages in Eq. (2.5), thereby

reducing the peak rate to

1
of that with Maddah's scheme.
L+2

A similar process with identical ˝le splitting can be followed for the interference
cancelling scheme, achieving the same scaling of the peak rate:
parity,

1
when there is no
L

1
1
with a single parity server, and
with two parity servers.
L+1
L+2

In practice, however, it is often preferred to avoid striping and store whole ˝les as
a single unit in each server to simplify the book-keeping, ensure security, and make
the network more ˛exible. The rest of the chapter will focus on the case where nodes
store entire ˝les, and each user requests a ˝le stored in a speci˝c node.

2.4 Scheme 1: Large cache
In this section, we extend Maddah-Ali and Niesen's scheme to the multiple server
system. Instead of spreading each ˝le across multiple servers as in Section 2.3, each
˝le is stored as a single unit in a data server, as shown in Table 2.1.

Table 2.1.:

Files stored in each server in distributed storage system.

Server A

Server B

···

Server L

A1

B1

···

L1

A2

B2

···

L2

.
.
.

.
.
.

Ar

Br

.
.
.

···

Lr

The performance of Maddah's scheme in Eq. (2.2) is highly dependent on the
cache capacity

M.

Compared with the interference elimination in section 2.2.3, the

advantage of Maddah's scheme lies in that ˝le segments are stored in plain form
instead of encoded as linear combinations.

This saves some segments from being

transmitted in the delivery phase, but it requires larger cache capacities to obtain

17

coded caching gains. Hence, Maddah's scheme is appropriate when the cache capacity
is large.
The placement phase of our algorithm is identical to that in the traditional scheme.
For example, in a system with
˝les, each ˝le is divided into

20

K =6

users with cache capacity

M =4

segments and each segment is stored by

Table 2.3 indicates the indices of the

10

and

N =8

t=3

users.

segments that each user stores, assumed to

be the same for all ˝les without loss of generality.
In order to simplify later derivations, the notation is clari˝ed here. Since the peak
rate for the storage system is considered, we assume that all users request di˙erent
˝les, hence each user can be represented by the ˝le that it has requested.

S

to be the user set and

users in

S.

mSA

Furthermore, if

to represent the message sent from server

A

Denote

to all the

α = {α1 , α2 , . . . , αi } represents a vector of ˝le indices and

γ = {γ1 , γ2 , . . . , γi } represents a vector of segment indices, then Aα

represents the set

of requests (or users)

Aα = {Aα1 , Aα2 , . . . , Aαi }
and

Aγα

represents the message

Aγα = Aγα11 ⊕ Aγα22 ⊕ . . . ⊕ Aγαii ,
where

Aji

represents the

j -th segment from the i-th ˝le in server A.

Similarly,

Aγα ⊕Bγα

represents the the message:

Aαγ ⊕ Bαγ = (Aγα11 ⊕ Bαγ11 ) ⊕ . . . ⊕ (Aγαii ⊕ Bαγii ).
We ˝rst explore the multi-server system without parity servers in subsection 2.4.1.
Then we study a simple system with two data and one parity server in subsection 2.4.2.
Finally, we study the cases with one and two parity servers in subsections 2.4.3 and
2.4.4, respectively.

18

2.4.1 No parity servers
In a system without redundancy, such as the one shown in Table 2.1, the servers
cannot collaborate with each other. During the delivery phase, each user is assigned
to the server storing the ˝le that it requested, and then each data server transmits
enough messages to ful˝l its requests.
server receiving

m

for each group of

Speci˝cally, following Maddah's scheme, a

requests would need to transmit

t

�K
t+1

−

�K−m
t+1

messages,

users containing at least one of its requesters.

i.e.

one

The normalized

peak rate for that server would therefore be



 
  
K
K −m
K
−
t+1
t+1
t

The worst case occurs when all users request ˝les from the same server,

i.e. m = K .

Then the peak transmission rate is the same as in the single server system.

2.4.2 One parity and two data servers
This section focuses on a very simple storage system with two data servers and
a third server storing their bitwise XOR, as shown in Table 2.2. Despite each server
can only access its own ˝les, the con˝guration in Table 2.2 allows composing any
message by combining messages from any two servers. Intuitively, if server

A

(or

B)

˝nish its transmission task before the other one, it can work with the parity server to
help server

B

(or

A).

This collaborative scheme allows serving two requests for ˝les

stored in the same server in parallel, balancing the load and reducing the worst case
peak rate below that achieved without the parity server (see Section 2.4.1).
However, there is a better transmission scheme where messages from all three
servers are combined to get more information across to the users. The basic idea is
to include some unrequested segments, as well as the requested ones, in each message
from a data server. If the additional segments are well chosen, they can be combined
with messages from the parity server to obtain desired ˝le segments. The algorithm
developed in this section is based on this idea.

19

Table 2.2.:

Files stored in each server in a system with two data and one parity

server.

Server

A

Server

B

P

Server

A1

B1

A1 ⊕ B1

A2

B2

A2 ⊕ B2

.
.
.

.
.
.

.
.
.

Ar

Br

Ar ⊕ Br

Just like in Maddah's scheme, data servers will send each message to a set of

t+1

users and the message will contain the XOR of

t+1

segments (one for each

user). These segments are chosen so that all users except the intended receiver can
cancel them out. If the user had requested a ˝le stored by the sender, the message will
contain the corresponding segment; otherwise the message will include its complement
in terms of the parity in server

P , i.e. Aji

Table 2.3, the message from server
1 through 4, will be

Bij

and vice versa. Therefore, the

A

or

B

are uniquely determined by the sender

S1

or

S2

respectively. In the example shown in

contents of each message from server
and the set of receivers, denoted by

instead of

A to S1 = {A1 , A2 , A3 , B4 }, corresponding to users

mSA1 = A111 ⊕ A25 ⊕ A32 ⊕ A14 .

Lemma 2.4.1 Let the receivers for servers A and B be
S1 = {Aα , Bβ , A∗ }

S2 = {Aα , Bβ , B∗ },

respectively, where α and β denote (possibly empty) sets of indices, the ∗ denote
arbitrary sets, and S1 6= S2 . The corresponding messages are
mSA1 = Aα∗ ⊕ Aγβ ⊕ A∗∗

mSB2 = Bηα ⊕ Bβ∗ ⊕ B∗∗ ,

with segment indices chosen so that each user can cancel all but one of the components. This provides users Bβ and Aα with some unrequested segments Aγβ and Bηα ,
respectively. Then server P can send the message
mSP1 ∩S2 = (Aαη ⊕ Bαη ) ⊕ (Aγβ ⊕ Bγβ ),

20

to S1 ∩ S2 , so that each user in S1 and S2 obtains a missing segment and those in
the intersection obtain two. These three transmissions are equivalent to messages mS1
and mS2 as de˝ned in Eq. (2.1) for Maddah's single server scheme. They both provide
the same requested segments to their destinations.

Proof

All the users in

S1

and

S2

get at least one desired segment, from the server

storing their requested ˝le. Those in
server

A or B .

S1 ∩S2 also receive an unrequested segment from

It only remains to prove that users in

segment to obtain its complement from

in

mSA1

were chosen so that user

Similarly, the set of indices
(for all ˝les). Therefore,
all terms from
segment

Bβγii .

mSP1 ∩S2

η

Bβi

mSB2

B βi ∈ S 1 ∩ S 2 .

The set of segment indices

is caching all the segments except the

was chosen so that

can obtain

except

As long as

in

Bβi

can use this unrequested

mSP1 ∩S2 .

Without loss of generality, consider user

γ

S1 ∩ S2

Aβγii ⊕ Bβγii .

Aγβii

from

mSA1

Bβi

γi -th.

is caching all of them

and should be able to cancel

Combining both of these yields the desired

S1 =
6 S2 , this segment will be di˙erent from the one that Bβi

mSB2

because there is a one-to-one relationship between segment indices

and user subsets.

■

obtains from

Take the case in Table 2.3 as an example.

{A1 , A2 , A3 , B4 }

and

S2 = {A1 , A2 , B1 , B4 },

Lemma. 2.4.1 states that if

we construct

mSA1 , mSB2 , mSP1 ∩S2

S1 =

as:

5
2
1
mSA1 = A11
1 ⊕ A2 ⊕ A3 ⊕ A4 ,

mSB2 = B114 ⊕ B28 ⊕ B12 ⊕ B43 ,
mSP1 ∩S2 = (A114 ⊕ B114 ) ⊕ (A82 ⊕ B28 ) ⊕ (A14 ⊕ B41 ).
It is easy to verify that these messages are equivalent to two transmissions in Maddah's
scheme, speci˝cally those intended for users

Corollary 2.4.1.1 Assume

{A1 , A2 , A3 , B4 }

and

{A1 , A2 , B1 , B4 }.

S1 = {A∗ , Bβ } and S2 = {B∗ }, i.e. it only contains

requests for server B . Then server P sends mBP β = Aβγ ⊕ Bβγ to all the users in Bβ in
Lemma 2.4.1, so that all the users in S1 and S2 get the same segments as in Maddah's
scheme. The same holds switching the roles of A and B .

21

Table 2.3.:

Mapping of ˝le segments to user caches. Each cache stores the same 10

segments for every ˝le, marked with X (K

= 6, M = 4, N = 8).

Segment\ User

1

2

3

1

X

X

X

2

X

X

3

X

X

4

X

X

5

X

X

6

X

X

7

X

X

8

X

X

9

X

X

10

X

4

X
X
X
X
X
X
X
X

X

X

12

X

X

13

X

X

14

X

X

15

X

X

16

X

X
X

X

X

18

X

X

19

X

A2

X
X
X

20

A3

X

X

17

A1

6

X

11

Request

5

X

X
X
X

X

X

X

X

B4

B1

B2

22

Proof

This is a particular case of Lemma 2.4.1 when

α

is empty (β can be empty

or non-empty).

■

De˝nition 2.4.1 If user subsets S1 and S2 ful˝ll the conditions in Lemma 2.4.1, we

call (S1 , S2 ) an e˙ective

pair.

Our goal is to design a scheme equivalent to Maddah's scheme while minimizing
the maximum number of messages sent by any server. If two user subsets form an
e˙ective pair, the corresponding messages in Maddah's scheme (see Eq. (2.1)) can be
replaced by a single transmission from each server. Hence, we wish to make as many
e˙ective pairs as possible.

Lemma 2.4.2 The peak rate is

�1
2


+ 16 Δ RC (K, t) for the server system in Table 2.2,

where Δ represents the ratio of unpaired messages and t =

Proof

KM
N

.

For each e˙ective pair, we can use a single transmission from each server to

deliver the same information as two transmissions in Maddah's single server scheme.
This contributes

1
(1
2

− Δ)RC (K, t)

to the total rate. Unpaired messages are trans-

mitted as described in section 2.2.4, that is combining messages from any two out of
the three servers. Assuming that this load is balanced among all three servers, the
contribution to the total rate is

2
ΔRC (K, t). Adding both contributions yields the
3

rate above.

■

The following lemma characterizes the ratio of unpaired user subsets

Δ in the case

with symmetric requests (both servers receive the same number of requests).

Lemma 2.4.3 If the requests are symmetric, then Δ = 0 when t is even and Δ ≤ 13
when t is odd. That is, the following peak rate is achievable in the case with symmetric
requests:
RT (K, t) =

⎧
1
⎪
⎪
⎪ 2 RC (K, t)
⎨

if t is even

⎪
⎪
⎪
⎩ � 1 + 1 Δ R (K, t) if t is odd,
C
6
2

where RC (K, t) is de˝ned in Eq. (2.2).

(2.6)

23

Proof

A pairing algorithm with these characteristics is presented in the Appendix.

■
Although

Δ can reach

1
, in most cases the pairing algorithm in the Appendix per3

forms much better. As an example, Table 2.3 has each segment cached by
users and the normalized peak rate with the pairing algorithm is
than the

t=

KM
N

=3

2
, signi˝cantly lower
5

3
with Maddah's single server scheme.
4

Finally, we are ready to derive an achievable peak rate for a general set of requests,
based on the following lemma.

Lemma 2.4.4 If

(S1 , S2 ) form an e˙ective pair, then S10 = {S1 , Aα } and S20 =

{S2 , Aα } also form an e˙ective pair of a larger dimension. The same holds when

an all-B ˝le set is appended instead of the all-A ˝le set Aα .

Proof

The proof is straightforward by observing that

(S10 , S20 )

still ful˝lls the con-

ditions in Lemma 2.4.1.

■

The extension to the asymmetric case is as follows. Let
denote the number of requests for servers
loss of generality. Divide the
˝rst with

KB

remaining

A

K = K A + KB

and

B,

KA

and

and assume

KB

respectively

K A > KB

without

requests (or users) into two groups: the

requests for each server (symmetric demands) and the second with the

KA − KB

requests for server A. We construct e˙ective pairs of length

t+1

by appending requests from the second group to e˙ective pairs from the ˝rst.

Theorem 2.4.5 If the requests are asymmetric, the ratio of unpaired messages is
also bounded by Δ ≤ 13 . Speci˝cally, if KA and KB respectively denote the number
of requests for servers A and B , assuming KA > KB without loss of generality, the
following normalized peak rate is achievable:
R(KA , KB , t) =


t+1 
X
K A − KB
l=0

l

RT (2KB , t − l),

where RT is de˝ned in Eq. (2.6) and K = KA + KB .

(2.7)

24

Proof

From Lemma 2.4.3,

subsets of
multiply

t+1−l

RT (2KB , t − l)

represents the peak rate after pairing all

requests from the symmetric group. For each

RT (2KB , t − l)

l = 0, 1, . . . , t + 1, we

by the number of possible completions with

l

requests from

the second group, to obtain the peak rate corresponding to subsets with
requests from the ˝rst group and

l

from the second.

t+1−l

Adding them for all

l

gives

Eq. (2.7).

�

RT (i, j) ≤ 12 + 16 Δ RC (i, j) with Δ ≤ 13 by Lemma 2.4.3, and
Pt+1 �KA −KB 
RC (2KB , t − l) = RC (K, t) by combinatorial equations, Eq. (2.7) iml=0
l
�1 1 
+ 6 Δ RC (K, t) with Δ ≤ 13 as de˝ned in Lemma 2.4.2.
plies that R(KA , KB , t) ≤
2
Since

■

Corollary 2.4.5.1 A peak rate of

5
R (K, t)
9 C

is achievable for a system with two data

servers and a parity server.

2.4.3 One parity and L data servers
The previous subsection has discussed the case with two data servers and one
parity server, but the same algorithm can be extended to systems with more than
two data servers. Intuitively, if there are

L

data servers and one parity server, any

message can be built by combining messages from any
could be distributing the
groups of

L

�K
t+1

L

servers. A ˝rst approach

messages in Maddah's scheme across the

L+1 possible

servers, as proposed in subsection 2.2.4. Each server would then need to

send a maximum of

�K
t+1

·

L
messages. However, there is a more e°cient way of
L+1

ful˝lling the requests based on the algorithms in subsections 2.2.4, 2.4.1 and 2.4.2.

Lemma 2.4.6 Let

S1 = {Aα , Bβ , A∗ , Y} and S2 = {Aα , Bβ , B∗ , Y0 } be two user

subsets, where Y and Y0 are arbitrary lists of requests for servers C through L and the
∗ represent arbitrary (possibly empty) index sets. Then, S1 and S2 can be paired so that

servers A, B and P require a single transmission to provide the same information as
messages mS1 and mS2 in Maddah's single server scheme. The other data servers, C

25

through L, require a maximum of two transmissions, as shown in paired transmissions
in Fig. 2.1.

Proof

The transmissions would proceed as follows:

1. Servers

C

C

through

would send

˝les from

2. Server

C

L each send two messages, to S1

mSC1

and

mSC2 ,

S2 .

For example, server

providing a desired segment to users requesting

C -segments

and the corresponding

A sends5 mSA1 ,

and

to those requesting other ˝les.

providing a desired segment to users requesting

and the corresponding undesired A-segments to those requesting

3. Server

B

mSB2 ,

sends

P

4. Server

{Aα ,Bβ }

mP

sends

to users requesting

segments previously received, the users in
and

B

Bβ .

providing a desired segment to users requesting

and the corresponding undesired B-segments to those requesting

{Aα , Bβ }.

{Aα , Bβ }

{A∗ , Aα }

{Bβ , B∗ }

Aα .

Using the undesired

can solve for the desired

A

segments.

A simple comparison of the requested and received segments shows that these transmissions deliver the same information as messages

mS1

and

mS2

server scheme.

in Maddah's single

■

As an example, Table 2.4 shows the segments that each user gets in transmissions
(1)-(4) when

S1 = {A1 , A2 , B1 , C1 }

sponding to segments

and

{A11 , A22 , B13 , C14 }

S2 = {A1 , B1 , B2 , C2 },

and

respectively corre-

{A51 , B16 , B27 , C28 }.

Theorem 2.4.7 The following normalized peak rate is achievable for a system with
L ≥ 3 data servers and one parity server:
RP (K, t) =

L−1
RC (K, t),
L

(2.8)

where RC is de˝ned in Eq. (2.2).
5 It would be enough for

A

to send

{A∗ ,Aα ,Bβ }

mA

instead of

of simplicity. The same applies to the message from server

mSA1 ,
B.

but we use the latter for the sake

26

Table 2.4.:

Segments

Lemma 2.4.6, where

received

by

each

users

in

transmissions

(1)-(4)

from

Pij = Aji ⊕ Bij ⊕ Cij .
Trans.\Req.

A1

A2

B1

(1)

C15

(2)

A11

(3)

B15

B16

(4)

P15

P13

in total

A11 , A51

B2

C13
A22

A22

C1

C2

C14

C28

C14

C28

A31
B27

B13 , B16

B27

□□□□□
A

<
<

B

X

paired transmissions:

C

D

X

X

X

X

P

X
X

X

X

X

X

X

X

unpaired transmissions:

Figure 2.1.:

Pairing for

4

data servers and

X

1

parity server system.

are data servers and P represents the parity server.
transmitted from the corresponding server.

X

A, B, C, D

X means there is a message

27

Proof

First we show that we can deliver

using at most

� 
1 K
L t+1

2 K
L t+1

�



of the messages in Maddah's scheme

transmissions from servers

A, B

and

P;

and at most

2 K
L t+1

�



transmissions from each of the other servers. This can be done by pairing the messages
as shown in Lemma 2.4.6, if they include requests for

A or B , and by using the scheme

in subsection 2.4.1, if they do not.
Selecting these

2 K
L t+1

�



messages can be done as follows: group messages by the

number of segments that they have from servers

A

or

B.

Within each group, we pair

the messages as shown in Lemma 2.4.6. This is equivalent to pairing the

A

and

B

requests into e˙ective pairs according to Theorem 2.4.5 and considering all possible
completions for each pair using requests for other servers. Theorem 2.4.5 showed that

2
of the messages in each group can be paired. Messages which have no
L

at least

2
3

A

segments can be transmitted as described in section 2.4.1, without requiring

or

B

≥

any transmissions from servers
The remaining

� 
L−2 K
L

t+1

A, B

or

P.

messages can be transmitted as described in subsec-

C through L.
� 
L−3 K

tion 2.2.4, distributing the savings evenly among servers

� 
L−2 K
L

t+1

transmissions from servers

A, B

Each server then transmits a total of

and

� 
L−1 K
L

t+1

P;

and

L

t+1

This requires

from each of the rest.

, hence the peak rate in Eq. (2.8).

■

Theorem 2.4.7 provides a very loose bound for the peak rate in a system with one
parity and

L

data servers. In practice, there often exist alternative delivery schemes

with signi˝cantly lower rates.

For example, if all the users request ˝les from the

same server, that server should send half of the messages while all the other servers
collaborate to deliver the other half. The rate would then be reduced to half of that
in Maddah's scheme. Similarly, if

L > t+1 and all the servers receive similar numbers

of requests, the scheme in subsection 2.4.1 can provide signi˝cantly lower rates than
Eq. (2.8).

28

Table 2.5.:

Files stored in parity servers in RAID-6.

Server P

Server Q

A1 + B1 + . . . + L1

A1 + κB B1 + . . . + κL L1

A2 + B2 + . . . + L2

A2 + κB B2 + . . . + κL L2

..
.

..
.

Ar + Br + . . . + Lr

Ar + κB Br + . . . + κL Lr

2.4.4 Two parity and L data servers
In this section, we will extend our algorithm to a system with

L

data and two

linear parity servers operating in a higher order ˝eld instead of GF(2). The parity
server

P

stores the horizontal sum of all the ˝les while the parity server

Q

stores a

di˙erent linear combination of the ˝les BY ROW, as shown in Table 2.5. It will be
assumed that the servers form an MDS code. We will show that with a careful design
of the delivery strategy, the peak rate can be reduced to almost half of that with
Maddah's single server scheme.

Lemma 2.4.8 Let

S1 = {A∗ , Y} and S2 = {B∗ , Y}, where Y represents a com-

mon set of requests from any server. Then S1 and S2 can be paired so that a single
transmission from each server ˝lls the same requests as messages mS1 and mS2 in
Eq. (2.1).

Proof

The transmission scheme shares the same pairing idea as the algorithm in

subsection 2.4.2. The transmissions are as follows:

1. Server A sends

mSA1 ,

providing a desired segment to users requesting its ˝les

and the corresponding undesired A-segments to others.

2. Server B sends

mSB2 ,

providing a desired segment to users requesting its ˝les

and the corresponding undesired B-segments to others.

29

3. Servers

C, D, . . . , L

each send a single message to

S1

T

S2 = {Y}

with the

following content for each user:

•

Users requesting ˝les from server B received some undesired segments from
server

A.

Servers

C, D, . . . , L

send them the matching ones so that the

desired segments can be decoded using the parity in server

•

Y

The remaining users in

P

later.

S1

will get the desired segment corresponding to

when possible, otherwise they will get the undesired segment corresponding
to

S2 .

In other words, each server

C, . . . , L

will send segments corresponding to

users requesting its ˝les or those from server

B,

S1

to

and segments corresponding to

S2

to the rest. At this point, all the users have satis˝ed their requests related

to

S1 ,

except those requesting ˝les from server

related to

S2

B,

who satis˝ed their requests

instead. Each user has also received

6

segments , corresponding to
corresponding to

S2

S1

L−2

undesired matched"

for those requesting ˝les from server

B

and

for the rest.

4. Finally, parity servers

P

and

Q each transmit a message to S1

T

S2 = {Y} with

a combination of segments for each user (see Table 2.5). Those requesting ˝les
will get two combinations of the segments corresponding to

S1 ,

while the rest will get two combinations of the segments corresponding to

S2 .

from server

B

Since each user now has
combinations of all

L−2

individual segments and two independent linear

L segments, it can isolate the requested segment (as well all

the matching" segments in other servers).

A simple comparison of the requested and received segments shows that these transmissions deliver the same information as messages
server scheme.

6 Users in

Y

requesting ˝les from servers

but we can ignore the extra one.

mS1

and

mS2

in Maddah's single

■

A or B

received

L − 1 matched" segments instead of L − 2,

30

Table 2.6.:
denote

Segments users get in (1)-(4) transmissions (In order to simplify notation,

Pij = Aji + Bij + Cij

Qji = Aji + κB Bij + κC Cij ).

and

Trans.\Req.

A1

A2

B1

(1)

A11

A22

A31

(2)

B16

B17

(3)

C16

(4)

P16

in total

A11 , A61

A22

C1

C2

A41

A52

C13

C19

C210

P13

P14 , Q41

P25 , Q52

C14 , C19

C25 , C210

B13 , B17

B2

B28

B28

As an example, Table 2.6 shows the delivered segments in transmissions (1)-(4) if

mS1 = {A11 , A22 , B13 , C14 , C25 }

Theorem 2.4.9 For the

mS2 = {A61 , B17 , B28 , C19 , C210 }.

and

L data server and two parity server system, the following

normalized peak rate is achievable:

RQ (K, t) =

where Δ ≤

1
3


1
L−2
+
Δ RC (K, t),
2 2L + 4

(2.9)

is the pairing loss and RC is the rate of the single server Maddah's

scheme in Eq. (2.2).

Proof
or

B.

Group messages by the number of segments that they have from servers
Within each group, we pair the messages as shown in Lemma 2.4.8.

number of requests from

A or B

is not zero, this is equivalent to pairing the

A

If the

A and B

requests into e˙ective pairs according to Theorem 2.4.5 and considering all possible
completions for each pair using requests for other servers. Theorem 2.4.5 showed that
at most

1
of the messages in each group remains unpaired. For the messages which
3

do not contain segments from

A

or

B

servers, with identical results: at most

we repeat the same process with two other

1
of them remain unpaired.
3

Each pair of messages can be delivered using a single transmission from each
server, as shown in Lemma 2.4.8, hence paired messages contribute
to the total rate, where

Δ denotes the ratio of unpaired messages.

1
(1 − Δ)RC (K, t)
2

Unpaired messages

31

are transmitted as described in section 2.2.4, that is using

L

Balancing this load among all the servers, they contribute

L
ΔRC (K, t) to the total
L+2

out of the

rate. Adding both contributions yields the rate above.

L+2

servers.

■

2.5 Scheme 2: Small cache
This section extends the interference elimination scheme in section 2.2.3 to a
multi-server system.

The interference elimination scheme is specially designed to

reduce the peak rate when the cache size is small [5]. Unlike Maddah's scheme, which
caches plain segments, the interference elimination scheme proposes caching linear
combinations of them. That way each segment can be cached by more users, albeit
with interference. This section will start with the system without parity in Table 2.1,
showing that the transmission rate decreases as

1
with the number of servers. Then it
L

performs a similar analysis for the case with parity servers, which can be interpreted
as an extension of the user's caches.

Theorem 2.5.1 In a system with L data servers and parallel channels, the peak rate
of the interference cancelling scheme can be reduced to

1
L

of that in a single server

system, i.e. the following (M, R) pair is achievable:


t [(N − 1)t + K − N ] N (K − t)
,
K(K − 1)
LK


, t = 0, 1, . . . , K.

(2.10)

This holds regardless of whether each ˝le is spread across servers (striping) or stored
as a single block in one server.

Proof

Section 2.3 showed that striping the ˝les across

rate of the interference cancelling scheme by

L

servers reduces the peak

1
compared with a single server system.
L

In contrast to Maddah's scheme, the interference cancelling scheme sends the same
number of segments from each ˝le, regardless of the users' requests. Moreover, each
message consists of a combination of segments from a single ˝le [5]. Therefore, the
same messages can be transmitted even if di˙erent ˝les are stored in di˙erent servers.
Each server will need to transmit a fraction

1
of the messages, since it will be storing
L

32

that same fraction of the ˝les. The peak load can then be reduced to

1
of that in
L

Eq. (2.4).

■

If there are parity servers, we can further reduce the transmission rate by regarding them as an extension of the users' cache.

Section 2.2.3 explained that in the

interference elimination algorithm [5], each user caches the parity symbols resulting
from encoding a set of segments with a systematic MDS code

C(P0 , P ).

It is possible

to pick the code in such a way that some of these parity symbols can be found as
combinations of the information stored in servers

P

and

Q.

Then, instead of storing

them in the user's cache, they are discarded. Those that are needed in the delivery
phase will be transmitted by the parity servers.
For example, parity server

P

stores the horizonal sum of the ˝les, so it can transmit

messages of the form:
K−1
N/L ( t−1 )
X
X

i=1
with arbitrary coe°cients

λij

� s
s
s 
λij Ai j + Bi j . . . + Li j ,

j=1
for any user set

sj .

This corresponds to a linear combi-

nation of all the segments in Eq. (2.3). Similarly, parity server

Q

can transmit some

other linear combinations of the segments which can also work as components of an
MDS code. This e˙ectively increases the size of the cache memories by

M0

˝le units,

corresponding to the amount of information that the parity servers can a˙ord to send
each user during the delivery phase.

Theorem 2.5.2 If there are η parity servers and K ≥ N , the following (M, R) pairs
are achievable for t = 0, 1, . . . , K


Proof

t [(N − 1)t + K − N ]
N (K − t) N (K − t)
−η
,
LK
K(K − 1)
LK 2


.

The information sent by the parity server is bounded by the peak rate of the

data servers,

i.e.

N (K−t)
according to Eq. (2.10). Assuming a worst case scenario,
LK

each transmission from a parity server will bene˝t a single user.
parity server can e˙ectively increase the cache of each user by

M0 =

Therefore, each

N (K−t)
.
LK 2

■

33

This memory sharing strategy provides signi˝cant improvement when the cache
capacity is small.
˝les stored in

Fig. 2.2 shows the performance for

L=4

K = 15

users and

N = 12

data servers. When the cache size is small, the peak rate of the

system with two parity servers is much lower than that without parity servers. As
the cache grows the advantage of the system with parity servers becomes less clear.

3
no parity servers
2 parity servers

2.9
2.8
2.7

R

2.6
2.5
2.4
2.3
2.2
2.1
2
0

0.2

0.4

0.6

0.8

1

M

Figure 2.2.:

Comparison of the performance between multi-server system without

parity servers and the system with two parity servers.

The interference elimination scheme is specially designed for the case with less
˝les than users (N
reduced by

≤ K)

in the single server system. However, since the peak load is

1
in a multi-server system, the interference elimination scheme might also
L

have good performance when
we can just add

N −K

N >K

if

L

is large. In order to apply the algorithm,

dummy users with arbitrary requests.

following corollary from Theorem 2.5.2:

Then, we have the

34

Table 2.7.:

Normalized peak rate of Scheme 1.

server system

Normalized peak rate

RC (K, t) =

single server

Table 2.8.:

L

data

1

parity

L

data

2

parity

L−1
RC (K, t)
L

( 12 +

L−2
Δ)RC (K, t) (Δ
2L+4

1
)
3

Normalized (M,R)



single server

L

≤

Normalized (M,R) pair of Scheme 2. (η is the number of parity servers.)

server system

L

� K  �K 
/ t
t+1

data
data

η
η

parity (K
parity (K

≥ N)



t[(N −1)t+K−N ] N (K−t)
, K
K(K−1)

t[(N −1)t+K−N ]
K(K−1)



≤ N)

Corollary 2.5.2.1 If there are

t2
N

−

−



(K−t) N (K−t)
η NLK
2 ,
LK

−t) (N −t)
η (NLN
, L





η parity servers and K ≤ N , the following (M, R)

pairs are achievable:


t2
(N − t) (N − t)
−η
,
N
LN
L


,

t = 0, 1, . . . , N.

2.6 Simulations
This section compares all the schemes studied in this chapter, for a system with

N = 20 ˝les stored in L = 4 data servers with 5 ˝les each.

We show that striping has

better performance than the schemes in sections 2.4 and 2.5 (Scheme 1 and Scheme 2,
respectively) at the cost of network ˛exibility. If each ˝le is stored as a single block
in one server, Scheme 2 has better performance when the cache capacity is small
while Scheme 1 is more suitable for the case where the cache capacity is large. The
performances of Scheme 1 and Scheme 2 are summarized in Table 2.7 and Table 2.8,
respectively.
Fig. 2.3 and Fig. 2.4 focus on the case with one and two parity servers, respectively. We assume that there are

K = 15

users, thus there are more ˝les than users,

35

with varying cache capacity. We observe that striping provides lower peak rates than
storing whole ˝les, as expected. Additionally, since

N > K,

the interference elimina-

tion scheme always has worse performance than Maddah's scheme when striping is
used. Without striping, Scheme 2 provides lower peak rate than Scheme 1 when the
cache capacity is small, and it is the other way around when the capacity is large.

14
striping Maddah’s scheme
striping interference elimination
scheme1
scheme 2

12

10

R

8

6

4

2

0
0

Figure 2.3.:

5

10
M

15

20

Comparison between the performance between Scheme 1 and Scheme

2 in one parity server system when

N = 20

and

K = 15.

Then Fig. 2.5 and Fig. 2.6 compare the performance between Scheme 1 and
Scheme 2 when there are more users

(K = 60)

than ˝les for the one or two par-

ity case, respectively. As shown in Fig. 2.5 and Fig. 2.6, the striping has lower rate
than storing whole ˝les and when the cache capacity is very small, the striping interference elimination has better performance than striping Maddah's scheme. For
Scheme 1 and Scheme 2, when the cache capacity is small, Scheme 2 provides lower
peak rate, while when the cache capacity increases, Scheme 1 has better performance.

36

8
striping Maddah’s scheme
striping interference elimination
scheme 1
scheme 2

7
6

R

5
4
3
2
1
0
0

Figure 2.4.:

5

10
M

15

20

Comparison between the performance between Scheme 1 and Scheme

2 in two parity server system when

N = 20

and

K = 15.

Moreover, we notice that the curves intersect at a point with larger

M

than they did

in Fig. 2.3 and Fig. 2.4, which means that we are more prone to utilize Scheme 2
when there are more users than ˝les.

2.7 Summary
This chapter proposes coded caching algorithms for reducing the peak data rate
in multi-server systems with distributed storage and di˙erent levels of redundancy. It
shows that, by striping each ˝le across multiple servers, the peak rate can be reduced
proportionally to the number of servers. Then it addresses the case where each ˝le
is stored as a single block in one server and proposes di˙erent caching and delivery
schemes depending on the size of the cache memories.

37

16
striping Maddah’s scheme
striping interference elimination
scheme 1
scheme 2

14
12

R

10
8
6
4
2
0
0

Figure 2.5.:

5

10
M

15

20

Comparison between the performance between scheme 1 and scheme 2

in one parity server system when

N = 20

and

K = 60.

Distributed storage systems generally use MDS codes across the servers to protect
the information against node failures. The coded caching schemes proposed in this
chapter are able to leverage that redundancy in creative ways to reduce the achievable
tra°c peak rate. The results for Scheme 1 and Scheme 2 are shown in Table 2.7 and
Table 2.8 respectively.
This chapter proposed methods to reduce the load on the links between servers
and users, which is the most common bottleneck for system performance. However,
there are cases in which the server I/Os, not the overall tra°c on the links, are the
limiting parameter. The next chapter will study the trade-o˙ between network tra°c
load and disk I/Os.

38

10
striping Maddah’s scheme
striping interference elimination
scheme 1
scheme 2

9
8
7

R

6
5
4
3
2
1
0
0

Figure 2.6.:

5

10
M

15

20

Comparison between the performance between scheme 1 and scheme 2

in two parity server system when

N = 20

and

K = 60.

39

3. TRAFFIC LOAD-I/O TRADE-OFF FOR CACHING

3.1 Introduction
Users and applications demand accessing data at a higher speed and lower latency nowadays, which poses challenges to both networks and devices. This chapter
addresses two performance bottlenecks of storage systems: the number of read and
write operations (disk I/Os) and the amount of data transferred (transfer load).
Disk I/Os are a valuable resource. Many applications are I/O bounded and serve
a huge number of user requests and perform intensive computations.

A signi˝cant

amount of research has gone into coding techniques to minimize disk I/Os in storage
systems [9, 10].
Transfer load (or tra°c) is another dominant factor in slow or congested networks.
Caching has been investigated as a useful technique to relieve peak tra°c by prefetching contents during o˙-peak hours. A caching scheme has two phases: placement and
delivery. In the placement phase, the users have access to all ˝les to ˝ll their caches.
In the delivery phase, only the server has database access and it delivers messages to
the users to ful˝ll their requests. In [1], Maddah-Ali and Niesen proposed a caching
and delivery scheme o˙ering a worst case performance within a constant factor of
the information-theoretic optimum, for a system with a single server broadcasting to
multiple users and uniform ˝le popularity. Inspired by their work, [40, 41] studied its
average performance and the case with random demands. Further works improved
the delivery scheme by exploiting commonality among users' demands [6, 11, 42] and
introduced a decentralized version [43].
This chapter focuses on the same system, illustrated in Fig. 3.1.

Maddah-Ali

and Nisen's coded caching scheme in [1] (henceforth denoted M-N scheme") has
the lowest peak tra°c load in the literature and its extension by Yu et.

al. [11]

40

(henceforth denoted Yu's scheme") is proved to achieve the best average tra°c load
with uncoded prefetching. However, their I/O performance is suboptimal when there
are redundant user demands.

The same segment could be read multiple times if

it is used to construct di˙erent messages, which dramatically increases I/O reads.
In contrast, if all messages are transmitted uncoded, each data segment requested
is read once and broadcast to all users.

Inspired by this fact, we study the trade-

o˙ between tra°c load and I/O by designing algorithms which combine coded and
uncoded transmission. To the extent of our knowledge, this is the ˝rst work which
studies the I/O performance for coded caching.
The rest of this chapter is organized as follows. Section 3.2 introduces the system model, the traditional uncoded and coded caching schemes. Section 3.3 proposes
two algorithms which study the trade-o˙ between tra°c load and I/O access. Section 3.4 provides simulations to support and illustrate our algorithms and Section 3.5
concludes the chapter.

N files

server

shared link

K users
caches
(size M)

□

c____________,I

Figure 3.1.:

I

□I

□

~I~

Caching system considered in this chapter.

41

3.2 Background
3.2.1 System Model
Our system model is identical to the one in [1], shown in Fig. 3.1: a single server
is connected to

K

users through a shared broadcast link, and

with uniform popularity. Each user has a cache of size

MF

N

˝les of size

F

bits

bits.

Users ˝ll their caches during the placement phase and then independently request
a ˝le in the delivery phase.

dk

We denote these requests by

is the index of the ˝le requested by user

{1, . . . , N }K

k. d

d = {d1 , . . . , dk },

where

is uniformly distributed over

and the number of distinct ˝les requested is denoted by

D=

Ne (d).

The server must then ful˝l those requests. We wish to study the trade-o˙ between
the resulting load on the shared link and the disk I/O. Disks are read one page at a
time, all of the same size [44]. Therefore, the disk I/O is approximately proportional
to the total data read.

Moreover, if the same ˝le segment is used to construct

messages, we assume that it needs to be read
shared link by

Rt

and the total data read by

k

RIO

times.

k

Denoting the load on the

(both normalized by the ˝le size),

the objective is to design an algorithm to minimize the cost

Rcost (α, d) = αRt (α, d) + (1 − α)RIO (α, d),
where

α ∈ [0, 1]

Rcost (α, d),

is the trade-o˙ coe°cient.

(3.1)

Although this chapter will focus on

the proposed algorithms could be easily applied to other cost functions.

Also, we denote the expected cost over all users' requests as

Rcost (α) = αEd [Rt (α, d)] + (1 − α)Ed [RIO (α, d)].

(3.2)

3.2.2 Uncoded scheme
In the uncoded scheme, every user caches the same

M/N

fraction of each of the

N

˝les. In the delivery phase, the server sends plain missing segments to all users. Since

42

each data segment is read once and all of those segments need to be transmitted, the
normalized I/O

u
RIO

is identical to the normalized tra°c load of the shared link

u
RIO
(d) = Rtu (d) = Ne (d)g,

where

g = 1−

pr = 1 − (1 −

Rtu :

(3.3)

M
is the local caching gain. Each ˝le is requested with probability
N

1 K
) , so
N
u
u
Rcost
= Rtu = RIO
= N pr g.

(3.4)

Lemma 3.2.1 The uncoded scheme is optimal in terms of expected data read among
all schemes with uncoded pre-fetching, i.e. , for any other scheme with uncoded
prefetching RIO will be greater than that in Eq. (3.4).

Proof

Denote

i = 1, . . . , K .

mij

the fraction of ˝le

j

Given a list of demands

subset of users requesting distinct ˝les

being cached by user i, for

d,

let

j = 1, . . . , N

U = {u1 , . . . , uNe (d) }

{f1 , . . . , fNe (d) }.

and

be an arbitrary

Then

Ne (d)

RIO (d) ≥

X

(1 − mui fi ),

(3.5)

i=1
since the total data read cannot be lower than that delivered.

ui ∈ U has probability N1 of requesting ˝le j , so the average I/O for
PN
PN
1
least
j=1 (1 − mui j ). Since
j=1 mui j = M , the average I/O for ui is
N

Each user
user

ui

at least

is at

1
(N
N

− M ) = g.

Combined with Eq. (3.5),

RIO

is bounded by:

RIO ≥ Ed [Ne (d)g] = N pr g.

(3.6)

■
According to Lemma 3.2.1, the uncoded scheme achieves the best average I/O performance with uncoded prefetching! However, the I/O could be even lower with coded
prefetching [45]. For example, consider a system with
size

M =

1
. The uncoded scheme yields
2

2 ˝les (A, B ), 2 users, and cache

u
RIO
= 1.125,

segments of the same size (A1 , A2 , B1 , B2 ) and caching

but dividing each ˝le into

Ai ⊕ Bi

at user

2

i (i = 1, 2)

43

would only require
user

2

RIO = 1, regardless of the requests (e.g.

requests B, then the server only needs to transmit

, if user

A2 , B1 ).

1 requests A and

However, the I/O

for coded prefetching is a complex problem beyond the scope of this thesis. Instead,
we focus our discussion on uncoded prefetching.

3.2.3 Coded scheme
The centralized coded caching scheme proposed by Maddah-Ali and Niesen [1]
splits each ˝le into

�K 
t

nonoverlapping segments of equal size, with

t

caches each segment in a distinct group of
sends one message to each subset of

t+1

message is composed as the XOR of the

t =

KM
, and
N

users. In the delivery phase, the server

users, for a total of

t+1

�K
t+1

messages. Each

segments requested by one user and

cached by the others. Each user can then cancel out the segments that it already has
in its cache to recover the desired segment. This algorithm has the best normalized

i.e.

tra°c load in the worst case,
the normalized rate

Rtm
Rtm

when all users request distinct ˝les. When

is


=

  
K
K
/
= Kg/(1 + KM/N ),
t+1
t

M
. Each message is the XOR of
N

where

g =1−

I/O is

m
= Kg .
RIO

N ≥ K,

t+1

(3.7)

segments, thus the normalized

Yu's scheme and our own research [6, 11] extended this work to the case with
redundant requests and more general values of

N

and

as [1]. As for the delivery, the server picks

Ne (d)

˝les and only sends messages to subsets of

t+1

The corresponding rate

Rtc

K.

It uses the same placement

leader" users requesting distinct

users containing at least

1

leader.

is:

�K
Rtc (d, t)

=

t+1

�

e (d)
− K−N
t+1
�K 
.

(3.8)

t
This extension is shown to achieve the best average tra°c load with uncoded prefetching. Since each message is the XOR of

t+1

segments, the average cost

c
c
Rcost
(α, t) = Ed [αRtc (d, t) + (1 − α)(1 + t)RIO
(d, t)].

c
Rcost

is:
(3.9)

44

Further extensions proposed a decentralized version of the algorithm [11, 40] without coordination in the content placement: users randomly cache a subset of
from every ˝le. The delivery phase takes a
are stored in exactly

i+1

i

users

K -step

(i = 0, . . . , K − 1),

MF
bits
N

greedy approach: for bits which

it constructs messages by XORing

segments, similarly to the centralized scheme.

i.e.

Both M-N and Yu's schemes have optimal I/O in the worst case (
requests), given by Lemma 3.2.1 (Ne (d)

=K

, no repeated

in Eq.(3.6)). Moreover, both schemes

have the best peak tra°c load (hence the best worst case

Rcost (α))

in the literature.

They are the basis for the algorithms proposed in this chapter. It is therefore recommended that readers have a clear understanding of both M-N and Yu's schemes
before proceeding.

3.3 General Algorithms
In this section, we propose algorithms aiming at minimizing the cost functions
in Eq.(3.1) and Eq.(3.2).

Subsection 3.3.1 introduces an algorithm with the same

placement as M-N and Yu's scheme, to maintain optimal performance in the worst
case, and an adaptive delivery algorithm to further reduce
redundant requests.

Rcost (d)

in the case with

Subsection 3.3.2 sacri˝ces worst case performance to improve

it in the average case. It introduces a new placement algorithm that yields a lower
average

Rcost

than both the coded and uncoded schemes.

3.3.1 Adaptive delivery
As mentioned in section 3.2.3, the coded caching scheme has the best

Rcost

in the

worst case. However, it could be further reduced when some requests are redundant
by sending some segments uncoded. Take the following case as an example.

Example 3.3.1 Consider a server with 4 ˝les (denoted A, B , C and D), 4 users with
a normalized cache size M = 2, and a trade-o˙ parameter α = 0.2. In the placement
phase, ˝le A is split into 6 segments (denoted A1 , A2 , . . . , A6 ) and each segment is

45

Table 3.1.:

Mapping of ˝le segments to user caches.

three segments for every ˝le, marked with X (K

user\ segment

1

2

3

1

X

X

X

2

X

4

X

4

= 4, N = 4, M = 2).
5

6

request
C

X

3

Each cache stores the same

X

X
X

X

B
X

A

X

A

cached by 2 users. The same goes for ˝les B, C, D. Table 3.1 indicates the indices of
the 3 segments that each user stores, assumed to be the same for all ˝les without loss
of generality. Let the requests be C, B, A, A.
c
= 1.73 according to Eq.(3.9).
In the delivery phase, we can easily derive that Rcost

In this example, we notice that A1 is needed by users 3 and 4. The messages containing
A1 are A1 ⊕ B2 ⊕ C4 and A1 ⊕ B3 ⊕ C5 . If we transmit A1 uncoded along with B2 ⊕ C4

and B3 ⊕ C5 instead, the users are still able to recover the requested ˝les. The tra°c
load is higher but the I/O is reduced. The resulting Rcost = 1.63 is better than both
M-N and Yu's schemes.
The general algorithm is shown in the following lemma.

Lemma 3.3.1 If a segment is requested by more than

1
1−α

users, then Rcost (α, d) can

be reduced by transmitting it uncoded.

Proof

If a segment is requested by

increases

Rt

Therefore,

j>

to

Rt0 = Rt + 1/

�K 
t

j

users (j

= 1, . . . , K ),

and decreases

0
0
Rcost
= αRt0 + (1 − α)RIO

RIO

to

transmitting it uncoded

� 
0
RIO
= RIO − (j − 1)/ Kt .

as de˝ned in Eq.(3.1) is lower than

Rcost

1
.
1−α

In the worst case, each segment is requested by only one user.

when

■
Our adaptive

delivery scheme is then identical to M-N and Yu's scheme, ensuring an optimal

Rcost

in the worst case. This adaptive delivery algorithm can also be easily extended to

46

the decentralized scheme in [11, 43], following the same principle: the bits which are

for M-N scheme, Yu's scheme and the adaptive algorithm

1
users are sent uncoded and all the others coded. Fig. 3.2
1−α

Rcost

requested by more than
compares the average

α
is small, the adaptive algorithm has a much better performance than

proposed in this chapter both for the centralized and decentralized settings. It shows
that, when

Decentralized

M-N scheme and Yu's scheme both in the centralized and decentralized cases.

8
◄

Centralized

7

Yu
Adaptive
M-N

I

I

8
Yu
Adaptive
M-N

6

5

4

3

2
0

,.

A'

7

6

1

. . u·+·

α

0.5

.
.,,.,.
,.,<I
1

,,·*

5

4

3

2

α

0.5

◄'

_,.
..
,*
;cf ~·
,.
·*
till' ,,:'
,.
•
•* ~

Comparison of adaptive and coded schemes for varying trade-o˙ coef-

0

LJ

α (K = 10, N = 10, M = 2).

Figure 3.2.:

.. ~t,·'
,._,,4 ~··
This subsection proposes an algorithm seeking a lower average

is very small,

Rcost (α)

is close to

1,

we
Inspired by this fact, if only a

α

mainly depends on I/O, so we should

as de˝ned in Eq.(3.2), at the cost of suboptimal worst case performance.

α

should transmit all the requested segments coded.

transmit all the requested segments uncoded; vice versa, when

Intuitively, when

Rcost (α)

Rcost in the worst case.

The adaptive delivery scheme used the same placement as [1] to ensure optimal

3.3.2 Partial Caching

˝cient

Rcost (α)

47

portion

p ∈ [0, 1]

of every ˝le is cached at the users and transmitted coded, while

the rest is always transmitted uncoded, we expect a better average performance for
intermediate values of

α.

The general algorithm is as follows. In the placement phase, we choose a fraction

p

of each ˝le to be cached at the users' end. This portion is divided into

segments and the rest (of size

(1 − p)F )

is not cached.

t0 =

KM
Np

In the delivery phase, the

cached part is transmitted coded using Yu's scheme and the uncached portion is
transmitted uncoded. The fraction

p

is optimized to minimize

Rcost :

u
c
p = arg min ((1 − p)Rcost
(α) + pRcost
(α, t0 )) ,
p

where

t0 =

KM
,
Np

u
Rcost

c
Rcost

is de˝ned in Eq.(3.4) and

is de˝ned in Eq.(3.9).

This

algorithm can be easily extended to the decentralized case by caching a random
portion

p

of each ˝le at each user, employing the decentralized transmission strategy

mentioned in section 3.2.3 for these portions, and transmitting the uncached portions
uncoded.
The coded, uncoded and partial caching schemes are compared in Fig. 3.3 for both
the centralized and decentralized cases. When
takes

p=0

α,

is small, the partial caching scheme

and transmits all the segments uncoded; vice versa, when

the algorithm takes
of

α

p = 1 and it is equivalent to Yu's scheme.

partial caching o˙ers lower

Rcost

α

is close to

1,

For intermediate values

than both the coded and uncoded schemes.

3.4 Simulations
This section compares the proposed algorithms with traditional schemes through
simulations. It ˝xes the number of ˝les as

N =8

tra°c load and I/O by varying the cache size

M

and studies the trade-o˙ between

and the number of users

K.

Fig. 3.4 compares the performance of the adaptive scheme and Yu's scheme with
trade-o˙ parameter

α = 0.3

when the cache size

M

changes.

It shows that the

adaptive scheme has an advantage over Yu's scheme in terms of
small, but the gap closes as

M

increases.

Rcost

when

M

is

Moreover, the ˝gure presents results for

48

Centralized

5

M-N
Yu
Uncoded
Partial

4.8
4.6

Rcost (α)

5

4.6
4.4

4.2

4.2

4

4

3.8

3.8

3.6

3.6

3.4

3.4

3.2

3.2

0.4

0.5

0.6

0.7

3
0.3

α

Figure 3.3.:

M-N
Yu
Uncoded
Partial

4.8

4.4

3
0.3

Decentralized

0.4

0.5

0.6

0.7

α

Comparison of partial caching, coded and uncoded schemes for varying

α (K = 8, N = 8, M = 2).

4 and 8 users, showing that the gains are more prominent as the number of users
increases. This is because both small cache size and more users increase the chance
that a segment is requested by multiple users.
Fig. 3.5 compares the performance of the coded, uncoded and partial caching
schemes for di˙erent number of users. As mentioned in section 3.3.2, when the tradeo˙ parameter

α

α

is small, we prefer to use the uncoded scheme to minimize I/O. As

increases, the portion

threshold

p

of each ˝le that is transmitted coded also increases. The

α for which p is no longer 0 is bigger for the system with 8 users than for the

system with

4

users. This is because when there are more users, the probability that

some users request the same ˝le increases, which bene˝ts the uncoded transmission.

3.5 Summary
This chapter proposes algorithms to study the trade-o˙ between tra°c load and
I/O for coded caching in both the centralized and decentralized settings. Reading a

49

Centralized

5

EJ
-

4.5

-

4

Rcost (0.3)

3.5

Decentralized

5

Yu (K=8)
Adaptive (K=8)
Yu (K=4)
Adaptive (K=4)

EJ
-

4.5

-

4
3.5

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

Yu (K=8)
Adaptive (K=8)
Yu (K=4)
Adaptive (K=4)

0
2

4

6

8

2

4

M

6

8

M

Figure 3.4.:

Comparison of adaptive and coded schemes for varying cache capacity

and users (N

= 8, α = 0.3).

Centralized

-__,,,_
.__, Yu (K=8)

6

uncoded (K=8)
partial (K=8)
►- Yu (K=4)
uncoded (K=4)
--I- partial (K=4)

5.5

--+-

-

-·+--·

Rcost (α)

5

5.5
5

4.5

4.5

4

4

3.5

3.5

3

3

2.5

2.5

2

2

1.5
0.2

0.3

0.4

α

Figure 3.5.:

6

0.5

0.6

1.5
0.2

Decentralized

- .,._ Yu (K=8)

........... uncoded (K=8)
--+- partial (K=8)

- •-·•-·
---t. . ,,. .
. . ,,. .

Yu (K=4)
uncoded (K=4)
partial (K=4)

--·-·•·-·-·
0.3

0.4

0.5

0.6

α

Comparison of partial caching, coded and uncoded schemes for di˙erent

number of users (N

= 8, M = 2).

˝le segment multiple times to compose di˙erent messages can be suboptimal when
I/O is considered.

The proposed algorithms strike a balance between coded and

50

uncoded transmissions, showing better performance than traditional schemes both in
the worst case and the average case.
Besides network protocols, as we studied in Chapter 2 and Chapter 3, storage
hardware is another constraint which limits the system performance.

In next two

chapters, we will focus on another storage system, non-volatile memories, and study
how to utilize signal processing approaches to improve the reliability.

51

4. SIGNAL PROCESSING FOR NAND FLASH MEMORIES

4.1 Introduction
NAND Flash is a non-volatile memory technology which o˙ers signi˝cantly higher
speeds and power e°ciency than hard drives, but its higher cost is still an obstacle
for its widespread use.

In order to reduce the cost, manufacturers are scaling the

technology and trying to pack more bits in each cell. One of the main problems that
NAND ˛ash memories are facing today is the reliability of the stored information [46].
A ˛ash cell is a ˛oating gate transistor whose threshold voltage can be adjusted by
injecting charges into its ˛oating gate. Information is stored by setting this voltage
threshold to speci˝c values. In its simplest form, one bit is stored in each cell, depending on whether it is charged or discharged. Memories of this type are known as
SLC. In order to increase the capacity (and reduce their cost accordingly) most applications now use MLC memories, which can be programmed to four di˙erent voltage
levels and store two bits in each cell. Some manufacturers have gone even further,
producing memories which store three (TLC) or even four bits in each cell [47, 48].
As Flash memory technology scales and more bits are stored in each cell, the
signal to noise ratio observed in the programmed voltages decreases.

One of the

main sources of noise, which is becoming increasingly important as the technology
scales and is expected to get even worse for the forthcoming 3D ˛ash structures,
is inter-cell interference (ICI) [13, 49]. The shift in threshold voltage of one cell can
change the threshold voltage of its neighbors due to the parasitic capacitance-coupling
e˙ect [50]. Extensive measurements have shown that the ICI noise created by a cell
is proportional to the voltage to which it is being programmed [51]. Other sources
of noise include Gaussian noise, caused by overprogramming and charge leakage, and
impulse noise, caused by defective or broken cells [52].

52

Additionally, ˛ash cells have a limited lifetime.

Before data can be written to

1

a page , the block must have been erased (i.e., all the cells need to be discharged).
The tunneling of charges into and out of the ˛oating gate causes damage to the
dielectric barrier that holds the charges, limiting the range of programmed voltages
and the number of times that each cell can be written. The amount of damage that a
cell su˙ers in a single write operation increases super-linearly with the programmed
voltage [14]. Hence, writing data patterns that are represented by a lower threshold
voltage could prolong the lifetime of the ˛ash [5356].
This chapter proposes two new signal processing methods.

The de˝nitions for

page, block, and read threshold will be given in Section 4.2, together with some necessary background on NAND ˛ash memories. Then Section 4.3 studies a new read
method, which we call multi-page read, that can help alleviate some of the challenges that the ˛ash memory industry is facing. A multi-page read operation selects
multiple pages in a block, biases them with di˙erent read thresholds, and returns a
combination of their stored information. Section 4.4 proposes a new data representation scheme which increases endurance and signi˝cantly reduces the probability of
error caused by inter-cell-interference. The method is based on using an orthogonal
code to spread each bit across multiple cells, resulting in lower variance for the voltages being programmed in the cells.

This new data representation method is also

shown to present many of the advantages that spreading sequences bring to wireless
communications. For example, multiple information sequences can be written on the
same cells at di˙erent times without interfering with each other. It also allows storing
additional information on an already programmed memory in such a way that the
new information is hidden by the noise.

At last, Section 4.5 summaries the whole

chapter. The results of this chapter are published in [1719].

1 Cells in a NAND ˛ash are grouped into pages, which is the smallest unit for write and read
operations. Pages are grouped into blocks, which is the elementary unit for erase operations

53

4.2 Background
A ˛ash cell, illustrated in Fig. 4.1, is a ˛oating gate transistor whose threshold
voltage can be adjusted by Fowler-Nordheim (FN) tunneling [57] of charge into or
out of the ˛oating gate.

If the control gate voltage is greater than this threshold,

the cell opens a channel between the drain and the source and we say that the cell

conducts.

Otherwise, the cell acts as an open circuit and the cell does not conduct.

NAND ˛ash memories organize cells in array structures known as blocks, like the one

Wordline
dielectric
Tunnel oxide

Control gate
Bitline
Floating gate

Source

Drain

P-substrate

Figure 4.1.:

Floating gate transistor structure.

shown in Fig. 4.2. We refer to each row of cells in a block as a wordline and to each
column as a bitline. A page is a logical structure that includes one bit from each cell
in a wordline. SLC memories have one page per wordline, MLC memories have two,
and TLC have three. Blocks are the elementary unit for erase operations, but reads
and writes can be done at a page granularity. This wordline-bitline structure allows
programming or reading all the cells in a page in parallel, as described below.

54

Bit-line

Bit-line

Bit-line

Word-line

Word-line

Word-line

Block
Page

Word-line

Source line

Figure 4.2.:

Bitline-Wordline structure of NAND ˛ash memory.

Program operation
The programming is done by sending high voltage pulses into one wordline and
biasing all other wordlines so that their cells conduct. Cells in the selected wordline
with grounded bitlines experience a high electric ˝eld across the ˛oating gate and the
substrate, triggering the FN tunneling. After each pulse, a verify read is performed
and cells which have reached the desired level of charge are inhibited from further
programming.

This can be done by biasing their bitlines to a high voltage.

This

programming method is called ISPP algorithm [58]. For MLC cells, the programming
includes two stages: the LSB programming leaves the cell either erased or at half its
maximum charge, and the MSB programming does the ˝ne adjustment of the ˝nal
voltage. The amount of electrons injected into the ˛oating gate is determined by both
the LSB and MSB bit values [59].

55

Erase operation
The erasing of a block follows an similar process to the programming, but using negative pulses to remove charges from the ˛oating gate instead. Additionally,
stronger and fewer pulses are used since there is little harm in over-erasing the cells.
Individual pages cannot be erased independently because a dielectric breakdown may
occur due to the interference between wordlines [60].

Read operation
The voltage threshold of the cells cannot be read directly, it can only be compared
with an adjustable reference (read voltage). Pages are read by biasing one wordline
with this read voltage

l

while all others are set to a high voltage

Vpass (l  Vpass )

so that their cells conduct. Cells in the selected wordline whose threshold voltage is
below the read voltage

l also conduct, causing the discharge of a capacitor through the

bitline, whereas cells with higher threshold voltage act as an open circuit, not letting
the current through. By sensing which capacitors got discharged, many bitlines can
be read in parallel.

4.3 Multi-page Read for NAND Flash
This section explains the multi-page read method and it is structured as follows:
Subsection 4.3.1 describes the multi-page read and explains how it should be implemented.

Then, Subsection 4.3.2 provides examples where the multi-page read can

help improve the reliability, speed, and endurance of NAND ˛ash memories.

4.3.1 Multi-page Read Method
The read operation described above provides one bit of information about each
cell in the selected wordline. If a bitline conducts, it means that the cell's threshold
voltage is below the read voltage. The corresponding bit is then read as "0". If a

56

Vso

read as 1

VTHSA
sensing
margin

Icell

t

TEVA

precharge

Figure 4.3.:

read as 0

ABL read operation timing diagram.

bitline does not conduct, it means that the cell's threshold is above the read voltage
and the corresponding bit is then read as "1". However, it is important to understand
that the bit values depend on the read threshold: the same page can yield di˙erent

2

bit values for di˙erent read thresholds .
From the perspective of sensing circuits, there exist multiple read architectures,
all of which use capacitances to integrate the bitline current. Most modern NAND
Flash memories use the All Bitline (ABL) architecture [61] shown in Fig. 4.4, which
includes a dedicated capacitor

CSO

and keeps the bitline voltage constant during the

evaluation phase. Fig. 4.3 shows the three phases in a read operation. First,
pre-charged to a high voltage
experience a constant current

VDD .

Icell

Then

MPCH

that discharges

CSO

is

is shut o˙ and conducting bitlines

CSO

(evaluation phase). After

seconds, the capacitor voltage is compared with a reference

VTHSA

TEVA

and the read result

is output through a latch [62]. The total read time is dominated by the evaluation
time

TEVA ,

which can be represented as:

TEVA =

(VDD − VTHSA )CSO
.
Icell

(4.1)

2 Some manufacturers use the reverse bit labels, "1" to denote conducting and "0" not conducting.
This convention makes no di˙erence towards our results, but OR operations should be replaced with
AND.

57

VDD
MPCH
PCH

VTHSA

-----1
-----1

+
_

SO

MSEL

SEL

LATCH

CSO

OUT

---1
Icell

selected string

Vpass

I1

l2

MSLS
source line

SLS
source line

Figure 4.4.:

ABL sense circuits for NAND ˛ash.

The current through a MOS transistor operating in ohmic region
drain to source voltage

VDS

Icell

can be approximated by:

Icell = k [(VGS − VTH )VDS ] ,
where
and

k

VGS

and

VTH

with small

(4.2)

respectively represent the gate-to-source and threshold voltages

is a scaling parameter [62]. Hence, the equivalent resistance

Roh

for the tran-

sistor working in the Ohmic region is:

Roh =

VDS
1
=
.
k(VGS − VTH )
Icell

(4.3)

The multi-page read proposed in this section uses the same components and read
methodology as the ABL architecture, but instead of biasing a single wordline with a
read threshold and all others with

Vpass ,

multiple wordlines are biased with di˙erent

58

read thresholds

{l1 , l2 , . . .}

while the rest are kept at

Vpass

as shown in Fig. 4.4. A

bitline will conduct only when all the selected cells have lower voltage than the corresponding read thresholds. Since we are using value "1" to denote "not conducting",
this is equivalent to a bit-wise OR operation of all the selected wordlines.
The main problem of the ABL architecture is the static current consumption
during the pre-charge phase, specially by cells with threshold voltage much smaller
than the read voltage as indicated by Eq. (4.2). With multi-page read, however, fewer
bitlines will conduct and those that do will only draw strong currents if

all the read

voltages are much larger than the corresponding thresholds. Hence, multi-page read
helps alleviate the power consumption problem.
Bitlines that do conduct will experience a read current very similar to that in
regular reads. Each bitline has hundreds of cells connected in series whose equivalent
resistance is determined by the gap between the read and threshold voltages according
to Eq. (4.3). This gap is smaller for cells being read than for those biased at

Vpass ,

but both are usually in the same order of magnitude. The equivalent resistance of
the whole string is then dominated by the hundreds of cells acting as pass transistors,
not by the few being read.

If the read and threshold voltages are very close for a

cell, it would reach the saturation mode, thereby limiting the current independently
of

VDS .

Since

Icell

in Eq. (4.1) does not su˙er a signi˝cant decrease in either case, we

can conclude that the evaluation time

TEVA

in a multi-page read is similar to that in

a regular read, o˙ering comparable read speeds.
Additionally, the multi-page read method can be applied to improve several applications of ˛ash memories as we will discuss in Section 4.3.2.

4.3.2 Applications for Multi-page Read
ICI Equalization
The ISPP programming algorithm can compensate for the inter-cell interference
caused by previously programmed wordlines, but not for the interference of subsequent

59

write operations. Since wordlines are programmed in sequential order, most of the
ICI su˙ered by a speci˝c wordline is caused by the direct-above-neighbor. Extensive
measurements have shown that the change in threshold voltage su˙ered by the victim
cell is proportional to the threshold voltage of the aggressor cell, with a proportionally
factor

γ

that depends on the parasitic capacitance between the aggressor cell and the

victim cell.
The neighbor-cell assisted error correction (NAC) algorithm was proposed in [59]
to equalize ICI. The NAC method ˝rst performs one read of the aggressor wordline
and classi˝es the cells in the victim wordline as su˙ering weak or strong ICI depending
on the value programmed in their direct above neighbor. Then, it reads the victim
wordline with di˙erent thresholds, selectively chosing which result to keep for each
cell. It e˙ectively reads cells su˙ering strong ICI with a di˙erent threshold as those
su˙ering weak ICI, thereby reducing the probability of error.

However, this algo-

rithm requires reading the aggressor wordline and thus reduces the read speed. We
propose to use the multi-page read method to read the victim and aggressor wordlines
simultaneously.
If we set a read threshold

lvictim

on the desired wordline and an intermediate

threshold laggressor on its neighbor, while the rest are set to

Vpass ,

only bitlines which

ful˝l both conditions would conduct. This way we can use a single multi-page read to
detect the cells which have voltage below lvictim

and are su˙ering weak ICI. Combining

these multi-page reads allows us to obtain similar results to the NAC algorithm.
For example: In MLC memories, each cell can be programmed to four di˙erent
levels, denoted

S0 , S1 , S2 ,

them into 8 states:

and

S3 .

According to the su˙ered ICI, we further classify

S0weak , S0strong , S1weak , S1strong . . .,

thresholds for weak ICI cells are

A 1 , B1 , C 1

as Fig. 4.5 shows.

and for strong ICI cells

The read

A2 , B2 , C2 .

The

six proposed reads are listed in Table 4.1.
To classify the cells into
threshold comparisons.

Sistrong ),

where

Siweak

S0 , S1 , S2

State

and

S3 ,

Si (i = 0, 1, 2, 3)

we combine the results of the six
can be represented as

can be found from the ˝rst three reads and

Sistrong

(Siweak

OR

can be found

60

S0strong S1weak S1strong S2weakS2strong S3weak S strong
3

S0weak

\ __ I

A1 A2
weak ICI

Figure 4.5.:

B1 B2

C1 C2

t

strong ICI

Illustration of multi-page read method for MLC ICI equalization. The

curves represent histograms of threshold voltages across a page and vertical lines
represent read thresholds.

Bit-line

Vpass

lvictim

laggressor

Vpass

---11
---11
---11
---11

Figure 4.6 & Table 4.1:

lvictim

laggressor

Read_1

A1

t

Read_2

B1

t

Read_3

C1

t

Read_4

A2

Vpass

Read_5

B2

Vpass

Read_6

C2

Vpass

Bitline illustration & Multi-page reads for MLC ICI

equalization.

from the last three reads after eliminating weak ICI cells. For example, we classify
a cell as

S1

Read_3=0}.

i˙ {Read_1=0 and Read_2=1} OR {Read_4=0 and Read_5=1 and

61

This strategy brings slightly more error between

S2

and

S3

than NAC, which

performs an additional read on the aggressor wordline. In order to reduce this error

C2

is slightly shifted, as shown in Fig. 4.5.

However, the error is minor and this

strategy will save one read. Moreover, when ICI from the nearest neighbor is large,
this method provides lower BER than any other with reads on a single page ever
could.
Fig. 4.7 compares the channel capacity of a regular scheme using six reads to
produce soft information [63] with NAC and the multi-page read method for MLC
˛ash memories. It is assumed that
the same variance

σ = 0.15

S0 , S1 , S2 , S3

and means

0, 1, 2, 3,

are Gaussian distributed [13] with
respectively.

The read thresholds

for all three methods were numerically optimized to maximize the resulting capacity.
The results show that as

γ

increases, the NAC and multi-page read method provide

signi˝cantly better performance.

In fact, when ICI dominates over the Gaussian

noise, the proposed method would always provide higher capacity than reading the
desired page alone, regardless of how many reads the latter employs. This is due to
the fact that the multi-page read method provides some amount of equalization for
the channel. The ˝gure also shows that the performance of multi-page read is very
close to that for the NAC method despite it requires one less read operation.

Partial Erase
The erase operation consists of sending a series of voltage pulses into the gates of
all the cells in a block, until all the ˛oating gates have been discharged [64]. All cells
in the block su˙er the same pulses, despite some can be "fast erased" and others need
more negative pulses [65].

These pulses damage the dielectric barrier in the cells,

increasing BER and shortening their lifetime. This subsection shows how to apply
the multi-page read method to reduce the number of erase pulses sent, so that this
damage can be reduced and the erase operation can be accelerated.

62

2

1.95

channel capacity

1.9

1.85

1.8

1.75
Multi-page 6 reads

NAC 7 reads
I_:..:.:
6 reads
-•- Regular
~
_ _ JI

1.7

1.65
0

Figure 4.7.:

0.02

0.04

0.06
0.08
γ (ICI coeffiecient)

0.1

0.12

0.14

Channel capacity for an MLC cell after 6 traditional reads, 6 multi-page

reads, and NAC with 7 reads.

Flash memories generally use a log-structured ˝le system, with a background
garbage collection process that keeps a pool of erased blocks ready to be written [66].
As new information arrives, the controller writes it in these blocks, ˝lling one before
moving on to the next.
Unlike the traditional erase algorithm, which continues issuing pulses until the
whole block is totally erased, we propose a partial-erase option where fewer pulses
are sent and a small number of cells remain incompletely erased. By reading all the
wordlines simultaneously with a very low read threshold, the controller can detect
which bitlines have cells that have not been completely erased, and store their indices
as part of the block's metadata. The controller may then chose to skip those bitlines
during the writing process, use a constraint code to mask the errors [52], or ignore
this information and rely on the ECC block to correct any errors that might arise.
Similar ideas can also be applied to worn-out ˛ash memories.

Although it is

common to assume a uniform wear among all the cells, not all of them present the
same level of tolerance towards program-erase (P/E) cycles.

This makes the reuse

63

of worn-out blocks meaningful.

Lab collected data shows that erase errors, which

typically trigger the permanent retirement of a block, are caused by a few broken
cells that remain unerased, but most of the other cells are healthy.

By setting all

wordlines to a small read threshold, we can detect which bitlines have broken cells
and skip them in subsequent writes.
Unfortunately, the multi-page read step may slow down the erase operation. The
gap between the read and threshold voltages could be relatively small for many cells
along the bitline, thereby limiting the current and extending the evaluation time.
However, the read step still takes much less time than sending the erase pulses.
According to [67], the erase operation for one block takes about

500µs while the read

operation takes only several nanoseconds. So the latency brought by the multi-page
read step is negligible in the partial erase operation.

WOM Codes
WOM (Write-Once-Memory) codes were designed for memories where bits can

e.g.

change in one direction (

, 0 to 1) but not the other [2, 68], and they have recently

been proposed to allow multiple overwrites of a ˛ash page without erasing [69]. A
simple example of a WOM code is shown in Fig. 4.8.

This example is using three

SLC (binary value) cells to write two bits twice. Initially, all three cells are erased
(state 000). The ˝rst two bits are written by transitioning to one of the states in the
˝rst generation, according to the labels shown in the plot. The second pair of bits, if
di˙erent from the ˝rst, is written by transitioning to one of the states in the second
generation. All transitions involve charging, not erasing, the cells so they are feasible.
However, WOM codes present some practical limitations that prevent their adoption by the ˛ash memory industry: they signi˝cantly reduce the capacity and speed
of the memory. The example shown in Fig. 4.8 provides two bits of information for
every three cells read, so page length and read throughput are 33% lower than with
a traditional scheme.

64

In order to address these challenges, we propose aligning the WOM codewords
vertically, across wordlines, instead of storing the whole codeword on the same page.
The multi-page read allows us to rapidly compute the OR operation of multiple
wordlines, accelerating the decoding.
by

b1 b2 b3 ,

b3 ",

Observe that, denoting the state of the cells

the two bits in the ˝rst generation are given by "b2 OR

b3 "

and "b1 OR

respectively. A single multi-page read of the last two wordlines would provide

the ˝rst information bit and a single multi-page read of the ˝rst and third wordlines
the second information bit, as shown in Fig. 4.9. Each multi-page read provides a
sequence of information bits of the same length as a page, so the read throughput is
the same as in the traditional scheme.
Unfortunately, a similar idea cannot be applied to the second generation of writes.
Some codewords may still be in ˝rst generation states so it is necessary to read all
three pages individually to obtain two pages of information.

This yields the same

rate as the regular WOM scheme, which performs one read to obtain

2
of a page of
3

information.
On average, for the example shown in Fig. 4.8, the proposed scheme with multipage reads would provide an information rate of 0.8 (4 pages of information after
5 reads), which is a signi˝cant improvement over the 0.66 rate shown by the usual
WOM scheme. Similar approaches might be possible for more advanced WOM codes.

Other Applications
The multi-page read provides a way of obtaining the bit-wise OR (or bit-wise
AND, if the discharged state is denoted by 1) of the information stored in multiple

3

wordlines using a single read operation . There are multiple applications that could
bene˝t from such feature: group testing, masking, constrained codes, hash lookups,
etc. Instead of reading multiple pages and storing them in registers to perform these

3 For MLC memories, ˝nding the OR of an MSB page would require two multi-read threshold comparisons, just like in a traditional MSB read.

65

111
00

2nd
generation

110

011

11

101
10
010

01

10

100

001

01

11

1st
generation

000
00

Figure 4.8.:

-

The WOM code on the cube.

0

1

0

0

1

0

0

0

-{

0

0

0

1

decode:

10

01

00

11

Read 2
Read 1

Figure 4.9.:

.......
.......
.......

Multi-page read to decode WOM code.

operations, the multi-page read allows us to obtain the result in a fraction of the time
by performing a single read.

4.4 Spreading Modulation for NAND Flash Memories
This section will explain a new signal processing approach: the spreading modulation. This section is organized as follows: Subsection 4.4.1 introduces the system
model used in the rest of the chapter. Subsection 4.4.2 explains the spreading data
representation approach, analyzing its performance under di˙erent types of noise,
and Section 4.4.3 provides guidelines on how to adjust the spreading parameter. Sec-

66

tions 4.4.4 and 4.4.5 respectively show how the spreading approach can be used to
generate soft information and to hide data in the memory.

Finally, Section 4.4.6

presents simulation results to validate the method.

4.4.1 System Model
In order to better illustrate the features of the proposed scheme, this thesis will
consider multiple scenarios with di˙erent noise distributions and memory types. From
a high level perspective, it will be assumed that in a write operation the host provides
a vector of (possibly encoded) information symbols

v0

which are then mapped to a vector of voltages
By the time that the cells are read, the voltages
of white Gaussian noise [13, 70], denoted
denoted

nICI .

nw ,

b ∈ XM

from an alphabet

X,

to be programmed on the cells.

v0

will have su˙ered some amount

as well as inter-cell interference (ICI),

Therefore, the voltage actually stored in the cells at read time is

v = v0 + n,

n = nw + nICI .

(4.4)

The noise due to leakage is also assumed to be Gaussian and is therefore absorbed
into the

nw

term.

ICI occurs when a shift in the threshold voltage of one cell changes the threshold
voltage of its neighbors due to the parasitic capacitance between cells, known as
˛oating-gate interference" [50]. Extensive measurements have shown that the change
in threshold voltage su˙ered by the victim cell is proportional to the threshold voltage
of the aggressor cell, with a proportionality factor that depends on the parasitic
capacitance between the aggressor cell and the victim cell. This factor is commonly
known as coupling ratio and will be denoted by

γ.

Hence,

nICI = γvaggressor .
With the usual data representation scheme, each symbol
nominal voltage

v0.

(4.5)

b

is mapped to a ˝xed

So, for the sake of simplicity, it will be assumed that they both

share the same alphabet

X

and

b = v0.

In SLC memories these symbols are binary,

67

in MLC they can take four values (representing two bits of information), in TLC they
take 8 values (3 bits), etc. In general, the number of levels is chosen to be as large as
possible while still avoiding potential overlap between the levels and excessive damage
when programming the largest voltage, denoted

Vmax .

Damage to ˛ash memory cells is caused by program/erase (P/E) cycling.

Ac-

cording to [14] and the experimental results presented in Section 4.4.6, the damage
su˙ered by a cell when programmed to a voltage

Vth2 .

Vth

is approximately proportional to

Most of the damage happens when cells are programmed to the largest voltage

Vmax , so writing data patterns that are represented by a lower threshold voltage could
prolong the lifetime of the ˛ash [14, 53].
The proposed data representation scheme will use a linear mapping between the
symbols

b

and the nominal voltages

v0 ,

to be described in the next section.

This

mapping will extend the number of possible voltages to be programmed, but also
reduce the number of cells programmed to
robustness to impulse noise.

Vmax ,

attenuate the ICI, and increase

This will result in increased capacity and extended

lifetime for the memory.
In practice, the discharged state in which the cells are left after being erased sets a
lower limit for the range of programmable voltages and the write procedure can only
push the cells towards higher voltages.

Thus it is not possible to program NAND

˛ash cells with a negative voltage. However, for our derivations it will be useful to
assume that the range of programmable voltages is symmetric and the voltages
and symbols

b

v0

can take both positive and negative values. SLC cells will therefore

have their voltage levels relabeled as
to take voltage levels
in the alphabet

X,

−0.5 and +0.5, while MLC cells will be assumed

−1.5, −0.5, 0.5,

and

they will be labeled as

1.5.

In general, if there are

2S

symbols

X = {±0.5, ±1.5, . . . , ±(S − 0.5)}.

largest symbol in the alphabet will be denoted by

Vmax = S − 0.5.

The

This represents a

simple shift of the physical reference system. The rest of the chapter assumes that
the symbols

bi

and voltages

vi0

have zero mean.

68

4.4.2 The Spreading Approach
This section introduces the spreading modulation and then analyzes its performance against three types of noise: Gaussian, ICI, and impulse noise. We ˝rst study
the trade-o˙ between damage and Gaussian noise.

The SNR can be increased by

widening the range of programmed voltages, but doing so increases the damage suffered by the cells. Then we study how the proposed modulation can attenuate ICI
and impulse noise, respectively.
In a traditional ˛ash memory, each cell stores a ˝xed number of bits. There is usually some redundant bits introduced by the ECC or RAID schemes but, ultimately,
each bit is stored in a speci˝c cell. Cells have a ˝xed number of voltage levels to which
they can be programmed and all the levels are written with the same frequency. This
section proposes a new data representation scheme which uses orthogonal codes to
spread each bit across multiple cells, similar to DS-CDMA transmission in wireless
communications. This data representation scheme reduces the variability of the voltages being programmed in the cells, resulting in improved endurance and additional
robustness towards impulse noise and ICI.
Instead of mapping each symbol
a matrix with orthogonal columns
voltages

v0

bi

to a ˝xed voltage

vi0 , the proposed scheme uses

C (e.g., a Walsh matrix) to map the symbols b into

to be programmed. For example, when mapping four symbols

b ∈ X4

into four cells, the voltages to be programmed are:

⎡

v10

⎤

⎡

⎤ ⎡

⎤

1
1
1
1
b
⎥
⎢
⎢
⎥ ⎢ 1 ⎥
⎢
⎥
⎢ 0 ⎥
⎥ ⎢
⎢ v2 ⎥ k ⎢ 1 −1
1 −1 ⎥ ⎢ b2 ⎥
⎥
⎢
⎥,
⎢
⎥·⎢
⎥
⎢ 0 ⎥ = -4 · ⎢
⎥ ⎢
⎢ 1
⎢ v3 ⎥
⎥
⎢
b ⎥
1 −1 −1
⎦
⎣
⎣
⎦ ⎣ 3 ⎦
v40
1 −1 −1
1
b4
where

k

is an adjustable parameter that controls the range of voltages being pro-

grammed. By scaling

k,

we can introduce more separation between the programmed

69

voltage levels, but the damage su˙ered by the cells and the power consumed would
also increase. In general, when

M

symbols are to be programmed into

v0 =
where

C is a {−1, 1}N ×M

k
Cb,
M

voltage to be programmed (remember that

k,

v0

cells,

(4.6)

matrix with orthogonal columns and

shifted versions of them). By scaling

N ≥M

kVmax is the maximum

are not the programmed voltages, but

we can introduce more separation between

the programmed voltage levels, but the noise is not a˙ected by this scaling. We will
refer to this operation as spreading.
By spreading each information symbol across multiple cells, we increase the number of possible programmed voltages in each cell, so symbols and nominal voltages no longer share the same alphabet.

bi ∈ {−0.5, 0.5}
cell:

and

M = N = 4,

our scheme would have ˝ve possible levels for each

vi0 ∈ {−0.5k, −0.25k, 0, 0.25k, 0.5k}.

the alphabet

X,

For example, in the SLC case where

In general, if

Vmax

is the largest symbol in

the voltage levels after spreading are in the range

[−kVmax , kVmax ].

When the read operation is performed, the voltages are multiplied by a despreading matrix

CT ,

which is the left inverse of the spreading matrix.

Because

of the properties of Walsh sequences, the de-spreading matrix is the transpose of the
spreading matrix. Continuing with the previous example, when

⎡

bˆ1

⎤

⎡

⎤ ⎡

N =M =4

⎤

1
1
1
1
v
⎢
⎥ ⎢ 1 ⎥
⎥
⎢
⎢
⎥ ⎢
⎥
⎢ ˆ ⎥
⎢ b2 ⎥ 1 ⎢ 1 −1
1 −1 ⎥ ⎢ v2 ⎥
⎢
⎥·⎢
⎥
⎥,
⎢
⎥ ⎢
⎥
⎢ ˆ ⎥= k ·⎢
⎢ 1
⎢ b3 ⎥
1 −1 −1 ⎥ ⎢ v3 ⎥
⎣
⎦ ⎣
⎦
⎦
⎣
ˆ
b4
1 −1 −1
1
v4
where

bˆi , i = 1, 2, 3, 4

represent the information estimates after reading. In general,

ˆ = M CT v,
b
Nk

(4.7)

The processes of spreading and de-spreading discussed above are shown in Fig. 4.10.

70

b

V0

Spreading

Figure 4.10.:

V

Flash
Write/Read

De-spreading

6

Illustration of the spreading approach.

Combining Eqs. (4.4), (4.6), and (4.7), the estimated information symbols can be
represented as:

N
M X
ˆ
±nj ,
bi = bi +
N k j=1

i = 1, 2, . . . , M.

The noise can be arbitrarily attenuated by decreasing
ri˝cing capacity by decreasing
increasing

k.

(4.8)

M
, but that involves sacNk

M
or using a wider range of programmed voltages by
N

Since most practical applications are not willing to compromise capac-

ity, the rest of the chapter assumes

M = N,

which means that the storage space is

the same as in the regular scheme.

Gaussian Noise and Damage
For ˝xed voltage range

(k = 1)

and signal-independent Gaussian noise, spreading

actually decreases the signal-to-noise ratio (SNR) at read time. Assuming independent and identically distributed noise components

ni ∼ N (0, σ 2 ) [70],

the SNR of the

regular and spreading schemes are:

SN Rregular =
where

Ps = E[b2i ]

SN Rspread

Ps
σ2

SN Rspread =

Ps
,
N 2
σ
k2

(4.9)

represents the power of the stored symbols. It is easy to increase

when needed by increasing the scaling constant

k,

but doing so widens

the range of programmed voltages and thus causes more damage and consumes more
power. This subsection studies such tradeo˙.
One of the advantages of the spreading scheme is that it reduces the probability of
programming the maximum voltage as shown in Fig. 4.11, thus reducing the damage
to the ˛ash memory. The amount of damage su˙ered by the memory is approximately

71

proportional to the square of the voltage programmed. As mentioned in Section 4.4.1,
cell voltages must be non-negative so in practice they are shifted to
the regular scheme and to

E[vi0 ] = 0.

Denote

Tspread

vi0 + kVmax

and

Tregular

in the spreading scheme when

bi + Vmax

E[bi ] = 0

in

and

the damage with the spreading and the regular

scheme, respectively. Then,



Tregular = a · E (bi + Vmax )2

Tspread

for some constant

a.

For

2
= a(E[b2i ] + Vmax
),


= a · E (vi0 + kVmax )2
 2

k
2 2
2
=a
E[bi ] + k Vmax ,
N

k = 1

(4.10)

(i.e., both schemes have identical programming

range), spreading causes less damage than the regular scheme but it lowers the SNR.
For

k=

√

N

both schemes have the same SNR, but spreading causes more damage.

Section 4.4.3 will elaborate how to choose an optimal

k

in between.

regular scheme

percentage

3

2

1

0
−2

−1.5

−1

−0.5
0
0.5
programmed voltage
spreading scheme

1

1.5

2

−1.5

−1

−0.5
0
0.5
programmed voltage

1

1.5

2

percentage

3

2

1

0
−2

Figure 4.11.:
M=N=4, k=1,

Distribution of cell voltages for both modulation schemes when

σ = 0.1, γ = 0.2.

Spreading leads to a distribution with less variance.

72

Inter-cell Interference
The previous section showed that when the noise is independent from the voltages
being programmed, our spreading scheme does not provide any improvement in terms
of BER unless

k≥

√

N.

However, the main source of noise in new memory generations

is ICI, which is proportional to the voltages being programmed in the cells.
In most memories, ˛ash cells are organized in an array structure, where all the
cells in a wordline are programmed simultaneously and wordlines are programmed
in increasing order.

The ISPP [58] algorithm used to program wordlines can com-

pensate for the inter-cell interference caused by previously programmed wordlines,
but not for the interference of subsequent program operations. Hence, most of the
ICI su˙ered by a speci˝c cell is caused by the direct-neighbor. This will be the only
ICI component considered in our analysis, but the simulations in Section 4.4.6 will
include 3 neighbors.
Assuming

nICI  nw

and

M = N,

Eq. (4.8) becomes

N

1X
bˆi = bi +
±nICI
j , i = 1, 2, . . . , M,
k j=1
where

nICI

is proportional to the programmed voltages. According to Eq. (4.5),

nICI

can be represented as:

nICI =

N
kγ X
±bj .
N j=1

So the estimated symbol can be represented as

bˆi = bi + Δbspread ,

where

2

Δbspread
If the distance between the symbols is
distribution of

Δbspread

N
γ X
=
±bj .
N j=1

d,

is approximately

errors happen only when

N (0, γ 2 E[b2i ])

when

N

Δbspread ≥

d
. The
2

is large according to

73

4

the Central Limit Theorem. Then the probability of error for non-extreme symbols
is approximately

Pespread
where

φ(u) =

≈ 2φ

−d
p
2γ E[b2i ]

!
,

(4.11)

2

y
√1 e− 2
−∞ 2π

Ru

dy .

Note that in Eq. (4.11), the scaling parameter

k

has

no e˙ect on ICI.
In the regular scheme, the estimated symbol is:

bˆi = bi + γbj ,
with probability of error for non-extreme symbols

Peregular


=P

d
|bj | >
2γ


.

(4.12)

The main advantages of the proposed spreading scheme comes from the fact that
it leads to less variance in the programmed voltages than the regular scheme, as shown
in Fig. 4.11. As

γ

increases, the regular scheme introduces much more probability of

error than our spreading scheme. For example, if
then

d=1

γ = 0.35

and

X = {±0.5, ±1.5}

and for MLC memories

regular
Pe(M
LC) ' 0.25

spread
Pe(M
LC) ' 0.2,

(4.13)

for intermediate (non-extreme) symbols when Gaussian noise is negligible according
to Eq. (4.12) and Eq. (4.11).

The lowest and highest symbol would su˙er half as

much probability of error in both cases.
As

γ

increases,

Pespread

increases slower than

Peregular .

the spreading scheme has a better performance.

So, when

γ

is large enough,

It is also important to take into

account that the grouping of cells into spreading blocks must be done carefully. If the
same cells, say

1−4, were taken as a spreading block in two consecutive wordlines, the

ICI would have the form of a scaled codeword, and would therefore not be attenuated
by the de-spreading.

4 Non-extreme symbols refer to the symbols which are not programmed to the highest or lowest
voltage levels.

74

Impulse Noise
Another important advantage of the spreading approach lies on its increased robustness to impulse noise. Flash memories are currently being used in a wide variety
of environments.

In most of them they compete with HDD and DRAM but there

are some cases in which ˛ash is the only viable option. One of those cases are satellite applications.

Hard drives have moving parts, and need a certain air pressure

for the head to ˛y appropriately. DRAM memories are volatile and require frequent
refreshing to avoid losing the information. Flash memories, however, are perfect for
satellite applications. Their lack of moving parts makes them very compact and shock
resistant, and they can be powered o˙ for extended periods of time without losing
information.
Satellites su˙er a signi˝cant amount of radiation, constituting one of the leading
causes of electrical component failures [71].

A high energy particle impacting on

a NAND ˛ash cell usually causes what is known as a stuck-at defect [52].

The

cell e˙ectively breaks and will henceforth be read as storing the same voltage value,
regardless of what it was meant to be programmed to. In the regular scheme, any
bit written to that cell will most likely be lost. The scheme proposed in this chapter,
on the other hand, spreads each bit across multiple cells, and has a chance to recover
the bit even if one of the cells is stuck at a given value.
Broken cells can usually be identi˝ed before they are read. The ISPP programming
mechanism checks the cell voltages after sending each pulse and, when the controller
detects that the cell voltage has not changed after having sent multiple pulses, the cell
is marked as broken. If this were known

before

the programming started, we could

just ignore that cell altogether and not store anything in it.
broken cell is detected

during

Unfortunately, if the

programming, it is too late to stop the programming

of the other cells in the page.
Let

p

denote the probability of a cell breaking.

We assume that the controller

knows which cells are broken, and can therefore assign them an arbitrary voltage at

75

read time, independently from the actual state they are in. In order to minimize the
resulting noise, broken cells will be read as having a voltage of 0, the average voltage
stored by a healthy cell.
Equation (4.6) shows that the nominal voltage programmed in the

vi0 = ciT b,

where

ciT

represents the

i-th

cell is broken, the controller interprets

vi = 0,

cell is

If the

i-th

which is equivalent to replacing

i-th

row of the spreading matrix

C.

i-th

column of the de-spreading matrix by zeros when the read operation is performed.
Denote by

Ĉ

the matrix

C

with the

i-th

row replaced by zeros, the estimated data

symbols can then be represented as:

ˆ = 1 · ĈT · C · b
b
N⎡
N − 1 ±1 · · ·
⎢
⎢
±1 N − 1 · · ·
1 ⎢
⎢
=
.
N⎢
.
⎢
.
⎣
±1
±1 · · ·

±1

⎤

⎤⎡

b
⎥⎢ 1 ⎥
⎥
⎥⎢
±1 ⎥ ⎢ b2 ⎥
⎥
⎥⎢
⎥ ⎢ .. ⎥ ,
.
.
⎥
⎢
. ⎥
.
⎦
⎦⎣
N −1
bN

where all the noise except impulse noise has been neglected.
estimated information symbol

bˆi

In other words, the

can be represented as:

N −1
1 X
bi +
±bj
bˆi =
N
N j6=i

i = 1, 2, . . . , N,

where the signs of the error terms depend on the spreading matrix and the signs of
the di˙erent bits.
In SLC ˛ash memories, an error will occur if the sign of
sign of

bi .

that the

still have a

is di˙erent from the

This can only happen if the signs of the other bits align just right so

N −1

probability

bˆi

error terms cancel out the correct

N −1
contribution. It happens with
N

1
. Moreover, even when the signs align just right to give
2N −1

50%

bˆi = 0,

we

chance of guessing the sign correctly. So the probability of error due

to broken cells is

N
p(1
2N

− p)N −1 + O(p2 ).

However, for the regular scheme, whatever bits were stored in the broken cells are
completely lost. The ECC will have to recover them if possible. If a cell is broken,

76

it has a

p
1
chance of storing the correct value, so the probability of error is
for the
2
2

regular scheme, which is much larger than with the spreading scheme.
In MLC ˛ash memories, however, our scheme no longer o˙ers advantages towards
impulse noise. Since there are more programming voltage levels, the error terms may
play a more important role because it may contain some large voltage levels.

But

space applications generally use SLC memories because they are more reliable than
MLC.

4.4.3 Choice of Spreading Parameter
Increasing

k

can improve SNR through noise attenuation, but the range of pro-

grammed voltages becomes wider. It was shown in Fig. 4.11 that the probability of
programming a very large or small voltage with the spreading scheme is very low, so
it could be helpful to increase

k

and then crop those extremes. If the gains in terms

of noise attenuation obtained by increasing

k

make up for the cropping noise, the

overall SNR will increase.
Instead of increasing

k

and then cropping the largest voltages, our scheme crops

both high and low voltages symmetrically, so as to minimize the cropping noise.
Assuming that the desired range of programmed voltages is

[−Vmax , Vmax ],

the quan-

tization noise introduced by cropping is

qi =

where

i = 1, 2, . . . , N

⎧
⎪
⎪
0
⎪
⎨

−vi0 + Vmax
⎪
⎪
⎪
⎩ −v 0 − V
max
i

and

vi0

is the

i-th

if

vi0 ∈ [−Vmax , Vmax ]

if

vi0 > Vmax

if

vi0 < −Vmax ,

component of the programmed voltage

v0

de˝ned in Eq.(4.6). The information symbols read can be represented as:

ˆ = b + 1 CT n + 1 CT q,
b
k
k
where

n

is write noise and ICI noise as de˝ned in Eq.(4.4) and

is the quantization noise.

q = [q1 , q2 , . . . , qN ]T

77

In other words, for each estimated information value

N

bˆi :

N

1X
1X
±(nICI + nw ) +
±qj .
bˆi = bi +
k j=1
k j=1
To minimize the overload distortion introduced by cropping, we hope to crop only the
largest and smallest voltages in our scheme,
with probability

2
, where
LN

L

±kVmax .

These levels are programmed

is the number of possible voltage levels, hence

qj

can

be represented as:

qj =

⎧
⎪
⎪ ±(k − 1)Vmax
⎪
⎨
⎪
⎪
⎪
⎩ 0

with probability

2
LN

with probability

LN −2
LN

The Gaussian noise, ICI, and quantization noise are uncorrelated, so the total noise
power

PN

can be found by a simple sum of the components

PN = Pw + PICI + Pq ,

where:

Pw =

N 2
σ
k2

PICI = γ 2 E[b2i ]
Pq =
As

k

2N (k − 1)2 2
Vmax .
k 2 LN

(4.14)

increases the write noise decreases but the quantization noise increases. There

is a trade-o˙ between quantization noise and write noise. As shown in Fig. 4.12, the
optimal

k

which minimizes the total noise power and consequently maximizes the

SNR is:



?

k = arg min
k

The scaling parameter

k


N 2
2N (k − 1)2 2
Vmax + 2 σ .
k 2 LN
k

should not be too large, so that the range of the pro-

grammed voltage of the spreading approach is close to that of the regular scheme.
However, if k is small, the distance between any two adjacent levels will be reduced
or compressed.

For some memories, it may be hard to control the small voltage

increments between the levels in the spreading scheme, specially if

k

is small. The

over-programming could introduce Gaussian noise, but the total power of this noise

78

0.11
0.105

PN (power of noise)

0.1
0.095
0.09
0.085
0.08
0.075

1

Figure 4.12.:
with M=N=4,

1.1

1.2
1.3
k (scaling parameter)

1.4

Quantization noise power as a function of

σ = 0.1,

and

1.5

k

for a SLC ˛ash memory

γ = 0.2

would still be lower than that in the regular scheme, since the programming pulses
would also be smaller.
In addition to cropping, there are other ways to increase SNR and at the same time
maintain the same programming range: we can reassign the programmed voltages to
reduce the probability of error.

That is, we can increase the distance between the

voltage levels with higher probability and decrease the distance between the voltage
levels with lower probability.

This scheme is more complex than cropping and is

suitable for ˛ash memories with high computational capability. We will not discuss
it in detail in this chapter.

4.4.4 Obtaining Soft Input
There are two types of decoders in ˛ash: hard-decoders and soft-decoders. The
di˙erence between them lies in the input and output dictionary: Hard-decoders usually have the same input and output dictionary which is a ˝xed set of deterministic

79

symbols; Soft-decoders operate on log-likelihood ratios (LLR), specifying the probability of each input being a noisy version of each symbol. Soft-decoders can correct
more errors, but they require more reads and a more complex decoding algorithm.
Some ˛ash controllers use a hard-decoder when BER is low and switch to a soft one
when the former one fails [72].
In ˛ash memories, cells are read by comparing their voltage with a number of
reference thresholds. If a total of

l

reads have been performed on a page, each cell

can be classi˝ed as falling into one of the

l+1

intervals between the read thresholds.

The problem of reliably storing information on the ˛ash is therefore equivalent to
the problem of error-free transmission over a Pulse amplitude modulation (PAM)
channel [63]. The channel inputs represent the levels to which the cells are written,
the outputs represent read intervals, and the channel transition probabilities specify
how likely it is for cells programmed to a speci˝c level to be found in each interval at
read time [55, 73].
When we perform the minimum required number of reads on a page, cells can
only be classi˝ed into the nominal symbols. However, if we perform additional reads,
we can achieve a ˝ner quantization of the cell voltages. It is then possible to assign
an LLR value to each of these voltage intervals. The LLR value associated with a
read interval

Pab

r

between level

i

and level

j

is de˝ned as

denotes the transition probability from

a

to

b.

LLRr = log(Pir /Pjr ),

where

A hard decoder takes a greedy

approach mapping each interval to the most possible nominal symbol and returning
the closest codeword.

A soft decoder operates on the LLR values and uses those

probabilities to perform a maximum likelihood estimation of the codeword.
As mentioned in section 4.4.2, the spreading approach brings more possible programming voltage levels and requires more reads to distinguish them. This results
in a ˝ner quantization of the cell voltages and provides soft inputs to the decoder.
This holds even if we reduce the number of reads to be the same as in the regular
scheme, since the de-spreading step will combine the read voltages increasing the total number of possible values. For example, conventional SLC memories use a single

80

read to classify the cells into two states. The channel with the regular scheme is then
equivalent to a Binary Symmetric Channel (BSC). In the spreading scheme, however,
a single read will still classify each cell into one of two states, but after de-spreading
with

N = 4,

each component can take ˝ve possible values. The channel observed by

each symbol is then equivalent to the PAM channel with ˝ve outputs illustrated in
Fig. 4.13. As an example, Table 4.2 shows the transition probabilities for Fig. 4.13
when write noise is Gaussian with variance

σ = 0.3

and ICI parameter

γ = 0.5.

Soft

information plays an important role when noise is large and helps to minimize the
probability of error. In our numerical simulation with strong noise in SLC mentioned
above(i.e., write noise with
causes probability of error

σ = 0.3

and ICI noise with

γ = 0.5),

the regular scheme

0.1043 and the spreading scheme causes probability of error

0.0893.
Fig. 4.14 shows the symmetric capacity (i.e., capacity under uniform distribution
of inputs) of the channel in the MLC case with write noise
of the ICI parameter

γ.

N (0, 0.12 ),

as a function

When the ICI is weak, the regular scheme (using 3 reads) has

higher capacity than the spreading one, even when the latter uses 12 reads. However,
the capacity of the regular scheme decreases rapidly when the ICI increases, falling
below the capacity of the spreading scheme, even when the latter uses 3 reads.

C

0

1\
2

1

It

P11
P12

1

C
Figure 4.13.:
scheme with

P15
P14
P13

.

2

P21
P22
P23
P
P24 25

.

3

.

4

05

PAM channel equivalent to SLC ˛ash read channel in spreading

M = N = 4.

81

Table 4.2.:

Transition probabilities and LLR values for SLC cells.

Pij

j=1

j=2

j=3

j=4

j=5

i=1

0.0555

0.2048

0.1784

0.0578

0.0041

i=2

0.0040

0.0580

0.1786

0.2034

0.0555

LLRj

2.6301

1.2616

-0.0011

-1.2582

-2.6054

2

Capacity (uniform inputs)

1.8

1.6

....

Usual scheme
Spreading (12 reads)
Spreading (3 reads)

...

.... . ....

1.4

.... .
..

. . . ......

1.2

................

,

,

.. .
. .
..

1

0.8
0.2

Figure 4.14.:

0.25

0.3

0.35

γ (ICI parameter)

0.4

0.45

0.5

Comparison of the channel capacity of the spreading scheme and

regular scheme for MLC ˛ash.

4.4.5 Security
Section 4.4.2 has shown how spreading can be bene˝cial to reduce the probability
of error when ICI is large.

This section will show that it can also be used to hide

information using a technique known as superposition coding [74].

This technique

has been widely used in Direct-sequence spread spectrum (DSSS) communications to
make spread-spectrum signals appear wide-band and noise-like, thus making them
hard to detect [75].

82

The key idea of hiding information using superposition coding is making the modulated hidden information look like additional noise. In this chapter, we are going
to use a long Pseudo noise (PN) sequence [76] to spread a single symbol of hidden
information over many cells. This will create a very long sequence of voltage components, which will be added on top of the original information stored in ˛ash in
plain view. The spreading and de-spreading process are described as follows: Denote
the spreading sequence (PN sequence) by
symbol by

h.

d ∈ {+1, −1}L

and a hidden information

The voltage components for the hidden information can be represented

as

e = εdh,
v
where

ε is a (small) scaling parameter that controls the range of programmed voltages.

The combined voltage

vc

for the original and hidden information is:

vc = b + εdh + n,
where

b is the vector of L plain view information symbols and n is the noise as de˝ned

in Eq. (4.4).
If the distribution of the combined voltage

vc

still looks similar to the original

information to any unauthorized reader, hence making it di°cult for them to notice
the existence of the hidden information. For example, as shown in Fig. (4.15), the
distribution of the combined voltage of the original symbols and hidden information is
still similar to the distribution of the original information when we choose the scaling
factor

ε appropriately.

decrease

ε

To make the combined voltage as random as possible, we may

so that the power of hidden information is reduced.

However, small

ε

also brings higher probability of error when decoding the hidden information because
write noise will play a comparatively larger in˛uence.
Two steps are required to decode the hidden information. The ˝rst step is subtracting the original information. Assuming that we can recover the original infor-

83

mation with low probability of error, the voltage left after subtracting the original
information is approximately:

vs ≈ εdh + n.
The second step is de-spreading using

d,

the decoded hidden information symbol

is:

v s dT
b
h=
Lε
L
X
1
=h+
± ni ,
εL
i=1
where

L

is the length of the spreading sequence.

Assume the write noise to be Gaussian with variance

σ2,

the SNR after the de-

spreading process is:

SN RPN =
where

Ps

Ps Lε2
,
σ2

(4.15)

is the power of each hidden information symbol.

According to Eq. (4.15), another way to increase the accuracy of the recovered
hidden information is to increase the length of the spreading sequence
in Fig. 4.16,

Pehidden

is lower for the same

Peoriginal

L

As shown

(equivalently, for the same noise

variance) when the length of the spreading sequence
However, as

L.

L

is larger; and vice versa.

increases, so does the number of cells required to store each hidden

information symbol for hidden information, thereby reducing the e˙ective capacity of
the memory.
In order to both increase the accuracy of the recovered hidden information and save
storage space, we may group several information symbols together. Grouping means
that we can write several information symbols in one group of cells using orthogonal
spreading sequences and decode them separately. For example, Fig. 4.16 shows that
choosing the spreading sequence length to be

L = 32

and using two overlapping

orthogonal sequences has a better performance than the case with

L = 16

and a

single sequence, despite both schemes use the same storage space.
Additionally, the spreading approach supports multiple access: di˙erent hidden
information sequences can be written at di˙erent times using orthogonal spreading

84

sequences and without erasing the cells between writes.
shifting

ṽ

This can be achieved by

up to be non-negative, so that each write only needs to introduce a small

voltage increment on the cells, and shifting the read voltages down before despreading.

percentage

percentage

original information only
5%

t

0
−0.8

,
−0.6

·
−0.4

:

5%

t

0
−0.8

:

−0.6

:

−0.4

:

−0.2
0
0.2
hidden infomation only

:

−0.2

·a· •

0.4

0.6

0.8

u :l
0

0.2

0.4

0.6

0.8

percentage

combined distribution

Figure 4.15.:
ε = 0.1

5%

L.. : : .J

0
−0.8

−0.6

−0.4

−0.2

0
voltage

0.2

0.4

Voltage distribution for a SLC cell with

and length of spreading sequence

0.6

0.8

σ = 0.1,

spreading factor

L = 16.

4.4.6 Simulation Results
This section compares the proposed data representation scheme with the traditional one through simulations.

It evaluates both of them in terms of BER and

damage caused to the memory. We simulate
block and

8096 cells in each page.

10

memory blocks with

128

Each cell is assumed to su˙er ICI from

in the next wordline, with ICI coe°cients

(γy , γxy ) = (0.08, 0.006),

pages per

3 neighbors
γy

is the

γxy

is the

where

ICI coe˙cient for the direct neighbor (the one in the same bitline) and

ICI coe°cient for the two diagonal ones [13, 77]. The write noise is assumed to be
Gaussian with zero mean and

σ = 0.1,

so

nw ∼ N (0, 0.12 ).

85

0.1
L=32(2 groups)
L=16
N=8

0.09
0.08
0.07

Poriginal
e

0.06
0.05
0.04
\

0.03
0.02
0.01

''

' '
',,
~
,,,, _____ ,_,_,_,_,_,_ ,_,_,_,_ ,_,_ ,

0
0

------------------~~

0.05

0.1

0.15

0.2

0.25

Phidden
e

Figure 4.16.:

SLC cell with

σ = 0.2. Peoriginal

the decoded original information and

Pehidden

represents the probability of error of

represents the probability of error of the

decoded hidden information.

First, we study how BER increases with ICI when

M = N,

e°ciency is the same as that in the regular scheme. Assume
and the voltages

[−3.5, 3.5]

v0

are cropped to be in the range

for a TLC memory.

so that the storage

M = N = 4, k = 1.1,

[−1.5, 1.5] for a MLC memory and

The ˝rst two curves in Fig. 4.17 and Fig. 4.18 (no

redundancy) show the results for MLC and TLC, respectively. When ICI is small the
regular scheme performs better in both MLC and TLC cells, but when ICI increases,
the spreading scheme provides lower BER.
We also study the case when there is redundancy. We assume
that is, spreading

N =4

and

M = 3,

3 symbols over 4 cells, so that the code rate is 75% in the spreading

scheme. In the regular scheme, we use

(15, 11)

Hamming code to encode the input

information so that the code rate is almost the same as that in the spreading scheme.
The last two curves in Fig. 4.17 and Fig. 4.18 show that the spreading scheme provides
lower BER as ICI increases.

86

0.4
spread
(no redundancy)
e
regular
Pe
(no redundancy)
spread
P
(with redundancy)
e
regular
(with redundancy)
-■- Pe

-+- P

0.35
0.3
0.25
Pe

.....
..
,,,,·

·-•-·
.........

,

0.2

,

,'

,·
,t1'

,/
..
,,

--

.... -

0.15

■-

0.1
0.05
0
0.2

0.25

0.3

0.35

0.4

0.45

0.5

γ (ICI coefficient)
y

Figure 4.17.:
ory (i.e.,

Evolution of the probability of error as ICI increases for an MLC mem-

b ∈ {−1.5, −0.5, 0.5, 1.5}4 ),

and broken rate

when k=1.1,

σ = 0.1, (γy , γxy ) = (0.08, 0.006),

p = 0.001.

Then, we study the case when the impulse noise dominates the BER in SLC. In
order to focus on the impulse noise, both the Gaussian noise and ICI are assumed to
be small. The results without parity are given by the ˝rst two curves in Fig.4.19 (no
LDPC), showing that the BER with the traditional scheme is much larger than with
the spreading scheme. Additionally, we analyzed the performance when the spreading
modulation was combined with an LDPC encoding of the information. Speci˝cally, we
used the (64800,58320) LDPC code which is embedded in matlab R2015b. When the
write noise and ICI noise is negligible, the problem of writing and reading information
from a ˛ash memory with the traditional modulation is equivalent to transmission over
a binary erasure channel (BEC). The spreading scheme, on the other hand, spreads
out the noise caused by broken cells, e˙ectively transforming the BEC channel into a
binary input AWGN channel, which is better for soft decoding. Fig.4.19 shows that
the spreading scheme begins to fail at a larger
bits.

p and has lower BER among the output

87

0.4
Pspread(no redundancy)
0.35
0.3

e
regular
(no redundancy)
e
spread
Pe
(with redundancy)
regular
Pe
(with redundancy)

P

Pe

0.25
0.2
0.15
0.1
0.05
0
0.1

Figure 4.18.:
memory (i.e.,

0.006,

0.12

0.14
0.16
0.18
γy (ICI coefficient)

0.2

0.22

Evolution of the probability of error as ICI increases for an TLC

b ∈ {±3.5, ±2.5, ±1.5, ±0.5}4 ),

and broken rate

when k=1.1,

σ = 0.1, γy : γxy = 0.08 :

p = 0.001.

Finally, we designed an experiment to evaluate how the voltage level to which
a cell is programmed in˛uences the damage that it su˙ers. Our preliminary results
showed that when memories are programmed with highly structured data (e.g. 50%
of the cells in a wordline written to the same level), they behave abnormally. Hence
we tried to use random data in our experiment, while still imposing enough structure
to observe di˙erent amount of damage in di˙erent cells. Four blocks in a 19nm MLC
˛ash were repeatedly erased and programmed with random data, generated according
to a di˙erent distribution for each cell. For example, some cells were programmed to
the highest level 90% of the time, while others only reached that level on 10% of the
PE cycles. After the wearing phase, each cell was programmed

100

more times with

uniform random data and a dwell time of 1 hour at a temperature of 60C between
writes.

The information was read back before each new write, so as to obtain an

average BER (proxy for damage) at the end of the

100

writes.

88

0.04
0.035
0.03

spread
(no LDPC)
e
regular
(no LDPC)
e
spread
P
(LDPC)
' "V " e
regular
(LDPC)
-■- Pe

-+- P
·-•-·P

,_,• •-·- ·--·

Pe

0.025
0.02
0.015
0.01
0.005
0
0.045

Figure 4.19.:
(i.e.,

0.05

0.055
p

0.06

Evolution of BER as cell broken rate

b ∈ {−0.5, 0.5}4 ),

when M=N=4, k=1.1,

p

0.065

increases for an SLC memory

σ = 0.1,

ICI coe°cient

γ = 0.1,

the

impulse noise dominates the BER.

Once the cells had been worn (to a di˙erent number of PE cycles for each block)
and the BER data had been collected, we performed a least squares ˝t to the model

(1)

BERi = α1 Pi
where

(j)

Pi

(2)

+ α2 Pi

(3)

+ α3 Pi

(4)

+ α4 Pi , i = 1, . . . , 108 ,

denotes the probability of programming the

i-th

cell to level

j

on each

cycle of the wearing phase. The coe°cients obtained for each of the four blocks, which
should be proportional to the damage caused by programming each level, are shown

5

in Fig. 4.20. Assuming that the voltage levels are equally spaced , a clear superlinear
behavior can be observed. A quadratic model was adopted for simplicity, yielding the
expression in Eq. (4.10) for the damage with each scheme (without loss of generality,
we assumed

a = 1).

5 Unfortunately, we were not able to verify this fact from our memory manufacturer.

89

0.01
0.009
0.008
0.007

D
■

PE = 20000
PE = 15000
' PE = 10000
' PE = 5000

α

0.006
0.005
0.004
0.003
0.002
0.001

------

---........

...... ......

; ... ir.,,,,

. ....
,,,,

....
.... .

.

, ,,,

·······················•V ''''''''''''''' ' '''''''''v ····

0
1

2

3

4

Level programmed

Figure 4.20.:

Coe°cients modeling the damage su˙ered by a 19nm MLC cell when

programmed to each voltage level, for di˙erent numbers of PE cycles.

4.5 Summary
This chapter proposes two signal processing approaches to improve the reliability of NAND ˛ash memories. The ˝rst one is called multi-page read method. This
method reads multiple wordlines together and returns a bitwise OR (or AND, depending on notation) of their stored information. We show that this read method can
improve the reliability by equalizing ICI noise, reduce the damage caused by erase
operations, and accelerate the decoding of certain WOM codes.

Furthermore, the

proposed read method provides a very fast way of operating on the data without
actually having to read it, so it is very likely that there exist other applications that
can be studied in future research.
We also propose a novel data representation scheme where Walsh codes are used
to store the information in a NAND ˛ash memory, so that
over

N

cells. We only discuss the case where

M =N

M

symbols are spread out

so that the storage e°ciency is

the same as that in the regular scheme in this chapter. However, we could have better

90

performance in the proposed scheme at the cost of more storage space when

N > M.

By increasing the number of possible voltage levels in each cell, disregarding the fact
that these levels could overlap, the proposed scheme can provide signi˝cant gains in
terms of robustness towards inter-cell interference and impulse noise. Additionally,
higher levels are programmed less frequently, reducing the damage su˙ered by the
cells and thereby extending the endurance of the memory. We also show that this
spreading technique can be used to overlap a hidden layer of information on top of the
one stored in plain view. This chapter provides analytical expressions for the SNR
and BER of this spreading scheme under Gaussian noise, ICI, and impulse noise. Its
performance is then studied through simulations.

In next chapter, we will extend

these signal processing approaches to a new emerging non-volatile memory, Resistive
RAM.

91

5. SIGNAL PROCESSING FOR CROSSPOINT RESISTIVE
MEMORIES

5.1 Introduction
Resistive RAM (ReRAM) has become a research hit in memory industry because
of its high density, fast access time and low power consumption.
resistor (memristors" in short) to store information.

It uses memory

A memristor is a nonlinear

resistor whose value can be adjusted by pushing current across its terminals. An SLC
ReRAM cell has two states:

high resistance state (HRS) and low resistance state

(LRS), while MLC cells may have other intermediate levels.
Memristors are implemented in the form of a metal oxide layer sandwiched between two metal electrodes. When a voltage is applied to a ReRAM cell, conductive
˝laments (CF) are either formed or ruptured depending on the voltage polarity. The
cell's resistance depends on the strength of the CFs and thus can be controlled by the
programming current. This feature makes multiple-level ReRAM cells possible.
There are two types of architectures for ReRAM memories: MOS-accessed and
crosspoint. In MOS-accessed architectures, each memristor is paired with a transistor
that isolates the cell from the rest of the array when it is not in use. This 1T1R (one
transistor one resistor) architecture provides superior isolation between neighboring
cells, power e°ciency, and access time, but the transistors dominate the cell size,
increasing the area and consequently the cost of the memory. Crosspoint architectures
employ diodes (1D1R) or no selector device at all (0T1R) instead of MOS transistors
to control cell access [20,78]. Crosspoint architectures for ReRAM o˙er higher density,
power e°ciency, and endurance than most other emerging memory technologies, but
they su˙er sneak currents that cause signi˝cant write and read noise.

92

Typically, writing or reading the information stored in a cell is done by biasing
the corresponding wordline with a given voltage, grounding the corresponding bitline,
and measuring the ˛ow of current coming out of the bitlines, as shown in Fig. 5.1.
Other wordlines and bitlines are partially biased with a smaller voltage, so as to
reduce the current through the unselected cells.

Ideally, all the current would be

˛owing through the selected cell, but in practice there are some additional currents
˛owing through the some unselected cells, especially the cells in the same wordline
or bitline as biased (half-selected cells). These additional currents are called sneak
currents [20], as illustrated in Fig. 5.1.
The magnitude of the sneak currents increases dramatically with the size of the
memristor array. This is specially critical during programming: cells located far from
the driver can experience signi˝cantly di˙erent voltages depending on the state of
the other cells in the same bitline and wordline, low when they are all LRS and high
when they are HRS. If the voltage drop across the cells is too weak they do not get
programmed, but increasing the driver's voltage can cause undesired programming
of the cells closer to it. This voltage drop problem can be mitigated using dual-port
write as proposed in [79], but at the cost of doubling the area and power required for
the drivers.
Sneak currents are also a problem during read operations.

The state of a cell

is read out by measuring the current leaving the selected bitline [78].

Large cur-

rents correspond to cells in the LRS and small currents to cells in the HRS.However,
sneak currents introduce noise in the measured currents and can cause errors in the
estimation of cell resistances.
The sneak current problem is even more severe in MLC memories, which use
intermediate resistance levels between LRS and HRS to store multiple bits in each
memristor.

MLC memories have higher density but also lower noise margins than

SLC memories [80].

The scaling of the technology is causing the HRS resistance

to increase while the LRS resistance remains nearly constant, e˙ectively increasing

93

Wordline

Vw1
Vw2

---- ------- ---,~
~-------- --- ---- --- --- ---- --t-- \

)

1.-➔- .....

/

I

I
I
I

I
I
I

I

!

I
I
I

I
I
I

I

sw

Vw3
Vw4

I

I

I

I
I

Vw5

/
I
I
I

I

\

Vw6

\
\

Vb1

Vb2

Desired
path

Figure 5.1.:

Vb3

Vb4

Vb5

0

I

'V

Sneak
path

b6

Bitline

Illustration of sneak currents.

1

the noise margins and making room for additional levels .

Consequently, MLC is

becoming the norm in the industry.
Several approaches have been proposed to deal with sneak currents in crosspoint
ReRAM arrays.

At the device level, it is desirable to have the current through a

cell decrease superlinearly with the voltage across its terminals, so as to reduce the
sneak currents through half-selected cells. This non-linearity is achieved with special
materials or by including a selector in each cell [20]. There also exist methods which
perform additional reads to estimate the sneak current (often called background noise)
and then subtract it from the current obtained at read time [78].

A more precise

alternative is multistage reading, which performs three or more reads of the target
cell [80]. These methods can cancel the sneak current, but they sacri˝ce speed and
power e°ciency.
This chapter proposes and analyzes the potential of multiple signal processing
methods to ˝ght the sneak currents problem. The rest of this chapter is organized
as follows: Section 5.2 explains our model of sneak currents as a form of inter-cell
interference and states the assumptions that will be made throughout the chapter.

1 HRS can increase up to

16M Ω

when the cell size shinks to

10nm

[81].

94

Section 5.3 proposes the di˙erent techniques. Section 5.4 presents simulation results
to validate the methods. Finally, Section 5.5 summarizes the chapter. The results of
this chapter are published in [21].

5.2 System Model
Writes and reads of a MLC ReRAM memory can be done on a cell by cell basis
as shown in Fig. 5.1, multiple cells at a time, or even a entire wordline at a time
by grounding and sensing the current in all the bitlines. Experiments show that the
latter is more power e°cient than single cell operations [82], but it has signi˝cant
disadvantages in terms of area and reliability.

This chapter assumes that multiple

cells are written and read at a time, but not necessarily the whole wordline.
A critical feature of crosspoint ReRAM architectures is the non-linearity of the
cells. That is, the current through a memristor decreases superlinearly as the voltage across its terminals is reduced. This relationship is captured in a non-linearity
coe°cient

kr (p, V ) = p ×
where

R(V /p)

respectively. If

and

R(V )

kr (p, V )

For example, when

R(V /p)
,
R(V )

(5.1)

are the resistances of the cell biased at

V /p

and at

V,

is large, the resistance increases rapidly as the voltage drops.

kr (2, V ) = 2000

the resistance of a half-biased cell is 1000 times

larger than that of a fully biased cell. Memristors have an inherent amount of nonlinearity, but it is not su°cient for typical array sizes [82]. The rest of the chapter
will assume that each memristor is paired with a dedicated selector to increase the
non-linearity coe°cient to

kr (2, V ) = 2000

[20].

The resistance of the highest and lowest levels are ˝xedby the device but there exist
multiple choices of intermediate resistance levels.

An

ISO − ΔR

allocation spaces

the resistances linearly, resulting in low power consumption but long sensing latency.
An

ISO − ΔI

allocation spaces the currents (inverse resistances) linearly, increasing

noise and power consumption to facilitate read operations.

The

ISO − Δ log(R)

95

allocation provides a trade-o˙ between the previous two by spacing the resistances
geometrically [83].
There exist multiple read schemes, with di˙erent biasing for wordlines and bitlines.
The biasing voltages for selected and non-selected wordlines will be denoted by
and

Vw ,

respectively.

Selected bitlines will be assumed to be grounded and non-

selected bitlines will be biased to a voltage
are

V w = Vb = 0

Vsw

(ground-ground) and

Vb .

Two of the most popular biasings

Vw = Vb = Vsw /2

(half-biasing) [84].

The

former reduces the sneak currents but consumes a lot of power and causes signi˝cant
voltage drop in long wordlines [80]. Half biasing alleviates the voltage drop problem
and lowers power consumption but su˙ers stronger sneak currents. The rest of the
chapter assumes a half-biasing scheme for reading.
Reads are subject to two main sources of noise: sneak currents and voltage drop.
The estimated resistance

cij
R

at crossing

(i, j)

can be represented as:

cij = Rij + Zdrop − Zsneak ,
R
where

Rij

is the exact resistance,

the wordline, and

Zsneak

Zdrop

(5.2)

is the error caused by the voltage drop along

is the error caused by sneak currents ˛owing into the bitline

being read. The voltage drop noise is most prominent in bitlines far from the driver. It
reduces the measured current, hence increasing estimated resistance. Sneak currents,
on the other hand, are similar for all wordlines. They increase the measured current,
hence reducing the estimated resistance. The next subsections will develop simpli˝ed
models for both sources of noise.

5.2.1 Sneak Currents
With half-biasing, non-selected wordlines keep a nearly constant voltage of

Vsw /2

throughout. Consequently, the sneak current ˛owing into a selected bitline depends
mostly on the resistances on that bitline.
circuit model shown in Fig. 5.2.

Let

Rij

Ignoring the voltage drop results in the
denote the resistance of the memristor

ith

(i, j)

when reading cell

wordline and

located on the

j th

bitline of a

96

n × n array, then the current measured

is

Iijtotal = Iij + Iijsneak ,

(5.3)

where

Iij =

Vsw
,
Rij

Iijsneak =

X
1≤w≤n,w6=i

Vsw
kr · Rwj

(5.4)

represent the desired and sneak currents, respectively. If the resistance is estimated
as the read voltage divided by the measured current,

cs = Rij
R
ij

i.e. Rcij =

Vsw
total , then
Iij

!

1
1+

.

P
Rij
w6=i kr Rwj

(5.5)

The sneak currents shown in Eq. (5.4) can be understood as a form of inter-cellinterference (ICI) between the cells in the same bitline.

ICI has been extensively

studied for other types of memories, such as NAND Flash [13, 17, 49].

Section 5.3

shows how some of the existing methods for dealing with ICI in other technologies
can be used in ReRAM memories.

R1j

R2j

Vsw/2

Vsw/2

Rij

I

Vsw

Figure 5.2.:

~
Rnj

Vsw/2

Circuit model for sneak currents in ReRAM.

Rwire

Vsw

Rwire

Rwire

Ri1

Vsw/2

Ria

Vsw/2

Figure 5.3.:

Rwire

Rib

i i

Vsw/2

97

Rwire

Rin

Vsw/2

Circuit model for voltage drop in ReRAM.

5.2.2 Voltage Drop
The second source of noise is the voltage drop along the selected wordline. Ignoring
the sneak currents through other wordlines results in the circuit model shown in
Fig. 5.3, where

Rwire

represents the resistance of the connection between adjacent

memristors on the same wordline.
the selected

Denoting by

Ii1 , Ii2 , . . . , Iin

the currents leaving

wordline i through each bitline, the voltage at the j -th
⎤
⎡
I
⎢ i1 ⎥
⎥
⎢
⎢ I ⎥
⎢ i2 ⎥ .
Vij = Vsw − Rwire · [1 2 3 . . . j j . . . j] · ⎢
. ⎥
⎢ .. ⎥
⎦
⎣

cell is given by:

(5.6)

Iin

If all the currents are known, it is possible to cancel the voltage drop through equalization: applying Ohm's law (Rij

=

Vij
) to Eq. (5.6) yields:
Iij

R = diag(1/I) · (Vsw − Rwire · A · I),
where

R

and

I

are vectors of resistances and currents for the selected wordline and

Aij = min(i, j).

However, in large arrays it is only possible to read a few bitlines at

a time. Otherwise, the power consumed becomes too large and the excessive voltage
drops introduce non-linear e˙ects.

98

If the resistances are estimated as

cij =
R

Vsw
then
Iij

⎡

⎛

⎢
⎜
⎢
cv ' Rij ⎜
R
1
+
R
[1
2
.
.
.
j
.
.
.
j]
⎢
⎜
wire
ij
⎣
⎝
where

α = 1/kr

for half-biased bitlines and

α=1

α1 /Ri1
.
.
.

αn /Rin

⎤⎞
⎥⎟
⎥⎟
⎥⎟ ,
⎦⎠

(5.7)

otherwise. Once again, the voltage

drop can be understood as a form of Inter-cell-interference (ICI), this time between
the cells in the same wordline.
If both sneak currents and voltage drop are taken into account and resistance
levels are exponentially spaced, the estimated resistance

cij ) = log(Rij ) − log 1 +
log(R
⎛

X Rij
kr Rwj
w6=i
⎡

⎜
⎢
⎜
⎢
+ log ⎜1 + Rwire [1 2 . . . j . . . j] ⎢
⎝
⎣

cij
R
!

can be approximated as:

α1 /Ri1
.
.
.

αn /Rin

⎤⎞
⎥⎟
⎥⎟
⎥⎟ .
⎦⎠

(5.8)

5.3 Compensation for Sneak Currents
In ReRAM memories, information is modulated into cell resistances: SLC cells
take two resistance levels, storing one bit of information, and MLC cells take four
resistance levels, storing two bits of information. However, there is no inherent limitation in the number of levels that a cell can take and it is relatively easy to program
other resistance levels with little write noise [83]. It is therefore possible to increase
the number of levels that the cells take and use the redundancy to improve the reliability of the channel. This is commonly known as coded modulation.
Coded modulation can be used to reduce the number of errors in the memory
[85, 86] or to reduce ICI and other sources of noise [17, 51].

This section proposes

two techniques for reducing the voltage drop and sneak current noise:
modulation and distribution shaping.

spreading

99

5.3.1 Spreading Modulation
Spreading modulation was proposed in [17] to reduce ICI in NAND ˛ash. This section shows how the same method can be used to address the sneak currents in ReRAM,
with some additional advantages. The main idea is to use orthogonal spreading sequences to store multiple information symbols in the same cells without interfering,
similar to the Code Division Multiple Access (CDMA) method used in wireless communications.
In its most general form, the spreading modulation uses a
(i.e., with
of

M

±1

b

(assumed zero-mean) to a vector of

N

Walsh submatrix

C

entries and mutually orthogonal columns) denoted

data symbols

N ≥ M.

N ×M

to map a vector

cell currents

I,

where

The corresponding cells are then programmed to appropriate resistances so

that, when biased at

⎡

I1

Vsw ,
⎤

their currents are

⎡

1

1

I.
1

For example, when

α

and

M =3

⎤

⎡
⎥
⎢
⎢
⎥
b1
⎥
⎢
⎢
⎥
⎢ −1
⎢ I2 ⎥
1 −1 ⎥ ⎢
⎥=α·⎢
⎢
⎥·⎢
⎥
⎢
⎢
⎥ ⎢ b2
⎢ 1 −1 −1 ⎥ ⎣
⎢ I3 ⎥
⎦
⎣
⎣
⎦
b3
I4
−1 −1
1
where the scaling and shifting parameters

N =4

and

β

⎤

⎡

⎥
⎢
⎥
⎢
⎥+β⎢
⎦
⎣

1
.
.
.

1

⎤
⎥
⎥
⎥,
⎦

(5.9)

can be used to conform with the

feasible range of currents. In general, the modulation can be expressed as

Iwritten = αCb + β1,
and the demodulation (or despreading) as

�

ˆ = -1- CT Iread − β1 .
b
αN
The bene˝ts of this scheme are two-fold. First, it programs intermediate resistance
levels more often than extreme ones, thereby reducing the variance in the currents
through the cells. Sneak currents are then more predictable and it becomes simpler to
compensate for them, as shown in Fig. 5.4. Furthermore, the voltage drop problem
during programming is signi˝cantly alleviated by the reduction in number of LRS

100

cells. The second bene˝t is that each information symbol is spread over many cells,
so the information becomes less vulnerable to cell failures [17]. ReRAM cells do not
su˙er gradual drifts in their resistances, unlike Flash or PCM, but they are prone
to large magnitude errors caused by sudden transitions to their lowest or highest
resistance states [83].

regular scheme

percentage

5%
4%
3%
2%
1%
0
−5

−4

−3

−2

−1
0
1
sneak current
spreading scheme

2

−1
0
1
sneak current

2

3

4

5
−6

x 10

percentage

3%
2%
1%
0
−5

−4

Figure 5.4.:

−3

−2

3

4

5
−6

x 10

Distribution of sneak currents centered at 0.

The spreading can also be done vertically, across cells on the same bitline. In this
case, sneak currents are automatically canceled during despreading for symmetric
spreading sequences, since sneak currents are nearly identical for all the cells in a
bitline. However, it requires reading multiple wordlines to recover the information.
Reading all

N

wordlines allows for recovering of all

M

information symbols, but it is

also possible to recover a single symbol using two reads. The ˝rst read activates the
wordlines corresponding to positive entries in the spreading sequence, and measures
the sum of their currents ˛owing down the bitline. The second read does the same
with the negative entries. Then, it is just a matter of subtracting both currents and
estimating the information symbol.

101

5.3.2 Distribution Shaping
Another alternative for addressing the sneak current and voltage drop problems
without expanding the modulation is to shape the distribution of the programmed
levels to approach the channel capacity. The magnitude of the sneak currents and
voltage drops depend on the data written and the bitlines being read, as shown by
Eq. (5.8). Figure 5.5 illustrates this dependence for a

LRS = 103 Ω, HRS = 106 Ω.

256 × 256

SLC ReRAM with

The mean shift and variance for the HRS remain almost

the same for all bitlines but for the LRS they both increase with the distance to the
drivers.
Large resistances cause smaller voltage drops and sneak currents on other positions
but they su˙er more noise themselves.

The sneak current ˛owing into a wordline

is independent of the position within the array, but bitlines far from the drivers
su˙er and create stronger voltage drops.

Intuitively, the distribution should favor

high resistances at bitlines far from the driver so as to reduce the noise in all other
positions, and use a balance distribution for the ˝rst bitlines since they barely su˙er
any voltage drop.

1

0.05
HRS
LRS

0.5

HRS
LRS

0.04

,,,
,_

0.035
variance

0
Mean shift

17- I

0.045

−0.5

−1

0.03

,
...
,
,

0.025

,.

0.02
0.015

,,

0.01

−1.5

0.005

,I,'
−2

Figure 5.5.:

1

50

100
150
200
Bitline Number

250

Noise shift and variance for

HRS = 106 Ω, Rwire = 1Ω,

and

kr = 2000.

0

,

1

T
50

100
150
200
Bitline Number

250

256 × 256 SLC ReRAM with LRS = 103 Ω,

102

The optimal distribution depends strongly on the array's characteristics and needs
to be found numerically for each case. Once the desired distribution has been found,
it is necessary to design an encoder and decoder tailored to that distribution. One
possible way of achieving this is to use a lossless data compressor as decoder and the
corresponding de-compressor as encoder. Iterative source/channel decoding can then
be used to correct errors [87].

5.4 Simulations
This section provides simulation results to validate the model proposed in Section 5.2 and analyze the techniques proposed in Section 5.3.

The simulations are

based on a system of linear equations for the cross-point ReRAM obtained from
Kirchho˙ 's Current Law (KCL), which was shown in [82] to be highly accurate. Except where stated otherwise, all simulations are for a 512x512 array with

kr = 2000,

and

Vsw = 1.

Resistance levels are

(103 , 104 , 105 , 106 )

done 16 cells at a time, evenly spaced along the wordline.

1, 33, . . . , 481

are read simultaneously, as are bitlines

Rwire = 1Ω,

and reads are

For example, bitlines

2, 34, . . . , 482.

First, we compare the resistance estimates predicted by the model in Eq. (5.8)
with those obtained from the simulations as

cij =
R

Vsw
- . Figure 5.6 shows the average
I
ij

di˙erence between both estimates, normalized by the exact resistance, for di˙erent
bitlines and resistance levels. It can be observed that the relative error is quite small,
specially for large resistances and bitlines far from the voltage drivers, which are the
most critical cases.
Figure 5.7 compares the BER observed with the spreading modulation scheme in
Eq. (5.9) with that of a typical MLC modulation concatenated with an error correcting
code of similar rate. For small arrays, both of them show negligible BER, but the
spreading modulation provides signi˝cantly lower error rates for arrays larger than

256 × 256.

The spreading modulation requires multiple reads on di˙erent wordlines

103

(four in this case, since the spreading is done vertically), but it also returns more data
so read throughput is not a˙ected.
A signi˝cant portion of the gains obtained by the spreading modulation technique
stem from the lower fraction of cells in LRS state.

Figure 5.8 shows the capacity-

maximizing distribution of levels for each bitline, discretized to the closest percentage
point. It can be observed that, for the case under consideration, far bitlines should
reduce the frequency of LRS and use the other three levels with equal probability.
The information-theoretic capacity increases from 1.81 bits per cell when all four
levels are equiprobable to 1.96 bits per cell with the distribution in Fig. 5.8.

The

attenuation of sneak currents and voltage drops more than compensates for the loss
in capacity due to the asymmetric input distribution.

0.07
R = 1 kΩ
R = 10 kΩ
R = 100 kΩ
R = 1 MΩ

0.06
0.05
relative error

0.04

''

0.03
0.02

''

0.01

'

.... .... -

0
−0.01

Figure 5.6.:

0

64

128

----192
256
320
bitline number

384

448

512

Relative error between the simulated resistance estimates and those

predicted by Eq. (5.8).

5.5 Summary
Crosspoint resistive memories are a very promising storage technology providing
high density, fast access time, and low power consumption. As the memristors scale,
the gap between the LRS and the HRS is becoming wider, so it can be expected

104

0.35

Ll
Pregular
e

0.3

Pspreading
e

0.25

BER

0.2
0.15
0.1
0.05
0

Figure 5.7.:

220

240

260

280
300
array size

320

340

360

Evolution of the BER as the array size grows. The regular scheme uses

a (15,11) Hamming code and the spreading modulation has
rate is approximately

75%

N = 4, M = 3,

so the

for both.

30
28
26

Percentage

24
22
20
18

14
12

Figure 5.8.:

, ___ _

R=1kΩ
R=10kΩ
R=100kΩ
R=1MΩ

16

0

100

I

200

300
bitline number

400

500

600

Capacity-maximizing distribution of resistance levels per bitline. LRS

cells cause more noise, so they are chosen less often; the other levels are equiprobable
so their graphs overlap.

that most ReRAM memories will soon store multiple bits in each cell. Unfortunately,
resistive memories su˙er from voltage drops and sneak currents which limit the size
of the arrays. A lot of e˙ort is being put into increasing the non-linearity of the cells

105

through better materials and selectors, but there is little research available from the
perspective of signal processing.
This chapter proposed a simple analytical model for estimating the read noise and
two data representation techniques for reducing it. The ˝rst is a coded modulation
scheme which spreads each data symbol across multiple cells, reducing the variance in
the programmed resistances and thereby making noise more predictable. The second
scheme consists of shaping the distribution of programmed levels to reduce noise
levels. Both schemes were evaluated through simulations.

106

6. SUMMARY AND FUTURE WORK
In this thesis, we utilize signal processing approaches to solve problems in two storage
systems: caching networks and non-volatile memories. Algorithms are proposed to
improve the e°ciency of information delivery in networks and reliability in storage
hardware.

Section 6.1 and Section 6.2 present a summary of our work and some

suggested directions for future research in each of these two systems.

6.1 Caching Networks
Caching has been investigated as a useful technique to reduce the network load by
prefetching some contents during o˙-peak hours. It contains two phases: placement
and delivery. In the placement phase, the users have access to the database ˝ll their
caches.

In the delivery phase, each user requests one ˝le and only the server has

access to all ˝les. The server delivers messages to the users to ful˝l their requests.
Our work focuses on designing caching and delivery strategies for content delivery
networks bases on Maddah-Ali and Niesen's coded caching scheme [1].
We ˝rst propose coded caching algorithms for reducing the peak data rate in multiuser multi-server systems with distributed storage and di˙erent levels of redundancy
in Chapter 2. The content delivery network is assumed to be ˛exible, in the sense
that there is a path from every server to every user. Files are encoded using erasure
codes and distributed among servers to ˝ght with disk failures.

Some systems use

stripping across multiple servers, while others store whole ˝le as a single unit for
simplifying book-keeping.

We propose algorithms for both cases in this thesis.

It

shows that, by striping each ˝le across multiple servers, the peak rate can be reduced
proportionally to the number of servers. Then it addresses the case where each ˝le

107

is stored as a single unit in one server and proposes di˙erent caching and delivery
schemes depending on the size of the cache memories.
One possible direction for future work is extending our scheme to distributed
system with more advanced erasure codes.

In Chapter 2, we developed RAID-4

codes and RAID-6 codes to combine coded caching and distributed storage. However,
more advanced erasure codes are used in real network systems.

Therefore, it will

be interesting to study how this process can be generalized to larger systems and
more advanced erasure codes, such as fractional repetition codes [35, 36] or other
RAID-6 [39] structures.

It is also interesting to study the case where ˝les have

di˙erent popularity. All the former work [3, 88] has focused on the case where all data
nodes (disks, servers,. . . ) have identical probability of failure and cost of recovery.
This makes sense for error correction applications but when the data consists of
˝les with di˙erent size or popularity, it would be useful to make some ˝les easier to
recover than others. Yet another interesting problem would be to ˝nd erasure codes
for systems where di˙erent nodes have di˙erent failure probabilities.
Our second contribution is study of the tra°c load-I/O trade-o˙ for coded caching
in Chapter 3. In this work, we study the caching and delivery scheme for the system
identical to [1]: users connect to a single server through a broadcast link. The I/O
performance for the coded caching scheme proposed by Maddah-Ali and Neisen [1]
is suboptimal when there are redundant requests. When the server constructs messages, the same segment could be read multiple times if it used to construct di˙erent
messages, which dramatically increases I/O reads. This thesis proposes caching and
delivery algorithms which combine coded and uncoded transmission to leverage the
trade-o˙ between tra°c load and disk I/Os. Our algorithms can improve both the
average and worst case performance in terms of the user requests. In the future, it
would be interesting to extend our work to study the tra°c load-I/O trade-o˙ with
coded prefetching.

108

6.2 Non-volatile Memories
Our research on non-volatile memories mainly focuses on improving hardware
reliability and endurance. In Chapter 4, we study NAND ˛ash memories. A NAND
˛ash memory is fundamentally an array of ˛oating gate transistors, known as ˛ash
cells, whose threshold voltages can be programmed to represent di˙erent information
symbols. In order to reduce the cost, manufacturers are trying to shrink the cell and
pack more bits in one cell, which brings reliability challenges, specially ICI noises. ICI
is a phenomenon by which programming a cell increases the voltage of its neighbors.
Two methods are proposed in this thesis to improve the reliability of NAND ˛ash:
multi-page read and spreading modulation. The multi-page read method reads multiple wordlines together and returns a combination of their stored information. It is
shown that this read method can improve the reliability of the stored information by
equalizing ICI noise, reduce the damage caused by erase operations, and accelerate
the decoding of certain constrained codes. The spreading modulation spreads stored
information across multiple cells using Walsh codes. Higher levels are used less frequently than in the regular scheme, providing signi˝cant gains in terms of robustness
towards ICI and reducing the damage su˙ered by the cells. We also show that this
spreading technique can be used to overlap a hidden layer of information on top of
the one stored in plain view.
In Chapter 5, we focus on another type of promising non-volatile memories, Resistive RAM. Re-RAM uses memory resistors to store information, whose value can
be adjusted by pushing current across its terminals.

In order to increase the den-

sity, crosspoint architectures are used. But it brings other problems, mainly related
to sneak currents ˛owing through deactivated cells. We propose a simple analytical
model for estimating the read noise and two data representation techniques for reducing it. The ˝rst one is spread modulation which is extended from that of NAND. It
reduces the variance in the programmed resistances and thereby making noise more

109

predictable. The second scheme consists of shaping the distribution of programmed
levels to reduce noise levels.
One interesting future work could be extending these signal processing approaches
to some new emerging memory architectures like 3D ˛ash memories [89], which o˙er
much higher capacity and have become widespread among ˛ash manufacturers. Instead of shrinking cells within a 2D plane, 3D ˛ash memories stack up cells in the
vertical direction [90, 91] and read noises become even more pronounced in this 3D
array architecture [92].

APPENDIX

110

A. PROOF FOR LEMMA 2.4.3
In this appendix, we will elaborate on the pairing scheme in Lemma 2.4.3 from Chapter 2, specially for the case with even

K

and symmetric requests.

De˝nition A.0.1 Let χA denote a set of messages (or, equivalently, subsets of t + 1
users) to be sent by server A and χB denote a set of messages to be sent by server B.
If there is an injective function providing each element in χA with an e˙ective pair in
χB , we say that there is a

saturating matching for χA.

In order to reduce the peak rate we want to separate all the messages to be
transmitted (equivalently, subsets of

t + 1 users) into two groups χA

and

χB

such that

there are as many e˙ective pairs as possible, as we shall see.
To better illustrate the allocation scheme, the problem of ˝nding e˙ective pairs
is mapped to a graph problem. Let G be a ˝nite bipartite graph with bipartite sets

χA

and

χB ,

where each message (or user subset) is represented as a vertex in the

graph and edges connect e˙ective pairs from
to allocate as many messages as possible to
saturating matching for

Theorem A.0.1

χA

χA

χA ,

and

χB .

The idea of our design is

while guaranteeing the existence of a

based on Hall's marriage Theorem [93].

(Hall's Marriage Theorem [93])

Let G be a ˝nite bipartite graph

with bipartite sets χA and χB . For a set u of vertices in χA , let NG (u) denote its
neighbourhood in G, i.e. the set of all vertices in χB adjacent to some element of u.
There is a matching that entirely covers χA if and only if
|u| ≤ |NG (u)|

for every subset u of χA .

111

Corollary A.0.1.1 If all vertices in χA have the same degree dA and all the vertices
in χB have the same degree dB (dA ≥ dB ), then there is a saturating matching for
χA .

Proof

For any

u ⊆ χA , all edges connected to u are also connected to NG (u), hence

|NG (u)| · dB ≥ |u| · dA .

Since

dA ≥ dB ,

|u| ≤ |NG (u)|.

we know that

Theorem A.0.1, there is a saturating matching for

According to

χA .

■

K

In order to compute the peak rate in the worst case, we assume that all
request di˙erent ˝les. Since each subset contains
to allocate between

χA

and

χB .

requests from server A: sets of

t+1

˝les, there are

�K
t+1

users

messages

We classify these subsets according to the number of

type w will have w requests from server A and t + 1 − w

from server B. The following proposition states that the messages of the same type
are not able to pair with each other.
When

t is even and the demands are symmetric, type w sets and type t + 1 − w sets

form a symmetric bipartite graph, so there exists a saturating matching according to
Corollary A.0.1.1. When

t is odd, type (t + 1)/2 sets are paired with the union of type

(t − 1)/2 sets and type (t + 3)/2 sets.

Since the vertices in

type (t − 1)/2 sets and type

(t + 3)/2 sets are connected to the same number of vertices in type (t + 1)/2 sets,

this

bipartite graph also ful˝lls the condition in Corollary A.0.1.1. Other sets are paired
as in the case with

t

even, that is,

type w

sets are paired with

type t + 1 − w

sets.

These pairings are illustrated in Fig.A.1.
When

t is even, there is a matching for every candidate ˝le set, thus the peak rate
t

is odd,

(t − 1)/2, (t + 1)/2, or (t + 3)/2 could fail to be paired.

Denote

is cut by half compared with the traditional single server scheme.
some vertices of types

the ratio of unpaired messages when

t is odd by Δ.

Any two servers can collaborate to

ful˝ll those requests, so the normalized overall peak rate

RT

with symmetric demands

is given by:

RT (K, t) =

When

⎧
⎪
⎪ 12 RC (K, t)
⎪
⎨

if

t

is even

⎪
⎪
⎪
⎩ � 1 + 1 Δ R (K, t)
C
6
2

if

t

is odd,

112

w

w
(b)
t is odd

(a)
t is even

Figure A.1.:

Pairing illustration.

The pairing loss

Δ is limited.

w is the number of ˝les from server A in a message.

The worst case occurs when there is a big di˙erence

between the number of vertices of type

(t − 1)/2

or

(t + 3)/2.

(t + 1)/2

and the number of vertices of types

In both cases, the pairing loss

Δ

is bounded by

1
.
3

VITA

113

VITA

Tianqiong Luo received her B.S. from Fudan University, Shanghai, China, in 2013.
She is currently pursuing the Ph.D. degree of Electrical and Computer Engineering
at Purdue University. Her research interests involve signal processing and coding for
caching networks and non-volatile storage systems.

REFERENCES

114

REFERENCES

[1] M. A. Maddah-Ali and U. Niesen, Fundamental limits of caching,

IEEE Trans.

[2] R. L. Rivest and A. Shamir, How to reuse a write-once" memory,

Information

Inf. Theory, vol. 60, no. 5, pp. 28562867, 2014.
and control, vol. 55, no. 1, pp. 119, 1982.

[3] A. G. Dimakis, P. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran,
Network coding for distributed storage systems,
vol. 56, no. 9, pp. 45394551, 2010.
[4] J. S. Plank, The RAID-6 liber8tion code,

formance Computing Applications, 2009.

IEEE Trans. Inf. Theory,

International Journal of High Per-

[5] C. Tian and J. Chen, Caching and delivery via interference elimination,

preprint arXiv:1604.08600, 2016.

arXiv

[6] T. Luo, V. Aggarwal, and B. Peleato, Coded caching with distributed storage,

arXiv preprint arXiv:1611.06591, 2016.

et al., Finding a needle in
OSDI, vol. 10, no. 2010, 2010, pp. 18.

[7] D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel
haystack: Facebook's photo storage. in

[8] S. Ghemawat, H. Gobio˙, and S.-T. Leung, The google ˝le system, in

SIGOPS operating systems review, vol. 37, no. 5.

ACM, 2003, pp. 2943.

ACM

[9] K. Rashmi, P. Nakkiran, J. Wang, N. B. Shah, and K. Ramchandran, Having
your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and
network-bandwidth. in

FAST, 2015, pp. 8194.

[10] O. Khan, R. C. Burns, J. S. Plank, W. Pierce, and C. Huang, Rethinking erasure
codes for cloud ˝le systems: minimizing I/O for recovery and degraded reads.
in

FAST, 2012, p. 20.

[11] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, The exact rate-memory tradeo˙ for caching with uncoded prefetching, in

Theory Proceedings (ISIT).

IEEE Int. Symp. on Information

IEEE, 2017, pp. 16131617.

[12] T. Luo and B. Peleato, The rate-I/O trade-o˙ for coded caching,

IEEE Commun. Lett., 2018.

submitted to

[13] G. Dong, S. Li, and T. Zhang, Using data postcompensation and predistortion

IEEE Trans.
Circuits Syst. I: Reg. Papers, vol. 57, no. 10, pp. 27182728, Oct. 2010.
to tolerate cell-to-cell interference in MLC NAND ˛ash memory,

115

[14] W. Wang, T. Xie, and D. Zhou, Understanding the impact of threshold voltage
on MLC ˛ash memory performance and reliability, in

Conf. on Supercomputing (ICS).

Proc. 28th ACM Int.

ACM, 2014, pp. 201210.

[15] S. Moshavi, Multi-user detection for DS-CDMA communications,

mun. Mag., vol. 34, no. 10, pp. 124136, Oct. 1996.

[16] F. Adachi,

M. Sawahashi,

and H. Suda,

Wideband DS-CDMA for next-

generation mobile communications systems,
no. 9, pp. 5669, Sep. 1998.

IEEE Com-

IEEE Commun. Mag.,

[17] T. Luo and B. Peleato, Spread programming for NAND ˛ash, in

Conf. on Communications (ICC).

IEEE, 2015, pp. 277282.

vol. 36,

IEEE Int.

IEEE Trans.
Commun., vol. 64, no. 3, pp. 11101119, 2016.
, Multipage read for NAND ˛ash, IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 64, no. 1, pp. 7680, 2017.

[18] , Spreading modulation for multilevel nonvolatile memories,

[19]

[20] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and
Y. Xie, Overcoming the challenges of crossbar resistive memory architectures,

IEEE 21st International Symp. on High Performance Computer Architecture
(HPCA). IEEE, 2015, pp. 476488.
in

[21] T. Luo, O. Milenkovic, and B. Peleato, Compensating for sneak currents in
multi-level crosspoint resistive memories, in

nals, Systems and Computers.

49th Asilomar Conference on Sig-

IEEE, 2015, pp. 839843.

[22] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar,
Row-diagonal parity for double disk failure correction, in

Usenix Conference on File and Storage Technologies, 2004.

FAST-2004: 3rd

[23] H. Ghasemi and A. Ramamoorthy, Improved lower bounds for coded caching,
in

IEEE Int. Symp. on Information Theory Proceedings (ISIT).

IEEE, 2015,

pp. 16961700.

[24] U. Niesen and M. A. Maddah-Ali, Coded caching with nonuniform demands,

Computer Communications Workshops (INFOCOM WKSHPS), 2014 IEEE
Conference on. IEEE, 2014, pp. 221226.
in

[25] J. Zhang, X. Lin, C.-C. Wang, and X. Wang, Coded caching for ˝les with distinct
˝le sizes, in

IEEE Int. Symp. on Information Theory Proceedings (ISIT).

IEEE,

2015, pp. 16861690.

[26] M. Ji, A. M. Tulino, J. Llorca, and G. Caire, Caching and coded multicasting:
Multiple groupcast index coding, in

tion Processing (GlobalSIP).

IEEE Global Conf. on Signal and Informa-

IEEE, 2014, pp. 881885.

[27] M. Ji, A. Tulino, J. Llorca, and G. Caire, Caching-aided coded multicasting
with multiple random requests, in
IEEE, 2015, pp. 15.

IEEE Information Theory Workshop (ITW).

[28] J. Hachem, N. Karamchandani, and S. Diggavi, E˙ect of number of users in
multi-level coded caching, in

ings (ISIT).

IEEE Int. Symp. on Information Theory Proceed-

IEEE, 2015, pp. 17011705.

116

[29] S. P. Shariatpanahi, S. A. Motahari, and B. H. Khalaj, Multi-server coded
caching,

arXiv preprint arXiv:1503.00265, 2015.

[30] R. Blom, An optimal class of symmetric key generation systems, in

on the Theory and Application of of Cryptographic Techniques.

Workshop

Springer, 1984,

pp. 335338.

[31] C. Suh and K. Ramchandran, Exact-repair MDS code construction using interference alignment,
2011.

IEEE Trans. Inf. Theory,

vol. 57, no. 3, pp. 14251442,

[32] S. El Rouayheb, A. Sprintson, and C. Georghiades, On the index coding problem
and its relation to network coding and matroid theory,
vol. 56, no. 7, pp. 31873195, 2010.

IEEE Trans. Inf. Theory,

[33] Z. Bar-Yossef, Y. Birk, T. Jayram, and T. Kol, Index coding with side information,

IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 14791494, 2011.

[34] M. A. R. Chaudhry and A. Sprintson, E°cient algorithms for index coding, in

IEEE INFOCOM Workshops.

IEEE, 2008, pp. 14.

[35] S. El Rouayheb and K. Ramchandran, Fractional repetition codes for repair
in distributed storage systems, in

Control, and Comput. (Allerton).

Proc. 48th Annu. Allerton Conf. Commun.,
IEEE, 2010, pp. 15101517.

[36] Q. Yu, C. W. Sung, and T. H. Chan, Irregular fractional repetition code optimization for heterogeneous cloud storage,
no. 5, pp. 10481060, 2014.

IEEE J. Sel. Areas Commun., vol. 32,

[37] C. Huang, M. Chen, and J. Li, Pyramid codes:

Flexible schemes to trade

space for access e°ciency in reliable data storage systems,

(TOS), vol. 9, no. 1, p. 3, 2013.

[38] J. R. Santos, R. R. Muntz, and B. Ribeiro-Neto,

cation and data striping in multimedia servers.

ACM Trans. Storage

Comparing random data allo-

ACM, 2000, vol. 28, no. 1.

[39] Y. Wang, X. Yin, and X. Wang,  MDR codes: A new class of RAID-6 codes
with optimal rebuilding and encoding,
no. 5, pp. 10081018, 2014.

IEEE J. Sel. Areas Commun.,

vol. 32,

[40] U. Niesen and M. A. Maddah-Ali, Coded caching with nonuniform demands,

IEEE Trans. Inf. Theory, vol. 63, no. 2, pp. 11461158, 2017.

[41] M. Ji, A. M. Tulino, J. Llorca, and G. Caire, On the average performance of
caching and coded multicasting with random demands, in

Wireless Communications Systems (ISWCS).

11th Int. Symp. on

IEEE, 2014, pp. 922926.

[42] S. A. Saberali, H. E. Sa˙ar, L. Lampe, and I. Blake, Adaptive delivery in caching
networks,

IEEE Commun. Lett., vol. 20, no. 7, pp. 14051408, 2016.

[43] M. A. Maddah-Ali and U. Niesen, Decentralized coded caching attains orderoptimal memory-rate tradeo˙,
pp. 10291040, 2015.

IEEE/ACM Trans. Netw. (TON), vol. 23, no. 4,

117

[44] E. J. O'neil, P. E. O'neil, and G. Weikum, The LRU-K page replacement algorithm for database disk bu˙ering,
297306, 1993.

ACM SIGMOD Record, vol. 22, no. 2, pp.

[45] Z. Chen, P. Fan, and K. B. Letaief, Fundamental limits of caching: Improved
bounds for small bu˙er users,

arXiv preprint arXiv:1407.1935, 2014.

[46] R. Frickey, Data integrity on 20nm SSDs, in

Flash Memory Summit, 2012.

[47] P. Pavan, R. Bez, P. Olivo, and E. Zanoni, Flash memory cells-an overview,

Proc. IEEE, vol. 85, no. 8, pp. 12481271, Aug. 1997.

[48] B. Shin, C. Seol, J.-S. Chung, and J. J. Kong, Error control coding and signal
processing for ˛ash memories, in
2012, pp. 409412.

Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),

[49] M. Qin, E. Yaakobi, and P. H. Siegel, Constrained codes that mitigate intercell interference in read/write cycles for ˛ash memories,

Commun., vol. 32, no. 5, pp. 836846, May 2014.

IEEE J. Sel. Areas

[50] J.-D. Lee, S.-H. Hur, and J.-D. Choi, E˙ects of ˛oating-gate interference on
NAND ˛ash memory cell operation,
pp. 264266, 2002.

IEEE Electron Device Lett., vol. 23, no. 5,

[51] Y. Kim, B. Kumar, K. L. Cho, H. Son, J. Kim, J. J. Kong, and J. Lee, Modulation coding for ˛ash memories, in

Commun. (ICNC).

International Conf. on Comput., Netw. and

IEEE, 2013, pp. 961967.

[52] Y. Kim and B. V. Kumar, Coding for memory with stuck-at defects, in

Int. Conf. on Communications (ICC).

IEEE, 2013, pp. 4347  4352.

IEEE

[53] H.-W. Tseng, L. Grupp, and S. Swanson, Understanding the impact of power
loss on ˛ash memory, in
2011, pp. 3540.

Proc. 48th Design Automation Conf. (DAC).

ACM,

et al., Adaptive
Proc. IEEE GLOBECOM Workshops.

[54] A. Jagmohan, M. Franceschini, L. Lastras-Montano, J. Karidis
endurance coding for NAND ˛ash, in
IEEE, Dec. 2010, pp. 18411845.

[55] B. Peleato and R. Agarwal, Maximizing MLC NAND lifetime and reliability
in the presence of write noise, in
IEEE, 2012, pp. 37523756.

IEEE Int. Conf. on Communications (ICC).

[56] B. Peleato, R. Agarwal, and J. Cio°, Probabilistic graphical model for ˛ash
memory programming, in
IEEE, 2012, pp. 788791.
[57] R. Katsumata,

M. Kito,

IEEE Statistical Signal Processing Workshop (SSP).
Y. Fukuzumi,

M. Kido,

M. Ishiduki, J. Matsunami, T. Fujiwara, Y. Nagata

H. Tanaka,

et al.,

Y. Komori,

Pipe-shaped BiCS

˛ash memory with 16 stacked layers and multi-level-cell operation for ultra high
density storage devices, in
137.

Proc. IEEE Symp. on VLSI Technol., 2009, pp. 136

118

[58] K.-D. Suh, B.-H. Suh, Y.-H. Lim, J.-K. Kim, Y.-J. Choi, Y.-N. Koh, S.-S. Lee,
S.-C. Kwon, B.-S. Choi, J.-S. Yum

et al.,

A 3.3 v 32 mb NAND ˛ash memory

with incremental step pulse programming scheme,
vol. 30, no. 11, pp. 11491156, Nov. 1995.

IEEE J. Solid-State Circuits,

[59] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, and K. Mai,
Neighbor-cell assisted error correction for MLC NAND ˛ash memories, in

Int. Conf. on Meas. and Modeling of Comput. Syst.
[60] J. E. Brewer and M. Gill,

Flash.

Wiley, 2008.

ACM

ACM, 2014, pp. 491504.

Nonvolatile Memory Technologies with Emphasis on

[61] R. Cernea, L. Pham, F. Moogat, S. Chan, B. Le, Y. Li, S. Tsao, T.-Y. Tseng,
K. Nguyen, J. Li

et al.,

A 34MB/s-program-throughput 16Gb MLC NAND

with all-bitline architecture in 56nm, in

(ISSCC), 2008, pp. 420624.
[62] R. Micheloni,

L. Crippa,

IEEE Int. Solid-State Circuits Conf.

and A. Marelli,

Springer, 2010.

Inside NAND Flash Memories.

[63] J. Wang, T. Courtade, H. Shankar, and R. Wesel, Soft information for LDPC
decoding in ˛ash: Mutual information optimized quantization, in

Communications Conf. (GLOBECOM).

IEEE, 2011, pp. 59.

IEEE Global

[64] G. Wu and X. He, Reducing SSD read latency via NAND ˛ash program and

Proc. of the 10th USENIX Conf. on File and Storage Technol., vol. 12, 2012, pp. 1010.
erase suspension, in

[65] S. C. Hollmer, C.-Y. Hu, B. Q. Le, P.-l. Chen, J. Su, R. Gutala, and C. Bill,
Erase verify scheme for NAND ˛ash, Dec. 28 1999, US Patent 6,009,014.
[66] E. Gal and S. Toledo, Algorithms and data structures for ˛ash memories,

Computing Surveys (CSUR), vol. 37, no. 2, pp. 138163, 2005.

[67] J. Cooke, Flash memory 101: An introduction to NAND ˛ash,

nology Inc: Boise, 2006.

ACM

Micron Tech-

[68] E. Yaakobi, S. Kayser, P. H. Siegel, A. Vardy, and J. K. Wolf, Codes for writeonce memories,

IEEE Trans. Inf. Theory, vol. 58, no. 9, pp. 59855999, 2012.

[69] R. Gabrys and L. Dolecek, Constructions of nonbinary WOM codes for multilevel ˛ash memories,
Apr. 2015.

IEEE Trans. Inf. Theory,

vol. 61, no. 4, pp. 19051919,

[70] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, Threshold voltage distribution in
MLC NAND ˛ash memory: Characterization, analysis, and modeling, in

Design, Autom., Test in Eur. Conf. (DATE), Mar. 2013, pp. 12851290.

Proc.

[71] S. Gerardin, M. Bagatin, A. Paccagnella, K. Grurmann, F. Gliem, T. Oldham,
F. Irom, and D. Nguyen, Radiation e˙ects in ˛ash memories,

Nucl. Sci., vol. 60, no. 3, pp. 19531969, 2013.

IEEE Trans.

[72] B. Peleato, R. Agarwal, J. Cio°, M. Qin, and P. H. Siegel, Towards minimizing
read time for NAND ˛ash, in

COM).

IEEE Global Communications Conf. (GLOBE-

IEEE, 2012, pp. 32193224.

119

[73] Y. Maeda and H. Kaneko, Error control coding for multilevel cell ˛ash memories
using nonbinary low-density parity-check codes, in

Fault Tolerance VLSI Syst.

Proc. IEEE Int. Symp. Defect

IEEE, 2009, pp. 367375.

[74] L. Wang, E. Sasoglu, B. Bandemer, and Y.-H. Kim, A comparison of superposition coding schemes, in
IEEE, 2013, pp. 29702974.
[75] B. Sklar,

IEEE Int. Symp. on Information Theory (ISIT).

Digital communications.

[76] T. S. Rappaport

Prentice Hall NJ, 2001, vol. 2.

et al., Wireless communications: principles and practice.

pren-

tice hall PTR New Jersey, 1996, vol. 2.

[77] D.-h. Lee and W. Sung, Direct and indirect measurement of inter-cell capacitance in NAND ˛ash memory, in

Systems (SiPS).

Proc. of IEEE Workshop on Signal Processing

IEEE, 2014, pp. 16.

[78] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, Design implications of memristorbased RRAM cross-point structures, in

Conference & Exhibition (DATE), 2011.

Design, Automation & Test in Europe
IEEE, 2011, pp. 16.

[79] Y. Zheng, C. Xu, and Y. Xie, Modeling framework for cross-point resistive mem-

Design Automation
Conference (ASP-DAC), 2015 20th Asia and South Paci˝c. IEEE, 2015, pp.
ory design emphasizing reliability and variability issues, in
112117.

[80] M. A. Zidan, H. A. H. Fahmy, M. M. Hussain, and K. N. Salama, Memristorbased memory: The sneak paths problem and solutions,

nal, vol. 44, no. 2, pp. 176183, 2013.

Microelectronics Jour-

[81] H. Nazarian, Versatile RRAM technology and applications, in

Summit, 2015.

Flash Memory

[82] D. Niu, C. Xu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, Design trade-o˙s
for high density cross-point resistive memory, in

symp. on Low power electronics and design.

Proc. ACM/IEEE international

ACM, 2012, pp. 209214.

[83] C. Xu, D. Niu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, Understanding the
trade-o˙s in multi-level cell ReRAM memory design, in

on Design Automation Conference (DAC).

50th ACM/EDAC/IEEE

IEEE, 2013, pp. 16.

[84] J. Zhou, K.-H. Kim, and W. Lu, Crossbar RRAM arrays: Selector device requirements during read operation,
pp. 13691376, 2014.

IEEE Trans. Electron Devices, vol. 61, no. 5,

[85] H.-L. Lou and C.-E. W. Sundberg, Coded modulation for digital storage in
analog memory devices, Apr. 3 2001, uS Patent 6,212,654.
[86] B. M. Kurkoski, Coded modulation using lattices and reed-solomon codes, with
applications to ˛ash memories,
900908, 2014.
[87] R.

Bauer

and

J.

IEEE J. Sel. Areas Commun., vol. 32, no. 5, pp.

Hagenauer,

source/channel decoding,
IEEE, 2001, pp. 273282.

in

On

variable

length

codes

for

iterative

Proc. of Data Compression Conference(DCC).

120

[88] N. B. Shah, On minimizing data-read and download for storage-node recovery,

IEEE Commun. Lett., vol. 17, no. 5, pp. 964967, 2013.

[89] Y.-H. Hsiao, H.-T. Lue, T.-H. Hsu, K.-Y. Hsieh, and C.-Y. Lu, A critical examination of 3D stackable NAND ˛ash memory architectures by simulation study of
the scaling capability, in
2010, pp. 14.

IEEE International Memory Workshop (IMW).

IEEE,

[90] Y. Kim, R. Mateescu, S.-H. Song, Z. Bandic, and B. V. Kumar, Coding scheme
for 3D vertical ˛ash memory, in
IEEE, 2015, pp. 264270.

IEEE Int. Conf. on Communications (ICC).

[91] Y.-M. Chang, Y.-H. Chang, T.-W. Kuo, Y.-C. Li, and H.-P. Li, Disturbance
relaxation for 3D ˛ash memory,
1483, 2016.

IEEE Trans. Comput, vol. 65, no. 5, pp. 1467

[92] S. Buzaglo, P. H. Siegel, and E. Yaakobi, Coding schemes for inter-cell interference in ˛ash memory, in

(ISIT).

IEEE Int. Symp. on Information Theory Proceedings

IEEE, 2015, pp. 17361740.

[93] P. Hall, On representatives of subsets,
2630, 1935.

J. London Math. Soc, vol. 10, no. 1, pp.

