University of Central Florida

STARS
Electronic Theses and Dissertations, 20202021

Robust Acceleration of Data-Centric Applications using Resistive
Computing Systems
Baogang Zhang
University of Central Florida

Part of the Data Storage Systems Commons

Find similar works at: https://stars.library.ucf.edu/etd2020
University of Central Florida Libraries http://library.ucf.edu
This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted
for inclusion in Electronic Theses and Dissertations, 2020- by an authorized administrator of STARS. For more
information, please contact STARS@ucf.edu.

STARS Citation
Zhang, Baogang, "Robust Acceleration of Data-Centric Applications using Resistive Computing Systems"
(2021). Electronic Theses and Dissertations, 2020-. 592.
https://stars.library.ucf.edu/etd2020/592

ROBUST ACCELERATION OF DATA-CENTRIC APPLICATIONS USING
RESISTIVE COMPUTING SYSTEMS

by

BAOGANG ZHANG
M.S. Florida Institute of Technology, 2016
B.S. Florida Institute of Technology, 2014

A dissertation submitted in partial fulfilment of the requirements
for the degree of Doctor of Philosophy
in the Department of Electrical and Computer Engineering
in the College of Engineering and Computer Science
at the University of Central Florida
Orlando, Florida

Spring Term
2021
Major Professor: Rickard Ewetz

© 2021 Baogang Zhang

ii

ABSTRACT

With the accessible data reaching zettabyte level, CMOS technology is reaching its limit for the
data hungry applications. Moore’s law has been reaching its depletion in recent studies. On the
other hand, von Neumann architecture is approaching the bottleneck due to the data movement
between the computing and memory units. With data movement and power budgets becoming
the limiting factors of today’s computing systems, in-memory computing using emerging nonvolatile resistive devices has attracted an increasing amount of attention. A non-volatile resistive device may be realized using memristor, resistive random access memory (ReRAM), phase
change memory (PCM), or spin-transfer torque magnetic random access memory (STT-MRAM).
Resistive devices integrated into crossbar arrays simultaneously supports both dense storage and
energy-efficient analog computation, which is highly desirable for processing of big data using
both low-power mobile devices and high-performance computing (HPC) systems. However, analog computation is vulnerable and may suffer from robustness issues due to variations such as,
array parasitics, device defects, non-ideal device characteristics, and various sources of errors.
These non-ideal factors directly impact the computational accuracy of the in-memory computation
and thereby the application level functional correctness.
This dissertation is focused on improving the robustness and reliability of analog in-memory computing. Three directions are mainly explored: data layout organization techniques, software and
hardware co-design, and hardware redundancy. Data layout organization aims to improve the robustness by masking the data to hardware according to the behavior of defective devices. Software
and hardware co-design mitigates the impact by modifying the data in the neural networks or image compression applications to become amenable to device defects and data layout organizations.
Hardware redundancy utilized multiple resistive device to realize each data, so each device can be
programmed with different value and realize the data accurately with lower overhead.
iii

ACKNOWLEDGMENTS

I would like to give thanks to the trinity God, my creator, my provider, my sustainer, my comforter.
I would like to thank my advisor and mentor, Dr. Rickard Ewetz, the most important person
through my Ph.D. study, without whom I could not become who I am academically. It was my
honor and enjoyment working with Dr. Ewetz, and I cannot imagine how my PhD study would
become without him. If I were asked to choose again, I will not hesitate for one second or have
any reservation to pursue my PhD degree with Dr. Ewetz. I have learned much knowledge and
experience in these years, including but not limited to: problem solving, programming, technical
writing, presenting, communicating, etc., and this list can go two more pages. I have received
abundantly and more than anywhere else I could imagine.
I also want to thank many professors, Dr. Deliang Fan, Dr. Murat Yuksel, Dr. Fan Yao, and
Dr. Liqiang Wang as my doctoral committee members. I want to thank Dr. Amro Awad and Dr.
Sumit Kumar Jha for the collaborative publications. I am grateful for these professors’ support and
guidence through my Ph.D. research.
I would also like to thank Necati Uysal, my colleague and friend, who had helped me greatly with
programming, debugging, presentation practicing, paper proofreading, etc., which I was unable to
complete by myself.
I would like to thank UCF, our beautiful school, and the ECE department, for all the supplies and
equipment we can enjoy as students, and for all the support and help I have received from the
employees and staff. Also, I would like to thank the cafeterias on campus, where I could fill my
stomach with affordable prices and extra convenience.
I would like to give thanks to my parents, who not only gave me my life, but also have supported me
iv

till today. Without entering college in their life, they would not be considered experts in educating
and parenting. Yet they love me in their own language and as hard as they possibly can, which
built me up as who I am today.
I would like to give thanks to my fiancee, Meixuan Song, whom I met the second day after I arrived
in Orlando. We did not have luxurious time or money during our relationship, yet she still chooses
to be with me and said ‘yes’ when I proposed. I will never forget those holidays and weekends
when I was working towards paper submissions and she would stay and kept me company. Surely
we have beautiful memories through these years, and I am joyfully glad that we have decided to
walk through life together. I am so blessed to have her in my life.
I would like to give thanks to my no-blood bother, Andy Huffman. A true and caring mentor,
whom I would be willing to share anything with.
I would like to give thanks to my church, Orlando Chinese Evangelical Christian Church (OCECC),
which has provided me the spiritual nourishment and helped me to become a more mature follower
of Christ.
Finally, I would like to thank the financial support from ECE department at UCF, NSF(CRII1755825, CNS-1908471) and Cyber-Florida(3910-1011-00).

v

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

CHAPTER 2: BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Non-volatile Resistive Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Analog In-memory Computing for Matrix-vector Multiplication (MVM) . . . . . . . . .

3

Big Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Deep Neural Network (DNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Image compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Sources of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Write accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Array parasitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Device defects/Stuck-at-fault defects . . . . . . . . . . . . . . . . . . . . . . . . .

9

DAC/ADC quantization errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

vi

Resistance drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Temperature variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Stochastic variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Non-ideal device characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER 3: PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Write accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Array parasitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Device defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
DAC/ADC quantization errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Resistance drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Temperature variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Stochastic variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Non-ideal device characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

CHAPTER 4: MATRIX TRANSFORMATION . . . . . . . . . . . . . . . . . . . . . . . 20
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Cost Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii

Row flipping transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Permutation transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Value range transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Flow of MT Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Matrix value range transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Multi-pass flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Weight update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Evaluation of ε in value range reduction transformation . . . . . . . . . . . . . . . 31
Evaluation of cost metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Evaluation of multi-pass vs. single-pass flow . . . . . . . . . . . . . . . . . . . . 34
Evaluation of transformations on classification accuracy . . . . . . . . . . . . . . . 34
Evaluation transformations in terms of overhead . . . . . . . . . . . . . . . . . . . 38
Comparison with hardware aware training . . . . . . . . . . . . . . . . . . . . . . 40
Evaluation of sensitivity to defect ratios and distributions . . . . . . . . . . . . . . 42
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

CHAPTER 5: FAST RESILIENT-AWARE DATA LAYOUT ORGANIZATION . . . . . . 45

viii

Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
The Error Cost(EC) Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Cost Matrix Computation and Assignment Problem . . . . . . . . . . . . . . . . . 46
Run-time Limitation of Data Layout Organization in [78] . . . . . . . . . . . . . . 46
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Proposed Speed-up Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
The Sparse Defect Indexing Technique . . . . . . . . . . . . . . . . . . . . . . . . 48
The weight range characterization technique . . . . . . . . . . . . . . . . . . . . . 48
The LP Formulation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Evaluation of sparse defect indexing . . . . . . . . . . . . . . . . . . . . . 54
Evaluation of the weight range characterization . . . . . . . . . . . . . . . 55
Evaluation of the LP formulation . . . . . . . . . . . . . . . . . . . . . . 56
Comparison with related studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

CHAPTER 6: STAT: MEAN AND VARIANCE CHARACTERIZATION FOR ROBUST

ix

INFERENCE OF DNNS ON MEMRISTOR-BASED PLATFORMS . . . . . 60
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Mean and variance characterization . . . . . . . . . . . . . . . . . . . . . . . . . 60
Mean guided bias weight modification . . . . . . . . . . . . . . . . . . . . . . . . 60
Variance guided neuron permutation . . . . . . . . . . . . . . . . . . . . . . . . . 61
Flow of the STAT framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Effectiveness of STAT framework . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

CHAPTER 7: REDUNDANT NEURONS AND SHARED REDUNDANT SYNAPSES .

67

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Redundant neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Redundant neurons and no stuck-at-fault defects . . . . . . . . . . . . . . 67
Redundant neurons and stuck-at-fault defects . . . . . . . . . . . . . . . . 68
Robustness analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Hardware realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Shared redundant synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

x

Sharing of redundant synapses (2:3) . . . . . . . . . . . . . . . . . . . . . 71
Neuron pairing and permutation . . . . . . . . . . . . . . . . . . . . . . . 73
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Effectiveness of redundant neurons . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Effectiveness of shared redundant synapses . . . . . . . . . . . . . . . . . . . . . 77
Comparison with related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

CHAPTER 8: RESILIENT NEURAL NETWORKS WITH HIGH THROUGHPUT . . . . 81
Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Mapping neural networks to RCSs . . . . . . . . . . . . . . . . . . . . . . . . . . 81
RCS with stuck-at-fault defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Data layout transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Channel transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Pixel transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

xi

Hybrid transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Flow for data layout organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Distribution guided training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Evaluation of data layout organization . . . . . . . . . . . . . . . . . . . . . . . . 91
Evaluation of distribution guided training . . . . . . . . . . . . . . . . . . . . . . 93
Comparison with state-of-the-art works . . . . . . . . . . . . . . . . . . . . . . . 94
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

CHAPTER 9: COMPUTATIONAL RESTRUCTURING FOR IMAGE COMPRESSION . 97
Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Review of image compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Image compression performance metrics . . . . . . . . . . . . . . . . . . . . . . . 99
Acceleration of MVM using emerging RCAs . . . . . . . . . . . . . . . . . . . . 100
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Rethinking Image Compression using RCAs . . . . . . . . . . . . . . . . . . . . . . . . 103
Proposed Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2D DCT Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xii

Overview of reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Analysis of reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 107
The reconstructed DCT matrix D . . . . . . . . . . . . . . . . . . . . . . 108
Frequency Spectrum Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Frequency reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Frequency spectrum pruning . . . . . . . . . . . . . . . . . . . . . . . . . 109
Analysis of frequency spectrum pruning . . . . . . . . . . . . . . . . . . . 110
Quantization optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ADC based quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Hardware friendly ADC based quantization . . . . . . . . . . . . . . . . . 115
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Evaluation of optimization techniques . . . . . . . . . . . . . . . . . . . . . . . . 119
Evaluation of 2D DCT reconstruction . . . . . . . . . . . . . . . . . . . . 119
Evaluation of frequency optimization . . . . . . . . . . . . . . . . . . . . 122
Evaluation of quantization optimization . . . . . . . . . . . . . . . . . . . 124
Evaluation of proposed image compression . . . . . . . . . . . . . . . . . . . . . 127
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

xiii

CHAPTER 10: FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Advanced Data Layout Organization Techniques . . . . . . . . . . . . . . . . . . . . . . 133
System Software Support for Resistive Computing Systems . . . . . . . . . . . . . . . . 133

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

xiv

LIST OF FIGURES

Figure 2.1: (a) Digital MVM (b) Analog MVM (c) RCA circuit for analog MVM. . . . .

4

Figure 2.2: Compassion between performing MVM using an RCA and a digital ASIC in
terms of (a) speed and (b) power [28]. . . . . . . . . . . . . . . . . . . . . .

6

Figure 2.3: Examples of using analog in-memory computing to accelerate DNN inference.

7

Figure 2.4: Example of using analog in-memory computing to accelerate image compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Figure 3.1: Data to hardware assignment (a) using routers and (b) using data layout organization. Wl and Wl+1 are the weight matrices of two adjacent layers in a
neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Figure 4.1: Value range of one and two parallel memristors while using row flipping. . . . 22
Figure 4.2: (a) Neuron permutation [73]. (b) Feature map permutation. (c) Permutation
of rows and columns in W(l−1) and Wl . (d) Permutation of rows in Wl−1 and
groups of columns in Wl . (e) Assignment problem for neurons and (f) feature
maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 4.3: Proposed flow of the MT framework. . . . . . . . . . . . . . . . . . . . . . . 27
Figure 4.4: (a) Average value range reduction vs ε. (b) Classification accuracy in software and in hardware vs. ε on CIFAR-10 . . . . . . . . . . . . . . . . . . . 33

xv

Figure 4.5: Comparison between cost metrics used to guide the MT framework. . . . . . 33
Figure 4.6: Norm. cost and Norm. accuracy vs. the number of passes in the multi-pass
flow. The results in (a) and (b) are obtained for two different sets of defect
matrices on CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 4.7: Normalized overhead evaluation of (a) power and (b) area. . . . . . . . . . . 40
Figure 4.8: Comparison of performance with (a) different stuck-on and stuck-off ratios,
(b) different stuck-at-fault distributions. . . . . . . . . . . . . . . . . . . . . 43

Figure 5.1: (a) Formulation and (b) Cost matrix of the assignment problem. . . . . . . . . 46
Figure 5.2: Data layout organization run-time break down of a 16-layer network. . . . . . 47
Figure 5.3: (a) The cost computation of data to hardware and (b) the proposed sparse data
structure for cost computation. . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 5.4: An example of the weight range characterization technique of using R = 4
devices for a single weight. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 5.5: Flow of the proposed framework. . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 5.6: Evaluation of sparse defect indexing with respect to defect rate on (a) MNIST
and (b) CIFAR-10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 5.7: Effectiveness of the weight range characterization technique with respect to
different redundancy factor R on (a) MLP-6 and (b) CNN-16a. . . . . . . . . 56
Figure 5.8: Evaluation of the solver run-time vs. problem dimension. . . . . . . . . . . . 56

xvi

Figure 6.1: (a) Variance of input to layer l. (b) Weight significance factors set based on
the variance statistics. (c) The function g used to defined the weight significance factors ck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 6.2: Proposed flow of the STAT framework. . . . . . . . . . . . . . . . . . . . . . 63

Figure 7.1: (a) Initial neural network. (b) A redundant neuron t replicating the neuron s
in the network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 7.2: Robustness provided by redundant neuron. . . . . . . . . . . . . . . . . . . . 70
Figure 7.3: Redundant neurons can be inserted without area overhead when there is a
mismatch between the dimensions of the weight matrices and the MCAs. . . . 71
Figure 7.4: Shared redundant synapses. Three rows in the MCA are used to realize two
rows in a matrix W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 7.5: Proposed flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 7.6: Evaluation of redundant neurons. (a) Recovered accuracy as a function of the
number of redundant neurons. (b) P99 with and without redundant neurons. . . 77
Figure 7.7: The recovered accuracy for different percentages of defects on different networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 7.8: Evaluation of power and area overhead for different redundancy factors (q:r).

78

Figure 8.1: Flow of mapping CNNs with high throughput to RCSs with defects. . . . . . 81

xvii

Figure 8.2: Weight value ranges based on RRAM state. ‘on’ (‘off’) denotes a device
stuck-on (stuck-off). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 8.3: The proposed data layout transformations. . . . . . . . . . . . . . . . . . . . 86
Figure 8.4: (a) Weight value distribution of a VGG-7. (b) Percentage of weights stuck to
a specific value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 8.5: Effectiveness of the proposed data layout organization on evaluated networks.

93

Figure 8.6: Effectiveness of the data layout organization compatible training comparing
to the hybrid data layout organization. . . . . . . . . . . . . . . . . . . . . . 94

Figure 9.1: Review of JPEG image compression based on 2D DCT [55]. . . . . . . . . . 97
Figure 9.2: Review of direct mapping in [27, 47, 48]. . . . . . . . . . . . . . . . . . . . 101
Figure 9.3: Image compression using (a) digital and (b) resistive hardware. The RCAs
have dimensions 64x128 and parameters as in [27, 47, 48]. . . . . . . . . . . 102
Figure 9.4: (a) Flow of proposed image compression. (b) Overview of 2D DCT reconstruction, (c) frequency spectrum optimization, and (d) quantization optimization.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 9.5: Proposed 2D DCT computation using reconstructed DCT matrix. . . . . . . . 106
Figure 9.6: (a) Frequency reordering and (b) frequency spectrum pruning. . . . . . . . . 109

xviii

Figure 9.7: The figure shows the total, analog, and frequency errors in terms of MSE with
respect to only computing N f of the N frequency coefficients. The trade-off
is shown with respect to an reconstructed DCT matrix with dimensions (a)
64x64, (b) 144x144, and (c) 256x256. . . . . . . . . . . . . . . . . . . . . . 111
Figure 9.8: (a) Quantization in the digital domain using a quantization table qTable. (b)
Quantization in the analog domain using ADCs. . . . . . . . . . . . . . . . . 113
Figure 9.9: (a) Two differential ADCs with individual reference voltages. (b) Two differential ADCs with shared reference voltages. (c) An ideal quantization table.
(d) Shared quantization table with respect to a group size (M) of eight. . . . . 115
Figure 9.10:The output interface (a) power and (b) area breakdown based on the ADC
group size M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 9.11:(a) Reference image. (b) Images obtained using the direct mapping in [27,
47, 48]. (c) Images obtained using the proposed 2D DCT reconstruction. . . . 121
Figure 9.12:Sensitivity of the image quality with respect to the bit-accuracy of the DAC
and ADC domain interfaces. The images in (a) are obtained using the direct
mapping and the images in (b) are obtained using the proposed 2D DCT
reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 9.13:The images in the left (right) column are obtained without (with) frequency
spectrum optimization. The dimension of the reconstructed 2D DCT matrix
and the MSE are shown below each figure. . . . . . . . . . . . . . . . . . . . 124
Figure 9.14:Performance improvements from frequency spectrum optimization for reconstructed 2D DCT matrices with different dimensions. . . . . . . . . . . . . . 125

xix

Figure 9.15:Comparison between proposed ADC based quantization and using 8-bit ADCs
and performing digital quantization. The comparison is evaluated in terms of
MSE in (a) and power in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Figure 9.16:Evaluation of group size selection (M) in terms of (a) power consumption
and (b) area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Figure 9.17:Comparison of image quality obtained using (a) digital hardware and (b) resistive hardware. Quantization is performed using the table in Figure 9.9. . . 131
Figure 9.18:The image quality and compression ratio is evaluated with respect to the processed block size in (a). The normalized power and area with respect to the
block size is evaluated in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . 132

xx

LIST OF TABLES

Table 2.1: Comparison of non-volatile devices in [74]. . . . . . . . . . . . . . . . . . .

3

Table 4.1: Matrix value range transformation compared with the baseline (no transformation) in terms of cost for a weight wk . . . . . . . . . . . . . . . . . . . . . 26
Table 4.2: Properties of FF NNs trained on MNIST. . . . . . . . . . . . . . . . . . . . . 30
Table 4.3: Properties of CNN trained on CIFAR-10 with 75.03% software accuracy.
“Conv” denotes convolutional layer, and “FC” denotes fully-connected layer.

30

Table 4.4: Evaluation of proposed transformations on MNIST . . . . . . . . . . . . . . 36
Table 4.5: Evaluation of proposed transformations on CIFAR-10 . . . . . . . . . . . . . 39
Table 4.6: Evaluation of the training schemes on MNIST and CIFAR-10. . . . . . . . . 41

Table 5.1: Time complexity analysis of assigning N weights to R parallel resistive devices in an RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Table 5.2: Properties of the evaluated neural networks. . . . . . . . . . . . . . . . . . . 54
Table 5.3: Comparison of the proposed framework with related studies. . . . . . . . . . 59

Table 6.1: Evaluation of various methods in terms of normalized classification accuracy.

65

Table 7.1: Comparisons with previous works. . . . . . . . . . . . . . . . . . . . . . . . 80

xxi

Table 8.1: Holistic analysis of techniques for deploying CNNs with low and high throughput to RCSs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Table 8.2: Properties of evaluated CNNs on CIFAR-10. . . . . . . . . . . . . . . . . . . 90
Table 8.3: Type of data layout transformation scheme in each layer of the evaluated CNNs. 92
Table 8.4: Solution space for data layout organization for CNNs with low and high
throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Table 8.5: Comparison with state-of-the-art techniques on 2% defect rate on VGG-7. . . 95
Table 8.6: Comparison with state-of-the-art techniques on 10% defect rate using 2X
hardware on VGG-13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Table 9.1: Analysis of number of MVM operations. . . . . . . . . . . . . . . . . . . . . 107
Table 9.2: Image quality and performance (power and area) with respect to the selected
frequency spectrum N f . The figure shows that the ideal frequency spectrum
is in the range [1, N ∗f ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Table 9.3: Performance in power and area with respect to the group size (M) of ADCs
that share reference voltages from the same DACs. M ∗ is the group size that
minimizes the power consumption. . . . . . . . . . . . . . . . . . . . . . . . 117
Table 9.4: Properties of the data sets of input images. . . . . . . . . . . . . . . . . . . . 118
Table 9.5: Parameters of the of RCAs used in the experimental evaluation. . . . . . . . . 119
Table 9.6: Power and area of crossbar and peripheral circuitry. . . . . . . . . . . . . . . 119

xxii

Table 9.7: Comparison of performance and overheads w/o without 2D DCT reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Table 9.8: Comparison of different image compression techniques. . . . . . . . . . . . . 128

xxiii

CHAPTER 1: INTRODUCTION

With the exponential growth and availability of digital data, we have entered an era dominated by
data-centric applications such as computer vision, data analytics, and deep learning. As a result,
the demand for data to be analyzed and processed has rapidly increased to exascale (1018 bytes/s).
Unfortunately, these computing needs cannot be met through further technology scaling using
traditional silicon technology and von-Neumann architecture. Mainly, due to the separation of
computing units and memory units, which translates into power hungry and bandwidth limited data
transfer [70]. Recently, numerous large-scale research programs and research efforts have been
devoted to improving energy-efficiency and reducing data movement, by rethinking all layers of
the computing stack, including hardware, software and hardware/software fundamental approaches
and schemes[3, 5, 8].
Due to promises of simultaneous storage and processing, in-memory computing based on nonvolatile resistive technology has emerged as an appealing solution to overcome the aforementioned challenges. A non-volatile resistive device is a two terminal device with programmable
resistance, which may be realized using memristor [18, 65], resistive random access memory
(ReRAM)[68, 44], phase change memory (PCM) [69, 33], or spin-transfer torque magnetic random
access memory (STT-MRAM) [57, 29]. By integrating the emerging devices into resistive crossbar arrays (RCAs), various core mathematical kernels can be evaluated efficiently in the analog
domain. Based on the availability of these new core kernels, non-traditional computing paradigms
as analog matrix-vector multiplication (MVM) [28], flow-based computing [6], imply-logic [40],
and memristor-aided logic [39, 30] have been explored. Of these computing paradigms, analog
MVM is particularly promising because the computation is significantly (orders of magnitude)
more energy-efficient than in the digital domain [28]. Data movement is also substantially reduced
by storing the matrix in-memory and performing the computation in-situ [58, 15]. Moreover,
1

MVM is the dominating computation in many data intensive applications as deep learning [60],
image processing [48], graph processing [63], and simulation of physical systems [21]. Therefore,
in-memory computing based on RCAs is viewed as a promising candidate to (i) process big data
in high-performance computing systems and (ii) perform energy-efficient computation on powerconstrained devices deployed in the Internet of Things (IoT).
The main challenge of leveraging analog in-memory computing is that the computational accuracy is directly impacted by various reliability and robustness issues, which may compromise the
functional correctness of the accelerated applications. In contrast, robustness issues within digital
computing systems only introduces timing violations, which can be alleviated by scaling down the
clock frequency or by using less aggressive voltage scaling. In particular, the computational accuracy within analog MVM is impacted by write accuracy, non-zero array parasitics, limited device
yield, resistance drift, temperature variations, random telegraph noise and endurance. To provide
guarantees on the system level performance, synergistic innovations on the device level, algorithm
level, and software application level are required. Such advantages have the potential to enable
early deployment of RCA in commercial applications.
This dissertation is focused on improving the robustness and reliability of analog in-memory computing. We have investigated techniques based on data layout organization, software and hardware
co-design, and hardware redundancy. Chapter 4 to 7 are focused on minimizing the negative impact
of limited device yield within DNN. Software level classification accuracy can be achieved by the
implemented framework, while considering high fabrication defects rate in the RCAs. Chapter 9 is
concentrated on optimizing image processing quality by performing computational restructuring.
Observable image quality improvement can be achieved with reduced overhead.

2

CHAPTER 2: BACKGROUND

Non-volatile Resistive Devices

In this section, we provide the review and comparison of the non-volatile devices. The comparison
in [74] is shown in Table 2.1.
ReRAM is popular as a crossbar accelerator due to its programmable conductance, low write energy and latency, and high endurance. ReRAM devices may be vulnerable to write noise, device
variations, and random telegraph noise (RTN). PCM uses current pulses for phase transition to
switch between amorphous state and crystalline state. Compared to ReRAM, PCM requires higher
programming current and may encounter resistance drift in the amorphous phase.

Table 2.1: Comparison of non-volatile devices in [74].
ReRAM
∼ 1MΩ
4F 2
1012 cycles
8 bits
1µA

Maximum resistance
Device area
Endurance
Programmable resolution
Write current

PCM
∼ 1MΩ
4F 2
1012 cycles
5 bits
100µA

Analog In-memory Computing for Matrix-vector Multiplication (MVM)

The high-level concept and circuitry needed to perform MVM operations (Ax=y) using an RCA
is illustrated in Figure 2.1. The MVM operations are accelerated using a one-time expensive
initialization phase and a fast and efficient evaluation phase. In the initialization phase, a matrix A
is mapped to the state (or conductance) of the multi-level resistive devices within an RCA. This is

3

a slow and expensive process because the conductance of each device in the RCA is required to be
accurately programmed based on the input matrix. In the evaluation phase, MVM operations are
performed fast and efficiently by converting input vectors (x) into input voltages and decoding the
output voltages/currents into output vectors (y), which is shown in Figure 2.1(a). Consequently,
RCAs are a promising candidate to accelerate applications where the matrix is relatively fixed and

x1
y1
x2 = y2
x3
y3
decoded from
output currents

iout

(a)

(b)

access transistor

DACs

a11 a12 a13
a21 a22 a23
a31 a32 a33

Input: vector x

mapped to
input voltages

vin

mapped to
resitive devices

vin

the input vectors frequently change.

TIAs

vout

ADCs
Output: vector y

(c)

Figure 2.1: (a) Digital MVM (b) Analog MVM (c) RCA circuit for analog MVM.

The principal of analog MVM is shown in Figure 2.1(b). The figure shows a set of wordlines
connected to a set of bitlines using a resistive device in each intersection. The current through
each resistive device is obtained using Ohm’s law by multiplying the input voltages applied to the
wordlines with the conductance values of the resistive devices. Next, the currents are summed
along the bitlines using Kirchhoff’s current law. The relation between the input voltages (vin )
and the output currents (iout ) is equal to vTin G = iTout , where G is the conductance matrix realized
by the RCA. The conductance matrix G has dimensions (N)x(M) for a RCA with N wordlines
and M bitlines. Let the conductance values of the resistive devices (organized in a matrix form)
be denoted g, where gi j is the conductance of the resistive devices connecting wordline i with

4

bitline j. In the ideal case (no array/input/output resistances), each entry Gi j in G is equal to
gi j . Consequently, the conductance values g are obtained by linearly mapping the target matrix
A into the programmable conductance range [gmin , gmax ]. gmin and gmax respectively denote the
minimum and maximum conductance where the resistive devices demonstrate the necessary high
linearity. The circuit of an RCA used for analog MVM is shown in Figure 2.1(c). There are digital
to analog converters attached to the wordlines that are used to convert an digital input vector x into
analog input voltages vin . The transimpedance amplifiers (TIAs) attached to the bitlines amplifies
the output currents (iout ) into output voltages (vout ), where vout = iout Rs and Rs is the feedback
resistance of the TIAs. Consequently, the output voltages are equal to vTout = vTin GRs . Next, output
voltages vout are converted into a digital output vector y using an analog to digital converters
(ADCs).
The interest in leveraging g RCAs to accelerate MVM operations stems from that (i) the performance is projected to be more efficient than a custom ASIC and (ii) the in-memory computing
significantly reduces data movement. The projected performance of MVM of an 512x512 RCA
is compared with a custom ASIC optimized for both latency and energy-efficiency in Figure 2.2.
The figure shows that the RCA is estimated to be capable of performing 10 Peta operations per
second (OPS) and have a power efficiency of 100 Peta OPS per Watt [28]. In contrast, the speed
and power efficiency of a custom ASIC is 0.1 Peta OPS and 1-10 Peta OPS/Watt, respectively.
Moreover, further power improvements in power will be obtained if the dimensions are scaled further. In terms of data movement, in-memory computing inherently circumvents the power hungry
data transfer associated with fetching the matrix from memory. In [24], it was shown that a DRAM
access consumes 1000∼2000 pJ while a floating point multiplication only consumes 100∼200 pJ,
i.e., the power consumption is dominated by the data movement.

5

(a)

(b)

Figure 2.2: Compassion between performing MVM using an RCA and a digital ASIC in terms of
(a) speed and (b) power [28].

Big Data Applications

Deep Neural Network (DNN)

DNN have surpassed human-level capabilities for a number of computer vision applications [43].
The networks consist of multiple layers of neuron connected together by synapse weights. The
networks operate using a training and an inference phase. In the training phase, the synapse weights
are learned to solve a classification task. In the inference phase, input images/objects/videos are
classified into one of multiple output categories by passing an input to the first layer and recording
the output from the last layer. The processing (or evaluation) of every layer involves multiplying
the outputs from the previous layer with a matrix of synapse weights (an MVM operation) and
passing the result through a non-linear activation function. It is appealing to accelerate the MVM
operations using RCAs because the computation and its associated memory access is the bottleneck
limiting the system performance. Using this high-level approach, architectural level studies have
projected significant improvements in terms of power, area, latency, and throughput [58, 15, 62].

6

Figure 2.3: Examples of using analog in-memory computing to accelerate DNN inference.

Image compression

Low-latency and low-power signal and image compression is essential for edge devices deployed
in the Internet of Things. State-of-the-art signal and image compression techniques are based
on transforming signals and images from the spatial domain to the frequency domain. A signal
is translated into the frequency domain using the one-dimensional discrete cosine transform (1D
DCT), which involves performing a single MVM operation. Let x be spatial representation of a
signal. The frequency coefficients c are computed using c = Dx, where D is the DCT matrix [27].
A 2D image is translated into the frequency domain using the two-dimensional discrete cosine
transform (2D DCT), which involves performing two matrix-matrix multiplications series. Let X a
matrix representation of an image in the spatial domain. The frequency coefficients C of the image
X are computed using C = DXD0 , where D again is the DCT matrix. Consequently, both signal and
image compression can be accelerated using RCAs. Promising results have been demonstrated in
both simulation and using hardware prototypes [27, 47, 48].

7

1D DCT

Figure 2.4: Example of using analog in-memory computing to accelerate image compression.

Sources of Errors

The main challenge of leveraging RCAs to accelerate MVM operations is that various sources of
variations may introduce errors that degrade system performance. A list of the most important
types of errors and how to model them is outlined below. Some errors are not addressed in this
dissertation, while the focus is on overcoming the challenge of device defects.

Write accuracy

Resistive devices cannot be exactly programmed to a specific conductance value due to device variabilities. With respect to a desired target conductance, the obtained device conductance exhibits
a log normal distribution [28]. The programming accuracy on an abstract level can be captured
using bit-accuracy b and a lower and upper bound on the programmable conductance range [gmin ,
gmax ] [7, 25]. Hence, each element in g is a discrete variable with 2b states uniformly distributed
between [gmin , gmax ]. The modeling of non-uniform state distributions has been explored in [56].

8

Array parasitics

In the ideal case, the conductance matrix G is equivalent to the conductance values of the resistive
devices g. In reality, the conductance matrix G, or G(g), is a highly non-linear function of g and
the non-zero input (driver), output (sensing), and array (wire) parasitics [50]. G is used to balance
clarity and brevity. Let r p capture the input, output, and array resistances that are fixed after
fabrications. Let the RCA have M and N wordlines and N bitlines, respectively. The conductance
matrix G can be obtained as follows:
G = SY (g, r p )−1 B,

(2.1)

where B, S, and Y (g, r p ) are matrices. The B and S matrices respectively have dimensions (M)x(2NM +
M + N) and (2NM + M + N)x(N) and only depend on the size of the RCA. Y (g, r p ) is a matrix with
dimension (2MN + M + N)x(2MN + M + N) that depends on the fixed parasitic properties r p and
the conductance values g of the resistive devices. The matrices in Eq (2.1) are defined in [50, 80].

Device defects/Stuck-at-fault defects

A resistive device can suffer hard defect, which implies that the device conductance cannot be
further programmed. Stuck-at-faults can arise in the fabrication process or from heavy testing and
utilization. Let Gd be a defect matrix that captures the stuck-at-fault defects in an RCA. Each
memristor in Gd is either, non-defective, stuck-on, or stuck-off. Gd can be identified using the
techniques in [13, 34, 66, 73].
We follow the overall approach of implementing a matrix-vector accelerator (dot-product engine)
proposed in [26, 28]. The accelerators are based on using multilevel devices that enable the value
range of the matrix to be linearly mapped into the memristor conductance range. The linear map9

ping is followed by minor tuning of the conductance values to compensate for IR-drop over the
ef f

array parasitics. Let wk

be the effective weight realized when a weight wk in W is mapped to a
ef f

memristor mk in an MCA characterized by Gd . Based on the models in [51, 14, 28], wk

can be

computed as follows:

ef f

wk





wmin




= wk






wmax

if mk stuck-off,
if mk non-defective,

(2.2)

if mk stuck-on,

where wmin and wmax are the minimum and maximum values in W . A weight matrix W is converted
into and effective weight matrix W e f f using Gd and Eq (2.2). Clearly, stuck-at-fault defects may
impact the computational accuracy of matrix-vector multiplication accelerated using MCAs.

DAC/ADC quantization errors

The domain interfaces introduce errors in both the digital/analog and the analog/digital conversion.
The DAC quantize the input value range into bdac bits or 2bdac states. The ADC quantize the output
voltage range into badc bits or 2badc states. The errors introduced by the DACs and ADCs are
therefore proportional to the input and output value range, respectively.

Resistance drift

The resistance of every memristor will be slightly changed after each MVM operation is performed
due to resistance drift [12]. Consequently, the computational accuracy of the MVM operations will
be degraded over time.

10

Temperature variations

The conductance of each resistive device is dependent on the temperature of the operating environment [28]. The devices are more (less) conductive at higher (lower) temperatures.

Stochastic variations

Random telegraph noise (RTN) introduces conductance variations for every resistive device [16,
20]. Johnson and short noise introduce non-ideal currents through the resistive devices and TIAs,
respectively.

Non-ideal device characteristics

Non-ideal device characteristics result in that the resistive devices acts as a non-linear device i(v, s)
instead of an ideal resistor/conductor. Consequently, the current through a resistive device i(v, s)
is a function of a state variable s and the voltage across the device v. Quantitatively, the resistive
devices have been reported to become more conductive devices under high voltage excitation [28].

11

CHAPTER 3: PREVIOUS WORK

This sections outlines the state-of-the-art solutions to handle the aforementioned sources of errors.

Write accuracy

A resistive device is typically programmed by first applying a SET operation to the maximum
conductance. Next, it is programmed to a target conductance by applying multiple short RESET
pulses. The two phase scheme is used because the write operation is asymmetric. The SET (or
reduction in conductance) is abrupt and cannot be performed with fine control whereas the RESET
operation can be performed in a more predictable manner [7]. Open loop programming is based
on the concept of analytically calculating the length (or the number of pulses) that are required to
program the device to the target conductance. Closed loop programming (or program and verify) is
the concept of iteratively applying voltage pulses and measuring the realized device conductance.
Intuitively, smaller write errors are obtained through closed loop programming but the process is
slower and consumes more power.
The length of the write operation depends exponentially on the voltage across the resistive device [50]. The challenge is that when the array parasitics in the array are considered, the voltage
across each device will be significantly smaller than in the ideal case. The smaller voltage drop
translates into substantially longer programming latencies. To reduce the voltage drop during the
programming phase, the RCA can be fabricated with an access transistor connected is series with
each resistive device. The access transistors and resistive devices have demonstrated adequate
linearity in hardware [28].
In [51, 14], it was observed that different resistive devices exhibit write errors with different vari12

ance. To reduce the impact of write error in a neural network application, small (large) weights
were mapped to devices with large (small) variance. The data to hardware assignment (or mapping)
was realized using hardware routers.

Array parasitics

The techniques used to compensate for non-zero array parasitics are based on conductance mapping, post-processing, and retraining of neural network weights.
Conductance mapping is the concept of mapping a matrix A into conductance values g such that
the realized conductance matrix G is proportional to A. The challenge stems from (i) the IR-drop
(or voltage drop) over the array parasitics and (ii) that currents may flow through multiple paths
from an input to an output in the RCA. Nevertheless, given the conductance values g, the realized
conductance matrix G can be computed analytically using Eq (2.1).
Early work on conductance mapping focused on specifying the conductance values g while compensating for the non-zero output resistance of the TIAs. A linear approximation technique was
used in [26]. An iterative technique was proposed in [71]. The first technique that explicitly captured the array parasitics was proposed in [50]. The method was based on defining the matrix
realized by an RCA to be Ar = G/α, where α is scaling factor and G is the conductance matrix obtained using Eq (2.1). Next, the conductance values g were specified by minimizing ||A − G/α||2
using steepest gradient decent, where ||.||2 denotes the square of the Frobenius norm. The limitation of that work is that the write accuracy of the resistive devices was not considered and the
run-time of the algorithm was also very long. In [28], an extremely fast conductance mapping
algorithm was proposed. First, an ideal target current through each resistive device was determined while treating the RCA to be ideal (no array parasitics). Next, the conductance values of

13

the resistive devices were tuned using Newton’s algorithm to deliver the specified target currents
through each device. This method is heuristic because it does not explicitly minimize the difference between A and Ar . However, it has empirically demonstrated high fidelity. In [80], the method
in [50] was extended to consider the write accuracy of the resistive devices. This was performed by
minimizing ||A − Ar || while optimizing both the scaling factor α and the conductance values g. α
regulates a trade-off between errors introduced by the write accuracy and the IR-drop over the array
parasitics. This method results in significantly higher accuracy compared with in [28] and in [50].
However, the run-time of the mapping is too long to be practical for real-world applications.
An orthogonal approach to compensating for the IR-drop over the array parasitics is to modify the
application to account for the IR-drop. In particular, retraining of the neural network weights has
been explored [31, 23]. Our understanding is that the retraining was performed by linearly mapping the neural network weights into conductance values and computing the weight matrix that is
effectively realized. In [31], the accurate model in Eq (2.1) was used to capture the array parasitics.
In [23], an approximate statistical model was used to reduce the run-time of the retraining.

Device defects

The techniques used to compensate for stuck-at-fault defects are based on retraining of neural
network weights, digital co-processing, post-processing, and optimizing the data to hardware assignment. Hardware-aware training aims to train the weights of a neural network in software to
mimic the defects in the RCA hardware [14, 52]. The main limitation is that each neural network
application must be retrained based on the unique defect pattern of each RCA based platform.
Digital co-processing is the concept of compensating for the defects using a digital co-processor.
This technique can be used to compensate for any number of defective devices but is expected
to introduces significant performance and hardware overheads [23]. In [81], post-processing tech14

niques were used to minimize the errors using a first order polynomial. In [76], it was observed that
the constant term of the polynomial can be captured by using the bias weights without introducing any overhead. Stuck-at-fault defects can also be compensated for using redundant hardware,
which involves representing each matrix element using multiple parallel resistive devices [13]. A
technique of programming the parallel devices to realize an effective matrix value was proposed
in [72]. In [51], it was observed that the negative impact of stuck-at-fault defects can be reduced
by optimizing the data to hardware assignment. Specifically, small (large) matrix elements were
assigned to devices stuck-off (stuck-on). The assignment was realized by permuting rows using
routers. The matrix row to RCA row assignment was guided by a greedy algorithm. In [14],
the mapping was formulated as an assignment problem, which can be solved optimally using the
Hungarian algorithm. In [73], it was observed that the data to hardware assignment could be performed without any routers (or hardware overhead) by reordering the neurons in each layer of
neural network, which can be referred to as data layout organization and is illustrated in (a) and (b)
of Figure 4.2. Nevertheless, the data layout organization reduces the number of possible data to
hardware assignments from (n!)2 to n!, where n is the number of neurons in a layer. In [78], data
layout organization was performed by solving an assignment problem for the neurons in each layer
of a neural network. The technique also seamlessly incorporated the use of hardware redundancy.
The main drawback of that work is that the run-time becomes prohibitively long for large-scale
DNNs.

DAC/ADC quantization errors

The precision of the DACs and the ADCs is limited to 8-bits. However, many scientific computing
applications require 16-bit fixed-point or even floating point precision. To overcome this limitation,
many architectural level studies have adopted bit-slicing techniques to emulate high precision [11,

15

(n!)2
permutations

n!
permutations

router

Wl

n neurons
in layer l

Wl

Wl+1

(a)

n neurons
in layer l

Wl+1

(b)

Figure 3.1: Data to hardware assignment (a) using routers and (b) using data layout organization.
Wl and Wl+1 are the weight matrices of two adjacent layers in a neural network.

58, 21]. The concept is based on decomposing both the input vector and the matrix into bitslices. Next, each of the bit-slices are multiplied and added together using a shift-and-add reduction
network. The concept of bit-slicing of a matrix is shown below:








0 0
1 0
0 1
2 1
1
0
2
+2 
+2 


=2 
1 0
1 1
1 1
7 3
The limitation of bit-slicing is that the paradigm is highly vulnerable to errors. Specifically, small
analog errors may be amplified into large digital errors by the shift-operations. An orthogonal
approach to minimizing the impact of quantization errors is based on computational restructuring. In [77], image compression using 2D DCT was reformulated into a single MVM operation,
which avoids that quantization errors (and analog errors) in the first matrix-matrix multiplication
were amplified in the second matrix-matrix multiplication. In [41], linear systems of equations
(Ax = b) were solved using a combination of low-precision analog hardware and high precision
digital hardware using a two-step methodology. First, an approximate solution is obtained using
low-precision analog in-memory computing. Second, the error residual is computed using high
16

precision digital hardware. Next, the right hand-side of the linear equation is updated to the error
residual and the two-step process is iteratively repeated until an arbitrary bound on the precision is
satisfied. In [49], a neural networks were trained to be robust to variations by penalizing outputs
with a large gradient in the activation functions. In [54], the introduction of quantization errors
was circumvented by proposing a fully analog neural network architecture.

Resistance drift

Resistance drift results in the conductance of a resistive device changes slightly after each MVM
operation. Intuitively, resistance drift can be compensated for by periodically reprogramming the
conductance of the resistive devices there original state. In [53], a framework was proposed to
perform the reprogramming with respect to the application performance instead of with respect to
a fixed time period. In [75], the reprogramming was performed while the RCA was ideal to minimize the system downtime. Due to that the write operation is asymmetric, resistive devices can
only be programmed to be more conductive with fine granularity. To make the resistive devices
less conductive, a disruptive RESET operation is required to program the device to the minimum
conductance. Next, the conductance can be tuned to the target conductance. Consequently, it was
observed that by using two devices organized in a differential pair configuration, the reprogramming of the devices can be performed much quicker [46, 45]. If a device has drifted to be more
conductive, the other device in the differential pair configuration in programmed to be slightly
more conductive to compensate for the errors.

17

Temperature variations

In [28], the state variables of the resistive devices were specified based on an calibration vector
using Newton’s method. The specification can be performed with respect to any arbitrary operating
temperature. In [10], it was observed that the temperature impacts the conductance of the resistive
devices. To overcome this limitation, the weights were retrained or remapped based on a sensor
characterization. The main idea was to avoid mapping large weights to resistive devices with high
temperature.

Stochastic variations

It is difficult to mitigate the errors introduced by random telegraph noise. Most studies exploit
that many applications have an inherent resilience to errors and can tolerate a certain amount
of errors introduced by stochastic variations can also be reduced using the techniques based on
computational restructuring. In [23], it was observed that stochastic noise (thermal noise and shot
noise) is dependent on the operating frequency. Consequently, the impact of the variations can be
reduced to viable levels by selecting a sufficiently slow operating frequency.

Non-ideal device characteristics

Conductance range selection is based on analysing the linearity of the device and selecting an appropriate part to be the programmable conductance range. Intuitively, there is a trade-off between
selecting a small range (few programmable states but small non-linearities) and a long range (many
states but larger non-linearities) [25]. Compensation for non-linearities has been explored using
pre-processing and post-processing. Using pre-processing, non-linearities were compensated for

18

by non-linearly mapping the input vector x to the input voltage range [0,Vmax ] [71], i.e., to exploit
a pseudo Ohm’s law. In [25], the output voltages were mapped to the digital output range using
a first order polynomial based on minimizing the mean square error. In [28], the non-linearities
were considered when specifying the conductance values by ensuring that correct output for an
input calibration vector. However, the input vector was used to simultaneously compensated for
the impact of the array parasitics. In [80], the conductance values with respect to ideal devices
were first specified. Next, the state variables were specified with respect to an input vector, i.e.,
decoupling the compensation for the non-ideal device characteristics and the array parasitics.

19

CHAPTER 4: MATRIX TRANSFORMATION

Problem Formulation

In this paper, we address the problem of mapping DNNs to MCA based platforms with stuck-atfault defects. The objective is to perform the mapping while maximizing the classification accuracy
in the MCA based hardware. We approach this problem by mapping DNNs to MCA based platforms while minimizing a cost metric cost using matrix transformations T , as follows:

min

∑

T l = 1 to L−1

fl , Gdl ),
cost(W

(4.1)

fl is the transformed weight matrix Wl , i.e., W
fl = T (Wl ). Wl is the weight matrix that
where W
connects the neurons in layer l to the neurons in layer l + 1. Gdl is the defect matrix capturing the
fl . T is a series of transformations T .
defects in the MCAl for the weight matrix Wl or W

Cost Metric

The MT framework is guided by a cost metric that is based on the weighted square error, which is
defined as follows:

cost(Wl , Gdl ) =

ul
ef f
· ∑ (wk − wk )2 ,
nl k∈W

(4.2)

l

ef f

where wk is a weight in a weight matrix Wl . wk

is the effective weight obtained when wk is

mapped to the memristor mk in Gdl . ul is the number of times the weight matrix Wl is used per
input image and nl is the number of weights in Wl .

20

ef f

The square errors is used to prioritize minimizing large differences between wk and wk . nl is
used to prioritize the realization of the connections between any pair of layers l and l + 1 equally.
ul is used to account for that certain weight matrices are used multiple times while processing one
input image.

Transformations

In this section, three transformations T are introduced to transform a weight matrix W into a
e = T (W ), which may reduce the cost metric and improve the robustness
different weight matrix W
to stuck-at-faults.

Row flipping transformation

The row flipping transformation is based on the observation that the matrix-vector multiplication
W x can be reformulated, as follows:

W x = wmax sum(x) − (wmax −W ) x,
| {z }

(4.3)

e
W

e is the matrix mapped to the MCA. sum(x) is the sum of the elements in the input vecwhere W
e are
tor x, which is computed using digital hardware. The minimum and maximum values in W
ef f

respectively equal to 0 and wmax − wmin . Consequently, the effective weight wk
ef f

fault is equal to, wmin = wmax − (wmax − wmin ), and the effective weight wk

of a stuck-on

of a stuck-off fault

is, wmax = wmax − 0. Hence, the transformation effectively flips all stuck-off (stuck-on) faults into
stuck-on (stuck-off) faults. Moreover, the row flipping transformation can be applied to independently flip any subset of the rows in W .

21

Value range of
two parallel
memristors

Value range of
one memristor

wmax
0
wmin
n

L

-

H

f

H

-

L

n W+

n Wf
n
n
f
f
f

W+
WW+
WW+
W-

L L - - LH H - H

H - H - LH - L L
H
H
L
L
H
L

n = non-ipped W+ = positive row
f = ipped
W- = negative row

H-H
L-L
H-L

-

HL
LH
LH
HL
HL
HL

L-L
H-H
L-H

L
L
H
H
L
H

H = stuck-on
L = stuck-o

Figure 4.1: Value range of one and two parallel memristors while using row flipping.

ef f

Let the value range for a matrix element wk denote the effective weight values wk

that can be

realized while considering the defects in the hardware. The value ranges that can be realized
using the row flipping transformation while considering all combinations of defects with respect
to one and two memristors is shown in Figure 4.1. The figure shows that when one memristor is
used to realize an weight without row flipping transformation, the effective weights that can be
realized are wmin (stuck-off), wmax (stuck-on), or [wmin , wmax ] (non-defective). With the flipping
transformation applied, the effective weights that can be realized are wmin (stuck-on), wmax (stuckoff), [wmin , wmax ] (non-defective). Clearly, the row flipping transformation allows the selection of
alternative value ranges, which translate into smaller cost.
The transformation is particularly effective when each row in W is realized using two (or multiple)
parallel rows in an MCA. The explanation is that there are 2R possibilities of applying the row
flipping transformation when R rows in an MCA are used to realize a single row in a weight

22

matrix. The first row in the figure is equivalent to the technique of programming parallel rows used
in [72]. Note that it is assumed that an ADC is used to measure the output of each row in the MCA.
If the output of two (or R) rows in subtracted (added) in the analog domain, row flipping would be
restricted to flipping, or not flipping, both of the rows.

Permutation transformation

In this section, we explain how to perform neuron permutation or feature map permutation while
considering row flipping in order to minimize cost. The permutation transformation permutes the
order of the neurons or feature maps in layer l of a neural network, which is shown in (a) and (b)
of Figure 4.2, respectively.
Permuting the order of two neurons in layer l is equivalent to permuting the corresponding two
rows in Wl−1 and the corresponding two columns in Wl [73], which is illustrated in Figure 4.2(c).
Note that the columns in Wl−1 and the rows in Wl are fixed while permuting the order of the
neurons in layer l. Permuting the order of two feature maps in layer l is equivalent to permuting
the corresponding two rows in Wl−1 and a group of columns in Wl , which is shown in Figure 4.2(d).
The size of the group is equal to the kernel size, i.e., typically 3x3=9. Finding the permutation of
the Ml neurons or feature maps in layer l that minimize the cost metric is equivalent to assigning
each of the neurons or feature maps to one of Ml locations, as illustrated in (e) and (f) of Figure 4.2,
respectively. The assignment of neurons (or feature maps) to locations can be formulated as an
assignment problem, which can be solved optimally using the Hungarian algorithm [19].
We explain the formulation of the assignment problem with respect to neurons but it can directly
be applied to feature maps. The assignment problem formulation requires the cost ci j of assigning
neuron i to location j to be computed for all i ∈ Ml and j ∈ Ml . c11 is illustrated in Figure 4.2(c).
out
in
ci j is equal to cin
i j + ci j , where ci j is the cost of assigning row i in Wl−1 to row j in the MCAl−1 and

23

neurons in layer l

feature maps in layer l

(a)

(b)

layer l

layer l

Wl

Wl-1

(c)

(d)

locations
in layer l

feature mpas
in layer l

c11

c11
neurons
in layer l

Wl

(e)

locations
in layer l

Wl-1

(f)

Figure 4.2: (a) Neuron permutation [73]. (b) Feature map permutation. (c) Permutation of rows
and columns in W(l−1) and Wl . (d) Permutation of rows in Wl−1 and groups of columns in Wl . (e)
Assignment problem for neurons and (f) feature maps.

in
cout
i j is the cost of assigning column i in Wl to column j in MCAl . When computing ci j , we compute

the cost when the row j is flipped and non-flipped. Next, cin
i j is equal to the minimum of the two
computed costs. In general, if a row in W is represented using R parallel rows, we enumerate the
2R combinations of flipped and non-flipped rows and select the smallest cost when setting cin
ij.

24

Value range transformation

The value range of W is equal to [wmin , wmax ]. The proposed value range transformation consists
of introducing a reduced value range [wrmin , wrmax ] with wmin ≤ wrmin and wrmax ≤ wmax . The matrix
value range is mapped to the reduced value range by setting all weights in W larger (smaller) than
wrmax (wrmin ) to wrmax (wrmin ).
ef f

When a weight wk is mapped to a memristor without a stuck-at-fault, the effective weight wk

is

obtained, as follows:

ef f

wk

ef f

The effective weight wk





wrmin




= wk






wrmax

if wk ≤ wrmin ,
if wk ∈ [wrmin , wrmax ],

(4.4)

if wk ≥ wrmax .

of a weight wk mapped to a memristor with a stuck-at-fault is obtained

by replacing wmin and wmax respectively with wrmin and wrmax in Eq (2.2).
The impact of the proposed transformation on the cost metric is summarized in Table 4.1. Both
the reduction and the baseline have no errors (or cost) when weights in the range [wrmin , wrmax ] are
mapped to memristors without stuck-at-faults. Therefore, the performance is listed as equal in
Table 4.1. Compared with the baseline, the reduction results in smaller cost when weights in the
range [wrmin , wrmax ] are mapped to memristors with stuck-at-faults and larger cost when weights in
the ranges [wmin , wrmin ) and (wrmax , wmax ] are mapped to memristors without stuck-at-faults. With
respect to the baseline, the cost can be smaller, equal, or larger when weights in the ranges [wmin ,
wrmin ) and (wrmax , wmax ] are mapped to memristors with defects. The magnitude of the costs depend
on the values of wrmin , wrmax , and the type of the stuck-at-fault.

25

Table 4.1: Matrix value range transformation compared with the baseline (no transformation) in
terms of cost for a weight wk .
wk within
Memristor
stuck-at-fault
[wrmin , wrmax ]
(inside/outside)
(yes/no)
inside
no
inside
yes
outside
no
outside
yes

cost
compared with baseline
(smaller/equal/larger)
equal
smaller
larger
smaller/equal/larger

Based on Table 4.1, it can be observed that cost can potentially be reduced with the proposed
transformation. In particular, if there exists a reduced value range [wrmin , wrmax ] such that there are
few values outside of the reduced value range and the reduced range is substantially smaller than
the initial value range [wmin , wmax ]. We propose a method to specify the reduced value range.
An alternative to the proposed value range transformation is to train the neural network to initially
have weights of smaller magnitude. Nevertheless, the transformation allows neural network not
specifically trained for MCA based platforms to be executed more robustly using MCAs.

Flow of MT Framework

The flow of the proposed MT framework is shown in Figure 4.3. The input to the framework
is a CNN (or a feed-forward neural network) with L layers and weight matrices W , an MCA
for each weight matrix, and a parameter R denoting the number of parallel rows in each MCA
that is used to represent each row in W . R is called the redundancy factor. The objective of the
MT framework is to specify the transformations T that result in the smallest cost, as defined in
the problem formulation. The output of the framework is a row flipping transformation T f , a
permutation transformation Tp , and a value range transformation Tv for each weight matrix in the

26

neural network. The Tp transformation can be decomposed into a row and column transformation
Tpc and Tpr , respectively. The MT framework can also be used to perform hardware aware training
using an optional weight update step. The weight update is mainly included in the flow to enable
more detailed comparisons with previous studies.
input
Identify Gd
for each MCA



One pass

Multi-pass ow

Update
W

Weight
update
(optional)

Matrix value range
transformation

Flip rows and permute
neurons/feature maps
in layer l
Layers remaining?
Passes remaining
Yes
and cost reduced? No

Find

Tv =[wrmin, wrmax]

Find
Tf and Tpr
for layer l
and Tpc
for layer l+1

Yes
No
Weight update?
Yes
No

output

Figure 4.3: Proposed flow of the MT framework.

The first step of the proposed framework is to determine Gd for each MCA using the techniques
in [13, 34, 66]. Second, the value range transformation is applied to find an reduced matrix value
range [wrmin , wrmax ] for each weight matrix W . Third, neuron permutation is applied to find the permutation transformations Tp and the row flipping transformations T f . The row flipping and neuron
and feature map permutation can only be applied to a single layer at the same time. Consequently,
the optimization problem in Eq (4.1) is decoupled into a separate optimization problem for each
layer in the network. In particular, the MT framework iterates over the layers from 1 to L and while
applying transformations.

27

The permutation of neurons in the input and output layer can be optionally deactivated to avoid
reorganization of the input and output data. Lastly, the optional weight update retraining can be
applied to update the weights W using the techniques in [51, 14, 73], which requires the permutation and row flipping transformations to be reapplied with respect to the new weights. Please refer
to these works for the technical details of the hardware aware training techniques.

Matrix value range transformation

This step consists specifying the reduced value range [wrmin , wrmax ]. This is performed by setting
wrmax = x and wrmin = −x and sweeping x from the maximum of wmax and |wmin | and towards 0.
The classification accuracy of the neural network is evaluated for different values of x when there
are no defective memristors. Let ε be the accuracy loss that the value range transformation is
allowed to introduce in total. For each weight matrix, x is selected to be the smallest value that
results in a classification accuracy loss of less than

ε
L−1 %,

where L is the number of layers in the

neural network. The parameter ε is determined based on experiments. It is easy to understand that
the value range transformation Tv or x cannot be determined based on cost, as this may result in
unacceptable loss in classification accuracy.

Multi-pass flow

This section provides the motivation and details of the multi-pass flow. Given the ordering of the
neurons (or feature maps) in layer l − 1 and l + 1, the optimal Tpr and T f for layer l and optimal
Tpc for layer l + 1 are determined by formulating an assignment problem that is solved optimally
using the Hungarian algorithm. However, as the problem in Eq (4.1) is decoupled for each layer,
there is no guarantee that a globally optimal solution is obtained after the neurons in each of the
layers have been permuted one time, i.e., after one-pass has been completed, which is shown in
28

Figure 4.3. Therefore, we propose to iterate over all the layers multiple times using a multi-pass
flow in order to converge to a local minimum. The iterative flow halts when the total cost is not
reduced after a complete pass or after m passes have been performed. The user defined parameter
m is specified based on a trade-off between run-time and cost.

Weight update

The weight update is based on updating the weights in W of the neural network that are mapped to
the MCA based platform. The weight update is performed by setting each weight matrix W in the
neural network to be equal to W e f f . Next, the neural network is trained using back propagation
and steepest gradient decent for n epochs with a deep learning framework such as TensorFlow or
PyTorch. In this paper, the hardware aware training can be in the form of ‘retraining’ or ‘training’.
‘retraining’ is referred to the concept of applying using the weight update step to improve the
weights of an already trained neural network. ‘training’ is the concept of integrating the MT
framework in the initial training process of a neural network.

Experimental evaluation

The proposed framework is implemented using C++, and the experiments are performed on a 3.4
GHz machine with 32GB of memory. Keras and TensorFlow are used to train the neural networks.
The training is performed on a NVIDIA 1080Ti GPU for image recognition.
The effectiveness of the proposed techniques are evaluated using DNNs trained on the MNIST
handwritten digit classification dataset [42] and the CIFAR-10 image classification dataset [36].
Both datasets consist of 60, 000 images that are partitioned into 10 classes. 50, 000 and 10, 000
of the images are randomly selected for training and evaluation, respectively. The images in the
29

CIFAR dataset have three color channels and the images in the MNIST dataset are grey scale. A
four-layer and a six-layer feed-forward neural network (FF NN) are trained on the MNIST dataset
using 12 epochs. A seven-layer CNN is trained on the CIFAR-10 dataset using 50 epochs. An
epoch denotes one pass of all the training examples in the training process of a neural network.
The CNN uses 3x3 kernels in convolutional layers and there are max pooling layers after Conv2
and Conv4. The detailed structure is available online [4]. A summary of the structure of the
neural networks and the classification accuracy achieved on in software is provided in Table 4.2
and Table 4.3.

Table 4.2: Properties of FF NNs trained on MNIST.
Neural
Weight matrix
Software
networks
dimensions
accuracy
Four-layer
784x500x300x10
98.35%
Six-layer 784x500x400x300x200x10 98.43%

Table 4.3: Properties of CNN trained on CIFAR-10 with 75.03% software accuracy. “Conv” denotes convolutional layer, and “FC” denotes fully-connected layer.
Layer
Conv1
Conv2
Conv3
Conv4
FC1
FC2

Weight matrix
dimensions
27x32
288x32
288x64
576x64
2304x512
512x10

ul
1024
900
225
169
1
1

A neural network with L layers is mapped to an memristor based platform with L − 1 grids of
MCAs. The MCAs have a dimension of 128×128. Each weight matrix is distributed onto the
smallest grid that can capture the entire weight matrix. The defect matrix Gd for each MCAs is
generated while assuming that p percent of the memristors are defective and that the stuck-at-faults
30

are randomly distributed in each MCA. p is set to 10% and 20% on MNIST and 5% and 10% on
CIFAR. (Based on the fabrication results reported in [13], we believe that it is sufficient to consider
defect rates up to 10%. However, we use 20% on the MNIST data set to demonstrate that the MT
framework is capable of handling such high defect rates with small overhead.) 81.6% of the defects
are stuck-on and 18.4% of the defects are stuck-off [52]. The sensitivity to the stuck-on/stuck-off
ratio and the distribution of the defects is evaluated in Section 4.
Recall that the classification accuracy achieved in hardware when mapping a neural network with
weight matrices W to an MCA based platform is obtained by first computing the effective weight
matrices W e f f . Next, the classification accuracy in hardware is obtained by evaluating the neural
network using the effective weight matrices. When the proposed transformations are applied, the
e and different effective weight matrices will be realized,
weight matrices will be transformed into W
which results in that the classification accuracy may be improved (or degraded).
We evaluates the MT framework using cost and normalized accuracy. The cost is computed using
Eq (8.3). The normalized accuracy is equal to the classification accuracy achieved in hardware
divided by the classification accuracy achieved in software. The selection of ε in the value range
transformation is evaluated in section 4. The proposed cost metric is evaluated in Section 4. The
multi-pass flow is evaluated in Section 4. The proposed transformations are evaluated in terms of
classification accuracy in Section 7. The overhead is evaluated in Section 4. Comparisons with
techniques based on hardware aware training are provided in Section 7. A sensitivity analysis of the
MT framework to the ratio and distribution of the stuck-at-fault defects is performed in Section 4.

Evaluation of ε in value range reduction transformation

In Figure 4.4, we evaluate the selection of the value for the parameter ε, which is used in the
proposed value range transformation. The figure shows the average value range reduction, the
31

classification accuracy in software, and the classification accuracy in hardware (with all optimization techniques in the MT framework applied) as a function of ε. The classification accuracy in
hardware is obtained as the average with respect to ten different sets of defect matrices. Note that
the classification accuracy in this section is evaluated using the raw accuracy to more clearly explain the value range reduction technique. As expected, the average value range length is reduced
when ε is increased. There is one outlier in the graph (ε = 0.11), which stems from that the value
range reduction is a heuristic method that is applied layer-by-layer. While examining the results
in detail we observe that a large value range reduction is obtain in an initial layer, which happens
to result in that smaller reductions are achieved in the subsequent layers. Moreover, the software
classification accuracy is reduced when ε is increased. When ε swept from 0% to 0.2%, it can
be observed that the classification accuracy in hardware is improved to a turning point (ε = 0.03)
after which the classification accuracy starts to decrease. The explanation is that a small nonzero ε reduces the value ranges of the weight matrices significantly, which mitigates the negative
impact of stuck-at-fault defects. However, when ε is increased further, the hardware accuracy is
degraded because the software classification accuracy is reduced, which is an upper bound on the
classification accuracy that can be obtained in hardware after the subsequent optimization techniques (transformations) are applied. Based on the results in the figure, ε is set to 0.03% in the MT
framework.

Evaluation of cost metric selection

In this section, we evaluate the selection of the cost metric used in the MT framework. We choose to
evaluate the binary error metric in [73], the absolute linear metric in [51], a square error metric, and
our proposed weighted square cost metric. In Figure 4.5, the normalized classification accuracy
of the three neural networks is shown when the all transformations in the MT framework are
applied with respect to the different cost metrics. The figure shows that the binary cost metric
32

(a)

(b)

Figure 4.4: (a) Average value range reduction vs ε. (b) Classification accuracy in software and in
hardware vs. ε on CIFAR-10

performs worse compared with the other cost metrics. Moreover, absolute linear, square, and
weighted square metrics perform similar for the FF NNs trained on the MNIST dataset. However,
the proposed weighted square cost metric improves the classification accuracy for the seven-layer
CNN trained on CIFAR-10. The main reason is that the ul factor in Eq (8.3) accounts for that
certain different weight matrices are used multiple times in the classification of a single input
image. This is particularly important for the layer of neurons that connects the convolutional
layers and the fully-connected layers together.

Four-layer FF NN
MNIST

Six-layer FF NN
MNIST

Seven-layer CNN
CIFAR-10

Figure 4.5: Comparison between cost metrics used to guide the MT framework.

33

Evaluation of multi-pass vs. single-pass flow

In this section, we evaluate the advantage of utilizing the proposed multi-pass flow compared with
the single-pass flows in [51, 14, 73]. In Figure 4.6, both the cost and normalized classification
accuracy of the CNN on CIFAR-10 is shown as a function of the number of passes of the flow that
are applied using two different sets of defect matrices. The figure shows that both the cost and
accuracy are greatly improved in the first pass of the flow. In the subsequent passes, the cost is
only slightly reduced and only minor accuracy improvements are obtained. The figure shows that
no cost improvements are achieved after a few passes, i.e., the multi-pass flow has converged to
a local minimum. Recall that the explanation to why the multi-pass flow can improve the cost is
that the neuron permutation is only applied to a single layer at a time. In the first pass, the neurons
in layer l are arranged optimally with respect to the arrangement of the neurons in layer l − 1 and
l + 1. However, after the neurons in layer l + 1 have been permuted (with respect to the order of
the neurons in layer l and l + 2), the neurons in layer l can potentially be permuted again to reduce
the cost further. Nevertheless, it is easy to understand that the flow quickly converges to a local
minimum in terms of cost because only neurons in a single layer are permuted at the same time. In
the MT framework, we choose to use a maximum of three passes (m = 3) to balance run-time and
performance based on the results in the figure. Of course, the flow also terminates early if no cost
improvements are obtained in a pass.

Evaluation of transformations on classification accuracy

In this section, we evaluate the transformations in the MT framework in terms of cost and classification accuracy on MNIST and CIFAR-10.
In Table 4.4 the proposed transformations are evaluated in terms of cost, normalized classification

34

(a)

(b)

Figure 4.6: Norm. cost and Norm. accuracy vs. the number of passes in the multi-pass flow. The
results in (a) and (b) are obtained for two different sets of defect matrices on CIFAR-10.

accuracy, and run-time on the feed-forward networks trained on the MNIST dataset. The normalized classification accuracy is obtained by performing Monte Carlo simulations using different
defect matrices Gd and recording the mean (µ) and the standard deviation (σ ). In this paper, we
aim to achieve an normalized classification accuracy of 99% or higher, such entries are marked
bold in the table. The run-time is reported in minutes. The Tv , Tp , and T f transformations are
respectively denoted ‘V’, ‘P’, and ‘F’, and ‘R’ is the redundancy factor.
It can be observed that the stuck-at-fault defects greatly degrade the normalized classification accuracy when no optimization is applied (entries labeled ‘-’). The results show that the six-layer
neural network is more robust to variations than the four-layer network, which may stem from that
the network is larger and has higher inherent robustness to defects. As expected, the normalized
classification accuracy is higher when there are fewer defective memristors in the MCAs (lower
p). Moreover, the normalized accuracy is greatly improved when each row in a weight matrix is
realized using two (R=2) parallel rows in each MCA, as proposed in [72]. The improvements stem
from that each stuck-at-fault introduces a smaller error in W e f f .
The table shows that the value range transformation stand alone improves the normalized accuracy

35

Table 4.4: Evaluation of proposed transformations on MNIST
p

(%)
10

20

Trans
R
-formation

V
VP
VPF
- [72]
V
VP
VPF
V
VP
VPF
- [72]
V
VP
VPF

1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2

cost

243.4
101.4
34.2
32.8
37.6
32.1
4.6
3.4
489.7
204.4
77.9
75.2
97.5
74.2
16.0
11.3

Four-layer
Norm. accuracy Run
µ
σ
time
(%)
(%)
(min)
38.0
5.9
77.9
5.7
4.0
90.9
1.7
5.4
73.7
11.8
6.9
99.3
2.3
99.1
0.1
4.0
99.4
0.1
5.6
99.7
0.0
9.3
14.0
4.7
30.3
5.8
4.0
37.6
1.9
5.4
14.9
4.3
7.0
78.7
8.0
81.6
5.3
4.0
93.5
0.8
5.7
99.7
0.0
9.8

cost

285.3
89.9
56.1
52.4
44.4
22.7
10.1
7.3
563.7
174.9
125.4
117.3
116.4
53.9
31.7
22.0

Six-layer
Norm. accuracy Run
µ
σ
time
(%)
(%)
(min)
46.9
8.2
85.0
4.0
7.5
92.1
2.4
9.6
67.4
10.5 11.7
99.7
0.1
99.1
0.1
7.5
99.4
0.1 10.0
99.6
0.1 15.1
10.2
0.4
22.2
8.4
7.5
31.8
6.1
9.8
17.0
8.1 11.9
97.7
0.6
86.5
1.4
7.5
93.3
0.5 10.2
99.5
0.1 15.8

for both the four-layer and six-layer networks when a redundancy factor of R=1 is used. We note
that the classification accuracy may be slightly degraded by the value range transformation when
a redundancy factor of R=2 is used. The explanation is that the parameter ε was optimized with
respect to that all transformations (‘VPF’) would be applied. Note that ε is set to 0.03% for all
networks based on the characterization on the CIFAR-10 network. The performance could be
further be improved by tailoring the value of the ε parameter to each individual neural network
and defect rate.
It can be observed in the table that the permutation transformation is capable of greatly reducing
the cost. Consequently, it is not surprising that the normalized accuracy is improved when the permutation transformation is applied. The improvements are a result of that small (large) weights are

36

assigned to memristors stuck-off (stuck-on). Note that when R=2 and the permutation transformation is used, the MT framework achieves very stable (small σ ) and high normalized classification
accuracy (large µ). The multi-pass flow is also an important factor that contributes to the stable
performance.
When the row flipping transformation is turned-on, the performance is slightly better. The transformation improves both the cost and accuracy by providing 2R alternative sets of value ranges for
each row. The cost is improved for all p and R configurations. However, we observe that the row
flipping may degrade the classification accuracy when R=1. The explanation is that the cost metric
and the normalized classification accuracy are only strongly correlated when the cost is small.
It can be observed that for R=1, the MT framework is not capable of meeting the specified 99%
accuracy requirement. However, it is very promising that MT framework is capable of meeting
the 99% requirement for both networks and both the 10% and 20% defect rate with R=2. In fact,
this is the first study demonstrating that a 99% classification accuracy can be achieved using R=2
without using hardware aware training of the neural networks.
The results indicate that the CNN on CIFAR-10 is more sensitive to stuck-at-fault defects than
the FF NNs on the MNIST dataset. This is likely a result of that classifying the images in the
CIFAR-10 data set is more difficult that classifying the images in the MNIST dataset. Moreover,
the native structure of CNNs may be more vulnerable to defects than FF NNs. In Table 4.5, we
evaluate the performance of the seven-layer CNN trained on the CIFAR dataset using the various
transformations. The same notation as in Table 4.4 is used. The trends on CIFAR-10 dataset are
very similar to the trends observed on the MNIST dataset. The main difference is that the CNN
is more sensitive to suck-at-fault defects, i.e., the classification accuracy in hardware when no
optimization is very small (labeled ‘-’ in the table), i.e., close to a random guess.
Using R=1, the MT framework is not capable of improving the normalized classification accuracy
37

above 22.9% even when all transformations are applied. Using R=2, the normalized classification
accuracy is improved to 93.0% and 64.2% when p is equal to 5% and 10%, respectively. An
explanation to why the performance of the CNN on CIFAR-10 is worse compared with the feedforward networks on MNIST is that the permutation transformation is only applied to the feature
maps and not individual neurons, which results in that it is more difficult to align the small and
large weights with the memristors that are stuck-off and stuck-on, respectively. However, using
R=4 the normalized classification accuracy is improved to 99.8% for the 5% defect rate p. For
R=4 and p=10%, a normalized accuracy of 98.4% is achieved. To meet the 99% requirement, we
use R=8 and achieve a normalized classification accuracy of 99.8%.

Evaluation transformations in terms of overhead

In this section, we analyze the run-time, power and area overhead of the proposed transformations.
The run-time for applying the value range transformation for the four-layer and six-layer network is
4 and 7.5 minutes, respectively. Note that the value range reduction is a one-time effort per neural
network that can be performed after training without any knowledge of the specific locations of
the stuck-at-faults in the MCAs. From the table, it can be observed that the run-time for the
permutation transformation is almost identical for any redundancy factor R when row flipping
transformation is not used. However, the run-time increases exponentially with O(2R ) when the
row flipping is applied because the total run-time of the MT framework is dominated by computing
the cost matrix used in the Hungarian algorithm.
Next, we focus on evaluating the overhead in terms of power and area. The permutation transformation introduces no power or area overhead when it is used for internal layers of a neural network.
In our implementation, we only allow permutation of the color channels in the input layer, which
can be expected to be performed for free using system level optimization. Neurons in the output
38

Table 4.5: Evaluation of proposed transformations on CIFAR-10
p
(%)
5

10

TransR
formation
V
VP
VPF
- [72]
V
VP
VPF
- [72]
V
VP
VPF
V
VP
VPF
- [72]
V
VP
VPF
- [72]
V
VP
VPF

1

2

4

1

2

4

cost

11.96
6.10
3.12
2.74
1.31
1.08
0.31
0.19
0.15
0.21
0.03
0.01
23.97
12.07
7.34
6.51
3.39
2.49
1.07
0.64
0.49
0.63
0.17
0.06

Norm. accuracy Run
µ
σ
time
(%)
(%)
(min)
14.7
1.9
14.7
2.1
8.9
19.9
4.1 11.7
22.9
3.3 14.8
42.8
6.5
42.9
12.7
8.9
82.2
6.7 11.7
93.0
3.1 18.9
90.6
3.3
82.5
5.5
8.9
98.6
0.9 12.6
99.8
0.3 49.7
13.7
1.0
13.6
0.5
8.9
14.7
1.3 11.8
14.4
1.1 15.0
23.3
3.1
20.8
4.1
8.9
29.2
9.9 12.2
64.2
7.5 19.4
56.4
11.4
44.9
10.9
8.9
91.6
3.7 13.0
98.4
0.8 52.5

layer are allowed to be permuted as ten-input MUX or look-up table (LUT) can be used to decode
the neuron to class classification with neglectable overhead. Consequently, the permutation of the
feature maps and the neurons in the neural networks is seamless to a user of the MCA based accelerator. The overhead of the row flipping transformation is one bit for each row (indicating if the
row is flipped or not) and the digital hardware required to compute sum(x). In [28], it was shown
that the sum(x) term can be computed using digital hardware with limited overhead.

39

The power and area overhead with respect to R is shown in Figure 7.8. The figure shows that the
power and area overheads are slightly larger than the redundancy factor R. This stems from that
adders are required to sum the outputs from the R rows to represent a single row in a weight matrix.
For example, the power and area overheads for the seven-layer CNN with R={2, 4, 8} are {2.03×,
4.08×, 8.19×} and {2.20×, 4.53×, 9.23×}, respectively.

(a) Power

(b) Area

Figure 4.7: Normalized overhead evaluation of (a) power and (b) area.

Comparison with hardware aware training

In Table 9.7, we provide a comparison with different hardware aware training schemes with defect rate of 20% and 10% on MNIST and CIFAR-10, respectively. The table shows the type of
hardware aware training, the normalized classification accuracy, and the run-time. In the column
labeled ‘training scheme’, ‘-’ indicates that no hardware aware training was performed. ‘retraining’
denotes that weight update in Figure 9.4 was applied one time. ‘training’ indicates that hardware
aware training was performed throughout the entire training process using the weight update in
Figure 9.4.In our implementation, the weight update is performed until no further accuracy improvements are obtained. The input to the ‘-’ and ‘retraining’ scheme is a neural network trained
in digital hardware. The run-time of training this initial network is 1.1 min and 53.5 min for the
40

MNIST and CIFAR-10, respectively.

Table 4.6: Evaluation of the training schemes on MNIST and CIFAR-10.
R

1

2

4

8

Training
scheme
retraining
training
retraining
training
retraining
training
retraining
training

MNIST
Norm.
Run
accuracy
time
17.0%
4.4
14.4%
9.1
10.5%
65.1
99.5%
8.3
99.4%
16.9
99.1%
119.7
99.9%
33.8
99.8%
67.9
99.8%
340.5
99.9%
272.1
99.9%
544.5
99.9% 1089.4

CIFAR-10
Norm.
Run
accuracy
time
14.4%
6.1
14.9%
15.4
20.5%
65.1
64.2%
10.5
65.0%
24.2
77.4%
95.9
98.4%
43.6
99.3%
90.4
97.9%
187.2
99.8%
348.8
99.9%
700.8
99.9% 1056.0

The entries labeled ‘-’ show that MT framework can achieve a 99% normalized classification
accuracy on MNIST and CIFAR-10 without hardware aware training. The accuracy is achieved
using a redundancy factor of 2 and 8, respectively. The main advantage of the ‘-’ scheme is that no
access to the training data set is required, which may be critical when deploying neural networks
on edge-devices in the future.
Compared with ‘-’, it can be observed that the ‘retraining’ scheme results in similar or slightly
better normalized accuracy at the expense of longer run-time on CIFAR-10. It is particularly
promising that the redundancy factor can be reduced from 8 to 4 while still achieving 99% normalized accuracy on CIFAR-10. The use of a smaller redundancy factor directly translates into a
50% and 51% reduction in power and area, respectively. The results on the MNIST data set are
non conclusive because the initial classification accuracy is either very low or high.
Compared with ‘retraining’, the table shows that ‘training’ results in higher normalized accuracy
41

when the normalized accuracy achieved by ‘-’ or ‘retraining’ is low. This is expected because
iterative use of the weight update can in many cases compensate for the stuck-at-fault defects that
cannot be handled using the proposed transformations. However, the results show that ‘training’
technique is not always capable of converging to neural networks with 99% normalized accuracy,
i.e., the technique becomes stuck in a local minimum even with the use of the permutation transformation. For example, the ‘-’ and ‘retraining’ schemes outperform the ‘training’ scheme when a
redundancy factor of 4 is used for the CNN on CIFAR-10. In terms of run-time, it can be observed
that the run-time is up to 14.8× and 7.2× longer than the ‘-’ and ‘retraining’ scheme, respectively.
In summary, the MT framework is orthogonal to the hardware aware training, which can be used
to reduce the overhead in terms of power and area.
The table also provides an indirect comparison with previous studies that are based on hardware
aware training [73, 51, 14, 52]. The hardware aware training schemes in the previous works can be
viewed to be upper bounded by the accuracy of the training scheme shown in Table 9.7. This can
be understood because the method in [73] is very similar to the ‘training’ scheme but it is guided
by the binary cost metric and the permutation transformation is applied using a greedy algorithm.
The training schemes in [51, 14, 52], only use a subset of the transformation in the MT framework
and have only been evaluated on the MNIST data set (or other small data sets). Moreover, many
of the techniques of updating and redistributing gradients are not compatible with state-of-the-art
deep learning frameworks as TensorFlow or PyTorch, which are required to train state-of-the-art
CNNs.

Evaluation of sensitivity to defect ratios and distributions

The sensitivity of the MT framework to the ratio of stuck-on/stuck-off defects and the distributions
of the defects is evaluated in Figure 4.8. In Figure 4.8(a), the classification accuracy is evalu-

42

(a)

(b)

Figure 4.8: Comparison of performance with (a) different stuck-on and stuck-off ratios, (b) different stuck-at-fault distributions.

ated with respect to both a (1:1) and (4.43:1) [52] ratio of stuck-on/stuck-off defects. Note that
(4.43:1) is the ratio used for the experimental setup in the previous sections. It can be observed
that the MT framework achieves similar classification accuracy on both MNIST and CIFAR-10
datasets, regardless of the stuck-on to stuck-off ratio. A possible explanation is that the row flopping transformation allows on/off defects to be transformed into off/on defects. In Figure 4.8(b),
the normalized classification accuracy is evaluated with respect to the distribution of the defects,
which are assumed to follow a random or Gaussian distribution [64]. The figure shows that the
neural networks trained on the MNIST data set are robust to the distribution of the defects. However, the normalized classification accuracy on the CIFAR-10 data set is slightly degraded when
a Gaussian distribution is used. We believe that the slight degradation stems from that the convolutional layers are more sensitive to the clusters of defects that are generated using the Gaussian
distribution.

43

Conclusion

In this work, we have demonstrated that stuck-at-faults can be handled using matrix transformations. In particular, we demonstrate that by optimizing a proposed cost metric using a row flipping
transformation, a permutation transformation, and a value range transformation, a normalized classification accuracy of 99% can be achieved. Compared with previous works, the main advantage
is that no hardware aware training of the neural networks based on the location of the defects is required. However, when weight update retraining is augmented with the proposed MT framework,
the overhead can be reduced with up to 51%.

44

CHAPTER 5: FAST RESILIENT-AWARE DATA LAYOUT
ORGANIZATION

Background and Motivation

In this section, we review the data layout organizations in [78], which is an extension of the techniques in [14, 51, 73]. The technique is based on reordering the neurons in each layer, while
minimizing a cost metric that measures the difference between the weight matrix W and the realized weight matrix W r . The reordering of the neurons in a layer is performed by first computing
the cost of various candidate data to hardware assignments (or data layout organizations). Next,
the data layout organization that minimizes the cost is selected by solving an assignment problem.
The run-time limitation of [78] is analyzed.

The Error Cost(EC) Metric

Previous studies introduce a cost metric to measure the difference between the weight matrix Wl
and the realized weight matrix Wlr [14, 51, 73, 72, 78]. In [78], the weighted square error metric
was used to compute the assignment cost, which is defined as follows:

EC = cl ·

∑

(wk − wrk )2

(5.1)

wk ∈Wl

where EC is the error cost. cl is the ratio of the number of times a weight matrix is used per image
and the number weights in the matrix. wk and wrk are the weights in the weight matrix Wl and the
realized weight in matrix Wlr , respectively.
45

Cost Matrix Computation and Assignment Problem

The reordering of the neurons in a layer can be viewed as the problem of assigning each of the
neurons to a hardware location, which is illustrated in Figure 5.1 (a). Determining the mapping
that minimizes the cost in Eq (5.1) can be formulated as an assignment problem, which requires
a cost matrix C to computed. ci j in C denotes the cost of assigning neuron i to location j. This
involves computing the cost of assigning row i in Wl to row j in an RCAl and column i in Wl+1
to column j in an RCAl+1 . Next, the Hungarian algorithm is used to find the neuron to hardware
assignment that minimizes the cost in Eq (5.1) based on the cost matrix C.

Cost matrix

c11 c12 c13
c21 c22 c23
c31 c32 c33
(a)

(b)

Figure 5.1: (a) Formulation and (b) Cost matrix of the assignment problem.

Run-time Limitation of Data Layout Organization in [78]

When the resilient-aware data layout organization in [78] is applied to two 16-layer CNNs, it can
be observed that the run-time is 1.7 hours and 67.9 hours, which is illustrated in Figure 5.2 (a).
The run-time is longer for CNN-16b because it has been optimized for throughput. Clearly, the
run-time is too long to be practical for real-world applications, which motivates the work in this
paper. To identify the run-time bottleneck, we profile the run-time of the data layout organization
46

of CNN-16b in Figure 5.2 (b). The figure shows that 96.9% of the run-time is consumed by
computing the cost matrix of the assignment problem and 2.8% of the run-time is consumed by
solving the assignment problem. Therefore, the speed-up techniques in the paper are focused on
reducing the run-time of these two steps.

(a)

(b)

Figure 5.2: Data layout organization run-time break down of a 16-layer network.

Problem Definition

Data layout organization is based on the observation that if the order of two neurons (in layer l of a
DNN) are permuted, the network is functionally equivalent in software if the corresponding rows
in Wl and columns in Wl+1 are exchanged. However, the reordering of neurons will change the
assignment of the weight matrices to the RCAs, which results in different classification accuracy
in hardware. Consequently, the data layout organization problem consists of finding the ordering
of the neurons in each layer that maximizes the classification accuracy in hardware.

Proposed Speed-up Techniques

In this section, we present the details of the three proposed speed-up techniques.

47

The Sparse Defect Indexing Technique

The sparse defect indexing technique aims to speed-up the computation of the cost matrix. The
key insight of the technique is that the cost of assigning weight to non-defective device is equal to
zero. Consequently, it is expected that the run-time can be significantly reduced by only computing
the cost of assigning weights to defective devices.
The computation of ci j in the cost matrix C is obtained by respectively mapping row i in Wl to row
j in an RCAl and column i in Wl+1 to column j in an RCAl+1 , which is illustrated in Figure 5.3.
It can be observed that many of the costs are equal to zero as the number of non-defective devices
outnumbers the number of defective devices. To only compute the cost of assigning weights to
defective devices, we introduce two adjacency matrices to store the location of the defects within
each RCA. One stores the defect locations in a row-oriented fashion and the other stores the defect
locations in a column oriented fashion. Consequently, when the cost of assigning row i in Wl
to row j in an RCAl is computed, the framework iterates over the elements in the row-oriented
data structure to only compute the cost of assigning weight to the defective devices. Similarly,
the column-oriented data structure is used to compute the cost of assigning column i in Wl+1 to
column j in an RCAl+1 . The use of the defect indexing results in that the computation number of
computing costs is proportional to the number of defective devices instead of the total number of
devices.

The weight range characterization technique

The weight range technique aims to speed-up the computation of the cost matrix when more than
one resistive device is used to realize each weight. The main idea is to pre-characterize the weight
value range that can be realized by a set of parallel devices, which avoids dynamically computing

48

(a)

x: defective device *: non-zero error

(b)
Figure 5.3: (a) The cost computation of data to hardware and (b) the proposed sparse data structure
for cost computation.

the weight range every time a weight is assigned to the devices.
When R parallel resistive devices are used to realize a single weight, the realized weight wrk is
computed in three steps [72], as follows:

Step 1: For the R parallel devices, count the number of device stuck-on d H and device stuck-off
dL.

49

Step 2: Convert the d H and d L into an weight value range [wmin , wmax ], as follows:
wmin = (d H ·Wmax + (R − d H ) ·Wmin )/R
(5.2)
wmax = (d L ·Wmin + (R − d L ) ·Wmax )/R
where wmin and wmax are the minimum and maximum of the weight range. Wmin and Wmax
are the minimum and maximum values of matrix Wl , respectively.
Step 3: Compute the realized weight wrk based on the weight value range [wmin , wmax ], as follows:




wmin ,




wrk = wk ,






wmax ,

wk < wmin ,
wmin ≤ wk ≤ wmax ,

(5.3)

wk > wmax

Using wrk , the cost can be evaluated quickly using Eq (5.1).
When formulating the cost matrix in Section 5, we observe that the algorithm evaluates mapping
many different weights to the same parallel resistive devices. For each weight, the three steps are
repeated and the time complexity is O(R) based on the first step. The process is illustrated at the top
of Figure 5.4. However, we observe that the first two steps are independent of the specific weight.
Consequently, there exists an opportunity to pre-characterize the weight range using a one-time
expensive initialization phase. Next, each realized weight can be computed fast and efficiently
based on the weight range characterization, as shown at bottom of Figure 5.4.
The time complexity analysis of assigning N weights to R parallel resistive devices on the RCA
is shown in Table 5.1. Without the weight range characterization, the time complexity of the N
assignments is O(NR). When the one-time characterization is utilized, the time complexity is
reduced to O(N + R). Clearly, the technique is particularly effective when a higher redundancy
50

x
x
-

Figure 5.4: An example of the weight range characterization technique of using R = 4 devices for
a single weight.

factors are used to compensate for many defects in the hardware.

Table 5.1: Time complexity analysis of assigning N weights to R parallel resistive devices in an
RCA.
Weight range mechanism
Without characterization
With characterization

Time complexity
Step 1 Step 2 Step 3
Total
O(NR) O(N) O(N)
O(N R)
O(R)
O(1) O(N) O ( N + R )

The LP Formulation Technique

The assignment problem can be solved using the Hungarian algorithm or an LP formulation. Previous studies have used the Hungarian algorithm. We empirically observe that the use of an LP
formulation results in shorter run-time due to the structure of the problem. The LP is formulated,
51

as follows:

min ∑

∑ ci j xi j ,

(5.4)

i∈M j∈N

∑ xi j = 1, ∀i ∈ M

(5.5)

∑ xi j = 1, ∀ j ∈ N

(5.6)

j∈N

i∈M

where ci j is the cost of assigning neuron i to location j, i.e., entry (i, j) in the cost matrix C.
xi j = {0, 1} is a binary variable that denotes if neuron i is assigned to location j. The objective
function in Eq (5.4) minimizes the total cost of the data layout organization. The constraints in
Eq (5.5) and (5.6) ensure that each neuron is assigned to one and only one location.

Methodology

The flow of the proposed framework for fast data layout organization is shown in Figure 5.5. The
input to the framework is the weight matrices of a neural network with L layers, a RCA for each
weight matrix, and a redundancy factor R indicating the number of parallel resistive devices that
are used to realize each weight. The output from the framework is the order of the neurons in
each layer of the neural network. The neuron order is expected to reduce the cost in Eq (5.1) and
therefore improve the classification accuracy in hardware.
First, the defective devices in each RCA is detected using the technique in [13, 34, 66]. In the
initialization phase, the row-oriented and column-oriented data structures are constructed to facilitate the sparse indexing technique. Moreover, the value range for each location for each set of R
parallel devices is pre-characterized. Next, the neuron permutation is iteratively applied to each
layer of the neural network from the first layer to the last layer. In each layer, the neuron permuta-

52

Figure 5.5: Flow of the proposed framework.

tion is performed by computing a cost matrix and solving an assignment problem. The cost matrix
is quickly computed using the proposed sparse defect indexing and weight range characterization
techniques. Next, the assignment problem is swiftly solved using the proposed LP formulation.

Experimental Evaluation

The speed-up techniques in the proposed framework are implemented using C++ and the experiments are performed on a 3.4GHz×8 core machine with 32GiB of memory. The neural networks
on MNIST [42] and CIFAR-10 [36] are trained using Keras [17] and TensorFlow on a NVIDIA
Tesla K80 GPU.
We evaluate the performance of the framework using two MLPs trained on MNIST and three
CNNs trained on CIFAR-10. The properties of the evaluated neural networks on both the MNIST
and CIFAR-10 datasets are shown in Table 5.2.

53

Table 5.2: Properties of the evaluated neural networks.
Network
MLP-4
MLP-6
CNN-7
CNN-16a
CNN-16b

Dataset

Software
accuracy
MNIST
98.35%
MNIST
98.43%
CIFAR-10 75.03%
CIFAR-10 93.45%
CIFAR-10 93.45%

Layers

Weights

4
6
7
16
16

545000
774000
1250144
14977728
54586368

Through
-put
1
1
1024
1024
32

The framework is evaluated in terms of the normalized classification accuracy, is equal to the
hardware accuracy divided by the software accuracy. The software classification accuracy is shown
in the table. The classification accuracy in hardware is obtained by introducing 10% defects into
the RCAs and evaluating the DNN classification accuracy using the realized weight matrices.
The baseline method with no optimization is labeled ‘-’. The technique of utilizing redundant
hardware in [72] is denoted ‘H’. The use of both redundant hardware and data layout organization
in [78] is denoted ‘HD’. The ‘HD’ method extended with the sparse defect indexing technique
is called ‘HD-I’. ‘HD-IC’ is the ‘HD-I’ method integrated with the weight range characterization
technique. ‘HD-ICL’ is the ‘HD-IC’ technique extended with the proposed LP formulation.

Evaluation of sparse defect indexing

The sparse defect indexing technique is evaluated in Figure 5.6. The figure shows the run-time of
the framework with and without the sparse defect indexing technique with different defect rates on
both the MNIST and CIFAR-10 datasets.
The figure shows that the run-time without sparse defect indexing is proportional to the number of
devices in the DNN. The sparse defect indexing technique significantly reduces the run-time, as the

54

(a) MLP-6

(b) CNN-16a

Figure 5.6: Evaluation of sparse defect indexing with respect to defect rate on (a) MNIST and (b)
CIFAR-10 dataset.

run-time is proportional to the number of defective devices. With the defect rate increasing from
1% to 10%, the normalized indexing run-time increases from 5.1% to 20.6% on the MNIST and
from 15.5% to 72.4% on the CIFAR-10, respectively. The run-time is reduced by only computing
the cost at the recorded locations instead of computing the cost at each location of the RCAs, as
the cost introduced by non-defective devices is equal to zero.

Evaluation of the weight range characterization

The weight range characterization technique is evaluated in Figure 5.7. The figure shows that the
run-time is proportional to R when the the weight range characterization technique is not used. In
contrast, the run-time is almost constant when the weight range characterization is applied. The
speed-up stems from that the time-complexity of computing wrk is reduced from O(R) to O(1).

55

(a)

(b)

Figure 5.7: Effectiveness of the weight range characterization technique with respect to different
redundancy factor R on (a) MLP-6 and (b) CNN-16a.

Evaluation of the LP formulation

The linear programming formulation technique is evaluated and compared with the Hungarian
algorithm, as shown in Figure 5.8. It can be observed that the run-time of the Hungarian algorithm
greatly increase as the cost matrix dimension increase.

Figure 5.8: Evaluation of the solver run-time vs. problem dimension.

56

Comparison with related studies

In Table 9.7, we present the comparison of the proposed framework with previous studies in terms
of normalized accuracy, the cost in Eq (5.1), and run-time. The comparison is performed on both
the MNIST and CIFAR-10 datasets using networks with 10% defect rate. The redundancy factor is
set such that the ‘HD’ method achieves 99% normalized accuracy. Specifically, a redundancy factor
R equal to 4, 4, 8, 4, 8 is used for MLP-4, MLP-6, CNN-7, CNN-16a, CNN-16b, respectively.
When no optimization is applied, the normalized accuracy is unacceptably low. The technique of
utilizing hardware redundancy (the ‘H’ method in [72]) greatly improves the accuracy. Compared
with the ‘H’ method, the ‘HD’ method improves the average normalized classification accuracy
with 1.8% without any hardware overhead. It is easy to understand that the normalized accuracy is
improved because the cost in Eq (5.1) is improved with 52.6%. However, the run-time is too long
to be practical for the larger neural networks. In particular, it can be observed that the run-time for
CNN-16a and CNN-16b is 1.7 hours and 67.9 hours, respectively.
Compared with the ‘HD’ method, the proposed ‘HD-I’, ‘HD-IC’, and ‘HD-ICL’ methods result in
the exact same cost and normalized classification accuracy. This is expected because the speedup techniques only avoid redundant computation when computing the cost metric or solves the
assignment problem faster, i.e., the exact same data to hardware assignment (or data layout organization) is obtained. Compared with the ‘HD’ method, the ‘HD-I’, ‘HD-IC’, and ‘HD-ICL’ method
respectively reduces the run-time with 40%, 88%, and 89%. The improvements in run-time are
not surprising as the ‘HD-I’ method avoids a significant amount of redundant “zero" cost computation and the weight characterization reduces the time complexity of utilizing redundant hardware.
The LP formulation slightly reduces the run-time for the large neural networks. In summary, the
proposed framework reduces the average run-time of the state-of-the-art data layout organization
without degrading the performance in resistive hardware.
57

Conclusion

In this work, we have proposed a framework to speed-up the data layout organization for deploying
DNNs on RCSs. The results show that the proposed framework is capable of achieving software
level classification accuracy while reducing up to 89% of the run-time.

58

Table 5.3: Comparison of the proposed framework with related studies.
Network
(dataset)

MLP-4
(MNIST)

MLP-6
(MNIST)

CNN-7
(CIFAR-10)

CNN-16a
(CIFAR-10)

CNN-16b
(CIFAR-10)

Norm.

Work

[72]
[78]
Ours
Ours
Ours
[72]
[78]
Ours
Ours
Ours
[72]
[78]
Ours
Ours
Ours
[72]
[78]
Ours
Ours
Ours
[72]
[78]
Ours
Ours
Ours
[72]
[78]
Ours
Ours
Ours

Method
H
HD
HD-I
HD-IC
HD-ICL
H
HD
HD-I
HD-IC
HD-ICL
H
HD
HD-I
HD-IC
HD-ICL
H
HD
HD-I
HD-IC
HD-ICL
H
HD
HD-I
HD-IC
HD-ICL
H
HD
HD-I
HD-IC
HD-ICL

cost
Norm.
Run-time
in Eq (5.1)
accuracy
(sec)
39.0
22.0
3
6.8
97.8
4
0.1
99.9
176
0.1
99.9
69
0.1
99.9
18
0.1
99.9
18
48.4
22.9
4
7.3
98.4
5
0.1
100.0
259
0.1
100.0
100
0.1
100.0
27
0.1
100.0
27
10050.3
13.3
2
97.8
97.2
7
7.9
99.9
686
7.9
99.9
668
7.9
99.9
71
7.9
99.9
71
2931020.0
10.7
25
5821.1
96.9
64
1911.8
99.3
5969
1911.8
99.3
5051
1911.8
99.3
943
1911.8
99.3
939
90413.3
7.5
166
24.3
99.5
770
0.3
99.9
244572
0.3
99.9
95725
0.3
99.9
28905
0.3
99.9
25248
61011.4 (-84.5%) 0.15
0.01
47.4 (-1.8%) 0.98
0.01
1.00
1.00
1.00
1.00
1.00
0.60
1.00
1.00
0.12
1.00
1.00
0.11

59

CHAPTER 6: STAT: MEAN AND VARIANCE CHARACTERIZATION
FOR ROBUST INFERENCE OF DNNS ON MEMRISTOR-BASED
PLATFORMS

Design

In this section, we explain the characterization technique used in the STAT framework. Next, mean
guided bias weight modification technique and the variance guided neuron permutation technique
are explained.

Mean and variance characterization

The characterization in the STAT framework consists of computing the mean and variance of the
output from each neuron. The mean and variance statistics are straightforward to obtained by
propagating the training dataset through a neural network and recording the output from each
neuron for each input vector. Next, the mean and the variance of the output from each neuron are
computed, respectively. The mean and variance of the inputs to layer l are denoted µl and σl2 ,
respectively. The computed statistics are used in the two subsequent optimization techniques.

Mean guided bias weight modification

The objective of the mean guided bias weight modification technique is to adjust the bias weights
such that there are no input errors to the neurons in a layer with respect to a specific input vector.
In this paper, we modify the bias weights such that there are no input errors (to the neurons in each
layer) when the input vector (to each layer) is equal to the mean input vector obtained from the
60

statistical characterization. Consequently, the bias weights are modified as, follows:

bul = bl − (Wlr −Wl ) · µl .

(6.1)

where bl and bm
l are the original and modified bias weights, respectively. µl is the mean input
vector to layer l from the statistical characterization. Wl and Wlr are the weight matrix and the
realized weight matrix for layer l, respectively.

Variance guided neuron permutation

The objective of variance guided neuron permutation is to prioritize mapping weights connected to
neurons with high variance to non-defective memristors using neuron permutation, as the modified
bias weights technique can effectively compensate for errors from neurons with small variance.
To enable the neuron permutation techniques in earlier studies to facilitate such mappings, a weight
significance factor ck is introduced for each weight wk to guide the neuron permutation in [73, 79].

σ2l=[0.1, 0.7, 0.2]T

Cl
g(0.1) g(0.7) g(0.2)
g(0.1) g(0.7) g(0.2)
g(0.1) g(0.7) g(0.2)

l-1

layer

(a)

l

(b)

weight
significance ck

Next, neuron permutation is performed with respect to a significance error (SE) metric, which is
g(σ2)
1

α
0

variance σ2

(c)

Figure 6.1: (a) Variance of input to layer l. (b) Weight significance factors set based on the variance
statistics. (c) The function g used to defined the weight significance factors ck .

61

defined, as follows:

SE(W ) =

∑ ck · (wrk − wk )2,

(6.2)

k∈W

where ck is the weight significance factor for weight wk . We also choose to minimize the square
error instead of the linear error |wrk − wk |, to avoid introducing large errors for certain weights in
order to represent other weights more accurately.
To prioritize that weights connected to neurons with high variance are assigned to non-defective
memristors, the weight significance factors ck are specified, as follows:
ck = g(σlk2 ),

(6.3)

where σlk2 is the recorded variance from the output of neuron k in layer l − 1. σl2 denotes the vector
of input variances to layer l, which is illustrated in Figure 6.1(a). The weight significance factors
for Wl with respect to the inputs to layer l are shown in Figure 6.1(b). g is the function shown
in Figure 6.1(c). The purpose of the function g is to prioritize all neurons with a variance above
a threshold α equally. The parameter α is a user defined parameter determined based on Monte
Carlo simulations.

Flow of the STAT framework

The flow of the STAT framework is shown in Figure 6.2. The input to the framework is a neural
network, an MCA for each weight matrix, and the redundancy factor R. The objective is to map
the neural network to the memristor-based platform while minimizing the SE metric, which is
expected to maximize the classification accuracy.

62

Input

Mean and variance characterization
Specify weight signicance factors ck
Perform neuron permutation for
all layers using techniques in [22,25]
Mean guided
bias weight modication

}

Variance guided
neuron permutation

Detect stuck-at-fault defects [20,3]

Output

Figure 6.2: Proposed flow of the STAT framework.

The first step is detect stuck-at-fault defects, which consists of identifying the location and the
type (stuck-on or stuck-off) of each defect in the MCAs using the methods in [66, 9]. The second
step is mean and variance characterization, which involves propagating the training set through
the network to obtain the mean and variance of each neuron. Using the variance statistics for the
neurons, the weight significance factors are specified for each weight in the network . Next, the
neuron permutation in [73, 79] is performed using the introduced weight significance factors. The
combination of the weight significance factors and the neuron permutation forms the proposed
variance guided neuron permutation step. Lastly, the mean guided bias weight modification step
is applied.

Experimental Evaluation

The proposed framework is implemented in C++. The experiments are performed on a 3.4 GHz
machine with 32GB of memory. The neural networks are trained using Keras and TensorFlow. The
proposed techniques are evaluated using two feed-forward neural networks trained on the MNIST
63

dataset [42]. The dimensions of the neural networks are 784x500x10 and 784x500x300x10, with
the software classification accuracy of 98.25% and 98.35%, respectively.

Effectiveness of STAT framework

The STAT framework is evaluated in Table 6.1. ‘p’ is the percentage of defects and ‘R’ is the
redundancy factor. We evaluate several different methods in the table to demonstrate the effectiveness of the STAT framework. The baseline with no optimization is denoted ‘-’. The baseline
with R=2 is equivalent to the method used in [72]. The technique of permuting neurons in [79]
is called ‘P [79]’, which is obtained by only using the first and the fourth step in Figure 9.4. The
‘PM’ method is the ‘P’ method extended with the mean guided bias weight modification technique
proposed in the STAT framework. The ‘PMV’ method denotes the ‘PM’ technique extended with
the proposed variance guided neuron permutation. The run-time for all methods and both networks
is less than five minutes.
In Table 6.1, it can be observed that stuck-at-fault defects greatly degrade the classification accuracy of both the three and four layer neural network. The classification accuracy can be greatly
improved when a redundancy factor R=2 is used, as in [72]. However, the improved robustness
comes at the expense of doubling the hardware utilization and power consumption.
Compared with the baseline ‘-’, the ‘P [79]’ method can substantially improve the classification
accuracy with no overhead. Drastic improvements are obtained when a redundancy factor of R=1
is used, as the obtained accuracies for the ‘-’ method are low.
Next, we compare the proposed ‘PM’ method with the ‘P [79]’ method. Notable improvements
are obtained when R=1, as the normalized classification accuracies for R=2 are already very high.
The accuracy improvements stem from that the modified bias weights cancel out the expected

64

Table 6.1: Evaluation of various methods in terms of normalized classification accuracy.
p

Techniques
% P/M/V
P [79]
PM
PMV
10 - [72]
P [79]
PM
PMV
P [79]
PM
PMV
20 - [72]
P [79]
PM
PMV

R

1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2

Normalized Accuracy(%)
Three-layer
Four-layer
[min, max] mean [min, max] mean
[39.3,76.0] 62.1 [33.7,48.3] 39.8
[59.3,90.5] 78.5 [33.7,51.9] 43.3
[91.5,97.2] 94.9 [39.0,69.1] 54.5
[92.5,97.3] 96.1 [41.7,71.3] 55.8
[99.0,99.7] 99.5 [99.2,99.6] 99.4
[99.3,99.9] 99.6 [99.4,99.8] 99.6
[99.6,99.9] 99.7 [99.5,99.7] 99.6
[99.7,100.0] 99.8 [99.5,99.8] 99.6
[13.3,37.6] 25.6 [10.5,19.7] 13.8
[26.4,63,8] 43.4
[9.2,37.5] 19.2
[68.9,82.0] 75.3 [10.5,26.1] 19.4
[74.8,89.2] 84.0
[9.6,30.1] 19.6
[94.7,98.1] 97.0 [61.1,78.1] 70.0
[96.7,98.7] 98.1 [86.6,95.6] 92.2
[98.3,99.3] 98.8 [85.5,96.0] 93.0
[97.8,99.3] 98.8 [89.1,96.1] 93.5

input errors to each neuron. Consequently, we conclude that the mean guided bias technique is
effective and improves the performance without introducing any additional overhead with respect
to the state-of-the-art ‘P [79]’ method. Compared with the ‘PM’ method, the proposed ‘PMV’
method further improves the robustness to defects. The improvements stem from that weights
connected to neurons with small variance are mapped to memristors with stuck-at-faults. The
parameter α is obtained using a parameter sweep. Based on the results reported, we conclude that
variance guided neuron permutation method further improves the performance without introducing
additional overhead.
In conclusion, the STAT framework improves the robustness significantly over the previous stateof-the-art neuron permutation technique ‘P [79]’ without introducing any overhead. This is a
steeping stone to our objective of executing DNNs on any memristor-based platform without any
65

performance degradation and with R=1. Nevertheless, the experimental results demonstrate that
there is a need for even better techniques to handle the four-layer (or large) neural network in the
presence of many defects.
We note that high normalized classification accuracies can be obtained using retraining as demonstrated in [52] and [73]. However, these techniques require each neural network to be retrained for
each memristor-based platform, which is not very practical as the training process is time consuming.

Conclusion

Handling stuck-at-fault defects is one of the key challenges when accelerating DNNs using a
memristor-based platforms. In this work, we have demonstrated that statistics can be used to guide
optimization techniques aimed at handling stuck-at-fault defect. In particular, the statistics were
used in a proposed mean guided bias weight modification technique and a variance guided neuron
permutation technique, which resulted in substantial improvements compared with the state-ofthe-art framework in [79].

66

CHAPTER 7: REDUNDANT NEURONS AND SHARED REDUNDANT
SYNAPSES

Design

In this section, the designs of redundant neurons and shared redundant synapses are presented.

Redundant neurons

This technique aims to improve the robustness of neural networks to stuck-at-fault defects by
inserting redundant neurons in the network, which is illustrated in Figure 7.1. First, we explain how
redundant neurons can be inserted in a neural network without affecting the classification accuracy
when there are no stuck-at-fault defects. Next, we explain how to insert redundant neurons and
set the synapse weights to improve the robustness of the network in the presence of stuck-atfault defects. Subsequently, we analyze the robustness provided by a redundant neuron. Lastly,
we explain how redundant neurons can be inserted without hardware overhead when there is a
mismatch between the dimensions of the weight matrices and the MCAs.

Redundant neurons and no stuck-at-fault defects

Part of a neural network is shown in Figure 7.1(a). The redundant neuron technique is based on the
observation that a redundant neuron t can be inserted to replicate a neuron s without influencing the
performance of the initial neural network. The redundant neuron t is inserted by replicating all the
incoming and outgoing synapses to the neuron s using the same synapse weights, which is illustrated in Figure 7.1(b). Moreover, the activation function of the neurons s and t are updated to σ ()u ,
67

where σ (.)s is the original activation function of neuron s and σ (.)u = σ (.)s /2. Consequently, the
two neural networks in Figure 7.1 are logically equivalent when there are no stuck-at-fault defects,
as the sum of the output from the two neurons is equal to the output of the initial neuron s.
The insertion of a neuron t in layer k requires an additional row in weight matrix Wk and an
additional column in weight matrix Wk+1 . The modification of the activation function can be
performed without (or non-significant) overhead using a single shift operation.
k-1 W
k

k

Wk+1

s

k+1

(.)s
w

k

k-1

k+1

s
t

(.)u = (.)s/2

(a)

(b)

Figure 7.1: (a) Initial neural network. (b) A redundant neuron t replicating the neuron s in the
network.

Redundant neurons and stuck-at-fault defects

In the presence of stuck-at-fault defects, the activation function for the neurons t and s is specified
as described in the normal case (or no defects). However, instead of simply copying the synapse
weights connected to neuron s, we propose to improve the robustness by specifying the synapse
weights connected to the neurons t and s to compensate for each others errors using a two step
approach.
Let w̄s be the row in Wk connected to neuron s. First, the memristors in the rows of the MCA
connected to the neurons t and s are programmed to realize w̄s using the technique in [72]. Let the
difference between the target w̄s and the weights effectively realized be denoted 4s = (w̄s − w̄es ) and
4t = (w̄s − w̄te ). Second, the memristors in the rows connected to neurons t and s are programmed
68

to respectively realize the target row weights (wt + 4s ) and (ws + 4t ) using the technique in [72].
Next, the exact same two-step method is used to program the memristors in the columns connected
to the neurons t and s in the MCA for Wk+1 . Note that it is not guaranteed that the new target
weights for the rows (or columns) t and s can be realized.

Robustness analysis

In this section, the robustness improvement provided by the insertion of a redundant neuron is
analyzed using Figure 7.2. It is straightforward that the robustness provided by the synapses from
the neurons t and s to the neurons in layer k + 1 is similar to the robustness provided using parallel redundant synapses in [72, 79]. However, the robustness provided by the parallel redundant
synapses from the neurons in layer k − 1 to the neurons t and s in layer k is slightly different due to
the non-linear activation functions σ (.) of the neurons. Given an input vector xk , let yi be the ideal
input (no defects) to the neurons t and s. Moreover, let yt and ys denote the input to neuron t and s
in the presence of stuck-at-faults, respectively. If the programming of the memristors in step-two
realizes the target weights perfectly, yt and ys will satisfy the following conditions yt = yi + εy
and ys = yi − εy , where εy is the errors introduced by the stuck-at-faults, which is illustrated along
the horizontal axis in Figure 7.2. Clearly, the neurons t and s can be replaced with an equivalent
neuron with a value of σ (ys )/2 + σ (yt )/2, as shown with the large hollow circle in the figure.
In contrast, if a pair of parallel redundant synapses were used [72, 79], the output of the neuron
would have been σ (ys /2 + yt /2) = σ (yi ), i.e., the ideal output, which is shown with the large solid
circle. Consequently, the redundant neuron provides εx worse robustness compared with redundant
synapses, as illustrated in Figure 7.2.

69

Output from
equivalent neuron

(.) = ReLU
(ys)+ (yt)
2
(yi)

}

x

0

y

y

ys

yi

yt

Input to neuron

Figure 7.2: Robustness provided by redundant neuron.

Hardware realization

Despite the slightly worse robustness compared with redundant synapses, there are a number of advantages of utilizing redundant neurons: (i) Redundant neurons can be used without area overhead
when there is a mismatch between the dimensions of the weight matrices and the MCAs, which is
illustrated in Figure 7.3. Note that there is a similar opportunity even when a weight matrix is tiled
across a grid of MCAs. When redundant neurons are needed but there is no mismatch between
the matrix dimension and MCA tiles, an extra row/column of MCA tiles can be introduced. Such
redundancy introduces area overhead but with a finer granularity than increasing the redundancy
factor. (ii) The redundant neuron technique can be used to surgically improve the robustness of
neurons connected to rows and columns in MCAs with many stuck-at-faults. Note that the redundant synapse technique is required to be applied uniformly because the redundancy is hardwired
at fabrication. (iii) A redundant neuron can be used to replicate any neuron in the same layer, i.e.,
there is no proximity constraint. This may allow the weights to more effectively mask the defects
in the hardware when neuron permutation is applied.

70

layer k-1

layer k

W

W

unused

unused

MCA

free
redundant
neurons

layer k+1

Figure 7.3: Redundant neurons can be inserted without area overhead when there is a mismatch
between the dimensions of the weight matrices and the MCAs.

Shared redundant synapses

This technique aims to share redundant synapses to reduce hardware overhead, as illustrated in
Figure 7.4. Note that the proposed sharing of redundant synapses generalizes (1:r) synapse redundancy to (q:r) synapse redundancy. However, the description of the sharing of redundant synapses
is focused on (2:3) redundancy.

Sharing of redundant synapses (2:3)

The proposed realization of shared redundant synapses between the neurons in layer k and layer k +
1 is shown in Figure 7.4. The input to two neurons in layer k +1 is generated using three rows in the
MCA and the ‘Control’ circuitry, which explains the (2:3) redundant synapse notation. The control
circuitry consists of two 3:2 multiplexers and two subtractor units. Both the multiplexers and
the subtractors are required to handle the same number of bits as the analog-to-digital converters
(ADCs). The input to each of the neurons in layer k + 1 is obtained by selecting two of the three
outputs from the ADCs and subtracting the negative component from the positive component, in

71

similar to the parallel redundant synapses.
layer k

m3

layer k+1

m2

3:2

ADCs

m1

3:2

w1
w2

Control

DACs

Subtractor

MCA

DACs = digital-to-analog converters
ADCs = analog-to-digital converters

Figure 7.4: Shared redundant synapses. Three rows in the MCA are used to realize two rows in a
matrix W .

Next, it is described how to program three parallel memristors (m1 ,m2 ,m3 ) to realize two parallel
weights (w1 ,w2 ), which is shown to the left in Figure 7.4. Assume that row 1 and row 2 are
respectively used to realize the positive and negative component of w1 . Similarly, row 3 and row
2 are respectively used to realize the positive and negative component of w2 . Consequently, we
−
−
−
−
+
have w+
1 − w1 = w1 , w2 − w2 = w2 and w1 = w2 . We propose to specify the values of the three

components to minimize the cost for w1 and w2 , i.e., (w1 − we1 )2 + (w2 − we2 )2 . Clearly, this is a
standard quadratic programming formulation with linear constraints, which can be solved using
interior point or active set methods [19]. However, since the problem size is small (one to three
variables based on the number of stuck-at-fault defects), the formulation can be solved optimally
in a few step by decomposing the problem formulation for the different defect scenarios. The
outlined method is used to program all groups of three parallel memristors (m1 , m2 , m3 ).
Next, the technique is extended to cover all the different row sharing alternatives and row flipping
alternatives. Given two rows in a weight matrix and three rows in an MCA, there are six alternative
ways to share the rows between the two neurons and each row can be flipped using the row flipping
72

technique in [79], which results in eight different flip combinations for the six sharing alternatives,
i.e., a total of 48 = 6 · 8 options. Each of the 48 candidates is evaluated and the option with the
minimum cost is selected.

Neuron pairing and permutation

In this section, we explain how rows (or columns) in the weight matrix are assigned to rows and
columns in the MCA hardware. We decompose the assignment problem into a row pairing step
and a neuron pair permutation step. In the row pairing step, the rows in each weight matrix are
grouped together into pairs while considering the nominal values of the aligned weights in the
paired rows. Next, neuron pair permutation is performed while accounting for the defects in the
hardware using the efficient algorithms developed in earlier studies [14, 79, 73]. This involves
permuting the paired neurons (or rows) instead of individual neurons.
Since each pair of aligned weights w1 and w2 share one memristor, it is easy to understand that it
is impossible to realize two weights that are of significantly different magnitude. Even when there
are no defects, a pair of weights can only be realized exactly if magnitude constraint is satisfied, as
follows:

|w1 − w2 | ≤

(wmax − wmin )
,
2

(7.1)

where wmax is the maximum weight in the MCA and wmin is the minimum weight in the MCA.
w1 and w2 are a pair of aligned weights. Here, it is assumed that wmax =wmin or that numeric
representations with bias are used. We formulate a mixed integer programming (MIP) formulation

73

to pair rows while minimizing the violations of the magnitude constraint, as follows:

min

∑

ci j xi j ,

(7.2)

(i, j)∈E

∑

xi j = 1,

∀k ∈ V,

k∈∀(i, j)

where xi j is a binary variable that indicates if row i and j are paired. V is the set of all rows in a
weight matrix W , and E is the set of non-duplicate row pairs. ci j is the sum of max(|w1 − w2 | −
(wmax − wmin )/2, 0)2 for all the aligned weights in row i and j in W . The constraint ensures that
each row is paired with a unique row.

Methodology

In this section, we describe the flow of the proposed framework, as illustrated in Figure 7.5. The
input to the framework is a neural network and a memristor-based platform. The platform is
specified using user defined parameters synapse redundancy and number of redundant neurons.
Note that different parameter values can be used for each layer k. The objective is to map the input
neural network to the memristor-based platform while minimizing cost for each weight matrix in
order to maximize the classification accuracy.
The first part of the flow is a identify defects step, where the memristors with suck-at-fault defects
are identified using the techniques in [13, 34, 66, 73]. The subsequent steps are applied for layer
1 to layer K in order. The second step is to perform layer-wise row pairing if (2:3) redundant
synapses is used. Next, neuron permutation and row flipping is performed layer-wise. The implementation is based on the techniques in [73, 79]. Finally, surgical repair using redundant neurons
is performed layer-wise until there are no available redundant neurons in any layer. Note that pairs
of redundant neurons are inserted when (2:3) redundant synapses are used.
74

neural
network

memristor-based platform
(synapse redundancy
and number of
redundant neurons)
Input

Identify defects [1,7]
no
yes

no
Neuron permutation
and row ipping
in layer k [18,21]

yes

no
Surgical repair
using redundant
neurons in layer k

Layers
remaining?

Row pairing in layer k

yes

Output

Figure 7.5: Proposed flow.

The surgical repair step consists of iteratively inserting redundant neurons to improve robustness.
The method first finds the neuron in layer k that has the largest cost, i.e., the cost of the row in
Wk and column in Wk+1 connected to the neuron. Let this worst neuron be denoted s. Next, the
cost for all available redundant neurons in the layer are computed with respect to neuron s. Let
the redundant neuron with the minimum cost be denoted t. Subsequently, t is used as a redundant
neuron for s and the synapse weights are set. Lastly, the cost of the neurons s and t are divided by
two and the process is iteratively repeated until there are no neurons remaining in layer k (or until
no redundant neurons can be inserted to reduce cost).

75

Experimental Evaluation

The proposed framework is implemented using C++. The experiments are performed on a 3.4 GHz
machine with 32GB of memory. The proposed techniques are evaluated using a four-layer and a
six-layer feed-forward neural networks trained on the MNIST data set and a seven-layer convolutional neural network (CNN) trained on the CIFAR-10 data set. The feed-forward neural network
have dimensions 784x500x300x10 and 784x500x400x300x200x10, respectively. The CNN has 4
convolutional layers and 2 fully-connected layers with the dimensions of 32x32x64x64 (3x3 kernels) and 2304x 512x10, respectively. The nominal accuracy of the three networks are 98.35%,
98.43% and 75.03%, respectively. The MCAs have dimension of 128x128 with a one-transistorone-memristor configuration.
We evaluate the performance of the proposed algorithms when there are p percent stuck-at-fault
defects in the MCAs with p between 0% and 10%. Recent fabrication results have demonstrated
that this is an adequate defect range [25]. The effectiveness of the proposed techniques is measured
using recovered classification accuracy, which is equal to the classification accuracy obtained using
W e divided by the classification accuracy obtained using W . In this work, we target to achieve
99% recovered accuracy using the MCA hardware. Let P99 denote the maximum defect rate p that
allows 99% recovered accuracy to be achieved.
We refer to (q:r) redundant synapses as the redundancy factor in this section. This is the redundancy factor used for the fully-connected layers in the feed-forward neural networks and the CNN.
For the convolutional layers in the CNN, a redundancy factor (1:4) is always used because those
layers are very sensitive to defects.

76

Effectiveness of redundant neurons

The effectiveness of inserting redundant neurons is evaluated in Figure 7.6. In Figure 7.6(a), we
show that when up to 40 redundant neurons are inserted the recovered accuracy increases from
98.7% to 99.3%. This confirms that redundant neurons are capable of improving the robustness
to stuck-at-fault defects. In Figure 7.6(b), we show P99 for the different DNNs without and with
redundant neurons. When (1:1) redundancy is used without redundant neurons, a 99% recovered
accuracy can be achieved when the defect rate is at most 2%. In contrast, with redundant neurons
inserted, a 99% recovered accuracy can be obtained when there is up to 4% defects. Note that these
benefits are obtained without any area overhead due to the mismatch between the weight matrices
and MCA dimensions. However, the power consumption is increased by 0.4% when the redundant
neurons are used.

(a)

(b)

Figure 7.6: Evaluation of redundant neurons. (a) Recovered accuracy as a function of the number
of redundant neurons. (b) P99 with and without redundant neurons.

Effectiveness of shared redundant synapses

The recovered classification accuracy for different redundancy factors (q:r) is shown in Figure 7.7.
The figure shows that a higher r/q ratio (or redundancy factor) results in improved robustness,
77

(a) 4-layer

(b) 6-layer

(c) 7-layer

Figure 7.7: The recovered accuracy for different percentages of defects on different networks.

which is expected. However, the size of the required MCAs also increases with a larger r/q.
Clearly, the smallest redundancy factor that is required to achieve 99% recovered accuracy is dependent on the percentage of defects in the memristor-based platform. Interestingly, it can be
observed that the robustness provided by (2:3) redundancy is almost equivalent to (1:2) redundancy. Thus, a new dominant solution is provided by the proposed technique between redundancy
factor (1:1) and (2:3), as shown by the “shaded” area in Figure 7.7. As higher robustness than (1:1)
is provided with smaller power and area overhead than (1:2).

Figure 7.8: Evaluation of power and area overhead for different redundancy factors (q:r).

Next, we focus on evaluating the power and area overhead of different redundancy factors (q:r).

78

A lower bound for the overhead of (q:r) redundancy is r/q. However, the ‘Control’ circuitry in
Figure 7.4 introduces additional overhead in terms of power and area. In Figure 7.8, we show the
normalized power and area overhead for different redundancy factors. The values in the figure are
obtained from parameters in [58] and Synopsys DC compiler and a 45 nm library. Compared with
(1:1) redundancy, the figure shows that the power overhead of (2:3) and (1:2) redundancy is 1.71×
and 2.04×, respectively. Consequently, there is a power overhead associated with the ‘Control’
circuitry for (2:3) redundancy.

Comparison with related work

In this section, we compare the proposed framework with previous studies in terms of P99 satisfaction, normalized power and normalized area. The framework is compared with the work in [73]
and [79]. The work in [73] is based on hardware aware training, and the work in [79] is based on
the redundant synapses, row flipping and permutation of neurons techniques reviewed in Section 9.
The defect rate on MNIST and CIFAR-10 is set to 10% and 4%, respectively.
The hardware aware training in [73] is capable of recovering 99% of the accuracy on MNIST
but not on CIFAR-10 using a redundancy factor of 2. Compared with [73], the work in [79] is
capable of obtaining a 99% normalized classification accuracy on both MNIST and CIFAR-10 by
combining the neuron permutation and row flipping. Compared with [79], the proposed framework
is also capable of obtaining a 99% normalized classification accuracy on both MNIST and CIFAR10. However, the power and area overhead on MNIST is reduced with 16% and 25%, respectively.
The power and area overhead on CIFAR-10 is reduced 1% and 22%, respectively. The power
and area improvement are smaller on CIFAR-10 as the power consumption is dominant by the
convolutional layers where (1:4) redundancy factor is used for all techniques. Nevertheless, the
table shows that the proposed framework satisfies the classification accuracy requirement but with

79

Table 7.1: Comparisons with previous works.
Works
P99 ?
[73]
[79]
This work

Yes
Yes
Yes

MNIST
Norm.
Power
1.00
1.00
0.84

Norm.
Area
1.00
1.00
0.75

P99 ?
No
Yes
Yes

CIFAR-10
Norm. Norm.
Power Area
1.00
1.00
1.00
1.00
0.99
0.78

Training
required?
Yes
No
No

smaller power and area overhead.

Conclusion

In this work, we have demonstrated that redundant neurons can be used to improve robustness to
stuck-at-fault defects and that redundant synapses can be shared to reduce hardware overhead. The
experimental results demonstrate improvements on both the MNIST and the CIFAR-10 data sets.

80

CHAPTER 8: RESILIENT NEURAL NETWORKS WITH HIGH
THROUGHPUT

Background and Motivation

Mapping neural networks to RCSs

The flow of mapping CNNs with high throughput to RCSs is shown in Figure 8.1. The input to the
flow is a CNN model M trained in software and a crossbar-based RCS R.
Input

Output

CNN model M

Hardware
classi cation
accuracy

RCS platform P

Figure 8.1: Flow of mapping CNNs with high throughput to RCSs with defects.

The first step is the proposed distribution guided training, which involves retraining the weights
of the CNN to be amen-able for data layout organization. This converts the model M into a new
e The details are provided in Section 9.
model M.
The second step is throughput optimization. In [62], it was observed that Nl is typically the highest
in the first few layers of CNNs. If a layer-wise pipeline structure is implemented, it is intuitive
that the layer with the largest Nl becomes the bottleneck of the pipeline. Therefore, the authors
proposed to replicate the weight matrices of the initial layers to improve throughput. Hardware
utilization can be further improved by staggering the replicated weight matrices [83].
The third step is resource allocation. This involves partitioning and binding each weight matrix
81

within the CNN model M to crossbars within the RCS R [35]. The binding of the weight matrices
to the crossbars is performed with the objective of minimizing the data movement between the
crossbars in R.
The fourth step is the proposed data layout organization. This involves reorganizing the weights
within a CNN such that the software classification accuracy is the same but the assignment of the
e
weights to the crossbars is modified. This also converts the input model M into a new model M.
The fifth step is to program the matrices to the assigned crossbars. This involves programming the
RRAM devices within an RCS using closed-loop tuning [25].
The sixth step is to stream input data to the platform and determine the classification accuracy in
hardware. Let the classification accuracy of a CNN model M be denoted P(M). When there are
e
no defects in the RCS, P(M) is equal to the software classification accuracy. Similarly, let P(M)
denote the accuracy with the two proposed steps applied.

RCS with stuck-at-fault defects

Each crossbar in an RCS is used to accelerate a matrix vector multiplication y = W x using analog
matrix-vector multiplication iout = Gvin , where (vin ) are the input voltages, G is the crossbar conductance matrix, and iout are the output currents. The input vector x is mapped into input voltages
vin and the output vector y is decoded from the output currents iout . The conductance matrix G is
specified to be proportional to the weight matrix W .
As mentioned earlier, RRAM devices within RCSs may suffer hard defects at fabrication or from
prior testing and utilization [13]. This results in that a crossbar realizes an effective weight matrix
W r instead of W . As crossbars commonly utilize a differential pair configuration, we show the
weight value range that can be realized by two RRAM devices (+, −), with different combinations
82

of stuck-at-faults in Figure 8.2. Here, wmin and wmax denote the minimum and maximum value
within a weight matrix W . More specifically, when a weight w is mapped to a differential pair of
RRAM devices, the realized weight wr is the closest value to w within the value range [72]. The
hardware accuracy is determined by evaluating the CNN with weight matrices W r instead of W .

Weight
value
range

wmax

wmin
RRAM
state

+ oﬀ
- on

Figure 8.2: Weight value ranges based on RRAM state. ‘on’ (‘off’) denotes a device stuck-on
(stuck-off).

Motivation

We holistically analyze the use of training schemes and data layout organization techniques for
handling defects in Table 8.1. The analysis is performed for CNNs with both low and high throughput.

Table 8.1: Holistic analysis of techniques for deploying CNNs with low and high throughput to
RCSs.
Degree of
Hardware-aware
throughput
training
low
mimic specific defects
high
mimic defect distribution

Data layout
organization
few opportunities
many opportunities

The table shows that for CNNs with low throughput, it is ideal to train the network’s weights
based on the specific defects in the platform. This stems from that there is a one-to-one mapping
83

between the weights and the RRAM devices. Moreover, there are few opportunities for data layout
organization because of the high degree of weight sharing. The sharing places constraints on how
the weights within a CNN model can be organized. For CNNs with high throughput, we discover
that there are new opportunities to perform data layout organization. This is a result of that the
degree of weight sharing is reduced by the replication of weights. Our detailed analysis reveals
that the average solution space for data layout organization is increased with 1.4 × 1014 X when the
weights within a CNNs are replicated to improve throughput (see details in Section 8). The larger
solution space motivates our investigation into data layout organization in Section 8, i.e., to take
advantage of these new opportunities. Moreover, as it is impossible to train the weights to mimic
specific defects, we propose to train the weights to be amenable for data layout organization in
Section 9. This is performed by training the weights to mimic the distribution of the defects.

Design

Data layout transformation

In this section, we propose a channel, a pixel, and a hybrid channel-pixel transformation to modify
the weight data to crossbar assignments within the convolutional layers of a CNN. The different
transformations are applied depending on the number of cycles required to produce and consume
the data within a feature map. A cycle is defined as one matrix-vector multiplication with one
input vector and one output vector. The number of cycles used to produce and consume a layer
is directly dependent on the number of times the weights within a layer are replicated, which we
denote rl for layer l.

84

rl =

ml =

ml · ml
min


min ,

(8.1)
l=1
(8.2)



 (ml−1 −(k−S)+2·Ppad )·S , l ≥ 2
Ppool

where min is the volume size (number of pixels in x and y-direction) of input images to the network.
ml is the volume size of input feature maps to a convolutional layer l. Ppad is the zero-padding factor. Ppool is the max pooling factor, which is performed with 2 × 2 dimension in this work. S is the
stride of the convolutional kernels and K is the kernel size. Note that the data layout transformation
between the last convolutional and first FC layer in each network can only be channel transformation. The three proposed techniques are illustrated in Figure 8.3. The data layout transformation
can be cast into three different cases using Eq 8.2, as follows:

• Case 1 rl <

ml
2S :

Channel transformation (Section 8).

• Case 2 rl =

ml
S :

Pixel transformation (Section 8).

• Case 3

ml
2S

≤ rl <

ml
S :

Hybrid pixel-channel transformation (Section 8).

Channel transformation

The channel transformation is illustrated in Figure 8.3(a). For the data consumed in two subsequent
cycles, it can be observed that there is overlap in both the x-direction and y-direction within a
feature map. Therefore, it is only possible to reorganize data in the z-direction. As the data along
the z-direction are called channels, we call the operation a channel transformation. If two channels
in layer l are permuted, the corresponding pair of rows in Wl are permuted and the corresponding
85

(a) Channel transformation

(b) Hybrid channel-pixel transformation

(c) Pixel transformation

Figure 8.3: The proposed data layout transformations.

groups of columns in Wl+1 are permuted, which is illustrated in the bottom of Figure 8.3(a). The
column group size is equal to K × K ×Chl , where Chl is the number of channels in feature map l,
i.e., the transformation is relatively coarse grained for CNNs with many channels.

Pixel transformation

The pixel transformation is illustrated in Figure 8.3(c). Instead of deploying convolution operations onto the RCS platforms, the weights in the convolutional layers are replicated to convert
the convolutional layer into a semi-FC layer. The semi-FC layer performs exact same mathematical computations by processing higher bandwidth of input data. There is only overlap in the
y-direction, when all the data in the x-direction within a feature map is consumed and produced in
a single cycle. Consequently, data can be reorganized in the xz-plane. We call the transformation
pixel transformation because from a top-view, it can be viewed as permuting pixels within a plane.
If two pixels in layer l are permuted, the corresponding rows in layer Wl are permuted and the
corresponding group of rows is permuted in Wl+1 . The group size is equal to 3, which stems from
the standard y-height of a 3x3 convolutional kernel. The pixel transformation is advantageous to
the channel transformation because it is more fine grained, i.e., it provides better opportunities to

86

mask the defects using the neural network weights.

Hybrid transformation

The hybrid channel-pixel transformation is shown in Figure 8.3(b). When the data in the xdirection of a feature map is consumed in a few cycles, there is overlap in both the x-direction
and the y-direction. However, in contrast to within Figure 8.3(a), the overlap in the x-direction
only covers part of the data produced and consumed in each cycle. Therefore, there exists an opportunity to apply a hybrid version of the channel and pixel transformation. Specifically, channel
transformation is applied to the sections of the feature map with overlap and pixel transformation
is applied to the sections without any overlap. In Figure 8.3(b), we show an example of both a
channel and pixel transformation. Channel transformations are applied to the first and last rows
in Wl and the first and last columns of Wl+1 . Pixel transformations are applied to the midsections of Wl and Wl+1 . Hence, the hybrid channel-pixel transformation allows the fine grained pixel
transformation to be used partly although there is overlap in the x-direction.

Flow for data layout organization

In this section, we describe the flow for performing data layout organization. The input to the flow
is a CNN model M and the location and type of the defects within each crossbar of an RCS R. The
defects can be identified using the techniques in [13, 34, 66]. The output from the flow is a CNN
e which is obtained by applying data layout transformations T in Section 8. While the
model M,
flow is similar to in [14, 51, 73, 78], the transformations T are new. The data layout organization
is performed by defining a cost metric and casting into a mathematical optimization problem. We
first describe the cost metric. Next, we describe how transformations T are applied to minimize
the cost metric.
87

Cost metric: The framework is based on minimizing a cost metric, as follows:

cost(M) =

∑

∑

(wk − wrk )2 ,

(8.3)

l = 1 to L wk ∈Wl

where wk is a weight in weight matrix Wl within model M. wrk is the weight realized when wk is
mapped RRAM hardware, as explained in Section 8 using Figure 8.2.
Minimization of Cost Metric: The flow for the data layout organization is provided in Algorithm 1. The flow is based on iterating over the layers 2 to (L − 1) and applying a data layout
transformation T to each layer. The transformation that minimizes the cost metric can be formulated and solved using an assignment problem formulation. This involves defining an assignment
cost matrix A, where Ai j is the cost of assigning channel/pixel i to location j. Next, the assignment
problem is solved using the Hungarian algorithm [19]. Finally, the data within Wl and Wl+1 is
e
reorganized to update M into M.
Algorithm 1: Data layout organization
1 Input: CNN model M, RCS R.
e Effective network.
2 Output: CNN model M
3 Generate assignment constraints for all layers.
4 for l in 2 to (L − 1) do
5
Formulate assignment problem
6
Solve assignment problem
7
Reassign data within Wl and Wl
8 end
e
9 return M;

Distribution guided training

In this section, we introduce the distribution guided training. The objective is to retrain neural
networks that are amenable to data layout organization. This idea is motivated by that it has been
88

theoretically proven that there exists an infinite number of neural networks with similar classification accuracy [22]. Concretely, we propose to perform this by retraining the CNN weights to
mimic the distribution of the value ranges governed by the RRAM defects.
We show the weight distribution of a CNN in Figure 8.4(a) and the percentage of the weights that
are fixated to a specific value (wmin /0/wmax ) by RRAM defects in Figure 8.4(b). Here, a 5% defect
rate is assumed for the RCS.

0

Min

(a)

Max

(b)

Figure 8.4: (a) Weight value distribution of a VGG-7. (b) Percentage of weights stuck to a specific
value.

First, by comparing the distributions, we observe that there are typically only one or two weights
within each weight matrix that has the minimum and maximum value. At the same time, the defects
fixate 0.04% of the weights to the minimum or maximum value. Clearly, it is impossible to mask
all of these defects using the default weights within the model M. Consequently, we propose to
retrain the model to have more weights with the min/max value. Concretely, we perform this using
a value range squeezing technique that sets all weights with a magnitude larger than percentile p
to the magnitude of the p-percentile weight. p is set such that there are ten times as many weights
with the min/max value than fixated weights. Second, similar to in [32], we observe that zeros are
the best weight for data layout organization. Therefore, we propose a zero snap technique that sets

89

the z-percentile of the weights with the smallest magnitude to zero. In contrast with in [32], we
target the distribution instead of specific weights.
After the squeezing and snapping techniques have been applied to the model M, we apply retraining
to recover the original software classification accuracy. We tolerate a maximal accuracy loss of
ε = 0.1% after retraining, otherwise we use less aggressive percentile p and z.

Experimental Evaluation

The experiments are performed on a Linux machine with 3.4 GHz × 8 core CPU and 32 GB
memory. The proposed framework is implemented in a combination of C++ and MATLAB.
The evaluation is performed using four CNNs with a VGG-like structure [61] trained on the
CIFAR-10 dataset [37]. The networks are trained using Keras and TensorFlow with backend on
NVIDIA Tesla K80 graphic card. The properties of the evaluated neural networks are shown in
Table 8.2. The table shows that throughput optimization is used to improve throughput with 32X
using weight replication [62, 83], which results in that hardware utilization is improved by 19.5X.

Table 8.2: Properties of evaluated CNNs on CIFAR-10.
Name
VGG-7
VGG-10
VGG-13
VGG-16

Software
accuracy
82.93%
88.42%
93.16%
93.45%

#Conv #FC
layers layers
4
2
7
2
10
2
13
2

Throughput
32X
32X
32X
32X

The RCSs used in the evaluation are based on crossbars with dimensions of 128x128. The evaluation is focused on RCSs with 1% to 5% defects, as crossbars with as few as 0.4% defects have been
reported in [52]. While previous studies could handle defect rates as high as 20%, they focused
90

on CNNs with low throughput [78, 23, 73], which we observed is easy to handle using hardwareaware training. Many other studies only provided evaluation on the MNIST dataset [14, 51, 52, 72],
which is too small for an insightful robustness evaluation.
The effectiveness of the two proposed techniques is evaluated using the flow in Figure 8.1. The
evaluation is performed in terms of normalized classification accuracy. The normalized accuracy
is equal to the hardware classification accuracy divided by the software classification accuracy.

Evaluation of data layout organization

In this section, we evaluate the proposed data layout organization. First, we analyze the type of
transformations that is used in each layer. Next, we evaluate the impact on the solution space and
the normalized classification accuracy.
The transformation type that is used for each layer of the evaluated CNNs is shown in Table 8.3.
The ‘-’ indicates that no transformations are applied to the first and last layer of each network,
which ensures that the optimization is seamless on the application level. C/P/H denotes the channel, pixel and hybrid channel-pixel transformation, respectively. ‘N’ denotes the neuron transformation used in fully-connected layers [73]. The table shows that the pixel transformation is used
for the second layer within all the CNNs. Next, the hybrid channel-pixel transformation is used
for the subsequent 2 to 3 layers. The pixel and hybrid transformations can only be applied to the
initial layers because those are the only layers that have a significant number of weights replication. Channel transformation is applied until the last fully-connected layer, where the neuron
transformation ‘N’ is applied [73].
Now we turn our attention to evaluate the solution space for data layout organization. The solution
space for CNN with low and high throughput is shown in Table 8.4. The evaluation is performed in

91

Table 8.3: Type of data layout transformation scheme in each layer of the evaluated CNNs.
Network
VGG-7
VGG-10
VGG-13
VGG-16

1 2
- P
- P
- P
- P

3 4
H H
H H
H H
H H

5 6
C N
C C
C C
H C

Layer
8 9 10

11

12

13

14

15

16

N
C
C

C
C

N
C

C

C

N

-

7
C C
C C
C C

C
C

terms of total number of possible data layout organizations. The numbers in the table are computed
by multiplying together the number of possible transformations for each layer. The table shows that

Table 8.4: Solution space for data layout organization for CNNs with low and high throughput.
Network
VGG-7
VGG-10
VGG-13
VGG-16
Norm.

Number of alternative data layout organization
low throughput
high throughput
2.1 × 109
2.3 × 1015
4.5 × 1015
4.7 × 1021
1.5 × 1026
6.5 × 1032
31
4.1 × 10
2.2 × 1043
1.0
1.4 × 1014

the average solution space for CNNs with high throughput is 1.4 × 1014 X larger than for CNNs
with low throughput. This stems from that the proposed transformations are capable of taking
advantage of the new opportunities provided by the weight replication. Note that for CNNs with
low throughput, channel transformation is the only transformation that can be applied.
We evaluate the impact of the data layout organization on the classification accuracy in Figure 8.5.
The figure shows that the data layout organization improves the classification accuracy for all the
networks and defect rates, which is a result of successfully masking defects using the CNN weights.
As expected, larger networks have a higher intrinsic robustness to defects, which is similar what
was observed for MLPs in [78].

92

(a) VGG-7

(b) VGG-10

(c) VGG-13

(d) VGG-16

Figure 8.5: Effectiveness of the proposed data layout organization on evaluated networks.

Evaluation of distribution guided training

In this section, we evaluate the effectiveness of the distribution guided training. The normalized
accuracy for VGG-7 and VGG-10 with and without distribution guided training is shown in Figure 8.6. Data layout organization is applied to both lines in the plot. The figure shows that the
distribution guided training is capable of improving the normalized classification accuracy for all
defect rates. This stems from that weights are trained amenable for data layout organization.

93

(a) VGG-7

(b) VGG-10

Figure 8.6: Effectiveness of the data layout organization compatible training comparing to the
hybrid data layout organization.

Comparison with state-of-the-art works

We compare the effectiveness of the proposed framework with the state-of-the-art techniques
in [14, 23, 51, 52, 73, 72, 78] to tackle the proposed challenge, as shown in Table 8.5 and Table 8.6.
The comparison is performed with hardware-aware training [14, 23, 51, 52, 73], redundant hardware [72], and data layout organization [78]. To make fair comparisons, the throughput is set to be
32X. For hardware-aware training, we show both a version where the entire network is replicated
32 times (denoted ‘N’). For all other methods, layers are replicated to achieve the 32X throughput
using the method in [83] (denoted ‘L’). The hardware redundancy factor is denoted as R. The
power and area consumption are contributed by analog-to-digital converters (ADCs), digital-toanalog converters (DACs) and RCAs. The specifications are adopted from [58]. Note that when
2X hardware redundancy is used in our proposed framework, the power and area overhead is less
than 2X because a few crossbars are underutilized in the baseline. The best results in each column
are marked in bold.
It can be observed that, by increasing the redundancy factor R in the proposed framework, over
94

99% accuracy can be recovered with less than 2X power and area overhead.

Table 8.5: Comparison with state-of-the-art techniques on 2% defect rate on VGG-7.
Technique

Replication
[14, 23, 51, 52, 73]
N
[14, 23, 51, 52, 73]
L
[72]
L
[78]
L
Ours
L
Ours
L

R
1X
1X
1X
1X
1X
2X

Norm.
Acc
99.1%
12.1%
26.2%
53.3%
98.8%
99.5%

Norm. Norm.
Run
Power Area
time
49.44
6.89 1394.9
1.00
1.00
43.6
1.00
1.00
0.2
1.00
1.00
2.4
1.00
1.00
1.0
1.80
1.97
3.3

Table 8.6: Comparison with state-of-the-art techniques on 10% defect rate using 2X hardware on
VGG-13.
Technique

Replication
[14, 23, 51, 52, 73]
N
[14, 23, 51, 52, 73]
L
[72]
L
[78]
L
Ours
L
Ours
L

Norm. Norm. Norm. Run
Acc
Power Area
time
1X 98.1% 37.23 11.85 241.5
1X 15.7%
1.00
1.00
7.5
1X 88.3%
1.00
1.00
0.1
1X 56.7%
1.00
1.00
1.8
1X 95.5%
1.00
1.00
1.0
2X 99.3%
1.86
1.98
6.5
R

The different techniques are compared in Table 8.5 and Table 8.6. It can be observed that hardwareaware training is capable of restoring the accuracy when network replication is not used to improve
the throughput, which maintains the one-to-one mapping. Unfortunately, this introduces large
power and area overheads. However, when layer replication is used, the hardware-aware training
is not effective at all.
The method in [72] is based on programming the two memristors that realize each weight in tandem, which improves accuracy a bit. This method requires no power or area overhead and only up
to 0.2X run time. This stems from that no data layout organization is applied to the networks and

95

all weights are directly mapped to crossbars. However, the recovered accuracy is very low on 2%
defect rate with low redundancy.
Using channel permutation, the technique in [78] requires no power or area overhead. The recovered accuracy is low due to less data layout organization opportunities. In comparison, our
proposed framework is able to recover over 95% accuracy with same hardware redundancy. This
stems from that the pixel and hybrid pixel-channel transformation enables more data layout organization opportunities.

Conclusion

In this work, we have proposed a framework to enable resilient deployment of CNNs with high
throughput to RCSs with defects. The framework is based on integrating a data layout organization
technique and a distribution guided training technique into the flow for mapping CNNs to RCSs.
The main idea is to mask the defects using the weights in the CNNs through data layout organization. The framework is capable of achieving close to software level accuracy while tolerating up
to 10% defects. Compared with state-of-the-art techniques, the advantage of the proposed framework is that software level accuracy can be recovered with low overhead while the framework
attains high throughput.

96

CHAPTER 9: COMPUTATIONAL RESTRUCTURING FOR IMAGE
COMPRESSION

Background and Motivation

In this section, we review the basics of image compression, metrics for image compression, and
the acceleration of MVM operations using emerging RCAs.

Review of image compression

Common lossy image and video compression formats as JPEG [67] and motion JPEG (MJPEG) are
based on transforming an image (or video) from the spatial domain to the frequency domain using
2D DCT and encoding the frequency coefficients. The fundamental steps of JPEG compression
are partitioning, 2D DCT, quantization, zig-zag reordering, encoding, and create file, which is
illustrated in Figure 9.1.

I

{

8x8

1. Partitioning

2. 2D DCT

3. Quantization

Run-length
encoding
Human
encoding

6. Create File

5. Encoding

4. Zig-Zag Reorder

Figure 9.1: Review of JPEG image compression based on 2D DCT [55].

97

The first step is to partition the input image I into small image blocks X with dimension 8x8 (or
16x16). Small block sizes are used in order to preserve high image quality. For color images, the
RGB components are compressed separately. Second, each image block X is converted into the
frequency domain by applying the 2D DCT, as follows:
C = DXD0 ,

(9.1)

where C is a matrix with the frequency coefficients of X. D is the standard 2D DCT matrix. Each
element Di j in D is defined, as follows:



 √1 ,
i = 1, 1 ≤ j ≤ N
Di j = qN


 2 cos[ π(2 j−1)(i−1)) ], 2 ≤ i ≤ N, 1 ≤ j ≤ N
N
2N

(9.2)

where the block size is NxN. Third, the frequency coefficients are divided by each corresponding
entry in a quantization table. The quantization table is designed to preserve low frequency components and discard high frequency components, as empirical studies have shown that humans are
less sensitive to high frequency patterns. Moreover, the coefficients in the quantization table can
be scaled with a factor quser to balance image quality and compression ratio. The quantization is
followed by zig-zag reordering of the frequency coefficients from into a vector (with the coefficients ordered from low to high frequency). The zig-zag reordering is performed to statistically
place the non-zero coefficients in the beginning and the zero components at the end of the vector,
which allows the subsequent encoding to be performed more effectively. The encoding step consists of run-length encoding and Huffman encoding. Run-length encoding is based on storing the
non-zero elements and the number of zeros that are followed by the non-zero element in the vector.
In particular, each non-zero-element is stored using a triplet (r, s, v), where r is the number of zeros
before the non-zero element; v is the value of the non-zero element; and s is the number of bits

98

required to store v. Next, the triplets are further compressed using Huffman encoding. The last
step is to create file where the encoded image is appended with the information required to perform
the uncompression, i.e., the quantization table and the specification of the Huffman encoding that
was used. Uncompression is performed by reversing the process in Figure 9.1.

Image compression performance metrics

In this paper, the quality of the image compression is measured using mean squared errors (MSE),
peak signal to noise ratio (PSNR), and structural similarity index measure (SSIM) [82]. The degree of compression (or compression ratio) is measured using bits per pixel (BPP). The MSE is
computed, as follows:
1 P Q
ˆ
MSE(I, I) =
∑ ∑ (Ipq − Iˆpq),
PQ p=1
q=1

(9.3)

where Iˆ is the original reference image with dimensions PxQ. I is the image obtained after Iˆ has
been compressed and uncompressed using the flow in Figure 9.1. PSNR is computed, as follows:



ˆ = 20 · log10  q
PSNR(I, I)

I peak
ˆ
MSE(I, I)

,

(9.4)

where I peak is the maximum pixel value. The technical details of the SSIM metric are provided
in [82]. The BPP metric for an image is computed, as follows:

BPP =

#num_bits
#num_pixels

(9.5)

Using the basic RGB representation of an image, each RGB component is represented using 8 bits.
Consequently, the RGB representation results in that an image is stored using 24 BPP.
99

Acceleration of MVM using emerging RCAs

In this section, we outline how MVM operations can be accelerated using emerging RCAs. In particular, we focus on MVM multiplication (x = Dy), where D is a DCT matrix (or a reconstructed
e in Section 9) and x and y are input and output vectors, respectively. An RCA
DCT matrix D or D
consists of wordlines and bitlines with a non-volatile resistive in each cross-point. The fabrication of non-volatile resistive devices has been explored based on resistive random access memory
(ReRAM) [68, 44], spin transfer torque magnetic random access memory (STT-MRAM)[57, 29],
and phase change memory (PCM) [69, 33].
Analog MVM is performed using a one-time expensive initialization phase and a fast and efficient evaluation phase. In the initialization phase, conductance values of the resistive devices are
programmed to realize a conductance matrix G. In this paper, the conductance matrix G is programmed to be proportional to the DCT matrix (D) in Eq (9.2). Next, analog MVM is performed
by passing an input vector vin to the wordlines and recording an output vector vout from the transimpedance amplifiers (TIAs) attached to the bitlines, where Rs is the feedback resistance of the
TIAs. vTout = vTin GRs is the computation performed in the analog domain. The digital input vector
x is converted into an analog input vector vin using digital to analog converters (DACs). Similarly,
the analog output vector vout is converted in to a digital output vector y using analog to digital converters (ADCs). As conductance values cannot be negative, the common differential pair approach
is used to represent negative matrix values, i.e., one bitline is respectively used to represent the
positive and negative elements for one row in a matrix. Next, the two outputs are subtracted while
being converted into the digital domain using an differential ADC. Consequently, a NxN matrix is
represented using an RCA with dimensions Nx2N.

100

Previous Work

In [27, 47, 48], image compression was accelerated by directly mapping the computation of the
2D DCT step to resistive hardware, which is illustrated in Figure 9.2. The 2D DCT computation
was selected because it is the bottleneck of image compression.
The figure shows how an image block X with dimensions NxN is processed into the corresponding
frequency coefficients C using 2N MVM operations. The two RCAs in the figure have dimensions
Nx2N as two resistive devices are used per matrix element. First, XD0 is computed by passing
each row from image block X as an input vector to an RCA programmed with the transpose of the
DCT matrix D. The result of each MVM operation is saved as a row in a temporary storage, which
is illustrated to the left in Figure 9.2. Next, DXD0 is computed by passing each column from the
temporary storage as an input vector to a second RCA, which is programmed with the DCT matrix
D. The output vector of each MVM operation is stored as a column in the final output C = DXD0 ,
which is shown to the right in Figure 9.2.
RCA

XD'

D

X
Image block

RCA

D
Large

Small
errors

Intermediate Ampli ed
errors
storage

N MVM operations

C=DXD'
Frequency
coeﬃcients

N MVM operations

Figure 9.2: Review of direct mapping in [27, 47, 48].

The two main limitations of the direct mapping are: (i) The image quality after uncompression is
degraded. (ii) It is difficult to scale-up the RCA dimensions, which is highly beneficial in terms of
power and area. We illustrate the image quality obtained when image compression is performed
by using digital and resistive hardware in Figure 9.3. The images in the figure are obtained with

101

the quantization step deactivated. It can be easily observed that the analog computation degrades
the image quality. Although the image is recognizable, it is well known from adversarial learning that even minor distortions may have a devastating impact on the subsequent processing (such
as classification or object detection) [38, 55]. The degraded image stems from that the series
matrix-matrix multiplication is inherently sensitive to variations. Small errors introduced in the
first matrix-matrix multiplication are amplified into large errors by the second matrix-matrix multiplication. Due to the inherent presence of errors and variations within analog computing, it is
impossible to achieve high image quality [27, 47, 48]. Moreover, it is not possible to trade-off performance (power/area) with image quality by reducing the complexity of the domain interfaces, as
the uncompressed images quickly become unrecognizable.

(a)

(b)

Figure 9.3: Image compression using (a) digital and (b) resistive hardware. The RCAs have dimensions 64x128 and parameters as in [27, 47, 48].

RCA with large dimensions have to be leveraged in order to gain performance advantages (power
and latency) over digital implementations. Consequently, large block sizes are required to be used
for the compression. In [27, 47, 48], RCAs with dimensions of 64x128 were used to process
block sizes of 64x64. In contrast, small 8x8 or 16x16 are commonly used in standard image
and video compression formats. The small block sizes are needed to attain high image quality
102

after uncompression. Nevertheless, these errors may be relatively minor for block sizes of 64x64.
However, the errors are significant when the RCA dimensions are scaled to 512x512 and above.

Rethinking Image Compression using RCAs

In this paper, we propose to fundamentally rethink how to perform image compression using
RCAs. The key idea is to restructure the computation within image compression to natively match
the properties of the underlying resistive hardware, which allows full potential of the emerging
hardware to be unleashed.
The proposed computational restructuring is based on two observations: (i) any number of linear
transformations performed in series can by definition be restructured into a single linear transformation. Consequently, the 2D DCT in Eq (9.1) can be reconstructed into a single linear transformation (or MVM operation). (ii) We view the quantization performed by the ADC as a “free”
quantization operation that can be exploited to perform efficient computation. In contrast, quantization performed by ADCs is commonly viewed as a source of errors that should be minimized.
These two observations enable the 2D DCT step, the quantization step, and zig-zag reordering
step to be integrated into a single analog MVM operation, which is illustrated in Figure 9.4(a).
The integration is facilitated by 2D DCT reconstruction, frequency spectrum optimization, and
quantization optimization, which is shown in (b) to (d) of Figure 9.4.
The 2D DCT reconstruction involves reconstructing the 2D DCT into a single larger linear transformation, which is illustrated in Figure 9.4. The reconstruction solves the two main challenges in
the previous works [27, 47, 48]. (i) The sensitivity to errors is reduced as the series computation is
eliminated. (ii) The reconstruction allows RCAs with large dimensions to efficiently process small
block sizes. This translates into significant improvements in terms of latency, power, and area. The

103

qTable
16 11 10
12 12 12
14 13 16

3
2
1
q 2q 3q
-1
-2
-3

{

Run-length
encoding

integration

x

DACs

8x8

RCA

Hu man
encoding

ADCs

1. Partitioning

2-4. MVM operation

5. Encoding

(a)

qTable

X

D' = C

D

=

D

=

D

=

DACs

D

16 11 10
12 12 12
14 13 16

RCA
ADCs

2D DCT
Reconstruction
(b)

Frequency Spectrum
Optimization

Quantization
Optimization

(c)

(d)

Figure 9.4: (a) Flow of proposed image compression. (b) Overview of 2D DCT reconstruction, (c)
frequency spectrum optimization, and (d) quantization optimization.

details of the reconstruction and the advantages are provided in Section 9. The reconstruction also
facilitates the explicit computation of each frequency coefficient from the spatial representation,
which opens the door for frequency spectrum optimization and quantization optimization.
Frequency spectrum optimization involves first reordering the rows in the reconstructed DCT matrix to perform the zig-zag reordering for “free”. This arranges the frequency coefficients from low
to high frequency. Next, we propose to prune the less important high frequency coefficients, which
is illustrated in Figure 9.4(b). The optimization interestingly improves both image quality while
simultaneously reduces hardware overheads due to the array parasitics in the RCAs. The details
are provided in Section 9.

104

Quantization optimization is based on configuring the ADCs to perform the quantization step for
“free”, which is shown in Figure 9.4(c). In particular, the bit-accuracy of each ADC is specified
based on the corresponding entry in the quantization table. This allows the requirements on the
domain interfaces to be reduced to an absolute minimum, which translates into significant saving
in terms of power and area. Intuitively, it is wasteful to compute each frequency coefficient with
high precision and then quantize them to low precision in order to save memory. The technique is
explained in Section 9.

Proposed Image Compression

In this section, we provide the details of our proposed image compression consisting of 2D DCT
reconstruction, frequency spectrum optimization, and quantization optimization.

2D DCT Reconstruction

In this section, we explain the proposed 2D DCT reconstruction. An overview of the reconstruction
is followed by an analysis of the advantages and the detailed specification of the reconstructed
matrix.

Overview of reconstruction

The 2D DCT reconstruction involves restructuring the series matrix-matrix multiplication in Eq (9.1)
into a single linear transformations, as follows:

c = Dx,

105

(9.6)

where D is a reconstructed 2D DCT matrix. x and c are column-wise vector representations of
the spatial and frequency coefficients X and C, respectively. If the reconstructed DCT matrix has
dimensions NxN, the image block X and the block of frequency coefficients C both have dimension
√ √
Nx N.
The partitioning and the 2D DCT step performed using the proposed reconstruction is shown in
Figure 9.5. Given that the reconstructed DCT matrix has dimension NxN, the image I is parti√ √
tioned into image blocks X and frequency blocks C with dimension Nx N. In the example, it
is assumed that the image I has dimension NxN. Consequently, there is a total of N image and
frequency blocks. Let the image and frequency blocks respectively be denoted Xi j and Ci j with
√
√
1 ≤ i ≤ N and 1 ≤ j ≤ N.
DX11D'

N blocks

{

N blocks

{

X11

X1N

x

C11

RCA
D
c

XN1

Small
errors

XNN

Frequency
coe cients

Image block
N MVM operations

Figure 9.5: Proposed 2D DCT computation using reconstructed DCT matrix.

The image blocks are one-by-one processed into the corresponding frequency subblock, i.e., Xi j
is processed into Ci j . Specifically, Ci j is obtained from Xi j by decomposing Xi j into a vector x
column-wise. Next, the vector is passed to an RCA programmed with the matrix D to perform the
computation c = Dx using an analog MVM operation, which is shown in the middle of Figure 9.5.
The frequency block Ci j is obtained from the output vector c by organizing the elements in c into
a block format.

106

In reality, there is no need to reorganize the vector c into the corresponding frequency block Ci j .
Using the subsequent optimization techniques, the output vector from the RCA will be the input
vector expected by the run-length encoding in the top-right of Figure 9.4(a) or step 5 in Figure 9.1.

Analysis of reconstruction

In Table 9.1, we analyse the number of MVM operations required to process an image of size
NxN using an RCA with dimensions Nx2N (two resistive devices per matrix element). The direct
mapping of DXD0 to RCA hardware used in the previous works results in 2N MVM operations.
First, N operations are used to compute XD0 . Second, an additional N operations are required
to compute D(XD0 ). In contrast, the proposed mapping only results in N MVM operations. The
√ √
image X is decomposed into N blocks of dimension Nx N and each block is processed using
a single MVM operation. Consequently, the restructuring directly reduces the number of MVM
operations by 2X (or from 2N to N), which translates into a 2X improvement in power, latency,
and area. Moreover, no intermediate results are required to be stored. Furthermore, the robustness
to errors is significantly improved because the series computation is eliminated, which results in
that there are only small errors in computed frequency coefficients.

Table 9.1: Analysis of number of MVM operations.
Mapping technique
Direct Proposed
√ √
Block size
NxN
Nx N
# partitions
1
N
# MVMs per partition
2N
1
Total # MVM operations
2N
N

107

The reconstructed DCT matrix D

The reconstructed 2D DCT matrix D is defined with respect to a column wise decomposition of x
and c into X and C, respectively. Let the element on row i and column j in D be denoted Di j and
defined, as follows:

Di j = a p · aq · cos[

π p(2t + 1)
πq(2r + 1)
] · cos[
],
2N
2N

(9.7)

where 1 ≤ i ≤ N, 1 ≤ j ≤ N. q = i/N and r = j/N where / is integer division. p = mod(i, N) and
t = mod( j, N) where mod is the modulus operator. The constant ak is defined, as follows:



 √1 , k = 0,
ak = qN


 2, k=
6 0.
N

Frequency Spectrum Optimization

In this section, we explain the proposed frequency spectrum optimization. The frequency spectrum
optimization consists of a frequency reordering step and frequency spectrum pruning step, which
is illustrated in Figure 9.6.

Frequency reordering

The input to the frequency reordering step is the reconstructed DCT matrix D. The output vector c
contains the frequency coefficients arranged in a column-wise order with respect to the frequency
block C, which is illustrated in the top right of Figure 9.6(a). However, the run-length encoding
108

1
=

D

C

Nf
N

D

=

pruning of high
frequency coe cients

reordering rows

1
D

C

=

Nf
N

(a)

Df

=

(b)

Figure 9.6: (a) Frequency reordering and (b) frequency spectrum pruning.

expects the frequency coefficients to be organized from low to high using the zig-zag pattern in
the bottom right of Figure 9.6(a). We observe that the elements in the output vector can be reordered without any overhead by simply permuting the corresponding rows in the reconstructed
DCT matrix. Consequently, the frequency reordering step consists of reordering the rows in D
e is obtained where the output elements are arranged with respect to the zig-zag
such that a matrix D
pattern expected by the run-length encoding. Consequently, the zig-zag reordering in Figure 9.6(a)
is performed for “free”.

Frequency spectrum pruning

The frequency spectrum pruning involves only computing the N f lowest frequency coefficients of
an image block X. The remaining (N − N f ) frequency coefficients are pruned (or set to zero).
Intuitively, the hardware cost is reduced when only a subset of the frequency coefficients are computed. In digital hardware, frequency spectrum pruning techniques have demonstrated a smooth
trade-off between image quality and hardware cost. Interestingly, when the image compression is
performed using RCAs, substantial reductions in overheads can be obtained while at the same time

109

improving the image quality. The savings can be significant because an expensive ADC is used to
measure each frequency coefficient.
e is used to compute a frequency coefficient in C.
Each row in the reconstructed DCT matrix D
e into a new matrix D
ff with
Consequently, the frequency spectrum pruning involves transforming D
dimensions N f xN, which is illustrated in Figure 9.6(b). The transformation automatically results
in that only the desired N f frequency coefficients are computed.

Analysis of frequency spectrum pruning

In this section, we first analyse the trade-off between frequency errors and analog errors that is
introduced by the pruning of frequency coefficients. All the errors are measured in terms of MSE.
Next, we analyze the implicit trade-off between image quality and overheads in terms of power
and area.
Let the errors introduced when only a subset of the frequency coefficients are computed be called
frequency errors. Intuitively, the magnitude of the frequency errors are increased when the fewer
number of frequency coefficients are computed. Nevertheless, the image quality is gracefully
degraded with respect to the number of discarded coefficients. This stems from that we choose
to discard the highest part of the frequency spectrum. In contrast, the impact of analog errors
is reduced when fewer frequency coefficients are computed. The explanation is that RCAs with
smaller dimension introduce smaller analog errors because there is less IR-drop over the array
parasitics [50]. Consequently, there exists a trade-off between analog errors and frequency errors
that is governed by the selected number of frequency coefficients N f . Let the combination of the
frequency errors and the analog errors be equal to the total errors.
In Figure 9.7, we plot the errors with respect to the ratio N f /N. The total errors and the frequency

110

64x64
(a)

144x144
(b)

256x256
(c)

Figure 9.7: The figure shows the total, analog, and frequency errors in terms of MSE with respect
to only computing N f of the N frequency coefficients. The trade-off is shown with respect to an
reconstructed DCT matrix with dimensions (a) 64x64, (b) 144x144, and (c) 256x256.

errors are obtained by compressing an image using an RCA and digital hardware using a subset
N f /N of the frequency components, respectively. Next, the images are uncompressed and the
errors are measured using MSE using Eq (9.3). The analog errors are equal to the difference
between the total errors and the frequency errors. The figure shows that the frequency errors
increase with the ratio N f /N. In contrast, the analog errors are decreased with respect to N f /N.
Consequently, the total errors first decrease until a turning point from where the error start to
increase rapidly, which is shown using a blue line in Figure 9.7. Let N ∗f . be the number of frequency
coefficients at the turning point. In other words, N ∗f is the number of frequency coefficients that
maximize the image quality.
Now we turn our attention to analyze the trade-off between image quality and overheads based
on N f , which is shown in Table 9.2. The key observation is that the highest image quality is
obtained when N f is equal to N ∗f . Consequently, power and area can be saved while at the same
time improving the image quality. Additional power and area savings can intuitively be obtained
at the expense of image quality by setting N f to below N ∗f . However, it is never beneficial to use
111

Table 9.2: Image quality and performance (power and area) with respect to the selected frequency
spectrum N f . The figure shows that the ideal frequency spectrum is in the range [1, N ∗f ].
Frequency spectrum N f
with respect to N ∗f
larger
equal
smaller

Image Quality
degraded
highest
degraded

Overhead
(power/area)
high
medium
small

N f > N ∗f , as both the image quality and the power/area is worse than for N f equal to N ∗f . In our
implementation, we set N f to an estimate N ∗f .
We also note that the frequency spectrum pruning can be applied using RCA with fixed dimensions.
e f to the bottom-left corner of the RCA. Next, all
This would be realized by mapping the matrix D
resistive devices that are not used would be programmed to have the maximum resistance, which
greatly reduces the analog errors by reducing the amount of IR-drop in the RCA.

Quantization optimization

The quantization optimization consists of ADC based quantization and hardware friendly ADC
based quantization.

ADC based quantization

We observe that there exists an equivalency between the digital quantization performed by a quantization table and analog quantization performed by an ADC, which is shown in Figure 9.8. In this
section, we explain how to exploit this equivalency to perform quantization operations for “free”
by appropriately configuring the ADCs, which intuitively allows the complexity of the ADCs to

112

be reduced to an absolute minimum. It would obviously be very wasteful to measure the analog
signal with high precision and then quantize the digital results (into low precision) to save memory
(or improve the BPP metric).
digital
quantization
using
qTable

analog voltage
analog
quantization
integer
using
(binary
encoding)
ADC

real number
integer
(binary encoding)

(a)

(b)

Figure 9.8: (a) Quantization in the digital domain using a quantization table qTable. (b) Quantization in the analog domain using ADCs.

The quantization step involves quantizing each frequency coefficient in a frequency block with the
corresponding entry in the quantization table. The quantization is motivated by that the subsequent
encoding can be more effective when most coefficients are small or preferably equal to zero. Let
q

ci j , ci j , and qi j be the frequency coefficient, the quantized frequency coefficient, and the entry in
the quantization table on row i and column j with respect to a frequency block C. The quantized
frequency coefficients are computed, as follows:
q

ci j = round(

ci j
),
qi j

(9.8)

where round(.) is the rounding operator. This is equivalent to defining quantization levels, as
follows:
1
qk = qi j + k4q,
2

k ∈ {. . . , −1, 0, 1, . . . }

(9.9)

4q = qi j ,

where 4q is the distance between two adjacent quantization levels. qk are the boundaries between
the quantization levels. The output after quantization is k if the input number is within the range
113

[qk−1 , qk ].
An differential ADC with a bit-accuracy of b compares an analog input voltage with 2b -1 reference voltages and output a b-bit binary number. The reference voltages are uniformly distributed
between a low and high reference voltage (vL , vH ). The reference voltage levels vk are defined, as
follows:

vk = vL + k · 4v,
4v =

k ∈ 0 to 2(b−1) ,

(9.10)

vH − vL
,
2(b−1)

where 4v is the distance between two adjacent voltage levels. The output from the ADC is equal
to k if the reference signal is within the voltage range [vk−1 , vk ].
It can easily be observed that there exists an equivalency between the quantization performed by
an quantization table and an ADC. Therefore, by appropriately specifying the reference voltages to
the ADC, the quantization operation can be performed for “free”. It is easy to understand that 4v
is required to be specified to be proportional to 4q. The main difference between the two types of
quantization is that the ADC based quantization requires the value range of the analog input signal
to be defined. We solve this issue by deriving the value range of the frequency coefficients in the
digital domain. Next, the digital value range is translated into an analog value range. Given the
analog value range, it is straightforward to specify the parameters vL , vH , and b to realize any entry
in a quantization table. It is easy to understand that this results in that the complexity of the ADCs
is reduced to an absolute minimum with respect to the specified quantization table.

114

Hardware friendly ADC based quantization

In this section, we propose a hardware friendly implementation of the ADC based quantization.
There are two main limitations of the ADC based quantization in the previous section. First, if
the complexity of the ADCs are fabricated based on a specific quantization table, it is impossible
to adjust the ADCs to obtain a higher image quality at run-time. Second, the technique requires
that different reference voltages vL and vH are provided as an input to each differential ADC. This
requires each ADC to have two internal DACs, which is illustrated in Figure 9.9(a). The separate

DAC

vL
vH

DAC

vL

DAC

vH

(a)


16
12

14

14

18

24

49
72

11
12
13
17
22
35
64
92

10
14
16
22
37
55
78
95

16 24 40
19 26 58
24 40 57
29 51 87
56 68 109
64 81 104
87 103 121
98 112 100

ADC

vH

DAC

ADC

DAC

vL

ADC

DAC

ADC

DACs naturally introduce significant power and area overheads.

(b)


51 61
10

60 55 
 10

69 56 
 10

80 62 
 13

103 77 
 13

113 92 
 22
120 101 22
103 99
55

(c)

10
10
13
13
22
22
55
55

10
10
13
22
22
55
55
62

10
13
22
24
55
55
55
62

13
22
24
24
55
55
62
92

13
24
24
55
55
62
92
92

24
24
55
55
62
62
92
92


24
55

55

62

62

92

92
92

(d)

Figure 9.9: (a) Two differential ADCs with individual reference voltages. (b) Two differential
ADCs with shared reference voltages. (c) An ideal quantization table. (d) Shared quantization
table with respect to a group size (M) of eight.

We propose to circumvent these two limitations by attaching an ADC with the maximum bitaccuracy (8 bits) to each bitline. This allows the ADCs to be calibrated to deliver variable image
115

quality. Next, groups of adjacent ADCs are set to share pairs of DACs that provide the reference
voltages (vL ,vH ), which is illustrated in Figure 9.9(b). Specifically, we divide the ADCs into groups
of M and let the ADCs in each group share the same reference voltages. The sharing intuitively
reduces the power and area overheads. On the other hand, the sharing of the reference voltages
results in that the corresponding entries in the quantization table must be shared. Consequently,
we convert the original quantization table into an shared quantization table, which is illustrated in
(c) and (d) of Figure 9.9. The shared quantization table is constructed by setting each group of M
quantization entries to be equal to the minimum of the M entries in the original quantization table.
This ensures that the image quality is not degraded by the sharing. Moreover, we propose to power
gate the ADC groups to save power if the full bit-accuracy is not required. An k-bit ADC requires
k clock cycles to provide the k bit output. If only p bits are required to be computed, the ADC can
be gated for k − p cycles. For example, let the required bit-accuracy for the low and high frequency
components be 8 and 5 bits, respectively. Consequently, there is an opportunity to power gate the
ADCs used to compute the high frequency coefficients for 3 cycles.
In Figure 9.10, we plot the power and area of the output interface, i.e., the differential ADCs and
the DACs used to provide the reference voltages to the ADCs with respect to a group size of M.
In Figure 9.10(a), it can be observed that the power of the DACs used to provide the reference
voltages is reduced when the group size is increase. At the same time, the power consumption of
the ADCs is increased due to that there are fewer opportunities for power gating. Consequently, it
is not surprising that the total power is reduced until a turning point from where the power starts to
increase. Let the group size that minimizes the power consumption be denoted M∗. We illustrate
a breakdown of the area in Figure 9.10(b). The area of the DACs providing the reference voltages
is reduced when the group size is increased. However, as the area is dominated by the ADCs
(constant), the total area is only slightly reduced.
We summarize our performance observations with respect to the group size M in Table 9.3. The
116

(a)

(b)

Figure 9.10: The output interface (a) power and (b) area breakdown based on the ADC group size
M.

total power is minimum when the group size M is equal to M ∗ . The power consumption is degraded
(or larger) if M is smaller or larger than M ∗ . The total area is only slightly decreased when M is
increased due to the increased degree of sharing.

Table 9.3: Performance in power and area with respect to the group size (M) of ADCs that share
reference voltages from the same DACs. M ∗ is the group size that minimizes the power consumption.
Group size M
with respect to M∗
smaller
equal
larger

Total
Total
power
area
larger
high
smallest medium
larger
small

Experimental Evaluation

The experimental results are obtained using a quad core 3.4 GHz Linux machine with 32GB of
memory. The images in the evaluation are subsets of the images within the Berkeley Segmenta117

tion Dataset [1] and Challenge on Learned Image Compression (CLIC) mobile and professional
datasets [2]. A summary of the properties of the evaluated images is provided in Table 9.4.
The images in the experimental results section are obtained by performing compression using
RCAs hardware using the proposed flow in Figure 9.5(a) or the default flow in Figure 9.1. Uncompression of the images is performed by reversing the flow using digital hardware. The compression
is evaluated in terms of image quality, compression ratio, latency, power, and area. Specifically,
the image quality is evaluated in terms of MSE, PSNR, and SSIM after uncompression. Image
quality degradations stem from both the quantization step and the errors introduced in the analog
computation. The compression ratio is evaluated in terms of BPP after the run-length encoding.
The BPP is mainly governed by the quantization table used in the quantization step.

Table 9.4: Properties of the data sets of input images.
Dataset
Size
Average dimensions
(name)
(#images) rows
cols
Berkeley Segmentation
100 369
433
CLIC mobile
30 1688
1785
CLIC professional
30 1314
1961
The MVM operations that are accelerated using RCA are evaluated using circuit simulation with
SPICE level accuracy. The circuit simulations explicitly capture the impact of array parasitics, programming accuracy, and domain interface quantization errors. The circuit simulation is performed
using a custom simulator that exploits the sparse structure of the RCAs, which results in significant
(orders of magnitude) run-time improvements over HSPICE. The accuracy has been validated to be
equivalent or higher than HSPICE. Using the parameters in Table 9.5, the experimental setup has
been proven to exhibit high correlation with results obtained using hardware prototypes [25, 47].
The conductance values of the resistive devices are programmed using the program and verify
techniques in [28].
118

Table 9.5: Parameters of the of RCAs used in the experimental evaluation.
Property
Value
Array block resistance
0.4Ω
Input resistance
100Ω
Output resistance
100Ω
Programmable resistance range [2k, 2M]Ω
Bit accuracy
6 bits
The performance in terms of power, latency, and area has been obtained by carefully combining
results reported in [48, 28, 58]. The power and area for an 64x128, 8-bit DAC, and 8-bit differential
ADC is shown in Table 9.6. The power and area for crossbars of other dimensions are obtained by
scaling the crossbar parameters with the number of cells in the crossbar. The power and area of
the DACs and ADCs are assumed to scale exponentially with the bit-accuracy. The latency of an
MVM operation with 8-bit ADCs is 100 ns.

Table 9.6: Power and area of crossbar and peripheral circuitry.
Bits
Differential ADC
DAC
64x128 crossbar

8
8
n/a

Power
(mW)
1.5
0.5
4.8

Area
(µm2 )
1178.8
21.2
400.0

Evaluation of optimization techniques

Evaluation of 2D DCT reconstruction

We evaluate the impact of the proposed 2D reconstruction in Figure 9.11. The images are obtained
using RCAs with dimensions 64x128. The quantization step is disabled to demonstrate the maximum image quality that can be achieved after uncompression. The reference images are shown
119

in Figure 9.11(a). The images obtained using the direct mapping in [27, 47, 48] are shown in
Figure 9.11(b). The images obtained using the proposed 2D DCT reconstruction are shown in
Figure 9.11(c). The reference images are of high quality. It can be observed that the images obtained using the direct mapping are degraded by the compression. The degradation stems from the
amplification of errors in the second matrix-matrix multiplication. Moreover, the impact of using
the large block sizes is visible when examining the images in detail. The image obtained after the
proposed 2D DCT reconstruction are just slightly degraded with respect to the reference images.
This stems from that there is no amplification of errors and the reconstruction enables small block
sizes can be used. In fact, the reconstruction improves the robustness to any type of errors.
To further demonstrate the improvement in robustness to errors, the image quality with respect to
the bit-accuracy of the domain interfaces is shown in Figure 9.12. The figure shows that using
the direct mapping in [27, 47, 48], the image quality is quickly degraded when the bit-accuracy is
reduced from 8 to 4 bits. Using the proposed 2D DCT reconstruction, the image quality is more
smoothly degraded when the bit-accuracy of the domain interfaces are reduced. Hence, a lower
vulnerability to errors is demonstrated.

Table 9.7: Comparison of performance and overheads w/o without 2D DCT reconstruction.
2D DCT
reconstruction?
No
Yes

Performance
Storage of intermediate
(power/area/latency)
data required?
2X
Yes
1X
No

In Table 9.7, we compare the performance in terms of latency, power and area. While excluding the
extra overhead introduced by the storage of the intermediate data, the power, area, and latency is
reduced with 2X. It is easy to understand that these performance benefits are obtained because the
number of MVM operations is reduced from 2N to N, which was analyzed in detail in Section 9.

120

(a) Reference images

(b) Compression using direct mapping

(c) Compression with 2D DCT Reconstruction

Figure 9.11: (a) Reference image. (b) Images obtained using the direct mapping in [27, 47, 48].
(c) Images obtained using the proposed 2D DCT reconstruction.

Based on the observed results, it is clearly advantageous to leverage the proposed 2D DCT reconstructions because it both improves the image quality and the performance in terms of power, area,
and latency.

121

8

6
(a) Compression using direct mapping

4

8

6
(b) Compression with 2D DCT reconstruction

4

Figure 9.12: Sensitivity of the image quality with respect to the bit-accuracy of the DAC and ADC
domain interfaces. The images in (a) are obtained using the direct mapping and the images in (b)
are obtained using the proposed 2D DCT reconstruction.

Evaluation of frequency optimization

In this section, we analyze the impact of the frequency spectrum optimization in Section 9. The
analysis is focused on the frequency spectrum pruning because the frequency reordering only
avoids performing the zig-zag reordering using a specialized router. For the frequency spectrum
e matrices with dimensions 64x64,
pruning, the optimal N̂ ∗f /N ratios are 0.8, 0.7, and 0.7 for D
144x144, and 256x256, respectively. The ratios were determined by performing the image compression using RCAs with different dimensions and selecting the ratio that minimized MSE in
Eq (9.3). It is not surprising that the N ∗f /N ratio becomes smaller for RCAs with larger dimensions, as larger RCAs are more severely impacted by IR-drop over the array parasitics [50].

122

The frequency spectrum pruning is evaluated in terms of image quality in Figure 9.13. The images
in the left column are obtained using the full frequency spectrum. The images in the right column
are obtained using a reduced frequency spectrum. The number of columns in the reconstructed
DCT matrix is 64, 144, and 256 for the top, middle, and bottom row, respectively. The dimensions
e or D
e f ) and the MSE are shown below each image. It can be
of the reconstructed DCT matrices (D
observed that image quality is gracefully degraded when RCAs with larger dimensions are utilized.
The degradation stems from the IR-drop over the array parasitics. Note that the loss in image
quality is observed although the state-of-the-art technique of tuning the memristors conductance
values to compensate for the IR-drop is utilized [28]. It can also be seen that the frequency pruning
improves the image quality (or reduces MSE). The image quality improvements are a result of that
smaller analog errors are introduced when the size of the RCAs are scaled down.
Next, we focus on evaluating the frequency spectrum pruning in terms of power and area for RCAs
with different dimensions. The evaluation in (a) and (b) of Figure 9.14. The figure shows that
the pruning significantly reduces the power and area while N f is selected to maximize the image
quality. The improvements are slightly smaller the N f /N ratio because only the number of bitlines
is reduced. However, large gains are still obtained because the ADCs used to measure the outputs
are more expensive in terms of overheads than the DACs used to provide the inputs. For the RCAs
with 144 or 256 inputs, it may be advantageous to accept a slight degradation in image quality in
order to significantly reduce the hardware overheads. The trade-off between the image quality and
the number of frequency coefficients for an RCA with 144 inputs was shown in Figure 9.7(b). The
trends for RCAs with 256 inputs are similar.
Given that the frequency spectrum pruning provides performance benefits without any degradation
in image quality, it can be concluded that it is always advantageous to apply frequency spectrum
optimization.

123

e 64x64
D:
MSE: 2.4

ff : 64x52
D
1.8

e 144x144
D:
MSE: 5.4

ff : 144x87
D
5.1

e 256x256
D:
MSE: 8.7

ff : 256x154
D
6.8

Figure 9.13: The images in the left (right) column are obtained without (with) frequency spectrum
optimization. The dimension of the reconstructed 2D DCT matrix and the MSE are shown below
each figure.

Evaluation of quantization optimization

In this section, we evaluate the quantization optimization in Section 9.

124

(a) power

(b) area

Figure 9.14: Performance improvements from frequency spectrum optimization for reconstructed
2D DCT matrices with different dimensions.

The ADC based quantization is examined in Figure 9.15. We compare the proposed ADC based
quantization with using 8-bit ADCs and performing digital quantization. The evaluation is performed in terms of image quality with respect to quser in Figure 9.15(a). Recall that the quantization table is scaled with quser to balance compression ratio with image quality. The figures shows
that the MSE for both methods is correlated with quser . For quser larger than 0.25, the MSE is similar for both methods because the overall errors are dominated by the quantization specified by the
quantization table. However, for quser equal to 0.25, the ADC based quantization results in smaller
errors. This stems from that the 8-bit ADCs introduce larger errors than the digital quantization
for small values of quser . The ADC based quantization would use ADCs with a bit-accuracy higher
than 8 to circumvent this to occur. We evaluate the normalized power with respect to quser in Figure 9.15(b). The figure shows that the power consumption is constant when 8-bit ADCs are used
and quantization is performed in the digital domain. In contrast, the power consumption of ADC
based quantization is correlated with the value of quser . This is easy to understand because when
quser is equal to 1, the ADC based quantization utilizes 22/10/12/81 ADCs with a bit-accuracy of
1 Frequency

pruning is assumed to be used, i.e., there are only 52 ADCs.

125

(a)

(b)

Figure 9.15: Comparison between proposed ADC based quantization and using 8-bit ADCs and
performing digital quantization. The comparison is evaluated in terms of MSE in (a) and power in
(b).

5/6/7/8, respectively.
In summary, the ADC based quantization is quality configurable, i.e., the effort in power is proportional with the desired image quality. With respect to the 8-bit ADC baseline, power is saved
when quser is set equal or greater than 0.5 by down sizing the bit-accuracy of the ADCs. The
power saving are obtained without degrading the image quality in terms of MSE. In particular, the
figure shows that the technique is able to save up to 60% of the total power for larger values of
quser . For quser equal to 0.25, the image quality is improved at the expense of increasing the power
consumption by sizing up some ADCs beyond the 8-bit baseline. The main limitation of the ADC
based quantization is that quser is restricted to a single at the time of fabrication.
The hardware friendly ADC based quantization is evaluated in Figure 9.16. The parameter quser
is set to 1 in the evaluation. The technique allows the domain interfaces to be configured with
respect to the image quality at run-time. The normalized performance in terms of power and
area is evaluated with respect to the group size in the figure. The group size M refers to the
number of ADCs that share the same reference voltages vL and vH . In Figure 9.16(a), it can be

126

observed that the minimum power is achieved for M equal to 14. This stems from that M equals 14
strikes a perfect balance between the power reductions obtained from the sharing of the reference
voltages and the power savings obtained from the power gating. The total RCA power is reported
in Figure 9.16(a). The power saving are smaller than in Figure 9.10 where only the power of
the output interface was reported. We evaluate the total area with respect to the group size in
Figure 9.16(b). It can be observed in the figure that only minor savings in terms of total area are
obtained with a higher degree of sharing. This stems from that the ADCs dominate the area of each
RCAs. Despite that the minimum power is obtained for a group size of 14, it may be more practical
to utilize a group size of 8, which would ease delivering the reference signals to the ADCs. The
majority of the saving in terms of power consumption are anyways achieved.

(a)

(b)

Figure 9.16: Evaluation of group size selection (M) in terms of (a) power consumption and (b)
area.

Evaluation of proposed image compression

In this section, we evaluate the proposed image compression as a whole and provide comparisons
with previous studies. We perform the evaluation in terms of image quality, compression ratio,
latency, power, and area in Table 9.8. Note that a lower MSE indicates higher image quality while
a higher PSNR or SSIM indicates higher image quality. The reported latency in the table is the
127

Table 9.8: Comparison of different image compression techniques.
Name

Method

Berkeley
segmentation

Ideal
D [47, 48]
D-P [27]
R
RF
RFQ
Ideal
D [47, 48]
D-P [27]
R
RF
RFQ
Ideal
D [47, 48]
D-P [27]
R
RF
RFQ
Ideal
D [47, 48]
D-P [27]
R
RF
RFQ

CLIC
mobile

CLIC
professional

Norm.

Image quality
Compression
(MSE) (PSNR) (SSIM)
(BPP)
73.7
30.3
0.930
3.6
223.5
25.1
0.757
6.3
223.5
25.1
0.757
6.3
89.1
29.4
0.908
3.2
84.8
29.6
0.913
3.3
83.2
29.7
0.916
3.3
22.3
35.6
0.957
2.0
140.7
26.8
0.742
3.8
140.7
26.8
0.742
3.8
26.9
34.7
0.948
1.8
25.9
34.9
0.949
1.9
24.9
35.1
0.952
1.9
52.2
32.1
0.938
2.9
213.8
25.0
0.689
6.0
213.8
25.0
0.689
6.0
63.7
31.1
0.922
2.6
60.7
31.3
0.925
2.6
59.3
31.4
0.928
2.7
0.88
1.02
1.01
1.08
4.29
0.79
0.75
2.17
4.29
0.79
0.75
2.17
1.08
0.99
0.99
0.97
1.03
0.99
1.00
0.99
1.00
1.00
1.00
1.00

Latency Power Area
(ms)
(mW) (mm2 )
‘-’
‘-’
‘-’
0.61 140.8 0.077
0.31 281.6 0.154
0.25 140.8 0.077
0.25 121.8 0.063
0.25 107.9 0.063
‘-’
‘-’
‘-’
9.59 140.8 0.077
4.80 281.6 0.154
4.65 140.8 0.077
4.65 121.8 0.063
4.65 107.9 0.063
‘-’
‘-’
‘-’
8.38 140.8 0.077
4.19 281.6 0.154
4.07 140.8 0.077
4.07 121.8 0.063
4.07 107.9 0.063
‘-’
‘-’
‘-’
2.06
1.30
1.23
1.03
2.61
2.45
1.00
1.30
1.23
1.00
1.13
1.00
1.00
1.00
1.00

average time for processing an image from the respective datasets. For the proposed image compression this is equal to number of image blocks multiplied with 100 ns. In the table, we evaluate
six different methodologies to clearly demonstrate the effectiveness of the proposed techniques.
“Ideal" denotes the performance obtained using floating point computation in digital hardware.
This method should be viewed as a reference point (or upper bound) on the image quality that
can be achieved. The “D" method stands for the direct mapping in [47, 48]. The “D-P" method
stands for the D method but with an implementation that is pipelined to maximize throughput [27].
The “R" method denotes our framework with only the 2D DCT reconstruction applied. The “RF"

128

method indicates the “R" method extended with frequency spectrum optimization. The “RFQ"
method is the “RF" method extended with the hardware friendly quantization. The normalized
performance of the different methods is shown in bold at the bottom of the table. For all the methods, we assume that the RCAs are fabricated with a ADC group sharing size of 8. The RCAs have a
dimension of 64x128 and 64x104 with and without frequency pruning, respectively. Quantization
is performed with respect to the quantization table in Figure 9.9(c).
First, we evaluate the D method with respect to the Ideal method. The table shows that the D
method has 4.33X higher MSE and 23% and 26% smaller PSNR and SSIM than the ideal method.
The compression rate is 2.0X worse in terms of BPP. The degraded image quality stems from that
errors are introduced when the RCA are leveraged to perform the MVM operations. We believe
that the worse compression ratio stems from that the errors introduce additional non-zero frequency
coefficients. Every non-zero coefficient requires a minimum of 9 bits to be stored. Compared with
the D method, the D-P method achieves the exact same performance in terms of image quality and
compression. However, the latency is about two times lower and the power and area is two times
higher due to a parallel implementation.
Compared with the D and D-P method, the R method improves MSE, PSNR, and SSIM with 25%,
26%, and 36%, respectively. The degree of compression is improved with 46.0%. The improvements in image quality stem from that the series matrix-matrix multiplication is circumvented by
the computational reconstruction. The BPP is reduced due to the improved robustness to variations. Compared with the D method, the latency is reduced with 51%. Compared with the D-P
method, power and area are reduced with 50%. The R method has slightly smaller (3%) average
latency than the D-P method because the image block size is reduced from 64x64 to 8x8. Consequently, less padding is required to make the image dimensions match a multiple of the block size
dimensions, i.e., the amount of redundant computation is reduced.

129

Compared with the R method, the RF method improves MSE, PSNR and SSIM with 4%, 0.6%, and
0.2%, respectively. The improved image quality stems from that the frequency pruning reduces the
amount of errors introduced in the analog computation. Recall that the frequency pruning reduces
the dimension of the RCA, which results in that the negative impact of IR-drop is reduced. The
(2%) increase in compression ratio may stem from that the image quality is improved. The power
consumption is reduced with 13.2% because RCAs with 20.0% fewer bitlines are utilized.
Compared with the RF method, the RFQ method results in similar performance in terms of image
quality and compression. However, the power consumption is reduced with 10.7% on the average.
The savings stem from that the RF method uses 8-bit ADCs and performs quantization in the
digital domain. The RFQ method performs the quantization using the ADCs, which allows the
8-bit ADCs used to compute the high frequency coefficients to be power gated in a few clock
cycles.
In summary, the proposed methods result in that image compression can be performed using RCAs
while only slightly degrading the image quality compared with digital hardware. The RFQ method
is compared with the Ideal method in terms of image quality in Figure 9.17. Despite that the MSE
is slightly higher and the PSNR and SSIM are a bit lower, the image quality is very similar to the
human eye. Compared with the previous work in [27, 47, 48], the image quality is improved while
at the same time reducing latency and power with 51% and 24% or 3% and 61%, respectively. The
benefits are obtained by reconstructing the compression such that the dominating computational
kernels are aligned with the properties of the underlying hardware.
Next, we focus on evaluating the proposed compression with respect to the block sizes that are
processed. We compare the normalized performance in terms of normalized MSE, BPP, power,
and area in Figure 9.18. A uniform quantization table consisting of only 10 is used for all the block
sizes. It can be observed that the MSE is degraded and the BPP is improved when the block size

130

(a) Image compression using digital hardware.

(b) Proposed image compression using resistive hardware.
Figure 9.17: Comparison of image quality obtained using (a) digital hardware and (b) resistive
hardware. Quantization is performed using the table in Figure 9.9.

is scaled up in Figure 9.18(a). The power and area performance is shown in Figure 9.18(b). The
power and area performance is close to constant with respect to the block size. This stems from
that an RCA used to process 16x16 blocks has 4X larger domain interfaces. At the same time, it
processes a 4X larger block size. The power and area saving obtained when the block size is scaled
from 8x8 to 12x12 stems from that additional frequency coefficients can be pruned. Nevertheless,
compared with digital hardware, it is highly advantageous that the computational effort is constant
(at worst) with respect the block size. The digital computational effort of 2D DCT is obviously not
constant with respect to the block size.
Compared with performing image processing with digital hardware, we estimate the computation
to be at least 44X more energy-efficient. A detailed comparison between an RCA hardware pro-

131

Figure 9.18: The image quality and compression ratio is evaluated with respect to the processed
block size in (a). The normalized power and area with respect to the block size is evaluated in (b).

totype and an application-specific integrated circuit (ASIC) was performed in [48, 59]. The study
reported a 17X improvement in energy-efficiency with obtaining similar image quality. The techniques proposed in this paper further improves the energy-efficiency with 2.61X. Moreover, the
image quality is at the same time improved. Therefore, the case for leveraging emerging resistive
hardware is even more compelling than before.

Conclusion

Computation of 2D DCT is the bottleneck of real-time image and video compression. An arising
solution to scalably enable 2D DCT to be performed on edge-devices is to accelerate the computation using emerging RCAs. In this work, we propose to rethink how to perform image compression
using emerging hardware by reconstructing the computational kernels to be aligned with the underlying properties of the resistive hardware. This results in significant improvements in image
quality, robustness to errors, power, area, and latency.

132

CHAPTER 10: FUTURE WORK

In the future, the investigation and exploration will be focused on advanced data layout organization techniques, and system software support for resistive computing systems.

Advanced Data Layout Organization Techniques

State-of-the-art data layout organization has been performed in one layer at a time through an
entire neural network. It has drawn our attention that by performing data layout organization in
more than one layer at the same time, it is possible to provide opportunities for data assignment to
escape a local optimal solution. As it has been observed in [78], by performing multiple pass of
data layout organization, the cost reduction is mostly from the first pass. The possible explanation
is that after the assignment problem is stuck at a local optimal, it is very hard for the framework to
provide global optimal assignment.

System Software Support for Resistive Computing Systems

Besides immature fabrication techniques, device defects also arise during run-time by heavy utilization. Further investigations will be performed with respect to device defects during run-time
and techniques to mitigate the impact of defective devices with optimized algorithms and overhead.

133

LIST OF REFERENCES

[1] https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/.
[2] https://www.compression.cc/challenge/.
[3] Exascale Proxy Applications. https://proxyapps.exascaleproject.org.
[4] https://keras.io/examples/cifar10_cnn/.
[5] Joint University Microelectronics Program (JUMP). https://www.darpa.mil/program/
joint-university-microelectronics-program.
[6] Z. Alamgir, K. Beckmann, N. Cady, A. Velasquez, and S. K. Jha. Flow-based computing on
nanoscale crossbars: Design and implementation of full adders. In 2016 IEEE International
Symposium on Circuits and Systems (ISCAS), pages 1870–1873. IEEE, 2016.
[7] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov. High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology, 23(7):075201,
2012.
[8] C. I. Bargmann and W. T. Newsome. The brain research through advancing innovative neurotechnologies (brain) initiative and neurology. JAMA neurology, 71(6):675–676, 2014.
[9] K. Beckmann, J. Holt, H. Manem, J. Van Nostrand, and N. C. Cady. Nanoscale hafnium
oxide RRAM devices exhibit pulse dependent behavior and multi-level resistance capability.
MRS Advances, 1(49):3355–3360, 2016.
[10] M. V. Beigi and G. Memik. Thermal-aware optimizations of ReRam-based neuromorphic
computing systems. In Proc. Design Automation Conference, page 39, 2018.

134

[11] M. N. Bojnordi and E. Ipek. Memristive boltzmann machine: A hardware accelerator for
combinatorial optimization and deep learning. In Proc. International Symposium on High
Performance Computer Architecture, pages 1–13, 2016.
[12] T. Chang, S.-H. Jo, and W. Lu. Short-term memory to long-term memory transition in a
nanoscale memristor. ACS nano, 5(9):7669–7676, 2011.
[13] C. Y. Chen, H. C. Shih, C. W. Wu, C. H. Lin, P. F. Chiu, S. S. Sheu, and F. T. Chen. RRAM
defect modeling and failure analysis based on march test and a novel squeeze-search scheme.
IEEE Transactions on Computers, 64(1):180–190, 2015.
[14] L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang. Accelerator-friendly neuralnetwork training: Learning variations and defects in RRAM crossbar. In Proc. Conference
on Design, Automation & Test in Europe, pages 19–24, 2017.
[15] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: a novel
processing-in-memory architecture for neural network computation in ReRAM-based main
memory. In Proc. ACM SIGARCH Computer Architecture News, volume 44, pages 27–39,
2016.
[16] S. Choi, Y. Yang, and W. Lu. Random telegraph noise and resistance switching analysis of
oxide based resistive memory. Nanoscale, 6(1):400–404, 2014.
[17] F. Chollet et al. Keras. https://keras.io, 2015.
[18] L. Chua. Memristor-the missing circuit element. IEEE Transactions on Circuit Theory,
18(5):507–519, 1971.
[19] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms.
McGraw-Hill Higher Education, 2001.

135

[20] R. Degraeve, A. Fantini, N. Raghavan, L. Goux, S. Clima, B. Govoreanu, A. Belmonte,
D. Linten, and M. Jurczak. Causes and consequences of the stochastic aspect of filamentary
RRAM. Microelectronic Engineering, 147:171 – 175, 2015.
[21] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek. Enabling scientific
computing on memristive accelerators. In Proc. International Symposium on Computer Architecture, pages 367–382, 2018.
[22] S. Feizi et al. Porcupine neural networks:(almost) all local optima are global. arXiv preprint
arXiv:1710.02196, 2017.
[23] Z. He, J. Lin, R. Ewetz, J.-S. Yuan, and D. Fan. Noise injection adaption: End-to-end ReRAM
crossbar non-ideal effect adaption for neural network mapping. In Design Automation Conference, pages 29:1–6, 2019.
[24] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In Proc. Int.
Solid-State Circuits Conference Digest of Technical Papers, pages 10–14, 2014.
[25] M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams,
J. J. Yang, Q. Xia, and J. P. Strachan. Memristor-based analog computation and neural network classification with a DPE. Adv. Materials, 30, 2018.
[26] M. Hu, H. Li, Y. Chen, Q. wu, G. Rose, and R. W Linderman. Memristor crossbar-based
neuromorphic computing system: A case study. NN. and Learning Sys., IEEE Tran. on,
25:1864–1878, 2014.
[27] M. Hu and J. P. Strachan. Accelerating discrete fourier transforms with dot-product engine.
In 2016 IEEE International Conference on Rebooting Computing (ICRC), pages 1–5, Oct
2016.

136

[28] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J. J. Yang,
and R. S. Williams. Dot-product engine for neuromorphic computing: Programming 1T1M
crossbar to accelerate matrix-vector multiplication. In Proc. Design Automation Conference,
pages 1–6, 2016.
[29] Y. Huai. Spin-transfer torque MRAM (STT-MRAM): Challenges and prospects. AAPPS
bulletin, 18(6):33–40, 2008.
[30] R. B. Hur, N. Wald, N. Talati, and S. Kvatinsky. Simple magic: Synthesis and in-memory
mapping of logic execution for memristor-aided logic. In Proc. International Conference on
Computer-Aided Design, pages 225–232, 2017.
[31] S. Jain, A. Sengupta, K. Roy, and A. Raghunathan. Rx-caffe: Framework for evaluating and
training deep neural networks on resistive crossbars. CoRR, abs/1809.00072, 2018.
[32] S. Jin et al. On improving fault tolerance of memristor crossbar based neural network designs
by target sparsifying. DATE, pages 91–96, 2020.
[33] B. G. Johnson and C. H. Dennison. Phase change memory, 2004. US Patent 6,791,102.
[34] S. Kannan, N. Karimi, R. Karri, and O. Sinanoglu. Modeling, detection, and diagnosis of
faults in multilevel memristor memories. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 34(5):822–834, 2015.
[35] K. Kourtis et al. Compiling neural networks for a computational memory accelerator. arXiv
preprint arXiv:2003.04293, 2020.
[36] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
[37] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research).
[38] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial machine learning at scale. 2017.
137

[39] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and
U. C. Weiser. Magic—memristor-aided logic. IEEE Transactions on Circuits and Systems
II: Express Briefs, 61(11):895–899, 2014.
[40] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser. Memristorbased material implication (IMPLY) logic: Design principles and methodologies. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 22(10):2054–2066, 2014.
[41] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni,
and E. Eleftheriou. Mixed-precision in-memory computing. Nature Electronics, 1:246–253,
04 2018.
[42] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/,
1998.
[43] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[44] H. Lee, P. Chen, T. Wu, Y. Chen, C. Wang, P. Tzeng, C. Lin, F. Chen, C. Lien, and M.-J. Tsai.
Low power and high speed bipolar switching with a thin reactive Ti buffer layer in robust
HfO2 based RRAM. In IEEE Int. Electron Devices Meeting, pages 1–4, 2008.
[45] B. Li, P. Gu, Y. Shan, Y. Wang, Y. Chen, and H. Yang. Rram-based analog approximate computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
34:1–1, 12 2015.
[46] B. Li, Y. Wang, Y. Chen, H. H. Li, and H. Yang. Ice: Inline calibration for memristor crossbarbased computing engine. In Proceedings of the Conference on Design, Automation Test in
Europe, DATE ’14, Leuven, BEL, 2014. European Design and Automation Association.

138

[47] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song, N. Dávila, C. E.
Graves, et al. Analogue signal and image processing with large memristor crossbars. Nature
Electronics, 1(1):52, 2018.
[48] C. Li, Y. Li, H. Jiang, W. Song, P. Lin, Z. Wang, J. J. Yang, Q. Xia, M. Hu, E. Montgomery,
J. Zhang, N. Dávila, C. E. Graves, Z. Li, J. P. Strachan, R. S. Williams, N. Ge, M. Barnell, and
Q. Wu. Large memristor crossbars for analog computing. In Proc. International Symposium
on Circuits and Systems, pages 1–4, 2018.
[49] B. Liu, M. Hu, H. Li, Z.-H. Mao, Y. Chen, T. Huang, and W. Zhang. Digital-assisted noiseeliminating training for memristor crossbar-based analog neuromorphic computing engine.
In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6. IEEE,
2013.
[50] B. Liu, H. Li, Y. Chen, X. Li, T. Huang, Q. Wu, and M. Barnell. Reduction and IR-drop compensations techniques for reliable neuromorphic computing systems. In Proc. International
Conference on Computer-Aided Design, pages 63–70, 2014.
[51] B. Liu, H. Li, Y. Chen, X. Li, Q. Wu, and T. Huang. Vortex: Variation-aware training for
memristor X-bar. In Proc. Design Automation Conference, pages 1–6, 2015.
[52] C. Liu, M. Hu, J. P. Strachan, and H. H. Li. Rescuing memristor-based neuromorphic design
with high defects. In Proc. Design Automation Conference, pages 87:1–87:6, 2017.
[53] X. Liu, M. Mao, B. Liu, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu, J. Yang, H. Li, and
Y. Chen. Harmonica: A framework of heterogeneous computing systems with memristorbased neuromorphic computing accelerators. IEEE Transactions on Circuits and Systems I:
Regular Papers, 63(5):617–628, 2016.

139

[54] X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu,
et al. Reno: A high-efficient reconfigurable neuromorphic computing accelerator design. In
Proceedings of the 52nd Annual Design Automation Conference, pages 1–6, 2015.
[55] Z. Liu et al. Deepn-jpeg: A deep neural network favorable jpeg-based image compression
framework. DAC, page 18, 2018.
[56] Y. Long, X. She, and S. Mukhopadhyay. Design of reliable dnn accelerator with un-reliable
reram. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages
1769–1774. IEEE, 2019.
[57] S. Parkin, X. Jiang, C. Kaiser, A. Panchula, K. Roche, and M. Samant. Magnetically engineered spintronic sensors and memory. Proc. IEEE, 91(5):661–680, 2003.
[58] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S.
Williams, and V. Srikumar. ISAAC: A convolutional neural network accelerator with in-situ
analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 44(3):14–26,
2016.
[59] P. Sheridan et al. Sparse coding with memristor networks. Nature nanotechnology, 12 8:784–
789, 2017.
[60] P. Simon. Too Big to Ignore: The Business Case for Big Data. Wiley Publishing, 1st edition,
2013.
[61] K. Simonyan et al. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
[62] L. Song, X. Qian, H. Li, and Y. Chen. Pipelayer: A pipelined reram-based accelerator for deep
learning. In Proc. International Symposium on High Performance Computer Architecture,
pages 541–552, 2017.
140

[63] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen. GraphR: Accelerating graph processing using
ReRAM. In Proc. International Symposium on High Performance Computer Architecture,
pages 531–543, 2018.
[64] C. H. Stapper. Simulation of spatial fault distributions for integrated circuit yield estimations.
IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 8(12):1314–
1318, 1989.
[65] D. B. Strukov, G. S.Snider, D. R.Stewart, and R. Williams. The missing memristor found.
Nature, 453(12):80–83, 2009.
[66] A. van de Goor and Y. Zorian. Effective march algorithms for testing single-order addressed
memories. In Proc. European Conference on Design Automation with the European Event in
ASIC Design, pages 499–505, 1993.
[67] G. K. Wallace. The jpeg still picture compression standard. IEEE transactions on consumer
electronics, 38(1):xviii–xxxiv, 1992.
[68] H.-S. P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F. T. Chen, and M.-J.
Tsai. Metal–oxide rram. Proc. IEEE, 100(6):1951–1970, 2012.
[69] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi, and
K. E. Goodson. Phase change memory. Proc. IEEE, 98(12):2201–2227, 2010.
[70] W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious.
SIGARCH Computing Architecture News, 23(1):20–24, 1995.
[71] L. Xia, P. Gu, B. Li, T. Tang, X. Yin, W. Huangfu, S. Yu, Y. Cao, Y. Wang, and H. Yang.
Technological exploration of RRAM crossbar array for matrix-vector multiplication. Journal
of Computer Science and Technology, 31(1):3–19, 2016.

141

[72] L. Xia, W. Huangfu, T. Tang, X. Yin, K. Chakrabarty, Y. Xie, Y. Wang, and H. Yang. Stuck-at
fault tolerance in RRAM computing systems. IEEE Journal on Emerging and Selected Topics
in Circuits and Systems, 8(1):102–115, 2018.
[73] L. Xia, M. Liu, X. Ning, K. Chakrabarty, and Y. Wang. Fault-tolerant training with on-line
fault detection for RRAM-based neural computing systems. In Proc. Design Automation
Conference, pages 1–6, 2017.
[74] T. P. Xiao, C. H. Bennett, B. Feinberg, S. Agarwal, and M. J. Marinella. Analog architectures
for neural network acceleration based on non-volatile memory. Applied Physics Reviews,
7(3):031301, 2020.
[75] B. Yan, J. Yang, Q. Wu, Y. Chen, and H. Li. A closed-loop design to enhance weight stability
of memristor based neural network chips. In Proc. International Conference on ComputerAided Design, pages 541–548, 2017.
[76] B. Zhang, N. Uysal, and R. Ewetz. STAT: Mean and variance characterization for robust
inference of DNNs on memristor-based platforms. In Proc. Great Lakes Symposium on VLSI,
pages 339–342, 2019.
[77] B. Zhang, N. Uysal, and R. Ewetz. Computational restructuring: Rethinking image processing using memristor crossbar arrays. In Design, Automation & Test in Europe Conference &
Exhibition (DATE), (accepted), 2020.
[78] B. Zhang, N. Uysal, D. Fan, and R. Ewetz. Handling stuck-at-fault defects using matrix
transformation for robust inference of dnns. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 2019.

142

[79] B. Zhang, N. Uysal, D. Fan, and R. Ewetz. Handling stuck-at-faults in memristor crossbar
arrays using matrix transformations. In Proc. Asia and South Pacific Design Automation
Conference, pages 438–443, 2019.
[80] B. Zhang, N. Uysal, D. Fan, and R. Ewetz. Representable matrices: Enabling high accuracy
analog computation for inference of dnns using memristors. In 2020 25th Asia and South
Pacific Design Automation Conference (ASP-DAC), pages 538–543. IEEE, 2020.
[81] F. Zhang and M. Hu. Defects mitigation in resistive crossbars for analog vector matrix multiplication. In 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC),
pages 187–192. IEEE, 2020.
[82] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–
612, 2004.
[83] Z. Zhu, J. Lin, M. Cheng, L. Xia, H. Sun, X. Chen, Y. Wang, and H. Yang. Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method. In Proc. International
Conference on Computer-Aided Design, pages 69:1–69:8, 2018.

143

