Energy-efficient systems for information transfer and processing by Lin, Yingyan
© 2017 Yingyan Lin
ENERGY-EFFICIENT SYSTEMS FOR INFORMATION TRANSFER AND PROCESSING
BY
YINGYAN LIN
DISSERTATION
Submitted in partial fulﬁllment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Doctoral Committee:
Professor Naresh R. Shanbhag, Chair
Professor Andrew C. Singer
Professor Elyse Rosenbaum
Professor Pavan K. Hanumolu
ABSTRACT
Machine learning (ML) systems are ﬁnding excellent utility in tackling the data deluge of
the big data era thanks to the exponential increase in computing power. Current ML sys-
tems adopt either centralized cloud computing or distributed edge computing. In both, the
challenge of energy eﬃciency has been drawing increased attention. In cloud computing,
data transfer due to inter-chip, inter-board, inter-shelf and inter-rack communications (I/O
interface) within data centers is one of the dominant energy costs. This will intensify with
the growing demand for increased I/O bandwidth of high-performance computing in data
centers. On the other hand, in edge computing, energy eﬃciency is the primary design
challenge, as mobile devices have limited energy, computation and storage resources. This
challenge is being exacerbated by the need to embed ML algorithms such as convolutional
neural networks (CNNs) for enabling local on-device inference capabilities. In this disserta-
tion, we investigate techniques to address these challenges.
To address the energy eﬃciency challenge in data centers, this dissertation focuses on
reducing the energy consumption of the I/O interface. Speciﬁcally, in the emerging analog-
to-digital converter (ADC)-based multi-Gb/s serial link receivers, the power dissipation is
dominated by the ADC. ADCs in serial links employ signal-to-noise-and-distortion-ratio
(SNDR) and eﬀective-number-of-bits (ENOB) as performance metrics because these are the
standard for generic ADC design. This dissertation presents the use of information-based
metrics such as bit-error-rate (BER) to design a BER-optimal ADC (BOA) for serial links.
First, theoretical analysis is developed to show when the beneﬁts of BOA over a conventional
uniform ADC (CUA) in a serial link receiver are substantial. Second, a 4 GS/s, 4-bit on-chip
ADC in a 90 nm CMOS process is designed and integrated into a 4 Gb/s serial link receiver
to verify the aforementioned analysis. Speciﬁcally, measured results demonstrate that a 3-bit
ii
BOA receiver outperforms a 4-bit CUA receiver at a BER < 10−12 and provides 50 % power
savings in the ADC. In the process, it is demonstrated conclusively that BER as opposed to
ENOB is a better metric when designing ADCs for serial links.
For the problem of resource-constrained computing at the edge, this dissertation tack-
les the issue of energy-eﬃcient implementation of ML algorithms, particularly CNNs which
have recently gained considerable interest due to their record-breaking performance in many
recognition tasks. However, their implementation complexity hinders their deployment
on power-constrained embedded platforms. This dissertation develops two techniques for
energy-eﬃcient CNN design.
The ﬁrst technique is a predictive CNN (PredictiveNet), which makes use of high sparsity
in well-trained CNNs to bypass a large fraction of power-dominant convolutions at run-
time without modifying the CNN structure. Analysis supported by simulations is provided
to justify PredictiveNet's eﬀectiveness. When applied to both the MNIST and CIFAR-10
datasets, simulation results show that PredictiveNet achieves 7.2× and 4.4× reduction in
the computational and representational costs, respectively, compared with a conventional
CNN. It is further shown that PredictiveNet enables computational and representational
cost reductions of 2.5× and 1.7×, respectively, compared to a state-of-the-art CNN, while
incurring only 0.02 classiﬁcation accuracy loss.
The second technique is a variation-tolerant architecture for CNN capable of operating
in near threshold voltage (NTV) regime for aggressive energy eﬃciency. It is well-known
that NTV computing can achieve up to 10× energy savings but is sensitive to process,
temperature, and voltage (PVT) variations which can lead to timing errors. To leverage the
great potential of NTV for energy eﬃciency, this dissertation develops a new statistical error
compensation (SEC) technique referred to as rank decomposed SEC (RD-SEC). RD-SEC
makes use of inherent redundancy in CNNs to handle timing errors due to NTV computing.
When evaluated in CNNs for both the MNIST and CIFAR-10 datasets, simulation results
in 45 nm CMOS show that RD-SEC enables robust CNNs operating in the NTV regime.
Speciﬁcally, the proposed RD-SEC can achieve up to 11× improvement in variation tolerance
and enable up to 113× reduction in the standard deviation of classiﬁcation accuracy while
incurring marginal degradation in the median classiﬁcation accuracy.
iii
To my family whose selﬂess support made this journey possible and inspires me every day,
and to everyone who has blessed me with kindness along the way.
iv
ACKNOWLEDGMENTS
Many great people have helped make my Ph.D. journey at UIUC more challenging, rewarding
and memorable. I am grateful from the bottom of my heart for their contributions to both
my research and my personal growth.
First, I would like to extend my sincerest thanks to my advisor, Prof. Naresh R. Shanbhag,
for his valuable guidance and support throughout the years. It is he who encouraged me
to start this journey and who has trained me along the path of becoming an independent
researcher. Prof. Shanbhag has consistently emphasized the importance of big vision, an-
alytical justiﬁcation, eﬀective presentation, and concise writing. He has held me to high
standards, and my interactions with him have made me much stronger, more appreciative,
and more conﬁdent. I believe he has thoroughly prepared me to face "the real world" that
I will enter upon graduation, for which I will always be grateful.
Second, I would like to sincerely thank Prof. Andrew C. Singer, Prof. Pavan K. Hanu-
molu, and Prof. Elyse Rosenbaum for serving on my thesis committee and for providing
their insightful comments. Their suggestions have helped me strengthen my research and
improve this dissertation. In addition, I have heartfelt appreciation to all three of them for
their valuable support on my faculty applications. I am also grateful for staﬀ members at
UIUC, including Jennifer Carlson, Laurie Fisher, Brooke Newell, Rachel Palmisano, Jen-
nifer Summers, and James Hutchinson, for their help with all kinds of administrative and
logistical tasks or editorial check of this dissertation.
Third, I truly appreciate my past and current research group colleagues, notably Dr. Eric
Kim, Dr. Sai Zhang, Dr. Yongjune Kim, Dr. Mingu Kang, Sujan Gonugondla, Ameya
Patil, Charbel Sakr, and Sungmin Lim. I could not have wished for better colleagues: they
are very tough and honest, they have a keen sense for technical details, and their feedback
v
was always balanced with kindness. The numerous inspiring discussions and all of your
generous encouragement will not be forgotten. Special thanks go to Dr. Yongjune Kim for
his comprehensive feedback to improve this dissertation, to Dr. Sai Zhang for his useful
contributions to the RD-SEC project, and to Charbel Sakr and Dr. Yongjune Kim for their
helpful input on the PredictiveNet project.
Fourth, I would like to acknowledge my other friends for their priceless help and support,
notably Dr. Yang Zhang, Katie Martin, Nicole Joy, Dr. Zhangyang Wang, Lili Su, Dr.
Songbin Gong, Dr. Tejasvi Anand, Dr. Guanghua Shu, Dr. Woo-Seok Choi, Eric Ehmann,
Yang Xiu, Dr. Peter Kairouz, and Dr. Saurabh Saxena. Special thanks go to Katie Martin
for her valuable critiques to improve my English and for her encouragement during diﬃcult
times. I also sincerely thank Nicole Joy for her great Zumba instruction, which has not only
made this form of exercise my true-love, but has also helped me ﬁnd the delicate balance
between research, personal life and health.
Finally, and most importantly, my deepest gratitude goes to my family who have supported
me with more love and care that one could hope for. My parents and brothers have been
models for my life: they showed me how to strive wholeheartedly towards your full potential;
they demonstrated what it looks like to choose what you love to do, and then to do your
very best; they taught by example how to help others in need, even when you are in need
yourself. My parents and parents-in-law have taken turns to travel the long distance from
China to the U.S. to help me take care of my son when I am at school. The work in this
dissertation would not have been possible without their selﬂess support. I am very fortunate
to have my husband Jing with me in the U.S., who has always encouraged me to pursue my
dreams. Although this resulted in a long distance between us, he put in extra eﬀorts to visit
and to demonstrate his love and support, including driving back and forth between Texas
and Illinois more than twenty times during my time at UIUC. I am extremely grateful to
have my son, Chengzhong, who brightens my life like a joyful angel. He has been and will
continue to be my greatest motivation for growth in all aspects of life.
vi
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Dissertation Contributions and Organization . . . . . . . . . . . . . . . . . . 10
Chapter 2 BER-OPTIMAL ADC-BASED RECEIVER FOR SERIAL LINKS . . . 12
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Achievable BER Improvement via BOA . . . . . . . . . . . . . . . . . . . . 17
2.3 Implementation of BOA Receiver . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Derivation of BERR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 3 PREDICTIVENET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 The Proposed PredictiveNet Technique . . . . . . . . . . . . . . . . . . . . 44
3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Derivation of (3.10) and (3.11) . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 4 RANK DECOMPOSED STATISTICAL ERROR COMPENSATION . . 58
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 The Proposed RD-SEC Technique . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Error Model Generation and Validation . . . . . . . . . . . . . . . . . . . . 67
4.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Derivation of α in (4.10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 79
5.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
vii
LIST OF TABLES
2.1 Performance Summary of the ADC-based Receivers . . . . . . . . . . . . . . 35
2.2 Performance Comparison with State-of-the-art ADC-based Receivers and
Analog Receivers in CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Errors of MSB-CNN and PredictiveNet with Respect to FP-CNN . . . . . . 46
3.2 Parameters Summary of the CNN for the MNIST Dataset . . . . . . . . . . 50
3.3 Computational and Representational Cost Comparison among CNNs for
the MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Parameters Summary of the CNN [78] for the CIFAR-10 Dataset . . . . . . 51
3.5 Computational and Representational Cost Comparison among CNNs for
the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Parameters for the γ
C
and γ
F
in Fig. 4.4 . . . . . . . . . . . . . . . . . . . . 67
4.2 Summary of CNN Parameters from [68] . . . . . . . . . . . . . . . . . . . . 74
4.3 Summary of CNN Parameters for CIFAR-10 Dataset [79] . . . . . . . . . . 76
viii
LIST OF FIGURES
1.1 Illustration of: (a) centralized cloud computing, and (b) distributed edge
computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Power breakdown in a state-of-the-art 48-core processor at both low and
high power modes [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Comparison of available resource between a standard desktop and a Google
glass [6, 7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 The scaling of supply voltage, a quadratic knob for energy eﬃciency, re-
mains stagnant beyond 45 nm [11]. . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Role of an ADC in a serial link: (a) block diagram of a serial link, and (b)
idealized model for the ADC in (a). . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Channel response h = [0.0949, 0.2539, 0.1552, 0.0793, 0.0435, 0.0356, 0.0220]
with L = 7 of a 20-inch backplane channel carrying 10 Gb/s data [57]. . . . . 13
2.2 An illustrative example: (a) the block diagram of an ADC-ML receiver,
(b) the conditional pdf of the channel output given b[n] = −1, (c) the
conditional pdf of the channel output given b[n] = +1, (d) BOA's quanti-
zation thresholds (inverted triangles in yellow) and uniform quantization
thresholds (dashed lines in red) for channel h = [0.08, 0.07, 0.1, 0.04] when
SNR = 36 dB, and (e) the simulated BERR(SNR, 4, 3), peo(SNR, 3) and
peu(SNR, 4) versus SNR plot. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Examples of m-clustering and ht: (a) m-clustering with m = 3, and
(b) ht for to = [±0.3,±0.2,±0.1, 0] (case I where ht = 1) and to =
[±0.3,±0.11,±0.09, 0] (case II where ht = 0.7564) with ymax = 0.3. . . . . . . 20
2.4 An illustrative example for d∗o,k, µ
∗
k and d
∗
u,k. . . . . . . . . . . . . . . . . . . 21
2.5 Comparison among peo from MC simulation (peo(sim)), peo estimated using
(2.9) (peo(10)), peu from MC simulation (peu(sim)), and peu estimated using
(2.12) (peu(13)), for channels (a) h = [0.09, 0.1, 0.08,−0.05] when Bu = 6
and Bo = 3, and (b) h = [0.08, 0.07, 0.1, 0.04] when Bu = 4 and Bo = 3,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Comparison betweenBERR(SNR,Bu, Bo) fromMC simulation (BERRsim),
and BERR(SNR,Bu, Bo) estimated using (2.6) (BERR(7)) for channels
(a) h = [0.09, 0.1, 0.08,−0.05] when Bu = 6 and Bo = 3, and (b) h =
[0.08, 0.07, 0.1, 0.04] when Bu = 4 and Bo = 3, respectively. . . . . . . . . . 22
ix
2.7 BERR(SNR,Bu, Bo) vs. ht, where Bu = Bo = log2(m + 1), for channels
h = [1, a1, a2, a3] (ai = 0.1 : 0.1 : 0.9, i = 1, 2, 3) using an ADC-ML re-
ceiver when SNR = 38 dB. The value of ht and measuredBERR(SNR, 4, 3)
for a FR4 channel using an ADC-LE receiver (described in section 2.3) are
also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 BERR(SNR,Bx, Bx) vs. Bx of (a) an ADC-ML receiver, when the chan-
nel impulse response is h = [0.09, 0.1, 0.08, 0.04] and SNR = 36 dB, and
(b) an ADC-LE receiver, when the channel impulse response is equal to
the FR-4 channel (see Fig. 2.1) and SNR = 36 dB. . . . . . . . . . . . . . . 25
2.9 Block diagram of the BOA receiver. . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 Block diagram of the BOA chip. . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.11 The 8-bit single-core multiple-output DAC: (a) circuit schematic, and (b)
timing diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12 Architectures of: (a) the QL-UD unit, and (b) the ith RL-UD unit. . . . . . 29
2.13 Schematics of the preampliﬁer and latches. . . . . . . . . . . . . . . . . . . . 31
2.14 (a) Micrograph of the BOA chip, and (b) the test set-up. . . . . . . . . . . . 32
2.15 Standalone ADC measurement results: (a) DNL and (b) INL characteris-
tics before/after calibration, and (c) measured ENOB vs. input frequency. . 32
2.16 Measured ADC output: (a) eye diagram, and (b) histogram for a 20-inch
FR4 channel at 4 Gb/s when TX amplitude is 180 mVppd and ADC FSR
is 100 mVppd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.17 ENOB and BER measurements: (a) ENOB vs. input frequency, and (b)
BER vs. TX amplitude at 4 Gb/s when the FSR of the CUA is 100 mVppd. . 34
2.18 Measured BER vs. sampling phase using a 20-inch channel when TX
amplitude is 180 mVppd and the FSR of the CUA is 100 mVppd. . . . . . . . 35
3.1 Illustration of a state-of-the-art CNN [68] showing a convolutional layer
(C-layer), a subsampling layer (S-layer), feature maps (FMs), and the
squashing function f(·). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 An architecture implementing (3.4) in PredictiveNet. . . . . . . . . . . . . . 45
3.3 Illustration of the empirical values of MSEMSB-CNN and MSEPredictiveNet
and the upper bound on MSEPredictiveNet with respect to Bx,msb for: the
(a) ﬁrst and (b) second C-layers over FP-CNN where Bw,msb = 5 bits,
Bw = 8 bits, Bx = Bδ = 7 bits, and Bδ,msb = Bx,msb. Both curves are
obtained by averaging over all pixels of the two C-layers' output FMs. . . . 48
3.4 Simulation results for the MNIST dataset comparing FP-CNN (FP-CONV
and FP-ZS), PredictiveNet, and MSB-CNN in terms of: (a) classiﬁcation
error rates, (b) normalized computational cost (# of full adders (FAs)),
and (c) normalized representational cost (# of bits), where Bx,msb =
Bδ,msb = 4 bits and Bw,msb = 5 bits. . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Comparison on the C-layer input sparsity of FP-CONV/FP-ZS, Predic-
tiveNet and MSB-CNN for the MNIST dataset. . . . . . . . . . . . . . . . . 53
x
3.6 Simulation results for the CIFAR-10 dataset comparing FP-CNN (FP-
CONV and FP-ZS), PredictiveNet, and MSB-CNN in terms of: (a) clas-
siﬁcation error rates, (b) normalized computational cost (# of full adders
(FAs)), and (c) normalized representational cost (# of bits), whereBx,msb =
Bδ,msb = 6 bits and Bw,msb = 5 bits. . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Comparison on the C-layer input sparsity of FP-CONV/FP-ZS, Predic-
tiveNet and MSB-CNN for the CIFAR-10 dataset. . . . . . . . . . . . . . . 55
4.1 Algorithmic noise-tolerance (ANT): (a) architecture, and (b) the error
statistics in the M-block and E-block [50]. . . . . . . . . . . . . . . . . . . 60
4.2 Architecture of: (a) a (N,M) dot product ensemble (MVM), where wml =
[w1ml, · · · , wNml] andWl = [w1l, · · · ,wMl], and (b) one stage MVM-based
CNN consisting of a C-layer and an S-layer. . . . . . . . . . . . . . . . . . . 63
4.3 RD-SEC applied to an MVM. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Overhead of the RD-SEC-based MVM: computational overhead γ vs. N ,
where the corresponding parameters are summarized in Table 4.1. . . . . . 66
4.5 Block diagram of: (a) model generation methodology, and (b) error mod-
eling framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Illustration of the AND gate within-the-die (WID) delay histograms from
HSPICE Monte Carlo (MC) simulations and the AND gate delay model
at: (a) Vdd = 1.2V, (b) Vdd = 0.4V, with 1000 MC iterations. . . . . . . . . 69
4.7 Validating the error model generation methodology by comparing SNR
from HDL simulations and the NTV methodology based on 30 MVM in-
stances with 105 random input samples for each instance operating at gate
level delay variation of 3%-39%. . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8 Characterization of: (a) process variations in terms of (σ/µ)d vs. Vdd, and
(b) the impact of process variations on MVM error rate p¯η based on 30
MVM instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.9 Simulation results for the MNIST dataset: (a) p¯det vs. (σ/µ)d, and (b)
σpdet vs. (σ/µ)d, based on 30 CNN instances in the presence of process. . . . 73
4.10 An example of the C1 FMs and the output vector from: (a) the Conv CNN,
and (b) the RD-SEC CNN, when the input digit is 5 and (σ/µ)d = 27%. . 75
4.11 Simulation results for the CIFAR-10 dataset: (a) p¯det vs. (σ/µ)d, and (b)
σpdet vs. (σ/µ)d, based on 30 CNN instances in the presence of process
variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
xi
Chapter 1
INTRODUCTION
Machine learning (ML) systems have been dramatically transforming the way we live and
work by enhancing our ability to recognize, analyze, and classify the world around us. In
fact, many see this as the fourth industrial revolution [1]. Such unprecedented transfor-
mation is made possible by the explosion in computing power and the availability of vast
amounts of data. Indeed, ML systems have transformed science ﬁction into everyday reality.
Examples include self-driving cars or aircraft, household robots, virtual assistants, and many
others. Recently, ML systems exceeded human performance in some applications such as
million-scale object recognition [2]. However, this record-breaking performance comes at a
large energy cost. For example, Google's AlphaGo, which amazed everyone by beating the
human Go champion early in 2016, runs on 1202 CPUs and 176 GPUs [3] and consumes
more than four-orders-of-magnitude higher power than the human brain. Therefore, there
is an imperative need to design energy-eﬃcient ML systems for enabling their pervasive
deployment in our daily lives.
Current ML systems are either centralized in a cloud (see Fig. 1.1(a)) or distributed at the
edge (see Fig. 1.1(b)). Speciﬁcally, in cloud platforms, data from the devices of end users,
such as mobile phones, are transferred to the data centers which execute ML algorithms on
CPU and GPU clusters. The extracted information is then transferred back to users' devices.
While cloud computing is rapidly expanding, recent work [4] shows that the energy cost of
transferring data between data centers and local devices can be a signiﬁcant percentage
of the total energy cost in cloud computing if the usage rate and data volume are large.
Therefore, there has been an increasing interest in enabling local inference capability at the
edge such as end users' devices. Local processing of raw data reduces energy and latency,
and enhances privacy. In both centralized cloud computing and distributed edge computing,
1
2Centralized Cloud Computing
“intelligence” in the cloud
devices at the edge
Raw data Information
Information
Distributed Edge Computing
Feature 
Extractor
Classifier
Trainer
Decision    
Label  
Label  
Raw data
“intelligence” at the edge
(a)
2
Centralized Cloud Computing
“intelligence” in the cloud
devices at the edge
Raw data Information
Information
Distributed Edge Computing
Feature 
Extractor
Classifier
Trainer
Decision    
Label  
Label  
Raw data
“intelligence” at the edge
(b)
Figure 1.1: Illustration of: (a) centralized cloud computing, and (b) distributed edge com-
puting.
2
Figure 1.2: Power breakdown in a state-of-the-art 48-core processor at both low and high
power modes [5].
there is a grand energy eﬃciency challenge as described next.
 Energy Eﬃciency Challenge in the Data Center: It is reported that US data
centers consumed about 70 billion kilowatt-hours of electricity in 2014, representing
2% of the country's total energy consumption [5]. Indeed, the costs of power and
cooling are becoming signiﬁcant factors in the total expenditures of large-scale data
centers [8]. In particular, data transfer due to inter-chip, inter-board, inter-shelf and
inter-rack communications within data centers is one of the dominant energy costs.
For example, the I/O interface consumes about 20% − 70% of the total power in a
state-of-the-art 48-core processor [5], as shown in Fig. 1.2. This will be made worse
by the growing demand for increased I/O bandwidth of high-performance computing
in data centers. For example, a recent projection [9] indicates that the I/O bandwidth
demand will exceed 750 TB/s for super-computers by the year 2020, and the I/O power
could reach half of the CPU power.
 Energy Eﬃciency Challenge at the Edge: Devices at the edge including smart
phones, autonomous vehicles, wearable devices, and many others have limited energy,
computation and storage resources since they are battery-powered and have a small
form factor. For example, the comparison in Fig. 1.3 shows that the CPU power of a
Google glass is about one eightieth of a standard desktop [6, 7]. On the other hand,
3
26
Dell Desktop
(XPS 8910) 
Google Glass Ratio
Supply 
Source
AC Power Battery ∞
CPU
Power
65 W 0.81 W ↓ 80 ×
Figure 1.3: Comparison of available resource between a standard desktop and a Google
glass [6, 7].
the required implementation complexity of many ML algorithms is high due to the
need to process hundreds of computations and a signiﬁcant amount of data movement.
For example, a state-of-the-art convolutional neural network (CNN), AlexNet, requires
666 million multiplieraccumulators (MACs) per 227 × 227 image (13k MACs/pixel)
and hundreds of megabytes for weight storage [10]. Therefore, the energy eﬃciency
challenge will be aggravated due to the need for ML algorithms to enable inference
capability in these platforms. Conventional designs rely on voltage and process scaling
for energy eﬃciency, which have already stagnated as shown in Fig. 1.4 [11].
Therefore, we aim to explore techniques to address energy eﬃciency challenges in both
data centers and resource-constrained platforms at the edge. Speciﬁcally, to address the
energy eﬃciency challenge in data centers, we focus on reducing the energy of the I/O
interface by exploring the design of analog-to-digital converter (ADC)-based multi-Gb/s
serial link receivers. In addition, we also investigate energy-eﬃcient design of complicated
ML algorithms such as CNNs for their employment in resource-constrained platforms.
In the remainder of this chapter, we provide an overview of related prior work, and then
present the contributions and organization of this dissertation.
4
Figure 1.4: The scaling of supply voltage, a quadratic knob for energy eﬃciency, remains
stagnant beyond 45 nm [11].
1.1 Related Work
1.1.1 ADC-based Links
In conventional ADC-based serial links (see Fig. 1.5(a)), the ADC is designed to be a trans-
parent conduit of the input analog waveform xc(t). In such ADCs, the quantization thresh-
olds t are set uniformly within their full-scale range (FSR). We refer to such an ADC as
a conventional uniform ADC (CUA). The signal-to-quantization noise ratio (SQNR) of a
CUA can be approximated by SQNR = 6.02Bx + 4.8 − 20 log 10 (Vmax/σx) [12], where σ2x
is the average signal power at the ADC input, Bx is the ADC resolution, and Vmax is the
maximum input amplitude. The SQNR is a signal ﬁdelity metric as it measures the aver-
age squared diﬀerence between the ADCs sampled input xc(nT ) and its quantized output
x[n]. Other signal ﬁdelity based metrics such as signal-to-noise-and-distortion-ratio (SNDR)
or eﬀective-number-of-bits (ENOB) are also employed. Such ﬁdelity-based metrics impose
overly stringent speciﬁcations on the ADC because they ignore the true role of the ADC
in a communication link, which is to preserve the information content in the input signal
xc(t) in order to recover the transmitted data reliably. One direct consequence of employing
ﬁdelity-based metrics is that the ADC needs more resolution Bx than needed. In such ADCs,
5
'ULYHU
&KDQQHO
$'& 6OLFHU
t
'63
(a)
4XDQWL]HU
(b)
Figure 1.5: Role of an ADC in a serial link: (a) block diagram of a serial link, and (b) ide-
alized model for the ADC in (a).
a single bit reduction in Bx can result in signiﬁcant power savings. For example, in ﬂash
ADCs, the area, power consumption, and input capacitance increase exponentially with Bx.
These result in large preampliﬁer bandwidth and multiple stages of latches which exacer-
bate the ADC power consumption problem [13,14]. Therefore, the design of low-power and
high-speed ADCs in serial links is a major challenge, which has drawn great interest from
both industry and academia.
Recently, there has been research that attempts to employ the link bit-error-rate (BER) as
a design metric for energy-eﬃcient link design. Past work on BER-optimal link components
includes [15], in which an adaptive minimum BER (AMBER) algorithm is proposed to
adapt the equalizer coeﬃcients. It was shown that minimum-BER equalizers outperform
conventional minimum mean square error (MMSE) equalizers over a wide variety of channels
especially when the BER lies in a regime of rapid descent with the number of equalizer
coeﬃcients. Chen et al. [16] demonstrated the beneﬁts of adapting the equalizer coeﬃcients
and the sampling phase of the clock-data-recovery (CDR) to minimize the BER in serial links
via the design of a prototype IC in 65 nm CMOS for a 6.25 Gb/s serial link. In [17, 18], an
6
algorithm to determine the BER-optimal ADC (BOA) representation levels was proposed.
The ADC shaping gain SG(pe, Bu, Bo) deﬁned below was employed to quantify the beneﬁts
of BER optimality:
SG(pe, Bu, Bo) = SNRu(pe, Bu)− SNRo(pe, Bo), (1.1)
where SNRu(pe, Bu) and SNRo(pe, Bo) are the signal-to-noise ratios (SNRs) needed by a
CUA and a BOA, respectively, to achieve a BER equal to pe with identical receiver pro-
cessing, and Bu and Bo are the resolutions of the CUA and the BOA, respectively. This
ADC shaping gain quantiﬁes the reduction in the required channel SNR to achieve a given
BER pe due to the use of a BOA. The ADC shaping gain is analogous to a coding gain
when evaluating links with error control coding. It was shown in [17,18] that SG(10−12, 3, 3)
ranged from 2.5 dB to more than 30 dB for highly dispersive channels. We note that a BOA
employs representation levels that are dependent on signal statistics and BER, and hence
are typically non-uniformly spaced within the FSR of the ADC. The works in [17, 18] also
showed that the non-uniform BOA representation levels are signiﬁcantly diﬀerent from and
superior to the non-uniform (also signal statistics-dependent) representation levels obtained
from the well-known Lloyd-Max (LM) quantization algorithm [19, 20]. This is because the
LM algorithm minimizes the SQNR, which is also a ﬁdelity metric. In [21], it was shown
that BOA can relax the component speciﬁcations of ADCs. In particular, BOA can achieve
the same or even better BER while it has less stringent metastability and preampliﬁler
bandwidth requirements on the ADC comparators.
A power-optimized ADC-based 10 Gb/s serial link receiver in 65 nm CMOS was designed in
[22] using a low-gain analog and mixed-mode pre-equalizer in conjunction with non-uniform
representation levels for the ADC. The works in [22, 23] propose to merge slicers whose
thresholds are similar into one for loop-unrolled decision feedback equalizers (DFEs) and
adjusts a pseudo-BER metric (voltage margin) to minimize BER, which in eﬀect emulates a
BOA followed by a DFE. However, this technique is applicable only to loop-unrolled DFEs,
and has to rely on a continuous-time linear equalizer to cancel the precursor ISI. Second,
as mentioned in [23], their procedure to determine the optimal threshold placement is not
7
suitable for online calibration. More recently, Son et al. [24] proposed a power eﬃcient
equalizing receiver front-end that includes a two-step adaptive BER-minimizing equalizer
algorithm. These works mentioned above demonstrate that the use of information-based
metrics such as the BER are indeed quite eﬀective in reducing link component power in
serial links.
1.1.2 Energy-eﬃcient CNNs
Many emerging applications in pattern recognition and data mining require the use of ML
algorithms to process massive data volumes on energy-constrained platforms [25]. CNN is
a powerful ML algorithm that achieves state-of-the-art performance in various recognition
tasks [2]. For example, the Microsoft ResNet achieved a better-than-human accuracy of
3.57% in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 [26], which
is a benchmark in object category classiﬁcation and detection consisting of hundreds of object
categories and millions of images. The implementation complexity of CNNs is very high due
to the need to compute a large number of convolutions usually taking up over 90% of the
total computational cost [27] and to process a signiﬁcant amount of data movement. This
high complexity of CNNs hinders their implementation on power-constrained embedded
platforms.
Substantial research eﬀorts have been invested in reducing the complexity of CNNs. One
line of research attempts to reduce the precision of weights and activations, and has shown
that 8-bit [28] or even binary [29] ﬁxed-point representation is suﬃcient for evaluating CNNs.
Another approach focuses on optimizing the structure of CNN itself. The work in [30]
employs a three-step method, where the network is trained to learn important connections,
prune redundant connections in pre-trained CNNs, and then retrain the pruned networks
to restore the performance. Zhang et al. [31] proposed to replace convolutional layers by
several convolutional layers applied sequentially, which have a lower total complexity. Other
research thrust exploits sparsity in well-trained CNNs or enhances sparsity in CNNs via
regularization, and skips operations with zero entries (zero-skipping) [10, 32]. Recent work
[33,34] showed that it is possible to avoid evaluation of certain computations with a marginal
8
performance loss. In [33], a linear regression model was trained for each convolutional layer
to predict the importance of each convolutional ﬁlter and prune low-impact ﬁlters at runtime.
Panda et al. [34] proposed conditional deep learning (CDL) by adding a linear network to
each convolutional layer and monitoring the output to decide whether classiﬁcation can be
terminated at the current stage.
The above mentioned techniques have the potential to reduce both the computational
and data movement costs in CNN implementations. In general, data movement (memory
access) cost tends to dominate the overall energy consumption in data-intensive computing
systems [35]. This is especially true for large-scale CNN implementations [36, 37]. Thus,
research focus has been on reducing data movement cost via maximally reusing data locally
[36, 37] or in-memory computing [38,39]. Once these techniques aggressively trim down the
data movement cost of large-scale CNNs as well as in the case of small-scale CNNs, the
computational cost will be on the same order or even dominate the overall energy cost [40].
In these cases, how to reduce the computational cost of CNNs becomes the primary concern.
In line with this direction, one opportunity in CNNs is that matrix-vector multiply (MVM)
is the most power hungry kernel and accounts for 90% of the computational cost in state-of-
the-art integrated circuit implementations [10]. In a MVM, an input vector x is projected
to a set of weight vectors, i.e.:
y = WTx (1.2)
where W = [w1, . . . ,wM ] is the N ×M weight matrix, wk is the kth N × 1 weight vector,
x is the N × 1 input vector, y = [y1, ..., yM ]T is the M × 1 output vector, and yk is the kth
element of y which can be expressed as a dot product (DP) yk = w
T
k x.
As a result, energy-eﬃcient MVM architectures are of great importance for energy-eﬃcient
CNN design. Techniques such as low power parallel ﬁlter design [41] and common subexpres-
sion elimination (CSE) [42] can be applied to MVMs to reduce computational complexity.
These techniques exploit the redundancy within a multiplier or a DP. To further reduce
energy consumption, near threshold voltage (NTV) designs have been proposed where the
supply voltage is reduced to close to the transistor threshold voltage of 0.3V-0.7V. This de-
9
sign paradigm is well-suited for low throughput sensor based applications such as biomedical
monitoring [43], surveillance [44], and structural sensing within critical infrastructures [11].
Past research has shown that NTV designs can achieve up to 10× savings in energy, but suf-
fer from a signiﬁcant increase in variations, which can be as high as 20× [11]. Error-resilient
techniques [4551] have been employed at various levels of design abstraction to compensate
for the resultant timing errors caused by NTV operation. At the logic or circuit level, RA-
ZOR [45], error detection sequential (EDS) [46], and Markov Random Field [47] have been
proposed. These techniques either compensate for small error rates (< 2%) or have large
overhead (> 5×), limiting their ability to enhance energy eﬃciency. At the system level, con-
ventional fault-tolerance techniques such as N-modular redundancy (NMR) [48] incur N×
complexity and power overhead, restricting their applicability. Statistical error compensa-
tion (SEC) [4951] has been shown to be a promising solution. SEC employs detection and
estimation-based techniques for error compensation. SEC techniques such as algorithmic
noise-tolerance (ANT) are able to compensate for error rates of 21%− 89% while achieving
35%− 72% energy savings [50].
1.2 Dissertation Contributions and Organization
The design of energy-eﬃcient ML systems is challenging due to the need for intensive com-
putation and massive data movement. In this dissertation, we address this challenge by (1)
employing information-based system metrics, as opposed to ﬁdelity circuit metrics, to design
power-dominant components in ML systems, (2) making use of inherent redundancy in ML
algorithms for reduced complexity, and (3) computing at the limits of energy eﬃciency and
robustness and developing SEC technique to eﬃciently compensate for the resultant errors.
The major contributions and organization of this dissertation are summarized as follows:
Chapter 2 presents an investigation of the use of link BER for designing a BOA based
serial link. Channel parameters such as the m-clustering value and the threshold non-
uniformity metric ht are introduced and employed to quantify the BER improvement achieved
by a BOA over a CUA in the receiver. Analytical expressions for BER improvement are
derived and validated through simulations. A prototype of BOA is designed, fabricated
10
and tested in a 1.2 V, 90 nm LP CMOS process to verify the results of this study. BOA's
variable-threshold and variable-resolution conﬁgurations are implemented via an 8-bit single-
core, multiple-output passive digital-to-analog converter (DAC), which incurs an additional
power overhead of < 0.1% (approximately 50 µW ). Measurement results show that the BER
achieved by the 3-bit BOA receiver can be lower by a factor of 109 and 1010, as compared to
the 4-bit and 3-bit CUA receivers, respectively, at a data rate of 4-Gb/s and a transmitted
signal amplitude of 180 mVppd.
Chapter 3 presents a predictive CNN (PredictiveNet), which predicts the zero outputs of
the nonlinear layers using low-cost predictors thereby bypassing a majority of computations.
PredictiveNet skips a large fraction of power-dominant convolutions in CNNs at runtime
without modifying the CNN structure or requiring additional branch networks. Analysis
supported by simulations validates the proposed PredictiveNet technique. When applied to
CNNs for both the MNIST and CIFAR-10 datasets, simulation results show that Predic-
tiveNet is able to achieve up to 2.5× and 1.7× reduction in the computational cost (number
of 1-bit full adders) and representational cost (number of bits to represent data and weights),
respectively, compared with a state-of-the-art CNN, while incurring only 0.02 classiﬁcation
accuracy degradation.
Chapter 4 presents a variation-tolerant architecture for CNNs capable of operating in
NTV regime for energy eﬃciency. A SEC technique referred to as rank decomposed SEC
(RD-SEC) is proposed. The key idea of RD-SEC is to exploit inherent redundancy within a
MVM, a power-hungry operation in CNNs, to derive low-cost estimators for error detection
and compensation. When evaluated in CNNs for both the MNIST and CIFAR-10 datasets,
simulation results in 45 nm CMOS show that the proposed RD-SEC can enable up to 11× im-
provement in variation tolerance and achieve up to 113× reduction in the standard deviation
of classiﬁcation accuracy while incurring marginal degradation in the median classiﬁcation
accuracy.
Chapter 5 concludes this dissertation and provides directions for future research activi-
ties.
11
Chapter 2
BER-OPTIMAL ADC-BASED RECEIVER FOR
SERIAL LINKS
ADC-based multi-Gb/s serial link receivers have gained increasing attention as a promising
scheme for data transfer in data centers because they have enabled the application of digital
signal processing (DSP) techniques to recover data under severe channel impairments such as
channel loss, reﬂection, and crosstalk, while being constrained by a stringent power budget
[22, 5256]. This chapter presents the eﬀectiveness of employing information-based system
metrics such as the link BER to reduce the energy consumption of serial link components
such as the ADC, which tends to be the most power hungry block. For example, the ADC
itself (excluding the clock buﬀer) consumes 41% of the total receiver power in [22].
The rest of this chapter is organized as follows. Section 2.1 reviews the theory behind the
BOA. Section 2.2 discusses how to maximize the beneﬁts of a BOA receiver over a CUA
receiver. In Section 2.3, the design of a 4 GS/s, 4-bit BOA IC in a 90 nm CMOS process is
described. Both stand-alone ADC and link measurement results are summarized in Section
2.4, and a summary of this chapter is provided in Section 2.5.
2.1 Background
In this section, the concept of the BOA is reviewed. Figure 1.5(a) depicts a typical ADC-
based serial link, where the ADC is followed by an equalizer prior to detection. When
considering an equivalent discrete-time, symbol-spaced, time-invariant channel corrupted by
additive white Gaussian noise (AWGN), and 2-PAM modulation, the channel output at time
n is given by
xc[n] = xc(nT ) =
∑L−1
i=0 h[i]b[n− i] + v[n],
12
0 2 4 6 8
0
0.1
0.2
Figure 2.1: Channel response h = [0.0949, 0.2539, 0.1552, 0.0793, 0.0435, 0.0356, 0.0220]
with L = 7 of a 20-inch backplane channel carrying 10 Gb/s data [57].
where b[n] ∈ {±1} is the transmitted sequence, h[n] is the equivalent discrete time channel
impulse response (see Fig. 2.1 for an example) with memory L, and v[n] is AWGN with
variance σ2v . At the receiver, the processor estimates the transmitted symbols from quantized
channel outputs through the ADC. A subsequent slicer determines the transmitted bit.
Figure 2.1 shows an example of the channel response h for a 20-inch backplane channel
carrying 10 Gb/s data [57].
2.1.1 Comparison Metric
Comparison of a BOA and a CUA requires an appropriate metric to be deﬁned. The ADC
shaping gain SG(pe, Bu, Bo) in (1.1) is one such metric. In applications where it is diﬃcult to
measure the underlying circuit and environmental noise, comparing the ratio of the resulting
bit-error-rates, or the BER ratio, may be of interest. Similarly, when two systems are being
compared, and only one of the two experiences an exponential decay in BER with SNR as
shown in Fig. 2.2(e), it may not be possible to measure SG(pe, Bu, Bo). However at a given
SNR, the ratio of the measured BERs may be readily measurable. Once again, the BER
ratio becomes a quantity of interest. In our application, as the resolution of the ADC may
be insuﬃcient to reach the so-called waterfall regime of the BER vs. SNR curve, we will
use the BER ratio as a metric of comparison. We recognize that this metric may be more
susceptible to measurement sensitivities since we are comparing quantities that may diﬀer
13
by orders of magnitude. However, we proceed with this metric for the above mentioned
reasons. Therefore, in this chapter we employ the BER ratio (BERR) deﬁned as:
BERR(SNR,Bu, Bo) =
peu(SNR,Bu)
peo(SNR,Bo)
,
where peu(SNR,Bu) and peo(SNR,Bo) are the BERs achieved by a Bu-bit CUA and a Bo-bit
BOA with identical receiver processing and channel SNR given by SNR =
∑L−1
i=0 |h[i]|2/σ2v .
2.1.2 An Illustrative Example
A BOA [17] exploits signal statistics to maximize the probability of correctly detecting the
transmitted bits. To provide insight into the operation of a BOA, we consider an ADC
followed by a memoryless symbol-by-symbol maximum likelihood (ML) detector (ADC-ML)
receiver, as shown in Fig. 2.2(a), and provide an example to illustrate the point.
First, we deﬁne the set of quantization thresholds for the CUA and the BOA.
Deﬁnition 1. The vectors tu = [tu,1, tu,2, · · · , tu,M ], ru = [ru,1, ru,2, · · · , ru,M+1], and the set
Iu = {Iu,1, Iu,2, · · · , Iu,M+1} are the thresholds, output representation levels, and interval set
of a log2(M + 1)-bit CUA, where Iu,1 = (−∞, tu,1], Iu,k = [tu,k−1, tu,k] for k = 2, · · · ,M ,
Iu,M+1 = [tu,M ,+∞), and the CUA output x[n] = ru,k if xc(nT ) ∈ Iu,k for k = 1, · · · ,M +
1. Similarly, the vectors to = [to,1, to,2, · · · , to,N ], ro = [ro,1, ro,2, · · · , ro,N+1], and the set
Io = {Io,1, Io,2, · · · , Io,N+1} are the thresholds, output representation levels, and interval set
of a log2(N + 1)-bit BOA, where Io,1 = (−∞, to,1], Io,k = [to,k−1, to,k] for k = 2, · · · , N ,
Io,N+1 = [to,N ,+∞), and the BOA output x[n] = ro,k if xc(nT ) ∈ Io,k for k = 1, · · · , N + 1.
Consider a 4-tap channel with impulse response h = [0.08, 0.07, 0.1, 0.04]. The conditional
probability density functions (pdfs) P (xc[n]|b[n] = −1) and P (xc[n]|b[n] = +1) correspond-
ing to the marginal pdf of the channel output conditioned on the bit b[n] being either −1
or +1 at time n, are illustrated in Fig. 2.2(b) and Fig. 2.2(c), respectively. In a CUA, the
thresholds are set uniformly within the ADC's FSR. Assuming that the FSR is [−0.3,+0.3], a
4-bit CUA will have its thresholds at tu = [±0.26005,±0.2290,±0.18575,±0.14875,±0.1145,
±0.0743,±0.03715, 0] (Fig. 2.2(d)). In contrast, the thresholds in a BOA are positioned at
14
'ULYHU
&KDQQHO
t
0/
(a)
−0.3 −0.2 −0.1 0 0.1 0.2 0.3
0
5
10
xc[n]
 
 
pdf(x
c
|b[n]=−1)

(b)
−0.3 −0.2 −0.1 0 0.1 0.2 0.3
0
5
10
xc[n]
 
 
pdf(x
c
|b[n]=+1)
Ö
(c)
−0.3 −0.2 −0.1 0 0.1 0.2 0.3
0
2
4
6
xc[n]
 
 
pdf(x
c
|b[n]=+1)
pdf(x
c
|b[n]=−1)
)65
4rbitCUAthresholds
3rbitBOAthresholds
(d)
15 20 25 30 35 40
100
102
104
106
108
1010
BE
R
R
(S
NR
, 4
, 3
)
SNR [dB]
 
 
10−15
10−10
10−5
100
BE
R
BE
R
BE
R
R
(S
NR
, 4
, 3
) BERR(SNR, 4, 3)
Pe,o(SNR,3)
Pe,u(SNR,4)

F$
(e)
Figure 2.2: An illustrative example: (a) the block diagram of an ADC-ML receiver, (b)
the conditional pdf of the channel output given b[n] = −1, (c) the conditional pdf of
the channel output given b[n] = +1, (d) BOA's quantization thresholds (inverted tri-
angles in yellow) and uniform quantization thresholds (dashed lines in red) for channel
h = [0.08, 0.07, 0.1, 0.04] when SNR = 36 dB, and (e) the simulated BERR(SNR, 4, 3),
peo(SNR, 3) and peu(SNR, 4) versus SNR plot.
15
the crossover points of the two conditional pdfs. For this example, the BOA's thresh-
olds are found to be to = [±0.11,±0.08,±0.03, 0] (see Fig. 2.2(d)) as there are 7 crossover
points. Thus, the BOA illustrated here is a 3-bit ADC. Figure 2.2(e) shows that the BOA
achieves a 1-bit reduction in the ADC resolution while achieving BERR(40, 4, 3) ≈ 108 and
SG(10−3, 4, 3) ≈ 8 dB.
In order to compute to, we need the following deﬁnition of noise-free channel outputs:
Deﬁnition 2. The µ-set of a channel h = [h0, h1, · · · , hL−1] is an ordered set deﬁned as
µ = {µ+ ∪ µ−}, where both µ+ = {µ+l }2
L−1
l=1 and µ
− = {µ−l }2
L−1
l=1 are ordered sets of noise-
free channel outputs conditioned on the transmitted symbol b[n] taking the value +1 and −1,
respectively. The µ, µ+ and µ− sets have elements in ascending order.
In general, the N thresholds of a BOA for an ADC-ML receiver can be obtained [17] as
the N solutions for the unknowns {to,i} (i = 1, · · · , N) to the following equation:
2−L+1
2L−1∑
l=1
N (to,i;µ+l , σn) = 2−L+1
2L−1∑
l=1
N (to,i;µ−l , σn), (i = 1, · · · , N), (2.1)
where N (x;µ, σn) = 1√
2piσ2n
e
−(x−µ)2
2σ2n , N ≤ 2L − 1, µ+l and µ−l (1 ≤ l ≤ 2L−1) are the
2L−1 noise-free channel outputs (see Deﬁnition 2). In the example shown in Fig. 2.2,
h = [0.08, 0.07, 0.1, 0.04], L = 4, {µ+l }8l=1 = {−0.09 ,−0.01, 0.05, 0.07, 0.13, 0.15, 0.21, 0.29}
and {µ−l }8l=1 = {−0.29,−0.21,−0.15,−0.13,−0.07,−0.05, 0.01, 0.09}.
2.1.3 BOA with a Linear Equalizer (LE)
Consider a BOA followed by a K-tap linear equalizer (LE) with taps w = [w0, w1, . . . , wK−1],
such that the equalizer output yeq[n] =
∑K−1
k=0 wkx[n − k]. In a BOA, the representation
levels ro = {ro,1, ro,2, ..., ro,N+1} and the thresholds to are chosen to minimize the link BER.
Obtaining a closed form expression for the BOA's representation levels in the presence of
channel ISI and a LE is in general intractable. Therefore, the gradient descent algorithm [17]
is employed to compute the representation levels iteratively as follows:
16
BER = f(h, ro, to,w, σn)
∆BER = f(h, ro +4ro, to +4to,w, σn))− f(h, ro, to,w, σn), (2.2)
ro[j] = ro[j − 1] + µ
(
∂BER
∂ro
)
|ro=ro[j−1]
≈ ro[j − 1] + µ
(
∆BER
∆ro
)
|ro=ro[j−1]
, (2.3)
where it is assumed that BER = f(h, ro, to,w, σn) is known, and ro[j] = {ro,1[j], ro,2[j], · · · ,
ro,N+1[j]} are the ADC representation levels in the jth iteration of the gradient search. The
thresholds in the jth iteration are obtained as follows:
to,i[j] =
ro,i[j] + ro,i+1[j]
2
, (i = 1, · · · , N). (2.4)
The BOA's representation levels adaptation algorithm is as follows. First, the ADC pa-
rameters ro and to are initialized. Then, the gradient vector is estimated by computing ﬁnite
diﬀerences based on (2.3). The next step is to update to using (2.4). The last two steps are
repeated until the BER converges, i.e., when the diﬀerence in the BER between adjacent
iterations is less than a pre-speciﬁed value.
2.2 Achievable BER Improvement via BOA
In this section, we study through analysis and simulations how to maximize the beneﬁts
of the BOA over the CUA. Note: a BOA receiver always achieves the same if not better
BER as compared to a CUA receiver, given the same number of bits, channel and noise
power, because a CUA is a special case of the BOA. An important question to ask is: Under
what conditions are the beneﬁts oﬀered by a BOA over a CUA substantial? In particular,
we wish to determine channel conditions under which BERR(SNR,Bu, Bo) is say at least
10×. BERR(SNR,Bu, Bo) is empirically observed to depend strongly on the diﬀerence
between the CUA's and the BOA's thresholds, the number of adjacent noise-free channel
17
outputs with opposing signs, channel SNR and the ADC resolution. We therefore discuss
the relationship between these factors on BERR(SNR,Bu, Bo). We restrict our analysis to
channels with memory L < 7 in order to enable the derivation of useful insights analytically.
Note that the performance of BOA for channels with large memory L > 7 has been studied
in [18].
In this section, for tractability of analysis, we assume that the ADC (BOA or CUA) is
followed by a memoryless symbol-by-symbol ML decoder and that binary phase-shift keying
(BPSK) signaling is used over a known channel with impulse response h. Thus, dropping
the time index `n', and employing the notation Xc and Xu to represent the random variables
(RVs) corresponding to xc(nT ) and x[n], respectively, for a CUA, we have:
P (Xu = ru,k|b) = P (Xc ∈ Iu,k|b), (k = 1, · · · ,M + 1),
where Xu ∈ {ru,1, ru,2, · · · , ru,M+1}, ru,k is the kth CUA representation level (see Deﬁnition
1). Then, the memoryless ML decision rule for a CUA is given by:
bˆ =
+1, if
P (Xu|b=+1)
P (Xu|b=−1) > 1
−1, otherwise
.
Similarly, let Xo be the RV corresponding to x[n] for a BOA. Then,
P (Xo = ro,k|b) = P (Xc ∈ Io,k|b), (k = 1, · · · , N + 1),
where Xo ∈ {ro,1, ro,2, · · · , ro,N+1} and ro,k is the kth BOA representation level (see Deﬁni-
tion 1). Then, the memoryless ML decision rule for a BOA is given by:
bˆ =
+1, if
P (Xo|b=+1)
P (Xo|b=−1) > 1
−1, otherwise
.
2.2.1 BERR Expression
We wish to analytically predict BERR(SNR,Bu, Bo) given its arguments and the channel
h. Such analysis will eliminate the need for expensive Monte Carlo (MC) simulations.
18
Furthermore, conditions under which a BOA can oﬀer a BERR(SNR,Bu, Bo) of 10
r can be
derived.
First, the following deﬁnitions are provided.
Deﬁnition 3. A channel h is said to be m-clustered if there are m transitions (µ-transitions)
in its µ-set, where a µ-transition occurs when an element of the µ+ set is followed by an
element of the µ− set or vice versa. Note: m > 0 and takes odd values only, and m > N at
low SNR scenario while m = N at high SNR scenario.
Deﬁnition 4. The threshold non-uniformity metric ht of a log2(N+1)-bit BOA is a measure
of the diﬀerence between to and tu, and is deﬁned as:
ht =
−1
log2(
N+1
2
)
(N+1)/2∑
i=2[(
to,i − to,i−1
ymax
)
log2
(
to,i − to,i−1
ymax
)
+
(
to,1 + ymax
ymax
)
log2
(
to,1 + ymax
ymax
)]
(2.5)
where [−ymax, ymax] is the ADC FSR, and only the non-positive BOA thresholds are used
since the BOA's thresholds are symmetric about the origin. Note: 0 ≤ ht ≤ 1, and the larger
the value of ht the closer are the BOA's thresholds to those of the CUA.
Figure 2.3(a) shows an example when m = 3 for a 4-tap channel (thus there are 24 = 16
elements in µ), and Fig. 2.3(b) illustrates two examples when ht = 0.7564 and ht = 1,
respectively. Algorithm 1 can be employed to obtain m and ht for a speciﬁc channel.
Deﬁnition 5. Let d∗o,k = min
µ∈µ
(|to,k − µ|) be the minimum distance of the kth BOA threshold
to,k (k = 1, · · · , N) to the nearest noise-free channel output µ. Then, the minimum BOA
distance do,min = min
1≤k≤N
(d∗o,k).
Deﬁnition 6. Each of the (M + 1) intervals Iu,k with k = 1, · · · ,M + 1, in a CUA has a
dominant noise-free channel output µ∗k given by
19


(a)
rrr
&DVH,
rr
&DVH,,
r
(b)
Figure 2.3: Examples of m-clustering and ht: (a) m-clustering with m = 3, and (b) ht for
to = [±0.3,±0.2,±0.1, 0] (case I where ht = 1) and to = [±0.3,±0.11,±0.09, 0] (case II
where ht = 0.7564) with ymax = 0.3.
µ∗k =

arg max
µ−l ∈µ−
[ ∫
Iu,k
N (x;µ−l , σn)dx
]
, if
P (Xu=ru,k|b=+1)
P (Xu=ru,k|b=−1) > 1
arg max
µ+l ∈µ+
[ ∫
Iu,k
N (x;µ+l , σn)dx
]
, otherwise
.
Deﬁnition 7. Let d∗u,k = −min(µ∗k − tu,k−1, tu,k − µ∗k) be the minimum distance of the kth
CUA's dominant noise-free channel output µ∗k from the boundaries of the k
th interval Iu,k.
Then, the minimum CUA distance du,min = min
1≤k≤(M+1)
(d∗u,k).
Figure 2.4 shows an example of the marginal pdf of the channel output, illustrating the
Algorithm 1 Algorithm to obtain m-clustered value and ht for an ADC-ML receiver.
1. Initialize the channel h and the SNR, calculate noise variance σ2v based on h and the
SNR.
2. Deﬁne the main cursor of h, calculate µ+={µ+i }2L−1l=1 and µ−={µ−i }2L−1l=1 , respectively, and
obtain the ordered set µ = {µ+⋃µ−}.
3. Count the number of transitions in µ, which is the m-clustered value.
4. Obtain t0 using equation (2).
5. Calculate ht using equation (6).
20
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
0
0.5
1
1.5
2
xc[n]
pd
f o
f x
c
 
 
pd
f o
f x
c
pdf(x
c
|b[n]=+1)
pdf(x
c
|b[n]=−1)
BOAthresholds
CUAthresholds

Figure 2.4: An illustrative example for d∗o,k, µ
∗
k and d
∗
u,k.
corresponding d∗o,k, µ
∗
k and d
∗
u,k.
In Section 2.6, we show that BERR(SNR,Bu, Bo) is given by:
BERR(SNR,Bu, Bo)
≈

do,min
2du,min
e
d2o,min−d2u,min
2σ2n , if du,min > 0√
pi
2
do,min
σn
e
d2o,min
2σ2n , if du,min < 0√
pi
8
do,min
σn
e
d2o,min
2σ2n , if du,min = 0
. (2.6)
Equation (2.6) indicates that BERR(SNR,Bu, Bo) increases with d
2
o,min or (d
2
o,min −
d2u,min). Furthermore, BERR(SNR,Bu, Bo) can be predicted given the channel h (thus
µ−, µ+, d2o,min and d
2
u,min) and SNR.
MC simulations of link BER were run with 108 (for SNR ≤ 36 dB) or 1012 (for SNR >
36 dB) samples and for SNR ranging from 18 dB to 40 dB for channels h = [0.09, 0.1, 0.08,
−0.05] and h = [0.08, 0.07, 0.1, 0.04], respectively. Figure 2.5 indicates that the analytical
expressions (2.9) and (2.12) can predict the results of the MC simulations to within an
order of magnitude, and thus can be employed to estimate the peo and peu. Furthermore, as
expected, the expressions (2.9) and (2.12) become more accurate at high SNRs. Finally, it
can be seen in Fig. 2.5 that 3-bit BOA achieves a shaping gain of 2 dB (8 dB) over a 6-bit
21
(a) (b)
Figure 2.5: Comparison among peo from MC simulation (peo(sim)), peo estimated using
(2.9) (peo(10)), peu from MC simulation (peu(sim)), and peu estimated using (2.12) (peu(13)),
for channels (a) h = [0.09, 0.1, 0.08,−0.05] when Bu = 6 and Bo = 3, and (b)
h = [0.08, 0.07, 0.1, 0.04] when Bu = 4 and Bo = 3, respectively.
(a) (b)
Figure 2.6: Comparison between BERR(SNR,Bu, Bo) from MC simulation (BERRsim),
and BERR(SNR,Bu, Bo) estimated using (2.6) (BERR(7)) for channels (a) h =
[0.09, 0.1, 0.08,−0.05] when Bu = 6 and Bo = 3, and (b) h = [0.08, 0.07, 0.1, 0.04] when
Bu = 4 and Bo = 3, respectively.
22
(4-bit) CUA at a BER of 10−5 (10−3). Figure 2.6 shows that BERR can also be predicted
via (2.6) to within an order of magnitude of MC simulations, and that it increases with SNR.
2.2.2 BER Improvement vs. Channel ISI
This subsection presents an empirical study of BERR as a function of channel ISI. In the rest
of this subsection, we consider the special case of log2(m+ 1)-bit BOA and CUA, i.e., Bu =
Bo = log2(m + 1). Speciﬁcally, we study the relationship between BERR(SNR,Bu, Bo)
and m-cluster and ht, for a set of 4-tap channels h = [1, a1, a2, a3] (ai = 0.1 : 0.1 : 0.9, i =
1, 2, 3) using an ADC-ML receiver. The corresponding results are shown in Fig. 2.7, where
BERR(SNR,Bu, Bo) is calculated using BER expressions (2.9) and (2.12) when SNR =
38 dB. Note: the results when pe,u > 10
−1, which occurs because m = 1 or the channel
ISI is too large for the given SNR, are removed in Fig. 2.7 for better illustration. The
BERR(SNR,Bu, Bo) and ht for the channel (under which the measured results are shown
in Fig. 2.17) is also shown in Fig. 2.7, which has log2(m + 1) = 2.6 ≈ 3 and ht ≈ 0.46.
Figure 2.7 indicates that smaller ht and larger m combinations are likely to result in larger
BERR(SNR,Bu, Bo). Furthermore, Fig. 2.7 shows that BERR(SNR,Bu, Bo) > 10
6, when
m ≥ 5 and ht ≤ 0.8.
2.2.3 BER Improvement vs. Number of Bits Bx in the ADC
In this subsection, we claim that for any given realization of the type considered in this
section, there exists an optimal ADC resolution in the sense that it achieves the maximum
of the function BERR(SNR,Bx, Bx). To see this, note that pe,u decreases with Bx, which
is a consequence of peu(SNR,Bx + 1) being deﬁned through a maximization over the set
of decision regions that includes those used to achieve peu(SNR,Bx) as a subset. Except
on a set of measure zero (for which peu(SNR,Bx) = peo(SNR,Bx)), peu(SNR,Bx + 1)
< peu(SNR,Bx), for all Bx. On the other hand, we note that peo decreases with Bx for
Bx < log2(N + 1), following the same reasoning. However, once Bx = log2(N + 1), further
increases in Bx provide no additional improvement in peo (for the memoryless symbol-by-
23
Figure 2.7: BERR(SNR,Bu, Bo) vs. ht, where Bu = Bo = log2(m + 1), for channels
h = [1, a1, a2, a3] (ai = 0.1 : 0.1 : 0.9, i = 1, 2, 3) using an ADC-ML receiver when
SNR = 38 dB. The value of ht and measured BERR(SNR, 4, 3) for a FR4 channel using
an ADC-LE receiver (described in section 2.3) are also shown.
symbol detector used in this analysis). As a result, the ratio BERR(SNR,Bx, Bx) will
monotonically decrease for Bx > log2(N + 1). Hence, it must achieve a maximal value for
one (or more) Bx ≤ log2(N + 1). The BERR(SNR,Bx, Bx) versus Bx plot in Fig. 2.8(a)
shows that BERR(36 dB, Bx, Bx) is maximized when Bx = 3.
The intuition gained from the analysis of the ADC-ML receiver holds for the ADC-
LE receivers. This is conﬁrmed by the simulated BERR(SNR,Bx, Bx) versus Bx plot
shown in Fig. 2.8(b), which is obtained for a FR4 channel at 10 Gb/s in Fig. 2.1, when
SNR = 36 dB. Figure 2.8(b) also shows, under the given condition, Bx = 3 maximizes the
BERR(SNR,Bx, Bx).
2.3 Implementation of BOA Receiver
In this section, a prototype IC implementation of BOA receiver is described. Figure 2.9
shows the block diagram of the receiver, which consists of a BOA chip and an Altera FPGA
board. The FPGA board implements the back-end DSP blocks. The ADC chip includes a
reconﬁgurable 4-bit 4 GS/s ﬂash ADC and an 8-bit digital-to-analog converter (DAC). The
8-bit DAC enables variable-threshold and variable-resolution ADC conﬁgurations, so that
24
   








%
[(bits)
(a)
   








%
[(bits)
(b)
Figure 2.8: BERR(SNR,Bx, Bx) vs. Bx of (a) an ADC-ML receiver, when the channel
impulse response is h = [0.09, 0.1, 0.08, 0.04] and SNR = 36 dB, and (b) an ADC-LE
receiver, when the channel impulse response is equal to the FR-4 channel (see Fig. 2.1)
and SNR = 36 dB.
the performance of a CUA receiver can be compared with that of the BOA receiver. The
back-end DSP block includes a LE and QL-UD. In the BOA receiver, the ADC quantizes
the channel outputs and provides these into the back-end DSP block, which implements
an adaptive LE. Once the equalizer coeﬃcients converge, the quantization levels ro that
minimize link BER are obtained using gradient descent search algorithm. The updated
quantization thresholds to are then fed back into the ADC chip.
)LJXUH%ORFNGLDJUDPRIWKHRYHUDOO%(5PLQLPL]LQJ$'&EDVHGUHFHLYHU7KH$'&,&LV
&KDQQHO
4/8'
(1&
6OLFHU
WDS/(
:HLJKW
XSGDWH
35%6
*HQ
'DWD
6\QF
Back-end FPGA 
&RPSRVLWH5HFHLYHU
%(5
0LQLPL]LQJ
$'&
Figure 2.9: Block diagram of the BOA receiver.
25
ADC chip
&ORFN
GLVWULEXWRU
7
K
H
UP
R
P
H
WH
U
WR

%
LQ
D
U\
H
Q
F
R
G
H
U
5HJLVWHU
EDQN
6
WR
UD
J
H
F
D
S
D
F
LW
R
U
&RPSDUDWRUDUUD\
ELW
'$&
Figure 2.10: Block diagram of the BOA chip.
2.3.1 ADC Full-chip Block Diagram
The ADC chip consists of an 8-bit DAC, storage capacitor array, and a 4-bit ﬂash ADC, as
illustrated in Fig. 2.10. In a conventional ﬂash ADC, the threshold voltages are generated by
a resistor ladder. A BOA, however, requires variable thresholds. Thus, a DAC is employed
for threshold generation. System analysis shows that an 8-bit DAC is required to ensure
that the 3-bit BOA receiver can achieve similar or better BER performance compared to a
4-bit CUA receiver. In principle, 30 DACs are needed for 4-bit ADC threshold generation.
To minimize power and area overhead, a single-core, multiple-output passive DAC, which
is an extension of the single-core single-output passive DAC presented in [58], is proposed
to generate the variable threshold voltages. The threshold voltage generator has a power
overhead of 10% (∼ 50µW) compared to a ﬁxed resistor string for a CUA.
Figure 2.11 illustrates the operation of the 8-bit DAC. A single voltage threshold update
occurs over two phases of non-overlapping clocks φ1 and φ2. In phase I (φ1 = 1, φ2 =
0), the 4 MSBs of the 8-bit DAC input selects a 4-bit section of the resistor ladder to
charge the 4-bit unit capacitor (Cu) array. In phase II (φ1 = 0, φ2 = 1), Cu and Cref are
connected together. The resulting charge sharing shifts the threshold voltage toward the
desired value. The nominal, post-layout extracted operating frequency of the DAC core is
12 MHz, resulting in an update frequency per Cref of 375 kHz. This is more than suﬃcient
to compensate for leakage. The 8-bit DAC updates the thresholds of the comparator array
26
(a)
%è:sxF 0; %åØÙ6=
4 /
ö5 
ö6 
LOAD[0] 
LOAD[1] 
LOAD[29] 
1 update cycle 
(b)
Figure 2.11: The 8-bit single-core multiple-output DAC: (a) circuit schematic, and (b)
timing diagram.
sequentially. Therefore, only the threshold of one comparator is updated at each moment.
The storage capacitors hold the thresholds for the other comparators. In a ﬂash ADC,
one input of each comparator is connected to the analog input, while the other input is
connected to the corresponding threshold. The comparator array compares the analog input
with the threshold voltages simultaneously and produces comparison results in the form of
a thermometer code. Thus, a binary encoder following the comparator array is needed for
the ease of back-end digital processing.
27
2.3.2 Back-end DSP in the FPGA
The back-end DSP units are implemented in an Altera Transceiver Signal Integrity Develop-
ment FPGA board [59], in which the DSP units operate at a frequency of 100 MHz. Thus, 40
parallel channels are used to handle the 4 Gb/s outputs from the ADC chip. As shown in Fig.
2.9, the back-end DSP block consists of an encoder (ENC), data synchronization unit (Data
Sync), LE, and QL-UD. The binary encoder converts the ADC output x[n] into two's com-
plement number representation xr[n], while the data synchronization unit provides the start
position of the pseudorandom binary sequence (PRBS). A 3-tap least mean squares (LMS)
adaptive equalizer is designed to compensate for channel ISI, while the QL-UD unit adjusts
the ADC quantization thresholds to and representation levels ro to achieve the minimum
BER. As other blocks are standard DSP functional blocks, we focus on the implementation
of the QL-UD unit. The equalizer computes an estimate of the transmitted symbol b[n−∆]
based on the encoder output [xr[n], . . . , xr[n−M + 1]]Tas:
bˆ[n−4] = ∑M−1k=0 w[k]xr[n− k].
The estimation error e[n] = b[n−∆]− bˆ[n−4] is used to adjust the equalizer coeﬃcients
w. Once w converges, e[n] can be used to update the ADC representation levels. Fixing
the quantization thresholds to, the optimal representation levels ro that minimize the MSE
between b[n] and bˆ[n] can be obtained from gradient descent search. The LMS update
of the quantization levels is obtained by approximating the gradient of the MSE by its
instantaneous value,
ro,i[n+ 1] = ro,i[n] + µre[n]
∑
k:xr[n−k]=ro,i w[k].
The LMS update can be further modiﬁed to the AMBER algorithm [15]:
ro,i[n+ 1] = ro,i[n] + µrI[n]sgn(e[n])
∑
k:xr[n−k]=ro,i w[k],
where I[n] is the bit error indicator function. Figure 2.12 is the block diagram of the quan-
tization level update unit and the individual RL-UD block, which update each quantization
level based on the current ADC output, equalizer estimation error, and equalizer coeﬃcients.
The architecture of the update unit for each representation level ro,i is shown in Fig. 2.12(b).
28
/  71
RL-UD1 RL-UD2 RL-UDN
>>1 >>1
the  BOA representation level
the  BOA quantization threshold
(a)
FRPS
'
FRPS
'
FRPS
'
'
(b)
Figure 2.12: Architectures of: (a) the QL-UD unit, and (b) the ith RL-UD unit.
29
A total of 2Bo such units are needed to update the representation levels ro. The total number
of gates and power consumption of the QL-UD unit are estimated to be about 90K (NAND
gates equivalent) and 12.3 mW, respectively. This power accounts for about 44% of the total
power of the back-end DSP, and is expected to reduce with technology scaling. Note: the
QL-UD unit can be turned oﬀ once BOA representation levels are obtained.
2.3.3 Comparator Design
The comparator consists of a preampliﬁer and 3 cascaded latches, and Fig. 2.13 illustrates
the schematics of the preampliﬁer and latches. The preampliﬁer subtracts the analog input
from the threshold and provides polarity of the comparison result. The cascaded latches
amplify the preampliﬁer output to reduce the occurrence of meta-stability. The preampliﬁer
design is shown in Fig. 2.13, which is widely employed for high-speed ADCs [12,13]. The ﬁrst
latch is a current-mode latch [12], which is composed of an input diﬀerential pair (M1, M2)
and a cross-coupled regenerative latch pair (M3, M4) sharing the same resistive load, RD.
When CLK is high, the circuit is in tracking mode with low gain and large bandwidth. When
the CLK is low, the circuit shifts to the regenerative mode, and the sampled signal from the
tracking mode is ampliﬁed and delivered to the next stage. The second and third latches
do not consume static power once they fully regenerate, and are referred to as dynamic
latches [60], [61]. In this design, only the ﬁrst latch uses a current mode latch because it is
the most critical to guarantee accurate comparison results. For example, if the ﬁrst latch does
not have a large enough bandwidth to follow the updated polarity of the preampliﬁer output,
the comparator may generate an incorrect output, regardless of how well the following latches
perform. It should be noted that the cascaded latches operate in a pipelined manner for
speed consideration. In particular, when the preceding latch is working in a tracking mode,
the subsequent latch is working in regeneration mode, and vice versa.
30
03
0101
03
01
01
01
01
01
9''
&/.
&/. &/.
9,13 9,11
92871 92873
9''
0%
9%
0%
9%
9,11
5
'
5
'
92871
92873
0 0
9,13
95()3
00
95()1
9''
0 0
0
&/.
0
0
&/.%
9,13
9,11
5
'
5
'
92871
92873
0%9%
0
3UHDPSOLILHU VWODWFK QG	UGODWFK
Figure 2.13: Schematics of the preampliﬁer and latches.
2.3.4 Encoder Design
A Gray encoder is used because it is more compact and faster than a summing encoder [13].
Since less pipelining is required for multi-gigahertz operation, the Gray encoder consumes less
power than a summing encoder. There are three steps to encoding. First, the thermometer
code is converted to a 1-of-N code. And then, the 1-of-N code is converted to a Gray code
to suppress bubble errors. The Gray code is ﬁnally converted to binary code by XOR gates,
for the purpose of further DSP processing in the subsequent blocks. In this chip, pipelined
D ﬂip-ﬂops (DFFs) are used between adjacent logic gates in the encoder to guarantee 4 Gb/s
operations.
2.4 Measurement Results
This section summarizes the measurement results of both the stand-alone ADC chip and
the ADC-based receiver. A 4-bit 4 GS/s ADC chip is fabricated in a 90 nm low power (LP)
CMOS technology with an active area of 0.33 mm2 and tested in a chip-on-board assembly.
The chip's micrograph is shown in Fig. 2.14(a).
First, we ensure the standalone ADC performance is suﬃcient to support link operation.
Since the ADC chip includes an 8-bit DAC for variable-threshold and variable-resolution
ADC conﬁguration, we utilize it to conﬁgure the ADC's threshold voltages for calibration.
31
A
D
C
co
re
DAC
St
o
ra
ge
ca
p
Register
Bank
Clk
(a)
BERT
ADC
Test 
Board
ADC-
ADC+
CLK+ CLK-
OUT[4:1]
Shift register control
FR4 channel
board
FPGA board
DC power
Supply
Function
generator DAC clock
DATA CLK
(b)
Figure 2.14: (a) Micrograph of the BOA chip, and (b) the test set-up.
Beforecal:+/r 1.3LSB
Aftercal:+/r 0.04LSB







              
/
6
%
&RGH
EHIRUHFDO
DIWHUFDO
/
6
%
(a)
Beforecal:4LSB
Aftercal:0.14LSB
 






              
/
6
%
&RGH
EHIRUHFDO
DIWHUFDO
(b)
3.93 3.93 
3.62 3.58 
3.4 
0
1
2
3
4
5
0 0.5 1 1.5 2
A
D
C
 E
N
O
B
 
b
it
s
 
ADC ,nput )requency 
GHz(c)
Figure 2.15: Standalone ADC measurement results: (a) DNL and (b) INL characteristics
before/after calibration, and (c) measured ENOB vs. input frequency.
32
    




8,
$
'
&
'
R
X
W
/
6
%

1
X
P
E
H
U
R
I
6
D
P
S
OH
V
,
(a)
   




$'&&RGH/6%
1
X
P
E
H
U
R
I
6
D
P
S
OH
V
$' ' XW /6%
(b)
Figure 2.16: Measured ADC output: (a) eye diagram, and (b) histogram for a 20-inch
FR4 channel at 4 Gb/s when TX amplitude is 180 mVppd and ADC FSR is 100 mVppd.
The measured DNL and INL characteristics before and after calibration are shown in Fig.
2.15(a) and (b), respectively. The DNL and INL are reduced from ±1.3 LSB and 4 LSB to
±0.04 LSB and 0.14 LSB, respectively, after calibration, indicating the eﬀectiveness of the
calibration process. Figure 2.15(c) illustrates that the ADC can achieve up to 3.4 bits of
ENOB at near-Nyquist rate input frequency of 1.9375 GHz. The ﬁgure of merit (FOM) of
the ADC is 1.42 pJ/conv.step at 4 GS/s excluding clock buﬀers, which is comparable to the
state-of-art using a similar technology [13,14].
Figure 2.14(b) shows the block diagram of the link test set-up, which mainly consists of
a BER tester (BERT), channel board, ADC PCB board, and FPGA board. The BERT
provides 4 Gb/s synchronous data and clock, with the data passing through a 20-inch FR4
channel before entering the ADC board. The ADC chip quantizes the incoming analog signal
and its outputs are fed into the FPGA board. The back-end DSP units in the FPGA then
perform equalization and optimal ADC representation levels search. Finally, the updated
representation levels are fed back to the ADC chip. However, in our experiment, the BOA's
representation levels are obtained oﬀ-line due to a synchronization problem in the interface
between the BOA chip and the FPGA board.
Link tests were conducted at 4 Gb/s over a 20-inch channel with 223 − 1 PRBS data.
33
01
2
3
0.5 1 1.5 2
fin (GHz)
3b CUA
4b CUA 
3b BOA 1.E-13
1.E-11
1.E-09
1.E-07
1.E-05
1.E-03
170 180 190 200 210 220 230 240
TX amplitude Vppd (mV)
3b unif
4b unif
3b nonunif
60mVppd
109
(a)
0
1
2
3
0.5 1 1.5 2
fin (GHz)
3b unif
4b unif
3b nonunif
1.E-13
1.E- 1
1.E-09
1.E-07
1.E-05
1.E-03
170 180 190 200 210 220
TX Amplitude V pd (mV)
3b CUA
4b CUA 
3b BOA
60 Vppd
109
(b)
Figure 2.17: ENOB and BER measurements: (a) ENOB vs. input frequency, and (b) BER
vs. TX amplitude at 4 Gb/s when the FSR of the CUA is 100 mVppd.
The BOA's representation levels were obtained by ﬁrst extracting the converged equalizer
coeﬃcients with the ADC IC in a 4-bit uniform mode, followed by an oﬀ-line adaptive
channel estimation and gradient search procedure [17]. The ADC representation levels were
then manually set in the lab. Figure 2.16 shows the post-ADC eye diagram and histogram
of the ADC code at a 20-inch FR4 channel, with TX amplitude of 180 mVppd and a CUA
FSR of 100 mVppd, which indicate that the received eye is closed. In particular, the channel
loss at Nyquist rate is about −22 dB.
Figure 2.17(a) compares the measured ENOB when the FSR of the CUA is 100 mVppd.
The FSR of the 4-bit CUA was adjusted to achieve the best BER under the given TX
amplitude and channel loss. The 3-bit BOA has the lowest ENOB. The ENOB diﬀerence
between the 4-bit CUA and the 3-bit BOA is in the range of 1.37-bit to 1.08-bit, while the
diﬀerence between the 3-bit CUA and the 3-bit BOA is in the range of 0.74-bit to 0.48-bit.
Figure 2.17(b) illustrates that the BER achieved by the 3-bit BOA receiver is lower by a
factor of 109 and 1010, as compared to the 4-bit and 3-bit CUA receivers, respectively, at a
TX amplitude of 180 mVppd. This is in spite of the 3-bit BOA having a poorer ENOB than
both the 3-bit and 4-bit CUA. Furthermore, the 3-bit BOA requires a 60 mVppd lower TX
swing compared to a 4-bit CUA to achieve BER < 10−12. Thus, Fig. 2.17 indicates that
ENOB is not the best ADC design metric for serial links. The bathtub curve in Fig. 2.18
34
1.E‐12
1.E‐10
1.E‐08
1.E‐06
1.E‐04
1.E‐02
‐0.2 ‐0.1 0 0.1 0.2
Sampling Phase (UI)
B
E
R
3b CUA
4b CUA 
3b BOA 
 ~ 0.17UI
Figure 2.18: Measured BER vs. sampling phase using a 20-inch channel when TX ampli-
tude is 180 mVppd and the FSR of the CUA is 100 mVppd.
Table 2.1: Performance Summary of the ADC-based Receivers
ADC operating mode 3-bit BOA 4-bit CUA
Technology 90 nm LP CMOS (1P8M)
Core die area 0.38 mm2
Supply voltage 1.2 V for analog, 1.28 V for digital & clock
Data rate 4 Gb/s
Power consumption
ADC [mW] 30.7 59.7
B/E digital [mW] ∗ (30/∗∗17.7) ∗16.4
∗Digital back-end power estimated from synthesis in 90 nm LP CMOS.
∗∗The power of the QL-UD unit is excluded.
shows that the 3-bit BOA can tolerate a peak-to-peak jitter of about 43 ps (∼0.17 UI) at
BER=10−12 with a TX amplitude of 180 mVppd, while the 4-bit and 3-bit CUAs are unable
to achieve BER < 10−4 and 10−3, respectively, under identical conditions.
Table 2.1 summarizes the performance of the proposed BOA receiver, and Table 2.2
compares this work against state-of-the-art ADC-based receivers [22, 5355] and analog re-
ceivers [62,63] in CMOS. The ADC IC consumes 59.7 mW in 4-bit CUA mode and 30.7 mW
in 3-bit non-uniform mode excluding the clock buﬀers. The clock buﬀers in our design accept
external clocks and have to drive a long interconnect before they reach the ADC compara-
tors, as the ADC occupies a small fraction (< 9%) of the die area. Furthermore, the power
35
Table 2.2: Performance Comparison with State-of-the-art ADC-based Receivers and Ana-
log Receivers in CMOS
ADC-based Receivers Analog Receivers
This work [22] [53] [54] [55] [62] [63] [64]
Process 90 nm LP 65 nm 65 nm 65 nm 65 nm 40 nm 40 nm 90 nm
Sampling rate [GS/s] 4 2.5 2.575 2.5 1.2875 5.1563 7.0125 3.125
Number of bits 3 3 6 5 8 N/A N/A N/A
BER@Channel < 10−12 < 10−7 < 10−15 < 10−12 N/A < 10−12 < 10−15 < 10−15
Channel loss −22 dB −17 dB −26 dB 34′′ N/A ∼ −28 dB −26 dB −15 dB
RX power [mW] 60.7 106 500 192 1600 87 410 8.0
Eﬃciency [pJ/bit] ∗ (15.2/∗∗12.1) 10.6 48.5 38.4 155.3 ∗∗∗8.4 ∗∗∗∗29.23 1.28
FOM2 ∗ (10.9/∗∗13.7) 18.8 10.3 0.5568 N/A 74.8 13.6 24.7
∗Digital back-end power estimated from synthesis in 90 nm LP CMOS.
∗∗The power of the QL-UD unit is excluded.
∗∗∗With both Tx and Rx.
∗∗∗∗With both Tx and Rx and under the worst case PVT conditions.
Note: FOM2 = DR(Gb/s)× 10 loss10 /Power(mW) [65], where DR stands for data rate.
consumption of the clock buﬀers for the ADC core alone could not be measured because
the power pins of the clock buﬀers driving the interconnect and the ADC core are shared.
The power of the clock buﬀers driving the ADC core alone, extracted from post-layout sim-
ulations, is about 10 mW when the ADC operates at 4 GS/s, 4-bit CUA mode. The digital
back-end power including all the functional blocks in the FPGA was estimated to be 30 mW
and 17.7 mW via synthesis when including and excluding the QL-UD unit, respectively. The
energy eﬃciency of this receiver excluding the clock buﬀers is 15.2 pJ/bit and 12.1 pJ/bit
when including and excluding the QL-UD unit, respectively.
The presented receiver achieves a BER of less than 10−12 with the lowest ADC resolution
(3-bit non-uniform; Note: a 3-bit CUA was not able to achieve BER < 10−3 under the
same conditions) at the highest ADC sampling rate (4 GS/s) while achieving more than
2× higher energy eﬃciency compared with [53], [54] and [55]. Taking channel loss into
account, our solution achieves higher (better) ﬁgure-of-merit FOM2 (proposed in [65]) than
[53] and [54]. Implemented in a more advanced technology and combining several low power
circuit techniques, [22] achieves a better energy eﬃciency than this work. However, [22] only
showed measured BER of 10−7 although an extrapolated BER of 10−15 was reported. A
2.3-bit (5 comparators) BOA is suﬃcient if the target BER is relaxed from 10−12 to 10−7
based on simulations, which translates to about 2/7 power savings in the ADC. As a result,
36
the eﬃciency is improved to 13.0 pJ/bit from 15.2 pJ/bit. Furthermore, it should be noted
that the power consumption of a ﬂash ADC is mostly determined by its sampling rate and
the process technology. Compared to [22], our ADC has higher sampling rate (4 GS/s vs.
2.5 GS/s in [22]) while being implemented in a slower technology (90 nm LP vs. 65 nm
in [22]). Therefore, it is expected that our solution will achieve comparable or better energy
eﬃciency if the sampling rate and process technology are identical. On the other hand,
Table 2.2 shows that energy eﬃciency and FOM2 of ADC-based receivers need to be further
improved compared with analog receivers [6264].
A key outcome of this work is the demonstration of information-based metric beneﬁts,
such as the BER, in reducing the ADC precision requirements, and the identiﬁcation of
conditions that maximize the BER improvement oﬀered by a BOA receiver over a CUA
receiver under the same condition. Although we demonstrate the BOA concept via a ﬂash
ADC, BOA is in principle applicable to diﬀerent ADC architectures because it adjusts the
ADC thresholds but does not change the ADC architecture. However, the power savings
when designing BOA using other ADC architectures will not be as great as with ﬂash ADCs.
In particular, for ﬂash ADC every bit decrease in resolution almost halves the size of the
ADC core circuitry and the power. In contrast, for a SAR, pipelined, or sigma-delta ADC,
the die size and power will decrease linearly with a decrease in resolution.
2.5 Summary
This chapter describes our study on the beneﬁts of BOA for serial links. First, we discuss
conditions that maximize BER improvement by a BOA receiver over a CUA receiver, and
propose two channel-dependent parameters to quantify these conditions. Furthermore, a
4 Gb/s BOA receiver, which employs the true system BER to adjust the ADC representation
levels and a linear equalizer, was implemented in 90 nm LP CMOS to show that a 3-bit BOA
needs a lower SNR than a 4-bit CUA at a BER < 10−12. This study demonstrates that the
use of information-based system metrics such as the BER are very eﬀective in reducing the
component power of information transfer in ML systems. It inspires us to extend such a
design principle to address the challenge of energy-eﬃcient information processing in ML
37
systems as described in the following chapters.
2.6 Derivation of BERR
In this section, we derive (2.6).
In an ADC-ML receiver, the BER of a log2(N + 1)-bit BOA is given by:
peo(SNR,Bo) =
N∑
k=1
peo,k, (2.7)
where peo,k is the BER contribution from the k
th µ-transition. In particular, peo,k includes
the BER contribution from all the peaks N (x;µ+l , σn) with µ+l ∈ µ+(N (x;µ−l , σn) with
µ−l ∈ µ−) to the interval [tlo,k, to,k) and the BER contribution from all the peaks N (x;µ−l , σn)
with µ−l ∈ µ−
(N (x;µ+l , σn) with µ+l ∈ µ+) to the interval [to,k, tro,k) if the memoryless ML
decision for the interval [tlo,k, to,k) is −1 (+1), where tlo,k and tro,k are deﬁned as follows:
tlo,k =
{
−∞, if k = 1
to,k−1+to,k
2
, if 1 < k ≤ N
tro,k =
{
to,k+to,k+1
2
, if 1 ≤ k < N
+∞, if k = m .
Note: if the decision for the interval [tlo,k, to,k) is +1 (or −1) then the decision for the interval
[to,k, t
r
o,k) is −1 (or +1). At high SNR, this contribution is well-approximated by:
peo,k ≈ 2−(L−1)Q
(
d∗o,k
σn
)
, (2.8)
where d∗o,k (see Deﬁnition 5) is the minimum distance of the k
th BOA threshold to the
nearest noise-free channel output µ.
Substituting (2.8) into (2.7), employing the high SNR approximation for the Q-function
(Q(y) ≈ 1
y
√
2pi
e
−y2
2 , for y > 0), and the approximation
∑
k
eak ≈ emax(k ak), we get:
38
peo(SNR,Bo) =
N∑
k=1
peo,k ≈
N∑
k=1
[
2−(L−1)Q
(
d∗o,k
σn
)]
=
N∑
k=1
[
2−(L−1)
σn√
2pid∗o,k
e
− 1
2
(
d∗o,k
σn
)2]
. (2.9)
≈ 2−(L−1) σn√
2pido,min
e
− 1
2
(
do,min
σn
)2
Similarly, the BER of a log2(M + 1)-bit CUA receiver is given by:
peu(SNR,Bu) =
M+1∑
k=1
peu,k, (2.10)
where peu,k denotes the BER contributed by the k
th interval Iu,k. Speciﬁcally, peu,k includes
BER contribution from all the peaksN (x;µ+l , σn) with µ+l ∈ µ+
(N (x;µ−l , σn) with µ−l ∈ µ−)
to the interval Iu,k if the memoryless ML decision for the interval Iu,k is −1 (+1). At high
SNR, this contribution is well-approximated by:
peu,k ≈ 2−LQ
(
d∗u,k
σn
)
, (2.11)
where d∗u,k (see Deﬁnition 7) is the minimum distance of the k
th dominant noise-free output
µ∗k from the boundaries of the interval Iu,k.
Substituting (2.11) into (2.10), employing the high SNR approximation for the Q-function,
the approximation
∑
k
eak ≈ emax(k ak), and the relationship Q(y) = 1 − Q(−y) for y < 0, we
get:
39
peu(SNR,Bu)
=
M+1∑
k=1
peu,k ≈
M+1∑
k=1
[
2−LQ
(
d∗u,k
σn
)]
≈ 2−LQ
(
du,min
σn
)
=

2−L σn√
2pidu,min
e
− 1
2
(
du,min
σn
)2
, if du,min > 0
2−L
[
1 + σn√
2pidu,min
e
− 1
2
(
du,min
σn
)2]
, if du,min < 0
2−(L+1), if du,min = 0
. (2.12)
Therefore, from (2.9) and (2.12), we obtain:
BERR(SNR,Bu, Bo)
=
peu(SNR,Bu)
peo(SNR,Bo)
≈

do,min
2du,min
e
d2o,min−d2u,min
2σ2n , if du,min > 0
do,min
2du,min
e
d2o,min−d2u,min
2σ2n (1 + χ) , if du,min < 0√
pi
8
do,min
σn
e
d2o,min
2σ2n , if du,min = 0
, (2.13)
where χ =
√
2pidu,min
σn
e
d2u,min
2σ2n . Note: do,min ≥ du,min and do,min ≥ 0. Applying the approxi-
mation
∑
k
eak ≈ emax(k ak) to the case when du,min < 0, (2.13) can be further simpliﬁed to
(2.6).
40
Chapter 3
PREDICTIVENET
In the previous chapter, the energy eﬃciency of information transfer in ML systems is im-
proved by using information-based metrics such as the BER to design power dominant
components. In this chapter, we explore a similar approach to reduce the implementation
complexity of information processing such as CNNs in ML systems. As such, the power dom-
inant convolutions in CNNs are designed to maintain the information-based system metric
(i.e. classiﬁcation accuracy) rather than ﬁdelity circuit metrics such as SNR. Speciﬁcally, we
make use of CNN structure to propose a technique referred to as PredictiveNet which pre-
dicts zero activations using low-cost predictors to skip a signiﬁcant amount of convolutional
operations. PredictiveNet ﬁrst evaluates the most signiﬁcant bit (MSB) part of the convolu-
tion to predict whether the nonlinear layer output corresponding to the current convolution
is zero, and then decides if the remaining least signiﬁcant bit (LSB) part's computation
can be skipped or not. PredictiveNet takes advantage of the fact that the MSB part has
an exponentially larger contribution to the ﬁnal output and well-trained CNNs have high
sparsity.
The rest of this chapter is organized as follows. Section 3.1 provides the relevant back-
ground. Section 3.2 presents the PredictiveNet technique and analysis to justify Predic-
tiveNet's eﬀectiveness. Simulation results are shown in Section 3.3. Section 3.4 provides
conclusions.
41
3.1 Background
3.1.1 Convolutional Neural Networks (CNNs)
CNNs are a class of multi-layer neural networks [66]. A CNN consists of a cascade of
multiple convolutional layers (C-layers), subsampling layers (S-layers) (feature extractor),
and fully-connected layers (F-layers) (classiﬁer). Figure 3.1 illustrates a state-of-the-art CNN
for object recognition [66]. In a C-layer, DPs between receptive ﬁelds and weight vectors
are computed, to which a bias term is added, and passed through a nonlinear function to
generate the output feature maps (FMs). The computation of one output pixel for the
C-layer is described as follows:
zm[j] = f(ym[j] + δm), (m = 1, . . . ,M) (3.1)
ym[j] =
L∑
l=1
wTmlxjl, (m = 1, . . . ,M), (3.2)
where L and M are the number of input and output FMs, respectively. wml is the N -tuple
weight vector connecting the lth input FM Xl = [x1l, . . . ,xJl] (where xjl is the j
th receptive
ﬁeld in Xl) to the m
th convolutional output ym, δm is the bias term, and zm denotes the
mth output FM in the C-layer. Equation (3.2) shows that the jth pixel ym[j] of the m
th
convolutional output ym is obtained by ﬁrst performing DPs between the L input vectors
xjl and the weight vectors wml, and summing up the result. The nonlinear function f(·)
typically takes a sigmoid or hyperbolic form. However, a rectiﬁed linear unit (ReLU) has
emerged recently as increased evidence shows that it improves performance of CNNs [67].
The S-layer reduces the dimension of its input FMs via either an average or a max pooling.
3.1.2 Sparsity in CNNs
Sparsity has become a concept of interest in the ﬁelds of neuroscience, machine learning, and
signal processing. It was ﬁrst introduced in the context of sparse coding in visual systems [69],
which seeks to ﬁnd an overcomplete basis set and represent images as a linear superposition
42
Figure 3.1: Illustration of a state-of-the-art CNN [68] showing a convolutional layer (C-
layer), a subsampling layer (S-layer), feature maps (FMs), and the squashing function f(·).
of basis functions from the resulted set. Overcomplete means that the number of basis
functions M is greater than the eﬀective dimensionality of the image space L, which gives
rise to sparsity as only L out of M nonzero coeﬃcients are needed to represent arbitrary L-
dimensional images. As similar sparse overcomplete representation was observed in biological
neurons, it becomes a plausible model for the visual cortex [70, 71]. In ML models, sparse
overcomplete representation has been claimed to be a fundamental reason behind the success
of deep neural networks, such as CNNs [72]. Speciﬁcally, it has a number of theoretical and
practical advantages [7274], including 1) greater ﬂexibility in capturing inherent structure
of underlying data; 2) increased robustness to small perturbations of the data; and 3) better
separability because the information is represented in a high-dimensional space. State-of-
the-art CNN models can obtain sparsity from 50% to 85% in their activations [67]. It is worth
noting that while conventional hyperbolic tangent or sigmoid nonlinear function generates
sparse activations taking small but non-zero values, the recently emerged ReLU is able to
43
produce real zeros of activations and thus truly sparse representations while achieving better
classiﬁcation accuracy.
3.2 The Proposed PredictiveNet Technique
This section describes the PredictiveNet technique and its analytical justiﬁcation.
3.2.1 Principle and Architecture
Without loss of generality, we drop the indices for j and m in (3.1) and (3.2) and assume
f(·) is a ReLU, i.e.,
z = max
(
L∑
l=1
wTl xl + δ, 0
)
, (3.3)
where wTl xl =
N∑
i=1
wl[i]xl[i].
We ﬁrst decompose xl[i], wl[i], and δ into MSB and LSB parts. If we assume that Bmsb is
the precision of the MSB part of wl, xl, and δ, then:
z = max (y, 0) = max
(
L∑
l=1
wTl xl + δ, 0
)
=
 ymsb + ylsb2
−(Bmsb−1) if
∑L
l=1w
T
l xl + δ > 0
0 otherwise
, (3.4)
where ymsb and ylsb can be expressed as follows:
ymsb =
L∑
l=1
wTl,msbxl,msb + δmsb (3.5)
ylsb =
L∑
l=1
(
xTl,lsbwl,msb + x
T
l wl,lsb
)
+ δlsb, (3.6)
where xl,msb, wl,msb, and δmsb denote the MSB parts of xl, wl, and δ, respectively. Also,
44
D>>
D{  }   
  MUX
Clock
gating
 
sign(    )
0
{  }   
 
    
    
EN
    
sign(    )
ReLU
{  ,   }   
 
{  ,   }   
 
C-LSB
C-MSB
1
C-Predictor
  
    
 
Figure 3.2: An architecture implementing (3.4) in PredictiveNet.
xl,lsb, wl,lsb, and δlsb denote the LSB parts of xl, wl, and δ, respectively.
The PredictiveNet architecture (Fig. 3.2) includes a C-MSB block that predicts the sign
of y by computing only ymsb in (3.5). If ymsb < 0 (i.e., sign(ymsb) = 1), then we set z = 0
without computing ylsb in (3.6) and ymsb + ylsb2
−(Bmsb−1) according to (3.4). If ymsb ≥ 0
(i.e., sign(ymsb) = 0), then C-LSB computes (3.6) and sets z = ymsb + ylsb2
−(Bmsb−1) in (3.4).
By doing so, PredictiveNet avoids evaluating a signiﬁcant number of convolutions while
incurring only marginal accuracy loss.
The reasons for the accuracy loss to be marginal in PredictiveNet are as follows: 1) the
contribution of ylsb for calculating z is 2
−(Bmsb−1) smaller than ymsb as shown in (3.4); 2) the
speciﬁc value of ymsb +ylsb2
−(Bmsb−1) is not important if it is negative due to the rectiﬁcation
eﬀect of ReLU; and 3) the high sparsity in CNNs as mentioned in Section 3.1.2 implies that
the term ymsb + ylsb2
−(Bmsb−1) is very likely to be negative, which will result in zero C-layer
outputs after being passed through the ReLU function.
45
Table 3.1: Errors of MSB-CNN and PredictiveNet with Respect to FP-CNN
Event Condition
MSB-CNN
error
PredictiveNet
error
H0 ymsb ≤ 0, y ≤ 0 0 0
H1 ymsb ≤ 0, y > 0 y y
H2 ymsb > 0, y ≤ 0 −ymsb 0
H3 ymsb > 0, y > 0 y − ymsb 0
3.2.2 Analysis
In this subsection, analysis and empirical simulation results are presented to justify why
PredictiveNet incurs marginal accuracy loss while greatly decreasing the computational cost.
Our analysis is based on the trade-oﬀs between accuracy and precision. Recently, such a
trade-oﬀ has been analytically characterized for simple ML algorithms such as support vector
machine (SVM) [75]. Such insights have not yet been leveraged for complex algorithms such
as CNNs.
Assume that Bw, Bx and Bδ denote the required precisions of wl, xl, and δ, respectively.
Also, let Bw,msb, Bx,msb and Bδ,msb denote the precisions of wl,msb, xl,msb, and δmsb in (3.5),
respectively. For convenience, we term the CNN comprising only C-MSB as MSB-CNN,
and the CNN implemented using Bw, Bx and Bδ as the full precision CNN (FP-CNN),
respectively.
In Table 3.1, we compare the ReLU output errors of MSB-CNN and PredictiveNet with
respect to the outputs of FP-CNN (i.e., z) in (3.3) for four disjoint events from H0 to H3.
Note that each possible outcome is included in exactly one of these events. The MSE at the
outputs of the ReLU with respect to FP-CNN are:
MSEMSB-CNN = E[Y
2
∣∣H1]P (H1) + E[Y 2msb∣∣H2]P (H2)
+ E[|Y − Ymsb|2
∣∣H3]P (H3) (3.7)
MSEPredictiveNet = E[Y
2
∣∣H1]P (H1), (3.8)
where upper case letter denotes random variables. By comparing (3.7) and (3.8), we see
46
that:
MSEPredictiveNet < MSEMSB-CNN. (3.9)
Furthermore, P (H1) has been found to be small in practice and can be upper bounded as
follows:
P (H1) ≤ ∆2wE1 + ∆2xE2 + ∆2δE3, (3.10)
where ∆w = 2
−(Bw,msb−1), ∆x = 2−(Bx,msb−1) and ∆δ = 2−(Bδ,msb−1) are the quantization noise
step sizes of wl,msb, xl,msb and δmsb, respectively, and E1, E2, and E3 are given in Section 3.5
along with the proof of (3.10).
Similarly, the E[Y 2
∣∣H1] can be upper bounded as follows:
E[Y 2
∣∣H1] ≤ ∆2wE4 + ∆2xE5 + ∆2δ , (3.11)
where E4 = E
[∑L
l=1 ‖Xl‖2
∣∣H1] and E5 = ∑Ll=1 ‖wl‖2. The proof of (3.11) can also be
found in Section 3.5.
Combining (3.10) and (3.11), we can obtain an upper bound on MSEPredictiveNet:
MSEPredictiveNet ≤ ∆4wE6 + ∆4xE7 + ∆4δE8 + (∆w∆x)2E9
+ (∆w∆δ)
2E10 + (∆x∆δ)
2E11, (3.12)
where E6, . . . , E11 are the cross product terms associated with the product of (3.10) and
(3.11).
We observe that every term in (3.12) is a fourth order multiplicative combination of
quantization steps. Each quantization step is of the order of 2−Bmsb . Hence, the upper
bound in (3.12) is of the order of 2−4Bmsb .
Figure 3.3 shows empirical values for MSEMSB-CNN and MSEPredictiveNet for the two C-
layers in a CNN designed for handwritten digit recognition [76]. Figure 3.3 supports (3.9)
and shows that the MSE of PredictiveNet is much smaller than the MSE of MSB-CNN. This
results from the exponentially larger weighting factor of ymsb contributed to y over that of
47
Fi
rs
t 
C
-l
ay
e
r 
M
SE
 
𝐵𝐱,𝐦𝐬𝐛 (bits) 𝐵𝐱,𝐦𝐬𝐛 (bits) 
Se
co
n
d
 C
-l
ay
e
r 
M
SE
 
(a)
Fi
rs
t 
C
-l
ay
e
r 
M
SE
 
𝐵𝐱,𝐦𝐬𝐛 (bits) 𝐵𝐱,𝐦𝐬𝐛 (bits) 
Se
co
n
d
 C
-l
ay
e
r 
M
SE
 
(b)
Figure 3.3: Illustration of the empirical values of MSEMSB-CNN and MSEPredictiveNet and
the upper bound on MSEPredictiveNet with respect to Bx,msb for: the (a) ﬁrst and (b) sec-
ond C-layers over FP-CNN where Bw,msb = 5 bits, Bw = 8 bits, Bx = Bδ = 7 bits, and
Bδ,msb = Bx,msb. Both curves are obtained by averaging over all pixels of the two C-layers'
output FMs.
48
ylsb and the high sparsity of the C-layer outputs in well trained CNNs.
3.3 Simulation Results
In this section, we evaluate the performance of PredictiveNet on two datasets: MNIST
and CIFAR-10, which are benchmark datasets for handwritten digit and object recognition,
respectively.
3.3.1 System Set-up
The term δm and kernel wml in (3.2) are trained using the back propagation algorithm [66].
The following four architectures are considered: 1) FP-CONV: a conventional FP-CNN; 2)
FP-ZS: a full-precision input zero skipping CNN; 3) PredictiveNet; and 4) MSB-CNN: a
predictor-only CNN. These architectures are evaluated in terms of the following metrics:
 Classiﬁcation error rate pe: pe = P{Tˆ 6= t}, where Tˆ and t are the decision of the
evaluated CNN and the true label, respectively.
 Computational cost: the total number of full adders (FAs) in the network, where
an FA is a basic building block of arithmetic units. We assume that the evaluated
CNNs are implemented using the commonly used Baugh-Wooley multiplier and ripple
carry adder (RCA) designed using FAs. Therefore, the number of FAs to compute an
R-dimensional DP between the kernel weights and the activations is [77]:
RBwBx + (R− 1)(Bx +Bw + dlog2(R)e − 1). (3.13)
 Representational cost: the total number of bits associated with non-zero activations
and weights in the network, which represents the data storage and movement costs.
For a ﬁxed-point network, it is deﬁned as:
|X |Bx + |W|Bw, (3.14)
49
Table 3.2: Parameters Summary of the CNN for the MNIST Dataset
Parameter Deﬁnition CNN Parameter Summary
Parameter Description Layer L M I1 × I2 K ×K
L/M # of input/output FMs C1 1 16 28× 28 5× 5
K ×K size of kernels C2 16 32 12× 12 5× 5
I1 × I2 size of input FMs F1 100 10 4× 4 4× 4
where X and W are the sets of all non-zero activations and weights in the network, respec-
tively. Together, the computational and representational costs capture the implementation
complexity of a CNN, and are equally important metrics.
3.3.2 Evaluation on CNNs for MNIST
The parameters of the CNN for the MNIST dataset are summarized in Table 3.2, which is
developed based on the CNN architecture in [66]. The precision Bx and Bw are set to be 7
bits and 8 bits, respectively, ensuring the error-free ﬁxed-point pe increases by only 4× 10−3
compared with the ﬂoating-point pe of 0.016.
Figure 3.4 compares FP-CNN (FP-CONV and FP-ZS), PredictiveNet, and MSB-CNN in
terms of their classiﬁcation error rates, computational and representational costs normalized
over that of FP-CONV. Figure 3.4 shows that PredictiveNet is able to achieve a classiﬁcation
error rate that is only 1.9 × 10−2 larger than that of FP-CNN while reducing the compu-
tational cost by 2.5× compared to the state-of-the-art FP-ZS. Furthermore, PredictiveNet
achieves 1.7× reduction in the representational cost over that of FP-ZS. On the other hand,
when compared to MSB-CNN, PredictiveNet reduces the classiﬁcation error rates by 12×
(0.475/0.039) at the cost of only 1.6× greater computational cost.
It is interesting to observe that PredictiveNet has even 19% smaller representational cost
than its predictor-only counterpart, i.e., MSB-CNN (see Fig. 3.4 (c)). This can be justi-
ﬁed by the higher sparsity observed in the PredictiveNet than the latter. In particular, the
computational and representational costs for a CNN applied on top of FP-ZS depend not
only on the precision requirement associated with Bx and Bw but also the sparsity in the
50
Table 3.3: Computational and Representational Cost Comparison among CNNs for the
MNIST Dataset
FP-CONV FP-ZS PredictiveNet MSB-CNN
Computational cost (million) 77.13 40.88 16.43 10.18
Representational cost (million) 0.5376 0.2640 0.1586 0.1951
Table 3.4: Parameters Summary of the CNN [78] for the CIFAR-10 Dataset
Parameter Deﬁnition CNN Parameter Summary
Parameter Description Layer L M I1 × I2 K ×K
L/M # of input/output FMs
C1 3 192 32× 32 5× 5
C2 192 160 32× 32 1× 1
C3 160 96 32× 32 1× 1
K ×K size of kernels
C4 96 192 15× 15 5× 5
C5 192 192 15× 15 1× 1
C6 192 192 15× 15 1× 1
I1 × I2 size of input FMs
C7 192 192 7× 7 3× 3
C8 192 192 7× 7 1× 1
C9 192 10 7× 7 1× 1
C-layer inputs. For example, Fig. 3.5 (a) shows that the input sparsity of PredictiveNet's
C2 and F1 layers are 14.6% and 1.6× higher than those of the MSB-CNN, respectively. As
the representational cost of the C2 and F1 layers accounts for > 90% of the total represen-
tational cost, the higher sparsity in PredictiveNet's C2 and F1 layers explains its smaller
representational cost over the MSB-CNN.
Table 3.3 summarizes the computational and representational costs to implement the four
CNNs. Figure 3.4 and Table 3.3 show that PredictiveNet's accuracy is slightly worse than
FP-CNN (FP-CONV or FP-ZS) but with signiﬁcantly lower complexity.
3.3.3 Evaluation on CNNs for CIFAR-10
To demonstrate the generality of the proposed PredictiveNet technique, it is also applied
to the CIFAR-10 dataset [79]. The parameters of the CNN for the CIFAR-10 dataset are
51
PredictiveNet for MNIST
[0.02 0.039 0.475  0.046 0.361];
C
la
ss
if
ic
at
io
n
 E
rr
o
r 
R
at
e
N
o
rm
al
iz
e
d
 C
o
m
p
. C
o
st
FP-CONV/FP-ZS PredictiveNet
0.475
FP-CONV PredictiveNetFP-ZS
  .   ×
  .   ×
MSB-CNN
0.02 0.039
MSB-CNN
  .   ×
  .   ×    %
  .   ×
N
o
rm
al
iz
e
d
 R
e
p
re
s.
 C
o
st
FP-CONV PredictiveNetFP-ZS MSB-CNN
(a)
PredictiveNet for MNIST
[0.02 0.039 0.475  0.046 0.361];
C
la
ss
if
ic
at
io
n
 E
rr
o
r 
R
at
e
N
o
rm
al
iz
e
d
 C
o
m
p
. C
o
st
FP-CONV/FP-ZS PredictiveNet
0.475
FP-CONV PredictiveNetFP-ZS
  .   ×
  .   ×
MSB-CNN
0.02 0.039
MSB-CNN
  .   ×
  .   ×    %
  .   ×
N
o
rm
al
iz
e
d
 R
e
p
re
s.
 C
o
st
FP-CONV PredictiveNetFP-ZS MSB-CNN
(b)
PredictiveNet for MNIST
[0.02 0.039 0.475  0.046 0.361];
C
la
ss
if
ic
at
io
n
 E
rr
o
r 
R
at
e
N
o
rm
al
iz
e
d
 C
o
m
p
. C
o
st
FP-CONV/FP-ZS PredictiveNet
0.475
FP-CONV PredictiveNetFP-ZS
  .   ×
  .   ×
MSB-CNN
0.02 0.039
MSB-CNN
  .   ×
  .   ×    %
  .   ×
N
o
rm
al
iz
e
d
 R
e
p
re
s.
 C
o
st
FP-CONV PredictiveNetFP-ZS MSB-CNN
(c)
Figure 3.4: Simulation results for the MNIST dataset comparing FP-CNN (FP-CONV and
FP-ZS), PredictiveNet, and MSB-CNN in terms of: (a) classiﬁcation error rates, (b) nor-
malized computational cost (# of full adders (FAs)), and (c) normalized representational
cost (# of bits), where Bx,msb = Bδ,msb = 4 bits and Bw,msb = 5 bits.
52
Sparsity for MNIST 
 FP-CONV PredictiveNet MSB-CNN 
In
p
u
t 
Sp
ar
si
ty
 
FP-CONV RD-PNet RD-Predictor 
In
p
u
t 
Sp
ar
si
ty
Figure 3.5: Comparison on the C-layer input sparsity of FP-CONV/FP-ZS, PredictiveNet
and MSB-CNN for the MNIST dataset.
summarized in Table 3.4 [78]. The precision Bx and Bw are set to be 9 bits and 8 bits,
respectively, ensuring the error-free ﬁxed-point pe to be within 2.4 × 10−2 of the ﬂoating-
point pe of 0.124. Although both the MNIST and CIFAR-10 datasets contain data of 10
categories, the data of the latter are more diverse and thus the data statistics are more
complex. As a result, it can be seen from Tables 3.2 and 3.4 that the CNN architecture for
the CIFAR-10 dataset is more complicated and the achievable classiﬁcation error rates are
higher than those of the CNNs for the MNIST dataset.
Figure 3.4 compares the performance of FP-CNN (FP-CONV and FP-ZS), PredictiveNet,
and MSB-CNN in terms of classiﬁcation error rates, computational and representational
costs normalized over that of FP-CONV. It can be seen from Fig. 3.4 that PredictiveNet is
able to maintain a classiﬁcation error rate that is only 1.7×10−2 larger than that of FP-CNN
while achieving a 2.3× reduction in the computational cost over the state-of-the-art FP-ZS.
Furthermore, PredictiveNet reduces the representational cost by 1.4× compared to FP-ZS.
On the other hand, when compared to MSB-CNN, PredictiveNet shrinks the classiﬁcation
error rates by 3.7× (0.527/0.144) at the cost of 1.5× greater computational cost.
53
PredictiveNet for CIFAR10
er = [0.127 0.144 0.527 0.15 0.733]
C
la
ss
if
ic
at
io
n
 E
rr
o
r 
R
at
e
N
o
rm
al
iz
e
d
 C
o
m
p
. C
o
st
0.527
 .   ×
  .   ×
0.127 0.144
  .   ×
N
o
rm
al
iz
e
d
 R
e
p
re
s.
 C
o
st
4.   ×
  .   ×   .  %
FP-CONV/FP-ZS PredictiveNet
FP-CONV PredictiveNetFP-ZS
MSB-CNN
MSB-CNN
FP-CONV PredictiveNetFP-ZS MSB-CNN
(a)
PredictiveNet for CIFAR10
er = [0.127 0.144 0.527 0.15 0.733]
C
la
ss
if
ic
at
io
n
 E
rr
o
r 
R
at
e
N
o
rm
al
iz
e
d
 C
o
m
p
. C
o
st
0.527
 .   ×
  .   ×
0.127 0.144
  .   ×
N
o
rm
al
iz
e
d
 R
e
p
re
s.
 C
o
st
4.   ×
  .   ×   .  %
FP-CONV/FP-ZS PredictiveNet
FP-CONV PredictiveNetFP-ZS
MSB-CNN
MSB-CNN
FP-CONV PredictiveNetFP-ZS MSB-CNN
(b)
PredictiveNet for CIFAR10
er = [0.127 0.144 0.527 0.15 0.733]
C
la
ss
if
ic
at
io
n
 E
rr
o
r 
R
at
e
N
o
rm
al
iz
e
d
 C
o
m
p
. C
o
st
0.527
 .   ×
  .   ×
0.127 0.144
  .   ×
N
o
rm
al
iz
e
d
 R
e
p
re
s.
 C
o
st
4.   ×
  .   ×   .  %
FP-CONV/FP-ZS PredictiveNet
FP-CONV PredictiveNetFP-ZS
MSB-CNN
MSB-CNN
FP-CONV PredictiveNetFP-ZS MSB-CNN
(c)
Figure 3.6: Simulation results for the CIFAR-10 dataset comparing FP-CNN (FP-CONV
and FP-ZS), PredictiveNet, and MSB-CNN in terms of: (a) classiﬁcation error rates, (b)
normalized computational cost (# of full adders (FAs)), and (c) normalized representa-
tional cost (# of bits), where Bx,msb = Bδ,msb = 6 bits and Bw,msb = 5 bits.
54
In
p
u
t 
Sp
ar
si
ty
Figure 3.7: Comparison on the C-layer input sparsity of FP-CONV/FP-ZS, PredictiveNet
and MSB-CNN for the CIFAR-10 dataset.
Table 3.5: Computational and Representational Cost Comparison among CNNs for the
CIFAR-10 Dataset
FP-CONV FP-ZS PredictiveNet MSB-CNN
Computational cost (billion) 17.60 5.52 2.45 1.69
Representational cost (million) 11.95 3.94 2.74 2.63
Similar to the case in CNNs for the MNIST dataset, it is observed in Fig. 3.6(c) that
PredictiveNet requires negligible (0.4%) higher representational cost than its predictor-only
counterpart, i.e., MSB-CNN. Again, this can be explained by the higher sparsity observed in
the PredictiveNet than the latter as shown in Fig. 3.7. Speciﬁcally, the input sparsity of the
C2-C9 layers in the PredictiveNet is higher than those of the MSB-CNN, and the represen-
tational cost corresponding to these layers accounts for > 90% of the total representational
cost, therefore justifying the lower representational cost of PredictiveNet over MSB-CNN.
Table 3.5 lists the number of FAs and bits associated with the four CNNs for the CIFAR-
10 dataset. Similarly, it can be observed from Fig. 3.6 and Table 3.5 that PredictiveNet is
able to signiﬁcantly reduce the complexity while maintaining classiﬁcation accuracy close to
that of the FP-CNN (FP-CONV or FP-ZS).
55
3.4 Summary
In this chapter, we propose a new technique, PredictiveNet, which predicts sparse nonlinear
outputs and skips corresponding convolution operations for reduced complexity CNN design.
Analysis is performed to justify the eﬀectiveness of PredictiveNet and predict the behavior
of CNNs with respect to precision of its predictors. PredictiveNet takes advantage of the
fact that the weighting factors in ﬁxed-point representation decrease exponentially and high
sparsity is commonly observed in well trained CNNs. This work opens up a new research
dimension to greatly reduce CNNs' implementation cost without degrading their detection
accuracy. Future work includes the application of PredictiveNet to other ML algorithms such
as multilayer perceptron and spiking neural networks, where high sparsity is also commonly
observed. Imposing additional constraints that favor the reduction of prediction errors in
PredictiveNet into the training algorithms is also an interesting research topic.
3.5 Derivation of (3.10) and (3.11)
We provide a detailed derivation of (3.10) and (3.11).
P (H1) = P (Ymsb ≤ 0, Y > 0) = P (Y > 0)P (Ymsb ≤ 0|Y > 0)
=
1
2
P (Y > 0)P (|qy| > Y |Y > 0)
=
1
2
P (Y > 0)
∫
fX|Y >0 (x)P
(|qy| > Y ∣∣Y > 0,X = x) dx
≤ P (Y > 0)
24
∫
fX|Y >0(x)
∑L
l=1
[
∆2w ‖xl‖2 + ∆2x ‖wl‖2
]
+ ∆2δ∣∣∣∑Ll=1wTl xl + δ∣∣∣2 dx
=
1
24
P (Y > 0)E

∑L
l=1
[
∆2w ‖Xl‖2 + ∆2x ‖wl‖2
]
+ ∆2δ∣∣∣∑Ll=1wTl Xl + δ∣∣∣2
∣∣∣∣∣∣∣Y > 0

=
1
24
E

∑L
l=1
[
∆2w ‖Xl‖2 + ∆2x ‖wl‖2
]
+ ∆2δ∣∣∣∑Ll=1wTl Xl + δ∣∣∣2 · 1Y >0

=
1
24
∆2w
L∑
l=1
E
 ‖Xl‖2 · 1Y >0∣∣∣∑Ll=1wTl Xl + δ∣∣∣2

56
+
1
24
(
∆2x
L∑
l=1
‖wl‖2 + ∆2δ
)
E
 1Y >0∣∣∣∑Ll=1wTl Xl + δ∣∣∣2

= ∆2wE1 + ∆
2
xE2 + ∆
2
δE3,
where fX|Y >0(x) is the conditional distribution of X given Y > 0 and qy =
∑L
l=1(qw
T
l xl +
qx
T
l wl) + qδ. Note that qwl,qxl, and qδ are the quantization noise terms of wl,msb, xl,msb,
and δmsb, respectively. 1A denotes the indicator function of the event A. The
1
2
in the second
step is due to the symmetric distribution of qy. The fourth step comes from Chebyshev's
inequality. Note that
E1 =
1
24
L∑
l=1
E
 ‖Xl‖2 · 1Y >0∣∣∣∑Ll=1wTl Xl + δ∣∣∣2

E2 =
1
24
L∑
l=1
‖wl‖2E
 1Y >0∣∣∣∑Ll=1wTl Xl + δ∣∣∣2

E3 =
1
24
E
 1Y >0∣∣∣∑Ll=1wTl Xl + δ∣∣∣2
 .
Furthermore, underH1 we have ymsb = y+qy ≤ 0 and y > 0 which means that 0 < y ≤ −qy
(i.e., |y|2 ≤ |qy|2). Hence,
E[Y 2|H1] ≤ E[q2y|H1]
= ∆2wE
[
L∑
l=1
‖Xl‖2
∣∣H1]+ ∆2x L∑
l=1
‖wl‖2 + ∆2δ .
57
Chapter 4
RANK DECOMPOSED STATISTICAL ERROR
COMPENSATION
In previous chapters, the energy eﬃciency of both information transfer and processing in
ML systems is improved by employing information-based system metrics rather than ﬁdelity
circuit metrics to aggressively reduce component power of information transfer or complexity
of ML algorithms for information processing. This new dimension of energy eﬃciency vs.
robustness trade-oﬀ is made possible by taking advantage of accuracy relaxation on circuit
level operations oﬀered by the probabilistic nature of information-based system metrics or
the inherent structure in ML algorithms.
Such energy eﬃciency vs. robustness trade-oﬀ can also be leveraged to address the robust-
ness challenge of implementing information processing subsystems on stochastic unreliable
fabrics such as NTV for more aggressively reduced computational cost. Aligning with this
thought, this chapter proposes a new SEC technique referred to as RD-SEC that enables ro-
bust CNNs operating in the NTV regime. The opportunity is that one commonly employed
operation in signal processing and ML applications is MVMs in which the same input vector
is projected to a set of weight vectors. Examples include CNNs [66], ﬁlter banks for fea-
ture extraction [80], principal component analysis (PCA) [81], and wavelet transforms [82].
RD-SEC exploits inherent structure within MVMs for low-cost error detection and compen-
sation.
The remainder of this chapter is organized as follows. Section 4.1 provides background
on low power design techniques, ANT and rank decomposition. Section 4.2 presents the
proposed MVM-based CNN architecture and RD-SEC technique to enhance robustness.
The error model generation and validation are presented in Section 4.3. Simulation results
are shown in Section 4.4. Finally, conclusions and future work are presented in Section 4.5.
58
4.1 Background
4.1.1 Low Power Design Techniques
Various low power techniques can be used to reduce the energy consumption of MVMs. At
the logic level, programmable CSE [42] is a low power technique, where common subexpres-
sions (CSs) in the coeﬃcients are ﬁrst computed using shift and add, and then summed up
to obtain the ﬁnal product. Programmability is enabled via a look-up table.
In order to further reduce energy, NTV was proposed to operate the devices at or near their
threshold voltage (Vth), and has shown an energy reduction on the order of 10× [11]. How-
ever, the energy eﬃciency of NTV comes at a cost of exponential increase in the normalized
delay variation, leading to an increased functional failure. Speciﬁcally, circuit simulations in a
commercial 45 nm CMOS show that the delay variation of an 8-bit ripple-carry adder (RCA)
increases by 8.5× at Vdd = 0.35V (NTV) compared with that at the nominal Vdd = 1.1V
due to process variations. To address the variation challenge, the traditional approach is to
add design margin, which substantially reduces the beneﬁts of NTV [11]. For example, it
is estimated that the employing of voltage margining to ensure error-free operation results
in 3.1× energy overhead for the 8-bit RCA operating at 0.35V. Techniques such as body
biasing [83] or variable pipeline stage latency [84] have been proposed. Although these tech-
niques demonstrated some degree of eﬀectiveness, they can incur signiﬁcant overheads due
to the local nature of variations.
4.1.2 Algorithmic Noise-Tolerance (ANT)
ANT is an algorithmic technique that employs error statistics to perform error compensation,
and has been shown to be eﬀective for signal processing and ML kernels [49]. Speciﬁcally,
ANT incorporates a main block (M-block) and an estimator block (E-block) which is an
approximate version of the M-block (see Fig. 4.1(a)). The M-block is subject to large
magnitude errors η (e.g., timing errors which typically occur in the MSBs) while the E-
block is subject to small magnitude errors e (see Fig. 4.1(b), e.g., due to quantization noise
in the LSBs), i.e.:
59
y(a)
(b)
Figure 4.1: Algorithmic noise-tolerance (ANT): (a) architecture, and (b) the error statis-
tics in the M-block and E-block [50].
ya = yo + η (4.1)
ye = yo + e (4.2)
where yo, ya, and ye are the error-free, theM-block and E-block outputs, respectively. ANT
exploits the diﬀerence in the error statistics of η and e to detect and compensate for errors
and obtain the ﬁnal corrected output yˆ as follows:
60
yˆ =
 ya if |ya − ye| ≤ Thye otherwise , (4.3)
where Th is an application dependent threshold parameter chosen to maximize the perfor-
mance of ANT.
4.1.3 Rank Decomposition
Rank decomposition exists for every ﬁnite-dimensional matrix [85]. Assume A is an N ×M
matrix (N < M) whose rank is R, then R ≤ N and there exist R linearly independent rows
in A. A rank decomposition of A is a product A = BC, where B = [b1, . . . ,bR] is a N ×R
basis matrix, br (r = 1, . . . , R) is the r
th N × 1 basis vector, and C = [c1, . . . , cM ] is a
R×M coeﬃcient matrix. Every column vector of A is a linear combination of the columns
in matrix B. That is, the jth column aj in the matrix A = [a1, . . . , aM ] can be expressed as
aj = Bcj = c1jb1 + · · ·+ cRjbR with cj = [c1j, . . . , cRj]T.
4.2 The Proposed RD-SEC Technique
This section describes the proposed error compensation technique RD-SEC to enable robust
CNN design in the NTV regime. First, we reformulate the C-layer computation in terms of
the MVM.
4.2.1 MVM-based CNNs
Equation (3.2) can be rewritten in a vector form as follows:
61

y1[j]
...
ym[j]
...
yM [j]

=

∑L
l=1w
T
1lxjl
...∑L
l=1w
T
mlxjl
...∑L
l=1w
T
Mlxjl

=
L∑
l=1
WTl xjl, (4.4)
where Wl = [w1l, . . . ,wMl]. It can be seen that (4.4) is the sum of L MVMs, where the
lth MVM is given by WTl xjl. A single stage of an MVM-based CNN in Fig. 4.2 consists
of input and weight buﬀers, an MVM-based C-layer, and an S-layer. Speciﬁcally, the input
vectors and weight matrices are streamed from the input and weight buﬀers, respectively.
The MVM-based C-layer accepts the input vectors and weight matrices, and obtains the M
outputs according to (3.1) and (3.2). In the S-layer, the spatial resolution of the C-layer
output FMs is reduced by either averaging or max pooling.
4.2.2 Rank Decomposed SEC (RD-SEC): Principle and Architecture
The formulation of an MVM-based CNN in Section 4.2.1 enables us to exploit redundancy
within an MVM for statistical error compensation. The proposed approach RD-SEC employs
low-cost estimators from a set of basis vectors in the N ×M weight matrix W (see (1.2)).
To do so, we make use of the rank decomposition of W [85]:
W = BC, (4.5)
where B = [b1, . . . ,bR] is an N × R basis matrix with R = rank(W) (assume M > N ,
then R ≤ N), br (r = 1, . . . , R) is the rth N × 1 basis vector, C = [c1, . . . , cM ] is an R×M
coeﬃcient matrix, and cm (m = 1, . . . ,M) is the m
th R × 1 coeﬃcient vector. We choose
bi = wi (i = 1, . . . , R) so that
W = BC = B[ IR Ce ], (4.6)
62
(a)
(b)
Figure 4.2: Architecture of: (a) a (N,M) dot product ensemble (MVM), where wml =
[w1ml, · · · , wNml] and Wl = [w1l, · · · ,wMl], and (b) one stage MVM-based CNN consisting
of a C-layer and an S-layer.
63
Figure 4.3: RD-SEC applied to an MVM.
64
where IR is an R×R identity matrix, and Ce = [ce,1, . . . , ce,M−R] is an R× (M −R) matrix.
Substituting (4.6) into (1.2), we have:
y = [ IR Ce ]
T(BTx)
= [ IR Ce ]
Tyo (4.7)
=
 yo
ya
 ,
where yo = B
Tx = [yo,1, . . . , yo,R]
T is the error-free R× 1 vector, and ya = CTe yo =[ya,1, . . . ,
ya,(M−R)]T is an (M −R)×1 output vector from theM-block subject to errors. In RD-SEC,
we derive a low-cost estimator of ya using the error-free output yo and a rounded coeﬃcient
matrix CˆTe , i.e.:
ye = Cˆ
T
e yo= [cˆe,1, . . . , cˆe,(M−R)]
Tyo , (4.8)
where ye =[ye,1, . . . , ye,(M−R)]T is an (M − R) × 1 estimation vector, CˆTe = round(CTe )
where the round(·) operator rounds an element to the nearest power of 2, and cˆe,m =
[cˆe,1m, . . . , cˆe,Rm]
T is the mth R × 1 coeﬃcient vector corresponding to ye,m. Equation (4.8)
indicates that ye,m can be implemented using only shifts and adds. Finally, the m
th error
compensated output yˆm is obtained as follows:
yˆm =

yo,m if m ≤ R
ya,(m−R) if m > R &
∣∣ya,(m−R) − ye,(m−R)∣∣ ≤ Th,
ye,(m−R) otherwise
(4.9)
where the threshold Th is an application dependent parameter chosen to maximize system
performance [49]. The RD-SEC architecture is shown in Fig. 4.3.
65
4.2.3 RD-SEC Overhead
The overhead of an RD-SEC-based CNN can be approximated relative to the M-block in
an MVM. The computational overhead γ of RD-SEC relative to the M-block is deﬁned as:
γ =
NP −Nconv
Nconv
=
(M −R)
M
α, (4.10)
where NP and Nconv denote the complexities of the RD-SEC-based MVM and the conven-
tional MVM in terms of the number of full adders (FAs), respectively, and α quantiﬁes the
ratio of the complexities of one E-block and M-block (only M −R out of M channels have
E-blocks (see Fig. 4.3)). The detailed expression for α is provided in Section 4.6.
The γ
C
and γ
F
in Fig. 4.4 correspond to the computational overhead of the C1/C2 layers
and the F1 layer of the CNN in [68], respectively. Figure 4.4 shows that γ
C
increases with
N for N ≤ 5, and then decreases with N . This is because α increases with N due to the
increased number of adders in (4.8), while at the same time, the number of E-blocks M −R
reduces since R = N . Similar results were obtained for γ
F
. This indicates that RD-SEC
overhead reduces with N for large vector length (i.e., N ≥ 10). Speciﬁcally, γ
C
≈ 5% and
γ
F
≈ 15% when RD-SEC is applied to the CNN in [68].
1
𝛾𝐶
𝛾𝐹
𝛾𝐹
𝛾𝐹
𝛾
𝜸
𝑵
𝛾c
𝛾F
Figure 4.4: Overhead of the RD-SEC-based MVM: computational overhead γ vs. N ,
where the corresponding parameters are summarized in Table 4.1.
66
Table 4.1: Parameters for the γ
C
and γ
F
in Fig. 4.4
Parameters
γ
C
Bin = 7, Bw = 8, R = N , M = 32
γ
F
Bin = 7, Bw = 8, R = N , M = 100
4.3 Error Model Generation and Validation
This section presents the timing error model generation methodology [86] and the validation
of this timing error model in a commercial 45 nm CMOS.
4.3.1 Error Model Generation
The error model generation methodology is shown in Fig. 4.5, and described below:
1) Characterize the gate delay distribution vs. operating voltage Vdd of basic gates such
as AND and XOR using HSPICE. Speciﬁcally, the gate delay d is modeled as a Gaussian
random variable, i.e., d ∼ N (µˆd, σˆd), where the mean µˆd and standard variation σˆd are
estimated from HSPICE Monte Carlo (MC) simulations with 1000 iterations.
2) Implement the MVM architecture shown in Fig. 4.2(a) using structural Verilog HDL
and the basic gates characterized in Step 1.
3) Emulate process variations at NTV by generating multiple (30) architectural instances
and assigning random gate delays obtained via sampling the gate delay distributions obtained
in Step 1.
4) Generate the error PMF P (η) employing the procedure in [86].
Speciﬁcally, Steps 3-4 are performed according to Algorithm 2. During system level simu-
lations, the system performance (i.e. the probability of detection) is evaluated by performing
error injection using the HDL error PMF P (η).
Figure 4.6 shows that the extent of within-the-die (WID) delay variation (σˆ/µˆ)d of an
AND gate increases from 0.03 at the supply voltage Vdd = 1.2V (see Fig. 4.6(a)) to 0.24 at
Vdd = 0.4V (see Fig. 4.6(b)), indicating an 8× increase in the WID delay variation at NTV
compared with that of the super-threshold regime. The worst-case normalized conﬁdence
intervals (with a 99% conﬁdence level) for µˆd and σˆd of the AND gate delays are 1% and
67
(a)
27
Proposed Error Modeling Framework
• Errors modeled as random variables (RVs)
• Additive error model
• Goal: model the error probability mass function (PMF)   ( )
(b)
Figure 4.5: Bl ck diagram of: (a) model generation methodology, and (b) error modeling
framework.
68
1AND gate delay (ps)
# 
o
f 
o
cc
u
rr
e
n
ce
delay histogram
model
ෝ𝝁𝐴𝑁𝐷 = 𝟐𝟐 𝐩𝐬
ෝ𝝈𝐴𝑁𝐷 = 𝟎. 𝟔𝟓 𝐩𝐬
(a)
1
delay histogram
model
AND gate delay (ps)
ෝ𝝁𝐴𝑁𝐷 = 𝟏𝟔𝟑 𝐩𝐬
ෝ𝝈𝐴𝑁𝐷 = 𝟑𝟗 𝐩𝐬
# 
o
f 
o
cc
u
rr
e
n
ce
(b)
Figure 4.6: Illustration of the AND gate within-the-die (WID) delay histograms from
HSPICE Monte Carlo (MC) simulations and the AND gate delay model at: (a) Vdd =
1.2V, (b) Vdd = 0.4V, with 1000 MC iterations.
69
Algorithm 2 Algorithm to obtain the kernel error PMF P (η) under each operating voltage
Vdd.
1: Initialize the frequency to be the maximum error free frequency with delay of basic
gates set to their estimated means, and obtain the error free output yo for N = 10
5
ramdon inputs
2: for each kernel instance i do
3: instantiate die-to-die (D2D) delay dD2D via sampling the D2D delay distribution
4: for each gate within the instance i do
5: instantiate WID gate delay dWID by sampling the WID delay distribution, and
set dg = dD2D + dWID
6: end for
7: for each of the n-th random input do
8: obtain kernel output yi(n) and error i(n) = yo(n)− yi(n)
9: end for
10: obtain the error PMF: P (η)i = hist(i)/N
11: end for
5%, respectively, where a conﬁdence interval with a p (0 ≤ p ≤ 1) conﬁdence level implies
that the probability of the (random) conﬁdence interval contains the true percentage is at
least p [87]. These results indicate that the 1000 MC iterations are suﬃcient to provide high
accuracy estimation for the gate level delay models.
4.3.2 Error Model Validation
This subsection validates the error model generation methodology in Section 4.3.1. A com-
plete HDL simulation for the entire CNN is infeasible due to the large amount of the MVMs;
thus, we validate the model for a single MVM employing the circuit-level SNR of the main
block (see Fig. 4.1 and (4.1)) as follows:
SNR = 10log10(
σ2yo
σ2η
), (4.11)
where σ2yo and σ
2
η are the variances of the error-free output yo and the timing error η,
respectively.
The validation procedure is as follows. First, HDL (bit and clock accurate) simulations of
70
1• 100 instances under each conditions
• 105 input samples for each run
• Random input and weight vector 
• N = 5, BX = 7, BW = 6
Median derivation: 6%
Max derivation:  7%
Min derivation: 6%
𝑺
𝑵
𝑹
 (
𝒅
𝑩
)
Τ𝝈  𝝁  𝒅
𝑆𝑁𝑅ℎ
𝑆𝑁𝑅𝑠
𝑆𝑁𝑅
Figure 4.7: Validating the error model generation methodology by comparing SNR from
HDL simulations and the NTV methodology based on 30 MVM instances with 105 ran-
dom input samples for each instance operating at gate level delay variation of 3%-39%.
each instance in Step 3 are run to obtain error samples and circuit-level SNR SNRh. Second,
ﬁxed-point MATLAB simulations using the PMF from Step 4 to inject errors for the MVMs
are run to obtain circuit-level SNR SNRs. Third, we compare SNRs with SNRh.
Figure 4.7 plots SNRh obtained via HDL simulations using the characterized gate delay
distributions and SNRs obtained via MATLAB simulations using error PMF as a function
of the gate level delay variation (σ/µ)d. It is found that the diﬀerence between the median
SNRh (SNRh) and SNRs (SNRs) is no more than 5% when (σ/µ)d increases from 3% to
39%. Figure 4.7 shows that the variation of SNR increases for 3% ≤ (σ/µ)d ≤ 34%, and
then decreases because all the instances are subject to large timing errors. Figure 4.7 further
shows that the maximum and minimum values of SNRh and SNRs diﬀer by no more than
6% and 4%, respectively. These results indicate that the timing error is well-modeled by its
PMF.
71
1𝑽𝒅𝒅(𝑽)
(𝝈
/𝝁
) 𝒅 𝟏𝟑 ×
(a)
Median of Pe
𝑴
𝒆
𝒅
𝒊𝒂
𝒏
𝒐
𝒇
𝒑
𝜼
𝟕𝟎 ×
(𝝈/𝝁 )𝒅
ഥ 𝒑
𝜼
(b)
Figure 4.8: Characterization of: (a) process variations in terms of (σ/µ)d vs. Vdd, and (b)
the impact of process variations on MVM error rate p¯η based on 30 MVM instances.
4.4 Simulation Results
In this section, we evaluate the performance of RD-SEC-based CNNs employing the error
PMFs from Section 4.3 for the MNIST [68] and CIFAR-10 datasets [79].
4.4.1 System Set-up
Similar to the case in PredictiveNet, the bias term δm in (3.1) and kernel wml in (3.2) of the
CNNs being studied are trained using the back-propagation algorithm [66]. The following
two architectures are considered: 1) a slow CNN architecture with RD-SEC applied to the
C-layers and F1 layer (denoted as RD-SEC CNN), where the multipliers and adders are
implemented using Baugh-Wooley (BW) multiplier and RCA, respectively; 2) an uncom-
pensated fast CNN architecture (denoted as Conv CNN), where the multipliers and adders
are implemented using the programmable CSE technique in [42] and Kogge-Stone adder,
respectively. The fast architecture is chosen for comparison because it will result in the
largest energy savings in the error-free case when voltage scaling is employed.
72
1𝑴
𝒆
𝒅
𝒊𝒂
𝒏
𝒐
𝒇
𝒑
𝒅
𝒆
𝒕
𝟏. 𝟒 ×
(𝝈/𝝁 )𝒅
ഥ 𝒑
𝒅
𝒆
𝒕
Conv CNN
RD-SEC CNN
𝟏𝟏 ×
𝜏𝐴𝑁𝐷
(a)
Sigma of Pdet
1
(𝝈/𝝁 )𝒅
𝝈
𝒑
𝒅
𝒆
𝒕
𝟏𝟏𝟑 ×
Conv CNN
RD-SEC CNN
(b)
Figure 4.9: Simulation results for the MNIST dataset: (a) p¯det vs. (σ/µ)d, and (b) σpdet vs.
(σ/µ)d, based on 30 CNN instances in the presence of process.
4.4.2 Characterization
First, the extent of process variation in NTV is characterized in terms of (σ/µ)d. Figure
4.8(a) shows that (σ/µ)d increases by 13× from 3% to 39% as the supply voltage Vdd decreases
from 0.7 V to 0.3 V. Note that process variation makes the detection accuracy pdet = P{Tˆ =
t} (Tˆ and t are the classiﬁer decision and the true label, respectively) a random variable,
which is denoted as Pdet. Figure 4.8(b) shows that the median error rate p¯η (where the error
rate is deﬁned as pη = P{η 6= 0}) increases by 70× from 1.4× 10−2 to 0.99 as Vdd decreases
from 0.7 V to 0.3 V. At a (σ/µ)d = 34%, the median error rate p¯η = 0.57.
Next, we employ the error PMFs obtained from Step 4 of the NTV error modeling method-
ology (see Section 4.3.1) to inject errors in ﬁxed-point MATLAB simulations of CNN archi-
tectures to evaluate their robustness to timing errors in NTV. We compare the two archi-
tectures in terms of median (p¯det) and standard deviation (σpdet) of the detection accuracy
Pdet. This is because pη and pdet are spatially distributed random variables in the presence
of process variations, where the path delay distribution and the timing violations (hence pη
and pdet) are diﬀerent for each MVM or CNN instance.
73
Table 4.2: Summary of CNN Parameters from [68]
Parameter Deﬁnition CNN Parameter Summary
Parameter Description Layer L M I1 × I2 K ×K
L/M # of input/output FMs C1 1 32 28× 28 5× 5
K ×K size of kernels C2 32 64 12× 12 5× 5
I1 × I2 size of input FMs F1 64 100 4× 4 2× 1
4.4.3 Comparison of p¯det and σpdet for CNNs using MNIST
The parameters of the CNNs for the MNIST dataset are summarized in Table 4.2 [68].
The precision Bin and Bw are set to 7 bits and 8 bits, respectively, ensuring the error-free
ﬁxed-point detection accuracy to be within 0.2% of the ﬂoating-point detection accuracy of
0.98.
Figure 4.9(a) shows that RD-SEC CNN is able to maintain p¯det ≥ 0.9 (the worst-case con-
ﬁdence interval with a 95% conﬁdence level is [0.92, 0.95]) for (σ/µ)d ≤ 34%, whereas Conv
CNN can only maintain the same performance for (σ/µ)d ≤ 3% (the worst-case conﬁdence
interval with a 95% conﬁdence level is [0.80, 0.90]). Thus, RD-SEC CNN is able to deliver
a high detection accuracy in the presence of high error rate of p¯η ≤ 0.57 (see Fig. 4.8(b)).
This indicates an 11× improvement compared with the Conv CNN. Figure 4.9(b) shows that
the RD-SEC CNN can achieve σpdet = 2.6 × 10−3 (with a 95% level conﬁdence interval of
[2.1 × 10−3, 3.6 × 10−3]), indicating an 113× reduction in σpdet as compared to that of the
Conv CNN σpdet = 0.3 (with a 95% level conﬁdence interval of [0.24, 0.4]) at (σ/µ)d = 11%.
Figure 4.9(b) also shows that σpdet of the RD-SEC CNN is no more than 4.8×10−2 , whereas
the maximum σpdet of the Conv CNN is 0.32 for 3% ≤ (σ/µ)d ≤ 39%.
Furthermore, Fig. 4.9(b) demonstrates that σpdet of the RD-SEC CNN increases from
1.8× 10−3 to 4.8× 10−2 when (σ/µ)d increases from 3% to 34%, and then decreases. When
(σ/µ)d > 34%, σpdet of the RD-SEC CNN is larger than that of the Conv CNN because
all the instances of the Conv CNN achieve a low Pdet ≈ 0.1, whereas some instances of the
RD-SEC CNN can still achieve a Pdet ≥ 0.9, leading to a larger σpdet .
To understand the robustness improvement achieved by RD-SEC, the input, C1 FMs (12
out of 32), the output vector and the ﬁnal decision Tˆ are analyzed (see Fig. 4.10). Note
74
1Input C1 output FMs Output Decision
(a)
1
Input C1 output FMs Output Decision
(b)
Figure 4.10: An example of the C1 FMs and the output vector from: (a) the Conv CNN,
and (b) the RD-SEC CNN, when the input digit is 5 and (σ/µ)d = 27%.
that Tˆ is chosen as the index of the maximum element in the output vector. Figure 4.10(a)
shows that the timing errors contaminate the extracted features in the Conv CNN, leading to
classiﬁcation failure. Speciﬁcally, the output vector has two peaks (at positions 3 and 5)
due to the contaminated features, resulting in a wrong decision 3 instead of the correct one
5. On the other hand, RD-SEC is able to compensate for timing errors, and thus enables
the RD-SEC CNN to extract correct features for correct classiﬁcation even in the presence
of a large number of timing errors.
75
Table 4.3: Summary of CNN Parameters for CIFAR-10 Dataset [79]
Parameter Deﬁnition CNN Parameter Summary
Parameter Description Layer L M I1 × I2 K ×K
L # of input FMs C1 3 3× 32 32× 32 5× 5
M # of output FMs C2 3× 32 3× 32 16× 16 5× 5
K ×K size of kernels C2 3× 32 3× 64 8× 8 5× 5
I1 × I2 size of input FMs F1 3× 64 3× 64 4× 4 2× 1
4.4.4 Comparison of p¯det and σpdet for CNNs using CIFAR-10
To demonstrate the generality of the proposed RD-SEC technique, RD-SEC based CNN is
also applied to the CIFAR-10 dataset [79], which contains three C-layers and S-layers and
one F-layer. The parameters of the CNNs for the CIFAR-10 dataset are summarized in Table
4.3, which is developed based on the LeNet-5 CNN in [66]. The precision Bin and Bw are
set to 8 bits and 7 bits, respectively, ensuring the error-free ﬁxed-point detection accuracy
to be within 0.3% of the ﬂoating-point detection accuracy of 0.8.
Figure 4.11(a) shows that RD-SEC CNN is able to maintain p¯det ≥ 0.8 (the worst-case con-
ﬁdence interval with a 95% conﬁdence level is [0.78, 0.80]) for (σ/µ)d ≤ 29%, whereas Conv
CNN can only maintain the same performance for (σ/µ)d ≤ 6% (the worst-case conﬁdence
interval with a 95% conﬁdence level is [0.63, 0.80]). This indicates a 5× improvement in error
rate tolerance compared with the Conv CNN. Figure 4.11(b) shows that the RD-SEC CNN
can achieve σpdet = 2.6×10−3 (with a 95% level conﬁdence interval of [2.1×10−3, 3.6×10−3]),
indicating an 85× reduction in σpdet as compared to that of the Conv CNN σpdet = 0.229
(with a 95% level conﬁdence interval of [0.18, 0.30]) at (σ/µ)d = 13%. Figure 4.11(b) also
shows that σpdet of the RD-SEC CNN is no more than 2.3 × 10−2, whereas the maximum
σpdet of the Conv CNN is 0.23 for 3% ≤ (σ/µ)d ≤ 39%.
Similar to the observation for CNNs using the MNIST dataset in Fig. 4.9(b), Fig. 4.11(b)
demonstrates that σpdet of the RD-SEC CNN increases from 1.6× 10−3 to 2.3× 10−2 when
(σ/µ)d increases from 3% to 34%, and then decreases. When (σ/µ)d > 34%, σpdet of the
RD-SEC CNN is larger than that of the Conv CNN because all the instances of the Conv
CNN achieve a low Pdet ≈ 0.1, whereas some instances of the RD-SEC CNN can still achieve
76
1𝟓 ×
(𝝈/𝝁 )𝒅
ഥ 𝒑
𝒅
𝒆
𝒕
(a)
1
(𝝈/𝝁 )𝒅
𝟖𝟓 ×
𝝈
𝒑
𝒅
𝒆
𝒕
(b)
Figure 4.11: Simulation results for the CIFAR-10 dataset: (a) p¯det vs. (σ/µ)d, and (b)
σpdet vs. (σ/µ)d, based on 30 CNN instances in the presence of process variations.
a Pdet ≥ 0.8, leading to a larger σpdet .
Comparing the simulation results for the MNIST (see Fig. 4.9) and CIFAR-10 (see Fig.
4.11) datasets, there are three observations. First, the Conv CNN for the CIFAR-10 dataset
can tolerate larger process variation in terms of (σ/µ)d (≤ 6%) than that of the MNIST
dataset (≤ 3%). This could result from its deeper structure and thus better inherent ro-
bustness. Second, the Conv CNN for the CIFAR-10 dataset fails more abruptly as (σ/µ)d
increases than the Conv CNN for the MNIST dataset. This is likely due to the fact that
data statistics in the CIFAR-10 dataset are more diverse and thus more sensitive to com-
putation errors. Third, RD-SEC can eﬀectively enhance robustness of the CNNs for both
the MNIST and CIFAR-10 datasets even in the presence of a large number of computation
errors, indicating its generality as a low power error resiliency algorithmic technique.
4.5 Summary
In this chapter, a new SEC technique named RD-SEC is proposed for MVMs, which is a
power-hungry and commonly employed kernel in many signal processing and ML algorithms.
RD-SEC is able to signiﬁcantly enhance the robustness of information processing when op-
77
erating in NTV for energy eﬃciency. Therefore, RD-SEC has the potential to enable the
deployment of powerful but power-hungry ML algorithms on power-constrained platforms.
This work opens the possibility to exploit inherent redundancy or structure within signal
processing and ML algorithms to develop low-cost SEC techniques that enable robust com-
puting on unreliable stochastic fabrics for signiﬁcant improvement in energy eﬃciency.
4.6 Derivation of α in (4.10)
In this section, we provide a detailed expression for α in (4.10). The complexity is calculated
in terms of the number of FAs. From (4.10), α is given by:
α =
NE
NM
=
Nadd−R +NMUX
NDP
, (4.12)
where NE and NM denote the complexities of one E-block andM-block, respectively, Nadd−R
denotes the complexity of the summer in (4.8), NMUX denotes the complexity of MUX-based
shifter in (4.8), NDP denotes the complexity of one DP implemented using a Baugh-Wooley
(BW) multiplier and ripple-carry adder (RCA). Speciﬁcally,
Nadd−R = (R− 1)(Bout + dlog2(R)e − 1) (4.13)
NMUX = Bout(dlog2(Bout + 1)e rM2F )R (4.14)
NDP = NBwBin + (N − 1)(Bin +Bw + dlog2(N)e − 1), (4.15)
where r
M2F
denotes the normalized complexity of a 2 : 1 MUX over a FA and we use
r
M2F
= 3.5/9 [88], the dae is the ceiling operation, and Bin, Bout and Bw denote the precision
for the input/output and weights, respectively.
78
Chapter 5
CONCLUSIONS AND FUTURE WORK
With ML systems increasingly becoming woven into our daily lives, energy eﬃciency will be
the key enabler for their pervasive applications. The design for energy-eﬃcient ML systems is
made challenging by the need for intensive computation and massive data movement. This
dissertation explores techniques to address this challenge for energy-eﬃcient information
transfer and processing in ML systems.
5.1 Dissertation Contributions
Current ML systems adopt either centralized cloud computing or distributed edge computing.
In both cases, there is an imperative challenge of energy eﬃciency which will only be made
worse with the growing demand for increased I/O bandwidth of high-performance computing
in data centers as well as the increasing need to embed complicated ML algorithms to local
devices.
To address the energy eﬃciency challenge in data centers, this dissertation has presented
our study on the use of link BER for designing a BOA-based serial link. First, we study,
through analysis and simulations, the beneﬁts of the BOA over the CUA in a serial link
receiver. In particular, we propose two channel-dependent parameters to quantify these
beneﬁts: 1) m-clustering value, and 2) the threshold non-uniformity metric ht. Further-
more, we show that the BER improvement is greater than 106 when m ≥ 5 and ht ≤ 0.8
for a family of channels. Second, we present the design of a 4 GS/s, 4-bit BOA IC in a
90 nm CMOS process that includes a single-core, multiple-output passive DAC to enable a
variable-threshold and variable-resolution ADC conﬁguration and verify the aforementioned
analysis. Measured results demonstrate that a 3-bit BOA has lower SNR requirement than
79
a 4-bit CUA, thereby supporting the BOA idea in the presence of non-idealities such as
ﬁnite sampling bandwidth and metastability. In the process, we demonstrate conclusively
that ENOB is not the best metric when designing ADCs for serial links. Third, we propose
architectures to implement the gradient descent algorithm to compute the representation
levels of BOA iteratively. In particular, the architectures for the QL-UD and the individual
RL-UD block are proposed.
For the problem of resource-constrained computing at the edge, this dissertation focuses on
energy-eﬃcient implementation of ML algorithms, particularly CNNs, for their application
on power-constrained embedded platforms. This dissertation develops two techniques for
energy-eﬃcient CNN design:
First, this dissertation proposes PredictiveNet which predicts the zero activations (zero
prediction) and thereby avoids computing those. In this way, a signiﬁcant reduction in the
number of convolutional operations is achieved without altering the structure or introducing
additional side networks. Thus, PredictiveNet has negligible overhead and can easily be
applied on top of existing techniques to obtain an even greater reduction in implementation
complexity. When applied to CNNs for the MNIST and CIFAR-10 datasets, simulation
results show that PredictiveNet can achieve up to 7.2× and 4.4× reduction in the compu-
tational and representational costs, respectively, compared to a conventional CNN, and up
to 2.5× and 1.7× reduction in the computational and representational costs, respectively,
compared to a zero-skipping CNN, while incurring only 0.02 degradation in classiﬁcation
accuracy.
Second, this dissertation proposes a new SEC technique referred to as RD-SEC that is
particularly well-suited for MVMs, which is a commonly used signal processing and ML
kernel. RD-SEC makes use of the fact that a large fraction of computation inside a MVM
can be derived from a small subset, and employs these for low-cost error detection and cor-
rection. Simulation results in 45 nm CMOS for an RD-SEC-based CNN architecture show
that RD-SEC enables robust CNNs operating in the NTV regime for aggressive energy sav-
ings. Speciﬁcally, when applied to CNNs for the MNIST dataset, the proposed architecture
can achieve a median classiﬁcation accuracy Pdet ≥ 0.9 in the presence of gate level delay
variation of up to 34%. This represents an improvement in variation tolerance of 11× as
80
compared to a conventional CNN. We further show that RD-SEC-based CNN enables up
to 113× reduction in the standard deviation of Pdet compared to the conventional CNN.
When applied to CNNs for the CIFAR-10 dataset, the proposed architecture improves vari-
ation tolerance by 5× and reduces variation in CNN classiﬁcation accuracy (Pdet) by 85×
compared with a conventional CNN.
5.2 Future Work
Since ML systems have their unique properties, it is important to consider non-traditional
approaches and explore innovations at various levels of design abstraction to address the
challenge of designing energy-eﬃcient ML systems. This is because the design space of
ML systems is complex due to their interlinked challenges and opportunities at system,
architecture, circuit and device levels.
5.2.1 System Level
Designing ML systems in many emerging applications leads to new problems compared to
those in the mature areas of signal processing and communication systems. The current
practice of ML systems design is being conducted in an ad-hoc manner. Therefore, sys-
tematic design methodology and system innovations are critical for realizing ML systems
with optimal energy eﬃciency. ML algorithms are essentially optimization problems and try
to minimize certain loss functions. Furthermore, many ML algorithms such as CNNs have
been shown to achieve satisfactory performance even when the training reaches only a local
minimum rather than the global minimum in the search space. In addition, there are usually
more than one local minimum that would lead to the speciﬁed system performance. This
provides a system-level opportunity to improve energy eﬃciency: the original optimization
problem can be reformulated to include additional constraints on architecture, circuit, and
data movement for energy eﬃciency. Such energy-constrained reformulation aims to achieve
a holistic optimal realization of the entire information gathering and processing stack in ML
systems. In line with this direction, the PredictiveNet and RD-SEC techniques proposed
81
in this dissertation oﬀer two future directions. First, one natural next step is to constrain
the ML training algorithms such that the zero predictors in PredictiveNet can be shared by
many kernels thereby further reducing both the computational and representational costs
signiﬁcantly. Second, imposing additional constraints that favor the reduction of estimation
errors in the RD-SEC technique could suppress the estimation errors. Possibly, the low-cost
estimators themselves can guarantee marginal system performance loss and eliminate the
need for power-hungry implementations.
5.2.2 Architecture Level
Conventional computing architectures separate sensing, computation and storage units. Such
architectures may not be energy-eﬃcient for ML systems due to the required large amounts
of data movement. New architectures should reduce the need for costly massive data move-
ment in the entire information gathering and processing chain of ML systems and make
use of the probabilistic nature of performance metric in ML algorithms for signiﬁcant en-
ergy reduction. The recently proposed in-memory computing architecture [38] is one such
example. The MVM-based architecture presented in this dissertation oﬀers another di-
rection to be extended. In the case when data movement is much more expensive than
computation, one energy-eﬃcient architecture taking advantage of conventional computing
architectures is to store only the basis weights and then derive computations associated with
the non-basis weights from those associated with the basis weights. The energy eﬃciency of
such architecture is achieved by trading the more costly data access with relatively low-cost
computation. Another possible direction is to develop energy-eﬃcient sensing, storage and
processing combo units, and distribute many of such in a systematic and energy-minimizing
manner.
5.2.3 Circuit and Device Level
Each decision of many ML algorithms involves hundreds of operations; therefore, the correct-
ness of ﬁnal decision may not require each operation to be always accurate. This inherent
82
robustness of ML algorithms can be leveraged to design circuits using non-traditional infor-
mation metrics such as classiﬁcation accuracy. The resultant new circuits would be more
energy-eﬃcient as they have more relaxed mismatch, precision or linearity requirements
than those designed using traditional ﬁdelity metrics such as SQNR. The BOA technique
presented in Chapter 2 is one such example. Essentially, new techniques should bridge the
probabilistic nature of ML systems and the statistical behavior of circuits in scaled CMOS
and emerging technologies for energy eﬃciency. On the device side, the performance ben-
eﬁts of CMOS scaling have become stagnant if adopting conventional designs. Although
emerging technologies such as spin [89] have been shown to have a potential to achieve large
energy savings, they have various robustness issues. A promising future direction is to in-
vestigate SEC techniques to address those robustness issues and thus enable energy-eﬃcient
and robust ML systems on scaled CMOS or emerging beyond-CMOS technologies. In fact,
the RD-SEC technique in Chapter 4 is one such example where SEC is shown to enable
robust ML systems designed in unreliable stochastic circuit/device fabrics for aggressive
improvement in energy eﬃciency.
83
REFERENCES
[1] The fourth industrial revolution, by Klaus Schwab. [Online]. Available:
https://www.weforum.org/pages/the-fourth-industrial-revolution-by-klaus-schwab/
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classiﬁcation with deep convo-
lutional neural networks, in Advances in Neural Information Processing Systems, 2012,
pp. 10971105.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., Mastering the game of
go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp. 484489,
2016.
[4] J. Baliga, R. W. Ayre, K. Hinton, and R. S. Tucker, Green cloud computing: Balancing
energy in processing, storage, and transport, Proceedings of the IEEE, vol. 99, no. 1,
pp. 149167, 2011.
[5] J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla,
M. Konow, M. Riepen, M. Gries et al., A 48-core IA-32 processor in 45 nm CMOS
using on-die message-passing and DVFS for performance and power scaling, IEEE
Journal of Solid-State Circuits, vol. 46, no. 1, pp. 173183, 2011.
[6] Google glass, Google Technology Company, 2013. [Online]. Available:
https://en.wikipedia.org/wiki/Google-Glass/
[7] XPS tower, Dell Computer Company, 2016. [Online]. Available:
http://www.dell.com/en-us/shop/productdetails/xps-8910-desktop/
[8] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu, Energy proportional
data center networks, in ACM SIGARCH Computer Architecture News, vol. 38, no. 3.
ACM, 2010, pp. 338347.
[9] M. Mansuri, J. E. Jaussi, J. T. Kennedy, T.-C. Hsueh, S. Shekhar, G. Balamurugan,
F. O'Mahony, C. Roberts, R. Mooney, and B. Casper, A scalable 0.128-1 Tb/s, 0.82.6
pJ/bit, 64-lane parallel I/O in 32-nm CMOS, IEEE Journal of Solid-State Circuits,
vol. 48, no. 12, pp. 32293242, 2013.
[10] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, Eyeriss: An energy-eﬃcient reconﬁg-
urable accelerator for deep convolutional neural networks, in 2016 IEEE International
Solid-State Circuits Conference (ISSCC). IEEE, 2016, pp. 262263.
84
[11] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, Near-threshold
computing: reclaiming Moore's law through energy eﬃcient integrated circuits, Pro-
ceedings of the IEEE, vol. 98, no. 2, 2010.
[12] R. Plassche, Integrated Analog-to-Digital and Digital-to-Analog Converters. Kluwer
Academic Publishers Dordrecht, The Netherlands, 1994.
[13] S. Park, Y. Palaskas, A. Ravi, R. Bishop, and M. Flynn, A 3.5 GS/s 5-b ﬂash ADC in
90nm CMOS, in 2006. CICC IEEE Custom Integrated Circuits Conference, Sept 2006,
pp. 489492.
[14] S. Sheikhaei, S. Mirabbasi, and A. Ivanov, A 43 mW single-channel 4GS/s 4-bit ﬂash
ADC in 0.18µm CMOS, in 2007. CICC IEEE Custom Integrated Circuits Conference,
Sept 2007, pp. 333336.
[15] C.-C. Yeh and J. Barry, Adaptive minimum bit-error rate equalization for binary sig-
naling, IEEE Transactions on Communications, vol. 48, no. 7, pp. 12261235, Jul
2000.
[16] E.-H. Chen, W. Leven, N. Warke, A. Joy, S. Hubbins, A. Amerasekera, and C.-K.
Yang, Adaptation of CDR and full scale range of ADC-based SerDes receiver, in 2009
Symposium on VLSI Circuits, June 2009, pp. 1213.
[17] M. Lu, N. Shanbhag, and A. Singer, BER-optimal analog-to-digital converters for com-
munication links, in Proceedings of 2010 IEEE International Symposium on Circuits
and Systems (ISCAS), May 2010, pp. 10291032.
[18] R. Narasimha, M. Lu, N. Shanbhag, and A. Singer, BER-optimal analog-to-digital
converters for communication links, Signal Processing, IEEE Transactions on, vol. 60,
no. 7, pp. 36833691, 2012.
[19] S. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information
Theory, vol. 28, no. 2, pp. 129137, Mar 1982.
[20] J. Max, Quantizing for minimum distortion, IRE Transactions on Information Theory,
vol. 6, no. 1, pp. 712, March 1960.
[21] Y. Lin, A. Xu, N. Shanbhag, and A. Singer, Energy-eﬃcient high-speed links using ber-
optimal ADCs, in Electrical Design of Advanced Packaging and Systems Symposium
(EDAPS), 2011 IEEE, 2011, pp. 14.
[22] E.-H. Chen, R. Yousry, and C.-K. Yang, Power optimized ADC-based serial link re-
ceiver, Solid-State Circuits, IEEE Journal of, vol. 47, no. 4, pp. 938951, April 2012.
[23] J. Kim, E.-H. Chen, J. Ren, B. Leibowitz, P. Satarzadeh, J. Zerbe, and C.-K. Yang,
Equalizer design and performance trade-oﬀs in ADC-based serial links, Circuits and
Systems I: Regular Papers, IEEE Transactions on, vol. 58, no. 9, pp. 20962107, 2011.
85
[24] S. Son, H.-S. Kim, M.-J. Park, K. Kim, E.-H. Chen, B. Leibowitz, and J. Kim, A 2.3-
mW, 5-Gb/s low-power decision-feedback equalizer receiver front-end and its two-step,
minimum bit-error-rate adaptation algorithm, IEEE Journal of Solid-State Circuits,
vol. 48, no. 11, pp. 26932704, Nov 2013.
[25] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini, A heterogeneous
multi-core system-on-chip for energy eﬃcient brain inspired vision, in Circuits and
Systems (ISCAS), 2016 IEEE International Symposium on, 2016, pp. 29102910.
[26] K. He et al., Deep residual learning for image recognition, CoRR, vol. abs/1512.03385,
2015.
[27] J. Cong and B. Xiao, Minimizing computation in convolutional neural networks, in
International Conference on Artiﬁcial Neural Networks. Springer, 2014, pp. 281290.
[28] Y. Wang, L. Xia, T. Tang, B. Li, S. Yao, M. Cheng, and H. Yang, Low power con-
volutional neural networks on a chip, in Circuits and Systems (ISCAS), 2016 IEEE
International Symposium on, 2016, pp. 129132.
[29] M. Courbariaux and Y. Bengio, BinaryNet: Training deep neural networks with weights
and activations constrained to + 1 or -1, arXiv preprint arXiv:1602.02830, 2016.
[30] S. Han, J. Pool, J. Tran, and W. Dally, Learning both weights and connections for
eﬃcient neural network, in Advances in Neural Information Processing Systems, 2015,
pp. 11351143.
[31] X. Zhang, J. Zou, K. He, and J. Sun, Accelerating very deep convolutional networks
for classiﬁcation and detection, arXiv preprint arXiv:1505.06798, 2015.
[32] P. Knag, C. Liu, and Z. Zhang, A 1.40mm2 141mW 898GOPS sparse neuromorphic
processor in 40nm CMOS, in 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits),
June 2016, pp. 12.
[33] S. Leroux, S. Bohez, C. De Booem, E. De Coninck, T. Verbelen, B. Vankeirsbilck,
P. Simoens, and B. Dhoedt, Lazy evaluation of convolutional ﬁlters, arXiv preprint
arXiv:1605.08543, 2016.
[34] P. Panda, A. Sengupta, and K. Roy, Conditional deep learning for energy-eﬃcient and
enhanced pattern recognition, in 2016 Design, Automation & Test in Europe Confer-
ence & Exhibition (DATE), 2016, pp. 475480.
[35] M. Horowitz, Computing's energy problem (and what we can do about it), in Solid-
State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE Interna-
tional. IEEE, 2014, pp. 1014.
[36] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, DianNao: A
small-footprint high-throughput accelerator for ubiquitous machine-learning, in ACM
Sigplan Notices, vol. 49, no. 4. ACM, 2014, pp. 269284.
86
[37] Y.-H. Chen, J. Emer, and V. Sze, Eyeriss: A spatial architecture for energy-eﬃcient
dataﬂow for convolutional neural networks, in Computer Architecture (ISCA), 2016
ACM/IEEE 43rd Annual International Symposium on. IEEE, 2016, pp. 367379.
[38] M. Kang, S. K. Gonugondla, M.-S. Keel, and N. R. Shanbhag, An energy-eﬃcient
memory-based high-throughput vlsi architecture for convolutional networks, in Acous-
tics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.
IEEE, 2015, pp. 10371041.
[39] J. Zhang, Z. Wang, and N. Verma, In-memory computation of a machine-learning
classiﬁer in a standard 6T SRAM array, IEEE Journal of Solid-State Circuits, vol. 52,
no. 4, pp. 915924, 2017.
[40] Z. Du et al., ShiDianNao: shifting vision processing closer to the sensor, in Computer
Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, 2015,
pp. 92104.
[41] J.-G. Chung and K. K. Parhi, Frequency spectrum based low-area low-power parallel
FIR ﬁlter design, EURASIP J. Appl. Signal Process., vol. 2002, pp. 944953, 2002.
[42] R. Mahesh and A. Vinod, New reconﬁgurable architectures for implementing FIR ﬁlters
with low complexity, Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, vol. 29, no. 2, 2010.
[43] X. Liu, J. Zhou, X. Liao, C. Wang, J. Luo, M. Madihian, and M. Je, Ultra-low-energy
near-threshold biomedical signal processor for versatile wireless health monitoring, in
2012 IEEE Asian Solid State Circuits Conference (A-SSCC), Nov 2012, pp. 381384.
[44] Y. Kim, I. Hong, and H. J. Yoo, A 0.5V 54 uW ultra-low-power recognition processor
with 93.5% accuracy geometric vocabulary tree and 47.5% database compression, in
2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical
Papers, Feb 2015, pp. 13.
[45] S. Das, D. Blaauw, D. Bull, K. Flautner, and R. Aitken, Addressing design mar-
gins through error-tolerant circuits, in Design Automation Conference (DAC), 46th
ACM/IEEE, 2009, pp. 1112.
[46] J. Tschanz, K. Bowman, C. Wilkerson, S.-L. Lu, and T. Karnik, Resilient circuits:
enabling energy-eﬃcient performance and reliability, in Computer-Aided Design (IC-
CAD), IEEE/ACM International Conference on, 2009.
[47] R. Bahar, J. Mundy, and J. Chen, A probabilistic-based design methodology for
nanoscale computation, in Computer Aided Design (ICCAD), IEEE/ACM Interna-
tional Conference on, 2003, pp. 480486.
[48] N. Vaidya and D. Pradhan, Fault-tolerant design strategies for high reliability and
safety, Computers, IEEE Transactions on, vol. 42, no. 10, pp. 11951206, 1993.
87
[49] B. Shim, S. Sridhara, and N. Shanbhag, Reliable low-power digital signal processing
via reduced precision redundancy, Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, vol. 12, no. 5, pp. 497510, 2004.
[50] J. Choi, E. P. Kim, R. A. Rutenbar, and N. R. Shanbhag, Error resilient MRF message
passing architecture for stereo matching, in Signal Processing Systems (SiPS), IEEE
Workshop on, 2013, pp. 348353.
[51] R. A. Abdallah and N. R. Shanbhag, Error-resilient systems via statistical signal pro-
cessing, in Signal Processing Systems (SiPS), IEEE Workshop on, 2013.
[52] H.-M. Bae, J. Ashbrook, J. Park, N. Shanbhag, A. Singer, and S. Chopra, An MLSE
receiver for electronic dispersion compensation of OC-192 ﬁber links, IEEE Journal of
Solid-State Circuits, vol. 41, no. 11, pp. 25412554, Nov 2006.
[53] J. Cao, B. Zhang, U. Singh, D. Cui, A. Vasani, A. Garg, W. Zhang, N. Kocaman, D. Pi,
B. Raghavan, H. Pan, I. Fujimori, and A. Momtaz, A 500mW digitally calibrated
AFE in 65nm CMOS for 10Gb/s serial links over backplane and multimode ﬁber, in
2009 IEEE International Solid-State Circuits Conference Digest of Technical Papers
(ISSCC), Feb 2009, pp. 370371,371a.
[54] B. Abiri, A. Sheikholeslami, H. Tamura, and M. Kibune, A 5Gb/s adaptive DFE for
2x blind ADC-based CDR in 65nm CMOS, in 2011 IEEE International Solid-State
Circuits Conference Digest of Technical Papers (ISSCC), Feb 2011, pp. 436438.
[55] O. Agazzi, D. Crivelli, M. Hueda, H. Carrer, G. Luna, A. Nazemi, C. Grace, B. Kobeissy,
C. Abidin, M. Kazemi, M. Kargar, C. Marquez, S. Ramprasad, F. Bollo, V. Posse,
S. Wang, G. Asmanis, G. Eaton, N. Swenson, T. Lindsay, and P. Voois, A 90nm CMOS
DSP MLSD transceiver with integrated AFE for electronic dispersion compensation of
multi-mode optical ﬁbers at 10Gb/s, in 2008 IEEE International Solid-State Circuits
Conference Digest of Technical Papers (ISSCC), Feb 2008, pp. 232609.
[56] M. Harwood, N. Warke, R. Simpson, T. Leslie, A. Amerasekera, S. Batty, D. Col-
man, E. Carr, V. Gopinathan, S. Hubbins, P. Hunt, A. Joy, P. Khandelwal, B. Killips,
T. Krause, S. Lytollis, A. Pickering, M. Saxton, D. Sebastio, G. Swanson, A. Szczepanek,
T. Ward, J. Williams, R. Williams, and T. Willwerth, A 12.5Gb/s SerDes in 65nm
CMOS using a baud-rate ADC with digital receiver equalization and clock recovery, in
Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE
International, 2007, pp. 436591.
[57] [Online]. Available: http://www.ieee802.org/3/ap/public/channel_model/
index.html#Backplane%20Models
[58] C.-M. Hsu, M. Straayer, and M. Perrott, A low-noise wide-BW 3.6-GHz digital
fractional-N frequency synthesizer with a noise-shaping time-to-digital converter and
quantization noise cancellation, IEEE Journal of Solid-State Circuits, vol. 43, no. 12,
pp. 27762786, Dec 2008.
88
[59] Stratix II GX edition transceiver signal integrity develop-
ment FPGA board, Altera Corporation, 2003. [Online]. Available:
http://www.altera.com/products/devkits/altera/kit-signa_integrity_s2gx.html
[60] G. Al-Rawi, A new oﬀset measurement and cancellation technique for dynamic latches,
in Circuits and Systems, 2002. ISCAS 2002. IEEE International Symposium on, vol. 5,
2002, pp. V149V152 vol.5.
[61] J. He, S. Zhan, D. Chen, and R. Geiger, Analyses of static and dynamic random
oﬀset voltages in dynamic comparators, Circuits and Systems I: Regular Papers, IEEE
Transactions on, vol. 56, no. 5, pp. 911919, 2009.
[62] M. Ramezani, M. Abdalla, A. Shoval, M. van Ierssel, A. Rezayee, A. McLaren, C. Hold-
enried, J. Pham, E. So, D. Cassan, and S. Sadr, An 8.4mW/Gb/s 4-lane 48Gb/s
multi-standard-compliant transceiver in 40nm digital CMOS technology, in Solid-State
Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE International,
2011, pp. 352354.
[63] F. Zhong, S. Quan, W. Liu, P. Aziz, T. Jing, J. Dong, C. Desai, H. Gao, M. Gar-
cia, G. Hom, T. Huynh, H. Kimura, R. Kothari, L. Li, C. Liu, S. Lowrie, K. Ling,
A. Malipatil, R. Narayan, T. Prokop, C. Palusa, A. Rajashekara, A. Sinha, C. Zhong,
and E. Zhang, A 1.0625-14.025 Gb/s multi-media transceiver with full-rate source-
series-terminated transmit driver and ﬂoating-tap decision-feedback equalizer in 40 nm
CMOS, Solid-State Circuits, IEEE Journal of, vol. 46, no. 12, pp. 31263139, 2011.
[64] J. Poulton, R. Palmer, A. Fuller, T. Greer, J. Eyles, W. Dally, and M. Horowitz, A
14-mW 6.25-Gb/s Transceiver in 90-nm CMOS, Solid-State Circuits, IEEE Journal
of, vol. 42, no. 12, pp. 27452757, 2007.
[65] S. Saxena, R. Nandwana, and P. Hanumolu, A 5 Gb/s 3.2 mW/Gb/s 28 dB loss-
compensating pulse-width modulated voltage-mode transmitter, in Custom Integrated
Circuits Conference (CICC), 2013 IEEE, 2013, pp. 14.
[66] Y. Lecun et al., Gradient-based learning applied to document recognition, Proceedings
of the IEEE, vol. 86, no. 11, pp. 22782324, 1998.
[67] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectiﬁer neural networks, in Proceed-
ings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics,
2011, pp. 315323.
[68] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, What is the best multi-stage
architecture for object recognition? in Computer Vision, IEEE 12th International
Conference on, 2009, pp. 21462153.
[69] B. A. Olshausen and D. J. Field, Sparse coding with an overcomplete basis set: A
strategy employed by V1? Vision research, vol. 37, no. 23, pp. 33113325, 1997.
89
[70] C. Poultney, S. Chopra, Y. L. Cun et al., Eﬃcient learning of sparse representations
with an energy-based model, in Advances in neural information processing systems,
2007, pp. 11371144.
[71] H. Lee, C. Ekanadham, and A. Y. Ng, Sparse deep belief net model for visual area
V2, in Advances in neural information processing systems, 2008, pp. 873880.
[72] D. Arpit, Y. Zhou, H. Ngo, and V. Govindaraju, Why regularized auto-encoders learn
sparse representation? in International Conference on Machine Learning, 2016, pp.
136144.
[73] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectiﬁer neural networks. in Aistats,
vol. 15, no. 106, 2011, p. 275.
[74] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, Better mixing via deep representa-
tions, in Proceedings of the 30th International Conference on Machine Learning (ICML-
13), 2013, pp. 552560.
[75] C. Sakr, A. Patil, S. Zhang, Y. Kim, and N. Shanbhag, Understanding the energy and
precision requirements for online learning, arXiv preprint arXiv:1607.00669, 2016.
[76] R. B. Palm, Prediction as a candidate for learning deep hierarchical models of data,
M.S. thesis, 2012.
[77] Y. Lin, S. Zhang, and N. R. Shanbhag, Variation-tolerant architectures for convolu-
tional neural networks in the near threshold voltage regime, in 2016 IEEE International
Workshop on Signal Processing Systems (SiPS), Oct 2016, pp. 1722.
[78] M. Lin, Q. Chen, and S. Yan, Network In Network, ArXiv e-prints, Dec. 2013.
[79] A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images,
2009.
[80] J. Yoo, L. Yan, D. El-Damak, M. Altaf, A. Shoeb, and A. Chandrakasan, An 8-
Channel Scalable EEG Acquisition SoC With Patient-Speciﬁc Seizure Classiﬁcation
and Recording Processor, Solid-State Circuits, IEEE Journal of, vol. 48, no. 1, pp.
214228, Jan 2013.
[81] V. Perlibakas, Distance measures for PCA-based face recognition, Pattern Recogn.
Lett., vol. 25, no. 6, pp. 711724, Apr. 2004.
[82] H. He and J. Starzyk, A self-organizing learning array system for power quality classi-
ﬁcation based on wavelet transform, Power Delivery, IEEE Transactions on, vol. 21,
no. 1, pp. 286295, 2006.
[83] R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas, Mitigating parameter variation
with dynamic ﬁne-grain body biasing, in Microarchitecture (MICRO), 40th Annual
IEEE/ACM International Symposium on, 2007.
90
[84] X. Liang, G.-Y. Wei, and D. Brooks, Revival: a variation-tolerant architecture using
voltage interpolation and variable latency, Micro, IEEE, vol. 29, 2009.
[85] G. Strang, Introduction to Linear Algebra, 3rd ed. Wesley-Cambridge Press, 2003.
[86] S. Zhang and N. Shanbhag, Probabilistic error models for machine learning kernels
implemented on stochastic nanoscale fabrics, in Design, Automation Test in Europe
(DATE), 2016.
[87] D. P. Bertsekas and J. N. Tsitsiklis, Introduction to Probability. 2nd ed. Athena Sci-
entiﬁc, 2008.
[88] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: a Design
Perspective. Upper Saddle River (N.J.): Prentice-Hall, Inc., 2003.
[89] K. Roy, M. Sharad, D. Fan, and K. Yogendra, Beyond charge-based computation:
Boolean and non-boolean computing with spin torque devices, in Low Power Electron-
ics and Design (ISLPED), 2013 IEEE International Symposium on. IEEE, 2013, pp.
139142.
91
