Statistical error compensation for robust digital signal processing and machine learning by Kim, Eric
c© 2014 Eric Park Kim
STATISTICAL ERROR COMPENSATION FOR ROBUST DIGITAL SIGNAL PROCESSING
AND MACHINE LEARNING
BY
ERIC PARK KIM
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2014
Urbana, Illinois
Doctoral Committee:
Professor Naresh R. Shanbhag, Chair
Professor Rob A. Rutenbar
Professor Andrew C. Singer
Associate Professor Deming Chen
Abstract
Machine learning (ML) based inference has recently gained importance as a key kernel in
processing massive data in digital signal processing (DSP) systems. Due to the ever increas-
ing complexity of DSP systems, energy-efficient ML accelerators are critical. Traditionally,
energy efficiency was obtained through technology scaling. However, modern nanoscale
complementary metal–oxide semiconductor (CMOS) process technologies suffer in reliability
caused by process, temperature, and voltage variations. As ML applications are inherently
probabilistic and robust to errors, statistical error compensation (SEC) techniques can play
a significant role in achieving robust and energy-efficient implementation of these important
kernels. SEC embraces the statistical nature of errors and utilizes statistical and probabilistic
techniques to build robust systems. Energy efficiency is obtained by trading off the enhanced
robustness with energy. This dissertation focuses on utilizing statistical approaches via SEC
in implementing energy-efficient digital signal processing (DSP) systems with an emphasis
on machine learning kernels.
We first demonstrate the potential of SEC techniques to a detection based application.
A 180 nm CMOS pseudonoise (PN) code acquisition integrated circuit (IC) has been imple-
mented and measured. Measurements show that while maintaining a detection probability
Pdet ≥ 90%, an error rate pη ≥ 85.83% with energy savings of 2.52× could be achieved.
SEC is then applied to a communication centric machine learning kernel, a low-density
parity check (LDPC) decoder. As iterative message-passing based architectures are inher-
ently robust to small-magnitude errors, the SEC based LDPC decoder shows significant
improvement in robustness and energy efficiency. Three different size LDPC codes, (50, 25),
(800, 400), and (1800, 900), were implemented with five iterations per block. Circuit simu-
lations in a commercial 45 nm process show that the SEC based LDPC decoder can operate
ii
at a supply voltage up to 38% less than the nominal voltage and tolerate up to 30× more
errors over an SNR range of 3 dB to 8 dB, while maintaining less than 3× degradation in bit
error rate (BER). This is equivalent to energy savings of 45.7% compared to conventional
LDPC decoders, and 33.2% compared to a sign bit protected LDPC decoder.
Motivated by the success of SEC based LDPC decoders, SEC has been applied to a
more complex message-passing application: Markov random field (MRF) based stereo image
matching. Analysis and simulations show that for a 20-bit architecture, small errors (η ≤
1024) are tolerable, while large errors (η ≥ 4096) degrade the performance significantly. By
applying algorithmic noise tolerance (ANT), experimental results show that the proposed
ANT based hardware can tolerate an error rate of 20%, with performance degradation of
only 3.5% at an overhead of 97.4%, compared to an error-free full precision hardware with
an energy savings of 39.6%. To reduce the compensation complexity, higher level error
compensation is explored as well.
Recent studies on approximate computing (AC) follow a principle similar to SEC, but with
one critical exception. AC based design still carries the requirement of creating a determin-
istic design, and thus the improvement in energy efficiency is marginal. We successfully
apply SEC to AC based designs and show that by embracing the statistical nature of the
underlying process, an additional 44.9% energy savings can be obtained.
Finally, SEC techniques are analyzed to provide insight into the trade-offs in the design
of SEC based systems. Algorithmic noise tolerance is analyzed under a unifying framework
based on detection and estimation theory. ANT is shown to approximate the Bayes optimal
detector and estimator.
iii
To my family, friends, and the curious minds
iv
Acknowledgments
My deepest thanks go to my wife, my children, whom I have not yet had the pleasure to
meet, my parents, and my adviser. My wife has always given me motivation and strength to
continue my pursuit of this doctorate degree. She herself has put up a goal far challenging
than a doctorate degree, and showed to me, by example, how to live everyday with meaning.
My parents have always put me and my sister before them, and cared for us with the
greatest love any parent can give. Their constant support and advice has inspired me and
helped me in numerous situations, and their attitude towards life has set an example that I
have always tried to follow. My adviser, Professor Naresh Shanbhag, has been the greatest
teacher and mentor one could hope for. He has given me great challenges and showed me
what it means to accomplish them. His patience during this journey along with his faith
in my abilities is the one reason this dissertation was able to be completed. I am greatly
indebted to him for all the advice, knowledge, responsibilities, and opportunities that he
gave me in school and life. His passion in life and work, his knowledge and experience,
along with his high standards will be a source of inspiration and an example to follow during
my entire career. I would also like to thank Professors Rob Rutenbar, Andrew Singer, and
Deming Chen for the many insightful discussions and input and for agreeing to be on my
committee. Their comments and suggestions have greatly helped improve this dissertation.
Additionally, I would like to thank Jungwook Choi for his help on implementing the TRW-
S HW architecture and running various simulations for the results in Chapter 4 and my
research group colleagues for their feedback and input during group meetings. I greatfully
acknowledge past and present support from the Gigascale Systems Research Center (GSRC),
one of six research centers funded under the Focus Center Research Program (FCRP), a
Semiconductor Research Corporation (SRC) entity; and Systems on Nanoscale Information
fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.
Finally, I would like to thank my friends at the Korean Tennis Club (KTC), especially Mr.
Chulkee Chang, for making me feel at home, and taking me in as a family member during
the long and lonely holidays. Because of the KTC, I was able to relieve my stress, and start
fresh on my research.
v
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Robust System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Statistical Error Compensation . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 2 The Design of an Error-Resilient PN Code Acquisition Prototype Chip . 26
2.1 PN Code Acquisition Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 SEC Based PN Code Acquisition Filter . . . . . . . . . . . . . . . . . . . . . 29
2.3 Chip Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 3 Statistical Error Compensation for Low-Density Parity Check Codes . . . 49
3.1 Low-Density Parity Check Codes . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 LDPC Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Energy, Delay, and Error Modeling . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 4 Statistical Error Compensation for Stereo Image Matching . . . . . . . . 64
4.1 TRW-S Message-Passing Based Stereo Matching . . . . . . . . . . . . . . . . 66
4.2 Error-Resilient TRW-S via ANT . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Error Compensation at Various Levels . . . . . . . . . . . . . . . . . . . . . 74
4.4 System Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
Chapter 5 Approximate Computing Based Statistical Error Compensation . . . . . 92
5.1 Approximate Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 TRW-S Accelerator Architecture Using AC and SEC . . . . . . . . . . . . . 96
5.3 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Chapter 6 Analysis of Statistical Error Compensation
Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.1 Analysis of Algorithmic Noise Tolerance . . . . . . . . . . . . . . . . . . . . 106
6.2 Comparison of Simulation and Analysis . . . . . . . . . . . . . . . . . . . . . 111
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Chapter 7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
vii
List of Tables
1.1 Various robust system design techniques applied at different levels. . . . . . . 10
1.2 Characteristic of robust system design techniques. . . . . . . . . . . . . . . . 12
2.1 Complexity of circuit blocks in SSNOC prototype IC. . . . . . . . . . . . . . 39
2.2 Comparison of SSNOC prototype IC with other work. . . . . . . . . . . . . . 45
3.1 HW error rate that can be tolerated by SEC based LDPC decoder at a
given BER threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Depth map and BPR comparison for error-free, conventional, and ANT at
various Vdd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Estimated compensation overhead and power consumption of arithmetic-
level ANT obtained via synthesis in a commercial 45 nm CMOS process. . . 86
4.3 Estimated compensation overhead and power consumption of iteration-
level ANT obtained via synthesis in a commercial 45 nm CMOS process. . . 90
5.1 Truth table for the mirror based approximate adder 1 through 4 [21]. . . . . 96
5.2 Estimated cell count and power consumption obtained via synthesis in a
45 nm CMOS process with AC using approximate adder 1 [21]. . . . . . . . . 103
5.3 Estimated cell count and power consumption obtained via synthesis in a
45 nm CMOS process with AC using approximate adder 2 [21]. . . . . . . . . 103
5.4 Estimated cell count and power consumption obtained via synthesis in a
45 nm CMOS process with AC using approximate adder 3 [21]. . . . . . . . . 104
5.5 Estimated cell count and power consumption obtained via synthesis in a
45 nm CMOS process with AC using approximate adder 4 [21]. . . . . . . . . 104
viii
List of Figures
1.1 Machine learning applications: (a) computer vision (stereo image matching
and background segmentation), and (b) machine listening (sound source
separation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ever increasing VLSI power [3]: (a) increase in leakage power due to
technology scaling, and (b) expected power per chip. The increase in
leakage power may create a large discrepancy from the ITRS requirements. . 3
1.3 Energy challenge due to thermal limits [5]: (a) leakage power of nominal
dice over temperature, and (b) silicon junction temperature of different
representative cellular baseband processors under heaviest usage scenario. . 4
1.4 Variation in deep scaled CMOS devices [6]: (a) threshold voltage (Vth) at
different process technology, and (b) delay of memory cells, represented by
color, of a 32 nm 64 Kbit SRAM cell obtained via simulations. . . . . . . . . 5
1.5 IDS vs. VDS curve for 50 different CNFETs [7]. . . . . . . . . . . . . . . . . . 5
1.6 Simulation results of voltage overscaling (VOS) for a 4-tap correlation filter
(a sensor) at 50 MHz in a 180 nm CMOS process. . . . . . . . . . . . . . . . 7
1.7 Error model used within this dissertation. The actual output yi consists
of the ideal output yo, hardware errors η, and estimation errors e. . . . . . . 12
1.8 An NMR system. The processing element (PE) is replicated N times, and
a majority voter is used to combine the outputs. . . . . . . . . . . . . . . . . 13
1.9 An NMR system with a triplicated voter. . . . . . . . . . . . . . . . . . . . . 14
1.10 Statistical error compensation: (a) general form, (b) algorithmic noise
tolerance, and (c) error distributions. . . . . . . . . . . . . . . . . . . . . . . 18
1.11 Block diagram of a stochastic sensor network-on-a-chip (SSNOC). By de-
composing the main computation into smaller blocks or “sensors,” the
stochastic sensor network-on-chip creates an opportunity for efficient system-
level error-tolerance techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.12 Block diagram of (a) NMR, and (b) soft NMR. . . . . . . . . . . . . . . . . 22
2.1 Architecture of a linear feedback shift register (LFSR) used for PN code
generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 PN code acquisition filter: (a) conventional, and (b) SEC based. . . . . . . . 29
2.3 Measured sensor outputs over time. Outliers in the sensor output are due
to hardware errors ηi and can be seen to shift the mean significantly. . . . . 31
2.4 Prototype IC architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
2.5 Sensor architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Distribution of outputs selected by hierarchical median algorithms, select-
ing the lower of the two central values at each stage. . . . . . . . . . . . . . 35
2.7 Simulated ROCs for hierarchical median: (a) simple two-level, (b) simple
three-level, (c) overlapping two-level and (d) overlapping three-level. . . . . . 36
2.8 Bit slice of the median block. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9 Fusion block architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10 Thresholding block architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 The 256-tap PN code acquisition filter chip in a 180 nm CMOS process:
(a) layout, and (b) microphotograph. . . . . . . . . . . . . . . . . . . . . . . 39
2.12 Error PMF of region R1: (a) Vdd = 0.88 V, and (b) Vdd = 0.85 V. All
errors are one bit errors with powers-of-two magnitude. . . . . . . . . . . . . 41
2.13 Error PMF of region R2: (a) Vdd = 0.78 V, and (b) Vdd = 0.76 V. Multi-bit
errors are observed, but most errors are still small in magnitude. . . . . . . . 41
2.14 Error PMF of region R3: (a) Vdd = 0.68 V, and (b) Vdd = 0.66 V. A dense
Gaussian PMF and a sparse Gaussian PMF are mixed. . . . . . . . . . . . . 42
2.15 Error PMF of region R4: (a) Vdd = 0.63 V, and (b) Vdd = 0.60 V. . . . . . . . 43
2.16 Detection probability Pdet and sensor probability of error pη vs. supply
voltage Vdd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.17 Energy consumption and sensor probability of error pη vs. supply voltage
Vdd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.18 HSPICE simulation of a sensor to measure α0→1: (a) number of glitches
vs. Vdd, and (b) average α0→1 vs. Vdd. . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Supply voltage vs. energy and delay of an LDPC variable node (Fig.
3.4(a)) in a commercial 45 nm process. By voltage overscaling up to an
error rate of 70%, the same performance can be achieved with 70% less energy. 51
3.2 Example LDPC code: (a) parity check matrix, and (b) its bipartite graph. . 52
3.3 High-level block diagram of the LDPC decoder. . . . . . . . . . . . . . . . . 55
3.4 Architecture of nodes: (a) variable node, and (b) check node. . . . . . . . . . 56
3.5 Energy consumption and delay curves obtained through circuit simulation
of the variable and check node architectures in Fig. 3.4, synthesized in a
commercial 45 nm process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Interconnect energy for a 200µm wire: (a) distributed RC model, and (b)
average energy vs. supply voltage curve. . . . . . . . . . . . . . . . . . . . . 58
3.7 Methodology for simulating the LDPC decoder architecture under VOS.
Energy models can be obtained as well. . . . . . . . . . . . . . . . . . . . . . 59
3.8 BER vs. SNR plot of a (800, 400) LDPC code decoded with five iterations
at pη = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.9 BER vs. pη at SNR = 5 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 Energy vs. BER plot of a (800, 400) LDPC code decoded with five itera-
tions at SNR = 5 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Architecture of streaming TRW-S hardware (TRWS_HW): (a) block dia-
gram, (b) reparameterize unit, and (c) message update unit. . . . . . . . . . 68
x
4.2 Effect of AS error on updated message when (a) ∆H > 0, and (b) ∆H < 0. . 70
4.3 Effect of error after normalization: (a) large effective error when error
values differ significantly, and (b) small effective error when error values
are similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Verification of error analysis: (a) effect of error on message updated via
a box plot of ηAS vs. η˜AS, and (b) effect of error on energy minimization
performance of message passing via a plot of energy E vs. η˜. . . . . . . . . . 73
4.5 Performance of ANT with injection of uniform errors of different magni-
tude using 4-bit precision estimator. . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Block diagrams: (a) error compensation at different levels, and (b) de-
tailed comparison of arithmetic-level and iteration-level compensation in
the reparameterize unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Simulation methodology: (a) flow diagram, and (b) error statistics for AS
at Vdd = 0.75 V with pη = 0.21. . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8 Energy minimization performance against various precisions in computa-
tion of TRWS_HW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.9 Block diagram depicting the truncation and saturation for optimal fixed-
point implementation of REPARM and MSG_UPD. . . . . . . . . . . . . . 80
4.10 Architecture of streaming TRW-S stereo matching CPU+FPGA system. . . 81
4.11 Timing errors in FPGA: (a) block diagram for error verification and statis-
tics collection, (b) measured error statistics (20-bit) in the FPGA, and (c)
error statistics (8-bit) obtained via circuit simulations in a 45 nm CMOS
process [91]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.12 Simulation results of error injection rate vs. energy minimization for dif-
ferent ANT estimators with M[8, 8, 4]. . . . . . . . . . . . . . . . . . . . . . . 84
4.13 Correction performance at (a) system level with SAD, (b) system level
with HBP (S = 8), and (c) hybrid with HBP (S = 8) and HWRPR(3-bit). . 88
4.14 Bad-pixel ratio vs. Vdd. With no error compensation, TRW-S alone cannot
tolerate 10% voltage scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.15 Results for RPR-ANT applied at the arithmetic level: (a) FPGA emula-
tion, and (b) error injection based simulation. . . . . . . . . . . . . . . . . . 89
5.1 Approximate mirror adders [21]: (a) conventional mirror adder, (b) ap-
proximate mirror adder 1, (c) approximate mirror adder 2, (d) approxi-
mate mirror adder 3, and (e) approximate mirror adder 4. . . . . . . . . . . 97
5.2 Flow chart of the simulation methodology. . . . . . . . . . . . . . . . . . . . 100
5.3 Energy E vs. supply voltage Vdd for TRW-S implemented with AC using:
(a) approximate adder 1, (b) approximate adder 2, (c) approximate adder
3, and (d) approximate adder 4 [21]. It can be seen that AC can tolerate
at most 10% scaling in Vdd, whereas when combined with ANT, up to 34%
scaling in Vdd can be tolerated. . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1 The Bayesian framework for ANT. . . . . . . . . . . . . . . . . . . . . . . . 108
xi
6.2 Example error PMFs: (a) depiction of Pη and Pe that increase or decrease
in distance from the mean, and (b) voltage overscaling (VOS) induced
timing errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Simulation results for a simple example: (a) probability of correct detec-
tion and MSE vs. τ , (b) error PMF, and (c) τ ?s,p, and τ ?a,p for different
PMFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4 Block diagram of the simulation setup for the DCT application. . . . . . . . 114
6.5 DCT example: (a) error statistics of a voltage overscaled DCT block at
Vdd = 1.1V (pη = 0.0043), and Vdd = 1.0V (pη = 0.0374), and (b) the
resulting probability of correct detection and MSE vs. τ . . . . . . . . . . . . 115
xii
List of Abbreviations
AC approximate computing
ANT algorithmic noise tolerance
AS add substract
BER bit error rate
BP belief propagation
CDMA code division multiple access
CMOS complementary metal-oxide semiconductor
CNFET carbon nanotube field effect transistor
CS compare select
DCT discrete cosine transform
DMR double modular redundancy
DNA deoxyribonucleic acid
DSP digital signal processing
EDS error detection sequential
ERSA error-resilient system architecture
FF flip-flop
FIR finite impulse response
FOM figure of merit
FPGA field-programmable gate array
GMM Gaussian mixture model
xiii
HBP hierarchical belief propagation
HMM hidden Markove model
HW hardware
IC integrated circuit
ITRS International Technology Roadmap for Semiconductors
LAN local area network
LDPC low-density parity check
LFSR linear feedback shift register
LLR log likelihood ratio
LOA lower-part-OR adder
LSB least significant bit
MAE minimum absolute error
MAP maximum a posteriori
ML machine learning
MMAE minimum mean absolute error
MMSE minimum mean square error
MP message passing
MRF Markov random field
MSB most significant bit
MSE mean square error
NBTI negative bias temperature instability
NMR N -modular redundancy
PCA principal component analysis
PDF probability density function
PGM probabilistic graphical model
PMF probability mass function
PN pseudorandom noise
xiv
PoFF point of first failure
PSNR peak signal-to-noise ratio
PVT process, voltage, and temperature
RCA ripple carry adder
RMS recognition, mining, and synthesis
RPR reduced precision redundancy
RTL register-transfer level
SAD sum of absolute difference
SBP sign bit protection
SEC statistical error compensation
SNR signal-to-noise ratio
SOC system-on-chip
SRAM static random access memory
SSNOC stochastic sensor network-on-a-chip
SW software
TMR triple modular redundancy
TRC tunable replica circuit
TRW-S sequential tree-reweighted message passing
VLSI very large scale integration
VOS voltage overscaling
WLAN wireless LAN
xv
Chapter 1
Introduction
New emerging applications such as recognition, mining, and synthesis (RMS) [1] process
massive amounts of data in order to derive inferences and enable decision-making tasks.
Machine learning (ML), in a broad sense, is being successfully applied to these problems.
These applications have the characteristic where the system-level performance is determined
by a metric that is statistical in nature. Applications such as pseudo noise (PN) code acqui-
sition are one such class where the performance is measured via a statistical metric known as
probability of detection, Pdet. ML applications, which span broad areas including computer
vision, speech recognition, search engines, and deoxyribonucleic acid (DNA) sequencing, also
fall within this category. Figure 1.1 depicts examples of such applications that include stereo
image matching, background segmentation, and sound source separation, among others.
Machine learning has found its way into commercial products such as Apple’s Siri, Sam-
sung’s S Voice, and Google Now, which are intelligent personal assistants and knowledge
navigators that provide answers to queries, make recommendations, and processes many
other requests through a natural language user interface. In the future, such machine learn-
ing applications will be deeply embedded in our daily lives. It is predicted that the number of
embedded processors per person may exceed 1,000 by 2015 [2]. Massive amount of data will
be collected by these ubiquitous sensors surrounding our environment, and machine learning
will play a critical role in processing data collected by these sensors. Moreover, biomedi-
cal applications will be an important application, whereby one’s health can be monitored
24/7. To enable such rich computation in a small form factor, the availability of low-power
hardware (HW) machine learning kernel accelerators is critical.
Traditionally, energy efficiency has been achieved by scaling of feature sizes in CMOS
process technologies. As the load capacitance decreases along with the supply voltage, dy-
1
(a)
(b)
Figure 1.1: Machine learning applications: (a) computer vision (stereo image matching and
background segmentation), and (b) machine listening (sound source separation).
2
 Low Power Design Essentials ©2008 1.26  
 Subthreshold Leakage As an Extra Complication 
Year 
2002 ’04 ’06 ’08 ’10 ’12 ’14 ’16 
0 
0.2 
0.4 
0.6 
0.8 
1 
1.2 
Vo
lta
ge
 [V
] 
VTH 
VDD 
2002 ’04 ’06 ’08 ’10 ’12 ’14 ’16 
0 
1 
2 
Year 
PDYNAMIC 
PLEAK 
Po
w
er
 [µ
W
 / 
ga
te
] 
Subthreshold leak 
(Active leakage) 
[Ref: T. Sakurai, ISSCC’03] 
0 
1 
2 
Po
w
er
 [µ
W
 / 
ga
te
 @
 1
00
˚C
] 
PLEAK 
PDYNAMIC 
(a)
P
o
w
e
r 
p
e
r 
c
h
ip
 [
W
]
1980 1985 1990 1995 2000
0.01
0.1
1
10
100
1000
Year
MPU 
DSP Processors published in ISSCC
2005 2010 2015
ITRS 
requirement
10000
Dynamic
Leakage
1/100
© IEEE 2003
(b)
Figure 1.2: Ever increasing VLSI power [3]: (a) increase in leakage power due to
technology scaling, and (b) expected power per chip. The increase in leakage power may
create a large discrepancy from the ITRS requirements.
namic power has shown a steady decrease with technology scaling (Fig. 1.2(a)). However,
as increasingly complex and rich functionality are being integrated on-chip, the total power
requirements are increasing exponentially [3]. Furthermore, with deep scaling, the gap be-
tween the supply and threshold voltage is decreasing, resulting in higher leakage power,
as shown in Fig. 1.2(a). This increase in leakage power is expected to be much larger
than the dynamic power savings obtained through scaling, creating a large discrepancy from
the International Technology Road Map for Semiconductors (ITRS) [4] power target (Fig.
1.2(b)). This increase in power dissipation will inevitably lead to an increase in temperature.
As shown in Fig. 1.3(a), high temperatures also result in increased leakage and exacerbate
the energy challenge. The relatively constant thermal constraint across process technology
generations suggests that the thermal limit may become the key bottleneck instead of power
(Fig. 1.3(b)).
To make matters worse, deep scaling creates reliability problems that have been previously
ignored. This is due to the fact that with aggressive scaling, artifacts such as process, voltage,
and temperature (PVT) variations significantly affect the operation of modern nanoscale
complementary metal–oxide semiconductor (CMOS) devices. The extent of variation in
3
T
f
T
d
le
te
te
to
g
te
le
p
in
te
s
F
herefore, leaka
actor in the total
his is especially
epends exponen
akage power 
mperature, fro
mperatures, lea
tal power con
eometries, dyna
mperatures alt
akage power. 
ower-hungry c
tegrated in sm
mperatures are 
cenarios. 
igure 5: Leaka
65 nm) 
Figure 6: Leak
ge power is be
 power dissipati
 true at higher t
tially on tempe
of a nominal 
m 1x at 25°C
kage power co
sumption of a 
mic power con
hough to a m
Furthermore, d
omputation an
all form-factor
expected to be s
ge power of no
for different pr
age power of no
25°C) over 
coming a much
on of semicondu
emperatures sin
rature. Figure 
device in 28 
 to 30x at 1
uld account for
device. Also, a
sumption also in
uch lesser ext
ue to the incre
d communicat
 mobile devic
ignificantly hig
minal dice (nor
ocess technolog
 
minal dice (no
temperature 
 more domina
ctor devices. 
ce leakage pow
6 shows how th
nm scales wi
25°C. At high
 40%-50% of th
t smaller proce
creases at high
ent compared 
asingly rich an
ion functionali
es today, devi
h for heavy-usag
malized to 1 fo
y nodes 
rmalized to 1 a
nt 
er 
e 
th 
er 
e 
ss 
er 
to 
d 
ty 
ce 
e 
 
r 
 
t 
The con
very im
Howev
usually
tempera
device,
tempera
in wors
“power
power 
typicall
3.2 M
Consid
today 
Frequen
describ
lowest 
the CPU
more o
depend
optimiz
optimiz
device.
power, 
for proc
cores a
voltage
domina
reverse
3.3 C
Temper
aspects
estimat
manage
mobile 
design 
modelin
4. TH
4.1 D
Over th
mobile 
ubiquit
sophist
commu
process
graphic
amongs
generat
decreas
case ha
and mo
number
support
mobile 
how th
represe
increas
can res
From 
process
applica
sequence of th
portant facto
er, power mod
 do not factor in
ture is itself a
 there is a po
ture until the d
t-case scenarios
-temperature c
estimation. Fu
y used in mobil
otivationa
er, for example,
in multi-CPU 
cy Scaling (D
ed in Section 2
voltage and freq
 load requirem
f the CPU co
ing on the C
ed for dynam
ed for leakage 
 At low temper
DVFS would b
essing the sam
t a lower voltag
 and frequency.
nce of leakag
 and CPU Hotpl
hallenge 
ature must be 
 of power on m
ion, to simul
ment policies. 
chipsets, where
parameter int
g and managem
ERMAL M
escription
e last several y
technology, wi
ous today. Su
icated and 
nication techno
ors, advanced 
s cores, 5G W
t others [15]. N
ion, the powe
ing. Therefore, 
s been decreasin
re rich functio
 of gates), and
 such function
devices has be
e power dissipat
ntative Broadc
ed over the yea
pectively suppo
a 65 nm chip
or, to a 28 nm
tion processors
e above is that t
r to consider 
els, simulations
 temperature as
 function of t
sitive feedback 
evice reaches s
, there is therma
osimulation” m
rthermore, pow
e devices today 
l Example 
 two CPU powe
mobile device
VFS) [19] an
.2, in DVFS, t
uency setting w
ents. In contra
res are power
PU load requi
ic power, whil
power. Now, c
atures, due to 
e more power-e
e CPU load, it m
e and frequenc
 However, at hi
e power, the 
ug may be more
factored in as 
obile devices,
ation and to 
This is the stra
 temperature is 
o all aspects 
ent. 
ANAGEM
ears, there has b
th ultra-high-en
ch devices int
high-performa
logies, such as 
LTE modems, 
iFi, GPS, Blu
ow, for every s
r dissipation p
the power cons
g. However, du
nality (with a 
 increases in th
ality, the overa
en increasing r
ion (normalized
om mobile b
rs for the heavie
rt and without
 featuring a s
 chip featuring
, the potentia
emperature ha
in the context
 and estimates 
 a parameter. In
he power dissi
loop between 
teady-state tem
l runaway). Thi
ethodology fo
er manageme
are temperature
r management p
s: Dynamic V
d CPU Hotplu
he CPU is ope
hich is just eno
st, in CPU Hotp
ed down or p
rements. DVF
e CPU Hotplu
onsider a dual-
the dominance 
fficient than CP
ay be better to 
y than one core
gh temperatures
situation may 
 power-efficien
a critical param
 from power m
thermally aw
tegy adopted in
incorporated as 
of mobile dev
ENT 
een an explosiv
d mobile device
egrate a wide 
nce computa
multiple GHz+
advanced mult
etooth and NF
uccessive techn
er silicon gate
umption for th
e to the integrat
consequent incr
e operating fre
ll power dissip
apidly. Figure 
 to 1W for Chip
aseband proc
st usage scenar
 any thermal m
ingle ARM11 
 multiple adva
l power dissi
s become a 
 of power. 
used today 
 fact, since 
pation of a 
power and 
perature (or 
s calls for a 
r accurate 
nt policies 
 agnostic. 
olicies used 
oltage and 
g [20]. As 
rated at the 
ugh to meet 
lug, one or 
owered up 
S is better 
g is better 
core mobile 
of dynamic 
U Hotplug; 
operate two 
 at a higher 
, due to the 
completely 
t. 
eter for all 
odeling, to 
are power 
 Broadcom 
a first-class 
ice power 
e growth in 
s becoming 
variety of 
tion and 
 application 
imedia and 
C chipsets, 
ology node 
 has been 
e same use-
ion of more 
ease in the 
quencies to 
ated inside 
7 illustrates 
1) of some 
essors has 
io that they 
anagement. 
application 
nced ARM 
pation has 
366
(a)
in
c
fu
F
D
m
h
c
F
te
H
m
te
th
te
P
creased by alm
ontinue with 
nctionality in m
igure 7: Power
baseband 
Figure 8:
representative
ue to increasi
obile devices h
ow the silicon 
orresponding to
igure 7 (JED
mperature of 
owever, the fun
obile device h
mperature rang
ermal limit o
mperature of th
ackage-on-Pack
ost 3 times. 
further integra
obile devices. 
 dissipation of 
processors und
(normalized to
Silicon junction
 cellular baseb
usage s
ng power diss
ave been rapid
die temperatur
 the same chips
EC thermal a
25°C; withou
damental therm
ave not change
e, low-power D
f 105°C [21], 
e underlying ba
age (PoP) sys
This trend is 
tion of even 
different repre
er heaviest usa
 1W for Chip1)
 temperature o
and processors 
cenario 
ipation, the tem
ly increasing. F
e has increase
 and power diss
nalysis assum
t any therma
al limits of com
d. For example
RAM (LPDDR
thereby limitin
seband process
tems. Similarly
only expected 
more advance
sentative cellula
ge scenario 
 
f different 
und r heaviest
peratures insid
igure 8 illustrat
d over the yea
ipations shown 
ing an ambie
l managemen
ponents inside
, in the elevate
) devices have
g the maximu
ors in the case 
, the maximu
to 
d 
 
r 
 
 
e 
es 
rs 
in 
nt 
t). 
 a 
d 
 a 
m 
of 
m 
baseban
to 125
thermal
remaine
to with
cooling
laptops
phones
the fact
thinner
has bee
One wa
through
techniq
use of 
materia
improv
even m
bottlene
mobile 
therma
imperat
Therma
perform
reduce 
of mob
the mob
perform
manage
much l
of for s
of addin
4.2 M
Consid
their ef
Assume
100°C 
device 
reaches
throttle
down t
shows 
tempera
time un
the dev
the de
constan
shows 
(dotted
Policy2
under P
therefo
headroo
higher 
keeping
Additio
applica
visibly 
betwee
lead to 
d processor die
°C to ensure c
 limits on m
d constant ove
stand temperat
 mechanisms su
 and PCs for the
 and tablets. The
 that mobile dev
, and therefore 
n decreasing. 
y to mitigate th
 better platfo
ues such as ther
materials such
l (TIM). How
ements and are
ore than power
ck to the contin
devices. Theref
l managemen
ive. 
l management 
ance or functio
their power dis
ile devices. Ho
ile device, ther
ance. Therefor
ment policies, 
ower modes of 
ignificant perio
g such high-pe
otivationa
er, for example
fect on power 
 that the max
and the ambien
is allowed to di
 the maximum 
d to dissipate on
o 90°C, at whic
the power dissip
ture (Secondary
der Policy1. Th
ice under Policy
vice reaches a
tly throttled to
the power diss
 blue line) of th
 dissipate the s
olicy2, the de
re has 5°C m
m can be tran
power than P
 the temperat
nally, the effec
tion behavior co
jerky applicati
n high and low p
smoother, albeit
 junction temp
orrect device o
obile device s
r the years sinc
ure does not ch
ch as fans, wh
rmal mitigation
 thermal proble
ice form-factor
their ability to d
e thermal chal
rm-level therm
mally aware co
 as heat spread
ever, such te
 relatively expe
, is fast becom
ued progress to
ore, the design
t algorithms a
policies typica
nality of vario
sipation, and co
wever, by cons
mal manageme
e, without care
mobile devices 
performance tha
ds of time, ther
rformance featu
l Example 
, two thermal 
and temperature
imum temperat
t temperature i
ssipate 3W of p
limit of 100°C. 
ly 1W of power
h point throttli
ation (Primary
 Y-axis, dotted
e long-term av
1 is 2.3W. In c
 temperature 
 dissipate only 
ipation (solid b
e device under
ame amount of 
vice temperatur
ore thermal 
slated to allow
olicy1 (2.5W, 
ure below the 
t of Policy1 and
uld be very diff
on behavior du
erformance mo
 lower, applicat
eratures are usu
peration. Addit
urface tempera
e the ability of 
ange. Furtherm
ich are commo
, are not feasib
m is further exa
s are becoming 
issipate the gen
lenges in mobil
al design. Th
mponent placem
ers and therm
chniques prov
nsive to use. T
ing the fundame
wards even mo
 and developme
nd policies h
lly rely on th
us components 
nsequently the 
training the cap
nt will lead to s
ful design of su
could end up o
n they are actu
eby negating the
res in the first p
management p
 as illustrated i
ure limit of th
s 25°C. Under P
ower until the 
At this point, th
 until the tempe
ng is disengage
 Y-axis, solid re
 red line) of the 
erage power di
ontrast, under Po
greater than 9
2.3W of powe
lue line) and 
 Policy2. Both 
average power 
e reaches only
headroom. Th
ing the device 
8.7% higher) 
thermal limit 
 Policy2 on use
erent. Policy1 c
e to the consta
des, whereas Po
ion performance
ally limited 
ionally, the 
tures have 
human skin 
ore, active 
nly used in 
le for use in 
cerbated by 
smaller and 
erated heat 
e devices is 
is includes 
ent and the 
al interface 
ide limited 
emperature, 
ntal design 
re powerful 
nt of better 
as become 
rottling the 
in order to 
temperature 
abilities of 
ome loss of 
ch thermal 
perating in 
ally capable 
 usefulness 
lace. 
olicies and 
n Figure 9. 
e device is 
olicy1, the 
temperature 
e device is 
rature cools 
d. Figure 9 
d line) and 
device over 
ssipation of 
licy2, once 
0°C, it is 
r. Figure 9 
temperature 
Policy1 and 
(2.3W), but 
 95°C, and 
is thermal 
to dissipate 
while still 
of 100°C. 
r-perceived 
ould lead to 
nt toggling 
licy2 could 
. 
367
(b)
Figure 1.3: Energy challenge due to thermal limits [5]: (a) leakage power of nominal dice
over temperature, and (b) silicon junction temperature of different representative cel lar
baseband processors under heaviest usage scenario.
CMOS devices [6] increases with scaling. Figure 1.4(a) shows that the variance of device
threshold voltages increases as feature sizes shrink from 28 nm through 20 nm to 14 nm.
Note that an increasing fraction of transistors are fabricated with threshold voltages at or
below zero, which means they no longer operate as switches, resulting in faulty operation
that cannot be mitigated even through conservative worst-case design methodologies. Figure
1.4(b) shows the delay (represented by colors) of memory cells in a 32 nm 64 kilobit static
random access memory (SRAM) array. The color differences represent a 70% difference in
delay. Unfortunately, this trend of increased variation and defects resulting in low reliability
is true for post-silicon devices too. Figure 1.5 shows the IDS vs. VDS curve for 50 different
carbon nanotube field effect transistors (CNFETs) [7]. As the figure shows, the saturation
current has almost an order of magnitude difference ranging from 22µA to 98µA.
This large delay variation in deeply scaled CMOS and post-silicon devices can severely
impact the reliability of modern integrated circuits and information processing systems.
However, conventional worst-case design techniques incur a heavy energy penalty. On the
other hand, nominal-case designs, though energy efficient, fail to meet the robustness specifi-
cations of the applications. In fact, in its 2007 report [4], ITRS forecasts that “Relaxing the
4
28 nm
14 nm
20 nm
(a)
p
r
e
v
io
u
s
 p
a
g
e
: g
o
ld
 s
ta
n
d
a
r
d
 s
im
u
la
t
io
n
s
; a
b
o
v
e
, c
lo
c
k
w
is
e
 f
r
o
m
 l
e
ft
: g
o
ld
 s
ta
n
d
a
r
d
 s
im
u
la
t
io
n
s
 (
2)
; i
m
e
c
july 2012   •   iEEE SpEctrum   •   NA    35spectrum.ieee.org
the same, albeit unpredictable, way. For example, the light 
coming from a lithography tool that’s used to print devices 
can be distorted by slight changes in optics from exposure 
to exposure, creating transistors that are slightly longer—
and thus slower—than intended. Fluctuating environmen-
tal conditions can also create variation. If the temperature 
in a vapor deposition chamber drops too low, for instance, it 
can slow the growth of the insulating layer of oxide in a tran-
sistor. The resulting thin insulator can leave transistors a bit 
leakier than normal. But over the past few decades, chip-
makers have been able to keep global variation under con-
trol by steadily improving manufacturing tools and manu-
facturing process control.
A second source of variation is 
often called local process variation or 
process variability, and it is proving 
far more difficult to address. It started 
appearing in digital circuits about 
10 years ago, when chipmakers began 
producing transistors with channels 
less than 90 nanometers long, the 
span of a few hundred silicon atoms. 
At that scale, the electrical properties 
of a transistor begin to be affected by 
random sources of variation, such as 
the roughness of a transistor’s edges 
or the granularity in the crystal of the 
metal electrode that turns a transis-
tor on or off. Such variations have an 
independent effect on every transis-
tor in any given integrated circuit. One 
transistor may end up being slower 
while its neighbor becomes speedier 
but also leaks more current. 
One of the most dramatic sources 
of local process variation comes from 
dopants, the atoms of another mate-
rial that are added to a silicon channel 
to speed up the switching of a transistor 
and, by extension, decrease the energy 
that switching consumes. Chipmakers 
typically add dopants by accelerating 
ions to high speeds and shooting them 
into a wafer. But this approach wasn’t 
designed for work on an atomic scale; 
it’s difficult to control how many atoms 
make it into a transistor and exactly 
where they fall. Transistor channels 
once contained tens of thousands of dopant atoms. Nowadays 
chipmakers produce transistors that can accommodate only a 
few hundred of them. And in that case, the absence of a single 
atom is much more noticeable and can alter how much voltage 
is needed to turn a transistor on or off by a few percent. 
The random, uncorrelated nature of these variations poses 
a problem for circuit designers. Link up many such transis-
tors in an integrated circuit, with its sensitive dependencies 
and timing requirements, and the variabilities can magnify 
one another: The resulting system may be even more randomly 
variable than its parts. Nor can you accommodate local varia-
tion by using hand-me-down tools developed to tackle global 
variation. We need a new approach. 
dOTS and LInES: As transistors get 
smaller, random fluctuations in dopant 
location and concentration [left block] 
and the roughness of circuit features 
[right block] have a stronger impact on 
transistor properties. Both factors result 
in less than ideal electrical potential 
profiles [shown above each block] .
ImpErfEcT mEmOry: this simulation of a 32-nanometer, 64-kilobit static rAm 
device illustrates the impact of local process variation. if there were no variation, all 
the cells in this graphic would be the same color.
07.SemiconVariability.NA.indd   35 6/12/12   3:04 PM
(b)
Figure 1.4: Variation in deep scaled CMOS devices [6]: (a) threshold voltage (Vth) at
different process technology, and (b) delay of memory cells, represented by color, of a
32 nm 64 Kbit SRAM cell obtained via simulations.
N  CNTs
           M metallic 
       S semiconducting
diameter
distribution 
D25
D57
D203
D136
D543
D1D2 ... Dd
m-CNT
breakdown
(N-M)
actives
CNTs 
CNFET with 
N tubes 
Sample simulation
Stanford uniform CNFET model used for IDS 
characteristic generation for each CNT component
 (middle and edge cases)  
762  diameters 
Summation of the N components
D203edge D543edge
D136middle
I
V
X samples
IDS current distribution (X samples) 
Obtention of equiv lent VTH for ach s mple
Calculation of mean and STD of  VTH variable
I
V
Results
1   2    3         ...              N
CNFET sample extraction
Fig. 2. Simulation methodology.
Fig. 3. Example of IDS − VDS distribution for 50 CNFET samples.
VTH of the single-tube components of the transistor. The
K is evaluated from the VDD saturated current level using
the expression of the Sah [13] model for a equivalent Si-
MOS transistor neglecting channel modulation and carriers
saturation effects (as correspond to a CNFET).
In the next section, the behavior of the mean and STD
of these two parameters (VTH and K) with N and TM is
presented.
IV. SIMULATIONS RESULTS
Fig. 4 shows the mean (µ) and the STD (σ) of equivalent
VTH and K for a Chi distribution of diameters. These
graphs have been obtained using the procedure previously
presented. Both parameters VTH and K characterize the
device variability in CMOS technology [13]; the first one as
threshold voltage and the second as a factor that includes the
impact of the geometric parameters of the layout (transistor
channel width and length).
Fig. 4(a) shows that the mean of VTH of the CNFET
increases as the probability of metallic tubes in the man-
ufacturing process increases and decreases as the number of
tubes per transistor grows. Fig. 4(b) shows the effect on the
TABLE II
VTH AND K VARIATION
Parameter VTH K
Number of tubes N=4 N=8 N=12 N=4 N=8 N=12
Prob. m-CNTs Chi dist.
TM = 0% 45% 36% 32% 23% 21% 18%
TM = 33% 71% 44% 38% 105% 79% 64%
Prob. m-CNTs Gaussian dist.
TM = 0% 21% 17% 14% 25% 20% 17%
TM = 33% 47% 20% 18% 108% 76% 63%
mean of K. It decreases as the number of tubes decreases and
as the metallic probability increases. All these considerations
are useful in the design phase but now we concentrate our
attention on the parameters variability that is a key drawback
for m dern CMOS technologies.
Fig. 4( ) and Fig. 4(d) show the impact of N and TM
on the STD of VTH and equivalent K. In Table II the
percentages of variation (100× 3σ/µ) of VTH and K for a
Chi and a Gaussian distribution of diameters are shown. For a
Chi distribution, the maximum and the minimum percentage
of VTH v riation are 71% and 32% that correspond to
the c rners (N = 4, TM = 33%) and (N = 12, TM =
0%) respectively. These values imply that even for current
CNF T manufacturing capabilities, the variation in VTH
is lower than the expected for 32nm and 16nm Si-CMOS
technologies, because ITRS[1] predicts that they may reach
70% and 100% percentage of variation respectively. In the
case of K, observe that the higher the number of tubes
(N), the lower the variability for both TM probabilities.
Consequently there is a tradeoff between device size and
variability (caused by N) that designers should take into
account.
For an ideal case of no metallic tubes in the manufacturing
process (TM = 0%), we obtain a percentage of variation for
VTH and K that goes from 32% to 45 % and from 18% to
23% respectively. So, the reduction of the proportion of m-
CNTs is a key factor for the development of CNFET circuits.
For this last case (TM = 0%) the only reason of variability
is the diameter distribution of the CNTs. If we now consider
an hypothetical manufacturing process with a narrower dis-
tribution of diameters (Gaussian diameter distribution) the
percentage of variation is between 14% (N = 12) and 21%
(N = 4) for VTH and between 17% (N = 12) and 25%
(N = 4) for K. This would allow a design scenario where
variability would not be a critical factor as it is nowadays in
conventional CMOS technology.
V. CONCLUSIONS
CNFETs are promising candidates to replace silicon
CMOS due to their high current driving capability, tolerance
to temperature and low leakage currents.
Device variability, that is one of the key limiting factors
in silicon-MOS technology, has been investigated for such
CNFET devices in this paper. Considering a range of metallic
tubes from 33% (current growth methods) to 0% (perfection)
1086
Figure 1.5: IDS vs. VDS curve for 50 different CNFETs [7].
5
requirement of 100% correctness for devices and interconnects may dramatically reduce costs
of manufacturing, verification, and test. Such a paradigm shift is likely forced in any case
by technology scaling, which leads to more transient and permanent failures of signals, logic
values, devices, and interconnects.” In short, enforcing an ideal deterministic abstraction on
the underlying CMOS circuit will incur significant overhead, and error aware system design
is essential in order to continue to reap the benefits of technology scaling.
1.1 Research Direction
In this dissertation, we describe a class of solutions that can simultaneously achieve ro-
bustness and energy efficiency in the implementation of deeply scaled ML kernels. Robust
behavior of communication systems based on Shannon’s information theory [8], inspire the
use of similar techniques for robust computation. Statistical system-level performance met-
rics are essential in the use of communication-inspired techniques, and provide a means to
trade off reliability with energy efficiency. Figure 1.6 shows HSPICE simulation results of
a four-tap filter in 180 nm CMOS subject to voltage overscaling (VOS). If the errors are
fully compensated for without additional overhead, energy reductions of up to 9× can be
achieved by operating at a significantly lower supply voltage over a system operating at
the point of first failure (PoFF). As we show throughout this dissertation, statistical error
compensation (SEC) is a communication-inspired/Shannon-inspired technique that provides
this low overhead error compensation. SEC is a novel technique that can be applied to DSP,
communications, and ML systems to achieve simultaneous robustness and energy efficiency.
In the remainder of this chapter, we present the background information related to this
work. We begin by discussing conventional techniques used for robust system design. SEC
is then presented with a summary of the techniques that have been developed.
6
0.8 1 1.2 1.4 1.6 1.8
0
2
4
6
8
10
En
er
gy
 (p
J)
0
0.2
0.4
0.6
0.8
1
Vdd(V)
p eenergy
9X
pη 
PoFF
This work
Past work
2170X
Voltage overscaling (VOS)
p η
 
Figure 1.6: Simulation results of voltage overscaling (VOS) for a 4-tap correlation filter (a
sensor) at 50 MHz in a 180 nm CMOS process.
1.2 Robust System Design
Robust system design has a long history. As early as 1821, Charles Babbage and John
Herschel discussed methods on achieving infallible machines [9]. They viewed failure as a
matter related to purpose, not to whether an item is physically intact or not. This view
highlights the importance of measuring the reliability of a system based on its performance
at the application level. The SEC techniques discussed in this dissertation leverage this
viewpoint.
Robustness can be discussed at various levels of design abstraction: physical, logical,
algorithmic, and system/application. The following summarizes the characteristic of the
impairments and available robust design techniques at each level:
• physical level: this level deals with the physical aspects of the circuit and device such
as geometry, layout, and material properties. Impairments at this level are known as
defects. A short/open circuit in the manufacturing process is one example of a defect.
Negative bias temperature instability (NBTI) [10] or an open circuit caused by elec-
tromigration [11] are examples of defects that develop over time due to aging. Defects
7
can remain dormant, until a particular input excites the defect, which then manifests
itself as a fault at the logic level. Defects are usually circumvented through the use
of worst-case design (over designing), or detected and corrected (eliminated) through
testing procedures such as burn-in [12] and periodic maintenance where sensitive parts
are replaced (preventive measure against aging).
• logic level: Impairments at this level are known as faults. A short/open circuit defect
can be modeled as stuck-at-faults, where the output of a gate is always a 0 or 1. Even
though the gate outputs may be in fault, it does not affect the system operation unless
the incorrect values are latched as errors at the algorithmic level. Fault avoidance is
usually achieved through defect avoidance, and fault tolerant techniques [13,14] require
the use of some form of redundancy to detect and circumvent faults before they develop
into errors.
• algorithmic level: Once incorrect values are latched, they become errors, i.e., in-
correct data or internal states. Errors in digital systems manifest as bit flips (0 → 1
or 1 → 0) in the storage element (latch). Though faults are one source of errors, in
modern information systems, complexity makes verification of the design a very diffi-
cult task, and design failures, such as bugs in hardware and software, become another
source of errors. Thus, in addition to fault avoidance, rigorous verification and testing
is used to identify bugs and design errors. To tolerate errors, like faults, redundancy is
employed, but at the information level. Error correcting codes have been successfully
used on memory to combat errors, while techniques such as checkpointing [15] and
Razor [16, 17] have been successfully applied to achieve error resilient computation.
SEC also performs error compensation at this level.
• system/application level: this level is where the actual functionality (purpose) of
the system is visible. Impairments at this level can be attributed to system failure
and result in catastrophic events. Software-oriented failure tolerance techniques in-
clude atomic broadcasting, agreement and commit protocols, synchronization, global
state determination, and various other distributed algorithms [18]. Many applications
8
that employ statistical metrics possess inherent robustness. Especially emerging RMS
applications and machine learning based applications have large inherent application
level robustness, which can or should be exploited by robustness techniques designed
at lower levels. SEC exploits this application-level information at the algorithmic-level
to achieve significant reduction in compensation complexity.
Robust system design enables a system to continuously operate even in the presence of
defects, faults, or errors. Robust design dates back to John von Neumann’s work [19],
where logic networks composed of noisy or probabilistic gates were considered and signal
replication with majority voting was proposed to increase resiliency. This is known as N -
modular redundancy (NMR). Due to its large overhead, in practice, such majority voting
techniques are employed today at the architecture level in mission critical applications and
servers. Thus, fault tolerance is usually obtained by adding some degree of redundancy.
The most widely used redundancy techniques require additional silicon area (by replicating
blocks) and/or time (by performing re-computation). A summary of current robust design
techniques is provided in Table 1.1.
Robust design techniques can be classified depending on whether the defects, faults, or
errors are being avoided or corrected. System aware design adds another dimension where
certain errors can be allowed to propagate to the system level, if the application has sufficient
tolerance for errors. Inference based applications including ML applications, communication
systems, and many DSP systems possess inherent application- or system-level error resiliency.
This is due to the following:
1. The performance metrics for these applications are statistical. Inference applications
including ML are measured by the probability of correct operation, such as probability
of detection, or probability of correct classification. Communication systems are based
on bit error rate (BER), while many filters are measured by SNR based metrics. Such
metrics show that the application is designed with the possibility of failure in mind.
Thus, even in the presence of HW errors, the application may provide acceptable
performance.
2. The modeling of the application and the input data is imperfect. In speech recognition,
9
Ta
bl
e
1.
1:
Va
rio
us
ro
bu
st
sy
st
em
de
sig
n
te
ch
ni
qu
es
ap
pl
ie
d
at
di
ffe
re
nt
le
ve
ls.
ci
rc
ui
t/
pr
oc
es
s
la
tc
h/
ga
te
al
go
rit
hm
sy
st
em
m
ic
ro
-a
rc
hi
te
ct
ur
al
wo
rs
t-
ca
se
de
sig
n
N
-m
od
ul
ar
re
du
nd
an
cy
[2
0]
ch
ec
k
po
in
tin
g
[1
5]
ap
pr
ox
im
at
e
co
m
pu
tin
g
[2
1,
22
]
co
di
ng
[2
3]
(r
ed
uc
ed
pr
ec
isi
on
co
m
pu
tin
g)
bu
rn
-in
te
st
in
g
[1
2]
R
az
or
[1
6,
17
]
be
st
eff
or
t
co
m
pu
tin
g
SE
C
[2
4]
SE
C
[2
4]
10
Gaussian mixture models (GMM) and hidden Markov models (HMM) for extracting
phenoms have been used with great success [25]. However, this is not a perfect model of
a real speech system, and many approximations are made to reduce the complexity of
the algorithm. In weather prediction, supercomputers are used to process and analyze
the data based on a sophisticated model. Due to practical reasons, many factors that
influence the weather are not present in the modeling, and the input data to the model,
i.e., temperature, atmospheric pressure, wind speed, etc., cannot be measured at every
location. Thus, such applications are designed to be robust to modeling and data
errors. In addition, if computation errors are introduced, but controlled to be within
some bounds, severe degradation in system performance can be prevented.
3. In many inference applications, it is difficult to define a golden output. Recommen-
dation systems such as those used in Netflix and Amazon are good examples. Based
on a user’s history and preference, recommendations are made. Sometimes several
recommendations are provided to the user in a ranked order. However, it is hard to
know whether the recommended results are actually the true correct values. The user
may not be sure of his or her preference, and it may change depending on the situ-
ation. Web search and data analytics are other examples. Thus, some errors in the
computation process that alter the results may not necessarily be regarded as a system
error.
4. Human interaction is another important factor. Limited perceptual capability of hu-
mans, and the user willingness to accept less-than-perfect results, all give rise to the
acceptance of erroneous results.
To achieve optimal robustness in a design, a combination of such techniques should be
employed. Table 1.2 summarizes the characteristic of each robustness technique. SEC is the
only technique that tries to actively compensate errors in a non-deterministic fashion, while
utilizing system-level information.
A uniform error model is used to facilitate the analysis and comparison between robust
design techniques. Two different types of errors exist: hardware errors and estimation errors.
Hardware errors, denoted as η, are errors that occur within the computational block. Timing
11
Table 1.2: Characteristic of robust system design techniques.
technique avoidance compensation/ system/applicationcorrection aware
worst-case design
√
burn-in testing
√
N -modular redundancy
√
(deterministic)
checkpointing
√
(deterministic)
coding
√
(deterministic)
Razor
√
(deterministic)
approximate computing
√
best effort computing
√
SEC
√
(stochastic)
√
Cx
eoyy  1
eoyy  2
eoN yy  
ideal output
HW errors
estimation errors
Figure 1.7: Error model used within this dissertation. The actual output yi consists of the
ideal output yo, hardware errors η, and estimation errors e.
errors due to PVT variation, soft errors due to alpha particle hits, and errors induced by
circuit defects are all different forms of hardware errors. These errors are dynamic errors and
give rise to stochastic behavior. Estimation errors, denoted as e, are errors that are induced
by design. Algorithmic simplifications such as subsampling, approximation, using fewer
iterations, or reducing hardware complexity via reduced precision all give rise to estimation
errors at the output. The additive error model (see Fig. 1.7) captures both types of errors,
and is given by:
yi = yo + ηi + ei (1.1)
where yo is the ideal error free output, ηi is the hardware errors, and ei is the estimation
error of the ith output yi.
Next, we describe some of the well-known robust system design techniques.
12
B-1
B-2
B-N
D
e
te
c
to
r yˆ
x
Correct Value
E
s
ti
m
a
to
r
Equalizer SlicerChannel
?
ˆ
0yy 
?
ˆ
0

 yy
Exact 
Comparison
Inexact 
Comparison
1y
3y
2y
0y
B-1
B-2
B-N
D
e
te
c
to
r yˆ
x
E
s
ti
m
a
to
r
Equalizer SlicerChannel
1y
3y
2y
y~
PE-1
PE-2
PE-N
V
o
te
r yˆx
1y
2y
Ny
PE-1
PE-2
PE-N
D
e
te
c
to
r yˆx
Soft Voter
E
s
ti
m
a
to
r
Equalizer SlicerChannel
1y
2y
Ny
y~
Figure 1.8: An NMR system. The processing element (PE) is replicated N times, and a
majority voter is used to combine the outputs.
1.2.1 N -modular redundancy (NMR)
The basic concept of NMR (see Fig. 1.8) is to replicate the processing element (PE) N
times, then vote (majority, plurality) to determine the final output of the system. NMR has
an area overhead, as additional silicon area is required to replicate the PEs. In NMR, the
output of each block is assumed to be corrupted by hardware errors only, i.e., the following
error model is employed:
yi = yo + η (1.2)
which is obtained from (1.1) by substituting ei = 0.
Triple-MR (TMR) [20] is most widely used because the overhead of replicating for N > 3
is very large. In TMR, if a fault occurs in one PE, the majority voter can generate a correct
output and keep the system in operation. In NMR (see Fig. 1.8), the voter becomes a
single-point-of-failure, i.e., voter failure leads to system failure, and should be designed to
operate error free. One possible solution is to triplicate the voter and have three independent
outputs. If these outputs are inputs to another TMR configuration, then each output could
be connected to each replicated block as shown in Fig. 1.9. An example of such a system
is the Tandem Integrity S2, which was designed to meet the need for a reliable UNIX-based
computing system [26].
In cases where all three outputs do not agree, a TMR system will detect but will not be
able to correct errors. Further, an undetected error will occur if a majority of the outputs
exhibit the same error.
13
PE-1
PE-2
PE-3
Voter 1
1yˆ
x
1y
2y
3y
Voter 2
2yˆ
Voter 3
3yˆ
PE-1
PE-2
PE-3
Figure 1.9: An NMR system with a triplicated voter.
1.2.2 Checkpointing
A checkpoint is a snapshot of the entire state of the system at a specific time. During
computation, when a fault has been detected, the computation falls back to the most recent
checkpoint and resumes execution [15]. Thus, checkpointing is a fault-tolerant technique
that exploits time redundancy. These checkpoints need to be stored in a stable and reliable
memory, which is now the single-point-of-failure. The error model for checkpointing is the
same as in NMR (1.2). Error detection mechanisms must be incorporated in the design
to minimize the re-computation in case of a fault. Checkpointing generally provides more
reliable operation than does an NMR system, as in checkpointing, errors are corrected by
re-computation, while in NMR, forward error correction is performed via a majority vote.
However, as there are certain operations that cannot be undone, such as printing a document
or launching a missile, the system must be designed to deliver these outputs only when it is
certain there is no need to undo them.
1.2.3 Error resilient processors
Error resilient processors [27–30] employ logic-level techniques to detect errors and architec-
tural-level techniques to correct them. For example, in Razor [16, 17], a shadow latch is
employed that latches at a slightly delayed time from the main latch. If the critical path was
violated and the main latch was in error, the shadow latch will latch on to the correct value,
and an error can be detected by comparing the two. This type of error resiliency to timing
errors has been shown to be an effective approach for combating variations while achieving
14
energy efficiency. Voltage overscaling (VOS) [31]was employed in [27–30] to induce timing
errors by reducing the supply voltage (Vdd) below the critical voltage (Vdd,crit), which is the
point of first failure (PoFF). The error rate, pη (percentage of clock cycles in which the output
is in error), increases as Vdd is reduced. Razor I [27] employs VOS along with in situ local (FF
level) timing error detection and local correction in order to reduce energy while combating
variations. Razor I demonstrated that at an error rate of 10−7, which is near the PoFF,
the error-correction overhead is minimal, and energy-efficiency gains of 14% to 17% can be
obtained when compared to an error-free architecture operating just above the PoFF. Razor
II [28, 29] employs local error detection and architectural replay to operate at an error rate
of pη = 4× 10−4, which is also near PoFF, while achieving an energy savings of 33% to 35%.
An error-resilient microprocessor core in a 45 nm process [30] employs an error-detection
sequential (EDS) and a tunable replica circuit (TRC) to achieve 41% throughput gain or
22% energy reduction, with a 10% Vdd droop. However, these techniques strive to deliver
exact correctness (deterministic output), which may not be needed by the application, and
are severely limited in operable error rates (pη < 10−3) and thus achievable energy efficiency.
The error detection in these processors is at the logic level. Hence the error model in (1.1)
does not apply directly.
Error resilient system architecture (ERSA) [32] has been proposed as an alternative to
designing robust processors. ERSA relaxes the requirements of exact correctness. This
architecture has one reliable processor core, the super-reliable core (SRC), along with several
unreliable cores, known as relaxed reliability cores (RRCs). ERSA is shown to achieve
resilience to higher-order bit errors and maintain sufficient accuracy even at very high error
rates with minimum impact on execution time.
1.2.4 Approximate computing
Emerging new applications, such as recognition, mining, and synthesis [1], process large
amounts of noisy data via statistical and probabilistic computation and are inherently error
resilient. Conventional digital signal processing (DSP) applications including image, audio,
and communications also possess such inherent reliability. This motivates the use of approx-
15
imate computing, in which the requirement of numerical exactness on the outputs is relaxed.
Several factors such as limited perceptual capability of humans, difficulty to define a golden
output (e.g., web search, data analytics), and the user willingness to accept less-than-perfect
results all contribute to justification of approximate computing. The goal of approximate
computing is to simplify the implementation of computation to gain energy efficiency or to
obtain higher performance.
The distinctive feature of AC is that it does not involve assumptions about the stochastic
nature of any underlying process implementing the system. It does, however, often utilize
statistical properties of data and algorithms to trade quality for energy reduction. AC, hence,
employs deterministic designs that produce imprecise results [22], i.e., AC incurs estimation
errors in the output and employs the following error model:
y = yo + e (1.3)
which is obtained from (1.1) by substituting η = 0. SEC, detailed in the following section,
embraces the stochastic nature of the underlying implementation to provide compensation
such that the output may be in error but guarantee that the application-level specifications
are met.
Approximate computing has been applied at the circuit and algorithmic level. At the
circuit level, several arithmetic circuits have been proposed. Adders are partitioned into two
parts [21, 33], where the more-significant bits are computed in an exact manner, while the
less-significant bits are computed in an inexact manner via a simplified circuit. Approxi-
mate mirror adders [21,33], approximate XOR/XNOR-based adders [34], and lower-part-OR
adders [35] are some examples. Approximate multipliers are constructed via the use of spec-
ulative adders for obtaining partial products [35–38]. Approximate logic-level synthesis has
been explored as well [39–42]. At the algorithmic level, incremental refinement, a charac-
teristic of iterative algorithms, has been exploited to achieve results that gradually increase
in quality [43–47]. To increase energy efficiency, dynamic bit width adaptation has been
proposed [48]. A more detailed survey of AC is presented in Chapter 5.
16
1.3 Statistical Error Compensation
Conventional robust design techniques are usually high in overhead (when applied at cir-
cuit/logic levels), or are unable to provide sufficient protection (when applied at the system
level). In this dissertation we propose the use of statistical error compensation (SEC). Sta-
tistical error compensation techniques, such as algorithmic noise tolerance (ANT) [31], em-
ploy statistical estimation and detection techniques to compensate for errors approximately.
These methods are applied at the algorithm level and compensate for erroneous behavior,
while still utilizing the system-level information. In short, catastrophic algorithm-level errors
that lead to system failure are converted to benign errors that the system can absorb. Thus,
SEC techniques use a combination of system aware design (utilizing system-level inherent
robustness for failure avoidance) and error compensation (approximate error compensation
at the algorithm level) as shown in Table 1.2.
SEC techniques are best suited for applications where the performance metrics themselves
are statistical. When SEC techniques are applied to these applications, significant power
savings, e.g., up to 67% for a finite impulse response (FIR) filter [31], and robustness enhance-
ment can be achieved, compared to logic- or architectural-level techniques such as Razor. A
high-level depiction of SEC is given in Fig. 1.10(a). SEC utilizes the statistics of errors to
perform detection and estimation to compensate for errors. It also incorporates system-level
statistical metrics, such as signal-to-noise ratio (SNR), or bit error rate (BER). SEC oper-
ates on multiple observations, where each observation is generated by erroneous hardware,
an error-free estimator, or an erroneous estimator. Each observation yi is a corrupted version
of the correct output yo, i.e., yi = yo + ηi + ei. Based on these observations, detection and
estimation techniques are employed in conjunction with the statistical information of ηi and
ei to obtain the output most likely to be correct. Errors that have a large effect on the
system-level performance are detected and compensated, while errors with minimal effect
on performance are considered benign and permitted.
17
C yˆx Estimator/
Detector
1y
2y
N
y
observations corrected
output
,
( , )P
 
 
(a)
M
M-est
x
 oa yy
eyy oe 
hardware errors
yˆ
estimation errors
|   |> T
-
error-free
actual

(b) (c)
Figure 1.10: Statistical error compensation: (a) general form, (b) algorithmic noise
tolerance, and (c) error distributions.
1.3.1 Algorithmic noise tolerance
Statistical error compensation (SEC) in the form of algorithmic noise tolerance (ANT) [31,49]
in Fig. 1.10(b) incorporates a main block and an estimator. The main block, unlike the
estimator, is permitted to make hardware or timing errors. The estimator is a low-complexity
block (typically 5% to 20% of the main block complexity) generating a statistical estimate
of the correct main block output, i.e.,
ya = yo + η (1.4)
ye = yo + e (1.5)
where ya is the actual main block output, yo is the error-free main block output, η is the
hardware error, ye is the estimator output, and e is the estimation error. Equations (1.4)
and (1.5) can be obtained by substituting ei = 0, and ηi = 0 in (1.1), respectively. Note that
18
the estimator exhibits estimation error e because it is simpler than the main block. ANT
exploits the difference in the statistics of η and e (see Fig. 1.10(c)). To enhance robustness,
it is necessary that when η 6= 0, η be large compared to e. In addition, the probability of the
event η 6= 0 must be small. The final or corrected output of an ANT system yˆ is obtained
via the following decision rule:
yˆ =

ya, if |ya − ye| < τ
ye, otherwise
(1.6)
where τ is an application-dependent parameter chosen to maximize the performance of ANT.
Under the conditions outlined above, it is possible to show that
SNRuc  SNRe  SNRANT ≈ SNRo (1.7)
where SNRuc, SNRe, SNRANT , and SNRo are the signal-to-noise ratios of the uncorrected
main block (η dominates), the estimator (e dominates), the ANT system, and the error-free
main block (ideal), respectively. Thus, ANT detects and corrects errors approximately, but
does so in a manner that satisfies an application-level performance specification (SNR). Sev-
eral low-overhead estimation techniques have been proposed by exploiting data correlation,
system architecture, and statistical signal processing [24].
Conventional fault-tolerant techniques focus on providing complete correctness while in op-
eration. However, communication-inspired algorithmic noise tolerance (ANT) techniques [50]
utilize the fact that some applications are tolerant to small errors and show significant im-
provement in robustness while providing energy efficiency. This is done not by concentrating
on error-free output, but rather by trying to meet the signal-to-noise ratio (SNR) or bit error
rate (BER) specification of the application. The communication-inspired approach treats
nanometer circuit fabrics as a noisy channel, and faults and errors resulting from this chan-
nel are addressed through statistical signal processing techniques that were primarily used
in communication systems for several decades.
19
PE-1
PE-2
PE-3
Voter 1
1yˆ
x
1y
2y
3y
Voter 2
2yˆ
Voter 3
3yˆ
PE-1
PE-2
PE-3
Main 
Computation
Sensor 1
Sensor 2
Sensor 3
Sensor 4
 
Fusion Block
Statistically similar 
decomposition
x
x
Figure 1.11: Block diagram of a stochastic sensor network-on-a-chip (SSNOC). By
decomposing the main computation into smaller blocks or “sensors,” the stochastic sensor
network-on-chip creates an opportunity for efficient system-level error-tolerance techniques.
1.3.2 Stochastic sensor network-on-a-chip (SSNOC)
Stochastic sensor network-on-chip (SSNOC) [51,52] extends ANT and provides a novel alter-
native that employs continuous error compensation as opposed to sequential error detection
and error correction as in [27–30]. The main computation is divided into a number of smaller
sub-computations. These sub-computations are said to be performed by “sensors” because
of the resemblance of the overall system to a sensor network. Figure 1.11 illustrates an
SSNOC system with four sensors. Traditional sensor networks consist of a large number
of measurement nodes that communicate with each other, usually over wireless links, and
estimate underlying physical phenomena (e.g., temperature or pollution). Their distributed
nature makes sensor networks robust to local variations, node failures, and communication
noise. The computational errors made by the sensors of the SSNOC may be viewed as anal-
ogous to measurement noise in sensor networks. This view of computation enables us to
borrow concepts from estimation theory to build robust IC systems.
In a SSNOC, the outputs of the sensors are combined using a fusion block that employs
20
techniques from robust statistics [53] to provide system-level robustness to hardware errors in
the sensors. The output of sensors can be corrupted by both hardware errors and estimation
errors, i.e., (1.1) applies as is
yi = yo + ηi + ei (1.8)
which is in contrast to ANT where the outputs were either corrupted by hardware errors
(1.4) or estimation errors (1.5) only. The fusion center needs to be designed to accommodate
hardware errors in the sensors that may be difficult to characterize but are potentially severe.
Computational errors can be viewed as probabilistic sources of noise that contaminate the
measurement noise in the input, commonly modeled as Gaussian. The aggregate noise in
the system (comprising computational errors and input noise) may be modeled as drawn
from the following set of mixture distributions:
P = {F |F = (1− )Φ + H} (1.9)
where Φ is the class of normal distributions, H is the class of arbitrary densities with zero
mean and finite but unbounded variance, and 0 <  < 1 is a mixing parameter. The robust
estimation problem is to find the maximum-likelihood estimate for the least informative
distribution in the above mixture class. This estimate will be robust to the worst-case
probability distribution of the errors caused due to variations. The theory of robust statistics
[53] states that the robust estimate will be the solution to the following equation:
m∑
k=1
ψ[Yk − θ] = 0 (1.10)
where ψ is a general odd-symmetric function known as the influence function, and Yk are the
measurements. For the case of -contaminated N (0, 1) distributions, the influence function,
ψ, is given by [53],
ψ(x) =

x, if |x| ≤ γ
γ sgn(x), else
(1.11)
where sgn(·) is the sign function, and γ is a constant that depends only on  and the nominal
21
B-1
B-2
B-N
D
e
te
c
to
r yˆ
x
Correct Value
E
s
ti
m
a
to
r
Equalizer SlicerChannel
?
ˆ
0yy 
?
ˆ
0

 yy
Exact 
Comparison
Inexact 
Comparison
1y
3y
2y
0y
B-1
B-2
B-N
D
e
te
c
to
r yˆ
x
E
s
ti
m
a
to
r
Equalizer SlicerChannel
1y
3y
2y
y~
PE-1
PE-2
PE-N
V
o
te
r yˆx
1y
2y
Ny
PE-1
PE-2
PE-N
D
e
te
c
to
r yˆx
Soft Voter
E
s
ti
m
a
to
r
Equalizer SlicerChannel
1y
2y
Ny
y~
(a)
B-1
B-2
B-N
D
e
te
c
to
r yˆ
x
Correct Value
E
s
ti
m
a
to
r
Equalizer SlicerChannel
?
ˆ
0yy 
?
ˆ
0

 yy
Exact 
Comparison
Inexact 
Comparison
1y
3y
2y
0y
B-1
B-2
B-N
D
e
te
c
to
r yˆ
x
E
s
ti
m
a
to
r
Equalizer SlicerChannel
1y
3y
2y
y~
PE-1
PE-2
PE-N
V
o
te
r yˆx
1y
2y
Ny
PE-1
PE-2
PE-N
D
e
te
c
to
r yˆx
Soft Voter
E
s
ti
m
a
to
r
Equalizer SlicerChannel
1y
2y
Ny
y~
(b)
Figure 1.12: Block diagram of (a) NMR, and (b) soft NMR.
distribution, N (0, 1).
The abstraction provided by the SSNOC can be directly applied to many applications in
signal processing that can be parallelized. For instance, polyphase decomposition is often
used to obtain a parallel implementation of FIR filters. Such filters are used in the matched
filtering operation common in code division multiple access (CDMA) wireless receivers.
1.3.3 Soft NMR
ANT and SSNOC are heavily application dependent. NMR is a general computation tech-
nique that requires no knowledge of the application, but has limited performance gain. Soft
NMR [54] combines the general applicability of NMR and enhancement in robustness of
ANT in a very general technique that can be applied to nearly any computation. Like ANT,
soft NMR also incorporates application-specific knowledge to dramatically improve the per-
formance over conventional NMR. As shown in Fig. 1.12, soft NMR differs from NMR in
that it incorporates a soft voter, which is composed of an estimator and a detector. Soft
NMR views computation in the PEs as a noisy communication channel, i.e., the following
error model is used:
yi = yo + ηi (1.12)
which is obtained from (1.1) by substituting ei = 0, and employs the estimator as an equalizer
and the detector as the slicer.
The soft voter explicitly uses statistics including data and error statistics. Data statistics
22
is the distribution of the error-free output yo, which is also known as the prior, while error
statistics are the distribution of the hardware errors η. The soft voter then finds the most
likely output, yˆ, among the hypotheses defined by a hypothesis set H:
yˆ = arg max
Hi∈H
P (y1, y2, . . . , yN |Hi) (1.13)
Optimally, H will contain all possible outputs yo with nonzero probability as given by the
prior. However, in practical implementations, the size of H limits the performance of soft
NMR, and thus the size of H needs to be limited. One good approximation is the set of all
observations, H = {yi}Ni=1.
1.4 Contributions of This Dissertation
In the past, SEC techniques have been applied to a wide range of applications in communi-
cation and DSP systems. In this dissertation, we extend the application of SEC to include
machine learning applications. Specifically, we show that when SEC is applied to applica-
tions that possess inherent robustness, its effectiveness is significantly enhanced. The major
contributions of the dissertation are summarized as follows:
1. An SSNOC based PN acquisition filter has been implemented in silicon and measured.
Results show operation at error rates larger than 85.83%, with energy savings of 2.52×.
2. SEC has been applied to a belief propagation based communication system, low-density
parity check (LDPC) codes. Combined with the inherent resiliency of LDPC codes,
an SEC based LDPC decoder can operate at a supply voltage up to 38% less than
the nominal voltage and tolerate up to 30× more errors over an SNR range of 3 dB to
8 dB, while maintaining less than 3× degradation in BER. This is equivalent to energy
savings of 45.7% compared to conventional LDPC decoders, and 33.2% compared to a
sign bit protected LDPC decoder.
3. SEC has been applied to a Markov random field based iterative stereo matching ar-
chitecture. Correction is performed at several levels including arithmetic, iteration,
23
and system levels. The compensation overhead, robustness, and energy savings are
characterized and compared among the different levels of compensation. Arithmetic
compensation achieves power savings of 39.6% at an overhead of 97.4%. A hybrid ap-
proach can successfully trade off the compensation complexity and energy savings by
achieving 16.1% additional power savings compared to arithmetic level while reducing
the overhead to 57.9%.
4. A study on combining SEC and approximate computing has been performed to show
that SEC can further extend the AC based design to achieve additional robustness
and energy efficiency. Results show that ANT combined with AC achieves energy
savings of 44.9 % compared to a conventional system, while achieving at most 4%
degradation in performance. This supports our view that embracing the stochasticity
of the underlying process is essential to achieve significant energy savings.
5. Attempts to analyze SEC techniques have been made. An analysis framework has been
proposed and under this framework, ANT is shown to be a Bayes optimal detector and
estimator.
1.5 Dissertation Organization
This dissertation is organized as follows:
• Chapter 1 has provided the necessary background information on fault-tolerant design
methods and statistical error compensation design techniques.
• Chapter 2 shows the benefits of SEC applied to detection and correlation based
applications. A PN-code acquisition filter was chosen as an example. Through mea-
surements on a 180 nm silicon implementation, it is shown that the SEC based PN code
acquisition can achieve detection probability Pdet ≥ 90% at an error rate pη ≥ 85.83
with energy savings of 2.52×.
• Chapter 3 delves into machine learning based communication applications. SEC is
applied to a low-density parity check (LDPC) decoder. As iterative message passing
24
based architectures are inherently robust to small magnitude errors, the SEC based
LDPC decoder shows significant improvement in robustness and energy efficiency.
• Chapter 4 applies SEC to a more generic machine learning application. With the
success in LDPC decoders, SEC has been applied to a complex message-passing ap-
plication, Markov random field (MRF) based stereo image matching. Measurements
and analysis on the inherent robustness of message-passing architectures are explored.
Then, the synergy between SEC and ML applications is shown through simulations
and FPGA emulations.
• Chapter 5 combines AC and SEC. AC is applied to the same stereo image-matching
architecture of Chapter 4, and it is shown that SEC is essential in achieving significant
energy savings.
• Chapter 6 analyzes SEC techniques to provide insight into the design of optimal
SEC based systems. In particular, ANT is shown to be a Bayes optimal detector and
estimator.
• Chapter 7 concludes this dissertation and provides directions for future research
activities.
25
Chapter 2
The Design of an Error-Resilient PN Code Acquisition
Prototype Chip
In this chapter, we describe the architecture and measurement results of an SEC based pseu-
dorandom noise (PN) code acquisition filter. As discussed in Chapter 1, the performance of
PN code acquisition is determined by its detection probability Pdet, and is a good application
for SEC.
Code division multiple access (CDMA) based communication systems employ PN code
sequences to spread the bandwidth of the transmitted signal over a wide frequency band.
This technique, employed in communication standards such as IEEE 802.11 WLAN [55],
CDMA2000 [56], and WCDMA [57], is useful for multiple access communication, as the
resulting transmit signal has a very low signal-to-noise ratio, and causes minimal interference.
Conventionally, the receiver correlates the noisy received signal with a local copy of the PN
code, and the output is then processed by a detector [58]. Matched filters with locally
generated code sequence as tap weights are commonly employed for this purpose. Peaks
in the output of the matched filter are employed for detection and synchronization of PN
sequences. Code acquisition is a power hungry, computationally intensive operation in a
spread-spectrum communication receiver [58]. For example, [59] shows that the matched
filter consumes 20% to 25% of total receiver power. Reducing the power required to perform
PN code acquisition is therefore essential for mobile wireless communication [60].
The matched filter based PN code acquisition is performed by searching over all possible
code phases. At one extreme, this can be achieved by serially processing each code phase
[61, 62]; while at the other extreme, a full parallel search can be performed. A serial search
has low area complexity while incurring a large acquisition and detection latency. On the
other hand, a full parallel search can obtain a phase lock in a single search period, but
results in large area overhead. Hybrid approaches have been utilized to trade off detection
26
latency and implementation complexity [63] to achieve low power operation. However, the
complexity or latency trade-off is only linear, which results in a constant area delay product
[64]. Shibano et al. [65,66] have employed analog techniques to achieve energy efficiency. Yet,
such techniques lack programmability. Recent developments in iterative message-passing
algorithms have inspired iterative PN acquisition techniques [67]. But, this technique incurs
a large interconnect complexity as well as latency due to its iterative nature. These low-
power PN code acquisition architectures sacrifice the simplicity of a digital matched filter.
In this chapter, we show that when SSNOC, also known as stochastic networked com-
putation [52], is applied to a 256-tap PN code acquisition filter implemented in a 180 nm
CMOS process [68], it can achieve 2.4× to 5.8× (ave 3.86×) energy savings over conven-
tional designs, and 1.55× to 3.79× (ave 2.52×) energy efficiency and 2146× to 2281× (ave
2225×) in error tolerance over existing error resilient designs [27–30]. In terms of a figure
of merit (FOM), defined as (power/(precision × sample rate)), this translates to a 19.27×
improvement compared to conventional PN code acquisition filters [69–71]. By enabling
operation at significantly lower voltage, and being able to effectively compensate for VOS
errors with a low error correction overhead, stochastic networked computation is able to
achieve a significant fraction of the promised 9× energy savings of Fig. 1.6.
The remainder of the chapter is organized as follows. The PN code acquisition problem
and the conventional matched filter based architectures are described in Section 2.1. The
application of statistical error compensation to the PN acquisition filter is described in
Section 2.2. The architecture of a prototype test chip is shown in Section 2.3. Section 2.4
presents measurement results, and Section 2.5 summarizes the chapter.
2.1 PN Code Acquisition Filter
Pseudorandom noise (PN) codes are a set of sequences whose cross correlation is zero, while
its autocorrelation is nonzero only at lag zero. PN codes are generated via a linear feedback
shift register (LFSR) as shown in Fig. 2.1. The LFSR can be described by a polynomial
xk+r = g1xk+r−1⊕g2xk+r−2⊕· · ·⊕grxk, where ⊕ is modulo 2 addition. An LFSR polynomial
with degree r that results in a sequence with the maximum period of 2r − 1, is referred to
27
D D D D
1g 2g 2rg 1rg rg
kx1kx2kx2rkx1rkxrkx 
Figure 2.1: Architecture of a linear feedback shift register (LFSR) used for PN code
generation.
as a maximal length sequence (MLS) or an m-sequence. Such m-sequences are widely used
as spreading sequences for spread spectrum systems such as CDMA.
In spread-spectrum systems, communication is possible only when the transmitted se-
quence matches the sequence stored in the receiver and the offsets are synchronized. This
synchronization task is known as PN acquisition, and is performed by utilizing the correla-
tion property of PN codes. The traditional architecture for a PN code acquisition system is
a simple matched filter such as the one in Fig. 2.2(a) [69]. An N -tap PN code acquisition
filter correlates the received signal x[n] with locally generated PN-code hj as follows:
yo[n] =
N−1∑
j=0
hjx[n− j] (2.1)
where hj represents the 1 b (1 bit) PN-code, and x[n] is the received signal. In our imple-
mentation, x[n] is chosen to be 8 b. A peak detector or a thresholding block is used to detect
if there is a match via the equation:
m = sgn(yo[n]− τ) (2.2)
sgn(a) =

1 a > 0
0 otherwise
(2.3)
Here, m is a binary variable indicating a match when m = 1 and no match when m = 0, and
τ is a user-defined threshold chosen to maximize the probability of detection at a specific
false alarm rate.
28
D D D D D
][nx
0h 1h 2h 3h 4h 1Nh
][0 ny
(a)
][ˆ nyo
S1
S2
S64
Fusion 
Block
][1 ny
][2 ny
][64 ny
][nx
PN in
PN code
PN code
PN code
Data in
Data out
4D
4D
(b)
Figure 2.2: PN code acquisition filter: (a) conventional, and (b) SEC based.
2.2 SEC Based PN Code Acquisition Filter
The SEC implementation of a 256-tap PN code acquisition system is shown in Fig. 2.2(b)
using stochastic networked computation (SNC) [52]. SNC decomposes the main computa-
tion into sub-blocks (sensors) whose outputs have identical mean and variance (statistically
similar). For the PN acquisition application, a symmetric structure composed of a bank of
identical sub-correlators is used and online error compensation is achieved using a fusion
block. This is in contrast to sequential error detection and correction as in [27–30]. With
a decomposition factor of 64, the 256-tap correlator in Fig. 2.2(b) is decomposed into 64
4-tap sub-correlators whose output yi[n] (i = 0, . . . , 63) is given by
yi[n] =
3∑
j=0
h4i+jx[n− 4i− j] (2.4)
This sub-blocking approach is different from the polyphase decomposition used in [52]. Both
perform statistically similar decomposition of the main block and thus will have minimal
29
impact on performance, which was verified through simulations. The sub-blocking approach
used in this chapter can achieve additional power savings (see Section 2.3.1).
It should be noted that the sum of all sub-correlators∑64i=1 yi[n] is equal to yo[n], the output
of the conventional matched filter in (2.1). Each sub-correlator is referred to as a sensor
and is subject to estimation errors ei[n], i.e., yi[n] = yo[n] + ei[n], which are uncorrelated
(between different sensors). If additionally, the sensors are subject to VOS, timing errors
ηi[n] are induced as well, resulting in the following output:
yi[n] = yo[n] + ei[n] + ηi[n] (2.5)
The fusion block in Fig. 2.2(b) combines the 64 outputs of the sub-correlators employing
Huber’s robust estimation theory [53] for error compensation. Huber’s robust estimation
theory provides the optimal fusion algorithm for −contaminated errors, i.e., the noise in
the system (comprising ηi[n] and ei[n]) is modeled as drawn from the following set of mixture
distributions:
P = {F |F = (1− )Φ + H} (2.6)
where Φ is the class of normal distributions, H is the class of arbitrary densities with zero
mean and finite but unbounded variance, and 0 <  < 1. The error model for the sub-
correlator output in (2.5) has a distribution that approximates (2.6) because the estimation
error ei[n] is small magnitude and Gaussian, and timing violations ηi[n] are large and in-
termittent. Thus, the sum ei[n] + ηi[n] ≈ ηi[n] when ηi[n] 6= 0, and hence ei[n] + ηi[n] is
approximately -contaminated.
Figure 2.3 plots the correct output yo[n] and measured outputs of four sensors, along with
the mean and median of all 64 sensors, from the prototype test chip. It can be seen that most
times the sensor outputs are close to yo[n], indicating that ηi[n] = 0, and yi[n] = yo[n]+ei[n],
which is approximately Gaussian distributed. Once in a while, the sensor outputs deviate
significantly from yo[n], indicating that ηi[n] 6= 0 and is large in magnitude. This is to be
expected as MSB errors will occur in LSB-first computation.
The robust estimation problem is to find the maximum-likelihood estimate for the least
informative distribution in the set P (2.6) for a given . This estimate will be robust to the
30
0 1 2 3 4 5 6
-150
-100
-50
0
50
100
150
Time index
S
e n
s o
r  o
u t
p u
t  v
a l
u e
 
 
yo
sensors
median
mean
mean 
shift
outliers
oy (correct output)
Figure 2.3: Measured sensor outputs over time. Outliers in the sensor output are due to
hardware errors ηi and can be seen to shift the mean significantly.
worst-case probability distribution of the errors caused due to variations. Huber’s theory [53]
enables one to determine a fusion algorithm that finds the solution to this robust estimation
problem. This robustness property of the Huber estimator is useful as the distribution of
ηi[n] may not be known accurately because timing violations are a function of a number of
parameters including process, temperature, and voltage. However, this (worst-case) optimal
fusion block was shown to be very complex [52], and a simplification of this algorithm leads
to the median block, which performs equally well with significant complexity reduction.
The median and mean fusion is shown to compensate for errors effectively. Hence, in this
dissertation, we employ the median and mean as algorithms to fuse the sensor outputs.
2.3 Chip Architecture
Figure 2.4 shows the architecture of the PN code acquisition test chip consisting of a total of
64 sensors, and a fusion block implementing two algorithms: (a) mean and (b) hierarchical
median. An adaptive thresholding block at the output determines whether a detection has
31
S1 – 
S4
S5 – 
S8
S9 - 
S12
S13 - 
S16
S29 - 
S32
S33 - 
S36
S61 - 
S64
F-A
F-B
F-C
F-D
F-E
F-H
F-P
data_in
8
To F-P
F-AA
F-CC
F-DD
F-BB
To F-DD
Final 
Fusion
Threshold
8
8
8
8
8
8
8
8
8
8
8
8 1
PN _load
sensors
PN_in
4x8
4x8
4x8
4x8
4x8
4x8
4x8
from S1-S4
from F-A, F-B, F-C, 
F-D
yˆ
][ˆ nyo
][nx
code_in
Figure 2.4: Prototype IC architecture.
32
88
PN in PN out
Data in
Data out
Fusion  init
Algorithm select
Load enable
D D D D
D D D D
D
Internal 
Control
>> 2
10
Figure 2.5: Sensor architecture.
occurred. Additional logic to control the threshold block and select the correct tap to load,
is placed in a control block (not shown).
2.3.1 Sensor architecture
Each sensor is a four-tap FIR filter (Fig. 2.5) with 8 b data (x[n]) and 1 b coefficient (hj),
and a 10 b output whose two LSBs are dropped to generate a final 8 b output yˆ0[n]. To
reduce complexity, the PN code bits are shifted rather than shifting the 8-bit data. This
reduces the number of registers that need to be updated from 2048 (256 8 b data registers)
to 264 (a single 8 b data register and the 256 coefficient registers). Each sensor, except for
Sensor-1, takes the next coefficient from the previous sensor as input. An external pnload
signal switches the Sensor-1 coefficient input between the final coefficient from Sensor-64 and
the external signal pnin. In addition to the FIR filter, a mod-24 state counter determines
when to advance the coefficients and latch new inputs and outputs, and the final two bits
of the data load counter are decoded to determine which data value, if any, to overwrite.
The latches that advance the coefficient and data operate with a different clock as shown
in Fig. 2.5 (shaded vs. unshaded latch) with the clock for PN code bits being 24× faster.
Three configuration bits, set to constants based on the sensor number during instantiation
and thus optimized out during synthesis, modify the operation of the data load decoder.
Finally, the sensor value is output serially.
33
2.3.2 Fusion block architecture
The fusion block implements a three-stage hierarchical mean or median function to avoid
global interconnect and simplify the implementation of a 64-median block. The mean is a
simple addition of four values at each level, with a depth of three. The hierarchical median
operates by grouping the input data, taking the median of each group, and repeating the
procedure until only one group remains. Two- and three-level structures with groups of
eight and four, respectively, were tested. In addition, to attempt to mitigate the loss of
accuracy resulting from grouping the data, overlapping structures in which each input is put
in multiple groups were also considered. These use group sizes of 16 and 8, respectively,
except for the final level. The accuracy of these structures in estimating the median is
dependent on how the inputs are grouped; to estimate the effect, the integers from 0 to
63 are randomly shuﬄed 100,000 times and used as inputs to the algorithm. A histogram
of the final outputs for each hierarchical median structure when selecting the lower of the
central values at each stage is shown in Fig. 2.6; selecting the higher instead would produce
a mirror image of the plot. The most frequently selected values are 26 for the simple two-
level algorithm, 29 for the overlapping two-level algorithm, 18 for the simple three-level
algorithm, and 25 for the overlapping three-level algorithm. The performance of each of
these algorithms is shown in Fig. 2.7. Interestingly, each algorithm performs noticeably
better when selecting the lower of the central values rather than the higher. When the code
sequence is present, nearly all sensor outputs should be high, but the distribution of sensor
outputs is approximately Gaussian when the sequence is not present. Thus, selecting the
lower value is more likely to produce a small output when the sequence is not present while
still producing a large output when the sequence is present.. In general, the overlapping
structures perform best and the two-level structures are more accurate than the respective
three-level structures. Using the lower of the central values, the overlapping three-level
structure has a higher probability of detection than the simple two-level structure. The
resulting hierarchical median algorithm employs a three-level hierarchy, where the two lower
levels of the median hierarchy have eight inputs and computes the fifth largest input, while
the final fusion block computes the third largest of four inputs.
34
Figure 2.6: Distribution of outputs selected by hierarchical median algorithms, selecting
the lower of the two central values at each stage.
The median block is based on [72] (Fig. 2.8). The algorithm requires one mask bit mi
and one set bit si per input to be passed from stage to stage. For the most significant bit,
the mask bits are all initialized to 1 and the set bits are not used. At each stage, if a mask
bit is 0, the corresponding input bit is replaced with its set bit. Then the majority value
z of the masked inputs is determined. This value z is the output bit for this stage of the
median filter. For the next stage, any inputs that do not already have a mask bit of 0 and
do not match z have their mask bits set to 0 and their set bits to z. Effectively, an input
that is known to be below the median is treated as 0, and an input known to be above the
median is treated as 1 in all remaining stages. The median bit slice structure can be seen in
Fig. 2.8.
In addition to the fusion structures themselves, a decoder (see Fig. 2.9) for two bits of the
data load counter is included in the fusion block. The outputs of this decoder are passed
to the lower-level fusion blocks or the sensors, as appropriate, as load enable signals. The
input and output registers for the adder and median bit slice are disabled when a different
algorithm is selected to reduce the number of register switches.
35
Figure 2.7: Simulated ROCs for hierarchical median: (a) simple two-level, (b) simple
three-level, (c) overlapping two-level and (d) overlapping three-level.
36
Majority 
Gate
]0[d
]0[is
]0[im
]1[d
]1[is
]1[im
]2[d
]2[is
]2[im
]3[d
]3[is
]3[im
]0[x
]1[x
]2[x
]3[x
z
]0[is
z
]3[is
z
z
]1[is
]2[is
]0[im
zx ]0[
]1[im
zx ]1[
]2[im
zx ]2[
]3[im
zx ]3[
]0[om
]1[om
]0[os
z
]2[os
]2[om
]3[os
]3[om
]1[os
Figure 2.8: Bit slice of the median block.
6:3 
Adder
S2
S1
S0
Init signal 
for serial 
registers
Median
D Z
SI SO
MI MO
Init in Init out
4
8
8
8
Mean out
Median out
Decoder
4
2
Count
Enable in
Enable 
out
Mean in
Median in
D
D
D D
D
D D
D
D
D D
D
D
D
Figure 2.9: Fusion block architecture.
37
Comparator
D
8 16 40 92 12 24 60 112
FA rate select
Counter/
Detection counter
reset
Threshold 
register
-1+1
reset
alg
Comparator
data_in
10
D data_out
Figure 2.10: Thresholding block architecture.
2.3.3 Threshold block architecture
The threshold block is illustrated in Fig. 2.10. The value in the threshold register is adjusted
based on the user-selected false alarm rate and selected fusion algorithm. Because the
hierarchical median algorithm produces a value lower than the mean, the initial threshold
upon reset must be lower than that of the mean. The appropriate initial threshold values
were chosen through simulations, which were 8 for the mean and 0 for the median. The
output of the final fusion is compared to the threshold, and the comparison result is sent to
the external output cmpout.
It is assumed that a PN code match is rare, and thus the number of positive responses
is a reasonable estimate of the false alarm rate. In order to maintain the selected rate, the
number of positive responses is counted for 1024 cycles. At this time, the count is compared
with predetermined values according to the user-selected false alarm rate. If this count is
outside the prescribed range, the threshold is incremented or decremented as necessary to
bring the count closer to the desired range. This process is then repeated so the threshold
will adapt to changing conditions if needed.
38
Table 2.1: Complexity of circuit blocks in SSNOC prototype IC.
block cells area (mm2)
sensors 25472 0.844
fusion block 3201 0.109
control 169 0.0055
thresholding block 249 0.0098
(a) (b)
Figure 2.11: The 256-tap PN code acquisition filter chip in a 180 nm CMOS process: (a)
layout, and (b) microphotograph.
2.3.4 Chip layout
The prototype chip layout in a 180 nm CMOS process contains a total of 48440 cells and a
total cell area of 1.871 mm2. A detailed breakdown of the complexity of each block is given
in Table 2.1. The core area is approximately 2 mm×2 mm, and 10 IO pins are placed on each
side. Including the pad frame, the final layout size is 2.7 mm × 2.7 mm. Spacing from the
power rings to the pad frame is approximately 15µm. The chip layout and microphotograph
are shown in Fig. 2.11. The chip is packaged in a 68-pin leadless chip carrier package.
39
2.4 Measurement Results
The chip was tested with the Agilent 16900A logic analysis system, at a frequency of fclk =
50 MHz. At this frequency, the point of first failure (PoFF) voltage is 0.95 V. The test vector
was generated by embedding a length 256 PN code in a sequence of Gaussian noise with a
SNR of −12 dB. Test vectors with length 106 containing 103 detections were employed. A
total of five chips were tested at supply voltages from 0.95 V down to 0.6 V.
2.4.1 Error characterization
First, the statistics of VOS errors is collected and analyzed. Errors can be observed at
operation below the critical voltage of Vdd−crit = 0.95 V and the error rate increases with
further scaling. Error characterization is performed by first subtracting the output at nomi-
nal voltage y0[n] from the output at overscaled voltages. Next, the errors are histogrammed
and normalized to generate the error probability mass function (PMF). The error PMFs are
grouped into four operating regions based on the supply voltage.
2.4.1.1 Region R1
Region R1 is at Vdd = 0.85 V, near the PoFF but well below it by 10%, and hence the error
rate is 0.074 (approximately 100× higher than in [28]) and consists of only small (single bit)
valued errors. Figure 2.12 (a), (b) shows the errors in this region.
2.4.1.2 Region R2
Region R2 is at Vdd = 0.76 V (20% below the PoFF) where multi-bit errors begin to appear
and the error rate is 0.54. This region has mild VOS applied to the circuit and the majority
of the errors are still small in magnitude as shown in Fig. 2.13 (a), (b).
40
−10 −5 0 5 10
0
200
400
600
800
1000
Error PMF at Vdd = 088
magnitude
o
cc
u
re
n
ce
(a)
−20 −10 0 10 20
0
500
1000
1500
2000
2500
Error PMF at Vdd = 085
magnitude
o
cc
u
re
n
ce
(b)
Figure 2.12: Error PMF of region R1: (a) Vdd = 0.88 V, and (b) Vdd = 0.85 V. All errors
are one bit errors with powers-of-two magnitude.
−20 0 20 40 60
0
1000
2000
3000
4000
5000
Error PMF at Vdd = 078
magnitude
o
cc
u
re
n
ce
(a)
−20 0 20 40 60
0
2000
4000
6000
8000
10000
Error PMF at Vdd = 076
magnitude
o
cc
u
re
n
ce
(b)
Figure 2.13: Error PMF of region R2: (a) Vdd = 0.78 V, and (b) Vdd = 0.76 V. Multi-bit
errors are observed, but most errors are still small in magnitude.
41
−100 −50 0 50 100
0
2000
4000
6000
8000
10000
12000
Error PMF at Vdd = 068
magnitude
o
cc
u
re
n
ce
(a)
−100 −50 0 50 100
0
1000
2000
3000
4000
5000
6000
7000
Error PMF at Vdd = 066
magnitude
o
cc
u
re
n
ce
(b)
Figure 2.14: Error PMF of region R3: (a) Vdd = 0.68 V, and (b) Vdd = 0.66 V. A dense
Gaussian PMF and a sparse Gaussian PMF are mixed.
2.4.1.3 Region R3
Region R3 is at Vdd = 0.66 V (30% below PoFF) with an error rate of 0.95. As seen in
Fig. 2.14 (a), (b), this region shows Gaussian-like error statistics (i) overlaid with large
magnitude errors (ηi), which are still correctable with median or mean fusion. Region R3
represents a high VOS region with an exceptionally high error rate. The spiky characteristic
of the PMF results in the good performance of median fusion. Median is a function that
eliminates outliers, and the spikes that are observed can be interpreted as outliers.
2.4.1.4 Region R4
Region R4 (Vdd = 0.6 V or 37% below PoFF) is where the error rate is 0.97 and the system
breaks down. In this region, the PN acquisition system fails to produce any meaningful
computation. The errors have no more spiky characteristics, and look like a narrow Gaussian
distribution, thus violating the -contaminated model. Also, large magnitude errors, up to
the MSB, are present as shown in Fig. 2.15 (a), (b).
42
−100 −50 0 50 100 150 200
0
1000
2000
3000
4000
5000
6000
7000
Error PMF at Vdd = 063
magnitude
o
cc
u
re
n
ce
(a)
−200 −100 0 100 200
0
2000
4000
6000
8000
10000
Error PMF at Vdd = 060
magnitude
o
cc
u
re
n
ce
(b)
Figure 2.15: Error PMF of region R4: (a) Vdd = 0.63 V, and (b) Vdd = 0.60 V.
2.4.2 Test results
Figure 2.16 plots Pdet and pη vs. Vdd for a fixed false alarm rate of 10%. It can be seen
that a near constant Pdet ≥ 90% is achieved for Vdd ≥ 0.69 V with pη ≤ 0.912. This voltage
is 27% below the PoFF voltage of Vdd = 0.95 V, indicating significant robustness to voltage
variations. These results are consistent with the simulation results in [52] that show that a
25% reduction in Vdd has minimal performance degradation.
Figure 2.17 shows the relation between energy and Vdd along with pη. The measurements
include the sensor and fusion block overhead. Compared to Vdd = 0.95 V (PoFF), energy
savings between 2.4× and 5.8×, for an average of 3.86×, and error tolerance between 2146×
and 2281×, for an average of 2225×, can be achieved at Vdd ranging between 0.69 V and
0.70 V without any loss in system-level performance (probability of detection Pdet) in the
presence of very high probability of error pη ≤ 0.912. Compared to simulation results in Fig.
1.6, measurements in Fig. 2.17 indicate that expected error tolerance and two-thirds of the
potential energy savings have been realized. This represents a 2.52× greater energy savings
over [28] with a 2225× higher error rate tolerance.
Table 2.2 compares the results of our work with previous published work. The first three
entries are compared against generic error resilient designs while the last three are compared
against conventional PN code acquisition chips. It can be seen that the SEC based design
achieves significantly better performance by utilizing statistical information at the system
43
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
10-2
10-1
100
Vdd (V)
P d
et
10-2
10-1
100
p e
pη 
Pdet
PoFF
0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
0.7
0.8
0.9
1
Vdd (V)
P d
et
0.7
0.8
0.9
1
p e
ηp
p η
 
p η
 
Figure 2.16: Detection probability Pdet and sensor probability of error pη vs. supply
voltage Vdd.
44
0.6 0.7 0.8 0.9 1
0.96
100
200
300
400
500
Vdd(V)
En
er
gy
 (p
J)
10-2
10-1
100
P e
energyenergy 
reduction
3.9X
PoFF
0.95
p η
 
pη 
Figure 2.17: Energy consumption and sensor probability of error pη vs. supply voltage
Vdd.
Table 2.2: Comparison of SSNOC prototype IC with other work.
Vdd tech. pη energy savings figure of merit
†
[27] 1.2− 1.8 V 180 nm 0.1% 14-17% N/A
[28] 0.8− 1.2 V 130 nm 0.04% 33-35% N/A
[30] 0.9− 1.0 V 45 nm N/A 22% N/A
[69] 1.8 V 350 nm 0% N/A 0.319
[70] 2 V 350 nm 0% N/A 0.09699
[71] 5 V 1.2µm 0% N/A 0.713
our work 0.69− 0.95 V 180 nm 88.99% 71.87% 0.037
† Figure of merit is defined as power/(resolution × sample rate) (nJ/bit),
where resolution is the average of input (max of data and coefficient bits) and
output bit precision.
level. The high error resiliency has been successfully traded off with energy to achieve
significant savings compared to other error resilient designs. To compare our PN code
filter with existing designs, a figure of merit (FOM) that normalizes performance based
on technology, resolution, and sample rate has been employed. The FOM is defined as
(normalized power) / (taps × precision × sample rate) with units of (pJ/bit). Normalized
power for analog circuits is obtained by normalizing with Vdd and the feature size [73], while
for digital, it is normalized by V 2dd and the feature size [74]. Resolution is the average of
45
input (max of data and coefficient bits) and output bit precision. As [69, 71] is an analog
implementation, it is not normalized by precision, and IO power has been included for [70].
The SEC based PN code acquisition filter, when compared against conventional PN code
acquisition filters, shows a minimum of 2.5× improvement in the FOM.
2.4.3 Impact of VOS on activity factor
The energy savings measured in Section 2.4.2 are larger than the values predicted by voltage
scaling via E = CLV 2dd, where CL is the effective load capacitance. In this section, we analyze
a circuit under VOS and provide simulation results to show the source of the additional
energy savings.
In VOS design, the operation frequency is kept constant at fCLK , while the supply voltage
Vdd is scaled down below Vdd,crit. The relationship between energy E and power P is
E = Pt = α0→1(Vdd)fCLKCLV 2ddt (2.7)
where α0→1 is the activity factor, which is the average probability of 0→ 1 transitions, and
t is the total time required to complete a given computation. In VOS design, α0→1depends
on Vdd, while fCLK , CL, and t are all independent of Vdd. Thus the energy is proportional to
the activity factor α0→1 and quadratic to the supply voltage V 2dd. Vdd is easily controllable
and measurable. For the PN acquisition prototype chip, Vdd is scaled from 0.95 V down
to 0.69 V, providing up to 47.2% energy savings, a little more than half of the total 82.8%
energy savings achieved. The remaining portion can be attributed to the reduction in activity
factor α0→1 due to VOS. However, the effect of VOS on α0→1 is not easy to measure. To
obtain some insight on how VOS affects α0→1, we have performed HSPICE simulations of a
sensor (Fig. 2.5) under various voltages and measured the average activity factor. Figure
2.18(a) plots the number of glitches observed for various bit position vs. Vdd. It can be
seen that the number of glitches reduces as the voltage is lowered. This can be explained by
noting that under severe VOS, the propagation delay will most certainly be longer than the
clock period, and the voltage at the final output gate will see less transitions. Furthermore,
46
0.7 0.75 0.8 0.85 0.9 0.95
0
500
1000
1500
2000
2500
Vdd (V)
# 
of
 g
litc
he
s
 
 
bit 8
bit 7
bit 6
bit 5
bit 4
bit 3
(a)
0.7 0.75 0.8 0.85 0.9 0.95
10
15
20
25
30
35
40
45
50
Vdd (V)
α
0−
>1
(b)
Figure 2.18: HSPICE simulation of a sensor to measure α0→1: (a) number of glitches vs.
Vdd, and (b) average α0→1 vs. Vdd.
glitches that are generated by input timing mismatches will not be able to propagate to the
end of the logic chain, further reducing the activity factor. The resulting α0→1 is plotted in
Fig. 2.18(b). In our simulations, the activity factor has shown a reduction of at most 39.2%.
The combination of activity reduction by 40 % and VOS from 0.95 V to 0.69 V results in a
(0.95/0.69)2/0.6 = 3.16× savings, which is close to the measured average energy savings of
3.86×.
In conclusion, VOS has been shown to also affect the activity factor significantly, and
result in greater energy benefits than predicted by voltage scaling alone. However, for
further generalization of this relationship between α0→1 and Vdd, a rigorous analysis and
simulation will need to be performed on various circuit architectures.
2.5 Summary
This chapter has shown that SEC is an effective technique for achieving energy efficiency
via error resiliency. SEC is an application of statistical estimation and detection techniques,
which form the foundation of communication systems, to the problem of robust and energy-
efficient information processing. In particular, stochastic networked computation, which
47
is based on Huber’s robust estimation theory, is particularly well suited for the PN code
acquisition filter. In this dissertation, we have approximated the one-step Huber algorithm
with the median block. However, there exists further possibility of reducing the fusion block
complexity with no loss in performance. Given the PN code acquisition application, the
thresholding stage and the fusion stage are interchangeable. By doing so, the median fusion
block reduces to N single bit adders followed by a thresholding operation with a threshold
at N2 , where N is the number of sensors.
The application of SEC requires a good understanding of system-level requirements, which
need to be integrated with architecture and circuit design issues. A number of energy-efficient
kernels (stochastic kernels) for domain- or application-specific algorithms can be developed
based on SEC. Such accelerator cores can comprise a future system-on-a-chip platform on
nanoscale CMOSs, where energy efficiency and robustness are jointly achieved through a
pervasive application of SEC.
48
Chapter 3
Statistical Error Compensation for Low-Density Parity
Check Codes
Probabilistic graphical models (PGMs) are commonly used to perform inference in machine
learning algorithms used in applications such as classification and parameter estimation.
PGMs are usually learned using empirical (and potentially incomplete and noisy) sets of
data. Since these models are learned from noisy or incomplete data sets, they already have
slight perturbations from the exact underlying model. Thus, PGMs are inherently resilient
to somedegree of computational errors, and tend to have superior and highly accurate classi-
fication performance. Belief propagation (BP) is a message-passing based algorithm widely
used for inference on these probabilistic graphical models.
Low-density parity check (LDPC) codes are primarily decoded using BP. It is widely
adopted in communication standards, such as 802.11 Wi-Fi, DVB-S2 satellite transmission
of digital television, and is considered for many 4G systems including WiMax (IEEE Std
802.16e) [75–77] due to its excellent error correction performance. However, the decoding
complexity of LDPC codes is quite large and low-power LDPC decoders are required to
satisfy the power constraints of wireless handsets. Analog decoder architectures have been
proposed for short-length codes [78]; yet, scaling the code length to more than 250 will be
challenging due to device mismatch and buffering requirements. Digital low-power LDPC
decoder architectures mostly focus on reducing the decoding complexity through early ter-
mination or approximation [79].
Previous work on error-resilient LDPC decoders has protected the sign bit of decoding
messages (see Section 3.2.1) or employed triple modular redundancy (TMR) [80]. However,
the error resiliency provided by sign bit protection (SBP) alone has limitations at high error
rates (more than 3% errors per clock cycle), and TMR has very high overhead.
Recently, studies on erroneous LDPC decoders have been performed. Varshney [81] shows
49
that the performance of a Gallager A decoder subject to random bit flips cannot achieve
arbitrarily small probability of error, contrary to the results of Shannon’s channel capacity
theorem [8]. Yazdi et. al. also show similar results for the Gallager B decoder [82]. However,
these decoders are hard decision based decoders that do not possess the inherent resiliency
of BP. For a BP based decoder, Varshney [81] has shown that for an additive white Gaussian
noise (AWGN) channel, if the decoding errors are bounded, arbitrarily small probability of
error is achievable. Similarly, according to this principle, quantized versions of BP show
little degradation in performance [83–85]. Such works show that BP based LDPC decoders
possess inherent resiliency, and are a good candidate for SEC. A general error analysis on BP
based message passing algorithms is performed in [86] for problems on distributed inference.
In this chapter, we apply SEC to present an energy-efficient BP based binary LDPC
decoder. Our focus is on timing errors, as the source of reliability for deep-scaled CMOSs,
including PVT variation, all manifest as output delay variation. One characteristic of timing
errors in DSP and communication systems is that the error magnitude is large, as most
systems are designed to compute LSB first. SEC exploits the system-level inherent resiliency
by converting large-magnitude timing errors into small tolerable errors. We evaluate the
energy efficiency vs. robustness trade-offs of the SEC based LDPC decoder in a 45 nm
process technology. Once again, energy efficiency is obtained by trading reliability for energy
savings via voltage overscaling (VOS), which deliberately induces timing errors. Figure 3.1
shows the energy vs. Vdd for a variable node, a block used in decoding LDPC codes (see Fig.
3.4(a) on page 56), for a conventional error-free design, and a VOS design that targets an
error rate of 70%. If the VOS-induced errors are fully compensated for, significant energy
savings, up to 70%, can be achieved.
The chapter is organized as follows. Section 3.1 provides background information on LDPC
codes. Section 3.2 describes the LDPC decoder architecture in detail. The simulation setup
including energy and error modeling is presented in Section 3.3. Simulation results are given
in Section 3.4, with Section 3.5 summarizing the chapter.
50
0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.4
0.8
1.2
1.6
2
x 10
-12
Supply Voltage (V)
E
ne
rg
y 
(J
)
 
 
10
-10
10
-9
10
-8
D
el
ay
 (s
)
Conventional
VOS@pη=0.7
Energy
70% energy 
savings 
VOS
Figure 3.1: Supply voltage vs. energy and delay of an LDPC variable node (Fig. 3.4(a)) in
a commercial 45 nm process. By voltage overscaling up to an error rate of 70%, the same
performance can be achieved with 70% less energy.
3.1 Low-Density Parity Check Codes
Low-density parity check (LDPC) codes, first introduced by Gallager [87], are error-correcting
codes used in various communication systems. The code is very powerful in the sense that
it can achieve the Shannon capacity for additive white Gaussian noise (AWGN) channels.
More interestingly, the decoding of LDPC codes is based on a message-passing algorithm
implemented over a factor graph. LDPC codes are linear block codes based on a sparse
parity check matrix H. Let H be a binary r × n matrix. In coding theory, any vector c of
length n is a valid codeword if HcT = 0. This parity check matrix gives rise to the bipartite
factor graph, where there are r check nodes and n variable nodes. The graph is connected
in a way such that if the entry (i, j) of H is 1 then the ith check node is connected to the
jth variable node. An example parity check matrix and its factor graph are depicted in Fig.
3.2. Each column of H represent the code bits (variables nodes), while each row represent
the parity check constraint (check nodes).
Though decoding of an LDPC code is NP-complete, BP based iterative decoding algo-
51














01011001
11100100
00100111
10011010
(a)
check nodes
variable nodes
(b)
Figure 3.2: Example LDPC code: (a) parity check matrix, and (b) its bipartite graph.
rithms have shown good performance. In each iteration, the variable nodes combine the
intrinsic information obtained from the received signal and the extrinsic information sup-
plied by the check nodes to update its belief on whether the bit is 0 or 1. This belief is then
propagated to the check nodes. The check nodes combine the belief of neighboring variable
nodes and update each node’s belief based on the parity constraint. This process is iterated
continuously until the beliefs converge and a hard decision is made at the variable nodes.
The following notation will be used: Pi is the probability of bit ci = 1, given the channel
output yi, i.e., Pr(ci = 1|yi). Denote qij(·) as the message sent from variable node vi to
check node cj, and rji(·) as the message sent from check node cj to variable node vi. The
variable-to-check messages, qij(0) and qij(1), are proportional to the probabilities Pr(vi = 0)
and Pr(vi = 1), respectively. The check-to-variable messages, rji(0) and rji(1), are the
probabilities of the jth check being satisfied when vi = 0 and vi = 1, respectively. Since
vi ∈ {0, 1}, qij(0) = 1 − qij(1). The check nodes compute, on average, the probability that
52
an even number of 1’s are observed through the following equations:
rji(0) =
1
2 +
1
2
∏
i′∈Vj\i
(1− 2qi′j(1)) (3.1)
rji(1) = 1− rji(0) (3.2)
where Vj\i denotes all the nodes connected to variable node vj excluding check node ci.
This follows directly from Gallager [87], where the probability of a sequence having an even
number of 1’s is computed as described in Theorem 3.1.
Theorem 3.1. Given a sequence ofM independent binary digits {ai}Mi=1, with the probability
of each digit being 1 denoted as pi = Pr(ai = 1), the probability that the sequence contains
an even number of 1’s is PM = 12 +
1
2
∏M
i=1(1− 2pi).
Proof. The proof follows by induction. Assume this is true for M = k, i.e., the probability
that a sequence with M = k independent binary digits has an even number of 1’s is Pk =
1
2 +
1
2
∏k
i=1(1− 2pi). Then, for M = k + 1,
Pk+1 = Pk(1− pk+1) + (1− Pk)pk+1
=
{
1
2 +
1
2
k∏
i=1
(1− 2pi)
}
(1− pk+1) +
{
1
2 −
1
2
k∏
i=1
(1− 2pi)
}
pk+1
= 12 +
1
2
k∏
i=1
(1− 2pi)− 12
k∏
i=1
(1− 2pi)pk+1 − 12
k∏
i=1
(1− 2pi)pk+1
= 12 +
1
2
k+1∏
i=1
(1− 2pi)
Thus, this is also true for M = k + 1.
For M = 1,
P1 = Pr(a1 = 0) = 1− p0 = 12 +
1
2
1∏
i=1
(1− 2pi)
Thus, this is also true for M = 1. By induction, this is true for all M ≥ 1, completing the
53
proof.
The variable node then computes its message by
qij(0) = (1− Pi)
∏
j′∈Ci\j
rj′i(0) (3.3)
qij(1) = Pi
∏
j′∈Ci\j
rj′i(1) (3.4)
This is just multiplying all the beliefs from the check node of being 0 to obtain the final belief
of the variable being 0 and likewise for 1. In a more practical implementation of LDPC,
instead of tracking two beliefs, only one message is passed, the log likelihood ratio (LLR),
which is the log of the ratio of the two messages, log qij(1)
qij(0) and log
rji(1)
rji(0) . Taking the log ratio
of (3.2) to (3.1) and (3.4) to (3.3) we get
mij = log
qij(1)
qij(0)
= log
(
Pi
1− Pi
)
+
∑
j′∈Ci\j
nj′i (3.5)
nji = log rji(1)rji(0) = log
1+
∏
i′∈Vj\i
tanh(mi′j/2)
1−
∏
i′∈Vj\i
tanh(mi′j/2)
(3.6)
= ψ−1
(∑
i′∈Vj\i ψ(mi′j)
)
(3.7)
where mij and nij are the LLR of the variable to check node and check node to variable node
messages, respectively, and ψ(x) = − log tanh(|x|/2). Using the max-log approximation [88],
i.e., log∑k zk ≈ maxk log zk, the check node to variable node message (3.7) can be further
approximated as
nji ≈ ( min
i′∈Vj\i
|mi′j|)
∏
i′∈Vj\i
sgn(mi′j) (3.8)
54
Check Nodes
Variable Nodes
D D
y bˆ
Figure 3.3: High-level block diagram of the LDPC decoder.
3.2 LDPC Decoder Architecture
High-throughput LDPC decoders require parallel computation of messages. We implement a
fully parallel LDPC decoder, with the computation of variable and check nodes multiplexed
in time, i.e., at one clock cycle variable nodes operate in parallel, and at the next clock cycle
check nodes operate in parallel. The high-level architecture along with node architectures is
shown in Figs. 3.3 and 3.4.
The variable node implements (3.5). It takes the prior information given by the received
signal y, and the messages incoming from the check nodes, and sums them up. A hard deci-
sion block extracts the sign bit of the computed messages and outputs them as the decoded
bits bˆ. The check node implements (3.7). The ψ(·) and ψ−1(·) functions are implemented as
a lookup table via piecewise linear approximation.
3.2.1 Error resiliency
The original LDPC decoder is inherently robust to errors due to its iterative message-passing
decoding algorithm. However, errors in the sign bit have been shown to be detrimental. In
our work, we employ SBP as our baseline LDPC decoder. The sign bit is computed separately
via the max-log approximation in (3.8). The critical path for the sign bit computation is
significantly shorter than that of the magnitude computation, and the sign bit will experience
errors only at significant VOS. In the proposed error-resilient LDPC decoder, SEC is added
to the variable node and check node via ANT. The estimator is a reduced precision replica
of the original variable and check node. For the 8-bit precision main block, three estimators
55
88
8
8
8
8
Sign 
Extraction
8
LLR_cn(β,2)
LLR_ch
LLR_cn(β,1)
LLR_cn(β,k)
LLR_bn(i,2)
LLR_bn(i,1)
LLR_bn(i,k)
-
-
-
k
b_hat
(a)
 1

 1
1
8
8
8
8
8
8
8
8
8
Sign 
Extraction
LLR_cn(j,2)
LLR_cn(j,1)
LLR_cn(j,l)
LLR_bn(α ,2)
LLR_bn(α ,1)
LLR_bn(α ,l)
(b)
Figure 3.4: Architecture of nodes: (a) variable node, and (b) check node.
are employed with 2, 3, and 4 bits of precision, respectively.
3.3 Energy, Delay, and Error Modeling
3.3.1 Energy model
We estimate the energy consumption of the LDPC decoder by its constituent blocks. The
total energy of the LDPC decoder ELDPC is given by
ELDPC = NvarEvar +NcheckEcheck +NwireEwire (3.9)
where Nvar, Ncheck, and Nwire are the total number of variable nodes, check nodes, and
interconnect wires, respectively, and Evar, Echeck, and Ewire are the energy consumptions of
a single variable node, check node, and interconnect wire, respectively. To obtain Evar and
Echeck, a single variable node and check node, as shown in Fig. 3.4, is implemented in Verilog
56
0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
x 10
-11
Supply Voltage (V)
E
ne
rg
y 
(J
)
 
 
10
-10
10
-9
10
-8
10
-7
10
-6
10
-5
D
el
ay
 (s
)
Energy
Delay
10
-9
8
D
el
ay
 (s
)
check node
variable 
node
Figure 3.5: Energy consumption and delay curves obtained through circuit simulation of
the variable and check node architectures in Fig. 3.4, synthesized in a commercial 45 nm
process.
and synthesized using a commercial 45 nm standard cell library. The synthesized netlist is
then used to extract a SPICE netlist, which is used for circuit simulations. The energy vs.
Vdd plot and the delay vs. Vdd are shown in Fig. 3.5 for both variable node and check node.
A three-wire distributed RC network is used for the interconnect model as shown in Fig.
3.6(a). Only coupling between adjacent wires is considered, and the energy consumed by
the middle wire is averaged over all possible transitions. The wire was assumed to be routed
on metal 4 with a length of 200µm. The values for R, CC , and CG were obtained from the
design manual of a commercial 45 nm process. Figure 3.6(b) shows the energy consumption
obtained through circuit simulations. As the wire delays were significantly shorter than the
bit node and check node delays, the interconnect is assumed to be error-free. The energy
values obtained are comparable to those obtained from the bus energy consumption model
in [89].
57
CC
GC
CC
GC
CC
GC
CC
GC
CC
GC
CC
GC
CC
GC
CC
GC
CC
GC
R R R
R R R
R R R
(a)
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
2
2.5
3
3.5
4
4.5
5
5.5
6
x 10−13
Supply voltage (V)
En
er
gy
 (J
)
 
 
Model
spice
(b)
Figure 3.6: Interconnect energy for a 200µm wire: (a) distributed RC model, and (b)
average energy vs. supply voltage curve.
58
Verilog 
RTL
Circuit 
Simulation
LDPC Decoder 
Architecture
Delay-based 
error injection
(voltage 
overscaling)
Error resiliency 
Techniques
BER
Energy
45nm CMOS 
PDK
Energy 
Models
SBP, ANT
Figure 3.7: Methodology for simulating the LDPC decoder architecture under VOS.
Energy models can be obtained as well.
3.3.2 Error modeling
To simulate input-dependent timing errors, gate-level simulations using an HDL simulator
are performed. First the gate delay is characterized with respect to supply voltage for basic
gates such as a full adder and XOR, using a circuit-level simulator as in Section 3.3.1. Then
a structural HDL implementation of the LDPC decoder is simulated via an HDL simulator
using the pre-characterized delay values. By choosing the delay values that correspond
to various supply voltages, HDL simulation is effectively run at different voltages, and for
Vdd < Vdd,crit, errors can be observed in the outputs. The complete simulation methodology
is summarized in Fig. 3.7.
3.4 Simulation Results
Simulations for a (800, 400) LDPC code was performed at various Vdd and SNRs.
59
0 1 2 3 4 5 6 7
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
SNR (dB)
B
E
R
 
 
Error free
No correction
SBP
ANT - 4b
ANT - 3b
ANT - 2b
Figure 3.8: BER vs. SNR plot of a (800, 400) LDPC code decoded with five iterations at
pη = 0.2.
3.4.1 BER performance
The BER performance of the LDPC code for an error rate of pη = 0.2 is shown in Fig. 3.8.
It can be seen that SBP schemes only work until an SNR of 1.8 dB and fails completely at
SNR > 3 dB. ANT schemes are significantly more robust with with ANT-2b (ANT with
a 2 b estimator precision) breaking down at SNR > 7 dB, ANT-3b breaking down at SNR
> 8 dB (not shown in figure), and ANT-4b retaining performance close to error-free LDPC
for SNR as high as 10 dB. The performance breakdown at high SNR can be attributed to
the error floor. The erroneous outputs created by ANT contribute to raising the error floor,
and the decoder hits this wall at a lower SNR than an error-free decoder.
3.4.2 Robustness
Figure 3.9 shows the change in BER as Vdd is reduced. It can be seen that even with no
error correction, the LDPC decoder maintains performance up to pη = 2 × 10−3. However,
its performance degrades rapidly when subject to higher error rates. SBP is able to tolerate
60
10
-4
10
-3
10
-2
10
-1
10
0
10
-4
10
-3
10
-2
10
-1
pe
B
E
R
 
 
Error Free
No correction
SBP
ANT-4b
ANT-3b
ANT-2b
2X
3X
4X
5X
BER 
thresholds
η 
Figure 3.9: BER vs. pη at SNR = 5 dB.
Table 3.1: HW error rate that can be tolerated by SEC based LDPC decoder at a given
BER threshold.
BER threshold 2× 3× 4× 5×
no correction 0.0060 0.0097 0.0134 0.0171
SBP 0.0087 0.0114 0.0140 0.0167
ANT-4b 0.2873 0.3503 0.4134 0.4839
ANT-2b 0.0355 0.0513 0.0672 0.0830
(ANT-4b/SBP) 4.08 30.73 29.53 28.98
higher error rates of up to pη = 7 × 10−3, but ANT is still much more powerful, achieving
acceptable performance for up to pη = 0.2. As a slight degradation in BER is tolerable in
most systems (as compared to a magnitude change), we have chosen several BER thresholds,
2 to 5 times the BER of the error-free case, and found the tolerable error rate for each scheme.
The results are summarized in Table 3.1. It can be seen that ANT shows up to 30.7× more
robustness than SBP.
61
10
-4
10
-3
10
-2
10
-1
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
x 10
-9
BER
E
ne
rg
y/
bi
t (
J)
 
 
No correction
SBP
ANT-4b
ANT-3b
ANT-2b
Nominal 
energy
45.7% 
savings
33.2% 
savings
Figure 3.10: Energy vs. BER plot of a (800, 400) LDPC code decoded with five iterations
at SNR = 5 dB.
3.4.3 Energy savings
Energy savings of various schemes are compared at the same BER performance. ANT
schemes are able to achieve the same BER performance at a significantly lower Vdd and thus
result in energy savings. Figure 3.10 shows that ANT-4b can achieve up to 45.7% energy
savings compared to the error-free conventional LDPC decoder, and up to 33.2% energy
savings compared to the erroneous conventional LDPC decoder. This is in addition to the
30× enhanced robustness.
3.5 Summary
In this chapter, we applied SEC to LDPC decoders and achieved up to 30× enhancement in
robustness and 45.7% energy savings compared to conventional LDPC decoders. Compared
to conventional SBP based design, significant enhancement in reliability is achieved. Future
work includes analysis of the effect of HW errors on decoding performance and the propaga-
62
tion of errors along the decoding graph. Based on this understanding, the overhead of SEC
can be reduced to achieve greater energy savings. Possible methods are to apply SEC to a
partial set of nodes that are critical, while simpler error resiliency techniques, such as SBP,
are applied to the remaining nodes.
63
Chapter 4
Statistical Error Compensation for Stereo Image
Matching
Machine learning based inference has recently gained importance as a key kernel in processing
massive data in signal processing systems, including computer vision and speech recognition.
Such applications contain an inference kernel that is inherently resilient to small magnitude
errors [90, 91]. By combining the statistical performance metric of machine learning appli-
cations and statistical nature of circuit non-idealities, statistical error compensation (SEC)
techniques [24] can achieve significant enhancement in error resiliency [90,91]. By trading off
the increased robustness with energy reduction, significant energy savings can be achieved as
well. In Chapter 3, algorithmic noise tolerance (ANT) has been applied to a message-passing
based low-density parity check (LDPC) decoder and shown to achieve 45.7% energy savings
while maintaining less than 4.7 dB degradation in bit error rate (BER) at a HW error rate,
pη (percentage of clock cycles in which an erroneous output exists), of 30%. In this chapter,
an architecture implementing a sequential tree-reweighted (TRW-S) Markov random field
(MRF) message-passing based stereo matching [92] will be shown to achieve similar results,
with an energy savings of 41%, a bad-pixel ratio (BPR) degradation of less than 3.5%, at
a HW error rate of 21.3%. This indicates that message-passing or belief propagation based
algorithms are inherently tolerant to estimation errors, p, within the ANT estimator, and
by properly exploiting this tolerance, ANT can provide significant benefits.
We first study the performance of the TRW-S architecture when subject to nanometer
imperfections to design an energy-efficient hardware implementation. Our goal is to analyze
the inherent error resiliency of message-passing based inference algorithms and to enhance
its error resiliency via application of SEC. The enhanced error resiliency can then be traded-
off with energy efficiency by exploiting it to tolerate hardware errors generated by relaxed
hardware implementation or relaxed operating conditions, resulting in energy savings. In
64
this chapter, the work of Chapter 3 is extended to a more general case of MRF inference,
i.e., TRW-S based stereo matching, with a more complex graph and a larger hypothesis
space. The TRW-S based stereo matching hardware architecture (TRWS_HW) from [92]
is taken as a test case, and its error resilience, as well as the impact of ANT techniques, is
explored. Results show that the ANT based hardware can tolerate an error rate of 21.3%,
with performance degradation of only 3.5% at an overhead of 44.7%, compared to an error-
free hardware with an energy savings of 41%, but such arithmetic-level error compensation
incurred significant overhead.
To reduce the overhead of arithmetic-level error compensation, ANT is further applied at
the iteration and system level for the TRW-S stereo matching system [93]. Iteration-level
compensation removes all correction overhead except at the final latch of each functional
block within the message-passing computation. At the system level, a CPU was used to aid
in the estimation and correction of the final stereo matching output (i.e., depth map). This
approach may seem similar to the existing error-resilient system architecture (ERSA) [32]
paradigm, but is significantly different because an application-specific accelerator is used and
permitted to make hardware errors, which are compensated via statistical estimation and
detection techniques. A hybrid approach combining both iteration-level and system-level
compensation shows promising enhancement in error resiliency with reduced overhead. Re-
sults show that, compared to arithmetic-level, system-level compensation reduces overhead
by more than 50%, while maintaining stereo matching performance with only 2.5% degra-
dation. These results are verified via FPGA emulation with timing errors induced within
the message-passing unit via relaxed synthesis.
The remainder of the chapter is organized as follows. Section 4.1 provides background
information on the TRW-S stereo matching algorithm. Section 4.2 describes the error-
resilience characteristic of the message-passing hardware in detail. The error compensation
methodology is discussed in Section 4.3. The simulation methodology and emulation setup
including the generation HW errors of TRW-S is discussed in Section 4.4. Section 4.5 shows
the results, while Section 4.6 summarizes the chapter.
65
4.1 TRW-S Message-Passing Based Stereo Matching
Stereo matching infers depth information from two horizontally shifted images. It can be
restated as a maximum a posteriori (MAP) problem, where each pixel is given a label,
ls ∈ {0, . . . , Dmax}, that corresponds to a discrete depth level and the goal is to find the
most probable label assignments for all the pixels of an image. The depth level represents
the proximity of the pixel to the observation point with 0 being the closest and Dmax the
farthest. Typical values of Dmax are 15 and 63. This MAP problem can be formulated in
terms of the cost functions defined on an undirected grid graph (i.e., a grid MRF) with nodes
(V ) and edges (E) as follows [94,95]:
min
l
E(l) = min
l
∑
s∈V
ds(ls) +
∑
(s,t)∈E
Vst(ls, lt)
 (4.1)
Here, a unary cost function ds(ls) represents the likelihood of a depth label ls being assigned
to a node s, and a pairwise (smoothness) cost function Vst(ls, lt) models, as prior preference,
the closeness (smoothness) among neighboring nodes. Note that Vst(ls, lt) is a truncated
function to allow large difference in labels between adjacent nodes. E(l) is referred to as
energy, and thus (4.1) is called an energy minimization problem; the quality of a label
assignment is higher if the obtained energy is lower, and thus E(l) is used as a metric to
gauge the quality of the results obtained by inference. In general, this energy minimization
problem is intractable since the complexity of the problem grows exponentially as the size
of the graph G(V,E) is increased.
Message-passing algorithms have been shown to efficiently solve (4.1) by selecting the best
label for each node based on local information from its neighbors (i.e., messages). Messages
between nodes of an MRF are updated repeatedly (iterations) until they converge to a fixed
value. The final label is then assigned based on the converged messages. In general, message-
passing algorithms such as belief propagation (BP) [94, 95] do not guarantee convergence if
a loop exists within the graph. Tree-reweighted message passing (TRW-S) performs tree
based sequential message passing which guarantees subsequent convergence to the lower
local minimum energy, and shows superior energy minimization performance compared to
66
BP in practice [96–98].
In case of a grid-MRF, message passing of TRW-S from node s to t can be performed in
two steps, a unary and pairwise computation as follows [97]:
Hst(ls) =
1
2
ds(ls) + ∑
u∈Nb(s)
Mus(ls)
−Mts(ls) (4.2)
Mst(lt) = min
ls
{Hst(ls) + Vst(ls, lt)} (4.3)
where Hst(ls) represents the updated belief of node s, Mst(ls) denotes the message from
node s to t, Nb(s) is the set of all neighbors of node s, and ∑u∈Nb(s)Mus(ls) indicates
accumulation of messages from neighbors of node s. The unary computation (4.2) and the
pairwise computation (4.3) are called reparameterization (REPARAM) and message update
(MSG_UPD), respectively. In this chapter, we use the hardware architecture of [92], where
this two-step message passing of TRW-S is pipelined and executed in a streaming manner
(see Fig. 4.1). FIFOs are used to handle streaming data access to the main memory in order
to feed the pipelined message-passing units without many stalls. The REPARAM unit is a
simple hardware realization of (4.2), while for MSG_UPD, a parallel message construction
technique [99] is exploited to reduce complexity of the pairwise computation in (4.3), as
shown in Fig. 4.1(c). In addition, to avoid overflow, MSG_UPD includes a normalization
step, which rescales the value of messages so that the minimum message is always zero.
4.2 Error-Resilient TRW-S via ANT
It is well known in the literature (e.g., [90]) that iterative message-passing algorithms such
as TRW-S are intrinsically robust to errors in computation. In this section, we analyze the
error propagation of TRW-S as a test case to understand the basis of the inherent error
robustness. Based on our findings, ANT is applied to further enhance the error robustness.
67
Read
FIFOs
REPARAM
MSG_UPD
(Horizontal)
MSG_UPD
(Vertical)
L
o
a
d
Write
FIFOs
Feedback
FIFOs
S
to
re
M
e
m
o
ry
M
e
s
s
a
g
e
 p
a
s
s
in
g
 u
n
it
(a)
dcost Mhor Mver Mhor_p Mver_p
Hhor Hver
(b)
Truncation
Normalization
+2λ +λ +2λ+λ+0
16:1 min
(4 stages)
H
v_max
Label 0~15
H_min
M_new
Message Update
(c)
Figure 4.1: Architecture of streaming TRW-S hardware (TRWS_HW): (a) block diagram,
(b) reparameterize unit, and (c) message update unit.
68
4.2.1 Error analysis of TRW-S
The error propagation of TRW-S is analyzed starting from (4.2) and (4.3). For simplicity,
we assume binary labels {0, 1} for all the nodes and the Potts model [96] (with penalty of
C) for the pairwise cost, i.e, the cost is 0 if neighboring labels are equal, while the cost is C,
if they differ. The message update (4.3) for ls ∈ {0, 1} can then be represented as follows:
Mst(ls = 0) = min {Hst(0), Hst(1) + C} (4.4)
Mst(ls = 1) = min {Hst(0) + C,Hst(1)} (4.5)
Note that other message-passing algorithms with a truncated smoothness cost (e.g., BP
in [94,95]) can also be represented this way.
To see the effect of arithmetic error on the result of the final message, we assume that
arithmetic errors occur in add or subtract (AS) and compare and select (CS) operations. An
erroneous message for label 0, M˜st(0), can be represented as follows:
M˜st(0) = min{Hst(0) + ηAS(0), Hst(1) + ηAS(1) + C}+ ηCS (4.6)
where ηAS(0) and ηAS(1) represent the error propagated from neighbors as well as any
arithmetic errors that occurred during computation of Hst, and ηCS indicates errors that
occur in CS. By setting ηAS = ηAS(0) − ηAS(1), and η˜CS = ηCS + ηAS(1), M˜st(0) can be
rewritten as
arg1 arg2
M˜st(0) = min{
︷ ︸︸ ︷
Hst(0) + ηAS,
︷ ︸︸ ︷
Hst(1) + C}+ η˜CS
= Mst(0) + η˜AS + η˜CS (4.7)
where η˜AS and η˜CS are the effective error in the message Mst generated by AS and CS,
respectively. We will refer to Hst(0) + ηAS and Hst(1) + C as arg1 and arg2, respectively,
and define ∆H , Hst(1) + C −Hst(0).
First, assume that Hst(0) < Hst(1) + C, i.e., ∆H > 0. Then, when ηAS is small, such
69
AS
~
AS
H
H
(a)
AS
~
AS
H
(b)
Figure 4.2: Effect of AS error on updated message when (a) ∆H > 0, and (b) ∆H < 0.
that ηAS < ∆H, arg1 will be chosen as the minimum and η˜AS = ηAS. Once ηAS ≥ ∆H,
min operation will now choose arg2 and η˜AS will be fixed to ∆H (see Fig. 4.2(a)). Next
assume that Hst(0) > Hst(1) + C, i.e., ∆H < 0. In this case, if ηAS ≥ ∆H, arg2 will be
selected as the minimum, which is the correct minimum, and η˜AS = 0. When ηAS < ∆H,
min operation will select arg1, and η˜AS = ηAS −∆H (see Fig. 4.2(b)). Note that the effect
of ηAS depends only on 4H, which, in turn, only depends on the error-free computation of
Hst. Based on Fig. 4.2, we can deduce two error characteristics of TRW-S:
1. TRW-S is affected more by negative errors than positive errors, since the effect of
positive errors is either bounded (Fig. 4.2(a)) or removed (Fig. 4.2(b)).
2. If the magnitude of an error is small (|ηAS| < |4H|), it can either be preserved (Fig.
4.2(a)) or removed (Fig. 4.2(b)).
Similar conclusions can be derived for M˜st(1).
The error generated by CS, η˜CS, can be viewed as part of the overall error in a message, η˜ =
η˜AS + η˜CS. The error η˜ will affect the message computation of the adjacent node, and thus,
can be regarded as a propagated error for the next message computation. Normalization
that occurs at the final step of MSG_UPD can also be shown to significantly reduce the
magnitude of the propagated errors. Normalization also reduces the effect of η˜ significantly,
70
ηCS(0)
(a)
ηCS(1) ηCS(0) ηCS(1)
(b)
: Error occurred
: Effective error
Figure 4.3: Effect of error after normalization: (a) large effective error when error values
differ significantly, and (b) small effective error when error values are similar.
as illustrated in Fig. 4.3. If a large-magnitude error occurs in M˜st(0) but only a small error
in M˜st(1) (the dotted arrows in Fig. 4.3(a)), the effective error after normalization (the solid
red arrow in Fig. 4.3(a)) is relatively large. In contrast, if large but similar magnitude errors
occur in both M˜st(0) and M˜st(1), the effective error is much smaller, as shown in Fig. 4.3(b).
4.2.2 Verification of error analysis
Two experiments were performed on TRW-S stereo matching hardware to verify our analysis.
We first run Tsukuba stereo matching [96] on our software simulator of TRWS_HW with
error injection on AS in (4.7) to see the effect of error on the message update. The Potts
model with C = 20 is used as the smoothness cost, and the computation in hardware
has 20-bit fixed point precision. The injected errors, ηAS, are drawn from U(−128, 128),
where U(a, b) is a uniform distribution between a and b. Figure 4.4(a) shows the resulting
relationship between ηAS and η˜AS summarized as a box plot. Note that |η˜AS| cannot be
larger than C due to truncation and normalization in message update. It can be seen that
η˜AS is more likely to be close to zero for ηAS > 0 than for ηAS < 0, which agrees with our
analysis.
For a macroscopic view of the effect of error on the message-passing performance, we fur-
ther apply errors with different magnitude on both ηAS and ηCS and run the same Tsukuba
71
stereo matching. We apply uniform errors as follows: ηAS, ηCS ∼ U(min[0, ηmax],max[0, ηmax]),
and −215 ≤ ηmax ≤ 215. Note that all injected errors are of the same sign. Fig. 4.4(b) shows
the effect of errors on the energy minimization performance for different error magnitudes
and error injection rates. If the error magnitude is small (e.g., |ηmax| ≤ 512), the minimum
E , given by (4.1), approaches the error-free performance after 10 iterations. However, if
the error magnitude is large (e.g., |ηmax| ≥ 1024), the minimum E does not decrease with
iterations, which indicates that message passing fails to find the best label assignment. Fur-
thermore, the slope of the minimum E is steeper in case of the negative errors, as expected
by analysis. Therefore, we can conclude that TRW-S is tolerant to small-magnitude errors
but suffers from large-magnitude errors.
4.2.3 Enhanced error robustness of TRW-S via ANT
As discussed in Section 4.2.1, message-passing inference, such as TRW-S, has intrinsic ro-
bustness to small-magnitude errors but is vulnerable to large-magnitude errors. Thus, the
approximate error correction capability of ANT, where large errors are converted to small
errors, can provide significant enhancement in error resiliency. In this section, we apply
ANT to TRWS_HW and demonstrate its effectiveness. ANT is applied to arithmetic com-
putation of REPARAM and MSG_UPD units of TRWS_HW (Fig. 4.1 (b) and (c)), to
compensate arithmetic errors injected on AS and CS. For the estimator, a reduced precision
replica (RPR) of the main block (i.e., AS and CS) has been used. This particular estimator
has a shorter critical path than does the original block, and thus, is not subject to timing er-
rors but suffers from estimation errors. Figure 4.5 shows the performance of ANT with 4-bit
precision estimators when uniform errors with different magnitude, as in Section 4.2.2, were
injected at various error rates, pη (percentage of clock cycles in which an erroneous output
exists). Only the tail (ηmax > 32) is shown. For small-magnitude errors (η < 1024), ANT’s
performance is mostly dictated by the main block. When large-magnitude errors (η > 4096)
are injected, the estimator becomes active and successfully compensates for the errors. The
slight degradation in performance at medium-sized errors (1024 ≤ η ≤ 8192) is because for
such errors, HW errors and estimator errors are similar in magnitude and thus, ANT is not
72
−20
−15
−10
−5
0
5
10
15
20
−
13
0
−
12
0
−
11
0
−
10
0
−
90
−
80
−
70
−
60
−
50
−
40
−
30
−
20
−
10 0 10 20 30 40 50 60 70 80 90 10
0
11
0
12
0
13
0
ηAS
η˜
A
S
(a)
−32768 −1024 −32   0     32    1024  32768 
106
107
η
max
M
in
im
um
 e
ne
rg
y
 
 
pη=0.001
pη=0.01
pη=0.1
pη=0.3
Error Free
(b)
Figure 4.4: Verification of error analysis: (a) effect of error on message updated via a box
plot of ηAS vs. η˜AS, and (b) effect of error on energy minimization performance of message
passing via a plot of energy E vs. η˜.
73
32    1024  32768 
106
107
η
max
M
in
im
um
 e
ne
rg
y
 
 
pη=0.01
ANT pη=0.01
pη=0.1
ANT pη=0.1
pη=0.3
ANT pη=0.3
Figure 4.5: Performance of ANT with injection of uniform errors of different magnitude
using 4-bit precision estimator.
able to compensate for the errors. To avoid this ambiguity, HW errors and estimator errors
should have very distinct characteristics, which is true for timing based HW errors.
4.3 Error Compensation at Various Levels
In the previous section, we observed that the error resiliency of TRW-S can be enhanced
if the arithmetic units are protected by RPR based ANT [91]. However, applying ANT to
all the arithmetic units causes large hardware resource overhead. In this section, therefore,
we further study the trade-off between granularity of ANT protection and its corresponding
overhead. The goal is to find the proper level of error compensation that causes low overhead
but maintains high error resilience of the TRW-S hardware.
To this end, errors are compensated via ANT at three levels as illustrated in Fig. 4.6.
At the system level, errors are compensated directly on the resulting depth map. At the
iteration level, errors are compensated after the computation of each functional block in
the message passing unit (Fig. 4.1(a)), i.e., at the end of the reparameterize and message
74
Reparameterize
est
 oa yy
11 eyy oe 
Iteration feedback
Depth map 
estimator
22 eyy oe 
x
yˆ
Iteration-level 
compensation
System-level 
compensation
Arithmetic-level 
compensation
(a)
dcost Mhor Mver Mhor_p Mver_p
Hhor Hver
Latch
Latch associated with arithmatic-level compensation
Latch associated with iteration-level compensation
(b)
Figure 4.6: Block diagrams: (a) error compensation at different levels, and (b) detailed
comparison of arithmetic-level and iteration-level compensation in the reparameterize unit.
update block. Iteration-level compensation is also performed in an online fashion (compen-
sated results are directly used as updates for the adjacent nodes). A hybrid scheme that
employs compensation at both system and iteration levels can enable further reduction in
complexity of the estimator. At the arithmetic level, errors are compensated after each
arithmetic operation, equivalent to each pipelining stage as shown in Fig. 4.1(b,c). The
original computation units of the TRW-S architecture are used as the main block, while
75
different estimators are used for each level of compensation.
4.3.1 Arithmetic-level compensation
At the lowest level, errors are compensated after each arithmetic operation that includes
every AS and CS. Every latch in REPARM and MSG_UPD (Fig. 4.1(b,c)) is subject to ANT
compensation. As the operations in need for error compensation are primitive in nature, our
choice of estimators is limited to circuit or structural techniques. In our work, a reduced
precision replica (RPR) has been employed at this compensation level. With a suitably
designed estimator (i.e., an RPR with proper precision), the resulting ANT architecture is
highly reliable, but suffers from high compensation overhead since additional logic is added
at each point of protection.
4.3.2 Iteration-level compensation
At the iteration level, compensation occurs at the end of each functional block of the TRW-S
hardware architecture (see Fig. 4.6(a)). For example, error compensation is performed only
at the output of REPARAM. This is due to the fact that the CS operation within MSG_UPD
has a significantly shorter critical path and is not subject to errors. As any approximate
computation of REPARAM can be used for the estimator, we simply use RPR as before.
With sufficient estimator precision, significant error resiliency is obtained. Also, the overall
overhead is reduced compared to the arithmetic-level compensation since fewer detection
and correction blocks, which are required for ANT, are used. A detailed compensation
comparison of arithmetic level vs. iteration level is depicted in Fig. 4.6(b).
4.3.3 System-level compensation
System-level compensation is performed on the computed erroneous depth map (Fig. 4.6(a))
occurring outside of the TRW-S message-passing loop, while iteration-level compensation
occurs at the end of each iteration within the message-passing loop. A separate estimated
depth map is used for compensation, which is obtained independently of the TRW-S based
76
main block. Thus, simpler algorithms such as sum of difference (SAD) based stereo matching
algorithm [100] and a scaled version (scaled by a factor of S) of TRW-S motivated by
hierarchical belief propagation (HBP) [101] can be used as estimators. The idea of HBP is
to obtain an overview of inference by running BP on the coarser graph. In HBP, this overview
is used as the initial guess of the original inference, but in the system-level compensation,
it is used for estimation to compensate erroneous inference. There is a trade-off between
accuracy of this estimation and the overhead in execution time of this estimation, since
smaller S results in more accurate inference but requires more computation.
4.3.4 Combination of iteration-level and system-level compensation
In this dissertation, we show that the system-level compensation can achieve global stereo
matching performance while being robust to errors (see Section 4.5). However, its perfor-
mance depends highly on the output quality of the main block. Thus, additional compen-
sation is needed at the iteration level. A hybrid approach using ANT with a lower precision
RPR at the iteration level for reduced compensation overhead, and using system-level com-
pensation to maintain performance provides a good complexity-performance trade-off. This
compensation method can be easily extended to system-on-chips (SoC) as such systems make
extensive use of HW accelerators combined with a general purpose processor.
4.4 System Evaluation Setup
We employ a combination of simulation and emulation to evaluate the reliability of TRWS_HW.
A cycle accurate software simulator has been implemented for this purpose, while a CPU+FPGA
hybrid system was employed to emulate the Verilog implementation of TRWS_HW, mainly
for the purpose of verifying the software simulator results.
77
4.4.1 Simulation
A cycle accurate software simulator is used to simulate impact of hardware errors on the
TRW-S message passing architecture. Three stereo matching benchmarks (Tsukuba, Venus,
and Teddy [96]) are used as input to the simulator.
4.4.1.1 Simulation methodology
We evaluate the performance of TRWS_HW in a software simulator via error injection.
Input dependent timing based error statistics are obtained for accurate simulation via gate-
level simulations using an HDL simulator. First, the gate delay is characterized with respect
to supply voltage for basic gates such as a full adder and XOR using circuit-level simula-
tors with a commercial 45 nm process library. Then a structural HDL implementation of
the REPARAM and MSG_UPD block is simulated via an HDL simulator using the pre-
characterized delay values. By choosing the delay values that correspond to various supply
voltages, HDL simulation is effectively run at different voltages. For a supply voltage (Vdd)
less than the critical voltage, errors can be observed in the output. Through characterization
of these errors, error statistics are obtained and used to inject errors in the AS and CS block
of the software simulator. The simulation methodology is summarized in Fig. 4.7(a). The
message-passing unit was designed for Vdd,crit = 1.2 V@ fCLK,crit = 270 MHz. Figure 4.7(b)
shows error statistics obtained for the AS block at Vdd = 0.75 V. Large-magnitude errors
can be seen to occur with high probability, which is expected due to the nature of LSB-first
computation. The CS block did not exhibit any errors under these operating conditions, due
to its short critical path. Thus, at the arithmetic level, ANT was only applied to AS opera-
tions. Each basic computation is protected by reduced precision replica (RPR) based ANT
to tolerate timing-induced arithmetic errors occurring in the message-passing hardware at
low supply voltage (Vdd). However, this fine-grain protection causes high overhead, reducing
benefits of voltage over-scaling.
78
TRW-S 
Software 
Simulator
Circuit 
Simulation
TRW_S HW 
Architecture
Delay injection
(voltage 
overscaling)
ANT
Energy 
Performance
45nm CMOS 
PDK
Verilog 
RTL
Error 
injection
(a)
−200 −100 0 100 200
0
0.1
0.2
0.3
0.4
error magnitude
pr
ob
ab
ilit
y
(b)
Figure 4.7: Simulation methodology: (a) flow diagram, and (b) error statistics for AS at
Vdd = 0.75 V with pη = 0.21.
4.4.1.2 Optimization of precision
We can exploit the intrinsic error robustness of TRW-S to optimize the fixed point preci-
sion of arithmetic computations in (4.2) and (4.3). Figure 4.8 shows E vs. precision used
in the computation of TRWS_HW when running the Tsukuba stereo matching. The no-
tation M[a, b, c] represents different precision options applied to the main computation of
TRWS_HW: a precision of a-bits is used to perform the computation of (4.2) and (4.3)
79
1 2 3 4 5 6 7 8 9 10
3.5
4
4.5
5
5.5
6
6.5
7
x 105
Number of iterations
En
er
gy
 
 
M[20,0,0]
M[12,8,0]
M[8,12,0]
M[8,8,4]
M[8,9,3]
M[8,10,2]
M[8,11,1]
M[7,8,5]
Figure 4.8: Energy minimization performance against various precisions in computation of
TRWS_HW.
REPARM or
MSG_UPD
20b
>>b
a
c
Figure 4.9: Block diagram depicting the truncation and saturation for optimal fixed-point
implementation of REPARM and MSG_UPD.
after b-bit LSB truncation and c-bit MSB saturation. Truncation and saturation are per-
formed at the output of each function block as depicted in Fig. 4.9. The baseline precision is
M[20, 0, 0] in [92], which outputs the same energy minimization results as the floating point
implementation [96]. Compared to this baseline, 8-bit LSB truncation (M[12, 8, 0]) produces
a similarly low energy minimization curve, whereas 12-bit truncation (M[8, 12, 0]) results in a
higher minimum energy curve. According to our experiments, 8-bit precision with less than
11-bit LSB truncation (i.e., M[8, 8, 4], M[8, 9, 3], and M[8, 10, 2]) achieves comparable energy
minimization performance to the baseline precision (M[20, 0, 0]). However, other precisions,
such as M[8, 11, 1] or M[7, 8, 5], perform worse due to large-magnitude quantization or satu-
ration errors. Similarly, the optimal precision of the main block for the other benchmarks,
Teddy and Venus, is obtained as M[8,8,4] and M[8,10,2], respectively.
80
R
E
P
A
R
A
M
MSG_UPD
(Horizontal)
MSG_UPD
(Vertical)
L
o
ad
Feedback
FIFOs
S
to
re
M
em
o
ry
Message passing unit
Write
FIFOs
Read
FIFOs
CPU
DIMMs
FPGAs
FPGA0
A
H
B
Figure 4.10: Architecture of streaming TRW-S stereo matching CPU+FPGA system.
4.4.2 Emulation
An emulation platform that generates timing errors was employed to verify the simulation
results. Though our entire system was implemented on the TRWS_HW FPGA platform,
the previous results were obtained via simulations, as this enabled us to explore injection of
the same error statistics, but at different error rate pη.
4.4.2.1 Emulation architecture
A TRW-S based stereo matching system has been implemented in a hybrid CPU+FPGA
platform to perform high-quality stereo matching in video-rate speed [92]. Figure 4.10 shows
the overall architecture of the TRW-S stereo matching system. The platform (Convey HC-1
[102]) contains an Intel Xeon dual core processor and four Virtex 5 (V5LX330) Xilinx FPGAs,
with a cache-coherent virtual memory system across both multicore and FPGA fabrics. The
two-step (reparameterize, REPARAM, and update message, UPDMSG) message-passing
algorithm [97] is accelerated in an FPGA, where its data path is fully pipelined and MRF
data is streamed via FIFOs to achieve high throughput. The CPU not only controls FPGA
operations but also processes image input and stereo-matching output (i.e., depth map). Our
iteration-level error compensation is emulated using this CPU+FPGA, where the precisions
of the main block and the estimator are set to 20-bit and 8-bit, respectively.
81
4.4.2.2 Error generation
To evaluate the results under real HW errors, the message-passing unit is implemented
as an FPGA accelerator exhibiting timing violations via relaxed synthesis. At a target
frequency of 150 MHz, all the paths through the REPARAM block were set as false paths.
Xilinx’s ChipScope Pro was used to verify that the errors were generated within REPARAM,
and compared against error-free results to obtain the error statistics (Fig. 4.11(a)). Figure
4.11(b) shows similar error statistics to the ones obtained through HSPICE simulations (Fig.
4.11(c)) based on the modeling methodology in [91], indicating that the error generation in
FPGA emulates the HW timing errors well.
However, controlling the error rate pη in an FPGA system is known to be difficult due to
various synthesis and mapping constraints and unpredictable routing within the FPGA [103].
The buffer chain in Fig. 4.11(a) is such an example. Contrary to our expectations, pη and
the resulting matching performance did not have strong dependency with the length of the
buffer chain. To enable flexible control of pη we extracted the error statistics of the FPGA
(Fig. 4.11(b)) and once again used error injection in the simulator. The results of the
simulator are comparable with the FPGA emulation results, justifying this approach.
4.5 Results
4.5.1 Simulation results for arithmetic-level compensation
The Tsukuba stereo matching task is run at various Vdd for erroneous message-passing com-
putations. A precision of M[8, 8, 4] has been used for the precision optimized main block,
and an estimator of precision E[a, b, c] is used for ANT. Figure 4.12(a) shows the energy min-
imization performance of ANT with different estimators at various Vdd. It is evident that
energy minimization performance is drastically degraded in the conventional case (no error
protection), while ANT introduces tolerance to a significant amount of errors. Compared to
2-bit and 3-bit precision estimators, estimators with precision higher than 4-bit show much
lower minimum energy. At Vdd = 1 V, where the error rate of the AS block is pη,AS = 0.8 %,
82
Computation
(Error Free)
Reparam
(Error Free)
Reparam
(Erroneous)
ChipscopeBuffer 
Chain
(a)
−1 −0.5 0 0.5 1
x 106
0
0.1
0.2
0.3
0.4
error magnitude
pr
ob
ab
ilit
y
(b)
−200 −100 0 100 200
0
0.1
0.2
0.3
0.4
error magnitude
pr
ob
ab
ilit
y
(c)
Figure 4.11: Timing errors in FPGA: (a) block diagram for error verification and statistics
collection, (b) measured error statistics (20-bit) in the FPGA, and (c) error statistics
(8-bit) obtained via circuit simulations in a 45 nm CMOS process [91].
the minimum energy of E[4, 12, 4] is almost the same as the minimum energy of the error-free
case. Similarly, the best estimator precision for the other benchmarks, Teddy and Venus, is
83
0.7 0.8 0.9 1 1.1
106
107
Vdd (V)
M
in
im
um
 e
ne
rg
y
 
 
No ANT
E[2,14,4]
E[3,13,4]
E[4,12,4]
E[5,11,4]
E[6,10,4]
Error Free
Figure 4.12: Simulation results of error injection rate vs. energy minimization for different
ANT estimators with M[8, 8, 4].
obtained as E[4,12,4] and E[4,14,2], respectively.
To further establish the effectiveness of ANT for TRWS_HW, the depth maps of three
cases – error free, conventional, and ANT – are compared. The estimator precision is set
to E[4, 12, 4]. Table 4.1 shows the depth map of each case at different Vdd. The depth map
for the conventional case becomes corrupted even at a low error rate of 1 %. In contrast,
the depth map when ANT is applied is comparable to the error-free case when the same 1%
error rate is applied. Furthermore, at low Vdd, where the error rate is 10% to 25%, the depth
map of ANT is still close to the depth map of the error-free case, which demonstrates the
outstanding error robustness of ANT.
To evaluate the accuracy of the depth maps, bad-pixel ratio (BPR) is employed [104] as
the system-level metric. BPR is calculated by comparing the depth label of each pixel in the
non-occlusion region to the true depth map (ground truth) and counting a pixel to be bad
if the label differs by more than a threshold κ (κ = 1 in our case). As shown in Table 4.1,
BPR of the conventional case rises drastically from 10.4% to 70.4% as Vdd is scaled down.
In contrast, BPR of ANT is robust to errors; BPR is at most 6.27%, at a high error rate of
84
Ta
bl
e
4.
1:
D
ep
th
m
ap
an
d
BP
R
co
m
pa
ris
on
fo
r
er
ro
r-
fre
e,
co
nv
en
tio
na
l,
an
d
A
N
T
at
va
rio
us
V
d
d
.
Ta
sk
Ts
uk
ub
a
Te
dd
y
Ve
nu
s
V
d
d
,p
η
,A
S
Er
ro
r-
Fr
ee
C
on
v.
A
N
T
Er
ro
r-
Fr
ee
C
on
v.
A
N
T
Er
ro
r-
Fr
ee
C
on
v.
A
N
T
1.
0V 0.
8%
2.
7%
10
.4
%
2.
51
%
12
.8
%
90
.3
0%
12
.6
0%
2.
78
%
49
.6
0%
2.
65
%
0.
9V 2.
9%
2.
7%
60
.7
%
2.
31
%
12
.8
%
92
.8
0%
12
.6
0%
2.
78
%
67
.6
0%
3.
00
%
0.
84
V
5.
1%
2.
7%
70
.3
%
3.
93
%
12
.8
%
92
.6
0%
13
.1
0%
2.
78
%
71
.7
0%
4.
09
%
0.
75
V
‘
21
.3
%
2.
7%
70
.4
%
6.
27
%
12
.8
%
92
.8
0%
17
.0
0%
2.
78
%
75
.7
0%
9.
69
%
85
Table 4.2: Estimated compensation overhead and power consumption of arithmetic-level
ANT obtained via synthesis in a commercial 45 nm CMOS process.
M [8,8,4] only ANT (M[8,8,4] + E[4,8,4])
Cell 40,790 80,534 81,366
Vdd (V) 1.10 1.10 0.72
Power (mW) 11.82 17.81 7.136
(leak, dyn) (2.03; 9.79) (3.39;14.42) (0.23,6.91)
21.3% for the Tsukuba task.
Such enhancement in error robustness of ANT can be exploited to achieve power savings
in circuit implementation. The power consumption of the message-passing unit is estimated
via switching activity based RTL synthesis in a commercial 45 nm synthesis library. Switch-
ing activity is obtained from Verilog simulations running stereo matching of the Tsukuba
task. This switching activity is then annotated in RTL synthesis to estimate accurate power
consumption. Table 4.2 summarizes the synthesis results with cell utilization and estimated
power consumption including leakage and dynamic power. As can be seen, the cell overhead
of ANT is approximately 97.4%, which is significant. However, at an extreme supply voltage
of Vdd = 0.72 V, ANT is able to achieve 39.7% power savings while maintaining the system
performance. This result shows that significant energy savings is possible via voltage scaling,
even when additional complexity has to be added for the system to perform correctly. To
overcome the significant overhead in complexity, compensation at higher levels, iteration and
system, is considered next.
4.5.2 Simulation results for system-level and iteration-level compensation
Once again, a cycle accurate simulator was used to perform simulation of system- and
iteration-level compensation. BPR was used as our performance metric. An RPR based HW
estimator (HWRPR) with 2- to 4-bit precision was used for the iteration-level compensation,
while SAD of window size 7 (W = 7), or HBP with a scale factor of S = 8 and S = 16, were
used for the estimator at the system level.
SAD based correction at the system level with Vdd = 0.75 V is shown in Fig. 4.13(a).
86
At this error rate, system-level correction gives marginal improvement and is insufficient for
satisfactory operation (main block BPR is 68.5%). Using HBP (Fig. 4.13(b)) alone, the
final output is significantly better at a BPR of 6.95%. Further combining with HWRPR
with 3-bit precision, BPR of 5.21% is achieved (Fig. 4.13(c)), which is a 2.5% degradation
compared to the error-free BPR of 2.7%. Note that this error compensation result (BPR of
5.21%) is even better than the result of arithmetic error compensation (BPR of 6.27%).
Figure 4.14 summarizes the error compensation performance of the system- and iteration-
level compensation. As expected, HBP with low-precision HWRPR shows comparable error
resiliency to high-precision (and large overhead) HWRPR. One interesting observation is
that the hybrid scheme combining HBP and HWRPR performs exceptionally well under
aggressive voltage scaling. However, the same cannot be said for moderate voltage scaling.
This is due to the limitation in quality of the HBP estimator used at the system level.
The overhead of HWRPR for the iteration-level compensation is derived through the
same activity factor based RTL synthesis. A summary of the power estimation and the
cell utilization for a 4-bit, 3-bit, and 2-bit estimator is given in Table 4.3. Compared to
arithmetic-level compensation, the cell utilization has been reduced by 11.8% to 20.3%.
Using the same synthesis based power estimation as for the arithmetic compensation, the
power consumption of the iteration-level ANT was estimated to be 5.227 mW at the lowest
point, which is an additional 16.1% power savings.
4.5.3 Emulation results
For the estimator in emulation, the Intel Xeon processor available within the convey HC-1
platform was used to compute estimates of the depth map.
The correctness of our simulator is verified by comparing the results against FPGA emu-
lation. FPGA emulation results of ANT applied at the iteration level and system level for
Tsukuba are shown in Fig. 4.15. For iteration-level compensation, the resulting performance
is very similar, and we conclude that the simulator faithfully represents the FPGA emulation
results.
Many system-on-chip (SoC) systems employ a platform consisting of a general purpose
87
M
a
in
 B
lo
c
k
 (
V
d
d
=
0
.7
5
, 
B
P
R
=
6
8
.5
%
)
E
s
ti
m
a
to
r 
(S
A
D
 W
=
7
, 
B
P
R
=
1
6
.4
%
)
F
in
a
l 
o
u
tp
u
t 
(B
P
R
=
 
1
6
.0
%
)
(a
)
e
s
ti
m
a
to
r
H
B
P
 (
S
=
8
, 
B
P
R
: 
1
3
.9
%
)
M
a
in
 B
lo
c
k
 (
V
d
d
=
0
.7
5
, 
B
P
R
=
6
8
.5
%
)
F
in
a
l 
o
u
tp
u
t 
(B
P
R
: 
6
.9
5
%
)
a
n
t
(b
)
M
a
in
 B
lo
c
k
 +
 H
W
R
P
R
 (
3
b
) 
(V
d
d
=
0
.7
5
, 
B
P
R
=
1
1
.3
%
)
e
s
ti
m
a
to
r
H
B
P
 (
S
=
8
, 
B
P
R
: 
1
3
.9
%
)
a
n
t
F
in
a
l 
o
u
tp
u
t 
(B
P
R
: 
5
.2
1
%
)
(c
)
Fi
gu
re
4.
13
:
C
or
re
ct
io
n
pe
rfo
rm
an
ce
at
(a
)
sy
st
em
le
ve
lw
ith
SA
D
,(
b)
sy
st
em
le
ve
lw
ith
H
BP
(S
=
8)
,a
nd
(c
)
hy
br
id
w
ith
H
BP
(S
=
8)
an
d
H
W
R
PR
(3
-b
it)
.
88
0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2
100
101
102
BP
R
 (%
)
Vdd (V)
 
 
No error compensation
HWRPR(4b)
HWRPR(3b)
HWRPR(2b)
HWRPR(3b)+HBP(S=16)
HWRPR(3b)+HBP(S=8)
HWRPR(2b)+HBP(S=8)
HBP(S=8)
Figure 4.14: Bad-pixel ratio vs. Vdd. With no error compensation, TRW-S alone cannot
tolerate 10% voltage scaling.
BPR: 2.7%
Error free Only ESTMOnly MAIN
11.8% 4.64% 3.87%
ANT (HWRPR)
3.87%
HWRPR+HBP
(a)
BPR: 2.7%
Error free Only ESTMOnly MAIN ANT (HWRPR)
18.3% 4.64% 2.86% 2.86%
HWRPR+HBP
(b)
Figure 4.15: Results for RPR-ANT applied at the arithmetic level: (a) FPGA emulation,
and (b) error injection based simulation.
89
Ta
bl
e
4.
3:
Es
tim
at
ed
co
m
pe
ns
at
io
n
ov
er
he
ad
an
d
po
we
r
co
ns
um
pt
io
n
of
ite
ra
tio
n-
le
ve
lA
N
T
ob
ta
in
ed
vi
a
sy
nt
he
sis
in
a
co
m
m
er
ci
al
45
nm
C
M
O
S
pr
oc
es
s.
A
N
T
(M
[8
,8
,4
]+
E[
4,
12
,4
])
A
N
T
(M
[8
,8
,4
]+
E[
3,
13
,4
])
A
N
T
(M
[8
,8
,4
]+
E[
2,
14
,4
])
V
d
d
(V
)
1.
10
0.
72
1.
10
0.
72
1.
10
0.
72
C
el
l
71
,0
96
71
,9
16
67
,4
19
68
,0
81
64
,4
14
64
,8
90
Po
we
r
(m
W
)
15
.2
34
5.
96
1
13
.8
14
5.
38
0
13
.4
32
5.
22
7
(le
ak
,d
yn
)
(3
.1
31
;1
2.
10
2)
(0
.2
10
;5
.7
51
)
(2
.8
66
;1
0.
94
8)
(0
.1
92
;5
.1
89
)
(2
.7
55
;1
0.
67
8)
(0
.1
83
;5
.0
44
)
90
processor and accelerators. Under this environment, when the accelerator is under load,
the processor can be idle and available for computation of simple estimations. Thus, we
employed a general purpose processor to perform computation of the HBP based depth map.
However, the error rate on the FPGA emulation platform was not high enough for a degraded
iteration level output, and thus system-level compensation was not able to produce any
more performance benefits. Nonetheless, the concept of using a general purpose processor
for computation of the estimator has been shown to be valid. One important point of
consideration is that the execution time must be less than that of the accelerator, and the
complexity of HBP depends on the number of nodes in the graph, which is 164 for S = 8, and
1
256 for S = 16. The execution time for the HBP estimator on a Intel Xeon @ 2.13 GHz with
10 iterations was approximately 250 ms and 60 ms for S = 8 and 16, respectively, while the
main block took approximately 130 ms for 40 iterations. Thus, S = 16 is a viable choice for
performing estimation on a processor.
4.6 Summary
Statistical error compensation (ANT) in the form of algorithmic noise tolerance (ANT) has
been applied to a Markov random field (MRF) based stereo image matching architecture.
The error propagation characteristics of an iterative message-passing based stereo image
matching application has been studied in depth. Analysis and simulations show that for
a 20-bit architecture, small errors (η ≤ 1024) are tolerable, while large errors (η ≥ 4096)
degrade the performance significantly. Low overhead compensation methods were proposed
and applied at various levels (system, iteration, and arithmetic) and show significant en-
ergy savings can be achieved. For best results, system-level correction reduces correction
complexity by a significant amount, but is inadequate. When combined with iteration-level
compensation, performance gains are significant with little increase in complexity.
91
Chapter 5
Approximate Computing Based Statistical Error
Compensation
Statistical error compensation (SEC) has shown significant benefits in achieving energy ef-
ficiency and error resiliency by embracing the stochastic nature of the underlying process
(see Chapters 2, 3, and 4). Approximate computing (AC) [22], on the other hand, employs
deterministic designs that produce imprecise results to achieve energy efficiency. In this
chapter, we bridge the two design paradigms by utilizing SEC and AC in the design of a
machine learning accelerator core.
Approximate computation (AC) [22], as briefly discussed in Section 1.2.4, relaxes the
required numerical exactness on the output. This relaxation is based on application-level
information just as SEC techniques have used application-level information to achieve en-
hancement in error resiliency. Emerging new applications, including recognition, mining,
and synthesis (RMS), process large amounts of noisy data via statistical and probabilistic
computation and are inherently error resilient. Furthermore, many machine learning appli-
cations have application metrics that are probabilistic, and are inherently robust to errors
as discussed in Section 1.2. Similar to SEC, these characteristics motivate the use of AC.
The goal of approximate computing is to prune the computation such that the applica-
tion is oblivious to the approximation and thereby gains energy efficiency or obtains higher
performance.
AC requires the computation to be free of hardware errors (i.e., y = yo + e). AC is based
on intentional simplification of the design that produces imprecise results by utilizing statis-
tical properties of data and algorithms [22]. Careful study of AC based algorithms needs to
be performed to ensure system-level performance. Through this process, the quality of the
output is traded off with area and energy. On the other hand, SEC is a more aggressive tech-
nique that relaxes the correctness requirement of the HW implementation. SEC embraces
92
the stochastic nature of the underlying implementation then provides approximate error
compensation such that the output may be in error but the application-level performance
specifications are met.
In this chapter, we combine SEC and AC in the implementation of an accelerator archi-
tecture for a tree-reweighted message-passing (TRW-S) based stereo image matching appli-
cation (see Chapter 4). Algorithmic noise tolerance (ANT) was applied at the arithmetic
level while a mirror adder based approximate adder [21] was used in the application of AC.
Hardware errors were induced in the message update and reparameterize unit of the TRW-S
HW architecture. A study on the trade-off between the amount of approximation and the
energy efficiency was performed by implementing AC at different granularity. To jointly op-
timize the performance, AC was not necessarily applied at the lowest significant bits (LSBs).
Results show that ANT combined with AC achieves energy savings of 44.9% compared to
a conventional system while achieving at most 4% degradation in bad-pixel ratio (BPR).
This result further strengthens the belief that energy efficiency is improved by embracing
the underlying stochasticity, i.e., circuit design should no longer follow worst-case design
principles, but allow errors to propagate to the next level via nominal case design.
The remainder of the chapter is organized as follows. Section 5.1 provides background
information on approximate computing. The TRW-S MRF message-passing based stereo
matching algorithm and its architecture are discussed in Section 5.2. Section 5.3 shows the
simulation setup and results, while Section 5.3.4 concludes the chapter.
5.1 Approximate Computing
In this section, we present a comprehensive survey on AC techniques. The actual AC
technique used in this dissertation, a mirror adder based approximate adder, is then discussed
in further detail.
93
5.1.1 Circuit-level AC techniques
5.1.1.1 Approximate adders
Approximate adders are implemented in two parts, an accurate MSB part and an approx-
imate LSB part. For the approximate LSB part, a simplified full adder is used. The truth
table of the full adder is altered to provide significant savings in the number of transis-
tors required. Many design variations exist. Approximate mirror adders [21] are based
on mirror adders. Five approximate adders are designed, each based on a different alter-
ation of the truth table. More details on approximate mirror adders are given in Section
5.1.3. Approximate XOR/XNOR based adders are based on the 10-transistor adder that
uses multiplexers implemented using pass transistors. Three different approximate adders
are presented in [34] and show good error distance [105] properties with low power-delay
product compared to other approximate adders. A third type of approximate adder, lower-
part-OR adder (LOA) [35], uses OR gates to approximately compute the LSBs. The carry
in for the MSB part is computed through an AND gate that takes inputs to the MSB of the
LSB part. Most carries are ignored in the lower-part module of the LOA resulting in a loss
of precision.
5.1.1.2 Approximate multipliers
In [35, 36], approximate multipliers are constructed via the use of speculative adders on
obtaining the sum of partial products. However, directly applying approximate adders to
the design of multipliers may not be suitable for approximate multiplier designs. In [37,38],
it is noted that reducing the critical path of partial product adders is essential. In [38],
instead of directly replacing adders with approximate adders, partial sums that are part of
the LSBs are simply omitted. In [37], a hierarchical multiplier is used, where an approximate
2× 2 multiplier is implemented, then used to build a larger multiplier.
94
5.1.1.3 Approximate logic synthesis
Approximate logic-level synthesis has been explored as well. In [40], logic synthesis is used
to synthesize the minimum area circuit, at a given error rate. This is performed in a two-step
synthesis process. In [41], a multi-level logic minimization algorithm is developed to simplify
the design and minimize the area of approximate circuits. A systematic methodology for au-
tomatic logic synthesis of approximate circuits (SALSA) has been proposed in [42]. SALSA
encodes quality constraints in logic functions known as Q-functions. The increase in flexibil-
ity denoted by Q-functions is then represented as approximate don’t cares, and traditional
circuit simplification and don’t care optimization techniques are utilized for synthesis.
5.1.2 Algorithmic-level AC techniques
Increased area and energy savings can be achieved when AC is applied at the algorithm level.
The quality of the output of iterative algorithms enhance incrementally with each iteration.
This characteristic, known as incremental refinement has been exploited to achieve results
that gradually increase in quality. In [43, 44], this idea was demonstrated on a FFT-based
maximum-likelihood detection algorithm. This was expanded in [45,46], where several DSP
algorithms were transformed to exhibit the incremental refinement property. In machine
learning, an approximate support vector machine (SVM) was shown [47], where the number
of dimensions (features) used in the SVM were reduced. Algorithmic AC techniques, in
essence, can be employed to design the estimator in ANT.
Inexact computing is based on a similar principle. In this work, probabilistic pruning
[106, 107] and probabilistic logic minimization [108] is applied at the architecture and logic
level. In probabilistic pruning, components that have lower significance or probability of
activation are pruned in a systematic way, while probabilistic logic minimization transforms
logic functions to a lower cost variant.
95
Table 5.1: Truth table for the mirror based approximate adder 1 through 4 [21].
Inputs Accurate Outputs Approximate Outputs
A B Cin Sum Cout Sum1 Cout1 Sum2 Cout2 Sum3 Cout3 Sum4 Cout4
0 0 0 0 0 0 0 1× 0 1× 0 0 0
0 0 1 1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 0× 1× 1 0 0× 1× 0× 0
0 1 1 0 1 0 1 0 1 0 1 1× 0×
1 0 0 1 0 0× 0 1 0 1 0 0× 1×
1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 0 1 0 1 0 1 0 1 0 1
1 1 1 1 1 1 1 0× 1 0× 1 1 1
5.1.3 Mirror adder based approximate adder
In this dissertation, we have used the mirror adder based approximate adder 1 through 4
in [21], where the truth table is shown in Table 5.1. By simplifying the truth table of the
adder, fewer transistors can be used to implement the approximate version of the adder,
resulting in area and energy reduction. The accurate mirror adder and the four different
approximate full adders corresponding to the truth table are depicted in Fig. 5.1. As can be
seen, the approximate versions require fewer transistors, and result in area savings ranging
from 28% to 55%. Further energy savings can be achieved if the reduction in the critical
path is traded off with energy via voltage scaling. A ripple carry adder that utilizes these
approximate adders was applied in the message passing unit of the TRWS_HW.
5.2 TRW-S Accelerator Architecture Using AC and SEC
5.2.1 TRW-S accelerator architecture
In this chapter, we employ the same TRW-S hardware architecture from Chapter 4, where
the two-step message passing of TRW-S is pipelined and executed in a streaming architecture
(Fig. 4.1). FIFOs are used to handle streaming data access to the main memory in order
96
126 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 1, JANUARY 2013
Fig. 1. Conventional MA.
Fig. 2. MA approximation 1.
Fig. 3. MA approximation 2.
α is the switching activity or average number of switching
transitions per unit time and C is the load capacitance being
charged/discharged. This directly results in lower power dissi-
pation. Area reduction is also achieved by this process. Now,
let us discuss the conventional MA implementation followed
by the proposed approximations.
1) Conventional MA: Fig. 1 shows the transistor-level
schematic of a conventional MA [23], which is a popular way
of implementing an FA. It consists of a total of 24 transistors.
Since this implementation is not based on complementary
Fig. 4. MA approximation 3.
Fig. 5. MA approximation 4.
CMOS logic, it provides a good opportunity to design an
approximate version with removal of selected transistors.
2) Approximation 1: In order to get an approximate MA
with fewer transistors, we start to remove transistors from the
conventional schematic one by one. However, we cannot do
this in an arbitrary fashion. We need to make sure that any
input combination of A,B and Cin does not result in short
circuits or open circuits in the simplified schematic. Another
important criterion is that the resulting simplification should
introduce minimal errors in the FA truth table. A judicious
selection of transistors to be removed (ensuring no open or
short circuits) results in a schematic shown in Fig. 2, which we
call approximation 1. Clearly, this schematic has eight fewer
transistors compared to the conventional MA schematic. In
this case, there is one error in Cout and two errors in Sum,
as shown in Table I. A tick mark denotes a match with the
corresponding accurate output and a cross denotes an error.
3) Approximation 2: The truth table of an FA shows that
Sum= Cout1 for six out of eight cases, except for the input
combinations A = 0, B = 0, Cin = 0 and A = 1, B = 1, Cin = 1.
Now, in the conventional MA, Cout is computed in the first
stage. Thus, an easy way to get a simplified schematic is to
set Sum= Cout. However, we introduce a buffer stage after Cout
(see Fig. 3) to implement the same functionality. The reason
for this can be explained as follows. If we set Sum= Cout as it is
1Henceforth, we denote the complement of a variable V by V .
(a)
126 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 1, JANUAR 2013
Fig. 1. Conventional MA.
Fig. 2. MA approximation 1.
Fig. 3. MA approximation 2.
α is the switching activity or average number of switching
transitions per unit time and C is the lo d capacitance being
charged/discharged. This directly results in lower power dissi-
pation. Area reduction is also achi ved by this process. Now,
let us discuss the conventional MA impl ment tion followed
by the proposed approximations.
1) Conventional MA: Fig. 1 shows the transistor-level
schematic of a conventional MA [23], which is a popular way
of implementing an FA. It consists of a total of 24 transistors.
Since this implementation is not based on complementary
Fig. 4. MA approximation 3.
Fig. 5. MA approximation 4.
CMOS logic, it provides a good opportunity to design an
approximate version with removal of selected transistors.
2) Approximation 1: In order to get an approximate MA
with fewer transistors, we start to remove transistors from the
conventional schematic one by one. However, we cannot do
this in an arbitrary fashion. We need to make sure that any
input combination of A,B and Cin does not result in short
circuits or open circuits in the simplified schematic. Another
important criterion is that the resulting simplification should
introduce minimal errors in the FA truth table. A judicious
selection of transistors to be removed (ensuring no open or
short circuits) results in a schematic shown in Fig. 2, which we
call approximation 1. Clearly, this schematic has eight fewer
transistors compared to the conventional MA schematic. In
this case, there is one error in Cout and two errors in Sum,
as shown in Table I. A tick mark denotes a match with the
corresponding accurate output and a cross denotes an error.
3) Approximation 2: The truth table of an FA shows that
Sum= Cout1 for six out of eight cases, except for the input
combinations A = 0, B = 0, Cin = 0 and A = 1, B = 1, Cin = 1.
Now, in the conventional MA, Cout is computed in the first
stage. Thus, an easy way to get a simplified sch ma ic is to
set Sum= Cout. However, we introduce a buffer stage a ter Cout
(see Fig. 3) to implement the same functionality. The reason
for this can be explained as follows. If we set Sum= Cout as it is
1Henceforth, we denote the complement of a variable V by V .
(b)
126 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 1, JANUARY 2013
Fig. 1. Conventional MA.
Fig. 2. MA approximation 1.
Fig. 3. MA approximation 2.
α is the switching activity or average number of switching
transitions per unit time and C is the load capacitance being
charged/discharged. This directly results in lower power dissi-
pation. Area reduction is also achieved by this process. Now,
let us discuss the conventional MA implementation followed
by the proposed approximations.
1) Conventional MA: Fig. 1 shows the transistor-level
schematic of a conventional MA [23], which is a popular way
of implementing an FA. It consists of a total of 24 transistors.
Since this im lementation is not based on complem ntary
Fig. 4. MA approximation 3.
Fig. 5. MA approximation 4.
CMOS logic, it provides a good opportunity to design an
approximate version with removal of selected transistors.
2) Approximation 1: In order to get an approximate MA
with fewer transistors, we start to remove transistors from the
conventional schematic one by one. However, we cannot do
this in an arbitrary fashion. We need to make sure that any
input combination of A,B and Cin does not result in short
circuits or open circuits in the simplified schematic. Another
important criterion is that the resulting simplification should
introduce minimal errors in the FA truth table. A judicious
selection of transistors to be removed (ensuring no open or
short circuits) results in a schematic shown in Fig. 2, which we
call approximation 1. Clearly, this schematic has eight fewer
transistors compared to the conventional MA schematic. In
this case, there is one error in Cout and two errors in Sum,
as shown in Table I. A tick mark denotes a match with the
corresponding accurate output and a cross denotes an error.
3) App oximation 2: The truth table of an FA shows that
Sum= Cout1 for six out of eight cases, except for the input
combinations A = 0, B = 0, Cin = 0 and A = 1, B = 1, Cin = 1.
Now, in the conventional MA, Cout is computed in the first
stage. Thus, an easy way to get a simplified schematic is to
set Sum= Cout. However, we introduce a buffer stage after Cout
(see Fig. 3) to implement the same functionality. The reason
for this can be explained as follows. If we set Sum= Cout as it is
1Henceforth, we denote the complement of a variable V by V .
(c)
126 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 1, JANUARY 2013
Fig. 1. Conventional MA.
Fig. 2. MA approximation 1.
Fig. 3. MA approximation 2.
α is the switching activity or average number of switching
transitions per unit time and C is the load capacitance being
charged/discharged. This directly results in lower power dissi-
pation. Area reduction is also achieved by this process. Now,
let us discuss the conventional MA implementation followed
by the proposed approximations.
1) Conventional MA: Fig. 1 shows the transistor-level
schematic of a conventional MA [23], which is a popular way
of implementing an FA. It consists of a total of 24 transistors.
Since this implementation is not based on complementary
Fig. 4. MA approximation 3.
Fig. 5. MA approximation 4.
CMOS logic, it provides a good opportunity to design an
approximate version with removal of selected transistors.
2) Approximation 1: In order to get an approximate MA
with fewer transistors, we start to remove transistors from the
conventional schematic one by one. However, we cannot do
this in an arbitrary fashion. We need to make sure that any
input combination of A,B and Cin does not result in short
circuits or open circuits in the simplified schematic. Another
important criterion is that the resulting simplification should
introduce minimal errors in the FA truth table. A judicious
selection of transistors to be removed (ensuring no open or
short circuits) results in a schematic shown in Fig. 2, which we
call approximation 1. Clearly, this schematic has eight fewer
transistors compared to the conventional MA schematic. In
this case, there is one error in Cout and two errors in Sum,
as shown in Table I. A tick mark denotes a match with the
corresponding accurate output and a cross denotes an error.
3) Approximation 2: The truth table of an FA shows that
Sum= Cout1 for six out of eight cases, except for the input
combinations A = 0, B = 0, Cin = 0 and A = 1, B = 1, Cin = 1.
Now, in the conventional MA, Cout is computed in the first
stage. Thus, an easy way to get a simplified schematic is to
set Sum= Cout. However, we introduce a buffer stage after Cout
(see Fig. 3) to implement the same functionality. The reason
for this can be explained as follows. If we set Sum= Cout as it is
1Henceforth, we denote the complement of a variable V by V .
(d)
126 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 1, JANUARY 2013
Fig. 1. Conventional MA.
Fig. 2. MA approximation 1.
Fig. 3. MA approximation 2.
α is the switching activity or average number of switching
transitions per unit time and C is the load capacitance being
charged/discharged. This directly results in l wer power dissi-
pation. Area reduction is also achieved by this process. Now,
let us discuss the conventional MA implementation followed
by the proposed approximation .
1) Conventional MA: Fig. 1 shows the transistor-level
schematic of a conventional MA [23], which is a popular way
of implementing an FA. It consists of a total of 24 transistors.
Since this implementation is not based on complem ntary
Fig. 4. MA approximation 3.
Fig. 5. MA approxima ion 4.
CMOS logic, it provides a good opportunity to design an
ap roximate version with removal of selected transistors.
2) Approximation 1: In order to get an approximate MA
with fewer transistors, we start to remove transistors from the
conventional schematic one by one. However, we cannot do
this in an arbitrary fashion. We need to make sure that any
input combination of A,B and Cin does not result in short
circuits or open circuits in the simplified schematic. Another
important criterion is that the resulting simplification should
introduce minimal errors in the FA truth table. A judicious
selection of transistors to be removed (ensuring no open or
short circuits) results in a schematic shown in Fig. 2, which we
call approximation 1. Clearly, this schematic has eight fewer
transistors compared to the conventional MA schematic. In
this case, there is one error in Cout and two errors in Sum,
as shown in Table I. A tick mark denotes a match with the
corresponding accurate output and a cross denotes an error.
3) Approximation 2: The truth table of an FA shows that
Sum= Cout1 for six out of eight cases, except for the input
combinations A = 0, B = 0, Cin = 0 and A = 1, B = 1, Cin = 1.
No , in the conventional MA, Cout is computed in the first
stage. Thus, an easy way to get a simplified schematic is to
set Sum= Cout. However, we introduce a buffer stage after Cout
(see Fig. 3) to implement the same functionality. The reason
for this can be explained as follows. If we set Sum= Cout as it is
1Henceforth, we denote the complement of a variable V by V .
(e)
Figure 5.1: Approximate mirror adders [21]: (a) conventional mirr r adder, (b)
approximate mirror adder 1, (c) ap roximate mirro adder 2, (d) approximate irror adder
3, and (e) approximate mirror adder 4.
97
to feed the pipelined message-passing units without many stalls. The REPARAM unit is a
simple hardware realization of (4.2), while for MSG_UPD, a parallel message construction
technique [99] is exploited to reduce complexity of the pairwise computation in (4.3), as
shown in Fig. 4.1(c). In addition, to avoid overflow, MSG_UPD includes a normalization
step, which rescales the value of messages so that the minimum message is always zero.
5.2.2 Design of ANT and AC
ANT and AC can be combined in a number of ways. The available design options are
described using the following notation:
• C[a] represents a conventional system (no robust design applied, i.e., neither AC nor
ANT applied) designed with a bit precision of a-bits.
• AC[a, b, c] represents an AC only design, with total bit precision a-bits, but using AC
on b-bits from the MSB and c-bits from the LSB.
• ANT [a, b, c; d, e, f ] represents an ANT design combined with AC. The three values,
a, b, and c, represent the main block designed with AC[a, b, c], while the latter three
values, d, e, and f , represent that the estimator is designed using AC[d, e, f ]. Thus,
an ANT only design (with no AC), is denoted as ANT [a, 0, 0; d, 0, 0] or ANT [a; d], for
simplicity.
ANT was applied at the arithmetic level, i.e., errors are compensated after each arithmetic
operation that includes every add or subtract (AS) and compare and select (CS). Every
latch in the reparameterize unit and message update unit (Fig. 4.1(b,c)) is subject to ANT
compensation. As the operations to correct are primitive in nature, our choice of estimators
is limited to circuit or structural techniques. In this dissertation, a reduced precision replica
(RPR) has been employed. An 8-bit main block (a = 8) and a 4-bit estimator (d = 4) were
used based on previous work [91].
AC has been applied exclusively to the main block. This is because ANT gains its benefits
via VOS, and the critical path of the estimator is an important factor in determining the
98
amount of VOS that can be applied. Thus, instead of applying AC to the estimator, only
truncation was used. The bit location where AC has been applied was varied and was not
limited to the least significant bits (LSBs), as ANT can provide compensation for large-
magnitude errors. In this chapter, we compare the following:
• Conventional: C[8].
• AC only design: AC[8, 0, 1], AC[8, 0, 2], and AC[8, 1, 2].
• ANT only design: ANT [8; 4].
• ANT combined with AC: ANT [8, 0, 1; 4], ANT [8, 0, 2; 4], and ANT [8, 1, 1; 4].
Four different AC based adder structures as shown in Fig. 5.1 [21] were used in the design
of the AC based main block.
5.3 Simulation and Results
5.3.1 Simulation setup
A cycle accurate software simulator is used to simulate the impact of hardware errors on the
TRW-S message-passing architecture. The Tsukuba task from the Middlebury benchmark
[96,100] is used as input to the simulator.
HW errors are modeled as voltage scaling induced timing errors, and error injection was
used to evaluate performance. Input dependent timing based error statistics are obtained
for accurate simulation via gate-level simulations using an HDL simulator. First, the gate
delay is characterized with respect to supply voltage for basic gates such as a full adder
and XOR using circuit-level simulators with a commercial 45 nm process library. Then a
structural HDL implementation of the REPARAM and MSG_UPD block is simulated via
an HDL simulator using the pre-characterized delay values. By choosing the delay values
that correspond to various supply voltages, HDL simulation is effectively run at different
voltages. For a supply voltage (Vdd) less than the critical voltage, errors can be observed
in the output. Through characterization of these errors, error statistics are obtained and
99
TRW-S 
Software 
Simulator
Circuit 
Simulation
TRW_S HW 
Architecture
Delay injection
(voltage 
overscaling)
ANT
Energy 
Performance
45nm CMOS 
PDK
Verilog 
RTL
Error 
injection
Figure 5.2: Flow chart of the simulation methodology.
used to inject errors in the AS and CS block of the software simulator. The simulation
methodology is summarized in Fig. 5.2.
5.3.2 Error resiliency
The performance of the TRW-S HW is measured in terms of the energy E (4.1) and BPR.
BPR is calculated by comparing the depth label of each pixel in the non-occlusion region
to the true depth map (ground truth) and counting a pixel to be bad if the label differs by
more than a threshold κ (κ = 1 in our case). Four different types of approximate adders
were simulated using approximate mirror adder 1 through 4 in [21].
Figure 5.3 shows the energy E of various design choices vs. the supply voltage Vdd. From
Fig. 5.3, it can be seen that ANT based systems are significantly robust to VOS as compared
to AC or conventional systems. This is not surprising because in Chapter 4, we showed that
a reduction up to 34% in Vdd is possible with a BPR degradation of at most 4%. In particular,
as Vdd is scaled, ANT [8, 1, 1; 4] has significantly lower E than AC[8, 1, 1] and C[8]. This is
because ANT is able to correct MSB errors induced by the use of AC and VOS.
Furthermore, ANT [8; 4], which is an ANT only design, has minimum energy E at any
voltage as compared to all other options. This is because introducing AC incurs additional
estimation errors when the main block output is chosen as the final output. This is a very
100
0.75 0.8 0.85 0.9 0.95 1
106
107
Vdd (V)
M
in
im
um
 e
ne
rg
y
 
 
C[8]
AC[8,0,1]
AC[8,0,2]
AC[8,1,1]
ANT[8;4]
ANT[8,0,1;4]
ANT[8,0,2;4]
ANT[8,1,1;4]
(a)
0.75 0.8 0.85 0.9 0.95 1
106
107
Vdd (V)
M
in
im
um
 e
ne
rg
y
 
 
C[8]
AC[8,0,1]
AC[8,0,2]
AC[8,1,1]
ANT[8;4]
ANT[8,0,1;4]
ANT[8,0,2;4]
ANT[8,1,1;4]
(b)
0.75 0.8 0.85 0.9 0.95 1
106
107
Vdd (V)
M
in
im
um
 e
ne
rg
y
 
 
C[8]
AC[8,0,1]
AC[8,0,2]
AC[8,1,1]
ANT[8;4]
ANT[8,0,1;4]
ANT[8,0,2;4]
ANT[8,1,1;4]
(c)
0.75 0.8 0.85 0.9 0.95 1
106
107
Vdd (V)
M
in
im
um
 e
ne
rg
y
 
 
C[8]
AC[8,0,1]
AC[8,0,2]
AC[8,1,1]
ANT[8;4]
ANT[8,0,1;4]
ANT[8,0,2;4]
ANT[8,1,1;4]
(d)
Figure 5.3: Energy E vs. supply voltage Vdd for TRW-S implemented with AC using: (a)
approximate adder 1, (b) approximate adder 2, (c) approximate adder 3, and (d)
approximate adder 4 [21]. It can be seen that AC can tolerate at most 10% scaling in Vdd,
whereas when combined with ANT, up to 34% scaling in Vdd can be tolerated.
101
frequent event as 30% ≤ pη ≤ 40%.
Also, as shown in Fig. 5.3, AC based designs result in less degradation in performance at
moderately scaled voltages, as it is subject to fewer hardware errors due to its shorter critical
path. The energy E for AC[8, 0, 2] is lower by 58.1%% to 33.8% at Vdd = 1V , depending
upon the type of approximate adder being used. This is because the TRW-S architecture
is resilient to small-magnitude errors as shown in Chapter 4, and due to its shorter critical
path, AC[8, 0, 2] has fewer hardware errors than both AC[8, 0, 1] and C[8]. Further, the gap
in E between AC based design and conventional design reduces with Vdd. This is because as
Vdd is scaled, hardware errors begin to dominate over the estimation errors induced by AC.
5.3.3 Overhead and energy savings
The overhead of AC and ANT was measured through the cell count as synthesized in a
commercial 45 nm process. Tables 5.2 to 5.5 summarizes the result. Two supply voltages
are listed in the table: first is the minimum supply voltage with error-free operation (critical
voltage, Vdd,crit), and the second is the minimum voltage that achieves a bad-pixel ratio
(BPR) lower than 6.7% (Vdd,BPR), which is 4% more than the error-free BPR of 2.7%. The
results for ANT [8; 4] show an overhead of 2.21, which is consistent with the results obtained
in Chapter 4. The difference is due to the synthesis library used; in order to reduce leakage
at low Vdd, a high threshold voltage library was used. It can be seen that the overhead
and Vdd,crit greatly depends on the number of bits AC was applied to regardless of whether
it was applied at the LSB or MSB. However, Vdd,BPR has a higher dependency on LSBs
than on MSBs. The largest energy savings of 44.9% is achieved for ANT [8, 0, 2; 4] using
approximate adder 3 compared to C[8]. This is in spite of the greater than 2× increase in
cell count overhead.
Also, note that ANT [8, 0, 2; 4] is not the design that operates at the lowest Vdd; however,
when combined with the savings that AC provides, the total energy savings is the largest.
In general, an energy savings peak is observed at a Vdd slightly higher than Vdd,BPR. This
peak is due to the fact that, as Vdd is scaled, it gives diminishing returns (see Fig 3.1),
whereas, AC provides a constant proportion of energy savings regardless of Vdd. Thus, at
102
Ta
bl
e
5.
2:
Es
tim
at
ed
ce
ll
co
un
t
an
d
po
we
r
co
ns
um
pt
io
n
ob
ta
in
ed
vi
a
sy
nt
he
sis
in
a
45
nm
C
M
O
S
pr
oc
es
s
w
ith
A
C
us
in
g
ap
pr
ox
im
at
e
ad
de
r
1
[2
1]
. C
[8
]
A
C
[8
,0
,1
]
A
C
[8
,0
,2
]
A
C
[8
,1
,1
]
A
N
T
[8
;4
]
A
N
T
[8
,0
,1
;4
]
A
N
T
[8
,0
,2
;4
]
A
N
T
[8
,1
,1
;4
]
V
d
d
,c
r
it
(V
)
1.
10
1.
09
1.
08
1.
08
1.
10
1.
09
1.
08
1.
08
po
we
r
(m
W
)
11
.8
2
11
.6
2
11
.4
9
11
.5
1
17
.8
1
17
.6
2
17
.3
8
17
.3
9
V
d
d
,B
P
R
(V
)
N
/A
1.
08
1.
06
1.
08
0.
72
0.
76
0.
76
0.
84
po
we
r
(m
W
)
11
.4
1
11
.1
3
11
.6
1
7.
14
8.
03
7.
88
8.
62
en
er
gy
sa
vi
ng
s
N
/A
3.
5%
5.
8%
2.
6%
39
.7
%
32
.1
%
33
.3
%
27
.1
%
ce
ll
co
un
t
40
,7
90
39
,2
15
37
,6
43
37
,5
89
90
,5
24
88
,2
15
86
,7
33
86
,8
50
ov
er
he
ad
1.
0
0.
96
0.
92
0.
92
2.
21
2.
16
2.
13
2.
13
Ta
bl
e
5.
3:
Es
tim
at
ed
ce
ll
co
un
t
an
d
po
we
r
co
ns
um
pt
io
n
ob
ta
in
ed
vi
a
sy
nt
he
sis
in
a
45
nm
C
M
O
S
pr
oc
es
s
w
ith
A
C
us
in
g
ap
pr
ox
im
at
e
ad
de
r
2
[2
1]
. C
[8
]
A
C
[8
,0
,1
]
A
C
[8
,0
,2
]
A
C
[8
,1
,1
]
A
N
T
[8
;4
]
A
N
T
[8
,0
,1
;4
]
A
N
T
[8
,0
,2
;4
]
A
N
T
[8
,1
,1
;4
]
V
d
d
,c
r
it
(V
)
1.
10
1.
09
1.
07
1.
07
1.
10
1.
09
1.
07
1.
07
po
we
r
(m
W
)
11
.8
2
11
.5
8
11
.0
3
11
.1
3
17
.8
1
16
.9
2
16
.8
0
16
.8
2
V
d
d
,B
P
R
(V
)
N
/A
1.
05
1.
02
1.
07
0.
72
0.
74
0.
76
0.
82
po
we
r
(m
W
)
10
.3
8
9.
83
11
.1
3
7.
14
7.
03
6.
62
8.
53
en
er
gy
sa
vi
ng
s
N
/A
8.
0%
16
.8
%
5.
89
%
39
.7
%
40
.5
%
44
.0
%
27
.8
%
ce
ll
co
un
t
40
,7
90
38
,2
18
35
,3
70
35
,1
52
90
,5
24
87
,5
12
85
,3
70
85
,4
14
ov
er
he
ad
1.
0
0.
94
0.
87
0.
86
2.
21
2.
15
2.
09
2.
09
103
Ta
bl
e
5.
4:
Es
tim
at
ed
ce
ll
co
un
t
an
d
po
we
r
co
ns
um
pt
io
n
ob
ta
in
ed
vi
a
sy
nt
he
sis
in
a
45
nm
C
M
O
S
pr
oc
es
s
w
ith
A
C
us
in
g
ap
pr
ox
im
at
e
ad
de
r
3
[2
1]
. C
[8
]
A
C
[8
,0
,1
]
A
C
[8
,0
,2
]
A
C
[8
,1
,1
]
A
N
T
[8
;4
]
A
N
T
[8
,0
,1
;4
]
A
N
T
[8
,0
,2
;4
]
A
N
T
[8
,1
,1
;4
]
V
d
d
,c
r
it
(V
)
1.
10
1.
09
1.
07
1.
07
1.
10
1.
09
1.
07
1.
07
po
we
r
(m
W
)
11
.8
2
11
.6
3
10
.9
2
11
.0
2
17
.8
1
16
.9
5
16
.7
2
16
.7
3
V
d
d
,B
P
R
(V
)
N
/A
1.
06
1.
02
1.
07
0.
72
0.
73
0.
75
0.
81
po
we
r
(m
W
)
11
.2
1
9.
62
11
.0
2
7.
14
6.
99
6.
51
8.
48
en
er
gy
sa
vi
ng
s
N
/A
5.
2%
18
.6
%
6.
8%
39
.7
%
40
.8
%
44
.9
%
28
.3
%
ce
ll
co
un
t
40
,7
90
37
,9
72
34
,5
18
34
,4
99
90
,5
24
86
,8
25
84
,8
21
84
,9
05
ov
er
he
ad
1.
0
0.
93
0.
85
0.
85
2.
21
2.
13
2.
08
2.
08
Ta
bl
e
5.
5:
Es
tim
at
ed
ce
ll
co
un
t
an
d
po
we
r
co
ns
um
pt
io
n
ob
ta
in
ed
vi
a
sy
nt
he
sis
in
a
45
nm
C
M
O
S
pr
oc
es
s
w
ith
A
C
us
in
g
ap
pr
ox
im
at
e
ad
de
r
4
[2
1]
. C
[8
]
A
C
[8
,0
,1
]
A
C
[8
,0
,2
]
A
C
[8
,1
,1
]
A
N
T
[8
;4
]
A
N
T
[8
,0
,1
;4
]
A
N
T
[8
,0
,2
;4
]
A
N
T
[8
,1
,1
;4
]
V
d
d
,c
r
it
(V
)
1.
10
1.
09
1.
07
1.
07
1.
10
1.
09
1.
07
1.
07
po
we
r
(m
W
)
11
.8
2
11
.5
9
10
.9
0
10
.9
8
17
.8
1
16
.9
5
16
.7
2
16
.7
3
V
d
d
,B
P
R
(V
)
N
/A
1.
06
1.
04
1.
07
0.
72
0.
74
0.
79
0.
88
po
we
r
(m
W
)
11
.1
8
9.
68
10
.9
8
7.
14
7.
08
8.
31
10
.2
1
en
er
gy
sa
vi
ng
s
N
/A
5.
4%
19
.0
%
7.
1%
39
.6
%
40
.1
%
29
.7
%
13
.6
%
ce
ll
co
un
t
40
,7
90
38
,0
26
34
,8
73
34
,7
80
90
,5
24
87
,0
14
85
,0
93
85
,1
11
ov
er
he
ad
1.
0
0.
93
0.
85
0.
85
2.
21
2.
13
2.
09
2.
09
104
high Vdd the energy savings due to VOS dominates, while at low Vdd energy savings due to
AC becomes an important factor, and a trade-off point exists. The savings for the AC only
designs (AC[8, b, c]) is less than the amount reported in [21], as the number of bits where
AC was applied was limited to at most two bits to ensure system-level performance.
We also show that applying AC at the MSB is a viable design choice when combined
with ANT, as is the case with ANT [8, 1, 1; 4]. This shows that ANT can indeed tolerate
large-magnitude errors, and jointly optimizing the design with AC and ANT will result in
greater energy efficiency.
5.3.4 Conclusions
In this chapter, we have studied the use of both AC and ANT for robust and energy-
efficient design. First, AC-only designs were shown to achieve up to 10% energy savings by
incorporating application-level information. When combined with ANT, significant energy
savings, in the order of 45% can be achieved. This shows that embracing the stochasticity
of the underlying process is crucial in achieving high energy efficiency. This dissertation has
mainly focused on the application of AC to the main block. However, the design space of AC
combined with ANT is large. Additional cases where AC is applied only to the estimator,
and where AC is applied to the main block and estimator together should be studied. After
full exploration of the design space, a design methodology to obtain the optimal design in
combining AC and ANT can be obtained. To reduce the ANT overhead, AC combined with
ANT at the iteration and system level (Chapter 4) is another direction for future work.
105
Chapter 6
Analysis of Statistical Error Compensation Techniques
In the previous chapters, SEC techniques have been successfully applied to tolerate errors in
hardware. In this chapter, we provide a theoretical analysis for algorithmic noise tolerance
(ANT).
In the design of an ANT-based system, the threshold τ is an important design parameter
that was previously chosen empirically. In this analysis, we establish a link between ANT
and the Bayesian detection and estimation framework. We show that ANT can be viewed as
a two-step process, where detection of an error event is performed, followed by estimation of
the correct value. The detection stage is a threshold detector, while the estimation stage is
an approximation to the minimum mean square error (MMSE) estimator. We further derive
expressions for determining the Bayesian optimal threshold τ ?, and verify it with Monte
Carlo simulations. Thus, this analysis proves that ANT has a strong Bayesian foundation.
Furthermore, this dissertation opens up the possibility of establishing a similar basis for
other SEC techniques.
The remainder of the chapter is organized as follows. In Section 6.1 we formulate the
detection and estimation problem, then derive the optimal decision rule under the Bayesian
estimation framework. Section 6.2 compares the performance of ANT and the optimal
decision rule via a simple example and a 2D-DCT application. Section 6.3 summarizes the
chapter.
6.1 Analysis of Algorithmic Noise Tolerance
In this section, we formulate a general detection and estimation problem with two obser-
vations. The optimal detection and estimation rule is derived and shows that ANT is a
106
low-complexity approximation of this optimal solution. A description of ANT is provided
in Section 1.2.4, where it was shown that the final or corrected output yˆ is obtained via the
following decision rule:
yˆ = θANT (y) =

ya, if |ya − ye| < τ
ye, otherwise
(6.1)
where y = (ya, ye) is the observation vector (see Fig. 6.1), and τ is an application-dependent
parameter chosen empirically to maximize the performance (MMSE, probability of correct
operation Pcorrect) of ANT.
6.1.1 A Bayesian formulation of ANT
The detector assumes binary hypotheses: (1) the error-free event H0 (i.e. η = 0), and (2) the
error-event H1 (i.e., η 6= 0). The goal is to determine the optimal decision rule δ(y|Pη, Pe),
which chooses one hypothesis based on the observation y, and the error statistics Pη and Pe.
Then estimation is performed via an estimation rule yˆ = θ(y), which finds the best estimate
of yo based on a specific optimization criteria. This is different from a classical detection and
estimation problem, as the hypotheses are not based on yo, but instead on η, and that the
estimation step utilizes the detection stage information to simplify the estimation process.
This distinction turns out to be crucial in enabling a simplification of the Bayesian error
compensation scheme leading to ANT.
The following assumptions will be made. The error-free main block output yo and the
errors η and  are a fixed-point number with bit width N and defined as a random variable
over the set S = {−2N−1, ..., 2N−1 − 1}. Setting the bit width of the output as N , the
cardinality of S is 2N . Furthermore, yo is assumed to be uniformly distributed and thus
p(yo) = 12N . The errors η and  are assumed to be independent. We will further assume that
η is zero mean and Pη has a peak at zero that represents the probability of no errors. Further,
Pη is symmetric about the mean, and Pη(α) ≥ Pη(β) for |α| ≥ |β|, β 6= 0. On the other
hand,  is assumed to be zero mean, and P is symmetric about the mean, and Pe(α) ≤ Pe(β)
107
MM-est
x
 oa yy
eyy oe 
yˆ
Detectort t r
Estimatorti t r
)(),( ePP e
error statistics
 detection threshold
Figure 6.1: The Bayesian framework for ANT.
)(P
)(ePe
(a) (b)
Figure 6.2: Example error PMFs: (a) depiction of Pη and Pe that increase or decrease in
distance from the mean, and (b) voltage overscaling (VOS) induced timing errors.
for |α| ≥ |β|, as depicted in Fig. 6.2(a). Large magnitude values of  are assumed to
be extremely unlikely. These assumptions are motivated by noting that timing-induced
hardware errors are large in magnitude due to LSB-first computation, and estimation errors
are Gaussian (subsampled estimators) or uniform distributed (reduced precision estimators).
An example conditional PMF of timing errors, conditioned on the error event induced by
voltage overscaling (VOS), is given in Fig. 6.2(b). This error statistic was obtained through
Verilog simulations of a 16-bit ripple carry adder. It can be seen that large-magnitude errors
occur with large probability, which justifies our assumption for Pη.
108
6.1.2 Detection
We perform Bayesian hypothesis testing to detect the error event. For equally likely priors,
the decision rule is given by comparing the likelihoods [109]:
p(H1|y)
H1
≷
H0
p(H0|y) (6.2)
The likelihoods are derived as follows:
p(H0|y) = p(H0 ∩ y)
p(y)
=
1
2N Pη(0)Pe(ye − ya)
p(y) (6.3)
where the second equality comes from the fact that all possible values of ya are equally
likely, and Pη and Pe are independent. Also, to have the no error event, i.e., η = 0, when
y = (ya, ye) is observed, we need η = 0 and e = ye − ya. Likewise,
p(H1|y) = p(y)−
1
2N Pη(0)Pe(ye − ya)
p(y) (6.4)
where once again the independence assumption between Pη and Pe has been used.
The posterior probability p(y) is the sum of the probability of all possibilities to observe
y. Thus, it is given by
p(y) =
∑
yo
p(yo)Pη(ya − yo)P(ye − yo)
= 12N
∑
η
Pη(η)Pe(η + (ye − ya)) (6.5)
Defining the likelihood ratio L by taking the ratio of (6.4) to (6.3), L = p(H1|y)
p(H0|y) , the optimal
109
Bayesian detection rule δ(y) is now denoted as
δ(y) =

H1 ifL ≥ 1
H0 ifL < 1
(6.6)
From (6.3) and (6.5), it can be seen that L only depends on the difference of the observed
outputs d = ya − ye. Furthermore, given that Pe is decreasing in distance from the mean
and symmetric, Pη(0)Pe(ye − ya) is a decreasing function in |d|. Also, p(y) is symmetric,
and convex in the half-plane, and given that Pe is heavily concentrated around the mean,
the minimum of p(y) is close to 0. Thus we can approximate p(y) as an increasing function
in |d|. Now we can rewrite the decision rule as
δ(y) =

H1 if |d| ≥ τ
H0 if |d| < τ
(6.7)
where the theoretically optimal threshold τ ?a,p is the value of |d| when L=1, and satisfies
∑
η
Pη(η)Pe(η + τ ?a,p) = 2Pη(0)Pe(τ ?a,p) (6.8)
We can see that the resulting optimal decision rule (6.7) is equivalent to the detection rule
(6.1) used in ANT. Furthermore, we have shown that the optimal threshold τ ?a,p to be used
in ANT is given by (6.8).
6.1.3 Estimation
When H0 is detected (η = 0), the main block output is used as the corrected output i.e.,
yˆ = ya. When H1 is detected, a more complex estimation needs to take place. There
are many different optimality conditions available for estimation such as minimum mean
squared error (MMSE), minimum mean absolute error (MMAE), and maximum a posteriori
probability (MAP). In this analysis, we will focus on MMSE.
The optimization criterion will be to minimize E(yˆ − yo)2 given y and that H1 has been
110
detected where the expectation is over yo. It is well known that the MMSE estimator is the
conditional mean, i.e., the estimate yˆ(y) = E(yo|y, H1) [109]. As the posterior distribution
of yo is
p(yo|y, H1) = 11− pηPη(ya − yo)Pe(ye − yo) (6.9)
the MMSE optimal estimator will be
yˆ = θ(y) =

ya if δ(y) = H0∑
yo 6=ya p(yo|y, H1)yo if δ(y) = H1
(6.10)
where ya is excluded in the summation as it corresponds to Ho. Determining yˆ is a complex
and power hungry task. In ANT, an approximation to the optimal estimator has been
made, such that yˆ = ye. This estimator can be implemented by a simple mux with the
output of the detection stage used as the control. In Section 6.2, we will show that such
approximation results in minimal performance degradation.
6.2 Comparison of Simulation and Analysis
In this section, we will first apply ANT on a simple example and compare the results with
analysis. For this example, all signals are considered to be eight bits (N = 8). The error
PMFs Pη and Pe are constructed as follows. Pe is a truncated discrete Laplace distribution
with parameter a = 0.9 normalized by a constant A. Pη is a circularly shifted version of Pe
scaled with the probability of error pη = 0.1, and a peak of 1− pη at 0. The exact forms of
the error PMFs Pη and Pe are given below:
Pe(k) = A
1− a
1 + aa
|k|,−128 ≤ k ≤ 127 (6.11)
Pη(k) =

1− pη if k = 0
pηPe
((
k + 2N−1
)
2N
)
if k 6= 0
(6.12)
where (·)2N represents the truncation operation where the N LSBs are obtained. The PMF
is depicted in Fig. 6.3(b), where the peak at Pη(0) has been clipped for better visibility.
111
0 20 40 60 80 100 120 140
0
1000
2000
M
ea
n 
S
qu
ar
e 
E
rro
r
0
0.5
1
Threshold τ
P
ro
ba
bi
lit
y 
of
 C
or
re
ct
 D
et
ec
tio
n
 
 
Pdet
MSE
70, =
∗
psτ68, =
∗
paτ
65, =
∗
msτ
(a)
-150 -100 -50 0 50 100 150
0
0.01
0.02
0.03
0.04
0.05
0.06
Error Magnitude
P
ro
b
a
b
ili
ty
 
 
P
η
P
ε
(b)
0 5 10 15 20
40
50
60
70
80
90
100
110
instance
τ
 
 
τ
s,p
τ
a,p
(c)
Figure 6.3: Simulation results for a simple example: (a) probability of correct detection
and MSE vs. τ , (b) error PMF, and (c) τ ?s,p, and τ ?a,p for different PMFs.
112
The performance of ANT with varying thresholds is shown in Fig. 6.3(a). The design of
an ANT system would go through such simulations to obtain the optimal threshold. For this
example, Monte Carlo simulations show that the probability of correct detection is maximum
at τ ?s,p = 70, with Pdet = 0.9986, which is very close to τ ?a,p = 68, the theoretical optimum
obtained from (6.8). Similarly, simulations indicate that the MSE is minimized at τ ?s,m = 65,
with MSE = 27.21, which is also very close to the MSE = 25.19 obtained via the optimal
estimator in (6.10). Furthermore, the utility of our analysis in design can be seen by the
fact that the MSE achieved via simulations with τ = τ ?a,p is 33.55, which corresponds to
a difference of 0.4% when normalized with respect to the maximum MSE obtained when
τ = 128. Note that the detection probability and MSE curves are flat around the optimal
point. This shows that the exact value of the optimal threshold is unnecessary to achieve
near-optimal performance, and that both τ ?a,p and τ ?s,p are good approximations for τ ?s,m.
Similar results (see Fig. 6.3(c)) were obtained with different forms for Pη and Pe, as long
as they met the assumptions outlined in Section 3. For the total of 20 cases shown in Fig.
6.3(c), the maximum difference in τ ?s,p and τ ?a,p was 11, while on average the difference was
6.6. In terms of MSE, this corresponds to a maximum degradation of 2.1%, with an average
degradation of 0.6% when normalized to the maximum MSE.
We have also applied ANT to a 2D-DCT image compression application. Figure (6.4)
depicts the simulation setup. For ANT, the main block is an 8-bit input, 8-bit output, 8× 8
2-D DCT block followed by a quantizer using Chen’s algorithm [110], with mirror adders and
array multipliers [111] as fundamental building blocks, implemented in a commercial 45 nm,
1.2 V CMOS process. Pη is characterized at various different voltages through delay based
Verilog simulations, with two examples shown in Fig. 6.5(a). Pη has a few large amplitude
errors that have a high probability of occurrence, which follows our assumptions for η but
not fully. Only the main 2D-DCT block is subject to voltage overscaling (VOS), and hence is
the only block that exhibits errors. The estimator is a reduced precision version of the main
block. Figure 6.5(b) shows the simulation result at Vdd = 1V . Simulations show τ ?s,p = 55,
while τ ?s,m = 57. Using (6.8), τ ?a,p = 65. We can see that even though Pη does not satisfy
the assumptions in Section 6.1, the values are similar. In this case as well, the detection
probability exhibits a very flat behavior around τ ?p and shows that ANT is robust to the
113
2D-DCT
RPR 2D-
DCT
x
Estimation errors
 oa yy
eyy oe 
hardware errors
yˆ
|   |> T
-
error-free

actual
Q 2D-IDCT
Figure 6.4: Block diagram of the simulation setup for the DCT application.
value of the detection threshold.
6.3 Summary
In this chapter we have provided a statistical analysis of ANT to aid the design of error-
resilient DSP systems. This is a first attempt in providing an analytical basis for designing
and verifying the performance of such systems. Analysis gives insight to the significant
robustness enhancement that ANT provides. The simple thresholding based detection is a
good approximation to the optimal Bayesian rule for detecting HW errors. Furthermore,
the performance of ANT is not sensitive to the threshold value, which suggests that a single
threshold value can be used without calibration for multiple instances of a design. This
analysis methodology can be generalized to include other statistical error compensation
techniques.
114
−100 −50 0 50 100
0
0.02
0.04
0.06
error magnitude
pr
ob
ab
ilit
y
−100 −50 0 50 100
0
0.02
0.04
0.06
error magnitude
pr
ob
ab
ilit
y
(a)
0 20 40 60 80 100 120 140
40
60
80
100
120
140
160
180
200
220
M
e
a
n
 S
q
u
a
re
 E
rr
o
r
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Threshold 
P
ro
b
a
b
ili
ty
 o
f 
C
o
rr
e
c
t 
D
e
te
c
ti
o
n
 
 
P
det
MSE
55, 

ps 65, 

pa
57, 

ms
(b)
Figure 6.5: DCT example: (a) error statistics of a voltage overscaled DCT block at
Vdd = 1.1V (pη = 0.0043), and Vdd = 1.0V (pη = 0.0374), and (b) the resulting probability
of correct detection and MSE vs. τ .
115
Chapter 7
Conclusion and Future Work
Machine learning applications have become prevalent. Energy-efficient ML accelerators will
be a critical enabler for deploying ML applications not only in the mobile space but also in
the high-performance computing space due to the thermal bottleneck. In this dissertation
the statistical nature of DSP and ML applications is combined with the stochasticity of
deeply scaled process technology through statistical error compensation to enhance error
resiliency and energy efficiency.
7.1 Dissertation Contributions
In the past, SEC techniques have been applied to a wide range of applications in commu-
nication and DSP systems. In this dissertation, we extended the application of SEC to
include machine learning applications. Specifically, we showed that when SEC is applied to
applications that possess inherent robustness, its effectiveness is significantly enhanced.
First PN code acquisition, a detection based application widely used in communication
systems, was chosen. An SSNOC based PN acquisition filter was implemented in 180 nm
CMOS and its performance and energy consumption was measured. Results show a detection
rate higher than 90% at a false alarm rate of 10% is maintained at aggressive VOS of a supply
voltage 28.4% below the critical voltage, resulting in error rates larger than 85.83%, with a
corresponding energy savings of 2.52×.
Then SEC was applied to a belief propagation based communication system, a decoder
for low-density parity check (LDPC) codes. Combined with the inherent resilience of LDPC
codes, the SEC based LDPC decoder can operate at a supply voltage up to 38% lower
than the nominal voltage and tolerate up to 30× more errors over an SNR range of 3 dB to
116
8 dB, while maintaining less than 3× degradation in BER. This is equivalent with energy
savings of 45.7% compared to conventional LDPC decoders, and 33.2% compared to a sign
bit protected LDPC decoder.
SEC was applied to a Markov random field based iterative stereo matching architecture.
Building upon the SEC based LDPC decoder, SEC was applied to a more complex ML appli-
cation with a larger graph and larger dimension. Error compensation is performed at several
levels including the arithmetic, iteration, and system level. The compensation overhead, ro-
bustness, and energy savings are characterized and compared among the different levels of
compensation. Arithmetic compensation achieves power savings of 39.6% at an overhead
of 97.4%. A hybrid approach can successfully trade off the compensation complexity and
energy savings by achieving 16.1% additional power savings compared to arithmetic level
while reducing the overhead to 57.9%.
A study on combining SEC and approximate computing has been performed to show that
SEC can further extend the AC based design to achieve additional robustness and energy
efficiency. Results show that ANT combined with AC achieves energy savings of 44.7%
compared to a conventional system, while achieving at most 4% degradation in performance.
This supports our view that embracing the stochasticity of the underlying process is essential
to achieve significant energy savings.
Despite the successful design of SEC based communication and ML accelerators, SEC
design is mostly done in a non-systematic ad hoc manner. To develop theory and design
guides, attempts to analyze SEC techniques have been made. An analysis framework has
been proposed and under this framework, ANT was shown to be an approximation to the
Bayesian optimal detector and estimator. Furthermore, the performance of ANT is not
sensitive to the threshold value, which suggests that a single threshold value can be used
without calibration for multiple instances of a design.
7.2 Future Work
Design of DSP and ML systems based on SEC creates a big paradigm shift along with
significant increase in error resiliency and energy efficiency. To fully embrace this paradigm
117
shift and appreciate the potential benefits SEC can provide, several interesting challenges
exist that should be explored. To solve these problems, an interdisciplinary research effort
is needed, with collaboration in various research areas including all levels of the design
hierarchy.
7.2.1 SEC for ML kernels
Machine learning is a very broad field. In this dissertation, we have applied SEC to one
specific class of ML applications, probabilistic graphical models (PGMs). LDPC decoders
are based on message passing, while the TRW-S stereo image matching architecture is based
on belief propagation. As PGMs perform inference in an iterative manner, it possesses
significant inherent resiliency, which we have successfully exploited. However, the benefits of
SEC are not limited to PGMs. Exploration of SEC applied to other class of ML applications
should be carefully done. As ML applications, in general, possess significant inherent error
resiliency, one can expect when SEC is applied, similar results as we have seen for PGMs will
be observed. Many classes such as traditional MAP estimation, expectation maximization
(EM), linear discriminants, and principal component analysis (PCA) can all benefit from
SEC to varying degrees.
One other class of ML that is of particular interest is applications that are implemented in
two steps: training and classification. The classification step gives rise to inherent resiliency.
As long as the computed value falls within the same classification boundary, hardware errors
will not affect system performance. Support vector machines (SVMs) [112] are one example
successfully used in many classification systems. SVMs extract features from the input
data, then map the features to a high-dimensional space, which is divided via decision
boundaries There already exists work on energy-efficient and error-resilient SVM kernels
using error-aware decision boundaries at the classification stage to combat against hardware
errors occurring at the feature extraction stage [113]. Many structures based on neural
networks and deep belief networks [114] including convolutional nets [115] are also based on
the training-classification principle, and possess the potential for significant energy efficiency
and robustness via SEC.
118
7.2.2 SEC techniques augmented with circuit- or logic-level techniques
In this dissertation, SEC was applied at the algorithm or microarchitecture level, and utilized
system- or application-level tolerance to achieve its benefits. Studies on combining SEC with
circuit- or logic-level techniques can be one research direction. The manufacturing processes
of post-silicon devices are still immature. Implementations that rely on post-silicon devices,
and even deeply scaled CMOS, have a very low yield due to the defects introduced within
the manufacturing process. For such defective processes, compensation closer to the physical
layer may be required, and circuit- or logic-level compensation techniques combined with
algorithmic compensation techniques can provide effective solutions to the manufacturing
challenge and enable the design of such systems.
7.2.3 Error-aware design methodology
In all our work, SEC design was performed in a custom manner that required deep knowledge
of systems and architecture as well as SEC techniques. To enable the general designer to
implement SEC based designs, an error-aware design methodology needs to be developed.
It is essential that this methodology incorporates system-level information such as a target
error rate or error PDF shape that the application can tolerate. Based on this methodology,
computer aided design (CAD) tools can be developed that truly automate the design of SEC
based systems. Similar design flows have been proposed in the work of approximate logic
level synthesis [40–42]. Synthesis methodology targeted for processors that allow voltage or
reliability trade-offs [116] or graceful degradation [117] is another effort in this direction.
7.2.4 Verification and testing
Verification and testing is one big challenge in designing stochastic systems. As SEC in-
herently causes the outputs to be stochastic, new methods to verify the correctness of the
design and test its functional correctness need to be developed. The goal for any component
in a system is to enable the system to perform useful tasks, and thus the testing procedure
should also be aware of system-level information. Defining new system-level metrics that
119
faithfully represent the validity of the HW implementation will be an interesting problem to
explore. In the context of AC, Liang has proposed the use of new metrics in [105], but it is
essential that application-level information is incorporated as well.
7.2.5 Theoretical foundations for SEC
Last but not least, theoretical foundations for stochastic design principles need to be es-
tablished. Though SEC is a communication- and Shannon-inspired technique, there is no
analogue to information theory in the computing domain. An analogue to the channel capac-
ity theorem is essential in the computing domain to find fundamental limits on the amount
of error resiliency or energy efficiency one can achieve. This limit will enable us to gauge
how well our current systems are designed, and provide us a means to keep exploring better
designs. To this end, an understanding of existing SEC techniques is important to enable
intelligent and efficient design choices.
120
References
[1] P. Dubey, “A platform 2015 workload model: Recognition, mining and synthesis moves
computers to the era of tera,” White Paper, Intel Corp., 2005.
[2] L. Sun, “The future according to Freescale,” Freescale Technology Forum, Freescale,
June 2008.
[3] T. Sakurai, “Perspectives on power-aware electronics,” in IEEE International Solid-
State Circuits Conference (ISSCC), vol. 1, Feb. 2003, pp. 26–29.
[4] “International Technology Roadmap for Semiconductors,” Online:
http://www.itrs.net.
[5] K. Sekar, “Power and thermal challenges in mobile devices,” in ACM International
Conference on Mobile Computing & Networking (MobiCom), 2013, pp. 363–368.
[6] M. Miranda, “When every atom counts,” IEEE Spectrum, vol. 49, no. 7, p. 32, 2012.
[7] C. Almudever and A. Rubio, “Carbon nanotube growth process-related variability in
CNFETs,” in IEEE Conference on Nanotechnology (IEEE-NANO), Aug. 2011, pp.
1084–1087.
[8] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical
Journal, vol. 27, no. 3, pp. 379–423, 1948.
[9] D. Swade, Charles Babbage’s Difference Engine No. 2 Technical Description. London,
UK: National Museum of Science and Industry, 1996.
[10] D. K. Schroder and J. A. Babcock, “Negative bias temperature instability: Road
to cross in deep submicron silicon semiconductor manufacturing,” Journal of Applied
Physics, vol. 94, no. 1, pp. 1–18, 2003.
[11] J. Lienig, “Electromigration and its impact on physical design in future technologies,”
in ACM International Symposium on Physical Design, 2013, pp. 33–40.
[12] D. Pantic, “Benefits of integrated-circuit burn-in to obtain high reliability parts,” IEEE
Transactions on Reliability, vol. 35, no. 1, pp. 3–6, Apr. 1986.
[13] I. Koren and A. Singh, “Fault tolerance in VLSI circuits,” IEEE Computer, vol. 23,
no. 7, pp. 73–83, 1990.
121
[14] W. Moore, “A review of fault-tolerant techniques for the enhancement of integrated
circuit yield,” Proceedings of the IEEE, vol. 74, no. 5, pp. 684–698, May 1986.
[15] Y. Tamir and M. Tremblay, “High-performance fault-tolerant VLSI systems using mi-
cro rollback,” vol. 39, no. 4, pp. 548–554, 1990.
[16] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level timing specu-
lation,” in Proceedings of the 36th annual IEEE/ACM International Symposium on
Microarchitecture, Dec. 2003, pp. 7–18.
[17] S. Das, C. Tokunaga, S. Pant, W. Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw,
“Razor II: In situ error detection and correction for PVT and SER tolerance,” IEEE
Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32–48, 2009.
[18] B. Parhami, “A multi-level view of dependable computing,” Computers & Electrical
Engineering, vol. 20, no. 4, pp. 347–368, 1994.
[19] J. von Neumann, “Probabilistic logics and the synthesis of reliable organisms from
unreliable components,” Automata Studies, vol. 34, pp. 43–98, 1956.
[20] B. W. Johnson, Design & Analysis of Fault Tolerant Digital Systems. Boston, MA:
Addison-Wesley Longman Publishing Co., Inc., 1988.
[21] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power digital signal
processing using approximate adders,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 32, no. 1, pp. 124–137, 2013.
[22] H. Jie and M. Orshansky, “Approximate computing: An emerging paradigm for energy-
efficient design,” in 18th IEEE European Test Symposium (ETS), 2013, pp. 1–6.
[23] C. Winstead and S. Howard, “A probabilistic LDPC-coded fault compensation tech-
nique for reliable nanoscale computing,” IEEE Transactions on Circuits and Systems—
Part II: Express Briefs, vol. 56, no. 6, pp. 484–488, 2009.
[24] N. R. Shanbhag, R. A. Abdallah, R. Kumar, and D. L. Jones, “Stochastic computa-
tion,” in 47th Design Automation Conference (DAC), 2010, pp. 859–864.
[25] G. He, T. Sugahara, S. Izumi, H. Kawaguchi, and M. Yoshimoto, “A 40-nm 168-mW
2.4x-real-time VLSI processor for 60-kWord continuous speech recognition,” in IEEE
Custom Integrated Circuits Conference (CICC), Sep. 2012, pp. 1–4.
[26] D. Jewett, T. Inc, and C. Cupertino, “Integrity S2: A fault-tolerant Unix platform,”
in Fault-Tolerant Computing, 1991. FTCS-21. Digest of Papers, Twenty-First Inter-
national Symposium, 1991, pp. 512–519.
[27] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge,
“A self-tuning DVS processor using delay-error detection and correction,” IEEE Jour-
nal of Solid-State Circuits, vol. 41, no. 4, pp. 792–804, Apr. 2006.
122
[28] D. Blaauw, S. Kalaiselvan, K. Lai, W.-H. Ma, S. Pant, C. Tokunaga, S. Das, and
D. Bull, “Razor II: In situ error detection and correction for PVT and SER tolerance,”
in IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2008, pp. 400
–622.
[29] D. Bull, S. Das, K. Shivashankar, G. Dasika, K. Flautner, and D. Blaauw, “A
power-efficient 32 bit ARM processor using timing-error detection and correction for
transient-error tolerance and adaptation to PVT variation,” IEEE Journal of Solid-
State Circuits, vol. 46, no. 1, pp. 18–31, Jan. 2011.
[30] J. Tschanz, K. Bowman, S.-L. Lu, P. Aseron, M. Khellah, A. Raychowdhury,
B. Geuskens, C. Tokunaga, C. Wilkerson, T. Karnik, and V. De, “A 45nm resilient and
adaptive microprocessor core for dynamic variation tolerance,” in IEEE International
Solid-State Circuits Conference (ISSCC), Feb. 2010, pp. 282–283.
[31] R. Hegde and N. Shanbhag, “A voltage overscaled low-power digital filter IC,” IEEE
Journal of Solid-State Circuits, vol. 39, no. 2, pp. 388–391, Feb. 2004.
[32] L. Leem, H. Cho, J. Bau, Q. Jacobson, and S. Mitra, “ERSA: Error resilient system
architecture for probabilistic applications,” in Design, Automation, and Test in Europe
Conference and Exhibition (DATE), 2010, pp. 1560–1565.
[33] V. Gupta, D. Mohapatra, P. Sang Phill, A. Raghunathan, and K. Roy, “IMPACT:
IMPrecise adders for low-power approximate computing,” in International Symposium
on Low Power Electronics and Design (ISLPED), 2011, pp. 409–414.
[34] Z. Yang, A. Jain, J. Liang, J. Han, and F. Lombardi, “Approximate XOR/XNOR-
based adders for inexact computing,” in IEEE International Conference on Nanotech-
nology (NANO), 2013, pp. 690–693.
[35] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired imprecise
computational blocks for efficient VLSI implementation of soft-computing applica-
tions,” IEEE Transactions on Circuits and Systems—Part I: Regular Papers, vol. 57,
no. 4, pp. 850–862, 2010.
[36] H. Jiawei, J. Lach, and G. Robins, “A methodology for energy-quality tradeoff us-
ing imprecise hardware,” in 49th ACM/EDAC/IEEE Design Automation Conference
(DAC), 2010, pp. 504–509.
[37] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power with an under-
designed multiplier architecture,” in 24th International Conference on VLSI Design
(VLSI Design), 2011, pp. 346–351.
[38] K. Khaing Yin, G. Wang-Ling, and Y. Kiat-Seng, “Low-power high-speed multiplier
for error-tolerant application,” in IEEE International Conference on Electron Devices
and Solid-State Circuits (EDSSC), 2010, pp. 1–4.
123
[39] M. R. Choudhury and K. Mohanram, “Approximate logic circuits for low overhead,
non-intrusive concurrent error detection,” in Design, Automation and Test in Europe
(DATE), 2008, pp. 903–908.
[40] S. Doochul and S. K. Gupta, “Approximate logic synthesis for error tolerant applica-
tions,” in Design, Automation and Test in Europe (DATE), 2010, pp. 957–960.
[41] S. Doochul and S. K. Gupta, “A new circuit simplification method for error tolerant
applications,” in Design, Automation and Test in Europe (DATE), 2011, pp. 1–6.
[42] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy, and A. Raghunathan, “SALSA:
Systematic logic synthesis of approximate circuits,” in 49th ACM/EDAC/IEEE Design
Automation Conference (DAC), 2010, pp. 796–801.
[43] S. H. Nawab, A. V. Oppenheim, A. P. Chandrakasan, J. M. Winograd, and J. T.
Ludwig, “Approximate signal processing,” Journal of VLSI Signal Processing Systems
for Signal, Image and Video Technology, vol. 15, no. 1-2, pp. 177–200, 1997.
[44] J. T. Ludwig, S. H. Nawab, and A. P. Chandrakasan, “Low-power digital filtering
using approximate processing,” IEEE Journal of Solid-State Circuits, vol. 31, no. 3,
pp. 395–400, 1996.
[45] A. Sinha, A. Wang, and A. P. Chandrakasan, “Algorithmic transforms for efficient
energy scalable computation,” in Proceedings of the International Symposium on Low
Power Electronics and Design (ISLPED), 2000, pp. 31–36.
[46] J. Goodman, A. P. Dancy, and A. P. Chandrakasan, “An energy/security scalable
encryption processor using an embedded variable voltage DC/DC converter,” IEEE
Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1799–1809, 1998.
[47] V. K. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S. T. Chakradhar, “Scal-
able effort hardware design: Exploiting algorithmic resilience for energy efficiency,” in
47th ACM/IEEE Design Automation Conference (DAC), 2010, pp. 555–560.
[48] P. Jongsun, C. Jung-Hwan, and K. Roy, “Dynamic bit-width adaptation in DCT: An
approach to trade off image quality and computation energy,” IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 18, no. 5, pp. 787–793, 2010.
[49] R. Abdallah and N. R. Shanbhag, “Error-resilient low power Viterbi decoders,” in
International Symposium on Low Power Electronics and Design (ISLPED), 2008, pp.
111–116.
[50] N. R. Shanbhag, “Reliable and efficient system-on-a-chip design,” IEEE Computer,
vol. 37, no. 3, pp. 42–50, Mar. 2004.
[51] G. V. Varatkar, S. Narayanan, N. R. Shanbhag, and D. Jones, “Sensor network-on-
chip,” in 2007 International Symposium on System-on-Chip, Nov. 2007, pp. 1–4.
124
[52] G. Varatkar, S. Narayanan, N. Shanbhag, and D. Jones, “Stochastic networked compu-
tation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18,
no. 10, pp. 1421–1432, Oct. 2010.
[53] P. Huber, Robust Statistics. New York, NY: Wiley, 1981.
[54] E. Kim and N. Shanbhag, “Soft N-modular redundancy,” IEEE Transactions on Com-
puters, vol. 61, no. 3, pp. 323–336, Mar. 2012.
[55] Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifica-
tions, ANSI/IEEE Std. 802.11:1999 (E) Part 11, ISO/IEC 880 211, 1999.
[56] CDMA2000 Wireless IP Network Standard, 3GPP2 Std. TIA/EIA/IS-835B, 2000.
[57] 3rd Generation Partnership Project; Technical Specification Group Radio Access Net-
work; Physical Layer – General Description, 3GPP2 Std. 3GPP TS 25.201, Rev.
V10.0.0, 2011.
[58] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. New York, NY:
Cambridge, 2005.
[59] S. Sheng, L. Lynn, J. Peroulas, K. Stone, I. O’Donnell, and R. Brodersen, “A low-
power CMOS chipset for spread spectrum communications,” in IEEE International
Solid-State Circuits Conference, 1996, pp. 346–347.
[60] W. Namgoong and T. Meng, “Minimizing power consumption in direct sequence spread
spectrum correlators by resampling IF samples–Part I: Performance analysis,” IEEE
Transactions on Circuits and Systems—Part II: Analog and Digital Signal Processing,
vol. 48, no. 5, pp. 450–459, May 2001.
[61] A. Polydoros and C. Weber, “A unified approach to serial search spread-spectrum code
acquisition–Part I: General theory,” IEEE Transactions on Communications, vol. 32,
no. 5, pp. 542–549, May 1984.
[62] A. Polydoros and C. Weber, “A unified approach to serial search spread-spectrum
code acquisition–Part II: A matched-filter receiver,” IEEE Transactions on Commu-
nications, vol. 32, no. 5, pp. 550–560, May 1984.
[63] C. Deng and C. Chien, “A PN-acquisition ASIC for wireless CDMA systems,” in IEEE
Custom Integrated Circuits Conference (CICC), 2000, pp. 469–472.
[64] W. Namgoong and T. Meng, “Power consumption of parallel spread spectrum correla-
tor architectures,” in International Symposium on Low Power Electronics and Design
(ISLPED), 1998, pp. 133–135.
[65] T. Shibano, K. Lizuka, M. Miyamoto, M. Osaka, R. Miyama, and A. Kito, “Matched
filter for DS-CDMA of up to 50 MChip/s based on sampled analog signal processing,”
in IEEE International Solid-State Circuits Conference (ISSCC). IEEE, Feb. 1997,
pp. 100–101.
125
[66] T. Yamasaki, T. Nakayama, and T. Shibata, “A low-power and compact CDMA
matched filter based on switched-current technology,” IEEE Journal of Solid-State
Circuits, vol. 40, no. 4, pp. 926–932, Apr. 2005.
[67] O. Yeung and K. Chugg, “An iterative algorithm and low complexity hardware archi-
tecture for fast acquisition of long PN codes in UWB systems,” The Journal of VLSI
Signal Processing, vol. 43, no. 1, pp. 25–42, Apr. 2006.
[68] E. Kim, D. Baker, S. Narayanan, D. Jones, and N. Shanbhag, “Low power and er-
ror resilient PN code acquisition filter via statistical error compensation,” in Custom
Integrated Circuits Conference (CICC), Sep. 2011, pp. 1–4.
[69] D. Senderowicz, S. Azuma, H. Matsui, K. Hara, S. Kawama, Y. Ohta, M. Miyamoto,
and K. Iizuka, “A 23 mw 256-tap 8 msample/s QPSK matched filter for DS-CDMA cel-
lular telephony using recycling integrator correlators,” in Proceedings of International
Solid-State Circuits Conference, 2000, pp. 354 –355.
[70] S. Kim and B. Daneshrad, “A 100 µW, 20 Mcps versatile correlator chip for third gen-
eration WCDMA systems,” in Asilomar Conference on Signals, Systems, and Com-
puters, vol. 1, Oct. 1999, pp. 130–134.
[71] K. Onodera and P. Gray, “A 75-mW 128-MHz DS-CDMA baseband demodulator for
high-speed wireless applications [LANs],” IEEE Journal of Solid-State Circuits, vol. 33,
no. 5, pp. 753–761, May 1998.
[72] C. Lee and C. Jen, “Bit-sliced median filter design based on majority gate,” Circuits,
Devices and Systems, IEE Proceedings G, vol. 139, no. 1, pp. 63–71, Feb. 1992.
[73] E. Vittoz, “Future of analog in the VLSI environment,” in IEEE International Sym-
posium on Circuits and Systems (ISCAS), vol. 2, 1990, pp. 1372–1375.
[74] S. Asai, Y. Wada, and E. Takeda, “Downscaling ULSIs by using nanoscale engineer-
ing,” Microelectronic Engineering, vol. 32, no. 1-4, pp. 31–48, 1996.
[75] “European Telecommunications Standards Institute (ETSI). Digital video broadcast-
ing (DVB); Second generation framing structure, channel coding and modulation sys-
tems for broadcasting, interactive services, news gathering and other broadband satel-
lite applications (DVB-S2) EN 302 307 V1.2.1,” Aug. 2009.
[76] “IEEE Std. 802.16e, IEEE Standard for Local and Metropolitan Area Networks, Part
16: Air Interface for Fixed and Mobile Broadband Wireless Access Systems. Amend-
ment 2: Physical and Medium Access Control Layers for Combined Fixed and Mobile
Operation in Licensed Bands.” Mar. 2006.
[77] “IEEE 802.11n. Wireless LAN Medium Access Control and Physical Layer Specifica-
tions: Enhancements for Higher Throughput. IEEE P802.16n/D1.0,” Mar. 2006.
126
[78] S. Hemati, A. Banihashemi, and C. Plett, “A 0.18-CMOS analog min-sum iterative
decoder for a (32, 8) low-density parity-check (LDPC) code,” IEEE Journal of Solid-
State Circuits, vol. 41, no. 11, pp. 2531–2540, 2006.
[79] A. Darabiha, A. Chan Carusone, and F. Kschischang, “Power reduction techniques for
LDPC decoders,” IEEE Journal of Solid-State Circuits, vol. 43, no. 8, pp. 1835–1845,
2008.
[80] M. May, M. Alles, and N. Wehn, “A case study in reliability-aware design: A resilient
LDPC code decoder,” in Design, Automation and Test in Europe (DATE). ACM,
2008, pp. 456–461.
[81] L. R. Varshney, “Performance of LDPC codes under faulty iterative decoding,” IEEE
Transactions on Information Theory, vol. 57, no. 7, pp. 4427–4444, 2011.
[82] S. M. S. Tabatabaei Yazdi, H. Cho, and L. Dolecek, “Gallager B decoder on noisy
hardware,” IEEE Transactions on Communications, vol. 61, no. 5, pp. 1660–1673,
2013.
[83] T. J. Richardson and R. L. Urbanke, “The capacity of low-density parity-check codes
under message-passing decoding,” IEEE Transactions on Information Theory, vol. 47,
no. 2, pp. 599–618, 2001.
[84] R. Singhal, G. S. Choi, and R. N. Mahapatra, “Quantized LDPC decoder design
for binary symmetric channels,” in IEEE International Symposium on Circuits and
Systems (ISCAS), pp. 5782–5785.
[85] P. Li and W. K. Leung, “Decoding low density parity check codes with finite quanti-
zation bits,” IEEE Communications Letters, vol. 4, no. 2, pp. 62–64, 2000.
[86] A. T. Ihler, J. W. Fisher III, and A. S. Willsky, “Loopy belief propagation: Convergence
and effects of message errors,” Journal of Machine Learning Research, pp. 905–936,
2005.
[87] R. Gallager, “Low-density parity-check codes,” IRE Transaction on Information The-
ory, vol. 8, no. 1, pp. 21–28, 1962.
[88] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X.-Y. Hu, “Reduced-
complexity decoding of LDPC codes,” IEEE Transactions on Communications, vol. 53,
no. 8, pp. 1288–1299, Aug. 2005.
[89] P. Sotiriadis and A. Chandrakasan, “A bus energy model for deep submicron tech-
nology,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10,
no. 3, pp. 341–350, 2002.
[90] E. P. Kim and N. R. Shanbhag, “Energy-efficient LDPC decoders based on error-
resiliency,” in IEEE Workshop on Signal Processing Systems (SiPS), 2012, pp. 149–
154.
127
[91] J. Choi, E. P. Kim, R. A. Rutenbar, and N. R. Shanbhag, “Error resilient MRF message
passing architecture for stereo matching,” in IEEE Workshop on Signal Processing
Systems (SiPS), 2013.
[92] J. Choi and R. A. Rutenbar, “Video-rate stereo matching using Markov random field
TRW-S inference on a hybrid CPU+FPGA computing platform,” in Proceedings of the
ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM,
2013, pp. 63–72.
[93] E. P. Kim, J. Choi, N. R. Shanbhag, and R. A. Rutenbar, “A robust message passing
based stereo matching kernel via system-level error resiliency,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
[94] J. Sun, N. Zheng, and H. Shum, “Stereo matching using belief propagation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 787–800,
2003.
[95] M. Tappen and W. Freeman, “Comparison of graph cuts with belief propagation for
stereo, using identical MRF parameters,” in IEEE International Conference on Com-
puter Vision, 2003, pp. 900–906.
[96] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tap-
pen, and C. Rother, “A comparative study of energy minimization methods for Markov
random fields with smoothness-based priors,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 30, no. 6, pp. 1068–1080, 2008.
[97] V. Kolmogorov, “Convergent tree-reweighted message passing for energy minimiza-
tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28,
no. 10, pp. 1568–1583, 2006.
[98] M. Wainwright, T. Jaakkola, and A. Willsky, “MAP estimation via agreement on
trees: Message-passing and linear programming,” IEEE Transactions on Information
Theory, vol. 51, no. 11, pp. 3697–3717, 2005.
[99] C. Liang, C. Cheng, Y. Lai, L. Chen, and H. Chen, “Hardware-efficient belief prop-
agation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21,
no. 5, pp. 525–537, 2011.
[100] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,”
in IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
[101] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early
vision,” International Journal of Computer Vision, vol. 70, no. 1, pp. 41–54, 2006.
[102] Convey Computer, “Convey Reference Manual,” Online:
http://www.conveycomputer.com, Sep. 2009.
128
[103] S. Morozov, A. Maiti, and P. Schaumont, “An analysis of delay based PUF implementa-
tions on FPGA,” in Reconfigurable Computing: Architectures, Tools and Applications,
ser. Lecture Notes in Computer Science, P. Sirisuk, F. Morgan, T. El-Ghazawi, and
H. Amano, Eds. Berlin, Heidelberg: Springer, 2010, vol. 5992, pp. 382–387.
[104] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms,” International Journal of Computer Vision, vol. 47, no.
1-3, pp. 7–42, 2002.
[105] J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of approximate and
probabilistic adders,” IEEE Transactions on Computers, vol. 62, no. 9, pp. 1760–1771,
Sep. 2013.
[106] A. Lingamneni, C. Enz, J. L. Nagel, K. Palem, and C. Piguet, “Energy parsimonious
circuit design through probabilistic pruning,” in Design, Automation and Test in Eu-
rope (DATE), 2011, pp. 1–6.
[107] A. Lingamneni, K. K. Muntimadugu, C. Enz, R. M. Karp, K. V. Palem, and C. Piguet,
“Algorithmic methodologies for ultra-efficient inexact architectures for sustaining tech-
nology scaling,” in Proceedings of the 9th Conference on Computing Frontiers, ser. CF
’12. New York, NY, USA: ACM, 2012, pp. 3–12.
[108] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, “Synthesizing parsimonious inex-
act circuits through probabilistic design techniques,” ACM Transaction on Embedded
Computing Systems (TECS) - Special Section on Probabilistic Embedded Computing,
vol. 12, no. 2s, pp. 1–26, 2013.
[109] H. Poor, An Introduction to Signal Detection and Estimation. New York, NY:
Springer-Verlag, 1994.
[110] W.-H. Chen, C. Smith, and S. Fralick, “A fast computational algorithm for the discrete
cosine transform,” IEEE Transactions on Communications, vol. 25, no. 9, pp. 1004–
1009, Sep. 1977.
[111] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, 2nd ed. Up-
per Saddle River, NJ: Prentice Hall, 2002.
[112] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3,
pp. 273–297, 1995.
[113] N. Verma, K.-H. Lee, K. J. Jang, and A. H. Shoeb, “Enabling system-level platform
resilience through embedded data-driven inference capabilities in electronic devices,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2012, pp. 5285–5288.
[114] A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny,
“Deep belief networks using discriminative features for phone recognition,” in IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), May
2011, pp. 5060–5063.
129
[115] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,”
in The Handbook of Brain Theory and Neural Networks, vol. 3361. Cambridge, MA:
MIT Press, 1995.
[116] J. Sartori and R. Kumar, “Architecting processors to allow voltage/reliability trade-
offs,” in Proceedings of the 14th International Conference on Compilers, Architectures
and Synthesis for Embedded Systems, ser. CASES ’11, 2011, pp. 115–124.
[117] A. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack redistribution for graceful degra-
dation under voltage overscaling,” in Asia and South Pacific Design Automation Con-
ference (ASP-DAC), Jan. 2010, pp. 825–831.
130
