Stochastic computation for energy-efficient robust ultra-low-power platforms by Abdallah, Rami
c© 2012 Rami A. Abdallah
STOCHASTIC COMPUTATION FOR ENERGY-EFFICIENT ROBUST
ULTRA-LOW-POWER PLATFORMS
BY
RAMI A. ABDALLAH
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2012
Urbana, Illinois
Doctoral Committee:
Professor Naresh R. Shanbhag, Chair
Professor Douglas L. Jones
Professor Philip T. Krein
Assistant Professor Rakesh Kumar
ABSTRACT
Next-generation ubiquitous computing promises new levels in immersion and seamless tech-
nology integration enabled through a profusion of embedded signal processing (DSP)-heavy
ultra-low-power (ULP) platforms. This dissertation proposes an holistic integrated stochas-
tic computing approach to enable the design of next-generation ULP platforms that operate
dramatically closer to the limits of the achievable robustness-energy-performance envelope
over a highly unreliable device fabric.
Stochastic computing was shown to be an elegant design approach for energy-efficient
and robust systems-on-a-chip (SoC) in superthreshold applications. This dissertation stud-
ies and extends the application of stochastic computing to the minimum-energy operating
point (MEOP), which is known to occur in the subthreshold regime. Analysis, architecture
and circuit-level simulations, and integrated circuit (IC) measurements in a 45-nm CMOS
technology, are employed to study the stochastic-subthreshold design space. Energy savings
of 28% to 54% beyond minimum achievable energy Emin at the conventional (error-free)
MEOP, along with 380× to 850× increase in pre-correction error rate (pη) handling ca-
pability, are demonstrated in the presence of voltage and process variations. A stochastic
computing-based biomedical processing IC is designed at the MEOP. The supply voltage of
the prototype IC can be scaled to 15% below its critical (error-free) value of 0.4 V, while
compensating for a pη = 58%, improving the heart-beat detection accuracy by 19×, and
achieving 28% Emin-energy savings over conventional MEOP processors. This IC consumes
14.5 fJ/cycle/1k-gate and exhibits 4.7× better energy efficiency than the state-of-the-art
while tolerating 16× more voltage variations.
ii
To further enhance system-energy efficiency, this dissertation proposes an integrated design
approach for ULP platforms by jointly optimizing the design of the compute cores and the
energy-delivery subsystem. Joint core architecture and DC-DC converter design techniques
are proposed in order to minimize the total system energy consumption. Results show a
45.5% system-energy savings and a 2.3× improvement in the efficiency of energy-delivery
subsystem over the conventional case where the system is operated at the core MEOP while
ignoring DC-DC converter losses.
This dissertation makes a contribution, to the portfolio of stochastic computing techniques,
referred to as likelihood processing (LP). LP exploits hardware error statistics to generate
reliability information or confidence level on the output bits of a compute block in a statis-
tically optimal manner and with a low complexity. The benefits of LP are demonstrated in
the design of a 45-nm discrete-cosine transform (DCT) codec, which can be employed as a
hardware accelerator in a ULP platform. LP is shown to tolerate 5× to 100× greater vales
of pη, and achieve 15% to 71% energy savings, when compared with existing techniques.
Stochastic computing advocates an explicit characterization and processing of error statis-
tics at the architectural/system level. To support this need, this dissertation proposes a
unified framework with a generalized statistical error characterization methodology. The
proposed framework and methodology are analyzed and verified for a number of 45-nm DSP
kernels. Furthermore, design diversity techniques are introduced in order to engineer fa-
vorable spatially-independent error-statistics and aid the robustness and implementation of
stochastic computing.
The proposed design principles and demonstrated energy and robustness benefits in this
dissertation can be generalized beyond ULP platforms to modern high-throughput computa-
tional platforms. This is timely, because such platforms are moving the direction of becoming
a heterogenous many-core SoC. Thus, the work in this dissertation, and stochastic computing
in general, can provide stochastic accelerator cores for integration with conventional cores
on to such a platform.
iii
To my mother, my father, and my advisor
iv
ACKNOWLEDGMENTS
And I say that life is indeed darkness save when there is urge,
And all urge is blind save when there is knowledge,
And all knowledge is vain save when there is work,
And all work is empty save when there is love;
And when you work with love you bind yourself to yourself, and to one another,
and to God.
Work is love made visible. . . And if you sing though as angels, and love not the
singing, you muﬄe man’s ears to the voices of the day and the voices of the
night.1
For this quote, I would like to thank all those who have inspired me, challenged me, taught
me, or helped me love my work.
Foremost, my deepest gratitude goes to God, my parents, and my advisor. My parents
have never hesitated to sacrifice anything for the well-being of their children. Their constant
love and encouragement have always pushed me to do my very best. My advisor, Professor
Naresh Shanbhag, has not been only a teacher and a mentor throughout these years but
also a father. I am confident that his role in my life will go much beyond these few years. I
am greatly indebted to him for all the advice, knowledge, responsibilities, and opportunities
that he gave me in school and life. The most important things that I learned from him go
beyond this dissertation, and I will always cherish them. His attitude towards life, passion
to work, creativity, diversified knowledge, high standards, and meticulousness have been and
1An excerpt from “The Prophet” by Gibran Khalil Gibran.
v
will always be a source of inspiration and a lifelong example. I am fortunate to have him
and my parents, and to them I dedicate this dissertation hoping that they find it worthy of
their expectations of me.
I would also like to thank Professors Douglas Jones, Rakesh Kumar, and Philip Krein for
the many insightful discussions and input during weekly group meetings and agreeing to be
on my thesis committee. Their comments and suggestions have greatly helped improve this
thesis. My sincere thanks also to Yu-Hung Lee for his help on error statistics simulations in
Chapter 6, my research group colleagues for their feedback and input during group meetings,
and the circuits group staff, Carolyn Genzel and Loren Heal, for their help and support. I
am also grateful to my sister and brothers for being there for me, and my labmates and
friends at Illinois who made me feel at home while being away from home. In particular, I
would like to mention Dr. Ali Bazzi, Dr. Marc Ghossoub, and Dr. Rajan Narasimha.
The support of the Gigascale Systems Research Center (GSRC), one of six research cen-
ters funded under the Focus Center Research Program (FCRP), a Semiconductor Research
Corporation (SRC) entity, and the support from Texas Instruments, Inc., are gratefully
acknowledged.
vi
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Past Relevant Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Stochastic Computing Techniques . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Role and Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . 16
1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
CHAPTER 2 ENERGY-EFFICIENT AND ROBUST ULP KERNELS VIA
STOCHASTIC COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 The Minimum-Energy Operating Point (MEOP) . . . . . . . . . . . . . . . . 22
2.2 MEOP via ANT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
CHAPTER 3 A 14.5 FJ/CYCLE/K-GATE, 0.33 V STOCHASTIC COMPUTING-
BASED ECG PROCESSOR IN A 45-NM CMOS . . . . . . . . . . . . . . . . . . 40
3.1 The Pan-Tompkins Algorithm (PTA) . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Architecture and Implementation . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
CHAPTER 4 JOINT OPTIMIZATION OF POWER DELIVERY AND CORE
ENERGY IN ULP PLATFORMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Core Energy Characterization Under Dynamic Voltage Scaling (DVS) . . . . 60
4.2 Design and Analysis of DC-DC Converters . . . . . . . . . . . . . . . . . . . 62
4.3 System (Core and DC-DC Converter) Energy Optimization . . . . . . . . . . 66
4.4 Core Architecture Optimization for Energy-Efficient Systems . . . . . . . . . 69
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
vii
CHAPTER 5 STOCHASTIC COMPUTING PLATFORMS VIA LIKELIHOOD
PROCESSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 A Unified Framework for Error Resiliency . . . . . . . . . . . . . . . . . . . 80
5.2 The Proposed Technique: Likelihood Processing (LP) . . . . . . . . . . . . . 86
5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
CHAPTER 6 CHARACTERIZATION AND ENGINEERING OF TIMING
ERROR STATISTICS FOR STOCHASTIC COMPUTING PLATFORMS . . . . 108
6.1 Proposed Timing Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Error Analysis: Impact of Input Statistics . . . . . . . . . . . . . . . . . . . 112
6.3 Simulations and Verifications for Statistical Error Characterization . . . . . . 120
6.4 Diversity Techniques for Error Independence . . . . . . . . . . . . . . . . . . 125
6.5 Case Study: Discrete-Cosine Transform (DCT) Codec Design . . . . . . . . . 129
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
CHAPTER 7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 The Broader Impact: Beyond ULP Platforms . . . . . . . . . . . . . . . . . 135
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
viii
LIST OF TABLES
2.1 MEOP comparison of conventional and ANT filters in the 45-nm LVT process. 34
2.2 MEOP comparison of conventional and ANT filters in the 45-nm HVT process. 35
3.1 Transfer function of building blocks in PTA. . . . . . . . . . . . . . . . . . . 44
3.2 Comparison with state-of-the-art systems. . . . . . . . . . . . . . . . . . . . 56
5.1 Complexity of an L-parallel LG-processor for LPNx-(By). . . . . . . . . . . 94
5.2 Gate complexity (normalized to NAND2) of building blocks in error-compensated
2D-IDCT architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 KL-distance between error PMFs in various architectures at different KV OS. 123
6.2 KL distance between error PMFs of 16-bit adders under various input
statistics and error PMF PEU obtained using a uniform input distribution. . 124
6.3 KL distance between error PMFs of a 16-tap FIR filter under various input
statistics and error PMF PEU obtained using a uniform input distribution. . 125
6.4 Error independence between RCA, CBA, and CSA, where Vdd−crit,RCA =
1.1 V, Vdd−crit,CBA = 0.95 V, Vdd−crit,CSA = 0.85 V, and f = 1.01 GHz . . . . . 127
6.5 Error independence between DF and TDF FIR filters, where Vdd−crit,DF =
1.1 V,Vdd−crit,TDF = 1 V, and f = 588 MHz. . . . . . . . . . . . . . . . . . . . 128
6.6 Error independence with scheduling diversity, where Vdd−crit = 1.1 V and
f = 714 MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.7 Error independence of two voltage overscaled DCT codec using different
scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
ix
LIST OF FIGURES
1.1 Ubiquitous computing: (a) infrastructure and (b) an example: body-area
sensor network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The 1-σ gate delay variations due to random and systematic process vari-
ations in 65-nm CMOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Energy delivery in ULP platforms: (a) power delivery network in a MAC-
Book Air (DC-DC converters/voltage regulators (VRs) are shown in blue),
(b) sensor node block diagram, and (c) measured DC-DC convertor effi-
ciency in 45-nm CMOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Energy under dynamic and aggressive voltage scaling. . . . . . . . . . . . . . 5
1.5 Stochastic computing in ULP platforms: matching noisy circuits to appli-
cation metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Stochastic computing: (a) block diagram, (b) measured error statistics at
the output of an FIR filter in a 45-nm CMOS process with Vdd scaled 15%
below its critical value, and (c) signal-to-noise ratio (SNR) performance
at different pre-correction error rates pη. . . . . . . . . . . . . . . . . . . . . 11
1.7 Algorithmic noise-tolerance (ANT): (a) framework and (b) error distributions. 12
1.8 The stochastic sensor network-on-a-chip (SSNOC). . . . . . . . . . . . . . . 14
1.9 Architecture-level error resilient techniques: (a) N-modular redundancy
(NMR) and (b) soft NMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Energy in the subthreshold regime of a conventional and ANT-based designs. 21
2.2 Validation of energy and frequency models for an eight-tap FIR filter in
45-nm CMOS processes: (a) direct-form architecture, (b) energy model,
and (c) throughput model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Voltage-frequency plane of iso-pη curves in the 45-nm LVT (solid lines)
and HVT (dotted lines) processes. . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Pre-correction error rate and energy characterization of the 8-tap FIR
filter under VOS (x axis ≤ 1) and FOS (x axis ≥ 1): a) error rate pη
vs. KV OS and KFOS, and b) normalized energy vs. KV OS and KFOS
(error-compensation overhead is not included). . . . . . . . . . . . . . . . . 30
2.5 SNR vs. error rate for the 8-tap reduced precision-redundancy (RPR)
ANT-based filter with different estimator precisions (Be): (a) architecture,
and (b) SNR performance vs. pη. . . . . . . . . . . . . . . . . . . . . . . . . 32
x
2.6 Energy of ANT FIR filter (including error-compensation overhead for pη 6=
0) at different pre-correction error rates and estimation overhead: (a) the
45-nm LVT process, and (b) the 45-nm HVT process. . . . . . . . . . . . . . 33
2.7 FIR filter frequency fluctuations under process variations using minimum-
size (Wmin) and 1.6-Wmin transistors in the 45-nm LVT process. . . . . . . . 36
2.8 Energy under process variations for the 8-tap FIR filter using up-sized
(1.6-Wmin) design and minimum-sized (Wmin) ANT design . . . . . . . . . . 37
2.9 Energy distributions at the MEOP of the minimum-sized (nominal) design,
up-sized design, and ANT minimum-sized designs with Be = 4 and 5 (error
compensation overhead is included). . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 ECG processing of a segment of an MIT-BIH database record: (a) input
(noisy) ECG, (b) filtered ECG, (c) ECG at derivative output, and (d)
ECG at moving average output. . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Block diagram of the Pan-Tompkins algorithm (PTA) for ECG processing . . 42
3.3 Architecture of ANT-based ECG Processor . . . . . . . . . . . . . . . . . . . 43
3.4 Architecture of building blocks in PTA: (a) LPF, (b) HPF, (c) derivative-
square (DS) block, and (d) moving average (MA) block. The precision
shown is for the main block M and the notation < n1, n2 > represents n1
integer-bits and n2 floating-bits. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Die photo of the test chip in the 45-nm IBM SOI CMOS process. . . . . . . 46
3.6 Measured energy and frequency of the conventional (error-free) ECG pro-
cessor under different work loads. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Measured pre-correction error rate at the MEOP of the ECG processor
under voltage and frequency overscaling. . . . . . . . . . . . . . . . . . . . . 48
3.8 Simulated detection performance of the conventional and ANT-based ECG
processors at different pre-correction error rates (solid lines indicate error-
free MA and dotted lines indicate erroneous MA). . . . . . . . . . . . . . . . 49
3.9 Measured detection performance of the conventional and ANT-based ECG
processors at different pre-correction error rates at the MEOP while the
MA block is error free. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10 Error statistics of the ECG processor: (a) measured voltage overscaled
processor with pη = 0.38 at MEOP, (b) RTL simulations at pη = 0.35, (c)
measured frequency overscaled processor with pη = 0.58 at MEOP, and
(d) RTL simulations at pη = 0.54. . . . . . . . . . . . . . . . . . . . . . . . . 51
3.11 Distribution of instantaneous RR-interval measurement at MEOP: (a) con-
ventional ECG processor and (b) ANT ECG processor. . . . . . . . . . . . . 52
3.12 ANT-based ECG processor measurement results under the ECG dataset:
(a) iso-pη contours in the Vdd–f plane and (b) the total energy (including
error compensation overhead for pη 6= 0) corresponding to the iso-pη contours. 54
3.13 ANT-based ECG processor measurement results under the synthetic dataset:
(a) iso-pη contours in the Vdd–f plane and (b) the total energy (including
error compensation overhead for pη 6= 0) corresponding to the iso-pη contours. 55
xi
3.14 Measured sensitivity and robustness of conventional and the ANT-based
ECG processors to voltage variations at the conventional MEOP (0.4 V, 600 kHz). 56
4.1 Energy-aware embedded system: (a) energy under DVS and (b) block diagram. 59
4.2 Switching DC-DC converter: (a) block diagram with parasitics, (b) continuous-
conduction mode (CCM), and (c) discontinuous-conduction mode (DCM). . 63
4.3 The computing core model: (a) architecture of a single MAC unit, (b) the
core frequency, and (c) the core energy consumption under DVS. . . . . . . . 67
4.4 DVS system energy:(a) DC-DC efficiency vs. core power and supply volt-
age and (b) the total system energy and losses. . . . . . . . . . . . . . . . . . 68
4.5 DC-DC efficiency for parallel/multi-cores. . . . . . . . . . . . . . . . . . . . 71
4.6 Architecture-optimized DVS system energy: (a) DC-DC efficiency and (b)
reconfigurable core (RC) system energy profile. . . . . . . . . . . . . . . . . . 72
4.7 DVS system energy with core pipelining: (a) DC-DC efficiency, and (b)
pipelined-core system energy profile. . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Block diagram of a stochastic system (stochastic core and DC-DC converter). 75
4.9 DVS energy of jointly optimized systems. Solid lines (dotted lines) refer
to the conventional (stochastic) system. . . . . . . . . . . . . . . . . . . . . . 76
4.10 DC-DC energy efficiency of jointly optimized stochastic core and DC-DC
converter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Computational error model: (a) additive error model, (b) sample error
statistics, and (c) measured error PMF Pη(η) of a 20-bit output filter IC
in 45-nm CMOS with Vdd = 0.85Vdd−crit. . . . . . . . . . . . . . . . . . . . . 81
5.2 Existing architectural level error resiliency techniques: (a) NMR, (b) algo-
rithmic noise-tolerance (ANT), (c) stochastic sensor network-on-chip (SS-
NOC), and (d) soft NMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 The proposed technique: likelihood processing (LP). . . . . . . . . . . . . . . 84
5.4 Techniques to generate observation vector YLP: (a) replication, (b) esti-
mation, and (c) spatio-temporal correlation. . . . . . . . . . . . . . . . . . . 85
5.5 An example of LP: (a) 2-bit output erroneous computational block and
(b) a 2-bit sample error PMF. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 System correctness of a 2-bit output system at different pη. . . . . . . . . . . 91
5.7 An LG-processor architecture for LPN -(By) (MU: metric unit and CS2:
2-operand compare-select unit). . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.8 A bit-subgrouped LG-processor architecture for LPNx-(By) (LPNx-(B1, B2
, . . . , Bm)) with probabilistic activation module. . . . . . . . . . . . . . . . . 95
5.9 An 8-bit output 2D-DCT/IDCT codec: (a) single codec, (b) replication
set-up, (c) estimation set-up, and (d) spatial correlation set-up. . . . . . . . 97
5.10 VOS errors in 2D-IDCT: (a) pre-correction error rate (component proba-
bility of error) pη, and output error PMFs (PE(e)) at (b) Vdd = 1.1 V and
(c) Vdd = 1 V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.11 System robustness of 2D DCT-IDCT codec under replication: (a) compar-
ing LPNr-(8) to other error-resilient techniques without bit-subgrouping
and (b) LPNr-(8) performance with bit-subgrouping. . . . . . . . . . . . . . 102
xii
5.12 System robustness of 2D DCT-IDCT codec using (a) estimation and (b)
spatial correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.13 Sample codec output images: (a) original image, (b) error-free IDCT
(pη = 0, PSNR = 33 dB), (c) erroneous single IDCT (pη = 0.13, PSNR =
14 dB), (d) majority-vote TMR (pη = 0.13, PSNR = 19 dB), (e) LP3c-
(5,3) (pη = 0.14, PSNR = 24 dB), (f) ANT (pη = 0.13, PSNR = 26 dB),
(g) LP3r-(5,3) (pη = 0.13, PSNR = 29 dB), (h) LP2e-(8) (pη = 0.13, PSNR =
31 dB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.14 LP power savings in a 45-nm TI CMOS process: (a) replication, (b) esti-
mation, and (c) spatial-correlation setups. . . . . . . . . . . . . . . . . . . . 106
6.1 A DSP kernel (M’) exhibiting errors: (a) block diagram and (b) proposed
additive error model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Various 16-bit input statistics: (a) word-level distribution and (b) their
corresponding bit probability profiles (BPPs). . . . . . . . . . . . . . . . . . 115
6.3 An architectural model of a DSP kernel with input x, output bit by,i, and
Li processing elements (PE)s. . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4 Error statistics of various architectures: (a)16-b RCA, (b) 16-b CBA, (c)
16-b CSA, and (d) DF and TDF 16-tap FIR filter. . . . . . . . . . . . . . . . 122
6.5 Output error statistics of 16-bit RCA at Kvos = 0.73 using: (a) symmetric
input statistics PX ’s and (b) uniform input distribution and asymmetric PX ’s 123
6.6 Block diagram of the 2D DCT-IDCT codec. . . . . . . . . . . . . . . . . . . 130
6.7 Performance of soft DMR-based codec under VOS. . . . . . . . . . . . . . . 131
7.1 The power wall in CPU design. . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 The 45-nm 8-core Intel Enterprise Xeon processor: (a) block diagram with
on-chip power management and (b) die photo with multiple clock domains
and temperature sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3 The 45-nm Intel Core i7 (quad core) processor: (a) die photo and (b)
on-chip power management to enable DVS and power gating. . . . . . . . . . 137
xiii
LIST OF ABBREVIATIONS
ANT algorithmic noise-tolerance
BER bit error rate
BPP bit probability profile
BTWC better than worst case
C-MEOP core minimum energy operating point
CAD computer-aided design
CMOS complementary MOSFET oxide semiconductor
CBA carry-bypass adder
CPU central processing unit
CSA carry-select adder
CVD cardiovascular disease
DCT discrete cosine transform
DF-FIR direct-form finite input response
DMR dual-modular redundancy
DSP digital signal processing
DVS dynamic voltage scaling
ECG electrocardiograph
FIR finite input response
FOS frequency overscaling
IC integrated circuit
xiv
IDCT inverse discrete cosine transform
KL Kullback-Leibler
LP likelihood processing
MEOP minimum energy operating point
MSE mean square error
NMR N-modular redundancy
PMF probability mass function
PE processing element
PoFF point of first failure
PSNR peak signal-to-noise ratio
PTA Pan-Tompkins algorithm
PVT process, voltage, and temperature
RC reconfigurable core
RCA ripple carry adder
RTL register transfer language
PC-MEOP parallelized core minimum energy operating point
S-MEOP system minimum energy operating point
SS-MEOP stochastic system minimum energy operating point
SC single core
SNR signal-to-noise ratio
SoC system-on-chip
SOI silicon-on-insulator
SSNOC stochastic sensor network-on-chip
TF-FIR transposed-form finite input response
TMR triple-modular redundancy
ULP ultra-low-power
VOS voltage overscaling
xv
CHAPTER 1
INTRODUCTION
As we move into a world of immersive and ubiquitous computing, we will witness a mul-
titude of embedded processors that are going to transform the way we interact with the
physical world. Sensing, surveillance, and media-rich immersive computing will constitute a
large part of next-generation applications in environmental monitoring, supply chain man-
agement, traffic control, power distribution, smart automotives and avionics, health care ,
entertainment, and defense applications. Such applications will be characterized by increased
functionality being embedded in multiple tiny ultra-low-power (ULP) devices/nodes giving
rise to a sensory swarm (see Fig. 1.1(a)) [1], which will act as virtual eyes, ears, and hands
for data collection and analysis. For example, Fig. 1.1(b) shows an example of body-area
sensor network (BSN) [2] which consists of multiple embedded sensing and processing nodes
on, near, or within human body, BSN promises new levels of immersion and novel uses in
health-care, medical diagnostic services, and entertainment. It is predicted that the number
of embedded processors per person will exceed 1000 by 2015 [3]. Form-factor and energy are
the main design drivers in these applications for seamless integration and extended lifetime
since frequent battery replacement with thousands of embedded processors is not a feasible
option, or these processors are being operated on scavenged energy sources [4].
A growing concern with increased integration and complexity in these applications is
reliability, since increased process, voltage and temperature (PVT) variations, leakage, soft
errors, and noise in sub-45-nm process technologies [5] are conspiring to offset the energy
and area benefits of feature-size scaling. Current design philosophy addresses the reliability
problem at the expense of power or energy consumption by targeting worst-case variations
1
Figure 1.1: Ubiquitous computing: (a) infrastructure [1] and (b) an example: body-area
sensor network.
and scenarios. In fact, the International Technology Roadmap for Semiconductors (ITRS) [6]
has stated the achievement of reliability and energy-efficiency as two of the Grand Challenges
facing the semiconductor industry. ULP platforms are typically operated in subthreshold
(below the threshold voltage (Vth) which is typically in the few hundred mV’s range) to
save energy. This leads to increased PVT variations due to the exponential relation between
current and supply voltage in subthreshold. For example, in 65-nm CMOS, worst-case 3-σ
gate delay is around 3 orders-of-magnitude from the nominal case (see Fig. 1.2) [7].
Further complicating the problem of energy in ULP platforms is the low efficiency and
increased size of the energy-delivery subsystem (DC-DC converter). Technology scaling
and energy-aware computing techniques have made it possible to operate cores at low Vdd.
Low Vdd operation, however, exacerbates the energy-delivery network and leads to the de-
ployment of multiple large programmable voltage regulators/DC-DC converters in modern
platforms (see Fig. 1.3(a)). In a typical sensor node platform (see Fig. 1.3(b)), depending
on the throughput requirement, a single or multiple programable DC-DC converters adjust
2
Figure 1.2: The 1-σ gate delay variations due to random and systematic process variations
in 65-nm CMOS [7].
the core supply voltage Vdd from an external battery source (Vbat) depending on the required
throughput. For example in subthreshold operation, the DC-DC converter is required to
convert a battery voltage VBAT = 1.2 V to a core voltage Vdd < 0.5 V. This severely re-
duces the converter energy efficiency ηDC , sometimes below 40% as shown by integrated
circuits (IC) measurements in Fig. 1.3(c) [8] and reported in [9]. Addressing the energy-
delivery system in ULP sensory application is made even more challenging by the small
form factor requirements and the extremely large variations in workload characteristics (up
to four orders of magnitude [10]) due to the event-driven nature of sensory ULP applica-
tions. A ULP platform architecture (see Fig. 1.3(b)) consists mainly of 1) a set of sensors
with signal-processing kernels to detect an interrupt, 2) an event or general-purpose pro-
cessor to manage interrupts, perform complex detection, and activate appropriate modules,
3) a set of hardware accelerators for enhanced performance and energy savings, 4) a mem-
ory subsystem, and 5) a communication kernel/radio, in addition to the DC-DC converter.
3
Figure 1.3: Energy delivery in ULP platforms: (a) power delivery network in a MAC-Book
Air [11] (DC-DC converters/voltage regulators (VRs) are shown in blue), (b) sensor node
block diagram, and (c) measured DC-DC convertor efficiency in 45-nm CMOS [8].
These subsystems have different duty cycles and performance requirements depending on
the environmental settings and workload. For example, the sensors and signal processing
modules are activated all the time and operate in subthreshold with very low frequencies to
match the observed phenomenon while consuming negligible power. At the other extreme,
the hardware-accelerators are power-gated most of the time and activated depending on the
complexity and severity of the required processing.
Thus, the design of robust and energy-efficient ULP platforms is an important area of
research and this dissertation presents novel solutions by simultaneously addressing energy
4
Figure 1.4: Energy under dynamic and aggressive voltage scaling.
efficiency and reliability in ULP platforms while taking into consideration the energy-delivery
subsystem.
1.1 Past Relevant Work
This section describes the past work addressing the three problems: reliability, energy effi-
ciency, and energy-delivery. In the past, these three problems have been addressed indepen-
dently leading to sub-optimal, and sometimes, contradictory design principles.
1.1.1 Low-Power Design
Low-power digital circuits and system design is a mature topic of research since the early
1990’s [12–15]. Supply voltage scaling [12], body biasing [16], transistor sizing [17], clock-
gating [18], and reconfiguration [14] [19] are commonly employed energy reduction techniques
5
in practice today. Voltage scaling has been the workhorse technique for energy-efficient op-
eration. Dynamic voltage scaling (DVS) [20–22] is employed to match the voltage and
frequency of operation to the application requirements in order to minimize energy con-
sumption in the superthreshold regime.
In the subthreshold regime, aggressive voltage scaling (see Fig. 1.4) where the core supply
voltage Vdd < Vth is a well-established design approach for low-throughput (hundreds of Hz
to few MHz) applications. Reducing the supply voltage Vdd results in quadratic reduction
in dynamic energy consumption Edyn. However, as Vdd is reduced below Vth, delay, and
hence the leakage energy Elkg, increases exponentially and quickly becomes comparable to
Edyn. This trade-off between Edyn and Elkg is well-studied [23–26], and results in a minimum
energy operating point (MEOP). The MEOP is characterized by the tuple (Vdd,opt, fopt, Emin),
where Vdd,opt and fopt are the energy-optimum supply voltage and frequency of operation,
respectively, resulting in the minimum achievable system energy Emin. In the last few
years, several ICs [27–29] and embedded processors [30–32] operating at the MEOP have
been demonstrated, and biomedical sensing and processing platforms [33–38] that exploit
dynamic or aggressive voltage scaling have appeared. Several circuit-level techniques [39],
such as multi-threshold CMOS [40], sleep transistor [41], and body biasing [42], and device-
level techniques such as channel and gate engineering [39] have been proposed to reduce
subthreshold leakage and total system energy. In addition, architectural techniques such as
pipelining [28] have been employed to reduce Emin.
These low-power design techniques address energy reduction at the expense of reducing
performance (speed). These also lead to greater sensitivity to PVT variations due to the
use of smaller devices or smaller logic depth. Thus, the conventional worst-case design
philosophy adopted currently tends to offset the energy benefits promised by such techniques.
Furthermore, these techniques address energy efficiency of the core only, and ignore the
impact of low Vdd and low load current on the efficiency of the energy-delivery subsystem.
6
1.1.2 Robust System Design
Design of reliable systems dates back to Von Neumann [43] in 1950, where logic networks
composed of noisy/probabilistic gates were considered and signal replication with majority
voting was proposed to increase resiliency. A number of logic-level techniques for robust
design have been proposed since then. In [44], Markov random networks are employed to
design robust logic network. This implementation, though quite robust to input voltage
noise, has a large overhead in terms of transistor count. In [45], stochastic logic is proposed
whereby Von Neumann’s N -wire bundle representation of Boolean variables is employed. It
is assumed that the logic is error-free, i.e., deterministic logic (error-free hardware) operating
on stochastic signals. These techniques provide error resiliency, at the expense of a large
gate-count overhead (> 5×) which limits their energy efficiency.
At architecture-level, N-modular redundancy (NMR) [46] is a commonly employed fault-
tolerant technique where N-way replication of processing elements is followed by majority
vote. NMR leads to an N-fold complexity and power overhead. Similarly, techniques such
as checkpointing [47], and coding techniques [48] [49] have been proposed, each of which,
though effective in enhancing robustness, incurs a significant energy cost.
The problem of increased circuit sensitivity to PVT variations in subthreshold has also
been addressed at circuit and architectural levels. Circuit-level techniques, such as body-
biasing [50], non-ratioed logic styles [51], and transistor sizing [26], reduces Vth variations
and consequently delay variations. Architectural techniques such as increased pipelining
depth [52] have been applied in subthreshold to mitigate the effect of PVT variations across
longer critical oath delays. However, these circuit and architectural-level techniques either
significantly increase energy or lower the performance.
The robustness overhead in each of the approaches described above aggravates the energy-
efficiency problem. The main challenge in robust system designs is to achieve robustness
with a low error-compensation overhead. Recently, deterministic microarchitecture error
7
compensation techniques [53–55], which employ local error detection and global error cor-
rection via architectural replay, have been proposed. Deterministic microarchitecture-level
error correction [53–55] is able to correct for error rates (percentage of clock-cycles in which
the output is in error) pη < 0.1% while achieving an energy efficiency of less than 15%
over point-of-first-failure (PoFF). For example, RAZOR-II [54] operates at an error rate
of pη = 4 × 10−4 while achieving an energy savings of 5% beyond PoFF, and error detec-
tion sequential (EDS) and tunable replica circuits (TRC) [55] achieve 7% energy reduction
over PoFF while tolerating a pre-correction error rate pη < 0.001. Deterministic error com-
pensation techniques rely on correcting for worst-case PVT variations and achieving 100%
correctness which may not be required at the application level. This severely limits their
design margins and energy efficiency. They miss the opportunity presented by a large class
of next generation applications where a relaxed definition of “correctness” is adopted, and
thus admitting an expanded set of acceptable outputs.
1.1.3 Design of Integrated Energy-Delivery Subsystems
The design of high-efficiency DC-DC converters has been extensively explored [56–59] with
converter efficiencies ηDC > 0.9 being achieved for constant load currents, and for high-power
superthreshold core operation. However, as we have illustrated, the efficiency ηDC varies
significantly with variations in the load current, and can fall below 40% [8] [9]. Converters
for subthreshold applications have appeared recently in the literature [60]. These tend to
employ switched capacitor topology to enable full on-chip integration. However, switched-
cap converters suffer from either poor load regulation when Vdd is adjusted, or significant
losses as the load varies. Design of integrated circuits where the DC-DC converter tracks
the MEOP of the core as a function of switching activity, has also been done [61]. However,
energy-delivery losses are not accounted for. The increased energy loss in the subthreshold
region due to the converter was illustrated recently [62]. Furthermore, the CAD community
8
has been playing a leading role in incorporating models for converter losses [63] [64] and
converter transition delay overhead [65], in determining optimal power management policies,
and designing power distribution networks for superthreshold operation.
As a result, past work in the design of integrated converters has focused on either design
of high-efficiency converters at fixed high-current loads for superthreshold cores, or on the
design of converters to track the core Vdd,opt in the subthreshold region but ignoring energy-
delivery losses. Thus, there is a unique opportunity for the joint design of DC-DC converters
and subthreshold cores to minimize total system energy for variable load conditions. This
is particularly important in ULP applications where at least up to four orders of magnitude
variation is observed [10].
1.2 Stochastic Computing Techniques
A large class of the next-generation applications can be categorized into recognition, mining,
and synthesis, where massive data volume needs to be processed [1]. Such applications rely
heavily on digital signal processing (DSP) algorithms and employ statistical performance
metrics, such as signal-to-noise ratio (SNR), bit error-rate (BER), probability of detection,
and many others. The statistical nature of these metrics allows a somewhat relaxed definition
of “correctness” where the output may be corrupted by errors. Stochastic computing [66]
relies on exploiting such a relaxed definition of “correctness” afforded by emerging applica-
tions. Stochastic computing (see Fig. 1.5) matches the statistical attributes of the underlying
circuit/device fabric to the statistical nature of the application-level performance metrics.
This is accomplished by employing statistical error compensation (SEC) (see Fig. 1.6(a)).
SEC exploits the statistical knowledge of the circuit/device fabric error, the input, and the
intermediate data processed by the main block in order to correct for the output error ap-
proximately at low overhead such that an application performance metric, such as SNR, is
maintained within an application dependent tolerance limit 4SNR. The non-uniformity of
9
[t]
Figure 1.5: Stochastic computing in ULP platforms: matching noisy circuits to application
metric.
the error statistics (see Fig. 1.6(b)) implies that certain error magnitudes are more likely to
occur than others. The statistics of the input and the intermediate data enable SEC with low
overhead. Several error-resilient techniques have been developed, also based on the concept
of stochastic computing such as those in [67–69]. Naturally, there will always be a small class
of critical applications such as those in finance/banking, flight control systems, and others
where a precise definition of correctness is mandatory, and where stochastic computing may
not be applicable in its current form.
Typically, voltage overscaling (VOS) [70] is employed in the main block to reduce its
energy. In VOS, the supply voltage is scaled below the critical voltage Vdd−crit needed for
error-free operation with a fixed operating frequency. When the supply voltage is lower
than Vdd−crit, the circuit will operate slower than the designed margins, and thus timing
violations will occur. SEC is then introduced in order to compensate for timing violations
10
x10
-4
P
ro
b
a
b
ili
ty
 o
f 
O
c
c
u
re
n
c
e
Error Magnitude
Main 
Block
Corrected
Output
(a)
Statistical Error 
Compensation
Erroneous
Output
Pre-correction Error Rate
SNR
oSNR
)( p
SNR

p
C
o
n
ve
n
tio
n
a
l 
Error-Compensated 
x
(c)
x10
4
(b)
0
p
SNR
Figure 1.6: Stochastic computing: (a) block diagram, (b) measured error statistics at the
output of an FIR filter in a 45-nm CMOS process with Vdd scaled 15% below its critical
value, and (c) signal-to-noise ratio (SNR) performance at different pre-correction error rates
pη.
(see Fig. 1.6(c)) such that: SNR∗ ≈ SNRo and p∗η  poη, SNR∗ is the SNR of the error
compensated system at an error rate p∗η, and SNRo is the SNR of the conventional error-free
system at an error-rate pη ≈ poη.
1.2.1 Algorithmic Noise-Tolerance (ANT)
Stochastic computing was first proposed in the form of algorithmic noise-tolerance (ANT) [70]
(see Fig. 1.7(b)). ANT incorporates a main block and an estimator. The main block is per-
11
Figure 1.7: Algorithmic noise-tolerance (ANT): (a) framework and (b) error distributions.
mitted to make errors, but not the estimator. The estimator is a low-complexity block
(typically 5% to 20% of the main block complexity) generating a statistical estimate of the
correct main block output, i.e.,
ya = yo + η (1.1)
ye = yo + e (1.2)
where ya is the actual main block output, yo is the error-free main block output, η is the
hardware error, ye is the estimator output, and e is the estimation error. Note: the estimator
exhibits an estimation error e because it is an approximate version of the main block. ANT
exploits the difference in the statistics of η and e (see Fig. 1.7(b)). To enhance robustness, it
is necessary that when η 6= 0, that η be large compared to e. In addition, the probability of
the event η 6= 0, must be small. The final/corrected output of an ANT system yˆ is obtained
via the following decision rule:
yˆ =

ya, if |ya − ye| < τ
ye, otherwise
(1.3)
12
where τ is an application-dependent parameter chosen to maximize the performance of ANT.
Under the conditions outlined above, it is possible to show that
SNRuc  SNRe  SNRANT ≈ SNRo (1.4)
where SNRuc, SNRe, SNRANT , and SNRo are the SNRs of the uncorrected main block
(η dominates), the estimator (e dominates), the ANT system, and the error-free main block
(ideal), respectively. Thus, ANT detects and corrects errors approximately, but does so in
a manner that satisfies an application-level performance specification (SNR). The decision
block is designed to be timing error-free at all process corners and voltages because it is a
critical block that directly impacts performance (SNR), and typically constitutes less than
5% of the main block complexity. Several low-overhead estimation techniques have been
proposed by exploiting data correlation, system architecture, and statistical signal processing
techniques [66].
For ANT to also provide energy-efficiency, it is necessary that the errors in the main block
be primarily due to enhancement of its energy-efficiency. In practice, these properties are
easily satisfied when errors in the main block occur due to VOS or a nominal case design
being subjected to a worse case process corner (better than worst-case design (BTWC)).
As most computations are least-significant-bit (LSB) first, timing violations due to VOS
or BTWC are generally large magnitude most-significant-bit (MSB) errors. Thus, timing
violations satisfy the error distribution shown in Fig. 1.7(b).
ANT has been shown to achieve up to 3× energy savings in theory and in practice via pro-
totype IC design [71] for finite impulse response (FIR) filters. ANT has also been employed
in the design of error-resilient low-power motion estimators [72] and Viterbi decoders [73]
(8000× improvement in BER with 3× improvement in energy savings).
13
Comput
ation
y1
Sensor 1
Sensor 2
Sensor 3
Sensor N
 
Statistically 
Similar
Decomposition
Fusion Block
x
x
1ey
eNy
3ey
2ey
yˆ
corrected
oy
Figure 1.8: The stochastic sensor network-on-a-chip (SSNOC).
1.2.2 Stochastic Sensor Network-on-a-Chip (SSNOC)
SSNOC [74] relies only on multiple estimators or sensors to compute, permitting hardware
errors to occur (see Fig. 1.8), and then fusing their outputs to generate the final corrected
output yˆ. Thus, the output of the ith sensor is given as
yei = yo + ei + ηi (1.5)
where ηi and ei are the hardware and estimation errors in the i
th estimator, respectively.
If hardware errors are due to timing violations, one can approximate the error term in (1.5)
as (1− pη)ei + pηηi. Such an -contaminated model lends itself readily to the application of
robust statistics [75] for error compensation. SSNOC has been applied to a CDMA PN-code
acquisition system [74], where the sensors were obtained through polyphase decomposition of
the matched filter. Simulations and IC measurements [76] indicate an 800× improvement in
detection probability while achieving up to 40% power savings. A key drawback of SSNOC
is the requirement of decomposing computation into appropriate sub-blocks whose output
errors have an -contaminated distribution.
14
M
a
jo
ri
ty
 V
o
te
r
yˆx
M1
M2
MN
11  oyy
22  oyy
NoN yy 
Hardware-errors
x
M1
M2
MN
11  oyy
22  oyy
NoN yy 
)(P
(a) (b)
S
o
ft
-V
o
te
r
yˆ
Figure 1.9: Architecture-level error resilient techniques: (a) N-modular redundancy (NMR)
and (b) soft NMR.
1.2.3 Soft N-Modular Redundancy (Soft NMR)
ANT and SSNOC rely on certain properties of the distribution of hardware errors η and
the estimation error e. For ANT, the distributions of η and e should be sufficiently distinct,
and for SSNOC, the composite error distribution should be -contaminated. More powerful
versions of stochastic computation can be developed if error statistics are explicitly employed
in computation. In fact, the robustness and energy efficiency of any error-resilient designs
depend upon the error statistics, even though error statistics are typically not accounted for
in the design. For example, the robustness of NMR depends upon the pre-correction error
rate pη and requires the error events across the redundant modules to be independent [77].
While conventional NMR ignores error statistics, [78] proposed soft NMR (see Fig. 1.9) where
error statistics, i.e., the probability mass function (PMF) Pη(η), is explicitly employed to
enhance the robustness of NMR.
Structurally, soft NMR differs from NMR in that it incorporates a soft voter, which is
composed of a detector. Soft NMR makes explicit use of two types of statistical information:
(1) data statistics, and (2) error statistics. Data statistics are the distribution of the error-
15
free output yo. This is referred to as the prior distribution, or prior. Error statistics are the
distribution of the errors ηi.
The role of the soft voter in Fig. 1.9(b) is to determine the output yˆ that would, on av-
erage, optimize a pre-specified performance metric. Detection theory can be employed in
order to systematically derive the soft voting algorithm. The detector maps the observa-
tions (y1, y2, . . . , yN) to the “closest” hypothesis. Thus, the detection problem requires the
definition of a hypothesis set H, from which the corrected output yˆ is selected. This is done
by solving the following problem:
yˆ = arg max
Hi∈H
P (y1, y2, ..., yN |Hi), (1.6)
where H = {Hi}mi=1 the set of all hypotheses.
As the arg max operation requires a search to be performed over the entire hypotheses space,
for practical implementations, the hypothesis space H needs to be limited. There are several
ways to limit H, the simplest being to choose H = (y1, y2, . . . , yN).
This dissertation builds and expands on existing stochastic computation applications and
techniques in order to design robust and energy efficient ULP platforms while addressing
the energy efficiency of both the core and the energy-delivery subsystem.
1.3 Role and Contributions of this Dissertation
Past work in stochastic computing were applied to the superthreshold regime. Furthermore,
research in energy efficiency addresses energy-efficiency issues in the core and energy-delivery
subsystems independently. This dissertation studies the application of stochastic computing
in ULP platforms operating in the subthreshold regime, and includes the efficiency issues
present in the energy-delivery subsystem.
The major contributions of this dissertation can be summarized as follows:
16
1. Stochastic computing is applied to subthreshold ULP applications. The new design
space is characterized by its sensitivity to PVT variations, and energy frugality.
2. The energy and robustness benefits of stochastic computing at MEOP is quantified
through analysis, simulations, and measurements from a 45-nm prototype ECG pro-
cessing IC which is 4.7× more energy-efficient than the state-of-the-art and 16× more
robust to voltage variations.
3. The energy optimization of compute cores and DC-DC converter in ULP platforms is
conducted jointly. Core architectural techniques are proposed to improve the system
(core and converter) energy efficiency.
4. A novel stochastic computing technique (likelihood processing) which improves on the
energy benefits and robustness of existing stochastic techniques.
5. A unified framework for stochastic computation is introduced consisting of an error
statistics characterization methodology, and diversity techniques to generate favorable
error statistics.
1.4 Dissertation Organization
This dissertation is organized as follows:
• Chapter 2 addresses stochastic computing in the subthreshold regime. It employs
algorithmic noise-tolerance for a subthreshold filtering application to reduce the energy
at the MEOP. In addition, it studies the design trade-offs between correction overhead,
robustness, and energy savings in subthreshold by employing architectural and circuit-
level simulations in the presence of voltage and process variations.
• Chapter 3 describes the implementation and design of a subthreshold stochastic
computing-based ECG processor in an IBM 45-nm CMOS process. Furthermore, it
17
demonstrates the energy efficiency and robustness benefits of stochastic computing
in ULP applications through IC measurement results, and illustrates the superiority
of the prototype IC compared with state-of-the-art systems in terms of energy and
reliability.
• Chapter 4 studies the joint optimization of the DC-DC converter and the compu-
tational cores to improve system energy efficiency. It shows that DC-DC converter
energy losses are significant in the subthreshold regime if the DC-DC converter is de-
signed to handle a wide range of DVS or largely varying workload characteristics. It
presents core-architecture techniques to alleviate the DC-DC losses and improve the
overall system energy efficiency.
• Chapter 5 presents a unified framework for stochastic computing and introduces a
new technique, likelihood processing (LP), which generates reliability information on
each output bit. Design trade-offs involved in LP implementation are studied. Energy
and robustness benefits in the design of a 45-nm 2D-DCT codec is quantified.
• Chapter 6 proposes a statistical framework for error characterization at the system
level and introduces design techniques to generate favorable error statistics in order to
enable next-generation stochastic computing techniques. It studies the factors (PVT
corner, input statistics, architecture, etc.) affecting error statistics at the system level,
and presents a generalized one-time oﬄine statistical error characterization method-
ology. In addition, the chapter presents design diversity techniques to ensure inde-
pendent errors across the observations, and consequently improve the robustness of
next-generation stochastic computing techniques. Various DSP blocks, such as adders
and filters, are employed to validate the proposed error model and its characterization
in a 45-nm CMOS process.
18
• Chapter 7 provides a summary of contributions, concludes the work completed for
this dissertation, and discusses its broader impact and future extensions.
19
CHAPTER 2
ENERGY-EFFICIENT AND ROBUST ULP
KERNELS VIA STOCHASTIC COMPUTATION
This chapter studies the application of algorithmic noise-tolerance (ANT) [70] to designs
operating at the minimum energy operating point (MEOP). ULP platforms for portable
medical and health monitoring application, distributed wireless sensor networks, and active
RFIDs operate at or around the MEOP. The subthreshold design space under stochastic com-
puting is explored and the energy efficiency and robustness benefits are illustrated through
analysis, architectural, and circuit-level simulations in a 45-nm CMOS technology.
The MEOP is defined via the tuple (Vdd,opt, fopt, Emin) (see conventional MEOP (MEOPC))
in Fig. 2.1), where Emin, Vdd,opt, and fopt are the minimum achievable energy, and the cor-
responding supply voltage and frequency of operation, respectively, where Emin is achieved.
Conventional MEOP design is well-studied [23] [26] [25] and is primarily a circuit/architectral
endeavour. However, it assumes worst-case PVT variations, leading to significant energy
overhead.
Stochastic computing techniques based on statistical error-compensation such as ANT has
been effective in reducing energy in digital signal processing (DSP) kernels by permitting
voltage overscaling (VOS). However, ANT and other error-resiliency techniques [53, 55, 66,
68,69,72] have not been applied in the subthreshold regime. This chapter studies the impact
of ANT on the energy consumption Emin at the MEOP for an 8-tap FIR filter. The ability
of ANT to cope with large pre-correction error rates pη is exploited, and its benefits are
extended to the subthreshold regime. Furthermore, the various design factors affecting
robustness and energy efficiency are studied.
In the subthreshold regime, leakage energy is significant, and thus frequency overscaling
20
Vdd
Energy
min,ANTE
optANTV , thV
AN
T
28%-
54%
ANTMEOP
CMEOP
minE
optddV ,
Co
nv
en
tio
na
l 
Figure 2.1: Energy in the subthreshold regime of a conventional and ANT-based designs.
(FOS), where the operating frequency is increased beyond its critical value, can be employed
as well to reduce energy. In this chapter, we allow both frequency and voltage to be overscaled
simultaneously,and thus achieving higher energy savings. In addition, we analyze the results
in two IBM 45-nm CMOS processes with different threshold voltages, Vth: a high Vth (HVT)
process, and a low Vth (LVT) process. We also study the energy benefits of ANT in presence
of process variations. The use of two different threshold voltage changes the contribution of
leakage energy at the MEOP and the delay sensitivity to voltage variations, thus affecting
energy efficiency and robustness. The results of this chapter (see Fig. 2.1) can be summarized
as follows:
1. Algorithmic noise-tolerance results in a new MEOP (MEOPANT in Fig. 2.1) where
Emin is reduced by up to 58% and 10%, while the operating frequency is increased by
2.5× and 1.2× in the 45-nm LVT and HVT processes, respectively, at a pre-correction
error rate of 85%.
2. Under process variations in LVT process, ANT reduces Emin by 54% on average while
maintaining a parametric yield of 99.7%.
This chapter is organized as follows: Section 2.1 analyzes the MEOP and its character-
ization. Section 2.2 studies the effectiveness of ANT in the subthreshold regime and and
21
analyzes its impact on the MEOP. Section 2.3 presents simulation results for an FIR filter
in a 45-nm IBM CMOS processes which demonstrate the impact of ANT on the MEOP in
the presence of voltage and process variations.
2.1 The Minimum-Energy Operating Point (MEOP)
The dominant sources of energy consumption in subthreshold are dynamic energy, (Edyn),
and the leakage energy (Elkg), given by:
Eo = Edyn + Elkg
Edyn = αNCV
2
dd
Elkg =
NIOFFVdd
f
(2.1)
where α is the switching activity factor, N is the number of gates each with an output
load capacitance C, f is the operating frequency, Vdd is the supply voltage, and IOFF is the
OFF-state leakage current. The subthreshold current [26] as a function of gate-to-source
and drain-to-source voltage is given by:
ISUB (VGS, VDS) = Io10
VGS−Vth−γVDS
S (1− e
VDS
VT ) (2.2)
where Io is a reference current and is proportional to the transistor W/L ratio, S = mVT
is the swing factor, γ is the DIBL coefficient, Vth is the threshold voltage, and VT is the
thermal voltage. Using (2.2), the ON-state and OFF-state currents for an NMOS transistor
are ION = ISUB (Vdd, Vdd) and IOFF = ISUB (0, Vdd), respectively.
Assuming the critical path of the computational kernel has a logic depth of L gates each
22
with an output load capacitance C, the operating frequency f is given by:
f =
ION
βLCVdd
(2.3)
where β is a fitting parameter needed to match the finite signal rise and fall times. The
subthreshold frequency of operation decreases exponentially with Vdd reduction due to the
exponential dependance of ION on Vdd in (2.2). This leads to an exponential increase in
leakage energy as seen by substituting (2.3) in (4.1) to get:
Elkg = βNLCV
2
dd
IOFF
ION
= βNLCV 2dd10
−Vdd
S (2.4)
and the total subthreshold energy is given by:
Eo = αNCV
2
dd + βNLCV
2
dd10
−γVdd
S (2.5)
Therefore, reducing Vdd in the subthreshold region decreases Edyn but increases Elkg expo-
nentially so that operating point MEOPC in Fig. 2.1 exists.
2.2 MEOP via ANT
In this section, we study the energy behavior of ANT in the subthreshold regime and study
its impact on the MEOP. For ANT to provide energy efficiency, it is necessary that the errors
in the main block (see Fig. 1.7(a)) be primarily due to enhancement of its energy efficiency.
These properties can be satisfied when errors in the main block are induced by either VOS,
FOS, or a combination of both since leakage energy contributes significantly to total energy
in subthreshold.
In VOS, the supply voltage is reduced below the critical voltage Vdd,crit needed for error-free
operation while keeping the operating frequency fixed, i.e., Vdd = KV OSVdd,crit and f = fcrit
23
where KV OS < 1 is the VOS factor. In FOS, Vdd is kept fixed while f is increased beyond
fcrit, i.e., Vdd = Vdd,crit and f = KFOSfcrit where KFOS > 1 is the FOS factor. FOS not
only reduces leakage energy due to a smaller operating clock period, but also enables higher
performance (frequency). As most arithmetic computations are least-significant-bit (LSB)
first, timing violations due to VOS or FOS are generally large magnitude most-significant-
bit (MSB) errors. Thus, timing violations due to VOS or FOS satisfy the error distribution
shown in Fig. 1.7(b).
Given an application-level error-tolerance limit, e.g., maximum SNR loss, VOS and FOS
can be applied simultaneously to save energy. The estimator is designed such that it operates
error-free due to its lower complexity compared to the main block, and the ANT residual
(post-correction) error is within the application tolerance limit. The energy of a subthreshold
ANT system EANT is given by:
EANT = K
2
V OS
(
1 + αestNest
αN
)
Eo,dyn +
1
KFOS
KV OS
(
1 + Nest
N
) IOFF,KV OSVdd,crit
IOFF,Vdd,crit
Eo,lkg (2.6)
where Nest is the number of additional gates needed to implement the estimator and the
decision block (ANT overhead), and αest is the average switching activity of the Nest nodes.
Note that, a large class of estimators in ANT operates on the high-order bits of the in-
put, which usually have a lower switching activity factor than the low-order bits, and thus
αest < α. Several design factors, such as the hardware error rate of the main block (pη),
the application error tolerance, and the estimator complexity, determine the total system
energy consumption. It will be shown that a new MEOP, MEOPANT characterized by
the tuple (VANT,opt, fANT,opt, EANT,min), exists, where VANT,opt ≤ Vdd,opt, fANT,opt > fopt, and
EANT,min < Emin. The MEOP energy EANT,min depends on the estimator overhead, the
application error tolerance, and delay sensitivity to voltage variations (process threshold
24
voltage). Next, we illustrate the energy savings and the different trade-offs involved in
subthreshold ANT using an FIR filter.
2.3 Simulations and Results
We design a 23-b output FIR filter (see Fig. 2.2(a)), which is a widely-used DSP kernel in
the subthreshold regime. The FIR filter operates at an error-free critical supply voltage and
frequency (Vdd,crit, fcrit), and computes y[n] =
∑7
i=0 x[n− i]× hi where x[n] is a 10-b input
signal, hi’s are the 10-b filter coefficients, and n is the clock-cycle/time index. The filter uses
a ripple-carry adder-based architecture as a building block for the adders and multipliers,
and achieves an SNR of 17 dB when operating hardware error-free. The filter is designed
in the IBM 45-nm LVT and HVT CMOS processes. The following simulation procedure is
employed to study the impact of ANT on the MEOP.
2.3.1 Simulation Procedure
Given an application error-tolerance limit (maximum SNR loss), energy/design margins are
reduced so that the ANT main block has a specific pre-correction error rate (pη) which can
be achieved by either VOS at KV OS or FOS at KFOS or a combination of both. To compute
pη and system energy, and study the effect of timing errors on application metric (SNR), the
simulation procedure is as follows:
1. Circuit Simulations: we employ circuit simulations using HSPICE in the 45-nm IBM
CMOS process in order to characterize the worst-case delay and power of a limited-size
gate library (1-bit adder, and-gate, or-gate, inverter, etc.) at different voltages or PVT
corners. We use the analytical models presented in (2.3) and (2.5) to fit the delay and
power numbers obtained for each gate.
2. Gate-Level Netlist Simulations: we employ an RTL-level structural Verilog model of the
25
filter to generate the erroneous output y[n], using individual gate delays obtained from
step 1 at Vdd = KV OSVdd,crit (Vdd = Vdd,crit), and operating at f = fcrit (f = KFOSfcrit)
for VOS (FOS).
3. We determine the main block error rate pη and the SNR of the filter under timing
errors by comparing the erroneous RTL output to the error-free output.
4. Power Estimation: we compute the overall system leakage and dynamic energy/power
by summing up the leakage and dynamic power estimates of the filter constituent gates
obtained from step 1, while taking into consideration the average activity factor α of
each gate in step 2.
2.3.2 MEOP of (Error-Free) Filter
We employ steps 1 and 4 of the simulation procedure to estimate the energy consumption
of the conventional error-free FIR filter. We validate the subthreshold energy consumption
by comparing the estimated values to HSPICE simulations of the complete FIR filter with
an average switching activity factor α = 0.1 in the HVT and LVT 45-nm IBM CMOS
processes. Figures 2.2(b) and (c) show the analytical model estimates of the filter energy
and operating frequency approximate SPICE simulations very well. Several observations can
be made when comparing the energy and frequency behavior of the filter in LVT to that in
HVT process and all can be attributed to the difference in Vth between the two processes:
1. Elkg in LVT process is significantly larger (20× in near/superthreshold) than Elkg in
HVT process while Edyn is almost the same in the two processes.
2. The total filter energy in LVT process is dominated by leakage even for near/superthreshold
supply voltages (Elkg ≈ 4Edyn), while Elkg in HVT starts to dominate total energy only
in the subthreshold regime.
26
][nx
0h
10-b
23-b
y
(c)
20-b
1h
10-b
7h
20X
HVT Spice
LVT Spice
Supply Voltage 
E
n
e
rg
y
 (
fJ
)
F
re
q
u
e
n
c
y
 o
f 
O
p
e
ra
ti
o
n
 (
M
H
z
)
 critddV 
Supply Voltage  critddV 
(b)
(a)
4X
MEOPC (HVT)
MEOPC (LVT)
MEOPC (HVT)
MEOPC (LVT)
Figure 2.2: Validation of energy and frequency models for an eight-tap FIR filter in 45-nm
CMOS processes: (a) direct-form architecture, (b) energy model, and (c) throughput model.
27
Supply Voltage 
F
re
q
u
e
n
c
y
 o
f 
O
p
e
ra
ti
o
n
 (
M
H
z)
2
HVT
Process
LVT
Process
*
*
CMEOP
 ddV
CMEOP
Figure 2.3: Voltage-frequency plane of iso-pη curves in the 45-nm LVT (solid lines) and HVT
(dotted lines) processes.
3. The filter in LVT achieves higher operating frequency than that in HVT. The filter
operating frequency starts to decrease significantly for supply voltages less than 0.38 V
in LVT and 0.5 V in HVT process, and the corresponding Elkg starts to increase at
similar voltages in each process.
4. The conventional MEOP MEOPC in LVT process is reached at (Vdd,opt = 0.38 V, fopt =
240 MHz, Emin = 1022 fJ) while that of HVT is reached at (Vdd,opt = 0.48 V, fopt =
80 MHz, Emin = 335 fJ).
28
2.3.3 Energy vs. Error Rate and Error Rate vs. SNR Characterization
In this section, we characterize the voltage-frequency operating points (V, f) required to
achieve a given pre-correction error rate pη in the filter main block and compare the effect of
FOS vs. VOS on pη. The simulation procedure in Section 2.3.1 is employed to characterize
the (V, f) iso-pη curves in the 45-nm LVT and HVT processes (see Fig. 2.3). Note that a
horizontal translation (fixed frequency) in Fig. 2.3 corresponds to VOS , a vertical translation
(fixed voltage) corresponds to FOS, and an arbitrary translation corresponds to simultaneous
VOS and FOS. We can see that as the critical supply voltage reduces, the horizontal and
vertical gaps between different pη curves reduce due to the increased sensitivity of delay
as supply voltage approaches the threshold voltage. Moreover, the HVT process has larger
delay sensitivity as compared to the LVT process due to higher threshold voltage. To further
compare FOS and VOS at MEOPC , Figs. 2.4 (a) and (b) show the raw hardware error rate,
pη, at the output of the main block as well as its energy consumption under VOS and FOS
in the LVT and HVT processes. Under FOS, pη is the same for the LVT and HVT processes
since errors due to FOS are only a function of the architecture and are independent of the
underlying circuit and process fabric. However, under VOS, the LVT process has a lower
pη than HVT for same KV OS since its |Vdd,opt − Vth| at MEOPC is less than that at the
MEOPC of the HVT process, which leads to lower percentage increase in delay at C-MEOP
due to VOS in the LVT process. This can be seen in Fig. 2.2(c) where the rate of change
(slope) of frequency at the C-MEOP in the LVT process is less than that in the HVT process.
Moreover, note that FOS is more robust than VOS at a given pη as seen in Fig. 2.4(a) by
comparing the slope of the respective pη curves. Small variations in KV OS lead to large
variations in pη unlike variations in KFOS. This is due to the exponential relation between
voltage and delay in the subthreshold regime.
The percentage energy savings in Emin at the MEOPC in the LVT and HVT processes
due to VOS is equal for the same KV OS (see Fig. 2.4(b)) since percentage energy savings
29
(a)
FOSVOS
FOSVOS
F
O
S
 i
n
 L
V
T
 &
 H
V
T
V
O
S
 i
n
 L
V
T
 &
 H
V
T
(b)
P
re
-c
o
rr
e
c
ti
o
n
 E
rr
o
r 
R
a
te
  
(p
η
)
N
o
rm
a
liz
e
d
 E
n
e
rg
y
Figure 2.4: Pre-correction error rate and energy characterization of the 8-tap FIR filter
under VOS (x axis ≤ 1) and FOS (x axis ≥ 1): a) error rate pη vs. KV OS and KFOS, and
b) normalized energy vs. KV OS and KFOS (error-compensation overhead is not included).
30
depends on KV OS and is independent of supply or threshold voltage, unlike the absolute
energy. However, under FOS, the percentage energy savings is larger for the LVT process
than the HVT process for the same KFOS since total energy in the LVT process at MEOPC
is dominated by Elkg compared to the HVT process (as seen in Fig. 2.2(b) and discussed in
Section 2.3.2) and FOS reduces only Elkg.
To see the effect of ANT on the application performance metric (SNR), we employ a
reduced precision-redundancy (RPR) version of the main block filter as an estimator [79],
as shown in Fig. 2.5(a). The main filter output precision is 23-b with 10-b input and
coefficients, and its architecture is shown in Fig. 2.2(a), while the RPR estimator block
has similar architecture to the main filter while processing only the Be most-significant bits
of the main filter 10-b inputs and coefficients (Be < 10) to generate an estimated output
having 2Be+3 bits. ANT estimation and correction circuits are operated at the same voltage
and frequency as the main block. However, they do not have the same timing error rates
as the main block due to their reduced complexity. In this case, the estimator has lower
precisions than the main block, resulting in longer timing slack. We follow the simulation
procedure outlined previously in Section 2.3.1 to estimate the SNR under hardware errors.
Figure 2.5(b) shows the SNR of the uncorrected (conventional) filter and that of the ANT
filter with different Be. The conventional filter SNR drops catastrophically as pη increases
above 0.1% while the SNR of the ANT filter remains within 0.8 dB of the error-free output for
pη values up to 70% for Best = 5 (point B). Higher-precision estimators reduce the residual
error at the output resulting in an SNR drop of less than 0.2 dB (point A) . However, they
operate error-free at lower pη (40%) than the lower precision estimators due to their increased
critical path. Next, we study the energy consumption of the ANT filter while accounting for
both estimator and main block energy.
31
RPR
Estimator
|  | >TH
x
yˆ
Main Filter
10-b
eB
eB
10-b
0h
eB
10-b
7h
23-b
32 eB
1-b
(a)
A B
C
Pre-correction Error Rate (pη)
(b)
S
N
R
 (
d
B
)
Figure 2.5: SNR vs. error rate for the 8-tap reduced precision-redundancy (RPR) ANT-based
filter with different estimator precisions (Be): (a) architecture, and (b) SNR performance
vs. pη.
32
S
u
p
p
ly
 V
o
lta
g
e
 (
V
dd
)
Frequency of Operation (MHz)
Frequency of Operation (MHz)
E
n
e
rg
y
 (
fJ
)
p
η =
0 
p
η=0.4 
p
η=0.85 
p
η=0.7 
E
n
e
rg
y
 (
fJ
)
S
u
p
p
ly
 V
o
lta
g
e
 (
V
dd
)
(a)
(b)
p
η =
0 
p
η=0.4 
p
η=0.85 
p
η=0.7 
Figure 2.6: Energy of ANT FIR filter (including error-compensation overhead for pη 6= 0)
at different pre-correction error rates and estimation overhead: (a) the 45-nm LVT process,
and (b) the 45-nm HVT process.
33
Table 2.1: MEOP comparison of conventional and ANT filters in the 45-nm LVT process.
Design Type SNR (dB) Vdd,opt fopt(MHz) Emin(fJ) Energy
Savings
w.r.t Con-
ventional 0
(17.1 dB)
Energy
Savings
w.r.t Con-
ventional
at same
SNR
Conventional 0
(pη = 0)
17.1 0.38 240 1022 0%
Conventional 1
(pη = 0)
16.9 0.375 240 998 2.3%
Conventional 2
(pη = 0)
16.3 0.371 240 946 7.4%
Conventional 3
(pη = 0)
13.6 0.370 240 891 12.8%
ANT
(pη = 0.4, Be = 6)
16.9 0.36 430 957 20% 4.1%
ANT
(pη = 0.7, Be = 5)
16.3 0.36 542 738 38% 22%
ANT
(pη = 0.85, Be = 4)
13.6 0.36 610 632 47% 29%
2.3.4 MEOP of ANT Filter at Nominal Process Corner
Given an application error tolerance (e.g. maximum allowable SNR loss), the optimal ANT
configuration (pη, Best) can be determined from Fig. 2.5. For example, for an SNR-loss of
0 dB, 0.2 dB, 0.8 dB, and 3.5 dB, the optimal ANT configurations are Conventional, A,
B, and C, respectively, where the pre-correction error rates pη are 0, 0.4, 0.7, and 0.85
and estimator precisions Best are 0-b, 6-b, 5-b and 4-b, respectively. Figure 2.3 shows the
corresponding voltage-frequency pair needed to achieve the required pη at the corresponding
optimal ANT configurations. The total system energy behavior, including ANT estimation
and correction overhead when pη 6= 0, is shown in Fig. 2.6. Vdd and f at the MEOP are
different in all configurations indicating that the system needs to be operated differently for
each configuration. The MEOP of each configuration is shown in Tables 2.1 and 2.2 for the
LVT and HVT processes, respectively. ANT achieves up to 38% and 47% energy saving at
34
Table 2.2: MEOP comparison of conventional and ANT filters in the 45-nm HVT process.
Design Type SNR (dB) Vdd,opt fopt(MHz) Emin(fJ) Energy
Savings
w.r.t.
Conven-
tional 0
(17.1 dB)
Energy
Savings
w.r.t.
Conven-
tional at
same SNR
Conventional 0
(pη = 0)
17.1 0.48 80 335 0% N.A.
Conventional 1
(pη = 0)
16.9 0.478 80 329 1.8% N.A.
Conventional 2
(pη = 0)
16.3 0.475 80 324 3.3% N.A.
Conventional 3
(pη = 0)
13.6 0.47 80 311 7.2% N.A.
ANT
(pη = 0.4, Be = 6)
16.9 0.47 141 369 −11% −12.1%
ANT
(pη = 0.7, Be = 5)
16.3 0.45 107 326 2.4% −0.6%
ANT
(pη = 0.85, Be = 4)
13.6 0.45 122 299 10% 3.9%
error rates of pη = 0.7 and 0.85, respectively, in the LVT process while incurring an SNR
loss of 0.8 to 3.5 dB when compared to an error-free conventional design at 17.1 dB.
The SNR loss incurred by ANT filters can be traded in conventional design for lower
energy by reducing input and coefficient precisions (see Conventional 1, 2, and 3 in Table
2.1). Comparing conventional designs to ANT filters at the same SNR, the energy savings
of ANT ranges from 4% to 29%. From Table 2.1, we can also see that MEOP ANT not only
operates at lower supply voltage Vdd,opt but also provides increased frequency of operation
reaching 1.8× and 2.25× at pη = 0.7 and 0.85, respectively in the LVT process. In the
HVT process, leakage energy is less dominant at MEOP compared to the LVT process, and
the error rate shows greater sensitivity to VOS as illustrated in Fig. 2.4(a). That is why
the energy benefits of ANT compared to the conventional system is less pronounced in the
HVT process where at most 10% energy savings are achieved under ANT with an SNR-loss
of 3.5 dB (see Table 2.2). Note that at pη = 0.4 in the HVT process, ANT results in 11%
35
Figure 2.7: FIR filter frequency fluctuations under process variations using minimum-size
(Wmin) and 1.6-Wmin transistors in the 45-nm LVT process.
energy overhead, since error correction overhead offsets the energy savings obtained by VOS
and FOS.
An important factor to consider in Fig. 2.6 is that the energy curves under ANT are flatter
than those of conventional error-free design, indicating that ANT designs are less sensitive to
Vdd variations. All this clearly shows that statistical error compensation saves considerable
energy and enhances robustness in energy-constrained subthreshold applications.
2.3.5 MEOP of ANT Filter Under Process Variations
Another design challenge, especially in MEOP designs, is process variations. Timing errors
induced by process variations severely affect the filter SNR. The traditional design philosophy
will reduce design margins to guarantee application requirements and performance at the
36
E
n
e
rg
y
 (
fJ
)
Supply Voltage  ddV
ANT (Wmin) Design, Be=4
1.6Wmin (Conventional) 
Design
Figure 2.8: Energy under process variations for the 8-tap FIR filter using up-sized (1.6-Wmin)
design and minimum-sized (Wmin) ANT design
worst-case process corner. A large portion of within-die (WID) variations are due to random
dopant fluctuations (RDF) which cause large variations in threshold voltage [80]. Increasing
the transistor sizes will reduce RDF at the expense of an increased energy consumption.
Using minimum-size (Wmin) transistors will guarantee the lowest energy consumption at
MEOP but will incur a loss of yield if the nominal/target performance is not met.
To simulate the effect of process variation on performance and energy, the delay distri-
bution of various gates used in the filter were obtained via Monte Carlo simulations in the
LVT 45-nm IBM CMOS process with WID variations enabled. These delay distributions are
sampled to obtain different instances of the filter. These instances are then simulated at the
RTL-level to find the error-free operating frequency of each filter instance. Figure 2.7 shows
the frequency distribution of the filter under process variations at different supply voltages
37
Energy at MEOP (fJ) 
P
ro
b
a
b
ili
ty
 o
f 
O
c
c
u
re
n
c
e
4.5%
54%
39%
Figure 2.9: Energy distributions at the MEOP of the minimum-sized (nominal) design,
up-sized design, and ANT minimum-sized designs with Be = 4 and 5 (error compensation
overhead is included).
and transistor widths. If we need to guarantee an operating frequency equals the nominal
operating frequency of the minimum-sized design fµ,nom = 240 kHz at MEOP under WID
process variations, the transistor sizes will have to be increased by at least 60% to maintain
a constant parametric yield of 99.7%.
Using HSPICE power estimates for each constituent gate of each filter instance and tak-
ing into consideration the switching activity factor of each gate, the energies of the up-sized
(1.6-Wmin) conventional design and the minimum-sized ANT design including error com-
pensation overhead are shown in Fig. 2.8, and the corresponding energy distributions at
MEOP are shown in Fig. 2.9. On average, there is a 4.5% increase in energy to guaran-
tee an operation frequency of fµ,nom under process variations in a conventional design. On
the other hand, a minimum-sized design, which employs ANT with FOS in order to meet
38
throughput and correct for timing violations, achieves a mean energy savings of 39% and
54% when Be = 5 and Be = 4, respectively (see Fig. 2.9). These results indicate the benefits
of stochastic computing in saving energy in the subthreshold regime while guaranteeing a
desired parametric yield.
2.4 Summary
In this chapter, the impact of stochastic computing on MEOP was studied. An ANT FIR
filter was employed as a test case to demonstrate significant energy savings and robustness
under process and voltage variations. This work shows that, similarly to the superthreshold
regime, stochastic computing designs provide robustness and energy benefits in the sub-
threshold regime over energy-optimal error-free designs. These benefits are attributed to the
ability of stochastic computing techniques to cope with relatively high error rates.
39
CHAPTER 3
A 14.5 FJ/CYCLE/K-GATE, 0.33 V STOCHASTIC
COMPUTING-BASED ECG PROCESSOR IN A
45-NM CMOS
Chapter 2 demonstrated through analysis, RTL, and circuit-level simulations, the application
of stochastic computing at the MEOP in order to allow the design of energy-efficient robust
ULP platforms and ICs. This chapter describes the design and implementation of a ULP
stochastic-computing prototype IC for electrocardiogram (ECG) analysis, and illustrates the
energy and robustness benefits of stochastic computing at MEOP through measurements.
Spiralling health care costs and a rapidly aging population have lead to a growing interest
in personal and preventive health care systems and telemedicine [81]. Real-time monitoring
and analysis of ECGs is expected to have a significant impact on personal and preventive
health care by enabling expert intervention in the early stages of cardiovascular diseases
(CVDs) [81–83], which account for 30% of all deaths [84, 85]. Since vital biomedical signal
bandwidths are less than 1 MHz, energy efficient health monitoring systems operate at or near
the MEOP. Recently, several wearable/implantable biomedical devices and SoCs [29,33–38],
have been reported at or near MEOP. However, they have been designed error free assuming
worst-case PVT variations, which leads to large energy overhead and reduced MEOP design
margins.
The stochastic computing-based subthreshold prototype IC is designed in 45 nm IBM
SOI CMOS process and implements the Pan-Tompkins algorithm (PTA) [86]. PTA is a
derivative-based algorithm widely used in ECG processing systems because it does not re-
quire extensive computation, manual segmentation of data, a training phase, or patient-
specific modifications and provides an acceptable beat-detection accuracy [82,87]. Measure-
ment results show that ANT reduces Emin by 28% compared to the conventional (error-free)
40
processors while maintaining acceptable beat-detection performance. Furthermore, ANT
enables the IC supply voltage to be scaled to 15% below its critical value at MEOP, while
compensating for a 58% pre-correction error rate pη. These results represent an improve-
ment of 19× in beat-detection performance, and 600× in pη over conventional systems. The
prototype IC consumes 14.5 fJ/cycle/1k-gate and exhibits 4.7× better energy efficiency than
the state-of-the-art while tolerating 16× more voltage variations.
This chapter is organized as follows: Section 3.1 presents a brief background on PTA
for ECG processing. Section 3.2 describes the architecture and implementation details of
the ECG processor. Section 3.3 presents measurement results illustrating the benefits of
stochastic computing in subthreshold.
3.1 The Pan-Tompkins Algorithm (PTA)
ECG consists of periodic QRS complexes (see Fig. 3.1(a)) which reflect the electrical activity
in the heart during ventricular contraction. Accurate real-time QRS detection and beat-to-
beat (RR) interval (see Fig. 3.1(a)) extraction are the basis for heart monitoring providing
a simple noninvasive and quantitative assessment of cardiac health. Several techniques have
been proposed for RR-interval extraction such as derivative-based algorithms, filter banks,
wavelet transforms, neural networks, genetic algorithms, and others (see [89] and [90] for a
comprehensive coverage of the different algorithms).
The PTA is the most widely used algorithm for QRS detection. It consists of noise-
removal filters, derivative and squaring stage, a moving average, and a peak detector (see
Fig. 3.2). The raw ECG signal x (see Fig. 3.1(a)) is corrupted by various noise artifacts
such as 60 Hz noise, muscle noise, motion artifact, and skin interface [91]. PTA employs a
band-pass filter as a first step to maximize the QRS signal-to-noise ratio (SNR) in the QRS
frequency band of interest which is 5 Hz to 15 Hz. The band-pass filter is implemented using
a cascade of a low-pass filter (LPF) with a cut-off frequency of 15 Hz and a high-pass filter
41
(a)
][ nx
][nxF
][nyD
][nyMA
(b)
(c)
(d)
Time Index
P
Q
R R
S S
P
Q
RR-Interval
Figure 3.1: ECG processing of a segment of an MIT-BIH [88] database record: (a) input
(noisy) ECG, (b) filtered ECG, (c) ECG at derivative output, and (d) ECG at moving
average output.
LPF
HLP
HPF
HHP
 
dt
d   2 

32
132
1
i
ix
HD HAVHSQ
x Fx Dy MAy  
Peak Detector
PTAyˆ
Figure 3.2: Block diagram of the Pan-Tompkins algorithm (PTA) for ECG processing
42
LPF D
C
N
T
R
L
 1
Out-1
HPF D
C
N
T
R
L
 2
D
C
N
T
R
L
 3
 
dt
d   2 D
32
1
Main 
Processor (M)
Reduced-
precision
Estimator 
(RPE)
|  | >Thx
 oyy1
 oyy2
timing errors
estimation errors
yˆ
11-b
4-b
22-b
 
PTA
Peak Detector
ANT-Decision 
Out-2 Out-3
Out-4
C
N
T
R
L
 4
Out-1
Out-2
Out-3
Out-4
C
N
T
R
L
 4
Out’-1
Out’-2
Out’-3
Out’-4
7-b
D
D
D
PTAyˆ
Figure 3.3: Architecture of ANT-based ECG Processor
(HPF) with a cut-off frequency of 5 Hz. The filtered ECG signal xF is then differentiated in
order to amplify the higher frequencies characteristic of ECG the wave (QRS-complex) while
attenuating the lower frequencies characteristic (P and T-waves in Fig.3.1(b)). Squaring is
then applied to intensify the higher frequency characteristics. The squared signal ySQ is then
passed through a moving window integrator to provide information about the QRS-complex
width. The final stage in PTA is an adaptive detector which exploits information about the
QRS amplitude, slope, and width, as well as physiological properties of ECG in order to
determine the locations of the R-waves. Thus, the final output yPTA is a pulse train where
each pulse indicates a location of an R-wave. Irregular RR-intervals can then signal the
onset of a CVD [82] [83].
3.2 Architecture and Implementation
Algorithmic noise tolerance is employed to provide robustness and reduce energy of the
designed ECG process. The chip architecture is shown in Fig. 3.3. The main processor (M)
43
Table 3.1: Transfer function of building blocks in PTA.
Block Transfer Function
LPF HLP (z) =
1−2z−6+z−12
1−2z−1+z−2
HPF HHP (z) =
−1+32z−16+z−32
1+z−1
Derivative HD(z) =
1
8
(−z−2 − 2z−1 + 2z1 + z2)
MA HMA(z) =
1
32
∑31
i=0 z
−i
consists of low-pass (LPF) and high-pass (HPF) filters, a derivative and squaring blocks
(DS), and a moving average block (MA). The main block operates on an 11-b input ECG
signal. A reduced precision redundancy is employed as estimator. The reduced precision
estimator (RPE) operates on the 4-b MSB of the input and each block in M is replicated
in RPE at the reduced precision. The estimator gate complexity is 32% of the main ECG
processor. The ANT-decision block in Fig. 3.3 employs the principle of statistical detection
in order to compute the corrected output. The PTA peak detector, similar to ANT-decision
block, presents a challenge for statistical error compensation since being non-linear with
one-bit output makes it difficult to design a low-complexity estimator. Thus, it is designed
to operate error-free with enough timing slack to handle VOS or FOS.
A reconfigurable data path in M and RPE is employed where pipelining latches (D) can
be introduced at the output of different blocks. This allows us to control the error locations
in the ECG processing algorithm. The filter coefficients are designed to be a power of 2 to
reduce complexity and are implemented as proposed in [86]. The transfer functions of basic
blocks and the corresponding architectures are shown in Table 3.1 and Fig. 3.4, respectively.
The LPF and HPF are designed using pole-zero cancelation on the unit circle in order to
have integer coefficients. The derivative is a five-point derivative which approximates an ideal
derivative up to 30 Hz. The moving average window size is 32 samples to accommodate the
largest QRS-complex width (160 ms) at a sample rate of 200 samples/s. The moving average
block (see Fig. 3.4(c)) is designed using Wallace-tree carry-save adders. For the rest of the
blocks in the ECG processor, the basic computation structure employs ripple carry adders
and array multiplier.
44
6D 6D
-2
D
D
+2
<2,9>
<1,10>
<3,10> <4,10> <7,10>
<9,10>
-1
Q
][nxLPF
16D D D
1/32
<2,10>
<7,10>
<8,10>
<7,10>
15D
<2,10>
1/32
-1
<4,10> <5,10> <7,10>
-1
Q Q
<7,10>
][nyLPF
][nyHPF][nxHPF
D 2D
2
<4,10>
<7,10>
<5,10>
D
-1 -2
<6,10> <3,10>
1/8
<6,20>
Q
<7,10>
]2[ nxD
][nySQ
D D D D D
][nyMA
][nxMA
<6,18>
<1,18>
(a)
(b) (d)
(c)
Figure 3.4: Architecture of building blocks in PTA: (a) LPF, (b) HPF, (c) derivative-square
(DS) block, and (d) moving average (MA) block. The precision shown is for the main block
M and the notation < n1, n2 > represents n1 integer-bits and n2 floating-bits.
The chip is implemented in a regular Vth 45-nm 1V IBM SOI CMOS process. An ARM
standard cell library is employed in synthesis. The usage of library cells is restricted to
minimum strength cells to reduce energy and introduce timing slack between MSB and LSB
part, and thus allow a graceful increase in error rate at the output of M when VOS or
FOS is applied. A total of 12 power domains are employed, one for each of the following 6
power domains: 1) M-block filters and DS, 2) RPE-block filters and DS, 3) M-block MA,
4) RPE-block MA, 5) ANT-decision block, and 6) PTA peak detector, and separately for
combinational and sequential logic within each of the six modules in order to avoid the
failure of sequential logic at very low voltages.
The final design has a total of 36 k NAND2 gates, and a total chip area including the pad
frame of 1.25 mm × 1.3 mm. The core area is approximately 0.7 mm × 0.7 mm. The chip
microphotograph is shown in Fig. 3.5.
45
RPE
Main 
Processor (M)
E
C
C
o
n
tro
l
Test
RPE
Main 
Processor (M)
E
C
C
o
n
tro
l
Test
Figure 3.5: Die photo of the test chip in the 45-nm IBM SOI CMOS process.
3.3 Measurement Results
The chip testing was done using two different work loads to study the impact of switching ac-
tivity factor: 1) ECG dataset (average switching activity = 0.065) consisting of 30-min ECG
recordings of 10 patients from the MIT-BIH arrhythmia database [88], and 2) a synthetic
dataset (average switching activity = 0.37). The ECG waveform was sampled at 200 Hz and
quantized to 11 bits as input to the ECG processor chip. However, given a critical supply
voltage Vdd,crit, the chip is operated at the corresponding critical frequency fcrit which is
typically greater than 200 Hz. The higher frequency of operation can be used to process
multiple ECG signals simultaneously. For example, modern ECG monitoring systems use
12- and 16-lead ECG sensors instead of 3 leads [81].
Figure 3.6 shows the measured energy for the conventional system (without error compen-
sation) and the corresponding fcrit as a function of supply voltage. Measured results indicate
that the error-free MEOP (Vdd,opt, fopt, Emin) is (0.4 V, 600 kHz, 0.72 pJ) and (0.3 V, 65 kHz, 4.1
pJ) for the ECG and synthetic datasets, respectively. The lower Vdd,opt for the synthetic
46
Solid Lines: Energy
Dotted Lines: Frequency 
MEOP
MEOP
(0.3V,65 kHz)
(0.4V,600 kHz)
C
ri
ti
c
a
l 
C
lo
c
k
 F
re
q
u
e
n
c
y
 (
f c
ri
t)
[M
H
z
]
E
n
e
rg
y
 p
e
r 
C
lo
c
k
 C
y
c
le
 (
fJ
)
Critical Supply Voltage (Vdd-crit)[V]
Figure 3.6: Measured energy and frequency of the conventional (error-free) ECG processor
under different work loads.
dataset is expected, since a higher activity factor causes dynamic energy to be more domi-
nant than leakage energy, and thus equilibrium is reached at a lower supply voltage.
Hardware/timing errors can be introduced by either VOS (Vdd = KV OSVdd,opt), FOS (f =
KFOSfopt), or a combination of both to reduce energy and illustrate system robustness to
hardware errors. The pre-correction error rate pη which is the probability that the main ECG
processor output (without error compensation) is in error due to VOS or FOS at MEOP is
shown in Fig. 3.7 for the two datasets. The pre-correction error rate pη increases rapidly for
VOS as compared to FOS. This is expected due to the exponential dependence of delay on
Vdd in subthreshold regime, and its linear dependence on f . Note that the synthetic dataset
has higher pη than ECG dataset, since more critical paths are being excited with higher
switching activity factor.
47
MEOP
Solid Lines: ECG data (α=0.065)
Dotted Lines: Synthetic data (α=0.37) 
VOS FOS
P
re
-c
o
rr
e
c
ti
o
n
 E
rr
o
r-
R
a
te
 (
p
η
 )
 
Overscaling Factor 
Figure 3.7: Measured pre-correction error rate at the MEOP of the ECG processor under
voltage and frequency overscaling.
Employing FOS RTL-level simulations, Fig. 3.8 shows the impact of hardware/timing
error rate pη on the system-level metrics for QRS detection: 1) the sensitivity (Se) which is
the probability of detecting a true QRS-complex, and 2) the positive predictivity (+P ) which
is the probability that the detected QRS-complex is true [92]. These metrics are defined in
terms of detection events as follows:
Se =
TP
TP + FN
(3.1)
+P =
TP
TP + FP
(3.2)
where TP , FN , and FP are the number of true-positive, false-negative, and false-positive
events, respectively. Probability values greater than or equal to 0.95 for Se and +P are
desirable [92].
Two different scenarios are employed in the simulations in Fig. 3.8: 1) error-free MA
(pipelining latches introduced at the outputs of the DS- and MA-block and the MA-block
48
Pre-correction Error-Rate (pη ) 
640x  pη  Increase
2
0
x
  A
p
p
lic
a
tio
n
 
M
e
tric
  In
c
re
a
s
e
D
e
te
c
ti
o
n
 A
c
c
u
ra
c
y
 (
S
e 
a
n
d
 +
P
) 
Conventional 
+P
ANT Se
ANT 
+P
Conventional Se
ANT with 
Erroneous MA 
Conventional with 
Erroneous MA 
Figure 3.8: Simulated detection performance of the conventional and ANT-based ECG pro-
cessors at different pre-correction error rates (solid lines indicate error-free MA and dotted
lines indicate erroneous MA).
Pre-correction Error Rate (pη ) 
600x  pη   Increase
1
9
x
  A
p
p
lic
a
tio
n
 
M
e
tric
  In
c
re
a
s
e
D
e
te
c
ti
o
n
 A
c
c
u
ra
c
y
 (
S
e 
a
n
d
 +
P
) 
Conventional Se
Conventional 
+P
ANT Se
ANT 
+P
Figure 3.9: Measured detection performance of the conventional and ANT-based ECG pro-
cessors at different pre-correction error rates at the MEOP while the MA block is error
free.
49
voltage is not overscaled), and 2) erroneous MA (pipelining latch introduced at the output
of the MA-block only, and all module voltages are overscaled). Error compensation is done
at the output of the MA block in both cases. For case 1, Se ≥ 0.95 and +P ≥ 0.95 for
pη ≤ 0.62 corresponding to a 640× increase in pη handling capability and 20× improvement
in detection accuracy. For case 2, Se ≥ 0.95 and +P ≥ 0.95 for pη ≤ 0.2 corresponding
to a 220× increase in pη handling capability and 4× improvement in detection accuracy.
This result clearly shows the effectiveness of ANT and the intrinsic error compensating
attribute of the MA-block since it acts as a low-pass filter averaging out large-magnitude
errors. Note that the error-free MA-block does not help the conventional architecture which
fails dramatically for pη > 0.001. This is due to the fact that the adaptive peak-detector
block in the PTA has memory, and thus uncorrected errors are propagated across different
clock-cycles causing erroneous thresholds to be used for different clock-cycles. This is not
the case of ANT where large errors are corrected for, at least in an approximate sense, prior
to the peak-detector block.
Measurement results that demonstrates the impact of ANT are shown in Fig. 3.9 where the
ECG processor with error-free MA block is voltage overscaled at its MEOP (0.4 V, 600 kHz,
0.72 pJ). The ANT-based ECG processor achieves the desired level of detection accuracy
(Se and +P > 0.95) in the presence of a large raw error rate pe ≤ 0.58, which corresponds
to a 600× greater pe handling capability, and a 19× improvement in Se and+P compared
to conventional error-free designs.
Comparing Figs. 3.8 and 3.9, shows that the measured results via VOS and those obtained
via frequency-overscaled RTL gate-level netlist simulations match very closely at the same
error-rate. In fact, Fig. 3.10 shows a close match between the measured and simulated
timing error probability distribution collected at the output of the main ECG processor. Such
error statistics can be further explicitly exploited by advanced statistical error compensation
techniques [66] in order to further increase system robustness to hardware errors.
The instantaneous RR-interval measurement distribution for the conventional and the
50
x10
-4
x10
-4
x10
-3
x10
-3
x10
4
x10
4
x10
4
x10
4
Error Magnitude
(c)
Error Magnitude
(d)
P
ro
b
a
b
ili
ty
 o
f 
O
c
c
u
re
n
c
e
P
ro
b
a
b
ili
ty
 o
f 
O
c
c
u
re
n
c
e
kHzfp
VVK ddvos
600,38.0
35.0,85.0



MHzfp
VVK ddvos
2.1,58.0
4.0,1.2



35.0
9.1


p
K fos
54.0
1.2


p
K fos
Error Magnitude
(b)
Error Magnitude
(a)
Figure 3.10: Error statistics of the ECG processor: (a) measured voltage overscaled proces-
sor with pη = 0.38 at MEOP, (b) RTL simulations at pη = 0.35, (c) measured frequency
overscaled processor with pη = 0.58 at MEOP, and (d) RTL simulations at pη = 0.54.
51
pη =0 
pη =10
-3 
pη =0.69 
pη =0.58 
pη =0.38 
pη =10
-2 
pη =0.1
pη =0.2
Instantaneous RR-interval [s]
(a)
Heart Rate (Beats/s)
(b)
O
c
c
u
rr
e
n
c
e
 
pη=0 
Increasing pη   
O
c
c
u
rr
e
n
c
e
 
Instantaneous RR-interval [s]
( )
pη =0 
pη =10
-3 
pη =0.69 
pη =0.58 
pη =0.38 
pη =10
-2 
pη =0.1
pη =0.2
Figure 3.11: Distribution of instantaneous RR-interval measurement at MEOP: (a) conven-
tional ECG processor and (b) ANT ECG processor.
52
ANT-based ECG processors is shown in Fig. 3.11 under different pη with error-free MA.
While conventional processor can maintain a reasonable RR-interval (1.2 s) only for very low
pη (< 10
−3) after which more spread is observed in RR-interval measurements, the ANT
processor can maintain reasonable RR-interval up to pη = 0.58.
Figure 3.12(a) shows measured iso-pη contours in the Vdd–f plane. We refer to the
ANT-based ECG processor operating on the pη = 0 contour in Fig. 3.12(a) without error-
compensation overhead as the conventional processor. Vertical translation (fixed Vdd) in
the Vdd–f plane from the pe = 0 contour (see Fig. 3.12(a)) corresponds to an application
of FOS. Similarly, a horizontal translation (fixed f) in the Vdd-f plane from the pη = 0
contour corresponds to an application of VOS. Arbitrary translations correspond to a joint
application of VOS and FOS. The total energy consumption per iso-pη contour (including
the energy overhead of error compensation for pη 6= 0) is shown in Fig. 3.12(b). The new
MEOP of the ANT-based processor at pη = 0.58 is (0.34 V, 630 kHz, 0.52 pJ). This corre-
sponds to a simultaneous 15% reduction in Vdd,opt, 5% increase in fopt, and a 28% reduction
in Emin compared to the MEOP of an error-free processor. Alternatively, the conventional
processor with Vdd,crit = 0.34 V operates at fcrit = 250 kHz and consumes 0.9 pJ. Thus, the
ANT-based processor (at its MEOP) can be viewed as being frequency overscaled with a
factor of KFOS = 630/250 = 2.5, i.e., there is 2.5× increase in throughput, along with a 42%
energy savings.
Note that for Vdd > 0.4 V in Fig. 3.12(b), the ANT-based processor operates at higher
frequency and consumes more energy than the conventional processor since FOS reduces
leakage energy only, which is not sufficient to account for the energy overhead of error
compensation. As Vdd drops below 0.4 V, leakage starts to dominate the overall energy, and
hence ANT starts to show energy savings.
Similarly to Fig. 3.12, Figs. 3.13(a) and (b) shows the measured iso-pη contours and the
corresponding processor energy, respectively, for the synthetic dataset. The new MEOP of
the ANT-based processor at pη = 0.58 for the synthetic dataset is (0.26 V, 65 kHz, 3 pJ).
53
Supply Voltage (Vdd)[V]
28% Energy 
Reduction
(0.34V,630 kHz)
E
n
e
rg
y
 p
e
r 
C
lo
c
k
-C
y
c
le
 [
fJ
]
(0.4V,600 kHz)
C
lo
c
k
 F
re
q
u
e
n
c
y
 (
f)
 [
M
H
z
]
Supply Voltage (Vdd)[V]
Conventional
MEOP
ANT-MEOP
at pη =0.58
(a)
(b)
(0.34V,250 kHz)
Figure 3.12: ANT-based ECG processor measurement results under the ECG dataset: (a)
iso-pη contours in the Vdd–f plane and (b) the total energy (including error compensation
overhead for pη 6= 0) corresponding to the iso-pη contours.
54
Supply Voltage (Vdd)[V]
27% Energy 
Reduction(0.26V,78 kHz)
E
n
e
rg
y
 p
e
r 
C
lo
c
k
-C
y
c
le
 [
fJ
]
(0.3V,65 kHz)
C
lo
c
k
 F
re
q
u
e
n
c
y
 (
f)
 [
M
H
z
]
Supply Voltage (Vdd)[V]
Conventional
MEOP
ANT-
MEOP
at pη =0.58
(a)
(b)
(0.26V,35 kHz)
Figure 3.13: ANT-based ECG processor measurement results under the synthetic dataset:
(a) iso-pη contours in the Vdd–f plane and (b) the total energy (including error compensation
overhead for pη 6= 0) corresponding to the iso-pη contours.
55
S
e
n
s
it
iv
it
y
 (
 Δ
S
e  
 /S
e 0
  
 &
  
 Δ
+
P
  
/+
P
0
) 
 
  Supply Voltage Variation  ( ΔVDD  /Vdd,opt )  
16x
4
3
x
Conventional SSe
Conventional S+P
ANT SSe
ANT S+P
Figure 3.14: Measured sensitivity and robustness of conventional and the ANT-based ECG
processors to voltage variations at the conventional MEOP (0.4 V, 600 kHz).
Table 3.2: Comparison with state-of-the-art systems.
Design Type Near/Sub-threshold Error Resilient Both
[37] [38] [53] [54] [55] This Work
Technology (nm) 90 130 180 45 65 45
(Vdd[V], f [MHz]) (0.4,1) (0.5,7) (1.8,N.A.) (1.165,185) (1,3000) (0.34,0.6)
Error Rate (pη) 0 0 0.001 0.04 0.001 0.58
Energy/Cycle (pJ) 13 29 870 505 N.A 0.52
Energy/Cycle/ 68 483 N.A. 8416 N.A. 15
1 k-Gates (fJ)
Energy Savings 0 0 14% 5% 7% 28%
(past PoFF)
56
This corresponds to a simultaneous 13% reduction in Vdd,opt, 20% increase in fopt, and a
27% reduction in Emin compared to the synthetic dataset MEOP of an error-free processor.
Alternatively, the conventional processor with Vdd,crit = 0.26 V operates at fcrit = 35 kHz
and consumes 5 pJ. Thus, the ANT-based processor (at its synthetic dataset MEOP) can
be viewed as being frequency overscaled with a factor of KFOS = 65/35 = 1.85, i.e., there is
1.85× increase in throughput, along with a 40% energy savings.
The sensitivity of Se (SSe = ∆Se/Se) and
+P (S+P = ∆
+P/+P ) to voltage variations is
characterized in Fig. 3.14 at the conventional MEOP supply voltage Vdd,opt = 0.4 V, where
we find that the ANT-based processor tolerates up to 16× higher voltage variations, and
shows up to 43× lower sensitivity (SSe and S+P ) compared to the conventional processor.
Table 3.2 compares our design to other near or subthreshold pe = 0 biomedical processors.
The ANT-based ECG processor consumes 14.5 fJ/cycle/1k-gate which is at least 4.7× more
energy efficient than state-of-the-art in addition to the robustness benefits. As comparing
to other error-resilient designs in superthreshold, the ANT-based ECG processor tolerates
an error rate of up to pe < 0.58, which is at least 580× greater than existing techniques.
Therefore, we see that stochastic computating provides tremendous increase in robustness
while meeting the threshold of acceptable detection performance. In addition, it results in
up to a 28% reduction in energy beyond the minimum achievable energy.
3.4 Summary
This chapter presented a stochastic computing-based ECG processer in a 45-nm IBM CMOS
process. The prototype IC illustrates the robustness and energy benefits of stochastic com-
puting in the subthreshold regime, where robustness is more of a concern due to increased
sensitivity to voltage, process, and temperature variations. The IC shows robust operation
up to a 58% error rate along with a 28% energy reduction beyond minimum achievable
energy and is 4.7× more energy efficient than state-of-the-art.
57
CHAPTER 4
JOINT OPTIMIZATION OF POWER DELIVERY
AND CORE ENERGY IN ULP PLATFORMS
This chapter addresses the problem of designing energy-efficient embedded systems by jointly
optimizing the energy of both the DC-DC converter(s) and the computational core(s) in ULP
platforms. Chapters 2 and 3 demonstrated the existence of a minimum energy operating
point (MEOP) in the subthreshold region for conventional and stochastic cores (core(C)-
MEOP) defined by the energy-optimum core voltage Vdd,opt, the energy-optimum frequency
fopt and the minimum core energy consumption Emin. In energy-aware ULP platforms, the
core supply voltage VC or Vdd (see Fig. 4.1(b)) is generated by a programmable DC-DC
converter, which translates a higher battery voltage to a lower core voltage Vdd < 1V and
dynamically adjusts it depending on the workload characteristics. Past work has focused
primarily on independently optimizing the energy consumption of the core and the DC-DC
converter, especially at MEOP and in the presence of DVS, leading to sub-optimal solutions.
This chapter proposes to minimize system (core and DC-DC converter) energy consump-
tion by jointly optimizing the DC-DC converter and core. Architectural-level techniques
(see Fig. 4.1(a)), which exploits joint-design principles, application-level requirements, and
stochastic core robustness, are proposed to mitigate energy-delivery losses. First, we show
dynamic voltage scaling (DVS) causes the overall system MEOP (S-MEOP) to differ signifi-
cantly from C-MEOP due to the increased DC-DC converter losses. Simulations in a 130-nm,
1.2 V IBM CMOS process show that operation at S-MEOP results in a 45.5% energy sav-
ings over operation at a core voltage VC,opt suggested by C-MEOP. The DC-DC converter
efficiency is also improved by 2.2×. Second, we show that architectural techniques cause the
S-MEOP to approach C-MEOP. Thus, it is sufficient to track C-MEOP – a much easier task
58
Figure 4.1: Energy-aware embedded system: (a) energy under DVS and (b) block diagram.
on-chip – in order to account for process variations and changing work load characteristics.
We show that core parallelization reduces DC-DC converter losses in the subthreshold regime
but increases it in the superthreshold regime. This observation leads us to propose a recon-
figurable core architecture that improves the converter efficiency by 2.3× at C-MEOP, and
makes energy consumption at S-MEOP and C-MEOP to be within 4% of each other. This
also improves throughput in the subthreshold regime by at least 8×. Furthermore, we show
that pipelining, which has been proposed to decrease core energy at C-MEOP while improv-
ing throughput [28], adversely affects the S-MEOP. Pipelined system’s energy at S-MEOP
is 85% lower than when operating at the C-MEOP voltage VC,opt. The DC-DC converter
efficiency is also improved by 10-percentage points compared to the unpipelined-core system.
Finally, we address the energy delivery for stochastic compute cores. The robustness of a
stochastic core to voltage variations relaxes the voltage ripple specification, and thus reduc-
ing DC-DC losses and improving system energy efficiency. Preliminary results demonstrate
the promise of joint stochastic compute core and DC-DC converter design, and open up
interesting further investigations and future extensions.
This chapter is organized as follows: Section 4.1 analyzes the core energy consumption
59
in DVS and the associated DC-DC losses. Section 4.3 discusses the challenges in designing
efficient DC-DC for a wide range of load conditions, and illustrates the energy gains obtained
by operating at S-MEOP instead of C-MEOP through simulations in 130-nm IBM CMOS
process. Section 4.4 presents the proposed reconfigurable core-architecture and joint-system
design techniques to reduce energy at S-MEOP and improve DC-DC efficiency.
4.1 Core Energy Characterization Under Dynamic Voltage Scaling
(DVS)
The core energy varies widely depending on the workload and application throughput re-
quirements. Dynamic voltage scaling (DVS) employs a programmable DC-DC converter
to adjust the core supply voltage VC , to meet the application throughput requirements at
minimum energy consumption.
The two main sources of core energy are dynamic and leakage energy (Edyn and Elkg) and
the latter becomes significant only when the core operates in subthreshold regime (VC ≤ Vth
where Vth is the threshold voltage). The core energy EC , can be expressed as:
EC = Edyn + Elkg
Edyn = αNCgV
2
C
Elkg =
NIOFFVC
fC
(4.1)
where α is the average switching activity factor of the core gates, N is the number of core
gates each with an output load capacitance Cg, fC is the core operating frequency, and IOFF
is the OFF-state leakage current.
The MOSFET drain current, ID as a function of gate-to-source and drain-to-source volt-
60
ages (VGS, VDS) in the subthreshold and superthreshold regimes, is given by:
ID (VGS, VDS) =

Ioe
VGS−Vth−γVDS
mVT (1− e
−VDS
VT )
if VGS < Vth + νmVT
Ioe
νmVT+γVDS
mVT
(
VGS−Vth
νmVT
)ν
if VGS ≥ Vth + νmVT
(4.2)
where Io is a reference current and is proportional to the transistor W/L ratio, ν is the
velocity saturation index, m is the subthreshold slope factor, γ is the DIBL coefficient, VTH
is the threshold voltage, and VT is the thermal voltage. Using (4.2), the ON-state and
OFF-state currents for an NMOS transistor are ION = ID (VC , VC) and IOFF = ID (0, VC),
respectively.
Assuming the critical path of the core has a logic depth of K gates each with an output
load capacitance Cg, the core operating frequency fC is given by:
fC =
ION
βKCgVC
(4.3)
where β is a fitting parameter needed to match the finite signal rise and fall times. The
subthreshold frequency decreases exponentially with VC reduction in subthreshold due to
the exponential dependance of ION on VC in (4.2) when VGS < Vth + ηmVT . This leads to
an exponential increase in subthreshold leakage energy. Leakage energy as a function of VC
is obtained by substituting (4.3) in (4.1) to yield:
Elkg = βNKCgV
2
C
IOFF
ION
=

βNKCgV
2
Ce
−γVC
mVT
if VC < Vth + νmVT
βNKCgV
2
Ch
1−e
−VC
VT
(VC−Vth)ν
if VC ≥ Vth + νmVT
(4.4)
61
where h is a constant function of Vth, VT , ν, and m. Note in superthreshold, leakage is
negligible and varies as 1/(VC − Vth)ν since VC >> VT . Including the dynamic energy, the
total core energy is given by:
EC =

NCgV
2
C
(
α + βKe
−γVC
mVT
)
if VC < Vth + νmVT
NCgV
2
C
(
α + βKh 1−e
−VC
VT
(VC−Vth)ν
)
if VC ≥ Vth + νmVT
(4.5)
Therefore, decreasing VC results in a quadratic reduction in Edyn at the expense of increased
delay or reduced frequency of operation fC . As VC is reduced below Vth, i.e., subthreshold
operation, Elkg increases very rapidly and becomes comparable to Edyn. This trade-off
between Edyn and Elkg in subthreshold is well studied [23] [25], and results in a minimum
energy operating point (MEOP) defined via the tuple (E∗C , V
∗
C , f
∗
C).
4.2 Design and Analysis of DC-DC Converters
The programmable DC-DC converter efficiency greatly depends on the core energy EC . The
DC-DC converter regulates the core supply voltage VC (see Fig. 4.1(b)) from an external
battery with voltage VB, which is greater than VC . It is imperative to decrease the losses of
the DC-DC converter to maximize the efficiency of energy delivery. Three key types of DC-
DC converters are linear regulator (LR), switched-capacitor regulator (SC), and switching
regulator (SR). LR employs a power MOSFET whose gate is controlled by a feedback error-
control signal to supply a specific current demand from the battery while maintaining a
constant VC . The efficiency of LR is limited by the ratio of VC/VB. Add to it the losses in
the control driver and power MOSFET. SC delivers the energy to the core by discharging
the battery energy through a capacitive network exchange. SC achieves efficiency better
62
Figure 4.2: Switching DC-DC converter: (a) block diagram with parasitics, (b) continuous-
conduction mode (CCM), and (c) discontinuous-conduction mode (DCM).
than LR but has poor output regulation due to output voltage ripple. In addition, SC does
not allow continuous DVS because the ratio of voltage conversion(s) D = VC/VB is already
determined by the chosen capacitor values and topology. The SRs are most widely used
because they enable continuous DVS with high efficiency over a relatively wide range of load
variations, and they are well studied in the literature [93] [94]. The block diagram of an SR
DC-DC converter is shown in Fig. 4.2(a). Commonly, pulse-width modulation (PWM) is
employed to generate a periodic pulse with duty cycle D and switching frequency fs = 1/Ts.
The periodic pulse controls the gates of a PMOS and an NMOS switch. When the pulse is
low, the NMOS is off and the PMOS is turned on, and current is supplied from the battery
to the load (continuous-conduction mode (CCM)) (see Fig. 4.2(b)). At light (low-current)
loads, the DC-DC enters discontinuous-conduction mode (DCM) (a variable PWM or pulse-
frequency modulation (PFM)) to decrease switching losses and improve efficiency. In DCM,
63
there are durations when both NMOS and PMOS are turned off when the inductor current
(IL) reaches zero in order to prevent it from flowing in the reverse direction (see Fig. 4.2(c)).
The LC-filter acts as a low-pass filter to pass the average of IL(t) as the core current IC(t)
and blocks the AC component of core voltage VC (see Fig. 4.2(b) and (c)). Typically, the
duty cycle D is chosen to be equal VC/VB so that IL,avg = IC . The output core voltage ripple
is given by:
4VC
VC
=
1−D
16LCf 2s
(4.6)
The choices of LC-filter (passive) components and the converter switching frequency fs,
determine the output voltage ripple and are chosen to balance switching and conduction
losses. Increasing fs decreases the size of the passive components and the conduction losses
associated with the LC-parasitics, but increases the switching losses.
The performance and losses of switching DC-DC converter are well studied in the litera-
ture [94] [93]. The losses mainly include the conduction losses, switching losses, and drive
losses. The conduction losses (Pcond) are due to the ON-resistance of PMOS and NMOS
switches (Ron,p and Ron,n, respectively), the inductor parasitic resistance (RL), and the
capacitor effective series resistance RC (see Fig. 4.2(a)) and are related to the root-mean
square currents through the inductor, PMOS, and NMOS switch in CCM and DCM. The
conduction loss through a resistance R with varying current and voltage terminals is given
by I2R,rmsR where IR,rms is the resistor root-mean-square (RMS) current. The RMS currents
through the resistances in Fig. 4.2(a) are largely determined by the inductor current wave-
form IL(t) whose average value yields the supplied core current IC (see Fig. 4.2(b) in CCM
and (c) DCM). In CCM, the RMS currents through Ron,p and Ron,n are, respectively:
Irms,p =
√
D
(
I2C +
4i2L
3
)
Irms,n =
√
(1−D)
(
I2C +
4i2L
3
)
(4.7)
64
where D = VC/VB and 4iL is the ripple in IL(t) in Fig. 4.2(b) and is given by:
4iL = VC(1−D)
2Lfsw
(4.8)
In DCM, the RMS currents through Ron,p and Ron,n are:
Irms,p =
√
LfswIL,peak
3(VB − VC)
Irms,n =
√
LfswIL,peak
3VC
(4.9)
where IL,peak is the peak inductor current in Fig. 4.2(c) and is given by:
IL,peak =
√
2ICVC(1−D)
Lfsw
(4.10)
The switching losses (Ps) are due to the current and voltage overlap when activating the
PMOS and NMOS switches and are given by Ps =
1
a
τVBIC where IC is core current, a
is a number usually between 2 and 6 that describes the switching trajectory, and τ is the
percentage of the DC-DC switching period when the switch current and voltage overlap.
The drive losses (Pdrive) are due to the capacitive switching in the MOSFET switch driver
and the controller and are given by Pdrive = fsCdV
2
d , where Cd is the driver and controller
switching capacitance and Vd is the driver supply voltage. The DC-DC converter efficiency
(ηDC) and energy (EDC) are given by:
ηDC =
PC
PC + PDC
=
PC
PC + Pcond + Ps + Pdrive
EDC =
PDCfC
fs
(4.11)
where PDC is the power losses in the DC-DC converter, PC is the core power, and EDC is
the DC-DC energy loss per core instruction.
65
4.3 System (Core and DC-DC Converter) Energy Optimization
We optimize the total system (core and DC-DC converter) energy consumption in presence
of variations in core energy demand due to DVS employing the energy models in Sections 4.1
and 4.2 and HSPICE simulations in a 1.2 V 130-nm IBM CMOS process.
We model the computational core as a bank of 50 16-b×16-b multiply-accumulate (MAC)
units. Each MAC unit (see Fig. 4.3(a)) operates at a core voltage and frequency (VC , fC)
and computes y[n] = y[n − 1] + x1[n] × x2[n] where x1[n] and x2[n] are 16-b input signals,
y[n] is a 32-b output, and n is the clock-cycle/time index. We employ a ripple carry-based
architecture with 1-b full adders as the basic building block to study the energy consumption
of the core. Figures 4.3(b) and (c) show the core frequency and energy (50×EMAC) based on
the analytical models in (4.3) and (4.5), respectively, and HSPICE simulations of the circuit
schematic of the MAC unit at various core voltages VC for workloads with average switching
activity of α = 0.3 and 0.1. The analytical models in (4.3) and (4.5) approximate the results
of HSPICE simulations very well, and hence will be employed to estimate core energy in
the rest of the chapter. As voltage is reduced to subthreshold, fC decreases exponentially
and Elkg increases significantly while Edyn continues to decrease. Figure 4.3(c) shows that
the C-MEOP is reached at (E∗C = 60 pJ, V
∗
C = 0.33 V, f
∗
C = 1.5 MHz) for a workload with
α = 0.3. We can see that as VC varies from 1.2 V down to V
∗
C = 0.33 V, the frequency fC
and energy consumption EC vary by 200× and 9×, respectively, or a 1800× variation in
power demand. Figure 4.3(c) shows that the average switching factor impacts the dynamic
energy only. Thus, in the rest of this paper, we focus on energy demand variations due to
DVS only and consider a workload with a fixed average switching factor of α = 0.3.
We assume a 3.3 V external battery source, and the DC-DC converter designed at a
switching frequency of fs = 10 MHz while maintaining an output ripple of around 10%
for all VC . This leads to relatively reasonable passive element values of L = 94 nH and
C = 47 nF , resulting in reduced conduction losses and improved efficiency (> 80%) in the
66
Figure 4.3: The computing core model: (a) architecture of a single MAC unit, (b) the core
frequency, and (c) the core energy consumption under DVS.
67
Figure 4.4: DVS system energy:(a) DC-DC efficiency vs. core power and supply voltage and
(b) the total system energy and losses.
68
superthreshold regime. In fact, the converter can maintain an efficiency greater than 80% for
0.45 V ≤ VC ≤ 1.2 V while delivering a core power in the range 0.6 to 50 mW (see Fig. 4.4(a)).
When VC is decreased further, efficiency drops significantly reaching 33% at the C-MEOP
determined previously in Fig. 4.3(c). This is due to the drive energy per core instruction
Edrive = Pdrive/fC loss dominating in the subthreshold regime. The energy losses of the
DC-DC converter per core clock-cycle is shown in Fig. 4.4(b). Here, it is assumed that the
capacitance of the driver Cd is equivalent to 1% of the core capacitance NCg, and its voltage
VD = 1.2 V. While conduction energy per core instruction Econd = Pcond/fC and switching
energy Es = Ps/fC losses scale well with the core energy EC under DVS, the drive energy
Edrive losses increase significantly in subthreshold since fC starts to decrease exponentially
as compared to the DC-DC switching frequency fs. The converter switching frequency fs
does not decrease much with VC in subthreshold under DCM because the output ripple
needs to be maintained at less than 10%. The increased driver losses cause the total system
MEOP (S-MEOP) core voltage V ∗S to be higher than that V
∗
C at the C-MEOP. Operating
at V ∗S instead of V
∗
C results in 2.2× improvement in efficiency and 45.5% energy savings as
illustrated in Fig. 4.4(a) and (b), respectively. However, tracking S-MEOP on-chip is more
difficult than C-MEOP as it requires external feedback. Next, we study core architecture
techniques to aid the design of energy efficient DC-DC converters at the C-MEOP in DVS
and thus increase the system efficiency and make S-MEOP approach C-MEOP.
4.4 Core Architecture Optimization for Energy-Efficient Systems
To minimize DC-DC converter drive losses per core instruction Edrive in subthreshold, we
employ architecture techniques to increase the core operating frequency fC so that the
controller in discontinuous mode can better adapt its DC-DC switching frequency (fs) to
the core frequency while maintaining less than 10% output ripple core voltage.
69
4.4.1 Energy-Efficient Multicore Systems
Parallelization/unfolding are commonly used techniques to increase core throughput. Par-
allelizing/unfolding by a factor of M , will instantiate M copies of the core running at the
same frequency as the original core. This increases both the throughput and power by a
factor of M , and maintains the energy per instruction the same as a single core (SC) if
the overhead of serialization/deserialization is ignored, i.e., the parallelized core (PC) or
multicore operates at the same energy level of an SC (SC-MEOP is same as PC-MEOP)
while delivering higher throughput. However, parallelization increases the DC-DC converter
conduction power losses Pcond by a factor greater than M due to the reduced conduction
efficiency when delivering M× the original load power. This translates to lower system effi-
ciency and higher system energy consumption when conduction losses dominate the DC-DC
converter losses. On the other hand, the switching and driver power losses in the DC-DC
converter are relatively independent of the core parallelization leading to a M× decrease in
their energy overhead per instruction, since PC throughput increases by M .
The effect of parallelization on the DC-DC efficiency is shown in Fig. 4.5. Core paral-
lelization will reduce the efficiency significantly in superthreshold, where conduction losses
dominate. On the other hand, it helps to extend the DC-DC converter high-efficiency range
into the subthreshold regime by reducing driver and switching losses until the core frequency
decreases significantly again. This improves DC-DC efficiency at MEOP by a factor of at
least 2.2× for M = 4, and that increases for higher values of M at the expense of reduced
efficiency in superthreshold (see Fig. 4.5). This motivates the use of reconfigurable core
(RC) architecture under DVS. The RC uses a single core while the rest of M − 1 cores being
power-gated as long as fC ≥ 0.1fs, i.e., the DC-DC controller can adapt its losses with
the core frequency and maintain good output ripple. When fC < 0.1fs, all M cores are
activated, which gives an additional window/slack to reduce VC while keeping driver losses
within bounds.
70
Figure 4.5: DC-DC efficiency for parallel/multi-cores.
The DC-DC efficiency of RC for M = 8 is shown in Fig. 4.6(a) and its corresponding
energy consumption and losses are shown in Fig. 4.6(b). RC reduces the driver energy losses
around C-MEOP so that SRC-MEOP approaches that of C-MEOP. The total system energy
at C-MEOP is within 4% of the energy at SRC-MEOP, and the proposed RC efficiency at
C-MEOP is 2.6× better than that of a SC efficiency. Simulations show that the difference
between S-MEOP and C-MEOP under RC architecture decreases further for higher values
of M , and higher efficiency is achieved. An additional benefit of the RC system architecture
is that it allows higher throughput (M = 8×-increase) to be met in subthreshold region
compared to SC system.
4.4.2 Energy-Efficient Pipelined Systems
Recently [28], pipelining has been shown to be an attractive technique in the subthresh-
old region as it reduces the energy consumption at C-MEOP by 30% and simultaneously
71
Figure 4.6: Architecture-optimized DVS system energy: (a) DC-DC efficiency and (b) re-
configurable core (RC) system energy profile.
72
increases throughput by 1.6×. In this subsection, we show that pipelining is unattractive
when DC-DC converter losses are included.
Pipelining by a level of J will decrease DC-DC driver losses per core instruction Edrive
by a factor of J assuming it operates at J× higher operating frequency. However, doing so
increases the load current leading to an increase in conduction losses by a factor greater than
J , which is why in Fig. 4.7(a), the efficiency of the pipelined system is always less than that
of the original core. Figure 4.7(b) shows the energy consumption of the original core with
pipelining level of J = 4. Pipelining reduces core leakage energy due to increased operating
frequency with minimal effect on core dynamic energy (overhead of pipelining register). This
reduces not only core energy at CPIP -MEOP but also pushes the voltage at CPIP -MEOP to
lower values (compare CPIP -MEOP in Fig. 4.7(b) to C-MEOP in Fig. 4.4(b)). Thus, DC-DC
driver losses Edrive will be more significant when included in overall system energy. This
leads to an 85% increase in the pipelined-core system energy if operating at CPIP -MEOP
instead of SPIP -MEOP, in addition to a 2.6× reduction in DC-DC converter efficiency ηDC .
Simulations also show that similar results are obtained with increased pipelining levels until
pipelining-register overhead starts to dominate.
4.4.3 Energy Delivery for Stochastic Compute Cores
Chapters 2 and 3 demonstrated the robustness of stochastic-computing cores to voltage vari-
ation through their ability to tolerate a large number of voltage-induced timing errors. Such
resiliency can be exploited in a joint stochastic system (DC-DC converter and stochastic-
computing core) design (see Fig. 4.8) by relaxing the core voltage ripple specification. This
reduces the value of the required converter passive elements (L and C) or its switching fre-
quency fs as shown in (4.6), and thus can reduce the form-factor or the energy conversion
sub-system. Although the ripple voltage is a dynamic phenomena and is a strong function
of the core work load characteristics, a conservative approach is adopted here, voltage ripple
73
Figure 4.7: DVS system energy with core pipelining: (a) DC-DC efficiency, and (b) pipelined-
core system energy profile.
74
Figure 4.8: Block diagram of a stochastic system (stochastic core and DC-DC converter).
and droop is assumed to be similar to (static) VOS. Thus, since Chapters 2 and 3 showed
that the stochastic core at MEOP can tolerate up to 15% worst-case voltage reduction, it
is assumed that the core ripple specification can be relaxed by an additional 15%. Further-
more, since further work is needed to study the core energy and timing errors under voltage
ripple and droop, we assume a worst-case scenario where the energy of the relaxed-ripple
stochastic core is same as that of a conventional core.
Using the same value of L and C used so far, the DC-DC converter frequency fs is
decreased until (4.6) is satisfied with the relaxed ripple specification. Figure 4.9 shows
that the switching, conduction, and driver losses are reduced once the ripple specification
is relaxed (compare the dotted lines to the solid lines in Fig. 4.9). This results in 13.5%
total system energy reduction at the new stochastic-system (SS)-MEOP as compared to
conventional S-MEOP. In addition, the core voltage at SS-MEOP is brought closer than the
core voltage at S-MEOP to the core voltage at C-MEOP. The reduction in DC-DC losses
improves the DC-DC efficiency at SS-MEOP by 8 percentage points (compare SS-MEOP
and S-MEOP in Fig. 4.10). All this illustrates the promise of joint optimization of DC-
DC converter and stochastic core in improving overall system energy efficiency, although
assuming a worst-case ripple voltage scenario. This motivates the investigation of stochastic
75
Figure 4.9: DVS energy of jointly optimized systems. Solid lines (dotted lines) refer to the
conventional (stochastic) system.
76
Figure 4.10: DC-DC energy efficiency of jointly optimized stochastic core and DC-DC con-
verter.
cores under voltage-ripple induced errors and a better understanding of core-voltage ripple
effect on the overall system behavior and energy which is part of the future work opened up
by this dissertation.
4.5 Summary
We developed a holistic view to energy-efficient system design, taking into consideration
the DC-DC converter losses. We showed that the DC-DC losses have considerable effect
on minimum energy operation. We employed architecture techniques to alleviate the DC-
DC converter losses so that high energy efficiency can be maintained over a wide range
of core energy demand variations. Furthermore, the robustness to voltage variations of
77
stochastic cores is exploited to increase the DC-DC design margins and improve overall
system efficiency.
78
CHAPTER 5
STOCHASTIC COMPUTING PLATFORMS VIA
LIKELIHOOD PROCESSING
Previous chapters employed ANT in subthreshold designs. This chapter proposes a new
stochastic computing technique referred to as likelihood processing (LP) and demonstrates
its energy savings and robustness benefits in 45-nm CMOS.
Stochastic computing techniques such as ANT [70], and SSNOC [74] implicitly employ
error statistics of the architectural components, while soft NMR [78] does so explicitly. The
proposed technique of LP explicitly employs error statistics for error compensation. It does
so by computing the likelihood, i.e., the ratio of the probability of an output bit being a ‘1’
vs. the probability of it being a ‘0’. Doing so, LP offers a better processing of error statistics
than soft NMR which exploits error statistics only at word-level. Furthermore, unlike soft
NMR, LP avoids the need for replication. Results show that LP significantly improves on
the robustness and energy efficiency of the conventional (error-free), ANT, and soft NMR
systems. This is demonstrated in the design of a 2D discrete-cosine transform (2D-DCT)
codec, a widely used image processing kernel in a TI 45-nm CMOS process. The codec can
be employed as a hardware accelerator in an ULP platform for surveillance applications,
for example. The energy-robustness trade-off in LP is evaluated in the presence of voltage
overscaling (VOS) are employed to emulate timing violations due to PVT variations as well.
The chapter is organized as follows: Section 5.1 presents a unified framework for error-
resiliency and stochastic-computing techniques including LP. Section 5.2 formally describes
the algorithmic basis for LP, its architecture, and presents a motivational example. Sec-
tion 5.3 demonstrates the benefits of LP in terms of robustness and energy efficiency in the
design of a 2D-DCT image codec.
79
5.1 A Unified Framework for Error Resiliency
This section presents a unified framework for describing error-resiliency techniques, including
both conventional and statistical techniques, in order to relate the proposed LP technique
to existing work.
The error model for an arbitrary computational kernel M (see Fig. 5.1(a)) is given by:
y = yo + η +  = yo + e (5.1)
where y is a By-bit observed output, yo is the correct (error-free) output, η and  are the
hardware and estimation errors, respectively, and e = η+ is the composite error. Though yo
can belong to a set of acceptable outputs, we take a conservative approach and assume that
the cardinality of such a set is unity, i.e., there is one ideal/error-free value for yo. The set
of all possible outputs of M is referred to as the output space Y , i.e., yi, yo, and e ∈ Y . Note
that, without any loss of generality, we employ a weighted number representation, e.g., two’s
complement, for the output and error signals, which is quite appropriate for media kernels,
as these tend to employ arithmetic operations quite extensively. This does not preclude the
use of other number representations both weighted and non-weighted.
In this chapter, the hardware error η arises due to timing violations, and it is typically large
in magnitude because the arithmetic operations in DSP kernels are least-significant bit (LSB)
first. This is reflected in the sample probability mass function (PMF) Pη(η) in Fig. 5.1(b).
On the other hand, the estimation error  is small in magnitude (see P() in Fig. 5.1(b))
because it can arise as a difference between yo and an internal signal in block M, or perhaps
the output of another lower-complexity block ME that tries to approximate/estimate the
output of M, i.e., an estimator. The topic of error characterization of a computational block
is an interesting one in its own right, and can be accomplished in many ways, both off-line
as well as via in situ calibration using typical inputs. Characterization of timings error for
80
)(P
)(P
Probability
0
(a) (b)
M
x oy y
e
 
yB
bbb ,...,, 21
yB
 ,
x10
-4
E
rr
o
r 
P
M
F
 
Error Magnitude 
x10
4
)(
(c)
)
(
P
Figure 5.1: Computational error model: (a) additive error model, (b) sample error statistics,
and (c) measured error PMF Pη(η) of a 20-bit output filter IC in 45-nm CMOS with Vdd =
0.85Vdd−crit.
81
Figure 5.2: Existing architectural level error resiliency techniques: (a) NMR, (b) algorithmic
noise-tolerance (ANT), (c) stochastic sensor network-on-chip (SSNOC), and (d) soft NMR.
computational blocks in the form of PMFs has been addressed in [95]. Figure 5.1(c) shows
measured error PMF Pη(η) obtained from a voltage-overscaled IC in a 45-nm CMOS process.
We now describe existing error-resiliency techniques (see Fig. 5.2) using the following
definitions:
1. Observation vector Y: Y = (y1, y2, . . . , yN), where yi = yo + ηi + i = yo + ei with yo,
yi, ηi, i ∈ Y .
2. Decision rule R: yˆ = f(Y, Pη, P), where the corrected/final output yˆ ∈ Y .
For example, NMR (see Fig. 5.2(a)) employs N -way replication, i.e., ei = ηi and i = 0, to
generate the observation vector Y, and employs majority or other forms of voting strategies
as the decision rule R. NMR is effective if the hardware errors ηi’s are independent, and is
described as follows:
1. YNMR = (y1, y2, . . . , yN), where yi = yo + ηi with yo, yi, ei = ηi ∈ Y .
82
2. RNMR : yˆ = maj (YNMR), where maj(.) is the majority operator.
Algorithmic-noise tolerance (ANT) [70] employs an estimator block ME (see Fig. 5.2(b)),
which is a low-complexity version of the M-block, to generate an estimate y2 of the error-free
output yo. The estimator block ME is designed to be free of hardware errors, i.e., η2 = 0
and thus e2 = . ANT relies on the difference in the statistics of the hardware error of the
M-block η1 and the estimation error 2, which can be made small (see P() in Fig. 5.1(b))
to detect and correct for η. Thus, ANT is described as:
1. YANT = (y1 = yo + η1, y2 = yo + 2), where y1, y2, η1, 2 ∈ Y .
2. RANT : yˆ =

y1 if |y1 − y2| < Th
y2 otherwise
, where Th is a user defined threshold that
maximizes an application-level performance metric.
The stochastic sensor network-on-chip (SSNOC) [74] in Fig. 5.2(c) decomposes the M-
block into a set of statistically similar estimator blocks ME,i, and employs robust estimation
techniques [75] for error compensation. Thus, SSNOC is described as:
1. YSSNOC = (y1, y2, . . . , yN), where yi = yo + ei, ei = ηi + i, with yo, yi, ei ∈ Y .
2. RSSNOC : yˆ = frobust(YSSNOC), e.g., frobust(YSSNOC) = median(YSSNOC) or
frobust(YSSNOC) = huber(YSSNOC).
Soft NMR [78] (see Fig. 5.2(d)) employs the same observation vector as NMR. However,
unlike NMR, soft NMR exploits the hardware error PMF Pη(η), to implement a decision
rule R based on the maximum-likelihood (ML) principle to enhance system robustness. Soft
NMR is described as:
1. YSNMR = (y1, y2, . . . , yN), where yi = yo + ηi with yo, yi, ei = ηi ∈ Y .
2. RSNMR : yˆ = arg max
yo
P (YSNMR|yo) = arg max
yo
Pη(η|yo), where η = (η1, η2, .....ηN).
83
x)(ePE
11 eyy o 
22 eyy o 
NoN eyy 
yˆ
)|0(
)|1(
LPi
LPi
i
bp
bp
Y
Y



yB
L
GMLP
yB
Slicer
Figure 5.3: The proposed technique: likelihood processing (LP).
Our proposed technique LP (see Fig. 5.3) consists of a computational block MLP generating
an N -element observation vector YLP , a likelihood generator (LG), and a slicer. Figure 5.4
shows that the block MLP can be designed via one or more of the following techniques:
1) replication, 2) estimation, and 3) exploiting inherent signal correlations in M. In the
latter case, intermediate signals from the M are employed to generate YLP , thereby avoiding
any hardware replication (see Fig. 5.4(c)). For example, adjacent pixels in image/video
processing applications have correlated values, thereby providing multiple observations with
low overhead.
The LG-block in Fig. 5.3 employs the composite error PMF PE(e = η+ ε) to compute the
a-posteriori probability (APP) ratio λj = P (bj = 1|YLP )/P (bj = 0|YLP ) for each output
bit bj (j = 0, .., By − 1), of the By-bit output yo. This soft information provides a measure
of the confidence/reliability on each output bit. For example, our confidence in bit bj being
a ‘1’ increases with the numerical value of λj for λj > 1, and vice versa for bit bj being
a ‘0’.The slicer in Fig. 5.3 thresholds λj to obtain a hard estimate bˆj. In this chapter, we
consider only hard decisions for simplicity and ignore the additional improvement available
by exploiting soft information further. The LP technique can be described as:
1. YLP = (y1, y2, . . . , yn), where yi = yo + ηi + i = yo + ei with yo, yi, ei ∈ Y .
84
x
A5 A6
A3
A4
A7
2y
3y
4y
5y
M1
MN
x
1y
Ny
M2 2y
(a)
(c)
MLP
M1
x
1y
Ny
2y
(b)
MLP
ME,2
ME,N
MLP = M
1y
7y
6y
Figure 5.4: Techniques to generate observation vector YLP: (a) replication, (b) estimation,
and (c) spatio-temporal correlation.
85
2. RLP : bˆj =

1, if λj ≥ 1
0, otherwise
, where λj =
P (bj=1|YLP )
P (bj=0|YLP ) for j = 0, 1, . . . , By − 1
We employ the notation LPNx-(By) to denote the LP technique operating on an obser-
vation vector YLP of length N and generating a By-bit error-compensated output. The
character ‘x’ in LPNx-(By) denotes the architectural setup of the MLP -block and can take
the values ‘r’, ‘e’, or ‘c’ corresponding to the three architectural choices: replication, estima-
tion, or correlation, respectively.
The next section describes the LP framework in detail and shows how to generate the
APP ratio λj for each output bit bj.
5.2 The Proposed Technique: Likelihood Processing (LP)
In this section, we present LP in its most general form, illustrate it through an example,
and then describe an efficient architecture.
5.2.1 The LP Algorithm
Consider the computational block M in Fig. 5.1(a), whose correct output yo is represented
with By bits, yo = {bj}Byj=1, and manifests an output space Y =
{
0, 1, . . . , 2By − 1}. Further-
more, employing one of the observability enhancing techniques in Fig. 5.4, an observation
vector YLP = {yi}Ni=1 is generated, where yi = yo + ηi + εi. The error PMFs {PEi (ei)}Ni=1,
are assumed to be known as in the case of soft NMR [78].
For each output bit bj (j = 1, . . . , By), we need to compute the APP ratio λj defined as
follows:
λj =
P (bj = 1|YLP )
P (bj = 0|YLP ) =
P (bj = 1|YLP )
1− P (bj = 1|YLP ) (5.2)
In practice, computing the log-domain APP ratio Λj simplifies the algorithm and implemen-
86
tation, and is given by:
Λj = log λj = log
P (bj = 1|YLP )
P (bj = 0|YLP ) (5.3)
Thus, our confidence in bj being a ‘1’ is very high if Λj >> 0, and vice versa for bj being a
‘0’.
Applying the Bayes rule to (5.2), we obtain:
P (bj = k|YLP ) = P (YLP |bj = k)P (bj = k)
P (YLP )
(5.4)
where P (YLP ) is the probability of observing the vector YLP and k ∈ {0, 1}. Substituting
(5.4) in (5.2), we obtain:
λj =
P (YLP |bj = 1)P (bj = 1)
P (YLP |bj = 0)P (bj = 0)
4
=
pj,1
pj,0
(5.5)
where we denote pj,k = P (bj = k|YLP )P (bj = k) for k = 0 or 1. For each output bit, bj,
the probabilities {pj,k}k=0,1 are generated by a bit-level to word-level mapping in the output
space Y as follows:
pj,k =
∑
yo∈Yj,k
P (YLP |yo )P (yo) (5.6)
where Yj,k = {yo ∈ Y | bj = k}, i.e., the set of all possible outputs that have the jth bit
bj = k, and P (yo) is the prior probability, i.e., the distribution of the error-fee output yo.
We assume that the errors ei are independent in order to reduce the memory complexity
inherent in the storage of joint error PMFs. Error independence can be achieved by employ-
ing techniques such as data diversity, architecture diversity, or scheduling diversity, which
would be described in Chapter 6.
87
Assuming independent errors, the conditional PMF in (5.6) can be written as:
P (YLP |yo ) = P ((y1, y2, . . . , yN) |yo )
=
N∏
i=1
P (yi |yo ) =
N∏
i=1
P (ei = yi − yo) =
N∏
i=1
PEi (ei) (5.7)
Therefore, substituting (5.7) into (5.6) results in
pj,k =
∑
yo∈Yj,k
N∏
i=1
[PEi (ei = yi − yo)]P (yo) (5.8)
Therefore, substituting (5.8) in (5.5), provides the final expression for λj as follows:
λj =
pj,1
pj,0
=
∑
yo∈Yj,1
[
N∏
i=1
PEi (ei = yi − yo)
]
P (yo)
∑
yo∈Yj,0
[
N∏
i=1
PEi (ei = yi − yo)
]
P (yo)
(5.9)
Next, we illustrate the LP algorithm with an example.
5.2.2 Motivational Example
Consider the computation kernel M in Fig. 5.5(a) with a By = 2-bit output y = (b1, b2)
corrupted by a 2-bit error e = (e1, e2) = (bo,1 ⊕ b1, bo,2 ⊕ b2) according to the PMF in
Fig. 5.5(b), where the component probability of error (pre-correction error rate) pη = p(yˆ 6=
yo) = PE(e 6= (0, 0)). Assume that we rely on a single output observation y = (b1, b2) = (1, 0),
i.e., N = 1, and YLP = (y1 = (1, 0)). We refer to this form of likelihood processing as LP1r-
(2), i.e., LP with N = 1, and By = 2.
The LG-block computes p1,1 in (5.6) by considering all possible outputs yo ∈ Y1,1 = {yo =
88
(a)
M
x
1,ob
2
(1,0)
1
(0,1)
3
(1,1)
Error 
p1
a
0
(0,0)
)(ePE
b
c
(b)
cbap 
),( 21 eee 
2,ob
1e
1b
2b
2e
Figure 5.5: An example of LP: (a) 2-bit output erroneous computational block and (b) a
2-bit sample error PMF.
(1, b2) ∈ Y} as follows:
p1,1 =
∑
yo∈Y1,1
P (YLP |yo )P (yo)
= P (YLP |yo = (1, 0))P (yo = (1, 0))
+ P (YLP |yo = (1, 1))P (yo = (1, 1)) (5.10)
Assuming all outputs yo ∈ Y are equally likely to occur, we have P (yo = (1, 0)) = 0.25 and
P (yo = (1, 1)) = 0.25 in (5.10). Employing the error PMF PE(e) in Fig. 5.5(b), we write
(5.10) as
p1,1 = P (y1 = (1, 0) |yo = (1, 0)) + P (y1 = (1, 0) |yo = (1, 1))
= P (e = (0, 0)) + P (e = (1, 1))
= (1− pη) + c (5.11)
Assuming that the pre-correction error rate pη = 0.6, a = 0.7pη, b = 0.3pη, and c = 0, (5.11)
gives p1,1 = 0.4. Similarly from (5.6), we get p1,0 = a+ b = 0.6. Thus, the APP ratio of first
bit λ1 = 0.67 and Λ1 = −0.4, which is less than 0, and thus the slicer in Fig. 5.3 generates
bˆ1 = 0. Similarly, one can show that p2,1 = a+ c = 0.7pη = 0.42, p2,0 = (1− pη) + b = 0.58,
λ2 = 0.72 and Λ2 = −0.33 hence bˆ2 = 0. Thus, the LG-block generates the final sliced error
89
compensated output yˆ = (0, 0) even though the M-block output is y = (b1, b2) = (1, 0), i.e.,
there is a high probability that b1 is in error, given the knowledge of the error PMF PE(e).
Next consider LP3r-(2), i.e., we have three identical copies of the M-block in Fig. 5.5(a)
generating the observation vector YLP = {y1, y2, y3} followed by a three-input LG-block
and a slicer. Without any loss of generality, we assume that y1, y2, and y3 are corrupted
by independent identically distributed (iid) errors e1, e2, and e3, respectively. In other
words, the errors e1, e2, and e3 are independent and follow the same error PMF PE(e) in
Fig. 5.5(b), i.e., PE1(e1) = PE2(e2) = PE3(e3) = PE(e). If the observation vector YLP =
(y1 = (1, 0), y2 = (1, 0), y3 = (0, 1)), then TMR selects yˆ = (1, 0) via a majority vote. On
the other hand, a smart voter with the knowledge of error statistics would realize that the
correct output yo 6= (1, 0) since e3 = y3 − yo = (1, 1) but PE3(e3 = (1, 1)) = c = 0 (see
Fig. 5.5(b)). For example, employing LP3r-(2), the computation of P (YLP |yo = (1, 1)) in
(5.6) is given by:
P (YLP |yo = (1, 1)) = P (e1 = (0, 1), e2 = (0, 1), e3 = (1, 0))
= P (e1 = (0, 1))P (e2 = (0, 1))P (e3 = (1, 0))
= (apη)
2(bpη) = ba
2p3η
Similarly, P (YLP |yo = (1, 0)) = apη(1 − pη)2. Assuming pη = 0.5, a = 0.7pη, b = 0.3pη,
c = 0, and equal priors, (5.10) results in P (b1 = 1|YLP ) = 0, i.e., λ1 = 0 indicating our
confidence in the decision bˆ1 = 0 is very high. Similarly, one can show for the second bit b2,
p2,1 = (cpη)
2(1 − pη) = 0.07, p2,0 = (bpη)2(cpη) = 0.019, λ2 = 3.4, and Λ2 = 1.22 and hence
bˆ2 = 1, resulting in the corrected output as yˆ = (0, 1). We also observe that the log-domain
APP ratio moved farther away from 0 with N = 3 than that with N = 1. This indicates that
multiple observations increase our confidence in the decision yˆ = (0, 1). The effectiveness
of LP over conventional design can be calculated for this example by injecting errors at
the output of 2-bit computational kernel(s) according to the PMF in Fig. 5.5(b) at various
90
TMR (Majority)
LP3r-(2)
LP1r-(2)
Conventional N=1
)( p
S
y
s
te
m
 C
o
rr
e
c
tn
e
s
s
)
1(
,s
ys
e
p

Pre-correction Error Rate
Figure 5.6: System correctness of a 2-bit output system at different pη.
pre-correction error rates pη’s. LPNr-(2) employs PE(e) to generate probabilities for b1 and
b2 from YLP followed by a slicer to produce a hard estimate yˆ, while a conventional design
directly uses YLP followed by a majority vote. The system correctness metric, defined as
P (yˆ = yo) = 1−P (yˆ 6= yo) = 1− pe,sys, is employed to compare LP to conventional design.
From Fig. 5.6, we observe that LP3r-(2) outperforms TMR for all values of pη. Second,
system correctness for both LP1r-(2) and LP3r-(2) increases with pη for pη ≥ 0.6 and pη ≥
0.7, respectively. This unusual outcome occurs because LP understands that the observations
in YLP are unreliable for high values of pη, and thus tends to choose outputs from Y that
do not belong to YLP . Note that when pη ≥ 0.6, system correctness for TMR falls below
even LP1r-(2) and the conventional N = 1 system, because the probability of two or more
identical errors becomes larger, and hence the majority voter selects the wrong values more
91
often. On the other hand, LP exploits the knowledge of the error distribution in Fig. 5.5(b),
i.e., different error magnitudes have different error probabilities, to correct for errors.
5.2.3 The N -Input Likelihood Generator (LG-Block) Architecture
We use the log domain processing of probabilities to simplify the implementation of the
LG-block in Fig. 5.3. Taking the logarithm (base 2) of (5.8), we obtain:
log pj,k = log
∑
yo∈Yj,k
2
[
N∑
i=1
logPEi (ei=yi−yo)
]
+logP (yo)
(5.12)
We use the log-max approximation [96] to simplify (5.12). The log-max approximation is
given by:
log (2x + 2y) = max(x, y) + C(|x− y|) ≈ max(x, y) (5.13)
where C(.) is a correction term. Thus, ignoring the correction term in (5.13), (5.12) can be
written as:
log pj,k ≈ max
yo∈Yj,k
[
N∑
i=1
logPEi (ei = yi − yo) + logP (yo)
]
= max
yo∈Yj,k
[Γ(yo) + logP (yo)] (5.14)
where Γ(yo) is referred to as the word-metric and is defined as:
Γ(yo) =
N∑
i=1
logPEi (ei = yi − yo) (5.15)
92
yB

 -Slow 
Latch 
Regular 
Latch 
 - Bit
CNTR
oy
1y
2y
Ny
)
(
log
1
e
P
E
1e Error 
LUT
Error 
LUT
Error 
LUT
2e
Ne
Prior 
LUT
)
(
lo
g
0
y
P
CS2
1
CS2
yB
1,oy
yB2
0
1
CS2
2CS2
2,oy
0
1
CS2
CS2
yBo
y ,
0
1
)( oy
Metric Unit (MU)
Figure 5.7: An LG-processor architecture for LPN -(By) (MU: metric unit and CS2: 2-
operand compare-select unit).
From (5.5) and (5.14), the log-APP ratio Λj for bit bj is computed as follows:
Λj = log pj,1 − log pj,0
≈ max
yo∈Yj,1
[Γ(yo) + logP (yo)]− max
yo∈Yj,0
[Γ(yo) + logP (yo)]
= max
yo∈Yj,1
Ω(yo)− max
yo∈Yj,0
Ω(yo) (5.16)
where Ω(yo) = Γ(yo) + logP (yo), and Γ(yo) is given by (5.15).
The LG architecture implementing (5.16) is shown in Fig. 5.7. The look-up tables (LUTs)
in Fig. 5.7 store the output prior (logP (yo)), and error PMFs (logPE(e)). The LG-block
generates Λj after 2
By clock-cycles. In each clock cycle, a specific yo ∈ Y is fed into the
metric unit (MU). The MU compares yo to each of the N observations yi and generates
Ω(yo). For each bj, there are two recursive compare-select (CS) units to keep track of the
93
Table 5.1: Complexity of an L-parallel LG-processor for LPNx-(By).
Parallelization Latency Storage Computational Activation
Factor Complexity Factor
L ≤ 2By 2By
L
2
(
2By ×Bp
)
2L×N + L+By Add, αLP
bits and By (log2L+ 2) CS2 = 1−
∏i=N
i=1 (1− pη,i)
maximum values of Γ(yo) + logP (yo) computed over Yj,1 and Yj,0, respectively, according
to (5.16).
Complexity and Power Overhead
The complexity of the LG-block depends only on the output precision (By) and the number
of observations N , and is independent of the complexity of the main MLP -block complexity.
Thus, as the complexity of MLP increases, LPNx-(By) overhead constitutes a smaller portion
of the total system complexity, resulting in higher energy and robustness benefits. The LG
architecture in Fig. 5.7 needs 2By clock-cycles to compute Λj. Parallelization by a factor of
L ≤ 2By reduces the number of clock-cycles to 2By/L but increases hardware complexity.
Complexity estimate of an L-parallel LG-block operating on By bits is shown in Table 5.1.
The storage of prior and error PMFs (PE and P (yo)) requires storing 2(2
By ×Bp) bits where
it is assumed PE and P (yo) are quantized to Bp bits.
5.2.4 Low-Complexity LP Architectures
The complexity and power overhead of LP can be reduced significantly by probabilistic
activation and bit-subgrouping as explained next.
Probabilistic Activation
The LG-block in LP can be activated only when there is a large difference between the
observations yi, which indicates the presence of a large error (see the activation block in
Fig. 5.8). Assuming hardware errors are large and independent across the observations yi,
94
L
P
N
x-
(B
1
) 
 
h|| Tyy ji 
E
n
a
b
le
yˆ
1y
2y
Ny
By
By
By
B1 B1 B2 B2 BM BM
NBy
NB1
NBy
1y
B1 B2 BM
NB2 NBM
By
By=B1+B2+…+BM
1b
LPyˆ
By
1b
Activation Block
L
P
N
x-
(B
2
) 
 
L
P
N
x-
(B
m
) 
 
Figure 5.8: A bit-subgrouped LG-processor architecture for LPNx-(By) (LPNx-
(B1, B2, ..., Bm)) with probabilistic activation module.
then the LG-block activation factor, αLP , is given by:
αLP = 1− P (|yi − yj| ≤ Th for all(i, j) such that i, j ∈ {1, 2, ..., N} and i 6= j)
≈ 1−
i=N∏
i=1
(1− pη,i) (5.17)
where pη,i is the probability of hardware error in yi.
Bit-Subgrouping
Since the output search space Y is exponential in the number of output bits By, the complex-
ity and power of the LG-block can be reduced via bit-subgrouping, as shown in Fig. 5.8. In bit-
subgrouping, the By-bit output is divided into m subgroups with precisions B1, B2,. . .,Bm,
95
respectively, such that By = B1 +B2 + ...+Bm. Then, LP is applied independently on each
subgroup of Bi bits. With bit-subgrouping, LPNx-(By) is denoted as LPN -(B1, B2, . . . , Bm).
Bit-subgrouping significantly reduces the output search space, and thus the LP overhead.
For example, if By-bit output is uniformly divided into m subgroups each with Bi = By/m
bits, m LPNx-(By/m)’s are needed instead of a single LPNx-(By), thereby reducing the
storage and computational complexity space of the LG-block from 2By to m2By/m. However,
as m increases, the system-level correctness psys or the robustness of LP will be reduced,
since bit-subgrouping ignores error correlations between adjacent subgroups of bits.
5.3 Simulation Results
We demonstrate benefits of LP in terms of robustness and energy efficiency in the design of
a two-dimensional inverse discrete-cosine transform (2D-IDCT) image codec subject to PVT
errors. The DCT-IDCT transform (see Fig. 5.9(a)) is applied on 256×256 8-bit pixel images,
stored initially in memory (Mem), in blocks of 8×8 pixels using Chen’s algorithm [97]. Each
2D transform employs two 1D transforms: the first is applied column-wise on the input
block, and the second is applied row-wise on the output of the first. Transposition memory
(TM) is used to swap the data between rows and columns. The quantizer (Q) and inverse
quantizer (Q−1) employ the JPEG quantization table for compression. Only the receiver
computational kernels (Q−1 and IDCT blocks) are subject to hardware errors. The error-
free DCT-IDCT codec achieves a peak signal-to-noise ratio (PSNR) of 33 dB, where the
PSNR is defined as
PSNR = 10log10(
(255)2
E[(yo − yˆ)2] ) (5.18)
96
1D - 
DCT
TM
1D - 
DCT
Q
1D - 
IDCT
TM
1D - 
IDCT
Q
-1
2D - DCT
2D - IDCT
8b
M
e
m
8b
x
2D-IDCT
8b 8b
2D-IDCT
2D-IDCT
1y
2y
Ny
RPR 
Estimate
22  oyy
11  oyy
3b
3b
x 2D-IDCT
8b
1y
2y
ioi yy 
11  oyy
x
1y2D-
IDCT
Ny
2y
8b
8b
8b
iioi yy  1
(a) (b)
(c)
8bx
y
8b
8b
M
e
m
(d)
Figure 5.9: An 8-bit output 2D-DCT/IDCT codec: (a) single codec, (b) replication set-up,
(c) estimation set-up, and (d) spatial correlation set-up.
5.3.1 System and Architecture Set-up
We employ three different setups to generate multiple observations of the 8-bit output pixel
in order to detect and correct hardware errors in the 2D-IDCT block:
1. Replication (see Fig. 5.9(b)): 2D-IDCT kernel is replicated to provide exact estimates
of yo corrupted by hardware errors ηi only. Such a setup is typical of robust general
computing systems. Data, architecture, and scheduling diversity can be employed to
ensure error independence across redundant kernels, as will be discussed in Chapter 6.
Such a setup can be employed in ULP sensing platforms for critical applications where
complexity can be traded for increased robustness.
2. Estimation (see Fig. 5.9(c)): A reduced-precision redundancy (RPR) estimator of the
97
2D-IDCT is employed in parallel with the main 2D-IDCT. While the main block is
designed to operate on 8-b pixel inputs, the RPR estimator is designed to operate on the
three most-significant bits (MSB) of the pixel value, and thus it is of lower complexity
allowing, it to be designed hardware error-free at low overhead. The estimator output
y2 is corrupted by estimation error 2 only.
3. Spatial correlation (see Fig. 5.9(d)): Application-level data correlations are exploited
to generate multiple observations at very low overhead, thereby avoiding approximate
or full replication overhead. In the IDCT output, pixels in adjacent rows have similar
values and have independent errors since the 1D-IDCT is applied row-wise on the
image in the final step of the 2D-IDCT. Thus, pixels in adjacent rows are employed to
generate multiple observations yi. For example, a 4-element observation vector YLP
for the pixel at row and column coordinates y1 = (rj, cj) is generated by choosing
pixels y1 = (rj, cj), y2 = (rj − 1, cj), y3 = (rj − 2, cj), and y4 = (rj + 1, cj). Thus,
pixels in the observation vector other than y1 are corrupted by both estimation and
replication errors, while y1 is corrupted only by estimation error.
Error correction mechanisms, such as NMR (Fig. 5.2(a)), ANT (Fig 5.2(b)), and soft
NMR (Fig. 5.2(d)), and the proposed technique LP (Fig. 5.3) are employed to process the
observation vector and correct for errors in each setup. All error detection and correction
blocks are operated at a critical supply voltage of (Vdd−crit = 0.7 V), to ensure correct
operation while consuming minimum power. LP employs a 2By -parallel version of the LG-
processor in Fig. 5.7 so that it requires one clock-cycle to generate the output APP ratios.
Different bit-subgroupings of the 8-bit output (see Fig. 5.8) are applied to study the trade-
off between LP complexity overhead and performance. In addition to that, a probabilistic
activation module for LG-block (see Fig. 5.8) is employed to decrease LP power overhead.
The error and prior PMFs are quantized to 8-bits before being stored by LG-block. The
complexity of different blocks in error-compensated 2D-IDCTs is shown in Table 5.2.
98
Table 5.2: Gate complexity (normalized to NAND2) of building blocks in error-compensated
2D-IDCT architectures.
8-bit 2D- 3-bit RPR TMR 2D- N=3 Majority
IDCT Module Estimator IDCT Module Voter
64.2 k 20.4 k 192.5 k 0.13 k
ANT Compare- LG-processor LG-processor LG-processor
Select Module for LP3x-(8) for LP3x-(5,3) for LP3x-(1,1,...1)
0.22 k 50.8 k 14.6 k 0.6 k
5.3.2 Error Characterization and Simulation Procedure
Likelihood processing requires statistical characterization of output errors of the computation
engine (DCT-IDCT codec). Accurate modeling of errors increases resiliency at the expense
of increased storage requirement and search space of LP. Generally, hardware errors are a
function of the PVT settings, the input space, and the architecture. In Chapter 6, we will
show that timing error statistics are relatively independent of the input statistics and are
a strong function of the architecture. Hence, a training input data IT can be employed to
statistically characterize the error statistics of the M-block. This captures the dependence
of hardware errors on the architecture, indirectly considers the dependence on the input
statistics, and provides the LG-block with a good estimate of the actual output errors. This
training phase can be performed either off-chip or on-chip.
We evaluate the robustness and energy efficiency of the proposed soft-output technique
under timing violations caused by PVT variations. As mentioned earlier, VOS is used to
emulate these timing violations using a 45-nm TI CMOS process. Keeping the frequency
fixed, the supply voltage Vdd is reduced beyond a critical design voltage Vdd−crit so that
intermittent timing errors appear. The simulation methodology involves two steps:
1. Training phase: An error PMF PE(e) is obtained via an RTL simulation of the
M-block employing a training input data-set IT as follows:
• Step 1: Circuit simulations are employed to characterize the worst-case delay vs.
Vdd relationship of the gate library.
99
(b)
P
re
-c
o
rr
e
c
ti
o
n
 E
rr
o
r 
R
a
te
Supply Voltage (Vdd)
)
(
p
(a)
Error Value 
 Vdd=1.0
Vdd=1.1
(c)
Error Value 


e
P
E


e
P
E
)(e
)(e
Figure 5.10: VOS errors in 2D-IDCT: (a) pre-correction error rate (component probability
of error) pη, and output error PMFs (PE(e)) at (b) Vdd = 1.1 V and (c) Vdd = 1 V.
• Step 2: At each Vdd < Vdd−crit, a gate-level netlist of the M-block is simulated
using gate delays obtained from Step 1 and employing the data-set IT while the
frequency of operation is fixed to meet Vdd−crit timing constraints only. This step
generates the erroneous output y in (5.1).
• Error PMFs PE(e)s are obtained at each Vdd by comparing yo and y as shown in
(5.1).
2. Operational phase: The M-block operates under VOS on an actual data-set IA,
which is different from IT , and exhibits IA-dependent errors. LP employs the IT -
dependent error PMF PE(e) obtained during the training phase for error compensation.
Figure 5.10(a) shows the pre-correction error rate pη at the output of 2D-IDCT as Vdd
is reduced from 1.2 V to 0.6 V. Each point on the curve is characterized by an error PMF
PE(e). For example, error PMFs for the 8-bit output 2D-IDCT at 1.1 V and 1.0 V are shown
in Figs. 5.10 (b) and (c), respectively. As voltage is reduced, more spread in error values is
observed because more circuit paths begin to fail to meet the timing constraints.
100
5.3.3 System Performance and Robustness
The PSNR for DCT-IDCT codec output under replication (Fig. 5.9(b)) is shown in Fig. 5.11
for different pη (probability of output error for a single 2D-IDCT) corresponding to different
Vdd. Figure 5.10(a) is employed to relate Vdd to pη. Figure 5.11(a) shows that LP3r-(8) can
tolerate 70×, 5× and 3× higher pre-correction error rate pη compared to the conventional
(uncompensated) single 2D-IDCT, TMR, and soft TMR, respectively, in order to achieve
a PSNR of 30 dB. Interestingly, as seen in Fig. 5.11(a), LP2r-(8), i.e., LP with dual-MR
(DMR), behaves close to or even better than TMR when pη ≥ 0.05, i.e., LP with a replication
factor of two outperforms TMR. This is unlike conventional DMR, which can only detect
errors but not correct them. The effect of bit-subgrouping on LP3r performance is studied
in Fig. 5.11(b). Bit-subgrouping of LP3r-(8) to LP3r-(5,3), where one sub-LP operates on
the 5-MSBs and the second on the rest of the 3-LSBs, minimally affects the robustness of
LP3r-(8). As bit-subgrouping increases in order to decrease LP overhead, the robustness
benefits of LP over TMR reduce as expected. This is because more error correlations across
adjacent bits are being ignored. However, even with 1-bit sub-grouping (LP3r-(1,1,. . .,1))
LP still outperforms TMR.
The PSNR of the DCT-IDCT codec under estimation setup (Fig. 5.9(c)) is shown in
Fig. 5.12(a) for different pre-correction error rates pη’s. The 3-bit RPR estimator is not
subject to VOS. The estimator alone achieves a PSNR of 22.2 dB. LP2e-(8) processes the
two outputs y1 (main block) and y2 (estimator) similarly to ANT (see Fig. 5.9(c)), and
achieves robustness enhancement of 100× and 5× compared to the uncompensated single
2D-IDCT, and ANT, respectively, at a PSNR of 30 dB. LP2r-(8) robustness benefits reduce
with bit-subgrouping (LP2e-(5,3)) and LP becomes less efficient than ANT.
The PSNR of the DCT-IDCT codec under the spatial-correlation setup (Fig. 5.9(d)),
where adjacent row pixels are used as estimators, is shown in Fig. 5.12(b). Only LP with
(5,3) bit-subgrouping is shown. Similarly to the case of replication, simulations confirm that
101
5X
3X
70X
70X
5X
Pre-correction Error Rate ( pη )
(a)
Pre-correction Error Rate ( pη )
(b)
Figure 5.11: System robustness of 2D DCT-IDCT codec under replication: (a) compar-
ing LPNr-(8) to other error-resilient techniques without bit-subgrouping and (b) LPNr-(8)
performance with bit-subgrouping.
102
(a)
5X
100X
LP2e-(8)
LP2e-(5,3)
14X
LP2c-(5,3)
LP3c-(5,3)
LP4c-(5,3)
P
S
N
R
P
S
N
R
(b)
Pre-correction Error Rate (pη ) Pre-correction Error Rate (pη )
Figure 5.12: System robustness of 2D DCT-IDCT codec using (a) estimation and (b) spatial
correlation.
bit-subgrouping under correlation setup from (8) to (5,3) shows negligible loss in perfor-
mance. In Fig. 5.12(b), LP3c-(5,3) uses pixels in adjacent two rows as estimators for the
current pixel and achieves 14× increase in robustness at a PSNR of 30 dB compared to the
conventional system. This level of robustness is similar to that achieved by TMR. However,
the latter has two additional M-blocks (IDCTs) compared to LP3c-(5,3). Note that, major-
ity voting does not apply for the spatial correlation setup, since the multiple observations
are corrupted by estimation errors when they are hardware error-free. Using two (current
and previous-row) instead of three adjacent pixels, LP2c-(5,3) shows worse performance
than LP3c-(5,3), since it has a smaller number of estimators. In fact, it behaves worse than
conventional design when pη < 0.004 because for pη < 0.004, estimation errors dominate
hardware errors and LP2c-(5,3) is not able to determine which of the two outputs is correct.
LP4c-(5,3) performance also degrades compared to LP3c-(5,3) because LP4c-(5,3) employs
pixels that are spatially farther apart than LP3c-(5,3), leading to higher estimation error.
Thus, the performance of LPNc-(5,3) depends upon the relative contribution of estimation
and hardware errors to the PSNR.
The perceptual quality of a sample image under different error-compensation techniques
103
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 5.13: Sample codec output images: (a) original image, (b) error-free IDCT (pη =
0, PSNR = 33 dB), (c) erroneous single IDCT (pη = 0.13, PSNR = 14 dB), (d) majority-
vote TMR (pη = 0.13, PSNR = 19 dB), (e) LP3c-(5,3) (pη = 0.14, PSNR = 24 dB), (f)
ANT (pη = 0.13, PSNR = 26 dB), (g) LP3r-(5,3) (pη = 0.13, PSNR = 29 dB), (h) LP2e-(8)
(pη = 0.13, PSNR = 31 dB).
is shown in Fig. 5.13 where the underlying hardware has the same pre-correction error rate
pη = 0.13. TMR achieves only 5 dB improvement in PSNR over the conventional system,
thereby failing to improve the image quality significantly. LP3r-(5,3) achieves much better
image quality with a PSNR of 28 dB. This result corresponds to a 14 dB improvement in
PSNR over the conventional system. Using the RPR-estimator setup, LP3e-(8) achieves the
best image quality (PSNR of 31 dB) with just a very few noticeable errors in the image.
Avoiding any form of redundancy and using only signal correlations, LP3c-(5,3) achieves
relatively much better quality than TMR with 9 dB instead of 5 dB improvement in PSNR
over the conventional system.
Therefore, we see that LP provides a tremendous increase in robustness to circuit errors
at the same level of application metric (output quality) compared to conventional systems.
104
This translates to significant improvement in the application metric at the same degree
of unreliability in the circuit fabric. Under exact replication and estimation (approximate
replication), LP outperforms existing error-resiliency techniques such as NMR and ANT
in terms of robustness and application metric. It can also provide robustness and quality
levels similar to those of NMR and ANT while avoiding any form of exact or approximate
replication.
5.3.4 Power Savings
HSPICE is used to estimate the power consumption of the gate library at different Vdd’s in
the 45-nm TI CMOS process. The total power for the computational kernel (2D-IDCT) and
error-compensation blocks (majority voter, soft voter, RPR estimator, and G-processors) is
obtained by summing up the individual power of constituent gates under various architectural
setups (replication, estimation, and correlation). Since the LG-processor is activated only
when a large difference is observed among the observations (see Fig. 5.7(b)), the LG-processor
power overhead is scaled by its probabilistic activation factor. Figure 5.14 shows the power
consumption at each PSNR in Fig. 5.11(a) and Figs. 5.12 (a) and (b) for the three different
setups. The conventional (N = 1) architecture is error-free only at PSNR = 33 dB.
In the case of replication (see Fig. 5.14(a)), LP3r-(5,3) achieves 15% power savings com-
pared to TMR for a wide range of PSNR. These power savings are achieved with additional
robustness to hardware error, i.e., higher component probability of error (pre-correction error
rate), at the same PSNR as illustrated in Fig. 5.11(a). We can trade this additional robust-
ness for further power savings. For example, at a PSNR of 28 dB, LP2r-(8) tolerates the
same level of component probability of error as TMR in Fig. 5.11(a) but achieves a power
savings of 35% (see Fig. 5.14(a)). Bit-subgrouping of LP from (8) to (5,3) increases the
power savings at a given PSNR due to the reduced complexity overhead. However, LP3r-(8)
105
15%
(c)
LP3c-(5,3)
LP4c-(5,3)
LP2c-(5,3)
(a)
35%
13%
(b)
27%
10%
LP2e-(8)
LP2e-(5,3)
RPR-Estimator
Figure 5.14: LP power savings in a 45-nm TI CMOS process: (a) replication, (b) estimation,
and (c) spatial-correlation setups.
achieves slightly better robustness at the same PSNR compared to LP3r-(5,3) as illustrated
in Fig. 5.11(b).
In the case of estimation setup (see Fig. 5.14(b)), LP2e-(8) power savings compared to
conventional design ranges between 10% and 27% for different PSNR and are slightly better
than the ANT-based design. This is in addition to the 100× improvement in robustness
over the conventional design. Note that bit-subgrouping in the estimation setup results
106
in additional power overhead at the same PSNR, since its robustness loss in Fig. 5.12(a)
compared to LP2e-(8) is significant compared to the logic complexity savings it provides.
In the spatial-correlation setup (see Fig. 5.14(c)), LP3c-(5,3) achieves 15% power savings
compared to conventional designs. If we want to trade increased robustness provided by LP
for additional power savings, TMR will employ two more IDCT modules to achieve similar
robustness to LP3c-(5,3) at 28 dB of PSNR (see Fig. 5.12(b)). In this case, the power savings
of LP3c-(5,3) will be 71% compared to TMR, since LP3c-(5,3) uses two fewer IDCT modules.
5.4 Summary
We presented a stochastic computing technique referred to as likelihood processing to design
robust systems by exploiting error statistics at bit-level. Techniques from detection and
estimation theory, in particular the maximum a-posteriori (MAP) rule, are employed to
generate reliability information or confidence-level on each output bit enabling the correction
of errors in an optimal probabilistic sense. Simulations in DCT image codec show that LP
improves on existing reliable system design techniques such as TMR, as well as on stochastic
computing techniques such as ANT and soft NMR. Energy savings up to 71% and robustness
benefits up to 100× are illustrated.
107
CHAPTER 6
CHARACTERIZATION AND ENGINEERING OF
TIMING ERROR STATISTICS FOR STOCHASTIC
COMPUTING PLATFORMS
Chapter 5 demonstrated the benefits of employing error statistics for robust and energy-
efficient DSP kernels in emerging ULP applications. It is clear that the availability of
statistical error models, and developing an understanding of the factors that impact error
statistics are essential in the design of stochastic computing systems. Furthermore, the
availability of error statistics enables robustness analysis of existing techniques, as done
in [77] and [98] for NMR, ANT, and soft NMR. However, error modeling and abstraction
is a hard problem because errors, in particular timing errors, are a function of a number
of parameters such as the input statistics, path delay distribution of the architecture, PVT
corner, and other physical parameters.
This chapter makes a case for developing statistical timing error models of DSP kernels
implemented in nanoscale circuit fabrics. First, it proposes a simple additive error model for
timing errors in arithmetic computations. Second, it analyzes the relationship between error
statistics and parameters such as the input statistics and the architecture based on the pro-
posed error models. Third, it presents a statistical error characterization methodology based
on the proposed error model, thus enabling efficient implementation of emerging stochastic
computing techniques. Key results include the following observations: 1) the output error
statistics is a strong function of the architecture, and a weak function of input statistics,
and 2) the output error statistics depends upon the one’s probability profile of the input
word. These observations enable a one-time off-line statistical error characterization of DSP
kernels similar to the delay and power characterizations done presently for standard cells
108
and IP cores. The proposed error model is derived and verified for a number of DSP kernels
in a 45-nm TI CMOS process.
The second part of this chapter addresses engineering of error statistics to enhance the
effectiveness of stochastic computing techniques such as soft-NMR and LP. These techniques
benefit from not the independence of error events and magnitudes across the redundant
processing elements (PEs). For this purpose, architectural diversity and scheduling diversity
techniques are proposed to engineer the occurrence of spatially independent error events and
magnitudes. Furthermore, the Kullback-Leibler (KL) distance [99] is proposed as an error
independence metric to measure the degree of diversity in such systems. The effectiveness of
the proposed techniques is demonstrated in the design of a 45-nm soft dual-MR (DMR)-based
discrete-cosine transform (DCT) codec. The soft DMR codec achieves a peak signal-to-noise
ratio (PSNR) close to that of a triple-MR (TMR) codec even though the former employs
one less PEs.
This chapter is organized as follows: Section 6.1 presents the additive error model and
its parameters. Section 6.2 analyzes relationships between the output error statistics and
the input statistics for a DSP kernel, and presents the error characterization methodology.
Section 6.3 verifies the proposed analysis, error model, and characterization methodology for
various 45-nm DSP blocks. Architecture and scheduling diversity techniques are presented
in section 6.4. Finally, the case study of the DCT image codec is described in Section 6.5.
6.1 Proposed Timing Error Model
We propose that the output of any DSP kernel M’ with latched input and outputs (see
Fig. 6.1(a)) exhibiting timing errors can be represented via an additive error model (see
Fig. 6.1(b)) :
y[n] = yo[n]⊕ e[n] = f1(x[n], y[n− 1], A, Vdd, Vt, T, P ) (6.1)
109
[ ]y n
Clk
x
(a)
[ ]oy n
x
[ ]e n
[ ]y n
M’  M 
(b)
Figure 6.1: A DSP kernel (M’) exhibiting errors: (a) block diagram and (b) proposed additive
error model.
where y[n] is the corresponding output at time-index (or clock-cycle) n, x[n] is the input,
yo[n] is the correct (error-free) output, and e[n] is the error. For example, if only the second
bit in a 3-bit output is in error, then e = (010).
Equation (6.1) indicates that the output y[n] is a complex non-linear function f() of the
processing element (PE) architecture (A), the input (x[n]), the supply voltage (Vdd), the
threshold voltage (Vth), the temperature (T ), and other physical effects (P ). It is also
a function of the previous output y[n − 1], either because some or all bits of the output
y[n] can retain their value if the clock period is too small, or because the architecture is
recursive. In the remainder of this chapter, we will focus on non-recursive architectures in
order to simplify the exposition and because such architectures can implement a large class
of applications.
In non-recursive PEs, yo[n] is a function of only the present input x[n]. Thus, e[n] in (6.1)
can be expressed as
e[n] = f2(x[n], y[n− 1], A, Vdd, Vt, T, P ) (6.2)
where e[n] embodies all the complex non-linear dependencies on the parameters in the ar-
gument of f2. The dependence of e[n] on y[n − 1] reflects the intrinsic memory effects in
combinational circuits that are operated faster than the critical path delay. As y[n − 1] is
110
also a function of x[n− 1] and y[n− 2], we can express e[n] as
e[n] = f3(x[n], A, Vdd, Vt, T, P ) (6.3)
where x[n] = (x[1], x[2], ..., x[n]). Function f3() is complex if described in a deterministic
manner. Instead, by recognizing that most emerging applications employ statistical perfor-
mance metrics such as mean-square error (MSE), SNR, and PSNR, and error-aware resilient
techniques such as soft NMR that rely on the statistics of e[n] instead of the exact value of
error e[n], we propose to treat e[n] as a random variable E and characterize its probability
mass function (PMF) denoted by PE(k) = p(e[n] = k). That is, we are interested in
PE = f4(PX , A, Vdd, Vt, T, P ) (6.4)
where PX represents the PMF of input x. Thus, given a fixed PVT corner, the output
error PMF depends on the architectural implementation of the DSP computation and input
statistics.
The output error statistics is a strong function of the architecture A or the kernel im-
plementation. Different architectures have different path delay distributions, and thus will
result in different errors for the same set of input statistics. In the next section, we study
the relationship between the error statistics PE and input statistics PX . Specifically, we
show that given a DSP architecture A, PE is relatively insensitive to a large class of input
distributions PX . Thus, the error PMF for each DSP architecure/kernel can be characterized
via a one-time off-line procedure at each PVT corner, independent of the application, and
similar to power and delay done presently for standard cells. These error PMFs can then
be employed by stochastic computing techniques, such as soft NMR and LP, to correct for
output errors.
111
6.2 Error Analysis: Impact of Input Statistics
Many DSP applications have a typical input data set or statistics PX,t which can be employed
to characterize the output error of a given architecture. In communication systems, knowing
the channel noise and transmitted symbol constellation, we can characterize the PMF of
the input signals to the receiver kernels. In media processing systems, the PMF of pixel
values in a given image can represent a large class of images, since adjacent pixels in most
images have high data correlations, and thus most image processing kernels have similar
input statistics. And in general work-load DSP, the input statistics can be assumed to be
uniform. Therefore, a typical input data set can be employed to generate the error PMF
of a given architecture in a given application. However, this makes error abstraction and
characterization procedure dependent on application.
Given a DSP kernel/architecture A, we wish to answer the questions:
1. If we employ an input PMF PX,t to obtain the output error PMF PE,t, can we find a
class of input PMFs CX,t = {PX,i}Mi=1 such that they all have similar error PMFs as
PX,t?
2. Can we find a PX,DSP such that the size of the corresponding class, M = |CX,DSP |, is
large and its characteristics are commonly encountered in most DSP applications?
3. What are the characteristic(s)/condition(s) that the PMFs of CX,t share?
If the answer to the first two questions is in the affirmative, then error characterization can
be done once for DSP kernels/architectures employing PX,DSP . We show that this is indeed
the case. To demonstrate this fact, we study the relationship between input statistics and
output error.
As Boolean computation occurs at bit-level, it is expected that the output error statistics
PE will be a stronger function of bit-level input statistics rather than word-level input statis-
tics PX . In what follows, 1) we study the relation between word-level and bit-level input
112
statistics, 2) analyze the impact of bit-level input statistics on output error statistics, and 3)
generate a representative input PMF PX,DSP for different DSP applications by conditioning
on bit-level statistics of the input PMFs in CX,DSP .
6.2.1 Bit-level vs. Word-level Statistics
Any Bx-bit signal/operand x[n] in a DSP kernel consists of bits denoted by bx,i[n] for i =
1, 2, . . . , Bx. We define the following:
• Bit probability of bx,i: px,i = p(bx,i[n] = 1)
• Bit probability profile (BPP) of an operand x: ΦX = (px,1, px,2, . . . , px,Bx), i.e., the set
of bit probabilities of its constituent bits.
• Probability mass function (PMF) of an operand x: PX = p(x)
• ON(x) and OFF (x): bit locations of x whose values are one and zero, respectively,
i.e., ON(x) = {i|bx,i = 1} and OFF (x) = {i|bx,i = 0}.
Given a PMF of x, we determine its BPP as follows:
px,i =
∑
x∈{x|i∈ON(x)}
PX(x) for i = 1, 2, . . . , Bx (6.5)
i.e., the ith bit probability is obtained by summing PX over x whose i
th bit is one. On
the other hand, given a BPP ΦX , a unique PX cannot be obtained unless the correlations
between bits bx,i are explicitly specified. This is summarized in the following property.
Property 1. Given any two operands x1 and x2:
PX1 = PX2
:⇒ ΦX1 = ΦX2
In fact, the next property shows that the number of PX that can be mapped to the same
ΦX is very large. Thus, we can define conditions on ΦX instead of PX to enforce similar
113
output error statistics for a given DSP kernel without excessively restricting the input space.
In the next section, we study the relation between ΦX and PE in order to determine these
conditions.
Property 2. For a fixed precision Bx:
Px is symmetric around the mean µx =
2Bx−1
2
⇔ Φx = (0.5, 0.5, ..., 0.5), i.e., px,i = 0.5 for
all i = 1, 2 . . . , Bx
Property 2 indicates that any PMF of x that is symmetric around µx =
2Bx−1
2
is mapped
to the same BPP where each bit is equally likely to be zero or one. Figures 6.2(a) and
(b) show a set of different 16-bit input distributions and their respective BPPs. Symmetric
distributions (U, G, and iG) with mean µx =
216−1
2
have the same equally-likely BPPs where
each px,i = 0.5, unlike asymmetric distributions (Asym1 and Asym2).
6.2.2 Impact of Bit-Level Input Statistics on Output Error
We aim to define conditions on any two different input PMFs, PX,1 and PX,2, such that the
corresponding output error PMFs, PE,1 and PE,2, are equal. Here, we show that the output
error statistics of a given DSP kernel is more dependent on the input BPP, ΦX , instead
of the word-level input PMF, PX . Thus, condition(s) to ensure similarity of output error
statistics can be placed on ΦX instead of PX .
Any output signal yi of a DSP kernel/architecture with input x can be viewed as a cascade
of Li processing elements (PEs), denoted by {PEk}Lik=1 (see Fig. 6.3). Each PEk has an
output signal(s) zk, intermediate input signal(s) zk−1, and a direct input signal set xk ⊆ x.
For example, in case of a carry-ripple adder, the 1-bit full-adders are the PEk’s, the rippled
carries are the intermediate signals zk−1, and the kth bits of the input operands are the direct
input signal set xk.
Note that this representation can take place at different granularity levels. For example,
each PEk can represent a single or multiple PEs or even a single logic gate. In what
114
(a)
(b)
Figure 6.2: Various 16-bit input statistics: (a) word-level distribution and (b) their corre-
sponding bit probability profiles (BPPs).
115
follows, we decompose the main DSP kernel into PEk’s in such a way that zk−1 and xk are
independent. For example, if both zk−1 and xk are generated from the same set of signals,
then they are correlated (e.g. zk−1 = a and xk = a¯), and in that case we have to enlarge PEk
to make zk−1 an internal signal. With such decomposition, if we know the logic functions
implemented by all PEj|j≤k, then the probability of any zk is completely determined by the
BPP (Φ) of xj|j≤k i.e.,
p(zk) = fk(Φxj|j≤k) (6.6)
where fk(·) is a polynomial function that depends on the logic functions of PEj|j≤k. To see
this, for example, if we assume the first PE, PE1, is a NAND gate with independent 1-bit
input signals, x1 and z0, and output z1 = x1 · z0, then p(z1 = 0) = p(x1 = 1)p(z0 = 1) =
px1pz0 and p(z1 = 1) = 1 − px1pz0 . Similarly, one can determine p(z2) if the logic function
of PE2 is known. This process is continued until PEk is processed, so that we can express
p(zk) as a polynomial function of Φxj|j≤k .
Timing violations occur when the computation for output yi is not allowed to complete.
Assume that each PEk in Fig. 6.3 PEk has the same delay d and the clock period is d(Li−1),
i.e., at most Li − 1 PEs can compute correctly. A timing error occurs at the output if all
Li PEs’ outputs zk[n] change their values from the previous clock cycle. If we denote the
transition event of a signal zk as tzk , i.e., tzk = 1 if z[n] 6= z[n − 1], then the probability of
output yi being in error, pey,i, is expressed as:
pey,i =
∑
ΦX
p(tz1 = 1, tz2 = 1, . . . , tzLi=1|ΦX)p(ΦX)
=
∑
ΦX
Li∏
k=1
[
p
(
tzk = 1|{tzj = 1}k−1j=1 ,ΦX
)]
p(ΦX) (6.7)
However, the input signal set for each PEk, denoted by Izk = {zk−1, xk}, shields zk from
signal transitions in preceding PEs, i.e., p(zk|Izk , w) = p(z|Izk) where w is any signal in PEj
for j = 1, 2, . . . , k−1. For example, in the case of a ripple-carry adder, given the input carry
116
Figure 6.3: An architectural model of a DSP kernel with input x, output bit by,i, and Li
processing elements (PE)s.
and the input bits into the kth 1-bit full-adder, the probability of output carry is independent
of signals in the preceding full-adders. Thus,
p
(
tzk = 1|{tzj=1}k−1j=1 ,ΦX
)
= p
(
tzk = 1|tzk−1 = 1,ΦXk
)
(6.8)
Substituting in (6.7), we write
pey,i =
∑
ΦX
Li∏
k=1
[
p
(
tzk = 1|tzk−1 = 1,ΦXk
)]
p(ΦX) (6.9)
In addition, tzk is relatively independent of tzk−1 since zk is determined by xk as well, i.e.,
transitions in zk−1 do not necessarily imply transitions in zk. Thus, (6.10) is expressed as
pey,i =
∑
ΦX
Li∏
k=1
[p (tzk = 1|ΦXk)] p(ΦX) (6.10)
For ease of notation, we denote (zk−1[n], zk[n]) as z[n] and introduce the operator |= to
denote that all individual components of the two vectors z[n] and z[n − 1] are not equal.
In non-recursive architectures, the signal transitions are independent across time, and the
conditional transition probability p(tzk = 1|ΦXk) in (6.10) at the output of each PEk is
117
expressed as follows:
p(tzk = 1|ΦXk) =
∑
zk[n−1]6=zk[n]
p(zk[n− 1]|ΦXk)p(zk[n]|ΦXk) (6.11)
This means that we treat the logic state of PEk independent of time and sum over values
where both zk[n] and zk[n− 1] are different. For example if zk is 1-bit then we sum over the
duple (zk[n], zk[n− 1]) ∈ {(0, 1), (1, 0)}.
Since the probabilities are stationary, we treat each p(z[n]|ΦXk) and p(z[n− 1]|ΦXk) sim-
ilarly. Substituting (6.6) into (6.11) and then (6.10), we get:
pey,i =
∑
ΦX
Li∏
k=1
∑
zk[n−1] 6=zk[n]
fk,n(Φxj|j≤k)fk,n−1(Φxj|j≤k)p(ΦX) (6.12)
If we assume that at most B ≤ Li− 1 PEs compute correctly, then, for an error to appear
at the output, all PEk for Qi = Li −B − 1 ≤ k ≤ Li need to undergo a transition in clock-
cycle n, i.e., the last B PEs need to undergo a transition independent of preceding PEs in
the chain. Otherwise the error cannot be propagated. Conditioning on p(zQi−1) will shield
all PEk>Qi−1 from signal transitions in preceding PEs in the logic chain, and thus (6.9) is
written as:
pey,i =
∑
ΦX
p(ΦX)
∑
zQi−1
p(zQi−1)
Li∏
k=Qi
[
p
(
tzk = 1|tzk−1=1,ΦXk , zQi−1
)]
(6.13)
Following similar procedure from (6.9) to (6.12), pey,i in (6.13) can also be written as a
polynomial function of ΦX .
This shows that knowing ΦX completely determines the probability of output errors in a
given DSP kernel. Thus, we can modulate the probability of output error of a DSP block by
enforcing conditions on the constituent elements of ΦX . Next, we employ this observation
118
to generalize the proposed error model to be independent of the application, given a DSP
architecture.
6.2.3 Generalized Error Characterization
Given a DSP kernel/architecture A and two input statistics PX,1 and PX,2 that have the same
BPP, i.e., ΦX,1 = ΦX,1 = ΦX , then (6.13) shows that output error PMFs corresponding to
the two input PMFs are equal, i.e., PE,1 = PE,2 = PE. Moreover, Property 2 shows that
for a DSP kernel with input precision Bx, all input PMFs that are symmetric around
2Bx−1
2
have a BPP where all bits are equally likely. We denote this BPP as ΦX,U and define the
corresponding class of PMFs as CX,U . The uniform input distribution U can be used as a
representative input distribution to characterize the DSP kernel for CX,U . Moreover, the
class of input PMFs CX,U can be generalized further to include any input PMF that is
symmetric around any value µx ∈ (0 : 2Bx−1). We denote this class as CX,DSP and the
uniform input distribution U can still be used as a representative for error characterization
to obtain PE,DSP of CX,DSP . To see this, the mean of x
′ = x + 2
Bx−1
2
− µx is µx′ = 2Bx−12
and thus PX′ ∈ CX,U and the corresponding error PMF of x′ is PE′ = PE,U . Then, the error
PMF of x can be obtained from the error-free DSP kernel functionality fDSP via a simple
translation of PE,U as follows:
PE = PE,U + fDSP
(
µx − 2
Bx − 1
2
)
(6.14)
Therefore, output error characterization for a DSP kernel/architecture at a given PVT
corner can be done once using a uniform input distribution to obtain PE,DSP . The obtained
error PMFs PE,DSP of the DSP kernel/architecture is applicable to any application whose
input statistics is symmetric. A large class of DSP applications can thus use PE,DSP . For
example, any application whose input distribution is uniform, Gaussian, or iGaussian as
shown on Fig. 6.2(a) or is a mixture of any set of symmetric distributions will have the same
119
output error PMF PE,DSP . If the input statistics in a given application PX,as was found out
to be asymmetric, then the error-characterization needs to be redone for the DSP kernel
taking into account PX,as.
Given a DSP kernel/architecture A, an operating frequency fop, and synthesis libraries at
different PVT corners, the generalized error characterization flow is as follows:
1. Generate a uniformly distributed input data set Dx,U and obtain the corresponding
error-free output yo[n] using an RTL or fixed-point simulation of the DSP kernel.
2. Synthesize the design at a PVT corner to obtain a gate-level netlist of the DSP kernel
that can operate timing error-free at fop.
3. Back-annotate the synthesized gate-level netlist with timing information (standard
delay format (SDF) file) at PVT corners worse than the synthesis PVT corner in step
2, i.e., at supply voltages lower than the synthesis voltage and/or process corners slower
than the synthesis process corner.
4. Generate the erroneous output y[n] at different PVT corners by employing an RTL-
level simulation of the synthesized gate-level netlist in step 2 using the same input
data set Dx,U as step 1 and the SDF files generated in step 3 while fixing the operating
frequency at fop.
5. Error PMF PE is obtained at different PVT corners by comparing yo[n] in step 1 to
y[n] in step 3 according to the relation in (6.1).
6.3 Simulations and Verifications for Statistical Error
Characterization
To validate the error analysis, modeling, and characterization, we employ voltage overscaling
(VOS) in order to generate timing violations and thereby emulate PVT variations. In VOS,
120
the supply voltage is reduced below a critical supply voltage Vdd−crit, which is the lowest
voltage at which the system operates error-free, while keeping the frequency of operation
fixed at fop. Thus, intermittent timing errors e[n] will appear at the output. We define
Vdd/Vdd−crit as the voltage overscaling factor KV OS. In what follows a 45-nm TI CMOS
process is employed and error PMF PE of a given DSP kernel/architecture A is obtained at
each voltage following the characterization flow outlined in the previous section. In certain
cases, when we want to study the effect of different input statistics on output error, we use
the respective input statistics instead of a uniform one to perform error characterization of
the DSP kernel. We focus on adder and multiplier units since these are widely used in DSP
designs and form most of the data path in circuits benchmarks such as ISCA-85/89.
We employ Kullback-Leibler (KL) distance [99] to quantify the difference between error
PMFs for different input statistics and architectures. Given two PMFs PE1 and PE2 of two
random variables E1 and E2, the KL-distance is:
KL(PE1 , PE2) =
∑
e
PE1(e)log2
PE1(e)
PE2(e)
(6.15)
KL distance measures the distance between two distributions so that KL(PE1 , PE1) = 0 if
and only if PE1 = PE1 . Usually, two PMFs are quite similar if KL distance < 1.
6.3.1 Impact of Architecture
The architectural or implementation choice of a DSP function strongly affects the output
error behavior since different architectures have different logic paths between input and
output. To verify this, we show how error PMF, PE, varies at the output of different
architectures implementing the same DSP functionality with the same input x and error-
free output yo.
We characterize the error PMFs due to VOS using the same uniformly distributed input
data-set for three 16-bit adders employing different architectures (ripple-carry adder (RCA),
121
(a) (b)
(c) (d)
Figure 6.4: Error statistics of various architectures: (a)16-b RCA, (b) 16-b CBA, (c) 16-b
CSA, and (d) DF and TDF 16-tap FIR filter.
carry-bypass adder (CBA), and carry-select adder (CSA)). We do the same for an FIR
filter with direct-form (DF) and transposed direct-form (TDF) implementation. The FIR
filters are 8-bit input 16-tap low-pass filters implemented using Baugh-Wooley multipliers
and RCAs. Figure 6.4 shows PE for the three adder types and for the two FIR filters. The
plots in Figs. 6.4(a), (b), and (c) indicate that the three different adders have clearly distinct
error PMFs at the same Kvos. Similarly in Fig. 6.4(d), the direct and transposed form FIR
filters do have distinct error statistics though they have the same input. These conclusions
support those of [100] which show that different arithmetic unit architectures have different
average error magnitudes. Therefore, error statistics are indeed strongly dependent on the
architecture.
Table 6.1 shows that the KL distance is large among the three adder architectures and
122
Table 6.1: KL-distance between error PMFs in various architectures at different KV OS.
16-bit Adder 16-tap FIR
KV OS KLRC,CB KLRC,CS KLCB,CS KLDF,TDF
0.95 7.3 9.0 0.4 3.2
0.90 18.3 19.3 11.0 7.1
0.82 26 64 92 15
0.73 69 148 190 32
(a) (b)
Figure 6.5: Output error statistics of 16-bit RCA at Kvos = 0.73 using: (a) symmetric input
statistics PX ’s and (b) uniform input distribution and asymmetric PX ’s
between DF and TDF FIR filters, indicating that the architecture choice indeed strongly
affects error PMF. Note that as voltage is reduced, the KL distance between any two error
PMFs increases since more architecturally-different paths become critical and more distinct
errors appear at the output of a DSP block.
6.3.2 Impact of Input Statistics
To verify the analysis and the relation between word-level (PMF) and bit-level (BPP) error
statistics and output statistics, we use the probability distributions in Fig. 6.2(a) as input
statistics for a 16-bit ripple-carry adder (RCA). Figure 6.5 shows the output error PMF of
a 16-bit RCA when subject to the different input statistics (U, G, iG, Asym1, and Asym2)
at KV OS = 0.73. Note that the input PMFs U, G, and iG are symmetric around µ =
216−1
2
and have the same equally likely BPP in Fig. 6.2(b) unlike the other two (Asym1 and
123
Table 6.2: KL distance between error PMFs of 16-bit adders under various input statistics
and error PMF PEU obtained using a uniform input distribution.
KV OS KLEU ,EG KLEU ,EiG KLEU ,EAsym1 KLEU ,EAsym2
16-bit RCA
0.95 0 0 0.062 0.04
0.90 0 0 0.15 0.06
0.82 0.01 0.01 1.15 0.20
0.73 0.07 0.07 8.86 1.33
0.65 0.30 0.28 52.0 8.48
16-bit CBA
0.95 0 0 0.08 0.05
0.90 0 0 3.93 0.06
0.82 0 0 24.3 0.72
0.73 0.02 0.01 32.6 1.83
0.65 0.01 0 142 14.5
16-bit CSA
0.95 0 0 0.07 0.07
0.90 0 0 1.29 0.53
0.82 0 0 40.7 0.40
0.73 0.01 0 129 15.7
0.65 0.1 0.02 308 96.5
Asym2). Figure 6.5(a) shows that the output error statistics for the symmetric distributions
is quite similar. On the other hand, the output error PMFs in Fig. 6.5(b) corresponding
to asymmetric input PMFs are very different from those obtained using the uniform input
distribution (U).
Table 6.2 shows the KL distance between the error PMFs corresponding to different input
PMFs and the error PMF PEU obtained using a uniform input distribution in different
16-bit adders. The error PMFs corresponding to symmetric input PMFs, G and iG, have
a very small KL distance with PEU . On the other hand, error PMFs corresponding to
asymmetric input PMFs Asym1 and Asym2 are close to PEU only at high KV OS where the
voltage of the adder is not reduced enough to produce a large number of output errors. As
voltage is reduced further, the error PMF of asymmetric input distributions starts to have
a very large KL distance compared to PEU . Note that KL(PEU , PEAsym1) is greater than
KL(PEU , PEAsym2) indicating that, when compared to PEU , the error PMF due to Asym1
124
Table 6.3: KL distance between error PMFs of a 16-tap FIR filter under various input
statistics and error PMF PEU obtained using a uniform input distribution.
KV OS KLEU ,EG KLEU ,EiG KLEU ,EAsym1 KLEU ,EAsym2
Direct-Form FIR
0.95 0.06 0.04 21.6 0.05
0.90 0.94 0.15 63 3.57
0.82 0.92 0.14 33 3.10
0.73 0.03 0.82 227 209
Transposed-Form FIR
0.95 0.49 0.13 70 0.53
0.90 0.91 0.38 62 5.78
0.82 0.31 0.08 56 3.41
0.73 0.03 0.89 203 163
is much different than that of Asym2 because the Asym1 PMF is more asymmetric than
Asym2 PMF (see Fig. 6.2(b)). A similar trend is observed in Table 6.3 for different types of
16-tap FIR filters where error PMFs of symmetric input distributions are close to PEU while
those of asymmetric distributions are quite different. These results support the presented
error analysis and modeling procedure and, specifically, the fact that input distributions
with similar input BPPs produce similar output error statistics.
6.4 Diversity Techniques for Error Independence
Conventional NMR requires that the error events across the replicated modules be inde-
pendent to avoid common-mode failures (CMFs), i.e., correlated error events, so that the
majority voter would not fail catastrophically. Also, a large class of fault-tolerant systems
are based on DMR with re-computation when an error is detected. These systems require
that the replicated modules produce non-identical error values/magnitudes to avoid unde-
tectable errors. The authors in [77] introduced the D-metric to measure the diversity degree
of such systems, i.e., the probability of producing non-identical errors across two modules
125
given by:
D =
∑
{(e1,e2):e1 6=e2}
p (e1, e2| an error occured) (6.16)
where e1 and e2 are the error values appearing at the output of two modules. Thus, the
higher the D-metric is the more reliable the NMR system is.
Emerging stochastic computing techniques such as soft-NMR and the proposed LP tech-
nique can operate effectively in the presence of identical error values, but benefit from having
the error magnitudes – not only the error events – be independent across the modules in
order to reduce the complexity overhead and improve robustness to errors. Thus, diversity
techniques [101], [102], and diversity metrics for conventional NMR, such as the D-metric,
are no longer relevant in emerging robust system-design techniques. In this part, we pro-
pose architectural diversity and scheduling diversity to support these emerging techniques.
We show that architectural and scheduling diversity techniques are simple to apply and yet
highly effective in ensuring that errors are spatially independent. Furthermore, we employ
the KL distance defined in (6.15) to measure the degree of independence between two ran-
dom variables E1 and E2 representing the errors at the output of two PEs, and consequently
measuring the efficiency of the diversity techniques. Finally, as a practical demonstration,
we show how error characterization and engineering in a DCT-based image codec leads to
enhanced robustness and energy efficiency.
6.4.1 Architectural Diversity
The architectural or implementation choice of a DSP function strongly affects the output
error behavior since different architectures have different logic paths between input and
output [95] [100]. This makes architectural diversity an attractive candidate to generate
independent error magnitudes in NMR even though all modules have the same input. We
employ architectural diversity in the design of a dual-modular redundant (DMR) 16-bit adder
and 16-tap FIR filter, and measure the independence of output errors. Voltage overscaling
126
Table 6.4: Error independence between RCA, CBA, and CSA, where Vdd−crit,RCA = 1.1 V,
Vdd−crit,CBA = 0.95 V, Vdd−crit,CSA = 0.85 V, and f = 1.01 GHz
KV OS pCMF (%) D(%) KLE1,E2
RCA and CBA
0.95 0.0 100 0.001
0.90 0.0 100 0.004
0.85 0.3 99.999 0.025
RCA and CSA
0.95 0.0 100 0.001
0.90 0.1 100 0.002
0.85 0.3 99.998 0.009
CBA and CSA
0.95 0.0 100 0.001
0.90 1.6 99.987 0.036
0.85 6.7 99.976 0.154
(VOS) is employed in the 45-nm TI CMOS process in order to generate timing violations
and thereby emulate PVT variations in computation. In VOS, the supply voltage is reduced
below a critical supply voltage Vdd−crit, which is the lowest voltage at which the system
operates error-free, while keeping the frequency of operation fixed at fop. Thus, intermit-
tent timing errors e[n] will appear at the output. We define the ratio Vdd/Vdd−crit as the
voltage overscaling factor KV OS. The different architectures are operating at the same clock
frequency but each having its own Vdd−crit in order to meet the timing constraints imposed
by the system clock. To observe the output error behavior across the different architec-
ture simultaneously, gate-level simulations are carried out at each supply voltages while the
same input is fed to all architectures. A total of 107 input vectors, sampled from a uniform
distribution, are used.
Three architectural candidates are considered for the 16-bit adder: RCA, CBA, and CSA.
Table 6.4 quantifies the dependence of errors (E1, E2) at the outputs of a pair of adders
employing different adder architectures using KL distance and shows conventional measures
such as D-metric and pCMF , the probability of common mode failures, i.e., error events
where a conventional DMR system fails to detect an error. The output error magnitudes
127
Table 6.5: Error independence between DF and TDF FIR filters, where Vdd−crit,DF =
1.1 V,Vdd−crit,TDF = 1 V, and f = 588 MHz.
DF FIR and Transposed DF FIR
KV OS pCMF (%) D(%) KLE1,E2
0.95 1.1 99.628 0.007
0.90 16.2 97.952 0.029
are almost independent for any pair, especially for the RCA-CSA pair, which has the lowest
KL(PE1,E2 , PE1PE2) making it the best choice for advanced robust system design techniques.
This conclusion contrasts with that obtained using conventional measures (pCMF and D-
metric in Table 6.4) which indicate that the RCA-CBA pair is better for conventional NMR
since it has the smallest probability of CMFs.
For the 16-tap FIR filter, two architectures are considered: direct-form and transposed
direct-form. The FIR filters are 8-bit input 16-tap low-pass filters implemented using Baugh-
Wooley multipliers and RCAs. Table 6.5 shows that errors are indeed independent between
the two architectures. Therefore, architectural diversity is an effective way to make error
magnitudes independent in advanced error-resilient designs.
6.4.2 Scheduling Diversity
Besides architectural diversity, we propose another general and simple diversity technique,
scheduling diversity, to make errors independent across redundant outputs. To reduce hard-
ware complexity and increase power efficiency, folding is a well-known technique which ex-
ecutes similar operations on the same hardware. In scheduling diversity, we reorder the
sequence of operations to be executed on multiplexed modules. A 15-tap FIR filter is em-
ployed as an example to demonstrate scheduling diversity, whose output at time n is given
by
y[n] =
14∑
k=0
h[k]x[n− k] (6.17)
128
Table 6.6: Error independence with scheduling diversity, where Vdd−crit = 1.1 V and f =
714 MHz.
Schedule 1 and 2
KV OS pCMF (%) D(%) KLE1,E2
0.95 7.0 96.425 0.027
0.90 26.6 96.574 0.119
Schedule 1 and 3
0.95 8.9 95.548 0.051
0.90 29.5 96.624 0.137
Schedule 2 and 3
0.95 8.9 95.114 0.011
0.90 26.1 97.550 0.091
where h[k] is the filter coefficient, and x[k] is the input. Different schedules can be employed
to map (6.17) onto a single multiply-accumulate (MAC) unit. We employ three possible
schedules, and measure the independence of errors in Table 6.6. We observe that output
errors for the three schedules are pairwise independent even if pCMF is high.
6.5 Case Study: Discrete-Cosine Transform (DCT) Codec Design
We demonstrate the use of error statistics and diversity techniques in the design of a robust
DMR-based DCT codec shown in Fig. 6.6(a). The DCT and inverse-DCT (IDCT) transform
are applied to a 256×256 8-bit pixel image in blocks of 8×8 pixels using Chen’s algorithm [97].
Each two-dimensional (2D) transform is implemented by applying a 1D transform row-wise,
and then column-wise on the output of the first. Transposition memory (TM) is used to
swap the data between rows and columns. The quantizer (Q) and inverse quantizer (Q−1)
employ the JPEG quantization table. The error-free codec achieves a peak signal-to-noise
(PSNR) ratio of 33 dB.
In DMR, two codecs are employed so that the two outputs y1 and y2 are available to the
voter. To ensure error independence between the two codecs, scheduling diversity is applied
by swapping the inputs of the array-based multipliers in the redundant codec. We obtain
129
Figure 6.6: Block diagram of the 2D DCT-IDCT codec.
Table 6.7: Error independence of two voltage overscaled DCT codec using different schedul-
ing.
KV OS pCMF (%) D(%) KLE1,E2
0.96 0 100 0.000
0.92 0.03 100 0.003
0.88 0.05 99.991 0.005
0.83 0.44 99.966 0.040
0.79 4.29 99.835 0.257
0.75 15.09 99.515 0.639
0.71 33.87 98.764 1.300
0.66 58.76 97.531 1.950
130
Figure 6.7: Performance of soft DMR-based codec under VOS.
the VOS error statistics PE1 and PE2 at the output of the two codecs respectively. Table 6.7
shows that errors at the output of the two codecs are indeed independent. Note that, as
VOS increases (smaller Kvos), the error dependence measure (KL distance) increases since
the number of erroneous paths for two codecs increases especially at lower Vdd. However,
the KL distance is still small even at low Vdd (maximum value of KL distance is 8 for the
8-bit output codec), and thus, the errors are still independent.
Conventional DMR can only detect errors and relies on re-computation to correct for
them, while soft DMR [78] utilizes error statistics to detect and correct errors. Using a look-
up table that stores the pre-characterized error statistics PE1 and PE2 of the two codecs,
the soft voter employs the maximum likelihood(ML) rule at the outputs of the two voltage-
overscaled RTL codecs to select the output with the higher probability of occurrence (see [78]
for a low-overhead implementation of the soft voter). Figure 6.7 shows the robustness of
soft DMR at different KV OS. Soft DMR performs close to conventional TMR though it uses
131
one less codec. Also, the robustness of soft DMR is barely affected in Fig. 6.7 when using
a single-error distribution for the two redundant codec, i.e., assuming PE1 = PE2 since they
have similar architectures but different scheduling. Note: as KV OS decreases below 0.8, soft
DMR starts to perform even better than conventional TMR since pCMF is high (4% to 60%
in Table 6.7) and TMR ignores error statistic.
6.6 Summary
We proposed a statistical additive error model that captures the statistical distribution of
timing errors in arithmetic units and DSP blocks at architecture and system level. We showed
that this model is relatively independent of input statistics for a wide class of applications.
Moreover, we presented techniques to ensure error magnitude/value independence across
redundant observations (spatial independence) in emerging error-resilient techniques.
132
CHAPTER 7
CONCLUSIONS
The paradigm shift toward a ubiquitous computing world is characterized by a profusion in
embedded ULP platforms where energy and size are of utmost concern for seamless integra-
tion and long battery-life. This dissertation the statistical nature of the ULP application
performance metrics and the dynamic nature of ULP workload characteristics, and matches
them to the statistical attributes and the device region of operation of the underlying cir-
cuit/device fabric. This is done while taking into consideration the energy-delivery overhead,
resulting in systems that operate closer to the limits of energy-performance envelop.
7.1 Dissertation Contributions
Research on MEOP has primarily been a circuit-level inquiry and is characterized by sig-
nificant loss in design margins since PVT variations at the MEOP are significant and error
resiliency at the MEOP is yet unexplored. Stochastic computing and other error compen-
sation techniques have been applied in the high-throughput superthreshold regime. This
dissertation studies the application of stochastic computing at the MEOP. It shows that
stochastic computing achieves 28% to 54% energy savings beyond what is achievable by con-
ventional MEOP designs. It demonstrates acceptable application-level performance metrics
in the presence of 70% to 85% pre-correction error rate, which represents a 700× to 850×
increase in error handling capability as compared to existing error-compensation techniques.
These conclusions are further verified by designing a subthreshold stochastic computing-
based ECG processor IC in a 45-nm CMOS process. The prototype IC delivers acceptable
133
beat-detection rates while operating at 15% below its critical supply voltage in the presence
of 58% error rate. These results represent an improvement of 19× in beat-detection accu-
racy, 600× in pη, and 28% in energy over conventional (error-free) MEOP systems. The
prototype IC consumes 14.5 fJ/cycle/1k-gate and exhibits 4.7× better energy efficiency than
the state-of-the-art while tolerating 16× more voltage variations.
Stochastic computing allows ULP next-generation applications to continue reaping the
energy and size benefits of Moore’s law despite the increasingly statistical device behavior.
However, low supply voltage operation due to technology scaling and subthreshold operations
reduces the efficiency of energy-delivery subsystem. This dissertation enables the design of
energy-efficient integrated ULP platforms/systems by jointly optimizing over the core and
DC-DC converter design spaces. Significant energy savings (45.5%) are demonstrated by
operating at the MEOP of the system (S-MEOP), as compared to the current practice of
operating at the core MEOP (C-MEOP). Architectural techniques are proposed to mitigate
the energy-delivery overhead at S-MEOP resulting in a 2.3× improvement in system energy
efficiency and S-MEOP approaching C-MEOP to within 5%.
This dissertation proposes a novel stochastic-computing technique, which generates reli-
ability information or confidence level on each output bit, and is referred to as likelihood
processing (LP). The robustness and energy benefits of LP are demonstrated in the design
of a 45-nm 2D-DCT codec, which can be employed as a hardware accelerator in a ULP plat-
form. Results show 5× to 100× improvement in robustness along with 15% to 71% energy
savings compared to conventional (error-free) and existing stochastic computing techniques.
It is clear that the availability of statistical hardware error models, and developing an
understanding of the factors that impact these models, are essential in the investigation and
development of robust stochastic computing design principles. This dissertation proposes
a unified framework for stochastic computing paradigm, develops a statistical error-model
for robust DSP-heavy computations, and shows that the proposed model is effective in
abstracting the hardware-error behavior at system level. The different factors affecting error
134
Critical Dimension (CD) [µm]
P
o
w
e
r 
D
e
n
s
it
y
 [
W
/c
m
2
]
Core 2 Duo
Pentium 4
Multicore
Figure 7.1: The power wall in CPU design [103].
statistics under the proposed model are studied, and a one-time off-line error characterization
methodology is proposed, which is similar to the power and delay characterization done today
for standard cells and IP cores. Finally, design diversity techniques are proposed to engineer
favorable spatially-independent error statistics.
7.2 The Broader Impact: Beyond ULP Platforms
This dissertation has addressed energy efficiency and robustness in ULP platforms, en-
abling them to operate dramatically closer to the limits of the achievable robustness-energy-
performance envelope. This dissertation is distinguished by its integration of principles
from power electronics, statistical signal processing, estimation and detection, VLSI archi-
tectures, and IC design. The stochastic design philosophy and principles introduced in
135
  PLL Circuits   10 Thermal Sensors
(a) (b)
  PCU
Figure 7.2: The 45-nm 8-core Intel Enterprise Xeon processor: (a) block diagram with
on-chip power management and (b) die photo with multiple clock domains and thermal
sensors [104].
this dissertation and their demonstrated robustness and energy benefits can be extended to
next-generation high-throughput processors and platforms.
In the last three decades, performance (speed) had been the main central processing
unit (CPU) design metric, with technology scaling as its workhorse. Device scaling has
led to faster circuits, smaller silicon area, and reduced supply voltages. Thus, in spite the
CPU speed and functional complexity increase, its power density (power per unit area) re-
mained almost constant for the same chip area in old process technologies. However, in
sub-micrometer scale processes, the reduction in the supply voltage has been limited by the
threshold voltage, leading to an increase in power density. For example, a 3 GHz Pentium 4
CPU has a power density close to a nuclear reactor (see Fig. 7.1). This problem was aggra-
vated further by microarchitectural design techniques such as instruction-level parallelism,
out-of-order execution, and multi-threading, which focused solely on throughput enhance-
ment. Hitting the power wall ( the chip’s overall power budget due to cooling constraints)
in single-core designs has forced industry to shift recently towards heterogeneous multi- and
136
 11 PLL Circuits   5 Thermal Sensors
(a) (b)
Figure 7.3: The 45-nm Intel Core i7 (quad core) processor: (a) die photo and (b) on-chip
power management to enable DVS and power gating [105].
many-core architectures (see Fig. 7.1) where explicit application- and thread-level parallelism
is exploited in a power efficient manner. In fact, modern CPUs are being designed as com-
plex SoCs with massive interfaces, multiple diverse and specialized functional units, which
are power gated and dynamically voltage scaled depending on workload characteristics (see
Figs. 7.2 and 7.3) [104–107]. Modern processors are converging toward the SoC-based archi-
tecture, similar to those found in ULP platforms. Thus, system-design principles proposed
in this dissertation can be similarly applied to modern processors to aid them in overcoming
the power wall and coping with the increasingly unreliable device/circuit fabric. For ex-
ample a stochastic-based approach to modern processor design will enable VOS in modern
processors to considerably save power beyond conventional techniques while maintaining the
application performance metric and the processor speed requirements.
137
7.3 Future Work
Stochastic computing for ULP platforms introduces a paradigm shift in systems and hard-
ware design with tremendous increase in robustness and energy savings. Thus, extending the
applicability and benefits of stochastic computing across all levels of modern system design
in silicon and post-silicon technologies will have significant impact on future systems and
platforms. This opens up a number of interesting problems to further explore. Although
these problems are interlinked, they can be characterized across five domains: 1) application
and software, 2) systems, 3) architecture and CAD support, 4) circuits and devices, and
5) theoretical foundations. A synergetic research effort is needed, which requires seeking
collaboration with researchers in various research areas such as nanotechnology, computer
architecture, communications, digital signal processing, machine learning, and computer
vision.
7.3.1 Applications and Software Domain
It is fortuitous that next-generation applications depend heavily on sensing surveillance and
media-rich immersive computing [1]. The recognition and mining aspects of such applica-
tions make them heavily dependent on advanced signal processing and classification (ma-
chine learning) mechanisms, such as support vector machines and Bayesian classifiers, which
achieve high probabilities of abnormal event detection. The increased power and complex-
ity of these error-tolerant kernels provide a good opportunity for applications of stochastic
computing techniques.
Exposing the architectural-level hooks to the software would provide additional benefits
over the current practice where stochastic computing has been applied in hardware. For ex-
ample, Chapter 5 showed how soft information can be generated by LP. This information can
be exploited at the software level to enable optimal task allocation and power management
policies. Furthermore, Chapter 4 investigated core error-resiliency assuming a worst-case
138
scenario of voltage droop/ripple. Incorporating task scheduling and software profiling tech-
niques will improve the implementation strategy of a jointly designed stochastic core and
DC-DC converter. Indeed, [108] shows that task scheduling can be employed in multicore
platforms to mitigate the effect of voltage droop and achieve better energy efficiency.
7.3.2 Systems
Advanced processing of hardware error statistics increases the energy and robustness benefits
of stochastic computing. To this end, advanced statistical techniques from machine learning
and communications, such as belief propagation and approximately decodable codes [109],
present good candidates to exploit error statistics more efficiently. An interesting problem
would be to develop iterative/turbo versions of LP in Chapter 5 where different likelihood
processors exchange their soft information to achieve increased energy efficiency and robust-
ness.
7.3.3 Architecture and CAD Support
Stochastic computing has been applied to DSP-heavy applications where statistical perfor-
mance metrics are employed. Architecture-level research is required to overcome the ap-
plication specificity of stochastic computing and extend its deployment to general-purpose
architectures. Architectural techniques are further necessary to engineer associated error
statistics into forms that are amenable to stochastic computing techniques, e.g., see Chap-
ter 6 where diversity techniques are introduced to achieve spatial architecture-level error
independence.
Composability of the error model is a desired property for statistical error characterization.
Composable error models enable us to build architectural error models from those of their
constituents. Statistical error characterization and engineering of architecture and circuit
139
macros needs to be done along with power and delay modeling, in order to enable the design
of robust energy-efficient systems.
7.3.4 Circuits and Devices
Circuit-level techniques need to be investigated in order to support architecture techniques
in generating favorable error statistics as well as voltage-ripple-tolerant designs. A variety
of research has also emerged on post-silicon technologies such as carbon nanotubes, phase
change memory, and spintronics [110]. These technologies are inherently probabilistic in
nature. Thus, the emerging radical device fabrics provide a natural and fertile ground for
the application of stochastic computing.
7.3.5 Theoretical Foundations
Last but not least, theoretical foundations for stochastic design principles need to be es-
tablished. Such an effort requires a synergy between the seminal works of Shannon on
information transfer [111] and Von Neumann on designing reliable systems from unreliable
components [43].
140
REFERENCES
[1] J. M. Rabaey, D. Burke, K. Lutz, and J. Wawrzynek, “Workloads of the future,” IEEE
Design & Test of Computers, vol. 25, no. 4, pp. 358–365, 2008.
[2] M. A. Hanson, H. C. Powell Jr., A. T. Barth, K. Ringgenberg, B. H. Calhoun, J. H.
Aylor, and J. Lach, “Body area sensor networks: Challenges and opportunities,” IEEE
Computer, vol. 42, no. 1, pp. 58–65, Jan. 2009.
[3] L. Sun, “The future according to freescale,” in Freescale Technology Forum, June 2008.
[4] V. Raghunathan, C. Schurgers, S. P. S. Park, and M. B. Srivastava, “Energy-aware
wireless microsensor networks,” IEEE Signal Processing Magazine, vol. 19, no. 2, pp.
40–50, 2002.
[5] T. Karnik, S. Borkar, and V. De, “Sub-90 nm technologies: Challenges and opportu-
nities for CAD,” in Proceedings of the 2002 IEEE/ACM International Conference on
Computer-Aided Design, ser. ICCAD ’02, 2002, pp. 203–206.
[6] “Intl. technology roadmap for semiconductors,” ITRS, Tech. Rep., 2008.
[7] S. Hanson, B. Zhai, D. Blaauw, D. Sylvester, A. Bryant, and X. Wang, “Energy opti-
mality and variability in subthreshold design,” in Proceedings of the 2006 International
Symposium on Low Power Electronics and Design ISLPED 06, 2006, pp. 363–365.
[8] N. Sturcken, M. Petracca, S. Warren, L. P. Carloni, A. V. Peterchev, and K. L. Shep-
ard, “An integrated four-phase buck converter delivering 1A/mm2 with 700-ps con-
troller delay and network-on-chip load in 45-nm SOI,” in IEEE Custom Integrated
Circuits Conference, Septemeber 2011, pp. 1–4.
[9] T. Sˇimunic´, L. Benini, and G. D. Micheli, “Energy-efficient design of battery-powered
embedded systems,” IEEE Trans. Very Large Scale Integr. Syst., vol. 9, no. 1, pp.
15–28, February 2001.
[10] M. Hempstead, N. Tripathi, P. Mauro, G.-Y. Wei, and D. Brooks, “An ultra low power
system architecture for sensor network applications,” in ISCA, 2005, pp. 208–219.
[11] S. Smith, “Power optimization in the connected world,” in Proc. of Int. Conf. on
Energy Aware Computing, Dec. 2010.
141
[12] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital de-
sign,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473–484, 1992.
[13] S. C. Prasad and K. Roy, “Circuit optimization for minimisation of power consumption
under delay constraint,” in Proceedings of the 8th International Conference on VLSI
Design, ser. VLSID ’95, 1995, pp. 305–309.
[14] J. Rabaey, “Reconfigurable processing: The solution to low-power programmable
DSP,” in Proceedings of the 1997 IEEE International Conference on Acoustics, Speech,
and Signal Processing, ser. ICASSP ’97, Washington, DC, USA, 1997, pp. 275–278.
[15] M. Pedram, “Power minimization in IC design: principles and applications,” ACM
Trans. Design Autom. Electron. Syst., vol. 1, no. 1, pp. 3–56, Jan. 1996.
[16] L. Wei, Z. Chen, K. Roy, Y. Ye, and V. De, “Mixed-Vth (MVT) CMOS circuit
design methodology for low power applications,” in Proceedings of the 36th annual
ACM/IEEE Design Automation Conference, ser. DAC ’99, 1999, pp. 430–435.
[17] M. Borah, R. M. Owens, and M. J. Irwin, “Transistor sizing for low power CMOS
circuits,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 15, no. 6, pp.
665–671, 1996.
[18] Q. Wu, M. Pedram, and X. Wu, “Clock-gating and its application to low power design
of sequential circuits,” IEEE Trans. on Circuits and Systems I: Fundamental Theory
and Applications, vol. 47, no. 3, pp. 415–420, 2000.
[19] M. Goel and N. R. Shanbhag, “Dynamic algorithm transformations (DAT)-a system-
atic approach to low-power reconfigurable signal processing,” IEEE Trans. VLSI Syst.,
vol. 7, no. 4, pp. 463–476, 1999.
[20] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic
voltage scaling algorithms,” in Proceedings of the 1998 International Symposium on
Low Power Electronics and Design, ser. ISLPED ’98, 1998, pp. 76–81.
[21] J. Zhuo, C. Chakrabarti, and N. Chang, “Energy management of DVS-DPM enabled
embedded systems powered by fuel cell-battery hybrid source,” in Proceedings of the
2007 International Symposium on Low Power Electronics and Design, ser. ISLPED
’07, 2007, pp. 322–327.
[22] Y.-H. Lee, Y.-Y. Yang, K.-H. Chen, Y.-H. Lin, S.-J. Wang, K.-L. Zheng, P.-F. Chen,
C.-Y. Hsieh, Y.-Z. Ke, Y.-K. Chen, and C.-C. Huang, “A DVS embedded power man-
agement for high efficiency integrated SoC in UWB system,” J. Solid-State Circuits,
vol. 45, no. 11, pp. 2227–2238, 2010.
[23] B. H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling and sizing for minimum en-
ergy operation in subthreshold circuits,” IEEE Journal of Solid-State Circuits, vol. 40,
no. 9, pp. 1778–1786, 2005.
142
[24] A. Raychowdhury, B. C. Paul, S. Bhunia, and K. Roy, “Computing with subthreshold
leakage: device/circuit/architecture co-design for ultralow-power subthreshold opera-
tion,” IEEE Trans. Very Large Scale Integr. Syst., vol. 13, no. 11, pp. 1213–1224, Nov.
2005.
[25] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, “The limit of dynamic voltage
scaling and insomniac dynamic voltage scaling,” IEEE Trans. Very Large Scale Integr.
Syst., vol. 13, no. 11, pp. 1239–1252, Nov. 2005.
[26] J. Kwong and A. P. Chandrakasan, “Variation-driven device sizing for minimum energy
sub-threshold circuits,” in Proceedings of the 2006 International Symposium on Low
Power Electronics and Design ISLPED 06, 2006, pp. 8–13.
[27] A. Wang and A. Chandrakasan, “A 180-mV subthreshold FFT processor using a mini-
mum energy design methodology,” IEEE Journal of Solid-State Circuits, vol. 40, no. 1,
pp. 310–319, 2005.
[28] D. Jeon, M. Seok, C. Chakrabarti, D. Blaauw, and D. Sylvester, “A super-pipelined
energy efficient subthreshold 240 ms/s FFT core in 65-nm CMOS,” J. Solid-State
Circuits, vol. 47, no. 1, pp. 23–34, 2012.
[29] C. H.-I. Kim, H. Soeleman, and K. Roy, “Ultra-low-power DLMS adaptive filter
for hearing aid applications,” IEEE Trans. on Very Large Scale Integration Systems,
vol. 11, no. 6, pp. 352–357, 2003.
[30] B. H. Calhoun and A. P. Chandrakasan, “Ultra-dynamic voltage scaling (UDVS) us-
ing sub-threshold operation and local voltage dithering,” IEEE Journal of Solid-State
Circuits, vol. 41, no. 1, pp. 238–245, 2006.
[31] S. Hanson, M. Seok, Y.-S. Lin, Z. Y. Foo, D. Kim, Y. L. Lee, N. Liu, D. Sylvester, and
D. Blaauw, “A low-voltage processor for sensing applications with picowatt standby
mode,” IEEE Journal of Solid-State Circuits, vol. 44, no. 4, pp. 1145–1155, 2009.
[32] B. Zhai, S. Pant, L. Nazhandali, S. Hanson, J. Olson, A. Reeves, M. Minuth,
R. Helfand, T. Austin, D. Sylvester, and D. Blaauw, “Energy-efficient subthreshold
processor design,” IEEE Trans. Very Large Scale Integr. Syst., vol. 17, no. 8, pp.
1127–1137, aug 2009.
[33] S. C. Jocke, J. F. Bolus, S. N. Wooters, T. N. Blalock, and B. H. Calhoun, “A 2.6µW
sub-threshold mixed-signal ECG SoC,” in Proceedings of the 14th ACM/IEEE Inter-
national Symposium on Low Power Electronics and Design ISLPED 09, 2009, pp.
117–118.
[34] N. Verma, A. Shoeb, J. Bohorquez, J. Dawson, J. Guttag, and A. P. Chandrakasan, “A
micro-power EEG acquisition SoC with integrated feature extraction processor for a
chronic seizure detection system,” IEEE Journal of Solid-State Circuits, vol. 45, no. 4,
pp. 804–816, 2010.
143
[35] J. Kwong and A. P. Chandrakasan, “An energy-efficient biomedical signal processing
platform,” J. Solid-State Circuits, vol. 46, no. 7, pp. 1742–1753, 2011.
[36] R. F. Yazicioglu, S. Kim, T. Torfs, H. Kim, and C. Van Hoof, “A 30µw analog signal
processor ASIC for portable biopotential signal monitoring,” IEEE Journal of Solid-
State Circuits, vol. 46, no. 1, pp. 209–223, 2011.
[37] M. Ashouei, J. Hulzink, M. Konijnenburg, J. Zhou, F. Duarte, A. Breeschoten,
J. Huisken, J. Stuyt, H. de Groot, F. Barat, J. David, and J. V. Ginderdeuren, “A
voltage-scalable biomedical signal processor running ECG using 13 pJ/cycle at 1 MHz
and 0.4 V,” in IEEE International Solid-State Circuits Conference, February.
[38] S. R. Sridhara et al., “Microwatt embedded processor platform for medical system-on-
chip applications,” IEEE Journal of Solid-State Circuits, vol. 46, no. 4, pp. 721–730,
2011.
[39] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage current mechanisms
and leakage reduction techniques in deep-submicrometer CMOS circuits,” Proceedings
of the IEEE, vol. 91, no. 2, pp. 305–327, 2003.
[40] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “1-v power
supply high-speed digital circuit technology with multithreshold-voltage CMOS,”
IEEE Journal of Solid-State Circuits, vol. 30, no. 8, pp. 847–854, 1995.
[41] K. Shi and D. Howard, “Challenges in sleep transistor design and implementation in
low-power designs,” in Proceedings of the 43rd annual Design Automation Conference,
ser. DAC ’06, 2006, pp. 113–116.
[42] T. Chen and S. Naffziger, “Comparison of adaptive body bias (ABB) and adaptive
supply voltage (ASV) for improving delay and leakage under the presence of process
variation,” IEEE Trans. VLSI Syst., vol. 11, no. 5, pp. 888–899, 2003.
[43] J. von Neumann, Probabilistic logics and the synthesis of reliable organisms from unre-
liable components. Princeton, N.J.: Automata Studies, Princeton Univ. Press, 1956.
[44] R. I. Bahar, J. Mundy, and J. Chen, “A probabilistic-based design methodology for
nanoscale computation,” in Proceedings of the 2003 IEEE/ACM International Con-
ference on Computer-Aided Design, ser. ICCAD ’03, Washington, DC, USA, 2003, pp.
480–486.
[45] W. Qian, M. D. Riedel, K. Bazargan, and D. J. Lilja, “The synthesis of combinational
logic to generate probabilities,” in Proceedings of the 2009 International Conference
on Computer-Aided Design, ser. ICCAD ’09, 2009, pp. 367–374.
[46] N. F. Vaidya and D. K. Pradhan, “Fault-tolerant design strategies for high reliability
and safety,” IEEE Trans. Comput., vol. 42, no. 10, pp. 1195–1206, Oct. 1993.
[47] Y. Tamir and M. Tremblay, “High-performance fault-tolerant VLSI systems using mi-
cro rollback,” IEEE Trans. Comput., vol. 39, no. 4, pp. 548–554, Apr. 1990.
144
[48] S. J. Piestrak, “Design of fast self-testing checkers for a class of Berger codes,” IEEE
Trans. Comput., vol. 36, no. 5, pp. 629–634, May 1987.
[49] C. Winstead and S. Howard, “A probabilistic LDPC-coded fault compensation tech-
nique for reliable nanoscale computing,” IEEE Trans. Cir. Sys., vol. 56, no. 6, pp.
484–488, June 2009.
[50] N. Jayakumar and S. P. Khatri, “A variation tolerant subthreshold design approach,”
in Proceedings of the 42nd annual Design Automation Conference, ser. DAC ’05, 2005,
pp. 716–719.
[51] N. Verma, J. Kwong, and A. P. Chandrakasan, “Nanometer MOSFET variation in min-
imum energy subthreshold circuits,” IEEE Transactions on Electron Devices, vol. 55,
no. 1, pp. 163–174, 2008.
[52] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, “Analysis and mitigation of vari-
ability in subthreshold design,” in Proceedings of the 2005 International Symposium
on Low Power Electronics and Design, ser. ISLPED ’05, 2005, pp. 20–25.
[53] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge,
“A self-tuning DVS processor using delay-error detection and correction,” IEEE Jour-
nal of Solid-State Circuits, vol. 41, no. 4, pp. 792–804, 2006.
[54] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and
D. T. Blaauw, “RAZOR-II: In situ error detection and correction for PVT and SER
tolerance,” IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32–48, 2009.
[55] K. A. Bowman, J. W. Tschanz, N.-S. Kim, J. C. Lee, C. B. Wilkerson, S.-L. L. Lu,
T. Karnik, and V. K. De, “Energy-efficient and metastability-immune resilient circuits
for dynamic variation tolerance,” IEEE Journal of Solid-State Circuits, vol. 44, no. 1,
pp. 49–63, 2009.
[56] P. Hazucha, T. Karnik, B. A. Bloechel, C. Parsons, D. Finan, and S. Borkar, “Area-
efficient linear regulator with ultra-fast load regulation,” IEEE Journal of Solid-State
Circuits, vol. 40, no. 4, pp. 933–940, 2005.
[57] G. Patounakis, Y. W. Li, and K. L. Shepard, “A fully integrated on-chip DCDC
conversion and power management system,” IEEE Journal of Solid-State Circuits,
vol. 39, no. 3, pp. 443–451, 2004.
[58] G.-Y. Wei and M. Horowitz, “A fully digital, energy-efficient, adaptive power-supply
regulator,” IEEE Journal of Solid-State Circuits, vol. 34, no. 4, pp. 520–528, 1999.
[59] F. Ichiba, K. Suzuki, S. Mita, T. Kuroda, and T. Furuyama, “Variable supply-voltage
scheme with 95%-efficiency dc-dc converter for mpeg-4 codec,” in Proceedings of the
1999 International Symposium on Low Power Electronics and Design, ser. ISLPED
’99, 1999, pp. 54–59.
145
[60] Y. Ramadass and A. Chandrakasan, “Voltage scalable switched capacitor DC-DC con-
verter for ultra-low-power on-chip applications,” in IEEE Power Electronics Specialists
Conference, PESC 2007, june 2007, pp. 2353–2359.
[61] Y. K. Ramadass and A. P. Chandrakasan, “Minimum energy tracking loop with embed-
ded DCDC converter enabling ultra-low-voltage operation down to 250 mV in 65-nm
CMOS,” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 256–265, 2008.
[62] Y. Pu, X. Zhang, J. Huang, A. Muramatsu, M. Nomura, K. Hirairi, H. Takata,
T. Sakurabayashi, S. Miyano, M. Takamiya, and T. Sakurai, “Misleading energy and
performance claims in sub/near threshold digital systems,” in Proceedings of the Inter-
national Conference on Computer-Aided Design, ser. ICCAD ’10, 2010, pp. 625–631.
[63] Y. Choi, N. Chang, and T. Kim, “DC-DC converter-aware power management for low-
power embedded systems,” IEEE Trans. on CAD of Integrated Circuits and Systems,
vol. 26, no. 8, pp. 1367–1381, 2007.
[64] B. Amelifard and M. Pedram, “Design of an efficient power delivery network in an
SoC to enable dynamic power management,” in Proceedings of the 2007 International
Symposium on Low Power Electronics and Design, ser. ISLPED ’07, 2007, pp. 328–333.
[65] J. Park, D. Shin, N. Chang, and M. Pedram, “Accurate modeling and calculation of
delay and energy overheads of dynamic voltage scaling in modern high-performance
microprocessors,” in Proceedings of the 16th ACM/IEEE International Symposium on
Low Power Electronics and Design, ser. ISLPED ’10, 2010, pp. 419–424.
[66] N. R. Shanbhag, R. A. Abdallah, R. Kumar, and D. L. Jones, “Stochastic computa-
tion,” in Proceedings of the 47th Design Automation Conference, ser. DAC ’10, New
York, NY, USA, 2010, pp. 859–864.
[67] F. J. Kurdahi, A. M. Eltawil, Y.-H. Park, R. N. Kanj, and S. R. Nassif, “System-
level sram yield enhancement,” in Proceedings of the 7th International Symposium on
Quality Electronic Design, ser. ISQED ’06, 2006, pp. 179–184.
[68] G. Karakonstantis, G. Panagopoulos, and K. Roy, “HERQULES: system level cross-
layer design exploration for efficient energy-quality trade-offs,” in Proceedings of the
16th ACM/IEEE International Symposium on Low Power Electronics and Design, ser.
ISLPED ’10, 2010, pp. 117–122.
[69] V. Papirla, A. Jain, and C. Chakrabarti, “Low power robust signal processing,” in Pro-
ceedings of the 14th ACM/IEEE International Symposium on Low Power Electronics
and Design, ser. ISLPED ’09, 2009, pp. 303–306.
[70] R. Hegde and N. R. Shanbhag, “Soft digital signal processing,” IEEE Trans. Very
Large Scale Integr. Syst., vol. 9, no. 6, pp. 813–823, Dec. 2001.
[71] R. Hegde and N. R. Shanbhag, “A voltage overscaled low-power digital filter IC,”
IEEE Journal of Solid-State Circuits, vol. 39, no. 2, pp. 388–391, 2004.
146
[72] G. V. Varatkar and N. R. Shanbhag, “Error-resilient motion estimation architecture,”
IEEE Transactions on Very Large Scale Integration Systems, vol. 16, no. 10, pp. 1399–
1412, 2008.
[73] R. A. Abdallah and N. R. Shanbhag, “Error-resilient low-power Viterbi decoder archi-
tectures,” IEEE Trans Signal Processing, vol. 57, no. 12, pp. 4906–4917, 2009.
[74] G. V. Varatkar, S. Narayanan, N. R. Shanbhag, and D. L. Jones, “Stochastic networked
computation,” IEEE Transactions on Very Large Scale Integration Systems, vol. 18,
no. 10, pp. 1421–1432, 2010.
[75] P. J. Huber, Robust Statistics. Wiley, 1981, vol. 1, no. 3.
[76] E. P. Kim, D. J. Baker, S. Narayanan, D. L. Jones, and N. R. Shanbhag, “Low power
and error resilient PN code acquisition filter via statistical error compensation,” in
IEEE Custom Integrated Circ. Conf. (CICC), 2011, pp. 1–4.
[77] S. Mitra, N. R. Saxena, and E. J. McCluskey, “A design diversity metric and analysis
of redundant systems,” IEEE Trans. Comput., vol. 51, no. 5, pp. 498–510, May 2002.
[78] E. P. Kim and N. R. Shanbhag, “Soft N-modular redundancy,” IEEE Trans. Comput.,
vol. 61, no. 3, pp. 323–336, Mar. 2012.
[79] B. Shim, S. R. Sridhara, and N. R. Shanbhag, “Reliable low-power digital signal pro-
cessing via reduced precision redundancy,” IEEE Transactions on Very Large Scale
Integration Systems, vol. 12, no. 5, pp. 497–510, 2004.
[80] H. Mahmoodi, S. Mukhopadhyay, and K. Roy, “Estimation of delay variations due to
random-dopant fluctuations in nanoscale CMOS circuits,” IEEE Journal of Solid-State
Circuits, vol. 40, no. 9, pp. 1787–1796, 2005.
[81] J. Fayn and P. Rubel, “Toward a personal health society in cardiology,” IEEE Trans.
Info. Tech. Biomed., vol. 14, no. 2, pp. 401–409, Mar. 2010.
[82] N. M. Arzeno, Z.-D. Deng, and C.-S. Poon, “Analysis of first-derivative based QRS
detection algorithms,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 2,
pp. 478–484, 2008.
[83] M. G. Tsipouras, D. I. Fotiadis, and D. Sideris, “An arrhythmia classification system
based on the RR-interval signal,” Artificial Intelligence in Medicine, vol. 33, no. 3, pp.
237–250, 2005.
[84] “The global burden of disease,” World Health Organization, Geneva, Tech. Rep., 2008.
[85] V. Fuster and B. Kelly, Promoting cardiovascular health in the developing world.
Washington DC: The National Academies Press, 2010.
[86] J. Pan and W. J. Tompkins, “A real-time QRS detection algorithm,” IEEE Transac-
tions on Biomedical Engineering, vol. 32, no. 3, pp. 230–236, 1985.
147
[87] A. Amann, R. Tratnig, and K. Unterkofler, “Reliability of old and new ventricular fib-
rillation detection algorithms for automated external defibrillators,” BioMedical Engi-
neering Online, vol. 4, no. 1, p. 60, 2005.
[88] PhysioNet, “MIT-BIH arrhythmia database,” [Online]. Available: http://www.physio
net.org/physiobank/database/mitdb.
[89] B. U. Kohler, C. Hennig, and R. Orglmeister, “The principles of software QRS detec-
tion,” IEEE Engineering in Medicine and Biology Magazine, vol. 21, no. 1, pp. 42–57,
2002.
[90] K. V. Surez, J. C. Silva, Y. Berthoumieu, P. Gomis, and M. Najim, “ECG beat
detection using a geometrical matching approach,” IEEE Transactions on Biomedical
Engineering, vol. 54, no. 4, pp. 641–650, 2007.
[91] G. M. Friesen, T. C. Jannett, M. A. Jadallah, S. L. Yates, S. R. Quint, and H. T.
Nagle, “A comparison of the noise sensitivity of nine QRS detection algorithms,”
IEEE Transactions on Biomedical Engineering, vol. 37, no. 1, pp. 85–98, 1990.
[92] “References testing and reporting performance results of cardiac rhythm and ST seg-
ment measurement algorithms,” AAMI Recommended Practice/American National
Standard, Tech. Rep., 1998.
[93] H. Arbetter, R. Erickson, and D. Maksimovid, “DC-DC converter design for battery-
operated systems,” in IEEE Power Electronics Specialists Conference, 1995, pp. 103–
109.
[94] V. Kursun, S. G. Narendra, V. K. De, and E. G. Friedman, “Low-voltage-swing mono-
lithic DCDC conversion,” IEEE Transactions on Circuits and Systems II: Express
Briefs, vol. 51, no. 5, pp. 241–248, 2004.
[95] R. A. Abdallah, Y.-h. Lee, and N. R. Shanbhag, “Timing error statistics for energy-
efficient robust DSP systems,” in Proceedings of the conference on Design, automation
and test in Europe, ser. DATE, 2011, pp. 1–4.
[96] J. Erfanian, S. Pasupathy, and G. Gulak, “Reduced complexity symbol detectors with
parallel structure for ISI channels,” IEEE Transactions on Communications, vol. 42,
no. 234, pp. 1661–1671, Feb./Mar./Apr. 1994.
[97] W.-H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational algorithm for the
discrete cosine transform,” IEEE Transactions on Communications, vol. 25, no. 9, pp.
1004–1009, 1977.
[98] E. P. Kim and N. R. Shanbhag, “Soft NMR: Analysis and application to DSP systems,”
in Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 10), 2010, pp. 1494–1497.
[99] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Mathematical
Statistics, vol. 22, no. 1, Mar. 1951.
148
[100] Y. Liu, T. Zhang, and K. K. Parhi, “Computation error analysis in digital signal
processing systems with overscaled supply voltage,” IEEE Trans. Very Large Scale
Integr. Syst., vol. 18, no. 4, pp. 517–526, Apr. 2010.
[101] J. H. Lala and R. E. Harper, “Architectural principles for safety-critical real-time
applications,” Proceedings of the IEEE, vol. 82, no. 1, pp. 25–40, 1994.
[102] Y. Tamir and C. H. Sequin, “Reducing common mode failures in duplicate module,”
in Proc. IEEE Int. Conf. Computer Design, 1984, pp. 302–307.
[103] G. Taylor, “Energy-efficient circuit design and the future of power delivery,” in IEEE
Electrical Performance of Electronic Packaging and Systems Conference, EPEPS’09,
October 2009.
[104] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta,
S. Kottapalli, and S. Vora, “A 45 nm 8-core enterprise Xeon processor,” J. Solid-State
Circuits, vol. 45, no. 1, pp. 7–14, 2010.
[105] R. Kumar and G. Hinton, “A family of 45-nm IA processors,” in IEEE International
Solid-State Circuits Conference, ISSCC’09, February 2009, pp. 58–59.
[106] M. Bohr, “The new era of scaling in an SoC world,” in IEEE International Solid-State
Circuits Conference, ISSCC’09, February 2009, pp. 23–28.
[107] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer,
A. Singh, T. Jacob, and et al., “An 80-tile 1.28 TFLOPS network-on-chip in 65-nm
CMOS,” in IEEE International Solid-State Circuits Conference, ISSCC’07, February
2007, pp. 98–99.
[108] V. J. Reddi, S. Kanev, W. Kim, S. Campanoni, M. D. Smith, G.-Y. Wei, and D. Brooks,
“Voltage smoothing: Characterizing and mitigating voltage noise in production pro-
cessors via software-guided thread scheduling,” in Proceedings of the 2010 43rd Annual
IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’43. Wash-
ington, DC, USA: IEEE Computer Society, 2010, pp. 77–88.
[109] V. Guruswami and P. Indyk., “Expander-based constructions of efciently decodable
codes,” in Proceedings of the IEEE Symposium on Foundations of Computer Science,
2001, pp. 658–667.
[110] H.-S. P. Wong, “Beyond the conventional transistor,” IBM J. Res. Dev., vol. 46, no.
2-3, pp. 133–168, Mar. 2002.
[111] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical
Journal, vol. 27, pp. 379–423 and 623–656, July and October 1948.
149
