Design of robust ultra-low power platform for in-silicon machine learning by Zhang, Sai
© 2016 Sai Zhang
DESIGN OF ROBUST ULTRA-LOW-POWER PLATFORM FOR
IN-SILICON MACHINE LEARNING
BY
SAI ZHANG
DISSERTATION
Submitted in partial fulﬁllment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2016
Urbana, Illinois
Doctoral Committee:
Professor Naresh R. Shanbhag, Chair
Professor Minh N. Do
Professor David T. Blaauw, University of Michigan
Assistant Professor Lav R. Varshney
ABSTRACT
The rapid development of machine learning plays a key role in enabling next generation com-
puting systems with enhanced intelligence. Present day machine learning systems adopt an
intelligence in the cloud" paradigm, resulting in heavy energy cost despite state-of-the-art
performance. It is therefore of great interest to design embedded ultra-low power (ULP)
platforms with in-silicon machine learning capability. A self-contained ULP platform con-
sists of the energy delivery, sensing and information processing subsystems. This dissertation
proposes techniques to design and optimize the ULP platform for in-silicon machine learning
by exploring a trade-oﬀ that exists between energy-eﬃciency and robustness. This trade-oﬀ
arises when the information processing functionality is integrated into the energy delivery,
sensing, or emerging stochastic fabrics (e.g., CMOS operating in near-threshold voltage or
voltage overscaling, and beyond CMOS devices).
This dissertation presents the Compute VRM (C-VRM) to embed the information process-
ing into the energy delivery subsystem. The C-VRM employs multiple voltage domain stack-
ing and core swapping to achieve high total system energy eﬃciency in near/sub-threshold
region. A prototype IC of the C-VRM is implemented in a 1.2 V, 130 nm CMOS process.
Measured results indicate that the C-VRM has up to 44.8% savings in system-level energy
per operation compared to the conventional system, and an eﬃciency ranging from 79% to
83% over an output voltage range of 0.52 V to 0.6 V.
This dissertation further proposes the Compute Sensor approach to embed information
processing into the sensing subsystem. The Compute Sensor eliminates both the traditional
sensor-processor interface, and the high-SNR/high-energy digital processing by moving fea-
ture extraction and classiﬁcation functions into the analog domain. Simulation results in
65 nm CMOS show that the proposed Compute Sensor can achieve a detection accuracy
ii
greater than 94.7% using the Caltech101 dataset, which is within 0.5% of that achieved by
an ideal digital implementation. The performance is achieved with 7× to 17× lower energy
than the conventional architecture for the same level of accuracy.
To further explore the energy-eﬃciency vs. robustness trade-oﬀ, this dissertation explores
the use of highly energy eﬃcient but unreliable stochastic fabrics to implement in-silicon
machine learning kernels. In order to perform reliable computation on the stochastic fabrics,
this dissertation proposes to employ statistical error compensation (SEC) as an eﬀective
error compensation technique. This dissertation makes a contribution to the portfolio of
SEC by proposing embedded algorithmic noise tolerance (E-ANT) for low overhead error
compensation. E-ANT operates by reusing part of the main block as estimator and thus
embedding the estimator into the main block. System level simulation results in a commer-
cial 45 nm CMOS process show that E-ANT achieves up to 38% error tolerance and up to
51% energy savings compared with an uncompensated system.
This dissertation makes a contribution to the theoretical understanding of stochastic fab-
rics by proposing a class of probabilistic error models that can accurately model the hardware
errors on the stochastic fabrics. The models are validated in a commercial 45 nm CMOS
process and employed to evaluate the performance of machine learning kernels in the pres-
ence of hardware errors. Performance prediction of a support vector machine (SVM) based
classiﬁer using these models indicates that the probability of detection Pdet estimated using
the proposed model is within 3% for timing errors due to voltage overscaling when the error
rate pη ≤ 80%, within 5% for timing errors due to process variation in near threshold-voltage
(NTV) region (0.3 V − 0.7 V) and within 2% for defect errors when the defect rate psaf is
between 10−3 and 20%, compared with HDL simulation results.
Employing the proposed error model and evaluation methodology, this dissertation ex-
plores the use of distributed machine learning architectures, named classiﬁer ensemble, to
enhance the robustness of in-silicon machine learning kernels. Comparative study of dis-
tributed architectures (i.e., random forest (RF)) and centralized architectures (i.e., SVM)
is performed in a commercial 45 nm CMOS process. Employing the UCI machine learning
repository as input, it is determined that RF-based architectures are signiﬁcantly more ro-
bust than SVM architectures in presence of timing errors in the NTV region (0.3 V− 0.7 V).
iii
Additionally, an error weighted voting technique that incorporates the timing error statistics
of the NTV circuit fabric is proposed to further enhance the robustness of RF architectures.
Simulation results conﬁrm that the error weighted voting technique achieves a Pdet that
varies by only 1.4%, which is 12× lower compared to centralized architectures.
iv
To my parents, and my advisor
v
ACKNOWLEDGMENTS
First and foremost, my deepest thanks go to my parents, and my advisor. My parents have
always put me before themselves, and supported me with the greatest love and care that
anyone can hope for. My advisor, Prof. Naresh Shanbhag, has not only been the teacher and
supervisor for my research, but also the mentor for my life. It is he who showed me the path
towards being an independent researcher with his passion, deep knowledge, patient guidance
and high standards. He demonstrates to me the importance of strong motivation, systematic
research, and logical reasoning. I believe these valuable lessons, along with many others that
I have learned from him, will continuously beneﬁt me throughout the rest of my career. I am
grateful to have my parents and my advisor, to whom I dedicate my thesis. I would also like to
sincerely thank Prof. David Blaauw, Prof. Lav Varshney, and Prof. Minh Do for their many
insightful comments and input and for agreeing to be on my committee. Their suggestions
have greatly helped improve this dissertation. Additionally, I would like to thank Jane Tu for
her help with implementing the FIR ﬁlter core in the Compute VRM project in Chapter 2,
and Mingu Kang and Charbel Sakr for providing simulation models for the Compute Sensor
project in Chapter 3. I would also like to give my sincere acknowledgments to my research
group colleagues, Yingyan Lin, Charbel Sakr, Sujan Gonugondla, Mingu Kang, Ameya Patil,
and Dr. Yongjune Kim; my research group alumni, Dr. Rami Abdallah and Dr. Eric Kim;
and my colleagues at UIUC, notably Dr. Talegaonkar, Dr. Kairouz, and Guanghua Shu
for their valuable feedback, input, and generous help with improving this dissertation. I
gratefully acknowledge past and present support from Texas Instruments, and Systems on
Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored
by MARCO and DARPA. Finally, I would like to thank my friends at Illinois, especially the
Chinese Volleyball Team and the Formosa Volleyball Enthusiasts group who made me feel
at home. Because of them, I was able to keep my physical well-being, and stay focused on
my research.
vi
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Dissertation Contributions and Organization . . . . . . . . . . . . . . . . . . 16
Chapter 2 COMPUTE VRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 C-VRM System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 C-VRM Prototype IC Design . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 3 COMPUTE SENSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 The Compute Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 4 EMBEDDED ALGORITHMIC-NOISE TOLERANCE . . . . . . . . . . 65
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Proposed E-ANT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 5 PROBABILISTIC ERROR MODELS FOR MACHINE LEARNING
KERNELS IMPLEMENTED ON STOCHASTIC NANOSCALE FABRICS . . . . 108
5.1 Modeling Framework and Accuracy Measure . . . . . . . . . . . . . . . . . . 110
5.2 Error Model Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Derivation of REM-j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
vii
Chapter 6 ERROR-RESILIENT MACHINE LEARNING IN NEAR THRESH-
OLD VOLTAGE VIA CLASSIFIER ENSEMBLE . . . . . . . . . . . . . . . . . . 124
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Chapter 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 140
7.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
viii
LIST OF TABLES
2.1 Comparison with Previously Published Work . . . . . . . . . . . . . . . . . 46
3.1 Model Parameters in 65 nm CMOS . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Energy per Pixel Processing in 65 nm CMOS . . . . . . . . . . . . . . . . . 61
4.1 DPD for FFT BU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 ARCH-ANT Performance and Energy Comparison . . . . . . . . . . . . . . 100
4.3 ALG-ANT Performance and Energy Comparison . . . . . . . . . . . . . . . 107
ix
LIST OF FIGURES
1.1 The need for in-silicon machine learning. . . . . . . . . . . . . . . . . . . . . 2
1.2 Architecture of the ULP platform for in-silicon machine learning. . . . . . . 3
1.3 The voltage conversion eﬃciency of VRMs tends to decrease as the con-
version ratio increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 The communication challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 The computation challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 The robustness challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Three commonly used DC-DC converter topologies. . . . . . . . . . . . . . . 9
1.8 Existing works on integrated sensing and computing. . . . . . . . . . . . . . 10
1.9 Near/sub-threshold operation. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.10 SEC techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Conventional design approach for seperated VRM and core. . . . . . . . . . . 19
2.2 Conventional SC-VRM architecture. . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 The C-VRM principle for N = 2. . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Data transfer in the C-VRM during core swapping. . . . . . . . . . . . . . . 27
2.5 The variable supply voltage Vdd(m) results in a time varying clock period
Tclk−C(m). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 The principle of charge conservation in the C-VRM. . . . . . . . . . . . . . . 30
2.7 The eﬀective voltage Vdd,eff as a function of ∆V = Vbat/2− Vdd(M). . . . . . 32
2.8 Comparison of SC-VRM system and C-VRM with 1% data transfer overhead. 34
2.9 System comparison between SC-VRM system and C-VRM with 10% data
transfer overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 The C-VRM prototype IC architecture. . . . . . . . . . . . . . . . . . . . . . 35
2.11 The 2:1 ladder SC-VRM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.12 The strong ARM comparator and the current starved oscillator employed
in the 2:1 SC-VRM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.13 Non-overlapping driver employed in the 2:1 SC-VRM. . . . . . . . . . . . . . 37
2.14 The 2:1 C-VRM block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.15 Control block of the C-VRM. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.16 Bidirectional level shifter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.17 Architecture of the CPR oscillator with tunable delay. . . . . . . . . . . . . . 41
2.18 Post layout simulations of the CPR oscillator. . . . . . . . . . . . . . . . . . 41
2.19 Measured SC-VRM operation during start up and steady state. . . . . . . . 42
x
2.20 Measured C-VRM core swapping and data transfer. . . . . . . . . . . . . . . 43
2.21 The C-VRM test chip measurement results. . . . . . . . . . . . . . . . . . . 45
2.22 Die photo of the test chip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 A typical embedded vision platform. . . . . . . . . . . . . . . . . . . . . . . 50
3.2 3T APS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Compute Sensor implementing PCA and SVM. . . . . . . . . . . . . . . . . 53
3.4 Capacitive multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Model characterization and validation. . . . . . . . . . . . . . . . . . . . . . 59
3.6 Compute Sensor system simulation results. . . . . . . . . . . . . . . . . . . . 60
3.7 The feature distribution and SVM separation hyper-plane. . . . . . . . . . . 61
3.8 Energy per decision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.9 Compute Sensor characterization chip in 65 nm CMOS. . . . . . . . . . . . . 64
4.1 Error statistics comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Algorithmic noise-tolerance (ANT). . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 EEG seizure classiﬁer with SVM. . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 E-ANT Adder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 E-ANT Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6 E-ANT MAC unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 E-ANT FIR ﬁlter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 DFG of the E-ANT FFT butterﬂy unit. . . . . . . . . . . . . . . . . . . . . 83
4.9 DFG of the E-ANT exponential kernel. . . . . . . . . . . . . . . . . . . . . . 85
4.10 ALG-ANT FIR ﬁlter structure. . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.11 ALG-ANT linear SVM: the dot product result is unaltered when the order
of the multiply-accumulates (MACs) is varied. . . . . . . . . . . . . . . . . 88
4.12 Evaluation methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.13 Optimization of E-ANT MAC. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.14 Energy savings vs. input precision and MSE. . . . . . . . . . . . . . . . . . 94
4.15 Second order polynomial kernel SVM EEG classiﬁcation system architec-
ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.16 SNR at the output of the FE. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.17 Simulation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.18 ALG-ANT applied to the ﬁlter design problem. . . . . . . . . . . . . . . . . 101
4.19 Comparison of classiﬁcation results with and without DR. . . . . . . . . . . 102
4.20 ALG-ANT based SVM EEG classiﬁcation system architecture. . . . . . . . 103
4.21 Simulation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1 Error modeling framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Model validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 JS divergence comparison of the proposed models for VOS errors for MAC1
used in the SVM classiﬁer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 JS divergence comparison of the proposed models for process variation
errors for MAC1 used in the SVM classiﬁer. . . . . . . . . . . . . . . . . . . 118
xi
5.5 JS divergence comparison of the proposed models for defect errors for
MAC1 used in the SVM classiﬁer. . . . . . . . . . . . . . . . . . . . . . . . . 119
5.6 System simulation results in presence of VOS errors for the SVM classiﬁer
comparing the proposed models with HDL simulation results. . . . . . . . . 120
5.7 System simulation results in presence of process variation errors for the
SVM classiﬁer comparing the HDL simulation results. . . . . . . . . . . . . . 121
5.8 System simulation results in presence of defect errors for the SVM classiﬁer
comparing the HDL simulation results. . . . . . . . . . . . . . . . . . . . . . 122
6.1 Two distinct machine learning frameworks. . . . . . . . . . . . . . . . . . . . 125
6.2 System architecture for the RF. . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 System architecture for a second-order polynomial kernel SVM classiﬁer. . . 131
6.4 Median error rate p¯η and gate level delay variation (σ/µ)d of SVM and RF
architecture in NTV region of 0.3 V ≤ Vdd ≤ 0.7 V. . . . . . . . . . . . . . . 134
6.5 Robustness comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.6 The variance of RF output when (σ/µ)d = 29%. . . . . . . . . . . . . . . . 135
xii
Chapter 1
INTRODUCTION
Machine learning based systems are transforming the way humans interact with the physical
world and have found wide applications in computer vision, data mining, healthcare, and
more. In many areas such as object recognition [1], machines have begun to exceed human
performance due to machine learning algorithms' capability to learn complex correlations
from large volumes of data. However, this state-of-the-art performance comes at the price
of heavy energy cost. For example, Google's AlphaGo system [2] that beat the human
Go champion employs 1202 CPUs and 176 GPUs, and consumes more than four-orders-
of-magnitude higher power compared with the much cited ∼20W power consumption of
the human brain. As a result of the intensive energy cost, most of the current machine
learning systems adopt an intelligence in the cloud paradigm as shown in Fig. 1.1(a). In
this paradigm, a large volume of sensory data is transferred from mobile devices to data
centers, where the bulk of machine learning algorithms are implemented on CPU and GPU-
based clusters. The extracted inference models are then transferred from the cloud back to
the device. This voluminous data transmission to the cloud leads to signiﬁcant energy and
latency costs. Indeed, recent projections [3] indicate that the traﬃc to the cloud consumes
9× more energy than that in the data center itself, and can account for 2% of the global
electricity consumption [4]. Therefore, there is a clear need for small form factor ultra-
low-power (ULP) platforms with inference capability so that the generated data can be
processed to obtain decisions locally (see Fig. 1.1(b)). Achieving this goal requires energy
eﬃcient in-silicon implementation of machine learning systems.
A self-contained ULP platform for in-silicon machine learning consists of sensing, in-
formation processing, and energy delivery subsystems (see Fig. 1.2). Figure 1.2 shows a
conventional architecture for embedded vision applications. Image data is ﬁrst acquired via
1
(a)
(b)
Figure 1.1: The need for in-silicon machine learning: (a) current machine learning systems
are implemented in the cloud, requiring transmission of voluminous amount of raw data,
and (b) ultra-low-power (ULP) platforms with in-silicon machine learning can potentially
process raw data to generate decisions locally.
2
an active pixel sensor (APS) array whose analog pixel values are sensed sequentially, and
converted into digital samples via analog-to-digital converters (ADCs), and then streamed
out to a back-end digital processor which implements feature extraction and classiﬁcation
function to obtain the ﬁnal decision. A digital trainer block computes the hyperparameters
of the feature extractor and classiﬁer. The energy delivery subsystem converts the voltage
from the supply (battery or energy harvester) to the voltage level suitable for information
processing. The limited energy source and the computational complexity of learning algo-
rithms make energy eﬃciency one of the primary design objectives. Several challenges arise
in designing and optimizing such a system:
Figure 1.2: Architecture of the ULP platform for in-silicon machine learning.
 The energy delivery challenge: The energy delivery subsystem typically consists
of one or more voltage regulator modules (VRMs) to convert the voltage from energy
3
Figure 1.3: The voltage conversion eﬃciency of VRMs tends to decrease as the conversion
ratio increases.
source into voltage of operation. While designing VRM with high eﬃciency of η > 90%
is feasible for output voltage Vdd ≥ 1 V, it is increasingly diﬃcult to maintain such high
eﬃciency for ULP platforms that need to operate with scaled voltages for reduction of
computation energy. As shown in Fig. 1.3, the eﬃciency of VRM tends to decrease as
the output load voltage decreases [5]. This poor VRM eﬃciency will oﬀset the energy
savings provided via low-voltage design techniques such as sub/near-threshold voltage
design.
 The communication challenge: The physical separation between the sensing and
information processing subsystems leads to a large interface energy. Such a separation
is made unavoidable because sensing is intrinsically an analog process while information
processing is intrinsically digital in the conventional architecture employing digital
signal processors. The energy required to move the data over the sensor-processor
4
interface (see Fig. 1.4(a)) comprising the ADC, the read-out (RD) circuitry and the
interconnect to the digital processor, can account for more than 50% of the total energy
as shown in Fig. 1.4(b).
(a) (b)
Figure 1.4: The communication challenge [6] : (a) the sensing front-end, and (b) energy
breakdown in a 65nm CMOS of an embedded vision system consisting of active pixel
sensor (APS) array as the sensing front-end and principal component analysis (PCA) and
support vector machine (SVM) digital signal processor as the back-end.
 The computation challenge: As projected in Fig. 1.5(a), the power consumption
of portable electronics is expected to keep increasing. This increase will be accelerated
if machine learning and inference capabilities were to be integrated in-silicon. Indeed,
many machine learning algorithms are computationally intensive, e.g., more than 666
million MACs are required to process one 227Ö227 image (13k MACs/pixel) in AlexNet
[7], one of the state-of-art deep learning algorithms. To reduce energy, the conventional
approach is to rely on continuous scaling of supply voltage and feature size. However,
this trend of scaling has stagnated as shown in Fig. 1.5(b) [8].
 The robustness challenge: One way to further reduce energy consumption is to
employ near/subtheshold design where the voltage is aggressively scaled down to
200 mV ∼ 500 mV. However, the resultant energy savings come with an increase
in delay variation as shown in Fig. 1.6(a) [10]. In addition, process scaling leads to
increased process-voltage-temperature (PVT) variation, leakage, soft-errors and noise
(see Fig.1.6(b) [11]). This is becoming a growing concern for reliable computing.
5
(a) (b)
Figure 1.5: The computation challenge: (a) the total power for embedded devices keeps
increasing [9], and (b) supply voltage scaling for CMOS process is stagnant below 45 nm [8].
(a) (b)
Figure 1.6: The robustness challenge: (a) delay variation increases as supply voltage scales
from super to sub-threshold region [10], and (b) standard deviation of threshold voltage
increases as technology scales [11].
An estimated energy breakdown between the energy delivery, sensing, and information
processing subsystems is very helpful to identify the limitations in the conventional archi-
tectures. Employing published works in the literature, it is safe to assume that 5%-20%
of the energy is consumed in the energy delivery subsystem as power conversion loss under
low voltage operations. The energy breakdown between sensing and information processing
subsystems highly depends on the application, algorithm, and circuit architecture employed.
6
As shown in Fig. 1.4(b), the sensing subsystem accounts for 59% of the system energy
(excluding the energy delivery loss), and most of this energy is consumed in the interface
circuitry. At the same time, information processing accounts for 41% of the total system
energy when employing PCA and SVM kernels as the digital backend. This suggests that
the interface circuitry and the information processing subsystem are the dominant sources
of energy consumption in the conventional architecture.
In the following part of this chapter, we provide an overview of related work to address
these challenges, and ﬁnally present our approach to solve these problems.
1.1 Related Work
This section provides an overview about related work in the design of energy delivery and
sensing subsystems, low power design, and robust system design.
1.1.1 Energy Delivery
There are three commonly used VRM topologies: 1) linear regulator, 2) switching converter,
and 3) switched capacitor voltage regulator module (SC-VRM), as shown in Fig. 1.7. The
linear regulator (Fig. 1.7(a)) uses high-gain ampliﬁer and series-shunt feedback to regulate
the voltage to a desired reference level. A major problem with the linear regulator is that
its eﬃciency is determined by the ratio Vout/Vin where Vout and Vin are the output and input
voltage, respectively. Hence, the linear regulator has poor eﬃciency at low output voltage.
A switching converter (Fig. 1.7(b)) employes duty cycle controlled switches to convert a DC
voltage into a pulse train, followed by an LC low-pass ﬁlter to extract its DC component.
However, the oﬀ-chip inductor increases the form factor of the system and thus prohibits
its use in form-factor constrained ULP platforms. The switched capacitor VRM (SC-VRM)
employs a capacitor array to store and transfer charge. Voltage conversion is achieved by
transferring charge using duty cycle controlled power switches. SC-VRM can have a high
eﬃciency when the output voltage is close to the ideal conversion output, but eﬃciency
decreases rapidly as the output deviates from the ideal output voltage. Its compactness and
7
capability to achieve high conversion ratio make SC-VRM an ideal candidate for use in ULP
platforms.
Design of high-eﬃciency SC-VRM for ULP platforms is made challenging due to the large
step-down ratio [12, 5] and the light load conditions (≈ 10 nA). SC-VRMs with pulse fre-
quency modulation (PFM) control [5] and capacitance modulation [12] have been employed
to boost light load eﬃciency up to 74%. A hybrid converter [13] has been proposed to ad-
dress energy delivery for loads in the range of 5 nA-to-500 nA, with eﬃciency up to 56%.
Ultra low power clock generation and level shifter are employed to reduce the switching loss,
which degrades light load eﬃciency. To mitigate the conversion ratio problem, the stacked
voltage domain approach [14] has been proposed where multiple cores are connected in series
to lower the step-down ratio. However, this approach needs push and pull linear regulators
in order to compensate for voltage ﬂuctuations in the intermediate supply nodes caused by
current mismatch. Most of the conventional approaches target reducing converter losses in
existing topologies. However, SC-VRM includes intrinsic charge sharing loss [15], gate drive
loss, and other overhead, that tends to severely degrade the VRM light load eﬃciency. Thus,
alternative architectures are needed to increase the VRM eﬃciency for ULP platforms.
1.1.2 Sensing Subsystem
The sensing subsystem contains various types of sensors, such as image, biomedical, or
chemical sensors, to convert physical signals into electrical signals for further processing. In
applications such as CMOS image sensor based embedded vision where sensing and informa-
tion processing co-exist, the interface energy between the two subsytems can dominate. One
approach to address the resulting communication challenge is to tightly integrate sensing
and computation. Previous work in integrating computation into the CMOS image sensor
array falls into one of two categories. In the ﬁrst, the pixel architecture is modiﬁed (see Fig.
1.8(a)) to enable simple computations such as 2D convolution [16], image ﬁltering [17, 18],
compressive image sensing [19], matrix transformations [20], Gaussian pyramid [21], and
image decomposition [22]. These approaches suﬀer from a loss in ﬁll-factor or the spatial
resolution because the modiﬁed pixel occupies an area that can be as high as 8× greater
8
Core
refV
inV ddV
Pass Device
Error Amplifier
(a)
Core
refV
inV ddV
Pass Device
Error Amplifier
Core
inV ddV
PWM 
Controller
inV
Core
1
21
2bottomC outC
Controller
D
river
Switched Capacitor Network
ddV
SC-VRM Core
Optimize Optimize energy
Optimize system energy
SC-VRM Core
scC
(b)
1
1
2
2
refhV
reflV
S
R
Q
1
2
ENfsw
totC
outC
ddV
Non-overlapping 
driver
M1
M3
M4
M2
batV
(c)
Figure 1.7: Three commonly used DC-DC converter topologies: (a) linear regulator, (b)
switching converter, and (c) SC-VRM.
9
(a) (b)
Figure 1.8: Integrated sensing and computing falls into two categories: (a) modiﬁed pixel
architecture leading to a loss of ﬁll factor [17], and (b) attaching digital/analog processor
in APS peripheral for low level ﬁltering functions [23].
than the standard pixel architecture. In the second, very simple analog processing functions
are embedded in the periphery of the APS array (see Fig. 1.8). These include convolution
[23], random projections for compressed sensing [24], and diﬀerence of Gaussian (DOG) [25].
The main limitation of these approaches is the absence of learning capabilities since only
low level image processing algorithms such as ﬁltering are supported. This lack of learning
prevents the system from adapting model parameters to compensate for the non-idealities in
the hardware platform such as the non-linearity and noise in the sensor and the peripheral
circuitry.
1.1.3 Digital Low Power Design
Conventional approaches for power reduction rely on device level techniques such as feature
size scaling and body biasing [26]; circuit level techniques such as transistor sizing [27] and
voltage scaling [28]; and architectural level techniques such as algorithm transformation [29],
clock/power gating [30, 31], dynamic voltage and frequency scaling (DVFS) [32]. As CMOS
technology scales into sub-10 nm, these traditional knobs such as supply voltage, frequency,
and threshold voltage for energy reduction are becoming ineﬀective due to leakage as well
10
as power density concerns. Moreover, the conventional techniques adopt a worst-case design
methodology to ensure error free operation. The resulting large margin limits the achievable
energy eﬃciency.
The work of Calhoun et al. [33] and Zhai et al. [34] leads to the discovery of the minimum
energy operation point (MEOP) in digital integrated circuits which arises from the trade-oﬀ
between dynamic energy Edyn and leakage energy Eleak in sub-threshold domain (see Fig.
1.9(a)). Supply voltage scaling results in quadratic reduction in Edyn, and an exponential
increase in the delay and thus in Eleak. The resulting MEOP is deﬁned via the tuple (V
∗
dd,
f ∗clk, E
∗
op) where V
∗
dd is the optimum supply voltage, f
∗
clk is the optimum clock frequency, and
E∗op is the optimum energy per operation. Operating circuits in sub-threshold might lead to
severe performance loss due to the increased delay. To compensate for the performance loss,
researchers have also proposed near-threshold operation where the supply voltage is scaled
down to 400 − 500 mV [8] (see Fig. 1.9(b)). Near-threshold computing oﬀers 10× energy
beneﬁts with relatively small performance loss and is considered a good trade oﬀ between
super and sub-threshold operations. Both system level studies and IC implementations exist
for near/sub-threshold computing. Markovic et al. [35] study the impact of activity factor
and various design parameters on near-threshold operation and propose suitable logic families
for near-threshold design. The work concludes that near-threshold operation can provide a
10× throughput increase with a 20% energy increase relative to the MEOP. Dreslinski et
al. [36] study architectural optimization of parallel chip multi-processors (CMP) operating
in near-threshold. Kwong [37] provides a design methodology for sub-threshold logic with
emphasis on device sizing. Processors operating in near/sub-threshold [38, 39, 40, 41] as well
as custom digital signal processing (DSP) kernels [42, 43] have also been proposed. The wide
adoption of sub/near-threshold design is limited due to the performance loss and increased
PVT variations [11]. The conventional approach employs transistors up-sizing and extensive
veriﬁcation using computer aided design (CAD) design ﬂow [37]. This worst case design
methodology limits the achievable energy eﬃciency.
11
(a) (b)
Figure 1.9: Near/Sub-threshold operation: (a) MEOP exists due to the balance between
dynamic and leakage energies [33], and (b) near-threshold computing oﬀers trade-oﬀ
between throughput and energy eﬃciency [8].
1.1.4 Robust System Design
Defects and errors originating from various sources necessitate robust system design for next
generation ULP platforms. Errors can be caused by imperfections in fabrication such as
scratches from wafer mishandling, mask misalignment and over/under-etching [44]. These
imperfections, referred to as defects, can cause unpredictable open or short circuits in the
fabricated chip, leading to circuit stuck-at-faults. In addition, near/sub-threshold computing
presents new challenges for robust system design. The increased PVT variations [8] lead to
higher probability of timing errors.
Robust system design dates back to the work of von Neumann [45] who showed that reliable
networks can be designed with a cascade of three input majority gates, if the component
probability of failure pe < 0.0073, and that reliable computation is impossible if pe ≥ 16 .
Techniques for various design abstractions have been proposed and are summarized next.
At the circuit level, yield enhancement routing [46] and ﬂoor planning [47] techniques
12
have been proposed. These techniques target improving yield at manufacturing time. In
[48], Markov random ﬁeld (MRF) logic is proposed to enhance noise immunity under low
voltage operation. The proposed logic is able to achieve great robustness improvement, but
the overhead is prohibitive. Techniques such as transistor sizing [37] and body biasing [49]
have been proposed to reduce variations in sub/near-threshold designs. Circuit hardening
techniques [50] have been proposed to mitigate single event transients (SET) on logic circuits.
At logic and microarchitecture level, conventional robust system design methods employ
redundancy based approaches. For example, in N modular redundancy (NMR), the design is
robustiﬁed by replicating the module N times followed by majority voting to obtain the ﬁnal
results. NMR incurs large (N fold) area and energy overhead, thus is not suitable for use in
low power platforms. In [51, 52], RAZOR is proposed as a low overhead microarchitecture
level technique for detecting timing errors. RAZOR employs a specially designed shadow
latch to detect late-arriving signals and a recovery scheme to re-execute the erroneous in-
struction. RAZOR is able to achieve 44% energy savings over the worst case design point
while operating close to point of ﬁrst failure (PoFF) with an error rate of 0.1% [51]. RA-
ZOR's deterministic error compensation makes it well-suited for applications where 100%
correctness is important. Emerging machine learning applications have a relaxed notion
of correctness that can potentially be utilized to enhance robustness and improve energy
eﬃciency.
In contrast to logic or circuit level techniques, algorithm and system level techniques can
take advantage of application level performance metrics to enhance system level robustness.
Emerging machine learning applications employ performance metrics that are statistical in
nature [53]. For example, the feature extractors consisting of ﬁlter banks employ signal-to-
noise ratio (SNR) as the design metric, and the classiﬁcation engines employs true positive
rate (TP rate), false positive rate (FP rate) or detection rate (Pdet) as the design metric
[54, 55]. Statistical metrics result in inherent error tolerance to small magnitude errors.
In data driven hardware resiliency (DDHR) [54], stuck-at faults and various system non-
idealities are treated as feature inputs into the machine learning algorithm. Through training
with error aﬀected features, the resulting classiﬁcation/regression engine compensates for
these errors. The resulting engine can thus perform correct classiﬁcation and regression in
13
the presence of errors. In [56], adaptive boosting is employed to train in the presence of
hardware errors. Data driven methods are eﬀective at handling static errors such as stuck-at
faults. However, the data driven nature of these approaches requires the error statistics to
be the same during training and testing, which might not be true for dynamic errors such
as timing errors.
Statistical error compensation (SEC) [53, 57, 58, 61, 60] (see Fig. 1.10) is a class of system
level error compensation techniques that utilize signal and error statistics. SEC employs
detection and estimation theory to compensate for the errors in the main computation
block, and thus can tolerate a much higher error rate compared with logic level techniques.
Algorithmic noise-tolerance (ANT) [57] employs an explicit estimator block to compensate
for the most signiﬁcant bit (MSB) ﬁrst errors in the main DSP block. ANT has been shown to
provide up to 65% energy savings with little loss of performance. Stochastic sensor-network-
on-a-chip (SSNOC) [58] employs statistically similar decomposition and robust estimation
theory to compensate for errors and achieves up to 5.8× energy savings. Soft NMR [61] makes
explicit use of error probability mass functions (PMFs) to provide up to 10× improvement
in robustness with 35% energy savings. Likelihood processing [60] utilizes bit-level error
statistics to perform inference and has been shown to provide up to 14× improvement in
robustness with 25% energy savings. In general, circuit level error resiliency techniques
[51, 62] enable operation close to point of ﬁrst failure (PoFF) or in the low error rate (<0.1%)
regime. In comparison, system level error resiliency techniques such as SEC [53, 57, 63, 61]
can operate in the high error rate (>10%) regime. Previous studies have shown that a
reduced precision replica ANT (RPR-ANT) protected ECG processor [64] and MRF stereo
matching block [65] can be fully functional at an error rates of 58% and 21.3%, respectively.
However, the improved robustness in RPR-ANT comes at the price of 30% [64] to 40%
[65] complexity overhead due to the use of explicit estimator blocks. Thus, improved SEC
techniques need to be developed for them to be applicable to more complex signal processing
and inference kernels.
14
Figure 1.10: SEC techniques (a) ANT [57], (b) SSNOC [58], (c) soft NMR [59] and (d)
likelihood processing [60].
15
1.2 Dissertation Contributions and Organization
The design of ULP platforms for machine learning applications is challenging due to the en-
ergy delivery, communication and the tightly coupled computation and robustness challenges.
Conventional design approaches optimize the energy delivery, sensing, and information pro-
cessing separately. In this dissertation, we tackle these problems by (1) embedding infor-
mation processing into the energy delivery and sensing subsystem to eliminate the voltage
conversion loss and interface overhead, and (2) computing at the limits of energy eﬃciency
and thus robustness, and employing SEC techniques to compensate for the resultant hard-
ware errors. The major contributions and organization of the dissertation are summarized
as follows:
Chapter 2 presents the C-VRM approach where the information processing subsystem
is embedded into the energy delivery subsystem. The C-VRM employs multiple voltage do-
main stacking and core swapping to achieve high total system energy eﬃciency in near/sub-
threshold region. A prototype IC incorporating a C-VRM and an SC-VRM supplying energy
to an 8-tap fully folded FIR ﬁlter core is implemented in a 1.2 V, 130nm CMOS process.
Measured results indicate that the C-VRM has up to 44.8% savings in system-level energy
per operation compared to the SC-VRM system, and an eﬃciency ranging from 79% to 83%
over an output voltage range of 0.52 V to 0.6 V.
Chapter 3 presents an in-sensor computing architecture which (mostly) eliminates the
sensor-processor interface by embedding information processing into the noisy sensor fabric
in analog and retraining the hyperparameters in order to compensate for non-ideal computa-
tions. The resulting architecture, referred to as the Compute Sensor - a sensor that computes
in addition to sensing - represents a radical departure from the conventional. A Compute
Sensor for image data is designed by embedding both feature extraction and classiﬁcation
functions in the analog domain in close proximity to the CMOS active pixel sensor (APS)
array. Signiﬁcant gains in energy eﬃciency are demonstrated using behavioral and energy
models in a commercial semiconductor process technology.
Chapter 4 presents embedded algorithmic-noise tolerance (E-ANT), a new low overhead
SEC technique aiming at enhancing the robustness and energy eﬃciency of signal processing
16
and machine learning kernels. E-ANT operates by reusing part of the main block operation
as estimation and thus embedding the estimator block into the main block. Such embed-
ding can be achieved at various levels. At the architecture level, we propose ARCH-ANT,
which uses data path decomposition to embed the reduced precision replica estimator into
the main block. At the algorithm level, we propose ALG-ANT, which employs additional
optimization constraints during algorithm to architecture mapping to design incremental
reﬁnement architectures. System level simulation results in commercial 45nm process shows
large energy eﬃciency and robustness improvement.
Chapter 5 presents several probabilistic error models for machine learning kernels im-
plemented on low-SNR circuit fabrics where errors arise due to voltage overscaling (VOS),
process variations, or defects. Four diﬀerent variants of the additive error model are proposed
that describe the error PMF. Analytical expressions for the error PMF are derived. Per-
formance prediction of a support vector machine (SVM) based classiﬁer using these models
indicates that when comparing Monte Carlo with HDL simulations, probability of detection
Pdet estimated using the model is within 3% for VOS error when the error rate pη ≤ 80%,
within 5% for process variation error and within 2% for defect errors when the defect rate
(the percentage of circuit nets subject to stuck-at-faults) psaf is between 10
−3 and 0.2.
Chapter 6 presents the design of error-resilient machine learning architectures by em-
ploying a distributed machine learning framework referred to as classiﬁer ensemble (CE). CE
combines several simple classiﬁers to obtain a strong one. In contrast, centralized machine
learning employs a single complex block. The random forest (RF) and the support vector ma-
chine (SVM), which are representative techniques from the CE and centralized frameworks,
respectively, are compared. Employing the breast cancer data set in the UCI machine learn-
ing repository and architectural-level error models in a commercial 45 nm CMOS process,
it is determined that RF-based architectures are signiﬁcantly more robust than SVM archi-
tectures in the presence of timing errors due to process variations in near-threshold voltage
(NTV) regions (0.3 V− 0.7 V). Additionally, an error weighted voting technique that incor-
porates the timing error statistics of the NTV circuit fabric is proposed to further enhance
the robustness of RF architectures. Simulation results conﬁrm that the error weighted voting
achieves a Pdet that varies by only 1.4%, which is 12× lower than SVM.
17
Chapter 7 concludes this dissertation and provides directions for future research activi-
ties.
18
Chapter 2
COMPUTE VRM
The emerging applications in machine learning require the design of ULP platforms with
limited energy supply. Energy per operation (Eop) of such systems is equal to the energy
extracted from the battery per operation Ebat and is given by Ebat = Eop = Evrm + Ecore,
where Evrm and Ecore are the energy consumption per instruction by the VRM and the core,
respectively. The conventional approach is to design the VRM to maximize its eﬃciency η
at a pre-speciﬁed core supply voltage Vdd and core/load current Iload (see Fig. 2.1).
SC-VRM Core
THV
*
ddV
*
coreE
/dd batV V
coreE
1
ddV
ddVbatV
Figure 2.1: Conventional design approach addresses VRM design and core design
separately. Due to the tradeoﬀ between dynamic and leakage energy, minimum energy
operation point (MEOP) of compute cores usually lies in near or sub-threshold regime.
However, the resulting high conversion ratio often results in poor VRM eﬃciency.
Sub/near-threshold computing (NTC) has been proposed [66, 67] to minimize Ecore by op-
erating the core close to its minimum energy operating point (MEOP) where Vdd is regulated
in the 400 mV-to-600 mV range. The battery voltage Vbat is typically in the range of 1.2 V to
3.6 V [68]. This large gap between Vbat and Vdd requires the voltage regulator module (VRM)
19
to achieve a high step-down ratio. Among the VRM topologies, the switched capacitor VRM
(SC-VRM) is attractive as it can achieve high conversion ratio and is amenable to on-chip
integration [15, 69]. Design of high-eﬃciency VRM for ULP platforms is challenging due to
the large step-down ratio [5, 12] and light load conditions due to NTC.
Various approaches have been proposed to address the eﬃciency issue in SC-VRM under
light load conditions. SC-VRM with pulse frequency modulation (PFM) control [5, 12] has
been employed to boost light load eﬃciency up to 74%. A hybrid converter [13] has been
proposed to address energy delivery for loads in the range of 5 nA-to-500 nA, with eﬃciency
up to 56%. The stacked voltage domain approach [14] has been proposed where multiple
cores are connected in series to lower the step-down ratio. This approach needs push and pull
linear regulators in order to compensate for voltage ﬂuctuation in the intermediate supply
nodes caused by current mismatch.
In this chapter, we propose the compute VRM (C-VRM) which exploits the similarity
between charge transfer in an SC-VRM and CMOS logic. Computation in CMOS occurs via
transfer of charge between supply/ground nodes and capacitive output nodes. This transfer
is controlled by MOS transistor switches. Energy delivery in a SC-VRM occurs in a similar
manner with power switches controlling the transfer of charge from the battery to the core.
The C-VRM exploits this similarity by replacing the power switches in a SC-VRM with
CMOS compute cores. In doing so, the proposed C-VRM provides the following advantages:
(1) eliminates driver loss, bottom plate capacitor loss, and charge transfer loss to enhance
voltage conversion eﬃciency, and (2) seamlessly integrates energy delivery and computation
to provide a uniﬁed platform that enables the minimization of total system (VRM+core)
energy Eop. The C-VRM concept is validated by: (a) developing energy models for the C-
VRM and the SC-VRM, and employing these in system simulations to evaluate the beneﬁts
of C-VRM in energy per operation Eop and eﬃciency η, and (b) implementing a prototype
IC incorporating a C-VRM and a SC-VRM supplying energy to an 8-tap folded FIR ﬁlter
core in a 1.2 V, 130 nm CMOS process to verify the beneﬁts of C-VRM via measured results.
The rest of the chapter is organized as follows: Section 2.1 describes the background of
the conventional SC-VRM and develops energy models to evaluate its eﬃciency. Section
2.2 presents the C-VRM and develops energy models to compare with the conventional SC-
20
VRM. Section 2.3 describes the prototype IC consisting of both the conventional SC-VRM
system and the C-VRM. Test results in Section 2.4 demonstrate the improvement in energy
and converter eﬃciency. Conclusions and future work are addressed in Section 2.5.
2.1 Background
This section reviews the design of a conventional SC-VRM system. An energy model is
derived to reveal the fundamental loss mechanisms in a SC-VRM. A core energy model is
also derived for use in system simulations in the following sections.
2.1.1 Intrinsic Loss in SC-VRM
A block diagram of a 2:1 SC-VRM is shown in Fig. 2.2(a), where a set of charge trans-
fer capacitors and switches are connected in diﬀerent conﬁgurations in each clock phase to
convert the voltage. Since the charge transfer procedure involves direct connection of volt-
age sources and capacitors, the current will be impulsive and lead to an intrinsic charge
transfer loss ECTL. As pointed out in [70, 71], ECTL depends on the operational domain
of the SC-VRM, i.e., complete charge, partial-charge, or no-charge. In near/sub-threshold
operation with light load, driver loss will degrade light load eﬃciency severely and should be
minimized. Thus, we assume that the SC-VRM is operating in the complete charge opera-
tion domain so as to maximize the charge transferred to the output per converter switching
cycle. Figure 2.2(b) illustrates the source of ECTL in the context of a simple 1:1 SC-VRM
where the charge is transferred from Vbat to Vdd with a ﬂying capacitor Csc. In Fig. 2.2(b),
it can be shown that in each phase (Φ1 and Φ2), the energy loss due to charge sharing is
1
2
Csc(Vbat − Vdd)2. Thus, the intrinsic charge transfer loss during every switching cycle is:
ECTL = Csc(Vbat − Vdd)2 (2.1)
Note that (2.1) does not depend on switch resistances R1 and R2. Therefore, ECTL in (2.1)
represents a fundamental loss mechanism in a conventional SC-VRM. The C-VRM extracts
21
useful computation from this loss and thereby improves Eop.
2.1.2 SC-VRM Energy Model
In order to evaluate the converter eﬃciency under diﬀerent load conditions, a power model
for the SC-VRM is necessary. For simplicity of analysis, we choose a 2:1 ladder SC-VRM as
shown in Fig. 2.2(a). However, the analysis method can be extended to a higher conversion
ratio. There are four major loss components in the conventional SC-VRM:
2.1.2.1 Charge Transfer Loss (ECTL)
As with any SC-VRM, there is the loss ECTL during each charge transfer. In [15, 72], SC-
VRM is modeled as an ideal transformer (see Fig. 2.2(c)) representing a conversion ratio of
N , and ECTL is captured by a series resistance Rctl, and is given by:
ECTL = I
2
coreRctlTsw (2.2)
where Icore is the load current and Tsw is the switching period. Substituting Icore =
2Csc∆V
Tsw
and Rctl =
1
4Cscfsw
into (2.2), we get:
ECTL =
I2core
4Cscfsw
Tsw = Csc(∆V )
2 (2.3)
where fsw is the switching frequency of the SC-VRM, and ∆V is the diﬀerence between
Vbat
N
(ideal output) and regulated Vdd. Note that (2.1) and (2.3) are identical when ∆V =
Vbat − Vdd.
2.1.2.2 Gate Drive Loss (EGDL)
The SC-VRM requires explicit power switches to transfer charge. A driver circuit, such
as a super buﬀer, is therefore needed, resulting in additional losses. Assuming the gate
capacitance of the power switch is Cswitch, the gate drive loss per instruction EGDL can be
22
expressed as:
EGDL = CswitchV
2
batfsw/fclk−S (2.4)
where EGDL is calculated per one core clock period Tclk−S = 1/fclk−S, and fclk−S is the
core clock frequency. This deﬁnition of EGDL allows a direct comparison with the energy
consumption of the core Ecore.
2.1.2.3 Bottom Plate Capacitor Loss (EBPCL)
The bottom plate capacitor Cbottom is the parasitic capacitor between the bottom plate of
Csc and the substrate (see Fig. 2.2(a)). Cbottom scales with the area of Csc and can be as high
as 5% of Csc [73]. Since the bottom plate of the Csc is not always grounded, during every
switching cycle, Cbottom will be charged and discharged, as shown in Fig. 2.2(a). Assuming
the ratio of Cbottom to Csc is γ, this will lead to an energy loss given by:
EBPCL = γCscV
2
ddfsw/fclk−S (2.5)
2.1.2.4 Control Loss (ECL)
Control loss represents a constant loss in the SC-VRM and will degrade light load eﬃciency.
Assuming the eﬀective load capacitance of control circuit is Cctrl, and the control circuit
frequency is fctrl, the control loss can be expressed as:
ECL = CctrlV
2
batfctrl/fclk−S (2.6)
2.1.3 Core Energy Model
There are two types of energy consumption in a core operating in near/sub-threshold region:
dynamic energy and leakage energy. A uniﬁed model that accounts for both components has
been proposed in [74]:
Ecore = αCcoreV
2
dd + VddIleak(Vdd)
1
fclk
(2.7)
23
+
-
outC
batV
scC
Core
1
1
2
2
ddV
 
 
 
 
D
river
bottomC
bat ddV V
ddV
/ 2swT / 2swT
cV
+
- V
(a)
+
-
inV outV
scC
1R 2R
ESRR
1SW 2SW
+
-
batV ddV
1F 2F
(b)
ddVbatV ctlR
driveRbpRctrlR
:1N
coreR ( )leak ddI V
coreI
batV
N
(c)
Figure 2.2: Conventional SC-VRM architecture: (a) block diagram of a 2:1 SC-VRM, (b)
charge sharing loss mechanism, and (c) a simpliﬁed energy transfer model.
24
Ileak(Vdd) = µCox
W
L
(m− 1)V 2T e
−Vt
mVT e
−ηdVdd
mVT (1− e
−Vdd
VT ) (2.8)
where α is the core activity factor, Ccore is the load capacitance in the core, Vdd is the supply
voltage, Vt is the threshold voltage, VT is the thermal voltage, µ is the carrier mobility, Cox
is the gate capacitance per W/L, m is a constant related with sub-threshold slope factor,
and ηd is the drain induced barrier lowering (DIBL) coeﬃcient. This model captures the
trade-oﬀ between the dynamic and leakage energy, which leads to the MEOP [74], as deﬁned
via the 3-tuple (E∗core, V
∗
dd, f
∗
clk), where E
∗
core is the energy at MEOP, V
∗
dd is the optimum
voltage, and f ∗clk is the energy optimum frequency. The core is modeled as a resistor Rcore
in parallel with a leakage current source Ileak(Vdd) (see Fig. 2.2(c)).
2.2 C-VRM System Design
This section presents the system design of the proposed C-VRM. An analytical energy model
for the C-VRM is developed to compare its eﬃciency with the conventional SC-VRM system.
2.2.1 Principle of Operation of the C-VRM
C-VRM utilizes computational cores as switches to perform computation and transfer charge.
The compute cores are used as distributed power switches and perform the dual functions
of energy transfer and information processing.
The proposed C-VRM operates in principle the same as an interleaved SC-VRM. To
illustrate this, Fig. 2.3 describes the operation of an interleaved 2:1 SC-VRM and a 2:1
C-VRM. For the interleaved SC-VRM, in the ﬁrst phase (Φ1), charge is stored in the ﬂying
capacitor C1 and released by C2; in the second phase (Φ2), charge is released by C1 and stored
in C2. The 2:1 C-VRM implements the same charge transfer function described above but
without explicit power switches/drivers. In Φ1, the core in the high voltage domain (CH) is
clock gated while the core in low voltage domain (CL) is active. Thus, charge is stored in C1
and released by C2. In Φ2 , CL is clock gated while CH is active, so that charge is released
25
Core
Vbat
Vbat
Core
Vbat Vbat
q q
q
q
Core
Vbat
Core
q
q
qq
Vbat
Vbat
CL
q
q
store store
release
release store
release
CH
Vbat
CLq
q
store
release
CH
clock gated
clock gated
Interleaved SC-VRM Simplified SC-VRM Compute VRM
Figure 2.3: The C-VRM principle for N = 2.
by C1 and stored in C2. The C-VRM achieves improved energy eﬃciency compared to the
conventional SC-VRM as the losses associated with the driver, bottom plate capacitor, and
intrinsic charge transfer are eliminated. Furthermore, it incorporates computation as an
intrinsic part of its energy delivery functionality.
The out-of-phase operation of CH and CL (core swapping) requires data transfer between
two voltage domains, as shown in Fig. 2.4. At the end of Φ1 and Φ2, data is transferred
between CL and CH by adding an extra core swapping cycle. The core swapping has
negligible eﬀect on total throughput, so long as the swap frequency is low compared to C-
VRM core clock frequency fclk−C . To ensure this condition, CH and CL employ continuous
voltage and frequency scaling (CVFS), where fclk−C tracks the decaying voltage (VCH or
VCL) across the active core. The voltage of the intermediate node (Vmid) is permitted to
vary by 80 mV-200 mV. As a result, VCH and VCL varies between 500 mV and 700 mV, and
fclk−C tracks the instantaneous voltage by employing an on-chip critical path replica (CPR)
oscillator.
26
Figure 2.4: Data transfer in the C-VRM during core swapping.
2.2.2 C-VRM Energy Model
An energy model of the C-VRM is necessary to compare its system level energy consump-
tion Eop with that of the SC-VRM system. The C-VRM eliminates driver loss and charge
transfer loss associated with the conventional SC-VRM system. However, the variable core
voltage results in data transfer loss and increased core energy. There are three major energy
components in the C-VRM: core energy (Ecore−C), data transfer loss (EDTL), and control
loss (ECL−C), all of which need to be characterized.
Since the Vdd across each core during its operation varies, Ecore−C is time varying, as
shown in Fig. 2.5. Assume that M operations are completed during M clock cycles that
comprise the active period. In the mth (m = 1, 2, ...M) cycle, the average voltage across
the core is denoted as Vdd(m) and the clock period during the m
th operation is denoted as
Tclk−C(m). The supply voltage Vdd drops from a pre-deﬁned voltage Vdd(0) to another pre-
deﬁned voltage Vdd(M) over M clock cycles, as shown in Fig. 2.5. In the test chip, Vdd(0)
and Vdd(M) are chosen to be 500 mV and 700 mV, respectively, and the value of M ranges
from 96 to 131 depending on the core activity factor α. We also assume that CH and CL
see the same voltage proﬁle (Vdd(m)) during active periods in order to simplify the analysis.
Thus, Ecore−C , EDTL, and ECL−C can be calculated as
27
Figure 2.5: The variable supply voltage Vdd(m) results in a time varying clock period
Tclk−C(m).
Ecore−C =
1
M
M∑
m=1
[αCcoreV
2
dd(m) + Ileak(m)Vdd(m)]Tclk−C(m) (2.9)
EDTL =
Creg−CV 2bat
M
(2.10)
ECL−C =
Cctrl−CV 2batfctrl−C
fclk−C
(2.11)
where α is the core activity factor, Ileak(m) is the leakage current at the supply voltage
of Vdd(m), Creg−C is the total load capacitance of data transfer logic, Cctrl−C is the load
capacitance of the control circuitry, fctrl−C is the equivalent control frequency, and fclk−C is
the core clock frequency. Figure 2.6 shows a C-VRM with N cores (thus N voltage domains).
From the principle of charge conservation, the following set of equations holds:
Q = αCcoreVdd(m) (2.12)
Q
N − 1
N
=
Csc
N
Vdd(m− 1)− Csc
N
Vdd(m) (2.13)
where (2.12) describes the charge consumed by the core in the mth clock cycle, and (2.13)
describes the charge conservation at node a in Fig. 2.6. Equations (2.12) and (2.13) can be
used to solve for Vdd(m) (m = 1...,M) to obtain:
Vdd(m) = [
Csc
(N − 1)αCcore + Csc ]
mVdd(0) (2.14)
28
Next, substituting m = M in (2.14), we solve for M as follows:
M =
ln Vdd(M)
Vdd(0)
ln[ Csc
(N−1)Ccore+Csc ]
(2.15)
The clock period Tclk−C(m) is obtained as the average of the critical path delays at Vdd(m)
(Td(Vdd(m))) and Vdd(m− 1) (Td(Vdd(m− 1))):
Tclk−C(m) =
Td(Vdd(m)) + Td(Vdd(m− 1))
2
(2.16)
Therefore, the Eop of the conventional SC-VRM system is given by:
Eop−SC = Ecore + ECTL + EGDL + EBPCL + ECL (2.17)
where Ecore, EGDL, EBPCL, and ECL are deﬁned in (2.2)-(2.7). Similarly, the Eop of the
C-VRM is obtained as:
Eop−C = Ecore−C + EDTL + ECL−C (2.18)
where Ecore−C , EDTL, and ECL−C are deﬁned in (2.9)-(2.11), and Vdd, M and TC,CLK are
obtained from (2.14)-(2.16). Measured results from a prototype test chip in 130 nm CMOS
(see Fig. 2.21) indicate that (2.17) and (2.18) accurately models the energy consumption
of the SC-VRM and the C-VRM, respectively. Energy saving can be obtained if Eop−C <
Eop−SC . Next, we determine conditions under which the C-VRM is more energy eﬃcient, as
compared to an SC-VRM system.
2.2.3 C-VRM System Design
In the rest of this chapter, we will assume that the C-VRM has N = 2 cores to simplify
the analysis. System simulations are performed to compare the energy eﬃciencies of the
C-VRM and the SC-VRM system. The battery voltage Vbat is assumed to be 1.2 V. For the
SC-VRM, we use an N = 2 ladder topology as shown in Fig. 2.2(a), with ﬂying capacitor Csc
chosen to be 500 pF. Cswitch is chosen such that the SC-VRM is operating in slow switching
29
Figure 2.6: The principle of charge conservation in the C-VRM.
limit (SSL) and fast switching limit (FSL) boundary to balance shunt and series losses. The
bottom plate capacitance Cbottom is assumed to be 2% of Csc and Cctrl is assumed to be 1%
of Csc. For the C-VRM, we also choose the N = 2 topology as shown in Fig. 2.3. For
fairness of comparison, we constrain the total charge transfer capacitance (C1 + C2) in the
C-VRM to equal Csc in the SC-VRM. The 500 pF capacitor is split equally between C1 and
C2. Each 250 pF capacitance supplies one of the cores. We also keep the control loss of the
C-VRM the same as the SC-VRM. We assume the same 100 pF Ccore for both the SC-VRM
system and the C-VRM. The average activity factor α is assumed to be 0.3. The switching
frequency fsw is swept to generate Vdd in the range of 0.42 V to 0.6 V. Energy losses and
core energy are calculated via the energy model developed in previous sections.
To compare energy eﬃciency of the SC-VRM and C-VRM, we deﬁne the eﬀective Vdd
(Vdd,eff ) as the Vdd under which the MAC core in the SC-VRM will give the same throughput
as the MAC core in the C-VRM, i.e., the SC-VRM clock period Tclk−S(Vdd,eff ) equals the
average C-VRM clock period:
Tclk−S (Vdd,eff ) =
1
M
M∑
i=1
Tclk−C (i) (2.19)
where Tclk−C(i) is the C-VRM clock period in the ith cycle. Substituting (2.15), (2.16) into
30
(2.19) for N = 2, we obtain:
Tclk−S (Vdd,eff ) =
ln[ Csc
Ccore+Csc
]
ln[Vdd(M)
Vdd(0)
]
M∑
i=1
Td(Vdd (i)) + Td (Vdd (i− 1))
2
(2.20)
Substituting (2.14) into (2.20), we obtain the relation between Vdd,eff and Vdd(0) and
Vdd(M) as follows:
Tclk−S (Vdd,eff ) =
ln[ Csc
Ccore+Csc
]
ln[Vdd(M)
Vdd(0)
]
M∑
i=1
Td(
[
Csc
αCcore+Csc
]i
Vdd(0)) + Td
([
Csc
αCcore+Csc
]i−1
Vdd(0)
)
2
(2.21)
Under the assumption that CH and CL see the same voltage proﬁle Vdd(m) during their
active periods, we can substitute Vbat
2
+ ∆V and Vbat
2
−∆V into (2.21) to obtain:
Tclk−S (Vdd,eff ) =
ln[ Csc
Ccore+Csc
]
ln[
Vbat
2
−∆V
Vbat
2
+∆V
]
×
M∑
i=1
Td(
[
Csc
αCcore+Csc
]i
(Vbat
2
+ ∆V )) + Td
([
Csc
αCcore+Csc
]i−1
(Vbat
2
+ ∆V )
)
2
For near threshold operation, it is diﬃcult to obtain a precise analytical expression for
this delay. Simulation results were used to extract the delay vs. Vdd curve of the critical
path and Vdd,eff can be solved numerically for diﬀerent values of ∆V . Figure 2.7 shows that
while the actual voltage range might go beyond Vbat/2 (0.6 V in the simulation), the eﬀective
voltage is less than the ideal output voltage Vbat/2.
In the ﬁrst experiment, we assume Creg−C is only 1% of Ccore so that the data transfer
overhead is small. Figure 2.8(a) shows the diﬀerent energy components for the conventional
SC-VRM as a function of Vdd. We denote ESHUNT = EGDL+EBPCL+ECL since driver loss,
bottom capacitance loss and control loss can all be denoted as parallel equivalent resistors
in Fig. 2.2(c). From Fig. 2.8(a), we can see that as Vdd increases, ECTL increases due to
reduced ∆V according to (2.3); but ESHUNT increases due to increased fsw. In the super-
31
Figure 2.7: The eﬀective voltage Vdd,eff as a function of ∆V = Vbat/2− Vdd(M).
threshold region, as Vdd decreases, Ecore decreases because dynamic energy dominates. As
Vdd further decreases to sub/near-threshold region, Ecore increases due to the exponential
increase of propagation delay. Due to the trade-oﬀ between ESHUNT , ECTL and Ecore, the
system MEOP (S-MEOP) voltage V ∗dd,S−MEOP is around 0.46 V. The Eop increases as the
Vdd deviates from V
∗
dd,S−MEOP .
Figure 2.8(b) shows the diﬀerent energy components of the C-VRM as a function of Vdd,
where EDTL and ECL−C are lumped together as ELOSS for simplicity. The Vdd value for the
C-VRM is the Vdd,eff deﬁned in (2.19). Figure 2.8(b) illustrates that compared with the
conventional SC-VRM system, the C-VRM has a higher Ecore due to its variable voltage
operation. However, the C-VRM eliminates EGDL, EBPCL and ECTL associated with the
SC-VRM system. Furthermore, Fig. 2.8(b) indicates that ELOSS becomes higher when Vdd
is close to the ideal output (1
2
Vbat = 0.6 V) due to the increased data transfer frequency.
ELOSS also increases as the core enters sub-threshold region due to increased delay.
Figure 2.8(c) compares the Eop−SC and Eop−C , as deﬁned in (2.17) and (2.18), and shows
that the C-VRM has lower Eop compared with the SC-VRM system across the entire oper-
ating point from 0.42 V to 0.6 V. Large energy savings can be achieved either at high Vdd
(close to ideal output of 0.6 V) due to the elimination of EGDL and EBPCL, or when Vdd is
further reduced beyond V ∗dd,S−MEOP of 0.46 V due to the elimination of ECTL.
Figure 2.8(d) shows the eﬃciency comparison of the SC-VRM system and the C-VRM.
32
This ﬁgure illustrates that the C-VRM can maintain high eﬃciency (ηC−V RM > 93%) across
the operating range from 0.42 V to 0.6 V, while the SC-VRM can only achieve ηSC−V RM ≈
80% at around 0.54 V. As Vdd deviates from this eﬃciency maximum voltage, ηSC−V RM
drops quickly due to increased ESHUNT or ECTL.
The energy beneﬁt of the C-VRM depends on the assumption that the data transfer
loss EDTL is small. This assumption holds if the core swapping frequency fswap is small
compared to fclk−C , and Creg−C is small. To illustrate this point, we perform the same set
of experiments as in Fig. 2.8 but with Creg−C increased to 10% of Ccore. Figure 2.9 shows
the resulting Eop and η. Figure 2.9(a) shows that when EDTL is large, it is possible that
the Eop−C > Eop−SC . However, energy savings are preserved when Vdd is close to the ideal
output of 1
2
Vbat = 0.6 V or when Vdd is in far below V
∗
dd,S−MEOP in sub-threshold. Figure
2.9(b) shows that when EDTL is large, ηC−V RM decreases dramatically when Vdd increases due
to the increased core swapping frequency. Therefore, to achieve maximum energy savings,
EDTL of the C-VRM needs to be kept to a minimum.
2.3 C-VRM Prototype IC Design
A prototype IC was designed in a 1.2 V, 130 nm CMOS process to compare the SC-VRM
system and the C-VRM. This section describes the prototype IC.
2.3.1 Chip Architecture
To enable a direct comparison between the SC-VRM system and the C-VRM, we ﬁx the
charge transfer capacitor (Csc in Fig. 2.2(a) and C1 +C2 Fig. 2.4) to 250 pF and employ an
8-bit multiply-accumulator (MAC) as the core in both systems. The SC-VRM system and
the C-VRM are optimized to supply an Icore up to 1 mA at a nominal Vdd of 500 mV . To
reduce EDTL, a folded MAC architecture is adopted.
Figure 2.10 shows the top level chip architecture. The chip consists of a 2:1 SC-VRM
system, a 2:1 C-VRM and a test block. The 2:1 SC-VRM system consists of a ladder SC-
VRM delivering energy to a core. The core is a MAC with an 8 bit array multiplier and a
33
(a) (b)
(c) (d)
Figure 2.8: Comparison of SC-VRM system and C-VRM with 1% data transfer overhead:
(a) energy vs. output Vdd of SC-VRM, (b) energy vs. Vdd of C-VRM, where EDTL and
ECL−C were lumped together as ELOSS, (c) Eop comparison of SC-VRM and C-VRM, and
(d) eﬃciency comparison of SC-VRM and C-VRM.
(a) (b)
Figure 2.9: System comparison between SC-VRM system and C-VRM with 10% data
transfer overhead: (a) Eop comparison of SC-VRM and C-VRM, and (b) eﬃciency
comparison of SC-VRM and C-VRM.
34
Figure 2.10: The C-VRM prototype IC architecture.
ripple carry adder. It is conﬁgured as an 8-tap folded FIR ﬁlter. The 2:1 C-VRM consists
of cores MAC_H and MAC_L, which are identical to the core in the 2:1 SC-VRM system.
A CPR oscillator with tunable delay is designed to continuously scale the core frequency
fclk−C with Vdd. The test block consists of a vector generator to feed input data to the cores
and level shifters to transfer output data for oﬀ-chip processing.
2.3.2 SC-VRM Design
Figure 2.11 shows the detailed architecture of the 2:1 SC-VRM, which has a ladder topology
containing four power switches and one 250 pF on-chip MIM ﬂying capacitor Csc. Csc is
chosen to supply maximum Icore of 1 mA with maximum fsw of 10 MHz. The transistor M1
and M2 are chosen to be PMOS and NMOS, respectively, to remove the threshold voltage
drop. M3 and M4 are chosen to be NMOS because the regulated output Vdd is always
lower than 1
2
Vbat. The power switches are sized to balance shunt and series loss according to
[15, 75]. Figure 2.11 also shows the control loop of the SC-VRM. A hysteresis PFM control
[76] is realized via a strong ARM comparator and a current starved oscillator. The output
35
1
1
2
2
refhV
reflV
S
R
Q
1
2
ENfsw
scC
outC
ddV
Non-overlapping 
driver
M1
M3
M4
M2
scC 250 pF
100 µ/0.12 µ
50 µ/0.12 µ
100 µ/0.12 µ
100 µ/0.12 µ
M1
M2
M3
M4
batV
Figure 2.11: The 2:1 ladder SC-VRM.
of the oscillator is passed through the driver circuitry and converted to non-overlapping
two-phase clock signals.
Figure 2.12 shows the circuit diagram of the strong ARM comparator and the current
starved oscillator. We adopt the dynamic comparator to avoid steady state current, which
degrades light load eﬃciency. In the pre-charge phase, MP1-MP4 pre-charges the output
node and the drains of MN3 and MN4 to Vdd. In the evaluation phase, the drain of MN3
and MN4 are discharged at diﬀerent rates according to input Vip and Vin, respectively. If
Vip is higher than Vin, MN1 will turn on prior to MN2, and the positive feedback formed by
MN1, MN2, MP5 and MP6 will discharge Von and charge Vop back to Vdd. The pre-charge
transistors are kept to a minimum size to reduce the load capacitance. The input pair and
cross coupled inverter MN1, MN2, MP5 and MP6 are sized to trade oﬀ speed and oﬀset. The
current starved oscillator contains mirror transistors MN1-MN3 and MP1-MP3. Transistors
MN4-MN6 and MP4-MP6 form a 3 stage ring oscillator. They are sized to minimize the
power consumption while providing suﬃcient driver speed.
Figure 2.13 shows the non-overlapping circuit with an embedded driver. The complemen-
tary clock signal is fed into a cross coupled NOR gate to add dead time, tp, between Φ1 and
Φ2. The tp is adjusted by changing the number of the buﬀer chain stages. A superbuﬀer is
added at the end of the buﬀer chain to drive the power switches.
36
Figure 2.12: The strong ARM comparator and the current starved oscillator employed in
the 2:1 SC-VRM.
Figure 2.13: Non-overlapping driver employed in the 2:1 SC-VRM.
37
2.3.3 C-VRM Design
Figure 2.14 shows the design of the C-VRM. The 2:1 C-VRM contains MAC_H and MAC_L
as the two compute cores, level shifters (LS) for data transfer, and a control block to switch
the compute cores from active to inactive modes.
Figure 2.14: The 2:1 C-VRM block diagram.
Figure 2.15(a) shows the circuit diagram of the control block. The shaded blocks operate
in the low voltage domain, while the unshaded blocks operate in the high voltage domain.
The control block consists of an RC delay based frequency detector, a latch and a pulse
generator for core swapping. Figure 2.15(b) shows the operation of the frequency detection
block. During the pre-charge phase, C2 is connected to Vmid; during the evaluation phase, C2
is discharged through the RC circuit formed by R2 and C2, with discharging time determined
by 1
2fclk
. If during this period, Va drops below the threshold, Vb will rise to Vmid and state
of the latch (ENl) will be set to 0, disabling MAC_L. The pulse generator is realized via
NOR ENl and a delayed version of the signal, so that during the 1-0 transition of ENl, a
pulse is generated. The pulse will force the state of latch in high voltage domain (ENh) to
be 1 through the pulldown transistor M1, thus enabling MAC_H by turning ENh to 1.
Figure 2.16 shows the level shifter used in the C-VRM. The level shifter will perform
bidirectional transfer of the data between MAC_H and MAC_L. The conventional level
shifter design shown in Fig. 2.16(a) is not suitable for two reasons: (1) two diﬀerent circuit
topologies are needed to perform high-to-low and low-to-high level conversion, respectively,
38
and (2) there are direct paths current through MN1, MP1 or MN2, MP2 during shift. Both
of these will result in additional EDTL in (2.18). Therefore, we adopt a capacitor coupling
based dynamic level shifter in Fig. 2.16(b). Figure 2.16(b) also shows the operation during
high-to-low data transfer. In the pre-charge phase, the capacitor Cls is pre-charged to Vmid.
When MAC_H is disabled by changing ENh from 1 to 0, a one clock cycle pulse shift_hl
is inserted before ENl goes high. This will turn on the transmission gate and shift data_h
from MAC_H to MAC_L. After the shift operation, Cls is charged to Vmid before the next
operation. The dynamic level shifter achieves bidirectional shifting and removes the direct
path loss associated with the conventional design.
2.3.4 CPR Oscillator
The CPR oscillator we employed in this chapter is similar to the one used in [62], which is
an inverter chain based ring oscillator with tunable delay cells (See Fig. 2.17(a)). A chain
of inverters are used instead of a direct mapping of the core critical path components to
provide a near-50% duty cycle clock. The number of inverters is calculated based on a ﬁrst-
order approximation using the Elmore delay formula for a resistor-capacitor network. In this
design, 68 inverters are used to replicate the critical path. To account for PVT variation, a
tuning circuit is added to adjust the delay margin provided by the CPR oscillator to ensure
the clock period is greater than the actual critical path. A digital control is chosen over
voltage control for simplicity and reliable bias in diﬀerent voltage domains. Figure 2.17(b)
shows the detail of the inverter delay cell. Each delay cell consists of a long path and a
shortpath, the selection of the paths is controlled by MUX. Each delay cell uses a single-bit
control to minimize the capacitive loading at the output of the delay cell. In the design, 4
delay cells are added to provide tuning range of 16 inverter delays.
Figure 2.18(a) shows the postlayout simulation of MAC unit critical path delay and the
CPR oscillator clock period in diﬀerent process corners. It can be seen that the CPR
oscillator is able to track the MAC unit critical path and ensure that the clock period is
larger than the MAC unit critical path delay across all process corners. Figure 2.18(b)
shows the CPR oscillator clock period with diﬀerent tuning settings. The 8 delay cells
39
(a)
(b)
Figure 2.15: Control block of the C-VRM: (a) circuit schematic, and (b) principle of
operation.
(a) (b)
Figure 2.16: Bidirectional level shifter: (a) conventional design of low-to-high and
high-to-low level shifters, and (b) the proposed capacitor coupling based dynamic level
shifter.
40
(a) (b)
Figure 2.17: Architecture of: (a) the CPR oscillator, and (b) the tunable delay cell.
(a) (b)
Figure 2.18: Post layout simulations showing: (a) the CPR oscillator clock period and
MAC critical path delay, and (b) the CPR oscillator clock period with diﬀerent delay cells
selected.
provide suﬃcient large tuning range to account for the delay variations.
2.4 Test Results
Figure 2.19 shows the operation of the SC-VRM during startup and steady state. During
steady state, PFM control is employed to scale the switching frequency fsw with Icore to
reduce the driver and bottom plate capacitance loss.
Figure 2.20 shows the operation of the C-VRM including core swapping and data transfer.
The startup circuitry for the C-VRM is implemented on the board. When the MAC core
voltage in one voltage domain decreases to 500 mV, core swapping is performed by employing
41
Figure 2.19: Measured SC-VRM operation during start up and steady state.
42
Figure 2.20: Measured C-VRM core swapping and data transfer.
the on-chip generated enable signal ENh and ENl for MAC_H and MAC_L, respectively.
During data transfer, when MAC_L is enabled by ENl, a one cycle switching signal sw_h2l
is generated to shift data from the high voltage to the low voltage domain.
Figure 2.21 compares the measured Eop and the eﬃciency of the fabricated 2:1 SC-VRM
system and the 2:1 C-VRM. The simulated results according to the energy model in Section
2.2 are also shown as dashed lines to demonstrate the accuracy of the model in Section 2.2.
The eﬃciency vs. Vdd is obtained by changing the reference voltage for the SC-VRM, and
by changing the RC time constant of the frequency detector in the C-VRM controller. To
43
perform eﬃciency measurements for the C-VRM, the compute core in the C-VRM, and the
controller and level shifter of the C-VRM are provided separate supply pins, which allows
the loss (control loss and data transfer loss) current Iloss and the core current Io to be
measured. The eﬃciency is then calculated using η = Io/(Io + Iloss). As shown in Fig.
2.21(a), the C-VRM has lower Eop across all measured Vdd from 0.52 V to 0.59 V. Since
C-VRM has a continuously varying voltage, the Vdd is the eﬀective output voltage delivering
the same throughput (11 MHz-to-20 MHz) as the SC-VRM system. Vdd of C-VRM cannot
extend below 0.52 V due to the limitations of the capacitively coupled level shifter. The
SC-VRM system has high system energy overhead both in high Vdd, due to increased driver
loss EGDL, and in low Vdd, due to control loss ECL and low fclk−C . This is also indicated
by the system level simulations in Fig. 2.8(a). The absence of the driver circuits and the
use of a low power frequency detection scheme enables the C-VRM to achieve a maximum
of 44.8% energy savings compared to the SC-VRM system. Figure 2.21(b) shows that the
C-VRM achieves eﬃciency > 79% across the entire tested Vdd range. As a comparison, the
SC-VRM has a peak eﬃciency of only 54% due to ESHUNT and ECTL. Figure 2.21(b) also
plots the eﬃciency of previously published SC-VRM designs operating at power levels of
1µW-100s of µW range from [67, 13, 77].
Table 2.1 shows the design speciﬁcations and performance of C-VRM compared with
previously published works. The system energy per instruction/K-gate is calculated by
dividing the system EPI by the estimated number of gates for [66] and [67]. For [12] and
[13], since no compute cores are included on-chip, the system energy per instruction/K-gate
is estimated using the core EPI and gate count of our design and the eﬃciency reported
in [12] and [13]. The comparison in Fig. 2.21(b) and in Table 2.1 shows that the C-VRM
achieves the highest eﬃciency (83%) in the designed power level and the lowest EPI/k-gate
of 0.79 pJ.
Figure 2.22 shows the die photo of the test chip, which is fabricated in a 1.2 V, 130 nm
CMOS process and has an area of 2 mm × 2 mm. Note that the SC-VRM requires 2 nF
oﬀ-chip decoupling capacitor. This capacitor, if integrated on-chip, will present 0.49 mm2
additional area.
44
(a) (b)
Figure 2.21: The C-VRM test chip measurement results: (a) Eop comparison, and (b)
eﬃciency comparison.
Figure 2.22: Die photo of the test chip.
45
Table 2.1: Comparison with Previously Published Work
[67] [12] [13] [66] C-VRM
Technology 65 nm 45 nm 130 nm 180 nm 130 nm
Conversion
Ratio
1/3, 1/2,
2/3, 3/4, 1
2/3 1/5 1/6 1/2
VRM
Topology
SC-VRM SC-VRM SC-
VRM+LDO
SC-
VRM+LDO
Compute
VRM
Input
Voltage
1.2 V 1.8 V 3.6 V 3.6 V 1.2 V
Output
power
/current
level
1− 500µW 100µA-
9 mA
2.5 nW-
254 nW
550 pW-
7.7µW
1− 60µA
Csc 600 pF 534 pF 800 pF NA 250 pF
Cout NA 700 pF NA NA 0
Maximum
driver
switching
freq.
15 MHz 30 MHz 2 KHz 1.2 MHz No driver
Eﬃciency 75%
@Vdd=0.5 V
Io=
100µA
55%
@Vdd =
0.9 V
Io=
200µA
56%
@Vdd =
0.44 V
Io= 300 nA
41.6%
@Vdd=
0.4 V
83%
@Vdd =
0.5 V-
0.7 V
Io=
40µA
Eop/k-gate 1.09 pJ 1.19 pJ
1 1.17 pJ1 0.88 pJ 0.79 pJ
1System energy per instruction is calculated based on core power of the ﬁlter core in this work and the
reported eﬃciency
46
2.5 Conclusions
In this chapter, we propose the C-VRM, a uniﬁed architecture for energy delivery and
computation, to overcome the intrinsic loss and drive circuit overhead of the conventional
SC-VRM. The C-VRM employs multiple voltage domain stacking, core swapping, and CVFS
to achieve high energy eﬃciency in the sub/near-threshold region. This work shows that by
combining the compute core and the energy delivery block, the system energy eﬃciency
can be signiﬁcantly improved. It opens the possibilities of embedding more sophisticated
computational blocks into SC-VRM and other switching power delivery blocks. Further
study can be explored to develop C-VRM architectures based on multi-ratio SC-VRM or
multi-phase SC-VRM.
47
Chapter 3
COMPUTE SENSOR
In the previous chapter, the energy delivery eﬃciency is improved by embedding information
processing into the VRM. In this chapter, we explore a similar approach which embeds infor-
mation processing into the sensing circuits to drastically alleviate the communication chal-
lenge between the sensing and information processing subsystems. Speciﬁcally, we present
an in-sensor computing architecture which (mostly) eliminates the sensor-processor interface
and thus resolves the communication challenge by embedding inference computations in the
noisy sensor fabric in analog, and retraining the hyperparameters in order to compensate
for non-ideal computations. The resulting architecture, referred to as the Compute Sen-
sor - a sensor that computes in addition to sensing - represents a radical departure from
the conventional architecture. We show that a Compute Sensor for image data can be
designed by embedding both feature extraction and classiﬁcation functions in the analog
domain in close proximity to the CMOS active pixel sensor (APS) array. Signiﬁcant gains
in energy-eﬃciency are demonstrated using behavioral and energy models in a commercial
semiconductor process technology. In the process, the Compute Sensor creates a unique
opportunity to develop machine learning algorithms for information extraction from data on
a noisy underlying computational fabric.
Figure 3.1(a) shows a conventional architecture of an embedded vision system. Image
data is ﬁrst acquired via an Mr row × Mc column active pixel sensor (APS) array whose
analog pixel values are sensed sequentially in a row-wise fashion, and then converted into
digital samples by the sample-and-hold (S/H) and the analog-to-digital converter (ADC),
and then streamed out by the read-out (RD) circuitry to a back-end digital processor which
implements feature extraction and classiﬁcation function to obtain the ﬁnal decision yˆ. A
digital trainer block computes the hyperparameters in supervised learning mode. This phys-
48
ical separation between sensing and processing subsystems is unavoidable because sensing is
intrinsically an analog process while information processing is intrinsically digital. Exclud-
ing the energy delivery loss, the energy dissipation in such a system is dominated by two
sources:
 The energy required to move the data over the sensor-processor interface comprising
the ADC, RD and the interconnect to the digital processor, i.e., communication energy,
and
 The energy consumed in processing the data using digital circuits which by nature are
high signal-to-noise ratio (SNR), i.e., computational energy.
We employed energy data from [6] for a CMOS image sensor consisting of a 32 × 32 APS
array and the associated interface circuits, and estimated the computational energy needed to
implement a principal component analysis (PCA) [78] engine and a support vector machine
(SVM) [79] in a 65 nm CMOS process operating at a throughput of 32 frames/s. This analysis
indicates that the communication and computational energies are approximately 53% and
41%, respectively, for a combined total of 94%. An impactful solution to the energy problem
needs to reduce both components of energy - communication and computational energy.
In this chapter, we propose the Compute Sensor shown in Figure 3.1(b) - a sensory system
that senses and processes the sensed data thereby integrating both data acquisition and
information extraction functionalities. The Compute Sensor architecture consists of a data
processing engine and a training engine. The data processing engine is a cascade of: (1) the
APS array which is identical to the conventional architecture, (2) a bit-line processor (BLP)
whose physical dimensions are matched to that of the APS array in order to perform pixel-
wise operation such as sample-and-hold (S/H), scaling, and absolute diﬀerence but no ADC,
(3) a cross bit-line processor (CBP) to perform data dimensionality reduction operations
such as dot product, ﬁltering, sum-of-absolute diﬀerence (SAD), mean square, followed by
an ADC that operates on the reduced dimensionality data and feeds it into (4) the residual
digital processor (RDP) which implements very simple digital computations needed to obtain
the ﬁnal decision yˆ. Unlike the conventional architecture, both the BLP and CBP in the
Compute Sensor operate in the analog domain. The trainer is digital.
49
Figure 3.1: A typical embedded vision platform: (a) conventional architecture, and (b) the
proposed Compute Sensor architecture.
The Compute Sensor eliminates both the traditional sensor-processor interface, and the
high-SNR/high-energy digital processing by moving feature extraction and classiﬁcation
functions into the analog domain in close proximity to the APS array. The Compute Sen-
sor leverages the intrinsic ability of machine learning algorithms to extract information from
noisy and often incomplete data to provide robust inference in presence of non-ideal computa-
tions. We demonstrate a Compute Sensor that incorporates a PCA-based feature extractor
and a support vector machine (SVM). Using circuit characterized behavioral and energy
models in a 65 nm CMOS process, we show that the Compute Sensor is able to achieve a
detection accuracy greater than 94.7% using the Caltech101 dataset [80], which is within
0.5% of that achieved by an ideal digital implementation. Furthermore, the Compute Sensor
is able to compensate for variations in the electrical parameters of the transistors in the APS
array caused by ﬁnite tolerances of the semiconductor manufacturing process by retraining
in presence of these non-idealities. As a result the Compute Sensor consumes 7× to 17×
less energy than the conventional architecture for the same level of accuracy. Thus, this pa-
per highlights the potential for conducting algorithmic research that accounts for platform
resource-constraints such as energy, storage, and computation.
50
The rest of the chapter is organized as follows. Section 3.1 presents the necessary back-
ground and establishes notation. Section 3.2 presents the Compute Sensor architecture
incorporating PCA and SVM algorithms, and the behavioral and energy models in a 65 nm
CMOS processes. Simulation results are shown in Section 3.3, and discussions are provided
in Section 3.4.
3.1 Background
3.1.1 Active Pixel Sensor
Solid state imaging devices can be classiﬁed into two categories, i.e., charge coupled device
(CCD) sensor and CMOS sensor. The CMOS image sensor has gained much popularity due
to its low voltage, low power operation, and compatibility with standard CMOS technologies
[81]. APS is by far the most widely employed CMOS image sensor architecture due to the
speed advantage and low noise. The architecture of the 3-transistor (3T) APS and associated
read-out (RD) circuit is shown in Fig.3.2(a) where each APS consists of a photodiode (PD)
as the optical detector and three transistors for readout. A rolling shutter operation where
the pixel array is exposed row by row is typically employed for the 3T-APS array, and the
timing diagram is shown in Fig. 3.2(b). During the reset phase, MRST is on and the charge
integrated on the photodiode is removed. During the integration phase, MRST is oﬀ and the
photodiode converts light into current, discharging the parasitic capacitor CPD. During the
readout phase, MSEL is on, and the signal voltage VSIG is sampled to sampler S-SIG.
3.1.2 Principle Component Analysis (PCA)
PCA is a widely used method for dimensionality reduction. This reduction is accomplished
by projecting the data vector xn ∈ RM (n = 1, ..., N , is the sample index and N is the
total number of samples in the dataset) onto a set of orthonormal principal components
αk ∈ RM , k = [1, ..., K]. These principal components are the top K variance maximizing
eigenvectors of the sample covariance matrix
∑N
n=1 xnx
T
n [78]. Hence, the reduced dimension
51
Figure 3.2: 3T APS: (a) pixel architecture and associated RD circuit, and (b) timing
diagram in rolling shutter operation.
(feature) vectors f ∈ RK are obtained as:
f = Ax (3.1)
where A = [α1, ...,αK ]
T ∈ RK×M is the eigenmatrix, and x ∈ RM is the test data vector.
In the CMOS image sensor shown in Figure 3.1(a), x is obtained from the APS array and
M = MrMc, whereMr andMc are the number of rows and number of columns, respectively,
in the APS array.
3.1.3 Support Vector Machine (SVM)
The SVM [79] is a popular supervised learning method for classiﬁcation and regression. In
SVM, the trained model is represented by:
yo = w
T
s f − b (3.2)
where ws ∈ RK is the optimum weight vector, and f ∈ RK is the test feature vector. It can
be shown that the optimum weight vector ws can be described in terms of feature vectors
that lie on the margins, i.e., support vectors:
ws =
Ns∑
n=1
βnynfs,n (3.3)
52
where yn, Ns and fs,n are the label, the number of support vectors, and the n
th support
vector, respectively. The SVM's classiﬁcation accuracy is denoted by pc = Pr{yˆ = y},
where yˆ = sgn(yo) is the computed label and y is the true label.
3.2 The Compute Sensor
This section presents the proposed Compute Sensor architecture for implementing the PCA
and SVM, along with architectural level functional and energy models in a 65 nm CMOS
process. These models are employed to study the eﬀectiveness of retraining on compensating
for analog non-idealities, and for estimating the energy consumption.
3.2.1 Architecture
Figure 3.3: Compute Sensor implementing PCA and SVM: (a) the architecture, and (b)
the behavioral model.
The general Compute Sensor architecture in Figure 3.1(b) enforces a speciﬁc sequence
of functions - acquire data in the APS array of size M = MrMc, bit-line processing, cross
53
bit-line processing, followed by residual digital processing. Bit-line processing involves scalar
operations while the cross bit-line processing results in dimensionality reduction. For exam-
ple, bit-line operations could be the product of scalar data values and scalar weights, while
cross bit-line processing would sum up these scalar products to generate a dot product. In
the following, we remember that M = McMr. Keeping these architectural constraints in
mind, we exploit the linearity of the PCA and SVM computations in (3.1) and (6.7) to
combine them as follows:
yo = w
T
sAx− b = wTx− b (3.4)
where wT = wTsA =
[
wT1 , . . . ,w
T
Mr
]
∈ R1×M , wi ∈ RMc , and x ∈ RM , where xT =[
xT1 , . . . ,x
T
Mr
]
and xi ∈ RMc . The composite weight vector w can be obtained directly via
SVM training methods. The data acquisition in the APS array occurs sequentially in a
row-by-row fashion. In order to accommodate this constraint, we rewrite (3.4) as follows:
yo =
[
wT1 , . . . ,w
T
Mr
]
x1
...
xMr
− b =
Mr∑
i=1
wTi xi − b (3.5)
This simple step enables us to implement multiplication operations involved in computing
the dot product wTi xi (3.5) in the BLP consisting of an array of Mc capacitive multipliers,
and the addition of these products in a charge sharing-based adder in the CBP. The Compute
Sensor's classiﬁcation accuracy pc = Pr{yˆ = y} is calculated in the same manner as that in
the conventional system.
3.2.2 Behavioral and Energy Models
Behavioral models describe the input-output relationship of the various blocks constituting
the Compute Sensor while accounting for circuit non-idealities. These models can be em-
ployed in system simulations to estimate the performance of algorithms implemented on the
Compute Sensor. In this chapter, the noise sources included in the behavior model are: (1)
spatial threshold mismatch in the APS array, (2) temporal noise in the APS array, and (3)
54
non-linearities in the BLP. Figure 3.3(b) shows that the ﬁrst two stages of the Compute
Sensor - the APS array and the S/H blocks - map light energy incident on the ith row of
pixels to xi as follows:
xi = xmax1− γIi + ηs,i + ηa,i (3.6)
where xi ∈ RMc is a discrete-time continuous-amplitude voltage representation of the lu-
minous exposure Ii incident on the i
th row of pixels, xmax is the maximum output, 1 is a
column vector with all ones, γ is the conversion gain, ηs,i ∈ RMc is a vector of samples
from N (0, σ2s) representing the impact of spatial mismatch in device parameters across the
APS array, and ηa,i ∈ RMc is a vector of samples from ∼ N (0, σ2n) representing the thermal
noise in the APS array. To derive this model, we note that during the APS operation, the
exposed PD voltage is ﬁrst sampled on the sampler in S&H block in Fig. 3.3(a) by selecting
the associated word line (WL). When the ith row is selected, the voltage on the jth sampler
VSIG can be expressed as:
VSIG,j = VPDrst − Vgs0 − [ κ1
CPD
− κ2(Vgs0 − Vgs1)]Ii,j + ∆Vth,j + Vn,j (3.7)
where VPDrst is the voltage of the PD after reset, Vgs0 and Vgs1 are the gate to source voltage
of MSF (see Fig. 3.2(a)) in dark and highest illumination condition, CPD is the parasitic
capacitance at the PD node, ∆Vth,j is the threshold mismatch, Vn,j is the output referred
RD noise, and κ1 and κ2 are ﬁtting parameters. The model in (3.6) can be derived by noting
that:
xmax = VPDrst − Vgs0 (3.8)
γ = [
κ1
CPD
− κ2(Vgs0 − Vgs1)] (3.9)
(3.10)
and the threshold mismatch ∆Vth,j and noise Vn,j are modeled as normally distributed ran-
dom variables with variances σ2s and σ
2
n, respectively.
The BLP scales each pixel value by the weight wi,j using a mixed-signal capacitive multi-
55
plier [82] as follows:
ym,i = ρ0(xmax1− xi) ∗wi + ρ1xi + ρ2wi + ηm,i (3.11)
where ∗ represents the element-wise product of two vectors, ρ0~ρ2 captures the non-linearity
due to charge sharing based computation, and ηm,i is a vector of samples from N (0, σm)
representing the impact of reset mismatches. To derive this model, we ﬁrst show the oper-
ation principle of the capacitive multiplier. In the Compute Sensor, the scaling operation
between input ∆VSIG,i,j = VPDrst − Vgs0 − VSIG,i,j and weight wi,j is realized in the bit-
line processor employing a mixed-signal capacitive multiplier as shown in Fig. 3.4(a). We
next drop the index (i, j) and denote the analog voltage and Bp-b digital weight as ∆VSIG
and w =
∑Bp−1
i=0 pi2
−(Bp−i), respectively, for notational simplicity. The capacitive multiplier
employs successive charge sharing to obtain a voltage Vm of:
Vm = Vpre − w(Vpre − (VPDrst − Vgs0)−∆VSIG) (3.12)
By choosing Vpre = VPDrst − Vgs0, the voltage drop ∆Vm is thus:
∆Vm = Vpre − Vm = w∆VSIG (3.13)
To account for the nonlinearity due to charge sharing based operation and the mismatch in
the reset transistors, the following model is employed:
∆Vm = ρ0∆VSIGw + ρ1VSIG + ρ2w + ηm (3.14)
The behavior in (3.11) can be obtained by noting that ym = ∆Vm and ∆VSIG = VPDrst −
Vgs0 − VSIG = xmax − x.
After the BLP, the CBP uses charge sharing-based circuits to sum up the elements of ym,i
and obtain the dot product wTi xi:
ys,i = 1
Tym,i (3.15)
56
Figure 3.4: Capacitive multiplier (a) architecture and (b) timing diagram.
The residual digital processor maintains a running sum of the row-wise dot products in order
to compute the output yo =
∑
i ys,i − b in (3.5) followed by yˆ = sign(yo) as the computed
label. Equations (3.6)-(3.15) describes the behavior of the Compute Sensor. Table 3.1 lists
the model parameters values in a 65nm CMOS process. These equations can be employed
to estimate the system behavior of the Compute Sensor.
The Compute Sensor's energy consumption per decision, i.e., in processing one Mr ×Mc
image, is given by:
ECS = MrMc(Ep + Em) +Mr(2Eadc + 2Eadd) + Eadd (3.16)
where Ep, Em, Eadc, and Eadd are the energy consumptions of the pixel, capacitive multiplier,
the ADC, and a digital adder, respectively.
The conventional system needs to convert all pixel values into the digital domain then
process digitally. The energy consumption per decision is given by:
Econv = McMr(Ep + Eadc + Erd) +McMrEmac (3.17)
57
where Erd and Emac are the energy per readout and multiply-accumulate (MAC) operation,
respectively. The behavioral and energy model will be employed in Section 3.3 to evaluate
the system performance and energy savings. The energy savings from Compute Sensor are
evident from (3.16) and (3.17). The key savings arise from having the ADC operate on row-
wise dot products giving rise to the multiplicative factor of 2Mr as compared to the factor
ofMcMr (Mc >> 2) for the conventional system. The second source of energy savings arises
from the analog domain multiplication in the Compute Sensor compared to digital domain
because Emac ≈ 3Em or 4Em.
3.3 Simulation Results
We ﬁrst validate the behavioral and energy models described in Section 3.2 using the pa-
rameters of a 65 nm CMOS process. The system performance and energy savings achieved
by Compute Sensor are estimated using these models. In the following, the conventional
system is assumed to be operating with noise-free data and ideal digital computations.
The Compute Sensor architecture in this study consists of a 32×32 APS array, a capacitive
multiplier array with 5b weight, a 8 b column ADC array, and 16 b addition in the digital
domain. The conventional digital implementation has an identical APS array and ADC, but
employs a digital MAC with 8 b input, 5 b weight, and 32 b output. These precisions are the
minimum needed for the conventional architecture to achieve a classiﬁcation accuracy pc =
95%. The face and non-face images extracted from the Caltech101 dataset [80] consisting
of 32× 32 gray-scale images are employed. During the system simulation, a linear mapping
from the pixel values to the luminous exposure Ii is employed and used in (3.6).
3.3.1 Model Validation
Table 3.1 lists the parameter values for the behavioral model in (3.6) obtained by curve
ﬁtting to the results of circuit simulations of a standard 3-transistor APS. The model is
found to match detailed circuit simulations to within 5.2% when the pixel output xi,j lies in
the interval [0.2, 0.9] as shown in Fig. 3.5(a). The standard deviation of spatial mismatch σs
58
was found to lie in the interval [1.62× 10−2, 2× 10−2] using Monte Carlo circuit simulations
and is shown in Fig. 3.5(b). The standard deviation of output referred noise σn was found
to lie in the interval [7 × 10−4, 7.5 × 10−4]. The model parameters in (3.11) were obtained
using the methodology in [82] and are also listed in Table 3.1.
Table 3.1: Model Parameters in 65 nm CMOS
xmax(V ) γ(V/(lx · s)) σs(V ) σn(V )
0.9 4.39× 10−5 2× 10−2 7.5× 10−4
ρ0 ρ1 ρ2(V ) σm(V )
0.93 1.2× 10−2 6.68× 10−4 1.6× 10−2
Figure 3.5: Model characterization and validation: (a) linearity, and (b) standard deviation
of mismatch and noise.
3.3.2 Classiﬁcation Accuracy
The classiﬁcation accuracy of Compute Sensor is evaluated employing the models in Sec-
tion 3.3.1 with parameters from Table 3.1.
Figure 3.6(a) shows that the Compute Sensor is able to achieve a classiﬁcation accuracy
pc = 94.7% at the nominal values of spatial mismatch σs = 2 × 10−2, multiplier mismatch
σm = 2 × 10−2, and noise σn = 7.5 × 10−4. This accuracy is very close to the value of
95% achieved by the ideal digital implementation. In fact, the Compute Sensor is able to
59
Figure 3.6: Classiﬁcation accuracy of the Compute Sensor wrt.: (a) APS spatial mismatch,
(b) capacitive multiplier mismatch, and (c) input peak signal-to-noise ratio (PSNR).
maintain pc ≥ 94% when σs is increased to 0.1, which is 5× more than the nominal value,
using the hyperparameters obtained with the nominal value of σs. Any further increases in
σs lead to a large reduction in pc. For example, pc decreases to 87% when σs increases to
0.5. Next, we retrain the Compute Sensor with data generated in the presence of spatial
mismatch. Figure 3.6(a) shows that the Compute Sensor achieves a pc = 92% when σs = 0.5
after retraining. This clearly indicates the eﬀectiveness of retraining in order to compensate
for spatial mismatch in the APS array. Retraining can also be employed to address the com-
putational errors due to capacitive multiplier mismatch ηm. A similar study was conducted
to observe the impact of multiplier mismatch σm as shown in Figure 3.6(b) where σs and
σn were set at their nominal values. This ﬁgure shows that the Compute Sensor achieves
pc = 90% in the presence of σm = 0.5 with retraining which is a signiﬁcant improvement
over case when retraining was not employed.
Classiﬁcation accuracy is a function of the input peak signal-to-noise ratio PSNR deﬁned
as PSNR = 20 log10
xmax
σn
. A PSNR = 61 dB is obtained with the nominal values of
xmax = 0.9, σs, σn. At this value of PSNR, a classiﬁcation accuracy of 94.7% is achieved.
Figure 3.6(c) shows that the Compute Sensor's classiﬁcation accuracy decreases to 78% as
the PSNR reduces to 0 dB.
To further understand the performance of Compute Sensor, PCA is performed on the
feature vectors obtained from the behavior model in Section 3.2.2. Figure 3.7(a) shows the
distribution of the feature vectors when circuit non-idealities are absent. In this case, the
SVM chooses a hyperplane that successfully separates the two classes. However, in the
60
presence of spatial and multiplier mismatch (σs = σm = 0.3), the feature vectors shift as
shown in Figure 3.7(b). The classiﬁcation accuracy falls if the original hyperplane is used.
However, retraining with the new set of feature vectors enables the Compute Sensor to
obtain a new separating hyperplane with a commensurate improvement in the classiﬁcation
accuracy as shown in Figure 3.7(c). These results indicate that the Compute Sensor may
need to adapt to changing environmental conditions such as temperature in order to ensure
that the optimal separating hyperplane is generated and employed for classiﬁcation.
Figure 3.7: The feature distribution and SVM separation hyper-plane when: (a)
σs = σm = 0 without retraining, (b) σs = σm = 0.3 without retraining, and (c)
σs = σm = 0.3 with retraining.
3.3.3 Energy Savings
In order to compare the energy consumption of the Compute Sensor with the conventional
architecture, we employ the energy numbers in Table 3.2 which are based on circuit simula-
tion and published energy numbers from [6, 83].
Table 3.2: Energy per Pixel Processing in 65 nm CMOS
Ep(pJ) Eadc(pJ) Erd(pJ) Em(pJ) Emac(pJ) Eadd(pJ)
2.69 20.5 5 0.77 3.2 0.1
Figure 3.8(a) shows that the proposed Compute Sensor consumes 6.2× less energy com-
pared with conventional implementation. The main source of energy savings is due to the
elimination of the per bit-line ADC and RD energy and from the use of analog dot product
61
computations in the Compute Sensor. For example, the 1024-length dot product in analog
consumes 0.79 nJ, which is 4.1× less than the 3.28 nJ needed by the digital implementation.
We also studied the energy savings as a function of the array size as shown in Figure 3.8(b).
Indeed, the energy savings increases from 6.2× to 11× as the APS array size increases from
32 × 32 to 512 × 512. This is because the Compute Sensor performs ADC operations row-
wise on dot products, as compared to the conventional architecture which performs ADC
operation pixel-wise on scalars.
Another opportunity to reduce energy consumption is to reduce the APS current. How-
ever, doing so will degrade the input PSNR. Speciﬁcally, one of the fundamental noise
sources in APS is the thermal noise, whose noise power is described by:
σ2n = kT/C (3.18)
where k is the Boltzmann's constant, T is the temperature, and C is the sampling capaci-
tance. The bandwidth of the APS can be approximated by:
B =
gm
C
=
Iaps
VovC
(3.19)
where Iaps is the current consumption of the APS array, Vov is the overdrive voltage and
is a ﬁxed parameter chosen during the design. A fundamental trade-oﬀ between noise and
bandwidth (thus speed) can be seen from (3.18) and (3.19). For ﬁxed bandwidth, reducing
C will allow smaller Iaps thus lower energy, but will increase the noise variance σ
2
n per (3.18).
More speciﬁcally, the PSNR is related to the current Iaps via:
PSNR = 20log10(xmax/σn) ∝ 10log10(Iaps) (3.20)
and the energy of the APS is related with Iaps via:
Epix = VddIapsTpix (3.21)
where Vdd and Tpix is the supply voltage and the pixel access time, respectively. The degraded
62
PSNR may be acceptable if retraining is employed as suggested in Figure 3.6(c). This ﬁgure
shows that the Compute Sensor is able to achieve less than 1% performance drop from the
ideal digital performance of pc = 95% for PSNR ≥ 20 dB. This relaxed PSNR requirement
allows the APS array current to be reduced for additional energy savings. Figure 3.8(c)
shows that the energy savings increases to 17× as the PSNR decreases from the 61 dB to
20 dB.
Figure 3.8: Energy per decision: (a) energy breakdown, (b) energy savings vs. APS size,
and (c) energy savings vs. PSNR.
3.4 Discussion
We have shown the beneﬁts of embedding information processing functionality into the sen-
sory substrates. We note that such embeddings are made possible due to the intrinsic ability
of machine learning algorithms to adapt to noise. Behavioral models such as those in Sec-
tion 3.2.2 can be employed to develop a variety of machine learning algorithms for Compute
63
Sensor style architectures. We believe that more powerful machine learning algorithms in-
cluding deep neural networks, ensemble methods such as bagging and boosting, decision
trees, and random forest, can also potentially be embedded into the Compute Sensor. The
huge design space spanned by the Compute Sensor encompassing algorithms, architectures,
circuits, and sensors, can be a challenge when searching for energy-optimal implementations.
Another formidable challenge that we hope to address in the future is to design programmable
Compute Sensor architectures whereby a variety of algorithms can be mapped on to the same
platform. Silicon prototypes are necessary to demonstrate the beneﬁts of Compute Sensor in
real world applications. An initial characterization chip (see Fig. 3.9) containing 8 diﬀerent
types of CMOS APS arrays in 65 nm CMOS has been taped-out. The chip allows ﬂexible
control of the APS supply voltage, bias current, readout timing and pulse widths. It will be
employed to characterize the APS model as well as the trade-oﬀ between performance and
energy consumption, and for future integration with various learning kernels.
Figure 3.9: Compute Sensor characterization chip in 65 nm CMOS: (a) chip architecture,
and (b) chip layout.
64
Chapter 4
EMBEDDED ALGORITHMIC-NOISE TOLERANCE
In previous chapters, the energy eﬃciency of in-silicon machine learning kernels is improved
by integrating information processing into the energy delivery and sensing fabrics thereby
eliminating the fundamental losses associated with the conventional architectures or the
communication overhead between sensing and computing. The resultant energy eﬃciency
vs. robustness trade-oﬀ that is exploited by retraining the hyper parameters in the ma-
chine learning algorithms to compensate for the circuit level non-idealities. Such an energy
eﬃciency vs. robustness trade-oﬀ also arises by implementing information processing subsys-
tems on stochastic fabrics, i.e., low SNR fabrics due to operating in diﬀerent regime or new
devices. Examples of stochastic fabrics include CMOS circuits operating with overscaled sup-
ply voltage, near/subthreshold voltage (NTV) CMOS [84] and emerging nanoscale devices
such as CNFET [85], spin [86], and others. As pointed out in Charpter 1, these stochastic
fabrics have the potential to achieve high energy eﬃciency, but are subject to various kinds
of hardware errors.
Hardware errors in stochastic fabrics have unique properties and should be distinguished
from the input noise and approximation errors such as those in approximate computing
(AC) literature [87, 88, 89]. Input noise in the feature vectors occurs during the data
acquisition process. Although it has been shown that many machine learning algorithms
are robust to noise [90, 91], and that adding noise during the training might even improve
the performance [92, 93], it is always assumed that the noise power is much smaller than
the signal power [92], and that the computation is error-free. Approximation errors occur
when complex operations/circuits are replaced with simpler approximated ones and thus are
static errors. Both logic level AC [87, 88, 89], and algorithmic level AC [94, 95] have been
proposed to improve the energy eﬃciency of machine learning algorithms. However, the
65
(a) (b)
Figure 4.1: The error probability mass function of: (a) the approxiamtion errors in an
approximate multiplier in [96], and (b) the hardware errors (timing errors) of an 8b
mutliplier operating with scaled voltage of Vdd = 0.7 V. Unlike the approximation errors,
the hardware errors are both dynamic and large-magnitude.
performance improvement relies solely on the inherent algorithmic robustness, thus limiting
the approximation errors to be of small magnitude. Furthermore, the computation at the
circuit level is also assumed to be error-free. In contrast, hardware errors occur during
the computations on the circuit fabrics. These errors are complex functions of the circuit
state, inputs, architecture, and the process technology, and can be both dynamic and large-
magnitude (see Fig. 4.1). This is particularly the case if the errors are timing errors in
DSP data path circuits [53], since these errors are most signiﬁcant bit (MSB) errors and
can directly lead to decision failures [53]. As a result, hardware errors are usually far more
detrimental to the system performance compared with the input noise and approximation
errors, and cannot be compensated for via the inherent robustness of machine learning
algorithms. Therefore, statistical error compensation techniques are needed to detect and
compensate for these errors.
Starting from this chapter, we will explore the use of error resiliency techniques to com-
pensate for the hardware errors on the stochastic fabrics, so that large energy savings can
be achieved without loss of system level performance. Error resiliency techniques have been
proposed [53] to enhance energy eﬃciency by reducing design margins and compensating for
the resultant errors. Large design margins arise from the need to provide robustness in the
66
presence of process, voltage, and temperature variations [97], and represent an energy over-
head as high as 3×-to-4× [98]. The key to the use of error resiliency for energy reduction is
that such techniques need to be low overhead and yet eﬀective in compensating for high error
rates. Classical fault-tolerance techniques such as N-modular redundancy (NMR) rely on
replication of the main computation block and as a result are ineﬀective for the purposes of
energy reduction. Hence, low overhead error resiliency techniques such as RAZOR [99, 100],
error-detection sequential (EDS) [101], and conﬁdence driven computing (CDC) [102] have
been proposed to enhance energy eﬃciency. These techniques employ rollback based error
correction, and are suitable when operating close to point of ﬁrst failure (PoFF). Unlike
the rollback based techniques, statistical error compensation (SEC) [53] is a class of system
level error compensation techniques that utilizes signal and error statistics and hence is par-
ticularly well-suited for signal processing and machine learning systems. These techniques
include algorithmic noise tolerance (ANT), soft NMR, and stochastic sensor network on a
chip (SSNOC) [53], and have been shown to compensate for error rates ranging from 0.21
to 0.89, with a combined error detection and correction overhead ranging from 5% to 30%
resulting in energy savings ranging from 35% to 72%.
ANT [53] is a speciﬁc SEC technique that has been shown to be eﬀective in compensating
for high error rates in signal processing and machine learning kernels. For example, the
reduced precision replica (RPR) ANT technique and prediction based ANT was employed to
compensate for error rates of 0.27 ∼ 0.58 in an ECG processor [103, 104] while delivering the
required application-level performance. The overhead in ANT ranges from 5% to 30% [103]
due to the use of explicit estimator blocks in error compensation. This overhead, though
small compared to other techniques, limits the achievable systems level energy eﬃciency to
28% ∼ 41%.
In this chapter, we propose embedded algorithmic-noise tolerance (E-ANT), a new class
of statistical error compensation (SEC) techniques aiming to reduce the error compensation
complexity associated with conventional SEC techniques. E-ANT operates by reusing part
of the main block as an estimator and thus embedding it into the main block. At the archi-
tectural level, we propose ARCH-ANT, which employs data path decomposition (DPD) to
embed the RPR estimator into the main block. At the algorithmic level, we propose ALG-
67
ANT, which employs additional optimization constraints during algorithm to architecture
mapping to design incremental reﬁnement architectures. The logic overhead of E-ANT is
reduced to below 8% from the 20%-44.1% [57, 65] overhead associated with conventional
ANT system. To evaluate the improved robustness and energy savings of the proposed tech-
nique, ARCH-ANT and ALG-ANT are applied to the design of an EEG seizure classiﬁcation
system consisting of a frequency selective ﬁlter bank as the feature extractor and a support
vector machine (SVM) as the classiﬁer. Simulation results in a commercial 45 nm CMOS
process show that ARCH-ANT can compensate for error rates up to 0.38, and ALG-ANT
can compensate for error rates up to 0.41, while maintaining a true positive rate ptp > 0.9
and a false positive rate pfp ≤ 0.01. This error tolerance is employed to reduce energy via
the use of voltage overscaling (VOS). ARCH-ANT and ALG-ANT are able to achieve up to
51% and 44% energy savings, respectively.
The rest of the chapter is organized as follows. Section 4.1 describes the background of the
ANT technique and the SVM EEG classiﬁcation system architecture. Section 4.2 presents
the principle of ARCH-ANT and ALG-ANT. Section 4.3 presents the design optimization
of ARCH-ANT and ALG-ANT compute kernels and their application to the SVM EEG
classiﬁcation system. Conclusions are presented in Section 4.4.
4.1 Background
4.1.1 Conventional ANT
Conventional ANT incorporates a main block (M) and an estimator (E) as shown Fig. 4.2(a).
TheM-block implements the algorithm of interest and is conventionally error-free. In ANT,
the M-block is permitted to make errors, which are then compensated for by the rest of
the blocks in Fig. 4.2(a) including the E-block. In RPR ANT, the E-block is obtained by
reducing the precision of the M-block. The M-block is subject to large magnitude errors η
(e.g., timing errors due to critical path violations which typically occur in the MSBs) while
the E-block is subject to small magnitude errors e (see Fig. 4.2(b), e.g., due to quantization
noise in the LSBs), i.e.:
68
ya = yo + η (4.1)
ye = yo + e (4.2)
where yo, ya, and ye are the error-free, the M, and E-block outputs, respectively. ANT
exploits the diﬀerence in the statistics of η and e to detect and compensate for errors to
obtain the ﬁnal corrected output yˆ as follows:
yˆ =
 ya if |ya − ye| ≤ Thye otherwise (4.3)
where Th is an application dependent threshold parameter chosen to maximize the perfor-
mance of ANT. In this paper, Th is chosen to equal max(|yo − ye|) as this ensures that the
M-block output ya will always be selected [103] when the output is error free. The error
rate pη is deﬁned as:
pη = 1− Pη(0) = Pr{η 6= 0} (4.4)
where Pη(·) is the error probability mass function (PMF) of η. The errors η are most
conveniently obtained by applying voltage overscaling (VOS) where the supply voltage Vdd
is scaled as follows:
Vdd = KvosVdd−crit (4.5)
where Kvos is the voltage overscaling factor, and Vdd−crit is the minimum voltage needed for
error free operation in the M-block. Note that for the ANT system to work properly, the
E-block is not permitted to make large magnitude errors such as those arising from timing
violations. This helps maintain the diﬀerence in the error statistics at the output of the M
and E-block as shown in Fig. 4.2(b).
The performance improvement achieved by ANT can be evaluated by employing a system
level metric such as the signal-to-noise ratio (SNR). Assume that the error-free output in
69
(a) (b)
Figure 4.2: Algorithmic noise-tolerance (ANT): (a) conventional architecture, and (b) the
error statistics in the main (M) and estimator (E) blocks.
Fig. 4.2(a) is expressed as:
yo = s+ ns (4.6)
where s and ns represent the signal and noise components in the error-free output yo, re-
spectively. At the application level, one is interested in the ratio of the signal power σ2s to
the noise powers at the outputs of the M, the E-block, and the ANT system. Thus, the
following application level SNRs can be deﬁned:
SNRM,a = 10log10(
σ2s
σ2ns + σ
2
η
) (4.7)
SNRE,a = 10log10(
σ2s
σ2ns + σ
2
e
) (4.8)
SNRANT,a = 10log10(
σ2s
σ2ns + σ
2
nr
) (4.9)
where σ2s , σ
2
ns , σ
2
η, σ
2
e , σ
2
yo and σ
2
nr are the variances of the signal s, noise ns,M-block hardware
error η, E-block estimation error e, error-free output yo, and residual error nr = yo − yˆ,
respectively. It is also of interest to evaluate how `noisy' the circuit fabric is with respect to
an error-free (conventional) architecture. By deﬁnition, the output of such an architecture
70
is yo. Thus, we deﬁne the circuit level SNRs as follows:
SNRM,c = 10log10(
σ2yo
σ2η
) (4.10)
SNRE,c = 10log10(
σ2yo
σ2e
) (4.11)
SNRANT,c = 10log10(
σ2yo
σ2nr
) (4.12)
If error detection is ideal, then nr ∈ {0, e}, and its probability mass function (PMF)
Pηr(nr) is given by:
Pηr(nr) =
1− pη if nr = 0pη if nr = e
and
σ2nr = pησ
2
e
where pη is the error rate of the M-block deﬁned in (4.4). Therefore, (4.9) and (4.12) can
be expressed as:
SNRANT,a = 10log10(
σ2s
σ2n + pησ
2
e
) (4.13)
SNRANT,c = 10log10(
σ2yo
pησ2e
) (4.14)
Since e is the small magnitude LSB error and η is the large magnitude MSB error, pησ
2
e 
σ2e  σ2η. This further implies that SNRANT,a  SNRE,a  SNRM,a, and SNRANT,c 
SNRE,c  SNRM,c. Thus, the output SNR of the ANT system is signiﬁcantly greater than
the SNR at the output of either the M or E-block. This phenomenon occurs in spite of the
fact that the ANT system output yˆ ∈ {ya, ye} (see Fig. 4.2(a)), i.e., yˆ equals the output of
either the M or E-block. The reason for this unique feature of ANT is that it exploits the
diﬀerence in the error statistics (see Fig. 4.2(b)) at the output of the M and E-block.
71
4.1.2 EEG Classiﬁcation System using SVM
Portable health monitoring is an important class of applications that can beneﬁt from the
design of energy eﬃcient machine learning kernels. It has been shown [105] that epileptic
seizures can be eﬃciently detected by analyzing the EEG signal using an SVM kernel. The
EEG seizure classiﬁcation system [105] shown in Fig. 4.3(a) consists of a frequency selective
ﬁlter bank to extract signal energy in the 0 − 20 Hz range and a SVM classiﬁer. The ﬁlter
bank has passband of 3 Hz with a transition band of 1.5 Hz. Eight channels are employed to
cover the entire frequency range [105].
SVM [79] is a popular supervised learning method for classiﬁcation and regression. An
SVM operates by ﬁrst training the model (the training phase) followed by classiﬁcation (the
classiﬁcation phase). During the training phase, feature vectors with labels are used to
train the model. During the classiﬁcation phase, the SVM produces a predictive label when
provided with a new feature vector. The SVM training can be formulated as the following
optimization problem to determine the maximum margin classiﬁer [79] (see Fig. 4.3(b)):
min 1
2
‖w‖2 + C∑
i
ξi
s.t.
yi(w
Txi−b) ≥ 1−ξi
ξi ≥ 0
(4.15)
where C is the cost factor, ξi is the soft margin, xi is the feature vector, yi is the label
corresponding to the feature vector xi, w is the weight vector, and b is the bias. The trained
model is represented by:
y = wTo x− b (4.16)
where wo are the optimized weights. It can be shown that the optimum weights are repre-
sented as a linear combination of the feature vectors that lie on the margins (see Fig. 4.3(b)),
i.e., support vectors:
72
(a) (b)
Figure 4.3: EEG seizure classiﬁer with SVM: (a) system architecture, and (b) principle of
SVM.
wo =
Ns∑
n=1
αnynxs,n (4.17)
where Ns and xs,n are the number of support vectors and n
th support vector, respectively.
The linear model can thus be represented as:
y =
Ns∑
n=1
αnynx
T
s,nx− b (4.18)
The linear SVM in (4.18) can be easily extended into non-linear SVM by employing the
kernel trick [79], resulting in:
y =
Ns∑
n=1
αnynK(xs,n,x)− b (4.19)
where K(xs,n,x) is a kernel function. Popular kernel functions include polynomial, radial
basis function (RBF), and others [106].
4.2 Proposed E-ANT
E-ANT reuses part of the main block M to generate an estimate of its error free output
yo. This is in contrast to conventional SEC techniques, where an explicit estimator is re-
quired. Such embedding of the estimator can be performed either at the architectural or the
algorithmic level. At the architectural level, data path decomposition can be employed to
transform an existing architecture into an error resilient architecture, leading to the proposed
73
ARCH-ANT technique. At the algorithm level, traditional algorithm transforms search over
the design space for optimum parameters suitable for hardware implementations. Additional
training/optimization constraints can be employed to trade oﬀ performance and error re-
siliency, leading to the proposed ALG-ANT technique. Both techniques will be presented in
this section.
4.2.1 ARCH-ANT
In RPR ANT, the M and E-blocks process the same data but with diﬀerent precisions.
This redundancy can be exploited to embed the E-block into the M-block via DPD. In
particular, DPD decomposes the M-block into MSB and LSB components, and employs
the output of the MSB component as an estimate of the error-free M-block output yo. By
ensuring that the critical path of the MSB block is always shorter than that of theM-block,
the requirements on the error statistics (see Fig. 4.2(b)) on theM and E-block are satisﬁed.
Let ya = f(x) denote theM-block functionality, where x and ya are the input and output of
theM-block, respectively. A Bx-bit input x = x0x1...xBx−1 can be written in 2's complement
form [107], as follows:
x = −x0 +
Bx−1∑
i=1
xi2
−i = xM + xL2−(Bmsb−1) (4.20)
where xM is the value of Bmsb MSB bits, and xL is the value of Bx − Bmsb LSB bits, as
shown below:
xM = −x0 +
Bmsb−1∑
i=1
xi2
−i (4.21)
xL =
Bx−1∑
i=Bmsb
xi2
−(i−Bmsb+1) (4.22)
Therefore, the M-block output is expressed as:
74
ya = f(x) = f(xM + xL2
−(Bmsb−1))
In E-ANT, we decompose f(x) as follows:
ya = f(x)
= f(xM + xL2
−(Bmsb−1))
= g(fM(xM), fL(xM , xL)) (4.23)
where fM(xM) and fL(xM , xL) are functions that are combined by the operator g(·) to
generate the ﬁnal output ya. Since this decomposition utilizes the ﬁnite precision nature of
arithmetic units, it is referred to as DPD. We show that DPD exists if f(x) is n-times diﬀer-
entiable or can be piecewise approximated. An E-ANT system can be obtained via DPD by
ensuring that: (1) the critical path of fM(xM) is shorter than that of g(fM(xM), fL(xM , xL)),
and (2) fM(xM) generates an estimate ye of the error-free output yo. The operation of DPD
based E-ANT is described as follows:
ya = g(fM(xM), fL(xM , xL))
ye = fM(xM)
yˆ =
 ya if |ya − ye| ≤ Thye otherwise
where Th is the error detection threshold as in (4.3). Next, we describe several methods to
achieve DPD.
4.2.2 DPD via Taylor Expansion
Taylor expansion can be employed to achieve DPD. If f(x) is n-times diﬀerentiable in the
input range x ∈ [xl, xu]. The DPD for f(x) using Taylor expansion is given by:
75
f(x) ≈ fM(xM) + fL(xM , xL) (4.24)
where
fM(xM) = f(x0) +
n∑
k=1
k∑
i=0
[
f (k)(x0)
k!
(
k
i
)(−x0)k−i]xi,M
fL(xM , xL) =
n∑
k=1
k∑
i=0
[
f (k)(x0)
k!
(
k
i
)(−x0)k−i]xi,L
xi,M = x
i
M
xi,L =
i−1∑
j=0
(
i
j
)xjM(xL2
−(Bmsb−1))i−j
where xM and xL are deﬁned in (4.21) and (4.22), respectively.
As a special case, when a ﬁrst order Taylor expansion is employed at x0 =
1
2
(xl +xu), i.e.,
at center of the input dynamic range, (4.24) simpliﬁes into:
f(x) ≈ f(x0) + f ′(x0)(x− x0) (4.25)
where f ′(x0) is the ﬁrst order derivative of f(x) at x0. Substituting (4.20) into (4.25), we
obtain the DPD of f(x) as follows:
f(x) ≈ f(x0) + f ′(x0)(xM + xL2−(Bmsb−1) − x0,M − x0,L2−(Bmsb−1))
= fM(xM) + fL(xL)2
−(Bmsb−1) (4.26)
where
76
fM(xM) = f(x0) + f
′(x0)(xM − x0,M)
fL(xL) = f
′(x0)(xL − x0,L)
and fM(xM) can be used as the E-block. Note that in the decomposition in (4.26), only x is
decomposed into MSB and LSB components and the factor f ′(x0) remains in full precision.
If a simpler E-block is required, f ′(x0) can also be decomposed into MSB and LSB parts,
as shown in Section 4.2.4. The pivot point x0 should be chosen such that the error metric,
e.g., the mean square error, between the original and the E-ANT kernel is minimized.
4.2.3 DPD via Piecewise Linear (PWL) Approximation
The PWL approximation can be employed when f(x) (x ∈ [xl, xu]) is non-diﬀerentiable or
the input dynamic range is large.
The PWL approximation employs N + 1 points (xk, f(xk)) where xk = xl +
k
N
(xu − xl)
and k = 0, 1, ..., N to approximate f(x) as:
f(x) ≈
N∑
k=1
pk(x)
pk(x) =
akx+ bk xk ≤ x < xk+10 otherwise (4.27)
where x0 = xl, xN = xu, ak =
f(xk+1)−f(xk)
xk+1−xk , and bk =
xk+1f(xk)−xkf(xk+1)
xk+1−xk . Each segment pk(x)
can be decomposed by noting that for a linear function p(x), substituting for x from (4.20),
we have
p(x) = p(xM + xL2
−(Bmsb−1)) = p(xM) + p(xL)2−(Bmsb−1) (4.28)
Therefore, substituting (4.20) into (4.27), we obtain:
77
pk(x) =
pk,M(xM) + pk,L(xL)2
−(Bmsb−1) xk ≤ x < xk+1
0 otherwise
(4.29)
where
pk,M(xM) = akxM + bk
pk,L(xL) = akxL
and pk,M(xM) can be employed as the E-block.
Note that other piecewise approximation methods such as spline interpolation [108] where
each segment is approximated with a low order polynomial can also be employed for DPD.
Each low order polynomial can be decomposed in a manner similar to (4.24).
Next, we apply DPD to obtain E-ANT architectures for arithmetic units and compute
kernels commonly used in signal processing and machine learning.
4.2.4 E-ANT Arithmetic Unit Architectures
4.2.4.1 E-ANT Adder
The output of a two-operand adder is given by:
ya = x1 + x2
where x1 and x2 are the input operands. We ﬁrst decompose the operands into MSB and
LSB components according to (4.20):
x1 = x1M + x1L2
−(Bmsb−1)
x2 = x2M + x2L2
−(Bmsb−1)
78
(a) (b)
Figure 4.4: E-ANT Adder: (a) DFG, and (b) symbol.
where xiM and xiL are deﬁned in (4.21) and (4.22), respectively.
Since addition is a linear function, DPD can be easily obtained from (4.28) as follows:
ya = x1M + x2M + x1L2
−(Bmsb−1) + x2L2−(Bmsb−1)
= fM + fL2
−(Bmsb−1) (4.30)
where fM = x1M + x2M and fL = x1L + x2L. The data ﬂow graph (DFG) and the symbol of
the E-ANT adder are shown in Fig. 4.4(a) and Fig. 4.4(b), respectively.
4.2.4.2 E-ANT Multiplier
Employing the DPD in (4.21)-(4.23), the E-ANT multiplier can be derived as follows:
ya = x1x2
= (x1M + x1L2
−(Bmsb−1))(x2M + x2L2−(Bmsb−1))
= fM + fL2
−(Bmsb−1) (4.31)
where fM = x1Mx2M and fL = x1Lx2M + x1x2L. Figure 4.5(a) and Fig. 4.5(b) show the
DFG and symbol of the E-ANT multiplier.
79
(a) (b)
Figure 4.5: E-ANT multiplier: (a) DFG, and (b) symbol.
4.2.4.3 E-ANT Multiply-accumulator (MAC)
MAC operation is described as:
ya[n] = x[n]w[n] + ya[n− 1] (4.32)
We ﬁrst decompose x[n], w[n] and y[n− 1] according to (4.20):
x[n] = xM [n] + xL[n]2
−(Bmsb−1) (4.33)
w[n] = wM [n] + wL[n]2
−(Bmsb−1) (4.34)
ya[n− 1] = ya,M [n− 1] + ya,L[n− 1]2−2(Bmsb−1) (4.35)
The E-ANT MAC can be obtained by substituting (4.33)-(4.35) into (4.32), and employing
(4.30)-(4.31) to decompose ya[n] as follows:
80
(a) (b)
Figure 4.6: E-ANT MAC unit: (a) DFG, and (b) symbol.
ya[n] = xM [n]wM [n] + ya,M [n− 1] + (xL[n]wM [n] + (xM [n] + xL[n]2−(Bmsb−1))wL[n]
+ ya,L[n− 1]2−(Bmsb−1))2−(Bmsb−1)
= xM [n]wM [n] + ya,M [n− 1] + (xL[n]wM [n] + x[n]wL[n]
+ ya,L[n− 1]2−(Bmsb−1))2−(Bmsb−1)
= fM + fL2
−(Bmsb−1) + ya,L[n− 1]2−(Bmsb−1))2−(Bmsb−1)
where fM = xM [n]wM [n] + ya,M [n − 1] and fL = xL[n]wM [n] + x[n]wL[n] + ya,L[n −
1]2−(Bmsb−1). Figure 4.6(a) and Fig. 4.6(b) show the DFG and the symbol of the E-ANT
MAC.
4.2.5 E-ANT Signal Processing and Machine Learning Kernels
Complex E-ANT kernels can be derived by employing the E-ANT arithmetic units derived
in section 4.2.4.
4.2.5.1 E-ANT FIR Filter
One of the most important kernels in information processing is ﬁltering/convolution. We
can derive an E-ANT FIR ﬁlter by employing (4.30)-(4.31) as follows:
81
(a) (b)
Figure 4.7: E-ANT FIR ﬁlter: (a) the DFG of direct form FIR ﬁlter, and (b) the DFG of
transposed form FIR ﬁlter.
ya[n] =
N−1∑
i=0
w[i]x[n− i] =
N−1∑
i=0
fiM +
∑
i
fiL2
−(Bmsb−1)
where fiM = xM [n − i]wM [i] and fiL = xL[n − i]wM [i] + x[n − i]wL[i] for i = 0...N − 1.
Figure 4.7 shows the DFGs of the direct form and transposed form E-ANT FIR ﬁlter where
we make use of the symbols in Fig. 4.4(b), 4.5(b), and 4.6(b) to simplify the DFGs.
4.2.5.2 E-ANT Fast Fourier Transform (FFT) Butterﬂy Unit (BU)
BU is the main data processing unit in FFT processors. A general BU implements the
following function:
y1r,a = x1r + x2r, y1i,a = x1i + x2i
d = x1 − x2
y2r,a = drWr − diWi, y2i,a = drWi + diWr
where x1 and x2 are the inputs, y1 and y2 are the outputs, and W is the twiddle factor. The
real and imaginary parts are denoted by r and i subscripts, respectively. E-ANT FFT BU
can be derived as shown in Table 4.1.
82
Figure 4.8: DFG of the E-ANT FFT butterﬂy unit.
Table 4.1: DPD for FFT BU
y1r,a = x1r + x2r = y1r,M + y1r,L2
−(Bmsb−1),where y1r,M = x1r,M + x2r,M and
y1r,L = x1r,L + x2r,L.
y1i,a = x1i + x2i = y1i,M + y1i,L2
−(Bmsb−1), where y1i,M = x1i,M + x2i,M and
y1i,L = x1i,L + x2i,L.
y2r,a = drWr − diWi = y2r,M + y2r,L2−(Bmsb−1), where y2r,M = dr,MWr,M − di,MWi,M
and y2r,L = dr,LWr,M + drWr,L − (di,LWi,M + diWi,L).
y2i,a = drWi + diWr = y2i,M + y2i,L2
−(Bmsb−1), where y2i,M = dr,MWi,M + di,MWr,M
and y2i,L = dr,LWi,M + drWi,L + di,LWr,M + diWr,L
The DFG of the E-ANT FFT BU is shown in Fig. 4.8.
4.2.5.3 E-ANT Exponential Kernel
Exponential kernel (e−x) is a critical component in many machine learning algorithms such
as kernel SVM [105, 109], Gaussian mixture model [110], and others [111, 112]. Taylor
expansion in Section 4.2.2 and PWL approximation in Section 4.2.3 can be employed to
obtain E-ANT exponential kernels.
Assuming that the input dynamic range is scaled to [0,1], a 2nd order Taylor expansion
83
leads to:
ya = e
−x ≈ e−x0 − e−x0 × (x− x0) + e
−x0(x− x0)2
2
(4.36)
When x0 = 0.5, (4.36) simpliﬁes to
ya ≈ ax2 + bx+ c (4.37)
where a = 0.3033, b = −0.9098, and c = 0.9856. The Taylor expansion based E-ANT
exponential block can thus be derived from (4.24) as follows:
ya ≈ fM + fL2−(Bmsb−1)
where
fM = ax
2
M + bxM + c
fL = a(2xMxL + x
2
L2
−(Bmsb−1)) + bxL
The DFG is shown in Fig. 4.9(a).
Alternatively, PWL approximation can be employed to obtain an E-ANT exponential
kernel. Assume that two linear functions on [0, 0.5] and [0.5, 1] are used to approximate the
exponential function on the interval [0, 1]; then according to (4.27):
ya = e
−x ≈
 a1x+ b1 0 ≤ x < 0.5a2x+ b2 0.5 ≤ x < 1
where a1 = −0.7869, b1 = 1, a2 = −0.4773 and b2 = 0.8452. We ﬁrst decompose ai, bi
(i = 1, 2) according to (4.20):
84
(a) (b)
Figure 4.9: DFG of the E-ANT exponential kernel: (a) Taylor expansion based, and (b)
PWL approximation based.
ai = ai,M + ai,L2
−(Bmsb−1)
bi = bi,M + bi,L2
−2(Bmsb−1)
Since each segment is linear, they can be decomposed by using (4.28):
ya,i = aix+ bi = fi,M + fi,L2
−(Bmsb−1)
where
fi,M = ai,MxM + bi,M
fi,L = ai,MxL + ai,Lx+ bi,L2
−(Bmsb−1)
The resultant E-ANT exponential kernel has a reconﬁgurable architecture where diﬀerent
approximations are chosen according to the input values. The DFG of the PWL based
E-ANT exponential kernel is shown in Fig. 4.9(b).
85
Figure 4.10: ALG-ANT FIR ﬁlter structure.
4.2.6 ALG-ANT
Low complexity estimation can also be achieved by changing the algorithm into an incre-
mental reﬁnement structure through algorithm transformation. Unlike architecture level
techniques, ALG-ANT is algorithm speciﬁc. We next derive two ALG-ANT techniques -
one for the FIR ﬁlter kernel and another for the dot product kernel.
4.2.7 ALG-ANT FIR Filter Kernel
The FIR ﬁlter is a commonly used kernel in signal processing and machine learning. The
conventional FIR ﬁlter design method employs algorithms such as the weighted least square
(WLS) method [113], which formulates the ﬁlter design as an optimization problem. Let
H(ejω) andHd(e
jω) denote the designed and the ideal ﬁlter frequency responses, respectively,
and W (ejω) be a non-negative error weighting function. The WLS method minimizes the
L2 norm of the weighted diﬀerence between H(e
jω) and Hd(e
jω) as follows:
min 1
2pi
´ pi
−pi [W (e
jω)H(ejω)−W (ejω)Hd(ejω)]2dω (4.38)
86
Let h = [h[0], ..., h[M ]]T , d = [d[0], ..., d[N − 1]]T be the pulse response of the (M + 1)-tap
ﬁlter H(ejω) and the IDFT of W (ejω)Hd(e
jw), respectively, and let the N by M + 1 matrix
W be deﬁned as W[n, l] = w[n − l] where w[n] is the IDFT of W (ejω). The optimization
can be reduced to:
min ‖Wh− d‖2 (4.39)
and has solution h∗ = W†d, where W† is the Moore-Penrose pseudo inverse.
In ALG-ANT, the optimization in (4.39) is modiﬁed to include architectural level con-
straints. In particular, we employ the ﬁlter architecture in Fig. 4.10 where the center
M + 1− 2Kf ﬁlter taps are employed to obtain the estimator output ye[n] (see Fig. 4.2(a)).
Here Kf is a design parameter that determines the estimator length. The rationale for using
the center taps of an FIR ﬁlter to obtain an estimate of its ﬁnal output yo[n] is that for
linear phase FIR ﬁlter, the center taps of the ﬁlter can provide a good estimate of the ﬁlter
response [114]. Doing so embeds the estimator completely into the main block. To achieve
this, we reformulate the objective function in (4.39) as follows:
min(1− γ)‖Wh− d‖2 + γ
∥∥∥W˜h∥∥∥2 (4.40)
where W˜ =

IKf×Kf 0Kf×Kˆf 0Kf×Kf
0Kˆf×Kf 0Kˆf×Kˆf 0Kˆf×Kf
0Kf×Kf 0Kf×Kˆf IKf×Kf
, Kˆf = M + 1 − 2Kf and the parameter γ (0 ≤
γ < 1) is used to control the relative strength of the two optimization terms. Doing so
constrains the magnitude of the outer taps of the ﬁlter. The optimization in (4.40) can
be solved by setting the derivative of the loss function in (4.40) to zero, resulting in the
following ﬁlter:
h∗ = ((1− γ)WTW + γW˜TW˜)−1((1− γ)WTd) (4.41)
where h∗ is the optimum ALG-ANT ﬁlter coeﬃcients. In practice, γ and Kf are design
parameters that can be employed to trade oﬀ the two optimization terms in (4.40). A large
87
Figure 4.11: ALG-ANT linear SVM: the dot product result is unaltered when the order of
the multiply-accumulates (MACs) is varied.
γ will weigh more on the estimator design, leading to a more accurate estimator. However, a
large γ tends to decrease the performance of the main block since the resulting ﬁlter deviates
from the ideal ﬁlter d. A small Kf (thus larger estimator length) will lead to a more accurate
estimator because more coeﬃcients can be employed, but a small value of Kf will limit the
amount by which VOS can be applied before the estimator begins to exhibit large magnitude
timing violations. Thus, in this design, as expected, the accuracy of the estimator and the
main block trade oﬀ with each other, and so does the extent of VOS that can be applied.
In Section 4.3.6, Kf is determined by the error rate pη (thus Kvos) and γ is optimized via a
grid search.
4.2.8 ALG-ANT Dot Product Kernel
We next derive ALG-ANT for the dot product kernel, another widely used kernel in machine
learning. The dot product kernel is employed in the linear SVM (see Fig. 4.3(a)), which
provides good classiﬁcation performance and results in a particularly simple architecture
[115]. In the dot product kernel (see Fig. 4.11), the input vector x and the weight vector w
are multiplied element-wise and the resulting products are added up.
One observation in (4.16) is that it is only the ﬁnal dot product that contributes to the
classiﬁcation result, not the order in which computation is done. This suggests that we
88
can implement the dot product kernel via dimension reordering (DR) which will reorder
the dimensions of the inputs x and w in the classiﬁcation engine and use more important
weights ﬁrst during the dot product evaluation.
The reordered weight vector wˆ can be calculated via a simple sorting operation:
wˆ = [wˆ1, wˆ2, ...wˆn]
where |wˆi| ≥ |wˆj| for i < j. In other words, we reorder the calculation of the dot product
according to the importance of weights wi in these dimensions. The resulting incremental
reﬁnement architecture enables us to employ the intermediate stage output as the estimator
output ye[n], as shown in Fig. 4.11. As the estimator length Kc increases, the classiﬁcation
results will improve but the extent to which VOS can be applied will reduce. This trade-oﬀ
is explored in the next section.
4.3 Simulation Results
This section presents the design optimization of the proposed ARCH-ANT and ALG-ANT
technique, and shows the simulation results in a 45 nm CMOS process when they are applied
to an SVM EEG classiﬁcation system.
4.3.1 Methodology
Figure 4.12(a) shows the evaluation methodology employed to quantify system-level per-
formance metrics and to estimate system-level energy consumption that integrates circuit,
architecture, and system level design variables. The methodology consists of two parts:
1) system-level error injection, and 2) system-level energy estimation. Comparison of the
proposed E-ANT with conventional approach (no error compensation) and retraining based
approach in [54] is done using a commercial 45 nm CMOS process.
System-level error injection is done as follows:
1. Characterize delay vs. Vdd of basic gates such as AND and XOR using HSPICE for
89
0.2 V ≤ Vdd ≤ 1.2 V.
2. Develop structural Verilog HDL models of key kernels needed in the EEG classiﬁcation
system using the basic gates characterized in Step 1. These kernels are a 12 b input,
8 b coeﬃcient, and 16 b output, 44-tap FIR ﬁlter (used in the FE) and a 8 b input, 8 b
coeﬃcient, and 19 b output vector-matrix multiplication kernel (used in the polynomial
kernel SVM CE).
3. HDL simulations of these kernels were conducted at diﬀerent voltages by including the
appropriate voltage-speciﬁc delay numbers obtained in Step 1 into the HDL model.
The error PMFs of these kernels and error rates pη are obtained for diﬀerent supply
voltages (and thus voltage overscaling factor Kvos).
4. System performance evaluation and design optimization are done by injecting errors
into a ﬁxed point MATLAB-model of the EEG classiﬁcation system. The errors are
obtained by sampling the error PMFs obtained in Step 3.
Figure 4.12(c) shows the error PMF Pη(η) of the 44-tap low pass FIR ﬁlter used in the
FE at Vdd = 0.9 V (fclk = 76 MHz) which corresponds to a Kvos = 0.75 and an error rate
pη = 0.05, and Fig. 4.12(d) shows the error rate pη increases from 10
−5 at Vdd = 1.15 V to
0.99 at Vdd = 0.5 V as the voltage scales down.
System-level energy estimation is done as follows:
1. Obtain a full adder (FA) count NFA of the kernel being analyzed.
2. Conduct a one-time characterization of the energy consumption of a FA incorporating
both dynamic and leakage energies as follows:
EFA = CFAV
2
dd + VddIleak(Vdd)
1
fclk
(4.42)
with
Ileak(Vdd) = µCox
W
L
(m− 1)V 2T e
−Vt
mVT e
−ηdVdd
mVT (1− e
−Vdd
VT ) (4.43)
where CFA is the eﬀective load capacitance of the FA and is extracted from HSPICE,
Vdd is the supply voltage, Vt, VT , µ, Cox, and ηd are the threshold voltage, the thermal
90
(a) (b)
(c) (d)
Figure 4.12: Evaluation methodology: (a) simulation setup, and (b) comparison of the
energy model and HSPICE simulations in a 45 nm CMOS process, (c) error PMF at
Vdd = 0.9 V, and (d) error rate pη vs. Vdd for the 44-tap low pass ﬁlter employed in the FE,
the CHB-MIT EEG data set [54] is employed as input.
voltage, the carrier mobility, the gate capacitance per unitW/L, and the drain induced
barrier lowering (DIBL) coeﬃcient, respectively, obtained from the process ﬁles, and
m is a constant related to the sub-threshold slope factor and is a ﬁtting parameter.
3. The energy estimate of the kernel is obtained as Eop = NFAEFA.
Figure 4.12(b) shows the modeling results of the FA and ripple carry adder (RCA) for
various bit widths demonstrating the accuracy and scalability of the energy model. The
energy model is within 5% (for 0.2 V ≤ Vdd ≤ 1.2 V) of circuit simulation results.
91
Algorithm 1 Energy optimization algorithm for E-ANT
1. Initialize K∗vos = 1, B
∗
msb = 0, E
∗
op = energy of conventional MAC, MSEreq = speciﬁed
MSE requirement.
2. Kvos = Kvos−∆, Bmsb = 0. Obtain maximum E-block precision Bmax to ensure error-free
E-block operation.
3. Bmsb = Bmsb + 1. If Bmsb > Bmax, then exit, else compute MSE according to (23).
4. If MSE < MSEreq, then calculate energy E(Kvos) according to (21), else go to step 3
5. If E∗op > E(Kvos), then E
∗
op = E(Kvos), and B
∗
msb = Bmsb
6. Go to step 2
4.3.2 ARCH-ANT Design Optimization
The methodology in Fig. 4.12(a) is employed to perform optimization for the E-ANT kernels
proposed in Sect. 4.2, and the E-ANT MAC kernel in Fig. 4.6(a) is used as an example.
Since we adopt VOS to obtain diﬀerent error rates, the parameters to be optimized are
the voltage overscaling factor Kvos and the E-block bit width Bmsb, where we assume that
Bx,msb = Bw,msb = Bmsb. The optimization framework is general enough to include the case
when Bx,msb 6= Bw,msb. A grid search algorithm is employed to systematically determine the
optimum setting K∗vos and B
∗
msb satisfying the performance metric, as shown in Algorithm 1
below. We adopt mean squared error (MSE) with respect to the ﬂoating point kernel as the
performance metric:
MSE = E(yˆ − yfl)2 (4.44)
where yˆ and yfl indicate the E-ANT and ﬂoating point output, respectively. The maximum
E-block length Bmax under which the E-block does not make errors is determined by Kvos.
The optimization routine gives the optimum E-ANT conﬁguration, including K∗vos, B
∗
msb and
minimum energy E∗op, at the output.
Algorithm 1 is employed to optimize E-ANT MAC for 8b and 16b precision with MSE
requirements of 10−2 ∼ 10−5. The iso-MSE plots in the Bmsb and pη plane (see Fig. 4.13(a)
and Fig. 4.13(c)) indicate that the optimum Bmsb increases as the error rate pη increases
because a higher precision E-block is needed to compensate for the M-block errors. Figure
4.13(b) shows that the 8 bit E-ANT MAC achieves energy savings of 16% ∼ 69%, while
as the 8 bit ANT MAC fails to achieve energy savings at the tight MSE requirement of
92
(a) (b)
(c) (d)
Figure 4.13: Optimization of E-ANT MAC; the dashed line illustrates that the maximum
E-block precision Bmax decreases as pη increases, indicating that to ensure error-free
E-block operation, the E-block bit width is upper bounded. The solid lines show the
optimum Bmsb conﬁguration for each pη at diﬀerent MSE requirements, with the circle
marker indicating the (B∗msb, p
∗
η) pair achieving the MSE requirements with minimum Eop:
(a) optimization results of an 8× 8 E-ANT MAC for diﬀerent MSE requirements, (b)
normalized energy of an 8× 8 conventional MAC, ANT MAC and E-ANT MAC, (c)
optimization results of a 16× 16 E-ANT MAC for diﬀerent MSE requirements, and (d)
normalized energy of a 16× 16 conventional MAC, ANT MAC, and E-ANT MAC.
10−5 due to E-block overheads. Figure 4.13(d) shows that the 16 bit E-ANT MAC achieves
59% energy savings compared with the conventional MAC. The overhead of the E-ANT
architecture is below 8% compared with the 36.4% and 13.9% overheads for the 8 bit and
16 bit ANT architecture, respectively.
Figure 4.14 shows that the energy savings increase as the MSE requirement increases for a
93
Figure 4.14: Energy savings vs. input precision and MSE.
ﬁxed Bx. This is because a larger MSE requirement allows the MAC to operate at a higher
pη and can thus reduce the E-block overheads. This is also conﬁrmed in Fig. 4.13(b) and
Fig. 4.13(d). Additionally, the energy savings increase as Bx increases for a ﬁxed MSE
requirement because a large Bx tends to tolerate more LSB errors, thus enabling the MAC
to operate at a higher pη.
4.3.3 ARCH-ANT System Performance
To evaluate the performance of E-ANT, the ﬁlter kernel in the FE and the vector-matrix
multiplication kernel in the SVM CE shown in Fig. 4.15 are implemented employing the
E-ANT MAC as shown in Fig. 4.6, and are characterized via the procedure described in
Section 4.3.2. For the ﬁlter bank in the FE, we use an input of 12 bit, with the MSB 8 bit
taken as the E-block. For the CE, the input precisions of the two MACs are chosen to be 8
bit, and E-block precisions are 4 bit.
We employ the CHB-MIT EEG data set [54] to train the SVM and use leave-one-out cross
validations to evaluate the system performance. The system performance metric employed
is the true positive (TP) rate ptp and false positive/alarm (FP) rate pfp, deﬁned as:
94
Figure 4.15: Second order polynomial kernel SVM EEG classiﬁcation system architecture.
ptp =
TP
TP + FN
pfp =
FP
FP + TN
where TP , FN , FP , and TN are the number of true positives, false negatives, false positives,
and true negatives, respectively. A good classiﬁer achieves high values of ptp ( > 0.9) at a
small constant false alarm rate pfp (<0.01).
Three implementations are considered: the uncompensated system (denoted as CONV),
the system which performs retraining with erroneous features, similar to the one proposed
in [54] (denoted as RETRAIN), and the system with E-ANT (denoted as E-ANT). In the
retraining method [54], the classiﬁer is trained with features extracted in the presence of
VOS errors. Unlike in the retraining method [54] where the CE needs to be error-free, E-
ANT can tolerate errors in both the FE and CE. Therefore, two setups are considered in our
experiment: (1) errors in FE only, and (2) errors in both FE and CE. The maximum value
of the error rate pη for which ptp > 0.9 and pfp < 0.01 is referred to as the error tolerance
pη−max of the architecture. In the ﬁrst setup, pη−max is the error rate in the FE, and in the
second setup, pη−max is the maximum of the error rate in the FE and CE.
95
(a) (b)
Figure 4.16: SNR at the output of the FE: (a) application level SNR, and (b) circuit level
SNR.
4.3.4 SNR Performance
Figure 4.16(a) shows that the improvement in the application level SNR (see (4.7)-(4.9))
achieved by E-ANT at the FE output is signiﬁcant. In particular, the SNR of the M-block
(also the SNR of the conventional system with errors), SNRM,a, drops catastrophically from
42 dB to 10 dB for values of pη as low as 8 × 10−4. The SNR of the E-block, SNRE,a,
is constant at 23 dB for pη ≤ 0.42. This is because the E-block makes small magnitude
estimation errors e. For pη > 0.42, SNRE,a drops catastrophically as the E-block also starts
to make large magnitude timing errors. In contrast, the ANT system SNR, SNRANT,a, is
at least 10 dB higher than either SNRE,a or SNRM,a for values of pη as high as 0.1, and
approaches the E-block SNR as pη increases.
The circuit level SNRs (see (4.10) - (4.12)) also exhibit a similar trend in Fig. 4.16(b).
Furthermore, the SNR analysis in Section 4.1.1 is validated by plotting (4.13) and (4.14) in
Fig. 4.16(a) and Fig. 4.16(b), respectively.
4.3.5 Classiﬁcation Performance and Energy Savings
As shown in Fig. 4.17(a), ptp drops sharply as circuit error rate increases in CONV system
where no SEC is applied. The RETRAIN system does slightly better than the CONV system
because the classiﬁer is retrained to adapt to the error aﬀected features. However, pη−max is
96
still only around 10−3. This is due to the fact that the errors under investigation are timing
errors. Unlike the stuck-at faults in [54], timing errors are dynamic and depend on the state
of circuit, so the error pattern observed during training might not be the same as during
the test. In contrast, when E-ANT is applied, ptp degrades gracefully as pη increases. As
a result, the E-ANT system can achieve pη−max as high as 0.38. Figure 4.17(a) also shows
that the ptp is always lower than 0.9 when only E-block is employed, i.e., the E-block on
its own is unable to meet the performance speciﬁcations. Similarly, when errors present in
both FE and CE (Fig. 4.17(b)), both the CONV system and RETRAIN system achieve
pη−max < 10−3, while E-ANT achieves a pη−max of 0.17. These are of 2 orders of magnitude
(errors in FE only) and 3 orders of magnitude (errors in both FE and CE) greater than
the existing systems. The receiver operating characteristic (ROC) curve at pη−max is shown
in Fig. 4.17(c) when errors are in FE only, and in Fig. 4.17(d) when errors are in both
FE and CE. In both experiments, the ROC of the CONV as well as the RETRAIN system
approaches the ROC of a random classiﬁer which outputs ±1 with equal probability, while
the ROC of the E-ANT system (w/ or w/o retraining) remains close to the ROC of an ideal
classiﬁer.
Principle component analysis (PCA) is performed on the feature vectors to understand the
reason why CONV system fails but E-ANT system is able to maintain good performance.
Figure 4.17(e) shows that when no SEC is applied, circuit errors have two eﬀects on the
feature vectors: (1) errors make it harder to separate the positive and negative samples, and
(2) the entire feature space is shifted due to the accumulation block in the FE. The SVM fails
to correctly perform classiﬁcation without knowledge of the error statistics. Figure 4.17(f)
shows that the large magnitude errors are compensated and converted to small residual
errors when E-ANT is applied. This will cause a very small shift in the feature space. As
a result, the SVM classiﬁer can still perform correct classiﬁcation. One way to improve E-
ANT further is to incorporate retraining. In this method, the classiﬁer is trained employing
features that are subject to residual errors after the correction via E-ANT. However, as
shown in Fig. 4.17(a), the improvement is minor due to the fact that the residual errors are
typically small.
Table 4.2 compares the pη−max, FE energy/feature (EF ), and CE energy/decision (EC)
97
(a) (b)
(c) (d)
(e) (f)
Figure 4.17: Simulation results: (a) ptp of CONV, RETRAIN and E-ANT with pfp = 0.01
when errors are in FE only, (b) ptp of CONV, RETRAIN and E-ANT with pfp = 0.01 when
errors are in both FE and CE, (c) ROC curve of CONV, RETRAIN and E-ANT at pη−max
when errors are in FE only, (d) ROC curve of CONV, RETRAIN and E-ANT at pη−max
when errors are in both FE and CE, (e) PCA results of error-free and erroneous features
for CONV, and (f) PCA results of error-free and erroneous features for E-ANT.
98
of the three systems. The E-ANT system can achieve pη−max of 0.38 when errors are in
FE only, and 0.17 when errors are in both FE and CE. When VOS is applied for energy
saving, the E-ANT system is able to achieve 51% energy savings when errors are in FE only
compared with the CONV system. When both FE and CE are in error, the E-ANT system
is able to achieve 43% and 29% energy savings in the FE and CE, respectively.
4.3.6 ALG-ANT Design Optimization
The FE and CE in an ALG-ANT based system have a number of design parameters that
need to be selected for optimal system performance. For the FE, (4.40) indicates that Kf
determines the estimator complexity. Hence, Kf places a lower bound on the supply voltage
because the estimator needs to be free of timing violations. Similarly, γ indicates how
closely the main block approximates the ideal frequency response. Thus, the accuracies of
the estimator and the main block trade oﬀ with each other, which suggests that an optimum
value for γ and Kf , i.e., γ
∗ and K∗f , exists. To explore the trade-oﬀ between main block
and estimator performance, the application level ALG-ANT ﬁlter SNR is deﬁned. Let yo, yˆ
denote the error-free main ﬁlter output and ALG-ANT ﬁlter output, and let yd denote the
error-free ideal ﬁlter (with coeﬃcient d) output. The ALG-ANT ﬁlter SNR is deﬁned as
SNRALG−ANT = 10log10(
σ2yo
σ2ae + σ
2
he
) (4.45)
where σ2ae = E(yo − yd)2 is the variance of approximation error and σ2he = E(yo − yˆ)2 is the
variance of hardware error.
In order to determine these SNR-optimum values, K∗f is ﬁrst determined by choosing the
maximum estimator length at a given supply voltage Vdd, and hence error rate pη (thus
Kvos). In particular, K
∗
f increases with pη as shown in Fig. 4.18(a). Next, γ
∗ is obtained
via sweeping its value and observing the SNRALG−ANT . Figure 4.18(a) shows that when
pη is low, i.e. K
∗
f is small, γ
∗ is small because the approximation error σ2ae dominates. On
the other hand, when pη is high, i.e., K
∗
f is large, γ
∗ is large because the hardware error
σ2he dominates, and the optimization procedure will strive for a more accurate estimator, as
99
T
ab
le
4.
2:
A
R
C
H
-A
N
T
P
er
fo
rm
an
ce
an
d
E
n
er
gy
C
om
p
ar
is
on
E
rr
o
rs
in
F
E
o
n
ly
E
rr
o
rs
in
b
o
th
F
E
a
n
d
C
E
p η
−m
a
x
E
n
er
gy
sa
v
in
gs
in
F
E
p η
−m
a
x
E
n
er
gy
sa
v
in
gs
in
F
E
E
n
er
gy
sa
v
in
gs
in
C
E
C
on
ve
n
ti
on
al
8
×
10
−4
N
A
2
×
10
−4
N
A
N
A
E
rr
or
re
si
li
en
t
re
tr
ai
n
in
g
3
×
10
−3
7%
8
×
10
−4
13
%
12
%
E
-A
N
T
0.
38
51
%
0.
17
43
%
29
%
100
0 0.2 0.4 0.6 0.8 1
26
28
30
32
34
36
38
40
42
W_T*W+W_hat_T*W_hat
K KM+1-2K
h[0:K-1]
h[M-K+1:M]
h[K:M-K] [K:M-K]
[0:K-1]
[M-K+1:M]
W_T*d+W_hat_T*d
=
0 0.2 0.4 0.6 0.8 1
40
42
44
46
48
50
52
54
gamma
S
N
R
 (
d
B
)
pe  = 1e-3 
pe  = 1e-2 
pe  = 1e-1 
pe  = 0.5
pe  = 0.8 
10
-2
10
-1
10
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x 10
-4
gamma 
n
o
is
e
 p
o
w
e
r
 
 
Out of band noise 
Hareware error
Total noise
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
 
 
out of band noise
hardware noise
total noise
3 *10 ( 4)fp K
 
*0.1 ( 9)fp K  

(d
B
)
A
L
G
A
N
T
S
N
R


N
o
rm
al
iz
ed
 e
rr
o
r 
va
ri
an
ce
*0.5 ( 13)fp K  
*0.8 ( 17)fp K  
*0.01 ( 6)fp K  
 
 
Total error variance
2
ae
2
he
(a)
0 0.2 0.4 0.6 0.8 1
26
28
30
32
34
36
38
40
42
W_T*W+W_hat_T*W_hat
K KM+1-2K
h[0:K-1]
h[M-K+1:M]
h[K:M-K] [K:M-K]
[0:K-1]
[M-K+1:M]
W_T*d+W_hat_T*d
=
0 0.2 0.4 0.6 0.8 1
40
42
44
46
48
50
52
54
gamma
S
N
R
 (
d
B
)
pe  = 1e-3 
pe  = 1e-2 
pe  = 1e-1 
pe  = 0.5
pe  = 0.8 
10
-2
10
-1
10
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x 10
-4
gamma 
n
o
is
e
 p
o
w
e
r
 
 
Out of band noise 
Hareware error
Total noise
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
 
 
out of band noise
hardware noise
total noise
3 *10 ( 4)fp K
 
*0.1 ( 9)fp K  

(d
B
)
A
L
G
A
N
T
S
N
R


N
o
rm
al
iz
ed
 e
rr
o
r 
va
ri
an
ce
*0.5 ( 13)fp K  
*0.8 ( 17)fp K  
*0.01 ( 6)fp K  
 
 
Total error variance
2
ae
2
he
(b)
Figure 4.18: ALG-ANT applied to the ﬁlter design problem: (a) SNRALG−ANT vs. γ for
various pη, and (b) approximation error σ
2
ae, hardware error σ
2
he and total error (σ
2
ae + σ
2
he)
vs. γ at pη = 0.1, the CHB-MIT EEG data set [54] is employed as input, error variances
are normalized w.r.t. total error variance at γ = 0.
shown in (4.40). Figure 4.18(b) shows this trade-oﬀ for a speciﬁc value of pη, where it can
be seen that as γ increases, σ2ae increases because the overall ﬁlter no longer minimizes the
diﬀerence between ideal ﬁlter and main block; at the same time, σ2he decreases because the
estimator gives better approximations.
For the linear SVM, DR is applied. We employ the CHB-MIT EEG data set [54] to
train the SVM and use leave-one-out cross validations to evaluate the classiﬁer performance.
The system performance metric employed is the true positive (TP) rate ptp and false posi-
tive/alarm (FP) rate pfp, as deﬁned in Sect. 4.3.3.
Figure 4.19 studies the impact of DR in the SVM classiﬁer in an error-free condition and
a pfp ≤ 0.01. It indicates that the TP rate in the absence of DR (ptp−nro) increases non-
monotonically with Kc (the estimator complexity). In particular, ptp−nro ≤ 0.5 for Kc ≤ 45,
and ptp−nro ≥ 0.9 only when Kc ≥ 112. In contrast, when DR is employed the TP rate ptp−ro
increases monotonically with Kc, and ptp−ro ≥ 0.9 when Kc ≥ 64, which is 43% smaller
than when DR is not used. Note that DR needs to be performed only once during the
training and thus does not incur overhead during classiﬁcation. Figure 4.19(b) shows that
without DR, the large magnitude weights are scattered across the dimensions, leading to
poor classiﬁcation results unless the value of Kc is suﬃciently large. DR uses the important
weights ﬁrst (see Fig. 4.19(c)), and thus can produce acceptable results with much smaller
101
0 20 40 60 80 100 120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
 
 
TP-ro
TP-nro
Kc
tpp
tp rop 
tp nrop 
(a)
(b) (c)
Figure 4.19: Comparison of classiﬁcation results with and without DR; feature vectors
extracted with FE are employed as input: (a) ptp (with pfp ≤ 0.01) of the SVM classiﬁer
vs. estimator length Kc where the estimator is directly obtained by using the ﬁrst Kc taps
of the dot product kernel; the results with DR are denoted as ptp−ro, while the results of
directly using the reduced dimension classiﬁer is denoted as ptp−nro. (b) The weights
without DR. (c) The weights with DR.
values of Kc.
4.3.7 ALG-ANT System Performance
The system architecture of the SVM EEG classiﬁcation system is shown in Fig. 4.20 where
the feature extractor employs the design parameters from [54, 55], with an input of 12 b
for the ﬁlter bank and 8 b bit for the SVM classiﬁer. Three architectures are considered:
102
D D...
[ / 2]w M[ / 2 1]w M 
...
[0]w
D D...
Estimator
| |>Th
ay ey
D D
[ ]x n
...
...
b
[ ]w n[1]w
[1]x [ ]x n
Estimator
[2]w
[2]x
| |>Th
ay ey
yˆ
D D
cK
BPF0
BPF1
BPFn



.
.
.
Filter bank
|   |.
|   |.
|   |.
Absolute sum
Feature extractor
SVM classifier
( ) 'f x b w x
D
D
D
Buffer
Figure 4.20: ALG-ANT based SVM EEG classiﬁcation system architecture.
the conventional classiﬁer (denoted as CONV), the classiﬁer with retraining [54] (denoted
as RETRAIN), and the classiﬁer with ALG-ANT (denoted as ALG-ANT). In the retraining
method [54], the classiﬁer is trained with features extracted in the presence of VOS errors.
Unlike in the retraining method [54] where CE needs to be error free, ALG-ANT can tolerate
errors in both FE and CE. Therefore, two setups are considered in our experiment: (1) errors
in FE only and (2) errors in both FE and CE. The max value of error rate pη for which
ptp > 0.9 and pfp < 0.01 is referred to as the error tolerance metric pη−max of the architecture.
In the ﬁrst setup, pη−max is the error rate in the FE, and in the second setup, pη−max is the
maximum of the error rate in FE and CE.
Figure 4.21(a) shows that when errors are in FE only, ptp for the conventional system drops
sharply and pη−max is as low as 1.5 × 10−4. Retraining does slightly better as the classiﬁer
is retrained to adapt to the error aﬀected features. However, the error tolerance pη−max is
below 10−3. This is most likely due to the fact that unlike stuck-at faults studied in [54],
timing errors due to VOS are dynamic and depend on the state of circuit. In contrast, when
ALG-ANT is applied, ptp has a graceful degradation as pη increases and pη−max is improved
to 0.41. The performance of the conventional ANT system was found to be similar to the
103
ALG-ANT system, and thus is not shown. Figure 4.21(a) also shows that the ptp is always
lower than 0.9 when only estimator is employed. Similarly, when errors present in both FE
and CE (see Fig. 4.21(b)), ALG-ANT classiﬁer achieves pη−max = 0.19. These are both 3
orders of magnitude greater than the existing systems.
Principle component analysis (PCA) is performed on the feature vectors to understand
the reason why the conventional system fails and ALG-ANT is able to maintain good per-
formance. Figure 4.21(c) shows that in the conventional system, circuit errors have two
eﬀects on the feature vectors: (1) errors make it harder to separate the positive and nega-
tive samples, and (2) the entire feature space is shifted. The SVM fails to correctly perform
classiﬁcation without knowledge of the error statistics. Figure 4.21(d) shows the large magni-
tude error is compensated and converted to small residual errors when ALG-ANT is applied,
which will cause a very small shift in the feature space. As a result, the SVM classiﬁer can
still perform correct classiﬁcation.
Table 4.3 compares the error tolerance pη−max, feature extraction energy/feature (EF ),
and classiﬁcation energy/decision (EC) of three classiﬁers. When VOS is applied for energy
savings, compared with the conventional classiﬁer, the ALG-ANT classiﬁer is able to achieve
44.3% energy savings when errors are in FE only. When both FE and CE are in error, the
ALG-ANT classiﬁer is able to achieve 37.1% and 36.9% energy savings in the FE and CE,
respectively. The energy savings are due to: (1) the elimination of an explicit estimator, and
(2) the scaling of supply voltage.
4.4 Conclusions
In this chapter, we propose E-ANT, where the estimator is embedded into the main block via
proper architecture and algorithm level transforms, resulting in a low overhead architecture
with the same error compensation functionality. At the architecture level, ARCH-ANT uses
data path decomposition to embed a reduced precision replica estimator into the main block.
The data path decomposition is general and can be derived for a wide class of compute ker-
nels. At the algorithm level ALG-ANT employs additional optimization constraints during
the algorithm to architecture mapping to embed the estimator into the main block. The
104
(a) (b)
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
 
 
Error free non-seizure
Error free seizure
Erroneous non-seizure
Erroneous seizure
PCA dim. 1
P
C
A
 d
im
. 2
PCA dim. 1
P
C
A
 d
im
. 2
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
 
 
Error free non-seizure
Error free seizure
Erroneous non-seizure
Erroneous seizure
(c)
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1.
1.
1.6
1.8
2
 
 
Error free non-seizure
Error free seizure
Erroneous non-seizure
Erroneous seizure
PCA dim. 1
P
C
A
 d
im
. 2
PCA dim. 1
P
C
A
 d
im
. 2
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
 
 
Error free non-seizure
Error free seizure
Erroneous non-seizure
Erroneous seizure
(d)
Figure 4.21: Simulation results: (a) ptp of conventional, retraining, and ALG-ANT classiﬁer
with pfp ≤ 0.01 when errors are in feature extractor only, (b) ptp of conventional, retraining,
and ALG-ANT system with pfp ≤ 0.01 when errors are in both the feature extractor and
the classiﬁer, (c) PCA results of error free and erroneous features for conventional
classiﬁer, and (d) PCA results of error free and erroneous features for ALG-ANT classiﬁer.
result is a single architecture that can be used to obtain both the estimator and main block
outputs as in ANT systems. The eﬀectiveness of the proposed ARCH-ANT and ALG-ANT
technique has been demonstrated through the design of a SVM EEG seizure classiﬁcation
system where simulation results in a commercial 45 nm CMOS process show that ARCH-
ANT achieves up to 38% error tolerance and up to 50.6% energy savings compared with
an uncompensated system. ALG-ANT achieves up to 41% error tolerance and up to 44.3%
105
energy savings compared with uncompensated system.
This work shows that by exploring architectural and algorithmic level transforms, it is
possible to design architectures that are inherently error resilient without explicit estimator
blocks. It opens up a few research directions to extend or generalize existing SEC techniques.
In particular, E-ANT techniques for other SEC techniques such as SSNOC [63] and soft-
NMR [61] can be derived. Moreover, with the adoption of near/sub-threshold voltage design
and continued scaling of the CMOS process, PVT-induced and defect-induced errors are
becoming a growing concern for the design of ULP platforms. E-ANT techniques, and in
general SEC techniques, can be applied in the near-threshold region to enhance system
robustness in the presence of these new error models.
106
T
ab
le
4.
3:
A
L
G
-A
N
T
P
er
fo
rm
an
ce
an
d
E
n
er
gy
C
om
p
ar
is
on
E
rr
o
rs
in
F
E
o
n
ly
E
rr
o
rs
in
b
o
th
F
E
a
n
d
C
E
p η
−m
a
x
E
n
er
gy
S
av
in
gs
in
F
E
p η
−m
a
x
E
n
er
gy
S
av
in
gs
in
F
E
E
n
er
gy
S
av
in
gs
in
C
E
C
on
ve
n
ti
on
al
1.
5
×
10
−4
N
A
10
−4
N
A
N
A
R
et
ra
in
in
g
3
×
10
−4
5.
1%
10
−4
0
0
A
L
G
-A
N
T
0.
41
44
.3
%
0.
19
37
.1
%
36
.9
%
107
Chapter 5
PROBABILISTIC ERROR MODELS FOR MACHINE
LEARNING KERNELS IMPLEMENTED ON
STOCHASTIC NANOSCALE FABRICS
Systematic design of ML kernels on stochastic nanoscale fabrics requires one to eﬃciently
predict the behavior of such implementations. For this, high-level error models of key ML
building blocks need to be developed. The error models of such kernels need to capture
the stochastic behavior of the underlying fabric such as voltage overscaling (VOS), process
variations, and defects. Compact analytical models of kernel behavior in the presence of
errors are very desirable as these can be employed to: (1) characterize the inherent error
resiliency of ML algorithms, and (2) evaluate the eﬀectiveness of error resiliency techniques
in compensating for these errors.
The error behavior of computational kernels can be fully captured in terms of their joint
probability mass functions (PMFs). To date, not much work has been done on this topic.
Analytical models for logic errors [116], transient errors [117], and timing errors [118] have
been proposed. These models focus on obtaining expressions for the error rate/magnitude.
In approximate computing, theoretical models have been proposed to model the inaccuracy
of circuits [119]. However, the models are architecture speciﬁc. Interval-based approaches
(interval arithmetic or aﬃne arithmetic) [120] have been proposed to model and propagate
PMFs. These approaches need to store the entire error PMF. A lookup table based technique
[121] has been proposed to characterize the statistical properties of approximate hardware.
These models only capture the standard deviations of basic circuit building blocks rather
than the error PMF. In signature analysis based testing, symmetrical error model [122] and
independent error model [123] are employed to model the output error PMF. Additionally, in
all the models for approximate computing, the errors are due to imprecise but deterministic
circuits, not dynamic errors due to VOS and process variations.
In this chapter, we propose a probabilistic additive error model capable of modeling er-
108
rors due to sources such as VOS, process variations, and defects. Four diﬀerent variants of
the additive error model are studied: additive over Reals Error Model with independent
Bernoulli RVs (REM-i), additive over Reals Error Model with joint Bernoulli RVs (REM-
j), additive over Galois ﬁeld Error Model with independent Bernoulli RVs (GEM-i), and
additive over Galois ﬁeld Error Model with joint Bernoulli RVs (GEM-j). Analytical ex-
pressions for the error PMFs are derived. Kernel level model validation is accomplished by
comparing the Jensen-Shannon divergence DJS between the modeled PMF and the PMFs
obtained via HDL simulations in a commercial 45 nm CMOS process of MAC units used in
a support vector machine (SVM) to classify the UCI machine learning dataset [124]. Results
indicate that at the MAC unit level, DJS for the GEM models are 2 orders of magnitude
lower (better) than the REM models for VOS, and 1 order of magnitude lower for process
variation errors. However, when considering errors due to defects, DJS for REM-j is between
1 and 2 orders of magnitude lower than the others. Performance (probability of detection
Pdet) prediction of a 2
nd order polynomial SVM classiﬁer is conducted using the proposed
model and compared with HDL simulations. We ﬁnd that Pdet estimated using GEM-j is
within 3% for VOS errors when the error rate pη ≤ 80%, and within 5% for process variation
errors when supply voltage Vdd is between 0.3 V and 0.7 V. In addition, Pdet using REM-j is
within 2% for defect errors when the defect rate (the percentage of circuit nets subject to
stuck-at-faults) psaf is between 10
−3 and 0.2.
The rest of the chapter is organized as follows. Section 5.1 describes the framework for the
error analysis and the distance measure employed to compare models. Section 5.2 presents
the proposed models, and derives analytical expressions for the PMF of the errors. Section
5.3 presents the error characterization/simulation methodology, and model validation results
at kernel level and system level. Conclusions are presented in Section 5.4.
109
5.1 Modeling Framework and Accuracy Measure
5.1.1 Error Modeling Framework
In this chapter, we employ capital letters and small letters to denote a RV Y and its real-
ization y, respectively. The proposed error modeling framework (see Fig. 5.1) captures the
spatio-temporal distribution of errors. This is required as certain error sources such as defect
and process variations result in an error RV whose PMF is determined by the statistics of
the input and the spatial distribution across physical instantiations of the computational
block. The following notation is employed in this chapter: let Ik (k = 1, 2...,M) denote the
kth instance of the system/kernel subject to errors, and let xk[n], yo,k[n], ηk[n], and ya,k[n]
denote the samples corresponding to the input, error free output, error, and the ﬁnal output
of Ik, with time index n, respectively.
5.1.2 Additive Error Models
For notational simplicity, we drop the index n and k. We consider the additive error model
as shown in Fig. 5.1:
ya = yo ⊕F η (5.1)
where ⊕F denotes addition over ﬁeld F , η is the error and is a realization of the RV N with
PMF P (η). The models (REM-i,j and GEM-i,j) proposed in this paper focus on modeling
P (η).
In REM-i,j, addition in (5.1) is taken over the ﬁeld of reals R. Thus, (5.1) can be written
as
ya = yo + η (5.2)
where ya, yo and η are reals expressed in the 2's complement form. For example, η is written
in the 2's complement form as:
110
η = −ηb0 +
Bη−1∑
i=1
ηbi2
−i (5.3)
and ηbi ∈ {0, 1} and Bη are the ith bit and bit precision of η, respectively. The 2's complement
form of ya and yo can be expressed similarly.
In GEM-i,j, addition in (5.1) is taken over the Galois ﬁeld of 2 (GF(2)). Thus, (5.1) can
be written as:
ya = yo ⊕ η (5.4)
where ya,yo and η are the bit vectors representing ya, yo and η, respectively, and ⊕ is the
bitwise XOR operator. For example, η is given by the vectorized form as:
η = [ηb0, ....η
b
Bη−1]
T (5.5)
and is a realization of RV N = [N b0 , N
b
1 , ..., N
b
Bη−1]. The vectorized form of ya and yo can
be expressed similarly. Note that the 2's complement form and the vectorized form of η are
equivalent.
Figure 5.1: Error modeling framework.
111
5.1.3 Model Accuracy Metric
To quantify model accuracy, we employ the commonly employed Jensen-Shannon (JS) diver-
gence DJS [125] as the measure of the distance between two distributions. The JS divergence
between two PMFs P and Q is deﬁned as:
DJS(P ||Q) = 1
2
DKL(P ||M) + 1
2
DKL(Q||M) (5.6)
where M(·) = 1
2
P (·) + 1
2
Q(·), and DKL is the KullbackLeibler (KL) divergence deﬁned as:
DKL(P ||Q) =
∑
i
P (i)log2
(
P (i)
Q(i)
)
(5.7)
The reason for choosing the JS divergence as the distance measure is that it is symmetric
(DJS(P ||Q) = DJS(Q||P )) and bounded (0 ≤ DJS ≤ 1) [125], unlike KL divergence.
5.2 Error Model Derivation
The challenges in modeling error N lie in the fact that: (1) N is a discrete RV and is
restricted to certain error magnitudes (especially when error rate is low), and (2) the PMF
is not smooth. Instead of modeling the error magnitude directly, we propose to model the
bits N bi of N as joint RVs. Four diﬀerent variants of the additive error model are studied:
REM-i, REM-j, GEM-i, and GEM-j.
5.2.1 REM-i: Additive over Real Error Model with Independent Bernoulli
RVs
In REM-i (see (5.2)), N bi (i = 0, 1..., Bη − 1) is modeled as a Bernoulli RV so the PMF of
N bi can be written as:
PNbi (x) =
pi if x = 11− pi if x = 0 (5.8)
112
and N bi (i = 0, 1..., Bη − 1) are assumed to be independent. Thus, under REM-i, the PMF
P (η) can be obtained from (5.8) as:
P (η) =
Bη−1∏
i=0
p
ηbi
i (1− pi)1−η
b
i (5.9)
Statistical metrics such as the mean and variance of N can be easily derived from (5.3)
and (5.9). The modeling complexity, deﬁned as the number of parameters to be estimated,
is O(Bη) for REM-i, as shown in (5.9).
5.2.2 REM-j: Additive over Real Error Model with Joint Bernoulli RVs
The pairwise covariance between N bi and N
b
j can be included to improve the modeling accu-
racy. In REM-j, the PMF of N is parametrized by the mean vector µη = [p0, p1, ..., pBη−1]
T
and the covariance matrix Cη, where Cη(i, j) = cov(N
b
i , N
b
j ) is the covariance between N
b
i
and N bj for i, j = 0, 1, ..., Bη − 1.
The dichotomized Gaussian (DG) distribution [126] can be used to obtain the PMF
of N . It is shown that [126] for any N , there exists a latent multivariate Gaussian
U = [U0, U1, ..., UBη−1]
T with mean vector µu and covariance matrix Cu such that after
dichotomizing U , i.e.
Nˆ bi =
1 Ui ≥ 00 Ui < 0 for (i = 0, 1..., Bη − 1) (5.10)
the obtained RV Nˆ = [Nˆ b0 , Nˆ
b
1 .., Nˆ
b
Bη−1]
T can have identical ﬁrst and second order statistics
as N . Therefore, as shown in Section 5.5, REM-j can be obtained as:
P (η) = PN ([N
b
0 , ....N
b
Bη−1]
T = [ηb0, ..., η
b
i ]
T )
= Φ([0, ...0]T ;Dµu,DCuD
T ) (5.11)
where
113
D =

(−1)ηb0 0 0
0
. . . 0
0 0 (−1)ηbBη−1
 (5.12)
and Φ([0, ...0]T ;Dµu,DCuD
T ) is the CDF of the joint Gaussian with mean Dµu and co-
variance matrix DCuD
T evaluated at [0, 0...0]T .
The mean and variance of N can be calculated using (5.3) and (5.11). In REM-j, the
parameters to be estimated are the mean vector µu and the covariance matrix Cu. Thus,
the modeling complexity is O(B2η) as shown in (5.11).
5.2.3 GEM-i and GEM-j
The main diﬀerence between GEM-i,j (see (5.4)) and REM-i,j (see (5.2)) is that in GEM-i,j,
the error is deﬁned using addition over GF(2) instead of real addition. Since the vectorized
form N and the 2's complement form N are equivalent, GEM-i,j can be derived in the same
manner as REM-i,j. Therefore, the PMF P (η) under GEM-i and GEM-j has the same form
as in (5.9) and (5.11), respectively. Note that the independent error model [123] is a special
case (pi = p,∀i) of the GEM-i model. The modeling complexities for GEM-i and GEM-j
are O(Bη) and O(B
2
η), respectively.
5.3 Model Validation
The proposed models are validated and compared at both the kernel and system levels. Ker-
nel level validation aims at comparing the JS divergence DJS between the proposed models
and the PMFs obtained via HDL simulation. System level validation aims at validating the
accuracy of the proposed models in predicting system level performance metric S. As shown
in Fig. 5.1, S can be obtained via averaging over the spatio-temporal domain. However,
we employ the following procedure in order to evaluate the performance yield: (1) For each
instance Ik, the system level performance metric for Ik is obtained by averaging over the
input X, i.e., Sk = E(S|Ik), where Sk is a RV, and (2) statistical measures such as the mean
114
and standard deviation of Sk can be obtained by performing spatial averaging.
5.3.1 Error Characterization and Injection Methodology
Figure 5.2: Model validation: (a) error characterization and injection methodology, and (b)
2nd order polynomial kernel SVM classiﬁer.
Figure 5.2(a) shows the error characterization and injection methodology for VOS, process
variation, and defect errors in a common framework. In this paper, an SVM classiﬁer as
shown in Fig. 5.2(b) is employed to validate the models. The SVM classiﬁer consists of two
types of multiply accumulator (MAC) kernels: MAC1 is an 8 b input, 8 b coeﬃcient, and
22 b output MAC used in the ﬁrst stage, and MAC2 is a 10 b input, 8 b coeﬃcient, and 24 b
output MAC used in the second stage. Simulation results are obtained using a commercial
45 nm CMOS process.
VOS error characterization and injection are done as follows:
1. Characterize delay vs. Vdd of basic gates such as AND and XOR using HSPICE for
0.3 V ≤ Vdd ≤ 1.2 V.
2. Develop structural Verilog HDL models for the SVM classiﬁer using the basic gates
characterized in Step 1.
115
3. Run HDL (bit and clock accurate) simulations using a characterization dataset to
obtain error samples η and classiﬁcation accuracy Pdet−h. The characterization dataset
is obtained via sampling with replacement from the application level data to emulate
the input statistics. Note that we treate the detection accuracy pdet = P (Yˆa = c) a
RV, which we denote as Pdet.
4. During kernel level validation, analytical models P (η) were built using REM-i,j and
GEM-i,j (see (5.9) and (5.11)). The JS divergence between the models and the char-
acterized error PMFs were calculated according to (5.6).
5. During system level validation, run ﬁxed-point MATLAB simulations using P (η) to
inject errors using the UCI dataset to obtain detection accuracy Pdet−s. Compare
Pdet−s with Pdet−h.
Process variation error characterization and injection are done as follows:
1. Characterize the gate delay distribution vs. operating voltage Vdd of basic gates such
as AND and XOR using HSPICE in the NTV range 0.3 V-0.7 V.
2. Implement the SVM architecture using structural Verilog HDL using the basic gates
characterized in Step 1.
3. Emulate process variations at NTV by generating multiple (30) architectural instances
and assigning random gate delays obtained via sampling the gate delay distributions
obtained in Step 1.
4. Kernel and system level model validations were conducted following the same procedure
as in the VOS methodology.
Defect error characterization and injection are done as follows:
1. Develop structural Verilog HDL models for the SVM classifeir.
2. During HDL simulation, multiple instances (30) were generated, and defects (stuck-
at-one and stuck-at-zero errors) with diﬀerent defect error rate psaf were injected to
randomly selected nets in the Verilog netlist using custom scripts.
116
3. Kernel and system level model validations were conducted following the same procedure
as in the VOS methodology.
5.3.2 Kernel Level Model Validation
Figure 5.3 shows the JS divergence comparison at diﬀerent voltage overscaling factor Kvos =
Vdd/Vdd,crit where Vdd−crit is the minimum voltage needed for error free operation. It shows
that for VOS errors, GEM-j achieves the lowest DJS which is below 10
−2 for 0.5 ≤ Kvos ≤
0.95, and both GEM-i and GEM-j achieve up to two-orders-of-magnitude smaller DJS com-
pared with REM-i and REM-j. Additionally, Figure 5.3 shows that the DJS of the symmet-
rical error model [122] and independent error model [123] is higher than the GEM models.
For all modeling methods, the PMF modeling accuracy decreases as Kvos decreases (thus
error rate pη increases).
Figure 5.3: JS divergence comparison of the proposed models for VOS errors for MAC1
used in the SVM classiﬁer. The JS divergence is calculated between the proposed models
and the error PMFs obtained via HDL simulation. The results for MAC2 are similar.
Figure 5.4 shows that in the case of process variation errors, GEM-j achieves the lowest
DJS which is below 0.03 for 0.3 V ≤ Vdd ≤ 0.7 V, and both GEM-i and GEM-j achieve up
to 10× smaller DJS compared with REM-i and REM-j. Additionally, Figure 5.4 shows that
the DJS of the symmetrical error model [122] and independent error model [123] are higher
than the GEM models. This is expected due to the fact that process variation errors are
117
indeed timing errors. For all modeling methods, the PMF modeling accuracy decreases with
Vdd.
Figure 5.4: JS divergence comparison of the proposed models for process variation errors
for MAC1 used in the SVM classiﬁer. The JS divergence is calculated between the
proposed models and error PMFs obtained via HDL simulation for M = 30 instances. The
mean DJS is shown in the ﬁgure. The results for MAC2 are similar.
Figure 5.5 shows that unlike timing errors caused by VOS or process variations, in the
case of defects, REM-j achieves the lowest DJS which is below 0.05 for 10
−3 ≤ psaf ≤ 0.2
compared with other models. This indicates that the error statistics of defect errors are
diﬀerent from timing errors, and diﬀerent model should be employed. In addition, Figure
5.5 shows the DJS of the symmetrical error model [122] and independent error model [123] is
higher than any of the proposed models. Figure 5.5 also shows that PMF modeling accuracy
decreases at higher psaf for all modeling methods.
5.3.3 System Level Simulation
To evaluate the model, we employ the probabilistic models in system simulation and compare
them with HDL results following the procedure in Section 5.3.1. We employ the Breast
Cancer Wisconsin dataset from the UCI machine learning repository [124] which consists
of labeled feature vectors (benign vs. malignant) constructed from digitized images of ﬁne
needle aspirates (FNA) of patient tissue, and use the SVM classiﬁer to perform classiﬁcation.
118
Figure 5.5: JS divergence comparison of the proposed models for defect errors for MAC1
used in the SVM classiﬁer. The JS divergence is calculated between the proposed models
and error PMFs obtained via HDL simulation for M = 30 instances. The mean DJS is
shown in the ﬁgure. The results for MAC2 are similar.
Figure 5.6 plots the Pdet in presence of VOS errors, and shows that GEM-i and GEM-j
are more accurate than REM-i and REM-j. The diﬀerence between the estimated Pdet using
GEM-j and HDL error statistics is within 3% for error rate pη ≤ 80%.
Figure 5.7 shows the distribution of Pdet in presence of process variation errors, and demon-
strates that GEM-i and GEM-j are more accurate than REM-i and REM-j, similar to the
case of VOS errors. The diﬀerence between the estimated Pdet using GEM-j and HDL error
statistics is within 5% for 0.3 V ≤ Vdd ≤ 0.7 V.
Figure 5.8 shows the distribution of Pdet in presence of defect errors, and demonstrates
that REM-j achieves higher accuracy than other models, unlike the case of VOS and process
variation errors. The diﬀerence between the estimated Pdet using REM-j and HDL error
statistics is within 2% for 10−3 ≤ psaf ≤ 0.2.
5.4 Conclusion
In this chapter, probabilistic additive models for circuit errors due to VOS, process variation,
and defects were proposed to eﬀectively predict the performance of ML kernels in presence
119
Figure 5.6: System simulation results in presence of VOS errors for the SVM classiﬁer
comparing the proposed models with HDL simulation results.
of hardware errors. Four models were compared, and analytical expressions for the PMF
were derived. In addition, error characterization/injection methodologies were proposed and
employed to validate the models. Kernel level validation showed that the GEM-j is the most
accurate for VOS and process variation errors, but REM-j is the most accurate for defect
errors. System level simulation using a 2nd order polynomial SVM classiﬁer further conﬁrms
the validity of the models.
5.5 Derivation of REM-j
In this section, we derive (5.11). Use vectorized notation η in (5.5), the P (η) can be expressed
as:
P (η)
= P ([N b0 , ....N
b
Bη−1]
T = [ηb0, ..., η
b
i ]
T )
= PU ((−1)η0U0 < 0, .., (−1)ηBη−1UBη−1 < 0)
(5.13)
where U = [U0, U1, ..., UBη−1]
T is the latent Gaussian that can be dichotomized to obtain η
according to (5.10). We further deﬁne Uˆ = DU where D is deﬁned in (5.12). Hence, the
mean and variance of Uˆ can be calculated as E(Uˆ) = Dµu and Cov(Uˆ) = DCuD
T , where
120
Figure 5.7: System simulation results in presence of process variation errors for the SVM
classiﬁer comparing the HDL simulation results with (a) REM-i, (b) REM-j, (c) GEM-i,
and (d) GEM-j. Simulations are performed for 30 instances, the box plot shows the
median, 25%, and 75% quartile of the prediction accuracy Pdet, the dashed line shows the
median Pdet.
121
Figure 5.8: System simulation results in presence of defect errors for the SVM classiﬁer
comparing the HDL simulation results with (a) REM-i, (b) REM-j, (c) GEM-i, and (d)
GEM-j. Simulations are performed for 30 instances, the box plot shows the median, 25%,
and 75% quartile of the prediction accuracy Pdet, the dashed line shows the median Pdet.
122
µu and Cu are the mean vector and covariance matrix of the latent Gaussian U . The PMF
P (η) can then be obtained as:
P (η) = Φ([0, ...0]T ;Dµu,DCuD
T )
where Φ([0, ...0]T ;Dµu,DCuD
T ) is the CDF of the joint GaussianU evaluated at [0, 0..., 0]T .
123
Chapter 6
ERROR-RESILIENT MACHINE LEARNING IN
NEAR THRESHOLD VOLTAGE VIA CLASSIFIER
ENSEMBLE
In this chapter, we present the design of error-resilient machine learning architectures by em-
ploying a distributed machine learning framework referred to as classiﬁer ensemble (CE). The
most common machine learning architecture is the centralized architecture (see Fig. 6.1(a))
where a complex block such as the support vector machine (SVM) is employed to process all
the input data. However, the computational complexity of centralized architecture increases
dramatically as a function of the non-linearity of the decision boundary [54]. The CE (see
Fig. 6.1(b)) is a distributed architecture for machine learning which combines several weak
(low-complexity) classiﬁers to form a strong classiﬁer. CE enables on-chip training due to
its distributed nature, and exhibits robustness to feature/label noise. Thus, it is of great
importance to compare the robustness and energy eﬃciency of distributed machine learning
architectures designed using CE with centralized architectures such as SVM. Speciﬁcally, we
hypothesize that architectures based on distributed algorithms are more robust than those
based on centralized ones in presence of timing errors due to NTV operations. We compare
a CE method - random forest (RF) - with SVM using architectural-level error models [127]
in a commercial 45 nm CMOS process on the breast cancer data set in the UCI machine
learning repository [124]. We show that RF achieves a detection accuracy (Pdet) that varies
by 3.2% while maintaining a median Pdet ≥ 0.9 when operating with a gate level delay
variation of 28.9%. This is 5× lower as compared to SVM which exhibits a Pdet that varies
by 16.8% under identical conditions. We further propose a new error weighted voting to
enhance the robustness of RF by employing the timing error statistics of the NTV circuit
fabric. Simulation results conﬁrm that the proposed method leads to a Pdet that varies by
only 1.4%, which is 12× lower compared to SVM.
The rest of the chapter is organized as follows. Section 6.1 provides the background
124
for CE, SVM. Section 6.2 describes dedicated architectures for RF and SVM classiﬁers.
Section 6.3 presents simulation results validating the error models in a 45 nm CMOS process,
and employs these models to compare the detection accuracy of SVM, RF, and proposed
RF with error weighted voting scheme. Conclusions are presented in Section 6.4.
Figure 6.1: Two distinct machine learning frameworks: (a) centralized machine learning,
and (b) classiﬁer ensemble.
6.1 Background
6.1.1 Classiﬁer Ensemble (CE)
Classiﬁer ensemble (also referred to as multiple classiﬁer system) has been employed to
enhance the performance of single classiﬁer system [128]. A wide variety of CE methods
exist. In bootstrap aggregating (bagging) [129], multiple training sets are generated from
the original training set via random sampling with replacement, in order to train multiple
classiﬁers. Adaboost [130] is another popular method for ensemble generation. The training
samples are re-weighted after each iteration so that the mis-classiﬁed samples get higher
weights. Other methods such as randomness injection, random subspace and output coding
[128] also exist.
RF is a CE method that combines random subspace and bagging, while employing an
ensemble of decision trees (DTs) as weak classiﬁers. It is a popular technique for classiﬁca-
tion, prediction, and variable selection, and yielded results superior to those of other linear
and non-linear predictive modeling techniques [131]. Advantages include parallel training,
125
robustness to overﬁtting, ease of design, the capability of getting out-of-bag (OOB) error
estimate, and others.
In RF, the training set for each individual DT is generated using bagging. During the
training of each DT, a random subset of features is selected, and the best feature is selected
to split the DT according to an appropriate criterion. Several variations of RF exist based
on the type of DT used as base classiﬁers. Classiﬁcation and regression tree (CART) [131]
employs the Gini index as a measure of the impurity of nodes. ID3 [132] employs information
gain as the criterion. C4.5 [132] improves ID3 by using the information gain ratio.
6.1.2 Support Vector Machine
Support vector machine (SVM) [133] is a popular supervised learning method for classiﬁca-
tion and regression. SVM operates by ﬁrst training a model (the training phase) followed
by test/classiﬁcation (the test phase). During the training phase, labeled feature vectors
are used to train a model. During the test phase, SVM produces a predictive label when
provided with a new (test) feature vector. SVM training can be formulated as the solution
to the following optimization problem [133]:
min 1
2
‖w‖2 + C∑
i
ξi
s.t.
ci(w
Txi−b) ≥ 1−ξi
ξi ≥ 0
where C is the cost factor, ξi is the soft margin, xi is the feature vector, ci is the label
corresponding to the feature vector xi, w is the weight vector, and b is the bias. It can
be shown that the optimum weights are a linear combination of the feature vectors that lie
on the margins, i.e., support vectors. Kernel tricks can be employed to realize non-linear
decision boundaries [133].
126
6.2 System Architecture
In this section, we present system architectures for RF and SVM classiﬁers.
6.2.1 The RF Architecture
The RF classiﬁer is implemented using an ensemble of L two-stage DT classiﬁers (weak
learners) shown in Fig. 6.2(a). The lth DT is trained from a bootstrapped training set Sl
obtained from the original training set S, and processes the Ml-dimensional data vector
xl = [xl,1, xl,2, . . . , xl,Ml ]
T obtained from the M -dimensional test data vector x (M Ml).
Stage 1 of the lth DT consists of a comparator array that computes sgn(xl,i − Tl,i) (l =
1, 2, . . . , L, and i = 1, 2, . . . ,Ml) where Tl,is are the thresholds obtained via training. Stage
2 consists of a look up table (LUT) which encodes the decision of each root-to-leaf path into
a 1-bit output ya,l ∈ {0, 1}. The outputs of the L DTs are combined via a voter block to
generate the ﬁnal decision. Each DT is trained using the Gini index [131] as the training
criterion.
Conventionally, a majority voter is employed to combine the outputs from all DTs as
follows:
yˆa = maj(ya,1, ya,2, ..., ya,L)
where yˆa ∈ {0, 1} is the majority voter output, and ya,l is the lth DT output given by:
ya,l = yo,l ⊕ ηl
where yo,l ∈ {0, 1} is the error-free output and ηl ∈ {0, 1} is the timing error of the lth
DT. The RF with majority voter is denoted as RF-M. In case of binary classiﬁcation, the
majority voter can be implemented as shown in Fig. 6.2(b).
In order to enhance the robustness of RF in presence of timing errors, we propose an error
weighted voting scheme where the timing error statistics are incorporated during the decision
127
(a)
(b) (c)
Figure 6.2: System architecture for: (a) the RF classiﬁer with L DTs, (b) the majority
voter, and (c) the weighted voter.
process. In order to do so, we employ the maximum-a-posterior (MAP) criterion, i.e.:
yˆa = arg max
∀c∈C
P (c|x) (6.1)
where C is the label set, and P (c|x) is the posterior probability of class label c conditioned
on the test data x. Thus:
128
P (c|x) =
L∑
l=1
P (c|Rl,x)P (Rl|x) (6.2)
=
L∑
l=1
P (c|Rl,x)P (Rl) (6.3)
≈
L∑
l=1
1{ya,l = c}pl (6.4)
where P (c|Rl,x) denotes the posterior probability of the class label, Rl is the event of the
lth DT being correct during the training phase, pl = P (Rl) is the probability of the event Rl,
and 1{·} denotes the indicator function. Equation (6.2) implies (6.3) because the test data
x and event Rl are independent, and (6.3) implies (6.4) because we assume the DT output
has a probability mass of 1 at the selected class label. The ﬁnal decision yˆa is obtained
from (6.1) by choosing the label c that maximizes (6.4). Note that pl represents the decision
accuracy of the lth DT in presence of timing errors.
In the case of binary classiﬁcation, one can simplify (6.1) using (6.4) into:
yˆa =
1 if
∑L
l=1 1{ya,l = 1}p′l > 12
0 otherwise
where p′l =
pl∑L
l=1 pl
and the voter can be implemented as shown in Fig. 6.2(c).
To incorporate the timing error statistics of each DT, we express pl in (6.4) as follows:
pl =
1∑
ηl=0
P (Rl, ηl) =
1∑
ηl=0
P (Rl|ηl)P (ηl) (6.5)
where P (Rl|ηl) is the probability of correct decision of the lth DT conditioned on ηl. The
probabilities P (Rl|ηl) and P (ηl) can be obtained during the training phase for each DT. For
a RF binary classiﬁer, (6.5) can be simpliﬁed into (see Section 6.5.1):
pl = P (Rl|ηl = 0)(1− pηl) + (1− P (Rl|ηl = 0))pηl (6.6)
129
where P (Rl|ηl = 0) can be obtained via performing validation using out-of-bag samples, and
pηl = P (ηl 6= 0) is the error rate of the lth DT. We denote RF with error weighted voting
scheme as RF-EW.
When error rate pηl = 0, the error weighted voting scheme reduces to the conventional
weighted voter [128] where pl = P (Rl|ηl = 0). The RF with conventional weighted voter is
denoted as RF-W.
The performance of RF-EW improves when the DTs exhibit uncorrelated errors, i.e., the
DT outputs exhibit diversity in terms of error statistics. It is possible to enhance DT diversity
by designing each DT to have diﬀerent: (1) algorithm (algorithmic diversity), (2) architecture
(architectural diversity), and (3) data-path precision (precisional diversity), across the DT
ensemble. Precision has a signiﬁcant impact on the timing error statistics since the hardware
errors under investigation are due to timing violations. Therefore, in this paper, the precision
of each DT data-path in the RF-EW is randomly assigned uniformly between 4b and 8b,
leading to diﬀerent critical path delays among the DTs, and hence uncorrelated errors.
6.2.2 The SVM Architecture
The centralized machine learning algorithm employed in this paper is a second-order poly-
nomial kernel SVM described as:
yˆa = sgn(ya)
ya =
N∑
i=1
(βsTi x+ γ)
2
αi + b (6.7)
where x = [x1, x2, ..., xM ]
T is the M dimensional test data vector, si = [s1, s2, ..., sM ]
T is the
ith support vector, αi is the weight associated with si, b is the bias, β and γ are parameters
of the polynomial kernel, and N is the total number of support vectors (typically N M).
Direct computation of (6.7) requires O(NM) multiply-accumulate (MAC) operations. The
130
following reformulation [134] reduces the number of MAC operations to O(M2):
ya = x˜
TW˜x˜+ b (6.8)
W˜ =
N∑
i=1
αis˜is˜i
T
where W˜ is a precomputed weight matrix, x˜ =
 1
x
, and s˜i =
 γ
βsi
. Figure 6.3 shows
a folded SVM architecture implementing (6.8) where Stage 1 computes W˜x˜, and Stage 2
computes the dot product between x˜ and Stage 1 output, and adds the bias term b.
Figure 6.3: System architecture for a second-order polynomial kernel SVM classiﬁer.
6.2.3 System Analysis
The potential robustness improvement achieved by RF can be analyzed by inspecting the
generalized error E
[(
C − 1
L
∑L
l=1 Yˆa,l
)2]
where C is the label and 1
L
∑L
l=1 Yˆa,l is the RF
output where equal weights in the voter are assumed for simplicity of analysis. Here the
expectation is taken over the distribution of the label C, the training set S, and the timing
error N1, ..., NL.
We start by deriving the generalized error for a single DT deﬁned as E[(C − Yˆa)2] where
Yˆa is the DT output. It can be shown that (see Section 6.5.2):
E
[(
C − Yˆa
)2]
= σ2C + b
2 + σ2
Yˆa
(6.9)
131
where σ2C = E
[(
C − E[C])2] is the irreducible error (noise), b2 = (E[C] − E[Yˆa])2 is the
bias term and σ2
Yˆa
= E
[(
E[Yˆa]− Yˆa
)2]
is the variance of Yˆa. Such a decomposition identiﬁes
the contribution of diﬀerent error sources and allows one to understand the eﬀect of CE in
reducing these errors.
For CE, it can further be shown that the noise σ2C,RF and the bias b
2
RF (corresponding
to the ﬁrst two terms in (6.9)) do not change, i.e., σ2C,RF = σ
2
C and b
2
RF = b
2, respectively.
However, the output variance σ2RF can be expressed as (see Section 6.5.3):
σ2RF =
1
L
σ2
Yˆa
(6.10)
We can see from (6.10) that σ2RF is reduced by a factor of
1
L
from σ2
Yˆa
, and that for the RF
to achieve a lower variance, σ2
Yˆa
should be less than L times the variance of the centralized
system. The reduction of variance leads to reduced generalized error and mis-classiﬁcation
rate.
6.3 Simulation Results
In section 6.3.1 the detection accuracies of the SVM and RF architectures are compared
using the validated error models and methodology from Chapter 5. We employ the Breast
Cancer Wisconsin dataset from the UCI machine learning repository [124] which consists
of labeled feature vectors (benign vs. malignant) constructed from digitized images of ﬁne
needle aspirates (FNA) of patient tissue.
The SVM architecture being considered in this study consists of two types of MACs:
Stage 1 employs 8 b input, 8 b coeﬃcient, and 22 b output MACs, and Stage 2 employs 10 b
input, 8 b coeﬃcient, and 24 b output MACs. The conventional RF architecture employing
majority and weighted voter has Stage 1 consisting of comparator arrays with 8 b input and
8 b thresholds, and LUTs implemented as logic networks during the architecture generation.
In the proposed RF-EW, each DT is implemented using a randomly selected precision uni-
formly distributed between 4 b and 8 b for both the input and the thresholds. In all cases,
the parameters of SVM and RF were trained assuming no timing errors. These precisions
132
were chosen to obtain less than 0.5% degradation in Pdet compared to a ﬂoating point imple-
mentation. The complexities of the SVM and RF (with ensemble size L = 10) were found
to be 1.63K and 1.47K 2-input NAND gate equivalents, respectively.
6.3.1 Comparison of SVM and RF
6.3.1.1 Comparison of Timing Error Rates
We ﬁrst compare the timing error rates pη = P (η 6= 0) of SVM and RF obtained via HDL
simulations as the voltage decreases in NTV. Figure 6.4 shows that the median timing error
rate p¯η increases by 500× from 2.1× 10−3 to 0.99, and from 1.1× 10−3 to 0.61 for SVM and
RF, respectively, as the voltage Vdd decreases from 0.7 V to 0.3 V, indicating that the RF
architecture has up to 4.5× lower timing error rate compared with SVM. The error rate of
RF architecture is lower because it has comparator blocks which have a much simpler data
path compared with the MAC units in SVM. Figure 6.4 also demonstrates that the gate
level delay variation (σ/µ)d increases by 12× from 2.8% to 33% as the voltage Vdd decreases
from 0.7 V to 0.3 V.
Next, we employ P (η) to inject errors in ﬁxed-point MATLAB simulations of SVM and
RF architectures to compare their robustness to timing errors in NTV. All comparisons
henceforth are in terms of Pdet−s. Hence, we simplify the subscript and denote the detec-
tion accuracy as Pdet. Four architectures are compared: (1) SVM, (2) RF with majority
voter [131] (RF-M), (3) RF with weighted majority voter [128] (RF-W), and (4) RF with
the proposed error weighted voter (RF-EW). We will compare the four architectures in terms
of median (p¯det) and standard deviation (σpdet) of detection accuracy Pdet.
6.3.1.2 Comparison of p¯det
Figure 6.5(a) shows that RF has higher p¯det than SVM when the ensemble size L is suﬃciently
large. Speciﬁcally, RF-M is able to maintain p¯det ≥ 0.9 for (σ/µ)d ≤ 28.9% with L = 10,
whereas SVM can only maintain the same performance for (σ/µ)d ≤ 11.7%. Additionally,
RF-EW achieves up to 3% higher p¯det compared with RF-W and RF-EW, and is able to
133
maintain p¯det ≥ 0.93 for (σ/µ)d ≤ 29.6%. Finally, Figure 6.5(a) further shows that RF with
L = 10 is able to maintain detection performance even at (σ/µ)d of 28.9%. This indicates
that RF architectures have a higher robustness to timing errors compared with SVM in spite
of its complexity being lower by 10% when L = 10.
Figure 6.4: Median error rate p¯η and gate level delay variation (σ/µ)d of SVM and RF
architecture in NTV region of 0.3 V ≤ Vdd ≤ 0.7 V.
(a) (b)
Figure 6.5: Robustness comparison in: (a) the median detection accuracy p¯det , and (b) the
standard deviation of detection accuracy σpdet for SVM classiﬁer, RF-M with L = 1 (i.e.
single DT), and RF-M (L = 10), RF-W (L = 10), and RF-EW (L = 10). Simulations were
performed over 30 instances.
134
6.3.1.3 Comparison of σpdet
Figure 6.5(b) shows that σpdet is signiﬁcantly reduced as L increases. RF-M achieves σpdet ≤
3.5 × 10−2 when L = 10, which is 5X lower compared to SVM or RF-M with L = 1. This
further demonstrates that distributed architectures are inherently more robust to timing
errors than centralized ones. Figure 6.5(b) also shows that RF-EW achieves σpdet ≤ 1.4 ×
10−2 when (σ/µ)d ≤ 29.6%, which is 12× and 3.5× lower compared to SVM and RF-W,
respectively. This demonstrates that incorporating timing error statistics into the decision
making process enhances robustness. When (σ/µ)d ≥ 30%, σpdet of RF-EW is higher than
that of RF-M and RF-W because all instances of RF-M and RF-W achieve a low Pdet ≈ 0.6,
whereas some instances of RF-EW can still achieve a Pdet ≥ 0.9, leading to increased σpdet .
To further understand the robustness improvement achieved by RF, Fig. 6.6 shows that
the RF output variance σ2RF reduces from 0.16 to 0.02 as L increases from 1 to 25 when no
precision diversity is employed. The variance reduction is more signiﬁcant when the ensemble
size L is small, and slows down as L further increases. This is because the independence
assumption across the DTs is violated for large L. Figure 6.6 also shows that σ2RF can be
further reduced to 0.01 due to more uncorrelated error statistics when precision diversity is
employed as in the RF-EW.
Figure 6.6: The variance of RF output when (σ/µ)d = 29%.
.
135
6.4 Conclusion
In this chapter, the inherent robustness of CE and centralized machine learning architectures
in presence of timing violations is compared. It is shown that distributed architectures em-
ploying CE are inherently more robust to timing errors than centralized ones. Furthermore,
it is shown that the algorithm itself can be adapted to further enhance the robustness. Such
enhancement is achieved by using error weighted voting during the decision combination, and
employing precision diversity in the architecture data path. The results demonstrate that in
the CE framework, architectural level information can be incorporated at the system level
to achieve enhanced robustness. In the future, architectural and algorithmic level diversity
techniques can be employed to improve the robustness of CE. In addition, the robustness of
CE in presence of defects errors (stuck-at-faults) can also be evaluated.
6.5 Derivations
6.5.1 Derivation of (6.6)
In this subsection, we derive (6.6). From the theorem of total probability:
pl = P (Rl|ηl = 0)P (ηl = 0) + P (Rl|ηl = 1)P (ηl 6= 0)
= P (Rl|ηl = 0)(1− pηl) + P (Rl|ηl = 1)pηl (6.11)
We need to show P (Rl|ηl = 1) = 1 − P (Rl|ηl = 0). In binary classiﬁcation, the erroneous
output of the lth DT can be expressed as:
ya,l = c⊕ ηl ⊕ el (6.12)
136
where c, ηl, el denote the true label, timing error, and error due to noise in data, respectively.
Thus, the event Rl = {el ⊕ ηl = 0}, and we have:
P (Rl|ηl = 1) = P (el ⊕ ηl = 0|ηl = 1)
= P (el = 1|ηl = 1) (6.13)
= P (el = 1) (6.14)
= P (el = 1|ηl = 0)
= P (el ⊕ ηl = 1|ηl = 0)
= 1− P (el ⊕ ηl = 0|ηl = 0)
= 1− P (Rl|ηl = 0) (6.15)
where (6.13) to (6.14) comes from the independence of el and ηl. Substituting (6.15) into
(6.11) leads to (6.6) thereby completing the proof of (6.6).
6.5.2 Derivation of (6.9)
In this subsection, we derive (6.9). In deriving the generalized error E[(C − Yˆa)2], the
expectation is taken over label C, the training set S, and the timing error N . Here S and
N are independent. Without loss of generality, we assume a ﬁxed input X = x as suggested
by [135] for notational simplicity. Thus, (6.9) can be expressed as:
E
[(
C − Yˆa
)2]
= E
[(
C − E[C] + E[C]− Yˆa
)2]
= E
[
(C − E[C])2]+ E[(E[C]− Yˆa)2] (6.16)
137
where we use the fact that
E
[(
C − E[C])(E[C]− Yˆa)]
= E
[
E
[
(C − E[C])(E[C]− Yˆa)|S,N
]]
= 0 (6.17)
The ﬁrst term in (6.16) E
[
(C − E[C])2] is the noise σ2C . The second term in (6.16) can be
further decomposed as:
E
[(
E[C]− Yˆa
)2]
= E
[(
E[C]− E[Yˆa] + E[Yˆa]− Yˆa
)2]
(6.18)
= (E[C]− E[Yˆa])2 + E
[(
E[Yˆa]− Yˆa
)2]
(6.19)
= b2 + σ2
Yˆa
(6.20)
where in going from (6.18) to (6.19) we use the fact that
E
[(
E[C]− E[Yˆa]
)(
E[Yˆa]− Yˆa
)]
=
(
E[C]− E[Yˆa]
)
E
[
E[Yˆa]− Yˆa]
]
= 0 (6.21)
This completes the proof of (6.9).
6.5.3 Derivation of (6.10)
In this subsection, we derive (6.10). In deriving the generalized error E[(C − 1
L
∑L
l=1 Yˆa,l)
2],
the expectation is taken over C,S, and the timing error of the L DTs N1, . . . , NL. The
generalized error can be decomposed similarly to (6.16) and (6.20) as follows:
E
[(
C − 1
L
L∑
l=1
Yˆa,l
)2]
= σ2C + b
2
RF + σ
2
RF
138
where σ2C = E
[(
C − E[C])2] is the noise term same as the ﬁrst term in (6.16). Assuming
Yˆa,l are i.i.d with the same distribution as in the single DT Yˆa, the bias term b
2
RF can be
simpliﬁed into:
b2RF =
(
E[C]− E[ 1
L
L∑
l=1
Yˆa,l]
)2
=
(
E[C]− E[Yˆa]
)2
which is the same as the ﬁrst term in (6.20), and σ2RF can be simpliﬁed as follows:
σ2RF = E
[(
E[
1
L
L∑
l=1
Yˆa,l]− 1
L
L∑
l=1
Yˆa,l
)2]
=
1
L2
E
[( L∑
l=1
(E[Yˆa,l]− Yˆa,l)
)2]
(6.22)
=
1
L2
L∑
l=1
E
[(
E[Yˆa,l]− Yˆa,l
)2]
(6.23)
=
1
L
σ2
Yˆa
(6.24)
where (6.22) to (6.23) comes from the assumption on the independence of Yˆa,l. This completes
the proof of (6.10).
139
Chapter 7
CONCLUSION AND FUTURE WORK
Moving towards the age of ubiquitous computing, ULP platforms for in-silicon machine
learning will be the key enabler for pervasive intelligence in our daily lives. The quest for
energy eﬃcient in-silicon machine learning is made challenging due to the energy delivery,
communication, computation and robustness challenges. This dissertation explores design
approaches that represent a radical change from the conventional design methodology by
embedding computation into the energy delivery, sensing, and emerging stochastic fabrics.
7.1 Dissertation Contributions
The Compute Voltage Regulator Module (C-VRM) has been proposed to embed information
processing into the energy delivery subsystem. The C-VRM eliminates the loss associated
with conventional switched capacitor based energy delivery circuits. Measured results of
a prototype IC show more than 40% savings in system-level energy per operation, and an
eﬃciency ranging from 79% to 83%.
The Compute Sensor approach has been proposed to embed information processing and
learning into the sensing front-end. The Compute Sensor eliminates both the traditional
sensor-processor interface, and the high-SNR/high-energy digital processing by moving fea-
ture extraction and classiﬁcation functions into the analog domain in close proximity to the
APS array. The Compute Sensor designed in 65 nm CMOS is shown to achieve a detection
accuracy greater than 94.7% using the Caltech101 dataset [80], which is within 0.5% of that
achieved by an ideal digital implementation. The performance is achieved with 7× to 17×
lower energy than the conventional architecture for the same level of accuracy.
This dissertation proposes a new SEC technique well-suited for machine learning appli-
140
cations, i.e., embedded algorithmic-noise tolerance (E-ANT). E-ANT at the architectural
and algorithmic level are applied to EEG seizure detection systems. Simulation results in a
commercial 45 nm CMOS process show that ARCH-ANT can compensate for error rates up
to 0.38, and ALG-ANT can compensate for error rates up to 0.41, while maintaining a true
positive rate ptp > 0.9 and a false positive rate pfp ≤ 0.01. This error tolerance is employed
to reduce energy via the use of voltage overscaling (VOS). ARCH-ANT and ALG-ANT are
able to achieve up to 51% and 44% energy savings, respectively.
This dissertation makes contribution in theoretical analysis of SEC by proposing a class
of probabilistic error models that can accurately model the error distribution on the noisy
hardware fabrics. The models are validated in a commercial 45 nm CMOS process and
employed to evaluate the performance of machine learning kernels in presence of hardware
errors. Performance prediction of a support vector machine (SVM) based classiﬁer using
these models indicates that when comparing Monte Carlo with HDL simulations, probability
of detection Pdet estimated using the model is within 3% for VOS error when the error rate
pη ≤ 80%, within 5% for process variation error and within 2% for defect errors when the
defect rate (the percentage of circuit nets subject to stuck-at-faults) psaf is between 10
−3
and 0.2.
Finally, SEC techniques have been extended into distributed machine learning algorithms
by proposing error resilient classiﬁer ensemble. Employing the breast cancer data set in
the UCI machine learning repository and architectural-level error models in a commercial
45 nm CMOS process, it is determined that RF-based architectures are signiﬁcantly more
robust than SVM architectures in presence of timing errors due to process variations in
near-threshold voltage (NTV) regions (0.3 V−0.7 V). Additionally, an error weighted voting
technique that incorporates the timing error statistics of the NTV circuit fabric is proposed
to further enhance the robustness of RF architectures. Simulation results conﬁrm that the
error weighted voting achieves a Pdet that varies by only 1.4%, which is 12× lower compared
to SVM.
141
7.2 Future Work
7.2.1 System Design
System optimization and design space exploration play a crucial role in designing energy
eﬃcient and robust ULP platforms for in-silicon machine learning. To date, an ad-hoc
approach is adopted for the design space exploration. The search for energy minimum
realization of in-silicon machine learning systems needs to be done in a systematic manner,
much as it is done in the area of low power signal processing and communication systems and
ICs [136, 29]. Many machine learning algorithms are inherently formulated as optimization
problems, i.e., minimizing certain loss functions subject to constraints such as the form
of the functions (linear, quadratic, multi-layer, etc), the sparsity of solutions, and others.
This optimization formulation opens many opportunities where the architecture/circuit level
performance metrics can be incorporated so as to introduce new constraints in the original
optimization problem. Such a resource constrained optimization allows the systematic design
of ULP platforms for machine learning. In fact, the ALG-ANT based FIR ﬁlter in Chapter 4
is one example where we introduce additional constraints during the WLS ﬁlter optimization
to enhance the robustness of the design. Some possible constraints to consider are the
precision requirements in the data path, the error rate/error statistics in the underlying
circuit fabric, the storage/communication cost for data transfer, and others.
7.2.2 Architecture
New architectures are needed for energy minimum realization of in-silicon machine learning
systems. Conventionally, the design of in-silicon machine learning is treated as yet another
problem of eﬃcient computing. As a result, reliable circuit operations are guaranteed by
adopting a worst case design methodology and use of large design margins, limiting the
achievable energy eﬃciency. In contrast, new architectures should embrace both the prob-
abilistic nature of the performance metric, and the statistical behavior in the computing
fabric. To this end, the architectures explored in this dissertation oﬀer several future direc-
tions to be extended.
142
In-sensor machine learning points to an opportunity to achieve large energy savings by the
elimination of interface overhead. At the same time, the non-idealities resulting from mixed
signal implementation require error resilient computing. The proposed Compute Sensor of-
fers a general framework to map diﬀerent algorithms such as decision trees, kernel SVMs,
and ensembles of these classiﬁers, because the bit-line and cross bit-line processors perform
element-wise and dimensionality reduction operations which are common in many of these
algorithms. Feature selection techniques, such as the focus-of-attention mechanisms used
in many vision algorithms [137], can also be embedded into the sensor. These techniques
employ a cascade of simple classiﬁers, and can rapidly discard images/samples that are
not informative. Therefore, further energy savings can be achieved by avoiding the access
of non-informative pixels. In addition, programmable in-sensor machine learning architec-
tures can also be explored to conﬁgure the bit-line and cross bit-line processors to achieve
various inference tasks. Another interesting direction is the combination of in-sensor and re-
cently proposed in-memory computing [82] architectures so that inference task is partitioned
between the two systems, completely eliminating the need for conventional von Neumann
architectures.
Machine learning in stochastic fabrics, such as those operating with voltage scaling, process
variation or defects, oﬀers another opportunity where architectural design can be explored
to tolerate device/circuit errors and to enhance energy eﬃciency. Many machine learning
algorithms are iterative and possess inherent tolerance for certain error statistics. New SEC
techniques can be explored for in-silicon machine learning systems. Error compensation
such as retraining can also be explored to develop architectures that have on-line training
capabilities. Such architectures are of great interest in applications where the data statistics
are time varying.
7.2.3 Circuit and Devices
Circuit level techniques should be explored in combination with the architectural level tech-
niques. Speciﬁcally, machine learning algorithms oﬀer relaxed precision/linearity require-
ment for the underlying circuit. This new degree of freedom can be explored to design
143
circuits that have tolerable non-ideality, but at the same time oﬀer large energy beneﬁts.
Circuit level techniques that can engineer the error statistics to favor error compensation
should also be explored for wide application of SEC techniques. At the device level, many
emerging devices such as CNFET [85], spin devices [86], and others have the potential for
large energy saving or density improvement, but suﬀer from various hardware errors such as
defects and noise. This inherent statistical behavior oﬀers an opportunity to employ SEC
to design reliable systems on emerging beyond CMOS technologies.
7.2.4 Theoretical Limits
Theoretical foundations for in-silicon machine learning need to be investigated. The sys-
tem performance in presence of errors/non-idealities should be further analyzed to reveal
the impact of diﬀerent error statistics. Performance bounds when hardware constraints are
incorporated need to be derived to guide the design of resource-constrained machine learn-
ing systems. Such theoretical limits will provide guidance in the quest for energy eﬃcient
realizations of in-silicon machine learning systems.
144
REFERENCES
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, Imagenet large scale visual
recognition challenge, International Journal of Computer Vision, vol. 115, no. 3, pp.
211252, 2015. [Online]. Available: http://dx.doi.org/10.1007/s11263-015-0816-y
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,
T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural
networks and tree search, Nature, vol. 529, pp. 484503, 2016. [Online]. Available:
http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html
[3] J. Baliga, R. W. A. Ayre, K. Hinton, and R. S. Tucker, Green cloud computing:
Balancing energy in processing, storage, and transport, Proceedings of the IEEE,
vol. 99, no. 1, pp. 149167, Jan 2011.
[4] M. K. Weldon, The Future X Network A Bell Labs Perspective. CRC Press, 2016.
[5] Y. Ramadass and A. Chandrakasan, Voltage scalable switched capacitor DC-DC con-
verter for ultra-low-power on-chip applications, in Power Electronics Specialists Con-
ference, 2007. PESC 2007. IEEE, June 2007, pp. 23532359.
[6] J. Choi, S. Park, J. Cho, and E. Yoon, An energy/illumination-adaptive CMOS image
sensor with rconﬁgurable modes of operations, IEEE Journal of Solid-State Circuits,
vol. 50, no. 6, pp. 14381450, June 2015.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classiﬁcation with deep con-
volutional neural networks, in Advances in Neural Information Processing Systems 25,
F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates,
Inc., 2012, pp. 10971105. [Online]. Available: http://papers.nips.cc/paper/4824-
imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf
[8] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, Near-threshold
computing: Reclaiming Moore's law through energy eﬃcient integrated circuits, Pro-
ceedings of the IEEE, vol. 98, no. 2, pp. 253266, Feb 2010.
[9] International technology roadmap 2011 edition, ITRS, Tech. Rep., 2011. [Online].
Available: http://www.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf
145
[10] S. Hanson, B. Zhai, D. Blaauw, D. Sylvester, A. Bryant, and X. Wang, Energy opti-
mality and variability in subthreshold design, in Low Power Electronics and Design,
2006. ISLPED'06. Proceedings of the 2006 International Symposium on, Oct 2006, pp.
363365.
[11] K. Tsuji, K. Terada, R. Takeda, T. Tsunomura, A. Nishida, and T. Mogami, Threshold
voltage variation extracted from MOSFET C-V curves by charge-based capacitance
measurement, in Microelectronic Test Structures (ICMTS), 2012 IEEE International
Conference on, March 2012, pp. 8286.
[12] Y. Ramadass, A. Fayed, B. Haroun, and A. Chandrakasan, A 0.16mm2 completely
on-chip switched-capacitor DC-DC converter using digital capacitance modulation for
LDO replacement in 45nm CMOS, in Solid-State Circuits Conference Digest of Tech-
nical Papers (ISSCC), 2010 IEEE International, 2010, pp. 208209.
[13] M. Wieckowski, G. K. Chen, M. Seok, D. Blaauw, and D. Sylvester, A hybrid DC-DC
converter for sub-microwatt sub-1V implantable applications, in VLSI Circuits, 2009
Symposium on, June 2009, pp. 166 167.
[14] S. Rajapandian, K. Shepard, P. Hazucha, and T. Karnik, High-tension power delivery:
operating 0.18 mu;m CMOS digital logic at 5.4V, in Solid-State Circuits Conference,
2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, 2005, pp. 298599.
[15] M. Seeman and S. Sanders, Analysis and optimization of switched-capacitor DC-DC
converters, in Computers in Power Electronics, 2006. COMPEL '06. IEEE Workshops
on, July 2006, pp. 216 224.
[16] A. Nilchi, J. Aziz, and R. Genov, Focal-plane algorithmically-multiplying CMOS com-
putational image sensor, IEEE Journal of Solid-State Circuits, vol. 44, no. 6, pp.
18291839, June 2009.
[17] P. Dudek and P. J. Hicks, A general-purpose processor-per-pixel analog SIMD vision
chip, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 52, no. 1,
pp. 1320, Jan 2005.
[18] N. Massari, M. Gottardi, L. Gonzo, D. Stoppa, and A. Simoni, A CMOS image
sensor with programmable pixel-level analog processing, IEEE Transactions on Neural
Networks, vol. 16, no. 6, pp. 16731684, Nov 2005.
[19] R. Robucci, J. D. Gray, L. K. Chiu, J. Romberg, and P. Hasler, Compressive sensing
on a CMOS separable-transform image sensor, Proceedings of the IEEE, vol. 98, no. 6,
pp. 10891101, June 2010.
[20] A. Bandyopadhyay, J. Lee, R. W. Robucci, and P. Hasler, MATIA: A programmable
80 µw/frame CMOS block matrix transform imager architecture, IEEE Journal of
Solid-State Circuits, vol. 41, no. 3, pp. 663672, March 2006.
146
[21] J. Fernandez-Berni, R. Carmona-Galan, and L. Carranza-Gonzalez, FLIP-Q: A QCIF
resolution focal-plane array for low-power image processing, IEEE Journal of Solid-
State Circuits, vol. 46, no. 3, pp. 669680, March 2011.
[22] Z. Lin, M. W. Hoﬀman, N. Schemm, W. D. Leon-Salas, and S. Balkir, A CMOS
image sensor for multi-level focal plane image decomposition, IEEE Transactions on
Circuits and Systems I: Regular Papers, vol. 55, no. 9, pp. 25612572, Oct 2008.
[23] A. Graupner, J. Schreiter, S. Getzlaﬀ, and R. Schuﬀny, CMOS image sensor with
mixed-signal processor array, IEEE Journal of Solid-State Circuits, vol. 38, no. 6, pp.
948957, June 2003.
[24] Y. Oike and A. E. Gamal, CMOS image sensor with per-column ADC and pro-
grammable compressed sensing, IEEE Journal of Solid-State Circuits, vol. 48, no. 1,
pp. 318328, Jan 2013.
[25] Y. Ni and J. Guan, A 256 x 256 pixel smart CMOS image sensor for line-based
stereo vision applications, IEEE Journal of Solid-State Circuits, vol. 35, no. 7, pp.
10551061, July 2000.
[26] S. Martin, K. Flautner, T. Mudge, and D. Blaauw, Combined dynamic voltage scaling
and adaptive body biasing for lower power microprocessors under dynamic workloads,
in Computer Aided Design, 2002. ICCAD 2002. IEEE/ACM International Conference
on, Nov 2002, pp. 721725.
[27] M. Borah, R. Owens, and M. Irwin, Transistor sizing for low power CMOS circuits,
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
vol. 15, no. 6, pp. 665671, Jun 1996.
[28] A. Chandrakasan, S. Sheng, and R. Brodersen, Low-power CMOS digital design,
Solid-State Circuits, IEEE Journal of, vol. 27, no. 4, pp. 473484, Apr 1992.
[29] M. Goel and N. Shanbhag, Dynamic algorithm transforms for low-power reconﬁg-
urable adaptive equalizers, in Signal Processing, IEEE Transactions on, vol. 47,
no. 10, Oct 1999, pp. 28212832.
[30] M. Donno, A. Ivaldi, L. Benini, and E. Macii, Clock-tree power optimization based on
RTL clock-gating, in Design Automation Conference, 2003. Proceedings, June 2003,
pp. 622627.
[31] S.-H. Chen and J.-Y. Lin, Implementation and veriﬁcation practices of DVFS and
power gating, in VLSI Design, Automation and Test, 2009. VLSI-DAT '09. Interna-
tional Symposium on, April 2009, pp. 1922.
[32] W. Kim, M. Gupta, G.-Y. Wei, and D. Brooks, System level analysis of fast, per-
core DVFS using on-chip switching regulators, in High Performance Computer Ar-
chitecture, 2008. HPCA 2008. IEEE 14th International Symposium on, Feb 2008, pp.
123134.
147
[33] B. Calhoun, A. Wang, and A. Chandrakasan, Modeling and sizing for minimum energy
operation in subthreshold circuits, Solid-State Circuits, IEEE Journal of, vol. 40,
no. 9, pp. 17781786, Sept 2005.
[34] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, Theoretical and practical limits of
dynamic voltage scaling, in Design Automation Conference, 2004. Proceedings. 41st,
July 2004, pp. 868 873.
[35] D. Markovic, C. C. Wang, L. P. Alarcon, T. T. Liu, and J. M. Rabaey, Ultralow-power
design in near-threshold region, Proceedings of the IEEE, vol. 98, no. 2, pp. 237252,
Feb 2010.
[36] R. Dreslinski, B. Zhai, T. Mudge, D. Blaauw, and D. Sylvester, An energy eﬃcient
parallel architecture using near threshold operation, in Parallel Architecture and Com-
pilation Techniques, 2007. PACT 2007. 16th International Conference on, Sept 2007,
pp. 175188.
[37] J. Y. S. Kwong, A sub-threshold cell library and methodology, M.S. thesis, Mas-
sachusetts Institute of Technology, Cambridge, 2006.
[38] B. Zhai, S. Pant, L. Nazhandali, S. Hanson, J. Olson, A. Reeves, M. Minuth,
R. Helfand, T. Austin, D. Sylvester, and D. Blaauw, Energy-eﬃcient subthreshold
processor design, Very Large Scale Integration (VLSI) Systems, IEEE Transactions
on, vol. 17, no. 8, pp. 11271137, Aug 2009.
[39] S. Luetkemeier, T. Jungeblut, M. Porrmann, and U. Rueckert, A 200mV 32b sub-
threshold processor with adaptive supply voltage control, in Solid-State Circuits Con-
ference Digest of Technical Papers (ISSCC), 2012 IEEE International, Feb 2012, pp.
484486.
[40] J. Kwong, Y. Ramadass, N. Verma, M. Koesler, K. Huber, H. Moormann, and A. Chan-
drakasan, A 65nm sub-Vt microcontroller with integrated SRAM and switched-
capacitor DC-DC converter, in Solid-State Circuits Conference, 2008. ISSCC 2008.
Digest of Technical Papers. IEEE International, Feb 2008, pp. 318616.
[41] S. Vangal, S. Jain, and V. De, A solar-powered 280mV-to-1.2V wide-operating-range
IA-32 processor, in IC Design Technology (ICICDT), 2014 IEEE International Con-
ference on, May 2014, pp. 14.
[42] A. Wang and A. Chandrakasan, A 180-mV subthreshold FFT processor using a mini-
mum energy design methodology, Solid-State Circuits, IEEE Journal of, vol. 40, no. 1,
pp. 310319, Jan 2005.
[43] N. Reynders and W. Dehaene, Variation-resilient sub-threshold circuit solutions for
ultra-low-power digital signal processors with 10MHz clock frequency, in ESSCIRC
(ESSCIRC), 2012 Proceedings of the, Sept 2012, pp. 474477.
148
[44] I. Koren and Z. Koren, Defect tolerance in VLSI circuits: techniques and yield anal-
ysis, Proceedings of the IEEE, vol. 86, no. 9, pp. 18191838, Sep 1998.
[45] J. von Neumann, Probabilistic logics and the synthesis of reliable organisms from
unreliable components, Automata Studies, pp. 4398, 1956.
[46] E. Huijbregts, H. Xue, and J. Jess, Routing for reliable manufacturing, Semiconduc-
tor Manufacturing, IEEE Transactions on, vol. 8, no. 2, pp. 188194, May 1995.
[47] Z. Koren and I. Koren, On the eﬀect of ﬂoorplanning on the yield of large area
integrated circuits, Very Large Scale Integration (VLSI) Systems, IEEE Transactions
on, vol. 5, no. 1, pp. 314, March 1997.
[48] K. Nepal, R. Bahar, J. Mundy, W. Patterson, and A. Zaslavsky, Designing logic
circuits for probabilistic computation in the presence of noise, in Design Automation
Conference, 2005. Proceedings. 42nd, June 2005, pp. 485490.
[49] N. Jayakumar and S. Khatri, A variation-tolerant sub-threshold design approach, in
Design Automation Conference, 2005. Proceedings. 42nd, June 2005, pp. 716719.
[50] M. Zhang and N. Shanbhag, A CMOS design style for logic circuit hardening, in Reli-
ability Physics Symposium, 2005. Proceedings. 43rd Annual. 2005 IEEE International,
April 2005, pp. 223229.
[51] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw,
T. Austin, K. Flautner, and T. Mudge, Razor: a low-power pipeline based on circuit-
level timing speculation, in Microarchitecture, 2003. MICRO-36. Proceedings. 36th
Annual IEEE/ACM International Symposium on, Dec 2003, pp. 718.
[52] N. Jayakumar and S. Khatri, A variation-tolerant sub-threshold design approach, in
Design Automation Conference, 2005. Proceedings. 42nd, June 2005, pp. 716719.
[53] N. Shanbhag, R. Abdallah, R. Kumar, and D. Jones, Stochastic computation, in
Design Automation Conference (DAC), 2010 47th ACM/IEEE, June 2010, pp. 859
864.
[54] N. Verma, K. H. Lee, K. J. Jang, and A. Shoeb, Enabling system-level platform re-
silience through embedded data-driven inference capabilities in electronic devices, in
Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Confer-
ence on, March 2012, pp. 52855288.
[55] S. Zhang and N. Shanbhag, Embedded error compensation for energy eﬃcient DSP
systems, in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Con-
ference on, Dec 2014, pp. 3034.
[56] Z. Wang, R. Schapire, and N. Verma, Error-adaptive classiﬁer boosting (EACB): Ex-
ploiting data-driven training for highly fault-tolerant hardware, in Acoustics, Speech
and Signal Processing (ICASSP), 2014 IEEE International Conference on, May 2014,
pp. 38843888.
149
[57] B. Shim, S. Sridhara, and N. Shanbhag, Reliable low-power digital signal process-
ing via reduced precision redundancy, Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, vol. 12, no. 5, pp. 497510, May 2004.
[58] G. Varatkar, S. Narayanan, N. Shanbhag, and D. Jones, Variation-tolerant, low-
power PN-code acquisition using stochastic sensor NOC, in Circuits and Systems,
2008. ISCAS 2008. IEEE International Symposium on, May 2008, pp. 380383.
[59] E. Kim and N. Shanbhag, Soft N-Modular Redundancy, Computers, IEEE Transac-
tions on, vol. 61, no. 3, pp. 323336, March 2012.
[60] R. Abdallah and N. Shanbhag, Robust and energy-eﬃcient DSP systems via out-
put probability processing, in Computer Design (ICCD), 2010 IEEE International
Conference on, Oct 2010, pp. 3844.
[61] E. Kim and N. Shanbhag, Soft NMR: Analysis and application to DSP systems, in
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Confer-
ence on, March 2010, pp. 14941497.
[62] K. Bowman, J. Tschanz, S. Lu, P. Aseron, M. Khellah, A. Raychowdhury, B. Geuskens,
C. Tokunaga, C. Wilkerson, T. Karnik, and V. De, A 45 nm resilient microprocessor
core for dynamic variation tolerance, Solid-State Circuits, IEEE Journal of, vol. 46,
no. 1, pp. 194208, Jan 2011.
[63] E. Kim, D. Baker, S. Narayanan, N. Shanbhag, and D. Jones, A 3.6-mW 50-MHz
PN code acquisition ﬁlter via statistical error compensation in 180-nm CMOS, Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. PP, no. 99, pp.
11, 2014.
[64] R. Abdallah and N. Shanbhag, An energy-eﬃcient ECG processor in 45-nm CMOS
using statistical error compensation, Solid-State Circuits, IEEE Journal of, vol. 48,
no. 11, pp. 28822893, Nov 2013.
[65] J. Choi, E. Kim, R. Rutenbar, and N. Shanbhag, Error resilient MRF message passing
architecture for stereo matching, in Signal Processing Systems (SiPS), 2013 IEEE
Workshop on, Oct 2013, pp. 348353.
[66] G. Chen, M. Fojtik, D. Kim, D. Fick, J. Park, M. Seok, M.-T. Chen, Z. Foo,
D. Sylvester, and D. Blaauw, Millimeter-scale nearly perpetual sensor system with
stacked battery and solar cells, in Solid-State Circuits Conference Digest of Technical
Papers (ISSCC), 2010 IEEE International, Feb 2010, pp. 288289.
[67] J. Kwong, Y. Ramadass, N. Verma, and A. Chandrakasan, A 65 nm sub-Vt microcon-
troller with integrated SRAM and switched capacitor DC-DC converter, Solid-State
Circuits, IEEE Journal of, vol. 44, no. 1, pp. 115 126, Jan. 2009.
150
[68] M. Stojcev, M. Kosanovic, and L. Golubovic, Power management and energy harvest-
ing techniques for wireless sensor nodes, in Telecommunication in Modern Satellite,
Cable, and Broadcasting Services, 2009. TELSIKS '09. 9th International Conference
on, Oct. 2009, pp. 65 72.
[69] J. Kimball and P. Krein, Analysis and design of switched capacitor converters, in
Applied Power Electronics Conference and Exposition, 2005. APEC 2005. Twentieth
Annual IEEE, vol. 3, 2005, pp. 14731477 Vol. 3.
[70] M. Evzelman and S. Ben-Yaakov, Average modeling technique for switched capacitor
converters including large signal dynamics and small signal responses, in Microwaves,
Communications, Antennas and Electronics Systems (COMCAS), 2011 IEEE Inter-
national Conference on, Nov 2011, pp. 15.
[71] M. Evzelman and S. Ben-Yaakov, Optimal switch resistances in switched capacitor
converters, in Electrical and Electronics Engineers in Israel (IEEEI), 2010 IEEE 26th
Convention of, Nov 2010, pp. 000 436000 439.
[72] S. Ben-Yaakov and M. Evzelman, Generic and uniﬁed model of Switched Capacitor
Converters, in Energy Conversion Congress and Exposition, 2009. ECCE 2009. IEEE,
2009, pp. 35013508.
[73] R. Aparicio and A. Hajimiri, Capacity limits and matching properties of integrated
capacitors, Solid-State Circuits, IEEE Journal of, vol. 37, no. 3, pp. 384 393, Mar
2002.
[74] B. Calhoun, A. Wang, and A. Chandrakasan, Modeling and sizing for minimum energy
operation in subthreshold circuits, Solid-State Circuits, IEEE Journal of, vol. 40,
no. 9, pp. 17781786, Sept 2005.
[75] D. El-Damak, S. Bandyopadhyay, and A. Chandrakasan, A 93% eﬃciency recon-
ﬁgurable switched-capacitor DC-DC converter using on-chip ferroelectric capacitors,
in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE
International, Feb 2013, pp. 374375.
[76] M. Seeman and R. Jain, Single-bound hysteretic regulation of switched-capacitor
converters, Mar. 31 2011, US Patent App. 12/566,730. [Online]. Available:
http://www.google.com/patents/US20110074371
[77] N. Krihely, S. Ben-Yaakov, and A. Fish, Eﬃciency optimization of a step-down
switched capacitor converter for subthreshold, Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 21, no. 12, pp. 23532357, 2013.
[78] J. Karhunen and J. Joutsensalo, Generalizations of principal component analysis,
optimization problems, and neural networks, Neural Networks, vol. 8, no. 4, pp. 549
 562, 1995.
151
[79] V. Vapnik, An overview of statistical learning theory, Neural Networks, IEEE Trans-
actions on, vol. 10, no. 5, pp. 988999, Sep 1999.
[80] L. Fei-Fei, R. Fergus, and P. Perona, Learning generative visual models from few train-
ing examples: an incremental bayesian approach tested on 101 object categories, in
Computer Vision and Pattern Recognition Workshop, 2004. CVPRW '04. Conference
on, June 2004, pp. 178178.
[81] H.-S. P. Wong, CMOS image sensors-recent advances and device scaling considera-
tions, in Electron Devices Meeting, 1997. IEDM '97. Technical Digest., International,
Dec 1997, pp. 201204.
[82] M. Kang, S. K. Gonugondla, M. S. Keel, and N. R. Shanbhag, An energy-eﬃcient
memory-based high-throughput VLSI architecture for convolutional networks, in
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Con-
ference on, April 2015, pp. 10371041.
[83] M. Horowitz, Computing's energy problem (and what we can do about it), in
2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers
(ISSCC), Feb 2014, pp. 1014.
[84] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, Near-
Threshold Computing: Reclaiming Moore's Law Through Energy Eﬃcient Integrated
Circuits, Proceedings of the IEEE, vol. 98, no. 2, pp. 253266, Feb 2010.
[85] R. Martel, V. Derycke, J. Appenzeller, S. Wind, and P. Avouris, Carbon nanotube
ﬁeld-eﬀect transistors and logic circuits, in Design Automation Conference, 2002.
Proceedings. 39th, 2002, pp. 9498.
[86] K. Roy, M. Sharad, D. Fan, and K. Yogendra, Beyond charge-based computation:
Boolean and non-Boolean computing with spin torque devices, in Low Power Elec-
tronics and Design (ISLPED), 2013 IEEE International Symposium on, Sept 2013,
pp. 139142.
[87] V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy, IMPACT: IM-
Precise adders for low-power approximate computing, in Low Power Electronics and
Design (ISLPED) 2011 International Symposium on, Aug 2011, pp. 409414.
[88] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy, and A. Raghunathan, SALSA:
Systematic logic synthesis of approximate circuits, in Design Automation Conference
(DAC), 2012 49th ACM/EDAC/IEEE, June 2012, pp. 796801.
[89] S. Venkataramani, K. Roy, and A. Raghunathan, Substitute-and-simplify: A uni-
ﬁed design paradigm for approximate and quality conﬁgurable circuits, in Design,
Automation Test in Europe Conference Exhibition (DATE), 2013, March 2013, pp.
13671372.
152
[90] X. Zhu and X. Wu, Class noise vs. attribute noise: A quantitative study of
their impacts, Artif. Intell. Rev., vol. 22, no. 3, pp. 177210, Nov. 2004. [Online].
Available: http://dx.doi.org/10.1007/s10462-004-0751-8
[91] A. Atla, R. Tada, V. Sheng, and N. Singireddy, Sensitivity of diﬀerent machine
learning algorithms to noise, J. Comput. Sci. Coll., vol. 26, no. 5, pp. 96103, May
2011. [Online]. Available: http://dl.acm.org/citation.cfm?id=1961574.1961594
[92] C. M. Bishop, Training with noise is equivalent to Tikhonov regularization,
Neural Comput., vol. 7, no. 1, pp. 108116, Jan. 1995. [Online]. Available:
http://dx.doi.org/10.1162/neco.1995.7.1.108
[93] Y. Raviv and N. Intrator, Bootstrapping with noise: An eﬀective regularization tech-
nique, Connection Science, vol. 8, pp. 355372, 1996.
[94] J. Mengte, A. Raghunathan, S. Chakradhar, and S. Byna, Exploiting the forgiving
nature of applications for scalable parallel execution, in Parallel Distributed Processing
(IPDPS), 2010 IEEE International Symposium on, April 2010, pp. 112.
[95] S. Venkataramani, A. Raghunathan, J. Liu, and M. Shoaib, Scalable-eﬀort classiﬁers
for energy-eﬃcient machine learning, in Proceedings of the 52Nd Annual Design
Automation Conference, ser. DAC '15. New York, NY, USA: ACM, 2015. [Online].
Available: http://doi.acm.org/10.1145/2744769.2744904 pp. 67:167:6.
[96] S. Hashemi, R. I. Bahar, and S. Reda, Drum: A dynamic range unbiased multiplier
for approximate applications, in Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design, ser. ICCAD '15. Piscataway, NJ, USA: IEEE
Press, 2015. [Online]. Available: http://dl.acm.org/citation.cfm?id=2840819.2840878
pp. 418425.
[97] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, Param-
eter variations and impact on circuits and microarchitecture, in Design Automation
Conference, 2003. Proceedings, June 2003, pp. 338342.
[98] K. Nepal, R. Bahar, J. Mundy, W. Patterson, and A. Zaslavsky, Techniques for de-
signing noise-tolerant multi-level combinational circuits, in Design, Automation Test
in Europe Conference Exhibition, 2007. DATE '07, April 2007, pp. 16.
[99] D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim, and K. Flautner,
RAZOR: circuit-level correction of timing errors for low-power operation, Micro,
IEEE, vol. 24, no. 6, pp. 1020, Nov 2004.
[100] D. Blaauw, S. Kalaiselvan, K. Lai, W.-H. Ma, S. Pant, C. Tokunaga, S. Das, and
D. Bull, RAZOR II: In situ error detection and correction for PVT and SER tol-
erance, in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical
Papers. IEEE International, Feb 2008, pp. 400622.
153
[101] J. Tschanz, K. Bowman, C. Wilkerson, S.-L. Lu, and T. Karnik, Resilient circuits;
Enabling energy-eﬃcient performance and reliability, in Computer-Aided Design -
Digest of Technical Papers, 2009. ICCAD 2009. IEEE/ACM International Conference
on, Nov 2009, pp. 7173.
[102] C.-H. Chen, D. Blaauw, D. Sylvester, and Z. Zhang, Design and evaluation of
conﬁdence-driven error-resilient systems, Very Large Scale Integration (VLSI) Sys-
tems, IEEE Transactions on, vol. 22, no. 8, pp. 17271737, Aug 2014.
[103] R. Abdallah and N. Shanbhag, An energy-eﬃcient ECG processor in 45-nm CMOS
using statistical error compensation, Solid-State Circuits, IEEE Journal of, vol. 48,
no. 11, pp. 28822893, Nov 2013.
[104] M. Chen, J. Han, Y. Zhang, Y. Zou, Y. Li, and X. Zeng, An error-resilient wavelet-
based ECG processor under voltage overscaling, in Biomedical Circuits and Systems
Conference (BioCAS), 2014 IEEE, Oct 2014, pp. 628631.
[105] J. Yoo, L. Yan, D. El-Damak, M. Altaf, A. Shoeb, and A. Chandrakasan, An 8-channel
scalable EEG acquisition SoC with patient-speciﬁc seizure classiﬁcation and recording
processor, Solid-State Circuits, IEEE Journal of, vol. 48, no. 1, pp. 214228, Jan
2013.
[106] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data
Mining and Knowledge Discovery, vol. 2, pp. 121167, 1998.
[107] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.
New York: Wiley, 1999, a Wiley-Interscience publication. [Online]. Available:
http://opac.inria.fr/record=b1097682
[108] M. T. Heath, Scientiﬁc Computing: An Introductory Survey, 2nd ed., E. M. Munson,
Ed. McGraw-Hill Higher Education, 1996.
[109] A. Ganapathiraju, J. Hamaker, and J. Picone, Applications of support vector ma-
chines to speech recognition, Signal Processing, IEEE Transactions on, vol. 52, no. 8,
pp. 23482355, Aug 2004.
[110] H. Hu, M.-X. Xu, and W. Wu, GMM supervector based SVM with spectral features
for speech emotion recognition, in Acoustics, Speech and Signal Processing, 2007.
ICASSP 2007. IEEE International Conference on, vol. 4, April 2007, pp. IV413IV
416.
[111] S. Lawrence, C. Giles, A. C. Tsoi, and A. Back, Face recognition: A convolutional
neural-network approach, Neural Networks, IEEE Transactions on, vol. 8, no. 1, pp.
98113, Jan 1997.
[112] D. Wedge, D. Ingram, D. McLean, and Z. Bandar, On global-local artiﬁcial neu-
ral networks for function approximation, Neural Networks, IEEE Transactions on,
vol. 17, no. 4, pp. 942952, July 2006.
154
[113] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time Signal Processing
(2Nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1999.
[114] J. Ludwig, S. Nawab, and A. Chandrakasan, Low-power digital ﬁltering using approx-
imate processing, Solid-State Circuits, IEEE Journal of, vol. 31, no. 3, pp. 395400,
Mar 1996.
[115] K. Lee, S.-Y. Kung, and N. Verma, Low-energy formulations of support vector ma-
chine kernel functions for biomedical sensor applications, Journal of Signal Processing
Systems, vol. 69, no. 3, pp. 339349, 2012.
[116] M. Abbas, M. Ikeda, and K. Asada, Statistical model for logic errors in CMOS digital
circuits for reliability-driven design ﬂow, in Design and Diagnostics of Electronic
Circuits and Systems, 2006 IEEE, April 2006, pp. 145146.
[117] K. Lingasubramanian and S. Bhanja, An error model to study the behavior of tran-
sient errors in sequential circuits, in VLSI Design, 2009 22nd International Conference
on, Jan 2009, pp. 485490.
[118] Y. Liu, T. Zhang, and K. Parhi, Computation error analysis in digital signal processing
systems with overscaled supply voltage, Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, vol. 18, no. 4, pp. 517526, April 2010.
[119] L. Li and H. Zhou, On error modeling and analysis of approximate adders, in
Computer-Aided Design (ICCAD), 2014 IEEE/ACM International Conference on,
Nov 2014, pp. 511518.
[120] J. Huang and J. Lach, Exploring the ﬁdelity-eﬃciency design space using imprecise
arithmetic, in Design Automation Conference (ASP-DAC), 2011 16th Asia and South
Paciﬁc, Jan 2011, pp. 579584.
[121] W.-T. Chan, A. Kahng, S. Kang, R. Kumar, and J. Sartori, Statistical analysis and
modeling for error composition in approximate computation circuits, in Computer
Design (ICCD), 2013 IEEE 31st International Conference on, Oct 2013, pp. 4753.
[122] K. Iwasaki and F. Arakawa, An analysis of the aliasing probability of multiple-input
signature registers in the case of a 2m-ary symmetric channel, Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on, vol. 9, no. 4, pp. 427438,
Apr 1990.
[123] T. Williams, W. Daehn, M. Gruetzner, and C. Starke, Aliasing errors in signature in
analysis registers, Design Test of Computers, IEEE, vol. 4, no. 2, pp. 3945, April
1987.
[124] D. N. A. Asuncion, UCI machine learning repository, 2007. [Online]. Available:
http://www.ics.uci.edu/∼mlearn/MLRepository.html
[125] J. Lin, Divergence measures based on the Shannon entropy, Information Theory,
IEEE Transactions on, vol. 37, no. 1, pp. 145151, Jan 1991.
155
[126] J. Macke, P. Berens, A. Ecker, A. Tolias, and M. Bethge, Generating spike trains with
speciﬁed correlation coeﬃcients, Neural Computation, vol. 21, no. 2, pp. 397423, Feb
2009.
[127] S. Zhang and N. Shanbhag, Probabilistic error models for machine learning kernels
implemented on stochastic nanoscale fabrics, in Design, Automation Test in Europe
Conference Exhibition, 2016. DATE '16, March 2016.
[128] T. G. Dietterich, Ensemble methods in machine learning, in Proceedings of the First
International Workshop on Multiple Classiﬁer Systems, ser. MCS '00. London, UK,
UK: Springer-Verlag, 2000, pp. 115.
[129] L. Breiman, Bagging Predictors, Mach. Learn., vol. 24, no. 2, pp. 123140, Aug.
1996.
[130] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm,
in Proceedings of the Thirteenth International Conference on Machine Learning
(ICML 1996), L. Saitta, Ed. Morgan Kaufmann, 1996. [Online]. Available:
http://www.biostat.wisc.edu/ kbroman/teaching/statgen/2004/refs/freund.pdf pp.
148156.
[131] L. Breiman, Random Forests, Mach. Learn., vol. 45, no. 1, pp. 532, Oct. 2001.
[132] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 1993.
[133] C. Cortes and V. Vapnik, Support-Vector Networks, Mach. Learn., vol. 20, no. 3, pp.
273297, Sep. 1995. [Online]. Available: http://dx.doi.org/10.1023/A:1022627411411
[134] K. Lee, S. Kung, and N. Verma, Low-energy formulations of support vector machine
kernel functions for biomedical sensor applications, Signal Processing Systems, vol. 69,
no. 3, pp. 339349, 2012. [Online]. Available: http://dx.doi.org/10.1007/s11265-012-
0672-8
[135] P. Domingos, A uniﬁed bias-variance decomposition and its applications, in In Proc.
17th International Conf. on Machine Learning. Morgan Kaufmann, 2000, pp. 231
238.
[136] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.
New York: Wiley, 1999, a Wiley-Interscience publication. [Online]. Available:
http://opac.inria.fr/record=b1097682
[137] P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple
features, in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings
of the 2001 IEEE Computer Society Conference on, vol. 1, 2001, pp. I511I518 vol.1.
156
