A dot product kernel using rapidly switched analog circuit by Nahlus, Ihab
c© 2015 Ihab Nahlus
A DOT PRODUCT KERNEL USING RAPIDLY SWITCHED ANALOG
CIRCUIT
BY
IHAB NAHLUS
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2015
Urbana, Illinois
Adviser:
Professor Naresh R. Shanbhag
ABSTRACT
In a world driven by technology and hand-held devices, there is ubiquitous
demand for high-performance, low-energy processing engines. In this thesis,
we present rapidly switched analog circuit (RSAC), a new circuit architec-
ture, to implement an energy-efficient mixed-signal dot product (DP) kernel
for machine learning and signal processing applications. RSAC operates by
fast switching the analog inputs to the output via variable width digital
pulses. A description of the different components of RSAC, along with a de-
tailed accuracy and energy consumption analysis is presented. We show two
RSAC designs that span the different design options and technology nodes.
Simulations for the first design in a 130 nm process show energy savings of
19× to 32× compared to a digital implementation for signal-to-quantization-
noise ratios (SQNRs) of 30 dB to 24 dB, respectively. Simulations for the
second design in a 28 nm FDSOI process show energy savings of 15.7×, 4×,
2.1× compared to a digital implementation running at the same sampling fre-
quency for SQNRs of 8 dB, 14 dB and 20 dB, respectively. Finally, we present
the design of an emotion recognition system composed solely of SAC-based
dot-products. Based on the behavioral and energy models developed in this
thesis, we obtain energy savings of 45% and 49% compared to a digital im-
plementation for average probabilities of error of 0.23 and 0.07, running at
frequencies of 1.87 MHz and 1.7 MHz, respectively.
ii
To my family, for their love and support
iii
ACKNOWLEDGMENTS
First and foremost, I would like to express my sincere gratitude to my advisor
Prof. Naresh R. Shanbhag for his continuous guidance, patience, enthusiasm
and invaluable comments throughout this work. I also thank our collabora-
tors Dr. Eric Kim and Prof. David Blaauw for their comments and feedback
for this work. I want to also thank my labmates: Pourya, Tianqi, Yingyan,
Sai, Mingu, Sujan and Ameya. They eased me into the research group when
I first arrived and made my research environment very friendly.
I want to thank my parents: Nazih and Ghada, my brothers: Saad and
Ramzi, and sisters: Farah, Juliana and Dunia, for their love and support
during my studies. They are my backbone and I feel blessed for having them
by my side. Life would not be complete without having friends to share the
journey with. I want to thank Mohammad, Peter, Marc, Laurence and Farah
for the wonderful company during my stay in Illinois. It hasn’t been an easy
ride but they definitely made it enjoyable.
Finally, I want to thank Brooke Newell and my friend Laurence for helping
me in making this thesis easy to read, grammatically correct, and typo-free.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Digital approach . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Analog approach . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Stochastic computing approach . . . . . . . . . . . . . 6
CHAPTER 2 RAPIDLY SWITCHED ANALOG CIRCUIT
(RSAC) PRINCIPLES . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 RSAC architecture . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Principle of operation . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 RSAC circuit implementations . . . . . . . . . . . . . . . . . . 16
2.4 Select generation . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Constant coefficients . . . . . . . . . . . . . . . . . . . 21
2.4.2 Programmable coefficients . . . . . . . . . . . . . . . . 21
2.5 Behavioral model . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Energy model . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CHAPTER 3 RSAC DESIGN METHODOLOGY AND
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Design I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Choice of R and C . . . . . . . . . . . . . . . . . . . . 29
3.1.3 Validation of MSE model . . . . . . . . . . . . . . . . . 29
3.1.4 Validation of energy model . . . . . . . . . . . . . . . . 30
3.1.5 Design optimization . . . . . . . . . . . . . . . . . . . 33
3.1.6 Comparison to digital implementation . . . . . . . . . 33
3.2 Design II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Choice of R and C . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Validation of behavioral model . . . . . . . . . . . . . . 39
3.2.4 Energy model . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.5 Design optimization . . . . . . . . . . . . . . . . . . . 43
3.2.6 Comparison to digital implementation . . . . . . . . . 43
v
CHAPTER 4 RSAC-BASED EMOTION RECOGNITION
SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Feature extraction using Gabor filtering . . . . . . . . . . . . . 53
4.2 Feature selection using Adaboost . . . . . . . . . . . . . . . . 54
4.3 Support vector machine (SVM) . . . . . . . . . . . . . . . . . 55
4.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . 57
CHAPTER 5 CONCLUSION AND FUTURE DIRECTIONS . . . . 60
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
vi
CHAPTER 1
INTRODUCTION
1.1 Motivation
In a world driven by technology and hand-held devices, the demand for ubiq-
uitous computing with learning and decision making capabilities has grown
tremendously in the past several years. These applications need to process
large amounts of data acquired by sensing the surrounding environment and
are subject to strict energy demands. In fact, smartphone users have reported
that the single most important feature when choosing a phone is battery life
[1]. Additionally, these applications have strict speed demands and require
sufficient accuracy at the output. Current devices in the market lack intelli-
gence, for they rely heavily on the cloud for processing. In order to embed
intelligence into hand-held devices, we should ensure that the cost of com-
puting in the learning algorithms is less than the cost of communicating with
a server. Our work targets the computing front in order to achieve that goal.
A typical machine learning algorithm consists of: (i) a feature extraction
engine to extract the relevant information from the input data and (ii) a
classifier to make a decision (Fig. 1.1). The dot product (DP) kernel is a key
kernel in both the feature extraction engine and the classifier. This kernel
implements a variety of functions, including but not limited to, vector inner
products, correlators, filters, convolutions, multiply-accumulate, `1 and `2
norms.
Conventionally, there are many ways to design the DP kernel: digital ap-
proach, analog approach and stochastic computing approach. In the remain-
ing sections in this chapter, we will discuss these approaches in more detail.
1
Sensors
analog
Actuator
analog
Feature Extractor Classifier
(.)2 e(.)



N
i
iiBABA
1
,

Building
blocks
SNR detP
Requirements User
.,.
Figure 1.1: A machine learning processing chain consists of a feature
extractor and a classifier. The dot product is a key kernel.
1.2 Background
1.2.1 Digital approach
Sensory input x(t) has to pass through an analog-to-digital converter (ADC),
which consists of a sampler and a quantizer, before being processed digitally.
The output is first sampled with a period Ts, and then quantized into a
discrete value,
xq[n] , x(nTs) + q[n],
where xq[n] is the n’th discrete sample and q[n] is the n’th quantization error.
Typically, for a uniform quantizer, the quantization step, ∆, is constant:
∆ , xmax − xmin
2Bx
, (1.1)
where xmax is the maximum input value, xmin is the minimum input value,
and Bx is the number of bits used to represent the analog values. Assuming
xmax = xm and xmin = −xm, Equation (1.1) reduces to:
∆ =
xm
2Bx−1
. (1.2)
It is typically assumed that the quantization error, q, is uniformly dis-
tributed between any two quantization levels, as shown in Fig. 1.2(b) [2],
2
2
2
3

2
x
qx
minx
maxx
(a)
2


1
q
)(qfQ
2

(b)
Figure 1.2: (a) Input-output relationship for a uniform quantizer, and (b)
the probability density function of the quantization noise q.
Multiplier
Multiplier
Adder
xB
xB
xB
xB
xB2
xB2
12 xB
2211 AA  
1
2
1A
2A
Figure 1.3: A two-input digital DP kernel.
3
and independent of the input. Under these assumptions, we can compute
the mean and variance of q as follows:
E[q] =
1
∆
∫ ∆
2
−∆
2
q dq = 0.
E[q2] =
1
∆
∫ ∆
2
−∆
2
q2 dq =
∆2
12
.
This will determine the precision at the output, which can also be cast
in the form of a signal-to-quantization noise ratio (SQNR). The SQNR is
defined as the ratio between the energy of the input and the quantization
error:
SQNR , E[x
2]
E[q2]
=
12(2Bx−2)
PAR
, (1.3)
where PAR is the peak-to-average ratio defined as:
PAR =
E[x2]
x2m
.
In dB, Equation (1.3) reduces to:
SQNRdB = 6Bx + 4.8− PARdB,
where PARdB = 20 log10 PAR. Essentially, for a 6 dB improvement in
SQNR, Bx would have to increase by only 1. To compute the DP, we first
multiply each two numbers together using a multiplier, a Baugh-Wooley mul-
tiplier (BWM) for example, and then the numbers are added in a tree fashion
using adders, ripple-carry adders (RCAs) for example. This is shown in Fig-
ure 1.3 for a length 2 DP. While there are many ways to model the energy
consumption of digital circuits, a general one would be:
Etot = g(Bx)
(
αCV 2dd + Elkg
)
,
where α is the average transition activity, C is the average output capaci-
tance, Vdd is the supply voltage, Elkg is the average leakage energy, and g(.)
4
1vVbias
VDD VDD
outV
DR DR
M1 M2 M3 M4
1 1( )I f  2 2( )I f 1 2
2vVbias
Figure 1.4: A two-input analog DP.
is an increasing function that depends on the algorithm and architecture.
For example, g(.) is linear if we are performing addition using a RCA and
quadratic if performing multiplication using a BWM.
We will further discuss this in Chapter 3 when comparing the techniques
developed in this thesis versus the digital approach.
1.2.2 Analog approach
An alternative but less common approach to digital processing is analog pro-
cessing which is typically used when accuracy requirements are low and speed
requirements are high. In the past, it was reported that analog processing
was more energy efficient than digital processing for precision less than 8
bits [3]. However, currently this point of operation has shifted down to 5-6
bits [4]. While quantization is the main source of error in a digital system,
sources of error in an analog system are due to various physical phenomena,
such as thermal noise, flicker noise, shot noise which are even more prominent
in advanced processes [5].
One possible design for a DP circuit is described in Fig. 1.4. The scheme
5
shows two differential amplifiers where the multiplication is done through the
gain of each pair and the summing is done through the connection between
the two amplifiers (i.e. Kirchhoff’s current law) [6]. For a DP of length N , we
need N drain-coupled differential pairs as shown in Fig. 1.4. The operational
transconductance of the transistors can be written as:
gm1,2 =
2
(
I1
2
)
Vov1,2
=
f(α1)
Vov1,2
gm3,4 =
2
(
I2
2
)
Vov3,4
=
f(α2)
Vov3,4
,
where I1 and I2 are the biasing currents in the first and second branch,
respectively, Vovi is the overdrive voltage of the i
th transistor, and αi is the
control knob to the biasing currents through a function f(.). The output
voltage is then obtained by the gain equation for differential amplifiers [6] as:
Vout = RDgm1,2v1 +RDgm3,4v2. (1.4)
Assuming Vov1,2 = Vov3,4 = Vov and f(x) = kx for some k, Equation (1.4)
reduces to:
Vout =
kRD
Vov
(α1v1 + α2v2).
There are other ways to compute DPs using analog circuits. The most
common one is the Gilbert cell, which is an extension to the circuit de-
scribed above so that it computes a four-quadrant product. The Gilbert
cell is widely used in modern receivers as a high-speed mixer [7]. Another
approach that has reported large energy savings is through a mixed-signal
circuit that utilizes switched capacitors [8].
1.2.3 Stochastic computing approach
In 1967, B. R. Gaines [9] proposed an alternative system, where a number is
represented by a binary stream with the information stored in the 1s density.
For example, to store the number 0.5, half the binary stream should consist
of 1s. The stream length M is a design parameter. The longer the stream,
the more accurate the output representation, as we will show later in this
6
clk
I
]1,0[U
O
I
Clk
O
Clk
(a)
Counter
clk
Input Output
(b)
Figure 1.5: (a) Analog-to-Stochastic Converter (ASC), and (b)
Stochastic-to-Digital Converter (DSC).
0,1,1,0,0,1,0,0
)8/7(
)8/3(
1,1,0,0,1,1,0,1 )8/5(
1
0
1,0,1,0,1,0,0,1 )8/4(
1,1,0,1,1,1,1,1
3I
1I
2I
O
Figure 1.6: Example of a DP computed in stochastic computing using a
multiplexer.
7
section. A typical way to obtain the stream from an analog voltage is through
an analog-to-stochastic converter (ASC) [10]. This ASC compares the input
to a random signal U , which is uniformly distributed between 0 and 1. An
example is shown in Fig. 1.5(a). The probability of obtaining a 1 at the
output is:
pO , P (O = 1)
= P (U < I)
= I.
We use a stochastic-to-digital converter (SDC) [10] to go back to the dig-
ital domain. The SDC consists of a counter that counts the number of
1s, as shown in Fig. 1.5(b). The computing elements are simple such as an
AND gate used for multiplication and multiplexer used for weighted-addition.
Next, we show the conditions under which those computing elements work.
This work focuses on the DP kernel and thus, we will focus on the multi-
plexer. For a multiplexer with inputs shown in Fig. 1.6, the output Boolean
equation can be written as:
O = I1I3 + I2I¯3,
where I1 and I2 are the two inputs to the multiplexer, and I3 is the select
input. Assuming that I1, I2, and I3 are independent Bernoulli streams with
parameters p1, p2, and p3 respectively, we obtain that O is also a Bernoulli
stream with parameter:
pO , P (O = 1)
= P (I1I3 + I2I¯3 = 1)
= P (I1 = 1)P (I3 = 1) + P (I2 = 1)P (I3 = 0)
= p1p3 + p2(1− p3).
The counter at the backend counts all the 1s in the stream and thus,
it is essentially summing M independent Bernoulli random variables with
parameter pO and dividing by the length M . Hence, the mean and variance
8
of the output can be written as:
µ =
1
M
(MpO) = pO
σ2 =
1
M2
(MpO (1− pO)) = pO(1− pO)
M
.
The energy consumed by a multiplexer used in this configuration can be
written as:
Etot = M
(
αCV 2dd + Elkg
)
,
where M is the stream length, α is the average transition activity, C is
the output capacitance, Vdd is the supply voltage, and Elkg is the leakage
energy consumed by the multiplexer. Most recently, Blaauw showed how to
use analog inputs instead of digital Bernoulli bit streams, which reduces the
swing in the circuit and hence decreases energy consumption [11]. However,
the stream length M is proportional to the energy consumption and thus,
this technique becomes energy inefficient for large stream length.
Among the three approaches discussed in this chapter, the digital imple-
mentation remains the mainstream approach when designing systems today.
Digital designers have the ability to use extensive tools that take the sys-
tem from a behavioral description to a circuit layout. For low precision, the
energy savings due to using the analog approach are typically barely signifi-
cant (< 2×)[4] and are hardly worth the extra complexity and time spent in
designing such a system. For high precision, stochastic computing does not
provide competition due to its energy inefficiency. In Chapter 2, we will intro-
duce a new circuit, rapidly switched analog circuit (RSAC), that implements
the DP kernel [12]. RSAC is based on fast switching analog inputs to the
output via variable width digital pulses. Unlike the analog approach, RSAC
is oblivious of the transistor models, which greatly reduces the complexity
and time needed to design a system. Like stochastic computing, RSAC uses
a multiplexer to compute the DP. However, the operation mechanism is very
different and leads to an output convergence that is exponential and thus,
the energy consumption and delay are greatly reduced.
9
CHAPTER 2
RAPIDLY SWITCHED ANALOG CIRCUIT
(RSAC) PRINCIPLES
In this chapter, we introduce rapidly switched analog circuit (RSAC), a new
energy-efficient mixed-signal circuit. RSAC implements the dot product
(DP) kernel by fast switching the analog input to the output via variable
width digital pulses. The input analog voltages constitute the first vector in
the DP and are passed through an N input multiplexer with N select signals
(Fig. 2.1(a)). By having only one select signal active at a time, and switch-
ing among the inputs at a high frequency with certain duty cycles, which
constitute the second vector in the DP, the output voltage is obtained as the
weighted sum of the input voltages. An example operation with N = 3 is
depicted in Fig. 2.1(b). The output is sampled at TS, which is a multiple of
the iteration period T (i.e. TS = KT for some K). The iteration period T
is the average on-time for each path. Section 2.1 presents the architecture in
which a RSAC-based DP kernel can be implemented. A detailed accuracy
analysis is presented in Section 2.2, followed by a description of circuit imple-
mentations in Section 2.3. Section 2.4 details two different implementations
for the select generator. Finally, a behavioral model and an energy model
are presented in Sections 2.5 and 2.6, respectively.
2.1 RSAC architecture
A RSAC-based processor architecture consists of J stages of DP kernels as
shown in Fig. 2.2. SGjh represents the h
th select generator in stage j, with a
total of Hj such select generators. Here, Kjhi represents the i
th computation
kernel sharing SGjh, with a total of Ijh such kernels. The number of inputs
going into kernel Kjhi is Njh. Accordingly, the total number of inputs going
into stage j is
∑Hj
h=1 IjhNjh and the total number of outputs leaving stage j
is
∑Hj
h=1 Ijh. A system-wide multi-phase clock generator (MPCG) provides
10
0V
1V
1NV 
C
N
( 0,..., N 1)i  
1
0
N
z i i
i
V V p



i
(a)
0 10N 20N 30N 40N
0
0.2
0.4
0.6
0.8
1
Number of iterations
O
u
tp
u
t 
V
o
lt
a
g
e
 (
V
)
 
 
V
1
V
2
V
3
Target
Voltage
V
2
V
1
V
3
(b)
Figure 2.1: Rapidly switched analog circuit (RSAC)-based DP kernel: (a)
conceptual operation, and (b) output waveform for N = 3, V1 = 0.4,
V2 = 0.9, V3 = 0.1, and p1 = p2 = p3 = 1/3.
Multi-phase Clock Generator
Post 
Processing
Memory
Stage 1
1jSG 2jSG jjHSG
1j
11jK
12jK
11 jj I
K
2j
21jK
22jK
22 jj I
K
jjH

1jjH
K
2jjH
K
j jH j
jH IK
jjH
N
1jN
1
1
1
1
1
1
1
1
1
Stage j Stage J
Sensors
2jN
jjH
N
1jN
2jN
jjH
N
1jN
2jN
1jN 2jN jjHN
Figure 2.2: Architecture of a rapidly switched analog circuit (RSAC)
processor.
11
clock inputs to the SGjh select generators that generate the vector Φjh, the
select signals of periodNjhT required for switching. NjhT will also be referred
to as the round period, which is the time required to go through one round
over all inputs.
The computation kernel consists of a RSAC-based DP kernel, as shown
in Fig. 2.1(a). For a given input voltage vector V = (V0, V1, . . . , VN−1) and
weight vector p = (p0, p1, . . . , pN−1), with pi ≥ 0 and
∑N−1
i=0 pi = 1, the
circuit computes the output voltage:
VZ = V · p =
N−1∑
i=0
Vipi.
In general, for a positive coefficient vector c = (c0, c1, . . . , cN−1), the circuit
computes the output voltage VZ =
1∑N−1
i=0 ci
∑N−1
i=0 Vici. This can be seen by
identifying the relationship between the weight and the coefficient:
pi =
ci∑N−1
i=0 ci
.
The weights pi are implemented by a set of non-overlapping pulses φi
(i = 0, .., N − 1) of period NT , which is also the round period, and duty
cycle pi. By designing T to be significantly smaller than the time constant of
the RC network, Vz will converge to
∑N−1
i=0 Vipi, with accuracy increasing at
an exponential rate with the number of cycles until it settles. This circuit,
being analog in nature, is also affected by various physical and mechanical
phenomena that will affect its accuracy. Those will be discussed in detail in
Section 2.2.
2.2 Principle of operation
The basic implementation for RSAC is shown in Fig. 2.3. Transmission gates
are used to implement the switches to allow full swing at the output. Cd is
the total drain capacitance at node I and Ri is the equivalent resistance of
the ith path. A series resistance R and a capacitor C are added for proper
operation, as will be shown shortly. By imposing condition:
C1) : (R +Ri)C  max
i=0..(N−1)
RiCd,
12
1V
R
dC C
ZV
0R
=
1R
=
=
IV
0V
1NV 
0
0
1
1
1N 
1N 
1NR 
Figure 2.3: Basic implementation of RSAC.
13
we can treat the circuit as a first-order RC circuit with step inputs. We
define Vz,i as the output sequence of iterating over all paths but ending in
the ith path. From basic principles, we obtain the following relations:
Vz,i[n+ 1] =
Vi + (Vz,(i−1)N [n]− Vi)Xi, i = 0Vi + (Vz,(i−1)N [n+ 1]− Vi)Xi, i = 1, . . . , N − 1, (2.1)
where Xi = exp(− piNT(R+Ri)C ) and (x)N , x mod N . The relationship between
the total number of iterationsK and the total number of rounds n isK = nN .
Using Equation (2.1), we obtain the recurrence relations:
Vz,i[n+ 1] =
N−1∑
k=0
[
V(i−k)N (1−X(i−k)N )
k−1∏
j=0
X(i−j)N
]
+XVz,i[n], (2.2)
where X ,
∏N−1
k=0 Xk. The recurrence relations in Equation (2.2) are first-
order linear difference equations with constant coefficients. Their conver-
gence is guaranteed by the fact that X < 1 and they converge to:
V˜z,i , lim
n→∞
Vz,i[n] =
∑N−1
k=0
[
V(i−k)N (1−X(i−k)N )
∏k−1
j=0 X(i−j)N
]
1−X . (2.3)
We note that exp(−x) = 1−x+O(x2) and for small x, we obtain exp(−x) ≈
1 − x. By making T  RC, we can approximate Xi by 1 − piNT(R+Ri)C . By
imposing condition:
C2) : R Ri,
we make R large enough to dominate the equivalent resistances of the paths.
Thus, we can approximate Xi by 1 − piNTRC . This allows us to write Equa-
tion (2.3) as:
V˜z,i =
∑N−1
k=0
[
V(i−k)N
(
p(i−k)NNT
RC
)∏k−1
j=0
(
1− p(i−j)N T
RC
)]
NT
RC
.
Since T  RC, we will approximate ( T
RC
)2
by 0 and obtain:
14
V˜z,i =
∑N−1
k=0
[
V(i−k)N
(
p(i−k)NNT
RC
)]
NT
RC
=
N−1∑
k=0
Vkpk,
where
∑N−1
k=0 Vkpk is the ideal output. We note that any of those N outputs
can serve as an approximation to the ideal output and hence sampling the
output at any instance of time (beyond convergence) can serve as the final
output.
Another important aspect of this analysis is the convergence speed, since
the number of iterations, K = nN , is proportional to the overall energy
consumption (as will be seen in Section 2.6). Vz,i[n] can be unrolled and
written as:
Vz,i[n] = V˜z,i(1−Xn). (2.4)
Next, we define the error at the output to be ei[n] , Videal− Vz,i[n], where
Videal =
∑N−1
k=0 Vkpk is the ideal output. As such, the mean-square error
(MSE) J [n,X] is:
J [n,X] , E
{
ei[n]
2
}
= Jmin +X
nαA +X
2nβA, (2.5)
where Jmin = E
{(
Videal − V˜z,i
)2}
, αA = 2E
{
(Videal − V˜z,i)Vz,i
}
, and βA =
E
{
V˜ 2z,i
}
. Note that as n increases, Xn will go to zero and therefore:
lim
n→∞
J [n,X] = Jmin.
The speed of convergence is exponential and similar to the one in a simple
RC circuit. There is a trade-off between K and TTR , τ
T
, the tau-to-T ratio
where τ , RC, which will be studied in the next Chapter where we optimize
the choice of K and TTR in the context of minimizing energy consumption
for a fixed SQNR at the output.
Finally, the accuracy of the circuit will be affected by various physical
phenomena as well as some issues related to the operation mechanics. These
issues include: rise and fall times in select signals, charge sharing between Cd
and C (Section 2.4.2), finite precision in select generation, clock feed-through,
15
PVT variations, and thermal noise.
We discuss here the issue associated with finite rise and fall times in select
signals. The time when the switch turns on or off depends on the threshold
voltage of the switch, the source voltage, and, most importantly, the rise and
fall times. The latter are finite and different which will cause the effective
duty cycle pˆi to be different than pi. Another problem with finite rise and fall
times is that two select signals might overlap in that period. This overlap will
activate two paths at the same time which is problematic. To see that, assume
path i and j are active at the same time. This will make the equivalent
voltage at node I (see Fig. 2.3), when capacitors are 0:
VI =
RRi
R(Ri +Rj) +RiRj
Vj+
RRj
R(Ri +Rj) +RiRj
Vi+
RiRj
R(Ri +Rj) +RiRj
VZ ,
which is approximately Ri
Ri+Rj
Vj +
Rj
Ri+Rj
Vi when condition C2 is imposed
(R  Ri, R  Rj). This voltage will be driving the input and will cause
interference at the output.
The circuit described in this section cannot be cascaded since it has a finite
input impedance and non-zero output impedance. To solve that, we propose
two circuit implementations in the next section and we show how the analysis
in this section can be extended to explain their functionality.
2.3 RSAC circuit implementations
We have developed two circuit implementations for RSAC-based DP kernels,
shown in Figure 2.4, that allow cascading without the need of a voltage
buffer. This comes at the expense of a bias term at the output. These
circuits compute:
VZ =
G∑N−1
i=0 ci
N−1∑
i=0
Vici − Vt, (2.6)
where G is the attenuation factor of the source follower (0 < G ≤ 1) and Vt
is the threshold voltage. Another limitation for these implementations is the
constrained range of allowable inputs. N-RSAC (shown in Fig. 2.4(a)) uses
NMOS transistors, where the inputs should be in the range [Vtn , Vdd] and will
shift the output down (Vtn > 0). P-RSAC (shown in Fig. 2.4(b)) uses PMOS
16
1V
VDD VDD
C
ZV
R
VDD
IVdC
BIASV
0V
0
0 1
1
1N 
1N 
1NV 
(a)
1V
VDD
C
ZV
R
IVdC
0
0
1
1
1N 
1N 
0V 1NV 
BIASV
(b)
Figure 2.4: Circuit implementations of a RSAC-based DP kernel: (a)
N-RSAC, and (b) P-RSAC.
.
iV
VDD
VDD
VDD
C
ZV
R
IV
BIASV
Figure 2.5: The equivalent circuit of Fig. 2.4(b) when path i is on and all
other paths are off.
17
ip NT
(0)iV

Figure 2.6: The input to the second stage is a ramp input.
transistors, where the inputs should be in the range [0, Vdd − |Vtp |] and will
shift the output up (Vtp < 0).
The voltage at node I at DC voltage (i.e. capacitors are 0) will help us
relate the analysis of the implementation in Fig. 2.3 (shown in Section 2.2)
and the analysis of the implementations in Fig. 2.4. When path i is on, in
Fig. 2.3, we use KVL to obtain:
VI =
R
R +Ri
Vi +
Ri
R +Ri
VZ ,
which is approximately Vi when condition C2 is imposed (R  Ri). Hence,
the RC circuit is being driven by Vi, which makes the analysis in Section 2.2
valid. When path i is on in Fig. 2.4(a), the circuit reduces to the one shown
in Fig. 2.5. This circuit is a variation to the source follower configuration
and thus, the voltage at node I can be written as:
VI = GVi − Vt, (2.7)
where G is the attenuation factor of the source follower and Vt is the threshold
voltage. The exact value of G depends on VBIAS and will be investigated in
the Chapter 3. By substituting the value of VI in Equation (2.7) for Vi in the
analysis above, we obtain the relation in Equation (2.6). The same analysis
holds for the implementation in Fig. 2.4(b).
When cascading RSAC-based DP circuits, the inputs to the second stage
consist of the outputs of the first stage, which are not constant values but
18
rather exponentials. As discussed in Section 2.2, the exponentials can be
substituted by a linear approximation. With this approximation, we will
describe the behavior of the latter stages of a RSAC-based system with ramp
inputs. Figure 2.6 shows the input which is described as:
Vi(t) =
Vi(0) + αt, 0 ≤ t ≤ piNTβ, piNT < t,
where α , β−Vi(0)
piNT
is the slope. The one-sided Laplace transform of Vi(t) can
be written as:
Vi(s) =
Vi(0)
s
+ α
1− e−spiNT
s2
.
For an RC circuit, the one-sided Laplace transform of the output is:
Vo(s) =
Vi(s)
1 +RCs
+
Vo(0)
1 +RCs
.
We then obtain the output by replacing Vi(s) by its value and taking the
inverse Laplace transform:
Vo(t) = Vi(0) + (Vo(0)− Vi(0)) exp(− t
RC
)u(t)
+α
(
t−RC +RC exp(− t
RC
)
)
u(t)
+α
(
(t− piNT )−RC +RC exp(−t− piNT
RC
)
)
u(t− piNT ),
(2.8)
where u(t) is the unit step function. We make here the approximation that
exp(−x) ≈ 1 − x and thus, we can approximate the second term in Equa-
tion (2.8) as:
t−RC +RC exp(− t
RC
) ≈ t−RC +RC(1− t
RC
)
= 0.
Similarly, the third term can also be approximated by 0. This reduces
Equation (2.8) to:
19
0O
Clk Clk Clk Clk
D D D DQ Q Q Q
ClkT
1O 1MO2MO 
Figure 2.7: Length M ring oscillator.
Vo(t) = Vi(0) + (Vo(0)− Vi(0)) exp(− t
RC
)u(t),
which is the same response as an RC circuit excited by a step input of Vi(0).
Thus, the analysis provided in Sections 2.2 and 2.3 applies for the latter
stages of a RSAC-based system. We will verify this in Chapter 3.
In Section 2.4, we discuss the different design alternatives for the select
generator.
2.4 Select generation
Generation of the select signals is needed for proper RSAC operation. The
coefficients pi could be constant (e.g. filtering) or programmable (e.g. read
from memory), each resulting in a very different design for the select gener-
ator. However, in both cases, the select generator is fed by a MPCG that
provides the clock inputs at different phases. The MPCG is a ring counter
of length M , running at a frequency of fClk , 1TClk (Fig. 2.7). The choice of
M depends largely on the precision required in the select generation and the
topology used for the select generator, as will be shown shortly.
Alternatives such as counter-based oscillators or inverter-based ring oscil-
lators provide similar accuracy but the ring counter in Figure 2.7 makes the
design of the following select generator simpler and more energy efficient.
In both designs, the coefficients are implemented using a finite precision of
B bits. For a coefficient vector c = (c0, c1, . . . , cN−1), we define the quantized
version cˆ = (cˆ0, cˆ1, . . . , cˆN−1), where:
20
c0=1 c1=2 c2=1
c3=2 c4=4 c5=2
c6=1 c7=2 c8=1
(a)
0O
1 2 3 4 5 6 7 8
1O 2O 3O 4O 5O 6O 7O 8O 9O 10O 11O 12O 13O 14O 15O
0
(b)
Figure 2.8: (a) Coefficients of a Gaussian blur filter with σ2 = 0.85, and (b)
select generator for the according coefficients.
cˆi = round
(
ci
max ci
× 2B
)
.
2.4.1 Constant coefficients
When the coefficients are constant, such as in the case of filtering, we can
obtain the select signals by ORing the according outputs of the MPCG. For
example, consider the case of applying a 3 × 3 Gaussian blur filter with
σ2 = 0.85 with coefficients shown in Fig. 2.8(a). In this case, c = cˆ and
thus the coefficients are unaltered in the implementation. The length of
the MPCG is obtained by computing M =
∑N−1
i=0 cˆi, which is 16 in this
case. The implementation of the select generator is shown in Fig. 2.8(b).
The relationship between the iteration period T and the clock period TClk is
T = M
N
TClk in this design, which is the average on-time of any path.
2.4.2 Programmable coefficients
When the coefficients are programmable, such as when they are read from
memory, we can formulate the problem as a finite state machine for which
the state transition diagram is shown in Fig. 2.9. The cnt signal is the output
of a B-bit counter and is an input to this state machine, in addition to the
coefficients cˆi. The output of this state diagram is Φ, the N -bit signal that
21
0S
1S
2S
0
 c
cnt
0

 ccnt
1

 ccnt
B
cnt
2
Bcnt 2
12 NS
B
cn
t
2

12 NS
1

Nc
cn
t
1

 Nccnt
Bcnt 2
)0
...
00(
)
10
..0
(
)0
...
00(
)1
...
00(
Figure 2.9: State transition diagram for the input-dependent select
generator.
22
Circular 
Shifter

2 2B
O
1
O 2O
S
R
N
B
N
0O 12 BO
b
B
0c
1c
1Nc 
^
^
^
Figure 2.10: Optimized logic design for the programmable select generator.
An SR latch is used to distinguish the trigger point between the white and
gray states in Fig. 2.9.
represents a one-hot encoding of N states (in particular, the ones in white).
The extra N states (the ones shown in gray) represent idle states, which were
added to keep the control of the counter unaltered since it is shared by all
state machines in the processor. The implementation of this select generator
is optimized, in Fig. 2.10:
• Each successive white and gray state is merged into one state. An extra
bit, b in Fig. 2.10, serves to distinguish between the original states at
the final output.
• The states are one-hot encoded in order to be re-used as the outputs,
thus avoiding extra circuitry.
• A circular shifter is used to represent the states since the transitions in
the output are one-directional.
• A ring-counter is used instead of a counter to facilitate the comparison
circuitry. Although this counter is more expensive, it is shared by all
select generators running in parallel. This extra cost will be attenuated
by a factor of (
∑J
j=1
∑Hj
h=1 Ijh), which is typically large.
23
dC C
R ( )ZV t( )IV t
(0)IV (0)ZV
Figure 2.11: The equivalent circuit of the RSAC-based DP kernel when all
paths are off.
We note that the relationship between iteration period T and the clock
period TClk is T = 2
B−1TClk in this design, which is the average on-time of
any path.
This design gives rise to charge sharing between Cd and C when the signal
path is all idle. To see that, assume path i has been on for cˆiTClk and now
turns off for the remaining period, Toff = (2
B − cˆi)TClk. Figure 2.11 shows
the equivalent circuit in that period. Solving for the output voltage after
Toff , we obtain:
Vz(Toff ) = Vf +K exp(−Toff (Cd + C)
RCdC
)
Vf =
Cd
Cd + C
Vi(0) +
C
Cd + C
Vz(0),
where Vi(0) is Vi, Vz(0) is the expected output, and K is some positive
factor. For C  Cd and Toff  RCd (which is typically the case), we obtain
Vz(Toff ) ≈ Vf . This will cause interference at the output.
In Sections 2.5 and 2.6, we introduce a behavioral model that predicts the
output of RSAC and an energy model that predicts the energy consumed by
RSAC.
2.5 Behavioral model
Given the issues described in Sections 2.2 and 2.3, we can establish a behav-
ioral model for the RSAC circuit in Fig. 2.3:
24
Vout[K + 1] = V(K)N + (1− )
(
Vout[K]− V(K)N
)
exp
(
−(pˆ(K)N + ∆)N
TTR
)
+ η,
where  , Cd
Cd+C
is the factor due to charge sharing, pˆi is the quantized
version of the weight, ∆ is the adjustment to the weight due to rise and
fall times and PVT variations in the threshold voltages, and η is the term
due to thermal noise and the various other phenomena not captured in the
analysis. Accordingly, ∆ and η will be obtained using simulations before
using this model for system design. This same behavioral model can be used
for the circuit implementations in Fig. 2.4(a) and (b) by adjusting V(K)N to
GV(K)N − Vt to obtain:
Vout[K + 1] = GV(K)N − Vt
+(1− )(Vout[K]−GV(K)N ) exp
(
−(pˆ(K)N + ∆)N
TTR
)
+ η.
2.6 Energy model
A RSAC implementation of
∑J
j=1
∑Hj
h=1 Ihj DP kernels requires one MPCG.
Hence, the total energy consumption for a kernel can be written as:
Etot[n,X] , ERSAC [n,X] +
EMPCG∑J
j=1
∑Hj
h=1 Ijh
, (2.9)
where ERSAC,jhi[n,X] is the energy consumption of the RSAC signal path
and select generation over n rounds, and EMPCG is the energy consumption
of the MPCG. EMPCG depends largely on the topology used and thus will
be estimated through simulations, while ERSAC [n,X] can be written as:
ERSAC [n,X] , Esig[n,X] +
Esel[n]
Ijh
,
where Esig[n,X] is the energy consumed in the signal path over n rounds and
Esel[n] is the energy dissipated by the select generator over n rounds. These
quantities have different expressions, depending on the implementation used.
When using the basic circuit implementation described in Section 2.2, the
25
energy consumed in the signal path is denoted by Esig,bas[n,X] and can be
written as:
Esig,bas[n,X] = n
Cd
2
N−1∑
k=0
(
V(k+1)N − Vk
)2
+ C
n−1∑
l=0
V0 (Vz,0[l + 1]− Vz,N−1[l])
+C
n−1∑
l=0
N−1∑
k=1
Vk (Vz,k[l + 1]− Vz,k−1[l + 1])
= nCdcE + C
(
V˜ 2z (1−Xn) + ndE
)
,
with cE ,
∑N−1
k=0
(V(k+1)%N−Vk)
2
2
and dE ,
(∑N−1
k=0 V
2
k (1−Xk)− V˜ 2z (1−X)
)
.
In the case of using the implementations in Figure 2.4, the energy consumed
in the signal path is denoted by Esig,casc[n,X] and can be written as:
Esig,casc[n,X] = nIBiasVddNT.
The energy dissipated by the select generator depends on the topology
used. When the constant coefficient select generator topology in Section 2.4.1
is used, we obtain:
Esel,cst[n] = nCselV
2
dd,
where Csel includes the total gate capacitance of the access transistors in
the DP kernel and total drain capacitance of the constant coefficient select
generator. When the programmable select generator topology in Section 2.4.2
is used, we can write a more explicit expression:
Esel,prog[n] = n
(
(N + 2)ENOR,2 + 2
BEMUX,2 +BEMUX,N
)
, (2.10)
where ENOR,2 is the energy dissipated by a two-input NOR gate, EMUX,2
and EMUX,N are the energies dissipated by a two-input multiplexer and N -
input multiplexer, respectively. Equation (2.10) shows the dependency of the
energy consumed by the select generator and the length N and the precision
B.
In Chapter 3, we will introduce a design methodology for RSAC followed
by two design examples.
26
CHAPTER 3
RSAC DESIGN METHODOLOGY AND
EXAMPLES
RSAC has a rich design space. Namely, R, C, and T are left as parameters to
be controlled by the designer. Once these parameters are chosen, TTR = RC
T
and K = TS
T
are automatically decided and will affect the accuracy and
the energy consumption of the circuit. Our objective is to minimize energy
consumption for a given SQNR constraint. Hence, the design methodology
is as follows:
Step 1: ObtainRmin and Cmin that satisfy conditions C1 and C2 (Section 2.2).
Step 2: Obtain SQNR vs. energy plots for different TTR while varying K.
Step 3: Obtain the minimizing pair TTRmin and Kmin that have the mini-
mum energy consumption for the given SQNR constraint.
Step 4: Choose R and C. Initially, R = Rmin and C = Cmin.
Step 5: Choose T = RC
TTRmin
and K = TS
T
. If TClk (∝ T ) is unfeasible, go to
Step 4 and increase R and/or C. If K < Kmin, go to Step 3 while
ignoring the current TTRmin and Kmin.
We will assume that TClk is always feasible and thus, we will merge Step
1 and Step 4. However, in real life applications, this assumption should be
removed and design methodology above should be kept intact.
In this chapter, we will apply this design methodology to two different
design examples. Design I in Section 3.1 uses the basic circuit in Fig. 2.3 with
a constant coefficient select generator, shown in Fig. 2.8. The simulations
are conducted using a 130 nm CMOS process. We verify in this design the
mean-square error (MSE) and energy models developed in Sections 2.2 and
2.6, respectively. We apply the design methodology to optimize RSAC and
finally compare it against a digital implementation.
27
Design II in Section 3.2 uses the implementations in Fig. 2.4 with an
input-dependent select generator, shown in Fig. 2.10. The simulations are
conducted using a 28 nm CMOS fully-depleted silicon-on-insulator (FDSOI)
process. We verify the behavioral model for RSAC which was developed in
Section 2.5. We apply the design methodology to optimize RSAC and finally
compare it against a digital implementation.
3.1 Design I
In this section, a RSAC-based DP kernel is used to implement an image filter.
We will investigate two filters, the average filter and the Gaussian blur (GB)
filter whose coefficients are shown in Fig. 2.8(a) with its select generator in
Fig. 2.8(b).
3.1.1 Simulation setup
A RSAC-based DP kernel (Fig. 2.1(b)) is used to implement an image filter.
We only have one processing stage and thus, J = 1. Circuit simulations in a
130 nm CMOS process at the nominal corner were performed for image sizes
30 × 30, 60 × 60, and 120 × 120. The MPCG operates at TClk = 400 ps.
Three average filters of lengths M = 9, 25, and 49 were implemented, which
correspond to a 3× 3, 5× 5 and 7× 7 window, respectively. These filters do
not require any combinational logic at the output of the ring counter since
the different phases already represent the coefficients. A 3× 3 Gaussian blur
filter with σ2 = 0.85 (Fig. 2.8(a)) has also been implemented. The D-flipflops
in the ring counter are implemented using true single-phase clocking (TSPC).
Select signals are generated using static-CMOS based NOR and NAND gates.
Select signals corresponding to coefficients of unit value do not need a logic
stage but are still passed through inverters to match the delay of other select
signals. The image is processed on a per-row basis but the number of select
generators were duplicated, depending on the size of the image, to ensure
sharp rise and fall of the select signals. For the 30 × 30 image, H1 = 1 and
I11 = 30, while for the 60× 60 image, H1 = 2 and I11 = I12 = 30, and finally
for the 120× 120 image, H1 = 4 and I11 = I12 = I13 = I14 = 30.
28
Figure 3.1: Circuit simulation of a RSAC-based GB filter with TTR = 45
(K = 5TTR). Energy and MSE saturate at R ≥ 700 kΩ.
3.1.2 Choice of R and C
Step 1 in the design methodology is applied here by designing R and C
to satisfy conditions C1 and C2 (Section 2.2). To obtain a value for R, the
circuit was simulated at different values of R while keeping TTR = RC
T
at
a fixed value by adjusting C accordingly. C1 was satisfied by choosing RC
large enough to dominate τmax. Figure 3.1 shows the plot of the MSE and
energy consumption of the circuit vs. R with TTR = 45. The same trend
was observed for different values of TTR. A total of 5TTR iterations were
performed. A reduction in MSE and energy consumption can be seen as R
increases until R ≈ 700 kΩ for the MSE and R ≈ 900 kΩ for energy. Thus
in all simulations, R was chosen to be 1 MΩ, and the value of C was set
to obtain a specific value for TTR. In practice, C can consist solely of the
intrinsic capacitance, as long as condition C1 in Section 2.2 is satisfied. If so,
we can use TClk as a variable to control TTR.
3.1.3 Validation of MSE model
Comparison of circuit simulations and Equation (2.5) for MSE vs. iterations
are shown in Fig. 3.2. Figure 3.2(a) shows the GB filter applied to three
29
(a) (b)
Figure 3.2: Validation of Equation (2.5) for: (a) GB filter with different
images and (b) GB filter with varying TTR.
Table 3.1: Fitted accuracy parameter values of a GB filter with TTR = 45
(C = 32 fF).
Image αA
(
V 2
)
βA
(
V 2
)
Jmin
(
V 2
)
I1 −3.07× 10−4 2.3× 10−2 1.4× 10−5
I2 −1.11× 10−3 9.83× 10−2 6.17× 10−5
I3 −1.44× 10−3 8.9× 10−2 5.96× 10−5
different 30×30 images (I1, I2 and I3). I2 and I3 were chosen to have similar
statistics while being different from those of I1. Weighted least-squares fitting
was applied to obtain the parameters αA, βA and Jmin in Section 2.2. The
fitted parameters obtained for TTR = 45 (C = 32 fF) are tabulated in
Table 3.1. It can be seen that the parameters for I2 and I3 are similar in
value, as expected.
In Fig. 3.2(b), TTR was varied for I1 to see its effect on MSE. As ex-
pected, the parameters αA and βA in this model are a weak function of TTR
and depend largely on the input statistics. The asymptotic MSE value Jmin
decreases as TTR increases. However, the decrease in MSE is minimal due
to the overall accuracy being dominated by the imperfection of the various
issues described in Section 2.2.
3.1.4 Validation of energy model
The energy consumption of this design can be written as:
30
(a) (b)
Figure 3.3: Validation of Equation (3.1) for: (a) GB filter with different
images and (b) average filter with different window sizes.
Figure 3.4: Energy per computation breakdown of the GB filter applied to
30× 30, 60× 60, and 120× 120 images.
Table 3.2: Fitted energy parameter values of a GB filter with different
TTR.
TTR (C) αE (J) βE (J) EMPCG (J)
45 (32 fF) 8.12× 10−15 2.26× 10−15 2.25× 10−12
90 (64 fF) 8.12× 10−15 4.07× 10−15 2.25× 10−12
135 (96 fF) 8.13× 10−15 5.82× 10−15 2.25× 10−12
31
Table 3.3: Fitted energy parameter values of an average filter with
TTR = 45.
Window (C) αE (J) βE (J) EMPCG (J)
3× 3(18 fF) 5.97× 10−15 1.47× 10−15 7.29× 10−13
5× 5(50 fF) 1.98× 10−14 3.18× 10−15 5.45× 10−12
7× 7(98 fF) 2.41× 10−14 3.07× 10−15 1.06× 10−11
ERSAC [n,X] = nαE + βE(1−Xn), (3.1)
for:
αE = CdcE + CdE + n
Csel
30
βE = CV˜
2
z ,
where cE, dE, and Csel are defined in Section 2.6. Comparison of circuit
simulations and Equation (3.1) for energy vs. iterations are shown in Fig. 3.3.
Figure 3.3(a) shows the GB filter applied to the same three images as above.
Similarly, we can see the dependence of the parameters on input statistics.
To see the dependence of βE on TTR, the same simulation was run on I1 with
varying TTR. The fitted energy parameter values αE and βE are tabulated
in Table 3.2. It can be seen that βE changes with varying TTR, while αE
remains constant, as predicted by our energy model.
Figure 3.3(b) shows the average filter of different window sizes applied to
I1, with TTR = 45. The energy parameters are tabulated in Table 3.3. It
can be seen that the parameters αE and βE remain relatively constant, as
the input statistics and I1h remains constant. Also, the ring counter energy
consumption is constant per row of computation for the different image sizes.
In our design, EMPCG  ERSAC , and energy per single computation is dom-
inated by EMPCG∑H1
h=1 I1h
. As more computation kernels can share the same ring
counter, more energy benefits can be obtained (Fig. 3.4). Finally, the energy
consumption of the ring counter is proportional to M2, as can be seen in
Table 3.3.
32
Figure 3.5: SQNR vs. number of rounds of a GB filter for a 30× 30 image.
3.1.5 Design optimization
The remaining steps in the design methodology are applied in this section.
HSPICE simulations were performed on a GB filter for I1, by sweeping over
TTR at various iterations (Fig. 3.5). The number of iterations for a given
SQNR should be minimized since the linear component in ERSAC dominates
(αE  βE). From the close-up view (Fig. 3.6) of 9 ≤ TTR ≤ 18, it can
be seen that the minimum number of rounds for a given SQNR occur at
n = 5
9
TTR − 2. The optimal curve is obtained by joining these minimizing
points. SQNR improvements saturate around TTR = 45 due to the overall
accuracy being dominated by the imperfection of the select-signal generation
block.
3.1.6 Comparison to digital implementation
The RSAC-based DP kernel is compared against a digital logic implementa-
tion using ripple carry adders (RCA) and Baugh-Wooley multipliers (BWM).
To estimate the energy consumption of the adders and multipliers, the en-
ergy for a 1-bit full adder (EFA) using a mirror-adder structure loaded with
FO4 inverters was simulated. In a 130 nm process, EFA = 18.63 fJ. Energy
33
Figure 3.6: Close-up view of SQNR vs. number of rounds of a GB filter for
a 30× 30 image for 9 ≤ TTR ≤ 18.
15 20 25 30 35 40
10
-14
10
-13
10
-12
10
-11
E
n
e
rg
y
 (
J
)
SNR (dB)
 
 
Digital
SAC
5 bits
6 bits
7 bits
8 bits
32X
19X
SQNR (dB)
Figure 3.7: Comparison of energy per DP computation vs. SQNR for a GB
filter.
34
(a) (b)
(c) (d)
Figure 3.8: Filtered images: (a) original image, (b) noisy image, (c) ideal
filtered image, and (d) RSAC filtered image with TTR = 45 and
n = 5
9
TTR− 2.
35
Table 3.4: Fitted source follower parameter values for the N-RSAC and
P-RSAC implementations.
Implementation VBIAS G Vt E
{
e2SF
}
N-RSAC 0.24V 0.868 0.24V 6.91× 10−7
P-RSAC 0.6V 0.818 −0.49V 6.93× 10−6
consumption of a B-bit RCA is then estimated to be:
ERCA[B] = α0→1BEFA, (3.2)
where α0→1 is the activity factor of the RCA. We assume the inputs are
uniformly distributed and thus α0→1 is 0.25. The energy consumption of a
B-bit BWM is lower-bounded [13] by:
EBW [B] ≥ EFA(B2 − 2B + 2).
The SQNR vs. energy per DP computation is shown in Fig. 3.7 for a 120×
120 image. For SQNR ≈ 24 dB, energy savings are approximately 32×;
whereas for SQNR ≈ 30 dB, the energy savings are approximately 19×.
These savings are pessimistic, as EBW was based on a lower bound. The
filtered output of our RSAC-based image filter is shown in Fig. 3.8(d).
3.2 Design II
In this section, a RSAC-based DP kernel is used to implement a DP between
two arbitrary vectors of length 8. We will first discuss the simulation setup,
followed by the design optimization for RSAC and finally compare it to a
digital design for the same sampling period, TS.
3.2.1 Simulation setup
Circuit simulations in 28 nm fully-depleted silicon on insulator (FDSOI) at
the nominal corner were performed. The select generator in Fig. 2.10 was
synthesized with B = 6, N = 8 and mapped into an HSPICE netlist to
simulate with the computation kernel. A precision of 6 bits was chosen in the
select generator to make sure that this block will not be the limiting factor
36
Figure 3.9: Circuit simulation of a RSAC-based DP kernel of length 8 with
TTR = 10 (K = 5TTR). MSE saturates at R ≥ 200 kΩ.
in the accuracy of RSAC. The RSAC-based DP kernel was implemented
using the circuits in Fig. 2.4(b) and (c). VBIAS was swept by steps of 0.01 V
from 0.5 V to 0.2 V for N-RSAC and 0.5 V to 0.7 V for P-RSAC. Weighted
least-squares fitting was then applied to obtain the parameters G and Vt.
The value of VBIAS and the fitted parameters obtained were tabulated in
Table 3.4. eSF is the error due to the weighted least-squares fitting and
E {e2SF} is the MSE. We chose VBIAS to achieve an operation point that
minimizes energy consumption while keeping the MSE at the output low
enough to not dominate the overall accuracy. The allowable input range for
N-RSAC is [0.35, 0.80] and [0.00, 0.40] for P-RSAC, which leads to an output
range of [0.06, 0.40] and [0.49, 0.82], respectively. The select generator used
here allows charge sharing and hence we will fix the value of C and R, and
use TClk as a knob to control T and in turn, TTR. Finally, the inputs to
the MUX and the coefficients of the select generator were both generated at
random using a uniform distribution.
37
Figure 3.10: Circuit simulation of a RSAC-based DP kernel of length 8
with TTR = 10 (K = 5TTR). MSE saturates at C ≥ 200 fF.
3.2.2 Choice of R and C
Step 1 in the design methodology is applied here by designing R and C
to satisfy conditions C1 and C2 (Section 2.2). Unlike Design I, we need to
obtain acceptable values of both R and C. Indeed, a large enough value of
C will limit the charge sharing to obtain acceptable accuracy at the output.
Figures 3.9 and 3.10 show the MSE vs. R and C, respectively. In Figure 3.9,
C is fixed to 400 fF and TClk is changed to keep TTR = 10. The total number
of iterations is 5TTR. It can be seen that the MSE saturates at R ≥ 200 kΩ.
The same trend was observed for different values of TTR. As R increases,
the area increases and the clock period TClk will need to increase accordingly
in order to keep TTR constant. In turn, TS will also increase and so will the
energy consumption. R = 200 kΩ was chosen as a good balance between the
improvement in MSE and the increase in the sampling period. In Figure 3.10,
R is fixed to 1000 kΩ and TClk is changed to keep TTR = 10. The total
number of iterations is 5TTR. It can be seen that the MSE saturates at
C ≥ 200 fF. The same trend was observed for different values of TTR. As C
increases, the area increases and the clock period TClk will need to increase
accordingly in order to keep TTR constant. In turn, TS will also increase and
so will the energy consumption. C = 200 fF was chosen as a good balance
38
Figure 3.11: The probability mass function for the error between the circuit
simulation output and behavioral model output with η set to 0.
between the improvement in MSE and the increase in the sampling period.
3.2.3 Validation of behavioral model
The behavioral model described in Section 2.5 is verified in this section. We
review the model again:
Vout[K + 1] = GV(K)N − Vt
+(1− )(Vout[K]−GV(K)N ) exp
(
−(pˆ(K)N + ∆)N
TTR
)
+ η,
where  , Cd
Cd+C
is the factor due to charge sharing, pˆi is the quantized version
of the weight, ∆ is the adjustment to the weight due to rise and fall times
and PVT variations in the threshold voltages, η is the term due to thermal
noise and the various other phenomena not captured in the analysis.  is
obtained through fitting as 0.002. ∆ is negligible in this design due to the
high precision in the select generator (B = 6). The parameter η is obtained
as the difference between circuit simulation output and the behavioral model
output with η set to 0. The values of the inputs were swept from 0.35 V to
0.80 V for N-RSAC and 0.00 V to 0.40 V for P-RSAC in steps of 0.01 V. The
39
Figure 3.12: Validating Equation (3.3) for a one-stage RSAC-based system.
The average normalized error is 8.27%.
Figure 3.13: Validating Equation (3.3) for a two-stage RSAC-based system.
The average normalized error is 12.89% for the first stage and 13.17% for
the second stage.
40
Figure 3.14: Validating Equation (3.3) for a three-stage RSAC-based
system. The average normalized error is 11.23% for the first stage, 14.54%
for the second stage, and 21.17% for the third stage.
value of T was swept from 0.01τ to τ in steps of 0.01τ . The initial value of
the output was also swept from 0.00 V to 0.45 V for N-RSAC and 0.49 V to
0.82 V for P-RSAC in steps of 0.01 V. The probability mass function of η is
shown in Fig. 3.11. η has a normalized variance of 4.23× 10−6, which makes
its effect on the behavioral model minimal and will be ignored in the next
parts. We now verify the behavioral model in emulating the RSAC-based
DP. For a RSAC-based DP of length 8, we perform circuit simulations on
200 uniformly distributed inputs and we compare the output to the output
obtained from using the behavioral model. Figure 3.12 compares the output
from circuit simulations and the output from the behavioral model. The
average normalized error is 8.27%. Figure 3.13 shows the output from circuit
simulations and output from the behavioral model for a two-stage RSAC-
based DP of length 2. The total number of samples is 200 and the inputs are
uniformly distributed. The average normalized error is 12.89% for the first
stage and 13.17% for the second stage. Figure 3.14 shows the output from
circuit simulations and output from the behavioral model for a three-stage
RSAC-based DP of length 2. The total number of samples is 200 and the
inputs are uniformly distributed. The average normalized error is 11.23% for
41
Table 3.5: Actual and estimated energy consumed by the select generators
of different lengths and precision.
Actual Esel,prog(fJ) Estimated Esel,prog(fJ)
N = 8,B = 6 9.75 10.5
N = 8,B = 7 16.7 18
N = 8,B = 8 30.5 32.8
N = 25,B = 7 23.8 25
N = 150,B = 7 205 211
the first stage, 14.54% for the second stage, and 21.17% for the third stage.
3.2.4 Energy model
In this section, we verify the expression in Equation (2.10), which governs
the energy consumed by the select generator in this design. This expression
is parametrized by the length N and the coefficient precision B, which will
help in estimating the energy consumption of RSAC-based systems. A Vdd
of 0.675V is used in the select generator, which has minimal effects on the
accuracy of RSAC compared to using 0.9V as supply voltage. Using circuit
simulations at Vdd = 0.675V , we obtain:
ENOR,2 = 1.12× 10−16 J
EMUX,2 = 1.12× 10−16 J
EMUX,4 = 2× 10−16 J
EMUX,8 = 3.75× 10−16 J
EMUX,16 = 7.08× 10−16 J.
We use linear fitting to obtain EMUX,N for any N :
EMUX,N = (4.25N + 3)× 10−17 J.
We compare the energy consumption of the select generators of different
lengths and precision with the estimated energy consumption in Table 3.5.
The values show an average overestimate of 6.2%. The energy consumption
in RSAC is dominated by the select generator (Esel  Esig). This model will
42
0 20 40 60 80 100
-5
0
5
10
15
20
25
S
Q
N
R
 (
d
B
)
Number of iterations
 
 
Optimal
TTR=2.5
TTR=5
TTR=7.5
TTR=10
TTR=12.5
TTR=15
TTR=17.5
Figure 3.15: SQNR vs. number of iterations for a RSAC-based DP kernel.
only be used in the next chapter to estimate the energy savings of a system
of RSAC-based DPs.
3.2.5 Design optimization
Similarly to Section 3.1.5, the optimal curve is obtained by joining the min-
imizing points, shown in Fig. 3.15. SQNR improvements saturate around
TTR = 15 due to the overall accuracy being dominated by the other noise
factors described in Section 2.2.
3.2.6 Comparison to digital implementation
In this section, we will compare the RSAC-based DP against the digital
implementation, for the same sampling period TS. Thus, we need a more ad-
vanced model for digital, compared to the one in Section 3.1.6, that estimates
both the delay and energy consumption. We use here an energy model and a
delay model for the digital approach that is applicable for both superthresh-
old and subthreshold operations [14]. The dominant energy sources of a
processing element are dynamic energy and leakage energy, expressed as:
43
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-1
0
1
2
3
4
5
6
7
x 10
-4
Vdd (V)
I s
u
p
 (
A
)
Vgs=Vds=Vdd
 
 
Fitted model
SPICE
Figure 3.16: The superthreshold current vs. Vdd with Vgs = Vds = Vdd.
0.1 0.15 0.2 0.25 0.3 0.35
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x 10
-7
Vdd (V)
I s
u
b
 (
A
)
Vgs=Vds=Vdd
 
 
Fitted model
SPICE
Figure 3.17: The subthreshold current vs. Vdd with Vgs = Vds = Vdd.
44
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
0
1
2
3
4
5
6
7
x 10
-9
Vds (V)
I s
u
b
 (
A
)
Vgs=0
 
 
Fitted model
SPICE
Figure 3.18: The subthreshold current vs. Vdd with Vgs = 0.
Etot = Edyn + Elkg
Edyn = αMCdynV
2
dd
Elkg =
MIOffVdd
f
,
where α is the switching activity factor, M is the number of processing nodes
each with capacitance Cdyn, f is the operating frequency, Vdd is the supply
voltage, and IOff is the leakage current. The superthreshold current as a
function of gate-to-source voltage is given by:
ISUP (Vgs) = µnCox
W
L
(Vgs − Vt)2, (3.3)
where µn is the electron mobility, Cox is oxide capacitance, and Vt is the
threshold voltage. The subthreshold current [15] as a function of gate-to-
source and drain-to-source voltage is given by:
ISUB(Vgs, Vds) = I010
VGS−Vt−γVds
SVT (1− e
Vds
VT ), (3.4)
where I0 is the reference current and is proportional to the transistor W/L
ratio, S is the swing factor, γ is the DIBL coefficient, and VT is the thermal
45
voltage. Using Equations (3.3) and (3.4), the leakage current for an NMOS
transistor is IOFF = ISUB(0, Vdd), while the switching current is:
ION =
ISUB(Vdd, Vdd), if Vdd < Vt.ISUP (Vdd), otherwise. (3.5)
Figures 3.16, 3.17, and 3.18 show the plots of the fitted model with the
output from HSPICE. We fit the parameters using least-squares fitting to
obtain:
Vt = 0.35 V
γ = 0.16 V
S = 2.2
I0 = 56 nA
µnCox
W
L
= 1.16 mA.
Assuming the critical path of the processing element consists of L process-
ing nodes each with capacitance Cdyn, the operating frequency fS is given
by:
f =
1
TS
=
ION
βLCdynVdd
, (3.6)
where β is a fitting parameter needed due to finite rise and fall times. For
the length 8 DP at hand, we have:
M = N
(
B(B + 1) +B2Cand
)
+
log2N∑
i=1
N
2i
(2B + i− 1)
L = BTsum +BTcout + Tand + Tsum +
log2N∑
i=1
(2B + i− 2)Tcout,
where Cand is the ratio between the capacitance of the full-adder block and
that of the AND gate, which was found to be 0.25 from simulations. Tand is
the ratio between the switching period of the the full-adder and that of the
AND gate, which was also found to be 0.25 from simulations. Similarly, Tsum
is obtained as 0.95 (Tcout = 1 is the reference switching time). Finally, Cdyn
46
Figure 3.19: Energy consumption vs. supply voltage.
10
4
10
5
10
6
10
7
10
8
10
9
10
0
10
1
10
2
10
3
10
4
Frequency (Hz)
E
n
e
rg
y
 (
fJ
)
 
 
Dynamic
Dynamic SPICE
Leakage
Total
Total SPICE
Figure 3.20: Energy consumption vs. frequency of operation.
47
Figure 3.21: Energy consumptions vs. frequency of operation.
Figure 3.22: SQNR vs. frequency of operation.
48
was least-squares fitted and found to be 8.58 fF. Figures 3.19 and 3.20 show
the energy consumption vs. Vdd and energy consumption vs. frequency,
respectively. We use these plots to compare with our RSAC-based DP of
length 8. The results are shown in Fig. 3.21 and 3.22. For SQNR ≈ 8 dB,
14 dB, and 20 dB, the energy savings are approximately 15.7×, 4×, and 2.1×.
In Chapter 4, we will discuss the design of an emotion recognition system
that uses only RSAC-based DPs.
49
CHAPTER 4
RSAC-BASED EMOTION RECOGNITION
SYSTEM
In this chapter, we discuss how to implement an emotion recognition system
using three cascaded RSAC-based DPs, shown in Figure 4.1. The input to
this system is a frontal face and the objective is to classify this face between
two emotions, Emotion0 and Emotion1, which could be one of any eight
emotions: neutral, anger, contempt, disgust, fear, joy, sadness, surprise. The
system in Figure 4.1 uses Gabor filters to extract features from the frontal
face, picks the most important features using adaptive boosting (Adaboost),
and classifies those features using a linear support vector machine (SVM).
This system has three cascaded DP stages, J = 3, with the algorithmic work
largely based on [16, 17]. The first and third stages use N-RSAC, while the
second stage uses P-RSAC. This alternation between N-RSAC and P-RSAC
is needed to make sure that the output of the current stage is a valid input
for the latter stage. Computation in this system is done in stages:
• Compute in stage 1, select generators in stages 2 and 3 are clock-gated.
• Compute in stage 2, select generators in stages 1 and 3 are clock-gated.
• Compute in stage 3, select generators in stages 1 and 2 are clock-gated.
When the select generators in a stage are clock-gated, the output of each
computation kernels is stored on its capacitor which is typically large enough
to hold the value during the sampling period TS. This will be analyzed in
Section 4.4.
The face inputs are extracted from the extended Cohn-Kanade dataset [18]
using the MATLAB face detection algorithm, based on [19]. The faces are re-
scaled to 48× 48 pixels, with 8-bit precision for the grayscale values. Videos
were recorded in analog S-video using a camera located directly in front of the
subject. Subjects began each display with a neutral face and then performed
a display which has been described and modeled by the experimenter. For
50
Gabor filter 
bank
5 spatial frequencies
8 orientations
Linear SVM
1 DP stage 2 DP stages
ClassificationFeature 
Extraction
analog analog
Face 
Images
0/1
Emotions: 
1. Neutral
2. Anger
3. Contempt
4. Disgust
5. Fear
6. Joy
7. Sadness
8. Surprise
0Emotion
1Emotion
Figure 4.1: An emotion recognition system that classifies the face between
emotion 0 and emotion 1.
our study, we selected 327 sequences, from 113 subjects, that were labeled as
one of the seven basic emotions. The first frame of every sequence was also
collected as a neutral face, totaling 654 faces. For the system at hand, of the
654 faces collected, only the ones exhibiting one of Emotion0 or Emotion1
will be used. We will use a leave-one-subject-out cross validation method due
to the limited number of faces available. For a total of x subjects per system,
training will be conducted on x−1 subjects and testing on the remaining one
subject. This procedure will be repeated x times to obtain the probability
of error, Pe. There are
(
8
2
)
= 28 possible systems and we obtain the average
probability of error, Pe,avg by averaging over all 28 Pe values.
We note that RSAC has two potential drawbacks that need to be addressed
before designing a system of cascaded RSAC-based DPs. First, RSAC, in its
current implementation, can only handle positive numbers. Thus, we will
devise a strategy used throughout this system to resolve this.
Assume we have two vectors: A, consisting of only positive numbers, and
B, consisting of positive and negative numbers. Let I+ and I− be the the
index vectors that specify the indices of the positive numbers and negative
numbers in B, respectively. The vector of elements from B specified by the
index vector I is denoted by B(I). With this notation, B(I+) is the vector
consisting of the positive elements of B. The DP, α, of A and B can be
51
re-written as:
α , A ·B
= A(I+) ·B(I+) + A(I−) ·B(I−)
= A(I+) ·B(I+)−A(I−) · (−B(I−))
, α+ − α−,
where α+ and α− are defined as A(I+) ·B(I+) and A(I−) · (−B(I−)), respec-
tively. We note that both α+ and α− consist of computing DPs of positive
vectors, which is the target of this strategy. In this system, whenever ap-
plicable, we will compute α+ and α− while delaying doing the subtraction
until the end since DP operations are linear. We denote by α+ the vector
consisting of α+s and α− the vector consisting of α−s. The same strategy
can be applied to α+ and α− in the next stage, separately. In our system,
we start off with a positive vector consisting of the image pixels. At the
end of the third stage, we obtain eight outputs, of which four consist of re-
sults due to positive combinations and four consist of results due to negative
combinations. We will see in Section 4.3 that the final decision is made by
comparing the theoretical output to 0; and thus, the final decision in this
system can be made by comparing the sum of the first four outputs to the
sum of the second four outputs. This final decision is assumed to be done
oﬄine in our system and uses floating-point operations.
The second drawback to using RSAC in a cascaded system, is that instead
of computing αi , A · Bi, we obtain αˆi = Gn‖Bi‖1αi − Vtn , where Gn is the
attenuation factor of N-RSAC and ‖Bi‖1 is the `1 norm of Bi. This is a
linear transformation that is inversely proportional to ‖Bi‖1, and thus non-
constant. This will cause a problem in the next stage, where we want to
compute β , α ·C = ∑Ni=0 αiCi. Instead, we obtain:
52
βˆ , Gp‖C‖1 αˆ ·C− Vtp
=
Gp
‖C‖1
N∑
i=0
αˆiCi − Vtp
=
GnGp
‖C‖1
N∑
i=0
αiCi
‖Bi‖1 − VtnGp − Vtp .
where Gp is the attenuation factor of P-RSAC. We can clearly see that the
original DP β cannot be retrieved from the computed DP. We solve this
problem by substituting the vector C with a vector Cˆ, whose elements are
defined as Cˆi = Ci‖Bi‖1. With this, we obtain:
βˆ , Gp‖Cˆ‖1
αˆ · Cˆ− Vtp
=
Gp
‖Cˆ‖1
N∑
i=0
αˆiCˆi − Vtp
=
GnGp
‖Cˆ‖1
N∑
i=0
αiCi‖Bi‖1
‖Bi‖1 − VtnGp − Vtp
=
GnGp
‖Cˆ‖1
N∑
i=0
αiCi − VtnGp − Vtp
=
GnGp
‖Cˆ‖1
β − VtnGp − Vtp .
We now discuss the different algorithms used in this system.
4.1 Feature extraction using Gabor filtering
The feature extraction method used in this work is Gabor filtering. A Gabor
filter [20] is a 2-D filter consisting of a complex sinusoid, known as the carrier,
and a 2-D Gaussian-shaped function, known as the envelope. The Gabor filter
is defined as:
G(x, y) = exp
(
− xˆ
2
2Sx
− yˆ
2
2Sy
)
exp (j2pifxˆ) ,
53
where Sx and Sy are the variances along the x and y-axes, respectively, f is
the spatial frequency of the sinusoid, and xˆ and yˆ are defined as:
xˆ = x cos(θ) + y sin(θ)
yˆ = y cos(θ)− x sin(θ),
where θ is the orientation of the Gabor filter. Typically, the dimension of the
filter is chosen as (2S + 1) × (2S + 1), where S , Sx = Sy. In this system,
we chose S = 3 to obtain a 7× 7 complex filter, of which 25 coefficients are
positive and 24 are negative. To obtain useful features from the images, a
bank of Gabor filters at eight orientations and five spatial frequencies [21]
were used:
f = {4, 4
√
2, 8, 8
√
2, 16}
θ = {0, pi
8
,
2pi
8
,
3pi
8
,
4pi
8
,
5pi
8
,
6pi
8
,
7pi
8
}.
Typically, the frontal face is convolved with the filter and every complex
sample is merged into one real sample by taking its magnitude. However, in
this system, we treat every complex sample as two distinct samples in order
to avoid computing the magnitude, which is typically computationally heavy.
The total amount of features obtained from this step is 48×48×5×8×2 =
184, 320. This number is very large and needs to be reduced in order to have
a feasible implementation of this system.
4.2 Feature selection using Adaboost
As previously mentioned, a total number of 184, 320 features is very large,
and thus, we need a feature selection mechanism. In this thesis, we use
Adaboost to perform feature selection. Other techniques, such as principal
component analysis (PCA), were compared to Adaboost in [22] and the latter
was found to be the best option for this system. The feature selection is done
oﬄine and only the relevant features are computed in the feature extraction
stage. In feature selection by Adaboost, every feature from the Gabor filters
54
is treated as a weak classifier. Adaboost selects the best of those classifiers
based on a threshold scheme, and then boosts the weights on the errors
exhibited by using the selected feature. This procedure is repeated to obtain
the best 150 features for every emotion by classifying it against all others.
Thus, for every system, we only use 300 features from the Gabor filter and
those will be classified using a linear support vector machine, discussed in
Section 4.3. Depending on the two emotions chosen, we will have a different
set of features and thus, different H1 and different I1h values. However, we
know that
∑H1
h=1 I1h = 600, which is the total number of DPs and total
number of outputs, of which half consists of the positive samples and the
other half consists of the negative samples. The lengths of these DPs is 49
2
on average. The implementation using the digital approach has 300 DPs in
total, each of length N = 49.
4.3 Support vector machine (SVM)
We use a linear SVM in this system to classify the features between the
two different emotions. As a brief review on SVM, assume that {X}ni=1
are the features, with Xi ∈ RD. Accordingly, we have {ti}ni=1, which are
the target classes, with ti ∈ {−1, 1}. The purpose of SVM is to find the
hyperplane separating the two classes. The boundary will have an Equation
W · Φ(Xi) + b = 0, where W and b are parameters to be determined and
Φ(X) is some mapping, typically taking X to a higher dimensional space.
This is discussed below in more details.
The decision function y(X) is sign(W · Φ(Xi) + b). Let X1 and X2 be
such that:
W · Φ(X1) + b = −1
W · Φ(X2) + b = +1,
where the data has been re-scaled such that no points lie between -1 and
1; hence, X1 and X2 represent the boundaries of the two classes. We will
see that these vectors play a very important in SVMs and are referred to as
support vectors.
One can show that the distance between the two boundaries is 2‖W‖ . We
55
would like to maximize this quantity which is equivalent to minimizing ‖W‖
2
2
.
This leads to the following Quadratic Programming (QP) problem:
min
w,b
1
2
‖W‖2
subject to ti(W · Φ(Xi) + b) ≥ 1.
This formulation assumes that the data is perfectly separable. A simple
extension of this problem when the data is not perfectly separable is the soft-
margin extension, where we introduce slack variables i for each Xi. The QP
problem becomes:
min
W,b,
1
2
‖W‖2 + C
n∑
i=1
i (4.1)
subject to ti(W · Φ(Xi) + b) ≥ 1− i
i ≥ 0,
where C serves as a limiting quantity to the penalty factors for misclassified
data.
One can show that the solution to the problem in (4.1) is:
y(X) =
n∑
i=1
αiti (Φ(X) · Φ(Xi)) + b, (4.2)
where αi are the Lagrange multipliers of the Lagrangian solution to problem
(4.1) and b is a bias. We note that αi are 0 for all vectors except the ones on
the boundary, the support vectors. Hence, the complexity of this algorithm
depends largely on the number of support vectors, which we denote by L.
We can rewrite Equation (4.2) as:
y(X) =
L∑
i=1
αiti (Φ(X) · Φ(Xi)) + b, (4.3)
where {αi}ni=L+1 = 0 with a slight abuse of re-indexing. The decision is made
by taking the sign of the result in Equation (4.3). In the emotion recognition
system, a linear SVM is used, where Φ(X) = X. Hence, Equation (4.3)
reduces to:
56
y(X) =
L∑
i=1
αiti (X ·Xi) + b.
In the emotion recognition system, the second stage computes βi , X ·Xi.
This stage has H2 = 2L select generators, L for the positive coefficients and
L for the negative coefficients. Each of those select generators feeds the two
output chains from the previous stage; hence, I2h = 2 (h = 1, ..., H2). A total
of 4L RSAC-based DPs are needed in this stage, for a total of four chains
with L outputs each. The lengths of the DPs is N2h = 150 (h = 1, ..., H2),
on average. The implementation using the digital approach requires a total
of L DPs, each of length N = 300.
The third stage computes y(X) = α·β, where α = (α1t1, α2t2, . . . , αLtL, 1)
and β = (β1, β2, . . . , βL, b). This stage has H3 = 2 select generators, one for
the positive coefficients and one for the negative coefficients. Each of those
select generators feeds the four output chains from the previous stage; hence,
I31 = I32 = 4. A total of eight RSAC-based DPs are needed in this stage.
The lengths of these DPs is dictated by N31 +N32 = L+ 1, with an average
length of N31 = N32 =
L+1
2
. The implementation using the digital approach
requires only 1 DP of length N = L+ 1.
In Section 4.4, we will show the the effect of quantization on the probability
of error when using a digital implementation. Then, we will show the results
of using RSAC-based DPs and the energy savings.
4.4 Simulation results
The system in Figure 4.1 is implemented using a digital implementation with
different precision, B. In this implementation, the result of every DP is trun-
cated back to B bits before going into the next stage. The last DP output
is left unchanged. Figure 4.2 shows the plot of Pe,avg for different precision,
compared to using floating-point operations. We can see that the perfor-
mance saturates around 7 bits of precision. Figure 4.3 shows the energy
consumption of the different digital implementations vs. fS (the sampling
frequency). The minimum energy operating point (MEOP) for these imple-
mentations is 1.0309×10−9, 1.5012×10−9, 2.0766×10−9, 2.7654×10−9, and
3.5758× 10−9 for bit precision of 4, 5, 6, 7, and 8, respectively.
57
,
e
a
vg
P
Figure 4.2: Quantization effect of the digital implementation of the system
in Figure 4.1 on the average probability of error (POE).
Sf
Figure 4.3: Energy consumption of the different digital implementations vs.
sampling frequency.
58
,
e
a
vg
P
Sf
Figure 4.4: Average POE vs. frequency of operation for the RSAC-based
system and the digital implementations.
The simulation setup for the RSAC-based system is the same as the one
for Design II in Section 3.2. Unlike Design II, we will have a different R and
C for every stage, denoted by Rj and Cj, respectively. Rj will remain fixed
to 200 kΩ, while Cj = cj × 200 fF and accordingly, TTRj = cj × 10, where
c1 = 1, c2 = 3, and c3 = 1. The total number of iterations in stage 1 is
3 × TTR1 and the total number of iterations in stage 2 is 3 × TTR2. The
precision of the select generator was chosen to be 7.
Figure 4.4 shows the plot of Pe,avg vs. fS for the RSAC-based system
and the digital implementations. For Pe,avg = 0.23 and 0.07, we obtain
energy savings of 36% and 49% compared to the digital implementations’
MEOP, running at frequencies of 1.87 MHz and 1.7 MHz, respectively. For
Pe,avg = 0.23, we can operate RSAC with a select generator with a precision
of 6 and thus, increase the energy savings to 45% while not affecting Pe,avg.
59
CHAPTER 5
CONCLUSION AND FUTURE
DIRECTIONS
In this thesis, we presented RSAC, a new circuit architecture, that imple-
ments an energy-efficient DP. In Chapter 2, we analyzed the way RSAC
operates, its architecture, and its select generation. We also developed a
behavioral model for RSAC, and an energy model. In Chapter 3, we showed
the energy savings of RSAC compared to digital implementations at the DP
kernel level. In a 130 nm process, we showed 19− to− 32× in energy savings
compared to a digital implementation for SQNRs of 30− to− 24 dB, respec-
tively. In a 28 nm FDSOI process, we showed 15.7×, 4×, 2.1× in energy
savings compared to a digital implementation running at the same sampling
frequency for SQNRs of 8 dB, 14 dB and 20 dB, respectively. In Chapter 4,
we showed the design of an emotion recognition system composed solely of
RSAC-based DPs. We obtained 2× energy savings, approximately, compared
to digital implementations for Pe,avg = 0.23 and Pe,avg = 0.07. An extension
to the system described in Chapter 4 is one where we need to distinguish
one of seven emotions exhibited by a face. For this extension, one can use
the emotion recognition system in Chapter 4 to compare every combination
of two emotions, totaling 28 combinations, followed by a scoreboard to de-
termine the most likely emotion. This is left as topic for future research.
We note that RSAC is most powerful when the select generators are shared,
which was not the case in the system studied. One way to improve the energy
savings of using RSAC is by adopting another design for the select generator,
digital or analog. This is a topic for future study. In particular, in the case of
the emotion recognition system, we can improve the energy consumption by
using the output of the Gabor filter for the select generation and the support
vectors as inputs to the DP kernel. This will allow us to share the select
generation for all support vectors but it also requires an analog select gen-
eration scheme and digital-to-analog converters for the support vectors. We
also note that the cost of analog-to-digital conversions and digital-to-analog
60
conversions have all been ignored in this study and thus, a more reliable
study would be to take them into account for the designed system.
In Chapter 4, we showed how we can split the processing chain to two
chains, positive and negative, to accommodate the processing of negative
values and thus, doubled the number of DPs needed. Extra energy savings
can be achieved by working on a new RSAC implementation that accommo-
dates negatives numbers without doubling the number of DPs. We also note
that this technique would possibly not work if there’s a non-linear operation
in the processing chain. Further, the emotion recognition system consisted
of only three RSAC-based DP stages, but having more stages could be detri-
mental for RSAC since the dynamic range shrinks with every stage. This
should be considered when designing systems with more stages.
Finally, we can improve the sampling period by pipelining, which requires
an additional sample-and-hold (SAH) circuit at the output of each RSAC-
based DP. This is also left as a future direction.
61
REFERENCES
[1] C. Arthur, “Your smartphone’s best app? Battery
life, say 89% of Britons,” May 2014. [Online]. Avail-
able: http://www.theguardian.com/technology/2014/may/21/your-
smartphones-best-app-battery-life-say-89-of-britons
[2] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing.
Englewood Cliffs, NJ: Prentice-Hall, 1989.
[3] R. Sarpeshkar, “Analog versus digital: Extrapolating from electronics
to neurobiology,” Neural Computation, vol. 10, no. 7, pp. 1601–1638,
1998.
[4] Y. Lu and E. Alon, “Design techniques for a 66 Gb/s 46 mW 3-tap
decision feedback equalizer in 65 nm CMOS,” vol. 48, no. 12, pp. 3243–
3257, Dec. 2013.
[5] W. M. Leach Jr, “Fundamentals of low-noise analog circuit design,”
Proceedings of the IEEE, vol. 82, no. 10, pp. 1515–1538, 1994.
[6] B. Razavi, Design of Analog CMOS Integrated Circuits. Boston, MA:
McGraw-Hill, 2001.
[7] J.-H. Tsai, P.-S. Wu, C.-S. Lin, T.-W. Huang, J. G. Chern, and W.-
C. Huang, “A 25–75 GHz broadband Gilbert-cell mixer using 90-nm
cmos technology,” Microwave and Wireless Components Letters, IEEE,
vol. 17, no. 4, pp. 247–249, 2007.
[8] M. Duppils and C. Svensson, “Low power mixed analog-digital signal
processing,” in Proc. of Int. Symp. on Low Power Elect. and Design,
2000, pp. 61–66.
[9] B. Gaines, “Stochastic computing systems,” in Advances in Information
Systems Science. Springer, 1969, pp. 37–172.
[10] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM
Transactions on Embedded Computing Systems (TECS), vol. 12, no. 2s,
p. 92, 2013.
62
[11] D. Fick, G. Kim, A. Wang, D. Blaauw, and D. Sylvester, “Mixed-signal
stochastic computation demonstrated in an image sensor with integrated
2d edge detection and noise filtering,” in Custom Integrated Circuits
Conference (CICC), 2014 IEEE Proceedings of the. IEEE, 2014, pp.
1–4.
[12] I. Nahlus, E. P. Kim, N. R. Shanbhag, and D. Blaauw, “Energy-efficient
dot product computation using a switched analog circuit architecture,”
in Proceedings of the 2014 International Symposium on Low Power Elec-
tronics and Design. ACM, 2014, pp. 315–318.
[13] J. H. Satyanarayana and K. K. Parhi, “A theoretical approach to esti-
mation of bounds on power consumption in digital multipliers,” vol. 44,
no. 6, pp. 473–481, 1997.
[14] R. A. Abdallah and N. R. Shanbhag, “Minimum-energy operation via
error resiliency,” Embedded Systems Letters, IEEE, vol. 2, no. 4, pp.
115–118, 2010.
[15] J. Kwong and A. P. Chandrakasan, “Variation-driven device sizing for
minimum energy sub-threshold circuits,” International Symposium on
Low Power Electronics and Design (ISLPED), pp. 8–13, 2006.
[16] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and
J. Movellan, “Recognizing facial expression: Machine learning and ap-
plication to spontaneous behavior,” in Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on,
vol. 2. IEEE, 2005, pp. 568–573.
[17] M. S. Bartlett, G. Littlewort, I. Fasel, and J. R. Movellan, “Real time
face detection and facial expression recognition: Development and appli-
cations to human computer interaction.” in Computer Vision and Pat-
tern Recognition Workshop, 2003. CVPRW’03. Conference on, vol. 5.
IEEE, 2003, pp. 53–53.
[18] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and
I. Matthews, “The extended Cohn-Kanade dataset (ck+): A com-
plete dataset for action unit and emotion-specified expression,” in Com-
puter Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE
Computer Society Conference on. IEEE, 2010, pp. 94–101.
[19] P. Viola and M. J. Jones, “Robust real-time face detection,” Interna-
tional Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
[20] D. Gabor, “Theory of communication. Part 1: The analysis of informa-
tion,” Journal of the Institution of Electrical Engineers-Part III: Radio
and Communication Engineering, vol. 93, no. 26, pp. 429–441, 1946.
63
[21] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Mals-
burg, R. P. Wurtz, and W. Konen, “Distortion invariant object recogni-
tion in the dynamic link architecture,” Computers, IEEE Transactions
on, vol. 42, no. 3, pp. 300–311, 1993.
[22] G. Littlewort, M. S. Bartlett, I. Fasel, J. Susskind, and J. Movellan, “Dy-
namics of facial expression extracted automatically from video,” Image
and Vision Computing, vol. 24, no. 6, pp. 615–625, 2006.
64
