Linear Delay-cell Design for Low-energy Delay Multiplication and
  Accumulation by Shukla, Aditya
ar
X
iv
:2
00
7.
13
89
5v
1 
 [c
s.E
T]
  2
7 J
ul 
20
20
1
Linear Delay-cell Design for Low-energy Delay
Multiplication and Accumulation
Aditya Shukla, Student Member, IEEE
Abstract—A practical deep neural network’s (DNN) evaluation
involves thousands of multiply-and-accumulate (MAC) opera-
tions. To extend DNN’s superior inference capabilities to energy
constrained devices, architectures and circuits that minimize
energy-per-MAC must be developed. In this respect, analog
delay-based MAC is advantageous due to reasons both extrinsic
and intrinsic to the MAC implementation − (1) lower fixed-
point precision (1-8 bits) requirement in a DNN’s evaluation,
(2) better dynamic range than charge-based accumulation for
smaller technology nodes and (3) simpler analog-digital inter-
faces. Implementing DNNs using delay-based MAC requires
mixed-signal delay multipliers that accept digitally stored weights
and analog voltages as arguments. To this end, a novel, linearly
tune-able delay-cell is proposed, wherein, the delay is realized
with an inverted MOS capacitor (C∗) steadily discharged from
a linearly input-voltage dependent initial charge. The cell is
analytically modeled, constraints for its functional validity are
determined, and jitter-models are developed. Multiple cells with
scaled delays, corresponding to each bit of the digital argument,
must be cascaded to form the multiplier. To realize such bit-
wise delay-scaling of the cells, a biasing circuit is proposed that
generates sub-threshold gate-voltages to scale C∗’s discharging
rate, and thus area-expensive transistor width-scaling is avoided.
On applying the constraints and jitter models to 130nm tech-
nology, the minimum optimal C∗ was found to be 2 fF and
maximum number of bits to be 5. Schematic-level simulations
show a worst case energy-consumption close to the state-of-art,
and thus, feasibility of the cell.
Index Terms—Analog-computing, delay-cell, mixed-signal de-
lay multiplier, multiply-and-accumulate
I. INTRODUCTION
RECENT advances in machine learning algorithmsand, particularly, deep neural networks (DNNs), have
equipped portable computing devices with human-like infer-
ring, classifying and planning capabilities. Enormous sizes
of these networks, with number of operations per evaluation
often running into millions, make remote computing servers
indispensable. Reliance on servers increases inference latency,
communication energy, risk of privacy loss, traffic, and needs
a perpetual connection to the server. Some of these metrics are
critical in applications like self-driven cars, that cannot afford
delays while making decisions. Delocalizing computational
effort for evaluating ML model, away from server and towards
the leaf nodes, requires ML-specific energy-efficient com-
puting architectures [1]. Many such architectures have been
proposed to greatly accelerate the training and inference speed
of DNNs [2]–[4], but much work is needed to efficiently run
The author is with the Department of Electrical Engineering and Computer
Science, University of Michigan, Ann Arbor, MI - 48104, USA (e-mail:
aditshuk@umich.edu).
these networks under severe energy restrictions many portable
devices operate under .
The computing-energy’s problem [5] is tackled by: (1)
using simpler data-types: these algorithms do not require
a large precision and continue to provide similar accuracy
with simpler data-types and restricted widths [6]–[11] (2)
minimizing data-transfer: the number of data-fetches shoots up
for human-level, large-scale applications of these algorithms
causing significant non-compute (latent) energy losses [2], [4],
[12].
Relative robustness of DNNs to precision-loss, together
with a limitation on energy, motivates the use of analog
computing systems, wherein, the loss of information due
to noise and process-variability can effectively be modelled
as the loss in precision. To maintain the energy-efficiency
without an excessive (counter-productive) precision-loss, these
systems constitute both analog and digital computing units.
The computational roles are distributed such that the multiply-
and-accumulate (MAC) operations, which form the bulk of a
DNN’s evaluation, are executed in an analog domain, while
other operations (e.g. control-flow, data-communication and
storage) are done using binary voltages. Superposable electric
variables like charge [13], [14] and current [15]–[17] physi-
cally represent partial sums of a MAC, with an capacitor as a
an accumulator to store the sum of physical variables.
Recently, time was proposed as an accumulation variable,
as it is better than charge and current in following regards:
(1) time-to-voltage/digital converters (TDC, and vice versa
DTC) are more area and power-efficient than voltage-based
converters [18], [19]. For instance, both DTC and TDC can
be realized out of clocked counters, while voltage ADC/DACs
require area and energy-expensive operational amplifiers; (2)
while the noise-floor is relatively constant, the supply voltage,
VDD , drops with technology nodes. Thus, the dynamic-range
of accumulation of voltage, current or charge gets increasingly
limited; (3) the transition frequency of the FETs, which
dictates the temporal resolution of a TDC, increases with tech.
nodes.
Within the purview of time-based accumulation, pulse-
width [20], [21] and pulse-delay [18], [22], [23] are the two
modulation schemes that have been demonstrated on-chip. Of
these, pulse- (or, event) delay is more promising for MAC
applications due to (1) free addition/subtraction in case of
delay (2) requirement of peripheral pulse re-routing circuitry
requirements in the prior.
A practical delay-MAC must meet following specifications:
firstly, it must accept mixed-signal arguments − one analog
while other digital, for locally stored weights; secondly, it
2Fig. 1. Sub-processes of the delay-cell
should posses linear voltage-delay (transfer) characteristics
to accept externally sensed analog voltages and allow cas-
cading of multiple layers of MACs. To implement a low-
energy mixed-signal delay-multiplier, major challenge is the
design of a tune-able delay-cell, having linear transfer char-
acteristics. Miyashita et al. [18] first proposed the use of
analog-digital mixed signal delay-MAC. Common mathemat-
ical operations like addition, subtraction, multiplication, and
max-/minimization were demonstrated in a clocked time-
domain. However, the multiplication using clocked time-to-
digital converters negated power-savings expected from an
analog processor. Clock-less tune-able delay-cells for MACs
were proposed in [22], where the accumulation after each dot-
product of a binary convolutional neural network was carried
implicitly by delays of a series of nMOS resistor-based delay-
cells. However, the use of resistors for enabling scaling in
delay lead to an area-expensive solution.
Delay-modulation via programming voltages for both −
low-power front-end analog processing and approximate-
computing acceleration, was demonstrated in [23]. Applying
a small-signal analog input to the back-gate of the transistor
modulated the threshold voltage and hence, the delay. Since
the threshold voltage varies with the input in a square-root
fashion, the delay is inherently non-linear. Also, variation
in the threshold voltage across a chip can introduce non-
homogeneity in the multiplier.
In this work, a novel CMOS referential delay-cell, based
on the steady discharge of a capacitor (C∗) via a constant
current (I∗), is proposed. Block-diagram of Fig. 1 depicts three
sequential processes that C∗ undergoes, from t = 0:
1) instantaneous pre-charge to V ∗0 (colored red)
2) steady discharge, through a constant current I∗ (blue)
3) thresholding of V ∗ at V ∗th, using a threshold detector
(green).
With these three processes, time taken for V ∗ to reach V ∗th
is:
Td =
C∗
I∗
(V ∗0 − V
∗
th). (1)
If V0 is a linear function of VA, then time taken to discharge
to V ∗th (or simply, the delay) becomes a linear function of
VA. This forms the basis of the proposed delay-cell. For use
within a multiplier, its delay is exponentially scaled through
gate-voltages of the source of I∗, rather than transistor widths.
Next, analytical models for all the sub-processes in the delay-
cell are developed, key sources of jitter identified and a model
for the net jitter is formed. From these models, constraints on
Fig. 2. Referential delay
C∗ and I∗ for the usability of delay-cells in a multiplier are
found, and it’s shown that the multiplier cannot accommodate
more than 5 bits of (signed) digital-input. Biasing circuits
that generate the gate-voltages to scale I∗ and the delay
exponentially, are proposed and validated.
The paper is divided as follows: in Sec. II, necessary but
brief background on mixed-signal delay multipliers, delay
MACs and how multiple delay-cells together constitute a
multiplier, is presented. In Sec. III, the concept behind the
proposed delay cell is presented in more details. For each of
the three sub-processes, CMOS implementation details are pre-
sented and constraints for a linear delay transfer characteristics
developed. Jitter-models are then developed for the two sub-
processes that contribute most to the jitter. In Sec. IV, the
constraints and jitter model developed are employed to find
the minimum latency and maximum number of bits that can
be accommodated within the multiplier. Also, a biasing circuit
that enables an accurate exponential scaling of delays of cells
within a multiplier is presented. In Sec. V, we discuss about
the chosen noise-floor and its connection with the maximum
number of bits, and input dependence of energy consumption.
Next, the proposed delay-cell is compared with the state-of-art,
before concluding in Sec. VI.
II. BACKGROUND
A delay MAC comprises of several delay multipliers, delays
of which get accumulated, before undergoing further non-
linear processing e.g., comparison or max-/minimization. The
multiplier accepts a time-referenced event signal, which it
propagates forward as-is, but after a delay in proportion to
the product of its arguments. When several such multipliers
are placed in series, and a reference event is applied to the
first, then, the ref. event is propagated forward, and the net
(accumulation of) delay models the dot-product of inputs.
By using delays, the need of adders is eliminated, because
the delays are summed up naturally. Since negative numbers
cannot be represented using individual events, a pair of events
is used, where, the time of instance of one’s occurrence
referred to the other, is called referential delay (Fig. 2).
A delay multiplier, besides two input arguments, has a pair
of a variable and a reference event-signals (henceforth called
referential events) at its input and output. For a mixed-signal
multiplier, a signed, fixed-point weight vector (S and an n-bit
wide vectorW ) and an analog scalar (VA) form the argument,
and a pair of rising (or, falling) edges of voltages form referen-
tial event-signals (Fig. 2). To accommodate negative weights,
3symmetric 2:2 multiplexers (or, relays) are placed within each
multiplier that are realized using transmission-gates. The relay
ensures that for each negative weight, the referential events are
swapped before multiplication (Fig. 3a,b).
Each multiplier consists of smaller referential delay-cells
that correspond to each bit of W and create a referential
delay equaling 2iD, where, D is the common delay-factor
and i ∈ {0, 1, 2n− 1}. The common delay-factor, D, is a
linear function of VA. For reasons explained Sec. III-C, two,
VDD → 0 falling edges, as referential event signals, are
propagated through the multiplier. Weight bits, wi, individually
determine whether reference-event signals are delayed by 2iD
or not, by making the falling-edge pass or skip a delay-cell.
Each referential delay-cell has a pair of identical and
parallel, linearly tunable delay-cells. One delay-cell inputs VA
and outputs the falling-edge after delay linearly dependent on
VA. The second cell inputs a constant reference voltage, VA0,
and outputs the event after a fixed time. If VA > VA0, then
variable event gets more delayed compared to the reference,
which represents a positive partial sum. A negative partial sum
is produced if VA < VA0 and zero, if VA = VA0. Thus, using
a pair of delay-cells homogenises the multiplier with respect
to the multiplicand and allows negative weights and referential
delays.
For illustration, a 1-bit multiplier is shown in Fig. 3a. The
multiplier comprises of a 2:2 relay and a referential delay
cell. S = 1 implies a negative weight which causes the
falling-edges to get swapped. The referential delay-cell further
contains two delay-cells, with one variable input (VA) and the
other with a reference input (VA0). When W = w0 = 0, cell-
bypassing MUX is enabled leading to both negligible delay
and ref. delay. Next, 3-bit multiplier is shown in Fig. 3b. The
referential delays are scaled in the ratio 1,2 and 4, by scaling
the absolute delays in the ratio 1,2 and 4.
One may also use differential mode of operations, where,
the referential-delay cell is replaced with differential delay-
cell. In this mode, the analog input are changed from VA and
VA0 to VA0 + VA and VA0 − VA. This may remove second
order distortion terms of the multiplier without changing the
multiplier’s circuitry.
III. DELAY-CELL DESIGN IN CMOS
A. Steady discharge-based delay-cells
An idealized circuit implementing this process’s equivalent
is shown in Fig. 4, where, each component responsible for the
three sub-processes have been boxed and colored correspond-
ingly.
In branch 1, the key component is a VA-accepting MA that
has a net source-capacitance CS . Initially, both S1 and S2 are
open and C∗ is charged to VDD . Once S1 is closed, NFET
initializes C∗ by sinking its charge into CS , until its source-
voltage reaches approximately Vthn below the gate voltage, to
VA − Vthn. From charge conservation, the V
∗ lowers by:
∆V ∗0 ≈
CS
C∗
(VA − Vthn). (2)
Thus, the n-FET conducts until the source voltage rises
enough to cut-off the channel, establishing a linear relationship
between VA and∆V
∗
0 . The approximation in Eq. 2 comes from
the fact that a real sub-micron FET doesn’t have a well defined
threshold-voltage. However, as later shown in Sec. III-C, the
linear relationship still holds well if VA > Vthn.
Once the voltage across CS is set, S1 is opened and S2 is
closed causing C∗ to spontaneously discharge at a constant
rate via I∗ (Fig. 4). The steady discharge process can be
described as:
∆V ∗(t) = ∆V ∗0 +
I∗
C∗
(t− t0), (3)
where, ∆V ∗ is the drop in V ∗ below VDD . Next, a threshold-
detector, with a threshold V ∗th, outputs a falling-edge once V
∗
drops below V ∗th. Time taken for ∆V
∗ (= VDD−V
∗) to reach
a given threshold ∆V ∗th (= VDD − V
∗
th), called the absolute
delay (Td), is given by:
Td =
C∗
I∗
[∆V ∗th −∆V
∗
0 ]
≈
C∗
I∗
[
∆V ∗th −
CS
C∗
(VA − Vthn)
] (4)
This is a linear function of VA. So, the delay can be
adjusted linearly with the input (Fig. 5). As discussed in II,
to homogenize the delay-input relationship, referential delay
is used. For a pair of steady discharge based delay-cells, the
referential delay ∆tD is:
∆tD = Td − Td,REF
= −
CS
I∗
(VA − VA0),
(5)
which, is independent of C∗. In case C∗ varies with V ∗, the
referential delay may be written as:
∆tD =
1
I∗
∫ V ∗
0
(VA)
V ∗
0,REF
C(v)dv. (6)
For ∆tD to be a linear function of VA, C(V
∗) needs to
be maximally constant for the range of V ∗ ∈ [V ∗0 , V
∗
th]. If I
∗
is scaled (exponentially) by factor of 2 (Fig. 5), then a delay
multiplier with a digital input vector W¯ = {wi, ∀i = 1, 2, ...n}
will yield the following referential delay:
∆tD =
n∑
i=1
wi
2n
[∆tD,i(VA)] , (7)
where, ∆tD,i is the ref. delay from the delay cell i. This forms
the basis of our mixed-signal delay multiplier.
B. CMOS implementation: C∗
Later in Sec. IV-A, it is shown that C∗ of approximately
2fF is optimal for minimizing energy, latency and jitter. An
inverted MOSCAP, steadily discharging towards depletion,
reliably provides capacitance in this range. Since an n-type
MOSCAP has a larger inversion capacitance-density than p-
type (Fig. 7), the prior is used .
4(a) 1-bit (b) 3-bit
Fig. 3. Signed mixed-signal delay multiplier
Fig. 4. Three sub-processes with n-FET for (2)
(a) Varying VA (b) Varying I
∗
Fig. 5. Expected transient response
C. CMOS implementation: Voltage initialization
This stage comprises of min. sized transistors M1−3 and
pMOSCAP CS in Fig. 6, key design considerations of which
are discussed next:
1) Input nFET MA (M2): To nullify effect of process
variations (PV), the VA-accepting nFET is unique to
a multiplier, i.e. it is shared by all delay-cells within
a multiplier. Specifically, random dopant-fluctuation,
oxide-thickness variations and other process-related non-
idealities may offset Vthn, that may in-turn offset output
ref. delay by:
∆tD = −
CS
I∗
∆Vthn, (8)
where, ∆Vthn models effect of PV
2) CS : This capacitor is unique to a delay-cell, and a
min. sized pFET is assigned to each cell. The pFET
stays in the inversion regime regardless of VA, because
VS saturates to a value that is at least Vthn less than
VA,max(= VDD)
3) Switch S-1 (M1): In the relevant regime of operation,
V ∗ remains close to VDD, necessitating a p-type FET.
The switch is unique to each delay-cell, as it isolates C∗
of each cell from a shared M2 of the multiplier
4) Reset switch (M3): A switch to reset the VS to zero
before each computation is kept common to all cells
within a multiplier
If the initial charge on CS is zero, then, ∆V
∗
0 can be
expressed as a linear function of VA and an offset term:
∆V ∗(VA) =
CS + CpS
C∗
(VA − Vthn) +
∆Qof
C∗
≈
CS
C∗
(VA − Vthn) +
∆Qof
C∗
.
(9)
Here, CpS is the parasitic capacitors, arising from M2 and
M3; ∆Qof models the zero-offset at VA = Vthn dependent
on several parameters: C∗, M2’s width and other parasitic
effects like feed-forward of input falling-edge into C∗. Note
that ∆Qof has two distinct values: first is defined within
the discharge-pulse application (MD) and it contains a feed-
forward component of the falling-edge. The second is defined
after the discharge-pulse application (PD), and is slightly less
than MD.
To quantify linearity, ∆V ∗is plotted in Fig. 8 against VA
and its derivative w.r.t. VA in 8b for VA ranging between 0.3V
and 1.2V , with W ∗ (or C∗) as parameters of design. For this
range, less than 10% variation is seen. The figure shows that
larger capacitors can provide better linearity.
Fig. 9 plots the ∆V ∗0 for and its average derivative, versus
W ∗ (∝ C∗), in a log-log fashion. Both plots have a constant
slope of −1 for sufficiently large W ∗, validating Eq. 9 as a
model for the discharge process. CS+CpS and ∆Qof are then
empirically determined by fitting the model of Eq. 9, yielding
CS + CpS = 0.23fF and ∆Qof = 0.5fC (MD).
Eq. 9 is only valid when the C∗’s voltage is big enough to
charge up CS . Mathematically,
VDD −∆V
∗(VA) > VA − Vthn (10)
For VA = VDD , we get:
⇒ C∗ >
(CS + CpS)(VDD − Vthn) + ∆Qof
Vthn
(11)
This sets the lower limit on C∗, which is employed later in
Sec. IV-A.
5Fig. 6. Delay-cell schematic. All transistors have minimum widths
10 -6
W* (m)
10 -16
10 -15
10 -14
C*
 (F
)
N-type
P-type
Fig. 7. C∗ vs. FET width, W ∗
0.4 0.6 0.8 1
VA (V)
0.1
0.15
0.2
0.25
0.3
V
* 0
(a)
0.4 0.6 0.8 1
VA (V)
0
0.2
0.4
0.6
d/
dV
A
 
(
V
* 0)
(b)
160n
400n
1
2.56
Fig. 8. ∆V ∗ plots, vs. VA (a) Absolute value (b) Derivative
To slightly enhance the linearity without adding to the
area, one in every 5 pMOSCAP of the CS is replaced by
an nMOSCAP. For VS < VDD − Vthp, PFET is inverted and
provides a close to a constant cap. For VS > VDD − Vthp,
pFET’s capacitance diminishes but nFET offsets the loss.
Since NMOS is small, it does so, only to a small extent.
D. CMOS implementation: Steady discharge
This part of the delay-cell consists of transistors M4−6 in
Fig. 6, key design considerations of which are discussed next:
1) I∗(M4−M5): This is realized using bi-cascoded nFET
current source, with the FETs at their min. widths.
The exponential current scaling is done via an external
biasing circuit, that generates gate-voltages for bothM4
andM5. The biasing circuits are discussed in Sec. IV-B.
10 -6
W* (m)
10 -2
10 -1
V
0*
 (V
)
Mid.
Post
Slope
-1
(a) Deriv. ∆V ∗ vs. W ∗
10 -6 10 -5
W* (m)
10 -2
10 -1
d/
d
V
A
V
0*
 (V
)
Slope  -1
(b) ∆V ∗ vs. W ∗
Fig. 9. Variation of ∆V ∗ with W ∗ and linearization models
2.5 3 3.5
Time (s) 10 -8
0
0.2
0.4
0.6
0.8
1
1.2
V*
 (V
)
R0 2R0 4R0
}
V* 0(VA)
(a) V ∗ transient (simulated)
0.4 0.6 0.8 1 1.2
VA (V)
10 -9
10 -8
10 -7
T d
 
(s)
(b) Td at VDD/2 vs. VA
Fig. 10. Steady discharge’s characterization
2) Switch S-2 (M6): An NMOS switch is placed in se-
ries with the current-source. Unlike S-1, the switch
was placed away from C∗, preventing the feed-forward
through the parasitic capacitors.
For WS (∝ CS) at its minimum value (160nm) and
W ∗(∝ C∗) = 640nm, C∗s discharge transient is shown in
Fig. 10a. Referring back to Fig. 6, the switch (M2) is turned
ON at t = 25ns, by ramping-up IFE′, the input to M6. As
expected, I∗ discharges C∗ at a near-constant rate (Fig. 10a).
Its constancy depends solely on output resistance of the current
source. Kinks observable in the transients are caused by the
feedback from the half-latch and do not practically affect the
performance. The absolute delay for various R (= I∗/C∗),
spaced exponentially with a factor of 2, is plotted in Fig. 10b.
6E. CMOS Implementation: Threshold detector
It comprises ofM7−14 as the falling-edge, uni-polar thresh-
old detectors, half-latch and other switches for resetting.
Details and design consideration are discussed next:
1) Falling-edge inverter (M8): To minimize the area re-
quirements, the width of M8 is kept minimum. As
discussed below, under certain constraints on C∗, this
inverter contributes to a VA-independent delay, thus
keeping distortion negligible.
2) Leak-prevention switch (M7): It serves as a mean to
prevent sub-threshold M8 from leaking and set-up the
latch pre-maturely. It inputs the falling-edge of the
previous delay cell.
3) Half-latching inverter (M11,13): These transistors latch
VRE to VDD and the OFE-node to 0, once VRE reaches
Vthn.
4) Latch-en-/disable switches (M10,M14): These switches
enable the half-latch operation when closed and other-
wise, disable it, reducing the energy required to reset
the half-latch
5) Reset-FETs (M9,12): M10 resets node VRE to 0. M12
sets the OFE node to VDD before the start of computa-
tion. These transistors are shared within the multiplier.
A CMOS inverter can serve as a low-energy threshold
detector, whose switching-voltage can be set by designing the
ratio of sizes of pMOS and nMOS. However, it consumes
short-circuit energy (ESC ) given by:
ESC =
VDD
6R
µ
W
L
Cox(VDD − Vthp − Vthn)
3 (12)
where R = I∗/C∗ and,
µ
W
L
=
µpµn(W/L)p(W/L)n
((µp(W/L)p)1/2 + (µn(W/L)n)1/2)2
Since R decreases exponentially, ESC increases exponen-
tially. Hence, a delay-cell implementing the n-th exponent
will expend 2n× the ESC of the cell implementing the first.
For a n-bit multiplier, the total short-circuit energy lost is(
2n+1 − 1
)
ESC . Thus ,an exponential requirement in energy
consumption motivates an alternative inverting mechanism.
The low-energy alternative to the CMOS inverter is a
standalone pFET (M8 in Fig. 6), due to its switch-like I-
V relationship. If it were an ideal switch, with a switching
voltage VS (> ∆V
∗
0,max), VRE would jump to VDD after a
fixed delay following ∆V ∗(t) = Vthp. This would conserve
the linearity of Eq. 5 with respect to VA, as it only adds a
constant delay. However, a real PFET has a close to expo-
nential I-V relationship and conservation of linearity needs to
be established, or at least constraints for maximal linearity
determined.
With the assumption of exponential I-V characteristics and
large output-resistance (gDS), the sub-threshold current can be
expressed as a function of the gate-source voltage (= ∆V ∗0 )
using the following equation:
I = I0 exp
(
∆V ∗0
VT
)
, (13)
where, VT is the thermal voltage. Eq. 13 is valid only for
∆V ∗0 < Vthp; for VGS > Vthp, I-V relationship is usually
degree-2 or less polynomial, moving the switch away from an
ideal behavior.
For a VA that linearly decreases from ∆V
∗
0 with a steady
rate R, VRE (Fig. 6) can be expresses as a function of time
using:
VRE(t) =
I0
C
VT
R
exp
(
∆V ∗0
VT
)(
exp
(
Rt
VT
)
− 1
)
, (14)
where, R = I
∗
C∗ is the rate of change of V
∗ with time, and C
is net capacitance at the drain of M8. When VRE(t) = Vthn,
the half-latch is set up and the voltage at OFE node (Fig.
6) falls to 0. Thus, the time taken from the start of discharge
(t = 0) to the drop in OFE-node voltage to zero (t = Td) is:
Td =
VT
R
ln
(
R
CVthn
I0VT
exp
(
−
∆V ∗0
VT
)
+ 1
)
. (15)
Td becomes a linear function of ∆V
∗
0 under the constraint:
RCVthn
I0VT
exp
(
−
∆V ∗0
VT
)
≫ 1 (16)
Putting R = I
∗
C∗ , this inequality may alternatively be written
as:
I∗
C∗
C
I0 exp
(
∆V ∗
0
VT
) Vthn
VT
≫ 1 (17)
Since ∆V ∗0 varies inversely with C
∗ (from Eq. 9), the denom-
inator is a monotonically decreasing function of C∗. Then,
as per this inequality, C∗ should be greater than a critical
capacitance C∗min. This inequality is used in Sec. IV-A, for
establishing constraints on C∗ and n.
Under the validity of this inequality, Td can be expressed
as:
Td =
VT
R
ln
(
RCVthn
I0VT
)
−
∆V ∗0
R
, (18)
matching the expectation of Td’s linearity with ∆V
∗
0 , or VA.
Note that the latch-point, or, the value of V ∗ when VRE =
Vthn is a constant, independent of V
∗
0 (or VA), and expressible
as:
∆V ∗th = ∆V
∗
0 +RTd = VT ln
(
RCVthn
I0VT
)
. (19)
Though an exponential I-V characteristics is assumed for
M8, in reality, it is exponential only for sub-threshold gate
voltages. For devices with power I-V relations, Eq. 14 is re-
derived with the modified I-V, and constraints of Eq. 11 re-
determined. For p− 1 power current-voltage relationship,
I∗
C∗
C
I(VA, C∗)
VRE
VG0/p
≫ 1. (20)
For an ideal switch (p→∞), the constraint is trivially satisfied
and TD stage doesn’t contribute to distortion. As the I-V
relationship of the pFET moves away from step-like behaviour
towards linearity (p → 0), ensuring linearity from delay-VA
characteristics becomes harder.
7F. Noise-modelling
The delay-cell consists essentially of two current-integrators
that accumulate the accompanying noise-current, starting from
the arrival of falling-edge (t = 0) to the latch-up (t = Td). This
leads to a net temporal shift in the falling-edge, or a jitter in
the output falling-edge. For simplifying noise modeling, it is
assumed that:
1) Out of the three, only two processes contribute to the
jitter: steady discharge and threshold-detection. (Initial
discharge is done within a period much smaller the than
Td, so it contributes negligibly to the net jitter.)
2) The net jitter is much smaller than Td
To enable design of the delay-cell and multiplier, the two
jitter components are modelled as a function C∗ and I∗ (the
design variables) and set an upper limit on noise that yields
constraints on the design variables and n.
1) Jitter from steady discharge core: For this stage, the
primary contributor of jitter is noise-current fromM4,5, which
accumulates into C∗. To simplify the modelling, it is assumed
that noise-current out ofM4 circulates within itself, and hence
contributes negligibly to the jitter. With this assumption, the
stage reduces to a noisy FET discharging a fixed capacitor,
jitter modelling for which, has already been done for ring
oscillators [24]. For an inverter-type ring-oscillator, the jitter-
per-stage was modelled as:
∆t2Dn =
4kTγgd0
2I∗2
Td, (21)
where, γ is the excess noise factor, gd0 is the drain-source
conductance at VDS = 0. This naturally extends to the
proposed delay-cell, with the exception that Td is dependent
on the rate of discharge Since Td can also be expressed as:
Td =
C∗
I∗
(∆V ∗th −∆V
∗
0 ) , (22)
the expression for jitter then becomes:
∆t2Dn =
4kTγgd0
2I∗3
C∗ (∆V ∗th −∆V
∗
0 ) (23)
For simplicity, the dependence of jitter on ∆V ∗th and ∆V
∗
0 is
neglected and the jitter is defined for the instant when C∗ =
VDD/2. Owing to the fact that gd0 has a linear dependence
on current, the jitter is then compactly express-able as:
∆t2dn = K
C∗
I∗2
, (24)
where, K is a temperature and technology dependent constant.
For model validation, the jitter is simulated in software,
using BSIMv4 (tt) model for IBM’s 130nm technology. The
resulting jitter, with only M4−5 noise turned on, versus C
∗
and I∗ is plotted in Fig.11.
Instead of Eq. 24, the following equation is used as it fits the
experimental data better (R-sq. of 0.982, from 10 iterations):
∆t2dn = K
C∗
I∗p
, (25)
where, K = 2.95× 10−16 and p = 2.46.
10 -15 5 10 -15
C* (F)
10 -22
10 -20
t D
n
2
 
(s
2 )
(a) Iso-I∗
10 -7 10 -6
I* (A)
10 -22
10 -20
t D
n
2
 
(s
2 )
(b) Iso-C∗
Fig. 11. steady discharge-stage’s jitter
10 7 10 8
R (Vs -1)
0
0.5
1
1.5
2
v
n
,T
D
2
 
(V
2 )
10 -5
(a) Variance in VRE
10 6 10 7 10 8
R (Vs -1)
10 -22
10 -20
10 -18
t d2
 
(S
2 )
(b) Jitter
Fig. 12. Variance in VRE and jitter due to M8’s noise-current
2) Jitter from threshold detector: For M8, since the input
gate-source voltage (∆V ∗) increases linearly with time and
drain-current exponentially, it is assumed that RMS noise cur-
rent (in) out of M8 increases exponentially. Thus, intuitively,
at any given instant of time post falling-edge’s arrival, noise
current from only the past 3-4 VT drops in ∆V
∗ is dominant.
If ∆vn is the deviation in VRE at t → T
−
d , then for a
constant in, we have:
∆v2n ∝
i2n
∆f
Tdn
C2
. (26)
Since i2n varies exponentially over the duration tdn, this
equation cannot be applied without adjustments. Thus, the
following equation is used:
∆v2n ∝
∫ Td
0 4kTγgd0(t)dt
C2
(27)
Letting gd0 = G0 exp
(
∆V ∗
0
+Rt
VT
)
, we get:
∆v2n ∝
∫ Td
0
4kTγG0 exp
(
∆V ∗0 +Rt
VT
)
dt
= β
VT
R
exp
(
∆V ∗th
VT
)
= β
CVthn
I0
,
(28)
where, β = 4kTγG0. This equation establishes an indepen-
dence of vn on R, which is confirmed from Fig. 12. In the
figure, R is varied by a factor of more than 10×, but less than
2× rise is seen in ∆v2n.
Next, the relationship between ∆tDn and R is determined.
Similar to the approach adopted in [24], ∆tDn can be found
810 -10 10 -8 10 -6
I* fast  (A)
10 -16
10 -15
10 -14
C*
 (F
)
(a) n=4
10 -10 10 -8 10 -6
I* fast  (A)
10 -16
10 -15
10 -14
C*
 (F
)
(b) n=5
10 -10 10 -8 10 -6
I* fast  (A)
10 -16
10 -15
10 -14
C*
 (F
)
(c) n=6
No solution
3
1
2
Fig. 13. Constraints 1,2 and 3 for 4,5 and 6 bit multipliers
by extrapolating noisy VRE along the noise-less VRE , right
before it latches up.
∆t2Dn =
(
dVRE
dt
)
−2
∆v2n,
∝
v2n
R2
,
(29)
where, Eq. 14 was used for dVRE/dt ∝ R. Simulated jitter,
with only M8’s noise turned on, is plotted in Fig. 12. For
minimally sized M7−14, the fitted model from 10 iterations
is:
∆t2dn = K2
1
R1.5
, (30)
where, K2 = 1.29× 10
−10. Thus, the actual exponent is less
than predicted.
IV. MIXED-SIGNAL DELAY MULTIPLIER
With the delay-cell design considerations discussed, next,
the necessary steps to employ the cells within a multiplier
are presented: (1) use of constraints to find the valid re-
gion of design and operation (2) bias-circuit design for I∗’s
exponentiation. Lastly, through simulations, the functionality
of cascaded delay-cells as multiplier is validated and the
key energy components for each multiplication operation are
identified.
A. Optimizing C∗ and No. of Bits
Using the inequalities involving C∗ developed in Sec. III-C,
Sec. III-E and the noise models of Sec. III-F, the constraints
on C∗ and I∗f are determined.
1) Linearity of voltage-initialization: In the inequality of
Eq. 11, putting in model parameters gives constraint 1:
C∗ > 2.2f (31)
For IBM’s 130nm technology used in this work, this
corresponds to an inverted nMOSCAP single-finger width of
1.28µm. This constraint is marked by ’1’ in Fig. 13a-c.
2) Linearity of threshold detection: Since the current of the
slowest cell being 2n times smaller than I∗f , constraint 2 from
Eq. 17 becomes:
2−nI∗f
C∗
C
I0 exp
(
∆V ∗
0
(VA,C∗)
VT
) Vthn
VT
> 1 (32)
Only C∗ and I∗f are designable; the rest − C, Vthn, VT and
I0, are constant. To simplify the analysis, the denominator is
maximized over VA and ∆V
∗
0 (VA = 1.2, C
∗) used. Though
the argument of the exponential, ∆V0 (VA, C
∗), was modelled
in Sec. III-C, data plotted in Fig. 8 is used. This constraint is
marked by ’2’ in Fig. 13a-c.
3) Upper limit on jitter: The referential delay of the fastest
cell, from Eq. 5, is:
∆tD = −
CS
I∗f
(VA − VA0)
If jitter from steady discharge is denoted by ∆tDn,1 and
from TD by ∆tDn,2, the constraint on the net jitter is that it
is to be smaller than the maximum ref. delay of the fastest
cell. For a VA0 = 0.75V ,
3
√
∆t2Dn,1 +∆t
2
Dn,2 ≤ 0.4
CS
I∗f
3
√√√√K C∗
(2−nI∗f )
2.46
+K2
(
C∗
2−nI∗f
)1.5
≤ 0.4
CS
I∗f
.
(33)
The LHS being monotonic function of C∗, Eq. 33 gives an
upper limit on C∗ for a given n. This constraint is marked by
’3’ in Fig. 13a-c.
Fig. 13 plots the constraints for a 4, 5 and 6-bit multiplier.
For 4 and 5 bits, the valid region of operations lie between
constraints 1, 2, and 3 curves, marked by double-sided arrows,
but for 6 bits, no solution exists. For the 5-bit multiplier,
C∗ = 2.2fF
and
I∗f = 1µA
emerges as the point of design, as it works for all multipliers
with less than 6 bits, minimizes the multiplication latency and
energy consumption.
B. Biasing circuit
Accurate biasing for the current sources is required to ensure
low output distortion. As discussed below, its behavior must
meet two specifications:
1) As discussed in Sec. IV-A, I∗f is achievable only for
the sub-threshold transistors with the employed VLSI
node. Hence, the biasing circuit is designed only for sub-
threshold currents and works well in this regime only
92) The current source within each cell consists of a pair of
series NFETs (M4 andM4 in Fig. 6).M5’s gate-voltage
(primary bias) is such that it sinks 2−nI∗ and its drain-
voltage is fixed close to 100mV (≈ 4VT ). The drain
voltage is maintained by M4 gated with a secondary
bias approximately 100mV above M5
The circuit (Fig. 14) has two branches: source and scaling.
Source branch (M1−6) generates biasing voltages dependent
on a programmable voltage, VREF . Scaling branch (M7−12)
first uses those biasing voltages to generate current in the
exponents of 2 (using transistor widths) and then generates the
bias for the cells’ current-source using self-biasing. Here, the
primary bias out of M12 is denoted as VB1 and the secondary
out of M10 as VB2.
In the source branch, M1 converts the reference voltage
VREF into current IBIAS = I
∗
f . M2 and M3, being self-
biased in the saturation regime, push up the gate voltages of
the tri-cascode, enough to keep mirror transistors (M10−11)
of the scaling branch saturated. M4−6 produce the multiplier-
cascode’s bias.
In the scaling branch, M7−9’s widths are down-scaled by
2−n w.r.t. the source cascode’s width, which down-scale the
current in the same proportion.M12 is self-biased to accept the
current and generates the primary bias VB1. The gate voltage
of the FET with the largest current exponent (or, the smallest
delay exponent) is:
VB1,f ≈ VREF (34)
Its drain-voltage is maintained at 100mV using a fixed biased
M11. After down-scalingM11’s size (by 2
−i, i being the expo-
nent), its source voltage is maintained at a constant value.M10,
also a down-scaled transistor, is used to generate the secondary
bias (VB2). M10’s width is adjusted using parametric analysis
to keep its self-bias above VB1 by 100mV . An additional
exponent-dependent scaling for the M10 is needed, given by:
WM10,i ≈ (1.3)
−iWM10,max, (35)
to counter the lower turn-on voltages is required for the
scaling branches. The response of the bias circuit is plotted
in Fig. 15. As VREF varies, the bias current input to the
multiplier’s cascode is plotted in Fig. 15. VB1, VB2 and
VB2 − VB1 of down-scaled multiplier branches (upto 8 bits)
is plotted in Fig. 15.
To minimize distortion, it is essential thatM7’s correspond-
ing to all exponents are applied the same drains-source voltage.
Besides using a 3-level cascoding, the length of all transistors
within the cascode is increased by 10× over the minimum to
minimize the CLM and short-channel effects that shoot-down
the rDS . The sizes of all transistors are summarized in Table
I.
C. Multiplier Simulation
Using the peripheral elements described in Sec. II, the
multiplier is simulated using transient simulators, for IBM
130nm technology.
Fig. 14. Biasing circuit. M1−6 constitute source branch and M7−12
constitute the scaling branch
0.2 0.3 0.4 0.5
VREF (V)
10 -10
10 -5
I B
IA
S 
(A
)
I*
2 -8  I*
(a) Source branch’s current
0.2 0.3 0.4 0.5
VREF (V)
0
0.2
0.4
0.6
V
B1
 
(V
) Fastest
Slowest
(b) Output VB1
0.2 0.3 0.4 0.5
VREF (V)
0
0.2
0.4
0.6
0.8
V
B2
 
(V
)
(c) Output VB2
0.2 0.3 0.4 0.5
VREF (V)
0
0.05
0.1
0.15
0.2
V
B2
-
V
B1
 
(V
)
(d) VB2 − VB1
Fig. 15. Biasing circuit outputs
1) Functionality test: A 5-bit multiplier composed of delay-
cells cascaded as described in Sec. II, was simulated. Letting
VA0 = 0.75V , Fig. 16a plots the ref. delay of the multiplier as
it varies with VA but for a given weight, W , as a parameter.
Ref. delay with W as the independent warble and VA as
parameter is plotted in Fig. 16b. Since the output ref. delay
is distorted for VA < 75mV , input range of the multiplier is
75mV to 1.2V .
2) Energy analysis and simulation results: Within a delay
cell, the key components of energy are:
1) EC∗ : Energy used up in charging C
∗ for each compu-
tation. It is given by:
EC∗ = C
∗V 2DD (36)
The actual value may vary due to parasitic capacitance
and the dependence of C∗ on V ∗
10
TABLE I
FET SIZES FOR THE BIASING CIRCUIT
FET Label Width (/160nm) Length (/120nm)
M1 1 1
M2−3 2n 1
M4−6 10 × 2n 10
M7−9 10× 2i 10
M10 2.6i 10
M11 2i 10
M12 1 1
0 0.5 1
VA (V)
-4
-2
0
2
4
6
8
t D
 
(s)
10 -9
Valid V A
(a) Iso-|W |
0 10 20 30
|W|
-4
-2
0
2
4
6
8
t D
(s)
10 -9
Valid V A
(b) Iso-VA
Fig. 16. 5-bit multiplier transfer characteristics
2) ETD: Energy stored in node corresponding to VRE (6),
once ∆V ∗ crosses the threshold. It is given by:
ETD = CREV
2
DD (37)
where, CRE is the net capacitance at the node. Part of
it comes from the thresholding-pFET M8 (ETD1) and
other comes from latching-pFET, M11 (ETD2)
3) EPU : Energy used in pull-up of the event-propagating
wires of the delay cell (OFE node in Fig. 6)
4) EINV : Energy used up in inverting the input falling-
edge, to a rising edge (IFE′, input to M6 in Fig. 6)
Next, value of these metrics is determined by simulating
the 5-bit multiplier (schematic) for one cycle of computation,
with arguments VA = 1.2V and |W | = 31. ETD , EPU and
EINV is determined during the computation-phase and EC∗
and EPU are determined during pre-charge phase. The energy
components and their simulated values are listed in Table II.
Comparing their sum with the simulated total, it is concluded
that the listed components account for almost all the expended
energy.
V. DISCUSSION AND BENCH-MARKING
The constraint on noise was chosen such that peak-jitter
(3∆tdN ) from the slowest delay-cell is less than the max-
imum referential delay of the fastest delay-cell. However,
for certain inputs, the output referential delay for a MAC
can be zero, which makes it impossible for for the noise
to ever be smaller than the output signal. Thus, the chosen
constraint is a reasonable as it grants the energy benefit of
delay-based analog computing and simultaneously prevents
TABLE II
DELAY-CELL ENERGY COMPONENTS
Component Energy/MAC (fJ) Energy/MAC/bit (fJ)
EC∗ 34.0 6.8
ETD1 5.6 1.1
ETD2 8.8 1.7
EPU 46.0 9.0
EINV 16.0 3.0
Total 110 22
Total (sim.) 116 23
1 14JM, 
1
2
3
4
5
n
Fig. 17. Number of bits vs. excess jitter margin
excessive signal corruption by the noise. Depending on the
SNR specification for an application, much tighter constraint
on noise may be placed, which, in effect, reduces the number
of bits. Fig. 17 plots the number of bits possible within the
multiplier, as a function of excess jitter margin (ǫ), where,
ǫ is the ratio of maximum ref. delay of the fastest cell and
peak-jitter. At ǫ = 1, the number of bits is the highest, and
decreases to 1 at ǫ ≈ 14.
Note that the multiplier’s EC∗ , given in Table II is computed
for the case when all weight bits are set to 1 (W=31).
Otherwise, this components of energy depends on (1) the input
weight and (2) number of computations being done by the
MAC, per second. If the multiplier is used in a sense, or,
one-time-use mode, then, the listed EC∗ is accurate, as all
the charged-up energy leaks out eventually. Any new MAC
cycle would require the same energy to charge-up C∗ from
the point of no charge. However, for acceleration mode, where
same weights are used with variable VA, C
∗ of the cells with
w = 0 never get an opportunity to discharge completely, since
all falling-edges bypass the cell. Before it fully discharges,
a new MAC cycle’s pre-charge step would charge-up C∗ to
VDD from intermediate voltage.
In Table III, two simulated performance metrics are com-
pared with the state-of-art: (1) energy consumption per
multiply-accumulate, reported above and (2) multiplication
latency (from absolute delay of the delay-cell). We also
compare whether the multiplier allows negative weights and
the maximum number bits accommodable for various mixed-
signal MACs. From the table, it is seen that:
1) The delay cell consumes 23 fJ per bit of digital argu-
ment/input, which, is more than lowest-reported state-
of-art energy consumption. Our energy metric is at
130nm, and the lowest state-of-art metric at 65nm.
11
TABLE III
COMPARISON WITH STATE-OF-ART
This work Gopal et al.
[23]
Miyashita et
al. [18]
Sayal et al.
[20]
Lee et al.
[14]
Skrzyniarz et
al. [16]
Domain Time: delay Time: delay Time:
Clocked-
delay
Time: Pulse-
width
Analog-
charge
Analog-
current
Demo. node 130nm 65nm 65nm 40nm 40nm 65nm
Input-width Analog-5b Analog-3b 1b 8b Analog-3b 2b/1b
Energy
(fJ/MAC/bit)
23 7 20 - 15 13
Latency 1b: 1.2ns and
5b: 50ns
250ps 50ps - - -
Negative
weights
Yes No Yes Yes No No
Linearity
mechanism
Discharge-
till-pinch-off
Body-gate
biasing
Binary Binary NA NA
Assuming that the energy scales by L2, the scaled energy
consumption approaches that of the state-of-art.
2) In [23], linearity of the delay-cell is based on back-
body biasing, which, is theoretically non-linear. In the
proposed cell, the output-input characteristics are linear,
due to the linear voltage-initialization step
3) Despite noise limitations, the number of bits that can be
accommodated in the mixed-signal multiplier is higher
than state-of-art. All mixed-signal MACs use an expo-
nential scaling of transistor widths, as a way to convert
digital signals to analog. In this work, we proposed a
biasing circuit that exponentially scales the currents via
gate-voltages and avoid area expensive width-scaling
VI. CONCLUSION
In this work, a linearly tunable delay-cell is proposed
that realizes the analog input-dependent delay using three
sequential sub-processes: (1) an input-dependant charge-up of
C∗ (2) its steady discharge, via current I∗ (3) thresholding
of its voltage. Each of the sub-processes are then analytically
modelled, using which, constraints on the C∗ and I∗ for lin-
earity are found. Jitter models, based on prior ones developed
for CMOS inverter ring-oscillator, were modified and validated
for the proposed cell. To form a multiplier, delay cells with
same analog input and I∗ scaled in the exponents of 2 must be
cascaded to form a multiplier. Since I∗ is scaled using gates-
source voltages, a biasing circuit that accept a ref. voltage
and generates gate biases for delay cells corresponding to all
exponents, is proposed and validated. From the constraints
on C∗ for linearity and noise, the minimum C∗ was found
to be around 2fJ and maximum bits supportable to be
five. Lastly we also identify key energy components of the
multiplier, which sum up to be 20 fJ/MAC/bit for IBM’s
130nm technology.
REFERENCES
[1] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi,
“Scaling for edge inference of deep neural networks,” Nature Electron-
ics, vol. 1, no. 4, pp. 216–222, 2018.
[2] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.
Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-Datacenter
Performance Analysis of a Tensor Processing Unit,” in Proceedings of
the 44th Annual International Symposium on Computer Architecture.
New York, NY, USA: ACM, 6 2017, pp. 1–12. [Online]. Available:
https://dl.acm.org/doi/10.1145/3079856.3080246
[3] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu,
D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil,
P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt,
A. M. Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-
Scale DNN processor for real-Time AI,” Proceedings - International
Symposium on Computer Architecture, pp. 1–14, 2018.
[4] V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer, “Efficient Processing
of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the
IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
[5] M. Horowitz, “Computing’s energy problem (and what we can do about
it),” Digest of Technical Papers - IEEE International Solid-State Circuits
Conference, vol. 57, pp. 10–14, 2014.
[6] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
A. Moshovos, “Stripes: Bit-serial deep neural network computing,”
in 2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, 10 2016, pp. 1–12. [Online].
Available: http://ieeexplore.ieee.org/document/7783722/
[7] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
Hernandez-Lobato, G. Y. Wei, and D. Brooks, “Minerva: Enabling Low-
Power, Highly-Accurate Deep Neural Network Accelerators,” Proceed-
ings - 2016 43rd International Symposium on Computer Architecture,
ISCA 2016, pp. 267–278, 2016.
[8] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
ImageNet Classification Using Binary Convolutional Neural Networks,”
pp. 1–17, 3 2016. [Online]. Available: http://arxiv.org/abs/1603.05279
[9] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos,
“Loom: Exploiting Weight and Activation Precisions to Accelerate
Convolutional Neural Networks,” 6 2017. [Online]. Available:
http://arxiv.org/abs/1706.07853
[10] M. Courbariaux and I. Hubara, “Binarized Neural Networks: Training
Neural Networks with Weights and Activations Constrained to +1 or
1,” 2014.
[11] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst,
“BinarEye : An Always-On Energy-Accuracy-Scalable Binary CNN
Processor With All Memory On Chip In 28nm CMOS,” no. Ld, pp.
2–5.
12
[12] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
O. Temam, “DianNao,” in Proceedings of the 19th international
conference on Architectural support for programming languages
and operating systems - ASPLOS ’14. New York, New York,
USA: ACM Press, 2014, pp. 269–284. [Online]. Available:
http://dl.acm.org/citation.cfm?doid=2541940.2541967
[13] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A Multi-
Functional In-Memory Inference Processor Using a Standard 6T SRAM
Array,” IEEE Journal of Solid-State Circuits, vol. 53, no. 2, pp. 642–655,
2018.
[14] E. H. Lee and S. S. Wong, “Analysis and Design of a Passive
Switched-Capacitor Matrix Multiplier for Approximate Computing,”
IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 261–271, 1
2017. [Online]. Available: http://ieeexplore.ieee.org/document/7579580/
[15] H. Li, T. F. Wu, S. Mitra, and H. S. Wong, “Resistive RAM-Centric
Computing: Design and Modeling Methodology,” IEEE Transactions on
Circuits and Systems I: Regular Papers, vol. 64, no. 9, pp. 2263–2273,
2017.
[16] S. Skrzyniarz, L. Fick, J. Shah, Y. Kim, D. Sylvester, D. Blaauw,
D. Fick, and M. B. Henry, “24.3 A 36.8 2b-TOPS/W self-
calibrating GPS accelerator implemented using analog calculation in
65nm LP CMOS,” in 2016 IEEE International Solid-State Circuits
Conference (ISSCC). IEEE, 1 2016, pp. 420–422. [Online]. Available:
http://ieeexplore.ieee.org/document/7418086/
[17] Z. Wang and N. Verma, “A Low-Energy Machine-Learning Classifier
Based on Clocked Comparators for Direct Inference on Analog Sensors,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 64,
no. 11, pp. 2954–2965, 2017.
[18] D. Miyashita, R. Yamaki, K. Hashiyoshi, H. Kobayashi, S. Kousai,
Y. Oowaki, and Y. Unekawa, “An LDPC Decoder With Time-Domain
Analog and Digital Mixed-Signal Processing,” IEEE Journal of Solid-
State Circuits, vol. 49, no. 1, pp. 73–83, 1 2014. [Online]. Available:
http://ieeexplore.ieee.org/document/6630119/
[19] G. Li, Y. M. Tousi, A. Hassibi, and E. Afshari, “Delay-Line-Based
Analog-to-Digital Converters,” IEEE Transactions on Circuits and
Systems II: Express Briefs, vol. 56, no. 6, pp. 464–468, 6 2009.
[Online]. Available: http://ieeexplore.ieee.org/document/5075832/
[20] A. Sayal, S. Fathima, S. S. Nibhanupudi, and J. P. Kulkarni, “14.4 All-
Digital Time-Domain CNN Engine Using Bidirectional Memory Delay
Lines for Energy-Efficient Edge Computing,” Digest of Technical Papers
- IEEE International Solid-State Circuits Conference, vol. 2019-Febru,
no. 4, pp. 228–230, 2019.
[21] A. Sayal, S. S. Nibhanupudi, S. Fathima, and J. P. Kulkarni, “A 12.08-
TOPS/W All-Digital Time-Domain CNN Engine Using Bi-Directional
Memory Delay Lines for Energy Efficient Edge Computing,” IEEE
Journal of Solid-State Circuits, vol. 55, no. 1, pp. 60–75, 2020.
[22] D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A Neuromorphic
Chip Optimized for Deep Learning and CMOS Technology With Time-
Domain Analog and Digital Mixed-Signal Processing,” IEEE Journal of
Solid-State Circuits, vol. 52, no. 10, pp. 2679–2689, 2017.
[23] S. Gopal, P. Agarwal, J. Baylon, L. Renaud, S. N. Ali, P. P. Pande, and
D. Heo, “A Spatial Multi-Bit Sub-1-V Time-Domain Matrix Multiplier
Interface for Approximate Computing in 65-nm CMOS,” IEEE Journal
on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 3,
pp. 506–518, 2018.
[24] A. Abidi, “Phase Noise and Jitter in CMOS Ring Oscillators,” IEEE
Journal of Solid-State Circuits, vol. 41, no. 8, pp. 1803–1816, 8 2006.
[Online]. Available: http://ieeexplore.ieee.org/document/1661757/
