Self-timed, Minimum Latency Circuits for the Internet of Things by Wheeldon A et al.
Self-timed, Minimum Latency Circuits
for the Internet of Things
Adrian Wheeldon∗, Jordan Morris, Danil Sokolov, Alex Yakovlev
µSystems Group, Newcastle University, Newcastle Upon Tyne, United Kingdom
Abstract
This work presents a design flow for asynchronous, self-timed dual-rail circuits which introduces a timing
assumption in the return-to-spacer phase. The design flow enables power proportionality and is demonstrated
through the design of a 32-bit ripple-carry adder and a 32-bit comparator for internet of things applications.
The designs are synthesized to a 65 nm cell library with state-of-the-art transistor sizing for subthreshold.
Simulation results show improved performance and energy per computation across operating conditions
compared with single-rail equivalents. The design flow allows extension of the power proportional philosophy
to a wider range of circuits.
Keywords: dual-rail, internet of things, power proportional, self-timed
1. Introduction
Nowadays with the proliferation of internet of
things (IoT) devices, we see the utilization of low
power accelerators and signal conditioning circuits
in many different usage scenarios: from hand-held
battery-powered devices; to small, wide sensor net-
work nodes powered by energy harvesters; and
mains-powered smart-home devices which are al-
ways on.
An important factor in the IoT is the reduction
of the functional supply voltage (VDD,min). For
systems powered by energy harvesters, maximum
power point tracking (MPPT) is commonly imple-
mented to extract what little power may be avail-
able [1, 2]. The nature of MPPT can lead to a very
low VDD supplied to the circuit. In many cases it
is preferable for the circuit to continue operation
under this condition, albeit with reduced perfor-
mance. If the circuit did not function at this VDD,
the energy might simply be wasted, or stored in an
inefficient battery system.
∗Corresponding Author
Email addresses: a.r.wheeldon2@ncl.ac.uk
(Adrian Wheeldon), j.morris@ncl.ac.uk (Jordan Morris),
danil.sokolov@ncl.ac.uk (Danil Sokolov),
alex.yakovlev@ncl.ac.uk (Alex Yakovlev)
Asynchronous dual-rail circuits give the possibil-
ity to detect when the circuit has completed its
computation by using a self-timed approach [3]. Us-
ing a weakly-indicating circuit (such as one based
on the NCL-X design methodology [4]) with com-
pletion detection, we can implement a design which
is early propagative. Such a design allows flexible
performance which is self-adapting to the VDD, and
therefore the power availability when coupled with
an energy harvester. We call this a power propor-
tional design.
A useful side-effect of a power proportional de-
sign comes from its ability to adapt to the VDD. In
systems where we have high power availability, but
lower energy availability (such as battery-powered
systems), we can artificially lower the VDD, simu-
lating a power-sparse condition. Operating over a
long period of time in this condition leads to low
energy usage overall. In such battery-powered ap-
plications, we are interested in operating at the sys-
tem’s minimum energy point (MEP)—the point at
which the system consumes the minimum amount
of energy per computation. Operating at the MEP
will lead to maximized battery life. The amount
of energy per computation at the MEP (Emin)
occurs close to the point where the leakage en-
ergy of the circuit is equal to its dynamic energy
(Eleak ≈ Edyn). Emin can be characterized as a
Preprint submitted to Elsevier January 9, 2019
function of supply voltage. As a result, we define
VDD,MEP as the supply voltage for which Emin is
achieved. We are interested specifically in opera-
tion where VDD < VTH, since this is where the MEP
tends to lie for complex CMOS designs [5, 6].
In the contrary situation where we have high
power and high energy availability (such as in a
mains-powered device), we can supply a higher
VDD. The power proportional properties of the
system will allow it to operate with higher perfor-
mance, albeit at the expense of greater energy per
operation.
In IoT edge nodes, operands can appear spuri-
ously and far apart in time. This leads to long
periods of inactivity in the circuit where it is idle
and does no work. In a dual-rail design, we nor-
mally must acknowledge both codeword and spacer
to avoid hazards. However, if we make an assump-
tion that the idle time is sufficiently long to allow
the circuit to reset to spacer, then we no longer need
to acknowledge the spacer. This is especially true
in single-cycle accelerators and signal conditioning
circuits where subsystems are duplicated for perfor-
mance; in favor of resource sharing. This timing as-
sumption can lead to reduced completion detection
circuitry, and therefore reduced energy consump-
tion, while retaining the benefits of early propaga-
tion and power proportionality.
Here we have discussed how a power proportional
design can operate under three different usage sce-
narios with no extra design effort: energy harvest-
ing, battery-powered and mains-powered. It is for
this reason that we present 32-bit adder and com-
parator designs based on the power proportionality
principle.
The main contributions of this work are as fol-
lows:
• a novel sizing strategy is presented for
CMOS cells which reduces the effects of pro-
cess variation on propagation delay in the sub-
threshold region;
• two types of circuit are identified as can-
didates for early-propagative designs—
namely adders and comparators;
• a design flow is presented for dual-rail IoT
circuits where acknowledgment of the spacer
can be replaced with a timing assumption.
Section 2 introduces the cell library upon which
this work is based. Section 3 demonstrates the dual-
rail design flow. Section 4 uses the power propor-
tional adder design as a case study. Section 5 ap-
plies the same methodology to a comparator de-
sign. Section 6 shows the results of the designs
when benchmarked in a simulation environment.
Section 7 summarizes the work and results.
Throughout the paper, dual-rail refers to the
asynchronous design style which utilizes two wires
per data bit [7]. Single-rail refers to a conventional
circuit utilizing one wire per data bit.
2. Full Diffusion Cell Library
The cell library in [8] is based on a commercial
65 nm low-power process. The library uses a novel,
full diffusion sizing strategy for subthreshold with
100 nm transistor lengths. It features improved de-
lay variability and increased performance in sub-
threshold at the expense of greater leakage current
when compared to libraries sized for superthresh-
old.
During ion implantation in today’s deep sub-
micron technologies, dopants may penetrate more
or less deeply into the channel. Known as ran-
dom dopant fluctuation (RDF), this causes up to
70 % variation in propagation delays during sub-
threshold operation [9]. RDF is proportional to
the inverse square root of the device’s diffusion
area, therefore increasing gate area leads to lower
RDF. The full diffusion sizing strategy maximizes
the gate area therefore reducing delay variability
due to RDF, whilst maintaining parallel transistors
(contrast Figures 1 and 2). This novel sizing strat-
egy shows a delay variation of up to 30 % compared
to 200 % for minimum and regular diffusion sizing
strategies [8]. These attributes are summarized for
the standard-drive inverter cells in Table 1. The full
diffusion cells with 1, 2, 3 and 4 fingers are equiv-
alent to 210 nm, 420 nm, 630 nm and 840 nm regu-
lar cells respectively. Furthermore, performance in-
creases due to fingering, and at a rate proportional
with leakage power. This leads to higher perfor-
mance at a similar energy cost.
Transistor stacks degrade the Ion/Ioff ratio of
cells, reducing performance whilst increasing leak-
age power. In superthreshold the effect is small
enough that several transistors can be stacked with-
out a problem. In subthreshold however, the effect
is greater to the point where only transistor stacks
of two are used. Connecting cells to create more
complex logic is favored over more complex cells.
The transistor stack limitation leads to a small
2
number of cells in the library, namely: NAND2,
NOR2, INV, AOI22, OAI22.
Each full diffusion cell exists in a one, two, three
and four finger variant. Exceptions are the AOI22
and OAI22 cells which do not exist in the four fin-
ger variant due to cell dimension limitations. An
increased number of fingers results in increased per-
formance and leakage power. These variants give
rise to a performance and leakage range for the syn-
thesis tool to work with.
In superthreshold, carrier mobility decreases with
increasing temperature and is the main cause of
temperature-related variation. However in sub-
threshold, variation in VTH takes over due to the
exponential dependence on the transistor’s current
characteristic. VTH has a negative correlation with
temperature in subthreshold [10] and therefore cells
perform best at high temperatures and fail at low
temperatures—opposite to what is observed in su-
perthreshold.
Figure 1: Full diffusion inverter cells.
Figure 2: Regular diffusion inverter cells.
Table 1: Attributes of X1-drive inverter cells. Reproduced
from [8].
Cell Propagation Delay (ns) Leakage
Mean Std. Dev.
Regular Diffusion
210 nm 140 35.8 —
420 nm 155 30.9 —
630 nm 159 26.3 —
840 nm 164 23.7 —
Full Diffusion
1F 164 23.7 —
2F 128 19.0 1.19x
3F 89.9 12.8 1.24x
4F 68.7 8.50 2.0x
3. Design Flow
A dual-rail NCL-X design style is chosen for
its robustness and early-propagative nature as dis-
cussed in Section 1. These attributes allow bet-
ter average-case performance in exchange for silicon
area when compared to a single-rail design. This
design style is not as robust as dual-rail NCL-D
from a timing perspective, however it benefits from
a more tolerable area overhead and better perfor-
mance [4]. A single-rail bundled-data circuit is self-
timed and comes at a much lower area overhead—
however it does not offer the same robustness nor
can it be early-propagative. From hereon in, all ref-
erences to dual-rail will refer to the dual-rail NCL-X
design style.
In dual-rail encoding, two wires are used to rep-
resent a codeword. For a single bit x, the dual-rail
encoding consists of the positive and negative rails
{x1, x0}. x = 0 is encoded as {0, 1}, and x = 1 is
encoded as {1, 0}. One of the remaining states—
{0, 0} or {1, 1}—is chosen to represent the empty
state, referred to as a spacer, which separates code-
words temporally so they can be distinguished from
each other. Care must be taken to correctly handle
spacers in the design, otherwise data hazards could
occur where one codeword overtakes another [11].
The remaining state is forbidden and must not be
used.
Completion detection circuitry can be added to
a dual-rail circuit to make it self-timed. A comple-
tion detector generates a true value when the state
of a bit is either {0, 1} or {1, 0}, indicating a valid
codeword. The output of these completion detec-
3
tors can be combined to create a done signal for
the entire computation. Figure 3 shows how this
can be done for a circuit using an all-zeroes spacer
({0, 0}). The AND gate denoted with a ‘C’ repre-
sents a Muller C-element [12] which acknowledges
all-zeroes at its inputs. This is in addition to ac-
knowledging all-ones like a regular AND gate. The
C-element ensures that all output signals return to
spacer before the done signal is deasserted. The
OR gates can be replaced with NAND gates if the
circuit uses an all-ones spacer. Although strictly an
XOR gate should be used, we design our circuits to
guarantee the invalid state will not be entered so
these simpler gates can be used. The done signal
facilitates inter-block handshaking to self-time the
design.
Figure 3: An example completion detection tree for a dual-
rail circuit with two outputs.
Completion detection allows the circuit to be
early-propagative, gives rise to power proportion-
ality, and minimizes latency. In this work we
have substituted the C-element for an AND gate.
This implementation only acknowledges the all-
ones state, and not the all-zeroes state—giving the
side-effect that race hazards can occur in the return-
to-spacer phase. It therefore makes the assumption
that sufficient time is given for all logic to reset to
spacer before a new operand is applied to the pri-
mary inputs. This assumption is reasonable in IoT
applications where activity factor is low and there
may be extensive idle-time between operands. Fig-
input spacer codeword spacer codeword
w1
w0
y1
y0
done
Hazard
Figure 4: A spacer hazard in a dual-rail circuit using simpli-
fied completion detection.
ure 4 illustrates how a spacer hazard may occur in
this simplified completion detection. Signals w1, w0,
y1 and y0 are outputs of a dual-rail circuit. input
shows the state of the circuit’s inputs (codeword or
spacer). done is generated from simplified comple-
tion detection. Signal y1 did not return to spacer
(logic 0) before the second codeword was applied.
As a result, done is asserted prematurely and the
outputs are invalid. Our assumption is that this
spacer time will always be sufficient for all signals
to return to spacer, and therefore the hazard will
not occur.
Figure 5 gives an overview of the design flow
used in this work. The first step to obtaining a
dual-rail circuit is the same as a single-rail circuit—
we must derive a boolean equation for the output
from the specification. After this, we replace any
XOR/XNOR gates with alternatives since they are
not present in our cell library. At this stage, we
can choose to synthesize the equations to obtain
a single-rail netlist. To obtain a dual-rail netlist,
we instead apply dual-rail expansion as shown in
Figure 6. The single-rail inputs and outputs are
replaced by a positive-rail notation (eg. a becomes
a1) in the process of positive-rail insertion. Next,
negative-rail optimization is used to replace in-
verted literals by their negative-rail equivalent (eg.
a1 is replaced by a0). Now we have obtained the
final equation for the positive rail, we clone it, ex-
changing all gates in the equation for their duals
(eg. AND changes to OR, OR changes to AND,
etc.). We call this gate dualing. This gives the
equation for the negative rail. Together these two
equations form the dual-rail boolean equations for
the original specification.
Once we have the complete set of equations for
our dual-rail implementation, we can apply spacer-
aware negative-gate optimization. During this opti-
mization, care is taken to ensure that correct spacer
behavior is retained, since introduction of negative
gates can cause mismatched spacer polarity at the
outputs. Completion detection insertion is then
used to make the design self-timed. Finally, we syn-
thesize the equations using a standard commercial
tool as with any single-rail circuit.
In this work, all circuits were synthesized using
Synopsys Design Compiler with a combination
of fingered variants from the regular-VTH full dif-
fusion cell library. The tool chose faster cells with
more fingers in order to optimize the critical path
delay. Fewer-fingered variants were chosen else-
where in order to minimize power dissipation.
4
Boolean Equation
from Specification
XOR Removal
Synthesis
Synthesis
Dual-rail Expansion
Spacer-aware Negative
Gate Optimization
Cell Library
Completion Detection
Insertion
Single-rail
Boolean Equation
Dual-rail
Boolean Equations
Single-rail Netlist
Optimized Dual-rail
Boolean Equations
Self-timed
Dual-rail Netlist
Cell 
Descriptions
Optimized, Self-timed
Dual-rail Boolean Equations
Figure 5: Comparison of the design flow for a single-rail
and a self-timed dual-rail circuit, starting with a traditional
boolean equation.
4. 32-bit Ripple-Carry Adder
We illustrate the design flow in Section 3 using
a 32-bit adder as it is an essential block in any
IoT system. Since leakage power is increased in
the target library compared to libraries with stan-
dard transistor sizing, it is important to use circuit
design techniques for the design to have a compet-
itive Emin. Furthermore, in silicon processes with
reduced feature sizes such as the one used in this
work, Eleak is inherently greater, requiring more ef-
fort in circuit design to keep the energy per compu-
tation low. For these reasons, a ripple-carry archi-
tecture is chosen for its small logic footprint, since
a smaller logic footprint leads to a lower Eleak from
a circuits perspective.
For the 32-bit ripple-carry adder, in the worst
case, the carry signal must travel through all 32 full
adder blocks before the output is valid. However,
due to the carry-propagate and -generate character-
Single-rail
Boolean Equation
Positive-rail
Insertion
Negative-rail
Optimization
Gate Dualing
Boolean Equation
Positive-rail
Boolean Equation
Dual-rail
Boolean Equations
Negative-rail
Boolean Equation
Figure 6: Dual-rail expansion of a single-rail equation.
istics, it is possible for the output to be valid much
sooner than the worst case—after a delay of logN
carry stages on average [13]. We take advantage
of this by using an early propagative design based
on the dual-rail NCL-X design style discussed in
Section 3.
In this section, we present a single-rail full adder
design from which a dual-rail full adder is derived.
These designs are chained to form 32-bit ripple-
carry adders.
4.1. Single-rail
The single-rail design is derived from the well-
known equations (1a) and (2a). We remove XORs
in order to obtain (1b) and (2b) in terms of simple
gates.
sum = a⊕ b⊕ c (1a)
=
(
ab + ab
)
c +
(
ab + ab
)
c (1b)
cout = ab + (a⊕ b)c (2a)
= ab +
(
ab + ab
)
c (2b)
These forms allow us to extract the common term(
ab + ab
)
. We apply DeMorgan’s theorem to this
term to obtain it in terms of negative gates. We
write (3) using z as an intermediate term for sim-
5
plicity.
z = (a + b) · (a + b) (3)
Applying DeMorgan’s theorem to (1b) and (2b) we
can now use z to give (4) and (5) which lead to the
circuit implementation in Figure 7.
sum = (z + c) · (z + c) (4)
cout = (a + b) · (z + c) (5)
Figure 7: The circuit for the single-rail full adder.
4.2. Dual-rail
Equations for the dual-rail design are derived
from those of the single-rail design as described in
Section 3. We start from (1b) and (2b) since these
are in terms of positive gates. This will ensure cor-
rect spacer behavior from the outset (since nega-
tive gates invert the spacer polarity). All of the
inputs and outputs are substituted with their pos-
itive dual-rail counterparts and all inverted inputs
are replaced by the corresponding negative-rail in-
put. After noting that ab + ab = ab+ab, the result
is (6) and (7). Furthermore, the negative rails (8)
and (9) are derived.
sum1 = (a1b0 + a0b1)c0 + (a1b1 + a0b0)c1 (6)
cout1 = a1b1 + (a1 + b1)c1 (7)
sum0 = (a0b1 + a1b0)c1 + (a0b0 + a1b1)c0 (8)
cout0 = a0b0 + (a0 + b0)c0 (9)
Figure 8 shows the initial implementation of the
dual-rail full adder. The circuit is made up of pos-
itive gates as shown by the groupings. These must
be implemented as negative gates with a following
inverter due to the cells available in the library.
If we classify layers of logic separated by positive
gates, when logic zero is applied to all primary in-
puts, all layers will produce a zero since there are
no inversions. In this case, we say that the circuit
Layer 0
All-0 Spacer
Layer 1
All-0 Spacer
Layer 2
All-0 Spacer
Figure 8: The initial dual-rail full adder design using positive
gates.
uses all-zeroes spacers—the spacer is {0, 0} and the
{1, 1} state is not allowed.
To improve the circuit, we can transform the AO
gates into OAI gates with inverted inputs like. The
result is shown in Figure 9 where we have removed
some of the double inversions and consequently re-
duced the number of inverters from eight to six. In
the new circuit, for an all-zeroes spacer at layer 0,
layer 1 produces an all-ones spacer due to the in-
verting logic. We now have spacer inversion in layer
1 and we use the {1, 1} state as a spacer and forbid
the {0, 0} state. Following on, layer 2 has another
spacer inversion and therefore the primary outputs
produce all-zeroes spacers. We term this alterna-
tion of spacers between logic layers an alternating
spacer protocol.
Figure 10 shows a further optimized design.
Here, sum1 and sum0 use all-zeroes spacers, whereas
cout1 and cout0 use all-ones spacers. To achieve
this, the gates directly preceding the cout1 and
cout0 primary outputs have been moved into a new
layer of logic by the introduction of inverters at
their inputs. Consequently, c1 and c0 are moved
into layer 1 so that their spacers match cout1 and
cout0 from the previous block when the full adders
are chained. This optimization removes a further
two inverters from the design, reducing the amount
of logic and therefore lessening the effect of leakage
power on the MEP.
6
Layer 0
All-0 Spacer
Layer 1
All-1 Spacer
Layer 2
All-0 Spacer
Figure 9: Logic for the dual-rail full adder using an alternat-
ing spacer.
Layer 0 Layer 1
All-0 Spacer
Layer 2
All-0 SpacerAll-1 Spacer
Layer 3
All-1 Spacer
Figure 10: The final optimized logic for the dual-rail full
adder.
5. 32-bit Comparator
A comparator is another widely utilized circuit
forming part of an ALU or a discrete component in
IoT applications. A full comparator takes two n-
bit operands, a and b, and asserts a > b (greater-
than), a < b (less-than), or a = b (equal).
Figure 11 shows how multiple 1-bit comparators
are connected to form an n-bit comparator. The gt
and lt outputs are combined through OR gates to
form the n-bit outputs, whereas the equal output
is derived from the final comparator in the chain.
The eval input reads the eq output of the previous
stage. If eval is low, the result of the previous com-
parison was not equal and comparison in the cur-
rent stage need not take place—the outputs should
be zero. If eval is high, the result of all previous
comparison was equal and the current bits must be
compared. If the final eq output is high, all bits of
the operands are equal. This architecture can ben-
efit greatly from early propagation as the result is
known immediately after one of the 1-bit compara-
tors asserts gt or lt. The longest propagation path
is equal which must ripple through all comparator
stages.
a
b
lt
gt
eq
eval
a
b
lt
gt
eq
eval
a
b
lt
gt
eq
eval
a
b
lt
gt
eq
eval
greater
less
equal
a[3]
b[3]
a[2]
b[2]
a[1]
b[1]
a[0]
b[0]
'1'
Figure 11: A 4-bit comparator using a cascade of 1-bit com-
parators.
In this section we will present a single-rail 32-bit
comparator design. From this, we will derive a self-
timed variant which benefits from early propagation
using the methodology from Section 3.
5.1. Single-rail
Table 2 shows the truth table required to imple-
ment a 1-bit comparator. Note that eval disables
the outputs, since if the previous (more significant)
bits were not equal, a decision has already been
made in a previous stage. After applying DeMor-
gan’s theorem, this set of constraints gives rise to
(10) to (12) which are in terms of negative gates and
can be implemented directly in the target library.
gt = eval +
(
a · b
)
(10)
lt = eval +
(
a · b) (11)
eq = eval +
(
a · b + a · b
)
(12)
The inverted literals from these equations can be
shared giving the final single-rail implementation
in Figure 12.
5.2. Dual-rail
Table 2 shows that the outputs of the single-rail
design have a one-hot encoding. One-hot encod-
ing, like dual-rail encoding, is a subset of 1-of-n
codes [14]. Provided a spacer separates codewords,
7
Table 2: Truth table for the single-rail 1-bit comparator.
eval a b gt eq lt
0 — — 0 0 0
1 0 0 0 1 0
1 0 1 0 0 1
1 1 0 1 0 0
1 1 1 0 1 0
Figure 12: The single-rail 1-bit comparator implemented in
the target library.
the switching of 1-of-n codes is monotonic which
leads to delay insensitivity. We take advantage and
retain the one-hot encoding at the output, devi-
ating from the design methodology, transforming
only the inputs to dual-rail encoding. The one-hot-
encoded outputs can be completion detected in a
similar way to dual-rail encoding—giving rise to a
self-timed design.
Starting from Table 2, applying negative-rail op-
timization, generating redundant eval terms for
each input and applying DeMorgan’s theorem; we
arrive at (14) given (13). (15) and (16) follow.
xn′ = eval · xn (13)
gt = a1′+ b0′ (14)
lt = a0′+ b1′ (15)
eq = (a1′+ b1′) · (a0′+ b0′) (16)
This implementation gates the dual-rail inputs to
the circuit, having the advantage that dynamic
power dissipation is saved in stages where eval is
low. We rename the eval signal to req since it
forms half a handshake between two self-timed cir-
cuit. The final circuit implementation is shown in
Figure 13.
The 1-bit self-timed comparator is cascaded in
the same way shown in Figure 11 to construct a 32-
bit comparator, the only difference being operands
a and b are dual-rail encoded and eval is renamed
req. Due to the one-hot encoding, the done signal
for this circuit is derived by ORing the greater,
equal and less outputs.
a1'
a0'
b1'
b0'
a1'
b0'
b1'
a0'
a1'
b1'
a0'
b0'
gt
eq
lt
Layer 0
All-0 Spacer
Layer 1
All-1 Spacer
Layer 2
All-0 Spacer
Figure 13: The self-timed 1-bit comparator implemented in
the target library.
6. Results
The designs were synthesized using Synopsys
Design Compiler for VDD = 0.25 V. Both single-
rail and dual-rail designs use the full diffusion cell
library. 100 000 randomized 32-bit integer operands
were used as input in a digital simulation envi-
ronment on the post-synthesis netlist—no layout
was performed. The computation times for single
operands were measured from the dual-rail designs
using the done signal and sorted into bins giving
the normalized probability density functions.
To obtain a range of computation time and en-
ergy results, the designs were simulated in an ana-
logue environment from VDD = 600 mV down to
VDD = VDD,min (the circuit’s minimum operating
voltage). The voltage range was chosen to show
both sub- and super-threshold behaviors of the cir-
cuits. The temperature range was chosen as an
extended commercial range of −10 ◦C to +85 ◦C
which covers the harsh operating conditions ex-
pected of IoT nodes. The lower limit was enforced
since lower temperatures severely degraded logic
levels causing the circuits to fail (refer to the dis-
cussion in Section 2). Operating corners available
for analogue simulation were TT, FF and SS. The
operating conditions were identified as follows:
• Typical: TT, 27 ◦C.
• Best: FF, 85 ◦C.
• Worst: SS, −10 ◦C.
The mean operands were chosen from the binning
process and used as input to the analogue simula-
tions. For the single-rail designs, the time taken
for the circuit to compute at VDD,min for the worst
conditions was used as the clock period throughout
testing. A 10 % margin was applied to account for
on-chip variation which is not accounted for by the
process corners.
8
Figure 14 shows the average time for a single
computation as a function of VDD for the 32-bit
adder. The 32-bit comparator follows the same pat-
tern. The return-to-spacer time is not included in
this figure, since for our applications we assume idle
time is used for this, and we are more interested
in the computation latency. The single-rail design
is shown as if implemented with a three-step dy-
namic voltage and frequency scaling (DVFS) strat-
egy. The hatched areas illustrate time saved by the
self-timed implementation.
200 300 400 500 600
10−2
10−1
100
101
Time Saved
VDD [mV]
A
ve
ra
ge
C
o
m
p
u
ta
ti
o
n
T
im
e
[µ
s]
DR Adder
SR Adder
Figure 14: 32-bit adder computation time for a single set of
operands under typical operating conditions.
The remainder of the section discusses the delay
distributions and energy results in detail.
6.1. 32-bit Adder Results
From Figure 15 we see the adder exhibits a log-
normal distribution, since the average number of
full adders the carry must ripple through is log n
for n-bit operands [13]. A higher concentration of
operands fall towards a faster computation time,
illustrating the advantages of early propagation in
the adder circuit.
Figures 16 and 17 show the energy per computa-
tion as a function of VDD for the single-rail (SR) and
dual-rail (DR) implementations respectively. These
figures include the return-to-spacer energy for the
self-timed design. For the dual-rail design under
typical and best conditions, the MEP lies in the
range 150 mV to 200 mV. The energy in the single-
rail design is much higher due to leakage, since the
clock period is much longer than required. Un-
der worst conditions, the circuit does not function
below 250 mV. These are the only conditions for
which the single-rail design overtakes the dual-rail
in terms of energy. Figure 18 shows this is around
270 mV. Since the single-rail period is optimized
for these conditions and has a lesser area, a lesser
amount of energy is consumed per operation.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
Mean
Normalized Time per Computation
P
ro
b
a
b
il
it
y
Figure 15: Probability distribution for the self-timed adder
design.
200 300 400 500 600
100
200
300
400
VDD [mV]
A
ve
ra
ge
C
o
m
p
u
ta
ti
on
E
n
er
gy
[f
J
]
DR Adder Typical
DR Adder Best
DR Adder Worst
Figure 16: Energy per computation of the dual-rail (DR)
adder.
6.2. 32-bit Comparator Results
The fastest propagation path occurs when the
MSBs differ. In this case, a decision is made after a
single comparator delay. A decision can be made on
the 31st bits when they are either {0, 1} or {1, 0}.
The remaining bits account for 22(32−1) = 262 com-
binations. Since there are two MSB combinations,
this leads to 263 input combinations which are de-
cided on the MSBs—half of the total input operand
9
200 300 400 500 600
102
103
104
105
VDD [mV]
A
ve
ra
ge
C
o
m
p
u
ta
ti
on
E
n
er
g
y
[f
J
]
SR Adder Typical
SR Adder Best
SR Adder Worst
Figure 17: Energy per computation of the single-rail (SR)
adder.
300 400 500 600
100
200
300
400
VDD [mV]
A
ve
ra
ge
C
om
p
u
ta
ti
on
E
n
er
gy
[f
J
]
DR Adder Worst
SR Adder Worst
Figure 18: Energy per computation of the adder designs
under worst conditions.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
Mean
Normalized Time per Computation
P
ro
b
ab
il
it
y
Figure 19: Probability distribution for the self-timed com-
parator design.
space which is 264. Following on, a quarter of the
input operand space can be decided on the 30th
bits. This decision carries a delay of two compara-
tors. This reasoning leads to the negative exponen-
tial probability distribution seen in Figure 19.
Figures 20 and 21 show the energy per operation
for the 32-bit comparator designs. In the dual-rail
(DR) design, the circuit operating under worst con-
ditions outperforms the typical conditions at the
upper part of the voltage range. This can be ex-
plained by the reduced leakage power due to the
low temperature (−10 ◦C). As the worst-case com-
putation time becomes exponentially slower how-
ever, the energy grows greater. In contrast to the
dual-rail design, the energy per computation of the
single-rail (SR) design varies over several orders of
magnitude throughout the operating conditions.
The dynamic energy saved by gating the inputs
to the self-timed design leads to improved energy
per operation across all operating conditions com-
pared to the single-rail design, including the worst
conditions shown in Figure 22.
100 200 300 400 500 600
20
40
60
80
VDD [mV]
A
ve
ra
g
e
C
om
p
u
ta
ti
o
n
E
n
er
gy
[f
J
]
DR Comparator Typical
DR Comparator Best
DR Comparator Worst
Figure 20: Energy per computation of the dual-rail (DR)
comparator.
7. Conclusions
In this work we have described a design flow
for datapath circuits in applications where average-
case performance and energy per computation are
the most important factors. We have designed a 32-
bit ripple-carry adder and a 32-bit comparator us-
ing this flow and shown them to outperform single-
rail counterparts in these metrics.
10
100 200 300 400 500 600
101
102
103
104
105
VDD [mV]
A
ve
ra
ge
C
o
m
p
u
ta
ti
on
E
n
er
g
y
[f
J
]
SR Comparator Typical
SR Comparator Best
SR Comparator Worst
Figure 21: Energy per computation of the single-rail (SR)
comparator.
200 300 400 500 600
0
200
400
600
VDD [mV]
A
ve
ra
ge
C
o
m
p
u
ta
ti
on
E
n
er
gy
[f
J
]
DR Comparator Worst
SR Comparator Worst
Figure 22: Energy per computation of the comparator de-
signs under worst conditions.
Table 3 summarizes the designs under typical
conditions. Both self-timed dual-rail adder and
comparator designs save 80 % to 95 % energy per
computation and computation time. The area
penalty for the dual-rail adder is double that of
the single-rail design due to the duplicated logic
and completion detection. In contrast, the dual-
rail comparator uses less area than the single-rail
design. This is achieved by retaining the one-hot
encoding at the output. Additionally, the dual-rail
inputs allow some inverters to be saved. The leak-
age power of the designs scales with the area as
could be expected. In order to assess the full energy
advantage of the dual-rail over single-rail designs,
we should take into account the energy leaked dur-
ing idle periods. For the dual-rail designs, there is
some maximum time between new input operands
which allows the design to use less energy (includ-
ing idle periods) than the single-rail equivalent. We
find this by dividing the difference in computation
energy by the difference in leakage power. For the
adder designs, this is 855 µs—equivalent to an input
frequency of 1.2 kHz. In the case of the compara-
tor, the total energy consumed will always be less
than the single-rail design since both computation
energy and leakage power are lowered.
Table 3: Comparison of single-rail (SR) and self-timed dual-
rail (DR) designs under typical conditions with VDD =
0.25 V.
Design Average per Computation Area Leakage
Energy (fJ) Period (µs) Power (pW)
32-bit SR 436 15.8 592 317
Adder DR 61.0 3.01 1350 756
(−86%) (−81%) (+128%) (+138%)
32-bit SR 1470 14.4 799 323
Comparator DR 88.4 3.00 758 300
(−94%) (−79%) (−5%) (−7%)
Acknowledgment
This work was supported by EPSRC and ARM.
References
[1] X.-D. Do, S.-K. Han, S.-G. Lee, Optimization
of piezoelectric energy harvesting systems by us-
ing a MPPT method, in: 2014 IEEE 5th Int.
Conf. Commun. Electron., 2014, pp. 309–312.
doi:10.1109/CCE.2014.6916720.
[2] A. Montecucco, A. R. Knox, Maximum Power Point
Tracking Converter Based on the Open-Circuit Volt-
age Method for Thermoelectric Generators, IEEE
Trans. Power Electron. 30 (2) (2015) 828–839.
doi:10.1109/TPEL.2014.2313294.
[3] J. Sparsø, S. Furber, Principles of Asynchronous De-
sign: A Systems Perspective, Kluwer Academic Pub-
lishers, 2001.
[4] A. Kondratyev, K. Lwin, Design of asynchronous
circuits using synchronous CAD tools, IEEE
Des. Test Comput. 19 (4) (2002) 107–117.
doi:10.1109/MDT.2002.1018139.
[5] A. Wang, A. Chandrakasan, A 180-mV subthreshold
FFT processor using a minimum energy design method-
ology, IEEE J. Solid-State Circuits 40 (1) (2005) 310–
319. doi:10.1109/JSSC.2004.837945.
[6] J. Myers, A. Savanth, R. Gaddh, D. Howard, P. Prab-
hat, D. Flynn, A subthreshold ARM cortex-M0+ sub-
system in 65 nm CMOS for WSN applications with 14
Power Domains, 10T SRAM, and integrated voltage
regulator, IEEE J. Solid-State Circuits 51 (1) (2016)
31–44. doi:10.1109/JSSC.2015.2477046.
[7] A. J. Martin, Asynchronous datapaths and the design
of an asynchronous adder, Form. Methods Syst. Des.
1 (1) (1992) 117–137. doi:10.1007/BF00464358.
11
[8] J. Morris, P. Prabhat, J. Myers, A. Yakovlev, Uncon-
ventional Layout Techniques for a High Performance,
Low Variability Subthreshold Standard Cell Library, in:
2017 IEEE Comput. Soc. Annu. Symp. VLSI, IEEE,
2017, pp. 19–24. doi:10.1109/ISVLSI.2017.14.
[9] M. Alioto, Ultra-Low Power VLSI Circuit Design De-
mystified and Explained: A Tutorial, IEEE Trans.
Circuits Syst. I Regul. Pap. 59 (1) (2012) 3–29.
doi:10.1109/TCSI.2011.2177004.
[10] R. Singh, P. Bhatnagar, D. Sahu, N. Kr Shukla,
A. Goel, Analysis of the Effect of Temperature Vari-
ations on Sub-threshold Leakage Current in P3 and P4
SRAM Cells at Deep Sub-micron CMOS Technology,
Int. J. Comput. Appl. 35 (5) (2011) 975–8887.
[11] D. Sokolov, Automated synthesis of asynchronous cir-
cuits using direct mapping for control and data paths,
Ph.D. thesis, Newcastle University (2006).
[12] D. Muller, Theory of Asynchronous Circuits, Tech. rep.,
University of Illinois (1955).
[13] B. Gilchrist, J. H. Pomerene, S. Y. Wong, Fast
Carry Logic for Digital Computers, IEEE Trans.
Electron. Comput. EC-4 (4) (1955) 133–136.
doi:10.1109/TEC.1955.5219482.
[14] W. Bainbridge, W. Toms, D. Edwards, S. Furber,
Delay-insensitive, point-to-point interconnect using m-
of-n codes, in: Ninth Int. Symp. Asynchronous Circuits
Syst. 2003. Proceedings., IEEE Comput. Soc, 2003, pp.
132–140. doi:10.1109/ASYNC.2003.1199173.
12
