Logical effort based design exploration of 64-bit adders using a mixed dynamic-CMOS/threshold-logic approach by Celenski, P. et al.
Logical Effort Based Design Exploration of 64-bit Adders Using a Mixed
Dynamic-CMOS/Threshold-Logic Approach
Peter Celinski, Said Al-Sarawi, Derek Abbott
Centre for High Performance Integrated
Technologies & Systems (CHiPTec)
The Department of Electrical and Electronic
Engineering, The University of Adelaide
SA 5005, Australia.
celinski@eleceng.adelaide.edu.au
Sorin Cotofana, Stamatis Vassiliadis
Computer Engineering Group
Electrical Engineering Department
Delft University of Technology
Mekelweg 4, 2628 CD Delft
The Netherlands.
Abstract
This paper presents the design exploration of CMOS
64-bit adders designed using threshold logic gates based
on systematic transistor level delay estimation using Log-
ical Effort (LE). The adders are hybrid designs consist-
ing of domino and the recently proposed Charge Recycling
Threshold Logic (CRTL). The delay evaluation is based LE
modeling of the delay of the domino and CRTL gates. From
the initial estimations, we select the 8-bit sparse carry look-
ahead/carry-select scheme. Simulations indicate a delay of
less than 5 FO4, which is 1.1 FO4 or 17% faster than the
nearest domino design.
1 Introduction
Addition is one of the most critical operations performed
by VLSI processors. Adders are used in ﬂoating-point
arithmetic units, ALUs, memory addressing and program
counter-update. The main requirements are speed, power
dissipation and low area.
Threshold logic (TL) was introduced over four decades
ago, and over the years has promised much in terms of re-
duced logic depth and gate count compared to conventional
logic-gate based design. Lack of efﬁcient CMOS realiza-
tions has meant that TL has thus far had little impact on
VLSI. Efﬁcient TL gate realizations have recently become
available, and a small number of applications based on TL
gates [2, 1, 9, 10] have demonstrated its ability to achieve
high operating speed and signiﬁcantly reduced area.
To date no large scale arithmetic building blocks for pro-
cessors have been designed using TL. We address this issue
by proposing a high speed 64-bit TL based adder.
The delay analysis method used in in this work enables
the comparison of various adder topologies based on logical
effort. Another motivator for this approach is the desire to
avoid the common and largely unsatisfactory presentation
of circuit performance results commonly found in the liter-
ature in the form of delay numbers with insufﬁcient infor-
mation to allow comparison across different process tech-
nologies and loading conditions.
This paper presents the design of and critical path delay
evaluation of a high speed hybrid CRTL-domino adder. The
aim is to use a relatively quick method to determine fast 64-
bit adder topologies without venturing into CAD tool com-
plexity, taking into account the advantages and limitations
of CRTL. Simulations are used to verify the accuracy of the
delay model estimate. The proposed 64-bit adder has a sim-
ulated critical path delay of 5.3 FO4 in a 0.35 µm process.
We begin in Section 2 by giving a brief overview of
threshold logic. This is followed by a description of CRTL
in Section 3. Section 4 brieﬂy reviews Logical Effort and
the delay model for CRTL gates is developed in Section 5.
The circuit design examples are presented and evaluated in
Section 6. Finally a conclusion and suggestions for future
work are given in Section 7.
2 Overview of Threshold Logic
Threshold logic emerged in the early 1960’s as a gener-
alized theory of switching logic and includes conventional
Boolean logic as its subset. A threshold logic gate is func-
tionally similar to a hard limiting neuron without learning
capability. The gate takes n binary inputs x1,x2,. . . ,xn and
produces a single binary output Y , as shown in Fig. 1.
The Boolean function computed by such a gate is called
a threshold function and it is speciﬁed by the gate threshold
T and the weights w1,w2,. . . ,wn, where wi is the weight
associated with the ith input variable xi. The output y is
1
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI’04) 











Figure 1. Threshold Gate Model





i=1 wixi ≥ T
0, otherwise. (1)
A TL gate can be programmed to realize many distinct
Boolean functions by adjusting the threshold T and/or the
weights wi. For example, an n-input TL gate with T = n
will realize an n-input AND gate and by setting T = n/2,
the gate computes a majority function. This versatility
means that TL offers a signiﬁcantly increased computa-
tional capability over conventional AND-OR-NOT logic.
3 Charge Recycling Threshold Logic
The realization for CMOS threshold gates presented
in [2] and used in the design of TL circuits in this work is
now described. Fig. 2 shows the circuit structure. The sense
ampliﬁer (cross coupled transistors M1-M4) generates out-
put Q and its complement Qb. Precharge and evaluate is
speciﬁed by the enable clock signal E and its complement
E¯. The inputs xi are capacitively coupled onto the ﬂoating
gate φ of M5, and the threshold is set by the gate voltage t






where Ctot is the sum of all capacitances, including para-
sitics, at the ﬂoating gate of M5. Weight values are thus
realized by setting capacitors Ci to appropriate values. Typ-
ically, in CMOS technology these capacitors are imple-
mented between the polysilicon 1 and polysilicon 2 layers
where available, or other dedicated linear layers for linear
capacitors.
The enable signal E controls the precharge and activa-
tion of the sense circuit. The gate has two phases of op-
eration, the equalize phase and the evaluate phase. When
E¯ is high the output voltages are equalized. When E is
high, the outputs are disconnected and the differential cir-
cuit (M5-M7, M10, M11) draws different currents from the
formerly equalized nodes Q and Qb. The sense ampliﬁer is
activated after the delay of the enable inverters and ampli-
ﬁes the difference in potential now present between Q and
Qb, accelerating the transition. In this way the circuit struc-






























Figure 2. The CRTL gate circuit and Enable
signals.
is greater or less than the threshold, t, and a TL gate is re-
alized. Transistors M10 and M11 turn off the differential
circuit after evaluation is completed to reduce the power dis-
sipation. The gates may be pipelined in a self-timed man-
ner as described for the Asynchronous Sense Differential
Logic family in [5]. Extensive Monte Carlo, varied voltage
and temperature operating point and process corner varia-
tion simulations have shown the gate operates reliably at
high speed [2]. In addition, a series of test gates with fan-in
ranging from 8 to 64 were fabricated in a 0.35 µm process,
as shown in Fig. 3 and correct functionality was veriﬁed for
all gates, however precise delay measurement at high clock
frequency is the subject of ongoing work.
Figure 3. Chip micrograph showing CRTL
gates with 8 to 64 inputs.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI’04) 
0-7695-2097-9/04 $20.00 © 2004 IEEE
4 Logical Effort
Logical effort (LE) is a design methodology for estimat-
ing the delay of CMOS logic circuits, implementing a given
logic function [12]. Logical effort is based on a reformu-
lation of the conventional RC model of CMOS gate delay
which separates the effects on delay of gate size, topology,
parasitics and load. The relative simplicity of the method
compared to other delay modeling techniques and sufﬁcient
accuracy allow it to be used early in the design process to
evaluate alternative circuits.
The total delay of a gate, d, is comprised of two parts,
an intrinsic parasitic delay p, and an effort delay, f , driving
the capacitive load. The parasitic delay is largely indepen-
dent of the transistor sizes in the gate, since wider transis-
tors which provide increased current have correspondingly
larger diffusion capacitances. The effort delay in turn de-
pends on two factors, the ratio of the sizes of the transistors
in the gate to the load capacitance and the complexity of the
gate. The former term is called electrical effort, h, and the
latter is called logical effort, g.





where Cout and Cin are the gate load capacitance and in-
put capacitance, respectively. The logical effort, g, charac-
terizes the gate complexity, and is deﬁned as the ratio of
the input capacitance of the gate to the input capacitance of
and inverter that can produce equal output current. Alterna-
tively, the logical effort describes how much larger than an
inverter the transistors in the gate must be to be able to drive
loads equally well as the inverter. By deﬁnition an inverter
has a logical effort of 1.
The delay of a single logic gate can be expressed as
d = gh + p. (4)
This delay is in units of τ , which is, the delay of an inverter
driving an identical copy of itself, without parasitics. This
normalization enables the comparison of delay across dif-
ferent technologies. The product gh is called the gate or
stage effort.
The path delay, D, is the sum of the delays of each of
the gate stages in the path, di, and consists of the path effort










It can be shown that the path delay is minimized when each
stage in the path bears the same stage effort and the mini-
mum delay is achieved when the stage effort is
fmin = gihi = F 1/N , (6)
where F is the path effort.
This leads to the main result of logical effort, which is
the expression for minimum path delay
Dmin = NF 1/N + P. (7)
The accuracy of the delay predicted by Equation (4) can
be improved by calibrating the model by simulating the
delay as a function of load (electrical effort) and ﬁtting a
straight line to extract τ , the inverter parasitic delay, pinv ,
and the logical effort, g. We will use this technique to de-
velop a calibrated logical effort based model for the delay
of the CRTL gates.
5 Modeling CRTL Delay
We begin by providing a set of assumptions which will
simplify the analysis, a proposed expression for the worst
case delay of the CRTL gate and a derivation of the model’s
parameters.
5.1 Notation and Assumptions
The TL gate is assumed to have n logic inputs (fanin),
the total number of gate inputs connected to logic one is
denoted by N , and T is the threshold of the gate. The po-
tential of the gate of transistor M6, t, in Fig. 2 is given by
t = Tn × Vdd. In the worst case, the voltage φ in Equa-
tion (2) takes the values φ = t ± δ2 where δ is given by
δ = Vddn . This expresses the worst case (greatest delay)
condition where the difference between φ and t is minimal,
ie. the step voltage generated by the sum of inputs with
respect to the threshold voltage is smallest.
The gate inputs are assumed to have unit weights, ie.
wi = 1, since the delay depends only on the value of N and
T . Also, without loss of generality, we will assume positive
weights and threshold, since negative weights may easily
be accommodated in the differential structure of the gate by
using a network of input capacitors connected to the gate of
M6. Since the gate is clocked, we will measure delay from
the clock E to Qi-Qbi. Speciﬁcally, delay will be measured
as the average of the 50% point on two falling transitions of
E to the 50% points on the corresponding falling and rising
edges of Qi and Qbi. Generally, the delay will depend on
the threshold voltage, t, the step size, δ, and the capacitive
output load on Qi and Qbi. To simplify the analysis, we
will ﬁx the value of t at 1.5 V. This value is close to the re-
quired gate threshold voltage in typical circuit applications.
Therefore the worst case delay depends only on the fan-in
and gate loading, and allows us to propose a model based
on expressions similar to those for conventional logic based
on the theory of logical effort.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI’04) 
0-7695-2097-9/04 $20.00 © 2004 IEEE
5.2 Formulation of the Model and Param-
eter Extraction
The delay of the CRTL gate may be expressed as Equa-
tion (8). This delay is the total delay of the sense ampliﬁer
and the buffer inverters connected to Q and Qb, and depends
on the load, h, and the fanin, n, as follows
dE→Qi = {g(n)h + p(n)}τ. (8)
The load, h, is deﬁned as the ratio of load capacitance on
Qi (we assume the loads on Qi and Qbi are equal) and the
CRTL gate unit weight capacitance. Both logical effort and
parasitic delay in Equation (8) are a function of the fanin.
The delay parameters for the industrial 0.35 µm pro-
cess used to obtain the simulation results presented here are
τ=40 ps, pinv=1.18 and FO4=207 ps.
The values of g and p in Equation 8 were extracted by
linear regression from simulation results for a range of fanin
from n = 2 to 60 while the electrical effort was swept from
h = 0 to 20. By ﬁtting a curve to the parameters g and p,
CRTL gate delay may be approximated in closed form by
dE→Qi = {(0.002n + 0.34)h + ln(n) + 1.6}τ. (9)
In order to use Equation (9), it is necessary to compen-
sate for the parasitic capacitance at the ﬂoating gate of M5.
This effective fanin, neff , is given by
neff =
{∑n




where n0 is the number of inputs to the gate and neff is the
value used to calculate the delay. Typically, for a large fanin
CRTL gate, by far the major contribution to the parasitic ca-
pacitance will be from the bottom plate of the ﬂoating ca-
pacitors used to implement the weights. In the process used
in this work, this corresponds to the poly1 plate capacitance
to the underlying n-well used to reduce substrate noise cou-
pling to the ﬂoating node.
6 64-bit Adder Design and Critical Path De-
lay Estimation
The delay estimation based on logical effort has been
carried out for a number of high speed adders [3, 8], in-
cluding dynamic Kogge-Stone (D-KSA), dynamic carry
look-ahead (D-CLA), dynamic Ling/conditional-sum (D-
LCNSA) and Intel’s Quarternary (D-QTA) [6]. We extend
this work to include CRTL based adders. For completeness,
we also include comparison with the HP Ling adder [7],
Harris’ [4] adder and the Output Prediction Logic adder de-
veloped by Sechen [11].
6.1 64-bit Adder Architecture
The selection of the adder architecture is heavily inﬂu-
enced by the availability of fast high fan-in CRTL gates.
This leads us to use CLA (carry look-ahead) and CSA
(carry-select) blocks. The adders described in [7] and [4]
are based on 4-bit CLA blocks, which is usually the opti-
mal trade off between the depth of the CLA tree and the
number of series transistors in a CMOS gate. The carries in
these adders are generated at 16-bit boundaries, requiring
16-bit sub-adders for carry-select blocks.
Increasing the number of bits handled by a CLA block to
8-bits results in fewer logic levels and a more regular design
and layout [11]. This is impractical in conventional CMOS
logic, since it requires 8 series transistors. We can, how-
ever, take advantage of the wide AND gates in CRTL. We
obtain the regular structure shown in Fig. 4. In this scheme,
the 64-bit input addends are divided into eight 8-bit blocks,
and it has log864, or two levels of carry look-ahead. The
Kogge-Stone scheme generates carries for each bit position,
so no carry select is needed. The 4- and 8-bit block versions
have depth log464=3 and log864=2, respectively. However,
they consist of many more CLA blocks with signiﬁcantly
increased wiring and fanouts.
The structure of the proposed adder is a sparse carry pre-
ﬁx tree. In the ﬁrst layer, the bitwise propagate and generate
signals, pi, gi, are formed, followed by the computation of
eight pairs of 8-bit group generate and propagate signals
P j−7j , G
j−7
j in the second layer. These are then assimilated
in the global carry look-ahead block to generate the sum se-
lection carries, c7, c47, . . ., c55, which select pre-computed
8-bit sums. These 8-bit adders are also based on carry look-
ahead.
The CLA equations may be written as given as Equa-
tions (11)-(14). Each CLA level consists of an AND and an
OR gate, which requires signiﬁcantly lower sum of weights
in the CRTL implementation that a single gate AND-OR
implementation. The six stage critical path of the 64-bit
adder consists the domino-OR2 to generate p7 (despite the
lower logical effort and parasitic delay, this gate has a higher
fanout than g0), AND8 and OR8 to generate G07, AND8
and OR8 to generate c55 in the global CLA block and a 2:1
MUX to select the sum.
The bitwise propagate and generate signals are computed
as follows
pi = ai + bi (11)
gi = ai · bi , (12)
and from these the 8-bit block group propagate and generate
signals are given by
P 07 = p7 · p6 · p5 · p4 · p3 · p2 · p1 (13)
G07 = g7 + p7 · g6 + p7 · p6 · g5 + . . .
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI’04) 
0-7695-2097-9/04 $20.00 © 2004 IEEE





     7 − 0p, g     15 − 8p, g     23 − 16p, g     63 − 56
s64
P, G
      7
      0P, G
      23
      16P, G
      63
      56 P, G
      15
      8
s63 s56 s23 s16 s15 s8 s0s7
8−bit global CLA
8−bit sum by CLA
b16a63
1−bit p, g 1−bit p, g 1−bit p, g 1−bit p, g
8−bit group P, G8−bit group P, G 8−bit group P, G 8−bit group P, G
8−bit sum 8−bit sum 8−bit sum
by carry−select by carry−select by carry−select
Figure 4. 64-bit adder block diagram.
+ p7 · p6 · p5 · p4 · p3 · p2 · p1 · g0 . (14)
Finally the 8-bit block carry outputs are given by
c7 = G07 . (15)
A similar expression to Equation (14) may be written for
generating the global look-ahead carries.
6.2 Delay Estimation and Comparison
In addition to the delay model for CRTL discussed ear-
lier, in order to evaluate the adder delay it is necessary to
characterize the domino gates using HSPICE simulation of
the gate delay for various output loads, according to the LE
rules. Characterization of domino gates considers only the
one transition of interest, which is the falling transition for
the dynamic pull down and rising transition for the hi-skew
static inverter. This is repeated for each of the gates, and the
results are shown in Table 1. Note that the dynamic gates
listed consist of the pull down path only, excluding the static
inverter.
The delay of the critical path, s63, dyn-OR2 → CRTL-
AND8 → CRTL-OR8 → CRTL-AND8 → CRTL-OR8 →
MUX2 is calculated using Equations (5) and (9) and Ta-
ble 1. For the 8-input CRTL gates we ues an neff value of
10. In addition, we must consider the fan-out of 7 of the
dyn-OR2 gate (which drives 7 unit weight CRTL inputs).
The other gates have a unity electrical effort. From Equa-
tion 7 the optimized delay of the two stage dyn-OR2 gate is
therefore given by
dOR2,min = NF 1/N + P
Table 1. Normalized LE parameters of various
gates in 0.35 µm technology, in units of τ = 40
ps.
Gate Type LE, (g) Parasitic
delay, p
Inverter 1 1.18
Hi-skew Inverter 0.7 1
dyn-NAND2 0.4 1.8
dyn-NOR2 0.3 1.4
2:1 static MUX 1.13 2.6
= 2{gNOR2 × gHS−Inv × hHS−Inv}0.5
+ pNOR2 + pHS−Inv
= 2{0.3× 0.7× 7}0.5 + 1.4 + 1
= 4.8τ . (16)
From this the critical path delay is calculated as follows
ds63 = dOR2,min + 4× dCRTL10 + (gh + p)MUX2
= 4.8 + 4× 4.26 + 1.13 + 2.6
= 25.6τ
= 4.9 FO4 . (17)
The proposed adder consists of 3653 transistors and 342
unit capacitors. The critical path was also simulated, in-
cluding wiring capacitance estimations based on traversed
CRTL and domino cell pitch, and the extracted gate layouts
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI’04) 
0-7695-2097-9/04 $20.00 © 2004 IEEE
Table 2. Comparison of high speed 64-bit
adders.
64-bit Adder # Stages Tech. LE Sim.
µm FO4 FO4
D-CLA [3] 14 0.18 11.1 13.6
D-LCNSA [3] 9 0.18 9.0 9.5
Intel D-QTA [6] 10 0.10 8.3 -
D-HCA [8] 10 0.10 8.26 -
D-KSA [3] 6 0.18 6.2 7.4
HP mod. Ling [7] 4 0.5 - 7
Harris [4] - 0.6 - 6.4
OPL [11] 8 0.25 - 2.9
→ This Work 6 0.35 4.9 5.3
and the critical path delay thus obtained was 5.3 FO4. Note
that the 207 ps FO4 delay is a very slow process corner for
a drawn channel length of 0.4 µm, and is the fastest we had
available, ( [11] similarly reports 162 ps for the 0.25 µm
process used in that work). It is therefore not surprising that
the 930 ps delay for the 0.5 µm process reported in [7] has
a FO4 delay less than ours, especially if a faster process
corner was used.
The FO4 delay comparison with eight other dynamic
high speed adders is shown in Table 2, with the logical ef-
fort estimate and simulated or measured delay values listed
where available. The comparison suggests a signiﬁcant de-
lay speed improvement of almost 1.1 FO4 or 17% compared
to Harris’ agressive domino design. The OPL adder is in-
cluded for completeness to acknowledge other novel circuit
techniques, it has the signiﬁcant drawback of requiring 8
clock phases which has signiﬁcant power dissipation issues,
in addition to the reduced noise margin of OPL gates. Ta-
ble 2 also shows that delay is related to but not proportional
to the number of gate levels on the critical path, so compar-
ing delay estimates based on this simple metric is inconclu-
sive.
7 Conclusions and Future Work
A high speed 64-bit adder based on a hybrid carry look-
ahead/carry-select scheme using Charge Recycling Thresh-
old Logic and conventional domino logic has been pro-
posed. The worst case critical path delay was shown to
be signiﬁcantly improved compared to previously proposed
domino high-speed adders. The results show that by com-
bining TL and conventional CMOS logic with the appropri-
ate architectural strategy, relatively fast arithmetic circuits
may be achieved.
The work presented here leaves a number of unresolved
questions. The important issue of power dissipation has not
been addressed. Power dissipation may be traded for delay
and the energy-delay curves for adders may cross [8] which
implies that single point delay comparisons such as in Ta-
ble 2 are not always meaningful. The energy-delay depen-
dency of the proposed adder is currently under investiga-
tion. The results presented here suggest that the substantial
delay improvement over domino justiﬁes the added design
complexity of CRTL.
References
[1] P. Celinski, S. D. Cotofana, and D. Abbott. Area efﬁcient,
high speed parallel counter circuits using charge recycling
threshold logic. In Proc. IEEE International Symposium on
Circuits and Systems, volume 5, pages 233–236, May 2003.
[2] P. Celinski, J. F. Lo´pez, S. Al-Sarawi, and D. Abbott. Low
power, high speed, charge recycling CMOS threshold logic
gate. IEE Electronics Letters, 37(17):1067–1069, August
2001.
[3] H. Dao and V. G. Oklobdzija. Application of logical ef-
fort techniques for speed optimization and analysis of rep-
resentative adders. In Proc. Thirty-Fifth Asilomar Confer-
enceon Signals, Systems and Computers, 2001, volume 2,
pages 1666–1669, November 2001.
[4] M. Horowitz. EE271 class notes (adders). Stanford Univer-
sity, http://eeclass.stanford.edu/ee371/.
[5] B.-S. Kong, J.-D. Im, Y.-C. Kim, S.J-Jang, and Y.-H.
Jun. CMOS differential logic family with self-timing and
charge-recycling for high-speed and low-power VLSI. IEE
Proceedings Circuits, Devices and Systems, 150(1):45–50,
February 2003.
[6] S. Mathew, M. Anders, R. Krishnamurthy, and S. Borkar.
A 4 GHz 130nm address generation unit with 32-bit sparse-
tree adder core. In Symposium on VLSI Digest of Technical
Papers, pages 126–127. IEEE, 2002.
[7] S. Naffziger. A sub-nanosecond 0.5µm 64b adder design.
In International Solid State Circuits Conference Digest of
Technical Papers, pages 362–363. IEEE, 1996.
[8] V. Oklobdzija, B. Zeydel, D. Hoang, S. Mathew, and R. Kr-
ishnamurthy. Energy-delay estimation technique for high-
performance microprocessor VLSI adders. In Proceedings.
16th IEEE Symposium on Computer Arithmetic, pages 272–
279, June 2003.
[9] H. ¨Ozdemir, A. Kepkep, B. Pamir, Y. Leblebici, and
U. C¸ilinirog˘lu. A capacitive threshold-logic gate. IEEE
JSSC, 31(8):1141–1149, August 1996.
[10] M. Padure, S. Cotofana, and S. Vassiliadis. A low-power
threshold logic family. In Proc. IEEE International Confer-
ence on Electronics, Circuits and Systems, pages 657–660,
2002.
[11] S. Sun, L. McMurchie, and C. Sechen. A high-performance
64-bit adder implemented in output prediction logic. In Pro-
ceedings Conference on Advanced Research in VLSI, pages
213–222, 2001.
[12] I. E. Sutherland, R. F. Sproull, and D. L. Harris. Logical
Effort, Designing Fast CMOS Circuits. Morgan Kaufmann,
1999.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI’04) 
0-7695-2097-9/04 $20.00 © 2004 IEEE
