Ultra-Low Power 32-bit Pipelined Adder Using Subthreshold Source-Coupled Logic with 5fJ/stage PDP by Tajalli, Armin et al.
Ultra-Low Power 32-bit Pipelined Adder
Using Subthreshold Source-Coupled Logic
with 5fJ/stage PDP
Armin Tajalli a Elizabeth J. Brauer b Yusuf Leblebici a
aMicroelectronic Systems Lab. (LSM),
Ecole Polytechnique Fe´de´rale de Lausanne (EPFL), 1015 Lausanne, Switzerland
bElectrical Eng. Dept., Northern Arizona Univ., Flagstaff AZ 86011, USA
Abstract
This article presents a new approach for improving the power-delay performance
of subthreshold source-couple logic (STSCL) circuits. Using a simple two-phase
pipelining technique, it is possible to increase the activity rate of STSCL gates with
negligible additional cost, and hence reduce the total system energy consumption
per operation. In the proposed pipelined topology, each STSCL gate is followed
by a simple cross-coupled differential pair operating as a state keeper with a very
low power consumption and small area overhead. Measurement results on a 32-
bit pipelined adder chain fabricated with 0.18µm CMOS technology show that the
proposed approach can achieve a significant reduction in power-delay product (PDP)
down to 5fJ/stage.
Key words: CMOS integrated circuits, ultra low power circuit design,
source-coupled logic (SCL), current-mode logic (CML), subthreshold SCL
(STSCL), pipelined SCL.
1 Introduction
The demand for implementing ultra low power circuit building blocks in many
emerging applications where the energy consumption is extremely important
has made the subthreshold CMOS circuit techniques very attractive [1]. Ap-
plications such as wearable computing and implantable systems require very
low power circuits with low sensitivity to the supply voltage variations for ro-
bust operation where circuit performance is ideally independent of the supply
voltage [2].
Preprint submitted to Elsevier 7 January 2009
Fig. 1. Conventional source-coupled logic (SCL) circuit topology [7].
While the power consumption of conventional CMOS digital circuits can be
reduced substantially by proper biasing in subthreshold regime [3]-[5], they
generally require a very careful control on their supply voltage (VDD) since
the speed of operation and power consumption are both critically depending
on VDD [6].
Source-couple logic (SCL) circuits, shown in Fig. 1, exhibit a very low sensi-
tivity to the supply voltage variations [7]. Indeed, the speed of operation in
SCL circuits is independent to the supply voltage and can be adjusted through
the tail bias current. Meanwhile, SCL circuits exhibit good immunity to the
substrate and supply noise [7]. Recent results by the authors have shown that
it is possible to use this type of circuits in subthreshold regime where the
power consumption can be reduced to 1fJ/gate and even less [8], [9]. These
properties make the subthreshold SCL (STSCL) circuits a good choice for low
voltage and low power applications.
Building on this foundation, we present in this article a new approach for
improving the performance of STSCL circuits in terms of power-delay product
(PDP). Using a simple two-phase pipelining technique, it is possible to increase
the activity rate in STSCL circuits and hence utilize more efficiently the static
power consumption of the gates. Dramatic improvement of PDP (by factor of
14) is demonstrated for a 32-bit adder that is shown to operate with 5fJ/gate
PDP, independent of the supply voltage.
In Section 2, a short overview on STSCL circuits will be presented. The pro-
posed pipelining technique is introduced in Section 3, and Section 4 presents
the simulation and measurement results.
2
2 Subthreshold SCL Circuits
2.1 Conventional SCL
In an SCL circuit, the logic operation takes place in the switching network
that is composed of NMOS differential pair transistors as illustrated in Fig.
1. In this configuration, the constant tail bias current ISS will be switched
between two NMOS transistors in each stage and finally will be steered to one
of the output branches. This current is converted to voltage output by the
load resistances (RL) which determines the output logic levels [10]. Generally,
PMOS devices biased in triode region are used as the load resistances. The
required output voltage swing (VSW = RLISS) should be high enough to switch
the NMOS transistors of the following SCL stages. The output voltage swing
can be controlled by a replica bias circuit to make sure the output voltage
swing will remain high enough over process, temperature, and supply voltage
(PVT) variations [9].
The main speed limiting factor in SCL topology arises from the circuit output
time constant. Hence, the propagation delay of each gate can be estimated by:
td ≈ ln 2 ·RLCL = ln(2) · VSWCL
ISS
(1)
where CL stands for the total equivalent output capacitance seen by the SCL
gate.
2.2 Subthreshold SCL
To maintain the desired output voltage swing at very low bias current levels,
it is necessary to increase the load resistance value in inverse proportion to
the reducing tail bias current as
RL = VSW/ISS. (2)
In subthreshold operation, the tail bias current would be in the range of few
nA or even less. Therefore, to obtain a reasonable output voltage swing, the
load resistance should be in the range of hundreds of MΩ. Meanwhile, this
resistance should be controlled very accurately based on the ISS value. Hence,
a well controlled high resistivity load device with a very small area is required.
For this range of resistivity, conventional PMOS devices biased in triode region
can not be utilized since the required channel length of the transistor would be
3
Fig. 2. Compact subthreshold SCL circuit topology [8].
impractically large. The conventional bulk-source connected PMOS load con-
figuration [Fig. 1] results in a current source with almost infinite impedance,
even for deep sub-micron devices. Hence, the gain would not be limited, nei-
ther would the amplitude. However, the proposed configuration illustrated in
Fig. 2 for the load devices produces a finite and controllable differential re-
sistance, which, associated with the transconductance of the differential pair
will provide a controlled, limited gain and amplitude. Hence, it is possible
to implement a very high resistivity load device using a single minimum size
PMOS transistors [11].
As shown in Fig. 2, a replica bias circuit will produce the proper gate bias
voltage for PMOS load devices (VBP ) to control the output voltage swing [9].
The voltage swing must be selected larger than 4nnUT (nn is the subthreshold
slope factor of NMOS differential pair devices and UT is the thermal voltage)
to make sure that the NMOS differential pair devices will switch completely
[12].
Measurement results show that the tail bias current of the STSCL circuit
built using the topology of Fig. 2 can be reduced down to 10pA with a supply
voltage of as low as 350mV and still maintain an output voltage swing of
150mV and a PDP of less than 0.1fJ/gate [9].
2.3 Power-Delay Performance
Unlike the conventional CMOS gates, SCL circuits draw a constant bias cur-
rent from the supply voltage. This bias current should be kept high enough to
have an acceptable delay in each gate. Regarding (1), the power-delay product
(PDP [13]) of STSCL gates is equal to
PDPSCL = ln 2 · VDDVSWCL. (3)
4
Using VDD=0.5V and VSW=0.2V, for example, the PDP of an SCL gate can
be as low as 70aJ/fF/gate. However, compared to the conventional CMOS
digital circuits, an SCL circuit with logic depth of N > VDD/VSW exhibits
higher PDP which is mainly due to the static current consumption of SCL
gates [10]. In a digital SCL circuit with logic depth of N , the total delay is
td,N = N · td and total power consumption is P = NVDDISS. Therefore, for an
SCL digital circuit with a logic depth of N , the maximum operating frequency
would be:
fop,N ≈ 1
td,N
=
ISS
ln 2 ·NVSWCL (4)
which is N times less than the maximum possible operating frequency of each
SCL gate:
fop,Max ≈ 1
td
=
ISS
ln 2 · VSWCL . (5)
Here, we are neglecting the effect of incomplete settling when N is small. The
main reason for this reduction is that the activity rate in a digital circuit with
the logic depth of N is reduced by a factor of N while the power consumption
of each gate remains the same.
Defining the activity rate (or duty rate) as:
α =
fop
fop,Max
(6)
and regarding (3), one can show that the power-delay product with logic depth
of N is:
PDPSCL,N = ln 2 · N
α
VDDVSWCL. (7)
Therefore, by increasing the activity rate it is possible to reduce the power-
delay product of the proposed SCL circuit with a logic depth of N . Comparing
this result with the PDP of CMOS gates [6], [10]:
PDPCMOS,N = ln 2 ·NV 2DDCL (8)
it can be seen that increasing the activity rate of the STSCL topology can
help to achieve a PDP performance which is at least as good as the PDP
of conventional CMOS topology, with the additional benefit of keeping the
output swing and the delay completely independent of the supply voltage.
5
Regarding (4), one can conclude that the delay (or the maximum operating
frequency) in a STSCL gate depends on the tail bias current (ISS), but not
on VDD. Therefore, the delay of a logic block can be controlled without in-
fluencing PDP, which is not possible in conventional CMOS topologies. More
importantly, the speed and the operation (supply) voltage can be effectively
decoupled in STSCL circuits.
Meanwhile, to reduce the PDP of STSCL circuits as prdicted in (7), α should
be kept as large as possible. This observation does not contradict with similar
results for conventional CMOS, where
(P/f)CMOS = CLV
2
DD
(
1 +
2
α
e
−VDD
nUT
)
(9)
as shown in [1]. Here, power-to-frequency is defined as:
(P/f) =
Pdiss
fop
. (10)
However, the influence of VDD on (P/f) is quite different in conventional
CMOS, where an optimum VDD value to minimize (P/f) can be found, espe-
cially for small α values, due to significant leakage in CMOS topology.
Therefore, assuming that the system clock frequency is dictated by the longest
delay path between two consecutive register stages, and assuming that the
activity rate depends inversely on the maximum logic depth between two
registers, it is most beneficial to keep the logic depth as shallow as possible,
and thus, increase α. This calls for very short (ideally one stage) pipelining in
STSCL systems, which is demonstrated with an example in the next Section.
3 Pipelined STSCL Topology with Compound Gates
In this Section, some techniques for improving the performance of STSCL
circuits will be introduced. First, the performance of stacked STSCL gates will
be analyzed and then the proposed pipelining technique will be introduced.
3.1 Compound STSCL Structure
Compound SCL gates with merging two or more logic operations in a single
gate will provide this possibility to reduce the power consumption and improve
the speed of operation simultaneously. Figure 3 shows an example in which an
6
Fig. 3. Compound STSCL gate (AND operation followed by XOR gate).
AND gate and an XOR gate are merged together to construct the proposed
compound logic operation. Using this technique, it is possible to have only
one pair of output load devices and a single tail biasing transistor and hence
reduce the area in addition to halving the total current consumption.
Assuming that the time constant at the output nodes of each SCL gate is
equal to
τL = RLCL =
VSWCL
ISS
(11)
then the total equivalent time constant of a simple two stage SCL gate will
be:
τtot,A ≈ 2× VSWCL
ISS
(12)
Assuming a compound STSCL gate with M stacked levels of NMOS differen-
tial pairs (for example in Fig. 3: M=3), then the total time constant of the
circuit will be
τtot,A ≈ (VSWCL
ISS
) +M(
CS
gms
) (13)
7
Fig. 4. Performance improvement in an (8×8) multiplier circuit using compound
STSCL gates.
where gms = ISS/(nnUT ) and CS is the parasitic capacitance seen from the
source of each NMOS differential pair. Here, it is assumed that the time con-
stant at the intermediate nodes of a compound SCL gate is τi = CS/gms (see
Fig. 3) and the total time constant can be calculated by τtot = τL +
∑N
i=1 τi
[12]. Comparing (12) and (13) it can be concluded that as far as NUTCS <<
VSWCL, or
N <<
VSWCL
UTCS
(14)
stacking differential pair stages will not degrade the speed of operation. Simu-
lations show that the proposed technique can reduce the power dissipation of
an (8×8) Multiplier by about 40% and at the same time improve the speed of
operation. Figure 4 depicts this improvement for different operating frequen-
cies.
3.2 Two-Phase Pipelining Techniques
An effective approach for increasing the activity rate is using a simple two-
phase pipelining technique [13], [14]. Figure 5 shows one possible technique
to implement two-phase latch-based pipelining where the output of each gate
is latched during one clock phase, and passed on to the next stage during
the next clock phase - effectively reducing the maximum logic depth to two
consecutive gates.
The topology of a single stage pipelined gate is shown in Fig. 5(a). When clock
is low, the latch is disabled and the gate is evaluating the output value based
on the input data. In this period, as the gate is evaluating the output, the
8
Fig. 5. Pipelining technique for improving the activity rate in STSCL topology. (a)
Single stage pipelined gate and timing diagram. (b) Multi stage pipelined logic.
input data should remain constant.
When clock is high, on the other hand, the output is latched and the follow-
ing stages can start their evaluation step. Since in this period the output of
this stage is kept constant by the latch, input data can attain its new value.
Therefore, the input data rate can be increased practically to fD = 1/(2td).
This input data rate does not reduce if the logic depth increases (Fig. 5(b))
since during the evaluation phase of each gate, its input is kept constant by
the latch of the previous stage and hence does not change. Without pipelining,
the entire system consisting of N stages needs to wait until all the gates in the
chain complete their evaluation, hence the maximum data rate is limited to
fD = 1/(Ntd). As a conclusion, pipelining can theoretically helps to improve
the speed by a factor of N/2.
Instead of using explicit latch stages, such two-phase pipelining can be achieved
by increasing and reducing the tail current bias of alternating stages, using
the gate terminal of the tail current bias transistor of each stage as the clock
input [VBN in Fig. 6(a)]. In this approach, as illustrated in Fig. 6(a) for the
example of an STSCL full adder (FA) gate, the current bias of odd stages
is reduced to a low (yet non-zero) level to retain (hold) their output while
9
Fig. 6. (a) STSCL full adder and keeper stage. Here, the tail current bias VBN
is switched according to CK (or CK) while VBN0 is kept as a constant bias. (b)
Simulated output of the pipelined FA chain showing the holding and tracking modes
of operation.
the current bias of even stages is raised to the nominal operating value to
enable evaluation. Very simple cross-coupled keeper stages connected to each
gate output ensure that the output levels do not degrade significantly during
the hold phase. Since the keeper stage is used to maintain the latest state of
the output of each gate, it does not need to be very fast. Therefore, the bias
current of keeper stage (ISS,L) can be chosen as low as 1% of the bias current
of the main gate (ISS). This means that the power overhead of the keeper
stages is virtually negligible. Meanwhile, since the bias current of half of the
gates is almost zero in each clock phase, the overall power consumption of the
system will be reduced by a factor of two. Figure 6(b) shows the transient
simulation results for the output of a adder stage in a chain of adders. In this
figure it is possible to see the hold and evaluation phases for ISS,L = 0.01ISS
for VSW=0.2V.
Assuming that the delay of each gate is td, theoretically it is possible to increase
the input data rate in Fig. 5 to approximately 1/(2td). Therefore, the power-
10
Fig. 7. Test chip photomicrograph.
Fig. 8. Measured output of the pipelined full adder chain in comparison to the (a)
input data, and (b) reference clock. Here, VDD=1V, VSW=0.2V, ISS=1nA.
delay product of a pipelined STSCL system can be calculated as
PDPSCL,N,P ipe = 2 ln 2 ·NVDDVSWCL. (15)
Regarding (7) and (15), it can be seen that pipelining helps to reduce the
system power-delay product by a factor of approximately N/2 which is a
considerable improvement especially in deeper pipelines with a large number
of stages. In practice, the improvement in power-delay product is less than
this value because of increased loading at the output nodes as well as power
consumption of the keeper stage.
4 Experimental Results
A test chip has been fabricated in digital 0.18µm CMOS technology, which
consists of a 32-bit pipelined adder chain, and a conventional (non-pipelined)
11
Fig. 9. Measured delay versus tail bias current: total delay of simple adder chain, and
stage delay in pipelined adder chain. In both cases, the delay figure corresponds to
the time period between two consecutive inputs. The effective operating frequency
improves by a factor of 14 with pipelining.
32-bit ripple-carry adder as the comparison block, both designed with STSCL
topology. Figure 7 shows the test chip photomicrograph. Internal current mir-
rors are used to control the bias current of the gates and the keeper stage
separately. Each adder chain is followed by an SCL-to-CMOS level converter
circuit and an output driver.
Figure 8 shows the measured output of the pipelined FA chain in comparison to
the input data and clock. The latency is equal to NTCK/2 which in this figure
is 320µs. It is possible to measure the total delay in the simple non-pipelined
32-bit adder and also the delay of a single gate for the pipelined 32-bit adder.
The measurement results are shown in Fig. 9 as delay versus tail bias current.
The delay of both circuits can be adjusted linearly by changing their tail
bias current in a very wide range which is about three orders of magnitude
in these measurements. Note that the time delay between two consecutive
inputs can be reduced by a factor of 14 with pipelining (maximum theoretical
improvement would have been by a factor of N/2=16, as explained above).
The measured power-delay product for the two topologies are shown in Fig.
10. Both topologies show a relatively constant PDP over their tuning range.
The average PDP for simple and pipelined FA chains are 2.6pJ and 0.18pJ,
respectively, which corresponds an improvement factor of about 14. Figure
11 shows more clearly the improvement in power consumption at iso-speed
operation or speed improvement at iso-power operation. This result is very
close to the estimation made in (15).
Measurements for pipelined adder chain have been performed for two different
bias current of ISS,L: ISS,L = ISS/10 and ISS,L = ISS/100. As can be seen in
12
Fig. 10. Measured power-delay product for the two adder topologies. The pipelined
adder topology achieves a very significant reduction of PDP, over a wide range of
operating frequencies.
Fig. 10, the results for two bias currents for the keeper stage are very close.
Therefore, it is possible to reduce the bias current of the keeper stage to
ISS/100 and hence minimize the power overhead of this stage.
Figure 11 shows more clearly the improvement in power consumption at iso-
speed operation or speed improvement at iso-power operation.
It can be seen that the PDP of the proposed STSCL adder circuit with a deep
pipeline can be as low as the PDP of static CMOS adders reported in [15]-
[18]. This means that using pipelining technique, it is possible to improve the
performance STSCL circuits and make it comparable to static CMOS circuits
even with high logic depth.
5 Conclusion
A simple two-phase pipelining technique has been demonstrated to improve
the performance of subthreshold source-coupled logic circuits. It is shown that
the proposed approach can significantly increase the activity rate of logic cir-
cuits while reducing logic depth, and hence use more efficiently the static power
consumption in source-coupled logic circuits, with minimum overhead. Mea-
surement results obtained with a 32-bit pipelined adder chain structure show
that the PDP can be improved by factor of 14 compared to the non-pipelined
topology, achieving a very low PDP of 5fJ/stage.
13
Fig. 11. Power-Frequency improvement achieved by pipelining technique.
Acknowledgment
The authors would like to thank M. Alioto, F. K. Gurkaynak and S. Badel for
their help in test chip design and S. Hauser for preparing the test setup.
References
[1] E. Vittoz, ”Weak Inversion for Ultimate Low-Power Logic”, in Low-Power
Electronics Design, Editor C. Piguet, CRC Press, 2005.
[2] B. A. Warneke and K. S. J. Potkonjak, ”System-architecture for sensor networks
issues, alternatives, and directions,” in Proc. IEEE Int. Solid-State Circ. Conf.
(ISSCC), Feb. 2004.
[3] R. Gonzalez, B. M. Gordom, and M. A. Horowitz, ”Supply and threshold voltage
scaling for low power CMOS,” IEEE J. Solid-State Circuits, vol. 32, no. 8, pp.
1210-1216, Aug. 1997.
[4] H. Soeleman, K. Roy, and B. C. Paul, ”Robust subthreshold logic for ultra-low
power operation,” IEEE Trans. Very Large Scale Integ. (VLSI) Syst., vol. 9,
no. 1, pp. 90-99, Sep. 2001.
[5] B. H. Calhoun, A. Wang, and A. Chandrakasan, ”Modeling and sizing for
minimum energy operation in subthreshold circuits,” IEEE J. Solid-State
Circuits, vol. 40, no. 9, pp. 1778-1786, Sep. 2005.
[6] A. P. Chandrakasan and R. W. Broderson,”Minimizing power consumption in
digital CMOS circuits,” in Proceedings of the IEEE, vol. 83, no. 4, pp. 498-523,
Apr. 1995.
14
[7] S. Badel ”MOS current-mode logic standard cells for high-speed low-noise
applications,” PhD Dissertation, Ecole Polytechnique Fe´de´rale de Lausanne
(EPFL), Switzerland, 2008.
[8] A. Tajalli, E. Vittoz, Y. Leblebici, and E. J. Brauer, ”Ultra low power
subthreshold MOS current mode logic circuits using a novel load device
concept,” in Proc. of Eur. Solid-State Circ. Conf. (ESSCIRC), Munich,
Germany, Sep. 2007, pp. 281-284.
[9] A. Tajalli, E. J. Brauer, Y. Leblebici, and E. Vittoz, ”Sub-threshold source-
coupled logic circuit design for ultra low power applications,” IEEE J. of Solid-
State Circuits, no. 7, vol. 43, pp. 1699-1710, Jul. 2008.
[10] J. M. Musicer and J. Rabaey, ”MOS current mode logic for low power, low noise
CORDIC computation in mixed-signal environment,” in Proc. of Int. Symp. on
Low Power Elect. and Des. (ISLPED), pp.102-107, 2000.
[11] A. Tajalli, Y. Leblebici, and E. J. Brauer, ”Implementing ultra-high-value
floating tunable CMOS resistors,” in IEE Electronics Lett., vol. 44, no. 5, pp.
349-350, Feb. 2008.
[12] P. R. Gray, P. J. Hurst, S. H. Lewis, and R. G. Meyer, Analysis and Design of
Analog Integrated Circuits, John Wiely & Sons Inc., Fourth Edition, 2000.
[13] S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits, McGraw-Hill,
2003.
[14] M. Mizuno, and et al., ”A GHz MOS adaptive pipeline technique using MOS
current-mode logic,” in IEEE J. of Solid-State Circuits, pp. 784-791, vol. 31,
no. 6, Jun. 1996.
[15] T. T. Jeong, ”Implementation of low power adder design and analysis based
on power reduction technique,” in Microelectronics Journal, vol. 39, no. 12, pp.
1880-1886, Dec. 2008.
[16] M. Alioto and G. Palumbo, ”Analysis and comparison of full adder block in
submicron technology,” in IEEE J. of VLSI Syst., vol. 10, no. 6, pp. 806-823,
Dec. 2002.
[17] J.-F. Lin, Y.-T. Hwang, M.-H. Sheu, and C.-C. Ho, ”A novel high-speed and
energy efficient 10-transistor full adder design,” in IEEE Trans. on Circ. and
Syst.-I: Regular Papers, vol. 54, no. 5, pp. 1050-1059, May 2007.
[18] A. M. Shams and M. A. Bayoumi, ”A novel high-performance CMOS 1-bit full-
adder cell,” in IEEE Trans. on Circ. and Syst.-II: Analog Digl. Signal Process.,
vol. 47, no. 5, pp. 478-481, May 2000.
15
