Abstract-This paper presents a novel approach for implementing ultra-low-power digital components and systems using source-coupled logic (SCL) circuit topology, operating in weak inversion (subthreshold) regime. Minimum size pMOS transistors with shorted drain-substrate contacts are used as gate-controlled, very high resistivity load devices. Based on the proposed approach, the power consumption and the operation frequency of logic circuits can be scaled down linearly by changing the tail bias current of SCL gates over a very wide range spanning several orders of magnitude, which is not achievable in subthreshold CMOS circuits. Measurements in conventional 0.18 m CMOS technology show that the tail bias current of each gate can be set as low as 10 pA, with a supply voltage of 300 mV, resulting in a power-delay product of less than 1 fJ. Fundamental circuits such as ring oscillators and frequency dividers, as well as more complex digital blocks such as parallel multipliers designed by using the STSCL topology have been experimentally characterized.
I. INTRODUCTION
T HE demand for implementing ultra-low-power digital systems in many modern applications such as mobile systems [1] , [2] , sensor networks [3] , [4] , and implanted biomedical systems [5] , has increased the importance of designing logic circuits in subthreshold regime [6] . In subthreshold MOSFET operation, current density is very low and the ratio of the transconductance to bias current of the device is maximum [7] , [8] . Meanwhile, the exponential relationship between drain current and gate voltage makes this mode of operation very suitable for implementing widely adjustable circuits [7] , [9] . Conventional CMOS logic circuits utilizing subthreshold transistors can typically operate with a very low power consumption [10] - [13] , which is mainly due to the dynamic (switching) power consumption and is quadrWRatically dependent to the supply voltage as (where is the frequency of operation and indicates the supply voltage). Hence, reducing the supply voltage will result in reduction of power dissipation [1] , [14] as well as the output logic swing. Supply voltage reduction, on the other hand, increases the delay in each gate which means the power dissipation, logic swing, and speed of operation are tightly related to each other. Meanwhile, the exponential relationship between power dissipation and supply voltage in subthreshold regime makes the accurate control of power consumption difficult. To implement very low power digital systems, it is necessary to minimize the energy dissipation at the system level in addition to the gate level to achieve the desired performance [10] . Source-coupled logic (SCL) circuits are widely used in mixedmode integrated circuits where supply noise and substrate noise injection are crucial [15] . Reduced output voltage swing in SCL circuits compared to the CMOS logic gates has made this topology very suitable for high frequency applications [16] , [17] . This paper explores the potentials of subthreshold SCL circuits as an alternative solution for implementing ultra-low-power digital systems. In this approach, the power consumption and maximum speed of operation can be adjusted linearly through the tail bias current of each gate over a very wide range [18] , [19] , thus, efficiently decoupling the decision of output voltage swing from power dissipation and delay.
To enable operation at very low current levels and to achieve the desired performance specifications, special circuit techniques have to be applied, [18] - [21] , for implementing very low power SCL circuits. In [20] , the intrinsically limited output impedance of deep-submicron, short-channel pMOS devices has been used to implement very high value load resistances for SCL topology. Here, a more general approach with much less sensitivity to process and technology variations will be introduced [19] . This paper presents novel techniques for implementing subthreshold SCL (STSCL) gates where the bias current of each cell can be set as low as 10 pA. In Section II, after a brief review of SCL circuits, the proposed technique for implementing subthreshold SCL gates will be introduced. Section III discusses the power-delay performance of the proposed circuit configuration. Experimental results and comparison with conventional CMOS circuits are presented in Section IV, followed by conclusions in Section V.
II. SUBTHRESHOLD SOURCE-COUPLED LOGIC CIRCUITS

A. Conventional SCL Topology
In an SCL gate, the logic operation takes place mainly in current domain. Therefore, the speed of operation can be inherently high. Shown in Fig. 1 , a logic network composed of nMOS source coupled differential pair switches steers the tail current to one of the output branches based on the input logic levels. The output load resistance converts the branch current back to the voltage domain in order to drive the subsequent SCL gates. The voltage swing at the output node Fig. 1 . A conventional SCL-based inverter/buffer circuit. The switching part can be composed of a network of nMOS source-coupled pairs to implement more complex logic functions [15] . The load resistances can be implemented using pMOS devices biased in triode region.
should be high enough to switch completely the input differential pair of the next stage (i.e.,
). Based on this observation, the voltage swing should be larger than ( is the drain-source overdrive voltage of the input nMOS devices when ) when the input nMOS devices are in strong inversion [22] , and larger than when the devices are in weak inversion [7] ( is the thermal voltage and is the subthreshold slope factor). Therefore, the required voltage swing when the devices are in subthreshold regime can be as low as which is about 150 mV at room temperature (assuming ). This swing in the subthreshold regime depends on the subthreshold slope factor and is independent of the threshold voltage of the nMOS switching devices. Provided that the load resistance can be made sufficiently high, this means that the switching operation of nMOS devices has low dependence on the fabrication process variations. Therefore, as long as the tail bias current is higher than junction leakage currents and output impedance of the devices is much higher than the load resistance, the proposed topology can operate properly as a logic circuit, even in aggressively scaled deep-submicron technologies. Unlike CMOS logic circuits where the subthreshold channel leakage current is the dominant leakage component, in STSCL topology the main leakage currents are due to the p-n junctions of the MOS devices.
The speed of operation in an SCL gate is mainly limited by the time constant at the output node which is (1) Based on this, the propagation delay is inversely proportional to the tail bias current. Meanwhile, the circuit power-delay product (PDP) is independent of [15] , [16] , [23] .
B. Load Device Concept
To maintain the desired output voltage swing at very low bias current levels, it is necessary to increase the load resistance value in inverse proportion to the reducing tail bias current as (2) In subthreshold operation, the tail bias current would be in the range of few nA or even less. Therefore, to obtain a reasonable output voltage swing, the load resistance should be in the range of hundreds of . Meanwhile, this resistance should be controlled very accurately based on the value. Hence, a well controlled high resistivity load device with a very small area is required. For this range of resistivity, conventional pMOS devices biased in triode region can not be utilized since the required channel length of the transistor would be impractically large [ Fig. 2(a) ]. Fig. 2(c) (dotted line) shows the I-V characteristics of a pMOS device realized in 0.18 m technology for dif- values, indicating that the configuration of Fig. 2 (a) results in a current source with almost infinite output impedance, even for deep-submicron devices. Hence, the gain would not be limited, neither would the amplitude. Fig. 2(b) shows the proposed load device, where the drain of the pMOS device is connected to its bulk. In this way, as illustrated in Fig. 2(c) , the configuration shown in Fig. 2 (b) produces a finite and controllable differential resistance, which, associated with the transconductance of the differential pair will provide a controlled, limited gain and amplitude. Thus, it is possible to implement a very high resistivity load device using a single minimum size pMOS device. The fact that each individual pMOS load device must be confined in its own n-well also does not have a severe impact on area as will be demonstrated later. The measured DC I-V characteristics of the device are shown in Fig. 2 (d). For (bulk tied to the drain), the device operates as a very high resistivity element as expected. This plot also shows that the measurement results are very close to the resistance values predicted by simulations.
The cross section view of the proposed pMOS load device can be seen in Fig. 3 . Connecting the drain to the bulk of the pMOS load device ties the cathode of the n-well-to-substrate reverse-biased diode to the output node. However, since the devices are minimum size, the parasitic capacitance associated with this diode is very small and can usually be neglected (in this design using 0.18 m technology:
1 fF). The other important parasitic element is the forward biased source-bulk diode. Illustrated in Fig. 3 , this diode can limit the possible voltage swing at the drain of the device to 400-500 mV. However, as the required voltage swing for subthreshold SCL gates is well below this value, the source-bulk diode does not influence the operation of the circuit.
Using the EKV model, the I-V characteristics of the subthreshold pMOS device can be expressed by [7] , [8] ( 3) in which . In the proposed configuration illustrated in Fig. 2(b) , , hence Therefore, the output small signal resistance of the proposed load device is
in which and . Thus, can be controlled through the source-gate voltage of the device through . Because of exponential dependence of the output resistance on , it can be adjusted in a very wide range. To avoid process-related deviations, a replica bias generator is required for , as explained in the next section. The wide tuning range of means that the proposed STSCL gate can be used in a very wide range of operating conditions without the need for modifying the size of devices. Meanwhile, as long as the matching requirements are respected, the frequency of operation would be linearly proportional to the bias current.
Note that when becomes negative, the current direction is reversed and the device switches to conventional configuration in which the bulk is connected to source. In this case, the drain current will increase rapidly. This property can help implement high valued floating resistors with a very wide adjusting range by connecting two pMOS transistors in series as shown in 
C. STSCL Gates
The proposed pMOS load device can be utilized to implement an SCL gate biased in subthreshold. Fig. 5 shows the basic structure of the proposed STSCL gate. A simplified circuit diagram of the replica bias circuit used to control the output voltage swing is also shown. In this schematic, all devices operate in subthreshold regime and the tail bias current can be reduced until it becomes comparable in magnitude to the leakage currents that exist in the circuit.
Since the input differential pair transistors are operating in subthreshold, it can be shown that the transconductance of the input differential pair is (7) in which indicates the input differential voltage and is the subthreshold slope of nMOS devices. Based on (7), for the entire current will be switched to one of the branches. Therefore, a voltage swing of more than would be sufficient to make sure that the gain of STSCL circuit is enough to be used as a logic gate. Combining (7) with (6) results in (8) Fig. 6(a) illustrates the DC transfer characteristics of an STSCL gate as well as the stage gain. The simulated DC gain of 3.2 at the cross-over point is very close to the value estimated by (8) . The measured input-output transfer characteristics of an STSCL buffer stage are shown in Fig. 6(b) . Since all the devices are operating in subthreshold regime, the transfer characteristics of the circuit is independent of the bias current. In this plot, the deviation from the ideal DC characteristics is mainly due to the leakage currents in the test circuit coming from electrostatic discharge (ESD) protection circuitry. To measure the DC characteristics, output voltage swing has been adjusted manually.
Meanwhile, based on (5) it can be shown that the equivalent output resistance of the pMOS load for V is finite and equal to (9) which means the load devices are capable of pulling up the output node completely to . Concerning the area overhead associated with the pMOS load devices, actual mask layout examples using 0.18 m CMOS technology design rules provide an accurate assessment. The layout of a three-input XOR gate is shown in Fig. 7 where the area required for the pMOS load devices is demonstrated to be small compared to the remaining parts of the circuit.
D. Voltage Swing Control
A controlling circuit is necessary to keep the voltage swing at the output of the SCL gates on the desired value. Fig. 5 shows the simplified schematic of a replica bias (RB) circuit [15] . This circuit should be well matched to the SCL gates to have very low deviation in operating point. Meanwhile, amplifier should provide enough gain with a very low offset to have the desired accuracy. In this work, a folded-cascode amplifier has been used to provide a large swing at the output node and to be able to test the SCL gates in a very wide range of bias current values.
Any mismatch in the bias current or devices of the SCL gates and RB circuit will result in variation of the desired output voltage swing and it can be shown that the sensitivity of this circuit to the mismatches is (10) Meanwhile, it can be shown that the voltage gain from gate to drain of transistor MPR in Fig. 5 is small . Therefore, in spite of the exponential relationship between and , the gain of this stage is low and the RB circuit can be stabilized without difficulty. Finally, please note that one single replica bias circuit can be used for a large number of STSCL gates. Therefore, its area overhead would be negligible in large scale applications.
III. PERFORMANCE ANALYSIS AND OBSERVATIONS
Power-Delay Product: The power dissipation of the STSCL gate is where is the tail bias current, and the typical delay of the gate is (11) Thus, the Power Delay product (PDP) is found as (12) Meanwhile, the power-to-frequency ratio can be calculated as (13) where the operating frequency is defined as (14) with being the activity rate factor (duty rate) and being the maximum possible operating frequency:
. Thus, the ratio is (15) which provides a more practical measure for the power/frequency tradeoff of any functional block. Observation 1: The delay (or the maximum operating frequency) in a STSCL gate depends on the tail bias current , but not on . Therefore, the delay of a logic block can be controlled without influencing PDP, which is not possible in conventional CMOS topologies. More importantly, the speed and the operation (supply) voltage can be effectively decoupled in STSCL circuits. This point will be further elaborated in Section IV-B.
Observation 2: To reduce the ratio, should be kept as large as possible. This observation does not contradict with similar results for conventional CMOS, where (16) as shown in [6] . However, the influence of on is quite different in conventional CMOS, where an optimum value to minimize can be found, especially for small values, due to the significant leakage in CMOS.
Observation 3: Assuming that the system clock frequency is dictated by the longest delay path between two consecutive register stages, and assuming that the activity rate depends inversely on the maximum logic depth between two registers, it is most beneficial to keep the logic depth as shallow as possible, and thus, increase . This calls for very short (one stage) pipelining in STSCL systems, which is demonstrated with an example in Section IV-D.
The output load capacitance is partially due to the device parasitic capacitances such as the capacitance of n-well to p-substrate reverse biased diode and wiring capacitance related to interconnections . Since n-well to p-substrate capacitance for small size pMOS devices is less than 1 fF, it can be ignored in comparison to the wiring capacitance which can be much larger even for simple circuits.
Regarding (12) , one can conclude that the achievable power-delay product per unit capacitance would be . This means that for a supply voltage of 400 mV and mV, the minimum achievable PDP would be 0.04-0.06 [fJ/fF/Gate]. Since the total parasitic capacitance due to the STSCL gate itself (including ) is less than 1 fF, the minimum PDP that can be expected for an unloaded gate is [fJ/Gate]. Notice that PDP also depends on temperature through and can be reduced by reducing the temperature.
IV. TEST STRUCTURES AND MEASUREMENT RESULTS
A. Ring Oscillator and Divider Operation
To measure the delay versus power consumption for the proposed STSCL gates, a test chip has been designed and fabricated in conventional 0.18 m CMOS technology. The test structures consist of eight-stage ring oscillator and frequency divider (divide-by-8) circuits, both of which are implemented based on a two-input multiplexer (MUX) STSCL gate. The microphotographs of the test circuits are shown in Fig. 8 . To control the operation of the test circuits, the tail bias current of the SCL gates can be adjusted externally. Internal current mirrors with the ratio of 1/100 are used to simplify the measurement process. The supply voltages of the test blocks are directly accessible to measure the total power consumption of each block using HP4156A Semiconductor Analyzer. An internal replica bias circuit has been applied to control the voltage swing at the output of the gates, as described in Section II-D, ensuring a minimum output swing of 100 mV. The die-to-die variation of the gate bias voltage ( in Fig. 5 ) required to ensure a fixed voltage swing of 150 mV at a given tail current was found to be less than %, in conventional 0.18 m CMOS technology. Fig. 9 illustrates the measured oscillation frequency of an eight-stage ring oscillator with differential STSCL NAND gates (which are constructed based on two-input MUX) in comparison to the simulation results. The conventional CMOS oscillator used for comparison is built with two-input standard NAND gates in the same 0.18 m CMOS technology with driving strength of 1. As depicted in this figure, the measurement results of the STSCL oscillator are very close to the simulation results, and consistent over a range of several orders of magnitude. Meanwhile, PDP is very well predictable by (12) . This figure also shows the results for the CMOS ring oscillator, operating in subthreshold regime with different supply voltage values between 0.1 and 0.4 V.
The divide-by-8 circuit has been realized using the sourcecoupled latch structure as shown in Fig. 10 . Since all transistors operate in weak inversion, the device dimensions can be kept close to minimum size. The measured maximum operating (input) frequency of the divider is plotted against power dissipation in Fig. 11(a) at V and V, comparing the results with the performance of an optimized CMOS frequency divider operating in subthreshold regime. While the CMOS divider cannot sustain correct operation below 200 mV supply voltage, the SCL divider with the bulk-drain connected pMOS load continues its operation down to 10 pA/Gate of tail current, and 3 kHz of input frequency. The resulting (measured) PDP corresponds to less than 1 fJ/Gate.
To compare the performance of the STSCL gates at scaled technology nodes, the maximum operating frequency of a divide-by-8 circuit has been simulated using technology parameters for 90 nm, 130 nm, and 180 nm CMOS processes [ Fig. 11(b) ]. Here, it is assumed that the DFF gates are loaded with the same amount of interconnect capacitance, and all leakage components are taken into account. It can be seen that the STSCL frequency divider exhibits very similar performance in different technology nodes. It is possible to reduce the tail bias current of the circuit down to 10 pA in a controlled manner both in 130 nm and 90 nm technologies, whereas the subthreshold leakage current would be very difficult to limit in conventional CMOS logic circuits.
Considering the results presented in Figs. 9 and 11 , it can be observed that the STSCL solution can successfully extend the range of operation by two orders of magnitude along the power axis, and by about one order of magnitude along the frequency axis, while allowing completely separate control of voltage swing and power dissipation.
B. Carry-Save Multiplier Using SCL Gates
To illustrate the use of the proposed circuit topology for more complex functions, a second test chip containing an (8 8) bit parallel carry-save multiplier has been designed and fabricated using 0.18 m CMOS technology (Fig. 12 ). Fig. 13 shows the measured input-to-output delay of the STSCL-based multiplier, operating at V, 0.4 V, and 1.0 V, in comparison to the simulation results. It can be seen that the performance of the STSCL multiplier is accurately predicted by the simulations. The supply voltage can be reduced to 0.3 V while the circuit remains operational over a very wide range of tail bias current. The saturation behavior of the delay at higher bias currents is mainly due to the limited swing of the replica bias circuit that is used to produce the proper gate voltage for the pMOS load devices. To illustrate the independent control of the delay and the voltage supply, the PDP versus the delay of the STSCL multiplier circuit is plotted in Fig. 14 for different bias current levels, and compared with the variation of PDP of an equivalent CMOS multiplier circuit, also operating in subthreshold regime. In this example, the power supply voltage and the output voltage swing of the STSCL circuit is kept at 0.35 V and 0.15 V, respectively, resulting in nearly constant PDP of less than 1 pJ over the entire operating range. The PDP of the CMOS circuit, on the other hand, varies significantly with , due to the quadratic dependence of PDP on , and increasing dominance of leakage at low values.
C. Compound SCL Gates to Improve Power-Delay Performance
Using STSCL topology, the power consumption of a functional block is directly proportional to the number of logic gates to be biased with a tail current. Therefore, implementing more complex logic functions in a single stage SCL gate can be expected to result in smaller number of gates and hence, reduced power consumption. In this approach, since the time constant at the common source nodes (i.e., , in which indicates the parasitic capacitance in each common source node) is much smaller than the time constant at the output node ( as shown in (1)), the speed degradation due to the stacking will be negligible for ( indicates the number of stacked stages in nMOS switching network) where (17) Fig. 15(a) shows a unit cell which is required to implement the carry-save multiplier [24] . This unit block consists of a two-input AND gate and a full-adder (FA), and it can be implemented by two separate SCL gates, as shown in Fig. 15(b) . Alternatively, Fig. 15(c) shows an STSCL gate implemented by merging two logic functions of AND and XOR on a single branch and realizing the compound logic operation . Using the merged STSCL gate topology [ Fig. 15(c) ] results in a significant improvement of the power-frequency performance of the 8 8 multiplier, as illustrated in Fig. 16(a) . The multiplier in this example is built out of 56 adders and 64 AND gates (total number of gates is 120), of which 49 can be merged with the corresponding adder as described above. This modification alone results in approximately 40% power reduction. In the general case of an multiplier, the total number of gates is , and it is possible to merge AND gates with adders, resulting in almost 50% power reduction for higher values. In addition to the obvious reduction of tail currents, the merging of AND gates with adders also reduces the layout area of the unit cell, and hence, lowers the parasitic capacitance due to wiring. Finally, the operating frequency is further increased by reducing the overall logic depth, resulting in about 80% total improvement of speed at iso-power. While the results are difficult to generalize for random logic topologies, the merging of complex logic gates clearly presents a valuable opportunity that can be exploited for improving the power-frequency performance in STSCL circuits. ) and that of a conventional CMOS multiplier, operating at . It can be seen that the power-frequency performance of the STSCL circuit is comparable to, and in many cases better than, the CMOS equivalent, over a wide frequency range. The main drawback of the merged-gate approach is a slight increase of the minimum useable supply voltage, since compound gates with more levels typically require a higher supply voltage. However, this is a relatively minor limitation as long as the nMOS network transistors are biased in subthreshold regime.
D. Shallow Pipelining to Improve Activity Rate
As already discussed in Section III, the power-to-frequency ratio of STSCL circuits (i.e., the power efficiency to operate at a given frequency) can be significantly improved by increasing the activity rate using shallow pipelining and by reducing logic depth, as much as possible. One possibility is to implement Fig. 18 . Power-frequency improvement that can be achieved in the 8 2 8
carry-save multiplier circuit, by using shallow pipelining with keeper-latch stages.
two-phase latch-based pipelining where the output of each gate is latched during one clock phase, and passed on to the next stage during the other clock phase-effectively reducing the maximum logic depth to two consecutive gates. Instead of using explicit latch stages, such two-phase pipelining can be achieved by increasing (and reducing) the source (tail) current bias of alternating stages, using the gate terminal of the tail current bias transistor of each stage as the "clock" input. In this approach, illustrated in Fig. 17 for the example of the carry-save multiplier architecture, the current bias of odd stages is reduced to a low (yet non-zero) level to retain (hold) their output while the current bias of even stages is raised to the nominal operating value to enable evaluation. Very simple cross-coupled "keeper" stages connected to each gate output ensure that the output levels do not degrade significantly during the "hold" phase. Fig. 17(a) shows the circuit topology of an adder (sum generator) stage and the output keeper stage, where the pulsed tail bias achieves a very robust dynamic latching effect, augmented by the output keeper with a tail bias current of 100 pA. In an 8 8 bit carry-save multiplier circuit, taking into account the additional power overhead of pipelining (which is 1% only), shallow pipelining using keeper-latch stages will result in an overall improvement of the by a factor of 5 (Fig. 18 ). The pipelining technique described above can certainly be applied in combination with the gate-merging approach discussed in Section IV-C, to improve the power-frequency performance of subthreshold SCL circuits considerably.
V. CONCLUSION
A new approach for implementing ultra-low-power sourcecoupled logic circuits biased in subthreshold regime has been demonstrated. The new topology uses compact high resistance pMOS load devices to provide the required voltage swing at the output for proper logic operation. Measurement results show that the tail bias current of each logic gate can be reduced to less than 10 pA, while the power-delay product of the gate remains less than 1 fJ, using 0.18 m CMOS technology. Robust operation of ring oscillator and frequency divider circuits, as well as more complex logic blocks (8 8 bit carry-save multiplier) has been demonstrated over a very wide range of frequencies. Among other advantages, the proposed approach effectively decouples the circuit propagation delay from the operating voltage, resulting in near-constant PDP versus frequency. The bias current of the STSCL gate can be scaled over several decades using the same device dimensions, which makes this circuit topology very suitable for ultra-low-power configurable digital systems.
