Abstract-Gate diffusion input (GDI)-a new technique
I. INTRODUCTION
W HIS rapid development of portable digital applications, the demand for increasing speed, compact implementation, and low power dissipation triggers numerous research efforts [1] - [3] . The wish to improve the performance of logic circuits, once based on traditional CMOS technology, resulted in the development of many logic design techniques during the last two decades [17] . One form of logic that is popular in low-power digital circuits is pass-transistor logic (PTL).
Formal methods for deriving pass-transistor logic have been presented for nMOS. They are based on the model, where a set of control signals is applied to the gates of nMOS transistors. Another set of data signals are applied to the sources of the n-transistors [1] . Many PTL circuit implementations have been proposed in the literature [1] , [2] , [4] - [6] , [14] .
Some of the main advantages of PTL over standard CMOS design are 1) high speed, due to the small node capacitances; 2) low power dissipation, as a result of the reduced number of transistors; and 3) lower interconnection effects [7] , [8] , due to a small area.
However, most of the PTL implementations have two basic problems. First, the threshold drop across the single-channel pass transistors results in reduced current drive and hence slower A. Fish is with the Electrical Engineering Department, Ben-Gurion University, Israel (e-mail: afish@ee.bgu.ac.il).
I. A. Wagner is with IBM Haifa Labs, Haifa University, Mount Carmel, Israel (e-mail: wagner@il.ibm.com).
Digital Object Identifier 10.1109/TVLSI. 2002.801578 operation at reduced supply voltages; this is particularly important for low-power design since it is desirable to operate at the lowest possible voltage level. Second, since the "high" input voltage level at the regenerative inverters is not , the PMOS device in the inverter is not fully turned off, and hence direct-path static power dissipation could be significant [4] .
There are many sorts of PTL techniques that intend to solve the problems mentioned above [5] .
1) Transmission gate CMOS (TG) uses transmission gate logic to realize complex logic functions using a small number of complementary transistors. It solves the problem of low logic level swing by using pMOS as well as nMOS [1] .
2) Complementary pass-transistor logic (CPL) features
complementary inputs/outputs using nMOS pass-transistor logic with CMOS output inverters. CPL's most important feature is the small stack height and the internal node low swing, which contribute to lowering the power consumption. The CPL suffers from static power consumption due to the low swing at the gates of the output inverters. To lower the power consumption of CPL circuits, LCPL and SRPL circuit styles are used. Those styles contain pMOS restoration transistors or cross-coupled inverters (respectively). 3) Double pass-transistor logic (DPL) uses complementary transistors to keep full swing operation and reduce the dc power consumption. This eliminates the need for restoration circuitry. One disadvantage of DPL is the large area used due to the presence of pMOS transistors. An additional problem of existing PTL is top-down logic design complexity, which prevents the pass transistors from capturing a major role in real logic LSIs [6] . One of the main reasons for this is that no simple and universal cell library is available for PTL-based design.
This paper proposes a new low-power design technique that allows solving most of the problems mentioned above-gate diffusion input (GDI) technique. The GDI approach allows implementation of a wide range of complex logic functions using only two transistors. This method is suitable for design of fast, low-power circuits, using a reduced number of transistors (as compared to CMOS and existing PTL techniques), while improving logic level swing and static power characteristics and allowing simple top-down design by using small cell library.
Section II presents basic GDI functions and their circuit principle. In Section III, a detailed analysis of GDI cell is presented. Section IV shows a design methodology for GDI circuitry. Comparisons of some basic logic functions and high-level combinatorial circuits designed in CMOS, PTL, and GDI are discussed in Section V. Section VI presents measurements of a test chip, fabricated in GDI and CMOS. Conclusions and future work are discussed in Section VII.
II. BASIC GDI FUNCTIONS
The GDI method is based on the use of a simple cell as shown in Fig. 1 . At first glance, the basic cell reminds one of the standard CMOS inverter, but there are some important differences.
1) The GDI cell contains three inputs: G (common gate input of nMOS and pMOS), P (input to the source/drain of pMOS), and N (input to the source/drain of nMOS). 2) Bulks of both nMOS and pMOS are connected to N or P (respectively), so it can be arbitrarily biased at contrast with a CMOS inverter. It must be remarked that not all of the functions are possible in standard p-well CMOS process but can be successfully implemented in twin-well CMOS or silicon on insulator (SOI) technologies. This issue will be discussed in Section VII. Table I shows how a simple change of the input configuration of the simple GDI cell corresponds to very different Boolean functions.
Most of these functions are complex (6-12 transistors) in CMOS, as well as in standard PTL implementations, but very simple (only two transistors per function) in the GDI design method.
In this paper, most of the designed circuits were based on the F1 and F2 functions. The reasons for this are as follows.
1) Both F1 and F2 are complete logic families (allows realization of any possible two-input logic function). 2) F1 is the only GDI function that can be realized in a standard p-well CMOS process, because the bulk of any nMOS is constantly and equally biased. 3) When N input is driven at high logic level and P input is at low logic level, the diodes between NMOS and PMOS bulks to Out are directly polarized and there is a short between N and P, resulting in static power dissipation and . This causes a drawback for OR, AND, and MUX implementations in regular CMOS with configuration. The effect can be reduced if the design is performed in floating-bulk SOI technologies [22] , where a full GDI library can be implemented. In that case, floating-bulk effects have to be considered.
As can be seen, the GDI cell structure is different from the existing PTL techniques, reviewed in Section I, and has some important features, which allow improvements in design complexity level, transistor count, and power dissipation (all of these will be discussed in Sections IV-VI). Understanding of GDI cell properties demands a deeper operational analysis of the basic cell in different cases and configurations.
III. ANALYSIS OF GDI CIRCUITS
In this section, we analyze GDI circuits. First we explain their operation and analyze their transient behavior. Then we consider swing restoration issues and switching characteristics.
A. Operational Analysis of GDI Cell
As mentioned in Section I, one of the common problems of PTL design methods is the low swing of output signals because of the threshold drop across the single-channel pass transistors. In existing PTL techniques, additional buffering circuitry is used to overcome this problem.
To understand the effects of the low swing problem in a GDI cell, we suggest the following analysis, based on the example of F1 function, and can be easily extended to use in other GDI functions. Table II presents a full set of logic states and related functionality modes of F1.
As can be seen from Table II , the only state where low swing occurs in the output value is , . In this case, the voltage level of F1 is (instead of the expected 0 V) because of the poor high-to-low transition characteristics of the pMOS pass transistor [4] . It is obvious that the only case (among all the possible transitions) where the effect occurs is the transition from , to , . The fact that demands special emphasis is that in about 50% of the cases (for ), the GDI cell operates as a regular CMOS inverter, which is widely used as a digital buffer for logic-level restoration. In some of these cases, when without a swing drop from the previous stages, a GDI cell functions as an inverter buffer and recovers the voltage swing. Although this feature allows a self-swing restoration in certain cases, in this paper the worst case is assumed and additional circuitry is used for swing restoration in the implemented circuits.
B. Transient Analysis
The exact transient analysis for a basic GDI cell, in most cases, is similar to a standard CMOS inverter, widely presented in the literature [9] , [10] . This classic analysis is based on the Shockley model, where the drain current is expressed as follows:
subthreshold region linear region saturation region (1) where is drivability factor, is threshold voltage, is channel width, and is channel length.
In contrast with CMOS inverter analysis [8] , where was taken as an input voltage, in most of GDI circuits must be considered as a variable of input voltage in the Shockley model. In this paper, we shall only discuss the aspects in which GDI differs from CMOS.
The case of most interest is when a step signal is supplied to diffusion of nMOS transistor and causes a swing drop in output. Fig. 2 shows the schematic and a transient response for this case.
During this response, the nMOS transistor passes from saturation to subthreshold region. In assuming the fast transition in the input, the linear region can be neglected in our analysis. Analytical expressions that describe the transient response can be derived from (1), while considering capacitive load in the output. The capacitive current is (2) where is the output capacitance, is the voltage across the capacitance , is the current charging the capacitor, and is the drain current through the N-channel device.
The expression for as a function of time is derived as follows:
In saturation region (3) where, in the case of GDI cells linked through diffusion inputs, the capacitance includes both diffusion and well capacitances of the driven cell.
The integral form of (3) is (4) The same expression can be written as
where , , and are constants of the process or the given circuit. The final expression of transient response in the saturation region is (7) where is time in the saturation region and is a constant of integration and is calculated for initial conditions ( ). The solution of (7) is done numerically (e.g., in MATLAB) for specific values of ( ). After entering the subthreshold region, continues rising while the output capacitance is charged by according to (1) In subthreshold region
where is temperature in K, is Boltzmann's constant, is charge of an electron, and is a constant
The expression for response in the subthreshold region is
where is constant of integration defined by the initial conditions, is from (10), and is the threshold voltage. It must be noted that the analysis of propagation delay of a basic GDI cell given by (2)- (7) can be refined by taking into account the effect of the diode between the NMOS source and body. This diode is forward biased during the transient (Fig. 2) . By conducting an additional current, it contributes to charging the output capacitance . This current contribution can be calculated to be (13) where is the diode current, is the reverse current, and is a factor between one and two. This current should be added in (2) to derive an improved propagation delay, resulting in a faster transient operation of GDI cell.
C. Analysis of Swing-Restoring Buffers
As mentioned above, an important concern in PTL circuit design is the problem of swing degradation. This section presents a methodology for swing restoration in GDI circuits under constraints of area (power) and circuit frequency (delay).
The simplest method of swing restoration is to add a buffer stage after every GDI cell. This will certainly prevent the voltage drop, but the payment will be in additional area, delay, and power dissipation, which makes this method highly inefficient.
Note that our approach to swing restoration is rather simple; various buffering techniques are presented in the literature, e.g., [6] and [14] . Given a clocked logic circuit with known Tcycle and Tsetup, buffering of cascaded GDI cells will be optimal if the following conditions are preserved.
1) Successive Swing Restoration:
While cascading GDI cells, each cell contributes a voltage drop in the output that is equal to . Assuming 0.3 as a maximal allowed voltage drop of the whole cascade, the number of linked GDI cells between two buffers is limited by (14) As shown above, after exiting the saturation area, the value of is equal to and decreasing with time as follows, using (9): (15) Equation (15) applies for subthreshold region only, namely, for . According to (15) , remaining in the subthreshold region for will assure a significant decrease of and as a result, increasing in the number of linked cells . This allows achieving successive swing restoration while using a lower number of buffers. Fig. 3 presents Cadence Spectre simulation results of operation in the subthreshold region in an AND gate implemented in GDI.
If interconnection effects are essential, a signal potential loss over long interconnects has to be treated. In this case, (15) will be extended with respect to IR drop. Suppose that the voltage has to be applied to the drain input of the NMOS (Fig. 2 ) through a long wire. For given and dimensions, the resistance of the interconnect is defined by (16) where is a metal sheet resistance per square. The current flowing through the wire and causing the voltage drop is given by (17) can be determined by the equalization between the wire and NMOS transistor's currents as follows: (18) where is found from (1) according to the operation region of the transistor. Equation (18) can be solved numerically, and its contribution to the final expression is represented by (19) with from (15) . The operation in the subthreshold region causes increase of delay. Therefore, this method can be efficiently used mostly in low-frequency design.
Scaling, namely, reduction and threshold nonscalability, influences the number of required buffers in GDI design (14) . As a result, when operation with the lower supply voltages is performed, while the same technology and remain, insertion of additional buffers has to be considered. The direct impact of this is on the area and number of gates.
Finally, several points have to be emphasized concerning the buffer insertion topology in GDI.
1) Buffer insertion has to be considered only in the case of linking GDI cells through diffusion inputs. No buffers are needed before gate inputs of GDI cells. 2) Due to this feature, the "mixed path" topology can be used as an efficient method for buffer insertion. It allows one to reduce the number of buffers by intermittently involving diffusion and gate inputs in a given signal path. 3) The designer should check the tradeoff between buffer insertion and delay, area, and power consumption to achieve an efficient swing restoration.
2) Impacts of Process Variation on Swing Restoration:
In every VLSI process, there are variations in parameters like threshold tracking, variations, etc. The process dependence of and influences the value of and the swing restoration in GDI. This effect can be best described by defining a sensitivity of to the mentioned parameter variations as follows:
Current sensitivity of (20) Threshold sensitivity of (21) where is given by (19) .
3) Constraint of Maximal Delay of Cascade:
A signal path in cascade can be represented by a single-branch RC tree [11] , [12] , where are effective resistances of conducting transistors and are capacitive loads caused by following devices. An example of an RC tree can be seen in Fig. 4 .
is defined as the resistance of the path between the input and output (for RC tree without side branches).
is the resistance between input and node .
is the capacitance at node . The following times are defined in order to derive the bounds of delay in an RC tree:
The maximal delay of the RC tree can be derived numerically from bound for the time, given in the following equation:
The number of stages in GDI cascade can be found for maximal total delay time , while using the condition
Notice that (25) can be checked only after was assumed and a suitable RC tree was built.
The maximal number of stages in cascade between two buffers is therefore the minimal value between and .
D. Switching Characteristics: GDI Versus CMOS
Due to the complexity of the logic function that can be implemented in a GDI cell by using only two transistors, it is important to perform a comparison of its switching characteristics with CMOS gate, whose logic function is of the same order of complexity. This comparison can be used as a base for delay estimation in early stages of circuit design, if GDI or CMOS design techniques are considered.
While a GDI cell's characteristics are close to a standard inverter, the gate with equivalent functional complexity in CMOS will be NAND. The switching behavior of the inverter can be generalized by examining the parasitic capacitances and resistances associated with the inverter [15] , [16] . Consider the inverter shown in Fig. 5 with its equivalent digital model.
The propagation delay for an inverter driving a capacitive load is (26) where is the total capacitance on the output of the inverter, that is, the sum of the output capacitance of the inverter, any capacitance of interconnecting lines, and the input capacitance of the following gate(s).
A NAND gate with a series connection of identical n-channel MOSFETs is shown in Fig. 6 . We can estimate the intrinsic switching time of series-connected MOSFETs with an external load capacitance by [16] (27) The first term in this equation represents the intrinsic switching time of the series connection of MOSFETs, while the second term represents RC delay caused by charging .
For equal to 3/2 and assuming two serial n-MOS transistors, the propagation delay in NAND is (28) Therefore, the delay of a NAND gate compared to a GDI gate is approximated by (29) where the high bound is for high and the low bound is for low . Note that this ratio will become better if the effect of the body-source diode in a GDI cell is considered (14) and the delay formula in (7) is used in its improved form.
E. Fan-in and Fanout Bounds in GDI 1) Fanout:
Following the analysis performed in Section III-E, GDI cells with a two-transistor structure can be compared with CMOS gates with equivalent functional complexity. This approach allows definition of fanout bounds by using the logic-effort concept [23] . The logic effort is directly related to the fanout when the effort delay of a logic structure is analyzed. The effort delay of the logic gate is the product of these two factors (30) where is the effort delay, is the logic effort, and represents a fanout of the gate. For a desired delay, reducing the logic effort results in an improved fanout by the same ratio.
The values of logic effort are given by Sutherland in [23] for inputs of various static CMOS gates normalized comparatively to the logic effort of inverter. While a GDI cell's logic effort is close to a standard inverter, the equivalent logic functions in CMOS will be NAND, NOR, or MUX, depending on GDI input configuration (Table I) . It can be seen from [23] that the following improving factors in of GDI are derived comparatively to CMOS: a) for F1, F2 versus NAND, the factor is 4/3; b) for F1,F2 versus NOR, the factor is 5/3; and c) for MUX in GDI versus CMOS , the factor is 2.
The presented values are correct for the gate input of a GDI cell, which makes its characteristics similar to those of the CMOS inverter. If the diffusion input is considered, an additional factor has to be applied to represent the capacitance ratio between the gate and diffusion inputs: the given above factors have to be multiplied by . Both parameters are defined by the design technology.
2) Fan-in: As will be shown in Section IV, an ( 2)-input GDI cell can be implemented by extension of any n-input CMOS structure. While the stack of serial MOSFET devices and fan-in in CMOS gates are limited by body-effect considerations, the addition of diffusion inputs in GDI for the same structure results in an improved fan-in, defined by
Note that for F1 and F2 functions, where only one additional input applied to diffusion, the fan-in will increase by one compared to CMOS. 
F. Discussion
In this section, the analysis of a basic GDI cell was presented. The operational and transient analysis was performed, as well as comparison of switching characteristics of CMOS and GDI, showing the advantages of GDI in terms of delay, number of transistors, and area and power consumption. Several drawbacks, mostly related to inputs connection to MOSFET wells, have to be mentioned: 1) the threshold drop and, in some cases, an increased diffusion input capacitance (both exist also in PTL techniques and were considered in simulations and analysis) and 2) the relative increase of circuit area because of separated MOSFET wells (comparisons based on real layouts will be presented in Section V). However, as we shall show in Sections V and VI, those drawbacks are mostly compensated by the advantages of GDI circuits.
IV. A DESIGN METHODOLOGY FOR COMBINATORIAL CIRCUITS USING GDI CELLS

A. Designing Leaf Cells in GDI
The examples of GDI functions given in Table I refer only to extension of a single-input CMOS inverter structure to a tripleinput GDI cell in order to achieve implementation of complicated logic functions with a minimal number of transistors. Actually, this approach can be defined in more general form. Extension of any n-input CMOS structure to an ( 2)-input GDI cell can be done by introducing an input P instead of supply voltage in the pMOS block of a CMOS structure and an input instead of in the nMOS block (see Fig. 7 ). This extended implementation can be represented by the following logic expression:
where is a logic function of an nMOS block (not of the whole original n-input CMOS structure). An example for this extension can be seen in Fig. 8 , where a three-input CMOS structure is converted to a five-input GDI cell. Equation (32) can be used to implement a Shannon expansion [18] , writing a function with inputs { } as (33) where the functions and are
Shannon expansion is a very useful technique for precomputation-based low-power design in sequential logic circuits, due to its multiplexing properties [13] . In multiplexer-based precomputation, input can be used as an "enable" line of functions and and as the select line of a multiplexer that chooses between data of and , so that for a given value of , only one of the or blocks will operate, which significantly reduces the power dissipation of the circuit.
Due to their special properties, GDI cells can be successfully used for low-power design of combinatorial circuits, while combining two approaches: 1) Shannon expansion and 2) combinational logic precomputation, where transitions of logic values are prevented from propagating through the circuit if the final result does not change as a result of those transitions. Fig. 9 shows an architecture based on (32). We implement the functions and . Depending on the value of , only one of the functions will drive the data computed as a result of its input transitions, while the data transitions from the other function are prevented from propagating to the next logic block C.
The applicability of Shannon expansion to any logic function allows GDI implementation of any digital circuit in order to achieve low power dissipation.
B. Buffer Insertion
By using (14) and (25), values of and can be derived. As mentioned, the maximal number of stages in cascade between two buffers is the minimal value between and . The calculation of and depends on process parameters, frequency demand, and output loads. For example, given a 0.35-m technology process V , frequency demand of 40 MHz, and load capacitance of 100 fF, the maximal number of stages is dictated by (14) , where calculated with . The derived value will be two stages of GDI cells between the buffers.
C. High-Level Design
As was mentioned in Section II, one of the most significant problems of existing PTL is that no simple and universal cell library is available for PTL-based design. The result is a difficulty in developing synthesis tools. This section contains a simple algorithm that allows realization of any logic function by using only the basic GDI cell. It is based on Shannon expansion (27), where any function can be written as follows: Algorithm's steps: 1) Given a function with variables.
2) Check, if the function is not equal to 1, 0, or not inverted single variable. 3) If it is equal, no additional hardware is needed. 4) If it is not, use Shannon expansion (35) for a given function. 5) Use GDI cell for the function implementation, using products of Shannon expansion ( , , and ). 6) Back to step 2) for both functions and . One advantage of this algorithm is the ability to calculate the maximal count of transistors needed for n-input function implementation, in predesign stage. This can be calculated by (38) where is the maximal number of transistors that are needed to implement the function, is the maximal count of GDI cells and is the number of variables in the given function. Knowledge of the maximal number of GDI cells will fix firmly the final maximal area of the circuit.
The following pseudocode shows how any combinatorial function can be synthesized by means of three-input GDI cells, where not :
/* recursively synthesize an input function * / /* with GDI cells */ Algorithm If ( ) then return('1') else if ( ) then return('0') else return(G(SyntGDI(fjx n =1); x n ; SyntGDI(fjx n =0)));
As an example, if XOR , the above procedure will return where stands for GDI and stands for an inverted GDI cell that is inserted as a postprocess in order to keep signal integrity. This approach can be used in combination with existing cell library-based synthesis tools to achieve an optimized design.
It must be noticed that, as has been shown before, using Shannon expansion in regular logic circuits results in a lower power dissipation but requires significant area overhead. This overhead is caused by the additional precomputation circuitry. On the other hand, a Shannon-based GDI design does not require a special precomputation circuitry because of its particular MUX-like nature, so that most area overhead is eliminated. The presented approach can be used in combination with existing cell library-based synthesis tools to achieve an optimized design.
V. COMPARISONS WITH OTHER LOGIC STYLES
A. Leaf Cells Comparisons 1) Cells and Simulation Conditions:
Five sets of comparisons were carried out on various logic gates. Circuits were designed at the transistor level in a 0.35-m twin-well CMOS process technology ( V, V). The circuits were simulated using Cadence Spectre at 3.3 V, 40 MHz, and 27 C, with load capacitance of 100 fF. In our simulations, the well capacitance and other parasitic parameters were taken into account. Each set includes a logic cell implemented in four different techniques: GDI, CMOS, transmission gate, and nMOS pass gate. Cells were designed for a minimal number of transistors in each technique as shown in Table III , while in nMOS pass-gate cells a buffer was added because of low swing of output voltage ( ). Most circuits where implemented with ratio of three to achieve the best power-delay performance Same transitions of logic values were supplied to the inputs of the test circuits in each technique. Measured values apply to transitions in inputs connected to gate of transistors, in order to achieve a consistent comparison.
Measurements were performed on test circuits that were placed between two blocks, which contain circuits similar to the device under test (DUT). The measured power is that of the DUT, including the power consumed by driving the next stage, thus accounting for the input power consumption and not just the power directly consumed from supply. This allows more realistic environment conditions for test circuit, instead of the ideal input transitions of the simulator's voltage sources [24] .
The fact that no GDI cell contains full supply implies that the only power consumed is through the inputs, as GDI cells are fed only by the previous circuits. A similar phenomenon is partially observed in most PTL circuits, but there the power consumption from the source is caused by CMOS buffers, which are included in every regular PTL. Yet, in real circuits and simulations, current flow from the sources can be measured in GDI. It is caused by buffers that are connected between cascaded cells. Hence, a fair comparison between the techniques must be performed for measurements that are carried out from cells series with buffers and not from a single cell. GDI and TG test circuits contain two basic cells with one output buffer. N-PG contains two buffers: one after each cell. CMOS has no buffers in test circuits.
2) Comparisons and Results: For each technique, average power, maximal delay, and number of transistors were measured. The results are given in Table IV . a) Number of transistors comparison: Among all the design techniques, GDI proves to have the minimal number of transistors. Each GDI gate was implemented using only two transistors. The worst case, with respect to transistor count, is for the CMOS MUX gate (multiplexers are the well-known domain of pass-transistor logic). In this sense, the PTL techniques prove to be inferior compared to GDI. 
TABLE IV LOGIC GATE COMPARISONS (GDI, CMOS, TRANSMISSION GATE, AND nMOS PASS GATE) USING THE CIRCUIT TOPOLOGIES FROMTABLE IV
b) Power dissipation comparison: Results are given for power dissipation in different gates. Consistently for all design techniques, the MUX gate has the largest power consumption because of its complicated implementation (CMOS) and the presence of additional input. On the other hand, AND's power dissipation is the minimal among all the gates. Still, most GDI logic gates prove to be the most power efficient among the four compared design techniques (only F2 gate shows an advantage of nMOS pass gate compared to GDI).
c) Delay comparison:
The best performance with respect to circuit delay was measured in GDI and TG circuits. The advantage of TG in some circuits can be explained by the fact that one nMOS and one pMOS transistor is conducting at once for each logic state in TG gates. It should be noticed that the results of CMOS delays compared to GDI in most cases are bounded according to (29), as expected. Circuits implemented in N-PG are the slowest, because of the need for additional buffer circuitry in each gate. 3) Discussion: Among the presented design techniques, GDI proves to have the best performance values and lowest transistor count. Even in the cases where power or delay parameters of some GDI gates are inferior, compared to TG or N-PG, the power-delay products and transistor count of GDI are lower. Only the TG design method is a viable alternative for GDI if high-frequency operation is of concern.
B. Cell Comparisons for Different Load Conditions
A fair comparison of properties of techniques mentioned above should involve measuring delay and power consumption under different load conditions of the cell. In this section, the results of parametric simulations for power and delay measurement are presented. The simulations were carried out in SPECTRE to compare between NOR and AND GDI cells, using F1 function, and CMOS, N-PG, and TG techniques (as presented in Table V ) in 0.24-m CMOS technology. A regular CMOS inverter was used as a load for DUT, with dimensions of 2.4 m /0.24 m for PFET and 0.9 m /0.24 m for NFET. In this technology, the given load size applies a load capacitance of about 1 fF. To achieve a dependence of simulations on load conditions, load size was multiplied by PS parameter (changing from one to three). The results of power and delay as a function of PS parameter are presented in Table V , showing the consistent advantage of GDI.
C. High-Level Circuit Comparisons 1) Circuits and Design Methods:
Wishing to cover a wide range of possible circuits, design methods, and properties comparisons, several digital combinatorial circuits were implemented using various methods (GDI, PTL, and CMOS), design techniques, and technology processes. Table VI contains a full   TABLE VI  FULL LIST OF CIRCUITS IMPLEMENTED DURING RESEARCH G: GDI, C: CMOS, P: PTL, * fabricated circuits, * * research in progress (0.35 twin-well technology).
list of circuits implemented during the research with respect to design methods and processes. a) AND, OR, and XOR Cells: Table III contains AND, OR, and XOR cells using GDI, CMOS, and PTL design techniques. It must be noticed that use of the full GDI library is not possible in a regular p-well CMOS process. As a result, only function F1 and its expansions could be implemented. Table VII consists of implemented GDI basic functions for a regular p-well process and their layouts.
b) 8-bit CLA adder: Carry-lookahead adder (CLA) structure is well known and widely used thanks to its high-speed operation while calculating the carries in parallel [1] . 
Expanding this yields (42)
The sum is generated by or if
For four stages of lookahead, the appropriate terms are
(47) Fig. 11 shows a generic carry-lookahead adder. The PG generation and SUM generation circuits surround a carry-generate block. The circuit presented is a 4-bit adder that can be replicated in order to create an 8-bit adder, due to fan-in and size limitations of the gates. c) 4-bit ripple comparator: A 4-bit ripple comparator consists of a cascade of four identical basic units, while the comparison data are transmitted through the units (Fig. 12) . Comparison of the most significant bit digit is done first, proceeding down to the least significant bit.
The outcome of comparison in every unit is represented by two signals C and D according to Table VIII . The multiplier contains an array of interconnected basic cells [5] , as shown in Fig. 13 . Each multiplier cell (Fig. 14) represents one bit of partial product and is responsible for: 1) generating a bit of the correct partial product in response to the input signals; 2) adding this bit to the cumulative sum propagated from the row above. The cell consists of two components: an AND gate and adder to generate the partial product bit and add this bit to the previous sum.
e) Sequential logic circuits: Although this paper covers mostly combinatorial digital circuits, some implementations of sequential logic circuits were also performed. Fig. 15 presents the basic scheme an n-bit counter based on toggle flip-flop (TFF) cells.
The circuit was implemented in 0.35-m twin-well CMOS process technology, and its research is currently in progress. Layouts of basic TFF cells can be seen in Fig. 16 with respect to the number of transistors and area of each cell. Fig. 16(a) and (b) presents layouts of GDI TFFs based on F1 and F2 functions, respectively. Fig. 16(c) shows the layout of CMOS TFF. 
2) Simulated Results and Comparisons:
This section presents the results of performance comparisons of some of the digital circuits mentioned above. All given measurements were carried out on a representative pattern of possible input transitions, with the worst case assumption used to find a maximal delay of the circuit. Power dissipation was calculated as an average over the pattern.
a) 8-bit CLA adder: GDI versus CMOS and TG: An 8-bit adder was realized in a 1.6-m CMOS process. Two chips were designed, and their layouts can be seen in Fig. 17 . Each chip contains a GDI circuit and a compared circuit implemented in either CMOS or TG.
Performance comparisons were done by simulating in Cadence Spectre at V, MHz, and 27 C. Several parameters were measured: average power, maximal delay, power-delay product, number of transistors, and circuit area. The results are assembled in Table IX and Fig. 18 .
As can be seen, the GDI adder proves to be the most power-efficient circuit. Power dissipation in GDI is less than in CMOS and in TG, yet the delay of TG is less than that of GDI (as expected). CMOS circuit got the highest delay-44.9% more than GDI. In spite of the inferior speed of GDI relative to TG, the power-delay product of GDI is less than both TG and CMOS. Because of the use of a limited GDI cell library in p-well CMOS process, the number of transistors and area of CMOS and GDI circuits are close, but much less than in the TG adder implementation.
b) 8-bit comparator: GDI versus CMOS and N-PG:
The implementation of an 8-bit comparator was carried out in the same 1.6-m CMOS process at V, MHz, and 27 C. The layout of a chip that contains three compared circuits can be seen in Fig. 19 .
GDI proves to have the best performance among the tested design methods, as can be seen in Fig. 20 and Table X . The results of power, delay, and power-delay product of GDI are the best among the compared circuits, while N-PG has the worst performance results. Here, as well as in the adder circuit, the limited GDI library was used because of process constraints. As a result, the final area of the GDI comparator is greater than CMOS and N-PG, while the number of transistors in all three circuits is the same.
c) 4-bit multiplier: GDI versus CMOS:
The multiplier was implemented in 0.5-m CMOS technology, 3.3-V supply, 50 MHz, and 27 C. To achieve a robust measure of the power-delay product, we ran our simulations on CMOS and GDI circuits that were parametric in size, e.g., running with means that the transistor widths are twice those when running with . Spectre simulations were done on schematic circuits, while changing the area parameter from one to eight. Fig. 21 describes the changing of power, delay, and power-delay product as a function of . As can been seen, GDI shows better results in all parameters for all area coefficients. Twenty-six transistors were used in the GDI multiplier, relative to 44 transistors used in CMOS. An additional comparison was done for circuits with the same delay value (1.03 ns). The results of area, power dissipation, and power-delay can be seen in Table XI. VI. MEASUREMENTS OF A TEST CHIP An 8-bit adder designed in GDI and CMOS [ Fig. 17(a) ] was fabricated in 1.6-m CMOS technology (MOSIS). The voltage supplies of two circuits were separated in order to enable a separate power measurement. After the postprocessing, three types of chips were available: GDI adder, CMOS adder, and chips that contain both circuits, connected. This allowed carrying out measurements of dynamic power of the circuits while eliminating the static power dissipation and power dissipation of output Fig. 22 .
Several sets of measurements and tests where applied on test chips, using the EXCELL 100 + testing system of IMS. To demonstrate the influence of scaling on a given GDI circuit, the measurements were performed with various supply voltages.
1) Operational Tests: Both circuits were checked for proper operation, while using two scripts, which generated patterns of input values. The first set of values was generated according to binary order of input numbers. The second set included more than 20 000 random transitions, which were used in delay and power measurements.
2) Delay Measurements: The maximal delay of both circuits was measured by increasing the frequency of input signal and checking the results of addition. The frequency, where the first error appears, defines the delay of the circuit. Table XII presents the results of delay in GDI and CMOS adders for various voltage supply levels.
It can be noticed that for the given implementation and the output load, defined by the testing system, both circuits have equal delays.
3) Dynamic Power Measurements: Wishing to eliminate the influence of the circuitry in the output pads, which causes high additional power dissipation, a set of measurements in low frequencies was performed for various supply voltages. Those results represent the static power dissipation of the test chip. Then, power measurements at high frequencies were performed and static power values were subtracted from those results to achieve the dynamic power at the given frequency.
Final results of dynamic power dissipation are shown in Table XIII .
Dynamic power measurements were performed for various frequencies, respective to the voltage supply level. The measure- ments were performed for 5-V supply at 12.5 MHz, for 4.5 V at 10 MHz, and for the rest of the values at 4 MHz.
4) Power-Delay Product:
Due to equal delay values in both circuits, the normalized power-delay product has about the same values as those of power measurements. For power and powerdelay product, improvements in the range of 11-45% were measured.
It must be noted that there is a difference between the simulations and measured data. This is caused by the fact that in all the presented circuits, the simulations have been performed while placing the DUT in the environment of logic circuits designed in the same technique, while in the test chip measurements, the single DUT has been connected directly to output pads, causing a significantly higher load capacitance. Still, in both measured and simulated results, the relative advantage of GDI is preserved.
VII. CONCLUSION AND FUTURE RESEARCH
A novel GDI technique for low-power design was presented. An 8-bit CLA adder was fabricated using GDI and CMOS and used as a test vehicle. Numerous logic gates and high-level digital circuits are implemented in various methods and process technologies, and their simulation results are discussed. Comparisons with existing TG and N-PG techniques were carried out, showing an up to 45% reduction of power-delay product in the test chip in GDI over CMOS and significant improvements in performance, as well as decreased number of transistors and area in most simulated GDI circuits over CMOS and PTL.
An operational analysis and a design methodology were also presented. The GDI technique allows use of a simple and efficient design algorithm, based on the Shannon expansion. It makes GDI suitable for synthesis and realization of combinatorial logic in real LSI chips, while using a single-cell library. This proves to be an additional advantage of GDI over CMOS and PTL.
Most of the circuits were implemented in regular p-well CMOS processes, which casts a limitation on a GDI cell library. Still, even in limited-library-based GDI circuits, significant improvements of performance are observed. Implementations of GDI circuits in SOI or twin-well CMOS processes are expected to supply more power-delay efficient design, due to the use of a complete cell library with reduced transistor count.
The advantages of GDI technique, namely, Shannon-based design algorithm, two-transistor implementation of complex logic functions, and in-cell swing restoration under certain operating conditions, are unique within existing low-power design techniques. This, together with positive measurement and simulation results, provides evidence that GDI design might enrich the toolbox of VLSI circuit designers.
We hope that the presented results will encourage further research activities on the GDI technique. Implementations of different kinds of digital and mixed circuits have to be carried out in order to determine the fields of circuitry where GDI is superior over other styles. The issue of sequential logic design is currently being explored, as well as technology compatibility for twin-well CMOS process. More work is required in the automation of a logic design methodology based on GDI cells.
