Abstract: Two techniques are proposed which enhance the optimisation efficiency of CMOS combinational logic circuits. One uses transition times (rise and fall times) of each gate as variables of the optimisation process. The other technique uses the optimal characteristic waveform synthesising method (OCWSM) to obtain the initial guess for the optimisation process. The optimisation process, with these two techniques, can perform sizing and optimisation for circuits with a smaller fixed-delay specification than other sizing and optimisation algorithms. The circuits sized using the proposed algorithm have shown a smaller power dissipation, especially when the delay specification is small. The CPU time consumed is reasonable. High-speed low-power circuits are thus more realisable using the proposed algorithm. 
variables of the optimisation process because they are the desired solution. Since the relation between delay time and device sizes is very complex, the efficiency in the optimisation using device sizes as optimisation variables may be degraded. The initial guess (initial device sizes) of the optimisation process are either arbitrarily chosen by the user [lo] or obtained from the heuristic approach
[13]. The required circuit performance is usually different from that achieved when using the user-assigned device sizes (usually the minimum allowable device sizes). This may cause difficulties in the optimisation process because large iteration number and large CPU time consumption may result. The convergence speed can be greatly enhanced by using the heuristic approach to obtain the initial guess for the optimisation process. As the specified delay time approaches the optimal value, the required CPU time in the heuristic approach is intolerable even for a medium-size circuit [13] . Fig. 1 for 2 pm CMOS inverters with C, = 0 pf (only one fanout) under characteristicwaveform consideration. Fig. 2 shows the comparisons of 2 pm CMOS inverters with C, = 0 pf (only one fanout) under the exponential input excitations with time constants from 0.4 to 4.0 ns. It is found through the accuracy verification that the maximum error is under 15% for inverters with different device dimensions, capacitive loads, input waveforms, device parameters and temperatures. Fine tuning can further decrease the error to below 7% [17] .
The timing models should be able to characterise the timing of multi-input logic gates excited at any input node to accurately calculate the delay time of a logic circuit. A CMOS three-input NAND gate, as shown in 
CMOS three input NAND gate
excited and the other nodes are stabilised at V,,, the timing is not the worst case type. According to our observations, the longest rise delay of a CMOS three-input NAND gate with one fanout gate is about 58% longer than the corresponding shortest one. The timing models [l5, 181 developed by the authors considered all the triggering cases. Part of the comparisons between the SPICE simulation and the model calculation are listed in Table 3 for 2pm CMOS three-input NAND gates in different triggering cases. It is shown through the accuracy verification that the maximum error of the developed timing models is under 15% for CMOS multi-input NAND/ NOR gates with wide ranges of device sizes, capacitive loads and device parameter variations. The input voltage waveforms were not deviating much from characteristic waveforms. Fine tuning can further reduce the maximum error to below 10% [17] . A similar error characteristic can be obtained for the timing models of small-geometry CMOS AOI/OAI gates C181.
3
General power dissipation models of CMOS combinational gates
Consider a string of CMOS three-input NAND gates excited at the node 2 as shown in 
Typical fall characteristic of CMOS three input NAND gate
To neglect the short-circuit power dissipation is an acceptable approximation as long as the short-circuit power dissipation is small compared with the dynamic power dissipation needed to charge the capacitor [l 11. According to our observations, the energy loss of a CMOS logic circuit is mainly caused by the charging and discharging of device capacitances during the output rising/falling transition periods when the CMOS logic circuits are operated under the characteristic-waveform consideration. Only the dynamic power dissipation is considered.
There are four different types of device capacitances: voltage-dependent drain (source)-bulk junction capacitance, voltage-dependent gate-drain (source, bulk) capacitance, voltage-independent gate-drain (source) overlap capacitance, and external voltage-independent capacitances. The energy loss during the charging/discharging cycle of a capacitor is equal to the change of the energy stored on the capacitor [34] . To characterise the change of the energy stored on the voltage-dependent pn junction capacitances, the case of the drain-bulk junction capacitance is considered. The change of stored energy from the drain-bulk voltage V,, = 0 V to V,, = Gias is derived in the Appendix. For CMOS three-input NAND gates excited at the node 2, the change of the energy stored on voltage-dependent drain-bulk junction capacitances can be further divided into three subgroups.
(a) Drain-bulk junction capacitances at the output node: In a typical CMOS logic gate, the substrate of the PMOS is connected to the positive power supply V,,. When the output voltage is V,,, the voltage across the drain-bulk junction capacitance of the PMOSFET is 0 V. There is no charge stored on this capacitor. The voltage across this capacitance is V,, when the output voltage is OV. The change of the energy during each transition period can then be expressed as is the drain perimeter of the PMOSFET, F,,,, p( V,,) and F,e,i, p( V,,) can be found from the Appendix for the PMOSFET.
The change of the energy stored on the drain-bulk junction capacitance of the NMOSFET can be also expressed by using the same derivation technique.
(b) Drain-bulk junction capacitances of internal series NMOSFETs: The voltage swing of V,, at the node 3 is V,, VTNF. Thus, the change of the energy on the drainbulk junction capacitor at this node is
where CJ, is the zero-bias bulk capacitance of the NMOSFET, CJS W, is the zero-bias perimeter capacitance of the NMOSFET, A,," is the drain area of the NMOSFET, P,," is the drain perimeter of the NMOSFET, Farea, -V,,,) and Fperi, A V , , -VTNF) can be found from the Appendix for the NMOSFET.
(e) Internal inactive drain-bulk junction capacitances: From Fig. 4 , it is found that the voltage at the node 2 (K,) before the transition period is 0 V, so is the voltage after the transition period. The bias voltage hias is 0 V.
From the Appendix, the change of the energy on such an inactive drain-bulk junction capacitor is 0, as is the power dissipation.
The change of the energy stored on the voltagedependent source-bulk junction capacitor can also be formulated. For the voltage-dependent gate-drain (source, bulk) capacitance, its energy change is calculated region by region with suitable capacitance values determined by the device operating regions. For the voltage-independent gate-drain (source) overlap capacitance and external voltage-independent capacitances, the energy change can be easily characterised. The total energy loss during the rising/falling transition period can be determined by summing up the changes of the energy in a logic gate. The average power consumption of a logic circuit can be calculated by using the definition where N is the total gate number in the circuit, 8! is the energy loss of the gate i, and T,.,, is the critical maximum delay time of the logic circuit under operation, which is the delay time of the critical path.
It should be emphasised that the energy losses of a logic gate are different for different excitation inputs and so are the average power dissipations of a logic circuit. The power dissipation models developed can characterise those different power dissipations.
Sizing and constrained optimisation
As an example of the proposed techniques in the sizing and constrained optimisation, the minimisation of power dissipation with fixed-delay specifications in the sizing of CMOS combinational logic circuits is considered. The same techniques can also be applied to improve the optimisation efficiency for other optimisation problems. Minimising the power dissipation, 8 , with fixed-delay specifications is a nonlinear constrained optimisation problem which is defined as . . ., m, respectively. T,,, T 2 , . .., T,,,, are the specified delay times at nodes 1,2,. . ., m, respectively.
4.1
Transistor sizes are conventionally chosen as optimisation variables in the sizing and constrained optimisation. They are manipulated to obtain optimisation directions and steps toward a given delay in the constrained optimisation with specified delay times as design constraints. The delay time is a very complicated function of transistor sizes as may be seen from the physical timing models [l5-181. It is found that such a complex function leads to many difficulties in mathematical treatment.
The rise/fall and delay times of a MOS logic gate are generally determined by (i) driving capability (ii) internal gate capacitances (iii) load capacitance or resistance contributed by the loading gates (iv) load capacitance or resistance contributed by the interconnection line or the on-chip or off-chip fixed capacitive loads
Using transition times as independent optimisation variables (v) rise/fall times of the input waveforms (vi) excited input nodes or the input excitation patterns.
The first two factors are related to device sizes. The last two factors are associated with input excitations. If the output loading of a logic gate and the input excitations are known, the output transition times can be determined from the device sizes. The device sizes of a logic gate can be also determined from its output transition times if the output loading of a logic gate and the input excitations are known.
In sizing a MOS digital IC, the logic structure, the input waveforms to the circuit, the output off-chip loading, and technology and device parameters are known. In combinational logic circuits, the output offchip loading becomes the loading of the last stage in each of the signal paths. If input excitation patterns and output rise and fall times of the last stage are given, their device sizes are the only unknown factors in the rise-time and fall-time equations [15-181 of the timing models.
They can then be calculated from these timing equations.
Having obtained the device sizes of the last stage, the output loading contributed by the last stage to the stage preceding the last stage can be determined. If input excitation patterns and output rise and fall times of those stages are given, their device sizes can be also calculated by solving their timing equations. This implies that if the output rise and fall times of each logic gate are known, the timing synthesis of combinational logic circuits can be achieved using the last stage of each signal path to the first stage of the path simultaneously. The sizing can be performed simultaneously and globally from all the output stages to all the input stages. It is therefore feasible to treat rise and fall times of the gates in a circuit as independent variables in the sizing and optimisation process. In each optimisation step, the corresponding device sizes in each gate can be calculated from the rise/
IEE PROCEEDINGS-E, Vol. 138, No. 3, M A Y 1991
fall times by using the timing equations. Since the delay time of a CMOS logic gate is approximately a linear function of the rise and fall times as described earlier, it is expected that the optimisation using rise and fall times as optimisation variables is more optimal and/or has faster convergent rate than that using device sizes as optimisation variables.
In the synthesising process, the resultant rise and fall times may be larger or smaller than those in practical circuits. The synthesised device sizes are thus smaller than the user-specified minimium allowable channel widths or larger than the user-specified maximum allowable channel widths. The device sizes are reset to the minimum or maximum allowable values and the transition times are reset to the corresponding values to solve this problem.
In the optimisation, the ratio of rise time to fall time for all the logic gates can be defined by users. Symmetrical rise and fall transitions is one of the most important design issues so the ratio of rise time to fall time for all the logic gates is considered to be unity. All MOSFETs in series are designed with equal channel widths as are all MOSFETs in parallel.
Using the OCWSM to obtain a set of device sizes as the initial guess
In the design of a tapered buffer, the minimum total delay can be obtained by equalising the delay in each stage [3, 15-18, 281. The resultant rising or falling waveform in each stage is the same, being the characteristic waveform. The characteristic waveform appears in any minimum delay path of identical logic gates. In a minimum delay signal path with different types of logic gates, although the exact characteristic waveform does not appear, the deviation of the actual waveform from the characteristic waveform is not so significant because of the similarities among these inverting logic gates. It is expected that actual waveforms in an optimally designed chip are close to the characteristic waveforms. Based on these considerations, a quick sizing method was developed called the optimal characteristic waveform synthesising method (OCWSM) [19, 20] .
The designer chooses the ratio of rise time to fall time in all the gates with the OCWSM. The ratio of the output rise (or fall) time to the fan-in number of a gate is also fixed. Given an initially guessed value of rise or fall time, other rise/fall times of all the gates can be found through the two fixed ratios. If the ratio of the rise time to fall time is unity and the rise time of CMOS inverter is T,, the fall time of CMOS inverter is T, and the rise and fall times of two-input CMOS NAND gate are 2T,. OCWSM finds a value of 7; to achieve the minimum delay. This is a single variable optimisation problem and the OCWSM can quickly find the optimal value of rise (or fall) time for the minimum delay. The required number of iterations is typically under eight.
Using the solution from the OCWSM as the initial guess in the sizing and optimisation process, it is found that the speed of convergence can be significantly improved compared with that obtained when using the heuristic approach.
Outline of the augmented Lagrangian function and the self-scaling and restarting quasi-Newton method
The augmented Lagrangian function L,(x, A) of eqn. 6 is defined as
where c is the penalty parameter, n is the number of optimisation variables (design parameters), x is a n x 1 vector (the vector of design parameters), 1 is a m x 1 vector (multiple vector), and A' is the transpose of 1.
and
Ilh(x)ll is the Euclidean norm of h(x).
A sequence of minimisations in the form minimise LJx, Aj)
subject to x E X is performed, where {ci} and { A j } are the sequence of positive penalty parameters and multiplier vectors, respectively. The modified self-scaling and starting quasi-Newton method is implemented as shown in Table 4 . In Table 4 , the vector gj is the first derivative of the cost function at x j . The vectors g j , d j , p j , and qj are n x 1 vectors. The vector Sj is a n x n matrix (inverse Hessian). The value a is the optimal step size along the descent direction dj for the minimisation of L,i(x, Aj). Step 1 Step 2 Step 5
Step 6
Step 7
SteD 8 Obtain the solution
The device sizes of each gate are first calculated from the deviated x j by using the timing equation to determine the derivative of the cost function. Dynamic power dissipation of the circuit is determined from the calculated device sizes. The derivative of the cost function is approximated by the first finite divided difference of the deviated cost function.
There are two iteration loops as shown in Table 4 . The inner loop uses the self-scaling scheme to approximate the inverse Hessian matrix S. One complete cycle of an approximation requires at least n steps to approach the result of the conjugate gradient method. Round-off errors, numerical errors, and inaccurate line searches mean that the resultant inverse Hessian matrix may deviate from the actual inverse Hessian matrix and degrade the quadratic convergence rate. A smaller step number R N is assigned to the inner loop. The outer loop is used to construct the restarting scheme of the selfscaling quasi-Newton method. The maximum number of restarting cycles is iteration specified by the user.
There are two check points (A, and A,) in the optimisation process. A, is used to check whether L,ixj+,, ,Ij) is 0.99 times larger than L,,(xj, Aj). If that condition is satisfied, the optimisation process restarts. A, is used to check whether LJxj,, 1,) is 0.99 times larger than L&,, Aj). If that condition IS satisfied, the optimisation process ends and the required timing specifications for each gate is obtained. The device sizes of all the circuit can then be 160 determined by using the timing equations from all the circuit output to all the circuit inputs.
From eqn. 7, it is found that the scale of f ( x ) and IIh(x)ll is not compatible. The penalty parameter co is first normalised to balance the scale of f ( x ) and IIh(x)ll. The subsequent values of cj are monotonically increased using the equation c j + , = FilYcj. The value of Fi, is typically larger than 4 and smaller than 10 [21] .
The multiplier vector A, is initially chosen to be 0. The subsequent vectors of ,Ij are modified using the equation ,Ij+ = IZj + ci x h(xj). Other good modified equations of the multiplier vectors are described in the nonlinear optimisation text [21-231. The objective of this paper is only to verify the efficiency of the proposed techniques, so the other equations are not considered.
Experimental results
Using Turbo-C on a PC-AT, the above sizing and optimisation techniques have been implemented in an experimental program called the TISA [19, 201 and applied to size many circuits. The memory required for program and dynamic data is 200 Kbyte and 64 Kbyte/100 gate. The required memory increases quadratically as a function of the gate number because the optimisation method contains a two-dimensional matrix S(x). The maximum number of gates allowed in the optimisation is 128 under the PC-AT 640 Kbyte real-mode limitation. The conjugate gradient method [23] uses a one-dimensional vector in the optimisation process. Implementing TISA using the conjugate gradient method in PC-AT protectionmode operation (with 16 Mbyte), the maximum number of gates would be expected to be more than 10 OOO.
To demonstrate the efficiency of the proposed techniques in the sizing and optimisation, the conventional optimisation algorithms [ 10, 131, were also implemented and applied to size the same circuits. In one of the conventional algorithms, the device sizes are used as the variables of the optimisation process and the minimum device sizes are used as the initial guess of the optimisation process [lo] . It is then called the minimum-size algorithm for simplicity. In the other algorithm, the device sizes are used as variables of the optimisation process and the heuristic approach is used to obtain the initial guess of the optimisation process [13] . It is called the heuristic algorithm for simplicity. The timing models developed [lS-181 are also used in both algorithms for a fair comparison. The increment constant bumpsize [ 131 used in the heuristic approach is 1.1.
Different input excitation nodes lead to different output timing and power consumption. The input excitation node of each gate is considered to be the node furthest away from the output node to simplify the computation complexity and the computer time in sizing. This can lead to a safe design so that the actual chip delay is always equal to or smaller than that designed. Different input excitation nodes are considered in timing verification.
To verify the efficiency of the proposed techniques, the values of Finrr which is the factor associated with the penalty parameters ( c j + , = Fine cj), must be comprehensively considered. Both the developed and the conventional sizing and optimisation algorithms were applied to size a four-bit even parity checker as shown in Fig. 5 . The input voltage was an exponential waveform with a rise/ fall time of 0.44 ns. Using device sizes as optimisation minimisation variables and minimum device sizes as the initial guess, the minimum achievable fixed-delay specifi- cation of the optimisation with fixed-delay specification is 7.23 ns. For the optimisations with 5.5, 6, or 7 ns fixeddelay specifications, this algorithm can not respond as can be seen from Fig. 6a . Using device. sizes as optimisation variables and the heuristic approach to obtain the initial guess, the minimum fixed-delay specification achievable is 611s. The fixed-delay specification of the optimisation using transition times as optimisation variables and the OCWSM to obtain the initial guess can be as small as 5.5 ns. These proposed techniques are called the proposed algorithm. It is also found that the optimisation with 10 ns fixeddelay specification performed by using the minimum-size algorithm has the local minimum problem. The error between the resultant and the specified delay times is greater than 1%. This phenomenon is not seen in the other two algorithms (the proposed algorithm and the heuristic algorithm).
From Fig. 6b , it is found that the resultant power dissipation of the circuit optimised with 8 ns and 9 ns fixeddelay specifications and by using the proposed algorithm are greater than that obtained by using the heuristic algorithm. As the fixed-delay specification decreases to 6 ns and 7 ns, the power dissipation of the circuit optimised by using the proposed algorithm becomes smaller than the heuristic algorithm. This means that in highspeed design, the proposed algorithm can perform a more satisfactory optimisation with less resultant power dissipation.
In Fig. 6c , it is found that the required CPU time for the optimisations with 7, 8 and 9 ns fixed-delay specifications performed by using the three algorithms are very close. Although the required CPU time for the optimisation with 6 ns fixed-delay specification performed by using the proposed algorithm is 30% greater than that by using the heuristic algorithm, the power dissipation of the circuit optimised by using the proposed algorithm is smaller. The trade-off between CPU time and circuit performance is thus satisfactory in the proposed algorithm.
To further verify the efficiency of the optimisation process by using the proposed algorithms for the complex circuit, a benchmark circuit RD53 [37] shown in Fig. 7 was optimised. It contains CMOS standard static logic gates and A01 gates. The given input rise/fall time is 0.44 ns. The fixed-delay specifications at nodes FO, F1, and F2 are considered to be the same. Fig. 8 shows the comparisons between the optimisation results of the proposed algorithm and the heuristic algorithm. The optimisation using the minimum-size algorithm can not satisfy the specified delay time and so it is not shown in Fig. 8 . It is found from Fig. 8a that the heuristic algorithm and the proposed algorithm can satisfy the delay specification. As seen from Fig. 8b , the power dissipations obtained by using the proposed algorithm are smaller than those obtained by using the heuristic algorithm. This is because the h(x) in the cost function is nearly proportional to the optimisation variable. The relation between the cost function and the optimisation variables of the proposed algorithm is more linear than that of the heuristic algorithm. The proposed algorithm can thus avoid the incorrect optimisation convergence caused by the numerical errors and inaccurate line searches from the first finite divided difference of the deviated cost function. These characteristics also make the CPU time of the proposed algorithm smaller than that of the heuristic algorithm (Fig. Sc) .
To verify the accuracy of the adopted timing models, a one-bit full adder is sized by using the proposed algo- Table 5 , it is also found that because of the consideration of worst-case timing in the sizing and optimisation, the resultant delay times of sum and carry are smaller than the fixed-delay specifications. This guarantees a safe design. 
Conclusion and discussion
Two techniques were proposed to enhance the efficiency of the optimisation process. The techniques use the transition times as variables of the optimisation process and the OCWSM to obtain the initial guess of the optimisation process. The proposed techniques can perform the sizing and optimisation of CMOS combinational logic circuits with a smaller delay specification than other sizing and optimisation algorithms. This enhances the speed performance of the sized circuits. The circuit sized by the proposed algorithm has a smaller power dissipation especially when the delay specification is small. The CPU time is of the same order of magnitude as that in other algorithms. It is therefore suitable to use the proposed techniques in the sizing and optimisation of highperformance circuits. The power dissipation of CMOS logic gates consists of two components: dynamic power dissipation and short-circuit power dissipation. The power dissipation considered earlier is the dynamic power dissipation. To accurately calculate the power dissipation of a CMOS logic circuit, an accurate model of the short-circuit power dissipation for CMOS logic gates must be constructed. This is the intention of a future study.
There are many design considerations for a CMOS logic gate such as equal rise and fall times, equal channel widths of PMOSFET and NMOSFET, equal high and low noise margin, optimal ratio of each gate, etc. The design consideration with symmetrical rise and fall times is adopted as used in the above optimisation. All the design considerations can be arranged to use the transition time as the optimisation variables so it is expected that by using the proposed techniques, the above mentioned advantages can be also obtained.
The proposed techniques can be applied to other sizing and constrained optimisation problems. Further generalisation of the proposed algorithm in solving various problems will be performed in the future. 
