Abstract-This paper presents the first reported joint gate sizing and buffer insertion method for minimizing the delay of power constrained combinational logic networks that can incorporate a mixture of unbuffered and buffered gates (or mixture of CMOS and BiCMOS gates). In the method, buffered gates in a network are decided on by an iterative process that uses a sequence of sizing optimizations where after each sizing optimization an update to the selection of buffered gates is made. In this way, high drive capability buffered (i.e., BiCMOS) gates with sufficiently low fan-out are identified and replaced with a lower power unbuffered (i.e., CMOS) version. As well, the optimality of the final design is assessed based on a lowerbound delay value that is calculated. Experimental results have confirmed the efficiency and utility of the proposed method. In 8-b adder or 8 2 8 b multiplier networks, just two iterations are sufficient to achieve a delay that is at worst within 0.6% of its final optimized value and at worst within 10% of the lowerbound value. In the design of BiCMOS networks, it is seen that a speed advantage (at equivalent power) can be systematically achieved by using a mix of CMOS and BiCMOS gates versus using all CMOS or all BiCMOS gates and that this advantage increases with the tightness of the power constraint and with load capacitance.
performance by adding an output buffer circuit to selected logic gates in the network and uses the fact that buffering a gate reduces its output resistance but increases its internal delay (i.e., unloaded delay) and its power dissipation. Gate sizing and buffer insertion are, thus, similar in that they both attempt to exploit a tradeoff between capacitive drive capability (i.e., output resistance) and power dissipation in a selected set of gates. As well, both techniques preserve the logic function and topology of a network and are applicable to design in a standard cell based methodology where logic gates and buffer circuits are selected from a predefined set of alternatives that comprise a cell library. The techniques complement the capabilities of logic synthesis tools and provide a means for refining technology mappings in both the prelayout and postlayout phases of design.
The close relationship between gate sizing and buffer insertion suggests that a method capable of joint optimization could eliminate the need for separate gate sizing and buffer insertion tools and identify superior designs. Moreover, in BiCMOS networks that can incorporate both BiCMOS and CMOS gates, a joint optimization method will tell designers where to use BiCMOS gates and where to use CMOS gates since a BiCMOS logic gate acts as a form of high-quality buffered CMOS logic gate. By comparing optimized BiCMOS and CMOS networks, designers will be able to calculate the performance advantage a BiCMOS technology achieves over CMOS for a given application.
The objective of this work is to develop a joint gate sizing and buffer insertion method and assess its utility in the design of typical CMOS and BiCMOS networks. We seek a method that is capable of constrained optimization and in particular the minimization of network delay subject to a constraint in network power dissipation. As well, the method is to be suitable for a continuous type cell library where gate and buffer sizes are selectable across a given continuous range. The allowed transformations in the joint optimization are, thus, as shown in Fig. 1 . Namely, 1) add a buffer circuit to the output of an existing gate and optimize the size of the buffer circuit and gate and 2) just optimize the size of an existing gate without adding a buffer circuit. Note that we exclude the case of discrete sizing and assume that the given network topology has been optimized for logic depth and buffer tree expansions. Buffer tree optimization, unlike gate buffering, tries to resynthesize a critical path of a logic network so that no 0278-0070/98$10.00 © 1998 IEEE gate along this path drives a large number of other gates. This is generally done by identifying the critical and the noncritical signals emanating from each high fan-out gate. The critical signals are left as is (i.e., unbuffered). The noncritical signals are driven by a buffer that is added to the output of the high fan-out gate.
A. Prior Work
Unfortunately, previous methods concerning joint optimization or mixed CMOS/BiCMOS optimization appear to be inappropriate for our objective. In the method of Chen [1] , [2] , optimization of a mixed CMOS/BiCMOS network begins with an all CMOS implementation and proceeds in a greedy manner by iterating a three step procedure of 1) identifying critical paths and gates, 2) upward sizing transistors along the most critical path until no further improvement is obtained, and 3) switching the most delay sensitive CMOS gate along this path to BiCMOS if beneficial. In this way, Chen's method achieves CMOS versus BiCMOS gate selection capability without greatly increasing CPU time above that required for sizing optimization alone. However, the optimality of the method is not assessed and remains unclear. As well, the problem of delay minimization with respect to a power (or area) constraint is not considered. Conversely, previous methods by Duchene [3] and by Wissel [4] for mixed CMOS/BiCMOS networks have limited their scope to performance prediction and do not provide a means for finding an optimized design. Both methods are stochastic and estimate the fractional usage of BiCMOS and CMOS gates based on a distribution of node capacitances and, in [4] , a distribution of node delay slack. The distributions are static and, as such, the methods do not incorporate the effects due to gate sizing.
In contrast to joint optimization, methods for continuous case gate sizing have been extensively investigated since the pioneering work of Ruehli [5] , [6] . This work studied gate sizing to minimize power subject to a delay constraint and demonstrated the basic feasibility of exploiting nonlinear programming techniques. However, a crude gate delay model was used that did not consider input capacitance variation with gate size and the problem formation that was used resulted in an objective function that had discontinuous first derivatives and singular points. Hedelund [7] improved the formulation by introducing lower and upper size constraints and modeled the dependence of capacitance on gate size. His and other methods [8] , however, only handled a single delay path at a time making full network optimization tedious and suboptimal. Subsequently, Fishburn [9] , Matson [10] , and Marple [11] addressed multipath optimization. Fishburn, studying transistor sizing, showed that a multipath delay minimization problem could be formulated as a type of nonlinear program known as a posynomial program [12] that has the desirable property of being unimodal. Marple generalized this formulation to include the constrained case, identified an alternative posynomial formulation, and showed that gate sizing can be formulated as posynomial program provided the gate model is of posynomial form. Since a posynomial program is unimodal, the globally optimal solution for gate sizing can always be found using a gradient based numerical algorithm for constrained optimization such as the augmented Lagrangian or projected Lagrangian algorithms [13] . More recently, Sapatnekar [14] has used a novel convex programming algorithm to achieve global optimality with improved efficiency while Berkelaar [33] has investigated linear programming to improve efficiency.
B. Overview of this Work
To exploit the prior work in gate sizing, we investigate a joint gate sizing and buffer insertion method where buffered (i.e., BiCMOS) gates in a network are decided on by an iterative process that uses a sequence of sizing optimizations where after each sizing optimization an update to the selection of buffered gates can be made. In this way, high drive capability buffered (i.e., BiCMOS) gates with sufficiently low fan-out are identified and replaced with a lower power unbuffered (i.e., CMOS) version. As well, the optimality of the final design is assessed based on a lower-bound delay value that is calculated. Each required sizing optimization is formulated as a posynomial program and solved using a nonlinear programming routine. To assess the utility of the method, example adder, multiplier, parity generator, and decoder networks are optimized using a cell library for a 0.8-m BiCMOS process.
This paper is organized as follows. Section II provides basic definitions relevant to gate sizing and buffer insertion optimization while Section III describes network delay and power modeling. Section IV provides a formal description of the joint gate sizing and buffer insertion problem. Section V describes the optimization algorithm and its implementation. Example results are presented in Section VI. A discussion of the algorithm and results follows in Section VII, while conclusions are presented in Section VIII.
II. BASIC DEFINITIONS

A. Network Topology
The topology of a combinational logic network to be optimized is represented as a directed acylic graph following Definition 1. The graph specifies the logic operation and I/O connections for each gate in the network. The graph also specifies the wire capacitances. These capacitances can be estimated from a physical layout (or layout sketch) of the network (that gives the length of wire that a gate drives) and from the appropriate capacitance per length factor (that characterizes the wire interconnect).
Definition 1: The topology of a combinational logic network is a directed acyclic graph The set of vertices is given by where is the set of gates, is the set of network inputs, and is the set of network outputs. Each vertex implements a specified single-output Boolean logic primitive. There is an edge if and only if the output of vertex is connected with an input of vertex For each vertex the term is the net wire capacitance of the edges connected to the output of vertex In Definition 1, wire capacitance is a constant and wire resistance is not considered. Modeling wire capacitance as a constant is reasonable provided optimization does not force a change in the relative placement of gates in the layout. Setting an appropriate value for the maximum allowed gate size (see Definition 3) can ensure this is met. Neglecting resistance effects is reasonable provided wire lengths are less than about 1 mm [15] .
B. Gate Size and Logic Family
Given a network graph, the task of optimization seeks to find the best logic gate realization for each vertex Different realizations of a vertex correspond to a different choice of logic family or gate size, or both, as defined below.
Definition 2: A logic family is a unique set of transistor circuit designs for the realization of a collection of logic primitives (such as -input NOR, NAND, and XOR). A logic primitive is mapped to one and only one of these designs but all designs share a common set of operational characteristics. These characteristics are: 1) signal voltage level requirements, 2) supply voltage level requirements, 3) signal polarity requirements (i.e., single-ended or differential), and 4) clock requirements (i.e., static or dynamic). The transistor sizes (i.e., field-effect transistor (FET) gate widths and bipolar emitter lengths) that are specified in a design are the minimum allowed values consistent with proper operation and technology design rules.
Typically, the characteristics of a logic family will conform to a popular logic standard (e.g., ECL, TTL, and 5-V CMOS). The choice of logic family and standard influences network performance including noise and process tolerance behavior and must be considered a critical design parameter.
In designing a network, more than one logic standard may be used but then logic signal translation circuitry and additional power supply levels may be needed [16] . Moreover, since the noise margins of the standards will in general differ, network optimization cannot be accomplished on the basis of delay and power metrics alone.
As stated in Definition 2, the specification of a logic family for gate identifies the minimum transistor sizes that can be used. A scaled up version of this gate is identified by specifying its size relative to the minimum allowed. Examples of buffered gates and their corresponding unbuffered form are shown in Fig. 2 . The buffered gate shown in Fig. 2(b) can be sized as a two-stage gate since the front-end gate operation is independent of buffer circuit. The buffered gate of Fig. 2 (b) could also be sized as a one-stage gate. This is done by specifying a fixed ratio for the buffer size to the front-end gate size. The buffered gate shown in Fig. 2 (a) must be sized as a one-stage gate. This is because the buffered gate cannot be directly decomposed into an independently sizable front-end stage plus an independently sizable buffer stage.
D. Cell Library and Aggregate Cell Library
In optimizing a network, the choice of realizations for each gate is assumed to be restricted to a given set of logic families and sizes. This set of all available logic gates is called the aggregate cell library. The set of available logic gates belonging to a common logic family constitutes a cell library.
Definition 5: A cell library is the set of all logic gates (of different size and logic primitive) of a given logic family that have been characterized and are available for use in the implementation of a logic network. 
E. Network and Cell Technology
The process technology of a network (or cell) is defined based on the types of transistors used to implement the network (or cell). A greater number of transistor types generally implies a more complex and costly process.
Definition 6: The technology of a network is defined by the set of transistor types used to implement the network. Different transistor types arise from differences in: 1) substrate type (silicon, gallium arsenide, etc.), 2) transistor action type (unipolar or bipolar), 3) majority carrier type (electrons or holes), and 4) gate/emitter junction type (MOS, PN, Schottky, etc.). In a similar manner, the technology of a cell (cell library) is defined by the set of transistor types used to implement the cell (cell library).
Following Definition 6, a CMOS technology network (cell) is implemented using Si -channel enhancement MOSFET (NMOS) and silicon -channel enhancement MOSFET (PMOS) transistor types. A BiCMOS technology network (cell) is implemented using NMOS, PMOS, and Si NPN bipolar transistor types. A complementary BiCMOS technology network (cell) is implemented using NMOS, PMOS, NPN, and PNP transistor types.
Definitions 2-6, thus, provide the flexibility for describing the range of designs relevant to a BiCMOS technology. For example, can be defined as a one-stage CMOS cell library containing unbuffered gates, can be defined as a two-stage CMOS cell library containing buffered gates, and can be defined as a BiCMOS cell library containing buffered gates. The aggregate library for mixed CMOS/BiCMOS networks is , while the aggregate library for CMOS/buffered CMOS networks is Other relevant aggregate cell libraries are listed in Table I . In Table I , denotes a library of unbuffered gates and denotes a library of buffered gates.
III. DELAY AND POWER MODELING
A set of selections for the cell library and size for each gate in a network is written as and defines a specific implementation of the network. For such an implementation, the resulting network delay and power are calculated as follows.
A. Network Delay Definition 7:
The delay of a network is given by (1) The term denotes the set of gates in the th sensitizable delay path that starts at a network input and terminates at a network output. The term denotes the contribution of the delay on path that is attributed to gate and is modeled as (2) where is called the fan-out of gate The term is the capacitance of the relevant input of gate for path and is given by (3) where is the relevant input capacitance of a unit size first stage of gate The term in (2) is the output capacitance of gate and is given by (4) where is the wire capacitance, is the external load capacitance (if gate is connected to a network output), and the summation is the total input capacitance of the set of gates connected to the output of gate (since is the set of vertices connected to the output of The term in (2) is the loaded delay factor for the relevant input of gate and is given by (5) where is the loaded delay factor of the final stage of gate Finally, the term in (2) is the internal delay for the relevant input of gate and is given by (6) where is the internal delay for the relevant input of stage of gate , as defined, represents the delay of the longest path(s) of a network. It is assumed that each of the paths considered in (1) can be exercised by a primary input sequence and is, hence, not false [17] . Such a set of paths can be found by the method of Chen [18] .
In computing by Definition 7, the values for the stage parameters that are used to model gate on path can be specific to the input and state of gate Thus, includes the fact that the delay through a gate can depend both on the particular input that is switched and on the state (i.e., low or high) of nonswitching inputs. Similarly, the delay dependence on the direction of input switching (i.e., low-to-high or high-to-low) can be included (however, this complication is eliminated by ensuring all gates have a symmetric output or have a differential logic structure).
Additionally, slew rate dependence on gate delay can be accommodated by noting Hedenstierna [19] , and Hoppe [20] and adjusting internal delay according to (7) where is the rise (or fall) time of the input signal and and are called the independent and dependent, respectively, internal delay parameter. Thus, gates connected to a network input can directly use (7) if the input rise time is specified. For other gates, rise time can be determined from an analysis of the gate that drives the input to gate
In particular (8) where and are constants for gate [21] .
B. Network Power Definition 8:
The power dissipation of a network is given by (9) where (10) and (11) In (9), the term is the output voltage swing of gate . The term is the average number of output transitions per second for gate and is called the transition density [22] of gate and is assumed to be independent of sizing. The term is the internal dissipation of gate and is obtained by summing the internal dissipation of each stage of gate A unit size stage of gate has a internal static power dissipation of and internal dynamic energy dissipation of For , includes the energy required to charge/discharge the input capacitance of the stage.
Note that as defined incorporates both static and dynamic power dissipation. Static (i.e., dc-leakage) dissipation is modeled by the terms. Dynamic dissipation is modeled through: 1) the terms that account for the power to charge/discharge gate input capacitances and interconnect capacitances and 2) the terms that account for short circuit dissipation [23] as well as the switching energy to charge/discharge the internal nodes and (except for the first stage) input capacitance of the stages. For the first stage, does not include the energy required to charge/discharge its input capacitance since this capacitance is driven by a preceding gate.
C. Model Parameters
The stage parameters that are used to model gate delay and power are constants defined by the given cell library. They are independent of stage size and readily calculated by circuit simulation. The parameters and are found by measuring stage delay for different fan-out loadings and taking the intercept and slope, respectively, of a straight line fit. The parameter is found by measuring the delay for a known capacitive load and using the straight line fit to determine the equivalent fan-out of this capacitive load. The parameter is found by causing the output of the stage to change state and integrating over the switching period the product of the current supplied to the stage times the supply voltage. An illustration of the delay and power relationships for 1-stage unbuffered and 2-stage buffered gates are shown in Fig. 3 . Note that delay and power relationships for a 2-stage buffered gate are defined for all choices of buffer size once the front-end parameters and buffer parameters are known.
IV. THE JOINT GATE SIZING AND BUFFER INSERTION PROBLEM Using Definitions 1-8, a continuous case joint gate sizing and buffer insertion problem for the case of delay minimization subject to a power constraint can be formulated mathematically as Problem 1 below. The formulation for the case of power minimization subject to a delay constraint is similar except: 1) power constraint (12) is removed and 2) in (13) is replaced by the specified delay constant In Problem 1, is a specified external load capacitance connected to each of the network outputs and represents the input capacitance of a succeeding latch or output cell. As indicated in (4), contributes to the output capacitance of each output gate. The size constraints (14)- (15) in Problem 1 are stated in a general form. Nodes of the same primitive will typically be specified to have the same allowed size range since they share a common aggregate cell library. Nodes connected to one of the network inputs will typically be specified to have a fixed size for its first stage since this fixes the input capacitance seen at a network input and ensures that the optimization will not alter the delay of the circuitry that drives the inputs of network If the input capacitance is the same for all input gates, a network fan-out can be defined as the ratio of load to input capacitance.
As indicated in Problem 1, is either a library of one-stage gates or a library of two-stage gates. If is a library of twostage gates, the first stage of a gate in corresponds to a gate in library while the second stage is a buffer (that can be sized independently of the first stage). In this case, Problem 1 is classified as a joint gate sizing and buffer insertion problem The different problem types that Problem 1 incorporates are contrasted by a plot of the selection space for gate as illustrated in Fig. 4 . Such a plot shows how the gate parameters as given by (6), (5), and (10) vary with the choice of library and size ratio (Note that these parameters are independent of size .) A library choice of (unbuffered gate) corresponds to the diamond point at A library choice of (buffered gate) corresponds to a point at along the solid curve shown. For a joint problem with two-stage buffered gates, the choice of for a buffered gate is variable as , where and For a mixed-gate style problem, the choice of for a buffered gate is fixed.
Note from Fig. 4 that a gate selection space is disjoint (for other than a sizing problem). Consequently, Problem 1 is classified as a type of discrete choice gate selection problem. Such problems, are known to be NP-complete (for even special case network topologies such as chains and trees) [24] .
V. JOINT GATE SIZING AND BUFFER INSERTION ALGORITHM
The proposed algorithm for solving Problem 1 is labeled Algorithm 1 and has the overall structure outlined in Fig. 5 . First, a lower-bound delay calculation is performed. Next, an initial selection of buffered nodes is made based on this lower-bound result. A sizing optimization is then performed. Finally, if this sizing optimization reveals that some nodes are unnecessarily buffered, they are replaced with an unbuffered form (i.e., ) and another sizing optimization is performed. The sizing optimization step solves a continuous case gate sizing problem.
In Algorithm 1, buffered gates are decided by an iterative process that uses a sequence of sizing optimizations where after each optimization an update to the selection of buffered gates can be made. Convergence of Algorithm 1 is ensured since each iteration can only remove buffered gates. Each iteration ensures an incremental performance improvement since a buffered gate is replaced with an unbuffered gate only if gate delay can be maintained (or improved). With minor modification, Algorithm 1 is also applicable to the case of power minimization subject to a delay constraint. In particular, the lower-bound optimization (step 2) and each sizing optimization (step 4) now refers to a power minimization optimization. The following subsections detail the steps of Algorithm 1.
A. The Sizing Optimization Step
Algorithm 1 exploits the realization that once the set of cell library selections for each gate is set, the remaining task of sizing optimization can be efficiently and optimally solved using posynomial programming techniques [12] . A posynomial program consists of an objective function to be minimized and constraints that are all of posynomial form. An objective function is of posynomial form if it can be expressed as (20) where is a positive real design variable (i.e., , the exponents are arbitrary real numbers and the coefficients are positive. A constraint is of posynomial form if it can be expressed as , where is of form (20) . Thus, if is set, Problem 1 reduces to a posynomial programming problem since the objective function and all the constraint equations indicated by (12)-(15) are of posynomial form in the design variables and and these variables are continuous.
The resulting sizing optimization problem can be optimally solved since a posynomial program can be shown to be equivalent to a convex programming problem. That is, as a problem of minimizing a convex function over a feasible solution space that forms a convex set [12] . Consequently, for a posynomial program (as for a convex program), any local minimum that is found is guaranteed to also be a global minimum. Such a minimum can be found numerically by using a constrained nonlinear programming technique such as the augmented Lagrangian [13] . In Fig. 5 , refers to this technique that finds the values for the design variables that minimizes the posynomial objective function subject to the posynomial constraints The solution of a sizing optimization is denoted by The resulting network implementation is written as For networks that have a large number of delay paths to be optimized, the reformulation suggested by Marple [11] may be helpful in reducing the complexity of the posynomial program that needs to be solved. In this case, the path constraints of (13) are replaced by (21) and connected to a network output
The term represents the cumulative delay at the output of gate (gate measured from a network input and the term represents the set of gates connected to the input of gate
The resulting reformulated problem has more unknowns to determine (i.e., the 's) than the original problem but potentially significantly fewer constraints.
B. The Buffer Insertion Update Step
Unless the set of library selections was optimal, the sizing optimized implementation will not be a (globally) optimal solution to Problem 1. However, based on this implementation an improved set of selections can be systematically deduced. In particular, such an improved set is obtained by changing a buffered node in implementation , to an unbuffered node according to Library Selection Rule 1. This rule is based on the observation that changing from a buffered gate to an unbuffered gate (with same input capacitance) not only results in a lower gate power (i.e., since the dissipation of the buffer circuit is eliminated) but can also result in a lower gate delay provided the fan-out loading is sufficiently low [e.g., see Fig. 3(a) ].
Library Selection Rule 1: A node that is buffered in implementation and has delay is changed to an unbuffered form (i.e., changed from to ) if and only if
The term is the delay that node would have if it is changed to an unbuffered form having the same input and output capacitance as in implementation A node that is unbuffered in implementation remains unbuffered.
Note that Library Selection Rule 1 involves a simple algebraic calculation and comparison of gate delays for each buffered node and no computation for each unbuffered node. Since all the buffered nodes are tested, several of them may be replaced in one update step. Since application of Library Selection Rule 1 preserves gate input capacitance, the set of replaced nodes does not depend on the order that they were tested. As well, note that the subsequent sized optimized implementations are ensured to have a delay that is incrementally lower than implementation This is evident since each replaced (and nonreplaced) gate can be initialized to a size that gives the same input capacitance as the corresponding gate in implementation , resulting in an implementation that has lower power and equivalent (or lower) delay than implementation
C. The Lower-Bound Optimization Step
A lower bound for network delay consistent with the given power constraint is found by deriving a lower-bound model for gate delay and power and optimizing the network using this model. To guarantee the delay found is a lower bound, the lower-bound model is defined such that: 1) for each choice in the selection space of gate there exists a choice in the lower-bound selection space that is equivalent or superior in all gate characteristics, 2) the lower-bound selection space is continuous (rather than disjoint as previous), and 3) the resulting lower-bound optimization is a posynomial program. To enable the delay found to be a tight lower bound, the lowerbound model is defined such that the lower-bound selection space closely approximates the gate selection space. The lower-bound parameters for gate are denoted to distinguish them from the gate parameters As shown in Fig. 6 , each lower-bound parameter is derived from its respective gate parameter and is a function of the lower-bound size ratio where The value for is chosen such that at , the loaded delay factor of the corresponding buffered gate equals the loaded delay factor of the unbuffered gate. Thus, using (5) . The value for is chosen such that With and fixed, the lower-bound relationships as a function of are defined such that at they match the characteristics of the unbuffered gate while at they match characteristics of the buffered gate. In particular, the relationships for and are given by the straight line drawn between the two endpoints while the relationship for simply follows the buffered gate characteristics. Since at , and while at and and since the equation for a straight line defined by two endpoints and is , it follows that the lower-bound parameters for gate are defined as follows: (23) (24) (25) where and are constants such that (26) Note that if is a one-stage buffered library, fitted parameters for and (see Section IV) are used in (23)- (26) . In the subsequent lower-bound optimization, (23) and (25) replace (6) and (5), respectively, in gate delay equation (2), while (24) replaces (10) in power equation (9) . The input capacitance of a gate is calculated using (3) as previously. The solution of the lower-bound optimization is denoted where is the size ratio of node The corresponding lower-bound network delay is denoted
D. The Buffer Insertion Initialization Step
An additional feature of the lower-bound solution is that it directly provides a good initial selection of buffered nodes following Library Selection Rule 2.
Library Selection Rule 2: Following a lower-bound optimization: if , then set ; else set where is given by (26) . For the case of a single cell buffered gate library with a fixed buffer size ratio a gate can be initially unbuffered if where Library selection rule 2 is based on the observation that the lower-bound selection space of each gate is a selection space that contains the unbuffered gate choice (i.e., for ) plus a set of lower-bound buffered gate choices (i.e., for ) such that for each choice of buffer gate in the actual gate selection space (e.g., Fig. 4) there is a corresponding choice of lower-bound buffer gate that is equivalent or superior in all gate characteristics. Consequently, if the lower-bound optimization determines a gate should be unbuffered, it is unlikely that by changing to an inferior set of buffers, such as offered by the actual selection space, there will be a benefit to now making this gate buffered. Therefore, in Algorithm 1 the lower-bound optimization step is used to set an upper limit on the number of buffered gates while subsequent steps need only focus on reducing this initial number of buffered gates.
E. Implementation
Algorithm 1 is implemented in a program called technology optimization program (TOP). TOP consists of: 1) a main program that sets up a sizing iteration by calculating an initial or updated choice of buffered nodes and 2) a subroutine that solves this sizing (or lower-bound sizing) problem and realizes in Fig. 5 . Under user-defined control is the choice of: 1) power constraint value or list of values, 2) size range constraint values, 3) external load, and 4) including or not including wiring capacitance. The subroutine used in TOP is called automated design synthesis (ADS) and is described by Vanderplaats in [25] . ADS provides a choice of numerical methods for constrained nonlinear programming.
VI. EXPERIMENTAL RESULTS
Example networks that were optimized using Algorithm 1 and the TOP program are listed in Table II . Shown as well are the number of distinct gates, paths, and primitives in each network as well as the unoptimized network power and delay corresponding to the case where all gates are minimum size and unbuffered. Networks and are 16-b parity generators constructed using threeand two-input XOR gates, respectively. Networks and are 4-16 b decoders constructed using twoinput gates. Network is a balanced decoder. Network is a tree style decoder. Networks -are 8-b adders constructed using two-input gates. Network corresponds to the lookahead carry adder of Fishburn [26] . Network is a ripple-style adder while networks and are intermediate complexity designs. Network is a 8 8 b Braun-type parallel multiplier [27] . In this multiplier, only the 162 longest paths and array of 56 SUM-3 and 56 CARRY-3 gates are explicitly considered. Each of the 64 partial product generation gates in the multiplier is taken to be fixed at minimum size and unbuffered.
In optimizing the networks, certain conditions were assumed constant. The load capacitance present on each network output is 100 fF. The transition density for each gate is 0.5 transitions/clock-cycle. The wire capacitance is 10 fF per outdegree. That is, there is 10 fF for each for each node connected to the output of a gate. Each gate connected to a network input has an input capacitance fixed at 50 fF. All optimizations were run on a SUN 4/40 SPARCstation IPC with 20-MB MM. The augmented Lagrangian method was used as the constrained nonlinear programming method where each unconstrained minimization step used the Broyden-FletcherGoldfard-Shanno (BFGS) technique and each line search used the Golden Section method followed by polynomial interpolation [13] .
The gate and buffer designs used for network optimizations are shown in Fig. 7 . Note that gate cells and buffer cells are independent (i.e., a buffered gate is implemented as a double cell) and that both a CMOS buffer design and a BiCMOS buffer design are included. The gate and buffer designs used are similar to those of Song [28] . Namely, differential CMOS designs are used for the first stage while differential CMOS and BiNMOS [29] designs are used for the second stage. The parameter values shown in Fig. 7 are for the a cell of minimum (i.e., unity) size. The maximum allowed size is five for all gates and buffers. Note that since all designs have negligible static dissipation, only a value for the (internal) dynamic dissipation parameter is listed. All the parameter values in Fig. 7 were deduced by simulation program with integrated circuit emphasis (SPICE) simulation using transistor models obtained from a BiCMOS process that has a 0.8-mm minimum gate length and a 0.8-m minimum emitter width [30] .
For each network, joint gate sizing and buffer insertion optimizations were run to deduce mixed CMOS/BiCMOS (i.e., library type ) implementations for three different power constraints. For comparison, gate sizing optimizations were run to deduce optimal all CMOS (i.e., ) and all BiCMOS (i.e., ) implementations. Optimization results for the various networks are summarized in Table III showing the 1) power constraint setting, 2) CPU time required per iteration, 3) number of iterations (including lower-bound iteration), 4) lower-bound delay (calculated as described in Section V-C) expressed in relation to the final iteration delay , 5) first iteration delay normalized with respect to , 6) second iteration delay normalized with respect to , 7) final delay normalized with respect to the unoptimized delay , 8) fraction of buffered (i.e., BiCMOS) gates in the final design, 9) delay advantage (expressed as a speed-up ratio ) of the final design compared with an optimized CMOS design, and 10) delay advantage of the final design compared with an optimized all BiCMOS design (where a " " sign means that the given power constraint cannot be met in an all BiCMOS design so the delay advantage listed is conservative and based on the minimum power all BiCMOS design). Typical optimization iteration results generated by Algorithm 1 are shown in Tables IV and V. Table IV list results for a mixed CMOS/BiCMOS optimization of with a power constraint of As indicated, three optimization iterations were required. The first iteration is the lower-bound optimization that finds a lower-bound delay 3.13 ns. The lower-bound iteration also suggests that 14 gates be buffered. (Note that the lower-bound optimization does select only a fraction 14/46 of the nodes to be buffered.) The second iteration optimizes the network (with these 14 gates as buffered gates) using exact (rather than lower bound) gate parameters. A delay of 3.43 ns is computed and three of the 14 buffered gates are determined to be unnecessary by Library Selection Rule 1. The third iteration reoptimizes the network with the remaining 11 buffered gates and finds that delay is improved to 3.39 ns. Moreover, since none of the buffered gates were determined to be unnecessary and all buffer sizes are minimum or greater, iteration three is the final iteration. (If one or more of the buffer sizes was below minimum size, an additional iteration would be run with the buffer size constraint of imposed.) The user CPU time required by TOP for the three iterations was 208.5 s (or 69.5 s per iteration). The final iteration design is shown in Fig. 8 . Table V lists iteration results for a mixed CMOS/BiCMOS optimization of with a power constraint of As indicated, four optimization iterations were required. The lower-bound optimization finds a lower-bound delay 9.00 ns. The final design has a delay of 9.50 ns and uses 63 buffered gates. The user CPU time required by TOP for the three iterations was 4268.4 s (or 1067.1 s per iteration). The final iteration design is shown in Fig. 9 along with designs corresponding to other power constraint settings. (Although the reformulation given by (21)- (22) can been used to implicitly enumerate all paths of , the resulting sizing optimization problem has twice the number of unknowns and approximately twice the number of constraints to consider. Moreover, no improvement in optimized performance is achieved. This is because an upper-bound delay for the ignored paths can be calculated knowing the maximum allowed gate size and shown to be less than the optimized delay found by considering just the 162 longest paths.) Typical relationships between optimized delay and power constraint setting are plotted in Fig. 10 . Fig. 10(a) compares the optimized characteristics for network for different choices of aggregate cell libraries as defined in Table I . Fig. 10(b) compares the optimized characteristics for network for different categories of optimization. Shown also in Fig. 10(b) is a comparison of optimization results obtained using alternative approaches that contrast the nonlinear programming approach of Algorithm 1. Point G (solid square) is for an alternative gate sizing approach similar to the repower method of Singh [31] but modified to handle continuous case sizing. In this optimization, the target network delay was set to the minimum value found by Algorithm 1 and a critical network defined based on analysis of node timing slacks. Nodes for resizing were identified from a min-cutset of the critical network after assigning node weights based on fanout. Point J (solid diamond) is for an alternative joint gate sizing and buffer insertion approach. This approach is similar to the alternative gate sizing approach except that nodes can now be buffered as well as sized to improve them their delay. Point T (solid circle) is for the full optimization method of Singh [31] that allows both gate sizing and critical path resynthesis through buffer tree transformations. The points G, J, and T appear consistent with the locus of values generated by Algorithm 1. A direct comparison with Algorithm 1 is, however, limited by the fact that the alterative approaches do not presently handle the case of constrained optimization. Finally, typical relationships between optimized delay and external load (i.e., network fan-out) are plotted in Fig. 11 . (A fan-out greater than unity occurs, for example, if the network has broadcasted outputs that provide the inputs to a parallel multiplier or other array structure.) Fig. 11(a) shows optimized delay versus external load for network for different choices of aggregate cell libraries. Fig. 11(b) shows the performance advantage of BiCMOS over CMOS versus 
VII. DISCUSSION OF RESULTS
The results of Section VI confirm the utility of the proposed joint gate sizing and buffer insertion algorithm to designers of CMOS and BiCMOS networks. From Fig. 10 and from examining the speed-up values in Table III , it is evident that joint optimization yields an advantage over just gate sizing or just buffer insertion optimization. In particular, it is seen that for a BiCMOS technology, a mixed CMOS/BiCMOS approach achieves a delay advantage (at equivalent power) versus any other CMOS approach (i.e., or in Table I ) or the all BiCMOS approach (i.e., in Table I ). In large fan-out networks, the delay advantage increases proportionally with external load on the network at a rate dependent on the power constraint (Fig. 11) . The fact that the performance advantage and optimized design (e.g., Fig. 9 ) depends strongly on the power constraint provides justification for seeking a method that is capable of constrained optimization. Note that it is possible to include area constraints into Algorithm 1 provided a posynomial area model is used such as identified by Hoppe [20] .
The results of Table III also demonstrate that Algorithm 1 can find both a tight lower bound and a near-optimal final design. From the ratios of final to lower-bound delay , it is seen that in all cases Moreover, from Fig. 10(b) , it is seen that Algorithm 1 is superior to each of the alternative methods examined in determining minimum delay designs and may reflect that fact that Algorithm 1 considers sufficient global consequences of each gate choice that it is less prone to converge on a local minimum. As well, Algorithm 1 appears to have a rapid rate of convergence in terms of the number of iterations needed. From the intermediate iteration delay ratios in Table III , it is seen that in all cases Hence, just two iterations are sufficient to achieve a delay that is at worst within 1.4% of its final optimized value in and at worst within 11% of the lower-bound value.
At present the CPU requirement of Algorithm 1 permits networks of a few hundred gates, , to be optimized. The computational complexity of an augmented Lagrangian algorithm that uses the BFGS method for unconstrained minimization is approximately flops where is the number of unknowns (i.e., number of gates in the network) and is the number of constraints (i.e., number of delay paths in the network). This assumes that the Lagrange multipliers are updated times, each BFGS minimization needs line searches, and a line search direction calculation requires flops [34] . The mem-ory required by such an augmented Lagrangian algorithm is approximately since the Hessian and gradient of the augmented Lagrangian function must be accessible.
A more efficient gate sizing approach can be easily incorporated into Algorithm 1 and will enable larger networks to be optimized (with possibly some sacrifice in accuracy). For example, the method of feasible directions algorithm [32] for constrained nonlinear programming can be selected in TOP instead of the augmented Lagrangian algorithm. (Doing this for with resulted in a CPU time reduction of 1.87 , but the optimized delay was 1.05 times higher.) Future incorporation of the convex programming-based sizing approach of Sapatnekar [14] or the linear programming-based sizing approach of Berkelaar [33] may also yield efficiency improvements. Their work suggests that continuous case sizing optimization of networks containing gates should be feasible assuming the same SPARCstation used in this work.
The absolute accuracy of the network delay and power optimization values reported in Section VI are limited by two simplifying assumptions that were made but that are not an inherent limitation of Algorithm 1. First, the gate parameter values used (Fig. 7) were for the slowest input of each gate. Second, the transition density value used was the same for all gates in a network. Assigning unique parameters values for each gate input will improve the delay accuracy of Algorithm 1 without increasing CPU time. Precalculating [21] and assigning a distinct transition density to each node will improve dynamic power accuracy without increasing CPU time.
VIII. CONCLUSION
In conclusion, this paper has presented the first reported joint gate sizing and buffer insertion method for minimizing the delay of power constrained combinational logic networks that can incorporate a mixture of unbuffered and buffered gates (or mixture of CMOS and BiCMOS gates). The method is versatile and solves mixed CMOS/BiCMOS optimization problems or any other optimization problem where the network uses a mixture of compatible logic families and continuous sizing. Experimental results have shown that the approach of the algorithm (where logic family choices are decided by an iterative process that uses a sequence of sizing optimizations) results in a rapid convergence to a near-optimal design. As well, this approach enables the algorithm to be implemented from an existing gate sizing capability and to benefit from future developments in gate sizing. Experimental results have also highlighted the need for the proposed algorithm by demonstrating that: 1) a mixed logic family design (e.g., CMOS/BiCMOS design) can achieve a significant performance advantage over a corresponding single logic family design (e.g., CMOS only design), 2) the advantage is strongly dependent on the given application and must be systematically calculated, and 3) a simpler algorithm (not using iteration) yields inferior designs. The proposed algorithm, thus, represents an effective tool for applying and designing logic circuits implemented in BiCMOS or other mixed technologies.
