I. INTRODUCTION
Circuit delays in MOS integrated circuits often need to be reduced to obtain faster response times, with a minimal area penalty. A t ypical MOS digital integrated circuit consists of multiple stages of combinational logic blocks that lie between latches, clocked by system clock signals. Delay reduction must ensure that the worst-case delay of the combinational blocks is such that valid signals reach a latch before any transition in the signal clocking the latch, with allowances for set-up time requirements. In other words, the worst-case delay of each combinational stage must be restricted to be below a certain speci cation. The requirements for hold times are di erent in nature, and are not addressed in this paper.
Given the MOS circuit topology, the delay can be controlled by v arying the sizes of transistors in the circuit. Here, the size of a transistor is measured in terms of its channel width, since the channel lengths in a digital circuit are generally uniform. Roughly speaking, the sizes of certain transistors can be increased to reduce the circuit delay at the expense of additional chip area. In this paper, the delay of a circuit is de ned to be the maximum of the delays of all paths in the circuit. Hence, it can be formulated as the maximum of posynomial functions. This is mapped by the above transformation on to a maximum of convex functions, which is also a convex function. The area function is also a posynomial, and is transformed into a convex function by the same mapping. Therefore, the optimization problem de ned in 1 is mapped to a convex programming problem, i.e., a problem of minimizing a convex function over a convex constraint set. Due to the unimodal property of convex functions over convex sets, any local minimum of 1 is also a global minimum. The most commonly used measure of the circuit area is given by an a ne function of transistor sizes 3,5 12 . While this measure is not very accurate, it has the advantage of being a posynomial function of the sizes of transistors in the circuit.
Various methods have been used for optimization. TILOS 3, 6 , performs the task by iteratively identifying a critical delay path, and using a heuristic method to reduce the delay along this path. The iterative process stops when the critical path i.e., the largest delay path among all paths between a primary input and a primary output meets the delay constraint. All transistors are initially set to the minimum size, and the sizes of only those transistors that lie on the critical path are increased, in an attempt to meet the delay constraint b y increasing the sizes of as few transistors as possible. A subsequent algorithm
proposed by S h yu et al. 7 works in two phases. It uses TILOS to generate a rough initial solution in the rst phase. In the second phase, it converts the problem to a mathematical optimization problem in a smaller parameter space corresponding to sizes of transistors on the paths of worst delay, and uses a method of feasible directions to nd the optimal solution.
The use of the reduced space serves to reduce the complexity of the optimization problem.
iDEAS 8 , like TILOS, iteratively reduces the delay along the critical path; it di ers from TILOS in that it changes the size of more than one transistor in each iteration. The methods used by Cirit 9 , Hedlund 10 and Marple 11, 12 formulate non-linear programs, and solve them by the method of Lagrangian multipliers. Another approach, as practised in MOSIZ 13 , CATS 14 and iCOACH 15 , is to perform the transistor size optimization as a two-step iterative process. The rst step is an outer loop in which a timing`budget', T i , is assigned to each gate i, using a coarse simpli cation based on the overall delay speci cation. In the inner loop, the transistors in gate i are sized optimally so as to satisfy the timing budget, T i , for that gate. The partitioning of the task into two steps serves to reduce the computational complexity of the algorithm.
There are several problems associated with the above optimization methods. Essentially, they perform a sequence of local optimizations over a reduced parameter space,
hoping, but not guaranteeing, that such optimizations would lead to a global optimum.
Moreover, apart from using the unimodality property, none of these algorithms takes advantage of the fact that the optimization problem can be posed as a convex programming problem.
With regard to delay modeling, each of the algorithms described in this section, except for 5 , assumes waveforms with step transitions at the input and output of each gate. This is not realistic, since actual waveforms have non-zero rise and fall times. In 5 , although delay models accommodate the e ects of non-zero transition times, the accuracy of the optimization is compromised by c hoosing uniform widths W n and W p for all n-transistors and p-transistors, respectively, in a gate.
In this paper, we tackle the transistor sizing problem as de ned in 1, which i s the most common form of the problem faced by practising circuit designers. The other formulations mentioned earlier in this section can also be handled using the same approach.
We use a new and more accurate delay estimator that permits waveforms with non-zero rise and fall times, and computes rise and fall delays separately. The details of the delay estimation algorithm are furnished in Section II. An e cient convex programming method 16 is used for global optimization over the parameter space of all transistor sizes in a combinational subcircuit. This algorithm is capable of handling large problem sizes without having to prune any v ariables; moreover, its complexity is independent of the number of constraints. Hence, the optimization procedure is guaranteed to solve the problem exactly by nding the global minimum of the optimization problem, unlike many other problems which make simplifying assumptions for tractability, but cannot guarantee optimality and reasonable runtimes. The algorithm starts by bounding the convex domain by an initial polytope. By using special cutting plane techniques, the volume of this polytope is shrunk in each iteration, while ensuring the optimal solution lies within the boundary of the polytope.
The iterative procedure stops when the volume of the polytope becomes su ciently small.
A more complete description is given in Section III. Since this is the rst practical implementation of the convex programming algorithm 16 on problems of the size that we h a ve handled, a considerable portion of this paper is devoted to practical aspects of the implementation. The extension of the algorithm from combinational circuits to general sequential circuits is outlined in Section IV. Finally, experimental results to illustrate the e cacy of this technique are presented in Section V.
II. THE DELAY ESTIMATION ALGORITHM
In this section, an algorithm for estimating the worst-case delay through the circuit, over all possible input combinations, is described.
Consider a combinational CMOS circuit with a set of primary input nodes and primary output nodes. The circuit is rst divided into channel-connected c omponents henceforth referred to simply as components; each component corresponds to a set of transistors that are connected by drain and source nodes.
More formally, the de nition of a component can be given by the following construction. Create an undirected graph with a vertex for each circuit node and an edge between the drain and source node of each transistor. Next, split the vertices corresponding to the ground node, the supply V DD node, and the primary input nodes such that each of these vertices is incident on only one edge after splitting. A component is then a set of transistors corresponding to the edges within a connected component of the graph. This process is illustrated in Fig. 1 . is used to compute the maximum overall rise and fall delays between primary inputs and primary outputs of the circuit. A trace-back method is then used to obtain the critical path, which consists of the set of gates that lie on the largest delay path from a primary input to a primary output of the combinational network. Two n umbers t h and t l are assigned to each output node of each component in the circuit, which correspond to the total rise and fall delay from the primary inputs, respectively. In addition, for each component, we compute h and l , the Elmore delays of an RC network that corresponds to the worst-case rise and fall scenarios, respectively. Additionally, the output transition waveform is modeled as a function that varies linearly with time. The transition times of the rising and falling waveforms at the output of the component are taken to be 2 h and 2 l respectively. given by the resistance R on of the corresponding transistor. The V DD node and all of its incident edges are then removed from the graph. Let t h;max denote the maximum value of t h among all input nodes of the component and suppose this occurs at the gate node of an n-type transistor corresponding to an edge e max in G. It is assumed that the worst-case path is the largest resistive path LRP i.e. the path of largest weight between o and ground that passes through e max . This assumption is valid when the load capacitance at the output node is much greater than the internal capacitance at any node that lies on any path between the output node and the ground node through e max , as is often the case for CMOS circuits. As one pushes a component to its speed limit, internal node capacitances will no longer be small. However, since the capacitances that need to be driven by the component would probably increase, it is hoped that this assumption will hold. For the circuit delay speci cations reported in this paper, it is seen in Section V that the approximation is valid.
Since nding the LRP is equivalent to the longest path problem in a graph which is NP-hard 18 , we h a ve developed a heuristic to perform this task. This heuristic is exact for series-parallel graphs, such as CMOS complex gates, and can be outlined as follows. T = maximum weighted spanning tree in G containing e max such that the path P between o and ground in T contains e max maxW = sum of weights of edges in P LINK = edges in G , T for each edge e 2 LINK T 1 = T e P 1 = max weight o-to-ground path in T 1 through e max W = sum of weights of edges in P 1 if W maxW e 0 = a n y edge in P , P P 1 T = T 1 , e 0 , P = P 1 , maxW = W
The heuristic begins by nding a maximum weighted spanning tree T of G that contains the edge e max , using a variant of Prim's algorithm 18 . Let P 0 denote the unique path in T between o and ground. If P 0 contains e max , set P to P 0 ; otherwise an edge, e 6 2 T, is added to T such that T + e has a path P between o and ground through e max , and the e is the edge of greatest weight among all edges that satisfy this condition. The introduction of e creates a unique cycle; an edge e 0 , such that e 0 2 P 0 and e 0 6 2 P, is removed from T + e, to give a new initial tree T.
The edges which are not in T constitute the set of links. A link is then added to the present tree T to produce a subgraph T 1 that contains a unique cycle. Therefore, there can be at most two paths from o to ground in T 1 . The path of larger weight is called P 1 . I f the weight o f P 1 is larger than that of P, then the present tree T is updated by removing any edge from T 1 that belonged to P but not to P 1 . Also, P is reset to P 1 and the heuristic proceeds to process the next link, and so on, until all links of the original tree have been processed. The path between o and ground in the nal tree produced by the heuristic is referred to as the largest resistive path LRP. In case of series-parallel graphs, the heuristic does indeed generate the path of largest resistance from output to ground; in other cases such as graphs with bridges, it gives a good approximation. Now, consider any spanning tree T w of the graph G. I f P p and P q are the paths to ground from nodes p and q, respectively, i n T w , let R pq denote the resistance of the path P p P q . The Elmore delay 2 between o and the ground node in the RC-tree represented by T w is given by
where C j is the capacitance to ground at node j in T w . Note that while nding the Elmore delay, the capacitances which lie between the switching transistor and the supply rail are assumed to be at the voltage level of the supply rail at the time of the switching transition, and do not contribute to the Elmore delay.
In order to nd a tree that contains the LRP and which maximizes the Elmore delay, certain edges must be added to the LRP in such a w ay that R oj is maximized for every node j in the graph. The algorithm to construct the worst-case tree T w from the LRP is as follows. Initially T w is taken to be the LRP itself. Fo r a n o d e n 1 6 2 T w the algorithm nds a node n 2 2 T w that is farthest from the ground node and is connected to n 1 by a path that does not intersect T w . This path is then added to T w and the procedure is repeated until all nodes of G are included in the tree T w . The worst-case fall delay a t o is then computed using 3.
Example 1: Consider the graph G shown in Fig. 3 . Assume that the LRP between the output node o and ground has been found to be d,e . Initially, T w is taken to be the LRP d,e . Consider node n 1 which is connected to node o through several paths, one of which i s j,k . This path is added to T w which n o w becomes d,e,j,k . Note that both nodes n 1 and n 2 are now part of the tree T w . The nodes n 4 and n 5 are then added to the tree by adding the edges a and b respectively. Finally, the node n 6 is added to the tree by adding the edge f to it. This completes the formation of the worst-case tree which i s d,e,j,k,a,b,f indicated by the bold edges in Fig. 3 . If branch d corresponds to the switching transistor, the worst-case Elmore delay is given by
Finally, the value of t l for output node o is computed by adding t h;max , the Elmore delay of the worst-case RC network, l , and a term 4 related to the transition time of the rising input at the input node corresponding to the worst-case Elmore fall delay.
A more detailed description of how the e ect of input transition time is incorporated is provided later in this section. This procedure is repeated for all output nodes of the component.
The value of t h , the worst-case rise delay at each output node of the component, can be found in a similar manner. The weighted graph representing the component is constructed as before except that the ground node is removed instead of the V DD node. The rest of the procedure to nd the worst-case Elmore rise delay is identical to that of the fall delay except that the role of the ground node is replaced by the V DD node, and the roles of t h and h are exchanged with those of t l and l in the fall delay case.
In other delay estimators that we h a ve come across, the Elmore rise and fall delays are computed directly from the LRP without appending additional edges to extend it to the worst-case RC-tree as described above.
Delay Model for Components under Nonstep Transitions
In 4 , it has been shown that a good approximation to the delay, , of a CMOS A general complex gate such as the AOI gate, when excited by a step excitation, may be represented by an equivalent i n verter I whose size is determined by the Elmore delay of the worst-case RC tree described earlier. For an excitation of the type in 5, we m a y consider the general complex gate as being equivalent to the inverter I being excited by the same excitation. Hence, 6 also holds for complex gates.
The form of the path delay under step excitations is described in 3 . We examine the change required in this form to include the e ect of waveforms with nonstep transitions as described in 5, under the assumption that the signal at the output of a component i s modeled by a ramp function, as described earlier in this section.
Let i;step refer to the delay of component i on a path of the circuit, with all input waveforms having step transitions. The delay of the circuit, Delay step , is given by Delay step = 1;step + 2;step + + n;step 7
When we incorporate the e ect of the transition time, and add the simplifying assumption that the magnitude of the threshold voltage is the same for nMOS and pMOS enhancement mode transistors, the delay along the path is given by Delay n = 1;step + 2;step + Delay 1 + 3;step + Delay 2 , Delay 1 + 4;step + Delay 3 , Delay 2 + + n;step + Delay n,1 , Delay n,2 = 1;step + 2;step + + n,1;step + n;step + Delay n,1 = w 1 1;step + w 2 2;step + + w n n;step : In the case where the threshold voltage, V T is di erent in magnitude for n and p type transistors, the form of 8 remains the same, but the expression for the w k 's is more involved. In this work, we h a ve assumed that the magnitude of the threshold voltage is the same for n and p type transistors.
As will be shown in Section V, the delay times calculated by our estimator are in good agreement with SPICE results.
Area and Delay Functions
Let n denote the number of transistors in a combinational circuit and let x = x 1 ; x 2 ; ; x n b e a n n-dimensional vector of the transistor sizes. The total area of the circuit is taken, for simplicity, to be the sum of the transistor sizes, i.e.,
Note that the area function is a posynomial in x. The other formulations mentioned in the introduction, namely, minimizing the delay subject to area constraints, minimizing the area-delay product, or a formulation that involves the area, delay and power dissipation, can also be handled by this algorithm. However, since the above formulation is the most practically useful one, we restrict our discussion to this formulation.
Note that under this transformation, the delay along a path has the form The algorithm proceeds iteratively as follows. First, a center z c deep in the interior of the current polytope P is found by using a technique which will be described later. Next, an oracle is then invoked to determine whether or not the center z c lies within the feasible region S. From the de nition of S, the oracle is simply a routine that invokes the delay Since this is the rst practical implementation of this convex programming algorithm on problems of the size that we h a ve handled, our work addresses several issues that were inconsequential to previous implementations that worked with a smaller number of variables. Hence, a description of some of the practical issues involved is provided in some detail in this section.
Example 2: Consider the problem minimize fx 1 ; x 2 s:t: x 1 ; x 2 2 S where S is a convex set and f is a convex function. The shaded region in Fig. 4a corresponds to S, and the dotted lines show the level curves of the function f. The point x is the solution to this problem. The procedure begins by bounding the expected solution region by a closed polytope, which corresponds to a rectangle in two dimensions. This is shown in Fig. 4a. The center, z c of this rectangle is found. The oracle is invoked to determine whether z c lies within the feasible region or not; in this case it can be seen that z c lies outside the feasible region. Hence, the gradient of the constraint function is used to pass a hyperplane through z c , such that the polytope is divided into two parts, one of which contains the solution x . This is illustrated in Fig. 4b , where the hatched region corresponds to the polytope containing the solution. The process is repeated on this new smaller polytope. Its center lies inside the feasible region, and hence the gradient of the objective function is used to generate a hyperplane that further shrinks the size of the polytope, as shown in Fig. 4c .
The result of another iteration is illustrated in Fig. 4d . The process continues until the polytope has been shrunk su ciently.
It can be seen that the key parts of this algorithm are 1 nding the center z c of the existing polytope P, 2 generating gradient functions in 17 and 19 above, and 3 deciding when to terminate the algorithm.
Procedure for nding the center of the polytope
We w ould like to nd a point inside a polytope that satis es the property that any separating hyperplane drawn through it divides the original polytope into two parts of approximately equal volume. Finding such a point is di cult 16 , and so we settle for nding a point that is reasonably deep within the interior of the polytope, and can be found through relatively inexpensive computation. and is obtained by performing a one-dimensional line-search.
Note that the process of computing a Newton direction by 24 involves the inversion of an nn Hessian matrix which takes On 3 time and can prove to be rather expensive.
This expense can be cut down by maintaining the inverse of an approximate HessianĤ via rank-one updates 19 as described below, and by using an approximate Newton direction k instead of k in the line search. We note that using an approximate Newton direction instead of the exact one essentially does not a ect the convergence properties of the center-nding algorithm 16 .
Rank-one updates
Let z k be the point at the beginning of the k + 1 th iteration of Newton's method for nding the center z c of the polytope P described by 14.
Two methods for maintaining the approximate Hessian, using rank-one updates 19 are outlined below.
Method 1
The Hessian at z k may be written as For Scheme a for maintaining an approximate inverse Hessian described above, the parameter 1 above i s t ypically chosen to be around 1.5, while 2 may be set to about 5, while for Scheme b, typical values for 1 and 2 are 3 and 20 respectively.
The reason why 2 is set to be larger than 1 is as follows. When ! is positive i.e., when 2 determines whether an update is to be made or not, the denominator of is relatively large, and hence numerical errors in the calculation of are damped out. In the case where ! is negative i.e. the update decision is dependent on the value of 1 , the denominator of grows smaller as 1 increases, and a large 1 could lead to an ampli cation of numerical errors. small. Therefore, the choice of 2 may be more liberal than that of 1 .
In each of these two methods, it su ces to maintain H ,1 ; it is not even necessary to explicitly ndĤ.
Method 2
The Hessian at z k may also be written as
Let p be the number of additional planes added to the initial polytope, the box, described by 15. , 2 R nn is the Hessian at z k due to the planes of this box only, and is a diagonal matrix. The i th diagonal element of ,, denoted ii , is given by ii = h z k;i , x min ,2 + x max , z k;i ,2
The rows of U T 2 R pn correspond to the planes added to the initial polytope, i.e., the 2n + 1 th to the m th rows of A T . 2 R pp is a diagonal matrix, whose diagonal entries correspond to the last p diagonal entries of the matrix in 26.
We m a y n o w write 
34
As in Method 1, it su ces to maintain the approximate inverse of C; it is not necessary to explicitly store C itself. The approximate Hessian or the approximate inverse Hessian are never explicitly maintained; the search direction is found by computing = ,H ,1 rFz k T , which i n volves multiplication of the expression 34 for H ,1 by a n 1 v ector. The cost of this computation can be seen to be Onp if n p, i.e., the number of added planes is much less than the problem dimension. This is seen to be the case for large problems, and hence the use of this method would speed up the computation substantially for large problems.
If the number of additional planes, p has not changed since the last calculation of C ,1 , all that needs to be done to get the new C ,1 is a set of rank-one updates. If a new plane has been added, a method outlined in 21 may be used to update C ,1 . The method involves a rank-one update and a few additional operations to incorporate the e ect of the newly-added plane. As before, one of two s c hemes may be used to calculate the approximate Newton direction:
a Maintaining a more accurate C ,1 , and settinĝ
b Maintaining a more approximate C ,1 , and using it to as a preconditioner for a preconditioned conjugate gradient iteration that solves
It may be noted that the preconditioned conjugate gradient does not need an approximate H or H ,1 explicitly, but multiplies H ,1 by a n1 v ector; we h a ve already seen that this operation is computationally cheap when p is small.
It was found experimentally that Scheme b of Method 2 gave the best overall results for the problems that we w orked on.
One-dimensional Line Search
Once the Newton direction k of 24 has been found, the value of t that minimizes As a result of 38 and 39, and since the function is convex in the interval t min ; t max , t min can be set to 0, and a simple bisection search can be used to nd t at which 0 t = 0 as follows : repeat t = t min + t max =2 if 0 t and 0 t min are of opposite sign t max = t else t min = t until j 0 t j where is a small positive n umber.
Generation of hyperplanes
When the center z c of a polytope lies within the feasible region S, the gradient of the area function is required to generate the new hyperplane passing through the center. Transistors that satisfy neither of these two requirements have no contribution to the gradient of the delay function.
Termination criterion
The algorithm should be terminated when the volume of the nal polytope is su ciently small. In practice, near the optimum, the polytope becomes at in the direction normal to the gradient of the area. A practical termination criterion uses this property.
From the current center, z c , let z 1 and z 2 be the two nearest points on the boundary of the polytope, in the direction of the positive and negative gradient of the area respectively.
The di erence between the area of the circuit corresponding to the transistor sizes at z 1 and z 2 provides a measure of the atness of the polytope in the direction of the area gradient.
Hence, the termination criterion is taken to be
where is a small user-speci ed number a reasonable default value is 0.01.
IV. EXTENSION TO SEQUENTIAL CIRCUITS
For sizing sequential circuits, it is rst required that latches in the circuit be identi ed. Next, the combinational subcircuits that lie between these latches are extracted, and the delay constraint for each of these subcircuits is computed. For each subcircuit, the transistor sizing problem is solved by minimizing the area of the subcircuit, while ensuring that its delay requirement is satis ed.
The task of identifying latches proceeds as follows. The circuit is represented by a graph, G, with vertices corresponding to components, and with edges drawn from a component to each component that it fans out to. Feedback loops in the circuit e.g. crosscoupled NAND gates, which manifest themselves as strongly connected components in this graph, are identi ed using Tarjan's algorithm 22 . The input to the program is a SPICE deck that gives a transistor-level netlist of the circuit. In the preprocessing stage, the circuit is rst divided into channel-connected components. Next, latches in the circuit are identi ed. The circuit is divided into combinational subcircuits that lie between latches, and the delay constraints for each such subcircuit are determined. The main body of the procedure carries out a convex optimization on each combinational subcircuit.
It must be mentioned here that for our experimental results, the approximate Hessian for nding the Newton direction was maintained using Scheme b of Method 2 described in Section III.
A set of test circuits described in Table 1 were used to evaluate the performance of iCONTRAST. The entries under unsized a r ea and unsized delay correspond to the area and delay when all transistors in the circuit are set to the minimum size. In case of the sequential circuit, the delay refers to the maximum stage delay for the circuit. It may b e noted that the word`area' refers to the sum of transistor sizes. The technology parameters used here correspond to a submicron technology. The number of iterations for each circuit were of the order On. For these circuits, the initial polytope was taken to be a box with the minimum transistor size being 1.8, and the maximum size being 500. Table 2 shows the area of the circuit after it has been sized by iCONTRAST to meet a delay speci cation, T spec , and the execution time on a Sun SPARCstation I. Since our method solves the underlying convex programming problem exactly, the areas shown here correspond to the globally optimum solution to the transistor sizing problem, with an accuracy that is dictated by the tightness of the user-speci ed termination criterion. The number of iterations, and the memory requirement for each circuit are also shown. In case of the sequential circuit, Seq, the number of iterations corresponds to the maximum number of iterations required to size any combinational subcircuit.
Consider, for example, the results on the example circuit, Add8. As seen in Table   1 , the unsized area and delay for this circuit are 374.4 m, and 109.8 ns respectively. The area penalty required to achieve a relatively loose delay speci cation such as the rst one, 100 ns, is not very large; the active area of the sized circuit is only 11 larger than the unsized circuit. As the delay speci cation becomes tighter, the area penalty increases non-linearly; to achieve a delay speci cation of 40 ns, the active area of the sized circuit is 182 larger than that of the unsized circuit. A similar trend is visible for each of the other example circuits in Table 2 .
The number of iterations and the memory requirement are seen to increase slightly in most cases with the tightness of the delay speci cation. For the largest circuit, however, the number of iterations is seen to be roughly independent of the delay speci cation.
None of these results violates the theoretical prediction that the order of magnitude of the number of iterations for a given circuit is dependent only on the size of the initial polytope which w as the same for all circuits and is independent of the delay speci cation.
The basis for this prediction lies in the fact that the volume of the polytope is roughly halved in each iteration; hence, the volume of the polytope containing the solution is roughly the same after the same number of iterations, regardless of where the solution lies within the initial polytope. In a comparison with the optimization algorithm of TILOS 3, 6, 23 , using the same delay models for both algorithms, it was found that when the delay speci cation was loose, the area of the TILOS-sized circuit was close within a few tenths of a percent to the optimal one obtained using the iCONTRAST algorithm. However, as the delay speci cation was made tighter, it was observed that the TILOS solution moved away from the optimal one; in some cases, the area achieved by iCONTRAST was under 1 3 that given by TILOS 23 .
In comparison with TILOS, both the CPU time and the memory requirements were found to be larger; however, the improvement in the quality of the solution provided by iCONTRAST could be considerable, since the global optimum is guaranteed by this algorithm. between the primary inputs and the primary output; the delay along both paths, i.e., the rise and the fall delays at the output node, are equal after sizing, as expected. For relatively loose delay speci cations, it is seen that only the last stages are made larger, while those towards the input remain relatively una ected. As T spec given in ns is made tighter, it is seen that in addition to a ecting the transistors at the output stages, the sizes of the transistors that are closer to the input are also signi cantly increased. The sizes of transistors in the input stage are restricted by the contribution of the user-speci ed resistance of the source that drives the rst stage. The variation of sizes in the n-transistor stages is illustrated in Fig.   5a ; the variation of p-transistor sizes, shown in Fig. 5b , follows the same trend as the n-transistor stages.
It should be noted that in this circuit, since the number of n-transistors ptransistors in the two paths is not equal, the nature of the variation in transistor sizes is somewhat di erent from a circuit such as an 8-inverter chain, which has equal numbers of n-transistors p-transistors on each path. To illustrate this, note that the path with the larger unsized delay goes through p-transistors 1, 3, 5 and 7. Hence, in the sized circuit, where both path delays are equal, it is seen that p-transistors 3 and 5 contribute to the disruption of the smoothness of the curve b y being larger than their interpolated values.
Transistors 1 and 7, being at the primary input and primary output respectively, are inuenced by other considerations namely, the input resistance and the output capacitance, respectively, and therefore, such e ects are not visible.
However, a caveat is in order here: the above considerations are not the only reason for nonsmoothness of the curve; the curve for an 8-inverter chain is seen to be nonsmooth too. Also, one should curb an instinctive tendency to compare these variations with the smooth exponential variations of Mead and Conway 24 , since the two problems are not the same. The Mead-Conway problem principally di ers from ours in the following respects:
a The objective of their problem is to minimize the number of stages and the circuit delay.
In our problem, the circuit topology, and hence, the number of stages is xed.
b The Mead-Conway approach uses a simpler delay model. like Add2 that is composed of complex gates, the accuracy can be seen to deteriorate slightly, but remains close to SPICE. It is clear from the data displayed here that the enhancements in our algorithm provide a considerable improvement.
VI. CONCLUSION
In this paper, we h a ve presented a convex programming approach to solving the transistor sizing problem. This approach is guaranteed to nd the global minimum solution to the problem. Any of the commonly-speci ed forms of the transistor sizing problem can be handled by this approach; we h a ve illustrated the algorithm on the most useful form, given in Eq. 1. A major advantage is that the delay constraints do not need to be explicitly stated. Ensuring that the delay of the circuit satis es the speci cation, is equivalent t o ensuring that the delay along each path of the circuit satis es the speci cation; since the number of paths in the circuit could be exponentially large, the number of constraints could be exponential in number. A conventional technique, such as Lagrangian multipliers, would not be able to solve a problem with such a large number of constraints in a reasonable time.
The complexity of the algorithm is dependent on the number of variables, the size of the initial polytope, and the termination criterion, and is independent of the number of convex constraints. Moreover, the discontinuities in the circuit delay function do not require special treatment from the algorithm, as in many other transistor sizing algorithms such as 7 .
A new delay estimation algorithm, that takes waveform slopes into account and calculates the worst-case delays, is also presented. Experimental comparisons with SPICE
show that the enhancements made by this approach o ver previous approaches a ord a large improvement in the quality of the solution.
The algorithm was implemented as a C program on a SUN Sparcstation I, and results on purely combinational circuits with up to 832 transistors, and on a sequential circuit, have been presented here.
