This paper considers simultaneous gate and wire sizing for general VUI circuits under the E[more delay model. JVepresent a fast and met algorithm which can minimize total area subject to mmimurn delay bound. The algorithm can be easily modl~ed to givẽ wct algorithms for optimizing several other objectives (e.g. minimizing maximum delay or minimizing total area subject to arrival time specl~cations at all inputs and outputs). No previous algorithm for simultaneous gate and wire sizing can guarantee met solutions for general circuits. Our algorithm is an iterative one with a guarantee on convergence to global Oph.malsolutions. It is based on hgransian relmtion and "one-gatdwire-at-a-time" local optimi~tions, and is mtremely economical and fast. For example, we can optimize a circuit with 27,648 gates and wires in about 36 minutes using under 23 MB metnory on an IBM RS/6000 worhtation.
Introduction
Since the invention of integrated circuits rdmost 40 years ago, gate sizing has rdways been an effective technique to achieve desirable circuit performance. As technology continues to scale down, total number of gates and interconnects within a die grows over millions. In such increasingly dense integrated circuits, a significant portion of the total circuit delay comes from the interconnects. Therefore, developing efficient rdgorithms which can handle large scale gate and interconnect optimization problems are of great importance.
In the past, gate delay was the dominant factor in determining circuit performance. Thus, gate and transistor sizing have been extensively studied in the literature [6, 12, 15, 20] . As interconnect delay plays an increasingly important role in determining circuit performance, wire sizing has been an active research topic in the past few years [2, 4, 7, 9, 17, 19] .
Since gate sizes affect wire-sizing solutions and wire sizes affect gate-sizing solutions, it is beneficial to simultaneously *This work wx parti~y supported by the Texm Advmti Rmmch Progmm under GrmrtNo. 00365S2SSrural by a grmrtfrom the Intel Corpomtion. tcument addrws of Chung-Ping Chen ix Intel Corpomtion, 2111 N.E. 25th Ave. HiIlsboro, OR 97124-5961. PeMsion to make &@tat or tid copies of att or pti of this work for~nat or hsroom use is~ted ttittrout fee pmtidd that copies me not made or~tiũ ted for profit or comrneti ad~,arrtagemrd that copies bear b notice and the trrU dtation on the first page. To copy otherwise, to repubhh, to post on sewers or to r~ti%ute to kts, rquir= prior specific petilmr and/or a fm. ICCB8. SanJo=~USA o 1W8A&l l-58i13&8-m8/ml l.S5.w size both gates and wires. Several results on simultaneous gate and wire sizing have been reported [2, 7, S, 16, 1S, 20] .
[S] studied simultaneous driver and wire sizing and [2] considered simultaneous wire and buffer sizing, but both works only apply to circuits that are of tree topology. For simultaneous gate and wire sizing for general circuits, [1S] uses a least-square optimization technique, [16] employs a sequential quadratic programming approach, and [7] uses a greedy sizing technique in conjunction with dynamic programming. But none of these algorithms can guarantee to give exact solutions for objectives such as minimizing totrd area subject to maximum delay bound or minimizing maximum delay.
In this paper, we consider simultaneous gate and wire sizing for general USI circuits under the Elmore delay model. We present a fast and exact rdgorithm which can minimize total area subject to maximum delay bound. The algorithm can be easily modified to give exact algorithms for optimizing several other objectives (e.g. minimizing maximum delay or minimizing total area subject to arrival time specifications at dl inputs and outputs). Our algorithm is an iterative one with a guarantee on convergence to globrd optimal solutions. It is based on Lagrangian relaxation and "one-gate/wire-ata-time" local optimization, and is extremely economical and fast. For example, we can optimize a circuit with 27,64S gates and wires in about 36 minutes using under 23 MB memory on an~M RS/6000 workstation.
The problem in this paper is formulated as a geometric program [10] . Note that the transistor sizing problem is similar to our problem and was also formulated as a geometric program long time ago [12] . However, it would be very slow to solve it by some general-purpose geometric programming solver. So instead of solving it exactiy, [12] proposed~OS, which is based on an efficient sensitivity-based heuristic. Years later, [20] transforms the geometric program into a convex program and they solve it by a sophisticated general-purpose convex programmingsolverbased on interior point method. This is the best known previous algorithm that can guarantee exact transistor sizing solutions. However, as we explore the special structure of the geometric program, our tailored algorithm is much faster than algorithms using general-purpose solvers as in [20] . For example, the largest test circuit in [20] has S32 transistors and the reported runtime and memory are 9 hours (on a Sun SPARCstation 1) and 11 MB, respectively. For a problem of similar size (S64), our approach only needs 7 sec-ends of runtime (on a RS/6000 workstation) and 1.15 MB of memory.
The rest of this paper is organized as follows. In Section 2, we will introduce some notations and terminology that we will use in this paper. In Section 3, we will present our algorithm for the problem of minimizing total area subject to maximum delay bound. In Section 4, we will show how to modify our algorithm to minimize maximum delay, to handle arrival time specifications at all inputs and outputs, to consider power consumption and to use a more accurate gate model. In Section 5, experimental results of our algorithms are presented.
Preliminaries
In this section, we will define some notations and terminology that we will use in this paper.
For a general WSI circuit, we can ignore all latches and optimize its combinational subcircuits. Therefore, we will focus on combinational circuits below.
Given a combinational circuit with s input drivers, t output loads, and n gates or wire segments, the gate sizes or the segment widths are allowed to be varied in order to optimize some objective. For 1 < i < s, let R: be the driver resistance of the ith input driver. For 1< i < t,let C: be the load capacitance of the ith output load. See Figure 1 for an illustration of a circuit. Note that it is reasonable to assume that the gates are of bounded fanin. Hences = O(n) and t = O(n).
Inputdri~rers
Gates or \Vire segments(si=ble) A gate, a wire segment, or an input driver is called a component. In order to unifi the notations that we will introduce later, imagine that two factitious components are added to the circuit as shown in Figure 2 . The first one is called an output component which consists of all the t output loads. The second one is called an input component which connects to all thes input drivers. Let a node be a connection point between two components or the output point of the output component. Note that the output of each component should connect to a distinct node. So it is easy to see that there are n +s + 2 components and n +s + 2 nodes.
Let m = n+s+l. We label the nodes by indexes O,... ,m as follows. The node with index O is the output point of the output component. For 1< i < t, the node with index i is the one connecting to the ith output load. Fort + 1~i~n, the node with index i is a connection point among the gates and wire segments. The indexes are assigned in such a way that if node i and node j are connected to an input and the output of some component respectively, then i > j. For n + 1 < i < n +s, the node with index i is the one connecting to the (i -n)th input driver. The node with index m is the output point of the input component. It is not difficult to see that if we view the circuit as a directed acyclic graph, the node index assignment is a reverse topological ordering of tie graph. We also label the components by indexes O,.. ., m such that the output of the component with index i is connected to node i. See Figure 2 for an illustration of the circuit in Figure 1 with factitious components, node indexes and component indexes. If i G G, tien let Zi be the gate size, Ti be the output resistance of the gate and Cibe the input capacitance of a pin of the gate. (To simplifi the notations, we assume without loss of generality that the input capacitances of all input pins of a gate are the same.) Let Fi and Z be respectively the unit size output resistance and the input capacitance per unit size of gate i. Then ri = Fi/~i and G = &xi. Ifi c W, then let xi be the segment width,~i be the segment resistance and G be the segment capacitance. Let Pi, & and~i be respectively the unit width wire resistance, the wire area capacitance per unit width and the wire fringing capacitance of segment i. Theñ i = Fi/xi~d G = ?Zi +~i. Fori E~U W, let Li and Ui be respectively the lower bound and upper bound of the value ofxi, i.e. Li < xi < Ui.
For the purpose of delay calculation, we model components as RC circuits. A gate is modeled as a switch-level RC circuit as shown in Figure 3 . @or simplicity, we ignore the intrinsic gate delay in the model. It is easy to see that dl our results will still hold even if intrinsic delay is considered.) A wire segment is modeled as a n-type RC circuit as shown in Figure 4 . Ietrel RC cjrcujt. AJthough the gate shown here js a 2-input AND gate, tie model can be easily genertized for any gate with any number ofinput pins. Elmore delay model [ 11] is used for delay calculation. Basically, the E1more delay along a signal path is the sum of the delays associated with the resistors in the path, where the delay associated with a resistor is equrd to its resistance times its downstream capacitance. For our case, each component (except the 2 factitious components) contains a resistor. We label the resistors by indexes 1,..., n +s such that resistorĩ s the one inside component i. For convenience, for i E D, let Ti = R~n (i.e. the driver resistance of the (i -n)th input driver). So for i c GUW UD, the resistance of resistor i is~i. For i c~U W U D, let Ci be the downstream capacitance of resistor i. Figure 5 shows the circuit in Figure 2 after replacing the components by the RC models. The resistance of each resistor is marked in the figure. Also, the regions corresponding to the downstream capacitances of resistor 5 and resistor 12 are shaded.
51
:
: Let Di = TiCi be the delay associated with resistor i. We represent a signrd path passing through resistors ii,. ... ik by the set p = {il ,.. ., ik}. Let P be the set of all possible paths from node~to node O(i.e. from an input driver to an output load). Then for any p c P, the Elmore delay along path p is iEPDi.
Minimizing totil area subject to m~imum delay bound
In this section, we will solve the problem of minimizing the total component area with respect to component sizes Zl, . . . , Zn subject to the constraint that the maximum delay from any input driver to any output load is at most some constant A. (i.e. A. is a bound on the arrival time at node O). We will formulate the problem as a constrained optimization problem and then solve it using Lagrangian relaxation. Lagrangian relaxation is a general technique for solving constrained optimization problems. We outline the basic idea of Lagrangian relaxation below. More details can be found in [1, 13, 14] .
We call the constrained optimization problem to be solved , the primal problem (PP). In Lagrangian relaxation, "troublesome" constraints in 7P are "relaxed' and incorporated into the objective function after multiplying them by constants called Lagrange multipliers, one multiplier for each constraint. For each fixed vector A of the Lagrange multipliers introduced, we have a new optimization problem (which should be easier to solve because it is free of troublesome constraints) called the Lagrangian relaxation subproblem associated with A (Z~/A).
It can be shown that there exists a vector A such that the optimal solution of L%/A is also the optimal solution of the original problem PP. The problem of finding such a vector A is crdled the Lagrangian dual problem (LDP). So if we can solve both L~/A and LDP, then the optimal solution of PP will be given by LW/A where A is the optimal solution of LDP.
In Section 3.1, we will show how to formulate the gate and , wire sizing problem as a constrained optimization problem with a polynomial number of constraints. This formulation is our primal problem (PP). In Section 3.2, we will show how PP is relaxed to obtain the Z~/A.
We will use the Kuhn-Tucker conditions (see [1] for a reference) to greatly simplify L~/A.
We call the simplified version ZW/P. In Section 3.3, we will show how to solve LW/p (i.e. L~/A) for any fixed vector P. In Section 3.4, we will show how to solve the LDP by the classical method of subgradient optimization. 1
Due to space limitation, rdl the proofs in this section have been omitted. They can be found in [5] .
Problem formulation
The total component area can be written as~~=1~j~i for some constants al,. ... an. So the problem of minimizing total area subject to maximum delay bound can be formulated directly as the mathematical program: However, the number of possible signal paths from node m to node O(and hence the number of constraints in the mathemat-ical program above) can be exponential in n. So this direct formulation is impractical.
This difficulty can be handled by the classical technique of partitioning the constraints on path delay into constraints on delay across components. We associate a variable a~to each node i. ai represents the arrival time at node i (i.e. the maximum delay from node m to node i). Then it is not difficult to see that the mathematical program below, which we called the primal problem (7P), is equivalent to the mathematical program abovti
Note that the number of constraints in PP is linear in n. Also note that for the problem 7P, the objective function and the constraints can be rewritten in the form of polynomials [10] .
It is well known that under a variable transformation, the problem is convex. So PP has a unique global minimum and no other local minimum. We will see how to solve PP in the following.
Lagran~an
Relmation Following the Lagrangian relaxation procedure, we introduce a non-negative value called the Lagrange multiplier for each constraint on arrival time. For j E input(0) (i.e. j= l,..., t), we introduce Ajo for tie cons~aint aj < Ao. 
iED Then the Lagrangian relaxation subproblem associated with the Lagrange multipliers A will be: LM/A: min. LA(z, a) St.
Li<Xi<Ui iEGUW
Let (z*, a*) be the optimal solution of PP. By Kuhn-Tucker conditions, if the optimal solution of L=/A is also the optimal solution of PP, then A must satisfy the conditions *(Z*, a*) = O for 1 $ i < n + s. Therefore, we can consider only those A satisfying these conditions. By rearranging (l), we can write + termsindependent of all a~'s So~L~/aai = O for 1 < i < n + s imply the following optimality conditions on Lagrange Multipliers A: .
LetQ~={A~O : A satisfies (2) }. We observe that by considering only those A in QA and substituting (2) back to (1), we can greatiy simplify the objective function  LA(z, a) , and hence the problem Z~/A. This is summarized in the following lemma.
Lemma 1 For an)' A E n~, the optimal x ofLm/A is the same as the optimal x of L=/P: min. LP(x) s. t. Li<Xi<Ui
icGUW To solve L~/A, we can solve LW/N to find the optimal x. Then the optimal a can be found by considering one by one the variables ai's in the order of decreasing i. For each ai, we set it to the smallest possible value that satisfies the constraints of PP.
Solting LRS/p
In this subsection, for any fixed P~O, we will show how to solve LW/P optimally by a greedy algorithm based on iteratively re-sizing the gates and wire segments. Similar techniques have been successfully applied to some other wire or buffer sizing problems before (e.g. [3, 9] ).
For 1 < i S n, let upst~eam(i) be the set of resistor indexes (excluding i) on the path(s) from component i to the nearest upstream gate(s) or input driver(s). For example, for the circuit in If we re-size component i (i.e. changing xi) while keeping the sizes of all the other components fixed, we say that it is a local re-sizing of component i. An optimal local re-sizing of component i is a local re-sizing that minimize LP(x), and is given by the following lemma. 
LM/p
can be solved by a greedy algorithm based on iteratively re-sizing the components. In each iteration, the components are examined one at a tim~each time a component is re-sized optimrdly using Lemma 2 while keeping the sizes of the other components fixed. We call the algorithm SOL~LRS/P and it is described below. Note that in order to use Lemma 2 to re-size component i, we need to compute & and C; first. OuralgorithmSOL~LRS/p computes Cl's and &'s incrementally by traversing the circuit in a reverse topological order (step 2) and in a topological order (step 3) respectively. So it is not difficult to see that each iteration of e-algorithm takes only O(n) time. Note that LP(z) is a polynomial [10] in %. It is well known that under a variable transformation. a Dosvnomid is equivalent to a convex function. So LP (z)'h~a unique giobal minimum and no other Iocd minimum. We can prov~the following theorem which says that algorithm SOLmLRS/p always converges to the globrd minimum.
ALGOWTHM

Theorem 1 For any @ed vector p~O, algorithm
SOLVE~RS/p always converges to the optimal component-s izing solution of the problem L~/p.
Algorithm SOLWLRS/p runs in O(~n) time using O(n) storage, where n is the number of sizable components andĩ s the number of iterations. We observe that r is constant (i.e. the run time of SOL~LRS/P is linear) in practice.
Solting the LDP
Define the function Q(A) = the optimal vrdue of the problem LW/A.
In this subsection, we will consider the Lagrangian dual problem:
LDP : Maximize Q(A)
Subject to A c~Ã s said in Section 3.1, PP can be transformed into a convex problem. So Theorem 6.2.4 of [1] implies that if A is the optimal solution of LDP, then the optimal solution of LM/A will also optimize PP.
By Theorem 6.3.1 of [1] , Q(A) is a concave function over A~O. However, L~/A is not differentiable in general. So methods like steepest descent, which depends on the gradient directions, are not applicable. The subgradient optimization method is usually used instead. The subgradient optimization method can be viewed as a generalization of the steepest descent method in which the gradient direction is substituted by a subgradient-based direction (see [1] for a reference).
Basically, starting from an arbitrary point A, the method iteratively moves from the current point to a new point following the subgradient direction. At step k, we first solve ZM/A (by solving the simpler L~/P).
Then for each relaxed constraint, we define the subgradient to be the right hand side minus the left hand side of the constraint, evaluated at the current solution. The subgradient direction is the vector of dl the subgradients. We move to a new point by multiplying a step size pk to the subgradient direction and adding it to A. After each time we moved, we project A back to the nearest point in~~so that we can solve L~/P instead of L~/A for the next iteration. The procedure is repeated until it converges.
It is well known (see Theorem S.9.2 of [1] ) that if the step size sequence {pk} satisfies the conditions Ernk+m pk = O and~~=1 pk = m, then the subgradient optimization method will always converge to the optimal solution.
The description is summarized in the algorithm SOL~LDP below.
Theorem 2
The algorithm SOLVELDP always converges to , the optimal solution of LDP.
We conclude Section 3 by giving the algorithm SGWS-LR (Simultaneous Gate and Wire Sizing by Lagrangian Relaxation) below.
Theorem 3 For simultaneous gate and wire sizing, the problem of minimizing total area subject to mimum delay bound can be solved optimally by SGWS-LR.
ALGOWTHM SOL=LDP Outpuh A which maximizes Z~/A 1. k:= 1 I* step counter *I A := arbitrary initial vector in~2 . p = (po, . . . . Pn+s) where pi =~j~i~Ptit ( Q(A) ) S e~or bd.
ALGO~HM
SGWS-LR Output: the optimal gate and wire sizing solution z 1. Call SOLWLDP to find the optimal A. 2. p=(po,..., pn+s ) where~i = Zjcinput(i) 'ji 3. Call SOL~LRS/P to find the optimal x.
Extensions
In this section, we will show how to extend our approach to handle problems with other objectives and with other constraints. For all the extensions, as we will see, only slight modifications to our algorithms presented in Section 3 are needed. Moreover, convergence to global optimal solutions is still guaranteed. Actually, it is not difficult to see that any combination of the problem in Section 3 and the extensions here can be handled similarly. For example, we can optimally solve the problem of minimizing power subject to bounds on area and on maximum delay from any input to any output.
Minimizing Maximum Delay
Instead of having a constant bound AO for the arrival time at node O,we introduce one more variable a. to represent the arrival time at node O (i.e. maximum delay), and we want to minimize ao. As in Section 3.1, the problem can be formulated as the mathematical program below: It is easy to see that Z~/P can be solved optimally by the iterative local re-sizing algorithm in Section 3.3 and the corresponding LDP can be solved optimally by the subgradient optimization method as described in Section 3.4. Therefore the problem of minimizing maximum delay can also be solved optimally by our approach.
In fact, the problem of minimizing maximum delay subject to area bound can also be optimally solved by our Lagrangian relaxation approach. The constraint on area can be relaxed and incorporated into the objective function as well. The function LA(z, a) will be of the same form as the one in Section 3.2.
4.2
Arrival Time Specfiations on Inputs and Outputs
We show in this subsection that different arrival time specifications on the input and output signals can be easily handled. We demonstrate the idea by considering the problem of minimizing total area subject to different arrival time constraints at inputs and outputs.
For i c D, let Ai be the arrival time specification of the input signal at the (i -n)th input driver. For j E input(0), let Aj be the arrival time requirement on the output signal at the jth output load. Then the problem can be formulated as follows:
We can obtain exactly the same optimality conditions on Lagrange multipliers as (2) in Section 3.2. The problem Z~/P is also in exactiy the same form as the one in Lemma 1. So L~/P and LDP can be solved as before.
Po}verConsideration
For each i, the power consumption of component i is proportional to its size Xi. Therefore, the total power consumption can be written as X:=l Pixi for some cons~P l,...,
Pn. It is of~e
SWe fo~N he tot~COmPOnent~ea.
So it is easy to see that it can be handle in exacdy the same way as component area.
4.4
More Accurate Gate Model
For higher precision timing requirements, more accurate gate models are desirable. Although in Section 2, we model a gate as a switch-level RC circuit with a resistance proportional to the gate size, better gate models can be easily integrated into our algorithm. We now show an example of using precharacterized function as the delay model for gates.
The following precharacterized delay function Dio and output slope finction T~() can capture the input slope effect as well as the diffusion capacitance effect to the delay of gate t. .. D~(Z~,t;, Ci) = ?i +~~ti +~~Xi + FiCi/Xi, Ti(Z~, ti, Ci) = si +~iti + @iXi+ FiC~/Xij where xi is the gate size, ti is the input rise or fall time of gate i, Ci is the capacitance load, ;~,~~,Fi, ;i,~~and Fi are precharacterized coefficients.
It is not difficult to see that while keeping the size of other components fixed, the input slope ti is a linear function of xi since gate i contributes only the linear term~~i to its parents' capacitance load. Hence the delay of gate i can be rewritten as follows:
Di(X~, t~, Ci) = Si'+ @'Xi + FiCi/Xi where Fi' = Fi +~i(ij + pjtj+ Qjxj),i' =~i+~i~i~,and component j is the parent of component i. It is not hard to see that after the substitution, Ai(z) =~~+~i + R'. Hence our algorithm in Section 3 will still converge to the optimal solution under this modification.
Experimental Resdb
We implemented our algorithms in an RS/6000 workstation. We ran our algorithms on adders of different sizes ranging from S bitsto512 bits. Number of gates range from 120 to 15,360. Number of wires range from 96 to 12,2SS (note that the number of wires here means the number of sizable wire segments). The total number of sizable components range from 216 to 27,64S. The lower bound and upper bound of the size of each gate are 1 and 100 respectively. The lower bound and upper bound of the width of each wire are 1 and 3 pm respectively. The stopping criteria of our rdgorithm is the solution is within 1~oof the optimal solution. Table 1 shows the runtime and storage requirements of our algorithm. Even for a circuit with 27,64S sizable components, the runtime and storage requirements of our algorithm are only about half an hour and 23 MB respectively. The maximum delays for the solution of minimum gate and wire sizes, and for our solution are also listed. Figure 6 and Figure 7 show the runtime and storage requirements of our algorithm respectively. Figure 6 shows that the runtime increases roughly three times when the circuit size is doubled. Hence the empirical runtime of our program is about n)og31**g2 % nl.6. Figure 7 shows that the storage requirement is linear to the circuit size. The storage requirement for each sizable component is about 0.S~. Figure S shows the convergence sequence of our algorithm SOL~LDP on a 12S-bit adder. It shows that our algorithm converges steadily to the optimal solution. The solid line and the dotted line represent respectively the upper bound and lower bound of the optimal delay. The lower bound vrdues come from the optimal value of ZW/A at current iteration. Note that the optimal solution is always inbetween the upper bound and the lower bound. So these curves provide useful information about the distance between the optimal solution and the current solution, and help users to decide when to stop the program. Figure 9 shows the area versus delay tradeoff curve of a 16-bit adder. In our experiment, we observe that to generate a new point in the area and delay tradeoff curve, SOLWLDP converges in only about 5 iterations. It is because the A of the previous point is a good approximation for that of the new point and hence the convergence of SOL~LDP is fast. As a result, generating these tradeoff curves requires only a little bit more runtime but provides precious information.
Runtime vs. Circuit Size
[3] [4] [5] [6] [7] [s] [9]
[10] [11] Figure 9 : The area vs. delay tradeoff curve for a 16-bit adder volume 4, pages 412415, 1996.
