In this paper we describe an algorithm for obtaining a placement of large scale cell-based ICs subject to performance constraints. The problem is formulated as a constrained programming problem and is solved in two phases: continuous and discrete. Constraints are placed on total path delays rather than nets and behaviour of all the paths is captured. A unified mathematical technique, based on Lagrangian Relaxation is used. The algorithm yields good results as we show on a set of real examples. On the average, we are able to make upto 15%improvement in the wire delay of these examples with little or no impact on chip area after routing. These improvements are obtained by modifying the placement alone. The acronym RITUAL represents the key idea of our technique: Residual Iterative Technique for Updating All Lagrange multipliers.
Introduction

How Does Physical Layout Affect Performance?
One of the most important trends in silicon technology has been the scaling down of device and line geometries. The minimum feature width of devices that can be etched on silicon has decreased from about 8 microns in the late seventies to about 0.8 microns today. The speed of the metal-oxide semiconductor transistor which is the basic building block for cells, has increased dramatically by a factor of 20. Unfortunately, aggressive scaling has resulted in interconnect capacitance becoming the dominant determiner of performance in today's circuits. Informally, a net is the set of wire connections that link a cell to all of its output cells. A cell drives its outputs through interconnect wires belonging to the output net of that cell and as the wire capacitance increases, the time taken to charge and discharge the net increases. In fact, according to "the value of capacitance is increasing at a fast pace and promises to be the major performance limiter". In addition, the size of the chips manufactured today has increased, compounding the interconnect delay problem because signals have to travel longer lengths from input to output. Interconnect wires have a significant contribution to delay as the following analysis illustrates. Consider a simple example to get an idea of the contribution of interconnect delay in today's ICs. The values in this analysis are derived from an industrial cell library. Let Ga be a cell driving a length of interconnect wire that connects Ga to a receiving cell Gj. Typical cell delays for 0.8 micron technology are between 0.5 and 0.7 ns. Let us compute the RC delay contribution when the pullup transistor of Ga charges the interconnect wire. The average "on"-resistance of a pullup in performance optimized 0.8 micron CMOS technology is about 2.0KJ1. The capacitance per unit length of0.8 micron Aluminium wire is 2.0 pF/cm. Consider a chip 2.0 cm on a side. Assume that the wire connecting Ga to Gd travels across one eighths ofthe chip width (0.25 cm). The RC delay in such a wire is proportional to 2.0pF/cm x 0.25cm X 200012 = 1.0ns This value is already as much as the delay through a cell and the wire delay to cell delay ratio is expected to continue increasing in the future. When we consider the fact that the average interconnect length of a net on a 2.0 cm x 2.0 cm chip using Rent's rule is about 0.25 cm (see [Bakoglu 90a]) and there are about 15-30 levels of logic in a typical IC and thus 15-30 nets along a typical path, it is obvious that interconnect delay is a significant proportion of the total delay along a path in a circuit.
Problem Definition
Given a sequential circuit composed of a large number of small cells the problem is to place the cells on a two-dimentional plane so as to minimize total wire length while satisfying user specified timing constraints and cell position constraints. The total wirelength is a measure of routablity of the circuit. The cell position constraints are required to satisfy the design rules. They may include constraints such as requiring cells to lie on a grid, or to place them in rows, etc. Even simpler formulations of the problem without timing constraints are known to be NP-Complete.
The State of the Art
A performance driven computer tool should satisfy at least two goals in order to be useful:
1. It should deliver circuits with predictable performance 2. It should be efficient i.e., it should assist in designing faster chips in a short time.
The quest for such a tool in the area of physical layout started receiving attention in the early eighties. In one of the first documented attempts at performanceoptimization, [Wolff 78 ] developed techniques for optimizing the power and timing of LSI chips using ideas from physics to place the cells. Connections between cells were modeled as "springs" and a location was found for each cell that minimized its "potential energy". The model was a crude approximation of the wirability of the circuit at best. [Dunlop 84 ] did pioneering work in this area by designing a system that worked as follows: the circuit was initially laid out and then completely simulated on a computer to determine which input-to-output paths were limiting the performance. The layout was subsequently readjusted and then simulated again. The process was repeated several times till a layout that satisfied the performance and area requirements was obtained. The approach had several problems: (1) simulation was time-consuming, (2) it was not clear at the time how to modify the layout to ensure better performance, (3) sometimes the iterative process did not converge. Later, [Burstein 85 ] developed performance-driven circuit partitioning heuristics that resulted in some performance improvements with little loss in the wirability of the resulting circuit. However, the heuristic could not guarantee that the resulting chip met the performance requirements. In 1986, [Teig 86 ] described a method that interleaves timing analysis with placement and routing steps to successively refine net weights. The net weights are a measure of a net's criticality to timing and are used to bias layout tools.
An important development in the area of performance came from [Hauge 87, Nair 89] . They developed a method of generating constraints on the sizes of nets that connect cells so that per formance would be guaranteed. Any physical layout satisfying their bounds on the lengths of nets would also satisfy the timing constraints. However, no technique was given for positioning the cells that would guarantee the net bounds. [Jackson 89 ] developed a linear-programming approach for finding a layout that minimizes the estimated cycle time of a circuit. The true timing constraints and the measure of wirability of the resulting circuit are modeled by approximations in this ap proach. The method works well on small examples, but on circuits of moderate size, it takes hours or even days to find a layout. The authors [Youssef 89] attempt to predict the critical path before placement of a circuit using correlation coefficients derived from historic data. The problem with predicting critical paths prior to placement is that the circuit performance cannot be guaranteed.
[ used ideas from rectilinear distance facility location and partitioning to min imize wire delay, but they too could not guarantee the performance of the resulting placement. [Prasitjutrakul 89] and [Ogawa 86] 1. combinational: a module that computes a logic function based on its inputs and produces an output 2. synchronizing: a storage module that has data input, data output and clock signals. When the clock signal is active, the data input is sampled and stored internally and after some delay, the data output signal assumes the same value as the internally stored signal 3. primary input (PI): receives inputs from the external world outside the chip 4. primary output (PO): presents signals from the chip to the external world
To simplify the discussion, it is assumed that each cell computes only one function, each primary input receives only one signal from the outside world and each primary output presents only one signal to the outside world. Deviations from these assumptions can be handled by trivial mod ifications to the theory presented herein. Let / represent the number of primary inputs, and g represent the number of primary outputs; thus, there are M -f -g internal modules where an internal module is one that does not receive any signals directly from the outside world. The chip is assumed to be a two dimensional region and hence we can assign a coordinate to the center of a cell mj denoted by (xj,yj). In following discussion, the term cell location denotes the coordinate 
Timing Problems in Digital Logic
Consider a block of combinational logic receiving inputs from synchronizing elements and presenting outputs to synchronizing elements as shown in Figure 1 . This is a general sequential machine model and in this work it is assumed that cycles of combinational logic do not exist. If signals are applied to the inputs of the combinationallogic, then after some time Tiong the circuit's outputs will settle to values that are a function of the circuit's inputs. If the outputs are sampled before Tiong units of time have elapsed the circuit may not behave as designed. Thus, the longest path delay through the combinational logic constrains the earliest time that the output may be sampled. Figure 1 illustrates the relationship between the longest path delay Tiong, the clock period CP, the skew to the synchronizing clock pins T8kew> the set-up time of the synchronizing elements Taui and the synchronizing elements internal clock to output delay Tdk-*Q> This relationship is expressed as follows CP > Tiong + Takew + Tdk-^Q + Tau
If equation 1 is not satisfied, then a long path timing problem exists in the design. The short path problem occurs when a signal arrives at the output too early and races around the circuit before the end of one clock cycle. This happens if the clock period is too large and the synchronizing elements in the circuit are of the level-sensitive type. [Wakerly 90 ] has an excellent discussion of this problem. •+tu1-
Assumptions
This work restricts attention to one specific timing problem: the long path problem and ignores the related short path problem. Most designers consider the long path problem to be the key timing problem in large scale digital ICs. Ad hoc methods (such as adding delay lines) to fix the short path problem usually work well. However, the long path problem is not usually amenable to ad hoc fixes.
It is assumed that cell signal flow is unidirectional for every input-output conducting path in a cell. Similarly, each net has a signal direction associated with its output pin. Associated with every signal flow is a rising and falling delay that is a function of the corresponding cell and interconnect delay models. A single delay value is calculated for each signal flow that is based on the rising and falling transitions. The methods to be discussed are generalizable to the case of separate rising and falling delays [Hitchcock 83 ]. Each synchronizing cell is assumed to have a clock pin, data-input pins, and a data-output pin. Synchronizing cells may be allowed to move freely within the chip along with other combinational logic cells. For simplicity of discussion it is assumed that edgetriggered synchronizing elements are used. The methods described are generalizable to the case of level-sensitive latches.
The performance of a synchronous digital IC is inversely proportional to the circuit's cycle time or clock period. A path is defined to be a sequence of interconnected modules and nets with a well-defined starting point and ending point (the starting and ending points are represented by modules). A critical path is a path whose delay does not meet the timing requirements of chip.
Graph Representation of Chip Timing
Let the digraph Dy(V, A) represent the integrated circuit in the physical/timing domain. Let the vertex set V be in one-to-one correspondence with the set of pins. Arc weights d(vt-, Vj) denote the pin-to-pin signal propagation delays for all (v,-, Vj) G -4, and arc direction represents the direction of signalflow in the circuit. Also, let A1 and AE model the signalbehavior internal and external to all cells respectively; thus, internal signal arcs represent cell signal flow while external arcs represent net signal flow.
A = A!UAE
Let {vi,..., VM-g} represent the cell output pins in the circuit (it is assumed that each net is driven by a single-output pin and that primary inputs have no input pin and primary outputs have no output pin) and {vM-g+i, ••-,^p} correspond to the cell-input pins. Assume that pi is the output pin of m,-and connects to n,-. In the event that a cell has more than one output pin, the cell may be replicated for each output with identical nets feeding each replicated cell and all the copies of the cell are constrained to a common location during physical design. A path $, is defined by an unbroken sequence (v«,...,ve) of vertices that uniquely occur along the path. Delay in an 
The greater flexibility of this multiple-arc cell and net model permits more accurate modeling than single cell and net delay models. This is particularly important when it becomes necessary to model different pin-to-pin net delays for aggressively scaled technologies where interconnect resistive contributions become significant. Let E denote the set of vertices representing path end points that correspond to the input pins of primary outputs and the data-input pins of the synchronizing modules. Associated with each path endpoint vertex is a required arrival time r; specified by the designer of the circuit. In a similar manner, let S denote the set of vertices representing path starting points that correspond to the primary inputs and data-output pins of the synchronizing modules. Associated with each path starting point vertex is a designer specified actual arrival time Path delay in the circuit is computed by a block-oriented search [Hitchcock 83 ]. Actual arrival times for cells not in S are determined in a breadth-first manner, beginning at the path starting points and terminating at the path ending points. The worst-case actual arrival time aj and an arbitrary vertex is given by
The required arrival times specified for the path end points may be propagated in a backward breadth-first manner through the circuit starting from vertices in E so that requirements on the required arrival times for vertices not in E may be determined. The required arrival time r,-for an arbitrary vertex is defined to be
Based on the calculation of actual arrival and required arrival times for all v,-, a slack Si may respectively be defined as
Slack values are useful in characterizing the timing behavior of a circuit. A negative value of s,-for Vi indicates that a violation of a timing constraint has occurred. 
Interconnect Delay
Bakoglu in [Bakoglu 90b ] has presented an interconnect delay model that is the basis for the model chosen in this work. Consider a net n consisting of a driving cell Ga and \n\ -1 receivers. Cell Ga drives a length of interconnect wire connecting Ga to cell G\. ..G\n\-\ as shown in Figure 3 . Let 
Gnet is the interconnect capacitance of the output net of Ga and Cioad is the capacitance of the driven pins. Rw is the lumped interconnect resistance and for current technologies, its contribution is of the second order. The factor of 0.5 which multiplies Rw is based on the analysis of [Bakoglu 90b ] to model distributed RC delay. Although RITUAL can be easily modified to include the second order delayeffects due to Rw, in this paper wewill neglect it for simplicity. For details see [Srinivasan 91 ].
Models for the Lumped Capacitance
In order to estimate driae and djall due to a net, weneed to model the capacitance of the net during placement. Both these parameters are complex functions of the layout of the wires and neighboring nets. Once again, a choice has to be made that is efficient as well as accurate.
Analytical net capacitance models used in the past for large-scale circuits have relied on sim plified net bounding-box estimates [Jackson 89 ] or similar techniques. However, the deviation of the final routed net length from the estimated value may be quite large. Our goal was to be able to incorporate models of various complexities into a general framework. The techniques developed are capable of handling a variety of linear as well as non-linear delay models.
Bounding Box Model
In this model it is assumed that horizontal and vertical wires are routed on different layers and hence have different capacitance and resistance characteristics. Let Cn and Rh represent the capacitance and resistance per unit length respectively of horizontal wire and Cv and Rv the capacitance and resistance per unit length respectively of vertical wire. The estimators for the capacitance and resistance of a net are:
where xmax =Jfgjfaw} xmin = Sgii**} Vmax =5Jf*{%>.)
Single-Trunk Steiner Tree Model
This estimator uses a single-trunk steiner tree to model the length of a net. It is an accurate model and experiments on a number of chips yielded delay values close to delay estimates based on the final routing of the nets.
The resistance is modeled by similar equations.
Star Connected Net Model
Another model that was considered during the experimentation and yielded excellent results was the star-connected net. This model tends to overestimate the net length, but has the advantage of being simple and efficient to compute. Let (z" y0) represent the location of the cell driving net n.
The capacitance and resistance of the net are estimated as: For the proof, see [Srinivasan 91 ].
Phase I: Continuous Space Optimization
Quadratic Objective Function
The quadratic objective function was originally introduced by Hall in 1971 [Hall 70 ] and later used very successfully for producing high quality area-directed layouts by placement systems like Gordian [Kleinhans 91 ] and PROUD [Tsay 88 ]. The variant of the quadratic wirelength model we use has the following representation for the length of net n
PitPj€n where Xn~and y^are defined as:
Other quadratic measures have been introduced since the work of Hall. The differences between Figure 4 : The Quadratic wirelength model these models are minor and the analysis in this work remains unchanged for any convex model.
The estimate of the cost of a placement can be written as
Note that since the pin locations can be expressed in terms of cell locations the function L is a function of cell coordinates. In further discussion, we will assume that the coordinate of a pin is the same as the coordinate of the cell to which it is attached. This does not detract from the generality of the techniques described since pin locations can be derived from cell locations, of the final routing. The extension of this work to other models (for example the bounding-box model or the approximate Steiner tree model) of wirelength has been fully explored in [Srinivasan 91 ].
5.2
Formulation minimize subject to
•3 c *j Vvj e s d(vi,vj) = f(Xi9Yi) Vine* (14) where the function f(Xi,Yi) is the net delay equation for the output net of cell m,-based on the bounding-box, steiner or star connected net capacitance estimate presented earlier.
Optimality Conditions
The quadratic objective function can be written compactly in matrix notation as:
£(*,y) = 2(xTBx +yT]By) +cTx +dTy
where x is a vector of the x-coordinates of the module locations and y is a vector of the ycoordinates. c and d are constant vectors arising from the fixed modules. B is a symmetric matrix, typically sparse and can be stored very efficiently using sparse matrix data structures as in [Bunch 76 ]. In addition, it is shown that B is positive-definite.
The total number of variables in the problem is 2M + P. However, the P arrival time variables do not enter into the cost function, so the value of the cost function at any point is unchanged and the sparsity of the matrix representing the cost function is retained. To simplify further discussion some notation is introduced. Let Proof. When there axe more than one fixed module, the proof that a unique solution exists can be found in [Srinivasan 91 ]. When there is one fixed module the obvious unique optimal solution is the one in which all the cells occupy the same location as the fixed module.
• It is assumed in the following discussion that the wirelength models and the timing models used are one of those presented in in earlier. Under this assumption, due to the convexity of the objective function and the constraints, the general formulation GP is a convex programming problem. Let A* denote the vector function (possibly non-linear) of the active constraints at a global minimum w* and VA* denote the associated Jacobian matrix. Since the programming problem is convex, 
provided w* is a regular point of the constraints, i.e., at w* the matrix VA* has full rank. The above conditions are popularly known as the Kuhn-Tucker first-order optimality conditions. The Lagrange multiplier associated with a constraint at a given point w has a useful mathe matical interptretation. It represents the sensitivity of the cost function to that constraint at that point. If the Lagrange multiplier is zero, then the constraint is inactive and has no effect locally. A positive value indicates that the constraint is binding and moving away from it towards the interior of the feasible region will increase the objective value. A negative Lagrange multiplier indicates that moving away from the constraint towards the interior of the feasible region will result in a decrease in the objective value and hence that constraint can be "dropped" provided all the other binding constraints are retained. An excellent treatment of Lagrange multipliers and their interpretation may be found in [Murty 83]. 
Why is the Problem Difficult to Solve?
The number of constraints and variables in GP can be enormous even for problems of moderate size. For a typical problem with 1000 cells and 3000 nets, the number of active variables could be upto 18,000 and the active constraints could number 15,000. However, what makes the problem even more difficult is that the constraint set is highly degenerate. The effect of degeneracy is that standard quadratic-programming algorithms flounder for many steps without improving the objective function. For a detailed description of the problems in using conventional quadratic programming methods, see [Srinivasan 91 ].
Lagrangian Relaxation
Lagrangian Relaxation has been used occasionally in the past by economists and operations re searchers but has not found widespread use because of the difficulties involved in getting the method to converge to a solution on general problems. However, problems with special structure in the ob jective function and constraints respond magnificently to the technique [Fisher 85 ]. Unfortunately, finding special structure requires special insight in most cases. Luckily, the constrained optimiza tion problem as stated in this work possesses some very useful properties that have been exploited fully in this work. In order to give the reader an idea of the method of Lagrangian Relaxation, a simple example is discussed.
A Simple Example
Consider the constrained optimization problem: min (5x -14)2 (lg) subject to i2<5 * '
The optimal solution to this problem is x = y/Z. From basic Lagrangian theory [Fisher 85 ] it can be shown that the problem stated above is equivalent to the following optimization problem:
/j>0 x where as before fi is the Lagrange multiplier associated with the constraint (if there are multiple constraints, there is one multiplier associated with each constraint). The method of Lagrangian Relaxation as applied to this problem is now described skeletally. The description here is consider ably simplified for ease of explanation and the reader should be cautioned that several complicating details have been omitted. A more comprehensive treatment can be found in [Shapiro 79 ]. The method proceeds iteratively as follows:
1. Start with an initial value for /x, say 0. (Usually one can start with an educated guess).
2. For a fixed value of /x, solve the problem of Equation 19. For fixed fi this is an unconstrained minimization problem (and for problems with special structure, it is easy to solve).
3. Update \i based on the solution obtained. Intuitively, \i acts like a penalty on the constraint. If the current value of \i results in a solution that violates the constraint, it needs to be increased. If the value of /x is too high, the solution will be far from optimal -the constraint is satisfied by a wide margin. Thus, it is possible to update \l based on the residue in the constraint. There are many possible update methods that will guarantee convergence of the method for convex programming problems. One that is widely used is:
where a is a positive constant and /? is a positive constant < 1.0.
4. Repeat steps 2 and 3 till convergence.
Let us apply this recipe to the example, with a = 1.0,(3 = 1.0. For a fixed value of fi, the optimal solution is 2(5x-14) + fi(2x) = 0 .
T --14-W
The values for x and fi are listed for each iteration.
Solve Equation 21
with //(°) = 0. The optimal solution is x^= 2.8.
with fi^= fi^+ (x^2 -5) = 2.84. The initial value of fi was too low and this step increases it by an amount equal to the residue in the constraint.
Proceeding in a similar manner to
Step 2,we obtain xM = 1.78, fi^= 1.03. 4 . z(2) = 2.32, /i<3) = 1.42.
5. z(3) = 2.18, /x<4) = 1.17.
6. x(4) = 2.27, ix<5) = 1.31.
7. z(4) = 2.22, //(6) = 1.23....
In the limit x converges to the optimal value of 2.23... In practice, there are several methods of accelerating the convergence and for well structured problems, typically only a few iterations are required (see further references in [Fisher 75] ). This example illustrates the power of the relaxation method in solving nonlinearly constrained problems by series of unconstrained optimizations.
Detailed Recipe for Lagrangian Relaxation
Let us now consider a detailed description of the method of Lagrangian Relaxation for a general convex problem of the form:
where g(x),h(s) are convex vector functions of x. The constraint set is partitioned into g(a?) and h(x). It is assumed that g(x) consists of constaints that complicate the problem and they are termed "complicating" constraints. It is also assumed that the problem is easy to solve in the absence of g(x). As an aside and a preview of the techniques in this chapter, note that the wirelength optimization problem is very easy to solvewithout the timing constraints. Hall [Hall 70 ] first solved it for the quadratic wirelength model and showed that the solution corresponds to solving a linear system of equations. Later, Tsay [Tsay 88] 
subject to h(x) < d where A is a vector of multipliers. The most general method proceeds as follows: 
Now, eight centering constraints (four in the x direction and four in the y direction) are added to the constraint set to form a new problem GP\, termed the "level 1" problem. Figure 5 shows an example with four regions and the sets of cells in different shades after the solution. (Note that some cells from one region have migrated into another
Solving the Lagrangian
The specific problem for the quadratic wirelength model and the star net delay model (neglecting interconnect resistance) with k centers of mass is restated below: where Ais a vector of multiphers. For any fixed value of A, say A* the problem has a very simple solution 1.
Note that Q is independent of cell locations. Thus, at every iteration, the only component that changes in the right-hand side of Equation 29 is Aand the only product to be computed is A(fc)A.
This is a linear-time computation since the maximum number of active equations is equal to the number of edges in the timing graph. In an efficient implementation, Q_1 is never computed.
Since Q is positive definite, the equation Equation 29 can be solved iteratively using an algorithm like the accelerated Gauss-Seidel method for solving linear systems of equations [Golub 89] . If the Gauss-Seidel method is used, at each iteration k the previous solution w(*-1) can be used as an initial solution for rapid convergence. It is interesting to note that Rockafellar [Rockafellar 84] shows solving Equation 29 is equivalent to solving a minimum quadratic cost flow problem.
'To keep the notation simple, it is assumed that A(fc) refers to a row vector, i.e., the transposition symbol is
dropped.
Updating the Lagrange Multipliers
The method used to update Lagrange multiphers from iteration to iteration is based on the subgradient method for setting dual variables [Held 74 ]. This technique starts with an initial value Ak and iteratively applies the formula: A<*+1> = max{0, A<*> + t<*>(Aw<*) -c)} (30) In this formula, tW is a scalar step size and wW is the optimal solution for Equation 29 for A= A(fc).
The components of Aw^-c are the slacks in the constraints. For the timing constraints the components are none other than the vertex slacks for the cells on critical paths. For the spread constraints, they are the differences between the desired centers of mass of the various groups and the actual centers of mass. The choice of <(*) is critical to the success of the algorithm for two reasons: (1) it is closely tied to the linearization of the absolute valued delay constraints and (2) it affects the convergence of the algorithm. The procedure for computing <(*) is explained in the following subsection. The convergence properties of such a method for updating Aare described in detail in [Held 74 ].
Computing *<*>
Recall that all the delay models described in Section 4 have equations with absolute valued terms in them. It is possible to convert these delay constraints to linear constraints by using additional variables as in Section 5. and the system of equations is solved for for w(fc+1). Next, the largest value of t(*) is chosen such that a term in one of the delay equations just changes sign (i.e., is about to change from its current configuration to the opposite one as shown in Figure 6 ). w(*+1) and A(fc+1) are updated according to this value of <(*).
Updating the critical arc set
The algorithm maintains a set of active critical arcs throughout the algorithm. Active arcs axe those whose current Lagrange multipliers are positive. By maintaining the set of active arcs, further efficiency can be achieved since the components of ATA need not be computed when some component of A, say Aj = 0. Since not all arcs are critical, this typically makes updating the right hand side a sub-linear procedure. In any event, the number of critical arcs at any time is linearly bounded. Thus, the maximumnumber of multipliers that are active at any time is linear in the size of the timinggraph. Note that although the matrix A contains all the arc equations, in a practical implementation they are never explicitly computed or written unless they axe required. The only arcs that actually participate in updating the right hand side axe those which belongto the current set of critical arcs .
After solving for w(fc+1), a fast timing analysis on the timing graph is performed to determine the arcs that have become critical since the previous iteration. These arcs axe then added to the critical set with zero initial-valued Lagrange multiphers.
Computational Complexity
The flow of the algorithm is described in Figure 7 . The work done per inner-loop iteration of the algorithm is very little since it involves a right-hand side update which is 0(M), one step of matrix solution(assuming a direct solutionmethod) to solve for the new valueof w whichis 0(M 2), computing t, which is 0(E), where E is the number of edgesin D?, and updating the critical edges, which can be done in 0(M + E). Therefore, the work done per inner-loop iteration is bounded by 0(M2). Note that the critical arc set is continuously updated as new paths become critical. For the linearized delay equations, this procedure converges according to [Held 74, Shapiro 79] . There is no theoretical bound on the number of iterations required for convergence of the inner loop, however, in practice the number of iterations required per level was very low -200-400 even for the largest examples tested.
Extension to Nonlinear Delay Models
It is straightforward to extend the Lagrangian Relaxation algorithm described above to a convex nonlinear delay model like that of the star-connected net with interconnect resistance effects. As sume that an iterative method like Gauss-Seidel relaxation is used to solve the system of equations generated at each iteration. In the case of linear delay equations, only the right-hand side of Equa tion 29 had to be updated every iteration. When the timing constraints axe nonlinear however, some matrix entries may also need to be updated. The work done per iteration remains the same, although the number of iterations to convergence usually increases.
Phase II: Discrete Space Optimization
Although it is possible to add slot constraints as described in Section 7 till exactly one cell remains in each region, using a modification of that technique results in further improvement in timing and wirelength. The star connected net delay model is used to illustrate the ideas.
The idea is to perform hierarchical partitioning until a few (10-20) cells remain in each region. Followingthis, constrained Assignment (weighted bipartite matching) is used to assign slot positions to the cells within each region. Consider a region 5* containing M cells and N > M slots. A cost matrix C is set up with as many rows as the number of the cells in the region and as many columns 
Note that the integer constraints on Zij have been dropped too. Let A have p constraints. Then AZa-yW can be rewritten as:
Substituting this expression i/TAZxyW, we can state the problem more intuitively as: 3. Update the set of critical arcs (i.e. A, the matrix representation of the current timing con straints) and v according to i/(*+1) = max(0, i/(fc) + (Aw -c)).
4. Repeat steps 2 and 3 until the constraints axe satisfied to a desired accuracy.
The relaxed linear assignment problem has an intuitively appealing interpretation. Consider the cost of assigning a cell mt-to a slot Sj. The cost of this assignment is:
At every iteration, the cost of assigning a cell to a slot is modified by adding a term that is derived by looking at the Lagrange multipliers of the arcs. The multipliers themselves axe derived from the path slacks. Thus, the additional cost of assigning a cell to a slot can be derived by looking only at the arcs with non-zero multiphers passing through the cell. As the reader may recall, only a linear number of arcs need to be considered in the worst case.
Unlike the continuous case, it may not be possible to obtain the exact optimum by solving the relaxed problem in this manner. However, a solution that is close to the optimal can usually be obtained very quickly. The term "near-optimal" deserves some explanation. What it means in this context is that the timing constraints may be violated by a marginal amount depending on the level of discretization. The reason is that since the cells axe required to He on discrete locations, there may not be a solution where the constraints axe satisfied exactly -there axe only a discrete number of possibilities for the left-hand side of each constraint. The amount of error in the constraints is usually extremely small and the larger the problem size, the smaller the granularity in the left-hand side and the smaller the violation. Note that the constraints axe not always violated. They may be satisfied by an equally marginal amount (i.e., the error due to discretization could go either way).
In practice, the observed violation was of the order of 1-2%.
Note that any solution obtained is locally near-optimal for a region only with respect to the connections outside the region, and does not guarantee to minimze the cost of connections within the region. (The timing constraints however, do represent the correct behavior inside and outside the region). The problem could have been formulated as a Quadratic Assignment to handle the connections within the region properly. However, the large run time of Quadratic Assignment makes it impractical for regions with even a moderate number of cells. For regions with fewer cells, the effect of interconnection within the region is small, and by repeating the Linear Assignment few times an improved solution can be obtained. A further improvement in wirelength and timing can be obtained by allowing cells to migrate outside their region. This is achieved by shifting the regions in x and y directions by half the region size at alternate iterations as shown in Figure 8 and repeating the process until the improvement is small. Note that the absolute valued delay Figure 8 : Regions of optimization axe shifted at alternate iterations equations can be dealt with in a manner similar to the continuous case, i.e., by ensuring that the equations axe always written with the correct signs. The flow of the slot assignment algorithm is shown in Figure 9 .
• D D D H H H D I D D D I I I D D H H H D D D H D I H H D D D H D I I I D D D I • D D D H H H D
Some Comments on the Formulation
Experiments proved the method of constrained assignment to be extremely effective in reducing the wirelength while satisfying the timing constraints. The ability to use any assignment cost, regardless of the mathematical properties of the cost function makes it possible to include routability and 
The Complete Algorithm
The flow of the algorithm is shown in Figure 10 . Figure 11 . This results in considerable "dead area" or wasted space. The problem can be solved by adding a row width 
.>, . ttp3 is the average row width. The current width of a row is the sum of all the widths of all cells currently in that row. These constraints may be dealt with in a similar manner to the timing constraints during the discrete assignment phase of the algorithm. A multiplier Hi is associated with each constraint (or conceptually, with each row). These multipliers can be transferred to slots, i.e., the multiplier for a slot Sj is denoted by fij and is the multiplier for the row in which the slot is located. The modified Lagrangian can be written as: where Ry = i;,-, i.e., the width of cell m,-. Conceptually, to the wirelength and timing cost of assigning a cell to a slot, one must add the product of the multiplier of the row in which the slot is located and the width of cell. Intuitively, this tends to have the following effect:
• Rows that exceed the target width have a positive value of /z,-and cells are penalized for being assigned to that row.
• Rows that are shorter than the target width have a negative value of /i,-and the cost of assigning cells to those rows is reduced.
• When assigning cells to a row that exceeds the requirement, a wider cell is penalized more than a narrower one.
• When assigning cells to a row that is shorter than the requirement, a wider cell is preferred over a narrower one.
Hence, the mathematical theory has a intuitive and natural interpretation behind it.
(37)
Example # cells 
