We present a nexv top-down quadrisection-based global placer for standard-cell layout. The key contribution is a new general gain update scheme for partitioning that can exactly capture detailed placement objectives on a per-net basis.
INTRODUCTION
In the physical implementation of high-performance, complex deep-submicron integrated circuits, module placement is a critical step. Given fixed decisions from the upstream stages of the chip design flow -namely, microarchitecture design, chip timing, chip planning, logic synthesis and physical floorplanning -it is placement solution quality that is the major determinant of whether timing correctness and routing completion can be achieved. This paper describes a new placement tool for the standard-cell methodology; we assume a row-based layout with uniform module heights and variable module widths, with instance sizes of up to several tens of thousands of cells being of greatest interest. For overviews of (standard-cell) placement, see, e.g., Lenganer [LenSO] or Shahookar and Mazumder [SMSI] .
A VLSI circuit netliit consists of a set of modules (cells) connected by signal nets.
In the corresponding edgeand vertex-weighted netlist hypergraph N(V,I) with V = vn} and E = {el, e2,. . . , em}, the n vertices corresuond to netlist modules fcells) and the m liyucredfics coriespond to signal nets. Each hyperedge e E # is a &b-set of V containing one source vertex, with the remaining vertices of the hyperedge being sinks. The input to a placer is assumed to be the netlist and cell library information.
Define the location of a cell (and all its pins) to be the location of its center. A placement of the n. cells in V is an assignment of cells to locations in two-dimensional plane The placement is legal if cells are not overlapped and arc placed within the prescribed row coordinates. The placer typically seeks a legal placement of V, such that layout area is minimized while maintaining auto-routability and satisfying timing and other performance constraints.
For cell-based placement, the first-order objective is to place connected cells closer together to reduce both wirelength and lower bounds on signal delay. Thus, most placers have a minimum-wirelength objective: Given a netlist N(V, E), find a legal placement such that CeEE co&(e) is minimized, where cost(e) is the routing cost of the net e.
It is difficult to estimate routed wirelength, and hence only simple estimates are used in practice. Let MST(e) denote the minimum spanning tree (MST) cost over the locations of cells belonging to net e. Also, let HP(e) denote the halfperimeter of the minimum enclosing bounding box of the locations of cells belonging to net e. In practice, total MST cost and total HP cost are the most commonly used wirelength estimates for wirelength driven placement; any other practical estimate needs to have similarly low time complexity of evaluation. (Following several previous works, we will use the MST estimate for illustrative purposes and to evaluate total wirelength of our placements; however, our placer handles arbitrarily complicated per-net placement objectives.)
PARTITIONING-BASED PLACEMENT
Our proposed placement approach is based on top-down partitioning. In this section, we first review the traditional (KGFM) iterative partitioning approach, along with its gain update scheme. We then review several partitioningbased placement techniques in the literature, centering on the issue of terminal propagation. We will omit discussion of local-improvement techniques simulated annealing [SS93] [SS95] and DOMINO [DJS94] ).
2.1.
Gain Update in Iterative Partitioning Iterative improvement heuristics for netlist partitioning typically start with an initial solution and make a series of passes. Each pass iteratively determines the moue of one or more cells which achieves the best possible gain in the partitioning objective. After all cells have been moved in a given pass, the best solution seen during the entire pass is selected; the nest pass begins with this selected solution. The process terminates when a local minimum is reached, be., the current pass does not improve the objective. Computing and updating gain data is the heart of the iterative improvement approach.
The protot,ype iterative heuristic is that of Kernighan and Lin (KL) [KL70], which uses a pair-swap move structure. During each pass, every cell is moved esactly once between two partitions.
At, t,he beginning of the pass, all cells are "unlocked", i.e., free to be swapped. Iteratively, the pair of unlocked cells wit.1~ highest gain is swapped. After the selected cells are swapped, they become "locked" and the algorithm updates both the cost of the new partition and the gains of the remaining unlocked cells. After all cells are locked, the lowest,-cost partition encountered over the entire pass is restored and returned.
Further passes are executed, each using the result from the previous pass as its starting point, until no improvement results. Computing gains in the KL heuristic is expensive; O(n") swaps are evaluated before every move, resulting in a complexity per pass of O(n2 logn) (assuming a sorted list of costs). The method of Fiduccia and Mattheyses (FM) [FM821 reduces the time per pass to linear in the size of the netlist (i.e., O(j,), where p is t.he total number of pins) by adopting a single-cell move st,ructure, and a gain bucket data structure that allows constan&time selection of the highest-gain cell and fast gain updates after each move.
2.2,
Min-Cut Placement Plnccment by recursive (bi-)partitioning is based on repeated division of a given circuit into subhypergraphs to opbimizr a given partitioning objective. With each partitioning of the circuit,, the given layout area is partitioned in rit,lwr the horizontal or the vertical direction. Each subhypcrgraph is assigned to a partition; when each subhypergraph has only one cell, then each cell will have been mapped to a unique (non-overlapping) position on the chip. Early approaches which use a min-cut partitioning objective are due to such authors as Breuer [Bre76] [Bre77] or Lamher [Lau79] . hfost modern partitioning-based placers use some form of KL-Fhl partitioning heuristic, also with t,he minimum net-cut objective. Because the minimum netcut is a poor abstraction of the real placement cost function (e.g., only in some limiting sense will total cuts capture tot.al IMSTl wirelencthl. various devices have been used to improve rnin-cut, aackment; the most. important of these *are quadriwction and terminal propagation.
2.3.
Qundrisection While many placement tools have relied on top-down mincut bipartitioning, the main disadvantage of such an approach is that it can greedily obtain very good results in the first cut, but then bad results in successive cuts. The placement problem is essentially two-dimensional, in that me assign cells to locations in a planar layout. However, mm-cut bisect,ion adopts a one-dimensional approach, part,itioning t,he netlist along a single cut line at each step. Suaris and Kedem [SK87b] [SK87a] [SKSS] [SK891 use quadrisection to divide the chip, yielding a truly twodimensional placement procedure and results that are superior to those of top-down bipartitioning placement. Their quadrisection algorithm uses an extension of the FM heuristic which also runs in linear time per pass. Since a cell in one quadrant can be moved to any of the other three quadrants, there are 13 gain buckets, each corresponding to a pair of quadrants. At. each step, a cell with highest gain is selected. Suaris and Kedem also apply a more accurate cost function which considers different horizontal and vertical weights.
2.4.
Terminal Propagation When partitioning a (sub-)circuit into several parts, it is not sufficient to consider only the netlist induced over the modules in the subcircuit. i.e., only the internal nets. Nets connecting to external IO pads or other cells in another (higher-level) partition must also be considered. Dunlop and Kernighan [DKSS] proposed the terminal propagation technique which adds to the current netlist dummy cells that are fixed in the appropriate partitions.
For quadrisection, the terminal propagation technique is shown in Figure 1 . The figure shows that block B2 is about to be partitioned into {Bzo, &I, Baz, Bzs}. Cells C in Be2 and E in B3 are connected to cells D and F in Bz. It would be beneficial to assign D to B20 and F to either Bzl or B23. The terminal propagation is done by inserting dummy cells fixed in specific blocks: in this example, dummy cell G is fixed in block B20 and dummy cell H is allowed to move only between BZI and B23.
Boo BOl l standard cell 0 dummy cell 
A GENERAL GAIN UPDATE SCHEME FOR ITERATIVE PARTITIONING
In this section, we introduce an efficient, unified approach to updating gains for arbitrary objective functions during iterative multi-may partitioning. This technique is general, and can capture particular placement objectives exactly for individual nets; it is enabling to the new topdown placer described in the next section. Given a Sway partitioning {PO, A,. . . , Pk-r}, define the configurationof a given net to be the distribution of its cells into the partitions P,, i = 0, 1,. . . , rl -1. Each P, contains either zero or a nonzero number of cells in the net. For each net e, let c,(e) be the number of cells in e that are distributed in partition P,, i.e., cJ(e) = ]{v]u E e and o E P,}j. We can use a binary number fefr . . . fk-1 to represent each configuration, where f, = 1 if c,(e) 2 1 and fi = 0 if c2 (e) = 0. There are at most 3" -1 different configurations for each net in a L-way partitioning. Figure 3 shows the 15 possible configurations in a Pway partitioning.
We use COnf&k-lfk-2.. _ fe) = ~~~~ 2fj to denote the configurationid of a given net. In our new gain update scheme, each net e has an associated net vector V, with length 2': -1. if n e e distributes cells in exactly t k partit.ions. Sum-of-degrees Cost: Minimize xeEE cost(e), where cost(e) = 0, if net e distributes cells in one partition; cost(e) = k, if net e distributes cells in exactly K partitions.
This is a special objective for 4-way partitioning.
The hypergraph is partitioned among the upper-right, upper-left, lower-right and lower-left quadrants of the layout.
Minimize CcEE cost(e), where cost(e) is the MST routing cost based on the cell distribution of a net e.
As noted above, Sanchis [San891 developed a multi-way gain computation with lookahead for net-cut cost; she also developed gain computation schemes for absorption cost and quadratic cost in [San93] . Here, we propose to use the net vector concept to unify the gain computation for various objectives. Examples of net vectors with different values corresponding to different objectives are shown in Table 1 for 4-way partitioning. The method can be extended to any k-way partitioning as long as I; is not too large.
A more det,ailed discussion of the gain computation and update is now in order. We will center on the MST cost objective and Cway partitioning.
Thii is because our placement approach is based on recursive quadrisection.
Also, since the MST is more accurate than net-cut as an estimate of routing cost, our placer uses an MST cost objective instead of the traditional cut-based objective. confs is the original configuration for net c (i.e., (1 E Pa and b E Pz); confr is the configuration after n is moved to Pt (i.e., a E Pt and b E P,); confz is the configuration after b is moved to Pv but before a is moved (i.e., a E P3 and b E Pv); and confa is the configuration after b is moved to Pv aad after a is moved (i.e., u E Pt and b E Py).
The gain for moving cell b E P, to Pv before n is moved is We first observe that the net vector given in Table 1 for the MST cost objective assumes unit wire cost in both the horizontal and vertical directions. In other words, resources, congestions and routing costs are equal in bot,h directions. In practice, horizontal and vertical wire costs should be weighted according to the available resource, c,g., a three-layer HVH design might be relatively richer in horizontal resources, while a four-layer HVHV tlrsign might. be relatively richer in vertical resources. Let h and 11 rrspectively be the unit. costs of horizontal and vertical wiring. We can easily create a net vector to capture this kind of objective function, as shown in Table 2 . We next observe that our partitioning algorithm will use the same FM gain bucket data structure as in [FM83] . However, our gain computation is different from that of previous works. There are k(E -1) gain buckets for k-way par& tioning. We let rJ(v) denote the gain for moving cell 11 to partition j. Suppose cell a is moved from partition Pa to partition Pt. For each net e incident to a, we must update the gain of each cell b E e (b # u) as it moves from its current partition Pt to partition 4, i.e., b E Pz and 1/ # T, TO see how this gain update can be accomplished in constant time, consider the following four configurations:
and thus computing the gain update for cell b requires looking up only the four configurations con&. . . , confa. independent of it. Figure 3 shows an esample of the gain update computation, In the figure, the d-pin net e contains cells {a, b, qd}: a is in 9, b is in PO, and c and d are in Pg. Cell a is about to be moved to PO. and we would like to update the gain for moving cell b to A. The current, configuration for net e (i.e., confs) is "0111". After a is moved to P3. conf~ is "0101". The configuration after moving b to I'? before a is moved (i.e., conf2) is "1110". The configuration after moving b to PO after a is moved (i.e., confs) is "1101". Therefore, the gain update for moving b to P3 is
A-r3(N = (\r,[CO,&(olol)] -v,[Confjd(llol)])
If net cut is our objective function, the gain update is A"r3(b) = (1 -1) -(1 -1) = 0. By contrast, if hlST cost is our objective function, the gain update is Ay3(b) = (l-2) -(1 -1) = -1. In other tvords, the net cut cost does not change when u and b are moved to their new partitions, while t,he MST cost is reduced by 1. Figure 4 summarizes our gain update scheme when cell a is moved from Ps to Pt. We also illuskate how to efficiently compute the four configurations using bit operations. We emph,asize that this gain update scheme can be used within almost any iterative partitioning approach, including k-way Fhl, Z-phase FM [BCL87] , CLIP-FM [DD96a] and multilevel FM [Alp96].
A NEW TOP-DOWN QUADRISECTION BASED PLACER
We now describe a new top-down quadrisection-based placement algorithm, based on the gain update scheme described in the previous section along with a multilevel partitioning engine We also describe meta-heuristic approaches to improve solution quality, as well as means to accommodate such practical constraints as differing vertical and horizontal cut costs.
Our quadrisection-based placement is based on the gain update scheme of Section 3. The differences between our new placer and previous quadrisection-based placers are as follows. First, our approach handles instances with mixed floating and fixed pads, or even without any pads at all. E.g., if all IO pad locations are fixed, we only partition the internal (core) standard cells and do not need to partition where it serves as the initial partitioning solution of N,,,-r and is refined by the FMbased partitioner. The unclustering and refinement procedure continues until the original netlist Ne is partitioned. A similar approach can be applied to multi-way partitioning. ML is efficient (an untuned implementation performs Cway partitioning of a 25,000-cell design in 32 CPU seconds on a SUN Ultra 1 (140 MHz)), and yields excellent results when compared against the best known methods from the literature [Alp961 [AHK96] .
4.2.
Net Vector Computation During each stage of quadrisection, only the cells located in the current partition are movable; cells outside the current partition are fixed. We first compute the center coordinates of the four quadrants in the current partition. For each net e, we compute the number of pins located in the current partition, as well as all possible configurations with respect to the net e. Next, we evaluate the userspecified cost function (e.g., MST cost or half-perimeter) for the net e according to the pin distributions of all possible configurations, and normalize the costs so that the lowest cost is zero (this reduces the index of the highestgain bucket, i.e., the maximum possible gain, and improves runtime efficiency). Finally, we assign the net costs to their corresponding net vector entries. Figure 5 shows a snapshot of the top-down quadrisection process, with the northeast quadrant as the current partition. In the figure, the northwest quadrant has already been quadrisected and the northeast quadrant will be processed next. Consider a 5-pin net with two pins located in the current partition and three pins fixed outside the partition, with one of the fixed pins an IO pad. There are IO different configurations. Figure 5 illustrates the configuration-id, MST cost and halfperimeter cost for each configuration.
If the MST cost function is selected, the net cost ordered by the configuration-id is [-,5,7,6,'7,6,9, -,8,7,8,-,8, -,-,-I , and the resulting net vector is [0, 0, 2, 1, 2, 1, 4, 0, 3, 2, 3, 0, 3, 0, 0, 0] . When this quadrant is partitioned, the hypergraph instance contains a a-pin net which has the above net vector. Again, our approach does not require terminal propagation, and exactly captures the placement cost function during partitioning. Figure 6 shows the algorithm template for QUAD.
4.3.

Meta-Heuristic
Improvements We can further improve the placement quality by the following two operations.
At each level of quadrisection, we may cycle the partitioning process. Figure 7 While cycling the partitioning procedure at each level, a second performance improvement is possible by performing the quadrisection on ouerlapped regions. Figure 8 shows nine overlapped regions that are quadrisected at the second level, In general, there are (2" -1)2 overlapped regions at the k"' quadrisection level.
EXPERIMENTAL RESULTS
Our experiments were run on a Sun Ultra 1 (140 Mhz) with 192 MB RAM, and all runtimes reported (mm:ss) arc for this machine. Our first esperiment compares QUAD without cycling/overlapping (QUAD m/o CO), QUAD without overlapping (QUAD w/o 0) and QUAD. All test cases were placed with 100% area utilization. The results are shown in Table 3 . QUAD w/o CO averages 10% greater wirelength but can require as little as 17% of the runtime of QUAD for large benchmarks.
Our second experiment compares our quadrisection results with GORDIAN-L [SDJSl] and the post-processing detailed placer DOhfINO [DJA94] on 18 benchmarks with 100% area utilization (results for GORDIAN-L/DOMINO were provided by Guenter Stew [Ste97] at TU hlunich). Note that GORDIAN-L is a global quadratic placement tool, while DOMINO is a detailed placer; QUAD should be considered as a global placer. The hlST wirelength results are shown in Table 4 . QUAD outperforms GORDIAN-L on 15 benchmarks, and performs about 1% worse on three benchmarks.
The average MST wirelength improvement over GORDIAN-L is 4.8%. QUAD also performs slightly better than DOMINO. Table 5 compares QUAD against GORDIAN-L/DOhIINO on the same set of benchmarks using the half-perimeter (HP) objective; this is the measure used by the authors of GORDIAN-L and DOhIINO. QUAD has an average of 4.4% improvement over GORDIAN-L, but We have also compared 2-D congestions as measured by a simple supply and demand model, where "supply" is the available horizontal and vertical routing tracks and "demand" is the hiST routing edge for the net. We did this to verify that our wirelength improvements did not come at the cost of routing hotspots. Figure 9 depicts the overcongested areas of the avqsmall placements generated by QUAD and DOMINO; overcongested resources are those for which the sum of vertical and horizontal demands exceeds the sum of supplies (for space reasons, we dispense with the details of these measurements). The QUAD placement has 0.8% overcongested area while the DOhIINO placement has 1.2% overcongested area. Thus, although QUAD uses 1% more wirelength for thii case, it has better congestion distribution than DOhfINO.
EXTENSIONS TO TIMING-DRIVEN PLACEMENT
We have extended our basic quadrisection-based global placement engine in a number of directions.
One direction of interest is timing-driven placement, where simple extensions allow the top-down quadrisection to be driven by net cost vectors that capture both timing and wirelength aspects of the circuit layout. Our timing-driven im- and SPEED.
plementations update net cost vectors according to various schemes, e.g., based on timing analysis that is interleaved with the partitioning. Table 6 shows results comparing our timing-driven placement results with those of SPEED [RE95]. Here, "Delay" (a sort of "cycle time") is the maximum path delay between any pair of sequentially adjacent storage elements (flip-flops). Path delays are computed using pin parasitics and cell intrinsic delays from the timing-PROUD library data, along with a centroid-star net model and Elmore delay for the interconnect. This is the same delay evaluation (with the same interconnect parasitics) used in [RE95] , except that we apply factors of l/2 in the Elmore delay expressions that were not applied in [RE95] . We see that timing-driven QUAD ("Timing-QUAD") outperforms SPEED by an average of 3% in terms of delay while maintaining an average of 4.7% less MST cost.
We have also compared Timing-QUAD with the TimberWolf simulated annealing based timing-driven placement package (results obtained from Swartz [Swa96]) on the three test cases fract, struct and avqsmall using the same technology parameters as in the previous experiment. For each test case, TimberWolf uses different IO locations, number of rows and row locations. Thus, comparisons with TimberWolf involve completely different QUAD results from those of Table 6 . The TimberWolf comparison with Timing-QUAD is shown in Table 7 ; the two packages seem very comparable.
24
, 
