In this paper we take a fresh look at the partition-driven placement (PDP) paradigm for standard-cell placement for wire-length minimization. The goal is to develop several new algorithms for incorporation into a PDP framework that can rectify the well-known drawbacks of traditional PDP (increasingly localized view of nets with increasing levels of the partitioning tree, min-cut objective, inaccuracy and cost of terminal propagation (TP), irreversibility of move decisions), while preserving its considerable advantages (time efficiency, flexibility in accurately incorporating many optimization metrics, and flexibility in satisfying most constraints). We have developed several novel techniques within a PDP-based framework that yield the best wire-length results so far on all but two of the MCNC benchmark suite. Our major innovations are: (1) simultaneous level partitioning (SLP) in which we partition the entire circuit globally in every level of the partitioning tree, across the current cutline(s); (2) cell gain computation based on a global or distributed view of entire nets (thus obviating TP) and on the bounding-box (BB) minimization of nets (as opposed to mincut in prior PDP); (3) move irreversibility tackled in a post-processing phase via vertical and horizontal swaps. Empirical results indicate that our PDP algorithm SPADE (for Simultaneous level PArtitioning with Distributed [i.e., global] nEt views) provides almost 20% better wirelength results than an internal version of "regular" PDP with min-cut based gains, 10.8% better than the previous best PDP method QUAD, 10.6% better than TmberWolf (TW) 7.0, 15.8% better than the state-of-the-art force-directed technique from U. Munich (termed FD-98 here), and 15.3% better than the multilevel placement technique Snap-On. Besides TW7.0, we are also the only ones to report results on the approximately 100K-cell circuit golem3 (12.2% better than TW7.0). Our run times are quite reasonable.
Introduction
Cell placement is a decisive phase in physical design of VLSI circuits. With the advent of deep sub-micron (DSM) technology, placement tools must be able to handle large input circuits (up to tens of millions of transistors), and optimize different objectives (delay, power, area, etc.) under possibly multiple constraints (like crosstalk bounding and thermal distribution). Bearing these in mind, we find hierarchical partitioning to be a desirable framework for placement, because of the following three properties: * It is a divide-and-conquer type approach, which is in general timeefficient for large inputs.
* It implies top-down processing. Thus, many circuit parameters can be considered at the same time, enabling us to take more balanced and gradually refined decisions with a global view.
* It offers great flexibility in tackling multiple constraints. Particularly, the partitioning engine can be implemented in an iterative improvement (local search) fashion to proceed with multi-constraint satisfaction on a per-move basis.
We term any placement algorithm based on hierarchical partitioning as a partition-driven placement (PDP) method. Basically, PDP is based on repeated subdivision of the given netlist and layout surface (core region) into subcircuits and rectangular regions, respectively, and assignment of subcircuits to those regions [l, 41. When each subcircuit consists of only one cell, it is also uniquely placed in a region, marking the termination of PDP. Such a process can be represented by construction of a partition-tree (see Fig. lb) , where each node in the tree represents a region and circuit cells assigned to it. The root node is the starting core region, containing the whole circuit. All nodes of same 'This work was partly funded by a grant from Intel Corp. Processing depth in the tree constitute a level. In this paper, we focus on application of the PDP methodology to standard cell circuit placement. Given user specified row number and row length upper bound, our objective is to get a placement with minimized total half-perimeter wire length, which is free of cell overlap.
Other mainstream placement algorithms are largely based on either simulated annealing (SA) [12, 131 or mathematical programming [14, 15, 161 (LP, or force-directed quadratic programming) methodologies. SA (e.g., TW7.0) is well-known for its good solution quality as applied to placement problems, but can be fairly slow. The Gordian-LJDomino package, based on linear programming and 'soft' partitioning [14, 1-51, offers good trade-off between run-time and solution quality. Another force-directed method FD-98 [16] gives slightly better performance than TW7.0, and is considered to be the current stateof-the-art. Mathematical programming based methods, though efficient and effective per se, suffer from rigid formulation constraints (e.g., linear or quadratic programming). Hence, such algorithms usually adopt some imprecise models (e.g., cell-overlap and lumped-capacitance delay), and render modifications accommodating certain additional constraints difficult. An orthogonal paradigm of two-phase placement is proposed in the Snap-On placer [18] which offers good flexibility and enables early prediction of final placement results. In the first phase, the input netlist is hierarchically placed in 'global bins' using a combination of min-cut partitioning for the higher levels (the hMetis partioner is used in [lS] ) and simulated annealing with min-wirelength objective at the lower levels. This global placement is then refined in the detailed placement phase using simulated annealing.
The PDP methodology is not without drawbacks. First of all, placement is a problem that, in general, cannot be solved optimally through a divide-and-conquer approach like PDP, i.e., optimal solutions to subproblems from PDP (optimal placement within separate regions) may not constitute a globally best solution. Specifically, any cell moved across a cutline during previous level of processing can never go back. Such irreversibility tends to limit the search ability of PDP for a good local optimum when reaching lower levels down the partition hierarchy. Further, the traditional PDP method pursues minimization of cutsize across cutlines, while the actual objective is to minimize (estimated) wire length or layout area. Such mismatch in objectives of optimization can have an adverse impact on overall solution quality. We have developed a set of techniques to effectively deal with all the above issues, that are discussed in Sec. 3.
The rest of the paper is organized as follows. Section 2 describes previous work on two fronts: traditional PDP framework (with terminal propagation), and circuit partitioning algorithms. Section 3 consists of our new techniques for making the PDP paradigm significantly more effective: simultaneous level partitioning, global net views, minwirelength partitioning, and swap-based post-processing. We An important contribution that makes this framework complete is terminal propagation (TP), which results in as much as 30% of improvement in placement quality [2] . The significance of TP lies in the fact that bi-partitioning of each region is not independent of others, and such extemal influence are largely captured by producing dummy cells serving as virtual YO pads at the region boundary under processing. Circuit Partitioning Engine: Circuit partitioning engines are crucial to the performance of any PDP algorithm. A comprehensive survey of circuit partitioning is given in [9] . Most state-of-the-art bi-partitioners are of the iterative-improvement family, and can be regarded as variants of FM [6] , which is linear time per pass. Dutt and Deng proposed two novel iterative-improvement partitioners (IPS); PROP [7] considers futuristic as well as immediate gain, while CLIP [8] is a general cluster-oriented scheme that can be overlaid on any IIP engine. More recently, two multi-level partitioners hMetis [lo] and MLc [ll] with improved results and lower runtimes were proposed.
Our New PDP Techniques
In this section we present the general framework of our PDP algorithm SPADE, and elaborate on the new PDP techniques used in it, and how they address the drawbacks of classical PDP discussed earlier. In the sequel, we will use p, n and e to denote the total number of pins, cells and nets, respectively, in a circuit, and we will use d to denote the degree of an arbitrary net.
Simultaneous Level Partitioning (SLP) with
Global Net Views move sequence is, however, global, i.e., the current best-gain feasible cell across the entire circuit is moved. Cell gains are computed and updated (at each move) based on the global pin distribution of their nets. Pins of each net are stored in a binary search tree (pin-tree) sorted by their coordinates along the direction perpendicular to the current cutlines. In this way, during any level of processing, we can always have a global view of pin distribution of each net, and can find their bounding box (BB) geometries quickly. For example, in Fig. 2(a) , cutlines 3 and 3' (they are actually 'slice lines', to be explained later) are processed together in the third level (assuming cutline 1, and 2 have been processed in the first two levels). As a result, four regions are cut simultaneously. Every cell move will potentially have a global (out-of-region) effect, and the best prefix point after each pass of partitioning is also determined by the overall solution cost at that level, as opposed to local gain update and prefix point determination in previous SeqLP methods. After each cell is moved, we need also to update gains of neighboring cells outside the current region. Again, in Fig. 2 , if cell U, v and w are connected by a three-pin net nl, suppose the horizontal dimension of each level-4 partition is one unit, then w's gain is +1, U'S gain -1, and v's gain 0. However, if w is moved, v's gain will get updated and become -1, although they are in different regions. 
(a) Regions under SLE! (b) Correspondiog parh'tion tree.
As a comparison, terminal propagation, required for previous Se-qLP methods, also takes into account extemal pin distributions regarding nets inside current region, but it needs to generate additional dummy cells on-the-fly. Moreover, in some cases, TP cannot help Se-qLP escape local optima, or it might not be accurate enough. Such an example is shown in Figs. 3b-c. These show that SeqLP cannot escape a local optimum when the subregions are partitioned in clockwise order starting from the upper left quadrant; the cluster move sequence is B, D, E, H, which results in a wire-length reduction of 22x, where x is the horizontal dimension of each of the 8 regions. However, with SLP, the move order of the clusters is chosen for global optimality, and is H, A, D, E, which results in a greater wirelength reduction of 25x. Further, Fig. 3d shows that cell gain calculations based on local subnets using TP are not always accurate. In the figure, cell u1 can be in either A1 or A2 subregions initially, while u2 can be in either B1 or B2 initially; thus they are shown on the cutline to depict this generality. Using 'IT and the local subnet of ni in the (B1 , B 2 ) region, the gain of u2 (w.r.t. ni) will be 0 in both directions as the nearest dummy position only is considered. The dummy on the upper-left portion of the net does not affect 4 ' s gain. However, a global view of net ni, as considered in SPADE, will correctly compute u2's gain to move from right to left as positive as that decreases the BB of ni. Note that in SPADE, TP is obviated, since there is no concept of 'extemal' nets (every net is now globally visible).
To consider the interdependencies among multiple regions under PDP in each level of the partition tree, we partition all regions simultaneously, instead of using the traditional SeqLP paradigm of all previous PDP algorithms [l, 2,3,4] . At the same time, information on the local region to which each cell belongs to is maintained, and moves are still local (between the two child regions generated by local cutline). The
Decoupled R e g i o w A Caveat in SLP:
Since standard cell circuits are placed in uniform-height rows, the entire chip layout area is repeatedly divided into implicit row-sets at each horizontal partitioning step, and horizontal (local) cutlines across the regions spanning the same underlying row-set are perfectly aligned. We term such an aligned set of cutlines as a slice line. Cell gains in half-perimeter wire length measure is solely dependent on pin positions of incident nets. Meanwhile, each cell, once assigned to one side of a slice line, can never go back across it in later levels of processing. That means gain of any cell on one side of a (previous-level) slice line is independent of perturbations on the other side. In other words, any region in a row-set is decoupled from the regions in other row-sets, as summarized in Theorem 1.
Theorem 1 For every net ni spanning multiple row-sets, its net length change within any row-set is independent of cell moves (due to horizontal partitioning) in other row-sets. Hence, horizontal partitioning of multiple row-sets can be performed sequentially, without loss of solution quality. ProofSketch:
Cell moves in different row-sets are only in the vertical direction, thus the horizontal dimension (and coordinates) of the net BB's cannot change. Nets spanning row-sets cannot be removed from the previous levels' slice line(s). Thus changes in pin distribution of such nets in one row-set cannot affect the wire length reduction possible for such nets due to cell moves in other row-sets. 0 To exploit such separability of row-based placement, we implemented horizontal partitioning (at any level) as processing each set of regions spanning the same row-set individually, starting from the bottom row-set. Within any row-set, regions are partitioned simultaneously. For instance, in Fig. 2(a) , the two lower quadrants spanning slice line 3 are simultaneously processed before the two upper quadrants spanning slice line 3'. Similarly, in higher levels of processing, we performed vertical partitioning by cutting one 'column' of regions at a time, starting from the leftmost column. Although vertical'cutlines are not perfectly aligned, their deviation in coordinates are not significant at higher levels due to the balance constraint in partitioning. When vertically non-cuttable regions (e.g., single-cell regions) start to appear, our PDP algorithm switches to SLP automatically. In general decoupled SLP leads to greatly reduced run time and improved wire length results. The pseudo code for (horizontally and vertically decoupled) SLP is given in Fig. 4 .
Choice of cut directions for each level remains important in SPADE. Our experiments revealed that the quadrature style (alternating cut direction in adjacent levels) gives the best results. Hence, all later discussions and experimental results are based on quadrature-style cutline generation.
Wire-Length Based Gain Computation
Traditional PDP algorithms are oriented toward cutsize minimization. This is a crude approximation of the actual wire length objective, especially when there are many cells of greatly varied dimensions. Algorithm SLP(partiti0n-tree, head-region) /* Simultaneous partitioning of regions (accessible through head-region) in partition tree */ best-cost = m; for ( i a , i<trial-num; itc) /* perform trial-nwn random runs */ i n i tprocessing(partition-tree); I* necessary initialization */ i n i t p a r t i tion(head2egion); /* random initial partitioning */ done = 0; while (!done) i n i t-gain-cmp(head-region); /* at the beginning of each pass, compute gains of all free cells within current set of regions, using intrinsic-gain-pairs of connected net segments */ while (cell = global-select-cell()! =NULL) /* select the globally best-gain cell that is non-violating */ move-cell(cel1); /* lock the cell, record its wire-length gain */ local-update-gain(ce1l); /* update wire length gains of local neighboring cells */ global-update-gain(cel1); I* update intrinsic-gain-pairs of newly perturbed net segments (in other regions), recompute gains of cells connected to those net segments */ puss-gain = g e t b e s t p r e f ix(); /* get best prefix point, revert moves after it, return overall gain of the pass */ if (pass-gain 5 0) done = 1; if (curr-cost = get-cost() 5 best-cosr) best-cost = curr-cost; keep a copy of best result so far; update partition-tree with best recorded partitioning result; return; Hence, we introduce cost and gain measures based directly on halfperimeter wire length, which has been shown to be a good estimate of wire length [5]. Also, at the end of each level of processing, we recompute the width of each newly generated child region as the total size of cells contained in it, and perform compaction of regions along horizontal slice lines (row-based placement). This is also carried out to estimate the wire length achieved after each run of SPADE, so that the best wire length (assuming every cell at the center of its region) so far is computed with more precision.
Since a net can have pins in multiple regions, we term any maximal subset of pins of a net within a region as a 'net segment'. For example, if net ni has three pins, p1, p2, p3, and p1, p2 are in one region, while p3 is in another, then (p1,pz) and (p3) are the two net segments of ni. There are up to three possible configurations of any net segment, defined by distribution of its free-to-move pins relative to the cutline going across its containing region, as shown in Fig. 6a-b . We assign a bit label to identify a subregion relative to the cutline: lower or left (upper or right) subregion has label 0 (1); see Fig. 6d . Also, when all free pins are in the lower or left (upper or right) subregion, a net segment is said to be in config. 0 (config. 1); otherwise it is in config. 2 (spanning the cutline). At any level, a net's half-perimeter length is uniquely determined by the current configurations of all its constituent net segments. When a cell is moved, all net segments connected to it are perturbed. Hence, gain of a cell ci can be computed as gain(ci) = &n, gain(ci,n,,k) , where nj,k (the k-th net segment of n j ) is connected to ci. However, if n,,k's configuration is not changed due to the move of a cell connected to it, gain(ci,n,,k) = 0. Only those net segments whose configurations are changed by the current move contribute to a change in wire length.
To compute cell gains efficiently, we record a pair of gains Even if a net segment has only one free pin, we can compute IGP(ni,j) by assuming a virtual config. 2. As we can see, g"iSj[O] gain(ci) = 0; side = getside(ci); /* get current side of ci */ for each net segment n ., connected to c; '3 if (njk is in config. side)
gain(ci) -= g"jc [side];
if (ci is the only free cell connected to nj,k in current side) gain(c;) += g"j" [other(side)]; Figure 5 : Computation of cell gain based on intrinsic-gain-pairs ofincident net segments.
(gniJ [ 11) of net segment ni,, is the change in wire length of net ni when ni,j switches from config. 2 to coniig. 0 (config. 1). Hence, computation of IGP of ni,, is dependent on the current global pin distribution of ni. For instance, in Fig. 6d, net segment ZGP(ni,l) is computed as (0.25,0) O.Z, O), ZGP(ni,z)=(O, O) , After a cell is moved, however, any neighbors within the same region will have their gains updated (note that an 'intrinsic' gain pair never changes at local moves, but can be changed by moves of cells connected to the same net outside its region). Moreover, gains of those neighbors external to the current region, which are connected to other net segments of the perturbed nets, need to be updated in a two-step fashion, i.e., first update intrinsic-gain-pairs of their corresponding net segments, then recompute gains of connected cells. This corresponds to the global-update-gain ( ) procedure in Fig. 4 .
Inter-region update of intrinsic-gain-pairs, if done exhaustively, can be O(d2) in time for a net of degree d (consider the case in which d pins of net ni are distributed in exactly d regions, we may need to u p date IGPs of all ( d -1) other net segments at each move of a pin). This is not acceptable if the input circuit has very-large-degree nets. However, there is redundancy in exhaustive update. For example, in Fig. 6d , when ni,3 has one pin moved across its local cutline, and switches to another configuration, there is no need to update the IGPs of the other net segments of ni. For large-degree nets, only those pins really close to boundaries of its current bounding box can trigger inter-region updates. In general, we can show that the overhead incurred by such updates, based on our binary search tree data structure for pins of each net, is proportional to d(1ogd) for a net with a large d (updates of small nets with 2 to 4 pins, which are the majority of nets in real circuits, consume constant time).
We now analyze the per-pass time requirement of our PDP algo- ZGP(ni,3) 4 1 , 01, IGP(ni,4) 40, 0) .
per-pass is O(plog2n) (since p = ~( n ) ) .
Tackling Move Irreversibility: Post-Processing with Horizontal and Vertical Swaps
To address the irreversibility problem of PDP, we tried several techniques for post-processing for PDP, facilitating potentially global movement of cells. The first of such is a traditional method [17], in which each row is scanned from left to right, with neighboring pairs of cells swapped if there is positive gain (in wire length), and the process is restarted from the first row, until no further improvement can be achieved.
Then, intra-row clustering is included to further improve the gain achievable through the above neighbor-swap method. Our clustering technique features an internal-to-extemal net count ratio (YE). Every net that has all its pins inside the cluster is considered 'internal', and any partidly connected net is then 'external'. Initially, for any single cell, the ratio is always zero, since every net is external. Starting from the leftmost cell, each row will be scanned, and the profile of I/E ratio recorded along the way. A cluster is formed whenever YE drops, after scanning a new cell, below a pre-determined threshold. This is intuitive since a cluster is considered to be strongly connected only when the number of intemal nets is high relative to external connections. Care is also taken to make the clusters more or less equal in size. We do clustered and non-clustered neighbor swaps altemately until convergence is detected. Details of clustering is shown in Fig. 7 . Figure 7 : Internal (thick) and external (thin) nets of a cluster, and its VE ratio profile along the row direction However, this intra-row swap scheme converges to a local optimum very quickly (typically within a couple of iterations). To pull it out of local optimum, we also considered vertical (inter-row) swap of cells (given the row length limit is satisfied), with step size possibly greater than one (i.e., between non-adjacent rows). To limit computation time, we only swap cells vertically when their boundaries are horizontally overlapping each other. However, vertical swaps, unlike horizontal ones, tend to cause cell overlaps, and must be resolved by shifting at least one of overlapped subrows either to the left or right. Hence, we have to estimate, to a certain precision, the change in wire length due to such shifts. Note that if a pin is moved under the assumption that all other pins of the same net are fixed, there are only one or two pins that are really relevant to any change in wire length [13] . We developed a procedure to estimate shift-induced wire length change based on such juncture-pair modeling. Introduction of vertical swaps indeed improves our results by an additional 2 or 3 percent, but timing overhead is high due to shift-oriented estimation.
Circuit Partitioning Engine
The circuit partitioning engine invoked in our PDP program SPADE is CLIP-FM~~, which is an adapted version of CLIP-FM [8] for wire length minimization. The way CLIP favors cluster-removal is that it selects next (legal) move with maximum updated part of gain, instead of aggregate Fh4 gain1 Hence, the gain update scheme of FMWl 'We actually use a variation of CLIP in which the gains of cells on nets perturbed for the first time are magnified by l/(shrink facfor) where 0 5 shrink factor 5 1. When shrink factor = 0, we get CLIP (infinite amplification), and when shrink factor = 1, we obtain regular FM.
is left intact, and CLIP-FM~~ has the same per-pass time complexity as FMw[. However, through empirical studies we found that CLIp-FMw' converges with number of passes proportional to logn.
When applied to SPADE, we need to introduce the following nontrivial modifications to CLIP-FM as a circuit bi-partitioner. First of all, since there are multiple regions to be bisected simultaneously, an efficient data structure is needed to facilitate selection of best-gain, yet non-violating (with respect to all constraints under consideration) cell to move. Initially we used two 'global' binary search trees to store all free cells on either side of any cutline, only to discover later that violating cells tend to accumulate at higher-gain end of each trees, making further search for legal moves prohibitively time-consuming (since we always re-start at the best-gain node of each tree, it is obviously a quadratic-time search). Hence, in later implementations we used a twolevel tree structure, i.e., cells in each region are first stored in a pair of trees locally, and all best-gain nodes of local trees are then sorted in a pair of global trees also. Then an efficient search strategy is applied, as detailed below.
Global Cell Selection Strategy: Based on the two-level tree structure, we can still access the globally best-gain cell in constant time. If a violation occurs, we first determine the type of violation involved. For vertical partitioning, the only constraint is local tolerance for each region, so we only need to select the best-gain cell in the opposite local tree to move. In some cases, e.g., the opposite local tree is empty, we need to delete the violating best-gain cell from the tree, and start the search process again. For horizontal partitioning, however, more complex decisions have to be made, since there are two types of constraints (for local region and row-set balance control, as explained in Sec. 3.5).
If there is only local violation, we can proceed as in the vertical partitioning case. Otherwise, we prioritize row-set balance violation correction, and try the best-gain cell from the opposite global tree. If that cell is also violating, we search the entire opposite global tree until a feasible move is found. If no feasible moves are found, then the global tree in the side of the originally violating best-gain cell is searched. One other element in our search strategy that contributes to higher solution quality is 'tie-breaking' for any pair of best-gain nodes in global trees. Here 'ties' specifically refer to the two best-gain cells from both global trees. The basic idea is that if both moves are feasible, then the one that follows the last move tends to continue cluster-removal and should be selected. For row-based standard cell placement problem, row length control is crucial in determining final die area utilization. Maximum row length forms a lower bound on chip width. In a general PDP process, even if we assign relatively small balance tolerance (say, 1%) to every region, the final maximum row length may still run out of control, since errors tend to accumulate. Hence, for each implicit row-set under (horizontal) partitioning, we must assign specific aggregate tolerances to it, in addition to those local tolerances of each region. This is, however, not needed in vertical partitioning.
Let us consider a row-set R under horizontal partitioning. Suppose a slice line divides R into two sub-row-sets RO and R I , with ro and rl rows, respectively. To keep final length of any row in R within user- 
where 0 < a < 1. Then, local balance tolerances of regions in R can be computed based on the two row-set tolerances. Setting local tolerances for a region to be equal to its containing row-set tolerances overconstrains the problem. Hence we amplify the local tolerances by a factor of row-set tolerances that is greater than one, while still ensuring that the overall row-set tolerances are not violated. Experimental results reveal that setting a within [OS, 0.751, combined with a horizontal tolerance amplification factor of 1.5 gives good wire length results, and makes final row length violation almost impossible.
Experimental Results
Experiments were run for ten MCNC standard cell benchmarks whose characteristics are given in Table 1 . We used the same number of rows (also see Table 1 ) as [4, 13, 16, 181 . All experiments were conducted on a 550h4Hz Pentium-111 Linux workstation. All reported wire length (WL) results are in meters, and run times are in seconds. Table 2 provides a comparison of SPADE with an intemally developed traditional SeqLP-based PDP program (using a somewhat better TP method than that used traditionally), both for 16 runs. Results show Table 3 : Trade-off between SPADE run time and quality. The settings are: row  length deviation up to 3%. HN decoupling, shrink factor of 0.1. and HN swap  as post-processing. Also included, in the last column, is a set of results for best  settings for each circuit (within 16 runs and 3-5% deviation) FD-98 [16], QUAD [4] and Snap-On 1181. The *led numbem for QUAD correspond to three circuits whose wire lengths are not consistent with those of the other methods, probably due to QUAD'S use o f circuits in the PROUD format. Our comparisons to QUAD thus do not include these three circuits. SPADW8.s wirelengths are 5.1%, 11.3% and 5.6% belter  than those of TW7.0, FD-98 and QUAD, respectively. The TW7.0 times   are those reported in [16] . The "Scaled Time Ratio" row indicates the machinebased normalized time ratios of all methods to SPADW8-greater than 1 number for other methods indicate they are slower by that much factor. The scaled times were obtained according to the Heapsort benchmark in www.neUib.org, and our empirical findings that the 550 Mz Pentium III workstations we used are 3 times faster than the Sun Ultra 1 Model 170 machines (performance data for the Pentium I l l is not available). The Heapsort performance benchmark was chosen since placement requires some number crunching but also a more significant amount of dynamic data structure creation, access and update similar to that in Heapsort. The TW7.0 and FD-98 times are for a DEC Alphastation 250 4266 (2 times slower than the 550 M z Pentium 111) and the QUAD times are for a SUN Ultra I Model 140 (3.6 times slower) .
that SPADE provides an improvement of almost 20% in wirelength over SeqLP, thus establishing that our new paradigm of SLP with global net views is superior to traditional PDP using TP. Table 3 exhibits the tradeoff between run time and solution quality: results for 8, 12 and 16 runs at each level, under the same SPADE settings are shown. It can be seen that only the five largest circuits benefit appreciably from additional runs. On the average SPADW16 obtains 5.8% smaller wirelength than SPADW8 at the cost of 54% more runtime. Table 4 compares SPADW16 to four state-of-the-art methods representing four different approaches: QUAD (PDP) [4], TW7.0 (SA) [13] . FD-98 (mathematical programming) [16] and Snap-On 1181 (two-phase placement). We obtain overall wire-length improvements of 10.8%, 10.6%, 15.8% and 15.3%, respectively, over these four techniques. Also, as shown in the table, our results are the best so far for all circuits except two, avq-small and avq-large, for which our results are worse than those of "7.0, FD-98 and Snap-On, but better than those of QUAD. Besides TW7.0, we are the only ones to report results on golem3; SPADW16 gives 12.2% better WL than TW7.0. Table 3's last column also shows the best SPADE results over different settings for each circuit-this gives a sense of what is achievable by a PDP-based method; it is about 2% better than the single SPADW16 setting results reported in Tables 3 & 4 . Finally, run times are compared in Table 5 -Snap-On times are not compared as [18] reports times only for the global placement phase which is its main focus. Times for SPADW16 are reasonable, ranging from 10 minutes for biomed (6.4K cells) to about 8.3 hours for golem3 (100K cells); times for SPADW8, which still gives significant improvements over recent methods (see caption of Table 5 ) are about 35% less than those for SPADHl6. The table shows (using appropriate runtime scaling for different workstations) that SPADW8 is comparable in speed to TW7.0, about 1.8 times faster than QUAD, and 2.5 times slower than FD-98. In future work, we hope to obtain faster solutions by incorporating multilevel partitioning (we currently use a flat partitioner) and early-stop mechanisms, without sacrificing solution quality.
Conclusions
We presented several new PDP techniques used in our SPADE placer to rectify the drawbacks of classical PDP (mentioned in Sec. 1). The results obtained are overall significantly better than current state-ofthe-art placement methods (10.8% better than QUAD [4], 10.6% better than TW7.0 [13], 15.8% better than FD-98 [16] and 15.3% better than Snap-On [IS]). For all but two circuits in the MCNC benchmark suite, we have the best results so far. Our run-times are reasonable. We have thus demonstrated that PDP can be used effectively and efficiently to place large circuits. This is significant, since PDP has several advantages that can be exploited to meet the special needs of placement for complex deep sub-micron circuits: time efficiency, flexibility in accurately incorporating many optimization metrics like wirelength, power and timing, and flexibility in satisfying many constraints that are required for DSM chips (e.g., uniform thermal distribution, congestion control) without significantly degrading solution quality. Future work will investigate these aspects of cell placement.
