Placement is one of the m&t important steps in the RTLto-GDSII synthesis process, as it directly defines the interconnects, which have become the bottleneck in circuit and system performance in deep submicron technologies. The placenient problem has been studied extensively in the past 30 years. However, recent studies show that existing placement solutions are surprisingly far from optimal. The first part of this tutorial summarizes results from recent optimality and scalability studies of existing placement tools. These studies show that the results of leading placement tools from both industry and academia may be up t o 50% to 150% away from optimal in t,otal wirelength. If such a gap can be closed, the corresponding performance improvement will he equivalent to several technology-generation advancements. The second part of the tutorial highlights the recent progress on large-scale circuit placement, including techniques for wirelength minimization, routahility optimization, and performance optimization.
INTRODUCTION
The exponential growth of on-chip complexity has dramatically increased the demand for scalable optimization al-*Financial support from the Semiconductor Research Consortium under contracts 98-DJ-605> 98-TJ-686, and 2001-TJ-910 and from the National Science Foundation under grant CCR-0096383 is gratefully acknowledged.
Permission to makc digital or hard copies of a11 or pall of this work for pcrsonal or classroom usc is granted without fcc provided that copies arc no1 made or distributcd for motit or comncrcial advmlasc and chat {cong;shinnerl,xie,yuanxin} @cs.ucla.edu kongtm @magma-da.com with little or no consideration of the layout and interconnect information, it may not imp well to a two-dimensional layout solution. Therefore, large-scale global placement on a nearly flattened netlist is needed for physical hierarchy generation to achieve the best performance. This approach is even more important in today's nanometer designs, where the interconnect has become the performance bottleneck. This tutorial highlights state-of-the-art placement optimization techniques. Section 2 presents recent studies on the quality and scalability of existing placement algorithms on a set of benchmarks with known optimal solutions. Section 3 reviews scalable paradigms for large-scale wirelength minimization. Timing optimization and routability optimization are discussed in Sections 4 and 5, respectively. Conclusions are given in Section 6 . : Average solution quality vs percentage of non-local nets, from PEKO (0% non-local nets) t h r o u g h P E K U (0.25% to 10% of non-local nets) to G-PEKU (100% non-local nets). Each data point is an average quality ratio for a given placer over all circuits i n t h e given suite.
circuits, there may also be gIobal connections that span a significant portion of the chip, even when they are optimally placed. Additional benchmark circuits were therefore constructed to study the impact of global nets (241. Circuits in the G-PEKU suite consist only of global nets connecting either an entire row or an entire column. For such circuits, an obvious upper bound on optimal wirelength is the sum of the lengths of the rows and columns. Circuits in the PEKU suite (Placement Examples with Known Upper bounds on wirelength) consist of both PEKO-style local nets and additional, randomly generated non-local nets. An upper bound on the optimal wirelength is derived simply by adding the wirelengths of non-local nets t o the known total wirelength of the local nets. In t h e study [24], the percentage of nonlocal nets was gradually increased from 0.25% to 10%. The G-PEKU and PEKU suites are also available online [l].
Gap Analysis Results
Four state-of-the-art. placers from academia and one industrial placer were studied for optimality and scalability: 2.20GHz running RedHat 8.0 with 2GR of memory. To measure how close the placement results are to optimal, the ratio of a placement's wirelength to the optimal wirelength (on PEKO) or its upper bound (on G-PEKU and PEKU) was computed. This ratio is called the "quality ratio." An upper limit of 24 hours was placed on the run time; any process exceeding this h i t was terminated.
The results are summarized in Figure 1 and Figure 2 . Figure 1 shows bow the average quality ratios of these tools change with the percentage of non-local nets. Figure 2 shows how the run times of these tools changes with increase in cell number. We make the following three observations.
(i) None of the placers achieves a quality ratio close to 1.
On PEKO, the wirelengths produced by these tools .cdb range from 1.41 to 2.09 times the optimal on average (see Figure 1 ) and 1.66 to 2.50 times the optimal in the worst case (not shown). On G-PEKU, the gap between their solutions and the upper bound varies between 79% and 102% in the worst case. Some placers may try to improve routability by sacrificing wirelength. However, given the gap between their wirelengths and the optimal value, there remains significant room for improvement in existing placement algorithms.
(ii) The quality ratio from the same placer can vary significantly for designs of similar sizes but different characteristics. None of them produces consistently hetter results than another. On PEKO, mPL gives the shortest wirelength. However, its quality ratio shows an increase of more than 40% with a small increase of non-local nets. On G-PEKU, Capo gives the closest solution to the upper bound in most c a~e s . On PEKU, Dragon's wirelength gradually becomes the closest to the upper bound. This seems to suggest that more scalable and stable hybrid techniques may be needed for future generations of placement tools.
(iii) Different placers displayed different scalability in run time and solution quality. None of them can successfully finish all the circuits of PEKO, because of either the run-time limit (e.g., Dragon), or memory consumption (e.g., Capo, mPL, mPG, QPlace). For those circuits they successfully placed, an average solution quality deterioration from 4% (on QPlace) to 25% (on mPL) can be observed when the problem size is increased by a factor of 10.
It is not known whether the gaps on real circuits are similar to those observed on the benchmarks discussed above. The construction of placement examples that resemble real circuits more closely, including examples optimized for timing (251 or routability, is an active area of research.
SCALABLE PARADIGMS
We assert that scalability without some hierarchical form of computation is impossible. The use of hierarchy may be subtle or indirect, but never completely absent. In this paper, we use scalability in its traditional sense and therefore consider not just O ( N ) algorithms hut rather any framework likely to have applicability lasting for several technology geilerations.
ity are the typical objectives of VLSI placement. Of these, weighted total wirelength is a useful single representative, as (i) it can be optimized efficiently, and (ii) strategic, iterative net reweighting can be used to optimize other objectives, such as performance and mutability.
Our discussion is centered on methods for wirelengthdriven global placement. The goal here is only an approximately uniform distribution of cells with a s little total wirelength as possible. The problem of transforming a global placement to an overlap-free configuration is left to the detoiled placement phase.
The most promising large-scale approaches to wirelengthdriven global placement can be broadly categorized by (i) the manner in which their hierarchies are constructed and traversed, (ii) the kinds of intralevel optimizations used and the manner in which they are incorporated into the hierarchy and coordinated with each other. At the highest level, Wirelength, performance, power consumption, and routabil-
Recursive Top-Down Partitioning
Among academic placement tools, all the leading topdown niethods rely on variants of circuit partitioning in some way. Seminal work on partitioning-based placement was done by Breuer [7] and Dunlop and Kernighan 127). Most contemporary methods have exploited further advances in fast algorithms to push these frameworks beyond their original capabilities. Fast, high-quality O ( N ) partitioning algorithms give top-down partitioning att.ractive O ( N log N ) scalability overall.
Cutsize Minimization
Simple and traditional recursive bisection with a cutsize objective can be used quite effectively with simple FiducciaMatheysses-style iterations. At a given level, each region is considered separately from the others in some arbitrary order. A spatial cutline for the region, either horizontal or vertical, can be carefully chosen. Given some initial partition, subsets of cells are moved across the cutline in a way that reduces the total weight of hyperedges cut without violating a given area-balance constraint. This constraint can be set loosely initially and then gradually tightened.
Connections between subregions can be modeled by terminal propagation 1271, in which the usual cutsize objective is augmented by terms incorporating the effect of conuections to external subregions. Other techniques for organizing local partitioning subproblems use Rent's rule to relate cutsize to wirelength estimation [62, 671. Careful consideration of the order and manner in which subregions are selected for partitioning can be significant. For example, a dynamic-programming approach to cutline selection can improve overall results by 5% or more 1671. In the multi- 
3.1.2
An oft-cited disadvantage of recursive bisection is its alleged tendency to ignore the global objective as it pursues locally optimal partitions. Approximating wirelength by cutsize in the objective may also degrade the quality of the final placement. A radically different approach, first introduced in Proud [20, 591 and subsequently refined by Gordian (411, is to use continuous, iteratively-constrained quadratic star-model wirelength minimization over the entire circuit to guide partitioning decisions. The choice of a quadraticwirelength objective helps avoid long wires and facilitates the construction of efficient nunierical linear-system solvers for the optimality conditions, e.g., preconditioned conjugate gradients. 110 pads prevent the cells from simply collapsing to a single point. Linear wirelength can still be asymptotically approximated by iterative adjustments to the net weights 1551. Following this "analytical" placement, each region is then quadrisected, and cells are partitioned to subregions in order to further reduce overlap and area congestion. In Gordian, carefully chosen cutlines and FM-based cutsize-driven partitioning and repartitioning are used. Cellto-subregion assignments are loosely enforced by imposing and maintaining a single center-of-mass equality constraint for each subregion. As constraints accumulate geometrically, degrees of freedom in cell movement are eliminated, and the quadratic minimization at each step moves cells less a,nd less.
BonnPlace [61, 61 is the Jeading contemporary variation of this framework. It employs a sophisticated, novel, and linear-time displacement-minimizing partitioning instead of cutsize minimization during subregion assignment. Instead of introducing explicit equality constraints into the analytical minimization, t.he quadratic-wirelengtb objective is altered to minimize cells' displacements from their assigned subregions.
Iterative Rejinemenr
Following the initial partitioning at a given level, various means of further improving the result at that level can be used. In BonnPlace (61, 61, unconstrained quadratic wirelength minimization over 2 x 2 windows of subregions is followed by a repartitioning of the cells in these windows. Windows can be selected based on routing-congestion estimates. Capo [12] greedily selects cell orientations in order to reduce wirelength and improve routability. Feng Shui [SG]
follows k-way partitioning by localized repartitioning of each subregion. Some leading partitioning-based placers also employ timrlimited branch-and-bound-based enumeration at the finest levels 1111.
In Dragon 162, 531 , an initial cutsizrminimizing quadrisection is followed by a bin-swapping-based refinement, in which entire partition blocks at that level are interchanged in an effort to reduce total wirelength. At all levels except the last, low-temperature simulated annealing is used; at the finest level, a more detailed and greedy strategy is employed. Because the refinement is performed on aggregates of cells rather than on cells from the original netlist, Dragon closely resembles the multilevel methods discussed next.
Partitions Guided by Analytical Placements

Multilevel Methods
Placement algorithms in the multilevel paradigm have only recently drawn attention [50, 13, 15, 14, 26). These methods are based on coarsening, relaxation, and interpolation, defined as follows.
(i) Coarsening. Hierarchies are built from the bottom up by recursive aggregation, i.e., clustering or extensions.
Localized optimizations are performed at every aggregation level.
Intermediate solutions are transferred from each aggregation level to its adjacent finer level.
The scalability of this approach is straightforward to obtain and understand. Provided relaxation at each level has order linear in the number N. of aggregates at that level, and the number of aggregates per level decreases by factor T < 1 at each level of coarsening, say N , ( i ) = r'N at level i , the total orderofamultilevelmethodisat most c N ( l + r + r * + . . . ) = c N / ( l -r). Higher-order (nonlinear) relaxations can still be used, if their use is limited to subsets of bounded size, e.g., by sweeps over overlapping windows of contiguous clusters at the current aggregation level. (ii) Relaxation.
Coarsening
(iii) Interpolation.
Initial Placement at Coarsest Level
Relaxations
Relaxations at a given level are fast and relatively localized. The global view comes from the multilevel hierarchy, not from the intralevel relaxations. Almost any algorithm can be used, provided that it can support (i) incorporation of complex constraints (ii) restriction to subsets of movable objects. Relaxation in mPG and Ultrafast VPR is by Fast annealing. The mPG framework employs a fixed set of hierarchical bin-density constraints to monitor area and routing congestion. In mPL, relaxation at intermediate levels proceeds both by (i) quadratic wirelength minimization on small subsets followed by path-based area-congestion relief [38) and (ii) randomized, greedy, and discrete Goto-based cell swapping (311.
Interpolation
Simple declustering and linear assignment can be effective [13]. With this approach, each component cluster is initially placed at the center of it (single) parent's location. If an overlap-free configuration is needed, a uniform bin grid can be laid down, and clusters can be assigned to nearby bins or sets of bins. The complexity of this assignment can be reduced by first partitioning clusters into smaller windows, e.g., of 500 clusters each. If clusters can be assumed to have uniform size, then fast linear assignment can be used. Otherwise, approximation heuristics are needed.
Under AMG-style weighted disaggregation, interpolation proceeds by weighted averaging: each finer-level cluster is initially placed at the weighted average of the positions of all coarser-level clusters with which its connection is sufficiently strong [14]. Finer-level connections can also be used: once a finer-level cluster is placed, it can be treated as a fixed, coarser-level cluster for the purpose of placing subsequent finer-level clusters.
A constructive approach, as in Ultrafast VPR [50], can also lead to extremely fast and scalable algorithms. At each level, clusters are initially placed in the following sequence: (i) clusters directly connected to output pads, (ii) clusters directly connected t o input pads, (iii) other clusters.
Embedded Multilevel Optimization
Most leading methods owe their performance not just to external design but also to sophisticated and hierarchical iterative internal calculation. In fact, a placement problem of order lo6 cells and nets can still be solved "flat," i.e., without any =placit aggregation or partitioning, provided that 
TIMING OPTIMIZATION
Extensive research on timing-driven placement has been done in the past two decades and continues today. The performance of a circuit is determined by its longest path delay, but timing constraints are extremely complex. The number of paths present grows exponentially with circuit size. The advantage of path-based algorithms is their accurate timing view dbFing the optimization procedure. However, the drawback is that they usually require substantial computation resources due to the exponential number of paths which need to be simultaneously minimized. Moreover, in certain placement frameworks, e.g., top-down partitioning, it is very difficult or infeasible to maintain an accura,te global timing view.
Net-based Algorithms
Net-based algorithms 128, 48,60,29], on the contrary, do not directly enforce path-based constraints. Instead, timing constraints or requirements on paths are transformed into either length constraints or weights on individual nets. This information is then fed to a weighted wirelength minimization based placement engine to obtain a new placement with better timing. This new placement is then analyzed hy a static analyzer, thus generating a new set of timing information to guide the next placement iteration. Usually this process must he repeated for a few iterations until no improvement can be made or until a certain iteration limit has been reached.
The process of generating net-length constraints or netdelay constraints is called delay budgeting [35, 30, 44, 68, 58 , 52, 51, 181. The main idea is to distribute slacks at the endpoints of each path (POs or inputs of memory elements) to constituent nets in the path such that a zero-slack solution is obtained 148, 69, 18) . A serious drawback of this class of algorithms is that delay budgeting is usually done in the circuit's structural domain, without consideration of physical placement feasibility. As a result, it may severely overconstrain the placement problem. Recently, some attempts have been made to unify delay budgeting and placement 151, 63, 331, where a complete or coarse 163, 33) placement solution is used to guide the .delay budgeting step. However, it is generally difficult t o find an efficient or scalable algorithm for such unification.
To overcome these problems, approaches based on net weighting use different means. Instead of assigning a delay budget to each individual net or edge, net-weighting-based approaches assign weights to nets based on their timing criticality. Compared with delay-budgeting approaches, these methods will not suffer from the overconstraining problem. Net weighting based algorithms are generally very flexible. They can be naturally integrated into an existing wirelengthminimization-based placement framework. They also have a relatively low complexity. As circuit sizes continue to increase and practical timing constraints become increasingly complex, these advantages make the net-weighting-based approaches more and more attractive. Unfortunately, despite these advantages, net weighting is usually done in an ad-hoc, intuitive manner. The main principle used in most algorithms is that a timing critical net should receive a heavy weight. For example, VPR [46] used the following formula to assign weight to an edge e:
tu(.) = (1 -slack(e)/T)"
where T is the current longest path delay, a is a constant. A potential problem in the net-based approaches is the SD called oscillation problem. Usually net weights or budgets are assigned by performing timing analysis for some given placement solution P" at the n-th iteration; more critical nets will receive higher weights. Thus, in the next placement solution P"", the lengths of critical nets in P" will he reduced, while the lengths of other non-critical nets are potentially increased, resulting in changes in net criticalities, and, thus, in net weights. Therefore, it is important to ensure convergence of weighted-wirelength optimization.
A recent work 1421 proposed a nice solution.
Note that certain path-based approaches suffer from similar problems, e.g., a need to dynamically adjust the set of paths being optimized 1571.
Two ways to solve this problem have appeared in the literature. The first approach is to perform timing analysis and recompute net weighting periodically. VPR 1461 and PATH [42] follow this approach. Based on simulated annealing, both methods perform timing analysis and net reweighting once per temperature. The second approach is to make use of historic information 1291, i.e., to combine weights in previous iterations with criticality information in the current placement to derive the current weights. Intuitively, if a net is always critical during all placement iterations, we want to gradually increase its weight; while if it is never critical, we will decrease its weight..
ROUTABILITY OPTIMIZATION
Routing congestion is one of the fundamental issues in VLSI physical design. Because an aggressive wirelengthdriven placement may not be routable, routability is best considered directly during the placement phase in order to achieve the best overall performance.
Routability-driven placement involves mainly (i) routability modeling and (ii) solution techniques for routability coutrol. Usually optimization for routability control is performed based on the estimated routing congestion of a placement configuration. We discuss these two issues in the following subsections.
Routability Modeling
Routability is usually modeled on an X x Y global-routing grid in the chip's core region. Routing supply and demand are modeled for each bin and each boundary of the routing grid structure.
There have been many studies on routability modeling. There are two major categories: topology-free modeling (TPfree), where no explicit routing is done, and topology-based modeling (TP-based), where routing trees are explicitly constructed on some routing grid.
TP-free modeling is faster in general. Examples of this class include bounding-box (BB0X)-based modeling 1211, probabilistic analysis-based modeling 1431, Rent's rule-based modeling (651, and pin density-based modeling [SI. In RISA modeling 1211, the wiring supply is modeled based on the pre-wiring, cells and mega cells, and the wiring demand of a net is modeled by a weighted BBOX length. A net-based stochastic model for %pin nets is presented to compute expected horizontal and vertical track usage with consideration of routing blockage 1431. Peak routing demand and r e gional routing demand are estimated using Rent's rule [65]. Pin density per bin can be used as a metric for intrabin routing congestion, but it can not model the interbin boundary congestion. Therefore, it is combined with probabilistic analysis-based modeling for completeness 151.
In a TP-based modeling method, for each net, a Steiner tree topology is generated on the given routing grid. Such a modeling method can generate at least a global routing SD lution, i.e., it can provide a upper hound for the routability estimation. If a TP-based modeling method uses a topology similar to what the after-placement-router does, the fidelity of the model can be guaranteed. However, topology generation is often of high complexity; therefore, most research focuses mainly on efficiency. In one approach, a precomputed Steiner tree topology on a few grid structures is used for wiring-demand estimation [47]. Two algorithms of logarithmic complexity have recently been proposed: a fast congestion-avoidance two-bend routing algorithm, LZrouter, for topology generation for two-pin nets, and IncAtree algorithm, which can support incremental updates for building a rectilinear Steiner arborescence tree (A-tree) for a multipin net 1151.
Optimization Techniques
After routability is modeled, a routing-congestion picture is obtained on the global-routing grid structure. Basically, there are two ways to apply the modeling results to the placement optimization process: net weighting and cell weighting (cell inflation).
Net weighting directly transfers a congestion picture into bin weights and optimizes weighted wirelength. It can easily he incorporated into iterative placement algorithms such as simulated-annealing-based methods [36, 151.
Cell weighting (a.k.a cell inflation) inflates cell sizes based on congestion estimation, so that cells in congested bins can be moved out of the bins after being inflated. It is more suitable for incorporation into constructive placement techniques, such as analytical placers 1491, quadrisection-based placers 151, as well as iterative placement techniaucs. such as simulated annealing-based placers 1641.
CONCLUSION
Algorithms for largescale circuit placement play a vital role in today's interconnect-limited nanometer designs. Recent studies suggest that the potential exists for a full technology generation's worth of performance gains in the placement step alone. In this paper, we have reviewed the current state of the art, from the basic paradigms for scalable wirelength-driven placement to techniques for performance and routability optimization. We believe that hierarchical/multilevel methods &re needed for scalahility, and weighted wirelength minimization provides a general framework for performance and routability optimization in placement.
Ideally, systematic empirical comparisons would be used to understand the trade-offs of the different algorithms summarized in this paper. However, direct numerical comparisons of these algorithms arc difficult, partly due to limited accessiblity to these algorithms, and partly due to differences in their assumptions. Recently, comparisons based on wirelength minimization have been attempted 121. We are not aware of any comprehensive quantitative comparison in terms of performance or routability optimization. More work is needed to build a common framework for direct comparisons of different placement methods
REFERENCES
[l] http://cadlab.cs.ucla.edu/~pubbench. 
