Abstract-Automated cell placement is a critical problem in VLSI physical design. New analytical placement methods that simultaneously spread cells and optimize wirelength have recently received much attention from both academia and industry. A novel and simple objective function for spreading cells over the placement area is described in the patent of Naylor et al. [44] . When combined with a wirelength objective function, this allows efficient simultaneous cell spreading and wirelength optimization using nonlinear optimization techniques. In this work, we implement an analytic placer (APlace) according to these ideas (which have other precedents in the open literature), and conduct indepth analysis of characteristics and extensibility of the placer. Our contributions are as follows. (1) We extend the objective functions described in [44] with congestion information and implement a top-down hierarchical (multilevel) placer (APlace) based on them. For IBM-ISPD04 circuits, the half-perimeter wirelength of APlace outperforms that of FastPlace, Dragon and Capo respectively by 7.8%, 6.5% and 7.0% on average. For eight IBM-PLACE v2 circuits, after the placements are detailrouted using Cadence WRoute, the average improvement in final wirelength is 12.0%, 8.1% and 14.1% over QPlace, Dragon and Capo, respectively. (2) We extend the placer to address mixed-size placement and achieve an average of 4% wirelength reduction on ten ISPD02 mixed-size benchmarks compared to results of the leading-edge solver, FengShui. (3) We extend the placer to perform timing-driven placement. Compared with timing-driven industry tools, evaluated by commercial detailed routing and STA, we achieve an average of 8.4% reduction in cycle time and 7.5% reduction in wirelength for a set of six industry testcases. (4) We also extend the placer to perform I/O-core co-placement and constraint handing for mixed-signal designs. Our work aims to, and empirically demonstrates, that the APlace framework is a general, and extensible platform for "spatial embedding" tasks across many aspects of system physical implementation.
I. INTRODUCTION Automated cell placement is a critical problem in VLSI design. As deep-submicron technology scales, new challenges arise for placement tools: design sizes are larger, turnaround times are shorter, and a variety of additional physical and geometrical constraints must be fulfilled simultaneously.
A common placement formulation seeks to minimize wirelength under the constraint that cells do not overlap each other. The current state-of-the-art placement tools can be classified into two categories, based on how they obtain a placement without cell overlaps [15] . The first class consists of algorithms that refine the existing placement to obtain a better overlapfree placement. For example, TimberWolf [52] is a well-known annealing based placement tool; it develops new placements by permuting an existing placement. The second class of algorithms uses top-down recursive partitioning to provide the necessary cell spreading. Within this approach, a min-cut objective [4] [32] has been successfully used. A number of placement tools fall into this category: recursive partition with min-cut objective [8] [23] [54] [60] (Capo, CPlace, FengShui), quadratic placement [37] [53] [56] (GORDIAN, PROUD), and analytic placement with linear wirelength [51] (GORDIAN-L). Recently, the Dragon [57] [58] placement tool was presented, combining annealing with recursive bisection. Constructive methods based on (hybrids of) partitioning and analytical techniques are usually fast and produce good results. However, wirelength minimization may come at the cost of routability, or the inability to handle hard constraints.
New analytical placement methods that simultaneously spread cells and optimize wirelength have recently received much attention from both academia and industry. In such methods, forces based on the current cell distribution are applied to iteratively reduce cell overlaps. A quadratic objective combining wirelength and additional forces is proposed by [15] ; in each iteration, a better spreading of cells is achieved without excessive net stretching. Another model of cell attracting and repelling (ARP) is presented in [16] [17] . Attractors (dummy cells) are added to sparse regions to drag cells from nearby dense regions, together with a cell repeller model that captures the wirelength objective. The ideas of additional forces and fixed pseudocells are also combined in [24] . The difficulty of force and attractor-directed placement methods is that wirelength is easily damaged by improper forces and attractors. Recently, a fast placement algorithm with good quality was presented in [55] . Quadratic wirelength optimization and a cell shifting technique are iteratively applied to obtain a high-quality placement without cell overlapping. After quadratic optimization, a special cell shifting technique is used to reduce cell overlapping, and pseudo pins and nets are added to hold cells from clustering again in the next quadratic optimization process.
A novel and simple objective function for spreading cells over the placement area was proposed in the recent patent of Naylor et al. [44] . Combined with a wirelength objective function, it allows efficient simultaneous cell spreading and wirelength optimization using nonlinear optimization tech-niques. The focus of our present work is the implementation of an analytic placer (APlace) according to this idea, followed by in-depth analysis of characteristics and extensibility of the placer. While we implement the placer relying largely on the description in [44] , it should be noted that this method integrates quite a few ideas that have been published in the open literature, as we describe in detail in Section II below. The main contributions of our work include the following.
• We perform analysis and empirical studies of relevant characteristics of the objective functions described in [44] .
• We extend the objective functions with congestion information, to improve routability of results.
• We implement a top-down hierarchical (multilevel) placer (APlace) based on the objective functions. For IBM-ISPD04 circuits, the half-perimeter wirelength of APlace outperforms that of FastPlace, UCLA Dragon (v2.2.3) and Capo (v8.8) respectively by 7.8%, 6.5% and 7.0% on average. For eight IBM-PLACE v2 circuits, after APlace's results are detail-routed using Cadence WRoute (SE5.4), the average improvement in final wirelength is 12.0% over Cadence QPlace (SE5.4), and 8.1% over UCLA Dragon (v3.01), and 14.1% over Capo (v8.7).
• We extend the placer to handle mixed-size placement.
Our extension is compared to recent academic tools: UCLA mPG-MS [11] , Feng Shui (v2.4) [35] and a three stage placement-floorplanning-placement flow that uses Capo [1] [2] . For ten IBM-ISPD02 mixed-size circuits, the half-perimeter wirelength of our placer outperforms that of mPG-MS, Feng Shui and the Capo flow respectively by 24.7%, 4.0% and 26.0% on average.
• We extend the placer to address timing-driven placement. Our extension is compared to two industry placers: QPlace (SE v5.4) and amoebaPlace (SoC Encounter v3.2). When timing-driven placement is performed for six industry circuits and placements are detail-routed using Cadence WarpRoute (SoC Encounter v3.2), our placer has a minimum cycle time that outperforms that of QPlace and amoebaPlace respectively by 9.6% and 8.5%, as well as average improvements of 7.2% and 6.5% in routed wirelength, respectively.
• We extend the placer to perform I/O-core co-placement for area-array I/O designs. I/Os can be evenly distributed without damaging the wirelength figure of merit.
• We also extend the placer with constraint handling for mixed-signal designs. Basic geometric constraints including alignment, spacing and symmetry constraints can be enforced during placement.
The remainder of this paper is organized as follows. Section II discusses the objective functions of the placer. Section III describes our implementation, and Section IV summarizes the placement results. In Sections V and VI, we extend the placer to handle mixed-size placement and timing-driven placement.
In Sections VII and VIII, the placer is extended with I/O-core co-placement and constraint handling. The paper concludes in Section IX. 
II. PROBLEM FORMULATION
A basic goal of placement is to minimize wirelength subject to the constraint that cells do not overlap. Therefore, the objective function for analytic placement historically includes two terms: a density objective to spread cells, and a wirelength objective to minimize wirelength. In this section, we discuss the basic objective functions that our placer uses and experiments are performed to study characteristics of them.
A. Cell Spreading
One important objective of a placer is to distribute cells evenly over the placement area. Constructive placement methods can easily achieve this goal. Force-directed placement methods apply forces based on the current area distribution to move cells away from high-density regions and toward lowdensity regions. However, it is difficult to choose appropriate forces; wirelength is often seriously damaged when the cells are spread out.
To distribute cells evenly over the placement area, a generic strategy is to divide the placement region into grids and then attempt to equalize the total cell area in every grid. The straightforward "squared deviation" penalty for uneven cell distribution is
(1) However, this penalty function is not smooth or differentiable, and is hence difficult to optimize. Naylor et al. [44] tried to smooth the above penalty function, proposing a "bell-shaped" cell potential function instead of a solid cell area function. For a cell c with center at (CellX, CellY ) with area A c , the potential at grid point g = (GridX, GridY ) is given by
where
Here, p(d) defines the bell-shaped function, which is illustrated in Figure 1 ; r controls the radius of any given cell's potential (range of interaction); and K c is a normalization factor so that g P otential(c, g) = A c , i.e., each cell has a total potential equal to its area. Then, the penalty function in Equation (1) is transformed to:
(4) where ExpP otential(g) = T otalCellArea/N umGrids is the expected total potential at the grid point g.
We can use the conjugate gradient method to minimize the penalty function in Equation (4) . Our implementation will be described in detail in Section 3. We calibrate our optimizer using the ibm01-easy testcase from the IBM-PLACE 2.0 benchmark suite [25] . Our stopping criterion is when the maximum (Manhattan) movement of any cell between consecutive iterations is less than 30 units. Results with different numbers of grids and cell potential radii (r's) are summarized in Table I .
Definition 1: Discrepancy within area A w is defined as the maximum ratio of actual total cell area to expected cell area over all windows with area A w .
We use discrepancy to measure evenness of cell distribution. In Table I , the second and fifth columns show the discrepancy within 1% of the total placement area, for each of the cell distributions that the placer obtains. The number of iterations, and total running times needed to meet the above convergence criterion, are also shown in the table.
From these tests, we make the following conclusions.
(1) The finer the grid, the more iterations needed for the optimizer to converge; however, a more even cell distribution is obtained. (2) For finer grids, larger values of r help to reduce the number of iterations, although the running time per iteration is increased. (3) We observe serious oscillations in discrepancy when r = 2 and the number of grid nodes is larger than 60.
B. Wirelength Formulation
Minimization of wirelength is a common objective for circuit placement. Linear and quadratic wirelength objectives are typically used; see, e.g., [39] and [51] for comparisons. The quadratic objective function is used in many analytical placement methods because it is continuously differentiable and can be minimized efficiently by solving a system of linear equations. Unfortunately, this is not true for linear objective functions, and linear programming suffers from excessive computation times. The Gordian-L objective [51] minimizes a linear wirelength function using quadratic programming methods. Also, [38] proposes an α-order objective function to capture the strengths of both methods. While wirelength and overall placement quality is typically evaluated according to half-perimeter wirelength (HPWL), this "linear wirelength" function can not be efficiently minimized. Convex nonlinear approximations of HPWL, which do not require net models and which permit direct inclusion of nonlinear delay terms, are proposed and well-studied in such works as [3] [6] [33] [34] . The approach of Naylor et al. [44] follows along similar lines, and uses a log-sumexp method to capture the linear half-perimeter wirelength while simultaneously obtaining the desirable characteristic of continuous differentiability. The log-sum-exp formula picks the most dominant terms among pin coordinates. For a net t with pin coordinates {(x 1 , y 1 ), (x 2 , y 2 ), ... (x n , y n )}, the wirelength objective is
where α is a smoothing parameter. W L(t) is strictly convex, continuously differentiable and converges to HP W L(t) as α converges to 0. The log-sum-exp formula picks the most dominant terms; it has been previously used in physical design applications such as transistor sizing [50] . We minimize the wirelength objective function using the conjugate gradient optimizer. Initially, cells are randomly distributed over the placement area, and then the wirelength objective function in Equation (5) is minimized. The program is stopped after 300 iterations. Results with different smoothing parameter values (α's) are shown in Table II . The wirelength calculated according to Equation (5) for the initial random placement is displayed in the third column. Compared to the actual initial HPWL of this random placement (i.e., Initial HPWL = 7.311), we see that the wirelength formulation becomes more accurate (closer to HPWL) when smaller values of α are used. However, with larger α values, the wirelength objective function is more smooth and can be minimized more quickly, and a smaller final HPWL is obtained (e.g., final HPWL = 0.803 for α = 3336).
Combining the above two objectives, the analytic placer opti-mizes the following function:
W LW eight * T otalW L+DensityW eight * DensityP enalty (6) The density term drives the spreading of cells and is always changing with the current cell distribution. The wirelength term draws connected components back toward each other.
III. DETAILED IMPLEMENTATION
We now describe implementation details of APlace.
A. Conjugate Gradient Optimizer
We use the conjugate gradient method to optimize the objective function as described in Equation (6) . The conjugate gradient method is quite useful in finding an unconstrained minimum of a high-dimensional function f (x). A detailed treatment, along with a survey of descent-based methods for nonlinear programming, can be found in [42] .
In general, the conjugate gradient method finds the minimum by executing a series of line minimizations (i.e., line searches). A line minimization corresponds to one-dimensional function minimization along some search direction. The result of one line minimization is used as the start point for the next line minimization. The method has the following form:
where g k denotes the gradient ∇f (x k ), α k is a step length obtained by a line search algorithm, d k is the search direction, and β k is chosen so that d k becomes the k th conjugate direction when the function is quadratic and the line search finds the minimum along the direction exactly. Varieties of conjugate gradient methods differ in how they select β k : the best-known formulas for β k are due to Fletcher-Reeves, PolakRibiere, and Hestenes-Stiefel. Our implementation uses the Polak-Ribiere formula:
A Golden Section search method is used to find the step length for each iteration. The length of the search interval reduces by a factor of 0.618 (the golden ratio) or more in each step, and converges linearly to zero.
The conjugate gradient iteration in Equation (8) repeats until the following stopping criterion is reached: (1) a predetermined number of iterations has passed; or (2) the step length returned by the line search function is small enough; or (3) the function value is not changing significantly with additional iterations.
B. Control Factors
The weights of the wirelength and density objective functions provide important control parameters to the placer. Intuitively, a larger wirelength weight will draw cells together and prevent them from spreading out, while a larger density penalty weight will spread the cells out (without attention to wirelength). These controls are managed by keeping the
Top-Down Hierarchical APlace Algorithm Input:
User-defined objective density discrepancy DestDisc User-defined minimum step length User-defined total number of iterations T otalIters Output:
Cell placement {CellP osition(c)} Algorithm:
01. Construct a hierarchy of clusters
GridLen
StepLength = LineSearch(f, Gradient l ) 12.
ClusterP osition l = CellP osition l + StepLength * Gradient l 13. density weight fixed at some constant, and setting the wirelength weight to be large at the outset, but then decreasing this weight (by a factor of two, or a smaller factor near the final placement) whenever the conjugate gradient optimizer slows down and a stable solution emerges. After every weight change, the conjugate gradient optimizer is used to compute a new stable state wherein cells are distributed more evenly but wirelength is larger. The process repeats until the cells are spread evenly over the placement area.
The number of grid nodes, cell potential parameter r and wirelength smoothing parameter α are also important control knobs for APlace. According to the empirical studies discussed in Section II above, the number of grid nodes should increase during the whole placement process. Coarser grids at the beginning spread out the cells faster, while finer grids at the final stages help to reach a more even distribution. In our implementation, the cell potential radius r is set to 2 for coarser grids in order to reduce running times, and to 4 for finer grids. The wirelength smoothing parameter α is set to be half of the grid length.
C. Top-Down Hierarchical Algorithm
We use a top-down hierarchical approach to accelerate APlace. During initialization, a hierarchy of clusters is constructed using MLPart 4.21, a leading-edge, open-source min-cut hypergraph/circuit partitioner [7] . The top-down hierarchical algorithm is described in Figure 2 . Notations used are summarized as follows: c a cell n number of cells M axLevel maximum number of cluster levels N l number of clusters at level l C l (i) a cluster of cells at level l Area(C) total cell area of cluster C r radius of cell's potential GridLen spacing of grids α wirelength smoothing parameter f objective function of the placer {Gradient l (i)} gradient vector {ClusterP osition l (i)} a vector of cluster positions {CellP osition(c)} a vector of cell positions Subscript ranges, where not explicit, are: c = 1, ..., n; l = 0, ..., maxLevel; and i = 1, ..., N l .
For each level in the cell/cluster hierarchy, a coarse grid is determined by the average cluster size. We compute the density penalty by regarding cells in a cluster as a macro cell with area equal to the total cell area of the cluster. For wirelength calculation, cells are assumed to be located at the center of the cluster.
Figures 3 and 4 show how discrepancy and HPWL change with successive iterations for the ibm01-easy circuit. The clustering hierarchy for ibm01-easy has three levels. During the first approximately 900 iterations, the placer is working at cluster levels. Clustering helps to spread cells more quickly, but wirelength is impaired during cell expansion. It is clearly seen from the figures that when wirelength weight is decreased and the conjugate gradient optimizer restarts, discrepancy drops sharply and wirelength is often increased at first and then refined during the optimization. When both discrepancy and wirelength change slowly, we have a near stable sub-optimal solution for the current objective function; additional iterations will not further reduce discrepancy and wirelength very much and the wirelength weight should be reduced. Guided by these figures, we set the iteration limit at 100 in our experiments below.
D. Detailed Placement
The placement results of APlace have cell overlaps and need to be legalized. A simplified Tetris [22] legalization algorithm is implemented in APlace; this algorithm also bears strong resemblance to the method proposed in a technical report of Li and Koh [40] . The Tetris legalization is applied after global placement: cells are sorted according to their vertical coordinates, and then for each cell from left to right the current nearest available position is found. This greedy algorithm is very fast, with negligible running time compared to that of global placement, and increases wirelength by about 4% for IBM-PLACE 2.0 circuits [25] .
After legalization, the modules of orientation optimization and "row ironing" from UCLApack [10] are applied to the results. Row ironing helps to improve wirelength by applying a branch-and-bound placer in sliding windows. These algorithms are fast (running times are ignorable compared to that of global placement) and decrease wirelength by about 2% for IBM-PLACE 2.0 circuits [25] .
E. Congestion-Directed Placement
To improve routability of placement results, we have integrated congestion information into the objective functions to direct cell distribution. We use Kahng and Xu's accurate bend-based congestion estimation method [31] in our placer. If a particular grid is determined to be congested (resp. uncongested), the expected total cell potential of the grid in Equation (4) is reduced (resp. increased) accordingly. The sum of expected area potential over all grids is kept constant, and equal to the total cell area. Specifically, expected cell potential is adjusted as follows:
where γ is the congestion adjustment factor and decides the extent of congestion-directed placement. Table III summarizes placement runs for the ibm01-easy circuit with different congestion factors (γ's). Specific parameters used for these runs are: = 30 and destDisc = 1.1. Wirelengths before and after legalization are shown in the second and third columns. Discrepancy within 1% placement area is shown in the fourth column, and serves as a measure of evenness of cell distribution. The fifth column shows the total number of iterations. Global routing is performed using Cadence WRoute (SE5.4). The last two columns show the total wire length in the gcell grid, and total number of over-capacity gcells; these values provide figures of merit for routability of the placement results.
According to these results, routability is approximately 38% better with a congestion factor of 0.05, but deteriorates when the congestion factor is larger. Wirelength of congestion-aware placements is not seriously impaired.
IV. EXPERIMENTAL RESULTS
APlace is implemented in C++. We use the IBM-PLACE v2.0 benchmarks [25] as our test cases. These circuits include easy and hard cases of eight designs. IBM-PLACE circuits are widely used in the literature, and are becoming common benchmarks for standard-cell placement.
Our results are compared with a leading industry placer, Cadence QPlace (SE5.4), and recent academic tools: UCLA Dragon (v3.01) [14] and Capo (v8.7) [10] . All placers read the same LEF/DEF files and write DEF files as the placement output. Results of Capo are legalized using Cadence QPlace (SE5.4, ECO mode). The placements are then sent to Cadence WRoute (SE5.4, global and final routing enabled, and autostop disabled) to be routed.
All the experiments of APlace, Dragon and Capo are performed on a Xeon server with 2.4GHz CPU (double threads on each CPU) and 4GB memory; QPlace and WRoute are run on a Sun Ultra10 workstation with 400MHz CPU. We run QPlace with full placement mode (-full), and Dragon with fixed die mode (-fd). We run MetaPlacer for Linux, which incorporates Capo and orientation optimizer. Row ironing is disabled (-noRowIroning) for MetaPlacer for better routability of the placement results. One start of Dragon and of Capo is done for each case in the experiments, since the whole flow of placement and routing is very time consuming. The APlace results with no congestion awareness are summarized in Table IV . Specific parameters used for these runs are:
= 30 and destDisc = 1.1. Wirelengths (meters) before and after legalization are shown in the third and fourth columns. Average wirelength increase after legalization is 3.7% (range: 1.3% to 7.6%). Discrepancy within 1% placement area is shown in the fifth column and used to evaluate the evenness of cell distribution in APlace results. The last two columns show the total number of iterations and running times in minutes.
The placement results with congestion awareness are sent to Cadence WRoute (SE5.4) to be routed. The results are compared to QPlace, Dragon and Capo in Table VI . Routing results include success (finished routing with no violations), finished (with violations) and failure (because of time limit). For all cases, the number of violations, final wirelength, the number of vias and running times of WRoute are shown in the last four columns.
According to the results, average placed wirelength improvement over QPlace is 8.5% (range: 2.0% to 11.9%); We also tested APlace on IBM-ISPD04 benchmarks [55] . These circuits are derived from IBM benchmarks, but have fixed pads on the boundary. The APlace results with no congestion awareness are summarized in Table VI . The results are compared with those of FastPlace, Dragon (v2.2.3) and Capo (v.8.8) reported in [55] . According to the results, average placed wirelength improvement over FastPlace is 7.8% (range: 0.8% to 11.4%); average improvement over Dragon is 6.5% (range: -2.8% to 13.1%); average improvement over Capo is 7.0% (range: 3.1% to 11.1%). Running times (in seconds) cannot be directly compared: Capo is run on a Sun Sparc-2 750MHz machine and APlace is run on a PIII 1.4GHz machine. On average, APlace is less than 4.3X slower than Capo on IBM-ISPD04 benchmarks, and much slower (about 50X) than FastPlace.
V. MIXED-SIZE PLACEMENT
As VLSI designs scale to billion-transistor complexities, design productivity increasingly requires the reuse of predesigned or generated macro blocks (processing and interface cores, embedded memories, etc.). This presents a "boulders and dust" challenge to placers [5] , where the sizes of placeable objects can vary by factors of 10,000 or more. In this section, we extend the APlace approach to address the mixed-size placement problem. Our focus is on two issues: (1) the cellspreading potential function, and (2) legalization.
A. Previous Work
Modern ASIC designs are typically laid out in the fixeddie context, where the outline of the core area, as well as routing tracks and power/ground distribution, are fixed before placement [8] . Mixed-size placement becomes particularly complex in the fixed-die context because of its discreteness [2] . Traditional standard-cell frameworks often cannot address this challenge smoothly, and must resort to devices such as manual preplacement of blocks, with an attendant loss of overall solution quality.
Recently, a three stage placement-floorplanning-placement flow [1] [2] has been proposed which places macro blocks and standard cells without overlap. In the first step, all macros are shredded into small pieces connected by fake wires. A standard-cell placer, Capo, is used to obtain an initial placement, and the initial locations of macros are produced by averaging the locations of faked cells created during the shredding process. In the second step, the standard cells are merged into soft blocks, and a fixed-outline floorplanner, Parquet, generates valid locations of macros and soft blocks of movable cells. Finally, with macros considered fixed, Capo is used again to re-place small cells. This approach scales reasonably well, but wirelength results are often quite suboptimal.
A different approach is pursued by [11] , wherein a simulated annealing based multi-level placer, mPG-MS, recursively clusters both macro blocks and standard cells to build a hierarchy. The top-level netlist is placed, and then the placement is gradually refined by unclustering the netlist and improving the placement of smaller clusters by simulated annealing. Large objects are gradually fixed without overlap during coarse placement, and the locations of smaller objects are determined during further refinement. Significant effort is needed for legalization and overlap removal during this placement process.
Another recursive bisection based placement tool, Feng Shui, has been presented more recently in [35] . Rather than address standard cells and macro blocks separately, the placer considers them simultaneously via a fractional cut technique which allows horizontal cut lines that are not aligned with row boundaries. When compared to previous academic tools, the Feng Shui placer achieves surprising solution quality (as evaluated by placed half-perimeter wirelength) and good scalability.
B. Potential Function for Mixed-Size Placement
As described in Section II-A, the density objective drives cell spreading. The placement area is divided into grids, each cell has a potential or influence with respect to nearby grids, and the placer seeks to equalize the total cell potential at each grid. For standard-cell placement, the grid usually has a length greater than the average cell width, and the radius of cell's potential, r, is set to be a constant during optimization. However, for mixed-size placement, the size range between large and small objects can be as large as a factor of 10,000 [11] , and the radius of influence of a cell's potential will need to change according to the cell's dimension. In particular, a larger block will have potential with respect to more grids.
After investigation of several possibilities, we have chosen to address the potential function for large macros in the following simple way. Suppose a macro block b has width w. The radius or scope of this block's influence is w/2 + r, i.e., every grid within the distance of w/2+r from the block's center has a non-zero potential from this block. Moreover, the total potential of the block over all grids is equal to the block's area. Therefore, the function p(d) in Equation (2) becomes 
C. Legalization
A second key issue is that analytic placement results have cell overlaps that must be legalized. After investigation of a variety of approaches, we perform legalization in mixed-size placement as follows. Cells are sorted based on a combination of vertical coordinate and width, so that larger blocks may be fixed at a position ahead of nearby small cells. We also scale the cell positions to the left side by a fixed factor (set to 0.90 in all of the discussion and results below) so that (1) cells will not be pushed outside the placement region, and (2) horizontal overlaps among macros can be properly resolved by the legalization.
When an initial global placement has many overlaps among macros, legalization of mixed-size circuits can be extremely challenging. Indeed, a greedy algorithm such as "Tetris" may fail to find a valid position for one or more blocks, or wirelength may be seriously damaged by movement of blocks. Fortunately, the potential function described above allows our mixed-size placer to distribute cells quite evenly in the global placement, with little overlap among larger blocks. Hence, the greedy legalization approach is still an acceptable adjunct even for mixed-size placement: Wirelength increase after the legalization step for our testcases is 6.5% on average (cf. an increase of approximately 5% for pure standard-cell designs).
D. Experimental Results
We use ten circuits from the IBM-ISPD02 Mixed-Size Placement Benchmarks [26] as our testcases. These circuits are publicly available at the GSRC Bookshelf [19] and are widely used in the literature. Table VII summarizes the results for the ten mixed-size circuits. The second and third columns show half-perimeter wirelengths (meters) before and after legalization. The fourth column shows the wirelength increase due to legalization, expressed as a percentage. Average wirelength increase after the legalization step for our testcases is 6.5%. As our placer does not perform detailed placement, we use Feng Shui (v2.4) [35] as the detailed placer. Wirelength (meters) after detailed placement and percentage improvement are shown in the sixth and seventh columns. We see that the average wirelength improvement after the detailed placement step is 3.5%. All of our experiments are performed on an Intel Xeon server with dual 2.4GHz CPUs (double threads on each CPU) and 4GB memory. Running times of global and detailed placement (minutes) are shown in the fifth and eighth columns, respectively. Table VIII compares our results with those of recently published works that experiment with the same benchmarks. The results of the three-stage placement-floorplanning-placement flow, which uses the Capo standard-cell placer, were first presented in [1] and further improved in [2] ; results of the mPG-MS placer were presented in [11] ; and results of Feng Shui (v2.4) are from [35] . Final HPWL values and running times (minutes) for each placer are also shown in Table VIII. According to the results, average wirelength improvement of our placer over the Capo flow is 26.0% (range: 11.5% to 34.0%); average improvement over mPG-MS is 24.7% (range: 9.9% to 40.1%); and average improvement over Feng Shui is 4.0% (range: -7.3% to 20.0%). Running times of the placers cannot be directly compared, since the Capo flow used 2GHz Linux/Pentium 4 workstations, mPG used 750MHz Sun Blade 1000 workstations, and Feng Shui used 2.5GHz Linux/Pentium 4 workstations. But APlace is much slower than Feng Shui. Finally, Figures 5 and 6 show placements for the ibm02 benchmark before and after legalization. 
VI. TIMING-DRIVEN PLACEMENT
Design value depends on performance; with device and interconnect scaling, this presents greater challenges to timingdriven placement. In this section, we extend the placer to address timing-driven placement.
A. Previous Work
Timing-driven placement has been studied extensively. Existing approaches can be broadly divided into two classes: path-based and net-based.
A typical path-based approach [21] [28] [47] [48] usually considers all or a subset of paths directly within the problem formulation. The majority of this class of approaches are based on mathematical programming techniques. This class of algorithms usually maintain an accurate timing view during optimization, but its drawback is relatively high complexity due to the exponential number of paths that need to be simultaneously minimized.
For this reason, much of the recent timing-driven work [20] [45] [46] has been net-based. Unlike path-based approaches that handle paths directly, net-based approaches [15] [49] usually transform timing constraints or requirements into either net weight or net length (or delay) constraints, and employ a weighted wirelength minimization engine.
The process of generating net-length constraints or netdelay constraints is called delay budgeting. The main idea is to distribute slacks from the end-points of each path to constituent nets along the path, such that a zero-slack solution is obtained [12] [43] . A serious drawback of this class of algorithms is that delay budgeting is usually done in the circuit's structural domain, without consideration of physical placement feasibility. As a result, it may severely over-constrain the placement problem.
Instead of assigning a delay budget to each individual net or edge, net-weighting-based approaches assign weights to nets based on their timing criticality. The basic idea is to put a higher weight for nets that are more timing critical. Net weighting techniques have some favorable properties: relatively low complexity, strong flexibility and easy implementation. As circuit sizes increase and practical timing constraints become increasingly complex, these advantages make the net weighting method more attractive.
There are two principles for assigning net weights. The main principle used in most algorithms is that a timing critical net should receive a heavy weight. For example, VPR [41] uses the following formula to assign weight to an edge e:
w(e) = (1 − slack(e)/T ) δ (13) where T is the current longest path delay, and δ is a constant called the criticality exponent.
The other principle is path sharing -in general, an edge with many paths passing through should have a heavy weight as well. Path counting is a method developed to take pathsharing effects into consideration by computing the number of paths passing through each edge in the circuit. These numbers can then be used as net weights. Another work [36] proposed a solution that distinguishes timing-critical paths from noncritical paths, and scale the impact of all paths by their relative timing criticality. Given a weighting function D(slack, T ), the weight assigned to a particular edge e is:
where T is the current longest path delay, and slack(π) is the slack of a timing critical path π.
B. Slack-Derived Edge Weights
It is natural to apply the net weighting method in APlace to perform timing-driven placement.
Our placer uses the following formula to assign weight w(e) to an edge e: where
Here, δ is the criticality exponent.
is the slack of path π and u is the expected improvement of the longest path delay after this timing-driven iteration. The timing-driven process may repeat a few iterations. The weight of a net that is always timing critical is accumulated. Figure 7 shows the flow of timing-driven process at the final stage of our placer. The near-finished placement of the placer is sent to TrialRoute (SoC Encounter v3.2) to perform a fast global and detailed routing. RC is extracted based on the routing result. Then we use a commercial tool, Pearl (SE v5.4) to do static timing analysis (STA), and the resulting critical path delays are imported into the placer to decide net weights based on timing criticality and path sharing. The weighted wirelength objective is then optimized using the Conjugate Gradient solver, together with the density objective.
C. Timing-Driven Placement Flow

D. Experimental Results
1) Impact of Net Weighting Parameters:
Experiments are performed to study the parameters of the net weighting formula described in Equations (15) and (16) . Table IX shows the timing-driven placement runs with varying expected improvements (u's) and criticality exponents (δ's) for an industry testcase, indust1. The circuit is in LEF/DEF/GCF/SDC format and has 7077 cells and 8032 nets.
Again, all experiments are performed on an Intel Xeon server with dual 2.4GHz CPUs (double-threaded) and 4GB memory. Minimum cycle time (MCT) in nanoseconds is reported by the STA to measure performance of timing-driven placements, together with HPWL (meters) and running times (minutes) of the placements, and routed wirelength (meters) and the number of vias of TrialRoute's results.
The value of expected improvement decides how many timing critical paths are considered for the net weighting. For each value of expected improvement, timing-driven placements are performed with criticality exponents between 1 to 19. The MCT improvement in percentage with timing-driven placement is shown in the last column of Table IX . The minimum cycle time initially decreases with the criticality exponent; since wirelength always increases, the minimum cycle time gradually deteriorates when the criticality exponent is larger.
2) Comparison with Industry Placers:
Results of our placer are compared with two industry placers: QPlace (SE v5.4) and amoebaPlace (SoC Encounter v3.2) in Table X . QPlace is run on a Sun Ultra10 workstation with 400MHz CPU. Timingdriven and non-timing-driven placements are sent to Cadence WarpRoute (SoC Encounter v3.2) to do timing-driven routing. RC is then extracted and Pearl (SE v5.4) is used to perform static timing analysis (STA).
We use six industry circuits as our testcases. Two of them, mac1 and mac2, are among the ISPD 2001 Circuit Benchmarks [27] that first appeared in [13] . These circuits are also used in [59] as benchmarks for timing-driven placement. Only Verilog files are available for these two cases; they are synthesized with a commercial tool, Cadence BuildGates (v5.12). We use a 0.18µm standard-cell library as the LEF file and use the values reported in [59] as the clock cycles for the two circuits. The other four testcases are available in complete LEF/DEF/GCF/SDC format.
In Table X , the second and third columns show the number of cells and nets of the industry circuits. For each testcase, non-timing-driven and timing-driven placement are performed by each placer. For the indust4 circuit, timing-driven QPlace fails because of incompatible timing constraint file format. Placed wirelengths (meters) and running times (minutes) of each placement are summarized in the fifth and sixth columns. APlace-TD usually has a better placed wirelength than industry placers; but it is slower.
Timing-driven routing is performed with WarpRoute; overcapacity gcells in percentage, the number of violations, routed wirelength (meters), the number of vias and running times (minutes) of WarpRoute's results are shown in the seventh through eleventh columns of Table X. Most of the placement results of APlace-TD can be successively routed with good wirelength; finished routings with a small number of violations can be manually fixed. According to the results, average (routed) wirelength improvement of APlace-TD over QPlace is 7.2% (range: -1.2% to 7.1%); and average improvement over amoebaPlace is 6.5% (range: -11.1% to 23.2%). Compared to non-timing-driven APlace, APlace-TD has a slightly better routed wirelength (0.7% on average).
In the last column, minimum cycle time (nanoseconds) is reported from the STA tool as the performance measure of timing-driven or non-timing-driven placements.
1
Our placer usually has a better minimum cycle time. Average improvement in minimum cycle time of APlace-TD over QPlace is 9.6% (range: -1.2% to 14.8%); and average improvement over amoebaPlace is 8.5% (range: -0.8% to 28.5%). Compared to non-timing-driven APlace, APlace-TD improves the minimum cycle time by 2% on average (range: 0.1% to 3.8%). The MCT improvements are especially negligible for the indust3 and indust4 circuits; as noted in the footnote above, minimum cycle times of these two circuits are less sensitive with different net weights.
VII. I/O-CORE CO-PLACEMENT
IC packaging technologies with peripheral I/O pads have well-known shortcomings: clock/power distribution is constrained, and large parasitics of peripheral I/O pads cause coupling and power issues for off-chip signaling. The areaarray I/O regime is projected to eventually dominate IC implementation methodology, affording improved pad count and reliability, and reduced noise coupling. Area-array I/O presents new challenges to placement tools. Caldwell et al. [9] conducted a thorough study of the implication of area-array I/O for placement methods. They determined that with alternating I/O and core placement methods, which are often used in practice, the number of iterations needed to achieve good solutions can be surprisingly large. Also, a bad initial I/O placement can seriously handicap subsequent iterations. A simultaneous methodology is proposed in [18] , which performs top-down hierarchical placement, with I/Os re-placed to each partition by min-cost assignment. However, previous results from the simultaneous methodology are not good.
I/Os and core cells can be simultaneously placed in APlace. I/Os are spread over the placement area, in the same way and at the same time as core cells. The objective function of APlace in Equation (6) is modified as: W LW eight * T otalW L+ DensityW eight * DensityP enalty+ IODensityW eight * IODensityP enalty (17) where the third term drives the spreading of I/Os.
We have tested extensions of APlace to perform I/O-core coplacement on six industry circuits. Cells directly connected to fixed pads are regarded as I/O cells. Specific parameters used for the experiments are: = 30 and destDisc = 1.1. Figure 8 shows the placement for the mac1 circuit with 623 I/Os. Core cells and I/Os are displayed as yellow (grey) blocks and pink (dark) blocks, respectively. I/Os are distributed evenly over the placement area in the figure. The results are summarized in Table XI . For each circuit, the placement is performed without and with I/O-core co-placement. The third column shows the wirelength after legalization, in meters. I/O distribution is evaluated by the I/O discrepancy within 1% placement area. According to the results, I/O-core co-placement reduces I/O discrepancy by 61.3% on average, with an average increase of 5.3% in placed wirelength.
VIII. GEOMETRIC CONSTRAINTS
The need to increase the complexity and reduce the cost of electronic systems has greatly accelerated the demand for combining discrete components into application specific integrated circuits (ASICs). As more digital circuitry is integrated, the analog components of a system are more likely to represent a bottleneck in the path to size and cost reduction for a system. With increased demand for mixed-signal ASICs, there is a corresponding demand for software tools and design methodologies that increase the productivity of analog and mixed-signal ASIC designs. Digitally targeted tools are often inadequate to handle the critical and specific requirements of analog layout. Performance of analog circuits is much more sensitive to the details of physical implementation than that of digital circuits. A large number of constraints have to be considered in order to avoid extra design iterations caused by too many parasitic effects. Layout synthesis is often a multi-objective optimization problem where, along with area, wiring length and delay, topological constraints must be taken into account.
Basic geometric constraints for mixed-signal placement include the following categories: fixed components, spacing, alignment, axial symmetry and nodal symmetry. These categories cover most of the topological requirements in an analog layout system. More complex constraints, such as matching, can be built as combinations of these basic constraints.
Geometric constraints can be handled directly in APlace, since they can be converted to penalty functions and added to the objective function of the placer. Penalty functions for basic geometric constraints are summarized as follows.
• For an x-alignment constraint for cells with coordinates {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )}, which is expressed as x 1 = x 2 = ... = x n , the penalty function is
• For an x-spacing constraint between two cells with coordinates {(x 1 , y 1 ) and (x 2 , y 2 )}, which is expressed as L < |x 1 − x 2 | < U , the penalty function is
• For an axial symmetry constraint with an axis y = c between two cells with coordinates {(x 1 , y 1 ) and (x 2 , y 2 )}, which is expressed as x 1 = x 2 and y 1 + y 2 = 2c, the penalty function is (x 1 − x 2 ) 2 + (y 1 + y 2 − 2c)
2 .
• For a nodal symmetry constraint with a center (c x , c y ) between two cells with coordinates {(x 1 , y 1 ) and (x 2 , y 2 )}, which is expressed as x 1 + x 2 = 2c x and y 1 + y 2 = 2c y , the penalty function is (x 1 +x 2 −2c x ) 2 +(y 1 +y 2 −2c y )
2
.
Constraint handling is implemented in APlace, and experiments are performed for each IBM-PLACE easy circuit with a manually specified combination of 40 spacing constraints, 20 alignment constraints, 20 axial symmetry and 10 nodal symmetry constraints. Cells in the artificial constraints are randomly selected from the netlist. Specific parameters used for the experiments are:
= 30 and destDisc = 1.1. Figure 9 shows the placement with 90 geometric constraints for the ibm01-easy circuit. Constraints are shown as cells connected by colored (dark) lines. Cells connected by blue lines have alignment constraints among them; cells connected by black lines have axial symmetry and spacing constraints; and cells connected by red lines have nodal symmetry and spacing constraints. We can see from the figure that geometric constraints are closely enforced. The placement results are summarized in Table XII . The second and third columns show the wirelength before and after legalization in meters. Compared to placement results without constraints, the legalized wirelength is increased by 8.2% on average.
IX. CONCLUSION AND FUTURE WORK
We have implemented APlace, an analytic placer based on ideas described in the recent patent of Naylor et al. [44] , and have conducted in-depth analysis of characteristics and results of the placer. The implementation is successful: placed and routed wirelengths outperform QPlace, FastPlace, Dragon and Capo. We also extend the basic placer to perform topdown hierarchical placement, congestion-directed placement, mixed-size placement, timing-driven placement, I/O-core coplacement and constraint handling for mixed-signal contexts. Our work empirically demonstrates that the APlace analytic framework is a general and extensible platform for "spatial embedding" tasks across many aspects of system physical implementation.
We are currently working on speedup of APlace. The efforts include: (1) usage of lookup tables when computing density penalties and exponential function; (2) implementation of a faster pseudo-Newton solver and (3) application of the Augmented Lagrangian method for the constrained optimization problem.
Our other ongoing research directions include: (1) extension of the placer to power or IR drop directed placement; (2) extension to 3D placement; (3) extension to thermal-directed placement; and (4) devising a unified analytic placement approach that can simultaneously address congestion, timing, power, and wirelength at a level beyond the existing state of the art.
