Abstract-In this paper, we present FastPlace 3.0 -an efficient and scalable multilevel quadratic placement algorithm for large-scale mixed-size designs. The main contributions of our work are: (1) A multilevel global placement framework, by incorporating a two-level clustering scheme within the flat analytical placer FastPlace [27, 28] . (2) 
I. INTRODUCTION
In recent years, it has become common to interleave placement with logic synthesis and timing-optimization transforms to create a physical synthesis design flow. As a result, placement needs to be run repeatedly during the early design stages. In addition, circuits today often contain over a million objects that need to be placed. Hence, it is necessary to have efficient and scalable placement algorithms that produce good-quality results satisfying various design objectives including congestion, routability and timing.
Existing placement algorithms employ various approaches including simulated annealing [24, 25] , partitioning [ 1, 2, 7, 29] and analyticalplacement [4, [9] [10] [11] 16, 17, 21, 27, 28] . Analytical placement algorithms based on the quadratic objective funtion (also called quadratic placers) are very popular as they are quite efficient and also give good quality of results. They typically employ a flat placement methodology [9] [10] [11] 17, 27, 28] so as to maintain a global view of the placement problem.
But, with circuit sizes steadily increasing towards tens of millions of objects, a flat placement methodology may not be effective in handling the large problem size. Hence, for better scalability and solution quality, a hierarchical placement approach is beneficial. To this effect many modern placers follow a hierarchical or multilevel approach [3, 4, 13, 15, 21, 26 ].
An essential constraint that needs to be handled by current placers is that of placement congestion. Designers often run *This work was partially supported by the Semiconductor Research Corporation under Task ID 1206 and NSF under grant CCF-0540998. placement algorithms with specific target-density values. To determine the placement density, a pre-defined bin structure is imposed over the placement region. The density of a bin is then defined as the ratio of the total area of the movable objects to the total available free-space within the bin. The target-density basically specifies the maximum possible occupation for any bin in the placement region. Satisfying the target-density constraint means that the density of all the bins in the placement region should be less than or equal to the target-density value. The purpose of the target-density is to allow for more room within a bin for the subsequent routing step. It also creates space to perform subsequent timing optimization transforms like buffer insertion, gate-sizing etc.
In this paper we address the two issues of scalability and placement congestion. We present FastPlace 3.0 -an efficient multilevel quadratic placement algorithm with placement congestion control for large-scale mixed-size designs. The main contributions of our work are: * Incorporating a multilevel framework within the global placement stage of the flat quadratic placer FastPlace [27, 28] . This is done by employing two levels of clustering: an intial netlist based fine-grain clustering followed by a netlist and location based coarse-grain clustering. Our multilevel placement framework is summarized in Fig.  1 and follows the classical hierarchical flow that has been used in many existing placement algorithms [3, 4, 6, 13, 15, 21] . The entire flow of our placement algorithm is summarized in Fig. 2 . It consists of three stages: (a) global placement using a multilevel framework, (b) legalization of macro blocks using the Iterative Clustering Algorithm of [28] followed by a density aware standard-cell legalization scheme and (c) an effective detailed placement algorithm [22] . The individual components of the flow are described in more detail in the subsequent sections.
III. CLUSTERING FOR PLACEMENT
Circuit clustering is an attractive method to reduce the placement problem size for large-scale VLSI designs. If clustering is performed in a careful manner, it can also yield better wirelength along with faster runtime as compared to flat placement approaches. In our multilevel framework we use clustering in a persistent context as defined in [21] . As in, we use clustering at the beginning of placement to pre-process the flat netlist so as to reduce the placement problem size.
In our multilevel framework, we follow a two-level clustering scheme as shown in Fig. 1 . In the first level of clustering we create fine-grain clusters of about 2-3 objects per cluster. This clustering is solely based on the connectivity information between the objects in the original flat netlist. Since this clustering is performed before any placement, we restrict it to finegrain clustering to minimize any loss in placement quality due to incorrect clustering. In fact, it was demonstrated in [12] that building fine-grain clusters can improve placement efficiency with negligible loss in placement quality.
We then perform a fast, initial placement of the fine-grain clusters. The purpose of this step is to get some placement in- formation for the next clustering level. Since each cluster in the first level has only around 2-3 objects, the initial placement of the clusters closely resembles an initial placement of the flat netlist. We then create coarse-grain clusters by performing a second level of clustering. In this level, we consider both, the connectivity information between the clusters and their physical locations as obtained from the initial placement. We believe that generating coarse-grain clusters based on actual placement information, is better than generating them by a solely netlist based approach. Also, such an approach would further minimize any loss in (or even improve) the final wirelength.
The key difference between our clustering scheme and the ones followed in [3, 5, 15, 21] is that we use actual placement information while forming coarse-grain clusters, whereas the other approaches generate coarse-grain clusters solely based on netlist information. Our approach closely resembles that of [13] . The difference being that [13] uses two-levels of netlist based clustering followed by physical clustering, whereas we only use one level of fine-grain netlist based clustering.
For both levels of clustering, we use the Best-Choice clustering algorithm described in [21] . In Fig. 3 
For placement congestion control, the ILR is divided into 2 components. The d-ILR uses the global pre-defined bin structure used for placement density computation. It then calculates the utilization and contour height for these bins. Cells are then moved from source to target bins of the global bin structure.
Once the d-ILR is performed, we then run the r-ILR as before in which the bin sizes are initially set to a large value and then decreased over subsequent placement iterations. Fig. 6 depicts the interaction between the d-ILR and the r-ILR and shows the decrease in the size of the bins from the d-ILR stage to the end of the r-ILR stage.
V. LEGALIZATION AND DETAILED PLACEMENT
The aim of the legalization stage is to resolve module overlaps, present after global placement, and yield a legal nonoverlapping placement. Our legalization stage is divided into two steps: we first ignore all the standard-cells and resolve overlaps among the macro blocks; we then fix the macros and legalize the standard-cells. This is followed by detailed placement. These steps are described in more detail below.
A. Macro Block Legalization
During legalization, we do not want to move the macros by a significant amount from their global placement positions. Hence, the goal of the macro block legalization algorithm is to resolve overlaps among the macros by perturbing them by the minimum possible distance from their global placement positions. This is achieved by using the Iterative Clustering Algorithm [28] for macro block legalization. Due to space constraints, we refer the reader to [28] for more details. 
B. Density Aware Selective Bin-based Cell Legalization
After macro block legalization, we fix their positions and treat them as placement blockages for all subsequent steps. Each row in the placement region is then fragmented into segments based on the overlap of the row with the placement blockages. The aim of the density aware standard-cell legalizer is to satisfy segment capacities as well as placement congestion constraints and legalize the standard-cells within the segments.
To perform legalization, we create a Regular Bin Structure (RBS) over the entire placement region. The height of each bin is equal to the cell row height and its width is equal to around 4x the average cell width. We then determine the utilization of every bin and segment in the placement region. The utilization of a segment is defined as the total width of all the cells within the segment. If the total width is greater than the segment width, the segment is considered to be above capacity.
Based on the segment utilizations and placement blockages, we construct a move map of the entire placement region. For each bin in the RBS, this map has a value of either 1 for allowing movement of cells into or out of this bin, or 0 otherwise. For bins that completely overlap blockages we assign a value of 0 as we do not want cells to be moved on top of the blockage. If the utilization of a particular segment is greater than the target-density, then a small region of bins in and around the current segment is assigned a value of 1. This is to allow for move based legalization to be performed only on these bins. This is depicted in Fig. 7 where there are two segments that are above capacity (shown by the diagonal lines). Then, we turn on move based legalization for only a small set of bins around the segments (shown by the shaded regions). For moving the cells among the bins we use a technique similar to the ILR. The difference being that the score for a move during legalization is a weighted sum of three components: (a) the half-perimeter wirelength reduction for the move, (b) a
Since the legalization technique is mainly used to even out the placement and satisfy segment capacities, a higher weight is assigned to the second and third components. Once all the segments are brought within capacity, we assign the cells to legal positions within each segment.
The key advantages of the selective bin-based legalizer is that it does not significantly perturb the global placement solution. Secondly, it distributes the cells evenly within the segments. This helps to satisfy placement congestion constraints.
C. Detailed Placement
To further reduce the wirelength of the placement, we adopt a modified version of the FastDP [22] detailed placer that can handle placement congestion constraints.
VI. EXPERIMENTAL RESULTS

FastPlace3
. 0 was tested on the ISPD-2005 Placement Benchmarks [19] and the ISPD-2006 Placement Benchmarks [20] . These benchmarks have been derived from industrial ASIC designs with circuit sizes ranging from 211K to 2.50M objects. In addition, the ISPD-2006 benchmark suite has a specific target-density assigned to each circuit.
In Table I , we compare FastPlace3. 0 with the latest available versions of the academic placers mPL6 [4, 5, 8] , Capo]O.2 [23] and APlace 2.0 [15, 16] In Table II we compare our results with that of other placers reported during the ISPD 2005 placement contest. It should be noted that for the contest, all the placers were given the benchmarks in advance and there was no limit on the CPU time required to get the best possible results on the individual circuits.
From Table II In Table III Table IV gives the runtime comparison of our placer with other placers in the ISPD 2006 placement contest. This is a direct comparison of the runtime, as the machine specifications for the contest are the same as the one on which we ran our experiments. On average, the runtime of our placer is the least among all the placers.
VII. CONCLUSIONS
In this paper we describe FastPlace 3.0 an efficient and scalable quadratic placer for large-scale mixed-size circuits. It is based on a multilevel global placement framework and incorporates an improved Iterative Local Refinement Technique that can handle placement blockages as well as placement congestion constraints. We also describe an efficient density aware standard-cell legalization scheme.
The current implementation produces competitive results compared to other state-of-the-art academic placers on various benchmark circuits but in a significantly lesser runtime. Such an ultra-fast placer is very much needed in present day iterative physical synthesis flows to achieve timing closure without a significant runtime overhead. 
2A-2
