In this paper, we propose an effective algorithm flow to handle largescale mixed-size placement. The basic idea is to use floorplanning to guide the placement of objects at the global level. The flow consists of four steps: 1) The objects in the original netlist are clustered into blocks; 2) Floorplanning is performed on the blocks; 3) The blocks are shifted within the chip region to further optimize the wirelength; 4) With big macro locations fixed, incremental placement is applied to place the remaining objects. There are several advantages of handling placement at the global level with a floorplanning technique. First, the problem size can be significantly reduced. Second, exact HPWL can be minimized. Third, precise object distribution can be achieved so that legalization only needs to handle minor overlaps among small objects in a block. Fourth, rotation and various placement constraints on macros can be handled. To demonstrate the effectiveness of this new flow, we implement a high-quality floorplanguided placer called FLOP. We also construct the Modern MixedSize (MMS) placement benchmarks which can effectively represent the complexities of modern mixed-size designs and the challenges faced by modern mixed-size placers. Compared with state-of-the-art mixed-size placers and leading macro placers, experimental results show that FLOP achieves the best wirelength, and easily obtains legal solutions on all circuits.
INTRODUCTION *
In the nanometer scale era, placement has become an extremely challenging stage in modern VLSI designs. Millions of objects need to be placed legally within a chip region, while both the interconnection and object distribution have to be optimized simultaneously. As an early step of VLSI physical design flow, the quality of the placement solution has significant impacts on both routing and manufacturing. In modern System-on-Chip (SoC) designs, the usage of Intellectual Property (IP) and embedded memory blocks becomes more and more popular. As a result, a design usually contains tens or even hundreds of big macros. A design with big movable macros and numerous standard cells is known as mixed-size design, where the placement of big macros plays a key role. Due to the big size difference between big macros and standard cells, the placement of mixedsize designs is much more difficult than the standard-cell placement. Existing placement algorithms usually cannot generate a legal solution by themselves. They have to rely on a post-placement legalization process. However, legalizing big macros with wirelength minimization has been considered very hard to solve for a long time.
Previous Work
Most mixed-size placement algorithms place both the macros and the standard cells simultaneously. Examples are the annealing-based placer Dragon [1] , the partitioning-based placer Capo [2] , and the analytical placers FastPlace3 [3] , APlace2 [4] , Kraftwerk [5] , mPL6 [6] , and NTUplace3 [7] . The analytical placers are the state-of-the-art placement algorithms. They can produce the best result in the best runtime. But, the analytical approach has two problems. First, only an approximation (e.g., by log-sum-exp or quadratic function) of the Half-Perimeter Wirelength (HPWL) is minimized. Second, the distribution of objects is also approximated and that usually results in a large amount of overlaps. They have to rely on a legalization step to resolve the overlaps. For mixed-size designs, such legalization process is very difficult and is likely to significantly increase the HPWL.
Other researchers apply a two-stage approach as shown in Figure 1 to handle the mixed-size placement. An initial wirelength-driven placement is first generated. Then a macro placement or legalization algorithm is used to place only the macros, without considering the standard cells. After that, the macros are fixed, and the standard cells are re-placed in the remaining whitespace from scratch. As the macro placement is a crucial stage in this flow, people propose different techniques to improve the quality of result (QoR). Based on the MP-tree representation, Chen et al. [8] used a packing-based algorithm to place the macros around the four corners of the chip region. In [9] , a transitive closure graph (TCG) based technique was applied to enhance the quality of macro placement. One main problem with the above two approaches is that the initial placement is produced with large amount of overlaps. Thus, the initial solution may not provide good indications on the locations of objects. However, the following macro-placement stage still determines the macro locations by minimizing the displacement from the low-quality initial placement. Alternatively, Adya et al. [10] used an annealing-based floorplanner to directly minimize the HPWL among the macros and clustered standard cells at the macro-placement stage. But, they still have to rely on the illegal placement to determine the initial locations of macros and clusters. For all of the above two-stage approaches, after fixing the macros, the initial positions of standard cells have to be discarded to reduce the overlaps.
Our Contributions
To effectively handle the complexities of mixed-size placement, we present a new algorithm flow which efficiently integrates floorplanning and incremental placement algorithms. As floorplanners have a good capability of handling a small number of objects [2] , we apply floorplanning on the clustered circuit to generate a global overlap-free layout, and use it to guide the subsequent placement algorithm. This new flow is as follows (see Fig. 2 ).
Block Formation:
The purpose of the first step is to cut down the problem size. We define "small objects" as small macros and standard cells. The small objects are clustered into soft blocks, while each big macro is treated as a single hard block.
Floorplanning:
In this step, a floorplanner is applied on the blocks to directly minimize the exact HPWL. Simultaneously, the objects are precisely distributed across the chip region to guarantee an overlap-free layout.
3. Wirelength-driven Shifting: In order to further optimize the HPWL, the blocks are shifted at the floorplan level. After shifting, big macros are fixed. The remaining movable objects are assumed to be at the center of the corresponding soft block.
4. Incremental Placement: Lastly, the placement algorithm will place the remaining objects. The initial positions of such objects provided by the previous step are used to guide the incremental placement.
Comparing this new methodology with the state-of-the-art analytical placers, we can see that it is superior in several aspects: 1) The exact HPWL is optimized in Steps 1-3; 2) The objects are more precisely distributed in Step 2; 3) Placement constraints and macro orientation optimization can be handled in Step 2. Compared with the previous two-stage approach, instead of starting from an illegal initial placement, we use the floorplanner to directly generate a global overlapfree layout among the big macros, as well as between big macros and small objects. In addition, the problem size has been significantly reduced by clustering. A good floorplanner should be able to produce a high-quality global layout for the subsequent incremental placer. Furthermore, the initial positions of the small objects are not discarded. We keep such information as a starting point of incremental placement. Since the big macros have already been fixed, the placer avoids the difficulty of legalizing the big macros. Based on the new algorithm flow, we implement a robust, efficient and high-quality floorplan-guided placer called FLOP. It can effectively handle mixed-size placement with all movable objects including both macros and standard cells. FLOP can also optimize the macro orientation respecting to packing and wirelength optimization.
To show the effectiveness of FLOP, we derive the Modern MixedSize (MMS) placement benchmarks from the original ISPD05/06 Placement Benchmarks. These new circuits can represent the challenges of modern large-scale mixed-size placement.
The rest of this paper is organized as follows. Section 2 describes the overview of FLOP. Section 3 introduces the block formation and floorplanning algorithms. Section 4 presents the wirelength-driven shifting technique. Section 5 describes the incremental placement algorithm. Section 6 describes the MMS benchmarks. Section 7 presents the experimental results. Finally this paper ends with the conclusion and future work.
OVERVIEW OF FLOP
FLOP follows the same algorithm flow as shown in Figure 2 . The block formation is based on the result of recursive partitioning of the original circuit. After partitioning, small objects in each partition are clustered into a soft block and each big macro becomes a single hard block.
In the floorplanning step, FLOP adopts a min-cut based fixedoutline floorplanner similar to DeFer [11] . In DeFer, a hierarchy of the blocks needs to be derived using recursive partitioning. Because such a hierarchy has already been generated during the block formation step, it will be passed down and will not be generated again. Another way to look at the flow of FLOP is that the block formation step is merged into the floorplanning step as the first stage of DeFer.
We formulate the wirelength-driven shifting problem as a linear programming (LP) problem. Therefore, we can find the optimal block position in terms of the HPWL minimization among the blocks. In the LP-based shifting we only ignore the local netlist among small objects within each soft block.
Because analytical placers have the best capability in placing a large number of small objects, we use an analytical placer as the engine in the incremental placement step.
BLOCK FORMATION AND FLOORPLAN-NING
A high-quality and non-stochastic fixed-outline floorplanner DeFer was presented in [11] . It has been shown that, compared with other fixed-outline floorplanners, DeFer achieves the best success rate, the best wirelength and the best runtime on average.
Here is a brief description of the algorithm flow of DeFer: Firstly the original circuit is partitioned into several subcircuits, each of which contains at most 10 objects. After that, a high-level slicing tree structure is built up. Secondly, for each subcircuit an associated shape curve is generated to represent all possible slicing layouts within the subcircuit. Thirdly, the shape curves are combined from bottom-up following the high-level slicing tree. In the final shape curve at the root the points within the fixed outline are chosen for further HPWL optimization. At the end DeFer outputs a final layout.
In FLOP, we use DeFer in the floorplanning step. To make it more robust and efficient for mixed-size placement, we propose some new techniques and strategies, which are described in Sections 3.1-3.3.
Usage of Exact Net Model
We use the exact net model in [12] to improve the HPWL in partitioning. By applying this net model in partitioning, the cut value becomes exactly the same as the placed HPWL, so that the partitioner can directly minimize the HPWL instead of interconnections between two partitions. In FLOP at the first β levels of the high-level slicing tree (β = 3 by default), we apply two cuts on the original partition. One is horizontal cut, and another is vertical cut. We compare these two cuts and pick the one with less cost, i.e. HPWL.
However, for a vertical/horizontal cut, the cut value returned by the net model is only equal the horizontal/vertical component of HPWL. So for two cuts with different directions, it is incorrect to decide a better cut direction based on the two cut values generated by these two cuts. The authors in [12] avoided such comparison by fixing the cut direction based on the dimension of the partition region. Nevertheless, this may potentially lose the better cut direction. Here we propose a simple heuristic to solve the cut value comparison between the cuts from two different directions.
Suppose K is the total number of nets in one partition that we are going to cut. For the horizontal cut
for the vertical cut (V-cut). So the total HPWL of the K nets in this partition are:
Thus, the correct way to make the comparison between H-cut and V-cut should be:
As the net model only returns
for H-cut, and
for V-cut, we need find a way to estimate
. Let the aspect ratio (i.e. height/width) of the partition region be γ. When K is very big, based on statistics we can have:
Two reasons prevent us from applying the net model in lower levels (> β): 1) As partitioning goes on, K becomes smaller and smaller, which makes the approximation of
inaccurate; 2) Using the net model, we restrict the combine direction in the Generalized Slicing Tree [11] , which hurts the packing quality.
To make a trade-off we only apply the net model in the first β levels. 
Block Formation
As mentioned earlier, since DeFer starts with a min-cut partitioning, FLOP merges the block formation step into the floorplanning step. After the original circuit is partitioned into multiple subcircuits, in each subcircuit we treat a big macro as a hard block, and cluster all small objects into a soft block.
However, in DeFer the partitioning will not stop until each subcircuit contains less than or equal to 10 objects. If the same stopping criteria is used in FLOP, then most subcircuits will contain at most 10 standard cells, which means by clustering we can only cut down the problem size by at most 90%. Nevertheless, for a typical placement problem with millions of objects, the resulted circuit size is still too big for the floorplanning algorithm. So here we propose a more suitable stopping criteria. Let A o be the total area of all objects in the design. In one partition there are N p objects of which the total area is A p, α is the area bound (α = 0.15% by default). We will stop cutting this partition, if either one of the following conditions is satisfied: 1)
Generation of Shape Curve for Blocks
To capture the shape of the blocks, we generate an associated shape curve for each block. For the hard block if a macro cannot be rotated, only one point representing the user-specified rotation is generated (see Fig. 3 (a) ). Otherwise two points representing two different rotations are generated (see Fig. 3 (b) ). For the soft block we bound its aspect ratio from 1/3 to 3, and sample multiple points on the shape curve to represent its shape (see Fig. 3 (c) ). Considering the target density constraint in the placement, we add some white space in each soft block. In some sense, we "inflate" the soft block based on the target density.
In Equation 1, for soft block i, A s i is the "inflated area", As i is the total area of objects within soft block i, and T D is the target density. Based on this formula, if the target density is more than 93%, we add some white space into the soft block. The purpose is to leave some space for the analytical placer to place the small objects.
WIRELENGTH-DRIVEN SHIFTING
In FLOP the wirelength-driven shifting process is formulated as a linear programming (LP) problem, which is the same as in [13] . We use the contour structure [14] to derive the horizontal and vertical non-overlapping constraints among the blocks.
The LP-based shifting is an essential part in FLOP. In terms of the HPWL minimization it can find the optimal position for each block, and basically provides a globally optimized layout for the analytical placer. Since the LP-based shifting optimizes the HPWL at the floorplan level, it only ignores the local nets among the small objects within each soft block. The smaller the soft block is, the less nets it ignores, and the better the HPWL we will get at last. However, if the soft blocks become too small, numerous nets will be considered in the shifting. This would slow down the whole algorithm. Because of this, in the partition stopping criteria we set an area bound α, so that the soft blocks would not become too small. On the other hand, we only need the shifting step to generate a globally good layout. Regarding the local nets within the soft blocks, the following analytical placer can handle them very efficiently and effectively.
INCREMENTAL PLACEMENT
As mentioned before, the output of the wirelength-driven shifting step is a layout with legal, non-overlapping locations for the big macros. These big macros are then fixed in place to prevent further movement during any subsequent steps. But, there are multiple "soft blocks" in the layout, each containing numerous "small objects" (i.e., small macros and standard cells). The shifting step assigns these small objects to the center of the corresponding soft block. In this respect, the placement step has two key tasks: 1) Spread the small objects over the placement region and obtain a final overlap free placement among all objects; 2) Use the initial locations of the small objects as obtained by the shifting step.
To satisfy these two tasks, we use an efficient analytical incremental placement algorithm (see Algorithm 1). while number_of_clusters > target_number_of_clusters do 5: cluster netlist using Best-choice clustering [15] 6: use physical locations of small objects in clustering score 7: set cluster_location ← center of gravity of the objects within cluster 
Algorithm 1

MMS BENCHMARKS
The only publicly available benchmarks for mixed-size designs are ISPD02 and ICCAD04 IBM-MS [10, 17] that are derived from ISPD98 Placement Benchmarks. As pointed out in [18] , these circuits can no longer be representative of modern VLSI physical design. To continue driving the progress of physical design for the academic community, two suites of placement benchmarks [18, 19] have been released recently. They are directly derived from modern industrial ASICs design. Unfortunately, however, in the original circuits most macros have been fixed due to the difficulty of handling movable macros for the existing placers. The authors in [8, 9] freed all fixed objects in ISPD06 benchmarks and created new mixed-size placement circuits. But seven out of eight circuits do not have any fixed I/O objects, which is not realistic in the real designs. In order to recover the complexities of modern mixed-size designs, we modify the original ISPD05/06 benchmarks and derive the Modern MixedSize (MMS) placement benchmarks (see Table 1 ). Essentially, we make the following changes on the original circuits.
I. Macros are freed from the original positions. In the GSRC Bookshelf format that the original benchmarks use, both fixed macros and fixed I/O objects are treated as fixed objects. There is no extra specification to differentiate them. So we have to distinguish them only based on the size differences. Basically, if the area of one fixed object is more than λ× the average area of the whole circuit, we will recognize it as a macro. Otherwise, it is a fixed I/O object. Because for each circuit the average area is different, we need to use a different λ (see the last column in Table 1 ) to decide a reasonable number and suitable threshold size for the macros. There is one exception: in both circuits bigblue2 and bigblue4, there is one macro that does not connect with any other objects. If this macro is freed, it may cause some trouble for quadratic-based analytical placers. So we keep it fixed. Since this macro is also very small compared with other macros, it would not affect the circuit property.
II. The sizes of all I/O objects are set to zero. In MMS benchmarks there are two types of I/Os: perimeter I/Os around the chip boundary and area-array I/Os spreading across the chip region. Generally, the area-array I/Os are allowed to be overlapped with other movable objects in the design. But existing placers treat all fixed I/Os as fixed objects, so that their algorithms internally do not allow such overlaps during the legalization. Since the macros have already been freed in MMS benchmarks, the placers should ignore the overlaps between fixed I/O objects and movable objects, and concentrate on the legalization of movable objects. As we cannot change the code of other placers, one simple way to enforce this is to set the sizes of all I/O objects to zero.
The target density constraints are the same as the original circuits. The same scoring function 1 is used to calculate the scaled HPWL. However, since the macros are movable in the MMS circuits, we need to modify the script used in [19] to get the correct "scaled_overflow_factor". The modification being: Any movable macro that has a width or height greater than the bin dimension used for scaled overflow calculation, is now treated as a fixed macro during scaled overflow calculation. Note that, this was the method employed by the original script on newblue1, which is the only design that has big movable macros in the original circuits. It is required to treat big movable macros as fixed, otherwise we will get an incorrect picture of the placement density.
We have discussed the MMS benchmarks setup with the authors in [18, 19] . To keep the original circuit properties as much as possible, the above changes are the best we can do without accessing the original industrial data of the circuits. The MMS benchmarks are publicly available at [20] .
EXPERIMENTAL RESULTS
All experiments were performed on a Linux machine with AMD Opteron 2.6 GHz CPU and 8GB memory. We use hMetis2.0 [21] as the partitioner and QSopt [22] as the LP solver. The seed of hMetis2.0 is set to 5. Essentially, we set up four experiments. II. To show the importance of the initial positions of small objects in the incremental placement step, we generate the results of FLOP-NI that discards such information and places all small objects from scratch. As shown in Table 2 , FLOP-NI produces 5% worse HPWL and 17% slower than FLOP.
III. We compare FLOP with leading macro placers CG, MPT and XDP. Due to the IP issues, their binaries are not available. But the authors sent us the benchmarks used in [9] . So in Table 3 the other placers' results are cited from [9] . These benchmarks allow the rotation of macros and do not consider the target density. As shown in Table 3 , FLOP achieves 1%, 12%, 7% and 14% better HPWL compared with CG, MPT, XDP and NTUplace3, respectively. To show which algorithm provides the best macro location, we use NTUplace3 to substitute the incremental placer inside FLOP (NTUplace3 does not support incremental placement). The results show that FLOP+NTUplace3 generates 9% worse HPWL than CG. But this does not mean FLOP is weaker than CG in terms of handling the macros. We observe that FLOP+NTUplace3 produces significantly worse HPWL on newblue7. However, using the same macro locations generated by FLOP, the incremental placer inside FLOP achieves the best HPWL on newblue7. We believe this is because NTUplace3 is not an incremental placer. As shown earlier, nonincremental placement will significantly degrade FLOP's result.
IV. The runtime breakdown of FLOP is shown in Figure 4 . We can see that the LP-based shifting takes almost 1/3 of the total runtime. This is the main bottle neck of the runtime in FLOP.
CONCLUSION
This paper presents a new algorithm flow for large-scale mixedsize placement. To show the effectiveness of such flow, a highquality mixed-size placer FLOP is proposed. Compared with state- of-the-art mixed-size placers and leading macro placers, FLOP achieves the best HPWL, and easily produces the legal layout for every circuit.
We believe there is much room to further improve the QoR of FLOP. For example, we can use the min-cost flow algorithm to substitute the linear programming formulation in order to speed up the LP-based shifting step. We also observe that the partitioning takes around 80% of the total runtime in the floorplanning step. Thus a stand-alone clustering algorithm is needed in the block formation step to cut down the problem size before partitioning. This will definitely improve both the runtime and HPWL. In the future, different floorplanners and placers can be incorporated into this flow to handle other problems, e.g. placement with geometry constraints. 
