In the context of physical synthesis, large-scale standard-cell placement algorithms must facilitate incremental changes to layout, both local and global. In particular, flexible gate sizing, net buffering and detail placement require a certain amount of unused space in every region of the die. The need for "local" whitespace is further emphasized by temperature and power-density limits as well as the increasing use of buffered interconnect. Another requirement, the stability of placement results from run to run, is important to the convergence of physical synthesis loops. Indeed, logic re-synthesis targeting local congestion in a given placement or particular critical paths may be irrelevant for another placement produced by the same or a different layout tool.
Introduction
With the rapid decrease of feature sizes, circuit layouts become more complex, both in terms of size and design constraints [3] . To achieve timing closure for high-performance circuits, it is now common to use physical synthesis -an approach that combines logic and physical optimization, potentially performing placement-aware buffer insertion, gate sizing, fanout optimization, etc. A recent work from Intel [23] suggests that buffering alone implies the need for "local" whitespace throughout the core area. Such unused cell sites facilitate placement of signal-net and clock-tree buffers in near-optimal locations rather than pre-determined "buffer islands". However, research from IBM [4] shows that distributing whitespace uniformly [8] may significantly increase wirelength. It also suggests that pin-limited and floorplanned designs, e.g., microprocessors with large on-chip caches, very frequently contain placement partitions with large amounts of whitespace. To this end, we (i) develop techniques to achieve a compromise between cell density and design flexibility, and (ii) study relevant trade-offs.
Timing optimization and congestion removal often use loops in which a netlist is re-placed based on information gleaned from a trial placement. However, some popular algorithms such as min-cut placement and simulated annealing tend to produce very different placement solutions from run to run. Therefore information about timing-critical nets and nets that failed to route may be invalidated (similar reasons hamper interconnect prediction [25] ). To facilitate incremental improvement of layout, we propose to stabilize placements from run to run. We distinguish two kinds of stability. An inherently stable algorithm, such as many analytical algorithms, would produce similar results from run to run. However, even with a generally unstable algorithm one can tether all new placements to a given trial placement, with a tunable amount of freedom for further optimizations. Thus, we distinguish inherent stability from relative stability. The latter may be used, e.g., to tie placements produced by an annealer to a placement produced by a min-cut algorithm. We demonstrate such relative stability by comparing congestion maps [19] of several min-cut placements, cell displacements and top critical paths. Our empirical results show that small modifications of a placement instance can suppress the instability inherent in common placement algorithms, without the loss of solution quality. Our techniques rely largely on pre-and post-processing, and can be easily implemented with existing tools.
Another trend in VLSI design is the increasing dominance of interconnects [13] . This is primarily because the wires do not scale as well as the devices. Assuming ideal scaling, all dimensions of a wire are shrunk by 0.7x per generation. 1 It is known that the wire capacitance per unit length remains invariant from generation to generation [5] . However, the resistance per unit length doubles every process generation, resulting in a wire delay which scales as 1.4x every generation. RC delay of a wire grows quadratically with the length of the wire. Repeaters are often placed at optimal distances (generally equal) on the wire to linearize the delay through the wire [5] . Since for an ideally shrunk interconnect, the wire delay scales as 1.4x, it implies that more number of buffers are required for an interconnect to linearize the delay through it in the new process generation. Additionally, the design frequencies are also increasing causing the number of buffers to increase per process node. The requirement of additional number of buffers would often entail replacing the entire block with a more relaxed minimum local whitespace requirement. In an application of our proposed techniques on placement stability, we address the issue of rescaling a placement to satisfy different minimum local whitespace requirements while maintaining the timing characteristics of the original placement. Our rescaling flow is not restricted to small changes of the netlist and layout unlike current ECO techniques. By combining our techniques one can devise 1 Currently the wire heights do not scale as well as the width, resulting in tall thin wires placement flows to efficiently map an existing layout-optimized design to a new process generation, while allowing sufficient room for further optimizations. Companies selling soft IP, like Tensilica and MIPS, and their customers could benefit from rescaling. Microprocessors are also often downscaled. Perhaps our techniques can be used as a first step followed by technology-specific and/or performance driven layout and circuit optimizations.
In the remainder of the paper, Section 2 gives background on large-scale placement and describes previous work.
Whitespace distribution is discussed in Section 3, and stability in Section 4, accompanied by relevant empirical results.
Our proposed rescaling flow to allow for different minimum local whitespace requirements is explained in Section 5.
Finally, our contributions are summarized in Section 6.
Background and Previous Work
Modern ASIC designs are typically laid out in the fixed-die context, where the outline of the core area, all routing tracks and power lines are fixed before placement starts [6] . One of the reasons for this is the use of previously designed and rigorously simulated power grids. Also, standard-cell partitions of microprocessors are often laid out with fixed outlines in hierarchical floorplan-driven design flows because reshaping the outline would affect neighboring partitions. Large on-chip memories similarly constrain random-logic partitions. In the context of massive IP reuse, especially with hard IP blocks (analog circuits such as DACs, ADCs, PLLs and embedded memories), the die area may be determined by floorplanning, thus making area-minimization during placement irrelevant. Fixed-die layout is reasonable for processes with over-the-cell routing on three or more metal layers. In this context, the total area is fixed and the number of unused cell sites -whitespace -is known in advance. Variable-die placers typically pack all cells to the left in rows. However, fixed-die placers often allocate whitespace uniformly [6, 8] or according to congestion maps [21, 27] . When significantly more whitespace is available, the work in [4] proposes to allocate whitespace so as to improve half-perimeter wirelength.
They show that uniform whitespace distribution in such designs causes very significant increase in wirelength.
Fixed-die placement in physical synthesis
It is important to note that in the context of physical synthesis, the structure of the netlist may be changed and incremental placement must be performed. Given that some gates may be up-sized and many nets are likely to be buffered, the availability of "local" whitespace is a necessity. Indeed, the work in [23, 17] predicts that buffers will soon be the most frequently used gates in large high-performance circuits. Local whitespace can also be useful to accommodate regular structures such as (i) N-well contacts that have to be assigned to vertices of a grid, and (ii) area-array I/O pads that also form a grid. Thus, desired whitespace distribution must guarantee a minimum percent of "local" whitespace throughout the chip and beyond that optimize other design objectives. The requirement for minimum local whitespace may also be used to generically improve routability and yield, even out the temperature gradient across the die and decrease the likelihood of cross-talk noise.
Another effect of fixed-die layout is the occurrence of unroutable placements. Indeed, in variable-die layout one can always add routing tracks to complete routing at the cost of increased area [20] , but this is impossible with a fixed outline.
To improve congestion, it is common to use cell-bloating (i.e., treating cells as if they were larger in order to free routing tracks around them) in congested regions [24] . Additionally, a number of logic transformations (fanout optimization, input reordering, gate merging and cloning, etc) can be used to improve congestion. However, if the same placement tool produces an entirely different placement at the next run, such optimizations would be wasted. This problem is especially noticeable with placers based on min-cut and simulated annealing. The same problem is encountered when logic resynthesis targets timing optimization. Therefore, to reliably achieve timing closure one may want to stabilize placement solutions.
Hierarchical Whitespace Allocation in Top-Down Placement
The academic placement tool Capo [6] applies a top-down, min-cut partitioning based approach to find a global placement. Capo uniformly spreads [3] the available whitespace throughout the core region. We briefly explain the whitespace allocation strategy implemented in Capo [8] . In the top-down, divide-and-conquer approach for global placement, a given placement instance is decomposed into smaller instances by subdividing the placement region and assigning cells to subregions such that good solutions to sub-instances combine into good solutions of the original instance. The concept of a placement bin is pivotal. A bin represents: 1) a placement region with allowed locations; 2) a collection of cells to be placed in this region; 3) all nets incident to the cells; and 4) locations of all cells beyond the given region that are adjacent to the cells to be placed in the region; such external cells are considered to be terminals, and their locations are fixed. In a min-cut placer like Capo, every placement bin yields a hypergraph partitioning instance which is split through min-cut hypergraph bisection with FM-type move-based heuristics. The uniform whitespace distribution strategy in Capo is explained as follows. Let a placement bin have site area S, cell area C, absolute whitespace W = max{S −C, 0}, and relative whitespace w = W /S. A hypergraph bi-partitioning solution implies cell areas C 0 and C 1 in child bins, such that
The input to the hypergraph bi-partitioner must specify both the netlist and the allowed ranges for C 0 and C 1 , i.e., bounds C min
These bounds establish absolute tolerance
and relative tolerance τ j = T j /C. Capo uses a mix of fixed tolerances and hierarchical whitespace allocation during top-down placement [8] . The placer chooses vertical or horizontal bin splits depending on the bin's aspect ratio and typically cuts along the longest side of a bin. Vertical partitioning is performed with a fixed 20% tolerance. After partitioning, when the actual total cell area in each partition is available, the vertical cut-line determining the bin boundaries is shifted to equalize relative whitespace in the bins. A different strategy is employed for allocating whitespace for bins split by horizontal cut-line. During a horizontal split, the partitioning tolerances are calculated based on the relative whitespace of the bin and the number of rows in the bin. A precise mathematical model of hierarchical whitespace allocation in placement is proposed in [8] . It is based on the concept of whitespace deterioration which is explained as follows. Assuming non-zero relative whitespace at top-level, one will require that for each bin split with a relative whitespace of w, the relative whitespace in each child bin is at least αw, where 0 ≤ α ≤ 1 is the whitespace deterioration. As α approaches 1, the whitespace distribution in the final placement approaches uniform distribution. An α of 0 allows for fully utilized regions of the layout. One can adjust α on a per-bin basis to account for maximum allowed layout densities in the leaf-level bins which can be guided by minimum local whitespace requirements. It is shown in [8] , that given the whitespace deterioration α for a bin, the partition capacities and tolerances for the partitioner can be calculated as follows.
When the bin has large amount of whitespace (i.e. C is very small compared to S) and α sufficiently small, C max j and C min j may degenerate into C and zero, respectively. In such a case, all cells are allowed to go into one partition. A closed expression for whitespace deterioration α in terms of relative whitespace w in the bin and the number of rows R in the bin is given as follows.
Partitioning tolerances increase as the placer descends to lower levels, and relative whitespace in all bins is limited from below, thus preventing overlaps. This facilitates good use of whitespace, when it is scarce and prevents dense regions when large amounts of whitespace are available. Similar to vertical partitioning, after horizontal partitioning, the cut-line determining the bin boundaries is shifted to equalize relative whitespace in the bins. Hierarchical whitespace allocation during horizontal partitioning allows for higher tolerances during partitioning, thus allowing for lower cut [15] during min-cut operation. This can also lead to certain regions of the layout being packed more densely than others. However, a constant tolerance during the vertical partitioning step and shifting of cut-lines after each partitioning to equalize the relative whitespace in the child bins ensure a uniform distribution of whitespace through-out the core region.
Whitespace Management Framework
As shown in [4] , min-cut placers that uniformly distribute whitespace [8, 6] tend to produce excessive wirelength when large amounts of whitespace are present. The authors of [4] propose a fairly sophisticated technique, Analytical Constraint Generation (ACG), to place sparse designs. It has been argued in [4] that analytical placement algorithms have a global view of the placement problem and can better manage large amounts of whitespace. ACG [4] combines a min-cut based placer with quadratic placement engine. In ACG, during top-down recursive bisection based min-cut placement flow, the partitioning capacities for each placement bin to be partitioned are generated based on quadratic wirelength minimum placement at that level. While we address the same problem of placing sparse designs, our study is somewhat orthogonal to theirs. The methods we propose are much simpler and can be implemented as pre-processing without having access to placer source code. This allows us to explore the effect of whitespace on routed wirelength and congestion using different academic placers. We also describe optimized implementations of our whitespace management techniques in a typical top-down min-cut placer framework. Additionally, our placement framework is somewhat different from that used in [4] and benefits from these simple techniques in new ways. Namely, Capo can shift the cut-line to better reflect the outcome (balance) of every min-cut partitioning call, whereas the placer in [4] uses a grid of placement bins rather than a more general slicing floorplan as in Capo.
Free Cells
The technique we propose assumes a placer that uniformly distributes whitespace across the core area. We assume that the minimum "local" whitespace requirement leaves certain slack relative to the total whitespace available in the design.
By pre-processing, we can ensure (i) the minimum "local" whitespace through the core area, and (ii) better allocation of the remaining whitespace. The technique consists of adding small disconnected "free cells" to the design in an amount not exceeding whitespace that remains after the "local" requirement is satisfied. Since free cells are disconnected and small, a placer is free to place those cells so as to improve relevant design objectives. After placement, we remove free cells and treat the remaining cell sites as empty. This causes high cell density in certain areas, with free cells occupying the vacant areas of the chip.
Our empirical evaluation uses the Capo placer [6] which uniformly distributes available whitespace [8] with routability in mind (Feng Shui 2.0 and mPl 2.0 do not distribute whitespace, but Dragon 2.23 does in the fixed-die mode). However, with designs having low placement densities this strategy results in excessive wirelength and potentially poor signal delay. Figure 1 shows placements of an industrial design with 72940 cells, 73155 nets and 74% whitespace. also tested on this circuit [4] , and wirelength improved from 11.43e6 (for uniform whitespace distribution) to 10.38e6
(with ACG). Figure 2 shows the effect of free cells on the local whitespace distribution for the same design. To calculate the local whitespace distribution, we divide the layout region into a grid of bins (27x27 in this case) and calculate the local whitespace in each bin. Free cells are removed from the design before calculating the local whitespace distribution.
We plot the % of bins vs. the % local whitespace in each bin. As seen from can be demonstrated using a sparse design with one dense cluster of logic connected to pins on the periphery so that the cluster must be placed in the center to minimize wirelength. However, since such a placement implies a high top-level cut, some top-down placers (especially those with fixed cutline) will avoid this optimal placement.
We show that better whitespace allocation reduces wirelength in mixed-size placement flow from [2] . The main contribution of [2] is a methodology to place designs with numerous macros by combining floorplanning and standardcell techniques. The proposed design flow is as follows:
• A black-box standard-cell placer generates an initial placement. In a pre-processing step, all macros are shredded into small pieces (fake cells) connected by fake wires, and pins from the macro are propagated to individual pieces.
Each macro is thus represented by a grid, and the resulting netlist consists of only small cells. If the fake nets have sufficiently high weights, the fake cells belonging to the same macro should place next to each other. Fixed orientations of macros can be accommodated.
• The initial locations of macros are produced by averaging the locations of respective fake cells. To remove overlaps between macros, a physical clustering algorithm constructs a fixed-outline floorplanning instance. Thus, small standard cells placed next to each other are clustered and form soft blocks.
• A fixed-outline floorplanner [1] generates valid locations of macros and soft blocks of movable cells.
• With macros considered fixed, the black-box standard-cell placer is called again to re-place small cells.
Step 4 of the mixed-size placement flow presented in [2] fixes the macro locations to the ones provided by the floorplanner and replaces standard-cells around the macros. We improve whitespace allocation in this stage by introducing free cells.
We add free cells to reduce the available whitespace to the placer to 10% and replace the design with the macros being fixed. The results are summarized in Table 1 . We compare our results to mPG [11] .
Physical synthesis flows interleave placement optimizations with logic optimizations to achieve desired timing. This reduces the number of iterations required between the front-end design and back-end design for timing closure. Physical synthesis tools typically start from a global placement and perform logic optimizations like buffer insertion, driver sizing, logic replication etc. to improve timing of the design. These logic optimizations are based on the physical information generated by the initial global placement. Such tools rely on ECO placement techniques to legalize incremental changes in the netlist after global placement. This enforces a minimum local whitespace requirement after global placement to facilitate ECO placement after changes due to logic optimizations. Compacting a placement without physical synthesis in mind will severely limit the efficacy of the physical synthesis tools. We study the effect of free cells on physical synthesis in Table 2 . We conduct our experiments on proprietary industrial benchmarks with varying row-utilization. We report the worst slack and the total negative slack (TNS) in the design after the physical synthesis. In the default run, the global placer (Capo) uniformly spreads the cells around the core area. As an alternative flow, we add free cells during the global placement stage to reduce the whitespace available to the placer to 40%. Thus, the global placer compacts the placement but ensures minimum local whitespace of 40% around the core area. Free cells are removed after global placement. As seen from the results in Table 2 , the worst slack and total negative slack for all the designs improve considerably by adding free cells during the global placement stage of physical synthesis. All the designs are routable even after compacting the designs by using free cells.
We also conduct experiments to demonstrate the effect of free cells on the routability of a design. We use the ibm02 benchmark from [27] . The design initially has about 9% whitespace. The design is re-floorplanned to have 65% whitespace. The design is placed with Capo placer and routed with WarpRoute from Cadence. Free cells are gradually added during placement, reducing whitespace that the placer can allocate uniformly. Each of these designs is placed; the free cells are removed after placement and the design is routed with Cadence WarpRoute. Table 2 : The impact of free cells on physical synthesis for industrial designs with low utilization. We report the worst slack and total negative slack (TNS) after physical synthesis. During the placement stage of physical synthesis, we add free cells so that the whitespace available to the placer was reduced to 40%. Free cells are removed after global placement. All designs are routable after physical synthesis.
ing only half-perimeter wirelength may be misleading. Routability of Capo and Dragon placements on ibm-Dragon benchmarks is discussed in [3] , where, the differences are traced to greater horizontal wirelength and smaller vertical wirelength in Capo placements.
Low-Overhead Implementation of Free Cells in a Min-cut Placer
As explained in Section 3, a generic whitespace management framework can be obtained by using a placer that distributes whitespace uniformly through the core region and by representing the excessive whitespace as small disconnected free cells. This implementation of free cells does not require any changes to the placer source code and only pre-processes the input netlist by introducing fake free cells, which makes sense with existing commercial placers. However, explicit modeling of free cells and letting the placer process the modified netlist impacts the run-time and the memory footprint of the placer. The placer run-time degradation is evident from results presented in Tables 2 and 3 . In this section, we explain implicit handling of large amounts of whitespace in a top-down, recursive bisection based min-cut placer [6] with minimal runtime and memory overhead. We achieve this without representing excessive whitespace as free cells and hence without pre-processing the input netlist. Figure 3 shows the min-cut partitioning procedure for a bin with 50% whitespace in absence and presence of free cells. Figures 3A and B show 2 ways to partition a bin with no free cells added. The parameters affecting the balance in the two partitions are (i) Partition capacities (C j ), and (ii) Partition tolerances (C max j ,C min j ). In the example, Figure 3A shows the std-cells being partitioned in the ratio 50%:50% with a net-cut of 3. However, if higher tolerance is allowed the partitioner may decide to partition the cells in the ratio 80%:20% with a smaller net-cut of 2. As shown in Figure   3B , Capo is allowed to shift cut-lines after partitioning to equalize the relative whitespace in the two partitions. Figures   3C and D show how the partitioning procedure works in presence of free cells. Out of the 50% whitespace, 40% is represented as free cells and the remaining 10% whitespace can be distributed between the two bins. To achieve a lower cut, a good min-cut partitioner will favor to partition the instance as shown in Figure 3D over Figure 3C . Thus, the effect of higher tolerances is imitated using free cells and a constant tolerance.
As explained in Section 2.2, Capo uniformly spreads the available whitespace throughout the core region. Capo uses hierarchical whitespace tolerance calculation [8] only while splitting a bin horizontally. The tolerance during the vertical split is constant and the vertical cut-line is allowed to move after partitioning to balance the relative whitespace in the two child bins. This strategy works well for low whitespace designs. However, for high whitespace designs, it results in lower tolerances during vertical partitioning and uniformly distributes the whitespace in the core region resulting in excessive wirelength. After studying the behavior of Capo8.7 on low utilization designs, we added the option -nonUniformWS to Capo8.8. This causes Capo to use the same hierarchical tolerance computation for both horizontal and vertical splits when the bin whitespace is greater than the minimum local whitespace requirement. Since, during the top-down placement process, the aspect ratio of most of the bins is close to 1.0, we can approximate the number of recursively applied parallel vertical bin splits to n = log 2 R, where R is the number of rows in the bin. With this assumption, the partition tolerances are calculated in the same manner for horizontal and vertical splits. This change allows Capo to transparently handle designs with a large amount of whitespace. To account for local minimum whitespace requirement, we make sure that C max j for any partition does not violate the local minimum whitespace requirement for that child bin. Also, if the bin whitespace is greater than the minimum local whitespace requirement, we do not shift the cut-lines after partitioning.
This ensures that some regions of the layout are more tightly packed than other regions resulting in lower wirelength.
However, the minimum local whitespace requirement are still respected. We test the effect of this change on the "qor" test (fake) pins, skew lines show fake two-pin nets, and a fake 5-pin net is shown by a spline. In all three cases moving the cell within the region does not affect the total length of fake nets. However, any placement beyond the region will incur a wirelength penalty that is independent of other movable objects.
Stability of Placement Results
Physical synthesis flows often require the stability of placement results from run to run for future optimizations which target timing and/or congestion. However, Figures 4 (A) and (B) show that congestion maps [19] produced for unrelated runs of a randomized min-cut placer may be very different. In order to improve congestion, one may distribute whitespace to congested areas or restructure the logic, but such fixes may be irrelevant to the result of the next placement run, or if another placer is used. To achieve relative stability, we propose the following approach. Given a placement, we modify the original netlist by adding fake pins and fake nets. After the modified netlist is placed, the locations of real cells are likely to be close to their original locations, and the amount of change allowed can be easily controlled during preprocessing. It is important to note that we are not adding hard constraints -in principle, any cell can be placed anywhere.
However, locations that are far from the original location carry a wirelength penalty in terms of fake wires -further the location, greater the penalty. A key property of our construction is that all locations within a prescribed rectangle centered around the original location carry the same minimal wirelength penalty, and this are equally attractive during wirelength optimization. Figure 5 demonstrates several ways to tie a cell or a macro to a region without inducing a hard constraint. Four outer fake pins are fixed in the corners of the given region. In Figure 5 (A), four fake pins are added in the corners of the cell to preserve cell orientation. In Figure 5 (B), one fake pin is added at the center of the cell so that changes in orientation do not affect wirelength. In Figure 5 (C), the same effect is achieved by using one fake 5-pin net rather than four fake two-pin nets. In Figures 5 (B) and 5 (C), only the center of the cell is constrained to be in the region. In Figure 5 (D), one fake 8-pin net is used with the fake pins in the corners to ensure that the entire cell is placed within the region. Note that a technique similar to that in Figure 5 (A) is used in [2] to restrict orientations of macros, but in that work the four outer fake pins are fixed at the corners of the core region. The three new constructions ignore cell orientations. The first one uses four two-pin nets, the second uses one five-pin net and the third uses one eight pin net. The third new construction was suggested to us by Amir Farrahi from Sun Microsystems. It can be used to mitigate the number of added nets and to ensure that the entire cell is placed within the constraining region. Otherwise, these constructions are equivalent if used with min-cut placers or placers based on simulated annealing that minimize HPWL. Table 4 : The impact of constraining region size during tethering on the stability of global placements produced by Capo on the ibm06 benchmark. The constraining region size is measured as a percent of the total layout region size. 5% of cells are tethered to the base placement for all the runs. We report the average and maximum Manhattan displacement per cell between tethered placements and the base placement.
In our experiments, we randomly select 2%-5% cells in a given placement and tie them to regions centered at the cells' locations. The size of the regions is selected as a small fraction (several percent) of the core region size. These sizes and the weights of fake wires allow one to control changes from the original placement. As shown in Figure 4 (C), additional runs of the min-cut placer Capo produce essentially the same congestion map. The placement in Figure   4 (D) is tied to the output of Dragon. Table 5 reports the effect of tethering cells to a base placement on the IBM-v1
benchmarks [26] . Base placements are generated using the randomized min-cut placer Capo. We then tether a small number of randomly selected cells of the netlist to the base placement. The IBM-v1 benchmarks have disconnected groups of cells, caused by the removal of macros (and incident nets) during the conversion from the original ISPD 98 partitioning suite to placement benchmarks [26] . 2 To stabilize such designs we randomly select at least one cell from each disconnected component in the netlist for tethering. Table 5 reports the average and maximum Manhattan difference between locations of cells in the new tethered placements to those in the base placement. The difference is reported as a percentage of the core region bounding box and can be compared to the tethering region whose half-perimeter is 1% of that bounding box. As seen from the results, tethering several % of the cells to a base placement dramatically improves the stability of the randomized min-cut placer -the average cell displacement from the initial locations is very small.
However, the maximum displacement remains comparatively high. We trace this to cells in high fanout nets which, if not tethered, have a large freedom to be placed around the core region without affecting the half-perimeter wirelength of the design. 3 In practice, when it is desirable to stabilize placement with respect to a particular design objective, e.g., circuit delay, one should tether cells that are relevant to that objective, e.g., those on critical paths (see Section 5.2). Table 4 shows that the constraining-region size does not have a significant effect on the stability of global placement as measured by average and maximum displacement -a surprising result. Finally, in all of our experiments, except for those with very small constraining regions, the wirelength of tethered placements is similar to the original wirelength. benchmarks, we evaluate the impact of tethering random 2% / 5% / 10% / 50% of cells to a base placement. We report the average and maximum Manhattan cell-to-cell displacement between tethered placements and the base placement. The displacement is reported as % of the core bounding box.
(A) (B) Row Util=70%
Row Util=50% Aspect Ratio=1 Aspect Ratio=1 Figure 6 : Capo placements for AES(Rijndael) core.
Application: Resizing Existing Placed Designs
The number of buffers will, in general, increase as we move to lower process nodes [23] . This is primarily because (i) wires are not scaling as well as devices, (ii) transistor counts on chips are increasing, resulting in more number of buffers per logic gate. As we scale to newer process nodes and existing IP blocks gets embedded in larger designs, the IP blocks may have to be more porous, i.e. have a larger minimum local whitespace requirements. We extend our techniques for achieving placement stability to the context of rescaling a layout while maintaining the same relative timing characteristics. In contrast to current ECO techniques, our proposed rescaling method allows for large-scale optimizations during the placement process and is not limited to small changes in netlist and layout. We use the techniques for achieving stability presented in Section 4 as the basis of our rescaling technique. We tether randomly chosen x% of standard cells to the locations specified by the original layout. This helps a decision based placer such as top-down min-cut placer to make the right decisions while splitting the netlist at each partitioning level. Thus, the new placement is biased to be similar to the original placement. The objective in Section 4 was to achieve run-to-run stability for inherently unstable placement algorithms. We use similar concepts to trade off predictability vs. optimization potential when rescaling a design. Our proposed flow for achieving predictability while rescaling a design is shown in Figure 7 .
Experimental Flow
For our experiments, we use a well-known academic placer Capo [6] and the industrial placement tool QPlace from
Cadence Silicon Ensemble(SEDSM). In the first step, initial placement of the design is obtained. In an actual scenario, this would be obtained from the layout optimized design that we are trying to rescale. In our experiments, we obtain this placement by running the placement tool on the design. We then re-floorplan the design to have a lower row-utilization (higher whitespace) compared to the original floorplan. A placer which uniformly distributes whitespace will ensure that the new floorplanned design has a higher local minimum whitespace compared to the original design. However, from our experiments we observe that the resulting new placements are in general not similar to the original placement in terms of timing characteristics. In most cases, the most critical paths of the new placement are totally different from the most critical paths in the original placement of the higher row-utilization design. This points to the instability in the placement algorithms we used. To achieve similarity in placement we use the following approach. We first rescale the locations of all the standard cells and terminals from the original placement to the new floorplan. This scaling is straight forward. Let the floorplan of the block scale from height h and width w to height h and width w . Then the location (x, y) of a standard cell in the original placement scales to the location (x , y ) in new floorplan as follows.
After the rescaled placement has been obtained, the straight-forward thing to do is to apply ECO placement techniques to legalize the new re-scaled placement. This can be done efficiently using current ECO placement techniques such as
QPlace run in the ECO mode. However, in a scenario where the netlist also changes considerably during the rescaling process, using local ECO changes will result in sub-optimal results. In our proposed flow (Figure 7) , we tether the new rescaled placement using fake pins and fake nets as explained in Section 4. The new tethered netlist is then replaced.
The tethering fake nets and pins ensure that the netlist is placed similarly to the original placement. We thus ensure Tether x% of cells to resulting placement 5
Replace the tethered netlist 6
Remove the fake tethering pins and nets Table 6 : Benchmarks used in our rescaling experiments. Advanced Encryption Standard/Rijndael (AES) core is encryption core downloaded from http://www.opencores.org. MULT is 53X53 bit multiplier that we synthesized using synopsys design foundation library. predictability during rescaling. After the placement, the fake pins and fake nets are removed from the netlist and this is the final re-scaled placement. The allowed freedom during the second placement run is governed by the number of standard-cells tethered and the region size of the tethering bounding box around each tethered standard-cell. Our proposed technique allows one to conveniently trade-off further optimization vs. predictability during the rescaling process.
Preserving Critical Paths
In the techniques presented in Section 4, the designer specifies the % of cells to be tethered to their initial locations.
Specific cells are chosen at random. This approach works well when one is trying to focus mainly on the average similarity of the two placements. However, since in our case we are trying to preserve the timing behavior of the design during the rescaling process we select cells differently. From the initial layout-optimized design, we extract the cells in the top few (1000 in our experiments) worst paths in the design. We tether these critical cells to the locations, rescaled from their original locations. The remaining cells to be tethered are generated randomly. We thus try to ensure that the cells in the critical path are placed close to their rescaled locations with an aim to better preserve the timing behavior of the design.
As a variant of timing-driven tethering of cells, we also propose to change the size of the tethering region based on the dimensions of the critical nets in the original placement. By default, the tethering region is chosen to be a certain %(in our case 0.5% to 5%) of the layout dimensions. The size of the tethering region gives us a knob to trade off optimization vs. predictability. However, by tethering the critical cells to the bounding box of the critical net they belong to, one would maximize the optimization potential without sacrificing predictability in terms of timing behavior of the design. Table 6 lists the characteristics of benchmarks used in our experiments. We downloaded the verilog code for the Advanced Encryption Standard/Rijndael (AES) core design from [18] . The MULT design is a 53X53 bit multiplier that we instantiated from the Synopsys design foundation library. We synthesized these designs using Synopsys Design Compiler and floorplanned them using Cadence Silicon Ensemble(SEDSM ver. 5.4). For our experiments we use the academic standard-cell placement tool Capo [6] and leading industrial placement tool QPlace from Cadence.
Experimental Results
The results for rescaling are shown in Tables 7, 8 and 9 . For all rescaling experiments, we first floorplan the die to have a row utilization of 70%. We run placement tools on the initial design to get a base placement and an initial timing report which is generated using Synopsys Primetime tool. All subsequent comparisons are made to this base placement.
The design is then re-floorplanned to have row utilization of 50%. The netlist remains the same. We then replace the 50% utilized design. Next, we employ our proposed rescaling flow to maintain the timing characteristics of the design.
For our flow, we choose 10% of the cells to be tethered for the second placement run. We apply two variants of our flow.
In the first version we select the cells to be tethered randomly. The second version uses timing-driven tethering of cells by selecting cells on the top critical paths along with a few random cells, as the cells to be tethered. Table 7 presents the results for Capo when the size of tethering region around each tethered node was chosen to be 0.5% of the layout region size. As can be seen, without using tethering, Capo produces vastly different results in terms of tethering with 0 out of worst 1000 paths being similar for AES and MULT. We stabilize the placement considerably using our proposed flow and produce placements which have very similar timing characteristics compared to the original placement. However, the wirelength suffers around 10% due to tethering. One of the problems was that the tethered region size was too small, thus effecting the optimization potential of the placer. So we increased the area of the tethering region to 5% of the layout region. Results for this configuration of the experiment are shown in Table 8 . With this change we are able to reduce the impact on HPWL to minimal while still maintaining the similarity in the timing behavior of the designs. We repeat the same experiment on rescaling using industry placer QPlace. Results presented in Table 9 show that QPlace, produces vastly different results in terms of timing behavior when the floorplan is changed a little. This result suggests that QPlace is using inherently unstable algorithms. Since our techniques mainly rely on pre-processing and post-processing of the input net-list, we can perform the same experiments on QPlace and try to improve its behavior in terms of predictability. As seen from Table 9 we were able to produce placements with very similar timing characteristics even when we changed the row-utilizations of the design. However, the loss in HPWL due to tethering seems to be more significant for QPlace.
We suspect this is because of the fake fixed pins introduced all over the layout during tethering. Capo does not reserve any space for these fake pins on the layout, however, QPlace seems to be reserving a site for each of these pins during placement, thus hurting the optimization of HPWL. Currently, we are trying to alleviate this problem by tweaking our QPlace flow.
Conclusions
Large-scale placement is becoming more sophisticated in the presence of large IP blocks, embedded memories and macros. Aggressive timing constraints, large whitespace and physical synthesis flows pose new challenges to layout tools. In particular, local and global incremental changes must be sustained without chaotic effects on congestion and circuit delay. We observe that "local" whitespace makes layouts amenable to local modifications and re-synthesis, while stability of placement results facilitates larger incremental changes.
We contribute simple and tunable techniques for ensuring minimum "local" whitespace throughout the core region without distributing all whitespace uniformly, and empirically demonstrate that such local whitespace is achieved with approximately 5% precision. Our study is complementary to that in [4] where whitespace is managed using a combination of min-cut and analytical placement techniques. Similarly, our methods can be used with congestion-driven whitespace allocation from [21, 27] . Our empirical results show that lax controls over whitespace may lead to better half-perimeter wirelength, but at the same time may increase routed wirelength or even lead to unroutable designs. This may be the clearest example yet of the divergence between half-perimeter wirelength and routed wirelength as optimization objectives. Our experiments with physical synthesis point out that using a combination of free cells and uniform whitespace distribution during global placement can significantly improve circuit delay of low-utilization designs.
Our study of stability shows that while min-cut placers may produce solutions with very different congestion maps, it is possible to stabilize their results by a simple pre-processing. In fact, it takes a surprisingly small modification of the netlist to tie future placement solutions to a given set of locations. While some algorithms, e.g., analytical placement, tend to produce consistent results on multiple runs, our techniques can be used to tie the results produced by different placement algorithms and implementations to each other. In particular, placement predictions made by a fast estimator can be enforced at a global scale when a slower placer is used to optimize wirelength and various design objectives. We also address the issue of reshaping an existing layout-optimized design with an aim of preserving the timing characteristics of the design while still allowing room for further optimizations.
We apply our techniques for achieving stability in placers to devise a flow to rescale an existing layout-optimized design with the aim of preserving the timing characteristics of the design. We study the optimization vs. predictability trade-off in this context. The proposed rescaling flow is particularly useful when one also changes the netlist of the block during its re-implementation. Further, the rescaling flow is not limited to small changes in layout and netlist.
Straightforward implementations of the proposed techniques, such as free cells and fake nets, may increase the memory footprint of the placer and its runtime. Instead, those techniques can be implemented implicitly so as to guarantee the original memory footprint and only an insignificant slow-down. However, this is incompatible with the simple preprocessing approach that enabled our experiments with several placers.
