Recursive bisection based placement is well known, and recent advances in partitioning have made the approach more attractive. While partitioners can optimize a placement from a local perspective, high performance design requires consideration of global issues as well. We focus on aspects of the placement problem which cannot be captured with bisection, addressing them through a new approach derived from recent work on k-way partitioning. We consider large values of k, and objective functions which are more complex than the traditional min-cut.
INTRODUCTION
Circuit placement has been extensively studied; objectives such as area minimization, wire length minimization, and timing optimization are common. In this paper, we focus on standard cell placement problems: our objective is to place rectilinear circuit elements (cells) into one or more horizontal rows, minimizing total wire length. In particular, we are interested in capturing global aspects of the problem, improving a traditional approach.
Placement is a difficult problem. For all but trivial problem sizes, we have only heuristic methods; optimization is usually performed on only a small subset of the circuit at any given time. If we perturb a subset of circuit elements, we can optimize a placement from a local perspective, but have no guarantee that our modifications are appropriate from a global perspective. This distinction is a primary focus of this paper. Our contribution is a method which allows at least some global issues to be captured within a bisection based framework, resulting in improved placement quality.
We adapt a recently presented partitioning approach for k-way partitioning. Rather than direct k-way partitioning, we are interested in k simultaneous bisections, for large values of k. We integrate this into a traditional bisection based placement framework, resulting in a practical standard cell placement tool we call Feng Shui.
FORMULATION AND PREVIOUS WORK
We follow traditional problem formulations, and use hypergraphs to model circuit netlists. Cells are denoted as c i , and are roughly equivalent to vertices in a hypergraph. Nets are denoted as n j , and are roughly equivalent to hyperedges. When placing circuit elements, we will determine regions in which the elements lie; these are non-overlapping rectilinear areas, and are denoted by capital letters. Our approach employs partitioning of regions, converting each region into two or more subregions.
As the placement problem is well studied, we have a number of established approaches for it. Force Directed or LP based approaches repeatedly solve systems of equations, determining cell locations iteratively (for example, [5] ). This approach is popular in commercial placement tools. Simulated Annealing based approaches obtain cell placements by swapping positions of cells randomly, guided by a probabilistic acceptance function. A number of current commercial placement engines utilize this approach; efficient cost estimates allow the consideration of large numbers of intermediate states. A well known example of this approach is TimberWolf [13] . Partitioning based approaches determine cell locations by recursively dividing an initial area (region) with successive bisections or quadrisections. This approach has become more attractive recently; advances in partitioning research have provided a number of fast algorithms which produce extremely good results.
Our standard cell placement approach is integrated into a traditional partitioning-based framework. In an early algorithm, Breuer [1] utilized repeated graph bisections to obtain a circuit placement; this approach is shown in Figure 1 . The bisections divide the circuit netlist into a hierarchy of cells, with the resulting hierarchy roughly mapping into a rectilinear grid. Dunlop and Kernighan [4] extended this approach, through the use of an improved partitioning method [9] , and also terminal propagation. When partitioning a region, we can expect a number of connections to be required to cells or pads outside of the region. Terminal propagation provides a simple method to insert fixed "dummy" vertices, so that the partitioning considers these external connections. Note that with terminal propagation, the partitionings of regions become interdependent; if we begin with two regions, L and R, and partition L first, this impacts the optimal solution for R. Partitioning R first might result in a different solution, and neither of these might be globally optimal, even if the individual partitionings were.
To address the order dependence of the partitioning, both [12] and [7] employ repeated partitioning at each level. We might wish to partition L, followed by R, and then partition L a second time. Repeated partitionings do not, however, change a local optimization process into a global one.
For circuit placement applications, we may have tens or hundreds of thousands of regions, each of which requires partitioning, and each of which has impact on neighboring regions. This problem motivates our interest in k-way partitioning, or more precisely, k simultaneous bisections, for extremely large values of k. Recently, Zhong and Dutt [14] presented a partitioning driven placement approach which also uses large scale multi-way partitioning.
PLACEMENT APPROACH
Our placement approach integrates a variant of a recent k-way partitioning approach into a traditional framework. In this section, we first briefly describe the framework, followed by a discussion of the new technique.
Bisection
The framework for our placement tool might be considered a textbook implementation of the approach of Dunlop and Kernighan [4] . We repeatedly divide the circuit netlist by either horizontal or vertical cut lines. We utilize the recent multi-level clustering based partitioning algorithm hMetis [8] , version 1.5.3.
At each partitioning, we attempt to obtain a nearly exact bisection if cutting vertically. If the cut line is horizontal (splitting a number of rows), we split the rows as evenly as possible. If the region being bisected contains an odd number of standard cell rows, we insert fixed and weighted dummy vertices, allowing a nearly exact bisection to be mapped into the space available easily. The partitioning objective is min-cut; the hMetis partitioner attempts to minimize the number of cut hyperedges.
When we employ bisection, we may choose to cut a region either horizontally or vertically. In our placement engine, we control the aspect ratio of any region, using a parameter AR to determine the direction of cut lines. If the region being bisected contains more than a single row, we may elect to bisect horizontally if the ratio of the height and width exceeds a user defined threshold.
The recursive bisection process divides placement regions into progressively smaller areas, ultimately assigning each cell to a single row, but possibly having several cells remaining within a region. To establish positions for each cell, we order them by region location within each row, packing them together without spaces or overlap.
The positions of cells which were within the same region will be arbitrary at this point; they were not ordered by the partitioning process. To optimize these positions, we apply branch-and-bound reordering, modifying the positions of a small set of consecutive cells in a single row.
Feng Shui allows the specification of a "window size," controlling the number of cells involved in any branch-and-bound optimization. This window passes over each cell row (in order), traveling along each row at steps of half the window size. At each step, the optimal order for cells found under the window is determined.
The number of passes over the placement, and the size of the window, are both parameters which can be controlled by the user. In practice, we find that window sizes of 6 to 8 cells, and 4 passes of improvement, are sufficient for good overall performance. Increasing the window size may impact run times substantially (as the complexity of the branch-and-bound procedure is O w!¡ worst case, where w is the size of the optimization window). In [2] , a number of ways to implement branch-and-bound reorderings efficiently were explored.
k-Way Partitioning
The focus of our work has been on the global aspects of the placement problem. With partitioning, we can optimize the number of edges cut within a region effectively, but have no way of knowing if this local optimization is appropriate from a global perspective. Similarly, our branch-and-bound reordering is also a local optimization.
A careful examination of placement by recursive bisection reveals a number of instances where global objectives may be lost. The example in Figure 2 shows a simple case where local optimization is insufficient; we have four regions to bisect, each with two cells. If we approach this problem as a series of independent bisections, a number of configurations which are both stable and suboptimal can be encountered. The suboptimality of the global solution has nothing to do with the quality of the bisections of each region; simply improving the bisection algorithm will not improve the global configuration.
As we progress through the placement process, the number of regions increases, doubling repeatedly. If we have k regions, and wish to split each of them (obtaining 2k new regions), the traditional approach is iterative, bisecting a single region at a time. Our new approach is to attempt bisection of all regions at the same time, obtaining a solution that is of good quality globally. The method used to perform this massive bisection is based on partitioning by iterative deletion [10] .
New Formulation
To capture global objectives effectively, we formulate the placement problem as one of variant of multi-way partitioning, rather than as a series of bipartitions. We partition all regions simultaneously, with the intermediate state of each region influencing the others. We are concerned with partitioning very large numbers of regions, and our cost objective is wire length rather than min-cut. The problem we consider is given a set of regions (with physical constraints) and a set of elements mapped to these regions, bisect all regions to minimize the resulting bounding-box wire length. Solution of this problem optimizes the circuit from a global perspective.
Multi-way partitioning has proven quite challenging [11] ; for traditional objectives such as min-cut, the greatest success has been obtained with recursive partitioning, largely ignoring the nature of the problem considered here.
Iterative Deletion
In [10] , hypergraph partitioning was considered. A new method based on iterative deletion was presented; in this approach, vertices are duplicated, with one instance of each vertex being assigned to a partition. Redundant elements are removed one at a time until no duplicates remain.
While the approach was relatively simple, it proved effective in some areas where traditional methods had difficulty. For bipartitioning, cut sizes from a single linear-time pass were comparable to many passes of a traditional FM [6] algorithm. Multi-way cut sizes were superior to a direct flat multi-way partitioning algorithm [11] . For problems with a variety of hyperedge weights, a combination of iterative deletion and FM partitioning proved substantially more effective than FM partitioning alone. The approach is computationally attractive: with integer hyperedge weights, it may be implemented in O n¡ time.
Our variation of the iterative deletion approach for placement operates in the following manner. Each cell in a region is assigned to both subregions; if there is more than a single instance of a cell, it is considered to be redundant. We repeatedly remove redundant cells from subregions which have high utilization, and select the highest cost cell for removal.
In our current implementation of the placement engine, we evaluate cell cost based on the center of mass for the component nets. For each net n i , the center of mass for this net is the average X and Y location of the cells which it connects. The cost of any cell c i is the sum of the distances between the cell and the center of mass of each net to which the cell is connected.
In this way, a cell which is far from the center of mass of each net to which the cell is connected has high cost. Each region has a number of cells assigned to it, and an available capacity; we remove redundant cells from the region which has the highest ratio of cell area to capacity. The cell removed from any region is the redundant cell which has the highest cost.
We utilize heaps to maintain the ordering of cells within any given region, and we also use a heap to maintain an ordering of regions. In this way, maintenance and cell selection are both at worst O log n¡ for each cell removed. As the number of redundant elements to be removed in any pass is n, each pass is O nlog n¡ . Region sizes decrease by a factor of 2 with each pass, resulting in a logarithmic number of passes required. Thus, the iterative deletion portion of our algorithm is at worst O nlog 2 n¡ .
To illustrate the iterative deletion process, we present Figure 3 . In this figure, we duplicate cells c 1 , in region R 1 , and cell c 2 in region R 2 , and assume that a net connects c 1 to c 2 , and that a second net connects c 2 to a pad.
The order of cell deletions in this figure is as follows. Note that other cell deletions may be interspersed with the following; we focus only on these cells to clarify the process.
¢
The center of mass for the nets connected to cell c 2 is closest to the pad; we remove the instance of c 2 which is furthest from this location (as this is the instance which has highest cost).
The center of mass for nets connected to cell c 2 is recalculated, and this is propagated to the other cells. 
EXPERIMENTAL RESULTS
To evaluate the impact of our global optimization strategy, we have implemented a standard cell placement engine which utilizes either a traditional Dunlop and Kernighan style recursive bisection approach, or the Dunlop and Kernighan approach combined with iterative deletion. We refer to this tool as Feng Shui, and have made it publicly available. All experiments were performed on a 500mhz PentiumIII PC running Linux.
Our expectations were not that we would obtain dramatic reductions in wire length. Global optimization is extremely difficult, and Initial configuration, with duplicates for both cells
On instance of has been removed, resulting in a change in the center of mass for some nets, and also new cell costs.
A second redundant instance is removed. if the work on partitioning was any indication, little progress could be made. Never the less, we feel that a global perspective is crucial to successful high performance design, and wished to determine if any leverage could be obtained by considering the global problem.
Placements were performed on the MCNC standard cell benchmarks, and also the golem3 benchmark. These are the largest publicly available benchmark circuits. The number of cells, nets, and rows used are shown in Table 1 . A commercial version of TimberWolf, 1.2, was used to determine reference wire lengths; we note that the TimberWolf results are optimized for routability, and not with the sole objective of wire length minimization. As such, these results are not directly comparable to those of Feng Shui. All wire lengths are scaled by 10 6 .
For comparison with current tools, we also ran Capo [3] on the benchmarks. Capo has been made publicly available, along with evaluation tools and detailed benchmark information, making independent verification of results possible. Capo has been compared favorably to current commercial placement tools, producing low wire length placements that were routable in most experiments with industrial circuits.
As both Feng Shui and Capo utilize partitioners which use random seeds, we have run Feng Shui 20 times and Capo 80 times on each benchmark, reporting best, worst, and average results for each. In Feng Shui, we apply two iterations of partitioning improvement; the run time is dominated by this activity. Capo is from 3 to 4 times faster, and the additional runs allow comparable total run times, but with a greater chance of obtaining a "good" placement result. These results are shown in Table 2 ; a graphic version of the results is shown in Figure 4 .
In many benchmarks, the results of Capo and Feng Shui are similar; as they both utilize recursive bisection and strong partitioning algorithms, this is not surprising.
The performance of Feng Shui is extremely good on the largest benchmark, golem3, consistently outperforming Capo by nearly 11%. The substantial improvement obtained by Feng Shui appears to be a result of cut direction selection, and not an artifact of the underlying partitioning algorithms. Methods to optimize cut directions is part of our current work.
When comparing Feng Shui with the iterative deletion preprocessing to Feng Shui in a basic Dunlop and Kernighan configuration, some benchmarks showed greater improvement than others. This indicates a varying degree of effectiveness of the iterative deletion step. In nine of eleven benchmarks, average wire length is improved, and in eight of eleven benchmarks, best observed wire length is improved. There is reduced variation in results (we note again that the partitioning algorithms utilize random seeds, resulting in different placements with each run). In most cases, consideration of the global aspect of the placement problem resulted in improvements in the best, average, and worst case results.
While improvements in general were modest, we make the following observations. A very simple and efficient method, iterative deletion, can provide a few percent improvement with negligible impact on run time. From this, we assert that we can obtain a modest improvement "for free," and suspect that improvements in direct k-way partitioning will have benefit for circuit placement.
CONCLUSION
In this paper, we have presented an optimization approach which allows the consideration of global objectives from within a traditional top-down placement framework.
Global optimization is extremely difficult; while the conditions under which iterated bisections can obtain stable and suboptimal configurations are clear, it was unknown if this would occur in practice, or if it would be at all common. With a relatively simple approach, some improvement has been obtained, indicating that not only do these configurations occur, but that we can address them through a global approach. While the average improvement is relatively small, these gains cannot be obtained through local optimization alone. We expect that greater improvements are possible, and are investigating methods to obtain improved direct k-way partitions.
Timing driven placement is a significant concern for modern design. We are currently working with an industry research group to evaluate the performance of our approach on large designs under realistic delay rules. We note that delay optimization is perhaps more of a global phenomena than wire length minimization: meeting timing objectives may require modifications in many areas of a placement, and reductions in delay for some nets may require increased delay in others.
We also observe that the traditional approach of Dunlop and Kernighan is quite effective when modern multi-level clustering partitioners are utilized. Wire lengths were comparable to those of a well known commercial tool, and results reported for the tool Capo indicate that placements are not necessarily difficult to route. Our implementation of Feng Shui is fast; we find solutions for the largest available benchmark circuits, using modest hardware and CPU times. The nearly linear growth in run times should allow application to extremely large industrial circuits.
