Abstract -Placement is an important step in the overall IC design process in DSM technologies, as it defines the on-chip interconnects, which have become the bottleneck in determining circuit performance. The rapidly increasing design complexity, combined with the demand for the capability of handling nearly flattened designs for physical hierarchy generation, poses significant challenges to existing placement algorithms. There are very few studies on understanding the optimality and scalability of placement algorithms, due to the limited sizes of existing benchmarks and limited knowledge of optimal solutions. The contribution of this paper includes two parts: 1) We implemented an algorithm for generating synthetic benchmarks that have known optimal wirelengths and can match any given net distribution vector. 2) Using benchmarks of 10K to 2M placeable modules with known optimal solutions, we studied the optimality and scalability of three state-of-the-art placers, Dragon [4], Capo [1], mPL [24] from academia, and one leading edge industrial placer, QPlace [SI from Cadence. For the first time our study reveals the gap between the results produced by these tools versus true optimal solutions. The wirelengths produced by these tools are 1.66 to 2.53 times the optimal in the worst cases, and are 1.46 to 2.38 times the optimal on the average. As for scalability, the average solution quality of each tool deteriorates by an additional 4% to 25% when the problem size increases by a factor of 10. These results indicate significant room for improvement in existing placement algorithms.
Introduction
Placement is an important step in the overall IC design process in DSM technologies, as it defines the on-chip interconnects, which have become the bottleneck in determining circuit performance. Existing placement algorithms can be classified into three categories, min-cut based methods [l] , analytical methods [2] and iterative methods [3] . There are also many hybrid methods [4] . After producing an initial solution with one algorithm, they shift to another to further improve the solution quality.
According to ITRS'O1 Roadmap [6] , the maximum number of transistors per chip will be over 1.6 billion, with a clock frequency of 28.7 GHz by the year 2016. Such high complexity poses significant challenges to the scalability of placement algorithms. The traditional way to handle large designs is through partitioning according to the logical hierarchy. However, it is pointed out in [7] that these hierarchies are derived with little or no consideration for the physical layout and they may not embed well in a twodimensional silicon surface. Therefore, it is proposed in [7] that the right way to partition the design is to first flatten the logic hierarchy to the extent that we are certain about the "physical locality" of each module in the flattened design, and then construct a physical hierarchy (coarse placement) on this almost flattened netlist. The algorithm presented in [8] is developed to support this methodology. In general, this approach requires highly scalable placement algorithms which can handle nearly flattened designs with lOOK to 10M placeable objects.
Until now, there have been few studies to understand the optimality and scalability of placement algorithms. This is due to the limited sizes of existing benchmarks and limited knowledge of their optimal solutions. Two types of benchmarks are commonly used. One type of benchmarks is based on real designs [9] [10] [11] . They are either directly extracted from real designs [9] , or based on minor perturbations of real designs [ [15] to search Rent's parameter that incured the highest resource utilization ratio. The study in [19] attempted to quantify the suboptimality of placement algorithms in terms of chip area by "stitching" small designs to form large designs. The major drawback shared by these benchmarks is that their optimal solutions for placement are unknown. It is difficult to determine how the solution quality changes as the design size grows.
The contribution of this paper includes two parts: (1) We implemented an algorithm for generating synthetic benchmarks that have known optimal wirelengths and can match any given net distribution vector. Our algorithm is similar to the one first proposed by Boese, which was outlined in [19] . Boese, however, never implemented his idea nor experimented it with any placer [22] . (2) Using benchmarks of 10K to 2M placeable modules with known optimal solutions, we experimented with three state-of-the-art placers from academia, Dragon [4] , Capo [l] , mPL [24] , and one leading edge industrial placer, QPlace [5] from Cadence.
' The current contacting address for Chin-Chih Chang is Cadence Design Systems Inc., 555 River Oaks Parkway, San Joese, CA 95134.
For the first time our study reveals the gap between the results produced by these tools versus true optimal solutions. The wirelengths produced by these tools are 1.66 to 2.53 times the optimal in the worst cases, and are 1.46 to 2.38 times the optimal on the average. As for scalability, the average solution quality of each tool deteriorates by an additional 4% to 25% when the problem size increases by a factor of 10. These results indicate significant room for improvement in existing placement algorithms.
The rest of this paper is organized as follows. Section 2 describes our benchmark generation algorithm. Section 3 gives experimental results using the synthetic benchmarks. Section 4 gives the conclusion and future work.
Placement Benchmark Generation with Known Optimal Wirelength

Problem Formulation
First, we introduce some notations: Given a netlist N, let p be the number of placeable modules in the netlist, and let
where dk is the total number of k pin nets in the netlist.
W e are interested in the following problem: Given a number p and a vector D, construct a placement benchmark with p placeable modules, such that its netlist has D as its NDV and has a known optimal half perimeter wirelength.
Placement Benchmark Construction Algorithm
A. Algorithm Description
Our algorithm, PEKO (Placement Example with Known Optimal wirelength), makes two assumptions: all the modules are of equal size, and there is no space between the rows. It first places all the modules in a rectangular region close to a square, then connects the nets to the modules one-by-one, using the minimum perimeter bounding box for each net. In the end, a netlist is extracted from this placed configuration. Fig. 1 gives a description of the algorithm. Fig. 2 shows an example when p = 9, D = (6,2,2). Net A is a 4-pin net. According to our algorithm, it will connect four modules located in a 2x2 rectangular region. Similarly, the two 3-pin nets are generated as C and D respectively. This process is repeated until the NDV is exhausted. The total wirelength for this benchmark is 6*1+2*2+2*2=14.
Algorithm PEKO Input Output
B. Proof of Optimality
According to the generation algorithm, the wire length of each n-pin net is For any n-pin net, the optimal half perimeter wire length can only be achieved when the modules of this net are placed in a rectangular region close to a square, i.e., the length of each side is close to In particular, the width and height of the rectangle should be
The wirelength of such a configuration
The wirelength of an n-pin net achieved by our algorithm is optimal, and the total wirelength is the sum of all the nets, therefore, it is also optimal.
Given a benchmark E generated by PEKO with NDV D,
is the optimal wirelength of the benchmark, denoted as OW(E). Given a placement solution s
to benchmark E , we measure its wirelength and denote it as 
PW,(E). We define the ratio PW,(E)IOW(E) as the
Generation of Realistic Benchmark Set with Known 2.4 White Space Generation Optimal Wirelength
To mimic real designs, we take a simplistic approach to In order to generate realistic benchmarks, we first extract generate white space in the PEKO suite. After the optimal the module numbers and NDVs from the netlists in the configuration is obtained, white space is inserted to the right ISPD98 suite [9] (originally from IBM) and generate a setTo&-of the placeable modules. For each circuit in PEKO, 15% of benchmarks named suite-1 using PEKO. Table 1 gives the 7 the chip area is white space. characteristics of suite-1. The column "OW' gives the optimal half-perimeter wirelength for each benchmark. Suite-2 is generated by scaling the module number and NDV of each circuit in suite-1 by a factor of 10. .or of 10)
One important feature of suite-1 and suite-2 is that there is no net connected with pads. This feature is enforced from the concern that such nets may give hint about where to place each net. To make our study complete, we also generate another two sets of benchmarks which have nets connected with pads. They are named suite-3 and suite-4 respectively. Table 2 gives a description of suite-3. All benchmarks used in this paper are given in both GSRC Bookshelf format and LEF/DEF format, and can be downloaded from: http://cadlab.cs.ucla.edu/-pubbench/peko.htm. An alternative is to first connect each module with at least one net, then randomly remove m p modules and all the nets connected with them, where a is the ratio of desired space area to the chip area. It is easy to prove that benchmarks thus generated also have a known optimal wirelength. Furthermore, the white space is randomly distributed on the chip. This method, however, may not give a benchmark matching the desired NDV. Therefore, it is not used in PEKO.
Experimental Results and Analysis
The benchmarks are experimented with three state-of-theart placers from academia and one leading edge industrial placer, including:
Dragon: Dragon is based on a multilevel framework. It uses hMetis [21] to derive an initial partition result on the circuit, then undergoes a series of refinement stages doing bin-based swapping with simulated annealing [4] . of memory. The data is collected after running each tool only once. Since all the tools make use of randomization, running them several times may give different results. Also, direct comparison of Capo's runtime with the other tools may not be meaningful as it is run on a different machine, but the runtime data can give us some idea about its speed and scalability. We need to emphasize that it is not our purpose to give a comparison of the four placers. The experiments are performed to determine how much room is left for improvement in existing placement algorithms.
The test results for suite-1 are given in Table 3 . The column "PW' gives the detail placement wirelength produced by each tool. For each benchmark, the Quality Ratio is calculated for the four tools and given in the columns named "QR." According to the experiments, none of these tools achieve a Quality Ratio close to 1. The wirelengths produced by these tools can be 1.66 to 2.53 times the optimal in the worst cases2. It may be possible that some of the placers will try to enhance the routability by sacrificing the wirelength. However, given the gap between their wirelengths and the optimal value, there remains significant room for improvement in existing placement algorithms.
The entire test is repeated on suite-2 to observe how the QRs change as the design size grows. Since the benchmark sizes are lox larger in this set, we set an upper limit of 24 hours to a tool's runtime. The results are given in Table 4 . QPlace scales well in terms of runtime. It finishes 16 out of 18 benchmarks (up to 1.83M placable modules), and runs out of memory on the remaining two (with 1.85M and 2.15M placeable modules) on our machine's configuration. Its average Quality Ratio increases by only 4% from 1.84 to 1.88. Of the four tools, this increase is the smallest. Capo also shows good scalability in runtime. It finishes 13 of the circuits (up to 837K placeable modules) and runs out of memory on the remaining 5 circuits. Its average Quality Ratio shows an increase of 16% with the increase in design size. mPL finishes 9 of the 18 benchmarks, and runs out of memory on the remaining circuits. Its average Quality Ratio increases by 25% from 1.46 to 1.71. This is the highest increase of the four placers. Dragon manages to complete the placement for only the first 6 benchmarks (up to 323K placable modules) within 24 hours. Its average Quality Ratio increases from 2.09 to 2.28. Fig. 4 and Fig. 5 give the combined results for suite-1 and suite-2. They show how the solution quality and runtime of each tool change with the increase in cell numbers. Table 5 and Table 6 give the experimental results for suite-3 and suite-4, which have nets connected with pads. For these circuits, the wirelengths produced by the placers are 1.53 to 2.49 times the optimal in the worst cases, and are 1.43 to 2.37 times the optimal on the average. Their average solution quality shows deterioration by an additional 5% to 21% when the problem size increases by a factor of 10. It can be seen from Table 3 to Table 6 that having nets connected with pads does give some hint about the optimal We were provided with Capo's latest version by its author when preparing the final version of this paper. Some preliminary experiments did show improvement in solution quality. However, due to the time limitation, we could not finish all the experiments and include the results here. solution to some placers, with 12% improvement by mPL and 10% improvement by QPlace. This is understandable as analytical placement algorithms make use of fixed pad locations to avoid de-generated solutions. Interesting enough, Capo and Dragon do not benefit from the additional information from connection to pads.
Although our algorithm is capable of generating arbitrarily-sized benchmarks with known optimal wirelength, given the scalability problems encountered by these tools on suite-2 and suite-4, it is not meaningful for us to construct larger designs to further evaluate these algorithms.
Conclusion and Future work
In this paper, we implemented an algorithm to generate synthetic placement benchmarks with known optimal wirelength matching any net distribution vector. Using, benchmarks of 10K to 2M placeable modules with known optimal solutions, we experimented with four state-of-the-art placement tools. The wirelengths produced by these tools are 1.66 to 2.53 times the optimal in the worst cases, and are 1.46 to 2.38 times the optimal on the average. As for scalability, the average solution quality of each tool shows deterioration by an additional 4% to 25% when the problem size increases by a factor of 10.
Our study, as reported in this paper, is by no means complete. We did not have a chance to experiment with a number of well known placers, such as Gordian-L [2] , Timberwolf [3] , mPG [8] , from academia, and placement engines used by Avant!, Magma, and Synopsys. Also, the benchmarks generated by our algorithm have several limitations. For example, all modules in these circuits are of uniform size, making them unsuitable for evaluating the legalization capability of detail placement algorithms. All the nets in the optimal solutions are local connections, i.e., they only connect modules in contiguous areas. This may not be true in real circuits. Therefore, obtaining good results for these benchmarks may not guarantee good solution quality in real circuits. Also, these benchmarks can not be used to evaluate routability and delay. Nevertheless, we have made ii very important step in understanding the optimality and scalability of existing placement algorithms. We plan to further enhance our benchmark construction algorithm and broaden its applicability in the future. 
