, the MCNC regularly introduced and maintained circuit benchmarks for use by the Design Automation community. However, during the last five years, no new circuits have been introduced that can be used for developing fundamental physical design applications, such as partitioning and placement. The largest circuit in the existing set of benchmark suites has over 100,000 modules, but the second largest has just over 25,000 modules, which is small by today's standards. This paper introduces the ISPD98 benchmark suite which consists of 18 circuits with sizes ranging from 13,000 to 210,000 modules. Experimental results for three existing partitioners are presented so that future researchers in partitioning can more easily evaluate their heuristics.
Introduction
For over a decade, the Design Automation (DA) community has heavily relied on circuit benchmark suites to compare and validate their algorithms. Hundreds and perhaps thousands of publications have presented experimental results which use the circuits originally released by the Microelectronics Center of North Carolina (MCNC) and sponsored by ACM/SIGDA [3] . Indeed, papers in several fields, such as partitioning and placement, hardly stand a chance of being accepted into one of the major DA conferences without including experimental results that utilize these benchmarks. These benchmark suites (e.g., ISCAS85, ISCAS89, LayoutSynthesis92, Partitioning93, etc.) are currently maintained by the Collaborative Benchmarking Laboratory at North Carolina State University (www.cbl.ncsu.edu).
From [1985] [1986] [1987] [1988] [1989] [1990] [1991] [1992] [1993] , new suites of circuit benchmarks were regularly released; however, no new circuits have been released since. Most of these circuits are now obsolete, and do not adequately represent the complexity of modern designs. Consequently, there is a widening gap between the problems that are being solved in the academic literature and the problems that need to be solved. For example, a placer which achieves "5% improvement" on a design with 20 thousand moveable objects is not nearly as interesting or relevant as a placer which achieves "5% improvement" on a design with 200 thousand moveable objects.
One might argue that hierarchical design methodologies eliminate truly massive physical design problems. Currently, the only circuit in the existing suite of benchmarks with more than 26 thousand modules is golem3. However, given that next generation microprocessors will have between 20 and 50 million transistors, a physical design problem with just 1% of this complexity will still have between 200 and 500 thousand objects. It is not unreasonable to expect partitioning and placement problems of relatively small macros to reach this complexity. Indeed, physical design problems of this size have already been encountered within IBM. Given that golem3 is the only circuit in the public domain that can be said to represent medium to large designs, it seems unlikely that the academic community will be able to supply the algorithms that can manage the complexity expected in future designs.
The partitioning problem provides a perfect example of how both the academic and industrial community is likely to suffer from the lack of an up-to-date benchmark suite. Over the last few years, several innovative partitioning algorithms have been proposed, e.g., [1] [6][8] [14] , and the state of the art has advanced significantly (see [2] for a survey). However, the most recent partitioners are achieving virtually identical solution quality for most of the current benchmarks. Table 1 shows the minimum cut bipartitioning results (with a 45/55 partition size balance constraint) obtained by four algorithms: Dutt/Deng 1 [8] , hMetis [14] , ML C [1] and LSR/MFFS 2 [6] . Observe that there are very small differences in solution quality for almost every benchmark. Indeed, complete convergence has been obtained by several partitioners for the smaller benchmarks balu (cut=27), struct (cut=33) and s9234 (cut=40). Consequently, it appears impossible for any future partitioner to obtain more than, 1 Dutt and Deng present a general scheme for improving any iterative improvement engine. They present experiments with CLIP and CDIP on iterative improvement engines using lookahead, not using lookahead, and with probabilistic moves. say, a 2% average improvement over the current best partitioner. This state of affairs hardly means that partitioning is a solved problem. Rather, with over five years of opportunities to optimize a fixed suite of benchmarks, the DA community has collectively succeeded in finding superior partitioning solutions for these benchmarks. However, virtually nothing is known about what partitioners will work best or be most efficient on designs with 150 thousand or more moveable objects. Without the introduction of new, larger circuits, the CAD literature in pure partitioning will certainly die.
To offset the lack of public benchmarks, several works have studied random circuit generation. Success in this research domain could certainly offset the lack of available large circuits, yet much work remains. Early works, such as Bui et al. [5] and Garbers et al. [10] , propose classes of random graphs that have natural clustering and partitioning solutions. More recent works, such as Darnauer and Dai [7] and Hutton et al. [12] , generate random circuits that seek to capture such properties of real circuits as Rent parameter, circuit shape and depth, fanout distribution, reconvergence, etc. While these circuits are better than random graphs in representing real circuits, they are no substitute for actual test cases. 3 The purpose of this work is to release a new set of circuits, called the ISPD98 benchmark suite, for physical design applications. The circuit sizes range from 13,000 to 210,000 modules and were translated from internal IBM designs. The circuits can be downloaded via the World Wide Web at vlsicad.cs.ucla.edu. In addition, some partitioning results are presented to enable easy comparisons for future work. Table 2 presents the characteristics of the 18 circuits in the ISPD98 benchmark suite. The circuits are all generated from IBM internal designs produced at the Austin, Burlington and Rochester sites. The designs represent many types of parts, including bus arbitrators, bus bridge chips, memory and PCI bus interfaces, communication adaptors, memory controllers, processors, and graphics adaptors. For each circuit, a cell is considered to be an internal moveable object, a pad is an external (perhaps moveable) object, and a module is either a cell or a pad. The last column, Max%, gives the percent of the total area occupied by the largest module in the design. This percentage gives some idea as to how easy it is to partition the design under tight balance constraints.
A New Set Of Circuits
Each circuit is a translation from VIM (IBM's internal data format) into "net/are" format, a simple hypergraph representation originally proposed by Wei and Cheng [15] (see vlsicad.cs.ucla.edu for benchmarks in this format). In addition, a new format called "netD" is introduced, as described below. The circuits can be downloaded from vlsicad.cs.ucla.edu and complete descriptions of the benchmark formats can also be found there. The translation from VIM to "net/are" is performed as follows.
• All information relating to circuit functionality, timing and technology is removed. Unfortunately, this limits the direct applicability of these circuits (e.g., functional replication for partitioning); yet, the release of these circuits would have been impossible otherwise. Nevertheless, other applications besides pure partitioning can still be developed from this suite of circuits by making reasonable assumptions.
• All nets with more than 200 pins are removed from the design; most of these are likely related to clock and power distribution. The omission of these nets makes it more difficult to distinguish sequential from combinational cells. However, in modern design methodologies, layout is generally performed without the clock nets since they can bias the objective functions for partitioning and placement. For example, a placement algorithm might try to minimize the wirelength of the clock nets, forcing sequential elements to be clustered together. This may lead to an unbalanced clock distribution and misappropriation of clock resources.
• Small components that are disconnected from the largest component of the circuit are removed. This helps disguise the design while having virtually no effect on the layout since the disconnected components constitute a very small percentage of the layout area. As a side benefit, optimization techniques can be applied more easily. For example, spectral methods will not compute non-degenerate eigenvectors, flow based methods only need to construct a single network, and search based methods need to start from only a single module.
• Duplicate pins are removed. If a given net is connected to multiple pins incident to the same cell, then only one of these pins is included in the translated circuit. This has no effect on the topology of the netlist, but makes it easier to write physical design tools. For example, it simplifies the updating of gain buckets in Fiduccia-Mattheyses partitioning.
• All internal cells and pads are randomly numbered. Pads are assigned a default area of 0. One shortcoming with the original net/are format is that signal direction information is not preserved, so we propose a new format called "netD". This format is identical to net/are format except that each module in a given net is identified as either an input, output or bidirectional pin for that net. This information should enable one to apply standard directional clustering techniques such as cones and MFFCs [6] . The netD format subsumes net format, but the web site will maintain net format to ensure backward compatibility with existing tools. A potential problem with interpreting the signal direction information lies in handling bidirectional pads. Due to strict I/O limits in many technologies, a large percentage of the pads (up to 90%) in many designs are bidirectional. This makes it difficult to perform many operations, such as computing the longest paths from primary inputs to primary outputs, or generating cones. Figure 1(a) illustrates a typical instance. Here, a 2-pin net connects a bidirectional pad to an internal cell which also has contains three inputs (I1, I2, I3) and three outputs (O1, O2, O3).
To apply cone-based techniques, one must construct an equivalent circuit without bidirectional pads. One possibility is to split the pad into a primary input (PI) and a primary output (PO) as shown in Figure 1(b) . A potential problem that arises is that the path that goes from pad 1 through the cell and then to pad 2 does not really exist. Special care would have to be taken to avoid these "false paths". Figure 1(c) shows another alternative in which both the pad and cell are replicated. All the appropriate paths are preserved, but having two distinct cells becomes problematic since both cells must always appear in the same partition. Neither (b) nor (c) may be the best way to model bidirectional pads for cone-like constructions. Table 3 : Min-cut bipartitioning results with up to 10% deviation from exact bisection. Each cell and pad is assigned unit area. 
Partitioning Results
We now present results for three partitioners on the new suite of circuits. The purpose is not to make a comparative evaluation of current partitioners, but rather to provide a set of data for use by future researchers. We ran three partitioning algorithms: Fiduccia-Mattheyses (FM) [9] , CLIP [8] , and hMetis [14] . Implementations of FM and CLIP use a LIFO bucket structure and were obtained from the authors of [1] , and the hMetis executable was obtained from the authors of [14] . FM is the industry standard iterative exchange heuristic, CLIP is a modification of FM that biases cells to move in clusters, and hMetis is a multilevel partitioner. hMetis offers a choice of several different coarsening schemes, uncoarsening schemes, and V-cycle refinement schemes. We use the default schemes as described in [13] .
Results are presented for two different modelings of the cells: (i) each cell and pad has unit area; (ii) each pad has area zero, and each cell has non-unit (actual) area as specified in the appropriate area file. The reasons for including both are somewhat historical. Unit areas are more prominent in the literature (partly due to the absence of area data) and is in some sense a "purer" partitioning problem. Implementation of a partitioner is much simpler with unit areas since enforcement of balance constraints is simple. However, non-unit (actual) areas affords a much more realistic problem formulation. As the following results show, there are some problems with partitioning with non-unit areas that need to be addressed. Table 5 : Min-cut bipartitioning results with up to 2% deviation from exact bisection. Each cell and pad is assigned unit area. Table 3 presents bipartitioning results for the designs for unit cell and pad area and allowing up to 10% deviation from exact bisection, i.e., each partition must have between 45% and 55% of the total area. Both the minimum and average cuts over 100 runs of each algorithm are reported. The CPU column gives the average time required for a single run of each algorithm. Runtimes are reported for an 135 MHz IBM RS6000 S/595. Table 4 presents the same set of experiments except that the cells have non-unit areas, given in the "are" file. Tables 5 and 6 present similar results for the three partitioners, this time allowing up to 2% deviation from exact bisection, i.e., each partition must consist of between 49% and 51% of the total area. Table 5 presents results for unit cell and pad area, while sents results for non-unit area. Observe that some of the cut sizes for both FM and CLIP are very large in both Tables 4 and 6 for several circuits, e.g., ibm05, ibm07, ibm12 and ibm15. These large results do not necessarily reflect that FM and CLIP are poor algorithms, but rather that the implementation [1] is not particularly good at satisfying balance criteria when there are large variations in cell sizes. Indeed, the problem of even finding an exact bisection is NP-Complete when cells have non-unit areas [11] . Thus, when area constraints are fairly restricted and there are several cells with large areas, sophisticated balancing and rebalancing schemes need to be incorporated (at least in an iterative approach). This aspect of iterative partitioning has not been very actively researched. Some open questions include how to choose which partition to move a cell from, how to rebalance a solution that has become unbalanced by a given move, and how to handle designs with very large cells (e.g., more than 10% of the total area). Table 7 : Net cut, Sum of Degrees, and CPU times for 100 runs of hMetis 4-way partitioning for both unit and non-unit areas. Solutions were allowed to deviate up to 10% from exact quadrisection, i.e., each partition has between 22.5% and 27.5% of the total area.
Finally, Table 7 and Table 8 respectively present results for 4-way and 8-way partitioning, obtained by recursively applying hMetis. The solutions are the best recorded over 100 runs, and CPU is the amount of time for a single run. Note that hMetis first performs 100 runs of 2-way partitioning, chooses the best solution, then performs 100 runs on each of the two subpartitions. In the tables, "Cut" refers to the total number of nets cut by the solution, and "SOD" refers to the Sum of Degrees objective. Sum of Degrees is the sum over all partitions of the number of cut nets incident to the partition (see [1] ). The same parameters are used as for hMetis bipartitioning, and the area of each partition can vary up to 10% from exact quadrisection or octisection. Results are given for both unit and non-unit areas. Note that for ibm03, hMetis is unable to find an 8-way partitioning solution for non-unit areas. This is most likely due to the presence of the large module which occupies 10.76% of the total area. 
Conclusions
A new set of benchmarks is introduced for physical design applications. Results for several experiments are reported to serve as a stepping stone for future work in partitioning. It is our hope that others in industry will follow suit and make efforts to publish their data as well. Providing data in these simple formats does not compromise the intellectual property of the design, yet gives enough topological information to form real challenges to modern PD tools. Table 8 : Net cut, Sum of Degrees, and CPU times for 100 runs of hMetis 8-way partitioning for both unit and non-unit areas. Solutions were allowed to deviate up to 10% from exact octisection, i.e., each partition has between 11.25% and 13.75% of the total area.
