In this paper, clustering for the circuit placement problem is examined from the perspective of wire length contribution from groups of nets.
INTRODUCTION
Very large Scaled Integrated (VLSI) circuits are widely used in our daily life from cell phones to computers. Physical design of VLSI is the stage where the physical shape of a circuit is decided. The quality of the solution of physical design largely affects the circuit performance and stability. Hence, it requires attentions. Typically, physical design consists of three major phases: partitioning, placement, and routing. In partitioning, a circuit is gradually divided into small subcircuits. In placement, the exact locations of circuit components are determined. In routing, the paths of wires are determined. In Figure 1 , a typical flow chart for VLSI physical design is shown.
VLSI placement is the step in physical design in which the physical locations of the components of a circuit are determined while optimizing some objective or a set of objectives. Minimizing total estimated wire length is the most common placement objective in academic research. Placement is an NP-hard problem, but many heuristics exist for solving it, for example, a total of 9 academic placers competed in ISPD2005 [1] and ISPD2006 [2] placement competitions.
Figure 1. Typical physical design flow chart
The placement stage can involve several steps such as: global placement which roughly decides the positions of the circuit components, or "cells", legalization which removes all cell overlaps, and detailed placement which further improves the total wire length. Due to advances in deep sub-micron (DSM) technology, current state-of-the-art placers have to be able to handle increasingly complex VLSI circuits. One of the successful solutions to this problem has been the application of clustering algorithms in placement so that it can be performed in a hierarchical manner. As an example, most placers in ISPD2005 and ISPD2006 placement competitions used clustering in part of their program to improve final placement solutions and speedup the overall runtime.
Clustering is usually performed in a hierarchical manner. At each level of clustering, cells with high connectivity are sought and grouped together. Each group of cells is then considered as a new cell in the circuit at the next level. At the end of each level, the connectivity information of the cells, the netlist, is updated. Once the circuit size has been reduced to a desirable size, placement is performed for the small circuit. Then, the lowest level circuit is projected to its previous level and the placement solution from the previous level is refined to remove overlap and further reduce the total wire length. This process of projection and refinement is repeated until the original circuit is reached and cells are placed.
Several heuristics exist to perform clustering on circuits, such as [3, 4, 5, 6, 7, 8, 9] . Most clustering algorithms consider the local connectivity of cells and group one or two cells at a time. Therefore, the main objective of the existing state-of-the-art clustering algorithms is usually to reduce the number of the interconnections between clusters or increase the interconnections of the cells belonging to one cluster, instead of reducing the wire length which is the objective of placement.
Most clustering algorithms cluster low-degree nets such as nets with degree 2 or 3. In [10] , the placement algorithms are examined for optimality and scalability. In [11] , a study of placement efficiency and netlist structure is given. Both of these studies point to a common problem of current placement algorithms in trying to deal with nets with high length. For example in [10] , it is stated that using DRAGON [12] as a placer, 64% of the total wire length of ICCAD04 benchmark circuits [13] is because of the longest 10% of the nets.
In this paper, a preprocessing algorithm is proposed that tries to remedy two shortcomings of the clustering algorithms: ignoring high-degree nets and lack of connection with placement. The algorithm uses a simplified maximum flow algorithm to find cells belonging to high-degree nets that have other direct or indirect connections, and assign scores to pairs of cells for each net. The pairs with the highest scores are clustered. The experimental results show that the total wire length can be reduced between 2% to 5% on average for each placer. The rest of this paper is organized as follows. In Section 2, literature review on placement and clustering algorithms is given. In Section 3, the proposed clustering algorithm is described in detail. Experimental results are reported in Section 4. Finally, conclusions and future work are presented in Section 5.
CLUSTERING AND PLACEMENT BACKGROUND 2.1 Current Placement Techniques
The typical algorithms for placement are analytical, annealing-based or partitioningbased. In analytical placement, the problem is formulated as an optimization problem with an objective, such as reducing the total wire length, and constraints, such as reducing congestion. The solution of an analytical placer often contains overlaps between cells that are removed in the legalization step. Analytical placers have been very successful in ISPD2005 and ISPD2006 placement competitions.
In annealing-based placers, the estimated wire length of a circuit is reduced using the simulated annealing process [14] . Dragon [12] is an example of a simulated annealing placer.
In partitioning-based placement a circuit is recursively partitioned. The positions of partitions are determined at each partitioning step. This process is continued until only a few cells belong to each partition. Examples of partitioning-based placers are Capo [15] , and Fengshui 5.0 [16] .
Clustering Algorithms Review
In clustering algorithms, cells at each level that are deemed to be highly connected are grouped to form a new cell referred to as a cluster in the next level. In clustering algorithms such as edge-coarsening and hyperedge coarsening [4] , FirstChoice [4] , heavy-edge matching [3] and PinEC [5] , once a cluster is formed, it is finalized without any comparisons with other clusters. These algorithms are usually fast and can cluster circuits to very small sizes. However, since no comparison between clusters is made, potentially high quality clusters can be ignored.
Best-choice clustering [6] and Net Cluster [9] are examples of clustering techniques in which a score is calculated for each potential cluster. The potential clusters are compared and the clusters with the best scores are finalized. These algorithms can provide a more global view of the circuits at the cost of extra runtime.
In [17] , a preprocessing clustering algorithm that focuses on clustering nets with high-degree is proposed. Promising experimental results are given in this paper, however, there are two major shortcomings: First, a naive approach is used that says that any net with degree higher than 15 is classified as high-degree and deemed to have a long length. Second, the total runtime for placement is increased significantly. In this paper, major modifications to the algorithm proposed in [17] are made to remedy the problems stated. In addition, a simplified maximum flow algorithm is implemented to find the best pair of cells to be clustered that belong to a high-degree net.
PROPOSED DEGREE-BASED CLUSTERING ALGORITHM 3.1 Algorithm Motivation
The two major motivations for the algorithm proposed in this paper are: reducing the length of high-degree nets and considering wire length while performing clustering. Most clustering algorithms focus on clustering cells with the highest connectivity. However, the connectivity measure usually favors nets with low-degree. For example, in Figure 2 it is illustrated that after one level of clustering using FirstChioce, bestchoice, Net Cluster and hyperedge for ibm11 in ICCAD04 benchmark suite, more than 65% of the clustered nets are ones with degree 2 [18] . benchmark suite [18] .
However, the wire length contribution of nets with high-degree is large compared to their percentage. This fact is illustrated in Figures 3 (a) and (b), where nets are divided into four categories: the first group consists of nets with degree 2, which usually constitute 50% to 70%, of the nets. The second group consists of nets with degrees between 3 and 5, making up 20% to 30% of the total number of nets. The last two groups are nets with degrees from 6 to 9 and nets with degrees 10 or over. The x-axis in Figures 3 (a) and (b) represents the 18 benchmarks, numbered from 1 to 18. The yaxis in Figure 3 (a) shows the percentage of nets belonging to each group stacked on top of each other. The y-axis of Figure 3 (b) is the wire length contribution of the nets of each group after placement using Capo 10.1 [15] . Comparing Figures 3 (a) and (b), shows that even though nets with degree higher than 10 constitute a small percentage of the total number of nets, they can contribute heavily, up to 40%, to the total amount of wire length. The first motivation behind the proposed technique is to focus on clustering cells in nets with high-degree which are also estimated to have long length. The main difference between the clustering algorithm in this paper and the algorithm proposed in [17] is that in [17] , at the clustering stage, all nets with degree higher than an arbitrary threshold were considered for clustering. This can result in mislabeling nets. However, since clustering comes before placement, lengths of individual nets are not known a-priori. In this paper, it is proposed to use a fast pre-placement wire length estimation to predict the length of nets and determine their eligibility for clustering.
Algorithm Procedure
A flow chart of the proposed algorithm is as shown in Figure 4 . The algorithm has three phases: pre-placement wire length estimation, predicted global net clustering, and clustering ratio adjustment. 
Input: Flat netlist

Pre-placement Wire length Estimation
The first step of the proposed clustering algorithm is pre-placement individual wire length estimation. The main purpose of this phase is to predict which nets will be long in the final solution. These nets will be referred to as global nets or nets that span a whole row. In [19] and [20] two individual net wire length estimation techniques are
proposed. Experiments performed by the authors showed that the algorithm proposed in [19] is more accurate and hence it was chosen for the proposed clustering. This technique first defines a base length for each net, which is proportional to its degree. Then, the estimated wire length is further adjusted by considering local congestion metrics such as number of nets with different degrees in the neighborhood of the net and global congestion metrics such as number of nets with different degrees in the circuit.
In the proposed clustering algorithm, pre-placement wire length estimation [19] is used to predict which net can become a global net. In order to reduce runtime, the wire length estimation is only used for the nets with degree higher than or equal to 6. According to Figure 3 , these nets constitute between 25% and 35% of the nets but conventional clustering techniques tend to ignore them, as seen in Figure 2 .
To further reduce runtime, it is proposed to first calculate a threshold degree that can result in the base length of a net to become larger than the row length. This threshold degree is different for each circuit and can vary from 6 for IBM01 to 28 for IBM18. Any net with degree higher than this threshold will be automatically considered for the proposed clustering. This modification resulted in 5% to 10% reduction in the total estimation calculation. Once the threshold degree has been decided, for all the nets with degree between 6 and the threshold degree, individual length estimates are calculated. Nets with base length longer than the row length of the circuit are labeled as global nets.
Predicted global net clustering
The main purpose of this phase is to find and cluster cells on global nets that are connected via other nets. Clustering an entire global net can result in forming large loosely connected clusters. Therefore, only the cells in the net that have other connections are clustered. First, it is proposed to order global nets in descending order based on their degree. Each cell of a global net is visited and considered as a seed cell. Nets connected to a seed cell are visited to find whether this seed cell has alternative connections with other cells belonging to the same global net. If such cells exist, they are grouped in a cluster. For each unclustered cell without direct alternative connections on the global net, the maximum flow between the cell and its neighbors belonging to the same global net is examined. The maximum flow is approximated by calculating the minimum cut between the cell and its neighbors in a subgraph containing the first level and second level neighborhoods of each cell. Then, the cell is clustered with the neighbor with highest maximum flow, i.e. score. In this paper, it is proposed that cells adjacent to nets with degree 2 are not considered, since in phase III, a second level of clustering is performed which mainly focuses on degree 2 nets. The proposed clustering continues until all global nets are visited.
Clustering ratio adjustment
Since the proposed clustering algorithm only visits a limited number of nets, it is unable to significantly reduce circuit sizes. Therefore, it is proposed to use a clustering algorithm that mainly clusters nets with low degrees afterwards. In this paper, Net Cluster (NC) [9] has been chosen to find clusters of low-degree nets. The main reason of choosing NC versus other clustering algorithms such as best-choice and FirstChoice is that NC is very efficient in clustering nets with degrees from 2 to 6, shown in the Figure 2 , and was proven effective as a preprocessing step [21] .
EXPERIMENTS
The proposed degree-based clustering algorithm was implemented and tested as a preprocessing clustering algorithm for the ICCAD04 benchmark suite. First, benchmarks are clustered by the proposed clustering algorithm. Then, the clustered netlists are fed to four academic placers, Capo10.1 [15] , Fengshui5.1 [16] , mPL6 [22] , and NTUplace3-LE [23] . The placement results of the clustered circuits are mapped back onto the original circuits and the detailed placer in Capo10.1 is used to refine the final placement results. The tests are performed on an Intel Pentium 4 CPU 2.8GHz with RedHat Linux. In this section, the wire length is modeled using the Batched Iterated 1-Steiner Tree (BI1ST) algorithm [24] . This is because the half perimeter wire length estimation (HPWL) usually underestimates the wire length contribution attributed to nets with degree higher than 3, but the BI1ST estimation can produce a more accurate wire length estimation.
In Table 1 , to show the effectiveness of the proposed algorithm, the average over all circuits' wire length of different groups of nets is compared after one level of the proposed clustering or one level of Net Cluster. The average wire length for four groups of nets is compared to the original average wire length for both the proposed algorithm and NC. It can be seen that although these two clustering algorithms can achieve similar average improvements for different placers, shown in the row 'all nets', their effects on different groups of nets are different. For the proposed clustering, the improvement is achieved by shortening the wire length for almost all net groups, but mainly for the high-degree nets. However, the improvement for NC comes from focusing on shortening nets with low-degree rather than high-degree nets. Hence, from Table 1 , the effectiveness of reducing the length of high-degree nets by the proposed clustering algorithm is verified. In Table 2 , cell clustering ratio and clustering runtime for one level and two level of clustering is shown. In columns 2 and 3, the cell clustering ratio (CCR) for each benchmarks after one level of clustering and two levels of clustering is given. For one level of the proposed clustering, it only reduces the circuit size by up to 9.6%. After the two levels clustering, the cell clustering ratio is adjusted, and can reduce the circuit size by up to 50%. In columns 4 and 5, it indicates the runtime for clustering each circuit using only one level of the proposed algorithm or two levels of the proposed algorithm and NC. The final BI1ST placement results and runtime with and without the proposed preprocessing step are given in Tables 3-6 for each placer. In Tables 3-6 , the original wire length or runtime without the proposed clustering are shown for each academic placer under the 'orig' column respectively. For each placer, the BI1ST placement result comparisons after only applying one level of the proposed clustering or two levels of the proposed clustering and NC are given under '1-level WL comp.' and '2-level WL comp.' columns. Columns 5 and 6 in Table 3 -6 indicate runtime comparisons when performing placement with one level and two level clustering.
It can be seen that the placement results for most of benchmarks are improved. In Capo10.1 of Table 3 , the placement results have been improved by 2% for one level and 1% for two level clustering. This is the smallest improvement achieved. We believe that this can be because Capo focuses more on generating routable placement results, hence having more of a focus on high-degree nets. For Fengshui in Table 4 , the average improvement is around 5% for one level and 4% for two levels. Moreover, placement results have been improved for all circuits at one level clustering. The largest improvement for Fengshui is 17% for ibm02 at both one level and two levels. For mPL6 and NTU in Table 5 -6, an average more than 2% of wire length improvement has been achieved for one level. For runtime, it can be seen that there has been a runtime increase for 1-level proposed clustering. However, after the two level clustering, the total runtime is improved. In Table 3 , it can be seen that without sacrificing total final placement results, the runtime can be improved by up to 7% for Capo and 4% for NTU. For Fengshui and mPL6, the runtime is only increased by 4% and 2%. However, compared to the runtime of one level, the runtime has been improved by up to 25%. 
CONCLUSIONS AND FUTURE WORK
In this paper, a degree-based clustering preprocessing algorithm for placement has been proposed. Three major modifications are made to improve the original clustering algorithm published in [17] . Finally, the effectiveness of the proposed algorithm is verified by using four academic placers and the ICCAD04 benchmark suite.
A future direction for this research can include combining this clustering algorithm with congestion constraints to investigate how much improvement placers can obtain while guaranteeing routability.
