I. INTRODUCTION

W
ITH increased system complexity, circuit hypergraph partitioning, which is the "divide" step of the divide-and-conquer paradigm, plays a crucial role in many design tasks [8] . The problem is dividing the nodes of a graph into several roughly equal parts; the traditional objective is to minimize cutsize. Among many partitioning methods, multilevel approaches (e.g., MLPart [3] or hMetis [15] ) are considered effective for cutsize minimization. Such methods perform a sequence of netlist coarsening and uncoarsening steps with FM-based partitioning refinement at each level of the uncoarsening hierarchy. While these algorithms outperform other algorithms in cutsize, they cannot guarantee production of small delay in general. For performance-driven contexts, the hypergraph partitioner must consider the impact of implied interconnects 1 on performance. The primary objective of a performance-driven partitioner is to minimize path delay on timing paths. Recently, several performance-driven partitioning methods have been proposed. Most of these methods do not consider cutsize, and no attractive cutsize-delay tradeoff (let alone transparent consideration of path delay within traditional cutsize-driven approaches) has been discovered. In particular, existing performance-driven partitioners either try to modify the input of multilevel FM partitioners, by means such as reweighting [2] , or else apply novel approaches such as min-delay clustering [9] . However, these approaches may be impractical because of large cutsize [7] or large runtime [2] . Furthermore, improvements in timing are often not obvious. The goal of our work is to find a performance-driven partitioner that can provide a more attractive, and hopefully tunable, cutsize/delay tradeoff. Our contribution is summarized as follows.
1) We define the concept of a -shaped node in a partitioning solution, as well as its generalization to distance--shaped nodes. We observe that even a few -shaped or distance--shaped nodes in the partitioning solution may significantly increase path hopcounts across the cutline. This suggests improving performance by eliminating such nodes. 2) We propose a new algorithm to eliminate, or at least reduce, -shaped nodes. Instead of modifying the input of MLPart, we modify the MLPart algorithm itself by changing the gain function. We use a "look-ahead" algorithm reminiscent of CLIP [11] to eliminate distance--shaped nodes. We also propose to reweight the nets whose fanout nodes are " -nodes" to further reduce the hopcount. 3) Our method is easily implementable within standard FM with little cutsize and runtime penalty. We focus on the flat bipartitioning engine context, but our result can be applied within multilevel or any other framework that invokes standard FM. Our approach also extends easily to multiway implementations. Our experimental results show that this method can achieve an average of 39% hopcount reduction on industry testcases, with negligible implementation effort and negligible impact on cutsize and runtime. 1 That is to say, any cut net will correspond to an interblock wire, or a wire that has some expected length that depends on the size of the given partitioning instance.
0278-0070/04$20.00 © 2004 IEEE 4) We incorporate the new partitioner into the CapoT placer [19] and evaluate circuit delay using a commercial static timing analyzer, Cadence Pearl v5.1 [21] . The experimental results show that the delay is significantly reduced, with very acceptable impacts on wirelength or runtime.
II. NOTATION AND PROBLEM FORMULATION
Below, we use the following notation.
• denotes the circuit hypergraph.
• is the set of nodes representing components (e.g., cells) of the circuit.
• is the set of signal nets, where each net is a subset of nodes that are electrically connected by a signal.
• is the subset of combinational nodes, and is the subset of sequential nodes (or FF nodes); and .
• A bipartition of divides into two disjoint subsets and , such that ; the two subsets are also called Part 0 and Part 1.
• For a net , where is the fanout node whose output signal is the input signal to , we say that is the input of , and that each is an output of .
• If node is an input of node , then we say that there is a directed edge from to .
• PI denotes the set of primary inputs, and PO denotes the set of primary outputs. For purposes of path timing analysis we treat the nodes of PI, PO, and FF as the end points of timing paths, i.e., the circuit delay is the longest combinational path delay from any FF or PI output to any FF or PO input. We generically refer to timing paths as FF-FF paths.
• is a directed path from to , if there exists a directed edge from to , . We say that the length of is .
• Let be a directed path from to . If , , is a combinational directed path.
• Let be a directed path from to . If , and , , is a FF-FF path.
• A combinational node is a distance--shaped node, or -node, if it satisfies: 1) , , such that there is a directed edge from to and a combinational directed path from to whose length is and 2) and are both in the other partition with respect to . For the special case of , is called a -shaped node.
• A combinational node is an -shaped node, or -node, if it satisfies: 1)
, and , and 2) there are directed edges from and to , and from to and .
• the total area of all the nodes in . the total area of all the nodes in .
• If , , such that and , then is a cut net of the bipartition .
• The cutset of a bipartition is is a cut net of . The cutsize of the bipartition is .
• A directed edge from to is a hop, if is the input to in a cut net. • denotes the hopcount of an FF-FF path , i.e., the number of hops in .
• the maximum value of over all FF-FF paths in .
• A critical path is an FF-FF path whose hopcount is equal to .
• Suppose a bipartition has a cut on , and let be the fanout node whose output signal is the input to the rest of the nodes in . If , the cut direction is indicated as ; if , the cut direction is indicated as .
Performance-Driven Bipartition Problem (PDBP) Given:
Hypergraph Area balance tolerance , a given parameter that constrains partition areas , a given parameter which captures the desired tradeoff between cutsize and path delay in the objective function Find:
A bipartition which satisfies and minimizes
III. PREVIOUS PERFORMANCE DRIVEN METHODS
Most previous performance-driven partitioning approaches alter the netlist using logic replication, retiming or buffer insertion to meet delay constraints while minimizing the cutsize [7] - [10] , [16] . For example, Cong et al. [9] propose a global clustering-based partitioning algorithm. The basic idea is as follows.
• Construct a clustered circuit with the minimum clock period, and perform retiming and node duplication as possible.
• Perform cutsize-driven clustering on the clusters formed in the previous step.
• Perform simultaneous cutsize and delay refinement during cluster decomposition. The method reduces delay by 16% while increasing cutsize by 17%-with retiming-compared with hMetis [9] . However, such methods can require substantial gate replication, which potentially increases die area. Some of these methods [7] tend to produce noticeably worse cutsize compared to multilevel FM partitioning.
Other approaches [2] , [13] have been proposed which do not change the netlist. Typically, a multilevel FM partitioner such as hMetis [15] is used with some modified (weighted) input. In the taxonomy of [2] , these approaches can be divided into net-based and path-based categories. Net-based partitioning approaches ( [18] , according to [2] ) define a criticality value for each net after timing analysis, while path-based approaches consider the criticality of paths instead of single nets [2] . All of these methods require timing analysis in order to find critical nets or paths, and then reweight the critical nets or paths so as to reduce the chances of cuts occurring. According to Ababei et al. [2] , their bipartitioning algorithm can reduce delay by 14% at the expense of an increase of 10% in cutsize and 139% in runtime, compared with hMetis [2] .
We observe that timing analysis can take a long time and that it is necessarily based on an inaccurate delay model. Delay models such as that of Cong et al. [9] (node delay , intrablock delay , and interblock delay ) may identify "critical paths" that incorrectly drive the timing analysis and, hence, the partitioner. Since we may not have enough information at the partitioning stage to make accurate delay estimates, for some testcases, these algorithms can produce partitioning solutions with worse delay than solutions found by generic multilevel FM partitioning. 2 Recently, some algorithms have been proposed which attempt to incorporate timing analysis results somewhat more directly into the FM partitioner [1] . For example, Ababei et al. [1] propose to perform timing analysis first in order to assign a "criticality" value to each edge. Then, the gain function of the FM partitioner is changed such that edges with higher criticality will not be cut. Since criticality is a global variable, this means that to obtain a reasonable value of criticality, global timing analysis is needed. Moreover, since the change of one node in the partitioning solution may affect the criticality values of many other nodes, the complexity and convergence of the approach become difficult, as witnessed by the level of improvements reported.
IV. SOLVING PDBP VIA ELIMINATION OF -NODES
In a bipartition , if all nets in the cutset have the same cut direction, then we call a unidirectional bipartition. A key intuition stems from the fact that in a unidirectional bipartition, the hopcount of any FF-FF path is at most one. So, unidirectional bipartitioning is in some sense an "idealized goal" for performance-driven partitioning, and can be sought by, e.g., flow-based methods [17] and KAFM algorithm [5] , [6] . Unfor- tunately, a unidirectional bipartition tends to have much higher cutsize than multilevel FM solutions [6] , and for some testcases no purely unidirectional solution exists. Therefore, we propose to relax the unidirectional condition to "locally unidirectional." We call a bipartition without any -nodes (defined in Section II above) as a distance-unidirectional bipartition. Our intuition is that a tradeoff between cutsize and delay can be achieved by elimination or reduction of -nodes in the partitioning solution.
A. -Shaped Node Elimination
For any node , let be the set of nets to which is connected that lie entirely in the current partition of , and let be the set of nets that belong to the cutset and for which is the only incident node in the partition of . The traditional gain function in FM partitioning is for all nodes . FM partitioning [12] starts with a random initial partition and iteratively checks the node with maximum gain to see whether moving it to the other part will violate the area balance constraint. If not, the node is moved to the other part, otherwise, the node with maximum gain in the other subset will be moved. Every node is locked after moving, and the process continues until all nodes are locked. Then, all prefix sums are calculated, and is chosen such that is maximum (all node moves after the are undone). This process is called a pass, and FM partitioning repeats passes until . In general, the partitioning solutions returned by a multilevel FM partitioner, such as MLPart [4] , have good cutsize. However, for some testcases, MLPart tends to produce solutions with high values. We have analyzed critical paths and consistently found that:
1) there are a few -shaped nodes in the partitioning solutions; 2) every -shaped node is included in many critical paths; 3) almost every critical path contains one or more -shaped node; 4) MLPart cannot eliminate these -shaped nodes due to the traditional gain function. Based on these observations, we propose to improve timing by local biasing of FM to eliminate some (not necessarily all) -shaped nodes. For example, in Fig. 1(a) , node is a -shaped node. Suppose that the area balance tolerance is 0.35. MLPart cannot move node to Part 0 since there are directed edges from nodes and to . The traditional gain values of nodes -are 0, 0, 0, 1, 1, 2. Since the smallest cutsize is already achieved and no improvement is available, the FM partitioner will stop here. Of the directed paths passing through , will have two cut nets; and will each have one cut net. However, if we move node to Part 0, as shown in Fig. 1(b) , although the cutsize remains the same, the two cut nets on the path are saved, while the numbers of cut nets on the other two paths remain at one. We see that unlike timing-analysis based algorithms, which may increase the hopcounts of near-critical paths when the hopcounts of critical paths are reduced, elimination of -shaped nodes can improve timing without any negative effect.
We believe that this effect is increasingly important in recent industry testcases: partitioning solutions have much worse timing if they do not consider -shaped nodes. However, this effect is not apparent for testcases in which most gates have only two inputs. For example, in Fig. 1(a) , if node is removed, 
TABLE I BASIC PROPERTIES OF TESTCASES
the gain value of node will be 1, and the FM partitioner will move node to Part 0. For such testcases, there will be very few -shaped nodes in the MLPart partitioning solution. Our method may, therefore, be more suited to the "true" underlying netlist topology after synthesis, and in fact is not effective if the netlist has been reduced to some sort of generic 2-input gate variant. 3 
B. Elimination of Generalized -Shaped Nodes
For some testcases, eliminating -shaped nodes is not sufficient since there are many -nodes and -nodes left in the partition solutions after elimination of -shaped nodes. Moving these nodes can make the solutions better. Ideally, we hope that no two cut nets with different directions (one and one cut net) are located too close to each other in a path. 4 Therefore, we want to eliminate -nodes, with as large as possible. However, the number of -nodes will increase dramatically with , which means that we likely need to change the pure mincut-driven solution more substantially with large . Another problem associated with large is that movement of one node may affect many other nodes, again leading to increased cutsize and complexity. 5 Currently, we do not believe that it is practical to be concerned with . To eliminate -nodes, we effectively need to move more than one node. For example, in Fig. 2 , we need to move nodes , , in order to eliminate -node . If just node is moved, will become a new -node. For any -node , we denote the set of nodes whose movement is required to eliminate as and we require . In Fig. 2 , . 6 Therefore, after moving one node , we need to use a "look ahead" algorithm to, in effect, move the rest of the nodes in . we achieve this by means similar to the CLIP algorithm of Dutt et al. [11] . That is, we reset the gains of all nodes to zero after initial gain calculation such that the gain of a node after update only reflects its goodness for moving with regard to the nodes currently being moved. For example, in Fig. 2 , after moving the node and gain update, the node has the largest gain.
C. New Gain Function Calculation
To achieve what we call "distance-unidirectional bias", we change the gain function to: Gain for each node Here, is the traditional net-cut based gain function. The user-defined coefficients weight the attention paid to different -shapes (if only is positive, then we have the original FM algorithm), and is the reduction in the number of -nodes after moving . For any node , we denote the set of inputs of as . To simplify the description, we assume that . The procedure to calculate is shown in Fig. 3 . In Steps 1-3 of the procedure, if is a combinational node and its input set is not empty, we find all the nodes within distance of using BFS. All sequential nodes and their descendants will be removed from the BFS tree. Then, for every level from 1 to , we check whether is a node in Steps 6-8. 5 In fact, the experimental results for k = 3 and k = 4 show that the runtime and cutsize become worse while the delay is not improved compared with results of k = 2. 6 We require v 2 MS(v ) in our approach. Other approaches such as moving nodes a or d are not considered since nodes a and d may have neighbors and we wish to avoid the complexity of checking all possible configurations. If so, . In Steps 9-11, we check whether will be a node if it is moved to the other part. The value of can be 1, 0, 1, which represents the reduction of the number of nodes in the hypergraph due to the move of . To analyze the time complexity of the procedure, assume that the maximum fanin is and the maximum fanout is . For every node, checking the inputs will take at most time and BFS will take at most time. Therefore, the total time needed for calculating is . Distance-Unidirectional Bipartitioning: Summary and Time Complexity: Our algorithm to achieve distance-unidirectional bias in the bipartitioning (reminiscent of CLIP [11] in how it induces movement of clusters across the cutline) is summarized in Fig. 4 . Time complexity may be analyzed as follows. We use the same gain bucket list structure as proposed in [12] . The maximum possible gain is Gain , where is the maximum possible traditional gain. The time for calculating initial gain is , where is the number of nodes in the hypergraph.
Gain time is needed to reset the gains of all nodes to zero, since we only need to remove all linked lists from buckets and concatenate them to the bucket of zero gain. Moving gains also takes Gain time. Since moving one node only affects the gains of the nodes within distance , we need to update at most nodes in every iteration. Because every node can be moved at most once, the total time should be Gain . Therefore, our algorithm takes linear time per pass, just as in the original FM algorithm. The negligible impact on runtime is confirmed in the next section.
D. -Nodes Reweighting
In our analysis of critical paths, we notice that there are a few -nodes, as shown in Fig. 5 . These are nets that "straddle a cutline" and which should be penalized. Since moving node from Part 1 to Part 0 will increase the hopcount of the path by two, the reduction of hopcount can not be obtained by simply moving the node . One node can be identified as -node if both Part 0 and Part 1 have at least one input and one output of . In order to eliminate -nodes, we propose to increase the weights of the nets whose fanout nodes are -nodes, such as the net in Fig. 5 in order to constrain the cutsize driven partitioner not to cut these nets. 7 Performing -nodes reweighting and -nodes elimination simultaneously is diffi-cult since both of these operations will change the gain structures for the netlist. 8 Thus, we perform -nodes reweighting after -nodes elimination at the expense of the increase of runtime. Initially, the weights of all nets are set as 1. We reset the weight of each net whose fanout node is an -node to a given constant . The new gain function is: Gain for each node if belongs to one net whose fanout node is an -node and moving can save the net from being cut; Gain otherwise. Here, is the traditional gain function of . We then use the algorithm specified in Fig. 4 to obtain the final bipartitioning solution.
V. EXPERIMENTAL RESULTS
The MLPart code of [4] was downloaded from the MARCO GSRC Bookshelf [20] and modified. The code is currently compiled and run on Solaris and Linux platforms. Total code modifications amounted to less than 2000 lines.
We tested our algorithm on four industry testcases given to us in LEF/DEF format. The testcase parameters are summarized in Table I . All tests were run on code compiled with the GNU gcc2.95.2 compiler running on a 600-MHz Intel Pentium-III Xeon processor under the RedHat7.3 Linux operating system. We use the model in [2] to calculate the delay. Table II shows the results of multiple single-start runs on the testcase "industry1" when run with , . The results show that around 30% improvement on hopcount and 23% improvement on delay is achieved while only slightly increasing cutsize and runtime. 9 We also tested our algorithm with different values of , , and for the testcase "industry1" in order to find the best tuning of parameter values. In each test, we choose a different combination of values from the set {1, 3, 10, 30} for each of the three parameters, and run the code with 10 independent random starts. The (cutsize, delay) pairs across all 64 combinations averaged over ten starts for each combination are given as a scatter plot in Fig. 6 . The best tradeoff point is achieved with , , and . Empirically, we believe that good results are consistently achieved with , , and . Table III gives the average results of MLPart [4] and reweighting for all the six testcases with ten random starts, comparing directly against the results of -nodes removal and -nodes removal plus -nodes reweighting in Table IV . 10 We set , , , and . The results show that our algorithm is very efficient in reducing hopcount as well as delay. Across all testcases, the increase of cutsize (average of 5.1% after -nodes reweighting) and runtime (average of 30.9%) is acceptable.
Finally, to address our original motivating application, we have studied the impact of the new partitioner within the framework of top-down, partitioning-based, timing driven placement. We have incorporated the new partitioner into CapoT [19] , a timing driven placer used in [14] . Table V compares the results of modified CapoT by using the new partitioner with the original CapoT. Circuit delay is evaluated by a commercial static timing analyzer, Cadence Pearl v5.1 [21] . The experimental results show that worst-case timing slack is increased with the new partitioner, while the increase of wirelength (average of 0.1%) and runtime (average of 15.9%) is quite moderate.
VI. CONCLUSION
In this paper, we have proposed a simple yet efficient timingdriven partitioning algorithm which does not rely on any global timing analysis. Since only local information is used in the algorithm, we achieve an effective return of solution quality versus runtime. By changing the gain function in the FM partitioner, we bias toward movement of some -nodes in the FM partitioning solution across the cutline. We have observed that these "bad nodes," that is, -nodes, contribute significantly to the delay of the whole circuit. Thus, our biasing approach improves timing by eliminating or minimizing such nodes. We also propose to reweight the nets whose fanout nodes are nodes to further reduce the hopcount and path delay. Experimental results show that our method significantly reduces path delay while 9 When using the simpler delay model of [9] , we obtain an average of 20% improvement in delay over all testcases. 10 Programs described in [9] and [2] were not available for comparison. The reweighting code is obtained by modifying MLPart [20] according to the algorithm proposed in [2] .
keeping the cutsize and runtime almost the same as MLPart. To verify the effectiveness of our new partitioner, it is incorporated into a placer and results are evaluated by a commercial static timing analyzer. We observe that the circuit delay is reduced while the wirelength remains almost the same and the increase of runtime is moderate.
