Abstract -In this paper, we present a new performance-driven multilevel partitioning algorithm, which calculates the timing gain of a move in the move-based partitioning strategies based on the aggregation OF preferred signal directions. In addition, we propose a new timing-aware multilevel clustering algorithm that uses the connection strength of an edge as the primary objective, and the maximum depth or the maximum hop-count of any path containing the edge as a tiebreaker for the clustering step. These ideas are integrated into a general multilevel partitioning framework, which consists of three phases: uncoarsening, initial partitioning, and coarsening and refinement phases. The benchmarks show that, on average, we can reduce delay by 14.6%, while increasing the cutsize by 1.2% when compared to hMetis [ I].
I. INTRODUCTION
Traditional partitioning approaches [ 1-31 have been quite successful in reducing cutsize. However, these techniques cannot adequately control the cut count, of timing-critical paths in the circuit. Recently, tirning-driven partitioning approaches [4- 1 I] have been proposed lo simultaneously consider cutsize and the circuit delay. These works may be classified into two categories depending on whether they modify the netlist or not. Most of these reported works modify the netlist, i.e., they use logic replication, retiming, or buffer insertion techniques to reduce the circuit delay while minimizing the cutsize [6] [7][8J[11 I. These methods tend to reduce the circuit delay significantly, compared to the cutsizeoriented methods such as hMetis [l] . However, gate replication used by these methods can result in a significant increase in the chip area. In addition, some of these methodk suffer from large cutsize [SI or large runtime [7] [11]. Techniques that do not alter the circuit netlist, nonnally give more weight to the edges that lie on the timingcritical paths in a circuit [9][10]. These techniques require an a priori classification of signal nets into timingcritical and non-critical ones tiased on a timing analysis of the circuit prior to partitioning. However, these methods suffer from choosing the K-most critical paths since the performance in terms of cutsize, delay and runtime heavily depends on the value of K [9] . A number of researchers [4) [5] haw used the signal direction as an indicator of the timing goin fitnction during the move-based partitioning process. Examples include "backward edges" [4] and "V-shaped nodes" 151. These early results motivate the use of signal direction to guide the performarice-driven partitioning process.
In this paper, we present a new performance-driven multilevel .partitioning algonthm, called PMP, which can minimize circuit delay efficiently by aggregating the preferred signal directions as implied by all input-outpur conduits in the circuits (see Section I1 for a formal definition of U 0 conduits.) Unlike the previous approaches [4][5], OUT timing gain function accounts for the effect of a cell move on the delay of all input-output conduits that go through that cell. This enables PMP to compute the timing gain for each candidate cell to move accurately and efficiently. In addition, 1-213-740-9803 Email : pedram@usc.edu.
we propose a new timing-aware multilevel clustering algorithm that uses the connection strength of an edge as the primary objective, and the maximum depth or the maximum hop-count of any path containing the edge as a tiebreaker for the matching step.
The benchmark results show that, when compared with hMetis, PMP reduces the circuit delay by 14.8%, while increasing the cutsize by only 1.2% on average.
n. SIGNAL DIRECTION CONSTRAINTS
In this section, we introduce the notion of a (preferred) signal direction, and the resulting constraint, which will be used to optimize circuit delay efficiently at the minimum cost of cutsize.
Let's denote the set of primary inputs of a circuit as PI, the set of primary outputs as PO, and the set of flip-flops as FF. In a bipartition, if signal directions of all U 0 conduits are satisfied, ,then the bipartition will be optimum in terms of the path delay. Figure 1 shows the signal direction constraints between a signal source node s(e) and a signal target node r(e) for an edge e. An VO conduit oj from p i , to pa, comprises of a single topological path pil+vI+v2+v3*poI. The minimum achievable cut count of q is one since pil and pal are located in different parts. The signal directions of edges of q should be from part MO to part MI in order to obtain this minimum cut count. Let P(vJ denote the part that node vi is assigned to i.e., P(v,) = 0 if v, is put in MO, otherwise, P(vi) = 1. Notice that P(vJ of the source node vi of an edge e of q should not be any larger than P(v,] of the target node vj of that edge. For an U 0 conduit 62, comprising of a single path pi2+v4+v5+v6+po2, both source and target nodes of edges on 4
should be put in MO in order to satisfy the signal direction constraint of o2 (the minimum achievable CUI count of c2 is zero). Based on this discussion, we define .+d direcrion consrraints (SDC's) for a vertical CUI line as follows:
where SD(@ denotes the signal direction of 4 which is one of LL, RR, LR or RL. Clearly, LL (RR) implies that both-start and end nodes of the conduit are located in MO (Mi), whereas LJ7 (RL)
ASP-DAC 2005
r --
e.5@i2,v4), e6(v4,v5) , e7(v57vfi)3 e8(v6rp02) Signal Direction Constraints: means that the start node of the conduit is in MO (M,) while the end node of the conduit is in M I (MO]. Based on the above definitions, each edge on any topological path in an I/O conduit has the same preferred signal direction. Even though many topological paths of a conduit pass through an edge, the edge has only one signal direction constraint (SDC) for the conduit. This is because all topological paths o l a conduit a150 have the same preferred signal direction. In general, an edge may belong to many conduits, say n UO, in the circuit, each assigning a preferred signal direction to the edge. Therefore, we need only I/O conduits rather than all topological paths in a circuit when computing SDC's of all edges. Let's assume that n, of these conduits are of type LL whereas nz, nj and n4 are of types RR, LR and RL, respectively. ( n = n~+ n~+ n~+ n~. ) Clearly, n is polynomially bounded, Violating these signal directions cause signal direction violations (SDV's).
A. Timing gain function of a move
A bipartition that satisfies all of the SDC's associated with the U 0 conduits seldom exists for any realistic netlist. Even when such a solution exists, it tends to have a huge cutsize. Therefore, we must relax the constraints in order to obtain a smooth tradeoff between the circuit delay and cutsize.
Our proposed partitioner, PMP, employs an FM heuristic [12] in the uncoarsening phase, with a modified move gain function accounting for both the signal directions and the cutsize. To manage delay as the optimization objective rather than a constraint to he satisfied, we make use of the violation counts as defined above. More precisely, we define a timing gainfunction, TG(y), to quantitatively evaluate the desirability of moving vi from MO to M,.
This gain function is defined as: Proof is straightforward and is omitted to save space. Notice that to calculate the timing gain for node vl, contributions to delay of aII U 0 conduits containing vj should be aggregated. Previous Before v3 is moved, edges e2 and e4 do not satisfy S D P of conduits o4 and cr,, respectively. This is because SD(aJ=SD(%)=RR but the source and target nodes of these two edges are not in M I . The number of SDC violations is thus four. After vj is moved, S O 2 is satisfied for both e2 and e, while edges e, and e3 do not satisfy SDC' of conduit e. This means that by moving vj, we are able to reduce the cut counts of conduits a, and q by two each, while the cut count of conduit cI is increased by two. As a result, the total number of SDC violations are reduced by two, i.e., the timing gain for the q-move is two, TG(vj) = 2. 
=VC(v;
:
'THE PMP ALGORITHM

A. Clustering considering timing criticality
We use the Heavy Edge Matching (HEM) algorithm of 1141 in order to find a maximal matching in the net list graph. The clustering phase greatly influences the quality of the final partitioning solution in terms of both delay and cutsize. In order to improve the delay, the clustering algorithm should consider the timing criticality of the edge as well as its connectivity strength. However, our experimental results have taught us that the connectivity is a more important consideration at this stage and that delay should be used only as a tiebreaker when two or more candidate edges have equal weights. We define terminology needed to quantify the timing criticality of edges. Path depth is defined as the number of intermediate nodes in a path (excluding
PI'S, PO's and FF's).
Definition: The Muximum Depth ofany Path of an edge (referred to as the MDP of an edge) is defined as the maximum of the logical depth of any path that goes through that edge.
PMP uses the MDP of edges as a tiebreaker when selecting an unmatched adjacent node uj of v,. In particular, consider that there ate m matching candidates u l , . , . , u m . Suppose the first k of these matches have the sami: edge weight, which is higher than any other edge weight. Among the remaining candidate matches that have tied based on the HEM selection criterion, U], ..., ub PMP chooses the one whose corresponding edge to vi has the highest MDP value.
Initial partitioning
During the initial partitioning phase, a bisection of the coarsened hypergraph is computed to minimize cutsize while maintaining that each part contains roughly half of the node weight of the original graph. The node weight represents the area of a node. PMP does not consider delay at this stage because the coarse graph is very small (we set the threshold value of top-level size < 100) and too rough to calculate a meaningful timing gain function.
The initial partitioning solution is then used to decide the locations of FF's. Recall that, in a typical top-down design flow, the locations of the PI'S and PO's of the circuit are fixed, whereas the FF's are floating. For sequential circuits, the locations of FF's should be fixed befon: calculating SDC-counrs for all edges. Since the cutsize of a circuit greatly depends on the FF locations, we must carefully assign these locations. Therefore, we performed a number of experiments to assess how much the pre-fixed FF locations affect the cutsize in a multilevel partitioning scenario. To save paper space, we briefly mention the experiment results here. The increase rate of cutsize was kept within 10% on average for the benchmark circuits (See Tablel) when prefixing FF locations according to the result of the initial partitioning, compared to the case of not pre-fixing FF locations.
C. Uncoarsening with a new gain function
During the uncoarsening phase, the partitioning solution of the coarser graph is projected back to the original graph by going through multiple hierarchies. The hierarchy is constructed during the coarsening phase. At each level, a bipartition refinement starts from the projected partition of its upper level as an initial partition. We use standard FM as a bipartition refinement algorithm. Since our goal is to smoothly exchange between the delay and cutsize,
we use the following move gain function for moving node vi:
where CC(vJ represents the standard cutsize gain function.
is a weight coefficient, and X(G) is the average number of U 0 conduits going through any node in a circuit graph G. Note that we must normalize TG(v,) by h(G) in order to be able to pick a fixed weighting coefficient across different benchmak circuits. We multiply the linear combination of cutsize gain and normalized timing gain with 100 and then take the ceiling. The idea is to produce an integer value for the moving gain of a node that can differentiate between different moves. In our experiments, we used a=0.87 for best results.
D. Extension to the multiway partitioning
The Performance-driven, Multilevel, Multiway Partitioning (PMMP) algorithm is also implemented simply by making iterative calls to PMP to generate two-way, four-way, and eventually K-way partitioning solutions. However, PMMP uses a different tiebreaker during clustering after the first bisection in which MDP is used for the tie-breaker as described in Section 111 (A).
For the first bipartitioning step, we do not have enough information to define critica1 paths in the circuit graph. So we take the topological depth of a path as an indication of its timing criticality. On the other hand, in the succeeding bipartitioning steps, critical paths can be identified more precisely by counting the existing cut count of the paths, Therefore, we can use the maximum of hop-count of any path that goes thru an edge as a measure of timing criticality of that edge (a hop means ao edge that is been cut). In other words if there is a path that goes thru some edge e and that hop-count of that path i s already high, we do not want to cut the edge e i n the current bipartitioning step since it will likely increase the critical path delay after PMP completes its job.
Definition:
The Maximum Hop-count of uny Path of an edge (referred to as the MHP of an edge) is defined as the edge length of the longest path that goes through that edge.
N . EXPERIMENTAL RESULTS
The PMP (or PMMP) algorithm was implemented in C++ on a Sun Ultra Sparc 11 machine, and tested on ISCASS9 and ITC benchmarks. The characteristics of the benchmark circuits are summarized in Table 1 . The circuits were optimized by using the scriptrugged in SIS [15]. We obtained these optimized circuits from authors of [9] . In our experiments, we arbitrarily assigned locations to the primary inputs and outputs and kept their locations fixed throughout the experiments. We compare PMP with other timing-driven partitioning algorithms, we implemented the algorithm of reference [9] , which we will denote as TPA. In this algorithm, we give more weight to the edges that lie on the timingcritical paths in a circuit. P A performs the static timing analysis at We also compared PMP with hMetis [I] . Therefore, we compared PMP with TPA and hMetis for eight-way partitioning problem.
The maximum delay of the benchmark circuits, QG, is calculated based on the delay model presented in [9] . The chip size for each circuit is assumed to be twice the total area of all nodes in the circuit. For the 8-way partitioning, we set the area skew to 5%. For PMP, we set a parameter value to 0.87 (cf. Section III(C).) Table 2 repons the results of comparing PMP with hMetis and TPA for the eight-way parritioning problem. All results in this table represent the average of 20 different runs for each partitioning algorithm. The cutsize, AG, is calculated as the sum of costs of each net that is cut. In turn, the cos1 of a cut net is k-1 if that net has pins in k parts.
Based on the data in Table 2 , we conclude that, in terms of the circuit delay, PMP outperforms hMetis and TPA by an average of 14.6% and 3.8%, respectively. In addition, we can see that compared to hMetis, PMP increases the cutsize by an avenge of 1.2%, while, on average, PMP obtained 3.7% lower cutsize compared to P A . PMP runtime is on average three times higher than that of hMetis. Notice that PMP is on average 18.8% faster than TPA. PMP reduced the circuit delay by 14.6% with a negligible increase in the cutsize compared to hMetis.
V. CONCLUSIONS
In this paper, we presented B new performance-driven multilevel partitioning algorithm. Our main contribution is a new and efficient timing gain function formulauon for the move-based bipartitioning algorithm. In addition; we proposed a simple but efficient timng-aware clustering algorithm that uses the maximum logic depth and/or the hop-count of an edge as a tiebreaker. These new methods fit very nicely within the general framework of a multilevel partitioning algorithm. Consequently, we can reduce a circuit delay very efficiently with at the minimum cost of cutsize.
