Abstract-In this paper, an effective algorithm is presented for performance driven multi-level clustering for combinational circuits, and is applicable to hierarchical FPGAs. With a novel graph contraction technique, which allows some crucial delay information of a lower-level clustering to be maintained in the contracted graph, our algorithm recursively divides the lower-level clustering into the next higherlevel one in a way that each recursive clustering step is accomplished by applying a modified single-level circuit clustering algorithm based on [l]. We test our algorithm on the two-level clustering problem and compare it with the latest algorithm in [2] . Experimental results show that our algorithm achieves, on average, 12% more delay reduction when compared to the best results (from TLC with full nodeduplication) in [2] . In fact, our algorithm is the first one for the general multi-level circuit clustering problem with more than two levels.
I. INTRODUCTION
Circuit clustering is defined as assigning circuit elements into clusters under different design constraints, such as area and pin constraints [l, 3 , 4 , 51. In this way, the circuit clusters are smaller compared to the original circuit, and hence manipulation and synthesis of the clusters are easier. Most circuit clustering algorithms aim at either minimizing the circuit delay or the intercluster connections.
In this paper, we focus on the problem of combinational circuit clustering for delay minimization subject to area constraints. This problem is first studied in [5] . The authors formulate the problem in the unit delay model in which no delay value is associated with any connection within a cluster or with any gate while unit delay is assigned to each inter-cluster connection. A polynomial time algorithm is also proposed to solve the problem optimally. Recently, most researches adopt the general delay model [4] , in which each gate is associated with a delay value, no delay is for each connection within the same cluster, and a constant delay is for each inter-cluster connection. An algorithm which solves the circuit clustering problem based on the general delay model is proposed in [l] . It is proved that the algorithm can optimally solve the problem in polynomial time. The problems considered in [I, 4, 51 are referred to as single-leuel circuit clustering.
The necessity of a solution to the multi-leuel circuit clustering problem is increasing when more and more designs are built on hierarchical FPGA architectures. The two-level clustering problem with area constraints is studied in [2] . The problem formulation requires the division of a circuit into clusters (second-level clusters) and each cluster is then further divided into smaller clusters (first-level clusters). It is proved that two-level circuit
Ting-Chi Wang Department of Computer Science
National Tsing Hua University Hsinchu, Taiwan e-mail: tcwangQcs.nthu.edu.tw clustering for delay minimization is NP-hard. Hence, they propose a heuristic which is extended from [l] . Their algorithm constructs a candidate second-level cluster rooted at each node and then covers the whole circuit based on the clusters. During the construction of each candidate second-level cluster, the first-level clusters within it are formed at the same time. Both first-level and second-level clusters are constructed according to the same criterion -nodes are chosen by comparing the maximum delay of the paths from primary inputs to the cluster root passing through them.
However, the heuristic in [2] is not effective enough according to our experiments. The main reason may be related to the restriction in which it does not allow node duplication within a second-level cluster. Node duplication within a second-level cluster has two contrasting effects on delay minimization. On one hand, it may reduce the circuit delay since a node can be included into different clusters so that the number of inter-cluster connections may be reduced. On the other hand, each cluster is constrained by an area bound and node duplication consumes area, so less different nodes can be included into a second-level cluster and then the circuit delay may incrcase. However, we have shown by experiment that properly allowing node duplication within second-level clusters is beneficial to delay minimization. Moreover, since the algorithm in [2] performs first-level and second-level clusterings at the same time, for each node included into a first-level cluster, all the data for first-level and second-level clusters (e.g., lists for candidate nodes and immediate successor with maximum delay) should be updated accordingly. It makes their algorithm hardly extensible t o solve the circuit clustering problem with more than two levels.
In order to cope with the difficulties mentioned above, we propose an algorithm for the general combinational circuit clustering problem with any arbitrary number of levels. Our algorithm constructs clusters for each level separately, from the first level to the desired level. The clustering of each level is performed on a contracted graph which only captures the most important delay information from the clustering of the previous level. Besides, since we only perform circuit clustering on the contracted graph formed from the previous level, a simple but effective single-level circuit clustering algorithm can be employed. As a result, our algorithm effectively handles the multi-level problem by repeating the graph contraction technique and the single-level circuit clustering algorithm.
Although we employ a single-level clustering algorithm which is extended from [I], our overall algorithm is not merely a trivial extension of [I], because without our graph contraction technique, the single-level clustering algorithm cannot be repeatedly applied to the circuit to obtain a multi-level clustering. As a result, the graph contraction algorithm plays a critical role in our work and it successfully links every two successive levels of circuit clustering.
Taking the two-level clustering problem as an example, our algorithm first divides the circuit into a set of first-level clusters with node duplication, so that the node duplication within second-level clusters can be later guided by those first-level clusters. In this way, node duplication within a second-level cluster happens only when the duplication helps minimizing the delay of the resultant first-level clustering. In fact, our implementation and experimental results show that, in this mechanism, allowing node duplication within second-level clusters indeed further reduces the delay values. We are able to achieve 12% more delay reduction when comparing with the algorithm in [ 2 ] , in which node duplication within the second-level clusters is not allowed.
DEFINITIONS AND PROBLEM FORMULATION
A combinational circuit can be represented as a directed acyclic graph (DAG) G = (VIE). V is the set of nodes which represent the functional blocks (e.g., gates) in the circuit and E is the set of edges which stand for the connections among the blocks. In each edge, which connects two nodes between two different nth-level clusters, has a fixed delay Dn+l. Practically, we have
For the delay of a path from node a to node b, we always include all node delays and edge delays along the path. The path delay at a node v is defined as the maximum delay of all paths from PIS to 'v. The delay of a clustered circuit is defined as the maximum path delay at all P O nodes; in other words, it is the maximum delay of all paths from PIS to POs within the clustered circuit. According to the above definitions, the multilevel circuit clustering problem with area constraints is presented in the following. An example is shown in Figure 1 . In the figure, there are 17 nodes in the graph and a three-level clustering is shown. In the clustering, the delay of edge (j,k) is 0 4 since j and k are in different third-level clusters. The delay values of ( I , q ) and (2, j ) are D3 and Dz, respectively. Since e and f are in the same firstlevel cluster, the delay associated with (e, f) is D1. If each node is associated with unit area and M I = 2,Mz = 6 , M3 = 12, the area w of all first-level clusters in the clustering is 2 except for the clusters containing i,j or 0, whose w equals 1. The secondlevel cluster containing {a, b, c, d, e, f } has an area of 6. The area of the second-level cluster containing { g , h, i, j } is also 6 since it consists of three first-level clusters and no more first-level cluster can be further filled into it. In fact, the area of all second-level clusters in Figure 1 is 6 except the one containing {IF, I } (whose area is 2). The area of each third-level cluster in the graph is the same, which is 12.
In the example, we assume D1 = 1, Dz = 3, D3 = 7, Dq = 17 and d(v) = 1 for each node v. The delay of the clustered circuit is equal to 85, which is along the path 
THE ALGORITHM
The flow of our algorithm is depicted in Figure 2 . The algorithm mainly consists of two parts: single-level graph clustering and graph contraction. Then, single-level graph clustering is performed on the input graph Glev (= G when lev = 0) and it outputs a set of clusters, (= S1 for the first time). After that, lev is increased by 1. If lev is equal to the required number of levels, n, the algorithm terminates. Otherwise, graph contraction is performed on Glev-l (=Go for the first time) based on Slev. To generate the contracted graph Glev, we treat each lev-th-level cluster C:ev (at the first iteration, we have the set S1 of first-level clusters) as an independent supernode. Thus, Glev is built in the way that each node represents a supernode, which stands for a cluster in Slew. Then single-level graph clustering is performed on the new graph Glev, and the output is a set Slev+l of clusters. The algorithm is iterated until the desire level n of clusters is obtained.
The details of the single-level graph clustering and graph contraction algorithms are presented in Section 3.1 and 3.2 while the g r a p h contraction algorithm is the main contribution in the paper since without graph contraction, single-level graph clustering cannot be repeatedly applied to the circuit G.
A. Single-Level Graph Clustering
In the algorithm, given a graph Glev, clustering SleV+l is constructed by applying the single-level circuit clustering algorithm in [l] with modifications. The delay model adopted in [l] is slightly different from the delay model that we consider in this paper. There is no delay value (e.g., D1 = 0) for each edge within the same cluster in the delay model in [l] . But, in our model, the delay of each edge within the same cluster can be any nonnegative number. (Note that & ( e ) = D1 for each edge e at the beginning of the algorithm when lev = 0. After the first iteration, lev 2 1, each edge delay is assigned by graph contraction and it may not be a fixed value for every edge. Here, we only use D1 for the illustration of the delay model difference.) However, if we assume each edge has the delay D1 during the calculation of the delay matrix A,, which stores the maximum delay (including node delays and edge delays) of the paths between any two nodes, single-level graph clustering under our model can be solved optimally in a similar way. The pseudo-code of the modified single-level graph clustering algorithm is shown in the following. 
.

4.
5.
.
Remove the f i r s t node v from T ; 7 .
Compute N u ; Remove the f i r s t nod8 11 i n P. 
5.
IF(w(cluster(v)
U
Glev. B. Graph Contraction
After constructing the set Siev of lev-th-level clusters, we build a contracted graph Glev from the graph Glev-l and the clustering Siev. The contracted graph Glev is constructed such that each node vi in Glev corresponds to a lev-th-level cluster Cfeu in Slev and its path delay in Glev is the same as that of the root node of C:ell. Since different clusters have different numbers of elements and sub-graph structures, and there are different edges connecting the nodes of two clusters, delay assignment to each supernode and each edge connecting the supernodes is not straightforward. The general algorithm for constructing Glev from graph Gleu-l and clustering Sl,, is shown in the following. 
.
construct a node U , i n Gleu 3.
6ieu+i(u,) = D(C!'"):
: Graph GzeU-1, Clustering Sr,,,={C~", C$", ..., C:y,,} FOR each cluster C!'" i n SI,, / + Gf.,+l(vi) i s assigned i n GI., * / 4. END FOR 5. 6.
IF ( a and b are not i n the same c l u s t e r of SleV)
FOR each edge e = ( a , b ) i n G I . , -I 7.
8.
9.
10.
11.
12.
13.
14.
assume b i n Cl"" which i s rooted a t r ; find the cluster C:' " rootad a t a ;
IF (there e x i s t s no edge from vj t o vi i n G l e U )
add an edge from u j t o w; i n G I~" ; For each inter-cluster edge, we build an edge between the two corresponding supernodes and assign the delay value such that the maximum delay of the paths passing through that intercluster edge is maintained. This is the most complicated part for the graph contraction since different edges may connect different nodes of the two clusters. The delay assignment to each edge (lines 9-18) is depicted in Figure 3 Finally, for every two supernodes, if there exists more than one edge between them, we only keep the eGge with maximum delay value and remove all the others in the new graph Glev. Note that in the clustering generated by our modified single-level graph clustering algorithm mentioned in Section A, inter-cluster edges only connect from the roots of the predecessor clusters.
Fig. 3. Illustration of Graph-Contraction
We describe in Section IV that in this graph contraction algorithm, some crucial delay information of each new node vi is extracted from the root of the corresponding cluster Cieu. Since this delay information is retained in the contracted graph, finding the next higher-level clustering from the contracted graph for delay minimization can be achieved by the modified single-level graph clustering algorithm in polynomial time.
C. Remarks
A multi-level clustering is achieved when we iteratively perform the single-level graph clustering and graph contraction. At first, we set lev = 0 and Go = G which is the target unclustered circuit. After the single-level graph clustering process, the first-level clustering S1 is obtained and we can produce G I by the graph contraction algorithm. Then, a second-level clustering S2 can be found from the contracted graph GI by the same graph clustering algorithm with lev = 1. This time, the algorithm takes 0 3 -Dz for the calculation of 1; value, (in line 10 of "Labeling"), Mz for area bound (in line 5 of "Labeling"), and edge and node delays in G1 for the calculation of A,. After the second-level clustering S2 (or in general, the i-th-level clustering Si) is obtained, a contracted graph Gz (Gi) is constructed. The third-level clustering S3 ((i + 1)-th-level clustering Si+l) can then be generated similarly. It is obvious that our overall algorithm can be employed recursively to get an n-level circuit clustering.
IV. ANALYSIS OF THE ALGORITHM
In our algorithm, the contracted graph Glev is constructed such that the path delay at vi in Glev equals the label value 1 l e v ( y ) of the root node r of the corresponding cluster Cie" in Glev-l. We have the property stated in the following theorem.
Theorem 1 For any node vi in the contracted graph Glee = { K e v , Elev} generated by the algorithm Graph-Contraction, the path delay at vi in Keu is equal to the label, llev(r;), of the root node ~i of the corresponding cluster C: " E Slev. Proof It is omitted due to limited space.
Theorem 1 implies that the single-level graph clustering algorithm can be repeatedly applied to cluster the circuit to any level n with correct delay information. With Theorem 1, we can further derive the local optimality of our algorithm as stated in the following theorem. Theorem 2 Given an i-th-level clustering produced by our algorithm, our algorithm generates a (i + 1)-th-level clustering which divides the i-th level clusters and minimizes the delay of the resultant circuit. Proof It is omitted due to limited space.
The local optimality described in Theorem 2 does not guarantee a final globally optimal solution but the local optimality of each recursive step tends to maintain the circuit with a small delay value. In fact, our experiments have shown that the delay reduction achieved by our algorithm is much better than the state-of-the-art algorithms.
For the time complexity of the algorithm, the modified single- 
V. POSTPROCESSING TECHNIQUES
In the experiment, our algorithm generates large clustered circuits since node duplications occur frequently in order to minimize the circuit delay. So, we present two simple postprocessing techniques to reduce the area of a clustered circuit. The techniques are employed after the clustered circuit is generated, and they do not increase the delay of the clustered circuit. Assuming we have an n-level Clustered circuit, the first technique locates those n-th-level clusters each of which is a subset of another nth-level cluster and so they can be deleted without changing the delay and functionality of the whole circuit.
The second technique packs several n-th-level clusters into one single cluster if the area constraint is not violated. This can be done when the original clusters are small. The technique is based on the First Fit Decreasing heuristic for the bin packing problem (which is also mentioned in [4] ). All the n-th-level clusters are sorted in non-increasing order. We assume each bin has the capacity M,,. Then, it starts to place clusters one by one into the bins. Each time we place a cluster in the leftmost bin that still has enough space for it, and start a new bin if necessary.
VI. IMPLEMENTATION AND EXPERIMENTAL RESULTS
We tested our algorithm upon a two-level hierarchy which is based on Altera's APEX FPGA architecture [2] and compared our algorithm (namely MLC) to the UCLA TLC implementations which are obtained from the authors of 121. Based on the timing extraction done in [2] , we used the same parameters: M I = 10, 0.61ns, w(v) = 1 for each node V. We evaluated our algorithm on this 2-level hierarchy mainly because this FPGA architecture is the latest FPGA model for the multi-level circuit clustering problem.
Experiments were performed on MCNC benchmark circuits which also were used by UCLA TLC [2] . The benchmarks were pre-processed and mapped into 4-input LUT networks by UCB SIS and UCLA RASP systems. Each benchmark circuit was clustered into a two-level clustering by the two TLC implementations (No node duplication and Full node duplication) and our algorithm. The first TLC implementation did not allow any node duplication among different second-level clusters while the latter one did. However, both TLC implementations did not allow node duplication within a second-level cluster. The implementations returned a new circuit (if node duplication happens) which is functionally equivalent to the original circuit and also returned the clustering information.
In order to carry out a fair and objective comparison, our implementation strictly followed the same problem formulation as the TLC implementations in [2] . First, in our implementation, each PI node or P O node formed a second-level cluster by itself and it was excluded from any cluster rooted at any other node which is neither P I node nor P O node. As a result, the edge delay from any PI node to any of its immediate successors was always D3, while similarly the same delay 0 3 was always associated with the edge from any node to a P O node. Secondly, our problem formulation and the problem formulation in [2] seemed to be different in the area calculation of second-level clusters. We limited the total number of first-level clusters within a secondlevel cluster while [2] limited the total number of nodes in a second-level cluster. So, if some first-level clusters contain fewer nodes, more first-level clusters can be included into a second-level cluster in the problem formulation of [2] ; While in our problem formulation, the maximum number of first-level clusters within a second-level cluster was always fixed. However, our problem formulation was more appropriate to the APEX FPGA architecture where one MegaLAB (=second level cluster) can hold no more than 16 logic array blocks, LABs (=first level clusters), even when some of the LABs are not full. In fact, we found that the TLC implementations we obtained from the authors of [Z] followed our definition on the second-level cluster area bound, so this ensured a fair comparison.
In our implementation, we have also imposed a constraint on the maximum number of inputs for each first-level cluster. In the specification of Altera APEX FPGA devices, a first-level cluster (LAB) cannot have more than 22 inputs. Hence, we controlled the number of inputs to a first-level cluster by adding a condition in the I F statement in line 5 of "Labeling", such that we stopped adding new nodes into each first-level cluster when the number of inputs to the cluster was more than 22. This constraint was also included in the UCLA TLC implementations.
The experimental results are shown in Table I . Columns 2-4 show the results of TLC [2] with no node duplication. The results demonstrate that our algorithm achieves, on average, 12% more delay reduction than the TLC (Full ND) implementation. Moreover, our results are constantly better or the same for all benchmarks. Although our algorithm runs comparatively slower, the total run time for all 16 circuits is still less than one minute.
Due to more node duplication, our resultant area is greater when comparing to the TLC implementations before applying the postprocessing techniques. However, our postprocessing techniques effectively reduce the number of second-level clusters, on average, by 70% (from 3306 to 986). The effectiveness of our techniques is due to that most second-level clusters are not fully occupied. In fact, 60% of second-level clusters are less than half full in our results before postprocessing.
Our work aims a t minimizing the circuit delay, and we do successfully push the delay close t o the minimum. This can be seen in columns 14-15 of Table I . The columns show the delay achieved by our algorithm with n = 1 ("one" level clustering only) and only D1, Dz (without 0 3 ) used for edge delays, together with the percentage difference ("%di") when comparing to the delays achieved by the our algorithm (n = 2, two-level clustering). For the n = 1 case, the single-level graph clustering algorithm is only performed once and no graph contraction is performed. The local optimality of single-level graph clustering ensures that the result is an optimal 1-level circuit clustering.
In fact, we can take these 1-level clustering results as a "loose" lower bound for any two-level clustering. From the last column, it is shown that our results produce only 3.5% more delay than the optimal 1-level results. In fact, out of 16 benchmarks, we obtain optimal 2-level clustering solutions for at least 4 circuits (whose "%di" values equal to 0.0).
