Abstract-This paper considers the area-constrained clustering of combinational circuits for delay minimization under a more general delay model, which practically takes variable interconnect delay into account. Our delay model is particularly applicable when allowing the back-annotation of actual delay information to drive the clustering process. We present a vertex grouping technique and integrate it with the algorithm (Rajaraman and Wong, 1995) such that our algorithm can be proved to solve the problem optimally in polynomial time.
but the main arguments hold and hierarchical DRC programs have been replacing flat DRC programs in the market [4] .
For hierarchical timing analysis, one set of simplifying assumptions might be that the design contains only one clock, no transparent latches, and no purely combinatorial paths through cells. Starting with the leaf cells, analyze all paths between flip-flops and report any errors, then generate abstracts. There are (at least) two possible strategies for abstracts: deriving a required time and/or an arrival time for each pin, or including in the abstract all logic between each pin and the first clocked element. In either case then, the algorithm works its way up the hierarchy, substituting abstracts for the lower level cells, doing the timing analysis, and then computing the abstracts for use by the next higher level. For either abstraction strategy, the size of the abstract will be proportional to the number of pins. From Rent's rule [6] , the number of pins on a block of n primitives is proportional to n a , where a is Rent's exponent and ranges from about 0.5 to 0.7. Therefore, to get linear time analysis overall, timing analysis and abstract generation must complete in time n b , where b < 1=a (roughly 1.42 if a = 0:7). Timing analysis, which is nominally O(N ), easily satisfies this constraint.
Under real world conditions, however, hierarchical timing analysis is not uniformly advantageous. Multiple clocks, or combinatorial paths through blocks, can cause the timing abstract to grow more quickly than the number of pins, and, in fact, can result in a > 1. The presence of transparent latches can make the abstract larger and harder to compute. It can be very difficult and time consuming to generate a correct abstract for a cell with false and multicycle paths, especially if these paths cross hierarchical boundaries. Therefore, although hierarchical timing analysis is certainly used, it is mainly because of the other advantages mentioned in the introduction, not because of the efficiency arguments of this paper.
VI. CONCLUSION
Under conditions that are sometimes achievable in practice, hierarchical checking has performance O(N ) in the size of the expanded hierarchy, which is the best order possible.
APPENDIX
Why is hierarchical DRC NP-complete? Here's a sketch of the proof from [2] . The integer knapsack problem is known to be NP-complete.
Given a set I of integers {I 0 ; I 1 ; . . . ; I N01 }, is there any subset of these that adds up to S ? Hierarchical DRC is a more complex problem, but surely must be capable of answering the question, "Do layer 1 and layer 2 overlap anywhere in the design?" since this is an operation required in the verification of almost all IC processes.
Convert any integer knapsack problem to a hierarchical DRC problem as follows. First build a cell C 0 that contains two unit squares on layer 1, one at (0,0) and one at (1, I0 ). Then build a sequence of cells C 1 ; C 2 ; . . . ; C N01 such that cell C j has two copies of cell Cj01 , one at (0,0) and one at (2 j ; Ij ). Finally, build one cell Cn that has one copy of CN 01 at (0,0) and a rectangle on layer 2 from (0,S) to (2 N ; S + 1). The DRC problem now contains overlap of layer 1 and layer 2 if, and only if, there is some subset of I that sums to S . The construction works since there is a unit square of layer 1 with a Y coordinate corresponding to each possible subset of I .
This construction is demonstrated in Fig. 4 , with I = f3; 4; 7g and S = 9. Diagram (a) shows the construction of C 0 , (b) C 1 , and so on. The final diagram (e) shows the expanded hierarchy. Since there is no overlap of layers 1 and 2, there is no subset of I that sums to 9.
ACKNOWLEDGMENT
The author wishes to thank the reviewers for their extremely helpful comments.
straints [1] - [4] . In this way, the circuit clusters are smaller compared to the original circuit and manipulation of these clusters is easier. Circuit clustering algorithms usually aim at minimizing either the circuit delay or the intercluster connections.
In this paper, we focus on the problem of combinational circuit clustering for delay minimization subject to area constraints. This problem is first studied in [4] . The authors formulate the problem in the unit delay model, in which no delay is associated with any gate or with any connection within a cluster and a unit delay is assigned to each intercluster connection. They propose a polynomial time algorithm to solve the problem optimally. Recently, most researches adopt the general delay model [3] , in which each gate is associated with a delay value, no delay is for each connection within the same cluster, and a constant delay is for each intercluster connection. An algorithm for the circuit clustering problem under the general delay model is proposed in [1] . It is proved that the algorithm can optimally solve the problem in polynomial time.
However, when wire delay starts to dominate the total delay in a circuit, the general delay model is no longer capable of handling more practical problems in current technology. Hence, it is necessary for the delay model to be more sophisticated such that a delay value is also associated with each connection within a cluster. As a result, we propose a new delay model in this paper, which is practical for the delay back-annotation techniques, in which the actual delay information of the circuit after place and route is fed back to drive the clustering process. In fact, our delay model is more general because the general delay model adopted in [1] is a special case of our model (with all connections within the same cluster set to zero).
We demonstrate several trivial extensions of the algorithm in [1] and show that they cannot optimally solve the circuit clustering problem under our proposed delay model. Details are discussed in Section III.
Besides, we present a vertex grouping technique, and integrate it with the algorithm in [1] such that our algorithm can be proved to optimally solve the area-constrained combinational circuit clustering problem for delay minimization under our delay model in polynomial time.
This paper is organized as follows. The next section gives the problem definition. In Section III, we present a brief review of the algorithm in [1] and several trivial extensions. Section IV describes our vertex grouping technique while the overall algorithm is discussed in Section V. Analysis of the algorithm and conclusion are included in the last two sections.
II. PROBLEM DEFINITION
A combinational circuit can be represented as a directed acyclic graph G = (V; E). V is the set of vertices which represent the functional blocks (e.g., gates) in the graph and E is the set of edges which stand for the connections among the blocks. In the graph, primary input (PI) vertices are those with outgoing edges only, and on the contrary, primary output (PO) vertices have incoming edges only. A vertex u is a predecessor (successor) of a vertex v if there exists a path from u to v (from v to u). A vertex u is an immediate predecessor (immediate successor) of a vertex v if there exists an edge from u to v (from v to u).
For each vertex v 2 V , let w(v) represent its area. A cluster C V is a set of vertices fv1; v2; . . . ; v k g which satisfies the area bound M , where M is a given constant. For each cluster C, its area w(C) is defined as the sum of area of all vertices in it and must be no more than M . That is intrinsic delay. For each edge (u; v) 2 E, it is associated with a delay (u; v). (Note that, (u; v) = 0 in [1] .) For the graph in Fig. 1(a) , the numbers beside the vertices and edges indicate the delay values associated with them. For example, thevertexdelayof e is1((e) = 1), and the edge delay of (c; e) is 4 ((c; e) = 4).
A clustering S on the graph G is defined as a set of clusters, S = fC 1 ; C 2 ; . . . ; C m g, such that all the clusters in S satisfy the following condition 8i 2 f1;. . . ; mg; Ci V; s:t:
Note that node duplication (i.e., a node appearing in more than one cluster) may happen in S. Let G 0 be the clustered graph induced by a clustering S on the graph G. The delay associated with G 0 is evaluated as follows. For each edge (u; v) within the same cluster, it still has the original delay (u; v). However, for each edge connecting two vertices in different clusters, its edge delay is replaced by a fixed value D. For the clustered graph G 0 shown in Fig. 1(b) , the set of boxes indicates a clustering (which contains three clusters C1; C2; C3) on the graph in Fig. 1(a) . This clustering contains node duplication -node a appears in both C 1 and C 2 . Since the vertices a; c; e are in the same cluster C 1 , the edge delays are (a; c) = 6 and (c; e) = 4. (But for the delay model adopted in [1] , there is no delay associated with the edge (a; c) or (c; e) in this example.) However, the edge (e; g) is across two clusters C 1 and C3, so (e; g) becomes a predefined value D. Practically, we assume
To calculate the delay of a path from vertex u to vertex v, we always include all vertex delays and edge delays along the path. The path delay at a vertex v is defined as the maximum delay of allpaths from all PI vertices to v.
The delay of a clustered circuit G 0 is defined as the maximum delay of all paths from PIs to POs, which is equal to the maximum path delay at all PO vertices. For example, in Fig. 1(b) , the delay of G 0 is 15 + D along the path a ! c ! e ! g. Based on the definitions, the circuit clustering problem considered in this paper is presented in the following.
Circuit Clustering With Variable Interconnect Delay:
Given a graph G, find a clustering S (a set of clusters), such that the delay of the clustered circuit is minimized.
III. PREVIOUS WORK AND PITFALLS IN TRIVIAL EXTENSIONS

A. Previous Work
In this section, we discuss the algorithm in [1] , which solves the circuit clustering problem optimally under the general delay model. In the algorithm, each cluster has one and only one root vertex r and we denote the cluster rooted at r (which is found by the algorithm) as cluster(r). 1 Let Gr be the set of all predecessors of r together with the vertex r. Note that in the algorithm, cluster(r) is a subset of G r . When the context is not ambiguous, we also denote the subgraph induced by Gr by the vertex set Gr.
The algorithm consists of two phases: the labeling and clustering phases. For each vertex u 2 V , the label of u, l(u), is defined as the minimum path delay at u among all possible clusterings on the graph G u . In the labeling phase, for each vertex r in a topological order, the algorithm finds cluster(r) from Gr such that it would make the path delay at r become the minimum among all possible clusterings on G r , and at the same time, the algorithm obtains l(r).
In order to get l(r), the algorithm first calculates l 0 (u) of each predecessor u of r. l 0 (u) is defined as the sum of l(u) and the maximum delay of the paths from the output of u to the output of r in the input graph. 2 Then, a vertex u with the highest l 0 value is repeatedly found and included into the cluster rooted at r until the cluster area violates the area constraint.
After finding the cluster(r), the algorithm continues to calculate l1, the maximum l 0 value of the PI vertices which are inside the cluster, and find l 2 , the maximum l 0 + D value of the vertices outside the cluster.
After that, the label of r, l(r), can be found by getting the greater value of l 1 and l 2 .
The clustering phase constructs clusters from PO vertices to PI vertices according to the cluster information generated in the labeling phase. First, for each PO vertex v, the corresponding cluster(v) is included into the clustering S. Then, for each vertex u outside S, which is an immediate predecessor of any vertex inside S, cluster(u) is also included into the clustering S. The procedure is repeated until all of the vertices in G are included in S.
B. Pitfalls in Trivial Extensions
This section shows that the algorithm in [1] cannot be "trivially" extended to deal with the circuit clustering problem with variable interconnect delay.
The labeling phase in [1] is based on the fact that in the general delay model, for each vertex u 2 Gr 0r, l 0 (u)+D effectively represents the path delay at r due to u when u is not included in cluster(r). Therefore, to make the resultant path delay at r as small as possible, adding vertices into cluster(r) is done in the nonincreasing order of the l 0 1 All vertices in the cluster are predecessors of the root vertex. 2 When calculating the delay of a path from the output of u to the output of r, the vertex delay of u is not included. the path delay at g is 22 which is shown in Fig. 2(b) . However, as in Fig. 2(c) , if we form the clustering {fg; f; eg, fc; ag, fd; bg} instead, the path delay at g becomes 21. In fact, fg; f; eg is an optimal choice for cluster(g) while l(g) = 21. The problem of using the value of l 3 to select vertices is that the vertex set of the resultant cluster may not be connected and this may increase the circuit delay. This example shows that such a trivial extension of algorithm in [1] is also unable to produce optimal solutions.
IV. VERTEX GROUPING
In this section, we first define the function l 00 (x; y) for each edge (x; y) and then present our vertex grouping algorithm.
Definition 1: Given a root vertex r and its associated Gr, the l 00 value of an edge (x; y), which represents the path delay at r due to x if x is not included in the cluster of r and y is included, is defined as where 1(y; r) is defined as the maximum delay (including gate and edge delays) along any path from y to r, in the input graph (assuming all gates are in the same cluster). By calculating l 00 (x; y), we obtain the delay due to x in the situation where x is excluded from the cluster and y is included in the cluster rooted at r. The reason why l 00 is calculated for each connection, but not for each vertex, is that the delay on each fanout edge of a vertex can differ. If M = 3and D = 7, the l 00 values for the graph in Fig. 3(a) (for root vertex g) are shown in Fig. 3(b) . In the figure, it is clear that when c is included in the cluster, while a is excluded, the path delay at g is greater than the situation when we add a into cluster(g) since l 00 (a;c) = 23 > l 00 (c;e) = 21. In other words, if we cannot include a and c at the same time, we should not choose to include c. By this observation, vertices should be added to a cluster in a "group" basis. Hence, we propose the following grouping strategy.
Definition 2 (Vertex Grouping): Given a root vertex r, its associated G r , and an edge (a 0 ; x) in G r , group(a 0 ; x) is a subset of G r such that a predecessor ai of a0 must be assigned into group(a0;x) if there exists a path from a i to a 0 , denoted a i ! a i01 ! 1 1 1 ! a 1 ! a 0 , such that l 00 (a k ; a k01 ) > l 00 (a0;x); 8k = 1; 2; ...;i:
Also, a0 is always assigned into group(a0;x).
Note that in Definition 2, all vertices on the path from a i to x are the predecessors of r, i.e., x; a0; ...;ai 2 Gr . Based on this grouping definition, a is assigned into group(c;e), shown in Fig. 3(c) , because there exists a path a ! c such that l 00 (a;c) > l 00 (c;e). In fact, it
indicates that c and a should be assigned to the cluster rooted at g at the same time. Definition 2 describes the condition whether a vertex a i in G a is in group(a0;x). In order to get group(a0;x), we can check all vertices in G a . However, given the subgraph G r rooted at r, it is not necessary to get group(a;b) for every edge (a; b) in G r . Therefore, we propose the following vertex-grouping algorithm which assigns vertices into group(a;b) for some edges (a; b) only. The algorithm is shown below, followed by a detailed discussion. In Grouping(), each edge (u;r) is first stored in the "leader_set." When all vertices in group(u;r) are found, the edge (u;r) has been removed from "leader_set" and put into a list Pr . We denote all edges in both "leader_set" and P r as "leader edges." Group vertex(x; y; a; b)
shows the process of checking whether the vertex x should be added into group(a;b). If x 2 Ga does not comply with the definition of group(a;b), the edge (x; y) is added into the "leader_set." The algorithm terminates when "leader_set" is empty.
In this algorithm, not all edges in Gr may be added to Pr , and we only obtain group(a;b) for each "leader edge" (a; b) 2 P r . After Grouping(), all predecessors of r are divided into groups and the "leader edge" of each group is stored in the list P r .
For the circuit in Fig. 3(a) , the results after Grouping() are shown in Fig. 3(c) . In the figure, the thick lines are edges in Pr (leader edges) while five groups are totally formed. With the grouping algorithm, we are able to obtain an important property in which adding a group (no longer a vertex) into a cluster without increasing the path delay at the root vertex r is possible.
V. ALGORITHM
In this section, we present our algorithm (as follows) based on the grouping strategy, which is a "nontrivial" extension of the algorithm in [1] , for the circuit clustering problem under our new delay model. [1] , our algorithm consists of two parts: the labeling phase (lines 3-12) and the clustering phase (lines 13-19) . The clustering phase works in the same way as [1] .
ALGORITHM
In the labeling phase, we get a cluster cluster(r) for each vertex r in a topological order. For each non-PI vertex r, we first compute the value l 00 (u; w) for each edge (u;w) 2 E which connects vertices in Gr . Based on the l 00 values, we apply our vertex grouping algorithm to divide the vertices in Gr into groups. After that, each group of vertices is considered for adding to cluster(r) based on the l 00 value of its leader edge. Thus, the leader edges are sorted in advance according to their l 00 values.
For example, when applying our algorithm to the circuit in Fig. 3(a) , the resultant clustering contains three clusters: cluster(g) = fg; f; eg, cluster(c) = fa; cg, cluster(d) = fb; dg, and l(g) = 21 along the
In fact, the clustering is the same as the optimal one shown in Fig. 2 
(c).
The main difference between our algorithm and the algorithm in [1] is that we employ the grouping strategy and add vertices to a cluster in a group basis, which guarantees a minimum circuit delay under our delay model. Besides, in order to apply the grouping strategy, we calculate l 00 for every edge while the algorithm in [1] calculates l 0 for every vertex for selecting vertices which has been shown not applicable to our delay model.
For our algorithm, we have the following lemma describing some important properties of the sorted list P 0 r that is generated in line 11 of Circuit clustering(). 
P3
If a vertex z is the predecessor of both w and u such that z 2 group(w;u), there must exist a path from z to w such that all edges (p;q) along the path satisfy p 2 group(w;u) and q 2 group(w;u) and they must also satisfy the following inequality l(p) + 1(q;r) + D > l(w) + 1(u;r) + D:
P4 If x is the predecessor of both w and u such that y 2 group(w;u) and x 6 2 group(w;u), there must exist a path from y to w such that all edges (p;q) along the path satisfy p 2 group(w;u) and q 2 group(w;u) and they must also satisfy the following inequality 
VI. OPTIMALITY AND COMPLEXITY OF THE ALGORITHM
The authors in [1] state that in any clustering S on a graph G, the path delay at any vertex v should be greater than or equal to the path delay at v in an optimal clustering Sop on Gv [1, Lemma 1] . It can be easily proved that this lemma is also applicable to our new delay model. To prove the optimality of our algorithm, we first prove that: 1) the label of each vertex, l(v) (which is calculated in the labeling phase), is the lower bound of the path delay at v in any optimal clustering and 2) the clustering phase (lines 13-19) is able to construct a clustering such that the path delay at v equals l(v). From (1) and (2), together with in [1, Lemma 1] , it can then be proved that the clustering S generated by our algorithm is an optimal clustering.
Before proving (1) and (2), we first explore an important lemma for vertices within a cluster.
Lemma 2: For any cluster cluster(v) fvg generated in Labeling(), for each edge (x; y) such that x; y 2 G v , • Case 1
which is the maximum delay among the paths from all PI vertices in cluster(v) to v, and by assumption, l1(v) is greater than or equal to the delay value of all other paths involved in calculating l2(v). Hence, the path delay at v in any optimal clustering of the subgraph Gv cannot be smaller than l1(v).
• Case 2 See the following condition. This case implies cluster(v) Gv . Here, we prove by contradiction Fig. 4(b) . Based on the induction hypothesis, 
. This is depicted in Fig. 4(c) which contradicts the assumption delay(v) < l(v). As a result, the statement is also true for vertex v. Lemma 4: In our algorithm, for any vertex v in the clustering S generated by the clustering phase (lines 13-19), the path delay at v is less than or equal to l(v).
Proof: Our delay model is different from that in [1] , but the clustering phase in our algorithm is the same as that of [1] , so the proof is the same. Details can be found in [1] .
Based on Lemma 3 and Lemma 4, we can easily derive the following theorem.
Theorem 1:
The clustering S generated in our algorithm is an optimal clustering for any instance of the problem described in Section II.
Proof: In Lemma 3, it is shown that for each vertex v, the label l(v) in our algorithm is less than or equal to the path delay at vertex v in any optimal clustering; Lemma 4 states that our algorithm is able to generate a clustering with the path delay at v less than or equal to l(v) which is the lower bound of the path delay at vertex v in any optimal clustering. Together with Lemma 1 in [1] , the clustering S generated by our algorithm is an optimal clustering.
We analyze the complexity of our algorithm. In Grouping(), Group vertex() would run at most jV j times, so the time complexity of the WHILE loop is O(jV kEj). In Circuit clustering(), finding the maximum delay matrix 1 takes O(jV j(jV j + jEj)), finding a topological order in line 4 takes O(jV j + jEj) time, the sorting in line 11 takes time O(jEjlg(jEj)), and Labeling() takes only O(jEj) time. So, the first WHILE loop of Circuit clustering() takes O(jV j(jEjlg(jEj) + jV kEj)) time. Clustering phase (lines 13-19) takes time O(jV j + jEj). So the overall time complexity is O(jV j(jEjlg(jEj) + jV kEj)) = O(jV j 2 jEj).
Remarks: In fact, our algorithm can also handle the case where the intercluster delay D is a variable value (say D(x; y); 8(x; y) 2 E).
It is because the calculation of l 00 (x;y) = l(x) + D + 1(y; r) includes the value of D such that if D becomes a variable D(x; y), the calculation becomes l 00 (x;y) = l(x) + D(x; y) + 1(y; r), and still correctly represents the situation when (x; y) becomes an intercluster edge. Besides, the optimality of the algorithm still holds because all the theoretical results remain true and can be proved similarly.
VII. CONCLUSION
In this paper, we have introduced a new delay model which is more general and practical than the general delay model [3] . Under our new delay model, a circuit clustering algorithm based on a novel vertex grouping technique is proposed and is proved to optimally solve the area-constrained combinational circuit clustering problem for delay minimization in polynomial time.
