Abstract-Clustering (or partitioning) is a crucial step between logic synthesis and physical design in the layout of a large scale design. A design verified at the logic synthesis level may have timing closure problems at post-layout stages due to the emergence of multiple-clock-period interconnects. Consequently, a tradeoff between clock frequency and throughput may be needed to meet the design requirements. In this paper, we find that the processing rate, defined as the product of frequency and throughput, of a sequential system is upper bounded by the reciprocal of its maximum cycle ratio, which is only dependent on the clustering. We formulate the problem of processing rate optimization as seeking an optimal clustering with the minimal maximum-cycle-ratio in a general graph, and present an iterative algorithm to solve it. Experimental results validate the efficiency of our algorithm.
I. INTRODUCTION

C
IRCUIT clustering (or partitioning) is often employed between logic synthesis and physical design to decompose a large circuit into parts. Each part will be implemented as a separate cluster that satisfies certain design constraints, such as the size of a cluster. Clustering helps to provide the first-order information about interconnect delays as it classifies interconnects into two categories: intra-cluster ones are local interconnects due to their spatial proximity while inter-cluster ones may become global interconnects after floorplan/placement and routing (also known as circuit layout).
Due to aggressive technology scaling and increasing operating frequencies, interconnect delay has become the main performance limiting factor in large scale designs. Industry data shows that even with interconnect optimization techniques such as buffer insertion, the delay of a global interconnect may still be longer than one clock period, and multiple clock periods are generally required to communicate such a global signal. Since global interconnects are not visible at logic synthesis when the functionality of the implementation is the major concern, a design that is correct at the logic synthesis level may have timing closure problems after layout due to the emergence of multiple-clock-period interconnects.
This gap has motivated recent research to tackle the problem from different aspects of view. Some of them resort to retiming [15] , which is a traditional sequential optimization technique that moves flip-flops within a circuit without destroying its functionality. It was used in [5] , [16] , [17] , [21] , and [22] to pipeline global interconnects so as to reduce the clock period. Although retiming helps to relieve the criticality of global interconnects, there is a lower bound of the clock periods that can be achieved because retiming cannot change the latency of either a (topological) cycle or an input-to-output path in the circuit. In case the lower bound does not meet the frequency requirement, redesign and resynthesis may have to be carried out. One way to avoid redesign is to insert extra wire pipelining units like flip-flops to pipeline long interconnects, as done within Intel [6] and IBM [14] . It can be shown that if the period lower bound is determined by an input-to-output path, pipelining can reduce the lower bound without affecting the functionality. However, if the period lower bound is given by a cycle, inserting extra flip-flops in it will change its functionality.
-slow transformation [15] is a technique that slows down the input issue rate 1 of the circuit to accommodate higher frequencies. It was, thus, used in [19] to retain the functionality when extra flip-flops were inserted in cycles. The slowdown is dictated by the slowest cycle in the circuit where the ratio between the extra flip-flops and the original flip-flops is the maximum. Extra flip-flops are inserted in other cycles to match the slowdown. As a result, throughput is sacrificed (becomes ) to meet the frequency requirement.
Instead of slowing down the throughput uniformly over the whole circuit, latency insensitive design (LID) [1] , [2] , on the other hand, employs a protocol that slows down the throughput of a part of the circuit only when it is needed. As a result, LID can guarantee minimal throughput reduction while satisfying the frequency requirement.
We show in Section II that the aforementioned three approaches (retiming, pipelining with -slow, and pipelining with LID) can be unified under the same objective function of maximizing the processing rate, defined as the product of frequency and throughput, as illustrated in Fig. 1 . In addition, the processing rate of a sequential system is upper bounded by the reciprocal of the maximum cycle ratio of the system, which 1 The issue rate is defined as the number of clock periods between successive input changes. An issue rate of 1 indicates that the inputs can change every clock period.
1063-8210/$20.00 © 2006 IEEE is only dependent on the clustering. Therefore, we propose an optimal algorithm that finds a clustering with the minimal maximum-cycle-ratio under the single-value inter-cluster delay model.
The rest of this paper is organized as follows. Section II presents the problem formulation. Two previous works are reviewed in Section III. Section IV defines the notations and constraints used in this paper. Following an overview in Section V, our algorithm is elaborated in Section VI and VII. Section VIII presents the speed-up techniques. Section IX reviews the techniques in [20] for cluster and replication reduction. We present some experimental results in Section X. Conclusions are given in Section XI.
II. PROBLEM FORMULATION
We consider clustering subject to a size limit for clusters. More specifically, each gate has a specified size, as well as each interconnect. We require that the size of each cluster, defined as the sum of the sizes of the gates and the interconnects in the cluster, should be no larger than a given constant . Replication of gates is allowed, i.e., a gate may be assigned to more than one cluster in the layout. When a gate is replicated, its incident interconnects are also replicated so that the clustered circuit is logically equivalent to the original circuit. Fig. 2 (taken from [20] ) shows an example of gate replication in a clustering.
Given a particular clustering , we treat the replicas of gates and the original ones distinctly and denote them all as . We use to denote the set of interconnects among . The clustered circuit is represented as . In order for the circuit to operate at a specified clock period , additional wire-pipelining flip-flops are inserted. For all cycle in , let denote the cycle delay, and denote the number of flipflops in before and after additional pipelining flip-flops are inserted, respectively. Assuming , the cycle ratio of is defined as . Note that is defined using , not . The maximum cycle ratio over all the cycles in is denoted as . We define processing rate as follows. Definition 1: For a sequential system, processing rate is defined as the length of processed input sequence per unit time. In particular, it is the product of frequency and throughput in a synchronous system.
The larger the processing rate, the better the sequential system. Given the previous definition, the approach of retiming actually maximizes the processing rate by minimizing the period while keeping the throughput. It is interesting to notice that the approach of pipelining with -slow transformation also maximizes the processing rate for a specified period by computing the least slowdown of the issue rate, which is transformed into throughput reduction. As an alternative, LID helps the clustered circuit reach the maximum throughput for a specified period. Therefore, all three approaches can be unified under the same objective function of maximizing the processing rate.
It was shown in [3] and [4] that the maximum throughput of a LID for a specified period can be computed as cycle On the other hand, the fact that the circuit can operate at the specified period after the insertion of additional flip-flops implies that , i.e., , cycle
. Substitute this into the formula of to get
It follows that the maximum processing rate of a LID is upper bounded by since max processing rate
It is also an upper bound of the maximum processing rate obtained by the approach of retiming, as shown in [22] . In other words, all three approaches share the same upper bound of their common objective.
To maximize the processing rate, one can either maximize the upper bound or try to achieve the upper bound. They are equally important. However, since achieving the upper bound requires further knowledge on physical design, such as buffer and flip-flop allowable regions [9] , [22] , while the upper bound itself is only dependent on the maximum cycle ratio of the clustered circuit, we will consider how to optimally cluster the circuit such that the upper bound is maximized, or equivalently, the maximum cycle ratio is minimized.
In order to compute the maximum cycle ratio, we need to know how to compute the delay of a cycle during clustering. Although local interconnect delays can be obtained using some delay models at synthesis, the delays of global interconnects are not available until layout. Therefore, during clustering, we assume that each global interconnect induces an extra constant delay , as proposed in [20] . More specifically, if an interconnect with delay is assigned to be inter-cluster, then its delay becomes . The single-value inter-cluster delay model is the best approximation to distinguish potential global interconnects from local ones. In floorplanning, critical global interconnects are made short by placing the relevant modules closer. On the other hand, the values of cluster size and inter-cluster delay can be chosen deliberately to make this model practical. The intracluster interconnects will be very long if large is selected. On the other hand, shall not be too small otherwise there will be too many clusters and the inter-cluster delays will be similar to the intra-cluster delays after floorplanning. By carefully choosing , the single-value model can fulfill the need of integrating inter-cluster delay information in clustering.
Since we want to minimize the maximum cycle ratio, the path delays from primary inputs (PIs) to primary outputs (POs) can be ignored since they can be mitigated by pipelining. This motivates us to formulate the problem in a strongly connected graph.
Problem 1 (Optimal Clustering Problem): Given a directed, strongly connected graph , where each vertex has a delay and a specified size, and each edge has a delay , a specified size, and a weight (representing the number of flip-flops on it), find a clustering of vertices with possible vertex replication such that: 1) the size of each cluster is no larger than a given constant ; 2) each global interconnect induces an extra constant delay ; and 3) the maximum cycle ratio of the clustered circuit is minimized.
We assume that all delays are integral 2 and, thus, all cycle ratios are rational. In addition, we assume that each gate has unit size and the size of each interconnect is zero. Our proposed algorithm can be easily extended to handle various size scenarios.
III. PREVIOUS WORK
Pan et al. [20] proposed to optimally cluster a sequential circuit such that the lower bound of the period of the clustered circuit was minimized with retiming. However, the period lower bound may not come from a cycle ratio. In addition, their algorithm needs to start from PIs, thus, cannot be used to solve our problem in a strongly connected graph. Applying their algorithm with arbitrary PI selection may lead to a suboptimal solution. Consider an example circuit consisting of two gates and , connected in a ring with , , ,
, and zero gate and interconnect delays. If we pick as PI and as PO, their algorithm will return , but the optimal solution is . In this sense, they solved a different problem, even though it looks similar to ours.
Their problem was solved by binary search, using a test for feasibility as a subroutine. For each target period, they used a procedure called labeling computation to check the feasibility. The procedure starts with label assignment for PIs and for the other vertices, and repeatedly increases the label values until they all converge or the label value of some PO exceeds the target period, for which the target period is considered infeasible. For each vertex, the amount of increase in its label is computed using another binary search that basically selects the minimum from a candidate set. Because of the nested binary searches, their algorithm is relatively slow. In addition, the algorithm requires space to store a precomputed all-pair longest-path matrix, which is impractical for large designs. Cong et al. [10] improved the algorithm by tightening the candidate set to speed up the labeling computation, and by reducing the space complexity to linear dependency. But the improved algorithm still needs the nested binary searches.
Besides the difference in problem formulation, our algorithm differs from theirs in two algorithmic aspects. First, our algorithm focuses on cycles, thus, can work on any general graph. Second, no binary search is employed in our algorithm. As a result, our algorithm is efficient and essentially incremental. Like [10], our algorithm does not need precomputed information on paths either.
Except for these differences, [20] revealed some important results on clustering, which we summarize in the following to simplify our notations.
• Each cluster has only one output, which is called the root of the cluster. If there is a cluster with more than one output, we can replicate the cluster enough times so that each copy of the cluster has only one output.
• For each vertex in , there is exactly one cluster rooted at it and no cluster rooted at its replicas. Its arrival time (defined in Section IV) is no larger than the arrival times of its replicas.
• If is an input of the cluster rooted at , then the cluster rooted at must not contain a replica of .
IV. NOTATIONS AND CONSTRAINTS
First of all, we define notations with respect to and (the clustered circuit of a particular clustering ), respectively.
We use to denote a path in . Let be the number of flip-flops on , which is the sum of the weights of 's constituent edges. Let be the delay along , which is the sum of the delays of 's constituent edges and vertices, except for . When a path actually forms a cycle , includes the weight of each edge in the cycle only once. Similarly, includes the delay of each edge and vertex in the cycle only once. We assume, in this paper, that for all cycle . Each of the previous notations is appended with a subscript when it is referred to with respect to . More specifically, a path in is denoted as with flip-flops and delays. Note that the delay of an inter-cluster edge is . A cycle in is denoted as with flip-flops and delays. We have for all cycle . We use to denote the cycle ratio of , and to denote the maximum cycle ratio of .
Since we only need to consider clusters rooted at the vertices in , exactly one for each vertex, we use to refer to the set of vertices that are included in the cluster rooted at . Let be the set of inputs of . In the remainder of this paper, when we say , we mean that the cluster rooted at contains a replica of . For example, Fig. 3 (a) shows a circuit before clustering. There are five vertices (a-e) and seven edges. Fig. 3(b) illustrates a clustering of the circuit with size limit , where dashed circles represent clusters. For each cluster, the vertex whose index is outside the cluster indicates the root. For example, contains replicas of vertices and with the input set . Note that, is the output of , even though has no outgoing edges. We use a label to denote the arrival time of the vertex. To ease the presentation, we will extend the domain of to to represent the arrival times of the replicas of the vertices. Based on this, a clustering that satisfies the cluster size requirement and has a maximum cycle ratio no larger than a given rational value can be characterized as follows:
where (1)- (3) guarantee that the arrival times are all achievable, and (4) We must note that for a legal clustering, its maximum cycle ratio is feasible. In fact, any value larger than the maximum cycle ratio of a legal clustering is also feasible.
Given a feasible clustering under , consider an edge . It is either in with , or there is an edge such that is a replica of . In either case, the following inequality is true because of (2), (3), and the fact that from [20] 
The following lemma provides a lower bound for . V. OVERVIEW
The optimal clustering problem asks for a legal clustering with the minimal maximum-cycle-ratio. Since , the clustering with each vertex being a cluster is certainly legal. Starting from it, we will iteratively improve the clustering by reducing its maximum cycle ratio until the optimality is certified.
First of all, the maximum cycle ratio of a legal clustering is feasible and can be efficiently computed using Howard's algorithm [7] , [12] . Given a feasible , we show that, unless is already the optimal solution, a particular legal clustering can be constructed whose maximum cycle ratio is smaller than . The smaller can be obtained by applying Howard's algorithm on the constructed clustering. Therefore, we alternate between applying Howard's algorithm and constructing a better clustering until the minimal is reached.
VI. CLUSTERING UNDER A GIVEN
Given
, if is feasible, we show, in this section, how to construct a feasible clustering under , i.e., a clustering satisfying (1)- (5) under , whose maximum cycle ratio is no larger than .
A. Choosing (1) and (5) as Invariant
We choose to first satisfy (1) and (5) because they are independent on clustering, and iteratively update and to satisfy (2)-(4) while keeping (1) and (5).
Let denote the arrival time vector, i.e.,
A partial order can be defined between two arrival time vectors T and as follows:
According to the lattice theory [13] , if we treat assignment as the bottom element and assignment as the top element , then the arrival time vector space becomes a complete partially ordered set, that is, for all , . To satisfy (1), we set . Then we apply a modified Bellman-Ford's algorithm, defined as , on to satisfy (5) under . The modified Bellman-Ford's algorithm is the same as Bellman-Ford's algorithm [11] except that it takes two inputs: a given arrival time vector and a value of . The value of is used to specify (5) so that we can perform relaxation starting from the given arrival time vector . The relaxation is guaranteed to converge when . Given under , the resulting arrival time vector of is denoted as
In fact, is the least vector satisfying (1) and (5) under , as stated in the following lemma.
Lemma 2:
, for all satisfying (1) and (5) under .
Proof: Suppose we have a satisfying (1) and (5) 
B. Transformation to Satisfy (2)-(4) Under
In order to satisfy (2)-(4) while keeping (1) and (5) is the largest among all paths from to in . Therefore, , which concludes our proof.
Lemma 3 implies that when a vertex is put in , its preceding vertices that are already in can be ignored for the computation of .
C. Implementation of
Our implementation of is similar to the label computation in [10] . To characterize critical inputs, we introduce another label . Before the construction of , we assign with for all in while with . At each time, the vertex with the largest is identified. If , the construction is completed. Otherwise, we put it in and update with . This procedure will iterate until either or the last vertex identified has . To validate the previous procedure, we need to show that it is equivalent to , or equivalently, to show that it can always identify the critical input of . This is fulfilled by the next lemma and corollary.
Lemma 4: Given that (1) and (5) . Since is the largest among all paths from to , we have . It follows that , which contradicts that is the input with the largest . Therefore, the lemma is true.
The pseudocode for computing is given in Fig. 4 . It employs a heap for bookkeeping the vertices whose . At each iteration, it puts in the input with the largest and updates for each fanin of that becomes an input in . In our implementation, we choose Fibonacci heap [11] for .
The complexity of in Fig. 4 is given in the following lemma.
Lemma 5: The procedure in Fig. 4 terminates in time. Proof: At each iteration, the complexity of extracting the vertex with the maximum is by Fibonacci heap [11] . For each , updating takes time. On the other hand, since a vertex cannot be put in more than once, the total number of edges processed by the inner for-loop is . Therefore, the complexity the procedure is , or .
D. Clustering Under as a Fixpoint Computation
We define as the arrival time vector when all the 's , are applied once, followed by the modified Bellman-Ford's algorithm to ensure (5), expressed as
The following lemma shows that is an order-preserving transformation.
Lemma 6: For any and satisfying (1) and (5) , otherwise, , which contradicts the assumption that . In addition, since satisfies (1) and (5) Fig. 5(c) . By the same argument, we can show that . In either case, we have , i.e., , which is a contradiction. Therefore, the assumption is wrong and is true . It is easy to verify that after applying the modified Bellman-Ford's algorithm.
We say that is a fixpoint of under if and only if . The following theorem bridges the existence of a fixpoint and the feasibility of .
Theorem 1: is feasible if and only if has a fixpoint under .
Proof: : If is feasible, then, by definition, there exists a legal clustering whose arrival time vector satisfies (1)- (3) and (5) under . We claim that . Otherwise, for some , and has a critical input , as shown in Fig. 5(a) . We can conduct a similar case study as Fig. 5(b) and (5c) to show that , which is a contradiction. On the other hand, by the procedure of . Therefore, . Given that satisfies (5), applying the modified Bellman-Ford's algorithm gives , i.e., is a fixpoint of under . : If has a fixpoint under , then, by the definition of , the constructed clustering is legal and the arrival time vector satisfies (1)- (3) and (5) under . Therefore, is feasible. In fact, according to the lattice theory [13] , if , defined on a complete partially ordered set, has a fixpoint under , then it has a least fixpoint , defined as
We use to denote the clustering constructed by . In fact, if , then there is a critical path from to with . This is made precise in the following lemma. Fig. 6 , we will choose to be . Now consider any incoming edge of (from a vertex outside of to a vertex in ), it must be noncritical, otherwise, we can trace back from this edge and find another critical cycle that is not in , which is a contradiction. Since the arrival times of the vertices in are all greater than zero, we can decrease them simultaneously while keeping the arrival times of other vertices unchanged until some incoming edge of becomes critical or the arrival time of some vertex in is reduced to zero. For either case, we obtain a fixpoint less than , which is a contradiction. Therefore, the lemma is true.
To reach a fixpoint, iterative method can be used on . It starts with as the initial vector, iteratively computes new vectors from previous ones until it finds a such that . The following lemma states that applying iterative method on will converge to its least fixpoint in a finite number of iterations.
Lemma 8: If is feasible, applying iterative method on will converge to in a finite number of iterations. Proof: Since we start with , Lemma 6 ensures that at each iteration. Therefore, if converges, the fixpoint has to be the least fixpoint. What remains is to show that is finitely convergent.
By Lemma 3 and 7, if , then can be written as where and . Given that each vertex in has exactly one cluster rooted at it, we know that , thus, , where . On the other hand, since is a rational number, it can be expressed as , where and are integers and . If is increased during the iteration, the amount of increase will be at least . Therefore, if does not converge after iterations, then there exists a vertex whose , which contradicts , which concludes our proof. The next result is a corollary of Lemma 6-8.
Lemma 9:
implies that is infeasible.
Proof: Suppose, otherwise, that is feasible. Then, by Lemma 6 and 8, when is reached, we have , which contradicts Lemma 7. Therefore, is infeasible.
VII. OPTIMAL CLUSTERING ALGORITHM
Given a legal clustering , its maximum cycle ratio is feasible. If is not optimal, then we can find a feasible , which is specified in the following lemma.
Lemma 10: Given that is the maximum cycle ratio of a legal clustering , if is not optimal, then is also feasible, where is the total number of flip-flops in . Proof: Let denote the cycle with the maximum cycle ratio, that is, . If is not optimal, it means that there exists another legal clustering whose maximum cycle ratio is smaller than . Let be the cycle with . The difference between and can be written as since all delays are integers. In addition, since each vertex in has exactly one cluster rooted at it, both and can pass at most clusters. Thus, neither nor will be larger than . Therefore, . In other words, is also feasible. It implies that we can certify the optimality of by checking the feasibility of . The algorithm for finding the optimal is presented in Fig. 7 . It first computes a feasible by treating each vertex as a cluster, and computes a lower bound of by Lemma 1. After that, it checks the feasibility of by iterative method on . If converges, it means that we find a better clustering whose maximum cycle ratio is at most and can be computed by Howard's algorithm. The evidence of or the fact that is reduced below immediately certifies the optimality of the current feasible .
We prove the correctness of the algorithm by showing that it returns the optimal when it terminates.
Theorem 2: The algorithm in Fig. 7 will return a clustering with the optimal when it terminates.
Proof: When the algorithm terminates, we have either , or under . For the first case, Lemma 10 ensures that is optimal, otherwise, is feasible, which contradicts Lemma 1. For the second case, is infeasible by Lemma 9, which, by Lemma 10, implies that is optimal. The optimal and the corresponding clustering are recorded in and . We finally present the worst case complexity of the algorithm in the next theorem.
Theorem 3: The algorithm in Fig. 7 terminates in time, where and is the total number of flip-flops in .
Proof: First of all, is reduced during the execution of the outer while-loop in Fig. 7 . Since the amount of decrease in is at least after each loop, the algorithm will terminate in loops, where is an upper bound of . Since , the number of loops can be bounded by . At each loop, it takes to compute by the modified Bellman-Ford's algorithm. The complexity of maximumcycle-ratio computation can be bounded by [12] , or since consists of clusters and the size of each cluster is no larger than .
We next analyze the complexity of checking the feasibility of . According to the proof of Lemma 8, if is feasible, iterative method will converge in iterations, where and is an integer such that the product of and is integral. Since , we have . On the other hand, since is the maximum cycle ratio of a legal clustering minus , it is true that . As a result, the number of iterations can be bounded by , where . The complexity of each iteration can be computed as by Lemma 5. Therefore, the computational complexity of each loop is and the theorem is true. Remark 1: The significance of Theorem 3 is not the actual formula of the bound, but showing that the optimal clustering problem has a pseudopolynomial time complexity. Furthermore, caution should be taken on this bound. The worst case complexity is based on a series of assumptions that are very unlikely to be attainable in reality. For example, although we do feasible checking on a value that is slightly smaller than a given feasible , the improvement at each loop is not small. This is because the new clustering will have a different structure whose maximum cycle ratio is usually much smaller than the given . This is confirmed by our experiments in Section X. Therefore, we believe that the worst case bound we obtained is just an upper bound of the actual running time. A tighter bound may exist but its mathematical analysis is so complex that we cannot deduce it so far. The efficiency of our algorithm is confirmed by the experiments.
VIII. SPEED-UP TECHNIQUES
A. Variations of
In Section VI, is defined as applying all the 's , once followed by the modified Bellman-Ford's algorithm. In our implementation, all the 's are not computed at the same time. Intuitively, if previously computed 's can be taken into account in later computations of others, the convergence rate may be accelerated.
This motivates our study on a variation of , in which later computations of 's are based on previously computed ones, and each computation of is followed by the modified Bellman-Ford's algorithm. Let denote the vector after is updated with , that is Define where . It can be shown that different evaluation orders of give different 's. However, they all satisfy the following relation.
Lemma 11: For any satisfying (1) and (5) . To this aim, we observe that , provided that . Based on this, we can show by induction that . Therefore, the lemma is true.
As a corollary, the next result ensures that we can apply iterative method on to reach . Corollary 11.1: If is feasible, applying iterative method on will converge to in a finite number of iterations, independent of the evaluation order of .
Proof: Since is feasible, is finitely convergent by Lemma 8. Let be the number of iterations such that . The corollary can be proven if we can show that , or equivalently, . The former part can be derived from Lemma 11 because . The latter part is also a consequence of Lemma 11 since .
B. Reduced Clustering Representation
It was shown in [12] that Howard's algorithm was by far the fastest algorithm for maximum-cycle-ratio computation. Given a clustered circuit with edge delays and weights specified, Howard's algorithm finds the maximum cycle ratio in time, where is the product of the out-degrees of all the vertices in . Since vertex replication is allowed, and could be and , respectively, where is the product of the out-degrees of the vertices in .
To reduce the complexity, we propose a reduced clustering representation for maximum-cycle-ratio computation. For each cluster, we use edges from its inputs to its output (root) to represent the paths between them such that the delay and weight of an edge correspond to the delay and weight of an acyclic input-to-output path. Fig. 8 shows the reduced representation of the clustered circuit in Fig. 3(b) .
Let denote the maximum cycle ratio of the reduced representation for clustering . The following lemma formulates the relation among , , and the lower bound defined in Lemma 1.
Lemma 12: For any clustering , . Proof: All the cycles in can be classified into two groups according to whether they contain an inter-cluster edge or not. If a cycle contains only intra-cluster edges, its maximum cycle ratio is upper bounded by . If a cycle contains inter-cluster edges, it is present in the reduced representation and, thus, is upper bounded by .
One benefit of the reduced clustering representation is that we can now represent the clustered circuit without explicit vertex replication, that is, using instead of . Let denote the reduced representation for clustering . We call an edge in redundant if its removal will not affect the maximum cycle ratio of . The following lemma provides a criterion to prune the redundant edges so that Howard's algorithm can find the maximum cycle ratio of more efficiently. In practice, the number of flip-flops on an input-to-output path in a cluster is much smaller than , which enables the efficiency of the reduced clustering representation.
In our implementation, we employ another two parameters and to record the path delays and weights from the inputs of a cluster to its output, respectively. More specifically, we set , before is about to be carried out for some . After that, whenever a vertex is put in , we compute the and values of its preceding vertices based on and , followed by pruning.
IX. CLUSTER AND REPLICATION REDUCTION
In this section, we briefly review the techniques that were used in [18] and [20] to reduce the number of clusters and vertex replication.
In Section III, we assume that each cluster has one output. If this assumption is relaxed, a post-processing step can be added to reduce the number of clusters. For example, if the arrival time of a vertex in its own cluster is equal to the arrival time of a copy of in another cluster, then the entire cluster at can be removed, and replaced by the copy.
Replicated vertices can also be reduced as follows. If the arrival times of two copies of a vertex differ by an amount greater than or equal to the inter-cluster delay , then the output of the copy with the smaller arrival time can replace the copy with the TABLE I  SEQUENTIAL CIRCUITS FROM ISCAS-89 larger arrival time. In addition, there are slacks available for vertices on noncritical paths and their arrival times need not be the least fixpoint. This property can also be used to further remove replicated vertices. Once this reduction of replicated vertices is carried out, there may be clusters that are not completely filled. We can merge some of the clusters, provided that the area bound is not exceeded. Using these techniques, the area overhead can be reduced to 14%, as shown in [10] .
It is worthy to point out that it is both unnecessary and memory-wise prohibitive to keep all clusters during clustering under a given . The cluster rooted at each vertex is dynamically built during the execution of procedure and released when the procedure finishes. The existence of such a cluster is implied by the updated arrival time of . The whole clustered circuit needs to be built only when we want to compute its maximum cycle ratio, or when is found. For the former case, reduced clustering representation helps to manage the storage requirement. For the latter case, a reasonable overhead can be obtained using the aforementioned techniques.
X. EXPERIMENTAL RESULTS
We implemented the algorithm in a PC with a 2.4-GHz Xeon CPU, 512-kB second-level cache memory, and 1-GB RAM. To compare with the algorithm in [20] , we used the same test files, which were generated from the ISCAS-89 benchmark suite. For each test case, we introduced a flip-flop with directed edges from each PO to it and from it to each PI so that every PI-to-PO path became a cycle. As in [20] , the size and delay of each gate was set to , intra-cluster delays were , and inter-cluster interconnects had delays . The circuits used are summarized in Table I . We also list the maximum cycle ratio of the circuit before clustering in column , which provides a lower bound of the solution by Lemma 1.
Although theoretically the algorithm in Fig. 7 will reach the exact solution without being provided a precision, we have to consider the impact of floating point error introduced by practical finite precision arithmetic, due to the divisions involved in the maximum-cycle-ratio computation. In our experiments, we set the error to be 0.001. Since is generally smaller than 0.001, we set the precision of to be 0.01. For each circuit, we tested three size bounds: is 5%, 10%, and 20% of the number of gates. The optimal maximum-cycleratio for each circuit is shown in Table II .
To illustrate the advantage of our incremental algorithm over binary search, we ran binary search to find the optimal maximum-cycle-ratio using the proposed iterative method as a subroutine for feasibility checking. The lower bound of the binary search was and the upper bound was the maximum-cycle-ratio of the clustering where each vertex itself is a cluster. The binary search precision was also set to be 0.01. We report the results in Table III , where column "#step" lists the number of search steps and column "time(s)" lists the running time in seconds. "BS" refers to the binary search-based algorithm and "INC" refers to our incremental algorithm. Row "arith" ("geo") gives the arithmetic (geometric) mean of the running times. It can be shown that the incremental algorithm is more efficient.
To compare the running time in [20] , where the optimal clock period is integral, we set the precision of to be and ran the algorithm again for , , , respectively. The obtained matches the result in [20] for all the scenarios of . The only running time information given in [20] is the largest running time per step among the three scenarios, which we list in Table IV under column " [20] ." We then compute ours in column "ours." Note that the running time from [20] was based on an UltraSPARC 2 workstation.
We observe that, for most of the circuits, our algorithm finds the optimal solution in just a few steps, which is generally less than the number of iterations conducted in a binary search, which are not given in [20] . TABLE III  IMPROVEMENT OVER BINARY SEARCH IN RUNNING TIME   TABLE IV RUNNING TIME COMPARISON WITH [20] ( IN SECONDS) XI. CONCLUSION Processing rate, defined as the product of frequency and throughput, is identified as an important metric for sequential circuits. We show that the processing rate of a sequential circuit is upper bounded by the reciprocal of its maximum cycle ratio, which is only dependent on the clustering of the circuit. The problem of processing rate optimization is formulated as seeking an optimal clustering with minimal maximum-cycle-ratio in a general graph. An iterative algorithm is proposed that finds the minimal maximum-cycle-ratio under the single-value inter-cluster delay model. Since our algorithm avoids binary search and is essentially incremental, it has the potential to be combined with other optimization techniques, such as gate sizing, budgeting, etc., thus, can be used in incremental design methodologies [8] . In addition, since maximum cycle ratio is a fundamental metric, the proposed algorithm can be adapted to suit other traditional designs.
