Abstract-We propose an efficient performance-driven twoway partitioning algorithm to take into account clock period and latency with retiming. We model the problem with a Quadratic Programming formulation to minimize the crossing edge count with nonlinear timing constraints. By using a Lagrangian Approach on Modular Partitioning (LAMP), we merge nonlinear constraints into the objective function. We then decompose the problem into primal and dual subprograms. The primal program is solved by a heuristic Quadratic Boolean Programming approach and the dual program is solved by a subgradient method using a cycle mean method. Experimental results on seven industrial circuits have demonstrated our algorithm is able to achieve an average of 23.25% clock period and latency reductions compared to the best results produced by 20 runs on each test case using a Fiduccia-Mattheyses algorithm. In terms of the average number of crossing edges, our results are only 1.85% more than those of the Fiduccia-Mattheyses algorithm without timing constraints. Compared with previous network flow based approach, our algorithm reduces the average crossing edge count by 14.59%. Furthermore, an average of 7.70% clock period and latency reductions are achieved.
I. INTRODUCTION

D
UE TO PHYSICAL GEOMETRIC distance and interface technology limitations, intermodule delay is contributing a dominant portion of signal propagation delay. Consequently, instead of minimizing the number of the crossing edges [4] , [8] , [11] , [14] , [23] , [27] as the only objective during partitioning, we should take into account the intermodule delay for performance-driven partitioning.
Clock period is a major measurement for circuit performance. For a partitioning problem with timing and size constraints, Shih et al. [25] first proposed a -way Kernighan-Lin (KL) [14] based algorithm. Later, Shih and Kuh [26] formulated the partitioning problem with timing and size constraints as the Quadratic Boolean Programming problem (QBP) with linear constraints. Their approaches can generate a partition such that the delay between registers satisfies the timing constraint. In terms of the crossing edge count, their algorithms consistently produce results comparable with Kernighan-Lin and Fiduccia-Mattheyses (FM) [8] based algorithms without timing constraints.
In [20] , Liu et al. proposed a flow-based partitioning algorithm using a retiming technique [18] , [19] to explore the ultimate clock period of the circuit. The algorithm ensures that any critical path is cut at most once. They achieve clock period reduction at the expense of crossing edge count. The partitioning results in 16.44% more crossing edges [20] than the FM algorithm [8] without timing constraints.
In addition to clock period, the primary input to primary output latency is another important measurement for circuit performance. In this paper, our performance-driven partitioning method extends the problem of [20] to incorporate system latency. In other words, we minimize the crossing edge count of the partition subject to the performance constraints that the clock period and latency achieved by using retiming are within the given limits.
Given the size and performance (timing) constraints, the partitioning problem using retiming is formulated as QBP with linear constraints on the size limit and nonlinear constraints on the clock period and latency limits. By using a Lagrangian relaxation, the nonlinear constraints are merged into the objective function as a Lagrangian problem. The Lagrangian problem is then decomposed into primal and dual subprograms.
The performance-driven partitioning problem is solved through the primal and dual iterations on the Lagrangian problem. The primal program is a traditional partitioning problem without timing constraints and is solved by a heuristic Quadratic Boolean Programming approach [2] , [26] . The dual program is solved by a subgradient method [24] using the cycle mean method [13] . According to the experimental results, the proposed partitioning algorithm can simultaneously enhance the clock period and system latency considerably. Moreover, the crossing edge counts of our results are comparable with those of the FM algorithm without timing constraints.
The remainder of this paper is organized as follows. The concepts of retiming and latency are described and the performance driven partitioning problem is formulated in Section II. We present a quadratic programming formulation of the problem in Section III. Section IV presents the algorithm. Experimental results and concluding remarks are given in Sections V and VI.
II. PROBLEM FORMULATION
In this section, we first give some basic definitions. Then, the performance-driven partitioning problem is formally defined.
1057-7122/97$10.00 © 1997 IEEE Fig. 1 . An example of data flow graph, where A and E are primary inputs and F is a primary output.
A. Data Flow Graph
A synchronous digital circuit may consist of combinational elements and globally clocked registers. Each combinational element has positive delay and size associated with it, while all registers have delay and size values equal to zero.
A directed data flow graph , where , is adopted to represent the synchronous digital circuit, where and are sets of register nodes and combinational block nodes, respectively, and is a set of directed edges which correspond to signal flow in the system. Each node has an associated size and delay . As in [20] , we assume that the combinational blocks are fine-grained. A node is called fine-grained, if it can be split into several smaller nodes. On the other hand, if a node can not be split, it is called coarse-grained. An attribute denoting the number of interconnections between nodes and is associated with each edge . Furthermore, each edge is of zero delay. Fig. 1 is an example of such a graph. In Fig. 1 , registers are represented by rectangles and combinational blocks are represented by circles.
and are primary inputs and is a primary output. A path of length from a node to a node in a graph is a sequence of nodes such that , and for . A path is simple, if all nodes on the path are distinct. A path forms a loop if and the path contains at least one edge. The loop is simple if, in addition, are distinct. In the remainder of the paper, the terms path and loop are used to refer to simple paths and simple loops respectively.
A two-way partition maps into two modules, such that and . The upper size limits of these modules are denoted by and , respectively. An edge is a crossing edge of if node and node are in different subsets and . In this paper, the term "cut" stands for the crossing edge count. Each crossing edge is associated with an intermodule delay determined by the technology. By contrast, noncrossing edges incur zero delay. Furthermore, the intermodule delay on crossing edges is inherently coarse-grained and cannot be split.
B. System Latency and Retiming
In this subsection, we first introduce system latency and retiming. Then, a theorem formalizing the conditions under which the required clock period and system latency can be achieved for a given circuit via retiming is presented.
Given a path , we use to denote the number of registers on the path. Let denote the minimum number of registers among all possible paths from to , i.e., where is the set of all paths from node to . We define a path from to as a critical path if equals is also called an IO-critical path if node and are the primary input and output nodes, respectively. Let and be the sum of combinational block delays and the sum of intermodule delays on path , respectively. Let (1) where is the set of all critical paths from node to and denotes the maximum delay over all critical paths from node to .
System Latency: Given a path from primary input to primary output , we define the IO latency between and to be the minimum number of clock periods that pass between the signal arrival at node and its first effect on the output signal of node , as , where denotes the clock period. Note that the primary inputs and outputs are register nodes. In other words, the signal at primary input takes clock periods to first change the output signal of the primary output node . However, if there is no path between and , we set its latency to zero. Thus, we define system latency N of the whole system as the maximum IO latency among all possible IO pairs [17] , [9] , [22] , i.e., (2) where and are the sets of all primary inputs and outputs, respectively.
We use the example in Fig. 1 to illustrate the essence of latency. It takes at least three registers, , , and to walk from to . Thus, the IO latency between and is three clock periods. There are two simple paths from to . One has three registers and the other has five. The IO latency between and is equal to two clock periods. The system latency of the circuit is dominated by the IO latency between and , the value of which is three clock periods.
Retiming: Given a data flow graph where , let and be the set of primary inputs and outputs, respectively. A retiming of a data flow graph is an integer labeling of combinational, primary input, and primary output nodes:
. The retiming specifies a transformation of the original graph in which registers, except primary inputs and primary outputs, are moved across combinational blocks (and IO registers in our case) so as to change the graph into a new graph where . Let denote the number of registers on path from to after retiming . According to [19] , we have (3). for any two nodes and such that . Statement i) expresses that the register count between nodes and cannot be negative. Statement ii) shows that at least one register must be kept on the critical path from to if it has a delay larger than . Once statement ii) holds, [19] states that every path from to with delay larger than will have at least one register.
If i) and ii) hold, the retiming achieves clock period . By (3), the register count of each path from to is augmented by an equal amount through retiming. Therefore, if a path from to is critical before retiming, is still critical after retiming.
In order to achieve a system latency of clock periods, we incorporate the following condition:
iii) for any and any . Statement iii) enforces that each IO-critical path has no more than registers. Thus, can achieve a system latency of clock periods. Therefore, if satisfies the conditions i), ii), and iii), is a legal retiming of , achieving a clock period of and a system latency of clock periods. Time Complexity: We first present an algorithm to determine if a circuit achieves a clock period of and a system latency of clock periods by using retiming. From the statements in the preceding paragraph, the problem of finding a retiming which can achieve a required clock period and system latency is equivalent to finding a function from to integers which satisfies all constraints i), ii), and iii). In other words, we need to determine the feasible values for all the unknowns under a set of inequality constraints with the form of , where is a constant. Constraint systems with such format arise in shortest paths problems. We can use the Bellman-Ford algorithm (see [5, p. 532]) to determine if there exists a which satisfies all the constraints. Given a graph , the time complexity of the Bellman-Ford algorithm is ). Given a data flow graph where , let , and and denote the maximum and minimum delay value among all combinational block delays, respectively. We show that finding a retiming which can achieve a required clock period and system latency (or determining that no such retiming exists) can be done in time. Before enumerating all constraints of i), ii), and iii), the values of all and for each pair of nodes and in are be computed. The Floyd-Warshall all-pairs shortest-paths algorithm (see [5, p. 558] ) can be adopted to compute the values of all and . Given a graph, the Floyd-Warshal algorithm takes time, where denotes the set of nodes of the given graph. The reader can see the Algorithm WD in [19, Section 4] for greater detail. Since the combinational blocks are fine grained, a combinational block node with delay is split into combinational block nodes each having unit delay. Given a data flow graph , a new graph is constructed, where each combinational block in is of unit delay and denotes the extented set of edges from . Therefore, is given by . Given the graph , it takes time to calculate the values of all . After all have been calculated, we can enumerate all constraints of i), ii), and iii). Then, the Bellman-Ford algorithm is applied to determine if retiming can achieve the target clock period and system latency. Since the Bellman-Ford algorithm takes time, the time complexity of the algorithm for finding a feasible retiming is dominated by the time complexity of calculating the values of all , which is . Note that the retiming algorithm in [19] satisfies only conditions i) and ii). Therefore, the time complexity of the algorithm in [19] is different from ours.
C. Iteration Bound and Latency Bound
Iteration Bound: While retiming can reduce the clock period of a circuit, there is a limit to the amount of improvement possible imposed by feedback loops in the circuit [21] . Given a loop , let , and be the sum of combinational block delays, the sum of intermodule delays, and the number of registers in loop , respectively. The delay-to-register ratio of a loop is equal to . The iteration bound of a circuit is defined as the maximum delay-to-register ratio, i.e., (4) where is the set of all loops in the circuit. Note that the iteration bound of a given circuit yields a lower bound on the achieved clock period by retiming.
Minimum Cycle Mean Problem: For the special case where each edge contributes exactly one register, the iteration bound becomes a minimum cycle mean [13] . In our test cases, each combinational block node connects two register nodes. Thus, we can adopt the minimum cycle mean algorithm proposed by Karp [13] to calculate the iteration bound and the edges that contribute to the bound. Given a graph, the time complexity of Karp's algorithm is , where and denote the number of nodes and edges, respectively.
Latency Bound: Let denote the IO-critical path with maximum path delay among all IO-critical paths from to . Since the number of registers on path is equal to , the IO latency (i.e., ( ) between and is not less than , where denotes the clock period, and and are the sum of combinational block delays and the sum of intermodule delays on path , respectively. Thus, we define latency bound M as follows: (5) where is the set of all IO-critical paths. The latency bound also imposes a lower bound on the system latency achieved by using retiming. An all-pairs shortest-paths algorithm can be used to calculate the latency bound.
Timing Constraint: The timing constraints adopted by our algorithm are based on the limits imposed by the iteration and latency bounds. In other words, given two numbers and , our performance-driven partitioning algorithm will generate a partition with iteration and latency bounds not greater than and , respectively. We have two reasons for using the iteration and latency bounds, instead of actually performing the retiming during the process of the partitioning. i) It is faster to calculate these bounds. ii) The iteration and latency bounds give lower bounds on the clock period and system latency achievable by applying retiming. Furthermore, from our experience these bounds are very good predictors of system performance after retiming is applied, particularly in the fine-grained case. Therefore, we want to generate a partition with small iteration and latency bounds.
D. The Performance-Driven Partitioning Problem
Since the performance of a circuit is measured by both clock period and system latency, we want to generate a partition which achieves a good clock period and system latency by adopting retiming. In practice, we use the iteration and latency bound limits as the timing constraints. Retiming is then performed after partitioning to derive the exact clock period and system latency.
Given a partition
, let denote the iteration bound and denote the latency bound with respect to . Now we state the performance-driven partitioning problem as follows.
Given a data flow graph , where , and each node has size , two numbers and , size constraints and , and intermodule delay , find a partition with the minimum number of crossing edges, subject to , and . If and are large enough, the above problem becomes the traditional two-way partitioning problem of minimizing the number of the crossing edges while satisfying size constraints. Since our problem is in and the traditional partitioning problem is -Hard, the performance-driven partitioning problem is also -Hard. Examples in Figs. 2 and 3 illustrate the essence of the performance-driven partitioning problem. In Figs. 2 and 3 , shaded octagons denote crossing edges. In these examples, we assume combinational block delays are one unit and intermodule delays are two units. Register and register denote primary input and primary output, respectively. Given the circuit in Fig. 2(a) , the clock period is dominated by the longest combinational node delay between registers, which is from to with a delay of three units. There are two simple paths from nodes to . One path has nine registers, the other has 10 registers. Therefore, the system latency of this circuit is eight clock periods, i.e., 24 units. However, using retiming, we can move register to a new location as indicated by the dashed line. The resulting longest paths are from to and from to . Both paths have an improved delay of two units. Furthermore, the system latency becomes 16 units, substantially less than the original 24 unit. Because the iteration bound, which is determined by the left loop, is two units, we cannot obtain a smaller clock period.
Suppose the circuit is partitioned into two modules [ Fig. 2(b) ]. The clock period is five units before retiming because of the delay on the longest path from to . After retiming shifts to its new location as indicated by the dashed line, the delay is four units and the system latency increases to 32 units. On the other hand, in Fig. 3(a) , before retiming, the clock period is three units; hence, compared to Fig. 2(b) , a better partition can automatically improved performance. If we perform retiming as shown by the dashed lines, the delay in Fig. 3(a) is reduced to two units. Retiming adds one extra register between the combinational node and register . This results in a system latency of nine clock periods or 18 units. However, in Fig. 3(b) , we also achieve a clock period of two units while achieving a system latency of only eight clock periods. Hence, Fig. 3(b) is preferred.
III. QUADRATIC PROGRAMMING FORMULATION OF PERFORMANCE-DRIVEN PARTITIONING PROBLEM
In this section, we introduce a vector of Boolean variables to represent a partition. Given the timing and size constraints, the performance-driven partitioning problem can thus be represented by a Quadratic Boolean Programming formulation with linear constraints capturing the size limits and nonlinear constraints for the iteration and latency bounds. The nonlinear constraints are then absorbed into the objective function as a Lagrangian problem. Finally, the Lagrangian problem is decomposed into primal and dual subproblems.
A. Quadratic Boolean Programming Formulation
Let denote a Boolean variable, where is 1 if node is assigned to module , and is otherwise 0. Therefore, given a data flow graph , a two-way partition can be described by a vector of Boolean variables , where . Given a partition represented by a vector of Boolean variables, we express the delay-to-register ratio of a loop and the total delay of a path in terms of the given Boolean variables. If is a crossing edge, the value of the term is equal to 1; otherwise, equals 0. Given a loop , let denote the delay-to-register ratio of .
can be written as (6) where , , and represent the sum of combinational block delays, the number of registers and the sum of intermodule delays in loop , respectively. Similarly, the total delay of path in terms of the given Boolean variables is the following:
where and denote the sum of combinational block delays and the sum of intermodule delays on path , respectively.
Let us formulate the performance-driven partitioning problem by adopting a vector of Boolean variables to represent a partition. The objective function is to minimize the total costs of crossing edges (8) Subject to the following constraints:
: (Size Constraints) modules (9) : (Generalized Upper Bound Constraints) nodes (10) : (Iteration Bound Constraints) loops (11) : (Latency Bound Constraints)
IO-critical path .
Recall that if is a crossing edge, equals to 1. Therefore, is contributed to the objective function. Constraint expresses that the total node sizes assigned to module cannot be beyond the size limit . states that each node should be placed into one and only one module. Constraint enforces that the delay-to-register ratio of each loop is not greater than . Thus, the iteration bound of the partitioned circuit is not greater than . Similarly, limits the total delay of each IO-critical path.
We don't have to specify all loops in constraint . Since every loop is composed of simple loops, we have the following lemma:
Lemma 1: Given a number , if is not greater than for all simple loops , then is not greater than for all loops .
Lemma 1 implies that only the simple loops need be considered in constraint .
B. Lagrangian Problem
A vector of Lagrange multiplers is used to merge the nonlinear constraints of the performance-driven partitioning problem into the objective function. Let and represent the number of the simple loops and IO-critical paths, respectively, and denote the vector of Lagrange multiplers . Using Lagrangian relaxation [24] , we absorb the constraints (11) and (12) into the objective function (8) as a Lagrangian problem as follows: (13) subject to constraints and , where -
The Lagrangian problem is decomposed into two subproblems.
Dual Problem: Given a vector , we can represent (14) as a function of a vector , i.e.,
. Thus, the dual problem can be written as (15) Primal Problem: Let and denote the set of the simple loops and IO-critical paths passing edge . We rewrite (14) in terms of edges. Let us define and as follows: (16) - (17) Given a vector , we can represent (14) as a function of vector , i.e., . Thus, the primal problem can be rewritten as (18) subject to constraints and , where represents the constant contributed by , the number of registers , the node delay of loop , etc.
IV. LAMP
We adopt a Lagrangian Approach on Modular Partitioning (LAMP) which solves the performance-driven partitioning problem through primal and dual iterations on the Lagrangian problem. A heuristic Quadratic Boolean Programming algorithm [2] , [26] is used to solve the primal problem and generate a solution . For the dual problem based on , the values for the next primal-dual iteration are updated using the subgradient approach. The iteration proceeds until the bound of all loops and paths are within the given limits.
A. QBP Method for Primal Problem
Let each edge be associated with a cost . From (18) , the primal problem becomes a traditional partitioning problem of minimizing the total costs of crossing edges while satisfying the size constraint. Note that the primal problem does not incorporate any timing constraint. Any traditional partitioning algorithm [4] , [8] , [11] , [14] , [23] , [27] can be adopted to solve the primal problem. We adopt a heuristic Quadratic Boolean Programming method (QBP) [2] , [26] to optimize the primal problem, since QBP is easily extended to handle multiway partitioning.
B. Subgradient Method for Dual Problem Using Cycle Mean Approach
Once a solution of the dual problem (15) is generated, formula (16) is applied to update the edge costs. Given an edge , we need all of the simple loops passing and all values of the IO-critical paths passing to calculate as in (16) . The number of such simple loops and IO-critical paths may be exponential in the number of the nodes of the given data flow graph.
However, we do not enumerate all loops and paths when calculating edges costs by (16) . Instead, only some particular loops and paths are considered. For an optimal solution to problem (15) Updating Edge Costs: In practice, only dominant prices of each edge are determined and the edge cost is updated by the subgradient approach [24] . Equation (16) is not used to calculate the edge costs, since it needs to generate all the loops and paths. We utilize the minimum cycle mean algorithm [13] to calculate the dominant-loop price of each edge. An all-pairs shortest-paths algorithm can be adopted to compute the dominant-path price of each edge.
At the th iteration, let and denote the derived vector and crossing edge count of the primal problem, respectively. Let be the cost of edge at the th iteration. We adopt the subgradient method to generate the new edge cost as (19) where is defined as (20) and and Cut* are two given positive numbers. By (19), we increase the costs of active edges and decrease those of inactive edges, using subgradient approach. This captures the actual direction of the edge cost changes of active and inactive edges with respect to (16) .
C. Algorithm
We perform primal-dual iterations to solve the performancedriven partitioning problem. Let and denote the iteration and latency bound limits, respectively, and Cut* and represent the estimated crossing edge count and a positive value, respectively. The LAMP algorithm is described as follows:
LAMP Algorithm: [2] , [26] is called to solve the primal problem and generate a solution at Step 3. For the dual problem based on , we call a minimum cycle mean algorithm [13] and an all-pairs shortest-paths algorithm to obtain the dominant prices of each edge at Step 4. If the bounds of all loops and paths satisfy the constraints, the algorithm stops; otherwise, we calculate the subgradient on the dominant prices and update the values for the next primal-dual iteration at Step 5 and 6. By following the suggestion of [24] , the parameter is set to 1.3 in our experiments. 
D. Algorithm Complexity
Given a data flow graph , let and . Thus, the time complexity of the all-pair shortestpath algorithm is . The minimum cycle mean algorithm takes time. The complexity of QBP is . The time complexity of each iteration of LAMP is dominated by the the all-pair shortest-path algorithm. For simplicity, the steps 3-6 of the algorithm are iterated a maximum number of 20 times. Therefore, the time complexity of our algorithm is .
V. EXPERIMENTAL RESULTS
In our experiments we used the same seven industrial circuits from [25] , [26] as our test cases. The test cases range in size from from 3026 to 12 172 combinational blocks and from 342 to 607 registers. All combinational blocks are of unit size with the exception of some in test case sys1 which have size 2. Five of these circuits contain feedback loops.
External-Loop Constraint: Even though a system may have no feedback from its primary outputs to its primary inputs, it can interact with external systems. Hence, macroscopically, there possibly exist external feedback loops from primary outputs to primary inputs. We call ths assumption the externalloop constraint. According to the external-loop constraint, we have to take into account the path delay. Given a path from primary input to primary output, let , and be the sum of combinational block delays, the sum of intermodule delays, and the number of registers on path . The path delay bound of a circuit is defined by (21) where is the set of all paths from the primary input to the primary output. Table I shows the characteristics of these test cases. The second and third columns list the numbers of registers and combinational blocks, respectively. In the fourth and fifth columns we have the iteration bound J and the path delay bound B defined in the following subsection, respectively. Since and don't have feedback loops, their iteration bounds are equal to zero.
For performance-driven partitioning, one parameter to be decided is the value of intermodule delay . As indicated in [6] , the intermodule delay can increase to nearly 100% of the clock cycle period. Therefore, we set to be of 60% of , which is calculated before partitioning. For example, the intermodule delay associated with is equal to 2137, since the value of equals to 3563. We compared our algorithm LAMP to the FiducciaMattheyses (FM) algorithm [8] , Topological Timing Cut (TTC) [20] , Flow Timing Cut (FTC) [20] and the ShihKuh (SK) algorithm [26] . The SK algorithm was devised to generate a partition such that the delay between registers of the resulted partition is within a given timing limit. However, the SK algorithm does not take retiming into consideration. The TTC and FTC algorithms are the first algorithms to take retiming into consideration. The objective of the FM algorithm is to minimize the crossing edge count without any timing constraints. All algorithms were executed on a single-processor SUN SPARC 10 workstation under the C/UNIX environment. The results of FM are chosen from the best of 20 runs each. The size limit of each module was set to 60% of the circuit size. In the following, we present both the experimental results with and without the external-loop constraint. Table II shows the timing constraints used in the experiments. In the second column, represents the iteration bound limit (i.e., the iteration bound with no intermodule delays). For most test cases, the value of is equal to the iteration bound before partitioning shown in Table I . For , and is equal to the intermodule delay , since their iteration bounds are less than and the clock period is dominated by the intermodule delay. In the third column, stands for the latency bound limit. The LAMP algorithm has another parameter , which denotes the estimated crossing edge count. We set and from the results of TTC. Computational experiments shows that the determination of did not have a great influence on the performance of the algorithm. Therefore, if the result of TTC is not available, the reader can use any well-known bipartitioning package to estimate crossing edge count. The fourth column shows the number of iterations LAMP takes to stop. The last column reports the average execution time of each iteration of LAMP in seconds. Generally speaking, the execution time of LAMP is comparable to that of the FM algorithm. However, TTC and FTC are much faster than LAMP. Clock Period: Table IV gives detailed information for our experiments on clock period. In the first row, associated with FM is the maximum delay between registers before retiming. In Table IV , with the exception of the first and second columns, each column contains two subcolumns. The data in the first subcolumn represents the value derived from (4) after partitioning. The value in the second subcolumn is the clock cycle period of the partitioned circuit achieved by retiming. Note that we adopt the retiming algorithm of [19] . , calculated after partitioning, will dominate the optimal clock cycle period during retiming. However, if , then will dominate the clock cycle period because is not decomposable.
A. Experiments without External-Loop Constraint
Because all of the loops in our test cases are quite small and strongly connected, no loops are cut by TTC, FTC, and LAMP for all test cases. However, FM cuts the loops in and . Therefore, LAMP, TTC, and FTC obtain the same iteration bound and clock period for all test cases. Compared with FM, LAMP achieves 38.41% and 38.48% clock period reductions for and , respectively. By using the clock period generated by LAMP to be the timing constraints, we made comparisons with Shih and Kuh's (SK) performance-driven partitioning algorithm [26] . For all test cases, SK does not produce feasible solutions. Note that SK generates a partition where the delay between registers is within the timing constraint. Therefore, we have demonstrated that retiming can achieve better clock period.
System Latency: Table V gives the detailed information regarding our experiments on system latency. In Table V , with the exception of the first column, each column contains two subcolumns. The data in the first subcolumn is derived from (5) after partitioning.
in the second subcolumn is the system latency of the partitioned circuit obtained through retiming. 
B. Experiments with External-Loop Constraint
According to the external-loop constraint, we have to take into account the path delay in the calculation of iteration bound. In this case the clock period of a given partitioned circuit is dominated by Since there possibly exist external feedback loops from primary outputs to primary inputs, the register count of each path from primary input to primary output cannot be changed after retiming; otherwise, the functionality of the circuit will be modified. Thus, according to (2) , the system latency after retiming is proportional to the clock period after retiming. We imposed only the iteration bound limit in our experiment with external-loop constraint. Therefore, LAMP does not call an all-pair shortest-path algorithm to compute the latency bound. Table VI shows the timing constraint. In the second column, represents the iteration bound limit. The third and fourth columns stand for the number of iterations that LAMP takes Using the clock period generated by LAMP to be the timing constraints as shown in second column of Table IX, Table IX shows the experiment results of SK algorithm. For most test cases, SK does not produce feasible solutions.
System Latency: Because the system latency after retiming is proportional to the clock period after retiming with the external-loop constraint, LAMP achieves 23.25%, 6.01%, and 7.70% reductions, respectively, compared with FM, TTC and FTC.
VI. CONCLUDING REMARKS
We have proposed an efficient performance-driven twoway partitioning algorithm which considers clock period and latency simultaneously based on the limits of the iteration and latency bounds. Experimental results on seven industrial circuits have demonstrated our algorithm is able to achieve an average of 23.25% clock cycle period and 23.25% latency reduction compared to the best results produced by 20 runs on each test case using a Fiduccia-Mattheyses algorithm. In terms of the number of the crossing edges, our result is only 1.85% more than the result of the Fiduccia-Mattheyses algorithm without timing constraints.
Furthermore, we can easily expand our algorithm to -way partitioning, since QBP can handle -way partitioning.
