Abstract-The problem of retiming over a netlist of macroblocks to achieve minimal clock period, where block internal structures may not be changed and flip-flops may not be inserted on some wire segments, is called the optimal wire retiming problem. This paper presents a new algorithm that solves the optimal wire retiming problem with polynomial-time worst case complexity. Since the new algorithm avoids binary search and is essentially incremental, it has the potential of being combined with other optimization techniques. Experimental results show that the new algorithm is very efficient in practice.
function, a flip-flop can be used to fulfill communication buffering requirements-been explored.
Since dominant wire delays can only happen on global wires, it is more meaningful to formulate the problem at the chip level as in [13] and [15] , that is, the design we deal with is a netlist of macroblocks. The wires within a block are relatively much shorter and thus do not need multiple clock periods for propagation. In SOC designs, many of the macroblocks are intellectual property (IP) cores. Some of these blocks may be combinational circuits, and others sequential. Because of the existence of predesigned blocks such as IP cores or regularstructured blocks such as memories, (combinational) buffers or flip-flops may not be inserted everywhere [17] .
In this application, the approaches in [14] cannot be used because they considered only gates and cannot be extended to handle complex blocks. Our previous work [13] , on the other hand, solved the problem with complex blocks by proposing timing macromodels to model the timing behavior of the blocks, based on which a set of integer difference inequalities was shown to be both necessary and sufficient, thus quantifying a feasible solution. Although it gave a polynomial-time algorithm for feasibility checking, the complexity was high, making it inhibitive even for checking circuits with about 1000 vertices. Furthermore, it only gave a fully polynomial-time approximation scheme (FPTAS) for clock period minimization, that is, the overall complexity was dependent on a given precision. The same thing was also true in our improved work [15] and in [14] .
In this paper, a polynomial-time algorithm is proposed to compute the minimal clock period without using binary search. To the best of our knowledge, it is the first work that shows the optimal wire retiming problem can be solved incrementally in polynomial time.
II. PROBLEM FORMULATION
We consider wire retiming on an SOC design with a given block placement (also known as floorplan) and a global routing of the global wires. We can also handle the pipelining by allowing flip-flops to be inserted at primary inputs/outputs and retimed into the circuit. This problem may come from different design methodologies and different design stages. For example, it may come from an interconnect planning stage where the floorplan and global routing are done for estimation [18] , [19] , or it may come from a physical design stage where the floorplan and global routing are given. The wire retiming problem in these two situations is the same except that we may allow flipflop insertions in soft blocks during interconnect planning but only allow such insertions in preallocated buffer regions during physical design. A detailed problem formulation is given in [13] . We outline it here for the purpose of completeness. As an illustration, an SOC design with a floorplan and a global routing is depicted in Fig. 1 . Each wire has an arrow to indicate the signal direction and a weight to specify how many flip-flops are on the wire. Those with weight zero have the weights omitted. Some segments of a wire may not accommodate any buffer or flip-flop because they run over macroblocks that do not allow transistors to be added.
In order to take the delays within each block into consideration, timing models are used for specifying the timing behavior of each block. Due to the increasing popularity of SOC designs and IP-core-based designs, recently there are increasing research activities on timing models for macroblocks [20] [21] [22] .
Traditional retiming is applied to logic level netlists that are composed of simple gates. In our application, the netlist is composed of macroblocks. Since a combinational block can be viewed as a complex gate, moving a flip-flop over it is simply justified. The following lemma [13] shows that retiming can be generalized to sequential blocks.
Lemma 1: In an SOC design composed of macroblocks, a flip-flop can be moved from every input to every output of a block or vice versa without changing the function of the design.
Proof: The function of the design only depends on the synchronization of the data flows. When a flip-flop is moved from every input to every output of a sequential block or vice versa, the movement only affects the specific clock cycle during which the correct results are generated. In other words, the data flows are always synchronized under such a movement of flipflops. Therefore, the function of the design is always kept.
When the block is a combinational circuit, as shown in Fig. 2(a) , we can use edges between inputs and outputs to represent the path delays between them, as shown in Fig. 2(b) . In order to avoid flip-flops being placed on the edges we introduced in timing models, we require that the retiming tags of the input and the output connected by such an edge be the same. In addition, since we only care about the set-up conditions of flip-flops, if minimum and maximum delay pairs are given for a combinational block, only the maximum delays are taken.
The case for sequential blocks is trickier. We use the sequential block shown in Fig. 3(a) as an example. To simplify 1 . This implies that the arrival time at a should not be larger than T − d 1 , where T is the clock period. To enforce this set-up condition, we introduce a virtual flip-flop in the timing model, as shown in Fig. 3(b) , and add an edge with delay d 1 from a to this virtual flip-flop. Similarly, an edge with delay d 3 is added from b to another virtual flip-flop. The concept of virtual flip-flops is also used to specify the arrival times at the block outputs that are dependent on interior flip-flops. In Fig. 3(b) , all the virtual flipflops are combined into one flip-flop. Later, the virtual flip-flop is further modeled as an edge with delay zero and weight one.
Since it is assumed that the (global) routing of nets is given, each net is represented as a Steiner tree. For example, Fig. 4(a) shows the route of a net with one source and three sinks. The shaded regions represent the areas where no buffer or flip-flop can be inserted. The timing model used for this net is shown in Fig. 4(b) . Besides the sources and sinks of the net, vertices are created at points where wires are getting into or out of buffer-forbidden areas and at Steiner points not within buffer-forbidden areas. Similar to combinational paths within a block, a delay edge will be used to represent the wire delay from an entering point to an exiting point through a buffer-forbidden area.
For edges outside buffer-forbidden areas, buffers are generally assigned on appropriate positions to control the delay. According to [23] , a buffer-allowable wire can be optimally buffered such that its delay becomes linear in terms of its length. Thus, in our model, the delays on buffer-allowable edges are assumed to be linear. A buffer-forbidden edge may represent a combinational path within a block or a wire over a buffer-forbidden region. In the former case, its delay is given by the pin-to-pin path delay; in the latter case, its delay can be computed as the Elmore delay of the wire under the assumption that it is buffered at region boundaries. Based on the above models, the problem we want to solve can be formulated as follows.
Problem (Optimal Wire Retiming): Given a directed graph G = (V, E) with two types of edges, buffer-forbidden edges E 1 and buffer-allowable edges E 2 (E = E 1 ∪ E 2 ), where each edge e ∈ E has a delay d(e) and a weight w(e) (representing the number of flip-flops on it), find a retiming-i.e., a reposition of flip-flops in the graph-such that: 1) there is no flip-flop change on any edge e ∈ E 1 ; 2) the delay between two flipflops on an edge e ∈ E 2 is linear in terms of their distance; and 3) the clock period (i.e., the maximum delay between any two consecutive flip-flops, treating primary inputs (PIs) and primary outputs (POs) as flip-flops) is minimized.
III. NOTATIONS AND CONSTRAINTS
From the formulation of the problem, we already have a delay label d : E → R + and a weight label w : E → Z + . We will adopt the tradition to use a label r : V → Z to denote the number of flip-flops moved over a vertex and a label t : V → R to denote the arrival time of the vertex. For example, in Fig. 5 , t(u) is the maximum of the delays of the bold paths incident to u.
For any path p ∈ G, we use d(p) to denote the delay along p, which is the sum of the delays of p's constituent edges. Similarly, w(p) denotes the number of flip-flops on p before retiming, which is the sum of the weights of p's constituent edges. To ease the representation, we use w r (u, v) to denote the number of flip-flops on (u, v) ∈ E after retiming, i.e.,
We also use w r (p) to denote the number of flip-flops on path p = u → v after retiming: A valid retiming must satisfy the constraints
Let T denote a given clock period. Timing constraints refer to
Note that the arrival times are computed based on the retimed graph.
Likewise, critical paths and critical cycles can be similarly defined.
A solution (r, t, T ) that satisfies (1)- (4) is called a feasible solution; T is called a feasible clock period. Our task is to find a feasible solution with minimal T .
Similar to [13] , a vertex M is introduced into G, along with directed forbidden edges from each PO to it with zero delays and weights, and from it to each PI with delay zero but weight one. These forbidden edges ensure that if flip-flops are moved outside the circuit through PIs (POs), they will be moved back into the circuit through POs (PIs). As a result, if r(v) ∀v ∈ V is a valid retiming, then r(v) + K ∀v ∈ V is also a valid retiming for any K ∈ Z. On the other hand, due to the weight of one assigned on the forbidden edges to PIs, all PI→PO paths are transformed into cycles with positive weights, and thus will not contradict the assumption on positive cycle weights. Moreover, given that the arrival time t(M ) of vertex M can also be quantified by (3) and (4), we shall not differentiate M and the forbidden edges incident to it from other vertices and edges in the remainder of this paper.
We shall also clarify that the introduction of the vertex M is not necessarily required in our algorithm. In cases where flip-flops cannot be moved through PIs/POs due to restrictions on the initial state [6] , [24] , [25] , M will not be introduced. Our proposed algorithm is guaranteed to work as long as an optimal retiming can be reached from a given flip-flop configuration (which may or may not have been modified from the original circuit) by moving flip-flops in only one direction toward either the POs or the PIs.
IV. OVERVIEW
The optimal wire retiming problem asks for a feasible solution with minimal T . To this aim, our algorithm starts with a feasible solution and iteratively reduces T to the optimal while keeping (1)-(4) satisfied.
First of all, we notice that (1) and (2) are trivially satisfied with r(v) = 0 ∀v ∈ V . Based on this, we can easily find a feasible solution that also satisfies (3) and (4) if T is chosen large enough.
Since T is only involved in (3) and (4), one intuitive way to reduce it is to tighten (3) and (4) under the same r for the purpose of which we use Burns' algorithm [26] . We shall always try this to reduce T . On the other hand, there must be a lower bound of T that is determined by the particular r. Once T is reduced to the lower bound by tightening (3) and (4), we will compare it with an imagined optimal solution and adjust r to approach the optimal solution. These two ways of reducing T are iteratively applied until we reach the optimal T .
V. ALGORITHM

A. Initialization
We want an initialization that returns a feasible solution. It is apparent that the original flip-flop configuration without any retiming must satisfy (1) and (2) . A natural thought is to initialize it to satisfy the timing constraints (3) and (4) as well, which is easy to achieve when T is large.
To satisfy (3) and (4), we first compute t as if all flip-flops on each edge (u, v) ∈ E are lined up immediately before v so that (4) is satisfied with T = max v∈V t(v) and (3) 
To fix the violation of (3), we propose to locally distribute the flip-flops on each edge and increase T accordingly. The idea is illustrated in Fig. 6(b) , where one flip-flop is moved to the middle of the path from x to z and (3) is satisfied by setting T = 2. In general, if (3) is violated by some edge (u, v) with
is merely satisfied on (u, v). Since T is always increased while t remains unchanged, (4) will be kept. When completed, we have a feasible clock period T and the corresponding t values that satisfy (3) and (4) .
In addition, we keep a label f :
is the starting vertex of a critical path terminating at v with t(f (v)) = 0 ∀v ∈ V . The initialization procedure is summarized in Fig. 7 .
B. Base Algorithm
Starting with the initialization, there are two possible ways by which T can be reduced. Since T is only involved in (3) and (4), one intuitive way to reduce it, referred to as "t-ADJUST," is to tighten (3) and (4) under a given r, i.e., without changing the number of flip-flops on each edge. The other that allows such changes is referred to as "r-ADJUST." If the current r happens to be an optimal retiming, then we can reach the optimal T by "t-ADJUST" only. Otherwise, "r-ADJUST" is needed to adjust r toward an optimal retiming and the following lemma provides a direction of the adjustment.
Lemma 2: Provided that (1)- (3) are satisfied, if t(v) ≥ T for some v ∈ V , then any retiming satisfying (1)- (4) but with a smaller clock period must have more than w r (p) number of flipflops on p, where p is any critical path from f (v) to v.
Proof: Consider any critical path
In addition, since p is critical, it means that the delay between any two consecutive flip-flops on p is T , which implies that the
For the sake of contradiction, we assume that there exists another retiming (r,t,T ) satisfying (1)- (4) 
Together with the facts that d(p) = w r (p)T + t(v), t(v) ≥ T , T < T, and wr(p)
≤ w r (p), the above inequality can be reduced tot
Given that (r,t,T ) satisfies (4), we haveT ≥t(v) > T , which contradictsT < T. Therefore, wr(p) > w r (p) must be true.
Since the initialization returns a feasible solution satisfying (1)- (4), we can check if (∃v ∈ V : t(v) = T ) is true. If it is true, then, by Lemma 2, we know that T is the optimal under the current r. In order to reach a possible better solution, "r-ADJUST" needs to be carried out. Otherwise, (∀v ∈ V : t(v) < T ) is true and "t-ADJUST" is employed to tighten (3) and (4). We now describe "t-ADJUST" and "r-ADJUST" in Section V-B1 and V-B2, respectively.
1) Clock Period Reduction With r Unchanged:
Since r is kept unchanged, (1) and (2) will not be violated. Our objective is to find a smaller T such that (3) and (4) are satisfied under the same r.
First of all, we identify the set of critical edges E c , that
If E c contains a critical cycle, then the current T coincides with the cycle ratio of the critical cycle, which, by the following lemma [13] , implies that the current T is the solution to the optimal wire retiming problem.
Lemma 3: A feasible clock period T must satisfy
Proof: Since T is feasible, the delay between any two consecutive flip-flops must be no greater than T in order to satisfy the set-up conditions. In other words,
Otherwise, G c = (V, E c ) forms a directed acyclic graph. Suppose there exists a better solution (r,t,T ) under the same r satisfying (1)- (4) withT < T. Let θ denote the difference, i.e., θ = T −T . Consider critical edge (u, v) with t(u)
; otherwise, the difference betweent(u) and t(u) also needs to be counted intot(v). To characterizet, we define ∆ :
More specifically, ∆(v) is the maximum number of flip-flops on critical paths from roots in G c to v ∀v ∈ V . Thus, for a small θ, the following is true, i.e.,
The above relation collapses only when θ exceeds a threshold such that some of (3) and (4) is violated. The threshold is characterized below. Note that (3) is violated only if the following inequality is true on some edge (u, v) ∈ E, i.e., (u) , and T = T − θ, the above inequality can be rewritten as
Since (r, t, T ) satisfies (3), the left-hand side of the above inequality is no less than 0. It follows that
is guaranteed to hold. We then have an upper bound for θ. This process of characterization for keeping (3) is exactly the same as for Burns' algorithm [26] .
To maintain (4), we need to make sure that for all
It gives another upper bound. The threshold is then the smaller of the two bounds. Given that the first bound is evaluated on noncritical edges and that (∀v ∈ V : t(v) < T ), the value of the threshold is always positive. By reducing T by the threshold and adjusting t accordingly, we obtain a better solution satisfying (1)-(4). The above process can be iteratively applied as long as no critical cycle is formed and (∀v ∈ V : t(v) < T ) is true. The pseudocode is given in Fig. 8 .
Due to the similarity with Burns' algorithm [26] , procedure "t-ADJUST" has a provable complexity of O(|V | 2 |E|) before either the presence of a critical cycle or an evidence of (∃v ∈ V : t(v) = T ), the former of which certifies the optimality of T while the latter necessitates "r-ADJUST."
2) Clock Period Reduction by Changing r: Suppose that the current T is not the optimal. Let (r * , t * , T * ) denote an optimal solution. Given that t(v) = T for some v ∈ V , we know by Lemma 2 that the optimal solution requires more flip-flops on the critical path from f (v) to v, that is,
We can add more flip-flops on the critical path by either increasing r(v) or decreasing r(f (v)). No matter which one is chosen, the amount of change should only be 1 since we do not want to overadjust r. Without loss of generality, we choose to increase r(v) by 1 in our implementation, that is, r (v) = r(v) + 1.
However, the increase of r(v) might violate some of (1)-(3). For example, Fig. 9(a) (1) and (2) and define m labeling.
Our idea to restore (1) and (2) is illustrated in Fig. 9(b) . We first examine each edge incident to v.
, it is actually r (v) = r(x) + 1 due to the previous increase of r (v) . We then increase r(x) by 1 to restore (1), i.e., r (x) = r (v) with r (x) = r(x) + 1. For the example in Fig. 9(b) , both r(x 1 ) and r(x 2 ) will be increased. If (2) is violated on (v, x) ∈ E 2 , it is actually w(v, x) + r(x) − r (v) = −1 due to the previous increase of r (v) . We then increase r(x) by 1 to restore (2), i.e., w(v, x) + r (x) − r (v) = 0 with r (x) = r(x) + 1. For the example in Fig. 9(b) , r(x 3 ) will be increased. We do not need to consider the (x, v) ∈ E 2 case because the previous increase of r (v) ensures w(x, v) + r (v) − r(x) ≥ 1, which will not violate (2) . Vertex x 4 in Fig. 9(b) is this case. Likewise, we need to check the impact of the increase of r(x), if any, on the edges incident to x, and so on. This process of checking will not stop until (1) and (2) are restored.
In our implementation, we employ a first-in-first-out (FIFO) queue rQ for the bookkeeping of the vertices whose r values are increased during the above process of restoring (1) and (2) . For the example in Fig. 9(b), v, x 1 , x 2 , and x 3 will be queued in rQ. We claim that no vertex will be queued more than once in rQ. Otherwise, let y be the first vertex that is queued twice (r (y) = r(y) + 2) in order to restore (1) and (2) on some edge incident to y. If it is a violation of (1) on (x, y) ∈ E 1 or (y, x) ∈ E 1 , then the violation will be fixed after the increase of r(y), that is, r (x) = r (y) = r(y) + 2. Given that (r, t, T ) satisfies (1), we have r(x) = r(y); hence, r (x) = r(x) + 2, which contradicts the assumption that y is the first vertex whose r value is increased by 2 during the process. A contradiction can be similarly derived for the case of a violation of (2) . Therefore, the claim is true and we have r (u) ≤ r(u) + 1 ∀u ∈ V after (1) and (2) are restored.
A key observation on Fig. 9 is that if r(f (v) ) is increased before the next increase of r(v), then the increase of r (v) is necessitated because the increase of r(f (v)) cancels the previous increase of r (v) and makes (5) true again. The relation between r(f (v)) and r(v) is similar to that between t(f (v)) and t(v), where any increase of t(f (v)) will be propagated to t(v) unless the path between them ceases to be critical. The same relation exists between r(v) and r(x 1 ), r(v) and r(x 2 ), r(v) and r(x 3 ), and r(v) and r(x 5 ).
Therefore, we introduce another label m : V → V ∪ {∅}, where ∅ is the default assignment, and define it as follows. For all u ∈ V , if r(u) is increased due to (5), then m(u) is set to f (u); if r(u) is increased due to the violation of (1) and (2) on some edge between u and x, then m(u) is set to x. For the example in Fig. 9(b) , we will have
Based on the definition of the m labeling, we can formulate the relation between r(m(u)) and r(u) in the following lemma.
Lemma 4: It is true before we reach an optimal retiming r * that
Proof: By the definition of the m labeling, m(u) ∈ V only if r(u) has ever been increased since the initialization ∀u ∈ V . For a particular vertex u, we will show that the inequality is kept after the first increase of r(u) and continues to hold before an optimal r * is reached. Consider the first time that r(u) is increased. Vertex u is queued in rQ either because (5) is satisfied on u or for the purpose of restoring (1) and (2) . For the former case, m(u) will be set to f (u) and (5) becomes r
For the latter case, the increase of r(u) is due to the violation of (1) and (2) 
on the edge between u and m(u). If it is a violation of (1), then r(u) = r(m(u)) after (1) is restored. It follows that r
* (m(u)) − r(m(u)) = r * (u) − r(u) since r * (u) = r * (
m(u)). If it is a violation of (2), then w(m(u), u) + r(u) − r(m(u))
= 0 after (2) is re- stored. Since w(m(u), u) + r * (u) − r * (m(u)) ≥ 0, it follows that r * (m(u)) − r(m(u)) ≤ r * (u) − r(u).
In any case, the inequality is true after the first increase of r(u). Even if r(m(u)
) may be increased thereafter, the inequality will remain true until the next increase of r(u), when m(u) will be assigned to a new value that may or may not differ from the original assignment. By the same case study as above, we can show that the inequality will continue to hold. By induction, the lemma is true.
In addition, if there exists a sequence of vertices x i , i = 0, 1, . . . , k − 1, such that x i = m(x i+1 ) and x k = x 0 , we refer to it as an m cycle. For the example in Fig. 9 Proof: Suppose (1) and (2) are satisfied (if not, we shall wait until the process of restoring (1) and (2) is completed and r(v) > N ff is also true at that time). For the sake of contradiction, we assume that there is no m cycle. We will show that this assumption contradicts Lemma 5 by conducting a case study. Let V be the set of vertices that can reach v in G regardless of edge directions, including v. Suppose u ∈ V is such a vertex whose r(u) = min i∈V r(i). There are two cases depending on whether a direct path exists between u and v or not.
If there is no such path, we can find another vertex x ∈ V distinct from u and v such that all acyclic paths between v and x are in one direction, either from v to x or from x to v. If they are all from v to x, let p denote one such path. We therefore have w r (p) = w(p) + r(x) − r(v). Since all paths between v and x are from v to x, the flip-flops that were moved into p through x are different from those that were originally in p. Thus, w(p) + r(x) ≤ N ff . It leads to w r (p) < 0, which is impossible when (2) Therefore, there must exist a direct acyclic path 2 p between u and v. It follows that r(u) > 0; otherwise, more than N ff number of flip-flops have to be moved into or out of p, which is impossible by retiming. In addition, V cannot be V ; otherwise, it contradicts Lemma 5. However, since the vertices in V − V have no connection with those in V at all, the vertices in V and the edges connecting them actually constitute an independent subcircuit G ⊂ G. The flip-flop configuration of a subcircuit is also independent on those of others. Thus, Lemma 5 can be applied to G and G − G separately. Since we assume no m cycle, Lemma 5 implies that (∃γ ∈ V : r(γ) = 0) is true, which contradicts the fact that u is a vertex in V with minimal r value. Therefore, we must have an m cycle.
The reason why an m cycle is important is because its appearance certifies the optimality of the current T .
Theorem 1: If an m cycle appears, then the current T is optimal.
Proof: Suppose the last m assignment is m(x 0 ) = x k−1 , by setting which the m labeling forms a cycle, that is, a sequence of vertices
For the sake of contradiction, we assume that the current T is not optimal. Let r denote the retiming before the current call of "r-ADJUST" and r denote the retiming when the m cycle appears. Recall that "r-ADJUST" is necessitated only when T is reduced to the optimal under r by "t-ADJUST." Since T is not optimal, it follows that r is not an optimal retiming. Then, by Lemma 4 and m(x 0 ) = x k−1 , we have r
before the increase of r (x 0 ) = r(x 0 ) + 1. On the other hand, Lemma 4 guarantees that r
, which is a contradiction. Therefore, the current T is optimal.
Once (1) and (2) is restored, restoring (3) is straightforward. First of all, we reset t(v) = 0 ∀v ∈ V . After that, for edges (y, z) ∈ E with w r (y, z) = 0, we update t(z) with max(t(z), t(y) + d(y, z) ). In our implementation, another FIFO queue tQ is employed to facilitate this process. When completed, it produces t values satisfying (3).
However, if the resulting t satisfies (∃u ∈ V : t(u) ≥ T ), then, by Lemma 2, we have
All the above operations [increasing r(u) and then restoring (1)- (3)] need to be carried out again on that particular u. On the other hand, if (∀v ∈ V : t(v) < T ), (4) is restored and we obtain a better retiming. We then switch to "t-ADJUST."
The pseudocode for adjusting r is given in Fig. 10 . It turns out that Zhou's algorithm [10] for traditional retiming without binary search is a special case of Fig. 10 with d(u, v) 
Having provided the pseudocodes of both "t-ADJUST" and "r-ADJUST" and the criterion of switching between them, we then present our algorithm in Fig. 11 . The structure of the algorithm is clear; it alternates between "t-ADJUST," which reduces T to the minimum under r, and "r-ADJUST," which adjusts r to approach an optimal retiming, and terminates when either a critical cycle or an m-cycle is found. This is an elegant result that confirms our intuition. We know that if all edges can accommodate flip-flops, that is, E 1 = ∅, then (1) is dropped and we can compute a valid retiming by simply separating every two consecutive flip-flops by a delay value equal to the maximum cycle ratio ρ * . The corresponding arrival times will satisfy (3) and (4) under T = ρ * . In other words, the optimal period is equal to ρ * . When the cycle with ratio ρ * becomes critical under T , we have T = ρ * and T is optimal. However, in the presence of forbidden edges, separating every two consecutive flip-flops with delay ρ * may result in some flip-flop being placed on a forbidden edge, which is prohibited. Consequently, the optimal period T * will be larger than ρ * . When T is reduced to T * , adding more flipflops on critical paths will not help to further reduce T . In fact, the requirement for more flip-flops on one critical path will be propagated through forbidden edges to other critical paths, which is exactly captured by m labeling. When an m cycle is formed, the total number of flip-flops in the m cycle is a constant, independent of retiming. In other words, the requirement of more flip-flops in an m cycle cannot be fulfilled by retiming. Therefore, the current T cannot be further reduced; it is the optimal.
C. Correctness Proof and Computational Complexity
We prove the correctness of the algorithm by showing that it returns an optimal solution when it terminates.
Theorem 2: The algorithm in Fig. 11 returns a retiming with minimal clock period when it terminates.
Proof: The algorithm terminates only if a critical cycle is found in procedure "t-ADJUST" or an m cycle is found in "r-ADJUST." By Lemma 3 and Theorem 1, T is optimal, and the algorithm will return it along with a corresponding optimal retiming.
We finally present the computational complexity of the algorithm to show that it terminates in polynomial time.
Theorem 3: The algorithm in Fig. 11 terminates in O(|V | 3 |E|N ff ) time, where N ff is the total number of flip-flops in the original circuit.
Proof: First of all, due to the similarity with Burns' algorithm [26] , procedure "t-ADJUST" has a provable complexity of O(|V | 2 |E|) before either the presence of a critical cycle or an evidence of (∃v ∈ V : t(v) = T ). The latter actually triggers procedure "r-ADJUST." Since no vertex is queued more than once in rQ, restoring (1) and (2) Remark 1: The significance of Theorem 3 is not the actual formula of the bound, but showing that the optimal wire retiming problem can be indeed solved in polynomial time without using binary search. Furthermore, caution should be used on this bound. Firstly, a program usually has large running time variations on different problem instances. The worst case running time may only happen on a few rare instances, and thus it may not be a good indication of the efficiency on most other instances. For example, the worst case assumes that the algorithm terminates when all r(v)s ∀v ∈ V are increased to N ff , which is hardly the case in real applications. Second, even if the worst case does occur on most problem instances, a bound may be loose due to the difficulty of carrying out an accurate analysis. For example, reaching the minimal period under a given r typically takes O(|E|) time on real circuits. These assumptions lead to an ideal worst case that is very unlikely to be attainable in reality. In other words, the bound in Theorem 3 is very loose. Since only necessary operations are conducted in the algorithm, it should be efficient on most instances. This is confirmed by our experimental results in Section VI. 
D. Speed-Up Technique (Better Initialization)
Let T 0 and T * 0 denote the clock period returned by the initialization and the minimal clock period under r(v) = 0 ∀v ∈ V , respectively. Intuitively, the closer T 0 is to T * 0 , the smaller the number of iterations the algorithm has to take to reach T * 0 . Recall that in the initialization procedure we proposed in Fig. 7 , flip-flops are not allowed to be distributed until the arrival times of all vertices are obtained. This may result in too conservative a value of T 0 . For instance, a circuit with only one edge (u, v) and w(u, v) = 1 will be initialized with T 0 = d (u, v) . However, if we distribute flip-flops simultaneously during the computation of t, for this case we will place the flip-flop in the middle of (u, v), then the circuit ends up being initialized with
, which is exactly the optimal.
More specifically, the idea to compute t and distribute flipflops simultaneously is as follows. For each edge (u, v) ∈ E, its flip-flops are first distributed with delay T in between, where T is the current maximum of t. Then, we check if the resulting When completed, it gives a much better T 0 . However, (3) might not be satisfied because the flip-flops that are distributed earlier may have less than T 0 in between. As a remedy, the operations used in Fig. 10 to restore (3) are employed. The resulting t will not violate (4) . By doing so, we obtain a better T 0 within the same complexity as for Fig. 7 .
VI. EXPERIMENTAL RESULTS
We implemented the algorithm in a PC with a 2.4-GHz Intel Xeon CPU, 512-KB cache, and 1-GB RAM. In order to give a comparison with the results of our previous works [13] , [15] , we use the same test files, which are modified from ISCAS-89 suite with random delay assignment-1.0 and 2.0 units to gates (treated as macroblocks) and 0.2-5.0 to wires. To reflect the impact of global interconnects on an SOC design, the delay range is intentionally chosen in order for the wire delays to be commensurate or even many times larger than the block delays.
Generally speaking, different blocks can be classified into two categories: complete bipartite and noncomplete bipartite. A block is complete bipartite only if each connected component is a complete bipartite graph in its timing model. For example, the block in Fig. 2 is complete bipartite. It becomes noncomplete bipartite if we add an additional path from c to x.
Even though gates are treated as blocks, they can only be complete bipartite as each gate has only one output. To further test the cases with noncomplete bipartite (denoted as "non-CB" in Tables I and II) blocks, we apply hMETIS [27] to partition a circuit into groups. All edges inside a group are then treated as forbidden edges. For simplicity, we did not further apply our timing model to partitions when generating the results. The number of partitions of a circuit, which is denoted as "No.Part" in Table I , plays a key role in determining the percentage of noncomplete bipartite blocks. To better reflect the influence of noncomplete bipartite blocks on the optimal clock period, we intentionally choose these numbers in order for the resulting differences to be significant.
In Table I , we report the computed minimal clock period for each circuit. 3 They match the results in both [13] and [15] . The lower bound ρ * defined in Lemma 3 is also reported as a comparison. Since we approach the optimal clock period by gradual reduction in the algorithm, we also report the total number of occurrences that T is successfully reduced during "t-ADJUST" in the column of "No.
Step." Note that, although we did not change the topology of the circuit after partitioning, we have forced the type of the edges within a group to E 1 . The consequent configuration of edges has little to do with that before partitioning. We report them in one table only to share the basic circuit information, such as |V |, |E|, N ff , and ρ * , and to be consistent with the report format in [13] and [15] for clear comparisons.
In Table II , we highlight the differences of running time among the algorithms in [13] and [15] (both with precision 0.1), the base algorithm in Fig. 11 , and the further improved one with the proposed speed-up technique, denoted as t bs1 , t bs2 , t base , and t new , respectively. For a particular algorithm, we can compute the ratio of its running time to t new for each test case in row "arith" ("geo"). The results clearly show that our new algorithm with the speed-up technique achieves multiple-order improvement over [13] and [15] , and the speed-up technique contributes almost 2× speed-up by average. It also confirms that the bound in Theorem 3 is loose.
Remark 2: A few sentences are worth mentioning for a better understanding of the data presented in Table II . First of all, though faster by average, t new is a little slower than t bs2 for some cases "without non-CB blocks." This can be explained by looking into the nature of binary search. There are a couple of factors that will affect the runtime of a binary search-based algorithm. First, the smaller the precision is chosen, the more the runtime will be. Second, the runtime is highly dependent on the binary search range. As we know, checking an infeasible period usually takes more time than checking a feasible one. Consequently, if the upper bound is chosen to be slightly larger than the optimal, then during most of the binary search iterations the algorithm will be checking infeasible periods, which corresponds to the worst case runtime scenario. On the other hand, the best case scenario corresponds to the situation when the lower bound is slightly smaller than the optimal then most binary search iterations will be on feasible periods. Our experiments on test cases "without non-CB blocks" happened to be the best scenario. That explains why t bs2 is more efficient in some cases. Another phenomenon that needs to be explained is that the running times of the new algorithm actually decrease (to almost 0) when non-CB blocks are included, which counters the trend revealed by both t bs1 and t bs2 . The increasing trend of t bs1 and t bs2 can be explained by the higher binary search upper bounds caused by non-CB blocks and the fact that for a significant number of iterations the binary search based algorithms are now checking infeasible periods. For the proposed new algorithm, however, since it always works on feasible periods, the running time per iteration is much less. If the initialization happens to generate a feasible solution that is close to the optimal, then the number of iterations to reach the optimal will be very few, as it is the case in our experiments shown in Table I . Besides, the design methodology behind the new algorithm is to make it as efficient as possible. Three FIFO queues employed in the implementation are for this purpose.
Interestingly, Chu et al. [14] also considered the retiming problem with interconnect delays. However, they formulated [14] (SECONDS) their problem at block level, that is, only gates exist in the circuits. Therefore, they handled one issue (interconnect delay) in our work but not the other (retiming over blocks and bufferforbidden regions). Furthermore, the approximation approach of first ignoring gate boundaries and then moving flip-flops out of gates does not work for noncomplete bipartite blocks, that is, their near-optimal algorithm is not applicable to our second set of experiments (with non-CB blocks). In this sense, they solved a different problem, even though it looks similar to our problem.
For better comparison with their results within the first set of experiments (without non-CB blocks), we obtained their source code of the near-optimal algorithm and test files used in [14] , and ran our new algorithm on their test files. We report the running time differences in Table III , where t chu and t new represent the running time for their near-optimal algorithm and our new algorithm with speed-up technique, respectively. Likewise, row "arith" ("geo") denotes the arithmetic (geometric) mean of running time improvement. Column "No.
Step" denotes the total number of occurrences that T is successfully reduced during the execution of our new algorithm. The near-optimal algorithm assumed a 1% error bound with an average 9.6 number of binary search iterations over all benchmark suites.
Note that, although all test files come from the same benchmark suite, they have different delay assignments. In their test files, wire delays were obtained by first implementing the circuits in a 0.25-µm process, laying out the circuits by silicon ensemble, and then extracting the parasitics from the layout. Besides, since they formulated the problem at block level, each gate was treated as a vertex. Thus, for a given benchmark circuit, the generated problem size in their formulation is approximately three times smaller than ours, where each input/output of a gate is treated as a vertex instead. For example, "s5378" in Table I and "s5378 " in Table III come from the same circuit, but the number of vertices and edges for "s5378" and "s5378 " is 7205 and 8603, and 2781 and 4261, respectively. Even with problem sizes that are three times larger, Table III reveals that our new algorithm is generally more efficient than their near-optimal algorithm. For those cases where t new > t chu , the long t new is mainly due to the large "No.
Step," e.g., 3424 for "s6669 " and 5292 for "s38584 ." In addition, since their near-optimal algorithm uses binary search, comments similar to Remark 2 can be used to explain the results in Table III. VII. POST-RETIMING PLACEMENT As stated in Section II, the wire retiming problem is formulated at an abstract enough level that it may be used at different design stages such as interconnect planning and physical design. When flip-flops are repositioned in the retiming process, there is an issue of how they will be accommodated in the given placement. A simple solution may allow the floorplan to be modified after the wire retiming stage if there is not enough space for flip-flops. Interconnect planning renders more flexibility to this approach, but we should in general avoid the iterations between floorplanning and retiming. A better approach will estimate and allocate buffer regions during floorplanning [28] , and then place buffers only in the buffer regions. Furthermore, if we assume that the buffers (not the flip-flops) on the long wires are already placed, then when a flip-flop is moved to a wire, we can replace the closest buffer with the flipflop. This will alleviate the impact of repositioned flip-flops on the area.
A few sentences may be deemed necessary for the linear delay model of buffer-allowable edges and for our above suggestion to substitute the closest buffer by the flip-flop. By assuming a continuous number of buffers and buffer sizes, Otten and Brayton [23] showed that the delay of a wire could be made linear in terms of its length. However, even when the buffer number is an integer and the size is fixed, the delay of a wire can still be bounded by a linear function of its length. The difference between the model and the "real" delay is at most the delay of one buffer. Since only very long global interconnects need to be wire pipelined, the difference of at most one buffer delay is negligible.
VIII. CONCLUSION
A polynomial-time algorithm for optimal wire retiming is presented in this paper. Contrary to all previous algorithms that used binary search to check the feasibility of a range of clock periods, the new algorithm directly checks the optimality of the current feasible clock period and can thus either push down the period or certify the optimality.
The underlying idea looks into the nature of the binary search approach. At each step, the binary search gives the answer to the question of whether the current clock period is feasible. The optimality of a feasible clock period can only be established indirectly, that is, through the infeasibility of the next smaller clock period. However, in our algorithm, the question being answered at each step is whether a feasible clock period smaller than the current one exists. Since it gives the existence answer, optimality is established directly once we reach such a step that gives the answer: "No."
Besides the difference of program methodology, our algorithm has many other advantages over the binary search approach. First of all, it is polynomial time bounded. No precision is required. Second, the implementation is simpler. No upper and lower bounds are needed. It is even automatically determined by the algorithm itself how far a necessary step can proceed. Third, the algorithm is very efficient in practice, which is confirmed by the experimental results. Last, but not the least, without using binary search, our algorithm is essentially incremental and has the potential of being combined with other optimization techniques, such as gate sizing, budgeting, etc., and thus can be used in incremental design methodologies [29] .
