In rhis paper. we shrdy rhe problem of retiming of sequential circuirs wirh borh interconnecr 
Introduction
Retiming [ 11 is a useful and popular technique for performance optimization of sequential circuits. It relocates registers to reduce the cycle time while preserving the functionality of the circuit. Much effort has been made to apply this technique . Some extended its applicability in large practical circuits efficiently [I 1-18] . However, most retiming algorithms have assumed ideal conditions for the non-logical portions of the data paths, specifically ignoring the interconnect delay. As process technology gets down to deep sub-micron, interconnect delay becomes a major factor of path delay. Without including this delay component, existing retiming algorithms are not sufficiently accurate to be used in pradcal high performance circuits.
The choice of an accurate interconnect delay model and an appropriate retiming algorithm are important. In some previous works [19, 20] , interconnect delay was incorporated into the retiming process, but simplified assumptions were made such that the interconnect delay between adjacent registers on the same wire was neglected. Another approach to integrate retiming into detailed placement was presented in [21] . After an initial placement and routing, heuristics were used to estimate interconnect delay. Retiming and post-retiming placePermission to makc digital or hard copies of all or pan of t h s work for pcrsonal or clas5room use is granted without Ice provided that copies are not made or distrihutod for profit or commercial advantage and that copies bear this notice and the full citation on Ihc first page. To copy otlmwisc. to republish, 10 post on servers or to rcdistrihuie tu lists, requires prior specilic permission andlor a fee.
ICCAD'O3. Nuvember l1-13.2UU3, San Jose, Califomia. USA Copynght2UU3 ACM 1-581 13-762-11031UU1 I ... $5.00.
221
ment were then performed to optimize the circuit performance.
A recent paper [22] by Tabbara et al. applied retiming i n the DSM domain and interconnect delay was considered. It was done by having a lower bound on the number of registers on each wire e"", while the delays at nodes were irrelevant. Registers could be retimed into a node that represented a component and affected the total area of the component. Retiming was performed to satisfy the constraint on the number of registers on each wire while minimizing the total area of the components. Another paper [ 131 by Deokar et al. used a combination of clock skew and retiming to fmd a retiming solution which was guaranteed to be at most one gate delay larger than the optimal clock period. In their work, a clock skew solution corresponding to an optimal clock period was converted into a retiming solution. However, their current approach to perform this conversion considered only gate delays.
In this paper, we study the problem of retiming with both interconnect and gate delay. In OUI modeling, the delay of a wire is assumed to be directly proportional to its length. When a wire is short, the quadratic component of the wire delay is significantly smaller than its linear component. For a long wire, buffer insertion can be performed to break the wire into short segments. A simple experiment is conducted to illustrate the validity of this assumption and the result is shown in Figure 1 . In this experiment, the Elmore delay model is used and the parameters are based on the 0.07pm technology. This graph shows the relationship between wire delay (y-axis) and wire length (x-axis). If the wire is shorter than 1.46 mm, the error of using a h e a r approximation is at most 5.48%. If the wire is longer than 1.46 mm, the delay can be reduced by inserring a buffer and the error resulted is even less.
We present two approaches in this paper both of which have poIynomia1 time complexity. The first one is extended from the MILP approach in the paper [l J and can solve the problem optimally, i.e., relocating the registers to give the smallest possible clock period. The second one transforms the problem into a single-source longest paths problem and then applies a technique to reduce the size of the graph for longest path computation. It is an improvement over the first one in terms of practical applicability. It gives solutions very close to the optimal (0.13% more than the optimal on average) but in a much shorter runtime. Experimental results showed that a circuit with more than 22K gates and 32K wires could be retimed in 83.56 secondsby a PC with an I.8GHz Intel Xeon processor. These retiming techniques will also find applications in flip-flop dropping in placement by estimating the best possible register positions to optimize the circuit performance. The original placement solution will be modified to relocate the registers according to the retiming solution. However the effect will be minor if the original solution is not very densely placed. This is a reasonable assumption today as area is not a major concem while routahility and congestion are the important factors for circuit performance. Register relocations can then be done by making use of the empty space or by shifting the placed cells a little bit.
The remainder of this paper is organized as follows. We present the problem statement in Section 2. The optimal approach and the fast approach are presented in Section 3 and Section 4, respectively. Experimental results are shown and discussed in Section 5. A conclusion follows in Section 6.
A retiming solution can be viewed as a labeling of the nodes r : V 4 Z, where Z is the set of integers [I] . The retiming label r(v) for a node Y represents the number of registers moved from its outputs toward its inputs. After retiming, the number of registers on an edge e,, is given by i., = r ( v ) + wuv -r( u ) .
As interconnect delay is dominating in the VDSM technology, the exact position of each register will affect the clock period. A retiming solution should specify both the retiming label r(v) for each node v and the exact positions of the Guv registers on each edge e"". Retiming should be formulated as a problem of determining a feasible retiming solution, i.e., a solution in which the number of registers i,, on each edge Buy is non-negative, such that the clock period of the retimed circuit is minimized. In the following, we show how to check whether a particular clock period T can be achieved by a feasible retiming solution. The minimum achievable clock period To,, can then be found by binary search.
An Optimal Approach
This approach is extended from the mixed integer linear programming (MILP) approach in [I] . In the original formulation, only gate delay is considered and there is thus no difference between having one or more than one registers on a wire. Their technique can be extended to solve the problem with both gate and interconnect delay optimally by modifying some o f t h e constraint formulation. In order to formulate the problem as an MILP, for each gate v, we need to define a term U(.) that represents the maximum arrival time at the output of gate v .
An example to illustrate this definition is shown in Figure 2 . we can then
Problem Formulation
A sequential circuit can be represented by a directed graph G ( v , E ) , where each node corresponds to a combinational gate, and each directed edge Buy represents a connection from the output of gate U to the input of gate w, through zero or more registers. Without loss of generality, we assume that G is strongly connected. If not, we can add a source node s and connect it to all primary inputs, add a target node f and connect all primary outputs to it, and connect f to s. Then the resulting graph is strongly connected. If we set the delay of s, f and all the added edges to zero, and set the number of registers on e,, to one and that on the other added edges to zero, a retiming solution S of the modified graph will also be a valid retiming solution of the original graph as long as et$ still has one register in S. Let wuv be the number of registers of edge euv. Let duv he the interconnect delay of edge euy if all the registers are removed. Note that the delay of an interconnect segment is assumed to be proportional to the length of the segment. Let d,, be the gate delay of node U.
Traditionally, interconnect delay is ignored during retiming.
the Problem as the
where T is the clock period that we want to check whether it is achievable. Since u(v) is the longest delay to the output of gate v from a register connected directly to an input of v , this delay must be at least the delay of gate v , so d, 2 a(.) as stated in (1). Besides, this delay cannot exceed the clock period T as required in (2). Constraint (3) is needed for a feasible retiming solution. Constraint (4) is to ensure that enough registers are on each edge ell" to achieve a clock cycle T . As the largest possible delay between two adjacent registers is T , the righthand side of constraint (4) is reduced by T for each register on 222 edge euy. Note that this constraint also captures the scenario when there is no registers on edge euy. In that case, the arrival time at node U contributes directly to the arrival time at node W .
By introducing a variable R(v) at each node v that is defined as a ( v ) / T + r ( v ) , the above set of constraints (1)-(4) can be rewritten as a set of difference constraints as follows: 
A Fast Near-Optimal Approach
In this approach, we first replace each gate by a wire of the same delay and then solve the problem with only interconnect delay optimally and efficiently. Those registers retimed "into" a gate are moved either to the input or the output wires of the gate. The exact positions of the registers on the wires are then determined by a linear program to minimize the clock period.
The solution obtained by this approach is very close to the optimal on average as shown by the experimental results. In the following, we first show how the retiming problem with interconnect delay only can be solved optimally. Then we describe in details how gate delay can be handled simultaneously.
Retiming with Interconnect Delay Only
In this subsection, we assume d, = 0 for all v E V . We first show that the clock period feasibility problem can be reduced to a single-source longest paths problem. We then present a fast algorithm to solve the longest paths problem.
Reduction to Single-Source Longest Paths Problem
We solve the set of constraints (5)-(8) with the help of the following lemma.
Lemma 1 Given R(v) for all v t V sarisbing constraint (8).
we can obtain a solution to constraints (S)-(S) by setting
Therefore, (5) and (6) (5) is not satisfied. In other words, this idea cannot he applied to the retiming problem with both interconnect and gate delay discussed in Section 3.
The problem of finding R(v) for all v E V to satisfy constraint (8) can be viewed as a single-source longest paths problem on G with length I,, equals d,, f T -w,, for each e., E E.
As G is strongly connected, we can pick an arbitrary node as the source nodes.' Note that edge lengths can be positive. If G has a positive cycle, the set of constraints has no solutions.
It means that the clock period T is infeasible. The solution to this problem is presented in the following subsection.
Fast Single-Source Longest Paths Algorithm
The single-source longest paths problem in Section 4.1.1 can be solved by the Bellman-Ford algorithm [24] . The time complexityisO(IVIJEI), which isat leastafactorofO(1gjVI) faster than the optimal algorithm in Section 3. In practice, it is a factor of O(lgz [VI) faster as (El = O(lV1). However, this algorithm may still be slow in practice. In this section, we present a single-source longest paths algorithm which is faster in practice. The basic idea is to reduce the size of G by compacting some paths into edges before the Bellman-Ford algorithm is applied. The details are given below.
We first transform the graph G(V,E) into a directed acyclic graph (DAG) G'(V',E') by performing a depth-first traver-
sal [24] slatting from the source nodes. The depth-first traversal defines a tree in G. Those non-tree edges running from a node U to an ancestor v of u are called back edges. If we point all incoming back edges of a node v to an extra node d , the resulting graph will be a DAG because every simple cycle in G involves exactly one back edge. Formally, we use E6 to denote the set of back edges and Vb to denote the set of nodes with an incoming back edge. For each node v in 6 , we introduce an extra node d . The hack edge e,, is removed from the graph and the edge e,,{ is added. The resulting DAG is G'(F",E') where V' = V U {dlv E Vb} and E' = ( E -Eb) U {e,,dle,,, E E*}. We set the length I,,, of the edge eU9 to luv. To illustrate the transformation, consider the graph G in Figure 3 (a) with source node A . Suppose the depth-first traversal visits the nodes in the order ACDEFB. Then Eb = {eDA,ecA,e~c,eFA} and Vb = { A , C } . We introduce two extra nodes A' and C', and replace the four edges e a , eDA, eFA and eFc with the edges wuv. Hence, constraint (7) is also satisfied.
'Iftheoriginal circuil is not atrongly connected, BSOUICC nodes has already been added. initially, and Top, is the optimal clock period.
ecAr, eDAt, eFAr and eFc,, respectively. The resulting DAG is shown in Figure 3 (b).
node set vb, The edge set E H contains an edge euy for U , v t Vb if there exists a path in G with either no back edge or one back edge at the end from U to v. The length 1% of the edge euy is the longest path distance among those paths. Note that the longest path distance in G with no back edge (respectively, with one hack edge at the end of the path) from U to v equals the 1ongest.path distance in G' from U to v (respectively, from U to v'). Hence 1% for all U , v E vb can be computed by solving lVb1 single-source longest paths problems in G' for different source nodes in vb. As G' is a DAG, each single-source longest paths problem can be solved in linear time by visiting the nodes in topological order. The time complexity to constmct H is therefore O(lVbllEl).
It is obvious that every path in H corresponds to at least one path in G ofthe same length. Therefore i f H contains a positive cycle, G will also contain a positive cycle. On the other hand, if G contains a positive cycle, the cycle can be broken up into a set o f paths p I , p z , . . , ,pk such that both endpoints of each path pi are in I $ , . Notice In this section, we discuss how to consider interconnect and gate delay simultaneously based on the above algorithm for interconnect delay only. To consider gate delay, we first represent a gate v with delay d, by a wire e y , y l with delay d,,,, = d,.
This transformation for the circuit in Figure 3 (a) is shown in Figure 4(b) . We can then obtain' an optimal retiming on this transformed circuit (f? using the algorithm in Section 4.1. However the retiming solution obtained on G may not he feasible for the original circuit G because some registers may he retimed into a wire that represents a gate. Therefore, we need to below' time steps are step and step perform a post-processing step to get back a feasible retiming solution for G fro,,, the optimal retiming solution for G, This is done
First of all, we move the registers in a gate either backward to the input wires or foward to the output wires of the gate, depending on which direction has a shorter distance. An example showing the relocation of registers is given in Figure 5 . After this relocation step, the number of registers GUv on each edge euv is fixed. A linear program is used to determine the
4.1.3
The complete retiming algorithm I-RerimingO is Sw"rized inside the binary search loop.
Step 7 can be done in a(lvbllEl) time as discussed above.
Step 8 can be done in O(\Vb\\E!/\) time by the Bellman-Ford algorithm. As Vb contains much fewer nodes than v and E,, usually contains comparable or fewer edges than E , this technique is usually much more efficient than applying the Bellman-Ford algorithm to G directly. 
Experimental Results
We implemented the two approaches in a I .8GHz Intel Xeon PC with 512 KB cache and 512 MB RAM. We tested them with circuits from the ISCAS89 benchmark suite. In OUI experiments, we implement the circuits in a 0.25 p m process. We layout the circuits by Silicon Ensemble. Wire delays are then extracted according to the layout. In our current implementation, the lower and upper bounds of the binary search are set to 0 and 1 OOns respectively. In the near-optimal approach, we perform the procedure I-Retiming0 with an error bound of 1%.
After assigning the registers retimed into a gate to the appropriate wires, a linear program is set up to relocate the registers on the wires to get the smallest possible clock period T*. In the optimal approach, binary search is performed until an error bound of 0.01 % is obtained. We call the resulting clock period Top,. Notice that we do not need to obtain a very accurate result from I-RefimingO because the solution is optimized by the linear program afterwards. On average, the number of binary search iterations is 9.6 for the near-optimal approach and 16.5 for the optimal approach.
The results are shown in Table I . The second and third columns give the number of nodes and the number of edges in the graph G, respectively. Notice that all circuits are not strongly connected. The number of nodes and edges listed are those afier the addition of the source node, the target node, and the associated edges. The fourth and fifth columns show the number of nodes and the number of edges in the reduced graph H, respectively. These two values are dependent on the node chosen as the root in the depth-first traversal. In our current implementation, we always pick the additional nodes as the root. We notice that using other nodes as the root does not change the result significantly. The speedup of the Bellman-Ford algorithm by the graph reduction approach in Section 4.1.2 is (IVilEl)/(iVbllE~l), which is given in the sixth column, The graph reduction approach is faster in all circuits except ~38584. On average, it is faster by 30.61 times. However, the speedup is less (may even be less than one) for larger circuits. The reason is that l E~l is roughly quadratic in IV,(. For the circuits in Table 1 , the ratio of to l&lz is from 0.1 1 to 0.86 with an average of 0.41. Therefore, the graph reduction approach may not be useful for large circuits. We can avoid a slowdown of the Bellman-Ford algorithm by determining whether to use G o r H b a s e d o n the ratio (IVllEl)/(lvblIE~/). 1 6 1 and can be found in O(lVbl1El) time. Moreover, we only need to perform this checking once for each circuit. Hence, the runtime overhead is insignificant compared with the total runtime.
The seventh, eighth, and ninth columns show the runtime of the I-Retiming0 procedure, the time taken to solve the linear program, and the total runtime, respectively. The tenth column shows the runtime for the optimal approach. We can see that the near-optimal approach is much more efficient than the optimal approach (especially for large circuits). The eleventh and twelfth columns show the clock period T' and Tap, obtained by the near-optimal approach and the optimal approach, respectively. The last column is the percentage increase of T* over Topl. The clock period produced by the near-optimal approach is only 0.13% more than that by the optimal approach on average. The optimal clock period is found in seven out of thirteen circuits.
Conclusion
We have presented two elegant approaches to perform retiming on sequential circuits with both interconnect and gate delay. This is a pioneer work in solving this problem as far as we know. Most traditional retiming algorithms have neglected interconnect delay. Our first approach is extended from the MlLP approach in the paper [I] and can solve the problem optimally. Our second approach is an improvement over the first one in terms of practical applicability. The main idea is to transform the problem into a single-source longest paths problem in a reduced graph. We have implemented both algorithms, and compared their performance on ISCAS89 benchmark circuits. Experimental results show that the second approach gives solutions that are only 0.13% larger than the optimal on average but in a,much shorter runtime.
