Abstract
Introduction
Retiming, originally proposed by Leiserson and Saxe [1] [2] , is a powerful and well-known technique for performance improvement of synchronous circuits. It is based on relocating registers in a netlist while preserving the functionality of the circuit. Many improvements and extensions to the original ideas have been developed, like concepts for integrating retiming into logic synthesis [4] , algorithms for retiming level clocked circuits [5] [6] , algorithms taking register set up and hold times into account [7] [8], algorithms for retiming registers with enable inputs [9] as well as algorithms that can improve testability [10] .
The original FEAS-algorithm [2] developed by Leiserson and Saxe finds a retiming for a circuit such that a given cycle time is met if such a retiming exists. It is based on a simple circuit model assuming gate delays to be load independent.
In [11] - [13] , more sophisticated timing models are used. For each edge in the retiming graph multiple delay values are calculated covering the two cases that this edge can contain none or at least one register. Branches of multi sink nets are considered independently from each other.
The problem of modeling fanout trees accurately has been addressed in [14] . The problem is illustrated in Figure 1 . In real circuits, retiming gate B and gate C will change the load (drawn in bold lines) seen by gate A and therefore also changes the delay of gate A. This however, does not only affect the arrival time at gate B and C but also at gate D.
In practice, retiming of registers into fanout trees may change the topology of the affected nets dramatically and can change arrival times even on paths where no registers have been moved.
To be able to take dependencies between branches of a fanout system into account, delay tables are assigned to each edge in the retiming graph. Each table entry represents a delay value being valid for one particular register arrangement in the fanout system. Prior to each arrival time calculation step the delay values being valid for the actual register arrangement are selected from the delay tables. This timing model allows a very realistic modeling of circuits in the retiming graph.
For each of the timing models described above powerful retiming algorithms of polynomial time complexity have been developed. However, for large circuits with thousands of registers runtimes may still be unacceptably large impeding the use of these algorithms in practice.
For the original FEAS algorithm this problem has been addressed in [3] . The authors presented a very efficient technique for detecting unreachable target cycle times. Experiments showed that integrating this technique into FEAS can dramatically speed up execution time and generally leads to near linear time complexity when retiming typical VLSI circuits. Unfortunately, there is no straightforward way to integrate this technique into retiming algorithms using more complex timing models. The reason for this is that this technique is based on the assumption that the sum of the delays of paths forming a loop in the retiming graph does not depend on the actual positions of the register. This assumption however, does not hold for more accurate timing models as described in [11] and [14] .
To make retiming more practical, we propose an acceleration technique for the FEAS_CTM retiming algorithm presented in [14] . Our experimental results show a dramatic acceleration for most of the test cases and demonstrate that accurate timing model based retiming is possible even for very large circuits.
The remainder of this paper starts with a brief description of the FEAS_CTM timing model and retiming algorithm. For a more detailed description we refer the reader to [14] . Afterwards we describe our new acceleration technique in detail. The paper closes with experimental results and a short conclusion.
The FEAS_CTM algorithm

Circuit Model
A circuit is mapped onto a weighted directed retiming graph G = (V, E). Each logic gate is mapped onto a vertex v, being assigned a delay t d (v) and a retiming value r(v) which initially is 0 and can be incremented during retiming. A net with n sinks, n ≥ 1, is modeled as a bundle of edges ("branches"
Each edge e = (u, v) ∈ E is assigned a weight w(e) denoting its initial number of registers. The number of registers on e during or after retiming is denoted by w r (e) = w(e) + r(v) -r(u). 
. This models the fact that under the assumption described above two gates modeled by two vertices v i and v j with w r (b i ) = w r (b j ) are driven by the same register. Because t o is only defined for w(b i ) > 0 there are 2 n-1 different cases to be distinguished.
Algorithm
Like FEAS, FEAS_CTM performs an alternating sequene of calculating data arrival times and retiming critical vertices. These steps are repeated until either a solution is found or it is detected that no solution can be reached.
Arrival Time Calculation.
During a preliminary step for each edge the delay value(s) being valid for the actual register arrangements are selected from the corresponding delay tables in accordance to the weight of the edge.
The arrival time t ar (v) of a vertex v is calculated recursively from the delay values of the incoming edges e i = (u i ,v) and the arrival time of the predecessor vertices u i of v as follows: , e j = (u, v j ) using the t i values for the case that w r (e j ) > 0. For the case that there is a register on e, t ear denotes the data arrival time at the input of that register. Otherwise, if no register is present, t ear denotes the latest arrival time at an assumed (not necessarily yet present) register on an outgoing edge of v. 
Retiming Critical
FEAS_CTM(ttarget) { for (i = 0; i < |E|, i = i + 1) calculate_arrivaltimes(); if (checkcycletime(ttarget) == true) return true; analyze_nets(ttarget); for each (vertex v) if (v is marked) r(v) = r(v) + 1; return false; }
Accelerating Retiming
Retiming based cycle time minimization in general performs a binary search for the minimum reachable cycle time t optimal . During this process the retiming algorithm is called several times trying to find retimed circuits for various target cycle times t target . It has been observed that in the cases where FEAS_CTM has been able to find a feasible retiming, it only required a very small number (<<|E|) of iterations of its inner loop. The same observation has been made for the original FEAS Algorithm in [3] . Consequently speeding up retiming has to concentrate on those cases where no feasible solution could be found. In those cases FEAS_CTM requires |E| iterations of its inner loop.
In the following we propose an efficient and effective technique for testing whether a particular cycle time t target can not be reached. This check is performed during each iteration of FEAS_CTM. As soon as this test succeeds the first time, we can abort the retiming process for t target .
At first we give an informal outline. Afterwards our approach is explained in more detail.
A common strategy for computing bounds in timing analysis is to consider circuit loops [15] . Our acceleration technique is based on the following fact: If there is at least one loop in the retiming graph we can not determine a feasible retiming for, there will be no feasible retiming for the whole circuit. Consequently, we can abort the retiming process, if any such loop is found. This has already been observed in [3] and delivers the basis for our acceleration technique. During each iteration of the inner loop of FEAS_CTM we determine a set of loops we will analyze in detail.
In [3] , this analysis is very simple because the sum of the delays of paths forming a loop in does not depend on the actual positions of the register. For the more sophisticated timing models considered here, however, this assumption is wrong so that a more complex analysis is required.
For each selected loop we perform three tests. Test 1 determines a lower bound t lb for the cycle time that may be reached in the loop we are considering. If t lb > t target holds then we can abort the whole retiming process. Test 2 determines an upper bound t ub for the minimum cycle time that may be reached in this loop. If t ub < t target holds then there exists a timing feasible register placement for this loop and we can proceed with the next loop. If we find that t lb ≤ t target ≤ t ub holds we perform test 3. This test tries to determine a register placement for the loop using a dedicated register placement technique. If we can not determine a timing feasible register placement for the loop we can abort the overall retiming process. If no loop violating the timing constraint is found we proceed with the retiming process. Figure 5 shows retiming algorithm FEAS_CTM with our acceleration technique integrated.
FEAS_CTM(ttarget) {
for (i = 0; i < |E|, i = i+1) calculate_arrivaltimes(); /* test whether ttarget is not reachable */ determine set of critical loops C;
if (no feasible register placement for L found) return false; /* retime critical vertices */ if (checkcycletime(ttarget) == true) return true; analyze_nets(ttarget); for each (vertex v) if (v is marked) r(v) = r(v) + 1; return false; } 
Building the loop set
As an addition to the FEAS_CTM circuit model each graph vertex v is supplemented with an additional pointer crit_edge. During the arrival time calculation step this pointer is set to the incoming edge of v that is part of the longest path leading through v. Let E* denote the set of those edges e ∈ E a crit_edge pointer points to. Then E* and the vertices v ∈ V form a subgraph G* = (V, E*) of G containing only vertices with exactly one predecessor. The set of loops in G* form the set of loops we will investigate further. Because there are no reconverging paths in G*, all loops in G* can be determined with a simple graph traversal in linear time.
Determining a lower bound
In this section we describe how to determine the bound t lb for a loop
As already explained, the FEAS_CTM circuit model assigns tables of delay values to each edge in the retiming graph to reflect the fact that retiming an end vertex of a fanout tree has an influence on the delay of a paths leading through neighbor branches. Consequently the sum of all path delays in L, denoted by t loop (L) in the following, will also depend on the weights of edges that are not part of the loop. However during the unreachability detection step we consider each loop isolated. To be able to determine bounds for the cycle time reachable by retiming we analyze the delay tables and determine for each edge e ∈ G* the minimum values for t i , t o , t w . The minimum value found for t w is denoted by t wmin in the following. The sum of the minimum t i and the minimum t o is denoted by t iomin .
In other words, t wmin denotes the minimum delay that edge e contributes to t loop (L) for the case that e does not carry a register. Similarly, t iomin denotes the minimum delay that e contributes to t loop (L) for the case that e carries at least one register.
Further we determine the total sum of registers in L, denoted by r, by adding the weights of the edges forming L. Note that retiming will not change r for any loop. However, the number of edges in L with at least one register may change, because an edge can carry more than one register. The maximum number of edges carrying at least one register for any particular register placement in L results to r* = min(|L|,r).
Consequently, any possible register arrangement in L will consist of n edges with w > 0 and (|L| -n) edges with w = 0, 1 ≤ n ≤ r*.
Selecting n t iomin -values and (|L| -n) -t wmin values, 1 ≤ n ≤ r*, in a way that the resulting sum is minimized and adding them to the sum of the delays of the vertices in L results in t loop n (L). t loop n (L) delivers a lower bound for t loop (L) for the case n edges in L carry at least one register.
Minimizing t loop n (L)/n, 1 ≤ n ≤ r* gives a lower bound t lb for the cycle time that may be reached by retiming if we consider |L| isolated.
If we sort the t iomin and the t wmin values of the edges of L in two separated lists in rising order t lb can be determined very efficiently in |L| steps.
If t lb > t target holds for L we can abort the retiming process because it will not be possible to find a retimed circuit for t target .
Determining an upper bound
If we are combining n t iomin values and (|L|-n) t wmin values, n = 1, in a way that t loop n (L)/n is maximized, analogously to section 3.2 we derive an upper bound t ub for the cycle time that can be reached if we consider L isolated. In other words, the upper bound calculation considers the worst case situation, where all registers are located on a single edge. If t ub < t target holds then there exists a register placement in L reachable by retiming which satisfies the timing constraint. If this is the case we can skip the further investigation of L and may procceed with the next loop.
Retiming a loop
The analysis described above determines bounds for the cycle time that can be reached in L. These tests can be performed very efficiently and in many cases will already detect whether a cycle time t target may be reachable or not. In practice, this will especially be the case when t target is not too close to time t optimal . However, if we find that t ub < t target < t lb holds then we investigate L further. In this case we look for a register placement in L satisfying the timing constraint in L. If no such register placement is found we can abort the whole retiming process because there will be no retiming satisfying the timing constraint for the whole circuit.
For general graphs determining a register placement would require the use of a general retiming algorithm. However since we know that L represents a loop and we are considering L isolated we can use a dedicated technique.
A straightforward approach would be to place the first register on an arbitrarily chosen edge. For the remaining registers, positions would be determined by stepping forward through L and placing each register as far away as possible from its predecessor register. If the register placement determined will not be timing feasible, as shown in Figure 6 on the left side, this does not yet necessarily mean that no feasible solution exists. A feasible solution might be found repeating the process placing the initial register on a different edge, as shown in Figure 6 on the right side. The three numbers assigned to each edge denote the t imin -, t omin -, t wmin -delays of the edge. ). However, we will see that we can do better.
The first register is positioned arbitrarily as explained above. As a preliminary step we determine for each edge the data arrival time at the input of an assumed register on this edge. These data arrival times are stored in an auxiliary data structure in sorted order. This can be done in |L|*ln(|L|) steps.
Using this data structure we do not have to step through the loop to determine the positions of the (r-1) subsequent register as described above but we can determine each position in ln|L| steps. This leads to a time complexity of O(|L| * ln(|L|) for analyzing L.
It should be noted that our analysis may not always suceed in detecting an unreachable cycle time. The reason is that we consider each loop independently. Our analysis might find a timing feasible register placement for each loop. However, these register placements need not necessarily to be compatible with each other, so the possibility that no timing feasible register placement for the whole circuit exists. However, our experimental results showed, that despite of this deficiency the proposed approach can still considerably speed up the retiming process.
The acceleration technique described here can also be used in combination with the timing model used in [11] - [13] . Since, in this model, we have no tables containing multiple delay values, the delay values of each edge itself are used as t wmin and t iomin .
Experimental Results
For the evaluation of the benefit of our acceleration technique we mapped the circuits from the ISCAS-89 benchmark set onto a 0.18 µm standard cell library. For each benchmark we performed numerous optimization runs using wire length data that has been derived from different placements. For each trial that did not succeed (e.g., no feasible retiming has been found) we reported the following:
• the target cycle time t target being used • the cycle time t optimal that finally has been reached using a binary search based optimization approach • whether or not our approach deteced that t target could not be reached; if this was the case we additionally reported during which iteration of the inner loop of FEAS_CTM this happened.
In our experiments we observed, that in those cases our technique detected that a particular cycle time was not reachable only a very small number of iterations (for the benchmarks examined in no case more than 5) had been necessary. This enables a considerable reduction of the cpu runtimes in those cases. The maximum number of iterations of the inner loop of FEAS_CTM has been limited to |E| 0.5 as proposed in [14] . We already mentioned that our analysis may not always succeed in detecting an unreachable cycle time. However this happened only in those cases t target has been chosen very close to t optimal . In the majority of the test cases our approach succeeded. The results of our experimens are shown in Table 1 On average our approach has been able to detect all unreachable target cycle times that are 86% of the cycle time that finally has been reached.
From a theoretical point of view, it is important to mention that if our method identifies an unreachable target cycle time then this cycle time is indeed unreachable by retiming with FEAS_CTM. This is an improvement over the purely heuristic abortion criterion (aborting after |E| 0.5 iterations) of the original approach.
Conclusion
In this paper we have presented an acceleration technique for the FEAS_CTM retiming algorithm. This technique is based on detecting that a particular target cycle time may be unreachable during a very early phase in the retiming process. Experiments showed that in those cases the technique detected that a target cycle time will not be reachable, the number of iterations necessary to detect this has been dramatically reduced from the number of edges in the retiming graph to a number which in no case was larger than 5. On average this has been the case for all target cycle times up to 86% of the optimal cycle time.
