Introduction
The importance of the issue of optimizing the timing behavior of VLSI circuits probably needs no introduction to any reader of this paper, and a great deal of e ort has been invested into research i n this eld. This paper considers the method of retiming 1 , which proceeds by relocating ip-ops within a network to achieve faster clocking speeds. A novel approach to retiming that utilizes the solution of the clock s k ew optimization problem 2 forms the backbone of this work.
The introduction of clock s k ew at a ip-op has an e ect that is similar to the movement of the ip-op across combinational logic module boundaries. This was observed but not proved in 2 , which stated that clock s k ew and retiming are continuous and discrete optimizations with the same e ect. Although the designer can choose between the two transformations, these methods can, in general, complement each other. The equivalence between retiming and skew has been observed and used in earlier work, for example, in 3 5 . The contribution of this work is that it exploits this equivalence and presents a method that nds an optimal retiming e ciently, with a clock period that is guaranteed to be at most one gate delay larger than the optimal clock period.
The method views the circuit hierarchically, rst solving the clock s k ew problem at one level above the gate level, and then using local transformations at the gate level to perform retiming for the optimal clock period. The algorithm has been named ASTRA A Skew-To-Retiming Algorithm. 1 The clock s k ew problem is rst solved using e cient graph-theoretic techniques in polynomial time. The idea of using graph algorithms is to take advantage of the structure of the problem to arrive at an e cient solution. Like 2 , our technique is illustrated on single-phase clocked circuits containing edge-triggered ip-ops. The advantage of using this graph algorithm is that it not only minimizes the clock period, but that unlike a simplex-based linear programming approach, it also ensures that the di erence between the maximum and minimum skews is minimized at the optimal clock period.
The complexity of solving the retiming problem at the gate level using the technique in 1 is OjGj 2 log jGj, where jGj is the number of gates in the circuit; this could be phenomenally large, and a verbatim" implementation, such as that in SIS 6 is incapable of handling large circuits.
However, a clever implementation, such as that proposed very recently in 7 , can render the problem of nding a retiming of a circuit at the minimum clock period tractable. Another method proposed in 8 provides an e cient solution for the speci c problem of circuits with unit-delay gates.
To place the our method in perspective with that in 7 , it must be pointed out that the two were research e orts that were carried out independently and parallelly in time. While 7 performs the important task of debunking the myth that the Leiserson-Saxe algorithm cannot be applied to retime large gate-level circuits, our approach shows an alternative view to retiming, by w ay o f clock s k ew optimization. Apart from taking di erent approaches to solving the retiming problem, the two also di er in that the work presented here also directly provides a solution to the problem of jointly nding an optimal retiming and optimal clock s k ews for a minimum period circuit. The work in 7 also performs minimum area retiming; however, we h a ve currently left that as a topic for further research. The run-time of our approach and that in 7 are essentially the same for the minimum period retiming problem.
In this paper, the circuit is assumed to be composed of gates with constant delays. We assume the presence of an FF at each primary input and each primary output. The solution is divided 1 astra": Sanskrit a sophisticated weapon. into two phases. In Phase A, the clock s k ew optimization problem is solved with the objective of minimizing the clock period, while ensuring that the di erence between the maximum and the minimum skew is minimized. This di erence provides a measure of how many gates have t o b e traversed in the next phase, and therefore, it is important that it be a small number. In Phase B, the skew solution is translated to retiming and some ip-ops are relocated across gates in an attempt to set the values of all skews to be as close to zero as possible. The designer may c hoose to achieve the optimal clock period by using a combination of clock s k ew and retiming; alternatively, any s k ews that could not be set exactly to zero could now be forced to zero. This could cause the clock period to increase; however, it is shown that this increase will be no greater than one gate delay.
The paper is organized as follows: the equivalence between retiming and clock s k ew is rst shown in Sections 2 and 3. The algorithm for the clock s k ew optimization phase and a few related theoretical results are described in Section 4. Next, in Section 5, the process of transforming the solution of Phase A to a retiming solution is described, followed by a description of the properties of this solution in Section 6. Finally, w e present experimental results in Section 7 and conclude the paper in Section 8.
Clock S k ew Optimization and Retiming
In a sequential VLSI circuit, due to di erences in interconnect delays on the clock distribution network, clock signals do not arrive at all of the ip-ops at the same time. Thus, there is a skew between the clock arrival times at di erent ip-ops. In a single-phase clocked circuit, in the case where there is no clock s k ew, the designer must ensure that each input-output path of a combinational circuit block has a delay that is less than the clock period. In the presence of skew, however, the relation grows more complex as one must compensate for this e ect in designing the combinational circuits blocks.
One approach that has been followed by several researchers 9 13 is to design the clock distribution network so as to ensure zero clock s k ew. An alternative approach views clock s k ews as a manageable resource rather than a liability. It manipulates clock s k ews to advantage by intentionally introducing skews to improve the performance of the circuit. To illustrate how this may be done, consider the following example. In Figure 1 , assume the delays of the inverters to be 1.0 unit each. The combinational circuit blocks CC 1 and CC 2 have delays of 3.0 and 1.0 units, respectively, and therefore, the fastest allowable clock has a period of 3.0 units. However, if a skew of +1.0 unit is applied to the clock line to ip-op B, the circuit can run with a clock period of 2.0 units. This approach w as formalized in the work by Fishburn 2 . The clock s k ew optimization problem was formulated as a linear program that may be solved to nd the optimal clock period.
A second approach that is exploited here to improve the performance of the circuit is the procedure of retiming 1 . Retiming involves the relocation of ip-ops across logic gates to allow the circuit to be operated under a faster clock, without changing its functionality.
To illustrate how this may be done, consider the example in Figure 1 , where it was seen that the clock period can be minimized to 2.0 units by i n troducing a skew of +1.0 unit at the ip-op B. In an alternative approach, the period can still be minimized to 2.0 units by m o ving the ip-op B to the left across the inverter I3. This results in the combinational circuit blocks CC 1 and CC 2 having delays of 2.0 units each as seen in Figure 2 . This approach is formalized in 1 .
If one were to imagine the circuit as being drawn with its inputs to the left and outputs to the right, then the conversion of a negative positive skew to zero skew would involve the relocation of ip-ops to the right left. In this paper, we will use the terms right" and left" to denote the direction of signal propagation and the direction opposite to that of signal propagation, respectively.
3 Equivalence between clock s k ew and retiming A more formal presentation of the equivalence between clock s k ew and retiming is presented here.
Consider a ip-op j in a circuit, as shown in Figure 3a . For every combinational path from a ip-op i to j with delay di; j, the following constraints must hold to prevent zero-clocking and double-clocking, respectively.
x i + di; j +T setup x j + P x i + di; j x j + T hold 1 where x i and x j are the skews at ip-ops i and j, respectively. Similar constraints can be written for every combinational path from ip-op j to k with delay dj; k. Proof:
We will now prove statement a with the help of Figure 3b ; the proof to b is analogous, and can be seen via Figure 3c . Consider the case where ip-op j is moved to the left across gate G l that has delay d 1 . F or a combinational path from a ip-op i to j, i f x j and x 0 j are, respectively, the skews at ip-op j before and after the relocation operation, then the following relationships must be satis ed: 2 Therefore, if one were to calculate the optimal clock s k ews, one could retime the circuit by moving ip-ops with positivenegative skews to the leftright until the skews at the ip-ops are nearly equal to zero. It must be noted that since gate delays take on discrete values, it is not possible to guarantee that the skew at a ip-op can always be reduced to zero through retiming
operations.
An alternative view of the same procedure is as follows. Retiming may be thought o f a s a sequence of movements of ip-ops across gates Theorem 9.1 in 14 . We m a y start from the nal retimed circuit, where all of the skews are zero, and the zero-clocking constraints are met, and perform the sequence of movements in reverse order. This procedure can be used to move all ip-ops back to their initial locations, using Theorem 1 to keep track of the changed clock s k ews at each ip-op. Therefore, the optimal retiming is equivalent to applying skews at the inputs of ip-ops.
Note that the optimal clock period provided by the clock s k ew optimization procedure must, by de nition, be no greater than the clock period for the set of clock s k ews thus obtained. Any di erences arise due to the fact that clock s k ew optimization is a continuous optimization, while retiming is a discrete optimization. The following corollary follows:
Corollary 2 : The clock period obtained by an optimal retiming can be achieved via clock s k ew optimization. The clock period provided by the clock s k ew optimization procedure is less than or equal to that provided by the method of retiming.
Phase A: Optimizing clock s k ews
Given a combinational circuit segment that lies between two ip-ops i and j, a s s h o wn in Figure 4 , if x i and x j are the skews at the two ip-ops, then the following inequations must be satis ed:
x i + di; j x j + T hold ;
x i + di; j +T setup x j + P 2 where di; j di; j is the minimum maximum delay o f a n y combinational path between ipops i and j. In 2 , the clock s k ew problem for minimizing the clock period is solved by solving the following linear program: minimize P subject to x i , x j T hold , di; j x j , x i + P T setup + di; j 3 for every pair, i; j of ip-ops such that there is at least one combinational path from ip-op i to ip-op j.
In this work, we will consider single-sided constraints only, and will ignore the short path constraints. We t h us obtain the clock s k ews that correspond to the minimum clock period. Our strategy is to transform the skew solution to a retiming solution to achieve the minimum clock period. The rationale behind this approach is that, as will be shown in the next section, the clock period obtained thus is smaller than that obtained with the inclusion of double-sided constraints.
This minimum clock period can be preserved while reconciling short path and logic signal separation constraint violations 15 by using an algorithm for minimum padding, such as the one in 16 .
Solution to the clock period optimization problem 4.1.1 Formation of the constraint graph
The linear program 3 without the short path constraints is rewritten as minimize P subject to x j , x i + P T setup + di; j 4
Notice that for a constant v alue of P, the constraint matrix reduces to a system of di erence constraints which can be represented by a constraint graph 17 . A feasible solution to the linear program exists if the corresponding constraint graph G 1 P contains no positive cycles, and the minimum clock period corresponds to the smallest value of P at which no positive cycle exists.
The skews at all primary inputs and primary outputs are assumed to be zero; this is represented by a host node in the constraint graph, similar in principle to the notion in 1 .
Observe that the constraint set of the linear program 4 is a subset of the constraint set of the linear program 3. Therefore, the optimal period for the LP above m ust be less than or equal to that for the LP that handles double-sided constraints.
If di; j is nite, then a directed edge between x i and x j are constructed in G 1 P in accordance with the long path delay constraint.
Calculating the worst-case ip-op-to-ip-op delays
For any input i, the procedure 2 for computing di; j for all j involves setting the arrival time at input i to zero, and that at all other inputs to ,1. The resulting signal arrival time at each output j, found using PERT 18 , is the value of di; j. However, if this procedure were to be performed directly, i t w ould lead to large computation times.
It was observed during the symbolic propagation of constraints in 19 that in most cases, a ip-op at the input to a combinational block exercises only a small fraction of all of the paths between the inputs of the combinational block to the outputs. Based on this observation, we develop an e cient procedure for calculating the values of di; j. It was found that the use of this procedure gave run-time improvements of several orders of magnitudes over the direct multiple applications of PERT described in the previous paragraph.
Since we will be dealing with combinational blocks only here, in this subsection only, w e will refer to the ip-ops at the inputsoutputs of a combinational block as primary inputsoutputs.
The level number, levelk, of each gate k in the circuit is rst computed by a single PERT run; the level number is de ned as the largest number of gates from a primary input to the gate, inclusive of the gate. In other words, the level number of a gate is found by a topological ordering algorithm. To nd di; o, the largest delay from primary input i to all primary outputs o, w e conduct an event-driven PERT-like exercise starting at ip-op i, as described in the following g At each step, an element from the lowest unprocessed level is plucked from its level queue, and the worst-case delay from ip-op i to its output is computed. All of its fanouts are then placed on their corresponding level queues, unless they have already been placed on these queues.
Note that by construction, no gate is processed until the delay to all of its inputs that are a ected by ip-op i have been computed, since such inputs must necessarily have a l o wer level number.
Theoretical results on the optimal clock period
The following theoretical results are proved for the optimal clock period:
Lemma 3 : I f P = P 1 does not permit a feasible solution to the linear program 4, then nor does P = P 2 P 1 .
Proof : I f P = P 1 is such that it does not permit a feasible solution to the linear program, then the constraint graph G 1 P 1 has a positive cycle, C. where q = i; j is an edge in G 1 P, and d loop is de ned as the largest weight o f a n y edge in G 1 P from any node including the host node to itself, if such a loop exists, and 0 otherwise.
Proof : The weight o f a n y edge that starts and ends at the same node, di; i+T setup , is clearly a lower bound on the clock period. If no such edge exists, then the lower bound is calculated as follows.
Let C be the critical cycle at the optimal clock period, P opt , i.e., a cycle with weight zero.
Clearly one such cycle must exist, or else P opt could be reduced further. Let k be the number of edges in C. The nonnegativity of the right hand side is trivial.
2
Theorem 6 : A nite solution to the clock period minimization problem exists.
Proof : F rom Lemma 5, the solution P to the clock period minimization problem is bounded by 0 P low P P high 1. 2 11 
The clock s k ew optimization procedure
The skeletal pseudocode describing the algorithm for nding the optimal clock period proceeds as follows, using a binary search on the value of the clock period.
Construct the constraint graph; P max = P high ; ...Lemma 5a P min = P low ; ...Lemma 5b while P max , P min f P = P max + P min =2; if G 1 P has a positive cycle P min = P; else P max = P; g
In the above algorithm, the presence of a positive cycle in G 1 P m a y be tested using the Bellman-Ford algorithm 17 . If the skews are initialized to 0, the Bellman-Ford solution achieves the objective of minimizing jx i;max , x i;min j 17 . On a graph with V vertices and E edges, the computational complexity of this algorithm is OV E. The number of iterations is P high ,P low = .
The time required to form the constraint graph may be as large as jfj j Gj, where jfj is the maximum number of inputs to any combinational stage, though in practice, it is seen that this upper bound is seldom achieved. Therefore, the iterative procedure above, when carried to convergence, provides the solution to the linear program 4 with a worst-case time complexity o f O F E log 2 P high , P low = + jfj j Gj 11 where F is the number of ip-ops in the circuit, E is the number of pairs of ip-ops connected by a combinational path, and P high and P low are as de ned in Lemma 5, and , de ned in the pseudocode above, corresponds to the degree of accuracy required.
We point out that for real circuits, E = OF.
We caution the reader that the complexity shown above is not a genuine indication of the complexity if the implementation is cleverly carried out, using back-pointers during the BellmanFord process 20 and the procedure in Section 4.2 for di; j calculations.
In the solution found above, all skews must necessarily be positive, since the weights of each node in the Bellman-Ford algorithm was initialized to zero. Also, in general, the skew at the host node corresponding to primary inputs and outputs could be nonzero. Our objective is to ensure a zero skew at the primary input and output nodes since we do not have the exibility of retiming these, and hence we modify this solution. Note that if x 1 ; x n is a solution to a system of di erence constraints in x, then so is x 0 = x 1 + k; x 2 + k; ; x n + k . Therefore, by selecting k to be the negative of the skew at the host node, a solution x 0 with a zero skew at the host is found.
5 Phase B: Skew minimization by retiming
Introduction
In Phase B, the magnitudes of the clock s k ews obtained from Phase A are reduced to zero by applying retiming transformations. This employs relocation of the ip-ops across logic gates while maintaining the optimal clock period previously found. After the skew magnitudes have been reduced by a s m uch as possible, the retimed circuit may be implemented by applying the requisite skews at a ip-op to get the minimum achievable clock period or by setting all skews to zero to get a clock period that is, as will be shown in Section 6, no more than one gate delay a b o ve the optimum.
Since any ip-op to be moved must have a nonzero skew, we divide the relocations into one of the two following categories:
Flip-ops with negative s k ews Flip-ops with positive s k ews Before we proceed, we will state the following result, which is something of a generalization of Theorem 1. The theorem pertains to Figure 5 , where the ip-ops at the input are retimed across a combinational block to its outputs.
Theorem 7:
a Retiming transformations may be used to move ip-ops from all of the inputs of any combinational block to all of its outputs. The equivalent s k ew of the relocated ip-op at output j, considering long path constraints only, is given by
where the x i ; 1 i n are the skews at the input ip-ops, and x j is the equivalent s k ew at output j, and di; j is the worst-case delay o f a n y path from i to j. where the x j ; 1 j m are the skews at the input ip-ops, and x k is the equivalent s k ew at input k, and dk;j is the worst-case delay o f a n y path from k to j.
The proof is omitted and proceeds along the lines of that for Theorem 1. We point out here that in general, it is not possible to come up with an equivalent s k ew value that satis es both long and short path constraints. For example, when we consider short path constraints, moving ip-ops from the input to the output requires that the new skew be min 1in x i + di; j using the same notation as Theorem 7a, which is incompatible with the requirement stated above except in the special case where all paths from i to j have the same delay. H o wever, this is not a serious problem for us here since we are only considering the long-path constraints here, as stated in Section 4.
While the preceding paragraph may super cially seem to contradict Theorem 1, which i s valid for both long path and short path constraints, we reassure the reader that this is not so. Theorem 1 can be considered to be a special case where the combinational subcircuit consists of only one gate, where clearly the longest and shortest path delays of the combinational subcircuit are equal.
Case 1: Negative s k ew reduction
Consider the case of a ip-op j shown in the Figure 6a that has a negative s k ew at the conclusion of Phase A. 2 If we consider the gate p to which j fans out, we m a y nd its transitive fanins and identify the ip-ops at the input of the combinational subcircuit to which p belongs. Through retiming operations, it is possible to transform the circuit in Figure 6a to the one in Figure 6b; the equivalent s k ews at each ip-op in Figure 6b are calculated. At this point, it need only be noted that the equivalent s k ews for these ip-ops are found without physically moving them to the gate inputs.
To see why the transformation from Figure 6a to Figure 6b is always possible, note that the combinational block shown in the gure can be replaced by a set of input-output" delay constraints. We m a y then apply the result of Theorem 7a to obtain the equivalent s k ews; the complete procedure will be described later.
There now exists a set of n virtual" ip-ops 3 at the input to gate p, a s s h o wn in Figure 6b . Let x 1 ; x 2 ; ; x n be the skews of these ip-ops. They must satisfy the constraints: x k + delay x i + P 8 1 k n 14 where P is the clock period, x i the skew at an output ip-op, FF i not shown, of the combinational 2 We assume here that each ip-op fans out to exactly one gate. If the fanout of a ip-op is larger than one, then it is replicated at each fanout branch. The replicated ip-ops have exactly one fanout gate, and each such ip-op is considered in turn. 3 We refer to these ip-ops as virtual" ip-ops because we do not physically move them to the input of gate p at this point. block to which FF 1 FF n are input ip-ops, and delay is the largest combinational delay from the input of gate p to FF i .
Obviously, from the above constraints: max 1kn x k + delay x i + P 15 Now, there can exist two scenarios:
1 All of the n ip-ops at the inputs have negative s k ews. In this case, the maximum of all the negative s k ew ip-ops is negative and hence the set of ip-ops may b e m o ved across the gate p, as shown in Figure 6c . If the sign of the skew were to change after the relocation, the relocation would not be carried out unless it reduced the magnitude of the skew, and if not, it would be left in its current location with a skew of max 1in x i . For example, a ip-op with a skew of -0.75, to be moved across a gate with unit delay, w ould have a new skew value of 0.25; such a relocation would be desirable since it reduces the magnitude of the clock s k ew this idea is better understood in the context of Lemma 8 and Theorem 9, to be stated and proved later in this paper. However, it would be undesirable to move a ip-op with skew -0.25 across a unit delay gate. Therefore, in either case, this e ects a reduction in the magnitudes of the skew values, as is desirable.
2 One or more of the virtual" ip-ops has a positive e ective s k ew. In this case, the skew at the ip-op j under consideration may be set to zero without violating any timing constraints, since the maximum skew at the input to gate p would be unchanged by this operation. One example where this may occur is when one of the inputs to the combinational block i s a primary input of the sequential circuit; the equivalent s k ew of the corresponding virtual FF at the gate input would be positive. Note that due to this, it is never necessary for a primary input to be relocated, which is in conformance with our assumption that these are immovable.
Case 2: Positive s k ew reduction
In the case of a ip-op j that has a positive s k ew at the end of Phase A, as shown in Figure 7a , 4 the procedure parallels that described above. Through retiming operations, it is possible to transform the circuit in Figure 7a to the one in Figure 7b ; the equivalent s k ews at each ip-op in Figure 7b are calculated using Theorem 7b; the precise procedure will be described later.
Therefore, at the output of the gate p, there now exists a set of n virtual" ip-ops as shown in Figure 7b , with e ective s k ews x 1 ; x 2 : : : x n . The skews at these ip-ops need to satisfy the constraints:
x i + delay x k + P 8 1 k n 16 where x i the skew at an input ip-op, FF i not shown, of the combinational block to which FF 1 FF n are output ip-ops, and delay is the largest combinational delay from FF i to the output of gate p. Figure 7c . If the sign of the skew were to change after the relocation, the relocation would not be carried out unless it reduced the magnitude of the skew. Therefore, in either case, this e ects a reduction in the positive s k ew values, as is desirable.
One may also take advantage of slacks in the combinational paths to reduce the skews at ip-ops. If input r to gate p has a slack, slackr i.e., the worst-case delay at input r could have been increased by slackr without changing the worst-case delay to the output of gate p, then the skew may be further reduced by slackr. If slackr minx i , d, the skew is set to zero; if not, it is set to minx i , d , slackr.
2 If one or more of the virtual" ip-ops has a negative s k ew value, then the skew at the ip-op j under consideration is set to zero. This violates no timing constraints, since it leaves the minimum skew at an output of gate p unchanged.
Minimization procedure for case 1
The steps involved in minimizing the skew magnitudes for ip-ops with negative s k ews are outlined below:
Step 1 All ip-ops in the circuit with negative s k ews are placed on a queue, Q. F or each ipop, we consider one fanout at a time; in other words, the situation shown in Figure 8a is transformed to the form in Figure 8b .
Step 2 Let j be the ip-op that is currently at the head of the queue, and p the gate that it fans out to. We will assume for now that p is a gate; the case where it is the input of a ip-op will be dealt with separately. The equivalent s k ew at every other fanin node of p is found.
We do not physically move ip-ops to the inputs of gate p at this time, and are hence we imagine that the inputs to gate p are a set of virtual" ip-ops.
The equivalent s k ews are found as follows. Consider a ip-op j with negative s k ew, as shown in Figure 9 . The gate p to which it fans out to is added to the tail of a queue R. 5 A reverse PERT is employed to backtrace along the fanin cone of gate p. When a gate is encountered, it is added to the queue. In Figure 9 , gate z is rst added to R, and in the next step, gate y is added. The backtrace continues until a ip-op is encountered. In the example in Figure 9 , the backtrace terminates when ip-op x is encountered. During this process, we k eep track of the worst-case delay, d, to gate p. As a consequence of Theorem 3, if the skew calculated 5 Note that the queue R is distinct from the queue Q.
in Phase A at ip-op x is t units, then its equivalent s k ew at the input to gate p is t + d units.
Step 3 If any equivalent s k ew at a virtual" ip-op is positive, then the skew at j is set to zero and it is not relocated; if not, the skew after retiming is found using the criteria described earlier. If the magnitude of this skew is smaller than the current s k ew at j, then j and all of the virtual" ip-ops at the input to p are retimed across p. Notice that if the skew changes sign after retiming, then the magnitude of the retimed skew could become larger. Only those sign-changing moves that reduce the skew magnitude are permitted. Note that the motion of the virtual" ip-ops to their new location may e n tail replicating these ip-ops, as shown in Figure 8 . For example, if ip-op j were to be moved across gate p, a new ip-op would have to be created at y 2 with an equivalent s k ew corresponding to the skew of ip-op x, retimed to position y 2 . The new skews are found as explained in the previous section. Any such ip-ops that have a negative s k ew as will happen most of the time, unless relocation changed the sign of the skew are now placed at the tail of the queue, Q, and are processed later.
Step 4 If the retimed ip-op has a negative s k ew, it is placed at the tail of Q.
Step 5 If Q is not empty, go to Step 1; if not, the magnitudes of all negative s k ew values have been minimized.
In
Step 2 above, if the ip-op j fans out to another ip-op, which w e shall call k, then since there is no combinational delay b e t ween the ip-ops, and retiming preserves the zero-clocking constraints, it must be true that x j + T setup x k + P i.e.,
x j x k + P , T setup
If the right hand side is positive, then x j can be set to zero without violating any constraints. If not, then x k T setup , P 0, which implies that k is a ip-op that will eventually move to the right, thereby allowing ip-op j the leeway t o m o ve a s w ell. Therefore, if this is the case, the skew of ip-op j is set to x j = x k + P , T setup 0 18 and the ip-op j is reprocessed after ip-op k has been processed i.e., its skew is set to the value calculated above, and it is placed at the tail of Q. It will be shown in Lemma 8 that all such ip-ops will eventually be processed, and their skews set to nearly zero. It is interesting to note that the latter case was almost never seen to happen in the examples presented in this paper.
Note that in spite of Theorem 7, it is still necessary to go through the reverse PERT procedure to ensure that ip-ops are created at fanout points like y 2 .
Minimization procedure for case 2
The steps involved in minimizing the skews at ip-ops with positive s k ews are analogous to those described in Section 5.2, and are outlined below:
Step 1 All ip-ops in the circuit with positive s k ews are placed on a queue, Q. Note that each ip-op has precisely one fanin.
Step 2 Let j be the ip-op that is currently at the head of the queue, and p the gate that fans into it. As in case 1, we will postpone the discussion of the case where j fans out to a ip-op.
The equivalent s k ew of the virtual" ip-ops at every other fanout node of p is found.
The procedure for nding the equivalent s k ews is as follows. Consider a ip-op j with positive s k ew, as shown in Figure 9 . The gate p that fans into it to is added to the tail of a queue R. 6 Analogously to the procedure for Case 1, a forward PERT is employed to trace the fanout cone of gate p. When a gate is encountered, it is added to the queue. The trace continues until a ip-op is encountered. During the process, we k eep track of the worst-case delay, d, from gate p. As a consequence of Theorem 1, if the optimal skew at a ip-op is t units, then its equivalent s k ew at the output of gate p is t , d units.
Step 3 If any equivalent s k ew at a virtual" ip-op is negative, then the skew at j is set to zero and it is not relocated; if not, the skew after retiming is found using the criteria described earlier. If the magnitude of this skew is smaller than the current s k ew at j, then j and all of the virtual" ip-ops at the input to p are retimed across p. The new skews are found as explained in the previous section. Any such ip-ops that have a positive s k ew as will happen most of the time, unless relocation changed the sign of the skew are now placed at the tail of the queue, Q, and are processed later.
Step 4 If the retimed ip-op has a positive s k ew, it is placed at the tail of Q.
Step 5 If Q is not empty, go to Step 1.
In
Step 2 above, if the ip-op j fans into another ip-op, which w e shall call k, then since there is no combinational delay b e t ween the ip-ops, and retiming preserves the zero-clocking constraints, it must be true that x k + T setup x j + P i.e.,
x j x k , P + T setup
If the right hand side is negative, then x j can be set to zero without violating any constraints. If not, then we h a ve x k P , T setup 0, which implies that k is a ip-op that will eventually move to the left, thereby allowing ip-op j the leeway t o m o ve a s w ell. Therefore, if this is the case, the skew of ip-op j is set to x j = x k , P + T setup 0 19 and the ip-op j is reprocessed after ip-op k has been processed. It will be shown in Lemma 8 that all such ip-ops will eventually be processed. As in the analogous situation in case 1 described in the previous section, it was observed that the latter was almost never seen to happen in the ISCAS89 examples.
6 Properties of the Retiming Procedure Lemma 8 At the end of the retiming procedure in Phase B, the skew at each ip-op is no more than half a gate delay.
Proof:
Assume, for purposes of contradiction, that the minimum skew after Phase B at a ip-op is larger than half a gate delay. Assume also that each ip-op has a single fanout; in case of multiple fanouts, the ip-op can be replicated as shown in Figure 8 . Then one of two possibilities exist:
a If the ip-op fans out fans in to a gate G when the skew is negative positive, then using the procedure described above, the skew magnitude can be reduced, either to zero, keeping the ip-op in its current location, or by m o ving the ip-op across gate G. This contradicts the fact that Phase B is complete.
b If the ip-op, i, fans out to another ip-op, j a primary output is also considered to be a ip-op, then the skew of ip-op i is either set to zero if possible or to a magnitude that is smaller than that of x j . This can be seen from Equations 18, 19 and from Lemma 5b which a rms that P opt T setup .
Now, consider the case of negative s k ew ip-ops; the situation for positive s k ew ip-ops is analogous. Assume that we h a ve a situation where ip-ops j 1 ; j 2 ; ; j n are connected directly to each other in a chain and their skews remain negative at the end of Phase B. The ip-ops must necessarily be connected in a cycle; if not, there would be a rst leftmost ip-op in the chain that is connected either to a gate or to a ip-op with zero skew, which implies that their skew magnitudes can be reduced, and contradicts the assumption that Phase B is complete.
Now when each ip-op in the cycle is processed, the magnitude of its skew is reduced, as shown above. Hence, if the procedure described in Section 5.2 is applied, then each s k ew magnitude must eventually be reduced to zero, which contradicts the assumption that Phase B i s o ver. Therefore we cannot have a negative s k ew ip-op connected directly to a negative skew ip-op at the end of Phase B.
2
Theorem 9 If, at the end of the retiming procedure, all skews are set to zero, then the optimal clock period for this circuit is no more than P opt + d max , where P opt is the optimal clock period found in Phase A, and d max is the maximum delay o f a n y gate in the circuit.
Proof: By Lemma 8, the skew at each ip-op after Phase B is no more than half a gate delay.
Hence, to ensure that the double-clocking constraint is satis ed, it is easy to see that one may modify the period to maxP + x j , x i 20 where x j ; x i are the skews of any pair of ip-ops that are connected by a combinational path. Clearly, since jx k j d max =2 8 k, the clock period required to ensure that all double-clocking constraints are satis ed is no more than P + d max . Note also that since any period achievable by retiming is also achievable using skews, and nal period can be no better than P, which i s t h e optimal period using skew optimization.
7 Experimental Results
The algorithm was implemented as a C program, ASTRA. Experimental results running from ASTRA on all circuits in the ISCAS89 benchmark suite including the Addendum93 circuits are presented in Table 1 . For each circuit, the table provides data that describes its size in terms of the numb e r o f c o m binational gates, jGj, and ip-ops jFj init . All gates are assumed to have unit delays although the algorithm is certainly not restricted to unit delay circuits, and the setup and hold times are arbitrarily set to 0. P max is the upper bound on the clock period provided by Lemma 5a.
Note that P max corresponds to the clock period in the original circuit.
The next column shows the optimal value, P opt , calculated at the end of Phase A of ASTRA, using clock s k ew optimization. At the end of retiming in Phase B of ASTRA, any s k ews that could not be set exactly to zero are now forced to zero and the new period P ret is shown, along with the corresponding percentage improvement o ver the initial period, P max . This period corresponds to the maximum delay o f a n y combinational segment in the retimed circuit. As expected, it is seen that in each case, P ret is within one gate delay o f P opt .
The CPU times for running ASTRA on an HP 735 workstation for each of the two phases, and the total CPU time are shown for all these circuits. Note that the time for Phase A includes the time for calculating the values of di; j, the maximum delay from ip-op i to ip-op j, a s well as the time for the Bellman-Ford iterations. The last column shows the number of ip-ops in the retimed circuit.
For example, for s15850.1, a circuit with 9772 gates and 534 ip-ops, the value of P max is found to be 82.0 units. At the end of Phase A skew optimization, ASTRA calculates the value of P opt to be 63.0 units. The value of P ret at the end of Phase B of ASTRA is 63.0 units. The improvement in the clock period after the two phases is calculated as change = P max , P ret P ret 100
and is found to be 30.2 for s15850.1. The number of ip-ops was increased in the process from 534 to 572.
It can be seen from the Table that in 36 out of 44 circuits, ASTRA caused the clock period to improve, with the improvement being as much as 220.7 in the case of s6669. Although it is theoretically possible for retiming to reduce the number of ip-ops in the circuit, this was never seen to happen. It should be stressed here that use of the techniques presented here performs the relocation of ip-ops across the minimal number of levels of gates. However, it does not minimize the number of ip-ops, and this is a topic for further research.
It is worth noting that the CPU times for ASTRA are rather small; even the largest circuit could be retimed in just over a minute.
The runtimes for the ASTRA algorithm versus the circuit size, jGj, is shown in Figure 10 .
It is very di cult to infer a general relationship between the circuit size and the execution time.
However, it may be inferred from these graphs and from Table 1 that for large circuits where large improvements the clock period are possible, the algorithm is likely to have relatively larger runtimes. A second point that can be inferred is that very often, when the clock period is reduced tremendously, the run-time increases because the amount o f w ork required of Phase B increases.
Conclusion
An approach that takes advantage of the equivalence between retiming and clock s k ew is presented, and is used for gate-level retiming. Results on all of the circuits in the ISCAS89 benchmark suite have been presented and can easily be handled by this algorithm.
The chief reason for the improvement is that ASTRA takes a global view of retiming by rst solving the clock s k ew problem in a smaller number of variables. In the second phase, local transformations are used to perform the retiming. The logic behind this approach is that a circuit would have t o b e v ery poorly designed indeed to require enormous computation time for the local transformations, and hence in most practical cases, the latter phase takes only a small amount o f computation; this is borne out by our experimental results.
It must be pointed out that the algorithm performs retiming only for timing optimization and does not take i n to account the fact that retiming may cause initial states to change. Therefore, in its current a vatar, it is more applicable to the timing optimization of pipelined circuits, rather than for optimization of control unit circuitry, unless such circuits are designed using the techniques in 21 .
The ASTRA algorithm may easily be adapted to perform retiming to satisfy a given clock period, P spec . Phase A consists of a single pass through the graph G 1 P spec . If P spec is infeasible, this will be reported immediately, and if not, the skew solution at the end of Phase A may b e translated to a retiming solution using the methods of Phase B.
The algorithm directly provides a result to the combined clock s k ew optimization and retiming problem. The use of this algorithm to minimize the number of ip-ops using retiming only is a topic for further research. Another possible extension is the use of deliberate skews with retiming to minimize the number of ip-ops. 
