Introduction
Over the years, combinational timing optimization has been intensively studied and a significant level of maturity has been attained [3, 11, 121 . In comparison, sequential timing optimization has been lagging behind, due to the additional complexity of handling registers.
Traditionally, sequential circuits are viewed as a special case of combinational circuits and only the logic between registers is optimized. This approach not only results in stringent resynthesis constraints, but also does not permit interaction between logic separated by registers. Retiming [7] is a popular technique for sequential optimization.
Although it is very useful, the effectiveness of retiming is limited since it does not change the logic.
Efforts have been made to improve upon the direct application of combinational techniques to sequential circuits. Techniques were proposed that take into consideration the existence of post-resynthesis retiming by generating a set of relaxed resynthesis constraints [5, 21 . Several techniques were proposed to exploit signal dependencies across register boundaries. The approach proposed in [9] first retimes the circuit with the objective to expose signal dependencies across register boundaries. Then, it carries out resynthesis on the logic between registers in the retimed circuit. These techniques can be improved by repeating the retiming/resynthesis loop [6] . Attempts have also been made to generalize combinational logic synthesis techniques to sequential circuits [l, 4, 81. Although these approaches have achieved some degree of success, it is evident that a true integration of retiming and resynthesis is still lacking since in most cases, retiming and resynthesis are carried out separately.
Even when retiming and resynthesis are tightly integrated, no effective cri-*This work was done while the author was with Clarkson University, Potsdam, NY.
Permission to make digital or hard copies of all or part of this work for personal or C~SSKXXII use is granted without fee provided that copies a~ not made or distibuted for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 99, New Orleans, Louisiana 01999 ACM I-581 13.092-9/99/ooo6..$5.00 teria have been provided to guide the application of these techniques. As a result, local logic transformations are not strongly tied to the performance target.
In this paper, we propose a new approach to integrate retiming and resynthesis for performance optimization.
The approach is based on two recent concepts: expanded circuits and l-values which were originally proposed in the context of FPGA technology mapping [lo] . Our approach produces provably good results under a very general assumption.
The rest of this paper is organized as follows. Preliminaries are presented in Section 2. In Section 3, we introduce our approach. Section 4 deals with the experimental results obtained and Section 5 concludes the paper.
Preliminaries
A sequential circuit is represented as a directed graph. Each node denotes either a primary input (PI), a primary output (PO) or a gate, and each edge u % v represents a connection from node u to node v. Each edge e is weighted by the number of registers, w(e), on it. Each node has an area and a delay associated with it. The cycle time of a sequential circuit is the maximum delay on the combinational paths.
Retiming [7] is a transformation that repositions the registers in a circuit without altering its functionality.
Retiming a node by a value i is the operation of removing i registers from each fan-out edge and adding i registers to each fanin edge of the node. In general, all nodes can be retimed collectively to arrive at a retiming of the circuit.
To exploit the flexibility of dynamic register positions due to retiming, we make use of expanded circuits [lo] . The expanded circuit at a node is formed by unrolling the circuit over all time frames, starting from the node. It is essentially the combinational logic for the node in the circuit. For example, for the circuit in Fig. l(l Because of this result, given an output cone of the expanded circuit at a node, the registers within the cone can be pushed out of the cone to form a combinational subcircuit for the node. Thus, the concept of expanded circuits allows us to extract logic across register boundaries. For example, from the cone indicated in the expanded circuit in Fig. l(2) we have the combinational subcircuit for g in Fig. l(3) . We will refer the combinational subcircuits derived from expanded circuits as resynthesis cones as the proposed approach carries out resynthesis on them.
We will employ the concept of l-values introduced in [lo] to guide resynthesis. The l-values are defined for a given target cycle time 4. The l-value of a node v in a circuit is defined as the maximum weight of the paths from the PIs to v according to a set of new edge weights defined as follows. 
The proposed approach
The basic idea in our approach is to extract resynthesis cones and resynthesize them using a combinational timing optimizer. Because of Theorem 2, instead of considering cycle time directly, we resynthesize a circuit so that the l-values of all POs are less than or equal to 4. The following is a precise definition of the problem: Problem 1 Given a combinational timing optimizer 7 and a set of cones C(V) for each node v, resynthesize one cone in C(v) for each v such that the l-values of all POs in the equivalent circuit formed by the resynthesized cones are less than or equal to q5.
We now present an algorithm for this problem. The algorithm has two phases: a synthesis phase and an assembly phase. In the synthesis phase, we compute a label and an associated resynthesized cone for each node. Based on the label we know whether there is a solution to the problem. If the answer is affirmative, we connect the resynthesized cones in the assembly phase to form a solution to the problem.
In the synthesis phase, we want to find the minimum lvalue that can be obtained for each node with resynthesis. We determine the minimum l-values by iterative improvement. For each node v, we maintain a label l(v), which is a lower bound on the minimum l-value at U, and then successively approximate the minimum l-value by updating it. We begin by initializing the labels of the non-PI nodes to -co. The labels of the PIs are set to zero assuming all input signals arrive at the same clock edge. As the process of resynthesis continues, the labels are gradually increased. If the label of any PO ever exceeds 4, the synthesis procedure simply stops and returns FAILURE. (As will be shown later, in this case there is no solution to Problem 1, based on the resynthesis cones and the combinational timing optimizer.) Fig. 2 shows the outline of the synthesis procedure, where update is the procedure that updates the label at each node by calling the combinational timing optimizer. We now discuss how to update the label at each node. To maintain Z(V) as a lower bound on the minimum l-value at V, we resynthesize the cones in C(v) in such a way that the updated label is minimized.
Suppose we resynthesize c E C(v) and let c' be the resulting cone. If we use c' as the logic for generating (the output signal of) v, then by the definition of l-values, the l-value at v is at least as follows: Consequently, we constrain the resynthesis of each cone in C(V) by assigning appropriate arrival times at the inputs of the cone as indicated in Fig. 3 . Then we resynthesize the cone to minimize the arrival time at its output. Therefore, we translate the problem of determining the new lower bound to that of resynthesizing each cone to minimize the arrival time at the output. Among all cones in C(V), we pi'dk the one that has minimum arrival time after resynthesis in order to minimize the new label at V. Let 7(c) denote the arrival time at the output v in c after resynthesis using 7, with the arrival times at the inputs as shown in Fig. 3 . Then, update(v) = min,ec(v) T(c).
We use an example to illustrate the synthesis procedure. Consider the circuit in Fig. 4(l) which has a cycle time of Figure 4 : A circuit and selected resynthesis cones three units, assuming that each gate has one unit of delay.
We want to resynthesize the circuit with a target cycle time 4 = 2. Suppose that each node has the trivial cone formed just by itself as shown in Fig. 4(2) . (Note that the presence of trivial represents that we may choose not to resynthesize at the node.) Suppose that node g4 has the two additional resynthesis cones shown in Fig. 5(l) and (2). To simplify our discussion, we assume that the combinational timing optimizer can only produce the resynthesized cones shown in Fig. 5(3) and (4) for the cones in Fig. 5(l) and (2), respectively, regardless of the arrival times at their inputs. At the beginning, Z(ii) = Z(iz) = Z(ia) = 0, and Z(V) = -oo for all other nodes. Suppose we visit the nodes gi, 92, ga, g4 in this order, in the synthesis procedure. In the first iteration, since gi only has the trivial cone, we have Z(gi) = max{Z(ii)+ l,Z(g4) + l} = max{O + 1, -oo + 1) = 1. Similarly, Z(gz) = 1 and Z(ga) = 1. For node g4, the arrival time from the trivial cone is max(Z(g3) + l,Z(gz) -4 + 1) = 2; from the cone in Fig. 5(3) the arrival time is max{Z(ii) -$ + 3,Z(iz) -4 + 3,Z(g4) -4 + 2,l(i3) + 1) = 1; from the cone in Fig. 5(4) the arrival time is max{Z(gi) -#J + 2,Z(gz) -f$+ 2,Z(i3) + 1) = 1. Thus, Z(g4) = 1. For the output node Z(oi) = Z(g2) = 1 from the trivial cone at 01.
In the second iteration, since Z(g4) = 1, we have Z(gi) = 2, l(g2) = 2, and Z(ga) = 1. For node g4, the cone in Fig. 5(3) gives the smallest arrival time 1, so Z(g4) = 1. For the output node Z(oi) = Z(gz) = 2. Since labels have changed, the procedure goes to the third iteration. However, no more change in the labels will occur and the procedure stops by returning SUCCESS.
After the procedure outlined in Fig. 2 terminates with SUCCESS, we proceed to generate the resynthesized circuit. Recall that for each node in the initial circuit, we not only have a label, but also an associated resynthesized cone that realizes the label. In the assembly phase, we simply connect Figure 6 : Resynthesized circuits together the resynthesized cones that realize the labels. This is followed by a cleanup step to remove nodes that do not drive, directly or indirectly, the POs of the circuit. For the example in Fig. 4 , the resynthesized cone in Fig. 5(3) realizes the label at g4. For each of the other nodes, its label is realized by its trivial cone. After cleaning up, we obtain the resynthesized circuit in Fig. 6(l) .
Given a single-output combinational circuit, a set of arrival times at its inputs is pairwise smaller than another set if the arrival time at each input in the set is less than or equal to the corresponding arrival time in the other set. A combinational timing optimizer is order-respecting if a pairwise smaller set of arrival times at the inputs results in a resynthesized circuit with a smaller or the same arrival time at the output. Theorem 3 Assuming that 7 is order-respecting, there is a sohtion to Problem 1 iff the procedure returns SUCCESS.
To obtain a circuit with the target cycle time 4, we can simply retime the circuit generated in the assembly phase to minimize its cycle time. The resulting circuit is guaranteed to have a cycle time less than 4 plus a largest gate delay. For the resynthesized circuit in Fig. 6(l) , after retiming, we obtain the circuit in Fig. 6 (2) which has the desired cycle time of two units.
We introduce a factor called the depth to limit the size of the resynthesis cones selected from the expanded circuit for each node. This factor determines the size of the logic of the node that will be passed on to the combinational logic optimizer. A large value results in a large cone, which in turn, presents better resynthesis potential.
On the other hand, area overhead and computation time could be large too. A good choice should balance these factors.
Ideally, one would like to consider many cones for each node. In practice, however, resynthesizing several cones at each node may greatly increase the computation time. Our strategy to overcome this problem is to dynamically select one cone to resynthesize at each node in each iteration during the synthesis phase.
In practice, only a small set of nodes constrain the performance of a circuit. Moreover resynthesizing nodes that do not contribute to achieving the target cycle time may result in unnecessary area overhead and computation time. This suggests the need for an effective technique for selecting a few "strategic" nodes for resynthesis. Resynthesis is directed towards this set of nodes first. We designed an effective technique for node selection. The details of the technique is omitted here due to space limitation.
Experimental results
This section describes our experimental results on the sequential benchmark circuits in ISCAS89 suite. Our program, referred to as SeqRe, is integrated with the logic synthesis tool SIS. Combinational resynthesis is performed by speed-up in SIS. A min-cost based node selection technique was also implemented in SeqRe.
SeqRe performs repeated calls to ReRe outlined in Fig. 2 , to target a cycle time that is less than the current one until the cycle time cannot be reduced. Our program introduces a parameter step, to control the difference between the current and target cycle times. We set step to one initially.
If we come across a failure to meet the target cycle time, we then increase it to two and attempt once more. The other parameter depth that controls the size of the resynthesis cones is set to 4 in our experiments. Throughout the experiment, the unit-delay model is used. We report, in Table 1 , the results of SeqRe on all the benchmark circuits. For each benchmark we list the number of gates, number of registers and the cycle time, of the initial, optimally retimed and SeqRe optimized circuits. It is evident from the table, significant improvement in cycle time is achieved by SeqRe over the initial circuits -over 50% reduction overall. From Table 1 , we can also see that SeqRe achieves much better results than retiming alone. On the whole SeqRe reduces the cycle time by an additional 35% (with a 13% increase in the number of registers and a 1% increase in the number of gates) over the optimally retimed circuits. We point out that our current program does not try to minimize the number of registers. We expect the number of registers can be reduced considerably if a register minimization step is added. Comparison with existing retiming and resynthesis methods has been done although not reported here due to space limitation.
Conclusions
We have developed a novel timing optimization approach that integrates retiming with resynthesis to form a powerful combined technique. This approach has several important features. Firstly, it extracts combinational logic out of a circuit for resynthesis instead of carrying out resynthesis directly on the circuit (or its retimed one) to expose signal dependencies. By doing so, it becomes truly oblivious of register boundaries. Secondly, it tightly constrains logic resynthesis so that the resynthesized circuit is guaranteed to meet the performance target. Thirdly, it is independent of the combinational timing optimizer.
Experimental results show that the proposed approach can improve the performance of a sequential circuit significantly.
