Abstract -With the challenges of growing functionality and scaling chip size, the possible performance improvements should be considered in the earlier IC design stages, which gives more freedom to the later optimization. Potential slack as an effective metric of possible performance improvements is considered in this work which, as far as we known, is the first work that maximizes the potential slack by retiming for synchronous sequential circuit. A simultaneous slack budgeting and incremental retiming algorithm is proposed for maximizing potential slack. The overall slack budget is optimized by relocating the FFs iteratively with the MIS-based slack estimation. Compared with the potential slack of a well-known min-period retiming, our algorithm improves potential slack averagely 19.6% without degrading the circuit performance in reasonable runtime. Furthermore, at the expense of a small amount of timing performance, 0.52% and 2.08%, the potential slack is increased averagely by 19.89% and 28.16% separately, which give a hint of the tradeoff between the timing performance and the slack budget.
I Introduction
With the coming up of SoC (system on a chip) and SiP (system in a package), more and more devices trend to be put in the small silicon area while at the same time the clock frequency is pushed even higher. Thus timing, area, and power dissipation, which are the three major design objectives, become even more challenging in modern circuit design. Typically, the goal of performance optimization is either to minimize the clock period within a given area and/or power budget [1] [2] [3] , or to minimize their area and/or power dissipation under timing constraints [4] [5] [6] [7] [8] [9] [10] . Particularly, timing budget is often performed in order to slow down as many components as possible without violating the system's timing constraints. The slowed-down components can be further optimized to improve system's area, power dissipation, or other design quality metrics.
For timing-constrained gate-level synthesis, time slack is an effective metric of circuit's potential performance improvement. The slack budgeting problem on a graph has been studied in theory and practice for many different applications, such as timing-driven placement [16] [17] and floorplanning [11] , gate/wire sizing and power optimization [12] [13] [14] [15] , and etc. Most of the previous slack budgeting approaches are suboptimal heuristics such as Zero-Slack Algorithm (ZSA) [18] . In [12, [19] [20] , slack budgeting problem in combinatorial circuit is formulated as maximum-independent-set (MIS) on sensitive transitive closure graph. Recently, [21] presented an LP-based slack budgeting and maximized the potential slack by clock skew optimization. [22] budgeting through LP relaxation. [23] proposed combinatorial methods based on net flow approach to handle the slack budget problem. However, the existing slack budgeting algorithms are either used for combinatorial circuit, or limited to fixed FF locations. At the early design stages, there is flexibility to schedule pipeline or timing distribution to obtain more timing slack.
As one of the most powerful sequential optimization techniques, retiming was first proposed by Leiserson and Saxe in 1983 [24] , which relocates the flip-flops (FFs) in a circuit while preserving its functionality. The past twenty years saw retiming's effectiveness on improving circuits' timing, area, and power characteristics. Recent publications [1] and [3] proposed a very efficient retiming algorithm for minimal period by algorithm derivation. [2] and [5] later presented efficient incremental algorithms for min-period retiming under setup and hold constraints, and min-area retiming under a given clock period. [8, [25] [26] proposed the ILP/MILP-based algorithms of retiming for power reduction. Besides, retiming can also increase the potential slack which translates to potential performance. As illustrated in Fig. 1 , the period of a circuit is minimized with the delay and slack labeled beside each gate as well. It is seen that there is no potential slack in this circuit. However, if retiming process is taken, i.e. moving the FF from edge ED to DC, the potential slack can be increased from 0 to 4, keeping the period minimized at the same time. Though there are massive retiming algorithms, none of them can take slack budget into consideration directly. [28] proposed an LP formulation for optimizing the slack budgeting on interconnects between gates, in order to guide timing-driven placement. However, such an interconnect-based slack budgeting formulation is not easy to transform to gate-based one, since one gate can belong to different interconnects. Furthermore, solving LP formulation with a large problem size is not efficient. In this paper, we propose an incremental retiming approach which can optimize the slack distribution on gates efficiently without any performance degradation. As far as we know, this paper is the first work that maximizes the potential slack by retiming for synchronous sequential circuit. A simultaneously slack budgeting and incremental retiming algorithm is proposed. The formulation 978-1-4244-5767-0/10/$26.00 2010 IEEE
1C-1
of MIS on sensitive transitive closure graph is extended from combinatorial circuit to synchronous sequential circuit for potential slack estimation. The overall slack budget can be optimized by relocating the FFs iteratively with the MIS based slack estimation.
Experimental results show that comparing with the potential slack of a well-known min-period retiming [3] , our retiming approach increases the potential slack averagely by 19.6% without degrading timing performance. Meanwhile, retiming and slack budgeting is performed with relaxation of the timing constraint by 2% and 5%, and the potential slack increases 19.6% and 28.16% respectively, which gives a hint to designers about the tradeoff between performance and slack budget or other metrics.
The rest of this paper is organized as follows. Section II introduces the problem formulation. Section III gives three motivating examples of retiming for maximizing potential slack. Sensitive transitive closure graph and slack budgeting for estimating potential slack of synchronous circuit are described in Section IV. Section V presents the simultaneous slack budgeting and retiming algorithm. Experimental results are given in Section VI, and our paper concludes in Section VII.
II. Preliminaries
As in [24] , we model a synchronous sequential circuit as a directed graph G = (V, E, d, w). Each vertex v V represents a combinational gate and each edge (u, v) E represents a signal passing from gate u to gate v. Non-negative gate delays are given as vertex weights d: V R * and non-negative integer w: E Z * as the edge weight represents the number of FFs on the signal pass. A special host vertex H, the edges from host to the primary inputs, and the edges from the primary outputs to host, are introduced into the graph to represent interfaces with the external environment. Without loss of generality, we assume that G is strongly connected. Conventionally, three non-negative labels, a i / i /s i : V R * , represent the latest arrival time, require time, and slack of gate i. a i and i can be calculated recursively.
where means the number of FFs on any directed edge (v m , v n ), and FI(v i ) and FO(v i ) represent the incoming and outgoing gates to gate v i respectively. init is the initialization condition. Slack s i is then calculated by s i = i -a i (3) To represent the reallocation of FFs in retiming, an integer label r : V Z * is introduced to represent how many FFs are moved from the outgoing edges to the incoming edges of each node. When r is negative value, it means FFs are moved from the incoming edges to outgoing edges. Thus the number of FFs on edge (v i , v j ) with current label r is as formula (4) .
w i,j + r j -r i (4) Our problem is formulated as following. 
Problem of simultaneous slack budget and retiming (SSBR):
Given a directed graph G = (V, E, d,
III. Motivating Examples
The potential slack heavily depends on the topology of the graph. In Fig. 2 , the delay and slack are labeled beside each gate with clock period T=15. When increasing the delay of gate A, the slack of gate B is reduced as well. Intuitively, if we assign slack to gate A, there are only 3 timing units could be used totally. But gate B and gate D are located on different paths, thus we can increase the delay of gates B and D simultaneously so that total potential slack can be as much as 6 timing units. Thus, slack is needed to be budgeted in order to maximize the potential slack, with which the designer can predict the potential area/power reduction without having to go through actual low-level area/power optimization [19] . As one of the most powerful sequential optimization techniques, retiming has been used to minimize the clock period, area and power. Besides of those benefits, retiming can be used as an effective optimization method to improve the potential slack while preserving circuits' functionality and timing performance. By studying the topological structure of circuits, we find three different scenarios that the potential slack can be improved significantly by relocating
1C-1 the FFs.
Scenario I: make use of "slack" on an output. As shown in Fig. 3 (a) , there is no slack for each gate. But after further analysis, the signal arrival time of the output edge (v 1, v 3 ) of gate v 1 is 2. And this output signal is actually required to arrive at the input of the FF on edge (v 1, v 3 ) before 10. Thus the timing "slack" on this output is 8. Since the signal arrival time of the other output (v 1, v 2 ) is required before 2 -d 2 = 2, according to formulas (2) and (3), the required time on this gate is 1 = 2, and slack is s 1 = 0. However, with moving FF from (v 1, v 3 ) to (v 3, v 4 ), the timing "slack" on the output can be used by the gates v 3 and v 4 , as Fig. 3 (b) shows. So, the total potential slack increased from 0 to 8.
Scenario II: redistribute the slack to the sub-graph with more branches. The sub-graph here refers to one of the sub-graphs divided by FFs in graph G. The branches are actually the independent set, in which vertices are not slack sensitive to each other (Explained in Section IV). As Fig. 3  (c) shows, set {v 6 , v 2 } is an independent set in sub-graph of {v 1 ,v 2 ,v 3 ,v 5 ,v 6 }, while the size of independent set in other part is 1. Thus when moving the FF from edge (v 4 , v 7 ) to (v 3 , v 4 ) as in Fig. 3 (d) , slack is increased from 2 to s 2 + s 6 = 3.
Scenario III: move FF to the gate side with more edges. It is obvious that such a movement increases the slack at least 2d 1 shown in Fig. 3 (e) and (f). However, such potential slack is increased at the expense of more chip area and power consumed by more FFs. Thus, a combined cost to estimate the real potential performance is given as follows. * = -n (5) where and * are potential slack and combined potential slack, considering the cost of increasing FFs. n is the increased number of FFs, and constant reflects the impacts of add one more FFs to the circuits. For the purpose of maximizing the potential slack, can be assigned to the delay of a FF. And if the chip area is very critical, can be a very large number, so that Scenario III will not be considered.
It is worth to mention that max-potential-slack retiming may become complicated when there appears any negative-weight edge, after relocating FFs. For such case, we propose a more general method in Section V. Thus, to solve the SSBR problem, it not only needs slack budgeting for various circuit topologies affected by different FF locations, but also needs iteratively retiming to find a optimized FF locations that maximizes potential slack.
IV. Time Slack Budgeting for Estimating Potential Slack
To estimate the potential slack, a sensitive transitive closure graph is built up for a synchronous circuit. And a MIS-based method is presented for budgeting slack.
A. Sensitive transitive closure graph
The essential difference between combinatorial circuits and synchronous sequential ones is that the later ones have FFs on signal passes. The edge with FFs cannot be slack sensitive because the signals on the two sides of FFs are in different stage of sequential circuits. Thus, we could easily borrow the concepts of slack sensitive from combinatorial circuit with some modifications. With the help of the labels a/ /s and r on each vertex of graph G (V, E, d, w) 
w ij + r j -r i = 0 and (a j -
where w ij + r j -r i = 0 means there is no FF on edge (v i , v j ). From formula (1), the edge with a j -a i = d j is a sensitive edge, on which the arrival time a j = a i + d j of gate v i actually depends. In the same way, from formula (2), the edge with j -i = d j is also sensitive, since the required time i = j -d j of gate v i depends on it. According to the definition, for the example in Fig. 3 (d), edge (v 1 , v 2 ) is not slack sensitive (r 1 = r 2 =0, a 2 
B. Estimating Potential Slack (EPS)
After labeling a/ /s, we collect all the vertices with positive slack into set Q. An induced graph of on Q is named as . And let a maximum independent set of be SI. Then, each vertex in set Q can increase its delay by = min{s i | v i } separately without breaking the clock period constraint. And then we iteratively find the set SI, and increase delay until Q is empty. At last, the potential slack is the summation of the increased delay in every iteration. The algorithm of potential slack budgeting is given in Fig. 5 . The optimality of the algorithm was proved in [12] . And a well-know polynomial-time MIS solver by Dharwadker [27] is employed in step 4 of Fig. 5 , which solves the MIS problem by the duality with minimum vertex cover.
We still take the example in Fig. 4 
V. Simultaneous Retiming and Slack Budgeting

A. General form for SSBR
In section III, we enumerate three scenarios that improve the potential slack by relocating FFs. However, for the case shown in Fig. 3 (a) , if retiming vertex v 1 with labeling r 1 = 1 without considering clock period ( Fig.6(a) ), the new weight of edge (v 1 , v 2 ) is w 12 + r 2 -r 1 = -1, according to formula (4) . It means that the edge (v 1 , v 2 ) needs to borrow a FF from outgoing edges of vertex v 2 , in order to legalize the retiming. Thus with r 2 = 1, the retiming result is as Fig. 6(b) shows. This procedure is called retiming legalization. Such retiming legalization usually causes the relocation of the FFs in other part of the circuit which may in turn reflect the slack distribution. So, in this section, a more general method is proposed to increase potential sack with retiming under the specific period constraint. Thus simultaneous slack budgeting and retiming can be expressed in the following form.
Max p s (7) s.t. (8) shows the constraints that the number of FFs on each edge should not be negative. Formula (9) constrains that the delay of critical path must satisfy the timing constraint T.
B. Incremental retiming with possible moves
Therefore, we can relocate the positions of FFs iteratively without violating the timing constraint and estimate the potential slack for each relocation to find an optimal one. FO(v i ). Thus, in order to keep condition (10) satisfied, it is necessary that the slack s j of gate v j must be no smaller than the delay d i . Hence, condition (10) must be satisfied.
) In the same way, if the movement m i is negative, the delay d i of v i will be added to the timing path ending with gate v j FI(v i ). To keep condition (9) satisfied, it is necessary that the arrival time a j of gate v j plus the delay d i must not exceed the period T. That is condition (11) must be satisfied. (10) and (11) can reduce much searching space, they are insufficient for condition (8) , so some of them have to be picked out after retiming legalization and arrival time calculation, which are discussed later.
When relocating FFs by positive movement m i M, some outgoing edges may become negative weighted. As described in subsection A, the retiming legalization may borrow the FFs from the outgoing edges of the succeeding gates in the directed path. It is practiced recursively by invoking formula (12) , until all edges weight are positive.
In the same way, when relocating FFs by negative movement m i , if there occur negative weighted edges, we legalize the retiming by borrowing the FFs from the incoming edges of the preceding gates. And the retiming legalization is based on formula (13) recursively. r j = r i + w ji , v j FI(v i ), if w ji + r i -r j < 0 (13) After the retiming legalization, the arrival time of each gate is calculated by formula (1). If condition (9) is not satisfied, such movement m i will not be considered. If satisfied, an incremental retiming is performed based on the movement m i .
collect possible movements M by (10) and (11) retiming legalization by (12) and (13) 
C. Iterative incremental retiming flow
The subroutine of EPS is called to estimate potential slack after each incremental retiming with those considered 1C-1 movements in M. Only the movement that improves the potential slack is accepted, and then retiming label r is updated. The algorithm iteratively collects possible FF movements M, and takes the movement that can increase potential slack. The algorithm terminates until there are no potential slack increase any more, and its flow is illustrated in Fig. 7 . An example in Fig. 8 is used to clarify our idea of incremental retiming for maximizing potential slack. Detailed execution results are listed in the following table step by step, where M is the movements set represented by r, SI is the maximum independent set of with positive slack vertices, and P s is the potential slack. At last, the potential slack increases from 3 to 8. condition (10) is unsatisfied, and isImproved is false.
VI. Experimental Results
The algorithm is implemented with C++, and tested under linux server with eight 3.0-GHz cores and 6G memories. 19 cases from the ISCAS89 benchmarks are tested, and the name, number of gates, number of signal passes, the maximum number of gate outputs/inputs, and the minimum period for each case are separately given in Table 1 .
In the experiments, a min-period retiming algorithm [3] is employed to get the minimum period of the circuits, and potential slacks are estimated under current fixed FFs, which are listed in the 2nd and 3rd column of TABLE II. The simultaneous slack budgeting and retiming are firstly performed on those test cases within min-period timing constraint. The maximized potential slack, the increase ratio of potential slack, and runtime are given separately. The average increase ratio of potential slack is 19.6% within timing constraint of the min-period T min in reasonable runtime. And then we relax the timing constraint and perform the algorithm within timing constraint of (1+2%)
T min and (1+5%) T min . Since the clock period may not reach the bound of the timing constraint, the actual clock period after retiming, the potential slack, the slack increase ratio compare with P 0 , and runtime are given separately for both experiments with timing constraint relaxation in TALBE II. The average increase ratios are 19.89% and 28.16% with a little sacrifice of timing performance (increasing the average period by 0.52% and 2.08%), which can give designers some hints to the tradeoff between fast and low-power circuits. Fig.  9 illustrates a direct comparison between P 0 and the maximized potential slack within three different timing constraints which is normalized by being divided by P 0 .
Since the room of slack optimization with retiming depends on the topology of circuit greatly, the improvement varies from 0 to 214.29%. As for s27.test, the circuit size is small, so the increase ratio appears relatively high. We can see that in addition to s27.test, some cases also achieve much improvement, such as 58.76% and 20.88%, while there is no improvement at all for some others. Furthermore, the relaxed timing constraints may even not influence slack budget. Averagely, we can obtain about 9% extra slack budget (19.6% vs. 28.16%) with only 2% performance degradation (80.95 vs. 82.63). 
VII. Conclusions
In this work, we introduce the slack sensitive closure graph in combinatorial circuits to synchronous sequential circuits, on which a MIS-based slack budgeting is performed.
1C-1
And a simultaneously incremental retiming and slack budgeting algorithm, i.e. max-potential-slack retiming is presented, which is heuristic yet effective. Comparing with the potential slack obtained by a min-period retiming [3] , the experimental results show that our algorithm can improve the potential slack averagely by 19.6% without degrading timing performance in reasonable runtime. Meanwhile, we also perform max-potential-slack retiming with timing constraint relaxation of 2% and 5%, and the potential slack increases 19.6% and 28.16%, which gives a view of tradeoff between a fast and low-power design.Since potential slack is a good prediction for potential performance in the gate-level syntheses [19] , the retiming for power reduction maximization is our future work to practice its effectiveness.
