Clock skew scheduling is a technique that intentionally introduces skews to memory elements to improve the performance of a sequential circuit. It was shown in [21] that the full optimization potential of clock skew scheduling can be reliably implemented using a few skew domains. In this paper we present an optimal skew scheduling algorithm for sequential circuits with flip-flops. Given a finite set of prescribed skew domains, the algorithm finds a domain assignment for each flip-flop such that the clock period is minimized with possible delay padding. Experimental results validate the efficiency of our algorithm and show 17% improvement on average in clock period.
Introduction
In a sequential circuit, due to the differences of interconnect delays in the clock distribution network, clock signals do not arrive at all flip-flops at the same time. The consequent differences in clock arrival times are also known as clock skews. Since the setup and hold constraints of a sequential circuit are complicated by clock skews, an approach that has been followed by [12, 13, 25, 19, 20] is to deliberately design the clock distribution network so as to ensure zero clock skew.
Clock skew scheduling [10] , on the other hand, views clock skew as a manageable resource rather than a liability. It intentionally introduces skews to flip-flops to improve the circuit performance. The designated skews are then implemented by specific layout of the clock distribution network. However, in practice, a skew schedule with a large set of arbitrary values cannot be realized in a reliable manner. This is because the implementation of dedicated delays using additional buffers and interconnects is highly susceptible to intra-die variations of process parameters.
Instead of tuning clock skews of flip-flops, retiming [15] physically relocates flip-flops to balance the delays without changing the functionality of the circuit. It was observed in [10] that retiming and clock skew scheduling are discrete and continuous optimizations with the same effect. The equivalence between retiming and skew has been used in prior research [17, 16, 6, 22] . Although retiming is a powerful sequential optimization technique, its practical usage is limited due to the impact on the verification methodology, i.e., equivalence checking and functional simulation. Furthermore, the use of retiming for maximum performance may cause a steep increase in the number of flip-flops [9] , * This work was done at Northwestern University and supported by the NSF under CCR-0238484.
requiring a larger effort for clock distribution and resulting in higher power consumption.
Recently, multi-domain clock skew scheduling was proposed in [21] . Multiple clocking domains are routinely applied in designs to realize several clocking frequencies and also to address specific timing requirements. For example, a special clocking domain that delivers a phase-shifted clock signal to the flip-flops close to the chip inputs and outputs is regularly used to achieve timing closure for ports with extreme constraints on their arrival and required times. The motivation behind the multi-domain skew scheduling is based on the fact that large phase shifts between clocking domains can be implemented reliably by using dedicated, possibly expensive circuit components such as "structured clock buffers" [5] , adjustments to the PLL circuitry, or simply by deriving the set of phase-shifted domains from a higher frequency clock using different tapping points of a shift flip-flop. In [21] , Ravindran et al. showed that a clock skew schedule using a few domains combined with a small within-domain latency can reliably implement the full optimization potential of clock skew scheduling. They proposed an algorithm based on a branch-and-bound search to assign flip-flops to clock domains and used a modified Burns's algorithm [4] to compute the skews.
For a user-given number of domains, the algorithm in [21] computed the optimal skew for each domain. However, the user had no control on the distribution of the domains. Furthermore, the algorithm did not consider delay padding [23] , which is a technique that fixes hold violations by inserting extra delays on short paths without increasing the delay on any long path. In other words, the clock period obtained by the algorithm in [21] may be sub-optimal if delay padding is allowed, as demonstrated in [14, 8, 24] .
In this paper we formulate the clock skew scheduling problem on a user-given finite set of prescribed clock domains. For example, one can require the skews of flip-flops to be either zero or half the clock period. One can also choose clock domains based on the results of the algorithm in [21] . We propose a polynomial-time algorithm that finds an optimal domain assignment for each flip-flop such that the clock period is minimized with possible delay padding. The obtained skew schedule respects the user requirement. We then consider how to insert extra paddings such that both setup and hold constraints are satisfied under the minimal clock period. We show that the existence of such a padding solution is guaranteed, and present an approach to find a padding by network flow technique. Experimental results confirm the efficiency of our algorithm.
The rest of the paper is organized as follows. Section 2 presents a motivation example and the problem formulation. Notations and constraints are explained in Section 3. Our algorithm is elaborated in Section 4. Experimental results are presented in Section 5, followed by conclusions in Sec-tion 6.
Motivation and problem formulation
A i j [4, 7] [2,3] Figure 1 : Effect of clock skew scheduling and delay padding on circuit performance.
We use Figure 1 as an example to illustrate the effect of clock skew scheduling and delay padding on circuit performance. In this example, we have two flip-flops i and j, both triggered at the falling edge of a clock. The ellipses between the flip-flops represent the combinational blocks with their minimum and maximum delays. Suppose that the setup and hold times are all zero, and that the skew to each flip-flop can be either zero or half the period. In order for the circuit to operate under a given period T , the following conditions must be satisfied, where the first two are from the setup constraint and the other two are from the hold constraint.
Depending on the skew assignment, we have three cases. Firstly, skew(i) = skew(j), which leads to T ≥ 7. Secondly, skew(i) = T /2 and skew(j) = 0, we have 6 ≤ T ≤ 8. For skew(i) = 0 and skew(j) = T /2, the setup constraint requires T ≥ 14 while the hold constraint needs T ≤ 4. In other words, there is no such a T satisfying both constraints. However, if we insert an extra delay of 5 at point A, then the minimum and maximum delays from i to j become 7 and 8 respectively, which results in a feasible period of T = 14.
The above example reveals two things. Firstly, assigning skews to flip-flops may help to reduce the period of a circuit. On the other hand, cautions should be taken when choosing the skews since the circuit may end up having no feasible period at all. Secondly, delay padding can be used to remedy a skew assignment so that it permits feasible periods after the insertion of extra delays. In some cases, delay padding is required to reach a smaller period. For the above example, if the minimum delay from j to i is not 4 but 2, then a delay of 1 needs to be inserted on the minimum delay path to obtain the optimal period 6. This motivates us to solve a problem formulated as follows.
Problem 1 (Optimal Skew Scheduling)
Given a sequential circuit and a finite set of prescribed skew domains, find a domain assignment for each flip-flop such that the circuit satisfies both setup and hold constraints with possible delay padding under the minimal clock period.
For simplicity, we assume that flip-flops are triggered at the same clock edge of a single phase clock. However, the proposed approach can be extended to multiple clock phases.
Notations and constraints
Suppose that we are given N skew domains. Without loss of generality, we assume that the skew values of the N domains are s0T, s1T, ..., sN−1T with respect to the period T , where s0, s1, ..., sN−1 are constants between 0 and 1 in the increasing order, i.e., 0 = s0 < s1 < ... < sN−1 < 1. In particular, the domains are
A directed graph G = (V, E) is used to represent a sequential circuit, where V is the set of gates and flip-flops, and E is the set of interconnects. Each gate v ∈ V has a maximum delay D(v) and a minimum delay d(v). Each interconnect e ∈ E has a delay w(e). Delay padding increases w(e) by inserting extra delays on e. For any combinational path p = u ; v, we use D(p) to represent the maximum delay along p without padding, which is the sum of the constituent interconnect delays and maximum gate delays, except for D(u). The minimum delay along p without padding is denoted by d(p). With extra paddings on p, the maximum and minimum delays become ∆(p) and δ(p) respectively. Note that
We also construct a timing graph Gt = (Vt, Et) of G as follows. Let Vt ⊂ V be the set of flip-flops in the circuit. An edge (i, j) is introduced in Et if there is a combinational path p ∈ G from flip-flop i to flip-flop j. We define
In other words, D(i, j) and d(i, j) are the maximum and minimum combinational delays from i to j without padding, respectively. They become ∆(i, j) and δ(i, j) with padding. We say that p is a long path from i to
and H(i) to denote the setup and hold times at flip-flop i respectively. A label l : Vt → {0, ..., N − 1} is introduced to represent the index of the domain that a flip-flop is assigned to. Using these notations, the setup and hold constraints under a given period T can be formulated as follows.
A partial order (≤) can be defined between two labels l and l as follows.
We say that T is a feasible period if and only if we can find an l and a delay padding such that (1)- (3) are satisfied under T . The optimal skew scheduling problem asks for the minimal feasible T , with possible padding insertion.
Algorithm
In order to find the minimum feasible T , we first compute a lower bound T lb for it in Section 4.1. Then we observe that for any l satisfying (1), since sn < 1, ∀n ∈ [0, N −1], we have 1 + s l(j) − s l(i) > 0, ∀(i, j) ∈ Et. Thus, the setup constraint (2) can be written as T ≥ ∆(i, j)+X(j) / 1+s l(j) − s l(i) . Together with ∆(i, j) ≥ D(i, j), it characterizes a minimal period TS under setup constraint only. We propose an algorithm in Section 4.2 to compute T * = max(T lb , TS) and the corresponding l * . We then show in Section 4.3 that there always exists a padding solution such that both setup and hold constraints are satisfied under T * and l * . Therefore, T * and l * are the solution to the optimal skew scheduling problem.
Lower bound for feasible period
Consider any combinational path p from i ∈ Vt to j ∈ Vt. Since ∆(i, j) ≥ ∆(p) and δ(i, j) ≤ δ(p), from the definition of ∆(i, j) and δ(i, j), (2)-(3) imply that
Subtracting the second one from the first yields
we have a lower bound for T in the next lemma.
Lemma 1 A feasible clock period T must satisfy
To compute T lb , let θ(v) be the difference between the maximum and the minimum delays at gate v, defined as
For flip-flop j, we define
Let Θ(v) be the length of the longest combinational path terminating at v with respect to θ, i.e.,
Then, finding T lb is equivalent to computing the maximum Θ(j), ∀j ∈ Vt, which can be done by longest path computation in O(|E| + |V | log |V |) time [7] .
Minimum period under setup constraint
We use TS to denote the minimal period under which the setup constraint is satisfied without padding, i.e.,
Let T * = max(T lb , TS). We use l * to denote a domain assignment satisfying (1) and (6) under T * . For example, the corresponding domain assignment under TS is such an l * .
To find an l * , we start with l(i) = 0, ∀i ∈ V , and ob-
, and thus the current l is an l * . Otherwise, let (x, y) ∈ Et be the edge
On the other hand, T * and l * should satisfy the setup constraint on (x, y), i.e., 1 + s l * (y) − s l * (x) T * ≥ D(x, y) + X(y). As a result, we have
We can move l closer to l * by increasing l(y). The amount of increase should only be 1 since we do not want to overadjust l. This process is iterated until the optimality of T is certified. We present the pseudocode in Figure 2 .
Extract (x, y) from Et with max The next lemma states an invariant that is kept throughout "MinPeriod".
Lemma 2 l ≤ l * is kept during the execution of "MinPeriod" in Figure 2 before we reach an l * .
Proof: First of all, l ≤ l * before we enter the while loop since we initialize l(i) = 0, ∀i ∈ Vt. What remains is to show that l ≤ l * is preserved after l(y) is increased by 1 for some y ∈ Vt until T * is reached. Assume that T > T * , thus (7) is true. Since l(x) ≤ l * (x) and l(y) ≤ l * (y) due to l ≤ l * , we have l(y) < l * (y), otherwise l(y) = l * (y), which leads to l(x) > l * (x), which is a contradiction. Therefore, l ≤ l * is kept after the increase of l(y). The lemma is true.
The correctness and complexity of "MinPeriod" is given in the following theorem.
Theorem 1
The procedure "MinPeriod" in Figure 2 will terminate in O(N |Vt|Bt log |Et|) time, where Bt is the maximum incoming and outgoing degrees of the vertices in Vt. Upon termination, it gives T * = max(T lb , TS), and an l * satisfying (1) and (6) under T * .
Proof:
The outer while loop cannot be executed more than (N − 1)|Vt| times since each traversal results in an increase in l(y) for some y ∈ Vt. The complexity of extracting the edge (x, y) in Et with the maximum
is O(log |Et|) if we choose Fibonacci heap [7] . After l(y) is increased by 1, we need to adjust the values of
for all incoming edges (i, y) ∈ Et to y, and the values of
for all outgoing edges (y, j) ∈ Et from y, which takes O(Bt log |Et|) time. Therefore, the overall complexity is O (N |Vt|Bt log |Et|) .
When it terminates, we have either T * = T lb or l(y) = N > l * (y) for some y ∈ Vt. In the first case, we have TS ≤ T lb , thus T * = max(T lb , TS) is true. By Lemma 2, the second case implies that we have already reached an l * and went beyond it. Therefore, the obtained T * is TS. Since T * > T lb , T * = max(T lb , TS) is also true. In both cases, the obtained l * satisfies (1) and (6) under T * . Since padding increases D(i, j), we have ∆(i, j) ≥ D(i, j), ∀(i, j) ∈ Et. Therefore, T * is a lower bound for any feasible period with padding.
Padding for hold constraint
Given T * and l * from "MinPeriod", we will check the hold constraint (3) without padding. If we have a hold violation at some j ∈ Vt, it means that there is a short path p from i ∈ Vt to j such that s l
Intuitively, if p has an interconnect that does not lie on

6A-1
any long path, then we can insert extra delays on it to fix the hold violation. The following lemma [23] provides the condition under which the existence of such an interconnect is guaranteed.
Lemma 3 Let p be a short path to j ∈ Vt, where a hold violation occurs under T * and l * . There exists an edge e on p such that e does not lie on any long path if
The next result is a corollary of the above lemma.
Corollary 3.1 There exists a delay padding solution under T * and l * satisfying both setup and hold constraints.
Proof: Since T * ≥ T lb , the definition of T lb in Lemma 1 implies that D(p) − d(p) ≤ T * − X(j) − H(j) for all path p to j, ∀j ∈ Vt. Consequently, if a hold violation occurs at j under T * and l * , Lemma 3 ensures that we can successively identify interconnects for padding insertion without affecting any long path until the hold violation is fixed.
Based on Corollary 3.1 and the fact that T * is a lower bound for any feasible period with padding, we know that T * and l * are the solution to the optimal skew scheduling problem.
To find a padding solution, we can treat flip-flop outputs as primary inputs (PIs) and flip-flop inputs as primary outputs (POs), and find a padding for each individual combinational component. To ease the presentation, we will focus on padding for a combinational component Gc = (Vc, Ec) ⊆ G under T * and l * . For each gate v ∈ Gc, we use A(v) and a(v) to denote the latest and the earliest arrival times at the output of v, which are the longest and the shortest combinational delays from PIs to v, respectively. Let p(u, v) be the padding on (u, v) ∈ Gc. The problem (denoted as MP ) of finding a minimum delay padding under T * and l * can be formulated as follows [23] .
MP :
Minimize
The conditions (8)- (9) characterize the earliest arrival times at gate outputs. (10) is the hold constraint at POs. The conditions (11)- (12) characterize the latest arrival times. (13) is the setup constraint. Inequality (14) enforces nonnegative padding.
We say that p is a feasible padding if and only if there exist a and A such that (8)- (14) are satisfied under T * and l * . The feasible region of MP contains all the feasible paddings. Since both the objective and the constraints are linear, MP can be solved by any linear programming solver.
We next describe an approach to find a "reasonably good" padding using network flow technique, which is more efficient than linear programming.
First of all, we observe that subtracting (9) from (12) yields
. Given a feasible padding, we can insert extra delays on each edge such that (12) becomes an equality. When (12) is an equality, we have
As a result, there exists a combinational path p to v such that
Since Θ(v) can be obtained by longest path computation with respect to θ, we can replace A(v) by Θ(v) + a(v) in the conditions (11)- (13) to simplify the problem.
Lemma 4
The minimum padding problem MP with (12) being an equality is equivalent to the following problem (EM P ):
Proof: Since both MP and EM P have the same objective, what remains is to show that they have the same feasible region when (12) is an equality. Suppose that p is a feasible padding to EM P . Since Θ(v) ≥ Θ(u) + θ(v) by (5), we know that (15) implies (9) . For i ∈ P I, since Θ(i) = 0 and (8), we have (15) is an equality form of (12) . (16) is the same as (13) since A(j) = Θ(j) + a(j). Therefore, p is also a feasible padding to MP when (12) is an equality. Similarly, if p is feasible to MP with (12) being an equality, p is also feasible to EM P , which concludes our proof.
Note that EM P is the dual of a min-cost flow problem [3] . Letp be an optimal solution to EM P . Sincep satisfies the setup constraint, any p with p(u, v) ≤p(u, v), ∀(u, v) ∈ Gc, should also satisfy the setup constraint. Therefore, we usep as an upper bound for p and solve the minimum padding problem with the hold constraint only. This is formulated as the following problem (BM P ).
BM P :
Note that BM P is the dual of a convex-cost flow problem [2] . Both EM P and BM P can be solved in polynomial time bounded by O |Vc||Ec| log(|Vc| 2 /|Ec|) log(|Vc|T * ) . The next theorem provides the condition under which an optimal solution to BM P is also an optimal solution to MP .
Theorem 2
If MP has an optimal solution p * such that p * (u, v) ≤p(u, v) for all (u, v) ∈ Gc, then an optimal solution to BM P is one such p * .
Proof:
Since p * ≤p, p * is feasible to BM P . Given that the feasible region of MP contains the feasible region of 6A-1 BM P , p * is an optimal solution to BM P . Therefore, any optimal solution to BM P has the same amount of padding as p * , and hence is an optimal solution to MP . When the condition in Theorem 2 does not hold, solving BM P only gives a feasible padding. However, our experiments show that the feasible padding we obtained is close to the minimum padding.
Our algorithm for optimal skew scheduling is presented in Figure 3 . It first applies "MinPeriod" to compute T * and l * . Then, it finds a padding solution for each combinational component under T * and l * . The overall complexity is O N |Vt|Bt log |Et| + |V ||E| log(|V | 2 /|E|) log(|V |T * ) by Theorem 1 and the complexity for solving EM P and BM P .
Input : A circuit G = (V, E) and N skew domains. Output: Optimal period T * under domain assignment l * and delay padding p.
Construct timing graph
Find padding p by solving BM P for Gc; Return T * , l * and p; Figure 3 : Pseudocode of optimal skew scheduling algorithm.
Experimental results
We implemented the algorithm in a PC with a 2.4 GHz Xeon CPU, 512 KB 2nd level cache memory and 1GB RAM. We used the linear cost-scaling algorithm by Goldberg [11] to solve EM P . We also adapted it to convex cost case to solve BM P . The detailed procedure of adaption is shown in [2] . Our test files were generated from the ISCAS-89 benchmark suite using ASTRA [22] . Each gate was assigned a maximum delay equal to the number of fanouts or an upper bound 100, whichever is smaller. The minimum delay was equal to the maximum delay. To ease the presentation, interconnect delays were set to zero. Note that we did not ignore interconnect delays. They were handled uniformly in our algorithms. The circuits used are summarized in Table 1. Without loss of generality, we used four evenly distributed skew domains, i.e., sn = nT /4, 0 ≤ n ≤ 3. Setup and hold times of each flip-flop were set to 2. Thus, T lb = 4. The results are reported in Table 2 . Column "|sn|", 0 ≤ n ≤ 3, lists the number of flip-flops that are assigned to domain sn in the obtained optimal skew schedule. The sum of the setup time and the maximum combinational delay of the original circuit is listed in Column "T ub ". It is an upper bound for T * . The computed minimal period is listed in column "T * ". The running time of "MinPeriod" for finding T * is reported in column "t(sec)" in seconds. The improvement ratio (T ub −T * )/T ub for each circuit is listed in column "impr%". We obtain the arithmetic (geometric) mean of all the ratios in row "arith" ("geo"). Once T * and l * are obtained, we solve EM P and BM P to get a padding solution under T * and l * . We compare the solution with the minimum padding computed by MOSEK solver [1] . The amount of padding and running time are listed in column "padding" and "time(sec)" respectively.
We can see from Table 2 that, except for "s5378", all circuits have smaller periods after skew scheduling with possible padding. The improvement could be significant, e.g., 42.9% in "s15850". The average improvement is 16.9%. In addition, "MinPeriod" is efficient. It takes only 0.59 second for the largest circuit "s38584". Although the padding solution to BM P is about 1.5X the minimum padding, the actual area overhead will be reasonably small since the delay padding is amortized over the whole circuit. For running time comparison, solving BM P by network flow technique is much more efficient than solving MP by MOSEK. The average speed-up is more than one order of magnitude.
To see how skew scheduling helps to improve a retiming solution, we used the algorithm in [18] to compute a minimum period retiming under the setup and hold constraints, and then applied our algorithm on the retimed circuit. The results are reported in Table 3 . We use "s838 " to denote the optimal retiming of "s838", and so on for other optimal retimings. Column " [18] " lists the minimal period computed by the retiming algorithm in [18] , where we use "NO" to indicate that there is no feasible retiming. For circuits without feasible retiming, we obtained their min-period retimings under setup constraint only. We highlight the cases where the periods are further improved by our skew scheduling algorithm. Comparing Table 3 with Table 2 , we observe the following results.
Firstly, half of the circuits have their minimal periods further reduced after skew scheduling with possible padding insertion. The improvement is 7.3% on average and up to 27.8%. This is significant considering that our algorithm is applied after a minimum period retiming. Two circuits "s13207.1" and "s38584" do not even have a feasible retiming due to the discrete nature of retiming. It happens when there exist reconvergent paths where the retiming requirements from different paths contradict each other. However, by skew scheduling and delay padding, we are able to find the minimal periods for them.
Secondly, it appears that the optimal skew schedule for the retimed circuit uses less number of domains. The number of flip-flops that are assigned to domains other than s0 is also reduced. In other words, applying skew scheduling after retiming improves the implementability of the obtained optimal skew schedule.
Thirdly, in all test cases, except for "s838", applying skew scheduling on the retimed circuit results in less amount of delay padding than the original circuit. In addition, the solution to BM P is closer to (about 1.1X) the minimum padding of the retimed circuit.
Conclusion
We present a polynomial time algorithm that finds an optimal skew schedule over a finite set of prescribed skew domains such that the period is minimized with possible delay padding. We show that the existence of a padding solution under the minimal period is guaranteed and propose an approach to find a padding solution by network flow technique. Experimental results validate the efficiency of our algorithm. 
6A-1
