Abstract-Multiple Supply Voltage (MSV) assignment has emerged as an appealing technique in low power IC design, due to its flexibility in balancing power and performance. However, clock skew scheduling, which has great impact on criticality of combinational paths in sequential circuit, has not been explored in the merit of MSV assignment. In this paper, we propose a discrete voltage assignment algorithm for sequential circuit under clock scheduling. The sequential MSV assignment problem is first formulated as a convex cost dual network flow problem, which can be optimally solved in polynomial time assuming delay of each gate can be chosen in continuous domain. Then a mincut-based heuristic is designed to convert the unfeasible continuous solution into feasible discrete solution while largely preserving the global optimality. Besides, we revisit the hardness of the general discrete voltage assignment problem and point out some misunderstandings on the approximability of this problem in previous related work. Benchmark test for our algorithm shows 9.2% reduction in power consumption on average, in compared with combinational MSV assignment. Referring to the continuous solution obtained from network flow as the lower bound, the gap between our solution and the lower bound is only 1.77%.
I. INTRODUCTION
Power consumption has become one of the major concerns in modern IC design, especially for portable devices. To control the tradeoff between power and performance, low power design techniques such as Multiple Supply Voltage (MSV) [1] and Multiple Threshold Voltage (MTV) [2] assignment have drawn much attention of the EDA community. In MSV design, high supply voltage is assigned to timing critical cells to guarantee performance, while low supply voltage is assigned to other cells to save power consumption. In MTV design, low threshold voltage is assigned to timing critical cells in a similar way. Since MSV assignment and MTV assignment problem share very similar structure, an algorithm that applies to one can usually apply to the other as well. Therefore, both MSV and MTV assignment can be classified as the general voltage assignment problem, which is to minimize total power consumption by assigning different voltages to the circuit elements without violating the given timing constraint. This problem has become intensively studied in the literature [3] , [4] , [5] , [6] , [7] , [2] . In [3] , the authors proposed a heuristic to minimize power consumption and to simplify the power network resource under timing constraints simultaneously. In [4] , the authors tackled the voltage assignment problem by formulating it into an Integer Linear Programming (ILP) with inequalities representing timing requirements. Dynamic programming is used in [5] to solve this problem. However, all of these approaches suffer from either non-optimality or long run-time.
In [6] , Ma et al. formulated the MSV assignment problem as a convex cost integer dual network flow problem [8] after relaxing the discreteness requirement of supply voltages. The convex cost dual problem can be solved by a cost-scaling algorithm, which is computationally much more efficient than solving the general ILP problem. Straightforward rounding is then used to obtain the discrete solution, which has no optimality guarantee even though the continuous solution is optimal. In order to resolve the unfeasibility of the continuous solution, a branch and bound based algorithm is propose in [7] , which uses the method in [6] as tight bound estimation. While the proposed method is able to obtain optimal discrete solution, the computational complexity of branch and bound is not guaranteed. Recently, Feng et al. [2] claimed an -approximation to threshold voltage assignment problem under timing constraints. However, we find out that previous work [9] , [10] has ruled out the possibility of existence of any -approximation to the discrete voltage assignment problem unless P = NP . After our careful examination, we found out that there are some flaws in their approach. We will discuss these issues in Section V.
Despite of good efforts from aforementioned work, the voltage assignment is still restricted in combinational circuit. It is well known that the criticality of the combinational paths can be magnificently changed by sequential adjustments such as retiming and clock skew scheduling. Both retiming and clock skew scheduling have been utilized in various optimization problems such as gate sizing [11] , threshold voltage assignment [12] and slack budgeting [13] . These works have motivated us to incorporate appropriate sequential optimization techniques when handling the voltage assignment problem.
In this paper, we propose an algorithm for the discrete voltage assignment problem in sequential circuit under clock skew scheduling. The main contributions of this paper include:
• The discrete voltage assignment problem under clock skew scheduling is properly formulated for the first time as a convex cost integer dual network flow problem which has optimal continuous solution; • A novel mincut-based heuristic is devised to obtain a feasible solution from the optimal unfeasible solution. Its effectiveness and efficiency is justified by theoretical analysis as well as verified by experimental results; • Serious mistakes are identified in an -approximation approach to the MTV assignment problem [2] and the impossibility of existence of any -approximation to this problem is pointed out.
• Our approach is highly practical. According to experimental results on all benchmarks of ISCAS89, our method can achieve 9.2% power saving improvement in comparison with combinational MSV assignment [6] . It also turns out that our solution is nearly optimal. Referring to the optimal unfeasible solution obtained from network flow problem as the theoretical lower bound of optimal feasible solution, the gap between our final solution and the lower bound is only 1.77%. The rest of this paper is organized as follows. The definition and formulation of sequential voltage assignment problem are given in Section II. The hardness and approximability of the general voltage assignment problem is discussed in Section III. Section IV describes the reduction to convex cost flow problem and the corresponding cost-scaling algorithm. The mincut-based heuristic is described in Section V, and the flaws of [2] are also discussed in this section. Section VI presents the experimental results. Finally, Section VII concludes the paper.
II. PROBLEM FORMULATION
The input to voltage assignment problem includes a sequential circuit consisting of a set of properly connected combinational gates and flip-flops (FF). Each gate has several supply/threshold voltage candidates with fixed (delay, power) pair, which can be computed offline. All (delay, power) pairs of a gate compose a delay-power curve (DP curve), which is typically a convex function representing the relation of delay-power tradeoff. The objective of voltage assignment problem is to minimize the total power consumption of 978-1-4244-7516-2/11/$26.00 ©2011 IEEE 6B-3 the circuit while subjecting to various timing constraints, which is explained as follows.
Given a sequential circuit, a directed graph G = (V, E) can be constructed as follows. Node set V consists of two subsets V i, Vo and two dummy nodes P I, P O. P I and P O are connected to all primary inputs and outputs respectively. Each gate or FF is represented by an input vertex v i ∈ Vi and an output vertex vo ∈ Vo, and the input vertex is connected to the output vertex by a directed edge. Edge set E is composed of three subsets E gate, Enet and EFF . Egate and EFF is the set of all edges between the input vertex and output vertex of a gate and FF, respectively. A directed edge (i, j) ∈ E net denotes an interconnect between the gates and FFs corresponding to vertex i and j. An example of G is as shown in Figure 1 . For each (i, j) ∈ E, we use d ij to denote gate delay, net delay or FF delay on edge (i, j). Moreover, we use P ij to represent the function that maps delay to power on (i, j).
where kij is the number of possible voltage options of each gate, and P ij = 0 for (i, j) ∈ Enet ∪ EFF .
On combinational paths, the arrival time ti of a node i satisfies t i + dij tj, ∀(i, j) ∈ Enet ∪ Egate. In addition, for (i, j) ∈ EFF , the clock skew on each FF in sequential circuit needs to be bounded in [0, s max], where smax is the maximum acceptable clock skew. The influence of clock skew can be represented by the following inequalities:
where sij stands for clock skew of the FF embedded in (i, j) ∈ E F F and T is the clock period. In order to facilitate the problem formulation, we transform the above inequalities into its equivalent form to eliminate variable s ij :
After the transformation, the timing constraint on each edge (i, j) ∈ E can be uniformly stated as t i tj − dij, where
Based on deduction, the Discrete Voltage Assignment Problem under Clock Skew Scheduling problem can be formulated into the following mathematical program:
In Problem 1, the first two inequities assert the boundary constraints of the whole circuit. The third one is the general arrival time constraint on each timing edge. The fourth and fifth inequities are the clock skew constraints for flip-flops. Finally, the last constraint states the discreteness of the valid voltage candidate.
III. HARDNESS OF DISCRETE VOLTAGE ASSIGNMENT Discrete voltage assignment is well-known as a NP-hard problem. Feng et al. have studied the -approximation to the MTV assignment problem in [2] . Unfortunately, their solution turns out to be flawed after our careful analysis. In fact, some early research in relative areas has revealed impossibility of any -approximation to this problem unless P = NP . The discrete voltage assignment is actually a specific application of the general Discrete Time-Cost Tradeoff (DTCT) problem, which has been widely studied in the area of operation research and management science. De et al. has proved in [10] that the DTCT problem is strongly NP hard and can not be solved in pseudo-polynomial time unless P = NP . [9] further show that there does not exist a polynomial (1 + α, 5 4 − β)-approximation of DTCT with some α, β> 0 unless P = NP . So far the best approximation algorithm [14] has been found to the DTCT problem can only guarantee a performance of O(log l), where l is the ratio of the maximum duration ("delay") of any activity ("gate") to the minimum nonzero duration of any activity. No constant-approximation to the DTCT problem has been found except for some special cases.
Next, we would like to point out the misunderstandings in [2] . In the first step, the authors transform the original problem into a convex min-max resource-sharing LP, which is Formula (7) in [2] :
In Problem 2, x k ij indicates the voltage choice of gate (i, j). When the kth voltage is assigned, x k ij equals to 1. Otherwise, it is set to 0. d k ij , p k ij are the gate delay and power consumption corresponding to the kth voltage. P is the estimated value of the optimal total power consumption. Notice that the introduction of x, P and λ does not change the nature of this problem. At a high level, their algorithm strives to get an -approximation of P by iteratively solving theapproximation to Problem 2 [15] . The key point of their argument is that the -approximation to Problem 2 is also the -approximation to binary integer programming version of the same problem with all x k ij constrained to binary values. We will demonstrate that this claim is incorrect by a simple example. Considering λ
it is easy to find λ * = 1/2, at this time x1 = x2 = 1/2. If we constrain x1 and x2 to be binary values, the optimal solution will be either x1 or x2 to be 1. Obviously, no -approximation to the continuous problem will also be -approximate to the discrete problem. Another flaw in [2] is that the authors tackle all ti as constants when solving Problem 2 as a LP about x. They claim that it is easy to know t i once x is determined. This is an inversion of consequence because fixing t i at any constants is equal to constraining the slack and the range of x k ij . In this paper, we do not seek for approximation to the discrete voltage assignment problem. Instead, we will first solve the optimal solution of its continuous form and use it as a good starting point for finding the feasible solution. Based on the continuous solution, we further propose an efficient heuristic to obtain the discrete solution while striving for maintaining the global optimality. The two stages of the proposed approach will be described in the following sections. (1) is a Convex Cost Dual Network Flow problem [8] , which can be optimally solved in polynomial time. In this section, we describe the details of the continuous algorithm based on convex cost flow.
IV. CONTINUOUS RELAXATION With the last constraint relaxed to the continuous domain
First, the neighboring points in DP curve are connected with line segments, resulting in a linear-piecewise convex function of power versus delay. Next, in order to eliminate the lower and upper bound of the delay, we generate a new power-delay function as follows.
. . .
where ij . M is a sufficiently large number which can be viewed as a penalty factor for violating the bound constraints.
Similarly, in order to eliminate the bound constraints on the arrival time t i, we define Bi(ti) as follows and put it into the objective function.
where li and ui are the lower and upper bounds of ti. Without the bound constraints and discretization constraints Problem (1) can be transformed into the following form:
It is well know that the dual of Problem 3 [8] is a convex cost problem.
In Problem 4, we add an extra ground node 0 and its connections to nodes in V to facilitate the problem. For each node i ∈ V with a lower bound on arrival time t i (P I and outputs of all flip-flops), an edge (0, i) is introduced. Moreover, for each node i ∈ V with a upper bound on arrival time t i (P O and inputs of all flip-flops), an edge (i, 0) is introduced. We denote the set of newly added edges by E 0 and the new graph by
The cost function of each type of edges can be derived from the power functions and arrival time constraints using Lagrangian relaxation [8] , [6] 
For each (i, j) ∈ Enet:
For each (i, j) ∈ EFF :
And finally for each (i, j) ∈ E0:
The convex-cost scaling algorithm in [8] can be applied to obtain the optimal solution to Problem 4. The general flow of the algorithm is as shown in Algorithm 1. for For each admissible edge (i, j) ∈ E(x) do Examine the node for excess 8: while there is a node i with excess flow do 9: if there is an admissible edge (i, j) then − for every (i, j) ∈ G * r (x). Initially, is large enough to make any feasible flow -optimal. Then the algorithm gradually transforms the -optimal solution into /2-optimal solution by pushing as much as possible flow through admissible edges, which satisfy the condition − c π ij < 0. The algorithm terminates when < 1/ |V |. Though the convex-cost scaling algorithm is based on integer costs, the optimal flow x * and node potential π * returned by Algorithm 1 may be non-integer. But since all edge costs are integers, there must exist some integral optimal node potential π. Such π can be obtained by computing the shortest distance sp(i) from node 0 to every other node i ∈ V * in the residual graph G * r (x * ) and set π(i) = −sp(i). The optimal solution to Problem 3 can be then obtained by assigning ij }, this solution may be unfeasible and therefore can not be directly implemented in the circuits. The continuous solution requires further discretization processing, which will be discussed in the next section.
V. DISCRETE ALGORITHMS Most of the previous LP-based or network flow-based approaches to the discrete voltage assignment problem [8] , [16] , [17] simply floor the optimal continuous delay d ij to largest possible delay option d
to obtain a feasible solution. However, it is not guaranteed that the optimal continuous solutions will automatically fall on the expected discrete levels. In some cases, naive rounding may produce extra slacks that can be reused to further reduce power. In this section, we will present a mincut-based heuristic which can fully utilize the extra slacks while maintaining low power consumption.
Assume that the optimal continuous solution is already available, Problem 1 can be further reduced by removing flip-flop edges from the graph. The new voltage assignment problem can be stated as follows.
The new graph G = (V, E ) is slightly different from the original graph G. All edges (i, j) ∈ E F F are canceled and for each FF two new edges (P I, j) and (i, P O) are introduced with d(P I, j) = s * ij
ij is the optimal clock skew of flip-flop corresponding to edge (i, j). Additionally, we introduce one edge from P O to P I with d(P O, P I) = −T to eliminate the arrival time constraints on single nodes. Let's denote the new added edges by E nF F so that E = Enet ∪ Egate ∪ EnF F . Notice the new graph G is a DAG without (P O, P I), canceling cycles in the graph will allow us to better explore the special structure of this problem.
Let us first look back at the relaxed version of Problem 5 from a different perspective. According to [18] , the Karush-Kuhn-Tucker (KKT) optimality condition for relaxed Problem 5 is:
The above KKT condition actually describes a network flow satisfying certain properties. Initially, if we set all x ij = 0, dij = d k ij ij , tPI = 0 and all other ti as maximum delay from P I to i, the only violated constraint is timing constraint (6a) on edge (P O, P I). Since the KKT condition is close to be satisfied in the beginning, it is natural to develop a way of decreasing delay while maintaining flow constraints. Constraints (6b) (6c) (6d) can be easily maintained by setting upper/lower capacity of each edge (i, j) as −P − ij (dij)/−P + ij (dij) and pushing flow along cycles from P I to P O and back to P I. The dual constraint (6e) specifies that if there is positive flow on an edge (i, j) then equation t j = ti + dij must hold, which means (i, j) must be on the longest path from P I to P O. In the following text we will refer the maximum delay from P I to P O as M axDelay(P I, P O) and the set of all edges in the longest paths as critical network. In order to satisfy constraint (6e), we have to push flows in the critical network. Once flow xij reaches its upper/lower bound −P − ij (dij)/−P + ij (dij), we will be able to adjust dij without violating constraint (6b). However, the edges with positive flow may not be in the critical network after changing. In order to maintain dual constraint (6e), we expect the critical network to satisfy an important property called Monotonic Grow. Monotonic Grow ensures that once an edge is in the critical network, it will always be in the critical network. The way we use here is to change delay of the edges that on the mincut of the residual critical network. All details are as shown in Algorithm 2.
Identify critical network G Solve maximum flow in the residual graph of G 0 using s as source and t as sink 6: Select a mincut M and maximum possible δd 7: for each (i, j) ∈ M do 8: if (i, j) is a forward edge then 
16: end while
The basic idea behind our mincut-based continuous algorithm is similar as [19] , but [19] only explained the concept of mincut on a price-oriented perspective and didn't further explore the deeper relation between the flow and the delay. It is also worth point out that computationally Algorithm 2 may be not as efficient as Algorithm 1 for the power/delay tradeoff problem with a specific T . However, Algorithm 2 provides us with a theoretical foundation to develop an advanced discrete algorithm. After several modifications, we immediately get a mincut-based heuristic as shown in Algorithm 3.
There are four main changes in our mincut discrete heuristics:
1) d ij can only switch from one discrete level to another rather than changing continuously. If an edge (i, j) is on the mincut and its current delay is d q ij ij , then we will directly reduce the delay to the next smaller discrete level d
2) dij will be initialized as the smallest delay option d m ij ij that is not less than optimal continuous delay, instead of the largest delay option d k ij ij . Such initialization will not only be able to take full advantage of the global optimality of the optimal continuous solution, but also significantly shorten the run-time of the algorithm.
6B-3
Algorithm 3 Mincut-based Discrete Heuristics
Identify approximate critical network G 0
5:
Solve maximum flow in the residual graph of G 0 using s as source and t as sink 6: Select a mincut M 7:
3) The backward edges on the mincut will not be taken into account. Based on our observation and experiments, we find backward mincut edges rarely appear in the circuit network in our problem. Actually whether omitting the backward edges have very little impact on our results. 4) The range of critical network will be expanded. In our experiments, the approximate critical network is defined as the set of all paths with delay larger than 95% of the maximum delay. The runtime of Algorithm 3 is dominated by the time of solving maximum flow. The upper bound of the runtime is O(knM ), where k, n, M is the number of voltage options, the number of gates and the average time of solving maximum flow, respectively. In each while loop, at least one gate will reduce delay. Since the delay of each gate can be reduced for at most k times, there are at most kn while loops. However, since we start from the rounding solution and the mincut usually contains more than one gate, the number of while loops is much less than the order of kn in practice. Experimental results have shown that our discrete heuristic almost run in the same time order with the convex-cost scaling algorithm in the first stage.
VI. EXPERIMENTAL RESULTS Our Voltage Assignment Clock Skew Scheduling algorithm SeqVA is implemented in C++. All of the experiments are performed on a Linux workstation with 3.0GHz CPU and 2.0GB memory. The proposed algorithm is tested on ISCAS89 benchmarks. The delaypower curve of each cell is simulated by HSPICE U-2003.03-SP1 with supply voltage set as 0.8V, 1.0V, 1.2V and 1.4V respectively. After HSPICE simulation, we find that the assumption of convexity of delay-power curve is correct for all of the standard cells. An example of the delay-power curve of standard cell buf 1x is shown in Figure 2 . Previous works on low power voltage assignment all focus on combinational circuit. To examine the performance of our algorithm, the combinational circuit voltage assignment algorithm in [6] is implemented for comparison, denoted as CombVA. The results are listed in Table I . Some statistics of each test case are listed including the number of gates and flip-flops, and the minimum possible period when all voltages are set to the largest level 1.4V. In our experiment, clock period T is set to 1.1 times of the minimum period and maximum acceptable clock skew s max is set to T . Total power consumption and running time of both CombVA and SeqVA are listed in columns titled "Power" and "CPUTime". The power consumption for the continuous solution is also presented as a reference of the lower bound of all possible optimal results. Column "Gap" shows the difference between our result and lower bound.
As Table I shows, SeqVA can achieve 9.2% additional power reduction compared to CombVA. Besides that, since redundant arrival time constrains are eliminated on most of the combinational gates, the complexity of the graph construction is reduced dramatically, leading to less running time in solving continuous solution and less total running time. It turns out our solution is almost optimal, the achieved power consumption is only 1.77% larger than the lower bound on average.
It is mentioned in Section II that VACSS problem is equivalent to combinational voltage assignment problem when s max is set to 0. In Figure 3 we show how the power ratio between SeqVA and CombVA changes as s max changes from 0 to T . As the maximum acceptable clock skew increases, more power can be saved through clock skew scheduling. Similar properties can be observed on all test cases. We also observe that the impact of clock skew scheduling is more significant under tight timing constraints. As shown in Figure. 4, the power consumption of SeqVA is very close to the lower limit even when the timing constraints are very tight. If SeqVA sets T to 1.0T min and 1.1Tmin, then CombVA needs to set T to 1.4Tmin and 1.7T min accordingly to achieve similar power consumption. 
VII. CONCLUSIONS
In this paper, we proposed a low power discrete voltage assignment algorithm under clock skew scheduling. The problem is formulated in a continuous manner as a convex-cost network flow and optimally solved. Unlike previous work where the optimal continuous delay is directly floored to the nearest discrete level, an elaborate mincutbased heuristic is proposed to obtain a feasible solution from the 6B-3 continuous solution while well preserving the global optimality. During the discussion, we also point out several misunderstandings in previous related works. Experiment results demonstrate the effectiveness and efficiency of our approach. Comparing to combinational supply voltage assignment [6] , our method achieves more total power saving and faster running time. The gap between our final feasible solution and the lower bound of optimal solution is only 1.77%. 
