The integration of retiming and simultaneous supply/threshold voltage scaling has a potential to enable more rigorous total power reduction. However, such integration is a highly complex task due to its enormous solution space. This paper presents the first algorithm that performs retiming and simultaneous supply/threshold voltage scaling. In our three-step approach, low power retiming is first performed to reduce the clock period while taking the FF delay/power into consideration. Next, the subsequent voltage scaling makes the best possible supply/threshold voltage assignment under the given clock period constraint set by the retiming. Finally, a post-process further refines the voltage scaling solution by exploiting the remaining timing slack in the circuit. Related experiments show that our min-FF retiming plus simultaneous Vdd/Vth scaling approach reduces the total power consumption by 34% on average compared to the existing max-FF retiming plus Vdd Scaling approach.
INTRODUCTION
Over the last decade, IC power management has moved from a third-order to a first-order concern for chip designers, especially those designing ASICs and SOCs for portable-system applications. The low power research community has been actively proposing a huge volume of solutions during the last decade. Among the * This research has been supported by the National Science Foundation under contract CNS-0411149 and MARCO/C2S2. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. most successful ones at the circuit-level are supply voltage (Vdd) scaling, threshold voltage (Vth) scaling, gate-oxide (Tox) scaling, gate-sizing, retiming, and any combination of these methods. A majority of the existing works can be categorized into (i) Vdd scaling [27, 6, 5, 7] , (ii) Vth scaling [28] , (iii) simultaneous Vdd/Vth scaling [21, 19, 8, 24] , (iv) Vth scaling and sizing [23, 13, 20, 18] , (v) simultaneous Vdd/Vth scaling and gate sizing [10, 25, 2, 1, 11] , (vi) simultaneous Vth/Tox scaling and state assignment [26] , (vii) retiming [15, 17] , and (viii) Vdd scaling and retiming [22, 3, 4 ]. In addition, various level converter design and usage are studied to support the low-Vdd to high-Vdd conversion in Vdd scaling method [14, 12] .
We present the first work that performs retiming and simultaneous supply/threshold voltage scaling for total power reduction. Retiming [16] improves not only the clock period but also the dynamic power when FFs are moved to high switching interconnects [17] . In addition, FFs can be used to enable low-to-high supply voltage transition, thereby reducing the need for separate level converters. However, the integration of retiming and voltage scaling is a complex task due to its enormous solution space. The state-ofthe-art in combining retiming and voltage scaling (Vdd-only) is by Chabini and Wolf [4] , where they proposed a two-step approach that performs retiming and Vdd scaling sequentially. 1 We improve this work in the following ways:
• The authors [4] performed max-area retiming to increase the number of gates off timing critical path, which are ideal candidates for voltage scaling. We show that this approach in fact increases the FF counts and thus the total power consumed by the FFs. Thus, we suggest min-area retiming as a better choice.
• The authors [4] formulated the supply voltage scaling using integer linear programming (ILP) approach. Our simultaneous supply and threshold voltage scaling is also formulated as ILP, but we employ various LP-relaxation techniques to reduce the overall runtime by a few orders of magnitudes.
• We show that min-area retiming, while it reduces the critical path delay as well as total power consumed by the FFs, may reduce the total slack in the circuit and thus limit the subsequent voltage scaling. However, we show that the impact of the min-area retiming on timing slack is minimal.
• Our related experiments indicate that our min-FF retiming plus simultaneous Vdd/Vth scaling reduces the total power 1 An ILP-based simultaneous retiming and supply voltage scaling has been attempted [3] , where retiming as well as Vdd scaling are formulated as a single ILP, but the runtime was prohibitive. Thus the follow-up work employed a two-step approach [4] . consumption by 34% on average compared to the exiting max-FF plus Vdd Scaling approach [4] .
We employ a three-step approach: retiming, voltage scaling, and post refinement step. Our low power retiming is first performed to reduce the clock period while taking the FF delay/power into consideration. Next, the subsequent voltage scaling makes the best possible supply/threshold voltage assignment while satisfying the timing constraints set by the prior retiming step. We formulate the voltage scaling in ILP, relax it to LP, solve the LP in an iterative fashion, and apply various heuristics to convert the continuous LP solutions to integer solutions. Finally, a post refinement step further refines the voltage scaling solution by exploiting the remaining timing slack in the circuit. Related experiments show that our LP-based method named RVS (retiming-based voltage scaling) algorithm obtains results that are very close to the original ILP formulation but at a fraction of runtime.
METHODOLOGY

Low Power Retiming
The synchronous sequential circuit is modeled with a directed graph G = (V, E, d, w), where G is the set of gates, and E is the set of directed edges connecting gates. Edge ei,j represents a connection from gate i to gate j. d(i) is the delay of gate i and w(i, j) is the number of FF on edge ei,j . Let P (i, j) denote a directed path from gate i to gate j, and w(P (i, j)) = È e∈P w(e(i, j)) denotes the total weight of the edges along P (i, j). Let d(P (i, j)) denote the total delay of the nodes along P (i, j). The original retiming paper [16] introduces the following two matrices: (i) W (u, v) denotes min{w(P (u, v))|∀u, v ∈ V }, which is the minimum weight value among all paths that connect u and v, and (ii)
which is the maximum delay value among all paths with total weight of W (u, v).
Let T be a target clock period. 2 Let r(u) represent the number of FF moved from all fan-out edges of node u to all fan-in edges of u. Retiming assigns an integer r(u) to each node u ∈ V such that the following constraints are met:
iii) the clock period after the retiming is equal to or less than T . Let F I(v) and F O(v) be the number of fan-in and fan-out of node v. Our ILP-based low power retiming is formulated as follows:
Subject to:
The objective of the mathematical formulation is to minimize the total number of FFs under the clock period constraint. This is done by minimizing the total edge weight of the graph after retiming. If (v) . Constraint (2) states that the number of FFs on each edge after retiming cannot be negative. Constraint (3) states that there exists at least one FF on any path with delay more than T .
ILP-based Voltage Scaling
The second step of our approach is to perform dual supply and threshold voltage assignment so that the total power (= dynamic plus leakage) consumed by the gates and level converters (LC) is minimized. 3 One issue with Vdd scaling is that the low-to-high Vdd conversion needs a special method to guarantee the reliable computation. There exist two ways to support this conversion [14, 12] . The first is to use a FF that can handle the conversion as well, which is named the level conversion FF (LCFF). The second is to use a separate level converter (LC), which can be inserted anywhere in the circuit to raise the low-Vdd input voltage back to the highVdd level. In this paper we use both LCFFs and LCs so that LCs are used only on zero-weight edges (= edges with no FFs).Since both LCFF and LC cause additional delay and power, voltage scaling has to be done carefully to suppress the related delay/power overhead.
Our initial formulation is integer linear programming-based since the voltage assignment variable for each node in the retimed graph takes one out of the following four possible states:
• state 1: high-Vdd plus low-Vth (maximum performance, maximum total power)
• state 2: low-Vdd plus low-Vth (medium performance, low dynamic power)
• state 3: high-Vdd plus high-Vth (medium performance, low leakage power)
• state 4: low-Vdd plus high-Vth (minimum performance, minimum total power).
In addition, the LC assignment variable for each edge either takes 0 (no LC) or 1 (with LC).
The following variables are used in our ILP-based voltage scaling formulation:
• x v,k : voltage assignment variable for node v into state k (k = 1 corresponds to high-Vdd+low-Vth, etc).
• m(e): level converter assignment on edge e, where m(e) = 1 means LC is used on e and w(e) = 0.
• z v,k : supply voltage level of v given that v is assigned to voltage state k.
• p v,k : total power consumption of v given that v is assigned to voltage state k.
• d v,k : delay of v given that v is assigned to voltage state k.
• s(v): arrival time of node v.
• p lc , d lc : total power consumption and delay of a level converter.
• T : clock period constraint.
• D: difference between high Vdd and low Vdd.
Our ILP-based dual supply/threshold voltage assignment for total power reduction under timing constraint is formulated as follows:
Timing constraints:
Level converter (LC) constraints:
Integer constraints:
The objective of ILP is to minimize the total power consumption on all gates and level converters used. Constraint (5) and (10) state that each gate can be assigned to only one voltage state. Constraint (6) guarantee that the arrival time of each node combined with its delay is always less than the target clock period. Constraint (7) states that the arrival time of node v has to be greater than the summation of the arrival time of node u, the delay of node u, and the delay of level converter inserted on eu,v. Constraint (9) states that if a low Vdd gate u drives a high Vdd gate v, a level converter is inserted onto eu,v.
Linear Programming Relaxation
Our related experiment shown in Section 3 indicates that the computational effort to solve the ILP-based voltage scaling quickly becomes prohibitive as the size of the circuit increases. In this section, we propose a method to relax the ILP formulation into LP to overcome this limitation. We first solve the LP-relaxed version of the original ILP problem, which requires a few orders of magnitudes smaller runtime. Next, we convert the non-integral LP solution into integral ILP solution while satisfying the level conversion and clock period constraint. The objective of our LP remains the same: minimization of total power consumed by the gates and level converters. One of the biggest challenges is the continuous (LP) to integral (ILP) conversion of the voltage assignment (= x v,k ) and level converter assignment (= m(e)) variables. Our basic approach is to iteratively search for the best possible m(e) assignment while using x v,k conversion algorithm to guide the search process.
Our LP formulation uses the same objective and constraints as the original ILP formulation, i.e., we minimize Equation (4) under the constraints (5) to (9) . Instead of (10) and (11), however, we use the following non-integral constraints: into binary values (line 7-9). We then solve the LP problem based on these binary m(e) values and see if the timing constraints are met (line [10] [11] . If so, we use a heuristic algorithm named voltage mapping discussed in the next section to map the continuous x v,k values to binary (line [12] [13] [14] . 4 We perform a gain-based gradient search to obtain a new m th value (line 6) and repeat the whole process and see if the total power is further minimized under the new LC assignment. This search continues until the gain is not significant or the number of iterations has exceeded a certain limit (line 5).
We obtain the baseline solution by setting m(e) = 0, solving LP, and performing the voltage mapping (line 1-3). Note that fixing m(e) = 0 for all edges means we do not allow any LC to be inserted after the voltage scaling. In order words, the voltage scaling is severely restricted such that there should be no edge eu,v that connects a low Vdd node u to a high Vdd node v unless w(e) > 0, i.e. a FF exists on e. Nonetheless, it is still possible to reduce the total power under this restriction, and the final result becomes our baseline solution. We perform gradient search to obtain a new target threshold value m th , where the total power reduction during the last two iterations are used to compute a new target. Note that the power gain is not linearly dependent on m th . It is possible to obtain more power reduction with higher and/or lower m th value. In case of a high m th value, the number of LCs added is small, thereby reducing the power consumed by LCs. However, this limits the voltage scaling opportunity. In case of a low m th value, however, the larger number of LCs added increases the power consumed by LCs but allows more rigorous voltage scaling.
Voltage Mapping Algorithm input: LP-based voltage scaling with LC inserted output: ILP-based voltage scaling with reduced LC set 1. T = topological ordering of gates; 2. assign low-Vdd+high-Vth to all PIs; 3. while (T is not empty)
v ← Vdd-L+Vth-H;
v ← Vth-L; 
Voltage Mapping
The main objective of our voltage mapping stage is to map the continuous voltage assignment variables x v,k resulting from our LP formulation to binary values. There exist two major constraints during this mapping: LC (level converter) and timing constraints. Since we have performed LC insertion before calling the voltage mapping step, the supply voltage assignment has to honor the existing LCs, i.e., there should always be low-Vdd to high-Vdd transition on each edge e with LC as expressed in Equation (9) and (11) . In addition, the voltage mapping should be done in such a way that no node after the voltage mapping should violate the clock period and arrival time constraints as expressed in Equation (6), (7), and (8) . Since the voltage mapping step picks only one of four continuous assignment variables (xv,1, xv,2, xv,3, xv,4) and makes it 1 while fixing others to 0 for each node v, Equation (5) and (10) are also satisfied. Figure 2 shows our voltage mapping algorithm. Since the goal is to reduce the total power under LC and timing constraint, more low-Vdd and high-Vth nodes means more power reduction as long as these constraints are not violated. Note that a simple maximum function may not guarantee the LC and timing constraints. For example, if xv,1 = 0.2, xv,2 = 0.2, xv,3 = 0.4, and xv,4 = 0.2, then this "maximum" scheme assigns high-Vdd plus high-Vth (k = 3) to v. In our algorithm, we visit each node in a topological order so that the voltage mapping for all fan-in nodes is done when visiting a new node (line 1). The PIs are initialized to low-Vdd+high-Vth (line 2). For each node in a topological order, we first compute the given node v based on the four possible scenarios shown in Figure 3 . We start with the minimum total power configuration for each node, i.e., low-Vdd+high-Vth (line 7). We then decide whether we must raise the Vdd (line 8-11) or lower the Vth (line 14-17) based on the LC and timing constraints. During the Vdd mapping step, we first see for a given node v if there is any fanin node u with low Vdd assigned and eu,v contains an LC. If so, a high-Vdd has to be assigned to v to satisfy the LC constraint (line 8-9). Next, if vdd(v) > 0, the previous linear programming partially assigned high-Vdd to v, and raising v to high-Vdd will never violate timing constraints (line 10-11). At this point, it is important to note that some LCs become unnecessary during the PI-to-PO Vdd mapping process such as case 5, 7, and 8 in Figure 3 . Thus, our LC removal step (line 12-13) deletes these unnecessary LCs if (i) a high-Vdd node drives a low or high-Vdd node while using and LC (case 5 and 7), or (ii) a low-Vdd node drives another low-Vdd node while using an LC (case 8). Since LC removal never increases the overall delay, the timing constraint is never violated. During the subsequent Vth mapping, our goal is to see if the initial high-Vth has to be adjusted due to timing constraints-if dly(v) lies in between the delay of a high-Vth gate and a low-Vth gate, low-Vth assignment will guarantee to satisfy the timing constraint at the expense of slight leakage power increase.
Post Refinement
The last step of our algorithm is the post refinement, where we perform additional voltage scaling to the ILP solution. The primary concern during voltage mapping discussed in Section 2.4 is to satisfy the LC and timing constraints. Thus, our focus is to accept voltage mapping that will never violate the timing constraint for each node, which results in a delay reduction for each node in most cases. This delay change of a node affects the delay of all of its downstream nodes in a directed graph and may allow additional power reduction among them. Thus, a positive timing slack resulting from our conservative voltage mapping needs to be propagated downwards to correctly reflect the slack change globally. However, our voltage mapping does not perform static timing analysis (= timing slack re-computation) upon the voltage mapping of each node due to its prohibitive runtime, which may hide some power reduction opportunity. Thus, the voltage mapping based on the initial timing slack is a primary source of non-optimality. In addition, our LP formulation discussed in Section 2.3 may assign m(e) values that are not close to 0 or 1 for potentially many edges. Thus, relying on a single threshold value to decide which edge gets LC or not for all edges is another source of non-optimality. while (there is power reduction) 7.
for (each node v ∈ C) 8.
power gain (v, slk(v) , C); 9. z = max power gain node; 10.
commit voltage change for z; 11.
update slack for downstream nodes of z; Figure 4 shows our post refinement algorithm. The basic idea is to identify the nodes with positive timing slack and try to reduce their total power consumption by additional voltage scaling under timing and LC constraint. This time, however, we examine the impact of the proposed voltage scaling of each node on all affected nodes. We first perform clustering based on timing slack, where each cluster contains a set of reachable nodes with positive slack (line [1] [2] [3] [4] . In this case, we visit the largest cluster first (line 4) since our exploration is limited to the nodes inside each cluster and thus the more (and earlier) the nodes examined to see the impact of voltage scaling the better. We visit each cluster (line 5) and compute the total power gain for each node in the cluster (line 7-8). During the power gain computation of each node v, we compute the power reduction for v as well as all of its predecessor inside the cluster using our recursive algorithm power gain shown in Figure  5 (to be discussed later). We then select the node that results in the maximum power reduction and commit the voltage change (line 9-10). Lastly, we update the timing slack for all downstream nodes of the max-gain node (line 11). We continue to target the same cluster until there is no further power gain (line 6). Figure 5 shows our recursive algorithm that computes the total power gain of a given node and all its predecessors inside the given cluster. The voltage scaling and thus the delay increase of a node v reduces the timing slack of many of its downstream nodes. Thus, it is unlikely that there exists any power saving opportunity via voltage scaling among the downstream nodes. The upstream nodes of v, however, are not affected by the delay change of v. Thus, we limit our exploration to v and its predecessors to examine the impact of voltage scaling v. In addition, the reason we limit our search to the nodes inside the given cluster is because it is not possible for the zero-slack nodes outside the cluster to accommodate the delay increase without timing violation. For a given node v, we compute the power saving and delay increase for each candidate power state (line 3-7). Among the feasible power states, we pick the one with the maximum total power reduction (line 8-10). We visit the predecessors and keep track of the power gain from them (line [11] [12] [13] [14] . Finally, we return the total gain (line 15) as the final output.
Note that the computation of ∆p (line 4) and ∆d (line 5) considers the impact of additional LC insertion. For instance, if the Vdd is currently set to high for a given node u and one of its fanout power gain (v, dly, C) input: a node v ∈ C and timing slack dly mark i feasible; 8. y = feasible voltage state with max ∆p; 9. dly = dly + ∆d based on x → y; 10. tot gain = ∆p based on x → y; // recursive call 11. for (each non-visited fan-in u ∈ C) 12. if(slk(u) > dly) 13 . gain = power gain(u, dly, C); 14.
tot gain = tot gain + gain; 15. return tot gain; v is set to high-Vdd while m(eu,v) = 0, the ∆p for high-to-low Vdd adjustment for u should not only include the power saving from Vdd scaling but also the power increase from the LC that has to be inserted. In addition, the ∆d should include the delay increase from high-to-low Vdd adjustment as well as the LC insertion. Moreover, this delay change from Vdd adjustment further affects the subsequent Vth scaling and its corresponding leakage power. A similar argument applies when we examine the impact of related LC removal from a low-to-high Vdd adjustment on dynamic/leakage/delay tradeoff.
EXPERIMENTAL RESULTS
Our algorithm named Retiming-based Voltage Scaling (RVS) is implemented in C++/STL, compiled with gcc v3.2.2, and run on a Pentium IV 2.4 GHz machine. The solutions to the ILPs/LPs were found using the Gnu Linear Programming Kit's [9] version 4.5. Our benchmark set consists of nine sequential circuits from ISCAS89 benchmark. We use the following technology parameters for 130nm process: Vdd high/low is set to 1.2V/0.6V. Vth high/low is set to 0.23V/0.12V. The delay, dynamic power, and leakage power of each gate shown in Table 1 are computed according to [27, 4, 6] . We assume 20% average switching activities for the gates.
In Table 2 , we show the total timing slack among all nodes before and after our min-FF retiming. The purpose is to investigate the impact of retiming on the subsequent voltage scaling. The nodes with larger timing slack, i.e., the nodes off timing critical paths, are the prime target for voltage scaling. We observe that the impact of our min-FF retiming on the timing slack is minimal, suggesting that min-FF retiming is not interfering with the subsequent voltage scaling. Moreover, our min-FF retiming helps reduce the total power by minimizing the power consumed by FFs as shown in Table 3 (to be discussed).
In Table 3 , we investigate the impact of retiming objective (max-FF vs min-FF) on voltage scaling (Vdd-only vs Vdd/Vth simultaneously). Here we compare four algorithms: max-FF+Vdd, maxFF+Vdd/Vth, min-FF+Vdd, and min-FF+Vdd/Vth. We perform our LP-based voltage scaling (= LP-relaxation, voltage mapping, and post-refinement) for all four algorithms. Note that max-FF+Vdd algorithm is similar to [4] except that we use our own LP-based Vdd scaling instead of the original ILP approach. 5 For each algorithm, we report the total power consumed (= dynamic and leakage) by the gates/LC (= GL) as well as by the gates/LC/FF (= GLF). From the comparison between GL and GLF, we note that GLF is 10% to 28% higher on average in all four algorithms. This indicates that the FF power must be considered during the computation and optimization of total power consumption.
Next, we observe from the GL reduction trend (= g-ratio) that the min-FF vs max-FF retiming objective has little impact on the GL power consumption. However, the simultaneous Vdd/Vth scaling reduces the GL power by 30% on average compared to Vdd scaling-only. The GLF reduction trend (= t-ratio) indicates that the min-FF retiming obtains 7% (Vdd only) and 10% (both Vdd/Vth) better solution on average compared to max-FF retiming. In addition, the simultaneous Vdd/Vth scaling obtains 24% (max-FF retiming) and 27% (min-FF retiming) average GLF improvement over Vdd scaling-only. In summary, our min-FF retiming plus simultaneous Vdd/Vth scaling reduces the total power consumption by 34% on average compared to max-FF plus Vdd Scaling [4] . The average runtime for each algorithm was approximately 40 seconds for each circuit. Table 4 shows the total number of nodes under each Vdd/Vth configuration. We also report the number of LCs used. We observe that the majority of nodes is assigned high-Vdd/high-Vth. These nodes are often used to reduce the leakage power while meeting the timing constraints. The low-Vdd/low-Vth nodes also provide the same kind of effect as high-Vdd/high-Vth nodes. However, the Vdd scaling has more impact on the delay increase than Vth scaling, which is why low-Vdd/low-Vth nodes are not used as often as high-Vdd/high-Vth nodes due to its larger delay (2.53ps vs 1.24ps). However, a straightforward extension of our formulation can boost the usage of low-Vdd/low-Vth in case we prefer dynamic power reduction over leakage power. Next, the usage of high-Vdd/low-Vth (maximum power, minimum delay) is inevitable for timing critical nodes due to the timing constraints. However, our voltage scaling was successful in keeping this portion low. The minimum power configuration (= low-Vdd/high-Vth) is used heavily for almost all circuits to reduce the total power consumption. It is interesting to note that the circuits with heavy low-Vdd/high-Vth usage tend to utilize more LCs. Table 5 shows the breakdown of total power into leakage (for all gates), dynamic (for all gates), LC power (dynamic+leakage) and FF power (dynamic+leakage). We note that the power consumed by FFs is a dominant factor for several circuits (s838 and s1238, for example). In addition, the dynamic power is still higher than leakage for 130nm technology. The power consumed by LC is relatively small. Table 6 shows a comparison among four voltage scaling methods in terms of total power consumption. The first one named INIT is when all nodes are assigned to high supply and low threshold voltage. This serves as an upper bound on total power consumption. Under CVS we report the well-known Clustered Voltage Scaling results [27] . We report our retiming-based voltage scaling results under RVS column. In addition, we solve the ILP problem discussed in Section 2.2 without the LP relaxation and report the results under the ILP column. Due to its prohibitive runtime, we give ILP 1-day for each circuit and report the best solution discovered so far. We perform our min-FF retiming and the post refinement for all four algorithms. We observe that our RVS outperforms all other algorithms. In particular, RVS obtains results that are very close to the ILP but at a fraction of runtime. RVS also outperforms CVS by 28% on average.
CONCLUSIONS
This paper presented the first paper work that combines retiming and dual supply/threshold voltage scaling for total power reduction. Our solution consists of three steps: low power retiming, ILP-based voltage scaling, and post refinement. We relax the ILP formulation into LP, solve the LP in an iterative manner, and perform several heuristics to convert LP solutions back to ILP. The related experiments show that we obtain solutions that are very close to pure ILP approach within a fraction of runtime while outperforming several well-known methods.
