This work first presents an analytical repeater insertion method which optimizes power under delay constraint for a single net. This method finds the optimal repeater insertion lengths, repeater sizes, and V dd and V th levels for a net with a delay target, and it reduces more than 50% power over a previous work which does not consider V dd and V th optimization. This work further presents the power saving when multiple V dd and V th levels are used in repeater insertion at the full-chip level. Compared to the case with single V dd and V th suggested by ITRS, optimized dual V dd and dual V th reduce overall global interconnect power by 47%, 28% and 13% for 130nm, 90nm and 65nm technology nodes, respectively, but extra V dd or V th levels only give marginal improvement. We also show that an optimized single V th reduce interconnect power almost as effective as dual-V th does, in contrast to the need of dual V th for logic circuits.
INTRODUCTION
Repeater insertion causes increasingly severe problem of power consumption due to the ever increasing number of repeaters [1] . Traditional approach of repeater insertion optimizes the interconnect in terms of delay, but several works in the literature [2, 3, 4] have made use of the extra tolerable delay (i.e., slack) in nets for significant saving in interconnect power. [2, 3] provide analytical methods to compute unit length power optimal repeater insertion solutions. [4] defines a new figure of merit which allows trade-off between power and delay using repeater insertion legnths, repeater sizes and wire widths as design knobs. None of the above work considers supply voltage V dd and threshold voltage V th as design freedoms. [5] performs dual V dd and dual V th assignments on logic circuits to reduce power consumption, and shows that 20% of power can be saved by going from single V th to dual V th under the dual V dd power supply.
This paper studies the opportunity of power saving by computing power optimal repeater sizes, repeater insertion lengths, and V dd and V th levels for both individual nets and full chips. This paper is organized as follows. Section 2 discusses the delay and the power models. Section 3.1 presents single-net power optimization with V dd and V th tuning. Section 4 studies the full chip power optimization using multiple V dd and V th . We conclude this paper in Section 5.
PRELIMINARIES
This section discusses the delay and power models used in this paper. Both models are based on those in [2] , which assume fixed V dd and V th . We extend the models to reflect the effects of V dd and V th scaling.
Delay Model
Consider an interconnect of unit length resistance r, unit length capacitance c, and total length L. Suppose the interconnect is divided into L/l segments and identical repeaters of unit driving resistance rs, unit input capacitance co, unit output capacitance cp and size s are inserted at the beginning of every segment. The delay of a segment consisting of a repeater driving an interconnect segment of length l terminated with a repeater of the same size is given by τ = rs(co + cp) + rs s cl
and the unit length delay is
The total delay of the entire interconnect is τ l L, assuming continous numbers of buffers and segments. The driving resistance of the repeater depends on the operating V dd and V th levels and is approximated in [3] by
where K1 is a fitting parameter and I dsat is the saturated drain current of a minimum-sized NMOS or PMOS transistor with both Vgs and V ds equal to V dd . According to the alpha-power law model [6] , I dsat is modeled as
where K2 is a device parameter and α is about 1.25 for recent technology generations. By plugging Equation (4) into Equation (3), we obtain rs as a function of V dd and V th , which is given by
where K3 = K1/K2. For a given V dd and V th , we obtain the optimal unit length delay by setting lopt = r 2rs(co + cp) rc sopt = r rsc rco (6) and the optimum unit length delay is given by
Suppose we are given a target delay per length, which is expressed as f% more than ( τ l )opt, we can find a family of solutions {V dd , V th , l, s} that satisfy the target delay. In the solution set, there exists a solution that achieves the minimum power. The methodology of finding such solution is presented in Section 3.1.
Power Model
For an interconnect of length L, the total power dissipated by the repeaters is
Ptot l
L. The power consumption of a repeater comprises three parts: dynamic, leakage, and short circuit. We use the same formulae to compute power as in [2] except that V dd and V th are treated as variables in the expressions. The power models are summarized below.
Dynamic power is dissipated when repeaters charge and discharge their loading capacitances. It is given by
where a is the switching activity of a repeater, which is assumed to be 0.15, and f clk is the clock frequency.
We consider only the subthreshold leakage as in [2] . The subthreshold leakage current of a minimum-sized NMOS transistor is given by are the reference subthreshold leakage current and threshold voltage respectively for a particular technology node, and Sw is the subthreshold swing, which we assume 100mV/decade at the temperature 100 o C. The equation assumes that the transistor is at OFF state when Vgs = 0 and V ds = V dd .
The average leakage power of a repeater is The short circuit power dissipation depends on the transition time at the input and the output of an inverter. Assuming symmetric high-to-low and low-to-high transitions at the input and the output of the repeater, the short circuit power is given by
where a is the same switching factor as in the dynamic power expression, I short−circuit is approximately 65 /microA//microm and tr = τ loge3.
The power per length is therefore given by the sum of all P dynamic , P leakage and P short−circuit , i.e.,
where
We specify the target delay by using ( 
SINGLE NET POWER OPTIMIZATION

Analytical Solution
Based on the delay and power models discussed previously, we express the problem formulation as
For given V dd , V th and a delay target, the optimal l and s that give the minimum
Ptot l
can be obtained by solving the following set of nonlinear equations in [2] , i.e.,
The insertion length l is a function of the repeater size s under the equality delay constraint in Equation (10) . In this problem, both the objective function and the constraint are posynomial functions which are known to be convex under variable transformation. Therefore, there exists a unique minima for such optimization problem, which can be found in polynomial time [7] . When V dd and V th are treated as variables, it is not obvious if the problem is still convex. To visualize this, we can find the power-optimal solution for every point on the V dd -V th space using Equation (10), which solves for poweroptimal repeater insertion under fixed V dd and V th . Figure  1 shows the resulting iso-power plot under a delay target of (1 + 5%)( power minimization through V dd and V th optimization can be solved analytically. Our future research will attempt to prove that this problem possesses a unique optimum analytically.
Based on the observation that an optimal point exists, we develop an analytical method to solve this problem. Following the equality delay constraint, one of the variable must be a function of the other three variables. In our derivation, V th is chosen to be the dependent variable, because it is the only variable that can be easily expressed in the closed-form of the other three variables. From Equation (5), V th can be expressed in terms of V dd and rs as
By re-arranging Equation (2), rs can be expressed as a function of l and s:
Therefore, when deriving the gradients of the objective function, V th is treated as a function of V dd , l and s. The following equations set the gradients of the objective function with respect to V dd , s and l to zero.
These equations can be solved numerically using an iterative numerical solver. The optimal solution from the analytical method is verified by exhaustive search and they match each other closely.
Experimental Results
Equation (11) is used to optimize unit length power for a single net. The parameters for the power and delay models across various technology nodes are taken from [1] . Table 1 compares the results with and without V dd and V th tuning across different technology for target delay τ = (1+f )( as in [2] . As shown in Table 1 , the amount of power saving that can be achieved from V dd and V th optimization depends on the target delay. When f = 20%, the power saving is up to 28% across all technology nodes. When f = 100%, the power saving is more than 50% for all generations. The power saving is mainly achieved by lowering the supply voltage. As we can see, the optimal V dd levels are generally lower than the reference values. When f increases, V dd decreases significantly, showing that V dd provides good trade-off for power by utilizing f . The optimal V th values slowly decreases with increasing f to compensate for the loss of performance from V dd reduction. The reduction in V th causes a moderate increase in leakage power, but is rewarded by a large decrease in the dynamic power from lowering V dd . The performance loss due to V dd reduction is compensated by the increase of repeater size s and the slight decrease of insertion length l when compared to the reference values.
FULL-CHIP INTERCONNECT POWER
Power Calculation
In this section, we propose a methodology to evaluate fullchip interconnect power. In [8] , a closed-form analytical expression of the wire-length distribution for on-chip random logic networks based on Rent's rule is developed. We estimate the full-chip power by integrating the unit length power over the wire-length distribution from the smallest wire length with non-negligible power to the longest global interconnect assumed by the wire-length distribution model. We use the delay optimal segment length lopt given by Equation (6) to define the shortest interconnect which requires at least one repeater to be inserted. Nets shorter than lopt are not considered as they do not need repeaters. The delay of each net is bounded by 90% of the clock period T clk as in [9] . For an interconnect of length L operating at V dd and V th , the optimal delay is
is given by Equations (5) and (7). The difference between Dopt and 0.9 · T clk is the slack that we can use to optimize its power. We define Lmax to be the longest interconnect length which satisfies the target delay with delay optimal repeater insertion, i.e.,
Lmax = 0.9 · T clk τ l´opt
We pipeline the interconnects of lengths larger than Lmax so that the length of each segment is smaller than Lmax. We assume that the delay overhead of pipelining flip-flops is amortized in 0.1 · T clk . Therefore, the power for the full-chip is given by The length in terms of gate pitches is obtained by
where AF is the gate area factor, which is 320 across all technology nodes [1] and T is the technology node in terms of minimum local metal's half-pitch dimension. The number of pipelining stages β and the wire length per stage l β are given by
The optimal power per length (
)opt is a function of the target delay, and is obtained using Equation (10) discussed in when V dd and V th are fixed and Equation (11) when V dd and V th are design variables, both discussed in Section 3.1. Target delay of an interconnect of length l β is again speci- 
Vdd and Vth Optimization
To optimize the full-chip interconnect power, we consider various cases of V dd and V th assignment for nets. Practical assignment has limited number of V dd and V th levels throughout the chip. Multiple V dd levels are provided either by having multiple power distribution networks or by inserting pass transistors to create lower V dd supplies than the system V dd . Multiple V th can be achieved either through selective transistor doping or through substrate biasing. The V dd and V th pair for a net can be formed from any one of the available V dd and V th levels. Therefore, increasing V dd and V th levels improves the power saving it can achieve due to more fine-grained control to V dd and V th for each net. We are interested in maximizing the power saving that can be achieved by the minimum number of V dd and V th levels available at the full-chip level, since extra V dd and V th levels increase area and manufacturing costs. We compare the optimal full-chip global interconnect power of each combination (N dd , N th ), where N dd is the number of V dd levels and N th is the number of V th levels. The theoretical optimum power occurs at N dd → ∞ and N th → ∞, i.e., the V dd and V th of each net can be taylored. Such comparison provides us with an idea of the potential power saving by increasing N dd and N th . Table 3 shows our searching algorithm for the power optimal V dd and V th levels at the full-chip level. Given N dd and N th , the algorithm first generates all possible combinations of V dd and V th for the full-chip at line 3. ) have to be broken down into segments by means of pipelining as discussed, which is implemented by looping on the number of pipeline stages at line 10 and by folding the integration bounds in lines 11-12. ν is simply the length in terms of gate pitches, and the conversion between ν and length in absolute dimensions are done using Equation (13). Also note that the optimal power per length function`P l´opt (f, V dd , V th ) in line 13 refers to the power optimal repeater insertion with fixed V dd and V th using Equation (10) . The ideal case in which N dd → ∞ and N th → ∞ can be computed by the same algorithm with some modification. Even though some smart pruning has been done to the search space as shown in Table 3 , the algorithm fundamentally performs exhaustive search, in which the number of combinations for (V dd , V th ) grows exponentially as N dd and N th increase. We have found that N dd and N th beyond 3 is impractical from the runtime perspective. Therefore, instead of using large N dd and N th , the power per length function is changed to our analytical repeater insertion solution considering both V dd and V th optimization in Equation (11), and set N dd = N th = 1. This is equivalent to finding the optimum repeater insertion with numerically computed optimum V dd and V th for each net. (N dd , N 
Experimental Results
The methodology discussed above is used to optimize the full-chip power of chip sizes reported in [1] for various technology generations. N dd and N th are enumerated only up to three for the sake of runtime. V dd and V th search range are minimized without compromising the power optimality. 1) refers to the optimal full-chip power with fixed reference V dd and V th for all nets. The "ideal" combination refers to the continuous V dd and V th assignment, i.e., N dd , N th → ∞. Power reduces by 47%, 28% and 13% for 130nm, 90nm and 65nm technology nodes respectively by going from the single V dd , single V th configuration to the dual V dd , dual V th configuration. Using dual V th instead of single V th under dual V dd only gives ∼3% power reduction, as opposed to the 20% plus reduction reported for logic circuits in [5] . This suggests that optimizing the single reference V th may just perform as well as the dual V th configuration in terms of interconnect power consumption. The dual V dd and dual V th configuration has the total power just 17%, 12% and 5% from the theoretical power optimum configuration which allows infinite V dd and V th levels. Moreover, we observe no significant power reduction by moving to combinations with more V dd and V th levels in all technology generations.
The power breakdown of the optimized full-chip interconnect for each (N dd , N th ) configuration is shown in each bar in Figure 3 . Multiple V dd configurations (i.e., N dd > 1) in 130nm and 90nm technology nodes achieve significant dynamic power saving by aggressively reducing the second V dd level, as shown in Table 4 . The threshold voltage of the second V th level slightly decreases to compensate for the loss of performance due to V dd reduction, at the expense of slight increase in the leakage power. On the other hand, the leakage power in 65nm technology node is comparatively a lot larger in the (1, 1) configuration. From Table 4 , the second V th = 0.2V leaps above the reference level of 0.175V to limit the growth of leakage power. This can be seen in Figure 3 , where the block of leakage for the 65nm bars slightly reduces from the single V dd , single V th combination to the other multi-V dd /V th configurations. From this, we see that in order to get the right balance between dynamic power and leakage power for total power reduction in interconnect, we must consider both V dd and V th optimization. Figure 4 shows the breakdown of total wire length being assigned to (V dd , V th ) marked on each region of the figure for the dual V dd , dual V th case. The regions are ordered in the increasing power (the decreasing delay) (V dd , V th ) combinations from the bottom to the top. A large portion of the net is assigned to the combination which has V th /V dd ratio way above the default 0.25, particularly for 65 nm technology. This implies that the V th /V dd ratio has to be increased in order to attain power optimality. This is in line with the conclusion made by other works in the literature [10] , which suggests that the V th /V dd ratio shall be made larger than that current designs use for power efficiency.
CONCLUSIONS
This paper studies the opportunity of power saving by computing power optimal repeater sizes, repeater insertion lengths, and for the first time V dd and V th levels for both single nets and a full chip. We have derived a set of analytical formulae which finds the optimal interconnect power given the amount of the timing slack on a single net. Compared to [2] which does not consider V dd and V th as design variables, our method that customizes V dd and V th for each net can reduce power by more than 50% for both single nets and at the chip level. We have also studied the power saving of using multiple V dd and V th levels for buffering interconnects. Power reduces by 47%, 28% and 13% for 130nm, 90nm and 65nm technology nodes respectively by going from the single V dd , single V th configuration to the dual V dd , dual V th configuration. The fact that majority of the nets favors a V dd to V th ratio of more than 0.35 across all generations suggests that the ratio of 0.25 as suggested by other works in the literature is too low for power optimality. We show that the dual V dd and dual V th configuration is within 17%, 12% and 5% of the theoretical optimal power computed from our analytical method for 130nm, 90nm and 65nm technology node; and that extra V dd or V th level beyond dual V dd and dual V th only gives marginal improvement. Our experiment also shows that multiple V th does not improve power of interconnect as much as that of logic circuits.
