Abstract
Introduction
Starting from 0.18 μm technologies, static power consumption, due to leaky "off" transistors, is now a non negligible source of power dissipation even in running mode. Thus, the total power consumption (i.e. dynamic plus static power) has to be optimized instead of simply reducing dynamic power, which is due to switched capacitance charge/discharge.
Many research efforts aim at reducing the static power consumption at the device level using for instance MTCMOS, VTCMOS, Gated-Vdd, or DTCMOS [1] . Conversely very few articles considered the joint staticdynamic power optimization at a higher level, namely at system and architectural levels [2] [3] [4] .
For a given architecture, reducing the supply voltage Vdd leads to a reduction of dynamic power consumption, whereas it also results in a decrease of performance or speed. To compensate this effect, the threshold voltage Vth should be reduced too. Unfortunately, lowering the Vth exponentially increases the static power consumption. At a certain point, this increase in static power consumption becomes larger than the gain in dynamic power and the total power consumption becomes larger.
Therefore, between all the combinations of Vdd/Vth guaranteeing the desired speed, only one couple will result in the lowest power consumption ( Figure 1) . From now on, these working conditions will be called optimal working point or ideal working point. The location of this optimal working point and its associated total power consumption are tightly related to architectural and technology parameters. Figure 1 illustrates the fact that reducing the activity allows reducing Ptot, whereas it tends to increase the optimal Vdd and Vth. As architectural modifications will change simultaneously several factors (not just the activity), it is necessary to develop a methodology to evaluate the influence of such transformations on Ptot.
One assumption along this contribution is that Vdd and Vth can be freely (and precisely) modified. Whereas the supply voltage is in general easily controllable, it is harder to modify the threshold voltage as body backbiasing becomes less and less efficient in smaller technologies. On the other hand it is possible to select a technology that matches as closely as possible the Vdd and Vth requirements. In any case, the contributions of this paper permit to understand architectural implications on the total power consumption.
The originality of this paper therefore comes firstly from the approximated closed form equation for the total power consumption at its optimal working point, expressed in terms of architectural and technology parameters. This closed form approximation is shown to match precisely the full numerical calculation. Secondly this equation is used to understand the impact of architecture on the minimal achievable power consumption under freely controllable supply voltage (Vdd) and threshold voltage (Vth) assumption. Finally the presented power formula can also be used to select a technology flavor that is best suited for ultra low power consumption design. 
Basic equations
The closed-form approximation of total power consumption in optimal conditions that will be developed in Section 3 is based on the following fundamental equations. Total power consumption is expressed as: (1) with N number of cells; a average cell activity (i.e. the number of switching cells in a clock cycle over the total number of cells); C equivalent cell capacitance; f operating frequency; Io average off-current per cell for Vgs = Vth; n slope in weak inversion; Ut = kT/q thermal voltage. Parameters a, C and Io are defined as average values per cell calculated over the full circuits. Hence architectures with different cells distributions could present slightly different parameters even for the same technology.
In Eq 1 the short-circuit power contribution is lumped in the equivalent capacitance C. The static power is here represented by the sub-threshold contribution, which is the main part in present technologies. Neglected leakage sources include: gate tunneling, which exponential depends on the oxide thickness (luckily, it can be kept reasonably low even in future technology by using an high dielectric constant insulator); p-n reverse-bias current, coming from reverse diode conduction between drain/source and body; punchthrough coming from drain and source depletion "touching" deep in the substrate.
The transistor on-current model comes from a modified version of the well known alpha power law [4] : with η the DIBL coefficient. Then the delay can be formulated as:
with ζ (measured in Farad) a fitting parameter, which also includes the switched gate capacitance.
Approximated optimal total power consumption
The delay on the critical path, or logical depth (LD), must necessarily match the circuit frequency in order to operate at the optimal power condition. In fact, a positive slack would allow further reducing Vdd, resulting in additional power save. On the other hand a negative slack would correspond to a non working device. This condition can be expressed as: (6) Although the optimal Vdd and Vth are now tied together by mean of (5) Figure 2 shows that, in a reasonable range of Vdd, the expression Vdd 1/α can be linearized:
where A and B are two fitting variables that depend on α and on the fitting range. Eq. (5) 
To find the optimal Vdd, (1) is derived by Vdd and equaled to 0. Using the approximation that Vdd is much larger than nUt combined with previous equations, the two following relationships are obtained: The optimal total power is defined using (1) and (9):
And for Vdd >> nUt/(1-χA), the same equation becomes:
Finally the optimal Vdd (10), is introduced in (12) resulting in (13) [at the bottom of the page].
Equation 13 is a very important formula because it permits to analytically estimate the optimal total power directly from architectural parameters like activity (a), number of cells (N), frequency (f), logical depth (LD, included in χ) and technology parameters like average off-current (Io), weak inversion slope (n), alpha power law coefficient (α, included in A and B) and delay coefficient (ζ, included in χ). Thus, starting from this formula, it is possible to understand the impact of common architectural transformations, and to compare the performance of different technologies for a given architecture.
Note that (13) does no longer depend on η (DIBL coefficient) although this parameter was introduced during calculation. This can be explained by the fact that the threshold voltage is no more present in Eq. 13, hiding the DIBL effect on the same occasion.
Application to architecture selection
Architectural transformations will influence many parameters in (13), e.g. a, N, LD (contained in χ). Knowing the effect of transforming an architecture (e.g. pipelining or parallelization), it is directly possible to see if it will result in a higher or lower total power using (13).
For this discussion, a set of thirteen 16 bit multipliers (described in details in [5] [6]) was designed in VHDL and synthesized using Synopsys Design Compiler (V2003.06). The library used for the synthesis is 0.13um CMOS09GPLL from ST Microelectronics. achieving an even shorter LD, at the price of more glitches due to an increased spread of path delays. Finally, both parallelized versions (by 2 or by 4) are obtained by replicating the basic multiplier and multiplexing data across them. This way, each multiplier has additional clock cycles at its disposal relaxing timing constraints. 2. Wallace Tree (basic, 2 and 4 parallelization): the Wallace Tree structure adds the partial products using Carry Save Adders in parallel. Path delays are better balanced than in RCA, resulting in an overall faster architecture. Parallelized versions use circuit replication and multiplexing, similarly to the parallel RCA structure. 3. Sequential (basic, parallel and "4_16 Wallace"): the basic implementation computes the multiplication with a sequence of "add and shift" operations resulting in a very compact circuit. The intermediate result is shifted, added to the next partial product, and stored in a register. This type of structure needs as many clock cycles as the operand width to complete, but only one 16-bit adder is necessary. Note, this corresponds to an internal clock running 16 times faster than the 31.25 MHz data clock that defines the throughput. The architecture called 4_16 Wallace reduces the number of clock cycles per multiplication from 16 to 4 by using a 4x16 Wallace tree multiplier i.e. by adding 4 partial products in parallel. The parallelized version is a simple replication and multiplexing of the basic version.
Starting from the values of static and dynamic power at the nominal supply voltage (Vdd = 1.2V) with activity annotated through timing annotated simulations of the netlist in ModelSIM (Mentor Graphics), the optimal total power was calculated twice. Firstly numerically from Eqs. (1)- (6) by calculating the total power for all reasonable Vdd/Vth couples, then using Eq. (13). Results are shown in Table 1 .
The values of A and B used in Eq.13 were obtained by minimizing the approximation error (7) The first remark that can be made on this table is that the approximation of the optimal total power based on (13) presents an error lower than ±3% compared to a numerical solution based on not approximated equations.
Moreover, by looking at the influence of architecture on optimal power consumption, several things can be observed on Table 1 and explained thanks to (13).
It is clear that sequential multipliers are not suited for low power design, unless the circuits have to work at a very low data frequency. This happens due to two additive factors. Firstly, the activity (defined with respect to the throughput frequency and not the internal clock frequency) can be very high and even bigger than 1 in some cases. This will present a high dynamic consumption at nominal conditions, but also at the optimal working point as shown by the first fraction of (13). Secondly, such architecture is very slow, resulting in a large χ, hence penalizing the total power consumption by increasing χB and reducing 1-χA (present in a square form on the denominator of the prefactor in Eq. 13). The effect of a slow architecture can also be observed on the optimal Vdd and Vth. In fact, to respect the desired working frequency, sequential designs present high Vdd (i.e. high dynamic power) and low threshold voltage (i.e. high static power).
The RCA architecture is based on a very regular structure that permits many variations to be implemented. Both parallelization and pipelining transformations shorten the effective logical depth (which is reflected in a reduction of χ, although not in a linear manner). In this case the benefit of the relaxed timing constraints permits to further reduce Vdd and increase Vth, reducing this way the optimal total power consumption.
The diagonal pipeline versions present shorter logical depth but higher activity (due to more glitches) compared to horizontal pipeline, thus preferring the latter in low power pipelining techniques. In fact, when diagonally pipelining the basic RCA the critical paths will be effectively reduced more than using a horizontal pipeline, but the shortest paths will be reduced even more. This greater spread of paths delays results in a glitch increase and hence in a higher activity. This example illustrates very well how simple architectural transformations can modify the parameters like a and LD in a complex, and difficult to predict, manner.
Finally the Wallace family presents the fastest circuits of our set. By applying a parallelization to the basic version, we observe that, as for the RCA family, the logical depth is reduced and hence χ is also reduced. This, one more time, results in a lower Vdd and higher Vth meaning a slight power save. However, optimal total power of the further parallelized structure (Wallace par4) becomes higher than for the previous structures even if, as expected, Vdd is further reduced and Vth increased. The explanation comes from the overhead introduced by parallelization. In fact, the Wallace parallel architecture being already a fast circuit (compared to the desired working frequency), the reduction of χ is only marginal and its benefit is cancelled by the overhead in power consumption introduced by data multiplexing.
Application to technology selection
While Eq. (13) was discussed considering variations on the architectural parameters in Section 4, the optimal total power is also highly dependent on the technology parameters. Because current technologies often propose a choice of a few flavors or because it is sometimes possible to select one technology among several different available, we discuss here the influence of those parameters on total power consumption.
The CMOS09 0.13μm ST Microelectronics technology exists in three different flavors, namely High Speed (HS), Low Leakage (LL) and Ultra Low Leakage (ULL). The technology parameters for these cases were obtained with ELDO simulations by fitting delays on inverter chains ring oscillators: The optimal total power was calculated for the 16 bit multipliers introduced in Section 4. Due to space limitations, only the results for the Wallace family are presented here. The results for the LL type were already reported in Table 1 , while the values for the remaining two types are reported in Table 3 and Table 4 . The optimal total power of the parallel version of the Wallace multiplier is higher than that of the basic one when using the HS process (Table 4) , whereas it is the opposite for ULL and LL processes ( Table 3 , Table 1 ). This can be explained by the fact that parallelization (where the number of cells is more than doubled) is more penalized with technologies presenting a very high leakage. Moreover speed gain resulting from a logical depth reduction of an already rapid structure in "fast" technologies is often extremely limited.
Similarly, the optimal total power for ULL is always larger than for LL in corresponding architectures. This can be explained by the low Io and high ζ of ULL, which both lead to slower architectures as can be observed in (4) . This corresponds to a higher optimal Vdd (higher dynamic power) and lower Vth (higher static power).
On the other hand the HS technology is characterized by a low α (reflected in a high A) and increased capacitance C. Both effects tend to the increase the optimal total power as predicted by Eq. (13) and confirmed by Table 4 .
From these examples, it appears that under such conditions (a Wallace architecture working at 31.25MHz) the technology presenting the lowest optimal power consumption is the LL, showing that extreme technology flavors (ULL and HS) are penalized.
Starting from these observations, we can understand that a smaller technology node with ultra-high speed and large leakage might consume more than a larger techno with better balanced α, Io, ζ, etc. at its optimal working point when considering the same performances.
Conclusions
In this paper, three important subjects have been discussed around the theme of total power optimization for adjustable values of supply voltage (Vdd) and threshold voltage (Vth). In the first part, an analytical approximated formula for total power consumption (static plus dynamic consumption) at the optimal working point (where the minimum power is obtained while still maintaining speed requirements) is derived. Practical results show an error lower than 3% as compared to full numerical computations.
Starting from this equation, a discussion of the architecture influence on the total power was presented.
The first observation was that sequential circuits are highly penalized due to the high activity and large effective logical depth.
Then, parallelization was beneficial as long as the architecture did not already present a short LD. Otherwise the multiplexing overhead completely cancelled the benefit brought by relaxed timing constraints. This was for instance the case for Wallace structures.
For pipeline transformations, it was interesting to observe that a diagonal pipeline, presenting a shorter logical depth than the horizontal one, was penalized due to the increased number of glitches (reflected by the increase in activity).
In the last part of the article, Eq. 13 was used to discuss the impact of the technology on the optimal total power. Through simple examples, it was shown how extreme technology flavors (here in the case of a STM 0.13μm technology) like Ultra Low Leakage and High Speed were less suited for low power than Low Leakage. In fact, slow or highly leaky technologies perform worst than a moderated trade-off of these characteristics when working at the optimal point condition.
