Abstract: The ever-growing leakage current of MOSFETs in nanometre technologies is the major concern to high performance and power efficient designs. Dynamic power management via powergating is effective to reduce leakage power, but it introduces power-up current that affects the circuit reliability. The authors present an in-depth study on high-level modelling of power-up current and leakage current in the context of a full custom design environment. They propose a methodology to estimate the circuit area, maximum power-up current, and minimum and maximum leakage current for any given logic function. Novel estimation metrics are built based on logic synthesis and gate-level analysis using only a small number of typical circuits, but no further logic synthesis and gate-level analysis are needed during the high-level estimation. Compared to time-consuming logic synthesis and gate-level analysis, the average errors for circuits from a leading industrial design project are 23.59% for area, 21.44% for maximum power-up current, 15.65% for maximum leakage current and 6.21% for minimum leakage current. In contrast, estimation based on quick synthesis leads to an 11Â area difference in gate count for an 8-bit adder.
Introduction
Power has become one of the primary design constraints for both high-performance and portable system designs. As VLSI technology continues scaling down, leakage power becomes an ever-growing power component because of (i) increase of device leakage current due to the reduction in threshold voltage, channel length, and gate oxide thickness [1, 2] , and (ii) the increasing number of idle modules in a highly integrated system. For current high-performance design methodologies, the contribution of leakage power increases at each technology generation [3] . The Intel Pentium IV processors running at 3 GHz already have an almost equal amount of leakage and dynamic power [4] . Dynamic power management (DPM) [5] via power gating at system and circuit levels is effective to reduce both leakage and dynamic power. Figure 1a shows a system with a multichannel voltage regulation module (VRM). The VRM channels can be configured to supply power independently for individual modules. Therefore, modules can be turned on or off at appropriate times for power reduction but still maintain the desired functionality and performance. Power gating at circuit level is also called MTCMOS in [6] (see Fig. 1b) . A PMOS sleep transistor with a high threshold voltage connects the power supply to the virtual V dd . The sleep transistor is turned on when the function block is needed, and is turned off otherwise. (Instead of the PMOS sleep transistor, an NMOS sleep transistor can be inserted between the ground and virtual ground and I p to be presented later becomes the discharging current in this case.
For simplicity of presentation, we assume PMOS sleep transistors in this paper.) With the growing leakage power, power gating at either system level or circuit level is a viable alternative to clock gating, as clock gating only reduces dynamic power. We use MTCMOS to study power gating in this paper, and the idea can be extended to VRM design and DPM at system level. Key questions in applying power gating include: (i) How to estimate the leakage reduction by power gating and how to decide the area overhead of power gating? The answer determines whether power gating is worthwhile for a given design. Leakage current and power estimation has been studied [7 -9] at transistor, gate and system level. Most estimation methods consider the input vectordependent property of leakage current. The MTCMOS technique for leakage reduction further introduces power-up transient current when turning on or off a circuit module. As shown in [10] , all nodes in a power-gated module are at logic '0' state. They must be brought to valid logic states by power-up current (I p ) before useful computation can begin. Similar to leakage current, power-up current I p also depends on the input vector. Its maximum value must be known to design reliable sleep transistors, and evaluate their area overhead. (ii) How to answer the above question and make a decision at an early design stage without performing time-consuming logic synthesis and gate level analysis? Early decision making is needed to deal with time-to-market pressure. High-level estimation for circuit area [11, 12] , dynamic power [13, 14] , and transient switching current [15] have been studied. However, no previous work has studied high-level modelling and estimation of power-up current.
In this paper we propose a method to estimate the gate count for a given logic function without performing logic synthesis. We show that the quick synthesis leads to an 11Â difference for a simple adder, and further validate and improve an area estimation technique that was originally developed for a library with a limited number of cells [14] . The improved estimation method has an average error of 23.59%. This paper then presents an in-depth study of a unified high-level modelling for power-up current and leakage current by using commercial synthesis tools such as Design Compiler in the pre-characterisation stage. We propose a high-level metric to estimate the maximum I p without performing logic synthesis and gatelevel I p analysis. We verify this metric by a newly developed gate-level analysis for accurate I p . We then further extend this high-level estimation methodology to leakage power estimation. We use the design environment of a leading industrial high-performance CPU design project. There are hundreds of cells with various sizes (1Â to 65Â) in the library. All experiments are carried out on a number of typical circuits. The circuits are specified in Verilog and synthesised by Design Compiler to verify our high-level estimations. Owing to the need for IP protection, we report normalised values for currents in this paper.
2
Area estimation It shows that quick synthesis using Verilog specified at a higher abstraction level does not necessarily lead to a good estimation. Instead of using quick synthesis, we apply and improve the high-level area estimation in [14] . We summarise the estimation flow from [14] in Fig. 2 . It contains a one-time pre-characterisation, where gate-count A is pre-characterised as a function F of the linear measure L and output entropy H. Then, a multi-output function (MOF) is transformed into a single output function (SOF) by adding a m-to-1 MUX, where m is the number of outputs in the original MOF. L and H are calculated for the SOF to look up the pre-characterised table and obtain gate count. Removing the MUX from this gate count leads to A for the original MOF.
Overview
We improve the original estimation method in two ways. First, it is claimed in [14] that SOFs with the same output entropy H and same linear measure L have the same A. However, we find that it may not be true for VLSI functions implemented with a rich cell library. Functions with smaller output probability of logic '1' have a lower gate count under the same linear measure. Therefore, we have pre-characterised A as a function F(L, P), where P is the output probability. Since complementary probabilities lead to the same entropy, our pre-characterisation is more detailed compared to that in [14] . Further, we have developed an output-clustering algorithm to partition the original MOF into sub-functions (called sub-MOFs) with minimum support set overlap, and have improved the efficiency and accuracy of the high-level estimation. We summarise our estimation flow with the difference highlighted in Fig. 2 , and describe each step and our implementation details in the following Sections.
Linear measure
Linear measure L is determined by on-and off-sets of an
where L 1 and L 0 are the linear measure for the on-set and off-set, respectively. L 1 is further defined as
. N is the number of different sizes of all the prime implicants in a minimal cover of function f. The size of a prime implicant is the number of literals in it. c i is one distinct prime implicant size. p i is a weight of prime implicants with size c i and can be computed in the following way. Suppose all the input vectors to the logic function can occur with the same probability. Let c 1 , c 2 , . . ., c N be sorted in a decreasing order, and weight p i be the probability that one random input vector matches all the prime implicants with size c i but not by the prime implicants with size from c 1 to c i21 , 1 , i N. For i ¼ 1, p 1 is just the probability that one random input vector matches prime implicants with size c 1 . Here 'a matching' means that the intersection operation [16] between the vector and the prime implicant is consistent. Note that p i satisfies the equation P i¼1 N p i ¼ P( f ), where P( f ) is the probability to satisfy function f.
The minimum cover of an SOF can be obtained by twolevel logic minimisation [17] . To compute the weight p i , a straightforward approach is to make the minimum cover disjoint and compute the probability exactly. However, in practice, this exact approach turns out to be very expensive. In our experiments, when the number of inputs is larger than 10, the program using the exact approach does not finish within reasonable time. But with p i defined as the probability, L 1 ( f ) can be viewed as a random variable L 1 ( f ) with certain probability distribution. For each random input vector, the variable L 0 1 ( f ) takes a certain value 'randomly'. With probability of 1 2 P( f ), L 1 ( f ) takes the value '0'. Then L 1 ( f ) becomes the mean of the random variable L 0 1 ( f ). By assuming that the variable L 0 1 ( f ) takes a Gaussian distribution, we use the Monte Carlo simulation technique to estimate the mean value efficiently.
Output probability and gate-count recovery
The output probability can be obtained as a byproduct of Monte Carlo simulation. Since weight p i satisfies P i¼1 N p i ¼ P( f ), we can keep record of all the p i during the Monte Carlo simulation. When the simulation process satisfies the stopping criteria, the output probability can be obtained easily. To recover the gate count of the original MOF, the estimated gate count for the transformed SOF is subtracted by aA mux . A mux is the gate count of the complete multiplexer we have inserted, and a is the coefficient to obtain the reduced multiplexer gate count due to the logic optimisation.
Output clustering
As the number of primary outputs increases, the time to calculate the minimum cover of a function increases nonlinearly. To make the two-level optimisation more efficient, one may partition the original MOF into sub-MOFs by output clustering, and then estimate for each sub-MOF individually. The gate count of the original MOF is the sum of gate counts for all the sub-MOFs. However, estimation errors may be introduced due to the overlap of the support sets of the sub-MOFs. We propose to partition the outputs with minimum support set overlap (see Fig. 3 ). A POgraph is constructed with vertices representing the primary outputs (POs). If two POs have support set overlap, there is an edge connecting the two corresponding vertices. The edge weight is the size of the common support set. The vertex weight is the sum of the weights of all edges connected to this vertex. There are two loops in the algorithm. In each iteration of the inner loop, the vertex with the minimum weight is deleted and the weights are updated for edges and vertices that connect the deleted vertex. It continues until the number of remaining vertices is less or equal to the pre-specified cluster size. The POgraph is then re-constructed with all the POs that have not been clustered. The algorithm continues until all the outputs are clustered and the PO-graph becomes empty.
Experimental results
We compare area estimation methods in Fig. 4 , where the x-axis is the circuit ID number and the y-axis is the gate count. During the Monte Carlo simulation to calculate the linear measure, we choose the parameters of confidence and error as 96% and 3%, respectively. The actual gate count is obtained by the synthesis using Design Compiler. The method with random output clustering has an average absolute error of 39.36%. By applying our output clustering algorithm to minimise support set overlap, we reduce the average absolute error to 23.59%. Such estimation errors are much smaller compared to the 11Â gate-count difference in Table 1 . Note that different descriptions of a given logic function do not change the L and P, and therefore do not affect the estimation results by our approach. High-level estimation costs over 100Â less runtime compared to logic synthesis. Given the Boolean function f of a combinational logic block and the target cell library, our high-level estimation finds the maximum power-up current I p ( f ) when the logic block is implemented with the given cell library for power gating. A Boolean function can be implemented under different constraints, but we assume the min-area implementation in this paper.
We propose the following high-level metric M p for I p ( f ):
where A is the gate count estimated using the method in Section 2, and I avg is the weighted average I p to be discussed in Section 3.2. Because an accurate gate-level estimator is required for the calculation of I avg and verification of M( f ), we introduce our gate-level estimation in the next Section.
Gate-level estimation

Background knowledge:
The following observation has been shown in [10] :
Observation 1: All the internal nodes in a circuit with PMOS sleep transistors are at logic '0' after the circuit stays in the power-off state for a long enough time.
Power-up current (I p ) occurs when the power supply is turned on for a circuit module and it is different from the normal switching current (I s ). I s depends on two successive circuit states S 1 and S 2 , which are determined by two successive input vectors V 1 and V 2 for combinational circuits.
As discussed in Section 1, I p can be viewed as a special case of I s where the state S 1 before power-up is logic '0' for all the nodes. Because no input vector leads to a circuit state with all nodes at logic '0' for nontrivial circuits, the maximum I p is, in general, different from the maximum I s . Moreover, the I p of a circuit is solely decided by the circuit state S 2 , and therefore decided by a single input vector when the circuit is powered up. To illustrate that I p depends on the input vectors, we present the I p obtained by SPICE simulation for an 8-bit adder under two different input vectors in Table 2 . The difference of the maximum I p is about 24%. I p is greatly affected by the input vector when the circuit is powered up. We define I p element to be the power-up current generated by an individual gate, and give the following observation related to timing:
Observation 2: If a set of gates are controlled by one single sleep transistor, all these gates are powered up simultaneously, i.e. all the I p elements for these gates have the same starting time.
Further, we study the effects of the turn-on time (i.e. the time to turn on the sleep transistor in a MTC-MOS circuit) by simulating a five-stage inverter chain and an eight-bit adder. We use random SPICE simulation with large enough number of vectors for different turn-on times from 0.1 ns to 10 ns (see Table 3 ). Based on the results, we conclude:
Observation 3: I p is very sensitive to turn-on time; I p reduces when turn-on time increases.
Even though a large turn-on time can help reduce the power-up current, a small turn-on time is preferred for high-performance designs. A careful study is needed to achieve the best trade-off between performance and reliability/cost related to a small turn-on time.
ATPG-based algorithms have been proposed in [10] . It is assumed that the power-up current is proportional to the total charge in the circuit after power-up, and the charge for one single gate with output value '1' is proportional to its fanout number. Therefore, the gate fanout number is used as the figure of merit of the power-up current (I p ) for the gate with output value '1'. The ATPG algorithm is performed to find the logic vector that maximises the figure of merit. However, this algorithm does not take the current waveform in the time domain into account. The vector obtained by ATPG algorithm has to be further used in SPICE simulation to obtain the I p value.
To achieve a more accurate estimation and obtain the I p value directly, we need a current model that can capture the current waveform. We apply the piecewise linear (PWL) function to model the I p element. SPICE simulation is used to get the power-up current waveform, and the waveform is linearised at different regions to build the PWL model for each cell in the library (see Fig. 5 ). Our PWL model considers the following dimensions: gate type, input pin number, gate size, fanout number, turn-on time, and post-power-up output logic value. Note that a much simplified PWL model, the right-triangle current model, has been successfully used in [18] for maximum switching current estimation.
Genetic algorithm:
Since exhaustive search for the input vector that generates the maximum I p is infeasible, we apply genetic algorithm (GA) in our gate-level estimation. We encode the solution (i.e., input vector) into a string so that the length of the string is equal to the number of [19] is used in our selection process. From the current generation, we randomly pick two strings and select the one with the higher fitness value. After that, the two strings are removed from the current generation. We repeat this procedure until the current generation becomes empty. By doing this, we divide the original strings into inferior and superior groups. We keep record of the strings in the superior group and put these two groups together to carry out tournament selection again. The two superior groups generated in the two tournaments are combined to go through crossover and mutation, and produce the new generation. The string with the highest fitness will be selected twice so that the best solution so far will stay in the next generation. Since strings with lower fitness have higher probability of being dropped, the average fitness tends to increase by each generation.
The crossover scheme we use is the one-point crossover algorithm. One bit position is randomly chosen for two parent-strings and they are crossed at that position to get the two child-strings. After crossover, we further use a simple mutation scheme that flips each bit in the string with equal probability. The new generation is produced after crossover and mutation, and is ready to go through a new iteration of natural selection. The algorithm stops after the number of generations exceeds a pre-defined number. We summarise the algorithm in Fig. 6 .
We carry out experiments and compare the results of genetic algorithm to that by simulations with 5000 random vectors (called 'random 5000') in Table 4 . Under the same PWL current model, GA achieves up to 27% estimation improvement to approach the upper bound of power-up current. The average improvement for all the circuits is 6%.
Moreover, we compare the I p obtained by SPICE simulation of the entire benchmark circuit with the best vector from genetic algorithm to the I p calculated by our PWL model in Fig. 7(a) . Even though the I p by PWL model is different from that given by SPICE simulation, there is a close correlation between these two currents. Therefore, our PWL model has a high fidelity versus SPICE simulation and gives a conservative estimation. Owing to the high fidelity, we propose to scale the I p from the PWL model by a constant K, and compare the I p values by the new scaled PWL model and SPICE simulation. As shown in Fig. 7(b) , the difference between I p values is greatly reduced. The derivation of the scaling constant K will be discussed in Section 3.2.
Calculation of I avg and experimental results
I avg is not simply the average I p element for all cells in a library. The frequency of cells used in logic synthesis should be taken into account. We assume that the logic synthesis results for a few typical circuits (or random logic functions) are available. We calculate I avg in a regressionbased way as follows. We compute the average maximum I p per gate for n typical circuits by applying the gatelevel estimation. We then increase n until the resulting value becomes a 'constant'. We treat this constant value as I avg . In Fig. 8a , we plot I avg with respect to the number of circuits used to calculate I avg . The Figure shows that the change of the I avg value is relatively large when the number of circuits is small (less than 10 in the Figure) . After the number of circuits increases to 20, the value of I avg becomes very stable and can be used as our high-level metric M p . To validate our regression-based I avg , we use the computed value of I avg under the PWL model and the accurate gate count to obtain the high-level metric M p . We compare the gate-level estimation I p (ckt ) by the genetic algorithm to the metric M p in Fig. 8b . The average absolute error between I p (ckt ) and M p is 12.02%. Note that the circuits in Fig. 8b are different from those used to compute I avg for the purpose of the verification of metric M p .
Our high-level estimation methodology is also directly applicable for SPICE current model. We run SPICE simulation using the best vector obtained by the genetic algorithm to calculate I avg (SPICE ). As shown in Fig. 9 , a stable I avg (SPICE ) is also reached quickly. In addition, we can apply I avg to calibrate our gate-level estimation.
Let I p (PWL ) and I p (PWL ) be the I p values based on PWL and SPWL current model, respectively. Then we have I p (SPWL ) ¼ I p (PWL )/K, where K is the scaling constant defined in Section 3.1.2 and Fig. 7b can be calculated as I avg (PWL )/I avg (SPICE ).
Furthermore, we compare the maximum I p using estimated I avg and estimated A to the maximum I p obtained via logic synthesis followed by gate-level analysis. Table 5 shows that the average estimation error is 21.44%. We measure gate-level analysis runtime as the time for logic synthesis and genetic algorithm, and measure the high-level estimation runtime as the time for area estimation and application of the formula M p ( f ) ¼ I avg . A (pre-characterisation only has one-time cost and is ignored in the runtime comparison). Our highlevel estimation achieves more than 200Â run-time speedup for large test circuits.
Leakage current estimation
High-level estimation of leakage current is necessary for evaluating the feasibility of various leakage reduction techniques at a very early design stage. The fact that leakage current also depends on one single input vector means that leakage current shares similar properties with the power-up current that we have studied. Therefore, we believe that a similar high-level metric I avg lkg can also be applied to high-level leakage current estimation. We calculate the high-level metric I avg lkg for leakage current in a similar way. We apply the genetic algorithm in the gatelevel estimation for a few typical circuits, and obtain the metric I avg lkg for both maximum and minimum leakage current. Our gate-level estimation uses an input-patterndependent leakage current model built by SPICE simulation.
In Fig. 10 , we show the metric I avg lkg with respect to the number of circuits used to calculate I avg lkg . Characterising only a few typical circuits (less than 20) is enough to obtain the stable value of I avg lkg . This justifies the application of our high-level metric to leakage current estimation. With the metric I avg lkg and area estimation, we apply high-level leakage current estimation to our test circuits using the formula:
The estimation results are presented in Table 6 . The average estimation error is 15.65% for the maximum leakage current and 6.21% for the minimum leakage current. Again, the circuits used in Table 6 are different from those in Fig. 10 for the purpose of verification.
A simple leakage power model at the architectural level has been proposed in [20] . They modelled leakage power by the equation P static ¼ V cc . N . k design . Iˆl kg , where V cc is the supply voltage, N is the number of transistors. k design is an empirically determined parameter representing the average characteristics of library cells. Î lkg is a technologydependent parameter representing the per-device subthreshold leakage. However, it is unclear how these parameters are determined for different technologies and cell libraries. Our high-level metric I avg lkg with a welldefined calculation mechanism can be viewed as a combination of parameters k design and Î lkg . In addition to its simplicity, our calculation of I avg lkg can take into account how frequently the library cells are used by the synthesis tool and the fact that leakage current is input pattern dependent.
Temperature and V dd scaling
Considering that leakage power also depends on supply voltage V dd and temperature, we further characterise the temperature and voltage scaling of I avg lkg based on the following SPICE BSIM4 model. We distinguish subthreshold leakage and gate leakage due to their different temperature scaling trend. The BSIM4 subthreshold leakage current model [3] is as follows:
where V GS , V DS and V SB are the gate-source, drain-source and source-bulk voltages, respectively; V T is the zero-bias threshold voltage, V TH is the thermal voltage kT/q, g 0 is the linearised body-effect coefficient, h is the drain induced barrier lowering (DIBL) coefficient, m 0 is the carrier mobility, C ox is gate capacitance per area, W is the width and L eff is the effective gate length.
From (3) we can see the temperature scaling for subthreshold leakage current is T 2 e 1/T , where T is the temperature, and the voltage scaling for leakage current is e V dd . Based on these observations, we propose the following formula for subthreshold leakage metric I avg sub considering temperature Fig. 10 Verification of high-level metric for maximum and minimum leakage current 
where I s sub is a constant current at the reference temperature T 0 and voltage V 0 . a s1 and b s1 in (5) are empirical constants decided by circuit designs.
In the BSIM4 model, gate leakage current is modelled as gate direct tunnelling current -including tunnelling current between gate and substrate (I gb ) and current between gate and channel (I gc ). The formulas for both I gb and I gc are:
where (2) , but take into account the different scaling feature for subthreshold leakage and gate leakage. The total leakage for a logic circuit can be modelled:
where P lkg is the total leakage power for a logic circuit, I avg is the total leakage current per gate, I s is the I avg at given temperature T 0 and supply voltage V 0 , f avg (T, V dd ) is the scaling function to characterise temperature and V dd scaling considering both subthreshold and gate leakage. It can be expressed as follows:
where A, B, a, b, g, and d are empirical constants for different circuit types, technologies and designs. We obtain the constants in (5) and (14) empirically by determining the power consumption for different circuit types at multiple temperatures using SPICE simulations and then applying curve fitting. Table 7 compares I avg lkg by SPICE simulation at different temperature and V dd to our I avg lkg calculated by our temperature and V dd scaling formulae. We use the average leakage current for data-path circuits: adder (4-bit, 16-bit and 32-bit), shifter (8-bit, 16-bit and 32-bit), and multiplier (4-bit, 5-bit and 6-bit). The estimation error for I avg lkg at scaled temperature and V dd is less than 1%. This high-level leakage model considering temperature and V dd scaling can be used in the architectural and system-level simulations. We have applied our highlevel leakage power model in a coupled thermal and power microarchitecture simulator, PT scalar [21] , which studies the interdependence between leakage and temperature and impact on processor performance [22] .
5
Conclusions and discussions
Using design examples and design environment of a leading industrial CPU project, we have presented an improved high-level area estimation method. The estimation has an average error of 23.59% for designs using a rich cell library. We have also proposed a high-level metric to estimate the maximum power-up current due to power gating for leakage reduction. Compared to time-consuming logic synthesis followed by gate-level analysis, our high-level estimation has an average error of 21.44% for power-up current. We further extend our high-level estimation methodology to leakage current and the average estimation error is 15.65% and 6.21% for maximum and minimum leakage current, respectively. We also develop the high-level metric for leakage current considering temperature and supply voltage (V dd ) scaling. The estimation error for the metric is less than 1% at different temperatures and supply voltages. Our high-level estimation method can be readily applied to estimate the area overhead due to the sleep transistor insertion in power gating. There are two primary constraints for the sleep transistor. One constraint is the IR voltage drop that introduces a performance penalty. Appropriate sizing of the sleep transistor can be performed to satisfy this constraint based on the maximum switch current. The highlevel estimation of maximum switch current has been studied in [15] . There is a reliability constraint for sleep transistors (i.e. avoidance of damaging the sleep transistor by a large transient current). We can obtain the maximum transient current as the larger one between the maximum power-up current and maximum switching current, and size the sleep transistor to satisfy the reliability constraint. In addition to that, our high-level leakage model considering temperature and V dd scaling can also be applied in architectural and system-level simulations. One of the applications is that our high-level leakage model has been successfully used in a coupled thermal and power microarchitectural simulator PT scalar [21, 22] . 
