As device sizes continue to shrink and circuit complexity continues to grow, power has become the limiting factor in today's processor design. Since the power dissipation is a function of many variables with uncertainty, the most accurate representation of chip power or macro power is a statistical distribution subject to process and workload variation, instead of a single number for the average or worst-case power. Unlike statistical timing models that can be represented as a linear canonical form of Gaussian process parameter distributions, the exponential dependency of leakage power on process variables, as well as the complex relationship between switching power and workload fluctuations, present unique challenges in statistical power analysis. This paper presents a comprehensive study on the statistical distribution of dynamic switching power and static leakage power to demonstrate the characterization and correlation methods for macro-level and chip-level power analysis.
INTRODUCTION
The advent of continued device scaling and increasing process variability has contributed to the growing popularity of statistical timing analysis, which not only replaces the traditional process-corner-based approach used in static timing analysis, but also revolutionizes the way that chips are designed and verified today. In statistical timing analysis, all timing quantities such as gate delays, wire delays, arrival times, slews (rise/fall times) and slacks are represented by a canonical first-order delay model:
where a 0 is the mean or nominal value, X i represents the variation of the ith global source of variation X i from its nominal value, R a is the variation of an independent random variable R a , and a i is the sensitivity of the timing quantity to each of the sources of variation. By scaling the sensitivity coefficients, the random variables X i and R a can be assumed to have a normalized unit Gaussian distribution N 0 1 . The capabilities of parameterized blockbased statistical timing analysis in Ref. [1] have since been extended by Ref. [2] to handle non-Gaussian parameters and nonlinear delay functions. Like static timing methodology, most traditional power analysis methodologies are deterministic and corner based, where only selected cases such as the nominal case, best (−3 ) case, and worst (+3 ) case are analyzed. However, when the worst-case assumption is made for each random variable, the corner-based approach is inherently pessimistic. In order to avoid parametric yield prediction based on fully correlated and overly pessimistic corner points, statistical methods have been developed to model leakage power due to process variability. For example, an empirically-fit exponential quadratic equation is proposed in Ref. [3] to represent the subthreshold current as a function of channel length and estimate its probability density function. The mean and variance of the leakage current for the entire circuit can then be obtained by adding the lognormal distribution of leakage current from individual gates. A full-chip analysis of both the subthreshold leakage and gate tunneling leakage is described in Ref. [4] by considering spatial correlation due to intra-chip variations.
Although simplified lognormal models have been developed to estimate the leakage power distribution, the industry-standard BSIM 5 device models generally cannot be easily adapted to the analysis of process variability without loss of accuracy. Furthermore, the statistical characterization of power should include not only the static leakage power, but also the dynamic switching power. When the macro power is characterized deterministically by an average or worst-case number, designers often have little information about the true distribution of power that could potentially lead to thermal or yield problems. For example, a macro that consumes 0.3 W of power on average could operate with an idle power of 0.1 W in any given cycle or state, or a peak power of 0.5 W under a wide range of switching factors (Fig. 1) . Therefore it is important to look beyond one deterministic number that has traditionally characterized the power, and provide circuit designers an insight into the statistical distribution of power due to both process and workload variations. Such power analysis is useful for yield prediction, maximization of battery life, prediction of on-chip thermal gradients, power distribution design, decoupling capacitance allocation, electromigration analysis, etc.
In addition, it has been shown that in leakage dominated technologies, the leakage power can cause the yield window to shrink by imposing a two-sided constraint on the window. 6 The correlation between power and performance due to their dependence on common process variables could have a significant impact on yield, especially in high-frequency bins. 7 By integrating the methodologies to analyze the statistical distributions of static leakage power and dynamic switching power, we can accurately estimate and optimize the parametric yield based on the joint probability density function of both power and delay.
In this paper, we will present our study on the statistical distribution of leakage power in Section 2, the statistical distribution of switching power in Section 3, and the experimental results on the statistical distribution of total chip power consumption in Section 4.
STATISTICAL DISTRIBUTION OF LEAKAGE POWER
The total standby Current I DDQ of a CMOS transistor is comprised of two major components: the device subthreshold current I OFF and the gate tunneling current I GATE . Therefore the analysis of total leakage power must include both the channel leakage due to device subthreshold current and the gate leakage due to tunneling current. The leakage model described in this paper is a hardware-based model that has been extracted from various experiments in a pre-determined process window. After a model is fitted to the hardware measurements, it will accurately characterize the leakage current of the corresponding device such as an NFET, a PFET, or an SRAM cell. For our 45-nm SOI technology, the device subthreshold current I OFF is an exponential function of the channel length L P , the supply voltage V DD , and the temperature T . The exponential function I OFF (L P V DD T ) not only captures the charge-sharing and drain-induced barrier lowering (DIBL) effects, but also considers I OFF variation due to threshold voltage V T scattering and narrow channel effects. Figure 2 shows the probability density function of subthreshold current due to channel length variation in a typical device. Although the variation of channel length L P could be modeled as a Gaussian distribution, the corresponding subthreshold current variation is not Gaussian. For example, the long tail of Figure 2 illustrates that I OFF could increase by a factor of 4 due to channel length variation.
In addition to subthreshold current I OFF , the gate tunneling current I GATE for thin-oxide devices could also be a significant contributor to the total leakage current on the chip. As depicted in Figure 3 , the gate tunneling current includes the current between gate and source/drain diffusion through the channel region (I gcs and I gcd ), the current between gate and source/drain diffusion through the overlap region (I gos and I god ), and the current between gate and body (I gb ). For our 45-nm SOI technology, the gate tunneling current I GATE is a linear function of the channel length L P , but an exponential function of the gate oxide thickness T OX , the supply voltage V DD , and the temperature T . Figure 4 illustrates the probability density function of the normalized gate tunneling current due to channel length and oxide thickness variations in an NFET. Although the red curve in Figure 4 shows that the gate tunneling current due to channel length (L P ) variation alone is a Gaussian distribution, the closely matched T OX blue curve and T OX + L P green curve clearly demonstrate that the statistical distribution of gate tunneling current is dominated by the variation of oxide thickness. As the T OX decreases, I GATE increases exponentially. Since gate leakage occurs when the gate is on and channel leakage occurs when the channel is off, Figure 5 shows the statistical distribution of normalized I DDQ for one transistor when the gate is turned on 25% of the time and off 75% of the time. Based on the time-averaged channel and gate leakage state for each device reported by the circuit simulator, the probability of each leakage state not only reflects a device's average time in the on or off state, but also considers circuit topology effects such as device stacking.
STATISTICAL DISTRIBUTION OF SWITCHING POWER
An accurate analysis of chip power is imperative for high-performance processor design to understand system requirements and ensure system reliability. Therefore, in addition to the estimation of leakage power under process variation, the distribution of switching power due to workload variation should also be included in a statistical power analysis. The common power analysis methodology 8 that we have developed to estimate switching power starts with transistor-level simulation of each circuit macro. As a building block of the chip design hierarchy, the macro can be as simple as an I/O buffer, or as complex as a cache or multiplier. After generating the net list for each macro, adding input vectors and output loads, and capturing the current waveforms from circuit simulation, a comprehensive power analysis can be performed to extract device current models and specific circuit power characteristics.
Our circuit simulation is based on an event-driven circuit simulator, 9 which allows current waveform integration on the fly and greatly reduces the output file size. Technologydependent data such as device models, temperature and voltage parameters, and clock cycle time are used during circuit simulation. In addition, over 100 parameters are typically specified in a project file, which contains information such as signal timing, logic exclusivity of signals, and capacitive loads.
The process of creating a circuit net list can be run from both the schematic and layout views. The physical layout extraction not only identifies all the transistors and their parasitic capacitances, but also inserts current meters at the junction contacts between the transistors and power nets to collect detailed current distribution data under various operating conditions.
After the raw net list is extracted, appropriate input stimuli and output loads are added to generate the final net list. Voltage sources are applied at the primary inputs to represent the correct input vectors that satisfy all circuit and logic constraints and provide sufficient coverage for power, noise, and reliability analysis. The input vectors can be categorized into different states such as ramp-up, clockgated, hold (idle), functional (average), and peak power. For example, the ramp-up cycles serve to flush random data through the circuit to initialize its state. After the circuit is initialized, all inputs are held constant for several cycles, except for the global clock signals, to allow the circuit to reach its inactive state where the idle or hold power can be determined. The hold power is a measure of the clock-related power, which is often regarded as the power of the most common state. Finally, workloadbased random input vectors are applied to the data input nodes to measure the average functional power. A range of switching factors is applied to provide coverage for not only power calculation, but also noise and electromigration analysis.
The generation of the final net list is further controlled by a configuration file, which includes global circuit information such as voltage, temperature, clock cycle, signal timing, and output loads. The correct arrival time of input signals is extracted from the timing file and data-type constraints are assigned to the input nodes. Once all the macros have been simulated, the current data can be collected during the hold cycle, average-current cycle, and peak-current cycle for macro-level and chip-level power analysis. Figure 6 shows the scatter plot of 141 data points for the switching current of a macro with 79,082 NFETs and 79,921 PFETs. The relationship between switching current and input switching factor can be approximated by a linear, quadratic, or higher-order polynomial regression. In Figure 6 , about 80% of the data points are scattered between the upper and lower regression lines, which represent the 90th and 10th percentile of switching current respectively. Figure 7 shows the probability density function of macro power when there is a 10% probability that the macro operates with a switching factor (SF) of about 0.5, a 10% probability that the macro operates with an SF of about 0.1, a 30% probability that the macro is idle (SF = 0 with clock running), and a 50% probability that the macro is clock-gated (with leakage only). For the two clusters of data points where SF = 0, their input switching factors are assumed to follow two relatively narrow Gaussian distributions N and 2 , and standard deviations 1 and 2 are determined by the data in Figure 6 . Similarly, the variations of channel length and gate oxide thickness are assumed to have Gaussian distributions during leakage calculation.
EXPERIMENTAL RESULTS
A statistical power analysis has been performed on selected benchmark macros in our 45-nm SOI technology. For gate tunneling leakage calculation, the oxide thickness T OX is assumed to have a Gaussian distribution N T T , where T is the mean and T is the standard deviation of gate oxide thickness. To take into account the inter-chip and intra-chip variation, the oxide thickness is modeled as
where T global is a normalized Gaussian distribution due to global chip-to-chip variation, T local is a normalized Gaussian distribution due to local intra-chip variation, and 2 + 2 = 1. In our case study below, is set to 0.8 and is set to 0.6. Similarly, process variations such as gate lithography, etch bias, and lateral source/drain diffusion, result in channel length variation. For subthreshold leakage calculation, the channel length L P is assumed to have a Gaussian distribution N L L , where L is the mean and L is the standard deviation of physical channel length. To take into account the inter-chip and intra-chip variation, the total channel length variation is modeled as
where L CHIP is a normalized Gaussian distribution due to chip mean variation, L ACLV is a normalized Gaussian distribution due to across-chip line-width variation, and Figure 8 shows the probability density functions of gate leakage, subthreshold leakage, and total leakage power for a macro with 1,725 NFETs and 1,586
Chen et al. PFETs. The global and local variations of oxide thickness and channel length are considered at the transistor level and included in the respective leakage power distribution.
Statistical Power Analysis for High-Performance Processors
To model the correlated and independent randomness of switching power, the statistical power distribution of each macro must first be characterized by its probability density function PDF and cumulative distribution function CDF. The power of macro i can then be determined by the inverse function of CDF i , where P i = CDF −1 i X i , and 0 ≤ X i ≤ 1. Depending on how the switching activities of macro i are associated with workload j, the random variable X i will be assigned a CDF value based on the equation:
where W j is the CDF value that corresponds to the power fluctuation of workload j, a ij is the sensitivity of the dynamic power of macro i to workload j, M i is the CDF value that corresponds to the non-workload-dependent power variation of macro i, and b is the sensitivity of the dynamic power of macro i to its own random variation, subject to the constraints that b ≥ 0 and n j=1 a ij + b = 1. If two macros have similar-valued a ij 's, then their powers are well correlated; otherwise they are not. Since CDF values are assigned to W j and M i , both variables assume a standard uniform distribution U 0 1 .
In order to provide maximum generality and flexibility to model macros with different patterns of switching activities, Monte Carlo simulation with a sample size of 10,000 is used for statistical power analysis. After building the individual macro power models, it takes about 10 CPU hours to simulate a large chip with 43 million transistors. Figure 9 depicts the cumulative distribution function of macro power for 10 macros with a total of 159,003 MOSFETs. To illustrate how the correlation of switching activities affects the overall power distribution, Figure 9 shows two extreme cases where the switching activities of different macros are either completely independent or perfectly correlated, and one nominal case where the switching activities of different macros are 60% correlated to a common workload. The uncorrelated case, not surprisingly, gives us the smallest variance. Figure 10 shows the statistical power distribution of a high-performance processor chip with 8 cores and over 600 million transistors. If the switching activities of different macros are fully correlated within each core, but uncorrelated among different cores, the chip power will have a statistical distribution that ranges from about 100 W to 200 W (blue curve). On the other hand, if the switching activities of all the macros are fully correlated at the chip level, the chip power will have a much wider variation (red curve) that corresponds to the chip's common mode of operation such as powergated, clock-gated, idle, and peak power. Statistical power analysis can be further combined with statistical timing analysis to make better yield predictions. . Joint probability density function of power and delay. Figure 11 shows the joint probability density function of both power and delay for a benchmark macro. By integrating the statistical power distribution with the statistical timing distribution, this three-dimensional yield versus power and performance plot provides a more comprehensive means for designers to define the corners, improve the yield, determine bin splits, and optimize other design variables.
Full-chip power distribution

CONCLUSION
Although the statistical distribution of leakage power due to process variation has been extensively studied in the literature, the statistical analysis of switching power due to workload variation remains a difficult challenge. This paper presents a first study on the combined analysis of leakage power and switching power that takes both global correlation and local randomness into account. Leakage power due to process variables such as oxide thickness and channel length is modeled and correlated at the transistor or gate level, while switching power due to workloadrelated activities is modeled and correlated at the macro or block level. In order to provide a general framework to handle non-Gaussian and multiple-peak distributions, a CDF-based Monte-Carlo simulation is performed to analyze the statistical distribution of macro and chip power. Based on these benchmark results, we not only demonstrate the feasibility of a general statistical analysis for both leakage and switching power, but also develop a design methodology where the statistical power distribution of each macro is characterized by its PDF and CDF functions in the circuit library.
