Abstract-In this paper, we propose two efficient statistical sampling techniques for estimating the total power consumption of large hierarchical circuits. We first show that, due to the characteristic of sampling efficiency in Monte Carlo simulation, granularity of samples is an important issue in achieving high overall efficiency. The proposed techniques perform sampling both temporally (across different clock cycles) and spatially (across different modules) so that smaller sample granularity can be achieved while maintaining the normality of samples. The first proposed technique, which is referred to as the module-based approach, samples each module independently when forming a power sample. The second technique, which is referred to as the cluster-based approach, lumps the modules of a hierarchical circuit into a number of clusters on which sampling is then performed. Both techniques adapt stratification to further improve the efficiency. Experimental results show that these techniques provide a reduction of 2-3 in simulation run time compared to existing Monte-Carlo simulation techniques.
Improving the Efficiency of Monte Carlo Power Estimation
Chih-Shun Ding, Cheng-Ta Hsieh, and Massoud Pedram Abstract-In this paper, we propose two efficient statistical sampling techniques for estimating the total power consumption of large hierarchical circuits. We first show that, due to the characteristic of sampling efficiency in Monte Carlo simulation, granularity of samples is an important issue in achieving high overall efficiency. The proposed techniques perform sampling both temporally (across different clock cycles) and spatially (across different modules) so that smaller sample granularity can be achieved while maintaining the normality of samples. The first proposed technique, which is referred to as the module-based approach, samples each module independently when forming a power sample. The second technique, which is referred to as the cluster-based approach, lumps the modules of a hierarchical circuit into a number of clusters on which sampling is then performed. Both techniques adapt stratification to further improve the efficiency. Experimental results show that these techniques provide a reduction of 2-3 in simulation run time compared to existing Monte-Carlo simulation techniques.
Index Terms-Monte Carlo simulation, oversampling, very large scale integration (VLSI) power estimation.
I. INTRODUCTION

P
OWER dissipation has become one of the important design metrics in VLSI designs, along with area and speed. As a result, the issues of estimating the power consumption of a circuit in various design phases have been examined by researchers.
In early design phases of a circuit, the objective of power estimation [1] - [3] is to provide a relative comparison metric so that designers can efficiently explore various design alternatives. Absolute accuracies are compromised during these phases, as information such as routing capacitances and gate/transistor level implementations is not finalized yet. As the design progresses, more design decisions are made and information required for highly accurate power estimation becomes available. Although low-level optimization such as gate/transistor sizing can still be done in final phases of design, the objective of power estimation shifts more toward power verification, rather than design exploration. Examples of this type of technique are [4] - [13] The purpose of this paper is to propose highly accurate yet efficient sampling techniques used in Monte Carlo simulation loops [7] in the final phases of the design process. We assume that a hierarchical netlist of the circuit with each module described at the register-transfer (RT) level or below is available. For each module, a power simulation technique at the same abstraction level as the netlist is adopted. Surveys of existing techniques at RT level and below are given in [14] , [1] , and [15] . Our estimation target is the average total power consumption of a hierarchical circuit under a user-specified application input stream. In addition, the estimation itself is performed in a single Monte Carlo simulation so that the power estimate at the top level circuit satisfies the user-specified confidence and error levels. This is different from the strategy that estimates each module in a separate Monte Carlo simulation and sums up the estimates to produce the total average power. The drawback of the latter strategy is that it is difficult to determine a priori what confidence and error levels should be assigned to each module so that the sum of individual power estimates satisfies the confidence and error levels specified at the top circuit level.
Existing statistical sampling techniques are classified as either parametric [7] , [9] , [8] , [11] , [12] , in which a basic simulation unit, called a (power) sample, is assumed to follow near-normality, or nonparametric [10] , in which no assumption about the distribution of samples is made. In the former technique, the mean power consumption of a predetermined yet fixed number of randomly selected clock cycles constitutes a sample in order to ensure near-normality of samples. In nonparametric techniques, the power dissipation values of those units in a sample are ordered, resulting in order statistics of the sample. Different convergence (or stopping) criteria are applied in these techniques.
Sampling techniques are differentiated from one another based on how the samples are selected and, subsequently, on how many samples are required for convergence. Therefore, the term "efficiency" should be referred to as the product of simulation effort spent on each sample (or sample granularity) times the number of required samples, instead of strictly the number of required samples. After investigating the efficiency and oversampling characteristics of Monte Carlo simulation, we conclude that Monte Carlo simulation strategy favors smaller sample granularity. In another words, if sample granularity is a choice, smaller sample granularity achieves higher overall efficiency. Consequently, our objectives in design sampling strategies are 1) reducing the variance of samples and 2) using smaller sample granularity while maintaining the near-normality of samples. We achieve the first objective by using stratification and second objective by sampling in both temporal (time) and spatial (module) domains, contrasted with existing sampling techniques, which sample on the temporal domain only. In other words, the latter techniques, which are also referred to as clock-based techniques in this paper, treat the entire circuit as a single entity during sampling, and no circuit partitioning information is utilized.
The first proposed technique uses a module-based sampling strategy in which each module is sampled independently. By doing so, the normality of sample is significantly improved so that it allows us to use smaller granularity in a sample. When the number of modules in a circuit is very high, this technique still samples each module for at least one clock cycle. That is, there is a limitation in these techniques on how small the granularity of samples can be. Note that this is an issue only when the number of modules in a circuit is very high. In this case, we propose a second technique that will allow us to use an even smaller granularity in a sample. It uses a cluster-based sampling strategy in which the modules are grouped into a small number of clusters on which stratified random sampling techniques are performed. Experimental results show that these techniques reduce the simulation time by a factor of 4 .
We do not consider nonparametric techniques in this paper, as the results reported in [10] indicate that oversampling problem is even worse than parametric techniques.
The rest of this paper is organized as follows. In Section II, we introduce notations. In Section III, we consider issues associated with power estimation on large hierarchical circuits. A modulebased and a cluster-based technique are proposed in Sections IV and V, respectively. Practical issues are examined in Section VI. Experimental results are presented in Section VII, followed by concluding remarks in Section VIII.
II. BACKGROUND
A. Power Dissipation as a Random Variable
A circuit is defined as a collection of modules in which all inputs are either from circuit primary inputs or from the outputs of flip-flops in other modules. This partitioning scheme facilitates us to clearly define the signal arrival times at each module input so that we can accurately simulate each module independently for power. If the partitioning scheme used by the designer is different from the one proposed here, then for each module, the combinational cones between the module inputs and the outputs of flip-flops in other modules are duplicated and included in the module to aide simulation and their contributions to power consumption are ignored.
The netlist for each module is given at RT level or lower and can be at different levels. For each module, a power simulation technique at its own abstraction level can be adopted as long as it provides cycle-by-cycle power values.
Next, we need to define the population used for sampling. Depending on how the population is defined during sampling, sampling techniques can be either 1) survey sampling based [9] , [13] or 2) signal statistic based [7] , [8] , [10] - [12] . In the former, the actual population is obtained from an application vector stream at the circuit inputs combined with the traversal of internal circuit states under the applied stream. The estimation target is the power consumption under the applied vector stream. The vector traces on sequential elements in the circuit need to be known in advance. While this is a computational overhead, it can be significantly reduced using functional or cycle-based simulations [13] . On the other hand, in the latter technique, the population is described using signal statistics, such as signal probabilities and correlations. During sampling, the input vector stream is generated on the fly. While this class of techniques has been demonstrated on both combinational and sequential circuits, one of its shortcomings is that the complex correlations in a realistic input vector stream cannot be effectively recreated during the vector generation process. In particular, in realistic circuits, it requires specific vector sequences for the circuit to enter a particular operation mode and afterwards the signal statistics may change. In addition, it is not clear in these techniques how the input signal statistics are obtained in the first place.
We adapt the survey sampling approach. We further assume that the vector traces of all sequential elements are given as a result of functional simulation on the circuit.
Let and denote the total number of modules in the circuit and clock cycles in the input stream, respectively. Power dissipation of the th module at the th clock cycle is denoted as and represented by the entry of a power log matrix, as shown in Fig. 1 . Note that this matrix is just for conceptual convenience in defining the population; the actual power values are not known before estimation. Only the input vectors that will be used to simulate each entry are known from the vector traces of sequential elements.
The random variable representing the power consumption in each clock cycle of module is denoted as for . and denote the mean and variance of random variable , respectively. The random variable representing the circuit power in each clock cycle is denoted as and can be written as
One of the objectives in our techniques is to reduce sample granularity. To facilitate efficiency comparison against other sampling techniques, we define the notion of (simulation)
workload. The workload for a sample is calculated as the product of the number of transistors being simulated times the number of simulated cycles. Therefore, if a circuit consists of modules A and B (each having 10 k transistors) is simulated for 30 clock cycles and the average power over the 30 cycles is used to produce one sample value, then the workload for the sample is measured as k 30 = 600 k transistor cycles. On the other hand, if module A is simulated for 15 cycles and module B is simulated for 15 cycles, and the 30 simulation results are averaged and multiplied by a factor of two to produce the sample value, then the workload for the sample is measured as k k k transistor cycles.
III. ISSUES OF MONTE CARLO SIMULATION ON LARGE CIRCUITS
A. Oversampling in Monte Carlo Simulation
Monte Carlo (MC) simulation has its own preferred region of operation in terms of sampling efficiency. Here, the sampling efficiency is strictly in terms of the number of samples required by Monte Carlo simulations, compared to the number obtained from analytical analysis. To investigate the sampling efficiency of MC simulation, we run a series of MC simulations with the population generated from a random number generator with normal distribution, a user-specified variance, and a mean value of one. Let be the random variable representing the random draw from this population. Furthermore, we denoted this normal distribution as . Before performing the MC simulation, since the population mean and variance are known in advance, we can compute the ideal number of required samples to achieve an error level and a confidence level ( ) as follows [16] : (2) where is defined such that the area to its right under a standard normal distribution is equal to and denotes the relative variance of .
Next, on the same population (i.e., with the same population variance), we perform a Monte Carlo simulation for the same values of and (1-). The simulation terminates when the number of samples satisfies the following convergence criterion:
where is defined such that the area to its right under the -distribution with degree is equal to . and are computed as follows (the values of these samples are denoted as ): Since the value of changes from one Monte Carlo simulation to next, a fair comparison to the analytical result is to compare the average value of over a large number of Monte Carlo runs against . We denote the average number of of 100 000 Monte Carlo runs as . Each variance setting gives us a pair of and . By varying the variance, we can get a plot of versus . In this experiment, there is no need to adjust the mean value, as the random variable with normal distribution can be linearly transformed into another variable with normal distribution using . The multiplicative factor does not change the results in (2) . Similar argument applies to the computation of (3), and hence value. By varying the variance of our random number generator from zero to , we have virtually covered all normal distributions with positive mean values.
The relation between and depends the confidence level. While we cannot prove the relation is independent of the error level, we have found that the plots of 5% and 1% error levels are almost identical. In Fig. 2 , we plot (with solid line) versus with ranging from one to 50 under the parameters of 0.99 confidence level and 5% error level. In this figure, we also plot the line and label it as for comparison purposes.
In Fig. 2 , we can use as a base to define the efficiency coefficient as follows. The mc-efficient region depends on the user-specified efficiency coefficient threshold. In Fig. 2 , we plot versus as the heavy dashed line. The smaller the , the lower the . For instance, when , %. When , %, which translates into oversampling by a factor of 3.4. On the other hand, when , %, a mere 20% loss in efficiency. In a very large circuit (e.g., a circuit with 100 K gates), the power simulation for a single clock cycle needs significant computation time. Therefore, although the is very small in the mc-inefficient region, the impact of inefficient estimation technique is still enormous.
While we have demonstrated that Monte Carlo simulation could oversample, the question remains: Will we run into mc-inefficient region when using clock-based Monte Carlo technique for power estimation on a large circuit? In other words, what value do we expect to encounter during this type of power estimation? It is difficult to give a quantitative answer. Our approach is to derive the relation between at the module level and that at the top circuit level. Based on results reported at the module level [7] , [9] , we can then infer the possible range of values at top circuit level.
B. Relative Variance Analysis
We first look at the case where there is no correlation between and , where . The following theorem gives the bound on the relative variance of . MAX denotes the maximum relative variance of a module in the circuit. 
From the above analysis, we observe the following.
1) When the covariances are positive, the can be comparable to MAX . 2) When the covariances are negative, the is even smaller than the MAX . 3) For some positive and some negative covariances between different modules, one would expect that will still be much smaller than MAX .
The results reported from [7] , [9] have indicated that about ten samples are required to achieve 0.99 confidence level and 5% error level for ISCAS85 benchmarks using a sample size of 30 clock cycles. If the circuit of estimation target consists of ten modules from those benchmarks and there are no correlations between two modules, the number of required samples can be as low as one in theory with a sample size of 30 clock cycles. Therefore, Monte Carlo simulation is very likely to oversample if we adopt the clock-based approach at the top circuit level.
C. How to Reduce Oversampling
The disadvantage of adopting clock-based approaches at the circuit level is that there is no good method to ensure normality of samples other than taking the average power of randomly selected clock cycles as a sample. Empirically, the smallest value to satisfy near-normality is 30 [16] . Since the variance of clock-by-clock circuit power can be very low already, this method of ensuring near-normality of samples will force Monte Carlo simulation to utilize the mc-inefficient region.
In the next two sections, we will propose techniques to address this problem. To give an intuitive rationale for our approach, we need to revisit (2) .
From (2), the total number of required simulated clock cycles is given by the number of required samples times the number of clock cycles simulated in a sample which is independent of if 1) itself already satisfies normality and 2) we allow for noninteger value. With the above result and the efficiency characteristic of Monte Carlo simulation, we should use small workload per sample (equivalently, ) whenever possible. We achieve this goal by sampling both on temporal and spatial domains in the circuit.
IV. MODULE-BASED POWER ESTIMATION
The module-based power simulation can be explained in terms of power log matrix as follows. Instead of randomly selecting a rowsum as a unit to be included in a sample, an entry is randomly selected from each column, independent of the selection made in the other columns, as shown in Fig. 3 . Let denote a random variable representing the power dissipation value of module in the module-based approach. The new random variable representing the total circuit power estimated in this approach is denoted by . We can write
We use and to accentuate the fact that now s are independent of each other, whereas s in (1) may be correlated.
Theorem 4.1: Let and be defined as in (1) and (10), respectively; then . Proof: For , each row is a unit in the population. All units in the population is equally likely to be selected. Therefore, . On the other hand, for , each entry for the same value and is equally likely to be selected. Therefore . . Consequently, . Since s are now independent, the module-based technique will have a better normality when compared with clock-based techniques using the same workload. This is because the empirical requirement for applying the central limit theorem to ensure near-normality of samples is that a minimum number of independent random variables must be included in a sample and that some "general condition" is met. The general condition is informally stated as follows: "no single random variable makes a large contribution to the sum" [16] . One should not interpret this statement as about the mean values of those random variables. Instead, it should be interpreted as a statement about the "fluctuation" of each random variable around its mean (i.e., its variance).
For clock-based techniques such as [7] , each randomly selected clock cycle is identically distributed. The number of independent random variables included in a sample is . In contrast, in the module-based approach, the power dissipation of each module in a clock cycle counts as a random variable, and these random variables may have very different variances. For now, assume that each module has similar variance. If we have modules, we only need to simulate each module for clock cycles to satisfy the requirement of random variables. That means, the simulation load can be reduced by a factor of . However, this is the ideal situation. A difficult case is when one module dominates the sample variance. In this case, no reduction of workload is achieved. To alleviate this problem, we adapt stratified random samplings as outlined next. The goal is to allocate more randomly drawn units to the module with higher variance.
In stratified sampling, the population is divided into a set of disjoint groups (i.e., strata) such that units in the same stratum have similar power values. Next a predetermined and fixed number of units are randomly selected and simulated from each stratum and collected to form a sample. In our approach, the same number of units is drawn from each stratum. We use a predictor function to stratify the population. There are several proposed predictor functions [9] , [17] , ranging from the zero delay power estimate (the most computationally expensive) to the average switching activities on the module inputs, outputs, and internal sequential elements, weighted by estimated capacitances (or transistor counts) in each module. Since the predictor function is not directly used in computing the power consumption, reasonable deviation from absolute accuracy is acceptable. One should note that a more computational expensive power simulator can afford to use a more elaborated predictor function.
When allocating the number of strata to each module, since we cannot predict in advance how effective stratification will be, we completely ignore the future impact of stratification. We adapt the minimum variance allocation scheme proposed by [18] . This scheme is originally proposed as a solution to the sample size allocation problem by allocating the number of randomly drawn units for the th stratum as proportional to , where and are the stratum weight (stratum size divided by population size) and variance of th stratum. In our application at this stage, each module is treated as a stratum and therefore their weights are the same. The module variances are estimated using predictor values. In the case where a single module contributes significantly higher variance, our scheme ensures that more strata, and hence more drawn units, will be allocated to that module. After the number of strata is determined for each module, the units in each module are first sorted based on the predictor values and put in equal size bins, with each bin representing a stratum.
To summarize, the module-based technique is more efficient than clock-based techniques, as follows.
1) When power values of modules are strongly positively correlated, will be high and there may not be significant oversampling in clock-based techniques. By sampling each module independently, the correlations between modules are completely eliminated in module-based techniques, and consequently, smaller sample variances are achieved. 2) When power values of modules are strongly negatively correlated or weakly correlated, the required number of samples in clock-based techniques becomes so low that oversampling becomes an issue. By using smaller sample granularities in the module-based technique, the amount of oversampling is reduced. The main shortcoming of the stratified sampling is that the objective of reducing the workload conflicts with that of stratification. Using higher number of strata gives better efficiency improvement. However, at least one clock cycle needs to be simulated for each stratum. Therefore, the minimum workload per sample will be raised. This is fine for midsize circuits. For large-scale circuits, we could still enter the mc-inefficient region.
Our solution to this problem is to "lump" all modules of similar types into a super module and then perform stratification on each super module as described in the next section.
V. CLUSTER-BASED POWER ESTIMATION
In this section, we propose a technique that significantly reduces the workload of a power sample while maintaining the normality.
A. Homogeneous Circuits
To give a rationale for this approach, consider a fairly homogeneous circuit for now (for instance, a circuit consisting of only adders). Almost all practical circuits are heterogeneous, and they will be examined later in this section. There are clock cycles and modules in the circuit. The average power of this circuit is calculated as (11) Next, consider circuit B, which consists of a single adder, but there are clock cycles. Furthermore, the power log matrix of circuit B is obtained by concatenating all columns of the power log matrix of circuit A into a single column, as shown in Fig. 4(a) . The average power of this circuit is (12) Obviously, (12) is simply scaled down from (11) by . Therefore, all the entries in the single column matrix should be multiplied by if we want to estimate the average circuit power of circuit A by sampling on circuit B.
The overall strategy can be called a mix-and-stratify strategy, as explained next. Consider a homogeneous circuit as circuit A above. First, the predictor function is computed for all clock cycles and all modules, and the units of all modules in the circuit are mixed together. The mixture is stratified into equal size strata. One unit is randomly sampled from each stratum. We can determine to which the unit originally belonged and can perform gate/transistor-level simulation on that module under the corresponding vector pair. The observed power from the simulation is multiplied by and treated as a unit drawn in the standard stratified sampling. When calculating the sample value, we can exactly follow the steps in the standard stratified sampling. This approach can be easily extended to nonequal size stratum and/or nonequal sample sizes.
The advantage of this approach is that the minimum workload in a sample to maintain normality of samples can be less than that of simulating the entire circuit for one clock cycle. To give an example, consider a circuit with 60 adder modules. These 60 modules are lumped into a cluster (super module). Let the number of strata be 30. One unit is randomly drawn from each stratum and collectively becomes a sample. The workload of this sample is the same as simulating only one-half of the circuit for one clock cycle.
Next, we will address the issues of heterogeneous circuits.
B. Heterogeneous Circuits
By heterogeneous circuits, we mean circuits that contain more than one type of module. The classification scheme is based on the glitch activities in each module. For instance, multipliers should not be classified as the same type as adders, as the former has excessive glitch activities. On the other hand, random logic can be classified as the same type as adders. All modules of the same type are put into one cluster. As for power log matrix, all the columns corresponding to modules in the same cluster are concatenated into a single column matrix. Let the number of clusters be and the number of modules in each cluster be , where . After the column concatenation, all the entries in the th single column matrix are multiplied by , as shown in the previous section. Then the estimation can be reduced to that of estimating the power on a new circuit consisting of modules, with each module representing a single column matrix as shown in Fig. 4(b) . The simulation strategy is similar to that described in Section V-A.
C. Stratified Sampling
In module-based techniques, stratified sampling is for improving efficiency. However, in cluster-based sampling, stratified sampling is mandatory. This is because when modules are merged into a cluster, it can act as a reverse process of stratification if each original module is already very homogeneous, and yet at the same time, these modules have very different mean values. In other words, the clustering process alone can produce samples with multimodal distributions. However, this issue can be easily prevented by stratification after clustering.
VI. PRACTICAL CONSIDERATIONS
A. Storage Overhead
One may be concerned about the high overhead associated with storing the vector traces of sequential elements at each clock cycle. This overhead can be significantly reduced by using a two-stage sampling technique [9] . In the first stage, simple random sampling is performed during functional simulation to select a subpopulation of size . The value of is determined before functional simulation. When the functional simulation starts, randomly selected clocks are marked. As the functional simulator reaches each marked cycle, the vector pairs of all sequential elements at that cycle are sampled and stored. The collection of all sampled vector pairs becomes the population for the second stage sampling where the techniques proposed in this paper are applied.
Higher subpopulation size increases stratification overhead, whereas lower subpopulation size increases the inaccuracies of estimates. Reference [9] gives detailed analysis on this issue and concludes that once subpopulation size reaches a certain value, the improvement on accuracy is negligible on further increase in subpopulation size. Our rule of thumb is that the subpopulation size should be at least 25 times the largest average required number of simulated clock cycles used in the second-stage sampling techniques. The rationale behind the rule is that the error of a two-stage sampling technique is the root mean square of the errors from both stages. Our rule ensures that the error from the first stage is at least five times smaller than that from the second stage.
B. Simulation Partitioning
The proposed techniques require the circuit to be partitioned into a number of modules such that each module can be simulated independently. When refined delay models are used in simulation, the ideal partitioning is such that input signals of modules are always provided either by primary inputs or by sequential elements at module outputs. Issues such as data chaining and resource sharing, which are mainly associated with datapath subcircuits, should be addressed appropriately to facilitate simulation.
One should note, however, that the simulation partitioning does not need to follow design partitioning, which is often hierarchical, because our objective is to estimate the total power of the entire circuit, instead of those of each individual module. In other words, one can use the top-level design partitioning as the initial one and, if necessary, refine it so that sequential elements are positioned as close to module outputs as possible. After this step, if some sequential elements are not exactly at module outputs, we use the technique of logic duplication, as described in Section II. The amount of logic duplication should be kept small to minimize the simulation overhead.
VII. EXPERIMENTAL RESULTS
We compare the sampling efficiency of four techniques: clock-based simple random sampling (SRS), clock-based stratified random sampling (STS), module-based stratified sampling (MODU), and cluster-based stratified sampling (CLUS).
The circuits are selected from popular high-level synthesis benchmarks: a Chebyshev filter (CHEB), a differential solver (DS), an infinite impulse response filter (IIR), and a discrete cosine transformation circuit (DCT). The numbers of modules in each circuit type are summarized in Table I . All modules are of 16-bit type. The first input sequence is obtained from a music CD and will be referred to as the music stream. This stream contains strong low-frequency components (due to the drums in the music) and hence is correlated. The second input stream is generated randomly with 0.5 signal and transition probabilities. Both streams are 100 000 clock cycles in length. For stratification (STS, MODU, CLUS), we use a bit-parallel algorithm to compute the zero-delay power estimate as the predictor. The target simulator is a general-delay logic simulator. The power value of each cell is obtained from a lookup table indexed by output loading and input signal slew. The ratio of computational speed of the predictor function versus target simulator is about 60 to 1. The total circuit power of each circuit is first obtained by simulating the circuit over the entire sequence.
We set the error and confidence levels to 5% and 0.99, respectively. We perform 100 000 simulation runs for each sampling method. For stratified sampling, in order to reduce the overhead of stratification (including predictor calculation and sorting), we use the two-stage sampling method described in Section VI. The subpopulation size is set to 1000. That is, a different subpopulation of size 1000 is first randomly selected in each run. The stratification is performed on the subpopulation only. The sample size is set such that there are at least 30 independent random variables in a sample. Again, here we have assumed that 30 independent random variables is high enough to ensure near-normality. For SRS and STS, the sample size is simply the number of clock cycles simulated in each sample. In MODU, the numbers of strata are same for all modules. The sample size is equal to the number of strata in each module, and one unit is drawn from each stratum to form a sample. For CLUS, we use two clusters. All adders and subtracters are put into one cluster whereas all multipliers in another. Each cluster is stratified into the same number of equal size strata as the sample size, and one unit is sampled from each stratum.
The results of the music stream and random stream are summarized in Tables II and III These tables show that STS does not improve much over SRS from the Monte Carlo simulation results (with the exception of DS), although shows that the improvement should have been 2 or so. In other words, the advantages of variance reduction techniques such as stratification may not be fully realized in Monte Carlo simulation if is very low. Between STS and MODU, if both of them had used the same workload per sample, their s will be very close (with the exception of DS). For instance, on CHEB with music stream, if STS had used s.s. of six, would be , which is comparable to 5.17 in MODU. However, due to concerns on normality, one cannot simply use smaller sample size in SRS and STS without some analysis on the circuit power. To demonstrate that our proposed techniques achieve better normality using same workload per sample, Fig. 5 shows the normality plots of all four techniques for DCT (music stream) under the workload of two clock cycles. Both MODU and CLUS are about 2 better than SRS and STS.
The overhead of stratification is approximately linear with subpopulation size. We varied the subpopulation size from 500 to 4000 and observed no obvious change in or v.r. The subpopulation size chosen in this experiment is close to 25 times the average values of both MODU and CLUS. Again, we should emphasize that the choice of predictor function should depend on the execution speed of the target simulator to minimize the significance of stratification overhead.
MODU and CLUS are comparable except for larger benchmarks DS and DCT, where CLUS is approximately 15-37% better than MODU on average . This follows our expectation that CLUS will be advantageous when the number of sizable modules is larger.
In this experiment, the run-time improvement of MODU and CLUS over SRS and STS is about a factor of 2-3 , including the overhead of stratification.
VIII. CONCLUSION
In this paper, we propose efficient statistical sampling techniques for hierarchical circuit power estimation. We first show that Monte Carlo based statistical power estimation techniques oversample when the relative variances of samples are small. Applying existing clock-based sampling techniques to estimate total circuit power exacerbates the oversampling issues, as the workload per sample is too high. To address this issue, we need to reduce the workload per sample while maintaining the normality of these samples. We propose a module-based and a cluster-based Monte Carlo simulation technique to achieve this goal. We demonstrate that the proposed techniques provide a reduction of 2-3 in simulation run time compared to existing Monte-Carlo simulation techniques.
Chih-Shun Ding received the B.S. degree in electrical engineering from National Taiwan University, Taiwan, in 1985, and the M.S. and Ph.D. degrees in electrical engineering from the University of Southern California, Los Angeles, in 1989 and 1998, respectively.
He is currently a senior staff engineer at Conexant Systems, Newport Beach, CA. His research interests include timing analysis and CAD for low-power.
Cheng-Ta Hsieh received the B.S. degree in electrical engineering from the National Taiwan University, Taiwan, in 1990, and the M.S. and Ph.D. degrees in electrical engineering from the University of Southern California, Los Angeles, in 1993 and 1999, respectively.
He is a senior software engineer at Verplex Systems. His research interests include formal verification, post-layout optimization, power estimation for VLS,I and low power design methodologies.
Massoud Pedram received the B.S. degree in electrical engineering from the California Institute of Technology, Pasadena, in 1986, and the Ph.D. degree in electrical engineering and computer sciences from the University of California, Berkeley in 1991.
He is an Associate Professor of Electrical Engineering Systems at the University of Southern California, Los Angeles. He was a recipient of the National Science Foundation (NSF) Young Investigator Award in1994 and the Presidential Faculty Fellows (PECASE) Award in 1996. His current work focuses on developing computer-aided design methodologies and techniques for low-power design and coupling physical design to logic synthesis.
Dr. Pedram has served on the technical program committee of a number of conferences and workshops, including ASP-DAC, DATE, and ICCAD. He also served as the Technical Co-chair and General Co-chair of the International Symposium on Low-Power Electronics and Design in 1996 and 1997, respectively. He is a member of IEEE Circuits Systems Society and ACM-SIGDA, and an associate editor of ACM TODAES and IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
