Abstract-In this paper, we propose a robust register-transfer level (RTL) power modeling methodology for functional units. Our models are consistently accurate over a wide range of input statistics, they are automatically constructed and can provide pattern-by-pattern power estimates. An additional desirable feature of our modeling methodology is the capability of accounting for the impact of technology variations, library changes and synthesis tools. Our methodology is based on the concept of node sampling, as opposed to more traditional approaches based on input sampling.
I. INTRODUCTION
L ARGE digital circuits are nowadays often described and designed at the register-transfer level (RTL), possibly reusing general-purpose macros independently developed by third-party intellectual property (IP) providers. One of the key issues for the diffusion of an effective reuse paradigm, is the capability of providing information on critical cost metrics. In power-constrained design, RTL power models are required to predict the power dissipation of each macro instance. The fundamental requirements for practical RTL power models are the following. First, power estimation using RTL models should be much more efficient than gate-level (transistorlevel) power estimation. Second, the models should not require disclosure of the details of the internal structure of the IP component. Third, the models should be general and robust: it should be possible for the IP provider to generate them with an automated procedure, and for the IP user to employ them with a wide variety of input pattern distributions without compromising accuracy. Additional desirable features are weak technology dependence (i.e., the property of rapidly obtaining a new set of models for an entire library of components when the technology library used to map them is changed) and tunable accuracy (i.e., the capability of improving the accuracy of the power estimation at the RTL by increasing model complexity).
We propose an approach to RTL power modeling that satisfies all fundamental requirements and provides in some degree the desirable features outlined above. Our approach is based on the key observation that power dissipation is strongly dependent on the internal structure of a unit, and that partial knowledge of the internal structure is sufficient to provide accurate power estimations. 
II. RTL POWER MODELING
Several approaches to RTL power modeling have been proposed in the recent past addressing the fundamental requirements listed in the introduction [1] - [6] . The most challenging fundamental requirement is robustness. To increase robustness, previous approaches rely on some form of patternbased characterization. During model construction, a gatelevel (transistor-level) description of the component is simulated with a set of input patterns (i.e., a pattern sample) to collect power information. Model parameters are then tuned to fit the results of low-level simulation.
We argue that the characterization process is intrinsically weak when power estimation is the focus. Power dissipation is in general a highly nonsmooth function of the input patterns. The knowledge of the power consumption of a unit for a given input transition provides little information about the power consumption of the same unit for different transitions. Characterization-based power models are highly accurate only if evaluated in the same operating conditions used for characterization.
This typical situation is illustrated in Fig. 1 , where the relative error on power estimates is plotted as a function of the average input activity. The two dashed curves refer to the constant model [1] and to the linear regression model [6] , [3] for MCNC91 [15] benchmark cm85. Both models were characterized for random input patterns with 0.5 average transition probability. Concurrent gate-level and RTL simulations were repeatedly performed to evaluate the accuracy of the models. Each point in Fig. 1 represents the result obtained for an input sequence of 10 000 vectors with the average transition probability reported on the abscissa. The relative error is around 2% for input activities similar to those used for characterization, but it rapidly increases as the statistics of the input patterns change, becoming much larger than 100% for low input activities.
To synthetically represent both robustness and accuracy, we define the root-mean-square of the relative errors (rmsRE) provided by a power model under a wide range of operating conditions. The rmsRE is defined as where is the relative error under the th input stream and is the number of input streams used. For the constant (linear) model of Fig. 1 the rmsRE is of 1600% (281%).
To improve model robustness (i.e., to obtain lower rmsRE), we move from the observation that the power consumption of internal components builds up to the overall power consumption of a digital unit. Hence, we adopt a new viewpoint. We do not sample on input patterns, but we sample on internal nodes. Our approach is not mutually exclusive with pattern- based characterization, hybrid methodologies can be defined and will be analyzed.
III. A NEW APPROACH
Consider a combinational CMOS unit with inputs for which an input-output behavioral model and the gate-level implementation are provided. We denote by the supply energy drawn by the gate-level implementation due to an input transition from pattern to pattern
The task of modeling power consumption at the RT level consists of finding a simple but accurate model for (the corresponding power consumption being where 1 If we consider the internal structure of the unit can be expressed as where is the energy drawn by gate as a function of the input pattern transition, and is the number of gates in the unit. Consider now the energy drawn by a subset of the gates in Clearly, and as For a given we call the ratio between and Node sampling is based on the experimental observation that the functional dependency of from and is very weak. In symbols, even if
. Moreover, we observed that a good estimate for the value of is provided by the ratio of the total load capacitance of all gates in and the total load capacitance of gates in The main consequence of this fact is that it is possible to accurately predict the power consumption of a large unit by observing only a small number of internal nodes (1) This is the rationale behind node sampling. The validity of (1) for different values of is shown in Fig. 2 for MCNC91 [15] constant V 2 dd disappears from the computations. benchmark cm85. The plot reports the rmsRE as a function of the sampling factor, that is the percentage of unit's nodes in the sample: . The rmsRE decreases rapidly with the increase of the sampling factor. This result is typical of a large set of benchmark circuits, as discussed in Section V.
As shown in Section II, characterization-based approaches have rmsRE much larger than 100% since their accuracy strongly depends on input statistics. In contrast, node sampling provides power estimates whose accuracy does not depend on the input statistics. For circuit cm85, this is shown by the two solid-lines of Fig. 1 , representing the relative errors obtained with sampling factors of 10 and 20%.
The previous analysis indicates that a small node sample provides sufficient information to estimate the power dissipation of a unit under a wide variety of input statistics. The modeling task thus reduces to that of finding a RTL model for the energy drawn by gates in the sample
We express as the sum of two contributions: zero-delay dynamic energy and second-order delay-sensitive phenomena such as short-circuit currents and glitches (2) The distinction between and leads to a useful partitioning of the modeling task. We first model that usually represents the largest contribution to Highly accurate, pattern-dependent RTL models can be analytically constructed for based only on structural and functional information. No simulation-based characterization is required. Once a characterization-free RTL model has been constructed for we use characterization to model as the difference between the simulated values of and the estimated values of
In practice, when modeling we target robustness, when modeling we target absolute accuracy. We propose a three-step modeling procedure consisting of: i) node sampling, ii) analytical modeling of and iii) simulation-based characterization of
A. Node Sampling
We have implemented several criteria to automatically select a significant subset of internal nodes. They range from uniform random sampling to nonuniform or deterministic selection based either on topological information (such as the distance from primary inputs) or on load capacitance values. Interestingly, the best results were obtained by uniform random sampling. This is because uniform random sampling is an unbiased procedure that provides unbiased power estimators. Advanced sampling criteria, in contrast, are based on arbitrary assumptions about the relation between the power consumption at a given node and the power consumption of the entire unit.
B. Modeling 0-Delay Dynamic Energy
The zero-delay dynamic energy drawn by a gate in the sample (say, for a given input transition can be expressed as Where is the load capacitance, is the function realized at gate and is the Boolean exclusiveor. In the general case , the load capacitance is multiplied by The sample energy is the sum of the 0-delay dynamic consumption of all sampled gates (3) Equation (3) is the pattern-dependent RTL model for .
Only functional information about nodes in the sample is exploited. IP protection is ensured because information on the structure of the gate-level implementation is completely lost. The model is fully analytic and does not require any form of characterization and fitting on simulation results.
Given a pair of input patterns, evaluating the value of reduces to evaluating the values of Boolean functions and computing the weighted sum of (3). The computational cost of estimating depends on how efficient is the computation of for nodes in the sample. This in turn depends on how Boolean functions are represented. Finding optimal representations for Boolean functions is one of the fundamental issues in the EDA area and a detailed treatment is out of the scope of this paper. We only mention three representation styles that have been successfully used in academic and commercial RTL simulators: compiled-code models [9] - [11] , that are based on the observation that any Boolean expression can be compiled for fast evaluation, binary decision diagrams (BDD's) [8] , [12] , that enable the evaluation of Boolean functions in a time proportional to the number of their variables, and algebraic decision diagrams (ADD's) [13] , that directly represent realvalued functions of Boolean variables.
C. Delay-Sensitive Second-Order Contributions
According to (2), we move from the observation that second-order phenomena can be seen as an additive contribution (called to zero-delay energy The RTL model we propose for is a regression equation that needs to be characterized on the results of gate-level simulations. In practice, we propose a hybrid modeling approach for the energy drawn by the sample, where simulation-based characterization of improves the absolute accuracy of the fully-analytic characterization-free model of Experimentally, we observed that correlates well with input switching activity, hence, we adopt the following model:
where is the Hamming distance between and (i.e., the number of switching inputs) and is a fitting coefficient that is computed with a simple characterization experiment. The unit is simulated with a sample of random patterns. Full-delay simulation is employed. For each pattern pair an estimate of is computed by subtracting the zero delay energy (provided by the zero-delay model from the total energy computed by the simulator. An estimate for is obtained by computing the ratio for each pattern pair in the sample, then averaging the ratios.
IV. TECHNOLOGY TUNING
One of the assumptions made in the model construction process described in the previous section is that a golden model (i.e., a reference gate-level implementation of is available during characterization. The RTL power model is a good predictor of the power dissipation of the golden model. This may be an undesirable feature if IP components are designed and sold as soft macros, i.e., abstract RTL descriptions that can be synthesized by the user using a proprietary technology and gate library.
In the case of soft macros, there is no guarantee that the golden model employed for model construction is the same that will be instantiated by the user when he/she synthesizes the macro. The main sources of mismatch are technology, gate-level library and synthesis tools. As a result, the power estimates provided by the RTL model may be inaccurate if no corrective measure is taken.
Fortunately, accuracy may be recovered thanks to a technology tuning method that requires only limited effort on the IP user side. Our approach is based on the observation that the mismatches in final implementations produced by technology, library and synthesis tools tend to have limited variance, although their absolute value can be significant. Technology tuning is performed using a reference benchmark , a macro that contains no intellectual property and that can be released at no charge. A set of input patterns and the corresponding average power dissipation will be provided with The power dissipation is computed by the IP provider performing gate-level simulation of an implementation of The implementation of is obtained using the same technology, library and synthesis tools that are employed for generating the golden models of the IP units. The IP user synthesizes with his/her technology, library and synthesis tools, then simulates the implementation with the patterns provided and measures the average power dissipation
The technology scaling parameter is defined as Technology tuning is performed once for all.
can be used for an entire library of IP macros. After has been computed, the user can estimate the power consumption of his/her own implementation of the IP soft macros with the following formula: where is the power dissipation estimate of the model provided by the IP provider with the unit.
V. EXPERIMENTAL RESULTS
The modeling approach has been tested on MCNC91 [15] benchmark circuits. Each benchmark was first mapped on a reference technology library to obtain a golden model. Zero-delay power models were constructed using uniform random sampling (with 0.1 sampling factor) and compiledcode representations. Constant and linear estimators were also characterized for comparisons. Characterization was based on power data provided by gate-level simulations with random input sequences with 0.5 average activity.
Results are reported in Table I . The first two columns contain the name of the circuit and the number of gates in its golden model. Columns three to five report the rmsRE of constant model (Const), linear regression (Lin) and node sampling (Ns). The rmsRE was evaluated by running concurrent gate-level and RTL power simulations for input sequences with average transition probability ranging from 0.01 to 0.99. The improved robustness of node sampling is evident: the typical rmsRE is well below 10%, while the rmsRE of characterization-based models is always much greater than 100%. For benchmark circuits c7552 and c6288 similar results (namely, a rmsRE around 5%) were also obtained with a sampling factor of 1%.
To test the effectiveness of technology scaling, all benchmarks were remapped on a different technology library, containing only two-input cells with large input/output capacitances. The effect of remapping on power consumption was always in excess of 100%. Circuit cmb was chosen as reference benchmark for technology tuning. Its golden model and its remapped version were simulated using the same input sequence (namely, a sequence of 10 000 input patterns with 0.5 transition probability) to estimate and , respectively. The technology-tuning parameter was then computed and used to evaluate the power consumption of all remapped circuits. Results reported in the last column of Table I show that technology tuning does not impair accuracy and robustness significantly.
A second set of experiments was performed to test the quality of node sampling in full-delay power estimation. For each benchmark, full-delay gate-level simulation was performed (after node sampling) to characterize coefficient of the model of . The same simulation was used to characterize constant and linear models. The accuracy of the power estimates is reported in Table II . The rmsRE of node sampling with additional fitting coefficient is around 10%, that is more than one order of magnitude below the rmsRE of characterization-based approaches. The last column reports the accuracy of technology-tuned full-delay models.
VI. CONCLUSIONS
In this work, we have presented a methodology for power modeling of RTL units. Our models are automatically built without designer intervention and do not reveal details on the internal structure of the units (i.e., they protect intellectual property). The most remarkable characteristic of the models is their robustness in the input statistics. This is in sharp contrast with modeling methodologies based on characterization, whose accuracy deteriorate when the unit is operated with input statistics dissimilar to those assumed during characterization. We also presented a significant extension to the model that makes it capable of adapting to changing technologies, gate libraries and synthesis tools.
