Abstract-Various approaches for micro-architectural power/ energy estimation have been introduced, mainly driven by the need to obtain fast power/energy estimates during early phases of complex SOC designs. In contrast to previous approaches we study power/energy estimation for highly optimized synthesizable description of microprocessor cores. Under this real-world design scenario, we found, unlike related previous research, that power can hardly be estimated closer than around 15% using an instruction level model. However, we can estimate the energy as close as 5%. Our research has resulted in the SEA framework that estimates energy/power consumed by a software program, taking specific micro-architectural features of the underlying programmable hardware core into consideration. With this high accuracy in energy estimation we achieve around 5 orders of magnitude faster estimations compared to state-of-the art high-level (RTL) commercial energy/power estimation tool suites. Thus, our framework is capable of reliably estimating the energy/power consumption of future complex SOCs.
I. INTRODUCTION
IP-based design methodologies combined with the paradigm of platforms for specific application areas have enabled designers to design new multi-million gate designs in shorter times and at an overall smaller man-month count compared to traditional design method that do not extensively re-use existing IP. Examples of design platforms stem from domains like multimedia processing, wireless communications, real-time control, etc. The task of a designer has changed to integrating and estimating various scenarios of a future complex SOC by means of re-using existing IP and design platforms.
A high-level power estimation method should preferably have the following features:
1) The model should be able to estimate on a per-cycle basis to allow sufficient accuracy.
2) The model should be independent of access and switching activities as they can hardly be predicted from an instruction-level abstraction point of view. This holds especially for highly optimized processor designs that are coded in a mixed behavioral-RTL and structural-RTL fashion. In this paper we provide an approach according to these constraints and under the assumption that the processor design might not allow for functional clock gating, i.e., clock gating that allows RTL blocks at almost any size to be gated, which is true for many real-world processor designs which rigorously mix functional and structural RTL. Block-based power estimation approaches might not be applicable in such cases.
Our model is based on the observations made by studying the optimized synthesizable RTL code of the MicroSparcIIep tThis work is supported in part by NSF under grant numbers MIP-9701416 and CCR02-08992.
[25]. Our approach estimates energy within an accuracy of 5% and per-cycle power within 15% or less on an average.
A review of existing models is given in Sec. 11. Our power model is introduced in Sec. I11 and IV. The SEA framework is discussed in Sec. V. Experiments and results are shown in Sec. VI with our conclusions in Sec. VII.
REVIEW OF EXISTING POWER MODELS
A large number of approaches have been proposed for power estimation in recent years. We classify them as follows:
Module-based approaches view the power consumption of a core as a sum of the power consumption of the structural modules present in the core. Modeling the power behavior of these individual units provides the energy consumed every cycle by the core. [2, 41 present such power estimators which estimate the activity factors, area, capacitance, etc. for modules present in the core. [7] highlights the differences between these estimators and the difficulty in estimating accurate activity factors.
[23] present Simplepower which is another such estimator based on the Simplescalar tool suite [3] . Another estimator for SH3 is presented in [21] . [ 101 attempts a slightly broader classification of the core as datapath, control logic, etc. Such models require an in-depth knowledge of the architecture and the implementation, which, however, are not often provided by the IP providers. Further, modules in highly optimized IP cores can have complex operation inter-dependencies which make module partitioning difficult.
Instruction-based approaches abstract away the low-level details needed by the module-based models above. The energy consumption of the core is captured by assigning powedenergy values to each instruction of the instruction set. The first of such approaches has been presented in [22]. The power behavior of each instruction is captured by executing the instruction in a loop and measuring the current drawn by the chip. Using a measurement-based approach helps account for (physical) packaging issues, but it is difficult to back-annotate the system-level loading to the activities on the pins of the chip. Furthermore, the process of measuring the average current accurately is a relatively complex and error-prone process [12] . More importantly, a SoC designer generally needs to obtain energylpower data before the chip is taped out.
In [8] , each instruction is propagated along a gate-level net list for accurate energy estimation which is relatively timeconsuming. Many models are proposed by studying the various aspects of instructions that might effect the core, such as data, operands, etc. [18] attempts to capture data related effects through activity indices. Another data-related study is presented in [5] . Energy-sensitive factors are studied in [6] and a regression based analysis is presented in [ 111. Another model is presented in [20] which is based on classifying the execution cycles into different types. Most of these models Function-based approaches provide a yet higher abstraction for power estimation. Works like [l, 16, 171 abstract the processor as a set offunctions or stages. Behavior of various physical modules is captured by these abstract quantities. A technique for capturing these activities through performance counters is presented in [9] . Some processors might not offer such functionality and in some cases, it might not be possible to monitor certain power events accurately. A yet higher level of abstraction is capturing the power behavior for library routines [14] . The primary difficulty with such an approach is to capture the statistical run-time behavior of these libraries accurately (the cache misses, etc.). At the most abstract level, the processor can be treated as a black box. [I91 presents a cycle-level estimator where an ARM processor is assigned two states, active and nop-waiting. The power values for both states are taken from datasheets. Another such model is presented in [15] . Such a coarse treatment may not be applicable to a IP core and might lead to imprecise estimates.
As a summary, some previous approaches have either made simplified assumptions and do not model the microarchitecture in sufficient detail; or, the models are very detailed but the assumptions of the micro-architecture are rather un-realistic since a highly-optimized micro-architecture cannor be modeled as a set of RTL blocks that are either active or non-active. In fact, RTL blocks might dissipate energy during a certain time frame even though a functional simulation does not suggest so. Hence, the idealistic assumption of modulebased gated-clock designs is not in compliance with optimized micro-architectures that actually unveil a mixture of structural and behavioral RTL.
The approach we take is to incorporate as many details as possible to extract from an optimized synthesizable RTL description. This inhibits not only the decomposition into blocks, but it may lead to a partly unpredictable energylpower behavior. Overall, however, we can guarantee an upper and lower bound for the estimated power that is typically in the range of 15% and around 5% accuracy for energy estimates.
DERIVING AND REFINING AN INSTRUCTION-BASED ENERGY/POWER MODEL
To derive the powedenergy models for IP cores, we use a publicly available, synthesizable model of the MicroSparcIIep core [25] as an example. Employing a commercial core allows us to study many architectural features that typically do not appear in simple processor models constructed for research purposes only. The block diagram of a MicroSparcIIep is shown in Fig. 1 [13]. It is a RISC architecture that integrates a SPARC processor with a floating-point unit (FPU), memory management unit (MMU), separate instruction and data caches, and a PCI bus controller (PCIC) onto a single device.
The synthesizable RTL of the core is highly optimized such that behavioral and structural RTL are inter-mingled extensively. Further, functional clock-gating had not been implemented. Both these factors prevented the adoption of a blockbased power model as used by some estimation techniques. Gate-level power estimation is a very time consuming process and is definitely not recommendable for estimating software energy. However, for our initial investigation, we synthesized the core and conducted gate-level simulation in order to verify some conclusions presented in similar previous work.
Based on extensive detailed experiments, we observe that the powedenergy variations due to architectural characteristics are quite complex. For example, we note that modules that are not directly involved in the execution of an instruction can still contribute significantly in terms of power variation. The data in Tab. I clearly show this point. Here, the first column lists pairs of instructions, and the second column summarizes the corresponding power differences of various modules. The acronyms used are Core (all modules except caches), IU (integer unit), Ex (execution unit), Rf (register File), CC (cache controller), MMU (memory management unit), Memif (memory interface unit). We use x.y to denote that y is a sub-module of 2. From Tab. I, one can readily see that units such as MMU and Memif account for a large portion of the total power difference (e.g., more than 30% for Add V.S. And), even though the execution of both instructions is not supposed to involve the two units. Such behaviors are also observed in the FPU (especially the floating point register file) during the execution of integer instructions, which is due to the partial decoding of all instructions in the FPU. Instructions are later discarded if they are not FP-instructions.
The intricate dependencies of instruction executions on functional units make it impossible to model the power consumption of the core by means of a simple module decomposition. Moreover, even the most complex model would not accurately estimate the power consumption since many effects are simply due to the structural RTL coding style of the core. To overcome these difficulties, we adopt an instruction level model, capture the various effects through simulation and store them in a multidimensional database that is accordingly accessed by our estimation framework for estimating the powedenergy consumption of a C program.
Let us first review a basic instruction-based power model presented in [22] by Tiwari, et al. We then consider the complications of applying it to the MicroSparc core. Through this exercise, we identify the model's weakness and propose a modified model for an efficient power estimation tool. The power model from [22] is given in (1). 3) Inter-instruction effects are difficult to model due to the large variations within the instruction execution context. Nonetheless, such effects are overshadowed by the effect of data variations. Considering the above observations, we propose a refined cycle-based, instruction-level energy model. It assumes that the energy consumed in a certain cycle is induced by the instruction that resides in the execution unit at the (clock cycle) time of interest. The execution of an instruction i, can be broken down to two parts: na, active cycles and n,, stall cycles.
The stall cycles are assigned the same stall power, no matter which instruction causes the stall. Multi-cycle instructions such as "multiply" will have multiple active cycles. The average power consumed by instruction i in a cycle, Pavg, , reflects the average power consumed by the micro-architecture over all cycles when that instruction is residing in the execution stage of the pipeline. Our model is shown in Eqn.
2.
Here, Eprog is the total energy consumed by a software program, n is the number of instructions of the instruction trace, Pavgi is the average power consumed by instruction i, Pstall is the average stall cycle power, and T is the period of the clock cycle. Our power database (see Section IV) not only includes the Pavgi for each instruction, but also the lower and upper bounds so as to provide various options to the designer.
Use of the model in ( 2 ) requires various energy model data. Since measurement-based techniques as in [22] cannot be employed for IP cores, we employed simulation to obtain the energy model data. Each instruction is exposed to various test cases and a power value is assigned to it. In order to reduce the number of test cases, the instruction set is partitioned into classes, and each class is assigned a power value. We classified the instruction set into memory based and non-memory based as suggested by our initial gate-level simulations. Then, we further classified the instructions by whether or not specialpurpose hardware is invoked by the respective instructions and whether integer or floating point register files were affected.
In contrast to the existing instruction-level models that employ simulation for obtaining powerlenergy data, our model captures the details that can actually be observed at the instruction level and abstracts away architectural features whose powerlenergy implications are not visible at the instruction level. Such an approach helps to retain both efficiency and accuracy of an instruction-level model when dealing with a highly optimized model of an IP core. We, would like to emphasize the absence of functional clock-gating in many realworld, highly optimized processor cores due to which certain architectural details required by previous modeling techniques are not exposed to the power model. Our discussions in the next section will further justify the model that we proposed.
Iv. BUILDING THE ARCHITECTURAL POWERIENERGY
DATA BAS E
The model proposed in ( 2 ) requires a powerlenergy database for extracting various power data required by the model (e.g., Pavgi and Pstall). The consideration of architectural characteristics in conjunction with the optimized design representation of the MicroSPARCIIep processor core is key for constructing, such a database and hence designing an accurate energylpowes estimation tool. This section discusses the prominent issues to be resolved for applying the model.
A. Stall energy estimation
Pipeline stalls due to factors such as cache misses and data dependencies are unavoidable in modern microprocessors. Using measurements to capture stall energy as in (1) cannot be easily done. Detailed knowledge of the architecture and the functionalities of different modules is needed to develop good test cases for capturing these effects. This knowledge is unfortunately, not readily available for many IP cores.
In [2] , idle energy of modules of a CPU is estimated as lo%, of the corresponding active energy. However, the highly optimized synthesizable RTL of the MicroSparcIIep led to a different conclusion. The data presented in Tab. I1 show that idle power can contribute to 50% (or even higher) of the average. power consumption of the respective modules.
Such observations suggest that stalls have to be modeled accurately and have to be treated in a way similar to instructions. This is especially true in control-dominated and reactive applications that tend to have a higher stall rate. The stall energylpower of MicroSparcIIep, however, has been observed to be nearly constant, independent of the kind of stalls. Therefore, we decompose the total execution cycles of an instruction into two parts, active-cycles and stall-cycles and account for these parts separately in terms of energylpower consumption. Separating the stall cycles from active cycles decouples various powerlenergy related effects and thus facilitates a precise and reliable powerlenergy estimation.
B. Variations through data dependencies
The energylpower consumed in various hardware units of the processor core depends on, among others, the data that are processed by a certain instruction. Fig. 2 shows an excerpt of' : the zaxis shows the power consumption (in mW), the y-axis shows the break-down to the most prominent sub-units (RTL modules) of the IU while the x-axis shows diverse instructions with altered data. (add-no: an add instruction being executed with no alteration of the operands; add-incr: the add instruction incrementing an operand by 1; and add-ma: the maximum observed power consumption of the add instruction by inducing maximum switching activity of the register bits holding the operands (e.g., a 0-1-0 to 1-0-1 switch of a 3-bit wide register)). The data show that some sub-blocks vary significantly in power consumption whereas others remain more or less constant when data is being altered. Overall, the variations are quite significant and cannot be ignored.
The variations in the power consumption within an instruction are important for certain investigations. For example, the maximum power values are useful for estimating the peak power, which is a key parameter in studying battery utilization efficiency. To facilitate such estimations, we maintain the maximum, the minimum, and the average power for each instruction in the power database. These data can then be used in (2) to estimate the maximum, minimum and average energy of the whole core.
C. Inter-instruction effects (IIEs)
The energy/power consumed by an instruction may vary depending on the context in which an instruction is executed, i.e., the instructions immediately before and after the instruction. The model in (1) captures these effects by assigning a single energy value to each pair of instructions and adds such energy values. However, our detailed simulations show that the interinstruction effects (IIEs) are quite complex and the simple additive model in (1) is not appropriate.
Tab. I11 presents the power values for different instruction sequences. The first two columns are for instruction sequences containing only a single instruction, the next two columns for sequences containing two instructions, and so forth. Careful analysis of the data will show that using the formula in (1) would produce wrong results. Consider, as an example, a sequence of and-or-sll. The total energy for this sequence 2) Accurate estimation of the energy consumed (minimum, maximum, actual), considering the effects of input stimuli, data dependencies, instruction dependencies and architectural characteristics, and 3) Various statistical analyses through a graphical user interface. The input to SEA (Fig. 3) is a binary of an application program that, for example, has been written in C . The "Analyzer" is further fed with instruction traces that contain timing information. For that purpose, either an Instruction Set Simulator (ISS) or a HDL simulator can be used. The Analyzer accesses the power models capturing instruction sequences, data dependencies and pipeline effects. These models are input from the database. The database is directly generated (manually supported in a one-time effort; indicated by the dashed arrows) from the synthesizable RTL core in conjunction with a energy/power estimation tool (we used Sente). A graphical user interface allows to represent various powedenergy related graphical representations along with statistics. The reference flow is Sente's PeakNatt Watcher that is fed by the ModelSim simulator with stimuli generated by the execution of an application written in C .
POWER VARIATIONS OF INSTRUCTION SEQUENCES

VI. EXPERIMENTS AND RE,SULTS
To verify our powedenergy estimation tool suite, we have constructed the power database for the MicroSparcIIep core. We then applied our SEA framework to diverse applications that are written in C. SEA can operate in various modes and thereby aid the system designer in different design stages. The modes are described in the following.
Power/Energy Midhlax Mode:
In this mode, SEA computes the minimum and maximum energy consumption during every cycle. It calculates the minimum and maximum bounds, assuming that the same program might run on different data. Therefore, the system designer gets a reliable energy estimate even when the data the application will run on is not available yet. A typical plot for this mode can be seen in Fig. 4 . It shows the energy bounds "min" and "max", the predicted energy "pred" and "acrual" which is the comparison to a commercial tool (Sente [24] ). As can be seen the actual energy values lie always within the min/max range. Moreover, even the predicted graph is very close to the "actual" derived from Sente. As we will discuss later, our tool is several orders of magnitude faster than the Sente tool. The results are summarized in Tab. V. The most interesting results are shown in the last three columns of the table where the computation times are compared absolutely and in terms of relative difference. Simulating a whole processor core with the stimuli data for an application turned out to be very computation intensive for the Sente tool: around 2k-3k simulated cycles (col. 8) of an application running on the synthesizable RTL of the MicroSparcIIep took between 7hrs and 15hrs for the Sente tool. Our SEA framework estimated the power using exactly the same traces in less than a second resulting in speed-ups of more than five orders of magnitude. However, we need to put this performance improvement in the context of the accuracy we achieved: The energy data estimated for various applications using our SEA framework are shown in columns 2, 3 and 4 (minimum, maximum and average). The column named "Actual" is the reference energy consumption achieved by Sente. The column "PE" shows the error in prediction. It shows that our SEA framework is in all cases within a 5% accuracy compared to Sente. However the per-cycle power accuracy, "PCE", is within 12% to 17%. This is the toll we have to pay for our fast database oriented estimation model. Fig. 6 summarizes the results in terms of how many times faster our SEA framework estimates energylpower compared to the Sente tool. The comparison of our SEA framework to the Sente tool has to be set in relation, though: the aim of Sente is to estimate power of any RTL design whereas SEA needs a separate database for every new processor core. On the other hand, SEA'S computation time is independent of the (RTL) complexity of the processor since we estimate powerlenergy consumption at instruction level. The higher abstraction level is eventually the reason for the high speed-up. The level of accuracy we achieve, especially for average powerlenergy consumption, makes SEA a reliable tool for a system designer.
VII. CONCLUSIONS
In this paper, we have introduced a cycle-based, instructionlevel powedenergy model and the SEA framework for fast micro-architectural energylpower estimation. As opposed to previous work in the field our estimation is for a highly optimized synthesizable RTL core which prohibits the usage of module-based estimation approaches. Instead, our models employ the notion of minimudmaximum energylpower consumption and estimate depending on how much information and certainty is available within a certain time window. Various databases have been generated to accomplish this task. As a result, our technique estimates around 5 orders of magnitude faster compared to the Sente tool suite with an accuracy that is within 5% for energy estimates and within 15% for power estimates. Though the experiments shown here are all based on the MicroSparcIIep processor core, the general techniques we used are applicable to other cores as well.
