Abstract-The need for low-power embedded systems has become very significant within the microelectronics scenario in the most recent years. A power-driven methodology is mandatory during embedded systems design to meet system-level requirements while fulfilling time-to-market. The aim of this paper is to introduce accurate and efficient power metrics included in a hardware/software (HW/SW) codesign environment to guide the system-level partitioning. Power evaluation metrics have been defined to widely explore the architectural design space at high abstraction level. This is one of the first approaches that considers globally HW and SW contributions to power in a system-level design flow for control dominated embedded systems.
I. INTRODUCTION

E
MBEDDED systems are those computing and control systems designed for dedicated applications [1] where ad hoc software routines are provided to respond to specific requirements. The diffusion on the semiconductor market of standard processors characterized by high performance and reasonable prices contributed to increase the importance of embedded systems. The typical embedded system architecture is constituted by one or more dedicated hardware units such as application specific integrated circuits (ASIC's) to implement the hardware part and a set of software routines running on a dedicated processor or application specific instruction processor (ASIP) for the software part. Exploiting the advantages offered by submicron complementary metal-oxide-semiconductor (CMOS) technologies, the entire embedded system can be implemented on a single ASIC, including the processor core, the on-chip memory, the input/output (I/O) interface and the custom hardware part.
Innovative codesign techniques emerged as a new computeraided design (CAD) discipline in the recent past, to cope with the complexity of a comprehensive exploration of the design alternatives in the hardware/software design space. Codesign aims at meeting the system-level requirements by using a concurrent design and validation methodology, thus exploiting the synergism of the hardware and the software parts. Several design tasks are covered during the codesign process, mainly system-level modeling, capture of the functional cospecification, analysis and validation of the cospecification, system-level partitioning, exploration and evaluation of several architectures with respect to given design metrics, cosynthesis and cosimulation. The availability of a codesign methodology, covering all these design phases, is mandatory during embedded systems design to meet the system-level requirements.
The overall system costs and performance are greatly impacted by the effects of the partitioning task, that targets the assignment of operations to the hardware (HW) or software (SW) parts. To guide the partitioning process, design metrics should be defined to compare alternative partitionings and to evaluate their conformance with respect to the system requirements, typically defined in terms of performances, area, power, costs, etc. Moreover, the design of embedded systems is often over-constrained. Thus, a solution satisfying all those constraints is difficult to be identified in a reasonable design time. As a result, only a partial exploration of the architectural design space can be usually carried out, to get to an acceptable solution, far from the optimal one.
The importance of the power constraints during the design of embedded systems has continuously increased in the past years, due to technological trends toward high-level integration and increasing operating frequencies, combined with the growing demand of portable systems. Despite of the increasing importance of power consumption in most of the embedded applications, only a few codesign approaches take into account such a goal at the higher levels of abstraction.
While several power estimation techniques have been proposed in literature at the gate, circuit and layout levels [2] , a few papers have been published addressing the power estimation problem at high-level until recently [3] , [4] , despite the increasing interest in the system and behavioral levels. According to [3] , high-level power estimation techniques can be classified depending on their abstraction level.
The average power is strongly related to the switching activity of the circuit nodes, hence power estimation can be considered a pattern-dependent process. In particular, the input pattern-dependency of the power estimation approaches can be classified as strong or weak pattern-dependency [4] . Main advantages of the strongly pattern-dependent process, based on extensive simulations, derive from their accuracy and wide applicability. However, to obtain a complete and accurate power estimation, the designer should provide a comprehensive amount of input patterns to be simulated, thus making this approach very time consuming and computationally very costly. To avoid the need of a large amount of input patterns, the weakly pattern-dependent approaches require input probabilities, reflecting the typical input behavior, but the estimated results will depend on the user-supplied input probabilities.
High-level power estimation is a key issue in the early determination of the power budget for embedded systems. However, high-level power estimation methods [5] have not yet achieved the maturity necessary to enable their use within current industrial CAD environments. Our work is an attempt to fill such a gap, by providing a set of metrics based on a highlevel power model, to cover the different parts composing the basic architecture of embedded systems. The goal is to widely explore the architectural design space during the systemlevel partitioning and to early retarget architectural design choices. Accuracy and efficiency should be the driving forces to meet the power requirements, avoiding redesign processes. In general, the relative accuracy in high-level power estimation is much more important than the absolute accuracy, the main objective being the comparison of different design alternatives [3] .
The aim of this paper is to define a power evaluation codesign methodology. The method is part of a more general HW/SW codesign approach for control dominated embedded systems. The related CAD environment, called TOSCA (TOols for System Codesign Automation) [6] , among other design quality estimation techniques, provides accurate and efficient power metrics to guide the system-level partitioning. Metrics suitable for power evaluation of both the hardware and software parts are defined.
The availability of a high-level power analysis is of paramount importance to obtain early estimation results, while maintaining an acceptable accuracy and a competitive global design time. Based on these results, tradeoff considerations can be carried out in a reasonable time, by avoiding to follow the entire design flow to get power comparison results. Our approach can be considered as one of the first attempts to cover power estimation issues from a HW/SW comprehensive perspective, mainly focusing on the hardware part and considering a general architecture adopted by most industrial synthesis systems.
The paper is organized as follows. Foundations and notations constituting the background of our analysis are shown in Section II. Power metrics to guide the system-level partitioning are derived in Section III, while the proposed power models for the HW and SW parts are addressed in Section IV and V, respectively. Simulation results are also provided in Section VI, to demonstrate the advantages offered by the proposed methodology during the development of control dominated embedded systems. Finally, concluding remarks are drawn in Section VII.
II. BACKGROUND OF THE ANALYSIS
Let us introduce the general formalism to express power dissipation, the TOSCA codesign framework and the target system architecture.
A. Power Dissipation in CMOS Circuits
Power dissipation in CMOS devices is composed of both a static and a dynamic component. Anyway, the dominant part [7] is the dynamic part, expressed by the switching activity power where is the supply voltage, is the system clock frequency and is the effective switched capacitance (that is the product of the total physical capacitance of each node in the circuit and the switching activity factor of each node summed over all the nodes in the circuit).
The switching activity of each signal is fully characterized by a static and a dynamic component. The static component can be expressed in terms of the static signal probability of each node that is the probability of the node to be at one (therefore, and A signal is called equiprobable when
The transition probability is the probability of a zero to one transition at node In the spatial and temporal independence assumption [4] , is given by the probability that the current state is zero times the probability that the next state is one Under the same assumption, the switching activity of a node is while the toggle rate is
B. The TOSCA Codesign Flow
The design flow of the TOSCA codesign environment, where the present work is going to be integrated, is shown in Fig. 1 . Main goal is to reduce the impact of the system integration and design constraints verification bottlenecks on the global design time, thus allowing a cost-effective evaluation of alternative designs.
The design capture is performed via a mixed textual/graphical editor based on a OCCAMII customization [6] improving the user friendliness and gathering in the same design database timing constraints, design requirements, design goals and possibly an initial HW versus SW allocation of the modules composing the system. If the latter information is left unspecified by the user, an initial allocation is decided based on the results of an heuristic, by statically inspecting the properties of the system description. The main part of the codesign flow is represented by the design space exploration, i.e., a "what if" analysis of alternative architectural solutions to discover an acceptable final system modularization and HW versus SW allocation fulfilling the initial requirements and goals. This is obtained by evaluating system properties through a set of metrics, by applying systemlevel transformations, producing new modularization of the system specification semantically equivalent to the original one. When an acceptable partitioning is found, synthesis of the HW and SW parts can be performed. The SW synthesis passes through an intermediate uncommitted format, called virtual instruction set (VIS) [8] , allowing the designer to consider the timing performance when different CPU cores are employed and to make possible a flexible simulation of the cooperating HW and SW based on the same VHDL simulator engine. HW-bound modules and interfaces are automatically converted into suitable VHDL templates. Finally, simulation of the HW/SW system is performed, considering the sideeffects due to the HW/SW bused communication and the different performance of HW and SW technologies. The task of system-level partitioning should provide alternative solutions in terms of the cost/performance ratio. To afford the partitioning process with respect to the design constraints, it is necessary to define a cost function, based on some metrics. Thus, a preliminary and iterated phase is a metric-based analysis of the system-level description. Design metrics, considering the contribution of both the HW and SW parts, can be conceived to evaluate the quality of a partitioning solution in terms of fulfillment of several design optimization criteria [6] , such as performance, cost, resource exploitation, communication and power consumption. The current version of TOSCA evaluates a set of static and dynamic metrics, based on the analysis of the object oriented representation of the specification, high-level simulation and profiling. Metrics to evaluate area and performances are described in [6] , while metrics for power analysis are the subject of such paper.
C. The Target System Architecture
The system-level architecture of the embedded system is implemented within a single ASIC, including both the HW and SW parts. The target architecture is presented in Fig. 2 .
The single ASIC architecture is defined at the RT-level and it is composed of the following parts.
1) Data Path-including storage units, functional units, and multiplexers. A two-level multiplexer structure is considered for the interconnection among registers and functional units and the typical operations imply a register-to-register transfer; 2) Main Memory-to be accessed through input/output registers; 3) Control Unit-implemented as a set of finite state machines (FSM's); 4) Embedded Core Processor-such as a general-purpose standard processor, a microcontroller, a DSP, etc., with its memory (even if part of the memory can be external) implementing the SW part; 5) Clock Distribution Logic-including the buffers of the clock distribution network; 6) Crossbar Network-to interface the architectural units by using a communication protocol at the system-level; 7) Primary I/O's-to interface with the external environment.
III. HIGH-LEVEL POWER ESTIMATION METRICS
Our goal is to define power metrics to be applied at the system-level to measure and to compare the power consumption of several design alternatives. In general, it is quite difficult to define a single metric suitable for accurate and efficient power assessment for all the embedded systems applications. Thus, first we classify the embedded systems depending on their constraints and computational modes, then we propose a set of metrics for each class of systems.
We can divide the embedded systems in timing-constrained systems, if the speed is the most important design constraint, and area-constrained systems, if the area is the most important constraint. Several computational modes characterize the timing-constrained systems, depending on the system throughput defined as the number of operations performed in a given time [7] . For microprocessor-based embedded systems, we can define three main modes of computation: fixed throughput mode, maximum throughput mode, and burst throughput mode, the latter characterized by a fraction of time performing useful computations, during which the maximum throughput is required, while during the rest of the time the system is in idle state, such as among user requests. Since the power budget strictly depends of the computational mode for which the embedded system is dedicated, a specific power metric can be defined for each one of the above defined operating modes [7] .
For fixed throughput systems, a suitable metric is represented by the power/throughput ratio or equivalently the energy per operation. Since the throughput is fixed, if a partitioning solution leads to a reduction of such metric with respect to an initial partitioning, the corresponding power dissipation is reduced.
For maximum throughput systems, the most appropriate metric should account for both the low power and high performance needs. A suitable metric is thus the energy to throughput ratio (ETR) defined as in [7] where is the energy per operation or equivalently the power per throughput and is the maximum throughput. Hence, the ETR metric can also be expressed as ETR Power The ETR metric expresses the concept of optimization of both the throughput and the power dissipation. A partitioning corresponding to a lower value of ETR represents a solution with lower energy per operation for equal throughput as well as a solution with greater throughput for the same amount of energy per operation.
For systems operating in the burst throughput mode, the power metric should provide power reduction, during both the idle and computing time, and throughput optimization when computing. For those systems applying power shut down techniques during idle cycles, an efficient metric is just ETR, since the power dissipation has been completely eliminated when idling. For those systems not supporting power saving modes, a most effective metric is [7] (1) where is the total energy dissipated when computing/idling per total operations and is the maximum throughput.
For those area-constrained systems for which the target area is fixed, a valid metric is represented by the power by area product (or equivalently the power/area ratio). Since the area is fixed, a reduction in the value of corresponds to a minimization in the power consumption.
In general, for those area-constrained systems aiming at both power and area reduction, a good metric is given by the product of the energy per operation by area where is the energy per operation or equivalently the power per throughput and is the area. Hence, the EAP metric can also be expressed as EAP Power . The EAP metric expresses the concept of optimization of both the area and the power dissipation. A partitioning with a lower value of EAP represents a solution with lower energy per operation for the same area as well as a solution with lower area for the same energy per operation.
The models used to estimate the power terms (for both the SW and HW parts) contained in the above equations are detailed in the next sections. The power assessment of the SW side is based on the system-level specification described at the VIS level, while the analysis of the HW side is related to the VHDL description of the ASIC model at the behavioral/RT level. The methodology proposed in [6] can be used to evaluate the area and throughput terms in the above metrics.
IV. POWER ESTIMATION FOR THE HW PART
The power model for the HW-bound part is based on the VHDL description of the ASIC at the behavioral/RT levels and the probabilistic estimation of the internal switching activity. The proposed approach is based on the following general assumptions:
1) the supply and ground voltage levels in the ASIC are fixed, although it is worth noting the impact of supply voltage reduction on power; 2) the design style is based on synchronous sequential circuits; 3) the data transfer occurs at the register-to-register level; 4) a zero delay model (ZDM) has been adopted, thus ignoring the contribution of glitches and hazards to power. The inputs for the estimation are as follows: 1) the ASIC specification-consisting of a hierarchical VHDL description of the target system architecture; 2) the allocation library-composed of the available components implementing the macro-modules (such as adders, multipliers, etc.) and the basic modules (such as registers, multiplexers, logic gates, I/O pads, etc.). 3) the technological parameters-such as frequency, power supply, derating factors, etc.; 4) the switching activity-of the ASIC primary I/O's. The power model is an analytical model, where the average power of the VHDL descriptions is related to the physical capacitance and the switching activity of the nets. The estimation approach is hierarchical: at the highest hierarchical level, ad hoc analytical power models for each part of the target system architecture are proposed; these models are in turn based on a macro-module library, at the lowest hierarchical levels. Furthermore, to avoid a large amount of input patterns to be simulated, our approach is weakly pattern-dependent. Usersupplied input probabilities are required, reflecting the typical input behavior and derived from the system-level specification.
In the proposed single ASIC architecture, the total average power dissipated is given by (2) where and are the average power dissipated by the I/O nets and the core internal nets, respectively. The power model of the core logic is based on the models of the different components of the target system architecture, therefore the term can be in turn expressed as (3) where the single terms represent the average power dissipated by the data-path, the memory, the control logic and the embedded core processor. The power models related to the single terms in the above equations will be detailed in the following subsections, except for the term, that is considered to be part of the power dissipated by the SW-bound part, detailed in Section V.
A. P IO Estimation
Although a presynthesis analysis is performed, we assume the knowledge of the ASIC interface in terms of primary I/O pads characteristics and related switching activity from the system-level specifications. The set of input, output and bidirectional nets of the ASIC can be partitioned into sets, such as where the th set is composed of the same type of I/O pads. Considering for example a set of output pads, the average power of the set can be estimated as (4) where is the number of output pads in the set is the toggle rate of the th output pad, derived from the system-level specifications and is the average power consumption per MHz of the th output pad in as a function of the output load at a given reference frequency
B. Estimation
The average power dissipated by the data-path can be expressed as (5) where the single terms represent the average power dissipated by the registers, the multiplexers and the functional units.
Concerning the term, the live variable analysis has been applied to the behavioral-level VHDL code to estimate the number of required registers and the maximum switching activity of each register. The preliminary step is the estimation of the number of required registers and, consequently, the values of the toggle rate for each of them. According to the abstraction level, such data are directly available from the RT-level description or the live variable analysis can be applied to the behavioral-level specifications.
The algorithm [9] examines the life of a variable over a set of VHDL code statements to derive information concerning the registers switching activity and it can be summarized as follows.
1) Compute the lifetimes of all the variables in the given VHDL code, composed of statements. A variable is said to live over a set of sequential code statements when the variable is written in statement and it is last accessed in statement When a variable is written in a statement in the set, but last used in the same statement of the next iteration, it is assumed to live over the entire set. 2) Represent the lifetime of each variable as a vertical line from statement through statement in the column reserved for the corresponding variable .
3) Determine the maximum number of overlapping lifetimes, computing the maximum number of vertical lines intersecting with any horizontal cut-line. 4) Estimate the minimum number of set of registers necessary to implement the code by using register sharing, that has to be applied whenever a group of variables, with the same bit-width can be mapped to the same register. The total number of registers is given by the sum of all . 5) Select a possible mapping of variables into registers by using registers sharing. 6) Compute the number of write to the variables mapped to the same set of registers. 7) Estimate of each set of registers dividing by hence, The value of considers that the power of latches and flip/flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers, even though the data stored in the register does not change. Thus, our analytical model of registers takes into account both the switching and nonswitching power. Let the set of registers be composed of sets, such as where the th set is composed of the same type of registers, the average register power can be estimated as (6) where is the average power of each set and is the corresponding average nonswitching power, that is the average power dissipated by the internal clock buffers when there are no output transitions. The estimated value of accounts for while the estimated values of the should consider a toggle rate of
The estimated values of and for the th set (constituted by an estimated number of registers are respectively given by (7) where is the average power consumption per MHz of the th register in and is the nonswitching power consumption per MHz of a single register of type that is load-independent.
Let us consider the estimation of the power related to multiplexers. First, to estimate the size and number of multiplexers from the VHDL code, it is necessary to determine the number of paths in the data-path. Then, the approach is based on the definition of the power model of a twoinput noninverting multiplexer, based on both static signal probability of the selection net and the switching activities of the input nets. Given the pass-gate model of the two-input noninverting multiplexer, a simplified model for the maximum switching activity of the output of a two-input noninverting multiplexer is (8) where and are the switching activity of inputs and respectively, while is the static signal probability of the selection net . Globally, the average power dissipated by the multiplexers can be estimated as the sum of the average power of the single multiplexer contributions.
For the estimation of the average power of the functional units, we use complexity-based analytical models [3] , where the complexity of each functional unit is described, in a library of macromodules, in terms of equivalent gates. Then, the estimated power dissipated by the functional units can be expressed as the sum of the contributions of the average power consumption of the th macromodule given by (9) where is a technological parameter expressed in W/(gate MHz is the estimated number of logic gates in the th macrofunction;
is the toggle rate of the output net of the th macromodule.
C. Estimation
Considering a fully CMOS single port static RAM, at a highlevel of abstraction, we assume to have in the target library the information related to the power consumption of a single memory cell and of a single memory output buffer. The average power dissipation during a read access to a single row of the array, composed of n rows and m columns, is proportional to the inverse of the read access time and to the sum of the average power dissipated by the following blocks: the row decoder, the memory cells composing the th row and the output buffers. In particular, the power dissipated by the row decoder can be estimated with a complexity-based model, where the number of equivalent gates is proportional to the product and the load capacitance is the word line capacitance.
D. Estimation
This section describes the contribution to the power consumption due to the control part of the target system architecture, described as a set of finite-state machines (FSM's) represented by state transition graphs (STG's). The proposed FSM power model is a probabilistic model, where we approximate the average switching activities of the FSM nodes by using the switching probabilities (or transition probabilities) derived by modeling the FSM as a Markov chain. Given a typical implementation of a FSM, composed of a combinational circuit and a set of state registers, we consider the different contributions to the global average power (10) where is the average power dissipated by the primary inputs is the average power dissipated by the state registers, is the average power dissipated by the combinational logic and finally is the average power dissipated by the primary outputs.
The input static signal probabilities and the input switching activity factors are obtained from the system-level specifications, being derived by either simulating the FSM at a high abstraction level or by direct knowledge of the typical input behavior. Furthermore, we assume a ZDM for the logic gates and synchronous primary inputs. Under these assumptions, we can ignore the effects of glitches and hazards on the state bit lines, therefore the switching activity of the present and next state bit lines are equal.
Let the FSM, composed of states, described by using a STG composed of vertices, corresponding to the states in the set and the related directed edges. The edges are labeled with the set of input configurations that cause a transition from the source state to the destination state. Considering a transition from state to state , we can compute the factor called conditional state transition probability, that represents the conditional probability of the transition from state to state given that the FSM was in state Prob(Next Present ). The computation of the 's can be carried out as in [10] , assuming totally independent primary inputs and being the static signal probability of input The steady-state probability of a state is defined as the probability to be in the state in an arbitrarily long random sequence [11] . Computing the 's implies solving the system composed of the Chapman-Kolmogorov equations and the equation representing the normality condition: (11) where is the row vector of the steady-state probabilities and is the matrix of the conditional state transition probabilities . Note that the above system has equations and unknowns, thus one of the Chapman-Kolmogorov equations can be dropped [10] . Given the state probabilities 's and the conditional state transition probabilities 's, the total state transition probabilities between the two states and can be expressed as Given a state encoding, the next steps are represented by the estimation of the switching activity of the state bit lines and the primary outputs. The switching activity of the state bit lines depends on both the state encoding and the total state transition probabilities between each pair of states in the STG. Let us generalize the concept of state transition probability to transitions occurring between two distinct subsets of disjoint states, and contained in the set of states as defined in [11] 
Being the th bit of the state code (called state bit) and the number of state bits we consider the two sets of substates in which the th state bit assumes the value one and zero, respectively. The switching activity of the state bit line is given by [11] States States
In a Moore-type FSM, the total state transition probabilities between the two states and are equal to the total transition probabilities between the corresponding outputs and where the output row vector is composed of the primary outputs Let us define the transition probability of the transitions occurring between two distinct subsets of disjoint outputs and contained in the set of the outputs as
Being the th output bit and the number of primary outputs, we consider the two sets of outputs in which the th output bit assumes the value one and zero, respectively. The switching activity of primary outputs is given by Outputs Outputs
At this point of the analysis, we can detail the different power terms contained in the expression of The average power dissipated by the th primary input belonging to the set depends on the switching activity factors and the input load capacitance the latter being proportional to the number of literals, that the th primary input is driving in the combinational part, and the estimated capacitance due to each literal [11] . Therefore, the average power PIN can be estimated as (14) where and is the average power consumption per MHz of the cell driving the th input.
The average power dissipated by the state registers can be derived by using the switching activity of the th state bit line where and the corresponding toggle rate is
The term accounts for the switching and nonswitching power of the state registers (15) where is the number of state registers and and are the average switching and nonswitching power dissipated by each state register. The terms should account for a toggle rate given by while the terms should consider a toggle rate of . The average power dissipated by the combinational logic has been estimated by considering a two-level logic implementation, before the minimization step. The th state bit line (where can be expressed by using the canonical form as the sum of minterms where is the number of literals and is the maximum number of minterms). Similarly, the th output bit can be expressed in the canonical form as the sum of minterms Let us assume to use a single AND gate to represent the generic minterm, hence the maximum number of AND gates in the AND-plane is while in general Given the probabilistic model of the switching activity of the generic -input AND gate, we can derive an upper bound for the estimated power of the AND-plane (16) where is the average power consumption per MHz of the th -input AND gate; is the capacitance driven by the th -input AND gate and is the toggle rate of the th -input AND gate (derived by using the switching activity model of the -input AND gate). is the average power dissipated by the OR-plane, that is composed of -input OR gates corresponding to the state bit lines, driving the input capacitance of the state registers, and -input OR gates corresponding to the primary outputs, driving the output load capacitances.
Therefore, the upper bound for the power of the OR-plane is composed of two terms. The first term is thus proportional to the switching activity factors of the state bit line while the second term is proportional to the switching activity factors of the primary outputs:
where is the average power consumption per MHz of the th -input OR gate driving the th state bit line, is the input capacitance of each state register; is the toggle rate of the th state bit line is the average power consumption per MHz of the th -input OR gate driving the th primary output, is the output load capacitance of the th primary output and finally is the toggle rate of the th primary output.
V. POWER ESTIMATION FOR THE SW PART
The software power assessment in TOSCA is performed by following a bottom-up approach. Each software-bound part of the OCCAM2 specification is considered in terms of basic blocks and it is compiled in the VIS. Hence, the power analysis has been performed at the VIS-level, by considering the average power consumption of each VIS instruction during the execution of a given program. The choice to work at the VIS-level is motivated by the goal to make our analysis processor-independent.
In general, the average power dissipated by a processor while running a program is where is the average current and is the supply voltage. The associated energy is given by where is the execution time of the software program, that can be expressed as: being the number of clock cycles to execute the program and the clock period. To compute the average current drawn during the execution of each instruction, it is necessary to perform some measurements on the energy cost of each instruction, such those proposed in [12] , [13] , or to have detailed power information provided by the processor supplier, in terms of the energy dissipated by each type of instruction in the instructionset. This latter power information can be derived by the processor supplier by simulating the execution of instruction sequences on a lower level (circuit or layout) or gate level model of the processor, to obtain an estimate of the current drawn. Based on this information, a power table can be derived for each processor, reporting the energy consumption for each instruction in the instruction-set and for all the possible addressing modes associated with each instruction type. Additional contributions to the global energy derive from to interinstruction effects, not considered computing the base cost of each instruction. The possible interinstruction effects are mainly related to the previous state of the processor, the limited number of resources leading to pipeline and write buffer stalls and the rate of cache misses [12] , [13] . The condition of the processor in the previous clock cycle may cause an energy overhead due to the different switching activities on data and address busses and the different processors internal behavior. In general, the previous state of the circuit is different during program execution, since there is a switching from an instruction to another, with respect to the execution of the program used for the measurements of the base energy, where the same instruction was executed many times. The circuit state overhead has been measured in [12] by considering all the possible instruction pairs and it results approximately less than 5% of the base energy per instruction. This overhead has been considered in [12] , as an average constant value to be added to the base cost, without a significant loss of precision. The effects of resource constraints and cache misses on the power budget have been measured in [12] . However, these effects can be usually neglected in embedded software based on both simple microcontrollers (e.g., M68000, Intel 8051, Z80, ) where such advanced features can be absent, and advanced processors, achieving cache hit-rate over the 98% and providing a fully exploitation of the pipeline stages.
Once the power analysis is completed for all the basic VIS-level instructions, the analysis is extended to upper-level software modules, by weighting the power consumption of each basic block according to the execution frequencies.
VI. SIMULATION RESULTS
Since we are focusing on control dominated embedded systems, we report some results derived from the application of the proposed power model to a set of 35 FSM's derived from the MCNC-91 benchmark suite. The measures have been derived by using the HCMOS6 technology, featuring 0.35 m and 3. methodology at presynthesis level have been compared with the results derived by using the Synopsys Design Power tool, based on the synthesized gate-level netlist. Note that both methods are based on a ZDM. Fig. 3 summarizes the results. Considering the sequential power, the proposed model shows an average percentage error of 9.52% (ranging from 0.01 to 25.8%) with respect to the design power estimates. Concerning the combinational and total power, the average percentage errors is equal to 9.21 and 8.17%, respectively. Globally, the relative accuracy of our results compared with the design power results is considered satisfactory at this level of abstraction.
VII. CONCLUSIONS
The proposed analysis affords the problem of power estimation for control oriented embedded systems, implemented into a single ASIC. The main goal has been to offer a poweroriented codesign methodology, with particular emphasis on power metrics, to compare different design solutions described at high abstraction levels. Power models for both the HW and SW parts have been presented. The paper covers in more detail the HW part, since it is usually the more complicated part to be estimated with an acceptable precision, due to its heterogeneous nature. As it has been shown, the proposed approach is quite general, since it considers both the implementation domains as well as all the subparts, which typically constitute the HW side of an embedded system. The value-added has been to introduce a third dimension, power, to the speed versus area space, where the architectural design exploration is usually carried out. Finally, experimental results on benchmark circuits have shown a sufficient relative accuracy with respect to gate-level power estimates.
The approach is limited by the fact that at present the proposed power model is tailored to the target system architecture shown in Fig. 2 and that only the average power consumption is considered. However, the inclusion of the peak power could be performed by considering maximum switching activity values at input/output nodes. Moreover, work is in progress aiming at defining a power model suitable for the HW/SW communication part.
