Power dissipation has become one of the main constraints during the design of embedded systems and VLSI circuits in the recent years, due to the continuous increase of the integration level and the operating frequency. The aim of this paper is to present an innovative conceptual framework suitable for achieving accurate and efficient estimation of power dissipation for embedded systems described in VHDL at the behavioral and Register Transfer levels. The goal is to provide the designer with the capability of analyzing and comparing different solutions in the architectural design space, before the synthesis. The analytical power model is hierarchical, considering the different parts of the target system architecture, mainly the data-path, the memory, the control logic and the embedded core processor. Experimental results have been obtained by applying the proposed power model to benchmark circuits.
Introduction
An increasing number of applications in several fields like automotive, telecommunication, consumer electronics, etc. is recently being implemented by using embedded systems. These systems have become broadly used in the most recent years, due to the wide diffusion on the market of standard processors characterized by high performances and reasonable prices. We refer to embedded systems as those dedicated computing and control systems designed for specific target applications [28, 6] , where dedicated software routines are provided with the system to respond to specific requirements. In general, the functionality of an embedded system is constituted by a fixed number of operating ways and it is determined by the interaction between the system and the environment. According to the particular application class for which the system is dedicated, the embedded systems can be classified as data or control dominated systems [28] . In both cases, the target system architecture is composed of an hardware and a software part. The software part is typically constituted by a set of application specific software routines running on a dedicated processor or ASIP (Application Specific Instruction Processor), while the hardware part consists usually of one or more ASICs (Application Specific ICs). Due to the heterogeneous nature of the hardware and software parts of the embedded system, innovative co-design techniques have been proposed in the recent past, the goal being to meet the system-level requirements by using a concurrent design and validation approach, thus exploiting the synergism of the hardware and the software parts [6] . Several design aspects are involved in the co-design process at the system-level, including the system modeling, the capture of the functional specifications in a high-level language (co-specification), the analysis and validation of the specifications, the exploration and evaluation of the different architectures with respect to some design metrics, the system-level partitioning, the cosynthesis and co-simulation. The availability of an appropriate co-design methodology, covering all these design phases, is mandatory during the design of embedded systems in order to meet the system-level requirements. These requirements are typically defined in terms of some design constraints like performance, area, power dissipation, cost, reliability, testability and development time. In particular, the technological trends toward smaller geometry and the increasing performance levels lead to high-level integration, high clock frequencies and high power dissipation. Such aspects, combined with the growing demand of battery-powered portable systems, contributed to increase the importance of the power issues during the design of embedded systems. Therefore, co-design techniques for low power dissipation and EDA tools for accurate power estimation have become a critical factor for embedded systems and IC designers, in order to satisfy the power constraints, without reducing the global performance significantly. Design techniques targeting low power dissipation and power estimation methodologies should be provided at several abstraction levels. In fact, circuit and logic-level power estimation techniques are no more sufficient, due to the high complexity and high integration levels of the embedded systems. Accurate low-level estimation techniques present some limitations due to the need to cope with circuit complexity in an acceptable design time. Moreover, lowlevel estimation techniques can be applied only during the last design phases, when a circuit or logic-level description is already available. However, a re-design process at these levels could be very expensive and time consuming. Hence, high-level power estimation is a key issue in the early determination of the power budget for embedded systems, being unfeasible to synthesize every design solution down to the gate, circuit and layout levels in a reasonable time. The goal is to respect the design turnaround time, while exploring the architectural design space widely, and to early re-target the architectural design choices. Accuracy and efficiency of a high-level analysis should contribute to meet the power requirements, avoiding a costly re-design process. In general, the relative accuracy in high-level power estimation is considered much more important than the absolute accuracy, the main objective being the comparison of different design alternatives [10] . High-level power estimation tools are usually based on high-level descriptions. Up to now, most of embedded system descriptions are specified in a hardware description language, such as Verilog or VHDL, along with other graphical formalisms suitable for describing the functional behavior at the system-level, such as temporal diagrams, State Transition Graphs for Finite State Machines, Statecharts, etc. [9, 18] . In particular, VHDL has become the defacto standard in the European design community for the hardware description and for the most part of the commercial design entry, synthesis and simulation tools. The main advantage of VHDL is related to the possibility of specifying the system behavior by using a mixed description [18] at different abstraction levels: behavioral, Register-Transfer and structural. Therefore, VHDL provides high flexibility during both the design description and the simulation phases. Furthermore, VHDL supports a hierarchical design approach, where the description of the elements composing the hierarchy, properly connected, perform the global functionality. The hierarchical approach provides also the possibility to use a mixed description composed of behavioral, Register-Transfer and structural parts at the different hierarchical levels. Other advantages of VHDL are related to the possibility of easily specifying both the datapath and the control-path of the system and to support the modular design approach. Hence, VHDL allows the designer to re-use existing components. In fact, VHDL supports the definition of functions and procedures, to decompose a complex description into smaller and simpler functional units. These functional units can be organized as independent files, that can be compiled and verified separately, thus supporting the definition of a library of reusable cells and macro-cells. Finally, VHDL provides also the complete independence with respect to the technology used and the mapping between a given entity and different architectures, through the configuration approach. The aim of this paper is to provide a conceptual analysis framework for accurate and efficient estimation of power dissipation in embedded systems and VLSI circuits at the architectural and RT levels. The availability of a power analysis tool at these levels of abstraction is of paramount importance to early obtain estimation results, while maintaining an acceptable accuracy. In fact, the architectural and RT-level descriptions, based on VHDL, are the design entry point for the majority of embedded systems and IC designs. In the proposed approach, the analysis is based on a probabilistic estimation of the switching activity. The proposed model accurately accounts for both the switching activity and the physical capacitance [8] for all the parts composing the embedded system architecture. The paper is organized as follows. The discussion starts by presenting the most significant research works related to high-level power estimation in Section 2. Then the target system architecture of embedded systems, we are focusing on, is introduced in Section 3, while Section 4 contains the foundations and notations, constituting the basis of the proposed analysis. Then, the proposed power estimation model is detailed in Sections 5-10 while some experimental results obtained from benchmark circuits are reported in Section 11, which also outlines the future developments of our investigations.
Previous Work on High-Level Power Estimation
General surveys of power estimation techniques at different abstraction levels can be found in [7, 20, 5, 24] . While several power estimation techniques have been proposed in the literature at the gate, circuit and layout levels, a few papers have been published addressing the power estimation problem at high-level until recently [7, 10] , despite of the increasing interest in the system and behavioral levels design. A state-of-the-art survey of the high-level power estimation has been presented in [10] . According to this survey, high-level power estimation techniques can be classified depending on their abstraction level. At the architectural or RT levels, there are two classes of techniques: the analytical and the empirical techniques. The analytical methods aim at relating the power consumption to the capacitances and the switching activities of the design nets. These techniques are composed of complexity-based models and activity-based models. The former considers the design complexity of each part of the design, in terms of equivalent gates, as a measure of the capacitance, while the latter uses the concept of entropy, derived from the information theory, as a measure of the average transition activity in a circuit. More specifically, in the complexity-based models, first the number n of equivalent gates contained in each design function is specified in a macromodule library; then, the power estimates are obtained by multiplying n by the average power consumed by each equivalent gate. In the activity-based models, the average power is estimated as the product of the area, considered as a measure of the average nodes capacitance, and the entropy, considered as a measure of the activity. The empirical methods are based on the power measures of existing implementations, then a macro-modeling approach is used to derive models from these measurements. The empirical methods can be sub-divided into fixed-activity models and activity-sensitive models. The former models disregard the influence of data-activity on power, while the latter consider the effects of statistics related to data and instructions activity on power. Moving up in the abstraction levels, the behavioral methods are based on static and dynamic activity prediction. The goal of the static activity prediction is the estimation of the access frequency of different hardware resources, by analyzing statically the behavioral description of the functions to be implemented. The dynamic activity prediction is based on a dynamic profiling to determine the activation frequencies of various resources and the memory accesses. Finally, power exploration tools at the instruction and system levels can be used to identify power metrics to guide the system-level partitioning. A common characteristic of power estimation at the different abstraction levels is that the average power is strongly related to the switching activity of the circuit nodes. Such a fact has been indicated in [20] as stating that power estimation is a pattern-dependent process. In particular, the input pattern-dependency of the power estimation approaches can be classified as strong or weak pattern-dependency [20] . The typical methods for power estimation based on extensive circuit simulation have been indicated in [20] as strongly pattern-dependent process. Main advantages of these simulation techniques derive from its accuracy and wide applicability. However, to obtain a complete and accurate power estimation, the designer should provide a comprehensive amount of input patterns to be simulated, thus making this approach very time consuming and computationally very costly. Therefore the simulation approach is almost impossible to apply to most of the designs, due to their increasing complexity. To avoid the need of a large amount of input patterns, the weakly pattern-dependent approaches [20] require input probabilities. In this case, the estimation results will depend on the probabilities supplied by the designer, reflecting the typical behavior of the input signals. Both probabilistic techniques and statistical techniques are presented in [20] . Probabilistic techniques suitable for combinational circuits have been illustrated, requiring user-supplied input probabilities to solve the pattern dependency problem. Statistical techniques use randomly generated input patterns to simulate the circuit repeatedly, then using statistical mean estimation techniques to stop simulation following a criterion to determine the closeness to the average power. Analytical and stochastic power estimation techniques at the behavioral-level have been proposed in [16] , targeting real time DSP applications on ASIC architectures. The power dissipated by some ASIC components, such as data-path components, have been analytically estimated from the Control Data Flow Graph (CDFG) representing the design. For other ASIC components, such as interconnects and controllers, for which the power information available at the behavioral-level is not sufficient, statistical models were built to estimate power based on a stochastic study on several ASICs. However, the proposed models do not account for the power consumed by multiplexers and memories. The estimation techniques have been included into an exploration tool that, given a CDFG description of an algorithm and a library of hardware modules, explores the space of the available solutions for different values of clock periods and supply voltages. The results have been compared with an architectural-level power estimator, called Stochastical Power Analysis (SPA) and proposed in [11] , on 23 different chips, showing an average error of approximately 20%. Other power estimation techniques based on high-level descriptions have been proposed in [11, 12, 13] . The techniques described in [11] , targeting data-path architectures, derive stochastic models of busses and internal modules from the statistical behavior of inputs. In [12] , a power estimation model for data-path architectures operating at the RT-level is described. The model accounts for the switching activity by using the Dual Bit Type (DBT) method, considering two input bit types rather than one: the random activity of the least significant bits (LSB's) and the correlated activity of the most significant bits (MSB's). An architectural model for the power consumption of the control paths, called Activity-Based Control (ABC) model, has been presented in [13] , using three implementation styles: a ROMbased controller, a PLA-based controller and a random logic controller. Nevertheless, the methods proposed for high-level power estimation have not yet achieved the maturity necessary to enable their use within current industrial CAD environments. Our work is an attempt to fill such a gap, aiming at providing an high-level power model, based on VHDL descriptions, to cover the different parts composing the basic architecture of embedded systems.
The Target System Architecture
The target system architecture of the embedded system is implemented into a single ASIC, including both the software and the hardware bound parts, described at the behavioral/RT levels. The target system architecture is depicted in Figure 1 and it is quite similar to those proposed in [16, 17] . The ASIC architecture is defined at a pre-synthesis RT-level and consists of the following components:
• the data-path, composed of storage units, functional units and multiplexers. The storage units consist of registers and register files, while functional units can include a wide set of units such as adders, multipliers, and so on. A two-level multiplexer structure is considered for the interconnection among storage and functional units. The typical operation along the data-path implies a register-to-register transfer, consisting of the operands read from the input registers, an operation performed on the operands and the results stored in the output registers [17] ; • the main memory, based on a memory hierarchy, that can be constituted by single or multi-port memories, cache memories, TLBs, FIFOs, LIFOs, etc.. We assume that all read/write accesses to the memory will be performed through input/output registers;
• the control unit, implemented as a set of Finite State Machines and generating the control lines for the data-path components and the memory;
• the embedded core processor, such as a standard processor, a microcontroller, a DSP, etc., with its memory (even if part of the memory can be external) implementing the SW bound part;
• the clock distribution logic, including the buffers of the distribution network, organized for example as a balanced clock tree;
• the crossbar network, to interface the architectural units by using a communication protocol at the system-level. The interconnection power of the crossbar network is included in the power dissipated by the outputs of data-path, memory and control logic;
• the primary I/O pads.
Foundation of the Power Estimation
Power dissipation in CMOS circuits can be considered as composed of both a static and a dynamic component. Static power dissipation is mainly due to the leakage current of the reverse-biased diodes and the sub-threshold transistor conduction. However, in CMOS devices, static power dissipation can be considered insignificant in most designs [4] . The dominant part of the power dissipation in CMOS circuits is thus the dynamic component, which is composed of two terms [7] . The first term, indicated as the switching activity power, is due to the charge and discharge of the circuit node capacitances at the output of each logic gate. The second term, indicated as short-circuit power, represents the short-circuit current from the supply voltage to ground during the output transitions. The switching activity power P SW can be expressed as in [4] :
DD f CLK C EFF where V DD is the supply voltage, f CLK is the system clock frequency and C EFF is the effective switched capacitance, that is the product of the total physical capacitance C Li of each node in the circuit and the switching activity factor α i of each node (defined below) summed over all the N nodes in the circuit:
The short-circuit power is due to the fact that, during a transition of a CMOS gate, both p and n channel devices may conduct simultaneously, briefly establishing a flow of current from the supply to ground. The short-circuit power P SH can be expressed as in [7] :
where Q SC represents the quantity of charges carried by the short-circuit current per transition and α is the global switching activity factor, i.e. the number of gate output transitions per clock cycle.
Anyway, in CMOS devices, the short-circuit current typically dissipates a small fraction of the dynamic power (in the order of 5÷10 % as reported in [4] , so it can be ignored. Usually, in properly designed circuits, switching activity power accounts for over 90% of the total power dissipation, as reported in [7] . The switching activity of each signal (being a primary I/O or an internal signal), is fully characterized by the following two components [5] :
• a static component, taking into account the static probability of a signal;
• a dynamic component, taking into account the timing behavior of the circuit.
The static component can be expressed in terms of the static signal probability of each node n, that is the probability of the node to be at one:
where: i n (k) = value of i n at the clock cycle k (i.e. 0 or 1) N = number of clock cycles. From the above definition, it derives that: p n 1 ≤ 1 and p n 0 = 1 -p n 1 . A signal is called equiprobable when it has an uniform distribution of high and low levels: p n 1 = p n 0 = 0.5. The transition probability p n 01 is the probability of a zero to one transition at node n:
The other transition probabilities (p n 10 , p n 00 , p n 11 ) are defined similarly, while the following equations hold: p n 0 = p n 00 + p n 01 and p n 1 = p n 10 + p n 11 . In the spatial and temporal independence assumption [20] , the transition probability p n 01 is given by the probability that the current state is zero times the probability that the next state is one: p n 01 = p n 0 p n 1 = (1 -p n 1 )p n 1 . Under the same assumption, the switching activity factor of a node n, indicated as α n (or E sw (n)), is: α n = p n 01 + p n 10 = 2 p n 1 (1 -p n 1 ). Given the switching activity factor α n of a node n, the corresponding toggle rate can be defined as TR n = α n f CLK , where f CLK is the system clock frequency. Considering a N-bit bus, the bus switching activity factor, indicated as α BUS , is defined as
where α n is the switching activity factor of the n-th bit of the bus. The corresponding bus toggle rate can be defined as: TR BUS = α BUS f CLK .
The Power Estimation Model
The proposed estimation approach is based on the VHDL description of the ASIC model at the behavioral/RT levels. The entire analysis is based on the probabilistic estimation of the nodes switching activity. The inputs for the estimation are:
• the ASIC specification, consisting of a hierarchical VHDL description implementing the target system architecture depicted in Figure 1 ; • the allocation library, composed of the available components implementing the macromodules (such as adders, multipliers, etc.) and the basic modules (such as registers, multiplexers, logic gates, I/O pads, etc.). Every component model includes the description of the logic behavior, the input capacitance, the area and the power characteristics;
• the technological parameters such as frequency, power supply, derating factors (accounting for the variations in process, voltage and temperature), etc.;
• the switching behavior of the ASIC primary I/Os. The proposed model is based on the following assumptions: − the supply and ground voltage levels in the ASIC are fixed, although it is worth noting the impact of supply voltage reduction on power; − the design style is based on synchronous sequential circuits; − the data transfer occurs at the register-to-register level; − a Zero Delay Model (ZDM) has been used, thus ignoring the contribution of glitches and hazards to power. The power model is an analytical model, that attempts to relate the average power dissipation of the VHDL descriptions to the physical capacitance and the switching activity of the design nets. The estimation approach is hierarchical, in fact we propose an ad-hoc analytical power model for each part of the target system architecture, at the highest hierarchical level; these models are based on a macro-module library, at the lowest hierarchical levels. Furthermore, to avoid the need of a huge amount of input patterns, our approach is weakly pattern-dependent, requiring user-supplied input probabilities, reflecting the typical input behavior, that are derived from the system-level specification. In the proposed single ASIC architecture, the total average power dissipated, P AVE , is given by:
P AVE =P IO +P CORE where P IO and P CORE are the average power dissipated by the I/O nets and the core internal nets, respectively. The value of P AVE can be multiplied by the derating factor, δ, taking into account the effects of the variations of the fabrication process and the operating conditions (voltage and temperature) on the power values contained in the target library. The power model of the core logic is based on the models of the different components of the target system architecture, therefore the P CORE term can be detailed as:
P CORE = P DP + P MEM + P CNTR + P PROC where the single terms represent the average power dissipated by the data-path, the memory, the control logic and the embedded core processor. The power models related to the single terms in the above equations will be detailed in the following sections.
P IO Estimation
Although a pre-synthesis analysis is performed, we assume the knowledge of the ASIC interface in terms of primary I/O pads characteristics and related switching activity from the system-level specifications. The set S of input, output and bi-directional nets of the ASIC can be partitioned into N sets, such as: S= {s 1 ,s 2 , ..., s k , ..., s N }, where the k-th set s k is composed of the same type t k of I/O pads. Considering for example a set of output pads, the average power of the set s k can be estimated as:
where: n k is the number of output pads in the set s k ; P i (C i ) is the average power consumption per MHz of the i-th output pad in s k . The value of P i is computed as a function of the output load C i at a given reference frequency f 0 . This value is tabulated in the selected library (such as in [14] Similarly, the average power of the input pads can be computed, depending on the estimated internal standard loads and input ramptime.
P DP Estimation
The average power dissipated by the data-path can be expressed as: P DP = P REG + P MUX + P FU where the single terms represent the average power dissipated by the registers, the multiplexers and the functional units.
P REG Estimation
The live variable analysis [17] has been applied to the behavioral-level VHDL code to estimate the number of required registers and the maximum switching activity of each register. The preliminary step is the estimation of the number of required registers and, consequently, the values of the toggle rate TR i for each of them. According to the abstraction level, such data are directly available from the RT-level description or the live variable analysis can be applied to the behavioral-level specifications. The algorithm examines the life of a variable over a set of VHDL code statements and it is similar to the one proposed in [17] , for the computation of the lifetime of a variable in terms of its definition and use over a selected set of VHDL code statements. New passes have been added to the algorithm proposed in [17] , to obtain information concerning the registers switching activity. The proposed algorithm [8] can be summarized as follows: 1. compute the lifetimes of all the variables in the given VHDL code, composed of S statements. A variable v j is said to live over a set of sequential code statements {i, i+1, i+2, ..., i+n}, when the variable is written in statement i and it is last accessed in statement (i+n). When a variable is written in a statement (i+k) in the set, but last used in the same statement (i+k) of the next iteration, it is assumed to live over the entire set; 2. represent the lifetime of each variable as a vertical line from statement i through statement (i+n) in the column j reserved for the corresponding variable v j ; 3. determine the maximum number N of overlapping lifetimes, computing the maximum number of vertical lines intersecting with any horizontal cut-line; 4. estimate the minimum number N of set of registers necessary to implement the code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i , can be mapped to the same register. The total number of
5. select a possible mapping of variables into registers by using registers sharing; 6. compute the number w i of write to the variables mapped to the same set of registers; 7. estimate α i of each set of registers dividing w i by the number of statements S: Figure 2 shows an application example of this algorithm, representing the differential equation example reported in [17] . The bold dotted line at statement 7 represents the horizontal cut-line with the maximum number (N = 9) of vertical lines reaching or crossing it. Thus, using register sharing, the VHDL statements can be implemented with a minimum of 9 registers. A possible mapping of variables into registers is shown in Regarding P REG , it is worth noting that the power of latches and flip/flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers, depicted in Figure 3 , even though the data stored in the register does not change. Thus, our analytical model of registers takes into account both the switching and non-switching power, the latter due to internal clock buffers. The non-switching power dissipated by internal clock buffers accounts for approximately the 30% of the average power of the registers, as for example in [14] for a cell-based CMOS 3.3V technology. Note that, as depicted in Figure 3 , the internal clock buffers are independent of the output load, thus the non-switching power of latches and flip/flops is load-independent, but dependent on the clock input ramptime. 
where Ps k is the average power of each set s k and P NSk is the average non-switching power dissipated by the internal clock buffers of the registers in the set s k ,, that is the average power dissipated by the internal clock buffers when there are no output transitions. Note that the measured average power Ps k , tabulated in the target library, includes also the power dissipated by the internal clock buffers during clock edges corresponding to output transitions. Hence the estimated value of Ps k should account for a toggle rate given by the TRs k , while the estimated values of the P NSk should consider a toggle rate of (f CLK -TRs k ).
The estimated values of Ps k and P Nsk , for the k-th set s k , are respectively given by:
where: n k is the estimated number of registers in the set s k ; P i (C i ) is the average power consumption per MHz of the i-th register in s k . The value of P i has been computed running SPICE simulations, at a given reference frequency f 0 , for different output standard loads (representing both load cells and interconnections) and clock input ramptime. Thus the value of P i is given as a function of the output load C i and the input ramptime and it is tabulated in our allocation library in [µW/MHz] as a function of C i , expressed in equivalent standard load and input ramptime expressed in [nsec]; P 0k is the non-switching power consumption per MHz of a single register of type t k . The value of P 0k expressed in [µW/MHz] has been computed running SPICE simulations, at a given reference frequency f 0, as a function of the clock input ramptime; C i is the estimated output load of the i-th register in the set s k expressed in equivalent standard loads; TR i is the estimated toggle rate of the i-th register in the set s k , obtained by using the live variable analysis.
P MUX Estimation
To estimate the size and number of multiplexers from the VHDL code, it is necessary to determine the number of paths in the data-path. The approach is also based on the definition of the power model of a 2-input non-inverting multiplexer, based on both static signal probability of the selection net and the switching activities of the input nets.
The analysis of the design paths and the related notations are similar to those performed in [17] , however in the proposed approach we consider also the paths from primary inputs to internal registers and from internal registers to primary outputs. A path from the source component S to the target component T is represented as T < S. Note that all memory accesses require the use of intermediate registers.
Given the target architecture represented in Figure 1 , the possible paths can be classified in the following categories: 1. primary input to register (R < I); 2. register to primary output (O < R); 3. register to register (R < R); 4. register to functional unit (U < R); 5. functional unit to register (R < U); 6. register to memory (M < R);
memory to register (R < M).
The algorithm used to determine the possible paths in the data-path could be easily derived from the algorithm described in [17] , but considering also the possible paths of the categories 1 and 2. Once the size and number of multiplexers has been computed, we derive the switching activity of the output node of each multiplexer, given the model of the two-input noninverting multiplexer depicted in Figure 4 . A simplified model for the maximum switching activity of the output Z of a 2-input non-inverting multiplexer is:
where: α A is the switching activity of input A; α B is the switching activity of input B; p s 1 is the static signal probability of the selection net S. Globally, the average power dissipated by the multiplexers can be estimated as:
where N is the estimated number of multiplexers and P i is the average power of each multiplexer. The value of P i for the i-th multiplexer is given by: P i = P ti (C i ) TR i , where P ti is the average power consumption per MHz of a 2-input non-inverting multiplexer and TR i is the toggle rate of the output of the i-th multiplexer.
P FU Estimation
For the estimation of the average power of the functional units, we use complexity-based analytical models [10] , where the complexity of each functional unit is described in terms of equivalent gates. For the estimation of the number of equivalent gates necessary to implement a given function of the data-path, we use a library of macro-modules such as adders, multipliers, etc.. The library should include the estimated number of logic gates for each macro-module, depending on the number of operands and the bit-width of each operand. Once the number of equivalent gates for each macro-function has been evaluated, the estimated power dissipated by the functional units can be expressed as:
where N is the number of macro-modules, and P i is the power of the i-th macro-module given by: P i = n i P TECH TR i where P TECH is a technological parameter expressed in [µW/(gate MHz)]; n i is the estimated number of logic gates in the i-th macro-function; TR i is the toggle rate of the output net of the i-th macro-module.
P MEM Estimation
A power dissipation model for a memory cell, at a low-level of abstraction, has been proposed in [25] , being:
k is the number of cells in a row, c int is the wire capacitance per unit length, l column is the memory column length, 2 n-k is the number of cells in a column, C tr is the minimum size drain capacitance, and V swing is the bitline voltage swing. Considering a fully CMOS single port static RAM, at a high-level of abstraction, we assume to have in the target library the information related to the power consumption of a single memory cell P cell and of a single memory output buffer. The average power dissipation during a read access to a single row of the array, composed of n rows and m columns, is proportional to the inverse of the read access time t a and to the sum of the average power dissipated by the following blocks: the row decoder, the m memory cells composing the i-th row and the output buffers. In particular, the power dissipated by the row decoder can be estimated with a complexitybased model, where the number of equivalent gates is proportional to the product (n X lg 2 n) and the load capacitance is the word line capacitance.
P CNTR Estimation
The proposed model for power dissipation of a Finite State Machine (FSM) is a probabilistic model, where we approximate the average switching activities of the FSM nodes by using the switching probabilities (or transition probabilities) derived by modeling the FSM as a Markov chain [22] . Given a typical implementation of a FSM, composed of a combinational circuit and a set of state registers, as depicted in Figure 5 , we consider the different contributions to the global average power: P CNTR = P IN + P STATE_REG + P COMB +P OUT where: P IN is the average power dissipated by the primary inputs PI; P STATE_REG is the average power dissipated by the state registers; P COMB is the average power dissipated by the combinational logic; P OUT is the average power dissipated by the primary outputs. The power estimation models dealing with each term of the above equation are described in the following, along with some concepts and notations related to FSMs. As basic assumptions, we assume to have the FSM description available in the form of a State Transition Graph (STG), where each state is represented symbolically and nothing is known on the structure of the combinational logic implementing the next state and output functions. The input static signal probabilities and the input switching activity factors are supposed to be given from the system-level specifications, being derived by simulating the FSM at a high abstraction level or by direct knowledge of the typical input behavior. Furthermore, we assume to use a Zero Delay Model for the logic gates and synchronous primary inputs. Under these assumptions, we can ignore the effects of glitches and hazards on the state bit lines, therefore the switching activity of the present and next state bit lines are equal.
Estimation of Total State Transition Probabilities
Given the FSM description and the input probabilities, the first step of our estimation consists of the computation of the total state transition probabilities for each edge in the graph, by modeling the FSM as a Markov chain and following the same method shown in [3, 15, 22] . Let the FSM, composed of n s states, described by using a STG composed of n s vertexes, corresponding to the states in the set S = {s 1 Even if the conditional state transition probabilities can be considered as an approximation of the total state transition probabilities, the steady-state probabilities should be taken into account.
The steady-state probability P i of a state s i is defined as the probability to be in the state s i in an arbitrarily long random sequence [27] . Computing the P i 's implies solving the system composed of the Chapman-Kolmogorov equations [22] and the equation representing the normality condition:
where P T = (P 1 , ...., P k , ..., P ns ) is the row vector of the steady-state probabilities and p is the matrix of the conditional state transition probabilities p ij . Note that the above system has (n s +1) equations and n s unknowns, thus one of the Chapman-Kolmogorov equations can be dropped [15] . Given the state probabilities P i 's and the conditional state transition probabilities p ij 's, the total state transition probabilities P ij between the two states s i and s j can be expressed as [3] :
P ij = p ij P i .
State Encoding Algorithms
The second step consists of finding a state assignment that minimizes the power dissipation. Given the STG, the problem can be formulated as determining a state encoding so as to minimize a given cost function, C, that takes into account the number of state variable transitions between two consecutive clock cycles. The main goal is to minimize the switching activity associated with the state registers that, if combined with an appropriate combinational logic implementation, can lead to a global power minimization. Several solutions to the state encoding problem have been presented in literature [15] . Specific solutions can be applied for particular classes of STGs, such as the Gray encoding for STG representing structures such as counters. For a STG of generic structure, the One-Hot encoding guarantees exactly two state bit transitions for each clock cycle, however it requires a number of state variables exactly equal to the number of states (n var = n s ), while in general lg 2 n s  ≤ n var ≤ n s . Other coding techniques [27, 3, 15] can lead to a single state bit transition for each clock cycle. In general, the cost function C should take into account the Hamming distance H(s i , s j ) between the binary codes of state s i and s j among which a state transition can occur [15] :
However, a more accurate cost function should consider weight factors taking into account the probability of the state transitions:
where W ij is the weight assigned to the edge from state s i to state s j in the STG. The Syclop method, proposed in [23] , considers the conditional state transition probabilities p ij as the W ij coefficients in the cost function C and uses the minimum possible number of state bits (lg 2 n s ).
The state assignment algorithm POW3, proposed in [3] , considers the total state transition probabilities P ij to be included in the cost function C as weight coefficients.
Other state encoding algorithms are the Galops algorithm proposed in [21] , which uses the same cost function as POW3, and the LPSA algorithm, proposed in [27] , that addresses the state assignment problem for both the two-level and the multi-level implementations of the next state and output logic, accounting the loading factors and the switching activities of the present state inputs. A related power cost model to guide the state assignment has also been proposed in [27] .
Estimation of the Switching Activity of the State Bit Lines
The switching activity of the state bit lines, depends on both the state encoding and the total state transition probabilities between each pair of states in the STG [27] . Let us generalize the concept of state transition probability to transitions occurring between two distinct sub-sets of disjoint states, S i and S j , contained in the set of states S = {s 1 , s 2 , ..., s ns }, as defined in [27] :
Being b i the i-th bit (1 ≤ i ≤ n var ) of the state code (called state bit) and n var the number of state bits (lg 2 n s  ≤ n var ≤ n s ), we consider the two sets of sub-states in which the i-th state bit assumes the value one and zero respectively. The switching activity α b i of the state bit line b i is given by [27] :
Estimation of the Switching Activity of the Primary Outputs
Considering a Moore-type FSM, the switching activity of the primary outputs can be defined similarly to the switching activity of the state bit lines, depending on both the given output encoding and the total state transition probabilities. In fact, in a Moore-type FSM, the total state transition probabilities P ij between the two states s i and s 
Being y m the m-th output bit (1 ≤ m ≤ n O ) and n O the number of primary outputs, we consider the two sets of outputs in which the m-th output bit assumes the value one and zero respectively. The switching activity α ym of the primary outputs y m is given by:
P IN Estimation
As mentioned before, let us assume that the input static signal probabilities and the input switching activity factors are given from the system-level specifications. The average power dissipated by the k-th primary input belonging to the set PI ={x 1 , x 2 , ... x k , ..., x nI } depends on the switching activity factors α xk and the input load capacitance C xk , the latter being proportional to the number of literals, n litxk , that the k-th primary input is driving in the combinational part, and the estimated capacitance C lit due to each literal [27] . Therefore, the average power P IN can be estimated as:
where: C xk = n litxk C lit ; TR xk = α xk f CLK and P xk (C xk ) is the average power consumption per MHz of the cell driving the k-th input.
P STATE_REG Estimation
The average power dissipated by the state registers, P STATE_REG , can be derived by using the switching activity α bi of the i-th state bit line b i , where 1 ≤ i ≤ n var and the corresponding toggle rate is TR bi = α bi f CLK . The term P STATE_REG accounts for the switching and nonswitching power of the state registers:
where n var is the number of state registers and P i and P NSi are the average switching and nonswitching power dissipated by each state register. As before, the switching power P i includes also the power dissipated by the internal clock buffers, during clock edges corresponding to output transitions. Hence the terms P i should account for a toggle rate given by TR bi , while the terms P NSi should consider a toggle rate of (f CLK -TR bi ).
The estimated values of P i and P NSi are respectively given by: P i = P ti (C i ) TR bi and P NSi = P 0i (f CLK -TR bi ) where: P ti is the average power consumption per MHz of the i-th register of type t i as a function of the load capacitance C i and the input ramptime; P 0i is the non-switching power consumption per MHz of a single register of type t i ; C i = n litbi C lit is proportional to the number of literals, n litbi , that the i-th state bit line is driving in the combinational part, and the estimated capacitance C lit due to each literal, expressed in equivalent standard loads.
P COMB Estimation
The average power dissipated by the combinational logic P COMB has been estimated by considering a two-level logic implementation, before the minimization step. The i-th state bit line b i (where 1 ≤ i ≤ n var ) can be expressed by using the canonical form as the sum of N bi minterms (N bi ≤ 2 nlit where n lit is the number of literals and 2 nlit is the maximum number of minterms). Similarly, the m-th output bit y m (1 ≤ m ≤ n O ) can be expressed in the canonical form as the sum of N ym minterms (N ym ≤ 2 nlit ). Let us assume to use a single AND gate to represent the generic minterm, hence the maximum number of AND gates in the AND-plane is 2 nlit , while in general n AND ≤ 2 nlit . Given the probabilistic model of the switching activity of the generic n lit -input AND gate, we can derive an upper bound for the estimated power of the AND-plane:
where: P i (C i ) is the average power consumption per MHz of the i-th n lit -input AND gate; C i is the capacitance driven by the i-th n lit -input AND gate; TR i = α i f CLK is the toggle rate of the i-th n lit -input AND gate (derived by using the switching activity model of the n lit -input AND gate).
P OUT Estimation
P OUT is the average power dissipated by the OR-plane, that is composed of n var N bi -input OR gates corresponding to the state bit lines, driving the input capacitance of the state registers, and n O N ym -input OR gates corresponding to the primary outputs, driving the output load capacitances. Therefore, the upper bound for the power of the OR-plane is composed of two terms. The first term is thus proportional to the switching activity factors α bi of the state bit line b i , while the second term is proportional to the switching activity factors α yi of the primary outputs: 
P PROC estimation
A methodology to measure the power cost of embedded software, at the instruction-level, has been proposed in [26] . The current drawn by each processor instruction has been measured, during the execution of instruction sequences composed of the same instruction. The power contributions due to the inter-instruction effects have also been considered, along with the effects of resource constraints leading to stalls, such as pipeline and write buffer stalls, as well as the effects of cache misses, causing power penalties. Considering the embedded core processor in our target system architecture, the proposed estimation is carried out at the instruction-level, by considering the average power consumption during the execution of a given program. We assume the knowledge on detailed power information provided by the embedded core supplier, in terms of the power dissipated by each type of instruction in the instruction-set. Based on this information, a power table should be derived for a dedicated processor, reporting the power consumption for each instruction in the instruction-set and for all the possible addressing mode for a given instruction type.
Experimental results and concluding remarks
The proposed power estimation method has been implemented and applied to both data-path and FSM circuits. The measures have been derived by using the HCMOS6 technology, featuring 0.35µm and 3.3 V, supplied by SGS-Thomson Microelectronics at the target operating frequency of 100 MHz. The architecture of the data-path ASIC, reported in Figure 6 , contains registers, a 64-bit adder, I/O pads, a set of 64 multiplexers and a clock distribution tree. The VHDL model of the ASIC, reported in the figures 7 and 8, has been synthesized by using the Synopsys Design Compiler tool with the HCMOS6 technology. Experimental results are reported in Table 2 , in terms of the average power dissipated by the different parts of the ASIC, by considering several input switching activities: 0.75, 0.5, 0.25 and 0.1. The results obtained by the proposed methodology have been compared to the results obtained through the Synopsys Design Power tool, based on the gate-level netlist. Note that both the estimation methods are based on a Zero Delay Model. As far the global power is concerned, the proposed method provides a good approximation: the percentage error belongs to the range 1.12%-1.47% with respect to the Synopsys estimates. However, being the switched capacitance of I/O nodes usually larger than the switched capacitance of the internal nodes up to three orders of magnitude, the major contribution to the global power is constituted by the I/O power. In the benchmark, the I/O power represents the 94.83%, on average, of the total power, due to the reduced size of the core logic. Furthermore, the I/O power estimates are very close to the gate-level estimates (1.38% on average), due to the simple model used. Thus, to verify the accuracy of the proposed model, a more realistic measure is represented by the comparison of the core power: the model provides an estimation error below 4.54%. In particular, the results show an average percentage error of 2.07% for the registers, 30.49% for the multiplexers and -0.54% for the adder. To show the validity of the proposed FSM power model, we consider the same Moore-type FSM used in [3] . The State Transition Table of the four-states and two-inputs FSM is reported in Table 3 , with an arbitrary output encoding. Several state encodings have been applied to the FSM (see Table 4 ), to evaluate the effects of the state encoding on the power estimates. In particular, we derived the ENC_A state encoding to minimize power, ENC_B is the state encoding proposed in [3] , ENC_C has been derived by using NOVA to minimize the area, ENC_D and ENC_E are randomly generated encodings and ENC_F is an example of the One-Hot encoding. As before, the results obtained by the proposed model have been compared to the results obtained by the Design Power tool on the gate-level netlist (see Table  5 ). Considering the effects of the different encoding algorithms on the global power estimates, a similar trend can be observed both for Design Power and the proposed model. As expected, the Design Power measurements show a growing power dissipation from ENC_A to ENC_F. Our measurements, as reported in Table 5 , show a rather similar behavior. Considering the global power, the proposed model shows an average percentage error of -2.31% (ranging from -8.68% to 3.0%) with respect to the Design Power estimates. However, there is an over-estimation of the power of the state-registers of 27.9%, on average. Globally, considering the data-path and FSM benchmarks, the relative accuracy of our approach compared with the Design Power gate-level tool is considered satisfactory at this level of abstraction. Traditional post-synthesis gate-level methods suffer from a main drawback with respect to our approach: the need to perform time-consuming tasks such as the synthesis. On the contrary, our approach, by avoiding to move down to the gate-level description, represents an innovative methodology encompassing the requirements to achieve accurate power estimation in a reasonable design time.
In conclusion, the proposed analysis affords the problem of power estimation for embedded systems implemented into a single ASIC described by using VHDL at the behavioral/RT levels, before performing the synthesis and avoiding gate-level time consuming simulations. The main goal has been to offer a conceptual model and some power metrics to compare different design solutions described at high abstraction levels. Experimental results have shown sufficient relative accuracy with respect to gate-level power estimates. In addition the proposed estimation procedure is not time consuming. Finally, work is in progress aiming at integrating the proposed conceptual framework and the related power metrics to guide the partitioning task of a more general hardware-software co-design environment for embedded systems [2] . 
