Elevating power estimation to architectural and behavioral level is essential for design exploration beyond logic level. In contrast with purely statistical approach, an analytical model is presented to estimate the power consumption in datapath and controller for a given RT level design. Experimental result shows that order of magnitude speed-up over low level tools as well as satisfactory accuracy can be achieved. This work can also serve as the basis for behavioral level estimation tool.
Introduction
With the increasing demand of low p o w er applications, there is a growing interest in power estimation techniques. It is essential for the power optimization tools in that it provides the evaluation of the cost function, it helps to identify the \hot-spots" { the candidates for further optimization.
Power estimation tools can operate on dierent levels of abstraction. A lot of interesting work has been done on circuit and gate level [Na94] . While these tools can often achieve v ery high accuracy, they are prohibitively expensive in architecture exploration, which is believed to be able to bring most of the power reduction. It is thus desirable to have estimator operating on RT level in order to provide fast evaluation of the power metric without sacricing too much accuracy. Some related work at this level include [La94] [Me94] . In contrast with those purely statistical approach, we present i n this paper a power analysis technique which is analytical in nature.
The rest of the paper is organized in a bottom-up fashion. In Section 2, the power model of datapath components as well as interconnections is discussed. Then we present the power analysis techniques at the RT level in Section 3. We conclude the paper with some experimental results.
Component Level Power Analysis
In this section, we try to identify the sources of power consumption for the components in the datapath library as well as the interconnections such as buses and clock trees.
Power Model of Static CMOS Gates
Three main sources of power dissipation in static CMOS circuit are dynamic switching, leakage current and short circuit current respectively. The dominant factor is the rst one due to the charging or discharging of circuit capacitances.
Power Model of Datapath Components y
This work is partially supported by T oshiba Inc.
Having identied the sources of power dissipation for the gates, we need to investigate the power consumption model at the component level. In other words, we need to know the capacitance switched during each access of the functional units, registers, and bus drivers.
Ideally, the energy consumed for each access of a component should be a function of its (1) bitwidth, (2) its previous data, which determine the previous states of all the internal circuit nodes, (3) the current data, which determine the current states of all the circuit nodes and in turn their switching activities. This is not practical since the data is not available until run time. However, statistics measures such as mean, variance, and correlation on the input data are relatively easy to obtain through functional simulation. It is reasonable to expect that the energy of the component is a function of the statistics of the data and the bitwidth. Based on this idea, Component characterization techniques such as Dual Bit Model (DBT) are proposed to model the power consumption of datapath components [La94] .
An alternative is to assume uniform white noise inputs for each component. Based on this assumption, the power consumption of a component depends solely on its size. Statistical methods can be applied to obtain an average value for each component in the library. W e adopt this approach because of its simplicity.
In the discussions that follow, we assume each component c in the datapath library is associated with a capacitance C, the value of which is dened as the average capacitance switched for each access of the component.
Power Model of Interconnections
Strictly speaking, the power model for the bus and clock tree belongs to the subject of next section because they all depends on the RT level netlist. However, we advance it here for ease of discussion. There are two factors that contribute to the capacitance of the bus: wire capacitance, as indicated by Cw in Figure 1 . The wire capacitance is determined by the length of the wire and in turn the result of routing. Estimation of wire length can be one of the following:
1. performing detailed placement and routing; 2. performing rough oorplaning, and then use the square root of the resultant c hip area as an approximation of the wire length; 3. summing up the area of all the components as an approximation of the chip area (assume the oorplaner is perfect), and then use the square root of the chip area.
While 1 is too expensive to be practical and 2 needs an additional oorplaner, 3 is adopted for its simplicity.
component load: the component load refers to the capacitances contributed by the units attached to the bus. There are two t ypes of buses:
1. Multiplexed Bus: As shown in Figure 1 , bus drivers are used for multiplexed bus. For every data transfer bound to the bus, the capacitances introduced are: (1) the output capacitance of the bus driver (Co(Drv)), (2) the input capacitance of the functional units for input buses (like Ci(+); C i ( 3 )), or the input capacitance of the register for output buses (Ci(Reg)). 2. Direct Connection:
Register Register
Register
Figure 2: Capacitances of Direct Connection Bus
As shown in Figure 2 , for direct connection bus, there is no need for bus drivers. For every data transfer bound to the bus, the capacitance introduced are: (1) the output capacitance of the source functional unit or register (like Co (+) or Co (Reg)), (2) the input capacitance of the sink functional unit or register (Ci(+) and Ci (Reg)).
Similarly, the capacitance of the clock tree is the wire capacitance plus the capacitance of the clock input Cclk (Reg) o f each register. Same technique can be applied:
where Reg is the set of registers in the design. This section addresses the problem of estimating power at the RT level, which implies that the following is known:
RT L evel Description
A register transfer level design can be conveniently specied by a state action table (SAT), each r o w of which indicates that at a particular state, under a particular condition, the system will evolve to another given state, and the datpath will perform some given computation. A formal denition of the state action table will be given in Section 3.2. 2. Branching Probability:
Given a state action table, the execution sequence of the system is still not known due to the unavailability of the conditions. We assume some proling techniques are applied prior to the power analysis so that for each pair of rows (i; j) in the state action table, a branching probability P r o b ( i; j) is obtained. A more detailed treatment will be presented in Section 3.4.
Component Capacitance:
Based on discussions in Section 2, for every component i n the datapath library, w e assume that the average capacitance switched for each access is known. In other words, the average capacitance of each bus driver can be written as C (Drv), each register can be written as C (Reg), and each functional unit F U ican be written as C (F U i ). We also assume the input and output capacitances of each component are known.
For interconnections such as bus and clock tree, although accurate information is not known until the layout stage, we assume some area estimation techniques discussed in Section 2 are applied such that for each bus Bu s i, w e know the average capacitance switched for each access, denoted as C (Busi). Similarly, the capacitance of the clock tree can be denoted as C (Clock).
With the above information given, we need to estimate the power consumption of the hardware, which is dened as P o w e r = Energy C ycles 2 C lock P eriod (1) where C ycles is the total number of clock cycles.
Architectural Model
In general, digital hardware can be modeled as an FSMD (Finite State Machine with a Datapath), where the datapath is responsible for the computation, and the controller determines when and what computation will be performed [Ga92].
Datapath
A t ypical datapath is shown in Figure 3 . The datapath consists of functional units, registers, and buses (interconnections). The bus may o r m a y not be attached with a bus driver depending upon whether it has dierent sources. We omit the case of multiplexers since they can be treated as bus drivers. Because the applications concerned in this work are often power critical, we assume another design style called dynamic power management, which is frequently adopted by designers (Figure 4 ): we assume each functional unit has an enable input in order to shut down the unit during its inactivity. The enable circuitry can be implemented simply as a switch which separates the bus and the functional unit. The enable controls the on/o of the switch. Note that in order for this technique to take eect, design care has to be exercised to ensure that the enable signal is asserted before the change of register output. While multi-level logic implementation is very dicult to predict, the analysis of the rest is similar and relatively simple. We take (c) as a representative of (a), (b), (c) and an approximation of (d) in this paper. A t ypical 2-level logic controller implementation is shown in Figure 5 . As shown in Figure 5 , A typical controller is composed of four parts, namely, the state register, the decoder, next state logic and output logic.
The A t a particular state, the state of the hardware can be characterized by a set of activity v ectors, namely, the current state vectorS, the status vectorC, the next state vectorÑS , the function unit vectorF U , the the register vectorReg, the the bus vectorBus, the the bus driver vectorDrv. WhileS;C;ÑS indicates the value of the state register, status signals and next state signals, the value ofF U ;Reg;Bus;Drv indicates the activeness of corresponding datapath components.
The cardinality of the vectorṼ is dened as the total number of 1's of the vector: Note that the state action table denes the behavior of the hardware, whereas the state trace denes an actual execution scenario of the hardware. In the next two sections, we rst discuss the computation of power consumption for an execution sequence in Section 3.3, based on which w e derive estimation techniques for power consumption directly from the state action table in Section 3.4. 
Power Estimation from State Trace
where the factor 2 accounts for the switches of both the falling and rising edges of the clock.
Datapath
The activity of the datapath at stateti can be characterized by the corresponding activity v ectors:F U i ,Reg i ,Busi, and Drvi. W e denote their concatenation asDPi: DP i =FU i #Reg i #Bus i #Drv i
The capacitances of all the functional units in the datapath forms a capacitance vectorCFU= ( C ( F U 1 ) ; C ( F U 2 ) ; :::). Similarly, w e can dene the capacitance vectors for registers, buses, and bus drivers asCReg;CBus andCDrv respectively. We denote their concatenation asCDP: General Model Figure 5 shows the controller implementation. The controller falls naturally into four parts, namely, the state register, the decoder, the next state logic, and the output logic. The decoder is essentially a set of AND-gates, inputs of which are connected to the output of the state register and the status signals. Note that each input is indicated as a dot in Figure 5 and introduces a capacitance load (CAnd) for the state register output. The next state logic and the output logic are essentially a set of OR-gates, inputs of which are outputs of the decoder. Again, each input is indicated as a dot in Figure 5 and will introduce a capacitance load (COr) for the AND-gates of the decoder.
The dots in next state logic and output logic forms two matrices: next state matrix and output logic matrix. The rows of the matrices correspond to the decoder outputs, which i n turn correspond to a state tuple in the state action table. ThetSCÑ The activity of the controller at stateti can be characterized by a set of activity v ectors, namely the current state vectorSi, the next state vectorÑS i ; the decoder vectorDi; and the output vectorÕi. Each activity v ector correspond to the output of state register, status signals, next state logic, decoder and output logic respectively. Figure 6 shows the values of these vectors at each state for the example shown in Figure 5 . It is obvious thatÕ =FU#Reg#Drv Each bit of the activity v ectorṼ (could be one of S ;Ñ S ;D;Õ) is associated with a capacitance. The capacitances for all the bits also form a capacitance v e ctor, denoted asCL = ( C L 0 ; C L 1 ; :::). The energy consumed at state i can then be measured as (Ṽi 8Ṽi+1 ) 1CL 2 V 2 DD The total energy consumed for the entire state trace on this vector can be computed as
DD
Based on this model, we will identify the capacitance vector as well as activity v ector for each part of the controller.
State Register and Next State Logic
Since forti;ti+1 2 ST, w e always haveÑS i=S i +1. The switching activities of the state vector and next state vector are the same, so we treat them together.
The capacitance of each bit of the state register consists of its (1) internal capacitances and (2) the output loads due to its fanout to the state decoder. The capacitance of each bit of the next state logic is the input capacitance of the state register. The capacitances mentioned above are the same for each bit, so we denote their sum as CReg , and the corresponding capacitance vector becomes CReg 2Ĩ , whereĨ = ( 1 ; 1 ; :::; 1) is the unit vector. the total energy consumption of the state register and next state logic can then be computed as The switching activity of the decoder is elegantly simple to analyze. At e v ery stateti 2 ST, only the output of corresponding AND-gate is 1. In other words, at every state, exactly two AND-gates will switch: The gate corresponds to previous state will switch from 1 to 0; the gate corresponds to current state will switch from 0 to 1, and the rest of the gates will remain unchanged.
The capacitance of each AND-gate in the state decoder is determined by its fanout, that is, how many dots along the row in Figure 5 . If we assume each input of the OR-gates introduces the same capacitances as COr, the ith bit of the capacitance vectorCL can be computed as #dots(rowi)2COr, where #dots(rowi) can be computed as jÑ S i#Õ i j .
Due to the \one-hot" property of the activity v ector D, the energy consumed on the decoder can then be computed by counting the number of dots along the rows. The activity vector of the output logic isÕ = F U i #Reg i #Drvi. I f w e denote the capacitance vector asCO, then the energy consumed on the output logic is:
RT Level Power Estimation Branching Probability and Execution Frequency
In the previous section, we develop a set of formula for power estimation of a state trace. However, the state trace information is not available in general. We resort to proling techniques to obtain branching probability function P r o b ( i; j) dened for every pair of tuples (ti;tj) in the state action table
The execution frequency of a state tuple in SA T is dened as the expected number of times the state tuple will be executed. The execution frequency can be obtained either from the proling tool or directly from the branching probability function. Given the branching probability function, the determination of execution frequency of each state tuple can be formulated as solving a set of linear equations with the form 
Formula
The formula developed in the previous section can then be rewritten by inspecting the state tuples in SA T one by one. In other words, the power metrics can be measured as the sum of the corresponding metrics of all the state tuples weighted by their execution frequencies.
F r e q ( t i ) In order to evaluate the estimation tool, we applied it to a set of well known benchmarks [HW92] . Figure 7 shows the block diagram of the experiment.
The component library was built by feeding functional VHDL description of each component to COMPASS ASIC Synthesizer. The synthesized components were then fed into the component p o w er proler [Ag95] to obtain an average power for each component. The average component p o w er was stored in the library. The RT level design of each benchmark was manually synthesized from behavioral VHDL description. Assuming architectural model in Section 3.1, the power estimation of each benchmark was obtained by applying equations 8-13 in Section 3.4. We are able to obtain the average power of each benchmark in a couple of seconds on a Sparc 5 station.
The RT level VHDL description of each benchmark instantiating the components in the same library was also fed into the COMPASS chip compiler to obtain the layout. Netlists annotated with node capacitances were then extracted from the layout. Logic simulation assuming random input was invoked to obtain the total switched capacitances.
We compare the estimated results of datapath and controller with the measured results obtained from the layout in Table 1 and Table 2 respectively. The columns of the tables show the estimated switched capacitance for dierent classes of components (such as the functional units (FU), registers (Reg), buses (Bus), bus drivers (Drv), clock (Clk), state register (SR), next state logic (NS), decoder (Dec), output logic (Output) ), the total estimated switched capacitance, the measured switched capacitance, and the error computed as jmeasured0estimatedj measured . The rows of the tables correspond to dierent benchmarks.
Conclusions
The described power estimation technique which is statistical in nature at the component level, and analytical at the RT level, oers fast feedback for high level exploration tools. Experiments on standard benchmarks show that the average error of the datapath is 5% and the controller is 7%. Our future work will extend this technique to the behavioral level. 
FU

