A new static power analysis method for CMOS combinational circuits is presented. This approach integrates the simulation-based method and the probabilistic method, and can establish the relationships between the primary inputs and the internal nodes in the circuit. Based on the relationships, our approach can also indicate which internal node or input sequence consumes the most power. It is thus suitable for performing power estimation in the synthesis environment for power optimisation. To the best of our knowledge, this is the first attempt to develop a systematic way to symbolically represent the relationships between the primary inputs and the power consumption at every internal node of a circuit. Furthermore, by using the existing piecewise linear delay model, as well as the proposed algorithm, this novel method is also very accurate and efficient. For a set of benchmark circuits, the experimental results show that the power estimated by our technique is within 5%) error as compared with that by the exact SPICE simulation, while the execution speed is more than four orders of magnitude faster.
Introduction
Power dissipation has emerged as an important design parameter in the design of microelectronic circuits, especially in portable computing and personal communication applications. More generally, as the density and the size of chips and systems continue to increase, the problem of power consumption becomes a critical concern in VLSI design [l] . Low power design techniques are becoming increasingly important in today's integrated circuit designs [2] . Therefore, a fast and accurate power estimator is necessary for a low power circuitisystem designer.
During the synthesis for low power at higher level of abstractions such as the register transfer level, it is almost impossible to have a very accurate estimation about L e power consumption of each module. However, if the estimation model is able to correctly sort the alternative designs according to the estimated power, we can still have the right designs for low power because there are relatively few alternative designs and each design has very different trade-off among power, area and performance. However, performing synthesis for low power at gate level or below, the power estimation with high accuracy becomes a must, because there could be many alternative designs and it is difficult to have estimation with relatively low accuracy but with good fidelity [3] . Therefore, with a less accurate estimation of power, we may optimise the circuits in the wrong spots such that we cannot lower the power consumption with minimum overheads in terms of area and performance. There are quite a few approaches proposed [4-61 to reduce the inefficiency of the SPICE while maintaining acceptable estimation errors. The PowerMill approach [4] is a transistor-level power simulator, which uses an event-driven simulation algorithm to increase the speed by two to three orders of magnitude over SPICE. Although the PowerMill is relatively accurate, it is still not suitable for power-driven synthesis. This is because, when power optimisation is performed, the circuits will be modified frequently. Power estimation should be done incrementally to speed up the process. The simulation-based approach cannot be used in this situation. Switch-level simulation techniques are, in general, much faster than circuit-level simulation techniques, but are not as accurate or versatile. Standard simulators, such as IRSIM [6] , can be easily modified to report the switched capacitance (thus the dynamic power dissipation) during a simulation run.
However, the approaches mentioned above suffer three major problems as power simulation tools. First, they must simulate the 'chosen patterns' with many iterations to determine the average power consumption of each node, which slows down the simulation speed. Also, the average results strongly depend on the 'chosen patterns', and we may get biased simulation results. Secondly, if the PIS are not fully independent, the choices of patterns need much more attention and the number of patterns needed is large. Thirdly, due to the nature of simulation-based simulators, they cannot provide enough power information such as the percentage of glitch power of each node's power consumption, the difference between generation of glitches and passing of glitches, the source node of a glitch, or the cause of the glitch. Thus, these approaches can leave the user or synthesiser in a vague situation to improve or resynthesise a circuit.
In this paper, we will develop a new method which retains the advantages of both the simulation-based methods and the probabilistic methods. This power analyser can estimate the total power consumption due to the circuit itself as well as the power consumption due to the transition, and the spike at every internal node in the circuit efficiently and accurately. Furthermore, this new approach can not only establish the relationships between the primary inputs and the internal nodes in the circuit but also increase the efficiency of the simulation-based method. Ghosh proposed a symbolic simulator based on the binary decision diagram (BDD) [7] . The major difference between our method and the method in [7] is that we use the idea of cube representation to simplify the process of simulation. By using the idea of a cube, we can get much more information from the simulation than the method in [7] . To the best of our knowledge, our approach is the first attempt to develop a systematic way to symbolically represent the relationships between the primary inputs and the power consumption at every internal node of a circuit. The simulation results show that our new method provides much more information on power consumption than other methods. The approach can thus be integrated into a synthesis environment to determine where it can be improved or resynthesised for low power. Thus, this method is very suitable for performing power estimation in the synthesis environment for power optimisation. 
I Power dissipation model
It is well known that the dynamic power estimation formula is P = ll2aCV"fhere P is the average power, a is the switching activity, V is the supply voltage, f is the frequency, and C is the load capacitance of the gate [8, 91 . We use an ideal gate (e.g. AND, OR, etc.) and equivalent input and output capacitances to model a real gate. Using [l], we can estimate C.
Power dissipation model and definitions
We also established the database of mutiple-SPICEsimulation-based thresholds [ 101 to approach the piecewise linear delay model, which will further improve the precision and efficiency of the simulation. The simulation result was very accurate and efficient in [lo] . The error percentage is less than 5% as compared with the HSPICE simulation, while the execution speed is more than three orders of magnitude faster. For some circuits, the speedups are even more than four orders of magnitude larger. However, a large number of different input sequences is required for [lo] , and this simulation-based method was still very time consuming. Furthermore, the information on the average power consumption is not enough for performing power optimisation in the synthesis environment. In other words, with only these average power values derived from the simulation-based method, we cannot efficiently figure out where the most power is consumed and why. Therefore, it is necessary to combine a simulationbased method with a probabilistic method. In the following discussions, we will develop a new method based on the static analysis approach. This method 90 cannot only construct the relationships between the primary inputs and the internal nodes in the circuit but also increase the efficiency of the simulation-based method.
Node probabilities
The signal probability p ( X ) of a node X is defined as the probability that node X has a value of logic 1. Let us now define three special probabilities P I , Po, and Ps.
Assume that node X is the output of a gate g. Thus, the switching probability of X , P,(X), is equal to 2 x p ( X ) x (1 ~ p(X)) and is defined as the probability that node X will switch from low to high or high to low if any input(s) of gate g changes. The holding-one probability of X, Pl(x), is equal to p(2J2 and is defined as the probability that node X will hold in high (one) if any input(s) of gate g changes. The holding-zero probability of X , Po(X), is equal to (1 -P ( X ) )~ and is defined as the probability that node X will hold in low (zero) if any input(s) of gate g changes.
We define a cube with n elements as:
, and many internal nodes. The probability of this cube is: PK(PIi) (PIL), where K(PIi) can be 1, 0, or S. Here we assume that the primary inputs are uncorrelated for the sake of easy explanation. Pl(b) , I> for internal node d, the cube Cl(d) means that to have internal node d in the holding-one state, the primary input node a and node b must be both in the holding-one state and the node c can be in any of the three states. The probability
Definitions of symbols and cubes
Since each element of the cube represents the corresponding input, we do not have to write the X explicitly. Therefore, we introduce simple notations to be used in the cube as follows: (i) 1: the probability of any particular PI in the holding-one state.
(ii) 0: the probability of any particular PI in the holding-zero state. (iii) s or b: the probability of any particular PI in the switching state. All the PIS with the same switching direction (i.e. low to high (high to low)), are represented as s (b), while all other PIS with opposite switching direction are represented as b (s).
(iv) S(i) or B(i): the probability of any particular PI in the switching state (i is just a sequence index). However, the effect of this switching is blocked by some gates between the PIS and the node under consideration, and hence no transition is generated by the switching of PIS at this node. Thus, this switching state is less restrictive than s or b. We call this a don't-care switching state.
(v) -: this input is not related to the cube. We call this a don't-care state. The probability is one. For the determination of the effect of a spike, we extend the cube by adding two fields, the beginning time and the lasting time, to describe the timing information. The beginning time represents the starting time of the action and the lasting time represents how long such an action will last. We will simply call the holding-one cube as the 1-cube (represented as C,(X)), the holding-zero cube as the 0-cube (represented as Co(X)), the switching cube as the S-cube (represented as Cs(x>), and the spike cube as the G-cube (represented as CG(x>). The 1-cube and the 0-cube specify the conditions and probabilities for any particular node to hold in the 1 state and the 0 state respectively. Thus, their beginning times are always set to 0.0 and their lasting times are always set to 00. For the S-cube, the beginning time is set to the time when the switching starts and the lasting time is set to 00. The G-cube's beginning time is the beginning time of the first calculated switching, and its lasting time is the spike duration time. For example, assume that there is a node y whose Cs(y) = <1.35, w, s, b, 1, S(I), -, B(l), S(2)>. The circuit has seven PIS (PIl ... PI7). The cube Csb) means that one of the cases to make node y switch is to let PI, in the switching state, PI, in the switching state with a switching direction opposite to the switching direction of PI1, PI3 in the holding-one state, PI, in the don't-care switching state, PI, in the don't-care state, PI, in the don't-care switching state with opposite switching direction to PI,, and PI7 also in the don't-care switching state not related with any other PIS. The switching will start at time = 1.35 units.
The symbols 0, 1, and s are similar to Pya, P : ' , Pio, and PI1l proposed in [7] . However, Ghosh in [7] used the BDD representation to simulate the circuit. Instead of BDD, we use the cube-based operation in our simulator. It cannot only easily simulate the circuit but can also provide us with much more information for synthesising low-power circuits. This information includes which internal node or which input sequence consumes the most power. Therefore, our new proposed method is more suitable to be used in the synthesis environment.
Definition of cube sets
We call the union of the same type of cubes a 'cube set'. Every node in a circuit has four cube sets (i.e. 1, 0, S, and G cube sets). For a node X, the four cube sets are represented as { CO(X>}7 {CI(X)>, {CdX)}, { C d X ) } . 
Operators and algorithm
We will, in this Section, define operations at the cube level, the cube set level, and the logic node level, respectively. All higher level operations are built upon the lower level operations.
I Cube level operators
The proposed operators for cubes are intersection (n, e.g. A f l B), bar(-, e.g. A), and don't-care (dc, e.g.
&(A)).
Given an OR gate g with a gate delay of 1.3 time units and inertial delay of 0.3 time unit, its fanout is node z and its fanins are node x and node y . The nodes x, y , and z are all internal nodes. Assume a 1-cube in {C,(X)} is <O.O, 00, 1, -, S(1), B(l), s> and an S-cube in {C,(Y)} is <1.4, CO, 1, S(2), S(2), s, ->. The main function of the fl operator is to find the PI state which is compatible with the PI-part of both cubes. The procedure is explained as follows and is illustrated in Fig. 2 . (ii) PI,: B(1) fl s * s. Since s is more restricted than B(1), the result is s.
position of the first cube is changed to s, the S(1) in PI3 position of the first cube has to be changed to b, which is opposite to s, and S(2) in PI3 position of the second cube is also changed to b. (VI PI^: 1 n 1 + 1.
We can see that the f l operation is not a straightforward element-wise AND operation on cubes. Because the result of intersection is in the 1-cube set of node z, 
Spike determination
A spike is generated by two different signals going through the same gate with different arrival times and opposite transition directions. In Fig. 3 , we show a ted at nodef. Here we assume the parameof Fig. 3 are the same as Fig. 1 . 
bar(d,i,;
Compare the lasting time of a cube with i. If the lasting time is less than i, the result of this operator is an empty cube. Otherwise, add a delay of d units to the beginning time of the cube, keep the original lasting time in the cube, and change the PI-part based on the rules as shown in Table 1 . .' Intersect two cubes and obtain a new cube. In the new cube, the beginning time = d + Max(beginning times of the two cubes), the lasting time = w, and the PI-part is the result of applying n operator on the PI-parts of the two cubes. 
n(d, i, I):
If both g times are less than i, the result is an empty cube; otherwise, the way to derive the new cube is:
begmning time =d+ the larger beginning tim-of t h e t w o cube-O a s t i n a time = t h e smaller lasting t l m e { timing fields of the two cubes;
P I -p a r t : Interaectio; of t h e P I -p a r t s of t h e two cubes
n(d, i, 2):
If the difference between the two cubes' beginning times is less than i, the result LS an empty cube; otherwise, the way to derive the new cube is:
beginning t i m c = d + t h c smallcr beginning time of t h e t w o cubes, lasting t u n c = difference of t h e two cubes' beginning times timing fields ( P I -p a r t Intersection of t h e P I part9 of t h e t w o cubes
Formal definitions of cube-set level operators
Let Cu, Cv and Cw be three cube sets. Union is the ordinary union operator on sets. For the cube set (1, 0, S, G), there are six operators.
Operators at the node level
We will develop the operators only for the two input AND, OR, and NOT gates. The operators for all other complex gates can be built based on this foundation.
3.5.
I operator NOT Given an inverter with input a and output c and with delay d and inertial i, the corresponding cube set of output c is calculated as follows:
operator AND:
Given an AND gate with inputs a and b and output c and with delay d and inertial I , the corresponding cube sets of output c is calculated as follows:
Partitioning and cube reduction algorithm
To store the cube and cubeset completely, the memory may not be enough. If we can reduce the number of PIS, we can reduce the size of individual cube and cubeset. Also, if we can reduce the number of cubes, we can surely reduce the memory requirement. We solve the problem with two steps: partitioning the circuit into groups, then reducing the number of cubes dynamically.
Partitioning into groups:
We first determine the distances between different POs, then divide POs into several groups by using the distance information. Then, the grouped POs, the PIS and the internal nodes that have fanouts to these POs are put into the same partition. Note that we have not lost any information at this step. We just separate the unrelated PIS and internal nodes into different groups to reduce the unnecessary calculations.
The distance is defined as: if 3 gate g, such that 2 (or y) is g's fanin, and the other is g's fanout.
Cube reduction algorithm:
When a predefined memory usage limitation is reached, the dynamic cube reduction algorithm is executed to average the C1 and CO cubeset of some frontier nodes. Make them as new PIS and have a new simulation starting from these nodes. Since the most important cubesets are C, and CG, we focus on how to reduce these cube sets.
Algorithm
Given a netlist, we first sort the gates in the netlist topologically. We then apply the corresponding operators defined previously to each gate based on the topological order. The complexity of traverse is proportional to the number of gates. The overall algorithm is shown below. 
Experimental results
The proposed analyser based on the above power estimation model and delay model has been implemented in C on a SUN SPARCstation 10 with 7000 codes. The transistor models used are the m TSMC 0 . 8~ SPDM CMOS technology. Table 2 shows the full timing data of the basic gates derived from the SPICE simulation by applying the method proposed in [lo] . We run the SPICE and the proposed analyser on the circuit shown in Fig. 5 to evaluate the quality and efficiency of the method. Table 3 shows the detail simulation results the power consu 
The simulation results of all the nodes by the SPICE and the proposed method are shown in Table 3 . The estimation on the total power consumption by the proposed method has an error of about 4% compared with that by the SPICE simulation. Table 4 shows the simulation results on several circuits when exhaustive simulation is performed using both the SPICE and our analyser. These test benchmarks include decoders, full adders, and a multiplexer.
The 'cs*' benchmarks (cs27 .. cs1196) are the ISCAS-89 sequential circuits in which the feedback loops and FFs are taken out. The 'c*' benchmarks (c17 .. c2670) come from the ISCAS-85 benchmark circuits. All the primary inputs are assumed to be temporally and spatially independent and with a signal probability of 0.5 in the experiments. Our analyser can be four orders faster than the SPICE with less than 5% error. Better accuracy could be further achieved by fine tuning the Gcube calculation algorithm. From the cs208 to c2670, we use 50 groups of 100 random patterns for SPICE simulation and then average the power of the different 50 groups to compare with our simulation. Due to the partitioning, the results of our simulator for these benchmarks are three orders faster than the SPICE. However, we must remember that the result of SPICE is from the average of 50 groups of 100 random patterns. It means that the proposed method is more efficient than the SPICE method. From the c432 to c2670, it takes an enormous time to get the results of SPICE simulation results. The simulation time of a single input pattern is more than three days. Therefore, the results by the SPICE simulation are ignored.
Since our analyser can recalculate the results by using the existing cube sets, whenever the transition density of any node is changed, our algorithm has the incremental capability which the SPICE does not have. Therefore, our estimator is particularly useful in the synthesis environment for power optimisation.
Conclusions and future work
We have proposed a novel static power analyser for CMOS combinational circuits. The analyser can estimate the power consumption of a circuit very fast (4 orders faster than SPICE) and very accurately (with a 5% error compared with SPICE). The analyser is also applicable for different delay models with or without inertial delays. Furthermore, it can distinguish func-94 tional transitions from spikes and has incremental capability. Last but not least, the analyser can identify the input transitions which cause the large power consumption so that power optimisation can be appropriately applied to improve the circuit. Therefore, this analyser is very useful in the synthesis environment for low power.
One of the main future pieces of work is to solve the spatial and temporal dependency of the primary inputs. By giving different weights to different cubes of the Scube set and the G-cube set, which means the patterns of some cubes may appear more frequently in the input vector than others, we can calculate the temporal dependency approximately. Using a table looltup method instead of the simple multiplication to calculate the E(sw), which means different primary inputs are not independent of each other, we can approximate the spatial dependency among different primary inputs. Another direction of future works is to expand this approach to cover other abstraction levels such as circuit level and functional level by extending the symbols and the definitions of cube and cube operations.
