Estimating switching activity is a crucial step in optimizing circuits for low power. In this paper, a fast gate level switching activity estimator for combinational circuits will be presented. The combination of event driven and bitparallel simulation allows for high accuracy due to the real delay model of the former while maintaining the speedup of the latter. This is demonstrated by detailed experimental results.
Introduction
Today power optimization and thus power estimation has become a major objective in digital circuit development. Charging and discharging of circuit nodes is the main source for power dissipation in digital CMOS circuits [1] . The average power dissipation is given by: (1) V DD denotes the supply voltage, C Li the sum of all parasitic capacitances attached to node i, f the clock frequency the circuit is operating at, and p si the average number of output transitions per clock cycle of the gate driving node i [2] .
While all other parameters are given by the technology or the circuit layout [2] , p si not only depends on the circuit structure but also on the statistical properties of the input signals applied to the circuit. Thus, the major objective of this paper is to efficiently estimate the switching activity p Si .
Today, two main approaches exist to estimate switching activity on the logic level [3] : pattern simulation and probabilistic methods. The former relying on typical input patterns which are either known from high level simulation or randomly generated using the statistics of the primary inputs. The latter directly uses those statistics and propagates them through the circuit by symbolic simulation.
In pattern simulation the main problems are accuracy and runtime. Since accurate estimation requires a high number of input patterns, runtime increases with accuracy. Thus, most work in this field has been done in limiting the number of input patterns through e.g. Monte Carlo methods [4] and in improving runtime behavior [3, 5, 6] .
The difficulties in symbolic simulation are related to accurately handling correlations introduced by the primary inputs or reconvergent fanout. Most methods to handle correlation are based on BDDs. However, they can become very large for large circuits. Advanced techniques can be found in [7] and [8] .
This paper deals with speeding up pattern-based simulation using the bitparallel approach first published in [3] . The rest of the paper will be organized as follows: chapter 2 gives the most important definitions and reflects previous work on bitparallel simulation. In chapter 3 a new simulation method will be presented. It has been implemented in the computer program TESA (Time parallel Estimation of Switching Activity). Chapter 4 summarizes the results that have been obtained by comparing TESA to bitwise simulation.
Preliminaries

Previous Work
In the near past, two approaches of bitparallel simulation for switching activity estimation have been proposed. The first one relies on exploiting the whole width
of the processor word of the machine the simulator is running on [3] . Instead of simulating each clock cycle separately, 32 or 64 cycles are simulated at a time ( fig. 1 ). Speedups between 2 and 74 compared to single bit simulation have been reported in [3] on a 64 bit machine.
A second method has been proposed in [6] , where the signals are represented as sets. A set is a sequence of ones. Only the number of the first and the last clock cycle of a sequence are stored. Thus all logic operations become set operations.
This method is only efficient if there exist many signals with low activity, like the flip flop outputs in sequential circuits.
In the original publications both approaches have been restricted to the zero delay model (ZDM) which can result in important inaccuracies as will be shown in chapter 4. In [9] a first approach has been presented to extend bitparallel simulation by a real delay model (RDM). However, parallelism has only been applied to the logic operations. Transition, glitch and hazard detection is still performed sequentially, thus resulting in only a limited speedup compared to bitwise simulation. In the sequel a new method will be presented that exploits parallelism for all operations.
Signal Representation
Bitparallel signal representation is clock cycle oriented. It is assumed that a signal takes only one value per clock cycle. However, in real applications with non-zero delays, any internal signal of a circuit may change its value several times during the clock cycle due to different arriving times of the input signals. Figure 3 shows such a signal. Note, that ∆t is the simulation time, i.e. the time that has passed since the beginning of the current clock cycle.
It is obvious, that the signal representations of figures 1 and 2 cannot take into account the additional signal changes during the clock cycles. Hence, each signal waveform, traditionally being represented as a vector, has been extended to an array [9] . In this array, each line depicts the signal values at a specific time during the clock cycles ( fig. 4 ). In the sequel this array will be called the schedule of a signal and the lines are referred to as vectors. E.g. denotes row 1 in figure 4.
Row 0 with ∆t<0 denotes the signal value at the end of the last clock cycle, the steady state. Each line can now be represented using either word or set representation and all logic operations can be performed in parallel on complete lines.
Definitions
The use of the terms transition, hazard and glitch is 
The Simulation Algorithm
Taking the signal representation of chapter 2.2, the simulation algorithm can be divided into four steps:
For all gates:
1.perform logic operation 2.delay determination and scheduling 3.glitch removal
count transitions and hazards
In order to explain the algorithm in detail the ANDgate of figure 5 will be used as an example. Figure 6 shows the the first two clock cycles of the schedules of its input waveforms.
Logic Operation
First of all, the initial values of the output y is computed by performing the logic operation of the AND gate on its initial input values. The result is entered into the first line of the output schedule as event 0 (∆t<0) in figure 7:
(2)
Delay Determination and Scheduling
Then ∆t proceeds to the first event on the inputs which occurs at ∆t=1. Again, the logic operation is performed:
The result, however, cannot directly be entered into the output schedule, since the different delays for falling and rising edges must be taken into account, t down and t up respectively. Hence, the directions of the transitions must be determined using the following equations:
falling event:
rising event:
Now, the correct values for the delays can be applied. For the following processing, it is convenient to temporarily enter into the schedule instead of the logic values y i . Thus the schedule contains only events except for row 0 that holds the initial values of the clock cycles. The events are marked with an asterisk.
In the example, only yields a non-zero result. It will be entered into the output schedule at time ∆t=1+t up . Since this is the first event on signal y, no glitch detection is necessary.
Glitch Removal
Now simulation time ∆t proceeds to the next input event which occurs at ∆t=3 on input b. Again, the logic operation is performed, resulting in:
Equations 5 and 6 indicate no rising edge but two falling ones:
. It will be scheduled at ∆t=3+t down . Before this new value can be entered into the output schedule, glitches are neglected and must be filtered out. There are two conditions that must be fulfilled so that the current and a previously scheduled event and , respectively, cause one or more glitches:
1. e j has not taken place yet: its scheduled time shows at least one event in the same clock cycle (same column) as e j . In that case, some columns in e j and compensate each other. Those are filtered out using the following equations: 
That means e 1 can be removed. becomes:
.
(11) Figure 9 shows the resulting schedule. Note, that the events in the schedule have no direction attribute. It should be emphasized, that all the logic operations in the equations 5-9 can be performed using one of the parallel methods outlined in chapter 2.1. Thus maintaining the runtime advantages of both methods.
Counting Transitions and Hazards
After all input events have been processed the numbers of transitions tr y and hazards tr h of the output y can be counted. In order to do so it must be distinguished between the set and the word approach. In the former, the schedule with n+1 rows (n events plus the initial values) is scanned using equation 12:
Where E i and S i denote the index of the end and the start of set i, respectively.
For the word approach a method relying on a lookup table (LUT) has been proposed in [3] . It is also applied here with some minor modifications.
The LUT performs the assignment of 16 bit values to their hamming weight. Each row in the schedule is split into 16 bit words. Using the LUT the number of events in the words, corresponding to signal toggles, can be determined. Thus, employing equation 13 the total number of transitions tr y of the output y can be easily summed up.
(13)
During the transition counting phase the events are replaced by the corresponding logic values. In order to compute the number of hazards tr h , useful transitions u are computed first for both approaches:
Where y 0 and y n denote the logic values of the steady states at the beginning and at the end of the clock cycles, respectively. The number of useful transitions tr u is computed using equation 12 or 13. The number of hazards tr h then results in:
(15)
Experimental Results
The algorithms of chapter 3 have been implemented in the computer program TESA. It has been extensively tested on several benchmark circuits from the ISCAS85 and ISCAS89 benchmark sets. On the ISCAS89 benchmarks only the combinational part has been simulated. All simulation runs were performed on a SUN Sparc Ultra 2, 300 MHz, using 32 bit word width. For each circuit 10,000 randomly generated input patterns have been simulated. The average switching activity for the primary inputs was chosen to 0.5. For the sequential circuits the flip flops have been cut out, their outputs have become additional primary inputs. For the sake of realism their activity has been determined using RTL simulation. Figure 10 summarizes the results. CPU/s word is the absolute runtime of the word approach in seconds. The columns speedup word and set denote the speedup of wordwise and set simulation, respectively, compared to single bit simulation. The latter having been performed with TESA in single bit mode in order to avoid influences of implementation differences, different time models etc. of other simulators. The theoretical value of 32 times speedup for wordwise simulation has not been reached on the average due to some overhead during scheduling and glitch detection. But an average speedup of more than 20 is still a good result. For some smaller circuits, the speedup is even higher than 32. The reason is not absolutely clear yet but it may be caused by unaligned memory accesses in the single bit version. The set approach generally performs worse than wordwise simulation.
A certain variation of the speedup among different circuits can be noticed (e.g. c3540). Obviously it doesn't depend on the circuit size. But it was observed that for c3540 the schedules become very large compared to other circuits, resulting in a performance penalty.
Finally h/% denotes the percentage of hazards on all transitions. Due to their average rate of 32% and their high standard deviation, they are not neglible.
Conclusion
Accurate switching activity estimation is a crucial step during the design of low power circuits. On gate level, TESA offers a very fast estimation option by combining event based simulation and bitparallel approaches. On several benchmark circuits it has proven an average speedup of more than 20 compared to bitwise simulation at the same accuracy. With only slight modifications TESA could be used as an ordinary logic simulator as well. The current implementation is limited to combinational circuit and doesn't take into account the gate loads for delay modeling. An extension to sequential circuits has been presented in [11] for a ZDM simulator. Future work will include the integration of this extension into TESA and the extension of the time models. 
