Abstract-In recent years designers of embedded computer systems face a tremendous growth in complexity of their systems. This, together with the fact that the used system clock frequencies rise and that the real time required to see features start up and work correctly in an embedded system also increases, let skyrocket the simulation times of event based simulation engines. Performing these simulations on register transfer level (RTL), however, is crucial to achieve functional verification of embedded computer systems. The acceleration of such event based simulations thus is the aim of the work presented in this paper. To this end a methodology called clock suppression is presented and thoroughly discussed. To underpin the feasibility and performance of this approach, evaluation results of simulation experiments for several designs will be shown.
I. INTRODUCTION
With shrinking feature sizes of semiconductor process technologies today embedded computer systems are often made up of several microprocessors, digital signal processors, memories, and application specific circuitry, which are typically integrated on a single chip forming a system on a chip (SoC). Of those embedded systems an increasing large number is distributed in space and interconnected via a local area network, e.g. in factory automation systems, building automation, or sensor networks.
Not only the design of such complex systems is a challenge, but especially ensuring functional verification is of utmost importance. And it is time consuming, with the cost of undetected design errors tremendously increasing at the same time the later they are found. This is further aggravated by the fact that system clock frequencies of embedded computer systems keep steadily increasing, thus requiring event based simulation engines to simulate a lot more events to progress the same amount of real time for verifying the system's functions. And even worse, the new features of e.g. multi-media applications or higher-level communication protocols require simulations to span longer and longer intervals of real time.
This is why methodologies like hardware-software coverification or virtual prototypes are in frequent use to speed up and enhance the overall quality of the verification process [1] , [2] , [3] . But as always, there is a trade-off between simulation speed and the detail in which the models represent the actual physical entities of the system. And often -especially for real-time systems -it is crucial to have reliable timing information available. This is the case e.g. when using an instruction set simulator for a certain processor to execute the software to be verified against not only functional but also timing constraints.
As soon as the refinement process comes to the register transfer level (RTL), where synthesis tools automatically compile the hardware description language code to a certain integrated circuit technology, simulation run times dramatically increase. Even if only parts of the embedded computer system are simulated in that detail.
A gain in simulation performance is possible, if portions of the hardware design that are coded in a hardware description language (typically VHDL or Verilog) on RTL and are run by an event-driven simulation engine, can be turned off 1 during the time they are not needed. This is exactly what the proposed clock suppression (CS) technique, which has been used for the simulation runs under consideration in this work, does.
Generally speaking, the simulation time t sim depends on the number of events N e to be simulated during a run, the number of activities N a whose evaluation is initiated by those events, and the rate R at which the simulation engine can evaluate these activities [4] :
Increasing R is accomplished with increasing computing power of the workstations and optimizing the simulation engines. Keeping N e and N a low can be achieved by higher abstraction levels for modeling, which is not always possible, as motivated earlier. Consequently the remainder of this paper is devoted to the discussion of how to minimize Equation 1 by means of shrinking N e and is structured as follows. Section II deals with related scientific work and Section III presents the proposed clock suppression methodology in depth. Section IV shows how clock suppression works using a simple design example and Section V discusses prerequisites of this methodology. Thereafter Section VI eventually presents results of simulation experiments with real life design examples and Section VII finally concludes this paper.
II. RELATED WORK
Gravenstein [4] presents several techniques to speed-up Verilog HDL simulation and among them is a similar approach to the one presented herein. It is called emulation of linear function. This solution, however, is not generic and is restricted to the optimization of counters with no sort of input condition and a constant increment of one. The approach described in the work at hand can be considered as a generalization and improvement of the emulation of linear function technique.
Ulrich [5] first introduced the term clock suppression, when sequential clock-driven circuits like transmission gates, flipflops and registers were temporarily disconnected from their clock source and reconnected as soon as such circuits received new data input. However, its application is in the field of fault simulation, where low level circuit netlists are simulated rather than register-transfer level (RTL) models. Furthermore this approach was not able to suppress data-dependent periodic signals, as opposed to the clock suppression technique presented in this paper.
A state-based prediction of oscillating signals (a third signal state represents an oscillating signal) is introduced in [6] , where a clock distribution tree is traversed before simulation begins and the clock inputs along the tree are disconnected as required. This optimization strategy has been integrated in the Creator simulator [7] .
A similar approach is shown in [8] , where a static analysis of the circuit is done before simulation and the results are used to augment a modified simulator engine. An average performance increase of a factor of 5 is reported.
In [9] Takamine et. al. introduced clock suppression for gate level simulations using a special purpose hardware for simulation.
Optimization strategies on a higher abstraction level are shown in [10] and [11] . In [10] Gezel, a language for synchronous hardware description, is presented, which supports co-simulation including a cycle-skip detection mechanism to optimize the co-simulation interface.
In the field of low power design for synthesis Babighian et. al. [11] introduced automatic clock gating insertion at RTL to eliminate redundant computations performed by temporarily unobservable blocks.
In [12] a method is described to suppress clock events and even aperiodic events without large modifications of the simulator. It is claimed that VHDL programs can be automatically optimized by elimination of insensitive events. However, based on the classifications stated in Park's work [12] an expert VHDL programmer can optimize VHDL models by hand.
All before mentioned papers, however, do not optimize the linear behavior of adders and counters as opposed to the method presented in this work. They therefore all have a limited upper bound of optimization potential and typically do not achieve a shortening of simulation run times in excess of more than a factor of 10.
III. CLOCK SUPPRESSION METHODOLOGY
Simulations at RTL or even gate level spend a considerable amount of time for the generation of the clock signal and re-evaluating clocked signals even if they don't need to be processed because no events will happen on them. On gate level about 70 percent of the overall simulation time is spent for the processing of clock events [9] .
For example a one second simulation run of a single periodic signal with a frequency of 100 MHz without enabled debugging or tracing support takes 40 seconds to complete on a workstation with a computing performance of more than 1000 MIPS. It is clear that the resulting maximum event rate of 5 Mega Events/s strongly depends on the underlying simulator kernel, but still this is a limiting factor, especially for the simulation of the high clock rates seen in todays embedded computer system designs. Simple counter structures as shown in Listing 1 are very frequent in RTL designs. Especially in control-centric designs lots of counters and adders are instantiated to generate the required timing for interfaces, keep track with external events, support the internal control flow, and the like.
Listing 1 Simple counter in VHDL
To further motivate the strategy of clock suppression in embedded computer system simulations the results of a survey on recently developed designs regarding the utilization of adder structures like the one in Listing 1 are summarized in Table I . It is important to note that the figures for the adder count recognize only once multiple instances of the same adder in the source code. The adder to size ratio in the last column is an indicator for the type of the design, i.e. control vs. data flow dominated ones. For example the design D1 with its relatively high adder count has a greater potential for optimization through clock suppression than the design D6.
A. Introduction of Clock Suppression
For a more formal description of the clock suppression methodology we define a synchronous vectored signal s, whose representing value is of type signed or unsigned integer. s depends on a set of synchronous input signals I
Note, that s itself could be part of I. This property is essential for the following optimization strategy, since it focuses on the linear behavior of counter and accumulator structures as exemplified in line 7 of Listing 1. We further define a complete set of output conditions OC for s
which are utilized as input logic for subsequent signals. An output condition means any signal or condition that depends on the synchronous vectored signal s. These output conditions depend on s and possibly additional other signals. OC(s) has to be complete in that way as every subsequent, on s depending condition must be listet herein. Events on the clock signal can be suppressed if and only if none of the output conditions is fulfilled (oc 1 (s) = oc 2 (2) = ...oc n (s) = f alse). Thus a change of (i.e. an event on) signal s won't cause further events in circuit paths to which s is an input -generated events on s do not propagate.
After times where the clock signal has been suppressed, the reactivation value of s must be calculated. To this end detailed knowledge of the functional behavior of s is necessary. This is why counters and, as a generalization, adders are well suited for this approach. Their behavior is predictable, which offers the following optimization strategy.
The derivation of the time interval Δt, during which CS is applied, starts from the definition of the differential for computing the value of s for a future reactivation.
with s t being the gradient of s at time t and assuming s t to be constant within the simulation time interval [t, t + Δt], and s t+Δt denoting the value of signal s at simulation time t+Δt.
Rewriting of Equation 4 results in an equation for the timeout
Δt for continuous systems
which has to be adapted for discrete systems with a granularity of t clk to
Here the operator div stands for the integer division without remainder and s t is replaced by the increment Δs t . This timeout is the nearest point in future where the clock signal has to be activated depending on the value of s for a reactivation in the future s t+Δt . The special case Δs t = 0 leads to a division by zero and must be overruled to a Δt = ∞, i.e. when one summand of an adder function is zero no clock activation is necessary. The value of s after a period of clock suppression is calculated to
where s c is the value of s for the condition to which the timeout has been calculated. If an input event in I occurred before the timeout has expired, the new value of s will be calculated to
The clock signal must be reactivated if any of the following events occur:
• a signal from the set of input signals I changes • an output condition OC(s) changes • the calculated timeout expires
B. Clock Suppression Algorithm
The pseudocode for the algorithm for clock suppression is shown in Listing 2. Though it looks like C syntax, there is no limitation to the language used despite the fact that it must support event-driven simulation features like sequential statements or waiting on a list of events or conditions (waitOn, waitUntil), like Verilog, SystemC, or SystemVerilog do.
By default the clock is turned on and the simulation remains unaffected by the clock suppression algorithm. If no output condition is satisfied and there exist scheduled timeouts (line 5) a phase of deactivated clock will be entered after the calculation of the future signal value (s c ) (by calling the function nextEvent) and the timeout value Δt, which will be infinity if the increment of the adder is zero. In this case the wait statement in line 14 will terminate only on changes in the input signal list or in the output conditions. After the period of clock suppression the new value of s will be forced depending on the presence of a timeout (line 15).
In the activated phase (starting from line 23) an activation signal for the clock is asserted until the occurrence of a deactevent, which indicates that no more state changes besides the optimized signal happen. As the checking of this property can be very expensive in terms of computing performance an optimized solution would be the integration of the state change observation into the simulator kernel, where this comes almost for free. Alternatively the deactivation could take place after a certain timeout. While this solution needs some knowledge about the design under test (behavior of the operands of the adder) it could be applied efficiently, especially for huge designs.
Every time the operand of an adder changes, the new value must also be assigned to the local increment variable (assignIncrement in line 30).
As the presented algorithm can be clearly implemented in a single process within the testbench but outside the design hierarchy, no modification of the original design code is Listing 3 shows the modified clock generation process. The lines 2 -5 are inserted to prohibit the generation of clock events if the activation signal clkActivate is not asserted.
IV. A SIMPLE EXAMPLE
To give further insight into the methodology and to evaluate the theoretical work, a simple example consisting of two sequential elements with adders, is discussed in this section (see Figure 1) . The first adder C1 with its input logic en and IN C1 will be optimized through an external entity CS, instantiated in the testbench. The second adder C2 serves as output logic, which is activated on a set of output conditions for C1. While any synchronous logic could have been placed instead of C2, choosing an parameterisable adder gave the possibility to easily adjust the ratio between turn on and turn off time of the clock signal through modification of the conditions set (cd1 and cd2 in Figure 2) .
A big advantage of the proposed methodology is that the design code is not affected by the optimization, as it takes place only in the testbench. Even entity declarations remain unmodified as indicated in Figure 1 .
For verification of the correct behavior of the proposed clock suppression algorithm two identical instances of the test example were simulated, one of them enhanced through clock suppression. Example signal traces are shown in Figure 3 . Note, that the optimized version of C1 differs from the reference design, when the clock is turned off, but is set to the correct value (marked by an x) before the clock is turned on. The signal C2 always has correct values.
The duration of a 20 ms simulation run of both reference and optimized designs are compared in Figure 4 . While the simulation performance of the reference design remains nearly unaffected by the amount of activity on C2, the speedups of the optimized version are -as expected -proportional to the turn off time (less activity of C2). The break-even point is above 50 percent, i.e. if the turn on time gets equal to the turn off time, the overhead of the clock suppression algorithm will eliminate any speedup. The break even point will shift to the right if the frequency of the turn on -turn off cycles increases. In the next section a more detailed discussion on the prerequisites for the application of clock suppression is given. Though the presented approach has a great potential for optimization, some restrictions limit its application in verification environments. As mentioned before, this optimization methodology makes use of specific properties of signals, e.g. partly constant behavior of the operands of accumulators. For example traditional accumulators known from CPU's normally don't have a partly constant summand and are therefore not suitable for this approach.
The following two conditions must be satisfied to accelerate a simulation: be low compared to the clock frequency:
• The average event rate of activations due to satisfied output conditions has to be low compared to the clock frequency:
To even suppress any clock event within a given time period, thus allowing jumps in time, the whole logic driven by that clock must be supplied with clock suppression. Hence, for hierarchical designs a mechanism for synchronizing two or more clock suppression modules within a clock generation process is necessary. Figure 5 shows two methods. Either an explicit logic OR function or the implicit resolution function of multi-driver signals (std logic in VHDL) can be used to for this purpose. The latter has the advantage of easy extensibility and eliminates the need for a feedback of the activation signal, because the activation signal needs to be readable by all other clock suppression modules in order to set the correct state of a deactivated module if another module needs clock activation.
To prevent the designer from modifying the source code, forcing and reading of signals over component hierarchies has been used in the previous section. Both Verilog and SystemC have natively built-in support for setting and reading signals from foreign entities. The current VHDL standard forbids this type of action, but simulators usually can deal with it 2 .
VI. RESULTS
To underpin the feasibility and performance of the presented clock suppression technique, the results of three different simulation runs of a distributed embedded computer system consisting of a 4-port Ethernet switch and four computers with network interface cards are summarized. Results for a 300 TABLE I seconds simulation run on a Pentium 4 CPU with 3.4 GHz and 1 GB main memory running a 32-bit Linux operating system are listed in Table II . Depending on the actual configuration and the detail of modeling, a speedup factor of up to four orders of magnitude has been reached. In this case the simulation of the RTL design of an Ethernet switch together with four connected nodes including RTL designs of the network interface cards and the interconnecting bus system has been brought down from several days to minutes without giving up any detail of the simulated models. The evaluation of other designs, like the ones mentioned in Table I , is subject to ongoing research.
VII. CONCLUSION AND FUTURE WORK
This paper revisited the clock suppression technique used to significantly speed up simulation runs, which are mandatory to verify the functionality of nowadays embedded computer systems. After formally introducing the problem, results of applying clock suppression to real life simulations have been presented. They deliver a convincing impression of the potentials of the presented clock suppression methodology. And it is important to note that all this comes without the necessity to modify neither the event simulation engine itself nor the design, but by using existing means that state of the art hardware description and modeling languages and simulators offer.
To even squeeze the last performance out of this technique, it is planed for the near future to add the presented clock suppression algorithm to the simulation kernel of an event based simulation engine.
