A recent paper by Bailey contains a theorem stating that the idealized execution times of unit-delay, synchronous and conservative asynchronous simulations are equal under the conditions that unlimited number of processors are available and the evaluation time of each logic element is equal. Further it is shown that the above conditions result in a lower bound on the execution times of both synchronous and conservative asynchronous simulations. Bailey' s above important conclusions are derived under a strict assumption that the inputs to a circuit remain fixed during the entire simulation. We remove this limitation and, by extending the analyses to multi-input, multi-output circuits with an arbitrary number of input events, show that the conservative asynchronous simulation extracts more parallelism and executes faster than synchronous simulation in general. Our conclusions are supported by a comparison of the idealized execution times of synchronous and conservative asynchronous algorithms on ISCAS combinational and sequential benchmark circuits.
INTRODUCTION
Reliable design of digital VLSI systems requires extensive logic simulations consuming enormous amounts of CPU time. Parallel processing offers a viable way to improve upon this time. Two main classes of algorithms exist for parallel logic simulation known as the synchronous and asynchronous algorithms. In synchronous simulation (sometimes referred to as centralized-time simulation), a centralized clock for the simulation time is maintained. All logic elements experiencing input events at the current simulation time are processed and then the clock is advanced by one time unit to the next simulation time. In contrast, the asynchronous simulation (also 91 called distributed simulation) does not require any centralized clock to coordinate its execution. Instead, all events carry the simulation time information (timestamp) themselves. In conservative asynchronous simulation, a logic element is ready for evaluation as soon as all of its inputs have received a token (a logical value and its timestamp). When a logic element evaluates, it produces an output based on the logical value of the input tokens and consumes the input token(s) with the lowest timestamp. The output has a timestamp equal to the timestamp of the consumed input token(s) plus the delay of the logic element itself. In the "conservative" form of asynchronous simulation, the time order of tokens is always guaranteed and only "safe" evaluations are allowed i.e., an evaluation guaranteeing a correct result.
In implementing the event driven principle (i.e., sending an output token to the fanout elements only if there is a change in its logical value), the conservative asynchronous simulation can deadlock. A deadlock is a situation where no element can evaluate because at least one of its inputs is missing a token. This occurs frequently in the simulation of circuits with feedback because if the output that is feeding back did not change, no token will be sent to that input, causing a deadlock.
There are two ways to handle deadlocks (proposed by Chandy and Misra [2] [3] ); one is deadlock avoidance by the use of NULL or redundant messages, the other is deadlock detection and recovery. Bailey [1] develops the execution time of asynchronous simulation without considering the overhead due to handling of deadlocks. We do consider this overhead in the execution times of ISCAS-85 [4] and ISCAS-89 [5] benchmark circuits.
In the development of execution times of synchronous and conservative asynchronous simulation, Bailey first describes the circuit to be simulated in terms of a simulation dependency graph, , which is a directed graph of events with each vertex representing an event in the circuit. The vertices in the graph are labeled with events and the edges are labeled with delays in the circuit. Both the events and the delays have positive integer values. If a parent event causes a child event, then there is an edge in from the parent event vertex to the child event vertex with a delay of the logic element corresponding to the child event. The execution times of synchronous and conservative asynchronous simulation are developed in terms of this graph. In Bailey's analyses, a fixed execution sequence is assumed, the evaluation time of each vertex in the graph is equal, an unlimited number of processors are available and the inputs to a circuit remain fixed during a simulation. Under the above assumptions, it is then proved that the unitdelay simulation is a lower bound on the execution times of both synchronous and conservative asynchronous simulations and that these execution times are equal.
We continue a similar development here but relax the assumption that the inputs to a circuit are to remain fixed during a simulation. n be the number of external inputs and e be the number of input events on an input i. Then the best-case execution time for the unit-delay synchronous simulation is given by (2) . It occurs when all input events on a line are separated by one time unit to extract maximum pipelining, and different inputs receive events at the same simulation time to achieve maximum concurrency.
Max (E* ((depth(i) + 1) + e 1)) (2)
The worst-case execution time is given by (3) and occurs when all input events are separated in simulation time by an interval greater than or equal to the depth of the simulation dependency graph, such that there is no pipelining or concurrency (between different external input events).
'rs., in__--d E *(depth(i) + 1)* e (3) We illustrate the best-and worst-case execution times using an example. An exclusive-OR circuit is shown as an interconnection graph in Figure 1 . Figure For a general circuit with n inputs and e events on an input i, the best-case execution time of conservative asynchronous simulation is given by (5) .
It occurs when there is maximum pipelining and concurrency available in simulation. Note that unlike synchronous simulation, the separation in terms of simulation time is not a factor for exploiting either pipelining or concurrency in asynchronous simulation.
The worst-case execution time for asynchronous simulation is caused by reduced parallelism due to the way it processes events. In asynchronous simulation, each logic element has to sequence the input events in terms of their timestamps to guarantee correct behavior. During evaluation, a logic element consumes the input token with the lowest timestamp and produces an output with a timestamp equal to the timestamp of the consumed token plus the delay of the element itself. Thus even if the events appearing on different inputs of a logic element were generated in parallel, a number of output events equal to the sum of all input events have to be generated sequentially in the worst case, thereby reducing the concurrency in simulation. An example of this is shown in Figure 5 , where the two inverters process the events concurrently belonging to different simulation times but when passing through the AND gate, the generation of events is serialized on its output because of Format: Event @ Simulation tim%xecution timc 1010 O 01 FIGURE 5 An Example Showing Serialization of Generation of Output Events the differences in the input timestamps. In Figure 5 , the execution time for the generation of each event is denoted as a subscript to the event and it is assumed that E 1. The execution time for the output of a logic element equals one more than the maximum execution time on the front of its inputs. This is because, in conservative asynchronous simulation, a logic element is not ready for evaluation until it has received all of its inputs. In Figure 5 , @.. indicates additional events on an input, thus allowing the consumption of all events in the example.
The example shown in Figure 5 demonstrates that multi-input logic elements may reduce the concurrency in asynchronous simulation by serializing the generation of events if they receive events that are separated in simulation time on their different inputs. Taking this effect into account, the worst-case execution time for conservative asynchronous simulation is given by (6) .
where ek, denotes the number of events at the input of a logic element at level k in a given input-to-output path. Before applying (6), the number of events at each output of a logic element is computed by accumulating the number of events on the fanin lines of that element. Equation (6) 
COMPARISON OF SYNCHRONOUS AND CONSERVATIVE ASYNCHRONOUS SIMULATION
The best-and worst-case execution times for synchronous and conservative asynchronous simulation are given by Equations (2-3) and (5-6) respectively.
In comparing the best cases, it can be seen that Equation (2) for synchronous simulation is exactly identical to Equation (5) for conservative asynchronous simulation. However, there are differences in the requirements for achieving this minimum time. The best case for synchronous simulation occurs when the events on an input are separated in simulation time by only one time unit to exploit maximum pipelining, and events on different inputs occur at the same simulation time to get maximum concurrency. The conservative asynchronous simulation does not have this requirement and is capable of exploiting both pipelining and concurrency for widely separated events. For instance, the asynchronous simulation of a chain of inverters executes in the minimum time given by Equation (5) regardless of the separation time of input events. In contrast, the synchronous simulation requires input events to be separated by only one time unit to achieve the best execution time. Note also that in most practical simulations, the input data to a circuit is held stable for at least the delay through the circuit. Thus the asynchronous simulation may achieve the minimum time but the synchronous simulation cannot as the input events are almost always separated by more than one time unit in practical simulations.
In order to achieve the lowest possible execution time when there are multi-input logic elements involved, the conservative asynchronous simulation does require that the events on different inputs of a logic element have the same timestamps. This condition allows for consumption of multiple input events thus minimizing the effect of serialization in the generation of output events. Hence this condition ultimately requires a fixed simulation time difference in the external input events (depending upon the delay of the path of each input of a logic element to the external input) to achieve the best execution time. This is rather a stringent requirement as can be seen from an example. If the first input of an AND gate receives events through a chain of two inverters connected to an external input and the other input is an external input, then the external input events have to be separated by 2 simulation time units to result in minimum execution time in asynchronous simulation.
The minimum time given by Equation (5) would not be obtainable for most circuits because of the conflicting timing requirements from multiple paths through the circuit. Figure 6 illustrates this point using the data from Figure 3 . The minimum execution time given by Equation (5) In short, the requirements on both synchronous and asynchronous simulation to achieve the best execution time as given by Equations (2) and (5) are quite strict. The best execution time may not be observed for either type of simulation. The requirements for Equation (2) to be used would never be achieved in practical circuits that often use an input data that is held constant for at least the delay through the entire circuit. Likewise the requirements for the use of Equation (5) would not be achievable by most circuits having recombination of paths with different delays, although this is mitigated by not having an output event for each input event as has been assumed in the development above.
The worst-case execution times for synchronous and asynchronous simulation are given by Equations (3) and (6) Figure 4 ). The execution time for the conservative asynchronous simulation can also be verified by applying Equation (6) to the exclusive-OR circuit in Figure 7 Conservative asynchronous simulation on a combinational circuit comprised of multi-input AND, OR type gates can generally improve 50% upon its execution time by employing lookahead. This can be seen by assuming the probability that the output of a gate is 0 to be 0.5 i.e., the output is 0 half the time and 1, the rest of the time. The number of gate evaluations using lookahead will thus be reduced by half because half the time at least one of the inputs will have a controlling value. For sequential circuits, the conservative asynchronous simulation based on the deadlock avoidance scheme can have a much higher performance gain by using lookahead. This is because in addition to the reduced gate evaluations, lookahead greatly minimizes the number of NULL messages needed to avoid the deadlocks in feedback loops. Some results on benchmark circuits are presented in the next section that demonstrate the effectiveness of lookahead.
EVALUATION ON BENCHMARK CIRCUITS
We measured the execution times of combinational ISCAS-85 [4] and sequential ISCAS-89 [5] benchmark circuits on both synchronous and conservative asynchronous simulation algorithms. All circuits were simulated under unit-delay, as unit-delay has been shown to be the lower bound on the execution time of either synchronous or conservative asynchronous algorithm [1] . In the implementation of synchronous algorithm, a timing wheel is used whose time slots contain events that can be executed in parallel. Thus for a given data set (with unlimited number of processors and one time unit for evaluation of an element), the execution time of synchronous simulation is equal to the number of non-empty time slots ].
For conservative asynchronous simulation, we first implemented the algorithm presented in [6, 9] which uses an avoidance scheme to handle deadlocks. This algorithm was then further improved upon by incorporating lookahead. Our lookahead implementation used lookahead on multiple input gates as well as flip flops. The pseudocodes for the conservative asynchronous algorithm and the improved form incorporating lookahead are given in appendices A and B respectively. In this algorithm, NULL messagesare generated only if there is a possibility of a deadlock. This is detected when one of the inputs of a logic element becomes empty as a result of an evaluation.
In this case, the output is sent to its fanout elements regardless of a change from its previous value. Note that this is an optimization over Chandy and Misra's always send NULL message strategy in [2] [3] .
In our implementation, we have an input queue of size 16 for all inputs to a logic element. For an asynchronous algorithm using the avoidance scheme, the simulation execution time generally improves as the input queue size is increased and usually saturates for a queue size of about 5. In our execution time measurements, an unlimited number of processors is assumed with one unit evaluation time for a logic element and zero communication time for distributing tokens to the fanout of a gate. This is consistent with and chosen so that the parallelism in an algorithm can be determined independent of the communication overhead. However, as communication time increases, the synchronous and asynchronous algorithms would perform relatively the same. The total time units to complete the asynchronous simulation were measured using the same data set as used for the synchronous simulation. Table I shows the characteristics of the benchmark circuits and the data set. Data for the ISCAS-85 combinational circuits (c prefix) consisted of 30 random sets. The length of a set for a particular circuit was adjusted so that the circuit would reach a stable state before the next data was entered i.e., the length of a data set corresponds to the maximum depth of the circuit. Data for the ISCAS-89 sequential circuits consisted of 40 random sets. Data was preceded by several clock cycles to reset the flip flops in the circuit. Data was changed only during the middle of the positive clock pulse, and remained constant for a single clock period. Clock cycle times were adjusted for different circuits so that the circuit would reach a stable state before the next clock cycle.
The results of the execution times of the two algorithms on combinational and sequential benchmark circuits are shown in Table II . It can be seen that the execution time of asynchronous simulation with lookahead is much lower than the synchronous simulation for all circuits. On the average, the conservative asynchronous simulation is almost three and a half times faster than synchronous simulation for combinational circuits, and two times faster for sequential circuits. The redundant or NULL messages used in the asynchronous algorithm cause the overall execution time of conservative asynchronous simulation to increase because extra evaluations may take place at the element receiving these messages. The sequential circuit simulations generate a large number of NULL messages to avoid a large number of deadlocks (see Table IV ). The execution time data in Table II includes this effect and despite the overhead of NULL messages, the asynchronous simulation still outperforms synchronous simulation for combinational as well as sequential circuits when lookahead is employed.
We carried out a similar comparison between the synchronous algorithm and an asynchronous algorithm based on deadlock detection and recovery scheme. In the deadlock detection and recovery scheme, the circuit is allowed to deadlock which is a condition in which no logic element can evaluate because at least one of its inputs is missing a token. After a deadlock has been detected, the circuit recovers by computing a global minimum time "gmt" (which is the smallest time of an unconsumed event in the circuit) and updating token timestamps which Circuit 7  160  1200  30  41  32  546  1200  30   33  25  880  1200  30  233  140  1193  1200  30  50  22  1669  1500  30  207  108  3512  1650  30  10  104  8  2171  40  3  6  158  21  2565  40  35  24  379  19  2505  40  17  5  657  74  2505  40  35  49  2779  179  2515  40   36  39  5597  211  2521  40 are less than gmt to gmt [7] . Table III shows a comparison on benchmark circuits between the synchronous algorithm and the asynchronous algorithm based on deadlock detection and recovery scheme (DDR). In Table III , it is assumed that the circuit recovers from a deadlock in 0 time. Even with this unrealistic assumption, the conservative asynchronous simulation based on the deadlock detection and recovery scheme performs worse than the synchronous simulation. This is because the deadlock detection and recovery scheme looses much of the pipelining when the circuit deadlocks causing its performance to be worse than the synchronous simulation.
It can be seen from Table IV compares the NULL message overhead in different conservative asynchronous schemes based on deadlock avoidance and it can be seen that the conservative asynchronous scheme with lookahead has the least overhead in terms of NULL messages as compared to actual events in the circuit. Even though for sequential circuits, the number of NULL messages is two to three times more than the number of events in the lookahead based avoidance scheme, the execution time is still better than the synchronous simulation because of the increased pipelining and concurrency in event processing.
All ISCAS benchmark circuits were tested in this work. However, for keeping the paper to a reasonable length, we report the results on only a few of these circuits. More results on other circuits can be found in [8] . The results on remaining circuits are relatively similar to the ones we have presented in this paper. Further, in an implementation on a data flow architecture based hardware accelerator with limited number of processors [9] , the performance of the synchronous and the optimized conservative asynchronous algorithms shows relatively similar results as we report in this paper.
Overall, the ability of the conservative asynchronous algorithm to concurrently evaluate logic elements with each element's inputs having differing timestamps from other element's inputs and its ability to exploit better pipelining along with lookahead allow it to execute faster than the synchronous simulation. The conservative asynchronous algorithm implementing the deadlock avoidance scheme maintains better pipelining of events on the input(s) of a logic element and thus executes faster than the deadlock detection and recovery scheme in which the pipelining effect is lost when the circuit deadlocks. Even with the overhead of NULL messages, the conservative asynchronous simulation using the optimized deadlock avoidance scheme exploits better pipelining and concurrency, and thus executes faster than both the synchronous simulation and the conservative asynchronous simulation based on the deadlock detection and recovery scheme which looses all its pipelining when deadlocks occur. Thus our work presents important conclusions different than previously proved in [1] , and shows the effectiveness of conservative asynchronous simulation in terms of parallelism and execution time over synchronous simulation when a lookahead scheme is employed. Although the overhead associated with asynchronous simulation (maintaining input queues in each logic element etc.) is higher than synchronous simulation which makes it unattractive for software 
