We investigated the scalability of CTW using 3 different queuing models and different service-time distributions and showed that the algorithm acts to limit the explosion of rollbacks exhibited by Time Warp. Furthermore, we showed that the memory requirements for CTW are three times smaller than that of Time Warp for one model and half as large for the two other models.
INTRODUCTION
A great deal of effort has gone into parallel logic simulation because reducing the time of uniprocessor simulators can have a significant impact on the design of VLSI systems. The simulation of these systems has, in fact, become a bottleneck in the overall design process [3] is an excellent survey of the work done in parallel logic simulation. Recently, research in the area has turned from synchronous algorithms (such as the oblivious strategy in which all of the gates in a circuit are evaluated at each time step of the simulation), to the use of both conservative and optimistic asynchronous algorithms.
Conservative algorithms [4] are known to have low memory usage. On the other hand, avoiding or detecting and breaking deadlocks can reduce greatly the performance of these algorithms. This is especially true when large models with small computational granularity, such as those found in the domain of logic simulation, are considered. In general, conservative algorithms depend a great deal on lookahead to achieve good performance [8] . Given the large number of cycles in a circuit [2] , this might present a serious drawback.
Optimistic algorithms [11] are very attractive for logic simulation since they can extract a great deal of parallelism and they are deadlock-free. Nevertheless, Time Warp studies have often pointed out the problems encountered due to the large amount of memory a simulation might require. Furthermore, it is unclear whether Time Warp remains efficient as the size of the simulation model grows.
The ideal algorithm would be one that would have the memory needs of conservative algorithms and the potential of optimistic algorithms to extract a great deal of parallelism.
Digital circuits are constructed by interconnecting functional units, which are themselves composed of different blocks. At the lowest level a block can be modeled as some combinatorial logic connected to a series of clocked registers or latches.
Figure illustrates the hardware model of logic circuits [19] . We distinguish three phases:
1.An initialization vector is applied to the input latches and once the signal is stable, clock G0 is activated.
Combinatorial Logic 2.The propagation vector travels throughout the combinatorial logic and reach the output latches. 3.The output vector is then sampled when the clock 1 is activated.
This suggests that the signal activity within the blocks is rather chaotic whereas the activity between the blocks tends to be more regular. The key idea would then be to use a conservative approach to synchronize all the gates of one block, and to use an optimistic approach to synchronize these blocks.
In the following pages, we are going to present a new hybrid algorithm for the asynchronous parallel simulation of digital circuits (the algorithm can of course be applied to other types of simulations). The algorithm makes use of Time Warp between clusters of LPs running on different processors and use a sequential algorithm within the clusters. We also demonstrate experimentally that the algorithm scales well to the simulation of large models with low computational granularity, while Time Warp does not We christen the algorithm Clustered Time Warp [1] .
The remainder of the paper is organized along the following lines. Section 2 contains a description of other hybrid algorithms. Section 3 describes the Clustered Time Warp algorithm along with an illustrative example. Section 
contains experimental results in which Clustered Time
Warp is compared to Time Warp. Section 5 describes our work on the scalability of CTW. We conclude in Section 6 with the conclusion.
RELATED WORK
A number of attempts have been made to combine the optimistic and conservative approaches. [20] allows a process to proceed optimistically but avoids sending potentially erroneous messages to other LPs. [13] employs a window protocol to prevent LPs from getting too far apart in simulated time.
In [17] Figure 2 shows the structure of a cluster. Inputs e is the event to be sent. begin (1) send event e to the destination cluster (2) create the antimessage of e (3) coc ,--coo + end.
FIGURE 6 An LP passes to the CE event e to be sent.
3.8. Example 3.8.1. Receiving Messages Figure 7a shows the space time graph at a cluster composed of three logical processes. The xaxis represents the virtual time and the y-axis represents the location of the three LPs. Figure 7b shows the arrival of message ml, whose receive time is 7 and whose destination process is LP1. The reason for this is that it is possible to roll back to a point prior to the GVT because not every FIGURE 11 (a)m5 is annihilated by its antimessage, the cluster rolls back, and (b) m3 is reprocessed. event is checkpointed. Similarly, the events prior to the GVT in the LP input queue cannot all be removed. As it is possible for the LP to rollback to a state prior to the GVT, events with timestamp smaller than the GVT might have to be reprocessed while the LP coasts forward. Once an estimate of the GVT has been calculated, all the LPs can discard the states prior to the GVT but one, and preferably, the one whose timestamp is the closest to the GVT. Then, all the events whose receive time is smaller than the timestamp of the oldest state can be also discarded. Figure 12 shows the pseudocode executed by each logical process when a new GVT estimate has been calculated.
In the current implementation of Clustered Time-Warp, a token-ring passing algorithm [14] is used since the architecture used to develop the system (the BBN Butterfly) does not contain a large number of nodes (maximum of 32 nodes). A program was written to read the netlist of the ISCAS benchmark circuits and to partition them into clusters. We used a string partitioning algorithm, because of its simplicity and especially because results have shown that it favors concurrency over cone partitioning; see for example [6] . The algorithm is similar to an in order tree walk [7] . A gate connected to a primary input is first selected and assigned to a cluster. Its output is then followed and the same procedure is applied for each succeeding gate. When the cluster contains the desired number of gates, a new cluster is created and the algorithm resumes. Figure 13 shows a potential string assignment for circuit s27 for a cluster size of 4.
A simulation run can be decomposed into three phases. First, each processor starts up by loading the. gates assigned to it and by creating their corresponding LPs. Then, each gate which has an initialized state produces an event to the gates connected to it. Some of these gates will be triggered and will propagate their changes throughout the circuit. After a while the system becomes stable, and events stop being generated. During the third phase, input vectors (previously randomly generated) are read and the simulation is The difference in the peak memory consumption between the two circuits is due to the fact that circuit s38584 has a relative asynchronous parallelism nearly half that of circuit s35932 (see Tab. I).
This characterisitic of circuit s38584 has two consequences. First, because fewer events are being processed in parallel, the Clustered Time Warp approach has a smaller chance to take advantage of its sparse checkpointing techniques.
Take for example an LP that receives only one event between two GVT computations. In such a case it does not really matter what the checkpoint interval is, since the LP will have to perform at least one checkpoint anyway. Thus, if we consider a simulation in which LPs process very few events, the overall memory usage of any checkpointing technique will not be very important.
In addition, when a circuit having a small parallelism is simulated, the event population in
Number of clusters per processor FIGURE 15 Memory vs. Number of clusters per processor (circuit s38584).
the system is likely to be relatively small too, hence reducing the number of process states that have to be saved. Because less objects are being manipulated by the system, the estimated GVT tends to be closer to the actual GVT, therefore the fossil collection mechanism is able to remove most of the useless states and events. As a direct consequence, the memory usage reduction that can be achieved by Clustered Time Warp is attenuated. However, if the parallelism gets small, the event population becomes small too, and less fossil objects have to be collected. Therefore, the reduction of the garbage collection overhead is less significant.
Simulation Time

Summary
Based on these results, we chose the cluster size for each algorithm which gave the best performance in order to use them in our second set of experiments.
For LRCC and LRLC, we chose one cluster per processor. In the case of CRCC, we chose 32 and 128 clusters per processor for circuits s35932 and s38584 respectively.
Varying the Number of Processors
In the second set of experiments we observed the behavior of the algorithms, varying the number of processors from 8 to 24. In addition we also show the performance of a Periodic State Saving mechanism (PSS) which is a modified version of pure Time Warp in which the checkpoint interval is constant and larger than one. In our study, we chose a checkpoint interval of 3 as it proved to be an optimal value for a large range of type of simulations [15] . -- The phenomenon we described previously can now be observed. For circuit s35932, when compared to Time Warp, the CRCC checkpoint protocol, which saved half as many states as LRLC (see Fig. 19 ), actually performs much better than LRLC when all the memory usage is considered (see Fig. 21 ). Similarly, when compared to Time Warp, the periodic state saving technique with a checkpoint interval of 3 (PSS), saves only between 9 and 16% of the memory usage whereas it saved between 30 and 35% of the states.
These results show the importance of taking events into consideration for the design of checkpointing techniques for optimistic algorithms.
The same phenomenon is observed for circuit s38584 (Fig. 22) these examples. We note that this difference becomes less significant as the number of processors increases (since the memory is itself more distributed among the processors).
Speedup
In order to measure the speedup obtained with the parallel simulation system, we have developed a sequential simulator. In this case, since the simulation is performed on a single processor, there is no need for synchronization, therefore no checkpointing is performed and events are deleted as soon as they are processed. As a consequence, no GVT algorithm is needed and the fossil collection mechanism is simply switched off. The scheduling of the processes is performed with a single heap and a minimum message timestampfirst policy is used. The sequential simulation for circuits s35932 and s38584 took 283 and 291 seconds respectively.
Results are shown in Figures 25 and 26 . As we have seen in Table I , the parallelism available in circuit s35932 is much higher than that available in circuit s38584 (the relative parallelism is twice as high), as a consequence, the speedup obtained from the parallel simulation of circuit s35932 is relatively higher than circuit s35932. When the number of processors is relatively small, the overhead of the synchronization algorithm be- comes more significant, and we observe that the speedup is actually better for a circuit with less concurrency. This clearly shows that the performance of asynchronous algorithms depends highly on the intrinsic parallelism availble in the simulated circuits, but also in the ability of these algorithms to keep their overhead relatively small. The results also point out a stable behavior of the algorithms with respect to the number of clusters employed. With this range of choices among checkpointing algorithms, it is possible to choose an algorithm depending upon the memory requirements of the simulation.
SCALABILITY
In addition to large memory consumption, it is also possible for the number of rollbacks in Time Warp may increase without bound. Phenomena such as cascading rollbacks, echoing and the dogchasing-its-tail are examples of this problem [13] .
In this section we briefly summarize some of our results on the scalability of CTW (with CRCC checkpointing) as compared to that of Time Warp. As space limitations preclude a detailed discussion, the interested reader is directed to [1] .
We define the scalability of a Time Warp based system to be the rate at which the proportion of rolled back events to committed events increases relative to the size of the simulated model. We say that a Time Warp based system is unstable if the number of rolled back events during a simulation run is not bounded, making it impossible for the simulation to terminate in a finite amount of time.
The small number of large digital circuits publically available makes it difficult to examine the scalability of our algorithms in the context of a logic simulation. Consequently, we employed queuing networks in our experiments. This choice enabled us to relate the size and topology of the network to the performance of the algorithms. We used three different network models in our experiments [9] : a pipeline model, a hierarchical model and a distributed model. Each node in all of the models represents an n xn cluster of logical processes. In order to evaluate the scalability of Time Warp and CTW, we simply varied the cluster sizes. In our experiments the number of processes ranged from 10,000 to 60,000. Links were bidirectional and routing was random. Three metrics were used to characterize the behavior of the simulation: throughput, the proportion of rolled back events, and the maximum memory usage. The throughput is the number of committed events per second. It provides a measure of how fast the simulation advances in real time. We employed the deterministic, uniform Finally, and most important, it is important to evaluate the performance of CTW in realistic simulations, for example register level vlsi simulations of circuits with 250-500,000 gates. Each of these questions is the focus of on-going research efforts.
We remain optimistic.
