We investigate conservative parallel discrete event simulations for logical circuits on sharedmemory multiprocessors. For a rst estimation of the possible speedup, we extend the critical path analysis technique by partitioning strategies. To incorporate overhead due to the management of data structures, we use a simulation on an ideal parallel machine (PRAM). This simulation can be directly executed on the SB-PRAM prototype, yielding both an implementation and a basis for data structure optimizations. One of the major tools to achieve these optimizations is the SB-PRAM's hardware support for parallel pre x operations. Our reimplementation of the PTHOR program on the SB-PRAM yields substantially higher speedups than before.
INTRODUCTION
Large{scale shared-memory multiprocessors are likely to play an important role in parallel computing in the future, because they o er a much simpler programming model than traditional distributed-memory machines. Many of today's shared-memory machines are cache-based machines which s h o w goodperformance for regular applications with appropriate locality but which f a i l t o get good speedups for irregular applications with a lot of non-local memory accesses. Typical examples of such applications are particle{based simulations like MP3D 24] , routing algorithms like LocusRoute 24] , and discrete{event simulations like PTHOR 26] . In this article, we consider the execution of discrete-event simulations for logical circuits on shared-memory machines. We try to answer the question which performance we can hope to get on an ideal machine on which the locality of memory accesses can be neglected but for which the overhead for the management of data structures still takes e ect. As execution platform, we use the SB-PRAM which has a uniform memory-access time and behaves like a PRAM machine as it is used in theoretical computer science for the analysis of the complexity of algorithms.
We consider the PTHOR algorithm for the parallel simulation of logical circuits, which uses a conservative approach. The PTHOR simulator is based on the sequential THOR simulator and has rst beenconsidered for a parallel implementation on the Stanford Dash by Soul e 26] . Soul e investigates the performance of the PTHOR simulator for three platforms: an ideal multiprocessor simulator called Tango 24] , an Encore Multimax with 16 processors, and the Stanford Dash with 16 processors.
For a systematic analysis of the attainable speedup, we start with a critical path analysis of PTHOR on the benchmark circuits, which also takes into consideration the partitioning of the LPs among the processors. We extend the partitioning strategies investigated by Lin in 21] from static partitioning strategies to dynamic strategies and stealing strategies. Although this technique yields an upper bound on the speedup for the di erent b e n c hmark circuits, it does not take i n to account t h e o verhead for data structures. This can bedone by running PTHOR on the SB-PRAM. As the complete SB-PRAM is under construction, we u s e a s i m ulator that performs a cycle{by{cycle simulation of the actual machine.
Thus, the simulator delivers the exact runtime of the real hardware. The accuracy of the simulated runtimes is con rmed by comparisons with measured program runtimes on the available prototype.
Starting with the existing PTHOR implementation from the SPLASH1 benchmark suite 24], we show h o w the maximum attainable speedup can be increased by several changes in the data structures, including the data structures for the LPs and the memory management. When there are more LPs than processors, the work must be properly partitioned among the processors. We compare a dynamic partitioning scheme using a centralized FIFO queue with a stealing scheme that uses a local queue for each processor. We also show that the use of NULL-messages can result in a large increase of the speedup, depending on the benchmark circuit. The result is an implementation of the PTHOR simulator on the SB-PRAM for which the overhead for the management of data structures is considerably smaller than in the original implementation. Depending on the input circuit, the obtained speedup values even come close to the bound from critical path analysis.
The rest of the paper is organized as follows. Section 2 brie y introduces to parallel discrete event simulation. Section 3 sketches the execution platform used. Section 4 presents the critical path analysis. Section 5 investigates the performance characteristics of the original PTHOR simulator. Section 6 presents the improvements we added and discusses their e ects, Sect. 7 summarizes the results.
PARALLEL DISCRETE EVENT SIMULATION
A model for discrete event simulation assumes that the system beingsimulated only changes state at discrete points in time. For the simulation, the system is modeled as a collection of logical processes (LPs) that communicate via timestamped messages. For circuit simulations, typical LPs at varying levels of abstraction are transistors, NAND gates, ip ops, multipliers, etc., and their interconnections 5]. The state of the simulated model changes upon the occurrence of events, such as the change in output value of an individual gate. An event e may b e scheduled by a certain number of other events, if these determine the occurrence of e. The approaches to a parallel execution of discrete-event simulations (PDES) can be distinguished into centralized-time algorithms and distributed-time algorithms. In centralized-time algorithms, a global clock is used and the simulation is executed synchronously. In distributed-time algorithms, each processors has its own clock and the simulation is executed asynchronously. Distributed-time algorithms can be further distinguished into conservative and optimistic approaches. The approaches di er in the way they deal with causality errors caused by the distributed simulation, see 13] for a good overview.
The conservative method 10, 12] forces an LP to block until it is safe to simulate an event, i.e., the events are simulated in strict timestamp order. This may lead to deadlocks that have to be recognized and resolved. In the optimistic approaches 3, 16] , there is no such restriction, i.e., an LP can execute events in the order in which they arrive. If this leads to a simulation that is not in timestamp order, a roll back to a safe state has to be performed and the e ect of messages which should not have been send must be eliminated by appropriate anti-messages.
The limiting factor f o r a c e n tralized-time algorithm is that the simulation steps proceed in lockstep fashion, waiting for the slowest event to nish 5]. This can greatly slow d o wn the simulation, if there are widely varying event times.
Bailey shows in 4] by a theoretical analysis that the execution time of the conservative a s y n c hronous strategy is a lower boundtothe synchronous strategy and that with unit-delay timing, the execution times of the synchronous and asynchronous strategies are equal. The analysis is performed under the assumption that an unlimited number of processors are available and that the inputs to a circuit remain xed during the simulation. These assumptions are relaxed by Baker in 7] by allowing an arbitrary number of external inputs for each circuit, with each input experiencing di erent n umbers of events at di erent simulation times. Under these conditions, a relative comparison of the synchronous and conservative asynchronous simulation execution times shows that the conservative asynchronous simulation may e x e c u t e faster. In particular, the best-case execution times are the same for the synchronous and conservative asynchronous simulation, but the requirements for achieving the minimum time are quite strict. The worst-case execution time of the conservative asynchronous simulation will usually be less than that of the synchronous simulation.
In 26], a parallel, centralized-time logic simulator is discussed. In this practical work, none of both algorithms achieve the best results for all benchmark-circuits.
Supported by the theoretical results above, we decided to research conservative asynchronous simulation and neglect synchronous schemes. The algorithm used in PTHOR o ers various possibilities for optimization, with the hope of preserving the bene ts of asynchronous simulation.
EXECUTION PLATFORM
Most of today's shared-memory machines are cache-based machines, i.e., they still use a physically distributed memory but each processor is equipped with a one-level cache or a two-level cache-hierarchy.
The cache coherence is provided by the hardware. The memory access time of these machines is not uniform but depends on the physical location of the data being accessed. For this reason, they are called nonuniform memory access time (NUMA) machines. These machines rely on the locality of most applications and try to hide the memory latency by caching. Examples of NUMA machines are the KSR1/2 2] from Kendall Square Research, the Stanford Dash 20], and the SPP1000 from Convex 28] .
Besides cache-based shared-memory machines, uniform memory access time (UMA) machines have beendeveloped for which the memory access time is independent from the physical location of the data.
Examples of such machines are bus-based shared-memory machines like the Multimax 2] from Encore Computer Corp., the C90, J90, and T90 series from Cray Research 2], and the SGI Challenge from Silicon Graphics. The disadvantage of bus-based systems is that they usually can only provide a small number of processors.
The SB-PRAM which is currently under construction at the University of Saarbr ucken is an UMA machine that provides a shared address space with a fast memory access time 1]. The latency of the network between the processors and the memory modules is hidden by pipelining of processors, i.e., each p h ysical processor simulates a number of virtual processors. Thus, a write operation to the global memory by a virtual processor takes the same time as an arithmetic operation, independently of the memory location that is addressed. A read operation is also as fast as an arithmetic operation, but the result is available in the next but one instruction. Concurrent accesses to a single memory cell are allowed and combined, making the SB-PRAM behave like the CRCW (CRCW=concurrent read, concurrent write) PRAM model known from theoretical computer science.
Besides the usual load and store operations to access memory cells, the SB-PRAM also o ers multipre x instructions which enable several processors to perform pre x operations on a memory cell in parallel. As an example, we sketch the execution of a multipre x addition MPADD. Let p 1 : : : p n be the executing processors where each processor p i contributes a local value o i . Let s be a shared memory cell with value o. If p 1 : : : p n execute the MPADD operation synchronously, i.e., each processor p i executes MPADD s o i , then after the operation, processor p j holds the jth pre x sum o + P j;1 i=1 o i s contains the sum o+ P n i=1 o i : The multipre x operations MPMAX, MPOR, and MPAND work similar.
A m ultipre x operation is as fast as a read operation, independently of the number of participating processors. It is even possible that di erent groups of processors perform separate multipre x operations in parallel. The multipre x operations can beused for an e cient implementation of synchronization mechanisms (such as barriers without serialization 14]) and for the implementation of various parallel data structures for task management like priority queues or FIFO queues 23]. Because of its memory structure, the SB-PRAM is an ideal machine for the execution of irregular applications. In addition to running an application on the SB-PRAM, the machine can also beused to study the properties of a parallel program under ideal conditions, yielding a prediction of the maximum speedup that can be attained on other machines.
The current prototype provides the user with 128 PRAM processors, the complete prototype will provide 4096 processors. Program runs were executed on a cycle-by-cycle simulator, accuracy was con rmed by comparisons with runs on the actual prototype.
CRITICAL PATH ANALYSIS
Not all events occurring while simulating a circuit can be executed in parallel. The result of an event e can only be computed correctly if 1. all events preceding e on the same LP are executed, 2. the results of all events scheduling e are known to e.
Event Precedence Graphs
Consider the set of the events that occur during the simulation of a xed experiment on a xed model.
From the above constraints, we can derive a partial order on this set, called \causality". The representation of this order as a directed graph G = ( V E) is called \event precedence graph" (EPG), introduced independently by Berry and Je erson 8] and Livny 22] . V is the set of events, (e 1 e 2 ) i s a n e d g e i . e 1 schedules e 2 or e 1 is the last event before e 2 on the same LP. The weight function : V ! R + 0 assigns to each event the runtime to execute it. This de nition can bemade independent of the underlying machine by de ning (e) as a function on the indegree of e. We call an event e 2 dependent on e 1 i . there exists a path in G from e 1 to e 2 .
Only events that are independent from each other can be executed in parallel. Hence, the EPG serves to compute a lower bound on the simulation's runtime. We assume that every LP is simulated on its own processor. Then, because of constraint 1, it can never happen that more than one event e is ready for execution on one processor. This unique event e can be executed as soon as constraint 2 is satis ed.
Obviously, e v ents e with indegree 0 can be executed immediately after the simulation starts.
If START(e) a n d END(e) denote the times when the execution of event e ideally starts and nishes, 
is the runtime of an ideal simulation on a parallel machine with an arbitrary numberof processors.
T crit is a lower bound on the parallel runtime of every conservative simulation strategy 17]. It is even a l o wer bound on optimistic strategies with aggressive cancellation 15].
The path de ning the maximum in (1) is called critical path. Note that there may be several critical paths in an EPG.
The EPG also serves to compute a lower bound on the sequential runtime by
So far, the computed runtimes ignore any computational overhead in addition to causality. If we assume that the overhead in a parallel simulation is greater than in a sequential simulation, then the quotient S crit = T seq =T crit de nes an upper bound on the possible speedup for a particular experiment.
This overhead assumption is supported by the observation that normally all data structures from the sequential program are needed in the parallel version as well. The parallel program might need additional data structures to support information exchange between LPs.
Partitioning Strategies
For large circuits, real parallel machines do not have enough processors to assign each LP to a di erent processor. Hence, the LPs must be partitioned between the available processors.
On distributed memory multicomputers, a commonly used partitioning scheme is static partitioning.
Every processor is assigned a xed set of LPs, the sets are disjoint. Examples for static partitioning are cyclic distribution (LP i is executed on processor i mod p), blockwise distribution (processor i executes LP in=p+1 to LP (i+1)n=p ), and random distribution (each processor is assigned n=p LPs in a random fashion). If the numbering of LPs in the input data le is arbitrary, then any distribution resembles random partitioning.
There are a number of heuristic approaches to nd better static partitionings 9, 18, 19, 27]. However, we did not consider those approaches. They mostly try to optimize communication costs which is not necessary as we use shared{memory machines.
On a shared memory multiprocessor, all processors have access to the data of every LP. Hence, an obvious strategy would be to have a c e n tral FIFO queue for LPs that are ready for execution. An idle processor simply picks the rst queue element. We call this strategy dynamic. The standard method to nd out when an LP becomes ready for execution is presented in Subsect. This overhead can beeliminated by a serialization{free parallel data structure on the SB-PRAM (see Subsect. 6.5).
Often however, shared memory multiprocessors need some locality in data referencing to exploit their caches and hence to obtain appropriate memory bandwidth. To a c hieve locality, the PTHOR program of the SPLASH1 benchmark suite 24] uses a so called stealing strategy: basically, this is a static strategy with local task queues for LPs that are ready for execution. In cases where the load is not balanced, an idle processor can \steal" an LP that is ready for execution but is assigned to another processor.
The stealing strategy exploits locality as long as processors are busy and requires remote access only for load balancing when the processor is idle anyway.
In all these strategies, it may happen that a processor must choose between several LPs that are ready for execution. This can happen because either more than one LP assigned to a processor is ready, or because more than p LPs are ready in the central FIFO queue. In PTHOR, the processor chooses the LP that has been ready for execution for the longest time. This is easy to implement. Another popular method is to choose the LP with the smallest timestamp. This method leads to additional overhead because it requires that LPs that are ready to run are kept sorted according to their timestamps.
To get realistic runtime predictions T crit (p) depending on the number of processors p, it is necessary to model the partitioning strategy used in the critical path analysis. Note that these runtimes cannot be shorter than T crit . All delays due to causality apply for both T crit and T crit (p), and partitioning could introduce additional delays. The inclusion of partitioning strategies in critical path analysis was rst mentioned by Lin 21] , but he only uses a static strategy.
To include one of the above partitioning strategies in critical path analysis, we assume that the numberofavailable processors p is xed. We maintain a timer c(i) for each processor i, w h i c h speci es the computation time performed by i. If this processor executes an event e, the timer is increased by (e). As before, we evaluate the function END on the nodes of the EPG in topological order. For an event e executed on processor i, l e t c old (i) d e n o t e t h e v alue of c(i) before the execution of e. Then END(e) = START 0 (e) + (e) START 0 (e) = max (c old (i) START(e)) :
START(e) is de ned as above. The execution time consumed by simulating e is taken into account by
The di erent partitioning strategies lead to di erent assignments of LPs (and their events) to processors and hence to di erent results for T crit (p).
Note that the topological sort does not give a unique total order on the vertices, e.g. all vertices with indegree 0 could serve as the rst node. Therefore we maintain a priority queue of all events that are ready for execution. The priority is the time when the events became ready. Removing the event w i t h the smallest ready time ensures correct modeling.
Experiments
We computed the EPGs for three circuits delivered with the PTHOR simulator from the SPLASH1 benchmark suite 24]. DASH models the cache coherency controller of the DASH multiprocessor 20] and represents 74,000 gate equivalents organized in 24,000 LPs.
H-FRISC is a small RISC processor generated by a synthesis tool. It represents 7,000 gate equivalents organized in 5,000 LPs.
Multiplier implements a multiplier of two 16-bit numbers. It also represents 7,000 gate equivalents organized in 5,000 LPs.
We use the input vectors that are delivered with the PTHOR program. We use the unit delay model, i.e., each gate and each register has a delay of 1. We simulate 5000 time units. We computed the speedup bound S crit and bounds S crit (p) = T seq T crit (p) where p = 2 i , i = 0 : : : 12, for the three partitioning strategies. For the static and stealing strategies, we use a cyclic distribution. The curves are shown in Fig. 1 . Fig.1 The speedup bounds S crit (p) with partitioning reach the maximum speedup S crit already for small numbersof processors. The dynamic partitioning strategy outperforms the other two in theory. For small processor numbers (p 16), the stealing strategy behaves like the static strategy, for larger processor numbersit approaches the dynamic strategy. As the static strategy performs worst, we do not consider it in the sequel.
(=
Second, note that causality restricts the available parallelism severely. The DASH Especially the causality has a strong in uence on the parallelism. This might result from the form of the LPs. The DASH circuit has LPs with up to 94 inputs. In contrast, the H-FRISC and the Multiplier circuits have LPs with up to 17 and 5 inputs, respectively. The more inputs an LP has, the more it can depend on events occurring on other LPs. The events that schedule an event on an LP with many inputs might nish at vastly di erent computation times. As a conservative simulation must wait for t h e l a s t o f t h e s e e v ents to nish, the delays due to causality can be large. So, it might be wise to split large LPs into smaller units with fewer inputs.
In contrast to this, Soul e 26] proposes to combine LPs to larger units called \globbed elements" to get a larger granularity of the single tasks and so to increase the speedup. As this increases the number of inputs per LP, the bene ts due to larger granularity get lost by parallelism degradation. Our results strongly discourage this proposal.
We a l s o i n vestigated the granularity of the LP execution times as a possible source of speedup degradation. On the SB-PRAM the execution time of an LP is proportional to the numberofinstructions. 
PTHOR
A widely used algorithm for circuit simulations on parallel machines is the Chandy{Misra{Bryant algorithm (CMB) 10, 12] . This algorithm is a conservative approach. We will rst review the PTHOR program 26], which is an implementation of CMB on the Stanford Dash machine and distributed as part of the SPLASH1 benchmark suite 24].
Granularity has a strong in uence on centralized-time algorithms. The runtime of each round is bound by the longest task. The asynchronous CMB algorithm is potentially able to simulate events of other simulation timesteps in parallel while a lengthy e v ent runs on one processor. Our granularity measurements show that lengthy tasks exist in the simulation of our benchmark circuits.
Finally, the overhead of synchronization for each simulation-time step in synchronous simulation is inevitable. Every element in our benchmark circuits has a non-zero delay and no events are cancelled, so at most one deadlock per simulation-timestep can occur in CMB. With our optimizations discussed later, deadlock resolution runs only slightly slower on the SB-PRAM than a synchronization. So, every timestep without deadlock can help to avoid overhead that must occur in centralized-time simulation.
Description
PTHOR partitions the LPs of the simulated circuit with the stealing strategy sketched in Subsect. 4.2.
It uses a cyclic distribution of LPs to processors. There is a message channel between LP i and LP j if an input of component j in the simulated circuit is connected to an output of component i. If LP i computes a change of the output signal that occurs at simulated time t, then this output is put into a message with timestamp t. All LPs connected with LP i g e t a c o p y of this message in their appropriate input bu ers.
Each processor maintains an activation list that contains all of its LPs for which new messages have arrived. If LP i sends a message to another LP j, it generates an entry for LP j in the activation list of the processor to which L P j is assigned.
An event e can only besimulated if all necessary inputs are present in the input bu ers. An idle processor j tries to get an LP from its activation list. If its own list is empty, then it tries to steal an LP from another activation list. If the chosen LP has all necessary inputs, j can simulate one or several events from that LP correctly. In either case, this LP is removed from the activation list. It will be entered again if some new input message arrives.
It can thus happen that all activation lists become empty although some events could be simulated.
Such a situation is called deadlock. The CMB algorithm tolerates deadlocks, because it is able to detect and to resolve all of them. Deadlock detection can be implemented on a shared memory multiprocessor by maintaining a shared counter which is initially set to zero. A processor whose activation list becomes empty (and does not succeed in stealing) increases the counter. It decrements the counter again if it nds a new event to simulate. A deadlock has occurred if the counter equals the numberof available processors.
To resolve the deadlock, one has to nd at least one event that can besimulated. To do this, we search for a message m with the minimum timestampt. Chandy and Misra prove t h a t all events that occur at timet (and hence have m as input) can be simulated 12]. Figure 3 shows the speedups for the benchmark circuits on three machines, with processor numbers (= Fig.3 ranging from 2 to 128. Only on the SB-PRAM we obtain a speedup larger than 1. The diagrams show absolute speedups: the sequential runtime is not the runtime of the parallel program with one processor. Instead, it is the runtime of the fastest sequential implementation we were able to develop.
Performance
For the circuits, the same models and the same implementations were used in the sequential and the parallel case. Only the parts for administrating messages, scheduling LPs and memory management were replaced for the di erent sequential and parallel measurements. These parts of our sequential simulator had to beoptimized: In a sequential simulator, the events must beexecuted in increasing timestamp order. Thus, in contrast to parallel asynchronous schemes, the sequential queue not only schedules the LPs but also has to restore the timestamp order. To perform this task, all messages are held in a priority queue. For the SB-PRAM, we implemented several di erent data structures like binary heap, bonacci heaps and calendar queues. We found out that splay{trees 25] give the best runtime results for our application. Besides many small optimizations, an e cient memory management was realized.
Note that the parallel program on one processor is much s l o wer than the sequential program on one processor of the same machine. The quotient between these two runtimes is called slowdown factor. Table 1 shows the slowdown factors for the three benchmark circuits on the SB-PRAM and the Dash machine. The latter are taken from 26]. For Dash and Multimax, we used relative speedups from 26] and the above slowdown factors to compute absolute speedups..
The source code of the centralized-time simulator is not delivered with the SPLASH1 benchmarks.
So, runtime results for comparison on the SB-PRAM are not available.
The performance of PTHOR su ers from serialization. Serialization occurs during concurrent access to the shared counter for deadlock detection.
The access to the counter is protected by a lock. Figure 4 shows the total number of accesses to the (= Fig.4 shared counter and the fraction of accesses that were not directly granted. The time to access a lock i s one instruction in both the Dash and the SB-PRAM, as bothmachines provide hardware support for read-modify-write operations.
Serialization is also caused by the computation of the minimum timestamp during deadlock resolution.
This computation needs a loop over all processors and barrier synchronizations before and after the loop.
The barriers are also implemented by locks. The upper curves of Fig. 5 show the average numberof (= Fig.5 instructions needed to resolve a deadlock in PTHOR on the SB-PRAM. The lower curves show the corresponding numbers for the reimplementation (see next Section).
REIMPLEMENTATION
Our reimplementation avoids the serializations mentioned above. We also improved the memory management and the realization of channels between LPs. As mentioned in Sect. 1, the multipre x operation serves to compute global sums and global minima in a small constant numberof instructions. Figure   5 shows the average number of instructions needed for deadlock resolution on the SB-PRAM using multipre x.
Memory Management
During the simulation, one has to manage tens of thousands of small list elements for message queues, activation lists etc. PTHOR never recycles elements, it even keeps those elements that are not in use anymore. This is a waste of memory resources and leads to unnecessary shared memory allocations.
Furthermore, extracting list elements from the allocated memory leads to serialization because locks are used.
In the reimplementation, each processor maintains a so called freelist. After a processor has executed an event, some of the involved list elements might not beneeded anymore. Then, the processor adds these to its own freelist. If a processor wants to allocate a list element, it rst tries to obtain one from its freelist. If its freelist is empty, then it obtains a list element from an allocated shared memory block.
If a block containing l list elements is allocated, a shared counter c is initialized to l. A so called
R{pointer is set to the beginning of the memory block. To obtain a list element from that block, a processor decreases the counter c with the help of multipre x. This allows a concurrent a c c e s s o f m ultiple processors without serialization. The result r of the pre x operation gives the number of remaining list elements. If r 0 the memory block is exhausted. The processor that obtains value 0 then allocates a new memory block, all processors that received values less or equal to zero then repeat the allocation with the new block.
If a processor receives r 1, it can cut o a list element from the memory block. To do this, it increases the R{pointer of this block b y the size of a list element with the help of multipre x. The value the processor obtains then determines the position of the list element. Figure 6 shows ve processors that try to allocate a list element. Processor 0 nds an element in its freelist, the other four processors (= Fig.6 must allocate from a shared memory block with c = 2 . After the multipre x operation, c = ;2, and processor i receives value 3 ; i. Thus, processors 1 and 2 get list elements from the current memory block. Processor 3 receives the value 0 and allocates a new block, from which processors 3 and 4 allocate their list elements.
Channel Queues
The realization of a channel is performed with a FIFO queue where one LP writes a message and all LPs connected to this channel read the message. As it is not clear when all LPs have read a message, PTHOR keeps all messages in these queues. We attach a shared counter to each message in the queue.
The counter is initialized to the number of LPs connected to this channel. Each LP reading a message decreases its counter with the help of multipre x. If the counter has reached zero, the processor accessing the message removes it from the queue and puts it into its freelist. We call this queue organization single-in multiple-out queue (SIMO). It needs no locks. Figure 7 shows a SIMO queue where LP 0 writes and LPs 1 to 4 read. The uppermost two messages have not yet beenread by any LPand hence have (= Fig.7 counters with values 4. The next two messages have b e e n r e a d b y LP 1 and LPs 1 and 4, respectively, and thus have counters with values 3 and 2. LP 2 has just read the lowermost message and thus decreased the message's counter to zero. The message now i s r e m o ved from the queue.
LNE Lists
To resolve deadlocks, one has to inspect all LPs that satisfy the following conditions:
At the beginning of the deadlock, the LP still has messages in its input bu ers, the LP has processed at least one message.
To speed up deadlock resolution, we maintain a data structure containing only the LPs satisfying the above conditions. After the last test of LP i 's input bu ers, the computed LNE time is added to the LNE list. This means that either a reference to LP i containing the LNE time is added to the list, or that the LNE time of LP i is updated if a reference to LP i is already present.
If the bu ers of an LP get empty, its reference is removed from the LNE list.
If we employ a static or stealing partitioning strategy, each processor j maintains a partial LNE list containing references to LPs that are assigned to j. If we e m p l o y a stealing strategy, s e v eral processors might write into one partial list. Then the partial lists must be protected by l o c ks. However, as stealing happens seldomly, the number of collisions will be low. If we employ a dynamic strategy, e a c h processor maintains the partial LNE list of all LPs that would have beenassigned to it in a static partitioning strategy. The lists are also protected by l o c ks.
To n d t h e m i n i m um timestampt, each processor rst runs over its own partial LNE list sequentially.
Thent is computed by a global minimum over all processors. If the load is balanced, then each processor spends a similar amount of time to compute the local minimum. The global minimum is done with a multipre x operation in constant t i m e . Figure 8 shows the partial LNE list of processor 1, when a stealing strategy is employed. Processor (= Fig.8 2 has stolen LP 33 from processor 1, has just computed the LNE time of LP 33 to 20, and has inserted a reference to LP 33 at the beginning of the list. Processor 3 has stolen LP 11 from processor 1. The bu ers of LP 11 have become empty, therefore processor 3 removes the reference to LP 11 from the list. Figure 9 shows the absolute speedups of PTHOR and the reimplementation on the SB-PRAM. The (= Fig.9 speedups of the reimplementation are much better than the PTHOR speedups. For the DASH benchmark, the speedup reaches the critical path bound. For H-FRISC and Multiplier there is still a gap between the bound from critical path analysis and the actual speedup. Experiments that try to tighten this gap are discussed in Subsect. 6.5.
Performance
The runtime of the reimplementation can be split into four phases: On distributed memory machines, the ood of NULL-messages can cause more overhead than the deadlock avoidance method. Therefore, one only sends part of the NULL-messages to avoid part of the deadlocks 13]. On shared memory machines, messages need not besent explicitly. Every event can access each channel data structure in global memory. Therefore, instead of sending a message, one can update every channel clock directly. This removes most of the overhead of message passing (queue organization etc.) and makes NULL-messages a useful tool. To avoid deadlocks completely, every update of a channel clock m ust be followed by the activation of all LPs connected to this channel. Figure 11 shows the speedup curves with and without NULL-messages for the Multiplier circuit. The (= Fig.11 use of NULL-messages almost doubles the speedup.
The situation is di erent for the H-FRISC circuit. Here, the use of NULL-messages results in an increase of activations by a factor of 6. The speedup drops by a factor of 5 to 6, depending on the number of processors. The reason lies in the di erent structures of the circuits. While Multiplier is purely combinatorial, H-FRISC contains cycles between registers. In these cycles, often several NULL-messages are sent (and hence activations happen) before an event can be simulated.
Baker and Mahmoody 6] also present an algorithm that optimizes the use of NULL-messages. They report an increase by a factor of three in combinatorial circuits taken from the ISCAS suite. However, the performance of their algorithm on sequential circuits is unknown to us.
Second, we tried to use the dynamic partitioning strategy as an alternative to stealing. To do this, one needs a shared FIFO queue as a global activation list. This list is accessed by all processors and hence need not lead to serialization. With the help of multipre x, one can implement a FIFO queue that processes inserts or deletions of an arbitrary numberofprocessors in a small constant numberof instructions 23]. Figure 12 shows the speedups on H-FRISC for bothstrategies. The curves for the Multiplier circuit (= Fig.12 look similar. In contrast to theory, the dynamic strategy is not superior to stealing. A reason for this is that more than 90 % of all activations are satis ed from the processors' local activation lists, even for large processor numbers. However, the dynamic strategy leads to a simpler program code. Note that the di erence between the two curves is even increasing. This results from a constant runtime overhead while accessing the central FIFO queue.
CONCLUSIONS
Our results show that critical path analysis permits good speedup predictions if partitioning strategies are included. For the benchmark circuits, the SB-PRAM comes close to the maximum speedup, allowing more accurate predictions. As a consequence of using a single framework, the tool for critical path analysis also yields an e cient implementation.
For the prediction, we consider absolute speedup values. This is important to evaluate the use of parallel machines in practice as relative speedups are up to 10 times higher than the absolute ones.
To m a k e parallel simulators competitive, it might b e w orth investigating whether the slowdown factors from sequential to parallel can be made smaller.
Experiments with the benchmark circuits reveal that the maximum speedup is strongly dependent on the circuit's structure. Of particular importance are the length of the cycles and the number of inputs perLP. Our results strongly suggest to keep the numberofinputs per LP low, if necessary by decomposing one LP into several smaller ones.
We presented several new serialization{free parallel data structures which s e e m t o h a ve a large impact on the programs performance. The e ciency of these data structures is based upon the use of parallel pre x operations. 
Figure Caption List

