Abstract-Most small-scale simulation applications are implemented by sequential simulation techniques. As the problem size increases, however, sequential techniques may be unable to manage the time complexity of the simulation applications adequately. It is natural to consider re-implementing the corresponding largescale simulations using parallel techniques, which have been reported to be successful in reducing the time complexity for several examples. However, parallel simulation may not be effective for every application. Since the implementation of parallel simulation for an application is usually very expensive, it is required to investigate the performance of parallel simulation for a particular application before re-implementing the simulation. The Chandy-Misra parallel, discrete-event simulation paradigm has been utilized in many large-scale simulation experiments, and several significant extensions have been based on it. Hence the Chandy-Misra protocol is adopted here as a basic model of parallel simulation to which our performance prediction techniques are applied. For an existing sequential simulation program based on the process interaction model, this paper proposes a technique for evaluating Chandy-Misra parallel simulation without actually implementing the parallel program. The idea is to insert parallelism analysis code into the sequential simulation program. When the modified sequential program is executed, the time complexity of the parallel simulation based on the ChandyMisra protocol is computed.
A Parallelism Analyzer for Conservative Parallel Simulation
Yung-Chang Wong, Shu-Yuen Hwang, and Jason Yi-Bing Lin Abstract-Most small-scale simulation applications are implemented by sequential simulation techniques. As the problem size increases, however, sequential techniques may be unable to manage the time complexity of the simulation applications adequately. It is natural to consider re-implementing the corresponding largescale simulations using parallel techniques, which have been reported to be successful in reducing the time complexity for several examples. However, parallel simulation may not be effective for every application. Since the implementation of parallel simulation for an application is usually very expensive, it is required to investigate the performance of parallel simulation for a particular application before re-implementing the simulation. The Chandy-Misra parallel, discrete-event simulation paradigm has been utilized in many large-scale simulation experiments, and several significant extensions have been based on it. Hence the Chandy-Misra protocol is adopted here as a basic model of parallel simulation to which our performance prediction techniques are applied. For an existing sequential simulation program based on the process interaction model, this paper proposes a technique for evaluating Chandy-Misra parallel simulation without actually implementing the parallel program. The idea is to insert parallelism analysis code into the sequential simulation program. When the modified sequential program is executed, the time complexity of the parallel simulation based on the ChandyMisra protocol is computed.
Our technique has been used to determine whether a giant Signaling System 7 simulation (sequential implementation) should be re-implemented using the parallel simulation approach.
Index Terms-Chandy-Misra protocol, critical path analysis, Discrete event simulation, parallelism, parallel simulation I. INTRODUCTION N A PARALLEL SIMULATION, the simulated system is I partitioned into several subsystems, each of which consists of a nonoverlay subset of state variables. These subsystems are concurrently simulated by a set of processes that communicate by exchanging timestamped messages. The events scheduled for a process can modify only the state variables of the corresponding subsystem. The processes execute concurrently to complete a simulation run. To produce the correct simulation results, the executions of the processes must follow a set of synchronization rules [l] . The performance of parallel simulation depends on two factors: the parallelism existing in the system to be simulated and the overhead IEEE Log Number 9412400.
of the parallel simulation protocol running on a particular computer architecture. The inherent parallelism of a simulation application was first studied by Berry and Jefferson [2] and Livny [3] . Algorithms have been proposed to study the inherent application parallelism when every process is executed by a separate processor. Lin [4] proposed an inherentparallelism analysis algorithm for the case where more than one process may be mapped to a processor under different process scheduling policies. This paper extends previous results by considering both the inherent parallelism and the parallel simulation protocol overhead. The paper proposes a parallelism analysis algorithm for the Chandy-Misra protocol [5] , in which more than one process may be mapped to a processor. The parallelism analysis algorithm is integrated with the sequential simulation program. When this modified sequential simulation is executed, the time complexity of the parallel simulation based on the Chandy-Misra protocol is also computed. Our technique is a powerful tool for determining the performance of the Chandy-Misra parallel simulation for an existing sequential simulation program. This paper is organized as follows. Section I1 introduces the concept of event precedence graph. Section I11 describes the Chandy-Misra protocol. Section IV proposes a parallelism analysis algorithm for Chandy-Misra parallel simulation.
THE EVENT PRECEDENCE GRAPH
The execution of a discrete event simulation follows causality constraints, and the relationships between the events can be described by an event precedence graph [21, E63, 171, 141. The concept of event precedence graph is illustrated by the following example. Consider the simple network in Fig. l(a) . Fig. l(b) shows the event precedence graph for a simulation scenario of the network.
In this figure, the timestamp of event e; is i.
In the event precedence graph, a vertex represents the occurrence of an event. A dashed arrow from event ei to event e j means that both ei and e j are scheduled for the same process, and e; occurs earlier than e j does (cf. events e4 and e6 in Fig. l(b) ). A solid arrow from ei to e j means that the scheduling of e j is due to the occurrence of ei (cf. events e l and e2 in Fig. l(b) ). To correctly simulate the behavior of the network, event ei must be processed before e j if there is an arrow (either dashed or solid) from e; to e j in the event precedence graph. In a sequential simulation implementation, all events are processed in nondecreasing timestamp order. This sequential execution engine guarantees that the relationship in the event precedence graph is not violated. Suppose that the time to process event el, e3, e5 or e7 is I unit, and the time to process any one of the other events is 3 units. The execution order and the elapsed time after an event is executed in a sequential simulation are given in Fig. l(c) .
In Fig. l(b) , an event execution time is associated with each vertex (i.e., event). A communication delay is associated with each solid arrow (the cost for the dashed arrow is 0). Since the graph is acyclic, a maximal weighted path can be found. This path is called the critical path and its cost is the minimal time required to finish the execution of the parallel simulation. The critical path does not consider the overhead for parallel simulation protocols. In other words, the cost for the critical path is a lower bound for the execution time of any parallel simulation approach. To evaluate the time complexity for a particular parallel simulation protocol, new techniques (such
The sequential execution (C)
as the Chandy-Misra parallelism analyzer developed in this paper) are required.
THE CHANDY-MISRA PROTOCOL
From the definition of the event precedence graph, a parallel simulation protocol is correct if all events occurring at a logical process are executed in nondecreasing timestamp order. The Chandy-Misra protocol follows two waiting rules to satisfy the causality constraint. We first describe the assumptions of a Chandy-Misra simulation. The FIFO message sending assumption: Communication between two processors preserves the first-in-first-out (FIFO) property (i.e., the messages are received in the order they are sent The original Chandy-Misra protocol [5] further assumes that the buffer capacity of a process to store the incoming messages is limited. The purpose of this restriction was to limit the memory usage of a Chandy-Misra simulation. However, Lin and Preiss [8] and Jefferson [9] showed that in general limiting the input buffer capacities of processes does not limit the total memory usage for a Chandy-Misra simulation. We assume that the input buffer capacity of a process is unlimited. For simplicity, we make three assumptions about the simulated network. It is easy to see that our results can be generalized.
There are three types of processes in the simulation. A source process does not receive any messages from other processes. A server process may send and receive messages. A sink process does not send any messages to other processes. In Fig. l(a Two waiting rules ensure the correctness of the ChandyMisra protocol.
The input waiting rule: Before process pi executes an event e, pi must receive from each of its input channels an event (including e), and e must have the smallest timestamp among the events in the input channels.
The output waiting rule: Consider an event e' created at process p i , which is scheduled for process p j . Event e' is sent to p j after pi has started executing event e, where e'.ts 5 e.ts + &(e). If several events satisfy this inequality, then they are sent to p j in nondecreasing timestamp order.
The output waiting rule and the FIFO message sending assumption ensure that a process always receives messages from an input channel in nondecreasing timestamp order. This property, together with the input waiting rule, guarantees that all events occurring at a process are executed in nondecreasing timestamp order. Note that the output waiting rule is not required for a source process because of the FIFO event generation assumption. We define the lookahead of every event for P j @i>. e executed at a source process as &(e) = 00. (Note that &(e) = 00 does not imply that the execution of e at a source process will schedule an event with timestamp m. The infinite lookahead value is used to bypass the output waiting rule for the source process.)
Two types of control messages are introduced in the Chandy-Misra protocol: end-of-simulation (eos) messages and null messages. The eos messages are used to terminate the parallel simulation. After a source process has generated the last event, it sends an eos message to each of its output channels and enters the termination state. An eos message has timestamp 00 and lookahead value 00. When a server process p ; executes an eos message, it generates and sends an eos message to each of its output channels. Then pi enters the termination state. When a sink process executes an eos message, it simply enters the termination state. All processes eventually enter the termination state and the parallel simulation terminates. Note that after a process enters the termination state, it never become active again. Fig. 2 illustrates the execution of the Chandy-Misra simulation for the event precedence graph in Fig. l(b) .
We assume that the message sending delays are 0 in the Chandy-Misra simulation. We further assume that the lookahead values for e2, e4, e6, e8, e g , elo, e l l , and e12 are 3. Note that the lookahead values for e l , e3, e5, and e7 are 00 because they are executed by the source process p l . After pl has executed an event, the newly scheduled events are sent to the destinations immediately (cf. e2 and e3). Process p3 @2) has only one input channel. According to the input waiting rule, an arrival event is executed immediately if p3 is idle (cf. e4) or is executed after p3 has executed the previous event (cf. e6). According to the output waiting rule, el0 is not sent to p4 until time 8; i.e., when p3 starts executing ea. Note that e 6 . t~ + &(eg) = 6 + 3 < e1o.h = 10 < e 8 . t~ + ~( e g ) = 11. Event e9 arrives at p4 at time 4. However, its execution is delayed until elo arrives (due to the input waiting rule). The eos messages 21,2 and 21,3 are sent from p l to p2 and p 3 , respectively. These messages arrive at their destinations at time 4 (the time when the execution of e7 is completed) because there is no message sending delay. After p 3 has processed 21, 3 (the execution time for an eos message is 0), a new eos message 23,4 is sent to p4.
A null message provides only timing information. For example, after p ; has executed an event e, it may send a null message e', where e'.ts = e.ts + € ( e ) , to the output channel connected to p j . When p j receives e', it knows that it will never receive any message with timestamp less than e'.ts from p;. This information is used to reduce the overhead of the input waiting rule as well as to avoid deadlock [5] . In a ChandyMisra simulation, deadlock may occur in a feedback loop. Consider the feedback network in Fig. 3 . The initial events are generated by the source process po. At the beginning, po sends an event message e to process pl. According to the input waiting rule, p l cannot handle e before it receives a message from process p 4 . Unfortunately, p4 will not produce any output message before p l produces the first output message. Thus, processes p l , pa, p 3 , and p4 fall into a deadlock situation.
Two approaches have been proposed to resolve the deadlock situation. Deadlock avoidance [ 141 uses null messages to avoid deadlock. In Fig. 3 , suppose that process pi (for 1 5 i 5 4) has a constant lookahead value E; and its local clock clc; = 0 initially. At the beginning of execution, pi sends a null message with timestamp ck; + E to the output channel. When the destination p j receives the null message, p j is essentially promised by p ; that it will not send a message to p j carrying a timestamp smaller than E;, and clcj is incremented from 0 to E ; . Then p j sends a null message with timestamp clcj+~j = E ; + E~ to its output channel. After the null messages have circulated in the loop several times, pl eventually receives a null message with timestamp larger than e h . According to the input waiting rule, p l executes e and the deadlock is avoided.
In deadlock recovery, no null messages are sent. A separate mechanism is used to detect when the simulation is deadlocked, and another mechanism is used to break the deadlock. Deadlock detection mechanisms are described in [ 151, [ 161, [17] . In the deadlock recovery mechanism, all processes cooperate to find the events with the smallest timestamp in the system. These events can be safely executed, and the deadlock situation is thus recovered.
IV. A PARALLELISM ANALYZER
We first consider the parallelism analyzer for deadlock avoidance simulation. Then we extend the algorithm for deadlock recovery simulation.
Consider an existing sequential simulation program. We investigate the performance of the corresponding ChandyMisra parallel simulation without actually implementing the parallel program. The idea is to insert some instructions (to be described) into the sequential simulation program. The inserted code computes the elapsed time of the corresponding Chandy-Misra parallel simulation along with the execution of the sequential simulation.
We assume that the sequential program follows the process interaction model [ 181, in which the simulated system is modeled by a set of objects. These objects can be directly mapped to the logical processes in parallel simulation. We assume that every process is executed by a dedicated processor (we thus use the terms "process" and "processor" interchangeably). This restriction is relaxed later in this section. The modified sequential simulation that performs parallelism analysis for the corresponding Chandy-Misra simulation is referred to as the parallelism analyzer. The process-to-processor mapping affects the performance of the parallel simulation. To study the process assignment problem, one may execute the parallelism analyzer with different mappings.
In the parallelism analyzer the eos messages and the null messages are also included to simulate the Chandy-Misra protocol. The parallelism analyzer generates an eos event for every downstream process of a source process pi after it processes the last event scheduled for p i . When the parallelism analyzer processes an eos event for a server process p j , it generates new eos events for the downstream processes of p j . Suppose that an event e occurs at process p i , and its occurrence results in the scheduling of another event e' for process p 3 . When the parallelism analyzer processes e, it generally schedules a null message with timestamp e.ts + € ( e ) for every downstream process of p i . (In some implementations, no null message is sent to p j . In other implementations, null messages may be sent by demand [17] . Our parallelism analyzer can easily be tailored to study implementations with different null message sending policies.)
Several data structures are used in the parallelism analyzer.
Every event e is associated with a real number e.a which represents the (real) time when e arrives at its destination (i.e., the process that executes e) in the corresponding Chandy-Misra simulation. Initially, e.a = 0 (if e is an event pre-scheduled at the beginning of the simulation) or e.a = 00 (otherwise; in this case, the arrival time of e will be computed and assigned to e.a later).
For a channel directed from pi to p j , a set Qi,j is used in the parallelism analyzer to hold all "floating" events sent from p; to p j . The time when a "floating" event is to be executed in the Chandy-Misra simulation has not yet been determined. When an event is processed in the sequential simulation, the parallelism analyzer inserts the event into the corresponding Qi,j. This event is removed from Qi,j after its execution time in the Chandy-Misra simulation Line 2, e.E represents the set of events scheduled due to the execution of e. Every event e' E e.E is inserted in L in the timestamp order. The time when e' arrives at its destination in the Chandy-Misra simulation is not determined, and e'.a is assigned the value ca. At Line 3, e is inserted in Qj,;.
In other words, e is sent from p j to pi in the Chandy-Misra simulation (however, its arrival time may not be determined at this moment). In the loop Lines 4-10, the parallelism analyzer tests whether the time when an event e E Q,,, (for some m, n) available for execution in the Chandy-Misra simulation (i.e., the time when e satisfies the input waiting rule) is known. If so, the time when the execution of e is completed in the Chandy-Misra simulation is computed and assigned to t,. The parallelism analyzer also tests whether any event e' E 0, satisfies the output waiting rule . If so, the time when e' arrives at its destination is computed and assigned to e'.a.
At Line 4, the event 8, is defined as follows: Let I, be the set of processes that may schedule events (i.e., send messages) to p,. Suppose that the following two conditions are satisfied in the parallelism analyzer:
For every p , E I,, Q,,, # 0. The parallelism analyzer (more than one process may be mapped to a processor). Fig. 4 for the event precedence graph in Fig. l(b) is described in Appendix B.
The parallelism analyzer in Fig. 4 can easily be generalized to the case when more than one process is executed by a processor. Suppose there are K processors available for the parallel simulation, where K 5 N . Let Pk be the set of processes mapped to processor k . The modified parallelism analyzer is shown in Fig. 5 . In this algorithm, variable ti represents the progress of the processor i (where 1 5 i 5 K ) and in Lines 5-8, t, (in Fig. 4) is replaced by t k , where p , E 4. Fig. 5 is exactly the same as Fig. 4 except that the elapsed times considered in the Chandy-Misra simulation are for processors.
Our algorithm can also be extended to study the ChandyMisra deadlock recovery protocol. To compute the execution time of a deadlock recovery simulation, the algorithm in Fig. 5 is modified as follows:
No null messages are created in the parallelism analyzer.
In where A is the elapsed time to detect and break the deadlock (i.e., to find the event e). If e is sent from process i to process j, and process j is mapped to processor k , then Q i , j + Q i , j -{ e } , and t k = t k + q ( e ) .
l < i < N
From the above discussion, the number of deadlocks occurring in the simulation can also be computed.
V. SUMMARY
This paper proposed a technique to evaluate the time complexity of the Chandy-Misra parallel simulation protocol for an existing sequential simulation program. The idea is to insert parallelism analysis code into the sequential simulation program. When the modified sequential program is executed, the time complexity of the corresponding Chandy-Misra parallel simulation is also computed.
We described the parallelism analysis algorithm, and proved that the algorithm is correct. The algorithm assumes that the sequential simulation follows the process interaction model. This assumption does not restrict the applicability of our technique, because most modem simulation programs are implemented by object-oriented languages or environments that follow the process interaction model.
Our technique has been proven to be useful in several large-scale industrial applications. For example, a Signaling System 7 (SS7) network simulation was implemented by a sequential simulation technique that is adequate for small-scale networks. Because the number of new customers and services grows rapidly in a telephone network, it is important to study the performance of a scaled-up SS7 network. Since the SS7 network is very complex, it is difficult to execute a sequential simulation within a reasonable elapsed time when the network size is increased. A natural altemative for speeding up the simulation process is to use parallel simulation techniques. However, re-implementation of an SS7 simulation on a parallel platform is very expensive. It is necessary to investigate the performance of parallel simulation to determine whether it is cost effective. With our technique, the performance of ChandyMisra parallel simulation can be determined without actually implementing the parallel program.
ACKNOWLEDGMENT
The authors would like to thank D. DeGroot and the three reviewers for their comments.
APPENDIX

A. Proof of the Parallelism Analysis Algorithm
The correctness of the parallelism analyzer is proved by means of three lemmas and one theorem.
For the ith event em,,(i) sent from process p , to p,, the parallelism analyzer computes a value t,,,(i).
Lemma
,(i).ts < e,,,(j).ts.
Lemma 1 is used in Lemma 2. Lemma 2 consists of three parts. Part (a) proves that the value e.cy computed in the parallelism analyzer is the time when event e arrives at 
8,(i).ts + €(B,(i))}
Thus hypothesis (a) holds. 
B. An Example of Parallelism Analysis
This appendix illustrates the execution of the parallelism analyzer in Fig. 4 for the event precedence graph in Fig. l(b) . We note that V(e1) = V(e3) = V(eg) = V(e7) = 1, and V(e1l) = ~( e 4 = 3, and the message sending delay times for Event el is executed in the first iteration. Since el is a prescheduled event, e1.a = 0. At the end of Line 3, Q1,1 = {el} and L = {e2, es}. At this point, we know the arrival time of el in the Chandy-Misra simulation, as illustrated in Fig. 6(a) .
In this figure, a pair (ti, tsi) is associated with every process p i , where ti represents the (real) time of pi in the ChandyMisra simulation and tsi represents the local clock of pi (i.e., the timestamp of the event being executed by p i ) at time ti. In the figures, a link directed from pi to p j is labeled by an event e if and only if e.a < DC) and e E Qi,j. This implies that the parallelism analyzer "simulates" the arrival of e at p j in the Chandy-Misra simulation. At Line 4, 61 = el # 4. Since &(el) = 00, q(e1) = 1, e1.E = {ez,e3}, and S(e2) = 6(e3) = 0, at the end of Line 10, we have tl = 1, t s l = 1, e2.a = e3.a = 1, and Q1,1 = 0 (cf. Fig. 6(b) ). Table I illustrates the execution of events e2, e3, e4, and e5. The column for Line 4 of iteration e2 illustrates the related variable values after Line 4 is executed at ea's iteration. The last item " Figure" points to the figure illustrating the execution. Note that in Fig. 6(d) , the notation (eg) under pa means that e9 is an event scheduled by pa, which has not been sent to the destination in the Chandy-Misra simulation. Table I1 illustrates the execution of e6, e7, eg, and e12. The presentations for iterations eg,elo, and e l l are similar and hence are omitted. In iteration 21,2, the while loop in Lines 4-10 is executed twice. In the first iteration, 6 2 = 21,2, and in the second iteration, 64 = eg. Line 10 of iteration 2 1 ,~ in Table I11 shows the variable values at the end of the second iteration.
In iteration 22,4, the while loop in Lines 4-10 is executed three times. The first iteration handles elo, and at the end of Line 10, ( t 4 , is4) = (14, lo), as shown in Fig. 7 (U). Line 10 of iteration 22,4 in Table 111 shows the results at the end of the third iteration.
At 
C. Notation
This section summarizes the notation used in this paper. The message transmission delay of event e in the Chandy-Misra simulation. The lookahead value of e. An event.
The event with the smallest timestamp in Qj,i. The ith message sent from p , to pn. The time when e arrives at its destination process in the Chandy-Misra simulation (if the value is finite). The timestamp of event e. The set of events scheduled due to the execution of e. The execution time of event e. The event with the smallest timestamp among em,n, for all p , E In such that the following two conditions are satisfied: a. b.
For every p , E In, Qm,n # 8. For every p , E I n , em,n.a < 00 (i.e., the arrival time of em,n in the Chandy-Misra simulation is determined).
The value of en at the ith time when Loop Lines 4-10 in Fig. 4 is executed. An event such that its execution schedules event e(i.e., e E ge.E).
The set of processes that may schedule events to Pi. The number of processors available in the parallel simulation ( K 5 N ) .
The event list for the sequential simulation. The number of processes.
(The set of) the events generated at pi such that their departure times in the Chandy-Misra simulation have not been decided in the parallelism analyzer.
The value of 0, for the ith time when Line 10 in Fig. 4 is executed. A process.
The set of processes mapped to processor IC.
(The set of) the events sent from pj to pi such that the start execution times of these events in the Chandy-Misra simulation have not been determined by the parallelism analyzer.
The time when event 0, is available for execution in the Chandy-Misra simulation.
The progress (i.e., real time) of process pi.
The value of t, when 0, = em,,(i) at Line 4 in Fig. 4 .
The value of t, at the ith time after Line 4 (in Fig. 4 ) but before Line 5 is executed.
The value of t, at the ith time when Line 6 in Fig. 4 is executed.
The value of t, at the ith time when Line 7 in Fig. 4 is executed.
The local clock of pi at (real) time ti in the Chandy-Misra simulation. The execution time of the Chandy-Misra simulation.
The end-ofsimulation (eos) message sent from p j to pi in the Chandy-Misra simulation.
