The discipline of discrete event simulation may be utilized to model many physical systems such as digital hardware, queueing networks, telephone networks, simulated warfare, and banking transactions. Where the entities of a physical system execute independently and interact asynchronously, an asynchronous distributed event-driven simulation algorithm may enable the simulation of the system to execute on a parallel processor. This has the potential to signi cantly reduce the total simulation time. YADDES is the rst algorithm that is characterized by (i) acceptable performance, (ii) freedom from deadlock, and (iii) provably correct, for circuits where the interactions between the entities constitute a cyclic dependence. Given their complex, asynchronous nature, an important issue associated with all asynchronous distributed algorithms is their correctness, i.e. the generation of accurate output for given input stimulus, under all possible conditions. This paper presents a mathematical proof of correctness and reports the performance of YADDES on the Armstrong parallel processor at Brown University.
Introduction
The concept of simulation implies mimicing a physical system and its constituent entities with the intent of verifying correctness, identifying errors, and generating performance estimates prior to fabricating a prototype of the physical system. Examples of physical systems that are usually simulated extensively include digital hardware designs, industrial control circuits, and aircrafts. Where the activities of the entities constituting the physical system are concentrated at regular intervals of time and the interval is decided upon or known apriori, time-based simulation techniques may render the simulation process e cient. Where the activities of the physical system are distributed irregularly in time such as in digital hardware, queueing networks, and banking transactions, discrete-event simulation techniques apply. In discrete-event simulation, the simulation models representing the entities of the physical system remain idle except when excited by a stimulus external to it. In addition, only changes in a model's response are propagated to other models that are connected to its output.
For many physical systems, a directed graph corresponding to the interactions among the entities of the system may assume the form of a cyclic graph. Examples of physical systems with cyclic graphs include all digital hardware designs with feedback, industrial control systems with negative feedback, oscillators, sets of queueing networks interconnected in a closed loop, and the Bariloche model 13] of the world. A well known natural system with cyclic dependence is that of the food chain. The possibility of achieving faster simulation through the concurrent execution of the simulation models on parallel processors provides the motivation for studying asynchronous distributed techniques in discrete event simulation.
The literature reports three principal techniques for distributed discrete event simulation namely synchronous, rollback, and asynchronous. Fujimoto 16] reports a state-of-the-art survey on the execution of simulation models on parallel processors. In the synchronous approach 1], a processor is designated as a centralized controller and it is responsible for allocating all other entities to the processors of the parallel processor system and initiating their executions. In addition, the controller resynchronizes all processors at the end of every activity. It is limited in that the processors must resynchronize at the end of every activity even in the absence of data dependency and an uncertainty is associated with the completion of message communication at the end of an activity. In the rollback mechanism 2], the state of the entire system is saved periodically such that the simulation system may be permitted to rollback to its previous state in the event of an error caused by processing messages out of order. In the absence of information regarding a signal at an input port, a model assumes that the signal value at that input port has remained unchanged and the results of execution based on the assumption are propagated to subsequent models. If a subsequent message is received by the component that contradicts the previous assumption, new results in the form of anti-messages are propagated to subsequent models. The limitations of the rollback mechanism include the signi cant storage required for periodically saving the state of the entire simulation and the lack of certainty that an erroneous message will be canceled by the appropriate anti-message. The asynchronous discrete-event simulation mechanism 3,4,14] permits every simulation model to execute independently in the absence of explicit data dependency and has the potential to utilize maximum parallelism. Unlike the deadlock recovery approach 5], YADDES, detailed in 14], is free from deadlock. Additionally, YADDES does not share the ine ciency inherent in the approach in 3] for systems with complex cyclic dependence graphs.
In the remainder of the paper, section 2 reviews the YADDES algorithm while section 3 presents a proof of correctness. Section 4 details the performance of and discusses the limitations of the YADDES algorithm. Finally, section 5 concludes the paper.
Yet Another Approach to Asynchronous Distributed Discrete-Event Simulation
While the YADDES approach is presented in detail in 14], a brief description is presented here to facilitate the understanding of the proof of correctness. For a given circuit containing feedback loops, rst a feedback arc set 9] S given by S = fE 1 , E 2 , ..., E n g of a directed graph corresponding to a digital design is identi ed such that the graph may be rendered acyclic following the removal of all of the edges E 1 through E n . The correctness of the approach is not contingent on the identi cation of the minimal feedback arc set which is di cult and time consuming. However, the minimal feedback arc set may imply improved performance.
For each E i 8 i 2 f1,2,...,ng in the original directed graph, a new acyclic directed graph is reconstructed by replacing E i with two unconnected edges E in i and E out i as illustrated through Figure 1 . Figure 1a depicts a cyclic circuit consisting of a two-input AND gate A whose output is connected through edge E 2 to the input of the inverter B. The output of the inverter B is connected through edge E 1 to an input of A. The other input port of A is edge E 3 . Assume that the feedback arc set for the circuit is given by S = fE 1 g. The graph is rendered acyclic in Figure 1b through the removal of E 1 and replacing it by E in 1 and E out 1 associated with the input of A and the output of B respectively. Next, a data-ow network is synthesized from connecting two identical copies of the acyclic circuit through a crossbar switch. The two acyclic circuits to the left and right of the crossbar switch are referred to as primed and unprimed respectively. The entities in the data-ow network corresponding to the primed and unprimed circuits are referred to as pseudo primed (X 0 ) and unprimed (X) components respectively, where X refers to the corresponding simulation model. Every input port of a pseudo component X 0 that has a label of the form E in i is permanently held at a very large number represented by the symbol 1. Figure 1a , the corresponding data-ow network is shown in Figure 2 . are connected to the external path E 3 . Associated with E 3 are the externally applied signal transitions. Conceptually, these transitions may be included in the event list of model A i.e., the list of outstanding input transitions of A, and E 3 may be held at 1. Furthermore, conceptually the output port E out i of B is unconnected. However, in the current implementation of YADDES, the output of B is connected to a special entity \P1" signifying that it is the rightmost boundary of the data-ow network. As with any distributed simulator, the signal transitions received at the inputs of a simulation model are stored in an event list for that model. The head of the list refers to the transition with the smallest value of simulation time. This is also referred to as the event of the model and the value of its simulation time is represented by U X . In the YADDES approach, every event of a model may be accessed by the corresponding primed and unprimed pseudo components. The YADDES simulation environment consists of two major elements { the simulation circuit and the data-ow network. The simulation circuit consists of executable models corresponding to each component of the circuit and the ow of signals between the models is represented through messages over communication protocols. Signal transitions received at the input ports are executed by the models and any output transitions generated as a consequence of execution is propagated to other models connected to the output port. The decision regarding the precise execution of an event, however, is generated by the constituents of the data-ow network. The primed and unprimed pseudo components execute concurrently and asynchronously with respect to one another and the simulation models. However, in the current implementation of YADDES, they are executed round-robin by a processor. The execution of a pseudo component is initiated either by the corresponding model or the propagation of a new W (or W 0 ) value at an input port by other pseudo components.
The quantities U X , W 0 X , and W X are formally de ned as follows.
De nition of U: Associated with every simulation model, X, is a collection of events i.e., transitions that have been received at its input ports propagated from other models as messages. The events are ordered in increasing order of their simulation times in an event list and they may be ultimately executed by the model X. At any instant, U X is equal to the simulation time of the event at the head of the list i.e., with the smallest value of simulation time. Where the list is empty, the value of U X is considered equal to 1. Initially, every U X 8 X in the simulation circuit is set to 1. De nition of W: A mathematical quantity W X is associated with the output port of every unprimed pseudo component X in the data-ow network. Formally, W X is computed through the function, W X = minimum(U X +d, W 1 +d,...,W n +d) where W 1 , ...,W n refer to the W (or W 0 ) value at the input ports 1,...,n of X and \d" refers to the propagation delay of model X. In some cases, a W 0 value may be involved in the computation of a W value. It represents an accurate measure of the simulation time when the next event is expected to assert at the output of model X. To preserve the correctness of simulation i.e., the proper order of execution of events no message with a simulation time given by t < W X may be sent by model X at its output port following the possible propagation of a message with simulation time given by t = W X . Initially, every W X 8 X is set to 0 implying that they are not yet in uenced by any event. Simulation is considered complete when W X and U X 8 X are identical to 1.
A description of the YADDES approach to asynchronous distributed discrete-event simulation is presented as follows. Given that YADDES is a distributed approach, the sub algorithms describing the operations of a simulation model, a primed pseudo component, and an unprimed component apply equally to all other respective entities in the system. Assume that a signal transition is asserted at an input port of simulation model X either by another simulation model or from the external world.
When this event is incorporated in the event queue of X, it may either alter U X or leave it unchanged. Where U X is altered Conceptually, pseudo components X and X 0 may be initiated concurrently by the simulation model X. Also, multiple simulation models may be executed simultaneously due to signal transitions at their input ports. Consequently, the computations of the W 0 and W values initiated by multiple pseudo components may overlap. Consistency and correctness are guaranteed because the computations involve a minimum operator and the fact that the W value can never decrease. The issue of correctness is further addressed in a later section of this paper.
When both components X and X 0 have completed execution or where the signal transition asserted at an input port of model X does not alter its U value, the simulation model X sends an acknowledgement to the model that propagated the signal transition. If the signal transition was asserted externally, the acknowledgement implies that the transition is being processed and requires the external world to make available at that primary input port the subsequent signal transition.
Next, the simulation model X accesses the W or W 0 values associated with each of the input ports of the corresponding unprimed pseudo component and computes their minimum, K X . Where U X 6 = 1 i.e., an event exists at X, and K X exceeds U X , the model may execute the event corresponding to U X . Where no new transitions are generated at the output port of model X following its execution, the event corresponding to U X is deleted from the event queue and a new U X re ects the time of the new event at the head of the event queue. Where the event list of model X is empty, U X is set to 1. If a transition is generated at an output port as a consequence of execution of model X, it is propagated by X to other models that are connected to the output of X. Further execution of model X is suspended until it receives acknowledgements from each of the recipients. Then, U X is removed from the event queue and a new U X is associated with the event at the head of the queue. The value of U is set to 1 when the number of outstanding transitions at X is nil. The simulation model X then again initiates the pseudo components X and X 0 for execution and suspends further activity until the pseudo components have completed execution. The process continues until all usable external signal transitions at the primary input ports are utilized to generate output transitions.
The precise functionalities of a representative simulation model and a corresponding pseudo component are expressed in Figures  3 and 4 . The description in Figure 11 applies equally to both primed and unprimed components. Simulation model X:
read in events at input ports-from external ports or other components update event queue and order events according to time if (new event alters U value) { initiate pseudo components X and X' wait till done signal received from X and X' send acknowledgement to the sender of the event } else if (new event does not alter the U value) { send acknowledgement to the sender of the event } read W values at every input port of the simulation model X and compute the minimum K if ( Consider the example design in Figure 5 which consists of a NAND gate connected to an inverter through a feedback loop. The output of the NAND gate A is connected to the input of the inverter B and the output of B is connected to the second input port of A. The other input port of A is primary and a transition high to low is asserted at t = 0ns followed by a low to high transition at t = 1000ns. The propagation delays of both A and B are 5ns and the initial values at the outputs of A and B are assumed to be 0 and 1 respectively. For the given signal transition at the primary input of gate A, the output of both A and B changes and remains stable thereafter. When simulated by the conventional asynchronous distributed discrete event simulation algorithm 1], the gates A and B will deadlock as the signal values at the output ports do not change. Consequently, the event 0# of A is simulated. The model A executes the transition and generates a low to high transition at t = 5ns at its output. In Figure 6d value of 1000 with the consequence that the event 1000" may be simulated. Although the entire design has stabilized and no new output values are generated, the data-ow network computes updated values of W and W 0 that force the outstanding event { the external signal transition at t = 1000ns, to be simulated.
Proof of Correctness of the Algorithm
The proof of correctness of the YADDES algorithm proposed in this paper requires the correct execution of the simulation models, execution of events in the correct order, absence of deadlock, and the termination of simulation in nite time. The execution of a simulation model implies the execution of the model description and the issue of its correctness is limited by the accuracy of description of the simulation model.
Execution of Events in the Correct Order
Consider two events E 1 and E 2 with assertion times given by t = t 1 and t = t 2 (t 2 > t 1 ) respectively associated with a simulation model X. E 1 and E 2 may be the consequence of either messages sent by sender models or signal assertions asserted at a primary input port of X by the external world. It is assumed that where a sender sends two consecutive messages on the same communication path at times t = t 3 and t = t 4 (t 4 > t 3 ) respectively to the same receiver, the latter is guaranteed to intercept the message at t = t 3 prior to that corresponding to t = t 4 . The correct order of execution of events requires that X rst execute E 1 and then E 2 and that following execution of any event E Z at t = t Z , no other event E Y at t = t Y (t Y < t Z ) be either received at or executed by X. In the YADDES approach, an event E m at t = t m is red only when the minimum (K X ) of the W values at the input ports of X exceed t m . Consequently, the K value given by K 1 X corresponding to the ring of E 1 must exceed t 1 and the K value given by K 2 X corresponding to the ring of E 2 must exceed t 2 . K 1 X > t 1 || (1) K 2 X > t 2 || (2) By de nition, where the W value at the output of a simulation model N is given by W N and the output port of N is connected to an input of X, N may send a future message at time t = T along that interconnection path to X such that T must never be less than W N .
T W n || (3)
It is subsequently shown that the W value associated with the output port of an unprimed pseudo component may only increase monotonically. Consider the example circuit in Figure 7a and the corresponding data-ow network in Figure 7b . T R at a primary input port of a model is simulated i.e., the corresponding model is executed, the T R value must increase given that the old transition is consumed and that a new transition with a greater assertion time is asserted at the appropriate model.
Where all transitions have been consumed, the T value at that primary input port is set to 1. Consequently, between any two successive values of W 0 N , none of the T values will decrease and may not cause a decrease in the value of W 0 N . Next, consider that an event U J at t = t J of a simulation model J is executed. As a result of the execution, the output value of J may either remain unchanged or a transition may be asserted at the output port that will subsequently cause a new event at model (J+1) at t = U J +d J . For the rst case { where the output signal value remains unchanged, the event U J at t = t J is removed from the event queue of model J and the queue is updated. Consequently, the new U J value will refer either to an event of J with greater assertion time or be set to 1 when the event queue is empty. In either case, the contribution of the U J value in equation 4 will never be to decrease the value of W 0 N . Corresponding to the second case, an output transition is generated at its output port as a result of the execution of model J that is expressed as a new event associated with the model (J+1) at t = U J +d J .
The YADDES algorithm requires that rst the newly generated event at t = U J +d J be asserted at model (J+1). The U value of model J is still held at U J . The assertion time of the new event U J +d J of model J+1 may either be greater than, equal to, or less than that of the earlier event U J+1 of model J+1. Where U J +d J is greater than or equal to U J+1 , the U value of the model (J+1) is unchanged and, therefore, the W 0 N value in equation (4) (7) The second term in the equation (6) is the contribution from model J+1 and is observed to be identical to the rst term in equation (5). Consequently, the W 0 N value in equation (6) is no smaller than that in (5). In addition, in equation (7), the rst term is the contribution of model J and is larger than the rst term in equation (5) values at the inputs of the simulation models may only monotonically increase, the K values of the models must also monotonically increase. Following the execution of an event E 1 at t = t 1 where K 1 X > t 1 , a new event E 2 at t = t 2 (t 2 < t 1 ) may not arrive at that model. Assume that event E 2 at t = t 2 (t 2 < t 1 ) does arrive at that model. K 1 X is the minimum of all input W or W 0 values. The cause of E 2 is a message at an input port that must obey the condition t 2 K 1 X . Consequently, t 1 > t 2 > K 1 X > t 1 which is a contradiction. As a result, events may not arrive at a model so as to cause out-of-order execution.
Since the K values associated with a model may only monotonically increase and because it was shown that an event may arrive at a model only in correct order and given that events in the event queue of a model are ordered according to increasing values of their assertion times, the correct order of execution of events is assured.
Freedom from Deadlock
The principal characteristics of the algorithm that ensure freedom from deadlock may be expressed as follows. First, a simulation model initiates the execution of its prime and unprimed pseudo components in the data-ow network whenever its U value changes. The U value may change either when the model receives a new event at an input port or that the model updates its U value following the completion of execution of the most recent event. For the second scenario, assume that a model X of a simulation system deadlocks and has an outstanding unprocessed event E X . U X is the assertion time of E X . Assume also the absence of any other outstanding events in the system. Since E X cannot be executed, the K value of the model X must be smaller than U X i.e., K X U X ||{ (8) .
Given that there are no outstanding events in the system, the U value of every pseudo component must be 1 except those of X and X (9) Equations (8) and (9) contradict each other and, consequently, the assumption that deadlock occurs is false. A proof may be constructed for the scenario where the system contains multiple outstanding events. First choose the event with the smallest value of assertion time and apply a proof similar to the one above to prove deadlock may not occur for this event. Then, choose the event with the next larger assertion time value to show that it may not deadlock and continue the process for other events successively. Consequently, deadlock does not occur in the YADDES approach.
Termination of Simulation
A simulation system terminates when the number of outstanding events is nil and consequently, every W 0 , W, and U value is 1.
Assume that a system with a nite number of externally applied transitions never terminates. Since it was proved that a system may not deadlock, it must therefore execute continuously. The execution of a model involves receiving from or transmitting messages to other models that require nite time, execution of the model description which must terminate in nite time, and the computation of the W and W 0 values in the data-ow network which must also terminate in nite time because the computations are unidirectional and limited by the nite number of pseudo components in the data-ow network. Consequently, models must continuously receive incoming messages as events. Since the K and W values increase monotonically, subsequent events must be associated with increasing assertion times. In the expression for any W or K, the external events appear along with events that may have been generated by the simulation models. Since the minimum operator is involved in the computation of W or K, the assertion times of externally asserted transitions must eventually approach 1 which is contradictory. Consequently, the simulation system must terminate in nite time.
Performance of YADDES
While a typical digital system may consist of a few subcircuits that are purely combinational and the remaining subcircuits sequential, the use of YADDES is necessary only for the sequential subcircuits. Combinational subcircuits may be simulated utilizing the algorithm detailed in 3]. In this investigation, a generalized simulator is synthesized from YADDES and the approach in 3] for simulating any digital system. This generalized simulator is implemented on the ARMSTRONG I 5] parallel processor system at LEMS, Brown University. ARMSTRONG I is a loosely-coupled parallel processor consisting of 48 MC 68010 processors that may be recon gured by the user. For the purpose of this investigation, ARMSTRONG I is con gured as a 5-dimensional hypercube. Each processor is approximately 5.5 times slower than that of a SUN 3/60 workstation and the average time required for a guaranteed node-to-node message communication of size, anywhere between 4 bytes and 1Kbytes, is approximately 10milliseconds.
As detailed in the implementation section of 14], each processor of the Armstrong system accepts an unique input le that represents information on the simulation models and pseudo components, where applicable, and their interconnections for the corresponding partition. In the most recent implementation, the input les are generated by a preprocessor, \i genia," 15] that accepts a description of the circuit in a hardware description language ESL 12] and the user speci ed partitions and feedback arc set. E cient algorithms exist in the literature to extract the feedback arc set. The i genia preprocessor is a standard uniprocessor C program that is approximately 4500 lines in length.
The current ARMSTRONG con guration is limited to a maximum of 30 open connections including protocols and le pointers. Additionally, each processor has access to only 370Kbytes of main memory. Thus, the most distributed instance of simulation in this investigation is limited to 16 partitions of the SN74181 arithmetic logic unit that executes on 18 processors. The executable code for the simulator is compiled on a SUN 3/60 workstation under the \O4" optimization directive and the execution time is determined through the use of timers associated with each ARMSTRONG node. A total of ve example circuits are simulated { three are purely combinational while the remaining two are sequential circuits. For each circuit, performance statistics are collected from several simulation runs by varying the number of input vectors, number of processors, and the \weight" factor. In general, asynchronous distributed algorithms are most e cient where the simulation models require extensive computing and relatively less frequent communication between themselves. Although this generalized simulator is tested with gate-level circuits in this investigation, the principal goal is to simulate complex and large models on multiple processors to study the algorithm's e ciency. While the number of processors is varied between 1 and 16, the number of vectors chosen vary from 50 to 150. The \weight" factor is a deliberately introduced delay in the execution process and is designed to mimic the longer execution times that would be associated with complex simulation models. It is expected that behavior models may require between 1ms and 10ms to execute on an ARMSTRONG processor. The results of every simulation i.e., the output values, are veri ed through comparison against similar results obtained from executing the Bell Labs simulator ESIM 17]. Speedup factors are computed from the performance statistics collected from the simulation runs. For a simulation run executed on \N" processors, the speedup factor is de ned as the ratio of the CPU time required for executing the simulation on a uniprocessor to that on N processors. For the rst four example circuits, a simulation run corresponding to a single partition, is performed on a single ARMSTRONG processor. For the SN74181 circuit, a single partition is too large for a single ARMSTRONG processor. As a result, for the purpose of comparison, the single partition case for the SN74181 circuit is executed by ESIM on a SUN 3/60 workstation. This demonstrates an additional advantage of distributed simulation in that while large circuits can neither be loaded into a uniprocessor nor simulated, they may be simulated by the generalized distributed simulator through appropriate partitioning on a parallel processor. It may be further observed that, in the discipline of simulation, there are no standard benchmark sequential circuits that may be utilized to measure the performance of the algorithm.
The rst example circuit, shown in Figure 8a , consists of eight components organized through three dependent feedback loops. A speedup factor of 2.6 is observed for three processors. Thereafter, an increase in the number of processors does not yield a commensurate increase in speedup, thereby implying that the strong dependency of the three loops limit the inherent parallelism. The second circuit is constructed through adding deliberate feedback links in a two-bit full adder circuit (SN7482). A maximum of four processors yields a speedup factor of 3.2. The third example circuit is a purely combinational 8-1 multiplexer (SN 74251) that generates a speed up factor of 6.1 with a maximum of eight processors. The fourth example circuit is that of a two-bit adder (SN7482) that yields a maximum speedup factor of 5.1 with a maximum of eight processors. Finally, the ALU circuit is simulated with up to a maximum of 16 processors to generate a high speedup of 12.5. Given the hardware and software limitations of ARMSTRONG I, this investigation is unable to execute simulations of larger circuits with greater number of partitions at this time. The reader may note that the principal idea underlying these performance measurements is to prove the generalized simulator as an asynchronous, distributed, circuit-partitioning based algorithm and not produce an industrial grade simulator. Also, the current implementation is not optimized for performance. Thus, the results presented in this paper are open to further improvement. The performance graphs for the ve example circuits are shown in Figures 9 through 13.
Limitations of YADDES
For an intuitive understanding of why and where YADDES delivers superior performance, consider a circuit that executes into a deadlock during a straightforward distributed simulation. In the Chandy-Misra approach 5], upon executing into a deadlock, rst a distributed deadlock detection algorithm is executed to detect and recover from deadlock and then simulation continues until it hits a subseequent deadlock. In the approach 1,3] that utilizes timestamps or pseudo-messages, pseudo-messages are generated whenever an entity receives new input signals but fails to generate a new output signal. The pseudo-messages are propagated to (i) inform entities which do not receive new input signal that there has been no change associated with the signal, (ii) inform the time up to which no new input is expected, and (iii) keep the entities from simulating into a deadlock. In 3], for a given sequential circuit with a single loop, where the input signal from the external world changes very slowly, relative to the cumulative propagation delay, P i d i , of the entities constituting the cyclic loop, a large number of pseudo-messages are generated and at an interval given by P i d i . The assumption is that the circuit is not an oscillator. Also, for a circuit with multiple feedback loops, the actual interval is determined by the smallest value of P i d i across all loops. If the average time period of the externally asserted signal transitions, T, is such that T/ P i d i is large, say 1000, then the e ciency of simulation in approach 3] is very poor. On the contrary, YADDES would enable the simulation to proceed between the successive transitions of the external signal implying extremely high e ciency. In YADDES, there is the overhead of the execution of the data-ow network, i.e. the pseudo-components. However, the data-ow network execution is event-driven and the additional computational need is easily satis ed by the high performance nodes of today's parallel processors. 
Conclusion
The issue of asynchronous distributed discrete event simulation of cyclic circuits is crucial to the eld of computer simulation and has the potential of addressing problems in the domains of digital hardware design, queueing networks, banking transactions, and communication networks. The YADDES algorithm o ers the characteristics of freedom from deadlock and acceptable performance. This paper has presented a proof of correctness of YADDES and also reported its performance on a few example digital circuits.
