A model can be decomposed into components, then mapped onto a network of simulators for execution. With different levels of decompositions, we are interested in finding the optimal model decomposition that can be implemented on a particular structure of distributed simulators. This is done by comparing performance measures such as the minimum response time or the maximum throughput per unit of hardware complexity. We claim that the need for a model capable of performance evaluation is an important concern in designing distributed simulators. Thus a methodology for performance evaluation and simulation modelling of hierarchically specified distributed simulators is required. From the empirical results, we found that simulating a given distributed simulator model is suitable to get optimal performances, that is, each model has an optimal decomposition level in terms of performance. The influential performance measures are the average of simulation run times and the product of the average run time and the hardware complexity of the given simulator model.
INTRODUCTION
Distributed simulation increases the speed of simulation by processing discrete event simulators in parallel.
In the implementation of discrete event simulations, since a single processor takes maximum processing time, the number of processors can be traded-off against the time required for the distributed simulation [3] . Commonly, performance is not seriously considered unti 1 the system is built and many systems have unacceptable performance when completed. If performance is to be considered in the design of such distributed simulation systems, performance modelling must be employed to see that the benefits of distributed processing are achieved [2] .
Since the structure of distributed simulators is not unique, we need tools to evaluate the performance of such simulators.
Although analytic and simulation models have been recognized as useful tools, each tool has its distinct costs and advantages. Dubois [7] tried to combine such tools in complementary ways to exploit the advantages of the techniques. Nicol and Reynolds [11] built a network simulation model to observe parallel ism and communication dependency within the components of distributed systems. The observations lead to a partitioning of the model which reduces the total cost of event-! ist maintenance and to make sure that the simulation work is distributed. [6] . Frankline [8] proposed a performance measure derived from analyzing the simulation cost that includes factors relating to the simulation machine speed, machine cost, waitingtime cost, and simulation quality. Recently Livny [9] has studied the relationship between the inherent para! lelism of a concurrent simulation and the number of processors employed.
He used as a performance measure the para! lei ism factor of the computation which is the ratio between the total processing time and its execution time. Nicol and Reynolds [10] claimed that " ... excessive communication can lead to deg;aded perfo;mance, but monomozong communication need not optimize performance ••. ". To integrate these various approaches and results a coherent methodology for performance evaluation of distributed simulators is needed.
In this paper, we describe a f;amework for performance model I ing and simulation, and propose a performance simulator architecture called the hierarchical multipart simulator to evaluate performances of dist;ibuted simulation systems.
We also design a simulation system based on the proposed methodology and the architecture.
Using the simulation system, empirical results are collected and analyzed to find the optimal decomposition levels of several distributed simulator models.
Finally, conclusions concerning the application of the performance evaluation methodology are given.
A more detailed discussion can be found in Baik [1] .
FRAMEWORK FOR PERFORMANCE MODELLING AND SIMULATION
Modelling and simulation is a set of activities which relates to constructing models of real world systems and simulating them on computer systems [14] .
Based on this definition, general system modelling and simulation is concerned with three major entities real systems, models, and simulators and their relationships model! ing relation (between real systems and models) and simulation relation (between models and simulators). The real system is a sou;ce of input-output data which we obtain by input-output observation. That is, we a;e concerned with inputoutput relation of the real system. The model, a representation of the real system, is also a source of such data. The input-output relation of the model is obtained by experimentation with the simulator.
Applying this entities-relationships to distributed systems, we can build an entities-;elationships for dist;ibuted system model! ing and simulation. Our interest is on performance evaluation of distributed simulator systems, we propose the entitiesrelationships for distributed simulator performance model 1 ing and simulation as depicted in Figure 1 . The Procedures from a given DSM to a desired CTR employs two steps. The first step is the construction of a base CTR by node assignments and representation of the hierarchically specified DSM. The second step is the transformation of the base CTR to another structure of CTR by any of the following operations: l) AGGREGATION a many-to-one composition mapping from a base configuration satisfying the sufficient conditions to a transformed configuration constructed by block composition.
2) FLATTENING -a reconfigurable composition mapping, in which an interior node is removed by connecting directly its children nodes to its parent node. 31 DEEPENING -a reconfigurable composition mapping, in which the number of branches of an interior node is reduced by combining at least two branches, but not all, and adding one new interior node for the combined branches.
Therefore, aggregation can take place to any deepened or flattened composition in order to get more transformed configurations. Mapping a CTR onto a HMS is a one-to-one matching, that is, each of interior nodes of the CTR wi 1 I match to a coordinator processor, and each of leaf nodes wil I match to a simulator processor. The hierarchy provides a formal way to manipulate models by using essential concepts such as association and morphisms [15] . These concepts are required to transform system specification from one form to another and to prove the preservation of structural features.
HIERARCHICAL MULTIPORT SIMULATORS
The hierarchical multipart simulator contains coordinators to synchronize the component simulators and handle tasks, and simulators to simulate the corresponding components. As shown in Figure 5 , COOR has three input ports and three output ports. An input and output pert pair such as (p 1 , pZ) is for communicating with either its parent or the outside environment, and two pairs such as (p3,p4) and (p5,p6) are for communicating with SIMU 1 and SIMU 2 . An input port (p7 or p9) of S I MU i is connected to an output port (p4 or p6) of COOR, while an output port (p8 or plO) of a SIMUi is connected to an input port (p3 or p5) of COOR. It should be noted that parallel ism is achieved in the receipt of (x,D,t) and (*,t) tasks, because a (x,D,t) task can achieve parallel ism when there are more than one destinations (i.e., jDj > 1), and a (>l,t) task can achieve parallelism through the size of influencees of the influencer. A simulator aggregated with enclosed processors is called a aggregated node simulator and is assumed to be mapped on a sequential uniprocessor. The task execution of a (x, D, t) in the aggregated node simulator is different from that in a comparable nonaggregated node simulator, that is, when (x,D,t) arrives at the aggregated node simulator, (x,D,t) is sent to each enclosed simulator of the aggregated node simulator one at a time in a sequential mode. Therefore it is true that an aggregated node simulator is measured no parallel ism but less hardware complexity agaim;t a comparable non-aggregated node simulator.
DEF IN IT IONS AND .~SS UMPT IONS
We assume that a performance simulation model contains identical processors that communicate by message passing. Furthermore, we assume that the delay for communication, coordination, or computatior" is constant. And let, .complexity of a coordinator is 1, . complexity of a non-aggregated node simulator is 1,
.complexity of an aggregated node simulator is the number of enclosed processors, .complexity of a given model is the summation of complexities of alI processors plus the number of I inks among the processors. Thus, the haroware complexity of a particular simulator model can be computed by these definitions. For example, the complexity of the fully decomposed 2-level/3-fold simulator is 25, and the complexity of the one level decomposed 2-level/3-fold simulator is 16. Here, the 2-level/3-fold simulator is a balanced tree having the maximum level of 2 and the number of branches of 3 to every interior node.
If the complexity measurement for hardware units is given, we can c:ompute throughput per hardware unit of 424 a particular simulator model.
Comparing performance simulation outputs for each simulator model mapped from a given distributed model, an optimal model clecompos it ion I eve I can be found. Each mode 1 has an opt i rna 1 decomposition I eve I in terms of pt~rformance such as the minimum average of run.times or the maximum throughput per unit of hardware complexity. Here, the run.time is defined as the flow time of a task between the time the task gets into a simulator model and the time it gets out of the model.
The throughput per unit of hardware complexity is defined as one over the product of the average run.time and the hardware complexity of the given model. Thus, to find the maximum throughput is to find the minimum product of the average run.time and the hardware complexity.
.D~S I GN Q.E. PEIHORMANCE SIMULATOR Based on the methodology for performance model I ing and simulation of hierarchical simulators described in the previous sections, we design a simulation system cal led the performance simulator to determine the performance of a given distributed simulator models, so that a performance simulation program is written in SIMSCRIPT 11.5 [1] . The performance simulator is comprised of the simulation model and the experimental frame as shown in Figure 6 . The simulation model contains a set of coordinator-processes and a set of simulator-processes coupled to each other, and two types of globally accessable data structures. These tables are the coup! ing table and the routing set, which are used by all processors, coordinators and simulators, for routing and searching purposes: l) COUPLINo. 
DEVS
formal ism, the structruaf representation of an experimental frame is defined as a coup! ing of a generator, an acceptor and a transducer. The experimental frame specifies three systems connected to the model as shown in Figure   6 . The generator is an input system of the input segments s 1 . The transducer is an output system which observes mc>de I input/output segment pairs and performs the statistical processing specified by the summary Pigure 6. Performance Simulator mappings SM. The acceptor is a run control system which observes a run control variable segment and indicates acceptance or rejection of an experiment according to whether or not the segment belongs to the admissable class SC. For performance evaluation, the generator creates tasks such as input events to be sent to the simulation model, the acceptor controls the simulation run by checking current simulation time against the given observation interval, and the transducer performs statistical computation such as average run-time per task.
Process interaction oriented simulation languages such as SlKSCRlPT [13] can be used to implement the performance simulator. Each component of a distributed simulator model wi 1 I be implemented by a process and messages are exchange for synchronization and communication.
EXPERIMENTS AND RESULTS
Once the performance simulator is established, we can examine several configurations of distributed simulator models with constraints such as the number of processors or the number of links among the processors. By using the performance simulator, we can evaluate the performances of different configurations for a given distributed simulator model in order to find an optimal one.
We therefore can decide at what level to terminate the recursion in the hierarchical model specification so that the coupling of systems associated with that level satisfies the constraints of the simulator model. To get empirical results by using the performance simulator, we set up the input parameters as follows:
COMPUTATION.TIME ~ 8.0 time units COORDINATION.TIME c 0.5 time units per branch COMMUNICATtON.TtME e 1.5 time units Figure 7 shows the performance simulation results of 2-level/k-fold (k=2, 3, 4, 5) hierarchical multipart simulator model. While the fully decomposed level of a 2-level/k-fold simulator model is optimal in terms of the minimum average run.time, the one level decomposed simulator model is optimal in terms of the maximum throughput per unit of hardware complexity.
As shown in Figure 8 , while the two level decomposed model of a 3-level/2-fold simulator is optimal in terms of the minimum average run.time, the one level decomposed model is optimal in terms of the maximum throughput per unit of hardware complexity. However, in the 4-level/2-fold simulator, the three level decomposed model is optimal in both cases. For instance, in our expriments, the tight-outer is assumed that JDI is greater than half of the number of simulator processors, and the loose-outer is assumed that JOI is not greater than half. However, determining whether the simulator model is the tigflt-inne~ or the loose-; nnef' depends on the number of i nf 1 uencees, Jl I , of each leaf node.
In the experiments shown in Figures 7 and 8 , only the first combination, the tight-inner-coup! ing and the tight-outer-coupling, are given by default. In the 2-level/3-fold simulator model, experimental results show that two optimal performance levels are found, one is the fully decomposed 1 eve 1 in terms of the minimum average run.time, and the other is the one level decomposition in terms of the maximum throughput per unit of hardware complexity. See Figure 9 . 
