on arbitrary message-passing multicomputing configurations. The tool greatly decreases the amount of time needed to obtain a viable multiple processor implementation of large-scale simulations and can be expanded to include the latest methodologies that exploit functional and data parallelism present within the simulation model and algorithm. Such a unifying tool is designed to increase the likelihood that a parallel realization of a simulation problem can be obtained which has an acceptable level of performance. To illustrate the feasibility of developing such a tool, this paper describes the prototype implementation of the Automated Partitioning and Mapping Engine (APME) and demonstrates its effectiveness when applied to a large-scale simulation executed on a number of multicomputer systems and topological configurations.
This paper describes the creation of an intelligent software tool that performs automatic parallelization of dynamic system simulations for excecution on arbitrary message-passing multicomputing configurations. The tool greatly decreases the amount of time needed to obtain a viable multiple processor implementation of large-scale simulations and can be expanded to include the latest methodologies that exploit functional and data parallelism present within the simulation model and algorithm. Such a unifying tool is designed to increase the likelihood that a parallel realization of a simulation problem can be obtained which has an acceptable level of performance. To illustrate the feasibility of developing such a tool, this paper describes the prototype implementation of the Automated Partitioning and Mapping Engine (APME) and demonstrates its effectiveness when applied to a large-scale simulation executed on a number of multicomputer systems and topological configurations. Keywords: Dynamic system simulation, parallel processing, multicomputers I. Introduction
The execution time for large-scale dynamic simulations on conventional uniprocessor systems can be several orders of magnitude slower than is needed to supply timely and useful information. Parallel processing can often be employed in these cases, since the code for most dynamic simulations contains a high degree of functional/control and data concurrency. The major problem with exploiting this parallelism is determining how to do so automatically and in a manner that minimizes the encoding/ debugging effort while allowing the easy integration and incorporation of the latest algorithmic techniques. This paper focuses on the creation of a software tool, the Automated Partitioning and Mapping Engine (APME), which addresses this problem by allowing the selection of a number of advanced algorithmic techniques that automatically match the decomposed structure of the existing applications to the hardware topology and routing characteristics of the targeted multicomputer architecture.
Practical experience has shown that the cost (in terms of encoding and debugging time) of not automating the dynamic simulation process is very high and appears to increase in a greater than linear manner as the number of processors is increased. In fact, a major motivation for the creation of such an automated tool came during a research project which required that multiple parallel simulation methodologies be applied to a large-scale simulation problem assuming several topological configurations.
Manually encoding a single methodology/topology combination required an average of four days and was extremely error-prone. Applying the same methodology/ topology combination to the automated tool described in this paper will result in an accurate parallel implementation being created in a few seconds of computation time. It is estimated that if the automated tool had been available for this project the implementation time would have been shortened from that of several months to approximately one week.
The basic idea of automatically transforming and executing dynamic simulations on MIMD architectures is not new [1, 2] , but much of the previous research has focused upon the development of software tools that utilize a single paradigm to translate dynamic simulations to traditional shared memory single bus architectures which are limited in terms of scalability. The software tool being proposed in this paper is unique in that it is designed for a pure message-passing multicomputing environment, it supports a wide variety of parallelization paradigms, and it automatically produces all source-level files necessary for parallel execution to occur on an arbitrarily-connected targeted hardware platform.
II. Dynamic System Simulation
The mathematical model for a dynamic system is generally described using a set of multi-order nonlinear differential equations that represent the internal and external responses generated by a system to the desired level of accuracy. This set of equations is usually decomposed into an equivalent set of firstorder equations which are solved during the simulation by the process of integration. Since approximate methods, which derive values for the current time step from those of past simulation steps, are most often used, the state of the entire simulation is always a function of the variables which store the results of integration. The output variable for each integration equation represents a system State Variable. Other equations are often part of the model, allowing the model to be represented in a manner that is more descriptive of the system being simulated. These variables are called Declared Variables, and are defined in terms of the system State Variables (and/or the global Input Variables). The Input Variables are much like the system State Varial7les with the exception that they are updated by some process which is external to the simulation step (in a real-time environment these variables might be periodically updated by poling hardware sensors, and in a purely functional simulation these variables could be updated by executing periodically sections of code which drive the model in some pre-described manner). The Output Variables are those which are needed by some external process during the simulation (in a real-time environment these values might drive digital-to-analog actuators or in a functional environment produce output data to be viewed by the user).
The majority of computational effort expended within a typical dynamic simulation occurs when processing the aforementioned set of modeling equations. These equations are acyclic in nature and must be presorted, before compilation, or scheduled dynamically during the simulation in an order which assures that all data dependencies between equations are met. During this process the equations involving the integration operation are placed in a form in which tlie derivatives of the State Variables are calculated. These derivatives are used by the chosen integration method at the end of the simulation step to update the State Variables. There are literally hundreds of numeric integration methods that have been developed, many of which require multiple passes through the modeling equations for each simulation step. Although the general stability and accuracy bounds of these integration methods can be calculated analytically, the actual accuracy of an individual algorithm, when applied to a given simulation problem, is not easy to determine without applying it to the problem. Figure 1 illustrates some of the ways to apply parallel processing techniques to the simulation of large-scale dynamic systems. Many of these methodologies are being incorporated into the proposed APME software tool.
III. Parallel Simulation Methodologies

A. Parallelization of the Simulation Algorithm
As discussed previously, the underlying algorithm associated with dynamic system simulation is integration. Several parallel approaches to integration have been proposed in the literature. A major advantage to these proposed approaches is that they are often relatively easy to implement for many topologies, the processing load is automatically balanced, and the form of the implementation is more predictable and model independent.
Parallel symbolic integration algorithms avoid some of the errors associated with numeric processing by allowing a certain number of the higher-order derivatives of the system model to be obtained in a symbolic manner. An example of such a technique is Figure 1 . Parallel Simulation of Dynamic Systems the parallel Taylor Series expansion of the system modeling equations used within the EMPRESS project of ETH in Zurich, Switzerland [3] . This approach uses symbolic analytic recursion techniques to calculate in an exact manner the higher derivatives of the system. The algorithm has been reported to be well suited for parallel processing of a number of engineering-type applications. Of course, the effectiveness of symbolic algorithms depend in large part upon how well the simulation model, system software, and computer hardware support symbolic processing. Most parallel numerical integration algorithms employ a form of data parallelism where each processing element executes a local copy of the system model. For example, various block implicit methods [4, 5] obtain their parallelism by allowing concurrent calculation to occur on local copies of the system model, where each system model represents a distinct step in time. Some expanded predictor-corrector schemes [6] obtain parallelism by allowing each simulation step's predictor and corrector routine to be executed concurrently at the same time that multiple time steps are being calculated.
A common attribute associated with many parallel numerical algorithms is the style and type of communication that is required. Many of these algorithms require that data be sent periodically in a one-to-all (where one processor must send its State Variable data and/or derivative values to all the other processors) and/or an all-to-all (where all processors must share their State Variable and/or derivative information with one another) manner. Therefore, a topology and routing heuristic that readily support these types of communication patterns should be chosen. If there is a significant mismatch in either case, then severe performance degradation will occur.
It should also be noted that parallel integration can result in the parallel representation being very large, since most algorithms require separate copies of the system model to reside at each processing element. Such large system representations may not be able to fit into the instruction caches of many processing elements. Furthermore, in applying these algorithms, the effects of error and stability must be carefully considered. There is often a point where using expanded versions of a parallel algorithm may cause a decrease in performance due to instability or the increased amount of error. Additionally, because of the relatively long time that it takes to process the entire simulation model and the time skew that occurs when more than one time step is calculated concurrently, most parallel integration algorithms are considered not to be easily applied to hard real-time problems.
B. Parallelization of the Simulation Model
Depending upon the particular integration method employed, the modeling equations are often processed many times for each step of the simulation. This is necessary to allow the derivatives to be calculated, which are needed by the integration algorithm to obtain new State Variable values. For most large-scale simulations the processing of the system model is by far the most computationally intensive portion of the simulation. Therefore, faster execution of the simulation model through efficient parallel restructuring can be expected to have a large effect on overall simulation performance [7] . Also, since the underlying simulation algorithm is not altered, the accuracy of the simulation remains unaffected. Parallelizing the simulation model involves partitioning the modeling equations into a set of concurrent tasks for parallel execution by a potentially large number of processing elements.
Coarse-grained approaches result in the modeling equations being partitioned into a set of relatively complex tasks, which are few in number. Often this partitioning can be accomplished in ways that allow each task to directly correspond to a subsystem of the system model. This allows a parallel representation to be made that preserves much of the overall structure of the physical system, but limits the amount of parallelism obtainable to less than that possible in some of the finer-grained partitionings. Most coarse-grained partitionings require a large number of variables to be used for communication between tasks and for interprocessor communication. If the tasks are to be non-preemptive, being uninterruptible self-contained units of execution, it may be necessary to redundantly incorporate some of the modeling equations into more than one task, allowing tasks to communicate with one another via system State Variables.. Finer-grained partitionings allow the simulation model to be represented at a lower level of abstraction, thereby increasing the amount of parallelism obtainable, but also increasing the communication overhead. One such approach allows tasks to be created under the criterion that each task contain the modeling equations necessary to adequately represent a system State Variable (i.e. each task represents a first-order differential equation). This approach requires that all Declared Variable equations be absorbed within the tasks, with the equations that are needed by two or more tasks being replicated. Another approach results from allowing each modeling equation, both Stato and Declared Variable, to represent a task. Both of these partitioning methods result in a set of tasks that contain an arbitrary number of inputs but just one output. A fine-grain partitioning can be based upon the criterion that each task represents a basic mathematical operation such as a floating point operation. A very high degree of parallelism is possible, but a large total amount of communication is required between tasks.
In an environment where the hardware timing parameters are known before the simulation is executed, one approach is to start with the finest grain representation possible and combine fine grain tasks together to create coarser grained tasks by employing grain-packing [8] techniques. Such techniques make use of the targeted topology and selected task allocation methodology to determine the granularity of each task. This allows much of the work to be automated, but the resulting representation loses much of the recognizable structure of the system.
Another approach to obtain a set of tasks is to group modeling equations together in ways that match equation dynamics. With this method, equations with closely corresponding natural frequencies or time constants are grouped together to form separate tasks. The resulting parallel representation supports the possibility of using multiple integration algorithms which execute concurrently on multiple processing elements, allowing special integration methods to be used to process the very fast or &dquo;stiff&dquo;
equations [9] . Before a set of tasks can be effectively assigned and scheduled for execution on a group of processing elements, a weighting scheme must be incorporated to reflect the relative complexity of each task. Allocation strategies may require that any combination of minimum, maximum, and nominal weightings be assigned to each task. These weightings are often based upon both the number and type of floating point operations to be performed, with the weightings of each type of floating point operation being proportional to the expected execution time on the targeted parallel architecture. More advanced weighting schemes attempt to take into account such hardware dependent factors as main memory access time, cache access time, and the effects of processor pipelining. Task weightings can also be obtained by profiling an executing simulation on the targeted hardware architecture and recording in a statistical manner the execution times associated with each task. In any case, the task weightings are only approximations, since the actual execution time is dependent upon a large number of factors, many of which cannot be predicted before the simulation is performed. In addition to the task weightings, certain communication penalties are often assigned to the variables used to communicate between tasks. These are used by some allocation schemes to reflect the performance degradation that occurs when tasks executing on separate processing elements attempt to communicate with one another. These penalties approximate the latency, channel bandwidth and contention problems that can occur when data passes through the links of a message-passing interprocessor communication network.
Task allocation can be defined as the process by which a set of tasks is partitioned (task assignment) into clusters, where each cluster is mapped to execute on individual processing elements. Often included in this procedure is the scheduling of the execution order of the tasks of each cluster and the communication of data between tasks which reside on different clusters in a manner which minimizes processor idle time and maintains all precedence relationships (data dependencies) which exist between the tasks. The allocation problem (task assignment, mapping, and scheduling), even when approached statically in a pre-simulation environment, is very complex (i.e. NP hard) with heuristically-based algorithms most often being required to provide task assignments in a timely manner.
With most allocations the majority of the communication between processors occurs during the second stage of dynamic simulation, where the State Variahles (and/or their derivatives) which are needed elsewhere on the system are communicated across the interconnection network. The time spent during this phase depends upon the distribution of the State Variables, and the sequence chosen for the communication between processors. The distribution of the State Variable-.; determines the need for communication and, to a lesser extent, the communication efficiency which is possible, and the sequence of communications determines the amount of concurrent uses of the data paths. As with the process of allocation, there are several strategies which can be implemented to improve the efficiency associated with communicating State Variable information.
Parallel dynamic system simulation is often applicable to static task allocation methodologies. Classical allocation strategies, such as Critical Path Scheduling [10] , Depth-First/Implicit Heuristic, DF/IHS, method [11] , and the Heavy Node First, HNF [12] , algorithms, attempt to balance the amount of computation that occurs on each processing element while ignoring the effect of communication between processors. Other allocation methods are based solely upon reducing the number of interactions between tasks executed on different processors [13] . A more complete approach is based upon striking a compromise between balanced processing load and reduced interprocessor communication. Three approaches of this type include the Earliest Ready Task (ERT) algorithm [14] the Mapping Heuristic (MH) [15] and the Synchronous Non-Buffered Communication Heuristic (SNBC) [16] .
C. Parallelization of the Simulation Algorithm and
Model In many cases it may be advantageous to restructure both the simulation model and the integration algorithm to form a combined parallel processing environment. This may allow the simulation to be further dispersed among the processing elements. In such an environment, the parallel algorithm is superimposed upon the parallel execution of the system model. With this arrangement there are a partitionings of the simulation model (where a is an integer such that a z 1 ) created by the chosen allocation methodology and the system model is duplicated fi times (where (3 is an integer such that 3 <_ 1) as dictated by the parallel integration method. This results in an overall (x 0 P processor representation that has the same error characteristics as the p processor parallel integration method. It should be noted that the other two parallelization techniques discussed in the previous subsections were special cases of this technique. The automation of this process would allow for a large amount of flexibility in determining the configuration which provides the best performance for a particular application.
IV: Automated Partitioning and Mapping Engine
The focus of much of this research has been to create a tool which will automatically partition and map large-scale dynamic system simulations to multicomputing hardware architectures. Such a tool should 1) automatically create all source level files needed to implement the parallel representation, 2) support the selection of a wide number of serial and parallel integration algorithms, 3) allow for the easy incorporation of multiple model decomposition, task allocation, and communication scheduling techniques, 4) present a generic environment for easy expandability (e.g. the tool itself must be easily expandable allowing new algorithms to be incorpo-rated which will interface well with existing software), 5) support multiple paradigms for parallel simulation, (the manner in which state variable communication is accomplished is an area where at least two overall methodologies should be supported), 6) create output file formats that can be easily altered to be used on many types of computer platforms, (include an autoprofiling option which allows system timing parameters to be easily determined), 7) allow for, where appropriate, the off-line symbolic simplification of modeling equations, 8 ) support full mixed mode (discrete event and dynamic system) parallel simulation, 9) utilize advanced and detailed debugging and visualization to allow users to find both performance bottlenecks and solution accuracy &dquo;bugs&dquo;.
The prototype implementation of the Automated Partitioning and Mapping Engine (APME) specifically addresses the first six desired features discussed above. The remaining features are either addressed by existing system software or will remain the focus for future research.
V. APME Prototype Implementation Figure 2 shows the prototype configuration for the proposed Automated Partitioning and Mapping Engine (APME). The APME utilizes two applicationdependent input files and creates two types of source-level output files. The input files are the Hard7tiarc Configuration File, and the Dynamic System Description File. The output files include a set of System Configuration Files and a number of High-Level-Language Files. Depending upon the level of portability desired, there are a number of internal files which are created and used by the APME software.
The Hardware Configuration File is used to describe the structure of the multicomputer system; it fully characterizes the topological configuration of the inter-processor communication network and the relative speeds of the processing elements and When parallel integration is used in conjunction with the allocation of the system model, the Hard'l1.Jare Configuration File must be written in a manner that allows for the existence of two distinct levels of hierarchy. The first or lower level of hierarchy describes the portion of the over-all topology (i.e. the model processing clicster) associated with the execution of each instance of the simulation model. For example, in the case where a ~3 processor integration algorithm is employed, the Hardware Configuration File must have P distinct topology sections with the same topology but with possibly differing processor names and port link numbers. This is because each model processing cluster must be capable of supporting the same general allocation of the system model. The second or upper level of the hierarchy describes the connections required for the state variable and derivative data to be communicated between the various model processing clusters as dictated by the parallel integration algorithm. The current prototype utilizes a communication scheduling heuristic that requires that there be enough links between the clusters to support the embedding of a, (3-node, nCube topologies. This will readily support the oneto-all and all-to-all communications requirements present within many of the parallel integration methodologies.
The Model Description File contains the generic description of the system to be simulated and therefore has the same format regardless of the targeted hardware platform. It is coded using standard ANSI C syntax and is comprised of five sections which correspond to the CSSL format [17] . These sections are separated from one another by the key words of $initial, $terminal, $dynl1mic, $Olltpllt, and $derivative.
The $initial and $termil1111 sections contain the sequential code which executes only at the beginning and end of the simulation, respectively. This code is placed in a single high-level-language module to be executed on a single processor. Code contained within the $dYl1l1mic section executes at regular intervals throughout a simulation, but generally less often than the code in the derivative section (this section can be used to update the global Input Vnriables). The $output section contains the code necessary to update the system Output Variables. The $derivative section is the primary section of code which represents the acyclic modeling equations; it is executed at least once during each step of the simulation depending upon the particular integration method employed. This section is processed in the manner described by the integration routine that is selected before the post-allocation translator is executed.
The Task Graph Creator is responsible for determining the granularity and type of tasks that will be used to represent the system model. It can be executed in a manner that preserves the representation of tasks specified by the user in the Model Description File or it can be instructed to automatically construct tasks via a number of automatic methodologies such as the grain packing methods discussed earlier [8] . In some cases, the task graph creator may actually execute sequential simulations in order to obtain task weighting data, or it may be instructed to use predetermined weightings on computational constructs to compute the aggregate weightings for each task. The Task Graph Creator generates a Rostrcectnrocl Moclol File, where the code that makes up each processing task, constructed previously, is uniquely defined in its modified $derivative section. The Task Graph Creator also creates for use by the allocator a Tnsk Graph File which fully describes the precedence relationships and data dependencies that exist between the tasks. The Task Allocation File contains the assignment, mapping, and in many cases the scheduling information for all the tasks that are defined in the Restructured Model Description File. It also often reflects the schedule of all Inter-derivative Section communications which occur between the processing elements within a given simulation step. This file can be entered directly by a user, but will most likely be created by an external allocation routine. Several allocation methodologies have been incorporated in the APME software including the aforementioned HLFET, HLFNET, SCFET, SCFNET, SNBC/SA and MH methodologies [11, 15, 16] .
The Post-Allocation Translator (PAT) portion of the APME software allows the restructured system modeling equations to be allocated in the manner described by the allocation methodology under the direction of the chosen integration algorithm. There are two state variable communication paradigms supported by the APME; the first one views the process of dynamic simulation as occurring in two distinct steps, the execution of the acyclic modeling equations followed by the updating via numeric integration of the State Variables. The first step allows standard directed acyclic graph schedulers and allocators such as those previously discussed to be used to parse, sort and partition the pseudo-code contained in the $derivatives section of the Restriictiired Model File in the manner described in the Task Allocation File and to place the acyclic tasks/communications into separately coded high-level language (HLL) modules.
The second step of the first state variable communication paradigm is to then separately schedule the state variable communications which occur between the HLL models during each step in a simulation, using an independent communication scheduling heuristic which is part of the Post Allocation Translator. Several communication heuristics exist which attempt to achieve an optimal schedule of state variable communications in a manner that allows efficient routing of messages between non-nearest neighbor processors, while avoiding wasteful communications, and making good use of parallel data paths.
The second state variable communication paradigm supported by the APME is to choose an allocation methodology that accounts for state variable communications. In this case, the APME is required to echo the state variable communication pattern described by the allocation methodology directly onto the High Level Language Module Files. When the number of State Variables is large compared to the total number of system variables then improved performance is likely, since the allocation methodology can include the State Variable communication costs as part of its overall performance measuring objective function. The output files include the various System Configurrttion Files and a number of simulation High-Level Language Files whose derived object files will execute on the individual processing nodes. The High-Level-Language Files encode the parallel representation complete with the appropriate task mapping in the form of independent processes or threads that communicate with the code contained in other High-Level Language files via system supported messagepassing constructs such as blocking and/or nonblocking reads and writes. These files are created by the APME software using one of the aforementioned paradigms in a manner in which they can be compiled, linked, and configured for parallel execution. To aid in the easy compilation and execution of the parallel representation, one of the configuration files that is produced is a script file which, when invoked, will execute all system software required to automatically bind the individual source-level files together for parallel execution.
VI: Example Application
As a non-trivial example, a representation of the U.S. Space Shuttle Main Rocket Engine (SSME) has been applied to the APME software, resulting in the creation of multiple parallel implementations that execute on several parallel and distributed multicomputing platforms. Since the size of the Shuttle Main Engine Simulation is too large to be effectively visualized, a subsection of the model, the High and Low Pressure Fuel Turbo Pumps, has been extracted to serve as an example which illustrates the finer details of the prototype APME software. This section highlights the results obtained when the above example and the entire SSME simulation are applied to a reconfigurable configuration of SGS-THOMPSON Transputers; after which the effects of applying the SSME to other multicomputer platforms are discussed.
Although there are many viable parallel configurations supported by APME, the example assumes a hardware configuration that has the following attributes:
1) a message-passing type topology with an arbitrary structure, 2) processors cannot perform computation and communication concurrently, 3) no single processor is allowed to simultaneously send and receive data to or from more than one other processor, 4) communication between processors is synchronous with no buffering of messages being allowed (blocking reads and blocking writes are the only communication mechanisms employed), 5) only simple store-and-forward routing of messages is supported, 6) task execution time and communication requirements can be determined statically before parallel execution by profiling sequential runs, 7) processing tasks are non-preemptive in nature, and 8) task decomposition is accomplished in the manner where each equation in the $derivative section of the Model Description File represents an executable task (no grain-packing is employed). Arguably the most important of the various CSSL sections of code contained in the Model Description File is the $derivative section. This section is shown in Figure 3 for the SSME Fuel Turbo Pump Example where it can be seen that the equations which represent the system model are all written in standard ANSI C. The SSME Fuel Turbo Pump example has a total of thirty equations, four of which compute the values of the derivatives of the State Variables (note that the initial conditions for these State Variables are specified explicitly in the $initial section)these are distinguished by the &dquo;dt_&dquo; prefix. Each equation represents to the allocation routine a deterministic unpartitionable basic block code segment.
The targeted parallel platform being employed for this example is a multi-node SGS-THOMPSON Transputer system whose interconnection network can be reconfigured to represent a wide range of topologies. This is accomplished by superimposing the crossbar-connected links with the static links associated with the underlying linear array topology present within the system; each T805 Transputer has four 20 Mb/s ports, two of which are used for the static connections and two of which are connected to the crossbar switch. The crossbar switch introduces an increased propagation delay of 20% over that of the static links.
The parallel Transputer-based system used in this research utilizes Helios [18] as the distributed operating system. It supports UNIX-like system calls and directly facilitates the loading of program code on individual Transputer elements. The high-level system interprocessor communication calls supported by Helios are very powerful, but they incur too much start-up latency for the granularity of the tasks in this example. Therefore low-level system macros (which directly call the base machine language instructions to perform blocking reads and writes) are being used to provide the low-latency communication between nearest-neighbor Transputers. To function properly, Helios must have at least one link per Transputer reserved to allow the program code to be loaded individually to the Transputers, and for general input and output. This means that not all the transputers in the configuration can be used for computation; some must be reserved to maintain the Helios links. In this example, the base topology contains only the Transputers explicitly involved in the computation and communication operations that occur for the execution of the modeling equations. Although the crossbar switch is dynamically reconfigurable, this example assumes that it will be configured once and remain in that configuration for the duration of the simulation. This may be the most efficient use of the crossbar switch for most situations, since the reconfiguration time required is very large compared to its propagation delay. Figure 4 shows the Task Allocation File that was produced by the SNBC heuristic when the crossbar switch was set in a manner that the base topology was a four-processor nCube (or ring). Note that one communication link must pass through the crossbar switch, so it is approximately 20% slower than the other three links. To facilitate hand-optimization by the user, the Allocation File and all other files are textual so that they can be directly read and modi-fied. The SNBC allocation routine was employed to provide a complete assignment, mapping, and scheduling of the equations and communications to the individual Transputers. This can be seen in Figure 5 , where the complete Task graph of the Fuel Turbo Pump is shown along with the task computational weightings which were obtained by the runtime profiling of serial execution of the SSME Fuel Turbo Pump example.
The APME produces three configuration files and one UNIX type &dquo;makefile&dquo;. One configuration file, the Resource Mapping File, is used to indicate the module assignment between the C Source-Level Files and the appropriate Transputer. Another configuration file, the Component Description File, defines the Helios I/O links into and out of each Transputer. The third configuration file, the Wiring File, describes the configuration of the C004 crossbar switch, which in turn defines the chosen topology. The crossbar was configured in the example to form a four processor nCube topology with two added processors to form the I/O chain required by Helios. The &dquo;makcfile&dquo; is an execution script that performs all the system level compilation, linking, and module assignment necessary to seamlessly execute the application.
In addition to the configuration files, the APME produces four high-level language module files which contain ANSI C with the appropriate messagepassing communication primitives. The result of applying the simple Euler algorithm to the above allocation is shown in Figure 6 . It is apparent that there is a one-to-one correspondence between the tasks and communications in the Allocation File, and the equations and &dquo;link_data&dquo; statements in each High-Level-Language File. The integration of the State Variables is performed in a regular manner using loops and pointers instead of updating the variables using individual statements. This increases the probability that the entire loop will fit in the cache and makes it easier for the APME to systematically transfer State Variables and derivatives between processors when parallel integration is employed. As shown in the Allocation File the projected speedup for this example was 3.05. This compares well to the actual speedup obtained when it was executed on the Transputer configuration, which was 2.97.
The entire shuttle main engine model was also processed by the APME software. For this experiment, it was decomposed into a set of 131 non-, preemptive deterministic tasks where the task weights were estimated using timing data obtained through profiling runs. There are 41 State Variables which make up the system, and there is no regular structure associated with any portion of the model. The number of integration method, allocation algorithm, topology, and state variable communication paradigm combinations possible are almost exhaustive. In the interest of brevity, the result of varying only the topology is shown in Table 1 for the case where the Euler integration method is employed with the SNBC algorithm being used to perform each allocation. It is apparent that the measured speedups were significant and compared well to the projected speedups reported by the allocation methodology.
It should be noted that the execution times reported for this application using the T805 Transputer are not very impressive by today's standards. This is primarily due to the fact that the next generation of Transputers (the 9000 series) had a delayed entry into the market. The Transputer system, however, did perform very well as a testbed for the creation of this generic tool. It has many features that are present in other multicomputer architectures which have more powerful processors, while at the same time being a low-cost alternative to other commercially available parallel systems. The T805 Transputer's ratio of computation to communication rates is comparable to more modern multicomputer systems which supports the general scalability of the hardware configuration and the software methodologies currently being employed in the APME prototype. To evaluate the ease with which the APME software can be modified to create code for other parallel and distributive systems, the software was altered to allow the creation of executable representations of SSME simulation on a nCube-2 [19] parallel processing system, and on a network of SUN workstations using PVM [20] . The SSME application was used in these cases to verify implementation correctness only, since the size of the SSME model is much too small to effectively overcome the large start-up 'communication latency associated with these configurations (the SSME application is a relatively small-grained fixed-workload type problem that demands low start up communication latency to support its frequent interprocessor transfers of small amounts of data). For much larger size applications (where individual communications can be bundled together, thereby minimizing the number of communications), the use of this tool may result in meaningful speedups even on these high latency architectures.
VII: Conclusions and Future Directions
As parallel systems become larger, the demands for increased automation and higher performance software tools become more pronounced. The major driving force behind this research has been the need for a software tool which would in large part automate the task of applying dynamic system simulation across a larger domain of problems and parallel computer architectures. The APME tool seems to accomplish this goal. As an added benefit, the APME tends to minimize the potential corrupting influences which occur when the parallel simulation is coded by hand (this is especially true when more than one programmer is needed to code multiple allocations). It also facilitates the ease with which a highly-Parallel simulation can be modified, allowing simulation of large dynamic systems to occur in a matter of minutes instead of days.
Future areas for research include the addition of symbolic preprocessing capability to allow representations to be simplified analytically before they are processed numerically, and studying the feasibility of adding mixed-mode (i.e. combined dynamic and discrete event simulation) support to the APME software. Research is being undertaken to determine the extent by which the tool can be expanded and augmented to allow for fully parallel mixed-mode simulation of complex systems. Work is also progressing to establish a workable graphical user interface, allowing the tool to be used in a seamless manner in a typical windowing environment.
The feasibility of developing tools such as the APME has been illustrated by this and other research. The current form of the APME is being used primarily as a research tool in determining the effectiveness of a number of task decomposition, allocation, and communication scheduling methodologies when applied to a set of large-scale systems. A more robust commercial quality tool could be developed which would be generic enough to adapt well to the next generation of computing hardware, be powerful enough to make use of the many forms of parallelism present in large system models, and be expandable enough to allow new methodologies to be added as they are developed. It is the hope of this researcher that industry will explore the possibility of creating such a software tool.
VIII: Acknowledgement
The author would like to thank David McGhee of Adtran Corporation, Huntsville, AL for his work in developing the initial code for the Post-Allocation Translator used in the APME software. The author also like to express his appreciation to Kenneth Ricks and John Weir of NASA Marshall Space Flight Center, Huntsville, AL for their work in implementing the APME produced representations of Shuttle Main Engine Model on one of NASA's Transputer networks.
