0018-9251/88/1100-0719 $1.00 0 1988 IEEE The simulation of continuous dynamic systems requires the solution of ordinary differential equations with initial value conditions. Such systems occur in aerospace, mechanical, biological, electrical, and chemical systems. In the past, there have been four distinct approaches to computational machines for continuous system simulation; analog computers, hybrid computers, digital differential analyzers, and digital computers. A recent approach is to design a parallel digital structure which unites ideas from all four of these approaches. Such a system takes full advantage of recent developments in very large scale integration (VLSI) technology. This fifth approach is special purpose in the same sense that an analog computer is special purpose. The resulting computer is intended to solve those problems described by ordinary differential equations at an accuracy level comparable with reliable physical measurements.
PARALLEL COMPUTER ARCHITECTURES
A medium size analog computer is capable of performing the integration of 40 state variables and the associated function generation. Using fourth-order RungeKutta integration, a digital computer requires 20,000 integration time steps per second to achieve the equivalent accuracy and speed. A typical state equation requires 250 operations per state variable per integration step. Thus a digital computer requires a sustained throughput in excess of 200 MFLOPS (millions of floating point operations per second) to match the performance of a typical analog computer [I] .
general purpose vector supercomputers, such as the CRAY X-MP, begin to approach this level of sustained performance on real simulation problems [2] . Thus economics provides the major impetus for a special purpose machine.
A number of commercially available parallel computers have appeared recently. Many of these general purpose computers are being used for simulation applications [3-1 I]. Special purpose commercial simulation machines are also available [ 121. The floating point performance of these machines is not yet in the supercomputer class [ 2 ] . Utilizing VLSI components, many of these parallel computers offer price/performance advantages over vector supercomputers. The commercially available general purpose parallel computers typically have a small number of processors built using a general purpose microprocessor with a math coprocessor and a limited bandwidth interconnection network. The performance of these parallel machines cannot be improved by orders of magnitude without major modifications to their current architecture or technology.
Only the most recent models of multimillion dollar IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 24, NO. 6 NOVEMBER 1988
The classical solution method for ordinary differential equations with initial value conditions is shown in Fig. 1 . For each integration time step it is necessary to perform function or derivative evaluations and numerical integration sequentially for all of the state variables [ 13-201. The major differences are in the numerical integration methods used. Multiple function evaluations per time step are required by many integration methods. Many software packages allow the user to choose one of several common integration routines to obtain the best performance.
connection paths required. This implies that the interconnection network must be reconfigurable for high performance.
initial value conditions is ideally suited for parallel processing. This class of problems exhibits an extremely high degree of parallelism. Many computations can be performed in a processor before it is necessary to exchange global data. The amount of global data or state variables that must be shared among processors is small in relation to local data and program size. These are the characteristics that must be present for efficient parallel processing utilizing a large number of processors.
The solution of ordinary differential equations with An equivalent parallel digital solution method analogous to the operation of an analog computer, or a digital differential anayzer, is shown in Fig. 2 parallelism can be found in the numerical integration methods; however, this parallelism, a factor of two to four, is small compared with n [22, 27, 281. New parallel solution methods have also been suggested [29-331. Using classical methods, coupling in the system of equations means that data must be exchanged among processors every time a function evaluation is performed. This requires that an interprocessor connection network be provided. The time required for the transfer of data will reduce the speedup to a value less than n. For maximum speedup the processors must be connected by a high bandwidth interconnection network. The functional dependence of the differential equations determines the IV. SYSTEM ARCHITECTURE Based on an analysis of these numerical considerations, the architecture shown in Fig To demonstrate the utility of this architecture, an experimental prototype capable of supporting 32 processors was designed and built. The experimental prototype is intended to be used as a research tool. The prototype was used to obtain accurate performance data and to gain additional insight into implementation problems and limitations.
Each processor performs function evaluation and numerical integration on a subset, typically one or two, of the system state variables. Programs and data are maintained in the local processor memory. Every time a function evaluation is performed the new values of the state variables are transferred on the interconnection network. This is the minimum amount of information that must be exchanged among processors. Decomposition of the problem in this manner maintains fast interprocessor communication times. Further decomposition increases the parallelism at the expense of increased communication with a resulting decrease in performance. Processors with high speed analog-to-digital (AID) and digital-to-analog (DIA) channels are used for analog inputs and outputs.
V. NETWORK ARCHITECTURE
The interconnection network must be capable of parallel high speed data transfers among arbitrary processors. Clearly, a network which allows all processors to communicate directly to any other processors in parallel is desirable if it is economically viable. Networks, such as hypercubes, which require processor forwarding of data to support an arbitrary transfer are too slow to meet performance goals. Crossbar and Banyan networks are possible candidates for the interconnection network. These networks grow in complexity by an O ( n 2 ) when switch points, interconnection wiring, and control circuitry are taken into account [34] . Crossbar switching networks are nonblocking, require only one level of switching, and have a higher degree of symmetry making fabrication less difficult. A crossbar switching network was selected for use in the experimental prototype.
is prohibitive. Tri-state bidirectional data busses are used to reduce the size of the network by a factor of four. Thus a combination of space and time switching is required to transfer data. High speed microcode memory is used to enable and control the direction of each switch point in the network. This allows multiple destinations for a single packet of input data and use of different routing strategies. Typically, simulation problems require four or more time slots on the network for a complete data transfer.
experimental prototype. Each switch point uses two 74LS245 octal bus transceivers. A four by four switch matrix is implemented on a circuit board. Sixteen boards are required for the thirty-two processor prototype. Daisy chained ribbon cables run horizontally and vertically through the network to provide the large number of interconnections required.
The communication controller contains a high speed microprogrammable state machine as shown in Fig. 4 . Each microinstruction controls a time slot on the network. Fields in the microcode specify the processors requiring input, the processors providing output data, and the switch configurations required in the network. The processor microcode fields control maskable comparators that signal when the selected processors are ready to transfer data. The controller hardware tests and sets four handshake lines on all processors in parallel. Pipelining is used to configure the switch points prior to the transfer of data. The data path through the network contains a single gate delay of 8 ns. The major communication path delay is a 100 ns signal rise time in the ribbon cable connections. The rise time results from the capacitance between signal and ground wires in the data cable. Custom VLSI circuits could be used to reduce the physical size of the network and increase the performance. Processor programs must output and input data in an ordered sequence that coincides with the communication controller microprogram. The arrival of data via the network is used to synchronize the processors. The number and order of input and output variables is problem dependent and will vary from processor to processor. The network interface is buffered to allow processors to perform other operations while transfers are occurring.
VI. PROCESSOR ARCHITECTURE
The system architecture is capable of supporting many types of processor modules. All that is required is a compatible interface to the switching network and an IEEE Standard 796 card format [35, 361. The standard interface to the network is a 16-bit parallel transistortransistor logic (TTL) bidirectional port with four handshake lines. Five processor modules have been developed for the prototype computer. They include a general purpose microprocessor with a numeric coprocessor, a processor with high speed AID and D/A converters, a high speed microprogrammed fixed point processor, and two high speed microprogrammed floating point processors.
The microprocessor based design,uses an Intel SBC 86/12 processor with an 8087 numeric coprocessor. When this processor was selected it was anticipated that the next generation of VLSI floating point units would approach the performance goals for the machine. These processors served to provide floating point capability in the interim.
processor architecture is shown in Fig. 5 . With current VLSI technology it is necessary to use several speedup techniques to attain the data rates required. These include microcoding of programs, separate program and data memories, and pipelining of both instructions and data. floating point ALU and the AMD 29334 register file. After evaluating a prototype of both units, the AMD 29325 processor design has been selected for incorporation in the next version of the machine. The peak performance of a single AMD 29325 based processor is 10 MFLOPS [38] . New processor designs containing four to eight T800 floating point transputers The second processor design uses the AMD 29325
[40] or the bipolar integrated technology 2 1 1012 I20 floating point ALUs [38] are being investigated.
VI I. BENCHMARKS
As part of the experimental program several simple continuous system simulation benchmarks have been implemented on the prototype machine described in this paper. Results obtained using the prototype were compared with traditional serial results to verify correct operation and to validate the parallel solution method.
To program these benchmarks on the prototype a number of software tools were developed. In the prototype all processor and control memories can be downloaded by a general purpose host computer. The host can also start, stop, reset, single step, and examine memory contents in all processors and the communication controller. These features are useful in multirun simulations. A compiler was written to generate the microcode for the communication sequencer. The input to this compiler is a simple language which describes the data transfers required between processors.
Additionally, the program in local memory of each processor must be developed. For the microprocessor based processor a compiler was used. High speed processor benchmark programs were written in microcode. Ultimately, a compiler for a continuous system simulation language could be developed for the machine which would generate all of the required code modules [23, 4 I , 421 .
The benchmarks selected were a second-order linear system, the pilot ejection problem, PHYSBE, and a linear single axis autopilot [ 16, 231 . Speedups demonstrated on the prototype using the 8086/8087 microprocessor are shown in Fig. 6 . On any parallel computer, linear speedup cannot be obtained unless processor communication time is zero. Based on program execution times, a more realistic model for the machine assumes a ten percent overhead from processor communication. This speedup, 0.9N can be obtained on systems of ODES that produce equal processor computational loads. Performance below this level is due to unequal processor computational loads. Unbalanced processor execution times will arise in nonlinear ODE systems because of different function or derivative evaluation times.
evaluated by finding the highest frequency sine wave which can be integrated in real-time with an accuracy of 0.1 percent. Using this benchmark the performance limit of the prototype machine is in excess of 2000 Hz.
High speed microprogrammed processors were also
VIII. CONCLUSIONS
Using current technology, the prototype machine is capable of solving 64th-order ordinary differential equations at a solution bandwidth in excess of 1000 Hz. A special purpose machine built using parallel VLSI circuits offers the potential of mainframe performance levels at a hardware cost reduction of an order of magnitude or more. The architecture presented is capable of solving ordinary differential equations at speeds comparable with modem analog computers. Such a machine can serve as a replacement for hybrid systems and supercomputers in large real-time simulations.
Additional work is needed in the development of VLSI chips designed to support parallel architectures, the development of parallel compilers for continuous system simulation languages, and new integration methods designed for parallel computers. , et al. (1986) Billig. R.R., Corbin. S.S., and Moore, R.L. (1986) Matelan, N . (1986) Matelan, N., and Wojcik, A.S (1986) Geril, H.M., Van Schieveen, P., et al. (1986) Benyon, P.R. (1968) Henrici. P. (1962) Milne, W.E. (1970) 
