A typical array processing problem is run on parallel computers. Critical issue of using parallel computer is to maintain the speed up with respect to increase of processors. A VHDL simulation model were developed to predict the speed up of the parallel computer system. It is important to characterize the computation and communication work loads in partition procedure. Ethernet prediction results is compared with running data. Accuracy of this model is satisfactory. Prediction of speedup of a freguency domain beamformer is in progress.
INTRODUCTION:
In many defense related applications, very complex hardware and software systems exist. They are characterized by difficult computation and real time requirements. These systems typically have an assorted collection of heterogeneous analog or digital processors. The program that runs the embedded computer system typically is in the order of hundreds of thousands lines of source code. The system is generally very complex, hard to design, and hard to maintain. development emphasis is on performance breakthrough. In this era of shrinking defense expenditures there is a new emphasis. The objective is to achieve the same system capability with a lower overall cost. Because of recent investment and pay off in High Performance Computation (HPC) it is desirable to explore the possibility of implementing military complex system on these parallel processors. This kind of Massively Parallel Processing (MPP) system usually has thousands of processors op a network structure.
In the era of cold war military system If the MPP future market develops successfully, the hardware (H/W) cost of commercial MPP architectures will be lower than that of a custom MPP architecture. One objective of the Massively Parallel System Design (MPSD) research is focused on detailed level MPP software mapping and portability.
RETARGETABLE MPSD METHODOLOGY:
The MPSD design methodology were reported previously [ 11. The approach here concentrates on using commercial available tools whenever possible. Typical Navy DSP application can be represented in a task level signal flow graph [2] . . Underlining the graphics node is the task software itself. It may be in any procedural language so that simulation of the function can be done on a host processor before it is mapped onto a target MPP system.
The approach here is to follow the message passing methodology which involves explicit parallel environment control. The programming is usually done in a environment such as PVM or EXPRESS with utilities to handle parallel message passing. Achieving good performance in programming MPP still relies on slow and tedious manual mapping procedures. This project is concentrated on the methodology for MIMDs.
SALIENT FEATURES:
(1) Task level function module parallelization (coarse grain). (2) Use high level procedural language and messaging passing. After graphic entry and functional simulation a host mapping procedure is required before a program can be run on the MPP system. The Mapping Procedure consists of partition and allocation. For MIMD with distributed memory and message passing scheduling may be done statically at the compile time.
CALIBRATED MAPPING PERFORMANCE PREDICTION (CMPP):
For the retargetable MPSD methodology the critical step is the MPP mapping procedure. Here function modules are partitioned and allocated to a large MPP system. User has to couple the modules with the message passing operations. If the user is forced to do it manually for all the pieces (1OOO's) and run the MPP execution to decide whether this partition is relevant or not, it will take a long time. Figure 1 shows this mapping procedure in details. This cycle will be repeated many times for each performance metrics.
t yc4
'igure 1. Detailed mapping procedure.
In order to expedite the Mapping design cycle a Calibrated Mapping PegCormance Prediction (CMPP) paradigm is proposed here. module (EXEC), and the other is the communication modules (COMM). Attention at this time is on the benchmark of execution time, which is directly related to speedup in MPP systems. Modules are characterized by its EXEC load, EXEC bandwidth, COMM load, and COMM bandwidth. When a specific partition is decided, the host program can be analyzed to estimate the EXEC load and COMM loads for all the pieces. This is a static benchmark from a semi-automatic procedure. The collected data are used to annotate the performance model as model parameters before simulation. Presently this process of load extraction and estimation will be repeated for each new partition. Eventually an automatic procedure will be desirable.
To build up the performance model it is necessary to construct a token network structure with the EXEC modules and COMM modules. The token networks can handle multiple transmitters in a real network situation. Presently it can only handle Ethernet sitmulation. Construction of the performance model is done in graphics mode. Replication of identical modules in looOts can be done reasonably well with the generate construct of VHDL. In this CMPP paradigm many details of message passing are hidden so that attention can be given to the partition and allocation problems.
A -J-
functi;nal execution on MPP to collect dynamic performance benchmark, it is collected from performance model simulation.
modules are required. One is the execution
A commercial VHDL environment is used to demonstrate the Calibrated Mapping Performance Prediction (CMPP) paradigm. One advantage of using the VHDL environment is In a Performance model only two kinds of that the generate construct of the VHDL language can be used to do replication in Figure  2 . There are also generic facilities in VHDL that may be used to annotate model parameters before simulation.
MODEL DEVELOPMENT:
One main feature of this MPSD methodology is that the partition and allocation results are useful to retarget application to different types of MpPs. Due to different bandwidth and topology of the networks and throughput rates of the processors in different MPPs it is recognized that some remapping is necessary. The idea is to keep this retargeting effort as minimal as possible.
Ethernet were used in our development, A message passing development environment called EXPRESS is used in our experiment. This is an inexpensive approach that can help us to develop the retargetable MPSD methodology and CMPP paradigm.
One part is the EXEC module that characterizes a piece of execution occurred in an architecture element of the MPP. It can be a source that generates a load token, a feedthrough that accepts input tokens and produces output tokens, or a sink that consumes a load token. The EXEC modules are characterized by the following VHDL generics which are model parameters: A VHDL structure of modules is shown in Figure 3 . The left most one is a source EXEC module, and the right most one is a sink EXEC module. The INST generic is used to describe a unique name for the module in the model. The load of execution is characterized by size *unit. Unit is the basic data size such as in byte or Kbyte. The throughput characterizes the speed of this EXEC module. Latency is a feature for more accurate delay account. Duty cycle is relevant if the EXEC module is a source that generates periodic loads.
Shown in Figure 4 are two COMM modules called EBIU. The COMM module can receive or transmit to or from local port, and the data transfer on global port is also bi-directional. The signals in our study are all data token types.
A special VHDL resolution function is implemented to model the Ethemet.
SIMULATION RESULTS:
One important feature in the CMPP paradigm is the calibration process involving EXECKOMM model parameters. These parameters are extracted out and estimated from a host functional program. Using those parameters performance model can be simulated to yield dynamic benchmarks. Adjustment of model parameters by comparing a dynamic benchmark from the CMPP prediction in Figure  2 with that from the actual execution in Figure 1 is referred to as the calibration process. This is to ensure that model prediction is credible with the proper selection of model parameters.
Our partition experiment is done on workstations with Ethernet. The crucial step is to model and characterizes the Ethernet correctly. The above mentioned calibration process is used to tune the COMM modules (EBIU). Figure 4 shows both the actual message delays and the model predictions. As you can see that the prediction is fairly satisfactory. The model parameters that yield this prediction data are as follows:
bw-unit-per-sec => byte bandwith-info => 48,000 Tx-latency-info => 41.280 ms Rx-latency-info => 10.000 ms Ack-time-info => 41.280 ms bus-timeout-info => 10 sec A frequency domain beam former was used as an application to demonstrate the advantage of using MPP systems. A host program in FORTRAN 77 was implemented and checked with the test data to assure the correctness of the functionality. Then the mapping procedures in Figure 2 were followed to partition the application program and execute them under the parallel EXPRESS environment. The benchmark is the execution time. This mapping procedure was repeated for 1 , 2,4,6, and 8 architecture elements on the network of workstations. The results are shown in Figure 5 where increasing elements helps to decrease the 
SUMMARY:
This paper included detailed modeling and simulation data for a frequency domain beamformer implemented on several Sun Workstations networked with Ethernet. This test demonstrated the utility of the Calibrated Mapping Performance Prediction Paradigm. The information presented included an explanation of detailed VHDL models which the researcher fine tuned to reflect the actual operation of the network.
