Introduction
The present work is a part of a joint effort by Virginia Tech and Research Triangle Institute located in Research Triangle Park, NC. Figure 1 shows an overview of the system being developed [4] The main objective for the development of the above system is to study architectural trade-offs. The following are the steps adopted in the trade off study [4] .
Extract the processor characterization data from the PML processor library using the Processor Characterization Extraction Tool(PCET). Determine the required number of processors to run a given application from the Spreadsheet Analysis. Build a VMDL structural model of the architecture. Automatically create the Analysis Spreadsheet using the Architecture Characteristic Extraction Tool (ACET).
Use the Spreadsheet to obtain the partitioning of the software application algorithm.
Given the partitioned software application specifications, the software is converted to PML codeOperformance Model Library) and mapped onto the specific processors in the hardware architecture model.
Evaluate the performance of the model. If the results are not satisfactory, change the partitioning of the s o h a r e or the hardware model and repeat the entire procedure until results are acceptable. This paper concentrates on the creation of the hardware architecture model, mapping of the software tasks onto the different processors and evaluation of the performance models.
Reuse Of Existing Components
The Performance Model LibraryOpML) is a library of hardware components that was developed by Honeywell, Inc. These components were used as primitives to build the present performance models. They were modified and customized to meet the current needs. Reuse of the existing models and the models developed is ensured in the following ways:
Use Of Generics: Important device parameters like the throughput of the component, bus width, etc. are made generics so that their values can be easily changed. Modular Design: The complete design is composed of independent but linked blocks. These blocks do not depend on global signals or variables. Hence, individual blocks can be salvaged for use in other applications. Documentation and Coding Style: Good documentation and coding styles can enhance the reusability of the model and the individual blocks.
Ease Of Extendibility: The models are designed in such a way that they could be customized to meet a variety of needs. For example, the basic token structure(which is a fundamental signal) has a few redundant fields. This helps to prevent modifying all the components of the library if an extra field has to be added to the token in the future.
All the above features make upgrades an easier task.
An approach for reusing performance models for the development and design of the architecture is described below. Figure 2 shows a flowchart for model construction.
This work was supported in part by WAF contract F33615-93-C-1310 via subcontract SUBC-1-814-5722 &om Research Triangle Institute.
A cursory study of all the components in the GEN and PROC library has to be made to get an overall view of the VHDL architecture library. It is important to know the main function of each of the devices. The next step would be to analyze the task and translate the task into hardware and software specifications. The above step gives an insight into the hardware components needed to run a particular software algorithm. A complete study of all the components needed to constntct the hardware model should be done. One has to have a good understanding of all the generics that go with the hardware components. -~-Given the software specifications, the software has to be partitioned to be mapped onto the processors. Study of the software has to be made to check if there is any inherent parallelism within the software algorithm. The next step is to make a reasonable choice of the architecture schematic and the device parameters or generics. Then the required hardware components are selected &om the library. New components should be developed if need arises. A structural inter-connection of the hardware components is made to obtain the required architecture. Given the partitioned software algorilhm, different tasks@artitions) of the algorithm have to be mapped onto different processors. Run simulations and analyze the results. Set different generics(device parameters) and observe the behavior of the system. A detailed analysis of the results has to be made to check for validity. Make changes to the architecture and/or device parameters if the results are not satisfactory and repeat steps f7om translation of the task into hardware and software specifications. Once the results are analyzed to be sound, optimization should be done for better throughput and utilization of the system components. Freeze the hardware architecture, the device parameters and the software partitioning and move onto the next level of the design process involving a greater degree of detail. 
3
The Memory: The memory element is basically a programmable delay with the ability to return tokens to the sender. This component supports DMA(direct memory access) in which an extemal device can write to the memory directly without the intervention of a processor. 3. 4 The Queue: A component must be capable of scheduling outputs based on latency and priority. This output queue model is intended to interface to the outputs of all standard components. The queue will handle ordering and arbitration on extemal signals to ensure that the token is successfully transferred. 3.6 Bus Interface Unit(BIU):The bus interface unit is a programmable delay and a token filter. BIUs are used when a bus is driven by multiple drivers. 3.7 Crossbar Interconnection Device: When a number of components are connected through a common bus, the bus becomes the communication bottleneck and results in underutilization of the resources. Hence a crossbar was developed to reduce the traffic on a common bus. This device was developed at Virginia Tech.
3.8
The Processor: The processor is the most complex device of the whole model library. The processor model is divided into three major parts: The first is the software model which is the actual application running on the processor. The second is the scheduler or thread manager which consists of a scheduler and a collection of control threads. The scheduler uses the thread to run the software tasks. It passes software request for services that require intervention %om the hardware to the CPU for execution. The third is the processor hardware model which receives input control via requests to perform hardware operations fiom the scheduler and fiom interrupts fkom its peripherals.
The Seventeen Processor Raceway Architecture
The RACEm architecture was introduced in May 1993 by Mercury Computer Systems [S] . It has been deployed in systems scaling fiom four to more than 700 processors. As technology advances(for example in processor design, chip packaging and software tools) it becomes evident that an architecture must be able to rapidly evolve to maintain parity with the available technology. The RACE architecture for multiprocessors provides a highbandwidth, low latency system for solving real-time applications [S] .
In the present study, a seventeen processor mercury raceway architecture has been developed. Two algorithms, Synthetic Aperture Radar(SAR) RangeProcessing Algorithm and S A R Multiswath Processing Algorithm are mapped onto this raceway architecture. The schematic of the architecture model is shown in Figure 3 . As can be seen from Figure 3 , the model contains an input device which models the radar. This is the stimulus to the system under study. This architecture has a provision for a DMA(direct memory access) where the radar writes directly to the memory without the intervention of the CPU. The output device, as shown in Figure 3 , is a device to sink the tokens from the model. The results of an algorithm are written into the output device.
The Model Library
There are seventeen processors(CPUs) in total in the architecture. In Figure 3 ., these processors are represented as Bi-cpui(i = 1 to 17). Each processor has its own locaI memory module connected to it. The memory is shown as Bi-memi(i = 1 to 17).
There are six crossbar modules which are used to connect the seventeen processors together. They are shown by hexagonal blocks in Figure 3 . These crossbars help prevent congestion on the global bus and facilitate faster inter-device communication.
There are fifteen biu-star clusters(bus interface units) in the raceway architecture. The biu-star component is used to connect a processor, its local memory and one of the ports of the crossbar to a common global bus. These biu-star components are represented by triangle blocks in Figure 3 .(with "3" written inside the block).
There are two biu-four components(bus interface units) in the seventeen processor architecture. In Figure 3 , these components are represented as diamond blocks(with "4" written inside the block). The first biu-four is employed to connect the radar, B16-cpu16, B16-mem and crossbar5 to a common bus. The second biu-four is used at the output to connect the output device, B17-cpu17, B17-mem and crossbar6 to a common bus.
The SAR Range-Processing Algorithm
The Range-Processing Algorithm is a portion of the Synthetic Aperture Radar(SAR) algorithm. This software is mapped onto the seventeen processor raceway architecture described in section 4.
The Task Mapping
The software specification for this algorithm has been produced by a mapping tool which was done at the Research Triangle Institute located in Research Triangle Park, NC. Currently, these specifications must be converted into PML(Performance Model Library) code manually. We are currently working on an automated translation tool.
The simulation begins with the input device, radar sending a pulse(tenned pole). The radar writes the data directly into the memory(by Direct Memory Access) of the input processor@ 16-cpu16). This memory, B 16-mem sends an interrupt to the B16-cpu16 when it begins its processing. The B 16-cpul6 unpacks the data received from each radar pulse and sends the processed data to the four range processors@ I-cpul , B2-cpu2, B3-cpu3, B4-cpu4) on rotation. These runge processors process the data in parallel. Once the range processing is done, the processed data is sent to the corner turn processor(B1 I-cpul 1). The B11-cpull waits on 512 processed radar pulses and carries out the bin processing. This is then sent to the image processor(B15~cpul5) for assembling the image which in turn sends the processed data to the output processor@17~cpul7) which stores the image in the output device.
Results
One cycle of simulation is defined as the time between the arrival time of the first pulse and the time at which the result is written into the output device. A time difference of 700 micro seconds was set between the arrival of two radar pulses so the first pulse is finished processing before the arrival of the next one. When this time was decreased, then, the Bus Interface Units at the input of the architecture timed out while waiting for the bus to become idle and the radar tokens were lost. If this time is increased, we are obviously under-utilizing the system. Hence this time is determined by the communication bottle-necks and the processing speed.
In the test described below, all the data withii the algorithm is reduced by a factor of 64 for clarity of results. In this, all the processors are running at 40 MHz(25 ns clock period). The word size is set to 4 byte long. The memory access time is set to .025 micro seconds per word and the transfer time for all the transfer devices(crossbar, biu-four, biu-star) is set to .007 micro seconds per word.
The activity plot for this simulation is shown in Figure 4 . As can be seen from Figure 4 , the total time for processing the whole algorithm is 11910 micro seconds i.e. 11.9 ms. The utilization of the range processors is 41% and that of the input processor is 36%. As can be seen from Figure 4 , 8 pulses are sent by the radar and the input processors sends the processed data to the four range processors on rotation. The parallelism among the range processors can be clearly seen.
The utilization for the corner tum processor (Bll-cpull) is 16%, the image processor(B15-cpul5) is 9% and that of the oulputprocessor(B17-cpul7) is 1%. In Figure 3 , a large region of activity can be seen after the range processing. This is when the corner turn processor begins the bin processing and writes the results to the image processor.
From the above result, the final result while running the complete range processing algorithm with the correct number of pulses and right number of loops within the algorithm can be approximately estimated. The final processing would take approximately take 64 * 11.9 ms which is 761.6 ms. The percentage utilization of the processors would more or less be the same as in this test. The final activity plot would be a replication of the one obtained above.
The Multiswath Processing Algorithm
The Multiswath Processing Algorithm is a part of the Synthetic Aperture Radar(SAR) algorithm. The difference from the Range Processing Algorithm is not in the number of processors but in the more complex software the processors are executing. This algorithm is mapped onto the seventeen processor architecture.
The Task Mapping
The simulation begins with the radar sending pulses and writing the data to the B16-mem by DMA. This memory then interrupts the input processor, B16-cpu16. The B16-cpu16 then begins its processing and transmits the data &om each pulse to each of the runge processors successively. The data is then processed by the runge processors and sent to the corner tum processor, 131 1-cpull. This processor waits for a whole h e of 512 pulses and does the corner turn processing. The range processors interrupt the input processor at the middle of the frame at the 256 th pulse of a frame of 512 pulses. The input processor processed the data and sends the m i l l a y datu to the corner turn processor and the image processor, B 15-cpu15.
When the B 1 5-CPU 1 5 receives the uuxiZZav dura, it is sent as a separate message to the output processor, B17_cpul7(in reality, it is concatenated on the front of the output image) and the B17-cpu17 writes this message to the output device.
When the Bll-cpul 1 receives both the u2aillay datu from the B16-cpu16 and the data &om the comer turn processing, it begins the range bin processing by broadcasting the subswuth kernel to all the nine azimuth processors, B5-cpu5, B6-cpu6, B7-cpu7, B8-cpu8, B9-cpu9, B10-cpu10, B12-cpu12, B13-cpu13 and B14-cpu14. Once the subswuth processing is done by the azimuthprocessors, the result is sent to the image processor, B15-cpu15 where the image is msembled. The B15-cpu15 sends out this line of image to the output processor which then stores it to the oufput device.
Results
In this test too, the algorithm is scaled down for clarity of the results.
The activity plot for this simulation is shown in Figure 5 . As can be seen in Figure 5 , the first region of activity is the operation of the four runge processors in parallel. The next region of activity is the comer turn processing and broadcasting the kernel to the azimuth processors by the corner turn processor. It can be further observed that all the nine azimuth processors are operating in parallel. The last region of activity is the assembling of the image by the image processor and writing it to the ouipur processor which stores the image in the output device. From Figure 5 , it can be observed that the percentage utilization for the input processor is U%, the range processors is 19%, azimuth processors is 27%, the comer turn processor is 15%, the image processor is 20%
and the output processor is 2%. The total time for processing is 25292 micro seconds. As can be seen fiom the results, the utilization of the processors is very low. This shows that the architecture is over designed to run the above software application. As explained in the methodology in section 2, at this point, one should go back and change either the architecture or the partitioning of the software algorithm to obtain an optimum architecture for running these algorithms. It may further be noted that, the goal of the present paper is to demonstrate the methodology developed for construction of multiprocessor architectures and not to optimize the architecture.
Model Validation

Methodology
It is very important to verify that the model actually describes the real system. To check the correctness of the results is a very difficult task especially when the software algorithm that is running on the hardware model is very big and complicated.
Two major techniques were adopted to validate the results of the range processing algorithm and the multiswath processing algorithm. The techniques are decomposition of the software algorithm and scaling of the algorithm. The following methodology was developed to verify the correctness of the simulation results. Figure 6 shows the validation methodology flowchart. 1. If the software algorithm is very big and complicated, it is very difficult to verify the correctness of the results. Hence, the algorithm is purfitioned into small portions. 2A. Each of these portions are simulated separately and the results analyzed and checked for validity. 
2B.
The following methods were adopted to check for validity.
The delay between the arrival of two pulses from the radar was set to a very large value in order to insure that there is no traEc congestion or loss of tokens due to bus conflicts. In this way the parallelism among the processors is removed and the tokens can be traced more easily. In other words, set the generics so that there is no parallelism in the system.
Every token that is bom is given a unique token ID during run time. The run time statistics are written into an output trace file called STD-OUTF'UT. The path followed by every token can be traced in this file, from its birth till it reaches its destination with the help of this unique ID. One can also track the latency and busy time for all the transfer devices that a token passes through on its journey from its source to its destination. One can compare this against what is actually specified in the application software algorithm and make a valid conclusion whether the token has reached the correct destination in the expected amount of time. 
3B.
The critical tokens were traced because without these tokens reaching their destination, no further processing would have taken place. "ASSERT" statements were added wherever possible to display at run time the number of times a loop is being executed. the utilization plots were compared with the plots from the individual simulations.
In case a few processors are doing the same kind of processing, then it is possible to use the tests as described in step 2 for one of the processors and check the utilization(activity plot) to see if all these processors show similar utilization. One can zoom on the activity plot for better comparison. This step was followed to check a portion of the Range-Processing Algorithm where all four Range Processors had similar processing steps.
3C.
If the results are not satisfactory, review the source code, requirements and identify the problem area. Debug and repeat steps 3A and 3B.
4.
Finally the whole algorithm should be run on the whole architecture and the results checked as described in step 3. 5. When the whole algorithm is verified to be running correctly, optimization for better throughput and better utilization of the components is done by varying the device parameters.
With the above approach, the SAR Range Processing Algorithm and the S A R Multiswath Processing Algorithm were validated.
Range Processing Algorithm
Test 1: This test was conducted to validate the range processing part of the algorithm. In this test, the radar generates pulses with a delay of 25000 micro seconds between two pulses. This large value was given so that there is enough time for the first pulse to get processed before the arrival of the next pulse and hence there would be no traffic congestion or loss of radar tokens due to bus contention.
This would also make the check easier. Just four radar pulses were sent. The simulation result is shown in Figure 7 .
From Figure 7 , it can be observed that the input processor (B16-cpu16) is sitting idly after unpacking the first pulse from the radar and writing the result to the first range processor. The input processor receives the next pulse only after 25000 micro seconds. By that time, the first range processor is already finished its processing and is idle. When the next pulse is written to the second range processor, it is the only active range processor. Hence there is no overlap among the range processors which makes the token paths easier to trace.
This current model has no parallelism. By removing the parallelism, the correct number of tokens, the token paths, the transmission time for the token, the busy time and latencies of the components were more easily validated. After verifying the individual parts of the algorithm, parallelism will be restored to verify that bus bandwidths are not exceeded.
Multiswath Processing Algorithm Test 2:
The most complicated portion of the algorithm is the Azimuth processing. In this test only the azimuth processing part of the algorithm is validated. The comer turn processor receives processed data from the range processors and auxiliaiy data ftom the rnput processor. It then broadcasts the subswath kemel pointer to all the azimuth processors. The azimuth processors process the data and send the range bins to the image processor for assembling of the image.
In this test, all the processors in the architecture are running at 40 MHz. The memory access time for all the memory modules was set to .025 microseconds/word(4 bytes). The latencies through the transfer devices was set to zero. The transfer time for these devices was set for .007 micro seconds/word(4 bytes). The delay between the arrival of two pulses fiom the radar was set to 725 micro seconds. The activity plot is shown in Figure 8 .
As can be clearly observed from Figure 8 , the first region of activity is that of the input processor unpacking data from the radar and sending an interrupt to the corner turn processor. Once the corner turn processor(B 11-cpull) receives the interrupt from the input processor, there is a patch of activity around this processor as it is executing the comer tum processing, broadcasting subswath kemel to the azimuth processors and sending range bins to the azimuth processors. The nine long regions of activity are those of the azimuth processors. It can be further observed that these nine azimuth processors are running in parallel.
As described earlier, all tokens were traced for one of the azimuth processors but only the critical tokens were traced for the others since all azimuth processors have similar processing steps. By observing that the utilization plots of all azimuth processors were similar to the one that was traced exhaustively, that all critical tokens reached their final destination, and that the remaining part of the algorithm started on schedule, we conclude that this part of the algorithm is validated.
Results
This paper has presented a methodology to construct VHDL performance models which will help significantly reduce the time ftom an initial conception to a working design. To further reduce development time, existing structural primitives were reused. Also, an efficient methodology has been developed to validate performance models of complex multi-processor architectures. This method helps to make validation faster and less difficult. Moreover, this paper has resulted in the development of a high level VHDL library of hardware models and software algorithms. These models can be reused as primitives for the development of other new models with little or no modification. 
