It has long been recognized that there exists an upper bound on the computing speed attainable with uniprocessor architectures. The state of the industry has been such that the regular performance improvements in silicon technology have kept the need for widespread use of tightly-coupled parallel architectures to a minimum. However, with the increasing computational demands placed upon uniprocessor architectures, the time is rapidly approaching when tightly-coupled parallel architectures will be required.
In order to demonstrate the effectiveness of using tightly-coupled parallel architectures for avionics applications, a representative avionics algorithm was parallelized and implemented on three parallel topologies constructed from Texas Instruments' Parallel Digital Signal Processor TMS320C40. The algorithm's execution time was measured and estimated on different parallel topologies, including ring, four nearest neighbor mesh, and eight nearest neighbor mesh. It was found that due to the low communication-to-computation ratio of the algorithm, near-linear speedups were obtained for the parallel implementations of the algorithm.
Description of Testing Equipment and Topologies
The C40 is especially suited for parallel processing since it has six on-chip eight-bit communication ports which can interface with other C40s with no external logic. The six ports allow system designers to configure large numbers of C40s into any of a number of different topologies without having to design interface logic [l] .
The testing equipment used for this study was a Parallel Processing Development System (PPDS) available from Texas Instruments, which has four TMS320C40 (C40) chips on board [21. Since the PPDS has four C40s configured as a fullyconnected mesh (FCM), several topologies become immediately apparent for implementing AS4. Among these topologies are a fully-connected mesh, a four nearest neighbor mesh (4NNM), and a ring. These topologies are shown in Figure 1 , where the layout on the PPDS is shown in Figure l In general, for computation-bound algorithms (i.e. algorithms that require more computation time than communication time), the higher the communication-tocomputation ratio, the worse the algorithm's performance. Thus, for these types of algorithms, it becomes imperative to reduce the amount of communication as a whole and optimize the essential communication in order to ensure the highest performance. Investigations into the behavior of AS4 have shown it to be a computation-bound algorithm: the communication-tocomputation ratio is very low (on the order of 0.017:l). Thus for this particular algorithm, the communication patterns do not require an extensive effort to reduce the amount of communication performed. However, since the C40 must route data using the storeand-forward ( S A F ) routing method (unless additional software support for routing is used), it is beneficial to minimize both the amount of non-nearest neighbor communication and the time required f o r nearest neighbor communication.
Therefore, all mappings were made with the goal of reducing either the number of hops required to send data from one processor to another, the amount of time required to send a message to the nearest neighbor, or both. For the purposes of this study, it will be assumed that all images processed by AS4 are square with an edge size of N pixels, for a total of ~2 pixels per image.
. 1 Four N e a r e s t N e i g h b o r Mesh
If the 4"M follows a square pattern (i.e. the same number of processors on all edges), then for an array of processors with edge size n it becomes a problem of simply dividing the image into n2 pieces of N 2 / n 2 pixels, as shown in Figure 2 , where the regions of the image covered by a particular processor are In this topology, all horizontal and vertical data transfers can be done in one hop, and all diagonal data transfers will take two hops.
. E i g h t Nearest N e i g h b o r Mesh
In cases where n is relatively large (i.e., n 2 9), if the FCM connection pattern shown in Figure l (a) is extrapolated (e.g., two horizontal links, two vertical links, and four'diagonal links per processor) it becomes an eight nearest neighbor mesh ( 8 " M ) .
In the same manner as the 4 " M , mapping AS4 to an 8NNM will entail dividing the image into n2 pieces of N2/n2 pixels, as shown in Figure 2 . In this topology, all horizontal, vertical, and diagonal transfers can be done in one hop. Since the C40 has only six communication ports, an 8NNM cannot be formed with more than four processors, unless a custom hardware design is implemented to allow greater connectivity in the network.
. 3 Ring N e t w o r k
Mapping AS4 onto a ring network is a straightforward operation: one divides the image into stripes (rectangular sections of N x N/n pixels), as shown in Figure 3 .
One particular feature of this approach is that the number of pixels that must be sent from processor to processor remains constant regardless of the number of processors used. The advantage of this feature is that predicting the scalability of this approach becomes a trivial problem, since the amount of communication required by each processor remains constant and the amount of computation can be estimated by scaling. However, this can also be a disadvantage i n c r e a s e s ( i . e . , t h e number of p i x e l s p r o c e s s e d p e r node d e c r e a s e s a s t h e number of nodes i n c r e a s e s , b u t t h e amount of communication p e r node remains c o n s t a n t independent of t h e number of nodes i n t h e t o p o l o g y ) . This, t h e r e f o r e , impacts t h e maximum a t t a i n a b l e speedup g a i n e d from p a r a l l e l i z a t i o n .
3. Performance Measurements of.AS4 on the PPDS AS4 was p a r t i t i o n e d i n two d i f f e r e n t ways f o r placement on t h e PPDS, a s shown i n 
Partitioning Schemes PPDS Used on
The This 
4 " M and 8NNM Implementations of AS4
For t h e s e tests, a 64 x 64 p i x e l image s i z e was used, and m u l t i p l e frames of i d e n t i c a l d a t a were u s e d f o r images, due t o t h e memory l i m i t a
t i o n s of t h e PPDS. (The i d e n t i c a l n a t u r e of t h e d a t a w i l l n o t a f f e c t t h e r e s u l t s , s i n c e t h e measurements a r e b a s e d upon t h e number of operati.ons performed, n o t t h e " c o r r e c t n e s s " of t h e answer o b t a i n e d . )
The numbers r e p o r t e d a r e per-frame a v e r a g e s o v e r t h e m u l t i p l e frames t h a t were p r o c e s s e d .
Table 1 shows t h e communication t i m e s , computation t i m e s , t o t a l t i m e s , and speedups measured f o r AS4 on a 4NNM
network, and Table 2 
O v e r a l l , speedups f o r t h e m u l t i p r o c e s s o r v e r s i o n s , r e l a t i v e t o t h e i r r e s p e c t i v e u n i p r o c e s s o r v e r s i o n s , averaged around

T h i s a v e r a g e w i l l almost c e r t a i n l y never be o b t a i n e d ; even though some p r o c e s s o r s f i n i s h b e f o r e o t h e r s , due t o AS4's p i p e l i n e d s t r u c t u r e a l l p r o c e s s o r s must complete t h e c u r r e n t o p e r a t i o n b e f o r e t h e n e x t o p e r a t i o n can b e g i n . T h e r e f o r e , t h e e x p e c t e d speedup i s l i m i t e d by t h e s l o w e s t p r o c e s s o r , and t h u s t h e e
x p e c t e d speedup i s 3 . 9 2 5 f o r t h e 4NNM and 3.974 f o r t h e 8NNM.
P r o j e c t i o n s were n o t made f o r l a r g e r arrays of 4NNM and 8NNM topologies due to an inherent shortcoming of the partitioning method. Specifically, with nine or more processors, at least one processor in the array will be forced to communicate with all eight of its neighbors (see Figure 2 ), but several of Network the processors (those on the edges of the array) will only communicate with at most five neighbors. Tests will be made in the future to determine the impact the extra communication will have on the performance of AS4 in networks of more than nine processors, but the data presented here cannot be relied upon to make any such projections.
Ring-Based Implementations of AS4
Just as in the 4NNM and 8NNM tests, a 64 x 64 pixel image size was used, and multiple frames of identical data were used for images.
The ring implementation of AS4 was run on the PPDS system for one-and four-node networks. For the ring implementation, each node had an identical program, and thus the interprocessor communication remained constant regardless of the size of the ring. For these reasons, larger ring networks (8-node, 16-node, etc.) could be simulated accurately even though the PPDS only has four processors. For example, an eight-node ring can be simulated by placing one eighth of the image on each of the four processors (each node gets an 8 x 64 pixel stripe). Each node still needs to communicate up to three border rows with each of its two near neighbors. For this example, the four nodes combined only process a 32 x 64 pixel array, but each processor does the same amount of work as a processor in an eight-node system processing a full 64 x 64 pixel image. Eight-and 16-node rings were simulated, along with 1-and 4-node implementations. Table 3 shows the communication, computation, and total times, as well as speedups for each processor in the different array sizes. There are two reasons why the ideal and actual speedups differ. The first is simple overhead--partitioning an algorithm for parallelism requires interprocessor communication, and communication cannot be performed instantaneously, even between near neighbors. The other reason is that the communication-to-computation ratio for the ring-based implementation increases as the size of the ring increases. Since the amount of time spent communicating is essentially independent of the size of the ring (see column two of Table 3 ) and the amount of time spent computing is inversely proportional to the size of the 
Processing Speed Considerations
In order for an IRMW algorithm to be effective, it must be able to detect missiles (in this case point targets with particular spectral ratios) in an image that may contain large amounts of clutter and may be subject to unpredictable frame-to-frame registration (i.e., the data may not be'registered due to aircraft motion). One way of reducing the effects of aircraft motion on the algorithm is to process incoming frames at a frame rate much higher than the standard video frame rate of 30 frames per second.
The ability of a uniprocessor architecture to process frames at high frame rates is limited, and thus in order to be capable of processing images at an arbitrary frame rate, a parallel architecture will be required. Studies have shown that it is possible to achieve frame rates much higher than 3 0 frames per second for parallel versions of AS4.
The ring-based implementation suffers from two problems: the first is the increasing communication-to-computation ratio as the size of the ring increases, and the second is inherent in the partitioning scheme. Since images are partitioned based upon the number of rows in the image, and due to the windowing operations of AS4 (see [ 5 ] ) , an upper limit on the number of processors that can be used is quickly reached.
Therefore, for very high frame rates, a ring-based topology will not be suitable.
The 8 " M t.opology faces a different problem--since the C40 has only six communication ports, extra interface logic must be used in order to construct an array of more than four processors. Although the performance of AS4 was better on an 8"M than on a 4 " M , as shown in Table 2 , the performance was not so much better that it would warrant the extra expenditure required to construct an 8 " M .
Since AS4 operates on two-dimensional images, arid the 4"M topology follows a two-dimensional layout, it is expected that a 4NNM will provide the best priceto-performance ratio of the three topologies discussed here. It requires no additional logic circuitry, unlike the 8"M, and it is capable of scaling to much larger extent than the ring-based topology.
5 . sunnnary F o r several parallel implementations of a selected avionics algorithm, the speedups gained by parallelizing were close to linear, as expected. This demonstrates that the performance of certain avionics algorithms can be improved by moving to a parallel t.opology. For the particular algorithm studied here, it was determined that a four nearest neighbor mesh topology would be best able to support the processing requirements of the algorithm, both in terms of partitioning and in terms of number of frames of data processed per second.
