KAI HWANG and ZHlWEl XU n this article, we assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing.
First, we review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWS). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing.
In particular, we evaluate the IBM SP2 at MHPCC [33] , the Intel Paragon at SDSC [38] , the Cray T3D at Cray Eagan Center [I] , and the Cray T3E and ASCI TeraFLOP system recently proposed by Intel [32] . On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools andcommunication libraries are introduced.
Our experiences in porting the MITLincoln Laboratory STAP (space-time adaptive processing) benchmark programs onto the SP2, T3D, and Paragon are reported. Benchmark performance results are presented along with some scalability analysis on machine and problem sizes. Finally, we comment on using these scalable computers for signal processing in the future.
Scalable Parallel Computers
A computer system, including hardware, system software, and applications software, is called scalable if it can scale up to accommodate ever increasing users demand, or scale down to improve cost-effectiveness. We are most interested in scaling up by improving hardware and software resources to expect proportional increase in performance. Scalability is a multi-dimentional concept, ranging from resource, application, to technology [ 12, 27, 37] .
Resource scalability refers to gaining higher performance or functionality by increasing the machine size (i.e., the number of processors), investing in more storage (cache, main memory, disks), and improving the software. Commercial MPPs have limited resource scalability. For instance, the normal configuration of the IBM SP2 only allows for up to 128 processors. The largest SP2 system installed to date is the 5 12-node system at Come11 Theory Center [ 141, requiring a special configuration.
Technology scalability refers to a scalable system which can adapt to changes in technology. It should be generation scalable: When part of the system is upgraded to the next generation, the rest of the system should still work. For instance, the most rapidly changing component is the processor. When the processor is upgraded, the system should be able to provide increased performance, using existing components (memory, disk, network, OS, and application software, etc.) in the remaining system. A scalable system should enable integration of hardware and software components from different sources or vendors. This will reduce the cost and expand the system's usability. This heterogeneity scalability concept is called portability when used for software. It calls for using components with an open, standard architecture and interface. An ideal scalable system should also allow space scalability. It should allow scaling up from a desktop machine to a multi-rack machine to provide higher performance, or scaling down to a board or even a chip to be fit in an embedded signal processing system.
To fully exploit the power of scalable parallel computers, the application programs must also be scalable. Scalability over machine size measures how well the performance will improve with additional processors. Scalability overproblem size indicates how well the system can handle large problems with large data size and workload. Most real parallel appli-1 Interconnect Custom Crossbar ~ Bus or Crossbar I cations have limited scalability in both machine size and problem size. For instance, some coarse-grain parallel radar signal processing program may use at most 256 processors to handle at most 100 radar channels. These limitations can not be removed by simply increasing machine resources. The program has to be significantly modified to handle more processors or more radar channels.
Large-scale computer systems are generally classified into six architectural categories [25] : the single-instruction-multiple-data (SIMD) machines, the parallel vector processors (PVPs), the symmetric multiprocessors (SMPs), the massively parallel processors (MPPs), the clusters of workstations (COWs), and the distributed shared memory multiprocessors (DSMs). SIMD computers are mostly for special-purpose applications, which are beyond the scope of this paper. The remaining categories are all MIMD (multipleinstruction-multiple-data) machines.
Important common features in these parallel computer architectures are characterized below:
Commodity Components: Most systems use commercially off-the-shelf, commodity components such as microprocessors, memory clhips, disks, and key software.
MIMD:
Parallel machines are moving towards the MIMD architecture for general-purpose applications. A parallel program running on such a machine consists of multiple processes, each executing a possibly different code on a processor autonomously. Asynchrony: Each process executes at its own pace, independent of the speed of other processes. The processes can be forced to wait for one another through special synchronization operations, such as semaphores, barriers, blocking-mode communications, etc. 
I Parallel Vector Processors
The structure of a typical PVP is shown in Fig. la . Examples of PVP include the Cray C-90 and T-90. Such a system contains a s8mall number of powerful custom-designed vector processors (VPs), each capable of at least 1 Gflop/s performance. A custom-designed, high-bandwidth crossbar switch connects these vector processors to a number of shared memory (SM) modules. For instance, in the T-90, the shared memory can supply data to a processor at 14 GB/s. Such machines normally do not use caches, but they use a large number of vector registers and an instruction buffer.
Symmetric Mu Iti process0 rs
The SMP architecture is ;shown in Fig. lb . Examples include the Cray CS6400, the IBM R30, the SGI Power Challenge, and the DEC Alphaserver 8000. Unlike a PVP, an SMP system uses commodity microprocessors with on-chip and off-chip caches. These processors are connected to a shared memory though a high-speed bus. On some SMP, a crossbar switch is also used in adldition to the bus. SMP systems are heavily used in commerlcial applications, such as database systems, on-line transaction systems, and data warehouses. It is important for the system to be symmetric, in that every processor lhas equal access to the shared memory, the I/O devices, and operating system. This way, a higher degree of parallelism can be released, which is not possible in an asymmetric (or master-slave) multiprocessor system.
Massively Parallel Processors
To take advantage of higlher parallelism available in applica- Fig. 1 c is more restricted, representing machines such as the Intel Paragon. Such a machine consists a number of processing nodes, each containing one or more microprocessors interconnected by a high-speed memory bus to a local memory and a network interface circuitv (NIC). The nodes are interconnected by a high-speed, proprietary, communication network.
width and low latency.
processors.
Distributed Shared Memory Systems DSM machines are modeled in Fig.ld , based on the Stanford DASH architecture. Cache directory (DIR) is used to support distributed coherent caches [30] . The Cray T3D is also a DSM machine. But it does not use the DIR to implement coherent caches. Instead, the T3D relies on special hardware and software extensions to achieve the DSM at arbitrary block-size level, ranging from words to large pages of shared data. The main difference of DSM machines from SMP is that the memory is physically distributed among different nodes. However, the system hardware and software create an illusion of a single address space to application users. 
MPP Architectural Evaluation

Clusters of Workstations
Architectural features of five MPPs are summarized in [36] .
MPP Architectures
Among the three existing; MPPs, the SP2 has the most powerful processors for floating-point operations. Each POWER2 processor has a peak speed of 267 Mflop/s, almost two to three times higher than each Alpha processor in the T3D and each 8 6 0 processor in the Paragon, respectively. The Pentium Pro processor in the ASCI TFLOPS machine has the potential to compete with the POWER2 processor in the future. The latency is the time to send an empty message. The bandwidth refers to the asymptotic bandwidth for sending large messages. While the bandwidth is mainly limited by the communication hardware, the latency is mainly limited by the software overhead. The distributed shared memory design of T3D allows it to achieve the lowest latency of only 2 pi.
Message passing is supported as a native programming model in all three MPPs. The T3D is the most flexible machine in terms of programmability. Its native MPP programming language (called Cray Craft) supports three models: the data parallel Fortran 90, shared-variable extensions, and messagepassing PVM [18] . All MPPs also support the standard Message-Passing Ifiterface (MPI) library [20] . We have used MPI to code the parallel STAP benchmark programs. This approach makes them portable among all three MPPs.
Our MPI-based STAP benchmarks are readily portable to the next generation of MPPs, namely the T3E, the ASCI, and the successor to SP2. In 1996 and beyond, this implies that the portable STAP benchmark suite can be used to evaluate these new MPPs. Our experience with the STAP radar benchmarks can also be extended to convert SAR (synthetic aperture radar) and ATR (Automatic target recognition) programs for parallel execution on future MPPs.
Hot CPU Chips
Most current systems use commodity microprocessors. With wide-spread use of microprocessors, the chip companies can afford to invest huge resources into research and development on microprocessor-based hardware, software, and applications. Consequently, the low-cost commodity microprocessors are approaching the performance of customdesigned processors used in Cray supercomputers. The speed performance of commodity microprocessors has been increasing steadily, almost doubling every 18 months during the past decade.
From Table 3 Table 4 and Fig.2 illustrate the evolution trends of the Cray supercomputer family and of the Intel MPP family. Commodity microprocessors have been improving at a much faster rate than custom-designed processors. The peak speed of Cray processors has improved 12.5 times in 16 years, half of which comes from faster clock rates. In 10 years, the peak speed of the Intel microprocessors has increased 5000 times, of which only 25 times come from faster clock rate, the remaining 200 come from advances in the processor architecture. At the same time period, the one-way, point-to-point communication bandwidth for the Intel MPPs has increased 740 times, and the latency has improved by 86.2 times. Cray supercomputers use fast SRAMs as the main memory. The custom-designed crossbar provide high bandwidth and low communication latency. As a consequence, applications run- 
Scalable Growth Trends
ning on Cray supercomputers often have higher utilizations (15% to 45%) than those (1% to 30%) in MPPs.
Performance Metrics for Parallel Applications
We define below performance metrics used on scalable parallel computers. The terminology is consistent with that proposed by the Parkbench group [25] , which is consistent with the conventions used in other scientific fields, such as physics. These metrics are summarized in Table 5 .
Performance Metrics
The parallel computational steps in a typical scientific or signal processing application are illustrated in Fig. 3 . The algorithm consisting of a sequence of k steps. Semantically, all operatic" in a step should finish before the next step can begin.
Step i has a computational workload of W, million floating-point operations (Mflop), and takes T,(i) seconds to execute on one processor. It has a degree of parallelism of DOP,. In other words, when executing on n processors with lSnSDOP,, the parallel execution time for step i becomes T,(i) = T,(i)/n. The execution time can not be further reduced by using more processors. We assume all interactions (communication and synchronization operations) happen between the consecutive steps. We denote the total interaction overhead as T(>. Traditionally, four metrics have been used to measure the performance of a parallel program: the parallel execution time, the speed (or sustained speed), the speedup, and the efficiency: as shown in Table 5 . We have found that several additional metrics are also very useful in performance analysis.
A shortcoming of the speedup and efficiency metrics is that they tend to act in favor of slow programs. In other words, a slower parallel program can have higher speedup and efficiency than a faster one. The utilization metric does not have this problem. It is defined as the ratio of the measured n-processor speed of a program to the peak speed of an n-processor system. In Table 5 , Ppeak is the peak speed of a single processor. The critical path and the average parallelisnz are two extreme value metrics, providing a lower bound for execution time and an upper bound for speedup, respectively.
Efficiency
Communication Overhead
Xu and Hwang [43] have shown that the time of a communication operation can be estimated by a general timing model:
where m is the message length in bytes, the latency to(n) and the asymptotic bandwidth r J n ) can be linear or nonlinear functions of n. For instance, timing expressions are obtained for some MPL message-passing operations on the SP2, as shown in Table 6 . Details on how to derive these and other expressions are treated in [43] , where the MPI performance on SP2 is also compared to the native IBM MPL operations. The total overhead To is the sum of the times of all interaction operations occurred in a parallel program.
Parallel Programming Models
Four models for parallel programming are widely used on parallel computers: implicit, data parallel, message-passing, and shared variable. 
1 Seconds
I Dimensionless
Interaction issues address how to allocate workload and hot to distribute data to different processors and how to synchronizelcommunicate among the processors.
Semantic issues consider termination, determinacy, and correctness properties. Parallel programs are much more complex than sequential codes. In addition to infinite looping, parallel programs can deadlock or livelock. They can also be indeterminate: the same input could produce different results. Parallel programs are also more difficult to test, to debug, or to prove for correctness.
Programmability issues refer to whether a programming model facilitates the development of portable and efficient application codes.
The Implicit Model
With this approach, programmers write codes using a familiar sequential programming language (e.g., C or Fortran). The compiler and its run-time support system are responsible to resolve all the programming issues in Table 7 . Examples of such compilers include KAP from Kuck and Associates [29] and FORGE from Advanced Parallel Research [7] . These are platform-independent tools, which automatically convert a standard sequential Fortran program into a parallel code. 
The Data Parallel Model
The data parallel programming model is used in standard languages such as 
Explicit Interactions:
The programmer must resolve all the interaction issues, including data mapping, communication and synchronization. The workload allocation is usually done through the owner-compute rule, i.e., the process which owns a piece of data performs the computations associated with it. Both shared-variable and message-passing approaches can achieve high performance. However, they require greater efforts from the user in program development. The implicit and the data parallel models shift many burdens to the compiler, thus reducing the labor cost and the program development time. This tradeoff should be based on each specific application. For signal processing, we often require the highest performance. Furthermore, a parallel signal processing application, once developed, is likely to be used for a long time. This suggests the use of message-passing model for its high efficiency and better portability.
Realization Approaches The Message Passing Model
The message passing programming model is the native model for MPPs and COWS. The portability of message-passing programs is enhanced greatly by the wide adoption of the public-domain MPI and PVM libraries. This model has the following characteris tics:
Multiple threads (SPMD or MPMD) of control in different e Asynchronous operations at different nodes.
nodes.
The parallel programming models just described are realized in real systems by extending Fortran or C in three approaches: library subroutines, new language con structs, and compiler directives. More than one of them can be used in realizing a parallel programming model. We show in Fig.4 an example HPF code for target detection in radar applications, to illustrate the three realization methods (the algorithm in this code is credited to Michael Kumbera of the Maui High-Performance Computing Center).
New Constructs:
The programming language is extended with some new constructs to support parallelism and inter-
action. An example is the forall construct in Fig.4 Fig.4 are examples of compiler directives. This approach is a trade-off between the previous two approaches.
In Fig.4 , we want to find the ten closest targets in an array of one-dimensional FFT computations are performed. All end with target detection. The APT performs a Householder transform to generate a triangular learning matrix, which is used in a beamforming step to null the jammers and the clutter; whereas, in the HO-PD program, the two adaptive beamforming steps are coimbined into one step. The GEN program consists of four component algorithms to perform sorting, FFT, vector multiply, and linear algebra. These are the kernel routines often used in signal processing applications. The EL-Stag and the BM-Stag programs are similar to HO-PD, but use a staggered interference training algorithms.
Parallelization of STAP Programs
We have used three MPPs (IBM SP2, Intel Paragon, and Cray 
STAP Benchmark Performance
To demonstrate the performance of MPPs for signal processing, we choose to port the space-time adaptive processing (STAP) benchmark programs, originally developed by MIT Lincoln Laboratory for real time radar signal processing on UNIX workstations in sequential C code [34] . We have to These benchmarks were written to test the STAP algorithms for adaptive ratdar signal processing. These programs start with Dopplerprocessing (DP), in which a large number Measured Benchmark Results Figure 6 shows the measured parallel execution time, speed, and utilization as a function of machine size. Only the HO-PD performance is shown here. The SP2 demonstrates the best overall performance among the three MPPs. With 256 nodes, we achieved a total execution time of 0.56 seconds on the IBM SP2, corresponding to a 23 Gflop/s speed. This is partly due to SP2's fast processor, with a peak 266 Mflop/s compared to Paragon's 100 Mflopls and T3D's 150 Mflop/s ( Table 2 ). The degradation of Paragon performance when the number of nodes is less than 16 is due to the use of small local memory (1 6 MB/node in the SDSC Paragon, of which only 8 MB is available to the user applications). This results in excessive paging when a few nodes are used. The SP2's high performance is further explained by Fig.6c , which shows the utilization of the three machines. The SP2 has the highest utilization. In particular, the sequential performance is very good, with an utilization of 36%. The relatively high utilization is due to a good compiler, a large 
Execution Timing Analysis
In Table 8 
Scalability over Machine §ize
In an MPP, the total memory capacity increases with the number of nodes available. Assume every node has the same memory capacity of M bytes. On an n-node MPP, the total memory capacity is nM. Assume an application uses all the memory capacity M on one node and executes in W seconds (e.g., W is the sequential workload). This total workload has a sequential portion, x, and a parallelizable portion 1 -a. That is: W = aW + ( I -a)W. Three approaches have been used to get better performance as the machine size increases, which are formulated as three scalable performance laws.
Sun and Ni's Law
When n nodes are used, a larger problem can be solved due to the increased memory capacity nM. Let us assume that the parallel portion of the workload can be scaled up G(n) times. That is, the scaled workload
T . Sun and Ni [41] defined the memory bound speedup as follows:
sequential time for scaled workload parallel time for scaled workload
Amdahl's Law
When G(n) = 1, the problem size is fixed. Then Eq. 2 is called Amdahl's law [4] for fixed-workload speedup) and has the following form:
Scalability Over Problem Size
When G(n)>n, the computational workload increases faster than the memory requirement. Thus, the memorybound model (Eq. 2) gives a higher speedup than the fixedtime speedup (Gustafson's law) and the fixed-workload speedup (Amdahl's law). These three speedup models are comparatively analyzed in [26] .
We are interested in determining how well the parallel STAP programs sciile over different problem sizes. The STAP benchmark is designed to cover a wide range of radar configurations. We show the nietrics for the minimal, maximal, and nominal data sets in Table 9 . The input data size and the workload are given by the STAP benchmark specification [S,9,10,13]. The maximum parallelism is computed by finding the largesi degree ofparallelism (DOP) of the individual steps. The critical path (or more precisely, the length of the critical path) is the execution time when a potentially infinite number of nodes is used, excluding all communication overhead. For simplicity, we assume that every flop takes the same amount of time to execute. Each step's contribution to the critical path is its workload divided by its DOP.
Average Parallelism
These speedup models are plotted in Fig.7 for the parallel APT program running on the IBM sp2, We have calculated that G(n) = 1.4n+0.37 & > n , thus the fixed-memory speedup is better than the fixed-time and the fixed-workload speedups. The Parallel APT Program with the nominal data
The average parallelism is defined as the ratio of the total workload to the critical path. The average parallelism sets a hard upper bound on the achievable speedup. For instance, suppose we the sequential APT program by a factor of 100. This is impossible to achieve using a minimal to speed set has a sequential fraction a = 0.00278. This seemingly small sequential bottleneck, together with the communication overhead, limits the potential speed up to only 100 on a 256-node SP2 (the fixed-load curve). However, by increasing the problem size thus the workload, the speedup can increase to 206 using the fixed-time model, or 252 using the memorybound model. This example demonstrate that increasing the problem size can amortize the sequential bottleneck and communication overhead, thus improve performance. However, the problem size should not exceed the memory bound. Otherwise excessive paging will drastically degrade the performance, as illustrated in Fig.6 . Furthermore, increasing the problem size is profitable only when the workload increases at a faster rate than communication overhead.
data set with an average parallelism of 10, but it is possible using the nominal or larger problem sizes.
When the data set increases, the available parallelism also increases. But how many nlodes can be used profitably in the parallel STAP programs? A. heuristic is to choose the number of nodes to be higher than the average parallelism. When the number of nodes is more than twice the average parallelism, at least SO% of the time the nodes will be idle. Using this heuristic, the parallel STAP programs with a large data set can take advantage of thousands of nodes in current and future generations of MPPs.
For sequential programs, the memory required is twice of the data set size. But for parallel programs, the memory required is six times that o/' the input data set, or three times of the sequential memory required. The additional memory is needed for communication buffers. We have seen (Fig.6) . that lack of large local memory in the Paragon could significantly degrade the MPP performance. Table 9 implies that for large data sets, the STAP programs must use multiple nodes, as no current MPPs have large enough memory (3 to 34 GB) on a single node. It further tells us that existing MPPs has enough memory to handle parallel STAP programs with the maximal data sets. For instance, from Table 9 , an n-processor MPP should have a 102/n GB memory capacity per processor, excluding that used by the OS and other system software. Note that the corresponding average parallelism is 4332, larger than the maximal machine sizes of 512 for SP2 and of 2048 for Paragon and T3D. On a 5 12-node SP2, the per-processor memory requirement is 102GB/512 = 200 MB, and each SP2 node can have up to 2 GB memory. On a 2048-node T3D, the per-processor memory requirement is 102GB12048 = 50 MB, and each T3D processor can have up to 64 MB memory.
STAP Memory Requirements
Signal processing applications often have a response time requirement. For instance, we may want to compute an APT in one second. From Table 9 , this is possible for the norminal data set on current MPPs, as there are only about 8 Mflop on the critical path. All the three MPPs can sustain 8 Mflop/s per processor for APT. To execute HO-PD in a second, we need each MPP node to sustain 50 Mfloph. On the other hand, it is impossible to compute APT or HO-PD in one second for the maximal data sets, no matter how many processors are used. The reason is that it would require a processor to sustain 500 Mflop/s to 12 Gfloph, which is impossible in any current or next generation MPPs.
Lessons Learned and Conclusions
We summarize below important lessons learned from our MPP/STAP benchmark experiments. Then we make a number of suggestions towards general-purpose signal processing on scalable parallel computer platforms including MPPs, DSMs, and COWS. None of these systems is supported by a real-time operating system. A main probl e m i s that d u e to interferences from the OS, execution time of a program could vary by an order of magnitude under the same testing condition, even in dedicated mode. The Cray T3D has the best communication performance, small execution time variance, and little warm-up effect, which are desirable properties for real-time signal processing applications.
We feel that the ireported timing results could be even better, if these MPPs are exclusively used for dedicated, real-time signal processing. We expect the system utilization to increase beyond 4O%, if a real-time execution environment could be fully developed on these MPPs.
Developing an MF'P application is a time-consuming task.
Therefore, performance, portability, and scalability must be considered during program development. An application, once developed, should be able to execute efficiently on different machine sizes over different platforms, with little modification. Our experiences suggest four general guidelines to achieve these goals: e Coarse Granularity: Large-scale signal processing applications should exploit coarse-grain parallelism. As shown in Fig.2 Table 3 , we feel that the Alpha 21 164A (or the future 21264), UltraSPAI;!C 11, and MIPS RlOOOO have the highest potential to deliver a floating-point speed exceeding 500 Mflopls in the next few years. With a clock rate approaching 500 MHz and continuing advances in compiler technology, a superscalar microprocessor with multiple floating-point units has the potential to achieve 1 Gflop/ s speed by the turn of the century. Exceeding 1000 SPECint92 integer speed is also possible by then, based on the projections made by Digital, !Sun Microsystems, and SGUMIPS.
Future MPP Architecture
In Fig.8 [40] , where a custom-designed shell circuitry interfaces a commodity microprocessor to the rest of the node. In Cray terminology [l], the overall structure of a computer system as shown in Fig.8 is called the macro-architecture, while the shell and the processor is called the micro-architecture. A main advantage of this shell architecture is that when the processor is upgraded to the next generation or changed to a different architecture, only the shell (the micro-architecture) needs to be changed. There is always a local memory module and a network interface circuitry (NIC) in each node. There is always cache memory available in each node. However, the cache is normally organized as a hierarchy. The level-I cache, being the fastest and smallest, is on-chip with the microprocessor. A slower but much larger level-2 cache can be on-chip or off the chip, as seen in Table 3 .
Unlike some existing MPPs, each node in Fig.8 has its own local disk and a complete multi-tasking Unix operating system, instead of just a microkemel. Having local disks facilitates local swapping, parallel I/O, and checkpointing. Using a full-fledged workstation Unix on each node allow multiple OS services to be performed simultaneously at local nodes. On some current MPPs, functions involving accessing disks or OS are routed to a server node or the host to be performed sequentially.
The native programming model for this architecture is Fortran or C plus message passing using MPI. This will yield high performance, portability and scalability. It is also desirable to provide some VLSI accelerators into the future MPPs for specific signal/image processing applications. For example, one can mix a programmable MPP with an embedded accelerator board for speeding up the computation of the adaptive weights in STAP radar signal processing.
The Low-Cost Network
Up to three communication networks are used in scalable parallel computer. An inexpensive commodity network, such as the Ethernet, can be quickly installed, using existing, well-debugged TCP/IP communication protocols. This lowcost network, although only supporting low speed communications, has several important benefits: 
The High-Bandwidth Network
The high-bandwidth network is the backbone of a scalable computer, where most user communications take place. Examples include the 2-D mesh network of Paragon, the 3-D torus network of Cray T3D, the multi-stage High-Performance Switch (HPS) network of IBM SP2, and the fat-tree data network of CM-5. It is important for this network to have a high bandwidth, as well as short latency.
The Low-Latency Network
Some systems provide a third network to provide even lower latency to speed up communications of short messages. The control network of Thinking Machine CM-5 and the barriedeureka hardware of Cray T3D are examples of low-latency networks. There are many operations important to signal processing applications which need to have small delay but not a lot of bandwidth, because the messages being transmitted are short. Three such operations are listed below:
Barrier: This operation forces a process to wait until all processes reach a certain execution point. It may be needed in a parallel algorithm for radar target detection, where the processes must first detect all targets at a range gate before proceeding to the next farther range gate. The message length for such an operation is essentially zero.
Reduction: This operation aggregate a value (e.g., a floating-point word) from each process and generate a global sum, maximum, etc. This is useful, e.g., in aparallel Gauss elimination or Householder transform program with pivoting, where one needs to find the maximal element of a matrix row or column. The message length could vary, but is normally one or two words.
e Broadcasting of a short message: Again, in a parallel Householder transform program, once the pivot element is found, it needs to be broadcast to all processes. The message length is the size of the pivot element, one or two words. 
Comparison of NlPPs and COWs
We feel that future MPPs and COWs are converging, once commodity Gigabit/s networks and distributed memory support become widely used. In Table 10 , we provide a comparison of these two categories of scalable computers, based on today's technology. By 1996, the largest MPP will have 9000 processors approaching 1 Tflop/s performance; while any of the experimental COW system is still limited to less than 200 nodes with a potential 10 Gflop/s speed collectively. The MPPs are puishing for finer-grain computations, while COWs are used to satisfy large-grain interactive or multitasking user applications. The COWs demand special security protection, since they are often exposed to the public communication networks; while the MPPs use non-standard, proprietary communication network with implicit security.
The MPPs emphasize high-throughput and higher U 0 and memory bandwidth. The COW offers higher availability with easy access to large-scale database system. So far, some signal processing software libraries have been ported to most MPPs, while untestled on COWs. Finally, we point out that MPPs are more expensive and lack of sound OS support for real-time signal processing, while most COWs can not support DSM or lack 01' single system image. This will limit the programmability arid make it difficult to achieve a global efficiency in cluster resource utilization.
Extended Signal Processing Applications
So far, our MPP signal processing has been concentrated on STAP sensor data. The work can be extended to process SAR (synthetic aperture radar) sensor data. The same set of software tools, programming and runtime environments, and real-time OS kernel can be used for either STAP or SAR signal processing on the MPPs. The ultimate goal is to achieve automatic target recognition (ATR) or scene analysis in real time. To summarize, we list below the processing requirements for STAP/SAR/ATR applications on MPPs:
The STAP/SAR/ATR source codes must be parallelized and made portable on commercial MPPs with a higher degree of interoperability. Parallel programming tools for efficient STAP/SAR program partitioning, communication optimization, and performance tuning nced to be improved using visualization packages. Light-weighted OS kernel for real-time application on the target MPPs, DSMs, and COWs must be fully developed. Run-time software support for load balancing and insulating OS interferences are needed. Portable STAP/SAR/ATR benchmarks need to be developed for speedy multi-dimensional convolution, fast Fourier transforms, discrete cosine transform, wavelet transform, matrixvector product, and matrix inversion operations. 
