Complex signal processing problems are naturally described by compositions of program modules that process streams of data. In this paper we discuss how such compositions may be analyzed and mapped onto multiprocessor computers to e ectively exploit the massive parallelism of these applications. The methods are illustrated with an example of signal processing for an optical surveillance problem.
Introduction
An important goal toward making parallel computers more useable for practical computations is to provide compiling technology that is able to convert algorithms expressed directly and simply in a high level language into e cient machine code. The parallelism implicit in the expression of the algorithm must be identi ed and exploited by the compiler. Complex signal processing problems are naturally described by compositions of program modules that process streams of data. We use an example to illustrate how such compositions may be analyzed and mapped onto multiprocessor computers using extensions of methods used in the Paradigm compiler designed and implemented by the author Dennis 1989] .
We begin by discussing how the mapping of compositions of stream-processing modules di ers from the application of data parallel principles in mapping scienti c computations onto massively parallel computers. In the domain of high performance signal and image processing, applications can exploit massively parallel computation, but the form of parallelism present is not the data parallel form encountered in scienti c computations: (1) Program modules often work together in producer/consumer relationships, allowing concurrent operation; and (2) All modules of the program are continuously active processing streams of data. The use of stream data types plays a central role in expressing such computation in a high level form that permits automatic analysis. We illustrate an approach to mapping such computation by analyzing a typical processing computation for image data arriving from a sensor array. We indicate how the computation (when expressed in the Sisal functional programming language) may be analyzed and its structure represented in a program description tree, and used to guide the construction of code for a target multiprocessor. We discuss the problem of nding an optimal mapping, and discuss the structure and performance of constructed code for two choices of multiprocessor architecture.
Static Mapping
The problem of implementing programs written in high level languages on parallel computers may be approached in two fundamental ways according to the philosophy of managing processing and memory resources. One may strive to implement a very general model of parallel computing and implement it by a suitable combination of architectural features and runtime services so that all scheduling and memory allocation decisions are performed during program execution. This general approach is exempli ed by the Monsoon mutiprocessor Papadopoulos /Culler 1990] , but the mechanisms have not evolved to the level of eciency required to attract practical usage. The second approach is based on making most memory management decisions at compile time. This can yield very efcient exploitation of multiprocessors built of conventional processors for computations having a suitable regular structure. This second approach has been the basis for the development of the data parallel model and its implementation in such work as the Thinking Machines Fortran compiler, Sabot 1992 ] the de nition of High Performance Fortran and advanced work in Prof. Kennedy's group at Rice University.
The data parallel approach can also be followed for programs expressed in functional programming languages with signi cant advantages. It is simpler to identify the program blocks that are suitable for data parallel implementation, and the global program analysis needed to determine optimum alignment and mapping for the arrays of a program is more readily accomplished. This is because functional language programs do not make use of side e ects, and each use of any data de nition is readily identi ed. This has been done in the Paradigm compiler Dennis 1989] designed and built by the author for the Sisal language and targeted for the CM-2 Connection Machine.
Compiler Structure
The Paradigm compiler was designed to identify the principal data structures constructed by a program through global compile-time analysis, and to map these structures onto the processing elements of the target machine. The structure of the compiler is shown in Figure 1 . It consists of a conventional Front End that parses and checks source language modules, an Analyze module that identi es code blocks in the program, and a Code Constructor that implements each code block on the basis of mapping speci cations derived with optional advice from the user Dennis 1988 , Dennis 1989 .
Our goal requires some departure from the typical structure of programming language support systems. E cient machine code programs for large-scale parallel computers can be generated only if the compiler is able to consider the entire collection of program modules involved in a job in making decisions regarding how the computation should be mapped onto the target machine. This implies that the linking of program modules should be accomplished prior to the compiler's analysis and optimization decisions. A second change is more fundamental: instead of carrying out optimization as a sequence of independent steps, each of which supposedly leads to an \improvement" of the code, we perform an analysis of the given code, determine the best mapping strategy, then synthesize machine code according to the speci ed mapping. The Sisal functional programming language McGraw 1985] is particularly attractive for implementing this approach. The absence of global variables and the clear di erentiation of arguments and results of function modules in the Sisal language make it easy for a compiler to analyze source programs and identify the parts of the code that de ne the major data structures. We call these parts of the source language program code blocks.
The data structures appropriate for scienti c computation are large multi-dimensional arrays of numerical data. Each code block de nes an array value and represents a computation that may be spread over the processing elements of the machine according to a chosen assignment (or mapping) of array elements to processing elements. This is the essence of data parallel computation. The parallel iteration expression of the Sisal language provides a convenient high level notation for writing data parallel algorithms.
Application to High Performance Signal/Image Processing
Another area that can exploit massively parallel computation is high performance signal and image processing. In these applications large amounts of parallelism exist, but it takes di erent forms: (1) producer/consumer concurrency: the possibility of executing two program modules concurrently when one (the consumer) processes a stream of data generated by the other (the producer); and (2) Simultaneous application of several instances of functions. The use of stream data types plays a central role in expressing such computations in a high level form that permits automatic analysis. The rest of the paper is devoted to describing this process, illustrating its application to a practical signal processing problem, and studying the performance achievable for two multiprocessor architectures.
An Example: Optical Surveillance
The computation we have chosen to illustrate the proposed mapping strategy is derived from a collection of procedures for processing information from a sky-scanning optical surveillance device and detecting objects in its eld of view. The application has similarities to radar signal processing. There are many sensors, several for each line of the scanned image. These signals are conditioned, smoothed and downsampled before a two-dimensional lter is used to suppress unimportant detail. A peak detection algorithm identi es points in the image that should be analyzed further as potential objects to be reported. A bock diagram of the computation is shown in Figure 2 .
Each module in the diagram may be characterized as a function that transforms a stream of input data into a stream of output data. Hence it is natural to specify them using a language (Sisal) that includes streams as standard data types and supports analytic and constructive operations on streams.
In Sisal a stream is a sequence of values which may be in nite (unending). A stream of integers is a natural representation for a signal that has been converted into digital form. Interconnecting modules that process streams of data is a powerful means for combining program parts to build larger modules and is well matched to the needs of signal processing tasks. Thus the combination of processing modules shown in Fig This use of function composition for signal processing has been discussed in Dennis 1995] , where we showed how to transform tail-recursive functions on streams into non-recursive data ow graphs that may be executed e ciently by suitable ne-grain parallel computers Dennis/Gao 1994 , Dennis 1991 The use of data ow graphs as a natural means for specifying signal processing applications has also been studied in Ho/ Lee/Messerschmitt 1988] , and the idea of compiling signal processing programs from block diagrams was described as early as .
From this example we see that complete signal processing tasks may take the form of a set of processing modules, each generating a stream of values that is passed to other modules for further processing. Thus the overall computation may be described by a directed acyclic graph in which the nodes are stream processing modules such as those we have presented, and each link indicates a producer/consumer relationship between a pair of modules. It is well-known that such interconnections of modules may lead to deadlock if the graph contains (undirected) cycles, and the temporary storage for stream elements in each link is bounded in capacity. Given the structure of stream processing programs expressible in Sisal, a compiler can detect these situations and warn the user of the deadlock possibility. 
Program Analysis
In this paper we consider only programs having an overall structure that supports the continuous processing of streams of data. In these programs each module operates on data streams, produces a data stream, and runs continuously during program execution. The overall structure of such programs is an acyclic interconnection of such modules. This is in contrast to data parallel scienti c codes for which the original Paradigm compiler was designed. There the top level program structure is a main loop in which the loop body is an acyclic combination of code blocks that dene array values, as in the following program segment This parallel expression in Sisal de nes an array value Z, each (internal) element of which is the average of the two adjacent elements of a given vector X. The conditional expression provides special treatment of the end elements of Z. All instances of the body expression may be evaluated concurrently. In general, data parallel code blocks may be nested \for" expressions that de ne multidimensional arrays, and may include reduction operations that apply associative operators over speci ed dimensions of the de ned array. In the Paradigm Compiler Dennis 1989 ], programs having this structure are analyzed and transformed into data parallel programs for the CM-2 Connection Machine.
Stream-Processing Programs
The present study explores the prospects for static analysis and mapping of continuous stream-processing computations such as the optical surveillance problem. Thus we envision a new version of the Paradigm Compiler that will transform, and analyze such programs and generate machine code for multiprocessor computers. Given a program that is amenable to static resource management, the Analyze module of the rebuilt Paradigm Compiler will provide program descriptions that may be used to plan the mapping of the program onto a parallel computer and to construct code in the target machine language.
The job of program analysis has several parts:
1. Identify the program modules (code blocks) 2. Check the conditions that permit static mapping to be used 3. Extract parameters for each program module for use in performance estimation 4. Determine the relative computation rate for each module 5. Construct a program description tree containing the results of analysis
Program Transformation
The identi cation step includes examining recursive function de nitions to determine whether they are tail-recursions and have equivalent iterative data ow graphs. A method for doing this has been given in Dennis 1995] .
In the absence of conditional expressions in their bodies, the tail-recursive function de nitions express functions that process input streams into output streams where the numbers of output elements emitted is related to the number of input elements absorbed as a ratio of integers, a rational number. If the bodies of these function de nitions contain conditional expressions, it may be that the module does not have a xed ratio of output elements emitted to input elements absorbed, and only a range of values for the ratio can be determined through static analysis. Such situations appear to be rare in practical signal processing computations, for their existence would imply a non-uniform sampling rate. In our example, the BaseRemove function contains a conditional, but yet has a xed input/output ratio of unity. Given these ratios (or bounds on relative rates) a rate (or range of rates) can be calculated for every module. We assume that the usual architecture-independent optimization steps|constant folding, common subexpression elimination,etc.|have been performed. Also, and this is especially important in signal processing applications, shift operations are used to implement multiplication by known constants whenever this is more e cient (although this depends on the detailed design of the arithmetic units of the processor).
The Program Description Tree
The objective of program transformation and analysis is to provide information on which the choice of a good mapping of program modules and data structures onto a target multiprocessor may be made. We represent the results of analysis as a program description tree or PDT. Each node of the PDT corresponds to a syntactic element of the program an- 
Analysis of the Example
The optical surveillance program satis es the conditions for static mapping. Each recursive function de nition is tail-recursive and may be transformed into an iterative data ow graph with a xed memory requirement for each top-level invocation. Moreover, each of the resulting transformed modules maps one or more input streams into an output data stream, and the program in entirety is an acyclic composition of these modules, as speci ed by the top-level function Process. The program description tree for the optical surveillance example is shown in Figure 5 .
Computation Rate and Load Estimate
The program description tree contains su cient information to determine the relative computation rate for each program module, and the approximate fraction of the total computation load each module is responsible for. These data are calculated for the optical surveillance problem in Table 1 . We (arbitrarily) take the processing of one array of data by TwoDimFilter (or by PeakDetect) as the basic compute cycle of the computation. For each program module, the table shows the number of operations performed in each execution of a module, the width of the data stream processed by the module, and the number of instances of execution of the module for one compute cycle. These yield the operation count per cycle and the load fraction for each module. These data are used in Section 6 to estimate the computation rate and latency for selected mappings of the example.
Mapping Plans and Strategies
In this section we discuss reasonable choices for mapping continuous processing applications to multiprocessor computers, and discuss the problem of nding the best mapping plan.
Mapping Plans
In contrast to data parallel scienti c computing, the strategy of letting one program code block at a time utilize the whole machine appears to be a poor choice for continuous processing applications. Rather, in many cases the best approach is to structure the machine code so all program modules are executing concurrently at a rate the meets exactly the computation requirement 1 .
1 In current practice, a single processor is often multiplexed among program modules for di erent stages of processing, but because coarse-grain processing must be used to attain economical performance with conventional processors, large bu ers for intermediate data must be used and high latency of results occurs. With multiprocessing, assigning di erent modules to distinct processors will usually yield better resource utilization.
Given an estimate of the load to be handled by each code module, we must decide how many processing elements should be actively executing each module. For each program module, two reasonable mapping possibilities are apparent:
1. Allocate: Assign to the program module the exact number of processing elements needed to achieve the overall computation rate the module requires.
2. Distribute: Spread the computation load of the module uniformly over all processing elements
The choice between the allocate and distribute strategies may be made independently for each module, but those processing elements dedicated to program modules for which the allocate strategy is chosen are not in the set over which the work of the remaining modules may be spread. Which choices lead to better performance depends on the relative amounts of inter-module communication and intramodule communication, and on how well the loads match up with processing element capacities.
Note that if the computation rate demanded by some program module requires the performance of several processing elements, then the program module must o er su cient opportunities for concurrency that the processing elements can be fully utilized. Otherwise the computation is not feasible on the target multiprocessor.
Where the input of a module is an array of streams, a plan in which the modules producing the individual streams are executed by the same processing element as that assigned to the corresponding part of the array-processing module is likely to perform better by avoiding some communication cost.
An advantage of assigning a limited number of processors to chosen modules is that it is then not necessary to load all program modules into (or make them accessible from) every processor.
In data parallel scienti c computation, a major issue is aligning the distribution of various data arrays so as to minimize communication. In continuous processing computations, this issue has less impact. On the other hand, it is bene cial to distribute the work of the TwoDimFilter and PeakDetect modules over processing elements in aligned fashion.
On the basis of the above considerations we propose the following class of mapping plans for continuous processing programs: A mapping plan is a speci cation for each node of the program description tree as to whether execution of the program section described by the subtree is to be evenly distributed over all processing elements (distribute), or is to be executed by a group of dedicated processing elements (allocate) sized to accommodate the estimated load of that program section. Table 2 gives three reasonable proposals for mapping the optical surveillance computation. Under Plan A, the work of each of the ve modules is distributed over all processing elements. This choice is attractive because it eliminates all inter-processor communication except that due to boundary exchange in algorithms TwoDimFilter and PeakDetect. In Plan B, a group of processors is dedicated to the work of algorithms TwoDimFilter and PeakDetect, but these are both distributed across the group to avoid communication costs for passing data from TwoDimFilter to PeakDetect. The relative merits of these plans are discussed in the following section.
Finding the Optimal Mapping Plan
Given a mapping plan and characteristics of the target multiprocessor including the number of processing elements, it is straightforward to estimate performance parameters for a mapping plan. Given the total operation count, the number of processors, and their speed, the rate of computation may be estimated. The costs of process synchronization may be approximated from characteristics of the target architecture and program structure information in the description tree. The mapping plan induces a communication load from which it can be estimated whether the computation is compute bound or communication bound. Thus the following approach should help nd the optimum mapping plan:
1. Determine computation rates and load parameters for each node of the graph. 2. Generate several plausible mapping plans based on the given program description tree and estimates of performance parameters. 3. Evaluate each proposed mapping plan by constructing target machine code and determining accurate processor and communication loads 4. Select the best mapping plan for the user's objective. 5. Construct the nal machine code.
Multiprocessor Performance
Given the three mapping plans proposed above, let us consider their use in code construction for multiprocessor computers. First we introduce the two processor architectures we use as contrasting targets for parallel computing. Then we discuss the machine code structures appropriate to implementing continuous processing applcations such as the optical surveillance computation. We discuss the performance di erences among the three plans and point out the tradeo s possible among throughput, memory, and latency of output data.
Architectures
We consider two contrasting multiprocessor architectures. One uses processing elements of conventional architecture with features intended to support e cient multiprocessor computation. We designate this the CVA (ConVentional Architecture) machine. A commercial example of such a processing element is the Texas Instruments TMS320C3x digital signal processor. This machine has high single-thread performance, but makes only modest concessions to supporting negrain synchronization and communication for e cient parallel computing. In this architecture, a thread is created by a fork command or a parallel function call interpreted by run-time software, and may terminate at a join command or by execution of a quit command. A thread may be suspended to wait for some event to occur, and it may be preempted to allow the processor to handle interrupt events or to schedule a thread having a higher priority.
The second architecture uses a hypothetical processing element having an interleaved multithreading architecture, as proposed in Dennis/Gao 1994, Dennis 1989 ]. We designate this one the MTA (MultiThreaded Architecture) machine. In this processor there may be four active threads that share resources (functional units, registers, and access to local memory). Threads are non-preemptible, so execution of a ready thread is delayed until one of the four pipeline slots is released by termination of an active thread. A thread becomes ready for execution when it is signaled from other threads, or when a message arrives from another processor. A thread uses a small number of registers to pass results from one instruction to a later instruction of the thread; registers are unde ned when a thread becomes active and are not saved at thread termination. A typical thread will either send signals to activate other threads, or send an interprocessor message just before terminating by executing a quit instruction. In the MTA a thread is a sequence of instructions xed at compile time and is short enough that other threads may be executed soon enough to meet performance requirements. Table 2 : Three mapping plans for the optical surveillance problem.
Other multithreaded architectures have supported eight active threads to allow tolerance of long memory accesses. For the present discussion we assume that multiplexing four threads in the computation pipeline is su cient to tolerate the latency of accesses to local memory and to ll pipeline gaps due to intra-thread dependences.
To compare the two architectures for our example, we will assume that both are able to achieve the same total instruction processing rate. This means that one thread on the MTA will run one fourth as fast as a single thread on the CVA. (This is unfair to the MTA because the CVA will be slowed more by pipeline hazards.)
With respect to implementing the mapping plans for the optical surveillance problem, the di erences that a ect the code structure needed to get best performance are the following: 1. A CVA processor can be fully utilized by a single thread. For the proposed MTA machine, four threads are needed to fully utilize the processor. 2. Switching between threads is more expensive for the CVA, so long threads are favored. The fast switching of the MTA processor allows short threads to be used, permitting more parallelism to be exploited. 3. Sending and receiving overhead is very low in the MTA machine, so very short messages may be handled e ciently.
Machine Code Structure
In the MTA the low cost of threads allows the machine-level program structure to re ect the concurrency structure of the algorithm being implemented. In the case of the CVA, it will be advantageous for high throughput to unroll loops to obtain long threads, and to block data into long messages to amortize messagepassing overhead.
For both architectures, it better to run long threads because starting and terminating threads has a nonzero cost in both processors. Since we are assuming that local memory accesses do not cause pipeline gaps, the only events that bene t from thread switching are synchronizations with data arriving in messages from other processors, and to provide su cient multiplexing of module operation to meet latency and throughput requirements of the application.
The Example
First we determine the number of processors needed to perform the computation at the desired rate. To do this we estimate the number of instructions needed to perform one compute cycle, multiply by the desired rate and divide by the performance of the processing element.
From Table 1 we see that 139,138 operations per compute cycle are needed. Allowing an equal number of data movement and miscellaneous instructions, and allowing and additional 25 percent for overhead of scheduling and communication, we nd that the total instructions per cycle will be about 2 .
I cycle = 139138 2 1:25 = 347; 845 using the desired rate of 2.5 kHz., we nd that the total instruction rate must be at least R instr = 869 M I P S This rate could be met by 18 processors at 50 MIPS each, so to be generous, let us assume a machine with 20 processors.
Of the proposed mapping plans, Plan C has the greatest communication load because communication is needed to pass the entire data stream between each of two pairs of program modules (as well as a small amount of intra-module communication). The communication rate required will be 2w = 512 words per compute cycle, or 1:28 million words per second. This is less than 0:2 percent of the required instruction rate and is far below the capacity of typical interconnection networks. This is indeed an embarrassingly parallel computation, and is compute bound for all three mapping plans.
There is one more issue to discuss before we consider the mapping plans separately. A major challenge for this computation is handling the large number of high volume input data streams. In each case we assume that input data from the sensors is made available to the multiprocessor in blocks of eight values for each \channel" of data processing. This means processing one interrupt by the CVA or one synchronization event for the MTA for each channel on every minor cycle of operation. Similarly, the results of processing are delivered to the user as blocks of 16 16-bit words containing the (Boolean) peak data for one cycle of processing. Handling the output stream is a very minor problem, but the rate of input data stream events is R inp = 4 w 2:5 = 2:56 M H z or 128,000 input event per second for each processing element. If each input event is handled in the CVA machine by processing and interrupt and scheduling a thread, the overhead cost will be high. In the MTA machine, the corresponding cost is that of synchronizing thread initiation with an input event, which amounts to just a few processor cycles.
Plan A:
Under Plan A, computation by each of the ve program modules is distributed over all 20 processing elements. If each processor performs the work associated with 256=20 = 13 channels of data, the only interprocessor communication will be to support the boundary references in the TwoDimFilter and PeakDetect modules. As we have already noted, this communication load is very small.
In the CVA machine, a single high-priority thread may perform the 32 executions of BaseRemove, four executions of SpikeAdapt, and one execution of NyquistFilter for each data stream in each compute cycle. There will be a substantial cost associated with synchronizing the start of this thread with the arrival of 4 13 = 52 blocks of sensor data for each compute cycle. Separate lower priority threads may be used to perform the TwoDimFilter and PeakDetect computations when signaled by arrival of messages containing boundary data.
In the MTA machine, many threads may be employed without thread switching overhead becoming signi cant. One attractive structure is to use a separate thread to perform the work of the three frontend modules for each data channel. Each of these threads would contain 351 operations, which is suciently short that responsiveness of processing will not be a ected. One thread apiece will serve to perform the TwoDimFilter and PeakDetect computations after synchronizing with interprocessor messages.
The latency of processing is the time interval between arrival of input data and the availability of output data that depends on it. Some of the processing steps of the optical surveillance example have a built-in delay of from one to three operation cycles. Additional latency is introduced in the machine program by overhead costs and because once operations are performed additional work is done before the consumer of results is scheduled or signaled to begin operation. In this respect, the MTA machine has the advantage because its ner granularity of processing allows successor threads to be signaled sooner than is feasible to schedule them in the CVA machine. This is partially compensated by the property that threads execute four times faster in the CVA.
Plan B:
In Plan B a two processors would perform all computation for TwoDimFilter and PeakDetect. Because there would be only two sections of the data stream, message tra c for intra-module communication would be smaller. Instead, the entire data stream passing from Nyquist to TwoDimFilter would have to be carried in interprocessor messages. Handling this data stream on a word by word basis would involve a large overhead for the CVA machine (more cycles than needed to execute the TwoDimFilter algorithm), but would be a relatively minor amount for the MTA (ten percent or less). Although the higher communication load for this plan would not overload a typical network, there is no compensating saving because the intra-module communication need is so low, and the plan has the disadvantage of introducing unbalanced use of parts of the network. Under this mapping plan, there would be good opportunity to improve performance of the CVA by passing data in large blocks between stages of the computation, however this would increase the latency of results and require large data bu ers.
Plan C:
Plan C takes the further step of executing TwoDimFilter and PeakDetect algorithms on separate groups of processors, further increasing the communication load without compensating bene ts.
Discussion
The principal di erence between the two architectures is in the cost of synchronization, which also reects a di erence in the handling of global memory access. (One may regard the communication performed to implement access to boundary values of the data array in TwoDimFilter and PeakDetect as instances of a general mechanism for global memory access.) The e ect is greater in computations that can bene t from short threads.
The impact of this cost on performance of the CVA may be mitigated by several standard techniques, namely breaking the data stream up into blocks of su cient length that the start-up cost for sending and receiving messages is acceptably small. The penalty is longer latency of results and increased amounts of memory needed to bu er blocks of data between processing stages.
In the calculation of performance it is also necessary to check that the performance is actually achievable, that is, that there is su cient parallelism that no processing element is ever starved for work. This may be done using Petri nets to represent the dynamic behavior of the scheduling of threads, but is beyond the scope of this paper.
Conclusion
We have discussed how signal/image processing programs written in the Sisal functional programming language can be transformed and mapped onto multiprocessor computers. Our approach to program analysis and mapping involves the following steps:
1. Transform the program into an acyclic graph of stream-processing program modules 2. Determine relative computation rates and load parameters for each program module. 3. Choose plausible mapping plans 4. Determine performance characteristics for each mapping plan and select the best for the user's objective.
5. Construct the machine code.
We have discussed application of the method to an optical surveillance problem, and discussed program mapping plans suitable for two target multiprocessor architectures: a multiprocessor built of conventional processing elements and a hypothetical multiprocessor built of multithreaded processing elements. We suggest that, by o ering lower scheduling and synchronization costs, the multithreaded architecture has the ability to support e cient ne-grain computation, leading to lower end-to-end latency and decreased memory requirement for intermediate data for the studied application. Another architectural variant that o ers an intermediate choice for multiprocessing between conventional processors and the MTA machine discussed here is the threaded abstract machine Culler 1991] .
In the computation studied in this paper, there is plenty of parallelism to be exploited. Hence there would be no bene t to increase processing element cost by adding features designed only to increase singlethread performance.
Writing a program as a collection of stream processing functions permits easy characterization of the modules and exploration of a variety of choices for mapping the modules onto a parallel processing computer. Other work relating to static mapping of programs for multiprocessor execution includes many published results on resource management for realtime computation. A summary of work in that area appears in Chaudhary/Aggarwal 1993]. Others have also noted the tradeo between throughput and latency. The work presented here is distinctive in relating the mapping problem to program structure characteristic within a functional programming framework, and in dealing with multi-rate signal processing problems. The work closest in spirit to ours is the Ptolemy Project of Prof. Lee at Berkeley.
Our work indicates that the combination of functional programming with multithreaded processing elements can lead to signi cantly easier programming of applications in the domains of signal/image processing. Similar results are anticipated for other application areas that can bene t from use of stream data types, such as real-time embedded systems and certain industrial process control problems. For applications that require dynamic management of resources during program execution, further development of methods of scheduling and load balancing are needed together with architectural features that permit their e cient implementation. We look forward to further developments in this exciting area. to practical signal processing algorithms. The work is an extension of work done for the Paradigm compiler. The initial version of the Paradigm Compiler was developed by the author during his appointment as Visiting Scientist at RIACS from May 1988 through April 1989.
The optical surveillance example is based on simpli ed algorithms taken from a large-scale defense application studied by the Boeing Company. The complete original algorithms were expressed in a variant of the Val language Ackerman/ Dennis 1979] As presented in the text, the overall computation of this illustration is structured as the composition of functions given in Figure 6 . The overall computation processes signals from a collection of sensors that are swept over the region under surveillance. These signal are conditioned, averaged, and ltered before a peak detection criterion is applied.
A.1 Baseline Removal
The rst step is a procedure designed to ignore a slowly-varying base component of the signal from each sensor. This is de ned by the program BaseRemove shown in Figure 7 . 
A.2 Spike Adaptive Averaging
The second module (Figure 8 ) combines signals from groups of n sensors, rejecting data that exceeds a threshhold.
A.3 Nyquist Filter
This module (Figure 9 ) reduces the sampling rate of the data stream by combining groups of four samples using weights designed to provide a good approximation to the input.
A.4 Two-Dimension Filter
The function TwoDimFilter shown in Figure 10 represents a two-dimension lter by a single Sisal function. The lter is de ned by the three coe cients, a, b, and c, which are the center, side and corner elements of a three-by-three array. The lter is applied at each position in the image data for which an output value is desired. The input is an array of streams indexed from 1 to w. The output is an array of streams omitted from the result data to avoid applying the lter function to non-existing array positions.)
A.5 A Peak Detector Algorithm Figure 11 shows a PeakDetect function that identies all elements of the (image) data that have a value that is at least equal to the values of all immediate neighbors and exceeds their average by a given threshhold Th. The two conditions are tested separately and combined to determine the result. The input is an array of integer streams indexed from 2 to w ? 1. The output stream is an array of boolean streams indexed from 3 to w ?2. The peak detection function is similar in structure to the lter function; each element of the result is true if and only if the data surrounding the corresponding input pixel satis es the speci ed conditions. 
